2026-02-21T08:04:43.3517266Z Current runner version: '2.331.0' 2026-02-21T08:04:43.3521070Z Runner name: 'dgxb200-03-1008' 2026-02-21T08:04:43.3521830Z Runner group name: 'default' 2026-02-21T08:04:43.3522401Z Machine name: '28f7d8a42320' 2026-02-21T08:04:43.3524130Z ##[group]GITHUB_TOKEN Permissions 2026-02-21T08:04:43.3525671Z Contents: read 2026-02-21T08:04:43.3526100Z Metadata: read 2026-02-21T08:04:43.3526471Z ##[endgroup] 2026-02-21T08:04:43.3527943Z Secret source: Actions 2026-02-21T08:04:43.3528456Z Prepare workflow directory 2026-02-21T08:04:43.3890010Z Prepare all required actions 2026-02-21T08:04:43.3917886Z Getting action download info 2026-02-21T08:04:43.8586486Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd) 2026-02-21T08:04:44.2199812Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405) 2026-02-21T08:04:44.6153382Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b) 2026-02-21T08:04:45.1522727Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909) 2026-02-21T08:04:45.7968844Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f) 2026-02-21T08:04:46.2493549Z Getting action download info 2026-02-21T08:04:46.4437376Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820) 2026-02-21T08:04:46.4440019Z ##[group] Inputs 2026-02-21T08:04:46.4440299Z runner: linux.dgx.b200 2026-02-21T08:04:46.4440665Z python-version: 3.12 2026-02-21T08:04:46.4440942Z image: nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:46.4441262Z runtime-version: cu130 2026-02-21T08:04:46.4441641Z container-options: --gpus all 2026-02-21T08:04:46.4441934Z alias: b200 2026-02-21T08:04:46.4442185Z kernels: int4_gemm 2026-02-21T08:04:46.4442471Z env-vars: 2026-02-21T08:04:46.4442731Z custom-args: 2026-02-21T08:04:46.4443173Z run_h100: true 2026-02-21T08:04:46.4443478Z run_b200: true 2026-02-21T08:04:46.4443708Z run_mi325x: true 2026-02-21T08:04:46.4443995Z ##[endgroup] 2026-02-21T08:04:46.4444342Z Complete job name: run-b200 (int4_gemm) / benchmark-cu130-int4_gemm-py3.12-b200 2026-02-21T08:04:46.4674043Z ##[group]Checking docker version 2026-02-21T08:04:46.4683175Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2026-02-21T08:04:46.4848154Z '1.53' 2026-02-21T08:04:46.4859416Z Docker daemon API version: '1.53' 2026-02-21T08:04:46.4859823Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2026-02-21T08:04:46.4994597Z '1.52' 2026-02-21T08:04:46.5006976Z Docker client API version: '1.52' 2026-02-21T08:04:46.5010395Z ##[endgroup] 2026-02-21T08:04:46.5011972Z ##[group]Clean up resources from previous jobs 2026-02-21T08:04:46.5014932Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=e3c87a" 2026-02-21T08:04:46.5122477Z ##[command]/usr/bin/docker network prune --force --filter "label=e3c87a" 2026-02-21T08:04:46.5217010Z ##[endgroup] 2026-02-21T08:04:46.5217308Z ##[group]Create local container network 2026-02-21T08:04:46.5223540Z ##[command]/usr/bin/docker network create --label e3c87a github_network_4610037d786e4ad48d234196e1acb7ad 2026-02-21T08:04:46.5549812Z 67b233e525ca70c21bce48c4fb4b3efb68f31d2d69dd0ae9620aa0e700706577 2026-02-21T08:04:46.5567186Z ##[endgroup] 2026-02-21T08:04:46.5584528Z ##[group]Starting job container 2026-02-21T08:04:46.5599063Z ##[command]/usr/bin/docker pull nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:47.3285924Z 13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda 2026-02-21T08:04:47.5977815Z 1cd98a0b9132: Pulling fs layer 2026-02-21T08:04:47.5980025Z 76249c7cd503: Pulling fs layer 2026-02-21T08:04:47.5980315Z c20926c42231: Pulling fs layer 2026-02-21T08:04:47.5980764Z eea924c2c8fb: Pulling fs layer 2026-02-21T08:04:47.5981039Z afcf80b42416: Pulling fs layer 2026-02-21T08:04:47.5981252Z 8fb7ecb711ef: Pulling fs layer 2026-02-21T08:04:47.5982021Z 401d11fb2a09: Pulling fs layer 2026-02-21T08:04:47.5986655Z e93dd1223ff5: Pulling fs layer 2026-02-21T08:04:47.5989179Z d7913b78456a: Pulling fs layer 2026-02-21T08:04:47.5989509Z ab7341a40ee7: Pulling fs layer 2026-02-21T08:04:47.5989817Z c03b8ec8dd33: Pulling fs layer 2026-02-21T08:04:47.7721014Z 1cd98a0b9132: Download complete 2026-02-21T08:04:47.8716382Z c20926c42231: Download complete 2026-02-21T08:04:47.8727975Z c03b8ec8dd33: Download complete 2026-02-21T08:04:47.9727857Z afcf80b42416: Download complete 2026-02-21T08:04:47.9733076Z 401d11fb2a09: Download complete 2026-02-21T08:04:47.9740943Z d7913b78456a: Download complete 2026-02-21T08:04:47.9745763Z 8fb7ecb711ef: Download complete 2026-02-21T08:04:50.1724725Z ab7341a40ee7: Download complete 2026-02-21T08:04:50.2718731Z 76249c7cd503: Download complete 2026-02-21T08:04:51.7759415Z 76249c7cd503: Pull complete 2026-02-21T08:04:53.2729308Z 401d11fb2a09: Pull complete 2026-02-21T08:04:58.3753405Z ab7341a40ee7: Pull complete 2026-02-21T08:04:58.4720324Z c03b8ec8dd33: Pull complete 2026-02-21T08:04:58.4735120Z d7913b78456a: Pull complete 2026-02-21T08:05:17.6715835Z eea924c2c8fb: Download complete 2026-02-21T08:05:27.8718726Z e93dd1223ff5: Download complete 2026-02-21T08:05:53.3732876Z eea924c2c8fb: Pull complete 2026-02-21T08:05:54.0727211Z c20926c42231: Pull complete 2026-02-21T08:05:54.0731985Z afcf80b42416: Pull complete 2026-02-21T08:05:54.0739610Z 8fb7ecb711ef: Pull complete 2026-02-21T08:06:31.5723028Z 1cd98a0b9132: Pull complete 2026-02-21T08:06:31.5724920Z e93dd1223ff5: Pull complete 2026-02-21T08:06:31.5725252Z Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92 2026-02-21T08:06:31.5725684Z Status: Downloaded newer image for nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:31.5731241Z docker.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:31.5799923Z ##[command]/usr/bin/docker create --name 34d6440bddee4f2886b9a62f2ba8293b_nvidiacuda1301develubuntu2404_e0c1dd --label e3c87a --workdir /__w/helion/helion --network github_network_4610037d786e4ad48d234196e1acb7ad --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/frank/_work":"/__w" -v "/home/frank/externals":"/__e":ro -v "/home/frank/_work/_temp":"/__w/_temp" -v "/home/frank/_work/_actions":"/__w/_actions" -v "/home/frank/_work/_tool":"/__w/_tool" -v "/home/frank/_work/_temp/_github_home":"/github/home" -v "/home/frank/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" nvidia/cuda:13.0.1-devel-ubuntu24.04 "-f" "/dev/null" 2026-02-21T08:06:31.6323556Z 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c 2026-02-21T08:06:31.6344801Z ##[command]/usr/bin/docker start 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c 2026-02-21T08:06:31.9248955Z 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c 2026-02-21T08:06:31.9265399Z ##[command]/usr/bin/docker ps --all --filter id=21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2026-02-21T08:06:31.9407336Z 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c Up Less than a second 2026-02-21T08:06:31.9423770Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c 2026-02-21T08:06:31.9514154Z HOME=/github/home 2026-02-21T08:06:31.9515686Z GITHUB_ACTIONS=true 2026-02-21T08:06:31.9516055Z CI=true 2026-02-21T08:06:31.9516529Z PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:31.9516966Z NVARCH=x86_64 2026-02-21T08:06:31.9521711Z NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576 2026-02-21T08:06:31.9527120Z NV_CUDA_CUDART_VERSION=13.0.88-1 2026-02-21T08:06:31.9527363Z CUDA_VERSION=13.0.1 2026-02-21T08:06:31.9527699Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:31.9528074Z NVIDIA_VISIBLE_DEVICES=all 2026-02-21T08:06:31.9528358Z NVIDIA_DRIVER_CAPABILITIES=compute,utility 2026-02-21T08:06:31.9528607Z NV_CUDA_LIB_VERSION=13.0.1-1 2026-02-21T08:06:31.9528991Z NV_NVTX_VERSION=13.0.85-1 2026-02-21T08:06:31.9529221Z NV_LIBNPP_VERSION=13.0.1.2-1 2026-02-21T08:06:31.9529473Z NV_LIBNPP_PACKAGE=libnpp-13-0=13.0.1.2-1 2026-02-21T08:06:31.9529798Z NV_LIBCUSPARSE_VERSION=12.6.3.3-1 2026-02-21T08:06:31.9530058Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-13-0 2026-02-21T08:06:31.9530328Z NV_LIBCUBLAS_VERSION=13.0.2.14-1 2026-02-21T08:06:31.9530623Z NV_LIBCUBLAS_PACKAGE=libcublas-13-0=13.0.2.14-1 2026-02-21T08:06:31.9530966Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2026-02-21T08:06:31.9531203Z NV_LIBNCCL_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:31.9531499Z NCCL_VERSION=2.28.3-1 2026-02-21T08:06:31.9531822Z NV_LIBNCCL_PACKAGE=libnccl2=2.28.3-1+cuda13.0 2026-02-21T08:06:31.9532069Z NVIDIA_PRODUCT_NAME=CUDA 2026-02-21T08:06:31.9532368Z NV_CUDA_CUDART_DEV_VERSION=13.0.88-1 2026-02-21T08:06:31.9532608Z NV_NVML_DEV_VERSION=13.0.87-1 2026-02-21T08:06:31.9532883Z NV_LIBCUSPARSE_DEV_VERSION=12.6.3.3-1 2026-02-21T08:06:31.9533148Z NV_LIBNPP_DEV_VERSION=13.0.1.2-1 2026-02-21T08:06:31.9533472Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-13-0=13.0.1.2-1 2026-02-21T08:06:31.9533744Z NV_LIBCUBLAS_DEV_VERSION=13.0.2.14-1 2026-02-21T08:06:31.9534036Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-13-0 2026-02-21T08:06:31.9534409Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-13-0=13.0.2.14-1 2026-02-21T08:06:31.9534696Z NV_CUDA_NSIGHT_COMPUTE_VERSION=13.0.1-1 2026-02-21T08:06:31.9535066Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-13-0=13.0.1-1 2026-02-21T08:06:31.9535407Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2026-02-21T08:06:31.9535695Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:31.9535943Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.28.3-1+cuda13.0 2026-02-21T08:06:31.9536285Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2026-02-21T08:06:31.9541750Z ##[endgroup] 2026-02-21T08:06:31.9548841Z ##[group]Waiting for all services to be ready 2026-02-21T08:06:31.9549924Z ##[endgroup] 2026-02-21T08:06:31.9681990Z ##[group]Run echo "Detected NVIDIA image" 2026-02-21T08:06:31.9682401Z echo "Detected NVIDIA image" 2026-02-21T08:06:31.9682747Z nvidia-smi || echo "nvidia-smi not found" 2026-02-21T08:06:31.9684950Z shell: bash -l {0} 2026-02-21T08:06:31.9685326Z env: 2026-02-21T08:06:31.9685558Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:31.9685796Z ##[endgroup] 2026-02-21T08:06:32.0287983Z Detected NVIDIA image 2026-02-21T08:06:32.0439071Z Sat Feb 21 08:06:32 2026 2026-02-21T08:06:32.0439455Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:32.0440053Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T08:06:32.0440495Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:32.0440917Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T08:06:32.0441503Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T08:06:32.0442058Z | | | MIG M. | 2026-02-21T08:06:32.0442412Z |=========================================+========================+======================| 2026-02-21T08:06:32.0719178Z | 0 NVIDIA B200 Off | 00000000:C3:00.0 Off | 0 | 2026-02-21T08:06:32.0722540Z | N/A 30C P0 139W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T08:06:32.0726935Z | | | Disabled | 2026-02-21T08:06:32.0728614Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:32.0729028Z 2026-02-21T08:06:32.0729250Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:32.0729798Z | Processes: | 2026-02-21T08:06:32.0730238Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T08:06:32.0730629Z | ID ID Usage | 2026-02-21T08:06:32.0730946Z |=========================================================================================| 2026-02-21T08:06:32.0950633Z | No running processes found | 2026-02-21T08:06:32.0952582Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:32.1357706Z ##[group]Run set -x 2026-02-21T08:06:32.1357963Z set -x 2026-02-21T08:06:32.1358164Z apt-get update 2026-02-21T08:06:32.1358393Z apt-get install -y git 2026-02-21T08:06:32.1358707Z shell: bash -l {0} 2026-02-21T08:06:32.1358891Z env: 2026-02-21T08:06:32.1359110Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:32.1359330Z ##[endgroup] 2026-02-21T08:06:32.1850863Z + apt-get update 2026-02-21T08:06:32.2438634Z Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease [1581 B] 2026-02-21T08:06:32.3539085Z Get:2 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB] 2026-02-21T08:06:32.3634550Z Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 Packages [1218 kB] 2026-02-21T08:06:32.5538206Z Get:4 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB] 2026-02-21T08:06:32.7353912Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB] 2026-02-21T08:06:32.8292044Z Get:6 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB] 2026-02-21T08:06:32.9823842Z Get:7 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB] 2026-02-21T08:06:32.9830364Z Get:8 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB] 2026-02-21T08:06:33.1248003Z Get:9 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB] 2026-02-21T08:06:33.3223667Z Get:10 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB] 2026-02-21T08:06:33.3405530Z Get:11 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB] 2026-02-21T08:06:33.6868136Z Get:12 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB] 2026-02-21T08:06:33.7338550Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB] 2026-02-21T08:06:33.7340423Z Get:14 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB] 2026-02-21T08:06:33.7795603Z Get:15 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB] 2026-02-21T08:06:33.8102172Z Get:16 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB] 2026-02-21T08:06:33.8107445Z Get:17 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB] 2026-02-21T08:06:34.2236492Z Get:18 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB] 2026-02-21T08:06:34.2955640Z Get:19 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB] 2026-02-21T08:06:34.2974926Z Get:20 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB] 2026-02-21T08:06:34.6424443Z Fetched 37.5 MB in 2s (15.4 MB/s) 2026-02-21T08:06:35.2489334Z Reading package lists... 2026-02-21T08:06:35.2587781Z W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T08:06:35.2595886Z + apt-get install -y git 2026-02-21T08:06:35.8749218Z Reading package lists... 2026-02-21T08:06:35.9918956Z Building dependency tree... 2026-02-21T08:06:35.9923525Z Reading state information... 2026-02-21T08:06:36.1295263Z The following additional packages will be installed: 2026-02-21T08:06:36.1299720Z git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 libcurl3t64-gnutls 2026-02-21T08:06:36.1301285Z libedit2 liberror-perl libexpat1 libfido2-1 libgssapi-krb5-2 libk5crypto3 2026-02-21T08:06:36.1302138Z libkeyutils1 libkrb5-3 libkrb5support0 libnghttp2-14 libpsl5t64 librtmp1 2026-02-21T08:06:36.1302653Z libssh-4 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1 2026-02-21T08:06:36.1305083Z openssh-client publicsuffix xauth 2026-02-21T08:06:36.1305492Z Suggested packages: 2026-02-21T08:06:36.1305939Z gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui 2026-02-21T08:06:36.1306849Z gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user keychain 2026-02-21T08:06:36.1307226Z libpam-ssh monkeysphere ssh-askpass 2026-02-21T08:06:36.1714586Z The following NEW packages will be installed: 2026-02-21T08:06:36.1718549Z git git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 2026-02-21T08:06:36.1719081Z libcurl3t64-gnutls libedit2 liberror-perl libexpat1 libfido2-1 2026-02-21T08:06:36.1719511Z libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3 libkrb5support0 2026-02-21T08:06:36.1719893Z libnghttp2-14 libpsl5t64 librtmp1 libssh-4 libx11-6 libx11-data libxau6 2026-02-21T08:06:36.1720332Z libxcb1 libxdmcp6 libxext6 libxmuu1 openssh-client publicsuffix xauth 2026-02-21T08:06:36.4654250Z 0 upgraded, 31 newly installed, 0 to remove and 86 not upgraded. 2026-02-21T08:06:36.4656077Z Need to get 8886 kB of archives. 2026-02-21T08:06:36.4656423Z After this operation, 38.0 MB of additional disk space will be used. 2026-02-21T08:06:36.4656937Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 krb5-locales all 1.20.1-6ubuntu2.6 [14.8 kB] 2026-02-21T08:06:36.7459252Z Get:2 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB] 2026-02-21T08:06:37.1251080Z Get:3 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libbsd0 amd64 0.12.1-1build1.1 [41.2 kB] 2026-02-21T08:06:37.1811403Z Get:4 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libexpat1 amd64 2.6.1-2ubuntu0.4 [88.2 kB] 2026-02-21T08:06:37.2578550Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5support0 amd64 1.20.1-6ubuntu2.6 [34.4 kB] 2026-02-21T08:06:37.2814708Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libk5crypto3 amd64 1.20.1-6ubuntu2.6 [82.0 kB] 2026-02-21T08:06:37.3253256Z Get:7 http://archive.ubuntu.com/ubuntu noble/main amd64 libkeyutils1 amd64 1.6.3-3build1 [9490 B] 2026-02-21T08:06:37.3299833Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5-3 amd64 1.20.1-6ubuntu2.6 [348 kB] 2026-02-21T08:06:37.4548727Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libgssapi-krb5-2 amd64 1.20.1-6ubuntu2.6 [143 kB] 2026-02-21T08:06:37.5046763Z Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB] 2026-02-21T08:06:37.5080840Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 libedit2 amd64 3.1-20230828-1build1 [97.6 kB] 2026-02-21T08:06:37.5361177Z Get:12 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB] 2026-02-21T08:06:37.5475722Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libnghttp2-14 amd64 1.59.0-1ubuntu0.2 [74.3 kB] 2026-02-21T08:06:37.5593626Z Get:14 http://archive.ubuntu.com/ubuntu noble/main amd64 libpsl5t64 amd64 0.21.2-1.1build1 [57.1 kB] 2026-02-21T08:06:37.5691472Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 libxau6 amd64 1:1.0.9-1build6 [7160 B] 2026-02-21T08:06:37.5709177Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 libxdmcp6 amd64 1:1.1.3-0ubuntu6 [10.3 kB] 2026-02-21T08:06:37.5728817Z Get:17 http://archive.ubuntu.com/ubuntu noble/main amd64 libxcb1 amd64 1.15-1ubuntu2 [47.7 kB] 2026-02-21T08:06:37.5935545Z Get:18 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-data all 2:1.8.7-1build1 [115 kB] 2026-02-21T08:06:37.6598678Z Get:19 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-6 amd64 2:1.8.7-1build1 [650 kB] 2026-02-21T08:06:37.7304094Z Get:20 http://archive.ubuntu.com/ubuntu noble/main amd64 libxext6 amd64 2:1.3.4-1build2 [30.4 kB] 2026-02-21T08:06:37.7397580Z Get:21 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B] 2026-02-21T08:06:37.7409787Z Get:22 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB] 2026-02-21T08:06:37.8088933Z Get:23 http://archive.ubuntu.com/ubuntu noble/main amd64 publicsuffix all 20231001.0357-0.1 [129 kB] 2026-02-21T08:06:37.8193120Z Get:24 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB] 2026-02-21T08:06:37.8211135Z Get:25 http://archive.ubuntu.com/ubuntu noble/main amd64 libbrotli1 amd64 1.1.0-2build2 [331 kB] 2026-02-21T08:06:37.8479274Z Get:26 http://archive.ubuntu.com/ubuntu noble/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-2build7 [56.3 kB] 2026-02-21T08:06:37.8515110Z Get:27 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libssh-4 amd64 0.10.6-2ubuntu0.3 [190 kB] 2026-02-21T08:06:37.8672171Z Get:28 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB] 2026-02-21T08:06:37.9019572Z Get:29 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB] 2026-02-21T08:06:37.9032561Z Get:30 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB] 2026-02-21T08:06:37.9496458Z Get:31 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB] 2026-02-21T08:06:38.1683983Z debconf: delaying package configuration, since apt-utils is not installed 2026-02-21T08:06:38.1909416Z Fetched 8886 kB in 2s (4707 kB/s) 2026-02-21T08:06:38.2107690Z Selecting previously unselected package krb5-locales. 2026-02-21T08:06:38.2124298Z (Reading database ... 2026-02-21T08:06:38.2125872Z (Reading database ... 5% 2026-02-21T08:06:38.2126526Z (Reading database ... 10% 2026-02-21T08:06:38.2126756Z (Reading database ... 15% 2026-02-21T08:06:38.2127094Z (Reading database ... 20% 2026-02-21T08:06:38.2127314Z (Reading database ... 25% 2026-02-21T08:06:38.2127551Z (Reading database ... 30% 2026-02-21T08:06:38.2127807Z (Reading database ... 35% 2026-02-21T08:06:38.2128045Z (Reading database ... 40% 2026-02-21T08:06:38.2128277Z (Reading database ... 45% 2026-02-21T08:06:38.2128486Z (Reading database ... 50% 2026-02-21T08:06:38.2128751Z (Reading database ... 55% 2026-02-21T08:06:38.2128955Z (Reading database ... 60% 2026-02-21T08:06:38.2129291Z (Reading database ... 65% 2026-02-21T08:06:38.2135724Z (Reading database ... 70% 2026-02-21T08:06:38.2137930Z (Reading database ... 75% 2026-02-21T08:06:38.2138285Z (Reading database ... 80% 2026-02-21T08:06:38.2144018Z (Reading database ... 85% 2026-02-21T08:06:38.2154024Z (Reading database ... 90% 2026-02-21T08:06:38.2160663Z (Reading database ... 95% 2026-02-21T08:06:38.2160915Z (Reading database ... 100% 2026-02-21T08:06:38.2161263Z (Reading database ... 15507 files and directories currently installed.) 2026-02-21T08:06:38.2161746Z Preparing to unpack .../00-krb5-locales_1.20.1-6ubuntu2.6_all.deb ... 2026-02-21T08:06:38.2195701Z Unpacking krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:38.2386485Z Selecting previously unselected package less. 2026-02-21T08:06:38.2396231Z Preparing to unpack .../01-less_590-2ubuntu2.1_amd64.deb ... 2026-02-21T08:06:38.2422393Z Unpacking less (590-2ubuntu2.1) ... 2026-02-21T08:06:38.2647227Z Selecting previously unselected package libbsd0:amd64. 2026-02-21T08:06:38.2656516Z Preparing to unpack .../02-libbsd0_0.12.1-1build1.1_amd64.deb ... 2026-02-21T08:06:38.2694466Z Unpacking libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:38.2913204Z Selecting previously unselected package libexpat1:amd64. 2026-02-21T08:06:38.2917386Z Preparing to unpack .../03-libexpat1_2.6.1-2ubuntu0.4_amd64.deb ... 2026-02-21T08:06:38.2938495Z Unpacking libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:38.3148820Z Selecting previously unselected package libkrb5support0:amd64. 2026-02-21T08:06:38.3158563Z Preparing to unpack .../04-libkrb5support0_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:38.3174613Z Unpacking libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:38.3388747Z Selecting previously unselected package libk5crypto3:amd64. 2026-02-21T08:06:38.3397305Z Preparing to unpack .../05-libk5crypto3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:38.3421806Z Unpacking libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:38.3639853Z Selecting previously unselected package libkeyutils1:amd64. 2026-02-21T08:06:38.3647268Z Preparing to unpack .../06-libkeyutils1_1.6.3-3build1_amd64.deb ... 2026-02-21T08:06:38.3671766Z Unpacking libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:38.3864444Z Selecting previously unselected package libkrb5-3:amd64. 2026-02-21T08:06:38.3872852Z Preparing to unpack .../07-libkrb5-3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:38.3896352Z Unpacking libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:38.4144876Z Selecting previously unselected package libgssapi-krb5-2:amd64. 2026-02-21T08:06:38.4150860Z Preparing to unpack .../08-libgssapi-krb5-2_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:38.4173889Z Unpacking libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:38.4389436Z Selecting previously unselected package libcbor0.10:amd64. 2026-02-21T08:06:38.4398150Z Preparing to unpack .../09-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ... 2026-02-21T08:06:38.4425092Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:38.4624131Z Selecting previously unselected package libedit2:amd64. 2026-02-21T08:06:38.4630765Z Preparing to unpack .../10-libedit2_3.1-20230828-1build1_amd64.deb ... 2026-02-21T08:06:38.4656199Z Unpacking libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:38.4873963Z Selecting previously unselected package libfido2-1:amd64. 2026-02-21T08:06:38.4880998Z Preparing to unpack .../11-libfido2-1_1.14.0-1build3_amd64.deb ... 2026-02-21T08:06:38.4907993Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:38.5107988Z Selecting previously unselected package libnghttp2-14:amd64. 2026-02-21T08:06:38.5114021Z Preparing to unpack .../12-libnghttp2-14_1.59.0-1ubuntu0.2_amd64.deb ... 2026-02-21T08:06:38.5135941Z Unpacking libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:38.5340109Z Selecting previously unselected package libpsl5t64:amd64. 2026-02-21T08:06:38.5347501Z Preparing to unpack .../13-libpsl5t64_0.21.2-1.1build1_amd64.deb ... 2026-02-21T08:06:38.5370371Z Unpacking libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:38.5565681Z Selecting previously unselected package libxau6:amd64. 2026-02-21T08:06:38.5570802Z Preparing to unpack .../14-libxau6_1%3a1.0.9-1build6_amd64.deb ... 2026-02-21T08:06:38.5593671Z Unpacking libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:38.5774474Z Selecting previously unselected package libxdmcp6:amd64. 2026-02-21T08:06:38.5780178Z Preparing to unpack .../15-libxdmcp6_1%3a1.1.3-0ubuntu6_amd64.deb ... 2026-02-21T08:06:38.5806167Z Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:38.6010175Z Selecting previously unselected package libxcb1:amd64. 2026-02-21T08:06:38.6018304Z Preparing to unpack .../16-libxcb1_1.15-1ubuntu2_amd64.deb ... 2026-02-21T08:06:38.6035239Z Unpacking libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:38.6219842Z Selecting previously unselected package libx11-data. 2026-02-21T08:06:38.6227941Z Preparing to unpack .../17-libx11-data_2%3a1.8.7-1build1_all.deb ... 2026-02-21T08:06:38.6247467Z Unpacking libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:38.6596030Z Selecting previously unselected package libx11-6:amd64. 2026-02-21T08:06:38.6604584Z Preparing to unpack .../18-libx11-6_2%3a1.8.7-1build1_amd64.deb ... 2026-02-21T08:06:38.6623565Z Unpacking libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:38.6865833Z Selecting previously unselected package libxext6:amd64. 2026-02-21T08:06:38.6875261Z Preparing to unpack .../19-libxext6_2%3a1.3.4-1build2_amd64.deb ... 2026-02-21T08:06:38.6889087Z Unpacking libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:38.7088440Z Selecting previously unselected package libxmuu1:amd64. 2026-02-21T08:06:38.7096935Z Preparing to unpack .../20-libxmuu1_2%3a1.1.3-3build2_amd64.deb ... 2026-02-21T08:06:38.7118295Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:38.7380075Z Selecting previously unselected package openssh-client. 2026-02-21T08:06:38.7387341Z Preparing to unpack .../21-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ... 2026-02-21T08:06:38.7450224Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:38.7764231Z Selecting previously unselected package publicsuffix. 2026-02-21T08:06:38.7770485Z Preparing to unpack .../22-publicsuffix_20231001.0357-0.1_all.deb ... 2026-02-21T08:06:38.7794480Z Unpacking publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:38.7989116Z Selecting previously unselected package xauth. 2026-02-21T08:06:38.7996320Z Preparing to unpack .../23-xauth_1%3a1.1.2-1build1_amd64.deb ... 2026-02-21T08:06:38.8026055Z Unpacking xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:38.8252171Z Selecting previously unselected package libbrotli1:amd64. 2026-02-21T08:06:38.8260151Z Preparing to unpack .../24-libbrotli1_1.1.0-2build2_amd64.deb ... 2026-02-21T08:06:38.8272956Z Unpacking libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:38.8519973Z Selecting previously unselected package librtmp1:amd64. 2026-02-21T08:06:38.8526816Z Preparing to unpack .../25-librtmp1_2.4+20151223.gitfa8646d.1-2build7_amd64.deb ... 2026-02-21T08:06:38.8552695Z Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:38.8762642Z Selecting previously unselected package libssh-4:amd64. 2026-02-21T08:06:38.8771459Z Preparing to unpack .../26-libssh-4_0.10.6-2ubuntu0.3_amd64.deb ... 2026-02-21T08:06:38.8793595Z Unpacking libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:38.9049435Z Selecting previously unselected package libcurl3t64-gnutls:amd64. 2026-02-21T08:06:38.9059278Z Preparing to unpack .../27-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ... 2026-02-21T08:06:38.9080371Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:38.9293032Z Selecting previously unselected package liberror-perl. 2026-02-21T08:06:38.9300520Z Preparing to unpack .../28-liberror-perl_0.17029-2_all.deb ... 2026-02-21T08:06:38.9328833Z Unpacking liberror-perl (0.17029-2) ... 2026-02-21T08:06:38.9526565Z Selecting previously unselected package git-man. 2026-02-21T08:06:38.9535944Z Preparing to unpack .../29-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ... 2026-02-21T08:06:38.9555003Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:38.9855330Z Selecting previously unselected package git. 2026-02-21T08:06:38.9863215Z Preparing to unpack .../30-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ... 2026-02-21T08:06:38.9936424Z Unpacking git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:39.1062252Z Setting up libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:39.1139744Z Setting up libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:39.1195986Z Setting up libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:39.1253942Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:39.1305683Z Setting up libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:39.1356709Z Setting up libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:39.1426543Z Setting up libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:39.1497079Z Setting up less (590-2ubuntu2.1) ... 2026-02-21T08:06:39.1658417Z Setting up krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:39.1713006Z Setting up libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:39.1782085Z Setting up liberror-perl (0.17029-2) ... 2026-02-21T08:06:39.1846935Z Setting up libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:39.1915294Z Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:39.1984474Z Setting up libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:39.2058814Z Setting up git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:39.2137236Z Setting up libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:39.2215077Z Setting up libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:39.2272852Z Setting up libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:39.2331267Z Setting up publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:39.2406107Z Setting up libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:39.2488362Z Setting up libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:39.2563712Z Setting up libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:39.2629763Z Setting up libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:39.2712275Z Setting up libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:39.2774924Z Setting up libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:39.2848129Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:39.2911731Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:39.3512080Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:39.3551356Z Setting up libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:39.3595651Z Setting up git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:39.3719196Z Setting up xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:39.3797537Z Processing triggers for libc-bin (2.39-0ubuntu8.5) ... 2026-02-21T08:06:39.4219803Z ##[group]Run actions/checkout@v6 2026-02-21T08:06:39.4220102Z with: 2026-02-21T08:06:39.4220371Z repository: pytorch/helion 2026-02-21T08:06:39.4220762Z token: *** 2026-02-21T08:06:39.4221008Z ssh-strict: true 2026-02-21T08:06:39.4221222Z ssh-user: git 2026-02-21T08:06:39.4221466Z persist-credentials: true 2026-02-21T08:06:39.4221793Z clean: true 2026-02-21T08:06:39.4222024Z sparse-checkout-cone-mode: true 2026-02-21T08:06:39.4222293Z fetch-depth: 1 2026-02-21T08:06:39.4222473Z fetch-tags: false 2026-02-21T08:06:39.4222742Z show-progress: true 2026-02-21T08:06:39.4223081Z lfs: false 2026-02-21T08:06:39.4223286Z submodules: false 2026-02-21T08:06:39.4223535Z set-safe-directory: true 2026-02-21T08:06:39.4223776Z env: 2026-02-21T08:06:39.4223968Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:39.4224242Z ##[endgroup] 2026-02-21T08:06:39.4257305Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:39.6064502Z Syncing repository: pytorch/helion 2026-02-21T08:06:39.6065464Z ##[group]Getting Git version info 2026-02-21T08:06:39.6065737Z Working directory is '/__w/helion/helion' 2026-02-21T08:06:39.6066188Z [command]/usr/bin/git version 2026-02-21T08:06:39.6074869Z git version 2.43.0 2026-02-21T08:06:39.6093657Z ##[endgroup] 2026-02-21T08:06:39.6105720Z Temporarily overriding HOME='/__w/_temp/029eb213-88a8-46a7-a3d7-2261b0e696d4' before making global git config changes 2026-02-21T08:06:39.6107385Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T08:06:39.6107851Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T08:06:39.6146192Z Deleting the contents of '/__w/helion/helion' 2026-02-21T08:06:39.6150548Z ##[group]Initializing the repository 2026-02-21T08:06:39.6152024Z [command]/usr/bin/git init /__w/helion/helion 2026-02-21T08:06:39.6177524Z hint: Using 'master' as the name for the initial branch. This default branch name 2026-02-21T08:06:39.6177984Z hint: is subject to change. To configure the initial branch name to use in all 2026-02-21T08:06:39.6178394Z hint: of your new repositories, which will suppress this warning, call: 2026-02-21T08:06:39.6178675Z hint: 2026-02-21T08:06:39.6178949Z hint: git config --global init.defaultBranch 2026-02-21T08:06:39.6179202Z hint: 2026-02-21T08:06:39.6179461Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2026-02-21T08:06:39.6179863Z hint: 'development'. The just-created branch can be renamed via this command: 2026-02-21T08:06:39.6180179Z hint: 2026-02-21T08:06:39.6180375Z hint: git branch -m 2026-02-21T08:06:39.6184735Z Initialized empty Git repository in /__w/helion/helion/.git/ 2026-02-21T08:06:39.6198940Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion 2026-02-21T08:06:39.6218189Z ##[endgroup] 2026-02-21T08:06:39.6218679Z ##[group]Disabling automatic garbage collection 2026-02-21T08:06:39.6219028Z [command]/usr/bin/git config --local gc.auto 0 2026-02-21T08:06:39.6243870Z ##[endgroup] 2026-02-21T08:06:39.6244214Z ##[group]Setting up auth 2026-02-21T08:06:39.6247393Z Removing SSH command configuration 2026-02-21T08:06:39.6250420Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T08:06:39.6275664Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T08:06:39.6502004Z Removing HTTP extra header 2026-02-21T08:06:39.6506635Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T08:06:39.6530864Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T08:06:39.6756518Z Removing includeIf entries pointing to credentials config files 2026-02-21T08:06:39.6756983Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T08:06:39.6783769Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T08:06:39.7021201Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config http.https://github.com/.extraheader AUTHORIZATION: basic *** 2026-02-21T08:06:39.7064168Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T08:06:39.7092577Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T08:06:39.7117157Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T08:06:39.7142431Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T08:06:39.7171402Z ##[endgroup] 2026-02-21T08:06:39.7171974Z ##[group]Fetching the repository 2026-02-21T08:06:39.7176174Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main 2026-02-21T08:06:40.1869718Z From https://github.com/pytorch/helion 2026-02-21T08:06:40.1870154Z * [new ref] 874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main 2026-02-21T08:06:40.1900304Z [command]/usr/bin/git branch --list --remote origin/main 2026-02-21T08:06:40.1919614Z origin/main 2026-02-21T08:06:40.1925295Z [command]/usr/bin/git rev-parse refs/remotes/origin/main 2026-02-21T08:06:40.1942870Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:40.1946155Z ##[endgroup] 2026-02-21T08:06:40.1946512Z ##[group]Determining the checkout info 2026-02-21T08:06:40.1946974Z ##[endgroup] 2026-02-21T08:06:40.1949418Z [command]/usr/bin/git sparse-checkout disable 2026-02-21T08:06:40.1980181Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2026-02-21T08:06:40.2004988Z ##[group]Checking out the ref 2026-02-21T08:06:40.2005362Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2026-02-21T08:06:40.2260899Z Switched to a new branch 'main' 2026-02-21T08:06:40.2265457Z branch 'main' set up to track 'origin/main'. 2026-02-21T08:06:40.2270023Z ##[endgroup] 2026-02-21T08:06:40.2294186Z [command]/usr/bin/git log -1 --format=%H 2026-02-21T08:06:40.2313308Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:40.2451401Z ##[group]Run actions/setup-python@v6 2026-02-21T08:06:40.2451765Z with: 2026-02-21T08:06:40.2452008Z python-version: 3.12 2026-02-21T08:06:40.2452225Z check-latest: false 2026-02-21T08:06:40.2452596Z token: *** 2026-02-21T08:06:40.2452806Z update-environment: true 2026-02-21T08:06:40.2453213Z allow-prereleases: false 2026-02-21T08:06:40.2453440Z freethreaded: false 2026-02-21T08:06:40.2453662Z env: 2026-02-21T08:06:40.2453896Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:40.2454130Z ##[endgroup] 2026-02-21T08:06:40.2457686Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:40.4409883Z ##[group]Installed versions 2026-02-21T08:06:40.4415066Z Version 3.12 was not found in the local cache 2026-02-21T08:06:41.2178795Z Version 3.12 is available for downloading 2026-02-21T08:06:41.2179396Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz" 2026-02-21T08:06:42.0395185Z Extract downloaded archive 2026-02-21T08:06:42.0504384Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/914d35d8-9fbd-4426-95cf-f7b5b2685d1c -f /__w/_temp/8b81532b-9f9e-4b76-8740-c2d19969f97f 2026-02-21T08:06:43.8096471Z Execute installation script 2026-02-21T08:06:43.8206591Z Check if Python hostedtoolcache folder exist... 2026-02-21T08:06:43.8211036Z Creating Python hostedtoolcache folder... 2026-02-21T08:06:43.8212293Z Create Python 3.12.12 folder 2026-02-21T08:06:43.8225837Z Copy Python binaries to hostedtoolcache folder 2026-02-21T08:06:44.0976270Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action) 2026-02-21T08:06:44.1012705Z Upgrading pip... 2026-02-21T08:06:45.4305143Z Looking in links: /tmp/tmpcqa_yw6a 2026-02-21T08:06:45.4310166Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1) 2026-02-21T08:06:45.4349820Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2026-02-21T08:06:46.0029300Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag. 2026-02-21T08:06:46.1600030Z Collecting pip 2026-02-21T08:06:46.1977289Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB) 2026-02-21T08:06:46.2060995Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB) 2026-02-21T08:06:46.2910606Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 25.0 MB/s eta 0:00:00 2026-02-21T08:06:46.3001940Z Installing collected packages: pip 2026-02-21T08:06:46.3003400Z Attempting uninstall: pip 2026-02-21T08:06:46.3015027Z Found existing installation: pip 25.0.1 2026-02-21T08:06:46.3183106Z Uninstalling pip-25.0.1: 2026-02-21T08:06:46.3217556Z Successfully uninstalled pip-25.0.1 2026-02-21T08:06:46.9398821Z Successfully installed pip-26.0.1 2026-02-21T08:06:46.9825595Z Create complete file 2026-02-21T08:06:46.9856898Z Successfully set up CPython (3.12.12) 2026-02-21T08:06:46.9860852Z ##[endgroup] 2026-02-21T08:06:47.0043119Z ##[group]Run astral-sh/setup-uv@v7 2026-02-21T08:06:47.0043363Z with: 2026-02-21T08:06:47.0043611Z activate-environment: false 2026-02-21T08:06:47.0043891Z working-directory: /home/frank/_work/helion/helion 2026-02-21T08:06:47.0044306Z github-token: *** 2026-02-21T08:06:47.0044493Z enable-cache: auto 2026-02-21T08:06:47.0045014Z cache-dependency-glob: **/*requirements*.txt **/*requirements*.in **/*constraints*.txt **/*constraints*.in **/pyproject.toml **/uv.lock **/*.py.lock 2026-02-21T08:06:47.0045517Z restore-cache: true 2026-02-21T08:06:47.0045718Z save-cache: true 2026-02-21T08:06:47.0045977Z prune-cache: true 2026-02-21T08:06:47.0046175Z cache-python: false 2026-02-21T08:06:47.0046422Z ignore-nothing-to-cache: false 2026-02-21T08:06:47.0046689Z ignore-empty-workdir: false 2026-02-21T08:06:47.0046956Z add-problem-matchers: true 2026-02-21T08:06:47.0047212Z resolution-strategy: highest 2026-02-21T08:06:47.0047456Z env: 2026-02-21T08:06:47.0047738Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:47.0048026Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:47.0048378Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:47.0048685Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:47.0049000Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:47.0049287Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:47.0049745Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:47.0050179Z ##[endgroup] 2026-02-21T08:06:47.0056286Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:47.2255458Z (node:802) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T08:06:47.2256308Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T08:06:47.2322865Z Trying to find version for uv in: /__w/helion/helion/uv.toml 2026-02-21T08:06:47.2324501Z Could not find file: /__w/helion/helion/uv.toml 2026-02-21T08:06:47.2324970Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml 2026-02-21T08:06:47.2333276Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest. 2026-02-21T08:06:47.2342189Z Getting latest version from GitHub API... 2026-02-21T08:06:47.4721421Z manifest-file not provided, reading from local file. 2026-02-21T08:06:47.4759834Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases. 2026-02-21T08:06:47.4760853Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ... 2026-02-21T08:06:47.7974058Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/81db1e63-6d95-439d-b325-f4f7d5af2335 -f /__w/_temp/e773eaee-97a3-4e94-a789-c6f164673d47 2026-02-21T08:06:48.1751298Z Added /github/home/.local/bin to the path 2026-02-21T08:06:48.1755393Z Added /__w/_tool/uv/0.10.4/x86_64 to the path 2026-02-21T08:06:48.1758727Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python 2026-02-21T08:06:48.1759731Z Added /github/home/.local/share/uv/python to the path 2026-02-21T08:06:48.1764279Z Successfully installed uv version 0.10.4 2026-02-21T08:06:48.3421128Z ##[group]Run uv venv --python 3.12 2026-02-21T08:06:48.3421445Z uv venv --python 3.12 2026-02-21T08:06:48.3421889Z shell: bash -l {0} 2026-02-21T08:06:48.3422122Z env: 2026-02-21T08:06:48.3422334Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:48.3422661Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.3423012Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:48.3423329Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.3423669Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.3423953Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.3424426Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:48.3424938Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:48.3425219Z ##[endgroup] 2026-02-21T08:06:48.4528127Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12 2026-02-21T08:06:48.4528809Z Creating virtual environment at: .venv 2026-02-21T08:06:48.4529104Z Activate with: source .venv/bin/activate 2026-02-21T08:06:48.4596910Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:48.4597205Z source .venv/bin/activate 2026-02-21T08:06:48.4597554Z uv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/cu130 2026-02-21T08:06:48.4598033Z shell: bash -l {0} 2026-02-21T08:06:48.4598220Z env: 2026-02-21T08:06:48.4598420Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:48.4598735Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.4599025Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:48.4599324Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.4599629Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.4599923Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.4600337Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:48.4600876Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:48.4601137Z ##[endgroup] 2026-02-21T08:06:49.1562803Z Resolved 26 packages in 595ms 2026-02-21T08:06:49.1649268Z Downloading triton (162.6MiB) 2026-02-21T08:06:49.1694684Z Downloading networkx (2.0MiB) 2026-02-21T08:06:49.1719900Z Downloading nvidia-cuda-cupti (10.2MiB) 2026-02-21T08:06:49.1725116Z Downloading nvidia-cufft (204.2MiB) 2026-02-21T08:06:49.1726442Z Downloading torch (584.2MiB) 2026-02-21T08:06:49.1736742Z Downloading sympy (6.0MiB) 2026-02-21T08:06:49.1882630Z Downloading nvidia-cuda-runtime (2.1MiB) 2026-02-21T08:06:49.1885719Z Downloading nvidia-cusolver (184.5MiB) 2026-02-21T08:06:49.1887290Z Downloading nvidia-curand (56.8MiB) 2026-02-21T08:06:49.1887616Z Downloading nvidia-cufile (1.2MiB) 2026-02-21T08:06:49.1908724Z Downloading nvidia-nvjitlink (38.8MiB) 2026-02-21T08:06:49.1951498Z Downloading nvidia-cudnn-cu13 (332.4MiB) 2026-02-21T08:06:49.2000019Z Downloading nvidia-cusparse (133.8MiB) 2026-02-21T08:06:49.2147596Z Downloading nvidia-nvshmem-cu13 (57.6MiB) 2026-02-21T08:06:49.2248786Z Downloading nvidia-cuda-nvrtc (86.0MiB) 2026-02-21T08:06:49.2348142Z Downloading nvidia-cusparselt-cu13 (162.0MiB) 2026-02-21T08:06:49.2433868Z Downloading nvidia-nccl-cu13 (184.9MiB) 2026-02-21T08:06:49.2582860Z Downloading nvidia-cublas (400.0MiB) 2026-02-21T08:06:49.4636624Z Downloaded nvidia-cufile 2026-02-21T08:06:49.6918072Z Downloaded nvidia-cuda-runtime 2026-02-21T08:06:50.2474637Z Downloaded networkx 2026-02-21T08:06:50.9092082Z Downloaded nvidia-cuda-cupti 2026-02-21T08:06:51.8881916Z Downloaded sympy 2026-02-21T08:06:52.1249178Z Downloaded triton 2026-02-21T08:06:53.3446645Z Downloaded nvidia-nvjitlink 2026-02-21T08:06:54.5347690Z Downloaded nvidia-curand 2026-02-21T08:06:54.9860987Z Downloaded nvidia-nvshmem-cu13 2026-02-21T08:06:55.6998613Z Downloaded nvidia-cuda-nvrtc 2026-02-21T08:06:56.5477336Z Downloaded nvidia-cufft 2026-02-21T08:06:57.4146564Z Downloaded nvidia-cusparse 2026-02-21T08:06:57.4748227Z Downloaded nvidia-cusolver 2026-02-21T08:06:58.0527015Z Downloaded nvidia-cusparselt-cu13 2026-02-21T08:06:58.8424867Z Downloaded nvidia-nccl-cu13 2026-02-21T08:06:59.6870622Z Downloaded nvidia-cudnn-cu13 2026-02-21T08:07:00.9483476Z Downloaded nvidia-cublas 2026-02-21T08:07:03.1103224Z Downloaded torch 2026-02-21T08:07:03.1105149Z Prepared 26 packages in 13.95s 2026-02-21T08:07:03.1142781Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:03.1145103Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:03.1145775Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:04.9839172Z Installed 26 packages in 1.87s 2026-02-21T08:07:04.9841007Z + filelock==3.20.0 2026-02-21T08:07:04.9841261Z + fsspec==2025.12.0 2026-02-21T08:07:04.9841684Z + jinja2==3.1.6 2026-02-21T08:07:04.9841905Z + markupsafe==3.0.2 2026-02-21T08:07:04.9842515Z + mpmath==1.3.0 2026-02-21T08:07:04.9842713Z + networkx==3.6.1 2026-02-21T08:07:04.9843000Z + nvidia-cublas==13.0.0.19 2026-02-21T08:07:04.9843244Z + nvidia-cuda-cupti==13.0.48 2026-02-21T08:07:04.9844669Z + nvidia-cuda-nvrtc==13.0.48 2026-02-21T08:07:04.9844941Z + nvidia-cuda-runtime==13.0.48 2026-02-21T08:07:04.9845206Z + nvidia-cudnn-cu13==9.13.0.50 2026-02-21T08:07:04.9845464Z + nvidia-cufft==12.0.0.15 2026-02-21T08:07:04.9845779Z + nvidia-cufile==1.15.0.42 2026-02-21T08:07:04.9847830Z + nvidia-curand==10.4.0.35 2026-02-21T08:07:04.9848228Z + nvidia-cusolver==12.0.3.29 2026-02-21T08:07:04.9848661Z + nvidia-cusparse==12.6.2.49 2026-02-21T08:07:04.9849076Z + nvidia-cusparselt-cu13==0.8.0 2026-02-21T08:07:04.9849360Z + nvidia-nccl-cu13==2.27.7 2026-02-21T08:07:04.9849659Z + nvidia-nvjitlink==13.0.39 2026-02-21T08:07:04.9849953Z + nvidia-nvshmem-cu13==3.3.24 2026-02-21T08:07:04.9850258Z + nvidia-nvtx==13.0.39 2026-02-21T08:07:04.9850489Z + setuptools==70.2.0 2026-02-21T08:07:04.9850853Z + sympy==1.14.0 2026-02-21T08:07:04.9851077Z + torch==2.9.1+cu130 2026-02-21T08:07:04.9851336Z + triton==3.5.1 2026-02-21T08:07:04.9851780Z + typing-extensions==4.15.0 2026-02-21T08:07:04.9967113Z ##[group]Run source .venv/bin/activate 2026-02-21T08:07:04.9967468Z source .venv/bin/activate 2026-02-21T08:07:04.9967862Z SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]' 2026-02-21T08:07:04.9968242Z python -c "import helion; print(helion.__name__)" 2026-02-21T08:07:04.9968762Z shell: bash -l {0} 2026-02-21T08:07:04.9968970Z env: 2026-02-21T08:07:04.9969223Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:04.9969721Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:04.9970050Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:04.9970416Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:04.9970714Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:04.9971053Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:04.9971662Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:07:04.9972130Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:04.9972452Z ##[endgroup] 2026-02-21T08:07:07.6357496Z Resolved 30 packages in 2.53s 2026-02-21T08:07:07.6367556Z Building helion @ file:///__w/helion/helion 2026-02-21T08:07:07.6472111Z Downloading numpy (15.8MiB) 2026-02-21T08:07:07.6480637Z Downloading scikit-learn (8.5MiB) 2026-02-21T08:07:07.6519910Z Downloading pygments (1.2MiB) 2026-02-21T08:07:07.6524772Z Downloading scipy (33.4MiB) 2026-02-21T08:07:07.6559550Z Downloading virtualenv (5.6MiB) 2026-02-21T08:07:07.7774639Z Built helion @ file:///__w/helion/helion 2026-02-21T08:07:07.8584588Z Downloaded pygments 2026-02-21T08:07:07.8901122Z Downloaded virtualenv 2026-02-21T08:07:08.3707606Z Downloaded scikit-learn 2026-02-21T08:07:08.3779086Z Downloaded numpy 2026-02-21T08:07:08.7698547Z Downloaded scipy 2026-02-21T08:07:10.6097408Z Prepared 27 packages in 2.97s 2026-02-21T08:07:10.6102293Z Uninstalled 1 package in 0.47ms 2026-02-21T08:07:10.6103812Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:10.6104500Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:10.6105058Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:10.6893484Z Installed 29 packages in 79ms 2026-02-21T08:07:10.6895807Z + cfgv==3.5.0 2026-02-21T08:07:10.6896084Z + distlib==0.4.0 2026-02-21T08:07:10.6896387Z + expecttest==0.3.0 2026-02-21T08:07:10.6896834Z + filecheck==1.0.3 2026-02-21T08:07:10.6897079Z - filelock==3.20.0 2026-02-21T08:07:10.6897346Z + filelock==3.24.3 2026-02-21T08:07:10.6897588Z + helion==0.0.0 (from file:///__w/helion/helion) 2026-02-21T08:07:10.6897899Z + hypothesis==6.151.9 2026-02-21T08:07:10.6898100Z + identify==2.6.16 2026-02-21T08:07:10.6898333Z + iniconfig==2.3.0 2026-02-21T08:07:10.6898549Z + joblib==1.5.3 2026-02-21T08:07:10.6898786Z + markdown-it-py==4.0.0 2026-02-21T08:07:10.6899046Z + mdurl==0.1.2 2026-02-21T08:07:10.6899245Z + nodeenv==1.10.0 2026-02-21T08:07:10.6899469Z + numpy==2.4.2 2026-02-21T08:07:10.6899652Z + packaging==26.0 2026-02-21T08:07:10.6899916Z + platformdirs==4.9.2 2026-02-21T08:07:10.6900111Z + pluggy==1.6.0 2026-02-21T08:07:10.6900323Z + pre-commit==4.5.1 2026-02-21T08:07:10.6900541Z + psutil==7.2.2 2026-02-21T08:07:10.6904145Z + pygments==2.19.2 2026-02-21T08:07:10.6906923Z + pytest==9.0.2 2026-02-21T08:07:10.6907355Z + pytest-timeout==2.4.0 2026-02-21T08:07:10.6907594Z + pyyaml==6.0.3 2026-02-21T08:07:10.6907824Z + rich==14.3.3 2026-02-21T08:07:10.6908071Z + scikit-learn==1.8.0 2026-02-21T08:07:10.6908344Z + scipy==1.17.0 2026-02-21T08:07:10.6908563Z + sortedcontainers==2.4.0 2026-02-21T08:07:10.6908845Z + threadpoolctl==3.6.0 2026-02-21T08:07:10.6909058Z + virtualenv==20.38.0 2026-02-21T08:07:22.3116230Z helion 2026-02-21T08:07:23.1013530Z ##[group]Run set -x 2026-02-21T08:07:23.1013977Z set -x 2026-02-21T08:07:23.1014392Z source .venv/bin/activate 2026-02-21T08:07:23.1014719Z uv pip install pip 2026-02-21T08:07:23.1015009Z uv pip install quack-kernels --no-deps 2026-02-21T08:07:23.1015386Z mkdir -p benchmarks/ && pushd benchmarks/ 2026-02-21T08:07:23.1015774Z git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:07:23.1016171Z pushd tritonbench/ 2026-02-21T08:07:23.1016725Z git submodule update --init --recursive 2026-02-21T08:07:23.1017119Z uv pip install -r requirements.txt 2026-02-21T08:07:23.1017473Z python install.py --liger 2026-02-21T08:07:23.1017749Z uv pip install -e . --no-deps 2026-02-21T08:07:23.1018087Z popd 2026-02-21T08:07:23.1018329Z popd 2026-02-21T08:07:23.1018741Z shell: bash -l {0} 2026-02-21T08:07:23.1018960Z env: 2026-02-21T08:07:23.1019269Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:23.1019543Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:23.1019887Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:23.1020184Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:23.1020503Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:23.1020814Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:23.1021225Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:07:23.1021781Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:23.1022061Z ##[endgroup] 2026-02-21T08:07:23.2131281Z + source .venv/bin/activate 2026-02-21T08:07:23.2131865Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2132122Z ++ '[' -n x ']' 2026-02-21T08:07:23.2132355Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:07:23.2132739Z ++ '[' .venv/bin/activate = /__w/_temp/903163e4-2372-4912-b5bf-1e7f88af9f67.sh ']' 2026-02-21T08:07:23.2133084Z ++ deactivate nondestructive 2026-02-21T08:07:23.2133350Z ++ unset -f pydoc 2026-02-21T08:07:23.2133530Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2133783Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2133974Z ++ hash -r 2026-02-21T08:07:23.2134168Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2134429Z ++ unset VIRTUAL_ENV 2026-02-21T08:07:23.2134648Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:07:23.2134904Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:07:23.2135192Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:07:23.2135477Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:07:23.2135697Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:07:23.2136052Z ++ export VIRTUAL_ENV 2026-02-21T08:07:23.2136333Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2136570Z ++ unset SCRIPT_PATH 2026-02-21T08:07:23.2137345Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:07:23.2138580Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:07:23.2139280Z ++ export PATH 2026-02-21T08:07:23.2139539Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:07:23.2139755Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:07:23.2140015Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:07:23.2140277Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2140457Z ++ '[' -z '' ']' 2026-02-21T08:07:23.2140673Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:07:23.2140881Z ++ PS1='(helion) ' 2026-02-21T08:07:23.2141111Z ++ export PS1 2026-02-21T08:07:23.2141293Z ++ alias pydoc 2026-02-21T08:07:23.2141514Z ++ true 2026-02-21T08:07:23.2141756Z ++ hash -r 2026-02-21T08:07:23.2142260Z + uv pip install pip 2026-02-21T08:07:23.4234239Z Resolved 1 package in 201ms 2026-02-21T08:07:23.4305470Z Downloading pip (1.7MiB) 2026-02-21T08:07:23.7597965Z Downloaded pip 2026-02-21T08:07:24.0419320Z Prepared 1 package in 618ms 2026-02-21T08:07:24.0457587Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:24.0458148Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:24.0458716Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:24.1577393Z Installed 1 package in 115ms 2026-02-21T08:07:24.1581405Z + pip==26.0.1 2026-02-21T08:07:24.1601381Z + uv pip install quack-kernels --no-deps 2026-02-21T08:07:24.6427584Z Resolved 1 package in 474ms 2026-02-21T08:07:24.8050668Z Prepared 1 package in 162ms 2026-02-21T08:07:24.8087774Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:24.8088361Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:24.8088972Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:24.8136414Z Installed 1 package in 8ms 2026-02-21T08:07:24.8136716Z + quack-kernels==0.2.10 2026-02-21T08:07:24.8158209Z + mkdir -p benchmarks/ 2026-02-21T08:07:24.8169005Z + pushd benchmarks/ 2026-02-21T08:07:24.8169291Z + git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:07:24.8170094Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:07:24.8180259Z Cloning into 'tritonbench'... 2026-02-21T08:07:26.3569248Z + pushd tritonbench/ 2026-02-21T08:07:26.3569738Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:07:26.3572006Z + git submodule update --init --recursive 2026-02-21T08:07:26.4297690Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens' 2026-02-21T08:07:26.4355589Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter' 2026-02-21T08:07:26.4359507Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass' 2026-02-21T08:07:26.4578246Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention' 2026-02-21T08:07:26.5706481Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders' 2026-02-21T08:07:26.7562969Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers' 2026-02-21T08:07:26.7586087Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'... 2026-02-21T08:07:32.0912697Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'... 2026-02-21T08:07:46.7699247Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'... 2026-02-21T08:07:51.9114127Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'... 2026-02-21T08:07:55.9358735Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'... 2026-02-21T08:07:59.8747427Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'... 2026-02-21T08:08:04.5661957Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b' 2026-02-21T08:08:05.5408700Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190' 2026-02-21T08:08:05.5957658Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel' 2026-02-21T08:08:05.5979190Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'... 2026-02-21T08:08:12.6426157Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40' 2026-02-21T08:08:13.2003284Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e' 2026-02-21T08:08:13.2661893Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6' 2026-02-21T08:08:13.2681978Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel' 2026-02-21T08:08:13.2683640Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass' 2026-02-21T08:08:13.2720828Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'... 2026-02-21T08:08:17.2481152Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'... 2026-02-21T08:08:21.1810775Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb' 2026-02-21T08:08:21.6789206Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52' 2026-02-21T08:08:21.7042001Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5' 2026-02-21T08:08:21.7061097Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass' 2026-02-21T08:08:21.7086670Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'... 2026-02-21T08:08:25.7105101Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b' 2026-02-21T08:08:25.7698495Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44' 2026-02-21T08:08:25.7717046Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled' 2026-02-21T08:08:25.7722214Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass' 2026-02-21T08:08:25.7726306Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention' 2026-02-21T08:08:25.7747823Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'... 2026-02-21T08:08:29.6314378Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'... 2026-02-21T08:08:33.2619154Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'... 2026-02-21T08:08:34.2657041Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d' 2026-02-21T08:08:34.7136119Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0' 2026-02-21T08:08:34.7642915Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2026-02-21T08:08:34.7662093Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel' 2026-02-21T08:08:34.7662923Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass' 2026-02-21T08:08:34.7688819Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'... 2026-02-21T08:08:38.6496487Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'... 2026-02-21T08:08:42.2760469Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2026-02-21T08:08:42.7094472Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2026-02-21T08:08:42.7130529Z + uv pip install -r requirements.txt 2026-02-21T08:08:42.7207395Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:42.9115468Z Resolved 30 packages in 189ms 2026-02-21T08:08:42.9257385Z Downloading fonttools (4.7MiB) 2026-02-21T08:08:42.9257684Z Downloading kiwisolver (1.4MiB) 2026-02-21T08:08:42.9259228Z Downloading pillow (6.7MiB) 2026-02-21T08:08:42.9259547Z Downloading matplotlib (8.3MiB) 2026-02-21T08:08:42.9262578Z Downloading hf-xet (3.2MiB) 2026-02-21T08:08:42.9262880Z Downloading tokenizers (3.0MiB) 2026-02-21T08:08:42.9263133Z Downloading transformers (10.3MiB) 2026-02-21T08:08:43.0697218Z Downloaded kiwisolver 2026-02-21T08:08:43.1318699Z Downloaded tokenizers 2026-02-21T08:08:43.1405116Z Downloaded hf-xet 2026-02-21T08:08:43.2871896Z Downloaded pillow 2026-02-21T08:08:43.3130826Z Downloaded fonttools 2026-02-21T08:08:43.4070786Z Downloaded matplotlib 2026-02-21T08:08:44.2458854Z Downloaded transformers 2026-02-21T08:08:44.2464893Z Prepared 23 packages in 1.33s 2026-02-21T08:08:44.2493380Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:44.2493903Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:44.2494434Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:44.3181492Z Installed 23 packages in 71ms 2026-02-21T08:08:44.3186644Z + certifi==2026.1.4 2026-02-21T08:08:44.3191275Z + charset-normalizer==3.4.4 2026-02-21T08:08:44.3193292Z + contourpy==1.3.3 2026-02-21T08:08:44.3193505Z + cycler==0.12.1 2026-02-21T08:08:44.3193682Z + fonttools==4.61.1 2026-02-21T08:08:44.3193876Z + hf-xet==1.2.0 2026-02-21T08:08:44.3194066Z + huggingface-hub==0.36.2 2026-02-21T08:08:44.3194244Z + idna==3.11 2026-02-21T08:08:44.3194387Z + kiwisolver==1.4.9 2026-02-21T08:08:44.3194561Z + matplotlib==3.10.8 2026-02-21T08:08:44.3194723Z + nvidia-ml-py==13.590.48 2026-02-21T08:08:44.3194893Z + pillow==12.1.1 2026-02-21T08:08:44.3195047Z + pyparsing==3.3.2 2026-02-21T08:08:44.3195205Z + python-dateutil==2.9.0.post0 2026-02-21T08:08:44.3195382Z + regex==2026.2.19 2026-02-21T08:08:44.3195521Z + requests==2.32.5 2026-02-21T08:08:44.3195674Z + safetensors==0.7.0 2026-02-21T08:08:44.3195820Z + six==1.17.0 2026-02-21T08:08:44.3195964Z + tabulate==0.9.0 2026-02-21T08:08:44.3196117Z + tokenizers==0.21.4 2026-02-21T08:08:44.3196270Z + tqdm==4.67.3 2026-02-21T08:08:44.3196414Z + transformers==4.53.0 2026-02-21T08:08:44.3196574Z + urllib3==2.6.3 2026-02-21T08:08:44.3271289Z + python install.py --liger 2026-02-21T08:08:48.0367978Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:48.0394556Z Audited 6 packages in 3ms 2026-02-21T08:08:48.2076769Z INFO:__main__:[tritonbench] installing liger-kernels... 2026-02-21T08:08:48.2137341Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:48.3505380Z Resolved 1 package in 135ms 2026-02-21T08:08:48.4927729Z Prepared 1 package in 142ms 2026-02-21T08:08:48.4956136Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:48.4956649Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:48.4957457Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:48.5544848Z Installed 1 package in 61ms 2026-02-21T08:08:48.5549286Z + liger-kernel-nightly==0.7.0.dev20260219183429 2026-02-21T08:08:48.5570950Z INFO:__main__:[tritonbench] installation complete! 2026-02-21T08:08:49.0038486Z + uv pip install -e . --no-deps 2026-02-21T08:08:49.0477101Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:49.0513942Z Resolved 1 package in 2ms 2026-02-21T08:08:49.0995435Z Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:08:50.1314568Z Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:08:50.1774666Z Prepared 1 package in 1.12s 2026-02-21T08:08:50.1779598Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:50.1783599Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:50.1785269Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:50.1846403Z Installed 1 package in 6ms 2026-02-21T08:08:50.1848551Z + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench) 2026-02-21T08:08:50.1908145Z + popd 2026-02-21T08:08:50.1908373Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:08:50.1908604Z /__w/helion/helion 2026-02-21T08:08:50.1908772Z + popd 2026-02-21T08:08:50.1955532Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:08:50.1955880Z rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:08:50.1956135Z  2026-02-21T08:08:50.1956307Z source .venv/bin/activate 2026-02-21T08:08:50.1956487Z  2026-02-21T08:08:50.1956664Z TEST_REPORTS_DIR=$(pwd)/test/test-reports 2026-02-21T08:08:50.1956892Z mkdir -p "$TEST_REPORTS_DIR" 2026-02-21T08:08:50.1957097Z echo "$TEST_REPORTS_DIR" 2026-02-21T08:08:50.1957271Z  2026-02-21T08:08:50.1957420Z KERNEL_LIST="int4_gemm" 2026-02-21T08:08:50.1957627Z for kernel in ${KERNEL_LIST//,/ }; do 2026-02-21T08:08:50.1957862Z  echo "==========================================" 2026-02-21T08:08:50.1958130Z  echo "Running benchmark for kernel: $kernel" 2026-02-21T08:08:50.1958374Z  echo "==========================================" 2026-02-21T08:08:50.1958581Z  2026-02-21T08:08:50.1958813Z  # Get available implementations and baseline for this kernel 2026-02-21T08:08:50.1959249Z  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:") 2026-02-21T08:08:50.1959684Z  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p') 2026-02-21T08:08:50.1960024Z  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p') 2026-02-21T08:08:50.1960284Z  2026-02-21T08:08:50.1960433Z  if [[ -z "$IMPLS" ]]; then 2026-02-21T08:08:50.1960716Z  echo "Warning: No implementations found for kernel $kernel, skipping..." 2026-02-21T08:08:50.1961016Z  continue 2026-02-21T08:08:50.1961170Z  fi 2026-02-21T08:08:50.1961332Z  if [[ -z "$BASELINE" ]]; then 2026-02-21T08:08:50.1961656Z  echo "Warning: No baseline found for kernel $kernel, skipping..." 2026-02-21T08:08:50.1961919Z  continue 2026-02-21T08:08:50.1962087Z  fi 2026-02-21T08:08:50.1962249Z  echo "Using baseline: $BASELINE" 2026-02-21T08:08:50.1962513Z  echo "Available implementations for $kernel: $IMPLS" 2026-02-21T08:08:50.1962746Z  2026-02-21T08:08:50.1962929Z  # Do autotuning but do not record the results 2026-02-21T08:08:50.1963159Z  python benchmarks/run.py \ 2026-02-21T08:08:50.1963367Z  --op $kernel \ 2026-02-21T08:08:50.1963581Z  --metrics speedup,accuracy \ 2026-02-21T08:08:50.1963819Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:08:50.1964050Z  --cudagraph \ 2026-02-21T08:08:50.1964234Z  --only $IMPLS \ 2026-02-21T08:08:50.1964456Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:08:50.1964690Z  --baseline $BASELINE \ 2026-02-21T08:08:50.1964895Z  --atol 1e-2 \ 2026-02-21T08:08:50.1965094Z  --rtol 1e-2 \ 2026-02-21T08:08:50.1965450Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:08:50.1965693Z  --keep-going \ 2026-02-21T08:08:50.1965868Z   2026-02-21T08:08:50.1966026Z  2026-02-21T08:08:50.1966171Z  # Relax the GPU 2026-02-21T08:08:50.1966353Z  sleep 2m 2026-02-21T08:08:50.1966501Z  2026-02-21T08:08:50.1966673Z  # Run again with cache and record results 2026-02-21T08:08:50.1966996Z  HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \ 2026-02-21T08:08:50.1967301Z  --op $kernel \ 2026-02-21T08:08:50.1967498Z  --metrics speedup,accuracy \ 2026-02-21T08:08:50.1967727Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:08:50.1967955Z  --cudagraph \ 2026-02-21T08:08:50.1968126Z  --only $IMPLS \ 2026-02-21T08:08:50.1968445Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:08:50.1968681Z  --baseline $BASELINE \ 2026-02-21T08:08:50.1968878Z  --atol 1e-2 \ 2026-02-21T08:08:50.1969054Z  --rtol 1e-2 \ 2026-02-21T08:08:50.1969254Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:08:50.1969517Z  --output "$TEST_REPORTS_DIR/helionbench.json" \ 2026-02-21T08:08:50.1969762Z  --append-to-output \ 2026-02-21T08:08:50.1969962Z  --keep-going \ 2026-02-21T08:08:50.1970130Z   2026-02-21T08:08:50.1970279Z  2026-02-21T08:08:50.1970475Z  echo "✅ Completed benchmark for kernel: $kernel" 2026-02-21T08:08:50.1970696Z done 2026-02-21T08:08:50.1970841Z  2026-02-21T08:08:50.1971030Z if [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then 2026-02-21T08:08:50.1971313Z  echo "❌ helionbench.json is missing or empty" 2026-02-21T08:08:50.1971526Z  exit 1 2026-02-21T08:08:50.1971731Z fi 2026-02-21T08:08:50.1971905Z cat "$TEST_REPORTS_DIR/helionbench.json" 2026-02-21T08:08:50.1972256Z shell: bash -l {0} 2026-02-21T08:08:50.1972424Z env: 2026-02-21T08:08:50.1972576Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:08:50.1972792Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:50.1973058Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:08:50.1973324Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:50.1973552Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:50.1973793Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:50.1974198Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:08:50.1974606Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:08:50.1974844Z ##[endgroup] 2026-02-21T08:08:50.3126903Z /__w/helion/helion/test/test-reports 2026-02-21T08:08:50.3127284Z ========================================== 2026-02-21T08:08:50.3127698Z Running benchmark for kernel: int4_gemm 2026-02-21T08:08:50.3127963Z ========================================== 2026-02-21T08:08:54.7063557Z Using baseline: preprocessed_eager_int4_gemm 2026-02-21T08:08:54.7066969Z Available implementations for int4_gemm: helion_int4_gemm_tritonbench,preprocessed_torch_compile_int4_gemm,preprocessed_triton_int4_gemm 2026-02-21T08:08:59.2331961Z Applying custom args for int4_gemm: {'num_inputs': 10} 2026-02-21T08:08:59.2720680Z Running int4_gemm benchmark with Helion implementation... 2026-02-21T08:08:59.2725026Z 2026-02-21T08:08:59.8992331Z Equally-spaced-k mode: Selected 10 equally spaced inputs (total available: 32) 2026-02-21T08:08:59.8997049Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 3, 7, 10, 14, 17, 21, 24, 28, 31] 2026-02-21T08:08:59.9003893Z 2026-02-21T08:08:59.9011442Z 0%| | 0/10 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:09:22.4019507Z %cst = arith.constant dense<0> : tensor<64x2x2048xi8> 2026-02-21T08:09:22.4019759Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:09:22.4019947Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:22.4020165Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:09:22.4020394Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:09:22.4020833Z %cst_2 = arith.constant dense<4> : tensor<64x2048xi8> 2026-02-21T08:09:22.4021068Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:22.4021297Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32> 2026-02-21T08:09:22.4021726Z %cst_4 = arith.constant dense<1280> : tensor<2048xi32> 2026-02-21T08:09:22.4021950Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:09:22.4022149Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:09:22.4022329Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:22.4022517Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:22.4022702Z %c1280_i32 = arith.constant 1280 : i32 2026-02-21T08:09:22.4022909Z %c1280_i64 = arith.constant 1280 : i64 2026-02-21T08:09:22.4023114Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:22.4023434Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c1280_i32], [%c1280_i64, %c1_i64] : , > 2026-02-21T08:09:22.4023770Z %1 = tt.get_program_id x : i32 2026-02-21T08:09:22.4023955Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:09:22.4024149Z %3 = arith.minsi %2, %c1_i32 : i32 2026-02-21T08:09:22.4024350Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:09:22.4024571Z %4 = arith.divsi %arg3, %c32_i32 : i32 2026-02-21T08:09:22.4024764Z %5 = arith.muli %4, %c32_i32 : i32 2026-02-21T08:09:22.4024950Z %6 = arith.subi %c1_i32, %5 : i32 2026-02-21T08:09:22.4025136Z %7 = arith.minsi %6, %c32_i32 : i32 2026-02-21T08:09:22.4025320Z %8 = arith.remsi %arg3, %c32_i32 : i32 2026-02-21T08:09:22.4025508Z %9 = arith.divsi %8, %7 : i32 2026-02-21T08:09:22.4025695Z %10 = arith.muli %9, %c2048_i32 : i32 2026-02-21T08:09:22.4026274Z %11 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T08:09:22.4026543Z %12 = tt.splat %10 : i32 -> tensor<2048xi32> 2026-02-21T08:09:22.4026748Z %13 = arith.addi %12, %11 : tensor<2048xi32> 2026-02-21T08:09:22.4026971Z %14 = arith.cmpi slt, %13, %cst_4 : tensor<2048xi32> 2026-02-21T08:09:22.4027181Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T08:09:22.4027380Z %c192_i32 = arith.constant 192 : i32 2026-02-21T08:09:22.4027721Z %15 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg5 = %cst_3) -> (tensor<1x2048xf32>) : i32 { 2026-02-21T08:09:22.4028044Z %55 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:09:22.4028290Z %56 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:09:22.4028546Z %57 = tt.splat %55 : i32 -> tensor<128xi32> 2026-02-21T08:09:22.4028754Z %58 = arith.addi %57, %56 : tensor<128xi32> 2026-02-21T08:09:22.4029151Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:22.4029463Z %60 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4029765Z %61 = tt.addptr %60, %59 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:22.4030030Z %62 = tt.load %61 : tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4030303Z %63 = arith.extf %62 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:22.4030648Z %64 = tt.descriptor_load %0[%arg4, %10] : !tt.tensordesc> -> tensor<64x2048xi8> 2026-02-21T08:09:22.4030969Z %65 = arith.shli %64, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4031208Z %66 = arith.shrsi %65, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4031428Z %67 = arith.shrsi %64, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4031726Z %68 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:22.4032018Z %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:22.4032338Z %70 = tt.expand_dims %69 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:22.4032670Z %71 = tt.expand_dims %66 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4033000Z %72 = tt.expand_dims %67 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4033293Z %73 = arith.cmpi eq, %70, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4033546Z %74 = tt.broadcast %73 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4033834Z %75 = tt.broadcast %71 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4034141Z %76 = arith.select %74, %75, %cst : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4034417Z %77 = arith.cmpi eq, %70, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4034686Z %78 = tt.broadcast %72 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4034967Z %79 = tt.broadcast %77 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4035262Z %80 = arith.select %79, %78, %76 : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4035552Z %81 = tt.reshape %80 : tensor<64x2x2048xi8> -> tensor<128x2048xi8> 2026-02-21T08:09:22.4035841Z %82 = arith.sitofp %81 : tensor<128x2048xi8> to tensor<128x2048xf32> 2026-02-21T08:09:22.4036153Z %83 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:22.4036487Z %84 = tt.expand_dims %82 {axis = 0 : i32} : tensor<128x2048xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4036820Z %85 = tt.broadcast %83 : tensor<1x128x1xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4037080Z %86 = arith.mulf %85, %84 : tensor<1x128x2048xf32> 2026-02-21T08:09:22.4037303Z %87 = "tt.reduce"(%86) <{axis = 1 : i32}> ({ 2026-02-21T08:09:22.4037575Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:22.4037763Z %161 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:22.4037963Z tt.reduce.return %161 : f32 2026-02-21T08:09:22.4038167Z }) : (tensor<1x128x2048xf32>) -> tensor<1x2048xf32> 2026-02-21T08:09:22.4038395Z %88 = arith.addf %arg5, %87 : tensor<1x2048xf32> 2026-02-21T08:09:22.4038598Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T08:09:22.4038799Z %89 = arith.muli %c64_i32, %c1_i32_5 : i32 2026-02-21T08:09:22.4038990Z %90 = arith.addi %arg4, %89 : i32 2026-02-21T08:09:22.4039179Z %91 = arith.muli %90, %c2_i32 : i32 2026-02-21T08:09:22.4039422Z %92 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:09:22.4039669Z %93 = tt.splat %91 : i32 -> tensor<128xi32> 2026-02-21T08:09:22.4039886Z %94 = arith.addi %93, %92 : tensor<128xi32> 2026-02-21T08:09:22.4040204Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:22.4040538Z %96 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4040834Z %97 = tt.addptr %96, %95 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:22.4041111Z %98 = tt.load %97 : tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4041371Z %99 = arith.extf %98 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:22.4041745Z %100 = tt.descriptor_load %0[%90, %10] : !tt.tensordesc> -> tensor<64x2048xi8> 2026-02-21T08:09:22.4042082Z %101 = arith.shli %100, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4042327Z %102 = arith.shrsi %101, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4042580Z %103 = arith.shrsi %100, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4042850Z %104 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:22.4043160Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:22.4043503Z %106 = tt.expand_dims %105 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:22.4043850Z %107 = tt.expand_dims %102 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4044212Z %108 = tt.expand_dims %103 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4044514Z %109 = arith.cmpi eq, %106, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4044794Z %110 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4045104Z %111 = tt.broadcast %107 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4045427Z %112 = arith.select %110, %111, %cst : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4045732Z %113 = arith.cmpi eq, %106, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4046011Z %114 = tt.broadcast %108 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4046320Z %115 = tt.broadcast %113 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4046633Z %116 = arith.select %115, %114, %112 : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4046947Z %117 = tt.reshape %116 : tensor<64x2x2048xi8> -> tensor<128x2048xi8> 2026-02-21T08:09:22.4047253Z %118 = arith.sitofp %117 : tensor<128x2048xi8> to tensor<128x2048xf32> 2026-02-21T08:09:22.4047581Z %119 = tt.expand_dims %99 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:22.4047949Z %120 = tt.expand_dims %118 {axis = 0 : i32} : tensor<128x2048xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4048296Z %121 = tt.broadcast %119 : tensor<1x128x1xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4048584Z %122 = arith.mulf %121, %120 : tensor<1x128x2048xf32> 2026-02-21T08:09:22.4048831Z %123 = "tt.reduce"(%122) <{axis = 1 : i32}> ({ 2026-02-21T08:09:22.4049043Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:22.4049343Z %161 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:22.4049542Z tt.reduce.return %161 : f32 2026-02-21T08:09:22.4049752Z }) : (tensor<1x128x2048xf32>) -> tensor<1x2048xf32> 2026-02-21T08:09:22.4049974Z %124 = arith.addf %88, %123 : tensor<1x2048xf32> 2026-02-21T08:09:22.4050185Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T08:09:22.4050387Z %125 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T08:09:22.4050586Z %126 = arith.addi %arg4, %125 : i32 2026-02-21T08:09:22.4050790Z %127 = arith.muli %126, %c2_i32 : i32 2026-02-21T08:09:22.4051034Z %128 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:09:22.4051302Z %129 = tt.splat %127 : i32 -> tensor<128xi32> 2026-02-21T08:09:22.4051514Z %130 = arith.addi %129, %128 : tensor<128xi32> 2026-02-21T08:09:22.4051905Z %131 = tt.expand_dims %130 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:22.4052223Z %132 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4052520Z %133 = tt.addptr %132, %131 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:22.4052802Z %134 = tt.load %133 : tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4053049Z %135 = arith.extf %134 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:22.4053393Z %136 = tt.descriptor_load %0[%126, %10] : !tt.tensordesc> -> tensor<64x2048xi8> 2026-02-21T08:09:22.4053734Z %137 = arith.shli %136, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4053979Z %138 = arith.shrsi %137, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4054208Z %139 = arith.shrsi %136, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4054473Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:22.4054770Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:22.4055117Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:22.4055457Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4055802Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4056101Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4056360Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4056652Z %147 = tt.broadcast %143 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4056957Z %148 = arith.select %146, %147, %cst : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4057284Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4057554Z %150 = tt.broadcast %144 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4057841Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4058139Z %152 = arith.select %151, %150, %148 : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4058436Z %153 = tt.reshape %152 : tensor<64x2x2048xi8> -> tensor<128x2048xi8> 2026-02-21T08:09:22.4058726Z %154 = arith.sitofp %153 : tensor<128x2048xi8> to tensor<128x2048xf32> 2026-02-21T08:09:22.4059049Z %155 = tt.expand_dims %135 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:22.4059396Z %156 = tt.expand_dims %154 {axis = 0 : i32} : tensor<128x2048xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4059739Z %157 = tt.broadcast %155 : tensor<1x128x1xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4060010Z %158 = arith.mulf %157, %156 : tensor<1x128x2048xf32> 2026-02-21T08:09:22.4060240Z %159 = "tt.reduce"(%158) <{axis = 1 : i32}> ({ 2026-02-21T08:09:22.4060438Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:22.4060680Z %161 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:22.4060877Z tt.reduce.return %161 : f32 2026-02-21T08:09:22.4061076Z }) : (tensor<1x128x2048xf32>) -> tensor<1x2048xf32> 2026-02-21T08:09:22.4061332Z %160 = arith.addf %124, %159 : tensor<1x2048xf32> 2026-02-21T08:09:22.4061587Z scf.yield %160 : tensor<1x2048xf32> 2026-02-21T08:09:22.4061776Z } {tt.num_stages = 1 : i32} 2026-02-21T08:09:22.4061972Z %16 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T08:09:22.4062217Z %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:09:22.4062481Z %18 = tt.splat %16 : i32 -> tensor<128xi32> 2026-02-21T08:09:22.4062689Z %19 = arith.addi %18, %17 : tensor<128xi32> 2026-02-21T08:09:22.4062962Z %20 = tt.expand_dims %19 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:22.4063320Z %21 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4063604Z %22 = tt.addptr %21, %20 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:22.4063865Z %23 = tt.load %22 : tensor<1x128x!tt.ptr> 2026-02-21T08:09:22.4064101Z %24 = arith.extf %23 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:22.4064442Z %25 = tt.descriptor_load %0[%c4032_i32, %10] : !tt.tensordesc> -> tensor<64x2048xi8> 2026-02-21T08:09:22.4064763Z %26 = arith.shli %25, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4064989Z %27 = arith.shrsi %26, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4065213Z %28 = arith.shrsi %25, %cst_2 : tensor<64x2048xi8> 2026-02-21T08:09:22.4065454Z %29 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:22.4065748Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:22.4066050Z %31 = tt.expand_dims %30 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:22.4066377Z %32 = tt.expand_dims %27 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4066699Z %33 = tt.expand_dims %28 {axis = 1 : i32} : tensor<64x2048xi8> -> tensor<64x1x2048xi8> 2026-02-21T08:09:22.4066983Z %34 = arith.cmpi eq, %31, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4067241Z %35 = tt.broadcast %34 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4067516Z %36 = tt.broadcast %32 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4067813Z %37 = arith.select %35, %36, %cst : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4068082Z %38 = arith.cmpi eq, %31, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:22.4068337Z %39 = tt.broadcast %33 : tensor<64x1x2048xi8> -> tensor<64x2x2048xi8> 2026-02-21T08:09:22.4068613Z %40 = tt.broadcast %38 : tensor<1x2x1xi1> -> tensor<64x2x2048xi1> 2026-02-21T08:09:22.4068889Z %41 = arith.select %40, %39, %37 : tensor<64x2x2048xi1>, tensor<64x2x2048xi8> 2026-02-21T08:09:22.4069177Z %42 = tt.reshape %41 : tensor<64x2x2048xi8> -> tensor<128x2048xi8> 2026-02-21T08:09:22.4069447Z %43 = arith.sitofp %42 : tensor<128x2048xi8> to tensor<128x2048xf32> 2026-02-21T08:09:22.4069747Z %44 = tt.expand_dims %24 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:22.4070074Z %45 = tt.expand_dims %43 {axis = 0 : i32} : tensor<128x2048xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4070388Z %46 = tt.broadcast %44 : tensor<1x128x1xf32> -> tensor<1x128x2048xf32> 2026-02-21T08:09:22.4070647Z %47 = arith.mulf %46, %45 : tensor<1x128x2048xf32> 2026-02-21T08:09:22.4070854Z %48 = "tt.reduce"(%47) <{axis = 1 : i32}> ({ 2026-02-21T08:09:22.4071052Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:09:22.4071230Z %55 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:09:22.4071423Z tt.reduce.return %55 : f32 2026-02-21T08:09:22.4071660Z }) : (tensor<1x128x2048xf32>) -> tensor<1x2048xf32> 2026-02-21T08:09:22.4071932Z %49 = arith.addf %15, %48 : tensor<1x2048xf32> 2026-02-21T08:09:22.4072174Z %50 = arith.truncf %49 : tensor<1x2048xf32> to tensor<1x2048xbf16> 2026-02-21T08:09:22.4072461Z %51 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32> 2026-02-21T08:09:22.4072768Z %52 = tt.splat %arg2 : !tt.ptr -> tensor<1x2048x!tt.ptr> 2026-02-21T08:09:22.4073049Z %53 = tt.addptr %52, %51 : tensor<1x2048x!tt.ptr>, tensor<1x2048xi32> 2026-02-21T08:09:22.4073353Z %54 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T08:09:22.4073628Z tt.store %53, %50, %54 : tensor<1x2048x!tt.ptr> 2026-02-21T08:09:22.4073867Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:09:22.4074069Z tt.return 2026-02-21T08:09:22.4074199Z } 2026-02-21T08:09:22.4074332Z } 2026-02-21T08:09:22.4074402Z 2026-02-21T08:09:22.4074504Z {-# 2026-02-21T08:09:22.4074652Z external_resources: { 2026-02-21T08:09:22.4074812Z mlir_reproducer: { 2026-02-21T08:09:22.4079168Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:22.4083712Z disable_threading: false, 2026-02-21T08:09:22.4083891Z verify_each: true 2026-02-21T08:09:22.4084051Z } 2026-02-21T08:09:22.4084187Z } 2026-02-21T08:09:22.4084307Z #-} 2026-02-21T08:09:22.4084763Z /tmp/torchinductor_root/ut/cutkqsg2mejks7w3shpbcrxpvoytecqjyrx5r73pa7zsk5xb34wd.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:22.4086037Z /tmp/torchinductor_root/ut/cutkqsg2mejks7w3shpbcrxpvoytecqjyrx5r73pa7zsk5xb34wd.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:22.4087068Z [12s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:22.4088322Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 1, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=16, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:09:22.4089517Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:22.4089787Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:23.0709382Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:09:23.0714935Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:09:23.0716629Z %cst = arith.constant dense<0> : tensor<64x2x128xi8> 2026-02-21T08:09:23.0716936Z %c10_i32 = arith.constant 10 : i32 2026-02-21T08:09:23.0717161Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:09:23.0717645Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:23.0717908Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:23.0718144Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:09:23.0718407Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:09:23.0718669Z %cst_2 = arith.constant dense<4> : tensor<64x128xi8> 2026-02-21T08:09:23.0718961Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x128xf32> 2026-02-21T08:09:23.0719229Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:09:23.0719444Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:23.0719657Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:23.0719875Z %c1280_i32 = arith.constant 1280 : i32 2026-02-21T08:09:23.0720082Z %c1280_i64 = arith.constant 1280 : i64 2026-02-21T08:09:23.0720289Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:23.0720667Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c1280_i32], [%c1280_i64, %c1_i64] : , > 2026-02-21T08:09:23.0721065Z %1 = tt.get_program_id x : i32 2026-02-21T08:09:23.0721271Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:09:23.0721474Z %3 = arith.minsi %2, %c10_i32 : i32 2026-02-21T08:09:23.0721845Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:09:23.0722061Z %4 = arith.divsi %arg3, %c2_i32 : i32 2026-02-21T08:09:23.0722258Z %5 = arith.muli %4, %c2_i32 : i32 2026-02-21T08:09:23.0722437Z %6 = arith.subi %c10_i32, %5 : i32 2026-02-21T08:09:23.0722627Z %7 = arith.minsi %6, %c2_i32 : i32 2026-02-21T08:09:23.0722812Z %8 = arith.remsi %arg3, %c2_i32 : i32 2026-02-21T08:09:23.0723003Z %9 = arith.remsi %8, %7 : i32 2026-02-21T08:09:23.0723176Z %10 = arith.addi %5, %9 : i32 2026-02-21T08:09:23.0723358Z %11 = arith.muli %10, %c128_i32 : i32 2026-02-21T08:09:23.0723601Z %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:09:23.0723859Z %13 = tt.splat %11 : i32 -> tensor<128xi32> 2026-02-21T08:09:23.0724072Z %14 = arith.addi %13, %12 : tensor<128xi32> 2026-02-21T08:09:23.0724269Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T08:09:23.0724466Z %c192_i32 = arith.constant 192 : i32 2026-02-21T08:09:23.0724788Z %15 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg5 = %cst_3) -> (tensor<1x128xf32>) : i32 { 2026-02-21T08:09:23.0725168Z %53 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:09:23.0725372Z %54 = tt.splat %53 : i32 -> tensor<128xi32> 2026-02-21T08:09:23.0725576Z %55 = arith.addi %54, %12 : tensor<128xi32> 2026-02-21T08:09:23.0725845Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:23.0726157Z %57 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0726453Z %58 = tt.addptr %57, %56 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:23.0726763Z %59 = tt.load %58 evictionPolicy = evict_first : tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0727197Z %60 = arith.extf %59 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:23.0727536Z %61 = tt.descriptor_load %0[%arg4, %11] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T08:09:23.0727848Z %62 = arith.shli %61, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0728081Z %63 = arith.shrsi %62, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0728312Z %64 = arith.shrsi %61, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0728559Z %65 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:23.0728856Z %66 = tt.expand_dims %65 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:23.0729160Z %67 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:23.0729489Z %68 = tt.expand_dims %63 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0729886Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0730176Z %70 = arith.cmpi eq, %67, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0730436Z %71 = tt.broadcast %70 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0730710Z %72 = tt.broadcast %68 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0731010Z %73 = arith.select %71, %72, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0731275Z %74 = arith.cmpi eq, %67, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0731530Z %75 = tt.broadcast %69 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0731859Z %76 = tt.broadcast %74 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0732138Z %77 = arith.select %76, %75, %73 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0732421Z %78 = tt.reshape %77 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T08:09:23.0732693Z %79 = arith.sitofp %78 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T08:09:23.0733006Z %80 = tt.expand_dims %60 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:23.0733352Z %81 = tt.expand_dims %79 {axis = 0 : i32} : tensor<128x128xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0733669Z %82 = tt.broadcast %80 : tensor<1x128x1xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0733938Z %83 = arith.mulf %82, %81 : tensor<1x128x128xf32> 2026-02-21T08:09:23.0734153Z %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.0734359Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.0734552Z %156 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.0734756Z tt.reduce.return %156 : f32 2026-02-21T08:09:23.0734957Z }) : (tensor<1x128x128xf32>) -> tensor<1x128xf32> 2026-02-21T08:09:23.0735185Z %85 = arith.addf %arg5, %84 : tensor<1x128xf32> 2026-02-21T08:09:23.0735398Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:09:23.0735595Z %86 = arith.muli %c64_i32, %c1_i32_4 : i32 2026-02-21T08:09:23.0735795Z %87 = arith.addi %arg4, %86 : i32 2026-02-21T08:09:23.0735979Z %88 = arith.muli %87, %c2_i32 : i32 2026-02-21T08:09:23.0736177Z %89 = tt.splat %88 : i32 -> tensor<128xi32> 2026-02-21T08:09:23.0736374Z %90 = arith.addi %89, %12 : tensor<128xi32> 2026-02-21T08:09:23.0736630Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:23.0736932Z %92 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0737217Z %93 = tt.addptr %92, %91 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:23.0737524Z %94 = tt.load %93 evictionPolicy = evict_first : tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0737819Z %95 = arith.extf %94 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:23.0738145Z %96 = tt.descriptor_load %0[%87, %11] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T08:09:23.0738503Z %97 = arith.shli %96, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0738717Z %98 = arith.shrsi %97, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0738939Z %99 = arith.shrsi %96, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0739185Z %100 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:23.0739487Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:23.0739803Z %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:23.0740136Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0740464Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0740803Z %105 = arith.cmpi eq, %102, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0741069Z %106 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0741347Z %107 = tt.broadcast %103 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0741687Z %108 = arith.select %106, %107, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0741969Z %109 = arith.cmpi eq, %102, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0742238Z %110 = tt.broadcast %104 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0742530Z %111 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0742820Z %112 = arith.select %111, %110, %108 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0743128Z %113 = tt.reshape %112 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T08:09:23.0743411Z %114 = arith.sitofp %113 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T08:09:23.0743738Z %115 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:23.0744091Z %116 = tt.expand_dims %114 {axis = 0 : i32} : tensor<128x128xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0744419Z %117 = tt.broadcast %115 : tensor<1x128x1xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0744698Z %118 = arith.mulf %117, %116 : tensor<1x128x128xf32> 2026-02-21T08:09:23.0744925Z %119 = "tt.reduce"(%118) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.0745145Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.0745339Z %156 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.0745543Z tt.reduce.return %156 : f32 2026-02-21T08:09:23.0745748Z }) : (tensor<1x128x128xf32>) -> tensor<1x128xf32> 2026-02-21T08:09:23.0745967Z %120 = arith.addf %85, %119 : tensor<1x128xf32> 2026-02-21T08:09:23.0746177Z %c2_i32_5 = arith.constant 2 : i32 2026-02-21T08:09:23.0746369Z %121 = arith.muli %c64_i32, %c2_i32_5 : i32 2026-02-21T08:09:23.0746575Z %122 = arith.addi %arg4, %121 : i32 2026-02-21T08:09:23.0746767Z %123 = arith.muli %122, %c2_i32 : i32 2026-02-21T08:09:23.0747010Z %124 = tt.splat %123 : i32 -> tensor<128xi32> 2026-02-21T08:09:23.0747224Z %125 = arith.addi %124, %12 : tensor<128xi32> 2026-02-21T08:09:23.0747483Z %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:23.0747797Z %127 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0748092Z %128 = tt.addptr %127, %126 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:23.0748422Z %129 = tt.load %128 evictionPolicy = evict_first : tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0748725Z %130 = arith.extf %129 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:23.0749066Z %131 = tt.descriptor_load %0[%122, %11] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T08:09:23.0749383Z %132 = arith.shli %131, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0749661Z %133 = arith.shrsi %132, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0749895Z %134 = arith.shrsi %131, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0750145Z %135 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:23.0750443Z %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:23.0750757Z %137 = tt.expand_dims %136 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:23.0751094Z %138 = tt.expand_dims %133 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0751425Z %139 = tt.expand_dims %134 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0751741Z %140 = arith.cmpi eq, %137, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0752001Z %141 = tt.broadcast %140 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0752370Z %142 = tt.broadcast %138 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0752676Z %143 = arith.select %141, %142, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0752958Z %144 = arith.cmpi eq, %137, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0753216Z %145 = tt.broadcast %139 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0753498Z %146 = tt.broadcast %144 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0753782Z %147 = arith.select %146, %145, %143 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0754088Z %148 = tt.reshape %147 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T08:09:23.0754386Z %149 = arith.sitofp %148 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T08:09:23.0754706Z %150 = tt.expand_dims %130 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:23.0755073Z %151 = tt.expand_dims %149 {axis = 0 : i32} : tensor<128x128xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0755409Z %152 = tt.broadcast %150 : tensor<1x128x1xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0755690Z %153 = arith.mulf %152, %151 : tensor<1x128x128xf32> 2026-02-21T08:09:23.0755915Z %154 = "tt.reduce"(%153) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.0756129Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.0756334Z %156 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.0756535Z tt.reduce.return %156 : f32 2026-02-21T08:09:23.0756755Z }) : (tensor<1x128x128xf32>) -> tensor<1x128xf32> 2026-02-21T08:09:23.0756984Z %155 = arith.addf %120, %154 : tensor<1x128xf32> 2026-02-21T08:09:23.0757209Z scf.yield %155 : tensor<1x128xf32> 2026-02-21T08:09:23.0757462Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:09:23.0757736Z %16 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T08:09:23.0757951Z %17 = tt.splat %16 : i32 -> tensor<128xi32> 2026-02-21T08:09:23.0758160Z %18 = arith.addi %17, %12 : tensor<128xi32> 2026-02-21T08:09:23.0758426Z %19 = tt.expand_dims %18 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:23.0758734Z %20 = tt.splat %arg0 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0759032Z %21 = tt.addptr %20, %19 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:23.0759350Z %22 = tt.load %21 evictionPolicy = evict_first : tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0759662Z %23 = arith.extf %22 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T08:09:23.0760018Z %24 = tt.descriptor_load %0[%c4032_i32, %11] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T08:09:23.0760344Z %25 = arith.shli %24, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0760577Z %26 = arith.shrsi %25, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0760802Z %27 = arith.shrsi %24, %cst_2 : tensor<64x128xi8> 2026-02-21T08:09:23.0761120Z %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:09:23.0761420Z %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:09:23.0761794Z %30 = tt.expand_dims %29 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:09:23.0762115Z %31 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0762433Z %32 = tt.expand_dims %27 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T08:09:23.0762721Z %33 = arith.cmpi eq, %30, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0762974Z %34 = tt.broadcast %33 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0763247Z %35 = tt.broadcast %31 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0763542Z %36 = arith.select %34, %35, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0763861Z %37 = arith.cmpi eq, %30, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:09:23.0764120Z %38 = tt.broadcast %32 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T08:09:23.0764387Z %39 = tt.broadcast %37 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T08:09:23.0764667Z %40 = arith.select %39, %38, %36 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T08:09:23.0764949Z %41 = tt.reshape %40 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T08:09:23.0765215Z %42 = arith.sitofp %41 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T08:09:23.0765519Z %43 = tt.expand_dims %23 {axis = 2 : i32} : tensor<1x128xf32> -> tensor<1x128x1xf32> 2026-02-21T08:09:23.0765847Z %44 = tt.expand_dims %42 {axis = 0 : i32} : tensor<128x128xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0766161Z %45 = tt.broadcast %43 : tensor<1x128x1xf32> -> tensor<1x128x128xf32> 2026-02-21T08:09:23.0766423Z %46 = arith.mulf %45, %44 : tensor<1x128x128xf32> 2026-02-21T08:09:23.0766631Z %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.0766831Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:09:23.0767013Z %53 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:09:23.0767209Z tt.reduce.return %53 : f32 2026-02-21T08:09:23.0767404Z }) : (tensor<1x128x128xf32>) -> tensor<1x128xf32> 2026-02-21T08:09:23.0767620Z %48 = arith.addf %15, %47 : tensor<1x128xf32> 2026-02-21T08:09:23.0767852Z %49 = arith.truncf %48 : tensor<1x128xf32> to tensor<1x128xbf16> 2026-02-21T08:09:23.0768142Z %50 = tt.expand_dims %14 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:09:23.0768447Z %51 = tt.splat %arg2 : !tt.ptr -> tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0768725Z %52 = tt.addptr %51, %50 : tensor<1x128x!tt.ptr>, tensor<1x128xi32> 2026-02-21T08:09:23.0768990Z tt.store %52, %49 : tensor<1x128x!tt.ptr> 2026-02-21T08:09:23.0769211Z } {tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:09:23.0769416Z tt.return 2026-02-21T08:09:23.0769547Z } 2026-02-21T08:09:23.0769674Z } 2026-02-21T08:09:23.0769745Z 2026-02-21T08:09:23.0769804Z {-# 2026-02-21T08:09:23.0769946Z external_resources: { 2026-02-21T08:09:23.0770112Z mlir_reproducer: { 2026-02-21T08:09:23.0774574Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:23.0779155Z disable_threading: false, 2026-02-21T08:09:23.0779334Z verify_each: true 2026-02-21T08:09:23.0779479Z } 2026-02-21T08:09:23.0779611Z } 2026-02-21T08:09:23.0779729Z #-} 2026-02-21T08:09:23.0780165Z /tmp/torchinductor_root/r5/cr55hmqs6vtq3unt5s2oasrliuzukh2cnqo4dz6vuu6wghbtcosb.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:23.0781363Z /tmp/torchinductor_root/r5/cr55hmqs6vtq3unt5s2oasrliuzukh2cnqo4dz6vuu6wghbtcosb.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:23.0782367Z [13s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:23.0783571Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 1, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=32, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[2, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:09:23.0784644Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:23.0784903Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:26.8291651Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.9 configs/s 2026-02-21T08:09:26.8303179Z [17s] Adaptive compile timeout: 30s (90% percentile=1.2s, bounds=[30.0s, 60s]) 2026-02-21T08:09:27.2388659Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2429.7 configs/s 2026-02-21T08:09:27.2769919Z [17s] Initial random population of 100, 5 starting points: 2026-02-21T08:09:27.2774833Z error=5 2026-02-21T08:09:27.2776820Z ok=95 2026-02-21T08:09:27.2777034Z min=0.0236 2026-02-21T08:09:27.2781194Z mid=0.1526 2026-02-21T08:09:27.2785859Z max=2.3398 2026-02-21T08:09:27.2787397Z best={'block_sizes': [1024, 1, 16], 2026-02-21T08:09:27.2787713Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:09:27.2792717Z 'l2_groupings': [4], 2026-02-21T08:09:27.2794887Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:09:27.2795127Z 'loop_orders': [[0, 1]], 2026-02-21T08:09:27.2795336Z 'num_stages': 5, 2026-02-21T08:09:27.2795488Z 'num_warps': 32, 2026-02-21T08:09:27.2798617Z 'pid_type': 'flat', 2026-02-21T08:09:27.2802499Z 'range_flattens': [None, None], 2026-02-21T08:09:27.2805741Z 'range_multi_buffers': [None, None], 2026-02-21T08:09:27.2810185Z 'range_num_stages': [0, 3], 2026-02-21T08:09:27.2813955Z 'range_unroll_factors': [0, 1], 2026-02-21T08:09:27.2814237Z 'range_warp_specializes': [None, True]} 2026-02-21T08:09:27.2818421Z [17s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:09:28.0800486Z [18s] Generation 1 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:09:31.0777815Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 23.0 configs/s 2026-02-21T08:09:35.3602115Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.7 configs/s 2026-02-21T08:09:37.7508099Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 425.4 2026-02-21T08:09:37.7512044Z configs/s 2026-02-21T08:09:37.9321857Z [28s] Generation 1 complete: 2026-02-21T08:09:37.9326783Z error=1 2026-02-21T08:09:37.9331141Z ok=71 2026-02-21T08:09:37.9332624Z min=0.0174 2026-02-21T08:09:37.9332831Z mid=0.0256 2026-02-21T08:09:37.9333007Z max=0.0666 2026-02-21T08:09:37.9333202Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:09:37.9333467Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:09:37.9334044Z 'l2_groupings': [4], 2026-02-21T08:09:37.9334261Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:09:37.9334465Z 'loop_orders': [[1, 0]], 2026-02-21T08:09:37.9334638Z 'num_stages': 1, 2026-02-21T08:09:37.9334787Z 'num_warps': 32, 2026-02-21T08:09:37.9334942Z 'pid_type': 'flat', 2026-02-21T08:09:37.9335102Z 'range_flattens': [None, None], 2026-02-21T08:09:37.9335290Z 'range_multi_buffers': [None, False], 2026-02-21T08:09:37.9335481Z 'range_num_stages': [0, 2], 2026-02-21T08:09:37.9335656Z 'range_unroll_factors': [0, 0], 2026-02-21T08:09:37.9339701Z 'range_warp_specializes': [None, False]} 2026-02-21T08:09:37.9344240Z [28s] Fitting surrogate: 172 points, 172 targets 2026-02-21T08:09:38.7616382Z [29s] Generation 2 starting: 60 neighbors, 5 active search path(s) 2026-02-21T08:09:41.4476978Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 24.0 configs/s 2026-02-21T08:09:45.5953918Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 15.8 configs/s 2026-02-21T08:09:48.7407400Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 323.9 2026-02-21T08:09:48.7411738Z configs/s 2026-02-21T08:09:48.9818684Z [39s] Generation 2 complete: 2026-02-21T08:09:48.9822987Z ok=66 2026-02-21T08:09:48.9824449Z min=0.0174 2026-02-21T08:09:48.9824643Z mid=0.0256 2026-02-21T08:09:48.9824793Z max=0.1729 2026-02-21T08:09:48.9824980Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:09:48.9825239Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:09:48.9825482Z 'l2_groupings': [2], 2026-02-21T08:09:48.9825667Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:09:48.9825873Z 'loop_orders': [[1, 0]], 2026-02-21T08:09:48.9826026Z 'num_stages': 1, 2026-02-21T08:09:48.9826173Z 'num_warps': 8, 2026-02-21T08:09:48.9826314Z 'pid_type': 'flat', 2026-02-21T08:09:48.9826479Z 'range_flattens': [None, None], 2026-02-21T08:09:48.9826685Z 'range_multi_buffers': [None, False], 2026-02-21T08:09:48.9826884Z 'range_num_stages': [0, 2], 2026-02-21T08:09:48.9827056Z 'range_unroll_factors': [0, 0], 2026-02-21T08:09:48.9827236Z 'range_warp_specializes': [None, False]} 2026-02-21T08:09:48.9832383Z [39s] Fitting surrogate: 238 points, 238 targets 2026-02-21T08:09:49.9690363Z [40s] Generation 3 starting: 67 neighbors, 5 active search path(s) 2026-02-21T08:09:57.2754030Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 5.4 configs/s 2026-02-21T08:10:01.5265334Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 16.1 configs/s 2026-02-21T08:10:02.5869480Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 956.1 2026-02-21T08:10:02.5869857Z configs/s 2026-02-21T08:10:02.6723082Z [53s] Generation 3 complete: 2026-02-21T08:10:02.6726935Z error=1 2026-02-21T08:10:02.6731219Z ok=72 2026-02-21T08:10:02.6733199Z min=0.0174 2026-02-21T08:10:02.6733423Z mid=0.0236 2026-02-21T08:10:02.6733926Z max=0.1598 2026-02-21T08:10:02.6734091Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:02.6734412Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:02.6738598Z 'l2_groupings': [2], 2026-02-21T08:10:02.6743637Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:02.6745377Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:02.6745576Z 'num_stages': 1, 2026-02-21T08:10:02.6745735Z 'num_warps': 16, 2026-02-21T08:10:02.6745881Z 'pid_type': 'flat', 2026-02-21T08:10:02.6746053Z 'range_flattens': [None, None], 2026-02-21T08:10:02.6746236Z 'range_multi_buffers': [None, False], 2026-02-21T08:10:02.6746523Z 'range_num_stages': [0, 3], 2026-02-21T08:10:02.6746738Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:02.6746950Z 'range_warp_specializes': [None, False]} 2026-02-21T08:10:02.6747185Z [53s] Fitting surrogate: 311 points, 311 targets 2026-02-21T08:10:03.4676768Z [53s] Generation 4 starting: 53 neighbors, 4 active search path(s) 2026-02-21T08:10:10.0959136Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 4.9 configs/s 2026-02-21T08:10:13.3018778Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 17.1 configs/s 2026-02-21T08:10:15.7288318Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 452.1 2026-02-21T08:10:15.7291865Z configs/s 2026-02-21T08:10:15.9112355Z [66s] Generation 4 complete: 2026-02-21T08:10:15.9117258Z error=2 2026-02-21T08:10:15.9121865Z ok=55 2026-02-21T08:10:15.9125645Z min=0.0155 2026-02-21T08:10:15.9127195Z mid=0.0215 2026-02-21T08:10:15.9127398Z max=0.1812 2026-02-21T08:10:15.9131766Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:15.9135623Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:15.9139294Z 'l2_groupings': [2], 2026-02-21T08:10:15.9139592Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:15.9139823Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:15.9140043Z 'maxnreg': 128, 2026-02-21T08:10:15.9140655Z 'num_sm_multiplier': 2, 2026-02-21T08:10:15.9144988Z 'num_stages': 1, 2026-02-21T08:10:15.9149277Z 'num_warps': 16, 2026-02-21T08:10:15.9150769Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:15.9151007Z 'range_flattens': [None, None], 2026-02-21T08:10:15.9151215Z 'range_multi_buffers': [False, False], 2026-02-21T08:10:15.9151420Z 'range_num_stages': [4, 3], 2026-02-21T08:10:15.9151681Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:15.9151873Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:15.9152164Z [66s] Fitting surrogate: 368 points, 368 targets 2026-02-21T08:10:16.6642211Z [67s] Generation 5 starting: 46 neighbors, 3 active search path(s) 2026-02-21T08:10:18.8959185Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 31.4 configs/s 2026-02-21T08:10:21.6243727Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.5 configs/s 2026-02-21T08:10:23.6517432Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 502.2 2026-02-21T08:10:23.6522191Z configs/s 2026-02-21T08:10:23.8071107Z [74s] Generation 5 complete: 2026-02-21T08:10:23.8076974Z error=3 2026-02-21T08:10:23.8078351Z ok=46 2026-02-21T08:10:23.8078516Z min=0.0155 2026-02-21T08:10:23.8078655Z mid=0.0215 2026-02-21T08:10:23.8078779Z max=0.0379 2026-02-21T08:10:23.8078928Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:23.8079165Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:23.8079392Z 'l2_groupings': [2], 2026-02-21T08:10:23.8079564Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:23.8079778Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:23.8079940Z 'maxnreg': 128, 2026-02-21T08:10:23.8080083Z 'num_sm_multiplier': 2, 2026-02-21T08:10:23.8080242Z 'num_stages': 1, 2026-02-21T08:10:23.8080388Z 'num_warps': 16, 2026-02-21T08:10:23.8080540Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:23.8080760Z 'range_flattens': [None, None], 2026-02-21T08:10:23.8081244Z 'range_multi_buffers': [False, False], 2026-02-21T08:10:23.8081435Z 'range_num_stages': [4, 3], 2026-02-21T08:10:23.8081794Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:23.8081984Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:23.8087578Z [74s] Fitting surrogate: 417 points, 417 targets 2026-02-21T08:10:24.3019134Z [74s] Generation 6 starting: 27 neighbors, 2 active search path(s) 2026-02-21T08:10:25.6712222Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 30.8 configs/s 2026-02-21T08:10:27.0865498Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 19.8 configs/s 2026-02-21T08:10:28.3466833Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 805.9 2026-02-21T08:10:28.3467574Z configs/s 2026-02-21T08:10:28.4561370Z [78s] Generation 6 complete: 2026-02-21T08:10:28.4565993Z error=5 2026-02-21T08:10:28.4570852Z ok=24 2026-02-21T08:10:28.4572155Z min=0.0155 2026-02-21T08:10:28.4572320Z mid=0.0174 2026-02-21T08:10:28.4572459Z max=0.0400 2026-02-21T08:10:28.4572612Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:28.4572858Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:28.4573089Z 'l2_groupings': [2], 2026-02-21T08:10:28.4573262Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:28.4573474Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:28.4573628Z 'maxnreg': 128, 2026-02-21T08:10:28.4573785Z 'num_sm_multiplier': 4, 2026-02-21T08:10:28.4573941Z 'num_stages': 1, 2026-02-21T08:10:28.4574088Z 'num_warps': 16, 2026-02-21T08:10:28.4574245Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:28.4574447Z 'range_flattens': [None, None], 2026-02-21T08:10:28.4574635Z 'range_multi_buffers': [False, False], 2026-02-21T08:10:28.4574835Z 'range_num_stages': [4, 3], 2026-02-21T08:10:28.4575008Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:28.4575203Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:28.4575498Z [78s] Fitting surrogate: 446 points, 446 targets 2026-02-21T08:10:28.9367100Z [79s] Generation 7 starting: 30 neighbors, 2 active search path(s) 2026-02-21T08:10:37.1491738Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 2.9 configs/s 2026-02-21T08:10:38.7183765Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 19.8 configs/s 2026-02-21T08:10:39.6218315Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1121.2 2026-02-21T08:10:39.6219978Z configs/s 2026-02-21T08:10:39.7006534Z [90s] Generation 7 complete: 2026-02-21T08:10:39.7010894Z error=6 2026-02-21T08:10:39.7015222Z ok=26 2026-02-21T08:10:39.7016783Z min=0.0155 2026-02-21T08:10:39.7016942Z mid=0.0215 2026-02-21T08:10:39.7017075Z max=0.1136 2026-02-21T08:10:39.7017215Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:39.7017496Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:39.7018107Z 'l2_groupings': [2], 2026-02-21T08:10:39.7018284Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:39.7018476Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:39.7018639Z 'maxnreg': 128, 2026-02-21T08:10:39.7018784Z 'num_sm_multiplier': 4, 2026-02-21T08:10:39.7019027Z 'num_stages': 1, 2026-02-21T08:10:39.7019238Z 'num_warps': 16, 2026-02-21T08:10:39.7019398Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:39.7019598Z 'range_flattens': [None, None], 2026-02-21T08:10:39.7019777Z 'range_multi_buffers': [False, False], 2026-02-21T08:10:39.7019968Z 'range_num_stages': [4, 3], 2026-02-21T08:10:39.7020135Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:39.7020322Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:39.7023325Z [90s] Fitting surrogate: 478 points, 478 targets 2026-02-21T08:10:39.9889844Z [90s] Generation 8 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:10:47.2571071Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 1.9 configs/s 2026-02-21T08:10:47.8304091Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 14/14 27.6 configs/s 2026-02-21T08:10:48.0041775Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5575.2 2026-02-21T08:10:48.0043438Z configs/s 2026-02-21T08:10:48.0279621Z [98s] Generation 8 complete: 2026-02-21T08:10:48.0283956Z error=5 2026-02-21T08:10:48.0285587Z ok=10 2026-02-21T08:10:48.0285804Z min=0.0154 2026-02-21T08:10:48.0285938Z mid=0.0276 2026-02-21T08:10:48.0289898Z max=0.2161 2026-02-21T08:10:48.0294576Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:48.0296015Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:48.0296268Z 'l2_groupings': [2], 2026-02-21T08:10:48.0296452Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:48.0296650Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:48.0296821Z 'maxnreg': 128, 2026-02-21T08:10:48.0297261Z 'num_sm_multiplier': 4, 2026-02-21T08:10:48.0297468Z 'num_stages': 1, 2026-02-21T08:10:48.0297615Z 'num_warps': 16, 2026-02-21T08:10:48.0297780Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:48.0297971Z 'range_flattens': [None, None], 2026-02-21T08:10:48.0298164Z 'range_multi_buffers': [False, False], 2026-02-21T08:10:48.0298356Z 'range_num_stages': [4, 3], 2026-02-21T08:10:48.0298520Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:48.0298702Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:48.0298914Z [98s] Fitting surrogate: 493 points, 493 targets 2026-02-21T08:10:48.3772234Z [98s] Generation 9 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:10:51.3432360Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 6.0 configs/s 2026-02-21T08:10:52.0956779Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 21.5 configs/s 2026-02-21T08:10:52.3956829Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3304.9 2026-02-21T08:10:52.3961020Z configs/s 2026-02-21T08:10:52.4276486Z [102s] Generation 9 complete: 2026-02-21T08:10:52.4280807Z error=4 2026-02-21T08:10:52.4285238Z ok=13 2026-02-21T08:10:52.4287270Z min=0.0154 2026-02-21T08:10:52.4287443Z mid=0.0256 2026-02-21T08:10:52.4287583Z max=0.0953 2026-02-21T08:10:52.4287728Z best={'block_sizes': [2048, 1, 16], 2026-02-21T08:10:52.4287983Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:52.4288216Z 'l2_groupings': [2], 2026-02-21T08:10:52.4288388Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:10:52.4288598Z 'loop_orders': [[1, 0]], 2026-02-21T08:10:52.4288754Z 'maxnreg': 128, 2026-02-21T08:10:52.4288908Z 'num_sm_multiplier': 4, 2026-02-21T08:10:52.4289065Z 'num_stages': 1, 2026-02-21T08:10:52.4289213Z 'num_warps': 16, 2026-02-21T08:10:52.4289371Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:52.4289575Z 'range_flattens': [None, None], 2026-02-21T08:10:52.4289775Z 'range_multi_buffers': [False, False], 2026-02-21T08:10:52.4290293Z 'range_num_stages': [4, 3], 2026-02-21T08:10:52.4290475Z 'range_unroll_factors': [0, 0], 2026-02-21T08:10:52.4290658Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:52.4290923Z [102s] Fitting surrogate: 510 points, 510 targets 2026-02-21T08:10:52.7027687Z [103s] Autotuning complete in 103.2s after searching 493 configs. 2026-02-21T08:10:52.7028088Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:10:52.7029388Z @helion.kernel(config=helion.Config(block_sizes=[2048, 1, 16], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=128, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[4, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:10:52.7030607Z 2026-02-21T08:10:52.7030876Z [103s] Code of selected kernel: /tmp/torchinductor_root/bh/cbh7xsb5pswtjqgct7qao35fe5fz72zfvoomgygod5c7diih6cq5.py 2026-02-21T08:10:53.7930537Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T08:10:53.7934817Z x_val 2026-02-21T08:10:53.7939228Z ------------------ 2026-02-21T08:10:53.7940801Z (1, 1, 1280, 8192) 2026-02-21T08:10:53.7940971Z 2026-02-21T08:10:53.7946306Z 10%|█ | 1/10 [01:53<17:05, 113.89s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T08:10:53.7950275Z x_val 2026-02-21T08:10:53.7954720Z ------------------ 2026-02-21T08:10:53.7954970Z (1, 1, 8192, 3584) 2026-02-21T08:10:53.7955324Z INFO:tritonbench.utils.triton_op:Took 0.21ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:10:54.8748019Z INFO:tritonbench.utils.triton_op:Took 2.77ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:10:59.4137497Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:11:00.8518722Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:11:00.8522541Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:11:00.8525789Z 'dtype': 'torch.bfloat16', 2026-02-21T08:11:00.8530216Z 'shape': (1, 1, 3584), 2026-02-21T08:11:00.8532207Z 'stride': (3584, 3584, 1)}, 2026-02-21T08:11:00.8532448Z { 'device': 'cuda:0', 2026-02-21T08:11:00.8532662Z 'dtype': 'torch.int32', 2026-02-21T08:11:00.8532876Z 'shape': (3584, 8192), 2026-02-21T08:11:00.8533081Z 'stride': (8192, 1)}), 2026-02-21T08:11:00.8533266Z 'kwargs': {}} 2026-02-21T08:11:00.8534263Z INFO:tritonbench.utils.triton_op:Took 2.01ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:11:01.1387260Z [0s] Autotune random seed: 2134813318 2026-02-21T08:11:01.2419699Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:11:33.8350084Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:11:33.8368538Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T08:11:39.5477939Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.7 configs/s 2026-02-21T08:11:39.5489147Z [38s] Adaptive compile timeout: 30s (90% percentile=1.5s, bounds=[30.0s, 30s]) 2026-02-21T08:11:39.8195928Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 3616.1 configs/s 2026-02-21T08:11:39.8503776Z [38s] Initial random population of 100, 5 starting points: 2026-02-21T08:11:39.8508079Z error=4 2026-02-21T08:11:39.8509740Z timeout=1 2026-02-21T08:11:39.8510315Z ok=95 2026-02-21T08:11:39.8510533Z min=0.0216 2026-02-21T08:11:39.8510746Z mid=0.2038 2026-02-21T08:11:39.8510925Z max=6.4931 2026-02-21T08:11:39.8516211Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:11:39.8519394Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:11:39.8522722Z 'l2_groupings': [64], 2026-02-21T08:11:39.8526523Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:11:39.8526769Z 'loop_orders': [[1, 0]], 2026-02-21T08:11:39.8526943Z 'num_stages': 8, 2026-02-21T08:11:39.8527096Z 'num_warps': 4, 2026-02-21T08:11:39.8527243Z 'pid_type': 'flat', 2026-02-21T08:11:39.8527415Z 'range_flattens': [None, None], 2026-02-21T08:11:39.8527600Z 'range_multi_buffers': [None, None], 2026-02-21T08:11:39.8527794Z 'range_num_stages': [0, 4], 2026-02-21T08:11:39.8527968Z 'range_unroll_factors': [0, 0], 2026-02-21T08:11:39.8528157Z 'range_warp_specializes': [None, False]} 2026-02-21T08:11:39.8528704Z [38s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:11:40.8703449Z [39s] Generation 1 starting: 76 neighbors, 5 active search path(s) 2026-02-21T08:11:44.0451245Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 18.5 configs/s 2026-02-21T08:11:48.7935973Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 17.0 configs/s 2026-02-21T08:11:51.7379567Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 371.7 2026-02-21T08:11:51.7379943Z configs/s 2026-02-21T08:11:51.9392532Z [50s] Generation 1 complete: 2026-02-21T08:11:51.9396913Z ok=82 2026-02-21T08:11:51.9401267Z min=0.0216 2026-02-21T08:11:51.9404539Z mid=0.0338 2026-02-21T08:11:51.9408879Z max=0.0851 2026-02-21T08:11:51.9413263Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:11:51.9417196Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:11:51.9417541Z 'l2_groupings': [64], 2026-02-21T08:11:51.9417779Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:11:51.9418024Z 'loop_orders': [[1, 0]], 2026-02-21T08:11:51.9421993Z 'num_stages': 8, 2026-02-21T08:11:51.9426470Z 'num_warps': 4, 2026-02-21T08:11:51.9428015Z 'pid_type': 'flat', 2026-02-21T08:11:51.9428223Z 'range_flattens': [None, None], 2026-02-21T08:11:51.9428415Z 'range_multi_buffers': [None, None], 2026-02-21T08:11:51.9428613Z 'range_num_stages': [0, 4], 2026-02-21T08:11:51.9428784Z 'range_unroll_factors': [0, 0], 2026-02-21T08:11:51.9428982Z 'range_warp_specializes': [None, False]} 2026-02-21T08:11:51.9429216Z [50s] Fitting surrogate: 182 points, 182 targets 2026-02-21T08:11:52.9253993Z [51s] Generation 2 starting: 72 neighbors, 5 active search path(s) 2026-02-21T08:12:02.2363759Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 3.2 configs/s 2026-02-21T08:12:06.7960225Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.8 configs/s 2026-02-21T08:12:10.5477883Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 271.3 2026-02-21T08:12:10.5482079Z configs/s 2026-02-21T08:12:10.8090801Z [69s] Generation 2 complete: 2026-02-21T08:12:10.8095186Z ok=77 2026-02-21T08:12:10.8099033Z min=0.0216 2026-02-21T08:12:10.8103288Z mid=0.0277 2026-02-21T08:12:10.8107303Z max=0.2755 2026-02-21T08:12:10.8109363Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:12:10.8109640Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:10.8109871Z 'l2_groupings': [64], 2026-02-21T08:12:10.8110054Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:12:10.8110245Z 'loop_orders': [[1, 0]], 2026-02-21T08:12:10.8110416Z 'num_stages': 8, 2026-02-21T08:12:10.8110560Z 'num_warps': 4, 2026-02-21T08:12:10.8110710Z 'pid_type': 'flat', 2026-02-21T08:12:10.8110872Z 'range_flattens': [None, None], 2026-02-21T08:12:10.8111054Z 'range_multi_buffers': [None, None], 2026-02-21T08:12:10.8111243Z 'range_num_stages': [0, 4], 2026-02-21T08:12:10.8111425Z 'range_unroll_factors': [0, 0], 2026-02-21T08:12:10.8111996Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:10.8112213Z [69s] Fitting surrogate: 259 points, 259 targets 2026-02-21T08:12:11.9904301Z [70s] Generation 3 starting: 71 neighbors, 5 active search path(s) 2026-02-21T08:12:23.3877812Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 2.5 configs/s 2026-02-21T08:12:27.7663263Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 16.8 configs/s 2026-02-21T08:12:30.8896656Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.0 2026-02-21T08:12:30.8900860Z configs/s 2026-02-21T08:12:31.1136967Z [89s] Generation 3 complete: 2026-02-21T08:12:31.1140795Z ok=76 2026-02-21T08:12:31.1142762Z min=0.0215 2026-02-21T08:12:31.1142944Z mid=0.0277 2026-02-21T08:12:31.1143076Z max=0.5048 2026-02-21T08:12:31.1143240Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:12:31.1143845Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:31.1144108Z 'l2_groupings': [64], 2026-02-21T08:12:31.1144289Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:12:31.1144478Z 'loop_orders': [[1, 0]], 2026-02-21T08:12:31.1144641Z 'num_stages': 8, 2026-02-21T08:12:31.1144783Z 'num_warps': 4, 2026-02-21T08:12:31.1144931Z 'pid_type': 'flat', 2026-02-21T08:12:31.1145087Z 'range_flattens': [None, None], 2026-02-21T08:12:31.1145273Z 'range_multi_buffers': [None, False], 2026-02-21T08:12:31.1145463Z 'range_num_stages': [0, 4], 2026-02-21T08:12:31.1145628Z 'range_unroll_factors': [0, 0], 2026-02-21T08:12:31.1145815Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:31.1153967Z [89s] Fitting surrogate: 335 points, 335 targets 2026-02-21T08:12:31.8869356Z [90s] Generation 4 starting: 57 neighbors, 4 active search path(s) 2026-02-21T08:12:40.2488869Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 2.2 configs/s 2026-02-21T08:12:43.8155070Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.7 configs/s 2026-02-21T08:12:46.1657668Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 433.2 2026-02-21T08:12:46.1659362Z configs/s 2026-02-21T08:12:46.3442061Z [105s] Generation 4 complete: 2026-02-21T08:12:46.3446213Z ok=61 2026-02-21T08:12:46.3451934Z min=0.0215 2026-02-21T08:12:46.3453983Z mid=0.0236 2026-02-21T08:12:46.3454178Z max=0.5336 2026-02-21T08:12:46.3459574Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:12:46.3463515Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:46.3467276Z 'l2_groupings': [4], 2026-02-21T08:12:46.3467575Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:12:46.3467797Z 'loop_orders': [[1, 0]], 2026-02-21T08:12:46.3472981Z 'num_stages': 3, 2026-02-21T08:12:46.3476855Z 'num_warps': 8, 2026-02-21T08:12:46.3480966Z 'pid_type': 'flat', 2026-02-21T08:12:46.3485389Z 'range_flattens': [None, False], 2026-02-21T08:12:46.3489686Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:46.3493969Z 'range_num_stages': [0, 3], 2026-02-21T08:12:46.3497338Z 'range_unroll_factors': [0, 1], 2026-02-21T08:12:46.3501509Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:46.3501908Z [105s] Fitting surrogate: 396 points, 396 targets 2026-02-21T08:12:47.0588095Z [105s] Generation 5 starting: 46 neighbors, 3 active search path(s) 2026-02-21T08:12:54.3985235Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 3.6 configs/s 2026-02-21T08:12:57.2700695Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.0 configs/s 2026-02-21T08:12:59.3603736Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 535.7 2026-02-21T08:12:59.3607638Z configs/s 2026-02-21T08:12:59.5045289Z [118s] Generation 5 complete: 2026-02-21T08:12:59.5049660Z ok=49 2026-02-21T08:12:59.5051210Z min=0.0215 2026-02-21T08:12:59.5051928Z mid=0.0256 2026-02-21T08:12:59.5052068Z max=0.1588 2026-02-21T08:12:59.5052216Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:12:59.5052478Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:59.5052719Z 'l2_groupings': [4], 2026-02-21T08:12:59.5052895Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:12:59.5053101Z 'loop_orders': [[1, 0]], 2026-02-21T08:12:59.5053264Z 'num_stages': 3, 2026-02-21T08:12:59.5053418Z 'num_warps': 8, 2026-02-21T08:12:59.5053563Z 'pid_type': 'flat', 2026-02-21T08:12:59.5053733Z 'range_flattens': [None, False], 2026-02-21T08:12:59.5053919Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:59.5054117Z 'range_num_stages': [0, 3], 2026-02-21T08:12:59.5054291Z 'range_unroll_factors': [0, 1], 2026-02-21T08:12:59.5054485Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:59.5057647Z [118s] Fitting surrogate: 445 points, 445 targets 2026-02-21T08:13:00.1006012Z [118s] Generation 6 starting: 37 neighbors, 3 active search path(s) 2026-02-21T08:13:03.1658013Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 15.9 configs/s 2026-02-21T08:13:05.4505536Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 38/38 17.0 configs/s 2026-02-21T08:13:06.6996491Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 811.1 2026-02-21T08:13:06.6996953Z configs/s 2026-02-21T08:13:06.7928980Z [125s] Generation 6 complete: 2026-02-21T08:13:06.7930750Z ok=40 2026-02-21T08:13:06.7930959Z min=0.0215 2026-02-21T08:13:06.7931122Z mid=0.0297 2026-02-21T08:13:06.7931280Z max=0.1444 2026-02-21T08:13:06.7931453Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:06.7931920Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:06.7932146Z 'l2_groupings': [4], 2026-02-21T08:13:06.7932330Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:06.7932523Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:06.7932741Z 'num_stages': 3, 2026-02-21T08:13:06.7932898Z 'num_warps': 8, 2026-02-21T08:13:06.7933048Z 'pid_type': 'flat', 2026-02-21T08:13:06.7933211Z 'range_flattens': [None, False], 2026-02-21T08:13:06.7933391Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:06.7933578Z 'range_num_stages': [0, 3], 2026-02-21T08:13:06.7933743Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:06.7933931Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:06.7946378Z [125s] Fitting surrogate: 485 points, 485 targets 2026-02-21T08:13:07.2552003Z [126s] Generation 7 starting: 27 neighbors, 2 active search path(s) 2026-02-21T08:13:12.0668405Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 4.3 configs/s 2026-02-21T08:13:13.7649531Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 16.9 configs/s 2026-02-21T08:13:14.6215930Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1178.6 2026-02-21T08:13:14.6219660Z configs/s 2026-02-21T08:13:14.6921788Z [133s] Generation 7 complete: 2026-02-21T08:13:14.6923108Z ok=30 2026-02-21T08:13:14.6923317Z min=0.0215 2026-02-21T08:13:14.6928605Z mid=0.0297 2026-02-21T08:13:14.6930107Z max=0.2490 2026-02-21T08:13:14.6930297Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:14.6930542Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:14.6930774Z 'l2_groupings': [4], 2026-02-21T08:13:14.6930941Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:14.6931136Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:14.6931293Z 'num_stages': 3, 2026-02-21T08:13:14.6931433Z 'num_warps': 8, 2026-02-21T08:13:14.6931661Z 'pid_type': 'flat', 2026-02-21T08:13:14.6931824Z 'range_flattens': [None, False], 2026-02-21T08:13:14.6932010Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:14.6932196Z 'range_num_stages': [0, 3], 2026-02-21T08:13:14.6932372Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:14.6932840Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:14.6941960Z [133s] Fitting surrogate: 515 points, 515 targets 2026-02-21T08:13:15.0919701Z [133s] Generation 8 starting: 14 neighbors, 1 active search path(s) 2026-02-21T08:13:17.9696688Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 5.3 configs/s 2026-02-21T08:13:18.8729431Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s 2026-02-21T08:13:19.1356004Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3722.9 2026-02-21T08:13:19.1360223Z configs/s 2026-02-21T08:13:19.1661034Z [137s] Generation 8 complete: 2026-02-21T08:13:19.1665465Z ok=16 2026-02-21T08:13:19.1668571Z min=0.0215 2026-02-21T08:13:19.1673177Z mid=0.0338 2026-02-21T08:13:19.1674983Z max=0.1772 2026-02-21T08:13:19.1675221Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:19.1680514Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:19.1684471Z 'l2_groupings': [4], 2026-02-21T08:13:19.1688860Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:19.1690731Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:19.1690985Z 'num_stages': 3, 2026-02-21T08:13:19.1695701Z 'num_warps': 8, 2026-02-21T08:13:19.1700614Z 'pid_type': 'flat', 2026-02-21T08:13:19.1702108Z 'range_flattens': [None, False], 2026-02-21T08:13:19.1702406Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:19.1706683Z 'range_num_stages': [0, 3], 2026-02-21T08:13:19.1706935Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:19.1711687Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:19.1715793Z [137s] Fitting surrogate: 531 points, 531 targets 2026-02-21T08:13:19.5762258Z [138s] Generation 9 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:13:26.0801762Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.0 configs/s 2026-02-21T08:13:27.0446958Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 17.4 configs/s 2026-02-21T08:13:27.3076969Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3721.2 2026-02-21T08:13:27.3080639Z configs/s 2026-02-21T08:13:27.3375005Z [146s] Generation 9 complete: 2026-02-21T08:13:27.3376270Z ok=17 2026-02-21T08:13:27.3376446Z min=0.0215 2026-02-21T08:13:27.3376576Z mid=0.0338 2026-02-21T08:13:27.3376707Z max=0.2714 2026-02-21T08:13:27.3376847Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:27.3377093Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:27.3377313Z 'l2_groupings': [4], 2026-02-21T08:13:27.3377488Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:27.3377683Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:27.3377835Z 'num_stages': 3, 2026-02-21T08:13:27.3377984Z 'num_warps': 8, 2026-02-21T08:13:27.3378122Z 'pid_type': 'flat', 2026-02-21T08:13:27.3378288Z 'range_flattens': [None, False], 2026-02-21T08:13:27.3378489Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:27.3379009Z 'range_num_stages': [0, 3], 2026-02-21T08:13:27.3379181Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:27.3379372Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:27.3393681Z [146s] Fitting surrogate: 548 points, 548 targets 2026-02-21T08:13:27.7908373Z [146s] Generation 10 starting: 16 neighbors, 1 active search path(s) 2026-02-21T08:13:30.5622032Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 5.8 configs/s 2026-02-21T08:13:31.7846913Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 14.3 configs/s 2026-02-21T08:13:31.7854690Z [150s] Generation 10 complete: 2026-02-21T08:13:31.7856016Z ok=18 2026-02-21T08:13:31.7856194Z min=0.0215 2026-02-21T08:13:31.7856327Z mid=0.0318 2026-02-21T08:13:31.7856461Z max=0.1096 2026-02-21T08:13:31.7856603Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:31.7856850Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:31.7857079Z 'l2_groupings': [4], 2026-02-21T08:13:31.7857642Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:31.7857869Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:31.7858027Z 'num_stages': 3, 2026-02-21T08:13:31.7858179Z 'num_warps': 8, 2026-02-21T08:13:31.7858321Z 'pid_type': 'flat', 2026-02-21T08:13:31.7858485Z 'range_flattens': [None, False], 2026-02-21T08:13:31.7858666Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:31.7858854Z 'range_num_stages': [0, 3], 2026-02-21T08:13:31.7859021Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:31.7859208Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:31.7865745Z [150s] Fitting surrogate: 566 points, 566 targets 2026-02-21T08:13:32.1886959Z [150s] Generation 11 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:13:38.6209936Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2.1 configs/s 2026-02-21T08:13:39.4681285Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.5 configs/s 2026-02-21T08:13:39.6613751Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5010.9 2026-02-21T08:13:39.6617835Z configs/s 2026-02-21T08:13:39.6869697Z [158s] Generation 11 complete: 2026-02-21T08:13:39.6874067Z ok=15 2026-02-21T08:13:39.6878362Z min=0.0215 2026-02-21T08:13:39.6883393Z mid=0.0338 2026-02-21T08:13:39.6887987Z max=0.1608 2026-02-21T08:13:39.6892926Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:39.6896542Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:39.6896872Z 'l2_groupings': [4], 2026-02-21T08:13:39.6897082Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:39.6897308Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:39.6897821Z 'num_stages': 3, 2026-02-21T08:13:39.6898012Z 'num_warps': 8, 2026-02-21T08:13:39.6898181Z 'pid_type': 'flat', 2026-02-21T08:13:39.6898349Z 'range_flattens': [None, False], 2026-02-21T08:13:39.6898550Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:39.6898764Z 'range_num_stages': [0, 3], 2026-02-21T08:13:39.6898955Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:39.6899150Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:39.6899363Z [158s] Fitting surrogate: 581 points, 581 targets 2026-02-21T08:13:40.1049674Z [158s] Generation 12 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:13:42.7953688Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 6.8 configs/s 2026-02-21T08:13:43.7447534Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.7 configs/s 2026-02-21T08:13:44.1421327Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2500.0 2026-02-21T08:13:44.1426198Z configs/s 2026-02-21T08:13:44.1811284Z [162s] Generation 12 complete: 2026-02-21T08:13:44.1815500Z ok=17 2026-02-21T08:13:44.1818614Z min=0.0215 2026-02-21T08:13:44.1821915Z mid=0.0338 2026-02-21T08:13:44.1823724Z max=0.1833 2026-02-21T08:13:44.1823931Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:44.1824541Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:44.1824763Z 'l2_groupings': [4], 2026-02-21T08:13:44.1824942Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:44.1825130Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:44.1825289Z 'num_stages': 3, 2026-02-21T08:13:44.1825434Z 'num_warps': 8, 2026-02-21T08:13:44.1825573Z 'pid_type': 'flat', 2026-02-21T08:13:44.1825737Z 'range_flattens': [None, False], 2026-02-21T08:13:44.1825917Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:44.1826110Z 'range_num_stages': [0, 3], 2026-02-21T08:13:44.1826279Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:44.1826468Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:44.1830243Z [162s] Fitting surrogate: 598 points, 598 targets 2026-02-21T08:13:44.6078313Z [163s] Generation 13 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:13:50.2215703Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.8 configs/s 2026-02-21T08:13:51.0070124Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 22.0 configs/s 2026-02-21T08:13:51.1284170Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 7767.8 2026-02-21T08:13:51.1288075Z configs/s 2026-02-21T08:13:51.1492609Z [169s] Generation 13 complete: 2026-02-21T08:13:51.1493998Z error=3 2026-02-21T08:13:51.1494167Z ok=14 2026-02-21T08:13:51.1494300Z min=0.0215 2026-02-21T08:13:51.1494443Z mid=0.0440 2026-02-21T08:13:51.1494577Z max=0.2980 2026-02-21T08:13:51.1494723Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:51.1494981Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:51.1495244Z 'l2_groupings': [4], 2026-02-21T08:13:51.1495428Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:51.1495633Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:51.1495796Z 'num_stages': 3, 2026-02-21T08:13:51.1495953Z 'num_warps': 8, 2026-02-21T08:13:51.1496132Z 'pid_type': 'flat', 2026-02-21T08:13:51.1496306Z 'range_flattens': [None, False], 2026-02-21T08:13:51.1496494Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:51.1496699Z 'range_num_stages': [0, 3], 2026-02-21T08:13:51.1496875Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:51.1497084Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:51.1514325Z [169s] Fitting surrogate: 615 points, 615 targets 2026-02-21T08:13:51.5758713Z [170s] Generation 14 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:13:58.8336897Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 1.7 configs/s 2026-02-21T08:13:59.5083158Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 22.7 configs/s 2026-02-21T08:13:59.7694444Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3759.9 2026-02-21T08:13:59.7698651Z configs/s 2026-02-21T08:13:59.7993426Z [178s] Generation 14 complete: 2026-02-21T08:13:59.7997739Z error=3 2026-02-21T08:13:59.8001893Z ok=12 2026-02-21T08:13:59.8006154Z min=0.0215 2026-02-21T08:13:59.8010865Z mid=0.0338 2026-02-21T08:13:59.8011056Z max=0.2406 2026-02-21T08:13:59.8011210Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:13:59.8011464Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:59.8011777Z 'l2_groupings': [4], 2026-02-21T08:13:59.8011957Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:13:59.8012161Z 'loop_orders': [[1, 0]], 2026-02-21T08:13:59.8012324Z 'num_stages': 3, 2026-02-21T08:13:59.8012480Z 'num_warps': 8, 2026-02-21T08:13:59.8012627Z 'pid_type': 'flat', 2026-02-21T08:13:59.8012798Z 'range_flattens': [None, False], 2026-02-21T08:13:59.8012988Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:59.8013189Z 'range_num_stages': [0, 3], 2026-02-21T08:13:59.8013362Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:59.8013562Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:59.8013871Z [178s] Fitting surrogate: 630 points, 630 targets 2026-02-21T08:14:00.0794051Z [178s] Autotuning complete in 178.8s after searching 609 configs. 2026-02-21T08:14:00.0794462Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:14:00.0795491Z @helion.kernel(config=helion.Config(block_sizes=[256, 1, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:14:00.0796371Z 2026-02-21T08:14:00.0796620Z [178s] Code of selected kernel: /tmp/torchinductor_root/c2/cc2ukcdrw7joiedt3el5qe3stlrxexhpsgjw3ysrzx7dlyjftqtn.py 2026-02-21T08:14:01.0118249Z WARNING:tritonbench.utils.triton_op:Completed input ID 3: 2026-02-21T08:14:01.0122602Z x_val 2026-02-21T08:14:01.0124600Z ------------------ 2026-02-21T08:14:01.0124878Z (1, 1, 8192, 3584) 2026-02-21T08:14:01.0129231Z 2026-02-21T08:14:01.0134319Z 20%|██ | 2/10 [05:01<20:56, 157.03s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7: 2026-02-21T08:14:01.0138402Z x_val 2026-02-21T08:14:01.0140088Z ------------------ 2026-02-21T08:14:01.0140324Z (4, 1, 8192, 3584) 2026-02-21T08:14:01.0145771Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:14:02.1026133Z INFO:tritonbench.utils.triton_op:Took 2.69ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:14:02.8720363Z Autotune Choices Stats: 2026-02-21T08:14:02.8723004Z {"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.01849599927663803, "best_triton_pos": 1, "best_triton_time": 0.019328000023961067, "best_triton_kernel": "triton_mm_8", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"} 2026-02-21T08:14:02.8728532Z AUTOTUNE mm(4x3584, 3584x8192) 2026-02-21T08:14:02.8732453Z strides: [3584, 1], [8192, 1] 2026-02-21T08:14:02.8734661Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:14:02.8734943Z mm 0.0185 ms 100.0% 2026-02-21T08:14:02.8739795Z triton_mm_8 0.0193 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4 2026-02-21T08:14:02.8744327Z triton_mm_4 0.0204 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2 2026-02-21T08:14:02.8745115Z triton_mm_12 0.0224 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4 2026-02-21T08:14:02.8745817Z triton_mm_16 0.0244 ms 75.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:14:02.8750857Z triton_mm_2 0.0317 ms 58.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4 2026-02-21T08:14:02.8752805Z triton_mm_3 0.0326 ms 56.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2 2026-02-21T08:14:02.8753625Z triton_mm_7 0.0336 ms 55.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:14:02.8756103Z triton_mm_11 0.0378 ms 48.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:14:02.8756833Z triton_mm_1 0.0387 ms 47.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2 2026-02-21T08:14:02.8757426Z SingleProcess AUTOTUNE benchmarking takes 0.2573 seconds and 0.2917 seconds precompiling for 18 choices 2026-02-21T08:14:05.1727661Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:14:07.5128214Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:14:07.5132567Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:14:07.5134000Z 'dtype': 'torch.bfloat16', 2026-02-21T08:14:07.5134234Z 'shape': (4, 1, 3584), 2026-02-21T08:14:07.5134436Z 'stride': (3584, 3584, 1)}, 2026-02-21T08:14:07.5134634Z { 'device': 'cuda:0', 2026-02-21T08:14:07.5134822Z 'dtype': 'torch.int32', 2026-02-21T08:14:07.5135022Z 'shape': (3584, 8192), 2026-02-21T08:14:07.5135193Z 'stride': (8192, 1)}), 2026-02-21T08:14:07.5135671Z 'kwargs': {}} 2026-02-21T08:14:07.5151769Z INFO:tritonbench.utils.triton_op:Took 2.55ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:14:07.7669002Z [0s] Autotune random seed: 2134813318 2026-02-21T08:14:08.0169828Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:14:40.5493266Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:14:40.5509002Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T08:14:46.3700365Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.3 configs/s 2026-02-21T08:14:46.3708239Z [38s] Adaptive compile timeout: 30s (90% percentile=1.8s, bounds=[30.0s, 30s]) 2026-02-21T08:14:47.2310752Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1151.5 configs/s 2026-02-21T08:14:47.2910291Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T08:14:47.2914540Z error=4 2026-02-21T08:14:47.2916198Z timeout=1 2026-02-21T08:14:47.2916350Z ok=95 2026-02-21T08:14:47.2916489Z min=0.0502 2026-02-21T08:14:47.2916616Z mid=0.2366 2026-02-21T08:14:47.2916748Z max=17.3279 2026-02-21T08:14:47.2916901Z best={'block_sizes': [256, 1, 256], 2026-02-21T08:14:47.2917142Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:14:47.2917371Z 'l2_groupings': [4], 2026-02-21T08:14:47.2917537Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:14:47.2917733Z 'loop_orders': [[1, 0]], 2026-02-21T08:14:47.2917887Z 'num_stages': 3, 2026-02-21T08:14:47.2918032Z 'num_warps': 8, 2026-02-21T08:14:47.2918175Z 'pid_type': 'flat', 2026-02-21T08:14:47.2918372Z 'range_flattens': [None, False], 2026-02-21T08:14:47.2918955Z 'range_multi_buffers': [None, False], 2026-02-21T08:14:47.2919153Z 'range_num_stages': [0, 2], 2026-02-21T08:14:47.2919350Z 'range_unroll_factors': [0, 0], 2026-02-21T08:14:47.2919537Z 'range_warp_specializes': [None, None]} 2026-02-21T08:14:47.2926183Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:14:48.7691948Z [40s] Generation 1 starting: 92 neighbors, 5 active search path(s) 2026-02-21T08:15:08.6079365Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 1.7 configs/s 2026-02-21T08:15:14.2302671Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 17.2 configs/s 2026-02-21T08:15:17.4510720Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 313.5 2026-02-21T08:15:17.4512417Z configs/s 2026-02-21T08:15:17.6256914Z [69s] Generation 1 complete: 2026-02-21T08:15:17.6261283Z error=1 2026-02-21T08:15:17.6262876Z ok=97 2026-02-21T08:15:17.6263083Z min=0.0441 2026-02-21T08:15:17.6263221Z mid=0.0790 2026-02-21T08:15:17.6263342Z max=1.4091 2026-02-21T08:15:17.6263487Z best={'block_sizes': [256, 1, 256], 2026-02-21T08:15:17.6263725Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:15:17.6263949Z 'l2_groupings': [4], 2026-02-21T08:15:17.6264113Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:15:17.6264308Z 'loop_orders': [[1, 0]], 2026-02-21T08:15:17.6264459Z 'num_stages': 4, 2026-02-21T08:15:17.6264604Z 'num_warps': 8, 2026-02-21T08:15:17.6264743Z 'pid_type': 'flat', 2026-02-21T08:15:17.6264903Z 'range_flattens': [None, False], 2026-02-21T08:15:17.6265091Z 'range_multi_buffers': [None, False], 2026-02-21T08:15:17.6265275Z 'range_num_stages': [0, 3], 2026-02-21T08:15:17.6265449Z 'range_unroll_factors': [0, 0], 2026-02-21T08:15:17.6265627Z 'range_warp_specializes': [None, None]} 2026-02-21T08:15:17.6275056Z [69s] Fitting surrogate: 198 points, 198 targets 2026-02-21T08:15:18.7961764Z [70s] Generation 2 starting: 88 neighbors, 5 active search path(s) 2026-02-21T08:15:50.9866566Z [102s] Timeout after 30s compiling Config(block_sizes=[256, 1, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:15:50.9883262Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 0.7 configs/s 2026-02-21T08:15:53.0186329Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:15:53.0192169Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:15:53.0193735Z %cst = arith.constant dense<0> : tensor<256x2x32xi8> 2026-02-21T08:15:53.0194021Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:15:53.0194230Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:15:53.0194426Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:15:53.0194627Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:15:53.0194817Z %c4736_i32 = arith.constant 4736 : i32 2026-02-21T08:15:53.0195083Z %cst_0 = arith.constant dense<8192> : tensor<2x1xi32> 2026-02-21T08:15:53.0195334Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:15:53.0195580Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:15:53.0195814Z %cst_3 = arith.constant dense<4> : tensor<256x32xi8> 2026-02-21T08:15:53.0196058Z %cst_4 = arith.constant dense<3584> : tensor<2x1xi32> 2026-02-21T08:15:53.0196320Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<2x32xf32> 2026-02-21T08:15:53.0196565Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:15:53.0196771Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:15:53.0197266Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:15:53.0197466Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T08:15:53.0197660Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:15:53.0197855Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:15:53.0198043Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:15:53.0198381Z %0 = tt.make_tensor_descriptor %arg1, [%c1792_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:15:53.0198724Z %1 = tt.get_program_id x : i32 2026-02-21T08:15:53.0198949Z scf.for %arg3 = %1 to %c512_i32 step %c4736_i32 : i32 { 2026-02-21T08:15:53.0199192Z %2 = arith.divsi %arg3, %c16_i32 : i32 2026-02-21T08:15:53.0199390Z %3 = arith.muli %2, %c8_i32 : i32 2026-02-21T08:15:53.0199588Z %4 = arith.subi %c256_i32, %3 : i32 2026-02-21T08:15:53.0199776Z %5 = arith.minsi %4, %c8_i32 : i32 2026-02-21T08:15:53.0199976Z %6 = arith.remsi %arg3, %c16_i32 : i32 2026-02-21T08:15:53.0200284Z %7 = arith.remsi %6, %5 : i32 2026-02-21T08:15:53.0200485Z %8 = arith.addi %3, %7 : i32 2026-02-21T08:15:53.0200669Z %9 = arith.divsi %6, %5 : i32 2026-02-21T08:15:53.0200853Z %10 = arith.muli %8, %c32_i32 : i32 2026-02-21T08:15:53.0201106Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:15:53.0201377Z %12 = tt.splat %10 : i32 -> tensor<32xi32> 2026-02-21T08:15:53.0201666Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T08:15:53.0201865Z %14 = arith.muli %9, %c2_i32 : i32 2026-02-21T08:15:53.0202116Z %15 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:15:53.0202364Z %16 = tt.splat %14 : i32 -> tensor<2xi32> 2026-02-21T08:15:53.0202560Z %17 = arith.addi %16, %15 : tensor<2xi32> 2026-02-21T08:15:53.0202762Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T08:15:53.0202952Z %c512_i32_6 = arith.constant 512 : i32 2026-02-21T08:15:53.0203296Z %18 = scf.for %arg4 = %c0_i32 to %c1536_i32 step %c512_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<2x32xf32>) : i32 { 2026-02-21T08:15:53.0203632Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:15:53.0203878Z %68 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:15:53.0204191Z %69 = tt.splat %67 : i32 -> tensor<512xi32> 2026-02-21T08:15:53.0204404Z %70 = arith.addi %69, %68 : tensor<512xi32> 2026-02-21T08:15:53.0204663Z %71 = tt.expand_dims %17 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:15:53.0204939Z %72 = arith.muli %71, %cst_4 : tensor<2x1xi32> 2026-02-21T08:15:53.0205200Z %73 = tt.expand_dims %70 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:15:53.0205504Z %74 = tt.broadcast %72 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:15:53.0205773Z %75 = tt.broadcast %73 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:15:53.0206015Z %76 = arith.addi %74, %75 : tensor<2x512xi32> 2026-02-21T08:15:53.0206278Z %77 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:15:53.0206567Z %78 = tt.addptr %77, %76 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:15:53.0206890Z %79 = tt.load %78 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:15:53.0207205Z %80 = arith.extf %79 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:15:53.0207536Z %81 = tt.descriptor_load %0[%arg4, %10] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:15:53.0207856Z %82 = arith.shli %81, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0208073Z %83 = arith.shrsi %82, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0208298Z %84 = arith.shrsi %81, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0208554Z %85 = tt.expand_dims %15 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:15:53.0208867Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:15:53.0209281Z %87 = tt.expand_dims %83 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:15:53.0209603Z %88 = tt.expand_dims %84 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:15:53.0209890Z %89 = arith.cmpi eq, %86, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:15:53.0210137Z %90 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:15:53.0210418Z %91 = tt.broadcast %87 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:15:53.0210709Z %92 = arith.select %90, %91, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:15:53.0211043Z %93 = arith.cmpi eq, %86, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:15:53.0211298Z %94 = tt.broadcast %88 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:15:53.0211697Z %95 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:15:53.0211986Z %96 = arith.select %95, %94, %92 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:15:53.0212270Z %97 = tt.reshape %96 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:15:53.0212541Z %98 = arith.sitofp %97 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:15:53.0212844Z %99 = tt.expand_dims %80 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:15:53.0213179Z %100 = tt.expand_dims %98 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:15:53.0213498Z %101 = tt.broadcast %99 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:15:53.0213787Z %102 = tt.broadcast %100 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:15:53.0214058Z %103 = arith.mulf %101, %102 : tensor<2x512x32xf32> 2026-02-21T08:15:53.0214277Z %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({ 2026-02-21T08:15:53.0214484Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:53.0214686Z %147 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:53.0214879Z tt.reduce.return %147 : f32 2026-02-21T08:15:53.0215085Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:15:53.0215299Z %105 = arith.addf %arg5, %104 : tensor<2x32xf32> 2026-02-21T08:15:53.0215510Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:15:53.0215701Z %106 = arith.muli %c256_i32, %c1_i32 : i32 2026-02-21T08:15:53.0215900Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T08:15:53.0216096Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T08:15:53.0216330Z %109 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:15:53.0216590Z %110 = tt.splat %108 : i32 -> tensor<512xi32> 2026-02-21T08:15:53.0216799Z %111 = arith.addi %110, %109 : tensor<512xi32> 2026-02-21T08:15:53.0217058Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:15:53.0217326Z %113 = arith.muli %112, %cst_4 : tensor<2x1xi32> 2026-02-21T08:15:53.0217596Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:15:53.0217895Z %115 = tt.broadcast %113 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:15:53.0218159Z %116 = tt.broadcast %114 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:15:53.0218408Z %117 = arith.addi %115, %116 : tensor<2x512xi32> 2026-02-21T08:15:53.0218650Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:15:53.0218949Z %119 = tt.addptr %118, %117 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:15:53.0219264Z %120 = tt.load %119 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:15:53.0219576Z %121 = arith.extf %120 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:15:53.0219911Z %122 = tt.descriptor_load %0[%107, %10] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:15:53.0220280Z %123 = arith.shli %122, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0220517Z %124 = arith.shrsi %123, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0220743Z %125 = arith.shrsi %122, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0221009Z %126 = tt.expand_dims %15 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:15:53.0221336Z %127 = tt.expand_dims %126 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:15:53.0221694Z %128 = tt.expand_dims %124 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:15:53.0222032Z %129 = tt.expand_dims %125 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:15:53.0222320Z %130 = arith.cmpi eq, %127, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:15:53.0222578Z %131 = tt.broadcast %130 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:15:53.0222911Z %132 = tt.broadcast %128 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:15:53.0223220Z %133 = arith.select %131, %132, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:15:53.0223509Z %134 = arith.cmpi eq, %127, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:15:53.0223765Z %135 = tt.broadcast %129 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:15:53.0224050Z %136 = tt.broadcast %134 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:15:53.0224340Z %137 = arith.select %136, %135, %133 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:15:53.0224640Z %138 = tt.reshape %137 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:15:53.0224919Z %139 = arith.sitofp %138 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:15:53.0225216Z %140 = tt.expand_dims %121 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:15:53.0225557Z %141 = tt.expand_dims %139 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:15:53.0225873Z %142 = tt.broadcast %140 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:15:53.0226170Z %143 = tt.broadcast %141 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:15:53.0226430Z %144 = arith.mulf %142, %143 : tensor<2x512x32xf32> 2026-02-21T08:15:53.0226651Z %145 = "tt.reduce"(%144) <{axis = 1 : i32}> ({ 2026-02-21T08:15:53.0226852Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:53.0227038Z %147 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:53.0227236Z tt.reduce.return %147 : f32 2026-02-21T08:15:53.0227432Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:15:53.0227647Z %146 = arith.addf %105, %145 : tensor<2x32xf32> 2026-02-21T08:15:53.0227843Z scf.yield %146 : tensor<2x32xf32> 2026-02-21T08:15:53.0228033Z } {tt.num_stages = 1 : i32} 2026-02-21T08:15:53.0228221Z %19 = arith.muli %c1536_i32, %c2_i32 : i32 2026-02-21T08:15:53.0228461Z %20 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:15:53.0228715Z %21 = tt.splat %19 : i32 -> tensor<512xi32> 2026-02-21T08:15:53.0228916Z %22 = arith.addi %21, %20 : tensor<512xi32> 2026-02-21T08:15:53.0229169Z %23 = tt.expand_dims %17 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:15:53.0229425Z %24 = arith.muli %23, %cst_4 : tensor<2x1xi32> 2026-02-21T08:15:53.0229683Z %25 = tt.expand_dims %22 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:15:53.0229975Z %26 = tt.broadcast %24 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:15:53.0230232Z %27 = tt.broadcast %25 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:15:53.0230471Z %28 = arith.addi %26, %27 : tensor<2x512xi32> 2026-02-21T08:15:53.0230711Z %29 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:15:53.0231001Z %30 = tt.addptr %29, %28 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:15:53.0231384Z %31 = tt.load %30 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:15:53.0231708Z %32 = arith.extf %31 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:15:53.0232046Z %33 = tt.descriptor_load %0[%c1536_i32, %10] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:15:53.0232386Z %34 = arith.shli %33, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0232665Z %35 = arith.shrsi %34, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0232883Z %36 = arith.shrsi %33, %cst_3 : tensor<256x32xi8> 2026-02-21T08:15:53.0233145Z %37 = tt.expand_dims %15 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:15:53.0233449Z %38 = tt.expand_dims %37 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:15:53.0233780Z %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:15:53.0234171Z %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:15:53.0234460Z %41 = arith.cmpi eq, %38, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:15:53.0234728Z %42 = tt.broadcast %41 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:15:53.0234998Z %43 = tt.broadcast %39 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:15:53.0235297Z %44 = arith.select %42, %43, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:15:53.0235569Z %45 = arith.cmpi eq, %38, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:15:53.0235814Z %46 = tt.broadcast %40 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:15:53.0236101Z %47 = tt.broadcast %45 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:15:53.0236384Z %48 = arith.select %47, %46, %44 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:15:53.0236674Z %49 = tt.reshape %48 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:15:53.0236942Z %50 = arith.sitofp %49 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:15:53.0237253Z %51 = tt.expand_dims %32 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:15:53.0237598Z %52 = tt.expand_dims %50 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:15:53.0237915Z %53 = tt.broadcast %51 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:15:53.0238212Z %54 = tt.broadcast %52 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:15:53.0238470Z %55 = arith.mulf %53, %54 : tensor<2x512x32xf32> 2026-02-21T08:15:53.0238697Z %56 = "tt.reduce"(%55) <{axis = 1 : i32}> ({ 2026-02-21T08:15:53.0238906Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:15:53.0239096Z %67 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:15:53.0239303Z tt.reduce.return %67 : f32 2026-02-21T08:15:53.0239506Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:15:53.0239730Z %57 = arith.addf %18, %56 : tensor<2x32xf32> 2026-02-21T08:15:53.0239986Z %58 = arith.truncf %57 : tensor<2x32xf32> to tensor<2x32xbf16> 2026-02-21T08:15:53.0240279Z %59 = tt.expand_dims %17 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:15:53.0240553Z %60 = arith.muli %59, %cst_0 : tensor<2x1xi32> 2026-02-21T08:15:53.0240814Z %61 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:15:53.0241111Z %62 = tt.broadcast %60 : tensor<2x1xi32> -> tensor<2x32xi32> 2026-02-21T08:15:53.0241371Z %63 = tt.broadcast %61 : tensor<1x32xi32> -> tensor<2x32xi32> 2026-02-21T08:15:53.0241641Z %64 = arith.addi %62, %63 : tensor<2x32xi32> 2026-02-21T08:15:53.0241892Z %65 = tt.splat %arg2 : !tt.ptr -> tensor<2x32x!tt.ptr> 2026-02-21T08:15:53.0242191Z %66 = tt.addptr %65, %64 : tensor<2x32x!tt.ptr>, tensor<2x32xi32> 2026-02-21T08:15:53.0242464Z tt.store %66, %58 : tensor<2x32x!tt.ptr> 2026-02-21T08:15:53.0242677Z } {tt.warp_specialize} 2026-02-21T08:15:53.0242911Z tt.return 2026-02-21T08:15:53.0243046Z } 2026-02-21T08:15:53.0243183Z } 2026-02-21T08:15:53.0243255Z 2026-02-21T08:15:53.0243308Z {-# 2026-02-21T08:15:53.0243453Z external_resources: { 2026-02-21T08:15:53.0243626Z mlir_reproducer: { 2026-02-21T08:15:53.0248078Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:15:53.0252579Z disable_threading: false, 2026-02-21T08:15:53.0252759Z verify_each: true 2026-02-21T08:15:53.0252906Z } 2026-02-21T08:15:53.0253036Z } 2026-02-21T08:15:53.0253152Z #-} 2026-02-21T08:15:53.0253590Z /tmp/torchinductor_root/sm/csmasjr742vxosizcspwlga4kol56kc3cotymnf7fzyupyh67u4n.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:15:53.0254803Z /tmp/torchinductor_root/sm/csmasjr742vxosizcspwlga4kol56kc3cotymnf7fzyupyh67u4n.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:15:53.0255799Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:15:53.0257009Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 2, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=128, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:15:53.0258096Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:15:53.0258358Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:15:56.1799209Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 17.7 configs/s 2026-02-21T08:15:59.8069071Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 279.0 2026-02-21T08:15:59.8073886Z configs/s 2026-02-21T08:15:59.9988074Z [111s] Generation 2 complete: 2026-02-21T08:15:59.9989989Z error=3 2026-02-21T08:15:59.9990496Z timeout=1 2026-02-21T08:15:59.9990646Z ok=89 2026-02-21T08:15:59.9990804Z min=0.0401 2026-02-21T08:15:59.9990955Z mid=0.0686 2026-02-21T08:15:59.9991109Z max=1.0061 2026-02-21T08:15:59.9991285Z best={'block_sizes': [256, 4, 8], 2026-02-21T08:15:59.9991709Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:15:59.9992012Z 'l2_groupings': [8], 2026-02-21T08:15:59.9992204Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:15:59.9992435Z 'loop_orders': [[0, 1]], 2026-02-21T08:15:59.9992607Z 'num_stages': 6, 2026-02-21T08:15:59.9992793Z 'num_warps': 4, 2026-02-21T08:15:59.9992969Z 'pid_type': 'flat', 2026-02-21T08:15:59.9993147Z 'range_flattens': [None, True], 2026-02-21T08:15:59.9993341Z 'range_multi_buffers': [None, None], 2026-02-21T08:15:59.9993546Z 'range_num_stages': [0, 0], 2026-02-21T08:15:59.9993722Z 'range_unroll_factors': [0, 2], 2026-02-21T08:15:59.9994060Z 'range_warp_specializes': [None, None]} 2026-02-21T08:16:00.0005749Z [111s] Fitting surrogate: 291 points, 291 targets 2026-02-21T08:16:00.9760411Z [112s] Generation 3 starting: 80 neighbors, 5 active search path(s) 2026-02-21T08:16:33.9646607Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 0.8 configs/s 2026-02-21T08:16:38.8454042Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 17.2 configs/s 2026-02-21T08:16:42.7047228Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 280.5 2026-02-21T08:16:42.7047687Z configs/s 2026-02-21T08:16:42.9105989Z [154s] Generation 3 complete: 2026-02-21T08:16:42.9109162Z error=1 2026-02-21T08:16:42.9113657Z ok=84 2026-02-21T08:16:42.9114976Z min=0.0400 2026-02-21T08:16:42.9115178Z mid=0.0625 2026-02-21T08:16:42.9115313Z max=1.2913 2026-02-21T08:16:42.9115475Z best={'block_sizes': [256, 4, 8], 2026-02-21T08:16:42.9115794Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:16:42.9116102Z 'l2_groupings': [8], 2026-02-21T08:16:42.9116280Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:16:42.9116494Z 'loop_orders': [[0, 1]], 2026-02-21T08:16:42.9116659Z 'num_stages': 6, 2026-02-21T08:16:42.9116802Z 'num_warps': 4, 2026-02-21T08:16:42.9116955Z 'pid_type': 'flat', 2026-02-21T08:16:42.9117115Z 'range_flattens': [None, True], 2026-02-21T08:16:42.9117309Z 'range_multi_buffers': [None, None], 2026-02-21T08:16:42.9117496Z 'range_num_stages': [0, 0], 2026-02-21T08:16:42.9117673Z 'range_unroll_factors': [0, 2], 2026-02-21T08:16:42.9117853Z 'range_warp_specializes': [None, None]} 2026-02-21T08:16:42.9124829Z [154s] Fitting surrogate: 376 points, 376 targets 2026-02-21T08:16:43.9457524Z [155s] Generation 4 starting: 78 neighbors, 5 active search path(s) 2026-02-21T08:16:58.2289694Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 2.3 configs/s 2026-02-21T08:16:59.9588808Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:16:59.9592511Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:16:59.9594260Z %cst = arith.constant dense<0> : tensor<256x2x32xi8> 2026-02-21T08:16:59.9594524Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:16:59.9594719Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:16:59.9594914Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:16:59.9595096Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:16:59.9595287Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:16:59.9595493Z %cst_0 = arith.constant dense<8192> : tensor<2x1xi32> 2026-02-21T08:16:59.9595738Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:16:59.9595975Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:16:59.9596201Z %cst_3 = arith.constant dense<4> : tensor<256x32xi8> 2026-02-21T08:16:59.9596770Z %cst_4 = arith.constant dense<3584> : tensor<2x1xi32> 2026-02-21T08:16:59.9597023Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<2x32xf32> 2026-02-21T08:16:59.9597262Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:16:59.9597442Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:16:59.9597630Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:16:59.9597815Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T08:16:59.9598003Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:16:59.9598194Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:16:59.9598374Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:16:59.9598703Z %0 = tt.make_tensor_descriptor %arg1, [%c1792_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:16:59.9599035Z %1 = tt.get_program_id x : i32 2026-02-21T08:16:59.9599231Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:16:59.9599409Z %3 = arith.minsi %2, %c512_i32 : i32 2026-02-21T08:16:59.9599723Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:16:59.9599958Z %4 = arith.divsi %arg3, %c16_i32 : i32 2026-02-21T08:16:59.9600148Z %5 = arith.muli %4, %c8_i32 : i32 2026-02-21T08:16:59.9600340Z %6 = arith.subi %c256_i32, %5 : i32 2026-02-21T08:16:59.9600519Z %7 = arith.minsi %6, %c8_i32 : i32 2026-02-21T08:16:59.9600708Z %8 = arith.remsi %arg3, %c16_i32 : i32 2026-02-21T08:16:59.9600890Z %9 = arith.remsi %8, %7 : i32 2026-02-21T08:16:59.9601072Z %10 = arith.addi %5, %9 : i32 2026-02-21T08:16:59.9601242Z %11 = arith.divsi %8, %7 : i32 2026-02-21T08:16:59.9601460Z %12 = arith.muli %10, %c32_i32 : i32 2026-02-21T08:16:59.9601864Z %13 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:16:59.9602124Z %14 = tt.splat %12 : i32 -> tensor<32xi32> 2026-02-21T08:16:59.9602334Z %15 = arith.addi %14, %13 : tensor<32xi32> 2026-02-21T08:16:59.9602535Z %16 = arith.muli %11, %c2_i32 : i32 2026-02-21T08:16:59.9602767Z %17 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:16:59.9603020Z %18 = tt.splat %16 : i32 -> tensor<2xi32> 2026-02-21T08:16:59.9603218Z %19 = arith.addi %18, %17 : tensor<2xi32> 2026-02-21T08:16:59.9603420Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T08:16:59.9603610Z %c512_i32_6 = arith.constant 512 : i32 2026-02-21T08:16:59.9603938Z %20 = scf.for %arg4 = %c0_i32 to %c1536_i32 step %c512_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<2x32xf32>) : i32 { 2026-02-21T08:16:59.9604273Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:16:59.9604508Z %70 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:16:59.9604768Z %71 = tt.splat %69 : i32 -> tensor<512xi32> 2026-02-21T08:16:59.9604973Z %72 = arith.addi %71, %70 : tensor<512xi32> 2026-02-21T08:16:59.9605293Z %73 = tt.expand_dims %19 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:16:59.9605623Z %74 = arith.muli %73, %cst_4 : tensor<2x1xi32> 2026-02-21T08:16:59.9605894Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:16:59.9606198Z %76 = tt.broadcast %74 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:16:59.9606468Z %77 = tt.broadcast %75 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:16:59.9606714Z %78 = arith.addi %76, %77 : tensor<2x512xi32> 2026-02-21T08:16:59.9606960Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:16:59.9607256Z %80 = tt.addptr %79, %78 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:16:59.9607574Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:16:59.9607875Z %82 = arith.extf %81 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:16:59.9608212Z %83 = tt.descriptor_load %0[%arg4, %12] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:16:59.9608609Z %84 = arith.shli %83, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9608837Z %85 = arith.shrsi %84, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9609057Z %86 = arith.shrsi %83, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9609324Z %87 = tt.expand_dims %17 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:16:59.9609641Z %88 = tt.expand_dims %87 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:16:59.9609964Z %89 = tt.expand_dims %85 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:16:59.9610297Z %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:16:59.9610583Z %91 = arith.cmpi eq, %88, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:16:59.9610844Z %92 = tt.broadcast %91 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:16:59.9611191Z %93 = tt.broadcast %89 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:16:59.9611495Z %94 = arith.select %92, %93, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:16:59.9611823Z %95 = arith.cmpi eq, %88, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:16:59.9612083Z %96 = tt.broadcast %90 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:16:59.9612370Z %97 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:16:59.9612652Z %98 = arith.select %97, %96, %94 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:16:59.9612939Z %99 = tt.reshape %98 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:16:59.9613214Z %100 = arith.sitofp %99 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:16:59.9613517Z %101 = tt.expand_dims %82 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:16:59.9613872Z %102 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:16:59.9614200Z %103 = tt.broadcast %101 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:16:59.9614505Z %104 = tt.broadcast %102 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:16:59.9614773Z %105 = arith.mulf %103, %104 : tensor<2x512x32xf32> 2026-02-21T08:16:59.9615004Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T08:16:59.9615213Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:16:59.9615404Z %149 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:16:59.9615606Z tt.reduce.return %149 : f32 2026-02-21T08:16:59.9615806Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:16:59.9616029Z %107 = arith.addf %arg5, %106 : tensor<2x32xf32> 2026-02-21T08:16:59.9616234Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:16:59.9616440Z %108 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:16:59.9616643Z %109 = arith.addi %arg4, %108 : i32 2026-02-21T08:16:59.9616834Z %110 = arith.muli %109, %c2_i32 : i32 2026-02-21T08:16:59.9617081Z %111 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:16:59.9617340Z %112 = tt.splat %110 : i32 -> tensor<512xi32> 2026-02-21T08:16:59.9617558Z %113 = arith.addi %112, %111 : tensor<512xi32> 2026-02-21T08:16:59.9617810Z %114 = tt.expand_dims %19 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:16:59.9618085Z %115 = arith.muli %114, %cst_4 : tensor<2x1xi32> 2026-02-21T08:16:59.9618357Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:16:59.9618655Z %117 = tt.broadcast %115 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:16:59.9618926Z %118 = tt.broadcast %116 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:16:59.9619168Z %119 = arith.addi %117, %118 : tensor<2x512xi32> 2026-02-21T08:16:59.9619422Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:16:59.9619808Z %121 = tt.addptr %120, %119 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:16:59.9620134Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:16:59.9620444Z %123 = arith.extf %122 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:16:59.9620772Z %124 = tt.descriptor_load %0[%109, %12] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:16:59.9621088Z %125 = arith.shli %124, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9621313Z %126 = arith.shrsi %125, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9621655Z %127 = arith.shrsi %124, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9621989Z %128 = tt.expand_dims %17 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:16:59.9622381Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:16:59.9622823Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:16:59.9623230Z %131 = tt.expand_dims %127 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:16:59.9623554Z %132 = arith.cmpi eq, %129, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:16:59.9623863Z %133 = tt.broadcast %132 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:16:59.9624165Z %134 = tt.broadcast %130 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:16:59.9624513Z %135 = arith.select %133, %134, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:16:59.9624858Z %136 = arith.cmpi eq, %129, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:16:59.9625151Z %137 = tt.broadcast %131 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:16:59.9625499Z %138 = tt.broadcast %136 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:16:59.9625825Z %139 = arith.select %138, %137, %135 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:16:59.9626180Z %140 = tt.reshape %139 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:16:59.9626562Z %141 = arith.sitofp %140 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:16:59.9626925Z %142 = tt.expand_dims %123 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:16:59.9627339Z %143 = tt.expand_dims %141 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:16:59.9627699Z %144 = tt.broadcast %142 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:16:59.9628066Z %145 = tt.broadcast %143 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:16:59.9628372Z %146 = arith.mulf %144, %145 : tensor<2x512x32xf32> 2026-02-21T08:16:59.9628664Z %147 = "tt.reduce"(%146) <{axis = 1 : i32}> ({ 2026-02-21T08:16:59.9628916Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:16:59.9629129Z %149 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:16:59.9629371Z tt.reduce.return %149 : f32 2026-02-21T08:16:59.9629607Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:16:59.9629876Z %148 = arith.addf %107, %147 : tensor<2x32xf32> 2026-02-21T08:16:59.9630103Z scf.yield %148 : tensor<2x32xf32> 2026-02-21T08:16:59.9630424Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:16:59.9630752Z %21 = arith.muli %c1536_i32, %c2_i32 : i32 2026-02-21T08:16:59.9631034Z %22 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:16:59.9631356Z %23 = tt.splat %21 : i32 -> tensor<512xi32> 2026-02-21T08:16:59.9631648Z %24 = arith.addi %23, %22 : tensor<512xi32> 2026-02-21T08:16:59.9631963Z %25 = tt.expand_dims %19 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:16:59.9632246Z %26 = arith.muli %25, %cst_4 : tensor<2x1xi32> 2026-02-21T08:16:59.9632554Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:16:59.9632972Z %28 = tt.broadcast %26 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:16:59.9633248Z %29 = tt.broadcast %27 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:16:59.9633542Z %30 = arith.addi %28, %29 : tensor<2x512xi32> 2026-02-21T08:16:59.9633838Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:16:59.9634204Z %32 = tt.addptr %31, %30 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:16:59.9634557Z %33 = tt.load %32 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:16:59.9634904Z %34 = arith.extf %33 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:16:59.9635326Z %35 = tt.descriptor_load %0[%c1536_i32, %12] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:16:59.9635696Z %36 = arith.shli %35, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9636050Z %37 = arith.shrsi %36, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9636317Z %38 = arith.shrsi %35, %cst_3 : tensor<256x32xi8> 2026-02-21T08:16:59.9636650Z %39 = tt.expand_dims %17 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:16:59.9637005Z %40 = tt.expand_dims %39 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:16:59.9637398Z %41 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:16:59.9637792Z %42 = tt.expand_dims %38 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:16:59.9638117Z %43 = arith.cmpi eq, %40, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:16:59.9638416Z %44 = tt.broadcast %43 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:16:59.9638751Z %45 = tt.broadcast %41 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:16:59.9639063Z %46 = arith.select %44, %45, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:16:59.9639406Z %47 = arith.cmpi eq, %40, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:16:59.9639706Z %48 = tt.broadcast %42 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:16:59.9640057Z %49 = tt.broadcast %47 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:16:59.9640385Z %50 = arith.select %49, %48, %46 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:16:59.9640720Z %51 = tt.reshape %50 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:16:59.9641053Z %52 = arith.sitofp %51 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:16:59.9641399Z %53 = tt.expand_dims %34 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:16:59.9641835Z %54 = tt.expand_dims %52 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:16:59.9642187Z %55 = tt.broadcast %53 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:16:59.9642540Z %56 = tt.broadcast %54 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:16:59.9642863Z %57 = arith.mulf %55, %56 : tensor<2x512x32xf32> 2026-02-21T08:16:59.9643087Z %58 = "tt.reduce"(%57) <{axis = 1 : i32}> ({ 2026-02-21T08:16:59.9643316Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:16:59.9643526Z %69 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:16:59.9643770Z tt.reduce.return %69 : f32 2026-02-21T08:16:59.9644011Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:16:59.9644290Z %59 = arith.addf %20, %58 : tensor<2x32xf32> 2026-02-21T08:16:59.9644565Z %60 = arith.truncf %59 : tensor<2x32xf32> to tensor<2x32xbf16> 2026-02-21T08:16:59.9644914Z %61 = tt.expand_dims %19 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:16:59.9645244Z %62 = arith.muli %61, %cst_0 : tensor<2x1xi32> 2026-02-21T08:16:59.9645536Z %63 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:16:59.9645889Z %64 = tt.broadcast %62 : tensor<2x1xi32> -> tensor<2x32xi32> 2026-02-21T08:16:59.9646184Z %65 = tt.broadcast %63 : tensor<1x32xi32> -> tensor<2x32xi32> 2026-02-21T08:16:59.9646537Z %66 = arith.addi %64, %65 : tensor<2x32xi32> 2026-02-21T08:16:59.9646844Z %67 = tt.splat %arg2 : !tt.ptr -> tensor<2x32x!tt.ptr> 2026-02-21T08:16:59.9647154Z %68 = tt.addptr %67, %66 : tensor<2x32x!tt.ptr>, tensor<2x32xi32> 2026-02-21T08:16:59.9647440Z tt.store %68, %60 : tensor<2x32x!tt.ptr> 2026-02-21T08:16:59.9647666Z } {tt.warp_specialize} 2026-02-21T08:16:59.9647868Z tt.return 2026-02-21T08:16:59.9648020Z } 2026-02-21T08:16:59.9648169Z } 2026-02-21T08:16:59.9648248Z 2026-02-21T08:16:59.9648309Z {-# 2026-02-21T08:16:59.9648505Z external_resources: { 2026-02-21T08:16:59.9648695Z mlir_reproducer: { 2026-02-21T08:16:59.9653229Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:16:59.9657759Z disable_threading: false, 2026-02-21T08:16:59.9657937Z verify_each: true 2026-02-21T08:16:59.9658093Z } 2026-02-21T08:16:59.9658270Z } 2026-02-21T08:16:59.9658459Z #-} 2026-02-21T08:16:59.9658926Z /tmp/torchinductor_root/4c/c4c76vrwxwsssweq5dhju5lwe7g56z6eskd5ae3qt7in3o7utdaz.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:16:59.9660131Z /tmp/torchinductor_root/4c/c4c76vrwxwsssweq5dhju5lwe7g56z6eskd5ae3qt7in3o7utdaz.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:16:59.9661115Z [171s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:16:59.9662318Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 2, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:16:59.9663380Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:16:59.9663649Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:17:00.7556807Z module { 2026-02-21T08:17:00.7562559Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:17:00.7566770Z %cst = arith.constant dense<0> : tensor<256x2x32xi8> 2026-02-21T08:17:00.7571264Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:17:00.7576549Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:17:00.7580476Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:17:00.7584980Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:17:00.7586560Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:17:00.7586919Z %cst_0 = arith.constant dense<8192> : tensor<2x1xi32> 2026-02-21T08:17:00.7589743Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:17:00.7590351Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:17:00.7590657Z %cst_3 = arith.constant dense<4> : tensor<256x32xi8> 2026-02-21T08:17:00.7590947Z %cst_4 = arith.constant dense<3584> : tensor<2x1xi32> 2026-02-21T08:17:00.7591217Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<2x32xf32> 2026-02-21T08:17:00.7591459Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:17:00.7591717Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:17:00.7591937Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:17:00.7592196Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T08:17:00.7592436Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:17:00.7592673Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:17:00.7592900Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:17:00.7593295Z %0 = tt.make_tensor_descriptor %arg1, [%c1792_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:17:00.7593710Z %1 = tt.get_program_id x : i32 2026-02-21T08:17:00.7593937Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:17:00.7594190Z %3 = arith.minsi %2, %c512_i32 : i32 2026-02-21T08:17:00.7594495Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:17:00.7594741Z %4 = arith.divsi %arg3, %c2048_i32 : i32 2026-02-21T08:17:00.7595006Z %5 = arith.muli %4, %c8_i32 : i32 2026-02-21T08:17:00.7595217Z %6 = arith.subi %c2_i32, %5 : i32 2026-02-21T08:17:00.7595469Z %7 = arith.minsi %6, %c8_i32 : i32 2026-02-21T08:17:00.7595732Z %8 = arith.remsi %arg3, %c2048_i32 : i32 2026-02-21T08:17:00.7595961Z %9 = arith.remsi %8, %7 : i32 2026-02-21T08:17:00.7596197Z %10 = arith.addi %5, %9 : i32 2026-02-21T08:17:00.7596402Z %11 = arith.divsi %8, %7 : i32 2026-02-21T08:17:00.7596646Z %12 = arith.muli %10, %c2_i32 : i32 2026-02-21T08:17:00.7596917Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:17:00.7597221Z %14 = tt.splat %12 : i32 -> tensor<2xi32> 2026-02-21T08:17:00.7597465Z %15 = arith.addi %14, %13 : tensor<2xi32> 2026-02-21T08:17:00.7597729Z %16 = arith.muli %11, %c32_i32 : i32 2026-02-21T08:17:00.7597995Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:17:00.7598236Z %18 = tt.splat %16 : i32 -> tensor<32xi32> 2026-02-21T08:17:00.7598435Z %19 = arith.addi %18, %17 : tensor<32xi32> 2026-02-21T08:17:00.7598627Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T08:17:00.7598816Z %c512_i32_6 = arith.constant 512 : i32 2026-02-21T08:17:00.7599138Z %20 = scf.for %arg4 = %c0_i32 to %c1536_i32 step %c512_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<2x32xf32>) : i32 { 2026-02-21T08:17:00.7599468Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:17:00.7599712Z %70 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:17:00.7599963Z %71 = tt.splat %69 : i32 -> tensor<512xi32> 2026-02-21T08:17:00.7600174Z %72 = arith.addi %71, %70 : tensor<512xi32> 2026-02-21T08:17:00.7600527Z %73 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:00.7600805Z %74 = arith.muli %73, %cst_4 : tensor<2x1xi32> 2026-02-21T08:17:00.7601066Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:17:00.7601367Z %76 = tt.broadcast %74 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:17:00.7601691Z %77 = tt.broadcast %75 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:17:00.7601932Z %78 = arith.addi %76, %77 : tensor<2x512xi32> 2026-02-21T08:17:00.7602188Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:17:00.7602473Z %80 = tt.addptr %79, %78 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:17:00.7602791Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:17:00.7603151Z %82 = arith.extf %81 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:17:00.7603477Z %83 = tt.descriptor_load %0[%arg4, %16] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:17:00.7603791Z %84 = arith.shli %83, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7604011Z %85 = arith.shrsi %84, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7604234Z %86 = arith.shrsi %83, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7604489Z %87 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:17:00.7604804Z %88 = tt.expand_dims %87 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:17:00.7605129Z %89 = tt.expand_dims %85 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:00.7605446Z %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:00.7605731Z %91 = arith.cmpi eq, %88, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:17:00.7605976Z %92 = tt.broadcast %91 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:00.7606255Z %93 = tt.broadcast %89 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:00.7606547Z %94 = arith.select %92, %93, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:00.7606816Z %95 = arith.cmpi eq, %88, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:17:00.7607074Z %96 = tt.broadcast %90 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:00.7607340Z %97 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:00.7607622Z %98 = arith.select %97, %96, %94 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:00.7607896Z %99 = tt.reshape %98 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:17:00.7608169Z %100 = arith.sitofp %99 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:17:00.7608475Z %101 = tt.expand_dims %82 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:17:00.7608814Z %102 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:17:00.7609139Z %103 = tt.broadcast %101 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:00.7609428Z %104 = tt.broadcast %102 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:00.7609695Z %105 = arith.mulf %103, %104 : tensor<2x512x32xf32> 2026-02-21T08:17:00.7609919Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T08:17:00.7610133Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:17:00.7610334Z %149 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:17:00.7610533Z tt.reduce.return %149 : f32 2026-02-21T08:17:00.7610796Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:17:00.7611098Z %107 = arith.addf %arg5, %106 : tensor<2x32xf32> 2026-02-21T08:17:00.7611357Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:17:00.7611662Z %108 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:17:00.7612001Z %109 = arith.addi %arg4, %108 : i32 2026-02-21T08:17:00.7612268Z %110 = arith.muli %109, %c2_i32 : i32 2026-02-21T08:17:00.7612603Z %111 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:17:00.7612908Z %112 = tt.splat %110 : i32 -> tensor<512xi32> 2026-02-21T08:17:00.7613209Z %113 = arith.addi %112, %111 : tensor<512xi32> 2026-02-21T08:17:00.7613507Z %114 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:00.7613857Z %115 = arith.muli %114, %cst_4 : tensor<2x1xi32> 2026-02-21T08:17:00.7614178Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:17:00.7614556Z %117 = tt.broadcast %115 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:17:00.7614908Z %118 = tt.broadcast %116 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:17:00.7615249Z %119 = arith.addi %117, %118 : tensor<2x512xi32> 2026-02-21T08:17:00.7615584Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:17:00.7615925Z %121 = tt.addptr %120, %119 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:17:00.7616331Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:17:00.7616708Z %123 = arith.extf %122 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:17:00.7617097Z %124 = tt.descriptor_load %0[%109, %16] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:17:00.7617487Z %125 = arith.shli %124, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7617771Z %126 = arith.shrsi %125, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7618083Z %127 = arith.shrsi %124, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7618373Z %128 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:17:00.7618762Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:17:00.7619153Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:00.7619527Z %131 = tt.expand_dims %127 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:00.7619873Z %132 = arith.cmpi eq, %129, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:17:00.7620172Z %133 = tt.broadcast %132 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:00.7620520Z %134 = tt.broadcast %130 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:00.7620883Z %135 = arith.select %133, %134, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:00.7621208Z %136 = arith.cmpi eq, %129, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:17:00.7621576Z %137 = tt.broadcast %131 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:00.7621895Z %138 = tt.broadcast %136 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:00.7622256Z %139 = arith.select %138, %137, %135 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:00.7622591Z %140 = tt.reshape %139 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:17:00.7622932Z %141 = arith.sitofp %140 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:17:00.7623305Z %142 = tt.expand_dims %123 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:17:00.7623682Z %143 = tt.expand_dims %141 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:17:00.7624065Z %144 = tt.broadcast %142 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:00.7624400Z %145 = tt.broadcast %143 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:00.7624736Z %146 = arith.mulf %144, %145 : tensor<2x512x32xf32> 2026-02-21T08:17:00.7625021Z %147 = "tt.reduce"(%146) <{axis = 1 : i32}> ({ 2026-02-21T08:17:00.7625266Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:17:00.7625585Z %149 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:17:00.7625823Z tt.reduce.return %149 : f32 2026-02-21T08:17:00.7626091Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:17:00.7626337Z %148 = arith.addf %107, %147 : tensor<2x32xf32> 2026-02-21T08:17:00.7626599Z scf.yield %148 : tensor<2x32xf32> 2026-02-21T08:17:00.7626826Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:17:00.7627089Z %21 = arith.muli %c1536_i32, %c2_i32 : i32 2026-02-21T08:17:00.7627389Z %22 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:17:00.7627680Z %23 = tt.splat %21 : i32 -> tensor<512xi32> 2026-02-21T08:17:00.7627921Z %24 = arith.addi %23, %22 : tensor<512xi32> 2026-02-21T08:17:00.7628199Z %25 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:00.7628564Z %26 = arith.muli %25, %cst_4 : tensor<2x1xi32> 2026-02-21T08:17:00.7628863Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:17:00.7629209Z %28 = tt.broadcast %26 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:17:00.7629529Z %29 = tt.broadcast %27 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:17:00.7629797Z %30 = arith.addi %28, %29 : tensor<2x512xi32> 2026-02-21T08:17:00.7630090Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:17:00.7630413Z %32 = tt.addptr %31, %30 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:17:00.7630789Z %33 = tt.load %32 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:17:00.7631115Z %34 = arith.extf %33 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:17:00.7631512Z %35 = tt.descriptor_load %0[%c1536_i32, %16] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:17:00.7631923Z %36 = arith.shli %35, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7632163Z %37 = arith.shrsi %36, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7632449Z %38 = arith.shrsi %35, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:00.7632731Z %39 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:17:00.7633109Z %40 = tt.expand_dims %39 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:17:00.7633463Z %41 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:00.7633809Z %42 = tt.expand_dims %38 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:00.7634141Z %43 = arith.cmpi eq, %40, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:17:00.7634424Z %44 = tt.broadcast %43 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:00.7634750Z %45 = tt.broadcast %41 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:00.7635075Z %46 = arith.select %44, %45, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:00.7635387Z %47 = arith.cmpi eq, %40, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:17:00.7635677Z %48 = tt.broadcast %42 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:00.7635979Z %49 = tt.broadcast %47 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:00.7636320Z %50 = arith.select %49, %48, %46 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:00.7636619Z %51 = tt.reshape %50 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:17:00.7636939Z %52 = arith.sitofp %51 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:17:00.7637275Z %53 = tt.expand_dims %34 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:17:00.7637617Z %54 = tt.expand_dims %52 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:17:00.7637987Z %55 = tt.broadcast %53 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:00.7638302Z %56 = tt.broadcast %54 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:00.7638681Z %57 = arith.mulf %55, %56 : tensor<2x512x32xf32> 2026-02-21T08:17:00.7638919Z %58 = "tt.reduce"(%57) <{axis = 1 : i32}> ({ 2026-02-21T08:17:00.7639159Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:17:00.7639402Z %69 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:17:00.7639636Z tt.reduce.return %69 : f32 2026-02-21T08:17:00.7639895Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:17:00.7640145Z %59 = arith.addf %20, %58 : tensor<2x32xf32> 2026-02-21T08:17:00.7640427Z %60 = arith.truncf %59 : tensor<2x32xf32> to tensor<2x32xbf16> 2026-02-21T08:17:00.7640724Z %61 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:00.7641041Z %62 = arith.muli %61, %cst_0 : tensor<2x1xi32> 2026-02-21T08:17:00.7641359Z %63 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:17:00.7641770Z %64 = tt.broadcast %62 : tensor<2x1xi32> -> tensor<2x32xi32> 2026-02-21T08:17:00.7642090Z %65 = tt.broadcast %63 : tensor<1x32xi32> -> tensor<2x32xi32> 2026-02-21T08:17:00.7642357Z %66 = arith.addi %64, %65 : tensor<2x32xi32> 2026-02-21T08:17:00.7642651Z %67 = tt.splat %arg2 : !tt.ptr -> tensor<2x32x!tt.ptr> 2026-02-21T08:17:00.7642972Z %68 = tt.addptr %67, %66 : tensor<2x32x!tt.ptr>, tensor<2x32xi32> 2026-02-21T08:17:00.7643295Z tt.store %68, %60 : tensor<2x32x!tt.ptr> 2026-02-21T08:17:00.7643570Z } {tt.warp_specialize} 2026-02-21T08:17:00.7643750Z tt.return 2026-02-21T08:17:00.7643933Z } 2026-02-21T08:17:00.7644086Z } 2026-02-21T08:17:00.7644175Z 2026-02-21T08:17:00.7644275Z {-# 2026-02-21T08:17:00.7644439Z external_resources: { 2026-02-21T08:17:00.7644669Z mlir_reproducer: { 2026-02-21T08:17:00.7649106Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:17:00.7653612Z disable_threading: false, 2026-02-21T08:17:00.7653828Z verify_each: true 2026-02-21T08:17:00.7654036Z } 2026-02-21T08:17:00.7654222Z } 2026-02-21T08:17:00.7654375Z #-} 2026-02-21T08:17:00.7654873Z /tmp/torchinductor_root/7c/c7cbdxiogk6x2dotklarxpdjstjj3ws324v7vegsgqfovy2bfdwd.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:17:00.7656251Z /tmp/torchinductor_root/7c/c7cbdxiogk6x2dotklarxpdjstjj3ws324v7vegsgqfovy2bfdwd.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:17:00.7657359Z [172s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:17:00.7658654Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 2, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:17:00.7659830Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:17:00.7660158Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:17:03.0260031Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 16.8 configs/s 2026-02-21T08:17:05.2377879Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 456.7 2026-02-21T08:17:05.2379328Z configs/s 2026-02-21T08:17:05.3657652Z [177s] Generation 4 complete: 2026-02-21T08:17:05.3658991Z error=2 2026-02-21T08:17:05.3659188Z ok=81 2026-02-21T08:17:05.3659393Z min=0.0380 2026-02-21T08:17:05.3659553Z mid=0.0810 2026-02-21T08:17:05.3659682Z max=2.9410 2026-02-21T08:17:05.3659902Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:17:05.3664765Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:17:05.3668873Z 'l2_groupings': [64], 2026-02-21T08:17:05.3672385Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:17:05.3676901Z 'loop_orders': [[0, 1]], 2026-02-21T08:17:05.3677250Z 'num_stages': 8, 2026-02-21T08:17:05.3677456Z 'num_warps': 4, 2026-02-21T08:17:05.3677672Z 'pid_type': 'flat', 2026-02-21T08:17:05.3677881Z 'range_flattens': [None, True], 2026-02-21T08:17:05.3678085Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:05.3678325Z 'range_num_stages': [0, 4], 2026-02-21T08:17:05.3678536Z 'range_unroll_factors': [0, 0], 2026-02-21T08:17:05.3678779Z 'range_warp_specializes': [None, None]} 2026-02-21T08:17:05.3679003Z [177s] Fitting surrogate: 459 points, 459 targets 2026-02-21T08:17:06.4541334Z [178s] Generation 5 starting: 85 neighbors, 5 active search path(s) 2026-02-21T08:17:40.4506340Z [212s] Timeout after 30s compiling Config(block_sizes=[256, 4, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=8, num_stages=6, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[4, 3], range_warp_specializes=[None, None]) 2026-02-21T08:17:40.4523317Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.8 configs/s 2026-02-21T08:17:42.5439577Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:17:42.5444226Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:17:42.5446058Z %cst = arith.constant dense<0> : tensor<256x2x32xi8> 2026-02-21T08:17:42.5446438Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:17:42.5446733Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:17:42.5447045Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:17:42.5447315Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:17:42.5447618Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:17:42.5447949Z %cst_0 = arith.constant dense<8192> : tensor<2x1xi32> 2026-02-21T08:17:42.5448592Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:17:42.5448870Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:17:42.5449220Z %cst_3 = arith.constant dense<4> : tensor<256x32xi8> 2026-02-21T08:17:42.5449621Z %cst_4 = arith.constant dense<3584> : tensor<2x1xi32> 2026-02-21T08:17:42.5449986Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<2x32xf32> 2026-02-21T08:17:42.5450357Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:17:42.5450625Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:17:42.5450981Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:17:42.5451251Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T08:17:42.5451494Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:17:42.5452060Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:17:42.5452346Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:17:42.5452871Z %0 = tt.make_tensor_descriptor %arg1, [%c1792_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:17:42.5453254Z %1 = tt.get_program_id x : i32 2026-02-21T08:17:42.5453480Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:17:42.5453754Z %3 = arith.minsi %2, %c512_i32 : i32 2026-02-21T08:17:42.5454024Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:17:42.5454279Z %4 = arith.divsi %arg3, %c2048_i32 : i32 2026-02-21T08:17:42.5454542Z %5 = arith.muli %4, %c8_i32 : i32 2026-02-21T08:17:42.5454752Z %6 = arith.subi %c2_i32, %5 : i32 2026-02-21T08:17:42.5454997Z %7 = arith.minsi %6, %c8_i32 : i32 2026-02-21T08:17:42.5455224Z %8 = arith.remsi %arg3, %c2048_i32 : i32 2026-02-21T08:17:42.5455479Z %9 = arith.remsi %8, %7 : i32 2026-02-21T08:17:42.5455723Z %10 = arith.addi %5, %9 : i32 2026-02-21T08:17:42.5455934Z %11 = arith.divsi %8, %7 : i32 2026-02-21T08:17:42.5456184Z %12 = arith.muli %10, %c2_i32 : i32 2026-02-21T08:17:42.5456454Z %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:17:42.5456770Z %14 = tt.splat %12 : i32 -> tensor<2xi32> 2026-02-21T08:17:42.5457015Z %15 = arith.addi %14, %13 : tensor<2xi32> 2026-02-21T08:17:42.5457285Z %16 = arith.muli %11, %c32_i32 : i32 2026-02-21T08:17:42.5457572Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:17:42.5457819Z %18 = tt.splat %16 : i32 -> tensor<32xi32> 2026-02-21T08:17:42.5458024Z %19 = arith.addi %18, %17 : tensor<32xi32> 2026-02-21T08:17:42.5458218Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T08:17:42.5458412Z %c512_i32_6 = arith.constant 512 : i32 2026-02-21T08:17:42.5458735Z %20 = scf.for %arg4 = %c0_i32 to %c1536_i32 step %c512_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<2x32xf32>) : i32 { 2026-02-21T08:17:42.5459068Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:17:42.5459327Z %70 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:17:42.5459628Z %71 = tt.splat %69 : i32 -> tensor<512xi32> 2026-02-21T08:17:42.5459877Z %72 = arith.addi %71, %70 : tensor<512xi32> 2026-02-21T08:17:42.5460130Z %73 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:42.5460437Z %74 = arith.muli %73, %cst_4 : tensor<2x1xi32> 2026-02-21T08:17:42.5460741Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:17:42.5461108Z %76 = tt.broadcast %74 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:17:42.5461431Z %77 = tt.broadcast %75 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:17:42.5461756Z %78 = arith.addi %76, %77 : tensor<2x512xi32> 2026-02-21T08:17:42.5462051Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:17:42.5462372Z %80 = tt.addptr %79, %78 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:17:42.5462845Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:17:42.5463174Z %82 = arith.extf %81 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:17:42.5463569Z %83 = tt.descriptor_load %0[%arg4, %16] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:17:42.5463947Z %84 = arith.shli %83, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5464211Z %85 = arith.shrsi %84, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5464476Z %86 = arith.shrsi %83, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5464768Z %87 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:17:42.5465145Z %88 = tt.expand_dims %87 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:17:42.5465522Z %89 = tt.expand_dims %85 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:42.5465948Z %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:42.5466314Z %91 = arith.cmpi eq, %88, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:17:42.5466606Z %92 = tt.broadcast %91 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:42.5466950Z %93 = tt.broadcast %89 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:42.5467277Z %94 = arith.select %92, %93, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:42.5467615Z %95 = arith.cmpi eq, %88, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:17:42.5467939Z %96 = tt.broadcast %90 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:42.5468245Z %97 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:42.5468581Z %98 = arith.select %97, %96, %94 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:42.5468902Z %99 = tt.reshape %98 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:17:42.5469244Z %100 = arith.sitofp %99 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:17:42.5469585Z %101 = tt.expand_dims %82 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:17:42.5469994Z %102 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:17:42.5470386Z %103 = tt.broadcast %101 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:42.5470706Z %104 = tt.broadcast %102 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:42.5471039Z %105 = arith.mulf %103, %104 : tensor<2x512x32xf32> 2026-02-21T08:17:42.5471306Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T08:17:42.5471645Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:17:42.5471921Z %149 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:17:42.5472167Z tt.reduce.return %149 : f32 2026-02-21T08:17:42.5472458Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:17:42.5472737Z %107 = arith.addf %arg5, %106 : tensor<2x32xf32> 2026-02-21T08:17:42.5473032Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:17:42.5473289Z %108 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:17:42.5473568Z %109 = arith.addi %arg4, %108 : i32 2026-02-21T08:17:42.5473808Z %110 = arith.muli %109, %c2_i32 : i32 2026-02-21T08:17:42.5474136Z %111 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:17:42.5474482Z %112 = tt.splat %110 : i32 -> tensor<512xi32> 2026-02-21T08:17:42.5474748Z %113 = arith.addi %112, %111 : tensor<512xi32> 2026-02-21T08:17:42.5475091Z %114 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:42.5475415Z %115 = arith.muli %114, %cst_4 : tensor<2x1xi32> 2026-02-21T08:17:42.5475769Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:17:42.5476159Z %117 = tt.broadcast %115 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:17:42.5476540Z %118 = tt.broadcast %116 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:17:42.5476865Z %119 = arith.addi %117, %118 : tensor<2x512xi32> 2026-02-21T08:17:42.5477164Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:17:42.5477551Z %121 = tt.addptr %120, %119 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:17:42.5477930Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:17:42.5478321Z %123 = arith.extf %122 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:17:42.5478728Z %124 = tt.descriptor_load %0[%109, %16] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:17:42.5479098Z %125 = arith.shli %124, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5479398Z %126 = arith.shrsi %125, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5479720Z %127 = arith.shrsi %124, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5480055Z %128 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:17:42.5480416Z %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:17:42.5480812Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:42.5481214Z %131 = tt.expand_dims %127 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:42.5481598Z %132 = arith.cmpi eq, %129, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:17:42.5481921Z %133 = tt.broadcast %132 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:42.5482241Z %134 = tt.broadcast %130 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:42.5482611Z %135 = arith.select %133, %134, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:42.5482963Z %136 = arith.cmpi eq, %129, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:17:42.5483267Z %137 = tt.broadcast %131 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:42.5483613Z %138 = tt.broadcast %136 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:42.5483944Z %139 = arith.select %138, %137, %135 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:42.5484305Z %140 = tt.reshape %139 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:17:42.5484631Z %141 = arith.sitofp %140 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:17:42.5485007Z %142 = tt.expand_dims %123 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:17:42.5485424Z %143 = tt.expand_dims %141 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:17:42.5485788Z %144 = tt.broadcast %142 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:42.5486154Z %145 = tt.broadcast %143 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:42.5486463Z %146 = arith.mulf %144, %145 : tensor<2x512x32xf32> 2026-02-21T08:17:42.5486737Z %147 = "tt.reduce"(%146) <{axis = 1 : i32}> ({ 2026-02-21T08:17:42.5487006Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:17:42.5487237Z %149 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:17:42.5487500Z tt.reduce.return %149 : f32 2026-02-21T08:17:42.5487744Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:17:42.5488028Z %148 = arith.addf %107, %147 : tensor<2x32xf32> 2026-02-21T08:17:42.5488270Z scf.yield %148 : tensor<2x32xf32> 2026-02-21T08:17:42.5488540Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:17:42.5488782Z %21 = arith.muli %c1536_i32, %c2_i32 : i32 2026-02-21T08:17:42.5489085Z %22 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:17:42.5489408Z %23 = tt.splat %21 : i32 -> tensor<512xi32> 2026-02-21T08:17:42.5489657Z %24 = arith.addi %23, %22 : tensor<512xi32> 2026-02-21T08:17:42.5490063Z %25 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:42.5490367Z %26 = arith.muli %25, %cst_4 : tensor<2x1xi32> 2026-02-21T08:17:42.5490699Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:17:42.5491021Z %28 = tt.broadcast %26 : tensor<2x1xi32> -> tensor<2x512xi32> 2026-02-21T08:17:42.5491355Z %29 = tt.broadcast %27 : tensor<1x512xi32> -> tensor<2x512xi32> 2026-02-21T08:17:42.5491708Z %30 = arith.addi %28, %29 : tensor<2x512xi32> 2026-02-21T08:17:42.5491991Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<2x512x!tt.ptr> 2026-02-21T08:17:42.5492345Z %32 = tt.addptr %31, %30 : tensor<2x512x!tt.ptr>, tensor<2x512xi32> 2026-02-21T08:17:42.5492691Z %33 = tt.load %32 evictionPolicy = evict_first : tensor<2x512x!tt.ptr> 2026-02-21T08:17:42.5493118Z %34 = arith.extf %33 : tensor<2x512xbf16> to tensor<2x512xf32> 2026-02-21T08:17:42.5493527Z %35 = tt.descriptor_load %0[%c1536_i32, %16] : !tt.tensordesc> -> tensor<256x32xi8> 2026-02-21T08:17:42.5493884Z %36 = arith.shli %35, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5494172Z %37 = arith.shrsi %36, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5494431Z %38 = arith.shrsi %35, %cst_3 : tensor<256x32xi8> 2026-02-21T08:17:42.5494750Z %39 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:17:42.5495094Z %40 = tt.expand_dims %39 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:17:42.5495482Z %41 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:42.5495889Z %42 = tt.expand_dims %38 {axis = 1 : i32} : tensor<256x32xi8> -> tensor<256x1x32xi8> 2026-02-21T08:17:42.5496207Z %43 = arith.cmpi eq, %40, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:17:42.5496521Z %44 = tt.broadcast %43 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:42.5496861Z %45 = tt.broadcast %41 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:42.5497170Z %46 = arith.select %44, %45, %cst : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:42.5497505Z %47 = arith.cmpi eq, %40, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:17:42.5497791Z %48 = tt.broadcast %42 : tensor<256x1x32xi8> -> tensor<256x2x32xi8> 2026-02-21T08:17:42.5498109Z %49 = tt.broadcast %47 : tensor<1x2x1xi1> -> tensor<256x2x32xi1> 2026-02-21T08:17:42.5498421Z %50 = arith.select %49, %48, %46 : tensor<256x2x32xi1>, tensor<256x2x32xi8> 2026-02-21T08:17:42.5498762Z %51 = tt.reshape %50 : tensor<256x2x32xi8> -> tensor<512x32xi8> 2026-02-21T08:17:42.5499078Z %52 = arith.sitofp %51 : tensor<512x32xi8> to tensor<512x32xf32> 2026-02-21T08:17:42.5499406Z %53 = tt.expand_dims %34 {axis = 2 : i32} : tensor<2x512xf32> -> tensor<2x512x1xf32> 2026-02-21T08:17:42.5499799Z %54 = tt.expand_dims %52 {axis = 0 : i32} : tensor<512x32xf32> -> tensor<1x512x32xf32> 2026-02-21T08:17:42.5500150Z %55 = tt.broadcast %53 : tensor<2x512x1xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:42.5500492Z %56 = tt.broadcast %54 : tensor<1x512x32xf32> -> tensor<2x512x32xf32> 2026-02-21T08:17:42.5500810Z %57 = arith.mulf %55, %56 : tensor<2x512x32xf32> 2026-02-21T08:17:42.5501058Z %58 = "tt.reduce"(%57) <{axis = 1 : i32}> ({ 2026-02-21T08:17:42.5501321Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:17:42.5501591Z %69 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:17:42.5501854Z tt.reduce.return %69 : f32 2026-02-21T08:17:42.5502090Z }) : (tensor<2x512x32xf32>) -> tensor<2x32xf32> 2026-02-21T08:17:42.5502365Z %59 = arith.addf %20, %58 : tensor<2x32xf32> 2026-02-21T08:17:42.5502637Z %60 = arith.truncf %59 : tensor<2x32xf32> to tensor<2x32xbf16> 2026-02-21T08:17:42.5502981Z %61 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2xi32> -> tensor<2x1xi32> 2026-02-21T08:17:42.5503306Z %62 = arith.muli %61, %cst_0 : tensor<2x1xi32> 2026-02-21T08:17:42.5503658Z %63 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:17:42.5504016Z %64 = tt.broadcast %62 : tensor<2x1xi32> -> tensor<2x32xi32> 2026-02-21T08:17:42.5504311Z %65 = tt.broadcast %63 : tensor<1x32xi32> -> tensor<2x32xi32> 2026-02-21T08:17:42.5504613Z %66 = arith.addi %64, %65 : tensor<2x32xi32> 2026-02-21T08:17:42.5504896Z %67 = tt.splat %arg2 : !tt.ptr -> tensor<2x32x!tt.ptr> 2026-02-21T08:17:42.5505231Z %68 = tt.addptr %67, %66 : tensor<2x32x!tt.ptr>, tensor<2x32xi32> 2026-02-21T08:17:42.5505553Z tt.store %68, %60 : tensor<2x32x!tt.ptr> 2026-02-21T08:17:42.5505806Z } {tt.warp_specialize} 2026-02-21T08:17:42.5506037Z tt.return 2026-02-21T08:17:42.5506212Z } 2026-02-21T08:17:42.5506406Z } 2026-02-21T08:17:42.5506499Z 2026-02-21T08:17:42.5506570Z {-# 2026-02-21T08:17:42.5506827Z external_resources: { 2026-02-21T08:17:42.5507032Z mlir_reproducer: { 2026-02-21T08:17:42.5511471Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:17:42.5516055Z disable_threading: false, 2026-02-21T08:17:42.5516309Z verify_each: true 2026-02-21T08:17:42.5516512Z } 2026-02-21T08:17:42.5516707Z } 2026-02-21T08:17:42.5516873Z #-} 2026-02-21T08:17:42.5517393Z /tmp/torchinductor_root/c5/cc55y7ljikyixgnen56l3wzmfz5hcf3m6um2kf2hr47a2pghnpnv.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:17:42.5518700Z /tmp/torchinductor_root/c5/cc55y7ljikyixgnen56l3wzmfz5hcf3m6um2kf2hr47a2pghnpnv.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:17:42.5519810Z [214s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:17:42.5521119Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 2, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:17:42.5522387Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:17:42.5522700Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:17:45.5703011Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 17.1 configs/s 2026-02-21T08:17:49.2240274Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 277.1 2026-02-21T08:17:49.2244169Z configs/s 2026-02-21T08:17:49.4372628Z [221s] Generation 5 complete: 2026-02-21T08:17:49.4376944Z error=1 2026-02-21T08:17:49.4378210Z timeout=1 2026-02-21T08:17:49.4378385Z ok=88 2026-02-21T08:17:49.4378608Z min=0.0380 2026-02-21T08:17:49.4378820Z mid=0.0585 2026-02-21T08:17:49.4378980Z max=0.9399 2026-02-21T08:17:49.4379533Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:17:49.4379817Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:17:49.4380045Z 'l2_groupings': [32], 2026-02-21T08:17:49.4380259Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:17:49.4380477Z 'loop_orders': [[0, 1]], 2026-02-21T08:17:49.4380650Z 'num_stages': 8, 2026-02-21T08:17:49.4380827Z 'num_warps': 4, 2026-02-21T08:17:49.4381007Z 'pid_type': 'xyz', 2026-02-21T08:17:49.4381173Z 'range_flattens': [None, True], 2026-02-21T08:17:49.4381376Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:49.4381811Z 'range_num_stages': [0, 4], 2026-02-21T08:17:49.4382059Z 'range_unroll_factors': [0, 0], 2026-02-21T08:17:49.4382292Z 'range_warp_specializes': [None, None]} 2026-02-21T08:17:49.4399109Z [221s] Fitting surrogate: 549 points, 549 targets 2026-02-21T08:17:50.5138371Z [222s] Generation 6 starting: 72 neighbors, 4 active search path(s) 2026-02-21T08:17:56.7283240Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 6.8 configs/s 2026-02-21T08:18:01.2540581Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.5 configs/s 2026-02-21T08:18:04.3456693Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.2 2026-02-21T08:18:04.3460778Z configs/s 2026-02-21T08:18:04.5268715Z [236s] Generation 6 complete: 2026-02-21T08:18:04.5270730Z error=1 2026-02-21T08:18:04.5271003Z ok=75 2026-02-21T08:18:04.5271160Z min=0.0380 2026-02-21T08:18:04.5271321Z mid=0.0584 2026-02-21T08:18:04.5271500Z max=0.6073 2026-02-21T08:18:04.5271925Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:18:04.5272187Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:18:04.5272520Z 'l2_groupings': [32], 2026-02-21T08:18:04.5272729Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:18:04.5273010Z 'loop_orders': [[1, 0]], 2026-02-21T08:18:04.5273196Z 'num_stages': 8, 2026-02-21T08:18:04.5273429Z 'num_warps': 4, 2026-02-21T08:18:04.5273676Z 'pid_type': 'xyz', 2026-02-21T08:18:04.5273895Z 'range_flattens': [None, True], 2026-02-21T08:18:04.5274155Z 'range_multi_buffers': [None, None], 2026-02-21T08:18:04.5274356Z 'range_num_stages': [0, 4], 2026-02-21T08:18:04.5274532Z 'range_unroll_factors': [0, 0], 2026-02-21T08:18:04.5274715Z 'range_warp_specializes': [None, None]} 2026-02-21T08:18:04.5287657Z [236s] Fitting surrogate: 625 points, 625 targets 2026-02-21T08:18:05.1644388Z [237s] Generation 7 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:18:30.6999813Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 0.8 configs/s 2026-02-21T08:18:32.7748050Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 17.2 configs/s 2026-02-21T08:18:33.7329457Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1044.9 2026-02-21T08:18:33.7333748Z configs/s 2026-02-21T08:18:33.7998284Z [265s] Generation 7 complete: 2026-02-21T08:18:33.8002630Z ok=37 2026-02-21T08:18:33.8003933Z min=0.0380 2026-02-21T08:18:33.8004180Z mid=0.0585 2026-02-21T08:18:33.8004355Z max=1.2973 2026-02-21T08:18:33.8004570Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:18:33.8004833Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:18:33.8005086Z 'l2_groupings': [32], 2026-02-21T08:18:33.8005405Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:18:33.8009455Z 'loop_orders': [[1, 0]], 2026-02-21T08:18:33.8012658Z 'num_stages': 8, 2026-02-21T08:18:33.8017760Z 'num_warps': 4, 2026-02-21T08:18:33.8019389Z 'pid_type': 'xyz', 2026-02-21T08:18:33.8019673Z 'range_flattens': [None, True], 2026-02-21T08:18:33.8022457Z 'range_multi_buffers': [None, None], 2026-02-21T08:18:33.8022762Z 'range_num_stages': [0, 4], 2026-02-21T08:18:33.8023015Z 'range_unroll_factors': [0, 0], 2026-02-21T08:18:33.8023248Z 'range_warp_specializes': [None, None]} 2026-02-21T08:18:33.8023907Z [265s] Fitting surrogate: 662 points, 662 targets 2026-02-21T08:18:34.4591619Z [266s] Generation 8 starting: 31 neighbors, 2 active search path(s) 2026-02-21T08:18:40.0117540Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 4.1 configs/s 2026-02-21T08:18:41.9008416Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 17.3 configs/s 2026-02-21T08:18:43.1583524Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 798.9 2026-02-21T08:18:43.1587364Z configs/s 2026-02-21T08:18:43.2441129Z [275s] Generation 8 complete: 2026-02-21T08:18:43.2442878Z ok=34 2026-02-21T08:18:43.2443058Z min=0.0380 2026-02-21T08:18:43.2443274Z mid=0.0687 2026-02-21T08:18:43.2443442Z max=0.2851 2026-02-21T08:18:43.2443655Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:18:43.2443910Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:18:43.2444192Z 'l2_groupings': [32], 2026-02-21T08:18:43.2444450Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:18:43.2444737Z 'loop_orders': [[1, 0]], 2026-02-21T08:18:43.2444971Z 'num_stages': 8, 2026-02-21T08:18:43.2445157Z 'num_warps': 4, 2026-02-21T08:18:43.2445360Z 'pid_type': 'xyz', 2026-02-21T08:18:43.2445561Z 'range_flattens': [None, True], 2026-02-21T08:18:43.2445811Z 'range_multi_buffers': [None, None], 2026-02-21T08:18:43.2446044Z 'range_num_stages': [0, 4], 2026-02-21T08:18:43.2446281Z 'range_unroll_factors': [0, 0], 2026-02-21T08:18:43.2446509Z 'range_warp_specializes': [None, None]} 2026-02-21T08:18:43.2466138Z [275s] Fitting surrogate: 696 points, 696 targets 2026-02-21T08:18:43.7453107Z [275s] Generation 9 starting: 20 neighbors, 1 active search path(s) 2026-02-21T08:18:50.2196131Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.4 configs/s 2026-02-21T08:18:51.4035544Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.6 configs/s 2026-02-21T08:18:52.4267566Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 979.5 2026-02-21T08:18:52.4269148Z configs/s 2026-02-21T08:18:52.4968911Z [284s] Generation 9 complete: 2026-02-21T08:18:52.4972773Z ok=22 2026-02-21T08:18:52.4977059Z min=0.0380 2026-02-21T08:18:52.4981512Z mid=0.0543 2026-02-21T08:18:52.4983015Z max=1.0487 2026-02-21T08:18:52.4983297Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:18:52.4983617Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:18:52.4983869Z 'l2_groupings': [32], 2026-02-21T08:18:52.4984131Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:18:52.4984380Z 'loop_orders': [[1, 0]], 2026-02-21T08:18:52.4984608Z 'num_stages': 8, 2026-02-21T08:18:52.4984768Z 'num_warps': 4, 2026-02-21T08:18:52.4984966Z 'pid_type': 'xyz', 2026-02-21T08:18:52.4985151Z 'range_flattens': [None, True], 2026-02-21T08:18:52.4985389Z 'range_multi_buffers': [None, None], 2026-02-21T08:18:52.4985605Z 'range_num_stages': [0, 4], 2026-02-21T08:18:52.4986146Z 'range_unroll_factors': [0, 0], 2026-02-21T08:18:52.4986438Z 'range_warp_specializes': [None, None]} 2026-02-21T08:18:52.4992869Z [284s] Fitting surrogate: 718 points, 718 targets 2026-02-21T08:18:52.7873315Z [284s] Autotuning complete in 284.8s after searching 695 configs. 2026-02-21T08:18:52.7877768Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:18:52.7878865Z @helion.kernel(config=helion.Config(block_sizes=[128, 2, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_stages=8, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:18:52.7879832Z 2026-02-21T08:18:52.7880111Z [284s] Code of selected kernel: /tmp/torchinductor_root/gd/cgd6p7yrupjowc5plllwjmi75ib6wpvxp3akyli6oj7f4rpf3npg.py 2026-02-21T08:18:53.8305906Z WARNING:tritonbench.utils.triton_op:Completed input ID 7: 2026-02-21T08:18:53.8310205Z x_val 2026-02-21T08:18:53.8314604Z ------------------ 2026-02-21T08:18:53.8315945Z (4, 1, 8192, 3584) 2026-02-21T08:18:53.8316102Z 2026-02-21T08:18:53.8320220Z 30%|███ | 3/10 [09:53<25:33, 219.03s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10: 2026-02-21T08:18:53.8320607Z x_val 2026-02-21T08:18:53.8320789Z ------------------- 2026-02-21T08:18:53.8321011Z (16, 1, 7168, 8192) 2026-02-21T08:18:53.8325482Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:18:54.8805456Z INFO:tritonbench.utils.triton_op:Took 2.77ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:18:55.5809227Z Autotune Choices Stats: 2026-02-21T08:18:55.5810325Z {"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.028736000880599022, "best_triton_pos": 1, "best_triton_time": 0.03670400008559227, "best_triton_kernel": "triton_mm_25", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"} 2026-02-21T08:18:55.5828051Z AUTOTUNE mm(16x8192, 8192x7168) 2026-02-21T08:18:55.5832395Z strides: [8192, 1], [7168, 1] 2026-02-21T08:18:55.5834207Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:18:55.5834476Z mm 0.0287 ms 100.0% 2026-02-21T08:18:55.5834984Z triton_mm_25 0.0367 ms 78.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4 2026-02-21T08:18:55.5835716Z triton_mm_21 0.0368 ms 78.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2 2026-02-21T08:18:55.5836486Z triton_mm_29 0.0408 ms 70.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4 2026-02-21T08:18:55.5837506Z triton_mm_33 0.0470 ms 61.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:18:55.5838207Z triton_mm_20 0.0643 ms 44.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2 2026-02-21T08:18:55.5838931Z triton_mm_19 0.0653 ms 44.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4 2026-02-21T08:18:55.5839641Z triton_mm_24 0.0707 ms 40.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:18:55.5840429Z triton_mm_28 0.0777 ms 37.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:18:55.5841157Z triton_mm_31 0.0798 ms 36.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4 2026-02-21T08:18:55.5841847Z SingleProcess AUTOTUNE benchmarking takes 0.3524 seconds and 0.1736 seconds precompiling for 18 choices 2026-02-21T08:18:58.5516836Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:19:00.9748857Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:19:00.9752480Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:19:00.9753756Z 'dtype': 'torch.bfloat16', 2026-02-21T08:19:00.9754031Z 'shape': (16, 1, 8192), 2026-02-21T08:19:00.9754292Z 'stride': (8192, 8192, 1)}, 2026-02-21T08:19:00.9754554Z { 'device': 'cuda:0', 2026-02-21T08:19:00.9754782Z 'dtype': 'torch.int32', 2026-02-21T08:19:00.9755046Z 'shape': (8192, 7168), 2026-02-21T08:19:00.9755230Z 'stride': (7168, 1)}), 2026-02-21T08:19:00.9755404Z 'kwargs': {}} 2026-02-21T08:19:00.9764907Z INFO:tritonbench.utils.triton_op:Took 1.82ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:19:01.2297259Z [0s] Autotune random seed: 2134813318 2026-02-21T08:19:01.3466832Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:19:20.8374386Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.4 configs/s 2026-02-21T08:19:25.0063611Z [23s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:19:25.0069930Z Tensor-likes are not close! 2026-02-21T08:19:25.0073250Z 2026-02-21T08:19:25.0075082Z Mismatched elements: 114603 / 114688 (99.9%) 2026-02-21T08:19:25.0075424Z Greatest absolute difference: 6912.0 at index (9, 3668) (up to 0.01 allowed) 2026-02-21T08:19:25.0075837Z Greatest relative difference: 129024.0 at index (1, 159) (up to 0.01 allowed) 2026-02-21T08:19:25.0076183Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:25.0076394Z 2026-02-21T08:19:25.6630234Z [24s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 16, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[1, 2], range_warp_specializes=[None, None]) 2026-02-21T08:19:25.6631855Z Tensor-likes are not close! 2026-02-21T08:19:25.6636056Z 2026-02-21T08:19:25.6640608Z Mismatched elements: 114422 / 114688 (99.8%) 2026-02-21T08:19:25.6645115Z Greatest absolute difference: 5312.0 at index (10, 4604) (up to 0.01 allowed) 2026-02-21T08:19:25.6649582Z Greatest relative difference: 280576.0 at index (2, 1421) (up to 0.01 allowed) 2026-02-21T08:19:25.6650015Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:25.6650235Z 2026-02-21T08:19:28.9612228Z [27s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[1, 2], range_unroll_factors=[2, 2], range_warp_specializes=[None, None]) 2026-02-21T08:19:28.9613414Z Tensor-likes are not close! 2026-02-21T08:19:28.9613534Z 2026-02-21T08:19:28.9613617Z Mismatched elements: 114688 / 114688 (100.0%) 2026-02-21T08:19:28.9613910Z Greatest absolute difference: 5440.0 at index (15, 680) (up to 0.01 allowed) 2026-02-21T08:19:28.9614250Z Greatest relative difference: 3.046875 at index (2, 1421) (up to 0.01 allowed) 2026-02-21T08:19:28.9614564Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:28.9614728Z 2026-02-21T08:19:29.5716405Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.5 configs/s 2026-02-21T08:19:29.5726850Z [28s] Adaptive compile timeout: 30s (90% percentile=2.7s, bounds=[30.0s, 30s]) 2026-02-21T08:19:30.0350130Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1965.0 configs/s 2026-02-21T08:19:30.0945750Z [28s] Initial random population of 100, 5 starting points: 2026-02-21T08:19:30.0950055Z error=6 2026-02-21T08:19:30.0955182Z ok=94 2026-02-21T08:19:30.0955447Z min=0.1937 2026-02-21T08:19:30.0955640Z mid=1.5513 2026-02-21T08:19:30.0955887Z max=94.5326 2026-02-21T08:19:30.0956089Z best={'block_sizes': [256, 2, 256], 2026-02-21T08:19:30.0956366Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:19:30.0956629Z 'l2_groupings': [4], 2026-02-21T08:19:30.0956807Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:19:30.0957005Z 'loop_orders': [[1, 0]], 2026-02-21T08:19:30.0957190Z 'num_stages': 3, 2026-02-21T08:19:30.0957338Z 'num_warps': 8, 2026-02-21T08:19:30.0957477Z 'pid_type': 'flat', 2026-02-21T08:19:30.0957640Z 'range_flattens': [None, False], 2026-02-21T08:19:30.0957820Z 'range_multi_buffers': [None, False], 2026-02-21T08:19:30.0958008Z 'range_num_stages': [0, 2], 2026-02-21T08:19:30.0958173Z 'range_unroll_factors': [0, 0], 2026-02-21T08:19:30.0958387Z 'range_warp_specializes': [None, None]} 2026-02-21T08:19:30.0962412Z [28s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:19:31.2913040Z [29s] Generation 1 starting: 92 neighbors, 5 active search path(s) 2026-02-21T08:20:05.4089248Z [64s] Timeout after 30s compiling Config(block_sizes=[64, 2, 256], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=32, num_stages=8, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 2], range_unroll_factors=[3, 3], range_warp_specializes=[None, False]) 2026-02-21T08:20:05.4101889Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 1.0 configs/s 2026-02-21T08:20:11.0548311Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 17.0 configs/s 2026-02-21T08:20:12.7307142Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 586.7 2026-02-21T08:20:12.7307580Z configs/s 2026-02-21T08:20:12.8079297Z [71s] Generation 1 complete: 2026-02-21T08:20:12.8083726Z error=5 2026-02-21T08:20:12.8088094Z timeout=1 2026-02-21T08:20:12.8092457Z ok=92 2026-02-21T08:20:12.8096377Z min=0.1342 2026-02-21T08:20:12.8100287Z mid=0.4495 2026-02-21T08:20:12.8104606Z max=5.9801 2026-02-21T08:20:12.8105899Z best={'block_sizes': [64, 16, 16], 2026-02-21T08:20:12.8106228Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:20:12.8106487Z 'l2_groupings': [1], 2026-02-21T08:20:12.8106731Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:20:12.8106954Z 'loop_orders': [[0, 1]], 2026-02-21T08:20:12.8107134Z 'num_stages': 1, 2026-02-21T08:20:12.8107324Z 'num_warps': 1, 2026-02-21T08:20:12.8107478Z 'pid_type': 'flat', 2026-02-21T08:20:12.8107674Z 'range_flattens': [None, False], 2026-02-21T08:20:12.8107875Z 'range_multi_buffers': [None, None], 2026-02-21T08:20:12.8108128Z 'range_num_stages': [0, 0], 2026-02-21T08:20:12.8108336Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:12.8108585Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:12.8109191Z [71s] Fitting surrogate: 198 points, 198 targets 2026-02-21T08:20:13.9440794Z [72s] Generation 2 starting: 87 neighbors, 5 active search path(s) 2026-02-21T08:20:47.6900977Z [106s] Timeout after 30s compiling Config(block_sizes=[256, 4, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:20:47.6922237Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 0.6 configs/s 2026-02-21T08:20:53.0801211Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 17.0 configs/s 2026-02-21T08:20:53.3315204Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3665.4 2026-02-21T08:20:53.3319274Z configs/s 2026-02-21T08:20:53.3672063Z [112s] Generation 2 complete: 2026-02-21T08:20:53.3676364Z error=1 2026-02-21T08:20:53.3677812Z timeout=1 2026-02-21T08:20:53.3678036Z ok=91 2026-02-21T08:20:53.3678167Z min=0.0810 2026-02-21T08:20:53.3678326Z mid=0.3369 2026-02-21T08:20:53.3678477Z max=4.4236 2026-02-21T08:20:53.3678665Z best={'block_sizes': [128, 16, 64], 2026-02-21T08:20:53.3678959Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:20:53.3679209Z 'l2_groupings': [4], 2026-02-21T08:20:53.3679439Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:20:53.3679650Z 'loop_orders': [[0, 1]], 2026-02-21T08:20:53.3679862Z 'num_stages': 8, 2026-02-21T08:20:53.3680047Z 'num_warps': 4, 2026-02-21T08:20:53.3680239Z 'pid_type': 'flat', 2026-02-21T08:20:53.3680428Z 'range_flattens': [None, True], 2026-02-21T08:20:53.3680682Z 'range_multi_buffers': [None, None], 2026-02-21T08:20:53.3680905Z 'range_num_stages': [0, 4], 2026-02-21T08:20:53.3681073Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:53.3681277Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:53.3691078Z [112s] Fitting surrogate: 291 points, 291 targets 2026-02-21T08:20:54.4508969Z [113s] Generation 3 starting: 78 neighbors, 5 active search path(s) 2026-02-21T08:21:16.3327286Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.8 configs/s 2026-02-21T08:21:20.8620807Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 18.1 configs/s 2026-02-21T08:21:23.2559382Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 417.8 2026-02-21T08:21:23.2563571Z configs/s 2026-02-21T08:21:23.3609118Z [142s] Generation 3 complete: 2026-02-21T08:21:23.3610994Z error=6 2026-02-21T08:21:23.3611189Z ok=77 2026-02-21T08:21:23.3615414Z min=0.0810 2026-02-21T08:21:23.3619846Z mid=0.3104 2026-02-21T08:21:23.3621631Z max=8.7481 2026-02-21T08:21:23.3621819Z best={'block_sizes': [128, 16, 64], 2026-02-21T08:21:23.3622133Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:21:23.3622706Z 'l2_groupings': [4], 2026-02-21T08:21:23.3622929Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:21:23.3623132Z 'loop_orders': [[0, 1]], 2026-02-21T08:21:23.3623309Z 'num_stages': 8, 2026-02-21T08:21:23.3623450Z 'num_warps': 4, 2026-02-21T08:21:23.3623597Z 'pid_type': 'flat', 2026-02-21T08:21:23.3623779Z 'range_flattens': [None, True], 2026-02-21T08:21:23.3623964Z 'range_multi_buffers': [None, None], 2026-02-21T08:21:23.3624160Z 'range_num_stages': [0, 4], 2026-02-21T08:21:23.3624327Z 'range_unroll_factors': [0, 0], 2026-02-21T08:21:23.3624536Z 'range_warp_specializes': [None, None]} 2026-02-21T08:21:23.3629399Z [142s] Fitting surrogate: 374 points, 374 targets 2026-02-21T08:21:24.1859416Z [142s] Generation 4 starting: 59 neighbors, 4 active search path(s) 2026-02-21T08:21:28.6979340Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 7.4 configs/s 2026-02-21T08:21:32.1276545Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 17.7 configs/s 2026-02-21T08:21:34.5578259Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 464.6 2026-02-21T08:21:34.5582090Z configs/s 2026-02-21T08:21:34.6555667Z [153s] Generation 4 complete: 2026-02-21T08:21:34.6557008Z error=2 2026-02-21T08:21:34.6557208Z ok=61 2026-02-21T08:21:34.6557356Z min=0.0789 2026-02-21T08:21:34.6557491Z mid=0.3104 2026-02-21T08:21:34.6557632Z max=3.1949 2026-02-21T08:21:34.6557778Z best={'block_sizes': [128, 16, 32], 2026-02-21T08:21:34.6558063Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:21:34.6558308Z 'l2_groupings': [4], 2026-02-21T08:21:34.6558550Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:21:34.6558787Z 'loop_orders': [[0, 1]], 2026-02-21T08:21:34.6558974Z 'num_stages': 8, 2026-02-21T08:21:34.6559121Z 'num_warps': 4, 2026-02-21T08:21:34.6559264Z 'pid_type': 'flat', 2026-02-21T08:21:34.6559466Z 'range_flattens': [None, True], 2026-02-21T08:21:34.6559662Z 'range_multi_buffers': [None, None], 2026-02-21T08:21:34.6559852Z 'range_num_stages': [0, 4], 2026-02-21T08:21:34.6560018Z 'range_unroll_factors': [0, 0], 2026-02-21T08:21:34.6560206Z 'range_warp_specializes': [None, None]} 2026-02-21T08:21:34.6571200Z [153s] Fitting surrogate: 437 points, 437 targets 2026-02-21T08:21:35.5447801Z [154s] Generation 5 starting: 57 neighbors, 4 active search path(s) 2026-02-21T08:22:08.0028896Z [186s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:22:08.1840453Z [186s] Timeout after 30s compiling Config(block_sizes=[512, 4, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:22:08.2968335Z [186s] Timeout after 30s compiling Config(block_sizes=[512, 16, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:22:08.4955683Z [187s] Timeout after 30s compiling Config(block_sizes=[512, 16, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:22:08.4972243Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 0.6 configs/s 2026-02-21T08:22:11.8560167Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 17.5 configs/s 2026-02-21T08:22:13.4818525Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 613.9 2026-02-21T08:22:13.4819458Z configs/s 2026-02-21T08:22:13.5640952Z [192s] Generation 5 complete: 2026-02-21T08:22:13.5642892Z timeout=4 2026-02-21T08:22:13.5643046Z ok=57 2026-02-21T08:22:13.5643185Z min=0.0768 2026-02-21T08:22:13.5643316Z mid=0.2714 2026-02-21T08:22:13.5643451Z max=6.2515 2026-02-21T08:22:13.5643596Z best={'block_sizes': [128, 16, 32], 2026-02-21T08:22:13.5643830Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:22:13.5644034Z 'l2_groupings': [4], 2026-02-21T08:22:13.5644508Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:22:13.5644743Z 'loop_orders': [[0, 1]], 2026-02-21T08:22:13.5644909Z 'num_stages': 8, 2026-02-21T08:22:13.5645067Z 'num_warps': 4, 2026-02-21T08:22:13.5645214Z 'pid_type': 'flat', 2026-02-21T08:22:13.5645384Z 'range_flattens': [None, True], 2026-02-21T08:22:13.5645559Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:13.5645750Z 'range_num_stages': [0, 4], 2026-02-21T08:22:13.5645915Z 'range_unroll_factors': [0, 0], 2026-02-21T08:22:13.5646102Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:13.5660794Z [192s] Fitting surrogate: 498 points, 498 targets 2026-02-21T08:22:14.0723440Z [192s] Generation 6 starting: 29 neighbors, 2 active search path(s) 2026-02-21T08:22:35.5277551Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.8 configs/s 2026-02-21T08:22:37.4869309Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 15.6 configs/s 2026-02-21T08:22:38.5578969Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 922.7 2026-02-21T08:22:38.5583053Z configs/s 2026-02-21T08:22:38.6224116Z [217s] Generation 6 complete: 2026-02-21T08:22:38.6228475Z error=1 2026-02-21T08:22:38.6233387Z ok=31 2026-02-21T08:22:38.6234784Z min=0.0789 2026-02-21T08:22:38.6235016Z mid=0.2736 2026-02-21T08:22:38.6235153Z max=7.6002 2026-02-21T08:22:38.6235295Z best={'block_sizes': [128, 16, 32], 2026-02-21T08:22:38.6235526Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:22:38.6235731Z 'l2_groupings': [4], 2026-02-21T08:22:38.6235922Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:22:38.6236114Z 'loop_orders': [[0, 1]], 2026-02-21T08:22:38.6236302Z 'num_stages': 8, 2026-02-21T08:22:38.6236444Z 'num_warps': 4, 2026-02-21T08:22:38.6236592Z 'pid_type': 'flat', 2026-02-21T08:22:38.6236747Z 'range_flattens': [None, True], 2026-02-21T08:22:38.6236934Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:38.6237140Z 'range_num_stages': [0, 4], 2026-02-21T08:22:38.6237611Z 'range_unroll_factors': [0, 0], 2026-02-21T08:22:38.6237819Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:38.6244577Z [217s] Fitting surrogate: 530 points, 530 targets 2026-02-21T08:22:39.2086179Z [217s] Generation 7 starting: 29 neighbors, 2 active search path(s) 2026-02-21T08:22:42.6759291Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 6.1 configs/s 2026-02-21T08:22:44.3831980Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 18.1 configs/s 2026-02-21T08:22:45.4207501Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 952.2 2026-02-21T08:22:45.4210247Z configs/s 2026-02-21T08:22:45.4827893Z [224s] Generation 7 complete: 2026-02-21T08:22:45.4832269Z error=1 2026-02-21T08:22:45.4833570Z ok=31 2026-02-21T08:22:45.4833728Z min=0.0768 2026-02-21T08:22:45.4833871Z mid=0.2735 2026-02-21T08:22:45.4834060Z max=1.3867 2026-02-21T08:22:45.4834530Z best={'block_sizes': [128, 16, 32], 2026-02-21T08:22:45.4834809Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:22:45.4835013Z 'l2_groupings': [4], 2026-02-21T08:22:45.4835192Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:22:45.4835408Z 'loop_orders': [[0, 1]], 2026-02-21T08:22:45.4835572Z 'num_stages': 8, 2026-02-21T08:22:45.4835714Z 'num_warps': 4, 2026-02-21T08:22:45.4835863Z 'pid_type': 'flat', 2026-02-21T08:22:45.4836019Z 'range_flattens': [None, True], 2026-02-21T08:22:45.4836202Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:45.4836385Z 'range_num_stages': [0, 4], 2026-02-21T08:22:45.4836560Z 'range_unroll_factors': [0, 0], 2026-02-21T08:22:45.4836741Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:45.4847867Z [224s] Fitting surrogate: 562 points, 562 targets 2026-02-21T08:22:45.9450158Z [224s] Generation 8 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:23:16.8388443Z [255s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 8], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:23:16.8407095Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 0.4 configs/s 2026-02-21T08:23:17.9796839Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 16.4 configs/s 2026-02-21T08:23:17.9801359Z [256s] Generation 8 complete: 2026-02-21T08:23:17.9803129Z timeout=1 2026-02-21T08:23:17.9803323Z ok=18 2026-02-21T08:23:17.9803469Z min=0.0768 2026-02-21T08:23:17.9803607Z mid=0.5376 2026-02-21T08:23:17.9803827Z max=5.6288 2026-02-21T08:23:17.9809405Z best={'block_sizes': [128, 16, 32], 2026-02-21T08:23:17.9813989Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:23:17.9815322Z 'l2_groupings': [4], 2026-02-21T08:23:17.9815628Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:23:17.9815862Z 'loop_orders': [[0, 1]], 2026-02-21T08:23:17.9816045Z 'num_stages': 8, 2026-02-21T08:23:17.9816205Z 'num_warps': 4, 2026-02-21T08:23:17.9816384Z 'pid_type': 'flat', 2026-02-21T08:23:17.9816552Z 'range_flattens': [None, True], 2026-02-21T08:23:17.9816777Z 'range_multi_buffers': [None, None], 2026-02-21T08:23:17.9816972Z 'range_num_stages': [0, 4], 2026-02-21T08:23:17.9817174Z 'range_unroll_factors': [0, 0], 2026-02-21T08:23:17.9817367Z 'range_warp_specializes': [None, None]} 2026-02-21T08:23:17.9818806Z [256s] Fitting surrogate: 581 points, 581 targets 2026-02-21T08:23:18.2692296Z [256s] Autotuning complete in 256.9s after searching 556 configs. 2026-02-21T08:23:18.2695711Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:23:18.2698333Z @helion.kernel(config=helion.Config(block_sizes=[128, 16, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:23:18.2699591Z 2026-02-21T08:23:18.2699861Z [256s] Code of selected kernel: /tmp/torchinductor_root/gq/cgqzlclg3urloebhabbptqrqlgbjuxisqf2nhoezmm4xzbzxn3nh.py 2026-02-21T08:23:19.4228220Z WARNING:tritonbench.utils.triton_op:Completed input ID 10: 2026-02-21T08:23:19.4229894Z x_val 2026-02-21T08:23:19.4230076Z ------------------- 2026-02-21T08:23:19.4230266Z (16, 1, 7168, 8192) 2026-02-21T08:23:19.4230375Z 2026-02-21T08:23:19.4243178Z 40%|████ | 4/10 [14:19<23:44, 237.41s/it]WARNING:tritonbench.utils.triton_op:Running input ID 14: 2026-02-21T08:23:19.4243542Z x_val 2026-02-21T08:23:19.4243700Z ------------------- 2026-02-21T08:23:19.4243870Z (64, 1, 7168, 8192) 2026-02-21T08:23:19.4247518Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:23:20.4742461Z INFO:tritonbench.utils.triton_op:Took 2.88ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:23:21.2377411Z Autotune Choices Stats: 2026-02-21T08:23:21.2378468Z {"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.029600000008940697, "best_triton_pos": 1, "best_triton_time": 0.036768000572919846, "best_triton_kernel": "triton_mm_42", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4"} 2026-02-21T08:23:21.2386253Z AUTOTUNE mm(64x8192, 8192x7168) 2026-02-21T08:23:21.2386499Z strides: [8192, 1], [7168, 1] 2026-02-21T08:23:21.2386743Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:23:21.2386980Z mm 0.0296 ms 100.0% 2026-02-21T08:23:21.2387432Z triton_mm_42 0.0368 ms 80.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4 2026-02-21T08:23:21.2388204Z triton_mm_46 0.0428 ms 69.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4 2026-02-21T08:23:21.2388935Z triton_mm_50 0.0472 ms 62.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:23:21.2389624Z triton_mm_38 0.0543 ms 54.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4 2026-02-21T08:23:21.2390295Z triton_mm_41 0.0715 ms 41.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2026-02-21T08:23:21.2390967Z triton_mm_45 0.0718 ms 41.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:23:21.2391734Z triton_mm_37 0.0736 ms 40.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:23:21.2392430Z triton_mm_36 0.0766 ms 38.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:23:21.2393101Z triton_mm_48 0.0840 ms 35.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T08:23:21.2393667Z SingleProcess AUTOTUNE benchmarking takes 0.3600 seconds and 0.2466 seconds precompiling for 18 choices 2026-02-21T08:23:22.5075341Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:23:23.8020720Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:23:23.8022566Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:23:23.8022804Z 'dtype': 'torch.bfloat16', 2026-02-21T08:23:23.8023004Z 'shape': (64, 1, 8192), 2026-02-21T08:23:23.8023197Z 'stride': (8192, 8192, 1)}, 2026-02-21T08:23:23.8023380Z { 'device': 'cuda:0', 2026-02-21T08:23:23.8023559Z 'dtype': 'torch.int32', 2026-02-21T08:23:23.8023733Z 'shape': (8192, 7168), 2026-02-21T08:23:23.8023909Z 'stride': (7168, 1)}), 2026-02-21T08:23:23.8024078Z 'kwargs': {}} 2026-02-21T08:23:23.8049101Z INFO:tritonbench.utils.triton_op:Took 3.04ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:23:24.0575577Z [0s] Autotune random seed: 2134813318 2026-02-21T08:23:24.1827013Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:23:42.5276300Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T08:23:43.3375848Z [19s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=32, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:23:43.3381053Z Tensor-likes are not close! 2026-02-21T08:23:43.3385925Z 2026-02-21T08:23:43.3388129Z Mismatched elements: 458394 / 458752 (99.9%) 2026-02-21T08:23:43.3388432Z Greatest absolute difference: 7104.0 at index (23, 169) (up to 0.01 allowed) 2026-02-21T08:23:43.3388825Z Greatest relative difference: 626688.0 at index (27, 1188) (up to 0.01 allowed) 2026-02-21T08:23:43.3389159Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:43.3390288Z 2026-02-21T08:23:43.6275773Z [19s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 1], range_warp_specializes=[None, False]) 2026-02-21T08:23:43.6277324Z Tensor-likes are not close! 2026-02-21T08:23:43.6277460Z 2026-02-21T08:23:43.6277596Z Mismatched elements: 458384 / 458752 (99.9%) 2026-02-21T08:23:43.6282666Z Greatest absolute difference: 7136.0 at index (29, 5263) (up to 0.01 allowed) 2026-02-21T08:23:43.6286468Z Greatest relative difference: 1032192.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:23:43.6290296Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:43.6294429Z 2026-02-21T08:23:47.1077621Z [22s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 16, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:23:47.1079166Z Tensor-likes are not close! 2026-02-21T08:23:47.1079301Z 2026-02-21T08:23:47.1079386Z Mismatched elements: 457790 / 458752 (99.8%) 2026-02-21T08:23:47.1079666Z Greatest absolute difference: 5696.0 at index (42, 5979) (up to 0.01 allowed) 2026-02-21T08:23:47.1080054Z Greatest relative difference: 929792.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T08:23:47.1080387Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:47.1080555Z 2026-02-21T08:23:52.3779371Z [28s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 3], range_unroll_factors=[1, 0], range_warp_specializes=[False, True]) 2026-02-21T08:23:52.3780969Z Tensor-likes are not close! 2026-02-21T08:23:52.3781086Z 2026-02-21T08:23:52.3781266Z Mismatched elements: 458225 / 458752 (99.9%) 2026-02-21T08:23:52.3781796Z Greatest absolute difference: 7616.0 at index (55, 6163) (up to 0.01 allowed) 2026-02-21T08:23:52.3782187Z Greatest relative difference: 473088.0 at index (45, 4711) (up to 0.01 allowed) 2026-02-21T08:23:52.3782601Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:52.3782784Z 2026-02-21T08:23:55.0865338Z [30s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 1024], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]) 2026-02-21T08:23:55.0866744Z Tensor-likes are not close! 2026-02-21T08:23:55.0866892Z 2026-02-21T08:23:55.0867038Z Mismatched elements: 458068 / 458752 (99.9%) 2026-02-21T08:23:55.0867365Z Greatest absolute difference: 7136.0 at index (29, 5399) (up to 0.01 allowed) 2026-02-21T08:23:55.0867819Z Greatest relative difference: 2867200.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T08:23:55.0868186Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:55.0868406Z 2026-02-21T08:23:56.2355274Z [32s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 16], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 3], range_warp_specializes=[None, None]) 2026-02-21T08:23:56.2356669Z Tensor-likes are not close! 2026-02-21T08:23:56.2356821Z 2026-02-21T08:23:56.2356982Z Mismatched elements: 457739 / 458752 (99.8%) 2026-02-21T08:23:56.2357447Z Greatest absolute difference: 5792.0 at index (54, 3514) (up to 0.01 allowed) 2026-02-21T08:23:56.2357904Z Greatest relative difference: 716800.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T08:23:56.2358289Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:56.2358536Z 2026-02-21T08:23:56.9897020Z [32s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[1, 2], range_unroll_factors=[2, 2], range_warp_specializes=[None, None]) 2026-02-21T08:23:56.9898278Z Tensor-likes are not close! 2026-02-21T08:23:56.9898564Z 2026-02-21T08:23:56.9898699Z Mismatched elements: 458750 / 458752 (100.0%) 2026-02-21T08:23:56.9899163Z Greatest absolute difference: 5696.0 at index (29, 5255) (up to 0.01 allowed) 2026-02-21T08:23:56.9899641Z Greatest relative difference: 3.078125 at index (19, 992) (up to 0.01 allowed) 2026-02-21T08:23:56.9900235Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:23:56.9900476Z 2026-02-21T08:23:57.9066128Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 6.5 configs/s 2026-02-21T08:23:57.9076815Z [33s] Adaptive compile timeout: 30s (90% percentile=3.4s, bounds=[30.0s, 30s]) 2026-02-21T08:23:58.3004353Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━ 583/583 1236.2 configs/s 2026-02-21T08:23:58.3870468Z [34s] Initial random population of 100, 5 starting points: 2026-02-21T08:23:58.3874857Z error=11 2026-02-21T08:23:58.3876412Z ok=89 2026-02-21T08:23:58.3876690Z min=0.3308 2026-02-21T08:23:58.3876964Z mid=5.0535 2026-02-21T08:23:58.3877206Z max=374.9427 2026-02-21T08:23:58.3877415Z best={'block_sizes': [16, 64, 32], 2026-02-21T08:23:58.3877857Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:23:58.3878130Z 'l2_groupings': [1], 2026-02-21T08:23:58.3878386Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:23:58.3878688Z 'loop_orders': [[0, 1]], 2026-02-21T08:23:58.3878906Z 'maxnreg': 32, 2026-02-21T08:23:58.3879124Z 'num_sm_multiplier': 8, 2026-02-21T08:23:58.3879316Z 'num_stages': 6, 2026-02-21T08:23:58.3879572Z 'num_warps': 16, 2026-02-21T08:23:58.3879770Z 'pid_type': 'persistent_blocked', 2026-02-21T08:23:58.3880235Z 'range_flattens': [True, None], 2026-02-21T08:23:58.3880510Z 'range_multi_buffers': [None, True], 2026-02-21T08:23:58.3880800Z 'range_num_stages': [0, 0], 2026-02-21T08:23:58.3881047Z 'range_unroll_factors': [0, 1], 2026-02-21T08:23:58.3881325Z 'range_warp_specializes': [True, None]} 2026-02-21T08:23:58.3886652Z [34s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:23:59.7095213Z [35s] Generation 1 starting: 100 neighbors, 5 active search path(s) 2026-02-21T08:24:18.3249756Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 1.1 configs/s 2026-02-21T08:24:18.4501644Z [54s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=8, num_stages=6, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T08:24:18.4502992Z Tensor-likes are not close! 2026-02-21T08:24:18.4503220Z 2026-02-21T08:24:18.4503490Z Mismatched elements: 455802 / 458752 (99.4%) 2026-02-21T08:24:18.4503846Z Greatest absolute difference: 1792.0 at index (19, 6877) (up to 0.01 allowed) 2026-02-21T08:24:18.4507070Z Greatest relative difference: 380928.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:24:18.4507632Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:18.4507874Z 2026-02-21T08:24:19.0296580Z [54s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:19.0296934Z 2026-02-21T08:24:19.0299484Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=8, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:24:19.0300818Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:24:19.0301262Z 2026-02-21T08:24:19.0305856Z `ptxas` stderr: 2026-02-21T08:24:19.0306163Z ================================================================ 2026-02-21T08:24:19.0307147Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 67 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:24:19.0307907Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:24:19.0308218Z Internal Triton PTX codegen error 2026-02-21T08:24:19.0308671Z `ptxas` stderr: 2026-02-21T08:24:19.0309235Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 67 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:24:19.0310069Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:24:19.0310392Z 2026-02-21T08:24:19.0310895Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpx_4i7n7a.ptx -o /tmp/tmpx_4i7n7a.ptx.o 2026-02-21T08:24:19.0311445Z 2026-02-21T08:24:19.0311474Z 2026-02-21T08:24:19.0311650Z // 2026-02-21T08:24:19.0311877Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:24:19.0312132Z // 2026-02-21T08:24:19.0312291Z 2026-02-21T08:24:19.0312430Z .version 8.7 2026-02-21T08:24:19.0312622Z .target sm_100a 2026-02-21T08:24:19.0312867Z .address_size 64 2026-02-21T08:24:19.0312999Z 2026-02-21T08:24:19.0313221Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:24:19.0313603Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:24:19.0313899Z // @_helion_matmul_bf16_int4 2026-02-21T08:24:19.0314333Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:24:19.0314668Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:24:19.0315009Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:24:19.0315391Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:24:19.0315750Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:24:19.0316104Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:24:19.0316398Z ) 2026-02-21T08:24:19.0316608Z .reqntid 512 2026-02-21T08:24:19.0316802Z .maxnreg 32 2026-02-21T08:24:19.0317008Z { 2026-02-21T08:24:19.0317222Z .reg .pred %p<71>; 2026-02-21T08:24:19.0317423Z .reg .b16 %rs<121>; 2026-02-21T08:24:19.0317662Z .reg .b32 %r<494>; 2026-02-21T08:24:19.0317855Z .reg .b64 %rd<99>; 2026-02-21T08:24:19.0318072Z $L__func_begin0: 2026-02-21T08:24:19.0318163Z 2026-02-21T08:24:19.0318257Z // %bb.0: 2026-02-21T08:24:19.0318596Z .loc 1 14 0 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:14 2026-02-21T08:24:19.0318965Z mov.u32 %r1, %tid.x; 2026-02-21T08:24:19.0319164Z shr.u32 %r2, %r1, 5; 2026-02-21T08:24:19.0319427Z shfl.sync.idx.b32 %r3, %r2, 0, 31, -1; 2026-02-21T08:24:19.0319660Z setp.lt.u32 %p9, %r3, 8; 2026-02-21T08:24:19.0319900Z @%p9 bra $L__BB0_16; 2026-02-21T08:24:19.0320115Z bra.uni $L__BB0_1; 2026-02-21T08:24:19.0320330Z $L__BB0_16: 2026-02-21T08:24:19.0320516Z setmaxnreg.inc.sync.aligned.u32 40; 2026-02-21T08:24:19.0320827Z setp.lt.u32 %p41, %r1, 32; 2026-02-21T08:24:19.0321073Z mov.b32 %r321, global_smem; 2026-02-21T08:24:19.0321265Z // begin inline asm 2026-02-21T08:24:19.0321679Z @%p41 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r321], 256; 2026-02-21T08:24:19.0321984Z // end inline asm 2026-02-21T08:24:19.0322192Z bar.sync 0, 256; 2026-02-21T08:24:19.0322445Z ld.shared.b32 %r450, [global_smem]; 2026-02-21T08:24:19.0322703Z bar.sync 0, 256; 2026-02-21T08:24:19.0322896Z // begin inline asm 2026-02-21T08:24:19.0323210Z @%p41 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:24:19.0323514Z // end inline asm 2026-02-21T08:24:19.0323819Z .loc 1 20 30 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:20:30 2026-02-21T08:24:19.0324252Z mov.u32 %r94, %ctaid.x; 2026-02-21T08:24:19.0324568Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0324934Z max.u32 %r347, %r94, 111; 2026-02-21T08:24:19.0325379Z 2026-02-21T08:24:19.0325812Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpx_4i7n7a.ptx -o /tmp/tmpx_4i7n7a.ptx.o 2026-02-21T08:24:19.0326266Z 2026-02-21T08:24:19.0326402Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:19.0326753Z shl.b32 %r348, %r347, 8; 2026-02-21T08:24:19.0327001Z sub.s32 %r100, 28672, %r348; 2026-02-21T08:24:19.0327304Z $L__tmp0: 2026-02-21T08:24:19.0327709Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0328114Z shfl.sync.idx.b32 %r349, %r2, 0, 31, -1; 2026-02-21T08:24:19.0328367Z shl.b32 %r350, %r349, 21; 2026-02-21T08:24:19.0328610Z and.b32 %r351, %r350, 6291456; 2026-02-21T08:24:19.0328863Z add.s32 %r352, %r351, %r450; 2026-02-21T08:24:19.0329073Z shl.b32 %r353, %r349, 3; 2026-02-21T08:24:19.0329323Z and.b32 %r354, %r353, 32; 2026-02-21T08:24:19.0329562Z add.s32 %r322, %r352, %r354; 2026-02-21T08:24:19.0329769Z mov.pred %p43, -1; 2026-02-21T08:24:19.0330002Z mov.b32 %r492, 0; 2026-02-21T08:24:19.0330197Z // begin inline asm 2026-02-21T08:24:19.0330658Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r322 + 0], 16, {%r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492, %r492}; 2026-02-21T08:24:19.0331230Z // end inline asm 2026-02-21T08:24:19.0331422Z // begin inline asm 2026-02-21T08:24:19.0331697Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:24:19.0331916Z // end inline asm 2026-02-21T08:24:19.0332154Z bar.sync 0, 256; 2026-02-21T08:24:19.0332336Z $L__tmp1: 2026-02-21T08:24:19.0332665Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0333033Z setp.eq.b32 %p59, %r1, 0; 2026-02-21T08:24:19.0333279Z add.s32 %r339, %r321, 36864; 2026-02-21T08:24:19.0333525Z // begin inline asm 2026-02-21T08:24:19.0333762Z @%p59 mbarrier.init.shared::cta.b64 [%r339], 1; 2026-02-21T08:24:19.0334031Z // end inline asm 2026-02-21T08:24:19.0334205Z bar.sync 0, 256; 2026-02-21T08:24:19.0334466Z add.s32 %r340, %r321, 36872; 2026-02-21T08:24:19.0334681Z // begin inline asm 2026-02-21T08:24:19.0334937Z @%p59 mbarrier.init.shared::cta.b64 [%r340], 1; 2026-02-21T08:24:19.0335217Z // end inline asm 2026-02-21T08:24:19.0335447Z add.s32 %r341, %r321, 36880; 2026-02-21T08:24:19.0335689Z // begin inline asm 2026-02-21T08:24:19.0335940Z @%p59 mbarrier.init.shared::cta.b64 [%r341], 1; 2026-02-21T08:24:19.0336218Z // end inline asm 2026-02-21T08:24:19.0336421Z bar.sync 0, 256; 2026-02-21T08:24:19.0336664Z add.s32 %r342, %r321, 36888; 2026-02-21T08:24:19.0336873Z // begin inline asm 2026-02-21T08:24:19.0337137Z @%p59 mbarrier.init.shared::cta.b64 [%r342], 1; 2026-02-21T08:24:19.0337430Z // end inline asm 2026-02-21T08:24:19.0337616Z $L__tmp2: 2026-02-21T08:24:19.0338010Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0338410Z bar.sync 0, 256; 2026-02-21T08:24:19.0338642Z // begin inline asm 2026-02-21T08:24:19.0338886Z @%p59 mbarrier.arrive.shared::cta.b64 _, [%r341]; 2026-02-21T08:24:19.0339177Z // end inline asm 2026-02-21T08:24:19.0339380Z bar.sync 0, 256; 2026-02-21T08:24:19.0339612Z // begin inline asm 2026-02-21T08:24:19.0339877Z @%p59 mbarrier.arrive.shared::cta.b64 _, [%r342]; 2026-02-21T08:24:19.0340142Z // end inline asm 2026-02-21T08:24:19.0340371Z $L__tmp3: 2026-02-21T08:24:19.0340659Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0341103Z st.shared.v2.b32 [global_smem+36896], {0, 50397698}; 2026-02-21T08:24:19.0341386Z st.shared.b32 [global_smem], %r450; 2026-02-21T08:24:19.0341694Z st.shared.b32 [global_smem+8], %r450; 2026-02-21T08:24:19.0342007Z barrier.sync 1; 2026-02-21T08:24:19.0342204Z barrier.sync 1; 2026-02-21T08:24:19.0342415Z setp.lt.s32 %p50, %r100, 1; 2026-02-21T08:24:19.0342661Z mov.b32 %r493, %r492; 2026-02-21T08:24:19.0342905Z @%p50 bra $L__BB0_21; 2026-02-21T08:24:19.0343126Z // %bb.17: // %.lr.ph72 2026-02-21T08:24:19.0343531Z .loc 1 0 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:0:74 2026-02-21T08:24:19.0343936Z ld.param.b64 %rd7, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:24:19.0344268Z and.b32 %r95, %r1, 7; 2026-02-21T08:24:19.0344527Z shl.b32 %r96, %r95, 3; 2026-02-21T08:24:19.0344729Z and.b32 %r97, %r1, 248; 2026-02-21T08:24:19.0344969Z bfe.u32 %r346, %r1, 3, 5; 2026-02-21T08:24:19.0345182Z mul.lo.s32 %r98, %r346, 7168; 2026-02-21T08:24:19.0345442Z add.s32 %r99, %r98, 229376; 2026-02-21T08:24:19.0345764Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0346140Z add.s32 %r489, %r94, -1; 2026-02-21T08:24:19.0346387Z shl.b32 %r357, %r95, 10; 2026-02-21T08:24:19.0346588Z shl.b32 %r358, %r1, 4; 2026-02-21T08:24:19.0346834Z and.b32 %r359, %r358, 240; 2026-02-21T08:24:19.0347052Z shl.b32 %r360, %r1, 3; 2026-02-21T08:24:19.0347286Z and.b32 %r361, %r360, 768; 2026-02-21T08:24:19.0347484Z shl.b32 %r362, %r1, 1; 2026-02-21T08:24:19.0347755Z and.b32 %r363, %r362, 32; 2026-02-21T08:24:19.0347958Z shr.u32 %r364, %r1, 1; 2026-02-21T08:24:19.0348274Z and.b32 %r365, %r364, 64; 2026-02-21T08:24:19.0348553Z or.b32 %r366, %r359, %r361; 2026-02-21T08:24:19.0348773Z or.b32 %r367, %r363, %r365; 2026-02-21T08:24:19.0349005Z xor.b32 %r368, %r366, %r367; 2026-02-21T08:24:19.0349243Z or.b32 %r369, %r368, %r357; 2026-02-21T08:24:19.0349489Z add.s32 %r371, %r321, 28672; 2026-02-21T08:24:19.0349693Z add.s32 %r103, %r371, %r369; 2026-02-21T08:24:19.0349944Z xor.b32 %r372, %r369, 16; 2026-02-21T08:24:19.0350183Z add.s32 %r104, %r371, %r372; 2026-02-21T08:24:19.0350385Z shl.b32 %r373, %r1, 7; 2026-02-21T08:24:19.0350628Z and.b32 %r374, %r373, 7168; 2026-02-21T08:24:19.0350840Z shl.b32 %r375, %r95, 4; 2026-02-21T08:24:19.0351108Z shl.b32 %r376, %r97, 1; 2026-02-21T08:24:19.0351344Z or.b32 %r377, %r374, %r375; 2026-02-21T08:24:19.0351626Z xor.b32 %r378, %r377, %r376; 2026-02-21T08:24:19.0351832Z add.s32 %r105, %r371, %r378; 2026-02-21T08:24:19.0352081Z mov.b32 %r490, -1; 2026-02-21T08:24:19.0352316Z mov.b32 %r493, 0; 2026-02-21T08:24:19.0352515Z mov.b32 %r492, %r493; 2026-02-21T08:24:19.0352768Z mov.b32 %r488, %r493; 2026-02-21T08:24:19.0352960Z mov.b32 %r491, %r493; 2026-02-21T08:24:19.0353187Z bra.uni $L__BB0_18; 2026-02-21T08:24:19.0353436Z $L__BB0_20: // in Loop: Header=BB0_18 Depth=1 2026-02-21T08:24:19.0353866Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0354217Z selp.b32 %r488, %r381, %r488, %p52; 2026-02-21T08:24:19.0354479Z selp.b32 %r489, %r380, %r489, %p52; 2026-02-21T08:24:19.0354760Z setp.eq.b32 %p55, %r490, 255; 2026-02-21T08:24:19.0354967Z $L__tmp4: 2026-02-21T08:24:19.0355346Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0355742Z add.s32 %r441, %r492, 1; 2026-02-21T08:24:19.0355983Z setp.eq.b32 %p56, %r441, 2; 2026-02-21T08:24:19.0356181Z selp.b32 %r442, 0, %r441, %p56; 2026-02-21T08:24:19.0356478Z selp.b32 %r492, %r442, %r492, %p55; 2026-02-21T08:24:19.0356740Z and.pred %p57, %p55, %p56; 2026-02-21T08:24:19.0356950Z selp.b32 %r443, 1, 0, %p57; 2026-02-21T08:24:19.0357218Z xor.b32 %r493, %r493, %r443; 2026-02-21T08:24:19.0357421Z $L__tmp5: 2026-02-21T08:24:19.0357735Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0358100Z add.s32 %r491, %r491, 1; 2026-02-21T08:24:19.0358338Z setp.lt.s32 %p58, %r491, %r100; 2026-02-21T08:24:19.0358581Z @%p58 bra $L__BB0_18; 2026-02-21T08:24:19.0358813Z bra.uni $L__BB0_21; 2026-02-21T08:24:19.0359094Z $L__BB0_18: // =>This Inner Loop Header: Depth=1 2026-02-21T08:24:19.0359372Z add.s32 %r379, %r490, 1; 2026-02-21T08:24:19.0359624Z setp.eq.b32 %p51, %r490, 255; 2026-02-21T08:24:19.0359845Z selp.b32 %r490, 0, %r379, %p51; 2026-02-21T08:24:19.0360103Z setp.eq.b32 %p52, %r490, 0; 2026-02-21T08:24:19.0360316Z add.s32 %r380, %r489, 1; 2026-02-21T08:24:19.0360561Z shl.b32 %r381, %r380, 6; 2026-02-21T08:24:19.0360894Z setp.ne.b32 %p53, %r490, 255; 2026-02-21T08:24:19.0361107Z @%p53 bra $L__BB0_20; 2026-02-21T08:24:19.0361405Z // %bb.19: // in Loop: Header=BB0_18 Depth=1 2026-02-21T08:24:19.0361700Z shl.b32 %r410, %r492, 3; 2026-02-21T08:24:19.0361939Z add.s32 %r412, %r321, %r410; 2026-02-21T08:24:19.0362164Z add.s32 %r382, %r412, 36864; 2026-02-21T08:24:19.0362394Z $L__tmp6: 2026-02-21T08:24:19.0362770Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0363175Z shl.b32 %r413, %r492, 6; 2026-02-21T08:24:19.0363406Z bar.sync 0, 256; 2026-02-21T08:24:19.0363581Z // begin inline asm 2026-02-21T08:24:19.0363843Z 2026-02-21T08:24:19.0364010Z { 2026-02-21T08:24:19.0364206Z .reg .pred complete; 2026-02-21T08:24:19.0364443Z waitLoop: 2026-02-21T08:24:19.0364779Z mbarrier.try_wait.parity.shared.b64 complete, [%r382], %r493; 2026-02-21T08:24:19.0365087Z @!complete bra.uni waitLoop; 2026-02-21T08:24:19.0365335Z } 2026-02-21T08:24:19.0365426Z 2026-02-21T08:24:19.0365539Z // end inline asm 2026-02-21T08:24:19.0365725Z $L__tmp7: 2026-02-21T08:24:19.0366075Z .loc 1 30 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:30:32 2026-02-21T08:24:19.0366418Z or.b32 %r414, %r488, %r96; 2026-02-21T08:24:19.0366774Z .loc 1 85 50 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:85:50 2026-02-21T08:24:19.0367149Z add.s32 %r415, %r414, %r98; 2026-02-21T08:24:19.0367358Z add.s32 %r416, %r99, %r414; 2026-02-21T08:24:19.0367712Z .loc 1 85 22 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:85:22 2026-02-21T08:24:19.0368053Z mad.wide.s32 %rd65, %r415, 2, %rd7; 2026-02-21T08:24:19.0368323Z mad.wide.s32 %rd66, %r416, 2, %rd7; 2026-02-21T08:24:19.0368549Z $L__tmp8: 2026-02-21T08:24:19.0368923Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0369343Z add.s32 %r400, %r322, %r413; 2026-02-21T08:24:19.0369557Z // begin inline asm 2026-02-21T08:24:19.0370010Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r384, %r385, %r386, %r387, %r388, %r389, %r390, %r391, %r392, %r393, %r394, %r395, %r396, %r397, %r398, %r399}, [%r400 + 0], 16; 2026-02-21T08:24:19.0370457Z // end inline asm 2026-02-21T08:24:19.0370685Z // begin inline asm 2026-02-21T08:24:19.0370878Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:24:19.0371151Z // end inline asm 2026-02-21T08:24:19.0371381Z cvt.u64.u32 %rd67, %r384; 2026-02-21T08:24:19.0371607Z cvt.u64.u32 %rd68, %r385; 2026-02-21T08:24:19.0371869Z shl.b64 %rd69, %rd68, 32; 2026-02-21T08:24:19.0372085Z or.b64 %rd70, %rd67, %rd69; 2026-02-21T08:24:19.0372304Z $L__tmp9: 2026-02-21T08:24:19.0372620Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0373002Z mov.b64 {%r417, %r418}, %rd70; 2026-02-21T08:24:19.0373252Z cvt.rn.bf16x2.f32 %r419, %r418, %r417; 2026-02-21T08:24:19.0373500Z $L__tmp10: 2026-02-21T08:24:19.0373876Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0374248Z cvt.u64.u32 %rd71, %r386; 2026-02-21T08:24:19.0374503Z cvt.u64.u32 %rd72, %r387; 2026-02-21T08:24:19.0374713Z shl.b64 %rd73, %rd72, 32; 2026-02-21T08:24:19.0374951Z or.b64 %rd74, %rd71, %rd73; 2026-02-21T08:24:19.0375217Z $L__tmp11: 2026-02-21T08:24:19.0375505Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0375872Z mov.b64 {%r420, %r421}, %rd74; 2026-02-21T08:24:19.0376106Z cvt.rn.bf16x2.f32 %r422, %r421, %r420; 2026-02-21T08:24:19.0376381Z $L__tmp12: 2026-02-21T08:24:19.0376713Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0377195Z cvt.u64.u32 %rd75, %r388; 2026-02-21T08:24:19.0377446Z cvt.u64.u32 %rd76, %r389; 2026-02-21T08:24:19.0377650Z shl.b64 %rd77, %rd76, 32; 2026-02-21T08:24:19.0377900Z or.b64 %rd78, %rd75, %rd77; 2026-02-21T08:24:19.0378112Z $L__tmp13: 2026-02-21T08:24:19.0378429Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0378852Z mov.b64 {%r423, %r424}, %rd78; 2026-02-21T08:24:19.0379139Z cvt.rn.bf16x2.f32 %r425, %r424, %r423; 2026-02-21T08:24:19.0379410Z $L__tmp14: 2026-02-21T08:24:19.0379781Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0380196Z cvt.u64.u32 %rd79, %r390; 2026-02-21T08:24:19.0380405Z cvt.u64.u32 %rd80, %r391; 2026-02-21T08:24:19.0380640Z shl.b64 %rd81, %rd80, 32; 2026-02-21T08:24:19.0380852Z or.b64 %rd82, %rd79, %rd81; 2026-02-21T08:24:19.0381093Z $L__tmp15: 2026-02-21T08:24:19.0381424Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0381871Z mov.b64 {%r426, %r427}, %rd82; 2026-02-21T08:24:19.0382139Z cvt.rn.bf16x2.f32 %r428, %r427, %r426; 2026-02-21T08:24:19.0382346Z $L__tmp16: 2026-02-21T08:24:19.0382742Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0383135Z cvt.u64.u32 %rd83, %r392; 2026-02-21T08:24:19.0383357Z cvt.u64.u32 %rd84, %r393; 2026-02-21T08:24:19.0383587Z shl.b64 %rd85, %rd84, 32; 2026-02-21T08:24:19.0383822Z or.b64 %rd86, %rd83, %rd85; 2026-02-21T08:24:19.0384036Z $L__tmp17: 2026-02-21T08:24:19.0384346Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0384718Z mov.b64 {%r429, %r430}, %rd86; 2026-02-21T08:24:19.0384939Z cvt.rn.bf16x2.f32 %r431, %r430, %r429; 2026-02-21T08:24:19.0385208Z $L__tmp18: 2026-02-21T08:24:19.0385543Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0385949Z cvt.u64.u32 %rd87, %r394; 2026-02-21T08:24:19.0386207Z cvt.u64.u32 %rd88, %r395; 2026-02-21T08:24:19.0386407Z shl.b64 %rd89, %rd88, 32; 2026-02-21T08:24:19.0386640Z or.b64 %rd90, %rd87, %rd89; 2026-02-21T08:24:19.0386839Z $L__tmp19: 2026-02-21T08:24:19.0387168Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0387494Z mov.b64 {%r432, %r433}, %rd90; 2026-02-21T08:24:19.0387752Z cvt.rn.bf16x2.f32 %r434, %r433, %r432; 2026-02-21T08:24:19.0388013Z $L__tmp20: 2026-02-21T08:24:19.0388345Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0388754Z cvt.u64.u32 %rd91, %r396; 2026-02-21T08:24:19.0388968Z cvt.u64.u32 %rd92, %r397; 2026-02-21T08:24:19.0389207Z shl.b64 %rd93, %rd92, 32; 2026-02-21T08:24:19.0389404Z or.b64 %rd94, %rd91, %rd93; 2026-02-21T08:24:19.0389660Z $L__tmp21: 2026-02-21T08:24:19.0389978Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0390305Z mov.b64 {%r435, %r436}, %rd94; 2026-02-21T08:24:19.0390577Z cvt.rn.bf16x2.f32 %r437, %r436, %r435; 2026-02-21T08:24:19.0390791Z $L__tmp22: 2026-02-21T08:24:19.0391142Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0391569Z cvt.u64.u32 %rd95, %r398; 2026-02-21T08:24:19.0391799Z cvt.u64.u32 %rd96, %r399; 2026-02-21T08:24:19.0392026Z shl.b64 %rd97, %rd96, 32; 2026-02-21T08:24:19.0392256Z or.b64 %rd98, %rd95, %rd97; 2026-02-21T08:24:19.0392488Z $L__tmp23: 2026-02-21T08:24:19.0392780Z .loc 1 84 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:84:28 2026-02-21T08:24:19.0393161Z mov.b64 {%r438, %r439}, %rd98; 2026-02-21T08:24:19.0393468Z cvt.rn.bf16x2.f32 %r440, %r439, %r438; 2026-02-21T08:24:19.0393830Z .loc 1 85 81 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:85:81 2026-02-21T08:24:19.0394244Z st.shared.v4.b32 [%r103], {%r419, %r422, %r425, %r428}; 2026-02-21T08:24:19.0394538Z st.shared.v4.b32 [%r104], {%r431, %r434, %r437, %r440}; 2026-02-21T08:24:19.0394821Z bar.sync 0, 256; 2026-02-21T08:24:19.0395052Z ld.shared.v4.b32 {%r405, %r406, %r407, %r408}, [%r105+512]; 2026-02-21T08:24:19.0395396Z ld.shared.v4.b32 {%r401, %r402, %r403, %r404}, [%r105]; 2026-02-21T08:24:19.0395645Z // begin inline asm 2026-02-21T08:24:19.0395911Z st.global.v4.b32 [ %rd65 + 0 ], { %r401, %r402, %r403, %r404 }; 2026-02-21T08:24:19.0396219Z // end inline asm 2026-02-21T08:24:19.0396414Z // begin inline asm 2026-02-21T08:24:19.0396684Z st.global.v4.b32 [ %rd66 + 0 ], { %r405, %r406, %r407, %r408 }; 2026-02-21T08:24:19.0397007Z // end inline asm 2026-02-21T08:24:19.0397336Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0397658Z add.s32 %r409, %r412, 36880; 2026-02-21T08:24:19.0397933Z $L__tmp24: 2026-02-21T08:24:19.0398298Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0398655Z bar.sync 0, 256; 2026-02-21T08:24:19.0398909Z // begin inline asm 2026-02-21T08:24:19.0399138Z @%p59 mbarrier.arrive.shared::cta.b64 _, [%r409]; 2026-02-21T08:24:19.0399407Z // end inline asm 2026-02-21T08:24:19.0399618Z bra.uni $L__BB0_20; 2026-02-21T08:24:19.0399838Z $L__tmp25: 2026-02-21T08:24:19.0400061Z $L__BB0_1: // %.preheader.preheader 2026-02-21T08:24:19.0400393Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:24:19.0400712Z ld.param.b64 %rd6, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:24:19.0401016Z ld.param.b64 %rd5, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:24:19.0401317Z mov.b32 %r121, global_smem; 2026-02-21T08:24:19.0401531Z add.s32 %r122, %r121, %r3; 2026-02-21T08:24:19.0401815Z mov.u32 %r6, %ctaid.x; 2026-02-21T08:24:19.0402018Z max.u32 %r144, %r6, 111; 2026-02-21T08:24:19.0402263Z shl.b32 %r145, %r144, 8; 2026-02-21T08:24:19.0402503Z bra.uni $L__BB0_2; 2026-02-21T08:24:19.0402739Z $L__BB0_14: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0403157Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0403494Z barrier.sync 1; 2026-02-21T08:24:19.0403713Z barrier.sync 1; 2026-02-21T08:24:19.0403938Z $L__BB0_2: // %.preheader 2026-02-21T08:24:19.0404250Z // =>This Loop Header: Depth=1 2026-02-21T08:24:19.0404559Z // Child Loop BB0_7 Depth 2 2026-02-21T08:24:19.0404927Z .loc 1 14 0 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:14 2026-02-21T08:24:19.0405325Z setmaxnreg.dec.sync.aligned.u32 24; 2026-02-21T08:24:19.0405546Z barrier.sync 1; 2026-02-21T08:24:19.0405799Z ld.shared.b8 %r120, [%r122+36888]; 2026-02-21T08:24:19.0406034Z setp.gt.u32 %p10, %r120, 4; 2026-02-21T08:24:19.0406265Z @%p10 bra $L__BB0_4; 2026-02-21T08:24:19.0406553Z // %bb.3: // %.preheader 2026-02-21T08:24:19.0406820Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0407108Z $L_brx_0: .branchtargets 2026-02-21T08:24:19.0407338Z $L__BB0_5, 2026-02-21T08:24:19.0407554Z $L__BB0_13, 2026-02-21T08:24:19.0407731Z $L__BB0_14, 2026-02-21T08:24:19.0407960Z $L__BB0_15, 2026-02-21T08:24:19.0408129Z $L__BB0_22; 2026-02-21T08:24:19.0408337Z brx.idx %r120, $L_brx_0; 2026-02-21T08:24:19.0408629Z $L__BB0_5: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0408999Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0409442Z ld.shared.b32 %r4, [global_smem]; 2026-02-21T08:24:19.0409693Z ld.shared.b32 %r5, [global_smem+8]; 2026-02-21T08:24:19.0409957Z barrier.sync 1; 2026-02-21T08:24:19.0410147Z sub.s32 %r8, 28672, %r145; 2026-02-21T08:24:19.0410505Z .loc 1 44 38 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:44:38 2026-02-21T08:24:19.0410878Z shl.b32 %r147, %r1, 3; 2026-02-21T08:24:19.0411081Z and.b32 %r13, %r147, 24; 2026-02-21T08:24:19.0411426Z .loc 1 45 53 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:53 2026-02-21T08:24:19.0411797Z shl.b32 %r148, %r1, 11; 2026-02-21T08:24:19.0412037Z and.b32 %r14, %r148, 253952; 2026-02-21T08:24:19.0412348Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0412735Z setp.lt.s32 %p11, %r8, 1; 2026-02-21T08:24:19.0413047Z setp.gt.s32 %p12, %r8, 0; 2026-02-21T08:24:19.0413348Z .loc 1 45 60 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:60 2026-02-21T08:24:19.0413734Z or.b32 %r149, %r14, %r13; 2026-02-21T08:24:19.0414050Z .loc 1 45 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:32 2026-02-21T08:24:19.0414405Z mad.wide.u32 %rd8, %r149, 2, %rd5; 2026-02-21T08:24:19.0414704Z add.s64 %rd9, %rd8, 524288; 2026-02-21T08:24:19.0415029Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0415383Z shl.b32 %r17, %r1, 4; 2026-02-21T08:24:19.0415613Z and.b32 %r18, %r17, 2032; 2026-02-21T08:24:19.0415856Z add.s32 %r123, %r121, %r18; 2026-02-21T08:24:19.0416073Z selp.b32 %r124, 16, 0, %p12; 2026-02-21T08:24:19.0416327Z // begin inline asm 2026-02-21T08:24:19.0416590Z cp.async.cg.shared.global [ %r123 + 0 ], [ %rd8 + 0 ], 0x10, %r124; 2026-02-21T08:24:19.0416899Z // end inline asm 2026-02-21T08:24:19.0417135Z add.s32 %r125, %r123, 2048; 2026-02-21T08:24:19.0417343Z // begin inline asm 2026-02-21T08:24:19.0417620Z cp.async.cg.shared.global [ %r125 + 0 ], [ %rd9 + 0 ], 0x10, %r124; 2026-02-21T08:24:19.0417892Z // end inline asm 2026-02-21T08:24:19.0418134Z cp.async.commit_group; 2026-02-21T08:24:19.0418455Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0418844Z setp.gt.s32 %p13, %r8, 1; 2026-02-21T08:24:19.0419220Z .loc 1 45 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:32 2026-02-21T08:24:19.0419565Z add.s64 %rd10, %rd8, 64; 2026-02-21T08:24:19.0419810Z or.b32 %r150, %r149, 32; 2026-02-21T08:24:19.0420047Z mad.wide.u32 %rd18, %r150, 2, %rd5; 2026-02-21T08:24:19.0420313Z add.s64 %rd11, %rd18, 524288; 2026-02-21T08:24:19.0420631Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0421044Z bar.sync 2, 128; 2026-02-21T08:24:19.0421269Z add.s32 %r127, %r123, 4096; 2026-02-21T08:24:19.0421476Z selp.b32 %r128, 16, 0, %p13; 2026-02-21T08:24:19.0421782Z // begin inline asm 2026-02-21T08:24:19.0422045Z cp.async.cg.shared.global [ %r127 + 0 ], [ %rd10 + 0 ], 0x10, %r128; 2026-02-21T08:24:19.0422354Z // end inline asm 2026-02-21T08:24:19.0422581Z add.s32 %r129, %r123, 6144; 2026-02-21T08:24:19.0422822Z // begin inline asm 2026-02-21T08:24:19.0423088Z cp.async.cg.shared.global [ %r129 + 0 ], [ %rd11 + 0 ], 0x10, %r128; 2026-02-21T08:24:19.0423418Z // end inline asm 2026-02-21T08:24:19.0423652Z cp.async.commit_group; 2026-02-21T08:24:19.0423985Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0424377Z setp.gt.s32 %p14, %r8, 2; 2026-02-21T08:24:19.0424700Z .loc 1 45 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:32 2026-02-21T08:24:19.0425086Z add.s64 %rd12, %rd8, 128; 2026-02-21T08:24:19.0425402Z or.b32 %r151, %r149, 64; 2026-02-21T08:24:19.0425618Z mad.wide.u32 %rd19, %r151, 2, %rd5; 2026-02-21T08:24:19.0425891Z add.s64 %rd13, %rd19, 524288; 2026-02-21T08:24:19.0426224Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0426620Z bar.sync 2, 128; 2026-02-21T08:24:19.0426810Z add.s32 %r131, %r123, 8192; 2026-02-21T08:24:19.0427050Z selp.b32 %r132, 16, 0, %p14; 2026-02-21T08:24:19.0427266Z // begin inline asm 2026-02-21T08:24:19.0427557Z cp.async.cg.shared.global [ %r131 + 0 ], [ %rd12 + 0 ], 0x10, %r132; 2026-02-21T08:24:19.0427860Z // end inline asm 2026-02-21T08:24:19.0428066Z add.s32 %r133, %r123, 10240; 2026-02-21T08:24:19.0428304Z // begin inline asm 2026-02-21T08:24:19.0428538Z cp.async.cg.shared.global [ %r133 + 0 ], [ %rd13 + 0 ], 0x10, %r132; 2026-02-21T08:24:19.0428868Z // end inline asm 2026-02-21T08:24:19.0429117Z cp.async.commit_group; 2026-02-21T08:24:19.0429446Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0429844Z setp.gt.s32 %p15, %r8, 3; 2026-02-21T08:24:19.0430161Z .loc 1 45 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:32 2026-02-21T08:24:19.0430510Z add.s64 %rd14, %rd8, 192; 2026-02-21T08:24:19.0430744Z or.b32 %r152, %r149, 96; 2026-02-21T08:24:19.0430986Z mad.wide.u32 %rd20, %r152, 2, %rd5; 2026-02-21T08:24:19.0431211Z add.s64 %rd15, %rd20, 524288; 2026-02-21T08:24:19.0431612Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0431973Z bar.sync 2, 128; 2026-02-21T08:24:19.0432164Z add.s32 %r135, %r123, 12288; 2026-02-21T08:24:19.0432424Z selp.b32 %r136, 16, 0, %p15; 2026-02-21T08:24:19.0432630Z // begin inline asm 2026-02-21T08:24:19.0432921Z cp.async.cg.shared.global [ %r135 + 0 ], [ %rd14 + 0 ], 0x10, %r136; 2026-02-21T08:24:19.0433197Z // end inline asm 2026-02-21T08:24:19.0433428Z add.s32 %r137, %r123, 14336; 2026-02-21T08:24:19.0433670Z // begin inline asm 2026-02-21T08:24:19.0433940Z cp.async.cg.shared.global [ %r137 + 0 ], [ %rd15 + 0 ], 0x10, %r136; 2026-02-21T08:24:19.0434258Z // end inline asm 2026-02-21T08:24:19.0434458Z cp.async.commit_group; 2026-02-21T08:24:19.0434800Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0435150Z setp.gt.s32 %p16, %r8, 4; 2026-02-21T08:24:19.0435511Z .loc 1 45 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:32 2026-02-21T08:24:19.0436087Z add.s64 %rd16, %rd8, 256; 2026-02-21T08:24:19.0510579Z or.b32 %r153, %r149, 128; 2026-02-21T08:24:19.0510799Z mad.wide.u32 %rd21, %r153, 2, %rd5; 2026-02-21T08:24:19.0511012Z add.s64 %rd17, %rd21, 524288; 2026-02-21T08:24:19.0511334Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0511700Z bar.sync 2, 128; 2026-02-21T08:24:19.0511877Z add.s32 %r139, %r123, 16384; 2026-02-21T08:24:19.0512058Z selp.b32 %r140, 16, 0, %p16; 2026-02-21T08:24:19.0512242Z // begin inline asm 2026-02-21T08:24:19.0512467Z cp.async.cg.shared.global [ %r139 + 0 ], [ %rd16 + 0 ], 0x10, %r140; 2026-02-21T08:24:19.0512723Z // end inline asm 2026-02-21T08:24:19.0512877Z add.s32 %r141, %r123, 18432; 2026-02-21T08:24:19.0513051Z // begin inline asm 2026-02-21T08:24:19.0513264Z cp.async.cg.shared.global [ %r141 + 0 ], [ %rd17 + 0 ], 0x10, %r140; 2026-02-21T08:24:19.0513510Z // end inline asm 2026-02-21T08:24:19.0513666Z cp.async.commit_group; 2026-02-21T08:24:19.0513955Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0514269Z @%p11 bra $L__BB0_12; 2026-02-21T08:24:19.0514447Z // %bb.6: // %.lr.ph 2026-02-21T08:24:19.0514698Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0515220Z .loc 1 0 0 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:0 2026-02-21T08:24:19.0515547Z add.s32 %r7, %r145, -28416; 2026-02-21T08:24:19.0515720Z add.s32 %r146, %r1, -256; 2026-02-21T08:24:19.0515891Z shr.u32 %r9, %r146, 5; 2026-02-21T08:24:19.0516067Z shr.u32 %r10, %r1, 4; 2026-02-21T08:24:19.0516224Z bfe.u32 %r11, %r1, 4, 3; 2026-02-21T08:24:19.0516389Z and.b32 %r12, %r1, 15; 2026-02-21T08:24:19.0516545Z or.b32 %r15, %r14, 262144; 2026-02-21T08:24:19.0516717Z and.b32 %r16, %r1, 64; 2026-02-21T08:24:19.0516979Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0517278Z add.s32 %r477, %r6, -1; 2026-02-21T08:24:19.0517433Z sub.s32 %r20, 251, %r7; 2026-02-21T08:24:19.0517589Z shl.b32 %r164, %r12, 6; 2026-02-21T08:24:19.0517745Z shl.b32 %r165, %r1, 5; 2026-02-21T08:24:19.0517894Z and.b32 %r166, %r165, 3072; 2026-02-21T08:24:19.0518137Z shl.b32 %r167, %r1, 1; 2026-02-21T08:24:19.0518291Z and.b32 %r168, %r167, 32; 2026-02-21T08:24:19.0518459Z add.s32 %r170, %r121, %r164; 2026-02-21T08:24:19.0518618Z add.s32 %r171, %r170, %r166; 2026-02-21T08:24:19.0518786Z add.s32 %r21, %r171, %r168; 2026-02-21T08:24:19.0518944Z shl.b32 %r172, %r1, 7; 2026-02-21T08:24:19.0519108Z and.b32 %r173, %r172, 768; 2026-02-21T08:24:19.0519273Z and.b32 %r174, %r167, 60; 2026-02-21T08:24:19.0519443Z and.b32 %r175, %r1, 1; 2026-02-21T08:24:19.0519605Z neg.s32 %r176, %r175; 2026-02-21T08:24:19.0519760Z and.b32 %r177, %r176, 192; 2026-02-21T08:24:19.0519932Z and.b32 %r178, %r1, 32; 2026-02-21T08:24:19.0520085Z shr.u32 %r179, %r178, 4; 2026-02-21T08:24:19.0520261Z or.b32 %r180, %r173, %r179; 2026-02-21T08:24:19.0520417Z or.b32 %r181, %r177, %r174; 2026-02-21T08:24:19.0520581Z xor.b32 %r182, %r181, %r16; 2026-02-21T08:24:19.0520734Z or.b32 %r183, %r180, %r182; 2026-02-21T08:24:19.0520901Z add.s32 %r184, %r121, 20480; 2026-02-21T08:24:19.0521061Z add.s32 %r22, %r184, %r183; 2026-02-21T08:24:19.0521227Z xor.b32 %r185, %r183, 4; 2026-02-21T08:24:19.0521391Z add.s32 %r23, %r184, %r185; 2026-02-21T08:24:19.0521573Z xor.b32 %r186, %r183, 8; 2026-02-21T08:24:19.0521740Z add.s32 %r24, %r184, %r186; 2026-02-21T08:24:19.0521897Z xor.b32 %r187, %r183, 12; 2026-02-21T08:24:19.0522065Z add.s32 %r25, %r184, %r187; 2026-02-21T08:24:19.0522220Z shr.u32 %r188, %r1, 1; 2026-02-21T08:24:19.0522380Z and.b32 %r189, %r188, 12; 2026-02-21T08:24:19.0522535Z shl.b32 %r190, %r175, 5; 2026-02-21T08:24:19.0522698Z and.b32 %r191, %r1, 2; 2026-02-21T08:24:19.0522850Z and.b32 %r192, %r17, 64; 2026-02-21T08:24:19.0523008Z bfe.u32 %r193, %r1, 5, 1; 2026-02-21T08:24:19.0523171Z or.b32 %r194, %r190, %r191; 2026-02-21T08:24:19.0523329Z or.b32 %r195, %r192, %r193; 2026-02-21T08:24:19.0523496Z or.b32 %r196, %r194, %r195; 2026-02-21T08:24:19.0523653Z or.b32 %r197, %r196, %r189; 2026-02-21T08:24:19.0523820Z add.s32 %r26, %r184, %r197; 2026-02-21T08:24:19.0523976Z xor.b32 %r198, %r197, 64; 2026-02-21T08:24:19.0524138Z add.s32 %r27, %r184, %r198; 2026-02-21T08:24:19.0524291Z xor.b32 %r199, %r197, 4; 2026-02-21T08:24:19.0524451Z add.s32 %r28, %r184, %r199; 2026-02-21T08:24:19.0524606Z xor.b32 %r200, %r197, 68; 2026-02-21T08:24:19.0524767Z add.s32 %r29, %r184, %r200; 2026-02-21T08:24:19.0524928Z xor.b32 %r201, %r197, 8; 2026-02-21T08:24:19.0525079Z add.s32 %r30, %r184, %r201; 2026-02-21T08:24:19.0525243Z xor.b32 %r202, %r197, 72; 2026-02-21T08:24:19.0525397Z add.s32 %r31, %r184, %r202; 2026-02-21T08:24:19.0525559Z xor.b32 %r203, %r197, 12; 2026-02-21T08:24:19.0525707Z add.s32 %r32, %r184, %r203; 2026-02-21T08:24:19.0525869Z xor.b32 %r204, %r197, 76; 2026-02-21T08:24:19.0526020Z add.s32 %r33, %r184, %r204; 2026-02-21T08:24:19.0526183Z and.b32 %r205, %r172, 8064; 2026-02-21T08:24:19.0526345Z and.b32 %r206, %r17, 112; 2026-02-21T08:24:19.0526499Z shr.u32 %r207, %r16, 4; 2026-02-21T08:24:19.0526660Z or.b32 %r208, %r205, %r206; 2026-02-21T08:24:19.0526884Z or.b32 %r209, %r208, %r207; 2026-02-21T08:24:19.0527048Z add.s32 %r34, %r184, %r209; 2026-02-21T08:24:19.0527201Z xor.b32 %r210, %r209, 16; 2026-02-21T08:24:19.0527363Z add.s32 %r35, %r184, %r210; 2026-02-21T08:24:19.0527515Z xor.b32 %r211, %r209, 32; 2026-02-21T08:24:19.0527675Z add.s32 %r36, %r184, %r211; 2026-02-21T08:24:19.0527831Z xor.b32 %r212, %r209, 48; 2026-02-21T08:24:19.0527991Z add.s32 %r37, %r184, %r212; 2026-02-21T08:24:19.0528155Z xor.b32 %r213, %r209, 64; 2026-02-21T08:24:19.0528305Z add.s32 %r38, %r184, %r213; 2026-02-21T08:24:19.0528473Z xor.b32 %r214, %r209, 80; 2026-02-21T08:24:19.0528624Z add.s32 %r39, %r184, %r214; 2026-02-21T08:24:19.0528788Z xor.b32 %r215, %r209, 96; 2026-02-21T08:24:19.0528935Z add.s32 %r40, %r184, %r215; 2026-02-21T08:24:19.0529099Z xor.b32 %r216, %r209, 112; 2026-02-21T08:24:19.0529254Z add.s32 %r41, %r184, %r216; 2026-02-21T08:24:19.0529418Z add.s32 %r283, %r5, 128; 2026-02-21T08:24:19.0529645Z bfe.u32 %r217, %r184, 4, 14; 2026-02-21T08:24:19.0529820Z cvt.u64.u32 %rd22, %r217; 2026-02-21T08:24:19.0529997Z or.b64 %rd58, %rd22, 4611686293338849280; 2026-02-21T08:24:19.0530183Z add.s32 %r218, %r121, 20512; 2026-02-21T08:24:19.0530351Z bfe.u32 %r219, %r218, 4, 14; 2026-02-21T08:24:19.0530508Z cvt.u64.u32 %rd23, %r219; 2026-02-21T08:24:19.0530681Z or.b64 %rd59, %rd23, 4611686293338849280; 2026-02-21T08:24:19.0530859Z add.s32 %r220, %r121, 20544; 2026-02-21T08:24:19.0531028Z bfe.u32 %r221, %r220, 4, 14; 2026-02-21T08:24:19.0531185Z cvt.u64.u32 %rd24, %r221; 2026-02-21T08:24:19.0531357Z or.b64 %rd60, %rd24, 4611686293338849280; 2026-02-21T08:24:19.0532012Z add.s32 %r222, %r121, 20576; 2026-02-21T08:24:19.0532178Z bfe.u32 %r223, %r222, 4, 14; 2026-02-21T08:24:19.0532346Z cvt.u64.u32 %rd25, %r223; 2026-02-21T08:24:19.0532506Z or.b64 %rd61, %rd25, 4611686293338849280; 2026-02-21T08:24:19.0532688Z or.b32 %r43, %r10, 56; 2026-02-21T08:24:19.0532840Z mov.b32 %r463, 4; 2026-02-21T08:24:19.0532992Z mov.b32 %r462, -1; 2026-02-21T08:24:19.0533145Z mov.pred %p69, -1; 2026-02-21T08:24:19.0533300Z mov.pred %p65, 0; 2026-02-21T08:24:19.0533441Z mov.b32 %r460, 16; 2026-02-21T08:24:19.0533585Z mov.b32 %r459, 32; 2026-02-21T08:24:19.0533729Z mov.b32 %r458, 48; 2026-02-21T08:24:19.0533865Z mov.b32 %r457, 64; 2026-02-21T08:24:19.0534008Z mov.b32 %r456, 0; 2026-02-21T08:24:19.0534142Z mov.b32 %r455, 1; 2026-02-21T08:24:19.0534287Z mov.b32 %r454, 2; 2026-02-21T08:24:19.0534419Z mov.b32 %r453, 3; 2026-02-21T08:24:19.0534564Z mov.b32 %r461, %r456; 2026-02-21T08:24:19.0534713Z mov.pred %p66, %p65; 2026-02-21T08:24:19.0534871Z mov.pred %p67, %p65; 2026-02-21T08:24:19.0535016Z mov.pred %p68, %p65; 2026-02-21T08:24:19.0535169Z mov.b32 %r478, %r456; 2026-02-21T08:24:19.0535313Z mov.b32 %r479, %r456; 2026-02-21T08:24:19.0535478Z mov.b32 %r480, %r456; 2026-02-21T08:24:19.0535624Z mov.b32 %r481, %r456; 2026-02-21T08:24:19.0535765Z mov.b32 %r482, %r456; 2026-02-21T08:24:19.0535911Z mov.b32 %r483, %r456; 2026-02-21T08:24:19.0536048Z mov.b32 %r484, %r456; 2026-02-21T08:24:19.0536197Z mov.b32 %r485, %r456; 2026-02-21T08:24:19.0536334Z mov.b32 %r472, %r456; 2026-02-21T08:24:19.0536481Z mov.b32 %r473, %r456; 2026-02-21T08:24:19.0536624Z mov.pred %p70, %p65; 2026-02-21T08:24:19.0536774Z mov.b32 %r475, %r463; 2026-02-21T08:24:19.0536914Z mov.b32 %r476, %r456; 2026-02-21T08:24:19.0537063Z bra.uni $L__BB0_7; 2026-02-21T08:24:19.0537254Z $L__BB0_9: // in Loop: Header=BB0_7 Depth=2 2026-02-21T08:24:19.0537583Z .loc 1 68 38 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:68:38 2026-02-21T08:24:19.0537883Z setp.eq.b32 %p21, %r16, 0; 2026-02-21T08:24:19.0538149Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0538442Z add.s32 %r242, %r462, 1; 2026-02-21T08:24:19.0538601Z setp.gt.s32 %p22, %r242, 4; 2026-02-21T08:24:19.0538859Z selp.b32 %r462, 0, %r242, %p22; 2026-02-21T08:24:19.0539140Z .loc 1 38 35 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:38:35 2026-02-21T08:24:19.0539420Z add.s32 %r243, %r461, %r12; 2026-02-21T08:24:19.0539683Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0539973Z cp.async.wait_group 4; 2026-02-21T08:24:19.0540124Z bar.sync 2, 128; 2026-02-21T08:24:19.0540267Z shl.b32 %r244, %r462, 12; 2026-02-21T08:24:19.0540522Z .loc 1 49 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:49:32 2026-02-21T08:24:19.0540810Z add.s32 %r245, %r21, %r244; 2026-02-21T08:24:19.0541007Z ld.shared.v4.b32 {%r246, %r247, %r248, %r249}, [%r245]; 2026-02-21T08:24:19.0541213Z mov.b32 {%rs9, %rs10}, %r249; 2026-02-21T08:24:19.0541384Z mov.b32 {%rs11, %rs12}, %r248; 2026-02-21T08:24:19.0541582Z mov.b32 {%rs13, %rs14}, %r247; 2026-02-21T08:24:19.0541811Z mov.b32 {%rs15, %rs16}, %r246; 2026-02-21T08:24:19.0542014Z ld.shared.v4.b32 {%r250, %r251, %r252, %r253}, [%r245+16]; 2026-02-21T08:24:19.0542230Z mov.b32 {%rs17, %rs18}, %r253; 2026-02-21T08:24:19.0542388Z mov.b32 {%rs19, %rs20}, %r252; 2026-02-21T08:24:19.0542549Z mov.b32 {%rs21, %rs22}, %r251; 2026-02-21T08:24:19.0542714Z mov.b32 {%rs23, %rs24}, %r250; 2026-02-21T08:24:19.0542874Z cvt.f32.bf16 %r226, %rs15; 2026-02-21T08:24:19.0543043Z cvt.f32.bf16 %r227, %rs16; 2026-02-21T08:24:19.0543195Z cvt.f32.bf16 %r228, %rs13; 2026-02-21T08:24:19.0543354Z cvt.f32.bf16 %r229, %rs14; 2026-02-21T08:24:19.0543504Z cvt.f32.bf16 %r230, %rs11; 2026-02-21T08:24:19.0543660Z cvt.f32.bf16 %r231, %rs12; 2026-02-21T08:24:19.0543811Z cvt.f32.bf16 %r232, %rs9; 2026-02-21T08:24:19.0543965Z cvt.f32.bf16 %r233, %rs10; 2026-02-21T08:24:19.0544112Z cvt.f32.bf16 %r234, %rs23; 2026-02-21T08:24:19.0544264Z cvt.f32.bf16 %r235, %rs24; 2026-02-21T08:24:19.0544418Z cvt.f32.bf16 %r236, %rs21; 2026-02-21T08:24:19.0544566Z cvt.f32.bf16 %r237, %rs22; 2026-02-21T08:24:19.0544721Z cvt.f32.bf16 %r238, %rs19; 2026-02-21T08:24:19.0544871Z cvt.f32.bf16 %r239, %rs20; 2026-02-21T08:24:19.0545027Z cvt.f32.bf16 %r240, %rs17; 2026-02-21T08:24:19.0545174Z cvt.f32.bf16 %r241, %rs18; 2026-02-21T08:24:19.0545438Z .loc 1 51 55 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:51:55 2026-02-21T08:24:19.0545720Z mul.lo.s32 %r254, %r243, 7168; 2026-02-21T08:24:19.0545992Z .loc 1 51 62 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:51:62 2026-02-21T08:24:19.0546280Z add.s32 %r255, %r478, %r254; 2026-02-21T08:24:19.0546438Z add.s32 %r256, %r479, %r254; 2026-02-21T08:24:19.0546598Z add.s32 %r257, %r480, %r254; 2026-02-21T08:24:19.0546750Z add.s32 %r258, %r481, %r254; 2026-02-21T08:24:19.0546908Z add.s32 %r259, %r482, %r254; 2026-02-21T08:24:19.0547063Z add.s32 %r260, %r483, %r254; 2026-02-21T08:24:19.0547213Z add.s32 %r261, %r484, %r254; 2026-02-21T08:24:19.0547372Z add.s32 %r262, %r485, %r254; 2026-02-21T08:24:19.0547630Z .loc 1 51 34 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:51:34 2026-02-21T08:24:19.0547908Z cvt.s64.s32 %rd50, %r255; 2026-02-21T08:24:19.0548070Z add.s64 %rd27, %rd6, %rd50; 2026-02-21T08:24:19.0548225Z cvt.s64.s32 %rd51, %r256; 2026-02-21T08:24:19.0548383Z add.s64 %rd30, %rd6, %rd51; 2026-02-21T08:24:19.0548534Z cvt.s64.s32 %rd52, %r257; 2026-02-21T08:24:19.0548689Z add.s64 %rd33, %rd6, %rd52; 2026-02-21T08:24:19.0548839Z cvt.s64.s32 %rd53, %r258; 2026-02-21T08:24:19.0548992Z add.s64 %rd36, %rd6, %rd53; 2026-02-21T08:24:19.0549146Z cvt.s64.s32 %rd54, %r259; 2026-02-21T08:24:19.0549293Z add.s64 %rd39, %rd6, %rd54; 2026-02-21T08:24:19.0549456Z cvt.s64.s32 %rd55, %r260; 2026-02-21T08:24:19.0549609Z add.s64 %rd42, %rd6, %rd55; 2026-02-21T08:24:19.0549774Z cvt.s64.s32 %rd56, %r261; 2026-02-21T08:24:19.0549926Z add.s64 %rd45, %rd6, %rd56; 2026-02-21T08:24:19.0550090Z cvt.s64.s32 %rd57, %r262; 2026-02-21T08:24:19.0550303Z add.s64 %rd48, %rd6, %rd57; 2026-02-21T08:24:19.0550577Z .loc 1 51 87 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:51:87 2026-02-21T08:24:19.0550872Z // begin inline asm 2026-02-21T08:24:19.0551021Z mov.u64 %rd26, 0x0; 2026-02-21T08:24:19.0551243Z createpolicy.fractional.L2::evict_first.b64 %rd26, 1.0; 2026-02-21T08:24:19.0551463Z // end inline asm 2026-02-21T08:24:19.0551636Z // begin inline asm 2026-02-21T08:24:19.0551779Z mov.u16 %rs1, 0x0; 2026-02-21T08:24:19.0552005Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs1 }, [ %rd27 + 0 ], %rd26; 2026-02-21T08:24:19.0552252Z // end inline asm 2026-02-21T08:24:19.0552398Z // begin inline asm 2026-02-21T08:24:19.0552541Z mov.u64 %rd29, 0x0; 2026-02-21T08:24:19.0552739Z createpolicy.fractional.L2::evict_first.b64 %rd29, 1.0; 2026-02-21T08:24:19.0552961Z // end inline asm 2026-02-21T08:24:19.0553100Z // begin inline asm 2026-02-21T08:24:19.0553316Z mov.u16 %rs2, 0x0; 2026-02-21T08:24:19.0553531Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs2 }, [ %rd30 + 0 ], %rd29; 2026-02-21T08:24:19.0553782Z // end inline asm 2026-02-21T08:24:19.0553921Z // begin inline asm 2026-02-21T08:24:19.0554071Z mov.u64 %rd32, 0x0; 2026-02-21T08:24:19.0554259Z createpolicy.fractional.L2::evict_first.b64 %rd32, 1.0; 2026-02-21T08:24:19.0554481Z // end inline asm 2026-02-21T08:24:19.0554628Z // begin inline asm 2026-02-21T08:24:19.0554768Z mov.u16 %rs3, 0x0; 2026-02-21T08:24:19.0555052Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs3 }, [ %rd33 + 0 ], %rd32; 2026-02-21T08:24:19.0555296Z // end inline asm 2026-02-21T08:24:19.0555444Z // begin inline asm 2026-02-21T08:24:19.0555585Z mov.u64 %rd35, 0x0; 2026-02-21T08:24:19.0555784Z createpolicy.fractional.L2::evict_first.b64 %rd35, 1.0; 2026-02-21T08:24:19.0556002Z // end inline asm 2026-02-21T08:24:19.0556149Z // begin inline asm 2026-02-21T08:24:19.0556299Z mov.u16 %rs4, 0x0; 2026-02-21T08:24:19.0556512Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs4 }, [ %rd36 + 0 ], %rd35; 2026-02-21T08:24:19.0556770Z // end inline asm 2026-02-21T08:24:19.0556909Z // begin inline asm 2026-02-21T08:24:19.0557054Z mov.u64 %rd38, 0x0; 2026-02-21T08:24:19.0557242Z createpolicy.fractional.L2::evict_first.b64 %rd38, 1.0; 2026-02-21T08:24:19.0557471Z // end inline asm 2026-02-21T08:24:19.0557603Z // begin inline asm 2026-02-21T08:24:19.0557744Z mov.u16 %rs5, 0x0; 2026-02-21T08:24:19.0557961Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs5 }, [ %rd39 + 0 ], %rd38; 2026-02-21T08:24:19.0558192Z // end inline asm 2026-02-21T08:24:19.0558330Z // begin inline asm 2026-02-21T08:24:19.0558462Z mov.u64 %rd41, 0x0; 2026-02-21T08:24:19.0558646Z createpolicy.fractional.L2::evict_first.b64 %rd41, 1.0; 2026-02-21T08:24:19.0558849Z // end inline asm 2026-02-21T08:24:19.0558987Z // begin inline asm 2026-02-21T08:24:19.0559119Z mov.u16 %rs6, 0x0; 2026-02-21T08:24:19.0559323Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs6 }, [ %rd42 + 0 ], %rd41; 2026-02-21T08:24:19.0559560Z // end inline asm 2026-02-21T08:24:19.0559691Z // begin inline asm 2026-02-21T08:24:19.0559829Z mov.u64 %rd44, 0x0; 2026-02-21T08:24:19.0560011Z createpolicy.fractional.L2::evict_first.b64 %rd44, 1.0; 2026-02-21T08:24:19.0560219Z // end inline asm 2026-02-21T08:24:19.0560349Z // begin inline asm 2026-02-21T08:24:19.0560488Z mov.u16 %rs7, 0x0; 2026-02-21T08:24:19.0560683Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs7 }, [ %rd45 + 0 ], %rd44; 2026-02-21T08:24:19.0560917Z // end inline asm 2026-02-21T08:24:19.0561052Z // begin inline asm 2026-02-21T08:24:19.0561185Z mov.u64 %rd47, 0x0; 2026-02-21T08:24:19.0561370Z createpolicy.fractional.L2::evict_first.b64 %rd47, 1.0; 2026-02-21T08:24:19.0561613Z // end inline asm 2026-02-21T08:24:19.0561758Z // begin inline asm 2026-02-21T08:24:19.0561897Z mov.u16 %rs8, 0x0; 2026-02-21T08:24:19.0562106Z ld.global.L1::evict_first.L2::cache_hint.b8 { %rs8 }, [ %rd48 + 0 ], %rd47; 2026-02-21T08:24:19.0562347Z // end inline asm 2026-02-21T08:24:19.0562657Z .loc 1 54 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:54:28 2026-02-21T08:24:19.0562949Z cvt.s16.s8 %rs25, %rs5; 2026-02-21T08:24:19.0563105Z cvt.s16.s8 %rs26, %rs1; 2026-02-21T08:24:19.0563366Z .loc 1 56 25 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:56:25 2026-02-21T08:24:19.0563651Z shl.b16 %rs27, %rs1, 12; 2026-02-21T08:24:19.0563814Z shr.s16 %rs28, %rs27, 12; 2026-02-21T08:24:19.0563966Z shl.b16 %rs29, %rs5, 12; 2026-02-21T08:24:19.0564128Z shr.s16 %rs30, %rs29, 12; 2026-02-21T08:24:19.0564383Z .loc 1 54 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:54:28 2026-02-21T08:24:19.0564671Z cvt.s16.s8 %rs31, %rs6; 2026-02-21T08:24:19.0564824Z cvt.s16.s8 %rs32, %rs2; 2026-02-21T08:24:19.0565075Z .loc 1 56 25 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:56:25 2026-02-21T08:24:19.0565409Z shl.b16 %rs33, %rs2, 12; 2026-02-21T08:24:19.0565565Z shr.s16 %rs34, %rs33, 12; 2026-02-21T08:24:19.0565725Z shl.b16 %rs35, %rs6, 12; 2026-02-21T08:24:19.0565872Z shr.s16 %rs36, %rs35, 12; 2026-02-21T08:24:19.0566137Z .loc 1 54 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:54:28 2026-02-21T08:24:19.0566418Z cvt.s16.s8 %rs37, %rs7; 2026-02-21T08:24:19.0566563Z cvt.s16.s8 %rs38, %rs3; 2026-02-21T08:24:19.0566821Z .loc 1 56 25 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:56:25 2026-02-21T08:24:19.0567097Z shl.b16 %rs39, %rs3, 12; 2026-02-21T08:24:19.0567252Z shr.s16 %rs40, %rs39, 12; 2026-02-21T08:24:19.0567399Z shl.b16 %rs41, %rs7, 12; 2026-02-21T08:24:19.0567550Z shr.s16 %rs42, %rs41, 12; 2026-02-21T08:24:19.0567797Z .loc 1 54 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:54:28 2026-02-21T08:24:19.0568078Z cvt.s16.s8 %rs43, %rs8; 2026-02-21T08:24:19.0568232Z cvt.s16.s8 %rs44, %rs4; 2026-02-21T08:24:19.0568485Z .loc 1 56 25 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:56:25 2026-02-21T08:24:19.0568771Z shl.b16 %rs45, %rs4, 12; 2026-02-21T08:24:19.0568947Z shr.s16 %rs46, %rs45, 12; 2026-02-21T08:24:19.0569103Z shl.b16 %rs47, %rs8, 12; 2026-02-21T08:24:19.0569250Z shr.s16 %rs48, %rs47, 12; 2026-02-21T08:24:19.0569511Z .loc 1 59 28 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:59:28 2026-02-21T08:24:19.0569784Z shr.s16 %rs49, %rs26, 4; 2026-02-21T08:24:19.0569935Z shr.s16 %rs50, %rs25, 4; 2026-02-21T08:24:19.0570084Z shr.s16 %rs51, %rs32, 4; 2026-02-21T08:24:19.0570227Z shr.s16 %rs52, %rs31, 4; 2026-02-21T08:24:19.0570376Z shr.s16 %rs53, %rs38, 4; 2026-02-21T08:24:19.0570519Z shr.s16 %rs54, %rs37, 4; 2026-02-21T08:24:19.0570671Z shr.s16 %rs55, %rs44, 4; 2026-02-21T08:24:19.0570814Z shr.s16 %rs56, %rs43, 4; 2026-02-21T08:24:19.0571069Z .loc 1 63 45 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:63:45 2026-02-21T08:24:19.0571360Z st.shared.v2.b8 [%r22], {%rs28, %rs30}; 2026-02-21T08:24:19.0571830Z st.shared.v2.b8 [%r23], {%rs34, %rs36}; 2026-02-21T08:24:19.0572026Z st.shared.v2.b8 [%r24], {%rs40, %rs42}; 2026-02-21T08:24:19.0572207Z st.shared.v2.b8 [%r25], {%rs46, %rs48}; 2026-02-21T08:24:19.0572386Z bar.sync 2, 128; 2026-02-21T08:24:19.0572530Z ld.shared.b8 %rs57, [%r26]; 2026-02-21T08:24:19.0572707Z ld.shared.b8 %rs58, [%r26+16]; 2026-02-21T08:24:19.0572878Z ld.shared.b8 %rs59, [%r27+128]; 2026-02-21T08:24:19.0573059Z ld.shared.b8 %rs60, [%r27+144]; 2026-02-21T08:24:19.0573223Z ld.shared.b8 %rs61, [%r28+256]; 2026-02-21T08:24:19.0573395Z ld.shared.b8 %rs62, [%r28+272]; 2026-02-21T08:24:19.0573565Z ld.shared.b8 %rs63, [%r29+384]; 2026-02-21T08:24:19.0573726Z ld.shared.b8 %rs64, [%r29+400]; 2026-02-21T08:24:19.0573892Z ld.shared.b8 %rs65, [%r30+512]; 2026-02-21T08:24:19.0574054Z ld.shared.b8 %rs66, [%r30+528]; 2026-02-21T08:24:19.0574276Z ld.shared.b8 %rs67, [%r31+640]; 2026-02-21T08:24:19.0574433Z ld.shared.b8 %rs68, [%r31+656]; 2026-02-21T08:24:19.0574597Z ld.shared.b8 %rs69, [%r32+768]; 2026-02-21T08:24:19.0574753Z ld.shared.b8 %rs70, [%r32+784]; 2026-02-21T08:24:19.0574918Z ld.shared.b8 %rs71, [%r33+896]; 2026-02-21T08:24:19.0575073Z ld.shared.b8 %rs72, [%r33+912]; 2026-02-21T08:24:19.0575346Z .loc 1 64 45 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:64:45 2026-02-21T08:24:19.0575636Z bar.sync 2, 128; 2026-02-21T08:24:19.0575786Z st.shared.v2.b8 [%r22], {%rs49, %rs50}; 2026-02-21T08:24:19.0575980Z st.shared.v2.b8 [%r23], {%rs51, %rs52}; 2026-02-21T08:24:19.0576159Z st.shared.v2.b8 [%r24], {%rs53, %rs54}; 2026-02-21T08:24:19.0576347Z st.shared.v2.b8 [%r25], {%rs55, %rs56}; 2026-02-21T08:24:19.0576514Z bar.sync 2, 128; 2026-02-21T08:24:19.0576659Z ld.shared.b8 %rs73, [%r26]; 2026-02-21T08:24:19.0576828Z ld.shared.b8 %rs74, [%r26+16]; 2026-02-21T08:24:19.0577043Z ld.shared.b8 %rs75, [%r27+128]; 2026-02-21T08:24:19.0577215Z ld.shared.b8 %rs76, [%r27+144]; 2026-02-21T08:24:19.0577376Z ld.shared.b8 %rs77, [%r28+256]; 2026-02-21T08:24:19.0577541Z ld.shared.b8 %rs78, [%r28+272]; 2026-02-21T08:24:19.0577696Z ld.shared.b8 %rs79, [%r29+384]; 2026-02-21T08:24:19.0577857Z ld.shared.b8 %rs80, [%r29+400]; 2026-02-21T08:24:19.0578012Z ld.shared.b8 %rs81, [%r30+512]; 2026-02-21T08:24:19.0578174Z ld.shared.b8 %rs82, [%r30+528]; 2026-02-21T08:24:19.0578330Z ld.shared.b8 %rs83, [%r31+640]; 2026-02-21T08:24:19.0578499Z ld.shared.b8 %rs84, [%r31+656]; 2026-02-21T08:24:19.0578662Z ld.shared.b8 %rs85, [%r32+768]; 2026-02-21T08:24:19.0578824Z ld.shared.b8 %rs86, [%r32+784]; 2026-02-21T08:24:19.0578991Z ld.shared.b8 %rs87, [%r33+896]; 2026-02-21T08:24:19.0579148Z ld.shared.b8 %rs88, [%r33+912]; 2026-02-21T08:24:19.0579418Z .loc 1 69 58 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:69:58 2026-02-21T08:24:19.0579707Z selp.b16 %rs89, %rs57, %rs73, %p21; 2026-02-21T08:24:19.0579886Z cvt.s16.s8 %rs90, %rs89; 2026-02-21T08:24:19.0580042Z selp.b16 %rs91, %rs59, %rs75, %p21; 2026-02-21T08:24:19.0580214Z cvt.s16.s8 %rs92, %rs91; 2026-02-21T08:24:19.0580373Z selp.b16 %rs93, %rs61, %rs77, %p21; 2026-02-21T08:24:19.0580537Z cvt.s16.s8 %rs94, %rs93; 2026-02-21T08:24:19.0580696Z selp.b16 %rs95, %rs63, %rs79, %p21; 2026-02-21T08:24:19.0580858Z cvt.s16.s8 %rs96, %rs95; 2026-02-21T08:24:19.0581013Z selp.b16 %rs97, %rs65, %rs81, %p21; 2026-02-21T08:24:19.0581177Z cvt.s16.s8 %rs98, %rs97; 2026-02-21T08:24:19.0581335Z selp.b16 %rs99, %rs67, %rs83, %p21; 2026-02-21T08:24:19.0581497Z cvt.s16.s8 %rs100, %rs99; 2026-02-21T08:24:19.0581695Z selp.b16 %rs101, %rs69, %rs85, %p21; 2026-02-21T08:24:19.0581875Z cvt.s16.s8 %rs102, %rs101; 2026-02-21T08:24:19.0582036Z selp.b16 %rs103, %rs71, %rs87, %p21; 2026-02-21T08:24:19.0582213Z cvt.s16.s8 %rs104, %rs103; 2026-02-21T08:24:19.0582371Z selp.b16 %rs105, %rs58, %rs74, %p21; 2026-02-21T08:24:19.0582549Z cvt.s16.s8 %rs106, %rs105; 2026-02-21T08:24:19.0582706Z selp.b16 %rs107, %rs60, %rs76, %p21; 2026-02-21T08:24:19.0582878Z cvt.s16.s8 %rs108, %rs107; 2026-02-21T08:24:19.0583035Z selp.b16 %rs109, %rs62, %rs78, %p21; 2026-02-21T08:24:19.0583211Z cvt.s16.s8 %rs110, %rs109; 2026-02-21T08:24:19.0583366Z selp.b16 %rs111, %rs64, %rs80, %p21; 2026-02-21T08:24:19.0583540Z cvt.s16.s8 %rs112, %rs111; 2026-02-21T08:24:19.0583702Z selp.b16 %rs113, %rs66, %rs82, %p21; 2026-02-21T08:24:19.0583867Z cvt.s16.s8 %rs114, %rs113; 2026-02-21T08:24:19.0584037Z selp.b16 %rs115, %rs68, %rs84, %p21; 2026-02-21T08:24:19.0584202Z cvt.s16.s8 %rs116, %rs115; 2026-02-21T08:24:19.0584363Z selp.b16 %rs117, %rs70, %rs86, %p21; 2026-02-21T08:24:19.0584527Z cvt.s16.s8 %rs118, %rs117; 2026-02-21T08:24:19.0584685Z selp.b16 %rs119, %rs72, %rs88, %p21; 2026-02-21T08:24:19.0584849Z cvt.s16.s8 %rs120, %rs119; 2026-02-21T08:24:19.0585120Z .loc 1 74 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:74:32 2026-02-21T08:24:19.0585474Z cvt.rn.f32.s16 %r263, %rs90; 2026-02-21T08:24:19.0585636Z cvt.rn.f32.s16 %r264, %rs92; 2026-02-21T08:24:19.0585802Z cvt.rn.f32.s16 %r265, %rs94; 2026-02-21T08:24:19.0585954Z cvt.rn.f32.s16 %r266, %rs96; 2026-02-21T08:24:19.0586112Z cvt.rn.f32.s16 %r267, %rs98; 2026-02-21T08:24:19.0586268Z cvt.rn.f32.s16 %r268, %rs100; 2026-02-21T08:24:19.0586435Z cvt.rn.f32.s16 %r269, %rs102; 2026-02-21T08:24:19.0586591Z cvt.rn.f32.s16 %r270, %rs104; 2026-02-21T08:24:19.0586752Z cvt.rn.f32.s16 %r271, %rs106; 2026-02-21T08:24:19.0586910Z cvt.rn.f32.s16 %r272, %rs108; 2026-02-21T08:24:19.0587061Z cvt.rn.f32.s16 %r273, %rs110; 2026-02-21T08:24:19.0587218Z cvt.rn.f32.s16 %r274, %rs112; 2026-02-21T08:24:19.0587367Z cvt.rn.f32.s16 %r275, %rs114; 2026-02-21T08:24:19.0587524Z cvt.rn.f32.s16 %r276, %rs116; 2026-02-21T08:24:19.0587678Z cvt.rn.f32.s16 %r277, %rs118; 2026-02-21T08:24:19.0587881Z cvt.rn.f32.s16 %r278, %rs120; 2026-02-21T08:24:19.0588037Z bar.sync 2, 128; 2026-02-21T08:24:19.0588185Z st.shared.b32 [%r34], %r263; 2026-02-21T08:24:19.0588344Z st.shared.b32 [%r34+8], %r264; 2026-02-21T08:24:19.0588512Z st.shared.b32 [%r35], %r265; 2026-02-21T08:24:19.0588676Z st.shared.b32 [%r35+8], %r266; 2026-02-21T08:24:19.0588837Z st.shared.b32 [%r36], %r267; 2026-02-21T08:24:19.0588997Z st.shared.b32 [%r36+8], %r268; 2026-02-21T08:24:19.0589156Z st.shared.b32 [%r37], %r269; 2026-02-21T08:24:19.0589318Z st.shared.b32 [%r37+8], %r270; 2026-02-21T08:24:19.0589477Z st.shared.b32 [%r38], %r271; 2026-02-21T08:24:19.0589640Z st.shared.b32 [%r38+8], %r272; 2026-02-21T08:24:19.0589797Z st.shared.b32 [%r39], %r273; 2026-02-21T08:24:19.0589959Z st.shared.b32 [%r39+8], %r274; 2026-02-21T08:24:19.0590123Z st.shared.b32 [%r40], %r275; 2026-02-21T08:24:19.0590278Z st.shared.b32 [%r40+8], %r276; 2026-02-21T08:24:19.0590444Z st.shared.b32 [%r41], %r277; 2026-02-21T08:24:19.0590602Z st.shared.b32 [%r41+8], %r278; 2026-02-21T08:24:19.0590766Z $L__tmp26: 2026-02-21T08:24:19.0591059Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0591414Z shfl.sync.idx.b32 %r279, %r9, 0, 31, -1; 2026-02-21T08:24:19.0591742Z shl.b32 %r280, %r279, 21; 2026-02-21T08:24:19.0591913Z and.b32 %r281, %r280, 6291456; 2026-02-21T08:24:19.0592086Z add.s32 %r225, %r281, %r283; 2026-02-21T08:24:19.0592249Z mov.pred %p26, -1; 2026-02-21T08:24:19.0592410Z // begin inline asm 2026-02-21T08:24:19.0592811Z @%p26 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r225 + 0], 16, {%r226, %r227, %r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241}; 2026-02-21T08:24:19.0593234Z // end inline asm 2026-02-21T08:24:19.0593381Z // begin inline asm 2026-02-21T08:24:19.0593552Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:24:19.0593719Z // end inline asm 2026-02-21T08:24:19.0593867Z bar.sync 2, 128; 2026-02-21T08:24:19.0594016Z // begin inline asm 2026-02-21T08:24:19.0594177Z fence.proxy.async.shared::cta; 2026-02-21T08:24:19.0594351Z // end inline asm 2026-02-21T08:24:19.0594486Z bar.sync 2, 128; 2026-02-21T08:24:19.0594640Z setp.ne.b32 %p23, %r279, 0; 2026-02-21T08:24:19.0594807Z setp.eq.b32 %p64, %r456, 255; 2026-02-21T08:24:19.0594976Z @%p23 bra $L__BB0_11; 2026-02-21T08:24:19.0595131Z bra.uni $L__BB0_10; 2026-02-21T08:24:19.0595286Z $L__tmp27: 2026-02-21T08:24:19.0595466Z $L__BB0_11: // in Loop: Header=BB0_7 Depth=2 2026-02-21T08:24:19.0595815Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0596123Z setp.lt.s32 %p35, %r476, %r20; 2026-02-21T08:24:19.0596289Z $L__tmp28: 2026-02-21T08:24:19.0596593Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0596935Z add.s32 %r306, %r473, 1; 2026-02-21T08:24:19.0597165Z setp.eq.b32 %p36, %r306, 2; 2026-02-21T08:24:19.0597334Z selp.b32 %r307, 0, %r306, %p36; 2026-02-21T08:24:19.0597517Z selp.b32 %r473, %r307, %r473, %p64; 2026-02-21T08:24:19.0597707Z and.pred %p37, %p64, %p36; 2026-02-21T08:24:19.0597874Z selp.b32 %r308, 1, 0, %p37; 2026-02-21T08:24:19.0598045Z xor.b32 %r472, %r472, %r308; 2026-02-21T08:24:19.0598200Z $L__tmp29: 2026-02-21T08:24:19.0598449Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0598745Z shl.b32 %r309, %r473, 3; 2026-02-21T08:24:19.0598908Z add.s32 %r311, %r121, %r309; 2026-02-21T08:24:19.0599066Z add.s32 %r300, %r311, 36880; 2026-02-21T08:24:19.0599225Z $L__tmp30: 2026-02-21T08:24:19.0599525Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0599859Z // begin inline asm 2026-02-21T08:24:19.0600018Z 2026-02-21T08:24:19.0600131Z { 2026-02-21T08:24:19.0600324Z @!%p64 bra.uni skipWait; 2026-02-21T08:24:19.0600484Z .reg .pred complete; 2026-02-21T08:24:19.0600635Z waitLoop: 2026-02-21T08:24:19.0600818Z mbarrier.try_wait.parity.shared.b64 complete, [%r300], %r472; 2026-02-21T08:24:19.0601060Z @!complete bra.uni waitLoop; 2026-02-21T08:24:19.0601213Z skipWait: 2026-02-21T08:24:19.0601339Z } 2026-02-21T08:24:19.0601407Z 2026-02-21T08:24:19.0601470Z // end inline asm 2026-02-21T08:24:19.0601633Z $L__tmp31: 2026-02-21T08:24:19.0601872Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0602155Z add.s32 %r312, %r458, 16; 2026-02-21T08:24:19.0602313Z add.s32 %r313, %r463, 1; 2026-02-21T08:24:19.0602466Z setp.gt.s32 %p38, %r313, 4; 2026-02-21T08:24:19.0602628Z selp.b32 %r463, 0, %r313, %p38; 2026-02-21T08:24:19.0602788Z add.s32 %r314, %r475, 1; 2026-02-21T08:24:19.0602948Z setp.eq.b32 %p39, %r475, 255; 2026-02-21T08:24:19.0603120Z selp.b32 %r90, 0, %r314, %p39; 2026-02-21T08:24:19.0603289Z setp.eq.b32 %p65, %r90, 0; 2026-02-21T08:24:19.0603460Z selp.b32 %r457, 0, %r312, %p65; 2026-02-21T08:24:19.0603727Z .loc 1 42 22 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:42:22 2026-02-21T08:24:19.0604013Z shl.b32 %r315, %r457, 1; 2026-02-21T08:24:19.0604264Z .loc 1 44 25 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:44:25 2026-02-21T08:24:19.0604552Z add.s32 %r316, %r315, %r13; 2026-02-21T08:24:19.0604808Z .loc 1 45 60 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:60 2026-02-21T08:24:19.0605094Z add.s32 %r317, %r316, %r14; 2026-02-21T08:24:19.0605251Z add.s32 %r318, %r316, %r15; 2026-02-21T08:24:19.0605507Z .loc 1 45 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:32 2026-02-21T08:24:19.0605796Z mad.wide.s32 %rd63, %r317, 2, %rd5; 2026-02-21T08:24:19.0605969Z mad.wide.s32 %rd64, %r318, 2, %rd5; 2026-02-21T08:24:19.0606246Z .loc 1 45 80 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:45:80 2026-02-21T08:24:19.0606522Z shl.b32 %r319, %r463, 12; 2026-02-21T08:24:19.0606680Z add.s32 %r320, %r121, %r319; 2026-02-21T08:24:19.0606835Z add.s32 %r302, %r320, %r18; 2026-02-21T08:24:19.0606987Z selp.b32 %r303, 16, 0, %p35; 2026-02-21T08:24:19.0607147Z // begin inline asm 2026-02-21T08:24:19.0607349Z cp.async.cg.shared.global [ %r302 + 0 ], [ %rd63 + 0 ], 0x10, %r303; 2026-02-21T08:24:19.0607576Z // end inline asm 2026-02-21T08:24:19.0607710Z add.s32 %r304, %r302, 2048; 2026-02-21T08:24:19.0607865Z // begin inline asm 2026-02-21T08:24:19.0608061Z cp.async.cg.shared.global [ %r304 + 0 ], [ %rd64 + 0 ], 0x10, %r303; 2026-02-21T08:24:19.0608287Z // end inline asm 2026-02-21T08:24:19.0608431Z cp.async.commit_group; 2026-02-21T08:24:19.0608683Z .loc 1 0 0 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:0 2026-02-21T08:24:19.0608970Z setp.ne.b32 %p70, %r456, 255; 2026-02-21T08:24:19.0609292Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0609583Z add.s32 %r476, %r476, 1; 2026-02-21T08:24:19.0609743Z setp.lt.s32 %p40, %r476, %r8; 2026-02-21T08:24:19.0609913Z mov.b32 %r453, %r475; 2026-02-21T08:24:19.0610063Z mov.b32 %r456, %r46; 2026-02-21T08:24:19.0610218Z mov.b32 %r461, %r51; 2026-02-21T08:24:19.0610376Z mov.pred %p69, %p4; 2026-02-21T08:24:19.0610526Z mov.b32 %r475, %r90; 2026-02-21T08:24:19.0610678Z @%p40 bra $L__BB0_7; 2026-02-21T08:24:19.0610827Z bra.uni $L__BB0_12; 2026-02-21T08:24:19.0611023Z $L__BB0_7: // Parent Loop BB0_2 Depth=1 2026-02-21T08:24:19.0611278Z // => This Inner Loop Header: Depth=2 2026-02-21T08:24:19.0611641Z .loc 1 0 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:0:74 2026-02-21T08:24:19.0611983Z mov.pred %p4, %p68; 2026-02-21T08:24:19.0612140Z mov.pred %p68, %p67; 2026-02-21T08:24:19.0612297Z mov.pred %p67, %p66; 2026-02-21T08:24:19.0612442Z mov.pred %p66, %p65; 2026-02-21T08:24:19.0612593Z mov.b32 %r51, %r460; 2026-02-21T08:24:19.0612732Z mov.b32 %r460, %r459; 2026-02-21T08:24:19.0612882Z mov.b32 %r459, %r458; 2026-02-21T08:24:19.0613021Z mov.b32 %r458, %r457; 2026-02-21T08:24:19.0613167Z mov.b32 %r46, %r455; 2026-02-21T08:24:19.0613304Z mov.b32 %r455, %r454; 2026-02-21T08:24:19.0613451Z mov.b32 %r454, %r453; 2026-02-21T08:24:19.0613703Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0613997Z not.pred %p19, %p69; 2026-02-21T08:24:19.0614146Z @%p19 bra $L__BB0_9; 2026-02-21T08:24:19.0614324Z // %bb.8: // in Loop: Header=BB0_7 Depth=2 2026-02-21T08:24:19.0614541Z add.s32 %r477, %r477, 1; 2026-02-21T08:24:19.0614801Z .loc 1 29 27 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:29:27 2026-02-21T08:24:19.0615086Z shl.b32 %r224, %r477, 6; 2026-02-21T08:24:19.0615338Z .loc 1 30 32 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:30:32 2026-02-21T08:24:19.0615625Z or.b32 %r478, %r224, %r11; 2026-02-21T08:24:19.0615785Z or.b32 %r479, %r478, 8; 2026-02-21T08:24:19.0615931Z or.b32 %r480, %r478, 16; 2026-02-21T08:24:19.0616084Z or.b32 %r481, %r478, 24; 2026-02-21T08:24:19.0616225Z or.b32 %r482, %r478, 32; 2026-02-21T08:24:19.0616372Z or.b32 %r483, %r478, 40; 2026-02-21T08:24:19.0616513Z or.b32 %r484, %r478, 48; 2026-02-21T08:24:19.0616664Z or.b32 %r485, %r224, %r43; 2026-02-21T08:24:19.0616814Z bra.uni $L__BB0_9; 2026-02-21T08:24:19.0617004Z $L__BB0_10: // in Loop: Header=BB0_7 Depth=2 2026-02-21T08:24:19.0617321Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0617379Z shl.b32 %r294, %r473, 3; 2026-02-21T08:24:19.0617444Z add.s32 %r296, %r121, %r294; 2026-02-21T08:24:19.0617504Z add.s32 %r297, %r296, 36864; 2026-02-21T08:24:19.0617563Z $L__tmp32: 2026-02-21T08:24:19.0617782Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0617838Z shl.b32 %r298, %r473, 6; 2026-02-21T08:24:19.0617904Z add.s32 %r282, %r298, %r4; 2026-02-21T08:24:19.0617967Z elect.sync %r299|%p25, -1; 2026-02-21T08:24:19.0618023Z mov.b32 %r284, 68159760; 2026-02-21T08:24:19.0618078Z // begin inline asm 2026-02-21T08:24:19.0618244Z @%p25 tcgen05.mma.cta_group::1.kind::tf32 [ %r282 + 0 ], [ %r283 + 0 ], %rd58, %r284, %p70; 2026-02-21T08:24:19.0618300Z // end inline asm 2026-02-21T08:24:19.0618356Z // begin inline asm 2026-02-21T08:24:19.0618512Z @%p25 tcgen05.mma.cta_group::1.kind::tf32 [ %r282 + 0 ], [ %r283 + 8 ], %rd59, %r284, %p26; 2026-02-21T08:24:19.0618567Z // end inline asm 2026-02-21T08:24:19.0618624Z // begin inline asm 2026-02-21T08:24:19.0618828Z @%p25 tcgen05.mma.cta_group::1.kind::tf32 [ %r282 + 0 ], [ %r283 + 16 ], %rd60, %r284, %p26; 2026-02-21T08:24:19.0618883Z // end inline asm 2026-02-21T08:24:19.0618938Z // begin inline asm 2026-02-21T08:24:19.0619090Z @%p25 tcgen05.mma.cta_group::1.kind::tf32 [ %r282 + 0 ], [ %r283 + 24 ], %rd61, %r284, %p26; 2026-02-21T08:24:19.0619144Z // end inline asm 2026-02-21T08:24:19.0619207Z and.pred %p32, %p64, %p25; 2026-02-21T08:24:19.0619269Z cvt.u64.u32 %rd62, %r297; 2026-02-21T08:24:19.0619332Z // begin inline asm 2026-02-21T08:24:19.0619457Z @%p32 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd62]; 2026-02-21T08:24:19.0619511Z // end inline asm 2026-02-21T08:24:19.0619577Z bra.uni $L__BB0_11; 2026-02-21T08:24:19.0619631Z $L__tmp33: 2026-02-21T08:24:19.0619729Z $L__BB0_15: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0619948Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0620010Z barrier.sync 1; 2026-02-21T08:24:19.0620067Z barrier.sync 1; 2026-02-21T08:24:19.0620121Z bra.uni $L__BB0_2; 2026-02-21T08:24:19.0620208Z $L__BB0_12: // %._crit_edge 2026-02-21T08:24:19.0620298Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0620362Z cp.async.wait_group 0; 2026-02-21T08:24:19.0620426Z bar.sync 2, 128; 2026-02-21T08:24:19.0620481Z barrier.sync 1; 2026-02-21T08:24:19.0620536Z bra.uni $L__BB0_2; 2026-02-21T08:24:19.0620629Z $L__BB0_13: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0620800Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0620855Z barrier.sync 1; 2026-02-21T08:24:19.0620910Z barrier.sync 1; 2026-02-21T08:24:19.0620972Z bra.uni $L__BB0_2; 2026-02-21T08:24:19.0621065Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:24:19.0621229Z .loc 1 14 0 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:14 2026-02-21T08:24:19.0621290Z barrier.sync 1; 2026-02-21T08:24:19.0621344Z barrier.sync 1; 2026-02-21T08:24:19.0621398Z bra.uni $L__BB0_2; 2026-02-21T08:24:19.0621481Z $L__BB0_21: // %._crit_edge73 2026-02-21T08:24:19.0621688Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0621743Z barrier.sync 1; 2026-02-21T08:24:19.0621800Z shl.b32 %r451, %r492, 3; 2026-02-21T08:24:19.0621866Z add.s32 %r444, %r341, %r451; 2026-02-21T08:24:19.0621924Z $L__tmp34: 2026-02-21T08:24:19.0622136Z .loc 2 291 36 // standard.py:291:36 @[ cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:81:40 ] 2026-02-21T08:24:19.0622198Z bar.sync 0, 256; 2026-02-21T08:24:19.0622253Z // begin inline asm 2026-02-21T08:24:19.0622304Z 2026-02-21T08:24:19.0622353Z { 2026-02-21T08:24:19.0622426Z .reg .pred complete; 2026-02-21T08:24:19.0622483Z waitLoop: 2026-02-21T08:24:19.0622603Z mbarrier.try_wait.parity.shared.b64 complete, [%r444], %r493; 2026-02-21T08:24:19.0622682Z @!complete bra.uni waitLoop; 2026-02-21T08:24:19.0622732Z } 2026-02-21T08:24:19.0622737Z 2026-02-21T08:24:19.0622792Z // end inline asm 2026-02-21T08:24:19.0622844Z $L__tmp35: 2026-02-21T08:24:19.0623013Z .loc 1 22 74 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:74 2026-02-21T08:24:19.0623069Z bar.sync 0, 256; 2026-02-21T08:24:19.0623124Z // begin inline asm 2026-02-21T08:24:19.0623218Z @%p59 mbarrier.inval.shared::cta.b64 [%r341]; 2026-02-21T08:24:19.0623273Z // end inline asm 2026-02-21T08:24:19.0623327Z bar.sync 0, 256; 2026-02-21T08:24:19.0623382Z // begin inline asm 2026-02-21T08:24:19.0623470Z @%p59 mbarrier.inval.shared::cta.b64 [%r342]; 2026-02-21T08:24:19.0623527Z // end inline asm 2026-02-21T08:24:19.0623583Z // begin inline asm 2026-02-21T08:24:19.0623671Z @%p59 mbarrier.inval.shared::cta.b64 [%r339]; 2026-02-21T08:24:19.0623787Z // end inline asm 2026-02-21T08:24:19.0623843Z bar.sync 0, 256; 2026-02-21T08:24:19.0623904Z // begin inline asm 2026-02-21T08:24:19.0623982Z @%p59 mbarrier.inval.shared::cta.b64 [%r340]; 2026-02-21T08:24:19.0624037Z // end inline asm 2026-02-21T08:24:19.0624199Z .loc 1 22 4 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:22:4 2026-02-21T08:24:19.0624261Z bar.sync 0, 256; 2026-02-21T08:24:19.0624315Z // begin inline asm 2026-02-21T08:24:19.0624430Z @%p41 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r450, 256; 2026-02-21T08:24:19.0624493Z // end inline asm 2026-02-21T08:24:19.0624594Z st.shared.v2.b32 [global_smem+36896], {67372036, 67372036}; 2026-02-21T08:24:19.0624651Z barrier.sync 1; 2026-02-21T08:24:19.0624737Z $L__BB0_22: // %common.ret 2026-02-21T08:24:19.0624954Z .loc 1 0 0 // cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py:0 2026-02-21T08:24:19.0625013Z ret; 2026-02-21T08:24:19.0625067Z $L__tmp36: 2026-02-21T08:24:19.0625130Z $L__func_end0: 2026-02-21T08:24:19.0625210Z // -- End function 2026-02-21T08:24:19.0625259Z } 2026-02-21T08:24:19.0625465Z .file 1 "/tmp/torchinductor_root/bv/cbv6xj47tyn3soopzqdxfkkabth5fz2b6fsijdyt2ufesfwkhecn.py" 2026-02-21T08:24:19.0625638Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:24:19.0625700Z .section .debug_abbrev 2026-02-21T08:24:19.0625750Z { 2026-02-21T08:24:19.0625844Z .b8 1 // Abbreviation Code 2026-02-21T08:24:19.0625928Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:24:19.0626006Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:24:19.0626095Z .b8 37 // DW_AT_producer 2026-02-21T08:24:19.0626174Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.0626253Z .b8 19 // DW_AT_language 2026-02-21T08:24:19.0626337Z .b8 5 // DW_FORM_data2 2026-02-21T08:24:19.0626411Z .b8 3 // DW_AT_name 2026-02-21T08:24:19.0626484Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.0626565Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:24:19.0626639Z .b8 6 // DW_FORM_data4 2026-02-21T08:24:19.0626712Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:24:19.0626784Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.0626861Z .b8 0 // EOM(1) 2026-02-21T08:24:19.0626930Z .b8 0 // EOM(2) 2026-02-21T08:24:19.0627012Z .b8 2 // Abbreviation Code 2026-02-21T08:24:19.0627102Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:24:19.0627176Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:24:19.0627248Z .b8 3 // DW_AT_name 2026-02-21T08:24:19.0627325Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.0627400Z .b8 32 // DW_AT_inline 2026-02-21T08:24:19.0627473Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.0627541Z .b8 0 // EOM(1) 2026-02-21T08:24:19.0627611Z .b8 0 // EOM(2) 2026-02-21T08:24:19.0627690Z .b8 3 // Abbreviation Code 2026-02-21T08:24:19.0627769Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:24:19.0627854Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:24:19.0627929Z .b8 17 // DW_AT_low_pc 2026-02-21T08:24:19.0628002Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.0628141Z .b8 18 // DW_AT_high_pc 2026-02-21T08:24:19.0628213Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.0628298Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:24:19.0628368Z .b8 19 // DW_FORM_ref4 2026-02-21T08:24:19.0628442Z .b8 0 // EOM(1) 2026-02-21T08:24:19.0628511Z .b8 0 // EOM(2) 2026-02-21T08:24:19.0628588Z .b8 4 // Abbreviation Code 2026-02-21T08:24:19.0628687Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:24:19.0628761Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:24:19.0628844Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:24:19.0628920Z .b8 19 // DW_FORM_ref4 2026-02-21T08:24:19.0629031Z .b8 17 // DW_AT_low_pc 2026-02-21T08:24:19.0629106Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.0629190Z .b8 18 // DW_AT_high_pc 2026-02-21T08:24:19.0629261Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.0629338Z .b8 88 // DW_AT_call_file 2026-02-21T08:24:19.0629413Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.0629497Z .b8 89 // DW_AT_call_line 2026-02-21T08:24:19.0629572Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.0629648Z .b8 87 // DW_AT_call_column 2026-02-21T08:24:19.0629729Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.0629795Z .b8 0 // EOM(1) 2026-02-21T08:24:19.0629861Z .b8 0 // EOM(2) 2026-02-21T08:24:19.0629935Z .b8 0 // EOM(3) 2026-02-21T08:24:19.0629988Z } 2026-02-21T08:24:19.0630048Z .section .debug_info 2026-02-21T08:24:19.0630098Z { 2026-02-21T08:24:19.0630188Z .b32 178 // Length of Unit 2026-02-21T08:24:19.0630272Z .b8 2 // DWARF version number 2026-02-21T08:24:19.0630323Z .b8 0 2026-02-21T08:24:19.0630444Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:24:19.0630530Z .b8 8 // Address Size (in bytes) 2026-02-21T08:24:19.0630630Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:24:19.0630716Z .b8 116 // DW_AT_producer 2026-02-21T08:24:19.0630770Z .b8 114 2026-02-21T08:24:19.0630823Z .b8 105 2026-02-21T08:24:19.0630874Z .b8 116 2026-02-21T08:24:19.0630931Z .b8 111 2026-02-21T08:24:19.0630982Z .b8 110 2026-02-21T08:24:19.0631033Z .b8 0 2026-02-21T08:24:19.0631113Z .b8 2 // DW_AT_language 2026-02-21T08:24:19.0631165Z .b8 0 2026-02-21T08:24:19.0631239Z .b8 99 // DW_AT_name 2026-02-21T08:24:19.0631291Z .b8 98 2026-02-21T08:24:19.0631348Z .b8 118 2026-02-21T08:24:19.0631399Z .b8 54 2026-02-21T08:24:19.0631450Z .b8 120 2026-02-21T08:24:19.0631501Z .b8 106 2026-02-21T08:24:19.0631595Z .b8 52 2026-02-21T08:24:19.0631647Z .b8 55 2026-02-21T08:24:19.0631698Z .b8 116 2026-02-21T08:24:19.0631753Z .b8 121 2026-02-21T08:24:19.0631805Z .b8 110 2026-02-21T08:24:19.0631856Z .b8 51 2026-02-21T08:24:19.0631906Z .b8 115 2026-02-21T08:24:19.0631964Z .b8 111 2026-02-21T08:24:19.0632014Z .b8 111 2026-02-21T08:24:19.0632064Z .b8 112 2026-02-21T08:24:19.0632119Z .b8 122 2026-02-21T08:24:19.0632168Z .b8 113 2026-02-21T08:24:19.0632218Z .b8 100 2026-02-21T08:24:19.0632267Z .b8 120 2026-02-21T08:24:19.0632325Z .b8 102 2026-02-21T08:24:19.0632376Z .b8 107 2026-02-21T08:24:19.0632426Z .b8 107 2026-02-21T08:24:19.0632484Z .b8 97 2026-02-21T08:24:19.0632536Z .b8 98 2026-02-21T08:24:19.0632638Z .b8 116 2026-02-21T08:24:19.0632687Z .b8 104 2026-02-21T08:24:19.0632744Z .b8 53 2026-02-21T08:24:19.0632793Z .b8 102 2026-02-21T08:24:19.0632842Z .b8 122 2026-02-21T08:24:19.0632891Z .b8 50 2026-02-21T08:24:19.0632947Z .b8 98 2026-02-21T08:24:19.0632996Z .b8 54 2026-02-21T08:24:19.0633046Z .b8 102 2026-02-21T08:24:19.0633101Z .b8 115 2026-02-21T08:24:19.0633152Z .b8 105 2026-02-21T08:24:19.0633201Z .b8 106 2026-02-21T08:24:19.0633251Z .b8 100 2026-02-21T08:24:19.0633307Z .b8 121 2026-02-21T08:24:19.0633357Z .b8 116 2026-02-21T08:24:19.0633407Z .b8 50 2026-02-21T08:24:19.0633462Z .b8 117 2026-02-21T08:24:19.0633512Z .b8 102 2026-02-21T08:24:19.0633562Z .b8 101 2026-02-21T08:24:19.0633611Z .b8 115 2026-02-21T08:24:19.0633667Z .b8 102 2026-02-21T08:24:19.0633716Z .b8 119 2026-02-21T08:24:19.0633767Z .b8 107 2026-02-21T08:24:19.0633818Z .b8 104 2026-02-21T08:24:19.0633875Z .b8 101 2026-02-21T08:24:19.0633926Z .b8 99 2026-02-21T08:24:19.0633976Z .b8 110 2026-02-21T08:24:19.0634083Z .b8 46 2026-02-21T08:24:19.0634140Z .b8 112 2026-02-21T08:24:19.0634193Z .b8 121 2026-02-21T08:24:19.0634245Z .b8 0 2026-02-21T08:24:19.0634345Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:24:19.0634423Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:24:19.0634477Z .b8 116 2026-02-21T08:24:19.0634536Z .b8 109 2026-02-21T08:24:19.0634588Z .b8 112 2026-02-21T08:24:19.0634639Z .b8 47 2026-02-21T08:24:19.0634691Z .b8 116 2026-02-21T08:24:19.0634750Z .b8 111 2026-02-21T08:24:19.0634802Z .b8 114 2026-02-21T08:24:19.0634854Z .b8 99 2026-02-21T08:24:19.0634911Z .b8 104 2026-02-21T08:24:19.0634963Z .b8 105 2026-02-21T08:24:19.0635014Z .b8 110 2026-02-21T08:24:19.0635065Z .b8 100 2026-02-21T08:24:19.0635124Z .b8 117 2026-02-21T08:24:19.0635177Z .b8 99 2026-02-21T08:24:19.0635229Z .b8 116 2026-02-21T08:24:19.0635290Z .b8 111 2026-02-21T08:24:19.0635343Z .b8 114 2026-02-21T08:24:19.0635394Z .b8 95 2026-02-21T08:24:19.0635446Z .b8 114 2026-02-21T08:24:19.0635507Z .b8 111 2026-02-21T08:24:19.0635560Z .b8 111 2026-02-21T08:24:19.0635612Z .b8 116 2026-02-21T08:24:19.0635666Z .b8 47 2026-02-21T08:24:19.0635725Z .b8 98 2026-02-21T08:24:19.0635777Z .b8 118 2026-02-21T08:24:19.0635828Z .b8 0 2026-02-21T08:24:19.0635935Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:24:19.0636010Z .b8 95 // DW_AT_name 2026-02-21T08:24:19.0636062Z .b8 104 2026-02-21T08:24:19.0636113Z .b8 101 2026-02-21T08:24:19.0636172Z .b8 108 2026-02-21T08:24:19.0636223Z .b8 105 2026-02-21T08:24:19.0636275Z .b8 111 2026-02-21T08:24:19.0636332Z .b8 110 2026-02-21T08:24:19.0636384Z .b8 95 2026-02-21T08:24:19.0636435Z .b8 109 2026-02-21T08:24:19.0636486Z .b8 97 2026-02-21T08:24:19.0636545Z .b8 116 2026-02-21T08:24:19.0636597Z .b8 109 2026-02-21T08:24:19.0636648Z .b8 117 2026-02-21T08:24:19.0636705Z .b8 108 2026-02-21T08:24:19.0636756Z .b8 95 2026-02-21T08:24:19.0636807Z .b8 98 2026-02-21T08:24:19.0636858Z .b8 102 2026-02-21T08:24:19.0636924Z .b8 49 2026-02-21T08:24:19.0636975Z .b8 54 2026-02-21T08:24:19.0637028Z .b8 95 2026-02-21T08:24:19.0637085Z .b8 105 2026-02-21T08:24:19.0637138Z .b8 110 2026-02-21T08:24:19.0637189Z .b8 116 2026-02-21T08:24:19.0637240Z .b8 52 2026-02-21T08:24:19.0637297Z .b8 0 2026-02-21T08:24:19.0637372Z .b8 1 // DW_AT_inline 2026-02-21T08:24:19.0637471Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:24:19.0637566Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:24:19.0637655Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:24:19.0637747Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:24:19.0637863Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:24:19.0637957Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:24:19.0638041Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T08:24:19.0638171Z .b64 $L__tmp35 // DW_AT_high_pc 2026-02-21T08:24:19.0638258Z .b8 1 // DW_AT_call_file 2026-02-21T08:24:19.0638337Z .b8 81 // DW_AT_call_line 2026-02-21T08:24:19.0638419Z .b8 40 // DW_AT_call_column 2026-02-21T08:24:19.0638506Z .b8 0 // End Of Children Mark 2026-02-21T08:24:19.0638587Z .b8 0 // End Of Children Mark 2026-02-21T08:24:19.0638637Z } 2026-02-21T08:24:19.0638704Z .section .debug_macinfo { } 2026-02-21T08:24:19.0638712Z 2026-02-21T08:24:19.0638789Z ================================================================ 2026-02-21T08:24:19.0638894Z please share the reproducer above with Triton project. 2026-02-21T08:24:19.0639928Z [54s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=8, num_stages=6, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T08:24:19.0640002Z Tensor-likes are not close! 2026-02-21T08:24:19.0640007Z 2026-02-21T08:24:19.0640088Z Mismatched elements: 455802 / 458752 (99.4%) 2026-02-21T08:24:19.0640233Z Greatest absolute difference: 1792.0 at index (19, 6877) (up to 0.01 allowed) 2026-02-21T08:24:19.0640375Z Greatest relative difference: 380928.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:24:19.0640483Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:19.0640487Z 2026-02-21T08:24:19.3949620Z [55s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:19.3949884Z 2026-02-21T08:24:19.3954813Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=8, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:24:19.3955798Z 2026-02-21T08:24:19.3955960Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:24:19.3956194Z `ptxas` stderr: 2026-02-21T08:24:19.3956631Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 186 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:24:19.3957116Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:24:19.3957264Z 2026-02-21T08:24:19.3957655Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp6xpvw7du.ptx -o /tmp/tmp6xpvw7du.ptx.o 2026-02-21T08:24:19.3958094Z 2026-02-21T08:24:19.3958220Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:19.3958404Z 2026-02-21T08:24:19.3958483Z ================================================================ 2026-02-21T08:24:19.3958683Z Internal Triton PTX codegen error 2026-02-21T08:24:19.3958848Z `ptxas` stderr: 2026-02-21T08:24:19.3959259Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 186 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:24:19.3959737Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:24:19.3959883Z 2026-02-21T08:24:19.3960249Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp6xpvw7du.ptx -o /tmp/tmp6xpvw7du.ptx.o 2026-02-21T08:24:19.3960687Z 2026-02-21T08:24:19.3960891Z 2026-02-21T08:24:19.3960955Z // 2026-02-21T08:24:19.3961123Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:24:19.3961302Z // 2026-02-21T08:24:19.3961373Z 2026-02-21T08:24:19.3961437Z .version 8.7 2026-02-21T08:24:19.3961646Z .target sm_100a 2026-02-21T08:24:19.3961792Z .address_size 64 2026-02-21T08:24:19.3961875Z 2026-02-21T08:24:19.3962021Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:24:19.3962312Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:24:19.3962538Z // @_helion_matmul_bf16_int4 2026-02-21T08:24:19.3962753Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:24:19.3963004Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:24:19.3963283Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:24:19.3963563Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:24:19.3963904Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:24:19.3964193Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:24:19.3964423Z ) 2026-02-21T08:24:19.3964544Z .reqntid 256 2026-02-21T08:24:19.3964683Z .maxnreg 32 2026-02-21T08:24:19.3964804Z { 2026-02-21T08:24:19.3964936Z .reg .pred %p<38>; 2026-02-21T08:24:19.3965082Z .reg .b16 %rs<97>; 2026-02-21T08:24:19.3965225Z .reg .b32 %r<310>; 2026-02-21T08:24:19.3965359Z .reg .b64 %rd<95>; 2026-02-21T08:24:19.3965501Z $L__func_begin0: 2026-02-21T08:24:19.3965582Z 2026-02-21T08:24:19.3965634Z // %bb.0: 2026-02-21T08:24:19.3965879Z .loc 1 14 0 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:14 2026-02-21T08:24:19.3966168Z mov.u32 %r1, %tid.x; 2026-02-21T08:24:19.3966325Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T08:24:19.3966498Z mov.b32 %r41, global_smem; 2026-02-21T08:24:19.3966655Z // begin inline asm 2026-02-21T08:24:19.3966899Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r41], 128; 2026-02-21T08:24:19.3967148Z // end inline asm 2026-02-21T08:24:19.3967291Z bar.sync 0; 2026-02-21T08:24:19.3967433Z ld.shared.b32 %r304, [global_smem]; 2026-02-21T08:24:19.3967612Z bar.sync 0; 2026-02-21T08:24:19.3967756Z // begin inline asm 2026-02-21T08:24:19.3967972Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:24:19.3968349Z // end inline asm 2026-02-21T08:24:19.3968597Z .loc 1 20 30 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:20:30 2026-02-21T08:24:19.3968890Z mov.u32 %r3, %ctaid.x; 2026-02-21T08:24:19.3969146Z .loc 1 22 52 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:22:52 2026-02-21T08:24:19.3969451Z setp.gt.u32 %p3, %r3, 111; 2026-02-21T08:24:19.3969619Z @%p3 bra $L__BB0_8; 2026-02-21T08:24:19.3969781Z // %bb.1: // %.lr.ph 2026-02-21T08:24:19.3970085Z .loc 1 0 52 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:0:52 2026-02-21T08:24:19.3970407Z ld.param.b64 %rd20, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:24:19.3970666Z ld.param.b64 %rd19, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:24:19.3970909Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:24:19.3971122Z and.b32 %r4, %r1, 64; 2026-02-21T08:24:19.3971375Z .loc 1 68 38 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:68:38 2026-02-21T08:24:19.3971731Z setp.eq.b32 %p8, %r4, 0; 2026-02-21T08:24:19.3971999Z .loc 1 28 45 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:28:45 2026-02-21T08:24:19.3972285Z shr.u32 %r90, %r1, 5; 2026-02-21T08:24:19.3972539Z .loc 1 38 48 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:38:48 2026-02-21T08:24:19.3972818Z shr.u32 %r91, %r1, 4; 2026-02-21T08:24:19.3972976Z bfe.u32 %r92, %r1, 4, 4; 2026-02-21T08:24:19.3973232Z .loc 1 28 45 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:28:45 2026-02-21T08:24:19.3973603Z and.b32 %r93, %r1, 7; 2026-02-21T08:24:19.3973756Z and.b32 %r94, %r1, 15; 2026-02-21T08:24:19.3974016Z .loc 1 44 38 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:44:38 2026-02-21T08:24:19.3974297Z shl.b32 %r95, %r1, 3; 2026-02-21T08:24:19.3974442Z and.b32 %r5, %r95, 24; 2026-02-21T08:24:19.3974698Z .loc 1 28 45 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:28:45 2026-02-21T08:24:19.3974976Z and.b32 %r96, %r1, 248; 2026-02-21T08:24:19.3975136Z bfe.u32 %r97, %r1, 3, 5; 2026-02-21T08:24:19.3975293Z shl.b32 %r98, %r1, 11; 2026-02-21T08:24:19.3975442Z and.b32 %r6, %r98, 516096; 2026-02-21T08:24:19.3975611Z setp.eq.b32 %p35, %r1, 0; 2026-02-21T08:24:19.3975763Z or.b32 %r99, %r6, %r5; 2026-02-21T08:24:19.3975928Z mad.wide.u32 %rd21, %r99, 2, %rd18; 2026-02-21T08:24:19.3976155Z and.b32 %r100, %r1, 255; 2026-02-21T08:24:19.3976316Z shl.b32 %r7, %r100, 4; 2026-02-21T08:24:19.3976463Z add.s32 %r189, %r41, %r7; 2026-02-21T08:24:19.3976623Z add.s32 %r175, %r304, 64; 2026-02-21T08:24:19.3976774Z shl.b32 %r10, %r100, 2; 2026-02-21T08:24:19.3976932Z add.s32 %r102, %r41, 28672; 2026-02-21T08:24:19.3977098Z add.s32 %r191, %r102, %r10; 2026-02-21T08:24:19.3977254Z shl.b32 %r103, %r94, 6; 2026-02-21T08:24:19.3977411Z and.b32 %r104, %r1, 96; 2026-02-21T08:24:19.3977558Z shl.b32 %r105, %r104, 5; 2026-02-21T08:24:19.3977719Z and.b32 %r106, %r1, 16; 2026-02-21T08:24:19.3977864Z and.b32 %r107, %r1, 128; 2026-02-21T08:24:19.3978021Z shr.u32 %r108, %r107, 2; 2026-02-21T08:24:19.3978172Z or.b32 %r109, %r103, %r105; 2026-02-21T08:24:19.3978332Z or.b32 %r110, %r106, %r108; 2026-02-21T08:24:19.3978487Z or.b32 %r12, %r109, %r110; 2026-02-21T08:24:19.3978654Z and.b32 %r111, %r1, 63; 2026-02-21T08:24:19.3978807Z shr.u32 %r112, %r107, 1; 2026-02-21T08:24:19.3978957Z or.b32 %r13, %r112, %r111; 2026-02-21T08:24:19.3979119Z shl.b32 %r113, %r111, 7; 2026-02-21T08:24:19.3979266Z shl.b32 %r114, %r93, 4; 2026-02-21T08:24:19.3979417Z and.b32 %r115, %r91, 12; 2026-02-21T08:24:19.3979565Z or.b32 %r116, %r113, %r115; 2026-02-21T08:24:19.3979727Z or.b32 %r117, %r116, %r114; 2026-02-21T08:24:19.3979883Z add.s32 %r118, %r41, 20480; 2026-02-21T08:24:19.3980048Z add.s32 %r14, %r118, %r117; 2026-02-21T08:24:19.3980207Z xor.b32 %r119, %r117, 16; 2026-02-21T08:24:19.3980370Z add.s32 %r15, %r118, %r119; 2026-02-21T08:24:19.3980538Z xor.b32 %r120, %r117, 32; 2026-02-21T08:24:19.3980693Z add.s32 %r16, %r118, %r120; 2026-02-21T08:24:19.3980860Z xor.b32 %r121, %r117, 48; 2026-02-21T08:24:19.3981014Z add.s32 %r17, %r118, %r121; 2026-02-21T08:24:19.3981180Z xor.b32 %r122, %r117, 64; 2026-02-21T08:24:19.3981333Z add.s32 %r18, %r118, %r122; 2026-02-21T08:24:19.3981497Z xor.b32 %r123, %r117, 80; 2026-02-21T08:24:19.3981684Z add.s32 %r19, %r118, %r123; 2026-02-21T08:24:19.3981852Z xor.b32 %r124, %r117, 96; 2026-02-21T08:24:19.3982012Z add.s32 %r20, %r118, %r124; 2026-02-21T08:24:19.3982178Z xor.b32 %r125, %r117, 112; 2026-02-21T08:24:19.3982340Z add.s32 %r21, %r118, %r125; 2026-02-21T08:24:19.3982502Z bfe.u32 %r126, %r118, 4, 14; 2026-02-21T08:24:19.3982674Z cvt.u64.u32 %rd31, %r126; 2026-02-21T08:24:19.3982842Z or.b64 %rd38, %rd31, 4611686293338849280; 2026-02-21T08:24:19.3983037Z add.s32 %r127, %r41, 20512; 2026-02-21T08:24:19.3983199Z bfe.u32 %r128, %r127, 4, 14; 2026-02-21T08:24:19.3983369Z cvt.u64.u32 %rd32, %r128; 2026-02-21T08:24:19.3983537Z or.b64 %rd39, %rd32, 4611686293338849280; 2026-02-21T08:24:19.3983728Z add.s32 %r129, %r41, 20544; 2026-02-21T08:24:19.3983896Z bfe.u32 %r130, %r129, 4, 14; 2026-02-21T08:24:19.3984057Z cvt.u64.u32 %rd33, %r130; 2026-02-21T08:24:19.3984231Z or.b64 %rd40, %rd33, 4611686293338849280; 2026-02-21T08:24:19.3984411Z add.s32 %r131, %r41, 20576; 2026-02-21T08:24:19.3984580Z bfe.u32 %r132, %r131, 4, 14; 2026-02-21T08:24:19.3984746Z cvt.u64.u32 %rd34, %r132; 2026-02-21T08:24:19.3984981Z or.b64 %rd41, %rd34, 4611686293338849280; 2026-02-21T08:24:19.3985164Z add.s64 %rd43, %rd21, 320; 2026-02-21T08:24:19.3985337Z mul.lo.s32 %r133, %r97, 7168; 2026-02-21T08:24:19.3985501Z shl.b32 %r134, %r93, 10; 2026-02-21T08:24:19.3985665Z shl.b32 %r135, %r94, 4; 2026-02-21T08:24:19.3985831Z shl.b32 %r136, %r104, 3; 2026-02-21T08:24:19.3985990Z shl.b32 %r137, %r106, 1; 2026-02-21T08:24:19.3986159Z or.b32 %r138, %r135, %r136; 2026-02-21T08:24:19.3986326Z or.b32 %r139, %r137, %r112; 2026-02-21T08:24:19.3986500Z xor.b32 %r140, %r138, %r139; 2026-02-21T08:24:19.3986665Z or.b32 %r141, %r140, %r134; 2026-02-21T08:24:19.3986841Z xor.b32 %r142, %r141, 16; 2026-02-21T08:24:19.3986994Z shl.b32 %r143, %r1, 7; 2026-02-21T08:24:19.3987155Z and.b32 %r144, %r143, 7168; 2026-02-21T08:24:19.3987318Z shl.b32 %r145, %r96, 1; 2026-02-21T08:24:19.3987470Z or.b32 %r146, %r144, %r114; 2026-02-21T08:24:19.3987693Z xor.b32 %r147, %r146, %r145; 2026-02-21T08:24:19.3987960Z .loc 1 22 52 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:22:52 2026-02-21T08:24:19.3988246Z cvt.u64.u32 %rd6, %r92; 2026-02-21T08:24:19.3988409Z mad.wide.u32 %rd35, %r92, 7168, %rd19; 2026-02-21T08:24:19.3988593Z add.s32 %r148, %r41, %r12; 2026-02-21T08:24:19.3988747Z or.b32 %r149, %r91, 48; 2026-02-21T08:24:19.3988915Z mad.wide.u32 %rd36, %r149, 7168, %rd19; 2026-02-21T08:24:19.3989207Z .loc 1 28 45 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:28:45 2026-02-21T08:24:19.3989487Z shl.b32 %r150, %r93, 3; 2026-02-21T08:24:19.3989647Z shl.b32 %r25, %r94, 2; 2026-02-21T08:24:19.3989797Z add.s32 %r151, %r102, %r13; 2026-02-21T08:24:19.3989960Z add.s32 %r152, %r41, %r10; 2026-02-21T08:24:19.3990114Z add.s32 %r79, %r152, 32768; 2026-02-21T08:24:19.3990275Z add.s32 %r77, %r189, 16384; 2026-02-21T08:24:19.3990427Z add.s64 %rd29, %rd21, 256; 2026-02-21T08:24:19.3990589Z add.s32 %r75, %r152, 31744; 2026-02-21T08:24:19.3990741Z add.s32 %r73, %r189, 12288; 2026-02-21T08:24:19.3990899Z add.s64 %rd27, %rd21, 192; 2026-02-21T08:24:19.3991054Z add.s32 %r71, %r152, 30720; 2026-02-21T08:24:19.3991202Z add.s32 %r69, %r189, 8192; 2026-02-21T08:24:19.3991360Z add.s64 %rd25, %rd21, 128; 2026-02-21T08:24:19.3991509Z add.s32 %r67, %r152, 29696; 2026-02-21T08:24:19.3991692Z add.s32 %r65, %r189, 4096; 2026-02-21T08:24:19.3991845Z add.s64 %rd23, %rd21, 64; 2026-02-21T08:24:19.3992112Z .loc 1 27 27 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:27:27 2026-02-21T08:24:19.3992390Z shl.b32 %r26, %r3, 6; 2026-02-21T08:24:19.3992650Z .loc 1 28 32 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:28:32 2026-02-21T08:24:19.3992937Z or.b32 %r153, %r26, %r25; 2026-02-21T08:24:19.3993091Z cvt.u64.u32 %rd37, %r153; 2026-02-21T08:24:19.3993250Z or.b32 %r154, %r26, %r150; 2026-02-21T08:24:19.3993401Z $L__tmp0: 2026-02-21T08:24:19.3993703Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.3994052Z shfl.sync.idx.b32 %r27, %r90, 0, 31, -1; 2026-02-21T08:24:19.3994246Z shl.b32 %r155, %r27, 21; 2026-02-21T08:24:19.3994401Z and.b32 %r156, %r155, 6291456; 2026-02-21T08:24:19.3994571Z add.s32 %r157, %r156, %r304; 2026-02-21T08:24:19.3994736Z and.b32 %r158, %r27, 4; 2026-02-21T08:24:19.3994887Z shl.b32 %r159, %r158, 3; 2026-02-21T08:24:19.3995048Z add.s32 %r270, %r157, %r159; 2026-02-21T08:24:19.3995206Z mov.pred %p12, -1; 2026-02-21T08:24:19.3995363Z mov.b32 %r305, 0; 2026-02-21T08:24:19.3995507Z // begin inline asm 2026-02-21T08:24:19.3995905Z @%p12 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r270 + 0], 16, {%r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305, %r305}; 2026-02-21T08:24:19.3996312Z // end inline asm 2026-02-21T08:24:19.3996458Z // begin inline asm 2026-02-21T08:24:19.3996681Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:24:19.3996845Z // end inline asm 2026-02-21T08:24:19.3996988Z bar.sync 0; 2026-02-21T08:24:19.3997115Z $L__tmp1: 2026-02-21T08:24:19.3997359Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.3997639Z add.s32 %r306, %r41, 33792; 2026-02-21T08:24:19.3997802Z // begin inline asm 2026-02-21T08:24:19.3997973Z @%p35 mbarrier.init.shared::cta.b64 [%r306], 1; 2026-02-21T08:24:19.3998170Z // end inline asm 2026-02-21T08:24:19.3998312Z bar.sync 0; 2026-02-21T08:24:19.3998446Z add.s32 %r60, %r41, 33800; 2026-02-21T08:24:19.3998606Z // begin inline asm 2026-02-21T08:24:19.3998772Z @%p35 mbarrier.init.shared::cta.b64 [%r60], 1; 2026-02-21T08:24:19.3998966Z // end inline asm 2026-02-21T08:24:19.3999099Z mov.b32 %r62, 16; 2026-02-21T08:24:19.3999351Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.3999699Z // begin inline asm 2026-02-21T08:24:19.3999912Z cp.async.cg.shared.global [ %r189 + 0 ], [ %rd21 + 0 ], 0x10, %r62; 2026-02-21T08:24:19.4000145Z // end inline asm 2026-02-21T08:24:19.4000286Z cp.async.commit_group; 2026-02-21T08:24:19.4000545Z .loc 1 51 34 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:34 2026-02-21T08:24:19.4000826Z add.s64 %rd22, %rd35, %rd37; 2026-02-21T08:24:19.4000986Z mov.b32 %r64, 4; 2026-02-21T08:24:19.4001219Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4001503Z // begin inline asm 2026-02-21T08:24:19.4001746Z cp.async.ca.shared.global [ %r191 + 0 ], [ %rd22 + 0 ], 0x4, %r64; 2026-02-21T08:24:19.4001968Z // end inline asm 2026-02-21T08:24:19.4002117Z cp.async.commit_group; 2026-02-21T08:24:19.4002373Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4002661Z bar.sync 0; 2026-02-21T08:24:19.4002795Z // begin inline asm 2026-02-21T08:24:19.4002998Z cp.async.cg.shared.global [ %r65 + 0 ], [ %rd23 + 0 ], 0x10, %r62; 2026-02-21T08:24:19.4003221Z // end inline asm 2026-02-21T08:24:19.4003369Z cp.async.commit_group; 2026-02-21T08:24:19.4003625Z .loc 1 51 34 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:34 2026-02-21T08:24:19.4003911Z add.s64 %rd24, %rd22, 114688; 2026-02-21T08:24:19.4004178Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4004454Z // begin inline asm 2026-02-21T08:24:19.4004658Z cp.async.ca.shared.global [ %r67 + 0 ], [ %rd24 + 0 ], 0x4, %r64; 2026-02-21T08:24:19.4004876Z // end inline asm 2026-02-21T08:24:19.4005022Z cp.async.commit_group; 2026-02-21T08:24:19.4005274Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4005555Z bar.sync 0; 2026-02-21T08:24:19.4005695Z // begin inline asm 2026-02-21T08:24:19.4005893Z cp.async.cg.shared.global [ %r69 + 0 ], [ %rd25 + 0 ], 0x10, %r62; 2026-02-21T08:24:19.4006125Z // end inline asm 2026-02-21T08:24:19.4006269Z cp.async.commit_group; 2026-02-21T08:24:19.4006539Z .loc 1 51 34 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:34 2026-02-21T08:24:19.4006823Z add.s64 %rd26, %rd22, 229376; 2026-02-21T08:24:19.4007097Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4007380Z // begin inline asm 2026-02-21T08:24:19.4007575Z cp.async.ca.shared.global [ %r71 + 0 ], [ %rd26 + 0 ], 0x4, %r64; 2026-02-21T08:24:19.4007801Z // end inline asm 2026-02-21T08:24:19.4007941Z cp.async.commit_group; 2026-02-21T08:24:19.4008201Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4008478Z bar.sync 0; 2026-02-21T08:24:19.4008615Z // begin inline asm 2026-02-21T08:24:19.4008862Z cp.async.cg.shared.global [ %r73 + 0 ], [ %rd27 + 0 ], 0x10, %r62; 2026-02-21T08:24:19.4009089Z // end inline asm 2026-02-21T08:24:19.4009237Z cp.async.commit_group; 2026-02-21T08:24:19.4009487Z .loc 1 51 34 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:34 2026-02-21T08:24:19.4009779Z add.s64 %rd28, %rd36, %rd37; 2026-02-21T08:24:19.4010037Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4010315Z // begin inline asm 2026-02-21T08:24:19.4010507Z cp.async.ca.shared.global [ %r75 + 0 ], [ %rd28 + 0 ], 0x4, %r64; 2026-02-21T08:24:19.4010730Z // end inline asm 2026-02-21T08:24:19.4010869Z cp.async.commit_group; 2026-02-21T08:24:19.4011125Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4011406Z bar.sync 0; 2026-02-21T08:24:19.4011576Z // begin inline asm 2026-02-21T08:24:19.4011842Z cp.async.cg.shared.global [ %r77 + 0 ], [ %rd29 + 0 ], 0x10, %r62; 2026-02-21T08:24:19.4012063Z // end inline asm 2026-02-21T08:24:19.4012209Z cp.async.commit_group; 2026-02-21T08:24:19.4012453Z .loc 1 51 34 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:34 2026-02-21T08:24:19.4012739Z add.s64 %rd30, %rd22, 458752; 2026-02-21T08:24:19.4013003Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4013277Z // begin inline asm 2026-02-21T08:24:19.4013475Z cp.async.ca.shared.global [ %r79 + 0 ], [ %rd30 + 0 ], 0x4, %r64; 2026-02-21T08:24:19.4013690Z // end inline asm 2026-02-21T08:24:19.4013835Z cp.async.commit_group; 2026-02-21T08:24:19.4014081Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4014371Z cp.async.wait_group 8; 2026-02-21T08:24:19.4014519Z bar.sync 0; 2026-02-21T08:24:19.4014700Z ld.shared.v4.b32 {%r160, %r161, %r162, %r163}, [%r148]; 2026-02-21T08:24:19.4014915Z mov.b32 {%rs1, %rs2}, %r163; 2026-02-21T08:24:19.4015075Z mov.b32 {%rs3, %rs4}, %r162; 2026-02-21T08:24:19.4015239Z mov.b32 {%rs5, %rs6}, %r161; 2026-02-21T08:24:19.4015392Z mov.b32 {%rs7, %rs8}, %r160; 2026-02-21T08:24:19.4015655Z .loc 1 49 32 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:49:32 2026-02-21T08:24:19.4015933Z cvt.f32.bf16 %r82, %rs7; 2026-02-21T08:24:19.4016097Z cvt.f32.bf16 %r83, %rs8; 2026-02-21T08:24:19.4016250Z cvt.f32.bf16 %r84, %rs5; 2026-02-21T08:24:19.4016407Z cvt.f32.bf16 %r85, %rs6; 2026-02-21T08:24:19.4016562Z cvt.f32.bf16 %r86, %rs3; 2026-02-21T08:24:19.4016711Z cvt.f32.bf16 %r87, %rs4; 2026-02-21T08:24:19.4016870Z cvt.f32.bf16 %r88, %rs1; 2026-02-21T08:24:19.4017016Z cvt.f32.bf16 %r89, %rs2; 2026-02-21T08:24:19.4017168Z $L__tmp2: 2026-02-21T08:24:19.4017454Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4017792Z add.s32 %r164, %r156, %r175; 2026-02-21T08:24:19.4017947Z shl.b32 %r165, %r158, 2; 2026-02-21T08:24:19.4018100Z add.s32 %r201, %r164, %r165; 2026-02-21T08:24:19.4018258Z // begin inline asm 2026-02-21T08:24:19.4018535Z @%p12 tcgen05.st.sync.aligned.16x32bx2.x8.b32 [%r201 + 0], 8, {%r82, %r83, %r84, %r85, %r86, %r87, %r88, %r89}; 2026-02-21T08:24:19.4018847Z // end inline asm 2026-02-21T08:24:19.4018986Z // begin inline asm 2026-02-21T08:24:19.4019147Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:24:19.4019309Z // end inline asm 2026-02-21T08:24:19.4019448Z bar.sync 0; 2026-02-21T08:24:19.4019575Z $L__tmp3: 2026-02-21T08:24:19.4019811Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4020103Z ld.shared.b8 %rs9, [%r151]; 2026-02-21T08:24:19.4020271Z ld.shared.b8 %rs10, [%r151+128]; 2026-02-21T08:24:19.4020452Z ld.shared.b8 %rs11, [%r151+256]; 2026-02-21T08:24:19.4020623Z ld.shared.b8 %rs12, [%r151+384]; 2026-02-21T08:24:19.4020854Z ld.shared.b8 %rs13, [%r151+512]; 2026-02-21T08:24:19.4021020Z ld.shared.b8 %rs14, [%r151+640]; 2026-02-21T08:24:19.4021190Z ld.shared.b8 %rs15, [%r151+768]; 2026-02-21T08:24:19.4021354Z ld.shared.b8 %rs16, [%r151+896]; 2026-02-21T08:24:19.4021657Z .loc 1 54 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:54:28 2026-02-21T08:24:19.4021945Z shl.b16 %rs17, %rs9, 4; 2026-02-21T08:24:19.4022100Z shl.b16 %rs18, %rs10, 4; 2026-02-21T08:24:19.4022263Z shl.b16 %rs19, %rs11, 4; 2026-02-21T08:24:19.4022412Z shl.b16 %rs20, %rs12, 4; 2026-02-21T08:24:19.4022570Z shl.b16 %rs21, %rs13, 4; 2026-02-21T08:24:19.4022716Z shl.b16 %rs22, %rs14, 4; 2026-02-21T08:24:19.4022867Z shl.b16 %rs23, %rs15, 4; 2026-02-21T08:24:19.4023012Z shl.b16 %rs24, %rs16, 4; 2026-02-21T08:24:19.4023267Z .loc 1 69 58 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:69:58 2026-02-21T08:24:19.4023609Z selp.b16 %rs25, %rs17, %rs9, %p8; 2026-02-21T08:24:19.4023784Z cvt.s16.s8 %rs26, %rs25; 2026-02-21T08:24:19.4023942Z shr.s16 %rs27, %rs26, 4; 2026-02-21T08:24:19.4024097Z selp.b16 %rs28, %rs18, %rs10, %p8; 2026-02-21T08:24:19.4024280Z cvt.s16.s8 %rs29, %rs28; 2026-02-21T08:24:19.4024433Z shr.s16 %rs30, %rs29, 4; 2026-02-21T08:24:19.4024601Z selp.b16 %rs31, %rs19, %rs11, %p8; 2026-02-21T08:24:19.4024774Z cvt.s16.s8 %rs32, %rs31; 2026-02-21T08:24:19.4024935Z shr.s16 %rs33, %rs32, 4; 2026-02-21T08:24:19.4025094Z selp.b16 %rs34, %rs20, %rs12, %p8; 2026-02-21T08:24:19.4025274Z cvt.s16.s8 %rs35, %rs34; 2026-02-21T08:24:19.4025434Z shr.s16 %rs36, %rs35, 4; 2026-02-21T08:24:19.4025591Z selp.b16 %rs37, %rs21, %rs13, %p8; 2026-02-21T08:24:19.4025773Z cvt.s16.s8 %rs38, %rs37; 2026-02-21T08:24:19.4025926Z shr.s16 %rs39, %rs38, 4; 2026-02-21T08:24:19.4026095Z selp.b16 %rs40, %rs22, %rs14, %p8; 2026-02-21T08:24:19.4026267Z cvt.s16.s8 %rs41, %rs40; 2026-02-21T08:24:19.4026435Z shr.s16 %rs42, %rs41, 4; 2026-02-21T08:24:19.4026599Z selp.b16 %rs43, %rs23, %rs15, %p8; 2026-02-21T08:24:19.4026777Z cvt.s16.s8 %rs44, %rs43; 2026-02-21T08:24:19.4026931Z shr.s16 %rs45, %rs44, 4; 2026-02-21T08:24:19.4027094Z selp.b16 %rs46, %rs24, %rs16, %p8; 2026-02-21T08:24:19.4027269Z cvt.s16.s8 %rs47, %rs46; 2026-02-21T08:24:19.4027419Z shr.s16 %rs48, %rs47, 4; 2026-02-21T08:24:19.4027686Z .loc 1 74 32 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:74:32 2026-02-21T08:24:19.4027976Z cvt.rn.f32.s16 %r166, %rs27; 2026-02-21T08:24:19.4028155Z cvt.rn.f32.s16 %r167, %rs30; 2026-02-21T08:24:19.4028316Z cvt.rn.f32.s16 %r168, %rs33; 2026-02-21T08:24:19.4028484Z cvt.rn.f32.s16 %r169, %rs36; 2026-02-21T08:24:19.4028643Z cvt.rn.f32.s16 %r170, %rs39; 2026-02-21T08:24:19.4028809Z cvt.rn.f32.s16 %r171, %rs42; 2026-02-21T08:24:19.4028977Z cvt.rn.f32.s16 %r172, %rs45; 2026-02-21T08:24:19.4029135Z cvt.rn.f32.s16 %r173, %rs48; 2026-02-21T08:24:19.4029307Z st.shared.b32 [%r14], %r166; 2026-02-21T08:24:19.4029471Z st.shared.b32 [%r15], %r167; 2026-02-21T08:24:19.4029639Z st.shared.b32 [%r16], %r168; 2026-02-21T08:24:19.4029801Z st.shared.b32 [%r17], %r169; 2026-02-21T08:24:19.4029969Z st.shared.b32 [%r18], %r170; 2026-02-21T08:24:19.4030129Z st.shared.b32 [%r19], %r171; 2026-02-21T08:24:19.4030298Z st.shared.b32 [%r20], %r172; 2026-02-21T08:24:19.4030467Z st.shared.b32 [%r21], %r173; 2026-02-21T08:24:19.4030621Z $L__tmp4: 2026-02-21T08:24:19.4030922Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4031263Z // begin inline asm 2026-02-21T08:24:19.4031436Z fence.proxy.async.shared::cta; 2026-02-21T08:24:19.4031645Z // end inline asm 2026-02-21T08:24:19.4031794Z bar.sync 0; 2026-02-21T08:24:19.4031937Z setp.ne.b32 %p9, %r27, 0; 2026-02-21T08:24:19.4032110Z @%p9 bra $L__BB0_3; 2026-02-21T08:24:19.4032266Z // %bb.2: 2026-02-21T08:24:19.4032411Z elect.sync %r186|%p11, -1; 2026-02-21T08:24:19.4032648Z mov.b32 %r176, 68159760; 2026-02-21T08:24:19.4032799Z mov.pred %p10, 0; 2026-02-21T08:24:19.4032950Z // begin inline asm 2026-02-21T08:24:19.4033199Z @%p11 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 0 ], %rd38, %r176, %p10; 2026-02-21T08:24:19.4033478Z // end inline asm 2026-02-21T08:24:19.4033616Z // begin inline asm 2026-02-21T08:24:19.4033857Z @%p11 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 8 ], %rd39, %r176, %p12; 2026-02-21T08:24:19.4034129Z // end inline asm 2026-02-21T08:24:19.4034268Z // begin inline asm 2026-02-21T08:24:19.4034505Z @%p11 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 16 ], %rd40, %r176, %p12; 2026-02-21T08:24:19.4034766Z // end inline asm 2026-02-21T08:24:19.4034912Z // begin inline asm 2026-02-21T08:24:19.4035175Z @%p11 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 24 ], %rd41, %r176, %p12; 2026-02-21T08:24:19.4035445Z // end inline asm 2026-02-21T08:24:19.4035650Z add.s32 %r188, %r41, 33792; 2026-02-21T08:24:19.4035827Z cvt.u64.u32 %rd42, %r188; 2026-02-21T08:24:19.4035994Z // begin inline asm 2026-02-21T08:24:19.4036209Z @%p11 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd42]; 2026-02-21T08:24:19.4036452Z // end inline asm 2026-02-21T08:24:19.4036583Z $L__tmp5: 2026-02-21T08:24:19.4036714Z $L__BB0_3: 2026-02-21T08:24:19.4036875Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:24:19.4037083Z add.s32 %r22, %r41, %r141; 2026-02-21T08:24:19.4037241Z add.s32 %r23, %r41, %r142; 2026-02-21T08:24:19.4037403Z add.s32 %r24, %r41, %r147; 2026-02-21T08:24:19.4037575Z mad.wide.u32 %rd7, %r133, 2, %rd20; 2026-02-21T08:24:19.4037749Z cvt.u64.u32 %rd8, %r154; 2026-02-21T08:24:19.4038018Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4038297Z // begin inline asm 2026-02-21T08:24:19.4038509Z cp.async.cg.shared.global [ %r189 + 0 ], [ %rd43 + 0 ], 0x10, %r62; 2026-02-21T08:24:19.4038739Z // end inline asm 2026-02-21T08:24:19.4038891Z cp.async.commit_group; 2026-02-21T08:24:19.4039149Z .loc 1 51 34 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:34 2026-02-21T08:24:19.4039442Z add.s64 %rd44, %rd22, 573440; 2026-02-21T08:24:19.4039712Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4039989Z // begin inline asm 2026-02-21T08:24:19.4040195Z cp.async.ca.shared.global [ %r191 + 0 ], [ %rd44 + 0 ], 0x4, %r64; 2026-02-21T08:24:19.4040415Z // end inline asm 2026-02-21T08:24:19.4040562Z cp.async.commit_group; 2026-02-21T08:24:19.4040820Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.4041109Z add.s32 %r196, %r6, %r5; 2026-02-21T08:24:19.4041272Z mad.wide.u32 %rd46, %r196, 2, %rd18; 2026-02-21T08:24:19.4041445Z add.s64 %rd93, %rd46, 384; 2026-02-21T08:24:19.4041629Z add.s32 %r197, %r26, %r25; 2026-02-21T08:24:19.4041787Z cvt.u64.u32 %rd47, %r197; 2026-02-21T08:24:19.4041954Z mad.lo.s64 %rd48, %rd6, 7168, %rd47; 2026-02-21T08:24:19.4042126Z add.s64 %rd49, %rd48, %rd19; 2026-02-21T08:24:19.4042293Z add.s64 %rd92, %rd49, 688128; 2026-02-21T08:24:19.4042449Z mov.b32 %r308, 1; 2026-02-21T08:24:19.4042594Z mov.b64 %rd94, -16; 2026-02-21T08:24:19.4042735Z mov.b32 %r307, %r305; 2026-02-21T08:24:19.4042887Z mov.b32 %r309, %r305; 2026-02-21T08:24:19.4043035Z bra.uni $L__BB0_4; 2026-02-21T08:24:19.4043224Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T08:24:19.4043550Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.4043828Z add.s64 %rd94, %rd94, 16; 2026-02-21T08:24:19.4043998Z setp.lt.u64 %p32, %rd94, 4000; 2026-02-21T08:24:19.4044159Z $L__tmp6: 2026-02-21T08:24:19.4044449Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4044837Z add.s32 %r248, %r308, 1; 2026-02-21T08:24:19.4044995Z setp.gt.s32 %p33, %r248, 1; 2026-02-21T08:24:19.4045170Z selp.b32 %r308, 0, %r248, %p33; 2026-02-21T08:24:19.4045340Z selp.b32 %r249, 1, 0, %p33; 2026-02-21T08:24:19.4045506Z xor.b32 %r309, %r251, %r249; 2026-02-21T08:24:19.4045656Z $L__tmp7: 2026-02-21T08:24:19.4045903Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4046188Z add.s32 %r244, %r36, %r7; 2026-02-21T08:24:19.4046357Z selp.b32 %r245, 16, 0, %p32; 2026-02-21T08:24:19.4046513Z // begin inline asm 2026-02-21T08:24:19.4046723Z cp.async.cg.shared.global [ %r244 + 0 ], [ %rd93 + 0 ], 0x10, %r245; 2026-02-21T08:24:19.4046954Z // end inline asm 2026-02-21T08:24:19.4047095Z cp.async.commit_group; 2026-02-21T08:24:19.4047406Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4047692Z add.s32 %r246, %r37, %r10; 2026-02-21T08:24:19.4047854Z selp.b32 %r247, 4, 0, %p32; 2026-02-21T08:24:19.4048007Z // begin inline asm 2026-02-21T08:24:19.4048228Z cp.async.ca.shared.global [ %r246 + 0 ], [ %rd92 + 0 ], 0x4, %r247; 2026-02-21T08:24:19.4048459Z // end inline asm 2026-02-21T08:24:19.4048599Z cp.async.commit_group; 2026-02-21T08:24:19.4048861Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.4049146Z add.s64 %rd93, %rd93, 64; 2026-02-21T08:24:19.4049311Z add.s64 %rd92, %rd92, 114688; 2026-02-21T08:24:19.4049475Z setp.lt.u64 %p34, %rd94, 4064; 2026-02-21T08:24:19.4049646Z mov.b32 %r305, %r251; 2026-02-21T08:24:19.4049790Z @%p34 bra $L__BB0_4; 2026-02-21T08:24:19.4049944Z bra.uni $L__BB0_7; 2026-02-21T08:24:19.4050130Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T08:24:19.4050464Z .loc 1 0 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:0:79 2026-02-21T08:24:19.4050754Z mov.b32 %r251, %r309; 2026-02-21T08:24:19.4051004Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.4051294Z add.s32 %r210, %r307, 1; 2026-02-21T08:24:19.4051448Z setp.gt.s32 %p22, %r210, 4; 2026-02-21T08:24:19.4051668Z selp.b32 %r307, 0, %r210, %p22; 2026-02-21T08:24:19.4051942Z .loc 1 45 80 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:45:80 2026-02-21T08:24:19.4052233Z cp.async.wait_group 8; 2026-02-21T08:24:19.4052392Z bar.sync 0; 2026-02-21T08:24:19.4052526Z shl.b32 %r211, %r307, 12; 2026-02-21T08:24:19.4052686Z add.s32 %r36, %r41, %r211; 2026-02-21T08:24:19.4052841Z add.s32 %r213, %r36, %r12; 2026-02-21T08:24:19.4053035Z ld.shared.v4.b32 {%r214, %r215, %r216, %r217}, [%r213]; 2026-02-21T08:24:19.4053240Z mov.b32 {%rs49, %rs50}, %r217; 2026-02-21T08:24:19.4053412Z mov.b32 {%rs51, %rs52}, %r216; 2026-02-21T08:24:19.4053574Z mov.b32 {%rs53, %rs54}, %r215; 2026-02-21T08:24:19.4053739Z mov.b32 {%rs55, %rs56}, %r214; 2026-02-21T08:24:19.4054005Z .loc 1 49 32 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:49:32 2026-02-21T08:24:19.4054294Z cvt.f32.bf16 %r202, %rs55; 2026-02-21T08:24:19.4054460Z cvt.f32.bf16 %r203, %rs56; 2026-02-21T08:24:19.4054612Z cvt.f32.bf16 %r204, %rs53; 2026-02-21T08:24:19.4054770Z cvt.f32.bf16 %r205, %rs54; 2026-02-21T08:24:19.4054921Z cvt.f32.bf16 %r206, %rs51; 2026-02-21T08:24:19.4055078Z cvt.f32.bf16 %r207, %rs52; 2026-02-21T08:24:19.4055227Z cvt.f32.bf16 %r208, %rs49; 2026-02-21T08:24:19.4055384Z cvt.f32.bf16 %r209, %rs50; 2026-02-21T08:24:19.4055528Z $L__tmp8: 2026-02-21T08:24:19.4055821Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4056156Z // begin inline asm 2026-02-21T08:24:19.4056297Z 2026-02-21T08:24:19.4056422Z { 2026-02-21T08:24:19.4056597Z .reg .pred complete; 2026-02-21T08:24:19.4056750Z waitLoop: 2026-02-21T08:24:19.4056937Z mbarrier.try_wait.parity.shared.b64 complete, [%r306], %r305; 2026-02-21T08:24:19.4057179Z @!complete bra.uni waitLoop; 2026-02-21T08:24:19.4057332Z } 2026-02-21T08:24:19.4057336Z 2026-02-21T08:24:19.4057398Z // end inline asm 2026-02-21T08:24:19.4057459Z mov.pred %p23, -1; 2026-02-21T08:24:19.4057515Z // begin inline asm 2026-02-21T08:24:19.4057735Z @%p23 tcgen05.st.sync.aligned.16x32bx2.x8.b32 [%r201 + 0], 8, {%r202, %r203, %r204, %r205, %r206, %r207, %r208, %r209}; 2026-02-21T08:24:19.4057790Z // end inline asm 2026-02-21T08:24:19.4057847Z // begin inline asm 2026-02-21T08:24:19.4057926Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:24:19.4057980Z // end inline asm 2026-02-21T08:24:19.4058035Z bar.sync 0; 2026-02-21T08:24:19.4058088Z $L__tmp9: 2026-02-21T08:24:19.4058328Z .loc 1 51 87 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:51:87 2026-02-21T08:24:19.4058394Z shl.b32 %r218, %r307, 10; 2026-02-21T08:24:19.4058455Z add.s32 %r219, %r41, %r218; 2026-02-21T08:24:19.4058523Z add.s32 %r37, %r219, 28672; 2026-02-21T08:24:19.4058582Z add.s32 %r220, %r37, %r13; 2026-02-21T08:24:19.4058645Z ld.shared.b8 %rs57, [%r220]; 2026-02-21T08:24:19.4058710Z ld.shared.b8 %rs58, [%r220+128]; 2026-02-21T08:24:19.4058782Z ld.shared.b8 %rs59, [%r220+256]; 2026-02-21T08:24:19.4058842Z ld.shared.b8 %rs60, [%r220+384]; 2026-02-21T08:24:19.4058903Z ld.shared.b8 %rs61, [%r220+512]; 2026-02-21T08:24:19.4058972Z ld.shared.b8 %rs62, [%r220+640]; 2026-02-21T08:24:19.4059033Z ld.shared.b8 %rs63, [%r220+768]; 2026-02-21T08:24:19.4059093Z ld.shared.b8 %rs64, [%r220+896]; 2026-02-21T08:24:19.4059266Z .loc 1 54 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:54:28 2026-02-21T08:24:19.4059328Z shl.b16 %rs65, %rs57, 4; 2026-02-21T08:24:19.4059387Z shl.b16 %rs66, %rs58, 4; 2026-02-21T08:24:19.4059446Z shl.b16 %rs67, %rs59, 4; 2026-02-21T08:24:19.4059515Z shl.b16 %rs68, %rs60, 4; 2026-02-21T08:24:19.4059573Z shl.b16 %rs69, %rs61, 4; 2026-02-21T08:24:19.4059630Z shl.b16 %rs70, %rs62, 4; 2026-02-21T08:24:19.4059694Z shl.b16 %rs71, %rs63, 4; 2026-02-21T08:24:19.4059751Z shl.b16 %rs72, %rs64, 4; 2026-02-21T08:24:19.4059918Z .loc 1 69 58 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:69:58 2026-02-21T08:24:19.4059985Z selp.b16 %rs73, %rs65, %rs57, %p8; 2026-02-21T08:24:19.4060052Z cvt.s16.s8 %rs74, %rs73; 2026-02-21T08:24:19.4060110Z shr.s16 %rs75, %rs74, 4; 2026-02-21T08:24:19.4060174Z selp.b16 %rs76, %rs66, %rs58, %p8; 2026-02-21T08:24:19.4060238Z cvt.s16.s8 %rs77, %rs76; 2026-02-21T08:24:19.4060294Z shr.s16 %rs78, %rs77, 4; 2026-02-21T08:24:19.4060356Z selp.b16 %rs79, %rs67, %rs59, %p8; 2026-02-21T08:24:19.4060413Z cvt.s16.s8 %rs80, %rs79; 2026-02-21T08:24:19.4060475Z shr.s16 %rs81, %rs80, 4; 2026-02-21T08:24:19.4060539Z selp.b16 %rs82, %rs68, %rs60, %p8; 2026-02-21T08:24:19.4060599Z cvt.s16.s8 %rs83, %rs82; 2026-02-21T08:24:19.4060663Z shr.s16 %rs84, %rs83, 4; 2026-02-21T08:24:19.4060723Z selp.b16 %rs85, %rs69, %rs61, %p8; 2026-02-21T08:24:19.4060780Z cvt.s16.s8 %rs86, %rs85; 2026-02-21T08:24:19.4060842Z shr.s16 %rs87, %rs86, 4; 2026-02-21T08:24:19.4060903Z selp.b16 %rs88, %rs70, %rs62, %p8; 2026-02-21T08:24:19.4060960Z cvt.s16.s8 %rs89, %rs88; 2026-02-21T08:24:19.4061016Z shr.s16 %rs90, %rs89, 4; 2026-02-21T08:24:19.4061085Z selp.b16 %rs91, %rs71, %rs63, %p8; 2026-02-21T08:24:19.4061143Z cvt.s16.s8 %rs92, %rs91; 2026-02-21T08:24:19.4061198Z shr.s16 %rs93, %rs92, 4; 2026-02-21T08:24:19.4061267Z selp.b16 %rs94, %rs72, %rs64, %p8; 2026-02-21T08:24:19.4061323Z cvt.s16.s8 %rs95, %rs94; 2026-02-21T08:24:19.4061379Z shr.s16 %rs96, %rs95, 4; 2026-02-21T08:24:19.4061581Z .loc 1 74 32 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:74:32 2026-02-21T08:24:19.4061656Z cvt.rn.f32.s16 %r221, %rs75; 2026-02-21T08:24:19.4061767Z cvt.rn.f32.s16 %r222, %rs78; 2026-02-21T08:24:19.4061827Z cvt.rn.f32.s16 %r223, %rs81; 2026-02-21T08:24:19.4061895Z cvt.rn.f32.s16 %r224, %rs84; 2026-02-21T08:24:19.4061954Z cvt.rn.f32.s16 %r225, %rs87; 2026-02-21T08:24:19.4062013Z cvt.rn.f32.s16 %r226, %rs90; 2026-02-21T08:24:19.4062072Z cvt.rn.f32.s16 %r227, %rs93; 2026-02-21T08:24:19.4062139Z cvt.rn.f32.s16 %r228, %rs96; 2026-02-21T08:24:19.4062199Z st.shared.b32 [%r14], %r221; 2026-02-21T08:24:19.4062258Z st.shared.b32 [%r15], %r222; 2026-02-21T08:24:19.4062325Z st.shared.b32 [%r16], %r223; 2026-02-21T08:24:19.4062383Z st.shared.b32 [%r17], %r224; 2026-02-21T08:24:19.4062444Z st.shared.b32 [%r18], %r225; 2026-02-21T08:24:19.4062511Z st.shared.b32 [%r19], %r226; 2026-02-21T08:24:19.4062571Z st.shared.b32 [%r20], %r227; 2026-02-21T08:24:19.4062630Z st.shared.b32 [%r21], %r228; 2026-02-21T08:24:19.4062841Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.4062912Z shl.b32 %r229, %r308, 3; 2026-02-21T08:24:19.4062970Z add.s32 %r230, %r41, %r229; 2026-02-21T08:24:19.4063029Z add.s32 %r306, %r230, 33792; 2026-02-21T08:24:19.4063089Z $L__tmp10: 2026-02-21T08:24:19.4063303Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4063361Z // begin inline asm 2026-02-21T08:24:19.4063445Z fence.proxy.async.shared::cta; 2026-02-21T08:24:19.4063501Z // end inline asm 2026-02-21T08:24:19.4063557Z bar.sync 0; 2026-02-21T08:24:19.4063617Z @%p9 bra $L__BB0_6; 2026-02-21T08:24:19.4063727Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T08:24:19.4063794Z elect.sync %r243|%p24, -1; 2026-02-21T08:24:19.4063852Z mov.b32 %r233, 68159760; 2026-02-21T08:24:19.4063916Z // begin inline asm 2026-02-21T08:24:19.4064076Z @%p24 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 0 ], %rd38, %r233, %p23; 2026-02-21T08:24:19.4064134Z // end inline asm 2026-02-21T08:24:19.4064191Z // begin inline asm 2026-02-21T08:24:19.4064350Z @%p24 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 8 ], %rd39, %r233, %p23; 2026-02-21T08:24:19.4064405Z // end inline asm 2026-02-21T08:24:19.4064460Z // begin inline asm 2026-02-21T08:24:19.4064616Z @%p24 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 16 ], %rd40, %r233, %p23; 2026-02-21T08:24:19.4064670Z // end inline asm 2026-02-21T08:24:19.4064726Z // begin inline asm 2026-02-21T08:24:19.4064876Z @%p24 tcgen05.mma.cta_group::1.kind::tf32 [ %r304 + 0 ], [ %r175 + 24 ], %rd41, %r233, %p23; 2026-02-21T08:24:19.4064932Z // end inline asm 2026-02-21T08:24:19.4064992Z cvt.u64.u32 %rd54, %r306; 2026-02-21T08:24:19.4065048Z // begin inline asm 2026-02-21T08:24:19.4065183Z @%p24 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd54]; 2026-02-21T08:24:19.4065237Z // end inline asm 2026-02-21T08:24:19.4065294Z bra.uni $L__BB0_6; 2026-02-21T08:24:19.4065399Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T08:24:19.4065612Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4065667Z // begin inline asm 2026-02-21T08:24:19.4065724Z 2026-02-21T08:24:19.4065776Z { 2026-02-21T08:24:19.4065837Z .reg .pred complete; 2026-02-21T08:24:19.4065892Z waitLoop: 2026-02-21T08:24:19.4066019Z mbarrier.try_wait.parity.shared.b64 complete, [%r306], %r251; 2026-02-21T08:24:19.4066084Z @!complete bra.uni waitLoop; 2026-02-21T08:24:19.4066134Z } 2026-02-21T08:24:19.4066139Z 2026-02-21T08:24:19.4066201Z // end inline asm 2026-02-21T08:24:19.4066254Z $L__tmp11: 2026-02-21T08:24:19.4066420Z .loc 1 37 79 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:37:79 2026-02-21T08:24:19.4066490Z cp.async.wait_group 0; 2026-02-21T08:24:19.4066545Z bar.sync 0; 2026-02-21T08:24:19.4066606Z add.s32 %r252, %r41, 33792; 2026-02-21T08:24:19.4066704Z // begin inline asm 2026-02-21T08:24:19.4066797Z @%p35 mbarrier.inval.shared::cta.b64 [%r252]; 2026-02-21T08:24:19.4066852Z // end inline asm 2026-02-21T08:24:19.4066906Z bar.sync 0; 2026-02-21T08:24:19.4066970Z // begin inline asm 2026-02-21T08:24:19.4067052Z @%p35 mbarrier.inval.shared::cta.b64 [%r60]; 2026-02-21T08:24:19.4067107Z // end inline asm 2026-02-21T08:24:19.4067271Z .loc 1 85 22 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:85:22 2026-02-21T08:24:19.4067339Z shl.b64 %rd59, %rd8, 1; 2026-02-21T08:24:19.4067400Z add.s64 %rd57, %rd7, %rd59; 2026-02-21T08:24:19.4067462Z add.s64 %rd58, %rd57, 458752; 2026-02-21T08:24:19.4067523Z $L__tmp12: 2026-02-21T08:24:19.4067746Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4067804Z // begin inline asm 2026-02-21T08:24:19.4068163Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r254, %r255, %r256, %r257, %r258, %r259, %r260, %r261, %r262, %r263, %r264, %r265, %r266, %r267, %r268, %r269}, [%r270 + 0], 16; 2026-02-21T08:24:19.4068225Z // end inline asm 2026-02-21T08:24:19.4068282Z // begin inline asm 2026-02-21T08:24:19.4068355Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:24:19.4068419Z // end inline asm 2026-02-21T08:24:19.4068481Z cvt.u64.u32 %rd60, %r254; 2026-02-21T08:24:19.4068543Z cvt.u64.u32 %rd61, %r255; 2026-02-21T08:24:19.4068611Z shl.b64 %rd62, %rd61, 32; 2026-02-21T08:24:19.4068673Z or.b64 %rd63, %rd60, %rd62; 2026-02-21T08:24:19.4068727Z $L__tmp13: 2026-02-21T08:24:19.4068906Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4068970Z mov.b64 {%r280, %r281}, %rd63; 2026-02-21T08:24:19.4069043Z cvt.rn.bf16x2.f32 %r282, %r281, %r280; 2026-02-21T08:24:19.4069097Z $L__tmp14: 2026-02-21T08:24:19.4069324Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4069389Z cvt.u64.u32 %rd64, %r256; 2026-02-21T08:24:19.4069449Z cvt.u64.u32 %rd65, %r257; 2026-02-21T08:24:19.4069517Z shl.b64 %rd66, %rd65, 32; 2026-02-21T08:24:19.4069578Z or.b64 %rd67, %rd64, %rd66; 2026-02-21T08:24:19.4069631Z $L__tmp15: 2026-02-21T08:24:19.4069802Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4069873Z mov.b64 {%r283, %r284}, %rd67; 2026-02-21T08:24:19.4069945Z cvt.rn.bf16x2.f32 %r285, %r284, %r283; 2026-02-21T08:24:19.4070000Z $L__tmp16: 2026-02-21T08:24:19.4070228Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4070289Z cvt.u64.u32 %rd68, %r258; 2026-02-21T08:24:19.4070349Z cvt.u64.u32 %rd69, %r259; 2026-02-21T08:24:19.4070415Z shl.b64 %rd70, %rd69, 32; 2026-02-21T08:24:19.4070476Z or.b64 %rd71, %rd68, %rd70; 2026-02-21T08:24:19.4070532Z $L__tmp17: 2026-02-21T08:24:19.4070706Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4070777Z mov.b64 {%r286, %r287}, %rd71; 2026-02-21T08:24:19.4070847Z cvt.rn.bf16x2.f32 %r288, %r287, %r286; 2026-02-21T08:24:19.4070902Z $L__tmp18: 2026-02-21T08:24:19.4071124Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4071185Z cvt.u64.u32 %rd72, %r260; 2026-02-21T08:24:19.4071245Z cvt.u64.u32 %rd73, %r261; 2026-02-21T08:24:19.4071312Z shl.b64 %rd74, %rd73, 32; 2026-02-21T08:24:19.4071373Z or.b64 %rd75, %rd72, %rd74; 2026-02-21T08:24:19.4071426Z $L__tmp19: 2026-02-21T08:24:19.4071630Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4071702Z mov.b64 {%r289, %r290}, %rd75; 2026-02-21T08:24:19.4071775Z cvt.rn.bf16x2.f32 %r291, %r290, %r289; 2026-02-21T08:24:19.4071888Z $L__tmp20: 2026-02-21T08:24:19.4072112Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4072174Z cvt.u64.u32 %rd76, %r262; 2026-02-21T08:24:19.4072236Z cvt.u64.u32 %rd77, %r263; 2026-02-21T08:24:19.4072305Z shl.b64 %rd78, %rd77, 32; 2026-02-21T08:24:19.4072367Z or.b64 %rd79, %rd76, %rd78; 2026-02-21T08:24:19.4072423Z $L__tmp21: 2026-02-21T08:24:19.4072594Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4072665Z mov.b64 {%r292, %r293}, %rd79; 2026-02-21T08:24:19.4072733Z cvt.rn.bf16x2.f32 %r294, %r293, %r292; 2026-02-21T08:24:19.4072788Z $L__tmp22: 2026-02-21T08:24:19.4073006Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4073068Z cvt.u64.u32 %rd80, %r264; 2026-02-21T08:24:19.4073180Z cvt.u64.u32 %rd81, %r265; 2026-02-21T08:24:19.4073245Z shl.b64 %rd82, %rd81, 32; 2026-02-21T08:24:19.4073312Z or.b64 %rd83, %rd80, %rd82; 2026-02-21T08:24:19.4073366Z $L__tmp23: 2026-02-21T08:24:19.4073538Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4073609Z mov.b64 {%r295, %r296}, %rd83; 2026-02-21T08:24:19.4073678Z cvt.rn.bf16x2.f32 %r297, %r296, %r295; 2026-02-21T08:24:19.4073732Z $L__tmp24: 2026-02-21T08:24:19.4073959Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4074019Z cvt.u64.u32 %rd84, %r266; 2026-02-21T08:24:19.4074079Z cvt.u64.u32 %rd85, %r267; 2026-02-21T08:24:19.4074138Z shl.b64 %rd86, %rd85, 32; 2026-02-21T08:24:19.4074206Z or.b64 %rd87, %rd84, %rd86; 2026-02-21T08:24:19.4074260Z $L__tmp25: 2026-02-21T08:24:19.4074435Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4074506Z mov.b64 {%r298, %r299}, %rd87; 2026-02-21T08:24:19.4074574Z cvt.rn.bf16x2.f32 %r300, %r299, %r298; 2026-02-21T08:24:19.4074629Z $L__tmp26: 2026-02-21T08:24:19.4074849Z .loc 2 291 36 // standard.py:291:36 @[ cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:81:40 ] 2026-02-21T08:24:19.4074909Z cvt.u64.u32 %rd88, %r268; 2026-02-21T08:24:19.4074968Z cvt.u64.u32 %rd89, %r269; 2026-02-21T08:24:19.4075028Z shl.b64 %rd90, %rd89, 32; 2026-02-21T08:24:19.4075096Z or.b64 %rd91, %rd88, %rd90; 2026-02-21T08:24:19.4075151Z $L__tmp27: 2026-02-21T08:24:19.4075320Z .loc 1 84 28 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:84:28 2026-02-21T08:24:19.4075390Z mov.b64 {%r301, %r302}, %rd91; 2026-02-21T08:24:19.4075459Z cvt.rn.bf16x2.f32 %r303, %r302, %r301; 2026-02-21T08:24:19.4075633Z .loc 1 85 81 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:85:81 2026-02-21T08:24:19.4075737Z st.shared.v4.b32 [%r22], {%r282, %r285, %r288, %r291}; 2026-02-21T08:24:19.4075828Z st.shared.v4.b32 [%r23], {%r294, %r297, %r300, %r303}; 2026-02-21T08:24:19.4075884Z bar.sync 0; 2026-02-21T08:24:19.4075982Z ld.shared.v4.b32 {%r275, %r276, %r277, %r278}, [%r24+512]; 2026-02-21T08:24:19.4076077Z ld.shared.v4.b32 {%r271, %r272, %r273, %r274}, [%r24]; 2026-02-21T08:24:19.4076136Z // begin inline asm 2026-02-21T08:24:19.4076238Z st.global.v4.b32 [ %rd57 + 0 ], { %r271, %r272, %r273, %r274 }; 2026-02-21T08:24:19.4076301Z // end inline asm 2026-02-21T08:24:19.4076357Z // begin inline asm 2026-02-21T08:24:19.4076455Z st.global.v4.b32 [ %rd58 + 0 ], { %r275, %r276, %r277, %r278 }; 2026-02-21T08:24:19.4076510Z // end inline asm 2026-02-21T08:24:19.4076599Z $L__BB0_8: // %._crit_edge 2026-02-21T08:24:19.4076759Z .loc 1 22 4 // cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py:22:4 2026-02-21T08:24:19.4076813Z bar.sync 0; 2026-02-21T08:24:19.4076878Z // begin inline asm 2026-02-21T08:24:19.4077054Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r304, 128; 2026-02-21T08:24:19.4077109Z // end inline asm 2026-02-21T08:24:19.4077168Z ret; 2026-02-21T08:24:19.4077220Z $L__tmp28: 2026-02-21T08:24:19.4077274Z $L__func_end0: 2026-02-21T08:24:19.4077357Z // -- End function 2026-02-21T08:24:19.4077415Z } 2026-02-21T08:24:19.4077613Z .file 1 "/tmp/torchinductor_root/u6/cu634mrcalkal53weolcumq72et4dsmjmrsvb76gf3652tuivxak.py" 2026-02-21T08:24:19.4077787Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:24:19.4077855Z .section .debug_abbrev 2026-02-21T08:24:19.4077907Z { 2026-02-21T08:24:19.4077995Z .b8 1 // Abbreviation Code 2026-02-21T08:24:19.4078087Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:24:19.4078166Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:24:19.4078289Z .b8 37 // DW_AT_producer 2026-02-21T08:24:19.4078368Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.4078450Z .b8 19 // DW_AT_language 2026-02-21T08:24:19.4078524Z .b8 5 // DW_FORM_data2 2026-02-21T08:24:19.4078598Z .b8 3 // DW_AT_name 2026-02-21T08:24:19.4078680Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.4078757Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:24:19.4078830Z .b8 6 // DW_FORM_data4 2026-02-21T08:24:19.4078912Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:24:19.4078984Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.4079056Z .b8 0 // EOM(1) 2026-02-21T08:24:19.4079125Z .b8 0 // EOM(2) 2026-02-21T08:24:19.4079217Z .b8 2 // Abbreviation Code 2026-02-21T08:24:19.4079296Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:24:19.4079369Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:24:19.4079451Z .b8 3 // DW_AT_name 2026-02-21T08:24:19.4079522Z .b8 8 // DW_FORM_string 2026-02-21T08:24:19.4079596Z .b8 32 // DW_AT_inline 2026-02-21T08:24:19.4079677Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.4079746Z .b8 0 // EOM(1) 2026-02-21T08:24:19.4079812Z .b8 0 // EOM(2) 2026-02-21T08:24:19.4079890Z .b8 3 // Abbreviation Code 2026-02-21T08:24:19.4079975Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:24:19.4080053Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:24:19.4080129Z .b8 17 // DW_AT_low_pc 2026-02-21T08:24:19.4080211Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.4080287Z .b8 18 // DW_AT_high_pc 2026-02-21T08:24:19.4080359Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.4080452Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:24:19.4080523Z .b8 19 // DW_FORM_ref4 2026-02-21T08:24:19.4080591Z .b8 0 // EOM(1) 2026-02-21T08:24:19.4080666Z .b8 0 // EOM(2) 2026-02-21T08:24:19.4080747Z .b8 4 // Abbreviation Code 2026-02-21T08:24:19.4080838Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:24:19.4080914Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:24:19.4081012Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:24:19.4081122Z .b8 19 // DW_FORM_ref4 2026-02-21T08:24:19.4081193Z .b8 17 // DW_AT_low_pc 2026-02-21T08:24:19.4081271Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.4081347Z .b8 18 // DW_AT_high_pc 2026-02-21T08:24:19.4081417Z .b8 1 // DW_FORM_addr 2026-02-21T08:24:19.4081500Z .b8 88 // DW_AT_call_file 2026-02-21T08:24:19.4081599Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.4081675Z .b8 89 // DW_AT_call_line 2026-02-21T08:24:19.4081747Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.4081832Z .b8 87 // DW_AT_call_column 2026-02-21T08:24:19.4081906Z .b8 11 // DW_FORM_data1 2026-02-21T08:24:19.4082025Z .b8 0 // EOM(1) 2026-02-21T08:24:19.4082104Z .b8 0 // EOM(2) 2026-02-21T08:24:19.4082171Z .b8 0 // EOM(3) 2026-02-21T08:24:19.4082222Z } 2026-02-21T08:24:19.4082290Z .section .debug_info 2026-02-21T08:24:19.4082340Z { 2026-02-21T08:24:19.4082423Z .b32 178 // Length of Unit 2026-02-21T08:24:19.4082508Z .b8 2 // DWARF version number 2026-02-21T08:24:19.4082569Z .b8 0 2026-02-21T08:24:19.4082687Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:24:19.4082774Z .b8 8 // Address Size (in bytes) 2026-02-21T08:24:19.4082884Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:24:19.4082965Z .b8 116 // DW_AT_producer 2026-02-21T08:24:19.4083019Z .b8 114 2026-02-21T08:24:19.4083080Z .b8 105 2026-02-21T08:24:19.4083134Z .b8 116 2026-02-21T08:24:19.4083186Z .b8 111 2026-02-21T08:24:19.4083238Z .b8 110 2026-02-21T08:24:19.4083295Z .b8 0 2026-02-21T08:24:19.4083370Z .b8 2 // DW_AT_language 2026-02-21T08:24:19.4083421Z .b8 0 2026-02-21T08:24:19.4083501Z .b8 99 // DW_AT_name 2026-02-21T08:24:19.4083552Z .b8 117 2026-02-21T08:24:19.4083603Z .b8 54 2026-02-21T08:24:19.4083653Z .b8 51 2026-02-21T08:24:19.4083711Z .b8 52 2026-02-21T08:24:19.4083763Z .b8 109 2026-02-21T08:24:19.4083812Z .b8 114 2026-02-21T08:24:19.4083863Z .b8 99 2026-02-21T08:24:19.4083921Z .b8 97 2026-02-21T08:24:19.4083972Z .b8 108 2026-02-21T08:24:19.4084021Z .b8 107 2026-02-21T08:24:19.4084077Z .b8 97 2026-02-21T08:24:19.4084128Z .b8 108 2026-02-21T08:24:19.4084176Z .b8 53 2026-02-21T08:24:19.4084225Z .b8 51 2026-02-21T08:24:19.4084282Z .b8 119 2026-02-21T08:24:19.4084332Z .b8 101 2026-02-21T08:24:19.4084382Z .b8 111 2026-02-21T08:24:19.4084438Z .b8 108 2026-02-21T08:24:19.4084487Z .b8 99 2026-02-21T08:24:19.4084539Z .b8 117 2026-02-21T08:24:19.4084591Z .b8 109 2026-02-21T08:24:19.4084647Z .b8 113 2026-02-21T08:24:19.4084695Z .b8 55 2026-02-21T08:24:19.4084744Z .b8 50 2026-02-21T08:24:19.4084794Z .b8 101 2026-02-21T08:24:19.4084851Z .b8 116 2026-02-21T08:24:19.4084901Z .b8 52 2026-02-21T08:24:19.4084950Z .b8 100 2026-02-21T08:24:19.4085006Z .b8 115 2026-02-21T08:24:19.4085056Z .b8 109 2026-02-21T08:24:19.4085107Z .b8 106 2026-02-21T08:24:19.4085157Z .b8 109 2026-02-21T08:24:19.4085214Z .b8 114 2026-02-21T08:24:19.4085264Z .b8 115 2026-02-21T08:24:19.4085313Z .b8 118 2026-02-21T08:24:19.4085370Z .b8 98 2026-02-21T08:24:19.4085420Z .b8 55 2026-02-21T08:24:19.4085470Z .b8 54 2026-02-21T08:24:19.4085519Z .b8 103 2026-02-21T08:24:19.4085578Z .b8 102 2026-02-21T08:24:19.4085628Z .b8 51 2026-02-21T08:24:19.4085677Z .b8 54 2026-02-21T08:24:19.4085734Z .b8 53 2026-02-21T08:24:19.4085783Z .b8 50 2026-02-21T08:24:19.4085834Z .b8 116 2026-02-21T08:24:19.4085884Z .b8 117 2026-02-21T08:24:19.4085941Z .b8 105 2026-02-21T08:24:19.4085993Z .b8 118 2026-02-21T08:24:19.4086094Z .b8 120 2026-02-21T08:24:19.4086143Z .b8 97 2026-02-21T08:24:19.4086201Z .b8 107 2026-02-21T08:24:19.4086250Z .b8 46 2026-02-21T08:24:19.4086301Z .b8 112 2026-02-21T08:24:19.4086359Z .b8 121 2026-02-21T08:24:19.4086408Z .b8 0 2026-02-21T08:24:19.4086499Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:24:19.4086576Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:24:19.4086634Z .b8 116 2026-02-21T08:24:19.4086684Z .b8 109 2026-02-21T08:24:19.4086736Z .b8 112 2026-02-21T08:24:19.4086796Z .b8 47 2026-02-21T08:24:19.4086847Z .b8 116 2026-02-21T08:24:19.4086899Z .b8 111 2026-02-21T08:24:19.4086950Z .b8 114 2026-02-21T08:24:19.4087011Z .b8 99 2026-02-21T08:24:19.4087063Z .b8 104 2026-02-21T08:24:19.4087113Z .b8 105 2026-02-21T08:24:19.4087171Z .b8 110 2026-02-21T08:24:19.4087221Z .b8 100 2026-02-21T08:24:19.4087272Z .b8 117 2026-02-21T08:24:19.4087322Z .b8 99 2026-02-21T08:24:19.4087380Z .b8 116 2026-02-21T08:24:19.4087474Z .b8 111 2026-02-21T08:24:19.4087527Z .b8 114 2026-02-21T08:24:19.4087577Z .b8 95 2026-02-21T08:24:19.4087634Z .b8 114 2026-02-21T08:24:19.4087684Z .b8 111 2026-02-21T08:24:19.4087735Z .b8 111 2026-02-21T08:24:19.4087791Z .b8 116 2026-02-21T08:24:19.4087842Z .b8 47 2026-02-21T08:24:19.4087891Z .b8 117 2026-02-21T08:24:19.4087941Z .b8 54 2026-02-21T08:24:19.4087999Z .b8 0 2026-02-21T08:24:19.4088099Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:24:19.4088172Z .b8 95 // DW_AT_name 2026-02-21T08:24:19.4088229Z .b8 104 2026-02-21T08:24:19.4088280Z .b8 101 2026-02-21T08:24:19.4088330Z .b8 108 2026-02-21T08:24:19.4088380Z .b8 105 2026-02-21T08:24:19.4088438Z .b8 111 2026-02-21T08:24:19.4088488Z .b8 110 2026-02-21T08:24:19.4088537Z .b8 95 2026-02-21T08:24:19.4088594Z .b8 109 2026-02-21T08:24:19.4088644Z .b8 97 2026-02-21T08:24:19.4088693Z .b8 116 2026-02-21T08:24:19.4088742Z .b8 109 2026-02-21T08:24:19.4088802Z .b8 117 2026-02-21T08:24:19.4088854Z .b8 108 2026-02-21T08:24:19.4088903Z .b8 95 2026-02-21T08:24:19.4088960Z .b8 98 2026-02-21T08:24:19.4089010Z .b8 102 2026-02-21T08:24:19.4089060Z .b8 49 2026-02-21T08:24:19.4089110Z .b8 54 2026-02-21T08:24:19.4089167Z .b8 95 2026-02-21T08:24:19.4089217Z .b8 105 2026-02-21T08:24:19.4089268Z .b8 110 2026-02-21T08:24:19.4089318Z .b8 116 2026-02-21T08:24:19.4089374Z .b8 52 2026-02-21T08:24:19.4089424Z .b8 0 2026-02-21T08:24:19.4089499Z .b8 1 // DW_AT_inline 2026-02-21T08:24:19.4089603Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:24:19.4089690Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:24:19.4089777Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:24:19.4089871Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:24:19.4089983Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:24:19.4090074Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:24:19.4090155Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T08:24:19.4090246Z .b64 $L__tmp27 // DW_AT_high_pc 2026-02-21T08:24:19.4090322Z .b8 1 // DW_AT_call_file 2026-02-21T08:24:19.4090398Z .b8 81 // DW_AT_call_line 2026-02-21T08:24:19.4090484Z .b8 40 // DW_AT_call_column 2026-02-21T08:24:19.4090565Z .b8 0 // End Of Children Mark 2026-02-21T08:24:19.4090646Z .b8 0 // End Of Children Mark 2026-02-21T08:24:19.4090702Z } 2026-02-21T08:24:19.4090767Z .section .debug_macinfo { } 2026-02-21T08:24:19.4090771Z 2026-02-21T08:24:19.4090847Z ================================================================ 2026-02-21T08:24:19.4090959Z please share the reproducer above with Triton project. 2026-02-21T08:24:20.2883335Z [56s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=64, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, None]) 2026-02-21T08:24:20.2884708Z Tensor-likes are not close! 2026-02-21T08:24:20.2889316Z 2026-02-21T08:24:20.2893880Z Mismatched elements: 455802 / 458752 (99.4%) 2026-02-21T08:24:20.2896043Z Greatest absolute difference: 1792.0 at index (19, 6877) (up to 0.01 allowed) 2026-02-21T08:24:20.2896437Z Greatest relative difference: 380928.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:24:20.2896753Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:20.2896937Z 2026-02-21T08:24:20.5319522Z [56s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=128, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, None]) 2026-02-21T08:24:20.5320704Z Tensor-likes are not close! 2026-02-21T08:24:20.5325760Z 2026-02-21T08:24:20.5325965Z Mismatched elements: 456598 / 458752 (99.5%) 2026-02-21T08:24:20.5326304Z Greatest absolute difference: 3248.0 at index (36, 5618) (up to 0.01 allowed) 2026-02-21T08:24:20.5331200Z Greatest relative difference: 342016.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T08:24:20.5335538Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:20.5337443Z 2026-02-21T08:24:20.8309895Z [56s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=64, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, None]) 2026-02-21T08:24:20.8311080Z Tensor-likes are not close! 2026-02-21T08:24:20.8311227Z 2026-02-21T08:24:20.8311441Z Mismatched elements: 457643 / 458752 (99.8%) 2026-02-21T08:24:20.8311882Z Greatest absolute difference: 4160.0 at index (29, 1901) (up to 0.01 allowed) 2026-02-21T08:24:20.8316220Z Greatest relative difference: 761856.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:24:20.8317678Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:20.8317906Z 2026-02-21T08:24:21.2127000Z [57s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T08:24:21.2128132Z Tensor-likes are not close! 2026-02-21T08:24:21.2132646Z 2026-02-21T08:24:21.2136160Z Mismatched elements: 456598 / 458752 (99.5%) 2026-02-21T08:24:21.2139967Z Greatest absolute difference: 3248.0 at index (36, 5618) (up to 0.01 allowed) 2026-02-21T08:24:21.2140414Z Greatest relative difference: 342016.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T08:24:21.2140762Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:21.2145617Z 2026-02-21T08:24:21.4101424Z [57s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T08:24:21.4102816Z Tensor-likes are not close! 2026-02-21T08:24:21.4102942Z 2026-02-21T08:24:21.4103023Z Mismatched elements: 455802 / 458752 (99.4%) 2026-02-21T08:24:21.4103295Z Greatest absolute difference: 1792.0 at index (19, 6877) (up to 0.01 allowed) 2026-02-21T08:24:21.4103627Z Greatest relative difference: 380928.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:24:21.4103934Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:24:21.4104095Z 2026-02-21T08:24:24.8772698Z 2026-02-21T08:24:24.8778899Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 16.0 configs/s 2026-02-21T08:24:25.1731418Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3099.0 2026-02-21T08:24:25.1733143Z configs/s 2026-02-21T08:24:25.2135771Z [61s] Generation 1 complete: 2026-02-21T08:24:25.2141080Z error=13 2026-02-21T08:24:25.2142936Z ok=93 2026-02-21T08:24:25.2143150Z min=0.1117 2026-02-21T08:24:25.2148039Z mid=0.5878 2026-02-21T08:24:25.2150013Z max=17.2447 2026-02-21T08:24:25.2150196Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:24:25.2150433Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:24:25.2150643Z 'l2_groupings': [1], 2026-02-21T08:24:25.2150836Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:24:25.2151036Z 'loop_orders': [[0, 1]], 2026-02-21T08:24:25.2151198Z 'num_stages': 6, 2026-02-21T08:24:25.2151336Z 'num_warps': 8, 2026-02-21T08:24:25.2151484Z 'pid_type': 'flat', 2026-02-21T08:24:25.2151707Z 'range_flattens': [None, None], 2026-02-21T08:24:25.2151925Z 'range_multi_buffers': [None, True], 2026-02-21T08:24:25.2152133Z 'range_num_stages': [0, 0], 2026-02-21T08:24:25.2152303Z 'range_unroll_factors': [0, 1], 2026-02-21T08:24:25.2152497Z 'range_warp_specializes': [None, True]} 2026-02-21T08:24:25.2156371Z [61s] Fitting surrogate: 206 points, 206 targets 2026-02-21T08:24:26.4637156Z [62s] Generation 2 starting: 92 neighbors, 5 active search path(s) 2026-02-21T08:24:39.0735881Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 2.3 configs/s 2026-02-21T08:24:45.4521894Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 15.0 configs/s 2026-02-21T08:24:49.2123759Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 265.6 2026-02-21T08:24:49.2128462Z configs/s 2026-02-21T08:24:49.3419732Z [85s] Generation 2 complete: 2026-02-21T08:24:49.3424064Z ok=98 2026-02-21T08:24:49.3425793Z min=0.1138 2026-02-21T08:24:49.3425980Z mid=0.3082 2026-02-21T08:24:49.3426176Z max=10.5903 2026-02-21T08:24:49.3426323Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:24:49.3430707Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:24:49.3435077Z 'l2_groupings': [1], 2026-02-21T08:24:49.3437069Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:24:49.3437311Z 'loop_orders': [[0, 1]], 2026-02-21T08:24:49.3437474Z 'num_stages': 5, 2026-02-21T08:24:49.3437624Z 'num_warps': 4, 2026-02-21T08:24:49.3437768Z 'pid_type': 'flat', 2026-02-21T08:24:49.3437989Z 'range_flattens': [None, None], 2026-02-21T08:24:49.3438184Z 'range_multi_buffers': [None, True], 2026-02-21T08:24:49.3438368Z 'range_num_stages': [0, 0], 2026-02-21T08:24:49.3438545Z 'range_unroll_factors': [0, 1], 2026-02-21T08:24:49.3438726Z 'range_warp_specializes': [None, True]} 2026-02-21T08:24:49.3439017Z [85s] Fitting surrogate: 304 points, 304 targets 2026-02-21T08:24:50.5398964Z [86s] Generation 3 starting: 90 neighbors, 5 active search path(s) 2026-02-21T08:25:03.9317212Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 1.8 configs/s 2026-02-21T08:25:10.3143479Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 14.6 configs/s 2026-02-21T08:25:18.0192564Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 130.2 2026-02-21T08:25:18.0194352Z configs/s 2026-02-21T08:25:18.2478531Z [114s] Generation 3 complete: 2026-02-21T08:25:18.2482856Z ok=95 2026-02-21T08:25:18.2484369Z min=0.1135 2026-02-21T08:25:18.2484592Z mid=0.1783 2026-02-21T08:25:18.2489406Z max=43.4350 2026-02-21T08:25:18.2493148Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:25:18.2494579Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:25:18.2494812Z 'l2_groupings': [1], 2026-02-21T08:25:18.2495005Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:25:18.2495205Z 'loop_orders': [[0, 1]], 2026-02-21T08:25:18.2495373Z 'num_stages': 5, 2026-02-21T08:25:18.2495826Z 'num_warps': 4, 2026-02-21T08:25:18.2496004Z 'pid_type': 'flat', 2026-02-21T08:25:18.2496168Z 'range_flattens': [None, None], 2026-02-21T08:25:18.2496346Z 'range_multi_buffers': [None, True], 2026-02-21T08:25:18.2496535Z 'range_num_stages': [0, 0], 2026-02-21T08:25:18.2496700Z 'range_unroll_factors': [0, 1], 2026-02-21T08:25:18.2496885Z 'range_warp_specializes': [None, True]} 2026-02-21T08:25:18.2502228Z [114s] Fitting surrogate: 399 points, 399 targets 2026-02-21T08:25:19.2560982Z [115s] Generation 4 starting: 70 neighbors, 4 active search path(s) 2026-02-21T08:25:31.5125846Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 1.7 configs/s 2026-02-21T08:25:32.4239469Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T08:25:32.4241155Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:25:32.4241495Z ^ 2026-02-21T08:25:32.4242205Z /tmp/torchinductor_root/rw/crwoiwwlfl2j3cwmgddgfhle73qse6spgwm26iq5djgvhtraqgil.py:91:40: note: called from 2026-02-21T08:25:32.4242738Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:25:32.4246834Z ^ 2026-02-21T08:25:32.4248866Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T08:25:32.4249374Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:25:32.4249639Z ^ 2026-02-21T08:25:32.4249915Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:25:32.4254860Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:25:32.4255556Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T08:25:32.4255849Z %c224_i32 = arith.constant 224 : i32 2026-02-21T08:25:32.4256050Z %c112_i32 = arith.constant 112 : i32 2026-02-21T08:25:32.4256233Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:25:32.4256424Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:25:32.4256606Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:25:32.4256797Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:25:32.4256984Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:25:32.4257194Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:25:32.4257433Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:25:32.4257662Z %cst_2 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T08:25:32.4257903Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T08:25:32.4258138Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T08:25:32.4258399Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T08:25:32.4258989Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:32.4259179Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:25:32.4259370Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T08:25:32.4259554Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T08:25:32.4259744Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:25:32.4260062Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T08:25:32.4260388Z %1 = tt.get_program_id x : i32 2026-02-21T08:25:32.4260598Z scf.for %arg3 = %1 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T08:25:32.4260830Z %2 = arith.divsi %arg3, %c224_i32 : i32 2026-02-21T08:25:32.4261024Z %3 = arith.muli %2, %c2_i32 : i32 2026-02-21T08:25:32.4261200Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T08:25:32.4261384Z %5 = arith.minsi %4, %c2_i32 : i32 2026-02-21T08:25:32.4261651Z %6 = arith.remsi %arg3, %c224_i32 : i32 2026-02-21T08:25:32.4261991Z %7 = arith.remsi %6, %5 : i32 2026-02-21T08:25:32.4262176Z %8 = arith.addi %3, %7 : i32 2026-02-21T08:25:32.4262358Z %9 = arith.divsi %6, %5 : i32 2026-02-21T08:25:32.4262537Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T08:25:32.4262786Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:25:32.4263054Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T08:25:32.4263258Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T08:25:32.4263460Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T08:25:32.4263652Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T08:25:32.4263858Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T08:25:32.4264055Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T08:25:32.4264392Z %17 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T08:25:32.4264774Z %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:25:32.4265040Z %20 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T08:25:32.4265268Z %21 = arith.addi %20, %19 : tensor<32xi32> 2026-02-21T08:25:32.4265472Z %22 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:25:32.4265682Z %23 = tt.splat %22 : i32 -> tensor<64xi32> 2026-02-21T08:25:32.4265885Z %24 = arith.addi %23, %11 : tensor<64xi32> 2026-02-21T08:25:32.4266157Z %25 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T08:25:32.4266441Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T08:25:32.4266703Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:25:32.4267004Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T08:25:32.4267271Z %29 = tt.broadcast %27 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T08:25:32.4267519Z %30 = arith.addi %28, %29 : tensor<64x64xi32> 2026-02-21T08:25:32.4267769Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T08:25:32.4268067Z %32 = tt.addptr %31, %30 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T08:25:32.4268380Z %33 = tt.load %32 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T08:25:32.4268679Z %34 = arith.extf %33 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T08:25:32.4268968Z %35 = tt.expand_dims %21 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:32.4269235Z %36 = arith.muli %35, %cst_3 : tensor<32x1xi32> 2026-02-21T08:25:32.4269502Z %37 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:25:32.4269796Z %38 = tt.broadcast %36 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T08:25:32.4270057Z %39 = tt.broadcast %37 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T08:25:32.4270301Z %40 = arith.addi %38, %39 : tensor<32x64xi32> 2026-02-21T08:25:32.4270615Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T08:25:32.4270893Z %42 = tt.addptr %41, %40 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T08:25:32.4271187Z %43 = tt.load %42 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T08:25:32.4271454Z %44 = arith.shli %43, %cst_2 : tensor<32x64xi8> 2026-02-21T08:25:32.4271720Z %45 = arith.shrsi %44, %cst_2 : tensor<32x64xi8> 2026-02-21T08:25:32.4271957Z %46 = arith.shrsi %43, %cst_2 : tensor<32x64xi8> 2026-02-21T08:25:32.4272205Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:25:32.4272501Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:25:32.4272808Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:25:32.4273198Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T08:25:32.4273574Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T08:25:32.4273860Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:25:32.4274116Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T08:25:32.4274383Z %54 = tt.broadcast %50 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T08:25:32.4274675Z %55 = arith.select %53, %54, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T08:25:32.4274944Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:25:32.4275197Z %57 = tt.broadcast %51 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T08:25:32.4275460Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T08:25:32.4275740Z %59 = arith.select %58, %57, %55 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T08:25:32.4276019Z %60 = tt.reshape %59 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T08:25:32.4276277Z %61 = arith.sitofp %60 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T08:25:32.4276637Z %62 = tt.dot %34, %61, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T08:25:32.4276957Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:25:32.4277167Z %63 = arith.muli %c32_i32, %c1_i32_7 : i32 2026-02-21T08:25:32.4277368Z %64 = arith.addi %arg4, %63 : i32 2026-02-21T08:25:32.4277613Z %65 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:25:32.4277880Z %66 = tt.splat %64 : i32 -> tensor<32xi32> 2026-02-21T08:25:32.4278087Z %67 = arith.addi %66, %65 : tensor<32xi32> 2026-02-21T08:25:32.4278295Z %68 = arith.muli %64, %c2_i32 : i32 2026-02-21T08:25:32.4278496Z %69 = tt.splat %68 : i32 -> tensor<64xi32> 2026-02-21T08:25:32.4278708Z %70 = arith.addi %69, %11 : tensor<64xi32> 2026-02-21T08:25:32.4278974Z %71 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T08:25:32.4279262Z %72 = arith.muli %71, %cst_4 : tensor<64x1xi32> 2026-02-21T08:25:32.4279535Z %73 = tt.expand_dims %70 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:25:32.4279831Z %74 = tt.broadcast %72 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T08:25:32.4280104Z %75 = tt.broadcast %73 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T08:25:32.4280350Z %76 = arith.addi %74, %75 : tensor<64x64xi32> 2026-02-21T08:25:32.4280608Z %77 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T08:25:32.4280903Z %78 = tt.addptr %77, %76 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T08:25:32.4281222Z %79 = tt.load %78 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T08:25:32.4281528Z %80 = arith.extf %79 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T08:25:32.4281852Z %81 = tt.expand_dims %67 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:32.4282192Z %82 = arith.muli %81, %cst_3 : tensor<32x1xi32> 2026-02-21T08:25:32.4282453Z %83 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:25:32.4282753Z %84 = tt.broadcast %82 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T08:25:32.4283024Z %85 = tt.broadcast %83 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T08:25:32.4283266Z %86 = arith.addi %84, %85 : tensor<32x64xi32> 2026-02-21T08:25:32.4283519Z %87 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T08:25:32.4283800Z %88 = tt.addptr %87, %86 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T08:25:32.4284112Z %89 = tt.load %88 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T08:25:32.4284391Z %90 = arith.shli %89, %cst_2 : tensor<32x64xi8> 2026-02-21T08:25:32.4284666Z %91 = arith.shrsi %90, %cst_2 : tensor<32x64xi8> 2026-02-21T08:25:32.4284891Z %92 = arith.shrsi %89, %cst_2 : tensor<32x64xi8> 2026-02-21T08:25:32.4285132Z %93 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:25:32.4285426Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:25:32.4285728Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:25:32.4286047Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T08:25:32.4286357Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T08:25:32.4286641Z %98 = arith.cmpi eq, %95, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:25:32.4286893Z %99 = tt.broadcast %98 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T08:25:32.4287159Z %100 = tt.broadcast %96 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T08:25:32.4287457Z %101 = arith.select %99, %100, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T08:25:32.4287729Z %102 = arith.cmpi eq, %95, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:25:32.4287984Z %103 = tt.broadcast %97 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T08:25:32.4288259Z %104 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T08:25:32.4288543Z %105 = arith.select %104, %103, %101 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T08:25:32.4288831Z %106 = tt.reshape %105 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T08:25:32.4289093Z %107 = arith.sitofp %106 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T08:25:32.4289447Z %108 = tt.dot %80, %107, %62, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T08:25:32.4289762Z scf.yield %108 : tensor<64x64xf32> 2026-02-21T08:25:32.4289943Z } 2026-02-21T08:25:32.4290126Z %18 = arith.truncf %17 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T08:25:32.4290447Z tt.descriptor_store %0[%10, %14], %18 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T08:25:32.4290770Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:25:32.4290971Z tt.return 2026-02-21T08:25:32.4291106Z } 2026-02-21T08:25:32.4291229Z } 2026-02-21T08:25:32.4291306Z 2026-02-21T08:25:32.4291358Z {-# 2026-02-21T08:25:32.4291488Z external_resources: { 2026-02-21T08:25:32.4291690Z mlir_reproducer: { 2026-02-21T08:25:32.4296238Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:25:32.4300763Z disable_threading: false, 2026-02-21T08:25:32.4300948Z verify_each: true 2026-02-21T08:25:32.4301097Z } 2026-02-21T08:25:32.4301233Z } 2026-02-21T08:25:32.4301352Z #-} 2026-02-21T08:25:32.4301831Z /tmp/torchinductor_root/rw/crwoiwwlfl2j3cwmgddgfhle73qse6spgwm26iq5djgvhtraqgil.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:25:32.4302895Z /tmp/torchinductor_root/rw/crwoiwwlfl2j3cwmgddgfhle73qse6spgwm26iq5djgvhtraqgil.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:25:32.4303730Z [128s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:25:32.4304918Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:25:32.4306012Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:25:32.4306275Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:25:35.9544041Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.1 configs/s 2026-02-21T08:25:42.4269805Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 154.9 2026-02-21T08:25:42.4274004Z configs/s 2026-02-21T08:25:42.6370289Z [138s] Generation 4 complete: 2026-02-21T08:25:42.6375332Z error=5 2026-02-21T08:25:42.6376881Z ok=70 2026-02-21T08:25:42.6377088Z min=0.1137 2026-02-21T08:25:42.6377284Z mid=0.1649 2026-02-21T08:25:42.6382597Z max=3.3751 2026-02-21T08:25:42.6384300Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:25:42.6384598Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:25:42.6390596Z 'l2_groupings': [1], 2026-02-21T08:25:42.6392597Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:25:42.6392890Z 'loop_orders': [[0, 1]], 2026-02-21T08:25:42.6395782Z 'num_stages': 5, 2026-02-21T08:25:42.6395983Z 'num_warps': 4, 2026-02-21T08:25:42.6396143Z 'pid_type': 'flat', 2026-02-21T08:25:42.6396327Z 'range_flattens': [None, None], 2026-02-21T08:25:42.6396519Z 'range_multi_buffers': [None, True], 2026-02-21T08:25:42.6396722Z 'range_num_stages': [0, 0], 2026-02-21T08:25:42.6397209Z 'range_unroll_factors': [0, 1], 2026-02-21T08:25:42.6397398Z 'range_warp_specializes': [None, True]} 2026-02-21T08:25:42.6397633Z [138s] Fitting surrogate: 474 points, 474 targets 2026-02-21T08:25:43.6413207Z [139s] Generation 5 starting: 70 neighbors, 4 active search path(s) 2026-02-21T08:25:55.9124716Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 1.4 configs/s 2026-02-21T08:25:57.3392460Z 2026-02-21T08:25:57.3395591Z 2026-02-21T08:25:57.3400169Z ================================================================ 2026-02-21T08:25:57.3404654Z Internal Triton PTX codegen error 2026-02-21T08:25:57.3406548Z `ptxas` stderr: 2026-02-21T08:25:57.3407049Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 325 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:25:57.3407562Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:25:57.3407718Z 2026-02-21T08:25:57.3408444Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp80jkuebr.ptx -o /tmp/tmp80jkuebr.ptx.o 2026-02-21T08:25:57.3408913Z 2026-02-21T08:25:57.3408918Z 2026-02-21T08:25:57.3408977Z // 2026-02-21T08:25:57.3409131Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:25:57.3409311Z // 2026-02-21T08:25:57.3409379Z 2026-02-21T08:25:57.3409437Z .version 8.7 2026-02-21T08:25:57.3409586Z .target sm_100a 2026-02-21T08:25:57.3409723Z .address_size 64 2026-02-21T08:25:57.3409814Z 2026-02-21T08:25:57.3409977Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:25:57.3410268Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:25:57.3410506Z // @_helion_matmul_bf16_int4 2026-02-21T08:25:57.3410741Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:25:57.3411005Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:25:57.3411317Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:25:57.3411670Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:25:57.3411957Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:25:57.3412540Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:25:57.3412761Z ) 2026-02-21T08:25:57.3412887Z .reqntid 128 2026-02-21T08:25:57.3413018Z .maxnreg 32 2026-02-21T08:25:57.3413147Z { 2026-02-21T08:25:57.3413274Z .reg .pred %p<132>; 2026-02-21T08:25:57.3413429Z .reg .b16 %rs<769>; 2026-02-21T08:25:57.3413570Z .reg .b32 %r<874>; 2026-02-21T08:25:57.3413718Z .reg .b64 %rd<220>; 2026-02-21T08:25:57.3413858Z $L__func_begin0: 2026-02-21T08:25:57.3413949Z 2026-02-21T08:25:57.3414064Z // %bb.0: 2026-02-21T08:25:57.3414308Z .loc 1 19 0 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:19 2026-02-21T08:25:57.3414610Z mov.u32 %r1, %tid.x; 2026-02-21T08:25:57.3414815Z ld.param.b64 %rd36, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:25:57.3415029Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T08:25:57.3415203Z mov.b32 %r49, global_smem; 2026-02-21T08:25:57.3415363Z // begin inline asm 2026-02-21T08:25:57.3415615Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r49], 256; 2026-02-21T08:25:57.3415861Z // end inline asm 2026-02-21T08:25:57.3416045Z ld.param.b64 %rd53, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T08:25:57.3416255Z bar.sync 0; 2026-02-21T08:25:57.3416397Z ld.shared.b32 %r867, [global_smem]; 2026-02-21T08:25:57.3416572Z bar.sync 0; 2026-02-21T08:25:57.3416702Z // begin inline asm 2026-02-21T08:25:57.3416911Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:25:57.3417135Z // end inline asm 2026-02-21T08:25:57.3417393Z .loc 1 21 67 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:21:67 2026-02-21T08:25:57.3417684Z mov.u32 %r3, %ctaid.x; 2026-02-21T08:25:57.3417994Z mov.u32 %r58, %ctaid.y; 2026-02-21T08:25:57.3418160Z mov.u32 %r59, %ctaid.z; 2026-02-21T08:25:57.3418311Z mov.u32 %r60, %nctaid.x; 2026-02-21T08:25:57.3418471Z mov.u32 %r61, %nctaid.y; 2026-02-21T08:25:57.3418629Z mad.lo.s32 %r62, %r59, %r61, %r58; 2026-02-21T08:25:57.3418813Z mad.lo.s32 %r63, %r62, %r60, %r3; 2026-02-21T08:25:57.3418982Z shl.b32 %r64, %r63, 7; 2026-02-21T08:25:57.3419139Z cvt.s64.s32 %rd54, %r64; 2026-02-21T08:25:57.3419294Z add.s64 %rd50, %rd53, %rd54; 2026-02-21T08:25:57.3419459Z shl.b32 %r65, %r1, 2; 2026-02-21T08:25:57.3419609Z add.s32 %r50, %r49, %r65; 2026-02-21T08:25:57.3419770Z mov.b32 %r67, 0; 2026-02-21T08:25:57.3419917Z // begin inline asm 2026-02-21T08:25:57.3420072Z @%p1 st.shared.b32 [ %r50 + 0 ], %r67; 2026-02-21T08:25:57.3420254Z // end inline asm 2026-02-21T08:25:57.3420413Z bar.warp.sync -1; 2026-02-21T08:25:57.3420573Z setp.eq.b32 %p127, %r1, 0; 2026-02-21T08:25:57.3420736Z cvt.u64.u32 %rd35, %r49; 2026-02-21T08:25:57.3420963Z // begin inline asm 2026-02-21T08:25:57.3421228Z @%p127 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd35 + 0 ], %rd36; 2026-02-21T08:25:57.3421520Z // end inline asm 2026-02-21T08:25:57.3421723Z // begin inline asm 2026-02-21T08:25:57.3421957Z @%p127 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x1; 2026-02-21T08:25:57.3422220Z // end inline asm 2026-02-21T08:25:57.3422357Z mov.b32 %r52, 64; 2026-02-21T08:25:57.3422502Z // begin inline asm 2026-02-21T08:25:57.3422738Z @%p127 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x0, %r52; 2026-02-21T08:25:57.3423013Z // end inline asm 2026-02-21T08:25:57.3423147Z // begin inline asm 2026-02-21T08:25:57.3423390Z @%p127 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x1, %r52; 2026-02-21T08:25:57.3423660Z // end inline asm 2026-02-21T08:25:57.3423795Z mov.b32 %r54, 7168; 2026-02-21T08:25:57.3423948Z // begin inline asm 2026-02-21T08:25:57.3424195Z @%p127 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x0, %r54; 2026-02-21T08:25:57.3424476Z // end inline asm 2026-02-21T08:25:57.3424607Z mov.b32 %r55, 4096; 2026-02-21T08:25:57.3424753Z // begin inline asm 2026-02-21T08:25:57.3424991Z @%p127 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x1, %r55; 2026-02-21T08:25:57.3425267Z // end inline asm 2026-02-21T08:25:57.3425405Z mov.b64 %rd43, 7168; 2026-02-21T08:25:57.3425546Z // begin inline asm 2026-02-21T08:25:57.3425803Z @%p127 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd35 + 0 ], 0x0, %rd43; 2026-02-21T08:25:57.3426089Z // end inline asm 2026-02-21T08:25:57.3426233Z mov.b32 %r56, 1; 2026-02-21T08:25:57.3426369Z // begin inline asm 2026-02-21T08:25:57.3426637Z @%p127 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x0, %r56; 2026-02-21T08:25:57.3426924Z // end inline asm 2026-02-21T08:25:57.3427056Z // begin inline asm 2026-02-21T08:25:57.3427317Z @%p127 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x1, %r56; 2026-02-21T08:25:57.3427600Z // end inline asm 2026-02-21T08:25:57.3427739Z // begin inline asm 2026-02-21T08:25:57.3427967Z @%p127 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x0; 2026-02-21T08:25:57.3428234Z // end inline asm 2026-02-21T08:25:57.3428367Z // begin inline asm 2026-02-21T08:25:57.3428625Z @%p127 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x0; 2026-02-21T08:25:57.3428917Z // end inline asm 2026-02-21T08:25:57.3429049Z // begin inline asm 2026-02-21T08:25:57.3429295Z @%p127 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x2; 2026-02-21T08:25:57.3429566Z // end inline asm 2026-02-21T08:25:57.3429713Z // begin inline asm 2026-02-21T08:25:57.3429940Z @%p127 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd35 + 0 ], 0x0; 2026-02-21T08:25:57.3430204Z // end inline asm 2026-02-21T08:25:57.3430345Z // begin inline asm 2026-02-21T08:25:57.3430752Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd50 + 0 ], [ %rd35 + 0 ], 0x80; 2026-02-21T08:25:57.3431121Z // end inline asm 2026-02-21T08:25:57.3431253Z // begin inline asm 2026-02-21T08:25:57.3431466Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd50 + 0 ], 0x80; 2026-02-21T08:25:57.3431741Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T08:25:57.3431939Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T08:25:57.3432151Z // end inline asm 2026-02-21T08:25:57.3432283Z bar.sync 0; 2026-02-21T08:25:57.3432432Z cvta.global.u64 %rd117, %rd50; 2026-02-21T08:25:57.3432723Z .loc 1 29 106 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:29:106 2026-02-21T08:25:57.3433028Z setp.gt.u32 %p21, %r3, 111; 2026-02-21T08:25:57.3433193Z @%p21 bra $L__BB0_8; 2026-02-21T08:25:57.3433369Z // %bb.1: // %.lr.ph 2026-02-21T08:25:57.3433740Z .loc 1 0 106 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:0:106 2026-02-21T08:25:57.3434072Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:25:57.3434326Z ld.param.b64 %rd33, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:25:57.3434531Z and.b32 %r4, %r1, 64; 2026-02-21T08:25:57.3434796Z .loc 1 74 38 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:74:38 2026-02-21T08:25:57.3435088Z setp.eq.b32 %p36, %r4, 0; 2026-02-21T08:25:57.3435359Z .loc 1 35 45 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:35:45 2026-02-21T08:25:57.3435649Z shr.u32 %r216, %r1, 5; 2026-02-21T08:25:57.3435901Z .loc 1 91 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:91:32 2026-02-21T08:25:57.3436195Z bfe.u32 %r217, %r1, 3, 4; 2026-02-21T08:25:57.3436451Z .loc 1 91 43 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:91:43 2026-02-21T08:25:57.3436746Z mul.lo.s32 %r218, %r217, 7168; 2026-02-21T08:25:57.3437015Z .loc 1 51 53 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:53 2026-02-21T08:25:57.3437302Z shl.b32 %r219, %r1, 9; 2026-02-21T08:25:57.3437464Z and.b32 %r220, %r219, 57344; 2026-02-21T08:25:57.3437724Z .loc 1 50 38 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:50:38 2026-02-21T08:25:57.3438013Z and.b32 %r221, %r1, 15; 2026-02-21T08:25:57.3438163Z shl.b32 %r222, %r221, 3; 2026-02-21T08:25:57.3438422Z .loc 1 35 45 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:35:45 2026-02-21T08:25:57.3438699Z and.b32 %r223, %r1, 7; 2026-02-21T08:25:57.3438854Z shl.b32 %r224, %r1, 4; 2026-02-21T08:25:57.3439008Z and.b32 %r5, %r224, 2032; 2026-02-21T08:25:57.3439162Z add.s32 %r226, %r49, 32768; 2026-02-21T08:25:57.3439328Z add.s32 %r456, %r226, %r5; 2026-02-21T08:25:57.3439483Z add.s32 %r458, %r456, 2048; 2026-02-21T08:25:57.3439646Z add.s32 %r460, %r456, 4096; 2026-02-21T08:25:57.3439800Z add.s32 %r462, %r456, 6144; 2026-02-21T08:25:57.3439955Z add.s32 %r464, %r456, 8192; 2026-02-21T08:25:57.3440106Z add.s32 %r466, %r456, 10240; 2026-02-21T08:25:57.3440268Z add.s32 %r468, %r456, 12288; 2026-02-21T08:25:57.3440422Z or.b32 %r13, %r224, 14336; 2026-02-21T08:25:57.3440583Z add.s32 %r470, %r226, %r13; 2026-02-21T08:25:57.3440745Z add.s32 %r406, %r867, 64; 2026-02-21T08:25:57.3440896Z shl.b32 %r227, %r221, 8; 2026-02-21T08:25:57.3441058Z and.b32 %r228, %r1, 96; 2026-02-21T08:25:57.3441206Z shl.b32 %r229, %r228, 7; 2026-02-21T08:25:57.3441364Z and.b32 %r230, %r1, 16; 2026-02-21T08:25:57.3441511Z shl.b32 %r231, %r230, 3; 2026-02-21T08:25:57.3441708Z or.b32 %r232, %r227, %r229; 2026-02-21T08:25:57.3441864Z or.b32 %r16, %r232, %r231; 2026-02-21T08:25:57.3442028Z and.b32 %r17, %r1, 63; 2026-02-21T08:25:57.3442179Z shl.b32 %r233, %r17, 7; 2026-02-21T08:25:57.3442339Z shl.b32 %r234, %r223, 4; 2026-02-21T08:25:57.3442506Z shr.u32 %r235, %r4, 4; 2026-02-21T08:25:57.3442739Z or.b32 %r236, %r234, %r235; 2026-02-21T08:25:57.3442901Z or.b32 %r237, %r236, %r233; 2026-02-21T08:25:57.3443055Z add.s32 %r18, %r49, %r237; 2026-02-21T08:25:57.3443219Z xor.b32 %r238, %r237, 16; 2026-02-21T08:25:57.3443372Z add.s32 %r19, %r49, %r238; 2026-02-21T08:25:57.3443537Z xor.b32 %r239, %r237, 32; 2026-02-21T08:25:57.3443694Z add.s32 %r20, %r49, %r239; 2026-02-21T08:25:57.3443857Z xor.b32 %r240, %r237, 48; 2026-02-21T08:25:57.3444008Z add.s32 %r21, %r49, %r240; 2026-02-21T08:25:57.3444169Z xor.b32 %r241, %r237, 64; 2026-02-21T08:25:57.3444329Z add.s32 %r22, %r49, %r241; 2026-02-21T08:25:57.3444480Z xor.b32 %r242, %r237, 80; 2026-02-21T08:25:57.3444638Z add.s32 %r23, %r49, %r242; 2026-02-21T08:25:57.3444790Z xor.b32 %r243, %r237, 96; 2026-02-21T08:25:57.3444948Z add.s32 %r24, %r49, %r243; 2026-02-21T08:25:57.3445098Z xor.b32 %r244, %r237, 112; 2026-02-21T08:25:57.3445309Z add.s32 %r25, %r49, %r244; 2026-02-21T08:25:57.3445469Z bfe.u32 %r245, %r49, 4, 14; 2026-02-21T08:25:57.3445633Z cvt.u64.u32 %rd73, %r245; 2026-02-21T08:25:57.3445799Z or.b64 %rd92, %rd73, 4611686293338849280; 2026-02-21T08:25:57.3445984Z add.s32 %r246, %r49, 32; 2026-02-21T08:25:57.3446142Z bfe.u32 %r247, %r246, 4, 14; 2026-02-21T08:25:57.3446297Z cvt.u64.u32 %rd74, %r247; 2026-02-21T08:25:57.3446464Z or.b64 %rd93, %rd74, 4611686293338849280; 2026-02-21T08:25:57.3446638Z add.s32 %r248, %r49, 64; 2026-02-21T08:25:57.3446794Z bfe.u32 %r249, %r248, 4, 14; 2026-02-21T08:25:57.3446948Z cvt.u64.u32 %rd75, %r249; 2026-02-21T08:25:57.3447112Z or.b64 %rd94, %rd75, 4611686293338849280; 2026-02-21T08:25:57.3447285Z add.s32 %r250, %r49, 96; 2026-02-21T08:25:57.3447440Z bfe.u32 %r251, %r250, 4, 14; 2026-02-21T08:25:57.3447598Z cvt.u64.u32 %rd76, %r251; 2026-02-21T08:25:57.3447753Z or.b64 %rd95, %rd76, 4611686293338849280; 2026-02-21T08:25:57.3447930Z add.s32 %r252, %r49, 8192; 2026-02-21T08:25:57.3448081Z bfe.u32 %r253, %r252, 4, 14; 2026-02-21T08:25:57.3448245Z cvt.u64.u32 %rd77, %r253; 2026-02-21T08:25:57.3448402Z or.b64 %rd96, %rd77, 4611686293338849280; 2026-02-21T08:25:57.3448577Z add.s32 %r254, %r49, 8224; 2026-02-21T08:25:57.3448725Z bfe.u32 %r255, %r254, 4, 14; 2026-02-21T08:25:57.3448885Z cvt.u64.u32 %rd78, %r255; 2026-02-21T08:25:57.3449041Z or.b64 %rd97, %rd78, 4611686293338849280; 2026-02-21T08:25:57.3449218Z add.s32 %r256, %r49, 8256; 2026-02-21T08:25:57.3449375Z bfe.u32 %r257, %r256, 4, 14; 2026-02-21T08:25:57.3449527Z cvt.u64.u32 %rd79, %r257; 2026-02-21T08:25:57.3449690Z or.b64 %rd98, %rd79, 4611686293338849280; 2026-02-21T08:25:57.3449859Z add.s32 %r258, %r49, 8288; 2026-02-21T08:25:57.3450018Z bfe.u32 %r259, %r258, 4, 14; 2026-02-21T08:25:57.3450179Z cvt.u64.u32 %rd80, %r259; 2026-02-21T08:25:57.3450354Z or.b64 %rd99, %rd80, 4611686293338849280; 2026-02-21T08:25:57.3450535Z add.s32 %r260, %r49, 16384; 2026-02-21T08:25:57.3450707Z bfe.u32 %r261, %r260, 4, 14; 2026-02-21T08:25:57.3450883Z cvt.u64.u32 %rd81, %r261; 2026-02-21T08:25:57.3451053Z or.b64 %rd100, %rd81, 4611686293338849280; 2026-02-21T08:25:57.3451244Z add.s32 %r262, %r49, 16416; 2026-02-21T08:25:57.3451402Z bfe.u32 %r263, %r262, 4, 14; 2026-02-21T08:25:57.3451609Z cvt.u64.u32 %rd82, %r263; 2026-02-21T08:25:57.3451780Z or.b64 %rd101, %rd82, 4611686293338849280; 2026-02-21T08:25:57.3451973Z add.s32 %r264, %r49, 16448; 2026-02-21T08:25:57.3452133Z bfe.u32 %r265, %r264, 4, 14; 2026-02-21T08:25:57.3452303Z cvt.u64.u32 %rd83, %r265; 2026-02-21T08:25:57.3452470Z or.b64 %rd102, %rd83, 4611686293338849280; 2026-02-21T08:25:57.3452661Z add.s32 %r266, %r49, 16480; 2026-02-21T08:25:57.3452828Z bfe.u32 %r267, %r266, 4, 14; 2026-02-21T08:25:57.3452987Z cvt.u64.u32 %rd84, %r267; 2026-02-21T08:25:57.3453158Z or.b64 %rd103, %rd84, 4611686293338849280; 2026-02-21T08:25:57.3453339Z add.s32 %r268, %r49, 24576; 2026-02-21T08:25:57.3453506Z bfe.u32 %r269, %r268, 4, 14; 2026-02-21T08:25:57.3453671Z cvt.u64.u32 %rd85, %r269; 2026-02-21T08:25:57.3453945Z or.b64 %rd104, %rd85, 4611686293338849280; 2026-02-21T08:25:57.3454126Z add.s32 %r270, %r49, 24608; 2026-02-21T08:25:57.3454293Z bfe.u32 %r271, %r270, 4, 14; 2026-02-21T08:25:57.3454463Z cvt.u64.u32 %rd86, %r271; 2026-02-21T08:25:57.3454633Z or.b64 %rd105, %rd86, 4611686293338849280; 2026-02-21T08:25:57.3454823Z add.s32 %r272, %r49, 24640; 2026-02-21T08:25:57.3454983Z bfe.u32 %r273, %r272, 4, 14; 2026-02-21T08:25:57.3455152Z cvt.u64.u32 %rd87, %r273; 2026-02-21T08:25:57.3455317Z or.b64 %rd106, %rd87, 4611686293338849280; 2026-02-21T08:25:57.3455507Z add.s32 %r274, %r49, 24672; 2026-02-21T08:25:57.3455664Z bfe.u32 %r275, %r274, 4, 14; 2026-02-21T08:25:57.3455831Z cvt.u64.u32 %rd88, %r275; 2026-02-21T08:25:57.3455995Z or.b64 %rd107, %rd88, 4611686293338849280; 2026-02-21T08:25:57.3456181Z or.b32 %r276, %r220, %r222; 2026-02-21T08:25:57.3456357Z mad.wide.u32 %rd55, %r276, 2, %rd33; 2026-02-21T08:25:57.3456590Z add.s64 %rd109, %rd55, 512; 2026-02-21T08:25:57.3456763Z or.b32 %r277, %r276, 256; 2026-02-21T08:25:57.3456928Z mad.wide.u32 %rd89, %r277, 2, %rd33; 2026-02-21T08:25:57.3457118Z add.s64 %rd110, %rd89, 131072; 2026-02-21T08:25:57.3457290Z add.s64 %rd111, %rd89, 262144; 2026-02-21T08:25:57.3457465Z add.s64 %rd112, %rd89, 393216; 2026-02-21T08:25:57.3457630Z add.s64 %rd113, %rd89, 524288; 2026-02-21T08:25:57.3457801Z add.s64 %rd114, %rd89, 655360; 2026-02-21T08:25:57.3457979Z add.s64 %rd115, %rd89, 786432; 2026-02-21T08:25:57.3458133Z add.s64 %rd116, %rd89, 917504; 2026-02-21T08:25:57.3458296Z and.b32 %r278, %r224, 176; 2026-02-21T08:25:57.3458448Z shl.b32 %r279, %r228, 3; 2026-02-21T08:25:57.3458609Z bfe.s32 %r280, %r1, 2, 1; 2026-02-21T08:25:57.3458757Z and.b32 %r281, %r280, 1088; 2026-02-21T08:25:57.3458915Z shl.b32 %r282, %r230, 2; 2026-02-21T08:25:57.3459068Z xor.b32 %r283, %r281, %r282; 2026-02-21T08:25:57.3459232Z add.s32 %r284, %r49, %r278; 2026-02-21T08:25:57.3459391Z add.s32 %r285, %r284, %r279; 2026-02-21T08:25:57.3459556Z shl.b32 %r286, %r1, 5; 2026-02-21T08:25:57.3459711Z and.b32 %r287, %r286, 1792; 2026-02-21T08:25:57.3459860Z shl.b32 %r288, %r1, 3; 2026-02-21T08:25:57.3460015Z and.b32 %r289, %r288, 48; 2026-02-21T08:25:57.3460163Z shl.b32 %r290, %r228, 1; 2026-02-21T08:25:57.3460318Z shl.b32 %r291, %r1, 6; 2026-02-21T08:25:57.3460464Z and.b32 %r292, %r291, 64; 2026-02-21T08:25:57.3460620Z xor.b32 %r293, %r290, %r292; 2026-02-21T08:25:57.3460772Z add.s32 %r294, %r49, %r287; 2026-02-21T08:25:57.3460930Z add.s32 %r295, %r294, %r289; 2026-02-21T08:25:57.3461091Z mad.wide.u32 %rd90, %r220, 2, %rd33; 2026-02-21T08:25:57.3461268Z add.s32 %r296, %r226, %r16; 2026-02-21T08:25:57.3461427Z shl.b32 %r297, %r223, 3; 2026-02-21T08:25:57.3461611Z xor.b32 %r28, %r17, 48; 2026-02-21T08:25:57.3461769Z add.s32 %r121, %r49, 65536; 2026-02-21T08:25:57.3461920Z add.s32 %r298, %r121, %r28; 2026-02-21T08:25:57.3462078Z xor.b32 %r29, %r17, 32; 2026-02-21T08:25:57.3462227Z add.s32 %r299, %r121, %r29; 2026-02-21T08:25:57.3462391Z xor.b32 %r30, %r17, 16; 2026-02-21T08:25:57.3462536Z add.s32 %r300, %r121, %r30; 2026-02-21T08:25:57.3462694Z add.s32 %r301, %r121, %r17; 2026-02-21T08:25:57.3462851Z add.s32 %r302, %r49, 49152; 2026-02-21T08:25:57.3462998Z add.s32 %r139, %r302, %r13; 2026-02-21T08:25:57.3463158Z add.s32 %r125, %r302, %r5; 2026-02-21T08:25:57.3463309Z add.s32 %r137, %r125, 12288; 2026-02-21T08:25:57.3463469Z add.s32 %r135, %r125, 10240; 2026-02-21T08:25:57.3463620Z add.s32 %r133, %r125, 8192; 2026-02-21T08:25:57.3463775Z add.s32 %r131, %r125, 6144; 2026-02-21T08:25:57.3463922Z add.s32 %r129, %r125, 4096; 2026-02-21T08:25:57.3464076Z add.s32 %r127, %r125, 2048; 2026-02-21T08:25:57.3464225Z or.b32 %r303, %r222, 128; 2026-02-21T08:25:57.3464389Z mad.wide.u32 %rd91, %r303, 2, %rd90; 2026-02-21T08:25:57.3464568Z add.s64 %rd71, %rd91, 917504; 2026-02-21T08:25:57.3464727Z add.s64 %rd70, %rd91, 786432; 2026-02-21T08:25:57.3464893Z add.s64 %rd69, %rd91, 655360; 2026-02-21T08:25:57.3465108Z add.s64 %rd68, %rd91, 524288; 2026-02-21T08:25:57.3465270Z add.s64 %rd67, %rd91, 393216; 2026-02-21T08:25:57.3465423Z add.s64 %rd66, %rd91, 262144; 2026-02-21T08:25:57.3465583Z add.s64 %rd65, %rd91, 131072; 2026-02-21T08:25:57.3465739Z add.s64 %rd64, %rd55, 256; 2026-02-21T08:25:57.3465897Z add.s64 %rd62, %rd55, 917504; 2026-02-21T08:25:57.3466049Z add.s64 %rd61, %rd55, 786432; 2026-02-21T08:25:57.3466208Z add.s64 %rd60, %rd55, 655360; 2026-02-21T08:25:57.3466369Z add.s64 %rd59, %rd55, 524288; 2026-02-21T08:25:57.3466521Z add.s64 %rd58, %rd55, 393216; 2026-02-21T08:25:57.3466681Z add.s64 %rd57, %rd55, 262144; 2026-02-21T08:25:57.3466832Z add.s64 %rd56, %rd55, 131072; 2026-02-21T08:25:57.3467109Z .loc 1 36 27 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:36:27 2026-02-21T08:25:57.3467400Z shl.b32 %r474, %r3, 6; 2026-02-21T08:25:57.3467731Z .loc 1 37 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:37:32 2026-02-21T08:25:57.3468026Z or.b32 %r304, %r474, %r297; 2026-02-21T08:25:57.3468178Z $L__tmp0: 2026-02-21T08:25:57.3468474Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3468830Z shfl.sync.idx.b32 %r32, %r216, 0, 31, -1; 2026-02-21T08:25:57.3469023Z shl.b32 %r305, %r32, 21; 2026-02-21T08:25:57.3469177Z and.b32 %r306, %r305, 6291456; 2026-02-21T08:25:57.3469343Z add.s32 %r781, %r306, %r867; 2026-02-21T08:25:57.3469500Z mov.pred %p43, -1; 2026-02-21T08:25:57.3469654Z // begin inline asm 2026-02-21T08:25:57.3470013Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r781 + 0], 32, {%r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67}; 2026-02-21T08:25:57.3470386Z // end inline asm 2026-02-21T08:25:57.3470533Z // begin inline asm 2026-02-21T08:25:57.3470871Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r781 + 16], 32, {%r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67, %r67}; 2026-02-21T08:25:57.3471247Z // end inline asm 2026-02-21T08:25:57.3471382Z // begin inline asm 2026-02-21T08:25:57.3471578Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:25:57.3471747Z // end inline asm 2026-02-21T08:25:57.3471881Z bar.sync 0; 2026-02-21T08:25:57.3472019Z $L__tmp1: 2026-02-21T08:25:57.3472263Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3472556Z add.s32 %r869, %r49, 73728; 2026-02-21T08:25:57.3472710Z // begin inline asm 2026-02-21T08:25:57.3472888Z @%p127 mbarrier.init.shared::cta.b64 [%r869], 1; 2026-02-21T08:25:57.3473082Z // end inline asm 2026-02-21T08:25:57.3473224Z bar.sync 0; 2026-02-21T08:25:57.3473358Z add.s32 %r101, %r49, 73736; 2026-02-21T08:25:57.3473521Z // begin inline asm 2026-02-21T08:25:57.3473700Z @%p127 mbarrier.init.shared::cta.b64 [%r101], 1; 2026-02-21T08:25:57.3473896Z // end inline asm 2026-02-21T08:25:57.3474043Z add.s32 %r472, %r49, 73744; 2026-02-21T08:25:57.3474195Z // begin inline asm 2026-02-21T08:25:57.3474363Z @%p127 mbarrier.init.shared::cta.b64 [%r472], 1; 2026-02-21T08:25:57.3474546Z // end inline asm 2026-02-21T08:25:57.3474685Z bar.sync 0; 2026-02-21T08:25:57.3474814Z add.s32 %r103, %r49, 73752; 2026-02-21T08:25:57.3474971Z // begin inline asm 2026-02-21T08:25:57.3475138Z @%p127 mbarrier.init.shared::cta.b64 [%r103], 1; 2026-02-21T08:25:57.3475324Z // end inline asm 2026-02-21T08:25:57.3475464Z mov.b32 %r457, 16; 2026-02-21T08:25:57.3475711Z .loc 1 51 80 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:80 2026-02-21T08:25:57.3475998Z // begin inline asm 2026-02-21T08:25:57.3476201Z cp.async.cg.shared.global [ %r456 + 0 ], [ %rd55 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3476439Z // end inline asm 2026-02-21T08:25:57.3476573Z // begin inline asm 2026-02-21T08:25:57.3476780Z cp.async.cg.shared.global [ %r458 + 0 ], [ %rd56 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3477065Z // end inline asm 2026-02-21T08:25:57.3477201Z // begin inline asm 2026-02-21T08:25:57.3477405Z cp.async.cg.shared.global [ %r460 + 0 ], [ %rd57 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3477626Z // end inline asm 2026-02-21T08:25:57.3477775Z // begin inline asm 2026-02-21T08:25:57.3477965Z cp.async.cg.shared.global [ %r462 + 0 ], [ %rd58 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3478192Z // end inline asm 2026-02-21T08:25:57.3478326Z // begin inline asm 2026-02-21T08:25:57.3478523Z cp.async.cg.shared.global [ %r464 + 0 ], [ %rd59 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3478748Z // end inline asm 2026-02-21T08:25:57.3478881Z // begin inline asm 2026-02-21T08:25:57.3479079Z cp.async.cg.shared.global [ %r466 + 0 ], [ %rd60 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3479299Z // end inline asm 2026-02-21T08:25:57.3479441Z // begin inline asm 2026-02-21T08:25:57.3479631Z cp.async.cg.shared.global [ %r468 + 0 ], [ %rd61 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3479907Z // end inline asm 2026-02-21T08:25:57.3480043Z // begin inline asm 2026-02-21T08:25:57.3480240Z cp.async.cg.shared.global [ %r470 + 0 ], [ %rd62 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3480463Z // end inline asm 2026-02-21T08:25:57.3480605Z cp.async.commit_group; 2026-02-21T08:25:57.3480870Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3481147Z bar.sync 0; 2026-02-21T08:25:57.3481284Z // begin inline asm 2026-02-21T08:25:57.3481477Z @%p127 mbarrier.arrive.expect_tx.shared.b64 _, [%r472], 4096; 2026-02-21T08:25:57.3481737Z // end inline asm 2026-02-21T08:25:57.3481976Z .loc 1 57 33 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:57:33 2026-02-21T08:25:57.3482259Z bar.sync 0; 2026-02-21T08:25:57.3482409Z elect.sync %r307|%p38, -1; 2026-02-21T08:25:57.3482577Z and.pred %p29, %p1, %p38; 2026-02-21T08:25:57.3482737Z // begin inline asm 2026-02-21T08:25:57.3483067Z @%p29 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r121], [%rd117, {%r474, %r67}], [%r472]; 2026-02-21T08:25:57.3483425Z // end inline asm 2026-02-21T08:25:57.3483667Z .loc 1 51 80 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:80 2026-02-21T08:25:57.3483954Z // begin inline asm 2026-02-21T08:25:57.3484157Z cp.async.cg.shared.global [ %r125 + 0 ], [ %rd64 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3484378Z // end inline asm 2026-02-21T08:25:57.3484521Z // begin inline asm 2026-02-21T08:25:57.3484714Z cp.async.cg.shared.global [ %r127 + 0 ], [ %rd65 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3484939Z // end inline asm 2026-02-21T08:25:57.3485072Z // begin inline asm 2026-02-21T08:25:57.3485268Z cp.async.cg.shared.global [ %r129 + 0 ], [ %rd66 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3485481Z // end inline asm 2026-02-21T08:25:57.3485621Z // begin inline asm 2026-02-21T08:25:57.3485810Z cp.async.cg.shared.global [ %r131 + 0 ], [ %rd67 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3486036Z // end inline asm 2026-02-21T08:25:57.3486181Z // begin inline asm 2026-02-21T08:25:57.3486369Z cp.async.cg.shared.global [ %r133 + 0 ], [ %rd68 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3486592Z // end inline asm 2026-02-21T08:25:57.3486725Z // begin inline asm 2026-02-21T08:25:57.3486921Z cp.async.cg.shared.global [ %r135 + 0 ], [ %rd69 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3487138Z // end inline asm 2026-02-21T08:25:57.3487282Z // begin inline asm 2026-02-21T08:25:57.3487477Z cp.async.cg.shared.global [ %r137 + 0 ], [ %rd70 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3487709Z // end inline asm 2026-02-21T08:25:57.3487856Z // begin inline asm 2026-02-21T08:25:57.3488045Z cp.async.cg.shared.global [ %r139 + 0 ], [ %rd71 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3488272Z // end inline asm 2026-02-21T08:25:57.3488408Z cp.async.commit_group; 2026-02-21T08:25:57.3488676Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3489020Z bar.sync 0; 2026-02-21T08:25:57.3489157Z // begin inline asm 2026-02-21T08:25:57.3489349Z @%p127 mbarrier.arrive.expect_tx.shared.b64 _, [%r103], 4096; 2026-02-21T08:25:57.3489577Z // end inline asm 2026-02-21T08:25:57.3489827Z .loc 1 57 33 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:57:33 2026-02-21T08:25:57.3490111Z bar.sync 0; 2026-02-21T08:25:57.3490258Z elect.sync %r308|%p39, -1; 2026-02-21T08:25:57.3490425Z and.pred %p31, %p1, %p39; 2026-02-21T08:25:57.3490591Z add.s32 %r142, %r49, 69632; 2026-02-21T08:25:57.3490745Z // begin inline asm 2026-02-21T08:25:57.3491072Z @%p31 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r142], [%rd117, {%r474, %r52}], [%r103]; 2026-02-21T08:25:57.3491423Z // end inline asm 2026-02-21T08:25:57.3491704Z .loc 1 51 80 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:80 2026-02-21T08:25:57.3492082Z cp.async.wait_group 1; 2026-02-21T08:25:57.3492236Z bar.sync 0; 2026-02-21T08:25:57.3492411Z ld.shared.v4.b32 {%r309, %r310, %r311, %r312}, [%r296]; 2026-02-21T08:25:57.3492621Z mov.b32 {%rs1, %rs2}, %r312; 2026-02-21T08:25:57.3492794Z mov.b32 {%rs3, %rs4}, %r311; 2026-02-21T08:25:57.3492952Z mov.b32 {%rs5, %rs6}, %r310; 2026-02-21T08:25:57.3493115Z mov.b32 {%rs7, %rs8}, %r309; 2026-02-21T08:25:57.3493307Z ld.shared.v4.b32 {%r313, %r314, %r315, %r316}, [%r296+16]; 2026-02-21T08:25:57.3493513Z mov.b32 {%rs9, %rs10}, %r316; 2026-02-21T08:25:57.3493690Z mov.b32 {%rs11, %rs12}, %r315; 2026-02-21T08:25:57.3493862Z mov.b32 {%rs13, %rs14}, %r314; 2026-02-21T08:25:57.3494039Z mov.b32 {%rs15, %rs16}, %r313; 2026-02-21T08:25:57.3494241Z ld.shared.v4.b32 {%r317, %r318, %r319, %r320}, [%r296+32]; 2026-02-21T08:25:57.3494464Z mov.b32 {%rs17, %rs18}, %r320; 2026-02-21T08:25:57.3494629Z mov.b32 {%rs19, %rs20}, %r319; 2026-02-21T08:25:57.3494800Z mov.b32 {%rs21, %rs22}, %r318; 2026-02-21T08:25:57.3494972Z mov.b32 {%rs23, %rs24}, %r317; 2026-02-21T08:25:57.3495169Z ld.shared.v4.b32 {%r321, %r322, %r323, %r324}, [%r296+48]; 2026-02-21T08:25:57.3495387Z mov.b32 {%rs25, %rs26}, %r324; 2026-02-21T08:25:57.3495549Z mov.b32 {%rs27, %rs28}, %r323; 2026-02-21T08:25:57.3495720Z mov.b32 {%rs29, %rs30}, %r322; 2026-02-21T08:25:57.3495882Z mov.b32 {%rs31, %rs32}, %r321; 2026-02-21T08:25:57.3496083Z ld.shared.v4.b32 {%r325, %r326, %r327, %r328}, [%r296+64]; 2026-02-21T08:25:57.3496292Z mov.b32 {%rs33, %rs34}, %r328; 2026-02-21T08:25:57.3496461Z mov.b32 {%rs35, %rs36}, %r327; 2026-02-21T08:25:57.3496623Z mov.b32 {%rs37, %rs38}, %r326; 2026-02-21T08:25:57.3496793Z mov.b32 {%rs39, %rs40}, %r325; 2026-02-21T08:25:57.3497000Z ld.shared.v4.b32 {%r329, %r330, %r331, %r332}, [%r296+80]; 2026-02-21T08:25:57.3497211Z mov.b32 {%rs41, %rs42}, %r332; 2026-02-21T08:25:57.3497400Z mov.b32 {%rs43, %rs44}, %r331; 2026-02-21T08:25:57.3497564Z mov.b32 {%rs45, %rs46}, %r330; 2026-02-21T08:25:57.3497740Z mov.b32 {%rs47, %rs48}, %r329; 2026-02-21T08:25:57.3497942Z ld.shared.v4.b32 {%r333, %r334, %r335, %r336}, [%r296+96]; 2026-02-21T08:25:57.3498159Z mov.b32 {%rs49, %rs50}, %r336; 2026-02-21T08:25:57.3498322Z mov.b32 {%rs51, %rs52}, %r335; 2026-02-21T08:25:57.3498493Z mov.b32 {%rs53, %rs54}, %r334; 2026-02-21T08:25:57.3498662Z mov.b32 {%rs55, %rs56}, %r333; 2026-02-21T08:25:57.3498863Z ld.shared.v4.b32 {%r337, %r338, %r339, %r340}, [%r296+112]; 2026-02-21T08:25:57.3499083Z mov.b32 {%rs57, %rs58}, %r340; 2026-02-21T08:25:57.3499243Z mov.b32 {%rs59, %rs60}, %r339; 2026-02-21T08:25:57.3499411Z mov.b32 {%rs61, %rs62}, %r338; 2026-02-21T08:25:57.3499573Z mov.b32 {%rs63, %rs64}, %r337; 2026-02-21T08:25:57.3499858Z .loc 1 55 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:55:32 2026-02-21T08:25:57.3500169Z cvt.f32.bf16 %r147, %rs7; 2026-02-21T08:25:57.3500335Z cvt.f32.bf16 %r148, %rs8; 2026-02-21T08:25:57.3500503Z cvt.f32.bf16 %r149, %rs5; 2026-02-21T08:25:57.3500662Z cvt.f32.bf16 %r150, %rs6; 2026-02-21T08:25:57.3500886Z cvt.f32.bf16 %r151, %rs3; 2026-02-21T08:25:57.3501041Z cvt.f32.bf16 %r152, %rs4; 2026-02-21T08:25:57.3501203Z cvt.f32.bf16 %r153, %rs1; 2026-02-21T08:25:57.3501357Z cvt.f32.bf16 %r154, %rs2; 2026-02-21T08:25:57.3501522Z cvt.f32.bf16 %r155, %rs15; 2026-02-21T08:25:57.3501719Z cvt.f32.bf16 %r156, %rs16; 2026-02-21T08:25:57.3501891Z cvt.f32.bf16 %r157, %rs13; 2026-02-21T08:25:57.3502065Z cvt.f32.bf16 %r158, %rs14; 2026-02-21T08:25:57.3502214Z cvt.f32.bf16 %r159, %rs11; 2026-02-21T08:25:57.3502369Z cvt.f32.bf16 %r160, %rs12; 2026-02-21T08:25:57.3502519Z cvt.f32.bf16 %r161, %rs9; 2026-02-21T08:25:57.3502673Z cvt.f32.bf16 %r162, %rs10; 2026-02-21T08:25:57.3502822Z cvt.f32.bf16 %r164, %rs23; 2026-02-21T08:25:57.3502976Z cvt.f32.bf16 %r165, %rs24; 2026-02-21T08:25:57.3503124Z cvt.f32.bf16 %r166, %rs21; 2026-02-21T08:25:57.3503278Z cvt.f32.bf16 %r167, %rs22; 2026-02-21T08:25:57.3503477Z cvt.f32.bf16 %r168, %rs19; 2026-02-21T08:25:57.3503639Z cvt.f32.bf16 %r169, %rs20; 2026-02-21T08:25:57.3503794Z cvt.f32.bf16 %r170, %rs17; 2026-02-21T08:25:57.3503943Z cvt.f32.bf16 %r171, %rs18; 2026-02-21T08:25:57.3504099Z cvt.f32.bf16 %r172, %rs31; 2026-02-21T08:25:57.3504247Z cvt.f32.bf16 %r173, %rs32; 2026-02-21T08:25:57.3504402Z cvt.f32.bf16 %r174, %rs29; 2026-02-21T08:25:57.3504548Z cvt.f32.bf16 %r175, %rs30; 2026-02-21T08:25:57.3504703Z cvt.f32.bf16 %r176, %rs27; 2026-02-21T08:25:57.3504853Z cvt.f32.bf16 %r177, %rs28; 2026-02-21T08:25:57.3505011Z cvt.f32.bf16 %r178, %rs25; 2026-02-21T08:25:57.3505159Z cvt.f32.bf16 %r179, %rs26; 2026-02-21T08:25:57.3505316Z cvt.f32.bf16 %r181, %rs39; 2026-02-21T08:25:57.3505472Z cvt.f32.bf16 %r182, %rs40; 2026-02-21T08:25:57.3505621Z cvt.f32.bf16 %r183, %rs37; 2026-02-21T08:25:57.3505780Z cvt.f32.bf16 %r184, %rs38; 2026-02-21T08:25:57.3505929Z cvt.f32.bf16 %r185, %rs35; 2026-02-21T08:25:57.3506090Z cvt.f32.bf16 %r186, %rs36; 2026-02-21T08:25:57.3506246Z cvt.f32.bf16 %r187, %rs33; 2026-02-21T08:25:57.3506415Z cvt.f32.bf16 %r188, %rs34; 2026-02-21T08:25:57.3506564Z cvt.f32.bf16 %r189, %rs47; 2026-02-21T08:25:57.3506719Z cvt.f32.bf16 %r190, %rs48; 2026-02-21T08:25:57.3506869Z cvt.f32.bf16 %r191, %rs45; 2026-02-21T08:25:57.3507024Z cvt.f32.bf16 %r192, %rs46; 2026-02-21T08:25:57.3507179Z cvt.f32.bf16 %r193, %rs43; 2026-02-21T08:25:57.3507327Z cvt.f32.bf16 %r194, %rs44; 2026-02-21T08:25:57.3507484Z cvt.f32.bf16 %r195, %rs41; 2026-02-21T08:25:57.3507632Z cvt.f32.bf16 %r196, %rs42; 2026-02-21T08:25:57.3507789Z cvt.f32.bf16 %r198, %rs55; 2026-02-21T08:25:57.3507937Z cvt.f32.bf16 %r199, %rs56; 2026-02-21T08:25:57.3508093Z cvt.f32.bf16 %r200, %rs53; 2026-02-21T08:25:57.3508241Z cvt.f32.bf16 %r201, %rs54; 2026-02-21T08:25:57.3508397Z cvt.f32.bf16 %r202, %rs51; 2026-02-21T08:25:57.3508545Z cvt.f32.bf16 %r203, %rs52; 2026-02-21T08:25:57.3508702Z cvt.f32.bf16 %r204, %rs49; 2026-02-21T08:25:57.3508857Z cvt.f32.bf16 %r205, %rs50; 2026-02-21T08:25:57.3509008Z cvt.f32.bf16 %r206, %rs63; 2026-02-21T08:25:57.3509168Z cvt.f32.bf16 %r207, %rs64; 2026-02-21T08:25:57.3509316Z cvt.f32.bf16 %r208, %rs61; 2026-02-21T08:25:57.3509475Z cvt.f32.bf16 %r209, %rs62; 2026-02-21T08:25:57.3509624Z cvt.f32.bf16 %r210, %rs59; 2026-02-21T08:25:57.3509778Z cvt.f32.bf16 %r211, %rs60; 2026-02-21T08:25:57.3509926Z cvt.f32.bf16 %r212, %rs57; 2026-02-21T08:25:57.3510081Z cvt.f32.bf16 %r213, %rs58; 2026-02-21T08:25:57.3510223Z $L__tmp2: 2026-02-21T08:25:57.3510522Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3510865Z add.s32 %r146, %r306, %r406; 2026-02-21T08:25:57.3511017Z // begin inline asm 2026-02-21T08:25:57.3511403Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 0], 64, {%r147, %r148, %r149, %r150, %r151, %r152, %r153, %r154, %r155, %r156, %r157, %r158, %r159, %r160, %r161, %r162}; 2026-02-21T08:25:57.3511832Z // end inline asm 2026-02-21T08:25:57.3511979Z // begin inline asm 2026-02-21T08:25:57.3512399Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 16], 64, {%r164, %r165, %r166, %r167, %r168, %r169, %r170, %r171, %r172, %r173, %r174, %r175, %r176, %r177, %r178, %r179}; 2026-02-21T08:25:57.3512801Z // end inline asm 2026-02-21T08:25:57.3512941Z // begin inline asm 2026-02-21T08:25:57.3513296Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 32], 64, {%r181, %r182, %r183, %r184, %r185, %r186, %r187, %r188, %r189, %r190, %r191, %r192, %r193, %r194, %r195, %r196}; 2026-02-21T08:25:57.3513694Z // end inline asm 2026-02-21T08:25:57.3513832Z // begin inline asm 2026-02-21T08:25:57.3514200Z @%p43 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 48], 64, {%r198, %r199, %r200, %r201, %r202, %r203, %r204, %r205, %r206, %r207, %r208, %r209, %r210, %r211, %r212, %r213}; 2026-02-21T08:25:57.3514591Z // end inline asm 2026-02-21T08:25:57.3514725Z // begin inline asm 2026-02-21T08:25:57.3514887Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:25:57.3515101Z // end inline asm 2026-02-21T08:25:57.3515244Z bar.sync 0; 2026-02-21T08:25:57.3515373Z $L__tmp3: 2026-02-21T08:25:57.3515620Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3515907Z // begin inline asm 2026-02-21T08:25:57.3516050Z 2026-02-21T08:25:57.3516178Z { 2026-02-21T08:25:57.3516307Z .reg .pred complete; 2026-02-21T08:25:57.3516463Z waitLoop: 2026-02-21T08:25:57.3516650Z mbarrier.try_wait.parity.shared.b64 complete, [%r472], %r67; 2026-02-21T08:25:57.3516892Z @!complete bra.uni waitLoop; 2026-02-21T08:25:57.3517044Z } 2026-02-21T08:25:57.3517116Z 2026-02-21T08:25:57.3517172Z // end inline asm 2026-02-21T08:25:57.3517418Z .loc 1 57 33 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:57:33 2026-02-21T08:25:57.3517720Z ld.shared.b8 %rs65, [%r301]; 2026-02-21T08:25:57.3517894Z ld.shared.b8 %rs66, [%r301+64]; 2026-02-21T08:25:57.3518071Z ld.shared.b8 %rs67, [%r301+512]; 2026-02-21T08:25:57.3518250Z ld.shared.b8 %rs68, [%r301+576]; 2026-02-21T08:25:57.3518422Z ld.shared.b8 %rs69, [%r301+1024]; 2026-02-21T08:25:57.3518601Z ld.shared.b8 %rs70, [%r301+1088]; 2026-02-21T08:25:57.3518770Z ld.shared.b8 %rs71, [%r301+1536]; 2026-02-21T08:25:57.3518941Z ld.shared.b8 %rs72, [%r301+1600]; 2026-02-21T08:25:57.3519107Z ld.shared.b8 %rs73, [%r301+2048]; 2026-02-21T08:25:57.3519277Z ld.shared.b8 %rs74, [%r301+2112]; 2026-02-21T08:25:57.3519442Z ld.shared.b8 %rs75, [%r301+2560]; 2026-02-21T08:25:57.3519614Z ld.shared.b8 %rs76, [%r301+2624]; 2026-02-21T08:25:57.3519781Z ld.shared.b8 %rs77, [%r301+3072]; 2026-02-21T08:25:57.3519942Z ld.shared.b8 %rs78, [%r301+3136]; 2026-02-21T08:25:57.3520111Z ld.shared.b8 %rs79, [%r301+3584]; 2026-02-21T08:25:57.3520271Z ld.shared.b8 %rs80, [%r301+3648]; 2026-02-21T08:25:57.3520439Z ld.shared.b8 %rs81, [%r300+128]; 2026-02-21T08:25:57.3520604Z ld.shared.b8 %rs82, [%r300+192]; 2026-02-21T08:25:57.3520775Z ld.shared.b8 %rs83, [%r300+640]; 2026-02-21T08:25:57.3520940Z ld.shared.b8 %rs84, [%r300+704]; 2026-02-21T08:25:57.3521110Z ld.shared.b8 %rs85, [%r300+1152]; 2026-02-21T08:25:57.3521280Z ld.shared.b8 %rs86, [%r300+1216]; 2026-02-21T08:25:57.3521441Z ld.shared.b8 %rs87, [%r300+1664]; 2026-02-21T08:25:57.3521631Z ld.shared.b8 %rs88, [%r300+1728]; 2026-02-21T08:25:57.3521792Z ld.shared.b8 %rs89, [%r300+2176]; 2026-02-21T08:25:57.3521961Z ld.shared.b8 %rs90, [%r300+2240]; 2026-02-21T08:25:57.3522124Z ld.shared.b8 %rs91, [%r300+2688]; 2026-02-21T08:25:57.3522292Z ld.shared.b8 %rs92, [%r300+2752]; 2026-02-21T08:25:57.3522453Z ld.shared.b8 %rs93, [%r300+3200]; 2026-02-21T08:25:57.3522622Z ld.shared.b8 %rs94, [%r300+3264]; 2026-02-21T08:25:57.3522790Z ld.shared.b8 %rs95, [%r300+3712]; 2026-02-21T08:25:57.3522950Z ld.shared.b8 %rs96, [%r300+3776]; 2026-02-21T08:25:57.3523120Z ld.shared.b8 %rs97, [%r299+256]; 2026-02-21T08:25:57.3523284Z ld.shared.b8 %rs98, [%r299+320]; 2026-02-21T08:25:57.3523456Z ld.shared.b8 %rs99, [%r299+768]; 2026-02-21T08:25:57.3523683Z ld.shared.b8 %rs100, [%r299+832]; 2026-02-21T08:25:57.3523857Z ld.shared.b8 %rs101, [%r299+1280]; 2026-02-21T08:25:57.3524031Z ld.shared.b8 %rs102, [%r299+1344]; 2026-02-21T08:25:57.3524206Z ld.shared.b8 %rs103, [%r299+1792]; 2026-02-21T08:25:57.3524372Z ld.shared.b8 %rs104, [%r299+1856]; 2026-02-21T08:25:57.3524547Z ld.shared.b8 %rs105, [%r299+2304]; 2026-02-21T08:25:57.3524721Z ld.shared.b8 %rs106, [%r299+2368]; 2026-02-21T08:25:57.3524890Z ld.shared.b8 %rs107, [%r299+2816]; 2026-02-21T08:25:57.3525068Z ld.shared.b8 %rs108, [%r299+2880]; 2026-02-21T08:25:57.3525238Z ld.shared.b8 %rs109, [%r299+3328]; 2026-02-21T08:25:57.3525420Z ld.shared.b8 %rs110, [%r299+3392]; 2026-02-21T08:25:57.3525592Z ld.shared.b8 %rs111, [%r299+3840]; 2026-02-21T08:25:57.3525766Z ld.shared.b8 %rs112, [%r299+3904]; 2026-02-21T08:25:57.3525932Z ld.shared.b8 %rs113, [%r298+384]; 2026-02-21T08:25:57.3526157Z ld.shared.b8 %rs114, [%r298+448]; 2026-02-21T08:25:57.3526336Z ld.shared.b8 %rs115, [%r298+896]; 2026-02-21T08:25:57.3526501Z ld.shared.b8 %rs116, [%r298+960]; 2026-02-21T08:25:57.3526672Z ld.shared.b8 %rs117, [%r298+1408]; 2026-02-21T08:25:57.3526836Z ld.shared.b8 %rs118, [%r298+1472]; 2026-02-21T08:25:57.3527007Z ld.shared.b8 %rs119, [%r298+1920]; 2026-02-21T08:25:57.3527172Z ld.shared.b8 %rs120, [%r298+1984]; 2026-02-21T08:25:57.3527344Z ld.shared.b8 %rs121, [%r298+2432]; 2026-02-21T08:25:57.3527506Z ld.shared.b8 %rs122, [%r298+2496]; 2026-02-21T08:25:57.3527679Z ld.shared.b8 %rs123, [%r298+2944]; 2026-02-21T08:25:57.3527849Z ld.shared.b8 %rs124, [%r298+3008]; 2026-02-21T08:25:57.3528012Z ld.shared.b8 %rs125, [%r298+3456]; 2026-02-21T08:25:57.3528181Z ld.shared.b8 %rs126, [%r298+3520]; 2026-02-21T08:25:57.3528345Z ld.shared.b8 %rs127, [%r298+3968]; 2026-02-21T08:25:57.3528516Z ld.shared.b8 %rs128, [%r298+4032]; 2026-02-21T08:25:57.3528795Z .loc 1 60 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:60:28 2026-02-21T08:25:57.3529088Z shl.b16 %rs129, %rs65, 4; 2026-02-21T08:25:57.3529246Z shl.b16 %rs130, %rs66, 4; 2026-02-21T08:25:57.3529403Z shl.b16 %rs131, %rs81, 4; 2026-02-21T08:25:57.3529557Z shl.b16 %rs132, %rs82, 4; 2026-02-21T08:25:57.3529706Z shl.b16 %rs133, %rs97, 4; 2026-02-21T08:25:57.3529857Z shl.b16 %rs134, %rs98, 4; 2026-02-21T08:25:57.3530007Z shl.b16 %rs135, %rs113, 4; 2026-02-21T08:25:57.3530168Z shl.b16 %rs136, %rs114, 4; 2026-02-21T08:25:57.3530320Z shl.b16 %rs137, %rs67, 4; 2026-02-21T08:25:57.3530472Z shl.b16 %rs138, %rs68, 4; 2026-02-21T08:25:57.3530617Z shl.b16 %rs139, %rs83, 4; 2026-02-21T08:25:57.3530769Z shl.b16 %rs140, %rs84, 4; 2026-02-21T08:25:57.3530915Z shl.b16 %rs141, %rs99, 4; 2026-02-21T08:25:57.3531068Z shl.b16 %rs142, %rs100, 4; 2026-02-21T08:25:57.3531226Z shl.b16 %rs143, %rs115, 4; 2026-02-21T08:25:57.3531379Z shl.b16 %rs144, %rs116, 4; 2026-02-21T08:25:57.3531556Z shl.b16 %rs145, %rs69, 4; 2026-02-21T08:25:57.3531707Z shl.b16 %rs146, %rs70, 4; 2026-02-21T08:25:57.3531865Z shl.b16 %rs147, %rs85, 4; 2026-02-21T08:25:57.3532012Z shl.b16 %rs148, %rs86, 4; 2026-02-21T08:25:57.3532167Z shl.b16 %rs149, %rs101, 4; 2026-02-21T08:25:57.3532318Z shl.b16 %rs150, %rs102, 4; 2026-02-21T08:25:57.3532479Z shl.b16 %rs151, %rs117, 4; 2026-02-21T08:25:57.3532630Z shl.b16 %rs152, %rs118, 4; 2026-02-21T08:25:57.3532790Z shl.b16 %rs153, %rs71, 4; 2026-02-21T08:25:57.3532946Z shl.b16 %rs154, %rs72, 4; 2026-02-21T08:25:57.3533094Z shl.b16 %rs155, %rs87, 4; 2026-02-21T08:25:57.3533254Z shl.b16 %rs156, %rs88, 4; 2026-02-21T08:25:57.3533406Z shl.b16 %rs157, %rs103, 4; 2026-02-21T08:25:57.3533570Z shl.b16 %rs158, %rs104, 4; 2026-02-21T08:25:57.3533720Z shl.b16 %rs159, %rs119, 4; 2026-02-21T08:25:57.3533877Z shl.b16 %rs160, %rs120, 4; 2026-02-21T08:25:57.3534030Z shl.b16 %rs161, %rs73, 4; 2026-02-21T08:25:57.3534184Z shl.b16 %rs162, %rs74, 4; 2026-02-21T08:25:57.3534331Z shl.b16 %rs163, %rs89, 4; 2026-02-21T08:25:57.3534487Z shl.b16 %rs164, %rs90, 4; 2026-02-21T08:25:57.3534698Z shl.b16 %rs165, %rs105, 4; 2026-02-21T08:25:57.3534850Z shl.b16 %rs166, %rs106, 4; 2026-02-21T08:25:57.3535007Z shl.b16 %rs167, %rs121, 4; 2026-02-21T08:25:57.3535157Z shl.b16 %rs168, %rs122, 4; 2026-02-21T08:25:57.3535314Z shl.b16 %rs169, %rs75, 4; 2026-02-21T08:25:57.3535463Z shl.b16 %rs170, %rs76, 4; 2026-02-21T08:25:57.3535615Z shl.b16 %rs171, %rs91, 4; 2026-02-21T08:25:57.3535760Z shl.b16 %rs172, %rs92, 4; 2026-02-21T08:25:57.3535916Z shl.b16 %rs173, %rs107, 4; 2026-02-21T08:25:57.3536069Z shl.b16 %rs174, %rs108, 4; 2026-02-21T08:25:57.3536227Z shl.b16 %rs175, %rs123, 4; 2026-02-21T08:25:57.3536386Z shl.b16 %rs176, %rs124, 4; 2026-02-21T08:25:57.3536538Z shl.b16 %rs177, %rs77, 4; 2026-02-21T08:25:57.3536696Z shl.b16 %rs178, %rs78, 4; 2026-02-21T08:25:57.3536843Z shl.b16 %rs179, %rs93, 4; 2026-02-21T08:25:57.3536999Z shl.b16 %rs180, %rs94, 4; 2026-02-21T08:25:57.3537249Z shl.b16 %rs181, %rs109, 4; 2026-02-21T08:25:57.3537423Z shl.b16 %rs182, %rs110, 4; 2026-02-21T08:25:57.3537579Z shl.b16 %rs183, %rs125, 4; 2026-02-21T08:25:57.3537742Z shl.b16 %rs184, %rs126, 4; 2026-02-21T08:25:57.3537898Z shl.b16 %rs185, %rs79, 4; 2026-02-21T08:25:57.3538058Z shl.b16 %rs186, %rs80, 4; 2026-02-21T08:25:57.3538218Z shl.b16 %rs187, %rs95, 4; 2026-02-21T08:25:57.3538370Z shl.b16 %rs188, %rs96, 4; 2026-02-21T08:25:57.3538533Z shl.b16 %rs189, %rs111, 4; 2026-02-21T08:25:57.3538692Z shl.b16 %rs190, %rs112, 4; 2026-02-21T08:25:57.3538858Z shl.b16 %rs191, %rs127, 4; 2026-02-21T08:25:57.3539015Z shl.b16 %rs192, %rs128, 4; 2026-02-21T08:25:57.3539303Z .loc 1 75 58 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:75:58 2026-02-21T08:25:57.3539616Z selp.b16 %rs193, %rs129, %rs65, %p36; 2026-02-21T08:25:57.3539810Z cvt.s16.s8 %rs194, %rs193; 2026-02-21T08:25:57.3539975Z shr.s16 %rs195, %rs194, 4; 2026-02-21T08:25:57.3540139Z selp.b16 %rs196, %rs130, %rs66, %p36; 2026-02-21T08:25:57.3540334Z cvt.s16.s8 %rs197, %rs196; 2026-02-21T08:25:57.3540492Z shr.s16 %rs198, %rs197, 4; 2026-02-21T08:25:57.3540664Z selp.b16 %rs199, %rs131, %rs81, %p36; 2026-02-21T08:25:57.3540849Z cvt.s16.s8 %rs200, %rs199; 2026-02-21T08:25:57.3541016Z shr.s16 %rs201, %rs200, 4; 2026-02-21T08:25:57.3541179Z selp.b16 %rs202, %rs132, %rs82, %p36; 2026-02-21T08:25:57.3541367Z cvt.s16.s8 %rs203, %rs202; 2026-02-21T08:25:57.3541564Z shr.s16 %rs204, %rs203, 4; 2026-02-21T08:25:57.3541737Z selp.b16 %rs205, %rs133, %rs97, %p36; 2026-02-21T08:25:57.3541932Z cvt.s16.s8 %rs206, %rs205; 2026-02-21T08:25:57.3542105Z shr.s16 %rs207, %rs206, 4; 2026-02-21T08:25:57.3542285Z selp.b16 %rs208, %rs134, %rs98, %p36; 2026-02-21T08:25:57.3542468Z cvt.s16.s8 %rs209, %rs208; 2026-02-21T08:25:57.3542640Z shr.s16 %rs210, %rs209, 4; 2026-02-21T08:25:57.3542816Z selp.b16 %rs211, %rs135, %rs113, %p36; 2026-02-21T08:25:57.3543011Z cvt.s16.s8 %rs212, %rs211; 2026-02-21T08:25:57.3543180Z shr.s16 %rs213, %rs212, 4; 2026-02-21T08:25:57.3543368Z selp.b16 %rs214, %rs136, %rs114, %p36; 2026-02-21T08:25:57.3543565Z cvt.s16.s8 %rs215, %rs214; 2026-02-21T08:25:57.3543730Z shr.s16 %rs216, %rs215, 4; 2026-02-21T08:25:57.3543908Z selp.b16 %rs217, %rs137, %rs67, %p36; 2026-02-21T08:25:57.3544089Z cvt.s16.s8 %rs218, %rs217; 2026-02-21T08:25:57.3544264Z shr.s16 %rs219, %rs218, 4; 2026-02-21T08:25:57.3544434Z selp.b16 %rs220, %rs138, %rs68, %p36; 2026-02-21T08:25:57.3544622Z cvt.s16.s8 %rs221, %rs220; 2026-02-21T08:25:57.3544787Z shr.s16 %rs222, %rs221, 4; 2026-02-21T08:25:57.3544965Z selp.b16 %rs223, %rs139, %rs83, %p36; 2026-02-21T08:25:57.3545147Z cvt.s16.s8 %rs224, %rs223; 2026-02-21T08:25:57.3545323Z shr.s16 %rs225, %rs224, 4; 2026-02-21T08:25:57.3545490Z selp.b16 %rs226, %rs140, %rs84, %p36; 2026-02-21T08:25:57.3545664Z cvt.s16.s8 %rs227, %rs226; 2026-02-21T08:25:57.3545828Z shr.s16 %rs228, %rs227, 4; 2026-02-21T08:25:57.3545988Z selp.b16 %rs229, %rs141, %rs99, %p36; 2026-02-21T08:25:57.3546170Z cvt.s16.s8 %rs230, %rs229; 2026-02-21T08:25:57.3546382Z shr.s16 %rs231, %rs230, 4; 2026-02-21T08:25:57.3546549Z selp.b16 %rs232, %rs142, %rs100, %p36; 2026-02-21T08:25:57.3546720Z cvt.s16.s8 %rs233, %rs232; 2026-02-21T08:25:57.3546877Z shr.s16 %rs234, %rs233, 4; 2026-02-21T08:25:57.3547045Z selp.b16 %rs235, %rs143, %rs115, %p36; 2026-02-21T08:25:57.3547216Z cvt.s16.s8 %rs236, %rs235; 2026-02-21T08:25:57.3547373Z shr.s16 %rs237, %rs236, 4; 2026-02-21T08:25:57.3547536Z selp.b16 %rs238, %rs144, %rs116, %p36; 2026-02-21T08:25:57.3547716Z cvt.s16.s8 %rs239, %rs238; 2026-02-21T08:25:57.3547868Z shr.s16 %rs240, %rs239, 4; 2026-02-21T08:25:57.3548035Z selp.b16 %rs241, %rs145, %rs69, %p36; 2026-02-21T08:25:57.3548205Z cvt.s16.s8 %rs242, %rs241; 2026-02-21T08:25:57.3548367Z shr.s16 %rs243, %rs242, 4; 2026-02-21T08:25:57.3548524Z selp.b16 %rs244, %rs146, %rs70, %p36; 2026-02-21T08:25:57.3548700Z cvt.s16.s8 %rs245, %rs244; 2026-02-21T08:25:57.3548907Z shr.s16 %rs246, %rs245, 4; 2026-02-21T08:25:57.3549071Z selp.b16 %rs247, %rs147, %rs85, %p36; 2026-02-21T08:25:57.3549249Z cvt.s16.s8 %rs248, %rs247; 2026-02-21T08:25:57.3549400Z shr.s16 %rs249, %rs248, 4; 2026-02-21T08:25:57.3549562Z selp.b16 %rs250, %rs148, %rs86, %p36; 2026-02-21T08:25:57.3549730Z cvt.s16.s8 %rs251, %rs250; 2026-02-21T08:25:57.3549888Z shr.s16 %rs252, %rs251, 4; 2026-02-21T08:25:57.3550048Z selp.b16 %rs253, %rs149, %rs101, %p36; 2026-02-21T08:25:57.3550231Z cvt.s16.s8 %rs254, %rs253; 2026-02-21T08:25:57.3550393Z shr.s16 %rs255, %rs254, 4; 2026-02-21T08:25:57.3550555Z selp.b16 %rs256, %rs150, %rs102, %p36; 2026-02-21T08:25:57.3550738Z cvt.s16.s8 %rs257, %rs256; 2026-02-21T08:25:57.3550890Z shr.s16 %rs258, %rs257, 4; 2026-02-21T08:25:57.3551056Z selp.b16 %rs259, %rs151, %rs117, %p36; 2026-02-21T08:25:57.3551226Z cvt.s16.s8 %rs260, %rs259; 2026-02-21T08:25:57.3551380Z shr.s16 %rs261, %rs260, 4; 2026-02-21T08:25:57.3551576Z selp.b16 %rs262, %rs152, %rs118, %p36; 2026-02-21T08:25:57.3551757Z cvt.s16.s8 %rs263, %rs262; 2026-02-21T08:25:57.3551910Z shr.s16 %rs264, %rs263, 4; 2026-02-21T08:25:57.3552076Z selp.b16 %rs265, %rs153, %rs71, %p36; 2026-02-21T08:25:57.3552250Z cvt.s16.s8 %rs266, %rs265; 2026-02-21T08:25:57.3552400Z shr.s16 %rs267, %rs266, 4; 2026-02-21T08:25:57.3552564Z selp.b16 %rs268, %rs154, %rs72, %p36; 2026-02-21T08:25:57.3552733Z cvt.s16.s8 %rs269, %rs268; 2026-02-21T08:25:57.3552891Z shr.s16 %rs270, %rs269, 4; 2026-02-21T08:25:57.3553048Z selp.b16 %rs271, %rs155, %rs87, %p36; 2026-02-21T08:25:57.3553223Z cvt.s16.s8 %rs272, %rs271; 2026-02-21T08:25:57.3553376Z shr.s16 %rs273, %rs272, 4; 2026-02-21T08:25:57.3553540Z selp.b16 %rs274, %rs156, %rs88, %p36; 2026-02-21T08:25:57.3553715Z cvt.s16.s8 %rs275, %rs274; 2026-02-21T08:25:57.3553865Z shr.s16 %rs276, %rs275, 4; 2026-02-21T08:25:57.3554028Z selp.b16 %rs277, %rs157, %rs103, %p36; 2026-02-21T08:25:57.3554197Z cvt.s16.s8 %rs278, %rs277; 2026-02-21T08:25:57.3554353Z shr.s16 %rs279, %rs278, 4; 2026-02-21T08:25:57.3554516Z selp.b16 %rs280, %rs158, %rs104, %p36; 2026-02-21T08:25:57.3554692Z cvt.s16.s8 %rs281, %rs280; 2026-02-21T08:25:57.3554842Z shr.s16 %rs282, %rs281, 4; 2026-02-21T08:25:57.3555004Z selp.b16 %rs283, %rs159, %rs119, %p36; 2026-02-21T08:25:57.3555173Z cvt.s16.s8 %rs284, %rs283; 2026-02-21T08:25:57.3555328Z shr.s16 %rs285, %rs284, 4; 2026-02-21T08:25:57.3555490Z selp.b16 %rs286, %rs160, %rs120, %p36; 2026-02-21T08:25:57.3555656Z cvt.s16.s8 %rs287, %rs286; 2026-02-21T08:25:57.3555812Z shr.s16 %rs288, %rs287, 4; 2026-02-21T08:25:57.3555966Z selp.b16 %rs289, %rs161, %rs73, %p36; 2026-02-21T08:25:57.3556140Z cvt.s16.s8 %rs290, %rs289; 2026-02-21T08:25:57.3556289Z shr.s16 %rs291, %rs290, 4; 2026-02-21T08:25:57.3556453Z selp.b16 %rs292, %rs162, %rs74, %p36; 2026-02-21T08:25:57.3556620Z cvt.s16.s8 %rs293, %rs292; 2026-02-21T08:25:57.3556780Z shr.s16 %rs294, %rs293, 4; 2026-02-21T08:25:57.3556934Z selp.b16 %rs295, %rs163, %rs89, %p36; 2026-02-21T08:25:57.3557111Z cvt.s16.s8 %rs296, %rs295; 2026-02-21T08:25:57.3557326Z shr.s16 %rs297, %rs296, 4; 2026-02-21T08:25:57.3557482Z selp.b16 %rs298, %rs164, %rs90, %p36; 2026-02-21T08:25:57.3557658Z cvt.s16.s8 %rs299, %rs298; 2026-02-21T08:25:57.3557806Z shr.s16 %rs300, %rs299, 4; 2026-02-21T08:25:57.3557970Z selp.b16 %rs301, %rs165, %rs105, %p36; 2026-02-21T08:25:57.3558143Z cvt.s16.s8 %rs302, %rs301; 2026-02-21T08:25:57.3558301Z shr.s16 %rs303, %rs302, 4; 2026-02-21T08:25:57.3558463Z selp.b16 %rs304, %rs166, %rs106, %p36; 2026-02-21T08:25:57.3558645Z cvt.s16.s8 %rs305, %rs304; 2026-02-21T08:25:57.3558808Z shr.s16 %rs306, %rs305, 4; 2026-02-21T08:25:57.3558970Z selp.b16 %rs307, %rs167, %rs121, %p36; 2026-02-21T08:25:57.3559148Z cvt.s16.s8 %rs308, %rs307; 2026-02-21T08:25:57.3559300Z shr.s16 %rs309, %rs308, 4; 2026-02-21T08:25:57.3559467Z selp.b16 %rs310, %rs168, %rs122, %p36; 2026-02-21T08:25:57.3559636Z cvt.s16.s8 %rs311, %rs310; 2026-02-21T08:25:57.3559793Z shr.s16 %rs312, %rs311, 4; 2026-02-21T08:25:57.3559995Z selp.b16 %rs313, %rs169, %rs75, %p36; 2026-02-21T08:25:57.3560177Z cvt.s16.s8 %rs314, %rs313; 2026-02-21T08:25:57.3560326Z shr.s16 %rs315, %rs314, 4; 2026-02-21T08:25:57.3560492Z selp.b16 %rs316, %rs170, %rs76, %p36; 2026-02-21T08:25:57.3560666Z cvt.s16.s8 %rs317, %rs316; 2026-02-21T08:25:57.3560817Z shr.s16 %rs318, %rs317, 4; 2026-02-21T08:25:57.3560979Z selp.b16 %rs319, %rs171, %rs91, %p36; 2026-02-21T08:25:57.3561146Z cvt.s16.s8 %rs320, %rs319; 2026-02-21T08:25:57.3561303Z shr.s16 %rs321, %rs320, 4; 2026-02-21T08:25:57.3561459Z selp.b16 %rs322, %rs172, %rs92, %p36; 2026-02-21T08:25:57.3561663Z cvt.s16.s8 %rs323, %rs322; 2026-02-21T08:25:57.3561816Z shr.s16 %rs324, %rs323, 4; 2026-02-21T08:25:57.3561982Z selp.b16 %rs325, %rs173, %rs107, %p36; 2026-02-21T08:25:57.3562161Z cvt.s16.s8 %rs326, %rs325; 2026-02-21T08:25:57.3562315Z shr.s16 %rs327, %rs326, 4; 2026-02-21T08:25:57.3562482Z selp.b16 %rs328, %rs174, %rs108, %p36; 2026-02-21T08:25:57.3562656Z cvt.s16.s8 %rs329, %rs328; 2026-02-21T08:25:57.3562821Z shr.s16 %rs330, %rs329, 4; 2026-02-21T08:25:57.3562979Z selp.b16 %rs331, %rs175, %rs123, %p36; 2026-02-21T08:25:57.3563159Z cvt.s16.s8 %rs332, %rs331; 2026-02-21T08:25:57.3563309Z shr.s16 %rs333, %rs332, 4; 2026-02-21T08:25:57.3563477Z selp.b16 %rs334, %rs176, %rs124, %p36; 2026-02-21T08:25:57.3563650Z cvt.s16.s8 %rs335, %rs334; 2026-02-21T08:25:57.3563812Z shr.s16 %rs336, %rs335, 4; 2026-02-21T08:25:57.3563974Z selp.b16 %rs337, %rs177, %rs77, %p36; 2026-02-21T08:25:57.3564139Z cvt.s16.s8 %rs338, %rs337; 2026-02-21T08:25:57.3564297Z shr.s16 %rs339, %rs338, 4; 2026-02-21T08:25:57.3564452Z selp.b16 %rs340, %rs178, %rs78, %p36; 2026-02-21T08:25:57.3564625Z cvt.s16.s8 %rs341, %rs340; 2026-02-21T08:25:57.3564774Z shr.s16 %rs342, %rs341, 4; 2026-02-21T08:25:57.3564938Z selp.b16 %rs343, %rs179, %rs93, %p36; 2026-02-21T08:25:57.3565106Z cvt.s16.s8 %rs344, %rs343; 2026-02-21T08:25:57.3565266Z shr.s16 %rs345, %rs344, 4; 2026-02-21T08:25:57.3565435Z selp.b16 %rs346, %rs180, %rs94, %p36; 2026-02-21T08:25:57.3565609Z cvt.s16.s8 %rs347, %rs346; 2026-02-21T08:25:57.3565767Z shr.s16 %rs348, %rs347, 4; 2026-02-21T08:25:57.3565924Z selp.b16 %rs349, %rs181, %rs109, %p36; 2026-02-21T08:25:57.3566104Z cvt.s16.s8 %rs350, %rs349; 2026-02-21T08:25:57.3566253Z shr.s16 %rs351, %rs350, 4; 2026-02-21T08:25:57.3566418Z selp.b16 %rs352, %rs182, %rs110, %p36; 2026-02-21T08:25:57.3566588Z cvt.s16.s8 %rs353, %rs352; 2026-02-21T08:25:57.3566750Z shr.s16 %rs354, %rs353, 4; 2026-02-21T08:25:57.3566911Z selp.b16 %rs355, %rs183, %rs125, %p36; 2026-02-21T08:25:57.3567092Z cvt.s16.s8 %rs356, %rs355; 2026-02-21T08:25:57.3567255Z shr.s16 %rs357, %rs356, 4; 2026-02-21T08:25:57.3567413Z selp.b16 %rs358, %rs184, %rs126, %p36; 2026-02-21T08:25:57.3567590Z cvt.s16.s8 %rs359, %rs358; 2026-02-21T08:25:57.3567740Z shr.s16 %rs360, %rs359, 4; 2026-02-21T08:25:57.3567904Z selp.b16 %rs361, %rs185, %rs79, %p36; 2026-02-21T08:25:57.3568073Z cvt.s16.s8 %rs362, %rs361; 2026-02-21T08:25:57.3568289Z shr.s16 %rs363, %rs362, 4; 2026-02-21T08:25:57.3568443Z selp.b16 %rs364, %rs186, %rs80, %p36; 2026-02-21T08:25:57.3568619Z cvt.s16.s8 %rs365, %rs364; 2026-02-21T08:25:57.3568768Z shr.s16 %rs366, %rs365, 4; 2026-02-21T08:25:57.3568930Z selp.b16 %rs367, %rs187, %rs95, %p36; 2026-02-21T08:25:57.3569101Z cvt.s16.s8 %rs368, %rs367; 2026-02-21T08:25:57.3569253Z shr.s16 %rs369, %rs368, 4; 2026-02-21T08:25:57.3569415Z selp.b16 %rs370, %rs188, %rs96, %p36; 2026-02-21T08:25:57.3569581Z cvt.s16.s8 %rs371, %rs370; 2026-02-21T08:25:57.3569737Z shr.s16 %rs372, %rs371, 4; 2026-02-21T08:25:57.3569897Z selp.b16 %rs373, %rs189, %rs111, %p36; 2026-02-21T08:25:57.3570076Z cvt.s16.s8 %rs374, %rs373; 2026-02-21T08:25:57.3570227Z shr.s16 %rs375, %rs374, 4; 2026-02-21T08:25:57.3570393Z selp.b16 %rs376, %rs190, %rs112, %p36; 2026-02-21T08:25:57.3570571Z cvt.s16.s8 %rs377, %rs376; 2026-02-21T08:25:57.3570721Z shr.s16 %rs378, %rs377, 4; 2026-02-21T08:25:57.3570933Z selp.b16 %rs379, %rs191, %rs127, %p36; 2026-02-21T08:25:57.3571107Z cvt.s16.s8 %rs380, %rs379; 2026-02-21T08:25:57.3571265Z shr.s16 %rs381, %rs380, 4; 2026-02-21T08:25:57.3571423Z selp.b16 %rs382, %rs192, %rs128, %p36; 2026-02-21T08:25:57.3571630Z cvt.s16.s8 %rs383, %rs382; 2026-02-21T08:25:57.3571782Z shr.s16 %rs384, %rs383, 4; 2026-02-21T08:25:57.3572064Z .loc 1 80 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:80:32 2026-02-21T08:25:57.3572365Z cvt.rn.f32.s16 %r341, %rs195; 2026-02-21T08:25:57.3572530Z cvt.rn.f32.s16 %r342, %rs198; 2026-02-21T08:25:57.3572697Z cvt.rn.f32.s16 %r343, %rs201; 2026-02-21T08:25:57.3572856Z cvt.rn.f32.s16 %r344, %rs204; 2026-02-21T08:25:57.3573018Z cvt.rn.f32.s16 %r345, %rs207; 2026-02-21T08:25:57.3573171Z cvt.rn.f32.s16 %r346, %rs210; 2026-02-21T08:25:57.3573332Z cvt.rn.f32.s16 %r347, %rs213; 2026-02-21T08:25:57.3573486Z cvt.rn.f32.s16 %r348, %rs216; 2026-02-21T08:25:57.3573652Z cvt.rn.f32.s16 %r349, %rs219; 2026-02-21T08:25:57.3573809Z cvt.rn.f32.s16 %r350, %rs222; 2026-02-21T08:25:57.3573971Z cvt.rn.f32.s16 %r351, %rs225; 2026-02-21T08:25:57.3574133Z cvt.rn.f32.s16 %r352, %rs228; 2026-02-21T08:25:57.3574287Z cvt.rn.f32.s16 %r353, %rs231; 2026-02-21T08:25:57.3574449Z cvt.rn.f32.s16 %r354, %rs234; 2026-02-21T08:25:57.3574601Z cvt.rn.f32.s16 %r355, %rs237; 2026-02-21T08:25:57.3574763Z cvt.rn.f32.s16 %r356, %rs240; 2026-02-21T08:25:57.3574917Z cvt.rn.f32.s16 %r357, %rs243; 2026-02-21T08:25:57.3575080Z cvt.rn.f32.s16 %r358, %rs246; 2026-02-21T08:25:57.3575233Z cvt.rn.f32.s16 %r359, %rs249; 2026-02-21T08:25:57.3575400Z cvt.rn.f32.s16 %r360, %rs252; 2026-02-21T08:25:57.3575565Z cvt.rn.f32.s16 %r361, %rs255; 2026-02-21T08:25:57.3575730Z cvt.rn.f32.s16 %r362, %rs258; 2026-02-21T08:25:57.3575892Z cvt.rn.f32.s16 %r363, %rs261; 2026-02-21T08:25:57.3576046Z cvt.rn.f32.s16 %r364, %rs264; 2026-02-21T08:25:57.3576206Z cvt.rn.f32.s16 %r365, %rs267; 2026-02-21T08:25:57.3576360Z cvt.rn.f32.s16 %r366, %rs270; 2026-02-21T08:25:57.3576523Z cvt.rn.f32.s16 %r367, %rs273; 2026-02-21T08:25:57.3576677Z cvt.rn.f32.s16 %r368, %rs276; 2026-02-21T08:25:57.3576838Z cvt.rn.f32.s16 %r369, %rs279; 2026-02-21T08:25:57.3576993Z cvt.rn.f32.s16 %r370, %rs282; 2026-02-21T08:25:57.3577165Z cvt.rn.f32.s16 %r371, %rs285; 2026-02-21T08:25:57.3577324Z cvt.rn.f32.s16 %r372, %rs288; 2026-02-21T08:25:57.3577478Z cvt.rn.f32.s16 %r373, %rs291; 2026-02-21T08:25:57.3577637Z cvt.rn.f32.s16 %r374, %rs294; 2026-02-21T08:25:57.3577788Z cvt.rn.f32.s16 %r375, %rs297; 2026-02-21T08:25:57.3577950Z cvt.rn.f32.s16 %r376, %rs300; 2026-02-21T08:25:57.3578102Z cvt.rn.f32.s16 %r377, %rs303; 2026-02-21T08:25:57.3578263Z cvt.rn.f32.s16 %r378, %rs306; 2026-02-21T08:25:57.3578414Z cvt.rn.f32.s16 %r379, %rs309; 2026-02-21T08:25:57.3578575Z cvt.rn.f32.s16 %r380, %rs312; 2026-02-21T08:25:57.3578736Z cvt.rn.f32.s16 %r381, %rs315; 2026-02-21T08:25:57.3578891Z cvt.rn.f32.s16 %r382, %rs318; 2026-02-21T08:25:57.3579053Z cvt.rn.f32.s16 %r383, %rs321; 2026-02-21T08:25:57.3579297Z cvt.rn.f32.s16 %r384, %rs324; 2026-02-21T08:25:57.3579466Z cvt.rn.f32.s16 %r385, %rs327; 2026-02-21T08:25:57.3579623Z cvt.rn.f32.s16 %r386, %rs330; 2026-02-21T08:25:57.3579790Z cvt.rn.f32.s16 %r387, %rs333; 2026-02-21T08:25:57.3579949Z cvt.rn.f32.s16 %r388, %rs336; 2026-02-21T08:25:57.3580112Z cvt.rn.f32.s16 %r389, %rs339; 2026-02-21T08:25:57.3580271Z cvt.rn.f32.s16 %r390, %rs342; 2026-02-21T08:25:57.3580440Z cvt.rn.f32.s16 %r391, %rs345; 2026-02-21T08:25:57.3580607Z cvt.rn.f32.s16 %r392, %rs348; 2026-02-21T08:25:57.3580768Z cvt.rn.f32.s16 %r393, %rs351; 2026-02-21T08:25:57.3580935Z cvt.rn.f32.s16 %r394, %rs354; 2026-02-21T08:25:57.3581093Z cvt.rn.f32.s16 %r395, %rs357; 2026-02-21T08:25:57.3581259Z cvt.rn.f32.s16 %r396, %rs360; 2026-02-21T08:25:57.3581416Z cvt.rn.f32.s16 %r397, %rs363; 2026-02-21T08:25:57.3581612Z cvt.rn.f32.s16 %r398, %rs366; 2026-02-21T08:25:57.3581774Z cvt.rn.f32.s16 %r399, %rs369; 2026-02-21T08:25:57.3582009Z cvt.rn.f32.s16 %r400, %rs372; 2026-02-21T08:25:57.3582177Z cvt.rn.f32.s16 %r401, %rs375; 2026-02-21T08:25:57.3582348Z cvt.rn.f32.s16 %r402, %rs378; 2026-02-21T08:25:57.3582518Z cvt.rn.f32.s16 %r403, %rs381; 2026-02-21T08:25:57.3582681Z cvt.rn.f32.s16 %r404, %rs384; 2026-02-21T08:25:57.3582859Z st.shared.b32 [%r18], %r341; 2026-02-21T08:25:57.3583037Z st.shared.b32 [%r18+8], %r342; 2026-02-21T08:25:57.3583223Z st.shared.b32 [%r18+8192], %r357; 2026-02-21T08:25:57.3583406Z st.shared.b32 [%r18+8200], %r358; 2026-02-21T08:25:57.3583599Z st.shared.b32 [%r18+16384], %r373; 2026-02-21T08:25:57.3583785Z st.shared.b32 [%r18+16392], %r374; 2026-02-21T08:25:57.3583976Z st.shared.b32 [%r18+24576], %r389; 2026-02-21T08:25:57.3584166Z st.shared.b32 [%r18+24584], %r390; 2026-02-21T08:25:57.3584343Z st.shared.b32 [%r19], %r343; 2026-02-21T08:25:57.3584522Z st.shared.b32 [%r19+8], %r344; 2026-02-21T08:25:57.3584694Z st.shared.b32 [%r19+8192], %r359; 2026-02-21T08:25:57.3584878Z st.shared.b32 [%r19+8200], %r360; 2026-02-21T08:25:57.3585057Z st.shared.b32 [%r19+16384], %r375; 2026-02-21T08:25:57.3585238Z st.shared.b32 [%r19+16392], %r376; 2026-02-21T08:25:57.3585408Z st.shared.b32 [%r19+24576], %r391; 2026-02-21T08:25:57.3585589Z st.shared.b32 [%r19+24584], %r392; 2026-02-21T08:25:57.3585773Z st.shared.b32 [%r20], %r345; 2026-02-21T08:25:57.3585944Z st.shared.b32 [%r20+8], %r346; 2026-02-21T08:25:57.3586124Z st.shared.b32 [%r20+8192], %r361; 2026-02-21T08:25:57.3586299Z st.shared.b32 [%r20+8200], %r362; 2026-02-21T08:25:57.3586481Z st.shared.b32 [%r20+16384], %r377; 2026-02-21T08:25:57.3586655Z st.shared.b32 [%r20+16392], %r378; 2026-02-21T08:25:57.3586834Z st.shared.b32 [%r20+24576], %r393; 2026-02-21T08:25:57.3587008Z st.shared.b32 [%r20+24584], %r394; 2026-02-21T08:25:57.3587198Z st.shared.b32 [%r21], %r347; 2026-02-21T08:25:57.3587363Z st.shared.b32 [%r21+8], %r348; 2026-02-21T08:25:57.3587534Z st.shared.b32 [%r21+8192], %r363; 2026-02-21T08:25:57.3587710Z st.shared.b32 [%r21+8200], %r364; 2026-02-21T08:25:57.3587879Z st.shared.b32 [%r21+16384], %r379; 2026-02-21T08:25:57.3588053Z st.shared.b32 [%r21+16392], %r380; 2026-02-21T08:25:57.3588220Z st.shared.b32 [%r21+24576], %r395; 2026-02-21T08:25:57.3588391Z st.shared.b32 [%r21+24584], %r396; 2026-02-21T08:25:57.3588556Z st.shared.b32 [%r22], %r349; 2026-02-21T08:25:57.3588723Z st.shared.b32 [%r22+8], %r350; 2026-02-21T08:25:57.3588884Z st.shared.b32 [%r22+8192], %r365; 2026-02-21T08:25:57.3589055Z st.shared.b32 [%r22+8200], %r366; 2026-02-21T08:25:57.3589226Z st.shared.b32 [%r22+16384], %r381; 2026-02-21T08:25:57.3589390Z st.shared.b32 [%r22+16392], %r382; 2026-02-21T08:25:57.3589563Z st.shared.b32 [%r22+24576], %r397; 2026-02-21T08:25:57.3589728Z st.shared.b32 [%r22+24584], %r398; 2026-02-21T08:25:57.3589900Z st.shared.b32 [%r23], %r351; 2026-02-21T08:25:57.3590058Z st.shared.b32 [%r23+8], %r352; 2026-02-21T08:25:57.3590227Z st.shared.b32 [%r23+8192], %r367; 2026-02-21T08:25:57.3590396Z st.shared.b32 [%r23+8200], %r368; 2026-02-21T08:25:57.3590626Z st.shared.b32 [%r23+16384], %r383; 2026-02-21T08:25:57.3590797Z st.shared.b32 [%r23+16392], %r384; 2026-02-21T08:25:57.3590970Z st.shared.b32 [%r23+24576], %r399; 2026-02-21T08:25:57.3591142Z st.shared.b32 [%r23+24584], %r400; 2026-02-21T08:25:57.3591307Z st.shared.b32 [%r24], %r353; 2026-02-21T08:25:57.3591475Z st.shared.b32 [%r24+8], %r354; 2026-02-21T08:25:57.3591666Z st.shared.b32 [%r24+8192], %r369; 2026-02-21T08:25:57.3591839Z st.shared.b32 [%r24+8200], %r370; 2026-02-21T08:25:57.3592003Z st.shared.b32 [%r24+16384], %r385; 2026-02-21T08:25:57.3592178Z st.shared.b32 [%r24+16392], %r386; 2026-02-21T08:25:57.3592347Z st.shared.b32 [%r24+24576], %r401; 2026-02-21T08:25:57.3592523Z st.shared.b32 [%r24+24584], %r402; 2026-02-21T08:25:57.3592702Z st.shared.b32 [%r25], %r355; 2026-02-21T08:25:57.3592865Z st.shared.b32 [%r25+8], %r356; 2026-02-21T08:25:57.3593036Z st.shared.b32 [%r25+8192], %r371; 2026-02-21T08:25:57.3593252Z st.shared.b32 [%r25+8200], %r372; 2026-02-21T08:25:57.3593428Z st.shared.b32 [%r25+16384], %r387; 2026-02-21T08:25:57.3593594Z st.shared.b32 [%r25+16392], %r388; 2026-02-21T08:25:57.3593767Z st.shared.b32 [%r25+24576], %r403; 2026-02-21T08:25:57.3593933Z st.shared.b32 [%r25+24584], %r404; 2026-02-21T08:25:57.3594100Z $L__tmp4: 2026-02-21T08:25:57.3594332Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3594399Z // begin inline asm 2026-02-21T08:25:57.3594474Z fence.proxy.async.shared::cta; 2026-02-21T08:25:57.3594531Z // end inline asm 2026-02-21T08:25:57.3594585Z bar.sync 0; 2026-02-21T08:25:57.3594656Z setp.ne.b32 %p40, %r32, 0; 2026-02-21T08:25:57.3594715Z @%p40 bra $L__BB0_3; 2026-02-21T08:25:57.3594766Z // %bb.2: 2026-02-21T08:25:57.3594838Z elect.sync %r453|%p42, -1; 2026-02-21T08:25:57.3594899Z mov.b32 %r407, 68159760; 2026-02-21T08:25:57.3594960Z mov.pred %p41, 0; 2026-02-21T08:25:57.3595020Z // begin inline asm 2026-02-21T08:25:57.3595189Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 0 ], %rd92, %r407, %p41; 2026-02-21T08:25:57.3595246Z // end inline asm 2026-02-21T08:25:57.3595301Z // begin inline asm 2026-02-21T08:25:57.3595458Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 8 ], %rd93, %r407, %p43; 2026-02-21T08:25:57.3595514Z // end inline asm 2026-02-21T08:25:57.3595569Z // begin inline asm 2026-02-21T08:25:57.3595725Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 16 ], %rd94, %r407, %p43; 2026-02-21T08:25:57.3595780Z // end inline asm 2026-02-21T08:25:57.3595834Z // begin inline asm 2026-02-21T08:25:57.3595985Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 24 ], %rd95, %r407, %p43; 2026-02-21T08:25:57.3596039Z // end inline asm 2026-02-21T08:25:57.3596095Z // begin inline asm 2026-02-21T08:25:57.3596235Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 32 ], %rd96, %r407, %p43; 2026-02-21T08:25:57.3596300Z // end inline asm 2026-02-21T08:25:57.3596355Z // begin inline asm 2026-02-21T08:25:57.3596494Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 40 ], %rd97, %r407, %p43; 2026-02-21T08:25:57.3596556Z // end inline asm 2026-02-21T08:25:57.3596611Z // begin inline asm 2026-02-21T08:25:57.3596751Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 48 ], %rd98, %r407, %p43; 2026-02-21T08:25:57.3596813Z // end inline asm 2026-02-21T08:25:57.3596868Z // begin inline asm 2026-02-21T08:25:57.3597009Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 56 ], %rd99, %r407, %p43; 2026-02-21T08:25:57.3597070Z // end inline asm 2026-02-21T08:25:57.3597125Z // begin inline asm 2026-02-21T08:25:57.3597266Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 64 ], %rd100, %r407, %p43; 2026-02-21T08:25:57.3597320Z // end inline asm 2026-02-21T08:25:57.3597382Z // begin inline asm 2026-02-21T08:25:57.3597573Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 72 ], %rd101, %r407, %p43; 2026-02-21T08:25:57.3597628Z // end inline asm 2026-02-21T08:25:57.3597690Z // begin inline asm 2026-02-21T08:25:57.3597833Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 80 ], %rd102, %r407, %p43; 2026-02-21T08:25:57.3597887Z // end inline asm 2026-02-21T08:25:57.3597948Z // begin inline asm 2026-02-21T08:25:57.3598088Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 88 ], %rd103, %r407, %p43; 2026-02-21T08:25:57.3598143Z // end inline asm 2026-02-21T08:25:57.3598197Z // begin inline asm 2026-02-21T08:25:57.3598347Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 96 ], %rd104, %r407, %p43; 2026-02-21T08:25:57.3598400Z // end inline asm 2026-02-21T08:25:57.3598455Z // begin inline asm 2026-02-21T08:25:57.3598609Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 104 ], %rd105, %r407, %p43; 2026-02-21T08:25:57.3598700Z // end inline asm 2026-02-21T08:25:57.3598758Z // begin inline asm 2026-02-21T08:25:57.3598911Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 112 ], %rd106, %r407, %p43; 2026-02-21T08:25:57.3598965Z // end inline asm 2026-02-21T08:25:57.3599020Z // begin inline asm 2026-02-21T08:25:57.3599172Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 120 ], %rd107, %r407, %p43; 2026-02-21T08:25:57.3599225Z // end inline asm 2026-02-21T08:25:57.3599288Z add.s32 %r455, %r49, 73728; 2026-02-21T08:25:57.3599349Z cvt.u64.u32 %rd108, %r455; 2026-02-21T08:25:57.3599414Z // begin inline asm 2026-02-21T08:25:57.3599543Z @%p42 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd108]; 2026-02-21T08:25:57.3599598Z // end inline asm 2026-02-21T08:25:57.3599659Z $L__tmp5: 2026-02-21T08:25:57.3599712Z $L__BB0_3: 2026-02-21T08:25:57.3599805Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:25:57.3599865Z add.s32 %r26, %r285, %r283; 2026-02-21T08:25:57.3599937Z add.s32 %r786, %r295, %r293; 2026-02-21T08:25:57.3600009Z mad.wide.u32 %rd26, %r218, 2, %rd34; 2026-02-21T08:25:57.3600070Z cvt.u64.u32 %rd27, %r304; 2026-02-21T08:25:57.3600255Z .loc 1 51 80 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:80 2026-02-21T08:25:57.3600312Z // begin inline asm 2026-02-21T08:25:57.3600433Z cp.async.cg.shared.global [ %r456 + 0 ], [ %rd109 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3600496Z // end inline asm 2026-02-21T08:25:57.3600552Z // begin inline asm 2026-02-21T08:25:57.3600669Z cp.async.cg.shared.global [ %r458 + 0 ], [ %rd110 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3600722Z // end inline asm 2026-02-21T08:25:57.3600789Z // begin inline asm 2026-02-21T08:25:57.3600905Z cp.async.cg.shared.global [ %r460 + 0 ], [ %rd111 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3600960Z // end inline asm 2026-02-21T08:25:57.3601026Z // begin inline asm 2026-02-21T08:25:57.3601142Z cp.async.cg.shared.global [ %r462 + 0 ], [ %rd112 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3601201Z // end inline asm 2026-02-21T08:25:57.3601257Z // begin inline asm 2026-02-21T08:25:57.3601376Z cp.async.cg.shared.global [ %r464 + 0 ], [ %rd113 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3601431Z // end inline asm 2026-02-21T08:25:57.3601486Z // begin inline asm 2026-02-21T08:25:57.3601639Z cp.async.cg.shared.global [ %r466 + 0 ], [ %rd114 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3601695Z // end inline asm 2026-02-21T08:25:57.3601750Z // begin inline asm 2026-02-21T08:25:57.3601864Z cp.async.cg.shared.global [ %r468 + 0 ], [ %rd115 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3601917Z // end inline asm 2026-02-21T08:25:57.3601972Z // begin inline asm 2026-02-21T08:25:57.3602080Z cp.async.cg.shared.global [ %r470 + 0 ], [ %rd116 + 0 ], 0x10, %r457; 2026-02-21T08:25:57.3602141Z // end inline asm 2026-02-21T08:25:57.3602205Z cp.async.commit_group; 2026-02-21T08:25:57.3602378Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3602489Z // begin inline asm 2026-02-21T08:25:57.3602604Z @%p127 mbarrier.arrive.expect_tx.shared.b64 _, [%r472], 4096; 2026-02-21T08:25:57.3602660Z // end inline asm 2026-02-21T08:25:57.3602836Z .loc 1 57 33 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:57:33 2026-02-21T08:25:57.3602890Z bar.sync 0; 2026-02-21T08:25:57.3602957Z elect.sync %r481|%p77, -1; 2026-02-21T08:25:57.3603020Z and.pred %p75, %p1, %p77; 2026-02-21T08:25:57.3603086Z mov.b32 %r475, 128; 2026-02-21T08:25:57.3603141Z // begin inline asm 2026-02-21T08:25:57.3603383Z @%p75 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r121], [%rd117, {%r474, %r475}], [%r472]; 2026-02-21T08:25:57.3603450Z // end inline asm 2026-02-21T08:25:57.3603616Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3603677Z bfe.u32 %r482, %r1, 4, 3; 2026-02-21T08:25:57.3603798Z mul.wide.u32 %rd119, %r482, 16384; 2026-02-21T08:25:57.3603868Z mul.wide.u32 %rd120, %r221, 16; 2026-02-21T08:25:57.3603930Z or.b64 %rd121, %rd119, %rd120; 2026-02-21T08:25:57.3603991Z add.s64 %rd122, %rd121, %rd33; 2026-02-21T08:25:57.3604063Z add.s64 %rd218, %rd122, 918272; 2026-02-21T08:25:57.3604118Z mov.b32 %r872, 1; 2026-02-21T08:25:57.3604175Z mov.b32 %r868, 0; 2026-02-21T08:25:57.3604238Z mov.b64 %rd219, 0; 2026-02-21T08:25:57.3604294Z mov.b32 %r870, %r868; 2026-02-21T08:25:57.3604351Z mov.b32 %r871, %r868; 2026-02-21T08:25:57.3604406Z mov.b32 %r873, %r868; 2026-02-21T08:25:57.3604469Z bra.uni $L__BB0_4; 2026-02-21T08:25:57.3604572Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T08:25:57.3604742Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3604816Z setp.lt.u64 %p120, %rd219, 3904; 2026-02-21T08:25:57.3604868Z $L__tmp6: 2026-02-21T08:25:57.3605086Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3605156Z add.s32 %r738, %r872, 1; 2026-02-21T08:25:57.3605219Z setp.gt.s32 %p123, %r738, 1; 2026-02-21T08:25:57.3605282Z selp.b32 %r872, 0, %r738, %p123; 2026-02-21T08:25:57.3605342Z selp.b32 %r739, 1, 0, %p123; 2026-02-21T08:25:57.3605407Z xor.b32 %r48, %r873, %r739; 2026-02-21T08:25:57.3605460Z $L__tmp7: 2026-02-21T08:25:57.3605626Z .loc 1 51 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:32 2026-02-21T08:25:57.3605700Z add.s64 %rd140, %rd218, -917504; 2026-02-21T08:25:57.3605763Z add.s64 %rd141, %rd218, -786432; 2026-02-21T08:25:57.3605823Z add.s64 %rd142, %rd218, -655360; 2026-02-21T08:25:57.3605881Z add.s64 %rd143, %rd218, -524288; 2026-02-21T08:25:57.3605948Z add.s64 %rd144, %rd218, -393216; 2026-02-21T08:25:57.3606007Z add.s64 %rd145, %rd218, -262144; 2026-02-21T08:25:57.3606068Z add.s64 %rd146, %rd218, -131072; 2026-02-21T08:25:57.3606240Z .loc 1 51 80 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:80 2026-02-21T08:25:57.3606300Z add.s32 %r717, %r42, %r5; 2026-02-21T08:25:57.3606361Z selp.b32 %r718, 16, 0, %p120; 2026-02-21T08:25:57.3606423Z // begin inline asm 2026-02-21T08:25:57.3606537Z cp.async.cg.shared.global [ %r717 + 0 ], [ %rd140 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3606591Z // end inline asm 2026-02-21T08:25:57.3606649Z add.s32 %r719, %r717, 2048; 2026-02-21T08:25:57.3606713Z // begin inline asm 2026-02-21T08:25:57.3606824Z cp.async.cg.shared.global [ %r719 + 0 ], [ %rd141 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3606878Z // end inline asm 2026-02-21T08:25:57.3606942Z add.s32 %r721, %r717, 4096; 2026-02-21T08:25:57.3606999Z // begin inline asm 2026-02-21T08:25:57.3607109Z cp.async.cg.shared.global [ %r721 + 0 ], [ %rd142 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3607163Z // end inline asm 2026-02-21T08:25:57.3607231Z add.s32 %r723, %r717, 6144; 2026-02-21T08:25:57.3607356Z // begin inline asm 2026-02-21T08:25:57.3607468Z cp.async.cg.shared.global [ %r723 + 0 ], [ %rd143 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3607530Z // end inline asm 2026-02-21T08:25:57.3607588Z add.s32 %r725, %r717, 8192; 2026-02-21T08:25:57.3607643Z // begin inline asm 2026-02-21T08:25:57.3607761Z cp.async.cg.shared.global [ %r725 + 0 ], [ %rd144 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3607815Z // end inline asm 2026-02-21T08:25:57.3607875Z add.s32 %r727, %r717, 10240; 2026-02-21T08:25:57.3607932Z // begin inline asm 2026-02-21T08:25:57.3608050Z cp.async.cg.shared.global [ %r727 + 0 ], [ %rd145 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3608104Z // end inline asm 2026-02-21T08:25:57.3608161Z add.s32 %r729, %r717, 12288; 2026-02-21T08:25:57.3608226Z // begin inline asm 2026-02-21T08:25:57.3608337Z cp.async.cg.shared.global [ %r729 + 0 ], [ %rd146 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3608392Z // end inline asm 2026-02-21T08:25:57.3608491Z add.s32 %r731, %r42, %r13; 2026-02-21T08:25:57.3608560Z // begin inline asm 2026-02-21T08:25:57.3608670Z cp.async.cg.shared.global [ %r731 + 0 ], [ %rd218 + 0 ], 0x10, %r718; 2026-02-21T08:25:57.3608727Z // end inline asm 2026-02-21T08:25:57.3608801Z cp.async.commit_group; 2026-02-21T08:25:57.3608977Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3609045Z and.pred %p118, %p127, %p120; 2026-02-21T08:25:57.3609113Z // begin inline asm 2026-02-21T08:25:57.3609228Z @%p118 mbarrier.arrive.expect_tx.shared.b64 _, [%r733], 4096; 2026-02-21T08:25:57.3609293Z // end inline asm 2026-02-21T08:25:57.3609464Z .loc 1 57 33 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:57:33 2026-02-21T08:25:57.3609529Z bar.sync 0; 2026-02-21T08:25:57.3609594Z elect.sync %r740|%p124, -1; 2026-02-21T08:25:57.3609659Z and.pred %p125, %p120, %p124; 2026-02-21T08:25:57.3609729Z and.pred %p119, %p1, %p125; 2026-02-21T08:25:57.3609792Z cvt.u32.u64 %r741, %rd219; 2026-02-21T08:25:57.3609855Z add.s32 %r736, %r741, 192; 2026-02-21T08:25:57.3609910Z // begin inline asm 2026-02-21T08:25:57.3610166Z @%p119 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r734], [%rd117, {%r474, %r736}], [%r733]; 2026-02-21T08:25:57.3610220Z // end inline asm 2026-02-21T08:25:57.3610394Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3610461Z add.s64 %rd218, %rd218, 256; 2026-02-21T08:25:57.3610527Z setp.lt.u64 %p126, %rd219, 3968; 2026-02-21T08:25:57.3610587Z add.s64 %rd219, %rd219, 64; 2026-02-21T08:25:57.3610652Z mov.b32 %r868, %r873; 2026-02-21T08:25:57.3610709Z mov.b32 %r873, %r48; 2026-02-21T08:25:57.3610768Z @%p126 bra $L__BB0_4; 2026-02-21T08:25:57.3610824Z bra.uni $L__BB0_7; 2026-02-21T08:25:57.3610936Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T08:25:57.3610997Z add.s32 %r556, %r871, 1; 2026-02-21T08:25:57.3611064Z setp.gt.s32 %p84, %r556, 1; 2026-02-21T08:25:57.3611134Z selp.b32 %r871, 0, %r556, %p84; 2026-02-21T08:25:57.3611305Z .loc 1 51 80 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:51:80 2026-02-21T08:25:57.3611369Z cp.async.wait_group 1; 2026-02-21T08:25:57.3611431Z bar.sync 0; 2026-02-21T08:25:57.3611489Z shl.b32 %r557, %r871, 14; 2026-02-21T08:25:57.3611574Z add.s32 %r559, %r49, %r557; 2026-02-21T08:25:57.3611632Z add.s32 %r42, %r559, 32768; 2026-02-21T08:25:57.3611698Z add.s32 %r560, %r42, %r16; 2026-02-21T08:25:57.3611795Z ld.shared.v4.b32 {%r561, %r562, %r563, %r564}, [%r560]; 2026-02-21T08:25:57.3611858Z mov.b32 {%rs385, %rs386}, %r564; 2026-02-21T08:25:57.3611928Z mov.b32 {%rs387, %rs388}, %r563; 2026-02-21T08:25:57.3611987Z mov.b32 {%rs389, %rs390}, %r562; 2026-02-21T08:25:57.3612045Z mov.b32 {%rs391, %rs392}, %r561; 2026-02-21T08:25:57.3612146Z ld.shared.v4.b32 {%r565, %r566, %r567, %r568}, [%r560+16]; 2026-02-21T08:25:57.3612265Z mov.b32 {%rs393, %rs394}, %r568; 2026-02-21T08:25:57.3612323Z mov.b32 {%rs395, %rs396}, %r567; 2026-02-21T08:25:57.3612380Z mov.b32 {%rs397, %rs398}, %r566; 2026-02-21T08:25:57.3612446Z mov.b32 {%rs399, %rs400}, %r565; 2026-02-21T08:25:57.3612540Z ld.shared.v4.b32 {%r569, %r570, %r571, %r572}, [%r560+32]; 2026-02-21T08:25:57.3612600Z mov.b32 {%rs401, %rs402}, %r572; 2026-02-21T08:25:57.3612665Z mov.b32 {%rs403, %rs404}, %r571; 2026-02-21T08:25:57.3612724Z mov.b32 {%rs405, %rs406}, %r570; 2026-02-21T08:25:57.3612781Z mov.b32 {%rs407, %rs408}, %r569; 2026-02-21T08:25:57.3612873Z ld.shared.v4.b32 {%r573, %r574, %r575, %r576}, [%r560+48]; 2026-02-21T08:25:57.3612941Z mov.b32 {%rs409, %rs410}, %r576; 2026-02-21T08:25:57.3612999Z mov.b32 {%rs411, %rs412}, %r575; 2026-02-21T08:25:57.3613059Z mov.b32 {%rs413, %rs414}, %r574; 2026-02-21T08:25:57.3613122Z mov.b32 {%rs415, %rs416}, %r573; 2026-02-21T08:25:57.3613260Z ld.shared.v4.b32 {%r577, %r578, %r579, %r580}, [%r560+64]; 2026-02-21T08:25:57.3613324Z mov.b32 {%rs417, %rs418}, %r580; 2026-02-21T08:25:57.3613381Z mov.b32 {%rs419, %rs420}, %r579; 2026-02-21T08:25:57.3613447Z mov.b32 {%rs421, %rs422}, %r578; 2026-02-21T08:25:57.3613505Z mov.b32 {%rs423, %rs424}, %r577; 2026-02-21T08:25:57.3613593Z ld.shared.v4.b32 {%r581, %r582, %r583, %r584}, [%r560+80]; 2026-02-21T08:25:57.3613658Z mov.b32 {%rs425, %rs426}, %r584; 2026-02-21T08:25:57.3613715Z mov.b32 {%rs427, %rs428}, %r583; 2026-02-21T08:25:57.3613772Z mov.b32 {%rs429, %rs430}, %r582; 2026-02-21T08:25:57.3613835Z mov.b32 {%rs431, %rs432}, %r581; 2026-02-21T08:25:57.3613923Z ld.shared.v4.b32 {%r585, %r586, %r587, %r588}, [%r560+96]; 2026-02-21T08:25:57.3613980Z mov.b32 {%rs433, %rs434}, %r588; 2026-02-21T08:25:57.3614037Z mov.b32 {%rs435, %rs436}, %r587; 2026-02-21T08:25:57.3614100Z mov.b32 {%rs437, %rs438}, %r586; 2026-02-21T08:25:57.3614158Z mov.b32 {%rs439, %rs440}, %r585; 2026-02-21T08:25:57.3614255Z ld.shared.v4.b32 {%r589, %r590, %r591, %r592}, [%r560+112]; 2026-02-21T08:25:57.3614323Z mov.b32 {%rs441, %rs442}, %r592; 2026-02-21T08:25:57.3614383Z mov.b32 {%rs443, %rs444}, %r591; 2026-02-21T08:25:57.3614440Z mov.b32 {%rs445, %rs446}, %r590; 2026-02-21T08:25:57.3614505Z mov.b32 {%rs447, %rs448}, %r589; 2026-02-21T08:25:57.3614675Z .loc 1 55 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:55:32 2026-02-21T08:25:57.3614735Z cvt.f32.bf16 %r487, %rs391; 2026-02-21T08:25:57.3614795Z cvt.f32.bf16 %r488, %rs392; 2026-02-21T08:25:57.3614862Z cvt.f32.bf16 %r489, %rs389; 2026-02-21T08:25:57.3614920Z cvt.f32.bf16 %r490, %rs390; 2026-02-21T08:25:57.3614977Z cvt.f32.bf16 %r491, %rs387; 2026-02-21T08:25:57.3615041Z cvt.f32.bf16 %r492, %rs388; 2026-02-21T08:25:57.3615098Z cvt.f32.bf16 %r493, %rs385; 2026-02-21T08:25:57.3615155Z cvt.f32.bf16 %r494, %rs386; 2026-02-21T08:25:57.3615214Z cvt.f32.bf16 %r495, %rs399; 2026-02-21T08:25:57.3615278Z cvt.f32.bf16 %r496, %rs400; 2026-02-21T08:25:57.3615337Z cvt.f32.bf16 %r497, %rs397; 2026-02-21T08:25:57.3615395Z cvt.f32.bf16 %r498, %rs398; 2026-02-21T08:25:57.3615461Z cvt.f32.bf16 %r499, %rs395; 2026-02-21T08:25:57.3615518Z cvt.f32.bf16 %r500, %rs396; 2026-02-21T08:25:57.3615574Z cvt.f32.bf16 %r501, %rs393; 2026-02-21T08:25:57.3615631Z cvt.f32.bf16 %r502, %rs394; 2026-02-21T08:25:57.3615696Z cvt.f32.bf16 %r504, %rs407; 2026-02-21T08:25:57.3615752Z cvt.f32.bf16 %r505, %rs408; 2026-02-21T08:25:57.3615810Z cvt.f32.bf16 %r506, %rs405; 2026-02-21T08:25:57.3615875Z cvt.f32.bf16 %r507, %rs406; 2026-02-21T08:25:57.3615931Z cvt.f32.bf16 %r508, %rs403; 2026-02-21T08:25:57.3615987Z cvt.f32.bf16 %r509, %rs404; 2026-02-21T08:25:57.3616054Z cvt.f32.bf16 %r510, %rs401; 2026-02-21T08:25:57.3616113Z cvt.f32.bf16 %r511, %rs402; 2026-02-21T08:25:57.3616171Z cvt.f32.bf16 %r512, %rs415; 2026-02-21T08:25:57.3616229Z cvt.f32.bf16 %r513, %rs416; 2026-02-21T08:25:57.3616297Z cvt.f32.bf16 %r514, %rs413; 2026-02-21T08:25:57.3616358Z cvt.f32.bf16 %r515, %rs414; 2026-02-21T08:25:57.3616457Z cvt.f32.bf16 %r516, %rs411; 2026-02-21T08:25:57.3616522Z cvt.f32.bf16 %r517, %rs412; 2026-02-21T08:25:57.3616580Z cvt.f32.bf16 %r518, %rs409; 2026-02-21T08:25:57.3616638Z cvt.f32.bf16 %r519, %rs410; 2026-02-21T08:25:57.3616694Z cvt.f32.bf16 %r521, %rs423; 2026-02-21T08:25:57.3616761Z cvt.f32.bf16 %r522, %rs424; 2026-02-21T08:25:57.3616817Z cvt.f32.bf16 %r523, %rs421; 2026-02-21T08:25:57.3616873Z cvt.f32.bf16 %r524, %rs422; 2026-02-21T08:25:57.3616937Z cvt.f32.bf16 %r525, %rs419; 2026-02-21T08:25:57.3616993Z cvt.f32.bf16 %r526, %rs420; 2026-02-21T08:25:57.3617051Z cvt.f32.bf16 %r527, %rs417; 2026-02-21T08:25:57.3617109Z cvt.f32.bf16 %r528, %rs418; 2026-02-21T08:25:57.3617174Z cvt.f32.bf16 %r529, %rs431; 2026-02-21T08:25:57.3617231Z cvt.f32.bf16 %r530, %rs432; 2026-02-21T08:25:57.3617288Z cvt.f32.bf16 %r531, %rs429; 2026-02-21T08:25:57.3617352Z cvt.f32.bf16 %r532, %rs430; 2026-02-21T08:25:57.3617449Z cvt.f32.bf16 %r533, %rs427; 2026-02-21T08:25:57.3617510Z cvt.f32.bf16 %r534, %rs428; 2026-02-21T08:25:57.3617567Z cvt.f32.bf16 %r535, %rs425; 2026-02-21T08:25:57.3617630Z cvt.f32.bf16 %r536, %rs426; 2026-02-21T08:25:57.3617686Z cvt.f32.bf16 %r538, %rs439; 2026-02-21T08:25:57.3617742Z cvt.f32.bf16 %r539, %rs440; 2026-02-21T08:25:57.3617806Z cvt.f32.bf16 %r540, %rs437; 2026-02-21T08:25:57.3617864Z cvt.f32.bf16 %r541, %rs438; 2026-02-21T08:25:57.3617920Z cvt.f32.bf16 %r542, %rs435; 2026-02-21T08:25:57.3617984Z cvt.f32.bf16 %r543, %rs436; 2026-02-21T08:25:57.3618041Z cvt.f32.bf16 %r544, %rs433; 2026-02-21T08:25:57.3618098Z cvt.f32.bf16 %r545, %rs434; 2026-02-21T08:25:57.3618154Z cvt.f32.bf16 %r546, %rs447; 2026-02-21T08:25:57.3618218Z cvt.f32.bf16 %r547, %rs448; 2026-02-21T08:25:57.3618274Z cvt.f32.bf16 %r548, %rs445; 2026-02-21T08:25:57.3618331Z cvt.f32.bf16 %r549, %rs446; 2026-02-21T08:25:57.3618396Z cvt.f32.bf16 %r550, %rs443; 2026-02-21T08:25:57.3618451Z cvt.f32.bf16 %r551, %rs444; 2026-02-21T08:25:57.3618510Z cvt.f32.bf16 %r552, %rs441; 2026-02-21T08:25:57.3618568Z cvt.f32.bf16 %r553, %rs442; 2026-02-21T08:25:57.3618628Z $L__tmp8: 2026-02-21T08:25:57.3618845Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3618902Z // begin inline asm 2026-02-21T08:25:57.3618962Z 2026-02-21T08:25:57.3619012Z { 2026-02-21T08:25:57.3619073Z .reg .pred complete; 2026-02-21T08:25:57.3619128Z waitLoop: 2026-02-21T08:25:57.3619255Z mbarrier.try_wait.parity.shared.b64 complete, [%r869], %r868; 2026-02-21T08:25:57.3619319Z @!complete bra.uni waitLoop; 2026-02-21T08:25:57.3619369Z } 2026-02-21T08:25:57.3619373Z 2026-02-21T08:25:57.3619434Z // end inline asm 2026-02-21T08:25:57.3619486Z $L__tmp9: 2026-02-21T08:25:57.3619650Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3619717Z selp.b32 %r593, 1, 0, %p84; 2026-02-21T08:25:57.3619778Z xor.b32 %r870, %r870, %r593; 2026-02-21T08:25:57.3619840Z mov.pred %p85, -1; 2026-02-21T08:25:57.3619894Z $L__tmp10: 2026-02-21T08:25:57.3620115Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3620171Z // begin inline asm 2026-02-21T08:25:57.3620467Z @%p85 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 0], 64, {%r487, %r488, %r489, %r490, %r491, %r492, %r493, %r494, %r495, %r496, %r497, %r498, %r499, %r500, %r501, %r502}; 2026-02-21T08:25:57.3620529Z // end inline asm 2026-02-21T08:25:57.3620584Z // begin inline asm 2026-02-21T08:25:57.3620866Z @%p85 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 16], 64, {%r504, %r505, %r506, %r507, %r508, %r509, %r510, %r511, %r512, %r513, %r514, %r515, %r516, %r517, %r518, %r519}; 2026-02-21T08:25:57.3620928Z // end inline asm 2026-02-21T08:25:57.3620984Z // begin inline asm 2026-02-21T08:25:57.3621262Z @%p85 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 32], 64, {%r521, %r522, %r523, %r524, %r525, %r526, %r527, %r528, %r529, %r530, %r531, %r532, %r533, %r534, %r535, %r536}; 2026-02-21T08:25:57.3621365Z // end inline asm 2026-02-21T08:25:57.3621420Z // begin inline asm 2026-02-21T08:25:57.3621733Z @%p85 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r146 + 48], 64, {%r538, %r539, %r540, %r541, %r542, %r543, %r544, %r545, %r546, %r547, %r548, %r549, %r550, %r551, %r552, %r553}; 2026-02-21T08:25:57.3621789Z // end inline asm 2026-02-21T08:25:57.3621853Z // begin inline asm 2026-02-21T08:25:57.3621924Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:25:57.3621977Z // end inline asm 2026-02-21T08:25:57.3622040Z bar.sync 0; 2026-02-21T08:25:57.3622094Z $L__tmp11: 2026-02-21T08:25:57.3622262Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3622328Z shl.b32 %r594, %r871, 3; 2026-02-21T08:25:57.3622386Z add.s32 %r595, %r49, %r594; 2026-02-21T08:25:57.3622531Z add.s32 %r733, %r595, 73744; 2026-02-21T08:25:57.3622591Z // begin inline asm 2026-02-21T08:25:57.3622648Z 2026-02-21T08:25:57.3622699Z { 2026-02-21T08:25:57.3622759Z .reg .pred complete; 2026-02-21T08:25:57.3622820Z waitLoop: 2026-02-21T08:25:57.3622937Z mbarrier.try_wait.parity.shared.b64 complete, [%r733], %r870; 2026-02-21T08:25:57.3623001Z @!complete bra.uni waitLoop; 2026-02-21T08:25:57.3623050Z } 2026-02-21T08:25:57.3623054Z 2026-02-21T08:25:57.3623117Z // end inline asm 2026-02-21T08:25:57.3623305Z .loc 1 57 33 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:57:33 2026-02-21T08:25:57.3623367Z shl.b32 %r596, %r871, 12; 2026-02-21T08:25:57.3623435Z add.s32 %r597, %r49, %r596; 2026-02-21T08:25:57.3623497Z add.s32 %r734, %r597, 65536; 2026-02-21T08:25:57.3623556Z add.s32 %r598, %r734, %r17; 2026-02-21T08:25:57.3623630Z ld.shared.b8 %rs449, [%r598]; 2026-02-21T08:25:57.3623696Z ld.shared.b8 %rs450, [%r598+64]; 2026-02-21T08:25:57.3623770Z ld.shared.b8 %rs451, [%r598+512]; 2026-02-21T08:25:57.3623840Z ld.shared.b8 %rs452, [%r598+576]; 2026-02-21T08:25:57.3623919Z ld.shared.b8 %rs453, [%r598+1024]; 2026-02-21T08:25:57.3623986Z ld.shared.b8 %rs454, [%r598+1088]; 2026-02-21T08:25:57.3624051Z ld.shared.b8 %rs455, [%r598+1536]; 2026-02-21T08:25:57.3624127Z ld.shared.b8 %rs456, [%r598+1600]; 2026-02-21T08:25:57.3624192Z ld.shared.b8 %rs457, [%r598+2048]; 2026-02-21T08:25:57.3624258Z ld.shared.b8 %rs458, [%r598+2112]; 2026-02-21T08:25:57.3624323Z ld.shared.b8 %rs459, [%r598+2560]; 2026-02-21T08:25:57.3624399Z ld.shared.b8 %rs460, [%r598+2624]; 2026-02-21T08:25:57.3624464Z ld.shared.b8 %rs461, [%r598+3072]; 2026-02-21T08:25:57.3624527Z ld.shared.b8 %rs462, [%r598+3136]; 2026-02-21T08:25:57.3624596Z ld.shared.b8 %rs463, [%r598+3584]; 2026-02-21T08:25:57.3624657Z ld.shared.b8 %rs464, [%r598+3648]; 2026-02-21T08:25:57.3624717Z add.s32 %r599, %r734, %r30; 2026-02-21T08:25:57.3624781Z ld.shared.b8 %rs465, [%r599+128]; 2026-02-21T08:25:57.3624852Z ld.shared.b8 %rs466, [%r599+192]; 2026-02-21T08:25:57.3624916Z ld.shared.b8 %rs467, [%r599+640]; 2026-02-21T08:25:57.3624981Z ld.shared.b8 %rs468, [%r599+704]; 2026-02-21T08:25:57.3625050Z ld.shared.b8 %rs469, [%r599+1152]; 2026-02-21T08:25:57.3625111Z ld.shared.b8 %rs470, [%r599+1216]; 2026-02-21T08:25:57.3625173Z ld.shared.b8 %rs471, [%r599+1664]; 2026-02-21T08:25:57.3625242Z ld.shared.b8 %rs472, [%r599+1728]; 2026-02-21T08:25:57.3625303Z ld.shared.b8 %rs473, [%r599+2176]; 2026-02-21T08:25:57.3625366Z ld.shared.b8 %rs474, [%r599+2240]; 2026-02-21T08:25:57.3625428Z ld.shared.b8 %rs475, [%r599+2688]; 2026-02-21T08:25:57.3625498Z ld.shared.b8 %rs476, [%r599+2752]; 2026-02-21T08:25:57.3625560Z ld.shared.b8 %rs477, [%r599+3200]; 2026-02-21T08:25:57.3625622Z ld.shared.b8 %rs478, [%r599+3264]; 2026-02-21T08:25:57.3625690Z ld.shared.b8 %rs479, [%r599+3712]; 2026-02-21T08:25:57.3625753Z ld.shared.b8 %rs480, [%r599+3776]; 2026-02-21T08:25:57.3625813Z add.s32 %r600, %r734, %r29; 2026-02-21T08:25:57.3625879Z ld.shared.b8 %rs481, [%r600+256]; 2026-02-21T08:25:57.3626007Z ld.shared.b8 %rs482, [%r600+320]; 2026-02-21T08:25:57.3626070Z ld.shared.b8 %rs483, [%r600+768]; 2026-02-21T08:25:57.3626133Z ld.shared.b8 %rs484, [%r600+832]; 2026-02-21T08:25:57.3626205Z ld.shared.b8 %rs485, [%r600+1280]; 2026-02-21T08:25:57.3626267Z ld.shared.b8 %rs486, [%r600+1344]; 2026-02-21T08:25:57.3626329Z ld.shared.b8 %rs487, [%r600+1792]; 2026-02-21T08:25:57.3626398Z ld.shared.b8 %rs488, [%r600+1856]; 2026-02-21T08:25:57.3626460Z ld.shared.b8 %rs489, [%r600+2304]; 2026-02-21T08:25:57.3626522Z ld.shared.b8 %rs490, [%r600+2368]; 2026-02-21T08:25:57.3626584Z ld.shared.b8 %rs491, [%r600+2816]; 2026-02-21T08:25:57.3626654Z ld.shared.b8 %rs492, [%r600+2880]; 2026-02-21T08:25:57.3626717Z ld.shared.b8 %rs493, [%r600+3328]; 2026-02-21T08:25:57.3626780Z ld.shared.b8 %rs494, [%r600+3392]; 2026-02-21T08:25:57.3626851Z ld.shared.b8 %rs495, [%r600+3840]; 2026-02-21T08:25:57.3626954Z ld.shared.b8 %rs496, [%r600+3904]; 2026-02-21T08:25:57.3627021Z add.s32 %r601, %r734, %r28; 2026-02-21T08:25:57.3627084Z ld.shared.b8 %rs497, [%r601+384]; 2026-02-21T08:25:57.3627154Z ld.shared.b8 %rs498, [%r601+448]; 2026-02-21T08:25:57.3627218Z ld.shared.b8 %rs499, [%r601+896]; 2026-02-21T08:25:57.3627280Z ld.shared.b8 %rs500, [%r601+960]; 2026-02-21T08:25:57.3627349Z ld.shared.b8 %rs501, [%r601+1408]; 2026-02-21T08:25:57.3627410Z ld.shared.b8 %rs502, [%r601+1472]; 2026-02-21T08:25:57.3627471Z ld.shared.b8 %rs503, [%r601+1920]; 2026-02-21T08:25:57.3627533Z ld.shared.b8 %rs504, [%r601+1984]; 2026-02-21T08:25:57.3627602Z ld.shared.b8 %rs505, [%r601+2432]; 2026-02-21T08:25:57.3627664Z ld.shared.b8 %rs506, [%r601+2496]; 2026-02-21T08:25:57.3627725Z ld.shared.b8 %rs507, [%r601+2944]; 2026-02-21T08:25:57.3627796Z ld.shared.b8 %rs508, [%r601+3008]; 2026-02-21T08:25:57.3627857Z ld.shared.b8 %rs509, [%r601+3456]; 2026-02-21T08:25:57.3627920Z ld.shared.b8 %rs510, [%r601+3520]; 2026-02-21T08:25:57.3627990Z ld.shared.b8 %rs511, [%r601+3968]; 2026-02-21T08:25:57.3628055Z ld.shared.b8 %rs512, [%r601+4032]; 2026-02-21T08:25:57.3628232Z .loc 1 60 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:60:28 2026-02-21T08:25:57.3628295Z shl.b16 %rs513, %rs449, 4; 2026-02-21T08:25:57.3628365Z shl.b16 %rs514, %rs450, 4; 2026-02-21T08:25:57.3628425Z shl.b16 %rs515, %rs465, 4; 2026-02-21T08:25:57.3628485Z shl.b16 %rs516, %rs466, 4; 2026-02-21T08:25:57.3628549Z shl.b16 %rs517, %rs481, 4; 2026-02-21T08:25:57.3628608Z shl.b16 %rs518, %rs482, 4; 2026-02-21T08:25:57.3628668Z shl.b16 %rs519, %rs497, 4; 2026-02-21T08:25:57.3628728Z shl.b16 %rs520, %rs498, 4; 2026-02-21T08:25:57.3628793Z shl.b16 %rs521, %rs451, 4; 2026-02-21T08:25:57.3628852Z shl.b16 %rs522, %rs452, 4; 2026-02-21T08:25:57.3628911Z shl.b16 %rs523, %rs467, 4; 2026-02-21T08:25:57.3628979Z shl.b16 %rs524, %rs468, 4; 2026-02-21T08:25:57.3629038Z shl.b16 %rs525, %rs483, 4; 2026-02-21T08:25:57.3629100Z shl.b16 %rs526, %rs484, 4; 2026-02-21T08:25:57.3629163Z shl.b16 %rs527, %rs499, 4; 2026-02-21T08:25:57.3629228Z shl.b16 %rs528, %rs500, 4; 2026-02-21T08:25:57.3629287Z shl.b16 %rs529, %rs453, 4; 2026-02-21T08:25:57.3629346Z shl.b16 %rs530, %rs454, 4; 2026-02-21T08:25:57.3629413Z shl.b16 %rs531, %rs469, 4; 2026-02-21T08:25:57.3629471Z shl.b16 %rs532, %rs470, 4; 2026-02-21T08:25:57.3629530Z shl.b16 %rs533, %rs485, 4; 2026-02-21T08:25:57.3629596Z shl.b16 %rs534, %rs486, 4; 2026-02-21T08:25:57.3629654Z shl.b16 %rs535, %rs501, 4; 2026-02-21T08:25:57.3629714Z shl.b16 %rs536, %rs502, 4; 2026-02-21T08:25:57.3629775Z shl.b16 %rs537, %rs455, 4; 2026-02-21T08:25:57.3629842Z shl.b16 %rs538, %rs456, 4; 2026-02-21T08:25:57.3629901Z shl.b16 %rs539, %rs471, 4; 2026-02-21T08:25:57.3629961Z shl.b16 %rs540, %rs472, 4; 2026-02-21T08:25:57.3630029Z shl.b16 %rs541, %rs487, 4; 2026-02-21T08:25:57.3630090Z shl.b16 %rs542, %rs488, 4; 2026-02-21T08:25:57.3630149Z shl.b16 %rs543, %rs503, 4; 2026-02-21T08:25:57.3630213Z shl.b16 %rs544, %rs504, 4; 2026-02-21T08:25:57.3630326Z shl.b16 %rs545, %rs457, 4; 2026-02-21T08:25:57.3630388Z shl.b16 %rs546, %rs458, 4; 2026-02-21T08:25:57.3630449Z shl.b16 %rs547, %rs473, 4; 2026-02-21T08:25:57.3630517Z shl.b16 %rs548, %rs474, 4; 2026-02-21T08:25:57.3630578Z shl.b16 %rs549, %rs489, 4; 2026-02-21T08:25:57.3630639Z shl.b16 %rs550, %rs490, 4; 2026-02-21T08:25:57.3630700Z shl.b16 %rs551, %rs505, 4; 2026-02-21T08:25:57.3630769Z shl.b16 %rs552, %rs506, 4; 2026-02-21T08:25:57.3630831Z shl.b16 %rs553, %rs459, 4; 2026-02-21T08:25:57.3630903Z shl.b16 %rs554, %rs460, 4; 2026-02-21T08:25:57.3630967Z shl.b16 %rs555, %rs475, 4; 2026-02-21T08:25:57.3631025Z shl.b16 %rs556, %rs476, 4; 2026-02-21T08:25:57.3631081Z shl.b16 %rs557, %rs491, 4; 2026-02-21T08:25:57.3631138Z shl.b16 %rs558, %rs492, 4; 2026-02-21T08:25:57.3631203Z shl.b16 %rs559, %rs507, 4; 2026-02-21T08:25:57.3631259Z shl.b16 %rs560, %rs508, 4; 2026-02-21T08:25:57.3631317Z shl.b16 %rs561, %rs461, 4; 2026-02-21T08:25:57.3631428Z shl.b16 %rs562, %rs462, 4; 2026-02-21T08:25:57.3631486Z shl.b16 %rs563, %rs477, 4; 2026-02-21T08:25:57.3631573Z shl.b16 %rs564, %rs478, 4; 2026-02-21T08:25:57.3631632Z shl.b16 %rs565, %rs493, 4; 2026-02-21T08:25:57.3631698Z shl.b16 %rs566, %rs494, 4; 2026-02-21T08:25:57.3631754Z shl.b16 %rs567, %rs509, 4; 2026-02-21T08:25:57.3631812Z shl.b16 %rs568, %rs510, 4; 2026-02-21T08:25:57.3631879Z shl.b16 %rs569, %rs463, 4; 2026-02-21T08:25:57.3631935Z shl.b16 %rs570, %rs464, 4; 2026-02-21T08:25:57.3631992Z shl.b16 %rs571, %rs479, 4; 2026-02-21T08:25:57.3632048Z shl.b16 %rs572, %rs480, 4; 2026-02-21T08:25:57.3632112Z shl.b16 %rs573, %rs495, 4; 2026-02-21T08:25:57.3632169Z shl.b16 %rs574, %rs496, 4; 2026-02-21T08:25:57.3632226Z shl.b16 %rs575, %rs511, 4; 2026-02-21T08:25:57.3632290Z shl.b16 %rs576, %rs512, 4; 2026-02-21T08:25:57.3632459Z .loc 1 75 58 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:75:58 2026-02-21T08:25:57.3632531Z selp.b16 %rs577, %rs513, %rs449, %p36; 2026-02-21T08:25:57.3632600Z cvt.s16.s8 %rs578, %rs577; 2026-02-21T08:25:57.3632656Z shr.s16 %rs579, %rs578, 4; 2026-02-21T08:25:57.3632725Z selp.b16 %rs580, %rs514, %rs450, %p36; 2026-02-21T08:25:57.3632783Z cvt.s16.s8 %rs581, %rs580; 2026-02-21T08:25:57.3632847Z shr.s16 %rs582, %rs581, 4; 2026-02-21T08:25:57.3632913Z selp.b16 %rs583, %rs515, %rs465, %p36; 2026-02-21T08:25:57.3632971Z cvt.s16.s8 %rs584, %rs583; 2026-02-21T08:25:57.3633036Z shr.s16 %rs585, %rs584, 4; 2026-02-21T08:25:57.3633102Z selp.b16 %rs586, %rs516, %rs466, %p36; 2026-02-21T08:25:57.3633159Z cvt.s16.s8 %rs587, %rs586; 2026-02-21T08:25:57.3633216Z shr.s16 %rs588, %rs587, 4; 2026-02-21T08:25:57.3633289Z selp.b16 %rs589, %rs517, %rs481, %p36; 2026-02-21T08:25:57.3633346Z cvt.s16.s8 %rs590, %rs589; 2026-02-21T08:25:57.3633403Z shr.s16 %rs591, %rs590, 4; 2026-02-21T08:25:57.3633475Z selp.b16 %rs592, %rs518, %rs482, %p36; 2026-02-21T08:25:57.3633537Z cvt.s16.s8 %rs593, %rs592; 2026-02-21T08:25:57.3633596Z shr.s16 %rs594, %rs593, 4; 2026-02-21T08:25:57.3633660Z selp.b16 %rs595, %rs519, %rs497, %p36; 2026-02-21T08:25:57.3633725Z cvt.s16.s8 %rs596, %rs595; 2026-02-21T08:25:57.3633781Z shr.s16 %rs597, %rs596, 4; 2026-02-21T08:25:57.3633844Z selp.b16 %rs598, %rs520, %rs498, %p36; 2026-02-21T08:25:57.3633908Z cvt.s16.s8 %rs599, %rs598; 2026-02-21T08:25:57.3633965Z shr.s16 %rs600, %rs599, 4; 2026-02-21T08:25:57.3634028Z selp.b16 %rs601, %rs521, %rs451, %p36; 2026-02-21T08:25:57.3634090Z cvt.s16.s8 %rs602, %rs601; 2026-02-21T08:25:57.3634146Z shr.s16 %rs603, %rs602, 4; 2026-02-21T08:25:57.3634208Z selp.b16 %rs604, %rs522, %rs452, %p36; 2026-02-21T08:25:57.3634264Z cvt.s16.s8 %rs605, %rs604; 2026-02-21T08:25:57.3634326Z shr.s16 %rs606, %rs605, 4; 2026-02-21T08:25:57.3634389Z selp.b16 %rs607, %rs523, %rs467, %p36; 2026-02-21T08:25:57.3634445Z cvt.s16.s8 %rs608, %rs607; 2026-02-21T08:25:57.3634508Z shr.s16 %rs609, %rs608, 4; 2026-02-21T08:25:57.3634573Z selp.b16 %rs610, %rs524, %rs468, %p36; 2026-02-21T08:25:57.3634689Z cvt.s16.s8 %rs611, %rs610; 2026-02-21T08:25:57.3634745Z shr.s16 %rs612, %rs611, 4; 2026-02-21T08:25:57.3634816Z selp.b16 %rs613, %rs525, %rs483, %p36; 2026-02-21T08:25:57.3634874Z cvt.s16.s8 %rs614, %rs613; 2026-02-21T08:25:57.3634931Z shr.s16 %rs615, %rs614, 4; 2026-02-21T08:25:57.3634999Z selp.b16 %rs616, %rs526, %rs484, %p36; 2026-02-21T08:25:57.3635055Z cvt.s16.s8 %rs617, %rs616; 2026-02-21T08:25:57.3635111Z shr.s16 %rs618, %rs617, 4; 2026-02-21T08:25:57.3635174Z selp.b16 %rs619, %rs527, %rs499, %p36; 2026-02-21T08:25:57.3635237Z cvt.s16.s8 %rs620, %rs619; 2026-02-21T08:25:57.3635294Z shr.s16 %rs621, %rs620, 4; 2026-02-21T08:25:57.3635357Z selp.b16 %rs622, %rs528, %rs500, %p36; 2026-02-21T08:25:57.3635421Z cvt.s16.s8 %rs623, %rs622; 2026-02-21T08:25:57.3635478Z shr.s16 %rs624, %rs623, 4; 2026-02-21T08:25:57.3635540Z selp.b16 %rs625, %rs529, %rs453, %p36; 2026-02-21T08:25:57.3635646Z cvt.s16.s8 %rs626, %rs625; 2026-02-21T08:25:57.3635715Z shr.s16 %rs627, %rs626, 4; 2026-02-21T08:25:57.3635778Z selp.b16 %rs628, %rs530, %rs454, %p36; 2026-02-21T08:25:57.3635836Z cvt.s16.s8 %rs629, %rs628; 2026-02-21T08:25:57.3635901Z shr.s16 %rs630, %rs629, 4; 2026-02-21T08:25:57.3635965Z selp.b16 %rs631, %rs531, %rs469, %p36; 2026-02-21T08:25:57.3636024Z cvt.s16.s8 %rs632, %rs631; 2026-02-21T08:25:57.3636088Z shr.s16 %rs633, %rs632, 4; 2026-02-21T08:25:57.3636151Z selp.b16 %rs634, %rs532, %rs470, %p36; 2026-02-21T08:25:57.3636208Z cvt.s16.s8 %rs635, %rs634; 2026-02-21T08:25:57.3636264Z shr.s16 %rs636, %rs635, 4; 2026-02-21T08:25:57.3636337Z selp.b16 %rs637, %rs533, %rs485, %p36; 2026-02-21T08:25:57.3636394Z cvt.s16.s8 %rs638, %rs637; 2026-02-21T08:25:57.3636450Z shr.s16 %rs639, %rs638, 4; 2026-02-21T08:25:57.3636521Z selp.b16 %rs640, %rs534, %rs486, %p36; 2026-02-21T08:25:57.3636579Z cvt.s16.s8 %rs641, %rs640; 2026-02-21T08:25:57.3636635Z shr.s16 %rs642, %rs641, 4; 2026-02-21T08:25:57.3636700Z selp.b16 %rs643, %rs535, %rs501, %p36; 2026-02-21T08:25:57.3636767Z cvt.s16.s8 %rs644, %rs643; 2026-02-21T08:25:57.3636823Z shr.s16 %rs645, %rs644, 4; 2026-02-21T08:25:57.3636888Z selp.b16 %rs646, %rs536, %rs502, %p36; 2026-02-21T08:25:57.3636954Z cvt.s16.s8 %rs647, %rs646; 2026-02-21T08:25:57.3637012Z shr.s16 %rs648, %rs647, 4; 2026-02-21T08:25:57.3637075Z selp.b16 %rs649, %rs537, %rs455, %p36; 2026-02-21T08:25:57.3637134Z cvt.s16.s8 %rs650, %rs649; 2026-02-21T08:25:57.3637211Z shr.s16 %rs651, %rs650, 4; 2026-02-21T08:25:57.3637274Z selp.b16 %rs652, %rs538, %rs456, %p36; 2026-02-21T08:25:57.3637331Z cvt.s16.s8 %rs653, %rs652; 2026-02-21T08:25:57.3637397Z shr.s16 %rs654, %rs653, 4; 2026-02-21T08:25:57.3637460Z selp.b16 %rs655, %rs539, %rs471, %p36; 2026-02-21T08:25:57.3637518Z cvt.s16.s8 %rs656, %rs655; 2026-02-21T08:25:57.3637582Z shr.s16 %rs657, %rs656, 4; 2026-02-21T08:25:57.3637647Z selp.b16 %rs658, %rs540, %rs472, %p36; 2026-02-21T08:25:57.3637705Z cvt.s16.s8 %rs659, %rs658; 2026-02-21T08:25:57.3637764Z shr.s16 %rs660, %rs659, 4; 2026-02-21T08:25:57.3637834Z selp.b16 %rs661, %rs541, %rs487, %p36; 2026-02-21T08:25:57.3637892Z cvt.s16.s8 %rs662, %rs661; 2026-02-21T08:25:57.3637949Z shr.s16 %rs663, %rs662, 4; 2026-02-21T08:25:57.3638016Z selp.b16 %rs664, %rs542, %rs488, %p36; 2026-02-21T08:25:57.3638072Z cvt.s16.s8 %rs665, %rs664; 2026-02-21T08:25:57.3638128Z shr.s16 %rs666, %rs665, 4; 2026-02-21T08:25:57.3638192Z selp.b16 %rs667, %rs543, %rs503, %p36; 2026-02-21T08:25:57.3638256Z cvt.s16.s8 %rs668, %rs667; 2026-02-21T08:25:57.3638313Z shr.s16 %rs669, %rs668, 4; 2026-02-21T08:25:57.3638377Z selp.b16 %rs670, %rs544, %rs504, %p36; 2026-02-21T08:25:57.3638442Z cvt.s16.s8 %rs671, %rs670; 2026-02-21T08:25:57.3638500Z shr.s16 %rs672, %rs671, 4; 2026-02-21T08:25:57.3638564Z selp.b16 %rs673, %rs545, %rs457, %p36; 2026-02-21T08:25:57.3638620Z cvt.s16.s8 %rs674, %rs673; 2026-02-21T08:25:57.3638685Z shr.s16 %rs675, %rs674, 4; 2026-02-21T08:25:57.3638751Z selp.b16 %rs676, %rs546, %rs458, %p36; 2026-02-21T08:25:57.3638851Z cvt.s16.s8 %rs677, %rs676; 2026-02-21T08:25:57.3638916Z shr.s16 %rs678, %rs677, 4; 2026-02-21T08:25:57.3638980Z selp.b16 %rs679, %rs547, %rs473, %p36; 2026-02-21T08:25:57.3639037Z cvt.s16.s8 %rs680, %rs679; 2026-02-21T08:25:57.3639102Z shr.s16 %rs681, %rs680, 4; 2026-02-21T08:25:57.3639165Z selp.b16 %rs682, %rs548, %rs474, %p36; 2026-02-21T08:25:57.3639223Z cvt.s16.s8 %rs683, %rs682; 2026-02-21T08:25:57.3639280Z shr.s16 %rs684, %rs683, 4; 2026-02-21T08:25:57.3639350Z selp.b16 %rs685, %rs549, %rs489, %p36; 2026-02-21T08:25:57.3639408Z cvt.s16.s8 %rs686, %rs685; 2026-02-21T08:25:57.3639465Z shr.s16 %rs687, %rs686, 4; 2026-02-21T08:25:57.3639535Z selp.b16 %rs688, %rs550, %rs490, %p36; 2026-02-21T08:25:57.3639592Z cvt.s16.s8 %rs689, %rs688; 2026-02-21T08:25:57.3639649Z shr.s16 %rs690, %rs689, 4; 2026-02-21T08:25:57.3639711Z selp.b16 %rs691, %rs551, %rs505, %p36; 2026-02-21T08:25:57.3639831Z cvt.s16.s8 %rs692, %rs691; 2026-02-21T08:25:57.3639891Z shr.s16 %rs693, %rs692, 4; 2026-02-21T08:25:57.3639955Z selp.b16 %rs694, %rs552, %rs506, %p36; 2026-02-21T08:25:57.3640019Z cvt.s16.s8 %rs695, %rs694; 2026-02-21T08:25:57.3640076Z shr.s16 %rs696, %rs695, 4; 2026-02-21T08:25:57.3640138Z selp.b16 %rs697, %rs553, %rs459, %p36; 2026-02-21T08:25:57.3640196Z cvt.s16.s8 %rs698, %rs697; 2026-02-21T08:25:57.3640259Z shr.s16 %rs699, %rs698, 4; 2026-02-21T08:25:57.3640321Z selp.b16 %rs700, %rs554, %rs460, %p36; 2026-02-21T08:25:57.3640377Z cvt.s16.s8 %rs701, %rs700; 2026-02-21T08:25:57.3640439Z shr.s16 %rs702, %rs701, 4; 2026-02-21T08:25:57.3640502Z selp.b16 %rs703, %rs555, %rs475, %p36; 2026-02-21T08:25:57.3640558Z cvt.s16.s8 %rs704, %rs703; 2026-02-21T08:25:57.3640615Z shr.s16 %rs705, %rs704, 4; 2026-02-21T08:25:57.3640684Z selp.b16 %rs706, %rs556, %rs476, %p36; 2026-02-21T08:25:57.3640739Z cvt.s16.s8 %rs707, %rs706; 2026-02-21T08:25:57.3640795Z shr.s16 %rs708, %rs707, 4; 2026-02-21T08:25:57.3640866Z selp.b16 %rs709, %rs557, %rs491, %p36; 2026-02-21T08:25:57.3640927Z cvt.s16.s8 %rs710, %rs709; 2026-02-21T08:25:57.3640983Z shr.s16 %rs711, %rs710, 4; 2026-02-21T08:25:57.3641054Z selp.b16 %rs712, %rs558, %rs492, %p36; 2026-02-21T08:25:57.3641110Z cvt.s16.s8 %rs713, %rs712; 2026-02-21T08:25:57.3641165Z shr.s16 %rs714, %rs713, 4; 2026-02-21T08:25:57.3641227Z selp.b16 %rs715, %rs559, %rs507, %p36; 2026-02-21T08:25:57.3641291Z cvt.s16.s8 %rs716, %rs715; 2026-02-21T08:25:57.3641347Z shr.s16 %rs717, %rs716, 4; 2026-02-21T08:25:57.3641410Z selp.b16 %rs718, %rs560, %rs508, %p36; 2026-02-21T08:25:57.3641474Z cvt.s16.s8 %rs719, %rs718; 2026-02-21T08:25:57.3641531Z shr.s16 %rs720, %rs719, 4; 2026-02-21T08:25:57.3641623Z selp.b16 %rs721, %rs561, %rs461, %p36; 2026-02-21T08:25:57.3641680Z cvt.s16.s8 %rs722, %rs721; 2026-02-21T08:25:57.3641746Z shr.s16 %rs723, %rs722, 4; 2026-02-21T08:25:57.3641808Z selp.b16 %rs724, %rs562, %rs462, %p36; 2026-02-21T08:25:57.3641868Z cvt.s16.s8 %rs725, %rs724; 2026-02-21T08:25:57.3641934Z shr.s16 %rs726, %rs725, 4; 2026-02-21T08:25:57.3641997Z selp.b16 %rs727, %rs563, %rs477, %p36; 2026-02-21T08:25:57.3642054Z cvt.s16.s8 %rs728, %rs727; 2026-02-21T08:25:57.3642113Z shr.s16 %rs729, %rs728, 4; 2026-02-21T08:25:57.3642183Z selp.b16 %rs730, %rs564, %rs478, %p36; 2026-02-21T08:25:57.3642241Z cvt.s16.s8 %rs731, %rs730; 2026-02-21T08:25:57.3642299Z shr.s16 %rs732, %rs731, 4; 2026-02-21T08:25:57.3642371Z selp.b16 %rs733, %rs565, %rs493, %p36; 2026-02-21T08:25:57.3642428Z cvt.s16.s8 %rs734, %rs733; 2026-02-21T08:25:57.3642484Z shr.s16 %rs735, %rs734, 4; 2026-02-21T08:25:57.3642547Z selp.b16 %rs736, %rs566, %rs494, %p36; 2026-02-21T08:25:57.3642615Z cvt.s16.s8 %rs737, %rs736; 2026-02-21T08:25:57.3642673Z shr.s16 %rs738, %rs737, 4; 2026-02-21T08:25:57.3642737Z selp.b16 %rs739, %rs567, %rs509, %p36; 2026-02-21T08:25:57.3642804Z cvt.s16.s8 %rs740, %rs739; 2026-02-21T08:25:57.3642862Z shr.s16 %rs741, %rs740, 4; 2026-02-21T08:25:57.3642929Z selp.b16 %rs742, %rs568, %rs510, %p36; 2026-02-21T08:25:57.3643043Z cvt.s16.s8 %rs743, %rs742; 2026-02-21T08:25:57.3643102Z shr.s16 %rs744, %rs743, 4; 2026-02-21T08:25:57.3643166Z selp.b16 %rs745, %rs569, %rs463, %p36; 2026-02-21T08:25:57.3643224Z cvt.s16.s8 %rs746, %rs745; 2026-02-21T08:25:57.3643291Z shr.s16 %rs747, %rs746, 4; 2026-02-21T08:25:57.3643355Z selp.b16 %rs748, %rs570, %rs464, %p36; 2026-02-21T08:25:57.3643415Z cvt.s16.s8 %rs749, %rs748; 2026-02-21T08:25:57.3643482Z shr.s16 %rs750, %rs749, 4; 2026-02-21T08:25:57.3643545Z selp.b16 %rs751, %rs571, %rs479, %p36; 2026-02-21T08:25:57.3643602Z cvt.s16.s8 %rs752, %rs751; 2026-02-21T08:25:57.3643660Z shr.s16 %rs753, %rs752, 4; 2026-02-21T08:25:57.3643731Z selp.b16 %rs754, %rs572, %rs480, %p36; 2026-02-21T08:25:57.3643788Z cvt.s16.s8 %rs755, %rs754; 2026-02-21T08:25:57.3643845Z shr.s16 %rs756, %rs755, 4; 2026-02-21T08:25:57.3643915Z selp.b16 %rs757, %rs573, %rs495, %p36; 2026-02-21T08:25:57.3644021Z cvt.s16.s8 %rs758, %rs757; 2026-02-21T08:25:57.3644081Z shr.s16 %rs759, %rs758, 4; 2026-02-21T08:25:57.3644145Z selp.b16 %rs760, %rs574, %rs496, %p36; 2026-02-21T08:25:57.3644209Z cvt.s16.s8 %rs761, %rs760; 2026-02-21T08:25:57.3644266Z shr.s16 %rs762, %rs761, 4; 2026-02-21T08:25:57.3644329Z selp.b16 %rs763, %rs575, %rs511, %p36; 2026-02-21T08:25:57.3644394Z cvt.s16.s8 %rs764, %rs763; 2026-02-21T08:25:57.3644450Z shr.s16 %rs765, %rs764, 4; 2026-02-21T08:25:57.3644512Z selp.b16 %rs766, %rs576, %rs512, %p36; 2026-02-21T08:25:57.3644577Z cvt.s16.s8 %rs767, %rs766; 2026-02-21T08:25:57.3644634Z shr.s16 %rs768, %rs767, 4; 2026-02-21T08:25:57.3644804Z .loc 1 80 32 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:80:32 2026-02-21T08:25:57.3644865Z cvt.rn.f32.s16 %r602, %rs579; 2026-02-21T08:25:57.3644933Z cvt.rn.f32.s16 %r603, %rs582; 2026-02-21T08:25:57.3644994Z cvt.rn.f32.s16 %r604, %rs585; 2026-02-21T08:25:57.3645052Z cvt.rn.f32.s16 %r605, %rs588; 2026-02-21T08:25:57.3645119Z cvt.rn.f32.s16 %r606, %rs591; 2026-02-21T08:25:57.3645180Z cvt.rn.f32.s16 %r607, %rs594; 2026-02-21T08:25:57.3645238Z cvt.rn.f32.s16 %r608, %rs597; 2026-02-21T08:25:57.3645294Z cvt.rn.f32.s16 %r609, %rs600; 2026-02-21T08:25:57.3645360Z cvt.rn.f32.s16 %r610, %rs603; 2026-02-21T08:25:57.3645418Z cvt.rn.f32.s16 %r611, %rs606; 2026-02-21T08:25:57.3645476Z cvt.rn.f32.s16 %r612, %rs609; 2026-02-21T08:25:57.3645540Z cvt.rn.f32.s16 %r613, %rs612; 2026-02-21T08:25:57.3645597Z cvt.rn.f32.s16 %r614, %rs615; 2026-02-21T08:25:57.3645654Z cvt.rn.f32.s16 %r615, %rs618; 2026-02-21T08:25:57.3645711Z cvt.rn.f32.s16 %r616, %rs621; 2026-02-21T08:25:57.3645775Z cvt.rn.f32.s16 %r617, %rs624; 2026-02-21T08:25:57.3645832Z cvt.rn.f32.s16 %r618, %rs627; 2026-02-21T08:25:57.3645889Z cvt.rn.f32.s16 %r619, %rs630; 2026-02-21T08:25:57.3645954Z cvt.rn.f32.s16 %r620, %rs633; 2026-02-21T08:25:57.3646010Z cvt.rn.f32.s16 %r621, %rs636; 2026-02-21T08:25:57.3646066Z cvt.rn.f32.s16 %r622, %rs639; 2026-02-21T08:25:57.3646131Z cvt.rn.f32.s16 %r623, %rs642; 2026-02-21T08:25:57.3646189Z cvt.rn.f32.s16 %r624, %rs645; 2026-02-21T08:25:57.3646246Z cvt.rn.f32.s16 %r625, %rs648; 2026-02-21T08:25:57.3646304Z cvt.rn.f32.s16 %r626, %rs651; 2026-02-21T08:25:57.3646369Z cvt.rn.f32.s16 %r627, %rs654; 2026-02-21T08:25:57.3646425Z cvt.rn.f32.s16 %r628, %rs657; 2026-02-21T08:25:57.3646482Z cvt.rn.f32.s16 %r629, %rs660; 2026-02-21T08:25:57.3646545Z cvt.rn.f32.s16 %r630, %rs663; 2026-02-21T08:25:57.3646601Z cvt.rn.f32.s16 %r631, %rs666; 2026-02-21T08:25:57.3646657Z cvt.rn.f32.s16 %r632, %rs669; 2026-02-21T08:25:57.3646714Z cvt.rn.f32.s16 %r633, %rs672; 2026-02-21T08:25:57.3646778Z cvt.rn.f32.s16 %r634, %rs675; 2026-02-21T08:25:57.3646834Z cvt.rn.f32.s16 %r635, %rs678; 2026-02-21T08:25:57.3646891Z cvt.rn.f32.s16 %r636, %rs681; 2026-02-21T08:25:57.3646953Z cvt.rn.f32.s16 %r637, %rs684; 2026-02-21T08:25:57.3647010Z cvt.rn.f32.s16 %r638, %rs687; 2026-02-21T08:25:57.3647067Z cvt.rn.f32.s16 %r639, %rs690; 2026-02-21T08:25:57.3647125Z cvt.rn.f32.s16 %r640, %rs693; 2026-02-21T08:25:57.3647254Z cvt.rn.f32.s16 %r641, %rs696; 2026-02-21T08:25:57.3647312Z cvt.rn.f32.s16 %r642, %rs699; 2026-02-21T08:25:57.3647368Z cvt.rn.f32.s16 %r643, %rs702; 2026-02-21T08:25:57.3647433Z cvt.rn.f32.s16 %r644, %rs705; 2026-02-21T08:25:57.3647490Z cvt.rn.f32.s16 %r645, %rs708; 2026-02-21T08:25:57.3647549Z cvt.rn.f32.s16 %r646, %rs711; 2026-02-21T08:25:57.3647605Z cvt.rn.f32.s16 %r647, %rs714; 2026-02-21T08:25:57.3647670Z cvt.rn.f32.s16 %r648, %rs717; 2026-02-21T08:25:57.3647727Z cvt.rn.f32.s16 %r649, %rs720; 2026-02-21T08:25:57.3647784Z cvt.rn.f32.s16 %r650, %rs723; 2026-02-21T08:25:57.3647849Z cvt.rn.f32.s16 %r651, %rs726; 2026-02-21T08:25:57.3647907Z cvt.rn.f32.s16 %r652, %rs729; 2026-02-21T08:25:57.3647965Z cvt.rn.f32.s16 %r653, %rs732; 2026-02-21T08:25:57.3648029Z cvt.rn.f32.s16 %r654, %rs735; 2026-02-21T08:25:57.3648086Z cvt.rn.f32.s16 %r655, %rs738; 2026-02-21T08:25:57.3648185Z cvt.rn.f32.s16 %r656, %rs741; 2026-02-21T08:25:57.3648247Z cvt.rn.f32.s16 %r657, %rs744; 2026-02-21T08:25:57.3648312Z cvt.rn.f32.s16 %r658, %rs747; 2026-02-21T08:25:57.3648371Z cvt.rn.f32.s16 %r659, %rs750; 2026-02-21T08:25:57.3648429Z cvt.rn.f32.s16 %r660, %rs753; 2026-02-21T08:25:57.3648494Z cvt.rn.f32.s16 %r661, %rs756; 2026-02-21T08:25:57.3648553Z cvt.rn.f32.s16 %r662, %rs759; 2026-02-21T08:25:57.3648611Z cvt.rn.f32.s16 %r663, %rs762; 2026-02-21T08:25:57.3648669Z cvt.rn.f32.s16 %r664, %rs765; 2026-02-21T08:25:57.3648737Z cvt.rn.f32.s16 %r665, %rs768; 2026-02-21T08:25:57.3648799Z st.shared.b32 [%r18], %r602; 2026-02-21T08:25:57.3648862Z st.shared.b32 [%r18+8], %r603; 2026-02-21T08:25:57.3648933Z st.shared.b32 [%r18+8192], %r618; 2026-02-21T08:25:57.3648995Z st.shared.b32 [%r18+8200], %r619; 2026-02-21T08:25:57.3649058Z st.shared.b32 [%r18+16384], %r634; 2026-02-21T08:25:57.3649122Z st.shared.b32 [%r18+16392], %r635; 2026-02-21T08:25:57.3649190Z st.shared.b32 [%r18+24576], %r650; 2026-02-21T08:25:57.3649252Z st.shared.b32 [%r18+24584], %r651; 2026-02-21T08:25:57.3649315Z st.shared.b32 [%r19], %r604; 2026-02-21T08:25:57.3649387Z st.shared.b32 [%r19+8], %r605; 2026-02-21T08:25:57.3649448Z st.shared.b32 [%r19+8192], %r620; 2026-02-21T08:25:57.3649511Z st.shared.b32 [%r19+8200], %r621; 2026-02-21T08:25:57.3649580Z st.shared.b32 [%r19+16384], %r636; 2026-02-21T08:25:57.3649642Z st.shared.b32 [%r19+16392], %r637; 2026-02-21T08:25:57.3649703Z st.shared.b32 [%r19+24576], %r652; 2026-02-21T08:25:57.3649764Z st.shared.b32 [%r19+24584], %r653; 2026-02-21T08:25:57.3649832Z st.shared.b32 [%r20], %r606; 2026-02-21T08:25:57.3649894Z st.shared.b32 [%r20+8], %r607; 2026-02-21T08:25:57.3649954Z st.shared.b32 [%r20+8192], %r622; 2026-02-21T08:25:57.3650023Z st.shared.b32 [%r20+8200], %r623; 2026-02-21T08:25:57.3650084Z st.shared.b32 [%r20+16384], %r638; 2026-02-21T08:25:57.3650143Z st.shared.b32 [%r20+16392], %r639; 2026-02-21T08:25:57.3650202Z st.shared.b32 [%r20+24576], %r654; 2026-02-21T08:25:57.3650271Z st.shared.b32 [%r20+24584], %r655; 2026-02-21T08:25:57.3650333Z st.shared.b32 [%r21], %r608; 2026-02-21T08:25:57.3650394Z st.shared.b32 [%r21+8], %r609; 2026-02-21T08:25:57.3650461Z st.shared.b32 [%r21+8192], %r624; 2026-02-21T08:25:57.3650521Z st.shared.b32 [%r21+8200], %r625; 2026-02-21T08:25:57.3650580Z st.shared.b32 [%r21+16384], %r640; 2026-02-21T08:25:57.3650640Z st.shared.b32 [%r21+16392], %r641; 2026-02-21T08:25:57.3650706Z st.shared.b32 [%r21+24576], %r656; 2026-02-21T08:25:57.3650766Z st.shared.b32 [%r21+24584], %r657; 2026-02-21T08:25:57.3650827Z st.shared.b32 [%r22], %r610; 2026-02-21T08:25:57.3650894Z st.shared.b32 [%r22+8], %r611; 2026-02-21T08:25:57.3650954Z st.shared.b32 [%r22+8192], %r626; 2026-02-21T08:25:57.3651013Z st.shared.b32 [%r22+8200], %r627; 2026-02-21T08:25:57.3651080Z st.shared.b32 [%r22+16384], %r642; 2026-02-21T08:25:57.3651139Z st.shared.b32 [%r22+16392], %r643; 2026-02-21T08:25:57.3651199Z st.shared.b32 [%r22+24576], %r658; 2026-02-21T08:25:57.3651300Z st.shared.b32 [%r22+24584], %r659; 2026-02-21T08:25:57.3651368Z st.shared.b32 [%r23], %r612; 2026-02-21T08:25:57.3651429Z st.shared.b32 [%r23+8], %r613; 2026-02-21T08:25:57.3651489Z st.shared.b32 [%r23+8192], %r628; 2026-02-21T08:25:57.3651582Z st.shared.b32 [%r23+8200], %r629; 2026-02-21T08:25:57.3651642Z st.shared.b32 [%r23+16384], %r644; 2026-02-21T08:25:57.3651702Z st.shared.b32 [%r23+16392], %r645; 2026-02-21T08:25:57.3651762Z st.shared.b32 [%r23+24576], %r660; 2026-02-21T08:25:57.3651828Z st.shared.b32 [%r23+24584], %r661; 2026-02-21T08:25:57.3651889Z st.shared.b32 [%r24], %r614; 2026-02-21T08:25:57.3651950Z st.shared.b32 [%r24+8], %r615; 2026-02-21T08:25:57.3652019Z st.shared.b32 [%r24+8192], %r630; 2026-02-21T08:25:57.3652079Z st.shared.b32 [%r24+8200], %r631; 2026-02-21T08:25:57.3652140Z st.shared.b32 [%r24+16384], %r646; 2026-02-21T08:25:57.3652207Z st.shared.b32 [%r24+16392], %r647; 2026-02-21T08:25:57.3652318Z st.shared.b32 [%r24+24576], %r662; 2026-02-21T08:25:57.3652381Z st.shared.b32 [%r24+24584], %r663; 2026-02-21T08:25:57.3652441Z st.shared.b32 [%r25], %r616; 2026-02-21T08:25:57.3652511Z st.shared.b32 [%r25+8], %r617; 2026-02-21T08:25:57.3652572Z st.shared.b32 [%r25+8192], %r632; 2026-02-21T08:25:57.3652634Z st.shared.b32 [%r25+8200], %r633; 2026-02-21T08:25:57.3652703Z st.shared.b32 [%r25+16384], %r648; 2026-02-21T08:25:57.3652762Z st.shared.b32 [%r25+16392], %r649; 2026-02-21T08:25:57.3652821Z st.shared.b32 [%r25+24576], %r664; 2026-02-21T08:25:57.3652880Z st.shared.b32 [%r25+24584], %r665; 2026-02-21T08:25:57.3653062Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3653122Z shl.b32 %r666, %r872, 3; 2026-02-21T08:25:57.3653181Z add.s32 %r667, %r49, %r666; 2026-02-21T08:25:57.3653246Z add.s32 %r869, %r667, 73728; 2026-02-21T08:25:57.3653300Z $L__tmp12: 2026-02-21T08:25:57.3653527Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3653593Z // begin inline asm 2026-02-21T08:25:57.3653666Z fence.proxy.async.shared::cta; 2026-02-21T08:25:57.3653722Z // end inline asm 2026-02-21T08:25:57.3653776Z bar.sync 0; 2026-02-21T08:25:57.3653841Z @%p40 bra $L__BB0_6; 2026-02-21T08:25:57.3653941Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T08:25:57.3654008Z elect.sync %r716|%p86, -1; 2026-02-21T08:25:57.3654073Z mov.b32 %r670, 68159760; 2026-02-21T08:25:57.3654129Z // begin inline asm 2026-02-21T08:25:57.3654283Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 0 ], %rd92, %r670, %p85; 2026-02-21T08:25:57.3654338Z // end inline asm 2026-02-21T08:25:57.3654401Z // begin inline asm 2026-02-21T08:25:57.3654549Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 8 ], %rd93, %r670, %p85; 2026-02-21T08:25:57.3654603Z // end inline asm 2026-02-21T08:25:57.3654669Z // begin inline asm 2026-02-21T08:25:57.3654818Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 16 ], %rd94, %r670, %p85; 2026-02-21T08:25:57.3654872Z // end inline asm 2026-02-21T08:25:57.3654934Z // begin inline asm 2026-02-21T08:25:57.3655077Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 24 ], %rd95, %r670, %p85; 2026-02-21T08:25:57.3655132Z // end inline asm 2026-02-21T08:25:57.3655195Z // begin inline asm 2026-02-21T08:25:57.3655336Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 32 ], %rd96, %r670, %p85; 2026-02-21T08:25:57.3655392Z // end inline asm 2026-02-21T08:25:57.3655447Z // begin inline asm 2026-02-21T08:25:57.3655594Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 40 ], %rd97, %r670, %p85; 2026-02-21T08:25:57.3655649Z // end inline asm 2026-02-21T08:25:57.3655704Z // begin inline asm 2026-02-21T08:25:57.3655851Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 48 ], %rd98, %r670, %p85; 2026-02-21T08:25:57.3655907Z // end inline asm 2026-02-21T08:25:57.3656035Z // begin inline asm 2026-02-21T08:25:57.3656186Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 56 ], %rd99, %r670, %p85; 2026-02-21T08:25:57.3656241Z // end inline asm 2026-02-21T08:25:57.3656296Z // begin inline asm 2026-02-21T08:25:57.3656446Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 64 ], %rd100, %r670, %p85; 2026-02-21T08:25:57.3656502Z // end inline asm 2026-02-21T08:25:57.3656558Z // begin inline asm 2026-02-21T08:25:57.3656703Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 72 ], %rd101, %r670, %p85; 2026-02-21T08:25:57.3656768Z // end inline asm 2026-02-21T08:25:57.3656826Z // begin inline asm 2026-02-21T08:25:57.3656969Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 80 ], %rd102, %r670, %p85; 2026-02-21T08:25:57.3657033Z // end inline asm 2026-02-21T08:25:57.3657090Z // begin inline asm 2026-02-21T08:25:57.3657269Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 88 ], %rd103, %r670, %p85; 2026-02-21T08:25:57.3657338Z // end inline asm 2026-02-21T08:25:57.3657395Z // begin inline asm 2026-02-21T08:25:57.3657535Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 96 ], %rd104, %r670, %p85; 2026-02-21T08:25:57.3657591Z // end inline asm 2026-02-21T08:25:57.3657654Z // begin inline asm 2026-02-21T08:25:57.3657802Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 104 ], %rd105, %r670, %p85; 2026-02-21T08:25:57.3657856Z // end inline asm 2026-02-21T08:25:57.3657920Z // begin inline asm 2026-02-21T08:25:57.3658063Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 112 ], %rd106, %r670, %p85; 2026-02-21T08:25:57.3658117Z // end inline asm 2026-02-21T08:25:57.3658181Z // begin inline asm 2026-02-21T08:25:57.3658325Z @%p86 tcgen05.mma.cta_group::1.kind::tf32 [ %r867 + 0 ], [ %r406 + 120 ], %rd107, %r670, %p85; 2026-02-21T08:25:57.3658378Z // end inline asm 2026-02-21T08:25:57.3658448Z cvt.u64.u32 %rd139, %r869; 2026-02-21T08:25:57.3658507Z // begin inline asm 2026-02-21T08:25:57.3658633Z @%p86 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd139]; 2026-02-21T08:25:57.3658687Z // end inline asm 2026-02-21T08:25:57.3658750Z bra.uni $L__BB0_6; 2026-02-21T08:25:57.3658846Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T08:25:57.3658939Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:25:57.3659001Z mov.b32 %r743, 1; 2026-02-21T08:25:57.3659221Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3659278Z // begin inline asm 2026-02-21T08:25:57.3659335Z 2026-02-21T08:25:57.3659384Z { 2026-02-21T08:25:57.3659445Z .reg .pred complete; 2026-02-21T08:25:57.3659500Z waitLoop: 2026-02-21T08:25:57.3659626Z mbarrier.try_wait.parity.shared.b64 complete, [%r869], %r743; 2026-02-21T08:25:57.3659694Z @!complete bra.uni waitLoop; 2026-02-21T08:25:57.3659746Z } 2026-02-21T08:25:57.3659750Z 2026-02-21T08:25:57.3659812Z // end inline asm 2026-02-21T08:25:57.3659864Z $L__tmp13: 2026-02-21T08:25:57.3660035Z .loc 1 44 71 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:44:71 2026-02-21T08:25:57.3660099Z cp.async.wait_group 0; 2026-02-21T08:25:57.3660162Z bar.sync 0; 2026-02-21T08:25:57.3660217Z // begin inline asm 2026-02-21T08:25:57.3660306Z @%p127 mbarrier.inval.shared::cta.b64 [%r472]; 2026-02-21T08:25:57.3660573Z [153s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:25:57.3661506Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:25:57.3661730Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:25:57.3661790Z `ptxas` stderr: 2026-02-21T08:25:57.3662139Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 325 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:25:57.3662231Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:25:57.3662242Z 2026-02-21T08:25:57.3662630Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp80jkuebr.ptx -o /tmp/tmp80jkuebr.ptx.o 2026-02-21T08:25:57.3662635Z 2026-02-21T08:25:57.3662763Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:25:57.3662827Z // end inline asm 2026-02-21T08:25:57.3662883Z bar.sync 0; 2026-02-21T08:25:57.3662989Z // begin inline asm 2026-02-21T08:25:57.3663082Z @%p127 mbarrier.inval.shared::cta.b64 [%r103]; 2026-02-21T08:25:57.3663150Z // end inline asm 2026-02-21T08:25:57.3663212Z add.s32 %r746, %r49, 73728; 2026-02-21T08:25:57.3663269Z // begin inline asm 2026-02-21T08:25:57.3663362Z @%p127 mbarrier.inval.shared::cta.b64 [%r746]; 2026-02-21T08:25:57.3663420Z // end inline asm 2026-02-21T08:25:57.3663474Z bar.sync 0; 2026-02-21T08:25:57.3663531Z // begin inline asm 2026-02-21T08:25:57.3663621Z @%p127 mbarrier.inval.shared::cta.b64 [%r101]; 2026-02-21T08:25:57.3663676Z // end inline asm 2026-02-21T08:25:57.3663852Z .loc 1 91 22 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:91:22 2026-02-21T08:25:57.3663920Z shl.b64 %rd153, %rd27, 1; 2026-02-21T08:25:57.3663983Z add.s64 %rd149, %rd26, %rd153; 2026-02-21T08:25:57.3664047Z add.s64 %rd150, %rd149, 229376; 2026-02-21T08:25:57.3664116Z add.s64 %rd151, %rd149, 458752; 2026-02-21T08:25:57.3664177Z add.s64 %rd152, %rd149, 688128; 2026-02-21T08:25:57.3664232Z $L__tmp14: 2026-02-21T08:25:57.3664457Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3664522Z // begin inline asm 2026-02-21T08:25:57.3664805Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755, %r756, %r757, %r758, %r759, %r760, %r761, %r762, %r763}, [%r781 + 0], 32; 2026-02-21T08:25:57.3664860Z // end inline asm 2026-02-21T08:25:57.3664925Z // begin inline asm 2026-02-21T08:25:57.3665203Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r765, %r766, %r767, %r768, %r769, %r770, %r771, %r772, %r773, %r774, %r775, %r776, %r777, %r778, %r779, %r780}, [%r781 + 16], 32; 2026-02-21T08:25:57.3665258Z // end inline asm 2026-02-21T08:25:57.3665319Z // begin inline asm 2026-02-21T08:25:57.3665389Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:25:57.3665442Z // end inline asm 2026-02-21T08:25:57.3665507Z cvt.u64.u32 %rd154, %r748; 2026-02-21T08:25:57.3665576Z cvt.u64.u32 %rd155, %r749; 2026-02-21T08:25:57.3665637Z shl.b64 %rd156, %rd155, 32; 2026-02-21T08:25:57.3665699Z or.b64 %rd157, %rd154, %rd156; 2026-02-21T08:25:57.3665760Z $L__tmp15: 2026-02-21T08:25:57.3665931Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3665993Z mov.b64 {%r819, %r820}, %rd157; 2026-02-21T08:25:57.3666070Z cvt.rn.bf16x2.f32 %r821, %r820, %r819; 2026-02-21T08:25:57.3666122Z $L__tmp16: 2026-02-21T08:25:57.3666342Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3666402Z cvt.u64.u32 %rd158, %r750; 2026-02-21T08:25:57.3666467Z cvt.u64.u32 %rd159, %r751; 2026-02-21T08:25:57.3666528Z shl.b64 %rd160, %rd159, 32; 2026-02-21T08:25:57.3666615Z or.b64 %rd161, %rd158, %rd160; 2026-02-21T08:25:57.3666676Z $L__tmp17: 2026-02-21T08:25:57.3666859Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3666975Z mov.b64 {%r822, %r823}, %rd161; 2026-02-21T08:25:57.3667056Z cvt.rn.bf16x2.f32 %r824, %r823, %r822; 2026-02-21T08:25:57.3667113Z $L__tmp18: 2026-02-21T08:25:57.3667344Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3667407Z cvt.u64.u32 %rd162, %r752; 2026-02-21T08:25:57.3667488Z cvt.u64.u32 %rd163, %r753; 2026-02-21T08:25:57.3667551Z shl.b64 %rd164, %rd163, 32; 2026-02-21T08:25:57.3667613Z or.b64 %rd165, %rd162, %rd164; 2026-02-21T08:25:57.3667674Z $L__tmp19: 2026-02-21T08:25:57.3667850Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3667912Z mov.b64 {%r825, %r826}, %rd165; 2026-02-21T08:25:57.3667982Z cvt.rn.bf16x2.f32 %r827, %r826, %r825; 2026-02-21T08:25:57.3668043Z $L__tmp20: 2026-02-21T08:25:57.3668309Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3668375Z cvt.u64.u32 %rd166, %r754; 2026-02-21T08:25:57.3668444Z cvt.u64.u32 %rd167, %r755; 2026-02-21T08:25:57.3668507Z shl.b64 %rd168, %rd167, 32; 2026-02-21T08:25:57.3668572Z or.b64 %rd169, %rd166, %rd168; 2026-02-21T08:25:57.3668639Z $L__tmp21: 2026-02-21T08:25:57.3668819Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3668881Z mov.b64 {%r828, %r829}, %rd169; 2026-02-21T08:25:57.3668952Z cvt.rn.bf16x2.f32 %r830, %r829, %r828; 2026-02-21T08:25:57.3669013Z $L__tmp22: 2026-02-21T08:25:57.3669239Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3669300Z cvt.u64.u32 %rd170, %r756; 2026-02-21T08:25:57.3669367Z cvt.u64.u32 %rd171, %r757; 2026-02-21T08:25:57.3669432Z shl.b64 %rd172, %rd171, 32; 2026-02-21T08:25:57.3669498Z or.b64 %rd173, %rd170, %rd172; 2026-02-21T08:25:57.3669558Z $L__tmp23: 2026-02-21T08:25:57.3669732Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3669795Z mov.b64 {%r831, %r832}, %rd173; 2026-02-21T08:25:57.3669864Z cvt.rn.bf16x2.f32 %r833, %r832, %r831; 2026-02-21T08:25:57.3669927Z $L__tmp24: 2026-02-21T08:25:57.3670155Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3670216Z cvt.u64.u32 %rd174, %r758; 2026-02-21T08:25:57.3670284Z cvt.u64.u32 %rd175, %r759; 2026-02-21T08:25:57.3670345Z shl.b64 %rd176, %rd175, 32; 2026-02-21T08:25:57.3670406Z or.b64 %rd177, %rd174, %rd176; 2026-02-21T08:25:57.3670460Z $L__tmp25: 2026-02-21T08:25:57.3670644Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3670711Z mov.b64 {%r834, %r835}, %rd177; 2026-02-21T08:25:57.3670783Z cvt.rn.bf16x2.f32 %r836, %r835, %r834; 2026-02-21T08:25:57.3670843Z $L__tmp26: 2026-02-21T08:25:57.3671065Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3671127Z cvt.u64.u32 %rd178, %r760; 2026-02-21T08:25:57.3671195Z cvt.u64.u32 %rd179, %r761; 2026-02-21T08:25:57.3671256Z shl.b64 %rd180, %rd179, 32; 2026-02-21T08:25:57.3671317Z or.b64 %rd181, %rd178, %rd180; 2026-02-21T08:25:57.3671371Z $L__tmp27: 2026-02-21T08:25:57.3671592Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3671657Z mov.b64 {%r837, %r838}, %rd181; 2026-02-21T08:25:57.3671728Z cvt.rn.bf16x2.f32 %r839, %r838, %r837; 2026-02-21T08:25:57.3671791Z $L__tmp28: 2026-02-21T08:25:57.3672013Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3672126Z cvt.u64.u32 %rd182, %r762; 2026-02-21T08:25:57.3672194Z cvt.u64.u32 %rd183, %r763; 2026-02-21T08:25:57.3672255Z shl.b64 %rd184, %rd183, 32; 2026-02-21T08:25:57.3672316Z or.b64 %rd185, %rd182, %rd184; 2026-02-21T08:25:57.3672371Z $L__tmp29: 2026-02-21T08:25:57.3672553Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3672614Z mov.b64 {%r840, %r841}, %rd185; 2026-02-21T08:25:57.3672682Z cvt.rn.bf16x2.f32 %r842, %r841, %r840; 2026-02-21T08:25:57.3672742Z $L__tmp30: 2026-02-21T08:25:57.3672968Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3673029Z cvt.u64.u32 %rd186, %r765; 2026-02-21T08:25:57.3673097Z cvt.u64.u32 %rd187, %r766; 2026-02-21T08:25:57.3673160Z shl.b64 %rd188, %rd187, 32; 2026-02-21T08:25:57.3673222Z or.b64 %rd189, %rd186, %rd188; 2026-02-21T08:25:57.3673277Z $L__tmp31: 2026-02-21T08:25:57.3673511Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3673577Z mov.b64 {%r843, %r844}, %rd189; 2026-02-21T08:25:57.3673647Z cvt.rn.bf16x2.f32 %r845, %r844, %r843; 2026-02-21T08:25:57.3673708Z $L__tmp32: 2026-02-21T08:25:57.3673931Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3673993Z cvt.u64.u32 %rd190, %r767; 2026-02-21T08:25:57.3674060Z cvt.u64.u32 %rd191, %r768; 2026-02-21T08:25:57.3674121Z shl.b64 %rd192, %rd191, 32; 2026-02-21T08:25:57.3674182Z or.b64 %rd193, %rd190, %rd192; 2026-02-21T08:25:57.3674236Z $L__tmp33: 2026-02-21T08:25:57.3674429Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3674489Z mov.b64 {%r846, %r847}, %rd193; 2026-02-21T08:25:57.3674555Z cvt.rn.bf16x2.f32 %r848, %r847, %r846; 2026-02-21T08:25:57.3674616Z $L__tmp34: 2026-02-21T08:25:57.3674832Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3674890Z cvt.u64.u32 %rd194, %r769; 2026-02-21T08:25:57.3674948Z cvt.u64.u32 %rd195, %r770; 2026-02-21T08:25:57.3675014Z shl.b64 %rd196, %rd195, 32; 2026-02-21T08:25:57.3675073Z or.b64 %rd197, %rd194, %rd196; 2026-02-21T08:25:57.3675123Z $L__tmp35: 2026-02-21T08:25:57.3675299Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3675358Z mov.b64 {%r849, %r850}, %rd197; 2026-02-21T08:25:57.3675424Z cvt.rn.bf16x2.f32 %r851, %r850, %r849; 2026-02-21T08:25:57.3675483Z $L__tmp36: 2026-02-21T08:25:57.3675698Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3675758Z cvt.u64.u32 %rd198, %r771; 2026-02-21T08:25:57.3675815Z cvt.u64.u32 %rd199, %r772; 2026-02-21T08:25:57.3675889Z shl.b64 %rd200, %rd199, 32; 2026-02-21T08:25:57.3675948Z or.b64 %rd201, %rd198, %rd200; 2026-02-21T08:25:57.3676001Z $L__tmp37: 2026-02-21T08:25:57.3676174Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3676236Z mov.b64 {%r852, %r853}, %rd201; 2026-02-21T08:25:57.3676303Z cvt.rn.bf16x2.f32 %r854, %r853, %r852; 2026-02-21T08:25:57.3676364Z $L__tmp38: 2026-02-21T08:25:57.3676576Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3676636Z cvt.u64.u32 %rd202, %r773; 2026-02-21T08:25:57.3676693Z cvt.u64.u32 %rd203, %r774; 2026-02-21T08:25:57.3676759Z shl.b64 %rd204, %rd203, 32; 2026-02-21T08:25:57.3676818Z or.b64 %rd205, %rd202, %rd204; 2026-02-21T08:25:57.3676869Z $L__tmp39: 2026-02-21T08:25:57.3677046Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3677164Z mov.b64 {%r855, %r856}, %rd205; 2026-02-21T08:25:57.3677230Z cvt.rn.bf16x2.f32 %r857, %r856, %r855; 2026-02-21T08:25:57.3677282Z $L__tmp40: 2026-02-21T08:25:57.3677498Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3677557Z cvt.u64.u32 %rd206, %r775; 2026-02-21T08:25:57.3677615Z cvt.u64.u32 %rd207, %r776; 2026-02-21T08:25:57.3677680Z shl.b64 %rd208, %rd207, 32; 2026-02-21T08:25:57.3677737Z or.b64 %rd209, %rd206, %rd208; 2026-02-21T08:25:57.3677789Z $L__tmp41: 2026-02-21T08:25:57.3677965Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3678025Z mov.b64 {%r858, %r859}, %rd209; 2026-02-21T08:25:57.3678090Z cvt.rn.bf16x2.f32 %r860, %r859, %r858; 2026-02-21T08:25:57.3678141Z $L__tmp42: 2026-02-21T08:25:57.3678401Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3678462Z cvt.u64.u32 %rd210, %r777; 2026-02-21T08:25:57.3678520Z cvt.u64.u32 %rd211, %r778; 2026-02-21T08:25:57.3678588Z shl.b64 %rd212, %rd211, 32; 2026-02-21T08:25:57.3678646Z or.b64 %rd213, %rd210, %rd212; 2026-02-21T08:25:57.3678699Z $L__tmp43: 2026-02-21T08:25:57.3678874Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3678932Z mov.b64 {%r861, %r862}, %rd213; 2026-02-21T08:25:57.3678998Z cvt.rn.bf16x2.f32 %r863, %r862, %r861; 2026-02-21T08:25:57.3679050Z $L__tmp44: 2026-02-21T08:25:57.3679269Z .loc 2 291 36 // standard.py:291:36 @[ cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:87:40 ] 2026-02-21T08:25:57.3679327Z cvt.u64.u32 %rd214, %r779; 2026-02-21T08:25:57.3679384Z cvt.u64.u32 %rd215, %r780; 2026-02-21T08:25:57.3679451Z shl.b64 %rd216, %rd215, 32; 2026-02-21T08:25:57.3679509Z or.b64 %rd217, %rd214, %rd216; 2026-02-21T08:25:57.3679566Z $L__tmp45: 2026-02-21T08:25:57.3679747Z .loc 1 90 28 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:90:28 2026-02-21T08:25:57.3679807Z mov.b64 {%r864, %r865}, %rd217; 2026-02-21T08:25:57.3679872Z cvt.rn.bf16x2.f32 %r866, %r865, %r864; 2026-02-21T08:25:57.3680037Z .loc 1 91 81 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:91:81 2026-02-21T08:25:57.3680140Z st.shared.v4.b32 [%r26], {%r821, %r833, %r845, %r857}; 2026-02-21T08:25:57.3680195Z bar.sync 0; 2026-02-21T08:25:57.3680253Z // begin inline asm 2026-02-21T08:25:57.3680410Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r802, %r806, %r810, %r814}, [%r786]; 2026-02-21T08:25:57.3680466Z // end inline asm 2026-02-21T08:25:57.3680520Z bar.sync 0; 2026-02-21T08:25:57.3680611Z st.shared.v4.b32 [%r26], {%r824, %r836, %r848, %r860}; 2026-02-21T08:25:57.3680672Z bar.sync 0; 2026-02-21T08:25:57.3680728Z // begin inline asm 2026-02-21T08:25:57.3680876Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r803, %r807, %r811, %r815}, [%r786]; 2026-02-21T08:25:57.3680941Z // end inline asm 2026-02-21T08:25:57.3680994Z bar.sync 0; 2026-02-21T08:25:57.3681080Z st.shared.v4.b32 [%r26], {%r827, %r839, %r851, %r863}; 2026-02-21T08:25:57.3681138Z bar.sync 0; 2026-02-21T08:25:57.3681194Z // begin inline asm 2026-02-21T08:25:57.3681340Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r804, %r808, %r812, %r816}, [%r786]; 2026-02-21T08:25:57.3681395Z // end inline asm 2026-02-21T08:25:57.3681454Z bar.sync 0; 2026-02-21T08:25:57.3681567Z st.shared.v4.b32 [%r26], {%r830, %r842, %r854, %r866}; 2026-02-21T08:25:57.3681621Z bar.sync 0; 2026-02-21T08:25:57.3681683Z // begin inline asm 2026-02-21T08:25:57.3681823Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r805, %r809, %r813, %r817}, [%r786]; 2026-02-21T08:25:57.3681877Z // end inline asm 2026-02-21T08:25:57.3681932Z // begin inline asm 2026-02-21T08:25:57.3682048Z st.global.v4.b32 [ %rd149 + 0 ], { %r802, %r803, %r804, %r805 }; 2026-02-21T08:25:57.3682156Z // end inline asm 2026-02-21T08:25:57.3682212Z // begin inline asm 2026-02-21T08:25:57.3682320Z st.global.v4.b32 [ %rd150 + 0 ], { %r806, %r807, %r808, %r809 }; 2026-02-21T08:25:57.3682376Z // end inline asm 2026-02-21T08:25:57.3682431Z // begin inline asm 2026-02-21T08:25:57.3682535Z st.global.v4.b32 [ %rd151 + 0 ], { %r810, %r811, %r812, %r813 }; 2026-02-21T08:25:57.3682590Z // end inline asm 2026-02-21T08:25:57.3682645Z // begin inline asm 2026-02-21T08:25:57.3682742Z st.global.v4.b32 [ %rd152 + 0 ], { %r814, %r815, %r816, %r817 }; 2026-02-21T08:25:57.3682805Z // end inline asm 2026-02-21T08:25:57.3682885Z $L__BB0_8: // %._crit_edge 2026-02-21T08:25:57.3683057Z .loc 1 29 4 // cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py:29:4 2026-02-21T08:25:57.3683119Z bar.sync 0; 2026-02-21T08:25:57.3683175Z // begin inline asm 2026-02-21T08:25:57.3683348Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r867, 256; 2026-02-21T08:25:57.3683407Z // end inline asm 2026-02-21T08:25:57.3683468Z ret; 2026-02-21T08:25:57.3683522Z $L__tmp46: 2026-02-21T08:25:57.3683578Z $L__func_end0: 2026-02-21T08:25:57.3683668Z // -- End function 2026-02-21T08:25:57.3683720Z } 2026-02-21T08:25:57.3683924Z .file 1 "/tmp/torchinductor_root/br/cbr4ay4htslvvet6ndzhl2w7gdogwwp4vzpjozu6xuvugqjvqpi2.py" 2026-02-21T08:25:57.3684109Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:25:57.3684171Z .section .debug_abbrev 2026-02-21T08:25:57.3684222Z { 2026-02-21T08:25:57.3684310Z .b8 1 // Abbreviation Code 2026-02-21T08:25:57.3684405Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:25:57.3684486Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:25:57.3684567Z .b8 37 // DW_AT_producer 2026-02-21T08:25:57.3684658Z .b8 8 // DW_FORM_string 2026-02-21T08:25:57.3684736Z .b8 19 // DW_AT_language 2026-02-21T08:25:57.3684812Z .b8 5 // DW_FORM_data2 2026-02-21T08:25:57.3684897Z .b8 3 // DW_AT_name 2026-02-21T08:25:57.3684973Z .b8 8 // DW_FORM_string 2026-02-21T08:25:57.3685053Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:25:57.3685127Z .b8 6 // DW_FORM_data4 2026-02-21T08:25:57.3685209Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:25:57.3685283Z .b8 8 // DW_FORM_string 2026-02-21T08:25:57.3685354Z .b8 0 // EOM(1) 2026-02-21T08:25:57.3685431Z .b8 0 // EOM(2) 2026-02-21T08:25:57.3685512Z .b8 2 // Abbreviation Code 2026-02-21T08:25:57.3685594Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:25:57.3685676Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:25:57.3685747Z .b8 3 // DW_AT_name 2026-02-21T08:25:57.3685821Z .b8 8 // DW_FORM_string 2026-02-21T08:25:57.3685895Z .b8 32 // DW_AT_inline 2026-02-21T08:25:57.3685977Z .b8 11 // DW_FORM_data1 2026-02-21T08:25:57.3686044Z .b8 0 // EOM(1) 2026-02-21T08:25:57.3686112Z .b8 0 // EOM(2) 2026-02-21T08:25:57.3686198Z .b8 3 // Abbreviation Code 2026-02-21T08:25:57.3686277Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:25:57.3686352Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:25:57.3686434Z .b8 17 // DW_AT_low_pc 2026-02-21T08:25:57.3686508Z .b8 1 // DW_FORM_addr 2026-02-21T08:25:57.3686632Z .b8 18 // DW_AT_high_pc 2026-02-21T08:25:57.3686715Z .b8 1 // DW_FORM_addr 2026-02-21T08:25:57.3686802Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:25:57.3686875Z .b8 19 // DW_FORM_ref4 2026-02-21T08:25:57.3686943Z .b8 0 // EOM(1) 2026-02-21T08:25:57.3687019Z .b8 0 // EOM(2) 2026-02-21T08:25:57.3687101Z .b8 4 // Abbreviation Code 2026-02-21T08:25:57.3687192Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:25:57.3687274Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:25:57.3687359Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:25:57.3687433Z .b8 19 // DW_FORM_ref4 2026-02-21T08:25:57.3687558Z .b8 17 // DW_AT_low_pc 2026-02-21T08:25:57.3687633Z .b8 1 // DW_FORM_addr 2026-02-21T08:25:57.3687712Z .b8 18 // DW_AT_high_pc 2026-02-21T08:25:57.3687784Z .b8 1 // DW_FORM_addr 2026-02-21T08:25:57.3687869Z .b8 88 // DW_AT_call_file 2026-02-21T08:25:57.3687943Z .b8 11 // DW_FORM_data1 2026-02-21T08:25:57.3688018Z .b8 89 // DW_AT_call_line 2026-02-21T08:25:57.3688100Z .b8 11 // DW_FORM_data1 2026-02-21T08:25:57.3688177Z .b8 87 // DW_AT_call_column 2026-02-21T08:25:57.3688249Z .b8 11 // DW_FORM_data1 2026-02-21T08:25:57.3688328Z .b8 0 // EOM(1) 2026-02-21T08:25:57.3688395Z .b8 0 // EOM(2) 2026-02-21T08:25:57.3688464Z .b8 0 // EOM(3) 2026-02-21T08:25:57.3688518Z } 2026-02-21T08:25:57.3688585Z .section .debug_info 2026-02-21T08:25:57.3688635Z { 2026-02-21T08:25:57.3688719Z .b32 178 // Length of Unit 2026-02-21T08:25:57.3688810Z .b8 2 // DWARF version number 2026-02-21T08:25:57.3688863Z .b8 0 2026-02-21T08:25:57.3688978Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:25:57.3689071Z .b8 8 // Address Size (in bytes) 2026-02-21T08:25:57.3689171Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:25:57.3689251Z .b8 116 // DW_AT_producer 2026-02-21T08:25:57.3689305Z .b8 114 2026-02-21T08:25:57.3689366Z .b8 105 2026-02-21T08:25:57.3689418Z .b8 116 2026-02-21T08:25:57.3689468Z .b8 111 2026-02-21T08:25:57.3689523Z .b8 110 2026-02-21T08:25:57.3689574Z .b8 0 2026-02-21T08:25:57.3689649Z .b8 2 // DW_AT_language 2026-02-21T08:25:57.3689701Z .b8 0 2026-02-21T08:25:57.3689783Z .b8 99 // DW_AT_name 2026-02-21T08:25:57.3689833Z .b8 98 2026-02-21T08:25:57.3689883Z .b8 114 2026-02-21T08:25:57.3689940Z .b8 52 2026-02-21T08:25:57.3689989Z .b8 97 2026-02-21T08:25:57.3690038Z .b8 121 2026-02-21T08:25:57.3690086Z .b8 52 2026-02-21T08:25:57.3690144Z .b8 104 2026-02-21T08:25:57.3690193Z .b8 116 2026-02-21T08:25:57.3690244Z .b8 115 2026-02-21T08:25:57.3690301Z .b8 108 2026-02-21T08:25:57.3690351Z .b8 118 2026-02-21T08:25:57.3690400Z .b8 118 2026-02-21T08:25:57.3690450Z .b8 101 2026-02-21T08:25:57.3690506Z .b8 116 2026-02-21T08:25:57.3690556Z .b8 54 2026-02-21T08:25:57.3690606Z .b8 110 2026-02-21T08:25:57.3690656Z .b8 100 2026-02-21T08:25:57.3690714Z .b8 122 2026-02-21T08:25:57.3690764Z .b8 104 2026-02-21T08:25:57.3690814Z .b8 108 2026-02-21T08:25:57.3690870Z .b8 50 2026-02-21T08:25:57.3690922Z .b8 119 2026-02-21T08:25:57.3690973Z .b8 55 2026-02-21T08:25:57.3691066Z .b8 103 2026-02-21T08:25:57.3691122Z .b8 100 2026-02-21T08:25:57.3691172Z .b8 111 2026-02-21T08:25:57.3691222Z .b8 103 2026-02-21T08:25:57.3691279Z .b8 119 2026-02-21T08:25:57.3691330Z .b8 119 2026-02-21T08:25:57.3691381Z .b8 112 2026-02-21T08:25:57.3691431Z .b8 52 2026-02-21T08:25:57.3691493Z .b8 118 2026-02-21T08:25:57.3691577Z .b8 122 2026-02-21T08:25:57.3691630Z .b8 112 2026-02-21T08:25:57.3691683Z .b8 106 2026-02-21T08:25:57.3691745Z .b8 111 2026-02-21T08:25:57.3691798Z .b8 122 2026-02-21T08:25:57.3691849Z .b8 117 2026-02-21T08:25:57.3691907Z .b8 54 2026-02-21T08:25:57.3691957Z .b8 120 2026-02-21T08:25:57.3692007Z .b8 117 2026-02-21T08:25:57.3692057Z .b8 118 2026-02-21T08:25:57.3692114Z .b8 117 2026-02-21T08:25:57.3692164Z .b8 103 2026-02-21T08:25:57.3692213Z .b8 113 2026-02-21T08:25:57.3692271Z .b8 106 2026-02-21T08:25:57.3692321Z .b8 118 2026-02-21T08:25:57.3692370Z .b8 113 2026-02-21T08:25:57.3692421Z .b8 112 2026-02-21T08:25:57.3692478Z .b8 105 2026-02-21T08:25:57.3692593Z .b8 50 2026-02-21T08:25:57.3692647Z .b8 46 2026-02-21T08:25:57.3692704Z .b8 112 2026-02-21T08:25:57.3692754Z .b8 121 2026-02-21T08:25:57.3692804Z .b8 0 2026-02-21T08:25:57.3692896Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:25:57.3692979Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:25:57.3693030Z .b8 116 2026-02-21T08:25:57.3693080Z .b8 109 2026-02-21T08:25:57.3693136Z .b8 112 2026-02-21T08:25:57.3693186Z .b8 47 2026-02-21T08:25:57.3693236Z .b8 116 2026-02-21T08:25:57.3693285Z .b8 111 2026-02-21T08:25:57.3693342Z .b8 114 2026-02-21T08:25:57.3693391Z .b8 99 2026-02-21T08:25:57.3693441Z .b8 104 2026-02-21T08:25:57.3693490Z .b8 105 2026-02-21T08:25:57.3693548Z .b8 110 2026-02-21T08:25:57.3693598Z .b8 100 2026-02-21T08:25:57.3693648Z .b8 117 2026-02-21T08:25:57.3693704Z .b8 99 2026-02-21T08:25:57.3693755Z .b8 116 2026-02-21T08:25:57.3693805Z .b8 111 2026-02-21T08:25:57.3693856Z .b8 114 2026-02-21T08:25:57.3693916Z .b8 95 2026-02-21T08:25:57.3693969Z .b8 114 2026-02-21T08:25:57.3694022Z .b8 111 2026-02-21T08:25:57.3694078Z .b8 111 2026-02-21T08:25:57.3694129Z .b8 116 2026-02-21T08:25:57.3694177Z .b8 47 2026-02-21T08:25:57.3694227Z .b8 98 2026-02-21T08:25:57.3694285Z .b8 114 2026-02-21T08:25:57.3694333Z .b8 0 2026-02-21T08:25:57.3694430Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:25:57.3694509Z .b8 95 // DW_AT_name 2026-02-21T08:25:57.3694559Z .b8 104 2026-02-21T08:25:57.3694608Z .b8 101 2026-02-21T08:25:57.3694657Z .b8 108 2026-02-21T08:25:57.3694713Z .b8 105 2026-02-21T08:25:57.3694763Z .b8 111 2026-02-21T08:25:57.3694813Z .b8 110 2026-02-21T08:25:57.3694870Z .b8 95 2026-02-21T08:25:57.3694919Z .b8 109 2026-02-21T08:25:57.3694969Z .b8 97 2026-02-21T08:25:57.3695018Z .b8 116 2026-02-21T08:25:57.3695074Z .b8 109 2026-02-21T08:25:57.3695123Z .b8 117 2026-02-21T08:25:57.3695172Z .b8 108 2026-02-21T08:25:57.3695221Z .b8 95 2026-02-21T08:25:57.3695278Z .b8 98 2026-02-21T08:25:57.3695334Z .b8 102 2026-02-21T08:25:57.3695383Z .b8 49 2026-02-21T08:25:57.3695438Z .b8 54 2026-02-21T08:25:57.3695487Z .b8 95 2026-02-21T08:25:57.3695536Z .b8 105 2026-02-21T08:25:57.3695586Z .b8 110 2026-02-21T08:25:57.3695644Z .b8 116 2026-02-21T08:25:57.3695694Z .b8 52 2026-02-21T08:25:57.3695742Z .b8 0 2026-02-21T08:25:57.3695822Z .b8 1 // DW_AT_inline 2026-02-21T08:25:57.3695917Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:25:57.3696002Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:25:57.3696090Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:25:57.3696187Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:25:57.3696298Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:25:57.3696386Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:25:57.3696478Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T08:25:57.3696617Z .b64 $L__tmp45 // DW_AT_high_pc 2026-02-21T08:25:57.3696695Z .b8 1 // DW_AT_call_file 2026-02-21T08:25:57.3696779Z .b8 87 // DW_AT_call_line 2026-02-21T08:25:57.3696858Z .b8 40 // DW_AT_call_column 2026-02-21T08:25:57.3696942Z .b8 0 // End Of Children Mark 2026-02-21T08:25:57.3697030Z .b8 0 // End Of Children Mark 2026-02-21T08:25:57.3697080Z } 2026-02-21T08:25:57.3697147Z .section .debug_macinfo { } 2026-02-21T08:25:57.3697151Z 2026-02-21T08:25:57.3697228Z ================================================================ 2026-02-21T08:25:57.3697338Z please share the reproducer above with Triton project. 2026-02-21T08:26:00.4264117Z 2026-02-21T08:26:00.4266361Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 15.9 configs/s 2026-02-21T08:26:06.9904901Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 152.9 2026-02-21T08:26:06.9908959Z configs/s 2026-02-21T08:26:07.1956649Z [163s] Generation 5 complete: 2026-02-21T08:26:07.1960478Z error=3 2026-02-21T08:26:07.1960743Z ok=72 2026-02-21T08:26:07.1960949Z min=0.1136 2026-02-21T08:26:07.1961136Z mid=0.1638 2026-02-21T08:26:07.1961297Z max=5.8521 2026-02-21T08:26:07.1961473Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:26:07.1961787Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:26:07.1962026Z 'l2_groupings': [1], 2026-02-21T08:26:07.1962246Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:07.1962487Z 'loop_orders': [[0, 1]], 2026-02-21T08:26:07.1962666Z 'num_stages': 5, 2026-02-21T08:26:07.1962841Z 'num_warps': 4, 2026-02-21T08:26:07.1963007Z 'pid_type': 'flat', 2026-02-21T08:26:07.1963176Z 'range_flattens': [None, None], 2026-02-21T08:26:07.1963399Z 'range_multi_buffers': [None, True], 2026-02-21T08:26:07.1963619Z 'range_num_stages': [0, 0], 2026-02-21T08:26:07.1963811Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:07.1964025Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:07.1978991Z [163s] Fitting surrogate: 549 points, 549 targets 2026-02-21T08:26:08.2032687Z [164s] Generation 6 starting: 62 neighbors, 4 active search path(s) 2026-02-21T08:26:14.4090427Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 4.4 configs/s 2026-02-21T08:26:18.1164631Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.5 configs/s 2026-02-21T08:26:22.8346151Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 212.0 2026-02-21T08:26:22.8350023Z configs/s 2026-02-21T08:26:22.9918052Z [178s] Generation 6 complete: 2026-02-21T08:26:22.9922484Z error=4 2026-02-21T08:26:22.9924103Z ok=63 2026-02-21T08:26:22.9924310Z min=0.1136 2026-02-21T08:26:22.9924802Z mid=0.2182 2026-02-21T08:26:22.9924940Z max=4.8574 2026-02-21T08:26:22.9925090Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:26:22.9925310Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:26:22.9925523Z 'l2_groupings': [1], 2026-02-21T08:26:22.9925699Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:22.9925913Z 'loop_orders': [[0, 1]], 2026-02-21T08:26:22.9926069Z 'num_stages': 5, 2026-02-21T08:26:22.9926217Z 'num_warps': 4, 2026-02-21T08:26:22.9926359Z 'pid_type': 'flat', 2026-02-21T08:26:22.9926522Z 'range_flattens': [None, None], 2026-02-21T08:26:22.9926701Z 'range_multi_buffers': [None, True], 2026-02-21T08:26:22.9926892Z 'range_num_stages': [0, 0], 2026-02-21T08:26:22.9927078Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:22.9927255Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:22.9940638Z [178s] Fitting surrogate: 616 points, 616 targets 2026-02-21T08:26:24.0248770Z [179s] Generation 7 starting: 65 neighbors, 4 active search path(s) 2026-02-21T08:26:33.4588643Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 1.6 configs/s 2026-02-21T08:26:37.1492669Z [192s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:26:37.1493787Z Tensor-likes are not close! 2026-02-21T08:26:37.1498648Z 2026-02-21T08:26:37.1500759Z Mismatched elements: 457235 / 458752 (99.7%) 2026-02-21T08:26:37.1501081Z Greatest absolute difference: 3104.0 at index (36, 2205) (up to 0.01 allowed) 2026-02-21T08:26:37.1501439Z Greatest relative difference: 397312.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:26:37.1501851Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:26:37.1502038Z 2026-02-21T08:26:37.4577976Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.2 configs/s 2026-02-21T08:26:41.9616791Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 222.2 2026-02-21T08:26:41.9618494Z configs/s 2026-02-21T08:26:42.1176930Z [197s] Generation 7 complete: 2026-02-21T08:26:42.1181352Z error=3 2026-02-21T08:26:42.1185222Z ok=67 2026-02-21T08:26:42.1186697Z min=0.1128 2026-02-21T08:26:42.1186866Z mid=0.2828 2026-02-21T08:26:42.1186996Z max=13.2895 2026-02-21T08:26:42.1187150Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:26:42.1187388Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:26:42.1187613Z 'l2_groupings': [2], 2026-02-21T08:26:42.1187784Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:26:42.1187969Z 'loop_orders': [[0, 1]], 2026-02-21T08:26:42.1188127Z 'num_stages': 6, 2026-02-21T08:26:42.1188288Z 'num_warps': 1, 2026-02-21T08:26:42.1188444Z 'pid_type': 'flat', 2026-02-21T08:26:42.1188598Z 'range_flattens': [None, None], 2026-02-21T08:26:42.1188780Z 'range_multi_buffers': [None, True], 2026-02-21T08:26:42.1188959Z 'range_num_stages': [0, 0], 2026-02-21T08:26:42.1189131Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:42.1189307Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:42.1205984Z [197s] Fitting surrogate: 686 points, 686 targets 2026-02-21T08:26:43.1366311Z [198s] Generation 8 starting: 65 neighbors, 4 active search path(s) 2026-02-21T08:26:47.3956282Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 28.6 configs/s 2026-02-21T08:26:48.0944761Z [203s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T08:26:48.0946187Z Tensor-likes are not close! 2026-02-21T08:26:48.0950618Z 2026-02-21T08:26:48.0955213Z Mismatched elements: 457355 / 458752 (99.7%) 2026-02-21T08:26:48.0958562Z Greatest absolute difference: 3168.0 at index (36, 3559) (up to 0.01 allowed) 2026-02-21T08:26:48.0962253Z Greatest relative difference: 397312.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T08:26:48.0966153Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:26:48.0966429Z 2026-02-21T08:26:51.0814777Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 18.7 configs/s 2026-02-21T08:26:57.1087044Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 177.0 2026-02-21T08:26:57.1088482Z configs/s 2026-02-21T08:26:57.3023412Z [213s] Generation 8 complete: 2026-02-21T08:26:57.3028190Z error=5 2026-02-21T08:26:57.3029677Z ok=64 2026-02-21T08:26:57.3029821Z min=0.1126 2026-02-21T08:26:57.3029950Z mid=0.1608 2026-02-21T08:26:57.3030079Z max=0.7424 2026-02-21T08:26:57.3030220Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:26:57.3030467Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:26:57.3030686Z 'l2_groupings': [2], 2026-02-21T08:26:57.3030915Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:26:57.3031107Z 'loop_orders': [[0, 1]], 2026-02-21T08:26:57.3031258Z 'num_stages': 6, 2026-02-21T08:26:57.3031404Z 'num_warps': 1, 2026-02-21T08:26:57.3031602Z 'pid_type': 'flat', 2026-02-21T08:26:57.3031767Z 'range_flattens': [None, False], 2026-02-21T08:26:57.3031947Z 'range_multi_buffers': [None, True], 2026-02-21T08:26:57.3032132Z 'range_num_stages': [0, 0], 2026-02-21T08:26:57.3032298Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:57.3032496Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:57.3050236Z [213s] Fitting surrogate: 755 points, 755 targets 2026-02-21T08:26:58.3192744Z [214s] Generation 9 starting: 62 neighbors, 4 active search path(s) 2026-02-21T08:27:03.3024728Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 12.6 configs/s 2026-02-21T08:27:06.7603929Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 19.1 configs/s 2026-02-21T08:27:13.0435955Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 159.7 2026-02-21T08:27:13.0437579Z configs/s 2026-02-21T08:27:13.2474021Z [229s] Generation 9 complete: 2026-02-21T08:27:13.2475789Z error=7 2026-02-21T08:27:13.2475961Z ok=60 2026-02-21T08:27:13.2476095Z min=0.1117 2026-02-21T08:27:13.2476238Z mid=0.1381 2026-02-21T08:27:13.2476364Z max=0.6687 2026-02-21T08:27:13.2476521Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:27:13.2476756Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:27:13.2476981Z 'l2_groupings': [2], 2026-02-21T08:27:13.2477191Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:13.2477409Z 'loop_orders': [[0, 1]], 2026-02-21T08:27:13.2477582Z 'num_stages': 6, 2026-02-21T08:27:13.2477732Z 'num_warps': 2, 2026-02-21T08:27:13.2477889Z 'pid_type': 'flat', 2026-02-21T08:27:13.2478054Z 'range_flattens': [None, False], 2026-02-21T08:27:13.2478247Z 'range_multi_buffers': [None, True], 2026-02-21T08:27:13.2478437Z 'range_num_stages': [0, 0], 2026-02-21T08:27:13.2478622Z 'range_unroll_factors': [0, 1], 2026-02-21T08:27:13.2478810Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:13.2501933Z [229s] Fitting surrogate: 822 points, 822 targets 2026-02-21T08:27:14.2777026Z [230s] Generation 10 starting: 66 neighbors, 4 active search path(s) 2026-02-21T08:27:20.4207859Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 4.6 configs/s 2026-02-21T08:27:23.8672106Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 68/68 20.0 configs/s 2026-02-21T08:27:30.6552160Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 156.2 2026-02-21T08:27:30.6552969Z configs/s 2026-02-21T08:27:30.8713526Z [246s] Generation 10 complete: 2026-02-21T08:27:30.8715069Z error=12 2026-02-21T08:27:30.8715259Z ok=59 2026-02-21T08:27:30.8715422Z min=0.1135 2026-02-21T08:27:30.8715581Z mid=0.1279 2026-02-21T08:27:30.8715725Z max=8.4992 2026-02-21T08:27:30.8715906Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:27:30.8716159Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:27:30.8716389Z 'l2_groupings': [2], 2026-02-21T08:27:30.8716589Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:30.8716790Z 'loop_orders': [[0, 1]], 2026-02-21T08:27:30.8716979Z 'num_stages': 6, 2026-02-21T08:27:30.8717144Z 'num_warps': 2, 2026-02-21T08:27:30.8717338Z 'pid_type': 'flat', 2026-02-21T08:27:30.8717519Z 'range_flattens': [None, True], 2026-02-21T08:27:30.8717733Z 'range_multi_buffers': [None, False], 2026-02-21T08:27:30.8718346Z 'range_num_stages': [0, 0], 2026-02-21T08:27:30.8718573Z 'range_unroll_factors': [0, 1], 2026-02-21T08:27:30.8718798Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:30.8741695Z [246s] Fitting surrogate: 893 points, 893 targets 2026-02-21T08:27:31.7537226Z [247s] Generation 11 starting: 50 neighbors, 3 active search path(s) 2026-02-21T08:27:35.0466867Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 25.3 configs/s 2026-02-21T08:27:37.5646137Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 20.7 configs/s 2026-02-21T08:27:41.6063355Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 248.0 2026-02-21T08:27:41.6068036Z configs/s 2026-02-21T08:27:41.7511234Z [257s] Generation 11 complete: 2026-02-21T08:27:41.7514975Z error=10 2026-02-21T08:27:41.7518867Z ok=43 2026-02-21T08:27:41.7523266Z min=0.1117 2026-02-21T08:27:41.7526558Z mid=0.1351 2026-02-21T08:27:41.7529809Z max=0.6440 2026-02-21T08:27:41.7533945Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:27:41.7538462Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:27:41.7540363Z 'l2_groupings': [2], 2026-02-21T08:27:41.7540585Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:41.7540789Z 'loop_orders': [[0, 1]], 2026-02-21T08:27:41.7540964Z 'num_stages': 6, 2026-02-21T08:27:41.7541120Z 'num_warps': 2, 2026-02-21T08:27:41.7541269Z 'pid_type': 'flat', 2026-02-21T08:27:41.7541440Z 'range_flattens': [None, True], 2026-02-21T08:27:41.7541708Z 'range_multi_buffers': [None, True], 2026-02-21T08:27:41.7546164Z 'range_num_stages': [0, 0], 2026-02-21T08:27:41.7551011Z 'range_unroll_factors': [0, 1], 2026-02-21T08:27:41.7554171Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:41.7562619Z [257s] Fitting surrogate: 946 points, 946 targets 2026-02-21T08:27:42.3719509Z [258s] Generation 12 starting: 29 neighbors, 2 active search path(s) 2026-02-21T08:27:44.5092466Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 24.3 configs/s 2026-02-21T08:27:46.2034926Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 18.2 configs/s 2026-02-21T08:27:49.7738800Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 280.1 2026-02-21T08:27:49.7742146Z configs/s 2026-02-21T08:27:49.8995303Z [265s] Generation 12 complete: 2026-02-21T08:27:49.8996850Z error=1 2026-02-21T08:27:49.8997026Z ok=30 2026-02-21T08:27:49.8997197Z min=0.1126 2026-02-21T08:27:49.8997358Z mid=0.1484 2026-02-21T08:27:49.8997514Z max=3.5350 2026-02-21T08:27:49.8997677Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:27:49.8997918Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:27:49.8998136Z 'l2_groupings': [2], 2026-02-21T08:27:49.8998321Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:49.8998534Z 'loop_orders': [[0, 1]], 2026-02-21T08:27:49.8998716Z 'num_stages': 6, 2026-02-21T08:27:49.8998901Z 'num_warps': 2, 2026-02-21T08:27:49.8999397Z 'pid_type': 'flat', 2026-02-21T08:27:49.8999601Z 'range_flattens': [None, True], 2026-02-21T08:27:49.8999794Z 'range_multi_buffers': [None, True], 2026-02-21T08:27:49.9000002Z 'range_num_stages': [0, 0], 2026-02-21T08:27:49.9000196Z 'range_unroll_factors': [0, 0], 2026-02-21T08:27:49.9000398Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:49.9022815Z [265s] Fitting surrogate: 977 points, 977 targets 2026-02-21T08:27:50.4691907Z [266s] Generation 13 starting: 30 neighbors, 2 active search path(s) 2026-02-21T08:27:53.6642325Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.4 configs/s 2026-02-21T08:27:55.4688970Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 17.6 configs/s 2026-02-21T08:27:57.8774548Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 414.0 2026-02-21T08:27:57.8778784Z configs/s 2026-02-21T08:27:57.9779268Z [273s] Generation 13 complete: 2026-02-21T08:27:57.9783274Z ok=32 2026-02-21T08:27:57.9784730Z min=0.1127 2026-02-21T08:27:57.9784907Z mid=0.2244 2026-02-21T08:27:57.9785040Z max=1.7684 2026-02-21T08:27:57.9785204Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:27:57.9785434Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:27:57.9785651Z 'l2_groupings': [2], 2026-02-21T08:27:57.9785826Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:57.9786014Z 'loop_orders': [[0, 1]], 2026-02-21T08:27:57.9786178Z 'num_stages': 6, 2026-02-21T08:27:57.9786323Z 'num_warps': 1, 2026-02-21T08:27:57.9786474Z 'pid_type': 'flat', 2026-02-21T08:27:57.9786634Z 'range_flattens': [None, True], 2026-02-21T08:27:57.9786820Z 'range_multi_buffers': [None, None], 2026-02-21T08:27:57.9787006Z 'range_num_stages': [0, 0], 2026-02-21T08:27:57.9787183Z 'range_unroll_factors': [0, 0], 2026-02-21T08:27:57.9787365Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:57.9810696Z [273s] Fitting surrogate: 1009 points, 1009 targets 2026-02-21T08:27:58.4049390Z [274s] Generation 14 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:28:00.9487217Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 5.1 configs/s 2026-02-21T08:28:01.9065172Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.5 configs/s 2026-02-21T08:28:02.2236985Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2880.7 2026-02-21T08:28:02.2241257Z configs/s 2026-02-21T08:28:02.2666708Z [278s] Generation 14 complete: 2026-02-21T08:28:02.2668553Z ok=17 2026-02-21T08:28:02.2668725Z min=0.1136 2026-02-21T08:28:02.2668858Z mid=0.3063 2026-02-21T08:28:02.2668991Z max=3.5360 2026-02-21T08:28:02.2669135Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:02.2669365Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:02.2669584Z 'l2_groupings': [2], 2026-02-21T08:28:02.2669788Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:02.2670314Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:02.2670487Z 'num_stages': 6, 2026-02-21T08:28:02.2670630Z 'num_warps': 1, 2026-02-21T08:28:02.2670783Z 'pid_type': 'flat', 2026-02-21T08:28:02.2670952Z 'range_flattens': [None, True], 2026-02-21T08:28:02.2671133Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:02.2671333Z 'range_num_stages': [0, 0], 2026-02-21T08:28:02.2671498Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:02.2671753Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:02.2699235Z [278s] Fitting surrogate: 1026 points, 1026 targets 2026-02-21T08:28:02.7253307Z [278s] Generation 15 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:28:04.1059891Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 28.3 configs/s 2026-02-21T08:28:05.0945551Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 19.2 configs/s 2026-02-21T08:28:05.9830193Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1092.0 2026-02-21T08:28:05.9831801Z configs/s 2026-02-21T08:28:06.0404798Z [281s] Generation 15 complete: 2026-02-21T08:28:06.0409150Z error=1 2026-02-21T08:28:06.0413469Z ok=18 2026-02-21T08:28:06.0417838Z min=0.1135 2026-02-21T08:28:06.0419405Z mid=0.2478 2026-02-21T08:28:06.0419575Z max=0.7614 2026-02-21T08:28:06.0419719Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:06.0419948Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:06.0420152Z 'l2_groupings': [2], 2026-02-21T08:28:06.0420327Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:06.0420512Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:06.0420674Z 'num_stages': 6, 2026-02-21T08:28:06.0420821Z 'num_warps': 1, 2026-02-21T08:28:06.0420961Z 'pid_type': 'flat', 2026-02-21T08:28:06.0421122Z 'range_flattens': [None, True], 2026-02-21T08:28:06.0421298Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:06.0421498Z 'range_num_stages': [0, 0], 2026-02-21T08:28:06.0421836Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:06.0422025Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:06.0436571Z [281s] Fitting surrogate: 1045 points, 1045 targets 2026-02-21T08:28:06.4942430Z [282s] Generation 16 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:28:09.9086482Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.2 configs/s 2026-02-21T08:28:10.8589473Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.7 configs/s 2026-02-21T08:28:11.3706039Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1848.8 2026-02-21T08:28:11.3710403Z configs/s 2026-02-21T08:28:11.4190715Z [287s] Generation 16 complete: 2026-02-21T08:28:11.4191030Z ok=17 2026-02-21T08:28:11.4191238Z min=0.1136 2026-02-21T08:28:11.4191395Z mid=0.2284 2026-02-21T08:28:11.4191622Z max=1.6434 2026-02-21T08:28:11.4192245Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:11.4192565Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:11.4192782Z 'l2_groupings': [2], 2026-02-21T08:28:11.4192960Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:11.4193166Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:11.4193334Z 'num_stages': 6, 2026-02-21T08:28:11.4193484Z 'num_warps': 1, 2026-02-21T08:28:11.4193764Z 'pid_type': 'flat', 2026-02-21T08:28:11.4211777Z 'range_flattens': [None, True], 2026-02-21T08:28:11.4211998Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:11.4212190Z 'range_num_stages': [0, 0], 2026-02-21T08:28:11.4212373Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:11.4212556Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:11.4222790Z [287s] Fitting surrogate: 1062 points, 1062 targets 2026-02-21T08:28:11.8908433Z [287s] Generation 17 starting: 16 neighbors, 1 active search path(s) 2026-02-21T08:28:13.5029927Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 14.4 configs/s 2026-02-21T08:28:14.2683171Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 22.6 configs/s 2026-02-21T08:28:14.9659575Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1378.2 2026-02-21T08:28:14.9661075Z configs/s 2026-02-21T08:28:15.0188099Z [290s] Generation 17 complete: 2026-02-21T08:28:15.0189944Z error=5 2026-02-21T08:28:15.0190100Z ok=13 2026-02-21T08:28:15.0190239Z min=0.1136 2026-02-21T08:28:15.0190370Z mid=0.2508 2026-02-21T08:28:15.0190506Z max=0.5514 2026-02-21T08:28:15.0190651Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:15.0190884Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:15.0191090Z 'l2_groupings': [2], 2026-02-21T08:28:15.0191268Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:15.0191459Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:15.0191703Z 'num_stages': 6, 2026-02-21T08:28:15.0191851Z 'num_warps': 1, 2026-02-21T08:28:15.0192309Z 'pid_type': 'flat', 2026-02-21T08:28:15.0192499Z 'range_flattens': [None, True], 2026-02-21T08:28:15.0192686Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:15.0192886Z 'range_num_stages': [0, 0], 2026-02-21T08:28:15.0193056Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:15.0193248Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:15.0219351Z [290s] Fitting surrogate: 1080 points, 1080 targets 2026-02-21T08:28:15.8909711Z [291s] Generation 18 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:28:17.7054926Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 6.7 configs/s 2026-02-21T08:28:18.5060378Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 22.9 configs/s 2026-02-21T08:28:18.5066516Z [294s] Generation 18 complete: 2026-02-21T08:28:18.5068620Z error=6 2026-02-21T08:28:18.5068792Z ok=13 2026-02-21T08:28:18.5068928Z min=0.1136 2026-02-21T08:28:18.5069076Z mid=0.2775 2026-02-21T08:28:18.5069214Z max=0.5846 2026-02-21T08:28:18.5069391Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:18.5069647Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:18.5069858Z 'l2_groupings': [2], 2026-02-21T08:28:18.5070038Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:18.5070228Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:18.5070394Z 'num_stages': 6, 2026-02-21T08:28:18.5070537Z 'num_warps': 1, 2026-02-21T08:28:18.5070692Z 'pid_type': 'flat', 2026-02-21T08:28:18.5070852Z 'range_flattens': [None, True], 2026-02-21T08:28:18.5071044Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:18.5071236Z 'range_num_stages': [0, 0], 2026-02-21T08:28:18.5071407Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:18.5072381Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:18.5099374Z [294s] Fitting surrogate: 1099 points, 1099 targets 2026-02-21T08:28:18.9842315Z [294s] Generation 19 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:28:20.7687761Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 12.7 configs/s 2026-02-21T08:28:21.5938403Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 22.1 configs/s 2026-02-21T08:28:22.2734439Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1413.7 2026-02-21T08:28:22.2738686Z configs/s 2026-02-21T08:28:22.3253393Z [298s] Generation 19 complete: 2026-02-21T08:28:22.3257746Z error=5 2026-02-21T08:28:22.3262035Z ok=14 2026-02-21T08:28:22.3263660Z min=0.1136 2026-02-21T08:28:22.3263869Z mid=0.1934 2026-02-21T08:28:22.3268081Z max=0.4085 2026-02-21T08:28:22.3272524Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:22.3275800Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:22.3279612Z 'l2_groupings': [2], 2026-02-21T08:28:22.3279899Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:22.3280135Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:22.3280319Z 'num_stages': 6, 2026-02-21T08:28:22.3280508Z 'num_warps': 1, 2026-02-21T08:28:22.3281092Z 'pid_type': 'flat', 2026-02-21T08:28:22.3281926Z 'range_flattens': [None, True], 2026-02-21T08:28:22.3282137Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:22.3282338Z 'range_num_stages': [0, 0], 2026-02-21T08:28:22.3282525Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:22.3282719Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:22.3288901Z [298s] Fitting surrogate: 1118 points, 1118 targets 2026-02-21T08:28:22.7834474Z [298s] Generation 20 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:28:24.1367828Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 30.7 configs/s 2026-02-21T08:28:25.0600688Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 20.7 configs/s 2026-02-21T08:28:25.5407899Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1968.5 2026-02-21T08:28:25.5411801Z configs/s 2026-02-21T08:28:25.5866093Z [301s] Generation 20 complete: 2026-02-21T08:28:25.5870863Z error=3 2026-02-21T08:28:25.5874760Z ok=16 2026-02-21T08:28:25.5876083Z min=0.1135 2026-02-21T08:28:25.5876256Z mid=0.2877 2026-02-21T08:28:25.5876387Z max=1.0239 2026-02-21T08:28:25.5876544Z best={'block_sizes': [16, 64, 64], 2026-02-21T08:28:25.5876779Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:28:25.5876987Z 'l2_groupings': [2], 2026-02-21T08:28:25.5877164Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:28:25.5877354Z 'loop_orders': [[0, 1]], 2026-02-21T08:28:25.5877521Z 'num_stages': 6, 2026-02-21T08:28:25.5877666Z 'num_warps': 1, 2026-02-21T08:28:25.5877820Z 'pid_type': 'flat', 2026-02-21T08:28:25.5877982Z 'range_flattens': [None, True], 2026-02-21T08:28:25.5878171Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:25.5878355Z 'range_num_stages': [0, 0], 2026-02-21T08:28:25.5878531Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:25.5878718Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:25.5901247Z [301s] Fitting surrogate: 1137 points, 1137 targets 2026-02-21T08:28:25.8946577Z [301s] Autotuning complete in 301.7s after searching 1101 configs. 2026-02-21T08:28:25.8950291Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:28:25.8952623Z @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:28:25.8953582Z 2026-02-21T08:28:25.8953840Z [301s] Code of selected kernel: /tmp/torchinductor_root/zd/czd6qic2ibdcw3h23m3d5gn7d5rhptjcsh5nwvnttjv44qnjwotu.py 2026-02-21T08:28:26.9767361Z WARNING:tritonbench.utils.triton_op:Completed input ID 14: 2026-02-21T08:28:26.9771219Z x_val 2026-02-21T08:28:26.9774326Z ------------------- 2026-02-21T08:28:26.9778818Z (64, 1, 7168, 8192) 2026-02-21T08:28:26.9783213Z 2026-02-21T08:28:26.9785307Z 50%|█████ | 5/10 [19:27<21:53, 262.71s/it]WARNING:tritonbench.utils.triton_op:Running input ID 17: 2026-02-21T08:28:26.9785678Z x_val 2026-02-21T08:28:26.9790525Z --------------------- 2026-02-21T08:28:26.9792091Z (1, 4096, 8192, 1024) 2026-02-21T08:28:26.9792516Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:28:27.9378429Z INFO:tritonbench.utils.triton_op:Took 2.48ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:28:28.8845809Z Autotune Choices Stats: 2026-02-21T08:28:28.8846952Z {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.04819199815392494, "best_triton_pos": 1, "best_triton_time": 0.07993599772453308, "best_triton_kernel": "triton_mm_68", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4"} 2026-02-21T08:28:28.8862128Z AUTOTUNE mm(4096x1024, 1024x8192) 2026-02-21T08:28:28.8866705Z strides: [1024, 1], [8192, 1] 2026-02-21T08:28:28.8868584Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:28:28.8868832Z mm 0.0482 ms 100.0% 2026-02-21T08:28:28.8869304Z triton_mm_68 0.0799 ms 60.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:28:28.8870079Z triton_mm_67 0.0880 ms 54.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:28:28.8870817Z triton_mm_60 0.1074 ms 44.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:28:28.8872032Z triton_mm_69 0.1084 ms 44.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:28:28.8872748Z triton_mm_62 0.1105 ms 43.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:28:28.8873422Z triton_mm_61 0.1125 ms 42.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T08:28:28.8874091Z triton_mm_65 0.1186 ms 40.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T08:28:28.8874771Z triton_mm_64 0.1208 ms 39.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:28:28.8875448Z triton_mm_58 0.1351 ms 35.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2026-02-21T08:28:28.8876021Z SingleProcess AUTOTUNE benchmarking takes 0.5221 seconds and 0.2417 seconds precompiling for 20 choices 2026-02-21T08:28:31.6690622Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:28:32.8404479Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:28:32.8407758Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:28:32.8411876Z 'dtype': 'torch.bfloat16', 2026-02-21T08:28:32.8415773Z 'shape': (1, 4096, 1024), 2026-02-21T08:28:32.8420027Z 'stride': (4194304, 1024, 1)}, 2026-02-21T08:28:32.8424043Z { 'device': 'cuda:0', 2026-02-21T08:28:32.8428559Z 'dtype': 'torch.int32', 2026-02-21T08:28:32.8428851Z 'shape': (1024, 8192), 2026-02-21T08:28:32.8429068Z 'stride': (8192, 1)}), 2026-02-21T08:28:32.8429269Z 'kwargs': {}} 2026-02-21T08:28:32.8455921Z INFO:tritonbench.utils.triton_op:Took 5.36ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:28:33.1294377Z [0s] Autotune random seed: 2134813318 2026-02-21T08:28:33.2577186Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:28:52.5727714Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s 2026-02-21T08:28:52.6703920Z [19s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[64], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=32, num_stages=7, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[1, 1], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T08:28:52.6705195Z Tensor-likes are not close! 2026-02-21T08:28:52.6710290Z 2026-02-21T08:28:52.6714866Z Mismatched elements: 33338520 / 33554432 (99.4%) 2026-02-21T08:28:52.6716804Z Greatest absolute difference: 776.0 at index (1687, 6957) (up to 0.01 allowed) 2026-02-21T08:28:52.6717286Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:28:52.6717659Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:28:52.6717858Z 2026-02-21T08:28:55.1978222Z [21s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 128, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=32, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:28:55.1979335Z Tensor-likes are not close! 2026-02-21T08:28:55.1979762Z 2026-02-21T08:28:55.1979891Z Mismatched elements: 33528093 / 33554432 (99.9%) 2026-02-21T08:28:55.1980176Z Greatest absolute difference: 3456.0 at index (190, 5946) (up to 0.01 allowed) 2026-02-21T08:28:55.1980523Z Greatest relative difference: inf at index (3980, 6366) (up to 0.01 allowed) 2026-02-21T08:28:55.1980819Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:28:55.1980992Z 2026-02-21T08:28:56.3901386Z [23s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 1], range_warp_specializes=[None, False]) 2026-02-21T08:28:56.3902893Z Tensor-likes are not close! 2026-02-21T08:28:56.3904908Z 2026-02-21T08:28:56.3905120Z Mismatched elements: 33484848 / 33554432 (99.8%) 2026-02-21T08:28:56.3905456Z Greatest absolute difference: 2464.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T08:28:56.3905817Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:28:56.3906119Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:28:56.3906283Z 2026-02-21T08:29:17.1333085Z [43s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 128, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:29:17.1334182Z Tensor-likes are not close! 2026-02-21T08:29:17.1334304Z 2026-02-21T08:29:17.1334396Z Mismatched elements: 33509402 / 33554432 (99.9%) 2026-02-21T08:29:17.1334749Z Greatest absolute difference: 3488.0 at index (190, 5938) (up to 0.01 allowed) 2026-02-21T08:29:17.1335176Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:17.1335532Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:17.1335736Z 2026-02-21T08:29:37.1767191Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:29:37.1767533Z 2026-02-21T08:29:37.1768597Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=8, num_stages=7, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:29:37.1769767Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:29:37.1770240Z 2026-02-21T08:29:37.1770359Z `ptxas` stderr: 2026-02-21T08:29:37.1770831Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 564 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:29:37.1771334Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:29:37.1771482Z 2026-02-21T08:29:37.1772042Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp8tiywaix.ptx -o /tmp/tmp8tiywaix.ptx.o 2026-02-21T08:29:37.1772469Z 2026-02-21T08:29:37.1772596Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:29:37.1772895Z ================================================================ 2026-02-21T08:29:37.1774667Z Internal Triton PTX codegen error 2026-02-21T08:29:37.1774901Z `ptxas` stderr: 2026-02-21T08:29:37.1775610Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 564 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:29:37.1776148Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:29:37.1776317Z 2026-02-21T08:29:37.1776734Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp8tiywaix.ptx -o /tmp/tmp8tiywaix.ptx.o 2026-02-21T08:29:37.1777189Z 2026-02-21T08:29:37.1777192Z 2026-02-21T08:29:37.1777256Z // 2026-02-21T08:29:37.1777404Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:29:37.1777594Z // 2026-02-21T08:29:37.1777666Z 2026-02-21T08:29:37.1777725Z .version 8.7 2026-02-21T08:29:37.1777876Z .target sm_100a 2026-02-21T08:29:37.1778019Z .address_size 64 2026-02-21T08:29:37.1778117Z 2026-02-21T08:29:37.1778265Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:29:37.1778562Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:29:37.1778784Z // @_helion_matmul_bf16_int4 2026-02-21T08:29:37.1779007Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:29:37.1779248Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:29:37.1779531Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:29:37.1779799Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:29:37.1780079Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:29:37.1780359Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:29:37.1780571Z ) 2026-02-21T08:29:37.1780904Z .reqntid 128 2026-02-21T08:29:37.1781032Z .maxnreg 32 2026-02-21T08:29:37.1781159Z { 2026-02-21T08:29:37.1781279Z .reg .pred %p<93>; 2026-02-21T08:29:37.1781432Z .reg .b16 %rs<296>; 2026-02-21T08:29:37.1781667Z .reg .b32 %r<620>; 2026-02-21T08:29:37.1781816Z .reg .b64 %rd<158>; 2026-02-21T08:29:37.1781965Z $L__func_begin0: 2026-02-21T08:29:37.1782061Z 2026-02-21T08:29:37.1782114Z // %bb.0: 2026-02-21T08:29:37.1782365Z .loc 1 14 0 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:14 2026-02-21T08:29:37.1782652Z mov.u32 %r1, %tid.x; 2026-02-21T08:29:37.1782815Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T08:29:37.1782980Z mov.b32 %r72, global_smem; 2026-02-21T08:29:37.1783147Z // begin inline asm 2026-02-21T08:29:37.1783392Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r72], 256; 2026-02-21T08:29:37.1783646Z // end inline asm 2026-02-21T08:29:37.1783788Z bar.sync 0; 2026-02-21T08:29:37.1783932Z ld.shared.b32 %r612, [global_smem]; 2026-02-21T08:29:37.1784109Z bar.sync 0; 2026-02-21T08:29:37.1784240Z // begin inline asm 2026-02-21T08:29:37.1784451Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:29:37.1784677Z // end inline asm 2026-02-21T08:29:37.1784943Z .loc 1 20 30 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:20:30 2026-02-21T08:29:37.1785332Z mov.u32 %r73, %ctaid.x; 2026-02-21T08:29:37.1785604Z .loc 1 20 35 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:20:35 2026-02-21T08:29:37.1785902Z mul.lo.s32 %r613, %r73, 28; 2026-02-21T08:29:37.1786171Z .loc 1 21 37 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:21:37 2026-02-21T08:29:37.1786467Z add.s32 %r74, %r613, 28; 2026-02-21T08:29:37.1786731Z .loc 1 21 49 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:21:49 2026-02-21T08:29:37.1787024Z min.s32 %r4, %r74, 32768; 2026-02-21T08:29:37.1787293Z .loc 1 22 120 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:22:120 2026-02-21T08:29:37.1787603Z setp.ge.s32 %p3, %r613, %r4; 2026-02-21T08:29:37.1787783Z @%p3 bra $L__BB0_9; 2026-02-21T08:29:37.1787948Z // %bb.1: // %.lr.ph 2026-02-21T08:29:37.1788315Z .loc 1 0 120 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:0:120 2026-02-21T08:29:37.1788640Z ld.param.b64 %rd32, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:29:37.1788896Z ld.param.b64 %rd31, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:29:37.1789136Z ld.param.b64 %rd30, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:29:37.1789344Z shr.u32 %r5, %r1, 5; 2026-02-21T08:29:37.1789498Z and.b32 %r6, %r1, 112; 2026-02-21T08:29:37.1789650Z bfe.u32 %r7, %r1, 4, 3; 2026-02-21T08:29:37.1789802Z shr.u32 %r8, %r1, 1; 2026-02-21T08:29:37.1789945Z bfe.u32 %r9, %r1, 1, 6; 2026-02-21T08:29:37.1790098Z shl.b32 %r75, %r1, 3; 2026-02-21T08:29:37.1790246Z and.b32 %r10, %r75, 8; 2026-02-21T08:29:37.1790401Z and.b32 %r11, %r1, 15; 2026-02-21T08:29:37.1790547Z shl.b32 %r12, %r11, 3; 2026-02-21T08:29:37.1790698Z and.b32 %r13, %r1, 16; 2026-02-21T08:29:37.1790841Z and.b32 %r76, %r1, 127; 2026-02-21T08:29:37.1790992Z shl.b32 %r14, %r76, 4; 2026-02-21T08:29:37.1791142Z add.s32 %r358, %r72, %r14; 2026-02-21T08:29:37.1791309Z add.s32 %r360, %r358, 2048; 2026-02-21T08:29:37.1791472Z add.s32 %r362, %r358, 4096; 2026-02-21T08:29:37.1791664Z add.s32 %r364, %r358, 6144; 2026-02-21T08:29:37.1791821Z add.s32 %r366, %r358, 8192; 2026-02-21T08:29:37.1791976Z add.s32 %r368, %r358, 10240; 2026-02-21T08:29:37.1792139Z add.s32 %r370, %r358, 12288; 2026-02-21T08:29:37.1792292Z add.s32 %r372, %r358, 14336; 2026-02-21T08:29:37.1792452Z add.s32 %r308, %r612, 16; 2026-02-21T08:29:37.1792607Z shl.b32 %r24, %r9, 13; 2026-02-21T08:29:37.1792763Z shl.b32 %r25, %r76, 3; 2026-02-21T08:29:37.1792918Z add.s32 %r78, %r72, 40960; 2026-02-21T08:29:37.1793073Z add.s32 %r374, %r78, %r25; 2026-02-21T08:29:37.1793233Z or.b32 %r27, %r12, 128; 2026-02-21T08:29:37.1793384Z add.s32 %r159, %r358, 16384; 2026-02-21T08:29:37.1793544Z add.s32 %r161, %r358, 18432; 2026-02-21T08:29:37.1793695Z add.s32 %r163, %r358, 20480; 2026-02-21T08:29:37.1793855Z add.s32 %r165, %r358, 22528; 2026-02-21T08:29:37.1794006Z add.s32 %r167, %r358, 24576; 2026-02-21T08:29:37.1794163Z add.s32 %r169, %r358, 26624; 2026-02-21T08:29:37.1794312Z add.s32 %r171, %r358, 28672; 2026-02-21T08:29:37.1794470Z add.s32 %r173, %r358, 30720; 2026-02-21T08:29:37.1794630Z add.s32 %r79, %r72, %r25; 2026-02-21T08:29:37.1794781Z add.s32 %r175, %r79, 41984; 2026-02-21T08:29:37.1794943Z shl.b32 %r80, %r11, 8; 2026-02-21T08:29:37.1795091Z and.b32 %r81, %r1, 96; 2026-02-21T08:29:37.1795246Z shl.b32 %r82, %r81, 7; 2026-02-21T08:29:37.1795391Z shl.b32 %r83, %r13, 3; 2026-02-21T08:29:37.1795543Z or.b32 %r84, %r80, %r82; 2026-02-21T08:29:37.1795694Z or.b32 %r37, %r84, %r83; 2026-02-21T08:29:37.1795851Z add.s32 %r38, %r72, %r37; 2026-02-21T08:29:37.1796002Z shr.u32 %r85, %r81, 1; 2026-02-21T08:29:37.1796152Z or.b32 %r39, %r85, %r11; 2026-02-21T08:29:37.1796307Z add.s32 %r40, %r78, %r39; 2026-02-21T08:29:37.1796456Z shl.b32 %r86, %r11, 7; 2026-02-21T08:29:37.1796608Z shl.b32 %r87, %r1, 4; 2026-02-21T08:29:37.1796809Z and.b32 %r88, %r87, 112; 2026-02-21T08:29:37.1796962Z shr.u32 %r89, %r6, 2; 2026-02-21T08:29:37.1797105Z or.b32 %r90, %r86, %r88; 2026-02-21T08:29:37.1797257Z xor.b32 %r91, %r90, %r89; 2026-02-21T08:29:37.1797406Z add.s32 %r92, %r72, 32768; 2026-02-21T08:29:37.1797565Z add.s32 %r41, %r92, %r91; 2026-02-21T08:29:37.1797714Z xor.b32 %r93, %r91, 32; 2026-02-21T08:29:37.1797867Z add.s32 %r42, %r92, %r93; 2026-02-21T08:29:37.1798025Z xor.b32 %r94, %r91, 64; 2026-02-21T08:29:37.1798172Z add.s32 %r43, %r92, %r94; 2026-02-21T08:29:37.1798331Z xor.b32 %r95, %r91, 96; 2026-02-21T08:29:37.1798477Z add.s32 %r44, %r92, %r95; 2026-02-21T08:29:37.1798635Z bfe.u32 %r96, %r92, 4, 14; 2026-02-21T08:29:37.1798787Z cvt.u64.u32 %rd33, %r96; 2026-02-21T08:29:37.1798953Z or.b64 %rd80, %rd33, 4611686293313683456; 2026-02-21T08:29:37.1799130Z add.s32 %r97, %r72, 32800; 2026-02-21T08:29:37.1799290Z bfe.u32 %r98, %r97, 4, 14; 2026-02-21T08:29:37.1799491Z cvt.u64.u32 %rd34, %r98; 2026-02-21T08:29:37.1799662Z or.b64 %rd81, %rd34, 4611686293313683456; 2026-02-21T08:29:37.1799844Z add.s32 %r99, %r72, 32832; 2026-02-21T08:29:37.1799996Z bfe.u32 %r100, %r99, 4, 14; 2026-02-21T08:29:37.1800156Z cvt.u64.u32 %rd35, %r100; 2026-02-21T08:29:37.1800315Z or.b64 %rd82, %rd35, 4611686293313683456; 2026-02-21T08:29:37.1800498Z add.s32 %r101, %r72, 32864; 2026-02-21T08:29:37.1800653Z bfe.u32 %r102, %r101, 4, 14; 2026-02-21T08:29:37.1800812Z cvt.u64.u32 %rd36, %r102; 2026-02-21T08:29:37.1800967Z or.b64 %rd83, %rd36, 4611686293313683456; 2026-02-21T08:29:37.1801147Z add.s32 %r103, %r72, 34816; 2026-02-21T08:29:37.1801307Z bfe.u32 %r104, %r103, 4, 14; 2026-02-21T08:29:37.1801461Z cvt.u64.u32 %rd37, %r104; 2026-02-21T08:29:37.1801682Z or.b64 %rd84, %rd37, 4611686293313683456; 2026-02-21T08:29:37.1801855Z add.s32 %r105, %r72, 34848; 2026-02-21T08:29:37.1802015Z bfe.u32 %r106, %r105, 4, 14; 2026-02-21T08:29:37.1802167Z cvt.u64.u32 %rd38, %r106; 2026-02-21T08:29:37.1802353Z or.b64 %rd85, %rd38, 4611686293313683456; 2026-02-21T08:29:37.1802527Z add.s32 %r107, %r72, 34880; 2026-02-21T08:29:37.1802685Z bfe.u32 %r108, %r107, 4, 14; 2026-02-21T08:29:37.1802839Z cvt.u64.u32 %rd39, %r108; 2026-02-21T08:29:37.1803008Z or.b64 %rd86, %rd39, 4611686293313683456; 2026-02-21T08:29:37.1803195Z add.s32 %r109, %r72, 34912; 2026-02-21T08:29:37.1803352Z bfe.u32 %r110, %r109, 4, 14; 2026-02-21T08:29:37.1803537Z cvt.u64.u32 %rd40, %r110; 2026-02-21T08:29:37.1803695Z or.b64 %rd87, %rd40, 4611686293313683456; 2026-02-21T08:29:37.1803874Z add.s32 %r111, %r72, 36864; 2026-02-21T08:29:37.1804026Z bfe.u32 %r112, %r111, 4, 14; 2026-02-21T08:29:37.1804185Z cvt.u64.u32 %rd41, %r112; 2026-02-21T08:29:37.1804344Z or.b64 %rd88, %rd41, 4611686293313683456; 2026-02-21T08:29:37.1804522Z add.s32 %r113, %r72, 36896; 2026-02-21T08:29:37.1804679Z bfe.u32 %r114, %r113, 4, 14; 2026-02-21T08:29:37.1804831Z cvt.u64.u32 %rd42, %r114; 2026-02-21T08:29:37.1804999Z or.b64 %rd89, %rd42, 4611686293313683456; 2026-02-21T08:29:37.1805173Z add.s32 %r115, %r72, 36928; 2026-02-21T08:29:37.1805334Z bfe.u32 %r116, %r115, 4, 14; 2026-02-21T08:29:37.1805487Z cvt.u64.u32 %rd43, %r116; 2026-02-21T08:29:37.1805649Z or.b64 %rd90, %rd43, 4611686293313683456; 2026-02-21T08:29:37.1805820Z add.s32 %r117, %r72, 36960; 2026-02-21T08:29:37.1805978Z bfe.u32 %r118, %r117, 4, 14; 2026-02-21T08:29:37.1806130Z cvt.u64.u32 %rd44, %r118; 2026-02-21T08:29:37.1806291Z or.b64 %rd91, %rd44, 4611686293313683456; 2026-02-21T08:29:37.1806469Z add.s32 %r119, %r72, 38912; 2026-02-21T08:29:37.1806621Z bfe.u32 %r120, %r119, 4, 14; 2026-02-21T08:29:37.1806781Z cvt.u64.u32 %rd45, %r120; 2026-02-21T08:29:37.1806937Z or.b64 %rd92, %rd45, 4611686293313683456; 2026-02-21T08:29:37.1807112Z add.s32 %r121, %r72, 38944; 2026-02-21T08:29:37.1807261Z bfe.u32 %r122, %r121, 4, 14; 2026-02-21T08:29:37.1807420Z cvt.u64.u32 %rd46, %r122; 2026-02-21T08:29:37.1807580Z or.b64 %rd93, %rd46, 4611686293313683456; 2026-02-21T08:29:37.1807814Z add.s32 %r123, %r72, 38976; 2026-02-21T08:29:37.1807973Z bfe.u32 %r124, %r123, 4, 14; 2026-02-21T08:29:37.1808125Z cvt.u64.u32 %rd47, %r124; 2026-02-21T08:29:37.1808290Z or.b64 %rd94, %rd47, 4611686293313683456; 2026-02-21T08:29:37.1808462Z add.s32 %r125, %r72, 39008; 2026-02-21T08:29:37.1808621Z bfe.u32 %r126, %r125, 4, 14; 2026-02-21T08:29:37.1808771Z cvt.u64.u32 %rd48, %r126; 2026-02-21T08:29:37.1808933Z or.b64 %rd95, %rd48, 4611686293313683456; 2026-02-21T08:29:37.1809107Z or.b32 %r45, %r12, 256; 2026-02-21T08:29:37.1809263Z and.b32 %r127, %r8, 15; 2026-02-21T08:29:37.1809414Z and.b32 %r128, %r87, 16; 2026-02-21T08:29:37.1809571Z or.b32 %r46, %r127, %r128; 2026-02-21T08:29:37.1809851Z .loc 1 22 120 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:22:120 2026-02-21T08:29:37.1810151Z mad.wide.u32 %rd49, %r11, 16, %rd30; 2026-02-21T08:29:37.1810333Z add.s64 %rd17, %rd49, 99072; 2026-02-21T08:29:37.1810650Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1810945Z or.b32 %r47, %r12, 57728; 2026-02-21T08:29:37.1811205Z .loc 1 22 120 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:22:120 2026-02-21T08:29:37.1811506Z mad.wide.u32 %rd50, %r9, 8192, %rd31; 2026-02-21T08:29:37.1811733Z add.s64 %rd18, %rd50, 1572864; 2026-02-21T08:29:37.1811901Z setp.eq.b32 %p89, %r1, 0; 2026-02-21T08:29:37.1812075Z setp.eq.b32 %p11, %r13, 0; 2026-02-21T08:29:37.1812233Z bra.uni $L__BB0_2; 2026-02-21T08:29:37.1812434Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:29:37.1812760Z .loc 1 0 120 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:0:120 2026-02-21T08:29:37.1813057Z mov.b32 %r581, 1; 2026-02-21T08:29:37.1813195Z $L__tmp0: 2026-02-21T08:29:37.1813498Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1813877Z // begin inline asm 2026-02-21T08:29:37.1814019Z 2026-02-21T08:29:37.1814146Z { 2026-02-21T08:29:37.1814269Z .reg .pred complete; 2026-02-21T08:29:37.1814430Z waitLoop: 2026-02-21T08:29:37.1814628Z mbarrier.try_wait.parity.shared.b64 complete, [%r616], %r581; 2026-02-21T08:29:37.1814885Z @!complete bra.uni waitLoop; 2026-02-21T08:29:37.1815045Z } 2026-02-21T08:29:37.1815121Z 2026-02-21T08:29:37.1815180Z // end inline asm 2026-02-21T08:29:37.1815329Z $L__tmp1: 2026-02-21T08:29:37.1815578Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1815884Z cp.async.wait_group 0; 2026-02-21T08:29:37.1816041Z bar.sync 0; 2026-02-21T08:29:37.1816188Z add.s32 %r582, %r72, 43008; 2026-02-21T08:29:37.1816351Z // begin inline asm 2026-02-21T08:29:37.1816540Z @%p89 mbarrier.inval.shared::cta.b64 [%r582]; 2026-02-21T08:29:37.1816739Z // end inline asm 2026-02-21T08:29:37.1816892Z bar.sync 0; 2026-02-21T08:29:37.1817039Z // begin inline asm 2026-02-21T08:29:37.1817213Z @%p89 mbarrier.inval.shared::cta.b64 [%r140]; 2026-02-21T08:29:37.1817418Z // end inline asm 2026-02-21T08:29:37.1817682Z .loc 1 91 43 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:91:43 2026-02-21T08:29:37.1817987Z shl.b32 %r598, %r51, 13; 2026-02-21T08:29:37.1818258Z .loc 1 91 50 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:91:50 2026-02-21T08:29:37.1818557Z add.s32 %r599, %r598, %r53; 2026-02-21T08:29:37.1818833Z .loc 1 91 22 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:91:22 2026-02-21T08:29:37.1819139Z mad.wide.s32 %rd138, %r599, 2, %rd32; 2026-02-21T08:29:37.1819323Z $L__tmp2: 2026-02-21T08:29:37.1819619Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1819966Z // begin inline asm 2026-02-21T08:29:37.1820350Z tcgen05.ld.sync.aligned.16x32bx2.x8.b32 {%r584, %r585, %r586, %r587, %r588, %r589, %r590, %r591}, [%r592 + 0], 8; 2026-02-21T08:29:37.1820678Z // end inline asm 2026-02-21T08:29:37.1820819Z // begin inline asm 2026-02-21T08:29:37.1820985Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:29:37.1821163Z // end inline asm 2026-02-21T08:29:37.1821309Z cvt.u64.u32 %rd139, %r584; 2026-02-21T08:29:37.1821481Z cvt.u64.u32 %rd140, %r585; 2026-02-21T08:29:37.1821681Z shl.b64 %rd141, %rd140, 32; 2026-02-21T08:29:37.1821858Z or.b64 %rd142, %rd139, %rd141; 2026-02-21T08:29:37.1822021Z $L__tmp3: 2026-02-21T08:29:37.1822274Z .loc 1 90 28 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:90:28 2026-02-21T08:29:37.1822578Z mov.b64 {%r600, %r601}, %rd142; 2026-02-21T08:29:37.1822772Z cvt.rn.bf16x2.f32 %r602, %r601, %r600; 2026-02-21T08:29:37.1822964Z $L__tmp4: 2026-02-21T08:29:37.1823325Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1823691Z cvt.u64.u32 %rd143, %r586; 2026-02-21T08:29:37.1823852Z cvt.u64.u32 %rd144, %r587; 2026-02-21T08:29:37.1824017Z shl.b64 %rd145, %rd144, 32; 2026-02-21T08:29:37.1824181Z or.b64 %rd146, %rd143, %rd145; 2026-02-21T08:29:37.1824347Z $L__tmp5: 2026-02-21T08:29:37.1824583Z .loc 1 90 28 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:90:28 2026-02-21T08:29:37.1824881Z mov.b64 {%r603, %r604}, %rd146; 2026-02-21T08:29:37.1825067Z cvt.rn.bf16x2.f32 %r605, %r604, %r603; 2026-02-21T08:29:37.1825239Z $L__tmp6: 2026-02-21T08:29:37.1825533Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1825859Z cvt.u64.u32 %rd147, %r588; 2026-02-21T08:29:37.1826027Z cvt.u64.u32 %rd148, %r589; 2026-02-21T08:29:37.1826185Z shl.b64 %rd149, %rd148, 32; 2026-02-21T08:29:37.1826358Z or.b64 %rd150, %rd147, %rd149; 2026-02-21T08:29:37.1826517Z $L__tmp7: 2026-02-21T08:29:37.1826760Z .loc 1 90 28 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:90:28 2026-02-21T08:29:37.1827054Z mov.b64 {%r606, %r607}, %rd150; 2026-02-21T08:29:37.1827229Z cvt.rn.bf16x2.f32 %r608, %r607, %r606; 2026-02-21T08:29:37.1827410Z $L__tmp8: 2026-02-21T08:29:37.1827691Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1828027Z cvt.u64.u32 %rd151, %r590; 2026-02-21T08:29:37.1828185Z cvt.u64.u32 %rd152, %r591; 2026-02-21T08:29:37.1828347Z shl.b64 %rd153, %rd152, 32; 2026-02-21T08:29:37.1828513Z or.b64 %rd154, %rd151, %rd153; 2026-02-21T08:29:37.1828670Z $L__tmp9: 2026-02-21T08:29:37.1828912Z .loc 1 90 28 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:90:28 2026-02-21T08:29:37.1829196Z mov.b64 {%r609, %r610}, %rd154; 2026-02-21T08:29:37.1829379Z cvt.rn.bf16x2.f32 %r611, %r610, %r609; 2026-02-21T08:29:37.1829665Z .loc 1 91 81 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:91:81 2026-02-21T08:29:37.1829977Z shfl.sync.idx.b32 %r593, %r602, %r46, 31, -1; 2026-02-21T08:29:37.1830190Z shfl.sync.idx.b32 %r594, %r605, %r46, 31, -1; 2026-02-21T08:29:37.1830407Z shfl.sync.idx.b32 %r595, %r608, %r46, 31, -1; 2026-02-21T08:29:37.1830612Z shfl.sync.idx.b32 %r596, %r611, %r46, 31, -1; 2026-02-21T08:29:37.1830797Z // begin inline asm 2026-02-21T08:29:37.1830998Z st.global.v4.b32 [ %rd138 + 0 ], { %r593, %r594, %r595, %r596 }; 2026-02-21T08:29:37.1831215Z // end inline asm 2026-02-21T08:29:37.1831476Z .loc 1 22 120 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:22:120 2026-02-21T08:29:37.1831790Z add.s32 %r613, %r613, 1; 2026-02-21T08:29:37.1831959Z setp.ne.b32 %p91, %r613, %r4; 2026-02-21T08:29:37.1832129Z @%p91 bra $L__BB0_2; 2026-02-21T08:29:37.1832273Z bra.uni $L__BB0_9; 2026-02-21T08:29:37.1832466Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T08:29:37.1832757Z // Child Loop BB0_5 Depth 2 2026-02-21T08:29:37.1833068Z .loc 1 28 35 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:28:35 2026-02-21T08:29:37.1833350Z shr.s32 %r245, %r613, 31; 2026-02-21T08:29:37.1833515Z shr.u32 %r246, %r245, 18; 2026-02-21T08:29:37.1833668Z add.s32 %r247, %r613, %r246; 2026-02-21T08:29:37.1833832Z shr.s32 %r248, %r247, 14; 2026-02-21T08:29:37.1834102Z .loc 1 31 45 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:31:45 2026-02-21T08:29:37.1834386Z and.b32 %r249, %r247, 49152; 2026-02-21T08:29:37.1834558Z sub.s32 %r250, %r613, %r249; 2026-02-21T08:29:37.1834814Z .loc 1 31 64 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:31:64 2026-02-21T08:29:37.1835105Z cvt.u16.u32 %rs1, %r250; 2026-02-21T08:29:37.1835408Z .loc 1 32 51 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:32:51 2026-02-21T08:29:37.1835698Z shr.s16 %rs2, %rs1, 15; 2026-02-21T08:29:37.1835857Z shr.u16 %rs3, %rs2, 11; 2026-02-21T08:29:37.1836005Z add.s16 %rs4, %rs1, %rs3; 2026-02-21T08:29:37.1836164Z shr.s16 %rs5, %rs4, 5; 2026-02-21T08:29:37.1836412Z .loc 1 31 64 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:31:64 2026-02-21T08:29:37.1836703Z and.b16 %rs6, %rs4, -32; 2026-02-21T08:29:37.1836854Z sub.s16 %rs7, %rs1, %rs6; 2026-02-21T08:29:37.1837117Z .loc 1 33 27 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:33:27 2026-02-21T08:29:37.1837395Z shl.b32 %r49, %r248, 11; 2026-02-21T08:29:37.1837555Z mul.wide.s16 %r50, %rs7, 64; 2026-02-21T08:29:37.1837721Z add.s32 %r251, %r50, %r49; 2026-02-21T08:29:37.1837980Z .loc 1 34 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:34:32 2026-02-21T08:29:37.1838266Z or.b32 %r252, %r251, %r7; 2026-02-21T08:29:37.1838522Z .loc 1 35 27 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:35:27 2026-02-21T08:29:37.1838812Z mul.wide.s16 %r52, %rs5, 16; 2026-02-21T08:29:37.1839069Z .loc 1 36 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:36:32 2026-02-21T08:29:37.1839351Z or.b32 %r53, %r52, %r10; 2026-02-21T08:29:37.1839606Z .loc 1 51 53 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:53 2026-02-21T08:29:37.1839881Z shl.b32 %r253, %r252, 10; 2026-02-21T08:29:37.1840033Z $L__tmp10: 2026-02-21T08:29:37.1840315Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1840659Z shfl.sync.idx.b32 %r54, %r5, 0, 31, -1; 2026-02-21T08:29:37.1840838Z shl.b32 %r254, %r54, 21; 2026-02-21T08:29:37.1840996Z and.b32 %r255, %r254, 6291456; 2026-02-21T08:29:37.1841164Z add.s32 %r592, %r255, %r612; 2026-02-21T08:29:37.1841320Z mov.pred %p15, -1; 2026-02-21T08:29:37.1841467Z mov.b32 %r615, 0; 2026-02-21T08:29:37.1841629Z // begin inline asm 2026-02-21T08:29:37.1841932Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x8.b32 [%r592 + 0], 8, {%r615, %r615, %r615, %r615, %r615, %r615, %r615, %r615}; 2026-02-21T08:29:37.1842251Z // end inline asm 2026-02-21T08:29:37.1842399Z // begin inline asm 2026-02-21T08:29:37.1842559Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:29:37.1842734Z // end inline asm 2026-02-21T08:29:37.1842872Z bar.sync 0; 2026-02-21T08:29:37.1843013Z $L__tmp11: 2026-02-21T08:29:37.1843266Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1843553Z bar.sync 0; 2026-02-21T08:29:37.1843701Z add.s32 %r616, %r72, 43008; 2026-02-21T08:29:37.1843860Z // begin inline asm 2026-02-21T08:29:37.1844039Z @%p89 mbarrier.init.shared::cta.b64 [%r616], 1; 2026-02-21T08:29:37.1844238Z // end inline asm 2026-02-21T08:29:37.1844433Z bar.sync 0; 2026-02-21T08:29:37.1844567Z add.s32 %r140, %r72, 43016; 2026-02-21T08:29:37.1844730Z // begin inline asm 2026-02-21T08:29:37.1844906Z @%p89 mbarrier.init.shared::cta.b64 [%r140], 1; 2026-02-21T08:29:37.1845097Z // end inline asm 2026-02-21T08:29:37.1845361Z .loc 1 51 60 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:60 2026-02-21T08:29:37.1845643Z or.b32 %r257, %r253, %r12; 2026-02-21T08:29:37.1845914Z .loc 1 51 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:32 2026-02-21T08:29:37.1846205Z mad.wide.s32 %rd51, %r257, 2, %rd30; 2026-02-21T08:29:37.1846392Z cvt.u64.u32 %rd69, %r12; 2026-02-21T08:29:37.1846548Z cvt.s64.s32 %rd19, %r253; 2026-02-21T08:29:37.1846710Z or.b64 %rd70, %rd19, %rd69; 2026-02-21T08:29:37.1846872Z shl.b64 %rd71, %rd70, 1; 2026-02-21T08:29:37.1847025Z add.s64 %rd20, %rd30, %rd71; 2026-02-21T08:29:37.1847190Z add.s64 %rd52, %rd20, 16384; 2026-02-21T08:29:37.1847394Z add.s64 %rd53, %rd20, 32768; 2026-02-21T08:29:37.1847557Z add.s64 %rd54, %rd20, 49152; 2026-02-21T08:29:37.1847710Z add.s64 %rd55, %rd20, 65536; 2026-02-21T08:29:37.1847870Z add.s64 %rd56, %rd20, 81920; 2026-02-21T08:29:37.1848022Z add.s64 %rd57, %rd20, 98304; 2026-02-21T08:29:37.1848186Z add.s64 %rd58, %rd20, 114688; 2026-02-21T08:29:37.1848351Z mov.b32 %r359, 16; 2026-02-21T08:29:37.1848599Z .loc 1 51 80 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:80 2026-02-21T08:29:37.1848889Z // begin inline asm 2026-02-21T08:29:37.1849095Z cp.async.cg.shared.global [ %r358 + 0 ], [ %rd51 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1849331Z // end inline asm 2026-02-21T08:29:37.1849465Z // begin inline asm 2026-02-21T08:29:37.1849674Z cp.async.cg.shared.global [ %r360 + 0 ], [ %rd52 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1849893Z // end inline asm 2026-02-21T08:29:37.1850035Z // begin inline asm 2026-02-21T08:29:37.1850235Z cp.async.cg.shared.global [ %r362 + 0 ], [ %rd53 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1850454Z // end inline asm 2026-02-21T08:29:37.1850594Z // begin inline asm 2026-02-21T08:29:37.1850785Z cp.async.cg.shared.global [ %r364 + 0 ], [ %rd54 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1851008Z // end inline asm 2026-02-21T08:29:37.1851142Z // begin inline asm 2026-02-21T08:29:37.1851337Z cp.async.cg.shared.global [ %r366 + 0 ], [ %rd55 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1851582Z // end inline asm 2026-02-21T08:29:37.1851724Z // begin inline asm 2026-02-21T08:29:37.1851918Z cp.async.cg.shared.global [ %r368 + 0 ], [ %rd56 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1852136Z // end inline asm 2026-02-21T08:29:37.1852276Z // begin inline asm 2026-02-21T08:29:37.1852465Z cp.async.cg.shared.global [ %r370 + 0 ], [ %rd57 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1852689Z // end inline asm 2026-02-21T08:29:37.1852821Z // begin inline asm 2026-02-21T08:29:37.1853020Z cp.async.cg.shared.global [ %r372 + 0 ], [ %rd58 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1853237Z // end inline asm 2026-02-21T08:29:37.1853386Z cp.async.commit_group; 2026-02-21T08:29:37.1853647Z .loc 1 57 62 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:62 2026-02-21T08:29:37.1853943Z add.s32 %r258, %r53, %r24; 2026-02-21T08:29:37.1854217Z .loc 1 57 34 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:34 2026-02-21T08:29:37.1854505Z cvt.s64.s32 %rd72, %r258; 2026-02-21T08:29:37.1854673Z add.s64 %rd59, %rd31, %rd72; 2026-02-21T08:29:37.1854828Z mov.b32 %r158, 8; 2026-02-21T08:29:37.1855079Z .loc 1 57 87 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:87 2026-02-21T08:29:37.1855356Z // begin inline asm 2026-02-21T08:29:37.1855561Z cp.async.ca.shared.global [ %r374 + 0 ], [ %rd59 + 0 ], 0x8, %r158; 2026-02-21T08:29:37.1855789Z // end inline asm 2026-02-21T08:29:37.1855930Z cp.async.commit_group; 2026-02-21T08:29:37.1856193Z .loc 1 51 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:32 2026-02-21T08:29:37.1856520Z add.s64 %rd60, %rd20, 256; 2026-02-21T08:29:37.1856695Z cvt.u64.u32 %rd73, %r27; 2026-02-21T08:29:37.1856862Z or.b64 %rd74, %rd19, %rd73; 2026-02-21T08:29:37.1857036Z shl.b64 %rd75, %rd74, 1; 2026-02-21T08:29:37.1857202Z add.s64 %rd76, %rd30, %rd75; 2026-02-21T08:29:37.1857381Z add.s64 %rd61, %rd76, 16384; 2026-02-21T08:29:37.1857553Z add.s64 %rd62, %rd76, 32768; 2026-02-21T08:29:37.1857719Z add.s64 %rd63, %rd76, 49152; 2026-02-21T08:29:37.1857893Z add.s64 %rd64, %rd76, 65536; 2026-02-21T08:29:37.1858056Z add.s64 %rd65, %rd76, 81920; 2026-02-21T08:29:37.1858231Z add.s64 %rd66, %rd76, 98304; 2026-02-21T08:29:37.1858400Z add.s64 %rd67, %rd76, 114688; 2026-02-21T08:29:37.1858690Z .loc 1 51 80 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:80 2026-02-21T08:29:37.1858983Z bar.sync 0; 2026-02-21T08:29:37.1859209Z // begin inline asm 2026-02-21T08:29:37.1859428Z cp.async.cg.shared.global [ %r159 + 0 ], [ %rd60 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1859661Z // end inline asm 2026-02-21T08:29:37.1859808Z // begin inline asm 2026-02-21T08:29:37.1860010Z cp.async.cg.shared.global [ %r161 + 0 ], [ %rd61 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1860246Z // end inline asm 2026-02-21T08:29:37.1860386Z // begin inline asm 2026-02-21T08:29:37.1860591Z cp.async.cg.shared.global [ %r163 + 0 ], [ %rd62 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1860815Z // end inline asm 2026-02-21T08:29:37.1860964Z // begin inline asm 2026-02-21T08:29:37.1861172Z cp.async.cg.shared.global [ %r165 + 0 ], [ %rd63 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1861397Z // end inline asm 2026-02-21T08:29:37.1861576Z // begin inline asm 2026-02-21T08:29:37.1861777Z cp.async.cg.shared.global [ %r167 + 0 ], [ %rd64 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1862011Z // end inline asm 2026-02-21T08:29:37.1862150Z // begin inline asm 2026-02-21T08:29:37.1862361Z cp.async.cg.shared.global [ %r169 + 0 ], [ %rd65 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1862590Z // end inline asm 2026-02-21T08:29:37.1862738Z // begin inline asm 2026-02-21T08:29:37.1862938Z cp.async.cg.shared.global [ %r171 + 0 ], [ %rd66 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1863171Z // end inline asm 2026-02-21T08:29:37.1863319Z // begin inline asm 2026-02-21T08:29:37.1863519Z cp.async.cg.shared.global [ %r173 + 0 ], [ %rd67 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1863754Z // end inline asm 2026-02-21T08:29:37.1863901Z cp.async.commit_group; 2026-02-21T08:29:37.1864177Z .loc 1 57 34 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:34 2026-02-21T08:29:37.1864470Z cvt.s64.s32 %rd77, %r53; 2026-02-21T08:29:37.1864641Z cvt.u64.u32 %rd78, %r24; 2026-02-21T08:29:37.1864806Z add.s64 %rd79, %rd78, %rd77; 2026-02-21T08:29:37.1864976Z add.s64 %rd21, %rd31, %rd79; 2026-02-21T08:29:37.1865147Z add.s64 %rd68, %rd21, 524288; 2026-02-21T08:29:37.1865419Z .loc 1 57 87 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:87 2026-02-21T08:29:37.1865706Z // begin inline asm 2026-02-21T08:29:37.1865900Z cp.async.ca.shared.global [ %r175 + 0 ], [ %rd68 + 0 ], 0x8, %r158; 2026-02-21T08:29:37.1866124Z // end inline asm 2026-02-21T08:29:37.1866263Z cp.async.commit_group; 2026-02-21T08:29:37.1866525Z .loc 1 51 80 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:80 2026-02-21T08:29:37.1866819Z cp.async.wait_group 2; 2026-02-21T08:29:37.1866970Z bar.sync 0; 2026-02-21T08:29:37.1867144Z ld.shared.v4.b32 {%r259, %r260, %r261, %r262}, [%r38]; 2026-02-21T08:29:37.1867348Z mov.b32 {%rs8, %rs9}, %r262; 2026-02-21T08:29:37.1867515Z mov.b32 {%rs10, %rs11}, %r261; 2026-02-21T08:29:37.1867682Z mov.b32 {%rs12, %rs13}, %r260; 2026-02-21T08:29:37.1867847Z mov.b32 {%rs14, %rs15}, %r259; 2026-02-21T08:29:37.1868040Z ld.shared.v4.b32 {%r263, %r264, %r265, %r266}, [%r38+16]; 2026-02-21T08:29:37.1868255Z mov.b32 {%rs16, %rs17}, %r266; 2026-02-21T08:29:37.1868474Z mov.b32 {%rs18, %rs19}, %r265; 2026-02-21T08:29:37.1868629Z mov.b32 {%rs20, %rs21}, %r264; 2026-02-21T08:29:37.1868789Z mov.b32 {%rs22, %rs23}, %r263; 2026-02-21T08:29:37.1868977Z ld.shared.v4.b32 {%r267, %r268, %r269, %r270}, [%r38+32]; 2026-02-21T08:29:37.1869186Z mov.b32 {%rs24, %rs25}, %r270; 2026-02-21T08:29:37.1869341Z mov.b32 {%rs26, %rs27}, %r269; 2026-02-21T08:29:37.1869501Z mov.b32 {%rs28, %rs29}, %r268; 2026-02-21T08:29:37.1869653Z mov.b32 {%rs30, %rs31}, %r267; 2026-02-21T08:29:37.1869844Z ld.shared.v4.b32 {%r271, %r272, %r273, %r274}, [%r38+48]; 2026-02-21T08:29:37.1870042Z mov.b32 {%rs32, %rs33}, %r274; 2026-02-21T08:29:37.1870201Z mov.b32 {%rs34, %rs35}, %r273; 2026-02-21T08:29:37.1870360Z mov.b32 {%rs36, %rs37}, %r272; 2026-02-21T08:29:37.1870512Z mov.b32 {%rs38, %rs39}, %r271; 2026-02-21T08:29:37.1870703Z ld.shared.v4.b32 {%r275, %r276, %r277, %r278}, [%r38+64]; 2026-02-21T08:29:37.1870953Z mov.b32 {%rs40, %rs41}, %r278; 2026-02-21T08:29:37.1871119Z mov.b32 {%rs42, %rs43}, %r277; 2026-02-21T08:29:37.1871272Z mov.b32 {%rs44, %rs45}, %r276; 2026-02-21T08:29:37.1871431Z mov.b32 {%rs46, %rs47}, %r275; 2026-02-21T08:29:37.1871648Z ld.shared.v4.b32 {%r279, %r280, %r281, %r282}, [%r38+80]; 2026-02-21T08:29:37.1871853Z mov.b32 {%rs48, %rs49}, %r282; 2026-02-21T08:29:37.1872016Z mov.b32 {%rs50, %rs51}, %r281; 2026-02-21T08:29:37.1872172Z mov.b32 {%rs52, %rs53}, %r280; 2026-02-21T08:29:37.1872334Z mov.b32 {%rs54, %rs55}, %r279; 2026-02-21T08:29:37.1872518Z ld.shared.v4.b32 {%r283, %r284, %r285, %r286}, [%r38+96]; 2026-02-21T08:29:37.1872721Z mov.b32 {%rs56, %rs57}, %r286; 2026-02-21T08:29:37.1872874Z mov.b32 {%rs58, %rs59}, %r285; 2026-02-21T08:29:37.1873034Z mov.b32 {%rs60, %rs61}, %r284; 2026-02-21T08:29:37.1873187Z mov.b32 {%rs62, %rs63}, %r283; 2026-02-21T08:29:37.1873383Z ld.shared.v4.b32 {%r287, %r288, %r289, %r290}, [%r38+112]; 2026-02-21T08:29:37.1873599Z mov.b32 {%rs64, %rs65}, %r290; 2026-02-21T08:29:37.1873759Z mov.b32 {%rs66, %rs67}, %r289; 2026-02-21T08:29:37.1873923Z mov.b32 {%rs68, %rs69}, %r288; 2026-02-21T08:29:37.1874078Z mov.b32 {%rs70, %rs71}, %r287; 2026-02-21T08:29:37.1874349Z .loc 1 55 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:55:32 2026-02-21T08:29:37.1874641Z cvt.f32.bf16 %r178, %rs14; 2026-02-21T08:29:37.1874810Z cvt.f32.bf16 %r179, %rs15; 2026-02-21T08:29:37.1874962Z cvt.f32.bf16 %r180, %rs12; 2026-02-21T08:29:37.1875120Z cvt.f32.bf16 %r181, %rs13; 2026-02-21T08:29:37.1875279Z cvt.f32.bf16 %r182, %rs10; 2026-02-21T08:29:37.1875427Z cvt.f32.bf16 %r183, %rs11; 2026-02-21T08:29:37.1875584Z cvt.f32.bf16 %r184, %rs8; 2026-02-21T08:29:37.1875735Z cvt.f32.bf16 %r185, %rs9; 2026-02-21T08:29:37.1875891Z cvt.f32.bf16 %r186, %rs22; 2026-02-21T08:29:37.1876039Z cvt.f32.bf16 %r187, %rs23; 2026-02-21T08:29:37.1876197Z cvt.f32.bf16 %r188, %rs20; 2026-02-21T08:29:37.1876345Z cvt.f32.bf16 %r189, %rs21; 2026-02-21T08:29:37.1876504Z cvt.f32.bf16 %r190, %rs18; 2026-02-21T08:29:37.1876656Z cvt.f32.bf16 %r191, %rs19; 2026-02-21T08:29:37.1876814Z cvt.f32.bf16 %r192, %rs16; 2026-02-21T08:29:37.1876970Z cvt.f32.bf16 %r193, %rs17; 2026-02-21T08:29:37.1877119Z cvt.f32.bf16 %r195, %rs30; 2026-02-21T08:29:37.1877273Z cvt.f32.bf16 %r196, %rs31; 2026-02-21T08:29:37.1877423Z cvt.f32.bf16 %r197, %rs28; 2026-02-21T08:29:37.1877578Z cvt.f32.bf16 %r198, %rs29; 2026-02-21T08:29:37.1877727Z cvt.f32.bf16 %r199, %rs26; 2026-02-21T08:29:37.1877882Z cvt.f32.bf16 %r200, %rs27; 2026-02-21T08:29:37.1878029Z cvt.f32.bf16 %r201, %rs24; 2026-02-21T08:29:37.1878184Z cvt.f32.bf16 %r202, %rs25; 2026-02-21T08:29:37.1878332Z cvt.f32.bf16 %r203, %rs38; 2026-02-21T08:29:37.1878486Z cvt.f32.bf16 %r204, %rs39; 2026-02-21T08:29:37.1878640Z cvt.f32.bf16 %r205, %rs36; 2026-02-21T08:29:37.1878789Z cvt.f32.bf16 %r206, %rs37; 2026-02-21T08:29:37.1878943Z cvt.f32.bf16 %r207, %rs34; 2026-02-21T08:29:37.1879093Z cvt.f32.bf16 %r208, %rs35; 2026-02-21T08:29:37.1879302Z cvt.f32.bf16 %r209, %rs32; 2026-02-21T08:29:37.1879451Z cvt.f32.bf16 %r210, %rs33; 2026-02-21T08:29:37.1879606Z cvt.f32.bf16 %r212, %rs46; 2026-02-21T08:29:37.1879753Z cvt.f32.bf16 %r213, %rs47; 2026-02-21T08:29:37.1879906Z cvt.f32.bf16 %r214, %rs44; 2026-02-21T08:29:37.1880055Z cvt.f32.bf16 %r215, %rs45; 2026-02-21T08:29:37.1880210Z cvt.f32.bf16 %r216, %rs42; 2026-02-21T08:29:37.1880365Z cvt.f32.bf16 %r217, %rs43; 2026-02-21T08:29:37.1880514Z cvt.f32.bf16 %r218, %rs40; 2026-02-21T08:29:37.1880670Z cvt.f32.bf16 %r219, %rs41; 2026-02-21T08:29:37.1880817Z cvt.f32.bf16 %r220, %rs54; 2026-02-21T08:29:37.1880972Z cvt.f32.bf16 %r221, %rs55; 2026-02-21T08:29:37.1881121Z cvt.f32.bf16 %r222, %rs52; 2026-02-21T08:29:37.1881277Z cvt.f32.bf16 %r223, %rs53; 2026-02-21T08:29:37.1881423Z cvt.f32.bf16 %r224, %rs50; 2026-02-21T08:29:37.1881607Z cvt.f32.bf16 %r225, %rs51; 2026-02-21T08:29:37.1881760Z cvt.f32.bf16 %r226, %rs48; 2026-02-21T08:29:37.1881971Z cvt.f32.bf16 %r227, %rs49; 2026-02-21T08:29:37.1882136Z cvt.f32.bf16 %r229, %rs62; 2026-02-21T08:29:37.1882294Z cvt.f32.bf16 %r230, %rs63; 2026-02-21T08:29:37.1882461Z cvt.f32.bf16 %r231, %rs60; 2026-02-21T08:29:37.1882627Z cvt.f32.bf16 %r232, %rs61; 2026-02-21T08:29:37.1882796Z cvt.f32.bf16 %r233, %rs58; 2026-02-21T08:29:37.1882952Z cvt.f32.bf16 %r234, %rs59; 2026-02-21T08:29:37.1883115Z cvt.f32.bf16 %r235, %rs56; 2026-02-21T08:29:37.1883272Z cvt.f32.bf16 %r236, %rs57; 2026-02-21T08:29:37.1883435Z cvt.f32.bf16 %r237, %rs70; 2026-02-21T08:29:37.1883588Z cvt.f32.bf16 %r238, %rs71; 2026-02-21T08:29:37.1883751Z cvt.f32.bf16 %r239, %rs68; 2026-02-21T08:29:37.1883910Z cvt.f32.bf16 %r240, %rs69; 2026-02-21T08:29:37.1884063Z cvt.f32.bf16 %r241, %rs66; 2026-02-21T08:29:37.1884224Z cvt.f32.bf16 %r242, %rs67; 2026-02-21T08:29:37.1884377Z cvt.f32.bf16 %r243, %rs64; 2026-02-21T08:29:37.1884538Z cvt.f32.bf16 %r244, %rs65; 2026-02-21T08:29:37.1884686Z $L__tmp12: 2026-02-21T08:29:37.1884990Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1885330Z add.s32 %r177, %r255, %r308; 2026-02-21T08:29:37.1885497Z // begin inline asm 2026-02-21T08:29:37.1885886Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 0], 64, {%r178, %r179, %r180, %r181, %r182, %r183, %r184, %r185, %r186, %r187, %r188, %r189, %r190, %r191, %r192, %r193}; 2026-02-21T08:29:37.1886281Z // end inline asm 2026-02-21T08:29:37.1886428Z // begin inline asm 2026-02-21T08:29:37.1886797Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 16], 64, {%r195, %r196, %r197, %r198, %r199, %r200, %r201, %r202, %r203, %r204, %r205, %r206, %r207, %r208, %r209, %r210}; 2026-02-21T08:29:37.1887202Z // end inline asm 2026-02-21T08:29:37.1887349Z // begin inline asm 2026-02-21T08:29:37.1887712Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 32], 64, {%r212, %r213, %r214, %r215, %r216, %r217, %r218, %r219, %r220, %r221, %r222, %r223, %r224, %r225, %r226, %r227}; 2026-02-21T08:29:37.1888116Z // end inline asm 2026-02-21T08:29:37.1888251Z // begin inline asm 2026-02-21T08:29:37.1888614Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 48], 64, {%r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241, %r242, %r243, %r244}; 2026-02-21T08:29:37.1888998Z // end inline asm 2026-02-21T08:29:37.1889142Z // begin inline asm 2026-02-21T08:29:37.1889303Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:29:37.1889469Z // end inline asm 2026-02-21T08:29:37.1889610Z bar.sync 0; 2026-02-21T08:29:37.1889743Z $L__tmp13: 2026-02-21T08:29:37.1889994Z .loc 1 57 87 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:87 2026-02-21T08:29:37.1890286Z ld.shared.b8 %rs72, [%r40]; 2026-02-21T08:29:37.1890464Z ld.shared.b8 %rs73, [%r40+64]; 2026-02-21T08:29:37.1890635Z ld.shared.b8 %rs74, [%r40+128]; 2026-02-21T08:29:37.1890817Z ld.shared.b8 %rs75, [%r40+192]; 2026-02-21T08:29:37.1890991Z ld.shared.b8 %rs76, [%r40+256]; 2026-02-21T08:29:37.1891223Z ld.shared.b8 %rs77, [%r40+320]; 2026-02-21T08:29:37.1891390Z ld.shared.b8 %rs78, [%r40+384]; 2026-02-21T08:29:37.1891578Z ld.shared.b8 %rs79, [%r40+448]; 2026-02-21T08:29:37.1891749Z ld.shared.b8 %rs80, [%r40+512]; 2026-02-21T08:29:37.1891912Z ld.shared.b8 %rs81, [%r40+576]; 2026-02-21T08:29:37.1892080Z ld.shared.b8 %rs82, [%r40+640]; 2026-02-21T08:29:37.1892242Z ld.shared.b8 %rs83, [%r40+704]; 2026-02-21T08:29:37.1892415Z ld.shared.b8 %rs84, [%r40+768]; 2026-02-21T08:29:37.1892585Z ld.shared.b8 %rs85, [%r40+832]; 2026-02-21T08:29:37.1892760Z ld.shared.b8 %rs86, [%r40+896]; 2026-02-21T08:29:37.1892938Z ld.shared.b8 %rs87, [%r40+960]; 2026-02-21T08:29:37.1893211Z .loc 1 60 28 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:60:28 2026-02-21T08:29:37.1893510Z shl.b16 %rs88, %rs72, 4; 2026-02-21T08:29:37.1893664Z shl.b16 %rs89, %rs73, 4; 2026-02-21T08:29:37.1893880Z shl.b16 %rs90, %rs74, 4; 2026-02-21T08:29:37.1894035Z shl.b16 %rs91, %rs75, 4; 2026-02-21T08:29:37.1894189Z shl.b16 %rs92, %rs76, 4; 2026-02-21T08:29:37.1894335Z shl.b16 %rs93, %rs77, 4; 2026-02-21T08:29:37.1894488Z shl.b16 %rs94, %rs78, 4; 2026-02-21T08:29:37.1894641Z shl.b16 %rs95, %rs79, 4; 2026-02-21T08:29:37.1894786Z shl.b16 %rs96, %rs80, 4; 2026-02-21T08:29:37.1894940Z shl.b16 %rs97, %rs81, 4; 2026-02-21T08:29:37.1895085Z shl.b16 %rs98, %rs82, 4; 2026-02-21T08:29:37.1895237Z shl.b16 %rs99, %rs83, 4; 2026-02-21T08:29:37.1895385Z shl.b16 %rs100, %rs84, 4; 2026-02-21T08:29:37.1895544Z shl.b16 %rs101, %rs85, 4; 2026-02-21T08:29:37.1895696Z shl.b16 %rs102, %rs86, 4; 2026-02-21T08:29:37.1895858Z shl.b16 %rs103, %rs87, 4; 2026-02-21T08:29:37.1896115Z .loc 1 75 58 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:75:58 2026-02-21T08:29:37.1896415Z selp.b16 %rs104, %rs88, %rs72, %p11; 2026-02-21T08:29:37.1896603Z cvt.s16.s8 %rs105, %rs104; 2026-02-21T08:29:37.1896764Z shr.s16 %rs106, %rs105, 4; 2026-02-21T08:29:37.1896937Z selp.b16 %rs107, %rs89, %rs73, %p11; 2026-02-21T08:29:37.1897106Z cvt.s16.s8 %rs108, %rs107; 2026-02-21T08:29:37.1897267Z shr.s16 %rs109, %rs108, 4; 2026-02-21T08:29:37.1897423Z selp.b16 %rs110, %rs90, %rs74, %p11; 2026-02-21T08:29:37.1897598Z cvt.s16.s8 %rs111, %rs110; 2026-02-21T08:29:37.1897746Z shr.s16 %rs112, %rs111, 4; 2026-02-21T08:29:37.1897910Z selp.b16 %rs113, %rs91, %rs75, %p11; 2026-02-21T08:29:37.1898082Z cvt.s16.s8 %rs114, %rs113; 2026-02-21T08:29:37.1898231Z shr.s16 %rs115, %rs114, 4; 2026-02-21T08:29:37.1898398Z selp.b16 %rs116, %rs92, %rs76, %p11; 2026-02-21T08:29:37.1898562Z cvt.s16.s8 %rs117, %rs116; 2026-02-21T08:29:37.1898718Z shr.s16 %rs118, %rs117, 4; 2026-02-21T08:29:37.1898875Z selp.b16 %rs119, %rs93, %rs77, %p11; 2026-02-21T08:29:37.1899046Z cvt.s16.s8 %rs120, %rs119; 2026-02-21T08:29:37.1899194Z shr.s16 %rs121, %rs120, 4; 2026-02-21T08:29:37.1899357Z selp.b16 %rs122, %rs94, %rs78, %p11; 2026-02-21T08:29:37.1899526Z cvt.s16.s8 %rs123, %rs122; 2026-02-21T08:29:37.1899683Z shr.s16 %rs124, %rs123, 4; 2026-02-21T08:29:37.1899845Z selp.b16 %rs125, %rs95, %rs79, %p11; 2026-02-21T08:29:37.1900030Z cvt.s16.s8 %rs126, %rs125; 2026-02-21T08:29:37.1900194Z shr.s16 %rs127, %rs126, 4; 2026-02-21T08:29:37.1900359Z selp.b16 %rs128, %rs96, %rs80, %p11; 2026-02-21T08:29:37.1900540Z cvt.s16.s8 %rs129, %rs128; 2026-02-21T08:29:37.1900695Z shr.s16 %rs130, %rs129, 4; 2026-02-21T08:29:37.1900865Z selp.b16 %rs131, %rs97, %rs81, %p11; 2026-02-21T08:29:37.1901038Z cvt.s16.s8 %rs132, %rs131; 2026-02-21T08:29:37.1901206Z shr.s16 %rs133, %rs132, 4; 2026-02-21T08:29:37.1901377Z selp.b16 %rs134, %rs98, %rs82, %p11; 2026-02-21T08:29:37.1901582Z cvt.s16.s8 %rs135, %rs134; 2026-02-21T08:29:37.1901758Z shr.s16 %rs136, %rs135, 4; 2026-02-21T08:29:37.1901921Z selp.b16 %rs137, %rs99, %rs83, %p11; 2026-02-21T08:29:37.1902106Z cvt.s16.s8 %rs138, %rs137; 2026-02-21T08:29:37.1902265Z shr.s16 %rs139, %rs138, 4; 2026-02-21T08:29:37.1902506Z selp.b16 %rs140, %rs100, %rs84, %p11; 2026-02-21T08:29:37.1902690Z cvt.s16.s8 %rs141, %rs140; 2026-02-21T08:29:37.1902865Z shr.s16 %rs142, %rs141, 4; 2026-02-21T08:29:37.1903034Z selp.b16 %rs143, %rs101, %rs85, %p11; 2026-02-21T08:29:37.1903225Z cvt.s16.s8 %rs144, %rs143; 2026-02-21T08:29:37.1903388Z shr.s16 %rs145, %rs144, 4; 2026-02-21T08:29:37.1903552Z selp.b16 %rs146, %rs102, %rs86, %p11; 2026-02-21T08:29:37.1903738Z cvt.s16.s8 %rs147, %rs146; 2026-02-21T08:29:37.1903894Z shr.s16 %rs148, %rs147, 4; 2026-02-21T08:29:37.1904064Z selp.b16 %rs149, %rs103, %rs87, %p11; 2026-02-21T08:29:37.1904239Z cvt.s16.s8 %rs150, %rs149; 2026-02-21T08:29:37.1904402Z shr.s16 %rs151, %rs150, 4; 2026-02-21T08:29:37.1904670Z .loc 1 80 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:80:32 2026-02-21T08:29:37.1904974Z cvt.rn.f32.s16 %r291, %rs106; 2026-02-21T08:29:37.1905150Z cvt.rn.f32.s16 %r292, %rs109; 2026-02-21T08:29:37.1905399Z cvt.rn.f32.s16 %r293, %rs112; 2026-02-21T08:29:37.1905578Z cvt.rn.f32.s16 %r294, %rs115; 2026-02-21T08:29:37.1905742Z cvt.rn.f32.s16 %r295, %rs118; 2026-02-21T08:29:37.1905911Z cvt.rn.f32.s16 %r296, %rs121; 2026-02-21T08:29:37.1906072Z cvt.rn.f32.s16 %r297, %rs124; 2026-02-21T08:29:37.1906240Z cvt.rn.f32.s16 %r298, %rs127; 2026-02-21T08:29:37.1906401Z cvt.rn.f32.s16 %r299, %rs130; 2026-02-21T08:29:37.1906572Z cvt.rn.f32.s16 %r300, %rs133; 2026-02-21T08:29:37.1906741Z cvt.rn.f32.s16 %r301, %rs136; 2026-02-21T08:29:37.1906901Z cvt.rn.f32.s16 %r302, %rs139; 2026-02-21T08:29:37.1907072Z cvt.rn.f32.s16 %r303, %rs142; 2026-02-21T08:29:37.1907234Z cvt.rn.f32.s16 %r304, %rs145; 2026-02-21T08:29:37.1907401Z cvt.rn.f32.s16 %r305, %rs148; 2026-02-21T08:29:37.1907562Z cvt.rn.f32.s16 %r306, %rs151; 2026-02-21T08:29:37.1907745Z st.shared.b32 [%r41], %r291; 2026-02-21T08:29:37.1907910Z st.shared.b32 [%r41+2048], %r295; 2026-02-21T08:29:37.1908093Z st.shared.b32 [%r41+4096], %r299; 2026-02-21T08:29:37.1908268Z st.shared.b32 [%r41+6144], %r303; 2026-02-21T08:29:37.1908443Z st.shared.b32 [%r42], %r292; 2026-02-21T08:29:37.1908608Z st.shared.b32 [%r42+2048], %r296; 2026-02-21T08:29:37.1908774Z st.shared.b32 [%r42+4096], %r300; 2026-02-21T08:29:37.1908949Z st.shared.b32 [%r42+6144], %r304; 2026-02-21T08:29:37.1909115Z st.shared.b32 [%r43], %r293; 2026-02-21T08:29:37.1909281Z st.shared.b32 [%r43+2048], %r297; 2026-02-21T08:29:37.1909445Z st.shared.b32 [%r43+4096], %r301; 2026-02-21T08:29:37.1909613Z st.shared.b32 [%r43+6144], %r305; 2026-02-21T08:29:37.1909774Z st.shared.b32 [%r44], %r294; 2026-02-21T08:29:37.1909940Z st.shared.b32 [%r44+2048], %r298; 2026-02-21T08:29:37.1910111Z st.shared.b32 [%r44+4096], %r302; 2026-02-21T08:29:37.1910279Z st.shared.b32 [%r44+6144], %r306; 2026-02-21T08:29:37.1910449Z $L__tmp14: 2026-02-21T08:29:37.1910748Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1911093Z // begin inline asm 2026-02-21T08:29:37.1911255Z fence.proxy.async.shared::cta; 2026-02-21T08:29:37.1911429Z // end inline asm 2026-02-21T08:29:37.1911599Z bar.sync 0; 2026-02-21T08:29:37.1911746Z setp.ne.b32 %p12, %r54, 0; 2026-02-21T08:29:37.1911912Z @%p12 bra $L__BB0_4; 2026-02-21T08:29:37.1912099Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:29:37.1912327Z elect.sync %r355|%p14, -1; 2026-02-21T08:29:37.1912487Z mov.b32 %r309, 67373328; 2026-02-21T08:29:37.1912647Z mov.pred %p13, 0; 2026-02-21T08:29:37.1912786Z // begin inline asm 2026-02-21T08:29:37.1913034Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 0 ], %rd80, %r309, %p13; 2026-02-21T08:29:37.1913299Z // end inline asm 2026-02-21T08:29:37.1913444Z // begin inline asm 2026-02-21T08:29:37.1913685Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 8 ], %rd81, %r309, %p15; 2026-02-21T08:29:37.1913945Z // end inline asm 2026-02-21T08:29:37.1914145Z // begin inline asm 2026-02-21T08:29:37.1914372Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 16 ], %rd82, %r309, %p15; 2026-02-21T08:29:37.1914638Z // end inline asm 2026-02-21T08:29:37.1914772Z // begin inline asm 2026-02-21T08:29:37.1914998Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 24 ], %rd83, %r309, %p15; 2026-02-21T08:29:37.1915257Z // end inline asm 2026-02-21T08:29:37.1915389Z // begin inline asm 2026-02-21T08:29:37.1915618Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 32 ], %rd84, %r309, %p15; 2026-02-21T08:29:37.1915869Z // end inline asm 2026-02-21T08:29:37.1916007Z // begin inline asm 2026-02-21T08:29:37.1916227Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 40 ], %rd85, %r309, %p15; 2026-02-21T08:29:37.1916488Z // end inline asm 2026-02-21T08:29:37.1916622Z // begin inline asm 2026-02-21T08:29:37.1916901Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 48 ], %rd86, %r309, %p15; 2026-02-21T08:29:37.1917165Z // end inline asm 2026-02-21T08:29:37.1917295Z // begin inline asm 2026-02-21T08:29:37.1917520Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 56 ], %rd87, %r309, %p15; 2026-02-21T08:29:37.1917774Z // end inline asm 2026-02-21T08:29:37.1917916Z // begin inline asm 2026-02-21T08:29:37.1918135Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 64 ], %rd88, %r309, %p15; 2026-02-21T08:29:37.1918398Z // end inline asm 2026-02-21T08:29:37.1918536Z // begin inline asm 2026-02-21T08:29:37.1918757Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 72 ], %rd89, %r309, %p15; 2026-02-21T08:29:37.1919018Z // end inline asm 2026-02-21T08:29:37.1919151Z // begin inline asm 2026-02-21T08:29:37.1919379Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 80 ], %rd90, %r309, %p15; 2026-02-21T08:29:37.1919635Z // end inline asm 2026-02-21T08:29:37.1919777Z // begin inline asm 2026-02-21T08:29:37.1920012Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 88 ], %rd91, %r309, %p15; 2026-02-21T08:29:37.1920269Z // end inline asm 2026-02-21T08:29:37.1920410Z // begin inline asm 2026-02-21T08:29:37.1920633Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 96 ], %rd92, %r309, %p15; 2026-02-21T08:29:37.1920903Z // end inline asm 2026-02-21T08:29:37.1921036Z // begin inline asm 2026-02-21T08:29:37.1921272Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 104 ], %rd93, %r309, %p15; 2026-02-21T08:29:37.1921567Z // end inline asm 2026-02-21T08:29:37.1921704Z // begin inline asm 2026-02-21T08:29:37.1921939Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 112 ], %rd94, %r309, %p15; 2026-02-21T08:29:37.1922195Z // end inline asm 2026-02-21T08:29:37.1922336Z // begin inline asm 2026-02-21T08:29:37.1922559Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 120 ], %rd95, %r309, %p15; 2026-02-21T08:29:37.1922823Z // end inline asm 2026-02-21T08:29:37.1922965Z add.s32 %r357, %r72, 43008; 2026-02-21T08:29:37.1923132Z cvt.u64.u32 %rd96, %r357; 2026-02-21T08:29:37.1923293Z // begin inline asm 2026-02-21T08:29:37.1923497Z @%p14 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd96]; 2026-02-21T08:29:37.1923734Z // end inline asm 2026-02-21T08:29:37.1923865Z $L__tmp15: 2026-02-21T08:29:37.1924045Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:29:37.1924362Z .loc 1 0 0 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:0 2026-02-21T08:29:37.1924650Z or.b32 %r51, %r251, %r9; 2026-02-21T08:29:37.1924922Z .loc 1 51 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:32 2026-02-21T08:29:37.1925211Z add.s64 %rd97, %rd20, 512; 2026-02-21T08:29:37.1925377Z cvt.u64.u32 %rd107, %r45; 2026-02-21T08:29:37.1925536Z add.s64 %rd108, %rd19, %rd107; 2026-02-21T08:29:37.1925708Z shl.b64 %rd109, %rd108, 1; 2026-02-21T08:29:37.1925914Z add.s64 %rd110, %rd30, %rd109; 2026-02-21T08:29:37.1926083Z add.s64 %rd98, %rd110, 16384; 2026-02-21T08:29:37.1926241Z add.s64 %rd99, %rd110, 32768; 2026-02-21T08:29:37.1926404Z add.s64 %rd100, %rd110, 49152; 2026-02-21T08:29:37.1926567Z add.s64 %rd101, %rd110, 65536; 2026-02-21T08:29:37.1926724Z add.s64 %rd102, %rd110, 81920; 2026-02-21T08:29:37.1926883Z add.s64 %rd103, %rd110, 98304; 2026-02-21T08:29:37.1927041Z add.s64 %rd104, %rd110, 114688; 2026-02-21T08:29:37.1927314Z .loc 1 51 80 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:80 2026-02-21T08:29:37.1927590Z // begin inline asm 2026-02-21T08:29:37.1927802Z cp.async.cg.shared.global [ %r358 + 0 ], [ %rd97 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1928028Z // end inline asm 2026-02-21T08:29:37.1928171Z // begin inline asm 2026-02-21T08:29:37.1928374Z cp.async.cg.shared.global [ %r360 + 0 ], [ %rd98 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1928642Z // end inline asm 2026-02-21T08:29:37.1928790Z // begin inline asm 2026-02-21T08:29:37.1928984Z cp.async.cg.shared.global [ %r362 + 0 ], [ %rd99 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1929212Z // end inline asm 2026-02-21T08:29:37.1929348Z // begin inline asm 2026-02-21T08:29:37.1929552Z cp.async.cg.shared.global [ %r364 + 0 ], [ %rd100 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1929774Z // end inline asm 2026-02-21T08:29:37.1929912Z // begin inline asm 2026-02-21T08:29:37.1930111Z cp.async.cg.shared.global [ %r366 + 0 ], [ %rd101 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1930334Z // end inline asm 2026-02-21T08:29:37.1930478Z // begin inline asm 2026-02-21T08:29:37.1930673Z cp.async.cg.shared.global [ %r368 + 0 ], [ %rd102 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1930907Z // end inline asm 2026-02-21T08:29:37.1931040Z // begin inline asm 2026-02-21T08:29:37.1931239Z cp.async.cg.shared.global [ %r370 + 0 ], [ %rd103 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1931459Z // end inline asm 2026-02-21T08:29:37.1931629Z // begin inline asm 2026-02-21T08:29:37.1931825Z cp.async.cg.shared.global [ %r372 + 0 ], [ %rd104 + 0 ], 0x10, %r359; 2026-02-21T08:29:37.1932053Z // end inline asm 2026-02-21T08:29:37.1932200Z cp.async.commit_group; 2026-02-21T08:29:37.1932460Z .loc 1 57 34 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:34 2026-02-21T08:29:37.1932757Z add.s64 %rd105, %rd21, 1048576; 2026-02-21T08:29:37.1933027Z .loc 1 57 87 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:87 2026-02-21T08:29:37.1933309Z // begin inline asm 2026-02-21T08:29:37.1933504Z cp.async.ca.shared.global [ %r374 + 0 ], [ %rd105 + 0 ], 0x8, %r158; 2026-02-21T08:29:37.1933733Z // end inline asm 2026-02-21T08:29:37.1933882Z cp.async.commit_group; 2026-02-21T08:29:37.1934141Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1934438Z add.s32 %r379, %r7, %r49; 2026-02-21T08:29:37.1934599Z add.s32 %r380, %r379, %r50; 2026-02-21T08:29:37.1934769Z shl.b32 %r381, %r380, 10; 2026-02-21T08:29:37.1934928Z mad.wide.s32 %rd156, %r381, 2, %rd17; 2026-02-21T08:29:37.1935109Z add.s32 %r614, %r47, %r381; 2026-02-21T08:29:37.1935268Z add.s32 %r382, %r10, %r52; 2026-02-21T08:29:37.1935430Z cvt.s64.s32 %rd111, %r382; 2026-02-21T08:29:37.1935594Z add.s64 %rd155, %rd18, %rd111; 2026-02-21T08:29:37.1935750Z mov.b32 %r618, 1; 2026-02-21T08:29:37.1935895Z mov.b64 %rd157, -64; 2026-02-21T08:29:37.1936038Z mov.b32 %r617, %r615; 2026-02-21T08:29:37.1936187Z mov.b32 %r619, %r615; 2026-02-21T08:29:37.1936327Z bra.uni $L__BB0_5; 2026-02-21T08:29:37.1936514Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T08:29:37.1936833Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1937127Z add.s64 %rd157, %rd157, 64; 2026-02-21T08:29:37.1937296Z setp.lt.u64 %p86, %rd157, 320; 2026-02-21T08:29:37.1937458Z $L__tmp16: 2026-02-21T08:29:37.1937807Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1938135Z add.s32 %r578, %r618, 1; 2026-02-21T08:29:37.1938298Z setp.gt.s32 %p87, %r578, 1; 2026-02-21T08:29:37.1938460Z selp.b32 %r618, 0, %r578, %p87; 2026-02-21T08:29:37.1938634Z selp.b32 %r579, 1, 0, %p87; 2026-02-21T08:29:37.1938787Z xor.b32 %r69, %r619, %r579; 2026-02-21T08:29:37.1938940Z $L__tmp17: 2026-02-21T08:29:37.1939178Z .loc 1 51 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:32 2026-02-21T08:29:37.1939470Z add.s64 %rd129, %rd156, -98304; 2026-02-21T08:29:37.1939654Z add.s64 %rd130, %rd156, -81920; 2026-02-21T08:29:37.1939814Z add.s64 %rd131, %rd156, -65536; 2026-02-21T08:29:37.1939981Z add.s64 %rd132, %rd156, -49152; 2026-02-21T08:29:37.1940141Z add.s64 %rd133, %rd156, -32768; 2026-02-21T08:29:37.1940357Z add.s64 %rd134, %rd156, -16384; 2026-02-21T08:29:37.1940531Z mad.wide.s32 %rd136, %r614, 2, %rd30; 2026-02-21T08:29:37.1940828Z .loc 1 51 80 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:80 2026-02-21T08:29:37.1941124Z add.s32 %r560, %r65, %r14; 2026-02-21T08:29:37.1941285Z selp.b32 %r561, 16, 0, %p86; 2026-02-21T08:29:37.1941451Z // begin inline asm 2026-02-21T08:29:37.1941682Z cp.async.cg.shared.global [ %r560 + 0 ], [ %rd129 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1941913Z // end inline asm 2026-02-21T08:29:37.1942052Z add.s32 %r562, %r560, 2048; 2026-02-21T08:29:37.1942213Z // begin inline asm 2026-02-21T08:29:37.1942414Z cp.async.cg.shared.global [ %r562 + 0 ], [ %rd130 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1942643Z // end inline asm 2026-02-21T08:29:37.1942788Z add.s32 %r564, %r560, 4096; 2026-02-21T08:29:37.1942939Z // begin inline asm 2026-02-21T08:29:37.1943139Z cp.async.cg.shared.global [ %r564 + 0 ], [ %rd131 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1943362Z // end inline asm 2026-02-21T08:29:37.1943508Z add.s32 %r566, %r560, 6144; 2026-02-21T08:29:37.1943661Z // begin inline asm 2026-02-21T08:29:37.1943861Z cp.async.cg.shared.global [ %r566 + 0 ], [ %rd132 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1944099Z // end inline asm 2026-02-21T08:29:37.1944251Z add.s32 %r568, %r560, 8192; 2026-02-21T08:29:37.1944409Z // begin inline asm 2026-02-21T08:29:37.1944619Z cp.async.cg.shared.global [ %r568 + 0 ], [ %rd133 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1944855Z // end inline asm 2026-02-21T08:29:37.1945000Z add.s32 %r570, %r560, 10240; 2026-02-21T08:29:37.1945171Z // begin inline asm 2026-02-21T08:29:37.1945375Z cp.async.cg.shared.global [ %r570 + 0 ], [ %rd134 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1945612Z // end inline asm 2026-02-21T08:29:37.1945755Z add.s32 %r572, %r560, 12288; 2026-02-21T08:29:37.1945922Z // begin inline asm 2026-02-21T08:29:37.1946121Z cp.async.cg.shared.global [ %r572 + 0 ], [ %rd156 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1946359Z // end inline asm 2026-02-21T08:29:37.1946510Z add.s32 %r574, %r560, 14336; 2026-02-21T08:29:37.1946668Z // begin inline asm 2026-02-21T08:29:37.1946877Z cp.async.cg.shared.global [ %r574 + 0 ], [ %rd136 + 0 ], 0x10, %r561; 2026-02-21T08:29:37.1947104Z // end inline asm 2026-02-21T08:29:37.1947258Z cp.async.commit_group; 2026-02-21T08:29:37.1947525Z .loc 1 57 87 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:87 2026-02-21T08:29:37.1947828Z add.s32 %r576, %r66, %r25; 2026-02-21T08:29:37.1947994Z selp.b32 %r577, 8, 0, %p86; 2026-02-21T08:29:37.1948162Z // begin inline asm 2026-02-21T08:29:37.1948371Z cp.async.ca.shared.global [ %r576 + 0 ], [ %rd155 + 0 ], 0x8, %r577; 2026-02-21T08:29:37.1948604Z // end inline asm 2026-02-21T08:29:37.1948759Z cp.async.commit_group; 2026-02-21T08:29:37.1949030Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1949335Z add.s64 %rd156, %rd156, 256; 2026-02-21T08:29:37.1949586Z add.s32 %r614, %r614, 128; 2026-02-21T08:29:37.1949759Z add.s64 %rd155, %rd155, 524288; 2026-02-21T08:29:37.1949935Z setp.lt.u64 %p88, %rd157, 384; 2026-02-21T08:29:37.1950113Z mov.b32 %r615, %r619; 2026-02-21T08:29:37.1950271Z mov.b32 %r619, %r69; 2026-02-21T08:29:37.1950426Z @%p88 bra $L__BB0_5; 2026-02-21T08:29:37.1950588Z bra.uni $L__BB0_8; 2026-02-21T08:29:37.1950781Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T08:29:37.1951083Z // => This Inner Loop Header: Depth=2 2026-02-21T08:29:37.1951298Z add.s32 %r454, %r617, 1; 2026-02-21T08:29:37.1951472Z setp.gt.s32 %p52, %r454, 1; 2026-02-21T08:29:37.1951664Z selp.b32 %r617, 0, %r454, %p52; 2026-02-21T08:29:37.1951944Z .loc 1 51 80 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:51:80 2026-02-21T08:29:37.1952242Z cp.async.wait_group 2; 2026-02-21T08:29:37.1952392Z bar.sync 0; 2026-02-21T08:29:37.1952581Z shl.b32 %r455, %r617, 14; 2026-02-21T08:29:37.1952739Z add.s32 %r65, %r72, %r455; 2026-02-21T08:29:37.1952904Z add.s32 %r457, %r65, %r37; 2026-02-21T08:29:37.1953093Z ld.shared.v4.b32 {%r458, %r459, %r460, %r461}, [%r457]; 2026-02-21T08:29:37.1953312Z mov.b32 {%rs152, %rs153}, %r461; 2026-02-21T08:29:37.1953484Z mov.b32 {%rs154, %rs155}, %r460; 2026-02-21T08:29:37.1953660Z mov.b32 {%rs156, %rs157}, %r459; 2026-02-21T08:29:37.1953830Z mov.b32 {%rs158, %rs159}, %r458; 2026-02-21T08:29:37.1954027Z ld.shared.v4.b32 {%r462, %r463, %r464, %r465}, [%r457+16]; 2026-02-21T08:29:37.1954242Z mov.b32 {%rs160, %rs161}, %r465; 2026-02-21T08:29:37.1954403Z mov.b32 {%rs162, %rs163}, %r464; 2026-02-21T08:29:37.1954567Z mov.b32 {%rs164, %rs165}, %r463; 2026-02-21T08:29:37.1954727Z mov.b32 {%rs166, %rs167}, %r462; 2026-02-21T08:29:37.1954927Z ld.shared.v4.b32 {%r466, %r467, %r468, %r469}, [%r457+32]; 2026-02-21T08:29:37.1955129Z mov.b32 {%rs168, %rs169}, %r469; 2026-02-21T08:29:37.1955298Z mov.b32 {%rs170, %rs171}, %r468; 2026-02-21T08:29:37.1955466Z mov.b32 {%rs172, %rs173}, %r467; 2026-02-21T08:29:37.1955624Z mov.b32 {%rs174, %rs175}, %r466; 2026-02-21T08:29:37.1955822Z ld.shared.v4.b32 {%r470, %r471, %r472, %r473}, [%r457+48]; 2026-02-21T08:29:37.1956024Z mov.b32 {%rs176, %rs177}, %r473; 2026-02-21T08:29:37.1956189Z mov.b32 {%rs178, %rs179}, %r472; 2026-02-21T08:29:37.1956346Z mov.b32 {%rs180, %rs181}, %r471; 2026-02-21T08:29:37.1956511Z mov.b32 {%rs182, %rs183}, %r470; 2026-02-21T08:29:37.1956701Z ld.shared.v4.b32 {%r474, %r475, %r476, %r477}, [%r457+64]; 2026-02-21T08:29:37.1956909Z mov.b32 {%rs184, %rs185}, %r477; 2026-02-21T08:29:37.1957072Z mov.b32 {%rs186, %rs187}, %r476; 2026-02-21T08:29:37.1957229Z mov.b32 {%rs188, %rs189}, %r475; 2026-02-21T08:29:37.1957395Z mov.b32 {%rs190, %rs191}, %r474; 2026-02-21T08:29:37.1957584Z ld.shared.v4.b32 {%r478, %r479, %r480, %r481}, [%r457+80]; 2026-02-21T08:29:37.1957790Z mov.b32 {%rs192, %rs193}, %r481; 2026-02-21T08:29:37.1957950Z mov.b32 {%rs194, %rs195}, %r480; 2026-02-21T08:29:37.1958116Z mov.b32 {%rs196, %rs197}, %r479; 2026-02-21T08:29:37.1958273Z mov.b32 {%rs198, %rs199}, %r478; 2026-02-21T08:29:37.1958472Z ld.shared.v4.b32 {%r482, %r483, %r484, %r485}, [%r457+96]; 2026-02-21T08:29:37.1958679Z mov.b32 {%rs200, %rs201}, %r485; 2026-02-21T08:29:37.1958836Z mov.b32 {%rs202, %rs203}, %r484; 2026-02-21T08:29:37.1959002Z mov.b32 {%rs204, %rs205}, %r483; 2026-02-21T08:29:37.1959160Z mov.b32 {%rs206, %rs207}, %r482; 2026-02-21T08:29:37.1959361Z ld.shared.v4.b32 {%r486, %r487, %r488, %r489}, [%r457+112]; 2026-02-21T08:29:37.1959566Z mov.b32 {%rs208, %rs209}, %r489; 2026-02-21T08:29:37.1959734Z mov.b32 {%rs210, %rs211}, %r488; 2026-02-21T08:29:37.1959893Z mov.b32 {%rs212, %rs213}, %r487; 2026-02-21T08:29:37.1960068Z mov.b32 {%rs214, %rs215}, %r486; 2026-02-21T08:29:37.1960340Z .loc 1 55 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:55:32 2026-02-21T08:29:37.1960636Z cvt.f32.bf16 %r387, %rs158; 2026-02-21T08:29:37.1960858Z cvt.f32.bf16 %r388, %rs159; 2026-02-21T08:29:37.1961014Z cvt.f32.bf16 %r389, %rs156; 2026-02-21T08:29:37.1961174Z cvt.f32.bf16 %r390, %rs157; 2026-02-21T08:29:37.1961327Z cvt.f32.bf16 %r391, %rs154; 2026-02-21T08:29:37.1961484Z cvt.f32.bf16 %r392, %rs155; 2026-02-21T08:29:37.1961657Z cvt.f32.bf16 %r393, %rs152; 2026-02-21T08:29:37.1961817Z cvt.f32.bf16 %r394, %rs153; 2026-02-21T08:29:37.1961976Z cvt.f32.bf16 %r395, %rs166; 2026-02-21T08:29:37.1962129Z cvt.f32.bf16 %r396, %rs167; 2026-02-21T08:29:37.1962289Z cvt.f32.bf16 %r397, %rs164; 2026-02-21T08:29:37.1962439Z cvt.f32.bf16 %r398, %rs165; 2026-02-21T08:29:37.1962597Z cvt.f32.bf16 %r399, %rs162; 2026-02-21T08:29:37.1962748Z cvt.f32.bf16 %r400, %rs163; 2026-02-21T08:29:37.1962905Z cvt.f32.bf16 %r401, %rs160; 2026-02-21T08:29:37.1963057Z cvt.f32.bf16 %r402, %rs161; 2026-02-21T08:29:37.1963217Z cvt.f32.bf16 %r404, %rs174; 2026-02-21T08:29:37.1963421Z cvt.f32.bf16 %r405, %rs175; 2026-02-21T08:29:37.1963584Z cvt.f32.bf16 %r406, %rs172; 2026-02-21T08:29:37.1963741Z cvt.f32.bf16 %r407, %rs173; 2026-02-21T08:29:37.1963890Z cvt.f32.bf16 %r408, %rs170; 2026-02-21T08:29:37.1964047Z cvt.f32.bf16 %r409, %rs171; 2026-02-21T08:29:37.1964194Z cvt.f32.bf16 %r410, %rs168; 2026-02-21T08:29:37.1964349Z cvt.f32.bf16 %r411, %rs169; 2026-02-21T08:29:37.1964497Z cvt.f32.bf16 %r412, %rs182; 2026-02-21T08:29:37.1964654Z cvt.f32.bf16 %r413, %rs183; 2026-02-21T08:29:37.1964802Z cvt.f32.bf16 %r414, %rs180; 2026-02-21T08:29:37.1964957Z cvt.f32.bf16 %r415, %rs181; 2026-02-21T08:29:37.1965111Z cvt.f32.bf16 %r416, %rs178; 2026-02-21T08:29:37.1965260Z cvt.f32.bf16 %r417, %rs179; 2026-02-21T08:29:37.1965414Z cvt.f32.bf16 %r418, %rs176; 2026-02-21T08:29:37.1965563Z cvt.f32.bf16 %r419, %rs177; 2026-02-21T08:29:37.1965720Z cvt.f32.bf16 %r421, %rs190; 2026-02-21T08:29:37.1965868Z cvt.f32.bf16 %r422, %rs191; 2026-02-21T08:29:37.1966025Z cvt.f32.bf16 %r423, %rs188; 2026-02-21T08:29:37.1966176Z cvt.f32.bf16 %r424, %rs189; 2026-02-21T08:29:37.1966336Z cvt.f32.bf16 %r425, %rs186; 2026-02-21T08:29:37.1966484Z cvt.f32.bf16 %r426, %rs187; 2026-02-21T08:29:37.1966641Z cvt.f32.bf16 %r427, %rs184; 2026-02-21T08:29:37.1966797Z cvt.f32.bf16 %r428, %rs185; 2026-02-21T08:29:37.1966947Z cvt.f32.bf16 %r429, %rs198; 2026-02-21T08:29:37.1967104Z cvt.f32.bf16 %r430, %rs199; 2026-02-21T08:29:37.1967253Z cvt.f32.bf16 %r431, %rs196; 2026-02-21T08:29:37.1967410Z cvt.f32.bf16 %r432, %rs197; 2026-02-21T08:29:37.1967560Z cvt.f32.bf16 %r433, %rs194; 2026-02-21T08:29:37.1967721Z cvt.f32.bf16 %r434, %rs195; 2026-02-21T08:29:37.1967873Z cvt.f32.bf16 %r435, %rs192; 2026-02-21T08:29:37.1968033Z cvt.f32.bf16 %r436, %rs193; 2026-02-21T08:29:37.1968186Z cvt.f32.bf16 %r438, %rs206; 2026-02-21T08:29:37.1968343Z cvt.f32.bf16 %r439, %rs207; 2026-02-21T08:29:37.1968499Z cvt.f32.bf16 %r440, %rs204; 2026-02-21T08:29:37.1968649Z cvt.f32.bf16 %r441, %rs205; 2026-02-21T08:29:37.1968807Z cvt.f32.bf16 %r442, %rs202; 2026-02-21T08:29:37.1968960Z cvt.f32.bf16 %r443, %rs203; 2026-02-21T08:29:37.1969118Z cvt.f32.bf16 %r444, %rs200; 2026-02-21T08:29:37.1969270Z cvt.f32.bf16 %r445, %rs201; 2026-02-21T08:29:37.1969427Z cvt.f32.bf16 %r446, %rs214; 2026-02-21T08:29:37.1969575Z cvt.f32.bf16 %r447, %rs215; 2026-02-21T08:29:37.1969733Z cvt.f32.bf16 %r448, %rs212; 2026-02-21T08:29:37.1969891Z cvt.f32.bf16 %r449, %rs213; 2026-02-21T08:29:37.1970041Z cvt.f32.bf16 %r450, %rs210; 2026-02-21T08:29:37.1970197Z cvt.f32.bf16 %r451, %rs211; 2026-02-21T08:29:37.1970347Z cvt.f32.bf16 %r452, %rs208; 2026-02-21T08:29:37.1970503Z cvt.f32.bf16 %r453, %rs209; 2026-02-21T08:29:37.1970649Z $L__tmp18: 2026-02-21T08:29:37.1970945Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1971275Z // begin inline asm 2026-02-21T08:29:37.1971419Z 2026-02-21T08:29:37.1971471Z { 2026-02-21T08:29:37.1971563Z .reg .pred complete; 2026-02-21T08:29:37.1971686Z waitLoop: 2026-02-21T08:29:37.1971805Z mbarrier.try_wait.parity.shared.b64 complete, [%r616], %r615; 2026-02-21T08:29:37.1971872Z @!complete bra.uni waitLoop; 2026-02-21T08:29:37.1971929Z } 2026-02-21T08:29:37.1971934Z 2026-02-21T08:29:37.1971990Z // end inline asm 2026-02-21T08:29:37.1972050Z mov.pred %p53, -1; 2026-02-21T08:29:37.1972114Z // begin inline asm 2026-02-21T08:29:37.1972408Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 0], 64, {%r387, %r388, %r389, %r390, %r391, %r392, %r393, %r394, %r395, %r396, %r397, %r398, %r399, %r400, %r401, %r402}; 2026-02-21T08:29:37.1972466Z // end inline asm 2026-02-21T08:29:37.1972523Z // begin inline asm 2026-02-21T08:29:37.1972814Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 16], 64, {%r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414, %r415, %r416, %r417, %r418, %r419}; 2026-02-21T08:29:37.1972872Z // end inline asm 2026-02-21T08:29:37.1972985Z // begin inline asm 2026-02-21T08:29:37.1973272Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 32], 64, {%r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434, %r435, %r436}; 2026-02-21T08:29:37.1973327Z // end inline asm 2026-02-21T08:29:37.1973382Z // begin inline asm 2026-02-21T08:29:37.1973661Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r177 + 48], 64, {%r438, %r439, %r440, %r441, %r442, %r443, %r444, %r445, %r446, %r447, %r448, %r449, %r450, %r451, %r452, %r453}; 2026-02-21T08:29:37.1973715Z // end inline asm 2026-02-21T08:29:37.1973770Z // begin inline asm 2026-02-21T08:29:37.1973850Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:29:37.1973905Z // end inline asm 2026-02-21T08:29:37.1973959Z bar.sync 0; 2026-02-21T08:29:37.1974011Z $L__tmp19: 2026-02-21T08:29:37.1974190Z .loc 1 57 87 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:57:87 2026-02-21T08:29:37.1974250Z shl.b32 %r490, %r617, 10; 2026-02-21T08:29:37.1974311Z add.s32 %r491, %r72, %r490; 2026-02-21T08:29:37.1974379Z add.s32 %r66, %r491, 40960; 2026-02-21T08:29:37.1974438Z add.s32 %r492, %r66, %r39; 2026-02-21T08:29:37.1974502Z ld.shared.b8 %rs216, [%r492]; 2026-02-21T08:29:37.1974566Z ld.shared.b8 %rs217, [%r492+64]; 2026-02-21T08:29:37.1974639Z ld.shared.b8 %rs218, [%r492+128]; 2026-02-21T08:29:37.1974702Z ld.shared.b8 %rs219, [%r492+192]; 2026-02-21T08:29:37.1974764Z ld.shared.b8 %rs220, [%r492+256]; 2026-02-21T08:29:37.1974831Z ld.shared.b8 %rs221, [%r492+320]; 2026-02-21T08:29:37.1974890Z ld.shared.b8 %rs222, [%r492+384]; 2026-02-21T08:29:37.1974949Z ld.shared.b8 %rs223, [%r492+448]; 2026-02-21T08:29:37.1975015Z ld.shared.b8 %rs224, [%r492+512]; 2026-02-21T08:29:37.1975074Z ld.shared.b8 %rs225, [%r492+576]; 2026-02-21T08:29:37.1975133Z ld.shared.b8 %rs226, [%r492+640]; 2026-02-21T08:29:37.1975192Z ld.shared.b8 %rs227, [%r492+704]; 2026-02-21T08:29:37.1975258Z ld.shared.b8 %rs228, [%r492+768]; 2026-02-21T08:29:37.1975320Z ld.shared.b8 %rs229, [%r492+832]; 2026-02-21T08:29:37.1975382Z ld.shared.b8 %rs230, [%r492+896]; 2026-02-21T08:29:37.1975449Z ld.shared.b8 %rs231, [%r492+960]; 2026-02-21T08:29:37.1975616Z .loc 1 60 28 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:60:28 2026-02-21T08:29:37.1975677Z shl.b16 %rs232, %rs216, 4; 2026-02-21T08:29:37.1975743Z shl.b16 %rs233, %rs217, 4; 2026-02-21T08:29:37.1975801Z shl.b16 %rs234, %rs218, 4; 2026-02-21T08:29:37.1975859Z shl.b16 %rs235, %rs219, 4; 2026-02-21T08:29:37.1975917Z shl.b16 %rs236, %rs220, 4; 2026-02-21T08:29:37.1975982Z shl.b16 %rs237, %rs221, 4; 2026-02-21T08:29:37.1976038Z shl.b16 %rs238, %rs222, 4; 2026-02-21T08:29:37.1976097Z shl.b16 %rs239, %rs223, 4; 2026-02-21T08:29:37.1976162Z shl.b16 %rs240, %rs224, 4; 2026-02-21T08:29:37.1976219Z shl.b16 %rs241, %rs225, 4; 2026-02-21T08:29:37.1976277Z shl.b16 %rs242, %rs226, 4; 2026-02-21T08:29:37.1976334Z shl.b16 %rs243, %rs227, 4; 2026-02-21T08:29:37.1976402Z shl.b16 %rs244, %rs228, 4; 2026-02-21T08:29:37.1976508Z shl.b16 %rs245, %rs229, 4; 2026-02-21T08:29:37.1976568Z shl.b16 %rs246, %rs230, 4; 2026-02-21T08:29:37.1976632Z shl.b16 %rs247, %rs231, 4; 2026-02-21T08:29:37.1976799Z .loc 1 75 58 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:75:58 2026-02-21T08:29:37.1976868Z selp.b16 %rs248, %rs232, %rs216, %p11; 2026-02-21T08:29:37.1976926Z cvt.s16.s8 %rs249, %rs248; 2026-02-21T08:29:37.1976990Z shr.s16 %rs250, %rs249, 4; 2026-02-21T08:29:37.1977059Z selp.b16 %rs251, %rs233, %rs217, %p11; 2026-02-21T08:29:37.1977116Z cvt.s16.s8 %rs252, %rs251; 2026-02-21T08:29:37.1977179Z shr.s16 %rs253, %rs252, 4; 2026-02-21T08:29:37.1977245Z selp.b16 %rs254, %rs234, %rs218, %p11; 2026-02-21T08:29:37.1977301Z cvt.s16.s8 %rs255, %rs254; 2026-02-21T08:29:37.1977357Z shr.s16 %rs256, %rs255, 4; 2026-02-21T08:29:37.1977428Z selp.b16 %rs257, %rs235, %rs219, %p11; 2026-02-21T08:29:37.1977485Z cvt.s16.s8 %rs258, %rs257; 2026-02-21T08:29:37.1977597Z shr.s16 %rs259, %rs258, 4; 2026-02-21T08:29:37.1977671Z selp.b16 %rs260, %rs236, %rs220, %p11; 2026-02-21T08:29:37.1977730Z cvt.s16.s8 %rs261, %rs260; 2026-02-21T08:29:37.1977787Z shr.s16 %rs262, %rs261, 4; 2026-02-21T08:29:37.1977856Z selp.b16 %rs263, %rs237, %rs221, %p11; 2026-02-21T08:29:37.1977914Z cvt.s16.s8 %rs264, %rs263; 2026-02-21T08:29:37.1977970Z shr.s16 %rs265, %rs264, 4; 2026-02-21T08:29:37.1978033Z selp.b16 %rs266, %rs238, %rs222, %p11; 2026-02-21T08:29:37.1978097Z cvt.s16.s8 %rs267, %rs266; 2026-02-21T08:29:37.1978154Z shr.s16 %rs268, %rs267, 4; 2026-02-21T08:29:37.1978218Z selp.b16 %rs269, %rs239, %rs223, %p11; 2026-02-21T08:29:37.1978280Z cvt.s16.s8 %rs270, %rs269; 2026-02-21T08:29:37.1978336Z shr.s16 %rs271, %rs270, 4; 2026-02-21T08:29:37.1978398Z selp.b16 %rs272, %rs240, %rs224, %p11; 2026-02-21T08:29:37.1978454Z cvt.s16.s8 %rs273, %rs272; 2026-02-21T08:29:37.1978519Z shr.s16 %rs274, %rs273, 4; 2026-02-21T08:29:37.1978585Z selp.b16 %rs275, %rs241, %rs225, %p11; 2026-02-21T08:29:37.1978644Z cvt.s16.s8 %rs276, %rs275; 2026-02-21T08:29:37.1978708Z shr.s16 %rs277, %rs276, 4; 2026-02-21T08:29:37.1978771Z selp.b16 %rs278, %rs242, %rs226, %p11; 2026-02-21T08:29:37.1978829Z cvt.s16.s8 %rs279, %rs278; 2026-02-21T08:29:37.1978886Z shr.s16 %rs280, %rs279, 4; 2026-02-21T08:29:37.1978957Z selp.b16 %rs281, %rs243, %rs227, %p11; 2026-02-21T08:29:37.1979015Z cvt.s16.s8 %rs282, %rs281; 2026-02-21T08:29:37.1979072Z shr.s16 %rs283, %rs282, 4; 2026-02-21T08:29:37.1979144Z selp.b16 %rs284, %rs244, %rs228, %p11; 2026-02-21T08:29:37.1979202Z cvt.s16.s8 %rs285, %rs284; 2026-02-21T08:29:37.1979258Z shr.s16 %rs286, %rs285, 4; 2026-02-21T08:29:37.1979321Z selp.b16 %rs287, %rs245, %rs229, %p11; 2026-02-21T08:29:37.1979384Z cvt.s16.s8 %rs288, %rs287; 2026-02-21T08:29:37.1979440Z shr.s16 %rs289, %rs288, 4; 2026-02-21T08:29:37.1979502Z selp.b16 %rs290, %rs246, %rs230, %p11; 2026-02-21T08:29:37.1979566Z cvt.s16.s8 %rs291, %rs290; 2026-02-21T08:29:37.1979625Z shr.s16 %rs292, %rs291, 4; 2026-02-21T08:29:37.1979691Z selp.b16 %rs293, %rs247, %rs231, %p11; 2026-02-21T08:29:37.1979754Z cvt.s16.s8 %rs294, %rs293; 2026-02-21T08:29:37.1979810Z shr.s16 %rs295, %rs294, 4; 2026-02-21T08:29:37.1979973Z .loc 1 80 32 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:80:32 2026-02-21T08:29:37.1980036Z cvt.rn.f32.s16 %r493, %rs250; 2026-02-21T08:29:37.1980102Z cvt.rn.f32.s16 %r494, %rs253; 2026-02-21T08:29:37.1980161Z cvt.rn.f32.s16 %r495, %rs256; 2026-02-21T08:29:37.1980219Z cvt.rn.f32.s16 %r496, %rs259; 2026-02-21T08:29:37.1980282Z cvt.rn.f32.s16 %r497, %rs262; 2026-02-21T08:29:37.1980339Z cvt.rn.f32.s16 %r498, %rs265; 2026-02-21T08:29:37.1980397Z cvt.rn.f32.s16 %r499, %rs268; 2026-02-21T08:29:37.1980453Z cvt.rn.f32.s16 %r500, %rs271; 2026-02-21T08:29:37.1980518Z cvt.rn.f32.s16 %r501, %rs274; 2026-02-21T08:29:37.1980574Z cvt.rn.f32.s16 %r502, %rs277; 2026-02-21T08:29:37.1980634Z cvt.rn.f32.s16 %r503, %rs280; 2026-02-21T08:29:37.1980783Z cvt.rn.f32.s16 %r504, %rs283; 2026-02-21T08:29:37.1980841Z cvt.rn.f32.s16 %r505, %rs286; 2026-02-21T08:29:37.1980899Z cvt.rn.f32.s16 %r506, %rs289; 2026-02-21T08:29:37.1980965Z cvt.rn.f32.s16 %r507, %rs292; 2026-02-21T08:29:37.1981023Z cvt.rn.f32.s16 %r508, %rs295; 2026-02-21T08:29:37.1981086Z st.shared.b32 [%r41], %r493; 2026-02-21T08:29:37.1981150Z st.shared.b32 [%r41+2048], %r497; 2026-02-21T08:29:37.1981220Z st.shared.b32 [%r41+4096], %r501; 2026-02-21T08:29:37.1981280Z st.shared.b32 [%r41+6144], %r505; 2026-02-21T08:29:37.1981342Z st.shared.b32 [%r42], %r494; 2026-02-21T08:29:37.1981412Z st.shared.b32 [%r42+2048], %r498; 2026-02-21T08:29:37.1981473Z st.shared.b32 [%r42+4096], %r502; 2026-02-21T08:29:37.1981558Z st.shared.b32 [%r42+6144], %r506; 2026-02-21T08:29:37.1981621Z st.shared.b32 [%r43], %r495; 2026-02-21T08:29:37.1981688Z st.shared.b32 [%r43+2048], %r499; 2026-02-21T08:29:37.1981748Z st.shared.b32 [%r43+4096], %r503; 2026-02-21T08:29:37.1981859Z st.shared.b32 [%r43+6144], %r507; 2026-02-21T08:29:37.1981930Z st.shared.b32 [%r44], %r496; 2026-02-21T08:29:37.1981990Z st.shared.b32 [%r44+2048], %r500; 2026-02-21T08:29:37.1982049Z st.shared.b32 [%r44+4096], %r504; 2026-02-21T08:29:37.1982108Z st.shared.b32 [%r44+6144], %r508; 2026-02-21T08:29:37.1982281Z .loc 1 43 70 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:43:70 2026-02-21T08:29:37.1982340Z shl.b32 %r509, %r618, 3; 2026-02-21T08:29:37.1982399Z add.s32 %r510, %r72, %r509; 2026-02-21T08:29:37.1982467Z add.s32 %r616, %r510, 43008; 2026-02-21T08:29:37.1982520Z $L__tmp20: 2026-02-21T08:29:37.1982739Z .loc 2 291 36 // standard.py:291:36 @[ cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:87:40 ] 2026-02-21T08:29:37.1982804Z // begin inline asm 2026-02-21T08:29:37.1982878Z fence.proxy.async.shared::cta; 2026-02-21T08:29:37.1982934Z // end inline asm 2026-02-21T08:29:37.1982990Z bar.sync 0; 2026-02-21T08:29:37.1983064Z @%p12 bra $L__BB0_7; 2026-02-21T08:29:37.1983165Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T08:29:37.1983231Z elect.sync %r559|%p54, -1; 2026-02-21T08:29:37.1983304Z mov.b32 %r513, 67373328; 2026-02-21T08:29:37.1983361Z // begin inline asm 2026-02-21T08:29:37.1983515Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 0 ], %rd80, %r513, %p53; 2026-02-21T08:29:37.1983579Z // end inline asm 2026-02-21T08:29:37.1983635Z // begin inline asm 2026-02-21T08:29:37.1983780Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 8 ], %rd81, %r513, %p53; 2026-02-21T08:29:37.1983835Z // end inline asm 2026-02-21T08:29:37.1983898Z // begin inline asm 2026-02-21T08:29:37.1984043Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 16 ], %rd82, %r513, %p53; 2026-02-21T08:29:37.1984096Z // end inline asm 2026-02-21T08:29:37.1984159Z // begin inline asm 2026-02-21T08:29:37.1984305Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 24 ], %rd83, %r513, %p53; 2026-02-21T08:29:37.1984361Z // end inline asm 2026-02-21T08:29:37.1984424Z // begin inline asm 2026-02-21T08:29:37.1984563Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 32 ], %rd84, %r513, %p53; 2026-02-21T08:29:37.1984617Z // end inline asm 2026-02-21T08:29:37.1984673Z // begin inline asm 2026-02-21T08:29:37.1984821Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 40 ], %rd85, %r513, %p53; 2026-02-21T08:29:37.1984875Z // end inline asm 2026-02-21T08:29:37.1984931Z // begin inline asm 2026-02-21T08:29:37.1985079Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 48 ], %rd86, %r513, %p53; 2026-02-21T08:29:37.1985133Z // end inline asm 2026-02-21T08:29:37.1985189Z // begin inline asm 2026-02-21T08:29:37.1985333Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 56 ], %rd87, %r513, %p53; 2026-02-21T08:29:37.1985389Z // end inline asm 2026-02-21T08:29:37.1985448Z // begin inline asm 2026-02-21T08:29:37.1985639Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 64 ], %rd88, %r513, %p53; 2026-02-21T08:29:37.1985694Z // end inline asm 2026-02-21T08:29:37.1985750Z // begin inline asm 2026-02-21T08:29:37.1985890Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 72 ], %rd89, %r513, %p53; 2026-02-21T08:29:37.1985952Z // end inline asm 2026-02-21T08:29:37.1986008Z // begin inline asm 2026-02-21T08:29:37.1986147Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 80 ], %rd90, %r513, %p53; 2026-02-21T08:29:37.1986210Z // end inline asm 2026-02-21T08:29:37.1986266Z // begin inline asm 2026-02-21T08:29:37.1986407Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 88 ], %rd91, %r513, %p53; 2026-02-21T08:29:37.1986472Z // end inline asm 2026-02-21T08:29:37.1986530Z // begin inline asm 2026-02-21T08:29:37.1986721Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 96 ], %rd92, %r513, %p53; 2026-02-21T08:29:37.1986790Z // end inline asm 2026-02-21T08:29:37.1986847Z // begin inline asm 2026-02-21T08:29:37.1986998Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 104 ], %rd93, %r513, %p53; 2026-02-21T08:29:37.1987054Z // end inline asm 2026-02-21T08:29:37.1987120Z // begin inline asm 2026-02-21T08:29:37.1987267Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 112 ], %rd94, %r513, %p53; 2026-02-21T08:29:37.1987324Z // end inline asm 2026-02-21T08:29:37.1987388Z // begin inline asm 2026-02-21T08:29:37.1987534Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r612 + 0 ], [ %r308 + 120 ], %rd95, %r513, %p53; 2026-02-21T08:29:37.1987589Z // end inline asm 2026-02-21T08:29:37.1987658Z cvt.u64.u32 %rd128, %r616; 2026-02-21T08:29:37.1987716Z // begin inline asm 2026-02-21T08:29:37.1987849Z @%p54 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd128]; 2026-02-21T08:29:37.1987906Z // end inline asm 2026-02-21T08:29:37.1987972Z bra.uni $L__BB0_7; 2026-02-21T08:29:37.1988028Z $L__tmp21: 2026-02-21T08:29:37.1988113Z $L__BB0_9: // %._crit_edge 2026-02-21T08:29:37.1988294Z .loc 1 22 4 // cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py:22:4 2026-02-21T08:29:37.1988352Z bar.sync 0; 2026-02-21T08:29:37.1988409Z // begin inline asm 2026-02-21T08:29:37.1988536Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r612, 256; 2026-02-21T08:29:37.1988592Z // end inline asm 2026-02-21T08:29:37.1988646Z ret; 2026-02-21T08:29:37.1988700Z $L__tmp22: 2026-02-21T08:29:37.1988764Z $L__func_end0: 2026-02-21T08:29:37.1988851Z // -- End function 2026-02-21T08:29:37.1988904Z } 2026-02-21T08:29:37.1989117Z .file 1 "/tmp/torchinductor_root/oo/cooxcy47h2qmkggrzkv7bsoq7rv6rskhbaqirvgafyqvsb3no43u.py" 2026-02-21T08:29:37.1989295Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:29:37.1989361Z .section .debug_abbrev 2026-02-21T08:29:37.1989415Z { 2026-02-21T08:29:37.1989513Z .b8 1 // Abbreviation Code 2026-02-21T08:29:37.1989603Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:29:37.1989685Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:29:37.1989776Z .b8 37 // DW_AT_producer 2026-02-21T08:29:37.1989855Z .b8 8 // DW_FORM_string 2026-02-21T08:29:37.1989930Z .b8 19 // DW_AT_language 2026-02-21T08:29:37.1990017Z .b8 5 // DW_FORM_data2 2026-02-21T08:29:37.1990096Z .b8 3 // DW_AT_name 2026-02-21T08:29:37.1990172Z .b8 8 // DW_FORM_string 2026-02-21T08:29:37.1990257Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:29:37.1990336Z .b8 6 // DW_FORM_data4 2026-02-21T08:29:37.1990415Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:29:37.1990543Z .b8 8 // DW_FORM_string 2026-02-21T08:29:37.1990623Z .b8 0 // EOM(1) 2026-02-21T08:29:37.1990694Z .b8 0 // EOM(2) 2026-02-21T08:29:37.1990780Z .b8 2 // Abbreviation Code 2026-02-21T08:29:37.1990873Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:29:37.1990950Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:29:37.1991026Z .b8 3 // DW_AT_name 2026-02-21T08:29:37.1991110Z .b8 8 // DW_FORM_string 2026-02-21T08:29:37.1991189Z .b8 32 // DW_AT_inline 2026-02-21T08:29:37.1991267Z .b8 11 // DW_FORM_data1 2026-02-21T08:29:37.1991339Z .b8 0 // EOM(1) 2026-02-21T08:29:37.1991461Z .b8 0 // EOM(2) 2026-02-21T08:29:37.1991591Z .b8 3 // Abbreviation Code 2026-02-21T08:29:37.1991680Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:29:37.1991772Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:29:37.1991852Z .b8 17 // DW_AT_low_pc 2026-02-21T08:29:37.1991936Z .b8 1 // DW_FORM_addr 2026-02-21T08:29:37.1992024Z .b8 18 // DW_AT_high_pc 2026-02-21T08:29:37.1992099Z .b8 1 // DW_FORM_addr 2026-02-21T08:29:37.1992188Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:29:37.1992265Z .b8 19 // DW_FORM_ref4 2026-02-21T08:29:37.1992342Z .b8 0 // EOM(1) 2026-02-21T08:29:37.1992412Z .b8 0 // EOM(2) 2026-02-21T08:29:37.1992499Z .b8 4 // Abbreviation Code 2026-02-21T08:29:37.1992603Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:29:37.1992680Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:29:37.1992768Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:29:37.1992852Z .b8 19 // DW_FORM_ref4 2026-02-21T08:29:37.1992931Z .b8 17 // DW_AT_low_pc 2026-02-21T08:29:37.1993006Z .b8 1 // DW_FORM_addr 2026-02-21T08:29:37.1993096Z .b8 18 // DW_AT_high_pc 2026-02-21T08:29:37.1993169Z .b8 1 // DW_FORM_addr 2026-02-21T08:29:37.1993252Z .b8 88 // DW_AT_call_file 2026-02-21T08:29:37.1993329Z .b8 11 // DW_FORM_data1 2026-02-21T08:29:37.1993418Z .b8 89 // DW_AT_call_line 2026-02-21T08:29:37.1993496Z .b8 11 // DW_FORM_data1 2026-02-21T08:29:37.1993578Z .b8 87 // DW_AT_call_column 2026-02-21T08:29:37.1993661Z .b8 11 // DW_FORM_data1 2026-02-21T08:29:37.1993733Z .b8 0 // EOM(1) 2026-02-21T08:29:37.1993802Z .b8 0 // EOM(2) 2026-02-21T08:29:37.1993879Z .b8 0 // EOM(3) 2026-02-21T08:29:37.1993932Z } 2026-02-21T08:29:37.1993996Z .section .debug_info 2026-02-21T08:29:37.1994048Z { 2026-02-21T08:29:37.1994142Z .b32 178 // Length of Unit 2026-02-21T08:29:37.1994230Z .b8 2 // DWARF version number 2026-02-21T08:29:37.1994284Z .b8 0 2026-02-21T08:29:37.1994413Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:29:37.1994507Z .b8 8 // Address Size (in bytes) 2026-02-21T08:29:37.1994677Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:29:37.1994765Z .b8 116 // DW_AT_producer 2026-02-21T08:29:37.1994820Z .b8 114 2026-02-21T08:29:37.1994874Z .b8 105 2026-02-21T08:29:37.1994926Z .b8 116 2026-02-21T08:29:37.1994984Z .b8 111 2026-02-21T08:29:37.1995035Z .b8 110 2026-02-21T08:29:37.1995085Z .b8 0 2026-02-21T08:29:37.1995159Z .b8 2 // DW_AT_language 2026-02-21T08:29:37.1995218Z .b8 0 2026-02-21T08:29:37.1995291Z .b8 99 // DW_AT_name 2026-02-21T08:29:37.1995343Z .b8 111 2026-02-21T08:29:37.1995402Z .b8 111 2026-02-21T08:29:37.1995453Z .b8 120 2026-02-21T08:29:37.1995504Z .b8 99 2026-02-21T08:29:37.1995554Z .b8 121 2026-02-21T08:29:37.1995610Z .b8 52 2026-02-21T08:29:37.1995661Z .b8 55 2026-02-21T08:29:37.1995710Z .b8 104 2026-02-21T08:29:37.1995766Z .b8 50 2026-02-21T08:29:37.1995817Z .b8 113 2026-02-21T08:29:37.1995931Z .b8 109 2026-02-21T08:29:37.1995984Z .b8 107 2026-02-21T08:29:37.1996042Z .b8 103 2026-02-21T08:29:37.1996092Z .b8 103 2026-02-21T08:29:37.1996143Z .b8 114 2026-02-21T08:29:37.1996199Z .b8 122 2026-02-21T08:29:37.1996250Z .b8 107 2026-02-21T08:29:37.1996301Z .b8 118 2026-02-21T08:29:37.1996350Z .b8 55 2026-02-21T08:29:37.1996407Z .b8 98 2026-02-21T08:29:37.1996457Z .b8 115 2026-02-21T08:29:37.1996507Z .b8 111 2026-02-21T08:29:37.1996564Z .b8 113 2026-02-21T08:29:37.1996614Z .b8 55 2026-02-21T08:29:37.1996664Z .b8 114 2026-02-21T08:29:37.1996713Z .b8 118 2026-02-21T08:29:37.1996770Z .b8 54 2026-02-21T08:29:37.1996820Z .b8 114 2026-02-21T08:29:37.1996869Z .b8 115 2026-02-21T08:29:37.1996919Z .b8 107 2026-02-21T08:29:37.1996975Z .b8 104 2026-02-21T08:29:37.1997024Z .b8 98 2026-02-21T08:29:37.1997076Z .b8 97 2026-02-21T08:29:37.1997133Z .b8 113 2026-02-21T08:29:37.1997183Z .b8 105 2026-02-21T08:29:37.1997233Z .b8 114 2026-02-21T08:29:37.1997283Z .b8 118 2026-02-21T08:29:37.1997342Z .b8 103 2026-02-21T08:29:37.1997393Z .b8 97 2026-02-21T08:29:37.1997445Z .b8 102 2026-02-21T08:29:37.1997504Z .b8 121 2026-02-21T08:29:37.1997554Z .b8 113 2026-02-21T08:29:37.1997604Z .b8 118 2026-02-21T08:29:37.1997654Z .b8 115 2026-02-21T08:29:37.1997711Z .b8 98 2026-02-21T08:29:37.1997761Z .b8 51 2026-02-21T08:29:37.1997812Z .b8 110 2026-02-21T08:29:37.1997864Z .b8 111 2026-02-21T08:29:37.1997922Z .b8 52 2026-02-21T08:29:37.1997973Z .b8 51 2026-02-21T08:29:37.1998024Z .b8 117 2026-02-21T08:29:37.1998083Z .b8 46 2026-02-21T08:29:37.1998134Z .b8 112 2026-02-21T08:29:37.1998186Z .b8 121 2026-02-21T08:29:37.1998237Z .b8 0 2026-02-21T08:29:37.1998340Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:29:37.1998414Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:29:37.1998465Z .b8 116 2026-02-21T08:29:37.1998522Z .b8 109 2026-02-21T08:29:37.1998572Z .b8 112 2026-02-21T08:29:37.1998620Z .b8 47 2026-02-21T08:29:37.1998670Z .b8 116 2026-02-21T08:29:37.1998729Z .b8 111 2026-02-21T08:29:37.1998780Z .b8 114 2026-02-21T08:29:37.1998833Z .b8 99 2026-02-21T08:29:37.1998890Z .b8 104 2026-02-21T08:29:37.1998939Z .b8 105 2026-02-21T08:29:37.1998990Z .b8 110 2026-02-21T08:29:37.1999040Z .b8 100 2026-02-21T08:29:37.1999099Z .b8 117 2026-02-21T08:29:37.1999149Z .b8 99 2026-02-21T08:29:37.1999199Z .b8 116 2026-02-21T08:29:37.1999255Z .b8 111 2026-02-21T08:29:37.1999307Z .b8 114 2026-02-21T08:29:37.1999357Z .b8 95 2026-02-21T08:29:37.1999407Z .b8 114 2026-02-21T08:29:37.1999465Z .b8 111 2026-02-21T08:29:37.1999516Z .b8 111 2026-02-21T08:29:37.1999567Z .b8 116 2026-02-21T08:29:37.1999616Z .b8 47 2026-02-21T08:29:37.1999675Z .b8 111 2026-02-21T08:29:37.1999724Z .b8 111 2026-02-21T08:29:37.1999775Z .b8 0 2026-02-21T08:29:37.1999879Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:29:37.1999951Z .b8 95 // DW_AT_name 2026-02-21T08:29:37.2000002Z .b8 104 2026-02-21T08:29:37.2000052Z .b8 101 2026-02-21T08:29:37.2000111Z .b8 108 2026-02-21T08:29:37.2000164Z .b8 105 2026-02-21T08:29:37.2000259Z .b8 111 2026-02-21T08:29:37.2000316Z .b8 110 2026-02-21T08:29:37.2000367Z .b8 95 2026-02-21T08:29:37.2000418Z .b8 109 2026-02-21T08:29:37.2000468Z .b8 97 2026-02-21T08:29:37.2000526Z .b8 116 2026-02-21T08:29:37.2000575Z .b8 109 2026-02-21T08:29:37.2000625Z .b8 117 2026-02-21T08:29:37.2000682Z .b8 108 2026-02-21T08:29:37.2000732Z .b8 95 2026-02-21T08:29:37.2000781Z .b8 98 2026-02-21T08:29:37.2000831Z .b8 102 2026-02-21T08:29:37.2000888Z .b8 49 2026-02-21T08:29:37.2000938Z .b8 54 2026-02-21T08:29:37.2000988Z .b8 95 2026-02-21T08:29:37.2001038Z .b8 105 2026-02-21T08:29:37.2001094Z .b8 110 2026-02-21T08:29:37.2001144Z .b8 116 2026-02-21T08:29:37.2001193Z .b8 52 2026-02-21T08:29:37.2001250Z .b8 0 2026-02-21T08:29:37.2001323Z .b8 1 // DW_AT_inline 2026-02-21T08:29:37.2001418Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:29:37.2001591Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:29:37.2001686Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:29:37.2001775Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:29:37.2001886Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:29:37.2001980Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:29:37.2002059Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T08:29:37.2002140Z .b64 $L__tmp21 // DW_AT_high_pc 2026-02-21T08:29:37.2002222Z .b8 1 // DW_AT_call_file 2026-02-21T08:29:37.2002297Z .b8 87 // DW_AT_call_line 2026-02-21T08:29:37.2002375Z .b8 40 // DW_AT_call_column 2026-02-21T08:29:37.2002462Z .b8 0 // End Of Children Mark 2026-02-21T08:29:37.2002545Z .b8 0 // End Of Children Mark 2026-02-21T08:29:37.2002597Z } 2026-02-21T08:29:37.2002661Z .section .debug_macinfo { } 2026-02-21T08:29:37.2002672Z 2026-02-21T08:29:37.2002746Z ================================================================ 2026-02-21T08:29:37.2002847Z please share the reproducer above with Triton project. 2026-02-21T08:29:38.6835058Z [65s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=1, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, None]) 2026-02-21T08:29:38.6836312Z Tensor-likes are not close! 2026-02-21T08:29:38.6839340Z 2026-02-21T08:29:38.6839564Z Mismatched elements: 33450159 / 33554432 (99.7%) 2026-02-21T08:29:38.6839943Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T08:29:38.6844361Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:38.6848711Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:38.6850171Z 2026-02-21T08:29:38.7283838Z [65s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T08:29:38.7284920Z Tensor-likes are not close! 2026-02-21T08:29:38.7285039Z 2026-02-21T08:29:38.7285136Z Mismatched elements: 33504323 / 33554432 (99.9%) 2026-02-21T08:29:38.7285414Z Greatest absolute difference: 3072.0 at index (311, 1307) (up to 0.01 allowed) 2026-02-21T08:29:38.7285770Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:38.7286290Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:38.7286463Z 2026-02-21T08:29:39.0812361Z [65s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=1, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, False]) 2026-02-21T08:29:39.0813551Z Tensor-likes are not close! 2026-02-21T08:29:39.0818731Z 2026-02-21T08:29:39.0823321Z Mismatched elements: 33528021 / 33554432 (99.9%) 2026-02-21T08:29:39.0827792Z Greatest absolute difference: 3040.0 at index (2844, 4360) (up to 0.01 allowed) 2026-02-21T08:29:39.0829507Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:39.0829952Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:39.0833713Z 2026-02-21T08:29:42.7935746Z [69s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 128, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=1, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[4, 1], range_unroll_factors=[3, 0], range_warp_specializes=[None, None]) 2026-02-21T08:29:42.7937703Z Tensor-likes are not close! 2026-02-21T08:29:42.7937870Z 2026-02-21T08:29:42.7942721Z Mismatched elements: 33553778 / 33554432 (100.0%) 2026-02-21T08:29:42.7944401Z Greatest absolute difference: 2560.0 at index (190, 5930) (up to 0.01 allowed) 2026-02-21T08:29:42.7944833Z Greatest relative difference: 3.03125 at index (44, 6176) (up to 0.01 allowed) 2026-02-21T08:29:42.7949985Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:42.7953868Z 2026-02-21T08:29:43.5882833Z [70s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T08:29:43.5884052Z Tensor-likes are not close! 2026-02-21T08:29:43.5884182Z 2026-02-21T08:29:43.5884281Z Mismatched elements: 33504803 / 33554432 (99.9%) 2026-02-21T08:29:43.5884567Z Greatest absolute difference: 3120.0 at index (3156, 2791) (up to 0.01 allowed) 2026-02-21T08:29:43.5884962Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:43.5885295Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:43.5885483Z 2026-02-21T08:29:45.5285977Z [72s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 1024], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:29:45.5287140Z Tensor-likes are not close! 2026-02-21T08:29:45.5287296Z 2026-02-21T08:29:45.5292945Z Mismatched elements: 33508818 / 33554432 (99.9%) 2026-02-21T08:29:45.5295046Z Greatest absolute difference: 3088.0 at index (1680, 5688) (up to 0.01 allowed) 2026-02-21T08:29:45.5295432Z Greatest relative difference: inf at index (3980, 6366) (up to 0.01 allowed) 2026-02-21T08:29:45.5295766Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:45.5296239Z 2026-02-21T08:29:45.5876533Z [72s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:29:45.5877662Z Tensor-likes are not close! 2026-02-21T08:29:45.5877807Z 2026-02-21T08:29:45.5877916Z Mismatched elements: 33553778 / 33554432 (100.0%) 2026-02-21T08:29:45.5878221Z Greatest absolute difference: 2560.0 at index (190, 5930) (up to 0.01 allowed) 2026-02-21T08:29:45.5878581Z Greatest relative difference: 3.03125 at index (44, 6176) (up to 0.01 allowed) 2026-02-21T08:29:45.5878905Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:45.5879321Z 2026-02-21T08:29:45.7186935Z [72s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], num_sm_multiplier=1, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:29:45.7188015Z Tensor-likes are not close! 2026-02-21T08:29:45.7188165Z 2026-02-21T08:29:45.7188332Z Mismatched elements: 33482531 / 33554432 (99.8%) 2026-02-21T08:29:45.7188641Z Greatest absolute difference: 2480.0 at index (2394, 3946) (up to 0.01 allowed) 2026-02-21T08:29:45.7193615Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:45.7198036Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:45.7202181Z 2026-02-21T08:29:46.4398591Z [73s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=128, num_stages=6, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[2, 2], range_unroll_factors=[0, 1], range_warp_specializes=[False, True]) 2026-02-21T08:29:46.4399693Z Tensor-likes are not close! 2026-02-21T08:29:46.4399822Z 2026-02-21T08:29:46.4399909Z Mismatched elements: 33476523 / 33554432 (99.8%) 2026-02-21T08:29:46.4400200Z Greatest absolute difference: 2448.0 at index (2394, 3930) (up to 0.01 allowed) 2026-02-21T08:29:46.4400545Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:46.4400862Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:46.4401031Z 2026-02-21T08:29:57.8391605Z [84s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=8, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T08:29:57.8392708Z Tensor-likes are not close! 2026-02-21T08:29:57.8392838Z 2026-02-21T08:29:57.8392924Z Mismatched elements: 33481897 / 33554432 (99.8%) 2026-02-21T08:29:57.8393202Z Greatest absolute difference: 2432.0 at index (1013, 348) (up to 0.01 allowed) 2026-02-21T08:29:57.8393544Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:57.8393861Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:57.8394025Z 2026-02-21T08:29:58.6638980Z [85s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 256, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]) 2026-02-21T08:29:58.6640411Z Tensor-likes are not close! 2026-02-21T08:29:58.6640543Z 2026-02-21T08:29:58.6640643Z Mismatched elements: 33504064 / 33554432 (99.8%) 2026-02-21T08:29:58.6640951Z Greatest absolute difference: 3168.0 at index (1866, 4858) (up to 0.01 allowed) 2026-02-21T08:29:58.6641337Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:29:58.6641900Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:29:58.6642112Z 2026-02-21T08:30:02.4736795Z [89s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[4, 0], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:30:02.4738128Z Tensor-likes are not close! 2026-02-21T08:30:02.4739913Z 2026-02-21T08:30:02.4740155Z Mismatched elements: 33553778 / 33554432 (100.0%) 2026-02-21T08:30:02.4740526Z Greatest absolute difference: 2560.0 at index (190, 5930) (up to 0.01 allowed) 2026-02-21T08:30:02.4740874Z Greatest relative difference: 3.03125 at index (44, 6176) (up to 0.01 allowed) 2026-02-21T08:30:02.4741214Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:02.4741381Z 2026-02-21T08:30:02.7473380Z [89s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=128, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 0], range_warp_specializes=[None, True]) 2026-02-21T08:30:02.7474572Z Tensor-likes are not close! 2026-02-21T08:30:02.7480047Z 2026-02-21T08:30:02.7481971Z Mismatched elements: 33528093 / 33554432 (99.9%) 2026-02-21T08:30:02.7486303Z Greatest absolute difference: 3456.0 at index (190, 5946) (up to 0.01 allowed) 2026-02-21T08:30:02.7491268Z Greatest relative difference: inf at index (3980, 6366) (up to 0.01 allowed) 2026-02-21T08:30:02.7492662Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:02.7492910Z 2026-02-21T08:30:03.3963334Z [90s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 3], range_warp_specializes=[None, None]) 2026-02-21T08:30:03.3964411Z Tensor-likes are not close! 2026-02-21T08:30:03.3964532Z 2026-02-21T08:30:03.3964618Z Mismatched elements: 33527996 / 33554432 (99.9%) 2026-02-21T08:30:03.3964904Z Greatest absolute difference: 3280.0 at index (2110, 1608) (up to 0.01 allowed) 2026-02-21T08:30:03.3965239Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:30:03.3965545Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:03.3965709Z 2026-02-21T08:30:04.8895115Z [91s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:30:04.8896570Z Tensor-likes are not close! 2026-02-21T08:30:04.8896702Z 2026-02-21T08:30:04.8896809Z Mismatched elements: 33528286 / 33554432 (99.9%) 2026-02-21T08:30:04.8897113Z Greatest absolute difference: 3168.0 at index (2110, 1576) (up to 0.01 allowed) 2026-02-21T08:30:04.8897512Z Greatest relative difference: 90177536.0 at index (1456, 6433) (up to 0.01 allowed) 2026-02-21T08:30:04.8897858Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:04.8898069Z 2026-02-21T08:30:14.7667857Z [101s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:30:14.7669015Z Tensor-likes are not close! 2026-02-21T08:30:14.7669165Z 2026-02-21T08:30:14.7669314Z Mismatched elements: 33444620 / 33554432 (99.7%) 2026-02-21T08:30:14.7669614Z Greatest absolute difference: 1312.0 at index (1150, 3790) (up to 0.01 allowed) 2026-02-21T08:30:14.7669980Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:30:14.7671648Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:14.7671819Z 2026-02-21T08:30:18.2320279Z 2026-02-21T08:30:18.2325203Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.2 configs/s 2026-02-21T08:30:18.2330947Z [104s] Adaptive compile timeout: 30s (90% percentile=4.5s, bounds=[30.0s, 30s]) 2026-02-21T08:30:18.5088808Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 111/111 168.1 configs/s 2026-02-21T08:30:18.9279541Z [105s] Initial random population of 100, 5 starting points: 2026-02-21T08:30:18.9281371Z error=28 2026-02-21T08:30:18.9281702Z ok=72 2026-02-21T08:30:18.9281843Z min=1.7869 2026-02-21T08:30:18.9281981Z mid=39.8162 2026-02-21T08:30:18.9282118Z max=2271.6775 2026-02-21T08:30:18.9282272Z best={'block_sizes': [8, 16, 256], 2026-02-21T08:30:18.9282520Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:30:18.9282739Z 'l2_groupings': [8], 2026-02-21T08:30:18.9282919Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:30:18.9283114Z 'loop_orders': [[1, 0]], 2026-02-21T08:30:18.9283274Z 'maxnreg': 256, 2026-02-21T08:30:18.9283418Z 'num_sm_multiplier': 32, 2026-02-21T08:30:18.9283581Z 'num_stages': 6, 2026-02-21T08:30:18.9283719Z 'num_warps': 8, 2026-02-21T08:30:18.9283879Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:30:18.9284075Z 'range_flattens': [False, None], 2026-02-21T08:30:18.9284281Z 'range_multi_buffers': [True, None], 2026-02-21T08:30:18.9284469Z 'range_num_stages': [3, 0], 2026-02-21T08:30:18.9284635Z 'range_unroll_factors': [0, 0], 2026-02-21T08:30:18.9284824Z 'range_warp_specializes': [True, None]} 2026-02-21T08:30:18.9297661Z [105s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:30:20.2796480Z [107s] Generation 1 starting: 99 neighbors, 5 active search path(s) 2026-02-21T08:30:34.3501386Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 1.3 configs/s 2026-02-21T08:30:36.6295638Z [123s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=32, num_stages=6, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[3, 0], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:30:36.6297325Z Tensor-likes are not close! 2026-02-21T08:30:36.6301915Z 2026-02-21T08:30:36.6306201Z Mismatched elements: 33338520 / 33554432 (99.4%) 2026-02-21T08:30:36.6311172Z Greatest absolute difference: 776.0 at index (1687, 6957) (up to 0.01 allowed) 2026-02-21T08:30:36.6315459Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:30:36.6319746Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:36.6323916Z 2026-02-21T08:30:36.7758232Z [123s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[False, False]) 2026-02-21T08:30:36.7759433Z Tensor-likes are not close! 2026-02-21T08:30:36.7764203Z 2026-02-21T08:30:36.7764419Z Mismatched elements: 33448405 / 33554432 (99.7%) 2026-02-21T08:30:36.7764808Z Greatest absolute difference: 1560.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T08:30:36.7765191Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:30:36.7769069Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:36.7772664Z 2026-02-21T08:30:37.1436746Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T08:30:37.1437732Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:30:37.1437993Z ^ 2026-02-21T08:30:37.1438396Z /tmp/torchinductor_root/zr/czrpdx5n2k6c6amnmsaq56vcdglskc252my5k5lpksx2yxyy3r5m.py:86:36: note: called from 2026-02-21T08:30:37.1438863Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:30:37.1439071Z ^ 2026-02-21T08:30:37.1439479Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T08:30:37.1439985Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:30:37.1440247Z ^ 2026-02-21T08:30:37.1440416Z module { 2026-02-21T08:30:37.1440947Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:30:37.1441642Z %cst = arith.constant dense<0> : tensor<8x2x16xi8> 2026-02-21T08:30:37.1441863Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:30:37.1442064Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:30:37.1442246Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:37.1442454Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:30:37.1442696Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:30:37.1442919Z %cst_2 = arith.constant dense<4> : tensor<8x16xi8> 2026-02-21T08:30:37.1443155Z %cst_3 = arith.constant dense<8192> : tensor<8x1xi32> 2026-02-21T08:30:37.1443392Z %cst_4 = arith.constant dense<1024> : tensor<512x1xi32> 2026-02-21T08:30:37.1443611Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:37.1443834Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<512x16xf32> 2026-02-21T08:30:37.1444070Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:30:37.1444255Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:30:37.1444434Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:30:37.1444615Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:37.1444797Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:30:37.1444983Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:30:37.1445162Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:37.1445725Z %0 = tt.make_tensor_descriptor %arg2, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:30:37.1446063Z %1 = tt.get_program_id x : i32 2026-02-21T08:30:37.1446252Z %2 = arith.divsi %1, %c2048_i32 : i32 2026-02-21T08:30:37.1446453Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:30:37.1446635Z %4 = arith.subi %c8_i32, %3 : i32 2026-02-21T08:30:37.1446822Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T08:30:37.1447002Z %6 = arith.remsi %1, %c2048_i32 : i32 2026-02-21T08:30:37.1447195Z %7 = arith.remsi %6, %5 : i32 2026-02-21T08:30:37.1447375Z %8 = arith.addi %3, %7 : i32 2026-02-21T08:30:37.1447562Z %9 = arith.divsi %6, %5 : i32 2026-02-21T08:30:37.1447750Z %10 = arith.muli %8, %c512_i32 : i32 2026-02-21T08:30:37.1447989Z %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:30:37.1448324Z %12 = tt.splat %10 : i32 -> tensor<512xi32> 2026-02-21T08:30:37.1448534Z %13 = arith.addi %12, %11 : tensor<512xi32> 2026-02-21T08:30:37.1448741Z %14 = arith.muli %9, %c16_i32 : i32 2026-02-21T08:30:37.1448971Z %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:30:37.1449234Z %16 = tt.splat %14 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.1449434Z %17 = arith.addi %16, %15 : tensor<16xi32> 2026-02-21T08:30:37.1449631Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:30:37.1449950Z %18 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c32_i32 iter_args(%arg4 = %cst_5) -> (tensor<512x16xf32>) : i32 { 2026-02-21T08:30:37.1450318Z %20 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.1450570Z %21 = tt.splat %arg3 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.1450773Z %22 = arith.addi %21, %20 : tensor<8xi32> 2026-02-21T08:30:37.1450975Z %23 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:30:37.1451169Z %24 = tt.splat %23 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.1451375Z %25 = arith.addi %24, %15 : tensor<16xi32> 2026-02-21T08:30:37.1451681Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:30:37.1451959Z %27 = arith.muli %26, %cst_4 : tensor<512x1xi32> 2026-02-21T08:30:37.1452222Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1452515Z %29 = tt.broadcast %27 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1452789Z %30 = tt.broadcast %28 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1453035Z %31 = arith.addi %29, %30 : tensor<512x16xi32> 2026-02-21T08:30:37.1453285Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1453582Z %33 = tt.addptr %32, %31 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:30:37.1453895Z %34 = tt.load %33 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1454205Z %35 = arith.extf %34 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:30:37.1454489Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.1454757Z %37 = arith.muli %36, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.1455017Z %38 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1455300Z %39 = tt.broadcast %37 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1455560Z %40 = tt.broadcast %38 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1455788Z %41 = arith.addi %39, %40 : tensor<8x16xi32> 2026-02-21T08:30:37.1456026Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1456291Z %43 = tt.addptr %42, %41 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.1456585Z %44 = tt.load %43 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1456854Z %45 = arith.shli %44, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1457124Z %46 = arith.shrsi %45, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1457341Z %47 = arith.shrsi %44, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1457577Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.1457905Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.1458225Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.1458545Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1458872Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1459161Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1459414Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1459773Z %55 = tt.broadcast %51 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1460066Z %56 = arith.select %54, %55, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1460350Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1460605Z %58 = tt.broadcast %52 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1460892Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1461184Z %60 = arith.select %59, %58, %56 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1461469Z %61 = tt.reshape %60 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.1461790Z %62 = arith.sitofp %61 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.1462169Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:30:37.1462522Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:37.1462722Z %64 = arith.muli %c8_i32, %c1_i32 : i32 2026-02-21T08:30:37.1462930Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T08:30:37.1463169Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.1463417Z %67 = tt.splat %65 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.1463649Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T08:30:37.1463851Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T08:30:37.1464054Z %70 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.1464261Z %71 = arith.addi %70, %15 : tensor<16xi32> 2026-02-21T08:30:37.1464529Z %72 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:30:37.1464812Z %73 = arith.muli %72, %cst_4 : tensor<512x1xi32> 2026-02-21T08:30:37.1465071Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1465374Z %75 = tt.broadcast %73 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1465657Z %76 = tt.broadcast %74 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1465895Z %77 = arith.addi %75, %76 : tensor<512x16xi32> 2026-02-21T08:30:37.1466168Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1466450Z %79 = tt.addptr %78, %77 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:30:37.1466757Z %80 = tt.load %79 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1467044Z %81 = arith.extf %80 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:30:37.1467327Z %82 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.1467586Z %83 = arith.muli %82, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.1467832Z %84 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1468118Z %85 = tt.broadcast %83 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1468371Z %86 = tt.broadcast %84 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1468664Z %87 = arith.addi %85, %86 : tensor<8x16xi32> 2026-02-21T08:30:37.1468893Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1469164Z %89 = tt.addptr %88, %87 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.1469455Z %90 = tt.load %89 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1469713Z %91 = arith.shli %90, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1469929Z %92 = arith.shrsi %91, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1470135Z %93 = arith.shrsi %90, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1470377Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.1470659Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.1470962Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.1471323Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1471657Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1471938Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1472183Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1472453Z %101 = tt.broadcast %97 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1472737Z %102 = arith.select %100, %101, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1473007Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1473258Z %104 = tt.broadcast %98 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1473516Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1473795Z %106 = arith.select %105, %104, %102 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1474072Z %107 = tt.reshape %106 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.1474341Z %108 = arith.sitofp %107 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.1474707Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:30:37.1475032Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T08:30:37.1475231Z %110 = arith.muli %c8_i32, %c2_i32_6 : i32 2026-02-21T08:30:37.1475424Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T08:30:37.1475659Z %112 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.1475904Z %113 = tt.splat %111 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.1476113Z %114 = arith.addi %113, %112 : tensor<8xi32> 2026-02-21T08:30:37.1476315Z %115 = arith.muli %111, %c2_i32 : i32 2026-02-21T08:30:37.1476508Z %116 = tt.splat %115 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.1476718Z %117 = arith.addi %116, %15 : tensor<16xi32> 2026-02-21T08:30:37.1476976Z %118 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:30:37.1477258Z %119 = arith.muli %118, %cst_4 : tensor<512x1xi32> 2026-02-21T08:30:37.1477521Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1477827Z %121 = tt.broadcast %119 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1478106Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1478351Z %123 = arith.addi %121, %122 : tensor<512x16xi32> 2026-02-21T08:30:37.1478606Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1478899Z %125 = tt.addptr %124, %123 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:30:37.1479226Z %126 = tt.load %125 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1479540Z %127 = arith.extf %126 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:30:37.1479885Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.1480157Z %129 = arith.muli %128, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.1480412Z %130 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1480706Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1480964Z %132 = tt.broadcast %130 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1481206Z %133 = arith.addi %131, %132 : tensor<8x16xi32> 2026-02-21T08:30:37.1481445Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1481767Z %135 = tt.addptr %134, %133 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.1482069Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1482386Z %137 = arith.shli %136, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1482611Z %138 = arith.shrsi %137, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1482825Z %139 = arith.shrsi %136, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1483074Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.1483368Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.1483679Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.1484003Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1484316Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1484602Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1484858Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1485126Z %147 = tt.broadcast %143 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1485413Z %148 = arith.select %146, %147, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1485681Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1485935Z %150 = tt.broadcast %144 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1486197Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1486474Z %152 = arith.select %151, %150, %148 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1486756Z %153 = tt.reshape %152 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.1487007Z %154 = arith.sitofp %153 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.1487369Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:30:37.1487686Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:30:37.1487879Z %156 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T08:30:37.1488066Z %157 = arith.addi %arg3, %156 : i32 2026-02-21T08:30:37.1488299Z %158 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.1488544Z %159 = tt.splat %157 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.1488760Z %160 = arith.addi %159, %158 : tensor<8xi32> 2026-02-21T08:30:37.1488960Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T08:30:37.1489157Z %162 = tt.splat %161 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.1489354Z %163 = arith.addi %162, %15 : tensor<16xi32> 2026-02-21T08:30:37.1489609Z %164 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:30:37.1489886Z %165 = arith.muli %164, %cst_4 : tensor<512x1xi32> 2026-02-21T08:30:37.1490151Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1490439Z %167 = tt.broadcast %165 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1490717Z %168 = tt.broadcast %166 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:30:37.1491022Z %169 = arith.addi %167, %168 : tensor<512x16xi32> 2026-02-21T08:30:37.1491266Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1491607Z %171 = tt.addptr %170, %169 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:30:37.1491923Z %172 = tt.load %171 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:30:37.1492230Z %173 = arith.extf %172 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:30:37.1492519Z %174 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.1492793Z %175 = arith.muli %174, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.1493057Z %176 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.1493346Z %177 = tt.broadcast %175 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1493663Z %178 = tt.broadcast %176 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.1493902Z %179 = arith.addi %177, %178 : tensor<8x16xi32> 2026-02-21T08:30:37.1494151Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1494436Z %181 = tt.addptr %180, %179 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.1494736Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.1495008Z %183 = arith.shli %182, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1495218Z %184 = arith.shrsi %183, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1495437Z %185 = arith.shrsi %182, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.1495675Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.1495973Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.1496292Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.1496610Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1496927Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.1497203Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1497460Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1497721Z %193 = tt.broadcast %189 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1498009Z %194 = arith.select %192, %193, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1498285Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.1498535Z %196 = tt.broadcast %190 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.1498806Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.1499080Z %198 = arith.select %197, %196, %194 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.1499361Z %199 = tt.reshape %198 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.1499621Z %200 = arith.sitofp %199 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.1500001Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:30:37.1500324Z scf.yield %201 : tensor<512x16xf32> 2026-02-21T08:30:37.1500500Z } 2026-02-21T08:30:37.1500689Z %19 = arith.truncf %18 : tensor<512x16xf32> to tensor<512x16xbf16> 2026-02-21T08:30:37.1501021Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<512x16xbf16> 2026-02-21T08:30:37.1501311Z tt.return 2026-02-21T08:30:37.1501451Z } 2026-02-21T08:30:37.1501616Z } 2026-02-21T08:30:37.1501690Z 2026-02-21T08:30:37.1501751Z {-# 2026-02-21T08:30:37.1501889Z external_resources: { 2026-02-21T08:30:37.1502064Z mlir_reproducer: { 2026-02-21T08:30:37.1506782Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:37.1511341Z disable_threading: false, 2026-02-21T08:30:37.1511520Z verify_each: true 2026-02-21T08:30:37.1511701Z } 2026-02-21T08:30:37.1511828Z } 2026-02-21T08:30:37.1511944Z #-} 2026-02-21T08:30:37.1512399Z /tmp/torchinductor_root/zr/czrpdx5n2k6c6amnmsaq56vcdglskc252my5k5lpksx2yxyy3r5m.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:37.1513478Z /tmp/torchinductor_root/zr/czrpdx5n2k6c6amnmsaq56vcdglskc252my5k5lpksx2yxyy3r5m.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:37.1514327Z [123s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:37.1515461Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:30:37.1516483Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:37.1516742Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:37.6795965Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T08:30:37.6800982Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:30:37.6803075Z ^ 2026-02-21T08:30:37.6803532Z /tmp/torchinductor_root/k4/ck4eb35rz4hbihsnfqq3ns2lrhnu3xbaoiih7ipny25bxucb3n42.py:94:40: note: called from 2026-02-21T08:30:37.6805639Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:30:37.6805932Z ^ 2026-02-21T08:30:37.6811008Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T08:30:37.6813109Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:30:37.6813704Z ^ 2026-02-21T08:30:37.6818567Z module { 2026-02-21T08:30:37.6823561Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:30:37.6824624Z %cst = arith.constant dense<0> : tensor<8x2x16xi8> 2026-02-21T08:30:37.6824896Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:30:37.6825096Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:30:37.6825293Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:30:37.6825475Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:37.6825651Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:37.6825853Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:30:37.6826303Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:30:37.6826542Z %cst_2 = arith.constant dense<4> : tensor<8x16xi8> 2026-02-21T08:30:37.6826782Z %cst_3 = arith.constant dense<8192> : tensor<8x1xi32> 2026-02-21T08:30:37.6827029Z %cst_4 = arith.constant dense<1024> : tensor<2048x1xi32> 2026-02-21T08:30:37.6827241Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:37.6827467Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<2048x16xf32> 2026-02-21T08:30:37.6827701Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:30:37.6827893Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:30:37.6828075Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:30:37.6828257Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:37.6828436Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:30:37.6828620Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:30:37.6828801Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:37.6829124Z %0 = tt.make_tensor_descriptor %arg2, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:30:37.6829461Z %1 = tt.get_program_id x : i32 2026-02-21T08:30:37.6829641Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:30:37.6829826Z %3 = arith.minsi %2, %c1024_i32 : i32 2026-02-21T08:30:37.6830028Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:30:37.6830247Z %4 = arith.divsi %arg3, %c2048_i32 : i32 2026-02-21T08:30:37.6830445Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T08:30:37.6830640Z %6 = arith.subi %c2_i32, %5 : i32 2026-02-21T08:30:37.6830822Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T08:30:37.6831003Z %8 = arith.remsi %arg3, %c2048_i32 : i32 2026-02-21T08:30:37.6831192Z %9 = arith.remsi %8, %7 : i32 2026-02-21T08:30:37.6831361Z %10 = arith.addi %5, %9 : i32 2026-02-21T08:30:37.6831623Z %11 = arith.divsi %8, %7 : i32 2026-02-21T08:30:37.6831811Z %12 = arith.muli %10, %c2048_i32 : i32 2026-02-21T08:30:37.6832066Z %13 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T08:30:37.6832341Z %14 = tt.splat %12 : i32 -> tensor<2048xi32> 2026-02-21T08:30:37.6832547Z %15 = arith.addi %14, %13 : tensor<2048xi32> 2026-02-21T08:30:37.6832747Z %16 = arith.muli %11, %c16_i32 : i32 2026-02-21T08:30:37.6832974Z %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:30:37.6833227Z %18 = tt.splat %16 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.6833424Z %19 = arith.addi %18, %17 : tensor<16xi32> 2026-02-21T08:30:37.6833626Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:30:37.6833952Z %20 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c32_i32 iter_args(%arg5 = %cst_5) -> (tensor<2048x16xf32>) : i32 { 2026-02-21T08:30:37.6834313Z %22 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.6834566Z %23 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.6834772Z %24 = arith.addi %23, %22 : tensor<8xi32> 2026-02-21T08:30:37.6835100Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:30:37.6835295Z %26 = tt.splat %25 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.6835500Z %27 = arith.addi %26, %17 : tensor<16xi32> 2026-02-21T08:30:37.6835770Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:30:37.6836046Z %29 = arith.muli %28, %cst_4 : tensor<2048x1xi32> 2026-02-21T08:30:37.6836317Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6836614Z %31 = tt.broadcast %29 : tensor<2048x1xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6836890Z %32 = tt.broadcast %30 : tensor<1x16xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6837129Z %33 = arith.addi %31, %32 : tensor<2048x16xi32> 2026-02-21T08:30:37.6837384Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6837740Z %35 = tt.addptr %34, %33 : tensor<2048x16x!tt.ptr>, tensor<2048x16xi32> 2026-02-21T08:30:37.6838056Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6838357Z %37 = arith.extf %36 : tensor<2048x16xbf16> to tensor<2048x16xf32> 2026-02-21T08:30:37.6838645Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.6838917Z %39 = arith.muli %38, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.6839171Z %40 = tt.expand_dims %19 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6839451Z %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6839715Z %42 = tt.broadcast %40 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6839947Z %43 = arith.addi %41, %42 : tensor<8x16xi32> 2026-02-21T08:30:37.6840185Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6840454Z %45 = tt.addptr %44, %43 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.6840754Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6841021Z %47 = arith.shli %46, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6841234Z %48 = arith.shrsi %47, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6841448Z %49 = arith.shrsi %46, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6841734Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.6842030Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.6842335Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.6842658Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6842974Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6843249Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6843504Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6843761Z %57 = tt.broadcast %53 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6844044Z %58 = arith.select %56, %57, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6844319Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6844562Z %60 = tt.broadcast %54 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6844827Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6845090Z %62 = arith.select %61, %60, %58 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6845360Z %63 = tt.reshape %62 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.6845607Z %64 = arith.sitofp %63 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.6845972Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<2048x16xf32> * tensor<16x16xf32> -> tensor<2048x16xf32> 2026-02-21T08:30:37.6846364Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T08:30:37.6846562Z %66 = arith.muli %c8_i32, %c1_i32_6 : i32 2026-02-21T08:30:37.6846764Z %67 = arith.addi %arg4, %66 : i32 2026-02-21T08:30:37.6846987Z %68 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.6847235Z %69 = tt.splat %67 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.6847434Z %70 = arith.addi %69, %68 : tensor<8xi32> 2026-02-21T08:30:37.6847630Z %71 = arith.muli %67, %c2_i32 : i32 2026-02-21T08:30:37.6847827Z %72 = tt.splat %71 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.6848023Z %73 = arith.addi %72, %17 : tensor<16xi32> 2026-02-21T08:30:37.6848286Z %74 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:30:37.6848653Z %75 = arith.muli %74, %cst_4 : tensor<2048x1xi32> 2026-02-21T08:30:37.6848914Z %76 = tt.expand_dims %73 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6849215Z %77 = tt.broadcast %75 : tensor<2048x1xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6849487Z %78 = tt.broadcast %76 : tensor<1x16xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6849723Z %79 = arith.addi %77, %78 : tensor<2048x16xi32> 2026-02-21T08:30:37.6849974Z %80 = tt.splat %arg0 : !tt.ptr -> tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6850262Z %81 = tt.addptr %80, %79 : tensor<2048x16x!tt.ptr>, tensor<2048x16xi32> 2026-02-21T08:30:37.6850575Z %82 = tt.load %81 evictionPolicy = evict_last : tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6850871Z %83 = arith.extf %82 : tensor<2048x16xbf16> to tensor<2048x16xf32> 2026-02-21T08:30:37.6851150Z %84 = tt.expand_dims %70 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.6851417Z %85 = arith.muli %84, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.6851699Z %86 = tt.expand_dims %19 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6851987Z %87 = tt.broadcast %85 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6852241Z %88 = tt.broadcast %86 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6852477Z %89 = arith.addi %87, %88 : tensor<8x16xi32> 2026-02-21T08:30:37.6852710Z %90 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6852974Z %91 = tt.addptr %90, %89 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.6853267Z %92 = tt.load %91 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6853520Z %93 = arith.shli %92, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6853735Z %94 = arith.shrsi %93, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6853942Z %95 = arith.shrsi %92, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6854191Z %96 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.6854486Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.6854788Z %98 = tt.expand_dims %97 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.6855105Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6855415Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6855710Z %101 = arith.cmpi eq, %98, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6855976Z %102 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6856245Z %103 = tt.broadcast %99 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6856532Z %104 = arith.select %102, %103, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6856804Z %105 = arith.cmpi eq, %98, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6857112Z %106 = tt.broadcast %100 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6857379Z %107 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6857659Z %108 = arith.select %107, %106, %104 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6857943Z %109 = tt.reshape %108 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.6858202Z %110 = arith.sitofp %109 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.6858567Z %111 = tt.dot %83, %110, %65, inputPrecision = tf32 : tensor<2048x16xf32> * tensor<16x16xf32> -> tensor<2048x16xf32> 2026-02-21T08:30:37.6858893Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T08:30:37.6859097Z %112 = arith.muli %c8_i32, %c2_i32_7 : i32 2026-02-21T08:30:37.6859290Z %113 = arith.addi %arg4, %112 : i32 2026-02-21T08:30:37.6859599Z %114 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.6859855Z %115 = tt.splat %113 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.6860060Z %116 = arith.addi %115, %114 : tensor<8xi32> 2026-02-21T08:30:37.6860261Z %117 = arith.muli %113, %c2_i32 : i32 2026-02-21T08:30:37.6860456Z %118 = tt.splat %117 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.6860662Z %119 = arith.addi %118, %17 : tensor<16xi32> 2026-02-21T08:30:37.6860919Z %120 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:30:37.6861207Z %121 = arith.muli %120, %cst_4 : tensor<2048x1xi32> 2026-02-21T08:30:37.6861482Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6861826Z %123 = tt.broadcast %121 : tensor<2048x1xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6862123Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6862379Z %125 = arith.addi %123, %124 : tensor<2048x16xi32> 2026-02-21T08:30:37.6862646Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6862952Z %127 = tt.addptr %126, %125 : tensor<2048x16x!tt.ptr>, tensor<2048x16xi32> 2026-02-21T08:30:37.6863285Z %128 = tt.load %127 evictionPolicy = evict_last : tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6863599Z %129 = arith.extf %128 : tensor<2048x16xbf16> to tensor<2048x16xf32> 2026-02-21T08:30:37.6863891Z %130 = tt.expand_dims %116 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.6864171Z %131 = arith.muli %130, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.6864429Z %132 = tt.expand_dims %19 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6864727Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6864997Z %134 = tt.broadcast %132 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6865243Z %135 = arith.addi %133, %134 : tensor<8x16xi32> 2026-02-21T08:30:37.6865501Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6865791Z %137 = tt.addptr %136, %135 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.6866114Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6866396Z %139 = arith.shli %138, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6866633Z %140 = arith.shrsi %139, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6866871Z %141 = arith.shrsi %138, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6867127Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.6867442Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.6867779Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.6868130Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6868525Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6868823Z %147 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6869095Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6869382Z %149 = tt.broadcast %145 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6869693Z %150 = arith.select %148, %149, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6869982Z %151 = arith.cmpi eq, %144, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6870251Z %152 = tt.broadcast %146 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6870534Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6870824Z %154 = arith.select %153, %152, %150 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6871196Z %155 = tt.reshape %154 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.6871473Z %156 = arith.sitofp %155 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.6871901Z %157 = tt.dot %129, %156, %111, inputPrecision = tf32 : tensor<2048x16xf32> * tensor<16x16xf32> -> tensor<2048x16xf32> 2026-02-21T08:30:37.6872254Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:30:37.6872454Z %158 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T08:30:37.6872666Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T08:30:37.6872903Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:37.6873178Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T08:30:37.6873384Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T08:30:37.6873591Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T08:30:37.6873785Z %164 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T08:30:37.6873997Z %165 = arith.addi %164, %17 : tensor<16xi32> 2026-02-21T08:30:37.6874266Z %166 = tt.expand_dims %15 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:30:37.6874548Z %167 = arith.muli %166, %cst_4 : tensor<2048x1xi32> 2026-02-21T08:30:37.6874826Z %168 = tt.expand_dims %165 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6875124Z %169 = tt.broadcast %167 : tensor<2048x1xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6875409Z %170 = tt.broadcast %168 : tensor<1x16xi32> -> tensor<2048x16xi32> 2026-02-21T08:30:37.6875666Z %171 = arith.addi %169, %170 : tensor<2048x16xi32> 2026-02-21T08:30:37.6875916Z %172 = tt.splat %arg0 : !tt.ptr -> tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6876220Z %173 = tt.addptr %172, %171 : tensor<2048x16x!tt.ptr>, tensor<2048x16xi32> 2026-02-21T08:30:37.6876538Z %174 = tt.load %173 evictionPolicy = evict_last : tensor<2048x16x!tt.ptr> 2026-02-21T08:30:37.6876850Z %175 = arith.extf %174 : tensor<2048x16xbf16> to tensor<2048x16xf32> 2026-02-21T08:30:37.6877140Z %176 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:30:37.6877412Z %177 = arith.muli %176, %cst_3 : tensor<8x1xi32> 2026-02-21T08:30:37.6877671Z %178 = tt.expand_dims %19 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:30:37.6877962Z %179 = tt.broadcast %177 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6878229Z %180 = tt.broadcast %178 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:30:37.6878467Z %181 = arith.addi %179, %180 : tensor<8x16xi32> 2026-02-21T08:30:37.6878710Z %182 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6878991Z %183 = tt.addptr %182, %181 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:30:37.6879283Z %184 = tt.load %183 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:30:37.6879558Z %185 = arith.shli %184, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6879822Z %186 = arith.shrsi %185, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6880048Z %187 = arith.shrsi %184, %cst_2 : tensor<8x16xi8> 2026-02-21T08:30:37.6880290Z %188 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:30:37.6880584Z %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:30:37.6880905Z %190 = tt.expand_dims %189 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:30:37.6881224Z %191 = tt.expand_dims %186 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6881586Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:30:37.6881869Z %193 = arith.cmpi eq, %190, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6882186Z %194 = tt.broadcast %193 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6882460Z %195 = tt.broadcast %191 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6882754Z %196 = arith.select %194, %195, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6883036Z %197 = arith.cmpi eq, %190, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:30:37.6883291Z %198 = tt.broadcast %192 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:30:37.6883575Z %199 = tt.broadcast %197 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:30:37.6883851Z %200 = arith.select %199, %198, %196 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:30:37.6884137Z %201 = tt.reshape %200 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:30:37.6884402Z %202 = arith.sitofp %201 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:30:37.6884764Z %203 = tt.dot %175, %202, %157, inputPrecision = tf32 : tensor<2048x16xf32> * tensor<16x16xf32> -> tensor<2048x16xf32> 2026-02-21T08:30:37.6885101Z scf.yield %203 : tensor<2048x16xf32> 2026-02-21T08:30:37.6885279Z } 2026-02-21T08:30:37.6885468Z %21 = arith.truncf %20 : tensor<2048x16xf32> to tensor<2048x16xbf16> 2026-02-21T08:30:37.6885801Z tt.descriptor_store %0[%12, %16], %21 : !tt.tensordesc>, tensor<2048x16xbf16> 2026-02-21T08:30:37.6886132Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32} 2026-02-21T08:30:37.6886352Z tt.return 2026-02-21T08:30:37.6886480Z } 2026-02-21T08:30:37.6886607Z } 2026-02-21T08:30:37.6886678Z 2026-02-21T08:30:37.6886729Z {-# 2026-02-21T08:30:37.6886866Z external_resources: { 2026-02-21T08:30:37.6887025Z mlir_reproducer: { 2026-02-21T08:30:37.6891415Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:37.6895968Z disable_threading: false, 2026-02-21T08:30:37.6896148Z verify_each: true 2026-02-21T08:30:37.6896291Z } 2026-02-21T08:30:37.6896418Z } 2026-02-21T08:30:37.6896538Z #-} 2026-02-21T08:30:37.6896966Z /tmp/torchinductor_root/k4/ck4eb35rz4hbihsnfqq3ns2lrhnu3xbaoiih7ipny25bxucb3n42.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:37.6898042Z /tmp/torchinductor_root/k4/ck4eb35rz4hbihsnfqq3ns2lrhnu3xbaoiih7ipny25bxucb3n42.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:37.6898855Z [124s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:37.6900014Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 2048, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 0], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:30:37.6901059Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:37.6901312Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:43.1425815Z [129s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=6, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:30:43.1426993Z Tensor-likes are not close! 2026-02-21T08:30:43.1427193Z 2026-02-21T08:30:43.1427403Z Mismatched elements: 33467923 / 33554432 (99.7%) 2026-02-21T08:30:43.1427754Z Greatest absolute difference: 1744.0 at index (1608, 2372) (up to 0.01 allowed) 2026-02-21T08:30:43.1432789Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:30:43.1437771Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:30:43.1442641Z 2026-02-21T08:30:46.8636276Z 2026-02-21T08:30:46.8638524Z 2026-02-21T08:30:46.8638784Z ================================================================ 2026-02-21T08:30:46.8639103Z Internal Triton PTX codegen error 2026-02-21T08:30:46.8643556Z `ptxas` stderr: 2026-02-21T08:30:46.8645320Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 248 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:30:46.8645837Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:30:46.8645987Z 2026-02-21T08:30:46.8646370Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp29456nj1.ptx -o /tmp/tmp29456nj1.ptx.o 2026-02-21T08:30:46.8646801Z 2026-02-21T08:30:46.8646805Z 2026-02-21T08:30:46.8646865Z // 2026-02-21T08:30:46.8647012Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:30:46.8647195Z // 2026-02-21T08:30:46.8647263Z 2026-02-21T08:30:46.8647330Z .version 8.7 2026-02-21T08:30:46.8647469Z .target sm_100a 2026-02-21T08:30:46.8647610Z .address_size 64 2026-02-21T08:30:46.8647692Z 2026-02-21T08:30:46.8647851Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:30:46.8648424Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:30:46.8648644Z // @_helion_matmul_bf16_int4 2026-02-21T08:30:46.8648885Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:30:46.8649140Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:30:46.8649435Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:30:46.8649716Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:30:46.8649988Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:30:46.8650271Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:30:46.8650487Z ) 2026-02-21T08:30:46.8650614Z .reqntid 256 2026-02-21T08:30:46.8650742Z .maxnreg 32 2026-02-21T08:30:46.8650870Z { 2026-02-21T08:30:46.8650996Z .reg .pred %p<248>; 2026-02-21T08:30:46.8651248Z .reg .b16 %rs<1633>; 2026-02-21T08:30:46.8651413Z .reg .b32 %r<2583>; 2026-02-21T08:30:46.8652042Z .reg .b64 %rd<616>; 2026-02-21T08:30:46.8652323Z .loc 1 14 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:14:0 2026-02-21T08:30:46.8652614Z $L__func_begin0: 2026-02-21T08:30:46.8652875Z .loc 1 14 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:14:0 2026-02-21T08:30:46.8653107Z 2026-02-21T08:30:46.8653164Z // %bb.0: 2026-02-21T08:30:46.8653354Z ld.param.b64 %rd49, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:30:46.8653623Z ld.param.b64 %rd48, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:30:46.8653974Z ld.param.b64 %rd47, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:30:46.8654180Z $L__tmp0: 2026-02-21T08:30:46.8654415Z .loc 1 14 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:14 2026-02-21T08:30:46.8654702Z mov.u32 %r1, %tid.x; 2026-02-21T08:30:46.8654860Z setp.lt.u32 %p3, %r1, 32; 2026-02-21T08:30:46.8655036Z mov.b32 %r265, global_smem; 2026-02-21T08:30:46.8655198Z // begin inline asm 2026-02-21T08:30:46.8655445Z @%p3 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r265], 256; 2026-02-21T08:30:46.8655697Z // end inline asm 2026-02-21T08:30:46.8655836Z bar.sync 0; 2026-02-21T08:30:46.8655988Z ld.shared.b32 %r2508, [global_smem]; 2026-02-21T08:30:46.8656159Z bar.sync 0; 2026-02-21T08:30:46.8656299Z // begin inline asm 2026-02-21T08:30:46.8656504Z @%p3 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:30:46.8656733Z // end inline asm 2026-02-21T08:30:46.8656994Z .loc 1 19 46 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:46 2026-02-21T08:30:46.8657282Z mov.u32 %r2523, %ctaid.x; 2026-02-21T08:30:46.8657553Z .loc 1 0 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0 2026-02-21T08:30:46.8657833Z sub.s32 %r266, 6463, %r2523; 2026-02-21T08:30:46.8658121Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.8658420Z shr.u32 %r267, %r266, 6; 2026-02-21T08:30:46.8658591Z mul.hi.u32 %r268, %r267, 116080198; 2026-02-21T08:30:46.8658766Z and.b32 %r269, %r268, 2097148; 2026-02-21T08:30:46.8658942Z mad.lo.s32 %r2580, %r269, 2368, %r2523; 2026-02-21T08:30:46.8659231Z .loc 1 31 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:45 2026-02-21T08:30:46.8659515Z shr.u32 %r5, %r1, 5; 2026-02-21T08:30:46.8659667Z and.b32 %r270, %r1, 7; 2026-02-21T08:30:46.8659820Z shl.b32 %r6, %r270, 4; 2026-02-21T08:30:46.8659972Z and.b32 %r7, %r1, 15; 2026-02-21T08:30:46.8660116Z shl.b32 %r8, %r7, 3; 2026-02-21T08:30:46.8660374Z .loc 1 33 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:45 2026-02-21T08:30:46.8660672Z bfe.u32 %r10, %r1, 3, 5; 2026-02-21T08:30:46.8660825Z or.b32 %r11, %r10, 32; 2026-02-21T08:30:46.8661064Z shr.u32 %r271, %r1, 4; 2026-02-21T08:30:46.8661214Z bfe.u32 %r12, %r1, 4, 4; 2026-02-21T08:30:46.8661371Z or.b32 %r13, %r12, 16; 2026-02-21T08:30:46.8661515Z or.b32 %r14, %r12, 32; 2026-02-21T08:30:46.8661717Z or.b32 %r15, %r271, 48; 2026-02-21T08:30:46.8661870Z shl.b32 %r16, %r270, 3; 2026-02-21T08:30:46.8662141Z .loc 1 65 38 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:65:38 2026-02-21T08:30:46.8662430Z and.b32 %r17, %r1, 128; 2026-02-21T08:30:46.8662704Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.8663014Z setp.lt.s32 %p5, %r2523, %r2580; 2026-02-21T08:30:46.8663181Z shl.b32 %r2509, %r1, 4; 2026-02-21T08:30:46.8663343Z add.s32 %r2332, %r2508, 128; 2026-02-21T08:30:46.8663505Z or.b32 %r2553, %r16, 64; 2026-02-21T08:30:46.8663671Z or.b32 %r2552, %r16, 128; 2026-02-21T08:30:46.8663828Z shl.b32 %r2513, %r7, 7; 2026-02-21T08:30:46.8664042Z and.b32 %r2514, %r1, 96; 2026-02-21T08:30:46.8664200Z shl.b32 %r2515, %r1, 1; 2026-02-21T08:30:46.8664353Z shr.u32 %r2516, %r17, 1; 2026-02-21T08:30:46.8664510Z shr.u32 %r2517, %r17, 5; 2026-02-21T08:30:46.8664658Z shl.b32 %r2518, %r1, 9; 2026-02-21T08:30:46.8664812Z shl.b32 %r2519, %r7, 4; 2026-02-21T08:30:46.8664956Z shl.b32 %r2520, %r1, 5; 2026-02-21T08:30:46.8665109Z and.b32 %r2521, %r1, 224; 2026-02-21T08:30:46.8665259Z shl.b32 %r2522, %r1, 2; 2026-02-21T08:30:46.8665418Z setp.eq.b32 %p204, %r1, 0; 2026-02-21T08:30:46.8665583Z setp.eq.b32 %p246, %r17, 0; 2026-02-21T08:30:46.8665752Z @%p5 bra $L__BB0_2; 2026-02-21T08:30:46.8665896Z bra.uni $L__BB0_1; 2026-02-21T08:30:46.8666066Z $L__BB0_2: // %.lr.ph 2026-02-21T08:30:46.8666376Z .loc 1 0 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0:112 2026-02-21T08:30:46.8666684Z shr.u32 %r9, %r1, 3; 2026-02-21T08:30:46.8666841Z and.b32 %r2555, %r2509, 4080; 2026-02-21T08:30:46.8667005Z add.s32 %r275, %r265, 32768; 2026-02-21T08:30:46.8667191Z add.s32 %r1731, %r275, %r2555; 2026-02-21T08:30:46.8667361Z or.b32 %r2554, %r2555, 4096; 2026-02-21T08:30:46.8667534Z add.s32 %r1733, %r1731, 4096; 2026-02-21T08:30:46.8667699Z shl.b32 %r27, %r10, 13; 2026-02-21T08:30:46.8667863Z add.s32 %r276, %r265, 57344; 2026-02-21T08:30:46.8668053Z add.s32 %r1735, %r276, %r2555; 2026-02-21T08:30:46.8668231Z add.s32 %r277, %r265, %r2555; 2026-02-21T08:30:46.8668413Z add.s32 %r1555, %r277, 40960; 2026-02-21T08:30:46.8668582Z add.s32 %r1557, %r277, 45056; 2026-02-21T08:30:46.8668759Z or.b32 %r32, %r27, 262144; 2026-02-21T08:30:46.8668930Z add.s32 %r1559, %r277, 61440; 2026-02-21T08:30:46.8669110Z add.s32 %r1561, %r277, 49152; 2026-02-21T08:30:46.8669277Z add.s32 %r1563, %r277, 53248; 2026-02-21T08:30:46.8669453Z or.b32 %r37, %r27, 524288; 2026-02-21T08:30:46.8669626Z add.s32 %r1565, %r277, 65536; 2026-02-21T08:30:46.8669788Z shl.b32 %r280, %r2514, 6; 2026-02-21T08:30:46.8669963Z and.b32 %r282, %r2515, 32; 2026-02-21T08:30:46.8670131Z or.b32 %r284, %r2513, %r280; 2026-02-21T08:30:46.8670305Z or.b32 %r285, %r282, %r2516; 2026-02-21T08:30:46.8670466Z or.b32 %r39, %r284, %r285; 2026-02-21T08:30:46.8670632Z add.s32 %r40, %r275, %r39; 2026-02-21T08:30:46.8670789Z and.b32 %r41, %r1, 127; 2026-02-21T08:30:46.8670955Z add.s32 %r42, %r276, %r41; 2026-02-21T08:30:46.8671121Z or.b32 %r43, %r1, 896; 2026-02-21T08:30:46.8671276Z add.s32 %r44, %r276, %r43; 2026-02-21T08:30:46.8671442Z or.b32 %r45, %r1, 1920; 2026-02-21T08:30:46.8671636Z add.s32 %r46, %r276, %r45; 2026-02-21T08:30:46.8671801Z or.b32 %r47, %r1, 2944; 2026-02-21T08:30:46.8671952Z add.s32 %r48, %r276, %r47; 2026-02-21T08:30:46.8672114Z or.b32 %r49, %r1, 3968; 2026-02-21T08:30:46.8672264Z add.s32 %r50, %r276, %r49; 2026-02-21T08:30:46.8672428Z shl.b32 %r286, %r41, 7; 2026-02-21T08:30:46.8672636Z or.b32 %r288, %r2517, %r286; 2026-02-21T08:30:46.8672807Z or.b32 %r289, %r288, %r6; 2026-02-21T08:30:46.8672990Z add.s32 %r51, %r265, %r289; 2026-02-21T08:30:46.8673235Z xor.b32 %r290, %r289, 16; 2026-02-21T08:30:46.8673412Z add.s32 %r52, %r265, %r290; 2026-02-21T08:30:46.8673578Z xor.b32 %r291, %r289, 32; 2026-02-21T08:30:46.8673753Z add.s32 %r53, %r265, %r291; 2026-02-21T08:30:46.8673917Z xor.b32 %r292, %r289, 48; 2026-02-21T08:30:46.8674089Z add.s32 %r54, %r265, %r292; 2026-02-21T08:30:46.8674268Z xor.b32 %r293, %r289, 64; 2026-02-21T08:30:46.8674434Z add.s32 %r55, %r265, %r293; 2026-02-21T08:30:46.8674611Z xor.b32 %r294, %r289, 80; 2026-02-21T08:30:46.8674770Z add.s32 %r56, %r265, %r294; 2026-02-21T08:30:46.8674944Z xor.b32 %r295, %r289, 96; 2026-02-21T08:30:46.8675103Z add.s32 %r57, %r265, %r295; 2026-02-21T08:30:46.8675277Z xor.b32 %r296, %r289, 112; 2026-02-21T08:30:46.8675440Z add.s32 %r58, %r265, %r296; 2026-02-21T08:30:46.8675621Z bfe.u32 %r297, %r265, 4, 14; 2026-02-21T08:30:46.8675804Z cvt.u64.u32 %rd50, %r297; 2026-02-21T08:30:46.8676066Z or.b64 %rd416, %rd50, 4611686293372403712; 2026-02-21T08:30:46.8676270Z add.s32 %r298, %r265, 32; 2026-02-21T08:30:46.8676428Z bfe.u32 %r299, %r298, 4, 14; 2026-02-21T08:30:46.8676595Z cvt.u64.u32 %rd51, %r299; 2026-02-21T08:30:46.8676766Z or.b64 %rd417, %rd51, 4611686293372403712; 2026-02-21T08:30:46.8676960Z add.s32 %r300, %r265, 64; 2026-02-21T08:30:46.8677136Z bfe.u32 %r301, %r300, 4, 14; 2026-02-21T08:30:46.8677304Z cvt.u64.u32 %rd52, %r301; 2026-02-21T08:30:46.8677472Z or.b64 %rd418, %rd52, 4611686293372403712; 2026-02-21T08:30:46.8677649Z add.s32 %r302, %r265, 96; 2026-02-21T08:30:46.8677810Z bfe.u32 %r303, %r302, 4, 14; 2026-02-21T08:30:46.8677964Z cvt.u64.u32 %rd53, %r303; 2026-02-21T08:30:46.8678131Z or.b64 %rd419, %rd53, 4611686293372403712; 2026-02-21T08:30:46.8678310Z add.s32 %r304, %r265, 16384; 2026-02-21T08:30:46.8678480Z bfe.u32 %r305, %r304, 4, 14; 2026-02-21T08:30:46.8678637Z cvt.u64.u32 %rd54, %r305; 2026-02-21T08:30:46.8678818Z or.b64 %rd420, %rd54, 4611686293372403712; 2026-02-21T08:30:46.8679007Z add.s32 %r306, %r265, 16416; 2026-02-21T08:30:46.8679167Z bfe.u32 %r307, %r306, 4, 14; 2026-02-21T08:30:46.8679334Z cvt.u64.u32 %rd55, %r307; 2026-02-21T08:30:46.8679496Z or.b64 %rd421, %rd55, 4611686293372403712; 2026-02-21T08:30:46.8679684Z add.s32 %r308, %r265, 16448; 2026-02-21T08:30:46.8679839Z bfe.u32 %r309, %r308, 4, 14; 2026-02-21T08:30:46.8680003Z cvt.u64.u32 %rd56, %r309; 2026-02-21T08:30:46.8680163Z or.b64 %rd422, %rd56, 4611686293372403712; 2026-02-21T08:30:46.8680352Z add.s32 %r310, %r265, 16480; 2026-02-21T08:30:46.8680506Z bfe.u32 %r311, %r310, 4, 14; 2026-02-21T08:30:46.8680667Z cvt.u64.u32 %rd57, %r311; 2026-02-21T08:30:46.8680840Z or.b64 %rd423, %rd57, 4611686293372403712; 2026-02-21T08:30:46.8681014Z shl.b32 %r312, %r9, 13; 2026-02-21T08:30:46.8681171Z or.b32 %r59, %r312, 786432; 2026-02-21T08:30:46.8681325Z and.b32 %r314, %r2518, 3072; 2026-02-21T08:30:46.8681485Z shl.b32 %r316, %r2514, 3; 2026-02-21T08:30:46.8681672Z or.b32 %r317, %r2519, %r316; 2026-02-21T08:30:46.8681833Z xor.b32 %r318, %r317, %r285; 2026-02-21T08:30:46.8681984Z add.s32 %r319, %r265, %r314; 2026-02-21T08:30:46.8682148Z add.s32 %r60, %r319, %r318; 2026-02-21T08:30:46.8682310Z and.b32 %r321, %r2520, 3936; 2026-02-21T08:30:46.8682462Z and.b32 %r324, %r2522, 16; 2026-02-21T08:30:46.8682625Z xor.b32 %r325, %r321, %r2521; 2026-02-21T08:30:46.8682782Z add.s32 %r326, %r265, %r324; 2026-02-21T08:30:46.8682941Z add.s32 %r659, %r326, %r325; 2026-02-21T08:30:46.8683212Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8683515Z or.b32 %r327, %r27, %r6; 2026-02-21T08:30:46.8683668Z or.b32 %r62, %r327, 1048576; 2026-02-21T08:30:46.8683951Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.8684265Z mad.wide.u32 %rd58, %r270, 16, %rd47; 2026-02-21T08:30:46.8684443Z add.s64 %rd9, %rd58, 512; 2026-02-21T08:30:46.8684611Z shl.b32 %r63, %r10, 10; 2026-02-21T08:30:46.8684929Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8685222Z or.b32 %r329, %r63, %r16; 2026-02-21T08:30:46.8685374Z or.b32 %r64, %r329, 33024; 2026-02-21T08:30:46.8685533Z bra.uni $L__BB0_3; 2026-02-21T08:30:46.8685721Z $L__BB0_27: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8686049Z .loc 1 0 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0:92 2026-02-21T08:30:46.8686336Z mov.b32 %r1851, 1; 2026-02-21T08:30:46.8686475Z $L__tmp1: 2026-02-21T08:30:46.8686768Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8687099Z // begin inline asm 2026-02-21T08:30:46.8687242Z 2026-02-21T08:30:46.8687358Z { 2026-02-21T08:30:46.8687493Z .reg .pred complete; 2026-02-21T08:30:46.8687637Z waitLoop: 2026-02-21T08:30:46.8687892Z mbarrier.try_wait.parity.shared.b64 complete, [%r2548], %r1851; 2026-02-21T08:30:46.8688141Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.8688294Z } 2026-02-21T08:30:46.8688358Z 2026-02-21T08:30:46.8688423Z // end inline asm 2026-02-21T08:30:46.8688558Z $L__tmp2: 2026-02-21T08:30:46.8688805Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8689097Z cp.async.wait_group 0; 2026-02-21T08:30:46.8689255Z bar.sync 0; 2026-02-21T08:30:46.8689390Z add.s32 %r1852, %r265, 69632; 2026-02-21T08:30:46.8689553Z // begin inline asm 2026-02-21T08:30:46.8689732Z @%p204 mbarrier.inval.shared::cta.b64 [%r1852]; 2026-02-21T08:30:46.8689923Z // end inline asm 2026-02-21T08:30:46.8690063Z bar.sync 0; 2026-02-21T08:30:46.8690190Z // begin inline asm 2026-02-21T08:30:46.8690366Z @%p204 mbarrier.inval.shared::cta.b64 [%r1442]; 2026-02-21T08:30:46.8690552Z // end inline asm 2026-02-21T08:30:46.8690811Z .loc 1 88 43 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:43 2026-02-21T08:30:46.8691097Z shl.b32 %r1925, %r148, 13; 2026-02-21T08:30:46.8691261Z shl.b32 %r1926, %r149, 13; 2026-02-21T08:30:46.8691419Z shl.b32 %r1927, %r150, 13; 2026-02-21T08:30:46.8691601Z shl.b32 %r1928, %r151, 13; 2026-02-21T08:30:46.8691867Z .loc 1 88 50 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:50 2026-02-21T08:30:46.8692156Z add.s32 %r1929, %r1925, %r147; 2026-02-21T08:30:46.8692326Z add.s32 %r1930, %r1926, %r147; 2026-02-21T08:30:46.8692483Z add.s32 %r1931, %r1927, %r147; 2026-02-21T08:30:46.8692648Z add.s32 %r1932, %r1928, %r147; 2026-02-21T08:30:46.8692915Z .loc 1 88 22 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:22 2026-02-21T08:30:46.8693220Z mad.wide.s32 %rd443, %r1929, 2, %rd49; 2026-02-21T08:30:46.8693411Z mad.wide.s32 %rd444, %r1930, 2, %rd49; 2026-02-21T08:30:46.8693593Z mad.wide.s32 %rd445, %r1931, 2, %rd49; 2026-02-21T08:30:46.8693782Z mad.wide.s32 %rd446, %r1932, 2, %rd49; 2026-02-21T08:30:46.8693951Z $L__tmp3: 2026-02-21T08:30:46.8694244Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8694576Z // begin inline asm 2026-02-21T08:30:46.8694986Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1854, %r1855, %r1856, %r1857, %r1858, %r1859, %r1860, %r1861, %r1862, %r1863, %r1864, %r1865, %r1866, %r1867, %r1868, %r1869}, [%r1887 + 0], 32; 2026-02-21T08:30:46.8695414Z // end inline asm 2026-02-21T08:30:46.8695559Z // begin inline asm 2026-02-21T08:30:46.8695958Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1871, %r1872, %r1873, %r1874, %r1875, %r1876, %r1877, %r1878, %r1879, %r1880, %r1881, %r1882, %r1883, %r1884, %r1885, %r1886}, [%r1887 + 16], 32; 2026-02-21T08:30:46.8696375Z // end inline asm 2026-02-21T08:30:46.8696518Z // begin inline asm 2026-02-21T08:30:46.8696671Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:46.8696901Z // end inline asm 2026-02-21T08:30:46.8697050Z cvt.u64.u32 %rd447, %r1854; 2026-02-21T08:30:46.8697212Z cvt.u64.u32 %rd448, %r1855; 2026-02-21T08:30:46.8697377Z shl.b64 %rd449, %rd448, 32; 2026-02-21T08:30:46.8697533Z or.b64 %rd450, %rd447, %rd449; 2026-02-21T08:30:46.8697697Z $L__tmp4: 2026-02-21T08:30:46.8697936Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8698235Z mov.b64 {%r1933, %r1934}, %rd450; 2026-02-21T08:30:46.8698417Z cvt.rn.bf16x2.f32 %r1935, %r1934, %r1933; 2026-02-21T08:30:46.8698602Z $L__tmp5: 2026-02-21T08:30:46.8698894Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8699227Z cvt.u64.u32 %rd451, %r1856; 2026-02-21T08:30:46.8699391Z cvt.u64.u32 %rd452, %r1857; 2026-02-21T08:30:46.8699545Z shl.b64 %rd453, %rd452, 32; 2026-02-21T08:30:46.8699758Z or.b64 %rd454, %rd451, %rd453; 2026-02-21T08:30:46.8699914Z $L__tmp6: 2026-02-21T08:30:46.8700154Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8700439Z mov.b64 {%r1936, %r1937}, %rd454; 2026-02-21T08:30:46.8700624Z cvt.rn.bf16x2.f32 %r1938, %r1937, %r1936; 2026-02-21T08:30:46.8700803Z $L__tmp7: 2026-02-21T08:30:46.8701080Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8701414Z cvt.u64.u32 %rd455, %r1858; 2026-02-21T08:30:46.8701595Z cvt.u64.u32 %rd456, %r1859; 2026-02-21T08:30:46.8701756Z shl.b64 %rd457, %rd456, 32; 2026-02-21T08:30:46.8701908Z or.b64 %rd458, %rd455, %rd457; 2026-02-21T08:30:46.8702070Z $L__tmp8: 2026-02-21T08:30:46.8702304Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8702604Z mov.b64 {%r1939, %r1940}, %rd458; 2026-02-21T08:30:46.8702792Z cvt.rn.bf16x2.f32 %r1941, %r1940, %r1939; 2026-02-21T08:30:46.8702967Z $L__tmp9: 2026-02-21T08:30:46.8703253Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8703580Z cvt.u64.u32 %rd459, %r1860; 2026-02-21T08:30:46.8703744Z cvt.u64.u32 %rd460, %r1861; 2026-02-21T08:30:46.8703898Z shl.b64 %rd461, %rd460, 32; 2026-02-21T08:30:46.8704065Z or.b64 %rd462, %rd459, %rd461; 2026-02-21T08:30:46.8704227Z $L__tmp10: 2026-02-21T08:30:46.8704463Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8704760Z mov.b64 {%r1942, %r1943}, %rd462; 2026-02-21T08:30:46.8704938Z cvt.rn.bf16x2.f32 %r1944, %r1943, %r1942; 2026-02-21T08:30:46.8705120Z $L__tmp11: 2026-02-21T08:30:46.8705422Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8705761Z cvt.u64.u32 %rd463, %r1862; 2026-02-21T08:30:46.8705919Z cvt.u64.u32 %rd464, %r1863; 2026-02-21T08:30:46.8706080Z shl.b64 %rd465, %rd464, 32; 2026-02-21T08:30:46.8706242Z or.b64 %rd466, %rd463, %rd465; 2026-02-21T08:30:46.8706398Z $L__tmp12: 2026-02-21T08:30:46.8706639Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8706926Z mov.b64 {%r1945, %r1946}, %rd466; 2026-02-21T08:30:46.8707110Z cvt.rn.bf16x2.f32 %r1947, %r1946, %r1945; 2026-02-21T08:30:46.8707278Z $L__tmp13: 2026-02-21T08:30:46.8707566Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8707892Z cvt.u64.u32 %rd467, %r1864; 2026-02-21T08:30:46.8708053Z cvt.u64.u32 %rd468, %r1865; 2026-02-21T08:30:46.8708212Z shl.b64 %rd469, %rd468, 32; 2026-02-21T08:30:46.8708366Z or.b64 %rd470, %rd467, %rd469; 2026-02-21T08:30:46.8708526Z $L__tmp14: 2026-02-21T08:30:46.8708764Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8709114Z mov.b64 {%r1948, %r1949}, %rd470; 2026-02-21T08:30:46.8709289Z cvt.rn.bf16x2.f32 %r1950, %r1949, %r1948; 2026-02-21T08:30:46.8709468Z $L__tmp15: 2026-02-21T08:30:46.8709752Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8710116Z cvt.u64.u32 %rd471, %r1866; 2026-02-21T08:30:46.8710277Z cvt.u64.u32 %rd472, %r1867; 2026-02-21T08:30:46.8710431Z shl.b64 %rd473, %rd472, 32; 2026-02-21T08:30:46.8710606Z or.b64 %rd474, %rd471, %rd473; 2026-02-21T08:30:46.8710768Z $L__tmp16: 2026-02-21T08:30:46.8711024Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8711331Z mov.b64 {%r1951, %r1952}, %rd474; 2026-02-21T08:30:46.8711523Z cvt.rn.bf16x2.f32 %r1953, %r1952, %r1951; 2026-02-21T08:30:46.8711735Z $L__tmp17: 2026-02-21T08:30:46.8712087Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8712443Z cvt.u64.u32 %rd475, %r1868; 2026-02-21T08:30:46.8712605Z cvt.u64.u32 %rd476, %r1869; 2026-02-21T08:30:46.8712771Z shl.b64 %rd477, %rd476, 32; 2026-02-21T08:30:46.8712931Z or.b64 %rd478, %rd475, %rd477; 2026-02-21T08:30:46.8713099Z $L__tmp18: 2026-02-21T08:30:46.8713346Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8713658Z mov.b64 {%r1954, %r1955}, %rd478; 2026-02-21T08:30:46.8713849Z cvt.rn.bf16x2.f32 %r1956, %r1955, %r1954; 2026-02-21T08:30:46.8714029Z $L__tmp19: 2026-02-21T08:30:46.8714329Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8714669Z cvt.u64.u32 %rd479, %r1871; 2026-02-21T08:30:46.8714837Z cvt.u64.u32 %rd480, %r1872; 2026-02-21T08:30:46.8714998Z shl.b64 %rd481, %rd480, 32; 2026-02-21T08:30:46.8715173Z or.b64 %rd482, %rd479, %rd481; 2026-02-21T08:30:46.8715331Z $L__tmp20: 2026-02-21T08:30:46.8715583Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8715891Z mov.b64 {%r1957, %r1958}, %rd482; 2026-02-21T08:30:46.8716075Z cvt.rn.bf16x2.f32 %r1959, %r1958, %r1957; 2026-02-21T08:30:46.8716264Z $L__tmp21: 2026-02-21T08:30:46.8716556Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8716905Z cvt.u64.u32 %rd483, %r1873; 2026-02-21T08:30:46.8717064Z cvt.u64.u32 %rd484, %r1874; 2026-02-21T08:30:46.8717230Z shl.b64 %rd485, %rd484, 32; 2026-02-21T08:30:46.8717391Z or.b64 %rd486, %rd483, %rd485; 2026-02-21T08:30:46.8717557Z $L__tmp22: 2026-02-21T08:30:46.8717809Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8718107Z mov.b64 {%r1960, %r1961}, %rd486; 2026-02-21T08:30:46.8718299Z cvt.rn.bf16x2.f32 %r1962, %r1961, %r1960; 2026-02-21T08:30:46.8718480Z $L__tmp23: 2026-02-21T08:30:46.8718776Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8719111Z cvt.u64.u32 %rd487, %r1875; 2026-02-21T08:30:46.8719269Z cvt.u64.u32 %rd488, %r1876; 2026-02-21T08:30:46.8719429Z shl.b64 %rd489, %rd488, 32; 2026-02-21T08:30:46.8719582Z or.b64 %rd490, %rd487, %rd489; 2026-02-21T08:30:46.8719740Z $L__tmp24: 2026-02-21T08:30:46.8719975Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8720270Z mov.b64 {%r1963, %r1964}, %rd490; 2026-02-21T08:30:46.8720444Z cvt.rn.bf16x2.f32 %r1965, %r1964, %r1963; 2026-02-21T08:30:46.8720621Z $L__tmp25: 2026-02-21T08:30:46.8720905Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8721292Z cvt.u64.u32 %rd491, %r1877; 2026-02-21T08:30:46.8721451Z cvt.u64.u32 %rd492, %r1878; 2026-02-21T08:30:46.8721642Z shl.b64 %rd493, %rd492, 32; 2026-02-21T08:30:46.8721806Z or.b64 %rd494, %rd491, %rd493; 2026-02-21T08:30:46.8721958Z $L__tmp26: 2026-02-21T08:30:46.8722200Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8722486Z mov.b64 {%r1966, %r1967}, %rd494; 2026-02-21T08:30:46.8722669Z cvt.rn.bf16x2.f32 %r1968, %r1967, %r1966; 2026-02-21T08:30:46.8722840Z $L__tmp27: 2026-02-21T08:30:46.8723132Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8723468Z cvt.u64.u32 %rd495, %r1879; 2026-02-21T08:30:46.8723621Z cvt.u64.u32 %rd496, %r1880; 2026-02-21T08:30:46.8723781Z shl.b64 %rd497, %rd496, 32; 2026-02-21T08:30:46.8724012Z or.b64 %rd498, %rd495, %rd497; 2026-02-21T08:30:46.8724180Z $L__tmp28: 2026-02-21T08:30:46.8724417Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8724713Z mov.b64 {%r1969, %r1970}, %rd498; 2026-02-21T08:30:46.8724896Z cvt.rn.bf16x2.f32 %r1971, %r1970, %r1969; 2026-02-21T08:30:46.8725065Z $L__tmp29: 2026-02-21T08:30:46.8725352Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8725682Z cvt.u64.u32 %rd499, %r1881; 2026-02-21T08:30:46.8725844Z cvt.u64.u32 %rd500, %r1882; 2026-02-21T08:30:46.8725999Z shl.b64 %rd501, %rd500, 32; 2026-02-21T08:30:46.8726163Z or.b64 %rd502, %rd499, %rd501; 2026-02-21T08:30:46.8726319Z $L__tmp30: 2026-02-21T08:30:46.8726566Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8726864Z mov.b64 {%r1972, %r1973}, %rd502; 2026-02-21T08:30:46.8727047Z cvt.rn.bf16x2.f32 %r1974, %r1973, %r1972; 2026-02-21T08:30:46.8727234Z $L__tmp31: 2026-02-21T08:30:46.8727515Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8727849Z cvt.u64.u32 %rd503, %r1883; 2026-02-21T08:30:46.8728002Z cvt.u64.u32 %rd504, %r1884; 2026-02-21T08:30:46.8728160Z shl.b64 %rd505, %rd504, 32; 2026-02-21T08:30:46.8728313Z or.b64 %rd506, %rd503, %rd505; 2026-02-21T08:30:46.8728470Z $L__tmp32: 2026-02-21T08:30:46.8728715Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8729006Z mov.b64 {%r1975, %r1976}, %rd506; 2026-02-21T08:30:46.8729189Z cvt.rn.bf16x2.f32 %r1977, %r1976, %r1975; 2026-02-21T08:30:46.8729360Z $L__tmp33: 2026-02-21T08:30:46.8729646Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8729977Z cvt.u64.u32 %rd507, %r1885; 2026-02-21T08:30:46.8730142Z cvt.u64.u32 %rd508, %r1886; 2026-02-21T08:30:46.8730305Z shl.b64 %rd509, %rd508, 32; 2026-02-21T08:30:46.8730461Z or.b64 %rd510, %rd507, %rd509; 2026-02-21T08:30:46.8730623Z $L__tmp34: 2026-02-21T08:30:46.8730861Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8731157Z mov.b64 {%r1978, %r1979}, %rd510; 2026-02-21T08:30:46.8731334Z cvt.rn.bf16x2.f32 %r1980, %r1979, %r1978; 2026-02-21T08:30:46.8731660Z .loc 1 88 81 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:81 2026-02-21T08:30:46.8731985Z st.shared.v4.b32 [%r60], {%r1935, %r1947, %r1959, %r1971}; 2026-02-21T08:30:46.8732196Z bar.sync 0; 2026-02-21T08:30:46.8732337Z // begin inline asm 2026-02-21T08:30:46.8732579Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1888, %r1889, %r1890, %r1891}, [%r659]; 2026-02-21T08:30:46.8732853Z // end inline asm 2026-02-21T08:30:46.8732991Z bar.sync 0; 2026-02-21T08:30:46.8733225Z st.shared.v4.b32 [%r60], {%r1938, %r1950, %r1962, %r1974}; 2026-02-21T08:30:46.8733433Z bar.sync 0; 2026-02-21T08:30:46.8733579Z // begin inline asm 2026-02-21T08:30:46.8733819Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1893, %r1894, %r1895, %r1896}, [%r659]; 2026-02-21T08:30:46.8734098Z // end inline asm 2026-02-21T08:30:46.8734243Z bar.sync 0; 2026-02-21T08:30:46.8734416Z st.shared.v4.b32 [%r60], {%r1941, %r1953, %r1965, %r1977}; 2026-02-21T08:30:46.8734625Z bar.sync 0; 2026-02-21T08:30:46.8734758Z // begin inline asm 2026-02-21T08:30:46.8735001Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1898, %r1899, %r1900, %r1901}, [%r659]; 2026-02-21T08:30:46.8735264Z // end inline asm 2026-02-21T08:30:46.8735408Z bar.sync 0; 2026-02-21T08:30:46.8735577Z st.shared.v4.b32 [%r60], {%r1944, %r1956, %r1968, %r1980}; 2026-02-21T08:30:46.8735786Z bar.sync 0; 2026-02-21T08:30:46.8735918Z // begin inline asm 2026-02-21T08:30:46.8736220Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1903, %r1904, %r1905, %r1906}, [%r659]; 2026-02-21T08:30:46.8736493Z // end inline asm 2026-02-21T08:30:46.8736629Z // begin inline asm 2026-02-21T08:30:46.8736832Z st.global.v4.b32 [ %rd443 + 0 ], { %r1888, %r1893, %r1898, %r1903 }; 2026-02-21T08:30:46.8737047Z // end inline asm 2026-02-21T08:30:46.8737190Z // begin inline asm 2026-02-21T08:30:46.8737380Z st.global.v4.b32 [ %rd444 + 0 ], { %r1889, %r1894, %r1899, %r1904 }; 2026-02-21T08:30:46.8737606Z // end inline asm 2026-02-21T08:30:46.8737743Z // begin inline asm 2026-02-21T08:30:46.8737939Z st.global.v4.b32 [ %rd445 + 0 ], { %r1890, %r1895, %r1900, %r1905 }; 2026-02-21T08:30:46.8738166Z // end inline asm 2026-02-21T08:30:46.8738306Z // begin inline asm 2026-02-21T08:30:46.8738505Z st.global.v4.b32 [ %rd446 + 0 ], { %r1891, %r1896, %r1901, %r1906 }; 2026-02-21T08:30:46.8738713Z // end inline asm 2026-02-21T08:30:46.8738979Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.8739281Z add.s32 %r2523, %r2523, 9472; 2026-02-21T08:30:46.8739460Z setp.lt.s32 %p206, %r2523, %r2580; 2026-02-21T08:30:46.8739640Z @%p206 bra $L__BB0_3; 2026-02-21T08:30:46.8739793Z bra.uni $L__BB0_28; 2026-02-21T08:30:46.8739988Z $L__BB0_3: // =>This Loop Header: Depth=1 2026-02-21T08:30:46.8740223Z // Child Loop BB0_6 Depth 2 2026-02-21T08:30:46.8740457Z // Child Loop BB0_12 Depth 2 2026-02-21T08:30:46.8740683Z // Child Loop BB0_18 Depth 2 2026-02-21T08:30:46.8740908Z // Child Loop BB0_24 Depth 2 2026-02-21T08:30:46.8741223Z .loc 1 25 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:25:35 2026-02-21T08:30:46.8741520Z shr.s32 %r401, %r2523, 31; 2026-02-21T08:30:46.8741721Z shr.u32 %r402, %r401, 23; 2026-02-21T08:30:46.8741880Z add.s32 %r403, %r2523, %r402; 2026-02-21T08:30:46.8742052Z shr.s32 %r404, %r403, 9; 2026-02-21T08:30:46.8742313Z .loc 1 26 33 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:26:33 2026-02-21T08:30:46.8742606Z shl.b32 %r405, %r404, 3; 2026-02-21T08:30:46.8742863Z .loc 1 27 39 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:39 2026-02-21T08:30:46.8743149Z sub.s32 %r406, 64, %r405; 2026-02-21T08:30:46.8743412Z .loc 1 27 52 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:52 2026-02-21T08:30:46.8743694Z min.s32 %r407, %r406, 8; 2026-02-21T08:30:46.8743957Z .loc 1 28 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:45 2026-02-21T08:30:46.8744244Z and.b32 %r408, %r403, -512; 2026-02-21T08:30:46.8744416Z sub.s32 %r409, %r2523, %r408; 2026-02-21T08:30:46.8744683Z .loc 1 29 51 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:29:51 2026-02-21T08:30:46.8745040Z div.s32 %r66, %r409, %r407; 2026-02-21T08:30:46.8745311Z .loc 1 28 64 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:64 2026-02-21T08:30:46.8745598Z mul.lo.s32 %r410, %r66, %r407; 2026-02-21T08:30:46.8745771Z sub.s32 %r411, %r409, %r410; 2026-02-21T08:30:46.8746037Z .loc 1 28 30 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:30 2026-02-21T08:30:46.8746325Z add.s32 %r412, %r411, %r405; 2026-02-21T08:30:46.8746587Z .loc 1 30 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:30:27 2026-02-21T08:30:46.8746876Z shl.b32 %r67, %r412, 7; 2026-02-21T08:30:46.8747141Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.8747426Z or.b32 %r68, %r67, %r6; 2026-02-21T08:30:46.8747692Z .loc 1 32 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:32:27 2026-02-21T08:30:46.8748036Z shl.b32 %r413, %r66, 6; 2026-02-21T08:30:46.8748300Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.8748585Z or.b32 %r414, %r413, %r10; 2026-02-21T08:30:46.8748751Z or.b32 %r415, %r413, %r11; 2026-02-21T08:30:46.8749018Z .loc 1 48 53 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:53 2026-02-21T08:30:46.8749301Z shl.b32 %r416, %r414, 10; 2026-02-21T08:30:46.8749468Z shl.b32 %r417, %r415, 10; 2026-02-21T08:30:46.8749615Z $L__tmp35: 2026-02-21T08:30:46.8749914Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8750263Z shfl.sync.idx.b32 %r74, %r5, 0, 31, -1; 2026-02-21T08:30:46.8750466Z shl.b32 %r418, %r74, 21; 2026-02-21T08:30:46.8750638Z and.b32 %r419, %r418, 6291456; 2026-02-21T08:30:46.8750799Z add.s32 %r420, %r419, %r2508; 2026-02-21T08:30:46.8750965Z and.b32 %r421, %r74, 4; 2026-02-21T08:30:46.8751117Z shl.b32 %r422, %r421, 4; 2026-02-21T08:30:46.8751276Z add.s32 %r1887, %r420, %r422; 2026-02-21T08:30:46.8751436Z mov.pred %p15, -1; 2026-02-21T08:30:46.8751618Z mov.b32 %r2526, 0; 2026-02-21T08:30:46.8751760Z // begin inline asm 2026-02-21T08:30:46.8752177Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 0], 32, {%r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526}; 2026-02-21T08:30:46.8752615Z // end inline asm 2026-02-21T08:30:46.8752752Z // begin inline asm 2026-02-21T08:30:46.8753153Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 16], 32, {%r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526, %r2526}; 2026-02-21T08:30:46.8753580Z // end inline asm 2026-02-21T08:30:46.8753721Z // begin inline asm 2026-02-21T08:30:46.8753877Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.8754048Z // end inline asm 2026-02-21T08:30:46.8754187Z bar.sync 0; 2026-02-21T08:30:46.8754327Z $L__tmp36: 2026-02-21T08:30:46.8754579Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8754869Z add.s32 %r2527, %r265, 69632; 2026-02-21T08:30:46.8755037Z // begin inline asm 2026-02-21T08:30:46.8755214Z @%p204 mbarrier.init.shared::cta.b64 [%r2527], 1; 2026-02-21T08:30:46.8755424Z // end inline asm 2026-02-21T08:30:46.8755562Z bar.sync 0; 2026-02-21T08:30:46.8755706Z add.s32 %r1442, %r265, 69640; 2026-02-21T08:30:46.8755866Z // begin inline asm 2026-02-21T08:30:46.8756043Z @%p204 mbarrier.init.shared::cta.b64 [%r1442], 1; 2026-02-21T08:30:46.8756249Z // end inline asm 2026-02-21T08:30:46.8756509Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.8756818Z or.b32 %r424, %r416, %r16; 2026-02-21T08:30:46.8756980Z or.b32 %r425, %r417, %r16; 2026-02-21T08:30:46.8757267Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8757631Z mad.wide.s32 %rd59, %r424, 2, %rd47; 2026-02-21T08:30:46.8757836Z mad.wide.s32 %rd60, %r425, 2, %rd47; 2026-02-21T08:30:46.8758026Z mov.b32 %r499, 16; 2026-02-21T08:30:46.8758295Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8758604Z // begin inline asm 2026-02-21T08:30:46.8758821Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd59 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8759069Z // end inline asm 2026-02-21T08:30:46.8759214Z // begin inline asm 2026-02-21T08:30:46.8759430Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd60 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8759667Z // end inline asm 2026-02-21T08:30:46.8759832Z cp.async.commit_group; 2026-02-21T08:30:46.8760121Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8760476Z add.s32 %r426, %r68, %r27; 2026-02-21T08:30:46.8760774Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8761082Z cvt.s64.s32 %rd68, %r426; 2026-02-21T08:30:46.8761259Z add.s64 %rd61, %rd48, %rd68; 2026-02-21T08:30:46.8761574Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8761883Z // begin inline asm 2026-02-21T08:30:46.8762097Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd61 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8762330Z // end inline asm 2026-02-21T08:30:46.8762487Z cp.async.commit_group; 2026-02-21T08:30:46.8762761Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8763070Z cvt.s64.s32 %rd69, %r416; 2026-02-21T08:30:46.8763233Z cvt.u64.u32 %rd10, %r16; 2026-02-21T08:30:46.8763404Z or.b64 %rd70, %rd69, %rd10; 2026-02-21T08:30:46.8763577Z shl.b64 %rd71, %rd70, 1; 2026-02-21T08:30:46.8763742Z add.s64 %rd11, %rd47, %rd71; 2026-02-21T08:30:46.8763909Z add.s64 %rd62, %rd11, 128; 2026-02-21T08:30:46.8764064Z cvt.s64.s32 %rd72, %r417; 2026-02-21T08:30:46.8764226Z or.b64 %rd73, %rd72, %rd10; 2026-02-21T08:30:46.8764381Z shl.b64 %rd74, %rd73, 1; 2026-02-21T08:30:46.8764542Z add.s64 %rd12, %rd47, %rd74; 2026-02-21T08:30:46.8764698Z add.s64 %rd63, %rd12, 128; 2026-02-21T08:30:46.8764972Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8765252Z bar.sync 0; 2026-02-21T08:30:46.8765393Z // begin inline asm 2026-02-21T08:30:46.8765590Z cp.async.cg.shared.global [ %r1555 + 0 ], [ %rd62 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8765818Z // end inline asm 2026-02-21T08:30:46.8765957Z // begin inline asm 2026-02-21T08:30:46.8766151Z cp.async.cg.shared.global [ %r1557 + 0 ], [ %rd63 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8766379Z // end inline asm 2026-02-21T08:30:46.8766520Z cp.async.commit_group; 2026-02-21T08:30:46.8766788Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8767073Z add.s32 %r427, %r68, %r32; 2026-02-21T08:30:46.8767342Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8767632Z cvt.s64.s32 %rd75, %r427; 2026-02-21T08:30:46.8767785Z add.s64 %rd64, %rd48, %rd75; 2026-02-21T08:30:46.8768053Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8768332Z // begin inline asm 2026-02-21T08:30:46.8768534Z cp.async.cg.shared.global [ %r1559 + 0 ], [ %rd64 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8768755Z // end inline asm 2026-02-21T08:30:46.8768903Z cp.async.commit_group; 2026-02-21T08:30:46.8769161Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8769452Z add.s64 %rd65, %rd11, 256; 2026-02-21T08:30:46.8769690Z add.s64 %rd66, %rd12, 256; 2026-02-21T08:30:46.8769951Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8770240Z bar.sync 0; 2026-02-21T08:30:46.8770372Z // begin inline asm 2026-02-21T08:30:46.8770575Z cp.async.cg.shared.global [ %r1561 + 0 ], [ %rd65 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8770797Z // end inline asm 2026-02-21T08:30:46.8770939Z // begin inline asm 2026-02-21T08:30:46.8771129Z cp.async.cg.shared.global [ %r1563 + 0 ], [ %rd66 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8771358Z // end inline asm 2026-02-21T08:30:46.8771509Z cp.async.commit_group; 2026-02-21T08:30:46.8771802Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8772098Z add.s32 %r428, %r68, %r37; 2026-02-21T08:30:46.8772357Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8772699Z cvt.s64.s32 %rd76, %r428; 2026-02-21T08:30:46.8772859Z add.s64 %rd67, %rd48, %rd76; 2026-02-21T08:30:46.8773128Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8773410Z // begin inline asm 2026-02-21T08:30:46.8773604Z cp.async.cg.shared.global [ %r1565 + 0 ], [ %rd67 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8773832Z // end inline asm 2026-02-21T08:30:46.8773971Z cp.async.commit_group; 2026-02-21T08:30:46.8774230Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8774510Z cp.async.wait_group 4; 2026-02-21T08:30:46.8774668Z bar.sync 0; 2026-02-21T08:30:46.8774835Z ld.shared.v4.b32 {%r429, %r430, %r431, %r432}, [%r40]; 2026-02-21T08:30:46.8775043Z mov.b32 {%rs1, %rs2}, %r432; 2026-02-21T08:30:46.8778776Z [133s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:46.8779961Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:30:46.8781115Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:30:46.8781358Z `ptxas` stderr: 2026-02-21T08:30:46.8781808Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 248 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:30:46.8782290Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:30:46.8782435Z 2026-02-21T08:30:46.8782824Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp29456nj1.ptx -o /tmp/tmp29456nj1.ptx.o 2026-02-21T08:30:46.8783246Z 2026-02-21T08:30:46.8783369Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:46.8783612Z mov.b32 {%rs3, %rs4}, %r431; 2026-02-21T08:30:46.8783781Z mov.b32 {%rs5, %rs6}, %r430; 2026-02-21T08:30:46.8783936Z mov.b32 {%rs7, %rs8}, %r429; 2026-02-21T08:30:46.8784136Z ld.shared.v4.b32 {%r433, %r434, %r435, %r436}, [%r40+16]; 2026-02-21T08:30:46.8784347Z mov.b32 {%rs9, %rs10}, %r436; 2026-02-21T08:30:46.8784517Z mov.b32 {%rs11, %rs12}, %r435; 2026-02-21T08:30:46.8784678Z mov.b32 {%rs13, %rs14}, %r434; 2026-02-21T08:30:46.8784843Z mov.b32 {%rs15, %rs16}, %r433; 2026-02-21T08:30:46.8785113Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.8785410Z cvt.f32.bf16 %r385, %rs7; 2026-02-21T08:30:46.8785572Z cvt.f32.bf16 %r386, %rs8; 2026-02-21T08:30:46.8785727Z cvt.f32.bf16 %r387, %rs5; 2026-02-21T08:30:46.8785975Z cvt.f32.bf16 %r388, %rs6; 2026-02-21T08:30:46.8786124Z cvt.f32.bf16 %r389, %rs3; 2026-02-21T08:30:46.8786282Z cvt.f32.bf16 %r390, %rs4; 2026-02-21T08:30:46.8786433Z cvt.f32.bf16 %r391, %rs1; 2026-02-21T08:30:46.8786587Z cvt.f32.bf16 %r392, %rs2; 2026-02-21T08:30:46.8786739Z cvt.f32.bf16 %r393, %rs15; 2026-02-21T08:30:46.8786902Z cvt.f32.bf16 %r394, %rs16; 2026-02-21T08:30:46.8787056Z cvt.f32.bf16 %r395, %rs13; 2026-02-21T08:30:46.8787219Z cvt.f32.bf16 %r396, %rs14; 2026-02-21T08:30:46.8787379Z cvt.f32.bf16 %r397, %rs11; 2026-02-21T08:30:46.8787532Z cvt.f32.bf16 %r398, %rs12; 2026-02-21T08:30:46.8787696Z cvt.f32.bf16 %r399, %rs9; 2026-02-21T08:30:46.8787847Z cvt.f32.bf16 %r400, %rs10; 2026-02-21T08:30:46.8788007Z $L__tmp37: 2026-02-21T08:30:46.8788295Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8788691Z add.s32 %r437, %r419, %r2332; 2026-02-21T08:30:46.8788856Z shl.b32 %r438, %r421, 3; 2026-02-21T08:30:46.8789016Z add.s32 %r1567, %r437, %r438; 2026-02-21T08:30:46.8789178Z // begin inline asm 2026-02-21T08:30:46.8789556Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r385, %r386, %r387, %r388, %r389, %r390, %r391, %r392, %r393, %r394, %r395, %r396, %r397, %r398, %r399, %r400}; 2026-02-21T08:30:46.8789964Z // end inline asm 2026-02-21T08:30:46.8790101Z // begin inline asm 2026-02-21T08:30:46.8790263Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.8790425Z // end inline asm 2026-02-21T08:30:46.8790568Z bar.sync 0; 2026-02-21T08:30:46.8790698Z $L__tmp38: 2026-02-21T08:30:46.8790953Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8791253Z ld.shared.b8 %rs17, [%r42]; 2026-02-21T08:30:46.8791424Z ld.shared.b8 %rs18, [%r42+128]; 2026-02-21T08:30:46.8791638Z ld.shared.b8 %rs19, [%r42+256]; 2026-02-21T08:30:46.8791809Z ld.shared.b8 %rs20, [%r42+384]; 2026-02-21T08:30:46.8791981Z ld.shared.b8 %rs21, [%r42+512]; 2026-02-21T08:30:46.8792142Z ld.shared.b8 %rs22, [%r42+640]; 2026-02-21T08:30:46.8792310Z ld.shared.b8 %rs23, [%r42+768]; 2026-02-21T08:30:46.8792472Z ld.shared.b8 %rs24, [%r44]; 2026-02-21T08:30:46.8792645Z ld.shared.b8 %rs25, [%r42+1024]; 2026-02-21T08:30:46.8792820Z ld.shared.b8 %rs26, [%r42+1152]; 2026-02-21T08:30:46.8792989Z ld.shared.b8 %rs27, [%r42+1280]; 2026-02-21T08:30:46.8793159Z ld.shared.b8 %rs28, [%r42+1408]; 2026-02-21T08:30:46.8793319Z ld.shared.b8 %rs29, [%r42+1536]; 2026-02-21T08:30:46.8793486Z ld.shared.b8 %rs30, [%r42+1664]; 2026-02-21T08:30:46.8793646Z ld.shared.b8 %rs31, [%r42+1792]; 2026-02-21T08:30:46.8793812Z ld.shared.b8 %rs32, [%r46]; 2026-02-21T08:30:46.8793968Z ld.shared.b8 %rs33, [%r42+2048]; 2026-02-21T08:30:46.8794135Z ld.shared.b8 %rs34, [%r42+2176]; 2026-02-21T08:30:46.8794300Z ld.shared.b8 %rs35, [%r42+2304]; 2026-02-21T08:30:46.8794463Z ld.shared.b8 %rs36, [%r42+2432]; 2026-02-21T08:30:46.8794635Z ld.shared.b8 %rs37, [%r42+2560]; 2026-02-21T08:30:46.8794796Z ld.shared.b8 %rs38, [%r42+2688]; 2026-02-21T08:30:46.8794963Z ld.shared.b8 %rs39, [%r42+2816]; 2026-02-21T08:30:46.8795121Z ld.shared.b8 %rs40, [%r48]; 2026-02-21T08:30:46.8795288Z ld.shared.b8 %rs41, [%r42+3072]; 2026-02-21T08:30:46.8795448Z ld.shared.b8 %rs42, [%r42+3200]; 2026-02-21T08:30:46.8795616Z ld.shared.b8 %rs43, [%r42+3328]; 2026-02-21T08:30:46.8795776Z ld.shared.b8 %rs44, [%r42+3456]; 2026-02-21T08:30:46.8795941Z ld.shared.b8 %rs45, [%r42+3584]; 2026-02-21T08:30:46.8796111Z ld.shared.b8 %rs46, [%r42+3712]; 2026-02-21T08:30:46.8796270Z ld.shared.b8 %rs47, [%r42+3840]; 2026-02-21T08:30:46.8796439Z ld.shared.b8 %rs48, [%r50]; 2026-02-21T08:30:46.8796724Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.8797016Z shl.b16 %rs49, %rs17, 4; 2026-02-21T08:30:46.8797184Z shl.b16 %rs50, %rs18, 4; 2026-02-21T08:30:46.8797354Z shl.b16 %rs51, %rs19, 4; 2026-02-21T08:30:46.8797568Z shl.b16 %rs52, %rs20, 4; 2026-02-21T08:30:46.8797728Z shl.b16 %rs53, %rs21, 4; 2026-02-21T08:30:46.8797886Z shl.b16 %rs54, %rs22, 4; 2026-02-21T08:30:46.8798039Z shl.b16 %rs55, %rs23, 4; 2026-02-21T08:30:46.8798200Z shl.b16 %rs56, %rs24, 4; 2026-02-21T08:30:46.8798354Z shl.b16 %rs57, %rs25, 4; 2026-02-21T08:30:46.8798514Z shl.b16 %rs58, %rs26, 4; 2026-02-21T08:30:46.8798669Z shl.b16 %rs59, %rs27, 4; 2026-02-21T08:30:46.8798831Z shl.b16 %rs60, %rs28, 4; 2026-02-21T08:30:46.8798985Z shl.b16 %rs61, %rs29, 4; 2026-02-21T08:30:46.8799145Z shl.b16 %rs62, %rs30, 4; 2026-02-21T08:30:46.8799298Z shl.b16 %rs63, %rs31, 4; 2026-02-21T08:30:46.8799456Z shl.b16 %rs64, %rs32, 4; 2026-02-21T08:30:46.8799616Z shl.b16 %rs65, %rs33, 4; 2026-02-21T08:30:46.8799768Z shl.b16 %rs66, %rs34, 4; 2026-02-21T08:30:46.8799928Z shl.b16 %rs67, %rs35, 4; 2026-02-21T08:30:46.8800080Z shl.b16 %rs68, %rs36, 4; 2026-02-21T08:30:46.8800300Z shl.b16 %rs69, %rs37, 4; 2026-02-21T08:30:46.8800457Z shl.b16 %rs70, %rs38, 4; 2026-02-21T08:30:46.8800616Z shl.b16 %rs71, %rs39, 4; 2026-02-21T08:30:46.8800768Z shl.b16 %rs72, %rs40, 4; 2026-02-21T08:30:46.8800928Z shl.b16 %rs73, %rs41, 4; 2026-02-21T08:30:46.8801079Z shl.b16 %rs74, %rs42, 4; 2026-02-21T08:30:46.8801236Z shl.b16 %rs75, %rs43, 4; 2026-02-21T08:30:46.8801394Z shl.b16 %rs76, %rs44, 4; 2026-02-21T08:30:46.8801585Z shl.b16 %rs77, %rs45, 4; 2026-02-21T08:30:46.8801750Z shl.b16 %rs78, %rs46, 4; 2026-02-21T08:30:46.8801905Z shl.b16 %rs79, %rs47, 4; 2026-02-21T08:30:46.8802065Z shl.b16 %rs80, %rs48, 4; 2026-02-21T08:30:46.8802336Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.8802652Z selp.b16 %rs81, %rs49, %rs17, %p246; 2026-02-21T08:30:46.8802835Z cvt.s16.s8 %rs82, %rs81; 2026-02-21T08:30:46.8802998Z shr.s16 %rs83, %rs82, 4; 2026-02-21T08:30:46.8803172Z selp.b16 %rs84, %rs50, %rs18, %p246; 2026-02-21T08:30:46.8803356Z cvt.s16.s8 %rs85, %rs84; 2026-02-21T08:30:46.8803518Z shr.s16 %rs86, %rs85, 4; 2026-02-21T08:30:46.8803678Z selp.b16 %rs87, %rs51, %rs19, %p246; 2026-02-21T08:30:46.8803865Z cvt.s16.s8 %rs88, %rs87; 2026-02-21T08:30:46.8804021Z shr.s16 %rs89, %rs88, 4; 2026-02-21T08:30:46.8804193Z selp.b16 %rs90, %rs52, %rs20, %p246; 2026-02-21T08:30:46.8804369Z cvt.s16.s8 %rs91, %rs90; 2026-02-21T08:30:46.8804532Z shr.s16 %rs92, %rs91, 4; 2026-02-21T08:30:46.8804693Z selp.b16 %rs93, %rs53, %rs21, %p246; 2026-02-21T08:30:46.8804886Z cvt.s16.s8 %rs94, %rs93; 2026-02-21T08:30:46.8805046Z shr.s16 %rs95, %rs94, 4; 2026-02-21T08:30:46.8805201Z selp.b16 %rs96, %rs54, %rs22, %p246; 2026-02-21T08:30:46.8805383Z cvt.s16.s8 %rs97, %rs96; 2026-02-21T08:30:46.8805532Z shr.s16 %rs98, %rs97, 4; 2026-02-21T08:30:46.8805693Z selp.b16 %rs99, %rs55, %rs23, %p246; 2026-02-21T08:30:46.8805864Z cvt.s16.s8 %rs100, %rs99; 2026-02-21T08:30:46.8806030Z shr.s16 %rs101, %rs100, 4; 2026-02-21T08:30:46.8806200Z selp.b16 %rs102, %rs56, %rs24, %p246; 2026-02-21T08:30:46.8806389Z cvt.s16.s8 %rs103, %rs102; 2026-02-21T08:30:46.8806552Z shr.s16 %rs104, %rs103, 4; 2026-02-21T08:30:46.8806716Z selp.b16 %rs105, %rs57, %rs25, %p246; 2026-02-21T08:30:46.8806896Z cvt.s16.s8 %rs106, %rs105; 2026-02-21T08:30:46.8807051Z shr.s16 %rs107, %rs106, 4; 2026-02-21T08:30:46.8807218Z selp.b16 %rs108, %rs58, %rs26, %p246; 2026-02-21T08:30:46.8807390Z cvt.s16.s8 %rs109, %rs108; 2026-02-21T08:30:46.8807549Z shr.s16 %rs110, %rs109, 4; 2026-02-21T08:30:46.8807706Z selp.b16 %rs111, %rs59, %rs27, %p246; 2026-02-21T08:30:46.8807881Z cvt.s16.s8 %rs112, %rs111; 2026-02-21T08:30:46.8808034Z shr.s16 %rs113, %rs112, 4; 2026-02-21T08:30:46.8808201Z selp.b16 %rs114, %rs60, %rs28, %p246; 2026-02-21T08:30:46.8808377Z cvt.s16.s8 %rs115, %rs114; 2026-02-21T08:30:46.8808530Z shr.s16 %rs116, %rs115, 4; 2026-02-21T08:30:46.8808696Z selp.b16 %rs117, %rs61, %rs29, %p246; 2026-02-21T08:30:46.8808868Z cvt.s16.s8 %rs118, %rs117; 2026-02-21T08:30:46.8809086Z shr.s16 %rs119, %rs118, 4; 2026-02-21T08:30:46.8809242Z selp.b16 %rs120, %rs62, %rs30, %p246; 2026-02-21T08:30:46.8809414Z cvt.s16.s8 %rs121, %rs120; 2026-02-21T08:30:46.8809565Z shr.s16 %rs122, %rs121, 4; 2026-02-21T08:30:46.8809728Z selp.b16 %rs123, %rs63, %rs31, %p246; 2026-02-21T08:30:46.8809902Z cvt.s16.s8 %rs124, %rs123; 2026-02-21T08:30:46.8810052Z shr.s16 %rs125, %rs124, 4; 2026-02-21T08:30:46.8810213Z selp.b16 %rs126, %rs64, %rs32, %p246; 2026-02-21T08:30:46.8810380Z cvt.s16.s8 %rs127, %rs126; 2026-02-21T08:30:46.8810537Z shr.s16 %rs128, %rs127, 4; 2026-02-21T08:30:46.8810692Z selp.b16 %rs129, %rs65, %rs33, %p246; 2026-02-21T08:30:46.8810866Z cvt.s16.s8 %rs130, %rs129; 2026-02-21T08:30:46.8811015Z shr.s16 %rs131, %rs130, 4; 2026-02-21T08:30:46.8811178Z selp.b16 %rs132, %rs66, %rs34, %p246; 2026-02-21T08:30:46.8811347Z cvt.s16.s8 %rs133, %rs132; 2026-02-21T08:30:46.8811505Z shr.s16 %rs134, %rs133, 4; 2026-02-21T08:30:46.8811780Z selp.b16 %rs135, %rs67, %rs35, %p246; 2026-02-21T08:30:46.8811955Z cvt.s16.s8 %rs136, %rs135; 2026-02-21T08:30:46.8812117Z shr.s16 %rs137, %rs136, 4; 2026-02-21T08:30:46.8812274Z selp.b16 %rs138, %rs68, %rs36, %p246; 2026-02-21T08:30:46.8812448Z cvt.s16.s8 %rs139, %rs138; 2026-02-21T08:30:46.8812597Z shr.s16 %rs140, %rs139, 4; 2026-02-21T08:30:46.8812763Z selp.b16 %rs141, %rs69, %rs37, %p246; 2026-02-21T08:30:46.8812932Z cvt.s16.s8 %rs142, %rs141; 2026-02-21T08:30:46.8813091Z shr.s16 %rs143, %rs142, 4; 2026-02-21T08:30:46.8813250Z selp.b16 %rs144, %rs70, %rs38, %p246; 2026-02-21T08:30:46.8813431Z cvt.s16.s8 %rs145, %rs144; 2026-02-21T08:30:46.8813594Z shr.s16 %rs146, %rs145, 4; 2026-02-21T08:30:46.8813759Z selp.b16 %rs147, %rs71, %rs39, %p246; 2026-02-21T08:30:46.8813943Z cvt.s16.s8 %rs148, %rs147; 2026-02-21T08:30:46.8814099Z shr.s16 %rs149, %rs148, 4; 2026-02-21T08:30:46.8814265Z selp.b16 %rs150, %rs72, %rs40, %p246; 2026-02-21T08:30:46.8814436Z cvt.s16.s8 %rs151, %rs150; 2026-02-21T08:30:46.8814598Z shr.s16 %rs152, %rs151, 4; 2026-02-21T08:30:46.8814755Z selp.b16 %rs153, %rs73, %rs41, %p246; 2026-02-21T08:30:46.8814931Z cvt.s16.s8 %rs154, %rs153; 2026-02-21T08:30:46.8815090Z shr.s16 %rs155, %rs154, 4; 2026-02-21T08:30:46.8815246Z selp.b16 %rs156, %rs74, %rs42, %p246; 2026-02-21T08:30:46.8815419Z cvt.s16.s8 %rs157, %rs156; 2026-02-21T08:30:46.8815568Z shr.s16 %rs158, %rs157, 4; 2026-02-21T08:30:46.8815733Z selp.b16 %rs159, %rs75, %rs43, %p246; 2026-02-21T08:30:46.8815901Z cvt.s16.s8 %rs160, %rs159; 2026-02-21T08:30:46.8816057Z shr.s16 %rs161, %rs160, 4; 2026-02-21T08:30:46.8816213Z selp.b16 %rs162, %rs76, %rs44, %p246; 2026-02-21T08:30:46.8816386Z cvt.s16.s8 %rs163, %rs162; 2026-02-21T08:30:46.8816535Z shr.s16 %rs164, %rs163, 4; 2026-02-21T08:30:46.8816700Z selp.b16 %rs165, %rs77, %rs45, %p246; 2026-02-21T08:30:46.8816874Z cvt.s16.s8 %rs166, %rs165; 2026-02-21T08:30:46.8817024Z shr.s16 %rs167, %rs166, 4; 2026-02-21T08:30:46.8817192Z selp.b16 %rs168, %rs78, %rs46, %p246; 2026-02-21T08:30:46.8817360Z cvt.s16.s8 %rs169, %rs168; 2026-02-21T08:30:46.8817516Z shr.s16 %rs170, %rs169, 4; 2026-02-21T08:30:46.8817671Z selp.b16 %rs171, %rs79, %rs47, %p246; 2026-02-21T08:30:46.8817845Z cvt.s16.s8 %rs172, %rs171; 2026-02-21T08:30:46.8817992Z shr.s16 %rs173, %rs172, 4; 2026-02-21T08:30:46.8818154Z selp.b16 %rs174, %rs80, %rs48, %p246; 2026-02-21T08:30:46.8818327Z cvt.s16.s8 %rs175, %rs174; 2026-02-21T08:30:46.8818476Z shr.s16 %rs176, %rs175, 4; 2026-02-21T08:30:46.8818748Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.8819037Z cvt.rn.f32.s16 %r439, %rs83; 2026-02-21T08:30:46.8819204Z cvt.rn.f32.s16 %r440, %rs86; 2026-02-21T08:30:46.8819360Z cvt.rn.f32.s16 %r441, %rs89; 2026-02-21T08:30:46.8819520Z cvt.rn.f32.s16 %r442, %rs92; 2026-02-21T08:30:46.8819673Z cvt.rn.f32.s16 %r443, %rs95; 2026-02-21T08:30:46.8819831Z cvt.rn.f32.s16 %r444, %rs98; 2026-02-21T08:30:46.8819994Z cvt.rn.f32.s16 %r445, %rs101; 2026-02-21T08:30:46.8820212Z cvt.rn.f32.s16 %r446, %rs104; 2026-02-21T08:30:46.8820377Z cvt.rn.f32.s16 %r447, %rs107; 2026-02-21T08:30:46.8820534Z cvt.rn.f32.s16 %r448, %rs110; 2026-02-21T08:30:46.8820699Z cvt.rn.f32.s16 %r449, %rs113; 2026-02-21T08:30:46.8820853Z cvt.rn.f32.s16 %r450, %rs116; 2026-02-21T08:30:46.8821015Z cvt.rn.f32.s16 %r451, %rs119; 2026-02-21T08:30:46.8821167Z cvt.rn.f32.s16 %r452, %rs122; 2026-02-21T08:30:46.8821327Z cvt.rn.f32.s16 %r453, %rs125; 2026-02-21T08:30:46.8821481Z cvt.rn.f32.s16 %r454, %rs128; 2026-02-21T08:30:46.8821689Z cvt.rn.f32.s16 %r455, %rs131; 2026-02-21T08:30:46.8821853Z cvt.rn.f32.s16 %r456, %rs134; 2026-02-21T08:30:46.8822010Z cvt.rn.f32.s16 %r457, %rs137; 2026-02-21T08:30:46.8822180Z cvt.rn.f32.s16 %r458, %rs140; 2026-02-21T08:30:46.8822335Z cvt.rn.f32.s16 %r459, %rs143; 2026-02-21T08:30:46.8822496Z cvt.rn.f32.s16 %r460, %rs146; 2026-02-21T08:30:46.8822703Z cvt.rn.f32.s16 %r461, %rs149; 2026-02-21T08:30:46.8822879Z cvt.rn.f32.s16 %r462, %rs152; 2026-02-21T08:30:46.8823038Z cvt.rn.f32.s16 %r463, %rs155; 2026-02-21T08:30:46.8823200Z cvt.rn.f32.s16 %r464, %rs158; 2026-02-21T08:30:46.8823358Z cvt.rn.f32.s16 %r465, %rs161; 2026-02-21T08:30:46.8823511Z cvt.rn.f32.s16 %r466, %rs164; 2026-02-21T08:30:46.8823670Z cvt.rn.f32.s16 %r467, %rs167; 2026-02-21T08:30:46.8823822Z cvt.rn.f32.s16 %r468, %rs170; 2026-02-21T08:30:46.8823982Z cvt.rn.f32.s16 %r469, %rs173; 2026-02-21T08:30:46.8824135Z cvt.rn.f32.s16 %r470, %rs176; 2026-02-21T08:30:46.8824299Z st.shared.b32 [%r51], %r439; 2026-02-21T08:30:46.8824461Z st.shared.b32 [%r51+8], %r440; 2026-02-21T08:30:46.8824639Z st.shared.b32 [%r51+16384], %r455; 2026-02-21T08:30:46.8824815Z st.shared.b32 [%r51+16392], %r456; 2026-02-21T08:30:46.8824995Z st.shared.b32 [%r52], %r441; 2026-02-21T08:30:46.8825162Z st.shared.b32 [%r52+8], %r442; 2026-02-21T08:30:46.8825326Z st.shared.b32 [%r52+16384], %r457; 2026-02-21T08:30:46.8825503Z st.shared.b32 [%r52+16392], %r458; 2026-02-21T08:30:46.8825671Z st.shared.b32 [%r53], %r443; 2026-02-21T08:30:46.8825837Z st.shared.b32 [%r53+8], %r444; 2026-02-21T08:30:46.8825997Z st.shared.b32 [%r53+16384], %r459; 2026-02-21T08:30:46.8826169Z st.shared.b32 [%r53+16392], %r460; 2026-02-21T08:30:46.8826333Z st.shared.b32 [%r54], %r445; 2026-02-21T08:30:46.8826494Z st.shared.b32 [%r54+8], %r446; 2026-02-21T08:30:46.8826659Z st.shared.b32 [%r54+16384], %r461; 2026-02-21T08:30:46.8826824Z st.shared.b32 [%r54+16392], %r462; 2026-02-21T08:30:46.8826997Z st.shared.b32 [%r55], %r447; 2026-02-21T08:30:46.8827151Z st.shared.b32 [%r55+8], %r448; 2026-02-21T08:30:46.8827318Z st.shared.b32 [%r55+16384], %r463; 2026-02-21T08:30:46.8827482Z st.shared.b32 [%r55+16392], %r464; 2026-02-21T08:30:46.8827651Z st.shared.b32 [%r56], %r449; 2026-02-21T08:30:46.8827806Z st.shared.b32 [%r56+8], %r450; 2026-02-21T08:30:46.8827974Z st.shared.b32 [%r56+16384], %r465; 2026-02-21T08:30:46.8828143Z st.shared.b32 [%r56+16392], %r466; 2026-02-21T08:30:46.8828317Z st.shared.b32 [%r57], %r451; 2026-02-21T08:30:46.8828480Z st.shared.b32 [%r57+8], %r452; 2026-02-21T08:30:46.8828638Z st.shared.b32 [%r57+16384], %r467; 2026-02-21T08:30:46.8828809Z st.shared.b32 [%r57+16392], %r468; 2026-02-21T08:30:46.8828974Z st.shared.b32 [%r58], %r453; 2026-02-21T08:30:46.8829136Z st.shared.b32 [%r58+8], %r454; 2026-02-21T08:30:46.8829295Z st.shared.b32 [%r58+16384], %r469; 2026-02-21T08:30:46.8829467Z st.shared.b32 [%r58+16392], %r470; 2026-02-21T08:30:46.8829627Z $L__tmp39: 2026-02-21T08:30:46.8829931Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8830278Z // begin inline asm 2026-02-21T08:30:46.8830442Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.8830616Z // end inline asm 2026-02-21T08:30:46.8830759Z bar.sync 0; 2026-02-21T08:30:46.8830904Z setp.ne.b32 %p12, %r74, 0; 2026-02-21T08:30:46.8831064Z @%p12 bra $L__BB0_5; 2026-02-21T08:30:46.8831318Z // %bb.4: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8831577Z elect.sync %r495|%p14, -1; 2026-02-21T08:30:46.8831746Z mov.b32 %r473, 69208336; 2026-02-21T08:30:46.8831905Z mov.pred %p13, 0; 2026-02-21T08:30:46.8832047Z // begin inline asm 2026-02-21T08:30:46.8832302Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r473, %p13; 2026-02-21T08:30:46.8832572Z // end inline asm 2026-02-21T08:30:46.8832716Z // begin inline asm 2026-02-21T08:30:46.8832951Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r473, %p15; 2026-02-21T08:30:46.8833222Z // end inline asm 2026-02-21T08:30:46.8833359Z // begin inline asm 2026-02-21T08:30:46.8833600Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r473, %p15; 2026-02-21T08:30:46.8833869Z // end inline asm 2026-02-21T08:30:46.8834006Z // begin inline asm 2026-02-21T08:30:46.8834300Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r473, %p15; 2026-02-21T08:30:46.8834562Z // end inline asm 2026-02-21T08:30:46.8834706Z // begin inline asm 2026-02-21T08:30:46.8834930Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r473, %p15; 2026-02-21T08:30:46.8835194Z // end inline asm 2026-02-21T08:30:46.8835337Z // begin inline asm 2026-02-21T08:30:46.8835561Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r473, %p15; 2026-02-21T08:30:46.8835823Z // end inline asm 2026-02-21T08:30:46.8835955Z // begin inline asm 2026-02-21T08:30:46.8836186Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r473, %p15; 2026-02-21T08:30:46.8836438Z // end inline asm 2026-02-21T08:30:46.8836575Z // begin inline asm 2026-02-21T08:30:46.8836806Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r473, %p15; 2026-02-21T08:30:46.8837064Z // end inline asm 2026-02-21T08:30:46.8837212Z add.s32 %r497, %r265, 69632; 2026-02-21T08:30:46.8837370Z cvt.u64.u32 %rd85, %r497; 2026-02-21T08:30:46.8837529Z // begin inline asm 2026-02-21T08:30:46.8837737Z @%p14 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd85]; 2026-02-21T08:30:46.8837973Z // end inline asm 2026-02-21T08:30:46.8838106Z $L__tmp40: 2026-02-21T08:30:46.8838283Z $L__BB0_5: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8838617Z .loc 1 0 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0 2026-02-21T08:30:46.8838894Z or.b32 %r69, %r67, %r8; 2026-02-21T08:30:46.8839055Z or.b32 %r70, %r413, %r12; 2026-02-21T08:30:46.8839209Z or.b32 %r71, %r413, %r13; 2026-02-21T08:30:46.8839367Z or.b32 %r72, %r413, %r14; 2026-02-21T08:30:46.8839514Z or.b32 %r73, %r413, %r15; 2026-02-21T08:30:46.8839788Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8840107Z add.s64 %rd86, %rd11, 384; 2026-02-21T08:30:46.8840265Z add.s64 %rd87, %rd12, 384; 2026-02-21T08:30:46.8840541Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8840838Z // begin inline asm 2026-02-21T08:30:46.8841069Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd86 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8841308Z // end inline asm 2026-02-21T08:30:46.8841460Z // begin inline asm 2026-02-21T08:30:46.8841701Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd87 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8841940Z // end inline asm 2026-02-21T08:30:46.8842098Z cp.async.commit_group; 2026-02-21T08:30:46.8842377Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8842695Z add.s32 %r507, %r68, %r59; 2026-02-21T08:30:46.8842978Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8843346Z cvt.s64.s32 %rd90, %r507; 2026-02-21T08:30:46.8843509Z add.s64 %rd88, %rd48, %rd90; 2026-02-21T08:30:46.8843794Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8844099Z // begin inline asm 2026-02-21T08:30:46.8844308Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd88 + 0 ], 0x10, %r499; 2026-02-21T08:30:46.8844546Z // end inline asm 2026-02-21T08:30:46.8844694Z cp.async.commit_group; 2026-02-21T08:30:46.8844977Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8845278Z add.s32 %r2525, %r62, %r67; 2026-02-21T08:30:46.8845456Z shl.b32 %r508, %r66, 16; 2026-02-21T08:30:46.8845619Z or.b32 %r509, %r63, %r508; 2026-02-21T08:30:46.8845800Z mad.wide.s32 %rd608, %r509, 2, %rd9; 2026-02-21T08:30:46.8845986Z or.b32 %r2524, %r64, %r508; 2026-02-21T08:30:46.8846145Z mov.b32 %r2529, 1; 2026-02-21T08:30:46.8846351Z mov.b64 %rd609, -32; 2026-02-21T08:30:46.8846510Z mov.b32 %r2528, %r2526; 2026-02-21T08:30:46.8846674Z mov.b32 %r2530, %r2526; 2026-02-21T08:30:46.8846828Z bra.uni $L__BB0_6; 2026-02-21T08:30:46.8847027Z $L__BB0_8: // in Loop: Header=BB0_6 Depth=2 2026-02-21T08:30:46.8847368Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8847681Z add.s64 %rd609, %rd609, 32; 2026-02-21T08:30:46.8847860Z setp.lt.u64 %p51, %rd609, 384; 2026-02-21T08:30:46.8848023Z $L__tmp41: 2026-02-21T08:30:46.8848341Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8848746Z add.s32 %r615, %r2529, 1; 2026-02-21T08:30:46.8848910Z setp.gt.s32 %p52, %r615, 1; 2026-02-21T08:30:46.8849075Z selp.b32 %r2529, 0, %r615, %p52; 2026-02-21T08:30:46.8849249Z selp.b32 %r616, 1, 0, %p52; 2026-02-21T08:30:46.8849408Z xor.b32 %r91, %r2530, %r616; 2026-02-21T08:30:46.8849568Z $L__tmp42: 2026-02-21T08:30:46.8849815Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8850111Z mad.wide.s32 %rd101, %r2524, 2, %rd47; 2026-02-21T08:30:46.8850406Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8850693Z add.s32 %r609, %r87, %r2555; 2026-02-21T08:30:46.8850859Z selp.b32 %r610, 16, 0, %p51; 2026-02-21T08:30:46.8851012Z // begin inline asm 2026-02-21T08:30:46.8851220Z cp.async.cg.shared.global [ %r609 + 0 ], [ %rd608 + 0 ], 0x10, %r610; 2026-02-21T08:30:46.8851443Z // end inline asm 2026-02-21T08:30:46.8851619Z add.s32 %r611, %r609, 4096; 2026-02-21T08:30:46.8851783Z // begin inline asm 2026-02-21T08:30:46.8851983Z cp.async.cg.shared.global [ %r611 + 0 ], [ %rd101 + 0 ], 0x10, %r610; 2026-02-21T08:30:46.8852216Z // end inline asm 2026-02-21T08:30:46.8852362Z cp.async.commit_group; 2026-02-21T08:30:46.8852636Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8852927Z cvt.s64.s32 %rd103, %r2525; 2026-02-21T08:30:46.8853097Z add.s64 %rd102, %rd48, %rd103; 2026-02-21T08:30:46.8853366Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8853659Z add.s32 %r613, %r88, %r2555; 2026-02-21T08:30:46.8853820Z // begin inline asm 2026-02-21T08:30:46.8854016Z cp.async.cg.shared.global [ %r613 + 0 ], [ %rd102 + 0 ], 0x10, %r610; 2026-02-21T08:30:46.8854245Z // end inline asm 2026-02-21T08:30:46.8854386Z cp.async.commit_group; 2026-02-21T08:30:46.8854655Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8854950Z add.s32 %r2525, %r2525, 262144; 2026-02-21T08:30:46.8855124Z add.s64 %rd608, %rd608, 128; 2026-02-21T08:30:46.8855288Z add.s32 %r2524, %r2524, 64; 2026-02-21T08:30:46.8855453Z setp.lt.u64 %p53, %rd609, 448; 2026-02-21T08:30:46.8855713Z mov.b32 %r2526, %r2530; 2026-02-21T08:30:46.8855862Z mov.b32 %r2530, %r91; 2026-02-21T08:30:46.8856017Z @%p53 bra $L__BB0_6; 2026-02-21T08:30:46.8856161Z bra.uni $L__BB0_9; 2026-02-21T08:30:46.8856347Z $L__BB0_6: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:46.8856593Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:46.8856812Z add.s32 %r530, %r2528, 1; 2026-02-21T08:30:46.8856971Z setp.gt.s32 %p33, %r530, 2; 2026-02-21T08:30:46.8857145Z selp.b32 %r2528, 0, %r530, %p33; 2026-02-21T08:30:46.8857426Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8857719Z cp.async.wait_group 4; 2026-02-21T08:30:46.8857877Z bar.sync 0; 2026-02-21T08:30:46.8858011Z shl.b32 %r531, %r2528, 12; 2026-02-21T08:30:46.8858172Z shl.b32 %r532, %r2528, 13; 2026-02-21T08:30:46.8858378Z add.s32 %r534, %r265, %r532; 2026-02-21T08:30:46.8858544Z add.s32 %r87, %r534, 32768; 2026-02-21T08:30:46.8858697Z add.s32 %r535, %r87, %r39; 2026-02-21T08:30:46.8858895Z ld.shared.v4.b32 {%r536, %r537, %r538, %r539}, [%r535]; 2026-02-21T08:30:46.8859108Z mov.b32 {%rs177, %rs178}, %r539; 2026-02-21T08:30:46.8859276Z mov.b32 {%rs179, %rs180}, %r538; 2026-02-21T08:30:46.8859447Z mov.b32 {%rs181, %rs182}, %r537; 2026-02-21T08:30:46.8859609Z mov.b32 {%rs183, %rs184}, %r536; 2026-02-21T08:30:46.8859712Z ld.shared.v4.b32 {%r540, %r541, %r542, %r543}, [%r535+16]; 2026-02-21T08:30:46.8859773Z mov.b32 {%rs185, %rs186}, %r543; 2026-02-21T08:30:46.8859833Z mov.b32 {%rs187, %rs188}, %r542; 2026-02-21T08:30:46.8859891Z mov.b32 {%rs189, %rs190}, %r541; 2026-02-21T08:30:46.8859957Z mov.b32 {%rs191, %rs192}, %r540; 2026-02-21T08:30:46.8860127Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.8860191Z cvt.f32.bf16 %r514, %rs183; 2026-02-21T08:30:46.8860263Z cvt.f32.bf16 %r515, %rs184; 2026-02-21T08:30:46.8860324Z cvt.f32.bf16 %r516, %rs181; 2026-02-21T08:30:46.8860382Z cvt.f32.bf16 %r517, %rs182; 2026-02-21T08:30:46.8860447Z cvt.f32.bf16 %r518, %rs179; 2026-02-21T08:30:46.8860503Z cvt.f32.bf16 %r519, %rs180; 2026-02-21T08:30:46.8860560Z cvt.f32.bf16 %r520, %rs177; 2026-02-21T08:30:46.8860617Z cvt.f32.bf16 %r521, %rs178; 2026-02-21T08:30:46.8860683Z cvt.f32.bf16 %r522, %rs191; 2026-02-21T08:30:46.8860741Z cvt.f32.bf16 %r523, %rs192; 2026-02-21T08:30:46.8860797Z cvt.f32.bf16 %r524, %rs189; 2026-02-21T08:30:46.8860862Z cvt.f32.bf16 %r525, %rs190; 2026-02-21T08:30:46.8860920Z cvt.f32.bf16 %r526, %rs187; 2026-02-21T08:30:46.8860977Z cvt.f32.bf16 %r527, %rs188; 2026-02-21T08:30:46.8861035Z cvt.f32.bf16 %r528, %rs185; 2026-02-21T08:30:46.8861102Z cvt.f32.bf16 %r529, %rs186; 2026-02-21T08:30:46.8861156Z $L__tmp43: 2026-02-21T08:30:46.8861380Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8861452Z // begin inline asm 2026-02-21T08:30:46.8861507Z 2026-02-21T08:30:46.8861590Z { 2026-02-21T08:30:46.8861658Z .reg .pred complete; 2026-02-21T08:30:46.8861722Z waitLoop: 2026-02-21T08:30:46.8861848Z mbarrier.try_wait.parity.shared.b64 complete, [%r2527], %r2526; 2026-02-21T08:30:46.8861913Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.8861970Z } 2026-02-21T08:30:46.8861974Z 2026-02-21T08:30:46.8862031Z // end inline asm 2026-02-21T08:30:46.8862094Z mov.pred %p34, -1; 2026-02-21T08:30:46.8862160Z // begin inline asm 2026-02-21T08:30:46.8862458Z @%p34 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r514, %r515, %r516, %r517, %r518, %r519, %r520, %r521, %r522, %r523, %r524, %r525, %r526, %r527, %r528, %r529}; 2026-02-21T08:30:46.8862516Z // end inline asm 2026-02-21T08:30:46.8862572Z // begin inline asm 2026-02-21T08:30:46.8862652Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.8862709Z // end inline asm 2026-02-21T08:30:46.8862765Z bar.sync 0; 2026-02-21T08:30:46.8862876Z $L__tmp44: 2026-02-21T08:30:46.8863046Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8863108Z add.s32 %r544, %r265, %r531; 2026-02-21T08:30:46.8863167Z add.s32 %r88, %r544, 57344; 2026-02-21T08:30:46.8863234Z add.s32 %r545, %r88, %r41; 2026-02-21T08:30:46.8863300Z ld.shared.b8 %rs193, [%r545]; 2026-02-21T08:30:46.8863366Z ld.shared.b8 %rs194, [%r545+128]; 2026-02-21T08:30:46.8863437Z ld.shared.b8 %rs195, [%r545+256]; 2026-02-21T08:30:46.8863499Z ld.shared.b8 %rs196, [%r545+384]; 2026-02-21T08:30:46.8863559Z ld.shared.b8 %rs197, [%r545+512]; 2026-02-21T08:30:46.8863627Z ld.shared.b8 %rs198, [%r545+640]; 2026-02-21T08:30:46.8863688Z ld.shared.b8 %rs199, [%r545+768]; 2026-02-21T08:30:46.8863746Z add.s32 %r546, %r88, %r43; 2026-02-21T08:30:46.8863807Z ld.shared.b8 %rs200, [%r546]; 2026-02-21T08:30:46.8863879Z ld.shared.b8 %rs201, [%r545+1024]; 2026-02-21T08:30:46.8863988Z ld.shared.b8 %rs202, [%r545+1152]; 2026-02-21T08:30:46.8864052Z ld.shared.b8 %rs203, [%r545+1280]; 2026-02-21T08:30:46.8864121Z ld.shared.b8 %rs204, [%r545+1408]; 2026-02-21T08:30:46.8864181Z ld.shared.b8 %rs205, [%r545+1536]; 2026-02-21T08:30:46.8864240Z ld.shared.b8 %rs206, [%r545+1664]; 2026-02-21T08:30:46.8864300Z ld.shared.b8 %rs207, [%r545+1792]; 2026-02-21T08:30:46.8864366Z add.s32 %r547, %r88, %r45; 2026-02-21T08:30:46.8864427Z ld.shared.b8 %rs208, [%r547]; 2026-02-21T08:30:46.8864488Z ld.shared.b8 %rs209, [%r545+2048]; 2026-02-21T08:30:46.8864555Z ld.shared.b8 %rs210, [%r545+2176]; 2026-02-21T08:30:46.8864614Z ld.shared.b8 %rs211, [%r545+2304]; 2026-02-21T08:30:46.8864674Z ld.shared.b8 %rs212, [%r545+2432]; 2026-02-21T08:30:46.8864733Z ld.shared.b8 %rs213, [%r545+2560]; 2026-02-21T08:30:46.8864801Z ld.shared.b8 %rs214, [%r545+2688]; 2026-02-21T08:30:46.8864861Z ld.shared.b8 %rs215, [%r545+2816]; 2026-02-21T08:30:46.8864922Z add.s32 %r548, %r88, %r47; 2026-02-21T08:30:46.8864993Z ld.shared.b8 %rs216, [%r548]; 2026-02-21T08:30:46.8865053Z ld.shared.b8 %rs217, [%r545+3072]; 2026-02-21T08:30:46.8865114Z ld.shared.b8 %rs218, [%r545+3200]; 2026-02-21T08:30:46.8865181Z ld.shared.b8 %rs219, [%r545+3328]; 2026-02-21T08:30:46.8865241Z ld.shared.b8 %rs220, [%r545+3456]; 2026-02-21T08:30:46.8865300Z ld.shared.b8 %rs221, [%r545+3584]; 2026-02-21T08:30:46.8865359Z ld.shared.b8 %rs222, [%r545+3712]; 2026-02-21T08:30:46.8865426Z ld.shared.b8 %rs223, [%r545+3840]; 2026-02-21T08:30:46.8865485Z add.s32 %r549, %r88, %r49; 2026-02-21T08:30:46.8865546Z ld.shared.b8 %rs224, [%r549]; 2026-02-21T08:30:46.8865725Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.8865785Z shl.b16 %rs225, %rs193, 4; 2026-02-21T08:30:46.8865843Z shl.b16 %rs226, %rs194, 4; 2026-02-21T08:30:46.8865899Z shl.b16 %rs227, %rs195, 4; 2026-02-21T08:30:46.8865964Z shl.b16 %rs228, %rs196, 4; 2026-02-21T08:30:46.8866023Z shl.b16 %rs229, %rs197, 4; 2026-02-21T08:30:46.8866083Z shl.b16 %rs230, %rs198, 4; 2026-02-21T08:30:46.8866147Z shl.b16 %rs231, %rs199, 4; 2026-02-21T08:30:46.8866204Z shl.b16 %rs232, %rs200, 4; 2026-02-21T08:30:46.8866260Z shl.b16 %rs233, %rs201, 4; 2026-02-21T08:30:46.8866325Z shl.b16 %rs234, %rs202, 4; 2026-02-21T08:30:46.8866383Z shl.b16 %rs235, %rs203, 4; 2026-02-21T08:30:46.8866439Z shl.b16 %rs236, %rs204, 4; 2026-02-21T08:30:46.8866496Z shl.b16 %rs237, %rs205, 4; 2026-02-21T08:30:46.8866559Z shl.b16 %rs238, %rs206, 4; 2026-02-21T08:30:46.8866616Z shl.b16 %rs239, %rs207, 4; 2026-02-21T08:30:46.8866672Z shl.b16 %rs240, %rs208, 4; 2026-02-21T08:30:46.8866735Z shl.b16 %rs241, %rs209, 4; 2026-02-21T08:30:46.8866793Z shl.b16 %rs242, %rs210, 4; 2026-02-21T08:30:46.8866849Z shl.b16 %rs243, %rs211, 4; 2026-02-21T08:30:46.8866906Z shl.b16 %rs244, %rs212, 4; 2026-02-21T08:30:46.8866970Z shl.b16 %rs245, %rs213, 4; 2026-02-21T08:30:46.8867029Z shl.b16 %rs246, %rs214, 4; 2026-02-21T08:30:46.8867130Z shl.b16 %rs247, %rs215, 4; 2026-02-21T08:30:46.8867194Z shl.b16 %rs248, %rs216, 4; 2026-02-21T08:30:46.8867251Z shl.b16 %rs249, %rs217, 4; 2026-02-21T08:30:46.8867308Z shl.b16 %rs250, %rs218, 4; 2026-02-21T08:30:46.8867365Z shl.b16 %rs251, %rs219, 4; 2026-02-21T08:30:46.8867429Z shl.b16 %rs252, %rs220, 4; 2026-02-21T08:30:46.8867486Z shl.b16 %rs253, %rs221, 4; 2026-02-21T08:30:46.8867543Z shl.b16 %rs254, %rs222, 4; 2026-02-21T08:30:46.8867608Z shl.b16 %rs255, %rs223, 4; 2026-02-21T08:30:46.8867666Z shl.b16 %rs256, %rs224, 4; 2026-02-21T08:30:46.8867835Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.8867905Z selp.b16 %rs257, %rs225, %rs193, %p246; 2026-02-21T08:30:46.8867972Z cvt.s16.s8 %rs258, %rs257; 2026-02-21T08:30:46.8868029Z shr.s16 %rs259, %rs258, 4; 2026-02-21T08:30:46.8868098Z selp.b16 %rs260, %rs226, %rs194, %p246; 2026-02-21T08:30:46.8868208Z cvt.s16.s8 %rs261, %rs260; 2026-02-21T08:30:46.8868272Z shr.s16 %rs262, %rs261, 4; 2026-02-21T08:30:46.8868340Z selp.b16 %rs263, %rs227, %rs195, %p246; 2026-02-21T08:30:46.8868416Z cvt.s16.s8 %rs264, %rs263; 2026-02-21T08:30:46.8868474Z shr.s16 %rs265, %rs264, 4; 2026-02-21T08:30:46.8868539Z selp.b16 %rs266, %rs228, %rs196, %p246; 2026-02-21T08:30:46.8868599Z cvt.s16.s8 %rs267, %rs266; 2026-02-21T08:30:46.8868666Z shr.s16 %rs268, %rs267, 4; 2026-02-21T08:30:46.8868732Z selp.b16 %rs269, %rs229, %rs197, %p246; 2026-02-21T08:30:46.8868791Z cvt.s16.s8 %rs270, %rs269; 2026-02-21T08:30:46.8868855Z shr.s16 %rs271, %rs270, 4; 2026-02-21T08:30:46.8868918Z selp.b16 %rs272, %rs230, %rs198, %p246; 2026-02-21T08:30:46.8868975Z cvt.s16.s8 %rs273, %rs272; 2026-02-21T08:30:46.8869031Z shr.s16 %rs274, %rs273, 4; 2026-02-21T08:30:46.8869106Z selp.b16 %rs275, %rs231, %rs199, %p246; 2026-02-21T08:30:46.8869163Z cvt.s16.s8 %rs276, %rs275; 2026-02-21T08:30:46.8869220Z shr.s16 %rs277, %rs276, 4; 2026-02-21T08:30:46.8869294Z selp.b16 %rs278, %rs232, %rs200, %p246; 2026-02-21T08:30:46.8869356Z cvt.s16.s8 %rs279, %rs278; 2026-02-21T08:30:46.8869413Z shr.s16 %rs280, %rs279, 4; 2026-02-21T08:30:46.8869477Z selp.b16 %rs281, %rs233, %rs201, %p246; 2026-02-21T08:30:46.8869543Z cvt.s16.s8 %rs282, %rs281; 2026-02-21T08:30:46.8869599Z shr.s16 %rs283, %rs282, 4; 2026-02-21T08:30:46.8869663Z selp.b16 %rs284, %rs234, %rs202, %p246; 2026-02-21T08:30:46.8869728Z cvt.s16.s8 %rs285, %rs284; 2026-02-21T08:30:46.8869786Z shr.s16 %rs286, %rs285, 4; 2026-02-21T08:30:46.8869851Z selp.b16 %rs287, %rs235, %rs203, %p246; 2026-02-21T08:30:46.8869916Z cvt.s16.s8 %rs288, %rs287; 2026-02-21T08:30:46.8869973Z shr.s16 %rs289, %rs288, 4; 2026-02-21T08:30:46.8870036Z selp.b16 %rs290, %rs236, %rs204, %p246; 2026-02-21T08:30:46.8870095Z cvt.s16.s8 %rs291, %rs290; 2026-02-21T08:30:46.8870161Z shr.s16 %rs292, %rs291, 4; 2026-02-21T08:30:46.8870226Z selp.b16 %rs293, %rs237, %rs205, %p246; 2026-02-21T08:30:46.8870286Z cvt.s16.s8 %rs294, %rs293; 2026-02-21T08:30:46.8870351Z shr.s16 %rs295, %rs294, 4; 2026-02-21T08:30:46.8870414Z selp.b16 %rs296, %rs238, %rs206, %p246; 2026-02-21T08:30:46.8870472Z cvt.s16.s8 %rs297, %rs296; 2026-02-21T08:30:46.8870528Z shr.s16 %rs298, %rs297, 4; 2026-02-21T08:30:46.8870601Z selp.b16 %rs299, %rs239, %rs207, %p246; 2026-02-21T08:30:46.8870658Z cvt.s16.s8 %rs300, %rs299; 2026-02-21T08:30:46.8870715Z shr.s16 %rs301, %rs300, 4; 2026-02-21T08:30:46.8870784Z selp.b16 %rs302, %rs240, %rs208, %p246; 2026-02-21T08:30:46.8870842Z cvt.s16.s8 %rs303, %rs302; 2026-02-21T08:30:46.8870898Z shr.s16 %rs304, %rs303, 4; 2026-02-21T08:30:46.8870961Z selp.b16 %rs305, %rs241, %rs209, %p246; 2026-02-21T08:30:46.8871026Z cvt.s16.s8 %rs306, %rs305; 2026-02-21T08:30:46.8871083Z shr.s16 %rs307, %rs306, 4; 2026-02-21T08:30:46.8871147Z selp.b16 %rs308, %rs242, %rs210, %p246; 2026-02-21T08:30:46.8871212Z cvt.s16.s8 %rs309, %rs308; 2026-02-21T08:30:46.8871268Z shr.s16 %rs310, %rs309, 4; 2026-02-21T08:30:46.8871334Z selp.b16 %rs311, %rs243, %rs211, %p246; 2026-02-21T08:30:46.8871440Z cvt.s16.s8 %rs312, %rs311; 2026-02-21T08:30:46.8871497Z shr.s16 %rs313, %rs312, 4; 2026-02-21T08:30:46.8871591Z selp.b16 %rs314, %rs244, %rs212, %p246; 2026-02-21T08:30:46.8871651Z cvt.s16.s8 %rs315, %rs314; 2026-02-21T08:30:46.8871714Z shr.s16 %rs316, %rs315, 4; 2026-02-21T08:30:46.8871779Z selp.b16 %rs317, %rs245, %rs213, %p246; 2026-02-21T08:30:46.8871837Z cvt.s16.s8 %rs318, %rs317; 2026-02-21T08:30:46.8871900Z shr.s16 %rs319, %rs318, 4; 2026-02-21T08:30:46.8871964Z selp.b16 %rs320, %rs246, %rs214, %p246; 2026-02-21T08:30:46.8872022Z cvt.s16.s8 %rs321, %rs320; 2026-02-21T08:30:46.8872079Z shr.s16 %rs322, %rs321, 4; 2026-02-21T08:30:46.8872148Z selp.b16 %rs323, %rs247, %rs215, %p246; 2026-02-21T08:30:46.8872205Z cvt.s16.s8 %rs324, %rs323; 2026-02-21T08:30:46.8872261Z shr.s16 %rs325, %rs324, 4; 2026-02-21T08:30:46.8872329Z selp.b16 %rs326, %rs248, %rs216, %p246; 2026-02-21T08:30:46.8872440Z cvt.s16.s8 %rs327, %rs326; 2026-02-21T08:30:46.8872501Z shr.s16 %rs328, %rs327, 4; 2026-02-21T08:30:46.8872566Z selp.b16 %rs329, %rs249, %rs217, %p246; 2026-02-21T08:30:46.8872632Z cvt.s16.s8 %rs330, %rs329; 2026-02-21T08:30:46.8872690Z shr.s16 %rs331, %rs330, 4; 2026-02-21T08:30:46.8872754Z selp.b16 %rs332, %rs250, %rs218, %p246; 2026-02-21T08:30:46.8872820Z cvt.s16.s8 %rs333, %rs332; 2026-02-21T08:30:46.8872878Z shr.s16 %rs334, %rs333, 4; 2026-02-21T08:30:46.8872941Z selp.b16 %rs335, %rs251, %rs219, %p246; 2026-02-21T08:30:46.8872998Z cvt.s16.s8 %rs336, %rs335; 2026-02-21T08:30:46.8873063Z shr.s16 %rs337, %rs336, 4; 2026-02-21T08:30:46.8873126Z selp.b16 %rs338, %rs252, %rs220, %p246; 2026-02-21T08:30:46.8873184Z cvt.s16.s8 %rs339, %rs338; 2026-02-21T08:30:46.8873247Z shr.s16 %rs340, %rs339, 4; 2026-02-21T08:30:46.8873312Z selp.b16 %rs341, %rs253, %rs221, %p246; 2026-02-21T08:30:46.8873370Z cvt.s16.s8 %rs342, %rs341; 2026-02-21T08:30:46.8873436Z shr.s16 %rs343, %rs342, 4; 2026-02-21T08:30:46.8873503Z selp.b16 %rs344, %rs254, %rs222, %p246; 2026-02-21T08:30:46.8873561Z cvt.s16.s8 %rs345, %rs344; 2026-02-21T08:30:46.8873617Z shr.s16 %rs346, %rs345, 4; 2026-02-21T08:30:46.8873691Z selp.b16 %rs347, %rs255, %rs223, %p246; 2026-02-21T08:30:46.8873749Z cvt.s16.s8 %rs348, %rs347; 2026-02-21T08:30:46.8873807Z shr.s16 %rs349, %rs348, 4; 2026-02-21T08:30:46.8873881Z selp.b16 %rs350, %rs256, %rs224, %p246; 2026-02-21T08:30:46.8873939Z cvt.s16.s8 %rs351, %rs350; 2026-02-21T08:30:46.8873996Z shr.s16 %rs352, %rs351, 4; 2026-02-21T08:30:46.8874171Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.8874241Z cvt.rn.f32.s16 %r550, %rs259; 2026-02-21T08:30:46.8874303Z cvt.rn.f32.s16 %r551, %rs262; 2026-02-21T08:30:46.8874363Z cvt.rn.f32.s16 %r552, %rs265; 2026-02-21T08:30:46.8874429Z cvt.rn.f32.s16 %r553, %rs268; 2026-02-21T08:30:46.8874488Z cvt.rn.f32.s16 %r554, %rs271; 2026-02-21T08:30:46.8874552Z cvt.rn.f32.s16 %r555, %rs274; 2026-02-21T08:30:46.8874622Z cvt.rn.f32.s16 %r556, %rs277; 2026-02-21T08:30:46.8874683Z cvt.rn.f32.s16 %r557, %rs280; 2026-02-21T08:30:46.8874742Z cvt.rn.f32.s16 %r558, %rs283; 2026-02-21T08:30:46.8874802Z cvt.rn.f32.s16 %r559, %rs286; 2026-02-21T08:30:46.8874870Z cvt.rn.f32.s16 %r560, %rs289; 2026-02-21T08:30:46.8874928Z cvt.rn.f32.s16 %r561, %rs292; 2026-02-21T08:30:46.8874986Z cvt.rn.f32.s16 %r562, %rs295; 2026-02-21T08:30:46.8875051Z cvt.rn.f32.s16 %r563, %rs298; 2026-02-21T08:30:46.8875108Z cvt.rn.f32.s16 %r564, %rs301; 2026-02-21T08:30:46.8875167Z cvt.rn.f32.s16 %r565, %rs304; 2026-02-21T08:30:46.8875225Z cvt.rn.f32.s16 %r566, %rs307; 2026-02-21T08:30:46.8875292Z cvt.rn.f32.s16 %r567, %rs310; 2026-02-21T08:30:46.8875351Z cvt.rn.f32.s16 %r568, %rs313; 2026-02-21T08:30:46.8875408Z cvt.rn.f32.s16 %r569, %rs316; 2026-02-21T08:30:46.8875474Z cvt.rn.f32.s16 %r570, %rs319; 2026-02-21T08:30:46.8875532Z cvt.rn.f32.s16 %r571, %rs322; 2026-02-21T08:30:46.8875592Z cvt.rn.f32.s16 %r572, %rs325; 2026-02-21T08:30:46.8875717Z cvt.rn.f32.s16 %r573, %rs328; 2026-02-21T08:30:46.8875783Z cvt.rn.f32.s16 %r574, %rs331; 2026-02-21T08:30:46.8875842Z cvt.rn.f32.s16 %r575, %rs334; 2026-02-21T08:30:46.8875898Z cvt.rn.f32.s16 %r576, %rs337; 2026-02-21T08:30:46.8875963Z cvt.rn.f32.s16 %r577, %rs340; 2026-02-21T08:30:46.8876019Z cvt.rn.f32.s16 %r578, %rs343; 2026-02-21T08:30:46.8876077Z cvt.rn.f32.s16 %r579, %rs346; 2026-02-21T08:30:46.8876137Z cvt.rn.f32.s16 %r580, %rs349; 2026-02-21T08:30:46.8876202Z cvt.rn.f32.s16 %r581, %rs352; 2026-02-21T08:30:46.8876264Z st.shared.b32 [%r51], %r550; 2026-02-21T08:30:46.8876328Z st.shared.b32 [%r51+8], %r551; 2026-02-21T08:30:46.8876397Z st.shared.b32 [%r51+16384], %r566; 2026-02-21T08:30:46.8876461Z st.shared.b32 [%r51+16392], %r567; 2026-02-21T08:30:46.8876522Z st.shared.b32 [%r52], %r552; 2026-02-21T08:30:46.8876589Z st.shared.b32 [%r52+8], %r553; 2026-02-21T08:30:46.8876692Z st.shared.b32 [%r52+16384], %r568; 2026-02-21T08:30:46.8876758Z st.shared.b32 [%r52+16392], %r569; 2026-02-21T08:30:46.8876820Z st.shared.b32 [%r53], %r554; 2026-02-21T08:30:46.8876889Z st.shared.b32 [%r53+8], %r555; 2026-02-21T08:30:46.8876950Z st.shared.b32 [%r53+16384], %r570; 2026-02-21T08:30:46.8877011Z st.shared.b32 [%r53+16392], %r571; 2026-02-21T08:30:46.8877079Z st.shared.b32 [%r54], %r556; 2026-02-21T08:30:46.8877139Z st.shared.b32 [%r54+8], %r557; 2026-02-21T08:30:46.8877199Z st.shared.b32 [%r54+16384], %r572; 2026-02-21T08:30:46.8877259Z st.shared.b32 [%r54+16392], %r573; 2026-02-21T08:30:46.8877327Z st.shared.b32 [%r55], %r558; 2026-02-21T08:30:46.8877387Z st.shared.b32 [%r55+8], %r559; 2026-02-21T08:30:46.8877445Z st.shared.b32 [%r55+16384], %r574; 2026-02-21T08:30:46.8877514Z st.shared.b32 [%r55+16392], %r575; 2026-02-21T08:30:46.8877574Z st.shared.b32 [%r56], %r560; 2026-02-21T08:30:46.8877633Z st.shared.b32 [%r56+8], %r561; 2026-02-21T08:30:46.8877695Z st.shared.b32 [%r56+16384], %r576; 2026-02-21T08:30:46.8877762Z st.shared.b32 [%r56+16392], %r577; 2026-02-21T08:30:46.8877822Z st.shared.b32 [%r57], %r562; 2026-02-21T08:30:46.8877882Z st.shared.b32 [%r57+8], %r563; 2026-02-21T08:30:46.8877950Z st.shared.b32 [%r57+16384], %r578; 2026-02-21T08:30:46.8878008Z st.shared.b32 [%r57+16392], %r579; 2026-02-21T08:30:46.8878065Z st.shared.b32 [%r58], %r564; 2026-02-21T08:30:46.8878132Z st.shared.b32 [%r58+8], %r565; 2026-02-21T08:30:46.8878192Z st.shared.b32 [%r58+16384], %r580; 2026-02-21T08:30:46.8878252Z st.shared.b32 [%r58+16392], %r581; 2026-02-21T08:30:46.8878428Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8878496Z shl.b32 %r582, %r2529, 3; 2026-02-21T08:30:46.8878554Z add.s32 %r583, %r265, %r582; 2026-02-21T08:30:46.8878612Z add.s32 %r2527, %r583, 69632; 2026-02-21T08:30:46.8878674Z $L__tmp45: 2026-02-21T08:30:46.8878899Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8878959Z // begin inline asm 2026-02-21T08:30:46.8879039Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.8879094Z // end inline asm 2026-02-21T08:30:46.8879150Z bar.sync 0; 2026-02-21T08:30:46.8879209Z @%p12 bra $L__BB0_8; 2026-02-21T08:30:46.8879313Z // %bb.7: // in Loop: Header=BB0_6 Depth=2 2026-02-21T08:30:46.8879379Z elect.sync %r608|%p35, -1; 2026-02-21T08:30:46.8879439Z mov.b32 %r586, 69208336; 2026-02-21T08:30:46.8879500Z // begin inline asm 2026-02-21T08:30:46.8879660Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r586, %p34; 2026-02-21T08:30:46.8879716Z // end inline asm 2026-02-21T08:30:46.8879772Z // begin inline asm 2026-02-21T08:30:46.8879930Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r586, %p34; 2026-02-21T08:30:46.8879986Z // end inline asm 2026-02-21T08:30:46.8880045Z // begin inline asm 2026-02-21T08:30:46.8880246Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r586, %p34; 2026-02-21T08:30:46.8880302Z // end inline asm 2026-02-21T08:30:46.8880359Z // begin inline asm 2026-02-21T08:30:46.8880516Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r586, %p34; 2026-02-21T08:30:46.8880571Z // end inline asm 2026-02-21T08:30:46.8880628Z // begin inline asm 2026-02-21T08:30:46.8880780Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r586, %p34; 2026-02-21T08:30:46.8880835Z // end inline asm 2026-02-21T08:30:46.8880890Z // begin inline asm 2026-02-21T08:30:46.8881031Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r586, %p34; 2026-02-21T08:30:46.8881096Z // end inline asm 2026-02-21T08:30:46.8881152Z // begin inline asm 2026-02-21T08:30:46.8881336Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r586, %p34; 2026-02-21T08:30:46.8881404Z // end inline asm 2026-02-21T08:30:46.8881461Z // begin inline asm 2026-02-21T08:30:46.8881635Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r586, %p34; 2026-02-21T08:30:46.8881702Z // end inline asm 2026-02-21T08:30:46.8881764Z cvt.u64.u32 %rd99, %r2527; 2026-02-21T08:30:46.8881821Z // begin inline asm 2026-02-21T08:30:46.8881956Z @%p35 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd99]; 2026-02-21T08:30:46.8882013Z // end inline asm 2026-02-21T08:30:46.8882072Z bra.uni $L__BB0_8; 2026-02-21T08:30:46.8882170Z $L__BB0_9: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8882272Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:30:46.8882328Z mov.b32 %r2536, 1; 2026-02-21T08:30:46.8882547Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8882612Z // begin inline asm 2026-02-21T08:30:46.8882666Z 2026-02-21T08:30:46.8882717Z { 2026-02-21T08:30:46.8882779Z .reg .pred complete; 2026-02-21T08:30:46.8882841Z waitLoop: 2026-02-21T08:30:46.8882964Z mbarrier.try_wait.parity.shared.b64 complete, [%r2527], %r2536; 2026-02-21T08:30:46.8883034Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.8883101Z } 2026-02-21T08:30:46.8883104Z 2026-02-21T08:30:46.8883162Z // end inline asm 2026-02-21T08:30:46.8883218Z $L__tmp46: 2026-02-21T08:30:46.8883399Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8883464Z cp.async.wait_group 0; 2026-02-21T08:30:46.8883520Z bar.sync 0; 2026-02-21T08:30:46.8883580Z add.s32 %r2534, %r265, 69632; 2026-02-21T08:30:46.8883647Z // begin inline asm 2026-02-21T08:30:46.8883736Z @%p204 mbarrier.inval.shared::cta.b64 [%r2534]; 2026-02-21T08:30:46.8883792Z // end inline asm 2026-02-21T08:30:46.8883853Z bar.sync 0; 2026-02-21T08:30:46.8883909Z // begin inline asm 2026-02-21T08:30:46.8883995Z @%p204 mbarrier.inval.shared::cta.b64 [%r1442]; 2026-02-21T08:30:46.8884052Z // end inline asm 2026-02-21T08:30:46.8884226Z .loc 1 88 43 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:43 2026-02-21T08:30:46.8884287Z shl.b32 %r763, %r70, 13; 2026-02-21T08:30:46.8884345Z shl.b32 %r764, %r71, 13; 2026-02-21T08:30:46.8884410Z shl.b32 %r765, %r72, 13; 2026-02-21T08:30:46.8884467Z shl.b32 %r766, %r73, 13; 2026-02-21T08:30:46.8884633Z .loc 1 88 50 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:50 2026-02-21T08:30:46.8884702Z add.s32 %r767, %r763, %r69; 2026-02-21T08:30:46.8884762Z add.s32 %r768, %r764, %r69; 2026-02-21T08:30:46.8884823Z add.s32 %r769, %r765, %r69; 2026-02-21T08:30:46.8884883Z add.s32 %r770, %r766, %r69; 2026-02-21T08:30:46.8885064Z .loc 1 88 22 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:22 2026-02-21T08:30:46.8885139Z mad.wide.s32 %rd104, %r767, 2, %rd49; 2026-02-21T08:30:46.8885265Z mad.wide.s32 %rd105, %r768, 2, %rd49; 2026-02-21T08:30:46.8885341Z mad.wide.s32 %rd106, %r769, 2, %rd49; 2026-02-21T08:30:46.8885406Z mad.wide.s32 %rd107, %r770, 2, %rd49; 2026-02-21T08:30:46.8885463Z $L__tmp47: 2026-02-21T08:30:46.8885697Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8885758Z // begin inline asm 2026-02-21T08:30:46.8886051Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r621, %r622, %r623, %r624, %r625, %r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633, %r634, %r635, %r636}, [%r1887 + 0], 32; 2026-02-21T08:30:46.8886109Z // end inline asm 2026-02-21T08:30:46.8886175Z // begin inline asm 2026-02-21T08:30:46.8886460Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r638, %r639, %r640, %r641, %r642, %r643, %r644, %r645, %r646, %r647, %r648, %r649, %r650, %r651, %r652, %r653}, [%r1887 + 16], 32; 2026-02-21T08:30:46.8886569Z // end inline asm 2026-02-21T08:30:46.8886639Z // begin inline asm 2026-02-21T08:30:46.8886711Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:46.8886768Z // end inline asm 2026-02-21T08:30:46.8886839Z cvt.u64.u32 %rd117, %r621; 2026-02-21T08:30:46.8886900Z cvt.u64.u32 %rd118, %r622; 2026-02-21T08:30:46.8886962Z shl.b64 %rd119, %rd118, 32; 2026-02-21T08:30:46.8887026Z or.b64 %rd120, %rd117, %rd119; 2026-02-21T08:30:46.8887087Z $L__tmp48: 2026-02-21T08:30:46.8887266Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8887331Z mov.b64 {%r771, %r772}, %rd120; 2026-02-21T08:30:46.8887411Z cvt.rn.bf16x2.f32 %r773, %r772, %r771; 2026-02-21T08:30:46.8887465Z $L__tmp49: 2026-02-21T08:30:46.8887692Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8887761Z cvt.u64.u32 %rd121, %r623; 2026-02-21T08:30:46.8887822Z cvt.u64.u32 %rd122, %r624; 2026-02-21T08:30:46.8887886Z shl.b64 %rd123, %rd122, 32; 2026-02-21T08:30:46.8887953Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T08:30:46.8888014Z $L__tmp50: 2026-02-21T08:30:46.8888190Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8888255Z mov.b64 {%r774, %r775}, %rd124; 2026-02-21T08:30:46.8888333Z cvt.rn.bf16x2.f32 %r776, %r775, %r774; 2026-02-21T08:30:46.8888387Z $L__tmp51: 2026-02-21T08:30:46.8888612Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8888680Z cvt.u64.u32 %rd125, %r625; 2026-02-21T08:30:46.8888740Z cvt.u64.u32 %rd126, %r626; 2026-02-21T08:30:46.8888802Z shl.b64 %rd127, %rd126, 32; 2026-02-21T08:30:46.8888863Z or.b64 %rd128, %rd125, %rd127; 2026-02-21T08:30:46.8888925Z $L__tmp52: 2026-02-21T08:30:46.8889107Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8889175Z mov.b64 {%r777, %r778}, %rd128; 2026-02-21T08:30:46.8889254Z cvt.rn.bf16x2.f32 %r779, %r778, %r777; 2026-02-21T08:30:46.8889308Z $L__tmp53: 2026-02-21T08:30:46.8889533Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8889596Z cvt.u64.u32 %rd129, %r627; 2026-02-21T08:30:46.8889664Z cvt.u64.u32 %rd130, %r628; 2026-02-21T08:30:46.8889727Z shl.b64 %rd131, %rd130, 32; 2026-02-21T08:30:46.8889789Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T08:30:46.8889851Z $L__tmp54: 2026-02-21T08:30:46.8890025Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8890088Z mov.b64 {%r780, %r781}, %rd132; 2026-02-21T08:30:46.8890164Z cvt.rn.bf16x2.f32 %r782, %r781, %r780; 2026-02-21T08:30:46.8890219Z $L__tmp55: 2026-02-21T08:30:46.8890450Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8890555Z cvt.u64.u32 %rd133, %r629; 2026-02-21T08:30:46.8890624Z cvt.u64.u32 %rd134, %r630; 2026-02-21T08:30:46.8890685Z shl.b64 %rd135, %rd134, 32; 2026-02-21T08:30:46.8890749Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T08:30:46.8890813Z $L__tmp56: 2026-02-21T08:30:46.8890990Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8891055Z mov.b64 {%r783, %r784}, %rd136; 2026-02-21T08:30:46.8891135Z cvt.rn.bf16x2.f32 %r785, %r784, %r783; 2026-02-21T08:30:46.8891192Z $L__tmp57: 2026-02-21T08:30:46.8891418Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8891480Z cvt.u64.u32 %rd137, %r631; 2026-02-21T08:30:46.8891585Z cvt.u64.u32 %rd138, %r632; 2026-02-21T08:30:46.8891650Z shl.b64 %rd139, %rd138, 32; 2026-02-21T08:30:46.8891714Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T08:30:46.8891840Z $L__tmp58: 2026-02-21T08:30:46.8892021Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8892086Z mov.b64 {%r786, %r787}, %rd140; 2026-02-21T08:30:46.8892156Z cvt.rn.bf16x2.f32 %r788, %r787, %r786; 2026-02-21T08:30:46.8892217Z $L__tmp59: 2026-02-21T08:30:46.8892438Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8892499Z cvt.u64.u32 %rd141, %r633; 2026-02-21T08:30:46.8892567Z cvt.u64.u32 %rd142, %r634; 2026-02-21T08:30:46.8892629Z shl.b64 %rd143, %rd142, 32; 2026-02-21T08:30:46.8892690Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T08:30:46.8892751Z $L__tmp60: 2026-02-21T08:30:46.8892951Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8893024Z mov.b64 {%r789, %r790}, %rd144; 2026-02-21T08:30:46.8893092Z cvt.rn.bf16x2.f32 %r791, %r790, %r789; 2026-02-21T08:30:46.8893154Z $L__tmp61: 2026-02-21T08:30:46.8893363Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8893422Z cvt.u64.u32 %rd145, %r635; 2026-02-21T08:30:46.8893489Z cvt.u64.u32 %rd146, %r636; 2026-02-21T08:30:46.8893548Z shl.b64 %rd147, %rd146, 32; 2026-02-21T08:30:46.8893606Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T08:30:46.8893665Z $L__tmp62: 2026-02-21T08:30:46.8893834Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8893894Z mov.b64 {%r792, %r793}, %rd148; 2026-02-21T08:30:46.8893960Z cvt.rn.bf16x2.f32 %r794, %r793, %r792; 2026-02-21T08:30:46.8894021Z $L__tmp63: 2026-02-21T08:30:46.8894231Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8894291Z cvt.u64.u32 %rd149, %r638; 2026-02-21T08:30:46.8894358Z cvt.u64.u32 %rd150, %r639; 2026-02-21T08:30:46.8894419Z shl.b64 %rd151, %rd150, 32; 2026-02-21T08:30:46.8894478Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T08:30:46.8894530Z $L__tmp64: 2026-02-21T08:30:46.8894704Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8894763Z mov.b64 {%r795, %r796}, %rd152; 2026-02-21T08:30:46.8894828Z cvt.rn.bf16x2.f32 %r797, %r796, %r795; 2026-02-21T08:30:46.8894887Z $L__tmp65: 2026-02-21T08:30:46.8895093Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8895151Z cvt.u64.u32 %rd153, %r640; 2026-02-21T08:30:46.8895214Z cvt.u64.u32 %rd154, %r641; 2026-02-21T08:30:46.8895272Z shl.b64 %rd155, %rd154, 32; 2026-02-21T08:30:46.8895331Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T08:30:46.8895382Z $L__tmp66: 2026-02-21T08:30:46.8895560Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8895674Z mov.b64 {%r798, %r799}, %rd156; 2026-02-21T08:30:46.8895740Z cvt.rn.bf16x2.f32 %r800, %r799, %r798; 2026-02-21T08:30:46.8895798Z $L__tmp67: 2026-02-21T08:30:46.8896014Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8896072Z cvt.u64.u32 %rd157, %r642; 2026-02-21T08:30:46.8896135Z cvt.u64.u32 %rd158, %r643; 2026-02-21T08:30:46.8896193Z shl.b64 %rd159, %rd158, 32; 2026-02-21T08:30:46.8896251Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T08:30:46.8896302Z $L__tmp68: 2026-02-21T08:30:46.8896475Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8896534Z mov.b64 {%r801, %r802}, %rd160; 2026-02-21T08:30:46.8896599Z cvt.rn.bf16x2.f32 %r803, %r802, %r801; 2026-02-21T08:30:46.8896657Z $L__tmp69: 2026-02-21T08:30:46.8896907Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8896969Z cvt.u64.u32 %rd161, %r644; 2026-02-21T08:30:46.8897035Z cvt.u64.u32 %rd162, %r645; 2026-02-21T08:30:46.8897093Z shl.b64 %rd163, %rd162, 32; 2026-02-21T08:30:46.8897152Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T08:30:46.8897202Z $L__tmp70: 2026-02-21T08:30:46.8897378Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8897437Z mov.b64 {%r804, %r805}, %rd164; 2026-02-21T08:30:46.8897503Z cvt.rn.bf16x2.f32 %r806, %r805, %r804; 2026-02-21T08:30:46.8897562Z $L__tmp71: 2026-02-21T08:30:46.8897769Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8897827Z cvt.u64.u32 %rd165, %r646; 2026-02-21T08:30:46.8897892Z cvt.u64.u32 %rd166, %r647; 2026-02-21T08:30:46.8897950Z shl.b64 %rd167, %rd166, 32; 2026-02-21T08:30:46.8898011Z or.b64 %rd168, %rd165, %rd167; 2026-02-21T08:30:46.8898063Z $L__tmp72: 2026-02-21T08:30:46.8898240Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8898301Z mov.b64 {%r807, %r808}, %rd168; 2026-02-21T08:30:46.8898367Z cvt.rn.bf16x2.f32 %r809, %r808, %r807; 2026-02-21T08:30:46.8898425Z $L__tmp73: 2026-02-21T08:30:46.8898638Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8898696Z cvt.u64.u32 %rd169, %r648; 2026-02-21T08:30:46.8898756Z cvt.u64.u32 %rd170, %r649; 2026-02-21T08:30:46.8898823Z shl.b64 %rd171, %rd170, 32; 2026-02-21T08:30:46.8898881Z or.b64 %rd172, %rd169, %rd171; 2026-02-21T08:30:46.8898935Z $L__tmp74: 2026-02-21T08:30:46.8899115Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8899174Z mov.b64 {%r810, %r811}, %rd172; 2026-02-21T08:30:46.8899242Z cvt.rn.bf16x2.f32 %r812, %r811, %r810; 2026-02-21T08:30:46.8899305Z $L__tmp75: 2026-02-21T08:30:46.8899519Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8899579Z cvt.u64.u32 %rd173, %r650; 2026-02-21T08:30:46.8899637Z cvt.u64.u32 %rd174, %r651; 2026-02-21T08:30:46.8899706Z shl.b64 %rd175, %rd174, 32; 2026-02-21T08:30:46.8899767Z or.b64 %rd176, %rd173, %rd175; 2026-02-21T08:30:46.8899821Z $L__tmp76: 2026-02-21T08:30:46.8900000Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8900067Z mov.b64 {%r813, %r814}, %rd176; 2026-02-21T08:30:46.8900134Z cvt.rn.bf16x2.f32 %r815, %r814, %r813; 2026-02-21T08:30:46.8900193Z $L__tmp77: 2026-02-21T08:30:46.8900406Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8900467Z cvt.u64.u32 %rd177, %r652; 2026-02-21T08:30:46.8900573Z cvt.u64.u32 %rd178, %r653; 2026-02-21T08:30:46.8900641Z shl.b64 %rd179, %rd178, 32; 2026-02-21T08:30:46.8900699Z or.b64 %rd180, %rd177, %rd179; 2026-02-21T08:30:46.8900753Z $L__tmp78: 2026-02-21T08:30:46.8900930Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8900990Z mov.b64 {%r816, %r817}, %rd180; 2026-02-21T08:30:46.8901055Z cvt.rn.bf16x2.f32 %r818, %r817, %r816; 2026-02-21T08:30:46.8901227Z .loc 1 88 81 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:81 2026-02-21T08:30:46.8901322Z st.shared.v4.b32 [%r60], {%r773, %r785, %r797, %r809}; 2026-02-21T08:30:46.8901379Z bar.sync 0; 2026-02-21T08:30:46.8901438Z // begin inline asm 2026-02-21T08:30:46.8901626Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r655, %r656, %r657, %r658}, [%r659]; 2026-02-21T08:30:46.8901685Z // end inline asm 2026-02-21T08:30:46.8901741Z bar.sync 0; 2026-02-21T08:30:46.8901904Z st.shared.v4.b32 [%r60], {%r776, %r788, %r800, %r812}; 2026-02-21T08:30:46.8901960Z bar.sync 0; 2026-02-21T08:30:46.8902019Z // begin inline asm 2026-02-21T08:30:46.8902163Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r660, %r661, %r662, %r663}, [%r659]; 2026-02-21T08:30:46.8902227Z // end inline asm 2026-02-21T08:30:46.8902281Z bar.sync 0; 2026-02-21T08:30:46.8902367Z st.shared.v4.b32 [%r60], {%r779, %r791, %r803, %r815}; 2026-02-21T08:30:46.8902430Z bar.sync 0; 2026-02-21T08:30:46.8902488Z // begin inline asm 2026-02-21T08:30:46.8902628Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r665, %r666, %r667, %r668}, [%r659]; 2026-02-21T08:30:46.8902690Z // end inline asm 2026-02-21T08:30:46.8902742Z bar.sync 0; 2026-02-21T08:30:46.8902828Z st.shared.v4.b32 [%r60], {%r782, %r794, %r806, %r818}; 2026-02-21T08:30:46.8902882Z bar.sync 0; 2026-02-21T08:30:46.8902953Z // begin inline asm 2026-02-21T08:30:46.8903093Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r670, %r671, %r672, %r673}, [%r659]; 2026-02-21T08:30:46.8903152Z // end inline asm 2026-02-21T08:30:46.8903216Z // begin inline asm 2026-02-21T08:30:46.8903321Z st.global.v4.b32 [ %rd104 + 0 ], { %r655, %r660, %r665, %r670 }; 2026-02-21T08:30:46.8903377Z // end inline asm 2026-02-21T08:30:46.8903433Z // begin inline asm 2026-02-21T08:30:46.8903541Z st.global.v4.b32 [ %rd105 + 0 ], { %r656, %r661, %r666, %r671 }; 2026-02-21T08:30:46.8903597Z // end inline asm 2026-02-21T08:30:46.8903652Z // begin inline asm 2026-02-21T08:30:46.8903754Z st.global.v4.b32 [ %rd106 + 0 ], { %r657, %r662, %r667, %r672 }; 2026-02-21T08:30:46.8903809Z // end inline asm 2026-02-21T08:30:46.8903866Z // begin inline asm 2026-02-21T08:30:46.8903961Z st.global.v4.b32 [ %rd107 + 0 ], { %r658, %r663, %r668, %r673 }; 2026-02-21T08:30:46.8904023Z // end inline asm 2026-02-21T08:30:46.8904201Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.8904265Z add.s32 %r819, %r2523, 2368; 2026-02-21T08:30:46.8904441Z .loc 1 25 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:25:35 2026-02-21T08:30:46.8904501Z shr.s32 %r820, %r819, 31; 2026-02-21T08:30:46.8904560Z shr.u32 %r821, %r820, 23; 2026-02-21T08:30:46.8904628Z add.s32 %r822, %r819, %r821; 2026-02-21T08:30:46.8904688Z shr.s32 %r823, %r822, 9; 2026-02-21T08:30:46.8904852Z .loc 1 26 33 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:26:33 2026-02-21T08:30:46.8904910Z shl.b32 %r824, %r823, 3; 2026-02-21T08:30:46.8905083Z .loc 1 27 39 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:39 2026-02-21T08:30:46.8905141Z sub.s32 %r825, 64, %r824; 2026-02-21T08:30:46.8905304Z .loc 1 27 52 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:52 2026-02-21T08:30:46.8905372Z min.s32 %r826, %r825, 8; 2026-02-21T08:30:46.8905538Z .loc 1 28 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:45 2026-02-21T08:30:46.8905656Z and.b32 %r827, %r822, -512; 2026-02-21T08:30:46.8905723Z sub.s32 %r828, %r819, %r827; 2026-02-21T08:30:46.8905892Z .loc 1 29 51 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:29:51 2026-02-21T08:30:46.8905951Z div.s32 %r94, %r828, %r826; 2026-02-21T08:30:46.8906119Z .loc 1 28 64 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:64 2026-02-21T08:30:46.8906181Z mul.lo.s32 %r829, %r94, %r826; 2026-02-21T08:30:46.8906239Z sub.s32 %r830, %r828, %r829; 2026-02-21T08:30:46.8906406Z .loc 1 28 30 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:30 2026-02-21T08:30:46.8906471Z add.s32 %r831, %r830, %r824; 2026-02-21T08:30:46.8906638Z .loc 1 30 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:30:27 2026-02-21T08:30:46.8906697Z shl.b32 %r95, %r831, 7; 2026-02-21T08:30:46.8906910Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.8906974Z or.b32 %r96, %r95, %r6; 2026-02-21T08:30:46.8907142Z .loc 1 32 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:32:27 2026-02-21T08:30:46.8907208Z shl.b32 %r832, %r94, 6; 2026-02-21T08:30:46.8907377Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.8907436Z or.b32 %r833, %r832, %r10; 2026-02-21T08:30:46.8907501Z or.b32 %r834, %r832, %r11; 2026-02-21T08:30:46.8907670Z .loc 1 48 53 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:53 2026-02-21T08:30:46.8907728Z shl.b32 %r835, %r833, 10; 2026-02-21T08:30:46.8907787Z shl.b32 %r836, %r834, 10; 2026-02-21T08:30:46.8907856Z mov.pred %p65, -1; 2026-02-21T08:30:46.8907912Z mov.b32 %r2533, 0; 2026-02-21T08:30:46.8907966Z $L__tmp79: 2026-02-21T08:30:46.8908197Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8908257Z // begin inline asm 2026-02-21T08:30:46.8908580Z @%p65 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 0], 32, {%r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533}; 2026-02-21T08:30:46.8908644Z // end inline asm 2026-02-21T08:30:46.8908702Z // begin inline asm 2026-02-21T08:30:46.8909017Z @%p65 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 16], 32, {%r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533, %r2533}; 2026-02-21T08:30:46.8909073Z // end inline asm 2026-02-21T08:30:46.8909138Z // begin inline asm 2026-02-21T08:30:46.8909213Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.8909268Z // end inline asm 2026-02-21T08:30:46.8909332Z bar.sync 0; 2026-02-21T08:30:46.8909387Z $L__tmp80: 2026-02-21T08:30:46.8909564Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8909631Z // begin inline asm 2026-02-21T08:30:46.8909723Z @%p204 mbarrier.init.shared::cta.b64 [%r2534], 1; 2026-02-21T08:30:46.8909778Z // end inline asm 2026-02-21T08:30:46.8909833Z bar.sync 0; 2026-02-21T08:30:46.8909898Z // begin inline asm 2026-02-21T08:30:46.8909988Z @%p204 mbarrier.init.shared::cta.b64 [%r1442], 1; 2026-02-21T08:30:46.8910043Z // end inline asm 2026-02-21T08:30:46.8910221Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.8910283Z or.b32 %r837, %r835, %r16; 2026-02-21T08:30:46.8910343Z or.b32 %r838, %r836, %r16; 2026-02-21T08:30:46.8910520Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8910588Z mad.wide.s32 %rd108, %r837, 2, %rd47; 2026-02-21T08:30:46.8910655Z mad.wide.s32 %rd109, %r838, 2, %rd47; 2026-02-21T08:30:46.8910755Z mov.b32 %r910, 16; 2026-02-21T08:30:46.8910929Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8910986Z // begin inline asm 2026-02-21T08:30:46.8911108Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd108 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8911172Z // end inline asm 2026-02-21T08:30:46.8911227Z // begin inline asm 2026-02-21T08:30:46.8911344Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd109 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8911399Z // end inline asm 2026-02-21T08:30:46.8911472Z cp.async.commit_group; 2026-02-21T08:30:46.8911675Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8911737Z add.s32 %r839, %r96, %r27; 2026-02-21T08:30:46.8911915Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8911978Z cvt.s64.s32 %rd181, %r839; 2026-02-21T08:30:46.8912108Z add.s64 %rd110, %rd48, %rd181; 2026-02-21T08:30:46.8912285Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8912342Z // begin inline asm 2026-02-21T08:30:46.8912459Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd110 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8912523Z // end inline asm 2026-02-21T08:30:46.8912586Z cp.async.commit_group; 2026-02-21T08:30:46.8912754Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8912817Z cvt.s64.s32 %rd182, %r835; 2026-02-21T08:30:46.8912889Z or.b64 %rd183, %rd182, %rd10; 2026-02-21T08:30:46.8912950Z shl.b64 %rd184, %rd183, 1; 2026-02-21T08:30:46.8913013Z add.s64 %rd18, %rd47, %rd184; 2026-02-21T08:30:46.8913083Z add.s64 %rd111, %rd18, 128; 2026-02-21T08:30:46.8913143Z cvt.s64.s32 %rd185, %r836; 2026-02-21T08:30:46.8913201Z or.b64 %rd186, %rd185, %rd10; 2026-02-21T08:30:46.8913260Z shl.b64 %rd187, %rd186, 1; 2026-02-21T08:30:46.8913330Z add.s64 %rd19, %rd47, %rd187; 2026-02-21T08:30:46.8913391Z add.s64 %rd112, %rd19, 128; 2026-02-21T08:30:46.8913557Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8913620Z bar.sync 0; 2026-02-21T08:30:46.8913678Z // begin inline asm 2026-02-21T08:30:46.8913793Z cp.async.cg.shared.global [ %r1555 + 0 ], [ %rd111 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8913856Z // end inline asm 2026-02-21T08:30:46.8913912Z // begin inline asm 2026-02-21T08:30:46.8914024Z cp.async.cg.shared.global [ %r1557 + 0 ], [ %rd112 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8914080Z // end inline asm 2026-02-21T08:30:46.8914150Z cp.async.commit_group; 2026-02-21T08:30:46.8914316Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8914376Z add.s32 %r840, %r96, %r32; 2026-02-21T08:30:46.8914556Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8914618Z cvt.s64.s32 %rd188, %r840; 2026-02-21T08:30:46.8914679Z add.s64 %rd113, %rd48, %rd188; 2026-02-21T08:30:46.8914846Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8914910Z // begin inline asm 2026-02-21T08:30:46.8915021Z cp.async.cg.shared.global [ %r1559 + 0 ], [ %rd113 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8915076Z // end inline asm 2026-02-21T08:30:46.8915146Z cp.async.commit_group; 2026-02-21T08:30:46.8915312Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8915372Z add.s64 %rd114, %rd18, 256; 2026-02-21T08:30:46.8915438Z add.s64 %rd115, %rd19, 256; 2026-02-21T08:30:46.8915605Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8915659Z bar.sync 0; 2026-02-21T08:30:46.8915718Z // begin inline asm 2026-02-21T08:30:46.8915892Z cp.async.cg.shared.global [ %r1561 + 0 ], [ %rd114 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8915947Z // end inline asm 2026-02-21T08:30:46.8916003Z // begin inline asm 2026-02-21T08:30:46.8916124Z cp.async.cg.shared.global [ %r1563 + 0 ], [ %rd115 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8916178Z // end inline asm 2026-02-21T08:30:46.8916241Z cp.async.commit_group; 2026-02-21T08:30:46.8916415Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8916474Z add.s32 %r841, %r96, %r37; 2026-02-21T08:30:46.8916642Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8916702Z cvt.s64.s32 %rd189, %r841; 2026-02-21T08:30:46.8916771Z add.s64 %rd116, %rd48, %rd189; 2026-02-21T08:30:46.8916935Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8917037Z // begin inline asm 2026-02-21T08:30:46.8917161Z cp.async.cg.shared.global [ %r1565 + 0 ], [ %rd116 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8917217Z // end inline asm 2026-02-21T08:30:46.8917280Z cp.async.commit_group; 2026-02-21T08:30:46.8917457Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8917522Z cp.async.wait_group 4; 2026-02-21T08:30:46.8917577Z bar.sync 0; 2026-02-21T08:30:46.8917668Z ld.shared.v4.b32 {%r842, %r843, %r844, %r845}, [%r40]; 2026-02-21T08:30:46.8917744Z mov.b32 {%rs353, %rs354}, %r845; 2026-02-21T08:30:46.8917807Z mov.b32 {%rs355, %rs356}, %r844; 2026-02-21T08:30:46.8917868Z mov.b32 {%rs357, %rs358}, %r843; 2026-02-21T08:30:46.8917935Z mov.b32 {%rs359, %rs360}, %r842; 2026-02-21T08:30:46.8918033Z ld.shared.v4.b32 {%r846, %r847, %r848, %r849}, [%r40+16]; 2026-02-21T08:30:46.8918094Z mov.b32 {%rs361, %rs362}, %r849; 2026-02-21T08:30:46.8918168Z mov.b32 {%rs363, %rs364}, %r848; 2026-02-21T08:30:46.8918230Z mov.b32 {%rs365, %rs366}, %r847; 2026-02-21T08:30:46.8918292Z mov.b32 {%rs367, %rs368}, %r846; 2026-02-21T08:30:46.8918466Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.8918541Z cvt.f32.bf16 %r746, %rs359; 2026-02-21T08:30:46.8918602Z cvt.f32.bf16 %r747, %rs360; 2026-02-21T08:30:46.8918662Z cvt.f32.bf16 %r748, %rs357; 2026-02-21T08:30:46.8918728Z cvt.f32.bf16 %r749, %rs358; 2026-02-21T08:30:46.8918787Z cvt.f32.bf16 %r750, %rs355; 2026-02-21T08:30:46.8918845Z cvt.f32.bf16 %r751, %rs356; 2026-02-21T08:30:46.8918903Z cvt.f32.bf16 %r752, %rs353; 2026-02-21T08:30:46.8918968Z cvt.f32.bf16 %r753, %rs354; 2026-02-21T08:30:46.8919026Z cvt.f32.bf16 %r754, %rs367; 2026-02-21T08:30:46.8919083Z cvt.f32.bf16 %r755, %rs368; 2026-02-21T08:30:46.8919149Z cvt.f32.bf16 %r756, %rs365; 2026-02-21T08:30:46.8919206Z cvt.f32.bf16 %r757, %rs366; 2026-02-21T08:30:46.8919263Z cvt.f32.bf16 %r758, %rs363; 2026-02-21T08:30:46.8919322Z cvt.f32.bf16 %r759, %rs364; 2026-02-21T08:30:46.8919388Z cvt.f32.bf16 %r760, %rs361; 2026-02-21T08:30:46.8919445Z cvt.f32.bf16 %r761, %rs362; 2026-02-21T08:30:46.8919497Z $L__tmp81: 2026-02-21T08:30:46.8919724Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8919781Z // begin inline asm 2026-02-21T08:30:46.8920073Z @%p65 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r746, %r747, %r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755, %r756, %r757, %r758, %r759, %r760, %r761}; 2026-02-21T08:30:46.8920137Z // end inline asm 2026-02-21T08:30:46.8920193Z // begin inline asm 2026-02-21T08:30:46.8920264Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.8920327Z // end inline asm 2026-02-21T08:30:46.8920382Z bar.sync 0; 2026-02-21T08:30:46.8920435Z $L__tmp82: 2026-02-21T08:30:46.8920607Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8920729Z ld.shared.b8 %rs369, [%r42]; 2026-02-21T08:30:46.8920794Z ld.shared.b8 %rs370, [%r42+128]; 2026-02-21T08:30:46.8920856Z ld.shared.b8 %rs371, [%r42+256]; 2026-02-21T08:30:46.8920924Z ld.shared.b8 %rs372, [%r42+384]; 2026-02-21T08:30:46.8920985Z ld.shared.b8 %rs373, [%r42+512]; 2026-02-21T08:30:46.8921045Z ld.shared.b8 %rs374, [%r42+640]; 2026-02-21T08:30:46.8921104Z ld.shared.b8 %rs375, [%r42+768]; 2026-02-21T08:30:46.8921175Z ld.shared.b8 %rs376, [%r44]; 2026-02-21T08:30:46.8921239Z ld.shared.b8 %rs377, [%r42+1024]; 2026-02-21T08:30:46.8921304Z ld.shared.b8 %rs378, [%r42+1152]; 2026-02-21T08:30:46.8921374Z ld.shared.b8 %rs379, [%r42+1280]; 2026-02-21T08:30:46.8921434Z ld.shared.b8 %rs380, [%r42+1408]; 2026-02-21T08:30:46.8921496Z ld.shared.b8 %rs381, [%r42+1536]; 2026-02-21T08:30:46.8921585Z ld.shared.b8 %rs382, [%r42+1664]; 2026-02-21T08:30:46.8921655Z ld.shared.b8 %rs383, [%r42+1792]; 2026-02-21T08:30:46.8921763Z ld.shared.b8 %rs384, [%r46]; 2026-02-21T08:30:46.8921829Z ld.shared.b8 %rs385, [%r42+2048]; 2026-02-21T08:30:46.8921897Z ld.shared.b8 %rs386, [%r42+2176]; 2026-02-21T08:30:46.8921955Z ld.shared.b8 %rs387, [%r42+2304]; 2026-02-21T08:30:46.8922015Z ld.shared.b8 %rs388, [%r42+2432]; 2026-02-21T08:30:46.8922082Z ld.shared.b8 %rs389, [%r42+2560]; 2026-02-21T08:30:46.8922141Z ld.shared.b8 %rs390, [%r42+2688]; 2026-02-21T08:30:46.8922200Z ld.shared.b8 %rs391, [%r42+2816]; 2026-02-21T08:30:46.8922261Z ld.shared.b8 %rs392, [%r48]; 2026-02-21T08:30:46.8922328Z ld.shared.b8 %rs393, [%r42+3072]; 2026-02-21T08:30:46.8922387Z ld.shared.b8 %rs394, [%r42+3200]; 2026-02-21T08:30:46.8922447Z ld.shared.b8 %rs395, [%r42+3328]; 2026-02-21T08:30:46.8922513Z ld.shared.b8 %rs396, [%r42+3456]; 2026-02-21T08:30:46.8922572Z ld.shared.b8 %rs397, [%r42+3584]; 2026-02-21T08:30:46.8922631Z ld.shared.b8 %rs398, [%r42+3712]; 2026-02-21T08:30:46.8922690Z ld.shared.b8 %rs399, [%r42+3840]; 2026-02-21T08:30:46.8922759Z ld.shared.b8 %rs400, [%r50]; 2026-02-21T08:30:46.8922933Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.8922993Z shl.b16 %rs401, %rs369, 4; 2026-02-21T08:30:46.8923060Z shl.b16 %rs402, %rs370, 4; 2026-02-21T08:30:46.8923119Z shl.b16 %rs403, %rs371, 4; 2026-02-21T08:30:46.8923176Z shl.b16 %rs404, %rs372, 4; 2026-02-21T08:30:46.8923241Z shl.b16 %rs405, %rs373, 4; 2026-02-21T08:30:46.8923296Z shl.b16 %rs406, %rs374, 4; 2026-02-21T08:30:46.8923352Z shl.b16 %rs407, %rs375, 4; 2026-02-21T08:30:46.8923409Z shl.b16 %rs408, %rs376, 4; 2026-02-21T08:30:46.8923472Z shl.b16 %rs409, %rs377, 4; 2026-02-21T08:30:46.8923529Z shl.b16 %rs410, %rs378, 4; 2026-02-21T08:30:46.8923586Z shl.b16 %rs411, %rs379, 4; 2026-02-21T08:30:46.8923649Z shl.b16 %rs412, %rs380, 4; 2026-02-21T08:30:46.8923705Z shl.b16 %rs413, %rs381, 4; 2026-02-21T08:30:46.8923763Z shl.b16 %rs414, %rs382, 4; 2026-02-21T08:30:46.8923819Z shl.b16 %rs415, %rs383, 4; 2026-02-21T08:30:46.8923885Z shl.b16 %rs416, %rs384, 4; 2026-02-21T08:30:46.8923945Z shl.b16 %rs417, %rs385, 4; 2026-02-21T08:30:46.8924002Z shl.b16 %rs418, %rs386, 4; 2026-02-21T08:30:46.8924066Z shl.b16 %rs419, %rs387, 4; 2026-02-21T08:30:46.8924123Z shl.b16 %rs420, %rs388, 4; 2026-02-21T08:30:46.8924179Z shl.b16 %rs421, %rs389, 4; 2026-02-21T08:30:46.8924236Z shl.b16 %rs422, %rs390, 4; 2026-02-21T08:30:46.8924301Z shl.b16 %rs423, %rs391, 4; 2026-02-21T08:30:46.8924359Z shl.b16 %rs424, %rs392, 4; 2026-02-21T08:30:46.8924415Z shl.b16 %rs425, %rs393, 4; 2026-02-21T08:30:46.8924480Z shl.b16 %rs426, %rs394, 4; 2026-02-21T08:30:46.8924536Z shl.b16 %rs427, %rs395, 4; 2026-02-21T08:30:46.8924594Z shl.b16 %rs428, %rs396, 4; 2026-02-21T08:30:46.8924650Z shl.b16 %rs429, %rs397, 4; 2026-02-21T08:30:46.8924715Z shl.b16 %rs430, %rs398, 4; 2026-02-21T08:30:46.8924772Z shl.b16 %rs431, %rs399, 4; 2026-02-21T08:30:46.8924829Z shl.b16 %rs432, %rs400, 4; 2026-02-21T08:30:46.8925011Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.8925132Z selp.b16 %rs433, %rs401, %rs369, %p246; 2026-02-21T08:30:46.8925192Z cvt.s16.s8 %rs434, %rs433; 2026-02-21T08:30:46.8925258Z shr.s16 %rs435, %rs434, 4; 2026-02-21T08:30:46.8925328Z selp.b16 %rs436, %rs402, %rs370, %p246; 2026-02-21T08:30:46.8925388Z cvt.s16.s8 %rs437, %rs436; 2026-02-21T08:30:46.8925447Z shr.s16 %rs438, %rs437, 4; 2026-02-21T08:30:46.8925525Z selp.b16 %rs439, %rs403, %rs371, %p246; 2026-02-21T08:30:46.8925584Z cvt.s16.s8 %rs440, %rs439; 2026-02-21T08:30:46.8925642Z shr.s16 %rs441, %rs440, 4; 2026-02-21T08:30:46.8925714Z selp.b16 %rs442, %rs404, %rs372, %p246; 2026-02-21T08:30:46.8925772Z cvt.s16.s8 %rs443, %rs442; 2026-02-21T08:30:46.8925829Z shr.s16 %rs444, %rs443, 4; 2026-02-21T08:30:46.8925893Z selp.b16 %rs445, %rs405, %rs373, %p246; 2026-02-21T08:30:46.8925958Z cvt.s16.s8 %rs446, %rs445; 2026-02-21T08:30:46.8926015Z shr.s16 %rs447, %rs446, 4; 2026-02-21T08:30:46.8926123Z selp.b16 %rs448, %rs406, %rs374, %p246; 2026-02-21T08:30:46.8926190Z cvt.s16.s8 %rs449, %rs448; 2026-02-21T08:30:46.8926249Z shr.s16 %rs450, %rs449, 4; 2026-02-21T08:30:46.8926315Z selp.b16 %rs451, %rs407, %rs375, %p246; 2026-02-21T08:30:46.8926373Z cvt.s16.s8 %rs452, %rs451; 2026-02-21T08:30:46.8926438Z shr.s16 %rs453, %rs452, 4; 2026-02-21T08:30:46.8926502Z selp.b16 %rs454, %rs408, %rs376, %p246; 2026-02-21T08:30:46.8926560Z cvt.s16.s8 %rs455, %rs454; 2026-02-21T08:30:46.8926637Z shr.s16 %rs456, %rs455, 4; 2026-02-21T08:30:46.8926706Z selp.b16 %rs457, %rs409, %rs377, %p246; 2026-02-21T08:30:46.8926767Z cvt.s16.s8 %rs458, %rs457; 2026-02-21T08:30:46.8926835Z shr.s16 %rs459, %rs458, 4; 2026-02-21T08:30:46.8926903Z selp.b16 %rs460, %rs410, %rs378, %p246; 2026-02-21T08:30:46.8926965Z cvt.s16.s8 %rs461, %rs460; 2026-02-21T08:30:46.8927026Z shr.s16 %rs462, %rs461, 4; 2026-02-21T08:30:46.8927102Z selp.b16 %rs463, %rs411, %rs379, %p246; 2026-02-21T08:30:46.8927165Z cvt.s16.s8 %rs464, %rs463; 2026-02-21T08:30:46.8927227Z shr.s16 %rs465, %rs464, 4; 2026-02-21T08:30:46.8927302Z selp.b16 %rs466, %rs412, %rs380, %p246; 2026-02-21T08:30:46.8927363Z cvt.s16.s8 %rs467, %rs466; 2026-02-21T08:30:46.8927422Z shr.s16 %rs468, %rs467, 4; 2026-02-21T08:30:46.8927490Z selp.b16 %rs469, %rs413, %rs381, %p246; 2026-02-21T08:30:46.8927558Z cvt.s16.s8 %rs470, %rs469; 2026-02-21T08:30:46.8927617Z shr.s16 %rs471, %rs470, 4; 2026-02-21T08:30:46.8927685Z selp.b16 %rs472, %rs414, %rs382, %p246; 2026-02-21T08:30:46.8927751Z cvt.s16.s8 %rs473, %rs472; 2026-02-21T08:30:46.8927811Z shr.s16 %rs474, %rs473, 4; 2026-02-21T08:30:46.8927878Z selp.b16 %rs475, %rs415, %rs383, %p246; 2026-02-21T08:30:46.8927939Z cvt.s16.s8 %rs476, %rs475; 2026-02-21T08:30:46.8928005Z shr.s16 %rs477, %rs476, 4; 2026-02-21T08:30:46.8928074Z selp.b16 %rs478, %rs416, %rs384, %p246; 2026-02-21T08:30:46.8928133Z cvt.s16.s8 %rs479, %rs478; 2026-02-21T08:30:46.8928203Z shr.s16 %rs480, %rs479, 4; 2026-02-21T08:30:46.8928273Z selp.b16 %rs481, %rs417, %rs385, %p246; 2026-02-21T08:30:46.8928333Z cvt.s16.s8 %rs482, %rs481; 2026-02-21T08:30:46.8928392Z shr.s16 %rs483, %rs482, 4; 2026-02-21T08:30:46.8928467Z selp.b16 %rs484, %rs418, %rs386, %p246; 2026-02-21T08:30:46.8928527Z cvt.s16.s8 %rs485, %rs484; 2026-02-21T08:30:46.8928586Z shr.s16 %rs486, %rs485, 4; 2026-02-21T08:30:46.8928662Z selp.b16 %rs487, %rs419, %rs387, %p246; 2026-02-21T08:30:46.8928722Z cvt.s16.s8 %rs488, %rs487; 2026-02-21T08:30:46.8928782Z shr.s16 %rs489, %rs488, 4; 2026-02-21T08:30:46.8928856Z selp.b16 %rs490, %rs420, %rs388, %p246; 2026-02-21T08:30:46.8928916Z cvt.s16.s8 %rs491, %rs490; 2026-02-21T08:30:46.8928975Z shr.s16 %rs492, %rs491, 4; 2026-02-21T08:30:46.8929041Z selp.b16 %rs493, %rs421, %rs389, %p246; 2026-02-21T08:30:46.8929109Z cvt.s16.s8 %rs494, %rs493; 2026-02-21T08:30:46.8929168Z shr.s16 %rs495, %rs494, 4; 2026-02-21T08:30:46.8929235Z selp.b16 %rs496, %rs422, %rs390, %p246; 2026-02-21T08:30:46.8929305Z cvt.s16.s8 %rs497, %rs496; 2026-02-21T08:30:46.8929435Z shr.s16 %rs498, %rs497, 4; 2026-02-21T08:30:46.8929503Z selp.b16 %rs499, %rs423, %rs391, %p246; 2026-02-21T08:30:46.8929563Z cvt.s16.s8 %rs500, %rs499; 2026-02-21T08:30:46.8929631Z shr.s16 %rs501, %rs500, 4; 2026-02-21T08:30:46.8929697Z selp.b16 %rs502, %rs424, %rs392, %p246; 2026-02-21T08:30:46.8929759Z cvt.s16.s8 %rs503, %rs502; 2026-02-21T08:30:46.8929827Z shr.s16 %rs504, %rs503, 4; 2026-02-21T08:30:46.8929895Z selp.b16 %rs505, %rs425, %rs393, %p246; 2026-02-21T08:30:46.8929956Z cvt.s16.s8 %rs506, %rs505; 2026-02-21T08:30:46.8930015Z shr.s16 %rs507, %rs506, 4; 2026-02-21T08:30:46.8930090Z selp.b16 %rs508, %rs426, %rs394, %p246; 2026-02-21T08:30:46.8930151Z cvt.s16.s8 %rs509, %rs508; 2026-02-21T08:30:46.8930211Z shr.s16 %rs510, %rs509, 4; 2026-02-21T08:30:46.8930285Z selp.b16 %rs511, %rs427, %rs395, %p246; 2026-02-21T08:30:46.8930345Z cvt.s16.s8 %rs512, %rs511; 2026-02-21T08:30:46.8930445Z shr.s16 %rs513, %rs512, 4; 2026-02-21T08:30:46.8930526Z selp.b16 %rs514, %rs428, %rs396, %p246; 2026-02-21T08:30:46.8930587Z cvt.s16.s8 %rs515, %rs514; 2026-02-21T08:30:46.8930648Z shr.s16 %rs516, %rs515, 4; 2026-02-21T08:30:46.8930717Z selp.b16 %rs517, %rs429, %rs397, %p246; 2026-02-21T08:30:46.8930785Z cvt.s16.s8 %rs518, %rs517; 2026-02-21T08:30:46.8930846Z shr.s16 %rs519, %rs518, 4; 2026-02-21T08:30:46.8930913Z selp.b16 %rs520, %rs430, %rs398, %p246; 2026-02-21T08:30:46.8930981Z cvt.s16.s8 %rs521, %rs520; 2026-02-21T08:30:46.8931042Z shr.s16 %rs522, %rs521, 4; 2026-02-21T08:30:46.8931110Z selp.b16 %rs523, %rs431, %rs399, %p246; 2026-02-21T08:30:46.8931169Z cvt.s16.s8 %rs524, %rs523; 2026-02-21T08:30:46.8931238Z shr.s16 %rs525, %rs524, 4; 2026-02-21T08:30:46.8931306Z selp.b16 %rs526, %rs432, %rs400, %p246; 2026-02-21T08:30:46.8931366Z cvt.s16.s8 %rs527, %rs526; 2026-02-21T08:30:46.8931435Z shr.s16 %rs528, %rs527, 4; 2026-02-21T08:30:46.8931654Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.8931725Z cvt.rn.f32.s16 %r850, %rs435; 2026-02-21T08:30:46.8931797Z cvt.rn.f32.s16 %r851, %rs438; 2026-02-21T08:30:46.8931861Z cvt.rn.f32.s16 %r852, %rs441; 2026-02-21T08:30:46.8931925Z cvt.rn.f32.s16 %r853, %rs444; 2026-02-21T08:30:46.8931986Z cvt.rn.f32.s16 %r854, %rs447; 2026-02-21T08:30:46.8932058Z cvt.rn.f32.s16 %r855, %rs450; 2026-02-21T08:30:46.8932122Z cvt.rn.f32.s16 %r856, %rs453; 2026-02-21T08:30:46.8932189Z cvt.rn.f32.s16 %r857, %rs456; 2026-02-21T08:30:46.8932257Z cvt.rn.f32.s16 %r858, %rs459; 2026-02-21T08:30:46.8932319Z cvt.rn.f32.s16 %r859, %rs462; 2026-02-21T08:30:46.8932380Z cvt.rn.f32.s16 %r860, %rs465; 2026-02-21T08:30:46.8932441Z cvt.rn.f32.s16 %r861, %rs468; 2026-02-21T08:30:46.8932509Z cvt.rn.f32.s16 %r862, %rs471; 2026-02-21T08:30:46.8932570Z cvt.rn.f32.s16 %r863, %rs474; 2026-02-21T08:30:46.8932633Z cvt.rn.f32.s16 %r864, %rs477; 2026-02-21T08:30:46.8932711Z cvt.rn.f32.s16 %r865, %rs480; 2026-02-21T08:30:46.8932776Z cvt.rn.f32.s16 %r866, %rs483; 2026-02-21T08:30:46.8932840Z cvt.rn.f32.s16 %r867, %rs486; 2026-02-21T08:30:46.8932903Z cvt.rn.f32.s16 %r868, %rs489; 2026-02-21T08:30:46.8932976Z cvt.rn.f32.s16 %r869, %rs492; 2026-02-21T08:30:46.8933038Z cvt.rn.f32.s16 %r870, %rs495; 2026-02-21T08:30:46.8933098Z cvt.rn.f32.s16 %r871, %rs498; 2026-02-21T08:30:46.8933166Z cvt.rn.f32.s16 %r872, %rs501; 2026-02-21T08:30:46.8933226Z cvt.rn.f32.s16 %r873, %rs504; 2026-02-21T08:30:46.8933287Z cvt.rn.f32.s16 %r874, %rs507; 2026-02-21T08:30:46.8933355Z cvt.rn.f32.s16 %r875, %rs510; 2026-02-21T08:30:46.8933416Z cvt.rn.f32.s16 %r876, %rs513; 2026-02-21T08:30:46.8933475Z cvt.rn.f32.s16 %r877, %rs516; 2026-02-21T08:30:46.8933536Z cvt.rn.f32.s16 %r878, %rs519; 2026-02-21T08:30:46.8933604Z cvt.rn.f32.s16 %r879, %rs522; 2026-02-21T08:30:46.8933665Z cvt.rn.f32.s16 %r880, %rs525; 2026-02-21T08:30:46.8933725Z cvt.rn.f32.s16 %r881, %rs528; 2026-02-21T08:30:46.8933799Z st.shared.b32 [%r51], %r850; 2026-02-21T08:30:46.8933926Z st.shared.b32 [%r51+8], %r851; 2026-02-21T08:30:46.8933994Z st.shared.b32 [%r51+16384], %r866; 2026-02-21T08:30:46.8934060Z st.shared.b32 [%r51+16392], %r867; 2026-02-21T08:30:46.8934132Z st.shared.b32 [%r52], %r852; 2026-02-21T08:30:46.8934197Z st.shared.b32 [%r52+8], %r853; 2026-02-21T08:30:46.8934262Z st.shared.b32 [%r52+16384], %r868; 2026-02-21T08:30:46.8934334Z st.shared.b32 [%r52+16392], %r869; 2026-02-21T08:30:46.8934409Z st.shared.b32 [%r53], %r854; 2026-02-21T08:30:46.8934469Z st.shared.b32 [%r53+8], %r855; 2026-02-21T08:30:46.8934529Z st.shared.b32 [%r53+16384], %r870; 2026-02-21T08:30:46.8934598Z st.shared.b32 [%r53+16392], %r871; 2026-02-21T08:30:46.8934656Z st.shared.b32 [%r54], %r856; 2026-02-21T08:30:46.8934716Z st.shared.b32 [%r54+8], %r857; 2026-02-21T08:30:46.8934784Z st.shared.b32 [%r54+16384], %r872; 2026-02-21T08:30:46.8934844Z st.shared.b32 [%r54+16392], %r873; 2026-02-21T08:30:46.8934954Z st.shared.b32 [%r55], %r858; 2026-02-21T08:30:46.8935018Z st.shared.b32 [%r55+8], %r859; 2026-02-21T08:30:46.8935084Z st.shared.b32 [%r55+16384], %r874; 2026-02-21T08:30:46.8935144Z st.shared.b32 [%r55+16392], %r875; 2026-02-21T08:30:46.8935203Z st.shared.b32 [%r56], %r860; 2026-02-21T08:30:46.8935270Z st.shared.b32 [%r56+8], %r861; 2026-02-21T08:30:46.8935328Z st.shared.b32 [%r56+16384], %r876; 2026-02-21T08:30:46.8935387Z st.shared.b32 [%r56+16392], %r877; 2026-02-21T08:30:46.8935454Z st.shared.b32 [%r57], %r862; 2026-02-21T08:30:46.8935513Z st.shared.b32 [%r57+8], %r863; 2026-02-21T08:30:46.8935573Z st.shared.b32 [%r57+16384], %r878; 2026-02-21T08:30:46.8935632Z st.shared.b32 [%r57+16392], %r879; 2026-02-21T08:30:46.8935699Z st.shared.b32 [%r58], %r864; 2026-02-21T08:30:46.8935759Z st.shared.b32 [%r58+8], %r865; 2026-02-21T08:30:46.8935819Z st.shared.b32 [%r58+16384], %r880; 2026-02-21T08:30:46.8935884Z st.shared.b32 [%r58+16392], %r881; 2026-02-21T08:30:46.8935939Z $L__tmp83: 2026-02-21T08:30:46.8936162Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8936222Z // begin inline asm 2026-02-21T08:30:46.8936303Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.8936359Z // end inline asm 2026-02-21T08:30:46.8936414Z bar.sync 0; 2026-02-21T08:30:46.8936483Z @%p12 bra $L__BB0_11; 2026-02-21T08:30:46.8936583Z // %bb.10: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8936650Z elect.sync %r906|%p64, -1; 2026-02-21T08:30:46.8936717Z mov.b32 %r884, 69208336; 2026-02-21T08:30:46.8936777Z mov.pred %p63, 0; 2026-02-21T08:30:46.8936836Z // begin inline asm 2026-02-21T08:30:46.8936994Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r884, %p63; 2026-02-21T08:30:46.8937057Z // end inline asm 2026-02-21T08:30:46.8937113Z // begin inline asm 2026-02-21T08:30:46.8937264Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r884, %p65; 2026-02-21T08:30:46.8937330Z // end inline asm 2026-02-21T08:30:46.8937385Z // begin inline asm 2026-02-21T08:30:46.8937533Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r884, %p65; 2026-02-21T08:30:46.8937597Z // end inline asm 2026-02-21T08:30:46.8937652Z // begin inline asm 2026-02-21T08:30:46.8937797Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r884, %p65; 2026-02-21T08:30:46.8937852Z // end inline asm 2026-02-21T08:30:46.8937916Z // begin inline asm 2026-02-21T08:30:46.8938057Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r884, %p65; 2026-02-21T08:30:46.8938112Z // end inline asm 2026-02-21T08:30:46.8938176Z // begin inline asm 2026-02-21T08:30:46.8938317Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r884, %p65; 2026-02-21T08:30:46.8938371Z // end inline asm 2026-02-21T08:30:46.8938438Z // begin inline asm 2026-02-21T08:30:46.8938626Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r884, %p65; 2026-02-21T08:30:46.8938679Z // end inline asm 2026-02-21T08:30:46.8938742Z // begin inline asm 2026-02-21T08:30:46.8938882Z @%p64 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r884, %p65; 2026-02-21T08:30:46.8938937Z // end inline asm 2026-02-21T08:30:46.8938999Z add.s32 %r908, %r265, 69632; 2026-02-21T08:30:46.8939072Z cvt.u64.u32 %rd198, %r908; 2026-02-21T08:30:46.8939129Z // begin inline asm 2026-02-21T08:30:46.8939257Z @%p64 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd198]; 2026-02-21T08:30:46.8939323Z // end inline asm 2026-02-21T08:30:46.8939378Z $L__tmp84: 2026-02-21T08:30:46.8939481Z $L__BB0_11: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8939658Z .loc 1 0 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0 2026-02-21T08:30:46.8939763Z or.b32 %r97, %r95, %r8; 2026-02-21T08:30:46.8939829Z or.b32 %r98, %r832, %r12; 2026-02-21T08:30:46.8939889Z or.b32 %r99, %r832, %r13; 2026-02-21T08:30:46.8939957Z or.b32 %r100, %r832, %r14; 2026-02-21T08:30:46.8940013Z or.b32 %r101, %r832, %r15; 2026-02-21T08:30:46.8940188Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8940259Z add.s64 %rd199, %rd18, 384; 2026-02-21T08:30:46.8940320Z add.s64 %rd200, %rd19, 384; 2026-02-21T08:30:46.8940491Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8940554Z // begin inline asm 2026-02-21T08:30:46.8940675Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd199 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8940730Z // end inline asm 2026-02-21T08:30:46.8940786Z // begin inline asm 2026-02-21T08:30:46.8940912Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd200 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8940970Z // end inline asm 2026-02-21T08:30:46.8941037Z cp.async.commit_group; 2026-02-21T08:30:46.8941216Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.8941276Z add.s32 %r918, %r96, %r59; 2026-02-21T08:30:46.8941449Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8941517Z cvt.s64.s32 %rd203, %r918; 2026-02-21T08:30:46.8941617Z add.s64 %rd201, %rd48, %rd203; 2026-02-21T08:30:46.8941782Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8941840Z // begin inline asm 2026-02-21T08:30:46.8941962Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd201 + 0 ], 0x10, %r910; 2026-02-21T08:30:46.8942018Z // end inline asm 2026-02-21T08:30:46.8942083Z cp.async.commit_group; 2026-02-21T08:30:46.8942255Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8942323Z add.s32 %r2532, %r62, %r95; 2026-02-21T08:30:46.8942383Z shl.b32 %r919, %r94, 16; 2026-02-21T08:30:46.8942442Z or.b32 %r920, %r63, %r919; 2026-02-21T08:30:46.8942517Z mad.wide.s32 %rd610, %r920, 2, %rd9; 2026-02-21T08:30:46.8942578Z or.b32 %r2531, %r64, %r919; 2026-02-21T08:30:46.8942636Z mov.b64 %rd611, -32; 2026-02-21T08:30:46.8942701Z mov.b32 %r2535, %r2533; 2026-02-21T08:30:46.8942759Z mov.b32 %r2537, %r2533; 2026-02-21T08:30:46.8942817Z bra.uni $L__BB0_12; 2026-02-21T08:30:46.8942925Z $L__BB0_14: // in Loop: Header=BB0_12 Depth=2 2026-02-21T08:30:46.8943093Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8943153Z add.s64 %rd611, %rd611, 32; 2026-02-21T08:30:46.8943219Z setp.lt.u64 %p101, %rd611, 384; 2026-02-21T08:30:46.8943281Z $L__tmp85: 2026-02-21T08:30:46.8943507Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8943624Z add.s32 %r1026, %r2536, 1; 2026-02-21T08:30:46.8943696Z setp.gt.s32 %p102, %r1026, 1; 2026-02-21T08:30:46.8943761Z selp.b32 %r2536, 0, %r1026, %p102; 2026-02-21T08:30:46.8943823Z selp.b32 %r1027, 1, 0, %p102; 2026-02-21T08:30:46.8943890Z xor.b32 %r116, %r2537, %r1027; 2026-02-21T08:30:46.8943943Z $L__tmp86: 2026-02-21T08:30:46.8944110Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.8944180Z mad.wide.s32 %rd214, %r2531, 2, %rd47; 2026-02-21T08:30:46.8944350Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8944413Z add.s32 %r1020, %r112, %r2555; 2026-02-21T08:30:46.8944474Z selp.b32 %r1021, 16, 0, %p101; 2026-02-21T08:30:46.8944537Z // begin inline asm 2026-02-21T08:30:46.8944657Z cp.async.cg.shared.global [ %r1020 + 0 ], [ %rd610 + 0 ], 0x10, %r1021; 2026-02-21T08:30:46.8944761Z // end inline asm 2026-02-21T08:30:46.8944825Z add.s32 %r1022, %r1020, 4096; 2026-02-21T08:30:46.8944889Z // begin inline asm 2026-02-21T08:30:46.8945008Z cp.async.cg.shared.global [ %r1022 + 0 ], [ %rd214 + 0 ], 0x10, %r1021; 2026-02-21T08:30:46.8945064Z // end inline asm 2026-02-21T08:30:46.8945134Z cp.async.commit_group; 2026-02-21T08:30:46.8945304Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.8945366Z cvt.s64.s32 %rd216, %r2532; 2026-02-21T08:30:46.8945433Z add.s64 %rd215, %rd48, %rd216; 2026-02-21T08:30:46.8945605Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8945664Z add.s32 %r1024, %r113, %r2555; 2026-02-21T08:30:46.8945720Z // begin inline asm 2026-02-21T08:30:46.8945842Z cp.async.cg.shared.global [ %r1024 + 0 ], [ %rd215 + 0 ], 0x10, %r1021; 2026-02-21T08:30:46.8945896Z // end inline asm 2026-02-21T08:30:46.8945961Z cp.async.commit_group; 2026-02-21T08:30:46.8946142Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8946203Z add.s32 %r2532, %r2532, 262144; 2026-02-21T08:30:46.8946263Z add.s64 %rd610, %rd610, 128; 2026-02-21T08:30:46.8946328Z add.s32 %r2531, %r2531, 64; 2026-02-21T08:30:46.8946392Z setp.lt.u64 %p103, %rd611, 448; 2026-02-21T08:30:46.8946450Z mov.b32 %r2533, %r2537; 2026-02-21T08:30:46.8946507Z mov.b32 %r2537, %r116; 2026-02-21T08:30:46.8946576Z @%p103 bra $L__BB0_12; 2026-02-21T08:30:46.8946635Z bra.uni $L__BB0_15; 2026-02-21T08:30:46.8946735Z $L__BB0_12: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:46.8946837Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:46.8946895Z add.s32 %r941, %r2535, 1; 2026-02-21T08:30:46.8946957Z setp.gt.s32 %p83, %r941, 2; 2026-02-21T08:30:46.8947022Z selp.b32 %r2535, 0, %r941, %p83; 2026-02-21T08:30:46.8947200Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.8947268Z cp.async.wait_group 4; 2026-02-21T08:30:46.8947322Z bar.sync 0; 2026-02-21T08:30:46.8947390Z shl.b32 %r942, %r2535, 12; 2026-02-21T08:30:46.8947448Z shl.b32 %r943, %r2535, 13; 2026-02-21T08:30:46.8947509Z add.s32 %r945, %r265, %r943; 2026-02-21T08:30:46.8947577Z add.s32 %r112, %r945, 32768; 2026-02-21T08:30:46.8947636Z add.s32 %r946, %r112, %r39; 2026-02-21T08:30:46.8947733Z ld.shared.v4.b32 {%r947, %r948, %r949, %r950}, [%r946]; 2026-02-21T08:30:46.8947800Z mov.b32 {%rs529, %rs530}, %r950; 2026-02-21T08:30:46.8947874Z mov.b32 {%rs531, %rs532}, %r949; 2026-02-21T08:30:46.8947936Z mov.b32 {%rs533, %rs534}, %r948; 2026-02-21T08:30:46.8947996Z mov.b32 {%rs535, %rs536}, %r947; 2026-02-21T08:30:46.8948104Z ld.shared.v4.b32 {%r951, %r952, %r953, %r954}, [%r946+16]; 2026-02-21T08:30:46.8948166Z mov.b32 {%rs537, %rs538}, %r954; 2026-02-21T08:30:46.8948227Z mov.b32 {%rs539, %rs540}, %r953; 2026-02-21T08:30:46.8948344Z mov.b32 {%rs541, %rs542}, %r952; 2026-02-21T08:30:46.8948411Z mov.b32 {%rs543, %rs544}, %r951; 2026-02-21T08:30:46.8948579Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.8948641Z cvt.f32.bf16 %r925, %rs535; 2026-02-21T08:30:46.8948709Z cvt.f32.bf16 %r926, %rs536; 2026-02-21T08:30:46.8948767Z cvt.f32.bf16 %r927, %rs533; 2026-02-21T08:30:46.8948826Z cvt.f32.bf16 %r928, %rs534; 2026-02-21T08:30:46.8948891Z cvt.f32.bf16 %r929, %rs531; 2026-02-21T08:30:46.8948948Z cvt.f32.bf16 %r930, %rs532; 2026-02-21T08:30:46.8949005Z cvt.f32.bf16 %r931, %rs529; 2026-02-21T08:30:46.8949062Z cvt.f32.bf16 %r932, %rs530; 2026-02-21T08:30:46.8949129Z cvt.f32.bf16 %r933, %rs543; 2026-02-21T08:30:46.8949186Z cvt.f32.bf16 %r934, %rs544; 2026-02-21T08:30:46.8949244Z cvt.f32.bf16 %r935, %rs541; 2026-02-21T08:30:46.8949310Z cvt.f32.bf16 %r936, %rs542; 2026-02-21T08:30:46.8949407Z cvt.f32.bf16 %r937, %rs539; 2026-02-21T08:30:46.8949470Z cvt.f32.bf16 %r938, %rs540; 2026-02-21T08:30:46.8949528Z cvt.f32.bf16 %r939, %rs537; 2026-02-21T08:30:46.8949595Z cvt.f32.bf16 %r940, %rs538; 2026-02-21T08:30:46.8949649Z $L__tmp87: 2026-02-21T08:30:46.8949871Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8949939Z // begin inline asm 2026-02-21T08:30:46.8949993Z 2026-02-21T08:30:46.8950044Z { 2026-02-21T08:30:46.8950114Z .reg .pred complete; 2026-02-21T08:30:46.8950168Z waitLoop: 2026-02-21T08:30:46.8950291Z mbarrier.try_wait.parity.shared.b64 complete, [%r2534], %r2533; 2026-02-21T08:30:46.8950356Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.8950415Z } 2026-02-21T08:30:46.8950419Z 2026-02-21T08:30:46.8950474Z // end inline asm 2026-02-21T08:30:46.8950536Z mov.pred %p84, -1; 2026-02-21T08:30:46.8950600Z // begin inline asm 2026-02-21T08:30:46.8950894Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r925, %r926, %r927, %r928, %r929, %r930, %r931, %r932, %r933, %r934, %r935, %r936, %r937, %r938, %r939, %r940}; 2026-02-21T08:30:46.8950952Z // end inline asm 2026-02-21T08:30:46.8951014Z // begin inline asm 2026-02-21T08:30:46.8951084Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.8951139Z // end inline asm 2026-02-21T08:30:46.8951194Z bar.sync 0; 2026-02-21T08:30:46.8951254Z $L__tmp88: 2026-02-21T08:30:46.8951425Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.8951486Z add.s32 %r955, %r265, %r942; 2026-02-21T08:30:46.8951586Z add.s32 %r113, %r955, 57344; 2026-02-21T08:30:46.8951648Z add.s32 %r956, %r113, %r41; 2026-02-21T08:30:46.8951712Z ld.shared.b8 %rs545, [%r956]; 2026-02-21T08:30:46.8951778Z ld.shared.b8 %rs546, [%r956+128]; 2026-02-21T08:30:46.8951851Z ld.shared.b8 %rs547, [%r956+256]; 2026-02-21T08:30:46.8951912Z ld.shared.b8 %rs548, [%r956+384]; 2026-02-21T08:30:46.8951975Z ld.shared.b8 %rs549, [%r956+512]; 2026-02-21T08:30:46.8952044Z ld.shared.b8 %rs550, [%r956+640]; 2026-02-21T08:30:46.8952105Z ld.shared.b8 %rs551, [%r956+768]; 2026-02-21T08:30:46.8952163Z add.s32 %r957, %r113, %r43; 2026-02-21T08:30:46.8952224Z ld.shared.b8 %rs552, [%r957]; 2026-02-21T08:30:46.8952296Z ld.shared.b8 %rs553, [%r956+1024]; 2026-02-21T08:30:46.8952359Z ld.shared.b8 %rs554, [%r956+1152]; 2026-02-21T08:30:46.8952419Z ld.shared.b8 %rs555, [%r956+1280]; 2026-02-21T08:30:46.8952487Z ld.shared.b8 %rs556, [%r956+1408]; 2026-02-21T08:30:46.8952547Z ld.shared.b8 %rs557, [%r956+1536]; 2026-02-21T08:30:46.8952608Z ld.shared.b8 %rs558, [%r956+1664]; 2026-02-21T08:30:46.8952676Z ld.shared.b8 %rs559, [%r956+1792]; 2026-02-21T08:30:46.8952733Z add.s32 %r958, %r113, %r45; 2026-02-21T08:30:46.8952793Z ld.shared.b8 %rs560, [%r958]; 2026-02-21T08:30:46.8952854Z ld.shared.b8 %rs561, [%r956+2048]; 2026-02-21T08:30:46.8952923Z ld.shared.b8 %rs562, [%r956+2176]; 2026-02-21T08:30:46.8952986Z ld.shared.b8 %rs563, [%r956+2304]; 2026-02-21T08:30:46.8953098Z ld.shared.b8 %rs564, [%r956+2432]; 2026-02-21T08:30:46.8953167Z ld.shared.b8 %rs565, [%r956+2560]; 2026-02-21T08:30:46.8953227Z ld.shared.b8 %rs566, [%r956+2688]; 2026-02-21T08:30:46.8953288Z ld.shared.b8 %rs567, [%r956+2816]; 2026-02-21T08:30:46.8953346Z add.s32 %r959, %r113, %r47; 2026-02-21T08:30:46.8953412Z ld.shared.b8 %rs568, [%r959]; 2026-02-21T08:30:46.8953472Z ld.shared.b8 %rs569, [%r956+3072]; 2026-02-21T08:30:46.8953533Z ld.shared.b8 %rs570, [%r956+3200]; 2026-02-21T08:30:46.8953599Z ld.shared.b8 %rs571, [%r956+3328]; 2026-02-21T08:30:46.8953659Z ld.shared.b8 %rs572, [%r956+3456]; 2026-02-21T08:30:46.8953718Z ld.shared.b8 %rs573, [%r956+3584]; 2026-02-21T08:30:46.8953777Z ld.shared.b8 %rs574, [%r956+3712]; 2026-02-21T08:30:46.8953846Z ld.shared.b8 %rs575, [%r956+3840]; 2026-02-21T08:30:46.8953904Z add.s32 %r960, %r113, %r49; 2026-02-21T08:30:46.8954014Z ld.shared.b8 %rs576, [%r960]; 2026-02-21T08:30:46.8954194Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.8954258Z shl.b16 %rs577, %rs545, 4; 2026-02-21T08:30:46.8954318Z shl.b16 %rs578, %rs546, 4; 2026-02-21T08:30:46.8954382Z shl.b16 %rs579, %rs547, 4; 2026-02-21T08:30:46.8954441Z shl.b16 %rs580, %rs548, 4; 2026-02-21T08:30:46.8954499Z shl.b16 %rs581, %rs549, 4; 2026-02-21T08:30:46.8954556Z shl.b16 %rs582, %rs550, 4; 2026-02-21T08:30:46.8954623Z shl.b16 %rs583, %rs551, 4; 2026-02-21T08:30:46.8954680Z shl.b16 %rs584, %rs552, 4; 2026-02-21T08:30:46.8954737Z shl.b16 %rs585, %rs553, 4; 2026-02-21T08:30:46.8954803Z shl.b16 %rs586, %rs554, 4; 2026-02-21T08:30:46.8954862Z shl.b16 %rs587, %rs555, 4; 2026-02-21T08:30:46.8954920Z shl.b16 %rs588, %rs556, 4; 2026-02-21T08:30:46.8955025Z shl.b16 %rs589, %rs557, 4; 2026-02-21T08:30:46.8955093Z shl.b16 %rs590, %rs558, 4; 2026-02-21T08:30:46.8955152Z shl.b16 %rs591, %rs559, 4; 2026-02-21T08:30:46.8955213Z shl.b16 %rs592, %rs560, 4; 2026-02-21T08:30:46.8955282Z shl.b16 %rs593, %rs561, 4; 2026-02-21T08:30:46.8955339Z shl.b16 %rs594, %rs562, 4; 2026-02-21T08:30:46.8955396Z shl.b16 %rs595, %rs563, 4; 2026-02-21T08:30:46.8955454Z shl.b16 %rs596, %rs564, 4; 2026-02-21T08:30:46.8955519Z shl.b16 %rs597, %rs565, 4; 2026-02-21T08:30:46.8955576Z shl.b16 %rs598, %rs566, 4; 2026-02-21T08:30:46.8955633Z shl.b16 %rs599, %rs567, 4; 2026-02-21T08:30:46.8955699Z shl.b16 %rs600, %rs568, 4; 2026-02-21T08:30:46.8955756Z shl.b16 %rs601, %rs569, 4; 2026-02-21T08:30:46.8955812Z shl.b16 %rs602, %rs570, 4; 2026-02-21T08:30:46.8955871Z shl.b16 %rs603, %rs571, 4; 2026-02-21T08:30:46.8955935Z shl.b16 %rs604, %rs572, 4; 2026-02-21T08:30:46.8955992Z shl.b16 %rs605, %rs573, 4; 2026-02-21T08:30:46.8956050Z shl.b16 %rs606, %rs574, 4; 2026-02-21T08:30:46.8956116Z shl.b16 %rs607, %rs575, 4; 2026-02-21T08:30:46.8956174Z shl.b16 %rs608, %rs576, 4; 2026-02-21T08:30:46.8956350Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.8956430Z selp.b16 %rs609, %rs577, %rs545, %p246; 2026-02-21T08:30:46.8956490Z cvt.s16.s8 %rs610, %rs609; 2026-02-21T08:30:46.8956547Z shr.s16 %rs611, %rs610, 4; 2026-02-21T08:30:46.8956614Z selp.b16 %rs612, %rs578, %rs546, %p246; 2026-02-21T08:30:46.8956681Z cvt.s16.s8 %rs613, %rs612; 2026-02-21T08:30:46.8956738Z shr.s16 %rs614, %rs613, 4; 2026-02-21T08:30:46.8956805Z selp.b16 %rs615, %rs579, %rs547, %p246; 2026-02-21T08:30:46.8956870Z cvt.s16.s8 %rs616, %rs615; 2026-02-21T08:30:46.8956927Z shr.s16 %rs617, %rs616, 4; 2026-02-21T08:30:46.8956992Z selp.b16 %rs618, %rs580, %rs548, %p246; 2026-02-21T08:30:46.8957050Z cvt.s16.s8 %rs619, %rs618; 2026-02-21T08:30:46.8957115Z shr.s16 %rs620, %rs619, 4; 2026-02-21T08:30:46.8957179Z selp.b16 %rs621, %rs581, %rs549, %p246; 2026-02-21T08:30:46.8957237Z cvt.s16.s8 %rs622, %rs621; 2026-02-21T08:30:46.8957301Z shr.s16 %rs623, %rs622, 4; 2026-02-21T08:30:46.8957368Z selp.b16 %rs624, %rs582, %rs550, %p246; 2026-02-21T08:30:46.8957471Z cvt.s16.s8 %rs625, %rs624; 2026-02-21T08:30:46.8957535Z shr.s16 %rs626, %rs625, 4; 2026-02-21T08:30:46.8957599Z selp.b16 %rs627, %rs583, %rs551, %p246; 2026-02-21T08:30:46.8957658Z cvt.s16.s8 %rs628, %rs627; 2026-02-21T08:30:46.8957714Z shr.s16 %rs629, %rs628, 4; 2026-02-21T08:30:46.8957785Z selp.b16 %rs630, %rs584, %rs552, %p246; 2026-02-21T08:30:46.8957844Z cvt.s16.s8 %rs631, %rs630; 2026-02-21T08:30:46.8957900Z shr.s16 %rs632, %rs631, 4; 2026-02-21T08:30:46.8957972Z selp.b16 %rs633, %rs585, %rs553, %p246; 2026-02-21T08:30:46.8958030Z cvt.s16.s8 %rs634, %rs633; 2026-02-21T08:30:46.8958087Z shr.s16 %rs635, %rs634, 4; 2026-02-21T08:30:46.8958150Z selp.b16 %rs636, %rs586, %rs554, %p246; 2026-02-21T08:30:46.8958216Z cvt.s16.s8 %rs637, %rs636; 2026-02-21T08:30:46.8958271Z shr.s16 %rs638, %rs637, 4; 2026-02-21T08:30:46.8958335Z selp.b16 %rs639, %rs587, %rs555, %p246; 2026-02-21T08:30:46.8958444Z cvt.s16.s8 %rs640, %rs639; 2026-02-21T08:30:46.8958505Z shr.s16 %rs641, %rs640, 4; 2026-02-21T08:30:46.8958570Z selp.b16 %rs642, %rs588, %rs556, %p246; 2026-02-21T08:30:46.8958627Z cvt.s16.s8 %rs643, %rs642; 2026-02-21T08:30:46.8958692Z shr.s16 %rs644, %rs643, 4; 2026-02-21T08:30:46.8958756Z selp.b16 %rs645, %rs589, %rs557, %p246; 2026-02-21T08:30:46.8958813Z cvt.s16.s8 %rs646, %rs645; 2026-02-21T08:30:46.8958877Z shr.s16 %rs647, %rs646, 4; 2026-02-21T08:30:46.8958941Z selp.b16 %rs648, %rs590, %rs558, %p246; 2026-02-21T08:30:46.8958998Z cvt.s16.s8 %rs649, %rs648; 2026-02-21T08:30:46.8959055Z shr.s16 %rs650, %rs649, 4; 2026-02-21T08:30:46.8959126Z selp.b16 %rs651, %rs591, %rs559, %p246; 2026-02-21T08:30:46.8959183Z cvt.s16.s8 %rs652, %rs651; 2026-02-21T08:30:46.8959241Z shr.s16 %rs653, %rs652, 4; 2026-02-21T08:30:46.8959313Z selp.b16 %rs654, %rs592, %rs560, %p246; 2026-02-21T08:30:46.8959370Z cvt.s16.s8 %rs655, %rs654; 2026-02-21T08:30:46.8959427Z shr.s16 %rs656, %rs655, 4; 2026-02-21T08:30:46.8959500Z selp.b16 %rs657, %rs593, %rs561, %p246; 2026-02-21T08:30:46.8959559Z cvt.s16.s8 %rs658, %rs657; 2026-02-21T08:30:46.8959617Z shr.s16 %rs659, %rs658, 4; 2026-02-21T08:30:46.8959682Z selp.b16 %rs660, %rs594, %rs562, %p246; 2026-02-21T08:30:46.8959746Z cvt.s16.s8 %rs661, %rs660; 2026-02-21T08:30:46.8959804Z shr.s16 %rs662, %rs661, 4; 2026-02-21T08:30:46.8959868Z selp.b16 %rs663, %rs595, %rs563, %p246; 2026-02-21T08:30:46.8959933Z cvt.s16.s8 %rs664, %rs663; 2026-02-21T08:30:46.8959990Z shr.s16 %rs665, %rs664, 4; 2026-02-21T08:30:46.8960054Z selp.b16 %rs666, %rs596, %rs564, %p246; 2026-02-21T08:30:46.8960112Z cvt.s16.s8 %rs667, %rs666; 2026-02-21T08:30:46.8960178Z shr.s16 %rs668, %rs667, 4; 2026-02-21T08:30:46.8960243Z selp.b16 %rs669, %rs597, %rs565, %p246; 2026-02-21T08:30:46.8960301Z cvt.s16.s8 %rs670, %rs669; 2026-02-21T08:30:46.8960368Z shr.s16 %rs671, %rs670, 4; 2026-02-21T08:30:46.8960431Z selp.b16 %rs672, %rs598, %rs566, %p246; 2026-02-21T08:30:46.8960492Z cvt.s16.s8 %rs673, %rs672; 2026-02-21T08:30:46.8960552Z shr.s16 %rs674, %rs673, 4; 2026-02-21T08:30:46.8960626Z selp.b16 %rs675, %rs599, %rs567, %p246; 2026-02-21T08:30:46.8960683Z cvt.s16.s8 %rs676, %rs675; 2026-02-21T08:30:46.8960741Z shr.s16 %rs677, %rs676, 4; 2026-02-21T08:30:46.8960817Z selp.b16 %rs678, %rs600, %rs568, %p246; 2026-02-21T08:30:46.8960875Z cvt.s16.s8 %rs679, %rs678; 2026-02-21T08:30:46.8960931Z shr.s16 %rs680, %rs679, 4; 2026-02-21T08:30:46.8961002Z selp.b16 %rs681, %rs601, %rs569, %p246; 2026-02-21T08:30:46.8961060Z cvt.s16.s8 %rs682, %rs681; 2026-02-21T08:30:46.8961118Z shr.s16 %rs683, %rs682, 4; 2026-02-21T08:30:46.8961182Z selp.b16 %rs684, %rs602, %rs570, %p246; 2026-02-21T08:30:46.8961247Z cvt.s16.s8 %rs685, %rs684; 2026-02-21T08:30:46.8961305Z shr.s16 %rs686, %rs685, 4; 2026-02-21T08:30:46.8961372Z selp.b16 %rs687, %rs603, %rs571, %p246; 2026-02-21T08:30:46.8961440Z cvt.s16.s8 %rs688, %rs687; 2026-02-21T08:30:46.8961502Z shr.s16 %rs689, %rs688, 4; 2026-02-21T08:30:46.8961660Z selp.b16 %rs690, %rs604, %rs572, %p246; 2026-02-21T08:30:46.8976804Z cvt.s16.s8 %rs691, %rs690; 2026-02-21T08:30:46.8976931Z shr.s16 %rs692, %rs691, 4; 2026-02-21T08:30:46.8977029Z selp.b16 %rs693, %rs605, %rs573, %p246; 2026-02-21T08:30:46.8977100Z cvt.s16.s8 %rs694, %rs693; 2026-02-21T08:30:46.8977167Z shr.s16 %rs695, %rs694, 4; 2026-02-21T08:30:46.8977259Z selp.b16 %rs696, %rs606, %rs574, %p246; 2026-02-21T08:30:46.8977324Z cvt.s16.s8 %rs697, %rs696; 2026-02-21T08:30:46.8977387Z shr.s16 %rs698, %rs697, 4; 2026-02-21T08:30:46.8977463Z selp.b16 %rs699, %rs607, %rs575, %p246; 2026-02-21T08:30:46.8977536Z cvt.s16.s8 %rs700, %rs699; 2026-02-21T08:30:46.8977599Z shr.s16 %rs701, %rs700, 4; 2026-02-21T08:30:46.8977671Z selp.b16 %rs702, %rs608, %rs576, %p246; 2026-02-21T08:30:46.8977743Z cvt.s16.s8 %rs703, %rs702; 2026-02-21T08:30:46.8977807Z shr.s16 %rs704, %rs703, 4; 2026-02-21T08:30:46.8978137Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.8978234Z cvt.rn.f32.s16 %r961, %rs611; 2026-02-21T08:30:46.8978303Z cvt.rn.f32.s16 %r962, %rs614; 2026-02-21T08:30:46.8978369Z cvt.rn.f32.s16 %r963, %rs617; 2026-02-21T08:30:46.8978434Z cvt.rn.f32.s16 %r964, %rs620; 2026-02-21T08:30:46.8978515Z cvt.rn.f32.s16 %r965, %rs623; 2026-02-21T08:30:46.8978579Z cvt.rn.f32.s16 %r966, %rs626; 2026-02-21T08:30:46.8978639Z cvt.rn.f32.s16 %r967, %rs629; 2026-02-21T08:30:46.8978710Z cvt.rn.f32.s16 %r968, %rs632; 2026-02-21T08:30:46.8978775Z cvt.rn.f32.s16 %r969, %rs635; 2026-02-21T08:30:46.8978835Z cvt.rn.f32.s16 %r970, %rs638; 2026-02-21T08:30:46.8978897Z cvt.rn.f32.s16 %r971, %rs641; 2026-02-21T08:30:46.8978969Z cvt.rn.f32.s16 %r972, %rs644; 2026-02-21T08:30:46.8979029Z cvt.rn.f32.s16 %r973, %rs647; 2026-02-21T08:30:46.8979092Z cvt.rn.f32.s16 %r974, %rs650; 2026-02-21T08:30:46.8979162Z cvt.rn.f32.s16 %r975, %rs653; 2026-02-21T08:30:46.8979224Z cvt.rn.f32.s16 %r976, %rs656; 2026-02-21T08:30:46.8979297Z cvt.rn.f32.s16 %r977, %rs659; 2026-02-21T08:30:46.8979359Z cvt.rn.f32.s16 %r978, %rs662; 2026-02-21T08:30:46.8979431Z cvt.rn.f32.s16 %r979, %rs665; 2026-02-21T08:30:46.8979493Z cvt.rn.f32.s16 %r980, %rs668; 2026-02-21T08:30:46.8979554Z cvt.rn.f32.s16 %r981, %rs671; 2026-02-21T08:30:46.8979625Z cvt.rn.f32.s16 %r982, %rs674; 2026-02-21T08:30:46.8979684Z cvt.rn.f32.s16 %r983, %rs677; 2026-02-21T08:30:46.8979744Z cvt.rn.f32.s16 %r984, %rs680; 2026-02-21T08:30:46.8979814Z cvt.rn.f32.s16 %r985, %rs683; 2026-02-21T08:30:46.8979874Z cvt.rn.f32.s16 %r986, %rs686; 2026-02-21T08:30:46.8979933Z cvt.rn.f32.s16 %r987, %rs689; 2026-02-21T08:30:46.8979993Z cvt.rn.f32.s16 %r988, %rs692; 2026-02-21T08:30:46.8980061Z cvt.rn.f32.s16 %r989, %rs695; 2026-02-21T08:30:46.8980119Z cvt.rn.f32.s16 %r990, %rs698; 2026-02-21T08:30:46.8980179Z cvt.rn.f32.s16 %r991, %rs701; 2026-02-21T08:30:46.8980248Z cvt.rn.f32.s16 %r992, %rs704; 2026-02-21T08:30:46.8980315Z st.shared.b32 [%r51], %r961; 2026-02-21T08:30:46.8980386Z st.shared.b32 [%r51+8], %r962; 2026-02-21T08:30:46.8980454Z st.shared.b32 [%r51+16384], %r977; 2026-02-21T08:30:46.8980528Z st.shared.b32 [%r51+16392], %r978; 2026-02-21T08:30:46.8980592Z st.shared.b32 [%r52], %r963; 2026-02-21T08:30:46.8980653Z st.shared.b32 [%r52+8], %r964; 2026-02-21T08:30:46.8980723Z st.shared.b32 [%r52+16384], %r979; 2026-02-21T08:30:46.8980785Z st.shared.b32 [%r52+16392], %r980; 2026-02-21T08:30:46.8980847Z st.shared.b32 [%r53], %r965; 2026-02-21T08:30:46.8980910Z st.shared.b32 [%r53+8], %r966; 2026-02-21T08:30:46.8980982Z st.shared.b32 [%r53+16384], %r981; 2026-02-21T08:30:46.8981043Z st.shared.b32 [%r53+16392], %r982; 2026-02-21T08:30:46.8981107Z st.shared.b32 [%r54], %r967; 2026-02-21T08:30:46.8981177Z st.shared.b32 [%r54+8], %r968; 2026-02-21T08:30:46.8981241Z st.shared.b32 [%r54+16384], %r983; 2026-02-21T08:30:46.8981303Z st.shared.b32 [%r54+16392], %r984; 2026-02-21T08:30:46.8981368Z st.shared.b32 [%r55], %r969; 2026-02-21T08:30:46.8981516Z st.shared.b32 [%r55+8], %r970; 2026-02-21T08:30:46.8981617Z st.shared.b32 [%r55+16384], %r985; 2026-02-21T08:30:46.8981682Z st.shared.b32 [%r55+16392], %r986; 2026-02-21T08:30:46.8981755Z st.shared.b32 [%r56], %r971; 2026-02-21T08:30:46.8981822Z st.shared.b32 [%r56+8], %r972; 2026-02-21T08:30:46.8981887Z st.shared.b32 [%r56+16384], %r987; 2026-02-21T08:30:46.8981962Z st.shared.b32 [%r56+16392], %r988; 2026-02-21T08:30:46.8982025Z st.shared.b32 [%r57], %r973; 2026-02-21T08:30:46.8982089Z st.shared.b32 [%r57+8], %r974; 2026-02-21T08:30:46.8982154Z st.shared.b32 [%r57+16384], %r989; 2026-02-21T08:30:46.8982229Z st.shared.b32 [%r57+16392], %r990; 2026-02-21T08:30:46.8982291Z st.shared.b32 [%r58], %r975; 2026-02-21T08:30:46.8982356Z st.shared.b32 [%r58+8], %r976; 2026-02-21T08:30:46.8982433Z st.shared.b32 [%r58+16384], %r991; 2026-02-21T08:30:46.8982498Z st.shared.b32 [%r58+16392], %r992; 2026-02-21T08:30:46.8982739Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8982808Z shl.b32 %r993, %r2536, 3; 2026-02-21T08:30:46.8982885Z add.s32 %r994, %r265, %r993; 2026-02-21T08:30:46.8982951Z add.s32 %r2534, %r994, 69632; 2026-02-21T08:30:46.8983014Z $L__tmp89: 2026-02-21T08:30:46.8983267Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8983331Z // begin inline asm 2026-02-21T08:30:46.8983412Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.8983480Z // end inline asm 2026-02-21T08:30:46.8983538Z bar.sync 0; 2026-02-21T08:30:46.8983601Z @%p12 bra $L__BB0_14; 2026-02-21T08:30:46.8983706Z // %bb.13: // in Loop: Header=BB0_12 Depth=2 2026-02-21T08:30:46.8983789Z elect.sync %r1019|%p85, -1; 2026-02-21T08:30:46.8983852Z mov.b32 %r997, 69208336; 2026-02-21T08:30:46.8983913Z // begin inline asm 2026-02-21T08:30:46.8984096Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r997, %p84; 2026-02-21T08:30:46.8984156Z // end inline asm 2026-02-21T08:30:46.8984216Z // begin inline asm 2026-02-21T08:30:46.8984382Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r997, %p84; 2026-02-21T08:30:46.8984440Z // end inline asm 2026-02-21T08:30:46.8984499Z // begin inline asm 2026-02-21T08:30:46.8984653Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r997, %p84; 2026-02-21T08:30:46.8984719Z // end inline asm 2026-02-21T08:30:46.8984779Z // begin inline asm 2026-02-21T08:30:46.8984926Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r997, %p84; 2026-02-21T08:30:46.8984993Z // end inline asm 2026-02-21T08:30:46.8985052Z // begin inline asm 2026-02-21T08:30:46.8985196Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r997, %p84; 2026-02-21T08:30:46.8985262Z // end inline asm 2026-02-21T08:30:46.8985323Z // begin inline asm 2026-02-21T08:30:46.8985472Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r997, %p84; 2026-02-21T08:30:46.8985538Z // end inline asm 2026-02-21T08:30:46.8985596Z // begin inline asm 2026-02-21T08:30:46.8985738Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r997, %p84; 2026-02-21T08:30:46.8985795Z // end inline asm 2026-02-21T08:30:46.8985864Z // begin inline asm 2026-02-21T08:30:46.8986007Z @%p85 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r997, %p84; 2026-02-21T08:30:46.8986064Z // end inline asm 2026-02-21T08:30:46.8986138Z cvt.u64.u32 %rd212, %r2534; 2026-02-21T08:30:46.8986195Z // begin inline asm 2026-02-21T08:30:46.8986327Z @%p85 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd212]; 2026-02-21T08:30:46.8986394Z // end inline asm 2026-02-21T08:30:46.8986454Z bra.uni $L__BB0_14; 2026-02-21T08:30:46.8986561Z $L__BB0_15: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.8986737Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:30:46.8986806Z mov.b32 %r2543, 1; 2026-02-21T08:30:46.8987034Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8987092Z // begin inline asm 2026-02-21T08:30:46.8987155Z 2026-02-21T08:30:46.8987207Z { 2026-02-21T08:30:46.8987272Z .reg .pred complete; 2026-02-21T08:30:46.8987331Z waitLoop: 2026-02-21T08:30:46.8987468Z mbarrier.try_wait.parity.shared.b64 complete, [%r2534], %r2543; 2026-02-21T08:30:46.8987536Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.8987589Z } 2026-02-21T08:30:46.8987594Z 2026-02-21T08:30:46.8987662Z // end inline asm 2026-02-21T08:30:46.8987716Z $L__tmp90: 2026-02-21T08:30:46.8987889Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.8988009Z cp.async.wait_group 0; 2026-02-21T08:30:46.8988071Z bar.sync 0; 2026-02-21T08:30:46.8988135Z add.s32 %r2541, %r265, 69632; 2026-02-21T08:30:46.8988192Z // begin inline asm 2026-02-21T08:30:46.8988294Z @%p204 mbarrier.inval.shared::cta.b64 [%r2541]; 2026-02-21T08:30:46.8988352Z // end inline asm 2026-02-21T08:30:46.8988409Z bar.sync 0; 2026-02-21T08:30:46.8988476Z // begin inline asm 2026-02-21T08:30:46.8988563Z @%p204 mbarrier.inval.shared::cta.b64 [%r1442]; 2026-02-21T08:30:46.8988619Z // end inline asm 2026-02-21T08:30:46.8988799Z .loc 1 88 43 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:43 2026-02-21T08:30:46.8988861Z shl.b32 %r1174, %r98, 13; 2026-02-21T08:30:46.8988920Z shl.b32 %r1175, %r99, 13; 2026-02-21T08:30:46.8988982Z shl.b32 %r1176, %r100, 13; 2026-02-21T08:30:46.8989051Z shl.b32 %r1177, %r101, 13; 2026-02-21T08:30:46.8989221Z .loc 1 88 50 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:50 2026-02-21T08:30:46.8989285Z add.s32 %r1178, %r1174, %r97; 2026-02-21T08:30:46.8989356Z add.s32 %r1179, %r1175, %r97; 2026-02-21T08:30:46.8989415Z add.s32 %r1180, %r1176, %r97; 2026-02-21T08:30:46.8989475Z add.s32 %r1181, %r1177, %r97; 2026-02-21T08:30:46.8989644Z .loc 1 88 22 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:22 2026-02-21T08:30:46.8989724Z mad.wide.s32 %rd217, %r1178, 2, %rd49; 2026-02-21T08:30:46.8989792Z mad.wide.s32 %rd218, %r1179, 2, %rd49; 2026-02-21T08:30:46.8989856Z mad.wide.s32 %rd219, %r1180, 2, %rd49; 2026-02-21T08:30:46.8989931Z mad.wide.s32 %rd220, %r1181, 2, %rd49; 2026-02-21T08:30:46.8989986Z $L__tmp91: 2026-02-21T08:30:46.8990207Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8990274Z // begin inline asm 2026-02-21T08:30:46.8990596Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1032, %r1033, %r1034, %r1035, %r1036, %r1037, %r1038, %r1039, %r1040, %r1041, %r1042, %r1043, %r1044, %r1045, %r1046, %r1047}, [%r1887 + 0], 32; 2026-02-21T08:30:46.8990655Z // end inline asm 2026-02-21T08:30:46.8990721Z // begin inline asm 2026-02-21T08:30:46.8991029Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1049, %r1050, %r1051, %r1052, %r1053, %r1054, %r1055, %r1056, %r1057, %r1058, %r1059, %r1060, %r1061, %r1062, %r1063, %r1064}, [%r1887 + 16], 32; 2026-02-21T08:30:46.8991086Z // end inline asm 2026-02-21T08:30:46.8991146Z // begin inline asm 2026-02-21T08:30:46.8991225Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:46.8991281Z // end inline asm 2026-02-21T08:30:46.8991345Z cvt.u64.u32 %rd230, %r1032; 2026-02-21T08:30:46.8991416Z cvt.u64.u32 %rd231, %r1033; 2026-02-21T08:30:46.8991476Z shl.b64 %rd232, %rd231, 32; 2026-02-21T08:30:46.8991595Z or.b64 %rd233, %rd230, %rd232; 2026-02-21T08:30:46.8991653Z $L__tmp92: 2026-02-21T08:30:46.8991842Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8991913Z mov.b64 {%r1182, %r1183}, %rd233; 2026-02-21T08:30:46.8992079Z cvt.rn.bf16x2.f32 %r1184, %r1183, %r1182; 2026-02-21T08:30:46.8992145Z $L__tmp93: 2026-02-21T08:30:46.8992368Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8992431Z cvt.u64.u32 %rd234, %r1034; 2026-02-21T08:30:46.8992507Z cvt.u64.u32 %rd235, %r1035; 2026-02-21T08:30:46.8992570Z shl.b64 %rd236, %rd235, 32; 2026-02-21T08:30:46.8992638Z or.b64 %rd237, %rd234, %rd236; 2026-02-21T08:30:46.8992696Z $L__tmp94: 2026-02-21T08:30:46.8992879Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8992943Z mov.b64 {%r1185, %r1186}, %rd237; 2026-02-21T08:30:46.8993015Z cvt.rn.bf16x2.f32 %r1187, %r1186, %r1185; 2026-02-21T08:30:46.8993082Z $L__tmp95: 2026-02-21T08:30:46.8993350Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8993416Z cvt.u64.u32 %rd238, %r1036; 2026-02-21T08:30:46.8993491Z cvt.u64.u32 %rd239, %r1037; 2026-02-21T08:30:46.8993553Z shl.b64 %rd240, %rd239, 32; 2026-02-21T08:30:46.8993618Z or.b64 %rd241, %rd238, %rd240; 2026-02-21T08:30:46.8993672Z $L__tmp96: 2026-02-21T08:30:46.8993858Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8993923Z mov.b64 {%r1188, %r1189}, %rd241; 2026-02-21T08:30:46.8993994Z cvt.rn.bf16x2.f32 %r1190, %r1189, %r1188; 2026-02-21T08:30:46.8994058Z $L__tmp97: 2026-02-21T08:30:46.8994274Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8994334Z cvt.u64.u32 %rd242, %r1038; 2026-02-21T08:30:46.8994403Z cvt.u64.u32 %rd243, %r1039; 2026-02-21T08:30:46.8994463Z shl.b64 %rd244, %rd243, 32; 2026-02-21T08:30:46.8994525Z or.b64 %rd245, %rd242, %rd244; 2026-02-21T08:30:46.8994585Z $L__tmp98: 2026-02-21T08:30:46.8994769Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8994832Z mov.b64 {%r1191, %r1192}, %rd245; 2026-02-21T08:30:46.8994904Z cvt.rn.bf16x2.f32 %r1193, %r1192, %r1191; 2026-02-21T08:30:46.8994968Z $L__tmp99: 2026-02-21T08:30:46.8995183Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8995244Z cvt.u64.u32 %rd246, %r1040; 2026-02-21T08:30:46.8995313Z cvt.u64.u32 %rd247, %r1041; 2026-02-21T08:30:46.8995371Z shl.b64 %rd248, %rd247, 32; 2026-02-21T08:30:46.8995436Z or.b64 %rd249, %rd246, %rd248; 2026-02-21T08:30:46.8995491Z $L__tmp100: 2026-02-21T08:30:46.8995670Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8995733Z mov.b64 {%r1194, %r1195}, %rd249; 2026-02-21T08:30:46.8995805Z cvt.rn.bf16x2.f32 %r1196, %r1195, %r1194; 2026-02-21T08:30:46.8995873Z $L__tmp101: 2026-02-21T08:30:46.8996088Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8996148Z cvt.u64.u32 %rd250, %r1042; 2026-02-21T08:30:46.8996218Z cvt.u64.u32 %rd251, %r1043; 2026-02-21T08:30:46.8996279Z shl.b64 %rd252, %rd251, 32; 2026-02-21T08:30:46.8996341Z or.b64 %rd253, %rd250, %rd252; 2026-02-21T08:30:46.8996397Z $L__tmp102: 2026-02-21T08:30:46.8996574Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8996634Z mov.b64 {%r1197, %r1198}, %rd253; 2026-02-21T08:30:46.8996704Z cvt.rn.bf16x2.f32 %r1199, %r1198, %r1197; 2026-02-21T08:30:46.8996766Z $L__tmp103: 2026-02-21T08:30:46.8996980Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8997043Z cvt.u64.u32 %rd254, %r1044; 2026-02-21T08:30:46.8997150Z cvt.u64.u32 %rd255, %r1045; 2026-02-21T08:30:46.8997219Z shl.b64 %rd256, %rd255, 32; 2026-02-21T08:30:46.8997280Z or.b64 %rd257, %rd254, %rd256; 2026-02-21T08:30:46.8997335Z $L__tmp104: 2026-02-21T08:30:46.8997508Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8997572Z mov.b64 {%r1200, %r1201}, %rd257; 2026-02-21T08:30:46.8997640Z cvt.rn.bf16x2.f32 %r1202, %r1201, %r1200; 2026-02-21T08:30:46.8997700Z $L__tmp105: 2026-02-21T08:30:46.8997913Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8997971Z cvt.u64.u32 %rd258, %r1046; 2026-02-21T08:30:46.8998031Z cvt.u64.u32 %rd259, %r1047; 2026-02-21T08:30:46.8998099Z shl.b64 %rd260, %rd259, 32; 2026-02-21T08:30:46.8998162Z or.b64 %rd261, %rd258, %rd260; 2026-02-21T08:30:46.8998216Z $L__tmp106: 2026-02-21T08:30:46.8998442Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8998507Z mov.b64 {%r1203, %r1204}, %rd261; 2026-02-21T08:30:46.8998577Z cvt.rn.bf16x2.f32 %r1205, %r1204, %r1203; 2026-02-21T08:30:46.8998642Z $L__tmp107: 2026-02-21T08:30:46.8998853Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8998914Z cvt.u64.u32 %rd262, %r1049; 2026-02-21T08:30:46.8998972Z cvt.u64.u32 %rd263, %r1050; 2026-02-21T08:30:46.8999040Z shl.b64 %rd264, %rd263, 32; 2026-02-21T08:30:46.8999103Z or.b64 %rd265, %rd262, %rd264; 2026-02-21T08:30:46.8999156Z $L__tmp108: 2026-02-21T08:30:46.8999336Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.8999400Z mov.b64 {%r1206, %r1207}, %rd265; 2026-02-21T08:30:46.8999468Z cvt.rn.bf16x2.f32 %r1208, %r1207, %r1206; 2026-02-21T08:30:46.8999531Z $L__tmp109: 2026-02-21T08:30:46.8999746Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.8999809Z cvt.u64.u32 %rd266, %r1051; 2026-02-21T08:30:46.8999870Z cvt.u64.u32 %rd267, %r1052; 2026-02-21T08:30:46.8999940Z shl.b64 %rd268, %rd267, 32; 2026-02-21T08:30:46.9000002Z or.b64 %rd269, %rd266, %rd268; 2026-02-21T08:30:46.9000057Z $L__tmp110: 2026-02-21T08:30:46.9000240Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9000302Z mov.b64 {%r1209, %r1210}, %rd269; 2026-02-21T08:30:46.9000372Z cvt.rn.bf16x2.f32 %r1211, %r1210, %r1209; 2026-02-21T08:30:46.9000426Z $L__tmp111: 2026-02-21T08:30:46.9000650Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9000711Z cvt.u64.u32 %rd270, %r1053; 2026-02-21T08:30:46.9000770Z cvt.u64.u32 %rd271, %r1054; 2026-02-21T08:30:46.9000843Z shl.b64 %rd272, %rd271, 32; 2026-02-21T08:30:46.9000908Z or.b64 %rd273, %rd270, %rd272; 2026-02-21T08:30:46.9000962Z $L__tmp112: 2026-02-21T08:30:46.9001146Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9001211Z mov.b64 {%r1212, %r1213}, %rd273; 2026-02-21T08:30:46.9001279Z cvt.rn.bf16x2.f32 %r1214, %r1213, %r1212; 2026-02-21T08:30:46.9001332Z $L__tmp113: 2026-02-21T08:30:46.9001584Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9001645Z cvt.u64.u32 %rd274, %r1055; 2026-02-21T08:30:46.9001706Z cvt.u64.u32 %rd275, %r1056; 2026-02-21T08:30:46.9001775Z shl.b64 %rd276, %rd275, 32; 2026-02-21T08:30:46.9001836Z or.b64 %rd277, %rd274, %rd276; 2026-02-21T08:30:46.9001891Z $L__tmp114: 2026-02-21T08:30:46.9002073Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9002192Z mov.b64 {%r1215, %r1216}, %rd277; 2026-02-21T08:30:46.9002263Z cvt.rn.bf16x2.f32 %r1217, %r1216, %r1215; 2026-02-21T08:30:46.9002316Z $L__tmp115: 2026-02-21T08:30:46.9002539Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9002599Z cvt.u64.u32 %rd278, %r1057; 2026-02-21T08:30:46.9002659Z cvt.u64.u32 %rd279, %r1058; 2026-02-21T08:30:46.9002731Z shl.b64 %rd280, %rd279, 32; 2026-02-21T08:30:46.9002793Z or.b64 %rd281, %rd278, %rd280; 2026-02-21T08:30:46.9002846Z $L__tmp116: 2026-02-21T08:30:46.9003026Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9003089Z mov.b64 {%r1218, %r1219}, %rd281; 2026-02-21T08:30:46.9003158Z cvt.rn.bf16x2.f32 %r1220, %r1219, %r1218; 2026-02-21T08:30:46.9003213Z $L__tmp117: 2026-02-21T08:30:46.9003492Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9003557Z cvt.u64.u32 %rd282, %r1059; 2026-02-21T08:30:46.9003618Z cvt.u64.u32 %rd283, %r1060; 2026-02-21T08:30:46.9003690Z shl.b64 %rd284, %rd283, 32; 2026-02-21T08:30:46.9003751Z or.b64 %rd285, %rd282, %rd284; 2026-02-21T08:30:46.9003808Z $L__tmp118: 2026-02-21T08:30:46.9003991Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9004051Z mov.b64 {%r1221, %r1222}, %rd285; 2026-02-21T08:30:46.9004124Z cvt.rn.bf16x2.f32 %r1223, %r1222, %r1221; 2026-02-21T08:30:46.9004176Z $L__tmp119: 2026-02-21T08:30:46.9004403Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9004465Z cvt.u64.u32 %rd286, %r1061; 2026-02-21T08:30:46.9004529Z cvt.u64.u32 %rd287, %r1062; 2026-02-21T08:30:46.9004590Z shl.b64 %rd288, %rd287, 32; 2026-02-21T08:30:46.9004652Z or.b64 %rd289, %rd286, %rd288; 2026-02-21T08:30:46.9004716Z $L__tmp120: 2026-02-21T08:30:46.9004881Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9004940Z mov.b64 {%r1224, %r1225}, %rd289; 2026-02-21T08:30:46.9005016Z cvt.rn.bf16x2.f32 %r1226, %r1225, %r1224; 2026-02-21T08:30:46.9005067Z $L__tmp121: 2026-02-21T08:30:46.9005277Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9005336Z cvt.u64.u32 %rd290, %r1063; 2026-02-21T08:30:46.9005400Z cvt.u64.u32 %rd291, %r1064; 2026-02-21T08:30:46.9005457Z shl.b64 %rd292, %rd291, 32; 2026-02-21T08:30:46.9005516Z or.b64 %rd293, %rd290, %rd292; 2026-02-21T08:30:46.9005576Z $L__tmp122: 2026-02-21T08:30:46.9005746Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9005806Z mov.b64 {%r1227, %r1228}, %rd293; 2026-02-21T08:30:46.9005883Z cvt.rn.bf16x2.f32 %r1229, %r1228, %r1227; 2026-02-21T08:30:46.9006053Z .loc 1 88 81 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:81 2026-02-21T08:30:46.9006153Z st.shared.v4.b32 [%r60], {%r1184, %r1196, %r1208, %r1220}; 2026-02-21T08:30:46.9006207Z bar.sync 0; 2026-02-21T08:30:46.9006271Z // begin inline asm 2026-02-21T08:30:46.9006428Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1066, %r1067, %r1068, %r1069}, [%r659]; 2026-02-21T08:30:46.9006484Z // end inline asm 2026-02-21T08:30:46.9006545Z bar.sync 0; 2026-02-21T08:30:46.9006641Z st.shared.v4.b32 [%r60], {%r1187, %r1199, %r1211, %r1223}; 2026-02-21T08:30:46.9006697Z bar.sync 0; 2026-02-21T08:30:46.9006760Z // begin inline asm 2026-02-21T08:30:46.9006910Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1071, %r1072, %r1073, %r1074}, [%r659]; 2026-02-21T08:30:46.9006965Z // end inline asm 2026-02-21T08:30:46.9007020Z bar.sync 0; 2026-02-21T08:30:46.9007124Z st.shared.v4.b32 [%r60], {%r1190, %r1202, %r1214, %r1226}; 2026-02-21T08:30:46.9007223Z bar.sync 0; 2026-02-21T08:30:46.9007280Z // begin inline asm 2026-02-21T08:30:46.9007433Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1076, %r1077, %r1078, %r1079}, [%r659]; 2026-02-21T08:30:46.9007491Z // end inline asm 2026-02-21T08:30:46.9007544Z bar.sync 0; 2026-02-21T08:30:46.9007636Z st.shared.v4.b32 [%r60], {%r1193, %r1205, %r1217, %r1229}; 2026-02-21T08:30:46.9007698Z bar.sync 0; 2026-02-21T08:30:46.9007755Z // begin inline asm 2026-02-21T08:30:46.9007897Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1081, %r1082, %r1083, %r1084}, [%r659]; 2026-02-21T08:30:46.9007959Z // end inline asm 2026-02-21T08:30:46.9008016Z // begin inline asm 2026-02-21T08:30:46.9008124Z st.global.v4.b32 [ %rd217 + 0 ], { %r1066, %r1071, %r1076, %r1081 }; 2026-02-21T08:30:46.9008186Z // end inline asm 2026-02-21T08:30:46.9008242Z // begin inline asm 2026-02-21T08:30:46.9008346Z st.global.v4.b32 [ %rd218 + 0 ], { %r1067, %r1072, %r1077, %r1082 }; 2026-02-21T08:30:46.9008456Z // end inline asm 2026-02-21T08:30:46.9008524Z // begin inline asm 2026-02-21T08:30:46.9008625Z st.global.v4.b32 [ %rd219 + 0 ], { %r1068, %r1073, %r1078, %r1083 }; 2026-02-21T08:30:46.9008682Z // end inline asm 2026-02-21T08:30:46.9008747Z // begin inline asm 2026-02-21T08:30:46.9008845Z st.global.v4.b32 [ %rd220 + 0 ], { %r1069, %r1074, %r1079, %r1084 }; 2026-02-21T08:30:46.9008901Z // end inline asm 2026-02-21T08:30:46.9009084Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9009155Z add.s32 %r1230, %r2523, 4736; 2026-02-21T08:30:46.9009324Z .loc 1 25 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:25:35 2026-02-21T08:30:46.9009386Z shr.s32 %r1231, %r1230, 31; 2026-02-21T08:30:46.9009455Z shr.u32 %r1232, %r1231, 23; 2026-02-21T08:30:46.9009520Z add.s32 %r1233, %r1230, %r1232; 2026-02-21T08:30:46.9009581Z shr.s32 %r1234, %r1233, 9; 2026-02-21T08:30:46.9009761Z .loc 1 26 33 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:26:33 2026-02-21T08:30:46.9009823Z shl.b32 %r1235, %r1234, 3; 2026-02-21T08:30:46.9009990Z .loc 1 27 39 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:39 2026-02-21T08:30:46.9010051Z sub.s32 %r1236, 64, %r1235; 2026-02-21T08:30:46.9010222Z .loc 1 27 52 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:52 2026-02-21T08:30:46.9010280Z min.s32 %r1237, %r1236, 8; 2026-02-21T08:30:46.9010444Z .loc 1 28 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:45 2026-02-21T08:30:46.9010516Z and.b32 %r1238, %r1233, -512; 2026-02-21T08:30:46.9010579Z sub.s32 %r1239, %r1230, %r1238; 2026-02-21T08:30:46.9010742Z .loc 1 29 51 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:29:51 2026-02-21T08:30:46.9010812Z div.s32 %r119, %r1239, %r1237; 2026-02-21T08:30:46.9010975Z .loc 1 28 64 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:64 2026-02-21T08:30:46.9011044Z mul.lo.s32 %r1240, %r119, %r1237; 2026-02-21T08:30:46.9011110Z sub.s32 %r1241, %r1239, %r1240; 2026-02-21T08:30:46.9011275Z .loc 1 28 30 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:30 2026-02-21T08:30:46.9011335Z add.s32 %r1242, %r1241, %r1235; 2026-02-21T08:30:46.9011499Z .loc 1 30 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:30:27 2026-02-21T08:30:46.9011598Z shl.b32 %r120, %r1242, 7; 2026-02-21T08:30:46.9011786Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.9011845Z or.b32 %r121, %r120, %r6; 2026-02-21T08:30:46.9012012Z .loc 1 32 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:32:27 2026-02-21T08:30:46.9012077Z shl.b32 %r1243, %r119, 6; 2026-02-21T08:30:46.9012246Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.9012358Z or.b32 %r1244, %r1243, %r10; 2026-02-21T08:30:46.9012424Z or.b32 %r1245, %r1243, %r11; 2026-02-21T08:30:46.9012592Z .loc 1 48 53 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:53 2026-02-21T08:30:46.9012650Z shl.b32 %r1246, %r1244, 10; 2026-02-21T08:30:46.9012709Z shl.b32 %r1247, %r1245, 10; 2026-02-21T08:30:46.9012778Z mov.pred %p115, -1; 2026-02-21T08:30:46.9012835Z mov.b32 %r2540, 0; 2026-02-21T08:30:46.9012887Z $L__tmp123: 2026-02-21T08:30:46.9013114Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9013171Z // begin inline asm 2026-02-21T08:30:46.9013581Z @%p115 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 0], 32, {%r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540}; 2026-02-21T08:30:46.9013649Z // end inline asm 2026-02-21T08:30:46.9013707Z // begin inline asm 2026-02-21T08:30:46.9014020Z @%p115 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 16], 32, {%r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540, %r2540}; 2026-02-21T08:30:46.9014076Z // end inline asm 2026-02-21T08:30:46.9014152Z // begin inline asm 2026-02-21T08:30:46.9014226Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9014283Z // end inline asm 2026-02-21T08:30:46.9014347Z bar.sync 0; 2026-02-21T08:30:46.9014402Z $L__tmp124: 2026-02-21T08:30:46.9014576Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9014642Z // begin inline asm 2026-02-21T08:30:46.9014737Z @%p204 mbarrier.init.shared::cta.b64 [%r2541], 1; 2026-02-21T08:30:46.9014796Z // end inline asm 2026-02-21T08:30:46.9014852Z bar.sync 0; 2026-02-21T08:30:46.9014921Z // begin inline asm 2026-02-21T08:30:46.9015015Z @%p204 mbarrier.init.shared::cta.b64 [%r1442], 1; 2026-02-21T08:30:46.9015073Z // end inline asm 2026-02-21T08:30:46.9015253Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.9015317Z or.b32 %r1248, %r1246, %r16; 2026-02-21T08:30:46.9015378Z or.b32 %r1249, %r1247, %r16; 2026-02-21T08:30:46.9015556Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9015630Z mad.wide.s32 %rd221, %r1248, 2, %rd47; 2026-02-21T08:30:46.9015701Z mad.wide.s32 %rd222, %r1249, 2, %rd47; 2026-02-21T08:30:46.9015760Z mov.b32 %r1321, 16; 2026-02-21T08:30:46.9015940Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9016001Z // begin inline asm 2026-02-21T08:30:46.9016133Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd221 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9016201Z // end inline asm 2026-02-21T08:30:46.9016261Z // begin inline asm 2026-02-21T08:30:46.9016387Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd222 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9016443Z // end inline asm 2026-02-21T08:30:46.9016515Z cp.async.commit_group; 2026-02-21T08:30:46.9016689Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9016752Z add.s32 %r1250, %r121, %r27; 2026-02-21T08:30:46.9016929Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9016991Z cvt.s64.s32 %rd294, %r1250; 2026-02-21T08:30:46.9017056Z add.s64 %rd223, %rd48, %rd294; 2026-02-21T08:30:46.9017232Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9017290Z // begin inline asm 2026-02-21T08:30:46.9017412Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd223 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9017477Z // end inline asm 2026-02-21T08:30:46.9017585Z cp.async.commit_group; 2026-02-21T08:30:46.9017758Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9017820Z cvt.s64.s32 %rd295, %r1246; 2026-02-21T08:30:46.9017890Z or.b64 %rd296, %rd295, %rd10; 2026-02-21T08:30:46.9017953Z shl.b64 %rd297, %rd296, 1; 2026-02-21T08:30:46.9018017Z add.s64 %rd25, %rd47, %rd297; 2026-02-21T08:30:46.9018084Z add.s64 %rd224, %rd25, 128; 2026-02-21T08:30:46.9018143Z cvt.s64.s32 %rd298, %r1247; 2026-02-21T08:30:46.9018205Z or.b64 %rd299, %rd298, %rd10; 2026-02-21T08:30:46.9018268Z shl.b64 %rd300, %rd299, 1; 2026-02-21T08:30:46.9018336Z add.s64 %rd26, %rd47, %rd300; 2026-02-21T08:30:46.9018397Z add.s64 %rd225, %rd26, 128; 2026-02-21T08:30:46.9018571Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9018636Z bar.sync 0; 2026-02-21T08:30:46.9018734Z // begin inline asm 2026-02-21T08:30:46.9018857Z cp.async.cg.shared.global [ %r1555 + 0 ], [ %rd224 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9018921Z // end inline asm 2026-02-21T08:30:46.9018980Z // begin inline asm 2026-02-21T08:30:46.9019100Z cp.async.cg.shared.global [ %r1557 + 0 ], [ %rd225 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9019158Z // end inline asm 2026-02-21T08:30:46.9019232Z cp.async.commit_group; 2026-02-21T08:30:46.9019408Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9019472Z add.s32 %r1251, %r121, %r32; 2026-02-21T08:30:46.9019651Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9019712Z cvt.s64.s32 %rd301, %r1251; 2026-02-21T08:30:46.9019777Z add.s64 %rd226, %rd48, %rd301; 2026-02-21T08:30:46.9019960Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9020023Z // begin inline asm 2026-02-21T08:30:46.9020146Z cp.async.cg.shared.global [ %r1559 + 0 ], [ %rd226 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9020206Z // end inline asm 2026-02-21T08:30:46.9020277Z cp.async.commit_group; 2026-02-21T08:30:46.9020449Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9020512Z add.s64 %rd227, %rd25, 256; 2026-02-21T08:30:46.9020581Z add.s64 %rd228, %rd26, 256; 2026-02-21T08:30:46.9020757Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9020814Z bar.sync 0; 2026-02-21T08:30:46.9020883Z // begin inline asm 2026-02-21T08:30:46.9021003Z cp.async.cg.shared.global [ %r1561 + 0 ], [ %rd227 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9021063Z // end inline asm 2026-02-21T08:30:46.9021123Z // begin inline asm 2026-02-21T08:30:46.9021257Z cp.async.cg.shared.global [ %r1563 + 0 ], [ %rd228 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9021318Z // end inline asm 2026-02-21T08:30:46.9021385Z cp.async.commit_group; 2026-02-21T08:30:46.9021612Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9021678Z add.s32 %r1252, %r121, %r37; 2026-02-21T08:30:46.9021852Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9021914Z cvt.s64.s32 %rd302, %r1252; 2026-02-21T08:30:46.9021987Z add.s64 %rd229, %rd48, %rd302; 2026-02-21T08:30:46.9022166Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9022227Z // begin inline asm 2026-02-21T08:30:46.9022356Z cp.async.cg.shared.global [ %r1565 + 0 ], [ %rd229 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9022415Z // end inline asm 2026-02-21T08:30:46.9022481Z cp.async.commit_group; 2026-02-21T08:30:46.9022668Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9022790Z cp.async.wait_group 4; 2026-02-21T08:30:46.9022847Z bar.sync 0; 2026-02-21T08:30:46.9022951Z ld.shared.v4.b32 {%r1253, %r1254, %r1255, %r1256}, [%r40]; 2026-02-21T08:30:46.9023027Z mov.b32 {%rs705, %rs706}, %r1256; 2026-02-21T08:30:46.9023093Z mov.b32 {%rs707, %rs708}, %r1255; 2026-02-21T08:30:46.9023156Z mov.b32 {%rs709, %rs710}, %r1254; 2026-02-21T08:30:46.9023227Z mov.b32 {%rs711, %rs712}, %r1253; 2026-02-21T08:30:46.9023336Z ld.shared.v4.b32 {%r1257, %r1258, %r1259, %r1260}, [%r40+16]; 2026-02-21T08:30:46.9023399Z mov.b32 {%rs713, %rs714}, %r1260; 2026-02-21T08:30:46.9023468Z mov.b32 {%rs715, %rs716}, %r1259; 2026-02-21T08:30:46.9023529Z mov.b32 {%rs717, %rs718}, %r1258; 2026-02-21T08:30:46.9023600Z mov.b32 {%rs719, %rs720}, %r1257; 2026-02-21T08:30:46.9023770Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.9023841Z cvt.f32.bf16 %r1157, %rs711; 2026-02-21T08:30:46.9023954Z cvt.f32.bf16 %r1158, %rs712; 2026-02-21T08:30:46.9024019Z cvt.f32.bf16 %r1159, %rs709; 2026-02-21T08:30:46.9024086Z cvt.f32.bf16 %r1160, %rs710; 2026-02-21T08:30:46.9024144Z cvt.f32.bf16 %r1161, %rs707; 2026-02-21T08:30:46.9024200Z cvt.f32.bf16 %r1162, %rs708; 2026-02-21T08:30:46.9024258Z cvt.f32.bf16 %r1163, %rs705; 2026-02-21T08:30:46.9024323Z cvt.f32.bf16 %r1164, %rs706; 2026-02-21T08:30:46.9024380Z cvt.f32.bf16 %r1165, %rs719; 2026-02-21T08:30:46.9024438Z cvt.f32.bf16 %r1166, %rs720; 2026-02-21T08:30:46.9024503Z cvt.f32.bf16 %r1167, %rs717; 2026-02-21T08:30:46.9024561Z cvt.f32.bf16 %r1168, %rs718; 2026-02-21T08:30:46.9024619Z cvt.f32.bf16 %r1169, %rs715; 2026-02-21T08:30:46.9024684Z cvt.f32.bf16 %r1170, %rs716; 2026-02-21T08:30:46.9024740Z cvt.f32.bf16 %r1171, %rs713; 2026-02-21T08:30:46.9024798Z cvt.f32.bf16 %r1172, %rs714; 2026-02-21T08:30:46.9024849Z $L__tmp125: 2026-02-21T08:30:46.9025074Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9025133Z // begin inline asm 2026-02-21T08:30:46.9025458Z @%p115 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r1157, %r1158, %r1159, %r1160, %r1161, %r1162, %r1163, %r1164, %r1165, %r1166, %r1167, %r1168, %r1169, %r1170, %r1171, %r1172}; 2026-02-21T08:30:46.9025520Z // end inline asm 2026-02-21T08:30:46.9025576Z // begin inline asm 2026-02-21T08:30:46.9025644Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9025706Z // end inline asm 2026-02-21T08:30:46.9025759Z bar.sync 0; 2026-02-21T08:30:46.9025809Z $L__tmp126: 2026-02-21T08:30:46.9025975Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9026043Z ld.shared.b8 %rs721, [%r42]; 2026-02-21T08:30:46.9026108Z ld.shared.b8 %rs722, [%r42+128]; 2026-02-21T08:30:46.9026172Z ld.shared.b8 %rs723, [%r42+256]; 2026-02-21T08:30:46.9026239Z ld.shared.b8 %rs724, [%r42+384]; 2026-02-21T08:30:46.9026301Z ld.shared.b8 %rs725, [%r42+512]; 2026-02-21T08:30:46.9026362Z ld.shared.b8 %rs726, [%r42+640]; 2026-02-21T08:30:46.9026420Z ld.shared.b8 %rs727, [%r42+768]; 2026-02-21T08:30:46.9026486Z ld.shared.b8 %rs728, [%r44]; 2026-02-21T08:30:46.9026547Z ld.shared.b8 %rs729, [%r42+1024]; 2026-02-21T08:30:46.9026609Z ld.shared.b8 %rs730, [%r42+1152]; 2026-02-21T08:30:46.9026676Z ld.shared.b8 %rs731, [%r42+1280]; 2026-02-21T08:30:46.9026737Z ld.shared.b8 %rs732, [%r42+1408]; 2026-02-21T08:30:46.9026796Z ld.shared.b8 %rs733, [%r42+1536]; 2026-02-21T08:30:46.9026862Z ld.shared.b8 %rs734, [%r42+1664]; 2026-02-21T08:30:46.9026922Z ld.shared.b8 %rs735, [%r42+1792]; 2026-02-21T08:30:46.9026982Z ld.shared.b8 %rs736, [%r46]; 2026-02-21T08:30:46.9027041Z ld.shared.b8 %rs737, [%r42+2048]; 2026-02-21T08:30:46.9027108Z ld.shared.b8 %rs738, [%r42+2176]; 2026-02-21T08:30:46.9027167Z ld.shared.b8 %rs739, [%r42+2304]; 2026-02-21T08:30:46.9027225Z ld.shared.b8 %rs740, [%r42+2432]; 2026-02-21T08:30:46.9027292Z ld.shared.b8 %rs741, [%r42+2560]; 2026-02-21T08:30:46.9027406Z ld.shared.b8 %rs742, [%r42+2688]; 2026-02-21T08:30:46.9027466Z ld.shared.b8 %rs743, [%r42+2816]; 2026-02-21T08:30:46.9027526Z ld.shared.b8 %rs744, [%r48]; 2026-02-21T08:30:46.9027593Z ld.shared.b8 %rs745, [%r42+3072]; 2026-02-21T08:30:46.9027653Z ld.shared.b8 %rs746, [%r42+3200]; 2026-02-21T08:30:46.9027713Z ld.shared.b8 %rs747, [%r42+3328]; 2026-02-21T08:30:46.9027780Z ld.shared.b8 %rs748, [%r42+3456]; 2026-02-21T08:30:46.9027841Z ld.shared.b8 %rs749, [%r42+3584]; 2026-02-21T08:30:46.9027899Z ld.shared.b8 %rs750, [%r42+3712]; 2026-02-21T08:30:46.9027959Z ld.shared.b8 %rs751, [%r42+3840]; 2026-02-21T08:30:46.9028026Z ld.shared.b8 %rs752, [%r50]; 2026-02-21T08:30:46.9028197Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.9028259Z shl.b16 %rs753, %rs721, 4; 2026-02-21T08:30:46.9028329Z shl.b16 %rs754, %rs722, 4; 2026-02-21T08:30:46.9028445Z shl.b16 %rs755, %rs723, 4; 2026-02-21T08:30:46.9028508Z shl.b16 %rs756, %rs724, 4; 2026-02-21T08:30:46.9028571Z shl.b16 %rs757, %rs725, 4; 2026-02-21T08:30:46.9028628Z shl.b16 %rs758, %rs726, 4; 2026-02-21T08:30:46.9028685Z shl.b16 %rs759, %rs727, 4; 2026-02-21T08:30:46.9028744Z shl.b16 %rs760, %rs728, 4; 2026-02-21T08:30:46.9028812Z shl.b16 %rs761, %rs729, 4; 2026-02-21T08:30:46.9028870Z shl.b16 %rs762, %rs730, 4; 2026-02-21T08:30:46.9028930Z shl.b16 %rs763, %rs731, 4; 2026-02-21T08:30:46.9028998Z shl.b16 %rs764, %rs732, 4; 2026-02-21T08:30:46.9029057Z shl.b16 %rs765, %rs733, 4; 2026-02-21T08:30:46.9029115Z shl.b16 %rs766, %rs734, 4; 2026-02-21T08:30:46.9029173Z shl.b16 %rs767, %rs735, 4; 2026-02-21T08:30:46.9029236Z shl.b16 %rs768, %rs736, 4; 2026-02-21T08:30:46.9029293Z shl.b16 %rs769, %rs737, 4; 2026-02-21T08:30:46.9029351Z shl.b16 %rs770, %rs738, 4; 2026-02-21T08:30:46.9029414Z shl.b16 %rs771, %rs739, 4; 2026-02-21T08:30:46.9029474Z shl.b16 %rs772, %rs740, 4; 2026-02-21T08:30:46.9029533Z shl.b16 %rs773, %rs741, 4; 2026-02-21T08:30:46.9029590Z shl.b16 %rs774, %rs742, 4; 2026-02-21T08:30:46.9029655Z shl.b16 %rs775, %rs743, 4; 2026-02-21T08:30:46.9029712Z shl.b16 %rs776, %rs744, 4; 2026-02-21T08:30:46.9029769Z shl.b16 %rs777, %rs745, 4; 2026-02-21T08:30:46.9029832Z shl.b16 %rs778, %rs746, 4; 2026-02-21T08:30:46.9029888Z shl.b16 %rs779, %rs747, 4; 2026-02-21T08:30:46.9029944Z shl.b16 %rs780, %rs748, 4; 2026-02-21T08:30:46.9030003Z shl.b16 %rs781, %rs749, 4; 2026-02-21T08:30:46.9030067Z shl.b16 %rs782, %rs750, 4; 2026-02-21T08:30:46.9030125Z shl.b16 %rs783, %rs751, 4; 2026-02-21T08:30:46.9030183Z shl.b16 %rs784, %rs752, 4; 2026-02-21T08:30:46.9030358Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.9030430Z selp.b16 %rs785, %rs753, %rs721, %p246; 2026-02-21T08:30:46.9030490Z cvt.s16.s8 %rs786, %rs785; 2026-02-21T08:30:46.9030555Z shr.s16 %rs787, %rs786, 4; 2026-02-21T08:30:46.9030626Z selp.b16 %rs788, %rs754, %rs722, %p246; 2026-02-21T08:30:46.9030687Z cvt.s16.s8 %rs789, %rs788; 2026-02-21T08:30:46.9030743Z shr.s16 %rs790, %rs789, 4; 2026-02-21T08:30:46.9030816Z selp.b16 %rs791, %rs755, %rs723, %p246; 2026-02-21T08:30:46.9030875Z cvt.s16.s8 %rs792, %rs791; 2026-02-21T08:30:46.9030931Z shr.s16 %rs793, %rs792, 4; 2026-02-21T08:30:46.9031003Z selp.b16 %rs794, %rs756, %rs724, %p246; 2026-02-21T08:30:46.9031062Z cvt.s16.s8 %rs795, %rs794; 2026-02-21T08:30:46.9031119Z shr.s16 %rs796, %rs795, 4; 2026-02-21T08:30:46.9031185Z selp.b16 %rs797, %rs757, %rs725, %p246; 2026-02-21T08:30:46.9031250Z cvt.s16.s8 %rs798, %rs797; 2026-02-21T08:30:46.9031308Z shr.s16 %rs799, %rs798, 4; 2026-02-21T08:30:46.9031373Z selp.b16 %rs800, %rs758, %rs726, %p246; 2026-02-21T08:30:46.9031438Z cvt.s16.s8 %rs801, %rs800; 2026-02-21T08:30:46.9031495Z shr.s16 %rs802, %rs801, 4; 2026-02-21T08:30:46.9031588Z selp.b16 %rs803, %rs759, %rs727, %p246; 2026-02-21T08:30:46.9031658Z cvt.s16.s8 %rs804, %rs803; 2026-02-21T08:30:46.9031771Z shr.s16 %rs805, %rs804, 4; 2026-02-21T08:30:46.9031834Z selp.b16 %rs806, %rs760, %rs728, %p246; 2026-02-21T08:30:46.9031892Z cvt.s16.s8 %rs807, %rs806; 2026-02-21T08:30:46.9031956Z shr.s16 %rs808, %rs807, 4; 2026-02-21T08:30:46.9032019Z selp.b16 %rs809, %rs761, %rs729, %p246; 2026-02-21T08:30:46.9032077Z cvt.s16.s8 %rs810, %rs809; 2026-02-21T08:30:46.9032140Z shr.s16 %rs811, %rs810, 4; 2026-02-21T08:30:46.9032203Z selp.b16 %rs812, %rs762, %rs730, %p246; 2026-02-21T08:30:46.9032262Z cvt.s16.s8 %rs813, %rs812; 2026-02-21T08:30:46.9032321Z shr.s16 %rs814, %rs813, 4; 2026-02-21T08:30:46.9032393Z selp.b16 %rs815, %rs763, %rs731, %p246; 2026-02-21T08:30:46.9032450Z cvt.s16.s8 %rs816, %rs815; 2026-02-21T08:30:46.9032509Z shr.s16 %rs817, %rs816, 4; 2026-02-21T08:30:46.9032584Z selp.b16 %rs818, %rs764, %rs732, %p246; 2026-02-21T08:30:46.9032646Z cvt.s16.s8 %rs819, %rs818; 2026-02-21T08:30:46.9032752Z shr.s16 %rs820, %rs819, 4; 2026-02-21T08:30:46.9032820Z selp.b16 %rs821, %rs765, %rs733, %p246; 2026-02-21T08:30:46.9032884Z cvt.s16.s8 %rs822, %rs821; 2026-02-21T08:30:46.9032941Z shr.s16 %rs823, %rs822, 4; 2026-02-21T08:30:46.9033003Z selp.b16 %rs824, %rs766, %rs734, %p246; 2026-02-21T08:30:46.9033068Z cvt.s16.s8 %rs825, %rs824; 2026-02-21T08:30:46.9033125Z shr.s16 %rs826, %rs825, 4; 2026-02-21T08:30:46.9033188Z selp.b16 %rs827, %rs767, %rs735, %p246; 2026-02-21T08:30:46.9033244Z cvt.s16.s8 %rs828, %rs827; 2026-02-21T08:30:46.9033309Z shr.s16 %rs829, %rs828, 4; 2026-02-21T08:30:46.9033372Z selp.b16 %rs830, %rs768, %rs736, %p246; 2026-02-21T08:30:46.9033427Z cvt.s16.s8 %rs831, %rs830; 2026-02-21T08:30:46.9033491Z shr.s16 %rs832, %rs831, 4; 2026-02-21T08:30:46.9033554Z selp.b16 %rs833, %rs769, %rs737, %p246; 2026-02-21T08:30:46.9033611Z cvt.s16.s8 %rs834, %rs833; 2026-02-21T08:30:46.9033676Z shr.s16 %rs835, %rs834, 4; 2026-02-21T08:30:46.9033740Z selp.b16 %rs836, %rs770, %rs738, %p246; 2026-02-21T08:30:46.9033800Z cvt.s16.s8 %rs837, %rs836; 2026-02-21T08:30:46.9033858Z shr.s16 %rs838, %rs837, 4; 2026-02-21T08:30:46.9033929Z selp.b16 %rs839, %rs771, %rs739, %p246; 2026-02-21T08:30:46.9033985Z cvt.s16.s8 %rs840, %rs839; 2026-02-21T08:30:46.9034043Z shr.s16 %rs841, %rs840, 4; 2026-02-21T08:30:46.9034113Z selp.b16 %rs842, %rs772, %rs740, %p246; 2026-02-21T08:30:46.9034169Z cvt.s16.s8 %rs843, %rs842; 2026-02-21T08:30:46.9034225Z shr.s16 %rs844, %rs843, 4; 2026-02-21T08:30:46.9034289Z selp.b16 %rs845, %rs773, %rs741, %p246; 2026-02-21T08:30:46.9034354Z cvt.s16.s8 %rs846, %rs845; 2026-02-21T08:30:46.9034410Z shr.s16 %rs847, %rs846, 4; 2026-02-21T08:30:46.9034473Z selp.b16 %rs848, %rs774, %rs742, %p246; 2026-02-21T08:30:46.9034538Z cvt.s16.s8 %rs849, %rs848; 2026-02-21T08:30:46.9034594Z shr.s16 %rs850, %rs849, 4; 2026-02-21T08:30:46.9034660Z selp.b16 %rs851, %rs775, %rs743, %p246; 2026-02-21T08:30:46.9034718Z cvt.s16.s8 %rs852, %rs851; 2026-02-21T08:30:46.9034787Z shr.s16 %rs853, %rs852, 4; 2026-02-21T08:30:46.9034853Z selp.b16 %rs854, %rs776, %rs744, %p246; 2026-02-21T08:30:46.9034910Z cvt.s16.s8 %rs855, %rs854; 2026-02-21T08:30:46.9034976Z shr.s16 %rs856, %rs855, 4; 2026-02-21T08:30:46.9035040Z selp.b16 %rs857, %rs777, %rs745, %p246; 2026-02-21T08:30:46.9035099Z cvt.s16.s8 %rs858, %rs857; 2026-02-21T08:30:46.9035165Z shr.s16 %rs859, %rs858, 4; 2026-02-21T08:30:46.9035230Z selp.b16 %rs860, %rs778, %rs746, %p246; 2026-02-21T08:30:46.9035289Z cvt.s16.s8 %rs861, %rs860; 2026-02-21T08:30:46.9035347Z shr.s16 %rs862, %rs861, 4; 2026-02-21T08:30:46.9035425Z selp.b16 %rs863, %rs779, %rs747, %p246; 2026-02-21T08:30:46.9035484Z cvt.s16.s8 %rs864, %rs863; 2026-02-21T08:30:46.9035542Z shr.s16 %rs865, %rs864, 4; 2026-02-21T08:30:46.9035611Z selp.b16 %rs866, %rs780, %rs748, %p246; 2026-02-21T08:30:46.9035669Z cvt.s16.s8 %rs867, %rs866; 2026-02-21T08:30:46.9035725Z shr.s16 %rs868, %rs867, 4; 2026-02-21T08:30:46.9035788Z selp.b16 %rs869, %rs781, %rs749, %p246; 2026-02-21T08:30:46.9035854Z cvt.s16.s8 %rs870, %rs869; 2026-02-21T08:30:46.9035953Z shr.s16 %rs871, %rs870, 4; 2026-02-21T08:30:46.9036018Z selp.b16 %rs872, %rs782, %rs750, %p246; 2026-02-21T08:30:46.9036082Z cvt.s16.s8 %rs873, %rs872; 2026-02-21T08:30:46.9036138Z shr.s16 %rs874, %rs873, 4; 2026-02-21T08:30:46.9036200Z selp.b16 %rs875, %rs783, %rs751, %p246; 2026-02-21T08:30:46.9036258Z cvt.s16.s8 %rs876, %rs875; 2026-02-21T08:30:46.9036321Z shr.s16 %rs877, %rs876, 4; 2026-02-21T08:30:46.9036384Z selp.b16 %rs878, %rs784, %rs752, %p246; 2026-02-21T08:30:46.9036440Z cvt.s16.s8 %rs879, %rs878; 2026-02-21T08:30:46.9036503Z shr.s16 %rs880, %rs879, 4; 2026-02-21T08:30:46.9036671Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.9036734Z cvt.rn.f32.s16 %r1261, %rs787; 2026-02-21T08:30:46.9036802Z cvt.rn.f32.s16 %r1262, %rs790; 2026-02-21T08:30:46.9036863Z cvt.rn.f32.s16 %r1263, %rs793; 2026-02-21T08:30:46.9036965Z cvt.rn.f32.s16 %r1264, %rs796; 2026-02-21T08:30:46.9037026Z cvt.rn.f32.s16 %r1265, %rs799; 2026-02-21T08:30:46.9037093Z cvt.rn.f32.s16 %r1266, %rs802; 2026-02-21T08:30:46.9037152Z cvt.rn.f32.s16 %r1267, %rs805; 2026-02-21T08:30:46.9037211Z cvt.rn.f32.s16 %r1268, %rs808; 2026-02-21T08:30:46.9037274Z cvt.rn.f32.s16 %r1269, %rs811; 2026-02-21T08:30:46.9037332Z cvt.rn.f32.s16 %r1270, %rs814; 2026-02-21T08:30:46.9037390Z cvt.rn.f32.s16 %r1271, %rs817; 2026-02-21T08:30:46.9037448Z cvt.rn.f32.s16 %r1272, %rs820; 2026-02-21T08:30:46.9037515Z cvt.rn.f32.s16 %r1273, %rs823; 2026-02-21T08:30:46.9037571Z cvt.rn.f32.s16 %r1274, %rs826; 2026-02-21T08:30:46.9037630Z cvt.rn.f32.s16 %r1275, %rs829; 2026-02-21T08:30:46.9037695Z cvt.rn.f32.s16 %r1276, %rs832; 2026-02-21T08:30:46.9037751Z cvt.rn.f32.s16 %r1277, %rs835; 2026-02-21T08:30:46.9037810Z cvt.rn.f32.s16 %r1278, %rs838; 2026-02-21T08:30:46.9037869Z cvt.rn.f32.s16 %r1279, %rs841; 2026-02-21T08:30:46.9037935Z cvt.rn.f32.s16 %r1280, %rs844; 2026-02-21T08:30:46.9037998Z cvt.rn.f32.s16 %r1281, %rs847; 2026-02-21T08:30:46.9038056Z cvt.rn.f32.s16 %r1282, %rs850; 2026-02-21T08:30:46.9038121Z cvt.rn.f32.s16 %r1283, %rs853; 2026-02-21T08:30:46.9038177Z cvt.rn.f32.s16 %r1284, %rs856; 2026-02-21T08:30:46.9038236Z cvt.rn.f32.s16 %r1285, %rs859; 2026-02-21T08:30:46.9038299Z cvt.rn.f32.s16 %r1286, %rs862; 2026-02-21T08:30:46.9038356Z cvt.rn.f32.s16 %r1287, %rs865; 2026-02-21T08:30:46.9038413Z cvt.rn.f32.s16 %r1288, %rs868; 2026-02-21T08:30:46.9038472Z cvt.rn.f32.s16 %r1289, %rs871; 2026-02-21T08:30:46.9038536Z cvt.rn.f32.s16 %r1290, %rs874; 2026-02-21T08:30:46.9038592Z cvt.rn.f32.s16 %r1291, %rs877; 2026-02-21T08:30:46.9038649Z cvt.rn.f32.s16 %r1292, %rs880; 2026-02-21T08:30:46.9038716Z st.shared.b32 [%r51], %r1261; 2026-02-21T08:30:46.9038780Z st.shared.b32 [%r51+8], %r1262; 2026-02-21T08:30:46.9038848Z st.shared.b32 [%r51+16384], %r1277; 2026-02-21T08:30:46.9038912Z st.shared.b32 [%r51+16392], %r1278; 2026-02-21T08:30:46.9038979Z st.shared.b32 [%r52], %r1263; 2026-02-21T08:30:46.9039043Z st.shared.b32 [%r52+8], %r1264; 2026-02-21T08:30:46.9039106Z st.shared.b32 [%r52+16384], %r1279; 2026-02-21T08:30:46.9039173Z st.shared.b32 [%r52+16392], %r1280; 2026-02-21T08:30:46.9039231Z st.shared.b32 [%r53], %r1265; 2026-02-21T08:30:46.9039293Z st.shared.b32 [%r53+8], %r1266; 2026-02-21T08:30:46.9039353Z st.shared.b32 [%r53+16384], %r1281; 2026-02-21T08:30:46.9039420Z st.shared.b32 [%r53+16392], %r1282; 2026-02-21T08:30:46.9039479Z st.shared.b32 [%r54], %r1267; 2026-02-21T08:30:46.9039537Z st.shared.b32 [%r54+8], %r1268; 2026-02-21T08:30:46.9039605Z st.shared.b32 [%r54+16384], %r1283; 2026-02-21T08:30:46.9039664Z st.shared.b32 [%r54+16392], %r1284; 2026-02-21T08:30:46.9039723Z st.shared.b32 [%r55], %r1269; 2026-02-21T08:30:46.9039789Z st.shared.b32 [%r55+8], %r1270; 2026-02-21T08:30:46.9039850Z st.shared.b32 [%r55+16384], %r1285; 2026-02-21T08:30:46.9039909Z st.shared.b32 [%r55+16392], %r1286; 2026-02-21T08:30:46.9039972Z st.shared.b32 [%r56], %r1271; 2026-02-21T08:30:46.9040082Z st.shared.b32 [%r56+8], %r1272; 2026-02-21T08:30:46.9040144Z st.shared.b32 [%r56+16384], %r1287; 2026-02-21T08:30:46.9040205Z st.shared.b32 [%r56+16392], %r1288; 2026-02-21T08:30:46.9040271Z st.shared.b32 [%r57], %r1273; 2026-02-21T08:30:46.9040332Z st.shared.b32 [%r57+8], %r1274; 2026-02-21T08:30:46.9040393Z st.shared.b32 [%r57+16384], %r1289; 2026-02-21T08:30:46.9040453Z st.shared.b32 [%r57+16392], %r1290; 2026-02-21T08:30:46.9040519Z st.shared.b32 [%r58], %r1275; 2026-02-21T08:30:46.9040579Z st.shared.b32 [%r58+8], %r1276; 2026-02-21T08:30:46.9040639Z st.shared.b32 [%r58+16384], %r1291; 2026-02-21T08:30:46.9040708Z st.shared.b32 [%r58+16392], %r1292; 2026-02-21T08:30:46.9040761Z $L__tmp127: 2026-02-21T08:30:46.9040980Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9041044Z // begin inline asm 2026-02-21T08:30:46.9041163Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.9041222Z // end inline asm 2026-02-21T08:30:46.9041275Z bar.sync 0; 2026-02-21T08:30:46.9041341Z @%p12 bra $L__BB0_17; 2026-02-21T08:30:46.9041440Z // %bb.16: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.9041506Z elect.sync %r1317|%p114, -1; 2026-02-21T08:30:46.9041604Z mov.b32 %r1295, 69208336; 2026-02-21T08:30:46.9041665Z mov.pred %p113, 0; 2026-02-21T08:30:46.9041722Z // begin inline asm 2026-02-21T08:30:46.9041893Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r1295, %p113; 2026-02-21T08:30:46.9041950Z // end inline asm 2026-02-21T08:30:46.9042007Z // begin inline asm 2026-02-21T08:30:46.9042158Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r1295, %p115; 2026-02-21T08:30:46.9042230Z // end inline asm 2026-02-21T08:30:46.9042289Z // begin inline asm 2026-02-21T08:30:46.9042445Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r1295, %p115; 2026-02-21T08:30:46.9042520Z // end inline asm 2026-02-21T08:30:46.9042580Z // begin inline asm 2026-02-21T08:30:46.9042733Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r1295, %p115; 2026-02-21T08:30:46.9042804Z // end inline asm 2026-02-21T08:30:46.9042863Z // begin inline asm 2026-02-21T08:30:46.9043016Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r1295, %p115; 2026-02-21T08:30:46.9043086Z // end inline asm 2026-02-21T08:30:46.9043143Z // begin inline asm 2026-02-21T08:30:46.9043288Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r1295, %p115; 2026-02-21T08:30:46.9043344Z // end inline asm 2026-02-21T08:30:46.9043406Z // begin inline asm 2026-02-21T08:30:46.9043550Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r1295, %p115; 2026-02-21T08:30:46.9043605Z // end inline asm 2026-02-21T08:30:46.9043669Z // begin inline asm 2026-02-21T08:30:46.9043817Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r1295, %p115; 2026-02-21T08:30:46.9043873Z // end inline asm 2026-02-21T08:30:46.9043939Z add.s32 %r1319, %r265, 69632; 2026-02-21T08:30:46.9044001Z cvt.u64.u32 %rd311, %r1319; 2026-02-21T08:30:46.9044058Z // begin inline asm 2026-02-21T08:30:46.9044185Z @%p114 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd311]; 2026-02-21T08:30:46.9044249Z // end inline asm 2026-02-21T08:30:46.9044301Z $L__tmp128: 2026-02-21T08:30:46.9044402Z $L__BB0_17: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.9044574Z .loc 1 0 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0 2026-02-21T08:30:46.9044635Z or.b32 %r122, %r120, %r8; 2026-02-21T08:30:46.9044695Z or.b32 %r123, %r1243, %r12; 2026-02-21T08:30:46.9044759Z or.b32 %r124, %r1243, %r13; 2026-02-21T08:30:46.9044818Z or.b32 %r125, %r1243, %r14; 2026-02-21T08:30:46.9044941Z or.b32 %r126, %r1243, %r15; 2026-02-21T08:30:46.9045110Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9045178Z add.s64 %rd312, %rd25, 384; 2026-02-21T08:30:46.9045236Z add.s64 %rd313, %rd26, 384; 2026-02-21T08:30:46.9045406Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9045472Z // begin inline asm 2026-02-21T08:30:46.9045594Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd312 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9045651Z // end inline asm 2026-02-21T08:30:46.9045714Z // begin inline asm 2026-02-21T08:30:46.9045833Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd313 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9045889Z // end inline asm 2026-02-21T08:30:46.9045952Z cp.async.commit_group; 2026-02-21T08:30:46.9046173Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9046240Z add.s32 %r1329, %r121, %r59; 2026-02-21T08:30:46.9046410Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9046480Z cvt.s64.s32 %rd316, %r1329; 2026-02-21T08:30:46.9046542Z add.s64 %rd314, %rd48, %rd316; 2026-02-21T08:30:46.9046710Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9046774Z // begin inline asm 2026-02-21T08:30:46.9046888Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd314 + 0 ], 0x10, %r1321; 2026-02-21T08:30:46.9046942Z // end inline asm 2026-02-21T08:30:46.9047004Z cp.async.commit_group; 2026-02-21T08:30:46.9047176Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9047236Z add.s32 %r2539, %r62, %r120; 2026-02-21T08:30:46.9047295Z shl.b32 %r1330, %r119, 16; 2026-02-21T08:30:46.9047362Z or.b32 %r1331, %r63, %r1330; 2026-02-21T08:30:46.9047432Z mad.wide.s32 %rd612, %r1331, 2, %rd9; 2026-02-21T08:30:46.9047491Z or.b32 %r2538, %r64, %r1330; 2026-02-21T08:30:46.9047550Z mov.b64 %rd613, -32; 2026-02-21T08:30:46.9047616Z mov.b32 %r2542, %r2540; 2026-02-21T08:30:46.9047673Z mov.b32 %r2544, %r2540; 2026-02-21T08:30:46.9047730Z bra.uni $L__BB0_18; 2026-02-21T08:30:46.9047839Z $L__BB0_20: // in Loop: Header=BB0_18 Depth=2 2026-02-21T08:30:46.9048009Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9048068Z add.s64 %rd613, %rd613, 32; 2026-02-21T08:30:46.9048141Z setp.lt.u64 %p151, %rd613, 384; 2026-02-21T08:30:46.9048193Z $L__tmp129: 2026-02-21T08:30:46.9048413Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9048471Z add.s32 %r1437, %r2543, 1; 2026-02-21T08:30:46.9048543Z setp.gt.s32 %p152, %r1437, 1; 2026-02-21T08:30:46.9048611Z selp.b32 %r2543, 0, %r1437, %p152; 2026-02-21T08:30:46.9048672Z selp.b32 %r1438, 1, 0, %p152; 2026-02-21T08:30:46.9048739Z xor.b32 %r141, %r2544, %r1438; 2026-02-21T08:30:46.9048791Z $L__tmp130: 2026-02-21T08:30:46.9048957Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9049034Z mad.wide.s32 %rd327, %r2538, 2, %rd47; 2026-02-21T08:30:46.9049199Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9049259Z add.s32 %r1431, %r137, %r2555; 2026-02-21T08:30:46.9049320Z selp.b32 %r1432, 16, 0, %p151; 2026-02-21T08:30:46.9049385Z // begin inline asm 2026-02-21T08:30:46.9049501Z cp.async.cg.shared.global [ %r1431 + 0 ], [ %rd612 + 0 ], 0x10, %r1432; 2026-02-21T08:30:46.9049556Z // end inline asm 2026-02-21T08:30:46.9049624Z add.s32 %r1433, %r1431, 4096; 2026-02-21T08:30:46.9049681Z // begin inline asm 2026-02-21T08:30:46.9049796Z cp.async.cg.shared.global [ %r1433 + 0 ], [ %rd327 + 0 ], 0x10, %r1432; 2026-02-21T08:30:46.9049898Z // end inline asm 2026-02-21T08:30:46.9049962Z cp.async.commit_group; 2026-02-21T08:30:46.9050126Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9050187Z cvt.s64.s32 %rd329, %r2539; 2026-02-21T08:30:46.9050255Z add.s64 %rd328, %rd48, %rd329; 2026-02-21T08:30:46.9050415Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9050476Z add.s32 %r1435, %r138, %r2555; 2026-02-21T08:30:46.9050541Z // begin inline asm 2026-02-21T08:30:46.9050657Z cp.async.cg.shared.global [ %r1435 + 0 ], [ %rd328 + 0 ], 0x10, %r1432; 2026-02-21T08:30:46.9050712Z // end inline asm 2026-02-21T08:30:46.9050775Z cp.async.commit_group; 2026-02-21T08:30:46.9050949Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9051058Z add.s32 %r2539, %r2539, 262144; 2026-02-21T08:30:46.9051123Z add.s64 %rd612, %rd612, 128; 2026-02-21T08:30:46.9051189Z add.s32 %r2538, %r2538, 64; 2026-02-21T08:30:46.9051251Z setp.lt.u64 %p153, %rd613, 448; 2026-02-21T08:30:46.9051308Z mov.b32 %r2540, %r2544; 2026-02-21T08:30:46.9051372Z mov.b32 %r2544, %r141; 2026-02-21T08:30:46.9051431Z @%p153 bra $L__BB0_18; 2026-02-21T08:30:46.9051489Z bra.uni $L__BB0_21; 2026-02-21T08:30:46.9051621Z $L__BB0_18: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:46.9051725Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:46.9051786Z add.s32 %r1352, %r2542, 1; 2026-02-21T08:30:46.9051848Z setp.gt.s32 %p133, %r1352, 2; 2026-02-21T08:30:46.9051918Z selp.b32 %r2542, 0, %r1352, %p133; 2026-02-21T08:30:46.9052085Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9052152Z cp.async.wait_group 4; 2026-02-21T08:30:46.9052218Z bar.sync 0; 2026-02-21T08:30:46.9052276Z shl.b32 %r1353, %r2542, 12; 2026-02-21T08:30:46.9052332Z shl.b32 %r1354, %r2542, 13; 2026-02-21T08:30:46.9052392Z add.s32 %r1356, %r265, %r1354; 2026-02-21T08:30:46.9052460Z add.s32 %r137, %r1356, 32768; 2026-02-21T08:30:46.9052520Z add.s32 %r1357, %r137, %r39; 2026-02-21T08:30:46.9052623Z ld.shared.v4.b32 {%r1358, %r1359, %r1360, %r1361}, [%r1357]; 2026-02-21T08:30:46.9052695Z mov.b32 {%rs881, %rs882}, %r1361; 2026-02-21T08:30:46.9052757Z mov.b32 {%rs883, %rs884}, %r1360; 2026-02-21T08:30:46.9052815Z mov.b32 {%rs885, %rs886}, %r1359; 2026-02-21T08:30:46.9052873Z mov.b32 {%rs887, %rs888}, %r1358; 2026-02-21T08:30:46.9052984Z ld.shared.v4.b32 {%r1362, %r1363, %r1364, %r1365}, [%r1357+16]; 2026-02-21T08:30:46.9053043Z mov.b32 {%rs889, %rs890}, %r1365; 2026-02-21T08:30:46.9053102Z mov.b32 {%rs891, %rs892}, %r1364; 2026-02-21T08:30:46.9053169Z mov.b32 {%rs893, %rs894}, %r1363; 2026-02-21T08:30:46.9053231Z mov.b32 {%rs895, %rs896}, %r1362; 2026-02-21T08:30:46.9053400Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.9053471Z cvt.f32.bf16 %r1336, %rs887; 2026-02-21T08:30:46.9053532Z cvt.f32.bf16 %r1337, %rs888; 2026-02-21T08:30:46.9053593Z cvt.f32.bf16 %r1338, %rs885; 2026-02-21T08:30:46.9053651Z cvt.f32.bf16 %r1339, %rs886; 2026-02-21T08:30:46.9053715Z cvt.f32.bf16 %r1340, %rs883; 2026-02-21T08:30:46.9053774Z cvt.f32.bf16 %r1341, %rs884; 2026-02-21T08:30:46.9053831Z cvt.f32.bf16 %r1342, %rs881; 2026-02-21T08:30:46.9053896Z cvt.f32.bf16 %r1343, %rs882; 2026-02-21T08:30:46.9053954Z cvt.f32.bf16 %r1344, %rs895; 2026-02-21T08:30:46.9054010Z cvt.f32.bf16 %r1345, %rs896; 2026-02-21T08:30:46.9054067Z cvt.f32.bf16 %r1346, %rs893; 2026-02-21T08:30:46.9054132Z cvt.f32.bf16 %r1347, %rs894; 2026-02-21T08:30:46.9054189Z cvt.f32.bf16 %r1348, %rs891; 2026-02-21T08:30:46.9054246Z cvt.f32.bf16 %r1349, %rs892; 2026-02-21T08:30:46.9054312Z cvt.f32.bf16 %r1350, %rs889; 2026-02-21T08:30:46.9054421Z cvt.f32.bf16 %r1351, %rs890; 2026-02-21T08:30:46.9054472Z $L__tmp131: 2026-02-21T08:30:46.9054690Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9054754Z // begin inline asm 2026-02-21T08:30:46.9054804Z 2026-02-21T08:30:46.9054854Z { 2026-02-21T08:30:46.9054924Z .reg .pred complete; 2026-02-21T08:30:46.9054979Z waitLoop: 2026-02-21T08:30:46.9055101Z mbarrier.try_wait.parity.shared.b64 complete, [%r2541], %r2540; 2026-02-21T08:30:46.9055172Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.9055221Z } 2026-02-21T08:30:46.9055226Z 2026-02-21T08:30:46.9055281Z // end inline asm 2026-02-21T08:30:46.9055342Z mov.pred %p134, -1; 2026-02-21T08:30:46.9055406Z // begin inline asm 2026-02-21T08:30:46.9055784Z @%p134 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r1336, %r1337, %r1338, %r1339, %r1340, %r1341, %r1342, %r1343, %r1344, %r1345, %r1346, %r1347, %r1348, %r1349, %r1350, %r1351}; 2026-02-21T08:30:46.9055843Z // end inline asm 2026-02-21T08:30:46.9055908Z // begin inline asm 2026-02-21T08:30:46.9055975Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9056030Z // end inline asm 2026-02-21T08:30:46.9056091Z bar.sync 0; 2026-02-21T08:30:46.9056145Z $L__tmp132: 2026-02-21T08:30:46.9056314Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9056373Z add.s32 %r1366, %r265, %r1353; 2026-02-21T08:30:46.9056441Z add.s32 %r138, %r1366, 57344; 2026-02-21T08:30:46.9056498Z add.s32 %r1367, %r138, %r41; 2026-02-21T08:30:46.9056560Z ld.shared.b8 %rs897, [%r1367]; 2026-02-21T08:30:46.9056632Z ld.shared.b8 %rs898, [%r1367+128]; 2026-02-21T08:30:46.9056696Z ld.shared.b8 %rs899, [%r1367+256]; 2026-02-21T08:30:46.9056759Z ld.shared.b8 %rs900, [%r1367+384]; 2026-02-21T08:30:46.9056819Z ld.shared.b8 %rs901, [%r1367+512]; 2026-02-21T08:30:46.9056889Z ld.shared.b8 %rs902, [%r1367+640]; 2026-02-21T08:30:46.9056952Z ld.shared.b8 %rs903, [%r1367+768]; 2026-02-21T08:30:46.9057010Z add.s32 %r1368, %r138, %r43; 2026-02-21T08:30:46.9057092Z ld.shared.b8 %rs904, [%r1368]; 2026-02-21T08:30:46.9057159Z ld.shared.b8 %rs905, [%r1367+1024]; 2026-02-21T08:30:46.9057226Z ld.shared.b8 %rs906, [%r1367+1152]; 2026-02-21T08:30:46.9057292Z ld.shared.b8 %rs907, [%r1367+1280]; 2026-02-21T08:30:46.9057364Z ld.shared.b8 %rs908, [%r1367+1408]; 2026-02-21T08:30:46.9057428Z ld.shared.b8 %rs909, [%r1367+1536]; 2026-02-21T08:30:46.9057493Z ld.shared.b8 %rs910, [%r1367+1664]; 2026-02-21T08:30:46.9057566Z ld.shared.b8 %rs911, [%r1367+1792]; 2026-02-21T08:30:46.9057627Z add.s32 %r1369, %r138, %r45; 2026-02-21T08:30:46.9057690Z ld.shared.b8 %rs912, [%r1369]; 2026-02-21T08:30:46.9057761Z ld.shared.b8 %rs913, [%r1367+2048]; 2026-02-21T08:30:46.9057824Z ld.shared.b8 %rs914, [%r1367+2176]; 2026-02-21T08:30:46.9057887Z ld.shared.b8 %rs915, [%r1367+2304]; 2026-02-21T08:30:46.9057955Z ld.shared.b8 %rs916, [%r1367+2432]; 2026-02-21T08:30:46.9058032Z ld.shared.b8 %rs917, [%r1367+2560]; 2026-02-21T08:30:46.9058096Z ld.shared.b8 %rs918, [%r1367+2688]; 2026-02-21T08:30:46.9058162Z ld.shared.b8 %rs919, [%r1367+2816]; 2026-02-21T08:30:46.9058234Z add.s32 %r1370, %r138, %r47; 2026-02-21T08:30:46.9058301Z ld.shared.b8 %rs920, [%r1370]; 2026-02-21T08:30:46.9058366Z ld.shared.b8 %rs921, [%r1367+3072]; 2026-02-21T08:30:46.9058430Z ld.shared.b8 %rs922, [%r1367+3200]; 2026-02-21T08:30:46.9058501Z ld.shared.b8 %rs923, [%r1367+3328]; 2026-02-21T08:30:46.9058564Z ld.shared.b8 %rs924, [%r1367+3456]; 2026-02-21T08:30:46.9058627Z ld.shared.b8 %rs925, [%r1367+3584]; 2026-02-21T08:30:46.9058699Z ld.shared.b8 %rs926, [%r1367+3712]; 2026-02-21T08:30:46.9058762Z ld.shared.b8 %rs927, [%r1367+3840]; 2026-02-21T08:30:46.9058823Z add.s32 %r1371, %r138, %r49; 2026-02-21T08:30:46.9058893Z ld.shared.b8 %rs928, [%r1371]; 2026-02-21T08:30:46.9059068Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.9059174Z shl.b16 %rs929, %rs897, 4; 2026-02-21T08:30:46.9059239Z shl.b16 %rs930, %rs898, 4; 2026-02-21T08:30:46.9059310Z shl.b16 %rs931, %rs899, 4; 2026-02-21T08:30:46.9059372Z shl.b16 %rs932, %rs900, 4; 2026-02-21T08:30:46.9059432Z shl.b16 %rs933, %rs901, 4; 2026-02-21T08:30:46.9059498Z shl.b16 %rs934, %rs902, 4; 2026-02-21T08:30:46.9059559Z shl.b16 %rs935, %rs903, 4; 2026-02-21T08:30:46.9059620Z shl.b16 %rs936, %rs904, 4; 2026-02-21T08:30:46.9059680Z shl.b16 %rs937, %rs905, 4; 2026-02-21T08:30:46.9059747Z shl.b16 %rs938, %rs906, 4; 2026-02-21T08:30:46.9059806Z shl.b16 %rs939, %rs907, 4; 2026-02-21T08:30:46.9059867Z shl.b16 %rs940, %rs908, 4; 2026-02-21T08:30:46.9059934Z shl.b16 %rs941, %rs909, 4; 2026-02-21T08:30:46.9059995Z shl.b16 %rs942, %rs910, 4; 2026-02-21T08:30:46.9060055Z shl.b16 %rs943, %rs911, 4; 2026-02-21T08:30:46.9060115Z shl.b16 %rs944, %rs912, 4; 2026-02-21T08:30:46.9060220Z shl.b16 %rs945, %rs913, 4; 2026-02-21T08:30:46.9060284Z shl.b16 %rs946, %rs914, 4; 2026-02-21T08:30:46.9060344Z shl.b16 %rs947, %rs915, 4; 2026-02-21T08:30:46.9060410Z shl.b16 %rs948, %rs916, 4; 2026-02-21T08:30:46.9060470Z shl.b16 %rs949, %rs917, 4; 2026-02-21T08:30:46.9060530Z shl.b16 %rs950, %rs918, 4; 2026-02-21T08:30:46.9060597Z shl.b16 %rs951, %rs919, 4; 2026-02-21T08:30:46.9060657Z shl.b16 %rs952, %rs920, 4; 2026-02-21T08:30:46.9060718Z shl.b16 %rs953, %rs921, 4; 2026-02-21T08:30:46.9060777Z shl.b16 %rs954, %rs922, 4; 2026-02-21T08:30:46.9060843Z shl.b16 %rs955, %rs923, 4; 2026-02-21T08:30:46.9060903Z shl.b16 %rs956, %rs924, 4; 2026-02-21T08:30:46.9060963Z shl.b16 %rs957, %rs925, 4; 2026-02-21T08:30:46.9061030Z shl.b16 %rs958, %rs926, 4; 2026-02-21T08:30:46.9061089Z shl.b16 %rs959, %rs927, 4; 2026-02-21T08:30:46.9061151Z shl.b16 %rs960, %rs928, 4; 2026-02-21T08:30:46.9061335Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.9061419Z selp.b16 %rs961, %rs929, %rs897, %p246; 2026-02-21T08:30:46.9061482Z cvt.s16.s8 %rs962, %rs961; 2026-02-21T08:30:46.9061579Z shr.s16 %rs963, %rs962, 4; 2026-02-21T08:30:46.9061659Z selp.b16 %rs964, %rs930, %rs898, %p246; 2026-02-21T08:30:46.9061721Z cvt.s16.s8 %rs965, %rs964; 2026-02-21T08:30:46.9061782Z shr.s16 %rs966, %rs965, 4; 2026-02-21T08:30:46.9061851Z selp.b16 %rs967, %rs931, %rs899, %p246; 2026-02-21T08:30:46.9061919Z cvt.s16.s8 %rs968, %rs967; 2026-02-21T08:30:46.9061980Z shr.s16 %rs969, %rs968, 4; 2026-02-21T08:30:46.9062047Z selp.b16 %rs970, %rs932, %rs900, %p246; 2026-02-21T08:30:46.9062116Z cvt.s16.s8 %rs971, %rs970; 2026-02-21T08:30:46.9062176Z shr.s16 %rs972, %rs971, 4; 2026-02-21T08:30:46.9062243Z selp.b16 %rs973, %rs933, %rs901, %p246; 2026-02-21T08:30:46.9062311Z cvt.s16.s8 %rs974, %rs973; 2026-02-21T08:30:46.9062371Z shr.s16 %rs975, %rs974, 4; 2026-02-21T08:30:46.9062441Z selp.b16 %rs976, %rs934, %rs902, %p246; 2026-02-21T08:30:46.9062506Z cvt.s16.s8 %rs977, %rs976; 2026-02-21T08:30:46.9062573Z shr.s16 %rs978, %rs977, 4; 2026-02-21T08:30:46.9062640Z selp.b16 %rs979, %rs935, %rs903, %p246; 2026-02-21T08:30:46.9062701Z cvt.s16.s8 %rs980, %rs979; 2026-02-21T08:30:46.9062769Z shr.s16 %rs981, %rs980, 4; 2026-02-21T08:30:46.9062838Z selp.b16 %rs982, %rs936, %rs904, %p246; 2026-02-21T08:30:46.9062898Z cvt.s16.s8 %rs983, %rs982; 2026-02-21T08:30:46.9062960Z shr.s16 %rs984, %rs983, 4; 2026-02-21T08:30:46.9063036Z selp.b16 %rs985, %rs937, %rs905, %p246; 2026-02-21T08:30:46.9063098Z cvt.s16.s8 %rs986, %rs985; 2026-02-21T08:30:46.9063159Z shr.s16 %rs987, %rs986, 4; 2026-02-21T08:30:46.9063236Z selp.b16 %rs988, %rs938, %rs906, %p246; 2026-02-21T08:30:46.9063297Z cvt.s16.s8 %rs989, %rs988; 2026-02-21T08:30:46.9063360Z shr.s16 %rs990, %rs989, 4; 2026-02-21T08:30:46.9063429Z selp.b16 %rs991, %rs939, %rs907, %p246; 2026-02-21T08:30:46.9063496Z cvt.s16.s8 %rs992, %rs991; 2026-02-21T08:30:46.9063558Z shr.s16 %rs993, %rs992, 4; 2026-02-21T08:30:46.9063694Z selp.b16 %rs994, %rs940, %rs908, %p246; 2026-02-21T08:30:46.9063763Z cvt.s16.s8 %rs995, %rs994; 2026-02-21T08:30:46.9063823Z shr.s16 %rs996, %rs995, 4; 2026-02-21T08:30:46.9063891Z selp.b16 %rs997, %rs941, %rs909, %p246; 2026-02-21T08:30:46.9063959Z cvt.s16.s8 %rs998, %rs997; 2026-02-21T08:30:46.9064021Z shr.s16 %rs999, %rs998, 4; 2026-02-21T08:30:46.9064095Z selp.b16 %rs1000, %rs942, %rs910, %p246; 2026-02-21T08:30:46.9064158Z cvt.s16.s8 %rs1001, %rs1000; 2026-02-21T08:30:46.9064227Z shr.s16 %rs1002, %rs1001, 4; 2026-02-21T08:30:46.9064300Z selp.b16 %rs1003, %rs943, %rs911, %p246; 2026-02-21T08:30:46.9064362Z cvt.s16.s8 %rs1004, %rs1003; 2026-02-21T08:30:46.9064430Z shr.s16 %rs1005, %rs1004, 4; 2026-02-21T08:30:46.9064500Z selp.b16 %rs1006, %rs944, %rs912, %p246; 2026-02-21T08:30:46.9064561Z cvt.s16.s8 %rs1007, %rs1006; 2026-02-21T08:30:46.9064620Z shr.s16 %rs1008, %rs1007, 4; 2026-02-21T08:30:46.9064747Z selp.b16 %rs1009, %rs945, %rs913, %p246; 2026-02-21T08:30:46.9064812Z cvt.s16.s8 %rs1010, %rs1009; 2026-02-21T08:30:46.9064874Z shr.s16 %rs1011, %rs1010, 4; 2026-02-21T08:30:46.9064958Z selp.b16 %rs1012, %rs946, %rs914, %p246; 2026-02-21T08:30:46.9065022Z cvt.s16.s8 %rs1013, %rs1012; 2026-02-21T08:30:46.9065084Z shr.s16 %rs1014, %rs1013, 4; 2026-02-21T08:30:46.9065165Z selp.b16 %rs1015, %rs947, %rs915, %p246; 2026-02-21T08:30:46.9065236Z cvt.s16.s8 %rs1016, %rs1015; 2026-02-21T08:30:46.9065293Z shr.s16 %rs1017, %rs1016, 4; 2026-02-21T08:30:46.9065359Z selp.b16 %rs1018, %rs948, %rs916, %p246; 2026-02-21T08:30:46.9065426Z cvt.s16.s8 %rs1019, %rs1018; 2026-02-21T08:30:46.9065482Z shr.s16 %rs1020, %rs1019, 4; 2026-02-21T08:30:46.9065548Z selp.b16 %rs1021, %rs949, %rs917, %p246; 2026-02-21T08:30:46.9065614Z cvt.s16.s8 %rs1022, %rs1021; 2026-02-21T08:30:46.9065671Z shr.s16 %rs1023, %rs1022, 4; 2026-02-21T08:30:46.9065736Z selp.b16 %rs1024, %rs950, %rs918, %p246; 2026-02-21T08:30:46.9065795Z cvt.s16.s8 %rs1025, %rs1024; 2026-02-21T08:30:46.9065863Z shr.s16 %rs1026, %rs1025, 4; 2026-02-21T08:30:46.9065929Z selp.b16 %rs1027, %rs951, %rs919, %p246; 2026-02-21T08:30:46.9065986Z cvt.s16.s8 %rs1028, %rs1027; 2026-02-21T08:30:46.9066049Z shr.s16 %rs1029, %rs1028, 4; 2026-02-21T08:30:46.9066115Z selp.b16 %rs1030, %rs952, %rs920, %p246; 2026-02-21T08:30:46.9066173Z cvt.s16.s8 %rs1031, %rs1030; 2026-02-21T08:30:46.9066230Z shr.s16 %rs1032, %rs1031, 4; 2026-02-21T08:30:46.9066303Z selp.b16 %rs1033, %rs953, %rs921, %p246; 2026-02-21T08:30:46.9066360Z cvt.s16.s8 %rs1034, %rs1033; 2026-02-21T08:30:46.9066417Z shr.s16 %rs1035, %rs1034, 4; 2026-02-21T08:30:46.9066490Z selp.b16 %rs1036, %rs954, %rs922, %p246; 2026-02-21T08:30:46.9066549Z cvt.s16.s8 %rs1037, %rs1036; 2026-02-21T08:30:46.9066606Z shr.s16 %rs1038, %rs1037, 4; 2026-02-21T08:30:46.9066678Z selp.b16 %rs1039, %rs955, %rs923, %p246; 2026-02-21T08:30:46.9066738Z cvt.s16.s8 %rs1040, %rs1039; 2026-02-21T08:30:46.9066796Z shr.s16 %rs1041, %rs1040, 4; 2026-02-21T08:30:46.9066866Z selp.b16 %rs1042, %rs956, %rs924, %p246; 2026-02-21T08:30:46.9066931Z cvt.s16.s8 %rs1043, %rs1042; 2026-02-21T08:30:46.9066989Z shr.s16 %rs1044, %rs1043, 4; 2026-02-21T08:30:46.9067054Z selp.b16 %rs1045, %rs957, %rs925, %p246; 2026-02-21T08:30:46.9067118Z cvt.s16.s8 %rs1046, %rs1045; 2026-02-21T08:30:46.9067175Z shr.s16 %rs1047, %rs1046, 4; 2026-02-21T08:30:46.9067242Z selp.b16 %rs1048, %rs958, %rs926, %p246; 2026-02-21T08:30:46.9067300Z cvt.s16.s8 %rs1049, %rs1048; 2026-02-21T08:30:46.9067365Z shr.s16 %rs1050, %rs1049, 4; 2026-02-21T08:30:46.9067429Z selp.b16 %rs1051, %rs959, %rs927, %p246; 2026-02-21T08:30:46.9067487Z cvt.s16.s8 %rs1052, %rs1051; 2026-02-21T08:30:46.9067553Z shr.s16 %rs1053, %rs1052, 4; 2026-02-21T08:30:46.9067619Z selp.b16 %rs1054, %rs960, %rs928, %p246; 2026-02-21T08:30:46.9067677Z cvt.s16.s8 %rs1055, %rs1054; 2026-02-21T08:30:46.9067733Z shr.s16 %rs1056, %rs1055, 4; 2026-02-21T08:30:46.9067909Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.9068015Z cvt.rn.f32.s16 %r1372, %rs963; 2026-02-21T08:30:46.9068077Z cvt.rn.f32.s16 %r1373, %rs966; 2026-02-21T08:30:46.9068144Z cvt.rn.f32.s16 %r1374, %rs969; 2026-02-21T08:30:46.9068204Z cvt.rn.f32.s16 %r1375, %rs972; 2026-02-21T08:30:46.9068262Z cvt.rn.f32.s16 %r1376, %rs975; 2026-02-21T08:30:46.9068327Z cvt.rn.f32.s16 %r1377, %rs978; 2026-02-21T08:30:46.9068385Z cvt.rn.f32.s16 %r1378, %rs981; 2026-02-21T08:30:46.9068445Z cvt.rn.f32.s16 %r1379, %rs984; 2026-02-21T08:30:46.9068505Z cvt.rn.f32.s16 %r1380, %rs987; 2026-02-21T08:30:46.9068573Z cvt.rn.f32.s16 %r1381, %rs990; 2026-02-21T08:30:46.9068631Z cvt.rn.f32.s16 %r1382, %rs993; 2026-02-21T08:30:46.9068688Z cvt.rn.f32.s16 %r1383, %rs996; 2026-02-21T08:30:46.9068752Z cvt.rn.f32.s16 %r1384, %rs999; 2026-02-21T08:30:46.9068814Z cvt.rn.f32.s16 %r1385, %rs1002; 2026-02-21T08:30:46.9068873Z cvt.rn.f32.s16 %r1386, %rs1005; 2026-02-21T08:30:46.9068973Z cvt.rn.f32.s16 %r1387, %rs1008; 2026-02-21T08:30:46.9069040Z cvt.rn.f32.s16 %r1388, %rs1011; 2026-02-21T08:30:46.9069098Z cvt.rn.f32.s16 %r1389, %rs1014; 2026-02-21T08:30:46.9069156Z cvt.rn.f32.s16 %r1390, %rs1017; 2026-02-21T08:30:46.9069221Z cvt.rn.f32.s16 %r1391, %rs1020; 2026-02-21T08:30:46.9069280Z cvt.rn.f32.s16 %r1392, %rs1023; 2026-02-21T08:30:46.9069338Z cvt.rn.f32.s16 %r1393, %rs1026; 2026-02-21T08:30:46.9069403Z cvt.rn.f32.s16 %r1394, %rs1029; 2026-02-21T08:30:46.9069461Z cvt.rn.f32.s16 %r1395, %rs1032; 2026-02-21T08:30:46.9069519Z cvt.rn.f32.s16 %r1396, %rs1035; 2026-02-21T08:30:46.9069576Z cvt.rn.f32.s16 %r1397, %rs1038; 2026-02-21T08:30:46.9069641Z cvt.rn.f32.s16 %r1398, %rs1041; 2026-02-21T08:30:46.9069700Z cvt.rn.f32.s16 %r1399, %rs1044; 2026-02-21T08:30:46.9069759Z cvt.rn.f32.s16 %r1400, %rs1047; 2026-02-21T08:30:46.9069824Z cvt.rn.f32.s16 %r1401, %rs1050; 2026-02-21T08:30:46.9069882Z cvt.rn.f32.s16 %r1402, %rs1053; 2026-02-21T08:30:46.9069942Z cvt.rn.f32.s16 %r1403, %rs1056; 2026-02-21T08:30:46.9070006Z st.shared.b32 [%r51], %r1372; 2026-02-21T08:30:46.9070075Z st.shared.b32 [%r51+8], %r1373; 2026-02-21T08:30:46.9070139Z st.shared.b32 [%r51+16384], %r1388; 2026-02-21T08:30:46.9070203Z st.shared.b32 [%r51+16392], %r1389; 2026-02-21T08:30:46.9070271Z st.shared.b32 [%r52], %r1374; 2026-02-21T08:30:46.9070331Z st.shared.b32 [%r52+8], %r1375; 2026-02-21T08:30:46.9070393Z st.shared.b32 [%r52+16384], %r1390; 2026-02-21T08:30:46.9070456Z st.shared.b32 [%r52+16392], %r1391; 2026-02-21T08:30:46.9070524Z st.shared.b32 [%r53], %r1376; 2026-02-21T08:30:46.9070584Z st.shared.b32 [%r53+8], %r1377; 2026-02-21T08:30:46.9070644Z st.shared.b32 [%r53+16384], %r1392; 2026-02-21T08:30:46.9070712Z st.shared.b32 [%r53+16392], %r1393; 2026-02-21T08:30:46.9070772Z st.shared.b32 [%r54], %r1378; 2026-02-21T08:30:46.9070833Z st.shared.b32 [%r54+8], %r1379; 2026-02-21T08:30:46.9070903Z st.shared.b32 [%r54+16384], %r1394; 2026-02-21T08:30:46.9070966Z st.shared.b32 [%r54+16392], %r1395; 2026-02-21T08:30:46.9071029Z st.shared.b32 [%r55], %r1380; 2026-02-21T08:30:46.9071090Z st.shared.b32 [%r55+8], %r1381; 2026-02-21T08:30:46.9071161Z st.shared.b32 [%r55+16384], %r1396; 2026-02-21T08:30:46.9071222Z st.shared.b32 [%r55+16392], %r1397; 2026-02-21T08:30:46.9071283Z st.shared.b32 [%r56], %r1382; 2026-02-21T08:30:46.9071351Z st.shared.b32 [%r56+8], %r1383; 2026-02-21T08:30:46.9071414Z st.shared.b32 [%r56+16384], %r1398; 2026-02-21T08:30:46.9071475Z st.shared.b32 [%r56+16392], %r1399; 2026-02-21T08:30:46.9071564Z st.shared.b32 [%r57], %r1384; 2026-02-21T08:30:46.9071642Z st.shared.b32 [%r57+8], %r1385; 2026-02-21T08:30:46.9071706Z st.shared.b32 [%r57+16384], %r1400; 2026-02-21T08:30:46.9071771Z st.shared.b32 [%r57+16392], %r1401; 2026-02-21T08:30:46.9071842Z st.shared.b32 [%r58], %r1386; 2026-02-21T08:30:46.9071906Z st.shared.b32 [%r58+8], %r1387; 2026-02-21T08:30:46.9071970Z st.shared.b32 [%r58+16384], %r1402; 2026-02-21T08:30:46.9072043Z st.shared.b32 [%r58+16392], %r1403; 2026-02-21T08:30:46.9072276Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9072341Z shl.b32 %r1404, %r2543, 3; 2026-02-21T08:30:46.9072404Z add.s32 %r1405, %r265, %r1404; 2026-02-21T08:30:46.9072478Z add.s32 %r2541, %r1405, 69632; 2026-02-21T08:30:46.9072534Z $L__tmp133: 2026-02-21T08:30:46.9072759Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9072826Z // begin inline asm 2026-02-21T08:30:46.9072900Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.9072956Z // end inline asm 2026-02-21T08:30:46.9073018Z bar.sync 0; 2026-02-21T08:30:46.9073078Z @%p12 bra $L__BB0_20; 2026-02-21T08:30:46.9073177Z // %bb.19: // in Loop: Header=BB0_18 Depth=2 2026-02-21T08:30:46.9073242Z elect.sync %r1430|%p135, -1; 2026-02-21T08:30:46.9073364Z mov.b32 %r1408, 69208336; 2026-02-21T08:30:46.9073430Z // begin inline asm 2026-02-21T08:30:46.9073590Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r1408, %p134; 2026-02-21T08:30:46.9073653Z // end inline asm 2026-02-21T08:30:46.9073709Z // begin inline asm 2026-02-21T08:30:46.9073864Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r1408, %p134; 2026-02-21T08:30:46.9073926Z // end inline asm 2026-02-21T08:30:46.9073982Z // begin inline asm 2026-02-21T08:30:46.9074134Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r1408, %p134; 2026-02-21T08:30:46.9074190Z // end inline asm 2026-02-21T08:30:46.9074254Z // begin inline asm 2026-02-21T08:30:46.9074403Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r1408, %p134; 2026-02-21T08:30:46.9074458Z // end inline asm 2026-02-21T08:30:46.9074523Z // begin inline asm 2026-02-21T08:30:46.9074673Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r1408, %p134; 2026-02-21T08:30:46.9074729Z // end inline asm 2026-02-21T08:30:46.9074791Z // begin inline asm 2026-02-21T08:30:46.9074938Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r1408, %p134; 2026-02-21T08:30:46.9074993Z // end inline asm 2026-02-21T08:30:46.9075057Z // begin inline asm 2026-02-21T08:30:46.9075201Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r1408, %p134; 2026-02-21T08:30:46.9075255Z // end inline asm 2026-02-21T08:30:46.9075310Z // begin inline asm 2026-02-21T08:30:46.9075461Z @%p135 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r1408, %p134; 2026-02-21T08:30:46.9075516Z // end inline asm 2026-02-21T08:30:46.9075578Z cvt.u64.u32 %rd325, %r2541; 2026-02-21T08:30:46.9075640Z // begin inline asm 2026-02-21T08:30:46.9075769Z @%p135 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd325]; 2026-02-21T08:30:46.9075830Z // end inline asm 2026-02-21T08:30:46.9075895Z bra.uni $L__BB0_20; 2026-02-21T08:30:46.9075996Z $L__BB0_21: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.9076088Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:30:46.9076143Z mov.b32 %r2550, 1; 2026-02-21T08:30:46.9076370Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9076426Z // begin inline asm 2026-02-21T08:30:46.9076476Z 2026-02-21T08:30:46.9076533Z { 2026-02-21T08:30:46.9076592Z .reg .pred complete; 2026-02-21T08:30:46.9076647Z waitLoop: 2026-02-21T08:30:46.9076771Z mbarrier.try_wait.parity.shared.b64 complete, [%r2541], %r2550; 2026-02-21T08:30:46.9076845Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.9076894Z } 2026-02-21T08:30:46.9076898Z 2026-02-21T08:30:46.9076952Z // end inline asm 2026-02-21T08:30:46.9077013Z $L__tmp134: 2026-02-21T08:30:46.9077187Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9077297Z cp.async.wait_group 0; 2026-02-21T08:30:46.9077358Z bar.sync 0; 2026-02-21T08:30:46.9077418Z add.s32 %r2548, %r265, 69632; 2026-02-21T08:30:46.9077475Z // begin inline asm 2026-02-21T08:30:46.9077563Z @%p204 mbarrier.inval.shared::cta.b64 [%r2548]; 2026-02-21T08:30:46.9077626Z // end inline asm 2026-02-21T08:30:46.9077680Z bar.sync 0; 2026-02-21T08:30:46.9077735Z // begin inline asm 2026-02-21T08:30:46.9077827Z @%p204 mbarrier.inval.shared::cta.b64 [%r1442]; 2026-02-21T08:30:46.9077881Z // end inline asm 2026-02-21T08:30:46.9078052Z .loc 1 88 43 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:43 2026-02-21T08:30:46.9078120Z shl.b32 %r1585, %r123, 13; 2026-02-21T08:30:46.9078178Z shl.b32 %r1586, %r124, 13; 2026-02-21T08:30:46.9078235Z shl.b32 %r1587, %r125, 13; 2026-02-21T08:30:46.9078359Z shl.b32 %r1588, %r126, 13; 2026-02-21T08:30:46.9078537Z .loc 1 88 50 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:50 2026-02-21T08:30:46.9078598Z add.s32 %r1589, %r1585, %r122; 2026-02-21T08:30:46.9078657Z add.s32 %r1590, %r1586, %r122; 2026-02-21T08:30:46.9078721Z add.s32 %r1591, %r1587, %r122; 2026-02-21T08:30:46.9078780Z add.s32 %r1592, %r1588, %r122; 2026-02-21T08:30:46.9078943Z .loc 1 88 22 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:22 2026-02-21T08:30:46.9079014Z mad.wide.s32 %rd330, %r1589, 2, %rd49; 2026-02-21T08:30:46.9079088Z mad.wide.s32 %rd331, %r1590, 2, %rd49; 2026-02-21T08:30:46.9079154Z mad.wide.s32 %rd332, %r1591, 2, %rd49; 2026-02-21T08:30:46.9079218Z mad.wide.s32 %rd333, %r1592, 2, %rd49; 2026-02-21T08:30:46.9079280Z $L__tmp135: 2026-02-21T08:30:46.9079495Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9079557Z // begin inline asm 2026-02-21T08:30:46.9079881Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1443, %r1444, %r1445, %r1446, %r1447, %r1448, %r1449, %r1450, %r1451, %r1452, %r1453, %r1454, %r1455, %r1456, %r1457, %r1458}, [%r1887 + 0], 32; 2026-02-21T08:30:46.9079939Z // end inline asm 2026-02-21T08:30:46.9079998Z // begin inline asm 2026-02-21T08:30:46.9080313Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1460, %r1461, %r1462, %r1463, %r1464, %r1465, %r1466, %r1467, %r1468, %r1469, %r1470, %r1471, %r1472, %r1473, %r1474, %r1475}, [%r1887 + 16], 32; 2026-02-21T08:30:46.9080372Z // end inline asm 2026-02-21T08:30:46.9080430Z // begin inline asm 2026-02-21T08:30:46.9080500Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:46.9080573Z // end inline asm 2026-02-21T08:30:46.9080638Z cvt.u64.u32 %rd343, %r1443; 2026-02-21T08:30:46.9080701Z cvt.u64.u32 %rd344, %r1444; 2026-02-21T08:30:46.9080775Z shl.b64 %rd345, %rd344, 32; 2026-02-21T08:30:46.9080835Z or.b64 %rd346, %rd343, %rd345; 2026-02-21T08:30:46.9080890Z $L__tmp136: 2026-02-21T08:30:46.9081065Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9081136Z mov.b64 {%r1593, %r1594}, %rd346; 2026-02-21T08:30:46.9081207Z cvt.rn.bf16x2.f32 %r1595, %r1594, %r1593; 2026-02-21T08:30:46.9081260Z $L__tmp137: 2026-02-21T08:30:46.9081483Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9081585Z cvt.u64.u32 %rd347, %r1445; 2026-02-21T08:30:46.9081647Z cvt.u64.u32 %rd348, %r1446; 2026-02-21T08:30:46.9081713Z shl.b64 %rd349, %rd348, 32; 2026-02-21T08:30:46.9081773Z or.b64 %rd350, %rd347, %rd349; 2026-02-21T08:30:46.9081826Z $L__tmp138: 2026-02-21T08:30:46.9081995Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9082066Z mov.b64 {%r1596, %r1597}, %rd350; 2026-02-21T08:30:46.9082140Z cvt.rn.bf16x2.f32 %r1598, %r1597, %r1596; 2026-02-21T08:30:46.9082248Z $L__tmp139: 2026-02-21T08:30:46.9082471Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9082531Z cvt.u64.u32 %rd351, %r1447; 2026-02-21T08:30:46.9082588Z cvt.u64.u32 %rd352, %r1448; 2026-02-21T08:30:46.9082655Z shl.b64 %rd353, %rd352, 32; 2026-02-21T08:30:46.9082716Z or.b64 %rd354, %rd351, %rd353; 2026-02-21T08:30:46.9082770Z $L__tmp140: 2026-02-21T08:30:46.9082941Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9083012Z mov.b64 {%r1599, %r1600}, %rd354; 2026-02-21T08:30:46.9083082Z cvt.rn.bf16x2.f32 %r1601, %r1600, %r1599; 2026-02-21T08:30:46.9083134Z $L__tmp141: 2026-02-21T08:30:46.9083360Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9083420Z cvt.u64.u32 %rd355, %r1449; 2026-02-21T08:30:46.9083526Z cvt.u64.u32 %rd356, %r1450; 2026-02-21T08:30:46.9083595Z shl.b64 %rd357, %rd356, 32; 2026-02-21T08:30:46.9083654Z or.b64 %rd358, %rd355, %rd357; 2026-02-21T08:30:46.9083706Z $L__tmp142: 2026-02-21T08:30:46.9083871Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9083940Z mov.b64 {%r1602, %r1603}, %rd358; 2026-02-21T08:30:46.9084009Z cvt.rn.bf16x2.f32 %r1604, %r1603, %r1602; 2026-02-21T08:30:46.9084062Z $L__tmp143: 2026-02-21T08:30:46.9084278Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9084338Z cvt.u64.u32 %rd359, %r1451; 2026-02-21T08:30:46.9084396Z cvt.u64.u32 %rd360, %r1452; 2026-02-21T08:30:46.9084463Z shl.b64 %rd361, %rd360, 32; 2026-02-21T08:30:46.9084523Z or.b64 %rd362, %rd359, %rd361; 2026-02-21T08:30:46.9084577Z $L__tmp144: 2026-02-21T08:30:46.9084748Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9084817Z mov.b64 {%r1605, %r1606}, %rd362; 2026-02-21T08:30:46.9084885Z cvt.rn.bf16x2.f32 %r1607, %r1606, %r1605; 2026-02-21T08:30:46.9084937Z $L__tmp145: 2026-02-21T08:30:46.9085150Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9085209Z cvt.u64.u32 %rd363, %r1453; 2026-02-21T08:30:46.9085266Z cvt.u64.u32 %rd364, %r1454; 2026-02-21T08:30:46.9085331Z shl.b64 %rd365, %rd364, 32; 2026-02-21T08:30:46.9085390Z or.b64 %rd366, %rd363, %rd365; 2026-02-21T08:30:46.9085442Z $L__tmp146: 2026-02-21T08:30:46.9085602Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9085671Z mov.b64 {%r1608, %r1609}, %rd366; 2026-02-21T08:30:46.9085736Z cvt.rn.bf16x2.f32 %r1610, %r1609, %r1608; 2026-02-21T08:30:46.9085788Z $L__tmp147: 2026-02-21T08:30:46.9086001Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9086061Z cvt.u64.u32 %rd367, %r1455; 2026-02-21T08:30:46.9086119Z cvt.u64.u32 %rd368, %r1456; 2026-02-21T08:30:46.9086174Z shl.b64 %rd369, %rd368, 32; 2026-02-21T08:30:46.9086240Z or.b64 %rd370, %rd367, %rd369; 2026-02-21T08:30:46.9086292Z $L__tmp148: 2026-02-21T08:30:46.9086457Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9086523Z mov.b64 {%r1611, %r1612}, %rd370; 2026-02-21T08:30:46.9086590Z cvt.rn.bf16x2.f32 %r1613, %r1612, %r1611; 2026-02-21T08:30:46.9086641Z $L__tmp149: 2026-02-21T08:30:46.9086856Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9086913Z cvt.u64.u32 %rd371, %r1457; 2026-02-21T08:30:46.9086969Z cvt.u64.u32 %rd372, %r1458; 2026-02-21T08:30:46.9087030Z shl.b64 %rd373, %rd372, 32; 2026-02-21T08:30:46.9087138Z or.b64 %rd374, %rd371, %rd373; 2026-02-21T08:30:46.9087190Z $L__tmp150: 2026-02-21T08:30:46.9087356Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9087424Z mov.b64 {%r1614, %r1615}, %rd374; 2026-02-21T08:30:46.9087490Z cvt.rn.bf16x2.f32 %r1616, %r1615, %r1614; 2026-02-21T08:30:46.9087543Z $L__tmp151: 2026-02-21T08:30:46.9087760Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9087817Z cvt.u64.u32 %rd375, %r1460; 2026-02-21T08:30:46.9087874Z cvt.u64.u32 %rd376, %r1461; 2026-02-21T08:30:46.9087931Z shl.b64 %rd377, %rd376, 32; 2026-02-21T08:30:46.9087998Z or.b64 %rd378, %rd375, %rd377; 2026-02-21T08:30:46.9088049Z $L__tmp152: 2026-02-21T08:30:46.9088216Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9088323Z mov.b64 {%r1617, %r1618}, %rd378; 2026-02-21T08:30:46.9088394Z cvt.rn.bf16x2.f32 %r1619, %r1618, %r1617; 2026-02-21T08:30:46.9088448Z $L__tmp153: 2026-02-21T08:30:46.9088672Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9088732Z cvt.u64.u32 %rd379, %r1462; 2026-02-21T08:30:46.9088791Z cvt.u64.u32 %rd380, %r1463; 2026-02-21T08:30:46.9088850Z shl.b64 %rd381, %rd380, 32; 2026-02-21T08:30:46.9088919Z or.b64 %rd382, %rd379, %rd381; 2026-02-21T08:30:46.9088973Z $L__tmp154: 2026-02-21T08:30:46.9089143Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9089213Z mov.b64 {%r1620, %r1621}, %rd382; 2026-02-21T08:30:46.9089280Z cvt.rn.bf16x2.f32 %r1622, %r1621, %r1620; 2026-02-21T08:30:46.9089333Z $L__tmp155: 2026-02-21T08:30:46.9089543Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9089611Z cvt.u64.u32 %rd383, %r1464; 2026-02-21T08:30:46.9089668Z cvt.u64.u32 %rd384, %r1465; 2026-02-21T08:30:46.9089726Z shl.b64 %rd385, %rd384, 32; 2026-02-21T08:30:46.9089793Z or.b64 %rd386, %rd383, %rd385; 2026-02-21T08:30:46.9089844Z $L__tmp156: 2026-02-21T08:30:46.9090010Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9090077Z mov.b64 {%r1623, %r1624}, %rd386; 2026-02-21T08:30:46.9090144Z cvt.rn.bf16x2.f32 %r1625, %r1624, %r1623; 2026-02-21T08:30:46.9090197Z $L__tmp157: 2026-02-21T08:30:46.9090406Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9090471Z cvt.u64.u32 %rd387, %r1466; 2026-02-21T08:30:46.9090528Z cvt.u64.u32 %rd388, %r1467; 2026-02-21T08:30:46.9090587Z shl.b64 %rd389, %rd388, 32; 2026-02-21T08:30:46.9090656Z or.b64 %rd390, %rd387, %rd389; 2026-02-21T08:30:46.9090710Z $L__tmp158: 2026-02-21T08:30:46.9090878Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9090944Z mov.b64 {%r1626, %r1627}, %rd390; 2026-02-21T08:30:46.9091011Z cvt.rn.bf16x2.f32 %r1628, %r1627, %r1626; 2026-02-21T08:30:46.9091063Z $L__tmp159: 2026-02-21T08:30:46.9091279Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9091343Z cvt.u64.u32 %rd391, %r1468; 2026-02-21T08:30:46.9091400Z cvt.u64.u32 %rd392, %r1469; 2026-02-21T08:30:46.9091457Z shl.b64 %rd393, %rd392, 32; 2026-02-21T08:30:46.9091524Z or.b64 %rd394, %rd391, %rd393; 2026-02-21T08:30:46.9091605Z $L__tmp160: 2026-02-21T08:30:46.9091775Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9091843Z mov.b64 {%r1629, %r1630}, %rd394; 2026-02-21T08:30:46.9091913Z cvt.rn.bf16x2.f32 %r1631, %r1630, %r1629; 2026-02-21T08:30:46.9092015Z $L__tmp161: 2026-02-21T08:30:46.9092224Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9092291Z cvt.u64.u32 %rd395, %r1470; 2026-02-21T08:30:46.9092348Z cvt.u64.u32 %rd396, %r1471; 2026-02-21T08:30:46.9092405Z shl.b64 %rd397, %rd396, 32; 2026-02-21T08:30:46.9092471Z or.b64 %rd398, %rd395, %rd397; 2026-02-21T08:30:46.9092522Z $L__tmp162: 2026-02-21T08:30:46.9092689Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9092756Z mov.b64 {%r1632, %r1633}, %rd398; 2026-02-21T08:30:46.9092823Z cvt.rn.bf16x2.f32 %r1634, %r1633, %r1632; 2026-02-21T08:30:46.9092875Z $L__tmp163: 2026-02-21T08:30:46.9093085Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9093206Z cvt.u64.u32 %rd399, %r1472; 2026-02-21T08:30:46.9093268Z cvt.u64.u32 %rd400, %r1473; 2026-02-21T08:30:46.9093327Z shl.b64 %rd401, %rd400, 32; 2026-02-21T08:30:46.9093392Z or.b64 %rd402, %rd399, %rd401; 2026-02-21T08:30:46.9093443Z $L__tmp164: 2026-02-21T08:30:46.9093609Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9093671Z mov.b64 {%r1635, %r1636}, %rd402; 2026-02-21T08:30:46.9093744Z cvt.rn.bf16x2.f32 %r1637, %r1636, %r1635; 2026-02-21T08:30:46.9093797Z $L__tmp165: 2026-02-21T08:30:46.9094007Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9094072Z cvt.u64.u32 %rd403, %r1474; 2026-02-21T08:30:46.9094128Z cvt.u64.u32 %rd404, %r1475; 2026-02-21T08:30:46.9094186Z shl.b64 %rd405, %rd404, 32; 2026-02-21T08:30:46.9094250Z or.b64 %rd406, %rd403, %rd405; 2026-02-21T08:30:46.9094302Z $L__tmp166: 2026-02-21T08:30:46.9094473Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9094536Z mov.b64 {%r1638, %r1639}, %rd406; 2026-02-21T08:30:46.9094611Z cvt.rn.bf16x2.f32 %r1640, %r1639, %r1638; 2026-02-21T08:30:46.9094777Z .loc 1 88 81 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:81 2026-02-21T08:30:46.9094877Z st.shared.v4.b32 [%r60], {%r1595, %r1607, %r1619, %r1631}; 2026-02-21T08:30:46.9094940Z bar.sync 0; 2026-02-21T08:30:46.9094996Z // begin inline asm 2026-02-21T08:30:46.9095153Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1477, %r1478, %r1479, %r1480}, [%r659]; 2026-02-21T08:30:46.9095218Z // end inline asm 2026-02-21T08:30:46.9095272Z bar.sync 0; 2026-02-21T08:30:46.9095369Z st.shared.v4.b32 [%r60], {%r1598, %r1610, %r1622, %r1634}; 2026-02-21T08:30:46.9095422Z bar.sync 0; 2026-02-21T08:30:46.9095486Z // begin inline asm 2026-02-21T08:30:46.9095641Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1482, %r1483, %r1484, %r1485}, [%r659]; 2026-02-21T08:30:46.9095701Z // end inline asm 2026-02-21T08:30:46.9095762Z bar.sync 0; 2026-02-21T08:30:46.9095854Z st.shared.v4.b32 [%r60], {%r1601, %r1613, %r1625, %r1637}; 2026-02-21T08:30:46.9095909Z bar.sync 0; 2026-02-21T08:30:46.9095964Z // begin inline asm 2026-02-21T08:30:46.9096120Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1487, %r1488, %r1489, %r1490}, [%r659]; 2026-02-21T08:30:46.9096176Z // end inline asm 2026-02-21T08:30:46.9096230Z bar.sync 0; 2026-02-21T08:30:46.9096331Z st.shared.v4.b32 [%r60], {%r1604, %r1616, %r1628, %r1640}; 2026-02-21T08:30:46.9096386Z bar.sync 0; 2026-02-21T08:30:46.9096442Z // begin inline asm 2026-02-21T08:30:46.9096594Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1492, %r1493, %r1494, %r1495}, [%r659]; 2026-02-21T08:30:46.9096650Z // end inline asm 2026-02-21T08:30:46.9096705Z // begin inline asm 2026-02-21T08:30:46.9096812Z st.global.v4.b32 [ %rd330 + 0 ], { %r1477, %r1482, %r1487, %r1492 }; 2026-02-21T08:30:46.9096877Z // end inline asm 2026-02-21T08:30:46.9096976Z // begin inline asm 2026-02-21T08:30:46.9097082Z st.global.v4.b32 [ %rd331 + 0 ], { %r1478, %r1483, %r1488, %r1493 }; 2026-02-21T08:30:46.9097145Z // end inline asm 2026-02-21T08:30:46.9097202Z // begin inline asm 2026-02-21T08:30:46.9097301Z st.global.v4.b32 [ %rd332 + 0 ], { %r1479, %r1484, %r1489, %r1494 }; 2026-02-21T08:30:46.9097357Z // end inline asm 2026-02-21T08:30:46.9097423Z // begin inline asm 2026-02-21T08:30:46.9097523Z st.global.v4.b32 [ %rd333 + 0 ], { %r1480, %r1485, %r1490, %r1495 }; 2026-02-21T08:30:46.9097579Z // end inline asm 2026-02-21T08:30:46.9097768Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9097829Z add.s32 %r1641, %r2523, 7104; 2026-02-21T08:30:46.9097999Z .loc 1 25 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:25:35 2026-02-21T08:30:46.9098067Z shr.s32 %r1642, %r1641, 31; 2026-02-21T08:30:46.9098188Z shr.u32 %r1643, %r1642, 23; 2026-02-21T08:30:46.9098255Z add.s32 %r1644, %r1641, %r1643; 2026-02-21T08:30:46.9098314Z shr.s32 %r1645, %r1644, 9; 2026-02-21T08:30:46.9098485Z .loc 1 26 33 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:26:33 2026-02-21T08:30:46.9098544Z shl.b32 %r1646, %r1645, 3; 2026-02-21T08:30:46.9098713Z .loc 1 27 39 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:39 2026-02-21T08:30:46.9098779Z sub.s32 %r1647, 64, %r1646; 2026-02-21T08:30:46.9098946Z .loc 1 27 52 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:52 2026-02-21T08:30:46.9099004Z min.s32 %r1648, %r1647, 8; 2026-02-21T08:30:46.9099176Z .loc 1 28 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:45 2026-02-21T08:30:46.9099240Z and.b32 %r1649, %r1644, -512; 2026-02-21T08:30:46.9099302Z sub.s32 %r1650, %r1641, %r1649; 2026-02-21T08:30:46.9099474Z .loc 1 29 51 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:29:51 2026-02-21T08:30:46.9099536Z div.s32 %r144, %r1650, %r1648; 2026-02-21T08:30:46.9099700Z .loc 1 28 64 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:64 2026-02-21T08:30:46.9099763Z mul.lo.s32 %r1651, %r144, %r1648; 2026-02-21T08:30:46.9099831Z sub.s32 %r1652, %r1650, %r1651; 2026-02-21T08:30:46.9099997Z .loc 1 28 30 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:30 2026-02-21T08:30:46.9100058Z add.s32 %r1653, %r1652, %r1646; 2026-02-21T08:30:46.9100229Z .loc 1 30 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:30:27 2026-02-21T08:30:46.9100291Z shl.b32 %r145, %r1653, 7; 2026-02-21T08:30:46.9100501Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.9100571Z or.b32 %r146, %r145, %r6; 2026-02-21T08:30:46.9100736Z .loc 1 32 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:32:27 2026-02-21T08:30:46.9100798Z shl.b32 %r1654, %r144, 6; 2026-02-21T08:30:46.9100969Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.9101028Z or.b32 %r1655, %r1654, %r10; 2026-02-21T08:30:46.9101099Z or.b32 %r1656, %r1654, %r11; 2026-02-21T08:30:46.9101270Z .loc 1 48 53 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:53 2026-02-21T08:30:46.9101338Z shl.b32 %r1657, %r1655, 10; 2026-02-21T08:30:46.9101398Z shl.b32 %r1658, %r1656, 10; 2026-02-21T08:30:46.9101461Z mov.pred %p165, -1; 2026-02-21T08:30:46.9101525Z mov.b32 %r2547, 0; 2026-02-21T08:30:46.9101607Z $L__tmp167: 2026-02-21T08:30:46.9101832Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9101891Z // begin inline asm 2026-02-21T08:30:46.9102238Z @%p165 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 0], 32, {%r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547}; 2026-02-21T08:30:46.9102353Z // end inline asm 2026-02-21T08:30:46.9102414Z // begin inline asm 2026-02-21T08:30:46.9102751Z @%p165 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1887 + 16], 32, {%r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547, %r2547}; 2026-02-21T08:30:46.9102809Z // end inline asm 2026-02-21T08:30:46.9102867Z // begin inline asm 2026-02-21T08:30:46.9102948Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9103004Z // end inline asm 2026-02-21T08:30:46.9103061Z bar.sync 0; 2026-02-21T08:30:46.9103121Z $L__tmp168: 2026-02-21T08:30:46.9103299Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9103358Z // begin inline asm 2026-02-21T08:30:46.9103496Z @%p204 mbarrier.init.shared::cta.b64 [%r2548], 1; 2026-02-21T08:30:46.9103565Z // end inline asm 2026-02-21T08:30:46.9103621Z bar.sync 0; 2026-02-21T08:30:46.9103679Z // begin inline asm 2026-02-21T08:30:46.9103778Z @%p204 mbarrier.init.shared::cta.b64 [%r1442], 1; 2026-02-21T08:30:46.9103835Z // end inline asm 2026-02-21T08:30:46.9104013Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.9104081Z or.b32 %r1659, %r1657, %r16; 2026-02-21T08:30:46.9104142Z or.b32 %r1660, %r1658, %r16; 2026-02-21T08:30:46.9104315Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9104387Z mad.wide.s32 %rd334, %r1659, 2, %rd47; 2026-02-21T08:30:46.9104466Z mad.wide.s32 %rd335, %r1660, 2, %rd47; 2026-02-21T08:30:46.9104527Z mov.b32 %r1732, 16; 2026-02-21T08:30:46.9104704Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9104771Z // begin inline asm 2026-02-21T08:30:46.9104901Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd334 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9104959Z // end inline asm 2026-02-21T08:30:46.9105022Z // begin inline asm 2026-02-21T08:30:46.9105150Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd335 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9105207Z // end inline asm 2026-02-21T08:30:46.9105272Z cp.async.commit_group; 2026-02-21T08:30:46.9105455Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9105518Z add.s32 %r1661, %r146, %r27; 2026-02-21T08:30:46.9105690Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9105759Z cvt.s64.s32 %rd407, %r1661; 2026-02-21T08:30:46.9105825Z add.s64 %rd336, %rd48, %rd407; 2026-02-21T08:30:46.9106001Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9106070Z // begin inline asm 2026-02-21T08:30:46.9106191Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd336 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9106249Z // end inline asm 2026-02-21T08:30:46.9106315Z cp.async.commit_group; 2026-02-21T08:30:46.9106496Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9106559Z cvt.s64.s32 %rd408, %r1657; 2026-02-21T08:30:46.9106624Z or.b64 %rd409, %rd408, %rd10; 2026-02-21T08:30:46.9106695Z shl.b64 %rd410, %rd409, 1; 2026-02-21T08:30:46.9106758Z add.s64 %rd32, %rd47, %rd410; 2026-02-21T08:30:46.9106819Z add.s64 %rd337, %rd32, 128; 2026-02-21T08:30:46.9106880Z cvt.s64.s32 %rd411, %r1658; 2026-02-21T08:30:46.9106949Z or.b64 %rd412, %rd411, %rd10; 2026-02-21T08:30:46.9107012Z shl.b64 %rd413, %rd412, 1; 2026-02-21T08:30:46.9107075Z add.s64 %rd33, %rd47, %rd413; 2026-02-21T08:30:46.9107147Z add.s64 %rd338, %rd33, 128; 2026-02-21T08:30:46.9107326Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9107427Z bar.sync 0; 2026-02-21T08:30:46.9107496Z // begin inline asm 2026-02-21T08:30:46.9107624Z cp.async.cg.shared.global [ %r1555 + 0 ], [ %rd337 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9107683Z // end inline asm 2026-02-21T08:30:46.9107743Z // begin inline asm 2026-02-21T08:30:46.9107868Z cp.async.cg.shared.global [ %r1557 + 0 ], [ %rd338 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9107926Z // end inline asm 2026-02-21T08:30:46.9107993Z cp.async.commit_group; 2026-02-21T08:30:46.9108172Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9108234Z add.s32 %r1662, %r146, %r32; 2026-02-21T08:30:46.9108403Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9108471Z cvt.s64.s32 %rd414, %r1662; 2026-02-21T08:30:46.9108573Z add.s64 %rd339, %rd48, %rd414; 2026-02-21T08:30:46.9108754Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9108814Z // begin inline asm 2026-02-21T08:30:46.9108946Z cp.async.cg.shared.global [ %r1559 + 0 ], [ %rd339 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9109005Z // end inline asm 2026-02-21T08:30:46.9109081Z cp.async.commit_group; 2026-02-21T08:30:46.9109256Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9109314Z add.s64 %rd340, %rd32, 256; 2026-02-21T08:30:46.9109372Z add.s64 %rd341, %rd33, 256; 2026-02-21T08:30:46.9109541Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9109595Z bar.sync 0; 2026-02-21T08:30:46.9109651Z // begin inline asm 2026-02-21T08:30:46.9109765Z cp.async.cg.shared.global [ %r1561 + 0 ], [ %rd340 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9109830Z // end inline asm 2026-02-21T08:30:46.9109888Z // begin inline asm 2026-02-21T08:30:46.9110001Z cp.async.cg.shared.global [ %r1563 + 0 ], [ %rd341 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9110064Z // end inline asm 2026-02-21T08:30:46.9110125Z cp.async.commit_group; 2026-02-21T08:30:46.9110291Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9110351Z add.s32 %r1663, %r146, %r37; 2026-02-21T08:30:46.9110522Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9110579Z cvt.s64.s32 %rd415, %r1663; 2026-02-21T08:30:46.9110639Z add.s64 %rd342, %rd48, %rd415; 2026-02-21T08:30:46.9110812Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9110868Z // begin inline asm 2026-02-21T08:30:46.9110984Z cp.async.cg.shared.global [ %r1565 + 0 ], [ %rd342 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9111050Z // end inline asm 2026-02-21T08:30:46.9111114Z cp.async.commit_group; 2026-02-21T08:30:46.9111281Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9111353Z cp.async.wait_group 4; 2026-02-21T08:30:46.9111406Z bar.sync 0; 2026-02-21T08:30:46.9111503Z ld.shared.v4.b32 {%r1664, %r1665, %r1666, %r1667}, [%r40]; 2026-02-21T08:30:46.9111598Z mov.b32 {%rs1057, %rs1058}, %r1667; 2026-02-21T08:30:46.9111672Z mov.b32 {%rs1059, %rs1060}, %r1666; 2026-02-21T08:30:46.9111733Z mov.b32 {%rs1061, %rs1062}, %r1665; 2026-02-21T08:30:46.9111792Z mov.b32 {%rs1063, %rs1064}, %r1664; 2026-02-21T08:30:46.9111901Z ld.shared.v4.b32 {%r1668, %r1669, %r1670, %r1671}, [%r40+16]; 2026-02-21T08:30:46.9111963Z mov.b32 {%rs1065, %rs1066}, %r1671; 2026-02-21T08:30:46.9112022Z mov.b32 {%rs1067, %rs1068}, %r1670; 2026-02-21T08:30:46.9112083Z mov.b32 {%rs1069, %rs1070}, %r1669; 2026-02-21T08:30:46.9112148Z mov.b32 {%rs1071, %rs1072}, %r1668; 2026-02-21T08:30:46.9112317Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.9112432Z cvt.f32.bf16 %r1568, %rs1063; 2026-02-21T08:30:46.9112500Z cvt.f32.bf16 %r1569, %rs1064; 2026-02-21T08:30:46.9112559Z cvt.f32.bf16 %r1570, %rs1061; 2026-02-21T08:30:46.9112617Z cvt.f32.bf16 %r1571, %rs1062; 2026-02-21T08:30:46.9112682Z cvt.f32.bf16 %r1572, %rs1059; 2026-02-21T08:30:46.9112740Z cvt.f32.bf16 %r1573, %rs1060; 2026-02-21T08:30:46.9112797Z cvt.f32.bf16 %r1574, %rs1057; 2026-02-21T08:30:46.9112857Z cvt.f32.bf16 %r1575, %rs1058; 2026-02-21T08:30:46.9112924Z cvt.f32.bf16 %r1576, %rs1071; 2026-02-21T08:30:46.9112984Z cvt.f32.bf16 %r1577, %rs1072; 2026-02-21T08:30:46.9113042Z cvt.f32.bf16 %r1578, %rs1069; 2026-02-21T08:30:46.9113105Z cvt.f32.bf16 %r1579, %rs1070; 2026-02-21T08:30:46.9113162Z cvt.f32.bf16 %r1580, %rs1067; 2026-02-21T08:30:46.9113218Z cvt.f32.bf16 %r1581, %rs1068; 2026-02-21T08:30:46.9113319Z cvt.f32.bf16 %r1582, %rs1065; 2026-02-21T08:30:46.9113387Z cvt.f32.bf16 %r1583, %rs1066; 2026-02-21T08:30:46.9113441Z $L__tmp169: 2026-02-21T08:30:46.9113657Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9113724Z // begin inline asm 2026-02-21T08:30:46.9114041Z @%p165 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r1568, %r1569, %r1570, %r1571, %r1572, %r1573, %r1574, %r1575, %r1576, %r1577, %r1578, %r1579, %r1580, %r1581, %r1582, %r1583}; 2026-02-21T08:30:46.9114097Z // end inline asm 2026-02-21T08:30:46.9114161Z // begin inline asm 2026-02-21T08:30:46.9114229Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9114284Z // end inline asm 2026-02-21T08:30:46.9114338Z bar.sync 0; 2026-02-21T08:30:46.9114399Z $L__tmp170: 2026-02-21T08:30:46.9114568Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9114631Z ld.shared.b8 %rs1073, [%r42]; 2026-02-21T08:30:46.9114706Z ld.shared.b8 %rs1074, [%r42+128]; 2026-02-21T08:30:46.9114769Z ld.shared.b8 %rs1075, [%r42+256]; 2026-02-21T08:30:46.9114832Z ld.shared.b8 %rs1076, [%r42+384]; 2026-02-21T08:30:46.9114900Z ld.shared.b8 %rs1077, [%r42+512]; 2026-02-21T08:30:46.9114961Z ld.shared.b8 %rs1078, [%r42+640]; 2026-02-21T08:30:46.9115021Z ld.shared.b8 %rs1079, [%r42+768]; 2026-02-21T08:30:46.9115082Z ld.shared.b8 %rs1080, [%r44]; 2026-02-21T08:30:46.9115154Z ld.shared.b8 %rs1081, [%r42+1024]; 2026-02-21T08:30:46.9115218Z ld.shared.b8 %rs1082, [%r42+1152]; 2026-02-21T08:30:46.9115279Z ld.shared.b8 %rs1083, [%r42+1280]; 2026-02-21T08:30:46.9115346Z ld.shared.b8 %rs1084, [%r42+1408]; 2026-02-21T08:30:46.9115408Z ld.shared.b8 %rs1085, [%r42+1536]; 2026-02-21T08:30:46.9115468Z ld.shared.b8 %rs1086, [%r42+1664]; 2026-02-21T08:30:46.9115529Z ld.shared.b8 %rs1087, [%r42+1792]; 2026-02-21T08:30:46.9115599Z ld.shared.b8 %rs1088, [%r46]; 2026-02-21T08:30:46.9115663Z ld.shared.b8 %rs1089, [%r42+2048]; 2026-02-21T08:30:46.9115730Z ld.shared.b8 %rs1090, [%r42+2176]; 2026-02-21T08:30:46.9115799Z ld.shared.b8 %rs1091, [%r42+2304]; 2026-02-21T08:30:46.9115862Z ld.shared.b8 %rs1092, [%r42+2432]; 2026-02-21T08:30:46.9115924Z ld.shared.b8 %rs1093, [%r42+2560]; 2026-02-21T08:30:46.9115983Z ld.shared.b8 %rs1094, [%r42+2688]; 2026-02-21T08:30:46.9116050Z ld.shared.b8 %rs1095, [%r42+2816]; 2026-02-21T08:30:46.9116111Z ld.shared.b8 %rs1096, [%r48]; 2026-02-21T08:30:46.9116171Z ld.shared.b8 %rs1097, [%r42+3072]; 2026-02-21T08:30:46.9116237Z ld.shared.b8 %rs1098, [%r42+3200]; 2026-02-21T08:30:46.9116296Z ld.shared.b8 %rs1099, [%r42+3328]; 2026-02-21T08:30:46.9116355Z ld.shared.b8 %rs1100, [%r42+3456]; 2026-02-21T08:30:46.9116424Z ld.shared.b8 %rs1101, [%r42+3584]; 2026-02-21T08:30:46.9116484Z ld.shared.b8 %rs1102, [%r42+3712]; 2026-02-21T08:30:46.9116544Z ld.shared.b8 %rs1103, [%r42+3840]; 2026-02-21T08:30:46.9116603Z ld.shared.b8 %rs1104, [%r50]; 2026-02-21T08:30:46.9116782Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.9116901Z shl.b16 %rs1105, %rs1073, 4; 2026-02-21T08:30:46.9116963Z shl.b16 %rs1106, %rs1074, 4; 2026-02-21T08:30:46.9117029Z shl.b16 %rs1107, %rs1075, 4; 2026-02-21T08:30:46.9117088Z shl.b16 %rs1108, %rs1076, 4; 2026-02-21T08:30:46.9117145Z shl.b16 %rs1109, %rs1077, 4; 2026-02-21T08:30:46.9117201Z shl.b16 %rs1110, %rs1078, 4; 2026-02-21T08:30:46.9117265Z shl.b16 %rs1111, %rs1079, 4; 2026-02-21T08:30:46.9117323Z shl.b16 %rs1112, %rs1080, 4; 2026-02-21T08:30:46.9117380Z shl.b16 %rs1113, %rs1081, 4; 2026-02-21T08:30:46.9117444Z shl.b16 %rs1114, %rs1082, 4; 2026-02-21T08:30:46.9117502Z shl.b16 %rs1115, %rs1083, 4; 2026-02-21T08:30:46.9117559Z shl.b16 %rs1116, %rs1084, 4; 2026-02-21T08:30:46.9117626Z shl.b16 %rs1117, %rs1085, 4; 2026-02-21T08:30:46.9117683Z shl.b16 %rs1118, %rs1086, 4; 2026-02-21T08:30:46.9117740Z shl.b16 %rs1119, %rs1087, 4; 2026-02-21T08:30:46.9117842Z shl.b16 %rs1120, %rs1088, 4; 2026-02-21T08:30:46.9117913Z shl.b16 %rs1121, %rs1089, 4; 2026-02-21T08:30:46.9117973Z shl.b16 %rs1122, %rs1090, 4; 2026-02-21T08:30:46.9118032Z shl.b16 %rs1123, %rs1091, 4; 2026-02-21T08:30:46.9118097Z shl.b16 %rs1124, %rs1092, 4; 2026-02-21T08:30:46.9118155Z shl.b16 %rs1125, %rs1093, 4; 2026-02-21T08:30:46.9118212Z shl.b16 %rs1126, %rs1094, 4; 2026-02-21T08:30:46.9118271Z shl.b16 %rs1127, %rs1095, 4; 2026-02-21T08:30:46.9118337Z shl.b16 %rs1128, %rs1096, 4; 2026-02-21T08:30:46.9118395Z shl.b16 %rs1129, %rs1097, 4; 2026-02-21T08:30:46.9118452Z shl.b16 %rs1130, %rs1098, 4; 2026-02-21T08:30:46.9118517Z shl.b16 %rs1131, %rs1099, 4; 2026-02-21T08:30:46.9118575Z shl.b16 %rs1132, %rs1100, 4; 2026-02-21T08:30:46.9118633Z shl.b16 %rs1133, %rs1101, 4; 2026-02-21T08:30:46.9118692Z shl.b16 %rs1134, %rs1102, 4; 2026-02-21T08:30:46.9118758Z shl.b16 %rs1135, %rs1103, 4; 2026-02-21T08:30:46.9118815Z shl.b16 %rs1136, %rs1104, 4; 2026-02-21T08:30:46.9118985Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.9119072Z selp.b16 %rs1137, %rs1105, %rs1073, %p246; 2026-02-21T08:30:46.9119130Z cvt.s16.s8 %rs1138, %rs1137; 2026-02-21T08:30:46.9119187Z shr.s16 %rs1139, %rs1138, 4; 2026-02-21T08:30:46.9119267Z selp.b16 %rs1140, %rs1106, %rs1074, %p246; 2026-02-21T08:30:46.9119327Z cvt.s16.s8 %rs1141, %rs1140; 2026-02-21T08:30:46.9119383Z shr.s16 %rs1142, %rs1141, 4; 2026-02-21T08:30:46.9119454Z selp.b16 %rs1143, %rs1107, %rs1075, %p246; 2026-02-21T08:30:46.9119519Z cvt.s16.s8 %rs1144, %rs1143; 2026-02-21T08:30:46.9119575Z shr.s16 %rs1145, %rs1144, 4; 2026-02-21T08:30:46.9119645Z selp.b16 %rs1146, %rs1108, %rs1076, %p246; 2026-02-21T08:30:46.9119709Z cvt.s16.s8 %rs1147, %rs1146; 2026-02-21T08:30:46.9119766Z shr.s16 %rs1148, %rs1147, 4; 2026-02-21T08:30:46.9119836Z selp.b16 %rs1149, %rs1109, %rs1077, %p246; 2026-02-21T08:30:46.9119894Z cvt.s16.s8 %rs1150, %rs1149; 2026-02-21T08:30:46.9119959Z shr.s16 %rs1151, %rs1150, 4; 2026-02-21T08:30:46.9120030Z selp.b16 %rs1152, %rs1110, %rs1078, %p246; 2026-02-21T08:30:46.9120086Z cvt.s16.s8 %rs1153, %rs1152; 2026-02-21T08:30:46.9120151Z shr.s16 %rs1154, %rs1153, 4; 2026-02-21T08:30:46.9120218Z selp.b16 %rs1155, %rs1111, %rs1079, %p246; 2026-02-21T08:30:46.9120276Z cvt.s16.s8 %rs1156, %rs1155; 2026-02-21T08:30:46.9120341Z shr.s16 %rs1157, %rs1156, 4; 2026-02-21T08:30:46.9120409Z selp.b16 %rs1158, %rs1112, %rs1080, %p246; 2026-02-21T08:30:46.9120466Z cvt.s16.s8 %rs1159, %rs1158; 2026-02-21T08:30:46.9120523Z shr.s16 %rs1160, %rs1159, 4; 2026-02-21T08:30:46.9120597Z selp.b16 %rs1161, %rs1113, %rs1081, %p246; 2026-02-21T08:30:46.9120655Z cvt.s16.s8 %rs1162, %rs1161; 2026-02-21T08:30:46.9120712Z shr.s16 %rs1163, %rs1162, 4; 2026-02-21T08:30:46.9120786Z selp.b16 %rs1164, %rs1114, %rs1082, %p246; 2026-02-21T08:30:46.9120844Z cvt.s16.s8 %rs1165, %rs1164; 2026-02-21T08:30:46.9120904Z shr.s16 %rs1166, %rs1165, 4; 2026-02-21T08:30:46.9121012Z selp.b16 %rs1167, %rs1115, %rs1083, %p246; 2026-02-21T08:30:46.9121079Z cvt.s16.s8 %rs1168, %rs1167; 2026-02-21T08:30:46.9121137Z shr.s16 %rs1169, %rs1168, 4; 2026-02-21T08:30:46.9121206Z selp.b16 %rs1170, %rs1116, %rs1084, %p246; 2026-02-21T08:30:46.9121272Z cvt.s16.s8 %rs1171, %rs1170; 2026-02-21T08:30:46.9121328Z shr.s16 %rs1172, %rs1171, 4; 2026-02-21T08:30:46.9121398Z selp.b16 %rs1173, %rs1117, %rs1085, %p246; 2026-02-21T08:30:46.9121457Z cvt.s16.s8 %rs1174, %rs1173; 2026-02-21T08:30:46.9121522Z shr.s16 %rs1175, %rs1174, 4; 2026-02-21T08:30:46.9121623Z selp.b16 %rs1176, %rs1118, %rs1086, %p246; 2026-02-21T08:30:46.9121684Z cvt.s16.s8 %rs1177, %rs1176; 2026-02-21T08:30:46.9121749Z shr.s16 %rs1178, %rs1177, 4; 2026-02-21T08:30:46.9121818Z selp.b16 %rs1179, %rs1119, %rs1087, %p246; 2026-02-21T08:30:46.9121877Z cvt.s16.s8 %rs1180, %rs1179; 2026-02-21T08:30:46.9121941Z shr.s16 %rs1181, %rs1180, 4; 2026-02-21T08:30:46.9122059Z selp.b16 %rs1182, %rs1120, %rs1088, %p246; 2026-02-21T08:30:46.9122120Z cvt.s16.s8 %rs1183, %rs1182; 2026-02-21T08:30:46.9122180Z shr.s16 %rs1184, %rs1183, 4; 2026-02-21T08:30:46.9122257Z selp.b16 %rs1185, %rs1121, %rs1089, %p246; 2026-02-21T08:30:46.9122317Z cvt.s16.s8 %rs1186, %rs1185; 2026-02-21T08:30:46.9122377Z shr.s16 %rs1187, %rs1186, 4; 2026-02-21T08:30:46.9122454Z selp.b16 %rs1188, %rs1122, %rs1090, %p246; 2026-02-21T08:30:46.9122516Z cvt.s16.s8 %rs1189, %rs1188; 2026-02-21T08:30:46.9122576Z shr.s16 %rs1190, %rs1189, 4; 2026-02-21T08:30:46.9122646Z selp.b16 %rs1191, %rs1123, %rs1091, %p246; 2026-02-21T08:30:46.9122720Z cvt.s16.s8 %rs1192, %rs1191; 2026-02-21T08:30:46.9122779Z shr.s16 %rs1193, %rs1192, 4; 2026-02-21T08:30:46.9122845Z selp.b16 %rs1194, %rs1124, %rs1092, %p246; 2026-02-21T08:30:46.9122910Z cvt.s16.s8 %rs1195, %rs1194; 2026-02-21T08:30:46.9122967Z shr.s16 %rs1196, %rs1195, 4; 2026-02-21T08:30:46.9123034Z selp.b16 %rs1197, %rs1125, %rs1093, %p246; 2026-02-21T08:30:46.9123101Z cvt.s16.s8 %rs1198, %rs1197; 2026-02-21T08:30:46.9123160Z shr.s16 %rs1199, %rs1198, 4; 2026-02-21T08:30:46.9123226Z selp.b16 %rs1200, %rs1126, %rs1094, %p246; 2026-02-21T08:30:46.9123284Z cvt.s16.s8 %rs1201, %rs1200; 2026-02-21T08:30:46.9123348Z shr.s16 %rs1202, %rs1201, 4; 2026-02-21T08:30:46.9123415Z selp.b16 %rs1203, %rs1127, %rs1095, %p246; 2026-02-21T08:30:46.9123473Z cvt.s16.s8 %rs1204, %rs1203; 2026-02-21T08:30:46.9123537Z shr.s16 %rs1205, %rs1204, 4; 2026-02-21T08:30:46.9123605Z selp.b16 %rs1206, %rs1128, %rs1096, %p246; 2026-02-21T08:30:46.9123663Z cvt.s16.s8 %rs1207, %rs1206; 2026-02-21T08:30:46.9123720Z shr.s16 %rs1208, %rs1207, 4; 2026-02-21T08:30:46.9123796Z selp.b16 %rs1209, %rs1129, %rs1097, %p246; 2026-02-21T08:30:46.9123854Z cvt.s16.s8 %rs1210, %rs1209; 2026-02-21T08:30:46.9123912Z shr.s16 %rs1211, %rs1210, 4; 2026-02-21T08:30:46.9123987Z selp.b16 %rs1212, %rs1130, %rs1098, %p246; 2026-02-21T08:30:46.9124043Z cvt.s16.s8 %rs1213, %rs1212; 2026-02-21T08:30:46.9124104Z shr.s16 %rs1214, %rs1213, 4; 2026-02-21T08:30:46.9124176Z selp.b16 %rs1215, %rs1131, %rs1099, %p246; 2026-02-21T08:30:46.9124240Z cvt.s16.s8 %rs1216, %rs1215; 2026-02-21T08:30:46.9124297Z shr.s16 %rs1217, %rs1216, 4; 2026-02-21T08:30:46.9124364Z selp.b16 %rs1218, %rs1132, %rs1100, %p246; 2026-02-21T08:30:46.9124429Z cvt.s16.s8 %rs1219, %rs1218; 2026-02-21T08:30:46.9124487Z shr.s16 %rs1220, %rs1219, 4; 2026-02-21T08:30:46.9124554Z selp.b16 %rs1221, %rs1133, %rs1101, %p246; 2026-02-21T08:30:46.9124620Z cvt.s16.s8 %rs1222, %rs1221; 2026-02-21T08:30:46.9124678Z shr.s16 %rs1223, %rs1222, 4; 2026-02-21T08:30:46.9124745Z selp.b16 %rs1224, %rs1134, %rs1102, %p246; 2026-02-21T08:30:46.9124802Z cvt.s16.s8 %rs1225, %rs1224; 2026-02-21T08:30:46.9124867Z shr.s16 %rs1226, %rs1225, 4; 2026-02-21T08:30:46.9124933Z selp.b16 %rs1227, %rs1135, %rs1103, %p246; 2026-02-21T08:30:46.9124992Z cvt.s16.s8 %rs1228, %rs1227; 2026-02-21T08:30:46.9125058Z shr.s16 %rs1229, %rs1228, 4; 2026-02-21T08:30:46.9125126Z selp.b16 %rs1230, %rs1136, %rs1104, %p246; 2026-02-21T08:30:46.9125230Z cvt.s16.s8 %rs1231, %rs1230; 2026-02-21T08:30:46.9125288Z shr.s16 %rs1232, %rs1231, 4; 2026-02-21T08:30:46.9125462Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.9125525Z cvt.rn.f32.s16 %r1672, %rs1139; 2026-02-21T08:30:46.9125587Z cvt.rn.f32.s16 %r1673, %rs1142; 2026-02-21T08:30:46.9125655Z cvt.rn.f32.s16 %r1674, %rs1145; 2026-02-21T08:30:46.9125715Z cvt.rn.f32.s16 %r1675, %rs1148; 2026-02-21T08:30:46.9125773Z cvt.rn.f32.s16 %r1676, %rs1151; 2026-02-21T08:30:46.9125837Z cvt.rn.f32.s16 %r1677, %rs1154; 2026-02-21T08:30:46.9125895Z cvt.rn.f32.s16 %r1678, %rs1157; 2026-02-21T08:30:46.9125952Z cvt.rn.f32.s16 %r1679, %rs1160; 2026-02-21T08:30:46.9126008Z cvt.rn.f32.s16 %r1680, %rs1163; 2026-02-21T08:30:46.9126074Z cvt.rn.f32.s16 %r1681, %rs1166; 2026-02-21T08:30:46.9126130Z cvt.rn.f32.s16 %r1682, %rs1169; 2026-02-21T08:30:46.9126231Z cvt.rn.f32.s16 %r1683, %rs1172; 2026-02-21T08:30:46.9126301Z cvt.rn.f32.s16 %r1684, %rs1175; 2026-02-21T08:30:46.9126359Z cvt.rn.f32.s16 %r1685, %rs1178; 2026-02-21T08:30:46.9126417Z cvt.rn.f32.s16 %r1686, %rs1181; 2026-02-21T08:30:46.9126475Z cvt.rn.f32.s16 %r1687, %rs1184; 2026-02-21T08:30:46.9126540Z cvt.rn.f32.s16 %r1688, %rs1187; 2026-02-21T08:30:46.9126599Z cvt.rn.f32.s16 %r1689, %rs1190; 2026-02-21T08:30:46.9126656Z cvt.rn.f32.s16 %r1690, %rs1193; 2026-02-21T08:30:46.9126722Z cvt.rn.f32.s16 %r1691, %rs1196; 2026-02-21T08:30:46.9126779Z cvt.rn.f32.s16 %r1692, %rs1199; 2026-02-21T08:30:46.9126836Z cvt.rn.f32.s16 %r1693, %rs1202; 2026-02-21T08:30:46.9126903Z cvt.rn.f32.s16 %r1694, %rs1205; 2026-02-21T08:30:46.9126962Z cvt.rn.f32.s16 %r1695, %rs1208; 2026-02-21T08:30:46.9127022Z cvt.rn.f32.s16 %r1696, %rs1211; 2026-02-21T08:30:46.9127080Z cvt.rn.f32.s16 %r1697, %rs1214; 2026-02-21T08:30:46.9127146Z cvt.rn.f32.s16 %r1698, %rs1217; 2026-02-21T08:30:46.9127207Z cvt.rn.f32.s16 %r1699, %rs1220; 2026-02-21T08:30:46.9127270Z cvt.rn.f32.s16 %r1700, %rs1223; 2026-02-21T08:30:46.9127336Z cvt.rn.f32.s16 %r1701, %rs1226; 2026-02-21T08:30:46.9127395Z cvt.rn.f32.s16 %r1702, %rs1229; 2026-02-21T08:30:46.9127454Z cvt.rn.f32.s16 %r1703, %rs1232; 2026-02-21T08:30:46.9127515Z st.shared.b32 [%r51], %r1672; 2026-02-21T08:30:46.9127582Z st.shared.b32 [%r51+8], %r1673; 2026-02-21T08:30:46.9127648Z st.shared.b32 [%r51+16384], %r1688; 2026-02-21T08:30:46.9127711Z st.shared.b32 [%r51+16392], %r1689; 2026-02-21T08:30:46.9127782Z st.shared.b32 [%r52], %r1674; 2026-02-21T08:30:46.9127842Z st.shared.b32 [%r52+8], %r1675; 2026-02-21T08:30:46.9127905Z st.shared.b32 [%r52+16384], %r1690; 2026-02-21T08:30:46.9127967Z st.shared.b32 [%r52+16392], %r1691; 2026-02-21T08:30:46.9128035Z st.shared.b32 [%r53], %r1676; 2026-02-21T08:30:46.9128096Z st.shared.b32 [%r53+8], %r1677; 2026-02-21T08:30:46.9128159Z st.shared.b32 [%r53+16384], %r1692; 2026-02-21T08:30:46.9128232Z st.shared.b32 [%r53+16392], %r1693; 2026-02-21T08:30:46.9128295Z st.shared.b32 [%r54], %r1678; 2026-02-21T08:30:46.9128356Z st.shared.b32 [%r54+8], %r1679; 2026-02-21T08:30:46.9128424Z st.shared.b32 [%r54+16384], %r1694; 2026-02-21T08:30:46.9128487Z st.shared.b32 [%r54+16392], %r1695; 2026-02-21T08:30:46.9128548Z st.shared.b32 [%r55], %r1680; 2026-02-21T08:30:46.9128608Z st.shared.b32 [%r55+8], %r1681; 2026-02-21T08:30:46.9128681Z st.shared.b32 [%r55+16384], %r1696; 2026-02-21T08:30:46.9128742Z st.shared.b32 [%r55+16392], %r1697; 2026-02-21T08:30:46.9128803Z st.shared.b32 [%r56], %r1682; 2026-02-21T08:30:46.9128872Z st.shared.b32 [%r56+8], %r1683; 2026-02-21T08:30:46.9128934Z st.shared.b32 [%r56+16384], %r1698; 2026-02-21T08:30:46.9128997Z st.shared.b32 [%r56+16392], %r1699; 2026-02-21T08:30:46.9129058Z st.shared.b32 [%r57], %r1684; 2026-02-21T08:30:46.9129126Z st.shared.b32 [%r57+8], %r1685; 2026-02-21T08:30:46.9129187Z st.shared.b32 [%r57+16384], %r1700; 2026-02-21T08:30:46.9129251Z st.shared.b32 [%r57+16392], %r1701; 2026-02-21T08:30:46.9129371Z st.shared.b32 [%r58], %r1686; 2026-02-21T08:30:46.9129433Z st.shared.b32 [%r58+8], %r1687; 2026-02-21T08:30:46.9129495Z st.shared.b32 [%r58+16384], %r1702; 2026-02-21T08:30:46.9129564Z st.shared.b32 [%r58+16392], %r1703; 2026-02-21T08:30:46.9129618Z $L__tmp171: 2026-02-21T08:30:46.9129842Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9129901Z // begin inline asm 2026-02-21T08:30:46.9129982Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.9130039Z // end inline asm 2026-02-21T08:30:46.9130093Z bar.sync 0; 2026-02-21T08:30:46.9130161Z @%p12 bra $L__BB0_23; 2026-02-21T08:30:46.9130260Z // %bb.22: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.9130326Z elect.sync %r1728|%p164, -1; 2026-02-21T08:30:46.9130386Z mov.b32 %r1706, 69208336; 2026-02-21T08:30:46.9130452Z mov.pred %p163, 0; 2026-02-21T08:30:46.9130553Z // begin inline asm 2026-02-21T08:30:46.9130721Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r1706, %p163; 2026-02-21T08:30:46.9130784Z // end inline asm 2026-02-21T08:30:46.9130841Z // begin inline asm 2026-02-21T08:30:46.9130997Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r1706, %p165; 2026-02-21T08:30:46.9131060Z // end inline asm 2026-02-21T08:30:46.9131116Z // begin inline asm 2026-02-21T08:30:46.9131270Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r1706, %p165; 2026-02-21T08:30:46.9131334Z // end inline asm 2026-02-21T08:30:46.9131389Z // begin inline asm 2026-02-21T08:30:46.9131572Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r1706, %p165; 2026-02-21T08:30:46.9131629Z // end inline asm 2026-02-21T08:30:46.9131695Z // begin inline asm 2026-02-21T08:30:46.9131845Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r1706, %p165; 2026-02-21T08:30:46.9131904Z // end inline asm 2026-02-21T08:30:46.9131967Z // begin inline asm 2026-02-21T08:30:46.9132112Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r1706, %p165; 2026-02-21T08:30:46.9132167Z // end inline asm 2026-02-21T08:30:46.9132229Z // begin inline asm 2026-02-21T08:30:46.9132372Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r1706, %p165; 2026-02-21T08:30:46.9132427Z // end inline asm 2026-02-21T08:30:46.9132490Z // begin inline asm 2026-02-21T08:30:46.9132634Z @%p164 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r1706, %p165; 2026-02-21T08:30:46.9132690Z // end inline asm 2026-02-21T08:30:46.9132750Z add.s32 %r1730, %r265, 69632; 2026-02-21T08:30:46.9132818Z cvt.u64.u32 %rd424, %r1730; 2026-02-21T08:30:46.9132874Z // begin inline asm 2026-02-21T08:30:46.9133003Z @%p164 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd424]; 2026-02-21T08:30:46.9133068Z // end inline asm 2026-02-21T08:30:46.9133121Z $L__tmp172: 2026-02-21T08:30:46.9133221Z $L__BB0_23: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:46.9133396Z .loc 1 0 0 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0 2026-02-21T08:30:46.9133457Z or.b32 %r147, %r145, %r8; 2026-02-21T08:30:46.9133518Z or.b32 %r148, %r1654, %r12; 2026-02-21T08:30:46.9133578Z or.b32 %r149, %r1654, %r13; 2026-02-21T08:30:46.9133642Z or.b32 %r150, %r1654, %r14; 2026-02-21T08:30:46.9133700Z or.b32 %r151, %r1654, %r15; 2026-02-21T08:30:46.9133868Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9133934Z add.s64 %rd425, %rd32, 384; 2026-02-21T08:30:46.9133990Z add.s64 %rd426, %rd33, 384; 2026-02-21T08:30:46.9134154Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9134282Z // begin inline asm 2026-02-21T08:30:46.9134403Z cp.async.cg.shared.global [ %r1731 + 0 ], [ %rd425 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9134459Z // end inline asm 2026-02-21T08:30:46.9134516Z // begin inline asm 2026-02-21T08:30:46.9134640Z cp.async.cg.shared.global [ %r1733 + 0 ], [ %rd426 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9134695Z // end inline asm 2026-02-21T08:30:46.9134758Z cp.async.commit_group; 2026-02-21T08:30:46.9134936Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9134997Z add.s32 %r1740, %r146, %r59; 2026-02-21T08:30:46.9135163Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9135221Z cvt.s64.s32 %rd429, %r1740; 2026-02-21T08:30:46.9135292Z add.s64 %rd427, %rd48, %rd429; 2026-02-21T08:30:46.9135509Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9135569Z // begin inline asm 2026-02-21T08:30:46.9135691Z cp.async.cg.shared.global [ %r1735 + 0 ], [ %rd427 + 0 ], 0x10, %r1732; 2026-02-21T08:30:46.9135747Z // end inline asm 2026-02-21T08:30:46.9135809Z cp.async.commit_group; 2026-02-21T08:30:46.9135981Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9136041Z add.s32 %r2546, %r62, %r145; 2026-02-21T08:30:46.9136100Z shl.b32 %r1741, %r144, 16; 2026-02-21T08:30:46.9136158Z or.b32 %r1742, %r63, %r1741; 2026-02-21T08:30:46.9136234Z mad.wide.s32 %rd614, %r1742, 2, %rd9; 2026-02-21T08:30:46.9136292Z or.b32 %r2545, %r64, %r1741; 2026-02-21T08:30:46.9136350Z mov.b64 %rd615, -32; 2026-02-21T08:30:46.9136418Z mov.b32 %r2549, %r2547; 2026-02-21T08:30:46.9136477Z mov.b32 %r2551, %r2547; 2026-02-21T08:30:46.9136536Z bra.uni $L__BB0_24; 2026-02-21T08:30:46.9136642Z $L__BB0_26: // in Loop: Header=BB0_24 Depth=2 2026-02-21T08:30:46.9136814Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9136876Z add.s64 %rd615, %rd615, 32; 2026-02-21T08:30:46.9136942Z setp.lt.u64 %p201, %rd615, 384; 2026-02-21T08:30:46.9137005Z $L__tmp173: 2026-02-21T08:30:46.9137226Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9137287Z add.s32 %r1848, %r2550, 1; 2026-02-21T08:30:46.9137361Z setp.gt.s32 %p202, %r1848, 1; 2026-02-21T08:30:46.9137426Z selp.b32 %r2550, 0, %r1848, %p202; 2026-02-21T08:30:46.9137491Z selp.b32 %r1849, 1, 0, %p202; 2026-02-21T08:30:46.9137559Z xor.b32 %r166, %r2551, %r1849; 2026-02-21T08:30:46.9137612Z $L__tmp174: 2026-02-21T08:30:46.9137780Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9137849Z mad.wide.s32 %rd440, %r2545, 2, %rd47; 2026-02-21T08:30:46.9138028Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9138090Z add.s32 %r1842, %r162, %r2555; 2026-02-21T08:30:46.9138151Z selp.b32 %r1843, 16, 0, %p201; 2026-02-21T08:30:46.9138213Z // begin inline asm 2026-02-21T08:30:46.9138326Z cp.async.cg.shared.global [ %r1842 + 0 ], [ %rd614 + 0 ], 0x10, %r1843; 2026-02-21T08:30:46.9138381Z // end inline asm 2026-02-21T08:30:46.9138439Z add.s32 %r1844, %r1842, 4096; 2026-02-21T08:30:46.9138502Z // begin inline asm 2026-02-21T08:30:46.9138617Z cp.async.cg.shared.global [ %r1844 + 0 ], [ %rd440 + 0 ], 0x10, %r1843; 2026-02-21T08:30:46.9138672Z // end inline asm 2026-02-21T08:30:46.9138743Z cp.async.commit_group; 2026-02-21T08:30:46.9138908Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9138967Z cvt.s64.s32 %rd442, %r2546; 2026-02-21T08:30:46.9139034Z add.s64 %rd441, %rd48, %rd442; 2026-02-21T08:30:46.9139206Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9139308Z add.s32 %r1846, %r163, %r2555; 2026-02-21T08:30:46.9139364Z // begin inline asm 2026-02-21T08:30:46.9139485Z cp.async.cg.shared.global [ %r1846 + 0 ], [ %rd441 + 0 ], 0x10, %r1843; 2026-02-21T08:30:46.9139540Z // end inline asm 2026-02-21T08:30:46.9139602Z cp.async.commit_group; 2026-02-21T08:30:46.9139772Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9139834Z add.s32 %r2546, %r2546, 262144; 2026-02-21T08:30:46.9139893Z add.s64 %rd614, %rd614, 128; 2026-02-21T08:30:46.9139958Z add.s32 %r2545, %r2545, 64; 2026-02-21T08:30:46.9140021Z setp.lt.u64 %p203, %rd615, 448; 2026-02-21T08:30:46.9140079Z mov.b32 %r2547, %r2551; 2026-02-21T08:30:46.9140136Z mov.b32 %r2551, %r166; 2026-02-21T08:30:46.9140203Z @%p203 bra $L__BB0_24; 2026-02-21T08:30:46.9140259Z bra.uni $L__BB0_27; 2026-02-21T08:30:46.9140421Z $L__BB0_24: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:46.9140527Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:46.9140588Z add.s32 %r1763, %r2549, 1; 2026-02-21T08:30:46.9140651Z setp.gt.s32 %p183, %r1763, 2; 2026-02-21T08:30:46.9140715Z selp.b32 %r2549, 0, %r1763, %p183; 2026-02-21T08:30:46.9140889Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9140954Z cp.async.wait_group 4; 2026-02-21T08:30:46.9141008Z bar.sync 0; 2026-02-21T08:30:46.9141076Z shl.b32 %r1764, %r2549, 12; 2026-02-21T08:30:46.9141133Z shl.b32 %r1765, %r2549, 13; 2026-02-21T08:30:46.9141191Z add.s32 %r1767, %r265, %r1765; 2026-02-21T08:30:46.9141256Z add.s32 %r162, %r1767, 32768; 2026-02-21T08:30:46.9141315Z add.s32 %r1768, %r162, %r39; 2026-02-21T08:30:46.9141419Z ld.shared.v4.b32 {%r1769, %r1770, %r1771, %r1772}, [%r1768]; 2026-02-21T08:30:46.9141484Z mov.b32 {%rs1233, %rs1234}, %r1772; 2026-02-21T08:30:46.9141603Z mov.b32 {%rs1235, %rs1236}, %r1771; 2026-02-21T08:30:46.9141668Z mov.b32 {%rs1237, %rs1238}, %r1770; 2026-02-21T08:30:46.9141728Z mov.b32 {%rs1239, %rs1240}, %r1769; 2026-02-21T08:30:46.9141837Z ld.shared.v4.b32 {%r1773, %r1774, %r1775, %r1776}, [%r1768+16]; 2026-02-21T08:30:46.9141898Z mov.b32 {%rs1241, %rs1242}, %r1776; 2026-02-21T08:30:46.9141956Z mov.b32 {%rs1243, %rs1244}, %r1775; 2026-02-21T08:30:46.9142022Z mov.b32 {%rs1245, %rs1246}, %r1774; 2026-02-21T08:30:46.9142081Z mov.b32 {%rs1247, %rs1248}, %r1773; 2026-02-21T08:30:46.9142249Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.9142310Z cvt.f32.bf16 %r1747, %rs1239; 2026-02-21T08:30:46.9142375Z cvt.f32.bf16 %r1748, %rs1240; 2026-02-21T08:30:46.9142433Z cvt.f32.bf16 %r1749, %rs1237; 2026-02-21T08:30:46.9142491Z cvt.f32.bf16 %r1750, %rs1238; 2026-02-21T08:30:46.9142557Z cvt.f32.bf16 %r1751, %rs1235; 2026-02-21T08:30:46.9142616Z cvt.f32.bf16 %r1752, %rs1236; 2026-02-21T08:30:46.9142673Z cvt.f32.bf16 %r1753, %rs1233; 2026-02-21T08:30:46.9142729Z cvt.f32.bf16 %r1754, %rs1234; 2026-02-21T08:30:46.9142793Z cvt.f32.bf16 %r1755, %rs1247; 2026-02-21T08:30:46.9142851Z cvt.f32.bf16 %r1756, %rs1248; 2026-02-21T08:30:46.9142907Z cvt.f32.bf16 %r1757, %rs1245; 2026-02-21T08:30:46.9142972Z cvt.f32.bf16 %r1758, %rs1246; 2026-02-21T08:30:46.9143030Z cvt.f32.bf16 %r1759, %rs1243; 2026-02-21T08:30:46.9143087Z cvt.f32.bf16 %r1760, %rs1244; 2026-02-21T08:30:46.9143144Z cvt.f32.bf16 %r1761, %rs1241; 2026-02-21T08:30:46.9143208Z cvt.f32.bf16 %r1762, %rs1242; 2026-02-21T08:30:46.9143261Z $L__tmp175: 2026-02-21T08:30:46.9143475Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9143542Z // begin inline asm 2026-02-21T08:30:46.9143593Z 2026-02-21T08:30:46.9143643Z { 2026-02-21T08:30:46.9143713Z .reg .pred complete; 2026-02-21T08:30:46.9143821Z waitLoop: 2026-02-21T08:30:46.9143943Z mbarrier.try_wait.parity.shared.b64 complete, [%r2548], %r2547; 2026-02-21T08:30:46.9144009Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.9144067Z } 2026-02-21T08:30:46.9144071Z 2026-02-21T08:30:46.9144126Z // end inline asm 2026-02-21T08:30:46.9144188Z mov.pred %p184, -1; 2026-02-21T08:30:46.9144252Z // begin inline asm 2026-02-21T08:30:46.9144565Z @%p184 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1567 + 0], 16, {%r1747, %r1748, %r1749, %r1750, %r1751, %r1752, %r1753, %r1754, %r1755, %r1756, %r1757, %r1758, %r1759, %r1760, %r1761, %r1762}; 2026-02-21T08:30:46.9144621Z // end inline asm 2026-02-21T08:30:46.9144685Z // begin inline asm 2026-02-21T08:30:46.9144752Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9144808Z // end inline asm 2026-02-21T08:30:46.9144862Z bar.sync 0; 2026-02-21T08:30:46.9144925Z $L__tmp176: 2026-02-21T08:30:46.9145139Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9145204Z add.s32 %r1777, %r265, %r1764; 2026-02-21T08:30:46.9145273Z add.s32 %r163, %r1777, 57344; 2026-02-21T08:30:46.9145334Z add.s32 %r1778, %r163, %r41; 2026-02-21T08:30:46.9145400Z ld.shared.b8 %rs1249, [%r1778]; 2026-02-21T08:30:46.9145466Z ld.shared.b8 %rs1250, [%r1778+128]; 2026-02-21T08:30:46.9145538Z ld.shared.b8 %rs1251, [%r1778+256]; 2026-02-21T08:30:46.9145599Z ld.shared.b8 %rs1252, [%r1778+384]; 2026-02-21T08:30:46.9145672Z ld.shared.b8 %rs1253, [%r1778+512]; 2026-02-21T08:30:46.9145743Z ld.shared.b8 %rs1254, [%r1778+640]; 2026-02-21T08:30:46.9145807Z ld.shared.b8 %rs1255, [%r1778+768]; 2026-02-21T08:30:46.9145868Z add.s32 %r1779, %r163, %r43; 2026-02-21T08:30:46.9145939Z ld.shared.b8 %rs1256, [%r1779]; 2026-02-21T08:30:46.9146007Z ld.shared.b8 %rs1257, [%r1778+1024]; 2026-02-21T08:30:46.9146073Z ld.shared.b8 %rs1258, [%r1778+1152]; 2026-02-21T08:30:46.9146140Z ld.shared.b8 %rs1259, [%r1778+1280]; 2026-02-21T08:30:46.9146216Z ld.shared.b8 %rs1260, [%r1778+1408]; 2026-02-21T08:30:46.9146279Z ld.shared.b8 %rs1261, [%r1778+1536]; 2026-02-21T08:30:46.9146342Z ld.shared.b8 %rs1262, [%r1778+1664]; 2026-02-21T08:30:46.9146411Z ld.shared.b8 %rs1263, [%r1778+1792]; 2026-02-21T08:30:46.9146474Z add.s32 %r1780, %r163, %r45; 2026-02-21T08:30:46.9146537Z ld.shared.b8 %rs1264, [%r1780]; 2026-02-21T08:30:46.9146602Z ld.shared.b8 %rs1265, [%r1778+2048]; 2026-02-21T08:30:46.9146674Z ld.shared.b8 %rs1266, [%r1778+2176]; 2026-02-21T08:30:46.9146738Z ld.shared.b8 %rs1267, [%r1778+2304]; 2026-02-21T08:30:46.9146802Z ld.shared.b8 %rs1268, [%r1778+2432]; 2026-02-21T08:30:46.9146872Z ld.shared.b8 %rs1269, [%r1778+2560]; 2026-02-21T08:30:46.9146936Z ld.shared.b8 %rs1270, [%r1778+2688]; 2026-02-21T08:30:46.9146999Z ld.shared.b8 %rs1271, [%r1778+2816]; 2026-02-21T08:30:46.9147067Z add.s32 %r1781, %r163, %r47; 2026-02-21T08:30:46.9147131Z ld.shared.b8 %rs1272, [%r1781]; 2026-02-21T08:30:46.9147198Z ld.shared.b8 %rs1273, [%r1778+3072]; 2026-02-21T08:30:46.9147263Z ld.shared.b8 %rs1274, [%r1778+3200]; 2026-02-21T08:30:46.9147335Z ld.shared.b8 %rs1275, [%r1778+3328]; 2026-02-21T08:30:46.9147397Z ld.shared.b8 %rs1276, [%r1778+3456]; 2026-02-21T08:30:46.9147461Z ld.shared.b8 %rs1277, [%r1778+3584]; 2026-02-21T08:30:46.9147533Z ld.shared.b8 %rs1278, [%r1778+3712]; 2026-02-21T08:30:46.9147596Z ld.shared.b8 %rs1279, [%r1778+3840]; 2026-02-21T08:30:46.9147657Z add.s32 %r1782, %r163, %r49; 2026-02-21T08:30:46.9147721Z ld.shared.b8 %rs1280, [%r1782]; 2026-02-21T08:30:46.9147904Z .loc 1 57 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:57:28 2026-02-21T08:30:46.9147965Z shl.b16 %rs1281, %rs1249, 4; 2026-02-21T08:30:46.9148027Z shl.b16 %rs1282, %rs1250, 4; 2026-02-21T08:30:46.9148096Z shl.b16 %rs1283, %rs1251, 4; 2026-02-21T08:30:46.9148156Z shl.b16 %rs1284, %rs1252, 4; 2026-02-21T08:30:46.9148216Z shl.b16 %rs1285, %rs1253, 4; 2026-02-21T08:30:46.9148285Z shl.b16 %rs1286, %rs1254, 4; 2026-02-21T08:30:46.9148388Z shl.b16 %rs1287, %rs1255, 4; 2026-02-21T08:30:46.9148449Z shl.b16 %rs1288, %rs1256, 4; 2026-02-21T08:30:46.9148509Z shl.b16 %rs1289, %rs1257, 4; 2026-02-21T08:30:46.9148575Z shl.b16 %rs1290, %rs1258, 4; 2026-02-21T08:30:46.9148635Z shl.b16 %rs1291, %rs1259, 4; 2026-02-21T08:30:46.9148695Z shl.b16 %rs1292, %rs1260, 4; 2026-02-21T08:30:46.9148761Z shl.b16 %rs1293, %rs1261, 4; 2026-02-21T08:30:46.9148820Z shl.b16 %rs1294, %rs1262, 4; 2026-02-21T08:30:46.9148881Z shl.b16 %rs1295, %rs1263, 4; 2026-02-21T08:30:46.9148940Z shl.b16 %rs1296, %rs1264, 4; 2026-02-21T08:30:46.9149008Z shl.b16 %rs1297, %rs1265, 4; 2026-02-21T08:30:46.9149066Z shl.b16 %rs1298, %rs1266, 4; 2026-02-21T08:30:46.9149126Z shl.b16 %rs1299, %rs1267, 4; 2026-02-21T08:30:46.9149190Z shl.b16 %rs1300, %rs1268, 4; 2026-02-21T08:30:46.9149249Z shl.b16 %rs1301, %rs1269, 4; 2026-02-21T08:30:46.9149306Z shl.b16 %rs1302, %rs1270, 4; 2026-02-21T08:30:46.9149422Z shl.b16 %rs1303, %rs1271, 4; 2026-02-21T08:30:46.9149492Z shl.b16 %rs1304, %rs1272, 4; 2026-02-21T08:30:46.9149551Z shl.b16 %rs1305, %rs1273, 4; 2026-02-21T08:30:46.9149610Z shl.b16 %rs1306, %rs1274, 4; 2026-02-21T08:30:46.9149676Z shl.b16 %rs1307, %rs1275, 4; 2026-02-21T08:30:46.9149734Z shl.b16 %rs1308, %rs1276, 4; 2026-02-21T08:30:46.9149794Z shl.b16 %rs1309, %rs1277, 4; 2026-02-21T08:30:46.9149861Z shl.b16 %rs1310, %rs1278, 4; 2026-02-21T08:30:46.9149920Z shl.b16 %rs1311, %rs1279, 4; 2026-02-21T08:30:46.9149979Z shl.b16 %rs1312, %rs1280, 4; 2026-02-21T08:30:46.9150154Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.9150239Z selp.b16 %rs1313, %rs1281, %rs1249, %p246; 2026-02-21T08:30:46.9150300Z cvt.s16.s8 %rs1314, %rs1313; 2026-02-21T08:30:46.9150360Z shr.s16 %rs1315, %rs1314, 4; 2026-02-21T08:30:46.9150443Z selp.b16 %rs1316, %rs1282, %rs1250, %p246; 2026-02-21T08:30:46.9150507Z cvt.s16.s8 %rs1317, %rs1316; 2026-02-21T08:30:46.9150570Z shr.s16 %rs1318, %rs1317, 4; 2026-02-21T08:30:46.9150643Z selp.b16 %rs1319, %rs1283, %rs1251, %p246; 2026-02-21T08:30:46.9150713Z cvt.s16.s8 %rs1320, %rs1319; 2026-02-21T08:30:46.9150772Z shr.s16 %rs1321, %rs1320, 4; 2026-02-21T08:30:46.9150844Z selp.b16 %rs1322, %rs1284, %rs1252, %p246; 2026-02-21T08:30:46.9150912Z cvt.s16.s8 %rs1323, %rs1322; 2026-02-21T08:30:46.9150973Z shr.s16 %rs1324, %rs1323, 4; 2026-02-21T08:30:46.9151044Z selp.b16 %rs1325, %rs1285, %rs1253, %p246; 2026-02-21T08:30:46.9151113Z cvt.s16.s8 %rs1326, %rs1325; 2026-02-21T08:30:46.9151174Z shr.s16 %rs1327, %rs1326, 4; 2026-02-21T08:30:46.9151246Z selp.b16 %rs1328, %rs1286, %rs1254, %p246; 2026-02-21T08:30:46.9151309Z cvt.s16.s8 %rs1329, %rs1328; 2026-02-21T08:30:46.9151378Z shr.s16 %rs1330, %rs1329, 4; 2026-02-21T08:30:46.9151450Z selp.b16 %rs1331, %rs1287, %rs1255, %p246; 2026-02-21T08:30:46.9151512Z cvt.s16.s8 %rs1332, %rs1331; 2026-02-21T08:30:46.9151627Z shr.s16 %rs1333, %rs1332, 4; 2026-02-21T08:30:46.9151702Z selp.b16 %rs1334, %rs1288, %rs1256, %p246; 2026-02-21T08:30:46.9151763Z cvt.s16.s8 %rs1335, %rs1334; 2026-02-21T08:30:46.9151826Z shr.s16 %rs1336, %rs1335, 4; 2026-02-21T08:30:46.9151909Z selp.b16 %rs1337, %rs1289, %rs1257, %p246; 2026-02-21T08:30:46.9151971Z cvt.s16.s8 %rs1338, %rs1337; 2026-02-21T08:30:46.9152032Z shr.s16 %rs1339, %rs1338, 4; 2026-02-21T08:30:46.9152115Z selp.b16 %rs1340, %rs1290, %rs1258, %p246; 2026-02-21T08:30:46.9152177Z cvt.s16.s8 %rs1341, %rs1340; 2026-02-21T08:30:46.9152239Z shr.s16 %rs1342, %rs1341, 4; 2026-02-21T08:30:46.9152311Z selp.b16 %rs1343, %rs1291, %rs1259, %p246; 2026-02-21T08:30:46.9152380Z cvt.s16.s8 %rs1344, %rs1343; 2026-02-21T08:30:46.9152441Z shr.s16 %rs1345, %rs1344, 4; 2026-02-21T08:30:46.9152513Z selp.b16 %rs1346, %rs1292, %rs1260, %p246; 2026-02-21T08:30:46.9152582Z cvt.s16.s8 %rs1347, %rs1346; 2026-02-21T08:30:46.9152642Z shr.s16 %rs1348, %rs1347, 4; 2026-02-21T08:30:46.9152714Z selp.b16 %rs1349, %rs1293, %rs1261, %p246; 2026-02-21T08:30:46.9152845Z cvt.s16.s8 %rs1350, %rs1349; 2026-02-21T08:30:46.9152908Z shr.s16 %rs1351, %rs1350, 4; 2026-02-21T08:30:46.9152980Z selp.b16 %rs1352, %rs1294, %rs1262, %p246; 2026-02-21T08:30:46.9153042Z cvt.s16.s8 %rs1353, %rs1352; 2026-02-21T08:30:46.9153109Z shr.s16 %rs1354, %rs1353, 4; 2026-02-21T08:30:46.9153181Z selp.b16 %rs1355, %rs1295, %rs1263, %p246; 2026-02-21T08:30:46.9153242Z cvt.s16.s8 %rs1356, %rs1355; 2026-02-21T08:30:46.9153309Z shr.s16 %rs1357, %rs1356, 4; 2026-02-21T08:30:46.9153382Z selp.b16 %rs1358, %rs1296, %rs1264, %p246; 2026-02-21T08:30:46.9153443Z cvt.s16.s8 %rs1359, %rs1358; 2026-02-21T08:30:46.9153503Z shr.s16 %rs1360, %rs1359, 4; 2026-02-21T08:30:46.9153580Z selp.b16 %rs1361, %rs1297, %rs1265, %p246; 2026-02-21T08:30:46.9153640Z cvt.s16.s8 %rs1362, %rs1361; 2026-02-21T08:30:46.9153700Z shr.s16 %rs1363, %rs1362, 4; 2026-02-21T08:30:46.9153778Z selp.b16 %rs1364, %rs1298, %rs1266, %p246; 2026-02-21T08:30:46.9153888Z cvt.s16.s8 %rs1365, %rs1364; 2026-02-21T08:30:46.9153953Z shr.s16 %rs1366, %rs1365, 4; 2026-02-21T08:30:46.9154033Z selp.b16 %rs1367, %rs1299, %rs1267, %p246; 2026-02-21T08:30:46.9154096Z cvt.s16.s8 %rs1368, %rs1367; 2026-02-21T08:30:46.9154159Z shr.s16 %rs1369, %rs1368, 4; 2026-02-21T08:30:46.9154232Z selp.b16 %rs1370, %rs1300, %rs1268, %p246; 2026-02-21T08:30:46.9154310Z cvt.s16.s8 %rs1371, %rs1370; 2026-02-21T08:30:46.9154366Z shr.s16 %rs1372, %rs1371, 4; 2026-02-21T08:30:46.9154435Z selp.b16 %rs1373, %rs1301, %rs1269, %p246; 2026-02-21T08:30:46.9154499Z cvt.s16.s8 %rs1374, %rs1373; 2026-02-21T08:30:46.9154556Z shr.s16 %rs1375, %rs1374, 4; 2026-02-21T08:30:46.9154622Z selp.b16 %rs1376, %rs1302, %rs1270, %p246; 2026-02-21T08:30:46.9154681Z cvt.s16.s8 %rs1377, %rs1376; 2026-02-21T08:30:46.9154746Z shr.s16 %rs1378, %rs1377, 4; 2026-02-21T08:30:46.9154815Z selp.b16 %rs1379, %rs1303, %rs1271, %p246; 2026-02-21T08:30:46.9154874Z cvt.s16.s8 %rs1380, %rs1379; 2026-02-21T08:30:46.9154941Z shr.s16 %rs1381, %rs1380, 4; 2026-02-21T08:30:46.9155010Z selp.b16 %rs1382, %rs1304, %rs1272, %p246; 2026-02-21T08:30:46.9155067Z cvt.s16.s8 %rs1383, %rs1382; 2026-02-21T08:30:46.9155125Z shr.s16 %rs1384, %rs1383, 4; 2026-02-21T08:30:46.9155199Z selp.b16 %rs1385, %rs1305, %rs1273, %p246; 2026-02-21T08:30:46.9155256Z cvt.s16.s8 %rs1386, %rs1385; 2026-02-21T08:30:46.9155313Z shr.s16 %rs1387, %rs1386, 4; 2026-02-21T08:30:46.9155385Z selp.b16 %rs1388, %rs1306, %rs1274, %p246; 2026-02-21T08:30:46.9155443Z cvt.s16.s8 %rs1389, %rs1388; 2026-02-21T08:30:46.9155499Z shr.s16 %rs1390, %rs1389, 4; 2026-02-21T08:30:46.9155572Z selp.b16 %rs1391, %rs1307, %rs1275, %p246; 2026-02-21T08:30:46.9155628Z cvt.s16.s8 %rs1392, %rs1391; 2026-02-21T08:30:46.9155684Z shr.s16 %rs1393, %rs1392, 4; 2026-02-21T08:30:46.9155751Z selp.b16 %rs1394, %rs1308, %rs1276, %p246; 2026-02-21T08:30:46.9155815Z cvt.s16.s8 %rs1395, %rs1394; 2026-02-21T08:30:46.9155872Z shr.s16 %rs1396, %rs1395, 4; 2026-02-21T08:30:46.9155941Z selp.b16 %rs1397, %rs1309, %rs1277, %p246; 2026-02-21T08:30:46.9156007Z cvt.s16.s8 %rs1398, %rs1397; 2026-02-21T08:30:46.9156064Z shr.s16 %rs1399, %rs1398, 4; 2026-02-21T08:30:46.9156129Z selp.b16 %rs1400, %rs1310, %rs1278, %p246; 2026-02-21T08:30:46.9156187Z cvt.s16.s8 %rs1401, %rs1400; 2026-02-21T08:30:46.9156251Z shr.s16 %rs1402, %rs1401, 4; 2026-02-21T08:30:46.9156318Z selp.b16 %rs1403, %rs1311, %rs1279, %p246; 2026-02-21T08:30:46.9156374Z cvt.s16.s8 %rs1404, %rs1403; 2026-02-21T08:30:46.9156438Z shr.s16 %rs1405, %rs1404, 4; 2026-02-21T08:30:46.9156506Z selp.b16 %rs1406, %rs1312, %rs1280, %p246; 2026-02-21T08:30:46.9156563Z cvt.s16.s8 %rs1407, %rs1406; 2026-02-21T08:30:46.9156628Z shr.s16 %rs1408, %rs1407, 4; 2026-02-21T08:30:46.9156802Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.9156867Z cvt.rn.f32.s16 %r1783, %rs1315; 2026-02-21T08:30:46.9156932Z cvt.rn.f32.s16 %r1784, %rs1318; 2026-02-21T08:30:46.9157044Z cvt.rn.f32.s16 %r1785, %rs1321; 2026-02-21T08:30:46.9157104Z cvt.rn.f32.s16 %r1786, %rs1324; 2026-02-21T08:30:46.9157163Z cvt.rn.f32.s16 %r1787, %rs1327; 2026-02-21T08:30:46.9157229Z cvt.rn.f32.s16 %r1788, %rs1330; 2026-02-21T08:30:46.9157288Z cvt.rn.f32.s16 %r1789, %rs1333; 2026-02-21T08:30:46.9157346Z cvt.rn.f32.s16 %r1790, %rs1336; 2026-02-21T08:30:46.9157405Z cvt.rn.f32.s16 %r1791, %rs1339; 2026-02-21T08:30:46.9157471Z cvt.rn.f32.s16 %r1792, %rs1342; 2026-02-21T08:30:46.9157529Z cvt.rn.f32.s16 %r1793, %rs1345; 2026-02-21T08:30:46.9157590Z cvt.rn.f32.s16 %r1794, %rs1348; 2026-02-21T08:30:46.9157655Z cvt.rn.f32.s16 %r1795, %rs1351; 2026-02-21T08:30:46.9157713Z cvt.rn.f32.s16 %r1796, %rs1354; 2026-02-21T08:30:46.9157770Z cvt.rn.f32.s16 %r1797, %rs1357; 2026-02-21T08:30:46.9157837Z cvt.rn.f32.s16 %r1798, %rs1360; 2026-02-21T08:30:46.9157895Z cvt.rn.f32.s16 %r1799, %rs1363; 2026-02-21T08:30:46.9157955Z cvt.rn.f32.s16 %r1800, %rs1366; 2026-02-21T08:30:46.9158052Z cvt.rn.f32.s16 %r1801, %rs1369; 2026-02-21T08:30:46.9158126Z cvt.rn.f32.s16 %r1802, %rs1372; 2026-02-21T08:30:46.9158187Z cvt.rn.f32.s16 %r1803, %rs1375; 2026-02-21T08:30:46.9158246Z cvt.rn.f32.s16 %r1804, %rs1378; 2026-02-21T08:30:46.9158312Z cvt.rn.f32.s16 %r1805, %rs1381; 2026-02-21T08:30:46.9158372Z cvt.rn.f32.s16 %r1806, %rs1384; 2026-02-21T08:30:46.9158431Z cvt.rn.f32.s16 %r1807, %rs1387; 2026-02-21T08:30:46.9158492Z cvt.rn.f32.s16 %r1808, %rs1390; 2026-02-21T08:30:46.9158561Z cvt.rn.f32.s16 %r1809, %rs1393; 2026-02-21T08:30:46.9158622Z cvt.rn.f32.s16 %r1810, %rs1396; 2026-02-21T08:30:46.9158684Z cvt.rn.f32.s16 %r1811, %rs1399; 2026-02-21T08:30:46.9158754Z cvt.rn.f32.s16 %r1812, %rs1402; 2026-02-21T08:30:46.9158816Z cvt.rn.f32.s16 %r1813, %rs1405; 2026-02-21T08:30:46.9158878Z cvt.rn.f32.s16 %r1814, %rs1408; 2026-02-21T08:30:46.9158941Z st.shared.b32 [%r51], %r1783; 2026-02-21T08:30:46.9159011Z st.shared.b32 [%r51+8], %r1784; 2026-02-21T08:30:46.9159078Z st.shared.b32 [%r51+16384], %r1799; 2026-02-21T08:30:46.9159145Z st.shared.b32 [%r51+16392], %r1800; 2026-02-21T08:30:46.9159213Z st.shared.b32 [%r52], %r1785; 2026-02-21T08:30:46.9159274Z st.shared.b32 [%r52+8], %r1786; 2026-02-21T08:30:46.9159336Z st.shared.b32 [%r52+16384], %r1801; 2026-02-21T08:30:46.9159405Z st.shared.b32 [%r52+16392], %r1802; 2026-02-21T08:30:46.9159467Z st.shared.b32 [%r53], %r1787; 2026-02-21T08:30:46.9159527Z st.shared.b32 [%r53+8], %r1788; 2026-02-21T08:30:46.9159588Z st.shared.b32 [%r53+16384], %r1803; 2026-02-21T08:30:46.9159657Z st.shared.b32 [%r53+16392], %r1804; 2026-02-21T08:30:46.9159717Z st.shared.b32 [%r54], %r1789; 2026-02-21T08:30:46.9159778Z st.shared.b32 [%r54+8], %r1790; 2026-02-21T08:30:46.9159847Z st.shared.b32 [%r54+16384], %r1805; 2026-02-21T08:30:46.9159908Z st.shared.b32 [%r54+16392], %r1806; 2026-02-21T08:30:46.9159969Z st.shared.b32 [%r55], %r1791; 2026-02-21T08:30:46.9160029Z st.shared.b32 [%r55+8], %r1792; 2026-02-21T08:30:46.9160100Z st.shared.b32 [%r55+16384], %r1807; 2026-02-21T08:30:46.9160165Z st.shared.b32 [%r55+16392], %r1808; 2026-02-21T08:30:46.9160224Z st.shared.b32 [%r56], %r1793; 2026-02-21T08:30:46.9160292Z st.shared.b32 [%r56+8], %r1794; 2026-02-21T08:30:46.9160354Z st.shared.b32 [%r56+16384], %r1809; 2026-02-21T08:30:46.9160416Z st.shared.b32 [%r56+16392], %r1810; 2026-02-21T08:30:46.9160483Z st.shared.b32 [%r57], %r1795; 2026-02-21T08:30:46.9160544Z st.shared.b32 [%r57+8], %r1796; 2026-02-21T08:30:46.9160605Z st.shared.b32 [%r57+16384], %r1811; 2026-02-21T08:30:46.9160665Z st.shared.b32 [%r57+16392], %r1812; 2026-02-21T08:30:46.9160732Z st.shared.b32 [%r58], %r1797; 2026-02-21T08:30:46.9160792Z st.shared.b32 [%r58+8], %r1798; 2026-02-21T08:30:46.9160856Z st.shared.b32 [%r58+16384], %r1813; 2026-02-21T08:30:46.9160925Z st.shared.b32 [%r58+16392], %r1814; 2026-02-21T08:30:46.9161096Z .loc 1 40 92 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:40:92 2026-02-21T08:30:46.9161159Z shl.b32 %r1815, %r2550, 3; 2026-02-21T08:30:46.9161262Z add.s32 %r1816, %r265, %r1815; 2026-02-21T08:30:46.9161332Z add.s32 %r2548, %r1816, 69632; 2026-02-21T08:30:46.9161384Z $L__tmp177: 2026-02-21T08:30:46.9161634Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9161703Z // begin inline asm 2026-02-21T08:30:46.9161776Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.9161831Z // end inline asm 2026-02-21T08:30:46.9161890Z bar.sync 0; 2026-02-21T08:30:46.9161949Z @%p12 bra $L__BB0_26; 2026-02-21T08:30:46.9162049Z // %bb.25: // in Loop: Header=BB0_24 Depth=2 2026-02-21T08:30:46.9162114Z elect.sync %r1841|%p185, -1; 2026-02-21T08:30:46.9162180Z mov.b32 %r1819, 69208336; 2026-02-21T08:30:46.9162237Z // begin inline asm 2026-02-21T08:30:46.9162448Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd416, %r1819, %p184; 2026-02-21T08:30:46.9162518Z // end inline asm 2026-02-21T08:30:46.9162575Z // begin inline asm 2026-02-21T08:30:46.9162732Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd417, %r1819, %p184; 2026-02-21T08:30:46.9162793Z // end inline asm 2026-02-21T08:30:46.9162849Z // begin inline asm 2026-02-21T08:30:46.9163001Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd418, %r1819, %p184; 2026-02-21T08:30:46.9163055Z // end inline asm 2026-02-21T08:30:46.9163118Z // begin inline asm 2026-02-21T08:30:46.9163269Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd419, %r1819, %p184; 2026-02-21T08:30:46.9163323Z // end inline asm 2026-02-21T08:30:46.9163385Z // begin inline asm 2026-02-21T08:30:46.9163534Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd420, %r1819, %p184; 2026-02-21T08:30:46.9163589Z // end inline asm 2026-02-21T08:30:46.9163651Z // begin inline asm 2026-02-21T08:30:46.9163797Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd421, %r1819, %p184; 2026-02-21T08:30:46.9163854Z // end inline asm 2026-02-21T08:30:46.9163916Z // begin inline asm 2026-02-21T08:30:46.9164064Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd422, %r1819, %p184; 2026-02-21T08:30:46.9164118Z // end inline asm 2026-02-21T08:30:46.9164174Z // begin inline asm 2026-02-21T08:30:46.9164328Z @%p185 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd423, %r1819, %p184; 2026-02-21T08:30:46.9164384Z // end inline asm 2026-02-21T08:30:46.9164446Z cvt.u64.u32 %rd438, %r2548; 2026-02-21T08:30:46.9164511Z // begin inline asm 2026-02-21T08:30:46.9164639Z @%p185 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd438]; 2026-02-21T08:30:46.9164694Z // end inline asm 2026-02-21T08:30:46.9164759Z bra.uni $L__BB0_26; 2026-02-21T08:30:46.9164812Z $L__tmp178: 2026-02-21T08:30:46.9164912Z $L__BB0_1: // %.._crit_edge_crit_edge 2026-02-21T08:30:46.9165087Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9165157Z and.b32 %r2555, %r2509, 4080; 2026-02-21T08:30:46.9165217Z or.b32 %r2554, %r2555, 4096; 2026-02-21T08:30:46.9165300Z $L__BB0_28: // %._crit_edge 2026-02-21T08:30:46.9165488Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9165549Z sub.s32 %r2027, 4096, %r2580; 2026-02-21T08:30:46.9165620Z mul.hi.s32 %r2028, %r2027, -580400985; 2026-02-21T08:30:46.9165688Z add.s32 %r2029, %r2028, %r2027; 2026-02-21T08:30:46.9165748Z shr.u32 %r2030, %r2029, 31; 2026-02-21T08:30:46.9165806Z shr.s32 %r2031, %r2029, 11; 2026-02-21T08:30:46.9165866Z add.s32 %r174, %r2031, %r2030; 2026-02-21T08:30:46.9165937Z mul.lo.s32 %r2032, %r174, 2368; 2026-02-21T08:30:46.9166002Z setp.ne.b32 %p209, %r2027, %r2032; 2026-02-21T08:30:46.9166068Z setp.gt.s32 %p210, %r2027, -1; 2026-02-21T08:30:46.9166194Z and.pred %p211, %p210, %p209; 2026-02-21T08:30:46.9166256Z selp.b32 %r175, 1, 0, %p211; 2026-02-21T08:30:46.9166318Z add.s32 %r176, %r174, %r175; 2026-02-21T08:30:46.9166373Z $L__tmp179: 2026-02-21T08:30:46.9166600Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9166677Z shfl.sync.idx.b32 %r177, %r5, 0, 31, -1; 2026-02-21T08:30:46.9166744Z shl.b32 %r2033, %r177, 21; 2026-02-21T08:30:46.9166812Z and.b32 %r178, %r2033, 6291456; 2026-02-21T08:30:46.9166871Z add.s32 %r2034, %r178, %r2508; 2026-02-21T08:30:46.9166931Z and.b32 %r179, %r177, 4; 2026-02-21T08:30:46.9166997Z shl.b32 %r2035, %r179, 4; 2026-02-21T08:30:46.9167057Z add.s32 %r1981, %r2034, %r2035; 2026-02-21T08:30:46.9167117Z mov.pred %p207, -1; 2026-02-21T08:30:46.9167172Z mov.b32 %r2561, 0; 2026-02-21T08:30:46.9167235Z // begin inline asm 2026-02-21T08:30:46.9167600Z @%p207 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1981 + 0], 32, {%r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561}; 2026-02-21T08:30:46.9167662Z // end inline asm 2026-02-21T08:30:46.9167726Z // begin inline asm 2026-02-21T08:30:46.9168042Z @%p207 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1981 + 16], 32, {%r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561, %r2561}; 2026-02-21T08:30:46.9168097Z // end inline asm 2026-02-21T08:30:46.9168159Z // begin inline asm 2026-02-21T08:30:46.9168226Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9168280Z // end inline asm 2026-02-21T08:30:46.9168335Z bar.sync 0; 2026-02-21T08:30:46.9168395Z $L__tmp180: 2026-02-21T08:30:46.9168574Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9168636Z setp.lt.s32 %p212, %r176, 1; 2026-02-21T08:30:46.9168711Z setp.gt.s32 %p213, %r176, 0; 2026-02-21T08:30:46.9168881Z .loc 1 25 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:25:35 2026-02-21T08:30:46.9168942Z shr.s32 %r2036, %r2580, 31; 2026-02-21T08:30:46.9169006Z shr.u32 %r2037, %r2036, 23; 2026-02-21T08:30:46.9169066Z add.s32 %r2038, %r2580, %r2037; 2026-02-21T08:30:46.9169125Z shr.s32 %r2039, %r2038, 9; 2026-02-21T08:30:46.9169293Z .loc 1 26 33 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:26:33 2026-02-21T08:30:46.9169359Z shl.b32 %r181, %r2039, 3; 2026-02-21T08:30:46.9169520Z .loc 1 27 39 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:39 2026-02-21T08:30:46.9169580Z sub.s32 %r2040, 64, %r181; 2026-02-21T08:30:46.9169752Z .loc 1 27 52 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:52 2026-02-21T08:30:46.9169810Z min.s32 %r183, %r2040, 8; 2026-02-21T08:30:46.9169978Z .loc 1 28 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:45 2026-02-21T08:30:46.9170049Z and.b32 %r2041, %r2038, -512; 2026-02-21T08:30:46.9170108Z sub.s32 %r182, %r2580, %r2041; 2026-02-21T08:30:46.9170271Z .loc 1 29 51 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:29:51 2026-02-21T08:30:46.9170338Z div.s32 %r184, %r182, %r183; 2026-02-21T08:30:46.9170502Z .loc 1 32 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:32:27 2026-02-21T08:30:46.9170560Z shl.b32 %r2556, %r184, 6; 2026-02-21T08:30:46.9170721Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.9170788Z or.b32 %r2581, %r2556, %r10; 2026-02-21T08:30:46.9170847Z or.b32 %r2582, %r2556, %r11; 2026-02-21T08:30:46.9171006Z .loc 1 48 53 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:53 2026-02-21T08:30:46.9171072Z shl.b32 %r2042, %r2581, 10; 2026-02-21T08:30:46.9171172Z shl.b32 %r2043, %r2582, 10; 2026-02-21T08:30:46.9171340Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.9171406Z or.b32 %r2044, %r2042, %r16; 2026-02-21T08:30:46.9171464Z or.b32 %r2045, %r2043, %r16; 2026-02-21T08:30:46.9171671Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9171743Z mad.wide.s32 %rd511, %r2044, 2, %rd47; 2026-02-21T08:30:46.9171817Z mad.wide.s32 %rd512, %r2045, 2, %rd47; 2026-02-21T08:30:46.9171984Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9172044Z add.s32 %r2047, %r265, 32768; 2026-02-21T08:30:46.9172113Z add.s32 %r2015, %r2047, %r2555; 2026-02-21T08:30:46.9172175Z selp.b32 %r2016, 16, 0, %p213; 2026-02-21T08:30:46.9172231Z // begin inline asm 2026-02-21T08:30:46.9172414Z cp.async.cg.shared.global [ %r2015 + 0 ], [ %rd511 + 0 ], 0x10, %r2016; 2026-02-21T08:30:46.9172473Z // end inline asm 2026-02-21T08:30:46.9172532Z add.s32 %r2017, %r2047, %r2554; 2026-02-21T08:30:46.9172589Z // begin inline asm 2026-02-21T08:30:46.9172715Z cp.async.cg.shared.global [ %r2017 + 0 ], [ %rd512 + 0 ], 0x10, %r2016; 2026-02-21T08:30:46.9172772Z // end inline asm 2026-02-21T08:30:46.9172836Z cp.async.commit_group; 2026-02-21T08:30:46.9173012Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.9173072Z or.b32 %r2048, %r2042, %r2553; 2026-02-21T08:30:46.9173130Z or.b32 %r2049, %r2043, %r2553; 2026-02-21T08:30:46.9173303Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9173371Z mad.wide.s32 %rd513, %r2048, 2, %rd47; 2026-02-21T08:30:46.9173436Z mad.wide.s32 %rd514, %r2049, 2, %rd47; 2026-02-21T08:30:46.9173602Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9173669Z bar.sync 0; 2026-02-21T08:30:46.9173729Z add.s32 %r2050, %r265, 40960; 2026-02-21T08:30:46.9173788Z add.s32 %r2019, %r2050, %r2555; 2026-02-21T08:30:46.9173850Z // begin inline asm 2026-02-21T08:30:46.9173968Z cp.async.cg.shared.global [ %r2019 + 0 ], [ %rd513 + 0 ], 0x10, %r2016; 2026-02-21T08:30:46.9174023Z // end inline asm 2026-02-21T08:30:46.9174081Z add.s32 %r2021, %r2050, %r2554; 2026-02-21T08:30:46.9174145Z // begin inline asm 2026-02-21T08:30:46.9174259Z cp.async.cg.shared.global [ %r2021 + 0 ], [ %rd514 + 0 ], 0x10, %r2016; 2026-02-21T08:30:46.9174315Z // end inline asm 2026-02-21T08:30:46.9174385Z cp.async.commit_group; 2026-02-21T08:30:46.9174551Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.9174610Z or.b32 %r2051, %r2042, %r2552; 2026-02-21T08:30:46.9174676Z or.b32 %r2052, %r2043, %r2552; 2026-02-21T08:30:46.9174841Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9174910Z mad.wide.s32 %rd515, %r2051, 2, %rd47; 2026-02-21T08:30:46.9174975Z mad.wide.s32 %rd516, %r2052, 2, %rd47; 2026-02-21T08:30:46.9175149Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9175205Z bar.sync 0; 2026-02-21T08:30:46.9175264Z add.s32 %r2053, %r265, 49152; 2026-02-21T08:30:46.9175332Z add.s32 %r2023, %r2053, %r2555; 2026-02-21T08:30:46.9175391Z // begin inline asm 2026-02-21T08:30:46.9175508Z cp.async.cg.shared.global [ %r2023 + 0 ], [ %rd515 + 0 ], 0x10, %r2016; 2026-02-21T08:30:46.9175572Z // end inline asm 2026-02-21T08:30:46.9175632Z add.s32 %r2025, %r2053, %r2554; 2026-02-21T08:30:46.9175690Z // begin inline asm 2026-02-21T08:30:46.9175807Z cp.async.cg.shared.global [ %r2025 + 0 ], [ %rd516 + 0 ], 0x10, %r2016; 2026-02-21T08:30:46.9175876Z // end inline asm 2026-02-21T08:30:46.9175941Z cp.async.commit_group; 2026-02-21T08:30:46.9176164Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9176235Z @%p212 bra $L__BB0_37; 2026-02-21T08:30:46.9176317Z // %bb.29: // %.lr.ph487 2026-02-21T08:30:46.9176376Z shl.b32 %r2060, %r176, 4; 2026-02-21T08:30:46.9176548Z .loc 1 28 64 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:64 2026-02-21T08:30:46.9176612Z mul.lo.s32 %r2061, %r184, %r183; 2026-02-21T08:30:46.9176671Z sub.s32 %r2062, %r182, %r2061; 2026-02-21T08:30:46.9176831Z .loc 1 28 30 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:30 2026-02-21T08:30:46.9176899Z add.s32 %r2063, %r2062, %r181; 2026-02-21T08:30:46.9177059Z .loc 1 30 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:30:27 2026-02-21T08:30:46.9177117Z shl.b32 %r2558, %r2063, 7; 2026-02-21T08:30:46.9177327Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.9177390Z or.b32 %r2562, %r2558, %r6; 2026-02-21T08:30:46.9177450Z add.s32 %r190, %r2060, -3; 2026-02-21T08:30:46.9177515Z shl.b32 %r2066, %r2514, 6; 2026-02-21T08:30:46.9177573Z and.b32 %r2068, %r2515, 32; 2026-02-21T08:30:46.9177631Z or.b32 %r2070, %r2068, %r2516; 2026-02-21T08:30:46.9177806Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9177873Z add.s32 %r2072, %r265, %r2513; 2026-02-21T08:30:46.9177932Z add.s32 %r2073, %r2072, %r2066; 2026-02-21T08:30:46.9177989Z add.s32 %r2074, %r2073, %r2070; 2026-02-21T08:30:46.9178055Z add.s32 %r191, %r2074, 32768; 2026-02-21T08:30:46.9178115Z and.b32 %r2075, %r1, 24; 2026-02-21T08:30:46.9178175Z shl.b32 %r2076, %r2075, 5; 2026-02-21T08:30:46.9178233Z shr.u32 %r2077, %r2075, 1; 2026-02-21T08:30:46.9178300Z bfe.u32 %r2078, %r1, 5, 2; 2026-02-21T08:30:46.9178360Z or.b32 %r2079, %r2077, %r2078; 2026-02-21T08:30:46.9178419Z or.b32 %r2080, %r2076, %r6; 2026-02-21T08:30:46.9178485Z or.b32 %r2081, %r2079, %r2080; 2026-02-21T08:30:46.9178543Z or.b32 %r2082, %r2081, %r17; 2026-02-21T08:30:46.9178602Z add.s32 %r192, %r265, %r2082; 2026-02-21T08:30:46.9178668Z xor.b32 %r2083, %r2082, 4; 2026-02-21T08:30:46.9178727Z add.s32 %r193, %r265, %r2083; 2026-02-21T08:30:46.9178783Z xor.b32 %r2084, %r2082, 8; 2026-02-21T08:30:46.9178841Z add.s32 %r194, %r265, %r2084; 2026-02-21T08:30:46.9178907Z xor.b32 %r2085, %r2082, 12; 2026-02-21T08:30:46.9178964Z add.s32 %r195, %r265, %r2085; 2026-02-21T08:30:46.9179022Z xor.b32 %r2086, %r2082, 32; 2026-02-21T08:30:46.9179087Z add.s32 %r196, %r265, %r2086; 2026-02-21T08:30:46.9179143Z xor.b32 %r2087, %r2082, 36; 2026-02-21T08:30:46.9179200Z add.s32 %r197, %r265, %r2087; 2026-02-21T08:30:46.9179257Z xor.b32 %r2088, %r2082, 40; 2026-02-21T08:30:46.9179321Z add.s32 %r198, %r265, %r2088; 2026-02-21T08:30:46.9179380Z xor.b32 %r2089, %r2082, 44; 2026-02-21T08:30:46.9179439Z add.s32 %r199, %r265, %r2089; 2026-02-21T08:30:46.9179503Z xor.b32 %r2090, %r2082, 64; 2026-02-21T08:30:46.9179560Z add.s32 %r200, %r265, %r2090; 2026-02-21T08:30:46.9179615Z xor.b32 %r2091, %r2082, 68; 2026-02-21T08:30:46.9179672Z add.s32 %r201, %r265, %r2091; 2026-02-21T08:30:46.9179734Z xor.b32 %r2092, %r2082, 72; 2026-02-21T08:30:46.9179790Z add.s32 %r202, %r265, %r2092; 2026-02-21T08:30:46.9179845Z xor.b32 %r2093, %r2082, 76; 2026-02-21T08:30:46.9179909Z add.s32 %r203, %r265, %r2093; 2026-02-21T08:30:46.9179965Z xor.b32 %r2094, %r2082, 96; 2026-02-21T08:30:46.9180021Z add.s32 %r204, %r265, %r2094; 2026-02-21T08:30:46.9180079Z xor.b32 %r2095, %r2082, 100; 2026-02-21T08:30:46.9180142Z add.s32 %r205, %r265, %r2095; 2026-02-21T08:30:46.9180199Z xor.b32 %r2096, %r2082, 104; 2026-02-21T08:30:46.9180254Z add.s32 %r206, %r265, %r2096; 2026-02-21T08:30:46.9180317Z xor.b32 %r2097, %r2082, 108; 2026-02-21T08:30:46.9180376Z add.s32 %r207, %r265, %r2097; 2026-02-21T08:30:46.9180478Z and.b32 %r2098, %r1, 12; 2026-02-21T08:30:46.9180543Z and.b32 %r2100, %r2522, 12; 2026-02-21T08:30:46.9180600Z and.b32 %r2101, %r1, 112; 2026-02-21T08:30:46.9180661Z mul.lo.s32 %r2102, %r2098, 264; 2026-02-21T08:30:46.9180719Z or.b32 %r2103, %r2100, %r2101; 2026-02-21T08:30:46.9180784Z xor.b32 %r2104, %r2102, %r2103; 2026-02-21T08:30:46.9180841Z add.s32 %r208, %r265, %r2104; 2026-02-21T08:30:46.9180899Z xor.b32 %r2105, %r2104, 260; 2026-02-21T08:30:46.9180962Z add.s32 %r209, %r265, %r2105; 2026-02-21T08:30:46.9181020Z xor.b32 %r2106, %r2104, 520; 2026-02-21T08:30:46.9181077Z add.s32 %r210, %r265, %r2106; 2026-02-21T08:30:46.9181135Z xor.b32 %r2107, %r2104, 780; 2026-02-21T08:30:46.9181199Z add.s32 %r211, %r265, %r2107; 2026-02-21T08:30:46.9181258Z shl.b32 %r2108, %r1, 7; 2026-02-21T08:30:46.9181316Z and.b32 %r2109, %r2108, 16256; 2026-02-21T08:30:46.9181382Z or.b32 %r2111, %r2517, %r2109; 2026-02-21T08:30:46.9181478Z or.b32 %r2112, %r2111, %r6; 2026-02-21T08:30:46.9181575Z add.s32 %r212, %r265, %r2112; 2026-02-21T08:30:46.9181635Z xor.b32 %r2113, %r2112, 16; 2026-02-21T08:30:46.9181701Z add.s32 %r213, %r265, %r2113; 2026-02-21T08:30:46.9181757Z xor.b32 %r2114, %r2112, 32; 2026-02-21T08:30:46.9181815Z add.s32 %r214, %r265, %r2114; 2026-02-21T08:30:46.9181879Z xor.b32 %r2115, %r2112, 48; 2026-02-21T08:30:46.9181936Z add.s32 %r215, %r265, %r2115; 2026-02-21T08:30:46.9181992Z xor.b32 %r2116, %r2112, 64; 2026-02-21T08:30:46.9182056Z add.s32 %r216, %r265, %r2116; 2026-02-21T08:30:46.9182113Z xor.b32 %r2117, %r2112, 80; 2026-02-21T08:30:46.9182170Z add.s32 %r217, %r265, %r2117; 2026-02-21T08:30:46.9182226Z xor.b32 %r2118, %r2112, 96; 2026-02-21T08:30:46.9182292Z add.s32 %r218, %r265, %r2118; 2026-02-21T08:30:46.9182351Z xor.b32 %r2119, %r2112, 112; 2026-02-21T08:30:46.9182409Z add.s32 %r219, %r265, %r2119; 2026-02-21T08:30:46.9182478Z add.s32 %r2120, %r178, %r2332; 2026-02-21T08:30:46.9182541Z shl.b32 %r2121, %r179, 3; 2026-02-21T08:30:46.9182603Z add.s32 %r2170, %r2120, %r2121; 2026-02-21T08:30:46.9182662Z bfe.u32 %r2122, %r265, 4, 14; 2026-02-21T08:30:46.9182732Z cvt.u64.u32 %rd517, %r2122; 2026-02-21T08:30:46.9182807Z or.b64 %rd529, %rd517, 4611686293372403712; 2026-02-21T08:30:46.9182870Z add.s32 %r2123, %r265, 32; 2026-02-21T08:30:46.9182943Z bfe.u32 %r2124, %r2123, 4, 14; 2026-02-21T08:30:46.9183006Z cvt.u64.u32 %rd518, %r2124; 2026-02-21T08:30:46.9183081Z or.b64 %rd530, %rd518, 4611686293372403712; 2026-02-21T08:30:46.9183143Z add.s32 %r2125, %r265, 64; 2026-02-21T08:30:46.9183211Z bfe.u32 %r2126, %r2125, 4, 14; 2026-02-21T08:30:46.9183270Z cvt.u64.u32 %rd519, %r2126; 2026-02-21T08:30:46.9183337Z or.b64 %rd531, %rd519, 4611686293372403712; 2026-02-21T08:30:46.9183405Z add.s32 %r2127, %r265, 96; 2026-02-21T08:30:46.9183462Z bfe.u32 %r2128, %r2127, 4, 14; 2026-02-21T08:30:46.9183520Z cvt.u64.u32 %rd520, %r2128; 2026-02-21T08:30:46.9183595Z or.b64 %rd532, %rd520, 4611686293372403712; 2026-02-21T08:30:46.9183654Z add.s32 %r2129, %r265, 16384; 2026-02-21T08:30:46.9183712Z bfe.u32 %r2130, %r2129, 4, 14; 2026-02-21T08:30:46.9183770Z cvt.u64.u32 %rd521, %r2130; 2026-02-21T08:30:46.9183844Z or.b64 %rd533, %rd521, 4611686293372403712; 2026-02-21T08:30:46.9183902Z add.s32 %r2131, %r265, 16416; 2026-02-21T08:30:46.9183960Z bfe.u32 %r2132, %r2131, 4, 14; 2026-02-21T08:30:46.9184025Z cvt.u64.u32 %rd522, %r2132; 2026-02-21T08:30:46.9184092Z or.b64 %rd534, %rd522, 4611686293372403712; 2026-02-21T08:30:46.9184149Z add.s32 %r2133, %r265, 16448; 2026-02-21T08:30:46.9184207Z bfe.u32 %r2134, %r2133, 4, 14; 2026-02-21T08:30:46.9184275Z cvt.u64.u32 %rd523, %r2134; 2026-02-21T08:30:46.9184340Z or.b64 %rd535, %rd523, 4611686293372403712; 2026-02-21T08:30:46.9184397Z add.s32 %r2135, %r265, 16480; 2026-02-21T08:30:46.9184462Z bfe.u32 %r2136, %r2135, 4, 14; 2026-02-21T08:30:46.9184520Z cvt.u64.u32 %rd524, %r2136; 2026-02-21T08:30:46.9184588Z or.b64 %rd536, %rd524, 4611686293372403712; 2026-02-21T08:30:46.9184710Z and.b32 %r2138, %r2518, 3072; 2026-02-21T08:30:46.9184775Z shl.b32 %r2140, %r2514, 3; 2026-02-21T08:30:46.9184833Z or.b32 %r2141, %r2519, %r2140; 2026-02-21T08:30:46.9184890Z xor.b32 %r2142, %r2141, %r2070; 2026-02-21T08:30:46.9184953Z add.s32 %r2143, %r265, %r2138; 2026-02-21T08:30:46.9185009Z add.s32 %r222, %r2143, %r2142; 2026-02-21T08:30:46.9185064Z and.b32 %r2145, %r2520, 3936; 2026-02-21T08:30:46.9185125Z and.b32 %r2147, %r2522, 16; 2026-02-21T08:30:46.9185182Z xor.b32 %r2148, %r2145, %r2521; 2026-02-21T08:30:46.9185238Z add.s32 %r2149, %r265, %r2147; 2026-02-21T08:30:46.9185296Z add.s32 %r2415, %r2149, %r2148; 2026-02-21T08:30:46.9185361Z shl.b32 %r2150, %r174, 4; 2026-02-21T08:30:46.9185418Z shl.b32 %r2151, %r175, 4; 2026-02-21T08:30:46.9185475Z add.s32 %r226, %r2150, %r2151; 2026-02-21T08:30:46.9185543Z mov.pred %p247, 0; 2026-02-21T08:30:46.9185597Z mov.b32 %r2568, 2; 2026-02-21T08:30:46.9185705Z mov.b32 %r2567, -1; 2026-02-21T08:30:46.9185767Z mov.b32 %r2565, 32; 2026-02-21T08:30:46.9185831Z mov.b32 %r2564, 64; 2026-02-21T08:30:46.9185887Z mov.b32 %r2560, 1; 2026-02-21T08:30:46.9185946Z mov.b32 %r2557, %r2556; 2026-02-21T08:30:46.9186009Z mov.b32 %r2559, %r2558; 2026-02-21T08:30:46.9186065Z mov.b32 %r2563, %r2562; 2026-02-21T08:30:46.9186121Z mov.b32 %r2566, %r2561; 2026-02-21T08:30:46.9186175Z mov.b32 %r2569, %r2556; 2026-02-21T08:30:46.9186237Z mov.b32 %r2570, %r2562; 2026-02-21T08:30:46.9186291Z mov.b32 %r2571, %r2558; 2026-02-21T08:30:46.9186348Z mov.b32 %r2573, %r2568; 2026-02-21T08:30:46.9186409Z mov.b32 %r2574, %r2561; 2026-02-21T08:30:46.9186465Z mov.b32 %r2577, %r2571; 2026-02-21T08:30:46.9186520Z mov.b32 %r2578, %r2570; 2026-02-21T08:30:46.9186575Z mov.b32 %r2579, %r2569; 2026-02-21T08:30:46.9186639Z bra.uni $L__BB0_30; 2026-02-21T08:30:46.9186743Z $L__BB0_36: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:46.9186920Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9186989Z add.s32 %r2574, %r2574, 1; 2026-02-21T08:30:46.9187055Z setp.ne.b32 %p243, %r226, %r2574; 2026-02-21T08:30:46.9187111Z mov.b32 %r2556, %r2569; 2026-02-21T08:30:46.9187175Z mov.b32 %r2557, %r227; 2026-02-21T08:30:46.9187230Z mov.b32 %r2558, %r2571; 2026-02-21T08:30:46.9187286Z mov.b32 %r2559, %r229; 2026-02-21T08:30:46.9187342Z mov.b32 %r2560, %r2573; 2026-02-21T08:30:46.9187406Z mov.b32 %r2561, %r231; 2026-02-21T08:30:46.9187464Z mov.b32 %r2562, %r2570; 2026-02-21T08:30:46.9187519Z mov.b32 %r2563, %r233; 2026-02-21T08:30:46.9187584Z mov.b32 %r2566, %r236; 2026-02-21T08:30:46.9187639Z mov.b32 %r2569, %r2579; 2026-02-21T08:30:46.9187695Z mov.b32 %r2570, %r2578; 2026-02-21T08:30:46.9187751Z mov.b32 %r2571, %r2577; 2026-02-21T08:30:46.9187815Z mov.b32 %r2573, %r248; 2026-02-21T08:30:46.9187873Z @%p243 bra $L__BB0_30; 2026-02-21T08:30:46.9187933Z bra.uni $L__BB0_37; 2026-02-21T08:30:46.9188045Z $L__BB0_30: // =>This Inner Loop Header: Depth=1 2026-02-21T08:30:46.9188218Z .loc 1 0 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0:112 2026-02-21T08:30:46.9188276Z mov.b32 %r236, %r2565; 2026-02-21T08:30:46.9188331Z mov.b32 %r2565, %r2564; 2026-02-21T08:30:46.9188395Z mov.b32 %r233, %r2562; 2026-02-21T08:30:46.9188450Z mov.b32 %r231, %r2560; 2026-02-21T08:30:46.9188527Z mov.b32 %r229, %r2558; 2026-02-21T08:30:46.9188593Z mov.b32 %r227, %r2556; 2026-02-21T08:30:46.9188774Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9188837Z add.s32 %r2152, %r2573, 1; 2026-02-21T08:30:46.9188911Z setp.eq.b32 %p215, %r2573, 15; 2026-02-21T08:30:46.9188980Z selp.b32 %r248, 0, %r2152, %p215; 2026-02-21T08:30:46.9189046Z setp.ne.b32 %p216, %r248, 0; 2026-02-21T08:30:46.9189107Z @%p216 bra $L__BB0_32; 2026-02-21T08:30:46.9189220Z // %bb.31: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:46.9189331Z add.s32 %r2580, %r2580, 2368; 2026-02-21T08:30:46.9189517Z .loc 1 25 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:25:35 2026-02-21T08:30:46.9189591Z shr.s32 %r2153, %r2580, 31; 2026-02-21T08:30:46.9189653Z shr.u32 %r2154, %r2153, 23; 2026-02-21T08:30:46.9189716Z add.s32 %r2155, %r2580, %r2154; 2026-02-21T08:30:46.9189776Z shr.s32 %r2156, %r2155, 9; 2026-02-21T08:30:46.9189961Z .loc 1 26 33 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:26:33 2026-02-21T08:30:46.9190021Z shl.b32 %r2157, %r2156, 3; 2026-02-21T08:30:46.9190195Z .loc 1 27 39 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:39 2026-02-21T08:30:46.9190264Z sub.s32 %r2158, 64, %r2157; 2026-02-21T08:30:46.9190486Z .loc 1 27 52 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:27:52 2026-02-21T08:30:46.9190549Z min.s32 %r2159, %r2158, 8; 2026-02-21T08:30:46.9190732Z .loc 1 28 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:45 2026-02-21T08:30:46.9190795Z and.b32 %r2160, %r2155, -512; 2026-02-21T08:30:46.9190858Z sub.s32 %r2161, %r2580, %r2160; 2026-02-21T08:30:46.9191040Z .loc 1 29 51 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:29:51 2026-02-21T08:30:46.9191103Z div.s32 %r2162, %r2161, %r2159; 2026-02-21T08:30:46.9191274Z .loc 1 28 64 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:64 2026-02-21T08:30:46.9191341Z mul.lo.s32 %r2163, %r2162, %r2159; 2026-02-21T08:30:46.9191409Z sub.s32 %r2164, %r2161, %r2163; 2026-02-21T08:30:46.9191624Z .loc 1 28 30 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:28:30 2026-02-21T08:30:46.9191686Z add.s32 %r2165, %r2164, %r2157; 2026-02-21T08:30:46.9191870Z .loc 1 30 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:30:27 2026-02-21T08:30:46.9191936Z shl.b32 %r2577, %r2165, 7; 2026-02-21T08:30:46.9192111Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.9192182Z or.b32 %r2578, %r2577, %r6; 2026-02-21T08:30:46.9192356Z .loc 1 32 27 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:32:27 2026-02-21T08:30:46.9192416Z shl.b32 %r2579, %r2162, 6; 2026-02-21T08:30:46.9192594Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.9192657Z or.b32 %r2581, %r2579, %r10; 2026-02-21T08:30:46.9192721Z or.b32 %r2582, %r2579, %r11; 2026-02-21T08:30:46.9192822Z $L__BB0_32: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:46.9193014Z .loc 1 0 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:0:32 2026-02-21T08:30:46.9193085Z setp.ne.b32 %p219, %r177, 0; 2026-02-21T08:30:46.9193267Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9193337Z add.s32 %r2188, %r2567, 1; 2026-02-21T08:30:46.9193402Z setp.gt.s32 %p221, %r2188, 2; 2026-02-21T08:30:46.9193470Z selp.b32 %r2567, 0, %r2188, %p221; 2026-02-21T08:30:46.9193655Z .loc 1 41 35 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:41:35 2026-02-21T08:30:46.9193716Z add.s32 %r2189, %r2566, %r10; 2026-02-21T08:30:46.9193888Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9193963Z cp.async.wait_group 2; 2026-02-21T08:30:46.9194020Z bar.sync 0; 2026-02-21T08:30:46.9194079Z shl.b32 %r2190, %r2567, 13; 2026-02-21T08:30:46.9194141Z add.s32 %r2191, %r191, %r2190; 2026-02-21T08:30:46.9194255Z ld.shared.v4.b32 {%r2192, %r2193, %r2194, %r2195}, [%r2191]; 2026-02-21T08:30:46.9194326Z mov.b32 {%rs1409, %rs1410}, %r2195; 2026-02-21T08:30:46.9194447Z mov.b32 {%rs1411, %rs1412}, %r2194; 2026-02-21T08:30:46.9194519Z mov.b32 {%rs1413, %rs1414}, %r2193; 2026-02-21T08:30:46.9194583Z mov.b32 {%rs1415, %rs1416}, %r2192; 2026-02-21T08:30:46.9194693Z ld.shared.v4.b32 {%r2196, %r2197, %r2198, %r2199}, [%r2191+16]; 2026-02-21T08:30:46.9194757Z mov.b32 {%rs1417, %rs1418}, %r2199; 2026-02-21T08:30:46.9194825Z mov.b32 {%rs1419, %rs1420}, %r2198; 2026-02-21T08:30:46.9194887Z mov.b32 {%rs1421, %rs1422}, %r2197; 2026-02-21T08:30:46.9194950Z mov.b32 {%rs1423, %rs1424}, %r2196; 2026-02-21T08:30:46.9195130Z .loc 1 52 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:52:32 2026-02-21T08:30:46.9195195Z cvt.f32.bf16 %r2171, %rs1415; 2026-02-21T08:30:46.9195257Z cvt.f32.bf16 %r2172, %rs1416; 2026-02-21T08:30:46.9195326Z cvt.f32.bf16 %r2173, %rs1413; 2026-02-21T08:30:46.9195386Z cvt.f32.bf16 %r2174, %rs1414; 2026-02-21T08:30:46.9195493Z cvt.f32.bf16 %r2175, %rs1411; 2026-02-21T08:30:46.9195556Z cvt.f32.bf16 %r2176, %rs1412; 2026-02-21T08:30:46.9195625Z cvt.f32.bf16 %r2177, %rs1409; 2026-02-21T08:30:46.9195683Z cvt.f32.bf16 %r2178, %rs1410; 2026-02-21T08:30:46.9195742Z cvt.f32.bf16 %r2179, %rs1423; 2026-02-21T08:30:46.9195809Z cvt.f32.bf16 %r2180, %rs1424; 2026-02-21T08:30:46.9195877Z cvt.f32.bf16 %r2181, %rs1421; 2026-02-21T08:30:46.9195933Z cvt.f32.bf16 %r2182, %rs1422; 2026-02-21T08:30:46.9195991Z cvt.f32.bf16 %r2183, %rs1419; 2026-02-21T08:30:46.9196056Z cvt.f32.bf16 %r2184, %rs1420; 2026-02-21T08:30:46.9196113Z cvt.f32.bf16 %r2185, %rs1417; 2026-02-21T08:30:46.9196168Z cvt.f32.bf16 %r2186, %rs1418; 2026-02-21T08:30:46.9196338Z .loc 1 54 55 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:55 2026-02-21T08:30:46.9196396Z shl.b32 %r2200, %r2189, 13; 2026-02-21T08:30:46.9196562Z .loc 1 54 62 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:62 2026-02-21T08:30:46.9196631Z add.s32 %r2201, %r2563, %r2200; 2026-02-21T08:30:46.9196795Z .loc 1 54 34 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:34 2026-02-21T08:30:46.9196855Z cvt.s64.s32 %rd528, %r2201; 2026-02-21T08:30:46.9196916Z add.s64 %rd526, %rd48, %rd528; 2026-02-21T08:30:46.9197088Z .loc 1 54 87 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:54:87 2026-02-21T08:30:46.9197146Z // begin inline asm 2026-02-21T08:30:46.9197205Z mov.u64 %rd525, 0x0; 2026-02-21T08:30:46.9197328Z createpolicy.fractional.L2::evict_first.b64 %rd525, 1.0; 2026-02-21T08:30:46.9197385Z // end inline asm 2026-02-21T08:30:46.9197447Z // begin inline asm 2026-02-21T08:30:46.9197509Z mov.u32 %r2166, 0x0; 2026-02-21T08:30:46.9197565Z mov.u32 %r2167, 0x0; 2026-02-21T08:30:46.9197621Z mov.u32 %r2168, 0x0; 2026-02-21T08:30:46.9197676Z mov.u32 %r2169, 0x0; 2026-02-21T08:30:46.9197870Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2166, %r2167, %r2168, %r2169 }, [ %rd526 + 0 ], %rd525; 2026-02-21T08:30:46.9197928Z // end inline asm 2026-02-21T08:30:46.9197994Z prmt.b32 %r2202, %r2166, 0, 0x8880U; 2026-02-21T08:30:46.9198064Z cvt.u16.u32 %rs1425, %r2202; 2026-02-21T08:30:46.9198129Z prmt.b32 %r2203, %r2166, 0, 0x7770U; 2026-02-21T08:30:46.9198188Z cvt.u16.u32 %rs1426, %r2203; 2026-02-21T08:30:46.9198259Z prmt.b32 %r2204, %r2166, 0, 0x9991U; 2026-02-21T08:30:46.9198317Z cvt.u16.u32 %rs1427, %r2204; 2026-02-21T08:30:46.9198380Z prmt.b32 %r2205, %r2166, 0, 0x7771U; 2026-02-21T08:30:46.9198439Z cvt.u16.u32 %rs1428, %r2205; 2026-02-21T08:30:46.9198516Z prmt.b32 %r2206, %r2166, 0, 0xaaa2U; 2026-02-21T08:30:46.9198573Z cvt.u16.u32 %rs1429, %r2206; 2026-02-21T08:30:46.9198634Z prmt.b32 %r2207, %r2166, 0, 0x7772U; 2026-02-21T08:30:46.9198702Z cvt.u16.u32 %rs1430, %r2207; 2026-02-21T08:30:46.9198762Z prmt.b32 %r2208, %r2166, 0, 0xbbb3U; 2026-02-21T08:30:46.9198818Z cvt.u16.u32 %rs1431, %r2208; 2026-02-21T08:30:46.9198881Z prmt.b32 %r2209, %r2166, 0, 0x7773U; 2026-02-21T08:30:46.9198987Z cvt.u16.u32 %rs1432, %r2209; 2026-02-21T08:30:46.9199047Z prmt.b32 %r2210, %r2167, 0, 0x8880U; 2026-02-21T08:30:46.9199108Z cvt.u16.u32 %rs1433, %r2210; 2026-02-21T08:30:46.9199175Z prmt.b32 %r2211, %r2167, 0, 0x7770U; 2026-02-21T08:30:46.9199233Z cvt.u16.u32 %rs1434, %r2211; 2026-02-21T08:30:46.9199293Z prmt.b32 %r2212, %r2167, 0, 0x9991U; 2026-02-21T08:30:46.9199352Z cvt.u16.u32 %rs1435, %r2212; 2026-02-21T08:30:46.9199418Z prmt.b32 %r2213, %r2167, 0, 0x7771U; 2026-02-21T08:30:46.9199477Z cvt.u16.u32 %rs1436, %r2213; 2026-02-21T08:30:46.9199537Z prmt.b32 %r2214, %r2167, 0, 0xaaa2U; 2026-02-21T08:30:46.9199603Z cvt.u16.u32 %rs1437, %r2214; 2026-02-21T08:30:46.9199662Z prmt.b32 %r2215, %r2167, 0, 0x7772U; 2026-02-21T08:30:46.9199720Z cvt.u16.u32 %rs1438, %r2215; 2026-02-21T08:30:46.9199787Z prmt.b32 %r2216, %r2167, 0, 0xbbb3U; 2026-02-21T08:30:46.9199845Z cvt.u16.u32 %rs1439, %r2216; 2026-02-21T08:30:46.9199949Z prmt.b32 %r2217, %r2167, 0, 0x7773U; 2026-02-21T08:30:46.9200010Z cvt.u16.u32 %rs1440, %r2217; 2026-02-21T08:30:46.9200077Z prmt.b32 %r2218, %r2168, 0, 0x8880U; 2026-02-21T08:30:46.9200134Z cvt.u16.u32 %rs1441, %r2218; 2026-02-21T08:30:46.9200193Z prmt.b32 %r2219, %r2168, 0, 0x7770U; 2026-02-21T08:30:46.9200259Z cvt.u16.u32 %rs1442, %r2219; 2026-02-21T08:30:46.9200319Z prmt.b32 %r2220, %r2168, 0, 0x9991U; 2026-02-21T08:30:46.9200377Z cvt.u16.u32 %rs1443, %r2220; 2026-02-21T08:30:46.9200437Z prmt.b32 %r2221, %r2168, 0, 0x7771U; 2026-02-21T08:30:46.9200503Z cvt.u16.u32 %rs1444, %r2221; 2026-02-21T08:30:46.9200562Z prmt.b32 %r2222, %r2168, 0, 0xaaa2U; 2026-02-21T08:30:46.9200619Z cvt.u16.u32 %rs1445, %r2222; 2026-02-21T08:30:46.9200687Z prmt.b32 %r2223, %r2168, 0, 0x7772U; 2026-02-21T08:30:46.9200743Z cvt.u16.u32 %rs1446, %r2223; 2026-02-21T08:30:46.9200803Z prmt.b32 %r2224, %r2168, 0, 0xbbb3U; 2026-02-21T08:30:46.9200867Z cvt.u16.u32 %rs1447, %r2224; 2026-02-21T08:30:46.9200927Z prmt.b32 %r2225, %r2168, 0, 0x7773U; 2026-02-21T08:30:46.9200987Z cvt.u16.u32 %rs1448, %r2225; 2026-02-21T08:30:46.9201047Z prmt.b32 %r2226, %r2169, 0, 0x8880U; 2026-02-21T08:30:46.9201112Z cvt.u16.u32 %rs1449, %r2226; 2026-02-21T08:30:46.9201171Z prmt.b32 %r2227, %r2169, 0, 0x7770U; 2026-02-21T08:30:46.9201228Z cvt.u16.u32 %rs1450, %r2227; 2026-02-21T08:30:46.9201292Z prmt.b32 %r2228, %r2169, 0, 0x9991U; 2026-02-21T08:30:46.9201349Z cvt.u16.u32 %rs1451, %r2228; 2026-02-21T08:30:46.9201408Z prmt.b32 %r2229, %r2169, 0, 0x7771U; 2026-02-21T08:30:46.9201465Z cvt.u16.u32 %rs1452, %r2229; 2026-02-21T08:30:46.9201531Z prmt.b32 %r2230, %r2169, 0, 0xaaa2U; 2026-02-21T08:30:46.9201647Z cvt.u16.u32 %rs1453, %r2230; 2026-02-21T08:30:46.9201708Z prmt.b32 %r2231, %r2169, 0, 0x7772U; 2026-02-21T08:30:46.9201773Z cvt.u16.u32 %rs1454, %r2231; 2026-02-21T08:30:46.9201833Z prmt.b32 %r2232, %r2169, 0, 0xbbb3U; 2026-02-21T08:30:46.9201890Z cvt.u16.u32 %rs1455, %r2232; 2026-02-21T08:30:46.9201951Z prmt.b32 %r2233, %r2169, 0, 0x7773U; 2026-02-21T08:30:46.9202016Z cvt.u16.u32 %rs1456, %r2233; 2026-02-21T08:30:46.9202188Z .loc 1 59 25 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:59:25 2026-02-21T08:30:46.9202248Z shl.b16 %rs1457, %rs1426, 12; 2026-02-21T08:30:46.9202315Z shr.s16 %rs1458, %rs1457, 12; 2026-02-21T08:30:46.9202372Z shl.b16 %rs1459, %rs1428, 12; 2026-02-21T08:30:46.9202429Z shr.s16 %rs1460, %rs1459, 12; 2026-02-21T08:30:46.9202491Z shl.b16 %rs1461, %rs1430, 12; 2026-02-21T08:30:46.9202546Z shr.s16 %rs1462, %rs1461, 12; 2026-02-21T08:30:46.9202603Z shl.b16 %rs1463, %rs1432, 12; 2026-02-21T08:30:46.9202658Z shr.s16 %rs1464, %rs1463, 12; 2026-02-21T08:30:46.9202721Z shl.b16 %rs1465, %rs1434, 12; 2026-02-21T08:30:46.9202778Z shr.s16 %rs1466, %rs1465, 12; 2026-02-21T08:30:46.9202834Z shl.b16 %rs1467, %rs1436, 12; 2026-02-21T08:30:46.9202898Z shr.s16 %rs1468, %rs1467, 12; 2026-02-21T08:30:46.9202955Z shl.b16 %rs1469, %rs1438, 12; 2026-02-21T08:30:46.9203014Z shr.s16 %rs1470, %rs1469, 12; 2026-02-21T08:30:46.9203147Z shl.b16 %rs1471, %rs1440, 12; 2026-02-21T08:30:46.9203211Z shr.s16 %rs1472, %rs1471, 12; 2026-02-21T08:30:46.9203267Z shl.b16 %rs1473, %rs1442, 12; 2026-02-21T08:30:46.9203324Z shr.s16 %rs1474, %rs1473, 12; 2026-02-21T08:30:46.9203389Z shl.b16 %rs1475, %rs1444, 12; 2026-02-21T08:30:46.9203447Z shr.s16 %rs1476, %rs1475, 12; 2026-02-21T08:30:46.9203504Z shl.b16 %rs1477, %rs1446, 12; 2026-02-21T08:30:46.9203568Z shr.s16 %rs1478, %rs1477, 12; 2026-02-21T08:30:46.9203624Z shl.b16 %rs1479, %rs1448, 12; 2026-02-21T08:30:46.9203681Z shr.s16 %rs1480, %rs1479, 12; 2026-02-21T08:30:46.9203738Z shl.b16 %rs1481, %rs1450, 12; 2026-02-21T08:30:46.9203802Z shr.s16 %rs1482, %rs1481, 12; 2026-02-21T08:30:46.9203859Z shl.b16 %rs1483, %rs1452, 12; 2026-02-21T08:30:46.9203916Z shr.s16 %rs1484, %rs1483, 12; 2026-02-21T08:30:46.9203982Z shl.b16 %rs1485, %rs1454, 12; 2026-02-21T08:30:46.9204090Z shr.s16 %rs1486, %rs1485, 12; 2026-02-21T08:30:46.9204153Z shl.b16 %rs1487, %rs1456, 12; 2026-02-21T08:30:46.9204211Z shr.s16 %rs1488, %rs1487, 12; 2026-02-21T08:30:46.9204392Z .loc 1 62 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:62:28 2026-02-21T08:30:46.9204452Z shr.u16 %rs1489, %rs1425, 4; 2026-02-21T08:30:46.9204511Z shr.u16 %rs1490, %rs1427, 4; 2026-02-21T08:30:46.9204578Z shr.u16 %rs1491, %rs1429, 4; 2026-02-21T08:30:46.9204636Z shr.u16 %rs1492, %rs1431, 4; 2026-02-21T08:30:46.9204695Z shr.u16 %rs1493, %rs1433, 4; 2026-02-21T08:30:46.9204753Z shr.u16 %rs1494, %rs1435, 4; 2026-02-21T08:30:46.9204822Z shr.u16 %rs1495, %rs1437, 4; 2026-02-21T08:30:46.9204880Z shr.u16 %rs1496, %rs1439, 4; 2026-02-21T08:30:46.9204938Z shr.u16 %rs1497, %rs1441, 4; 2026-02-21T08:30:46.9205007Z shr.u16 %rs1498, %rs1443, 4; 2026-02-21T08:30:46.9205064Z shr.u16 %rs1499, %rs1445, 4; 2026-02-21T08:30:46.9205121Z shr.u16 %rs1500, %rs1447, 4; 2026-02-21T08:30:46.9205188Z shr.u16 %rs1501, %rs1449, 4; 2026-02-21T08:30:46.9205248Z shr.u16 %rs1502, %rs1451, 4; 2026-02-21T08:30:46.9205303Z shr.u16 %rs1503, %rs1453, 4; 2026-02-21T08:30:46.9205360Z shr.u16 %rs1504, %rs1455, 4; 2026-02-21T08:30:46.9205537Z .loc 1 66 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:66:45 2026-02-21T08:30:46.9205601Z st.shared.b8 [%r192], %rs1458; 2026-02-21T08:30:46.9205664Z st.shared.b8 [%r193], %rs1460; 2026-02-21T08:30:46.9205732Z st.shared.b8 [%r194], %rs1462; 2026-02-21T08:30:46.9205792Z st.shared.b8 [%r195], %rs1464; 2026-02-21T08:30:46.9205857Z st.shared.b8 [%r196+1024], %rs1466; 2026-02-21T08:30:46.9205922Z st.shared.b8 [%r197+1024], %rs1468; 2026-02-21T08:30:46.9205992Z st.shared.b8 [%r198+1024], %rs1470; 2026-02-21T08:30:46.9206053Z st.shared.b8 [%r199+1024], %rs1472; 2026-02-21T08:30:46.9206115Z st.shared.b8 [%r200+2048], %rs1474; 2026-02-21T08:30:46.9206184Z st.shared.b8 [%r201+2048], %rs1476; 2026-02-21T08:30:46.9206246Z st.shared.b8 [%r202+2048], %rs1478; 2026-02-21T08:30:46.9206311Z st.shared.b8 [%r203+2048], %rs1480; 2026-02-21T08:30:46.9206379Z st.shared.b8 [%r204+3072], %rs1482; 2026-02-21T08:30:46.9206441Z st.shared.b8 [%r205+3072], %rs1484; 2026-02-21T08:30:46.9206501Z st.shared.b8 [%r206+3072], %rs1486; 2026-02-21T08:30:46.9206563Z st.shared.b8 [%r207+3072], %rs1488; 2026-02-21T08:30:46.9206626Z bar.sync 0; 2026-02-21T08:30:46.9206690Z ld.shared.b32 %r2234, [%r208]; 2026-02-21T08:30:46.9206750Z cvt.u16.u32 %rs1505, %r2234; 2026-02-21T08:30:46.9206819Z prmt.b32 %r2235, %r2234, 0, 0x7771U; 2026-02-21T08:30:46.9206879Z cvt.u16.u32 %rs1506, %r2235; 2026-02-21T08:30:46.9206940Z prmt.b32 %r2236, %r2234, 0, 0x7772U; 2026-02-21T08:30:46.9206999Z cvt.u16.u32 %rs1507, %r2236; 2026-02-21T08:30:46.9207067Z prmt.b32 %r2237, %r2234, 0, 0x7773U; 2026-02-21T08:30:46.9207124Z cvt.u16.u32 %rs1508, %r2237; 2026-02-21T08:30:46.9207189Z ld.shared.b32 %r2238, [%r208+128]; 2026-02-21T08:30:46.9207259Z cvt.u16.u32 %rs1509, %r2238; 2026-02-21T08:30:46.9207362Z prmt.b32 %r2239, %r2238, 0, 0x7771U; 2026-02-21T08:30:46.9207420Z cvt.u16.u32 %rs1510, %r2239; 2026-02-21T08:30:46.9207479Z prmt.b32 %r2240, %r2238, 0, 0x7772U; 2026-02-21T08:30:46.9207544Z cvt.u16.u32 %rs1511, %r2240; 2026-02-21T08:30:46.9207605Z prmt.b32 %r2241, %r2238, 0, 0x7773U; 2026-02-21T08:30:46.9207661Z cvt.u16.u32 %rs1512, %r2241; 2026-02-21T08:30:46.9207730Z ld.shared.b32 %r2242, [%r209]; 2026-02-21T08:30:46.9207788Z cvt.u16.u32 %rs1513, %r2242; 2026-02-21T08:30:46.9207849Z prmt.b32 %r2243, %r2242, 0, 0x7771U; 2026-02-21T08:30:46.9207913Z cvt.u16.u32 %rs1514, %r2243; 2026-02-21T08:30:46.9207973Z prmt.b32 %r2244, %r2242, 0, 0x7772U; 2026-02-21T08:30:46.9208030Z cvt.u16.u32 %rs1515, %r2244; 2026-02-21T08:30:46.9208090Z prmt.b32 %r2245, %r2242, 0, 0x7773U; 2026-02-21T08:30:46.9208153Z cvt.u16.u32 %rs1516, %r2245; 2026-02-21T08:30:46.9208216Z ld.shared.b32 %r2246, [%r209+128]; 2026-02-21T08:30:46.9208271Z cvt.u16.u32 %rs1517, %r2246; 2026-02-21T08:30:46.9208378Z prmt.b32 %r2247, %r2246, 0, 0x7771U; 2026-02-21T08:30:46.9208438Z cvt.u16.u32 %rs1518, %r2247; 2026-02-21T08:30:46.9208499Z prmt.b32 %r2248, %r2246, 0, 0x7772U; 2026-02-21T08:30:46.9208558Z cvt.u16.u32 %rs1519, %r2248; 2026-02-21T08:30:46.9208625Z prmt.b32 %r2249, %r2246, 0, 0x7773U; 2026-02-21T08:30:46.9208682Z cvt.u16.u32 %rs1520, %r2249; 2026-02-21T08:30:46.9208744Z ld.shared.b32 %r2250, [%r210]; 2026-02-21T08:30:46.9208806Z cvt.u16.u32 %rs1521, %r2250; 2026-02-21T08:30:46.9208866Z prmt.b32 %r2251, %r2250, 0, 0x7771U; 2026-02-21T08:30:46.9208925Z cvt.u16.u32 %rs1522, %r2251; 2026-02-21T08:30:46.9208990Z prmt.b32 %r2252, %r2250, 0, 0x7772U; 2026-02-21T08:30:46.9209047Z cvt.u16.u32 %rs1523, %r2252; 2026-02-21T08:30:46.9209105Z prmt.b32 %r2253, %r2250, 0, 0x7773U; 2026-02-21T08:30:46.9209164Z cvt.u16.u32 %rs1524, %r2253; 2026-02-21T08:30:46.9209232Z ld.shared.b32 %r2254, [%r210+128]; 2026-02-21T08:30:46.9209289Z cvt.u16.u32 %rs1525, %r2254; 2026-02-21T08:30:46.9209350Z prmt.b32 %r2255, %r2254, 0, 0x7771U; 2026-02-21T08:30:46.9209416Z cvt.u16.u32 %rs1526, %r2255; 2026-02-21T08:30:46.9209477Z prmt.b32 %r2256, %r2254, 0, 0x7772U; 2026-02-21T08:30:46.9209534Z cvt.u16.u32 %rs1527, %r2256; 2026-02-21T08:30:46.9209594Z prmt.b32 %r2257, %r2254, 0, 0x7773U; 2026-02-21T08:30:46.9209659Z cvt.u16.u32 %rs1528, %r2257; 2026-02-21T08:30:46.9209720Z ld.shared.b32 %r2258, [%r211]; 2026-02-21T08:30:46.9209779Z cvt.u16.u32 %rs1529, %r2258; 2026-02-21T08:30:46.9209845Z prmt.b32 %r2259, %r2258, 0, 0x7771U; 2026-02-21T08:30:46.9209903Z cvt.u16.u32 %rs1530, %r2259; 2026-02-21T08:30:46.9209963Z prmt.b32 %r2260, %r2258, 0, 0x7772U; 2026-02-21T08:30:46.9210022Z cvt.u16.u32 %rs1531, %r2260; 2026-02-21T08:30:46.9210090Z prmt.b32 %r2261, %r2258, 0, 0x7773U; 2026-02-21T08:30:46.9210148Z cvt.u16.u32 %rs1532, %r2261; 2026-02-21T08:30:46.9210209Z ld.shared.b32 %r2262, [%r211+128]; 2026-02-21T08:30:46.9210274Z cvt.u16.u32 %rs1533, %r2262; 2026-02-21T08:30:46.9210337Z prmt.b32 %r2263, %r2262, 0, 0x7771U; 2026-02-21T08:30:46.9210398Z cvt.u16.u32 %rs1534, %r2263; 2026-02-21T08:30:46.9210466Z prmt.b32 %r2264, %r2262, 0, 0x7772U; 2026-02-21T08:30:46.9210523Z cvt.u16.u32 %rs1535, %r2264; 2026-02-21T08:30:46.9210583Z prmt.b32 %r2265, %r2262, 0, 0x7773U; 2026-02-21T08:30:46.9210642Z cvt.u16.u32 %rs1536, %r2265; 2026-02-21T08:30:46.9210823Z .loc 1 67 45 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:67:45 2026-02-21T08:30:46.9210879Z bar.sync 0; 2026-02-21T08:30:46.9210942Z st.shared.b8 [%r192], %rs1489; 2026-02-21T08:30:46.9211010Z st.shared.b8 [%r193], %rs1490; 2026-02-21T08:30:46.9211070Z st.shared.b8 [%r194], %rs1491; 2026-02-21T08:30:46.9211130Z st.shared.b8 [%r195], %rs1492; 2026-02-21T08:30:46.9211192Z st.shared.b8 [%r196+1024], %rs1493; 2026-02-21T08:30:46.9211264Z st.shared.b8 [%r197+1024], %rs1494; 2026-02-21T08:30:46.9211327Z st.shared.b8 [%r198+1024], %rs1495; 2026-02-21T08:30:46.9211392Z st.shared.b8 [%r199+1024], %rs1496; 2026-02-21T08:30:46.9211511Z st.shared.b8 [%r200+2048], %rs1497; 2026-02-21T08:30:46.9211606Z st.shared.b8 [%r201+2048], %rs1498; 2026-02-21T08:30:46.9211670Z st.shared.b8 [%r202+2048], %rs1499; 2026-02-21T08:30:46.9211737Z st.shared.b8 [%r203+2048], %rs1500; 2026-02-21T08:30:46.9211797Z st.shared.b8 [%r204+3072], %rs1501; 2026-02-21T08:30:46.9211858Z st.shared.b8 [%r205+3072], %rs1502; 2026-02-21T08:30:46.9211919Z st.shared.b8 [%r206+3072], %rs1503; 2026-02-21T08:30:46.9211989Z st.shared.b8 [%r207+3072], %rs1504; 2026-02-21T08:30:46.9212044Z bar.sync 0; 2026-02-21T08:30:46.9212105Z ld.shared.b32 %r2266, [%r208]; 2026-02-21T08:30:46.9212170Z cvt.u16.u32 %rs1537, %r2266; 2026-02-21T08:30:46.9212231Z prmt.b32 %r2267, %r2266, 0, 0x7771U; 2026-02-21T08:30:46.9212290Z cvt.u16.u32 %rs1538, %r2267; 2026-02-21T08:30:46.9212351Z prmt.b32 %r2268, %r2266, 0, 0x7772U; 2026-02-21T08:30:46.9212417Z cvt.u16.u32 %rs1539, %r2268; 2026-02-21T08:30:46.9212529Z prmt.b32 %r2269, %r2266, 0, 0x7773U; 2026-02-21T08:30:46.9212593Z cvt.u16.u32 %rs1540, %r2269; 2026-02-21T08:30:46.9212665Z ld.shared.b32 %r2270, [%r208+128]; 2026-02-21T08:30:46.9212723Z cvt.u16.u32 %rs1541, %r2270; 2026-02-21T08:30:46.9212783Z prmt.b32 %r2271, %r2270, 0, 0x7771U; 2026-02-21T08:30:46.9212850Z cvt.u16.u32 %rs1542, %r2271; 2026-02-21T08:30:46.9212911Z prmt.b32 %r2272, %r2270, 0, 0x7772U; 2026-02-21T08:30:46.9212970Z cvt.u16.u32 %rs1543, %r2272; 2026-02-21T08:30:46.9213030Z prmt.b32 %r2273, %r2270, 0, 0x7773U; 2026-02-21T08:30:46.9213096Z cvt.u16.u32 %rs1544, %r2273; 2026-02-21T08:30:46.9213157Z ld.shared.b32 %r2274, [%r209]; 2026-02-21T08:30:46.9213214Z cvt.u16.u32 %rs1545, %r2274; 2026-02-21T08:30:46.9213282Z prmt.b32 %r2275, %r2274, 0, 0x7771U; 2026-02-21T08:30:46.9213339Z cvt.u16.u32 %rs1546, %r2275; 2026-02-21T08:30:46.9213400Z prmt.b32 %r2276, %r2274, 0, 0x7772U; 2026-02-21T08:30:46.9213456Z cvt.u16.u32 %rs1547, %r2276; 2026-02-21T08:30:46.9213527Z prmt.b32 %r2277, %r2274, 0, 0x7773U; 2026-02-21T08:30:46.9213588Z cvt.u16.u32 %rs1548, %r2277; 2026-02-21T08:30:46.9213650Z ld.shared.b32 %r2278, [%r209+128]; 2026-02-21T08:30:46.9213713Z cvt.u16.u32 %rs1549, %r2278; 2026-02-21T08:30:46.9213773Z prmt.b32 %r2279, %r2278, 0, 0x7771U; 2026-02-21T08:30:46.9213831Z cvt.u16.u32 %rs1550, %r2279; 2026-02-21T08:30:46.9213891Z prmt.b32 %r2280, %r2278, 0, 0x7772U; 2026-02-21T08:30:46.9213959Z cvt.u16.u32 %rs1551, %r2280; 2026-02-21T08:30:46.9214019Z prmt.b32 %r2281, %r2278, 0, 0x7773U; 2026-02-21T08:30:46.9214076Z cvt.u16.u32 %rs1552, %r2281; 2026-02-21T08:30:46.9214146Z ld.shared.b32 %r2282, [%r210]; 2026-02-21T08:30:46.9214204Z cvt.u16.u32 %rs1553, %r2282; 2026-02-21T08:30:46.9214265Z prmt.b32 %r2283, %r2282, 0, 0x7771U; 2026-02-21T08:30:46.9214330Z cvt.u16.u32 %rs1554, %r2283; 2026-02-21T08:30:46.9214389Z prmt.b32 %r2284, %r2282, 0, 0x7772U; 2026-02-21T08:30:46.9214447Z cvt.u16.u32 %rs1555, %r2284; 2026-02-21T08:30:46.9214509Z prmt.b32 %r2285, %r2282, 0, 0x7773U; 2026-02-21T08:30:46.9214577Z cvt.u16.u32 %rs1556, %r2285; 2026-02-21T08:30:46.9214639Z ld.shared.b32 %r2286, [%r210+128]; 2026-02-21T08:30:46.9214696Z cvt.u16.u32 %rs1557, %r2286; 2026-02-21T08:30:46.9214762Z prmt.b32 %r2287, %r2286, 0, 0x7771U; 2026-02-21T08:30:46.9214819Z cvt.u16.u32 %rs1558, %r2287; 2026-02-21T08:30:46.9214880Z prmt.b32 %r2288, %r2286, 0, 0x7772U; 2026-02-21T08:30:46.9214938Z cvt.u16.u32 %rs1559, %r2288; 2026-02-21T08:30:46.9215004Z prmt.b32 %r2289, %r2286, 0, 0x7773U; 2026-02-21T08:30:46.9215060Z cvt.u16.u32 %rs1560, %r2289; 2026-02-21T08:30:46.9215122Z ld.shared.b32 %r2290, [%r211]; 2026-02-21T08:30:46.9215187Z cvt.u16.u32 %rs1561, %r2290; 2026-02-21T08:30:46.9215247Z prmt.b32 %r2291, %r2290, 0, 0x7771U; 2026-02-21T08:30:46.9215305Z cvt.u16.u32 %rs1562, %r2291; 2026-02-21T08:30:46.9215365Z prmt.b32 %r2292, %r2290, 0, 0x7772U; 2026-02-21T08:30:46.9215431Z cvt.u16.u32 %rs1563, %r2292; 2026-02-21T08:30:46.9215493Z prmt.b32 %r2293, %r2290, 0, 0x7773U; 2026-02-21T08:30:46.9215600Z cvt.u16.u32 %rs1564, %r2293; 2026-02-21T08:30:46.9215670Z ld.shared.b32 %r2294, [%r211+128]; 2026-02-21T08:30:46.9215728Z cvt.u16.u32 %rs1565, %r2294; 2026-02-21T08:30:46.9215788Z prmt.b32 %r2295, %r2294, 0, 0x7771U; 2026-02-21T08:30:46.9215851Z cvt.u16.u32 %rs1566, %r2295; 2026-02-21T08:30:46.9215911Z prmt.b32 %r2296, %r2294, 0, 0x7772U; 2026-02-21T08:30:46.9215967Z cvt.u16.u32 %rs1567, %r2296; 2026-02-21T08:30:46.9216027Z prmt.b32 %r2297, %r2294, 0, 0x7773U; 2026-02-21T08:30:46.9216092Z cvt.u16.u32 %rs1568, %r2297; 2026-02-21T08:30:46.9216259Z .loc 1 72 58 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:72:58 2026-02-21T08:30:46.9216334Z selp.b16 %rs1569, %rs1505, %rs1537, %p246; 2026-02-21T08:30:46.9216400Z cvt.s16.s8 %rs1570, %rs1569; 2026-02-21T08:30:46.9216474Z selp.b16 %rs1571, %rs1513, %rs1545, %p246; 2026-02-21T08:30:46.9216533Z cvt.s16.s8 %rs1572, %rs1571; 2026-02-21T08:30:46.9216654Z selp.b16 %rs1573, %rs1521, %rs1553, %p246; 2026-02-21T08:30:46.9216724Z cvt.s16.s8 %rs1574, %rs1573; 2026-02-21T08:30:46.9216793Z selp.b16 %rs1575, %rs1529, %rs1561, %p246; 2026-02-21T08:30:46.9216853Z cvt.s16.s8 %rs1576, %rs1575; 2026-02-21T08:30:46.9216930Z selp.b16 %rs1577, %rs1506, %rs1538, %p246; 2026-02-21T08:30:46.9216988Z cvt.s16.s8 %rs1578, %rs1577; 2026-02-21T08:30:46.9217056Z selp.b16 %rs1579, %rs1514, %rs1546, %p246; 2026-02-21T08:30:46.9217123Z cvt.s16.s8 %rs1580, %rs1579; 2026-02-21T08:30:46.9217190Z selp.b16 %rs1581, %rs1522, %rs1554, %p246; 2026-02-21T08:30:46.9217249Z cvt.s16.s8 %rs1582, %rs1581; 2026-02-21T08:30:46.9217316Z selp.b16 %rs1583, %rs1530, %rs1562, %p246; 2026-02-21T08:30:46.9217382Z cvt.s16.s8 %rs1584, %rs1583; 2026-02-21T08:30:46.9217450Z selp.b16 %rs1585, %rs1507, %rs1539, %p246; 2026-02-21T08:30:46.9217509Z cvt.s16.s8 %rs1586, %rs1585; 2026-02-21T08:30:46.9217582Z selp.b16 %rs1587, %rs1515, %rs1547, %p246; 2026-02-21T08:30:46.9217645Z cvt.s16.s8 %rs1588, %rs1587; 2026-02-21T08:30:46.9217716Z selp.b16 %rs1589, %rs1523, %rs1555, %p246; 2026-02-21T08:30:46.9217775Z cvt.s16.s8 %rs1590, %rs1589; 2026-02-21T08:30:46.9217854Z selp.b16 %rs1591, %rs1531, %rs1563, %p246; 2026-02-21T08:30:46.9217914Z cvt.s16.s8 %rs1592, %rs1591; 2026-02-21T08:30:46.9217982Z selp.b16 %rs1593, %rs1508, %rs1540, %p246; 2026-02-21T08:30:46.9218051Z cvt.s16.s8 %rs1594, %rs1593; 2026-02-21T08:30:46.9218119Z selp.b16 %rs1595, %rs1516, %rs1548, %p246; 2026-02-21T08:30:46.9218178Z cvt.s16.s8 %rs1596, %rs1595; 2026-02-21T08:30:46.9218260Z selp.b16 %rs1597, %rs1524, %rs1556, %p246; 2026-02-21T08:30:46.9218322Z cvt.s16.s8 %rs1598, %rs1597; 2026-02-21T08:30:46.9218393Z selp.b16 %rs1599, %rs1532, %rs1564, %p246; 2026-02-21T08:30:46.9218454Z cvt.s16.s8 %rs1600, %rs1599; 2026-02-21T08:30:46.9218538Z selp.b16 %rs1601, %rs1509, %rs1541, %p246; 2026-02-21T08:30:46.9218597Z cvt.s16.s8 %rs1602, %rs1601; 2026-02-21T08:30:46.9218664Z selp.b16 %rs1603, %rs1517, %rs1549, %p246; 2026-02-21T08:30:46.9218730Z cvt.s16.s8 %rs1604, %rs1603; 2026-02-21T08:30:46.9218800Z selp.b16 %rs1605, %rs1525, %rs1557, %p246; 2026-02-21T08:30:46.9218857Z cvt.s16.s8 %rs1606, %rs1605; 2026-02-21T08:30:46.9218923Z selp.b16 %rs1607, %rs1533, %rs1565, %p246; 2026-02-21T08:30:46.9218989Z cvt.s16.s8 %rs1608, %rs1607; 2026-02-21T08:30:46.9219056Z selp.b16 %rs1609, %rs1510, %rs1542, %p246; 2026-02-21T08:30:46.9219114Z cvt.s16.s8 %rs1610, %rs1609; 2026-02-21T08:30:46.9219188Z selp.b16 %rs1611, %rs1518, %rs1550, %p246; 2026-02-21T08:30:46.9219245Z cvt.s16.s8 %rs1612, %rs1611; 2026-02-21T08:30:46.9219313Z selp.b16 %rs1613, %rs1526, %rs1558, %p246; 2026-02-21T08:30:46.9219377Z cvt.s16.s8 %rs1614, %rs1613; 2026-02-21T08:30:46.9219444Z selp.b16 %rs1615, %rs1534, %rs1566, %p246; 2026-02-21T08:30:46.9219501Z cvt.s16.s8 %rs1616, %rs1615; 2026-02-21T08:30:46.9219568Z selp.b16 %rs1617, %rs1511, %rs1543, %p246; 2026-02-21T08:30:46.9219633Z cvt.s16.s8 %rs1618, %rs1617; 2026-02-21T08:30:46.9219703Z selp.b16 %rs1619, %rs1519, %rs1551, %p246; 2026-02-21T08:30:46.9219804Z cvt.s16.s8 %rs1620, %rs1619; 2026-02-21T08:30:46.9219876Z selp.b16 %rs1621, %rs1527, %rs1559, %p246; 2026-02-21T08:30:46.9219934Z cvt.s16.s8 %rs1622, %rs1621; 2026-02-21T08:30:46.9220000Z selp.b16 %rs1623, %rs1535, %rs1567, %p246; 2026-02-21T08:30:46.9220059Z cvt.s16.s8 %rs1624, %rs1623; 2026-02-21T08:30:46.9220134Z selp.b16 %rs1625, %rs1512, %rs1544, %p246; 2026-02-21T08:30:46.9220193Z cvt.s16.s8 %rs1626, %rs1625; 2026-02-21T08:30:46.9220260Z selp.b16 %rs1627, %rs1520, %rs1552, %p246; 2026-02-21T08:30:46.9220324Z cvt.s16.s8 %rs1628, %rs1627; 2026-02-21T08:30:46.9220390Z selp.b16 %rs1629, %rs1528, %rs1560, %p246; 2026-02-21T08:30:46.9220448Z cvt.s16.s8 %rs1630, %rs1629; 2026-02-21T08:30:46.9220522Z selp.b16 %rs1631, %rs1536, %rs1568, %p246; 2026-02-21T08:30:46.9220581Z cvt.s16.s8 %rs1632, %rs1631; 2026-02-21T08:30:46.9220799Z .loc 1 77 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:77:32 2026-02-21T08:30:46.9220867Z cvt.rn.f32.s16 %r2298, %rs1570; 2026-02-21T08:30:46.9220936Z cvt.rn.f32.s16 %r2299, %rs1572; 2026-02-21T08:30:46.9220997Z cvt.rn.f32.s16 %r2300, %rs1574; 2026-02-21T08:30:46.9221056Z cvt.rn.f32.s16 %r2301, %rs1576; 2026-02-21T08:30:46.9221120Z cvt.rn.f32.s16 %r2302, %rs1578; 2026-02-21T08:30:46.9221178Z cvt.rn.f32.s16 %r2303, %rs1580; 2026-02-21T08:30:46.9221237Z cvt.rn.f32.s16 %r2304, %rs1582; 2026-02-21T08:30:46.9221295Z cvt.rn.f32.s16 %r2305, %rs1584; 2026-02-21T08:30:46.9221361Z cvt.rn.f32.s16 %r2306, %rs1586; 2026-02-21T08:30:46.9221420Z cvt.rn.f32.s16 %r2307, %rs1588; 2026-02-21T08:30:46.9221479Z cvt.rn.f32.s16 %r2308, %rs1590; 2026-02-21T08:30:46.9221590Z cvt.rn.f32.s16 %r2309, %rs1592; 2026-02-21T08:30:46.9221651Z cvt.rn.f32.s16 %r2310, %rs1594; 2026-02-21T08:30:46.9221709Z cvt.rn.f32.s16 %r2311, %rs1596; 2026-02-21T08:30:46.9221773Z cvt.rn.f32.s16 %r2312, %rs1598; 2026-02-21T08:30:46.9221831Z cvt.rn.f32.s16 %r2313, %rs1600; 2026-02-21T08:30:46.9221894Z cvt.rn.f32.s16 %r2314, %rs1602; 2026-02-21T08:30:46.9221953Z cvt.rn.f32.s16 %r2315, %rs1604; 2026-02-21T08:30:46.9222017Z cvt.rn.f32.s16 %r2316, %rs1606; 2026-02-21T08:30:46.9222075Z cvt.rn.f32.s16 %r2317, %rs1608; 2026-02-21T08:30:46.9222133Z cvt.rn.f32.s16 %r2318, %rs1610; 2026-02-21T08:30:46.9222200Z cvt.rn.f32.s16 %r2319, %rs1612; 2026-02-21T08:30:46.9222257Z cvt.rn.f32.s16 %r2320, %rs1614; 2026-02-21T08:30:46.9222314Z cvt.rn.f32.s16 %r2321, %rs1616; 2026-02-21T08:30:46.9222372Z cvt.rn.f32.s16 %r2322, %rs1618; 2026-02-21T08:30:46.9222437Z cvt.rn.f32.s16 %r2323, %rs1620; 2026-02-21T08:30:46.9222496Z cvt.rn.f32.s16 %r2324, %rs1622; 2026-02-21T08:30:46.9222556Z cvt.rn.f32.s16 %r2325, %rs1624; 2026-02-21T08:30:46.9222620Z cvt.rn.f32.s16 %r2326, %rs1626; 2026-02-21T08:30:46.9222679Z cvt.rn.f32.s16 %r2327, %rs1628; 2026-02-21T08:30:46.9222739Z cvt.rn.f32.s16 %r2328, %rs1630; 2026-02-21T08:30:46.9222807Z cvt.rn.f32.s16 %r2329, %rs1632; 2026-02-21T08:30:46.9222863Z bar.sync 0; 2026-02-21T08:30:46.9222929Z st.shared.b32 [%r212], %r2298; 2026-02-21T08:30:46.9222993Z st.shared.b32 [%r212+8], %r2299; 2026-02-21T08:30:46.9223067Z st.shared.b32 [%r212+16384], %r2314; 2026-02-21T08:30:46.9223130Z st.shared.b32 [%r212+16392], %r2315; 2026-02-21T08:30:46.9223190Z st.shared.b32 [%r213], %r2300; 2026-02-21T08:30:46.9223257Z st.shared.b32 [%r213+8], %r2301; 2026-02-21T08:30:46.9223319Z st.shared.b32 [%r213+16384], %r2316; 2026-02-21T08:30:46.9223381Z st.shared.b32 [%r213+16392], %r2317; 2026-02-21T08:30:46.9223441Z st.shared.b32 [%r214], %r2302; 2026-02-21T08:30:46.9223509Z st.shared.b32 [%r214+8], %r2303; 2026-02-21T08:30:46.9223570Z st.shared.b32 [%r214+16384], %r2318; 2026-02-21T08:30:46.9223631Z st.shared.b32 [%r214+16392], %r2319; 2026-02-21T08:30:46.9223701Z st.shared.b32 [%r215], %r2304; 2026-02-21T08:30:46.9223761Z st.shared.b32 [%r215+8], %r2305; 2026-02-21T08:30:46.9223824Z st.shared.b32 [%r215+16384], %r2320; 2026-02-21T08:30:46.9223887Z st.shared.b32 [%r215+16392], %r2321; 2026-02-21T08:30:46.9224007Z st.shared.b32 [%r216], %r2306; 2026-02-21T08:30:46.9224069Z st.shared.b32 [%r216+8], %r2307; 2026-02-21T08:30:46.9224131Z st.shared.b32 [%r216+16384], %r2322; 2026-02-21T08:30:46.9224199Z st.shared.b32 [%r216+16392], %r2323; 2026-02-21T08:30:46.9224261Z st.shared.b32 [%r217], %r2308; 2026-02-21T08:30:46.9224323Z st.shared.b32 [%r217+8], %r2309; 2026-02-21T08:30:46.9224397Z st.shared.b32 [%r217+16384], %r2324; 2026-02-21T08:30:46.9224459Z st.shared.b32 [%r217+16392], %r2325; 2026-02-21T08:30:46.9224521Z st.shared.b32 [%r218], %r2310; 2026-02-21T08:30:46.9224587Z st.shared.b32 [%r218+8], %r2311; 2026-02-21T08:30:46.9224653Z st.shared.b32 [%r218+16384], %r2326; 2026-02-21T08:30:46.9224711Z st.shared.b32 [%r218+16392], %r2327; 2026-02-21T08:30:46.9224768Z st.shared.b32 [%r219], %r2312; 2026-02-21T08:30:46.9224832Z st.shared.b32 [%r219+8], %r2313; 2026-02-21T08:30:46.9224940Z st.shared.b32 [%r219+16384], %r2328; 2026-02-21T08:30:46.9225000Z st.shared.b32 [%r219+16392], %r2329; 2026-02-21T08:30:46.9225059Z mov.pred %p224, -1; 2026-02-21T08:30:46.9225113Z $L__tmp181: 2026-02-21T08:30:46.9225338Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9225392Z // begin inline asm 2026-02-21T08:30:46.9225718Z @%p224 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2170 + 0], 16, {%r2171, %r2172, %r2173, %r2174, %r2175, %r2176, %r2177, %r2178, %r2179, %r2180, %r2181, %r2182, %r2183, %r2184, %r2185, %r2186}; 2026-02-21T08:30:46.9225771Z // end inline asm 2026-02-21T08:30:46.9225828Z // begin inline asm 2026-02-21T08:30:46.9225907Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:46.9225963Z // end inline asm 2026-02-21T08:30:46.9226014Z bar.sync 0; 2026-02-21T08:30:46.9226067Z // begin inline asm 2026-02-21T08:30:46.9226140Z fence.proxy.async.shared::cta; 2026-02-21T08:30:46.9226191Z // end inline asm 2026-02-21T08:30:46.9226252Z add.s32 %r2358, %r265, 57344; 2026-02-21T08:30:46.9226313Z // begin inline asm 2026-02-21T08:30:46.9226404Z @%p204 mbarrier.init.shared::cta.b64 [%r2358], 1; 2026-02-21T08:30:46.9226456Z // end inline asm 2026-02-21T08:30:46.9226515Z bar.sync 0; 2026-02-21T08:30:46.9226571Z @%p219 bra $L__BB0_34; 2026-02-21T08:30:46.9226668Z // %bb.33: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:46.9226729Z elect.sync %r2355|%p223, -1; 2026-02-21T08:30:46.9226794Z mov.b32 %r2333, 69208336; 2026-02-21T08:30:46.9226847Z // begin inline asm 2026-02-21T08:30:46.9227006Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 0 ], %rd529, %r2333, %p247; 2026-02-21T08:30:46.9227065Z // end inline asm 2026-02-21T08:30:46.9227118Z // begin inline asm 2026-02-21T08:30:46.9227271Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 8 ], %rd530, %r2333, %p224; 2026-02-21T08:30:46.9227326Z // end inline asm 2026-02-21T08:30:46.9227385Z // begin inline asm 2026-02-21T08:30:46.9227542Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 16 ], %rd531, %r2333, %p224; 2026-02-21T08:30:46.9227594Z // end inline asm 2026-02-21T08:30:46.9227655Z // begin inline asm 2026-02-21T08:30:46.9227803Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 24 ], %rd532, %r2333, %p224; 2026-02-21T08:30:46.9227858Z // end inline asm 2026-02-21T08:30:46.9227914Z // begin inline asm 2026-02-21T08:30:46.9228059Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 32 ], %rd533, %r2333, %p224; 2026-02-21T08:30:46.9228114Z // end inline asm 2026-02-21T08:30:46.9228176Z // begin inline asm 2026-02-21T08:30:46.9228319Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 40 ], %rd534, %r2333, %p224; 2026-02-21T08:30:46.9228370Z // end inline asm 2026-02-21T08:30:46.9228429Z // begin inline asm 2026-02-21T08:30:46.9228575Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 48 ], %rd535, %r2333, %p224; 2026-02-21T08:30:46.9228672Z // end inline asm 2026-02-21T08:30:46.9228725Z // begin inline asm 2026-02-21T08:30:46.9228876Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r2508 + 0 ], [ %r2332 + 56 ], %rd536, %r2333, %p224; 2026-02-21T08:30:46.9228927Z // end inline asm 2026-02-21T08:30:46.9228988Z cvt.u64.u32 %rd537, %r2358; 2026-02-21T08:30:46.9229045Z // begin inline asm 2026-02-21T08:30:46.9229170Z @%p223 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd537]; 2026-02-21T08:30:46.9229222Z // end inline asm 2026-02-21T08:30:46.9229279Z $L__tmp182: 2026-02-21T08:30:46.9229375Z $L__BB0_34: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:46.9229545Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9229609Z setp.eq.b32 %p240, %r248, 0; 2026-02-21T08:30:46.9229676Z setp.lt.s32 %p241, %r2574, %r190; 2026-02-21T08:30:46.9229730Z mov.b32 %r2359, 0; 2026-02-21T08:30:46.9229820Z $L__tmp183: 2026-02-21T08:30:46.9230045Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9230099Z // begin inline asm 2026-02-21T08:30:46.9230149Z 2026-02-21T08:30:46.9230198Z { 2026-02-21T08:30:46.9230263Z .reg .pred complete; 2026-02-21T08:30:46.9230314Z waitLoop: 2026-02-21T08:30:46.9230432Z mbarrier.try_wait.parity.shared.b64 complete, [%r2358], %r2359; 2026-02-21T08:30:46.9230502Z @!complete bra.uni waitLoop; 2026-02-21T08:30:46.9230548Z } 2026-02-21T08:30:46.9230552Z 2026-02-21T08:30:46.9230603Z // end inline asm 2026-02-21T08:30:46.9230656Z bar.sync 0; 2026-02-21T08:30:46.9230708Z // begin inline asm 2026-02-21T08:30:46.9230791Z @%p204 mbarrier.inval.shared::cta.b64 [%r2358]; 2026-02-21T08:30:46.9230843Z // end inline asm 2026-02-21T08:30:46.9230904Z $L__tmp184: 2026-02-21T08:30:46.9231076Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9231136Z add.s32 %r2366, %r2565, 32; 2026-02-21T08:30:46.9231202Z add.s32 %r2367, %r2568, 1; 2026-02-21T08:30:46.9231262Z setp.gt.s32 %p242, %r2367, 2; 2026-02-21T08:30:46.9231327Z selp.b32 %r2568, 0, %r2367, %p242; 2026-02-21T08:30:46.9231387Z selp.b32 %r2564, 0, %r2366, %p240; 2026-02-21T08:30:46.9231614Z .loc 1 45 22 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:45:22 2026-02-21T08:30:46.9231674Z shl.b32 %r2368, %r2564, 1; 2026-02-21T08:30:46.9231849Z .loc 1 47 25 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:47:25 2026-02-21T08:30:46.9231916Z add.s32 %r2369, %r2368, %r16; 2026-02-21T08:30:46.9232093Z .loc 1 48 53 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:53 2026-02-21T08:30:46.9232154Z shl.b32 %r2370, %r2581, 10; 2026-02-21T08:30:46.9232216Z shl.b32 %r2371, %r2582, 10; 2026-02-21T08:30:46.9232395Z .loc 1 48 60 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:60 2026-02-21T08:30:46.9232459Z add.s32 %r2372, %r2370, %r2369; 2026-02-21T08:30:46.9232528Z add.s32 %r2373, %r2371, %r2369; 2026-02-21T08:30:46.9232705Z .loc 1 48 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:32 2026-02-21T08:30:46.9232775Z mad.wide.s32 %rd538, %r2372, 2, %rd47; 2026-02-21T08:30:46.9232847Z mad.wide.s32 %rd539, %r2373, 2, %rd47; 2026-02-21T08:30:46.9233027Z .loc 1 48 80 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:48:80 2026-02-21T08:30:46.9233084Z shl.b32 %r2374, %r2568, 13; 2026-02-21T08:30:46.9233143Z add.s32 %r2375, %r265, %r2374; 2026-02-21T08:30:46.9233209Z add.s32 %r2376, %r2375, 32768; 2026-02-21T08:30:46.9233267Z add.s32 %r2361, %r2376, %r2555; 2026-02-21T08:30:46.9233328Z selp.b32 %r2362, 16, 0, %p241; 2026-02-21T08:30:46.9233391Z // begin inline asm 2026-02-21T08:30:46.9233526Z cp.async.cg.shared.global [ %r2361 + 0 ], [ %rd538 + 0 ], 0x10, %r2362; 2026-02-21T08:30:46.9233632Z // end inline asm 2026-02-21T08:30:46.9233695Z add.s32 %r2363, %r2376, %r2554; 2026-02-21T08:30:46.9233763Z // begin inline asm 2026-02-21T08:30:46.9233889Z cp.async.cg.shared.global [ %r2363 + 0 ], [ %rd539 + 0 ], 0x10, %r2362; 2026-02-21T08:30:46.9233946Z // end inline asm 2026-02-21T08:30:46.9234019Z cp.async.commit_group; 2026-02-21T08:30:46.9234197Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9234263Z setp.ne.b32 %p247, %r2561, 15; 2026-02-21T08:30:46.9234326Z @%p247 bra $L__BB0_36; 2026-02-21T08:30:46.9234433Z // %bb.35: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:46.9234605Z .loc 1 31 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:31:32 2026-02-21T08:30:46.9234668Z add.s32 %r2447, %r2559, %r8; 2026-02-21T08:30:46.9234910Z .loc 1 33 32 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:33:32 2026-02-21T08:30:46.9234977Z add.s32 %r2448, %r2557, %r12; 2026-02-21T08:30:46.9235039Z add.s32 %r2449, %r2557, %r13; 2026-02-21T08:30:46.9235106Z add.s32 %r2450, %r2557, %r14; 2026-02-21T08:30:46.9235167Z add.s32 %r2451, %r2557, %r15; 2026-02-21T08:30:46.9235347Z .loc 1 88 43 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:43 2026-02-21T08:30:46.9235416Z shl.b32 %r2452, %r2448, 13; 2026-02-21T08:30:46.9235476Z shl.b32 %r2453, %r2449, 13; 2026-02-21T08:30:46.9235535Z shl.b32 %r2454, %r2450, 13; 2026-02-21T08:30:46.9235594Z shl.b32 %r2455, %r2451, 13; 2026-02-21T08:30:46.9235779Z .loc 1 88 50 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:50 2026-02-21T08:30:46.9235842Z add.s32 %r2456, %r2452, %r2447; 2026-02-21T08:30:46.9235903Z add.s32 %r2457, %r2453, %r2447; 2026-02-21T08:30:46.9235972Z add.s32 %r2458, %r2454, %r2447; 2026-02-21T08:30:46.9236034Z add.s32 %r2459, %r2455, %r2447; 2026-02-21T08:30:46.9236218Z .loc 1 88 22 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:22 2026-02-21T08:30:46.9236296Z mad.wide.s32 %rd540, %r2456, 2, %rd49; 2026-02-21T08:30:46.9236364Z mad.wide.s32 %rd541, %r2457, 2, %rd49; 2026-02-21T08:30:46.9236431Z mad.wide.s32 %rd542, %r2458, 2, %rd49; 2026-02-21T08:30:46.9236497Z mad.wide.s32 %rd543, %r2459, 2, %rd49; 2026-02-21T08:30:46.9236560Z $L__tmp185: 2026-02-21T08:30:46.9236794Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9236853Z // begin inline asm 2026-02-21T08:30:46.9237195Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2377, %r2378, %r2379, %r2380, %r2381, %r2382, %r2383, %r2384, %r2385, %r2386, %r2387, %r2388, %r2389, %r2390, %r2391, %r2392}, [%r1981 + 0], 32; 2026-02-21T08:30:46.9237255Z // end inline asm 2026-02-21T08:30:46.9237314Z // begin inline asm 2026-02-21T08:30:46.9237648Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2394, %r2395, %r2396, %r2397, %r2398, %r2399, %r2400, %r2401, %r2402, %r2403, %r2404, %r2405, %r2406, %r2407, %r2408, %r2409}, [%r1981 + 16], 32; 2026-02-21T08:30:46.9237708Z // end inline asm 2026-02-21T08:30:46.9237767Z // begin inline asm 2026-02-21T08:30:46.9237839Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:46.9237904Z // end inline asm 2026-02-21T08:30:46.9237966Z cvt.u64.u32 %rd544, %r2377; 2026-02-21T08:30:46.9238026Z cvt.u64.u32 %rd545, %r2378; 2026-02-21T08:30:46.9238094Z shl.b64 %rd546, %rd545, 32; 2026-02-21T08:30:46.9238157Z or.b64 %rd547, %rd544, %rd546; 2026-02-21T08:30:46.9238213Z $L__tmp186: 2026-02-21T08:30:46.9238400Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9238465Z mov.b64 {%r2460, %r2461}, %rd547; 2026-02-21T08:30:46.9238541Z cvt.rn.bf16x2.f32 %r2462, %r2461, %r2460; 2026-02-21T08:30:46.9238594Z $L__tmp187: 2026-02-21T08:30:46.9238836Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9238947Z cvt.u64.u32 %rd548, %r2379; 2026-02-21T08:30:46.9239007Z cvt.u64.u32 %rd549, %r2380; 2026-02-21T08:30:46.9239075Z shl.b64 %rd550, %rd549, 32; 2026-02-21T08:30:46.9239137Z or.b64 %rd551, %rd548, %rd550; 2026-02-21T08:30:46.9239191Z $L__tmp188: 2026-02-21T08:30:46.9239370Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9239448Z mov.b64 {%r2463, %r2464}, %rd551; 2026-02-21T08:30:46.9239521Z cvt.rn.bf16x2.f32 %r2465, %r2464, %r2463; 2026-02-21T08:30:46.9239571Z $L__tmp189: 2026-02-21T08:30:46.9239791Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9239849Z cvt.u64.u32 %rd552, %r2381; 2026-02-21T08:30:46.9239908Z cvt.u64.u32 %rd553, %r2382; 2026-02-21T08:30:46.9240009Z shl.b64 %rd554, %rd553, 32; 2026-02-21T08:30:46.9240073Z or.b64 %rd555, %rd552, %rd554; 2026-02-21T08:30:46.9240125Z $L__tmp190: 2026-02-21T08:30:46.9240292Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9240360Z mov.b64 {%r2466, %r2467}, %rd555; 2026-02-21T08:30:46.9240431Z cvt.rn.bf16x2.f32 %r2468, %r2467, %r2466; 2026-02-21T08:30:46.9240483Z $L__tmp191: 2026-02-21T08:30:46.9240701Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9240759Z cvt.u64.u32 %rd556, %r2383; 2026-02-21T08:30:46.9240815Z cvt.u64.u32 %rd557, %r2384; 2026-02-21T08:30:46.9240879Z shl.b64 %rd558, %rd557, 32; 2026-02-21T08:30:46.9240938Z or.b64 %rd559, %rd556, %rd558; 2026-02-21T08:30:46.9240990Z $L__tmp192: 2026-02-21T08:30:46.9241158Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9241228Z mov.b64 {%r2469, %r2470}, %rd559; 2026-02-21T08:30:46.9241300Z cvt.rn.bf16x2.f32 %r2471, %r2470, %r2469; 2026-02-21T08:30:46.9241353Z $L__tmp193: 2026-02-21T08:30:46.9241609Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9241670Z cvt.u64.u32 %rd560, %r2385; 2026-02-21T08:30:46.9241728Z cvt.u64.u32 %rd561, %r2386; 2026-02-21T08:30:46.9241795Z shl.b64 %rd562, %rd561, 32; 2026-02-21T08:30:46.9241858Z or.b64 %rd563, %rd560, %rd562; 2026-02-21T08:30:46.9241909Z $L__tmp194: 2026-02-21T08:30:46.9242076Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9242148Z mov.b64 {%r2472, %r2473}, %rd563; 2026-02-21T08:30:46.9242220Z cvt.rn.bf16x2.f32 %r2474, %r2473, %r2472; 2026-02-21T08:30:46.9242272Z $L__tmp195: 2026-02-21T08:30:46.9242493Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9242556Z cvt.u64.u32 %rd564, %r2387; 2026-02-21T08:30:46.9242614Z cvt.u64.u32 %rd565, %r2388; 2026-02-21T08:30:46.9242680Z shl.b64 %rd566, %rd565, 32; 2026-02-21T08:30:46.9242743Z or.b64 %rd567, %rd564, %rd566; 2026-02-21T08:30:46.9242794Z $L__tmp196: 2026-02-21T08:30:46.9242960Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9243029Z mov.b64 {%r2475, %r2476}, %rd567; 2026-02-21T08:30:46.9243098Z cvt.rn.bf16x2.f32 %r2477, %r2476, %r2475; 2026-02-21T08:30:46.9243150Z $L__tmp197: 2026-02-21T08:30:46.9243365Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9243423Z cvt.u64.u32 %rd568, %r2389; 2026-02-21T08:30:46.9243481Z cvt.u64.u32 %rd569, %r2390; 2026-02-21T08:30:46.9243540Z shl.b64 %rd570, %rd569, 32; 2026-02-21T08:30:46.9243608Z or.b64 %rd571, %rd568, %rd570; 2026-02-21T08:30:46.9243715Z $L__tmp198: 2026-02-21T08:30:46.9243881Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9243949Z mov.b64 {%r2478, %r2479}, %rd571; 2026-02-21T08:30:46.9244017Z cvt.rn.bf16x2.f32 %r2480, %r2479, %r2478; 2026-02-21T08:30:46.9244069Z $L__tmp199: 2026-02-21T08:30:46.9244282Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9244342Z cvt.u64.u32 %rd572, %r2391; 2026-02-21T08:30:46.9244399Z cvt.u64.u32 %rd573, %r2392; 2026-02-21T08:30:46.9244457Z shl.b64 %rd574, %rd573, 32; 2026-02-21T08:30:46.9244523Z or.b64 %rd575, %rd572, %rd574; 2026-02-21T08:30:46.9244574Z $L__tmp200: 2026-02-21T08:30:46.9244740Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9244809Z mov.b64 {%r2481, %r2482}, %rd575; 2026-02-21T08:30:46.9244927Z cvt.rn.bf16x2.f32 %r2483, %r2482, %r2481; 2026-02-21T08:30:46.9244982Z $L__tmp201: 2026-02-21T08:30:46.9245198Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9245256Z cvt.u64.u32 %rd576, %r2394; 2026-02-21T08:30:46.9245312Z cvt.u64.u32 %rd577, %r2395; 2026-02-21T08:30:46.9245370Z shl.b64 %rd578, %rd577, 32; 2026-02-21T08:30:46.9245439Z or.b64 %rd579, %rd576, %rd578; 2026-02-21T08:30:46.9245492Z $L__tmp202: 2026-02-21T08:30:46.9245659Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9245730Z mov.b64 {%r2484, %r2485}, %rd579; 2026-02-21T08:30:46.9245798Z cvt.rn.bf16x2.f32 %r2486, %r2485, %r2484; 2026-02-21T08:30:46.9245851Z $L__tmp203: 2026-02-21T08:30:46.9246072Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9246132Z cvt.u64.u32 %rd580, %r2396; 2026-02-21T08:30:46.9246191Z cvt.u64.u32 %rd581, %r2397; 2026-02-21T08:30:46.9246249Z shl.b64 %rd582, %rd581, 32; 2026-02-21T08:30:46.9246315Z or.b64 %rd583, %rd580, %rd582; 2026-02-21T08:30:46.9246366Z $L__tmp204: 2026-02-21T08:30:46.9246534Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9246600Z mov.b64 {%r2487, %r2488}, %rd583; 2026-02-21T08:30:46.9246667Z cvt.rn.bf16x2.f32 %r2489, %r2488, %r2487; 2026-02-21T08:30:46.9246718Z $L__tmp205: 2026-02-21T08:30:46.9246934Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9246991Z cvt.u64.u32 %rd584, %r2398; 2026-02-21T08:30:46.9247047Z cvt.u64.u32 %rd585, %r2399; 2026-02-21T08:30:46.9247104Z shl.b64 %rd586, %rd585, 32; 2026-02-21T08:30:46.9247169Z or.b64 %rd587, %rd584, %rd586; 2026-02-21T08:30:46.9247221Z $L__tmp206: 2026-02-21T08:30:46.9247387Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9247456Z mov.b64 {%r2490, %r2491}, %rd587; 2026-02-21T08:30:46.9247523Z cvt.rn.bf16x2.f32 %r2492, %r2491, %r2490; 2026-02-21T08:30:46.9247575Z $L__tmp207: 2026-02-21T08:30:46.9247783Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9247847Z cvt.u64.u32 %rd588, %r2400; 2026-02-21T08:30:46.9247904Z cvt.u64.u32 %rd589, %r2401; 2026-02-21T08:30:46.9247962Z shl.b64 %rd590, %rd589, 32; 2026-02-21T08:30:46.9248029Z or.b64 %rd591, %rd588, %rd590; 2026-02-21T08:30:46.9248080Z $L__tmp208: 2026-02-21T08:30:46.9248245Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9248309Z mov.b64 {%r2493, %r2494}, %rd591; 2026-02-21T08:30:46.9248378Z cvt.rn.bf16x2.f32 %r2495, %r2494, %r2493; 2026-02-21T08:30:46.9248433Z $L__tmp209: 2026-02-21T08:30:46.9248687Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9248754Z cvt.u64.u32 %rd592, %r2402; 2026-02-21T08:30:46.9248810Z cvt.u64.u32 %rd593, %r2403; 2026-02-21T08:30:46.9248868Z shl.b64 %rd594, %rd593, 32; 2026-02-21T08:30:46.9248935Z or.b64 %rd595, %rd592, %rd594; 2026-02-21T08:30:46.9248987Z $L__tmp210: 2026-02-21T08:30:46.9249155Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9249222Z mov.b64 {%r2496, %r2497}, %rd595; 2026-02-21T08:30:46.9249290Z cvt.rn.bf16x2.f32 %r2498, %r2497, %r2496; 2026-02-21T08:30:46.9249343Z $L__tmp211: 2026-02-21T08:30:46.9249555Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9249620Z cvt.u64.u32 %rd596, %r2404; 2026-02-21T08:30:46.9249676Z cvt.u64.u32 %rd597, %r2405; 2026-02-21T08:30:46.9249780Z shl.b64 %rd598, %rd597, 32; 2026-02-21T08:30:46.9249850Z or.b64 %rd599, %rd596, %rd598; 2026-02-21T08:30:46.9249902Z $L__tmp212: 2026-02-21T08:30:46.9250068Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9250136Z mov.b64 {%r2499, %r2500}, %rd599; 2026-02-21T08:30:46.9250204Z cvt.rn.bf16x2.f32 %r2501, %r2500, %r2499; 2026-02-21T08:30:46.9250256Z $L__tmp213: 2026-02-21T08:30:46.9250467Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9250535Z cvt.u64.u32 %rd600, %r2406; 2026-02-21T08:30:46.9250593Z cvt.u64.u32 %rd601, %r2407; 2026-02-21T08:30:46.9250651Z shl.b64 %rd602, %rd601, 32; 2026-02-21T08:30:46.9250719Z or.b64 %rd603, %rd600, %rd602; 2026-02-21T08:30:46.9250772Z $L__tmp214: 2026-02-21T08:30:46.9250941Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9251012Z mov.b64 {%r2502, %r2503}, %rd603; 2026-02-21T08:30:46.9251080Z cvt.rn.bf16x2.f32 %r2504, %r2503, %r2502; 2026-02-21T08:30:46.9251131Z $L__tmp215: 2026-02-21T08:30:46.9251339Z .loc 2 291 36 // standard.py:291:36 @[ crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:84:40 ] 2026-02-21T08:30:46.9251406Z cvt.u64.u32 %rd604, %r2408; 2026-02-21T08:30:46.9251462Z cvt.u64.u32 %rd605, %r2409; 2026-02-21T08:30:46.9251520Z shl.b64 %rd606, %rd605, 32; 2026-02-21T08:30:46.9251617Z or.b64 %rd607, %rd604, %rd606; 2026-02-21T08:30:46.9251670Z $L__tmp216: 2026-02-21T08:30:46.9251840Z .loc 1 87 28 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:87:28 2026-02-21T08:30:46.9251900Z mov.b64 {%r2505, %r2506}, %rd607; 2026-02-21T08:30:46.9251977Z cvt.rn.bf16x2.f32 %r2507, %r2506, %r2505; 2026-02-21T08:30:46.9252145Z .loc 1 88 81 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:88:81 2026-02-21T08:30:46.9252249Z st.shared.v4.b32 [%r222], {%r2462, %r2474, %r2486, %r2498}; 2026-02-21T08:30:46.9252313Z bar.sync 0; 2026-02-21T08:30:46.9252372Z // begin inline asm 2026-02-21T08:30:46.9252532Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2431, %r2435, %r2439, %r2443}, [%r2415]; 2026-02-21T08:30:46.9252595Z // end inline asm 2026-02-21T08:30:46.9252649Z bar.sync 0; 2026-02-21T08:30:46.9252747Z st.shared.v4.b32 [%r222], {%r2465, %r2477, %r2489, %r2501}; 2026-02-21T08:30:46.9252802Z bar.sync 0; 2026-02-21T08:30:46.9252868Z // begin inline asm 2026-02-21T08:30:46.9253019Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2432, %r2436, %r2440, %r2444}, [%r2415]; 2026-02-21T08:30:46.9253075Z // end inline asm 2026-02-21T08:30:46.9253136Z bar.sync 0; 2026-02-21T08:30:46.9253231Z st.shared.v4.b32 [%r222], {%r2468, %r2480, %r2492, %r2504}; 2026-02-21T08:30:46.9253286Z bar.sync 0; 2026-02-21T08:30:46.9253353Z // begin inline asm 2026-02-21T08:30:46.9253505Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2433, %r2437, %r2441, %r2445}, [%r2415]; 2026-02-21T08:30:46.9253622Z // end inline asm 2026-02-21T08:30:46.9253676Z bar.sync 0; 2026-02-21T08:30:46.9253777Z st.shared.v4.b32 [%r222], {%r2471, %r2483, %r2495, %r2507}; 2026-02-21T08:30:46.9253830Z bar.sync 0; 2026-02-21T08:30:46.9253887Z // begin inline asm 2026-02-21T08:30:46.9254042Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2434, %r2438, %r2442, %r2446}, [%r2415]; 2026-02-21T08:30:46.9254099Z // end inline asm 2026-02-21T08:30:46.9254154Z // begin inline asm 2026-02-21T08:30:46.9254264Z st.global.v4.b32 [ %rd540 + 0 ], { %r2431, %r2432, %r2433, %r2434 }; 2026-02-21T08:30:46.9254328Z // end inline asm 2026-02-21T08:30:46.9254383Z // begin inline asm 2026-02-21T08:30:46.9254489Z st.global.v4.b32 [ %rd541 + 0 ], { %r2435, %r2436, %r2437, %r2438 }; 2026-02-21T08:30:46.9254552Z // end inline asm 2026-02-21T08:30:46.9254608Z // begin inline asm 2026-02-21T08:30:46.9254760Z st.global.v4.b32 [ %rd542 + 0 ], { %r2439, %r2440, %r2441, %r2442 }; 2026-02-21T08:30:46.9254827Z // end inline asm 2026-02-21T08:30:46.9254883Z // begin inline asm 2026-02-21T08:30:46.9254983Z st.global.v4.b32 [ %rd543 + 0 ], { %r2443, %r2444, %r2445, %r2446 }; 2026-02-21T08:30:46.9255037Z // end inline asm 2026-02-21T08:30:46.9255102Z bra.uni $L__BB0_36; 2026-02-21T08:30:46.9255191Z $L__BB0_37: // %._crit_edge488 2026-02-21T08:30:46.9255365Z .loc 1 19 112 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:112 2026-02-21T08:30:46.9255435Z cp.async.wait_group 0; 2026-02-21T08:30:46.9255490Z bar.sync 0; 2026-02-21T08:30:46.9255657Z .loc 1 19 4 // crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py:19:4 2026-02-21T08:30:46.9255712Z bar.sync 0; 2026-02-21T08:30:46.9255772Z // begin inline asm 2026-02-21T08:30:46.9255889Z @%p3 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r2508, 256; 2026-02-21T08:30:46.9255944Z // end inline asm 2026-02-21T08:30:46.9256005Z ret; 2026-02-21T08:30:46.9256059Z $L__tmp217: 2026-02-21T08:30:46.9256114Z $L__func_end0: 2026-02-21T08:30:46.9256204Z // -- End function 2026-02-21T08:30:46.9256254Z } 2026-02-21T08:30:46.9256464Z .file 1 "/tmp/torchinductor_root/rh/crhkdxpcwkcqlrnzcqdqoyf2zi7yaydwelbxpylnl2ekiq2dpfsn.py" 2026-02-21T08:30:46.9256637Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:30:46.9256707Z .section .debug_abbrev 2026-02-21T08:30:46.9256758Z { 2026-02-21T08:30:46.9256845Z .b8 1 // Abbreviation Code 2026-02-21T08:30:46.9256938Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:30:46.9257016Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:30:46.9257095Z .b8 37 // DW_AT_producer 2026-02-21T08:30:46.9257178Z .b8 8 // DW_FORM_string 2026-02-21T08:30:46.9257254Z .b8 19 // DW_AT_language 2026-02-21T08:30:46.9257334Z .b8 5 // DW_FORM_data2 2026-02-21T08:30:46.9257408Z .b8 3 // DW_AT_name 2026-02-21T08:30:46.9257488Z .b8 8 // DW_FORM_string 2026-02-21T08:30:46.9257563Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:30:46.9257638Z .b8 6 // DW_FORM_data4 2026-02-21T08:30:46.9257722Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:30:46.9257793Z .b8 8 // DW_FORM_string 2026-02-21T08:30:46.9257864Z .b8 0 // EOM(1) 2026-02-21T08:30:46.9257941Z .b8 0 // EOM(2) 2026-02-21T08:30:46.9258026Z .b8 2 // Abbreviation Code 2026-02-21T08:30:46.9258107Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:30:46.9258182Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:30:46.9258306Z .b8 3 // DW_AT_name 2026-02-21T08:30:46.9258380Z .b8 8 // DW_FORM_string 2026-02-21T08:30:46.9258455Z .b8 32 // DW_AT_inline 2026-02-21T08:30:46.9258539Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:46.9258609Z .b8 0 // EOM(1) 2026-02-21T08:30:46.9258674Z .b8 0 // EOM(2) 2026-02-21T08:30:46.9258761Z .b8 3 // Abbreviation Code 2026-02-21T08:30:46.9258841Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:30:46.9258918Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:30:46.9258998Z .b8 17 // DW_AT_low_pc 2026-02-21T08:30:46.9259080Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:46.9259199Z .b8 18 // DW_AT_high_pc 2026-02-21T08:30:46.9259277Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:46.9259378Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:30:46.9259449Z .b8 19 // DW_FORM_ref4 2026-02-21T08:30:46.9259515Z .b8 0 // EOM(1) 2026-02-21T08:30:46.9259587Z .b8 0 // EOM(2) 2026-02-21T08:30:46.9259667Z .b8 4 // Abbreviation Code 2026-02-21T08:30:46.9259756Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:30:46.9259835Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:30:46.9259917Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:30:46.9259987Z .b8 19 // DW_FORM_ref4 2026-02-21T08:30:46.9260058Z .b8 17 // DW_AT_low_pc 2026-02-21T08:30:46.9260137Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:46.9260216Z .b8 18 // DW_AT_high_pc 2026-02-21T08:30:46.9260285Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:46.9260366Z .b8 88 // DW_AT_call_file 2026-02-21T08:30:46.9260440Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:46.9260515Z .b8 89 // DW_AT_call_line 2026-02-21T08:30:46.9260593Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:46.9260670Z .b8 87 // DW_AT_call_column 2026-02-21T08:30:46.9260741Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:46.9260806Z .b8 0 // EOM(1) 2026-02-21T08:30:46.9260881Z .b8 0 // EOM(2) 2026-02-21T08:30:46.9260946Z .b8 0 // EOM(3) 2026-02-21T08:30:46.9260998Z } 2026-02-21T08:30:46.9261067Z .section .debug_info 2026-02-21T08:30:46.9261117Z { 2026-02-21T08:30:46.9261200Z .b32 178 // Length of Unit 2026-02-21T08:30:46.9261288Z .b8 2 // DWARF version number 2026-02-21T08:30:46.9261341Z .b8 0 2026-02-21T08:30:46.9261454Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:30:46.9261569Z .b8 8 // Address Size (in bytes) 2026-02-21T08:30:46.9261677Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:30:46.9261757Z .b8 116 // DW_AT_producer 2026-02-21T08:30:46.9261810Z .b8 114 2026-02-21T08:30:46.9261868Z .b8 105 2026-02-21T08:30:46.9261921Z .b8 116 2026-02-21T08:30:46.9261971Z .b8 111 2026-02-21T08:30:46.9262022Z .b8 110 2026-02-21T08:30:46.9262077Z .b8 0 2026-02-21T08:30:46.9262151Z .b8 2 // DW_AT_language 2026-02-21T08:30:46.9262203Z .b8 0 2026-02-21T08:30:46.9262334Z .b8 99 // DW_AT_name 2026-02-21T08:30:46.9262387Z .b8 114 2026-02-21T08:30:46.9262437Z .b8 104 2026-02-21T08:30:46.9262487Z .b8 107 2026-02-21T08:30:46.9262546Z .b8 100 2026-02-21T08:30:46.9262597Z .b8 120 2026-02-21T08:30:46.9262647Z .b8 112 2026-02-21T08:30:46.9262705Z .b8 99 2026-02-21T08:30:46.9262756Z .b8 119 2026-02-21T08:30:46.9262805Z .b8 107 2026-02-21T08:30:46.9262855Z .b8 99 2026-02-21T08:30:46.9262912Z .b8 113 2026-02-21T08:30:46.9262960Z .b8 108 2026-02-21T08:30:46.9263008Z .b8 114 2026-02-21T08:30:46.9263064Z .b8 110 2026-02-21T08:30:46.9263112Z .b8 122 2026-02-21T08:30:46.9263163Z .b8 99 2026-02-21T08:30:46.9263211Z .b8 113 2026-02-21T08:30:46.9263265Z .b8 100 2026-02-21T08:30:46.9263314Z .b8 113 2026-02-21T08:30:46.9263362Z .b8 111 2026-02-21T08:30:46.9263411Z .b8 121 2026-02-21T08:30:46.9263468Z .b8 102 2026-02-21T08:30:46.9263517Z .b8 50 2026-02-21T08:30:46.9263565Z .b8 122 2026-02-21T08:30:46.9263669Z .b8 105 2026-02-21T08:30:46.9263723Z .b8 55 2026-02-21T08:30:46.9263775Z .b8 121 2026-02-21T08:30:46.9263823Z .b8 97 2026-02-21T08:30:46.9263878Z .b8 121 2026-02-21T08:30:46.9263928Z .b8 100 2026-02-21T08:30:46.9263978Z .b8 119 2026-02-21T08:30:46.9264033Z .b8 101 2026-02-21T08:30:46.9264080Z .b8 108 2026-02-21T08:30:46.9264129Z .b8 98 2026-02-21T08:30:46.9264177Z .b8 120 2026-02-21T08:30:46.9264232Z .b8 112 2026-02-21T08:30:46.9264282Z .b8 121 2026-02-21T08:30:46.9264330Z .b8 108 2026-02-21T08:30:46.9264379Z .b8 110 2026-02-21T08:30:46.9264434Z .b8 108 2026-02-21T08:30:46.9264483Z .b8 50 2026-02-21T08:30:46.9264531Z .b8 101 2026-02-21T08:30:46.9264586Z .b8 107 2026-02-21T08:30:46.9264634Z .b8 105 2026-02-21T08:30:46.9264683Z .b8 113 2026-02-21T08:30:46.9264731Z .b8 50 2026-02-21T08:30:46.9264784Z .b8 100 2026-02-21T08:30:46.9264831Z .b8 112 2026-02-21T08:30:46.9264877Z .b8 102 2026-02-21T08:30:46.9264926Z .b8 115 2026-02-21T08:30:46.9264972Z .b8 110 2026-02-21T08:30:46.9265018Z .b8 46 2026-02-21T08:30:46.9265067Z .b8 112 2026-02-21T08:30:46.9265119Z .b8 121 2026-02-21T08:30:46.9265169Z .b8 0 2026-02-21T08:30:46.9265258Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:30:46.9265342Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:30:46.9265396Z .b8 116 2026-02-21T08:30:46.9265448Z .b8 109 2026-02-21T08:30:46.9265499Z .b8 112 2026-02-21T08:30:46.9265557Z .b8 47 2026-02-21T08:30:46.9265607Z .b8 116 2026-02-21T08:30:46.9265656Z .b8 111 2026-02-21T08:30:46.9265714Z .b8 114 2026-02-21T08:30:46.9265766Z .b8 99 2026-02-21T08:30:46.9265816Z .b8 104 2026-02-21T08:30:46.9265867Z .b8 105 2026-02-21T08:30:46.9265923Z .b8 110 2026-02-21T08:30:46.9265973Z .b8 100 2026-02-21T08:30:46.9266024Z .b8 117 2026-02-21T08:30:46.9266074Z .b8 99 2026-02-21T08:30:46.9266133Z .b8 116 2026-02-21T08:30:46.9266184Z .b8 111 2026-02-21T08:30:46.9266234Z .b8 114 2026-02-21T08:30:46.9266290Z .b8 95 2026-02-21T08:30:46.9266341Z .b8 114 2026-02-21T08:30:46.9266391Z .b8 111 2026-02-21T08:30:46.9266444Z .b8 111 2026-02-21T08:30:46.9266501Z .b8 116 2026-02-21T08:30:46.9266552Z .b8 47 2026-02-21T08:30:46.9266598Z .b8 114 2026-02-21T08:30:46.9266649Z .b8 104 2026-02-21T08:30:46.9266698Z .b8 0 2026-02-21T08:30:46.9266793Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:30:46.9266862Z .b8 95 // DW_AT_name 2026-02-21T08:30:46.9266917Z .b8 104 2026-02-21T08:30:46.9266962Z .b8 101 2026-02-21T08:30:46.9267009Z .b8 108 2026-02-21T08:30:46.9267062Z .b8 105 2026-02-21T08:30:46.9267112Z .b8 111 2026-02-21T08:30:46.9267158Z .b8 110 2026-02-21T08:30:46.9267205Z .b8 95 2026-02-21T08:30:46.9267257Z .b8 109 2026-02-21T08:30:46.9267306Z .b8 97 2026-02-21T08:30:46.9267355Z .b8 116 2026-02-21T08:30:46.9267411Z .b8 109 2026-02-21T08:30:46.9267461Z .b8 117 2026-02-21T08:30:46.9267511Z .b8 108 2026-02-21T08:30:46.9267562Z .b8 95 2026-02-21T08:30:46.9267620Z .b8 98 2026-02-21T08:30:46.9267670Z .b8 102 2026-02-21T08:30:46.9267719Z .b8 49 2026-02-21T08:30:46.9267771Z .b8 54 2026-02-21T08:30:46.9267871Z .b8 95 2026-02-21T08:30:46.9267921Z .b8 105 2026-02-21T08:30:46.9267971Z .b8 110 2026-02-21T08:30:46.9268026Z .b8 116 2026-02-21T08:30:46.9268075Z .b8 52 2026-02-21T08:30:46.9268125Z .b8 0 2026-02-21T08:30:46.9268196Z .b8 1 // DW_AT_inline 2026-02-21T08:30:46.9268298Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:30:46.9268385Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:30:46.9268471Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:30:46.9268565Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:30:46.9268676Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:30:46.9268763Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:30:46.9268849Z .b64 $L__tmp1 // DW_AT_low_pc 2026-02-21T08:30:46.9268973Z .b64 $L__tmp216 // DW_AT_high_pc 2026-02-21T08:30:46.9269053Z .b8 1 // DW_AT_call_file 2026-02-21T08:30:46.9269130Z .b8 84 // DW_AT_call_line 2026-02-21T08:30:46.9269215Z .b8 40 // DW_AT_call_column 2026-02-21T08:30:46.9269296Z .b8 0 // End Of Children Mark 2026-02-21T08:30:46.9269377Z .b8 0 // End Of Children Mark 2026-02-21T08:30:46.9269435Z } 2026-02-21T08:30:46.9269502Z .section .debug_macinfo { } 2026-02-21T08:30:46.9269506Z 2026-02-21T08:30:46.9269581Z ================================================================ 2026-02-21T08:30:46.9269692Z please share the reproducer above with Triton project. 2026-02-21T08:30:48.4981959Z 2026-02-21T08:30:48.4983915Z 2026-02-21T08:30:48.4983922Z 2026-02-21T08:30:48.4984135Z ================================================================ 2026-02-21T08:30:48.4984417Z Internal Triton PTX codegen error 2026-02-21T08:30:48.4984616Z `ptxas` stderr: 2026-02-21T08:30:48.4985057Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 262 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:30:48.4985556Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:30:48.4985705Z 2026-02-21T08:30:48.4986098Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpxryanb1o.ptx -o /tmp/tmpxryanb1o.ptx.o 2026-02-21T08:30:48.4986545Z 2026-02-21T08:30:48.4986549Z 2026-02-21T08:30:48.4986603Z // 2026-02-21T08:30:48.4986755Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:30:48.4986924Z // 2026-02-21T08:30:48.4986993Z 2026-02-21T08:30:48.4987058Z .version 8.7 2026-02-21T08:30:48.4987193Z .target sm_100a 2026-02-21T08:30:48.4987335Z .address_size 64 2026-02-21T08:30:48.4987417Z 2026-02-21T08:30:48.4987566Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:30:48.4987852Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:30:48.4988072Z // @_helion_matmul_bf16_int4 2026-02-21T08:30:48.4988283Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:30:48.4988531Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:30:48.4988809Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:30:48.4989083Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:30:48.4989357Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:30:48.4989649Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:30:48.4989879Z ) 2026-02-21T08:30:48.4990007Z .reqntid 128 2026-02-21T08:30:48.4990158Z .maxnreg 32 2026-02-21T08:30:48.4990291Z { 2026-02-21T08:30:48.4990434Z .reg .pred %p<257>; 2026-02-21T08:30:48.4990589Z .reg .b16 %rs<1793>; 2026-02-21T08:30:48.4991001Z .reg .b32 %r<4075>; 2026-02-21T08:30:48.4991144Z .reg .b64 %rd<1104>; 2026-02-21T08:30:48.4991425Z .loc 1 14 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:14:0 2026-02-21T08:30:48.4991835Z $L__func_begin0: 2026-02-21T08:30:48.4992081Z .loc 1 14 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:14:0 2026-02-21T08:30:48.4992305Z 2026-02-21T08:30:48.4992365Z // %bb.0: 2026-02-21T08:30:48.4992539Z ld.param.b64 %rd94, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:30:48.4992789Z ld.param.b64 %rd93, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:30:48.4993027Z ld.param.b64 %rd92, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:30:48.4993224Z $L__tmp0: 2026-02-21T08:30:48.4993456Z .loc 1 14 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:14 2026-02-21T08:30:48.4993733Z mov.u32 %r1, %tid.x; 2026-02-21T08:30:48.4994174Z setp.lt.u32 %p3, %r1, 32; 2026-02-21T08:30:48.4994339Z mov.b32 %r294, global_smem; 2026-02-21T08:30:48.4994502Z // begin inline asm 2026-02-21T08:30:48.4994730Z @%p3 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r294], 256; 2026-02-21T08:30:48.4994972Z // end inline asm 2026-02-21T08:30:48.4995103Z bar.sync 0; 2026-02-21T08:30:48.4995247Z ld.shared.b32 %r4008, [global_smem]; 2026-02-21T08:30:48.4995417Z bar.sync 0; 2026-02-21T08:30:48.4995543Z // begin inline asm 2026-02-21T08:30:48.4995748Z @%p3 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:30:48.4995965Z // end inline asm 2026-02-21T08:30:48.4996219Z .loc 1 19 46 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:46 2026-02-21T08:30:48.4996502Z mov.u32 %r4018, %ctaid.x; 2026-02-21T08:30:48.4996760Z .loc 1 0 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0 2026-02-21T08:30:48.4997043Z sub.s32 %r295, 6463, %r4018; 2026-02-21T08:30:48.4997320Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.4997625Z shr.u32 %r296, %r295, 6; 2026-02-21T08:30:48.4997784Z mul.hi.u32 %r297, %r296, 116080198; 2026-02-21T08:30:48.4997964Z and.b32 %r298, %r297, 2097148; 2026-02-21T08:30:48.4998133Z mad.lo.s32 %r4074, %r298, 2368, %r4018; 2026-02-21T08:30:48.4998435Z .loc 1 31 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:45 2026-02-21T08:30:48.4998732Z shr.u32 %r5, %r1, 5; 2026-02-21T08:30:48.4998885Z and.b32 %r299, %r1, 7; 2026-02-21T08:30:48.4999064Z shl.b32 %r6, %r299, 4; 2026-02-21T08:30:48.4999233Z and.b32 %r7, %r1, 15; 2026-02-21T08:30:48.4999399Z shl.b32 %r8, %r7, 3; 2026-02-21T08:30:48.4999651Z .loc 1 33 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:45 2026-02-21T08:30:48.4999945Z bfe.u32 %r9, %r1, 3, 4; 2026-02-21T08:30:48.5000094Z or.b32 %r10, %r9, 16; 2026-02-21T08:30:48.5000247Z or.b32 %r11, %r9, 32; 2026-02-21T08:30:48.5000398Z or.b32 %r12, %r9, 48; 2026-02-21T08:30:48.5000543Z bfe.u32 %r14, %r1, 4, 3; 2026-02-21T08:30:48.5000697Z or.b32 %r15, %r14, 8; 2026-02-21T08:30:48.5000838Z or.b32 %r16, %r14, 16; 2026-02-21T08:30:48.5000991Z or.b32 %r17, %r14, 24; 2026-02-21T08:30:48.5001135Z or.b32 %r18, %r14, 32; 2026-02-21T08:30:48.5001284Z or.b32 %r19, %r14, 40; 2026-02-21T08:30:48.5001425Z or.b32 %r20, %r14, 48; 2026-02-21T08:30:48.5001615Z or.b32 %r21, %r14, 56; 2026-02-21T08:30:48.5001766Z shl.b32 %r22, %r299, 3; 2026-02-21T08:30:48.5002041Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5002341Z setp.lt.s32 %p5, %r4018, %r4074; 2026-02-21T08:30:48.5002515Z add.s32 %r3694, %r4008, 128; 2026-02-21T08:30:48.5002682Z or.b32 %r4043, %r22, 64; 2026-02-21T08:30:48.5002840Z shl.b32 %r4012, %r7, 7; 2026-02-21T08:30:48.5003007Z and.b32 %r4013, %r1, 96; 2026-02-21T08:30:48.5003184Z shl.b32 %r4014, %r1, 2; 2026-02-21T08:30:48.5003435Z shl.b32 %r4015, %r1, 9; 2026-02-21T08:30:48.5003598Z shl.b32 %r4016, %r7, 4; 2026-02-21T08:30:48.5003772Z shl.b32 %r4017, %r1, 5; 2026-02-21T08:30:48.5003949Z setp.eq.b32 %p10, %r1, 0; 2026-02-21T08:30:48.5004121Z and.b32 %r4048, %r1, 127; 2026-02-21T08:30:48.5004292Z shl.b32 %r4047, %r4048, 4; 2026-02-21T08:30:48.5004454Z @%p5 bra $L__BB0_2; 2026-02-21T08:30:48.5004616Z bra.uni $L__BB0_1; 2026-02-21T08:30:48.5004786Z $L__BB0_2: // %.lr.ph 2026-02-21T08:30:48.5005117Z .loc 1 0 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0:112 2026-02-21T08:30:48.5005423Z add.s32 %r301, %r294, 32768; 2026-02-21T08:30:48.5005605Z add.s32 %r2763, %r301, %r4047; 2026-02-21T08:30:48.5005780Z or.b32 %r4046, %r4047, 2048; 2026-02-21T08:30:48.5005943Z add.s32 %r2765, %r2763, 2048; 2026-02-21T08:30:48.5006115Z or.b32 %r4045, %r4047, 4096; 2026-02-21T08:30:48.5006334Z add.s32 %r2767, %r2763, 4096; 2026-02-21T08:30:48.5006512Z or.b32 %r4044, %r4047, 6144; 2026-02-21T08:30:48.5006671Z add.s32 %r2769, %r2763, 6144; 2026-02-21T08:30:48.5006840Z shl.b32 %r39, %r9, 13; 2026-02-21T08:30:48.5006996Z shl.b32 %r40, %r10, 13; 2026-02-21T08:30:48.5007160Z add.s32 %r302, %r294, 49152; 2026-02-21T08:30:48.5007327Z add.s32 %r2771, %r302, %r4047; 2026-02-21T08:30:48.5007515Z add.s32 %r2773, %r2771, 2048; 2026-02-21T08:30:48.5007695Z add.s32 %r303, %r294, %r4047; 2026-02-21T08:30:48.5007863Z add.s32 %r2468, %r303, 40960; 2026-02-21T08:30:48.5008039Z add.s32 %r2470, %r303, 43008; 2026-02-21T08:30:48.5008204Z add.s32 %r2472, %r303, 45056; 2026-02-21T08:30:48.5008372Z add.s32 %r2474, %r303, 47104; 2026-02-21T08:30:48.5008531Z or.b32 %r48, %r39, 262144; 2026-02-21T08:30:48.5008696Z or.b32 %r49, %r39, 393216; 2026-02-21T08:30:48.5008855Z add.s32 %r2476, %r303, 53248; 2026-02-21T08:30:48.5009032Z add.s32 %r2478, %r303, 55296; 2026-02-21T08:30:48.5009210Z shl.b32 %r306, %r4013, 6; 2026-02-21T08:30:48.5009381Z and.b32 %r308, %r4014, 64; 2026-02-21T08:30:48.5009565Z or.b32 %r309, %r4012, %r306; 2026-02-21T08:30:48.5009734Z or.b32 %r52, %r309, %r308; 2026-02-21T08:30:48.5009917Z add.s32 %r53, %r301, %r52; 2026-02-21T08:30:48.5010078Z add.s32 %r54, %r302, %r4048; 2026-02-21T08:30:48.5010246Z or.b32 %r55, %r1, 896; 2026-02-21T08:30:48.5010401Z add.s32 %r56, %r302, %r55; 2026-02-21T08:30:48.5010568Z or.b32 %r57, %r1, 1920; 2026-02-21T08:30:48.5010727Z add.s32 %r58, %r302, %r57; 2026-02-21T08:30:48.5010885Z or.b32 %r59, %r1, 2944; 2026-02-21T08:30:48.5011045Z add.s32 %r60, %r302, %r59; 2026-02-21T08:30:48.5011202Z or.b32 %r61, %r1, 3968; 2026-02-21T08:30:48.5011362Z add.s32 %r62, %r302, %r61; 2026-02-21T08:30:48.5011518Z shl.b32 %r310, %r4048, 7; 2026-02-21T08:30:48.5011714Z or.b32 %r311, %r310, %r6; 2026-02-21T08:30:48.5011877Z add.s32 %r63, %r294, %r311; 2026-02-21T08:30:48.5012048Z xor.b32 %r312, %r311, 16; 2026-02-21T08:30:48.5012208Z add.s32 %r64, %r294, %r312; 2026-02-21T08:30:48.5012381Z xor.b32 %r313, %r311, 32; 2026-02-21T08:30:48.5012544Z add.s32 %r65, %r294, %r313; 2026-02-21T08:30:48.5012705Z xor.b32 %r314, %r311, 48; 2026-02-21T08:30:48.5012869Z add.s32 %r66, %r294, %r314; 2026-02-21T08:30:48.5013027Z xor.b32 %r315, %r311, 64; 2026-02-21T08:30:48.5013201Z add.s32 %r67, %r294, %r315; 2026-02-21T08:30:48.5013360Z xor.b32 %r316, %r311, 80; 2026-02-21T08:30:48.5013525Z add.s32 %r68, %r294, %r316; 2026-02-21T08:30:48.5013683Z xor.b32 %r317, %r311, 96; 2026-02-21T08:30:48.5013847Z add.s32 %r69, %r294, %r317; 2026-02-21T08:30:48.5014013Z xor.b32 %r318, %r311, 112; 2026-02-21T08:30:48.5014169Z add.s32 %r70, %r294, %r318; 2026-02-21T08:30:48.5014338Z bfe.u32 %r319, %r294, 4, 14; 2026-02-21T08:30:48.5014496Z cvt.u64.u32 %rd95, %r319; 2026-02-21T08:30:48.5014667Z or.b64 %rd737, %rd95, 4611686293372403712; 2026-02-21T08:30:48.5014844Z add.s32 %r320, %r294, 32; 2026-02-21T08:30:48.5015005Z bfe.u32 %r321, %r320, 4, 14; 2026-02-21T08:30:48.5015222Z cvt.u64.u32 %rd96, %r321; 2026-02-21T08:30:48.5015392Z or.b64 %rd738, %rd96, 4611686293372403712; 2026-02-21T08:30:48.5015576Z add.s32 %r322, %r294, 64; 2026-02-21T08:30:48.5015727Z bfe.u32 %r323, %r322, 4, 14; 2026-02-21T08:30:48.5015890Z cvt.u64.u32 %rd97, %r323; 2026-02-21T08:30:48.5016055Z or.b64 %rd739, %rd97, 4611686293372403712; 2026-02-21T08:30:48.5016247Z add.s32 %r324, %r294, 96; 2026-02-21T08:30:48.5016401Z bfe.u32 %r325, %r324, 4, 14; 2026-02-21T08:30:48.5016568Z cvt.u64.u32 %rd98, %r325; 2026-02-21T08:30:48.5016734Z or.b64 %rd740, %rd98, 4611686293372403712; 2026-02-21T08:30:48.5016928Z add.s32 %r326, %r294, 16384; 2026-02-21T08:30:48.5017082Z bfe.u32 %r327, %r326, 4, 14; 2026-02-21T08:30:48.5017240Z cvt.u64.u32 %rd99, %r327; 2026-02-21T08:30:48.5017403Z or.b64 %rd741, %rd99, 4611686293372403712; 2026-02-21T08:30:48.5017579Z add.s32 %r328, %r294, 16416; 2026-02-21T08:30:48.5017738Z bfe.u32 %r329, %r328, 4, 14; 2026-02-21T08:30:48.5017942Z cvt.u64.u32 %rd100, %r329; 2026-02-21T08:30:48.5018121Z or.b64 %rd742, %rd100, 4611686293372403712; 2026-02-21T08:30:48.5018299Z add.s32 %r330, %r294, 16448; 2026-02-21T08:30:48.5018459Z bfe.u32 %r331, %r330, 4, 14; 2026-02-21T08:30:48.5018612Z cvt.u64.u32 %rd101, %r331; 2026-02-21T08:30:48.5018785Z or.b64 %rd743, %rd101, 4611686293372403712; 2026-02-21T08:30:48.5018969Z add.s32 %r332, %r294, 16480; 2026-02-21T08:30:48.5019120Z bfe.u32 %r333, %r332, 4, 14; 2026-02-21T08:30:48.5019279Z cvt.u64.u32 %rd102, %r333; 2026-02-21T08:30:48.5019440Z or.b64 %rd744, %rd102, 4611686293372403712; 2026-02-21T08:30:48.5019622Z or.b32 %r71, %r39, 524288; 2026-02-21T08:30:48.5019771Z or.b32 %r72, %r39, 655360; 2026-02-21T08:30:48.5019928Z and.b32 %r335, %r4015, 3072; 2026-02-21T08:30:48.5020080Z shl.b32 %r337, %r4013, 3; 2026-02-21T08:30:48.5020234Z or.b32 %r338, %r4016, %r337; 2026-02-21T08:30:48.5020388Z xor.b32 %r339, %r338, %r308; 2026-02-21T08:30:48.5020543Z or.b32 %r340, %r339, %r335; 2026-02-21T08:30:48.5020702Z add.s32 %r73, %r294, %r340; 2026-02-21T08:30:48.5020853Z xor.b32 %r341, %r340, 32; 2026-02-21T08:30:48.5021008Z add.s32 %r74, %r294, %r341; 2026-02-21T08:30:48.5021158Z and.b32 %r343, %r4017, 3168; 2026-02-21T08:30:48.5021316Z shl.b32 %r344, %r1, 4; 2026-02-21T08:30:48.5021464Z and.b32 %r345, %r344, 384; 2026-02-21T08:30:48.5021728Z and.b32 %r346, %r4014, 16; 2026-02-21T08:30:48.5021879Z or.b32 %r347, %r343, %r345; 2026-02-21T08:30:48.5022039Z xor.b32 %r348, %r347, %r4013; 2026-02-21T08:30:48.5022204Z add.s32 %r349, %r294, %r346; 2026-02-21T08:30:48.5022356Z add.s32 %r888, %r349, %r348; 2026-02-21T08:30:48.5022517Z add.s32 %r893, %r888, 512; 2026-02-21T08:30:48.5022797Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5023106Z mul.wide.u32 %rd9, %r299, 16; 2026-02-21T08:30:48.5023270Z add.s64 %rd10, %rd92, 384; 2026-02-21T08:30:48.5023431Z shl.b32 %r79, %r9, 10; 2026-02-21T08:30:48.5023695Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5023992Z or.b32 %r351, %r79, %r22; 2026-02-21T08:30:48.5024149Z or.b32 %r80, %r351, 49344; 2026-02-21T08:30:48.5024301Z bra.uni $L__BB0_3; 2026-02-21T08:30:48.5024503Z $L__BB0_27: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5024830Z .loc 1 0 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0:70 2026-02-21T08:30:48.5025132Z mov.b32 %r2958, 1; 2026-02-21T08:30:48.5025275Z $L__tmp1: 2026-02-21T08:30:48.5025579Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5025921Z // begin inline asm 2026-02-21T08:30:48.5026056Z 2026-02-21T08:30:48.5026178Z { 2026-02-21T08:30:48.5026301Z .reg .pred complete; 2026-02-21T08:30:48.5026451Z waitLoop: 2026-02-21T08:30:48.5026644Z mbarrier.try_wait.parity.shared.b64 complete, [%r4039], %r2958; 2026-02-21T08:30:48.5026966Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5027122Z } 2026-02-21T08:30:48.5027197Z 2026-02-21T08:30:48.5027257Z // end inline asm 2026-02-21T08:30:48.5027397Z $L__tmp2: 2026-02-21T08:30:48.5027646Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5027948Z cp.async.wait_group 0; 2026-02-21T08:30:48.5028103Z bar.sync 0; 2026-02-21T08:30:48.5028252Z add.s32 %r2959, %r294, 57344; 2026-02-21T08:30:48.5028415Z // begin inline asm 2026-02-21T08:30:48.5028603Z @%p10 mbarrier.inval.shared::cta.b64 [%r2959]; 2026-02-21T08:30:48.5028797Z // end inline asm 2026-02-21T08:30:48.5028941Z bar.sync 0; 2026-02-21T08:30:48.5029075Z // begin inline asm 2026-02-21T08:30:48.5029252Z @%p10 mbarrier.inval.shared::cta.b64 [%r2245]; 2026-02-21T08:30:48.5029449Z // end inline asm 2026-02-21T08:30:48.5029777Z .loc 1 88 43 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:43 2026-02-21T08:30:48.5030071Z shl.b32 %r3102, %r172, 13; 2026-02-21T08:30:48.5030225Z shl.b32 %r3103, %r173, 13; 2026-02-21T08:30:48.5030382Z shl.b32 %r3104, %r174, 13; 2026-02-21T08:30:48.5030529Z shl.b32 %r3105, %r175, 13; 2026-02-21T08:30:48.5030686Z shl.b32 %r3106, %r176, 13; 2026-02-21T08:30:48.5030835Z shl.b32 %r3107, %r177, 13; 2026-02-21T08:30:48.5030990Z shl.b32 %r3108, %r178, 13; 2026-02-21T08:30:48.5031139Z shl.b32 %r3109, %r179, 13; 2026-02-21T08:30:48.5031406Z .loc 1 88 50 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:50 2026-02-21T08:30:48.5031732Z add.s32 %r3110, %r3102, %r170; 2026-02-21T08:30:48.5031895Z add.s32 %r3111, %r3103, %r170; 2026-02-21T08:30:48.5032063Z add.s32 %r3112, %r3104, %r170; 2026-02-21T08:30:48.5032223Z add.s32 %r3113, %r3105, %r170; 2026-02-21T08:30:48.5032386Z add.s32 %r3114, %r3106, %r170; 2026-02-21T08:30:48.5032543Z add.s32 %r3115, %r3107, %r170; 2026-02-21T08:30:48.5032709Z add.s32 %r3116, %r3108, %r170; 2026-02-21T08:30:48.5032867Z add.s32 %r3117, %r3109, %r170; 2026-02-21T08:30:48.5033143Z .loc 1 88 22 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:22 2026-02-21T08:30:48.5033446Z mad.wide.s32 %rd775, %r3110, 2, %rd94; 2026-02-21T08:30:48.5033632Z mad.wide.s32 %rd776, %r3111, 2, %rd94; 2026-02-21T08:30:48.5033822Z mad.wide.s32 %rd777, %r3112, 2, %rd94; 2026-02-21T08:30:48.5033999Z mad.wide.s32 %rd778, %r3113, 2, %rd94; 2026-02-21T08:30:48.5034182Z mad.wide.s32 %rd779, %r3114, 2, %rd94; 2026-02-21T08:30:48.5034359Z mad.wide.s32 %rd780, %r3115, 2, %rd94; 2026-02-21T08:30:48.5034544Z mad.wide.s32 %rd781, %r3116, 2, %rd94; 2026-02-21T08:30:48.5034725Z mad.wide.s32 %rd782, %r3117, 2, %rd94; 2026-02-21T08:30:48.5034888Z $L__tmp3: 2026-02-21T08:30:48.5035186Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5035520Z // begin inline asm 2026-02-21T08:30:48.5035930Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2961, %r2962, %r2963, %r2964, %r2965, %r2966, %r2967, %r2968, %r2969, %r2970, %r2971, %r2972, %r2973, %r2974, %r2975, %r2976}, [%r3028 + 0], 64; 2026-02-21T08:30:48.5036352Z // end inline asm 2026-02-21T08:30:48.5036495Z // begin inline asm 2026-02-21T08:30:48.5036893Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2978, %r2979, %r2980, %r2981, %r2982, %r2983, %r2984, %r2985, %r2986, %r2987, %r2988, %r2989, %r2990, %r2991, %r2992, %r2993}, [%r3028 + 16], 64; 2026-02-21T08:30:48.5037307Z // end inline asm 2026-02-21T08:30:48.5037451Z // begin inline asm 2026-02-21T08:30:48.5037835Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2995, %r2996, %r2997, %r2998, %r2999, %r3000, %r3001, %r3002, %r3003, %r3004, %r3005, %r3006, %r3007, %r3008, %r3009, %r3010}, [%r3028 + 32], 64; 2026-02-21T08:30:48.5038256Z // end inline asm 2026-02-21T08:30:48.5038389Z // begin inline asm 2026-02-21T08:30:48.5038779Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3012, %r3013, %r3014, %r3015, %r3016, %r3017, %r3018, %r3019, %r3020, %r3021, %r3022, %r3023, %r3024, %r3025, %r3026, %r3027}, [%r3028 + 48], 64; 2026-02-21T08:30:48.5039252Z // end inline asm 2026-02-21T08:30:48.5039390Z // begin inline asm 2026-02-21T08:30:48.5039553Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:48.5039719Z // end inline asm 2026-02-21T08:30:48.5039870Z cvt.u64.u32 %rd783, %r2961; 2026-02-21T08:30:48.5040036Z cvt.u64.u32 %rd784, %r2962; 2026-02-21T08:30:48.5040203Z shl.b64 %rd785, %rd784, 32; 2026-02-21T08:30:48.5040364Z or.b64 %rd786, %rd783, %rd785; 2026-02-21T08:30:48.5040529Z $L__tmp4: 2026-02-21T08:30:48.5040780Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5041077Z mov.b64 {%r3118, %r3119}, %rd786; 2026-02-21T08:30:48.5041272Z cvt.rn.bf16x2.f32 %r3120, %r3119, %r3118; 2026-02-21T08:30:48.5041453Z $L__tmp5: 2026-02-21T08:30:48.5041843Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5042179Z cvt.u64.u32 %rd787, %r2963; 2026-02-21T08:30:48.5042341Z cvt.u64.u32 %rd788, %r2964; 2026-02-21T08:30:48.5042499Z shl.b64 %rd789, %rd788, 32; 2026-02-21T08:30:48.5042657Z or.b64 %rd790, %rd787, %rd789; 2026-02-21T08:30:48.5042817Z $L__tmp6: 2026-02-21T08:30:48.5043052Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5043349Z mov.b64 {%r3121, %r3122}, %rd790; 2026-02-21T08:30:48.5043530Z cvt.rn.bf16x2.f32 %r3123, %r3122, %r3121; 2026-02-21T08:30:48.5043714Z $L__tmp7: 2026-02-21T08:30:48.5043991Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5044325Z cvt.u64.u32 %rd791, %r2965; 2026-02-21T08:30:48.5044488Z cvt.u64.u32 %rd792, %r2966; 2026-02-21T08:30:48.5044641Z shl.b64 %rd793, %rd792, 32; 2026-02-21T08:30:48.5044809Z or.b64 %rd794, %rd791, %rd793; 2026-02-21T08:30:48.5044969Z $L__tmp8: 2026-02-21T08:30:48.5045208Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5045508Z mov.b64 {%r3124, %r3125}, %rd794; 2026-02-21T08:30:48.5045702Z cvt.rn.bf16x2.f32 %r3126, %r3125, %r3124; 2026-02-21T08:30:48.5045885Z $L__tmp9: 2026-02-21T08:30:48.5046187Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5046543Z cvt.u64.u32 %rd795, %r2967; 2026-02-21T08:30:48.5046710Z cvt.u64.u32 %rd796, %r2968; 2026-02-21T08:30:48.5046878Z shl.b64 %rd797, %rd796, 32; 2026-02-21T08:30:48.5047040Z or.b64 %rd798, %rd795, %rd797; 2026-02-21T08:30:48.5047209Z $L__tmp10: 2026-02-21T08:30:48.5047455Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5047763Z mov.b64 {%r3127, %r3128}, %rd798; 2026-02-21T08:30:48.5047955Z cvt.rn.bf16x2.f32 %r3129, %r3128, %r3127; 2026-02-21T08:30:48.5048135Z $L__tmp11: 2026-02-21T08:30:48.5048431Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5048772Z cvt.u64.u32 %rd799, %r2969; 2026-02-21T08:30:48.5048941Z cvt.u64.u32 %rd800, %r2970; 2026-02-21T08:30:48.5049099Z shl.b64 %rd801, %rd800, 32; 2026-02-21T08:30:48.5049268Z or.b64 %rd802, %rd799, %rd801; 2026-02-21T08:30:48.5049428Z $L__tmp12: 2026-02-21T08:30:48.5049685Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5049993Z mov.b64 {%r3130, %r3131}, %rd802; 2026-02-21T08:30:48.5050175Z cvt.rn.bf16x2.f32 %r3132, %r3131, %r3130; 2026-02-21T08:30:48.5050359Z $L__tmp13: 2026-02-21T08:30:48.5050651Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5051057Z cvt.u64.u32 %rd803, %r2971; 2026-02-21T08:30:48.5051224Z cvt.u64.u32 %rd804, %r2972; 2026-02-21T08:30:48.5051395Z shl.b64 %rd805, %rd804, 32; 2026-02-21T08:30:48.5051586Z or.b64 %rd806, %rd803, %rd805; 2026-02-21T08:30:48.5051755Z $L__tmp14: 2026-02-21T08:30:48.5052013Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5052317Z mov.b64 {%r3133, %r3134}, %rd806; 2026-02-21T08:30:48.5052507Z cvt.rn.bf16x2.f32 %r3135, %r3134, %r3133; 2026-02-21T08:30:48.5052687Z $L__tmp15: 2026-02-21T08:30:48.5052990Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5053338Z cvt.u64.u32 %rd807, %r2973; 2026-02-21T08:30:48.5053511Z cvt.u64.u32 %rd808, %r2974; 2026-02-21T08:30:48.5053672Z shl.b64 %rd809, %rd808, 32; 2026-02-21T08:30:48.5053824Z or.b64 %rd810, %rd807, %rd809; 2026-02-21T08:30:48.5054032Z $L__tmp16: 2026-02-21T08:30:48.5054266Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5054555Z mov.b64 {%r3136, %r3137}, %rd810; 2026-02-21T08:30:48.5054730Z cvt.rn.bf16x2.f32 %r3138, %r3137, %r3136; 2026-02-21T08:30:48.5054909Z $L__tmp17: 2026-02-21T08:30:48.5055187Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5055514Z cvt.u64.u32 %rd811, %r2975; 2026-02-21T08:30:48.5055672Z cvt.u64.u32 %rd812, %r2976; 2026-02-21T08:30:48.5055823Z shl.b64 %rd813, %rd812, 32; 2026-02-21T08:30:48.5055981Z or.b64 %rd814, %rd811, %rd813; 2026-02-21T08:30:48.5056131Z $L__tmp18: 2026-02-21T08:30:48.5056370Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5056651Z mov.b64 {%r3139, %r3140}, %rd814; 2026-02-21T08:30:48.5056833Z cvt.rn.bf16x2.f32 %r3141, %r3140, %r3139; 2026-02-21T08:30:48.5057007Z $L__tmp19: 2026-02-21T08:30:48.5057300Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5057632Z cvt.u64.u32 %rd815, %r2978; 2026-02-21T08:30:48.5057788Z cvt.u64.u32 %rd816, %r2979; 2026-02-21T08:30:48.5057947Z shl.b64 %rd817, %rd816, 32; 2026-02-21T08:30:48.5058100Z or.b64 %rd818, %rd815, %rd817; 2026-02-21T08:30:48.5058258Z $L__tmp20: 2026-02-21T08:30:48.5058492Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5058782Z mov.b64 {%r3142, %r3143}, %rd818; 2026-02-21T08:30:48.5058956Z cvt.rn.bf16x2.f32 %r3144, %r3143, %r3142; 2026-02-21T08:30:48.5059132Z $L__tmp21: 2026-02-21T08:30:48.5059418Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5059739Z cvt.u64.u32 %rd819, %r2980; 2026-02-21T08:30:48.5059902Z cvt.u64.u32 %rd820, %r2981; 2026-02-21T08:30:48.5060057Z shl.b64 %rd821, %rd820, 32; 2026-02-21T08:30:48.5060216Z or.b64 %rd822, %rd819, %rd821; 2026-02-21T08:30:48.5060366Z $L__tmp22: 2026-02-21T08:30:48.5060609Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5060899Z mov.b64 {%r3145, %r3146}, %rd822; 2026-02-21T08:30:48.5061074Z cvt.rn.bf16x2.f32 %r3147, %r3146, %r3145; 2026-02-21T08:30:48.5061252Z $L__tmp23: 2026-02-21T08:30:48.5061531Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5061890Z cvt.u64.u32 %rd823, %r2982; 2026-02-21T08:30:48.5062045Z cvt.u64.u32 %rd824, %r2983; 2026-02-21T08:30:48.5062198Z shl.b64 %rd825, %rd824, 32; 2026-02-21T08:30:48.5062350Z or.b64 %rd826, %rd823, %rd825; 2026-02-21T08:30:48.5062508Z $L__tmp24: 2026-02-21T08:30:48.5062755Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5063096Z mov.b64 {%r3148, %r3149}, %rd826; 2026-02-21T08:30:48.5063279Z cvt.rn.bf16x2.f32 %r3150, %r3149, %r3148; 2026-02-21T08:30:48.5063447Z $L__tmp25: 2026-02-21T08:30:48.5063729Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5064048Z cvt.u64.u32 %rd827, %r2984; 2026-02-21T08:30:48.5064210Z cvt.u64.u32 %rd828, %r2985; 2026-02-21T08:30:48.5064360Z shl.b64 %rd829, %rd828, 32; 2026-02-21T08:30:48.5064517Z or.b64 %rd830, %rd827, %rd829; 2026-02-21T08:30:48.5064676Z $L__tmp26: 2026-02-21T08:30:48.5064909Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5065204Z mov.b64 {%r3151, %r3152}, %rd830; 2026-02-21T08:30:48.5065378Z cvt.rn.bf16x2.f32 %r3153, %r3152, %r3151; 2026-02-21T08:30:48.5065558Z $L__tmp27: 2026-02-21T08:30:48.5065893Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5066233Z cvt.u64.u32 %rd831, %r2986; 2026-02-21T08:30:48.5066397Z cvt.u64.u32 %rd832, %r2987; 2026-02-21T08:30:48.5066548Z shl.b64 %rd833, %rd832, 32; 2026-02-21T08:30:48.5066710Z or.b64 %rd834, %rd831, %rd833; 2026-02-21T08:30:48.5066863Z $L__tmp28: 2026-02-21T08:30:48.5067106Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5067398Z mov.b64 {%r3154, %r3155}, %rd834; 2026-02-21T08:30:48.5067579Z cvt.rn.bf16x2.f32 %r3156, %r3155, %r3154; 2026-02-21T08:30:48.5067752Z $L__tmp29: 2026-02-21T08:30:48.5068044Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5068386Z cvt.u64.u32 %rd835, %r2988; 2026-02-21T08:30:48.5068548Z cvt.u64.u32 %rd836, %r2989; 2026-02-21T08:30:48.5068707Z shl.b64 %rd837, %rd836, 32; 2026-02-21T08:30:48.5068868Z or.b64 %rd838, %rd835, %rd837; 2026-02-21T08:30:48.5069031Z $L__tmp30: 2026-02-21T08:30:48.5069266Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5069560Z mov.b64 {%r3157, %r3158}, %rd838; 2026-02-21T08:30:48.5069731Z cvt.rn.bf16x2.f32 %r3159, %r3158, %r3157; 2026-02-21T08:30:48.5069911Z $L__tmp31: 2026-02-21T08:30:48.5070194Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5070518Z cvt.u64.u32 %rd839, %r2990; 2026-02-21T08:30:48.5070680Z cvt.u64.u32 %rd840, %r2991; 2026-02-21T08:30:48.5070832Z shl.b64 %rd841, %rd840, 32; 2026-02-21T08:30:48.5070993Z or.b64 %rd842, %rd839, %rd841; 2026-02-21T08:30:48.5071146Z $L__tmp32: 2026-02-21T08:30:48.5071390Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5071705Z mov.b64 {%r3160, %r3161}, %rd842; 2026-02-21T08:30:48.5071888Z cvt.rn.bf16x2.f32 %r3162, %r3161, %r3160; 2026-02-21T08:30:48.5072068Z $L__tmp33: 2026-02-21T08:30:48.5072351Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5072684Z cvt.u64.u32 %rd843, %r2992; 2026-02-21T08:30:48.5072839Z cvt.u64.u32 %rd844, %r2993; 2026-02-21T08:30:48.5073000Z shl.b64 %rd845, %rd844, 32; 2026-02-21T08:30:48.5073154Z or.b64 %rd846, %rd843, %rd845; 2026-02-21T08:30:48.5073312Z $L__tmp34: 2026-02-21T08:30:48.5073552Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5073838Z mov.b64 {%r3163, %r3164}, %rd846; 2026-02-21T08:30:48.5074019Z cvt.rn.bf16x2.f32 %r3165, %r3164, %r3163; 2026-02-21T08:30:48.5074187Z $L__tmp35: 2026-02-21T08:30:48.5074474Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5074845Z cvt.u64.u32 %rd847, %r2995; 2026-02-21T08:30:48.5075010Z cvt.u64.u32 %rd848, %r2996; 2026-02-21T08:30:48.5075164Z shl.b64 %rd849, %rd848, 32; 2026-02-21T08:30:48.5075328Z or.b64 %rd850, %rd847, %rd849; 2026-02-21T08:30:48.5075490Z $L__tmp36: 2026-02-21T08:30:48.5075729Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5076026Z mov.b64 {%r3166, %r3167}, %rd850; 2026-02-21T08:30:48.5076203Z cvt.rn.bf16x2.f32 %r3168, %r3167, %r3166; 2026-02-21T08:30:48.5076384Z $L__tmp37: 2026-02-21T08:30:48.5076664Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5076997Z cvt.u64.u32 %rd851, %r2997; 2026-02-21T08:30:48.5077161Z cvt.u64.u32 %rd852, %r2998; 2026-02-21T08:30:48.5077317Z shl.b64 %rd853, %rd852, 32; 2026-02-21T08:30:48.5077481Z or.b64 %rd854, %rd851, %rd853; 2026-02-21T08:30:48.5077711Z $L__tmp38: 2026-02-21T08:30:48.5077959Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5078242Z mov.b64 {%r3169, %r3170}, %rd854; 2026-02-21T08:30:48.5078425Z cvt.rn.bf16x2.f32 %r3171, %r3170, %r3169; 2026-02-21T08:30:48.5078599Z $L__tmp39: 2026-02-21T08:30:48.5078887Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5079225Z cvt.u64.u32 %rd855, %r2999; 2026-02-21T08:30:48.5079380Z cvt.u64.u32 %rd856, %r3000; 2026-02-21T08:30:48.5079544Z shl.b64 %rd857, %rd856, 32; 2026-02-21T08:30:48.5079704Z or.b64 %rd858, %rd855, %rd857; 2026-02-21T08:30:48.5079874Z $L__tmp40: 2026-02-21T08:30:48.5080116Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5080419Z mov.b64 {%r3172, %r3173}, %rd858; 2026-02-21T08:30:48.5080598Z cvt.rn.bf16x2.f32 %r3174, %r3173, %r3172; 2026-02-21T08:30:48.5080781Z $L__tmp41: 2026-02-21T08:30:48.5081063Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5081383Z cvt.u64.u32 %rd859, %r3001; 2026-02-21T08:30:48.5081576Z cvt.u64.u32 %rd860, %r3002; 2026-02-21T08:30:48.5081730Z shl.b64 %rd861, %rd860, 32; 2026-02-21T08:30:48.5081893Z or.b64 %rd862, %rd859, %rd861; 2026-02-21T08:30:48.5082046Z $L__tmp42: 2026-02-21T08:30:48.5082290Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5082579Z mov.b64 {%r3175, %r3176}, %rd862; 2026-02-21T08:30:48.5082761Z cvt.rn.bf16x2.f32 %r3177, %r3176, %r3175; 2026-02-21T08:30:48.5082939Z $L__tmp43: 2026-02-21T08:30:48.5083216Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5083548Z cvt.u64.u32 %rd863, %r3003; 2026-02-21T08:30:48.5083704Z cvt.u64.u32 %rd864, %r3004; 2026-02-21T08:30:48.5083866Z shl.b64 %rd865, %rd864, 32; 2026-02-21T08:30:48.5084019Z or.b64 %rd866, %rd863, %rd865; 2026-02-21T08:30:48.5084176Z $L__tmp44: 2026-02-21T08:30:48.5084410Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5084695Z mov.b64 {%r3178, %r3179}, %rd866; 2026-02-21T08:30:48.5084871Z cvt.rn.bf16x2.f32 %r3180, %r3179, %r3178; 2026-02-21T08:30:48.5085041Z $L__tmp45: 2026-02-21T08:30:48.5085321Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5085643Z cvt.u64.u32 %rd867, %r3005; 2026-02-21T08:30:48.5085801Z cvt.u64.u32 %rd868, %r3006; 2026-02-21T08:30:48.5085952Z shl.b64 %rd869, %rd868, 32; 2026-02-21T08:30:48.5086111Z or.b64 %rd870, %rd867, %rd869; 2026-02-21T08:30:48.5086267Z $L__tmp46: 2026-02-21T08:30:48.5086500Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5086847Z mov.b64 {%r3181, %r3182}, %rd870; 2026-02-21T08:30:48.5087019Z cvt.rn.bf16x2.f32 %r3183, %r3182, %r3181; 2026-02-21T08:30:48.5087197Z $L__tmp47: 2026-02-21T08:30:48.5087484Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5087832Z cvt.u64.u32 %rd871, %r3007; 2026-02-21T08:30:48.5087995Z cvt.u64.u32 %rd872, %r3008; 2026-02-21T08:30:48.5088162Z shl.b64 %rd873, %rd872, 32; 2026-02-21T08:30:48.5088330Z or.b64 %rd874, %rd871, %rd873; 2026-02-21T08:30:48.5088491Z $L__tmp48: 2026-02-21T08:30:48.5088748Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5089047Z mov.b64 {%r3184, %r3185}, %rd874; 2026-02-21T08:30:48.5089239Z cvt.rn.bf16x2.f32 %r3186, %r3185, %r3184; 2026-02-21T08:30:48.5089418Z $L__tmp49: 2026-02-21T08:30:48.5089772Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5090128Z cvt.u64.u32 %rd875, %r3009; 2026-02-21T08:30:48.5090292Z cvt.u64.u32 %rd876, %r3010; 2026-02-21T08:30:48.5090458Z shl.b64 %rd877, %rd876, 32; 2026-02-21T08:30:48.5090618Z or.b64 %rd878, %rd875, %rd877; 2026-02-21T08:30:48.5090789Z $L__tmp50: 2026-02-21T08:30:48.5091036Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5091341Z mov.b64 {%r3187, %r3188}, %rd878; 2026-02-21T08:30:48.5091520Z cvt.rn.bf16x2.f32 %r3189, %r3188, %r3187; 2026-02-21T08:30:48.5091732Z $L__tmp51: 2026-02-21T08:30:48.5092040Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5092382Z cvt.u64.u32 %rd879, %r3012; 2026-02-21T08:30:48.5092554Z cvt.u64.u32 %rd880, %r3013; 2026-02-21T08:30:48.5092714Z shl.b64 %rd881, %rd880, 32; 2026-02-21T08:30:48.5092889Z or.b64 %rd882, %rd879, %rd881; 2026-02-21T08:30:48.5093049Z $L__tmp52: 2026-02-21T08:30:48.5093307Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5093608Z mov.b64 {%r3190, %r3191}, %rd882; 2026-02-21T08:30:48.5093797Z cvt.rn.bf16x2.f32 %r3192, %r3191, %r3190; 2026-02-21T08:30:48.5093987Z $L__tmp53: 2026-02-21T08:30:48.5094289Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5094644Z cvt.u64.u32 %rd883, %r3014; 2026-02-21T08:30:48.5094806Z cvt.u64.u32 %rd884, %r3015; 2026-02-21T08:30:48.5094972Z shl.b64 %rd885, %rd884, 32; 2026-02-21T08:30:48.5095132Z or.b64 %rd886, %rd883, %rd885; 2026-02-21T08:30:48.5095297Z $L__tmp54: 2026-02-21T08:30:48.5095547Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5095855Z mov.b64 {%r3193, %r3194}, %rd886; 2026-02-21T08:30:48.5096035Z cvt.rn.bf16x2.f32 %r3195, %r3194, %r3193; 2026-02-21T08:30:48.5096205Z $L__tmp55: 2026-02-21T08:30:48.5096492Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5096820Z cvt.u64.u32 %rd887, %r3016; 2026-02-21T08:30:48.5096978Z cvt.u64.u32 %rd888, %r3017; 2026-02-21T08:30:48.5097128Z shl.b64 %rd889, %rd888, 32; 2026-02-21T08:30:48.5097288Z or.b64 %rd890, %rd887, %rd889; 2026-02-21T08:30:48.5097438Z $L__tmp56: 2026-02-21T08:30:48.5097680Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5097969Z mov.b64 {%r3196, %r3197}, %rd890; 2026-02-21T08:30:48.5098143Z cvt.rn.bf16x2.f32 %r3198, %r3197, %r3196; 2026-02-21T08:30:48.5098320Z $L__tmp57: 2026-02-21T08:30:48.5098601Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5098981Z cvt.u64.u32 %rd891, %r3018; 2026-02-21T08:30:48.5099133Z cvt.u64.u32 %rd892, %r3019; 2026-02-21T08:30:48.5099292Z shl.b64 %rd893, %rd892, 32; 2026-02-21T08:30:48.5099452Z or.b64 %rd894, %rd891, %rd893; 2026-02-21T08:30:48.5099604Z $L__tmp58: 2026-02-21T08:30:48.5099842Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5100129Z mov.b64 {%r3199, %r3200}, %rd894; 2026-02-21T08:30:48.5100308Z cvt.rn.bf16x2.f32 %r3201, %r3200, %r3199; 2026-02-21T08:30:48.5100480Z $L__tmp59: 2026-02-21T08:30:48.5100769Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5101094Z cvt.u64.u32 %rd895, %r3020; 2026-02-21T08:30:48.5101257Z cvt.u64.u32 %rd896, %r3021; 2026-02-21T08:30:48.5101416Z shl.b64 %rd897, %rd896, 32; 2026-02-21T08:30:48.5101597Z or.b64 %rd898, %rd895, %rd897; 2026-02-21T08:30:48.5101830Z $L__tmp60: 2026-02-21T08:30:48.5102067Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5102357Z mov.b64 {%r3202, %r3203}, %rd898; 2026-02-21T08:30:48.5102529Z cvt.rn.bf16x2.f32 %r3204, %r3203, %r3202; 2026-02-21T08:30:48.5102708Z $L__tmp61: 2026-02-21T08:30:48.5102988Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5103320Z cvt.u64.u32 %rd899, %r3022; 2026-02-21T08:30:48.5103484Z cvt.u64.u32 %rd900, %r3023; 2026-02-21T08:30:48.5103639Z shl.b64 %rd901, %rd900, 32; 2026-02-21T08:30:48.5103804Z or.b64 %rd902, %rd899, %rd901; 2026-02-21T08:30:48.5103957Z $L__tmp62: 2026-02-21T08:30:48.5104203Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5104486Z mov.b64 {%r3205, %r3206}, %rd902; 2026-02-21T08:30:48.5104670Z cvt.rn.bf16x2.f32 %r3207, %r3206, %r3205; 2026-02-21T08:30:48.5104848Z $L__tmp63: 2026-02-21T08:30:48.5105124Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5105452Z cvt.u64.u32 %rd903, %r3024; 2026-02-21T08:30:48.5105607Z cvt.u64.u32 %rd904, %r3025; 2026-02-21T08:30:48.5105765Z shl.b64 %rd905, %rd904, 32; 2026-02-21T08:30:48.5105919Z or.b64 %rd906, %rd903, %rd905; 2026-02-21T08:30:48.5106078Z $L__tmp64: 2026-02-21T08:30:48.5106314Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5106601Z mov.b64 {%r3208, %r3209}, %rd906; 2026-02-21T08:30:48.5106779Z cvt.rn.bf16x2.f32 %r3210, %r3209, %r3208; 2026-02-21T08:30:48.5106947Z $L__tmp65: 2026-02-21T08:30:48.5107229Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5107546Z cvt.u64.u32 %rd907, %r3026; 2026-02-21T08:30:48.5107706Z cvt.u64.u32 %rd908, %r3027; 2026-02-21T08:30:48.5107860Z shl.b64 %rd909, %rd908, 32; 2026-02-21T08:30:48.5108018Z or.b64 %rd910, %rd907, %rd909; 2026-02-21T08:30:48.5108169Z $L__tmp66: 2026-02-21T08:30:48.5108409Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5108700Z mov.b64 {%r3211, %r3212}, %rd910; 2026-02-21T08:30:48.5108874Z cvt.rn.bf16x2.f32 %r3213, %r3212, %r3211; 2026-02-21T08:30:48.5109158Z .loc 1 88 81 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:81 2026-02-21T08:30:48.5109476Z st.shared.v4.b32 [%r73], {%r3120, %r3132, %r3144, %r3156}; 2026-02-21T08:30:48.5109729Z st.shared.v4.b32 [%r74], {%r3168, %r3180, %r3192, %r3204}; 2026-02-21T08:30:48.5109929Z bar.sync 0; 2026-02-21T08:30:48.5110073Z // begin inline asm 2026-02-21T08:30:48.5110321Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3029, %r3030, %r3031, %r3032}, [%r888]; 2026-02-21T08:30:48.5110596Z // end inline asm 2026-02-21T08:30:48.5110801Z // begin inline asm 2026-02-21T08:30:48.5111039Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3034, %r3035, %r3036, %r3037}, [%r893]; 2026-02-21T08:30:48.5111313Z // end inline asm 2026-02-21T08:30:48.5111455Z bar.sync 0; 2026-02-21T08:30:48.5111673Z st.shared.v4.b32 [%r73], {%r3123, %r3135, %r3147, %r3159}; 2026-02-21T08:30:48.5111920Z st.shared.v4.b32 [%r74], {%r3171, %r3183, %r3195, %r3207}; 2026-02-21T08:30:48.5112127Z bar.sync 0; 2026-02-21T08:30:48.5112267Z // begin inline asm 2026-02-21T08:30:48.5112500Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3039, %r3040, %r3041, %r3042}, [%r888]; 2026-02-21T08:30:48.5112767Z // end inline asm 2026-02-21T08:30:48.5112903Z // begin inline asm 2026-02-21T08:30:48.5113141Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3044, %r3045, %r3046, %r3047}, [%r893]; 2026-02-21T08:30:48.5113400Z // end inline asm 2026-02-21T08:30:48.5113540Z bar.sync 0; 2026-02-21T08:30:48.5113757Z st.shared.v4.b32 [%r73], {%r3126, %r3138, %r3150, %r3162}; 2026-02-21T08:30:48.5114003Z st.shared.v4.b32 [%r74], {%r3174, %r3186, %r3198, %r3210}; 2026-02-21T08:30:48.5114209Z bar.sync 0; 2026-02-21T08:30:48.5114337Z // begin inline asm 2026-02-21T08:30:48.5114574Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3049, %r3050, %r3051, %r3052}, [%r888]; 2026-02-21T08:30:48.5114833Z // end inline asm 2026-02-21T08:30:48.5114976Z // begin inline asm 2026-02-21T08:30:48.5115200Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3054, %r3055, %r3056, %r3057}, [%r893]; 2026-02-21T08:30:48.5115488Z // end inline asm 2026-02-21T08:30:48.5115619Z bar.sync 0; 2026-02-21T08:30:48.5115789Z st.shared.v4.b32 [%r73], {%r3129, %r3141, %r3153, %r3165}; 2026-02-21T08:30:48.5116031Z st.shared.v4.b32 [%r74], {%r3177, %r3189, %r3201, %r3213}; 2026-02-21T08:30:48.5116225Z bar.sync 0; 2026-02-21T08:30:48.5116361Z // begin inline asm 2026-02-21T08:30:48.5116589Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3059, %r3060, %r3061, %r3062}, [%r888]; 2026-02-21T08:30:48.5116859Z // end inline asm 2026-02-21T08:30:48.5116995Z // begin inline asm 2026-02-21T08:30:48.5117226Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3064, %r3065, %r3066, %r3067}, [%r893]; 2026-02-21T08:30:48.5117485Z // end inline asm 2026-02-21T08:30:48.5117623Z // begin inline asm 2026-02-21T08:30:48.5117821Z st.global.v4.b32 [ %rd775 + 0 ], { %r3029, %r3039, %r3049, %r3059 }; 2026-02-21T08:30:48.5118036Z // end inline asm 2026-02-21T08:30:48.5118176Z // begin inline asm 2026-02-21T08:30:48.5118361Z st.global.v4.b32 [ %rd776 + 0 ], { %r3030, %r3040, %r3050, %r3060 }; 2026-02-21T08:30:48.5118579Z // end inline asm 2026-02-21T08:30:48.5118712Z // begin inline asm 2026-02-21T08:30:48.5118901Z st.global.v4.b32 [ %rd777 + 0 ], { %r3031, %r3041, %r3051, %r3061 }; 2026-02-21T08:30:48.5119109Z // end inline asm 2026-02-21T08:30:48.5119248Z // begin inline asm 2026-02-21T08:30:48.5119432Z st.global.v4.b32 [ %rd778 + 0 ], { %r3032, %r3042, %r3052, %r3062 }; 2026-02-21T08:30:48.5119641Z // end inline asm 2026-02-21T08:30:48.5119784Z // begin inline asm 2026-02-21T08:30:48.5119966Z st.global.v4.b32 [ %rd779 + 0 ], { %r3034, %r3044, %r3054, %r3064 }; 2026-02-21T08:30:48.5120181Z // end inline asm 2026-02-21T08:30:48.5120313Z // begin inline asm 2026-02-21T08:30:48.5120500Z st.global.v4.b32 [ %rd780 + 0 ], { %r3035, %r3045, %r3055, %r3065 }; 2026-02-21T08:30:48.5120704Z // end inline asm 2026-02-21T08:30:48.5120846Z // begin inline asm 2026-02-21T08:30:48.5121026Z st.global.v4.b32 [ %rd781 + 0 ], { %r3036, %r3046, %r3056, %r3066 }; 2026-02-21T08:30:48.5121244Z // end inline asm 2026-02-21T08:30:48.5121386Z // begin inline asm 2026-02-21T08:30:48.5121600Z st.global.v4.b32 [ %rd782 + 0 ], { %r3037, %r3047, %r3057, %r3067 }; 2026-02-21T08:30:48.5121821Z // end inline asm 2026-02-21T08:30:48.5122074Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5122381Z add.s32 %r4018, %r4018, 9472; 2026-02-21T08:30:48.5122552Z setp.lt.s32 %p214, %r4018, %r4074; 2026-02-21T08:30:48.5122816Z @%p214 bra $L__BB0_3; 2026-02-21T08:30:48.5122975Z bra.uni $L__BB0_28; 2026-02-21T08:30:48.5123162Z $L__BB0_3: // =>This Loop Header: Depth=1 2026-02-21T08:30:48.5123409Z // Child Loop BB0_6 Depth 2 2026-02-21T08:30:48.5123640Z // Child Loop BB0_12 Depth 2 2026-02-21T08:30:48.5123875Z // Child Loop BB0_18 Depth 2 2026-02-21T08:30:48.5124099Z // Child Loop BB0_24 Depth 2 2026-02-21T08:30:48.5124413Z .loc 1 25 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:25:35 2026-02-21T08:30:48.5124718Z shr.s32 %r480, %r4018, 31; 2026-02-21T08:30:48.5124878Z shr.u32 %r481, %r480, 23; 2026-02-21T08:30:48.5125047Z add.s32 %r482, %r4018, %r481; 2026-02-21T08:30:48.5125208Z shr.s32 %r483, %r482, 9; 2026-02-21T08:30:48.5125536Z .loc 1 26 33 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:26:33 2026-02-21T08:30:48.5125822Z shl.b32 %r484, %r483, 3; 2026-02-21T08:30:48.5126086Z .loc 1 27 39 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:39 2026-02-21T08:30:48.5126370Z sub.s32 %r485, 64, %r484; 2026-02-21T08:30:48.5126634Z .loc 1 27 52 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:52 2026-02-21T08:30:48.5126922Z min.s32 %r486, %r485, 8; 2026-02-21T08:30:48.5127176Z .loc 1 28 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:45 2026-02-21T08:30:48.5127469Z and.b32 %r487, %r482, -512; 2026-02-21T08:30:48.5127629Z sub.s32 %r488, %r4018, %r487; 2026-02-21T08:30:48.5127898Z .loc 1 29 51 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:29:51 2026-02-21T08:30:48.5128173Z div.s32 %r82, %r488, %r486; 2026-02-21T08:30:48.5128443Z .loc 1 28 64 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:64 2026-02-21T08:30:48.5128735Z mul.lo.s32 %r489, %r82, %r486; 2026-02-21T08:30:48.5128897Z sub.s32 %r490, %r488, %r489; 2026-02-21T08:30:48.5129164Z .loc 1 28 30 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:30 2026-02-21T08:30:48.5129466Z add.s32 %r491, %r490, %r484; 2026-02-21T08:30:48.5129748Z .loc 1 30 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:30:27 2026-02-21T08:30:48.5130041Z shl.b32 %r83, %r491, 7; 2026-02-21T08:30:48.5130315Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5130614Z or.b32 %r84, %r83, %r6; 2026-02-21T08:30:48.5130884Z .loc 1 32 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:32:27 2026-02-21T08:30:48.5131183Z shl.b32 %r86, %r82, 6; 2026-02-21T08:30:48.5131452Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5131791Z or.b32 %r492, %r86, %r9; 2026-02-21T08:30:48.5131951Z or.b32 %r493, %r86, %r10; 2026-02-21T08:30:48.5132121Z or.b32 %r494, %r86, %r11; 2026-02-21T08:30:48.5132287Z or.b32 %r495, %r86, %r12; 2026-02-21T08:30:48.5132561Z .loc 1 48 53 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:53 2026-02-21T08:30:48.5132868Z shl.b32 %r496, %r492, 10; 2026-02-21T08:30:48.5133024Z shl.b32 %r497, %r493, 10; 2026-02-21T08:30:48.5133189Z shl.b32 %r498, %r494, 10; 2026-02-21T08:30:48.5133347Z shl.b32 %r499, %r495, 10; 2026-02-21T08:30:48.5133513Z $L__tmp67: 2026-02-21T08:30:48.5133827Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5134212Z shfl.sync.idx.b32 %r95, %r5, 0, 31, -1; 2026-02-21T08:30:48.5134414Z shl.b32 %r500, %r95, 21; 2026-02-21T08:30:48.5134579Z and.b32 %r501, %r500, 6291456; 2026-02-21T08:30:48.5134815Z add.s32 %r3028, %r501, %r4008; 2026-02-21T08:30:48.5134983Z mov.pred %p17, -1; 2026-02-21T08:30:48.5135140Z mov.b32 %r4020, 0; 2026-02-21T08:30:48.5135288Z // begin inline asm 2026-02-21T08:30:48.5135718Z @%p17 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 0], 64, {%r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020}; 2026-02-21T08:30:48.5136172Z // end inline asm 2026-02-21T08:30:48.5136312Z // begin inline asm 2026-02-21T08:30:48.5136731Z @%p17 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 16], 64, {%r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020}; 2026-02-21T08:30:48.5137169Z // end inline asm 2026-02-21T08:30:48.5137312Z // begin inline asm 2026-02-21T08:30:48.5137743Z @%p17 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 32], 64, {%r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020}; 2026-02-21T08:30:48.5138167Z // end inline asm 2026-02-21T08:30:48.5138312Z // begin inline asm 2026-02-21T08:30:48.5138695Z @%p17 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 48], 64, {%r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020, %r4020}; 2026-02-21T08:30:48.5139118Z // end inline asm 2026-02-21T08:30:48.5139251Z // begin inline asm 2026-02-21T08:30:48.5139409Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5139573Z // end inline asm 2026-02-21T08:30:48.5139715Z bar.sync 0; 2026-02-21T08:30:48.5139848Z $L__tmp68: 2026-02-21T08:30:48.5140089Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5140385Z add.s32 %r4021, %r294, 57344; 2026-02-21T08:30:48.5140540Z // begin inline asm 2026-02-21T08:30:48.5140717Z @%p10 mbarrier.init.shared::cta.b64 [%r4021], 1; 2026-02-21T08:30:48.5140911Z // end inline asm 2026-02-21T08:30:48.5141050Z bar.sync 0; 2026-02-21T08:30:48.5141184Z add.s32 %r2245, %r294, 57352; 2026-02-21T08:30:48.5141344Z // begin inline asm 2026-02-21T08:30:48.5141509Z @%p10 mbarrier.init.shared::cta.b64 [%r2245], 1; 2026-02-21T08:30:48.5141746Z // end inline asm 2026-02-21T08:30:48.5142002Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5142293Z or.b32 %r503, %r496, %r22; 2026-02-21T08:30:48.5142460Z or.b32 %r504, %r497, %r22; 2026-02-21T08:30:48.5142614Z or.b32 %r505, %r498, %r22; 2026-02-21T08:30:48.5142771Z or.b32 %r506, %r499, %r22; 2026-02-21T08:30:48.5143030Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5143329Z mad.wide.s32 %rd103, %r503, 2, %rd92; 2026-02-21T08:30:48.5143520Z mad.wide.s32 %rd104, %r504, 2, %rd92; 2026-02-21T08:30:48.5143701Z mad.wide.s32 %rd105, %r505, 2, %rd92; 2026-02-21T08:30:48.5143885Z mad.wide.s32 %rd106, %r506, 2, %rd92; 2026-02-21T08:30:48.5144051Z mov.b32 %r619, 16; 2026-02-21T08:30:48.5144306Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5144590Z // begin inline asm 2026-02-21T08:30:48.5144805Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd103 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5145039Z // end inline asm 2026-02-21T08:30:48.5145187Z // begin inline asm 2026-02-21T08:30:48.5145395Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd104 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5145619Z // end inline asm 2026-02-21T08:30:48.5145759Z // begin inline asm 2026-02-21T08:30:48.5145950Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd105 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5146177Z // end inline asm 2026-02-21T08:30:48.5146310Z // begin inline asm 2026-02-21T08:30:48.5146509Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd106 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5146732Z // end inline asm 2026-02-21T08:30:48.5146948Z cp.async.commit_group; 2026-02-21T08:30:48.5147215Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5147498Z add.s32 %r507, %r84, %r39; 2026-02-21T08:30:48.5147662Z add.s32 %r508, %r84, %r40; 2026-02-21T08:30:48.5147924Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5148212Z cvt.s64.s32 %rd115, %r507; 2026-02-21T08:30:48.5148372Z add.s64 %rd107, %rd93, %rd115; 2026-02-21T08:30:48.5148545Z cvt.s64.s32 %rd116, %r508; 2026-02-21T08:30:48.5148701Z add.s64 %rd108, %rd93, %rd116; 2026-02-21T08:30:48.5148976Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5149261Z // begin inline asm 2026-02-21T08:30:48.5149461Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd107 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5149744Z // end inline asm 2026-02-21T08:30:48.5149885Z // begin inline asm 2026-02-21T08:30:48.5150087Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd108 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5150303Z // end inline asm 2026-02-21T08:30:48.5150451Z cp.async.commit_group; 2026-02-21T08:30:48.5150711Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5151000Z cvt.s64.s32 %rd117, %r496; 2026-02-21T08:30:48.5151164Z cvt.u64.u32 %rd11, %r22; 2026-02-21T08:30:48.5151318Z or.b64 %rd118, %rd117, %rd11; 2026-02-21T08:30:48.5151485Z shl.b64 %rd119, %rd118, 1; 2026-02-21T08:30:48.5151678Z add.s64 %rd12, %rd92, %rd119; 2026-02-21T08:30:48.5151846Z add.s64 %rd109, %rd12, 128; 2026-02-21T08:30:48.5152008Z cvt.s64.s32 %rd120, %r497; 2026-02-21T08:30:48.5152170Z or.b64 %rd121, %rd120, %rd11; 2026-02-21T08:30:48.5152328Z shl.b64 %rd122, %rd121, 1; 2026-02-21T08:30:48.5152489Z add.s64 %rd13, %rd92, %rd122; 2026-02-21T08:30:48.5152657Z add.s64 %rd110, %rd13, 128; 2026-02-21T08:30:48.5152818Z cvt.s64.s32 %rd123, %r498; 2026-02-21T08:30:48.5152978Z or.b64 %rd124, %rd123, %rd11; 2026-02-21T08:30:48.5153135Z shl.b64 %rd125, %rd124, 1; 2026-02-21T08:30:48.5153294Z add.s64 %rd14, %rd92, %rd125; 2026-02-21T08:30:48.5153453Z add.s64 %rd111, %rd14, 128; 2026-02-21T08:30:48.5153618Z cvt.s64.s32 %rd126, %r499; 2026-02-21T08:30:48.5153772Z or.b64 %rd127, %rd126, %rd11; 2026-02-21T08:30:48.5153935Z shl.b64 %rd128, %rd127, 1; 2026-02-21T08:30:48.5154090Z add.s64 %rd15, %rd92, %rd128; 2026-02-21T08:30:48.5154255Z add.s64 %rd112, %rd15, 128; 2026-02-21T08:30:48.5154528Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5154809Z bar.sync 0; 2026-02-21T08:30:48.5154961Z // begin inline asm 2026-02-21T08:30:48.5155163Z cp.async.cg.shared.global [ %r2468 + 0 ], [ %rd109 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5155396Z // end inline asm 2026-02-21T08:30:48.5155534Z // begin inline asm 2026-02-21T08:30:48.5155747Z cp.async.cg.shared.global [ %r2470 + 0 ], [ %rd110 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5155976Z // end inline asm 2026-02-21T08:30:48.5156111Z // begin inline asm 2026-02-21T08:30:48.5156315Z cp.async.cg.shared.global [ %r2472 + 0 ], [ %rd111 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5156537Z // end inline asm 2026-02-21T08:30:48.5156678Z // begin inline asm 2026-02-21T08:30:48.5156871Z cp.async.cg.shared.global [ %r2474 + 0 ], [ %rd112 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5157101Z // end inline asm 2026-02-21T08:30:48.5157242Z cp.async.commit_group; 2026-02-21T08:30:48.5157508Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5157797Z add.s32 %r509, %r84, %r48; 2026-02-21T08:30:48.5157953Z add.s32 %r510, %r84, %r49; 2026-02-21T08:30:48.5158223Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5158513Z cvt.s64.s32 %rd129, %r509; 2026-02-21T08:30:48.5158735Z add.s64 %rd113, %rd93, %rd129; 2026-02-21T08:30:48.5158898Z cvt.s64.s32 %rd130, %r510; 2026-02-21T08:30:48.5159061Z add.s64 %rd114, %rd93, %rd130; 2026-02-21T08:30:48.5159328Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5159616Z // begin inline asm 2026-02-21T08:30:48.5159823Z cp.async.cg.shared.global [ %r2476 + 0 ], [ %rd113 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5160047Z // end inline asm 2026-02-21T08:30:48.5160189Z // begin inline asm 2026-02-21T08:30:48.5160386Z cp.async.cg.shared.global [ %r2478 + 0 ], [ %rd114 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5160615Z // end inline asm 2026-02-21T08:30:48.5160757Z cp.async.commit_group; 2026-02-21T08:30:48.5161030Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5161322Z cp.async.wait_group 2; 2026-02-21T08:30:48.5161530Z bar.sync 0; 2026-02-21T08:30:48.5161738Z ld.shared.v4.b32 {%r511, %r512, %r513, %r514}, [%r53]; 2026-02-21T08:30:48.5161944Z mov.b32 {%rs1, %rs2}, %r514; 2026-02-21T08:30:48.5162113Z mov.b32 {%rs3, %rs4}, %r513; 2026-02-21T08:30:48.5162270Z mov.b32 {%rs5, %rs6}, %r512; 2026-02-21T08:30:48.5162432Z mov.b32 {%rs7, %rs8}, %r511; 2026-02-21T08:30:48.5162623Z ld.shared.v4.b32 {%r515, %r516, %r517, %r518}, [%r53+16]; 2026-02-21T08:30:48.5162837Z mov.b32 {%rs9, %rs10}, %r518; 2026-02-21T08:30:48.5162998Z mov.b32 {%rs11, %rs12}, %r517; 2026-02-21T08:30:48.5163169Z mov.b32 {%rs13, %rs14}, %r516; 2026-02-21T08:30:48.5163338Z mov.b32 {%rs15, %rs16}, %r515; 2026-02-21T08:30:48.5163529Z ld.shared.v4.b32 {%r519, %r520, %r521, %r522}, [%r53+32]; 2026-02-21T08:30:48.5163742Z mov.b32 {%rs17, %rs18}, %r522; 2026-02-21T08:30:48.5163899Z mov.b32 {%rs19, %rs20}, %r521; 2026-02-21T08:30:48.5164063Z mov.b32 {%rs21, %rs22}, %r520; 2026-02-21T08:30:48.5164220Z mov.b32 {%rs23, %rs24}, %r519; 2026-02-21T08:30:48.5164416Z ld.shared.v4.b32 {%r523, %r524, %r525, %r526}, [%r53+48]; 2026-02-21T08:30:48.5164622Z mov.b32 {%rs25, %rs26}, %r526; 2026-02-21T08:30:48.5164792Z mov.b32 {%rs27, %rs28}, %r525; 2026-02-21T08:30:48.5164961Z mov.b32 {%rs29, %rs30}, %r524; 2026-02-21T08:30:48.5165118Z mov.b32 {%rs31, %rs32}, %r523; 2026-02-21T08:30:48.5165393Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5165680Z cvt.f32.bf16 %r447, %rs7; 2026-02-21T08:30:48.5165843Z cvt.f32.bf16 %r448, %rs8; 2026-02-21T08:30:48.5165996Z cvt.f32.bf16 %r449, %rs5; 2026-02-21T08:30:48.5166153Z cvt.f32.bf16 %r450, %rs6; 2026-02-21T08:30:48.5166301Z cvt.f32.bf16 %r451, %rs3; 2026-02-21T08:30:48.5166458Z cvt.f32.bf16 %r452, %rs4; 2026-02-21T08:30:48.5166608Z cvt.f32.bf16 %r453, %rs1; 2026-02-21T08:30:48.5166765Z cvt.f32.bf16 %r454, %rs2; 2026-02-21T08:30:48.5166922Z cvt.f32.bf16 %r455, %rs15; 2026-02-21T08:30:48.5167076Z cvt.f32.bf16 %r456, %rs16; 2026-02-21T08:30:48.5167237Z cvt.f32.bf16 %r457, %rs13; 2026-02-21T08:30:48.5167391Z cvt.f32.bf16 %r458, %rs14; 2026-02-21T08:30:48.5167549Z cvt.f32.bf16 %r459, %rs11; 2026-02-21T08:30:48.5167697Z cvt.f32.bf16 %r460, %rs12; 2026-02-21T08:30:48.5167856Z cvt.f32.bf16 %r461, %rs9; 2026-02-21T08:30:48.5168005Z cvt.f32.bf16 %r462, %rs10; 2026-02-21T08:30:48.5168162Z cvt.f32.bf16 %r464, %rs23; 2026-02-21T08:30:48.5168312Z cvt.f32.bf16 %r465, %rs24; 2026-02-21T08:30:48.5168472Z cvt.f32.bf16 %r466, %rs21; 2026-02-21T08:30:48.5168629Z cvt.f32.bf16 %r467, %rs22; 2026-02-21T08:30:48.5168778Z cvt.f32.bf16 %r468, %rs19; 2026-02-21T08:30:48.5168932Z cvt.f32.bf16 %r469, %rs20; 2026-02-21T08:30:48.5169080Z cvt.f32.bf16 %r470, %rs17; 2026-02-21T08:30:48.5169235Z cvt.f32.bf16 %r471, %rs18; 2026-02-21T08:30:48.5169383Z cvt.f32.bf16 %r472, %rs31; 2026-02-21T08:30:48.5169538Z cvt.f32.bf16 %r473, %rs32; 2026-02-21T08:30:48.5169685Z cvt.f32.bf16 %r474, %rs29; 2026-02-21T08:30:48.5169840Z cvt.f32.bf16 %r475, %rs30; 2026-02-21T08:30:48.5170075Z cvt.f32.bf16 %r476, %rs27; 2026-02-21T08:30:48.5170224Z cvt.f32.bf16 %r477, %rs28; 2026-02-21T08:30:48.5170379Z cvt.f32.bf16 %r478, %rs25; 2026-02-21T08:30:48.5170529Z cvt.f32.bf16 %r479, %rs26; 2026-02-21T08:30:48.5170682Z $L__tmp69: 2026-02-21T08:30:48.5170975Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5171316Z add.s32 %r2497, %r501, %r3694; 2026-02-21T08:30:48.5171475Z // begin inline asm 2026-02-21T08:30:48.5171888Z @%p17 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r447, %r448, %r449, %r450, %r451, %r452, %r453, %r454, %r455, %r456, %r457, %r458, %r459, %r460, %r461, %r462}; 2026-02-21T08:30:48.5172313Z // end inline asm 2026-02-21T08:30:48.5172465Z // begin inline asm 2026-02-21T08:30:48.5172919Z @%p17 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479}; 2026-02-21T08:30:48.5173331Z // end inline asm 2026-02-21T08:30:48.5173482Z // begin inline asm 2026-02-21T08:30:48.5173641Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5173823Z // end inline asm 2026-02-21T08:30:48.5173973Z bar.sync 0; 2026-02-21T08:30:48.5174112Z $L__tmp70: 2026-02-21T08:30:48.5174377Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5174684Z ld.shared.s8 %rs33, [%r54]; 2026-02-21T08:30:48.5174976Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5175273Z shl.b16 %rs34, %rs33, 4; 2026-02-21T08:30:48.5175554Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5175860Z ld.shared.s8 %rs35, [%r54+128]; 2026-02-21T08:30:48.5176157Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5176466Z shl.b16 %rs36, %rs35, 4; 2026-02-21T08:30:48.5176733Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5177037Z ld.shared.s8 %rs37, [%r54+256]; 2026-02-21T08:30:48.5177323Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5177623Z shl.b16 %rs38, %rs37, 4; 2026-02-21T08:30:48.5177897Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5178197Z ld.shared.s8 %rs39, [%r54+384]; 2026-02-21T08:30:48.5178491Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5178783Z shl.b16 %rs40, %rs39, 4; 2026-02-21T08:30:48.5179055Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5179356Z ld.shared.s8 %rs41, [%r54+512]; 2026-02-21T08:30:48.5179652Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5179938Z shl.b16 %rs42, %rs41, 4; 2026-02-21T08:30:48.5180190Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5180481Z ld.shared.s8 %rs43, [%r54+640]; 2026-02-21T08:30:48.5180753Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5181041Z shl.b16 %rs44, %rs43, 4; 2026-02-21T08:30:48.5181296Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5181629Z ld.shared.s8 %rs45, [%r54+768]; 2026-02-21T08:30:48.5181909Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5182190Z shl.b16 %rs46, %rs45, 4; 2026-02-21T08:30:48.5182454Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5182786Z ld.shared.s8 %rs47, [%r56]; 2026-02-21T08:30:48.5183054Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5183333Z shl.b16 %rs48, %rs47, 4; 2026-02-21T08:30:48.5183596Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5183889Z ld.shared.s8 %rs49, [%r54+1024]; 2026-02-21T08:30:48.5184159Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5184441Z shl.b16 %rs50, %rs49, 4; 2026-02-21T08:30:48.5184696Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5184987Z ld.shared.s8 %rs51, [%r54+1152]; 2026-02-21T08:30:48.5185303Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5185590Z shl.b16 %rs52, %rs51, 4; 2026-02-21T08:30:48.5185849Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5186128Z ld.shared.s8 %rs53, [%r54+1280]; 2026-02-21T08:30:48.5186401Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5186681Z shl.b16 %rs54, %rs53, 4; 2026-02-21T08:30:48.5186941Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5187230Z ld.shared.s8 %rs55, [%r54+1408]; 2026-02-21T08:30:48.5187495Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5187779Z shl.b16 %rs56, %rs55, 4; 2026-02-21T08:30:48.5188031Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5188327Z ld.shared.s8 %rs57, [%r54+1536]; 2026-02-21T08:30:48.5188601Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5188889Z shl.b16 %rs58, %rs57, 4; 2026-02-21T08:30:48.5189150Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5189431Z ld.shared.s8 %rs59, [%r54+1664]; 2026-02-21T08:30:48.5189704Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5189975Z shl.b16 %rs60, %rs59, 4; 2026-02-21T08:30:48.5190234Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5190511Z ld.shared.s8 %rs61, [%r54+1792]; 2026-02-21T08:30:48.5190785Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5191062Z shl.b16 %rs62, %rs61, 4; 2026-02-21T08:30:48.5191318Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5191640Z ld.shared.s8 %rs63, [%r58]; 2026-02-21T08:30:48.5191911Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5192201Z shl.b16 %rs64, %rs63, 4; 2026-02-21T08:30:48.5192454Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5192750Z ld.shared.s8 %rs65, [%r54+2048]; 2026-02-21T08:30:48.5193024Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5193302Z shl.b16 %rs66, %rs65, 4; 2026-02-21T08:30:48.5193573Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5193859Z ld.shared.s8 %rs67, [%r54+2176]; 2026-02-21T08:30:48.5194139Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5194466Z shl.b16 %rs68, %rs67, 4; 2026-02-21T08:30:48.5194727Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5195018Z ld.shared.s8 %rs69, [%r54+2304]; 2026-02-21T08:30:48.5195288Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5195575Z shl.b16 %rs70, %rs69, 4; 2026-02-21T08:30:48.5195826Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5196116Z ld.shared.s8 %rs71, [%r54+2432]; 2026-02-21T08:30:48.5196378Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5196665Z shl.b16 %rs72, %rs71, 4; 2026-02-21T08:30:48.5196929Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5197209Z ld.shared.s8 %rs73, [%r54+2560]; 2026-02-21T08:30:48.5197533Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5197809Z shl.b16 %rs74, %rs73, 4; 2026-02-21T08:30:48.5198067Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5198352Z ld.shared.s8 %rs75, [%r54+2688]; 2026-02-21T08:30:48.5198618Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5198901Z shl.b16 %rs76, %rs75, 4; 2026-02-21T08:30:48.5199155Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5199440Z ld.shared.s8 %rs77, [%r54+2816]; 2026-02-21T08:30:48.5199705Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5199990Z shl.b16 %rs78, %rs77, 4; 2026-02-21T08:30:48.5200254Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5200537Z ld.shared.s8 %rs79, [%r60]; 2026-02-21T08:30:48.5200810Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5201091Z shl.b16 %rs80, %rs79, 4; 2026-02-21T08:30:48.5201352Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5201674Z ld.shared.s8 %rs81, [%r54+3072]; 2026-02-21T08:30:48.5201953Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5202248Z shl.b16 %rs82, %rs81, 4; 2026-02-21T08:30:48.5202515Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5202811Z ld.shared.s8 %rs83, [%r54+3200]; 2026-02-21T08:30:48.5203080Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5203372Z shl.b16 %rs84, %rs83, 4; 2026-02-21T08:30:48.5203629Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5203926Z ld.shared.s8 %rs85, [%r54+3328]; 2026-02-21T08:30:48.5204205Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5204487Z shl.b16 %rs86, %rs85, 4; 2026-02-21T08:30:48.5204748Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5205036Z ld.shared.s8 %rs87, [%r54+3456]; 2026-02-21T08:30:48.5205315Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5205595Z shl.b16 %rs88, %rs87, 4; 2026-02-21T08:30:48.5205858Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5206149Z ld.shared.s8 %rs89, [%r54+3584]; 2026-02-21T08:30:48.5206424Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5206768Z shl.b16 %rs90, %rs89, 4; 2026-02-21T08:30:48.5207024Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5207321Z ld.shared.s8 %rs91, [%r54+3712]; 2026-02-21T08:30:48.5207594Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5207878Z shl.b16 %rs92, %rs91, 4; 2026-02-21T08:30:48.5208046Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5208117Z ld.shared.s8 %rs93, [%r54+3840]; 2026-02-21T08:30:48.5208282Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5208342Z shl.b16 %rs94, %rs93, 4; 2026-02-21T08:30:48.5208583Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5208649Z ld.shared.s8 %rs95, [%r62]; 2026-02-21T08:30:48.5208812Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5208877Z shl.b16 %rs96, %rs95, 4; 2026-02-21T08:30:48.5209038Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5209098Z cvt.s16.s8 %rs97, %rs34; 2026-02-21T08:30:48.5209156Z shr.s16 %rs98, %rs97, 4; 2026-02-21T08:30:48.5209221Z cvt.s16.s8 %rs99, %rs36; 2026-02-21T08:30:48.5209281Z shr.s16 %rs100, %rs99, 4; 2026-02-21T08:30:48.5209339Z shr.s16 %rs101, %rs33, 4; 2026-02-21T08:30:48.5209402Z shr.s16 %rs102, %rs35, 4; 2026-02-21T08:30:48.5209565Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5209629Z cvt.rn.f32.s16 %r527, %rs102; 2026-02-21T08:30:48.5209696Z cvt.rn.f32.s16 %r528, %rs101; 2026-02-21T08:30:48.5209758Z cvt.rn.f32.s16 %r529, %rs100; 2026-02-21T08:30:48.5209819Z cvt.rn.f32.s16 %r530, %rs98; 2026-02-21T08:30:48.5209981Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5210049Z cvt.s16.s8 %rs103, %rs38; 2026-02-21T08:30:48.5210112Z shr.s16 %rs104, %rs103, 4; 2026-02-21T08:30:48.5210170Z cvt.s16.s8 %rs105, %rs40; 2026-02-21T08:30:48.5210238Z shr.s16 %rs106, %rs105, 4; 2026-02-21T08:30:48.5210294Z shr.s16 %rs107, %rs37, 4; 2026-02-21T08:30:48.5210350Z shr.s16 %rs108, %rs39, 4; 2026-02-21T08:30:48.5210515Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5210584Z cvt.rn.f32.s16 %r531, %rs108; 2026-02-21T08:30:48.5210643Z cvt.rn.f32.s16 %r532, %rs107; 2026-02-21T08:30:48.5210703Z cvt.rn.f32.s16 %r533, %rs106; 2026-02-21T08:30:48.5210769Z cvt.rn.f32.s16 %r534, %rs104; 2026-02-21T08:30:48.5210935Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5210997Z cvt.s16.s8 %rs109, %rs42; 2026-02-21T08:30:48.5211063Z shr.s16 %rs110, %rs109, 4; 2026-02-21T08:30:48.5211121Z cvt.s16.s8 %rs111, %rs44; 2026-02-21T08:30:48.5211178Z shr.s16 %rs112, %rs111, 4; 2026-02-21T08:30:48.5211235Z shr.s16 %rs113, %rs41, 4; 2026-02-21T08:30:48.5211299Z shr.s16 %rs114, %rs43, 4; 2026-02-21T08:30:48.5211466Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5211526Z cvt.rn.f32.s16 %r535, %rs114; 2026-02-21T08:30:48.5211625Z cvt.rn.f32.s16 %r536, %rs113; 2026-02-21T08:30:48.5211685Z cvt.rn.f32.s16 %r537, %rs112; 2026-02-21T08:30:48.5211744Z cvt.rn.f32.s16 %r538, %rs110; 2026-02-21T08:30:48.5211916Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5211976Z cvt.s16.s8 %rs115, %rs46; 2026-02-21T08:30:48.5212039Z shr.s16 %rs116, %rs115, 4; 2026-02-21T08:30:48.5212156Z cvt.s16.s8 %rs117, %rs48; 2026-02-21T08:30:48.5212223Z shr.s16 %rs118, %rs117, 4; 2026-02-21T08:30:48.5212282Z shr.s16 %rs119, %rs45, 4; 2026-02-21T08:30:48.5212338Z shr.s16 %rs120, %rs47, 4; 2026-02-21T08:30:48.5212516Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5212578Z cvt.rn.f32.s16 %r539, %rs120; 2026-02-21T08:30:48.5212636Z cvt.rn.f32.s16 %r540, %rs119; 2026-02-21T08:30:48.5212697Z cvt.rn.f32.s16 %r541, %rs118; 2026-02-21T08:30:48.5212769Z cvt.rn.f32.s16 %r542, %rs116; 2026-02-21T08:30:48.5212932Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5212994Z cvt.s16.s8 %rs121, %rs50; 2026-02-21T08:30:48.5213065Z shr.s16 %rs122, %rs121, 4; 2026-02-21T08:30:48.5213128Z cvt.s16.s8 %rs123, %rs52; 2026-02-21T08:30:48.5213191Z shr.s16 %rs124, %rs123, 4; 2026-02-21T08:30:48.5213318Z shr.s16 %rs125, %rs49, 4; 2026-02-21T08:30:48.5213381Z shr.s16 %rs126, %rs51, 4; 2026-02-21T08:30:48.5213541Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5213601Z cvt.rn.f32.s16 %r543, %rs126; 2026-02-21T08:30:48.5213668Z cvt.rn.f32.s16 %r544, %rs125; 2026-02-21T08:30:48.5213726Z cvt.rn.f32.s16 %r545, %rs124; 2026-02-21T08:30:48.5213784Z cvt.rn.f32.s16 %r546, %rs122; 2026-02-21T08:30:48.5213953Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5214011Z cvt.s16.s8 %rs127, %rs54; 2026-02-21T08:30:48.5214069Z shr.s16 %rs128, %rs127, 4; 2026-02-21T08:30:48.5214127Z cvt.s16.s8 %rs129, %rs56; 2026-02-21T08:30:48.5214192Z shr.s16 %rs130, %rs129, 4; 2026-02-21T08:30:48.5214248Z shr.s16 %rs131, %rs53, 4; 2026-02-21T08:30:48.5214305Z shr.s16 %rs132, %rs55, 4; 2026-02-21T08:30:48.5214478Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5214540Z cvt.rn.f32.s16 %r547, %rs132; 2026-02-21T08:30:48.5214597Z cvt.rn.f32.s16 %r548, %rs131; 2026-02-21T08:30:48.5214664Z cvt.rn.f32.s16 %r549, %rs130; 2026-02-21T08:30:48.5214721Z cvt.rn.f32.s16 %r550, %rs128; 2026-02-21T08:30:48.5214881Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5214940Z cvt.s16.s8 %rs133, %rs58; 2026-02-21T08:30:48.5215007Z shr.s16 %rs134, %rs133, 4; 2026-02-21T08:30:48.5215064Z cvt.s16.s8 %rs135, %rs60; 2026-02-21T08:30:48.5215123Z shr.s16 %rs136, %rs135, 4; 2026-02-21T08:30:48.5215187Z shr.s16 %rs137, %rs57, 4; 2026-02-21T08:30:48.5215243Z shr.s16 %rs138, %rs59, 4; 2026-02-21T08:30:48.5215432Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5215497Z cvt.rn.f32.s16 %r551, %rs138; 2026-02-21T08:30:48.5215554Z cvt.rn.f32.s16 %r552, %rs137; 2026-02-21T08:30:48.5215614Z cvt.rn.f32.s16 %r553, %rs136; 2026-02-21T08:30:48.5215674Z cvt.rn.f32.s16 %r554, %rs134; 2026-02-21T08:30:48.5215844Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5215902Z cvt.s16.s8 %rs139, %rs62; 2026-02-21T08:30:48.5215973Z shr.s16 %rs140, %rs139, 4; 2026-02-21T08:30:48.5216041Z cvt.s16.s8 %rs141, %rs64; 2026-02-21T08:30:48.5216100Z shr.s16 %rs142, %rs141, 4; 2026-02-21T08:30:48.5216161Z shr.s16 %rs143, %rs61, 4; 2026-02-21T08:30:48.5216220Z shr.s16 %rs144, %rs63, 4; 2026-02-21T08:30:48.5216398Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5216460Z cvt.rn.f32.s16 %r555, %rs144; 2026-02-21T08:30:48.5216521Z cvt.rn.f32.s16 %r556, %rs143; 2026-02-21T08:30:48.5216591Z cvt.rn.f32.s16 %r557, %rs142; 2026-02-21T08:30:48.5216652Z cvt.rn.f32.s16 %r558, %rs140; 2026-02-21T08:30:48.5216823Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5216937Z cvt.s16.s8 %rs145, %rs66; 2026-02-21T08:30:48.5216998Z shr.s16 %rs146, %rs145, 4; 2026-02-21T08:30:48.5217058Z cvt.s16.s8 %rs147, %rs68; 2026-02-21T08:30:48.5217118Z shr.s16 %rs148, %rs147, 4; 2026-02-21T08:30:48.5217185Z shr.s16 %rs149, %rs65, 4; 2026-02-21T08:30:48.5217241Z shr.s16 %rs150, %rs67, 4; 2026-02-21T08:30:48.5217413Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5217482Z cvt.rn.f32.s16 %r559, %rs150; 2026-02-21T08:30:48.5217543Z cvt.rn.f32.s16 %r560, %rs149; 2026-02-21T08:30:48.5217604Z cvt.rn.f32.s16 %r561, %rs148; 2026-02-21T08:30:48.5217662Z cvt.rn.f32.s16 %r562, %rs146; 2026-02-21T08:30:48.5217842Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5217902Z cvt.s16.s8 %rs151, %rs70; 2026-02-21T08:30:48.5218004Z shr.s16 %rs152, %rs151, 4; 2026-02-21T08:30:48.5218076Z cvt.s16.s8 %rs153, %rs72; 2026-02-21T08:30:48.5218137Z shr.s16 %rs154, %rs153, 4; 2026-02-21T08:30:48.5218196Z shr.s16 %rs155, %rs69, 4; 2026-02-21T08:30:48.5218262Z shr.s16 %rs156, %rs71, 4; 2026-02-21T08:30:48.5218433Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5218495Z cvt.rn.f32.s16 %r563, %rs156; 2026-02-21T08:30:48.5218556Z cvt.rn.f32.s16 %r564, %rs155; 2026-02-21T08:30:48.5218624Z cvt.rn.f32.s16 %r565, %rs154; 2026-02-21T08:30:48.5218686Z cvt.rn.f32.s16 %r566, %rs152; 2026-02-21T08:30:48.5218857Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5218925Z cvt.s16.s8 %rs157, %rs74; 2026-02-21T08:30:48.5218985Z shr.s16 %rs158, %rs157, 4; 2026-02-21T08:30:48.5219044Z cvt.s16.s8 %rs159, %rs76; 2026-02-21T08:30:48.5219111Z shr.s16 %rs160, %rs159, 4; 2026-02-21T08:30:48.5219173Z shr.s16 %rs161, %rs73, 4; 2026-02-21T08:30:48.5219236Z shr.s16 %rs162, %rs75, 4; 2026-02-21T08:30:48.5219406Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5219475Z cvt.rn.f32.s16 %r567, %rs162; 2026-02-21T08:30:48.5219536Z cvt.rn.f32.s16 %r568, %rs161; 2026-02-21T08:30:48.5219597Z cvt.rn.f32.s16 %r569, %rs160; 2026-02-21T08:30:48.5219664Z cvt.rn.f32.s16 %r570, %rs158; 2026-02-21T08:30:48.5219834Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5219895Z cvt.s16.s8 %rs163, %rs78; 2026-02-21T08:30:48.5219956Z shr.s16 %rs164, %rs163, 4; 2026-02-21T08:30:48.5220023Z cvt.s16.s8 %rs165, %rs80; 2026-02-21T08:30:48.5220084Z shr.s16 %rs166, %rs165, 4; 2026-02-21T08:30:48.5220143Z shr.s16 %rs167, %rs77, 4; 2026-02-21T08:30:48.5220208Z shr.s16 %rs168, %rs79, 4; 2026-02-21T08:30:48.5220380Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5220445Z cvt.rn.f32.s16 %r571, %rs168; 2026-02-21T08:30:48.5220515Z cvt.rn.f32.s16 %r572, %rs167; 2026-02-21T08:30:48.5220576Z cvt.rn.f32.s16 %r573, %rs166; 2026-02-21T08:30:48.5220636Z cvt.rn.f32.s16 %r574, %rs164; 2026-02-21T08:30:48.5220807Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5220879Z cvt.s16.s8 %rs169, %rs82; 2026-02-21T08:30:48.5220942Z shr.s16 %rs170, %rs169, 4; 2026-02-21T08:30:48.5221002Z cvt.s16.s8 %rs171, %rs84; 2026-02-21T08:30:48.5221071Z shr.s16 %rs172, %rs171, 4; 2026-02-21T08:30:48.5221135Z shr.s16 %rs173, %rs81, 4; 2026-02-21T08:30:48.5221198Z shr.s16 %rs174, %rs83, 4; 2026-02-21T08:30:48.5221369Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5221439Z cvt.rn.f32.s16 %r575, %rs174; 2026-02-21T08:30:48.5221500Z cvt.rn.f32.s16 %r576, %rs173; 2026-02-21T08:30:48.5221596Z cvt.rn.f32.s16 %r577, %rs172; 2026-02-21T08:30:48.5221726Z cvt.rn.f32.s16 %r578, %rs170; 2026-02-21T08:30:48.5221906Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5221968Z cvt.s16.s8 %rs175, %rs86; 2026-02-21T08:30:48.5222035Z shr.s16 %rs176, %rs175, 4; 2026-02-21T08:30:48.5222096Z cvt.s16.s8 %rs177, %rs88; 2026-02-21T08:30:48.5222158Z shr.s16 %rs178, %rs177, 4; 2026-02-21T08:30:48.5222217Z shr.s16 %rs179, %rs85, 4; 2026-02-21T08:30:48.5222285Z shr.s16 %rs180, %rs87, 4; 2026-02-21T08:30:48.5222460Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5222523Z cvt.rn.f32.s16 %r579, %rs180; 2026-02-21T08:30:48.5222590Z cvt.rn.f32.s16 %r580, %rs179; 2026-02-21T08:30:48.5222653Z cvt.rn.f32.s16 %r581, %rs178; 2026-02-21T08:30:48.5222714Z cvt.rn.f32.s16 %r582, %rs176; 2026-02-21T08:30:48.5222947Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5223012Z cvt.s16.s8 %rs181, %rs90; 2026-02-21T08:30:48.5223072Z shr.s16 %rs182, %rs181, 4; 2026-02-21T08:30:48.5223133Z cvt.s16.s8 %rs183, %rs92; 2026-02-21T08:30:48.5223202Z shr.s16 %rs184, %rs183, 4; 2026-02-21T08:30:48.5223263Z shr.s16 %rs185, %rs89, 4; 2026-02-21T08:30:48.5223325Z shr.s16 %rs186, %rs91, 4; 2026-02-21T08:30:48.5223509Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5223571Z cvt.rn.f32.s16 %r583, %rs186; 2026-02-21T08:30:48.5223632Z cvt.rn.f32.s16 %r584, %rs185; 2026-02-21T08:30:48.5223692Z cvt.rn.f32.s16 %r585, %rs184; 2026-02-21T08:30:48.5223761Z cvt.rn.f32.s16 %r586, %rs182; 2026-02-21T08:30:48.5223945Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5224004Z cvt.s16.s8 %rs187, %rs94; 2026-02-21T08:30:48.5224074Z shr.s16 %rs188, %rs187, 4; 2026-02-21T08:30:48.5224135Z cvt.s16.s8 %rs189, %rs96; 2026-02-21T08:30:48.5224192Z shr.s16 %rs190, %rs189, 4; 2026-02-21T08:30:48.5224257Z shr.s16 %rs191, %rs93, 4; 2026-02-21T08:30:48.5224314Z shr.s16 %rs192, %rs95, 4; 2026-02-21T08:30:48.5224481Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5224539Z cvt.rn.f32.s16 %r587, %rs192; 2026-02-21T08:30:48.5224606Z cvt.rn.f32.s16 %r588, %rs191; 2026-02-21T08:30:48.5224664Z cvt.rn.f32.s16 %r589, %rs190; 2026-02-21T08:30:48.5224721Z cvt.rn.f32.s16 %r590, %rs188; 2026-02-21T08:30:48.5224823Z st.shared.v4.b32 [%r63], {%r530, %r528, %r529, %r527}; 2026-02-21T08:30:48.5224924Z st.shared.v4.b32 [%r63+16384], {%r562, %r560, %r561, %r559}; 2026-02-21T08:30:48.5225012Z st.shared.v4.b32 [%r64], {%r534, %r532, %r533, %r531}; 2026-02-21T08:30:48.5225113Z st.shared.v4.b32 [%r64+16384], {%r566, %r564, %r565, %r563}; 2026-02-21T08:30:48.5225200Z st.shared.v4.b32 [%r65], {%r538, %r536, %r537, %r535}; 2026-02-21T08:30:48.5225295Z st.shared.v4.b32 [%r65+16384], {%r570, %r568, %r569, %r567}; 2026-02-21T08:30:48.5225379Z st.shared.v4.b32 [%r66], {%r542, %r540, %r541, %r539}; 2026-02-21T08:30:48.5225477Z st.shared.v4.b32 [%r66+16384], {%r574, %r572, %r573, %r571}; 2026-02-21T08:30:48.5225560Z st.shared.v4.b32 [%r67], {%r546, %r544, %r545, %r543}; 2026-02-21T08:30:48.5225651Z st.shared.v4.b32 [%r67+16384], {%r578, %r576, %r577, %r575}; 2026-02-21T08:30:48.5225742Z st.shared.v4.b32 [%r68], {%r550, %r548, %r549, %r547}; 2026-02-21T08:30:48.5225830Z st.shared.v4.b32 [%r68+16384], {%r582, %r580, %r581, %r579}; 2026-02-21T08:30:48.5225912Z st.shared.v4.b32 [%r69], {%r554, %r552, %r553, %r551}; 2026-02-21T08:30:48.5226006Z st.shared.v4.b32 [%r69+16384], {%r586, %r584, %r585, %r583}; 2026-02-21T08:30:48.5226088Z st.shared.v4.b32 [%r70], {%r558, %r556, %r557, %r555}; 2026-02-21T08:30:48.5226177Z st.shared.v4.b32 [%r70+16384], {%r590, %r588, %r589, %r587}; 2026-02-21T08:30:48.5226231Z $L__tmp71: 2026-02-21T08:30:48.5226506Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5226565Z // begin inline asm 2026-02-21T08:30:48.5226640Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5226702Z // end inline asm 2026-02-21T08:30:48.5226758Z bar.sync 0; 2026-02-21T08:30:48.5226821Z setp.ne.b32 %p14, %r95, 0; 2026-02-21T08:30:48.5226887Z @%p14 bra $L__BB0_5; 2026-02-21T08:30:48.5226988Z // %bb.4: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5227053Z elect.sync %r615|%p16, -1; 2026-02-21T08:30:48.5227112Z mov.b32 %r593, 69208336; 2026-02-21T08:30:48.5227179Z mov.pred %p15, 0; 2026-02-21T08:30:48.5227236Z // begin inline asm 2026-02-21T08:30:48.5227398Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r593, %p15; 2026-02-21T08:30:48.5227463Z // end inline asm 2026-02-21T08:30:48.5227592Z // begin inline asm 2026-02-21T08:30:48.5227747Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r593, %p17; 2026-02-21T08:30:48.5227812Z // end inline asm 2026-02-21T08:30:48.5227868Z // begin inline asm 2026-02-21T08:30:48.5228018Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r593, %p17; 2026-02-21T08:30:48.5228073Z // end inline asm 2026-02-21T08:30:48.5228139Z // begin inline asm 2026-02-21T08:30:48.5228284Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r593, %p17; 2026-02-21T08:30:48.5228339Z // end inline asm 2026-02-21T08:30:48.5228404Z // begin inline asm 2026-02-21T08:30:48.5228547Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r593, %p17; 2026-02-21T08:30:48.5228602Z // end inline asm 2026-02-21T08:30:48.5228667Z // begin inline asm 2026-02-21T08:30:48.5228812Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r593, %p17; 2026-02-21T08:30:48.5228871Z // end inline asm 2026-02-21T08:30:48.5228935Z // begin inline asm 2026-02-21T08:30:48.5229077Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r593, %p17; 2026-02-21T08:30:48.5229132Z // end inline asm 2026-02-21T08:30:48.5229188Z // begin inline asm 2026-02-21T08:30:48.5229342Z @%p16 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r593, %p17; 2026-02-21T08:30:48.5229400Z // end inline asm 2026-02-21T08:30:48.5229464Z add.s32 %r617, %r294, 57344; 2026-02-21T08:30:48.5229536Z cvt.u64.u32 %rd139, %r617; 2026-02-21T08:30:48.5229594Z // begin inline asm 2026-02-21T08:30:48.5229722Z @%p16 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd139]; 2026-02-21T08:30:48.5229791Z // end inline asm 2026-02-21T08:30:48.5229847Z $L__tmp72: 2026-02-21T08:30:48.5229946Z $L__BB0_5: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5230118Z .loc 1 0 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0 2026-02-21T08:30:48.5230190Z or.b32 %r85, %r83, %r8; 2026-02-21T08:30:48.5230249Z or.b32 %r87, %r86, %r14; 2026-02-21T08:30:48.5230307Z or.b32 %r88, %r86, %r15; 2026-02-21T08:30:48.5230370Z or.b32 %r89, %r86, %r16; 2026-02-21T08:30:48.5230426Z or.b32 %r90, %r86, %r17; 2026-02-21T08:30:48.5230482Z or.b32 %r91, %r86, %r18; 2026-02-21T08:30:48.5230536Z or.b32 %r92, %r86, %r19; 2026-02-21T08:30:48.5230599Z or.b32 %r93, %r86, %r20; 2026-02-21T08:30:48.5230655Z or.b32 %r94, %r86, %r21; 2026-02-21T08:30:48.5230829Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5230899Z add.s64 %rd140, %rd12, 256; 2026-02-21T08:30:48.5230959Z add.s64 %rd141, %rd13, 256; 2026-02-21T08:30:48.5231017Z add.s64 %rd142, %rd14, 256; 2026-02-21T08:30:48.5231083Z add.s64 %rd143, %rd15, 256; 2026-02-21T08:30:48.5231255Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5231353Z // begin inline asm 2026-02-21T08:30:48.5231476Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd140 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5231565Z // end inline asm 2026-02-21T08:30:48.5231625Z // begin inline asm 2026-02-21T08:30:48.5231743Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd141 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5231806Z // end inline asm 2026-02-21T08:30:48.5231863Z // begin inline asm 2026-02-21T08:30:48.5231977Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd142 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5232039Z // end inline asm 2026-02-21T08:30:48.5232096Z // begin inline asm 2026-02-21T08:30:48.5232208Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd143 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5232262Z // end inline asm 2026-02-21T08:30:48.5232334Z cp.async.commit_group; 2026-02-21T08:30:48.5232555Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5232621Z add.s32 %r633, %r84, %r71; 2026-02-21T08:30:48.5232689Z add.s32 %r634, %r84, %r72; 2026-02-21T08:30:48.5232857Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5232918Z cvt.s64.s32 %rd147, %r633; 2026-02-21T08:30:48.5232983Z add.s64 %rd144, %rd93, %rd147; 2026-02-21T08:30:48.5233050Z cvt.s64.s32 %rd148, %r634; 2026-02-21T08:30:48.5233110Z add.s64 %rd145, %rd93, %rd148; 2026-02-21T08:30:48.5233276Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5233339Z // begin inline asm 2026-02-21T08:30:48.5233452Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd144 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5233505Z // end inline asm 2026-02-21T08:30:48.5233569Z // begin inline asm 2026-02-21T08:30:48.5233680Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd145 + 0 ], 0x10, %r619; 2026-02-21T08:30:48.5233736Z // end inline asm 2026-02-21T08:30:48.5233802Z cp.async.commit_group; 2026-02-21T08:30:48.5233974Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5234033Z add.s32 %r635, %r39, %r83; 2026-02-21T08:30:48.5234093Z cvt.u64.u32 %rd1087, %r635; 2026-02-21T08:30:48.5234161Z add.s32 %r636, %r11, %r86; 2026-02-21T08:30:48.5234221Z shl.b32 %r637, %r636, 10; 2026-02-21T08:30:48.5234291Z mad.wide.s32 %rd1086, %r637, 2, %rd10; 2026-02-21T08:30:48.5234348Z add.s32 %r638, %r10, %r86; 2026-02-21T08:30:48.5234414Z shl.b32 %r639, %r638, 10; 2026-02-21T08:30:48.5234481Z mad.wide.s32 %rd1085, %r639, 2, %rd10; 2026-02-21T08:30:48.5234540Z shl.b32 %r640, %r82, 16; 2026-02-21T08:30:48.5234604Z or.b32 %r641, %r79, %r640; 2026-02-21T08:30:48.5234668Z mad.wide.s32 %rd1084, %r641, 2, %rd10; 2026-02-21T08:30:48.5234726Z or.b32 %r4019, %r80, %r640; 2026-02-21T08:30:48.5234787Z mov.b32 %r4023, 1; 2026-02-21T08:30:48.5234845Z mov.b64 %rd1088, -32; 2026-02-21T08:30:48.5234909Z mov.b32 %r4022, %r4020; 2026-02-21T08:30:48.5234967Z mov.b32 %r4024, %r4020; 2026-02-21T08:30:48.5235051Z bra.uni $L__BB0_6; 2026-02-21T08:30:48.5235147Z $L__BB0_8: // in Loop: Header=BB0_6 Depth=2 2026-02-21T08:30:48.5235312Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5235379Z add.s64 %rd1088, %rd1088, 32; 2026-02-21T08:30:48.5235446Z setp.lt.u64 %p53, %rd1088, 416; 2026-02-21T08:30:48.5235500Z $L__tmp73: 2026-02-21T08:30:48.5235719Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5235785Z add.s32 %r810, %r4023, 1; 2026-02-21T08:30:48.5235848Z setp.gt.s32 %p54, %r810, 1; 2026-02-21T08:30:48.5235914Z selp.b32 %r4023, 0, %r810, %p54; 2026-02-21T08:30:48.5235982Z selp.b32 %r811, 1, 0, %p54; 2026-02-21T08:30:48.5236043Z xor.b32 %r110, %r4024, %r811; 2026-02-21T08:30:48.5236098Z $L__tmp74: 2026-02-21T08:30:48.5236321Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5236384Z add.s64 %rd158, %rd1084, %rd9; 2026-02-21T08:30:48.5236446Z add.s64 %rd159, %rd1085, %rd9; 2026-02-21T08:30:48.5236505Z add.s64 %rd160, %rd1086, %rd9; 2026-02-21T08:30:48.5236581Z mad.wide.s32 %rd161, %r4019, 2, %rd92; 2026-02-21T08:30:48.5236745Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5236806Z add.s32 %r798, %r106, %r4047; 2026-02-21T08:30:48.5236877Z selp.b32 %r799, 16, 0, %p53; 2026-02-21T08:30:48.5236933Z // begin inline asm 2026-02-21T08:30:48.5237051Z cp.async.cg.shared.global [ %r798 + 0 ], [ %rd158 + 0 ], 0x10, %r799; 2026-02-21T08:30:48.5237114Z // end inline asm 2026-02-21T08:30:48.5237173Z add.s32 %r800, %r798, 2048; 2026-02-21T08:30:48.5237230Z // begin inline asm 2026-02-21T08:30:48.5237385Z cp.async.cg.shared.global [ %r800 + 0 ], [ %rd159 + 0 ], 0x10, %r799; 2026-02-21T08:30:48.5237454Z // end inline asm 2026-02-21T08:30:48.5237514Z add.s32 %r802, %r798, 4096; 2026-02-21T08:30:48.5237572Z // begin inline asm 2026-02-21T08:30:48.5237692Z cp.async.cg.shared.global [ %r802 + 0 ], [ %rd160 + 0 ], 0x10, %r799; 2026-02-21T08:30:48.5237749Z // end inline asm 2026-02-21T08:30:48.5237808Z add.s32 %r804, %r798, 6144; 2026-02-21T08:30:48.5237866Z // begin inline asm 2026-02-21T08:30:48.5237992Z cp.async.cg.shared.global [ %r804 + 0 ], [ %rd161 + 0 ], 0x10, %r799; 2026-02-21T08:30:48.5238049Z // end inline asm 2026-02-21T08:30:48.5238112Z cp.async.commit_group; 2026-02-21T08:30:48.5238284Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5238347Z add.s64 %rd164, %rd9, %rd1087; 2026-02-21T08:30:48.5238411Z add.s64 %rd165, %rd164, 786432; 2026-02-21T08:30:48.5238583Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5238649Z add.s64 %rd166, %rd164, 917504; 2026-02-21T08:30:48.5238709Z cvt.s64.s32 %rd167, %rd165; 2026-02-21T08:30:48.5238770Z add.s64 %rd162, %rd93, %rd167; 2026-02-21T08:30:48.5238837Z cvt.s64.s32 %rd168, %rd166; 2026-02-21T08:30:48.5238898Z add.s64 %rd163, %rd93, %rd168; 2026-02-21T08:30:48.5239061Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5239130Z mul.lo.s32 %r111, %r4048, 15; 2026-02-21T08:30:48.5239192Z add.s32 %r806, %r107, %r111; 2026-02-21T08:30:48.5239249Z // begin inline asm 2026-02-21T08:30:48.5239367Z cp.async.cg.shared.global [ %r806 + 0 ], [ %rd162 + 0 ], 0x10, %r799; 2026-02-21T08:30:48.5239423Z // end inline asm 2026-02-21T08:30:48.5239482Z add.s32 %r808, %r806, 2048; 2026-02-21T08:30:48.5239539Z // begin inline asm 2026-02-21T08:30:48.5239657Z cp.async.cg.shared.global [ %r808 + 0 ], [ %rd163 + 0 ], 0x10, %r799; 2026-02-21T08:30:48.5239715Z // end inline asm 2026-02-21T08:30:48.5239781Z cp.async.commit_group; 2026-02-21T08:30:48.5239954Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5240020Z add.s64 %rd1087, %rd1087, 262144; 2026-02-21T08:30:48.5240080Z add.s64 %rd1086, %rd1086, 128; 2026-02-21T08:30:48.5240140Z add.s64 %rd1085, %rd1085, 128; 2026-02-21T08:30:48.5240206Z add.s64 %rd1084, %rd1084, 128; 2026-02-21T08:30:48.5240265Z add.s32 %r4019, %r4019, 64; 2026-02-21T08:30:48.5240329Z setp.lt.u64 %p55, %rd1088, 448; 2026-02-21T08:30:48.5240396Z mov.b32 %r4020, %r4024; 2026-02-21T08:30:48.5240454Z mov.b32 %r4024, %r110; 2026-02-21T08:30:48.5240513Z @%p55 bra $L__BB0_6; 2026-02-21T08:30:48.5240570Z bra.uni $L__BB0_9; 2026-02-21T08:30:48.5240673Z $L__BB0_6: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:48.5240768Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:48.5240831Z add.s32 %r679, %r4022, 1; 2026-02-21T08:30:48.5240939Z setp.gt.s32 %p35, %r679, 1; 2026-02-21T08:30:48.5241004Z selp.b32 %r4022, 0, %r679, %p35; 2026-02-21T08:30:48.5241169Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5241241Z cp.async.wait_group 2; 2026-02-21T08:30:48.5241296Z bar.sync 0; 2026-02-21T08:30:48.5241357Z shl.b32 %r680, %r4022, 12; 2026-02-21T08:30:48.5241416Z shl.b32 %r681, %r4022, 13; 2026-02-21T08:30:48.5241483Z add.s32 %r683, %r294, %r681; 2026-02-21T08:30:48.5241575Z add.s32 %r106, %r683, 32768; 2026-02-21T08:30:48.5241636Z add.s32 %r684, %r106, %r52; 2026-02-21T08:30:48.5241739Z ld.shared.v4.b32 {%r685, %r686, %r687, %r688}, [%r684]; 2026-02-21T08:30:48.5241801Z mov.b32 {%rs193, %rs194}, %r688; 2026-02-21T08:30:48.5241862Z mov.b32 {%rs195, %rs196}, %r687; 2026-02-21T08:30:48.5241927Z mov.b32 {%rs197, %rs198}, %r686; 2026-02-21T08:30:48.5241985Z mov.b32 {%rs199, %rs200}, %r685; 2026-02-21T08:30:48.5242137Z ld.shared.v4.b32 {%r689, %r690, %r691, %r692}, [%r684+16]; 2026-02-21T08:30:48.5242198Z mov.b32 {%rs201, %rs202}, %r692; 2026-02-21T08:30:48.5242265Z mov.b32 {%rs203, %rs204}, %r691; 2026-02-21T08:30:48.5242325Z mov.b32 {%rs205, %rs206}, %r690; 2026-02-21T08:30:48.5242383Z mov.b32 {%rs207, %rs208}, %r689; 2026-02-21T08:30:48.5242485Z ld.shared.v4.b32 {%r693, %r694, %r695, %r696}, [%r684+32]; 2026-02-21T08:30:48.5242545Z mov.b32 {%rs209, %rs210}, %r696; 2026-02-21T08:30:48.5242602Z mov.b32 {%rs211, %rs212}, %r695; 2026-02-21T08:30:48.5242660Z mov.b32 {%rs213, %rs214}, %r694; 2026-02-21T08:30:48.5242725Z mov.b32 {%rs215, %rs216}, %r693; 2026-02-21T08:30:48.5242816Z ld.shared.v4.b32 {%r697, %r698, %r699, %r700}, [%r684+48]; 2026-02-21T08:30:48.5242874Z mov.b32 {%rs217, %rs218}, %r700; 2026-02-21T08:30:48.5242939Z mov.b32 {%rs219, %rs220}, %r699; 2026-02-21T08:30:48.5242997Z mov.b32 {%rs221, %rs222}, %r698; 2026-02-21T08:30:48.5243059Z mov.b32 {%rs223, %rs224}, %r697; 2026-02-21T08:30:48.5243238Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5243302Z cvt.f32.bf16 %r646, %rs199; 2026-02-21T08:30:48.5243362Z cvt.f32.bf16 %r647, %rs200; 2026-02-21T08:30:48.5243420Z cvt.f32.bf16 %r648, %rs197; 2026-02-21T08:30:48.5243486Z cvt.f32.bf16 %r649, %rs198; 2026-02-21T08:30:48.5243543Z cvt.f32.bf16 %r650, %rs195; 2026-02-21T08:30:48.5243600Z cvt.f32.bf16 %r651, %rs196; 2026-02-21T08:30:48.5243664Z cvt.f32.bf16 %r652, %rs193; 2026-02-21T08:30:48.5243721Z cvt.f32.bf16 %r653, %rs194; 2026-02-21T08:30:48.5243779Z cvt.f32.bf16 %r654, %rs207; 2026-02-21T08:30:48.5243836Z cvt.f32.bf16 %r655, %rs208; 2026-02-21T08:30:48.5243903Z cvt.f32.bf16 %r656, %rs205; 2026-02-21T08:30:48.5243959Z cvt.f32.bf16 %r657, %rs206; 2026-02-21T08:30:48.5244015Z cvt.f32.bf16 %r658, %rs203; 2026-02-21T08:30:48.5244081Z cvt.f32.bf16 %r659, %rs204; 2026-02-21T08:30:48.5244140Z cvt.f32.bf16 %r660, %rs201; 2026-02-21T08:30:48.5244200Z cvt.f32.bf16 %r661, %rs202; 2026-02-21T08:30:48.5244265Z cvt.f32.bf16 %r663, %rs215; 2026-02-21T08:30:48.5244324Z cvt.f32.bf16 %r664, %rs216; 2026-02-21T08:30:48.5244383Z cvt.f32.bf16 %r665, %rs213; 2026-02-21T08:30:48.5244439Z cvt.f32.bf16 %r666, %rs214; 2026-02-21T08:30:48.5244504Z cvt.f32.bf16 %r667, %rs211; 2026-02-21T08:30:48.5244562Z cvt.f32.bf16 %r668, %rs212; 2026-02-21T08:30:48.5244618Z cvt.f32.bf16 %r669, %rs209; 2026-02-21T08:30:48.5244683Z cvt.f32.bf16 %r670, %rs210; 2026-02-21T08:30:48.5244743Z cvt.f32.bf16 %r671, %rs223; 2026-02-21T08:30:48.5244801Z cvt.f32.bf16 %r672, %rs224; 2026-02-21T08:30:48.5244859Z cvt.f32.bf16 %r673, %rs221; 2026-02-21T08:30:48.5244927Z cvt.f32.bf16 %r674, %rs222; 2026-02-21T08:30:48.5244985Z cvt.f32.bf16 %r675, %rs219; 2026-02-21T08:30:48.5245044Z cvt.f32.bf16 %r676, %rs220; 2026-02-21T08:30:48.5245111Z cvt.f32.bf16 %r677, %rs217; 2026-02-21T08:30:48.5245168Z cvt.f32.bf16 %r678, %rs218; 2026-02-21T08:30:48.5245222Z $L__tmp75: 2026-02-21T08:30:48.5245516Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5245582Z // begin inline asm 2026-02-21T08:30:48.5245634Z 2026-02-21T08:30:48.5245683Z { 2026-02-21T08:30:48.5245755Z .reg .pred complete; 2026-02-21T08:30:48.5245809Z waitLoop: 2026-02-21T08:30:48.5245935Z mbarrier.try_wait.parity.shared.b64 complete, [%r4021], %r4020; 2026-02-21T08:30:48.5246001Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5246058Z } 2026-02-21T08:30:48.5246063Z 2026-02-21T08:30:48.5246118Z // end inline asm 2026-02-21T08:30:48.5246178Z mov.pred %p36, -1; 2026-02-21T08:30:48.5246242Z // begin inline asm 2026-02-21T08:30:48.5246538Z @%p36 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r646, %r647, %r648, %r649, %r650, %r651, %r652, %r653, %r654, %r655, %r656, %r657, %r658, %r659, %r660, %r661}; 2026-02-21T08:30:48.5246594Z // end inline asm 2026-02-21T08:30:48.5246701Z // begin inline asm 2026-02-21T08:30:48.5246992Z @%p36 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r663, %r664, %r665, %r666, %r667, %r668, %r669, %r670, %r671, %r672, %r673, %r674, %r675, %r676, %r677, %r678}; 2026-02-21T08:30:48.5247048Z // end inline asm 2026-02-21T08:30:48.5247113Z // begin inline asm 2026-02-21T08:30:48.5247186Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5247242Z // end inline asm 2026-02-21T08:30:48.5247297Z bar.sync 0; 2026-02-21T08:30:48.5247359Z $L__tmp76: 2026-02-21T08:30:48.5247530Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5247592Z add.s32 %r701, %r294, %r680; 2026-02-21T08:30:48.5247660Z add.s32 %r702, %r701, 49152; 2026-02-21T08:30:48.5247722Z add.s32 %r107, %r702, %r4048; 2026-02-21T08:30:48.5247780Z add.s32 %r703, %r702, %r55; 2026-02-21T08:30:48.5247837Z add.s32 %r704, %r702, %r57; 2026-02-21T08:30:48.5247902Z add.s32 %r705, %r702, %r59; 2026-02-21T08:30:48.5247962Z add.s32 %r706, %r702, %r61; 2026-02-21T08:30:48.5248133Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5248204Z ld.shared.s8 %rs225, [%r107]; 2026-02-21T08:30:48.5248372Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5248434Z shl.b16 %rs226, %rs225, 4; 2026-02-21T08:30:48.5248605Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5248673Z ld.shared.s8 %rs227, [%r107+128]; 2026-02-21T08:30:48.5248838Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5248908Z shl.b16 %rs228, %rs227, 4; 2026-02-21T08:30:48.5249072Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5249137Z ld.shared.s8 %rs229, [%r107+256]; 2026-02-21T08:30:48.5249302Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5249370Z shl.b16 %rs230, %rs229, 4; 2026-02-21T08:30:48.5249534Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5249597Z ld.shared.s8 %rs231, [%r107+384]; 2026-02-21T08:30:48.5249768Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5249827Z shl.b16 %rs232, %rs231, 4; 2026-02-21T08:30:48.5249987Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5250056Z ld.shared.s8 %rs233, [%r107+512]; 2026-02-21T08:30:48.5250219Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5250279Z shl.b16 %rs234, %rs233, 4; 2026-02-21T08:30:48.5250451Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5250558Z ld.shared.s8 %rs235, [%r107+640]; 2026-02-21T08:30:48.5250724Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5250783Z shl.b16 %rs236, %rs235, 4; 2026-02-21T08:30:48.5250953Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5251014Z ld.shared.s8 %rs237, [%r107+768]; 2026-02-21T08:30:48.5251177Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5251245Z shl.b16 %rs238, %rs237, 4; 2026-02-21T08:30:48.5251405Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5251466Z ld.shared.s8 %rs239, [%r703]; 2026-02-21T08:30:48.5251718Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5251781Z shl.b16 %rs240, %rs239, 4; 2026-02-21T08:30:48.5251943Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5252015Z ld.shared.s8 %rs241, [%r107+1024]; 2026-02-21T08:30:48.5252181Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5252240Z shl.b16 %rs242, %rs241, 4; 2026-02-21T08:30:48.5252405Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5252478Z ld.shared.s8 %rs243, [%r107+1152]; 2026-02-21T08:30:48.5252644Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5252703Z shl.b16 %rs244, %rs243, 4; 2026-02-21T08:30:48.5252875Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5252941Z ld.shared.s8 %rs245, [%r107+1280]; 2026-02-21T08:30:48.5254198Z [135s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:48.5255155Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[4, 0], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:30:48.5255282Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:30:48.5255341Z `ptxas` stderr: 2026-02-21T08:30:48.5255695Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 262 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:30:48.5255789Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:30:48.5255795Z 2026-02-21T08:30:48.5256186Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpxryanb1o.ptx -o /tmp/tmpxryanb1o.ptx.o 2026-02-21T08:30:48.5256191Z 2026-02-21T08:30:48.5256318Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:48.5256490Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5256557Z shl.b16 %rs246, %rs245, 4; 2026-02-21T08:30:48.5256728Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5256793Z ld.shared.s8 %rs247, [%r107+1408]; 2026-02-21T08:30:48.5256960Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5257029Z shl.b16 %rs248, %rs247, 4; 2026-02-21T08:30:48.5257198Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5257317Z ld.shared.s8 %rs249, [%r107+1536]; 2026-02-21T08:30:48.5257493Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5257553Z shl.b16 %rs250, %rs249, 4; 2026-02-21T08:30:48.5257723Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5257794Z ld.shared.s8 %rs251, [%r107+1664]; 2026-02-21T08:30:48.5257960Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5258018Z shl.b16 %rs252, %rs251, 4; 2026-02-21T08:30:48.5258193Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5258257Z ld.shared.s8 %rs253, [%r107+1792]; 2026-02-21T08:30:48.5258474Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5258537Z shl.b16 %rs254, %rs253, 4; 2026-02-21T08:30:48.5258709Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5258772Z ld.shared.s8 %rs255, [%r704]; 2026-02-21T08:30:48.5258935Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5259000Z shl.b16 %rs256, %rs255, 4; 2026-02-21T08:30:48.5259162Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5259225Z ld.shared.s8 %rs257, [%r107+2048]; 2026-02-21T08:30:48.5259394Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5259452Z shl.b16 %rs258, %rs257, 4; 2026-02-21T08:30:48.5259639Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5259714Z ld.shared.s8 %rs259, [%r107+2176]; 2026-02-21T08:30:48.5259895Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5259957Z shl.b16 %rs260, %rs259, 4; 2026-02-21T08:30:48.5260129Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5260202Z ld.shared.s8 %rs261, [%r107+2304]; 2026-02-21T08:30:48.5260371Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5260432Z shl.b16 %rs262, %rs261, 4; 2026-02-21T08:30:48.5260612Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5260675Z ld.shared.s8 %rs263, [%r107+2432]; 2026-02-21T08:30:48.5260851Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5260920Z shl.b16 %rs264, %rs263, 4; 2026-02-21T08:30:48.5261099Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5261166Z ld.shared.s8 %rs265, [%r107+2560]; 2026-02-21T08:30:48.5261344Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5261404Z shl.b16 %rs266, %rs265, 4; 2026-02-21T08:30:48.5261602Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5261671Z ld.shared.s8 %rs267, [%r107+2688]; 2026-02-21T08:30:48.5261849Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5261911Z shl.b16 %rs268, %rs267, 4; 2026-02-21T08:30:48.5262084Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5262158Z ld.shared.s8 %rs269, [%r107+2816]; 2026-02-21T08:30:48.5262334Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5262453Z shl.b16 %rs270, %rs269, 4; 2026-02-21T08:30:48.5262637Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5262702Z ld.shared.s8 %rs271, [%r705]; 2026-02-21T08:30:48.5262878Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5262946Z shl.b16 %rs272, %rs271, 4; 2026-02-21T08:30:48.5263115Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5263179Z ld.shared.s8 %rs273, [%r107+3072]; 2026-02-21T08:30:48.5263354Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5263423Z shl.b16 %rs274, %rs273, 4; 2026-02-21T08:30:48.5263642Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5263711Z ld.shared.s8 %rs275, [%r107+3200]; 2026-02-21T08:30:48.5263886Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5263947Z shl.b16 %rs276, %rs275, 4; 2026-02-21T08:30:48.5264115Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5264188Z ld.shared.s8 %rs277, [%r107+3328]; 2026-02-21T08:30:48.5264359Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5264421Z shl.b16 %rs278, %rs277, 4; 2026-02-21T08:30:48.5264596Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5264660Z ld.shared.s8 %rs279, [%r107+3456]; 2026-02-21T08:30:48.5264827Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5264890Z shl.b16 %rs280, %rs279, 4; 2026-02-21T08:30:48.5265070Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5265134Z ld.shared.s8 %rs281, [%r107+3584]; 2026-02-21T08:30:48.5265304Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5265374Z shl.b16 %rs282, %rs281, 4; 2026-02-21T08:30:48.5265543Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5265608Z ld.shared.s8 %rs283, [%r107+3712]; 2026-02-21T08:30:48.5265787Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5265848Z shl.b16 %rs284, %rs283, 4; 2026-02-21T08:30:48.5266018Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5266090Z ld.shared.s8 %rs285, [%r107+3840]; 2026-02-21T08:30:48.5266263Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5266328Z shl.b16 %rs286, %rs285, 4; 2026-02-21T08:30:48.5266498Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5266568Z ld.shared.s8 %rs287, [%r706]; 2026-02-21T08:30:48.5266737Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5266798Z shl.b16 %rs288, %rs287, 4; 2026-02-21T08:30:48.5266974Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5267036Z cvt.s16.s8 %rs289, %rs226; 2026-02-21T08:30:48.5267096Z shr.s16 %rs290, %rs289, 4; 2026-02-21T08:30:48.5267164Z cvt.s16.s8 %rs291, %rs228; 2026-02-21T08:30:48.5267224Z shr.s16 %rs292, %rs291, 4; 2026-02-21T08:30:48.5267283Z shr.s16 %rs293, %rs225, 4; 2026-02-21T08:30:48.5267344Z shr.s16 %rs294, %rs227, 4; 2026-02-21T08:30:48.5267589Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5267652Z cvt.rn.f32.s16 %r707, %rs294; 2026-02-21T08:30:48.5267714Z cvt.rn.f32.s16 %r708, %rs293; 2026-02-21T08:30:48.5267782Z cvt.rn.f32.s16 %r709, %rs292; 2026-02-21T08:30:48.5267841Z cvt.rn.f32.s16 %r710, %rs290; 2026-02-21T08:30:48.5268005Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5268072Z cvt.s16.s8 %rs295, %rs230; 2026-02-21T08:30:48.5268130Z shr.s16 %rs296, %rs295, 4; 2026-02-21T08:30:48.5268188Z cvt.s16.s8 %rs297, %rs232; 2026-02-21T08:30:48.5268247Z shr.s16 %rs298, %rs297, 4; 2026-02-21T08:30:48.5268315Z shr.s16 %rs299, %rs229, 4; 2026-02-21T08:30:48.5268377Z shr.s16 %rs300, %rs231, 4; 2026-02-21T08:30:48.5268544Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5268655Z cvt.rn.f32.s16 %r711, %rs300; 2026-02-21T08:30:48.5268722Z cvt.rn.f32.s16 %r712, %rs299; 2026-02-21T08:30:48.5268781Z cvt.rn.f32.s16 %r713, %rs298; 2026-02-21T08:30:48.5268847Z cvt.rn.f32.s16 %r714, %rs296; 2026-02-21T08:30:48.5269008Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5269066Z cvt.s16.s8 %rs301, %rs234; 2026-02-21T08:30:48.5269123Z shr.s16 %rs302, %rs301, 4; 2026-02-21T08:30:48.5269190Z cvt.s16.s8 %rs303, %rs236; 2026-02-21T08:30:48.5269247Z shr.s16 %rs304, %rs303, 4; 2026-02-21T08:30:48.5269305Z shr.s16 %rs305, %rs233, 4; 2026-02-21T08:30:48.5269369Z shr.s16 %rs306, %rs235, 4; 2026-02-21T08:30:48.5269531Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5269590Z cvt.rn.f32.s16 %r715, %rs306; 2026-02-21T08:30:48.5269648Z cvt.rn.f32.s16 %r716, %rs305; 2026-02-21T08:30:48.5269713Z cvt.rn.f32.s16 %r717, %rs304; 2026-02-21T08:30:48.5269774Z cvt.rn.f32.s16 %r718, %rs302; 2026-02-21T08:30:48.5269944Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5270010Z cvt.s16.s8 %rs307, %rs238; 2026-02-21T08:30:48.5270068Z shr.s16 %rs308, %rs307, 4; 2026-02-21T08:30:48.5270125Z cvt.s16.s8 %rs309, %rs240; 2026-02-21T08:30:48.5270190Z shr.s16 %rs310, %rs309, 4; 2026-02-21T08:30:48.5270248Z shr.s16 %rs311, %rs237, 4; 2026-02-21T08:30:48.5270304Z shr.s16 %rs312, %rs239, 4; 2026-02-21T08:30:48.5270468Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5270536Z cvt.rn.f32.s16 %r719, %rs312; 2026-02-21T08:30:48.5270594Z cvt.rn.f32.s16 %r720, %rs311; 2026-02-21T08:30:48.5270653Z cvt.rn.f32.s16 %r721, %rs310; 2026-02-21T08:30:48.5270719Z cvt.rn.f32.s16 %r722, %rs308; 2026-02-21T08:30:48.5270886Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5270947Z cvt.s16.s8 %rs313, %rs242; 2026-02-21T08:30:48.5271013Z shr.s16 %rs314, %rs313, 4; 2026-02-21T08:30:48.5271070Z cvt.s16.s8 %rs315, %rs244; 2026-02-21T08:30:48.5271127Z shr.s16 %rs316, %rs315, 4; 2026-02-21T08:30:48.5271184Z shr.s16 %rs317, %rs241, 4; 2026-02-21T08:30:48.5271250Z shr.s16 %rs318, %rs243, 4; 2026-02-21T08:30:48.5271412Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5271472Z cvt.rn.f32.s16 %r723, %rs318; 2026-02-21T08:30:48.5271565Z cvt.rn.f32.s16 %r724, %rs317; 2026-02-21T08:30:48.5271627Z cvt.rn.f32.s16 %r725, %rs316; 2026-02-21T08:30:48.5271685Z cvt.rn.f32.s16 %r726, %rs314; 2026-02-21T08:30:48.5271847Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5271913Z cvt.s16.s8 %rs319, %rs246; 2026-02-21T08:30:48.5271970Z shr.s16 %rs320, %rs319, 4; 2026-02-21T08:30:48.5272030Z cvt.s16.s8 %rs321, %rs248; 2026-02-21T08:30:48.5272150Z shr.s16 %rs322, %rs321, 4; 2026-02-21T08:30:48.5272208Z shr.s16 %rs323, %rs245, 4; 2026-02-21T08:30:48.5272264Z shr.s16 %rs324, %rs247, 4; 2026-02-21T08:30:48.5272435Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5272494Z cvt.rn.f32.s16 %r727, %rs324; 2026-02-21T08:30:48.5272553Z cvt.rn.f32.s16 %r728, %rs323; 2026-02-21T08:30:48.5272611Z cvt.rn.f32.s16 %r729, %rs322; 2026-02-21T08:30:48.5272676Z cvt.rn.f32.s16 %r730, %rs320; 2026-02-21T08:30:48.5272838Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5272896Z cvt.s16.s8 %rs325, %rs250; 2026-02-21T08:30:48.5272959Z shr.s16 %rs326, %rs325, 4; 2026-02-21T08:30:48.5273016Z cvt.s16.s8 %rs327, %rs252; 2026-02-21T08:30:48.5273073Z shr.s16 %rs328, %rs327, 4; 2026-02-21T08:30:48.5273130Z shr.s16 %rs329, %rs249, 4; 2026-02-21T08:30:48.5273243Z shr.s16 %rs330, %rs251, 4; 2026-02-21T08:30:48.5273409Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5273468Z cvt.rn.f32.s16 %r731, %rs330; 2026-02-21T08:30:48.5273536Z cvt.rn.f32.s16 %r732, %rs329; 2026-02-21T08:30:48.5273595Z cvt.rn.f32.s16 %r733, %rs328; 2026-02-21T08:30:48.5273654Z cvt.rn.f32.s16 %r734, %rs326; 2026-02-21T08:30:48.5273828Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5273886Z cvt.s16.s8 %rs331, %rs254; 2026-02-21T08:30:48.5273944Z shr.s16 %rs332, %rs331, 4; 2026-02-21T08:30:48.5274001Z cvt.s16.s8 %rs333, %rs256; 2026-02-21T08:30:48.5274066Z shr.s16 %rs334, %rs333, 4; 2026-02-21T08:30:48.5274124Z shr.s16 %rs335, %rs253, 4; 2026-02-21T08:30:48.5274181Z shr.s16 %rs336, %rs255, 4; 2026-02-21T08:30:48.5274352Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5274414Z cvt.rn.f32.s16 %r735, %rs336; 2026-02-21T08:30:48.5274476Z cvt.rn.f32.s16 %r736, %rs335; 2026-02-21T08:30:48.5274541Z cvt.rn.f32.s16 %r737, %rs334; 2026-02-21T08:30:48.5274599Z cvt.rn.f32.s16 %r738, %rs332; 2026-02-21T08:30:48.5274766Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5274825Z cvt.s16.s8 %rs337, %rs258; 2026-02-21T08:30:48.5274890Z shr.s16 %rs338, %rs337, 4; 2026-02-21T08:30:48.5274948Z cvt.s16.s8 %rs339, %rs260; 2026-02-21T08:30:48.5275007Z shr.s16 %rs340, %rs339, 4; 2026-02-21T08:30:48.5275071Z shr.s16 %rs341, %rs257, 4; 2026-02-21T08:30:48.5275128Z shr.s16 %rs342, %rs259, 4; 2026-02-21T08:30:48.5275293Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5275352Z cvt.rn.f32.s16 %r739, %rs342; 2026-02-21T08:30:48.5275418Z cvt.rn.f32.s16 %r740, %rs341; 2026-02-21T08:30:48.5275480Z cvt.rn.f32.s16 %r741, %rs340; 2026-02-21T08:30:48.5275543Z cvt.rn.f32.s16 %r742, %rs338; 2026-02-21T08:30:48.5275715Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5275776Z cvt.s16.s8 %rs343, %rs262; 2026-02-21T08:30:48.5275833Z shr.s16 %rs344, %rs343, 4; 2026-02-21T08:30:48.5275899Z cvt.s16.s8 %rs345, %rs264; 2026-02-21T08:30:48.5275967Z shr.s16 %rs346, %rs345, 4; 2026-02-21T08:30:48.5276025Z shr.s16 %rs347, %rs261, 4; 2026-02-21T08:30:48.5276082Z shr.s16 %rs348, %rs263, 4; 2026-02-21T08:30:48.5276255Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5276314Z cvt.rn.f32.s16 %r743, %rs348; 2026-02-21T08:30:48.5276373Z cvt.rn.f32.s16 %r744, %rs347; 2026-02-21T08:30:48.5276438Z cvt.rn.f32.s16 %r745, %rs346; 2026-02-21T08:30:48.5276496Z cvt.rn.f32.s16 %r746, %rs344; 2026-02-21T08:30:48.5276664Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5276765Z cvt.s16.s8 %rs349, %rs266; 2026-02-21T08:30:48.5276830Z shr.s16 %rs350, %rs349, 4; 2026-02-21T08:30:48.5276887Z cvt.s16.s8 %rs351, %rs268; 2026-02-21T08:30:48.5276944Z shr.s16 %rs352, %rs351, 4; 2026-02-21T08:30:48.5277008Z shr.s16 %rs353, %rs265, 4; 2026-02-21T08:30:48.5277065Z shr.s16 %rs354, %rs267, 4; 2026-02-21T08:30:48.5277227Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5277294Z cvt.rn.f32.s16 %r747, %rs354; 2026-02-21T08:30:48.5277352Z cvt.rn.f32.s16 %r748, %rs353; 2026-02-21T08:30:48.5277412Z cvt.rn.f32.s16 %r749, %rs352; 2026-02-21T08:30:48.5277468Z cvt.rn.f32.s16 %r750, %rs350; 2026-02-21T08:30:48.5277641Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5277699Z cvt.s16.s8 %rs355, %rs270; 2026-02-21T08:30:48.5277757Z shr.s16 %rs356, %rs355, 4; 2026-02-21T08:30:48.5277860Z cvt.s16.s8 %rs357, %rs272; 2026-02-21T08:30:48.5277922Z shr.s16 %rs358, %rs357, 4; 2026-02-21T08:30:48.5277980Z shr.s16 %rs359, %rs269, 4; 2026-02-21T08:30:48.5278037Z shr.s16 %rs360, %rs271, 4; 2026-02-21T08:30:48.5278212Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5278271Z cvt.rn.f32.s16 %r751, %rs360; 2026-02-21T08:30:48.5278329Z cvt.rn.f32.s16 %r752, %rs359; 2026-02-21T08:30:48.5278396Z cvt.rn.f32.s16 %r753, %rs358; 2026-02-21T08:30:48.5278455Z cvt.rn.f32.s16 %r754, %rs356; 2026-02-21T08:30:48.5278620Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5278688Z cvt.s16.s8 %rs361, %rs274; 2026-02-21T08:30:48.5278745Z shr.s16 %rs362, %rs361, 4; 2026-02-21T08:30:48.5278804Z cvt.s16.s8 %rs363, %rs276; 2026-02-21T08:30:48.5278860Z shr.s16 %rs364, %rs363, 4; 2026-02-21T08:30:48.5278924Z shr.s16 %rs365, %rs273, 4; 2026-02-21T08:30:48.5278982Z shr.s16 %rs366, %rs275, 4; 2026-02-21T08:30:48.5279151Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5279218Z cvt.rn.f32.s16 %r755, %rs366; 2026-02-21T08:30:48.5279276Z cvt.rn.f32.s16 %r756, %rs365; 2026-02-21T08:30:48.5279334Z cvt.rn.f32.s16 %r757, %rs364; 2026-02-21T08:30:48.5279398Z cvt.rn.f32.s16 %r758, %rs362; 2026-02-21T08:30:48.5279565Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5279623Z cvt.s16.s8 %rs367, %rs278; 2026-02-21T08:30:48.5279680Z shr.s16 %rs368, %rs367, 4; 2026-02-21T08:30:48.5279744Z cvt.s16.s8 %rs369, %rs280; 2026-02-21T08:30:48.5279800Z shr.s16 %rs370, %rs369, 4; 2026-02-21T08:30:48.5279858Z shr.s16 %rs371, %rs277, 4; 2026-02-21T08:30:48.5279922Z shr.s16 %rs372, %rs279, 4; 2026-02-21T08:30:48.5280090Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5280154Z cvt.rn.f32.s16 %r759, %rs372; 2026-02-21T08:30:48.5280212Z cvt.rn.f32.s16 %r760, %rs371; 2026-02-21T08:30:48.5280277Z cvt.rn.f32.s16 %r761, %rs370; 2026-02-21T08:30:48.5280334Z cvt.rn.f32.s16 %r762, %rs368; 2026-02-21T08:30:48.5280500Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5280566Z cvt.s16.s8 %rs373, %rs282; 2026-02-21T08:30:48.5280622Z shr.s16 %rs374, %rs373, 4; 2026-02-21T08:30:48.5280679Z cvt.s16.s8 %rs375, %rs284; 2026-02-21T08:30:48.5280742Z shr.s16 %rs376, %rs375, 4; 2026-02-21T08:30:48.5280800Z shr.s16 %rs377, %rs281, 4; 2026-02-21T08:30:48.5280855Z shr.s16 %rs378, %rs283, 4; 2026-02-21T08:30:48.5281021Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5281089Z cvt.rn.f32.s16 %r763, %rs378; 2026-02-21T08:30:48.5281147Z cvt.rn.f32.s16 %r764, %rs377; 2026-02-21T08:30:48.5281208Z cvt.rn.f32.s16 %r765, %rs376; 2026-02-21T08:30:48.5281311Z cvt.rn.f32.s16 %r766, %rs374; 2026-02-21T08:30:48.5281475Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5281569Z cvt.s16.s8 %rs379, %rs286; 2026-02-21T08:30:48.5281637Z shr.s16 %rs380, %rs379, 4; 2026-02-21T08:30:48.5281694Z cvt.s16.s8 %rs381, %rs288; 2026-02-21T08:30:48.5281752Z shr.s16 %rs382, %rs381, 4; 2026-02-21T08:30:48.5281808Z shr.s16 %rs383, %rs285, 4; 2026-02-21T08:30:48.5281875Z shr.s16 %rs384, %rs287, 4; 2026-02-21T08:30:48.5282041Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5282101Z cvt.rn.f32.s16 %r767, %rs384; 2026-02-21T08:30:48.5282165Z cvt.rn.f32.s16 %r768, %rs383; 2026-02-21T08:30:48.5282224Z cvt.rn.f32.s16 %r769, %rs382; 2026-02-21T08:30:48.5282281Z cvt.rn.f32.s16 %r770, %rs380; 2026-02-21T08:30:48.5282379Z st.shared.v4.b32 [%r63], {%r710, %r708, %r709, %r707}; 2026-02-21T08:30:48.5282558Z st.shared.v4.b32 [%r63+16384], {%r742, %r740, %r741, %r739}; 2026-02-21T08:30:48.5282653Z st.shared.v4.b32 [%r64], {%r714, %r712, %r713, %r711}; 2026-02-21T08:30:48.5282751Z st.shared.v4.b32 [%r64+16384], {%r746, %r744, %r745, %r743}; 2026-02-21T08:30:48.5282846Z st.shared.v4.b32 [%r65], {%r718, %r716, %r717, %r715}; 2026-02-21T08:30:48.5282942Z st.shared.v4.b32 [%r65+16384], {%r750, %r748, %r749, %r747}; 2026-02-21T08:30:48.5283029Z st.shared.v4.b32 [%r66], {%r722, %r720, %r721, %r719}; 2026-02-21T08:30:48.5283129Z st.shared.v4.b32 [%r66+16384], {%r754, %r752, %r753, %r751}; 2026-02-21T08:30:48.5283213Z st.shared.v4.b32 [%r67], {%r726, %r724, %r725, %r723}; 2026-02-21T08:30:48.5283307Z st.shared.v4.b32 [%r67+16384], {%r758, %r756, %r757, %r755}; 2026-02-21T08:30:48.5283391Z st.shared.v4.b32 [%r68], {%r730, %r728, %r729, %r727}; 2026-02-21T08:30:48.5283489Z st.shared.v4.b32 [%r68+16384], {%r762, %r760, %r761, %r759}; 2026-02-21T08:30:48.5283577Z st.shared.v4.b32 [%r69], {%r734, %r732, %r733, %r731}; 2026-02-21T08:30:48.5283672Z st.shared.v4.b32 [%r69+16384], {%r766, %r764, %r765, %r763}; 2026-02-21T08:30:48.5283768Z st.shared.v4.b32 [%r70], {%r738, %r736, %r737, %r735}; 2026-02-21T08:30:48.5283862Z st.shared.v4.b32 [%r70+16384], {%r770, %r768, %r769, %r767}; 2026-02-21T08:30:48.5284027Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5284098Z shl.b32 %r771, %r4023, 3; 2026-02-21T08:30:48.5284163Z add.s32 %r772, %r294, %r771; 2026-02-21T08:30:48.5284224Z add.s32 %r4021, %r772, 57344; 2026-02-21T08:30:48.5284287Z $L__tmp77: 2026-02-21T08:30:48.5284506Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5284571Z // begin inline asm 2026-02-21T08:30:48.5284646Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5284710Z // end inline asm 2026-02-21T08:30:48.5284764Z bar.sync 0; 2026-02-21T08:30:48.5284824Z @%p14 bra $L__BB0_8; 2026-02-21T08:30:48.5284932Z // %bb.7: // in Loop: Header=BB0_6 Depth=2 2026-02-21T08:30:48.5284997Z elect.sync %r797|%p37, -1; 2026-02-21T08:30:48.5285056Z mov.b32 %r775, 69208336; 2026-02-21T08:30:48.5285112Z // begin inline asm 2026-02-21T08:30:48.5285277Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r775, %p36; 2026-02-21T08:30:48.5285333Z // end inline asm 2026-02-21T08:30:48.5285390Z // begin inline asm 2026-02-21T08:30:48.5285549Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r775, %p36; 2026-02-21T08:30:48.5285604Z // end inline asm 2026-02-21T08:30:48.5285660Z // begin inline asm 2026-02-21T08:30:48.5285814Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r775, %p36; 2026-02-21T08:30:48.5285869Z // end inline asm 2026-02-21T08:30:48.5285925Z // begin inline asm 2026-02-21T08:30:48.5286077Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r775, %p36; 2026-02-21T08:30:48.5286186Z // end inline asm 2026-02-21T08:30:48.5286243Z // begin inline asm 2026-02-21T08:30:48.5286387Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r775, %p36; 2026-02-21T08:30:48.5286450Z // end inline asm 2026-02-21T08:30:48.5286506Z // begin inline asm 2026-02-21T08:30:48.5286650Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r775, %p36; 2026-02-21T08:30:48.5286714Z // end inline asm 2026-02-21T08:30:48.5286771Z // begin inline asm 2026-02-21T08:30:48.5286913Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r775, %p36; 2026-02-21T08:30:48.5286976Z // end inline asm 2026-02-21T08:30:48.5287031Z // begin inline asm 2026-02-21T08:30:48.5287173Z @%p37 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r775, %p36; 2026-02-21T08:30:48.5287276Z // end inline asm 2026-02-21T08:30:48.5287343Z cvt.u64.u32 %rd157, %r4021; 2026-02-21T08:30:48.5287399Z // begin inline asm 2026-02-21T08:30:48.5287522Z @%p37 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd157]; 2026-02-21T08:30:48.5287587Z // end inline asm 2026-02-21T08:30:48.5287644Z bra.uni $L__BB0_8; 2026-02-21T08:30:48.5287742Z $L__BB0_9: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5287841Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:30:48.5287898Z mov.b32 %r4029, 1; 2026-02-21T08:30:48.5288117Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5288182Z // begin inline asm 2026-02-21T08:30:48.5288234Z 2026-02-21T08:30:48.5288284Z { 2026-02-21T08:30:48.5288348Z .reg .pred complete; 2026-02-21T08:30:48.5288410Z waitLoop: 2026-02-21T08:30:48.5288531Z mbarrier.try_wait.parity.shared.b64 complete, [%r4021], %r4029; 2026-02-21T08:30:48.5288599Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5288659Z } 2026-02-21T08:30:48.5288662Z 2026-02-21T08:30:48.5288719Z // end inline asm 2026-02-21T08:30:48.5288773Z $L__tmp78: 2026-02-21T08:30:48.5288944Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5289016Z cp.async.wait_group 0; 2026-02-21T08:30:48.5289072Z bar.sync 0; 2026-02-21T08:30:48.5289133Z add.s32 %r4027, %r294, 57344; 2026-02-21T08:30:48.5289199Z // begin inline asm 2026-02-21T08:30:48.5289288Z @%p10 mbarrier.inval.shared::cta.b64 [%r4027]; 2026-02-21T08:30:48.5289344Z // end inline asm 2026-02-21T08:30:48.5289397Z bar.sync 0; 2026-02-21T08:30:48.5289461Z // begin inline asm 2026-02-21T08:30:48.5289545Z @%p10 mbarrier.inval.shared::cta.b64 [%r2245]; 2026-02-21T08:30:48.5289599Z // end inline asm 2026-02-21T08:30:48.5289771Z .loc 1 88 43 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:43 2026-02-21T08:30:48.5289834Z shl.b32 %r1085, %r87, 13; 2026-02-21T08:30:48.5289893Z shl.b32 %r1086, %r88, 13; 2026-02-21T08:30:48.5289958Z shl.b32 %r1087, %r89, 13; 2026-02-21T08:30:48.5290016Z shl.b32 %r1088, %r90, 13; 2026-02-21T08:30:48.5290071Z shl.b32 %r1089, %r91, 13; 2026-02-21T08:30:48.5290128Z shl.b32 %r1090, %r92, 13; 2026-02-21T08:30:48.5290192Z shl.b32 %r1091, %r93, 13; 2026-02-21T08:30:48.5290248Z shl.b32 %r1092, %r94, 13; 2026-02-21T08:30:48.5290416Z .loc 1 88 50 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:50 2026-02-21T08:30:48.5290482Z add.s32 %r1093, %r1085, %r85; 2026-02-21T08:30:48.5290541Z add.s32 %r1094, %r1086, %r85; 2026-02-21T08:30:48.5290598Z add.s32 %r1095, %r1087, %r85; 2026-02-21T08:30:48.5290655Z add.s32 %r1096, %r1088, %r85; 2026-02-21T08:30:48.5290720Z add.s32 %r1097, %r1089, %r85; 2026-02-21T08:30:48.5290776Z add.s32 %r1098, %r1090, %r85; 2026-02-21T08:30:48.5290833Z add.s32 %r1099, %r1091, %r85; 2026-02-21T08:30:48.5290900Z add.s32 %r1100, %r1092, %r85; 2026-02-21T08:30:48.5291111Z .loc 1 88 22 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:22 2026-02-21T08:30:48.5291182Z mad.wide.s32 %rd169, %r1093, 2, %rd94; 2026-02-21T08:30:48.5291259Z mad.wide.s32 %rd170, %r1094, 2, %rd94; 2026-02-21T08:30:48.5291327Z mad.wide.s32 %rd171, %r1095, 2, %rd94; 2026-02-21T08:30:48.5291391Z mad.wide.s32 %rd172, %r1096, 2, %rd94; 2026-02-21T08:30:48.5291455Z mad.wide.s32 %rd173, %r1097, 2, %rd94; 2026-02-21T08:30:48.5291527Z mad.wide.s32 %rd174, %r1098, 2, %rd94; 2026-02-21T08:30:48.5291629Z mad.wide.s32 %rd175, %r1099, 2, %rd94; 2026-02-21T08:30:48.5291693Z mad.wide.s32 %rd176, %r1100, 2, %rd94; 2026-02-21T08:30:48.5291754Z $L__tmp79: 2026-02-21T08:30:48.5291971Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5292030Z // begin inline asm 2026-02-21T08:30:48.5292368Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r816, %r817, %r818, %r819, %r820, %r821, %r822, %r823, %r824, %r825, %r826, %r827, %r828, %r829, %r830, %r831}, [%r3028 + 0], 64; 2026-02-21T08:30:48.5292431Z // end inline asm 2026-02-21T08:30:48.5292491Z // begin inline asm 2026-02-21T08:30:48.5292768Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r833, %r834, %r835, %r836, %r837, %r838, %r839, %r840, %r841, %r842, %r843, %r844, %r845, %r846, %r847, %r848}, [%r3028 + 16], 64; 2026-02-21T08:30:48.5292833Z // end inline asm 2026-02-21T08:30:48.5292889Z // begin inline asm 2026-02-21T08:30:48.5293161Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r850, %r851, %r852, %r853, %r854, %r855, %r856, %r857, %r858, %r859, %r860, %r861, %r862, %r863, %r864, %r865}, [%r3028 + 32], 64; 2026-02-21T08:30:48.5293225Z // end inline asm 2026-02-21T08:30:48.5293281Z // begin inline asm 2026-02-21T08:30:48.5293556Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r867, %r868, %r869, %r870, %r871, %r872, %r873, %r874, %r875, %r876, %r877, %r878, %r879, %r880, %r881, %r882}, [%r3028 + 48], 64; 2026-02-21T08:30:48.5293629Z // end inline asm 2026-02-21T08:30:48.5293688Z // begin inline asm 2026-02-21T08:30:48.5293760Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:48.5293828Z // end inline asm 2026-02-21T08:30:48.5293892Z cvt.u64.u32 %rd189, %r816; 2026-02-21T08:30:48.5293951Z cvt.u64.u32 %rd190, %r817; 2026-02-21T08:30:48.5294014Z shl.b64 %rd191, %rd190, 32; 2026-02-21T08:30:48.5294084Z or.b64 %rd192, %rd189, %rd191; 2026-02-21T08:30:48.5294137Z $L__tmp80: 2026-02-21T08:30:48.5294307Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5294378Z mov.b64 {%r1101, %r1102}, %rd192; 2026-02-21T08:30:48.5294451Z cvt.rn.bf16x2.f32 %r1103, %r1102, %r1101; 2026-02-21T08:30:48.5294504Z $L__tmp81: 2026-02-21T08:30:48.5294721Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5294792Z cvt.u64.u32 %rd193, %r818; 2026-02-21T08:30:48.5294852Z cvt.u64.u32 %rd194, %r819; 2026-02-21T08:30:48.5294912Z shl.b64 %rd195, %rd194, 32; 2026-02-21T08:30:48.5294982Z or.b64 %rd196, %rd193, %rd195; 2026-02-21T08:30:48.5295035Z $L__tmp82: 2026-02-21T08:30:48.5295202Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5295272Z mov.b64 {%r1104, %r1105}, %rd196; 2026-02-21T08:30:48.5295345Z cvt.rn.bf16x2.f32 %r1106, %r1105, %r1104; 2026-02-21T08:30:48.5295397Z $L__tmp83: 2026-02-21T08:30:48.5295610Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5295677Z cvt.u64.u32 %rd197, %r820; 2026-02-21T08:30:48.5295735Z cvt.u64.u32 %rd198, %r821; 2026-02-21T08:30:48.5295793Z shl.b64 %rd199, %rd198, 32; 2026-02-21T08:30:48.5295863Z or.b64 %rd200, %rd197, %rd199; 2026-02-21T08:30:48.5295914Z $L__tmp84: 2026-02-21T08:30:48.5296080Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5296198Z mov.b64 {%r1107, %r1108}, %rd200; 2026-02-21T08:30:48.5296268Z cvt.rn.bf16x2.f32 %r1109, %r1108, %r1107; 2026-02-21T08:30:48.5296320Z $L__tmp85: 2026-02-21T08:30:48.5296530Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5296598Z cvt.u64.u32 %rd201, %r822; 2026-02-21T08:30:48.5296656Z cvt.u64.u32 %rd202, %r823; 2026-02-21T08:30:48.5296714Z shl.b64 %rd203, %rd202, 32; 2026-02-21T08:30:48.5296781Z or.b64 %rd204, %rd201, %rd203; 2026-02-21T08:30:48.5296832Z $L__tmp86: 2026-02-21T08:30:48.5296998Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5297065Z mov.b64 {%r1110, %r1111}, %rd204; 2026-02-21T08:30:48.5297133Z cvt.rn.bf16x2.f32 %r1112, %r1111, %r1110; 2026-02-21T08:30:48.5297185Z $L__tmp87: 2026-02-21T08:30:48.5297434Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5297505Z cvt.u64.u32 %rd205, %r824; 2026-02-21T08:30:48.5297564Z cvt.u64.u32 %rd206, %r825; 2026-02-21T08:30:48.5297623Z shl.b64 %rd207, %rd206, 32; 2026-02-21T08:30:48.5297691Z or.b64 %rd208, %rd205, %rd207; 2026-02-21T08:30:48.5297743Z $L__tmp88: 2026-02-21T08:30:48.5297911Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5297972Z mov.b64 {%r1113, %r1114}, %rd208; 2026-02-21T08:30:48.5298048Z cvt.rn.bf16x2.f32 %r1115, %r1114, %r1113; 2026-02-21T08:30:48.5298100Z $L__tmp89: 2026-02-21T08:30:48.5298307Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5298375Z cvt.u64.u32 %rd209, %r826; 2026-02-21T08:30:48.5298432Z cvt.u64.u32 %rd210, %r827; 2026-02-21T08:30:48.5298492Z shl.b64 %rd211, %rd210, 32; 2026-02-21T08:30:48.5298562Z or.b64 %rd212, %rd209, %rd211; 2026-02-21T08:30:48.5298615Z $L__tmp90: 2026-02-21T08:30:48.5298783Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5298843Z mov.b64 {%r1116, %r1117}, %rd212; 2026-02-21T08:30:48.5298921Z cvt.rn.bf16x2.f32 %r1118, %r1117, %r1116; 2026-02-21T08:30:48.5298973Z $L__tmp91: 2026-02-21T08:30:48.5299182Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5299248Z cvt.u64.u32 %rd213, %r828; 2026-02-21T08:30:48.5299305Z cvt.u64.u32 %rd214, %r829; 2026-02-21T08:30:48.5299364Z shl.b64 %rd215, %rd214, 32; 2026-02-21T08:30:48.5299429Z or.b64 %rd216, %rd213, %rd215; 2026-02-21T08:30:48.5299481Z $L__tmp92: 2026-02-21T08:30:48.5299647Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5299710Z mov.b64 {%r1119, %r1120}, %rd216; 2026-02-21T08:30:48.5299787Z cvt.rn.bf16x2.f32 %r1121, %r1120, %r1119; 2026-02-21T08:30:48.5299840Z $L__tmp93: 2026-02-21T08:30:48.5300050Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5300117Z cvt.u64.u32 %rd217, %r830; 2026-02-21T08:30:48.5300175Z cvt.u64.u32 %rd218, %r831; 2026-02-21T08:30:48.5300233Z shl.b64 %rd219, %rd218, 32; 2026-02-21T08:30:48.5300300Z or.b64 %rd220, %rd217, %rd219; 2026-02-21T08:30:48.5300353Z $L__tmp94: 2026-02-21T08:30:48.5300522Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5300582Z mov.b64 {%r1122, %r1123}, %rd220; 2026-02-21T08:30:48.5300660Z cvt.rn.bf16x2.f32 %r1124, %r1123, %r1122; 2026-02-21T08:30:48.5300714Z $L__tmp95: 2026-02-21T08:30:48.5300929Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5301037Z cvt.u64.u32 %rd221, %r833; 2026-02-21T08:30:48.5301096Z cvt.u64.u32 %rd222, %r834; 2026-02-21T08:30:48.5301155Z shl.b64 %rd223, %rd222, 32; 2026-02-21T08:30:48.5301216Z or.b64 %rd224, %rd221, %rd223; 2026-02-21T08:30:48.5301277Z $L__tmp96: 2026-02-21T08:30:48.5301442Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5301504Z mov.b64 {%r1125, %r1126}, %rd224; 2026-02-21T08:30:48.5301609Z cvt.rn.bf16x2.f32 %r1127, %r1126, %r1125; 2026-02-21T08:30:48.5301665Z $L__tmp97: 2026-02-21T08:30:48.5301876Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5301942Z cvt.u64.u32 %rd225, %r835; 2026-02-21T08:30:48.5302000Z cvt.u64.u32 %rd226, %r836; 2026-02-21T08:30:48.5302058Z shl.b64 %rd227, %rd226, 32; 2026-02-21T08:30:48.5302119Z or.b64 %rd228, %rd225, %rd227; 2026-02-21T08:30:48.5302250Z $L__tmp98: 2026-02-21T08:30:48.5302423Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5302484Z mov.b64 {%r1128, %r1129}, %rd228; 2026-02-21T08:30:48.5302561Z cvt.rn.bf16x2.f32 %r1130, %r1129, %r1128; 2026-02-21T08:30:48.5302614Z $L__tmp99: 2026-02-21T08:30:48.5302824Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5302892Z cvt.u64.u32 %rd229, %r837; 2026-02-21T08:30:48.5302954Z cvt.u64.u32 %rd230, %r838; 2026-02-21T08:30:48.5303014Z shl.b64 %rd231, %rd230, 32; 2026-02-21T08:30:48.5303074Z or.b64 %rd232, %rd229, %rd231; 2026-02-21T08:30:48.5303135Z $L__tmp100: 2026-02-21T08:30:48.5303303Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5303365Z mov.b64 {%r1131, %r1132}, %rd232; 2026-02-21T08:30:48.5303445Z cvt.rn.bf16x2.f32 %r1133, %r1132, %r1131; 2026-02-21T08:30:48.5303502Z $L__tmp101: 2026-02-21T08:30:48.5303716Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5303796Z cvt.u64.u32 %rd233, %r839; 2026-02-21T08:30:48.5303859Z cvt.u64.u32 %rd234, %r840; 2026-02-21T08:30:48.5303922Z shl.b64 %rd235, %rd234, 32; 2026-02-21T08:30:48.5303985Z or.b64 %rd236, %rd233, %rd235; 2026-02-21T08:30:48.5304049Z $L__tmp102: 2026-02-21T08:30:48.5304228Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5304291Z mov.b64 {%r1134, %r1135}, %rd236; 2026-02-21T08:30:48.5304369Z cvt.rn.bf16x2.f32 %r1136, %r1135, %r1134; 2026-02-21T08:30:48.5304426Z $L__tmp103: 2026-02-21T08:30:48.5304646Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5304718Z cvt.u64.u32 %rd237, %r841; 2026-02-21T08:30:48.5304783Z cvt.u64.u32 %rd238, %r842; 2026-02-21T08:30:48.5304843Z shl.b64 %rd239, %rd238, 32; 2026-02-21T08:30:48.5304905Z or.b64 %rd240, %rd237, %rd239; 2026-02-21T08:30:48.5304969Z $L__tmp104: 2026-02-21T08:30:48.5305144Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5305207Z mov.b64 {%r1137, %r1138}, %rd240; 2026-02-21T08:30:48.5305285Z cvt.rn.bf16x2.f32 %r1139, %r1138, %r1137; 2026-02-21T08:30:48.5305341Z $L__tmp105: 2026-02-21T08:30:48.5305563Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5305625Z cvt.u64.u32 %rd241, %r843; 2026-02-21T08:30:48.5305692Z cvt.u64.u32 %rd242, %r844; 2026-02-21T08:30:48.5305753Z shl.b64 %rd243, %rd242, 32; 2026-02-21T08:30:48.5305815Z or.b64 %rd244, %rd241, %rd243; 2026-02-21T08:30:48.5305877Z $L__tmp106: 2026-02-21T08:30:48.5306056Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5306176Z mov.b64 {%r1140, %r1141}, %rd244; 2026-02-21T08:30:48.5306254Z cvt.rn.bf16x2.f32 %r1142, %r1141, %r1140; 2026-02-21T08:30:48.5306310Z $L__tmp107: 2026-02-21T08:30:48.5306531Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5306591Z cvt.u64.u32 %rd245, %r845; 2026-02-21T08:30:48.5306659Z cvt.u64.u32 %rd246, %r846; 2026-02-21T08:30:48.5306719Z shl.b64 %rd247, %rd246, 32; 2026-02-21T08:30:48.5306781Z or.b64 %rd248, %rd245, %rd247; 2026-02-21T08:30:48.5306842Z $L__tmp108: 2026-02-21T08:30:48.5307017Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5307079Z mov.b64 {%r1143, %r1144}, %rd248; 2026-02-21T08:30:48.5307156Z cvt.rn.bf16x2.f32 %r1145, %r1144, %r1143; 2026-02-21T08:30:48.5307211Z $L__tmp109: 2026-02-21T08:30:48.5307474Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5307541Z cvt.u64.u32 %rd249, %r847; 2026-02-21T08:30:48.5307609Z cvt.u64.u32 %rd250, %r848; 2026-02-21T08:30:48.5307669Z shl.b64 %rd251, %rd250, 32; 2026-02-21T08:30:48.5307730Z or.b64 %rd252, %rd249, %rd251; 2026-02-21T08:30:48.5307790Z $L__tmp110: 2026-02-21T08:30:48.5307966Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5308030Z mov.b64 {%r1146, %r1147}, %rd252; 2026-02-21T08:30:48.5308100Z cvt.rn.bf16x2.f32 %r1148, %r1147, %r1146; 2026-02-21T08:30:48.5308162Z $L__tmp111: 2026-02-21T08:30:48.5308383Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5308446Z cvt.u64.u32 %rd253, %r850; 2026-02-21T08:30:48.5308514Z cvt.u64.u32 %rd254, %r851; 2026-02-21T08:30:48.5308578Z shl.b64 %rd255, %rd254, 32; 2026-02-21T08:30:48.5308642Z or.b64 %rd256, %rd253, %rd255; 2026-02-21T08:30:48.5308704Z $L__tmp112: 2026-02-21T08:30:48.5308878Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5308941Z mov.b64 {%r1149, %r1150}, %rd256; 2026-02-21T08:30:48.5309012Z cvt.rn.bf16x2.f32 %r1151, %r1150, %r1149; 2026-02-21T08:30:48.5309075Z $L__tmp113: 2026-02-21T08:30:48.5309292Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5309353Z cvt.u64.u32 %rd257, %r852; 2026-02-21T08:30:48.5309421Z cvt.u64.u32 %rd258, %r853; 2026-02-21T08:30:48.5309481Z shl.b64 %rd259, %rd258, 32; 2026-02-21T08:30:48.5309543Z or.b64 %rd260, %rd257, %rd259; 2026-02-21T08:30:48.5309606Z $L__tmp114: 2026-02-21T08:30:48.5309784Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5309850Z mov.b64 {%r1152, %r1153}, %rd260; 2026-02-21T08:30:48.5309923Z cvt.rn.bf16x2.f32 %r1154, %r1153, %r1152; 2026-02-21T08:30:48.5309987Z $L__tmp115: 2026-02-21T08:30:48.5310204Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5310266Z cvt.u64.u32 %rd261, %r854; 2026-02-21T08:30:48.5310337Z cvt.u64.u32 %rd262, %r855; 2026-02-21T08:30:48.5310399Z shl.b64 %rd263, %rd262, 32; 2026-02-21T08:30:48.5310463Z or.b64 %rd264, %rd261, %rd263; 2026-02-21T08:30:48.5310526Z $L__tmp116: 2026-02-21T08:30:48.5310711Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5310773Z mov.b64 {%r1155, %r1156}, %rd264; 2026-02-21T08:30:48.5310844Z cvt.rn.bf16x2.f32 %r1157, %r1156, %r1155; 2026-02-21T08:30:48.5310907Z $L__tmp117: 2026-02-21T08:30:48.5311130Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5311235Z cvt.u64.u32 %rd265, %r856; 2026-02-21T08:30:48.5311304Z cvt.u64.u32 %rd266, %r857; 2026-02-21T08:30:48.5311364Z shl.b64 %rd267, %rd266, 32; 2026-02-21T08:30:48.5311426Z or.b64 %rd268, %rd265, %rd267; 2026-02-21T08:30:48.5311480Z $L__tmp118: 2026-02-21T08:30:48.5311707Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5311772Z mov.b64 {%r1158, %r1159}, %rd268; 2026-02-21T08:30:48.5311844Z cvt.rn.bf16x2.f32 %r1160, %r1159, %r1158; 2026-02-21T08:30:48.5311908Z $L__tmp119: 2026-02-21T08:30:48.5312130Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5312198Z cvt.u64.u32 %rd269, %r858; 2026-02-21T08:30:48.5312262Z cvt.u64.u32 %rd270, %r859; 2026-02-21T08:30:48.5312322Z shl.b64 %rd271, %rd270, 32; 2026-02-21T08:30:48.5312382Z or.b64 %rd272, %rd269, %rd271; 2026-02-21T08:30:48.5312488Z $L__tmp120: 2026-02-21T08:30:48.5312666Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5312727Z mov.b64 {%r1161, %r1162}, %rd272; 2026-02-21T08:30:48.5312794Z cvt.rn.bf16x2.f32 %r1163, %r1162, %r1161; 2026-02-21T08:30:48.5312856Z $L__tmp121: 2026-02-21T08:30:48.5313064Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5313124Z cvt.u64.u32 %rd273, %r860; 2026-02-21T08:30:48.5313188Z cvt.u64.u32 %rd274, %r861; 2026-02-21T08:30:48.5313247Z shl.b64 %rd275, %rd274, 32; 2026-02-21T08:30:48.5313305Z or.b64 %rd276, %rd273, %rd275; 2026-02-21T08:30:48.5313358Z $L__tmp122: 2026-02-21T08:30:48.5313534Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5313595Z mov.b64 {%r1164, %r1165}, %rd276; 2026-02-21T08:30:48.5313664Z cvt.rn.bf16x2.f32 %r1166, %r1165, %r1164; 2026-02-21T08:30:48.5313727Z $L__tmp123: 2026-02-21T08:30:48.5313935Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5313995Z cvt.u64.u32 %rd277, %r862; 2026-02-21T08:30:48.5314061Z cvt.u64.u32 %rd278, %r863; 2026-02-21T08:30:48.5314119Z shl.b64 %rd279, %rd278, 32; 2026-02-21T08:30:48.5314178Z or.b64 %rd280, %rd277, %rd279; 2026-02-21T08:30:48.5314230Z $L__tmp124: 2026-02-21T08:30:48.5314405Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5314465Z mov.b64 {%r1167, %r1168}, %rd280; 2026-02-21T08:30:48.5314532Z cvt.rn.bf16x2.f32 %r1169, %r1168, %r1167; 2026-02-21T08:30:48.5314591Z $L__tmp125: 2026-02-21T08:30:48.5314796Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5314857Z cvt.u64.u32 %rd281, %r864; 2026-02-21T08:30:48.5314922Z cvt.u64.u32 %rd282, %r865; 2026-02-21T08:30:48.5314980Z shl.b64 %rd283, %rd282, 32; 2026-02-21T08:30:48.5315040Z or.b64 %rd284, %rd281, %rd283; 2026-02-21T08:30:48.5315092Z $L__tmp126: 2026-02-21T08:30:48.5315267Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5315326Z mov.b64 {%r1170, %r1171}, %rd284; 2026-02-21T08:30:48.5315393Z cvt.rn.bf16x2.f32 %r1172, %r1171, %r1170; 2026-02-21T08:30:48.5315451Z $L__tmp127: 2026-02-21T08:30:48.5315659Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5315718Z cvt.u64.u32 %rd285, %r867; 2026-02-21T08:30:48.5315776Z cvt.u64.u32 %rd286, %r868; 2026-02-21T08:30:48.5315841Z shl.b64 %rd287, %rd286, 32; 2026-02-21T08:30:48.5315899Z or.b64 %rd288, %rd285, %rd287; 2026-02-21T08:30:48.5315950Z $L__tmp128: 2026-02-21T08:30:48.5316124Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5316237Z mov.b64 {%r1173, %r1174}, %rd288; 2026-02-21T08:30:48.5316306Z cvt.rn.bf16x2.f32 %r1175, %r1174, %r1173; 2026-02-21T08:30:48.5316364Z $L__tmp129: 2026-02-21T08:30:48.5316572Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5316630Z cvt.u64.u32 %rd289, %r869; 2026-02-21T08:30:48.5316689Z cvt.u64.u32 %rd290, %r870; 2026-02-21T08:30:48.5316754Z shl.b64 %rd291, %rd290, 32; 2026-02-21T08:30:48.5316812Z or.b64 %rd292, %rd289, %rd291; 2026-02-21T08:30:48.5316864Z $L__tmp130: 2026-02-21T08:30:48.5317038Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5317097Z mov.b64 {%r1176, %r1177}, %rd292; 2026-02-21T08:30:48.5317165Z cvt.rn.bf16x2.f32 %r1178, %r1177, %r1176; 2026-02-21T08:30:48.5317224Z $L__tmp131: 2026-02-21T08:30:48.5317514Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5317578Z cvt.u64.u32 %rd293, %r871; 2026-02-21T08:30:48.5317637Z cvt.u64.u32 %rd294, %r872; 2026-02-21T08:30:48.5317704Z shl.b64 %rd295, %rd294, 32; 2026-02-21T08:30:48.5317764Z or.b64 %rd296, %rd293, %rd295; 2026-02-21T08:30:48.5317817Z $L__tmp132: 2026-02-21T08:30:48.5317993Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5318053Z mov.b64 {%r1179, %r1180}, %rd296; 2026-02-21T08:30:48.5318123Z cvt.rn.bf16x2.f32 %r1181, %r1180, %r1179; 2026-02-21T08:30:48.5318176Z $L__tmp133: 2026-02-21T08:30:48.5318397Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5318457Z cvt.u64.u32 %rd297, %r873; 2026-02-21T08:30:48.5318515Z cvt.u64.u32 %rd298, %r874; 2026-02-21T08:30:48.5318588Z shl.b64 %rd299, %rd298, 32; 2026-02-21T08:30:48.5318650Z or.b64 %rd300, %rd297, %rd299; 2026-02-21T08:30:48.5318704Z $L__tmp134: 2026-02-21T08:30:48.5318884Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5318945Z mov.b64 {%r1182, %r1183}, %rd300; 2026-02-21T08:30:48.5319013Z cvt.rn.bf16x2.f32 %r1184, %r1183, %r1182; 2026-02-21T08:30:48.5319067Z $L__tmp135: 2026-02-21T08:30:48.5319292Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5319352Z cvt.u64.u32 %rd301, %r875; 2026-02-21T08:30:48.5319412Z cvt.u64.u32 %rd302, %r876; 2026-02-21T08:30:48.5319481Z shl.b64 %rd303, %rd302, 32; 2026-02-21T08:30:48.5319543Z or.b64 %rd304, %rd301, %rd303; 2026-02-21T08:30:48.5319598Z $L__tmp136: 2026-02-21T08:30:48.5319774Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5319836Z mov.b64 {%r1185, %r1186}, %rd304; 2026-02-21T08:30:48.5319906Z cvt.rn.bf16x2.f32 %r1187, %r1186, %r1185; 2026-02-21T08:30:48.5319958Z $L__tmp137: 2026-02-21T08:30:48.5320180Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5320239Z cvt.u64.u32 %rd305, %r877; 2026-02-21T08:30:48.5320297Z cvt.u64.u32 %rd306, %r878; 2026-02-21T08:30:48.5320364Z shl.b64 %rd307, %rd306, 32; 2026-02-21T08:30:48.5320424Z or.b64 %rd308, %rd305, %rd307; 2026-02-21T08:30:48.5320475Z $L__tmp138: 2026-02-21T08:30:48.5320651Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5320711Z mov.b64 {%r1188, %r1189}, %rd308; 2026-02-21T08:30:48.5320778Z cvt.rn.bf16x2.f32 %r1190, %r1189, %r1188; 2026-02-21T08:30:48.5320831Z $L__tmp139: 2026-02-21T08:30:48.5321053Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5321171Z cvt.u64.u32 %rd309, %r879; 2026-02-21T08:30:48.5321229Z cvt.u64.u32 %rd310, %r880; 2026-02-21T08:30:48.5321295Z shl.b64 %rd311, %rd310, 32; 2026-02-21T08:30:48.5321354Z or.b64 %rd312, %rd309, %rd311; 2026-02-21T08:30:48.5321406Z $L__tmp140: 2026-02-21T08:30:48.5321598Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5321667Z mov.b64 {%r1191, %r1192}, %rd312; 2026-02-21T08:30:48.5321735Z cvt.rn.bf16x2.f32 %r1193, %r1192, %r1191; 2026-02-21T08:30:48.5321787Z $L__tmp141: 2026-02-21T08:30:48.5322006Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5322063Z cvt.u64.u32 %rd313, %r881; 2026-02-21T08:30:48.5322121Z cvt.u64.u32 %rd314, %r882; 2026-02-21T08:30:48.5322188Z shl.b64 %rd315, %rd314, 32; 2026-02-21T08:30:48.5322247Z or.b64 %rd316, %rd313, %rd315; 2026-02-21T08:30:48.5322350Z $L__tmp142: 2026-02-21T08:30:48.5322515Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5322583Z mov.b64 {%r1194, %r1195}, %rd316; 2026-02-21T08:30:48.5322651Z cvt.rn.bf16x2.f32 %r1196, %r1195, %r1194; 2026-02-21T08:30:48.5322811Z .loc 1 88 81 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:81 2026-02-21T08:30:48.5322918Z st.shared.v4.b32 [%r73], {%r1103, %r1115, %r1127, %r1139}; 2026-02-21T08:30:48.5323014Z st.shared.v4.b32 [%r74], {%r1151, %r1163, %r1175, %r1187}; 2026-02-21T08:30:48.5323069Z bar.sync 0; 2026-02-21T08:30:48.5323136Z // begin inline asm 2026-02-21T08:30:48.5323286Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r884, %r885, %r886, %r887}, [%r888]; 2026-02-21T08:30:48.5323342Z // end inline asm 2026-02-21T08:30:48.5323401Z // begin inline asm 2026-02-21T08:30:48.5323557Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r889, %r890, %r891, %r892}, [%r893]; 2026-02-21T08:30:48.5323615Z // end inline asm 2026-02-21T08:30:48.5323671Z bar.sync 0; 2026-02-21T08:30:48.5323772Z st.shared.v4.b32 [%r73], {%r1106, %r1118, %r1130, %r1142}; 2026-02-21T08:30:48.5323864Z st.shared.v4.b32 [%r74], {%r1154, %r1166, %r1178, %r1190}; 2026-02-21T08:30:48.5323919Z bar.sync 0; 2026-02-21T08:30:48.5323983Z // begin inline asm 2026-02-21T08:30:48.5324125Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r894, %r895, %r896, %r897}, [%r888]; 2026-02-21T08:30:48.5324181Z // end inline asm 2026-02-21T08:30:48.5324237Z // begin inline asm 2026-02-21T08:30:48.5324383Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r899, %r900, %r901, %r902}, [%r893]; 2026-02-21T08:30:48.5324439Z // end inline asm 2026-02-21T08:30:48.5324493Z bar.sync 0; 2026-02-21T08:30:48.5324590Z st.shared.v4.b32 [%r73], {%r1109, %r1121, %r1133, %r1145}; 2026-02-21T08:30:48.5324679Z st.shared.v4.b32 [%r74], {%r1157, %r1169, %r1181, %r1193}; 2026-02-21T08:30:48.5324734Z bar.sync 0; 2026-02-21T08:30:48.5324791Z // begin inline asm 2026-02-21T08:30:48.5324938Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r904, %r905, %r906, %r907}, [%r888]; 2026-02-21T08:30:48.5324994Z // end inline asm 2026-02-21T08:30:48.5325049Z // begin inline asm 2026-02-21T08:30:48.5325193Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r909, %r910, %r911, %r912}, [%r893]; 2026-02-21T08:30:48.5325249Z // end inline asm 2026-02-21T08:30:48.5325303Z bar.sync 0; 2026-02-21T08:30:48.5325400Z st.shared.v4.b32 [%r73], {%r1112, %r1124, %r1136, %r1148}; 2026-02-21T08:30:48.5325490Z st.shared.v4.b32 [%r74], {%r1160, %r1172, %r1184, %r1196}; 2026-02-21T08:30:48.5325545Z bar.sync 0; 2026-02-21T08:30:48.5325602Z // begin inline asm 2026-02-21T08:30:48.5325748Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r914, %r915, %r916, %r917}, [%r888]; 2026-02-21T08:30:48.5325804Z // end inline asm 2026-02-21T08:30:48.5325860Z // begin inline asm 2026-02-21T08:30:48.5326009Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r919, %r920, %r921, %r922}, [%r893]; 2026-02-21T08:30:48.5326112Z // end inline asm 2026-02-21T08:30:48.5326169Z // begin inline asm 2026-02-21T08:30:48.5326275Z st.global.v4.b32 [ %rd169 + 0 ], { %r884, %r894, %r904, %r914 }; 2026-02-21T08:30:48.5326339Z // end inline asm 2026-02-21T08:30:48.5326396Z // begin inline asm 2026-02-21T08:30:48.5326499Z st.global.v4.b32 [ %rd170 + 0 ], { %r885, %r895, %r905, %r915 }; 2026-02-21T08:30:48.5326562Z // end inline asm 2026-02-21T08:30:48.5326619Z // begin inline asm 2026-02-21T08:30:48.5326715Z st.global.v4.b32 [ %rd171 + 0 ], { %r886, %r896, %r906, %r916 }; 2026-02-21T08:30:48.5326778Z // end inline asm 2026-02-21T08:30:48.5326834Z // begin inline asm 2026-02-21T08:30:48.5326930Z st.global.v4.b32 [ %rd172 + 0 ], { %r887, %r897, %r907, %r917 }; 2026-02-21T08:30:48.5326984Z // end inline asm 2026-02-21T08:30:48.5327049Z // begin inline asm 2026-02-21T08:30:48.5327144Z st.global.v4.b32 [ %rd173 + 0 ], { %r889, %r899, %r909, %r919 }; 2026-02-21T08:30:48.5327198Z // end inline asm 2026-02-21T08:30:48.5327304Z // begin inline asm 2026-02-21T08:30:48.5327400Z st.global.v4.b32 [ %rd174 + 0 ], { %r890, %r900, %r910, %r920 }; 2026-02-21T08:30:48.5327457Z // end inline asm 2026-02-21T08:30:48.5327513Z // begin inline asm 2026-02-21T08:30:48.5327615Z st.global.v4.b32 [ %rd175 + 0 ], { %r891, %r901, %r911, %r921 }; 2026-02-21T08:30:48.5327671Z // end inline asm 2026-02-21T08:30:48.5327728Z // begin inline asm 2026-02-21T08:30:48.5327832Z st.global.v4.b32 [ %rd176 + 0 ], { %r892, %r902, %r912, %r922 }; 2026-02-21T08:30:48.5327888Z // end inline asm 2026-02-21T08:30:48.5328071Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5328142Z add.s32 %r1197, %r4018, 2368; 2026-02-21T08:30:48.5328317Z .loc 1 25 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:25:35 2026-02-21T08:30:48.5328378Z shr.s32 %r1198, %r1197, 31; 2026-02-21T08:30:48.5328436Z shr.u32 %r1199, %r1198, 23; 2026-02-21T08:30:48.5328509Z add.s32 %r1200, %r1197, %r1199; 2026-02-21T08:30:48.5328571Z shr.s32 %r1201, %r1200, 9; 2026-02-21T08:30:48.5328738Z .loc 1 26 33 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:26:33 2026-02-21T08:30:48.5328804Z shl.b32 %r1202, %r1201, 3; 2026-02-21T08:30:48.5328969Z .loc 1 27 39 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:39 2026-02-21T08:30:48.5329027Z sub.s32 %r1203, 64, %r1202; 2026-02-21T08:30:48.5329201Z .loc 1 27 52 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:52 2026-02-21T08:30:48.5329259Z min.s32 %r1204, %r1203, 8; 2026-02-21T08:30:48.5329423Z .loc 1 28 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:45 2026-02-21T08:30:48.5329486Z and.b32 %r1205, %r1200, -512; 2026-02-21T08:30:48.5329555Z sub.s32 %r1206, %r1197, %r1205; 2026-02-21T08:30:48.5329722Z .loc 1 29 51 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:29:51 2026-02-21T08:30:48.5329786Z div.s32 %r113, %r1206, %r1204; 2026-02-21T08:30:48.5329954Z .loc 1 28 64 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:64 2026-02-21T08:30:48.5330018Z mul.lo.s32 %r1207, %r113, %r1204; 2026-02-21T08:30:48.5330079Z sub.s32 %r1208, %r1206, %r1207; 2026-02-21T08:30:48.5330249Z .loc 1 28 30 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:30 2026-02-21T08:30:48.5330310Z add.s32 %r1209, %r1208, %r1202; 2026-02-21T08:30:48.5330473Z .loc 1 30 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:30:27 2026-02-21T08:30:48.5330541Z shl.b32 %r114, %r1209, 7; 2026-02-21T08:30:48.5330702Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5330761Z or.b32 %r115, %r114, %r6; 2026-02-21T08:30:48.5330925Z .loc 1 32 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:32:27 2026-02-21T08:30:48.5331035Z shl.b32 %r117, %r113, 6; 2026-02-21T08:30:48.5331200Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5331260Z or.b32 %r1210, %r117, %r9; 2026-02-21T08:30:48.5331327Z or.b32 %r1211, %r117, %r10; 2026-02-21T08:30:48.5331385Z or.b32 %r1212, %r117, %r11; 2026-02-21T08:30:48.5331442Z or.b32 %r1213, %r117, %r12; 2026-02-21T08:30:48.5331658Z .loc 1 48 53 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:53 2026-02-21T08:30:48.5331718Z shl.b32 %r1214, %r1210, 10; 2026-02-21T08:30:48.5331776Z shl.b32 %r1215, %r1211, 10; 2026-02-21T08:30:48.5331833Z shl.b32 %r1216, %r1212, 10; 2026-02-21T08:30:48.5331898Z shl.b32 %r1217, %r1213, 10; 2026-02-21T08:30:48.5331959Z mov.pred %p69, -1; 2026-02-21T08:30:48.5332015Z mov.b32 %r4026, 0; 2026-02-21T08:30:48.5332076Z $L__tmp143: 2026-02-21T08:30:48.5332344Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5332406Z // begin inline asm 2026-02-21T08:30:48.5332735Z @%p69 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 0], 64, {%r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026}; 2026-02-21T08:30:48.5332791Z // end inline asm 2026-02-21T08:30:48.5332846Z // begin inline asm 2026-02-21T08:30:48.5333156Z @%p69 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 16], 64, {%r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026}; 2026-02-21T08:30:48.5333220Z // end inline asm 2026-02-21T08:30:48.5333276Z // begin inline asm 2026-02-21T08:30:48.5333583Z @%p69 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 32], 64, {%r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026}; 2026-02-21T08:30:48.5333652Z // end inline asm 2026-02-21T08:30:48.5333709Z // begin inline asm 2026-02-21T08:30:48.5334015Z @%p69 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 48], 64, {%r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026, %r4026}; 2026-02-21T08:30:48.5334077Z // end inline asm 2026-02-21T08:30:48.5334131Z // begin inline asm 2026-02-21T08:30:48.5334203Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5334263Z // end inline asm 2026-02-21T08:30:48.5334316Z bar.sync 0; 2026-02-21T08:30:48.5334370Z $L__tmp144: 2026-02-21T08:30:48.5334539Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5334604Z // begin inline asm 2026-02-21T08:30:48.5334694Z @%p10 mbarrier.init.shared::cta.b64 [%r4027], 1; 2026-02-21T08:30:48.5334748Z // end inline asm 2026-02-21T08:30:48.5334808Z bar.sync 0; 2026-02-21T08:30:48.5334865Z // begin inline asm 2026-02-21T08:30:48.5334954Z @%p10 mbarrier.init.shared::cta.b64 [%r2245], 1; 2026-02-21T08:30:48.5335014Z // end inline asm 2026-02-21T08:30:48.5335183Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5335245Z or.b32 %r1218, %r1214, %r22; 2026-02-21T08:30:48.5335306Z or.b32 %r1219, %r1215, %r22; 2026-02-21T08:30:48.5335370Z or.b32 %r1220, %r1216, %r22; 2026-02-21T08:30:48.5335428Z or.b32 %r1221, %r1217, %r22; 2026-02-21T08:30:48.5335594Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5335673Z mad.wide.s32 %rd177, %r1218, 2, %rd92; 2026-02-21T08:30:48.5335741Z mad.wide.s32 %rd178, %r1219, 2, %rd92; 2026-02-21T08:30:48.5335807Z mad.wide.s32 %rd179, %r1220, 2, %rd92; 2026-02-21T08:30:48.5335873Z mad.wide.s32 %rd180, %r1221, 2, %rd92; 2026-02-21T08:30:48.5335938Z mov.b32 %r1334, 16; 2026-02-21T08:30:48.5336110Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5336214Z // begin inline asm 2026-02-21T08:30:48.5336346Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd177 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5336401Z // end inline asm 2026-02-21T08:30:48.5336457Z // begin inline asm 2026-02-21T08:30:48.5336586Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd178 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5336641Z // end inline asm 2026-02-21T08:30:48.5336696Z // begin inline asm 2026-02-21T08:30:48.5336812Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd179 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5336876Z // end inline asm 2026-02-21T08:30:48.5336932Z // begin inline asm 2026-02-21T08:30:48.5337045Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd180 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5337108Z // end inline asm 2026-02-21T08:30:48.5337171Z cp.async.commit_group; 2026-02-21T08:30:48.5337379Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5337452Z add.s32 %r1222, %r115, %r39; 2026-02-21T08:30:48.5337511Z add.s32 %r1223, %r115, %r40; 2026-02-21T08:30:48.5337680Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5337744Z cvt.s64.s32 %rd317, %r1222; 2026-02-21T08:30:48.5337819Z add.s64 %rd181, %rd93, %rd317; 2026-02-21T08:30:48.5337879Z cvt.s64.s32 %rd318, %r1223; 2026-02-21T08:30:48.5337940Z add.s64 %rd182, %rd93, %rd318; 2026-02-21T08:30:48.5338115Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5338172Z // begin inline asm 2026-02-21T08:30:48.5338288Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd181 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5338352Z // end inline asm 2026-02-21T08:30:48.5338409Z // begin inline asm 2026-02-21T08:30:48.5338522Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd182 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5338579Z // end inline asm 2026-02-21T08:30:48.5338653Z cp.async.commit_group; 2026-02-21T08:30:48.5338818Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5338877Z cvt.s64.s32 %rd319, %r1214; 2026-02-21T08:30:48.5338945Z or.b64 %rd320, %rd319, %rd11; 2026-02-21T08:30:48.5339005Z shl.b64 %rd321, %rd320, 1; 2026-02-21T08:30:48.5339065Z add.s64 %rd30, %rd92, %rd321; 2026-02-21T08:30:48.5339123Z add.s64 %rd183, %rd30, 128; 2026-02-21T08:30:48.5339189Z cvt.s64.s32 %rd322, %r1215; 2026-02-21T08:30:48.5339248Z or.b64 %rd323, %rd322, %rd11; 2026-02-21T08:30:48.5339305Z shl.b64 %rd324, %rd323, 1; 2026-02-21T08:30:48.5339372Z add.s64 %rd31, %rd92, %rd324; 2026-02-21T08:30:48.5339430Z add.s64 %rd184, %rd31, 128; 2026-02-21T08:30:48.5339489Z cvt.s64.s32 %rd325, %r1216; 2026-02-21T08:30:48.5339547Z or.b64 %rd326, %rd325, %rd11; 2026-02-21T08:30:48.5339612Z shl.b64 %rd327, %rd326, 1; 2026-02-21T08:30:48.5339673Z add.s64 %rd32, %rd92, %rd327; 2026-02-21T08:30:48.5339734Z add.s64 %rd185, %rd32, 128; 2026-02-21T08:30:48.5339798Z cvt.s64.s32 %rd328, %r1217; 2026-02-21T08:30:48.5339857Z or.b64 %rd329, %rd328, %rd11; 2026-02-21T08:30:48.5339916Z shl.b64 %rd330, %rd329, 1; 2026-02-21T08:30:48.5339982Z add.s64 %rd33, %rd92, %rd330; 2026-02-21T08:30:48.5340039Z add.s64 %rd186, %rd33, 128; 2026-02-21T08:30:48.5340208Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5340261Z bar.sync 0; 2026-02-21T08:30:48.5340326Z // begin inline asm 2026-02-21T08:30:48.5340439Z cp.async.cg.shared.global [ %r2468 + 0 ], [ %rd183 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5340494Z // end inline asm 2026-02-21T08:30:48.5340556Z // begin inline asm 2026-02-21T08:30:48.5340670Z cp.async.cg.shared.global [ %r2470 + 0 ], [ %rd184 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5340726Z // end inline asm 2026-02-21T08:30:48.5340781Z // begin inline asm 2026-02-21T08:30:48.5340959Z cp.async.cg.shared.global [ %r2472 + 0 ], [ %rd185 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5341016Z // end inline asm 2026-02-21T08:30:48.5341071Z // begin inline asm 2026-02-21T08:30:48.5341191Z cp.async.cg.shared.global [ %r2474 + 0 ], [ %rd186 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5341247Z // end inline asm 2026-02-21T08:30:48.5341310Z cp.async.commit_group; 2026-02-21T08:30:48.5341485Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5341585Z add.s32 %r1224, %r115, %r48; 2026-02-21T08:30:48.5341647Z add.s32 %r1225, %r115, %r49; 2026-02-21T08:30:48.5341816Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5341882Z cvt.s64.s32 %rd331, %r1224; 2026-02-21T08:30:48.5341943Z add.s64 %rd187, %rd93, %rd331; 2026-02-21T08:30:48.5342002Z cvt.s64.s32 %rd332, %r1225; 2026-02-21T08:30:48.5342132Z add.s64 %rd188, %rd93, %rd332; 2026-02-21T08:30:48.5342300Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5342357Z // begin inline asm 2026-02-21T08:30:48.5342480Z cp.async.cg.shared.global [ %r2476 + 0 ], [ %rd187 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5342535Z // end inline asm 2026-02-21T08:30:48.5342590Z // begin inline asm 2026-02-21T08:30:48.5342702Z cp.async.cg.shared.global [ %r2478 + 0 ], [ %rd188 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5342765Z // end inline asm 2026-02-21T08:30:48.5342828Z cp.async.commit_group; 2026-02-21T08:30:48.5342992Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5343063Z cp.async.wait_group 2; 2026-02-21T08:30:48.5343116Z bar.sync 0; 2026-02-21T08:30:48.5343214Z ld.shared.v4.b32 {%r1226, %r1227, %r1228, %r1229}, [%r53]; 2026-02-21T08:30:48.5343279Z mov.b32 {%rs385, %rs386}, %r1229; 2026-02-21T08:30:48.5343351Z mov.b32 {%rs387, %rs388}, %r1228; 2026-02-21T08:30:48.5343414Z mov.b32 {%rs389, %rs390}, %r1227; 2026-02-21T08:30:48.5343473Z mov.b32 {%rs391, %rs392}, %r1226; 2026-02-21T08:30:48.5343583Z ld.shared.v4.b32 {%r1230, %r1231, %r1232, %r1233}, [%r53+16]; 2026-02-21T08:30:48.5343645Z mov.b32 {%rs393, %rs394}, %r1233; 2026-02-21T08:30:48.5343702Z mov.b32 {%rs395, %rs396}, %r1232; 2026-02-21T08:30:48.5343768Z mov.b32 {%rs397, %rs398}, %r1231; 2026-02-21T08:30:48.5343827Z mov.b32 {%rs399, %rs400}, %r1230; 2026-02-21T08:30:48.5343924Z ld.shared.v4.b32 {%r1234, %r1235, %r1236, %r1237}, [%r53+32]; 2026-02-21T08:30:48.5343983Z mov.b32 {%rs401, %rs402}, %r1237; 2026-02-21T08:30:48.5344049Z mov.b32 {%rs403, %rs404}, %r1236; 2026-02-21T08:30:48.5344109Z mov.b32 {%rs405, %rs406}, %r1235; 2026-02-21T08:30:48.5344168Z mov.b32 {%rs407, %rs408}, %r1234; 2026-02-21T08:30:48.5344273Z ld.shared.v4.b32 {%r1238, %r1239, %r1240, %r1241}, [%r53+48]; 2026-02-21T08:30:48.5344332Z mov.b32 {%rs409, %rs410}, %r1241; 2026-02-21T08:30:48.5344392Z mov.b32 {%rs411, %rs412}, %r1240; 2026-02-21T08:30:48.5344460Z mov.b32 {%rs413, %rs414}, %r1239; 2026-02-21T08:30:48.5344518Z mov.b32 {%rs415, %rs416}, %r1238; 2026-02-21T08:30:48.5344689Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5344752Z cvt.f32.bf16 %r1051, %rs391; 2026-02-21T08:30:48.5344822Z cvt.f32.bf16 %r1052, %rs392; 2026-02-21T08:30:48.5344880Z cvt.f32.bf16 %r1053, %rs389; 2026-02-21T08:30:48.5344938Z cvt.f32.bf16 %r1054, %rs390; 2026-02-21T08:30:48.5345005Z cvt.f32.bf16 %r1055, %rs387; 2026-02-21T08:30:48.5345063Z cvt.f32.bf16 %r1056, %rs388; 2026-02-21T08:30:48.5345121Z cvt.f32.bf16 %r1057, %rs385; 2026-02-21T08:30:48.5345180Z cvt.f32.bf16 %r1058, %rs386; 2026-02-21T08:30:48.5345249Z cvt.f32.bf16 %r1059, %rs399; 2026-02-21T08:30:48.5345309Z cvt.f32.bf16 %r1060, %rs400; 2026-02-21T08:30:48.5345369Z cvt.f32.bf16 %r1061, %rs397; 2026-02-21T08:30:48.5345446Z cvt.f32.bf16 %r1062, %rs398; 2026-02-21T08:30:48.5345555Z cvt.f32.bf16 %r1063, %rs395; 2026-02-21T08:30:48.5345613Z cvt.f32.bf16 %r1064, %rs396; 2026-02-21T08:30:48.5345671Z cvt.f32.bf16 %r1065, %rs393; 2026-02-21T08:30:48.5345736Z cvt.f32.bf16 %r1066, %rs394; 2026-02-21T08:30:48.5345794Z cvt.f32.bf16 %r1068, %rs407; 2026-02-21T08:30:48.5345851Z cvt.f32.bf16 %r1069, %rs408; 2026-02-21T08:30:48.5345916Z cvt.f32.bf16 %r1070, %rs405; 2026-02-21T08:30:48.5345975Z cvt.f32.bf16 %r1071, %rs406; 2026-02-21T08:30:48.5346031Z cvt.f32.bf16 %r1072, %rs403; 2026-02-21T08:30:48.5346095Z cvt.f32.bf16 %r1073, %rs404; 2026-02-21T08:30:48.5346151Z cvt.f32.bf16 %r1074, %rs401; 2026-02-21T08:30:48.5346209Z cvt.f32.bf16 %r1075, %rs402; 2026-02-21T08:30:48.5346265Z cvt.f32.bf16 %r1076, %rs415; 2026-02-21T08:30:48.5346330Z cvt.f32.bf16 %r1077, %rs416; 2026-02-21T08:30:48.5346387Z cvt.f32.bf16 %r1078, %rs413; 2026-02-21T08:30:48.5346446Z cvt.f32.bf16 %r1079, %rs414; 2026-02-21T08:30:48.5346548Z cvt.f32.bf16 %r1080, %rs411; 2026-02-21T08:30:48.5346609Z cvt.f32.bf16 %r1081, %rs412; 2026-02-21T08:30:48.5346666Z cvt.f32.bf16 %r1082, %rs409; 2026-02-21T08:30:48.5346722Z cvt.f32.bf16 %r1083, %rs410; 2026-02-21T08:30:48.5346784Z $L__tmp145: 2026-02-21T08:30:48.5347009Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5347066Z // begin inline asm 2026-02-21T08:30:48.5347424Z @%p69 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r1051, %r1052, %r1053, %r1054, %r1055, %r1056, %r1057, %r1058, %r1059, %r1060, %r1061, %r1062, %r1063, %r1064, %r1065, %r1066}; 2026-02-21T08:30:48.5347483Z // end inline asm 2026-02-21T08:30:48.5347544Z // begin inline asm 2026-02-21T08:30:48.5347878Z @%p69 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r1068, %r1069, %r1070, %r1071, %r1072, %r1073, %r1074, %r1075, %r1076, %r1077, %r1078, %r1079, %r1080, %r1081, %r1082, %r1083}; 2026-02-21T08:30:48.5347938Z // end inline asm 2026-02-21T08:30:48.5347999Z // begin inline asm 2026-02-21T08:30:48.5348081Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5348139Z // end inline asm 2026-02-21T08:30:48.5348195Z bar.sync 0; 2026-02-21T08:30:48.5348253Z $L__tmp146: 2026-02-21T08:30:48.5348439Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5348503Z ld.shared.s8 %rs417, [%r54]; 2026-02-21T08:30:48.5348678Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5348749Z shl.b16 %rs418, %rs417, 4; 2026-02-21T08:30:48.5348927Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5348996Z ld.shared.s8 %rs419, [%r54+128]; 2026-02-21T08:30:48.5349175Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5349238Z shl.b16 %rs420, %rs419, 4; 2026-02-21T08:30:48.5349414Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5349485Z ld.shared.s8 %rs421, [%r54+256]; 2026-02-21T08:30:48.5349664Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5349725Z shl.b16 %rs422, %rs421, 4; 2026-02-21T08:30:48.5349898Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5349972Z ld.shared.s8 %rs423, [%r54+384]; 2026-02-21T08:30:48.5350144Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5350207Z shl.b16 %rs424, %rs423, 4; 2026-02-21T08:30:48.5350385Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5350451Z ld.shared.s8 %rs425, [%r54+512]; 2026-02-21T08:30:48.5350622Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5350731Z shl.b16 %rs426, %rs425, 4; 2026-02-21T08:30:48.5350903Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5350967Z ld.shared.s8 %rs427, [%r54+640]; 2026-02-21T08:30:48.5351133Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5351202Z shl.b16 %rs428, %rs427, 4; 2026-02-21T08:30:48.5351372Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5351435Z ld.shared.s8 %rs429, [%r54+768]; 2026-02-21T08:30:48.5351633Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5351696Z shl.b16 %rs430, %rs429, 4; 2026-02-21T08:30:48.5351918Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5351995Z ld.shared.s8 %rs431, [%r56]; 2026-02-21T08:30:48.5352168Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5352229Z shl.b16 %rs432, %rs431, 4; 2026-02-21T08:30:48.5352412Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5352479Z ld.shared.s8 %rs433, [%r54+1024]; 2026-02-21T08:30:48.5352650Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5352711Z shl.b16 %rs434, %rs433, 4; 2026-02-21T08:30:48.5352895Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5352961Z ld.shared.s8 %rs435, [%r54+1152]; 2026-02-21T08:30:48.5353128Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5353200Z shl.b16 %rs436, %rs435, 4; 2026-02-21T08:30:48.5353372Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5353437Z ld.shared.s8 %rs437, [%r54+1280]; 2026-02-21T08:30:48.5353616Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5353678Z shl.b16 %rs438, %rs437, 4; 2026-02-21T08:30:48.5353847Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5353921Z ld.shared.s8 %rs439, [%r54+1408]; 2026-02-21T08:30:48.5354094Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5354155Z shl.b16 %rs440, %rs439, 4; 2026-02-21T08:30:48.5354328Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5354401Z ld.shared.s8 %rs441, [%r54+1536]; 2026-02-21T08:30:48.5354575Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5354640Z shl.b16 %rs442, %rs441, 4; 2026-02-21T08:30:48.5354816Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5354881Z ld.shared.s8 %rs443, [%r54+1664]; 2026-02-21T08:30:48.5355051Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5355120Z shl.b16 %rs444, %rs443, 4; 2026-02-21T08:30:48.5355305Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5355368Z ld.shared.s8 %rs445, [%r54+1792]; 2026-02-21T08:30:48.5355542Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5355601Z shl.b16 %rs446, %rs445, 4; 2026-02-21T08:30:48.5355771Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5355885Z ld.shared.s8 %rs447, [%r58]; 2026-02-21T08:30:48.5356062Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5356122Z shl.b16 %rs448, %rs447, 4; 2026-02-21T08:30:48.5356287Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5356359Z ld.shared.s8 %rs449, [%r54+2048]; 2026-02-21T08:30:48.5356521Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5356579Z shl.b16 %rs450, %rs449, 4; 2026-02-21T08:30:48.5356753Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5356814Z ld.shared.s8 %rs451, [%r54+2176]; 2026-02-21T08:30:48.5356976Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5357088Z shl.b16 %rs452, %rs451, 4; 2026-02-21T08:30:48.5357254Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5357316Z ld.shared.s8 %rs453, [%r54+2304]; 2026-02-21T08:30:48.5357480Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5357545Z shl.b16 %rs454, %rs453, 4; 2026-02-21T08:30:48.5357709Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5357772Z ld.shared.s8 %rs455, [%r54+2432]; 2026-02-21T08:30:48.5357945Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5358004Z shl.b16 %rs456, %rs455, 4; 2026-02-21T08:30:48.5358172Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5358243Z ld.shared.s8 %rs457, [%r54+2560]; 2026-02-21T08:30:48.5358410Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5358472Z shl.b16 %rs458, %rs457, 4; 2026-02-21T08:30:48.5358645Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5358708Z ld.shared.s8 %rs459, [%r54+2688]; 2026-02-21T08:30:48.5358875Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5358933Z shl.b16 %rs460, %rs459, 4; 2026-02-21T08:30:48.5359106Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5359168Z ld.shared.s8 %rs461, [%r54+2816]; 2026-02-21T08:30:48.5359335Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5359403Z shl.b16 %rs462, %rs461, 4; 2026-02-21T08:30:48.5359569Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5359632Z ld.shared.s8 %rs463, [%r60]; 2026-02-21T08:30:48.5359803Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5359861Z shl.b16 %rs464, %rs463, 4; 2026-02-21T08:30:48.5360024Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5360092Z ld.shared.s8 %rs465, [%r54+3072]; 2026-02-21T08:30:48.5360254Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5360311Z shl.b16 %rs466, %rs465, 4; 2026-02-21T08:30:48.5360475Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5360546Z ld.shared.s8 %rs467, [%r54+3200]; 2026-02-21T08:30:48.5360709Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5360769Z shl.b16 %rs468, %rs467, 4; 2026-02-21T08:30:48.5360983Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5361046Z ld.shared.s8 %rs469, [%r54+3328]; 2026-02-21T08:30:48.5361204Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5361271Z shl.b16 %rs470, %rs469, 4; 2026-02-21T08:30:48.5361430Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5361491Z ld.shared.s8 %rs471, [%r54+3456]; 2026-02-21T08:30:48.5361691Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5361751Z shl.b16 %rs472, %rs471, 4; 2026-02-21T08:30:48.5361913Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5361980Z ld.shared.s8 %rs473, [%r54+3584]; 2026-02-21T08:30:48.5362195Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5362255Z shl.b16 %rs474, %rs473, 4; 2026-02-21T08:30:48.5362417Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5362488Z ld.shared.s8 %rs475, [%r54+3712]; 2026-02-21T08:30:48.5362649Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5362706Z shl.b16 %rs476, %rs475, 4; 2026-02-21T08:30:48.5362874Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5362935Z ld.shared.s8 %rs477, [%r54+3840]; 2026-02-21T08:30:48.5363099Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5363163Z shl.b16 %rs478, %rs477, 4; 2026-02-21T08:30:48.5363327Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5363390Z ld.shared.s8 %rs479, [%r62]; 2026-02-21T08:30:48.5363559Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5363619Z shl.b16 %rs480, %rs479, 4; 2026-02-21T08:30:48.5363781Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5363841Z cvt.s16.s8 %rs481, %rs418; 2026-02-21T08:30:48.5363908Z shr.s16 %rs482, %rs481, 4; 2026-02-21T08:30:48.5363968Z cvt.s16.s8 %rs483, %rs420; 2026-02-21T08:30:48.5364026Z shr.s16 %rs484, %rs483, 4; 2026-02-21T08:30:48.5364089Z shr.s16 %rs485, %rs417, 4; 2026-02-21T08:30:48.5364146Z shr.s16 %rs486, %rs419, 4; 2026-02-21T08:30:48.5364310Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5364374Z cvt.rn.f32.s16 %r1242, %rs486; 2026-02-21T08:30:48.5364446Z cvt.rn.f32.s16 %r1243, %rs485; 2026-02-21T08:30:48.5364509Z cvt.rn.f32.s16 %r1244, %rs484; 2026-02-21T08:30:48.5364568Z cvt.rn.f32.s16 %r1245, %rs482; 2026-02-21T08:30:48.5364740Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5364801Z cvt.s16.s8 %rs487, %rs422; 2026-02-21T08:30:48.5364858Z shr.s16 %rs488, %rs487, 4; 2026-02-21T08:30:48.5364922Z cvt.s16.s8 %rs489, %rs424; 2026-02-21T08:30:48.5364979Z shr.s16 %rs490, %rs489, 4; 2026-02-21T08:30:48.5365036Z shr.s16 %rs491, %rs421, 4; 2026-02-21T08:30:48.5365092Z shr.s16 %rs492, %rs423, 4; 2026-02-21T08:30:48.5365271Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5365331Z cvt.rn.f32.s16 %r1246, %rs492; 2026-02-21T08:30:48.5365391Z cvt.rn.f32.s16 %r1247, %rs491; 2026-02-21T08:30:48.5365458Z cvt.rn.f32.s16 %r1248, %rs490; 2026-02-21T08:30:48.5365517Z cvt.rn.f32.s16 %r1249, %rs488; 2026-02-21T08:30:48.5365685Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5365799Z cvt.s16.s8 %rs493, %rs426; 2026-02-21T08:30:48.5365858Z shr.s16 %rs494, %rs493, 4; 2026-02-21T08:30:48.5365916Z cvt.s16.s8 %rs495, %rs428; 2026-02-21T08:30:48.5365974Z shr.s16 %rs496, %rs495, 4; 2026-02-21T08:30:48.5366042Z shr.s16 %rs497, %rs425, 4; 2026-02-21T08:30:48.5366099Z shr.s16 %rs498, %rs427, 4; 2026-02-21T08:30:48.5366267Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5366339Z cvt.rn.f32.s16 %r1250, %rs498; 2026-02-21T08:30:48.5366400Z cvt.rn.f32.s16 %r1251, %rs497; 2026-02-21T08:30:48.5366463Z cvt.rn.f32.s16 %r1252, %rs496; 2026-02-21T08:30:48.5366522Z cvt.rn.f32.s16 %r1253, %rs494; 2026-02-21T08:30:48.5366695Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5366793Z cvt.s16.s8 %rs499, %rs430; 2026-02-21T08:30:48.5366855Z shr.s16 %rs500, %rs499, 4; 2026-02-21T08:30:48.5366921Z cvt.s16.s8 %rs501, %rs432; 2026-02-21T08:30:48.5366978Z shr.s16 %rs502, %rs501, 4; 2026-02-21T08:30:48.5367035Z shr.s16 %rs503, %rs429, 4; 2026-02-21T08:30:48.5367097Z shr.s16 %rs504, %rs431, 4; 2026-02-21T08:30:48.5367263Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5367323Z cvt.rn.f32.s16 %r1254, %rs504; 2026-02-21T08:30:48.5367382Z cvt.rn.f32.s16 %r1255, %rs503; 2026-02-21T08:30:48.5367448Z cvt.rn.f32.s16 %r1256, %rs502; 2026-02-21T08:30:48.5367506Z cvt.rn.f32.s16 %r1257, %rs500; 2026-02-21T08:30:48.5367671Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5367738Z cvt.s16.s8 %rs505, %rs434; 2026-02-21T08:30:48.5367795Z shr.s16 %rs506, %rs505, 4; 2026-02-21T08:30:48.5367853Z cvt.s16.s8 %rs507, %rs436; 2026-02-21T08:30:48.5367912Z shr.s16 %rs508, %rs507, 4; 2026-02-21T08:30:48.5367979Z shr.s16 %rs509, %rs433, 4; 2026-02-21T08:30:48.5368037Z shr.s16 %rs510, %rs435, 4; 2026-02-21T08:30:48.5368202Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5368269Z cvt.rn.f32.s16 %r1258, %rs510; 2026-02-21T08:30:48.5368329Z cvt.rn.f32.s16 %r1259, %rs509; 2026-02-21T08:30:48.5368388Z cvt.rn.f32.s16 %r1260, %rs508; 2026-02-21T08:30:48.5368452Z cvt.rn.f32.s16 %r1261, %rs506; 2026-02-21T08:30:48.5368616Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5368674Z cvt.s16.s8 %rs511, %rs438; 2026-02-21T08:30:48.5368731Z shr.s16 %rs512, %rs511, 4; 2026-02-21T08:30:48.5368797Z cvt.s16.s8 %rs513, %rs440; 2026-02-21T08:30:48.5368853Z shr.s16 %rs514, %rs513, 4; 2026-02-21T08:30:48.5368910Z shr.s16 %rs515, %rs437, 4; 2026-02-21T08:30:48.5368976Z shr.s16 %rs516, %rs439, 4; 2026-02-21T08:30:48.5369145Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5369207Z cvt.rn.f32.s16 %r1262, %rs516; 2026-02-21T08:30:48.5369274Z cvt.rn.f32.s16 %r1263, %rs515; 2026-02-21T08:30:48.5369332Z cvt.rn.f32.s16 %r1264, %rs514; 2026-02-21T08:30:48.5369391Z cvt.rn.f32.s16 %r1265, %rs512; 2026-02-21T08:30:48.5369556Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5369622Z cvt.s16.s8 %rs517, %rs442; 2026-02-21T08:30:48.5369680Z shr.s16 %rs518, %rs517, 4; 2026-02-21T08:30:48.5369738Z cvt.s16.s8 %rs519, %rs444; 2026-02-21T08:30:48.5369802Z shr.s16 %rs520, %rs519, 4; 2026-02-21T08:30:48.5369859Z shr.s16 %rs521, %rs441, 4; 2026-02-21T08:30:48.5369915Z shr.s16 %rs522, %rs443, 4; 2026-02-21T08:30:48.5370080Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5370148Z cvt.rn.f32.s16 %r1266, %rs522; 2026-02-21T08:30:48.5370253Z cvt.rn.f32.s16 %r1267, %rs521; 2026-02-21T08:30:48.5370311Z cvt.rn.f32.s16 %r1268, %rs520; 2026-02-21T08:30:48.5370377Z cvt.rn.f32.s16 %r1269, %rs518; 2026-02-21T08:30:48.5370540Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5370599Z cvt.s16.s8 %rs523, %rs446; 2026-02-21T08:30:48.5370663Z shr.s16 %rs524, %rs523, 4; 2026-02-21T08:30:48.5370719Z cvt.s16.s8 %rs525, %rs448; 2026-02-21T08:30:48.5370777Z shr.s16 %rs526, %rs525, 4; 2026-02-21T08:30:48.5370833Z shr.s16 %rs527, %rs445, 4; 2026-02-21T08:30:48.5370897Z shr.s16 %rs528, %rs447, 4; 2026-02-21T08:30:48.5371060Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5371119Z cvt.rn.f32.s16 %r1270, %rs528; 2026-02-21T08:30:48.5371184Z cvt.rn.f32.s16 %r1271, %rs527; 2026-02-21T08:30:48.5371243Z cvt.rn.f32.s16 %r1272, %rs526; 2026-02-21T08:30:48.5371341Z cvt.rn.f32.s16 %r1273, %rs524; 2026-02-21T08:30:48.5371516Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5371609Z cvt.s16.s8 %rs529, %rs450; 2026-02-21T08:30:48.5371671Z shr.s16 %rs530, %rs529, 4; 2026-02-21T08:30:48.5371729Z cvt.s16.s8 %rs531, %rs452; 2026-02-21T08:30:48.5371795Z shr.s16 %rs532, %rs531, 4; 2026-02-21T08:30:48.5371853Z shr.s16 %rs533, %rs449, 4; 2026-02-21T08:30:48.5371910Z shr.s16 %rs534, %rs451, 4; 2026-02-21T08:30:48.5372081Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5372140Z cvt.rn.f32.s16 %r1274, %rs534; 2026-02-21T08:30:48.5372199Z cvt.rn.f32.s16 %r1275, %rs533; 2026-02-21T08:30:48.5372258Z cvt.rn.f32.s16 %r1276, %rs532; 2026-02-21T08:30:48.5372324Z cvt.rn.f32.s16 %r1277, %rs530; 2026-02-21T08:30:48.5372487Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5372547Z cvt.s16.s8 %rs535, %rs454; 2026-02-21T08:30:48.5372615Z shr.s16 %rs536, %rs535, 4; 2026-02-21T08:30:48.5372672Z cvt.s16.s8 %rs537, %rs456; 2026-02-21T08:30:48.5372729Z shr.s16 %rs538, %rs537, 4; 2026-02-21T08:30:48.5372793Z shr.s16 %rs539, %rs453, 4; 2026-02-21T08:30:48.5372851Z shr.s16 %rs540, %rs455, 4; 2026-02-21T08:30:48.5373018Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5373080Z cvt.rn.f32.s16 %r1278, %rs540; 2026-02-21T08:30:48.5373148Z cvt.rn.f32.s16 %r1279, %rs539; 2026-02-21T08:30:48.5373207Z cvt.rn.f32.s16 %r1280, %rs538; 2026-02-21T08:30:48.5373267Z cvt.rn.f32.s16 %r1281, %rs536; 2026-02-21T08:30:48.5373441Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5373499Z cvt.s16.s8 %rs541, %rs458; 2026-02-21T08:30:48.5373558Z shr.s16 %rs542, %rs541, 4; 2026-02-21T08:30:48.5373616Z cvt.s16.s8 %rs543, %rs460; 2026-02-21T08:30:48.5373687Z shr.s16 %rs544, %rs543, 4; 2026-02-21T08:30:48.5373745Z shr.s16 %rs545, %rs457, 4; 2026-02-21T08:30:48.5373802Z shr.s16 %rs546, %rs459, 4; 2026-02-21T08:30:48.5373976Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5374037Z cvt.rn.f32.s16 %r1282, %rs546; 2026-02-21T08:30:48.5374098Z cvt.rn.f32.s16 %r1283, %rs545; 2026-02-21T08:30:48.5374166Z cvt.rn.f32.s16 %r1284, %rs544; 2026-02-21T08:30:48.5374226Z cvt.rn.f32.s16 %r1285, %rs542; 2026-02-21T08:30:48.5374393Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5374454Z cvt.s16.s8 %rs547, %rs462; 2026-02-21T08:30:48.5374520Z shr.s16 %rs548, %rs547, 4; 2026-02-21T08:30:48.5374578Z cvt.s16.s8 %rs549, %rs464; 2026-02-21T08:30:48.5374636Z shr.s16 %rs550, %rs549, 4; 2026-02-21T08:30:48.5374702Z shr.s16 %rs551, %rs461, 4; 2026-02-21T08:30:48.5374760Z shr.s16 %rs552, %rs463, 4; 2026-02-21T08:30:48.5374969Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5375037Z cvt.rn.f32.s16 %r1286, %rs552; 2026-02-21T08:30:48.5375097Z cvt.rn.f32.s16 %r1287, %rs551; 2026-02-21T08:30:48.5375156Z cvt.rn.f32.s16 %r1288, %rs550; 2026-02-21T08:30:48.5375214Z cvt.rn.f32.s16 %r1289, %rs548; 2026-02-21T08:30:48.5375389Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5375447Z cvt.s16.s8 %rs553, %rs466; 2026-02-21T08:30:48.5375506Z shr.s16 %rs554, %rs553, 4; 2026-02-21T08:30:48.5375572Z cvt.s16.s8 %rs555, %rs468; 2026-02-21T08:30:48.5375631Z shr.s16 %rs556, %rs555, 4; 2026-02-21T08:30:48.5375688Z shr.s16 %rs557, %rs465, 4; 2026-02-21T08:30:48.5375745Z shr.s16 %rs558, %rs467, 4; 2026-02-21T08:30:48.5375917Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5376036Z cvt.rn.f32.s16 %r1290, %rs558; 2026-02-21T08:30:48.5376099Z cvt.rn.f32.s16 %r1291, %rs557; 2026-02-21T08:30:48.5376166Z cvt.rn.f32.s16 %r1292, %rs556; 2026-02-21T08:30:48.5376224Z cvt.rn.f32.s16 %r1293, %rs554; 2026-02-21T08:30:48.5376390Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5376457Z cvt.s16.s8 %rs559, %rs470; 2026-02-21T08:30:48.5376515Z shr.s16 %rs560, %rs559, 4; 2026-02-21T08:30:48.5376572Z cvt.s16.s8 %rs561, %rs472; 2026-02-21T08:30:48.5376629Z shr.s16 %rs562, %rs561, 4; 2026-02-21T08:30:48.5376693Z shr.s16 %rs563, %rs469, 4; 2026-02-21T08:30:48.5376751Z shr.s16 %rs564, %rs471, 4; 2026-02-21T08:30:48.5376917Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5376984Z cvt.rn.f32.s16 %r1294, %rs564; 2026-02-21T08:30:48.5377042Z cvt.rn.f32.s16 %r1295, %rs563; 2026-02-21T08:30:48.5377103Z cvt.rn.f32.s16 %r1296, %rs562; 2026-02-21T08:30:48.5377165Z cvt.rn.f32.s16 %r1297, %rs560; 2026-02-21T08:30:48.5377336Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5377394Z cvt.s16.s8 %rs565, %rs474; 2026-02-21T08:30:48.5377453Z shr.s16 %rs566, %rs565, 4; 2026-02-21T08:30:48.5377518Z cvt.s16.s8 %rs567, %rs476; 2026-02-21T08:30:48.5377576Z shr.s16 %rs568, %rs567, 4; 2026-02-21T08:30:48.5377633Z shr.s16 %rs569, %rs473, 4; 2026-02-21T08:30:48.5377700Z shr.s16 %rs570, %rs475, 4; 2026-02-21T08:30:48.5377865Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5377925Z cvt.rn.f32.s16 %r1298, %rs570; 2026-02-21T08:30:48.5377982Z cvt.rn.f32.s16 %r1299, %rs569; 2026-02-21T08:30:48.5378050Z cvt.rn.f32.s16 %r1300, %rs568; 2026-02-21T08:30:48.5378109Z cvt.rn.f32.s16 %r1301, %rs566; 2026-02-21T08:30:48.5378276Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5378344Z cvt.s16.s8 %rs571, %rs478; 2026-02-21T08:30:48.5378404Z shr.s16 %rs572, %rs571, 4; 2026-02-21T08:30:48.5378462Z cvt.s16.s8 %rs573, %rs480; 2026-02-21T08:30:48.5378527Z shr.s16 %rs574, %rs573, 4; 2026-02-21T08:30:48.5378584Z shr.s16 %rs575, %rs477, 4; 2026-02-21T08:30:48.5378640Z shr.s16 %rs576, %rs479, 4; 2026-02-21T08:30:48.5378806Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5378872Z cvt.rn.f32.s16 %r1302, %rs576; 2026-02-21T08:30:48.5378930Z cvt.rn.f32.s16 %r1303, %rs575; 2026-02-21T08:30:48.5378987Z cvt.rn.f32.s16 %r1304, %rs574; 2026-02-21T08:30:48.5379052Z cvt.rn.f32.s16 %r1305, %rs572; 2026-02-21T08:30:48.5379151Z st.shared.v4.b32 [%r63], {%r1245, %r1243, %r1244, %r1242}; 2026-02-21T08:30:48.5379264Z st.shared.v4.b32 [%r63+16384], {%r1277, %r1275, %r1276, %r1274}; 2026-02-21T08:30:48.5379361Z st.shared.v4.b32 [%r64], {%r1249, %r1247, %r1248, %r1246}; 2026-02-21T08:30:48.5379514Z st.shared.v4.b32 [%r64+16384], {%r1281, %r1279, %r1280, %r1278}; 2026-02-21T08:30:48.5379607Z st.shared.v4.b32 [%r65], {%r1253, %r1251, %r1252, %r1250}; 2026-02-21T08:30:48.5379708Z st.shared.v4.b32 [%r65+16384], {%r1285, %r1283, %r1284, %r1282}; 2026-02-21T08:30:48.5379805Z st.shared.v4.b32 [%r66], {%r1257, %r1255, %r1256, %r1254}; 2026-02-21T08:30:48.5379903Z st.shared.v4.b32 [%r66+16384], {%r1289, %r1287, %r1288, %r1286}; 2026-02-21T08:30:48.5379993Z st.shared.v4.b32 [%r67], {%r1261, %r1259, %r1260, %r1258}; 2026-02-21T08:30:48.5380096Z st.shared.v4.b32 [%r67+16384], {%r1293, %r1291, %r1292, %r1290}; 2026-02-21T08:30:48.5380186Z st.shared.v4.b32 [%r68], {%r1265, %r1263, %r1264, %r1262}; 2026-02-21T08:30:48.5380283Z st.shared.v4.b32 [%r68+16384], {%r1297, %r1295, %r1296, %r1294}; 2026-02-21T08:30:48.5380379Z st.shared.v4.b32 [%r69], {%r1269, %r1267, %r1268, %r1266}; 2026-02-21T08:30:48.5380517Z st.shared.v4.b32 [%r69+16384], {%r1301, %r1299, %r1300, %r1298}; 2026-02-21T08:30:48.5380609Z st.shared.v4.b32 [%r70], {%r1273, %r1271, %r1272, %r1270}; 2026-02-21T08:30:48.5380707Z st.shared.v4.b32 [%r70+16384], {%r1305, %r1303, %r1304, %r1302}; 2026-02-21T08:30:48.5380767Z $L__tmp147: 2026-02-21T08:30:48.5380989Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5381047Z // begin inline asm 2026-02-21T08:30:48.5381131Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5381188Z // end inline asm 2026-02-21T08:30:48.5381243Z bar.sync 0; 2026-02-21T08:30:48.5381311Z @%p14 bra $L__BB0_11; 2026-02-21T08:30:48.5381412Z // %bb.10: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5381479Z elect.sync %r1330|%p68, -1; 2026-02-21T08:30:48.5381564Z mov.b32 %r1308, 69208336; 2026-02-21T08:30:48.5381632Z mov.pred %p67, 0; 2026-02-21T08:30:48.5381690Z // begin inline asm 2026-02-21T08:30:48.5381855Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r1308, %p67; 2026-02-21T08:30:48.5381922Z // end inline asm 2026-02-21T08:30:48.5381980Z // begin inline asm 2026-02-21T08:30:48.5382132Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r1308, %p69; 2026-02-21T08:30:48.5382195Z // end inline asm 2026-02-21T08:30:48.5382251Z // begin inline asm 2026-02-21T08:30:48.5382404Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r1308, %p69; 2026-02-21T08:30:48.5382462Z // end inline asm 2026-02-21T08:30:48.5382527Z // begin inline asm 2026-02-21T08:30:48.5382680Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r1308, %p69; 2026-02-21T08:30:48.5382738Z // end inline asm 2026-02-21T08:30:48.5382806Z // begin inline asm 2026-02-21T08:30:48.5400246Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r1308, %p69; 2026-02-21T08:30:48.5400330Z // end inline asm 2026-02-21T08:30:48.5400405Z // begin inline asm 2026-02-21T08:30:48.5400589Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r1308, %p69; 2026-02-21T08:30:48.5400649Z // end inline asm 2026-02-21T08:30:48.5400711Z // begin inline asm 2026-02-21T08:30:48.5400875Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r1308, %p69; 2026-02-21T08:30:48.5400934Z // end inline asm 2026-02-21T08:30:48.5400995Z // begin inline asm 2026-02-21T08:30:48.5401140Z @%p68 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r1308, %p69; 2026-02-21T08:30:48.5401209Z // end inline asm 2026-02-21T08:30:48.5401275Z add.s32 %r1332, %r294, 57344; 2026-02-21T08:30:48.5401340Z cvt.u64.u32 %rd341, %r1332; 2026-02-21T08:30:48.5401409Z // begin inline asm 2026-02-21T08:30:48.5401599Z @%p68 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd341]; 2026-02-21T08:30:48.5401658Z // end inline asm 2026-02-21T08:30:48.5401732Z $L__tmp148: 2026-02-21T08:30:48.5401983Z $L__BB0_11: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5402157Z .loc 1 0 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0 2026-02-21T08:30:48.5402224Z or.b32 %r116, %r114, %r8; 2026-02-21T08:30:48.5402299Z or.b32 %r118, %r117, %r14; 2026-02-21T08:30:48.5402361Z or.b32 %r119, %r117, %r15; 2026-02-21T08:30:48.5402423Z or.b32 %r120, %r117, %r16; 2026-02-21T08:30:48.5402493Z or.b32 %r121, %r117, %r17; 2026-02-21T08:30:48.5402551Z or.b32 %r122, %r117, %r18; 2026-02-21T08:30:48.5402610Z or.b32 %r123, %r117, %r19; 2026-02-21T08:30:48.5402667Z or.b32 %r124, %r117, %r20; 2026-02-21T08:30:48.5402735Z or.b32 %r125, %r117, %r21; 2026-02-21T08:30:48.5402916Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5402980Z add.s64 %rd342, %rd30, 256; 2026-02-21T08:30:48.5403053Z add.s64 %rd343, %rd31, 256; 2026-02-21T08:30:48.5403181Z add.s64 %rd344, %rd32, 256; 2026-02-21T08:30:48.5403245Z add.s64 %rd345, %rd33, 256; 2026-02-21T08:30:48.5403429Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5403491Z // begin inline asm 2026-02-21T08:30:48.5403615Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd342 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5403674Z // end inline asm 2026-02-21T08:30:48.5403743Z // begin inline asm 2026-02-21T08:30:48.5403862Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd343 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5403920Z // end inline asm 2026-02-21T08:30:48.5403989Z // begin inline asm 2026-02-21T08:30:48.5404106Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd344 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5404163Z // end inline asm 2026-02-21T08:30:48.5404231Z // begin inline asm 2026-02-21T08:30:48.5404346Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd345 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5404406Z // end inline asm 2026-02-21T08:30:48.5404476Z cp.async.commit_group; 2026-02-21T08:30:48.5404656Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5404722Z add.s32 %r1348, %r115, %r71; 2026-02-21T08:30:48.5404783Z add.s32 %r1349, %r115, %r72; 2026-02-21T08:30:48.5404960Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5405024Z cvt.s64.s32 %rd349, %r1348; 2026-02-21T08:30:48.5405088Z add.s64 %rd346, %rd93, %rd349; 2026-02-21T08:30:48.5405148Z cvt.s64.s32 %rd350, %r1349; 2026-02-21T08:30:48.5405218Z add.s64 %rd347, %rd93, %rd350; 2026-02-21T08:30:48.5405385Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5405444Z // begin inline asm 2026-02-21T08:30:48.5405570Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd346 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5405626Z // end inline asm 2026-02-21T08:30:48.5405687Z // begin inline asm 2026-02-21T08:30:48.5405811Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd347 + 0 ], 0x10, %r1334; 2026-02-21T08:30:48.5405867Z // end inline asm 2026-02-21T08:30:48.5405934Z cp.async.commit_group; 2026-02-21T08:30:48.5406101Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5406172Z add.s32 %r1350, %r39, %r114; 2026-02-21T08:30:48.5406234Z cvt.u64.u32 %rd1092, %r1350; 2026-02-21T08:30:48.5406294Z add.s32 %r1351, %r11, %r117; 2026-02-21T08:30:48.5406362Z shl.b32 %r1352, %r1351, 10; 2026-02-21T08:30:48.5406436Z mad.wide.s32 %rd1091, %r1352, 2, %rd10; 2026-02-21T08:30:48.5406494Z add.s32 %r1353, %r10, %r117; 2026-02-21T08:30:48.5406563Z shl.b32 %r1354, %r1353, 10; 2026-02-21T08:30:48.5406632Z mad.wide.s32 %rd1090, %r1354, 2, %rd10; 2026-02-21T08:30:48.5406692Z shl.b32 %r1355, %r113, 16; 2026-02-21T08:30:48.5406752Z or.b32 %r1356, %r79, %r1355; 2026-02-21T08:30:48.5406830Z mad.wide.s32 %rd1089, %r1356, 2, %rd10; 2026-02-21T08:30:48.5406931Z or.b32 %r4025, %r80, %r1355; 2026-02-21T08:30:48.5406994Z mov.b64 %rd1093, -32; 2026-02-21T08:30:48.5407065Z mov.b32 %r4028, %r4026; 2026-02-21T08:30:48.5407124Z mov.b32 %r4030, %r4026; 2026-02-21T08:30:48.5407187Z bra.uni $L__BB0_12; 2026-02-21T08:30:48.5407292Z $L__BB0_14: // in Loop: Header=BB0_12 Depth=2 2026-02-21T08:30:48.5407472Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5407539Z add.s64 %rd1093, %rd1093, 32; 2026-02-21T08:30:48.5407609Z setp.lt.u64 %p105, %rd1093, 416; 2026-02-21T08:30:48.5407677Z $L__tmp149: 2026-02-21T08:30:48.5407903Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5407965Z add.s32 %r1525, %r4029, 1; 2026-02-21T08:30:48.5408039Z setp.gt.s32 %p106, %r1525, 1; 2026-02-21T08:30:48.5408145Z selp.b32 %r4029, 0, %r1525, %p106; 2026-02-21T08:30:48.5408211Z selp.b32 %r1526, 1, 0, %p106; 2026-02-21T08:30:48.5408273Z xor.b32 %r138, %r4030, %r1526; 2026-02-21T08:30:48.5408339Z $L__tmp150: 2026-02-21T08:30:48.5408508Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5408572Z add.s64 %rd360, %rd1089, %rd9; 2026-02-21T08:30:48.5408642Z add.s64 %rd361, %rd1090, %rd9; 2026-02-21T08:30:48.5408704Z add.s64 %rd362, %rd1091, %rd9; 2026-02-21T08:30:48.5408775Z mad.wide.s32 %rd363, %r4025, 2, %rd92; 2026-02-21T08:30:48.5408950Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5409010Z add.s32 %r1513, %r134, %r4047; 2026-02-21T08:30:48.5409073Z selp.b32 %r1514, 16, 0, %p105; 2026-02-21T08:30:48.5409131Z // begin inline asm 2026-02-21T08:30:48.5409259Z cp.async.cg.shared.global [ %r1513 + 0 ], [ %rd360 + 0 ], 0x10, %r1514; 2026-02-21T08:30:48.5409320Z // end inline asm 2026-02-21T08:30:48.5409383Z add.s32 %r1515, %r1513, 2048; 2026-02-21T08:30:48.5409450Z // begin inline asm 2026-02-21T08:30:48.5409568Z cp.async.cg.shared.global [ %r1515 + 0 ], [ %rd361 + 0 ], 0x10, %r1514; 2026-02-21T08:30:48.5409624Z // end inline asm 2026-02-21T08:30:48.5409687Z add.s32 %r1517, %r1513, 4096; 2026-02-21T08:30:48.5409755Z // begin inline asm 2026-02-21T08:30:48.5409868Z cp.async.cg.shared.global [ %r1517 + 0 ], [ %rd362 + 0 ], 0x10, %r1514; 2026-02-21T08:30:48.5409926Z // end inline asm 2026-02-21T08:30:48.5409996Z add.s32 %r1519, %r1513, 6144; 2026-02-21T08:30:48.5410054Z // begin inline asm 2026-02-21T08:30:48.5410166Z cp.async.cg.shared.global [ %r1519 + 0 ], [ %rd363 + 0 ], 0x10, %r1514; 2026-02-21T08:30:48.5410233Z // end inline asm 2026-02-21T08:30:48.5410297Z cp.async.commit_group; 2026-02-21T08:30:48.5410466Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5410530Z add.s64 %rd366, %rd9, %rd1092; 2026-02-21T08:30:48.5410605Z add.s64 %rd367, %rd366, 786432; 2026-02-21T08:30:48.5410773Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5410838Z add.s64 %rd368, %rd366, 917504; 2026-02-21T08:30:48.5410910Z cvt.s64.s32 %rd369, %rd367; 2026-02-21T08:30:48.5410973Z add.s64 %rd364, %rd93, %rd369; 2026-02-21T08:30:48.5411035Z cvt.s64.s32 %rd370, %rd368; 2026-02-21T08:30:48.5411095Z add.s64 %rd365, %rd93, %rd370; 2026-02-21T08:30:48.5411271Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5411331Z add.s32 %r1521, %r135, %r111; 2026-02-21T08:30:48.5411390Z // begin inline asm 2026-02-21T08:30:48.5411517Z cp.async.cg.shared.global [ %r1521 + 0 ], [ %rd364 + 0 ], 0x10, %r1514; 2026-02-21T08:30:48.5411616Z // end inline asm 2026-02-21T08:30:48.5411679Z add.s32 %r1523, %r1521, 2048; 2026-02-21T08:30:48.5411748Z // begin inline asm 2026-02-21T08:30:48.5411929Z cp.async.cg.shared.global [ %r1523 + 0 ], [ %rd365 + 0 ], 0x10, %r1514; 2026-02-21T08:30:48.5411987Z // end inline asm 2026-02-21T08:30:48.5412052Z cp.async.commit_group; 2026-02-21T08:30:48.5412235Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5412303Z add.s64 %rd1092, %rd1092, 262144; 2026-02-21T08:30:48.5412365Z add.s64 %rd1091, %rd1091, 128; 2026-02-21T08:30:48.5412434Z add.s64 %rd1090, %rd1090, 128; 2026-02-21T08:30:48.5412494Z add.s64 %rd1089, %rd1089, 128; 2026-02-21T08:30:48.5412554Z add.s32 %r4025, %r4025, 64; 2026-02-21T08:30:48.5412630Z setp.lt.u64 %p107, %rd1093, 448; 2026-02-21T08:30:48.5412691Z mov.b32 %r4026, %r4030; 2026-02-21T08:30:48.5412751Z mov.b32 %r4030, %r138; 2026-02-21T08:30:48.5412814Z @%p107 bra $L__BB0_12; 2026-02-21T08:30:48.5412883Z bra.uni $L__BB0_15; 2026-02-21T08:30:48.5413168Z $L__BB0_12: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:48.5413276Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:48.5413348Z add.s32 %r1394, %r4028, 1; 2026-02-21T08:30:48.5413412Z setp.gt.s32 %p87, %r1394, 1; 2026-02-21T08:30:48.5413480Z selp.b32 %r4028, 0, %r1394, %p87; 2026-02-21T08:30:48.5413647Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5413722Z cp.async.wait_group 2; 2026-02-21T08:30:48.5413780Z bar.sync 0; 2026-02-21T08:30:48.5413840Z shl.b32 %r1395, %r4028, 12; 2026-02-21T08:30:48.5413909Z shl.b32 %r1396, %r4028, 13; 2026-02-21T08:30:48.5413968Z add.s32 %r1398, %r294, %r1396; 2026-02-21T08:30:48.5414029Z add.s32 %r134, %r1398, 32768; 2026-02-21T08:30:48.5414099Z add.s32 %r1399, %r134, %r52; 2026-02-21T08:30:48.5414204Z ld.shared.v4.b32 {%r1400, %r1401, %r1402, %r1403}, [%r1399]; 2026-02-21T08:30:48.5414269Z mov.b32 {%rs577, %rs578}, %r1403; 2026-02-21T08:30:48.5414336Z mov.b32 {%rs579, %rs580}, %r1402; 2026-02-21T08:30:48.5414413Z mov.b32 {%rs581, %rs582}, %r1401; 2026-02-21T08:30:48.5414474Z mov.b32 {%rs583, %rs584}, %r1400; 2026-02-21T08:30:48.5414583Z ld.shared.v4.b32 {%r1404, %r1405, %r1406, %r1407}, [%r1399+16]; 2026-02-21T08:30:48.5414655Z mov.b32 {%rs585, %rs586}, %r1407; 2026-02-21T08:30:48.5414717Z mov.b32 {%rs587, %rs588}, %r1406; 2026-02-21T08:30:48.5414778Z mov.b32 {%rs589, %rs590}, %r1405; 2026-02-21T08:30:48.5414837Z mov.b32 {%rs591, %rs592}, %r1404; 2026-02-21T08:30:48.5414943Z ld.shared.v4.b32 {%r1408, %r1409, %r1410, %r1411}, [%r1399+32]; 2026-02-21T08:30:48.5415004Z mov.b32 {%rs593, %rs594}, %r1411; 2026-02-21T08:30:48.5415062Z mov.b32 {%rs595, %rs596}, %r1410; 2026-02-21T08:30:48.5415130Z mov.b32 {%rs597, %rs598}, %r1409; 2026-02-21T08:30:48.5415190Z mov.b32 {%rs599, %rs600}, %r1408; 2026-02-21T08:30:48.5415286Z ld.shared.v4.b32 {%r1412, %r1413, %r1414, %r1415}, [%r1399+48]; 2026-02-21T08:30:48.5415357Z mov.b32 {%rs601, %rs602}, %r1415; 2026-02-21T08:30:48.5415420Z mov.b32 {%rs603, %rs604}, %r1414; 2026-02-21T08:30:48.5415483Z mov.b32 {%rs605, %rs606}, %r1413; 2026-02-21T08:30:48.5415544Z mov.b32 {%rs607, %rs608}, %r1412; 2026-02-21T08:30:48.5415725Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5415792Z cvt.f32.bf16 %r1361, %rs583; 2026-02-21T08:30:48.5415856Z cvt.f32.bf16 %r1362, %rs584; 2026-02-21T08:30:48.5415924Z cvt.f32.bf16 %r1363, %rs581; 2026-02-21T08:30:48.5415986Z cvt.f32.bf16 %r1364, %rs582; 2026-02-21T08:30:48.5416046Z cvt.f32.bf16 %r1365, %rs579; 2026-02-21T08:30:48.5416104Z cvt.f32.bf16 %r1366, %rs580; 2026-02-21T08:30:48.5416174Z cvt.f32.bf16 %r1367, %rs577; 2026-02-21T08:30:48.5416234Z cvt.f32.bf16 %r1368, %rs578; 2026-02-21T08:30:48.5416292Z cvt.f32.bf16 %r1369, %rs591; 2026-02-21T08:30:48.5416359Z cvt.f32.bf16 %r1370, %rs592; 2026-02-21T08:30:48.5416419Z cvt.f32.bf16 %r1371, %rs589; 2026-02-21T08:30:48.5416481Z cvt.f32.bf16 %r1372, %rs590; 2026-02-21T08:30:48.5416587Z cvt.f32.bf16 %r1373, %rs587; 2026-02-21T08:30:48.5416647Z cvt.f32.bf16 %r1374, %rs588; 2026-02-21T08:30:48.5416707Z cvt.f32.bf16 %r1375, %rs585; 2026-02-21T08:30:48.5416766Z cvt.f32.bf16 %r1376, %rs586; 2026-02-21T08:30:48.5416835Z cvt.f32.bf16 %r1378, %rs599; 2026-02-21T08:30:48.5416894Z cvt.f32.bf16 %r1379, %rs600; 2026-02-21T08:30:48.5416954Z cvt.f32.bf16 %r1380, %rs597; 2026-02-21T08:30:48.5417021Z cvt.f32.bf16 %r1381, %rs598; 2026-02-21T08:30:48.5417078Z cvt.f32.bf16 %r1382, %rs595; 2026-02-21T08:30:48.5417137Z cvt.f32.bf16 %r1383, %rs596; 2026-02-21T08:30:48.5417194Z cvt.f32.bf16 %r1384, %rs593; 2026-02-21T08:30:48.5417263Z cvt.f32.bf16 %r1385, %rs594; 2026-02-21T08:30:48.5417322Z cvt.f32.bf16 %r1386, %rs607; 2026-02-21T08:30:48.5417380Z cvt.f32.bf16 %r1387, %rs608; 2026-02-21T08:30:48.5417446Z cvt.f32.bf16 %r1388, %rs605; 2026-02-21T08:30:48.5417506Z cvt.f32.bf16 %r1389, %rs606; 2026-02-21T08:30:48.5417603Z cvt.f32.bf16 %r1390, %rs603; 2026-02-21T08:30:48.5417667Z cvt.f32.bf16 %r1391, %rs604; 2026-02-21T08:30:48.5417735Z cvt.f32.bf16 %r1392, %rs601; 2026-02-21T08:30:48.5417793Z cvt.f32.bf16 %r1393, %rs602; 2026-02-21T08:30:48.5417847Z $L__tmp151: 2026-02-21T08:30:48.5418081Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5418142Z // begin inline asm 2026-02-21T08:30:48.5418195Z 2026-02-21T08:30:48.5418257Z { 2026-02-21T08:30:48.5418320Z .reg .pred complete; 2026-02-21T08:30:48.5418376Z waitLoop: 2026-02-21T08:30:48.5418500Z mbarrier.try_wait.parity.shared.b64 complete, [%r4027], %r4026; 2026-02-21T08:30:48.5418574Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5418628Z } 2026-02-21T08:30:48.5418633Z 2026-02-21T08:30:48.5418689Z // end inline asm 2026-02-21T08:30:48.5418761Z mov.pred %p88, -1; 2026-02-21T08:30:48.5418822Z // begin inline asm 2026-02-21T08:30:48.5419148Z @%p88 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r1361, %r1362, %r1363, %r1364, %r1365, %r1366, %r1367, %r1368, %r1369, %r1370, %r1371, %r1372, %r1373, %r1374, %r1375, %r1376}; 2026-02-21T08:30:48.5419219Z // end inline asm 2026-02-21T08:30:48.5419276Z // begin inline asm 2026-02-21T08:30:48.5419584Z @%p88 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r1378, %r1379, %r1380, %r1381, %r1382, %r1383, %r1384, %r1385, %r1386, %r1387, %r1388, %r1389, %r1390, %r1391, %r1392, %r1393}; 2026-02-21T08:30:48.5419639Z // end inline asm 2026-02-21T08:30:48.5419704Z // begin inline asm 2026-02-21T08:30:48.5419776Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5419829Z // end inline asm 2026-02-21T08:30:48.5419894Z bar.sync 0; 2026-02-21T08:30:48.5419947Z $L__tmp152: 2026-02-21T08:30:48.5420120Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5420187Z add.s32 %r1416, %r294, %r1395; 2026-02-21T08:30:48.5420249Z add.s32 %r1417, %r1416, 49152; 2026-02-21T08:30:48.5420312Z add.s32 %r135, %r1417, %r4048; 2026-02-21T08:30:48.5420370Z add.s32 %r1418, %r1417, %r55; 2026-02-21T08:30:48.5420438Z add.s32 %r1419, %r1417, %r57; 2026-02-21T08:30:48.5420499Z add.s32 %r1420, %r1417, %r59; 2026-02-21T08:30:48.5420558Z add.s32 %r1421, %r1417, %r61; 2026-02-21T08:30:48.5420735Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5420798Z ld.shared.s8 %rs609, [%r135]; 2026-02-21T08:30:48.5420960Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5421023Z shl.b16 %rs610, %rs609, 4; 2026-02-21T08:30:48.5421194Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5421258Z ld.shared.s8 %rs611, [%r135+128]; 2026-02-21T08:30:48.5421421Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5421527Z shl.b16 %rs612, %rs611, 4; 2026-02-21T08:30:48.5421735Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5421797Z ld.shared.s8 %rs613, [%r135+256]; 2026-02-21T08:30:48.5421971Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5422032Z shl.b16 %rs614, %rs613, 4; 2026-02-21T08:30:48.5422195Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5422262Z ld.shared.s8 %rs615, [%r135+384]; 2026-02-21T08:30:48.5422422Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5422482Z shl.b16 %rs616, %rs615, 4; 2026-02-21T08:30:48.5422649Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5422763Z ld.shared.s8 %rs617, [%r135+512]; 2026-02-21T08:30:48.5422930Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5422990Z shl.b16 %rs618, %rs617, 4; 2026-02-21T08:30:48.5423159Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5423222Z ld.shared.s8 %rs619, [%r135+640]; 2026-02-21T08:30:48.5423387Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5423452Z shl.b16 %rs620, %rs619, 4; 2026-02-21T08:30:48.5423613Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5423675Z ld.shared.s8 %rs621, [%r135+768]; 2026-02-21T08:30:48.5423844Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5423904Z shl.b16 %rs622, %rs621, 4; 2026-02-21T08:30:48.5424070Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5424136Z ld.shared.s8 %rs623, [%r1418]; 2026-02-21T08:30:48.5424309Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5424369Z shl.b16 %rs624, %rs623, 4; 2026-02-21T08:30:48.5424532Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5424602Z ld.shared.s8 %rs625, [%r135+1024]; 2026-02-21T08:30:48.5424764Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5424823Z shl.b16 %rs626, %rs625, 4; 2026-02-21T08:30:48.5424990Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5425053Z ld.shared.s8 %rs627, [%r135+1152]; 2026-02-21T08:30:48.5425219Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5425285Z shl.b16 %rs628, %rs627, 4; 2026-02-21T08:30:48.5425452Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5425517Z ld.shared.s8 %rs629, [%r135+1280]; 2026-02-21T08:30:48.5425691Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5425749Z shl.b16 %rs630, %rs629, 4; 2026-02-21T08:30:48.5425919Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5425982Z ld.shared.s8 %rs631, [%r135+1408]; 2026-02-21T08:30:48.5426155Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5426214Z shl.b16 %rs632, %rs631, 4; 2026-02-21T08:30:48.5426378Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5426450Z ld.shared.s8 %rs633, [%r135+1536]; 2026-02-21T08:30:48.5426660Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5426718Z shl.b16 %rs634, %rs633, 4; 2026-02-21T08:30:48.5426884Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5426947Z ld.shared.s8 %rs635, [%r135+1664]; 2026-02-21T08:30:48.5427110Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5427174Z shl.b16 %rs636, %rs635, 4; 2026-02-21T08:30:48.5427338Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5427396Z ld.shared.s8 %rs637, [%r135+1792]; 2026-02-21T08:30:48.5427557Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5427621Z shl.b16 %rs638, %rs637, 4; 2026-02-21T08:30:48.5427827Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5427893Z ld.shared.s8 %rs639, [%r1419]; 2026-02-21T08:30:48.5428063Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5428123Z shl.b16 %rs640, %rs639, 4; 2026-02-21T08:30:48.5428284Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5428354Z ld.shared.s8 %rs641, [%r135+2048]; 2026-02-21T08:30:48.5428513Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5428572Z shl.b16 %rs642, %rs641, 4; 2026-02-21T08:30:48.5428737Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5428797Z ld.shared.s8 %rs643, [%r135+2176]; 2026-02-21T08:30:48.5428963Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5429026Z shl.b16 %rs644, %rs643, 4; 2026-02-21T08:30:48.5429190Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5429257Z ld.shared.s8 %rs645, [%r135+2304]; 2026-02-21T08:30:48.5429420Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5429478Z shl.b16 %rs646, %rs645, 4; 2026-02-21T08:30:48.5429640Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5429704Z ld.shared.s8 %rs647, [%r135+2432]; 2026-02-21T08:30:48.5429865Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5429920Z shl.b16 %rs648, %rs647, 4; 2026-02-21T08:30:48.5430085Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5430148Z ld.shared.s8 %rs649, [%r135+2560]; 2026-02-21T08:30:48.5430307Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5430368Z shl.b16 %rs650, %rs649, 4; 2026-02-21T08:30:48.5430528Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5430588Z ld.shared.s8 %rs651, [%r135+2688]; 2026-02-21T08:30:48.5430757Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5430812Z shl.b16 %rs652, %rs651, 4; 2026-02-21T08:30:48.5430976Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5431034Z ld.shared.s8 %rs653, [%r135+2816]; 2026-02-21T08:30:48.5431199Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5431256Z shl.b16 %rs654, %rs653, 4; 2026-02-21T08:30:48.5431485Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5431589Z ld.shared.s8 %rs655, [%r1420]; 2026-02-21T08:30:48.5431755Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5431810Z shl.b16 %rs656, %rs655, 4; 2026-02-21T08:30:48.5431977Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5432035Z ld.shared.s8 %rs657, [%r135+3072]; 2026-02-21T08:30:48.5432196Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5432257Z shl.b16 %rs658, %rs657, 4; 2026-02-21T08:30:48.5432423Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5432483Z ld.shared.s8 %rs659, [%r135+3200]; 2026-02-21T08:30:48.5432690Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5432754Z shl.b16 %rs660, %rs659, 4; 2026-02-21T08:30:48.5432911Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5432971Z ld.shared.s8 %rs661, [%r135+3328]; 2026-02-21T08:30:48.5433135Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5433193Z shl.b16 %rs662, %rs661, 4; 2026-02-21T08:30:48.5433374Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5433445Z ld.shared.s8 %rs663, [%r135+3456]; 2026-02-21T08:30:48.5433609Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5433670Z shl.b16 %rs664, %rs663, 4; 2026-02-21T08:30:48.5433844Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5433908Z ld.shared.s8 %rs665, [%r135+3584]; 2026-02-21T08:30:48.5434074Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5434134Z shl.b16 %rs666, %rs665, 4; 2026-02-21T08:30:48.5434304Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5434366Z ld.shared.s8 %rs667, [%r135+3712]; 2026-02-21T08:30:48.5434537Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5434606Z shl.b16 %rs668, %rs667, 4; 2026-02-21T08:30:48.5434774Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5434836Z ld.shared.s8 %rs669, [%r135+3840]; 2026-02-21T08:30:48.5435014Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5435075Z shl.b16 %rs670, %rs669, 4; 2026-02-21T08:30:48.5435245Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5435314Z ld.shared.s8 %rs671, [%r1421]; 2026-02-21T08:30:48.5435483Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5435542Z shl.b16 %rs672, %rs671, 4; 2026-02-21T08:30:48.5435711Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5435778Z cvt.s16.s8 %rs673, %rs610; 2026-02-21T08:30:48.5435836Z shr.s16 %rs674, %rs673, 4; 2026-02-21T08:30:48.5435896Z cvt.s16.s8 %rs675, %rs612; 2026-02-21T08:30:48.5435960Z shr.s16 %rs676, %rs675, 4; 2026-02-21T08:30:48.5436016Z shr.s16 %rs677, %rs609, 4; 2026-02-21T08:30:48.5436074Z shr.s16 %rs678, %rs611, 4; 2026-02-21T08:30:48.5436246Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5436359Z cvt.rn.f32.s16 %r1422, %rs678; 2026-02-21T08:30:48.5436421Z cvt.rn.f32.s16 %r1423, %rs677; 2026-02-21T08:30:48.5436482Z cvt.rn.f32.s16 %r1424, %rs676; 2026-02-21T08:30:48.5436549Z cvt.rn.f32.s16 %r1425, %rs674; 2026-02-21T08:30:48.5436720Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5436781Z cvt.s16.s8 %rs679, %rs614; 2026-02-21T08:30:48.5436846Z shr.s16 %rs680, %rs679, 4; 2026-02-21T08:30:48.5436904Z cvt.s16.s8 %rs681, %rs616; 2026-02-21T08:30:48.5436962Z shr.s16 %rs682, %rs681, 4; 2026-02-21T08:30:48.5437026Z shr.s16 %rs683, %rs613, 4; 2026-02-21T08:30:48.5437083Z shr.s16 %rs684, %rs615, 4; 2026-02-21T08:30:48.5437256Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5437328Z cvt.rn.f32.s16 %r1426, %rs684; 2026-02-21T08:30:48.5437391Z cvt.rn.f32.s16 %r1427, %rs683; 2026-02-21T08:30:48.5437499Z cvt.rn.f32.s16 %r1428, %rs682; 2026-02-21T08:30:48.5437563Z cvt.rn.f32.s16 %r1429, %rs680; 2026-02-21T08:30:48.5437735Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5437796Z cvt.s16.s8 %rs685, %rs618; 2026-02-21T08:30:48.5437864Z shr.s16 %rs686, %rs685, 4; 2026-02-21T08:30:48.5437923Z cvt.s16.s8 %rs687, %rs620; 2026-02-21T08:30:48.5437983Z shr.s16 %rs688, %rs687, 4; 2026-02-21T08:30:48.5438050Z shr.s16 %rs689, %rs617, 4; 2026-02-21T08:30:48.5438110Z shr.s16 %rs690, %rs619, 4; 2026-02-21T08:30:48.5438284Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5438354Z cvt.rn.f32.s16 %r1430, %rs690; 2026-02-21T08:30:48.5438415Z cvt.rn.f32.s16 %r1431, %rs689; 2026-02-21T08:30:48.5438477Z cvt.rn.f32.s16 %r1432, %rs688; 2026-02-21T08:30:48.5438540Z cvt.rn.f32.s16 %r1433, %rs686; 2026-02-21T08:30:48.5438725Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5438789Z cvt.s16.s8 %rs691, %rs622; 2026-02-21T08:30:48.5438849Z shr.s16 %rs692, %rs691, 4; 2026-02-21T08:30:48.5438917Z cvt.s16.s8 %rs693, %rs624; 2026-02-21T08:30:48.5438976Z shr.s16 %rs694, %rs693, 4; 2026-02-21T08:30:48.5439036Z shr.s16 %rs695, %rs621, 4; 2026-02-21T08:30:48.5439096Z shr.s16 %rs696, %rs623, 4; 2026-02-21T08:30:48.5439277Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5439339Z cvt.rn.f32.s16 %r1434, %rs696; 2026-02-21T08:30:48.5439400Z cvt.rn.f32.s16 %r1435, %rs695; 2026-02-21T08:30:48.5439470Z cvt.rn.f32.s16 %r1436, %rs694; 2026-02-21T08:30:48.5439532Z cvt.rn.f32.s16 %r1437, %rs692; 2026-02-21T08:30:48.5439703Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5439772Z cvt.s16.s8 %rs697, %rs626; 2026-02-21T08:30:48.5439832Z shr.s16 %rs698, %rs697, 4; 2026-02-21T08:30:48.5439895Z cvt.s16.s8 %rs699, %rs628; 2026-02-21T08:30:48.5439955Z shr.s16 %rs700, %rs699, 4; 2026-02-21T08:30:48.5440021Z shr.s16 %rs701, %rs625, 4; 2026-02-21T08:30:48.5440082Z shr.s16 %rs702, %rs627, 4; 2026-02-21T08:30:48.5440255Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5440325Z cvt.rn.f32.s16 %r1438, %rs702; 2026-02-21T08:30:48.5440387Z cvt.rn.f32.s16 %r1439, %rs701; 2026-02-21T08:30:48.5440447Z cvt.rn.f32.s16 %r1440, %rs700; 2026-02-21T08:30:48.5440515Z cvt.rn.f32.s16 %r1441, %rs698; 2026-02-21T08:30:48.5440686Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5440749Z cvt.s16.s8 %rs703, %rs630; 2026-02-21T08:30:48.5440809Z shr.s16 %rs704, %rs703, 4; 2026-02-21T08:30:48.5440876Z cvt.s16.s8 %rs705, %rs632; 2026-02-21T08:30:48.5440936Z shr.s16 %rs706, %rs705, 4; 2026-02-21T08:30:48.5440997Z shr.s16 %rs707, %rs629, 4; 2026-02-21T08:30:48.5441104Z shr.s16 %rs708, %rs631, 4; 2026-02-21T08:30:48.5441281Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5441341Z cvt.rn.f32.s16 %r1442, %rs708; 2026-02-21T08:30:48.5441400Z cvt.rn.f32.s16 %r1443, %rs707; 2026-02-21T08:30:48.5441464Z cvt.rn.f32.s16 %r1444, %rs706; 2026-02-21T08:30:48.5441520Z cvt.rn.f32.s16 %r1445, %rs704; 2026-02-21T08:30:48.5441718Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5441784Z cvt.s16.s8 %rs709, %rs634; 2026-02-21T08:30:48.5441841Z shr.s16 %rs710, %rs709, 4; 2026-02-21T08:30:48.5441898Z cvt.s16.s8 %rs711, %rs636; 2026-02-21T08:30:48.5441961Z shr.s16 %rs712, %rs711, 4; 2026-02-21T08:30:48.5442017Z shr.s16 %rs713, %rs633, 4; 2026-02-21T08:30:48.5442073Z shr.s16 %rs714, %rs635, 4; 2026-02-21T08:30:48.5442282Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5442353Z cvt.rn.f32.s16 %r1446, %rs714; 2026-02-21T08:30:48.5442413Z cvt.rn.f32.s16 %r1447, %rs713; 2026-02-21T08:30:48.5442473Z cvt.rn.f32.s16 %r1448, %rs712; 2026-02-21T08:30:48.5442540Z cvt.rn.f32.s16 %r1449, %rs710; 2026-02-21T08:30:48.5442709Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5442766Z cvt.s16.s8 %rs715, %rs638; 2026-02-21T08:30:48.5442830Z shr.s16 %rs716, %rs715, 4; 2026-02-21T08:30:48.5442888Z cvt.s16.s8 %rs717, %rs640; 2026-02-21T08:30:48.5442946Z shr.s16 %rs718, %rs717, 4; 2026-02-21T08:30:48.5443003Z shr.s16 %rs719, %rs637, 4; 2026-02-21T08:30:48.5443069Z shr.s16 %rs720, %rs639, 4; 2026-02-21T08:30:48.5443238Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5443298Z cvt.rn.f32.s16 %r1450, %rs720; 2026-02-21T08:30:48.5443365Z cvt.rn.f32.s16 %r1451, %rs719; 2026-02-21T08:30:48.5443427Z cvt.rn.f32.s16 %r1452, %rs718; 2026-02-21T08:30:48.5443485Z cvt.rn.f32.s16 %r1453, %rs716; 2026-02-21T08:30:48.5443650Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5443716Z cvt.s16.s8 %rs721, %rs642; 2026-02-21T08:30:48.5443773Z shr.s16 %rs722, %rs721, 4; 2026-02-21T08:30:48.5443832Z cvt.s16.s8 %rs723, %rs644; 2026-02-21T08:30:48.5443898Z shr.s16 %rs724, %rs723, 4; 2026-02-21T08:30:48.5443955Z shr.s16 %rs725, %rs641, 4; 2026-02-21T08:30:48.5444013Z shr.s16 %rs726, %rs643, 4; 2026-02-21T08:30:48.5444187Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5444248Z cvt.rn.f32.s16 %r1454, %rs726; 2026-02-21T08:30:48.5444307Z cvt.rn.f32.s16 %r1455, %rs725; 2026-02-21T08:30:48.5444367Z cvt.rn.f32.s16 %r1456, %rs724; 2026-02-21T08:30:48.5444434Z cvt.rn.f32.s16 %r1457, %rs722; 2026-02-21T08:30:48.5444600Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5444661Z cvt.s16.s8 %rs727, %rs646; 2026-02-21T08:30:48.5444729Z shr.s16 %rs728, %rs727, 4; 2026-02-21T08:30:48.5444788Z cvt.s16.s8 %rs729, %rs648; 2026-02-21T08:30:48.5444847Z shr.s16 %rs730, %rs729, 4; 2026-02-21T08:30:48.5444906Z shr.s16 %rs731, %rs645, 4; 2026-02-21T08:30:48.5444973Z shr.s16 %rs732, %rs647, 4; 2026-02-21T08:30:48.5445141Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5445203Z cvt.rn.f32.s16 %r1458, %rs732; 2026-02-21T08:30:48.5445274Z cvt.rn.f32.s16 %r1459, %rs731; 2026-02-21T08:30:48.5445335Z cvt.rn.f32.s16 %r1460, %rs730; 2026-02-21T08:30:48.5445395Z cvt.rn.f32.s16 %r1461, %rs728; 2026-02-21T08:30:48.5445571Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5445629Z cvt.s16.s8 %rs733, %rs650; 2026-02-21T08:30:48.5445689Z shr.s16 %rs734, %rs733, 4; 2026-02-21T08:30:48.5445796Z cvt.s16.s8 %rs735, %rs652; 2026-02-21T08:30:48.5445860Z shr.s16 %rs736, %rs735, 4; 2026-02-21T08:30:48.5445917Z shr.s16 %rs737, %rs649, 4; 2026-02-21T08:30:48.5445973Z shr.s16 %rs738, %rs651, 4; 2026-02-21T08:30:48.5446146Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5446205Z cvt.rn.f32.s16 %r1462, %rs738; 2026-02-21T08:30:48.5446263Z cvt.rn.f32.s16 %r1463, %rs737; 2026-02-21T08:30:48.5446328Z cvt.rn.f32.s16 %r1464, %rs736; 2026-02-21T08:30:48.5446387Z cvt.rn.f32.s16 %r1465, %rs734; 2026-02-21T08:30:48.5446552Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5446611Z cvt.s16.s8 %rs739, %rs654; 2026-02-21T08:30:48.5446675Z shr.s16 %rs740, %rs739, 4; 2026-02-21T08:30:48.5446735Z cvt.s16.s8 %rs741, %rs656; 2026-02-21T08:30:48.5446853Z shr.s16 %rs742, %rs741, 4; 2026-02-21T08:30:48.5446923Z shr.s16 %rs743, %rs653, 4; 2026-02-21T08:30:48.5446979Z shr.s16 %rs744, %rs655, 4; 2026-02-21T08:30:48.5447146Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5447205Z cvt.rn.f32.s16 %r1466, %rs744; 2026-02-21T08:30:48.5447270Z cvt.rn.f32.s16 %r1467, %rs743; 2026-02-21T08:30:48.5447328Z cvt.rn.f32.s16 %r1468, %rs742; 2026-02-21T08:30:48.5447387Z cvt.rn.f32.s16 %r1469, %rs740; 2026-02-21T08:30:48.5447560Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5447618Z cvt.s16.s8 %rs745, %rs658; 2026-02-21T08:30:48.5447675Z shr.s16 %rs746, %rs745, 4; 2026-02-21T08:30:48.5447739Z cvt.s16.s8 %rs747, %rs660; 2026-02-21T08:30:48.5447796Z shr.s16 %rs748, %rs747, 4; 2026-02-21T08:30:48.5447853Z shr.s16 %rs749, %rs657, 4; 2026-02-21T08:30:48.5447910Z shr.s16 %rs750, %rs659, 4; 2026-02-21T08:30:48.5448081Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5448143Z cvt.rn.f32.s16 %r1470, %rs750; 2026-02-21T08:30:48.5448202Z cvt.rn.f32.s16 %r1471, %rs749; 2026-02-21T08:30:48.5448269Z cvt.rn.f32.s16 %r1472, %rs748; 2026-02-21T08:30:48.5448328Z cvt.rn.f32.s16 %r1473, %rs746; 2026-02-21T08:30:48.5448490Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5448555Z cvt.s16.s8 %rs751, %rs662; 2026-02-21T08:30:48.5448611Z shr.s16 %rs752, %rs751, 4; 2026-02-21T08:30:48.5448668Z cvt.s16.s8 %rs753, %rs664; 2026-02-21T08:30:48.5448725Z shr.s16 %rs754, %rs753, 4; 2026-02-21T08:30:48.5448789Z shr.s16 %rs755, %rs661, 4; 2026-02-21T08:30:48.5448846Z shr.s16 %rs756, %rs663, 4; 2026-02-21T08:30:48.5449012Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5449079Z cvt.rn.f32.s16 %r1474, %rs756; 2026-02-21T08:30:48.5449139Z cvt.rn.f32.s16 %r1475, %rs755; 2026-02-21T08:30:48.5449200Z cvt.rn.f32.s16 %r1476, %rs754; 2026-02-21T08:30:48.5449258Z cvt.rn.f32.s16 %r1477, %rs752; 2026-02-21T08:30:48.5449429Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5449487Z cvt.s16.s8 %rs757, %rs666; 2026-02-21T08:30:48.5449542Z shr.s16 %rs758, %rs757, 4; 2026-02-21T08:30:48.5449606Z cvt.s16.s8 %rs759, %rs668; 2026-02-21T08:30:48.5449662Z shr.s16 %rs760, %rs759, 4; 2026-02-21T08:30:48.5449718Z shr.s16 %rs761, %rs665, 4; 2026-02-21T08:30:48.5449780Z shr.s16 %rs762, %rs667, 4; 2026-02-21T08:30:48.5449944Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5450003Z cvt.rn.f32.s16 %r1478, %rs762; 2026-02-21T08:30:48.5450061Z cvt.rn.f32.s16 %r1479, %rs761; 2026-02-21T08:30:48.5450126Z cvt.rn.f32.s16 %r1480, %rs760; 2026-02-21T08:30:48.5450186Z cvt.rn.f32.s16 %r1481, %rs758; 2026-02-21T08:30:48.5450387Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5450452Z cvt.s16.s8 %rs763, %rs670; 2026-02-21T08:30:48.5450510Z shr.s16 %rs764, %rs763, 4; 2026-02-21T08:30:48.5450568Z cvt.s16.s8 %rs765, %rs672; 2026-02-21T08:30:48.5450625Z shr.s16 %rs766, %rs765, 4; 2026-02-21T08:30:48.5450690Z shr.s16 %rs767, %rs669, 4; 2026-02-21T08:30:48.5450748Z shr.s16 %rs768, %rs671, 4; 2026-02-21T08:30:48.5450912Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5450979Z cvt.rn.f32.s16 %r1482, %rs768; 2026-02-21T08:30:48.5451038Z cvt.rn.f32.s16 %r1483, %rs767; 2026-02-21T08:30:48.5451098Z cvt.rn.f32.s16 %r1484, %rs766; 2026-02-21T08:30:48.5451164Z cvt.rn.f32.s16 %r1485, %rs764; 2026-02-21T08:30:48.5451265Z st.shared.v4.b32 [%r63], {%r1425, %r1423, %r1424, %r1422}; 2026-02-21T08:30:48.5451413Z st.shared.v4.b32 [%r63+16384], {%r1457, %r1455, %r1456, %r1454}; 2026-02-21T08:30:48.5451515Z st.shared.v4.b32 [%r64], {%r1429, %r1427, %r1428, %r1426}; 2026-02-21T08:30:48.5451656Z st.shared.v4.b32 [%r64+16384], {%r1461, %r1459, %r1460, %r1458}; 2026-02-21T08:30:48.5451751Z st.shared.v4.b32 [%r65], {%r1433, %r1431, %r1432, %r1430}; 2026-02-21T08:30:48.5451849Z st.shared.v4.b32 [%r65+16384], {%r1465, %r1463, %r1464, %r1462}; 2026-02-21T08:30:48.5451947Z st.shared.v4.b32 [%r66], {%r1437, %r1435, %r1436, %r1434}; 2026-02-21T08:30:48.5452046Z st.shared.v4.b32 [%r66+16384], {%r1469, %r1467, %r1468, %r1466}; 2026-02-21T08:30:48.5452136Z st.shared.v4.b32 [%r67], {%r1441, %r1439, %r1440, %r1438}; 2026-02-21T08:30:48.5452242Z st.shared.v4.b32 [%r67+16384], {%r1473, %r1471, %r1472, %r1470}; 2026-02-21T08:30:48.5452331Z st.shared.v4.b32 [%r68], {%r1445, %r1443, %r1444, %r1442}; 2026-02-21T08:30:48.5452428Z st.shared.v4.b32 [%r68+16384], {%r1477, %r1475, %r1476, %r1474}; 2026-02-21T08:30:48.5452525Z st.shared.v4.b32 [%r69], {%r1449, %r1447, %r1448, %r1446}; 2026-02-21T08:30:48.5452623Z st.shared.v4.b32 [%r69+16384], {%r1481, %r1479, %r1480, %r1478}; 2026-02-21T08:30:48.5452712Z st.shared.v4.b32 [%r70], {%r1453, %r1451, %r1452, %r1450}; 2026-02-21T08:30:48.5452806Z st.shared.v4.b32 [%r70+16384], {%r1485, %r1483, %r1484, %r1482}; 2026-02-21T08:30:48.5452980Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5453040Z shl.b32 %r1486, %r4029, 3; 2026-02-21T08:30:48.5453099Z add.s32 %r1487, %r294, %r1486; 2026-02-21T08:30:48.5453170Z add.s32 %r4027, %r1487, 57344; 2026-02-21T08:30:48.5453224Z $L__tmp153: 2026-02-21T08:30:48.5453446Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5453513Z // begin inline asm 2026-02-21T08:30:48.5453588Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5453646Z // end inline asm 2026-02-21T08:30:48.5453704Z bar.sync 0; 2026-02-21T08:30:48.5453778Z @%p14 bra $L__BB0_14; 2026-02-21T08:30:48.5453881Z // %bb.13: // in Loop: Header=BB0_12 Depth=2 2026-02-21T08:30:48.5453949Z elect.sync %r1512|%p89, -1; 2026-02-21T08:30:48.5454014Z mov.b32 %r1490, 69208336; 2026-02-21T08:30:48.5454072Z // begin inline asm 2026-02-21T08:30:48.5454233Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r1490, %p88; 2026-02-21T08:30:48.5454295Z // end inline asm 2026-02-21T08:30:48.5454352Z // begin inline asm 2026-02-21T08:30:48.5454500Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r1490, %p88; 2026-02-21T08:30:48.5454555Z // end inline asm 2026-02-21T08:30:48.5454616Z // begin inline asm 2026-02-21T08:30:48.5454765Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r1490, %p88; 2026-02-21T08:30:48.5454819Z // end inline asm 2026-02-21T08:30:48.5454883Z // begin inline asm 2026-02-21T08:30:48.5455031Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r1490, %p88; 2026-02-21T08:30:48.5455141Z // end inline asm 2026-02-21T08:30:48.5455203Z // begin inline asm 2026-02-21T08:30:48.5455348Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r1490, %p88; 2026-02-21T08:30:48.5455402Z // end inline asm 2026-02-21T08:30:48.5455459Z // begin inline asm 2026-02-21T08:30:48.5455610Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r1490, %p88; 2026-02-21T08:30:48.5455665Z // end inline asm 2026-02-21T08:30:48.5455721Z // begin inline asm 2026-02-21T08:30:48.5455875Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r1490, %p88; 2026-02-21T08:30:48.5455929Z // end inline asm 2026-02-21T08:30:48.5455985Z // begin inline asm 2026-02-21T08:30:48.5456135Z @%p89 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r1490, %p88; 2026-02-21T08:30:48.5456241Z // end inline asm 2026-02-21T08:30:48.5456307Z cvt.u64.u32 %rd359, %r4027; 2026-02-21T08:30:48.5456369Z // begin inline asm 2026-02-21T08:30:48.5456500Z @%p89 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd359]; 2026-02-21T08:30:48.5456555Z // end inline asm 2026-02-21T08:30:48.5456613Z bra.uni $L__BB0_14; 2026-02-21T08:30:48.5456722Z $L__BB0_15: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5456816Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:30:48.5456873Z mov.b32 %r4035, 1; 2026-02-21T08:30:48.5457100Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5457157Z // begin inline asm 2026-02-21T08:30:48.5457206Z 2026-02-21T08:30:48.5457263Z { 2026-02-21T08:30:48.5457325Z .reg .pred complete; 2026-02-21T08:30:48.5457381Z waitLoop: 2026-02-21T08:30:48.5457502Z mbarrier.try_wait.parity.shared.b64 complete, [%r4027], %r4035; 2026-02-21T08:30:48.5457580Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5457631Z } 2026-02-21T08:30:48.5457635Z 2026-02-21T08:30:48.5457690Z // end inline asm 2026-02-21T08:30:48.5457750Z $L__tmp154: 2026-02-21T08:30:48.5457919Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5457984Z cp.async.wait_group 0; 2026-02-21T08:30:48.5458036Z bar.sync 0; 2026-02-21T08:30:48.5458103Z add.s32 %r4033, %r294, 57344; 2026-02-21T08:30:48.5458159Z // begin inline asm 2026-02-21T08:30:48.5458247Z @%p10 mbarrier.inval.shared::cta.b64 [%r4033]; 2026-02-21T08:30:48.5458308Z // end inline asm 2026-02-21T08:30:48.5458361Z bar.sync 0; 2026-02-21T08:30:48.5458416Z // begin inline asm 2026-02-21T08:30:48.5458506Z @%p10 mbarrier.inval.shared::cta.b64 [%r2245]; 2026-02-21T08:30:48.5458560Z // end inline asm 2026-02-21T08:30:48.5458728Z .loc 1 88 43 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:43 2026-02-21T08:30:48.5458794Z shl.b32 %r1800, %r118, 13; 2026-02-21T08:30:48.5458860Z shl.b32 %r1801, %r119, 13; 2026-02-21T08:30:48.5458917Z shl.b32 %r1802, %r120, 13; 2026-02-21T08:30:48.5458972Z shl.b32 %r1803, %r121, 13; 2026-02-21T08:30:48.5459034Z shl.b32 %r1804, %r122, 13; 2026-02-21T08:30:48.5459089Z shl.b32 %r1805, %r123, 13; 2026-02-21T08:30:48.5459144Z shl.b32 %r1806, %r124, 13; 2026-02-21T08:30:48.5459199Z shl.b32 %r1807, %r125, 13; 2026-02-21T08:30:48.5459372Z .loc 1 88 50 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:50 2026-02-21T08:30:48.5459433Z add.s32 %r1808, %r1800, %r116; 2026-02-21T08:30:48.5459492Z add.s32 %r1809, %r1801, %r116; 2026-02-21T08:30:48.5459559Z add.s32 %r1810, %r1802, %r116; 2026-02-21T08:30:48.5459616Z add.s32 %r1811, %r1803, %r116; 2026-02-21T08:30:48.5459674Z add.s32 %r1812, %r1804, %r116; 2026-02-21T08:30:48.5459737Z add.s32 %r1813, %r1805, %r116; 2026-02-21T08:30:48.5459795Z add.s32 %r1814, %r1806, %r116; 2026-02-21T08:30:48.5459895Z add.s32 %r1815, %r1807, %r116; 2026-02-21T08:30:48.5460061Z .loc 1 88 22 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:22 2026-02-21T08:30:48.5460137Z mad.wide.s32 %rd371, %r1808, 2, %rd94; 2026-02-21T08:30:48.5460205Z mad.wide.s32 %rd372, %r1809, 2, %rd94; 2026-02-21T08:30:48.5460269Z mad.wide.s32 %rd373, %r1810, 2, %rd94; 2026-02-21T08:30:48.5460340Z mad.wide.s32 %rd374, %r1811, 2, %rd94; 2026-02-21T08:30:48.5460402Z mad.wide.s32 %rd375, %r1812, 2, %rd94; 2026-02-21T08:30:48.5460466Z mad.wide.s32 %rd376, %r1813, 2, %rd94; 2026-02-21T08:30:48.5460528Z mad.wide.s32 %rd377, %r1814, 2, %rd94; 2026-02-21T08:30:48.5460597Z mad.wide.s32 %rd378, %r1815, 2, %rd94; 2026-02-21T08:30:48.5460649Z $L__tmp155: 2026-02-21T08:30:48.5460865Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5460931Z // begin inline asm 2026-02-21T08:30:48.5461278Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1531, %r1532, %r1533, %r1534, %r1535, %r1536, %r1537, %r1538, %r1539, %r1540, %r1541, %r1542, %r1543, %r1544, %r1545, %r1546}, [%r3028 + 0], 64; 2026-02-21T08:30:48.5461337Z // end inline asm 2026-02-21T08:30:48.5461402Z // begin inline asm 2026-02-21T08:30:48.5461740Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1548, %r1549, %r1550, %r1551, %r1552, %r1553, %r1554, %r1555, %r1556, %r1557, %r1558, %r1559, %r1560, %r1561, %r1562, %r1563}, [%r3028 + 16], 64; 2026-02-21T08:30:48.5461798Z // end inline asm 2026-02-21T08:30:48.5461864Z // begin inline asm 2026-02-21T08:30:48.5462159Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1565, %r1566, %r1567, %r1568, %r1569, %r1570, %r1571, %r1572, %r1573, %r1574, %r1575, %r1576, %r1577, %r1578, %r1579, %r1580}, [%r3028 + 32], 64; 2026-02-21T08:30:48.5462215Z // end inline asm 2026-02-21T08:30:48.5462273Z // begin inline asm 2026-02-21T08:30:48.5462579Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1582, %r1583, %r1584, %r1585, %r1586, %r1587, %r1588, %r1589, %r1590, %r1591, %r1592, %r1593, %r1594, %r1595, %r1596, %r1597}, [%r3028 + 48], 64; 2026-02-21T08:30:48.5462638Z // end inline asm 2026-02-21T08:30:48.5462694Z // begin inline asm 2026-02-21T08:30:48.5462770Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:48.5462825Z // end inline asm 2026-02-21T08:30:48.5462887Z cvt.u64.u32 %rd391, %r1531; 2026-02-21T08:30:48.5462952Z cvt.u64.u32 %rd392, %r1532; 2026-02-21T08:30:48.5463010Z shl.b64 %rd393, %rd392, 32; 2026-02-21T08:30:48.5463069Z or.b64 %rd394, %rd391, %rd393; 2026-02-21T08:30:48.5463123Z $L__tmp156: 2026-02-21T08:30:48.5463295Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5463358Z mov.b64 {%r1816, %r1817}, %rd394; 2026-02-21T08:30:48.5463429Z cvt.rn.bf16x2.f32 %r1818, %r1817, %r1816; 2026-02-21T08:30:48.5463490Z $L__tmp157: 2026-02-21T08:30:48.5463703Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5463766Z cvt.u64.u32 %rd395, %r1533; 2026-02-21T08:30:48.5463832Z cvt.u64.u32 %rd396, %r1534; 2026-02-21T08:30:48.5463890Z shl.b64 %rd397, %rd396, 32; 2026-02-21T08:30:48.5463949Z or.b64 %rd398, %rd395, %rd397; 2026-02-21T08:30:48.5464001Z $L__tmp158: 2026-02-21T08:30:48.5464172Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5464234Z mov.b64 {%r1819, %r1820}, %rd398; 2026-02-21T08:30:48.5464304Z cvt.rn.bf16x2.f32 %r1821, %r1820, %r1819; 2026-02-21T08:30:48.5464363Z $L__tmp159: 2026-02-21T08:30:48.5464577Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5464637Z cvt.u64.u32 %rd399, %r1535; 2026-02-21T08:30:48.5464702Z cvt.u64.u32 %rd400, %r1536; 2026-02-21T08:30:48.5464761Z shl.b64 %rd401, %rd400, 32; 2026-02-21T08:30:48.5464823Z or.b64 %rd402, %rd399, %rd401; 2026-02-21T08:30:48.5464936Z $L__tmp160: 2026-02-21T08:30:48.5465110Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5465171Z mov.b64 {%r1822, %r1823}, %rd402; 2026-02-21T08:30:48.5465239Z cvt.rn.bf16x2.f32 %r1824, %r1823, %r1822; 2026-02-21T08:30:48.5465300Z $L__tmp161: 2026-02-21T08:30:48.5465508Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5465565Z cvt.u64.u32 %rd403, %r1537; 2026-02-21T08:30:48.5465629Z cvt.u64.u32 %rd404, %r1538; 2026-02-21T08:30:48.5465687Z shl.b64 %rd405, %rd404, 32; 2026-02-21T08:30:48.5465746Z or.b64 %rd406, %rd403, %rd405; 2026-02-21T08:30:48.5465797Z $L__tmp162: 2026-02-21T08:30:48.5465966Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5466026Z mov.b64 {%r1825, %r1826}, %rd406; 2026-02-21T08:30:48.5466152Z cvt.rn.bf16x2.f32 %r1827, %r1826, %r1825; 2026-02-21T08:30:48.5466215Z $L__tmp163: 2026-02-21T08:30:48.5466422Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5466480Z cvt.u64.u32 %rd407, %r1539; 2026-02-21T08:30:48.5466543Z cvt.u64.u32 %rd408, %r1540; 2026-02-21T08:30:48.5466600Z shl.b64 %rd409, %rd408, 32; 2026-02-21T08:30:48.5466660Z or.b64 %rd410, %rd407, %rd409; 2026-02-21T08:30:48.5466710Z $L__tmp164: 2026-02-21T08:30:48.5466884Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5466944Z mov.b64 {%r1828, %r1829}, %rd410; 2026-02-21T08:30:48.5467011Z cvt.rn.bf16x2.f32 %r1830, %r1829, %r1828; 2026-02-21T08:30:48.5467066Z $L__tmp165: 2026-02-21T08:30:48.5467275Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5467335Z cvt.u64.u32 %rd411, %r1541; 2026-02-21T08:30:48.5467395Z cvt.u64.u32 %rd412, %r1542; 2026-02-21T08:30:48.5467459Z shl.b64 %rd413, %rd412, 32; 2026-02-21T08:30:48.5467516Z or.b64 %rd414, %rd411, %rd413; 2026-02-21T08:30:48.5467567Z $L__tmp166: 2026-02-21T08:30:48.5467733Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5467792Z mov.b64 {%r1831, %r1832}, %rd414; 2026-02-21T08:30:48.5467858Z cvt.rn.bf16x2.f32 %r1833, %r1832, %r1831; 2026-02-21T08:30:48.5467916Z $L__tmp167: 2026-02-21T08:30:48.5468126Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5468184Z cvt.u64.u32 %rd415, %r1543; 2026-02-21T08:30:48.5468241Z cvt.u64.u32 %rd416, %r1544; 2026-02-21T08:30:48.5468306Z shl.b64 %rd417, %rd416, 32; 2026-02-21T08:30:48.5468364Z or.b64 %rd418, %rd415, %rd417; 2026-02-21T08:30:48.5468415Z $L__tmp168: 2026-02-21T08:30:48.5468593Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5468656Z mov.b64 {%r1834, %r1835}, %rd418; 2026-02-21T08:30:48.5468721Z cvt.rn.bf16x2.f32 %r1836, %r1835, %r1834; 2026-02-21T08:30:48.5468779Z $L__tmp169: 2026-02-21T08:30:48.5468986Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5469042Z cvt.u64.u32 %rd419, %r1545; 2026-02-21T08:30:48.5469098Z cvt.u64.u32 %rd420, %r1546; 2026-02-21T08:30:48.5469161Z shl.b64 %rd421, %rd420, 32; 2026-02-21T08:30:48.5469219Z or.b64 %rd422, %rd419, %rd421; 2026-02-21T08:30:48.5469271Z $L__tmp170: 2026-02-21T08:30:48.5469439Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5469499Z mov.b64 {%r1837, %r1838}, %rd422; 2026-02-21T08:30:48.5469565Z cvt.rn.bf16x2.f32 %r1839, %r1838, %r1837; 2026-02-21T08:30:48.5469618Z $L__tmp171: 2026-02-21T08:30:48.5469876Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5469936Z cvt.u64.u32 %rd423, %r1548; 2026-02-21T08:30:48.5469992Z cvt.u64.u32 %rd424, %r1549; 2026-02-21T08:30:48.5470053Z shl.b64 %rd425, %rd424, 32; 2026-02-21T08:30:48.5470111Z or.b64 %rd426, %rd423, %rd425; 2026-02-21T08:30:48.5470161Z $L__tmp172: 2026-02-21T08:30:48.5470337Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5470396Z mov.b64 {%r1840, %r1841}, %rd426; 2026-02-21T08:30:48.5470464Z cvt.rn.bf16x2.f32 %r1842, %r1841, %r1840; 2026-02-21T08:30:48.5470514Z $L__tmp173: 2026-02-21T08:30:48.5470732Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5470790Z cvt.u64.u32 %rd427, %r1550; 2026-02-21T08:30:48.5470888Z cvt.u64.u32 %rd428, %r1551; 2026-02-21T08:30:48.5470957Z shl.b64 %rd429, %rd428, 32; 2026-02-21T08:30:48.5471022Z or.b64 %rd430, %rd427, %rd429; 2026-02-21T08:30:48.5471074Z $L__tmp174: 2026-02-21T08:30:48.5471248Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5471307Z mov.b64 {%r1843, %r1844}, %rd430; 2026-02-21T08:30:48.5471374Z cvt.rn.bf16x2.f32 %r1845, %r1844, %r1843; 2026-02-21T08:30:48.5471426Z $L__tmp175: 2026-02-21T08:30:48.5471688Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5471749Z cvt.u64.u32 %rd431, %r1552; 2026-02-21T08:30:48.5471807Z cvt.u64.u32 %rd432, %r1553; 2026-02-21T08:30:48.5471874Z shl.b64 %rd433, %rd432, 32; 2026-02-21T08:30:48.5471934Z or.b64 %rd434, %rd431, %rd433; 2026-02-21T08:30:48.5471986Z $L__tmp176: 2026-02-21T08:30:48.5472168Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5472231Z mov.b64 {%r1846, %r1847}, %rd434; 2026-02-21T08:30:48.5472298Z cvt.rn.bf16x2.f32 %r1848, %r1847, %r1846; 2026-02-21T08:30:48.5472350Z $L__tmp177: 2026-02-21T08:30:48.5472571Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5472630Z cvt.u64.u32 %rd435, %r1554; 2026-02-21T08:30:48.5472688Z cvt.u64.u32 %rd436, %r1555; 2026-02-21T08:30:48.5472753Z shl.b64 %rd437, %rd436, 32; 2026-02-21T08:30:48.5472812Z or.b64 %rd438, %rd435, %rd437; 2026-02-21T08:30:48.5472864Z $L__tmp178: 2026-02-21T08:30:48.5473038Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5473097Z mov.b64 {%r1849, %r1850}, %rd438; 2026-02-21T08:30:48.5473165Z cvt.rn.bf16x2.f32 %r1851, %r1850, %r1849; 2026-02-21T08:30:48.5473216Z $L__tmp179: 2026-02-21T08:30:48.5473436Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5473498Z cvt.u64.u32 %rd439, %r1556; 2026-02-21T08:30:48.5473555Z cvt.u64.u32 %rd440, %r1557; 2026-02-21T08:30:48.5473621Z shl.b64 %rd441, %rd440, 32; 2026-02-21T08:30:48.5473680Z or.b64 %rd442, %rd439, %rd441; 2026-02-21T08:30:48.5473731Z $L__tmp180: 2026-02-21T08:30:48.5473895Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5473964Z mov.b64 {%r1852, %r1853}, %rd442; 2026-02-21T08:30:48.5474031Z cvt.rn.bf16x2.f32 %r1854, %r1853, %r1852; 2026-02-21T08:30:48.5474082Z $L__tmp181: 2026-02-21T08:30:48.5474299Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5474381Z cvt.u64.u32 %rd443, %r1558; 2026-02-21T08:30:48.5474438Z cvt.u64.u32 %rd444, %r1559; 2026-02-21T08:30:48.5474503Z shl.b64 %rd445, %rd444, 32; 2026-02-21T08:30:48.5474565Z or.b64 %rd446, %rd443, %rd445; 2026-02-21T08:30:48.5474665Z $L__tmp182: 2026-02-21T08:30:48.5474829Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5474896Z mov.b64 {%r1855, %r1856}, %rd446; 2026-02-21T08:30:48.5474962Z cvt.rn.bf16x2.f32 %r1857, %r1856, %r1855; 2026-02-21T08:30:48.5475013Z $L__tmp183: 2026-02-21T08:30:48.5475232Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5475289Z cvt.u64.u32 %rd447, %r1560; 2026-02-21T08:30:48.5475347Z cvt.u64.u32 %rd448, %r1561; 2026-02-21T08:30:48.5475410Z shl.b64 %rd449, %rd448, 32; 2026-02-21T08:30:48.5475469Z or.b64 %rd450, %rd447, %rd449; 2026-02-21T08:30:48.5475520Z $L__tmp184: 2026-02-21T08:30:48.5475683Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5475796Z mov.b64 {%r1858, %r1859}, %rd450; 2026-02-21T08:30:48.5475868Z cvt.rn.bf16x2.f32 %r1860, %r1859, %r1858; 2026-02-21T08:30:48.5475918Z $L__tmp185: 2026-02-21T08:30:48.5476133Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5476191Z cvt.u64.u32 %rd451, %r1562; 2026-02-21T08:30:48.5476248Z cvt.u64.u32 %rd452, %r1563; 2026-02-21T08:30:48.5476311Z shl.b64 %rd453, %rd452, 32; 2026-02-21T08:30:48.5476369Z or.b64 %rd454, %rd451, %rd453; 2026-02-21T08:30:48.5476419Z $L__tmp186: 2026-02-21T08:30:48.5476583Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5476649Z mov.b64 {%r1861, %r1862}, %rd454; 2026-02-21T08:30:48.5476716Z cvt.rn.bf16x2.f32 %r1863, %r1862, %r1861; 2026-02-21T08:30:48.5476768Z $L__tmp187: 2026-02-21T08:30:48.5476985Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5477046Z cvt.u64.u32 %rd455, %r1565; 2026-02-21T08:30:48.5477103Z cvt.u64.u32 %rd456, %r1566; 2026-02-21T08:30:48.5477167Z shl.b64 %rd457, %rd456, 32; 2026-02-21T08:30:48.5477226Z or.b64 %rd458, %rd455, %rd457; 2026-02-21T08:30:48.5477277Z $L__tmp188: 2026-02-21T08:30:48.5477453Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5477524Z mov.b64 {%r1864, %r1865}, %rd458; 2026-02-21T08:30:48.5477594Z cvt.rn.bf16x2.f32 %r1866, %r1865, %r1864; 2026-02-21T08:30:48.5477647Z $L__tmp189: 2026-02-21T08:30:48.5477872Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5477933Z cvt.u64.u32 %rd459, %r1567; 2026-02-21T08:30:48.5477994Z cvt.u64.u32 %rd460, %r1568; 2026-02-21T08:30:48.5478061Z shl.b64 %rd461, %rd460, 32; 2026-02-21T08:30:48.5478123Z or.b64 %rd462, %rd459, %rd461; 2026-02-21T08:30:48.5478179Z $L__tmp190: 2026-02-21T08:30:48.5478354Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5478425Z mov.b64 {%r1867, %r1868}, %rd462; 2026-02-21T08:30:48.5478496Z cvt.rn.bf16x2.f32 %r1869, %r1868, %r1867; 2026-02-21T08:30:48.5478549Z $L__tmp191: 2026-02-21T08:30:48.5478777Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5478838Z cvt.u64.u32 %rd463, %r1569; 2026-02-21T08:30:48.5478898Z cvt.u64.u32 %rd464, %r1570; 2026-02-21T08:30:48.5478958Z shl.b64 %rd465, %rd464, 32; 2026-02-21T08:30:48.5479027Z or.b64 %rd466, %rd463, %rd465; 2026-02-21T08:30:48.5479082Z $L__tmp192: 2026-02-21T08:30:48.5479259Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5479331Z mov.b64 {%r1870, %r1871}, %rd466; 2026-02-21T08:30:48.5479403Z cvt.rn.bf16x2.f32 %r1872, %r1871, %r1870; 2026-02-21T08:30:48.5479505Z $L__tmp193: 2026-02-21T08:30:48.5479733Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5479796Z cvt.u64.u32 %rd467, %r1571; 2026-02-21T08:30:48.5479859Z cvt.u64.u32 %rd468, %r1572; 2026-02-21T08:30:48.5479920Z shl.b64 %rd469, %rd468, 32; 2026-02-21T08:30:48.5479991Z or.b64 %rd470, %rd467, %rd469; 2026-02-21T08:30:48.5480046Z $L__tmp194: 2026-02-21T08:30:48.5480221Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5480293Z mov.b64 {%r1873, %r1874}, %rd470; 2026-02-21T08:30:48.5480362Z cvt.rn.bf16x2.f32 %r1875, %r1874, %r1873; 2026-02-21T08:30:48.5480416Z $L__tmp195: 2026-02-21T08:30:48.5480639Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5480701Z cvt.u64.u32 %rd471, %r1573; 2026-02-21T08:30:48.5480805Z cvt.u64.u32 %rd472, %r1574; 2026-02-21T08:30:48.5480869Z shl.b64 %rd473, %rd472, 32; 2026-02-21T08:30:48.5480938Z or.b64 %rd474, %rd471, %rd473; 2026-02-21T08:30:48.5480993Z $L__tmp196: 2026-02-21T08:30:48.5481171Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5481240Z mov.b64 {%r1876, %r1877}, %rd474; 2026-02-21T08:30:48.5481312Z cvt.rn.bf16x2.f32 %r1878, %r1877, %r1876; 2026-02-21T08:30:48.5481367Z $L__tmp197: 2026-02-21T08:30:48.5481627Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5481691Z cvt.u64.u32 %rd475, %r1575; 2026-02-21T08:30:48.5481751Z cvt.u64.u32 %rd476, %r1576; 2026-02-21T08:30:48.5481811Z shl.b64 %rd477, %rd476, 32; 2026-02-21T08:30:48.5481883Z or.b64 %rd478, %rd475, %rd477; 2026-02-21T08:30:48.5481938Z $L__tmp198: 2026-02-21T08:30:48.5482114Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5482187Z mov.b64 {%r1879, %r1880}, %rd478; 2026-02-21T08:30:48.5482257Z cvt.rn.bf16x2.f32 %r1881, %r1880, %r1879; 2026-02-21T08:30:48.5482312Z $L__tmp199: 2026-02-21T08:30:48.5482538Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5482607Z cvt.u64.u32 %rd479, %r1577; 2026-02-21T08:30:48.5482668Z cvt.u64.u32 %rd480, %r1578; 2026-02-21T08:30:48.5482728Z shl.b64 %rd481, %rd480, 32; 2026-02-21T08:30:48.5482798Z or.b64 %rd482, %rd479, %rd481; 2026-02-21T08:30:48.5482853Z $L__tmp200: 2026-02-21T08:30:48.5483026Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5483096Z mov.b64 {%r1882, %r1883}, %rd482; 2026-02-21T08:30:48.5483167Z cvt.rn.bf16x2.f32 %r1884, %r1883, %r1882; 2026-02-21T08:30:48.5483223Z $L__tmp201: 2026-02-21T08:30:48.5483450Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5483522Z cvt.u64.u32 %rd483, %r1579; 2026-02-21T08:30:48.5483583Z cvt.u64.u32 %rd484, %r1580; 2026-02-21T08:30:48.5483643Z shl.b64 %rd485, %rd484, 32; 2026-02-21T08:30:48.5483715Z or.b64 %rd486, %rd483, %rd485; 2026-02-21T08:30:48.5483771Z $L__tmp202: 2026-02-21T08:30:48.5483950Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5484017Z mov.b64 {%r1885, %r1886}, %rd486; 2026-02-21T08:30:48.5484088Z cvt.rn.bf16x2.f32 %r1887, %r1886, %r1885; 2026-02-21T08:30:48.5484142Z $L__tmp203: 2026-02-21T08:30:48.5484369Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5484437Z cvt.u64.u32 %rd487, %r1582; 2026-02-21T08:30:48.5484496Z cvt.u64.u32 %rd488, %r1583; 2026-02-21T08:30:48.5484559Z shl.b64 %rd489, %rd488, 32; 2026-02-21T08:30:48.5484690Z or.b64 %rd490, %rd487, %rd489; 2026-02-21T08:30:48.5484744Z $L__tmp204: 2026-02-21T08:30:48.5484915Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5484984Z mov.b64 {%r1888, %r1889}, %rd490; 2026-02-21T08:30:48.5485053Z cvt.rn.bf16x2.f32 %r1890, %r1889, %r1888; 2026-02-21T08:30:48.5485107Z $L__tmp205: 2026-02-21T08:30:48.5485331Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5485398Z cvt.u64.u32 %rd491, %r1584; 2026-02-21T08:30:48.5485457Z cvt.u64.u32 %rd492, %r1585; 2026-02-21T08:30:48.5485517Z shl.b64 %rd493, %rd492, 32; 2026-02-21T08:30:48.5485586Z or.b64 %rd494, %rd491, %rd493; 2026-02-21T08:30:48.5485641Z $L__tmp206: 2026-02-21T08:30:48.5485819Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5485932Z mov.b64 {%r1891, %r1892}, %rd494; 2026-02-21T08:30:48.5486002Z cvt.rn.bf16x2.f32 %r1893, %r1892, %r1891; 2026-02-21T08:30:48.5486054Z $L__tmp207: 2026-02-21T08:30:48.5486263Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5486328Z cvt.u64.u32 %rd495, %r1586; 2026-02-21T08:30:48.5486385Z cvt.u64.u32 %rd496, %r1587; 2026-02-21T08:30:48.5486442Z shl.b64 %rd497, %rd496, 32; 2026-02-21T08:30:48.5486507Z or.b64 %rd498, %rd495, %rd497; 2026-02-21T08:30:48.5486558Z $L__tmp208: 2026-02-21T08:30:48.5486721Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5486780Z mov.b64 {%r1894, %r1895}, %rd498; 2026-02-21T08:30:48.5486852Z cvt.rn.bf16x2.f32 %r1896, %r1895, %r1894; 2026-02-21T08:30:48.5486904Z $L__tmp209: 2026-02-21T08:30:48.5487115Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5487183Z cvt.u64.u32 %rd499, %r1588; 2026-02-21T08:30:48.5487241Z cvt.u64.u32 %rd500, %r1589; 2026-02-21T08:30:48.5487299Z shl.b64 %rd501, %rd500, 32; 2026-02-21T08:30:48.5487364Z or.b64 %rd502, %rd499, %rd501; 2026-02-21T08:30:48.5487416Z $L__tmp210: 2026-02-21T08:30:48.5487584Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5487645Z mov.b64 {%r1897, %r1898}, %rd502; 2026-02-21T08:30:48.5487721Z cvt.rn.bf16x2.f32 %r1899, %r1898, %r1897; 2026-02-21T08:30:48.5487773Z $L__tmp211: 2026-02-21T08:30:48.5487984Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5488051Z cvt.u64.u32 %rd503, %r1590; 2026-02-21T08:30:48.5488108Z cvt.u64.u32 %rd504, %r1591; 2026-02-21T08:30:48.5488164Z shl.b64 %rd505, %rd504, 32; 2026-02-21T08:30:48.5488231Z or.b64 %rd506, %rd503, %rd505; 2026-02-21T08:30:48.5488284Z $L__tmp212: 2026-02-21T08:30:48.5488455Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5488514Z mov.b64 {%r1900, %r1901}, %rd506; 2026-02-21T08:30:48.5488592Z cvt.rn.bf16x2.f32 %r1902, %r1901, %r1900; 2026-02-21T08:30:48.5488645Z $L__tmp213: 2026-02-21T08:30:48.5488854Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5488924Z cvt.u64.u32 %rd507, %r1592; 2026-02-21T08:30:48.5488981Z cvt.u64.u32 %rd508, %r1593; 2026-02-21T08:30:48.5489039Z shl.b64 %rd509, %rd508, 32; 2026-02-21T08:30:48.5489105Z or.b64 %rd510, %rd507, %rd509; 2026-02-21T08:30:48.5489157Z $L__tmp214: 2026-02-21T08:30:48.5489320Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5489379Z mov.b64 {%r1903, %r1904}, %rd510; 2026-02-21T08:30:48.5489457Z cvt.rn.bf16x2.f32 %r1905, %r1904, %r1903; 2026-02-21T08:30:48.5489549Z $L__tmp215: 2026-02-21T08:30:48.5489754Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5489819Z cvt.u64.u32 %rd511, %r1594; 2026-02-21T08:30:48.5489876Z cvt.u64.u32 %rd512, %r1595; 2026-02-21T08:30:48.5489933Z shl.b64 %rd513, %rd512, 32; 2026-02-21T08:30:48.5489999Z or.b64 %rd514, %rd511, %rd513; 2026-02-21T08:30:48.5490050Z $L__tmp216: 2026-02-21T08:30:48.5490214Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5490273Z mov.b64 {%r1906, %r1907}, %rd514; 2026-02-21T08:30:48.5490347Z cvt.rn.bf16x2.f32 %r1908, %r1907, %r1906; 2026-02-21T08:30:48.5490399Z $L__tmp217: 2026-02-21T08:30:48.5490606Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5490720Z cvt.u64.u32 %rd515, %r1596; 2026-02-21T08:30:48.5490780Z cvt.u64.u32 %rd516, %r1597; 2026-02-21T08:30:48.5490838Z shl.b64 %rd517, %rd516, 32; 2026-02-21T08:30:48.5490899Z or.b64 %rd518, %rd515, %rd517; 2026-02-21T08:30:48.5490958Z $L__tmp218: 2026-02-21T08:30:48.5491125Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5491184Z mov.b64 {%r1909, %r1910}, %rd518; 2026-02-21T08:30:48.5491258Z cvt.rn.bf16x2.f32 %r1911, %r1910, %r1909; 2026-02-21T08:30:48.5491424Z .loc 1 88 81 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:81 2026-02-21T08:30:48.5491522Z st.shared.v4.b32 [%r73], {%r1818, %r1830, %r1842, %r1854}; 2026-02-21T08:30:48.5491665Z st.shared.v4.b32 [%r74], {%r1866, %r1878, %r1890, %r1902}; 2026-02-21T08:30:48.5491721Z bar.sync 0; 2026-02-21T08:30:48.5491781Z // begin inline asm 2026-02-21T08:30:48.5491936Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1599, %r1600, %r1601, %r1602}, [%r888]; 2026-02-21T08:30:48.5492004Z // end inline asm 2026-02-21T08:30:48.5492064Z // begin inline asm 2026-02-21T08:30:48.5492216Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1604, %r1605, %r1606, %r1607}, [%r893]; 2026-02-21T08:30:48.5492280Z // end inline asm 2026-02-21T08:30:48.5492335Z bar.sync 0; 2026-02-21T08:30:48.5492427Z st.shared.v4.b32 [%r73], {%r1821, %r1833, %r1845, %r1857}; 2026-02-21T08:30:48.5492524Z st.shared.v4.b32 [%r74], {%r1869, %r1881, %r1893, %r1905}; 2026-02-21T08:30:48.5492579Z bar.sync 0; 2026-02-21T08:30:48.5492636Z // begin inline asm 2026-02-21T08:30:48.5492782Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1609, %r1610, %r1611, %r1612}, [%r888]; 2026-02-21T08:30:48.5492845Z // end inline asm 2026-02-21T08:30:48.5492901Z // begin inline asm 2026-02-21T08:30:48.5493047Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1614, %r1615, %r1616, %r1617}, [%r893]; 2026-02-21T08:30:48.5493109Z // end inline asm 2026-02-21T08:30:48.5493163Z bar.sync 0; 2026-02-21T08:30:48.5493257Z st.shared.v4.b32 [%r73], {%r1824, %r1836, %r1848, %r1860}; 2026-02-21T08:30:48.5493349Z st.shared.v4.b32 [%r74], {%r1872, %r1884, %r1896, %r1908}; 2026-02-21T08:30:48.5493414Z bar.sync 0; 2026-02-21T08:30:48.5493470Z // begin inline asm 2026-02-21T08:30:48.5493615Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1619, %r1620, %r1621, %r1622}, [%r888]; 2026-02-21T08:30:48.5493678Z // end inline asm 2026-02-21T08:30:48.5493733Z // begin inline asm 2026-02-21T08:30:48.5493879Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1624, %r1625, %r1626, %r1627}, [%r893]; 2026-02-21T08:30:48.5493939Z // end inline asm 2026-02-21T08:30:48.5493993Z bar.sync 0; 2026-02-21T08:30:48.5494082Z st.shared.v4.b32 [%r73], {%r1827, %r1839, %r1851, %r1863}; 2026-02-21T08:30:48.5494172Z st.shared.v4.b32 [%r74], {%r1875, %r1887, %r1899, %r1911}; 2026-02-21T08:30:48.5494232Z bar.sync 0; 2026-02-21T08:30:48.5494288Z // begin inline asm 2026-02-21T08:30:48.5494434Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1629, %r1630, %r1631, %r1632}, [%r888]; 2026-02-21T08:30:48.5494546Z // end inline asm 2026-02-21T08:30:48.5494602Z // begin inline asm 2026-02-21T08:30:48.5494746Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1634, %r1635, %r1636, %r1637}, [%r893]; 2026-02-21T08:30:48.5494807Z // end inline asm 2026-02-21T08:30:48.5494863Z // begin inline asm 2026-02-21T08:30:48.5494971Z st.global.v4.b32 [ %rd371 + 0 ], { %r1599, %r1609, %r1619, %r1629 }; 2026-02-21T08:30:48.5495026Z // end inline asm 2026-02-21T08:30:48.5495089Z // begin inline asm 2026-02-21T08:30:48.5495193Z st.global.v4.b32 [ %rd372 + 0 ], { %r1600, %r1610, %r1620, %r1630 }; 2026-02-21T08:30:48.5495248Z // end inline asm 2026-02-21T08:30:48.5495310Z // begin inline asm 2026-02-21T08:30:48.5495412Z st.global.v4.b32 [ %rd373 + 0 ], { %r1601, %r1611, %r1621, %r1631 }; 2026-02-21T08:30:48.5495467Z // end inline asm 2026-02-21T08:30:48.5495522Z // begin inline asm 2026-02-21T08:30:48.5495630Z st.global.v4.b32 [ %rd374 + 0 ], { %r1602, %r1612, %r1622, %r1632 }; 2026-02-21T08:30:48.5495733Z // end inline asm 2026-02-21T08:30:48.5495791Z // begin inline asm 2026-02-21T08:30:48.5495897Z st.global.v4.b32 [ %rd375 + 0 ], { %r1604, %r1614, %r1624, %r1634 }; 2026-02-21T08:30:48.5495951Z // end inline asm 2026-02-21T08:30:48.5496006Z // begin inline asm 2026-02-21T08:30:48.5496104Z st.global.v4.b32 [ %rd376 + 0 ], { %r1605, %r1615, %r1625, %r1635 }; 2026-02-21T08:30:48.5496168Z // end inline asm 2026-02-21T08:30:48.5496225Z // begin inline asm 2026-02-21T08:30:48.5496320Z st.global.v4.b32 [ %rd377 + 0 ], { %r1606, %r1616, %r1626, %r1636 }; 2026-02-21T08:30:48.5496382Z // end inline asm 2026-02-21T08:30:48.5496437Z // begin inline asm 2026-02-21T08:30:48.5496534Z st.global.v4.b32 [ %rd378 + 0 ], { %r1607, %r1617, %r1627, %r1637 }; 2026-02-21T08:30:48.5496595Z // end inline asm 2026-02-21T08:30:48.5496779Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5496843Z add.s32 %r1912, %r4018, 4736; 2026-02-21T08:30:48.5497020Z .loc 1 25 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:25:35 2026-02-21T08:30:48.5497094Z shr.s32 %r1913, %r1912, 31; 2026-02-21T08:30:48.5497155Z shr.u32 %r1914, %r1913, 23; 2026-02-21T08:30:48.5497219Z add.s32 %r1915, %r1912, %r1914; 2026-02-21T08:30:48.5497290Z shr.s32 %r1916, %r1915, 9; 2026-02-21T08:30:48.5497463Z .loc 1 26 33 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:26:33 2026-02-21T08:30:48.5497522Z shl.b32 %r1917, %r1916, 3; 2026-02-21T08:30:48.5497700Z .loc 1 27 39 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:39 2026-02-21T08:30:48.5497759Z sub.s32 %r1918, 64, %r1917; 2026-02-21T08:30:48.5497928Z .loc 1 27 52 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:52 2026-02-21T08:30:48.5497986Z min.s32 %r1919, %r1918, 8; 2026-02-21T08:30:48.5498163Z .loc 1 28 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:45 2026-02-21T08:30:48.5498229Z and.b32 %r1920, %r1915, -512; 2026-02-21T08:30:48.5498290Z sub.s32 %r1921, %r1912, %r1920; 2026-02-21T08:30:48.5498468Z .loc 1 29 51 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:29:51 2026-02-21T08:30:48.5498529Z div.s32 %r140, %r1921, %r1919; 2026-02-21T08:30:48.5498696Z .loc 1 28 64 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:64 2026-02-21T08:30:48.5498767Z mul.lo.s32 %r1922, %r140, %r1919; 2026-02-21T08:30:48.5498826Z sub.s32 %r1923, %r1921, %r1922; 2026-02-21T08:30:48.5498991Z .loc 1 28 30 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:30 2026-02-21T08:30:48.5499058Z add.s32 %r1924, %r1923, %r1917; 2026-02-21T08:30:48.5499225Z .loc 1 30 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:30:27 2026-02-21T08:30:48.5499286Z shl.b32 %r141, %r1924, 7; 2026-02-21T08:30:48.5499454Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5499565Z or.b32 %r142, %r141, %r6; 2026-02-21T08:30:48.5499734Z .loc 1 32 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:32:27 2026-02-21T08:30:48.5499794Z shl.b32 %r144, %r140, 6; 2026-02-21T08:30:48.5499968Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5500027Z or.b32 %r1925, %r144, %r9; 2026-02-21T08:30:48.5500086Z or.b32 %r1926, %r144, %r10; 2026-02-21T08:30:48.5500151Z or.b32 %r1927, %r144, %r11; 2026-02-21T08:30:48.5500207Z or.b32 %r1928, %r144, %r12; 2026-02-21T08:30:48.5500369Z .loc 1 48 53 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:53 2026-02-21T08:30:48.5500429Z shl.b32 %r1929, %r1925, 10; 2026-02-21T08:30:48.5500492Z shl.b32 %r1930, %r1926, 10; 2026-02-21T08:30:48.5500586Z shl.b32 %r1931, %r1927, 10; 2026-02-21T08:30:48.5500647Z shl.b32 %r1932, %r1928, 10; 2026-02-21T08:30:48.5500715Z mov.pred %p121, -1; 2026-02-21T08:30:48.5500771Z mov.b32 %r4032, 0; 2026-02-21T08:30:48.5500823Z $L__tmp219: 2026-02-21T08:30:48.5501046Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5501103Z // begin inline asm 2026-02-21T08:30:48.5501428Z @%p121 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 0], 64, {%r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032}; 2026-02-21T08:30:48.5501493Z // end inline asm 2026-02-21T08:30:48.5501582Z // begin inline asm 2026-02-21T08:30:48.5501899Z @%p121 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 16], 64, {%r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032}; 2026-02-21T08:30:48.5501955Z // end inline asm 2026-02-21T08:30:48.5502023Z // begin inline asm 2026-02-21T08:30:48.5502327Z @%p121 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 32], 64, {%r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032}; 2026-02-21T08:30:48.5502383Z // end inline asm 2026-02-21T08:30:48.5502447Z // begin inline asm 2026-02-21T08:30:48.5502757Z @%p121 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 48], 64, {%r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032, %r4032}; 2026-02-21T08:30:48.5502812Z // end inline asm 2026-02-21T08:30:48.5502874Z // begin inline asm 2026-02-21T08:30:48.5502947Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5503002Z // end inline asm 2026-02-21T08:30:48.5503056Z bar.sync 0; 2026-02-21T08:30:48.5503116Z $L__tmp220: 2026-02-21T08:30:48.5503282Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5503340Z // begin inline asm 2026-02-21T08:30:48.5503440Z @%p10 mbarrier.init.shared::cta.b64 [%r4033], 1; 2026-02-21T08:30:48.5503496Z // end inline asm 2026-02-21T08:30:48.5503550Z bar.sync 0; 2026-02-21T08:30:48.5503614Z // begin inline asm 2026-02-21T08:30:48.5503701Z @%p10 mbarrier.init.shared::cta.b64 [%r2245], 1; 2026-02-21T08:30:48.5503757Z // end inline asm 2026-02-21T08:30:48.5503930Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5504000Z or.b32 %r1933, %r1929, %r22; 2026-02-21T08:30:48.5504059Z or.b32 %r1934, %r1930, %r22; 2026-02-21T08:30:48.5504118Z or.b32 %r1935, %r1931, %r22; 2026-02-21T08:30:48.5504183Z or.b32 %r1936, %r1932, %r22; 2026-02-21T08:30:48.5504346Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5504415Z mad.wide.s32 %rd379, %r1933, 2, %rd92; 2026-02-21T08:30:48.5504490Z mad.wide.s32 %rd380, %r1934, 2, %rd92; 2026-02-21T08:30:48.5504620Z mad.wide.s32 %rd381, %r1935, 2, %rd92; 2026-02-21T08:30:48.5504686Z mad.wide.s32 %rd382, %r1936, 2, %rd92; 2026-02-21T08:30:48.5504744Z mov.b32 %r2049, 16; 2026-02-21T08:30:48.5504916Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5504973Z // begin inline asm 2026-02-21T08:30:48.5505098Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd379 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5505160Z // end inline asm 2026-02-21T08:30:48.5505215Z // begin inline asm 2026-02-21T08:30:48.5505334Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd380 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5505396Z // end inline asm 2026-02-21T08:30:48.5505452Z // begin inline asm 2026-02-21T08:30:48.5505566Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd381 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5505619Z // end inline asm 2026-02-21T08:30:48.5505683Z // begin inline asm 2026-02-21T08:30:48.5505852Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd382 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5505909Z // end inline asm 2026-02-21T08:30:48.5505981Z cp.async.commit_group; 2026-02-21T08:30:48.5506146Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5506206Z add.s32 %r1937, %r142, %r39; 2026-02-21T08:30:48.5506265Z add.s32 %r1938, %r142, %r40; 2026-02-21T08:30:48.5506435Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5506497Z cvt.s64.s32 %rd519, %r1937; 2026-02-21T08:30:48.5506559Z add.s64 %rd383, %rd93, %rd519; 2026-02-21T08:30:48.5506627Z cvt.s64.s32 %rd520, %r1938; 2026-02-21T08:30:48.5506687Z add.s64 %rd384, %rd93, %rd520; 2026-02-21T08:30:48.5506853Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5506920Z // begin inline asm 2026-02-21T08:30:48.5507039Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd383 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5507101Z // end inline asm 2026-02-21T08:30:48.5507157Z // begin inline asm 2026-02-21T08:30:48.5507277Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd384 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5507332Z // end inline asm 2026-02-21T08:30:48.5507395Z cp.async.commit_group; 2026-02-21T08:30:48.5507567Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5507625Z cvt.s64.s32 %rd521, %r1929; 2026-02-21T08:30:48.5507686Z or.b64 %rd522, %rd521, %rd11; 2026-02-21T08:30:48.5507754Z shl.b64 %rd523, %rd522, 1; 2026-02-21T08:30:48.5507814Z add.s64 %rd48, %rd92, %rd523; 2026-02-21T08:30:48.5507872Z add.s64 %rd385, %rd48, 128; 2026-02-21T08:30:48.5507931Z cvt.s64.s32 %rd524, %r1930; 2026-02-21T08:30:48.5507996Z or.b64 %rd525, %rd524, %rd11; 2026-02-21T08:30:48.5508056Z shl.b64 %rd526, %rd525, 1; 2026-02-21T08:30:48.5508113Z add.s64 %rd49, %rd92, %rd526; 2026-02-21T08:30:48.5508179Z add.s64 %rd386, %rd49, 128; 2026-02-21T08:30:48.5508238Z cvt.s64.s32 %rd527, %r1931; 2026-02-21T08:30:48.5508297Z or.b64 %rd528, %rd527, %rd11; 2026-02-21T08:30:48.5508355Z shl.b64 %rd529, %rd528, 1; 2026-02-21T08:30:48.5508420Z add.s64 %rd50, %rd92, %rd529; 2026-02-21T08:30:48.5508478Z add.s64 %rd387, %rd50, 128; 2026-02-21T08:30:48.5508535Z cvt.s64.s32 %rd530, %r1932; 2026-02-21T08:30:48.5508599Z or.b64 %rd531, %rd530, %rd11; 2026-02-21T08:30:48.5508656Z shl.b64 %rd532, %rd531, 1; 2026-02-21T08:30:48.5508715Z add.s64 %rd51, %rd92, %rd532; 2026-02-21T08:30:48.5508772Z add.s64 %rd388, %rd51, 128; 2026-02-21T08:30:48.5508939Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5508994Z bar.sync 0; 2026-02-21T08:30:48.5509050Z // begin inline asm 2026-02-21T08:30:48.5509172Z cp.async.cg.shared.global [ %r2468 + 0 ], [ %rd385 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5509227Z // end inline asm 2026-02-21T08:30:48.5509326Z // begin inline asm 2026-02-21T08:30:48.5509448Z cp.async.cg.shared.global [ %r2470 + 0 ], [ %rd386 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5509503Z // end inline asm 2026-02-21T08:30:48.5509559Z // begin inline asm 2026-02-21T08:30:48.5509675Z cp.async.cg.shared.global [ %r2472 + 0 ], [ %rd387 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5509737Z // end inline asm 2026-02-21T08:30:48.5509793Z // begin inline asm 2026-02-21T08:30:48.5509907Z cp.async.cg.shared.global [ %r2474 + 0 ], [ %rd388 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5509969Z // end inline asm 2026-02-21T08:30:48.5510031Z cp.async.commit_group; 2026-02-21T08:30:48.5510200Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5510261Z add.s32 %r1939, %r142, %r48; 2026-02-21T08:30:48.5510327Z add.s32 %r1940, %r142, %r49; 2026-02-21T08:30:48.5510530Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5510592Z cvt.s64.s32 %rd533, %r1939; 2026-02-21T08:30:48.5510661Z add.s64 %rd389, %rd93, %rd533; 2026-02-21T08:30:48.5510720Z cvt.s64.s32 %rd534, %r1940; 2026-02-21T08:30:48.5510779Z add.s64 %rd390, %rd93, %rd534; 2026-02-21T08:30:48.5510949Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5511005Z // begin inline asm 2026-02-21T08:30:48.5511118Z cp.async.cg.shared.global [ %r2476 + 0 ], [ %rd389 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5511174Z // end inline asm 2026-02-21T08:30:48.5511235Z // begin inline asm 2026-02-21T08:30:48.5511345Z cp.async.cg.shared.global [ %r2478 + 0 ], [ %rd390 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5511399Z // end inline asm 2026-02-21T08:30:48.5511466Z cp.async.commit_group; 2026-02-21T08:30:48.5511668Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5511736Z cp.async.wait_group 2; 2026-02-21T08:30:48.5511797Z bar.sync 0; 2026-02-21T08:30:48.5511895Z ld.shared.v4.b32 {%r1941, %r1942, %r1943, %r1944}, [%r53]; 2026-02-21T08:30:48.5511959Z mov.b32 {%rs769, %rs770}, %r1944; 2026-02-21T08:30:48.5512020Z mov.b32 {%rs771, %rs772}, %r1943; 2026-02-21T08:30:48.5512086Z mov.b32 {%rs773, %rs774}, %r1942; 2026-02-21T08:30:48.5512144Z mov.b32 {%rs775, %rs776}, %r1941; 2026-02-21T08:30:48.5512247Z ld.shared.v4.b32 {%r1945, %r1946, %r1947, %r1948}, [%r53+16]; 2026-02-21T08:30:48.5512313Z mov.b32 {%rs777, %rs778}, %r1948; 2026-02-21T08:30:48.5512370Z mov.b32 {%rs779, %rs780}, %r1947; 2026-02-21T08:30:48.5512428Z mov.b32 {%rs781, %rs782}, %r1946; 2026-02-21T08:30:48.5512484Z mov.b32 {%rs783, %rs784}, %r1945; 2026-02-21T08:30:48.5512589Z ld.shared.v4.b32 {%r1949, %r1950, %r1951, %r1952}, [%r53+32]; 2026-02-21T08:30:48.5512648Z mov.b32 {%rs785, %rs786}, %r1952; 2026-02-21T08:30:48.5512707Z mov.b32 {%rs787, %rs788}, %r1951; 2026-02-21T08:30:48.5512775Z mov.b32 {%rs789, %rs790}, %r1950; 2026-02-21T08:30:48.5512835Z mov.b32 {%rs791, %rs792}, %r1949; 2026-02-21T08:30:48.5512933Z ld.shared.v4.b32 {%r1953, %r1954, %r1955, %r1956}, [%r53+48]; 2026-02-21T08:30:48.5512999Z mov.b32 {%rs793, %rs794}, %r1956; 2026-02-21T08:30:48.5513057Z mov.b32 {%rs795, %rs796}, %r1955; 2026-02-21T08:30:48.5513115Z mov.b32 {%rs797, %rs798}, %r1954; 2026-02-21T08:30:48.5513173Z mov.b32 {%rs799, %rs800}, %r1953; 2026-02-21T08:30:48.5513352Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5513416Z cvt.f32.bf16 %r1766, %rs775; 2026-02-21T08:30:48.5513477Z cvt.f32.bf16 %r1767, %rs776; 2026-02-21T08:30:48.5513543Z cvt.f32.bf16 %r1768, %rs773; 2026-02-21T08:30:48.5513602Z cvt.f32.bf16 %r1769, %rs774; 2026-02-21T08:30:48.5513658Z cvt.f32.bf16 %r1770, %rs771; 2026-02-21T08:30:48.5513724Z cvt.f32.bf16 %r1771, %rs772; 2026-02-21T08:30:48.5513781Z cvt.f32.bf16 %r1772, %rs769; 2026-02-21T08:30:48.5513841Z cvt.f32.bf16 %r1773, %rs770; 2026-02-21T08:30:48.5513947Z cvt.f32.bf16 %r1774, %rs783; 2026-02-21T08:30:48.5514016Z cvt.f32.bf16 %r1775, %rs784; 2026-02-21T08:30:48.5514072Z cvt.f32.bf16 %r1776, %rs781; 2026-02-21T08:30:48.5514131Z cvt.f32.bf16 %r1777, %rs782; 2026-02-21T08:30:48.5514198Z cvt.f32.bf16 %r1778, %rs779; 2026-02-21T08:30:48.5514275Z cvt.f32.bf16 %r1779, %rs780; 2026-02-21T08:30:48.5514332Z cvt.f32.bf16 %r1780, %rs777; 2026-02-21T08:30:48.5514390Z cvt.f32.bf16 %r1781, %rs778; 2026-02-21T08:30:48.5514458Z cvt.f32.bf16 %r1783, %rs791; 2026-02-21T08:30:48.5514518Z cvt.f32.bf16 %r1784, %rs792; 2026-02-21T08:30:48.5514576Z cvt.f32.bf16 %r1785, %rs789; 2026-02-21T08:30:48.5514641Z cvt.f32.bf16 %r1786, %rs790; 2026-02-21T08:30:48.5514698Z cvt.f32.bf16 %r1787, %rs787; 2026-02-21T08:30:48.5514755Z cvt.f32.bf16 %r1788, %rs788; 2026-02-21T08:30:48.5514813Z cvt.f32.bf16 %r1789, %rs785; 2026-02-21T08:30:48.5514879Z cvt.f32.bf16 %r1790, %rs786; 2026-02-21T08:30:48.5514979Z cvt.f32.bf16 %r1791, %rs799; 2026-02-21T08:30:48.5515039Z cvt.f32.bf16 %r1792, %rs800; 2026-02-21T08:30:48.5515103Z cvt.f32.bf16 %r1793, %rs797; 2026-02-21T08:30:48.5515160Z cvt.f32.bf16 %r1794, %rs798; 2026-02-21T08:30:48.5515218Z cvt.f32.bf16 %r1795, %rs795; 2026-02-21T08:30:48.5515275Z cvt.f32.bf16 %r1796, %rs796; 2026-02-21T08:30:48.5515340Z cvt.f32.bf16 %r1797, %rs793; 2026-02-21T08:30:48.5515396Z cvt.f32.bf16 %r1798, %rs794; 2026-02-21T08:30:48.5515448Z $L__tmp221: 2026-02-21T08:30:48.5515674Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5515730Z // begin inline asm 2026-02-21T08:30:48.5516052Z @%p121 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r1766, %r1767, %r1768, %r1769, %r1770, %r1771, %r1772, %r1773, %r1774, %r1775, %r1776, %r1777, %r1778, %r1779, %r1780, %r1781}; 2026-02-21T08:30:48.5516115Z // end inline asm 2026-02-21T08:30:48.5516171Z // begin inline asm 2026-02-21T08:30:48.5516475Z @%p121 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r1783, %r1784, %r1785, %r1786, %r1787, %r1788, %r1789, %r1790, %r1791, %r1792, %r1793, %r1794, %r1795, %r1796, %r1797, %r1798}; 2026-02-21T08:30:48.5516540Z // end inline asm 2026-02-21T08:30:48.5516596Z // begin inline asm 2026-02-21T08:30:48.5516667Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5516720Z // end inline asm 2026-02-21T08:30:48.5516781Z bar.sync 0; 2026-02-21T08:30:48.5516835Z $L__tmp222: 2026-02-21T08:30:48.5517008Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5517079Z ld.shared.s8 %rs801, [%r54]; 2026-02-21T08:30:48.5517240Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5517301Z shl.b16 %rs802, %rs801, 4; 2026-02-21T08:30:48.5517468Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5517536Z ld.shared.s8 %rs803, [%r54+128]; 2026-02-21T08:30:48.5517698Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5517759Z shl.b16 %rs804, %rs803, 4; 2026-02-21T08:30:48.5517923Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5517989Z ld.shared.s8 %rs805, [%r54+256]; 2026-02-21T08:30:48.5518151Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5518217Z shl.b16 %rs806, %rs805, 4; 2026-02-21T08:30:48.5518376Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5518439Z ld.shared.s8 %rs807, [%r54+384]; 2026-02-21T08:30:48.5518608Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5518666Z shl.b16 %rs808, %rs807, 4; 2026-02-21T08:30:48.5518830Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5518942Z ld.shared.s8 %rs809, [%r54+512]; 2026-02-21T08:30:48.5519105Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5519163Z shl.b16 %rs810, %rs809, 4; 2026-02-21T08:30:48.5519325Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5519394Z ld.shared.s8 %rs811, [%r54+640]; 2026-02-21T08:30:48.5519558Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5519616Z shl.b16 %rs812, %rs811, 4; 2026-02-21T08:30:48.5519784Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5519846Z ld.shared.s8 %rs813, [%r54+768]; 2026-02-21T08:30:48.5520059Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5520128Z shl.b16 %rs814, %rs813, 4; 2026-02-21T08:30:48.5520290Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5520350Z ld.shared.s8 %rs815, [%r56]; 2026-02-21T08:30:48.5520519Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5520578Z shl.b16 %rs816, %rs815, 4; 2026-02-21T08:30:48.5520737Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5520799Z ld.shared.s8 %rs817, [%r54+1024]; 2026-02-21T08:30:48.5520972Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5521033Z shl.b16 %rs818, %rs817, 4; 2026-02-21T08:30:48.5521208Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5521283Z ld.shared.s8 %rs819, [%r54+1152]; 2026-02-21T08:30:48.5521452Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5521512Z shl.b16 %rs820, %rs819, 4; 2026-02-21T08:30:48.5521725Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5521794Z ld.shared.s8 %rs821, [%r54+1280]; 2026-02-21T08:30:48.5521964Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5522031Z shl.b16 %rs822, %rs821, 4; 2026-02-21T08:30:48.5522202Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5522265Z ld.shared.s8 %rs823, [%r54+1408]; 2026-02-21T08:30:48.5522442Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5522505Z shl.b16 %rs824, %rs823, 4; 2026-02-21T08:30:48.5522677Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5522741Z ld.shared.s8 %rs825, [%r54+1536]; 2026-02-21T08:30:48.5522918Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5522979Z shl.b16 %rs826, %rs825, 4; 2026-02-21T08:30:48.5523154Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5523227Z ld.shared.s8 %rs827, [%r54+1664]; 2026-02-21T08:30:48.5523400Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5523461Z shl.b16 %rs828, %rs827, 4; 2026-02-21T08:30:48.5523642Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5523706Z ld.shared.s8 %rs829, [%r54+1792]; 2026-02-21T08:30:48.5523882Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5524006Z shl.b16 %rs830, %rs829, 4; 2026-02-21T08:30:48.5524177Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5524240Z ld.shared.s8 %rs831, [%r58]; 2026-02-21T08:30:48.5524411Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5524479Z shl.b16 %rs832, %rs831, 4; 2026-02-21T08:30:48.5524654Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5524720Z ld.shared.s8 %rs833, [%r54+2048]; 2026-02-21T08:30:48.5524902Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5524964Z shl.b16 %rs834, %rs833, 4; 2026-02-21T08:30:48.5525184Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5525263Z ld.shared.s8 %rs835, [%r54+2176]; 2026-02-21T08:30:48.5525434Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5525496Z shl.b16 %rs836, %rs835, 4; 2026-02-21T08:30:48.5525675Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5525740Z ld.shared.s8 %rs837, [%r54+2304]; 2026-02-21T08:30:48.5525909Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5525972Z shl.b16 %rs838, %rs837, 4; 2026-02-21T08:30:48.5526151Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5526215Z ld.shared.s8 %rs839, [%r54+2432]; 2026-02-21T08:30:48.5526385Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5526454Z shl.b16 %rs840, %rs839, 4; 2026-02-21T08:30:48.5526628Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5526691Z ld.shared.s8 %rs841, [%r54+2560]; 2026-02-21T08:30:48.5526865Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5526926Z shl.b16 %rs842, %rs841, 4; 2026-02-21T08:30:48.5527095Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5527165Z ld.shared.s8 %rs843, [%r54+2688]; 2026-02-21T08:30:48.5527337Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5527398Z shl.b16 %rs844, %rs843, 4; 2026-02-21T08:30:48.5527570Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5527643Z ld.shared.s8 %rs845, [%r54+2816]; 2026-02-21T08:30:48.5527814Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5527878Z shl.b16 %rs846, %rs845, 4; 2026-02-21T08:30:48.5528058Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5528122Z ld.shared.s8 %rs847, [%r60]; 2026-02-21T08:30:48.5528289Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5528356Z shl.b16 %rs848, %rs847, 4; 2026-02-21T08:30:48.5528523Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5528588Z ld.shared.s8 %rs849, [%r54+3072]; 2026-02-21T08:30:48.5528800Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5528857Z shl.b16 %rs850, %rs849, 4; 2026-02-21T08:30:48.5529020Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5529120Z ld.shared.s8 %rs851, [%r54+3200]; 2026-02-21T08:30:48.5529293Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5529351Z shl.b16 %rs852, %rs851, 4; 2026-02-21T08:30:48.5529513Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5529583Z ld.shared.s8 %rs853, [%r54+3328]; 2026-02-21T08:30:48.5529745Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5529802Z shl.b16 %rs854, %rs853, 4; 2026-02-21T08:30:48.5529972Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5530033Z ld.shared.s8 %rs855, [%r54+3456]; 2026-02-21T08:30:48.5530199Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5530309Z shl.b16 %rs856, %rs855, 4; 2026-02-21T08:30:48.5530477Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5530540Z ld.shared.s8 %rs857, [%r54+3584]; 2026-02-21T08:30:48.5530699Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5530766Z shl.b16 %rs858, %rs857, 4; 2026-02-21T08:30:48.5530928Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5530991Z ld.shared.s8 %rs859, [%r54+3712]; 2026-02-21T08:30:48.5531163Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5531220Z shl.b16 %rs860, %rs859, 4; 2026-02-21T08:30:48.5531381Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5531449Z ld.shared.s8 %rs861, [%r54+3840]; 2026-02-21T08:30:48.5531640Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5531702Z shl.b16 %rs862, %rs861, 4; 2026-02-21T08:30:48.5531872Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5531933Z ld.shared.s8 %rs863, [%r62]; 2026-02-21T08:30:48.5532095Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5532152Z shl.b16 %rs864, %rs863, 4; 2026-02-21T08:30:48.5532323Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5532383Z cvt.s16.s8 %rs865, %rs802; 2026-02-21T08:30:48.5532440Z shr.s16 %rs866, %rs865, 4; 2026-02-21T08:30:48.5532508Z cvt.s16.s8 %rs867, %rs804; 2026-02-21T08:30:48.5532565Z shr.s16 %rs868, %rs867, 4; 2026-02-21T08:30:48.5532621Z shr.s16 %rs869, %rs801, 4; 2026-02-21T08:30:48.5532686Z shr.s16 %rs870, %rs803, 4; 2026-02-21T08:30:48.5532854Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5532917Z cvt.rn.f32.s16 %r1957, %rs870; 2026-02-21T08:30:48.5532977Z cvt.rn.f32.s16 %r1958, %rs869; 2026-02-21T08:30:48.5533046Z cvt.rn.f32.s16 %r1959, %rs868; 2026-02-21T08:30:48.5533105Z cvt.rn.f32.s16 %r1960, %rs866; 2026-02-21T08:30:48.5533268Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5533335Z cvt.s16.s8 %rs871, %rs806; 2026-02-21T08:30:48.5533392Z shr.s16 %rs872, %rs871, 4; 2026-02-21T08:30:48.5533449Z cvt.s16.s8 %rs873, %rs808; 2026-02-21T08:30:48.5533507Z shr.s16 %rs874, %rs873, 4; 2026-02-21T08:30:48.5533573Z shr.s16 %rs875, %rs805, 4; 2026-02-21T08:30:48.5533631Z shr.s16 %rs876, %rs807, 4; 2026-02-21T08:30:48.5533799Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5533871Z cvt.rn.f32.s16 %r1961, %rs876; 2026-02-21T08:30:48.5533982Z cvt.rn.f32.s16 %r1962, %rs875; 2026-02-21T08:30:48.5534041Z cvt.rn.f32.s16 %r1963, %rs874; 2026-02-21T08:30:48.5534107Z cvt.rn.f32.s16 %r1964, %rs872; 2026-02-21T08:30:48.5534274Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5534332Z cvt.s16.s8 %rs877, %rs810; 2026-02-21T08:30:48.5534390Z shr.s16 %rs878, %rs877, 4; 2026-02-21T08:30:48.5534456Z cvt.s16.s8 %rs879, %rs812; 2026-02-21T08:30:48.5534513Z shr.s16 %rs880, %rs879, 4; 2026-02-21T08:30:48.5534569Z shr.s16 %rs881, %rs809, 4; 2026-02-21T08:30:48.5534633Z shr.s16 %rs882, %rs811, 4; 2026-02-21T08:30:48.5534798Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5534859Z cvt.rn.f32.s16 %r1965, %rs882; 2026-02-21T08:30:48.5534925Z cvt.rn.f32.s16 %r1966, %rs881; 2026-02-21T08:30:48.5535027Z cvt.rn.f32.s16 %r1967, %rs880; 2026-02-21T08:30:48.5535091Z cvt.rn.f32.s16 %r1968, %rs878; 2026-02-21T08:30:48.5535259Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5535330Z cvt.s16.s8 %rs883, %rs814; 2026-02-21T08:30:48.5535388Z shr.s16 %rs884, %rs883, 4; 2026-02-21T08:30:48.5535446Z cvt.s16.s8 %rs885, %rs816; 2026-02-21T08:30:48.5535516Z shr.s16 %rs886, %rs885, 4; 2026-02-21T08:30:48.5535573Z shr.s16 %rs887, %rs813, 4; 2026-02-21T08:30:48.5535630Z shr.s16 %rs888, %rs815, 4; 2026-02-21T08:30:48.5535796Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5535864Z cvt.rn.f32.s16 %r1969, %rs888; 2026-02-21T08:30:48.5535924Z cvt.rn.f32.s16 %r1970, %rs887; 2026-02-21T08:30:48.5535982Z cvt.rn.f32.s16 %r1971, %rs886; 2026-02-21T08:30:48.5536047Z cvt.rn.f32.s16 %r1972, %rs884; 2026-02-21T08:30:48.5536216Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5536277Z cvt.s16.s8 %rs889, %rs818; 2026-02-21T08:30:48.5536342Z shr.s16 %rs890, %rs889, 4; 2026-02-21T08:30:48.5536401Z cvt.s16.s8 %rs891, %rs820; 2026-02-21T08:30:48.5536457Z shr.s16 %rs892, %rs891, 4; 2026-02-21T08:30:48.5536512Z shr.s16 %rs893, %rs817, 4; 2026-02-21T08:30:48.5536575Z shr.s16 %rs894, %rs819, 4; 2026-02-21T08:30:48.5536739Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5536798Z cvt.rn.f32.s16 %r1973, %rs894; 2026-02-21T08:30:48.5536866Z cvt.rn.f32.s16 %r1974, %rs893; 2026-02-21T08:30:48.5536923Z cvt.rn.f32.s16 %r1975, %rs892; 2026-02-21T08:30:48.5536982Z cvt.rn.f32.s16 %r1976, %rs890; 2026-02-21T08:30:48.5537158Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5537218Z cvt.s16.s8 %rs895, %rs822; 2026-02-21T08:30:48.5537275Z shr.s16 %rs896, %rs895, 4; 2026-02-21T08:30:48.5537336Z cvt.s16.s8 %rs897, %rs824; 2026-02-21T08:30:48.5537405Z shr.s16 %rs898, %rs897, 4; 2026-02-21T08:30:48.5537461Z shr.s16 %rs899, %rs821, 4; 2026-02-21T08:30:48.5537517Z shr.s16 %rs900, %rs823, 4; 2026-02-21T08:30:48.5537687Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5537746Z cvt.rn.f32.s16 %r1977, %rs900; 2026-02-21T08:30:48.5537805Z cvt.rn.f32.s16 %r1978, %rs899; 2026-02-21T08:30:48.5537864Z cvt.rn.f32.s16 %r1979, %rs898; 2026-02-21T08:30:48.5537931Z cvt.rn.f32.s16 %r1980, %rs896; 2026-02-21T08:30:48.5538097Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5538156Z cvt.s16.s8 %rs901, %rs826; 2026-02-21T08:30:48.5538222Z shr.s16 %rs902, %rs901, 4; 2026-02-21T08:30:48.5538281Z cvt.s16.s8 %rs903, %rs828; 2026-02-21T08:30:48.5538337Z shr.s16 %rs904, %rs903, 4; 2026-02-21T08:30:48.5538402Z shr.s16 %rs905, %rs825, 4; 2026-02-21T08:30:48.5538499Z shr.s16 %rs906, %rs827, 4; 2026-02-21T08:30:48.5538664Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5538723Z cvt.rn.f32.s16 %r1981, %rs906; 2026-02-21T08:30:48.5538789Z cvt.rn.f32.s16 %r1982, %rs905; 2026-02-21T08:30:48.5538848Z cvt.rn.f32.s16 %r1983, %rs904; 2026-02-21T08:30:48.5538906Z cvt.rn.f32.s16 %r1984, %rs902; 2026-02-21T08:30:48.5539082Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5539139Z cvt.s16.s8 %rs907, %rs830; 2026-02-21T08:30:48.5539195Z shr.s16 %rs908, %rs907, 4; 2026-02-21T08:30:48.5539252Z cvt.s16.s8 %rs909, %rs832; 2026-02-21T08:30:48.5539316Z shr.s16 %rs910, %rs909, 4; 2026-02-21T08:30:48.5539372Z shr.s16 %rs911, %rs829, 4; 2026-02-21T08:30:48.5539428Z shr.s16 %rs912, %rs831, 4; 2026-02-21T08:30:48.5539651Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5539713Z cvt.rn.f32.s16 %r1985, %rs912; 2026-02-21T08:30:48.5539770Z cvt.rn.f32.s16 %r1986, %rs911; 2026-02-21T08:30:48.5539832Z cvt.rn.f32.s16 %r1987, %rs910; 2026-02-21T08:30:48.5539892Z cvt.rn.f32.s16 %r1988, %rs908; 2026-02-21T08:30:48.5540057Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5540115Z cvt.s16.s8 %rs913, %rs834; 2026-02-21T08:30:48.5540180Z shr.s16 %rs914, %rs913, 4; 2026-02-21T08:30:48.5540238Z cvt.s16.s8 %rs915, %rs836; 2026-02-21T08:30:48.5540294Z shr.s16 %rs916, %rs915, 4; 2026-02-21T08:30:48.5540358Z shr.s16 %rs917, %rs833, 4; 2026-02-21T08:30:48.5540415Z shr.s16 %rs918, %rs835, 4; 2026-02-21T08:30:48.5540582Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5540645Z cvt.rn.f32.s16 %r1989, %rs918; 2026-02-21T08:30:48.5540704Z cvt.rn.f32.s16 %r1990, %rs917; 2026-02-21T08:30:48.5540764Z cvt.rn.f32.s16 %r1991, %rs916; 2026-02-21T08:30:48.5540824Z cvt.rn.f32.s16 %r1992, %rs914; 2026-02-21T08:30:48.5541000Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5541058Z cvt.s16.s8 %rs919, %rs838; 2026-02-21T08:30:48.5541115Z shr.s16 %rs920, %rs919, 4; 2026-02-21T08:30:48.5541179Z cvt.s16.s8 %rs921, %rs840; 2026-02-21T08:30:48.5541235Z shr.s16 %rs922, %rs921, 4; 2026-02-21T08:30:48.5541291Z shr.s16 %rs923, %rs837, 4; 2026-02-21T08:30:48.5541346Z shr.s16 %rs924, %rs839, 4; 2026-02-21T08:30:48.5541519Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5541613Z cvt.rn.f32.s16 %r1993, %rs924; 2026-02-21T08:30:48.5541674Z cvt.rn.f32.s16 %r1994, %rs923; 2026-02-21T08:30:48.5541741Z cvt.rn.f32.s16 %r1995, %rs922; 2026-02-21T08:30:48.5541802Z cvt.rn.f32.s16 %r1996, %rs920; 2026-02-21T08:30:48.5541971Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5542040Z cvt.s16.s8 %rs925, %rs842; 2026-02-21T08:30:48.5542098Z shr.s16 %rs926, %rs925, 4; 2026-02-21T08:30:48.5542156Z cvt.s16.s8 %rs927, %rs844; 2026-02-21T08:30:48.5542213Z shr.s16 %rs928, %rs927, 4; 2026-02-21T08:30:48.5542277Z shr.s16 %rs929, %rs841, 4; 2026-02-21T08:30:48.5542333Z shr.s16 %rs930, %rs843, 4; 2026-02-21T08:30:48.5542499Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5542568Z cvt.rn.f32.s16 %r1997, %rs930; 2026-02-21T08:30:48.5542628Z cvt.rn.f32.s16 %r1998, %rs929; 2026-02-21T08:30:48.5542686Z cvt.rn.f32.s16 %r1999, %rs928; 2026-02-21T08:30:48.5542745Z cvt.rn.f32.s16 %r2000, %rs926; 2026-02-21T08:30:48.5542918Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5542978Z cvt.s16.s8 %rs931, %rs846; 2026-02-21T08:30:48.5543041Z shr.s16 %rs932, %rs931, 4; 2026-02-21T08:30:48.5543163Z cvt.s16.s8 %rs933, %rs848; 2026-02-21T08:30:48.5543223Z shr.s16 %rs934, %rs933, 4; 2026-02-21T08:30:48.5543282Z shr.s16 %rs935, %rs845, 4; 2026-02-21T08:30:48.5543353Z shr.s16 %rs936, %rs847, 4; 2026-02-21T08:30:48.5543517Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5543577Z cvt.rn.f32.s16 %r2001, %rs936; 2026-02-21T08:30:48.5543638Z cvt.rn.f32.s16 %r2002, %rs935; 2026-02-21T08:30:48.5543705Z cvt.rn.f32.s16 %r2003, %rs934; 2026-02-21T08:30:48.5543764Z cvt.rn.f32.s16 %r2004, %rs932; 2026-02-21T08:30:48.5543927Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5543994Z cvt.s16.s8 %rs937, %rs850; 2026-02-21T08:30:48.5544052Z shr.s16 %rs938, %rs937, 4; 2026-02-21T08:30:48.5544110Z cvt.s16.s8 %rs939, %rs852; 2026-02-21T08:30:48.5544173Z shr.s16 %rs940, %rs939, 4; 2026-02-21T08:30:48.5544275Z shr.s16 %rs941, %rs849, 4; 2026-02-21T08:30:48.5544337Z shr.s16 %rs942, %rs851, 4; 2026-02-21T08:30:48.5544501Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5544568Z cvt.rn.f32.s16 %r2005, %rs942; 2026-02-21T08:30:48.5544628Z cvt.rn.f32.s16 %r2006, %rs941; 2026-02-21T08:30:48.5544686Z cvt.rn.f32.s16 %r2007, %rs940; 2026-02-21T08:30:48.5544750Z cvt.rn.f32.s16 %r2008, %rs938; 2026-02-21T08:30:48.5544912Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5544971Z cvt.s16.s8 %rs943, %rs854; 2026-02-21T08:30:48.5545029Z shr.s16 %rs944, %rs943, 4; 2026-02-21T08:30:48.5545094Z cvt.s16.s8 %rs945, %rs856; 2026-02-21T08:30:48.5545150Z shr.s16 %rs946, %rs945, 4; 2026-02-21T08:30:48.5545207Z shr.s16 %rs947, %rs853, 4; 2026-02-21T08:30:48.5545273Z shr.s16 %rs948, %rs855, 4; 2026-02-21T08:30:48.5545440Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5545502Z cvt.rn.f32.s16 %r2009, %rs948; 2026-02-21T08:30:48.5545569Z cvt.rn.f32.s16 %r2010, %rs947; 2026-02-21T08:30:48.5545628Z cvt.rn.f32.s16 %r2011, %rs946; 2026-02-21T08:30:48.5545685Z cvt.rn.f32.s16 %r2012, %rs944; 2026-02-21T08:30:48.5545849Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5545914Z cvt.s16.s8 %rs949, %rs858; 2026-02-21T08:30:48.5545970Z shr.s16 %rs950, %rs949, 4; 2026-02-21T08:30:48.5546027Z cvt.s16.s8 %rs951, %rs860; 2026-02-21T08:30:48.5546089Z shr.s16 %rs952, %rs951, 4; 2026-02-21T08:30:48.5546146Z shr.s16 %rs953, %rs857, 4; 2026-02-21T08:30:48.5546202Z shr.s16 %rs954, %rs859, 4; 2026-02-21T08:30:48.5546370Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5546438Z cvt.rn.f32.s16 %r2013, %rs954; 2026-02-21T08:30:48.5546498Z cvt.rn.f32.s16 %r2014, %rs953; 2026-02-21T08:30:48.5546559Z cvt.rn.f32.s16 %r2015, %rs952; 2026-02-21T08:30:48.5546624Z cvt.rn.f32.s16 %r2016, %rs950; 2026-02-21T08:30:48.5546785Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5546844Z cvt.s16.s8 %rs955, %rs862; 2026-02-21T08:30:48.5546905Z shr.s16 %rs956, %rs955, 4; 2026-02-21T08:30:48.5546963Z cvt.s16.s8 %rs957, %rs864; 2026-02-21T08:30:48.5547018Z shr.s16 %rs958, %rs957, 4; 2026-02-21T08:30:48.5547074Z shr.s16 %rs959, %rs861, 4; 2026-02-21T08:30:48.5547136Z shr.s16 %rs960, %rs863, 4; 2026-02-21T08:30:48.5547300Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5547359Z cvt.rn.f32.s16 %r2017, %rs960; 2026-02-21T08:30:48.5547425Z cvt.rn.f32.s16 %r2018, %rs959; 2026-02-21T08:30:48.5547484Z cvt.rn.f32.s16 %r2019, %rs958; 2026-02-21T08:30:48.5547542Z cvt.rn.f32.s16 %r2020, %rs956; 2026-02-21T08:30:48.5547646Z st.shared.v4.b32 [%r63], {%r1960, %r1958, %r1959, %r1957}; 2026-02-21T08:30:48.5547795Z st.shared.v4.b32 [%r63+16384], {%r1992, %r1990, %r1991, %r1989}; 2026-02-21T08:30:48.5547891Z st.shared.v4.b32 [%r64], {%r1964, %r1962, %r1963, %r1961}; 2026-02-21T08:30:48.5547995Z st.shared.v4.b32 [%r64+16384], {%r1996, %r1994, %r1995, %r1993}; 2026-02-21T08:30:48.5548093Z st.shared.v4.b32 [%r65], {%r1968, %r1966, %r1967, %r1965}; 2026-02-21T08:30:48.5548192Z st.shared.v4.b32 [%r65+16384], {%r2000, %r1998, %r1999, %r1997}; 2026-02-21T08:30:48.5548282Z st.shared.v4.b32 [%r66], {%r1972, %r1970, %r1971, %r1969}; 2026-02-21T08:30:48.5548387Z st.shared.v4.b32 [%r66+16384], {%r2004, %r2002, %r2003, %r2001}; 2026-02-21T08:30:48.5548476Z st.shared.v4.b32 [%r67], {%r1976, %r1974, %r1975, %r1973}; 2026-02-21T08:30:48.5548572Z st.shared.v4.b32 [%r67+16384], {%r2008, %r2006, %r2007, %r2005}; 2026-02-21T08:30:48.5548667Z st.shared.v4.b32 [%r68], {%r1980, %r1978, %r1979, %r1977}; 2026-02-21T08:30:48.5548804Z st.shared.v4.b32 [%r68+16384], {%r2012, %r2010, %r2011, %r2009}; 2026-02-21T08:30:48.5548897Z st.shared.v4.b32 [%r69], {%r1984, %r1982, %r1983, %r1981}; 2026-02-21T08:30:48.5549001Z st.shared.v4.b32 [%r69+16384], {%r2016, %r2014, %r2015, %r2013}; 2026-02-21T08:30:48.5549091Z st.shared.v4.b32 [%r70], {%r1988, %r1986, %r1987, %r1985}; 2026-02-21T08:30:48.5549188Z st.shared.v4.b32 [%r70+16384], {%r2020, %r2018, %r2019, %r2017}; 2026-02-21T08:30:48.5549241Z $L__tmp223: 2026-02-21T08:30:48.5549466Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5549523Z // begin inline asm 2026-02-21T08:30:48.5549598Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5549662Z // end inline asm 2026-02-21T08:30:48.5549717Z bar.sync 0; 2026-02-21T08:30:48.5549777Z @%p14 bra $L__BB0_17; 2026-02-21T08:30:48.5549876Z // %bb.16: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5549954Z elect.sync %r2045|%p120, -1; 2026-02-21T08:30:48.5550017Z mov.b32 %r2023, 69208336; 2026-02-21T08:30:48.5550076Z mov.pred %p119, 0; 2026-02-21T08:30:48.5550140Z // begin inline asm 2026-02-21T08:30:48.5550303Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r2023, %p119; 2026-02-21T08:30:48.5550359Z // end inline asm 2026-02-21T08:30:48.5550422Z // begin inline asm 2026-02-21T08:30:48.5550577Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r2023, %p121; 2026-02-21T08:30:48.5550633Z // end inline asm 2026-02-21T08:30:48.5550688Z // begin inline asm 2026-02-21T08:30:48.5550852Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r2023, %p121; 2026-02-21T08:30:48.5550907Z // end inline asm 2026-02-21T08:30:48.5550964Z // begin inline asm 2026-02-21T08:30:48.5551122Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r2023, %p121; 2026-02-21T08:30:48.5551180Z // end inline asm 2026-02-21T08:30:48.5551241Z // begin inline asm 2026-02-21T08:30:48.5551400Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r2023, %p121; 2026-02-21T08:30:48.5551457Z // end inline asm 2026-02-21T08:30:48.5551514Z // begin inline asm 2026-02-21T08:30:48.5551704Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r2023, %p121; 2026-02-21T08:30:48.5551762Z // end inline asm 2026-02-21T08:30:48.5551819Z // begin inline asm 2026-02-21T08:30:48.5551973Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r2023, %p121; 2026-02-21T08:30:48.5552028Z // end inline asm 2026-02-21T08:30:48.5552083Z // begin inline asm 2026-02-21T08:30:48.5552227Z @%p120 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r2023, %p121; 2026-02-21T08:30:48.5552288Z // end inline asm 2026-02-21T08:30:48.5552349Z add.s32 %r2047, %r294, 57344; 2026-02-21T08:30:48.5552413Z cvt.u64.u32 %rd543, %r2047; 2026-02-21T08:30:48.5552526Z // begin inline asm 2026-02-21T08:30:48.5552656Z @%p120 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd543]; 2026-02-21T08:30:48.5552711Z // end inline asm 2026-02-21T08:30:48.5552763Z $L__tmp224: 2026-02-21T08:30:48.5552874Z $L__BB0_17: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5553036Z .loc 1 0 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0 2026-02-21T08:30:48.5553098Z or.b32 %r143, %r141, %r8; 2026-02-21T08:30:48.5553167Z or.b32 %r145, %r144, %r14; 2026-02-21T08:30:48.5553225Z or.b32 %r146, %r144, %r15; 2026-02-21T08:30:48.5553282Z or.b32 %r147, %r144, %r16; 2026-02-21T08:30:48.5553345Z or.b32 %r148, %r144, %r17; 2026-02-21T08:30:48.5553402Z or.b32 %r149, %r144, %r18; 2026-02-21T08:30:48.5553459Z or.b32 %r150, %r144, %r19; 2026-02-21T08:30:48.5553516Z or.b32 %r151, %r144, %r20; 2026-02-21T08:30:48.5553630Z or.b32 %r152, %r144, %r21; 2026-02-21T08:30:48.5553808Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5553871Z add.s64 %rd544, %rd48, 256; 2026-02-21T08:30:48.5553938Z add.s64 %rd545, %rd49, 256; 2026-02-21T08:30:48.5553997Z add.s64 %rd546, %rd50, 256; 2026-02-21T08:30:48.5554055Z add.s64 %rd547, %rd51, 256; 2026-02-21T08:30:48.5554233Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5554291Z // begin inline asm 2026-02-21T08:30:48.5554415Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd544 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5554471Z // end inline asm 2026-02-21T08:30:48.5554535Z // begin inline asm 2026-02-21T08:30:48.5554653Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd545 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5554709Z // end inline asm 2026-02-21T08:30:48.5554771Z // begin inline asm 2026-02-21T08:30:48.5554890Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd546 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5554948Z // end inline asm 2026-02-21T08:30:48.5555004Z // begin inline asm 2026-02-21T08:30:48.5555126Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd547 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5555181Z // end inline asm 2026-02-21T08:30:48.5555245Z cp.async.commit_group; 2026-02-21T08:30:48.5555420Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5555481Z add.s32 %r2063, %r142, %r71; 2026-02-21T08:30:48.5555539Z add.s32 %r2064, %r142, %r72; 2026-02-21T08:30:48.5555711Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5555770Z cvt.s64.s32 %rd551, %r2063; 2026-02-21T08:30:48.5555830Z add.s64 %rd548, %rd93, %rd551; 2026-02-21T08:30:48.5555888Z cvt.s64.s32 %rd552, %r2064; 2026-02-21T08:30:48.5555955Z add.s64 %rd549, %rd93, %rd552; 2026-02-21T08:30:48.5556125Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5556186Z // begin inline asm 2026-02-21T08:30:48.5556309Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd548 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5556365Z // end inline asm 2026-02-21T08:30:48.5556421Z // begin inline asm 2026-02-21T08:30:48.5556541Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd549 + 0 ], 0x10, %r2049; 2026-02-21T08:30:48.5556595Z // end inline asm 2026-02-21T08:30:48.5556656Z cp.async.commit_group; 2026-02-21T08:30:48.5556822Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5556888Z add.s32 %r2065, %r39, %r141; 2026-02-21T08:30:48.5556946Z cvt.u64.u32 %rd1097, %r2065; 2026-02-21T08:30:48.5557006Z add.s32 %r2066, %r11, %r144; 2026-02-21T08:30:48.5557069Z shl.b32 %r2067, %r2066, 10; 2026-02-21T08:30:48.5557140Z mad.wide.s32 %rd1096, %r2067, 2, %rd10; 2026-02-21T08:30:48.5557198Z add.s32 %r2068, %r10, %r144; 2026-02-21T08:30:48.5557315Z shl.b32 %r2069, %r2068, 10; 2026-02-21T08:30:48.5557392Z mad.wide.s32 %rd1095, %r2069, 2, %rd10; 2026-02-21T08:30:48.5557451Z shl.b32 %r2070, %r140, 16; 2026-02-21T08:30:48.5557507Z or.b32 %r2071, %r79, %r2070; 2026-02-21T08:30:48.5557582Z mad.wide.s32 %rd1094, %r2071, 2, %rd10; 2026-02-21T08:30:48.5557638Z or.b32 %r4031, %r80, %r2070; 2026-02-21T08:30:48.5557697Z mov.b64 %rd1098, -32; 2026-02-21T08:30:48.5557764Z mov.b32 %r4034, %r4032; 2026-02-21T08:30:48.5557822Z mov.b32 %r4036, %r4032; 2026-02-21T08:30:48.5557879Z bra.uni $L__BB0_18; 2026-02-21T08:30:48.5557982Z $L__BB0_20: // in Loop: Header=BB0_18 Depth=2 2026-02-21T08:30:48.5558152Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5558213Z add.s64 %rd1098, %rd1098, 32; 2026-02-21T08:30:48.5558281Z setp.lt.u64 %p157, %rd1098, 416; 2026-02-21T08:30:48.5558341Z $L__tmp225: 2026-02-21T08:30:48.5558621Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5558682Z add.s32 %r2240, %r4035, 1; 2026-02-21T08:30:48.5558748Z setp.gt.s32 %p158, %r2240, 1; 2026-02-21T08:30:48.5558812Z selp.b32 %r4035, 0, %r2240, %p158; 2026-02-21T08:30:48.5558872Z selp.b32 %r2241, 1, 0, %p158; 2026-02-21T08:30:48.5558930Z xor.b32 %r165, %r4036, %r2241; 2026-02-21T08:30:48.5558989Z $L__tmp226: 2026-02-21T08:30:48.5559155Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5559216Z add.s64 %rd562, %rd1094, %rd9; 2026-02-21T08:30:48.5559283Z add.s64 %rd563, %rd1095, %rd9; 2026-02-21T08:30:48.5559343Z add.s64 %rd564, %rd1096, %rd9; 2026-02-21T08:30:48.5559410Z mad.wide.s32 %rd565, %r4031, 2, %rd92; 2026-02-21T08:30:48.5559580Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5559654Z add.s32 %r2228, %r161, %r4047; 2026-02-21T08:30:48.5559718Z selp.b32 %r2229, 16, 0, %p157; 2026-02-21T08:30:48.5559778Z // begin inline asm 2026-02-21T08:30:48.5559904Z cp.async.cg.shared.global [ %r2228 + 0 ], [ %rd562 + 0 ], 0x10, %r2229; 2026-02-21T08:30:48.5559962Z // end inline asm 2026-02-21T08:30:48.5560024Z add.s32 %r2230, %r2228, 2048; 2026-02-21T08:30:48.5560097Z // begin inline asm 2026-02-21T08:30:48.5560214Z cp.async.cg.shared.global [ %r2230 + 0 ], [ %rd563 + 0 ], 0x10, %r2229; 2026-02-21T08:30:48.5560272Z // end inline asm 2026-02-21T08:30:48.5560333Z add.s32 %r2232, %r2228, 4096; 2026-02-21T08:30:48.5560403Z // begin inline asm 2026-02-21T08:30:48.5560519Z cp.async.cg.shared.global [ %r2232 + 0 ], [ %rd564 + 0 ], 0x10, %r2229; 2026-02-21T08:30:48.5560577Z // end inline asm 2026-02-21T08:30:48.5560646Z add.s32 %r2234, %r2228, 6144; 2026-02-21T08:30:48.5560705Z // begin inline asm 2026-02-21T08:30:48.5560820Z cp.async.cg.shared.global [ %r2234 + 0 ], [ %rd565 + 0 ], 0x10, %r2229; 2026-02-21T08:30:48.5560880Z // end inline asm 2026-02-21T08:30:48.5560949Z cp.async.commit_group; 2026-02-21T08:30:48.5561113Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5561174Z add.s64 %rd568, %rd9, %rd1097; 2026-02-21T08:30:48.5561243Z add.s64 %rd569, %rd568, 786432; 2026-02-21T08:30:48.5561410Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5561472Z add.s64 %rd570, %rd568, 917504; 2026-02-21T08:30:48.5561572Z cvt.s64.s32 %rd571, %rd569; 2026-02-21T08:30:48.5561634Z add.s64 %rd566, %rd93, %rd571; 2026-02-21T08:30:48.5561693Z cvt.s64.s32 %rd572, %rd570; 2026-02-21T08:30:48.5561751Z add.s64 %rd567, %rd93, %rd572; 2026-02-21T08:30:48.5561922Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5561983Z add.s32 %r2236, %r162, %r111; 2026-02-21T08:30:48.5562042Z // begin inline asm 2026-02-21T08:30:48.5562217Z cp.async.cg.shared.global [ %r2236 + 0 ], [ %rd566 + 0 ], 0x10, %r2229; 2026-02-21T08:30:48.5562273Z // end inline asm 2026-02-21T08:30:48.5562332Z add.s32 %r2238, %r2236, 2048; 2026-02-21T08:30:48.5562396Z // begin inline asm 2026-02-21T08:30:48.5562510Z cp.async.cg.shared.global [ %r2238 + 0 ], [ %rd567 + 0 ], 0x10, %r2229; 2026-02-21T08:30:48.5562566Z // end inline asm 2026-02-21T08:30:48.5562629Z cp.async.commit_group; 2026-02-21T08:30:48.5562808Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5562872Z add.s64 %rd1097, %rd1097, 262144; 2026-02-21T08:30:48.5562932Z add.s64 %rd1096, %rd1096, 128; 2026-02-21T08:30:48.5563000Z add.s64 %rd1095, %rd1095, 128; 2026-02-21T08:30:48.5563058Z add.s64 %rd1094, %rd1094, 128; 2026-02-21T08:30:48.5563116Z add.s32 %r4031, %r4031, 64; 2026-02-21T08:30:48.5563182Z setp.lt.u64 %p159, %rd1098, 448; 2026-02-21T08:30:48.5563295Z mov.b32 %r4032, %r4036; 2026-02-21T08:30:48.5563359Z mov.b32 %r4036, %r165; 2026-02-21T08:30:48.5563419Z @%p159 bra $L__BB0_18; 2026-02-21T08:30:48.5563487Z bra.uni $L__BB0_21; 2026-02-21T08:30:48.5563586Z $L__BB0_18: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:48.5563679Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:48.5563745Z add.s32 %r2109, %r4034, 1; 2026-02-21T08:30:48.5563807Z setp.gt.s32 %p139, %r2109, 1; 2026-02-21T08:30:48.5563871Z selp.b32 %r4034, 0, %r2109, %p139; 2026-02-21T08:30:48.5564034Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5564105Z cp.async.wait_group 2; 2026-02-21T08:30:48.5564159Z bar.sync 0; 2026-02-21T08:30:48.5564230Z shl.b32 %r2110, %r4034, 12; 2026-02-21T08:30:48.5564297Z shl.b32 %r2111, %r4034, 13; 2026-02-21T08:30:48.5564356Z add.s32 %r2113, %r294, %r2111; 2026-02-21T08:30:48.5564418Z add.s32 %r161, %r2113, 32768; 2026-02-21T08:30:48.5564483Z add.s32 %r2114, %r161, %r52; 2026-02-21T08:30:48.5564596Z ld.shared.v4.b32 {%r2115, %r2116, %r2117, %r2118}, [%r2114]; 2026-02-21T08:30:48.5564662Z mov.b32 {%rs961, %rs962}, %r2118; 2026-02-21T08:30:48.5564724Z mov.b32 {%rs963, %rs964}, %r2117; 2026-02-21T08:30:48.5564792Z mov.b32 {%rs965, %rs966}, %r2116; 2026-02-21T08:30:48.5564852Z mov.b32 {%rs967, %rs968}, %r2115; 2026-02-21T08:30:48.5564960Z ld.shared.v4.b32 {%r2119, %r2120, %r2121, %r2122}, [%r2114+16]; 2026-02-21T08:30:48.5565027Z mov.b32 {%rs969, %rs970}, %r2122; 2026-02-21T08:30:48.5565089Z mov.b32 {%rs971, %rs972}, %r2121; 2026-02-21T08:30:48.5565148Z mov.b32 {%rs973, %rs974}, %r2120; 2026-02-21T08:30:48.5565208Z mov.b32 {%rs975, %rs976}, %r2119; 2026-02-21T08:30:48.5565321Z ld.shared.v4.b32 {%r2123, %r2124, %r2125, %r2126}, [%r2114+32]; 2026-02-21T08:30:48.5565382Z mov.b32 {%rs977, %rs978}, %r2126; 2026-02-21T08:30:48.5565441Z mov.b32 {%rs979, %rs980}, %r2125; 2026-02-21T08:30:48.5565510Z mov.b32 {%rs981, %rs982}, %r2124; 2026-02-21T08:30:48.5565573Z mov.b32 {%rs983, %rs984}, %r2123; 2026-02-21T08:30:48.5565673Z ld.shared.v4.b32 {%r2127, %r2128, %r2129, %r2130}, [%r2114+48]; 2026-02-21T08:30:48.5565741Z mov.b32 {%rs985, %rs986}, %r2130; 2026-02-21T08:30:48.5565800Z mov.b32 {%rs987, %rs988}, %r2129; 2026-02-21T08:30:48.5565861Z mov.b32 {%rs989, %rs990}, %r2128; 2026-02-21T08:30:48.5565921Z mov.b32 {%rs991, %rs992}, %r2127; 2026-02-21T08:30:48.5566103Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5566168Z cvt.f32.bf16 %r2076, %rs967; 2026-02-21T08:30:48.5566233Z cvt.f32.bf16 %r2077, %rs968; 2026-02-21T08:30:48.5566299Z cvt.f32.bf16 %r2078, %rs965; 2026-02-21T08:30:48.5566360Z cvt.f32.bf16 %r2079, %rs966; 2026-02-21T08:30:48.5566420Z cvt.f32.bf16 %r2080, %rs963; 2026-02-21T08:30:48.5566480Z cvt.f32.bf16 %r2081, %rs964; 2026-02-21T08:30:48.5566549Z cvt.f32.bf16 %r2082, %rs961; 2026-02-21T08:30:48.5566959Z cvt.f32.bf16 %r2083, %rs962; 2026-02-21T08:30:48.5567021Z cvt.f32.bf16 %r2084, %rs975; 2026-02-21T08:30:48.5567091Z cvt.f32.bf16 %r2085, %rs976; 2026-02-21T08:30:48.5567151Z cvt.f32.bf16 %r2086, %rs973; 2026-02-21T08:30:48.5567211Z cvt.f32.bf16 %r2087, %rs974; 2026-02-21T08:30:48.5567272Z cvt.f32.bf16 %r2088, %rs971; 2026-02-21T08:30:48.5567341Z cvt.f32.bf16 %r2089, %rs972; 2026-02-21T08:30:48.5567402Z cvt.f32.bf16 %r2090, %rs969; 2026-02-21T08:30:48.5567465Z cvt.f32.bf16 %r2091, %rs970; 2026-02-21T08:30:48.5567535Z cvt.f32.bf16 %r2093, %rs983; 2026-02-21T08:30:48.5567596Z cvt.f32.bf16 %r2094, %rs984; 2026-02-21T08:30:48.5567657Z cvt.f32.bf16 %r2095, %rs981; 2026-02-21T08:30:48.5567728Z cvt.f32.bf16 %r2096, %rs982; 2026-02-21T08:30:48.5567789Z cvt.f32.bf16 %r2097, %rs979; 2026-02-21T08:30:48.5567851Z cvt.f32.bf16 %r2098, %rs980; 2026-02-21T08:30:48.5567913Z cvt.f32.bf16 %r2099, %rs977; 2026-02-21T08:30:48.5567986Z cvt.f32.bf16 %r2100, %rs978; 2026-02-21T08:30:48.5568095Z cvt.f32.bf16 %r2101, %rs991; 2026-02-21T08:30:48.5568159Z cvt.f32.bf16 %r2102, %rs992; 2026-02-21T08:30:48.5568227Z cvt.f32.bf16 %r2103, %rs989; 2026-02-21T08:30:48.5568287Z cvt.f32.bf16 %r2104, %rs990; 2026-02-21T08:30:48.5568346Z cvt.f32.bf16 %r2105, %rs987; 2026-02-21T08:30:48.5568405Z cvt.f32.bf16 %r2106, %rs988; 2026-02-21T08:30:48.5568474Z cvt.f32.bf16 %r2107, %rs985; 2026-02-21T08:30:48.5568533Z cvt.f32.bf16 %r2108, %rs986; 2026-02-21T08:30:48.5568589Z $L__tmp227: 2026-02-21T08:30:48.5568820Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5568879Z // begin inline asm 2026-02-21T08:30:48.5568932Z 2026-02-21T08:30:48.5568983Z { 2026-02-21T08:30:48.5569055Z .reg .pred complete; 2026-02-21T08:30:48.5569113Z waitLoop: 2026-02-21T08:30:48.5569242Z mbarrier.try_wait.parity.shared.b64 complete, [%r4033], %r4032; 2026-02-21T08:30:48.5569320Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5569377Z } 2026-02-21T08:30:48.5569381Z 2026-02-21T08:30:48.5569439Z // end inline asm 2026-02-21T08:30:48.5569511Z mov.pred %p140, -1; 2026-02-21T08:30:48.5569570Z // begin inline asm 2026-02-21T08:30:48.5569902Z @%p140 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r2076, %r2077, %r2078, %r2079, %r2080, %r2081, %r2082, %r2083, %r2084, %r2085, %r2086, %r2087, %r2088, %r2089, %r2090, %r2091}; 2026-02-21T08:30:48.5569960Z // end inline asm 2026-02-21T08:30:48.5570026Z // begin inline asm 2026-02-21T08:30:48.5570350Z @%p140 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r2093, %r2094, %r2095, %r2096, %r2097, %r2098, %r2099, %r2100, %r2101, %r2102, %r2103, %r2104, %r2105, %r2106, %r2107, %r2108}; 2026-02-21T08:30:48.5570408Z // end inline asm 2026-02-21T08:30:48.5570475Z // begin inline asm 2026-02-21T08:30:48.5570548Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5570605Z // end inline asm 2026-02-21T08:30:48.5570670Z bar.sync 0; 2026-02-21T08:30:48.5570729Z $L__tmp228: 2026-02-21T08:30:48.5570907Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5570971Z add.s32 %r2131, %r294, %r2110; 2026-02-21T08:30:48.5571041Z add.s32 %r2132, %r2131, 49152; 2026-02-21T08:30:48.5571103Z add.s32 %r162, %r2132, %r4048; 2026-02-21T08:30:48.5571164Z add.s32 %r2133, %r2132, %r55; 2026-02-21T08:30:48.5571232Z add.s32 %r2134, %r2132, %r57; 2026-02-21T08:30:48.5571290Z add.s32 %r2135, %r2132, %r59; 2026-02-21T08:30:48.5571349Z add.s32 %r2136, %r2132, %r61; 2026-02-21T08:30:48.5571531Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5571635Z ld.shared.s8 %rs993, [%r162]; 2026-02-21T08:30:48.5571812Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5571877Z shl.b16 %rs994, %rs993, 4; 2026-02-21T08:30:48.5572061Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5572185Z ld.shared.s8 %rs995, [%r162+128]; 2026-02-21T08:30:48.5572359Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5572430Z shl.b16 %rs996, %rs995, 4; 2026-02-21T08:30:48.5572602Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5572667Z ld.shared.s8 %rs997, [%r162+256]; 2026-02-21T08:30:48.5572844Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5572907Z shl.b16 %rs998, %rs997, 4; 2026-02-21T08:30:48.5573081Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5573154Z ld.shared.s8 %rs999, [%r162+384]; 2026-02-21T08:30:48.5573371Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5573438Z shl.b16 %rs1000, %rs999, 4; 2026-02-21T08:30:48.5573610Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5573686Z ld.shared.s8 %rs1001, [%r162+512]; 2026-02-21T08:30:48.5573860Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5573921Z shl.b16 %rs1002, %rs1001, 4; 2026-02-21T08:30:48.5574105Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5574170Z ld.shared.s8 %rs1003, [%r162+640]; 2026-02-21T08:30:48.5574331Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5574398Z shl.b16 %rs1004, %rs1003, 4; 2026-02-21T08:30:48.5574564Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5574630Z ld.shared.s8 %rs1005, [%r162+768]; 2026-02-21T08:30:48.5574803Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5574863Z shl.b16 %rs1006, %rs1005, 4; 2026-02-21T08:30:48.5575024Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5575089Z ld.shared.s8 %rs1007, [%r2133]; 2026-02-21T08:30:48.5575262Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5575321Z shl.b16 %rs1008, %rs1007, 4; 2026-02-21T08:30:48.5575485Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5575561Z ld.shared.s8 %rs1009, [%r162+1024]; 2026-02-21T08:30:48.5575727Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5575787Z shl.b16 %rs1010, %rs1009, 4; 2026-02-21T08:30:48.5575960Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5576030Z ld.shared.s8 %rs1011, [%r162+1152]; 2026-02-21T08:30:48.5576196Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5576261Z shl.b16 %rs1012, %rs1011, 4; 2026-02-21T08:30:48.5576428Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5576492Z ld.shared.s8 %rs1013, [%r162+1280]; 2026-02-21T08:30:48.5576651Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5576716Z shl.b16 %rs1014, %rs1013, 4; 2026-02-21T08:30:48.5576880Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5576943Z ld.shared.s8 %rs1015, [%r162+1408]; 2026-02-21T08:30:48.5577115Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5577234Z shl.b16 %rs1016, %rs1015, 4; 2026-02-21T08:30:48.5577400Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5577472Z ld.shared.s8 %rs1017, [%r162+1536]; 2026-02-21T08:30:48.5577635Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5577693Z shl.b16 %rs1018, %rs1017, 4; 2026-02-21T08:30:48.5577867Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5577932Z ld.shared.s8 %rs1019, [%r162+1664]; 2026-02-21T08:30:48.5578097Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5578168Z shl.b16 %rs1020, %rs1019, 4; 2026-02-21T08:30:48.5578380Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5578446Z ld.shared.s8 %rs1021, [%r162+1792]; 2026-02-21T08:30:48.5578612Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5578680Z shl.b16 %rs1022, %rs1021, 4; 2026-02-21T08:30:48.5578841Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5578904Z ld.shared.s8 %rs1023, [%r2134]; 2026-02-21T08:30:48.5579083Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5579144Z shl.b16 %rs1024, %rs1023, 4; 2026-02-21T08:30:48.5579310Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5579386Z ld.shared.s8 %rs1025, [%r162+2048]; 2026-02-21T08:30:48.5579552Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5579613Z shl.b16 %rs1026, %rs1025, 4; 2026-02-21T08:30:48.5579788Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5579852Z ld.shared.s8 %rs1027, [%r162+2176]; 2026-02-21T08:30:48.5580017Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5580075Z shl.b16 %rs1028, %rs1027, 4; 2026-02-21T08:30:48.5580244Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5580306Z ld.shared.s8 %rs1029, [%r162+2304]; 2026-02-21T08:30:48.5580469Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5580535Z shl.b16 %rs1030, %rs1029, 4; 2026-02-21T08:30:48.5580696Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5580761Z ld.shared.s8 %rs1031, [%r162+2432]; 2026-02-21T08:30:48.5580934Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5580992Z shl.b16 %rs1032, %rs1031, 4; 2026-02-21T08:30:48.5581155Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5581227Z ld.shared.s8 %rs1033, [%r162+2560]; 2026-02-21T08:30:48.5581390Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5581448Z shl.b16 %rs1034, %rs1033, 4; 2026-02-21T08:30:48.5581658Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5581729Z ld.shared.s8 %rs1035, [%r162+2688]; 2026-02-21T08:30:48.5581895Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5581953Z shl.b16 %rs1036, %rs1035, 4; 2026-02-21T08:30:48.5582130Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5582254Z ld.shared.s8 %rs1037, [%r162+2816]; 2026-02-21T08:30:48.5582414Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5582481Z shl.b16 %rs1038, %rs1037, 4; 2026-02-21T08:30:48.5582639Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5582702Z ld.shared.s8 %rs1039, [%r2135]; 2026-02-21T08:30:48.5582865Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5582925Z shl.b16 %rs1040, %rs1039, 4; 2026-02-21T08:30:48.5583085Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5583149Z ld.shared.s8 %rs1041, [%r162+3072]; 2026-02-21T08:30:48.5583363Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5583424Z shl.b16 %rs1042, %rs1041, 4; 2026-02-21T08:30:48.5583588Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5583658Z ld.shared.s8 %rs1043, [%r162+3200]; 2026-02-21T08:30:48.5583824Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5583883Z shl.b16 %rs1044, %rs1043, 4; 2026-02-21T08:30:48.5584056Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5584117Z ld.shared.s8 %rs1045, [%r162+3328]; 2026-02-21T08:30:48.5584285Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5584352Z shl.b16 %rs1046, %rs1045, 4; 2026-02-21T08:30:48.5584516Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5584581Z ld.shared.s8 %rs1047, [%r162+3456]; 2026-02-21T08:30:48.5584757Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5584816Z shl.b16 %rs1048, %rs1047, 4; 2026-02-21T08:30:48.5584982Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5585044Z ld.shared.s8 %rs1049, [%r162+3584]; 2026-02-21T08:30:48.5585214Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5585272Z shl.b16 %rs1050, %rs1049, 4; 2026-02-21T08:30:48.5585436Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5585504Z ld.shared.s8 %rs1051, [%r162+3712]; 2026-02-21T08:30:48.5585669Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5585729Z shl.b16 %rs1052, %rs1051, 4; 2026-02-21T08:30:48.5585903Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5585965Z ld.shared.s8 %rs1053, [%r162+3840]; 2026-02-21T08:30:48.5586132Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5586196Z shl.b16 %rs1054, %rs1053, 4; 2026-02-21T08:30:48.5586358Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5586422Z ld.shared.s8 %rs1055, [%r2136]; 2026-02-21T08:30:48.5586587Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5586651Z shl.b16 %rs1056, %rs1055, 4; 2026-02-21T08:30:48.5586816Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5586878Z cvt.s16.s8 %rs1057, %rs994; 2026-02-21T08:30:48.5586946Z shr.s16 %rs1058, %rs1057, 4; 2026-02-21T08:30:48.5587042Z cvt.s16.s8 %rs1059, %rs996; 2026-02-21T08:30:48.5587100Z shr.s16 %rs1060, %rs1059, 4; 2026-02-21T08:30:48.5587166Z shr.s16 %rs1061, %rs993, 4; 2026-02-21T08:30:48.5587224Z shr.s16 %rs1062, %rs995, 4; 2026-02-21T08:30:48.5587388Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5587451Z cvt.rn.f32.s16 %r2137, %rs1062; 2026-02-21T08:30:48.5587520Z cvt.rn.f32.s16 %r2138, %rs1061; 2026-02-21T08:30:48.5587581Z cvt.rn.f32.s16 %r2139, %rs1060; 2026-02-21T08:30:48.5587641Z cvt.rn.f32.s16 %r2140, %rs1058; 2026-02-21T08:30:48.5587812Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5587870Z cvt.s16.s8 %rs1063, %rs998; 2026-02-21T08:30:48.5587928Z shr.s16 %rs1064, %rs1063, 4; 2026-02-21T08:30:48.5587992Z cvt.s16.s8 %rs1065, %rs1000; 2026-02-21T08:30:48.5588049Z shr.s16 %rs1066, %rs1065, 4; 2026-02-21T08:30:48.5588145Z shr.s16 %rs1067, %rs997, 4; 2026-02-21T08:30:48.5588206Z shr.s16 %rs1068, %rs999, 4; 2026-02-21T08:30:48.5588378Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5588441Z cvt.rn.f32.s16 %r2141, %rs1068; 2026-02-21T08:30:48.5588502Z cvt.rn.f32.s16 %r2142, %rs1067; 2026-02-21T08:30:48.5588573Z cvt.rn.f32.s16 %r2143, %rs1066; 2026-02-21T08:30:48.5588634Z cvt.rn.f32.s16 %r2144, %rs1064; 2026-02-21T08:30:48.5588802Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5588873Z cvt.s16.s8 %rs1069, %rs1002; 2026-02-21T08:30:48.5588931Z shr.s16 %rs1070, %rs1069, 4; 2026-02-21T08:30:48.5588991Z cvt.s16.s8 %rs1071, %rs1004; 2026-02-21T08:30:48.5589048Z shr.s16 %rs1072, %rs1071, 4; 2026-02-21T08:30:48.5589116Z shr.s16 %rs1073, %rs1001, 4; 2026-02-21T08:30:48.5589173Z shr.s16 %rs1074, %rs1003, 4; 2026-02-21T08:30:48.5589344Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5589412Z cvt.rn.f32.s16 %r2145, %rs1074; 2026-02-21T08:30:48.5589471Z cvt.rn.f32.s16 %r2146, %rs1073; 2026-02-21T08:30:48.5589529Z cvt.rn.f32.s16 %r2147, %rs1072; 2026-02-21T08:30:48.5589588Z cvt.rn.f32.s16 %r2148, %rs1070; 2026-02-21T08:30:48.5589764Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5589823Z cvt.s16.s8 %rs1075, %rs1006; 2026-02-21T08:30:48.5589880Z shr.s16 %rs1076, %rs1075, 4; 2026-02-21T08:30:48.5589945Z cvt.s16.s8 %rs1077, %rs1008; 2026-02-21T08:30:48.5590003Z shr.s16 %rs1078, %rs1077, 4; 2026-02-21T08:30:48.5590059Z shr.s16 %rs1079, %rs1005, 4; 2026-02-21T08:30:48.5590122Z shr.s16 %rs1080, %rs1007, 4; 2026-02-21T08:30:48.5590288Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5590347Z cvt.rn.f32.s16 %r2149, %rs1080; 2026-02-21T08:30:48.5590408Z cvt.rn.f32.s16 %r2150, %rs1079; 2026-02-21T08:30:48.5590477Z cvt.rn.f32.s16 %r2151, %rs1078; 2026-02-21T08:30:48.5590536Z cvt.rn.f32.s16 %r2152, %rs1076; 2026-02-21T08:30:48.5590705Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5590772Z cvt.s16.s8 %rs1081, %rs1010; 2026-02-21T08:30:48.5590829Z shr.s16 %rs1082, %rs1081, 4; 2026-02-21T08:30:48.5590887Z cvt.s16.s8 %rs1083, %rs1012; 2026-02-21T08:30:48.5590951Z shr.s16 %rs1084, %rs1083, 4; 2026-02-21T08:30:48.5591007Z shr.s16 %rs1085, %rs1009, 4; 2026-02-21T08:30:48.5591064Z shr.s16 %rs1086, %rs1011, 4; 2026-02-21T08:30:48.5591230Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5591298Z cvt.rn.f32.s16 %r2153, %rs1086; 2026-02-21T08:30:48.5591356Z cvt.rn.f32.s16 %r2154, %rs1085; 2026-02-21T08:30:48.5591416Z cvt.rn.f32.s16 %r2155, %rs1084; 2026-02-21T08:30:48.5591485Z cvt.rn.f32.s16 %r2156, %rs1082; 2026-02-21T08:30:48.5591730Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5591791Z cvt.s16.s8 %rs1087, %rs1014; 2026-02-21T08:30:48.5591857Z shr.s16 %rs1088, %rs1087, 4; 2026-02-21T08:30:48.5591916Z cvt.s16.s8 %rs1089, %rs1016; 2026-02-21T08:30:48.5591973Z shr.s16 %rs1090, %rs1089, 4; 2026-02-21T08:30:48.5592029Z shr.s16 %rs1091, %rs1013, 4; 2026-02-21T08:30:48.5592093Z shr.s16 %rs1092, %rs1015, 4; 2026-02-21T08:30:48.5592258Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5592319Z cvt.rn.f32.s16 %r2157, %rs1092; 2026-02-21T08:30:48.5592385Z cvt.rn.f32.s16 %r2158, %rs1091; 2026-02-21T08:30:48.5592444Z cvt.rn.f32.s16 %r2159, %rs1090; 2026-02-21T08:30:48.5592501Z cvt.rn.f32.s16 %r2160, %rs1088; 2026-02-21T08:30:48.5592724Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5592796Z cvt.s16.s8 %rs1093, %rs1018; 2026-02-21T08:30:48.5592857Z shr.s16 %rs1094, %rs1093, 4; 2026-02-21T08:30:48.5592916Z cvt.s16.s8 %rs1095, %rs1020; 2026-02-21T08:30:48.5592981Z shr.s16 %rs1096, %rs1095, 4; 2026-02-21T08:30:48.5593038Z shr.s16 %rs1097, %rs1017, 4; 2026-02-21T08:30:48.5593095Z shr.s16 %rs1098, %rs1019, 4; 2026-02-21T08:30:48.5593268Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5593328Z cvt.rn.f32.s16 %r2161, %rs1098; 2026-02-21T08:30:48.5593386Z cvt.rn.f32.s16 %r2162, %rs1097; 2026-02-21T08:30:48.5593445Z cvt.rn.f32.s16 %r2163, %rs1096; 2026-02-21T08:30:48.5593508Z cvt.rn.f32.s16 %r2164, %rs1094; 2026-02-21T08:30:48.5593676Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5593734Z cvt.s16.s8 %rs1099, %rs1022; 2026-02-21T08:30:48.5593797Z shr.s16 %rs1100, %rs1099, 4; 2026-02-21T08:30:48.5593856Z cvt.s16.s8 %rs1101, %rs1024; 2026-02-21T08:30:48.5593915Z shr.s16 %rs1102, %rs1101, 4; 2026-02-21T08:30:48.5593978Z shr.s16 %rs1103, %rs1021, 4; 2026-02-21T08:30:48.5594035Z shr.s16 %rs1104, %rs1023, 4; 2026-02-21T08:30:48.5594197Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5594257Z cvt.rn.f32.s16 %r2165, %rs1104; 2026-02-21T08:30:48.5594325Z cvt.rn.f32.s16 %r2166, %rs1103; 2026-02-21T08:30:48.5594384Z cvt.rn.f32.s16 %r2167, %rs1102; 2026-02-21T08:30:48.5594442Z cvt.rn.f32.s16 %r2168, %rs1100; 2026-02-21T08:30:48.5594612Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5594671Z cvt.s16.s8 %rs1105, %rs1026; 2026-02-21T08:30:48.5594727Z shr.s16 %rs1106, %rs1105, 4; 2026-02-21T08:30:48.5594793Z cvt.s16.s8 %rs1107, %rs1028; 2026-02-21T08:30:48.5594850Z shr.s16 %rs1108, %rs1107, 4; 2026-02-21T08:30:48.5594910Z shr.s16 %rs1109, %rs1025, 4; 2026-02-21T08:30:48.5594971Z shr.s16 %rs1110, %rs1027, 4; 2026-02-21T08:30:48.5595141Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5595202Z cvt.rn.f32.s16 %r2169, %rs1110; 2026-02-21T08:30:48.5595261Z cvt.rn.f32.s16 %r2170, %rs1109; 2026-02-21T08:30:48.5595327Z cvt.rn.f32.s16 %r2171, %rs1108; 2026-02-21T08:30:48.5595385Z cvt.rn.f32.s16 %r2172, %rs1106; 2026-02-21T08:30:48.5595550Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5595610Z cvt.s16.s8 %rs1111, %rs1030; 2026-02-21T08:30:48.5595676Z shr.s16 %rs1112, %rs1111, 4; 2026-02-21T08:30:48.5595732Z cvt.s16.s8 %rs1113, %rs1032; 2026-02-21T08:30:48.5595789Z shr.s16 %rs1114, %rs1113, 4; 2026-02-21T08:30:48.5595855Z shr.s16 %rs1115, %rs1029, 4; 2026-02-21T08:30:48.5595913Z shr.s16 %rs1116, %rs1031, 4; 2026-02-21T08:30:48.5596079Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5596192Z cvt.rn.f32.s16 %r2173, %rs1116; 2026-02-21T08:30:48.5596252Z cvt.rn.f32.s16 %r2174, %rs1115; 2026-02-21T08:30:48.5596310Z cvt.rn.f32.s16 %r2175, %rs1114; 2026-02-21T08:30:48.5596370Z cvt.rn.f32.s16 %r2176, %rs1112; 2026-02-21T08:30:48.5596545Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5596606Z cvt.s16.s8 %rs1117, %rs1034; 2026-02-21T08:30:48.5596663Z shr.s16 %rs1118, %rs1117, 4; 2026-02-21T08:30:48.5596731Z cvt.s16.s8 %rs1119, %rs1036; 2026-02-21T08:30:48.5596790Z shr.s16 %rs1120, %rs1119, 4; 2026-02-21T08:30:48.5596849Z shr.s16 %rs1121, %rs1033, 4; 2026-02-21T08:30:48.5596913Z shr.s16 %rs1122, %rs1035, 4; 2026-02-21T08:30:48.5597083Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5597142Z cvt.rn.f32.s16 %r2177, %rs1122; 2026-02-21T08:30:48.5597247Z cvt.rn.f32.s16 %r2178, %rs1121; 2026-02-21T08:30:48.5597318Z cvt.rn.f32.s16 %r2179, %rs1120; 2026-02-21T08:30:48.5597376Z cvt.rn.f32.s16 %r2180, %rs1118; 2026-02-21T08:30:48.5597540Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5597607Z cvt.s16.s8 %rs1123, %rs1038; 2026-02-21T08:30:48.5597664Z shr.s16 %rs1124, %rs1123, 4; 2026-02-21T08:30:48.5597721Z cvt.s16.s8 %rs1125, %rs1040; 2026-02-21T08:30:48.5597780Z shr.s16 %rs1126, %rs1125, 4; 2026-02-21T08:30:48.5597843Z shr.s16 %rs1127, %rs1037, 4; 2026-02-21T08:30:48.5597900Z shr.s16 %rs1128, %rs1039, 4; 2026-02-21T08:30:48.5598065Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5598131Z cvt.rn.f32.s16 %r2181, %rs1128; 2026-02-21T08:30:48.5598190Z cvt.rn.f32.s16 %r2182, %rs1127; 2026-02-21T08:30:48.5598250Z cvt.rn.f32.s16 %r2183, %rs1126; 2026-02-21T08:30:48.5598318Z cvt.rn.f32.s16 %r2184, %rs1124; 2026-02-21T08:30:48.5598484Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5598542Z cvt.s16.s8 %rs1129, %rs1042; 2026-02-21T08:30:48.5598599Z shr.s16 %rs1130, %rs1129, 4; 2026-02-21T08:30:48.5598665Z cvt.s16.s8 %rs1131, %rs1044; 2026-02-21T08:30:48.5598722Z shr.s16 %rs1132, %rs1131, 4; 2026-02-21T08:30:48.5598780Z shr.s16 %rs1133, %rs1041, 4; 2026-02-21T08:30:48.5598845Z shr.s16 %rs1134, %rs1043, 4; 2026-02-21T08:30:48.5599011Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5599072Z cvt.rn.f32.s16 %r2185, %rs1134; 2026-02-21T08:30:48.5599138Z cvt.rn.f32.s16 %r2186, %rs1133; 2026-02-21T08:30:48.5599197Z cvt.rn.f32.s16 %r2187, %rs1132; 2026-02-21T08:30:48.5599257Z cvt.rn.f32.s16 %r2188, %rs1130; 2026-02-21T08:30:48.5599423Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5599490Z cvt.s16.s8 %rs1135, %rs1046; 2026-02-21T08:30:48.5599547Z shr.s16 %rs1136, %rs1135, 4; 2026-02-21T08:30:48.5599605Z cvt.s16.s8 %rs1137, %rs1048; 2026-02-21T08:30:48.5599671Z shr.s16 %rs1138, %rs1137, 4; 2026-02-21T08:30:48.5599728Z shr.s16 %rs1139, %rs1045, 4; 2026-02-21T08:30:48.5599785Z shr.s16 %rs1140, %rs1047, 4; 2026-02-21T08:30:48.5599957Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5600018Z cvt.rn.f32.s16 %r2189, %rs1140; 2026-02-21T08:30:48.5600075Z cvt.rn.f32.s16 %r2190, %rs1139; 2026-02-21T08:30:48.5600135Z cvt.rn.f32.s16 %r2191, %rs1138; 2026-02-21T08:30:48.5600199Z cvt.rn.f32.s16 %r2192, %rs1136; 2026-02-21T08:30:48.5600366Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5600424Z cvt.s16.s8 %rs1141, %rs1050; 2026-02-21T08:30:48.5600488Z shr.s16 %rs1142, %rs1141, 4; 2026-02-21T08:30:48.5600547Z cvt.s16.s8 %rs1143, %rs1052; 2026-02-21T08:30:48.5600644Z shr.s16 %rs1144, %rs1143, 4; 2026-02-21T08:30:48.5600701Z shr.s16 %rs1145, %rs1049, 4; 2026-02-21T08:30:48.5600765Z shr.s16 %rs1146, %rs1051, 4; 2026-02-21T08:30:48.5600929Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5600987Z cvt.rn.f32.s16 %r2193, %rs1146; 2026-02-21T08:30:48.5601053Z cvt.rn.f32.s16 %r2194, %rs1145; 2026-02-21T08:30:48.5601111Z cvt.rn.f32.s16 %r2195, %rs1144; 2026-02-21T08:30:48.5601169Z cvt.rn.f32.s16 %r2196, %rs1142; 2026-02-21T08:30:48.5601338Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5601395Z cvt.s16.s8 %rs1147, %rs1054; 2026-02-21T08:30:48.5601452Z shr.s16 %rs1148, %rs1147, 4; 2026-02-21T08:30:48.5601509Z cvt.s16.s8 %rs1149, %rs1056; 2026-02-21T08:30:48.5601603Z shr.s16 %rs1150, %rs1149, 4; 2026-02-21T08:30:48.5601705Z shr.s16 %rs1151, %rs1053, 4; 2026-02-21T08:30:48.5601766Z shr.s16 %rs1152, %rs1055, 4; 2026-02-21T08:30:48.5601940Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5602000Z cvt.rn.f32.s16 %r2197, %rs1152; 2026-02-21T08:30:48.5602061Z cvt.rn.f32.s16 %r2198, %rs1151; 2026-02-21T08:30:48.5602127Z cvt.rn.f32.s16 %r2199, %rs1150; 2026-02-21T08:30:48.5602186Z cvt.rn.f32.s16 %r2200, %rs1148; 2026-02-21T08:30:48.5602285Z st.shared.v4.b32 [%r63], {%r2140, %r2138, %r2139, %r2137}; 2026-02-21T08:30:48.5602392Z st.shared.v4.b32 [%r63+16384], {%r2172, %r2170, %r2171, %r2169}; 2026-02-21T08:30:48.5602492Z st.shared.v4.b32 [%r64], {%r2144, %r2142, %r2143, %r2141}; 2026-02-21T08:30:48.5602593Z st.shared.v4.b32 [%r64+16384], {%r2176, %r2174, %r2175, %r2173}; 2026-02-21T08:30:48.5602684Z st.shared.v4.b32 [%r65], {%r2148, %r2146, %r2147, %r2145}; 2026-02-21T08:30:48.5602790Z st.shared.v4.b32 [%r65+16384], {%r2180, %r2178, %r2179, %r2177}; 2026-02-21T08:30:48.5602883Z st.shared.v4.b32 [%r66], {%r2152, %r2150, %r2151, %r2149}; 2026-02-21T08:30:48.5602983Z st.shared.v4.b32 [%r66+16384], {%r2184, %r2182, %r2183, %r2181}; 2026-02-21T08:30:48.5603079Z st.shared.v4.b32 [%r67], {%r2156, %r2154, %r2155, %r2153}; 2026-02-21T08:30:48.5603176Z st.shared.v4.b32 [%r67+16384], {%r2188, %r2186, %r2187, %r2185}; 2026-02-21T08:30:48.5603264Z st.shared.v4.b32 [%r68], {%r2160, %r2158, %r2159, %r2157}; 2026-02-21T08:30:48.5603370Z st.shared.v4.b32 [%r68+16384], {%r2192, %r2190, %r2191, %r2189}; 2026-02-21T08:30:48.5603461Z st.shared.v4.b32 [%r69], {%r2164, %r2162, %r2163, %r2161}; 2026-02-21T08:30:48.5603556Z st.shared.v4.b32 [%r69+16384], {%r2196, %r2194, %r2195, %r2193}; 2026-02-21T08:30:48.5603646Z st.shared.v4.b32 [%r70], {%r2168, %r2166, %r2167, %r2165}; 2026-02-21T08:30:48.5603752Z st.shared.v4.b32 [%r70+16384], {%r2200, %r2198, %r2199, %r2197}; 2026-02-21T08:30:48.5603925Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5603988Z shl.b32 %r2201, %r4035, 3; 2026-02-21T08:30:48.5604057Z add.s32 %r2202, %r294, %r2201; 2026-02-21T08:30:48.5604117Z add.s32 %r4033, %r2202, 57344; 2026-02-21T08:30:48.5604171Z $L__tmp229: 2026-02-21T08:30:48.5604398Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5604457Z // begin inline asm 2026-02-21T08:30:48.5604531Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5604587Z // end inline asm 2026-02-21T08:30:48.5604651Z bar.sync 0; 2026-02-21T08:30:48.5604713Z @%p14 bra $L__BB0_20; 2026-02-21T08:30:48.5604815Z // %bb.19: // in Loop: Header=BB0_18 Depth=2 2026-02-21T08:30:48.5604893Z elect.sync %r2227|%p141, -1; 2026-02-21T08:30:48.5604953Z mov.b32 %r2205, 69208336; 2026-02-21T08:30:48.5605012Z // begin inline asm 2026-02-21T08:30:48.5605188Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r2205, %p140; 2026-02-21T08:30:48.5605300Z // end inline asm 2026-02-21T08:30:48.5605358Z // begin inline asm 2026-02-21T08:30:48.5605513Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r2205, %p140; 2026-02-21T08:30:48.5605577Z // end inline asm 2026-02-21T08:30:48.5605633Z // begin inline asm 2026-02-21T08:30:48.5605785Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r2205, %p140; 2026-02-21T08:30:48.5605848Z // end inline asm 2026-02-21T08:30:48.5605903Z // begin inline asm 2026-02-21T08:30:48.5606054Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r2205, %p140; 2026-02-21T08:30:48.5606116Z // end inline asm 2026-02-21T08:30:48.5606172Z // begin inline asm 2026-02-21T08:30:48.5606317Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r2205, %p140; 2026-02-21T08:30:48.5606371Z // end inline asm 2026-02-21T08:30:48.5606473Z // begin inline asm 2026-02-21T08:30:48.5606621Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r2205, %p140; 2026-02-21T08:30:48.5606677Z // end inline asm 2026-02-21T08:30:48.5606739Z // begin inline asm 2026-02-21T08:30:48.5606886Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r2205, %p140; 2026-02-21T08:30:48.5606940Z // end inline asm 2026-02-21T08:30:48.5607005Z // begin inline asm 2026-02-21T08:30:48.5607151Z @%p141 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r2205, %p140; 2026-02-21T08:30:48.5607207Z // end inline asm 2026-02-21T08:30:48.5607275Z cvt.u64.u32 %rd561, %r4033; 2026-02-21T08:30:48.5607331Z // begin inline asm 2026-02-21T08:30:48.5607459Z @%p141 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd561]; 2026-02-21T08:30:48.5607515Z // end inline asm 2026-02-21T08:30:48.5607581Z bra.uni $L__BB0_20; 2026-02-21T08:30:48.5607687Z $L__BB0_21: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5607782Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:30:48.5607846Z mov.b32 %r4041, 1; 2026-02-21T08:30:48.5608070Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5608127Z // begin inline asm 2026-02-21T08:30:48.5608184Z 2026-02-21T08:30:48.5608235Z { 2026-02-21T08:30:48.5608296Z .reg .pred complete; 2026-02-21T08:30:48.5608353Z waitLoop: 2026-02-21T08:30:48.5608485Z mbarrier.try_wait.parity.shared.b64 complete, [%r4033], %r4041; 2026-02-21T08:30:48.5608555Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5608608Z } 2026-02-21T08:30:48.5608612Z 2026-02-21T08:30:48.5608678Z // end inline asm 2026-02-21T08:30:48.5608735Z $L__tmp230: 2026-02-21T08:30:48.5608912Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5608980Z cp.async.wait_group 0; 2026-02-21T08:30:48.5609049Z bar.sync 0; 2026-02-21T08:30:48.5609112Z add.s32 %r4039, %r294, 57344; 2026-02-21T08:30:48.5609171Z // begin inline asm 2026-02-21T08:30:48.5609272Z @%p10 mbarrier.inval.shared::cta.b64 [%r4039]; 2026-02-21T08:30:48.5609329Z // end inline asm 2026-02-21T08:30:48.5609386Z bar.sync 0; 2026-02-21T08:30:48.5609452Z // begin inline asm 2026-02-21T08:30:48.5609540Z @%p10 mbarrier.inval.shared::cta.b64 [%r2245]; 2026-02-21T08:30:48.5609597Z // end inline asm 2026-02-21T08:30:48.5609771Z .loc 1 88 43 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:43 2026-02-21T08:30:48.5609841Z shl.b32 %r2515, %r145, 13; 2026-02-21T08:30:48.5609904Z shl.b32 %r2516, %r146, 13; 2026-02-21T08:30:48.5609965Z shl.b32 %r2517, %r147, 13; 2026-02-21T08:30:48.5610029Z shl.b32 %r2518, %r148, 13; 2026-02-21T08:30:48.5610086Z shl.b32 %r2519, %r149, 13; 2026-02-21T08:30:48.5610145Z shl.b32 %r2520, %r150, 13; 2026-02-21T08:30:48.5610205Z shl.b32 %r2521, %r151, 13; 2026-02-21T08:30:48.5610323Z shl.b32 %r2522, %r152, 13; 2026-02-21T08:30:48.5610496Z .loc 1 88 50 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:50 2026-02-21T08:30:48.5610561Z add.s32 %r2523, %r2515, %r143; 2026-02-21T08:30:48.5610630Z add.s32 %r2524, %r2516, %r143; 2026-02-21T08:30:48.5610690Z add.s32 %r2525, %r2517, %r143; 2026-02-21T08:30:48.5610750Z add.s32 %r2526, %r2518, %r143; 2026-02-21T08:30:48.5610816Z add.s32 %r2527, %r2519, %r143; 2026-02-21T08:30:48.5610875Z add.s32 %r2528, %r2520, %r143; 2026-02-21T08:30:48.5610936Z add.s32 %r2529, %r2521, %r143; 2026-02-21T08:30:48.5610996Z add.s32 %r2530, %r2522, %r143; 2026-02-21T08:30:48.5611175Z .loc 1 88 22 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:22 2026-02-21T08:30:48.5611249Z mad.wide.s32 %rd573, %r2523, 2, %rd94; 2026-02-21T08:30:48.5611321Z mad.wide.s32 %rd574, %r2524, 2, %rd94; 2026-02-21T08:30:48.5611456Z mad.wide.s32 %rd575, %r2525, 2, %rd94; 2026-02-21T08:30:48.5611525Z mad.wide.s32 %rd576, %r2526, 2, %rd94; 2026-02-21T08:30:48.5611637Z mad.wide.s32 %rd577, %r2527, 2, %rd94; 2026-02-21T08:30:48.5611705Z mad.wide.s32 %rd578, %r2528, 2, %rd94; 2026-02-21T08:30:48.5611777Z mad.wide.s32 %rd579, %r2529, 2, %rd94; 2026-02-21T08:30:48.5611842Z mad.wide.s32 %rd580, %r2530, 2, %rd94; 2026-02-21T08:30:48.5611898Z $L__tmp231: 2026-02-21T08:30:48.5612129Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5612189Z // begin inline asm 2026-02-21T08:30:48.5612514Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2246, %r2247, %r2248, %r2249, %r2250, %r2251, %r2252, %r2253, %r2254, %r2255, %r2256, %r2257, %r2258, %r2259, %r2260, %r2261}, [%r3028 + 0], 64; 2026-02-21T08:30:48.5612581Z // end inline asm 2026-02-21T08:30:48.5612640Z // begin inline asm 2026-02-21T08:30:48.5612967Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2263, %r2264, %r2265, %r2266, %r2267, %r2268, %r2269, %r2270, %r2271, %r2272, %r2273, %r2274, %r2275, %r2276, %r2277, %r2278}, [%r3028 + 16], 64; 2026-02-21T08:30:48.5613039Z // end inline asm 2026-02-21T08:30:48.5613099Z // begin inline asm 2026-02-21T08:30:48.5613413Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2280, %r2281, %r2282, %r2283, %r2284, %r2285, %r2286, %r2287, %r2288, %r2289, %r2290, %r2291, %r2292, %r2293, %r2294, %r2295}, [%r3028 + 32], 64; 2026-02-21T08:30:48.5613481Z // end inline asm 2026-02-21T08:30:48.5613541Z // begin inline asm 2026-02-21T08:30:48.5613850Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2297, %r2298, %r2299, %r2300, %r2301, %r2302, %r2303, %r2304, %r2305, %r2306, %r2307, %r2308, %r2309, %r2310, %r2311, %r2312}, [%r3028 + 48], 64; 2026-02-21T08:30:48.5613909Z // end inline asm 2026-02-21T08:30:48.5613979Z // begin inline asm 2026-02-21T08:30:48.5614053Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:48.5614112Z // end inline asm 2026-02-21T08:30:48.5614194Z cvt.u64.u32 %rd593, %r2246; 2026-02-21T08:30:48.5614265Z cvt.u64.u32 %rd594, %r2247; 2026-02-21T08:30:48.5614329Z shl.b64 %rd595, %rd594, 32; 2026-02-21T08:30:48.5614393Z or.b64 %rd596, %rd593, %rd595; 2026-02-21T08:30:48.5614456Z $L__tmp232: 2026-02-21T08:30:48.5614631Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5614698Z mov.b64 {%r2531, %r2532}, %rd596; 2026-02-21T08:30:48.5614780Z cvt.rn.bf16x2.f32 %r2533, %r2532, %r2531; 2026-02-21T08:30:48.5614834Z $L__tmp233: 2026-02-21T08:30:48.5615060Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5615129Z cvt.u64.u32 %rd597, %r2248; 2026-02-21T08:30:48.5615189Z cvt.u64.u32 %rd598, %r2249; 2026-02-21T08:30:48.5615249Z shl.b64 %rd599, %rd598, 32; 2026-02-21T08:30:48.5615312Z or.b64 %rd600, %rd597, %rd599; 2026-02-21T08:30:48.5615373Z $L__tmp234: 2026-02-21T08:30:48.5615548Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5615664Z mov.b64 {%r2534, %r2535}, %rd600; 2026-02-21T08:30:48.5615748Z cvt.rn.bf16x2.f32 %r2536, %r2535, %r2534; 2026-02-21T08:30:48.5615802Z $L__tmp235: 2026-02-21T08:30:48.5616025Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5616101Z cvt.u64.u32 %rd601, %r2250; 2026-02-21T08:30:48.5616158Z cvt.u64.u32 %rd602, %r2251; 2026-02-21T08:30:48.5616216Z shl.b64 %rd603, %rd602, 32; 2026-02-21T08:30:48.5616275Z or.b64 %rd604, %rd601, %rd603; 2026-02-21T08:30:48.5616335Z $L__tmp236: 2026-02-21T08:30:48.5616498Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5616559Z mov.b64 {%r2537, %r2538}, %rd604; 2026-02-21T08:30:48.5616637Z cvt.rn.bf16x2.f32 %r2539, %r2538, %r2537; 2026-02-21T08:30:48.5616689Z $L__tmp237: 2026-02-21T08:30:48.5616947Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5617017Z cvt.u64.u32 %rd605, %r2252; 2026-02-21T08:30:48.5617074Z cvt.u64.u32 %rd606, %r2253; 2026-02-21T08:30:48.5617133Z shl.b64 %rd607, %rd606, 32; 2026-02-21T08:30:48.5617192Z or.b64 %rd608, %rd605, %rd607; 2026-02-21T08:30:48.5617251Z $L__tmp238: 2026-02-21T08:30:48.5617417Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5617477Z mov.b64 {%r2540, %r2541}, %rd608; 2026-02-21T08:30:48.5617553Z cvt.rn.bf16x2.f32 %r2542, %r2541, %r2540; 2026-02-21T08:30:48.5617606Z $L__tmp239: 2026-02-21T08:30:48.5617819Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5617886Z cvt.u64.u32 %rd609, %r2254; 2026-02-21T08:30:48.5617944Z cvt.u64.u32 %rd610, %r2255; 2026-02-21T08:30:48.5618003Z shl.b64 %rd611, %rd610, 32; 2026-02-21T08:30:48.5618065Z or.b64 %rd612, %rd609, %rd611; 2026-02-21T08:30:48.5618123Z $L__tmp240: 2026-02-21T08:30:48.5618287Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5618347Z mov.b64 {%r2543, %r2544}, %rd612; 2026-02-21T08:30:48.5618421Z cvt.rn.bf16x2.f32 %r2545, %r2544, %r2543; 2026-02-21T08:30:48.5618473Z $L__tmp241: 2026-02-21T08:30:48.5618682Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5618741Z cvt.u64.u32 %rd613, %r2256; 2026-02-21T08:30:48.5618805Z cvt.u64.u32 %rd614, %r2257; 2026-02-21T08:30:48.5618861Z shl.b64 %rd615, %rd614, 32; 2026-02-21T08:30:48.5618920Z or.b64 %rd616, %rd613, %rd615; 2026-02-21T08:30:48.5618978Z $L__tmp242: 2026-02-21T08:30:48.5619141Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5619202Z mov.b64 {%r2546, %r2547}, %rd616; 2026-02-21T08:30:48.5619276Z cvt.rn.bf16x2.f32 %r2548, %r2547, %r2546; 2026-02-21T08:30:48.5619328Z $L__tmp243: 2026-02-21T08:30:48.5619536Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5619594Z cvt.u64.u32 %rd617, %r2258; 2026-02-21T08:30:48.5619656Z cvt.u64.u32 %rd618, %r2259; 2026-02-21T08:30:48.5619712Z shl.b64 %rd619, %rd618, 32; 2026-02-21T08:30:48.5619770Z or.b64 %rd620, %rd617, %rd619; 2026-02-21T08:30:48.5619828Z $L__tmp244: 2026-02-21T08:30:48.5619993Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5620052Z mov.b64 {%r2549, %r2550}, %rd620; 2026-02-21T08:30:48.5620126Z cvt.rn.bf16x2.f32 %r2551, %r2550, %r2549; 2026-02-21T08:30:48.5620178Z $L__tmp245: 2026-02-21T08:30:48.5620395Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5620497Z cvt.u64.u32 %rd621, %r2260; 2026-02-21T08:30:48.5620563Z cvt.u64.u32 %rd622, %r2261; 2026-02-21T08:30:48.5620620Z shl.b64 %rd623, %rd622, 32; 2026-02-21T08:30:48.5620679Z or.b64 %rd624, %rd621, %rd623; 2026-02-21T08:30:48.5620738Z $L__tmp246: 2026-02-21T08:30:48.5620906Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5620966Z mov.b64 {%r2552, %r2553}, %rd624; 2026-02-21T08:30:48.5621038Z cvt.rn.bf16x2.f32 %r2554, %r2553, %r2552; 2026-02-21T08:30:48.5621090Z $L__tmp247: 2026-02-21T08:30:48.5621305Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5621364Z cvt.u64.u32 %rd625, %r2263; 2026-02-21T08:30:48.5621430Z cvt.u64.u32 %rd626, %r2264; 2026-02-21T08:30:48.5621487Z shl.b64 %rd627, %rd626, 32; 2026-02-21T08:30:48.5621617Z or.b64 %rd628, %rd625, %rd627; 2026-02-21T08:30:48.5621680Z $L__tmp248: 2026-02-21T08:30:48.5621845Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5621905Z mov.b64 {%r2555, %r2556}, %rd628; 2026-02-21T08:30:48.5621972Z cvt.rn.bf16x2.f32 %r2557, %r2556, %r2555; 2026-02-21T08:30:48.5622032Z $L__tmp249: 2026-02-21T08:30:48.5622241Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5622299Z cvt.u64.u32 %rd629, %r2265; 2026-02-21T08:30:48.5622364Z cvt.u64.u32 %rd630, %r2266; 2026-02-21T08:30:48.5622423Z shl.b64 %rd631, %rd630, 32; 2026-02-21T08:30:48.5622484Z or.b64 %rd632, %rd629, %rd631; 2026-02-21T08:30:48.5622546Z $L__tmp250: 2026-02-21T08:30:48.5622711Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5622774Z mov.b64 {%r2558, %r2559}, %rd632; 2026-02-21T08:30:48.5622845Z cvt.rn.bf16x2.f32 %r2560, %r2559, %r2558; 2026-02-21T08:30:48.5622909Z $L__tmp251: 2026-02-21T08:30:48.5623118Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5623178Z cvt.u64.u32 %rd633, %r2267; 2026-02-21T08:30:48.5623245Z cvt.u64.u32 %rd634, %r2268; 2026-02-21T08:30:48.5623303Z shl.b64 %rd635, %rd634, 32; 2026-02-21T08:30:48.5623363Z or.b64 %rd636, %rd633, %rd635; 2026-02-21T08:30:48.5623422Z $L__tmp252: 2026-02-21T08:30:48.5623583Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5623643Z mov.b64 {%r2561, %r2562}, %rd636; 2026-02-21T08:30:48.5623710Z cvt.rn.bf16x2.f32 %r2563, %r2562, %r2561; 2026-02-21T08:30:48.5623771Z $L__tmp253: 2026-02-21T08:30:48.5623980Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5624040Z cvt.u64.u32 %rd637, %r2269; 2026-02-21T08:30:48.5624108Z cvt.u64.u32 %rd638, %r2270; 2026-02-21T08:30:48.5624165Z shl.b64 %rd639, %rd638, 32; 2026-02-21T08:30:48.5624223Z or.b64 %rd640, %rd637, %rd639; 2026-02-21T08:30:48.5624281Z $L__tmp254: 2026-02-21T08:30:48.5624448Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5624508Z mov.b64 {%r2564, %r2565}, %rd640; 2026-02-21T08:30:48.5624577Z cvt.rn.bf16x2.f32 %r2566, %r2565, %r2564; 2026-02-21T08:30:48.5624638Z $L__tmp255: 2026-02-21T08:30:48.5624845Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5624903Z cvt.u64.u32 %rd641, %r2271; 2026-02-21T08:30:48.5624969Z cvt.u64.u32 %rd642, %r2272; 2026-02-21T08:30:48.5625026Z shl.b64 %rd643, %rd642, 32; 2026-02-21T08:30:48.5625085Z or.b64 %rd644, %rd641, %rd643; 2026-02-21T08:30:48.5625136Z $L__tmp256: 2026-02-21T08:30:48.5625314Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5625420Z mov.b64 {%r2567, %r2568}, %rd644; 2026-02-21T08:30:48.5625487Z cvt.rn.bf16x2.f32 %r2569, %r2568, %r2567; 2026-02-21T08:30:48.5625547Z $L__tmp257: 2026-02-21T08:30:48.5625761Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5625820Z cvt.u64.u32 %rd645, %r2273; 2026-02-21T08:30:48.5625883Z cvt.u64.u32 %rd646, %r2274; 2026-02-21T08:30:48.5625940Z shl.b64 %rd647, %rd646, 32; 2026-02-21T08:30:48.5625998Z or.b64 %rd648, %rd645, %rd647; 2026-02-21T08:30:48.5626049Z $L__tmp258: 2026-02-21T08:30:48.5626227Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5626287Z mov.b64 {%r2570, %r2571}, %rd648; 2026-02-21T08:30:48.5626353Z cvt.rn.bf16x2.f32 %r2572, %r2571, %r2570; 2026-02-21T08:30:48.5626454Z $L__tmp259: 2026-02-21T08:30:48.5626669Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5626729Z cvt.u64.u32 %rd649, %r2275; 2026-02-21T08:30:48.5626792Z cvt.u64.u32 %rd650, %r2276; 2026-02-21T08:30:48.5626850Z shl.b64 %rd651, %rd650, 32; 2026-02-21T08:30:48.5626909Z or.b64 %rd652, %rd649, %rd651; 2026-02-21T08:30:48.5626960Z $L__tmp260: 2026-02-21T08:30:48.5627134Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5627194Z mov.b64 {%r2573, %r2574}, %rd652; 2026-02-21T08:30:48.5627260Z cvt.rn.bf16x2.f32 %r2575, %r2574, %r2573; 2026-02-21T08:30:48.5627320Z $L__tmp261: 2026-02-21T08:30:48.5627534Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5627592Z cvt.u64.u32 %rd653, %r2277; 2026-02-21T08:30:48.5627657Z cvt.u64.u32 %rd654, %r2278; 2026-02-21T08:30:48.5627717Z shl.b64 %rd655, %rd654, 32; 2026-02-21T08:30:48.5627775Z or.b64 %rd656, %rd653, %rd655; 2026-02-21T08:30:48.5627825Z $L__tmp262: 2026-02-21T08:30:48.5627999Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5628058Z mov.b64 {%r2576, %r2577}, %rd656; 2026-02-21T08:30:48.5628126Z cvt.rn.bf16x2.f32 %r2578, %r2577, %r2576; 2026-02-21T08:30:48.5628183Z $L__tmp263: 2026-02-21T08:30:48.5628397Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5628454Z cvt.u64.u32 %rd657, %r2280; 2026-02-21T08:30:48.5628518Z cvt.u64.u32 %rd658, %r2281; 2026-02-21T08:30:48.5628574Z shl.b64 %rd659, %rd658, 32; 2026-02-21T08:30:48.5628634Z or.b64 %rd660, %rd657, %rd659; 2026-02-21T08:30:48.5628686Z $L__tmp264: 2026-02-21T08:30:48.5628860Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5628922Z mov.b64 {%r2579, %r2580}, %rd660; 2026-02-21T08:30:48.5628989Z cvt.rn.bf16x2.f32 %r2581, %r2580, %r2579; 2026-02-21T08:30:48.5629046Z $L__tmp265: 2026-02-21T08:30:48.5629253Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5629310Z cvt.u64.u32 %rd661, %r2282; 2026-02-21T08:30:48.5629374Z cvt.u64.u32 %rd662, %r2283; 2026-02-21T08:30:48.5629433Z shl.b64 %rd663, %rd662, 32; 2026-02-21T08:30:48.5629492Z or.b64 %rd664, %rd661, %rd663; 2026-02-21T08:30:48.5629543Z $L__tmp266: 2026-02-21T08:30:48.5629714Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5629774Z mov.b64 {%r2582, %r2583}, %rd664; 2026-02-21T08:30:48.5629841Z cvt.rn.bf16x2.f32 %r2584, %r2583, %r2582; 2026-02-21T08:30:48.5629900Z $L__tmp267: 2026-02-21T08:30:48.5630113Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5630225Z cvt.u64.u32 %rd665, %r2284; 2026-02-21T08:30:48.5630281Z cvt.u64.u32 %rd666, %r2285; 2026-02-21T08:30:48.5630344Z shl.b64 %rd667, %rd666, 32; 2026-02-21T08:30:48.5630402Z or.b64 %rd668, %rd665, %rd667; 2026-02-21T08:30:48.5630454Z $L__tmp268: 2026-02-21T08:30:48.5630631Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5630692Z mov.b64 {%r2585, %r2586}, %rd668; 2026-02-21T08:30:48.5630760Z cvt.rn.bf16x2.f32 %r2587, %r2586, %r2585; 2026-02-21T08:30:48.5630818Z $L__tmp269: 2026-02-21T08:30:48.5631029Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5631086Z cvt.u64.u32 %rd669, %r2286; 2026-02-21T08:30:48.5631145Z cvt.u64.u32 %rd670, %r2287; 2026-02-21T08:30:48.5631213Z shl.b64 %rd671, %rd670, 32; 2026-02-21T08:30:48.5631313Z or.b64 %rd672, %rd669, %rd671; 2026-02-21T08:30:48.5631370Z $L__tmp270: 2026-02-21T08:30:48.5631567Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5631631Z mov.b64 {%r2588, %r2589}, %rd672; 2026-02-21T08:30:48.5631701Z cvt.rn.bf16x2.f32 %r2590, %r2589, %r2588; 2026-02-21T08:30:48.5631761Z $L__tmp271: 2026-02-21T08:30:48.5631973Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5632031Z cvt.u64.u32 %rd673, %r2288; 2026-02-21T08:30:48.5632089Z cvt.u64.u32 %rd674, %r2289; 2026-02-21T08:30:48.5632154Z shl.b64 %rd675, %rd674, 32; 2026-02-21T08:30:48.5632213Z or.b64 %rd676, %rd673, %rd675; 2026-02-21T08:30:48.5632265Z $L__tmp272: 2026-02-21T08:30:48.5632441Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5632500Z mov.b64 {%r2591, %r2592}, %rd676; 2026-02-21T08:30:48.5632573Z cvt.rn.bf16x2.f32 %r2593, %r2592, %r2591; 2026-02-21T08:30:48.5632633Z $L__tmp273: 2026-02-21T08:30:48.5632844Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5632904Z cvt.u64.u32 %rd677, %r2290; 2026-02-21T08:30:48.5632961Z cvt.u64.u32 %rd678, %r2291; 2026-02-21T08:30:48.5633027Z shl.b64 %rd679, %rd678, 32; 2026-02-21T08:30:48.5633085Z or.b64 %rd680, %rd677, %rd679; 2026-02-21T08:30:48.5633136Z $L__tmp274: 2026-02-21T08:30:48.5633313Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5633372Z mov.b64 {%r2594, %r2595}, %rd680; 2026-02-21T08:30:48.5633438Z cvt.rn.bf16x2.f32 %r2596, %r2595, %r2594; 2026-02-21T08:30:48.5633491Z $L__tmp275: 2026-02-21T08:30:48.5633710Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5633775Z cvt.u64.u32 %rd681, %r2292; 2026-02-21T08:30:48.5633834Z cvt.u64.u32 %rd682, %r2293; 2026-02-21T08:30:48.5633901Z shl.b64 %rd683, %rd682, 32; 2026-02-21T08:30:48.5633961Z or.b64 %rd684, %rd681, %rd683; 2026-02-21T08:30:48.5634012Z $L__tmp276: 2026-02-21T08:30:48.5634188Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5634248Z mov.b64 {%r2597, %r2598}, %rd684; 2026-02-21T08:30:48.5634317Z cvt.rn.bf16x2.f32 %r2599, %r2598, %r2597; 2026-02-21T08:30:48.5634368Z $L__tmp277: 2026-02-21T08:30:48.5634586Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5634646Z cvt.u64.u32 %rd685, %r2294; 2026-02-21T08:30:48.5634702Z cvt.u64.u32 %rd686, %r2295; 2026-02-21T08:30:48.5634766Z shl.b64 %rd687, %rd686, 32; 2026-02-21T08:30:48.5634825Z or.b64 %rd688, %rd685, %rd687; 2026-02-21T08:30:48.5634876Z $L__tmp278: 2026-02-21T08:30:48.5635053Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5635158Z mov.b64 {%r2600, %r2601}, %rd688; 2026-02-21T08:30:48.5635225Z cvt.rn.bf16x2.f32 %r2602, %r2601, %r2600; 2026-02-21T08:30:48.5635277Z $L__tmp279: 2026-02-21T08:30:48.5635494Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5635552Z cvt.u64.u32 %rd689, %r2297; 2026-02-21T08:30:48.5635610Z cvt.u64.u32 %rd690, %r2298; 2026-02-21T08:30:48.5635674Z shl.b64 %rd691, %rd690, 32; 2026-02-21T08:30:48.5635734Z or.b64 %rd692, %rd689, %rd691; 2026-02-21T08:30:48.5635784Z $L__tmp280: 2026-02-21T08:30:48.5635955Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5636014Z mov.b64 {%r2603, %r2604}, %rd692; 2026-02-21T08:30:48.5636079Z cvt.rn.bf16x2.f32 %r2605, %r2604, %r2603; 2026-02-21T08:30:48.5636178Z $L__tmp281: 2026-02-21T08:30:48.5636397Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5636456Z cvt.u64.u32 %rd693, %r2299; 2026-02-21T08:30:48.5636514Z cvt.u64.u32 %rd694, %r2300; 2026-02-21T08:30:48.5636579Z shl.b64 %rd695, %rd694, 32; 2026-02-21T08:30:48.5636637Z or.b64 %rd696, %rd693, %rd695; 2026-02-21T08:30:48.5636689Z $L__tmp282: 2026-02-21T08:30:48.5636861Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5636920Z mov.b64 {%r2606, %r2607}, %rd696; 2026-02-21T08:30:48.5636987Z cvt.rn.bf16x2.f32 %r2608, %r2607, %r2606; 2026-02-21T08:30:48.5637039Z $L__tmp283: 2026-02-21T08:30:48.5637256Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5637313Z cvt.u64.u32 %rd697, %r2301; 2026-02-21T08:30:48.5637372Z cvt.u64.u32 %rd698, %r2302; 2026-02-21T08:30:48.5637440Z shl.b64 %rd699, %rd698, 32; 2026-02-21T08:30:48.5637500Z or.b64 %rd700, %rd697, %rd699; 2026-02-21T08:30:48.5637552Z $L__tmp284: 2026-02-21T08:30:48.5637716Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5637784Z mov.b64 {%r2609, %r2610}, %rd700; 2026-02-21T08:30:48.5637852Z cvt.rn.bf16x2.f32 %r2611, %r2610, %r2609; 2026-02-21T08:30:48.5637903Z $L__tmp285: 2026-02-21T08:30:48.5638118Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5638177Z cvt.u64.u32 %rd701, %r2303; 2026-02-21T08:30:48.5638234Z cvt.u64.u32 %rd702, %r2304; 2026-02-21T08:30:48.5638299Z shl.b64 %rd703, %rd702, 32; 2026-02-21T08:30:48.5638358Z or.b64 %rd704, %rd701, %rd703; 2026-02-21T08:30:48.5638410Z $L__tmp286: 2026-02-21T08:30:48.5638576Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5638647Z mov.b64 {%r2612, %r2613}, %rd704; 2026-02-21T08:30:48.5638713Z cvt.rn.bf16x2.f32 %r2614, %r2613, %r2612; 2026-02-21T08:30:48.5638765Z $L__tmp287: 2026-02-21T08:30:48.5638983Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5639042Z cvt.u64.u32 %rd705, %r2305; 2026-02-21T08:30:48.5639100Z cvt.u64.u32 %rd706, %r2306; 2026-02-21T08:30:48.5639165Z shl.b64 %rd707, %rd706, 32; 2026-02-21T08:30:48.5639224Z or.b64 %rd708, %rd705, %rd707; 2026-02-21T08:30:48.5639276Z $L__tmp288: 2026-02-21T08:30:48.5639441Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5639508Z mov.b64 {%r2615, %r2616}, %rd708; 2026-02-21T08:30:48.5639575Z cvt.rn.bf16x2.f32 %r2617, %r2616, %r2615; 2026-02-21T08:30:48.5639627Z $L__tmp289: 2026-02-21T08:30:48.5639851Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5639949Z cvt.u64.u32 %rd709, %r2307; 2026-02-21T08:30:48.5640009Z cvt.u64.u32 %rd710, %r2308; 2026-02-21T08:30:48.5640076Z shl.b64 %rd711, %rd710, 32; 2026-02-21T08:30:48.5640137Z or.b64 %rd712, %rd709, %rd711; 2026-02-21T08:30:48.5640191Z $L__tmp290: 2026-02-21T08:30:48.5640362Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5640431Z mov.b64 {%r2618, %r2619}, %rd712; 2026-02-21T08:30:48.5640497Z cvt.rn.bf16x2.f32 %r2620, %r2619, %r2618; 2026-02-21T08:30:48.5640549Z $L__tmp291: 2026-02-21T08:30:48.5640770Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5640826Z cvt.u64.u32 %rd713, %r2309; 2026-02-21T08:30:48.5640882Z cvt.u64.u32 %rd714, %r2310; 2026-02-21T08:30:48.5640987Z shl.b64 %rd715, %rd714, 32; 2026-02-21T08:30:48.5641050Z or.b64 %rd716, %rd713, %rd715; 2026-02-21T08:30:48.5641101Z $L__tmp292: 2026-02-21T08:30:48.5641264Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5641331Z mov.b64 {%r2621, %r2622}, %rd716; 2026-02-21T08:30:48.5641397Z cvt.rn.bf16x2.f32 %r2623, %r2622, %r2621; 2026-02-21T08:30:48.5641448Z $L__tmp293: 2026-02-21T08:30:48.5641691Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5641750Z cvt.u64.u32 %rd717, %r2311; 2026-02-21T08:30:48.5641808Z cvt.u64.u32 %rd718, %r2312; 2026-02-21T08:30:48.5641865Z shl.b64 %rd719, %rd718, 32; 2026-02-21T08:30:48.5641932Z or.b64 %rd720, %rd717, %rd719; 2026-02-21T08:30:48.5641984Z $L__tmp294: 2026-02-21T08:30:48.5642150Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5642221Z mov.b64 {%r2624, %r2625}, %rd720; 2026-02-21T08:30:48.5642291Z cvt.rn.bf16x2.f32 %r2626, %r2625, %r2624; 2026-02-21T08:30:48.5642460Z .loc 1 88 81 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:81 2026-02-21T08:30:48.5642565Z st.shared.v4.b32 [%r73], {%r2533, %r2545, %r2557, %r2569}; 2026-02-21T08:30:48.5642662Z st.shared.v4.b32 [%r74], {%r2581, %r2593, %r2605, %r2617}; 2026-02-21T08:30:48.5642717Z bar.sync 0; 2026-02-21T08:30:48.5642775Z // begin inline asm 2026-02-21T08:30:48.5642939Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2314, %r2315, %r2316, %r2317}, [%r888]; 2026-02-21T08:30:48.5642995Z // end inline asm 2026-02-21T08:30:48.5643054Z // begin inline asm 2026-02-21T08:30:48.5643213Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2319, %r2320, %r2321, %r2322}, [%r893]; 2026-02-21T08:30:48.5643271Z // end inline asm 2026-02-21T08:30:48.5643326Z bar.sync 0; 2026-02-21T08:30:48.5643428Z st.shared.v4.b32 [%r73], {%r2536, %r2548, %r2560, %r2572}; 2026-02-21T08:30:48.5643522Z st.shared.v4.b32 [%r74], {%r2584, %r2596, %r2608, %r2620}; 2026-02-21T08:30:48.5643581Z bar.sync 0; 2026-02-21T08:30:48.5643640Z // begin inline asm 2026-02-21T08:30:48.5643799Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2324, %r2325, %r2326, %r2327}, [%r888]; 2026-02-21T08:30:48.5643856Z // end inline asm 2026-02-21T08:30:48.5643912Z // begin inline asm 2026-02-21T08:30:48.5644070Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2329, %r2330, %r2331, %r2332}, [%r893]; 2026-02-21T08:30:48.5644124Z // end inline asm 2026-02-21T08:30:48.5644179Z bar.sync 0; 2026-02-21T08:30:48.5644269Z st.shared.v4.b32 [%r73], {%r2539, %r2551, %r2563, %r2575}; 2026-02-21T08:30:48.5644364Z st.shared.v4.b32 [%r74], {%r2587, %r2599, %r2611, %r2623}; 2026-02-21T08:30:48.5644419Z bar.sync 0; 2026-02-21T08:30:48.5644476Z // begin inline asm 2026-02-21T08:30:48.5644628Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2334, %r2335, %r2336, %r2337}, [%r888]; 2026-02-21T08:30:48.5644682Z // end inline asm 2026-02-21T08:30:48.5644788Z // begin inline asm 2026-02-21T08:30:48.5644940Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2339, %r2340, %r2341, %r2342}, [%r893]; 2026-02-21T08:30:48.5644995Z // end inline asm 2026-02-21T08:30:48.5645048Z bar.sync 0; 2026-02-21T08:30:48.5645138Z st.shared.v4.b32 [%r73], {%r2542, %r2554, %r2566, %r2578}; 2026-02-21T08:30:48.5645235Z st.shared.v4.b32 [%r74], {%r2590, %r2602, %r2614, %r2626}; 2026-02-21T08:30:48.5645288Z bar.sync 0; 2026-02-21T08:30:48.5645342Z // begin inline asm 2026-02-21T08:30:48.5645496Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2344, %r2345, %r2346, %r2347}, [%r888]; 2026-02-21T08:30:48.5645551Z // end inline asm 2026-02-21T08:30:48.5645605Z // begin inline asm 2026-02-21T08:30:48.5645756Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r2349, %r2350, %r2351, %r2352}, [%r893]; 2026-02-21T08:30:48.5645810Z // end inline asm 2026-02-21T08:30:48.5645866Z // begin inline asm 2026-02-21T08:30:48.5646018Z st.global.v4.b32 [ %rd573 + 0 ], { %r2314, %r2324, %r2334, %r2344 }; 2026-02-21T08:30:48.5646085Z // end inline asm 2026-02-21T08:30:48.5646139Z // begin inline asm 2026-02-21T08:30:48.5646243Z st.global.v4.b32 [ %rd574 + 0 ], { %r2315, %r2325, %r2335, %r2345 }; 2026-02-21T08:30:48.5646304Z // end inline asm 2026-02-21T08:30:48.5646359Z // begin inline asm 2026-02-21T08:30:48.5646459Z st.global.v4.b32 [ %rd575 + 0 ], { %r2316, %r2326, %r2336, %r2346 }; 2026-02-21T08:30:48.5646513Z // end inline asm 2026-02-21T08:30:48.5646576Z // begin inline asm 2026-02-21T08:30:48.5646673Z st.global.v4.b32 [ %rd576 + 0 ], { %r2317, %r2327, %r2337, %r2347 }; 2026-02-21T08:30:48.5646728Z // end inline asm 2026-02-21T08:30:48.5646791Z // begin inline asm 2026-02-21T08:30:48.5646887Z st.global.v4.b32 [ %rd577 + 0 ], { %r2319, %r2329, %r2339, %r2349 }; 2026-02-21T08:30:48.5646942Z // end inline asm 2026-02-21T08:30:48.5646998Z // begin inline asm 2026-02-21T08:30:48.5647102Z st.global.v4.b32 [ %rd578 + 0 ], { %r2320, %r2330, %r2340, %r2350 }; 2026-02-21T08:30:48.5647159Z // end inline asm 2026-02-21T08:30:48.5647218Z // begin inline asm 2026-02-21T08:30:48.5647321Z st.global.v4.b32 [ %rd579 + 0 ], { %r2321, %r2331, %r2341, %r2351 }; 2026-02-21T08:30:48.5647376Z // end inline asm 2026-02-21T08:30:48.5647433Z // begin inline asm 2026-02-21T08:30:48.5647539Z st.global.v4.b32 [ %rd580 + 0 ], { %r2322, %r2332, %r2342, %r2352 }; 2026-02-21T08:30:48.5647594Z // end inline asm 2026-02-21T08:30:48.5647769Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5647830Z add.s32 %r2627, %r4018, 7104; 2026-02-21T08:30:48.5648004Z .loc 1 25 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:25:35 2026-02-21T08:30:48.5648064Z shr.s32 %r2628, %r2627, 31; 2026-02-21T08:30:48.5648123Z shr.u32 %r2629, %r2628, 23; 2026-02-21T08:30:48.5648194Z add.s32 %r2630, %r2627, %r2629; 2026-02-21T08:30:48.5648255Z shr.s32 %r2631, %r2630, 9; 2026-02-21T08:30:48.5648422Z .loc 1 26 33 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:26:33 2026-02-21T08:30:48.5648494Z shl.b32 %r2632, %r2631, 3; 2026-02-21T08:30:48.5648657Z .loc 1 27 39 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:39 2026-02-21T08:30:48.5648717Z sub.s32 %r2633, 64, %r2632; 2026-02-21T08:30:48.5648881Z .loc 1 27 52 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:52 2026-02-21T08:30:48.5648954Z min.s32 %r2634, %r2633, 8; 2026-02-21T08:30:48.5649114Z .loc 1 28 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:45 2026-02-21T08:30:48.5649176Z and.b32 %r2635, %r2630, -512; 2026-02-21T08:30:48.5649245Z sub.s32 %r2636, %r2627, %r2635; 2026-02-21T08:30:48.5649406Z .loc 1 29 51 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:29:51 2026-02-21T08:30:48.5649469Z div.s32 %r167, %r2636, %r2634; 2026-02-21T08:30:48.5649641Z .loc 1 28 64 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:64 2026-02-21T08:30:48.5649758Z mul.lo.s32 %r2637, %r167, %r2634; 2026-02-21T08:30:48.5649817Z sub.s32 %r2638, %r2636, %r2637; 2026-02-21T08:30:48.5649983Z .loc 1 28 30 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:30 2026-02-21T08:30:48.5650042Z add.s32 %r2639, %r2638, %r2632; 2026-02-21T08:30:48.5650206Z .loc 1 30 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:30:27 2026-02-21T08:30:48.5650265Z shl.b32 %r168, %r2639, 7; 2026-02-21T08:30:48.5650452Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5650514Z or.b32 %r169, %r168, %r6; 2026-02-21T08:30:48.5650683Z .loc 1 32 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:32:27 2026-02-21T08:30:48.5650753Z shl.b32 %r171, %r167, 6; 2026-02-21T08:30:48.5650973Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5651037Z or.b32 %r2640, %r171, %r9; 2026-02-21T08:30:48.5651106Z or.b32 %r2641, %r171, %r10; 2026-02-21T08:30:48.5651164Z or.b32 %r2642, %r171, %r11; 2026-02-21T08:30:48.5651222Z or.b32 %r2643, %r171, %r12; 2026-02-21T08:30:48.5651402Z .loc 1 48 53 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:53 2026-02-21T08:30:48.5651462Z shl.b32 %r2644, %r2640, 10; 2026-02-21T08:30:48.5651522Z shl.b32 %r2645, %r2641, 10; 2026-02-21T08:30:48.5651612Z shl.b32 %r2646, %r2642, 10; 2026-02-21T08:30:48.5651682Z shl.b32 %r2647, %r2643, 10; 2026-02-21T08:30:48.5651746Z mov.pred %p173, -1; 2026-02-21T08:30:48.5651805Z mov.b32 %r4038, 0; 2026-02-21T08:30:48.5651867Z $L__tmp295: 2026-02-21T08:30:48.5652097Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5652160Z // begin inline asm 2026-02-21T08:30:48.5652503Z @%p173 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 0], 64, {%r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038}; 2026-02-21T08:30:48.5652568Z // end inline asm 2026-02-21T08:30:48.5652628Z // begin inline asm 2026-02-21T08:30:48.5652954Z @%p173 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 16], 64, {%r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038}; 2026-02-21T08:30:48.5653020Z // end inline asm 2026-02-21T08:30:48.5653079Z // begin inline asm 2026-02-21T08:30:48.5653404Z @%p173 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 32], 64, {%r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038}; 2026-02-21T08:30:48.5653470Z // end inline asm 2026-02-21T08:30:48.5653528Z // begin inline asm 2026-02-21T08:30:48.5653850Z @%p173 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3028 + 48], 64, {%r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038, %r4038}; 2026-02-21T08:30:48.5653919Z // end inline asm 2026-02-21T08:30:48.5653978Z // begin inline asm 2026-02-21T08:30:48.5654053Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5654117Z // end inline asm 2026-02-21T08:30:48.5654174Z bar.sync 0; 2026-02-21T08:30:48.5654229Z $L__tmp296: 2026-02-21T08:30:48.5654403Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5654471Z // begin inline asm 2026-02-21T08:30:48.5654564Z @%p10 mbarrier.init.shared::cta.b64 [%r4039], 1; 2026-02-21T08:30:48.5654623Z // end inline asm 2026-02-21T08:30:48.5654687Z bar.sync 0; 2026-02-21T08:30:48.5654746Z // begin inline asm 2026-02-21T08:30:48.5654838Z @%p10 mbarrier.init.shared::cta.b64 [%r2245], 1; 2026-02-21T08:30:48.5654894Z // end inline asm 2026-02-21T08:30:48.5655079Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5655197Z or.b32 %r2648, %r2644, %r22; 2026-02-21T08:30:48.5655258Z or.b32 %r2649, %r2645, %r22; 2026-02-21T08:30:48.5655326Z or.b32 %r2650, %r2646, %r22; 2026-02-21T08:30:48.5655386Z or.b32 %r2651, %r2647, %r22; 2026-02-21T08:30:48.5655560Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5655638Z mad.wide.s32 %rd581, %r2648, 2, %rd92; 2026-02-21T08:30:48.5655706Z mad.wide.s32 %rd582, %r2649, 2, %rd92; 2026-02-21T08:30:48.5655774Z mad.wide.s32 %rd583, %r2650, 2, %rd92; 2026-02-21T08:30:48.5655839Z mad.wide.s32 %rd584, %r2651, 2, %rd92; 2026-02-21T08:30:48.5655905Z mov.b32 %r2764, 16; 2026-02-21T08:30:48.5656079Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5656137Z // begin inline asm 2026-02-21T08:30:48.5656327Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd581 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5656389Z // end inline asm 2026-02-21T08:30:48.5656447Z // begin inline asm 2026-02-21T08:30:48.5656580Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd582 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5656639Z // end inline asm 2026-02-21T08:30:48.5656698Z // begin inline asm 2026-02-21T08:30:48.5656819Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd583 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5656883Z // end inline asm 2026-02-21T08:30:48.5656942Z // begin inline asm 2026-02-21T08:30:48.5657061Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd584 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5657123Z // end inline asm 2026-02-21T08:30:48.5657188Z cp.async.commit_group; 2026-02-21T08:30:48.5657358Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5657422Z add.s32 %r2652, %r169, %r39; 2026-02-21T08:30:48.5657495Z add.s32 %r2653, %r169, %r40; 2026-02-21T08:30:48.5657670Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5657734Z cvt.s64.s32 %rd721, %r2652; 2026-02-21T08:30:48.5657807Z add.s64 %rd585, %rd93, %rd721; 2026-02-21T08:30:48.5657870Z cvt.s64.s32 %rd722, %r2653; 2026-02-21T08:30:48.5657934Z add.s64 %rd586, %rd93, %rd722; 2026-02-21T08:30:48.5658140Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5658197Z // begin inline asm 2026-02-21T08:30:48.5658310Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd585 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5658366Z // end inline asm 2026-02-21T08:30:48.5658433Z // begin inline asm 2026-02-21T08:30:48.5658547Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd586 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5658607Z // end inline asm 2026-02-21T08:30:48.5658683Z cp.async.commit_group; 2026-02-21T08:30:48.5658851Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5658913Z cvt.s64.s32 %rd723, %r2644; 2026-02-21T08:30:48.5658983Z or.b64 %rd724, %rd723, %rd11; 2026-02-21T08:30:48.5659044Z shl.b64 %rd725, %rd724, 1; 2026-02-21T08:30:48.5659106Z add.s64 %rd66, %rd92, %rd725; 2026-02-21T08:30:48.5659165Z add.s64 %rd587, %rd66, 128; 2026-02-21T08:30:48.5659231Z cvt.s64.s32 %rd726, %r2645; 2026-02-21T08:30:48.5659289Z or.b64 %rd727, %rd726, %rd11; 2026-02-21T08:30:48.5659348Z shl.b64 %rd728, %rd727, 1; 2026-02-21T08:30:48.5659415Z add.s64 %rd67, %rd92, %rd728; 2026-02-21T08:30:48.5659473Z add.s64 %rd588, %rd67, 128; 2026-02-21T08:30:48.5659532Z cvt.s64.s32 %rd729, %r2646; 2026-02-21T08:30:48.5659589Z or.b64 %rd730, %rd729, %rd11; 2026-02-21T08:30:48.5659656Z shl.b64 %rd731, %rd730, 1; 2026-02-21T08:30:48.5659716Z add.s64 %rd68, %rd92, %rd731; 2026-02-21T08:30:48.5659773Z add.s64 %rd589, %rd68, 128; 2026-02-21T08:30:48.5659841Z cvt.s64.s32 %rd732, %r2647; 2026-02-21T08:30:48.5659938Z or.b64 %rd733, %rd732, %rd11; 2026-02-21T08:30:48.5659997Z shl.b64 %rd734, %rd733, 1; 2026-02-21T08:30:48.5660055Z add.s64 %rd69, %rd92, %rd734; 2026-02-21T08:30:48.5660120Z add.s64 %rd590, %rd69, 128; 2026-02-21T08:30:48.5660287Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5660341Z bar.sync 0; 2026-02-21T08:30:48.5660406Z // begin inline asm 2026-02-21T08:30:48.5660521Z cp.async.cg.shared.global [ %r2468 + 0 ], [ %rd587 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5660576Z // end inline asm 2026-02-21T08:30:48.5660640Z // begin inline asm 2026-02-21T08:30:48.5660754Z cp.async.cg.shared.global [ %r2470 + 0 ], [ %rd588 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5660810Z // end inline asm 2026-02-21T08:30:48.5660866Z // begin inline asm 2026-02-21T08:30:48.5660987Z cp.async.cg.shared.global [ %r2472 + 0 ], [ %rd589 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5661082Z // end inline asm 2026-02-21T08:30:48.5661142Z // begin inline asm 2026-02-21T08:30:48.5661260Z cp.async.cg.shared.global [ %r2474 + 0 ], [ %rd590 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5661316Z // end inline asm 2026-02-21T08:30:48.5661378Z cp.async.commit_group; 2026-02-21T08:30:48.5661576Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5661647Z add.s32 %r2654, %r169, %r48; 2026-02-21T08:30:48.5661706Z add.s32 %r2655, %r169, %r49; 2026-02-21T08:30:48.5661869Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5661936Z cvt.s64.s32 %rd735, %r2654; 2026-02-21T08:30:48.5661996Z add.s64 %rd591, %rd93, %rd735; 2026-02-21T08:30:48.5662055Z cvt.s64.s32 %rd736, %r2655; 2026-02-21T08:30:48.5662121Z add.s64 %rd592, %rd93, %rd736; 2026-02-21T08:30:48.5662283Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5662343Z // begin inline asm 2026-02-21T08:30:48.5662457Z cp.async.cg.shared.global [ %r2476 + 0 ], [ %rd591 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5662520Z // end inline asm 2026-02-21T08:30:48.5662577Z // begin inline asm 2026-02-21T08:30:48.5662689Z cp.async.cg.shared.global [ %r2478 + 0 ], [ %rd592 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5662752Z // end inline asm 2026-02-21T08:30:48.5662814Z cp.async.commit_group; 2026-02-21T08:30:48.5662976Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5663048Z cp.async.wait_group 2; 2026-02-21T08:30:48.5663102Z bar.sync 0; 2026-02-21T08:30:48.5663198Z ld.shared.v4.b32 {%r2656, %r2657, %r2658, %r2659}, [%r53]; 2026-02-21T08:30:48.5663264Z mov.b32 {%rs1153, %rs1154}, %r2659; 2026-02-21T08:30:48.5663336Z mov.b32 {%rs1155, %rs1156}, %r2658; 2026-02-21T08:30:48.5663397Z mov.b32 {%rs1157, %rs1158}, %r2657; 2026-02-21T08:30:48.5663460Z mov.b32 {%rs1159, %rs1160}, %r2656; 2026-02-21T08:30:48.5663573Z ld.shared.v4.b32 {%r2660, %r2661, %r2662, %r2663}, [%r53+16]; 2026-02-21T08:30:48.5663633Z mov.b32 {%rs1161, %rs1162}, %r2663; 2026-02-21T08:30:48.5663693Z mov.b32 {%rs1163, %rs1164}, %r2662; 2026-02-21T08:30:48.5663759Z mov.b32 {%rs1165, %rs1166}, %r2661; 2026-02-21T08:30:48.5663818Z mov.b32 {%rs1167, %rs1168}, %r2660; 2026-02-21T08:30:48.5663917Z ld.shared.v4.b32 {%r2664, %r2665, %r2666, %r2667}, [%r53+32]; 2026-02-21T08:30:48.5663976Z mov.b32 {%rs1169, %rs1170}, %r2667; 2026-02-21T08:30:48.5664042Z mov.b32 {%rs1171, %rs1172}, %r2666; 2026-02-21T08:30:48.5664101Z mov.b32 {%rs1173, %rs1174}, %r2665; 2026-02-21T08:30:48.5664160Z mov.b32 {%rs1175, %rs1176}, %r2664; 2026-02-21T08:30:48.5664260Z ld.shared.v4.b32 {%r2668, %r2669, %r2670, %r2671}, [%r53+48]; 2026-02-21T08:30:48.5664321Z mov.b32 {%rs1177, %rs1178}, %r2671; 2026-02-21T08:30:48.5664382Z mov.b32 {%rs1179, %rs1180}, %r2670; 2026-02-21T08:30:48.5664442Z mov.b32 {%rs1181, %rs1182}, %r2669; 2026-02-21T08:30:48.5664559Z mov.b32 {%rs1183, %rs1184}, %r2668; 2026-02-21T08:30:48.5664730Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5664792Z cvt.f32.bf16 %r2481, %rs1159; 2026-02-21T08:30:48.5664861Z cvt.f32.bf16 %r2482, %rs1160; 2026-02-21T08:30:48.5664920Z cvt.f32.bf16 %r2483, %rs1157; 2026-02-21T08:30:48.5664979Z cvt.f32.bf16 %r2484, %rs1158; 2026-02-21T08:30:48.5665044Z cvt.f32.bf16 %r2485, %rs1155; 2026-02-21T08:30:48.5665100Z cvt.f32.bf16 %r2486, %rs1156; 2026-02-21T08:30:48.5665156Z cvt.f32.bf16 %r2487, %rs1153; 2026-02-21T08:30:48.5665212Z cvt.f32.bf16 %r2488, %rs1154; 2026-02-21T08:30:48.5665277Z cvt.f32.bf16 %r2489, %rs1167; 2026-02-21T08:30:48.5665334Z cvt.f32.bf16 %r2490, %rs1168; 2026-02-21T08:30:48.5665390Z cvt.f32.bf16 %r2491, %rs1165; 2026-02-21T08:30:48.5665454Z cvt.f32.bf16 %r2492, %rs1166; 2026-02-21T08:30:48.5665510Z cvt.f32.bf16 %r2493, %rs1163; 2026-02-21T08:30:48.5665628Z cvt.f32.bf16 %r2494, %rs1164; 2026-02-21T08:30:48.5665690Z cvt.f32.bf16 %r2495, %rs1161; 2026-02-21T08:30:48.5665755Z cvt.f32.bf16 %r2496, %rs1162; 2026-02-21T08:30:48.5665812Z cvt.f32.bf16 %r2498, %rs1175; 2026-02-21T08:30:48.5665869Z cvt.f32.bf16 %r2499, %rs1176; 2026-02-21T08:30:48.5665934Z cvt.f32.bf16 %r2500, %rs1173; 2026-02-21T08:30:48.5665992Z cvt.f32.bf16 %r2501, %rs1174; 2026-02-21T08:30:48.5666052Z cvt.f32.bf16 %r2502, %rs1171; 2026-02-21T08:30:48.5666120Z cvt.f32.bf16 %r2503, %rs1172; 2026-02-21T08:30:48.5666180Z cvt.f32.bf16 %r2504, %rs1169; 2026-02-21T08:30:48.5666240Z cvt.f32.bf16 %r2505, %rs1170; 2026-02-21T08:30:48.5666300Z cvt.f32.bf16 %r2506, %rs1183; 2026-02-21T08:30:48.5666365Z cvt.f32.bf16 %r2507, %rs1184; 2026-02-21T08:30:48.5666421Z cvt.f32.bf16 %r2508, %rs1181; 2026-02-21T08:30:48.5666478Z cvt.f32.bf16 %r2509, %rs1182; 2026-02-21T08:30:48.5666541Z cvt.f32.bf16 %r2510, %rs1179; 2026-02-21T08:30:48.5666598Z cvt.f32.bf16 %r2511, %rs1180; 2026-02-21T08:30:48.5666657Z cvt.f32.bf16 %r2512, %rs1177; 2026-02-21T08:30:48.5666717Z cvt.f32.bf16 %r2513, %rs1178; 2026-02-21T08:30:48.5666778Z $L__tmp297: 2026-02-21T08:30:48.5666999Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5667056Z // begin inline asm 2026-02-21T08:30:48.5667382Z @%p173 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r2481, %r2482, %r2483, %r2484, %r2485, %r2486, %r2487, %r2488, %r2489, %r2490, %r2491, %r2492, %r2493, %r2494, %r2495, %r2496}; 2026-02-21T08:30:48.5667437Z // end inline asm 2026-02-21T08:30:48.5667494Z // begin inline asm 2026-02-21T08:30:48.5667825Z @%p173 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r2498, %r2499, %r2500, %r2501, %r2502, %r2503, %r2504, %r2505, %r2506, %r2507, %r2508, %r2509, %r2510, %r2511, %r2512, %r2513}; 2026-02-21T08:30:48.5667879Z // end inline asm 2026-02-21T08:30:48.5667934Z // begin inline asm 2026-02-21T08:30:48.5668012Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5668068Z // end inline asm 2026-02-21T08:30:48.5668123Z bar.sync 0; 2026-02-21T08:30:48.5668175Z $L__tmp298: 2026-02-21T08:30:48.5668356Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5668417Z ld.shared.s8 %rs1185, [%r54]; 2026-02-21T08:30:48.5668584Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5668653Z shl.b16 %rs1186, %rs1185, 4; 2026-02-21T08:30:48.5668825Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5668889Z ld.shared.s8 %rs1187, [%r54+128]; 2026-02-21T08:30:48.5669065Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5669125Z shl.b16 %rs1188, %rs1187, 4; 2026-02-21T08:30:48.5669290Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5669395Z ld.shared.s8 %rs1189, [%r54+256]; 2026-02-21T08:30:48.5669563Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5669622Z shl.b16 %rs1190, %rs1189, 4; 2026-02-21T08:30:48.5669785Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5669856Z ld.shared.s8 %rs1191, [%r54+384]; 2026-02-21T08:30:48.5670018Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5670076Z shl.b16 %rs1192, %rs1191, 4; 2026-02-21T08:30:48.5670246Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5670309Z ld.shared.s8 %rs1193, [%r54+512]; 2026-02-21T08:30:48.5670474Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5670580Z shl.b16 %rs1194, %rs1193, 4; 2026-02-21T08:30:48.5670744Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5670807Z ld.shared.s8 %rs1195, [%r54+640]; 2026-02-21T08:30:48.5670968Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5671033Z shl.b16 %rs1196, %rs1195, 4; 2026-02-21T08:30:48.5671195Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5671256Z ld.shared.s8 %rs1197, [%r54+768]; 2026-02-21T08:30:48.5671427Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5671484Z shl.b16 %rs1198, %rs1197, 4; 2026-02-21T08:30:48.5671675Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5671746Z ld.shared.s8 %rs1199, [%r56]; 2026-02-21T08:30:48.5671910Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5671968Z shl.b16 %rs1200, %rs1199, 4; 2026-02-21T08:30:48.5672139Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5672204Z ld.shared.s8 %rs1201, [%r54+1024]; 2026-02-21T08:30:48.5672370Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5672429Z shl.b16 %rs1202, %rs1201, 4; 2026-02-21T08:30:48.5672598Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5672663Z ld.shared.s8 %rs1203, [%r54+1152]; 2026-02-21T08:30:48.5672826Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5672891Z shl.b16 %rs1204, %rs1203, 4; 2026-02-21T08:30:48.5673057Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5673123Z ld.shared.s8 %rs1205, [%r54+1280]; 2026-02-21T08:30:48.5673293Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5673352Z shl.b16 %rs1206, %rs1205, 4; 2026-02-21T08:30:48.5673517Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5673588Z ld.shared.s8 %rs1207, [%r54+1408]; 2026-02-21T08:30:48.5673754Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5673813Z shl.b16 %rs1208, %rs1207, 4; 2026-02-21T08:30:48.5673975Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5674046Z ld.shared.s8 %rs1209, [%r54+1536]; 2026-02-21T08:30:48.5674211Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5674318Z shl.b16 %rs1210, %rs1209, 4; 2026-02-21T08:30:48.5674489Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5674551Z ld.shared.s8 %rs1211, [%r54+1664]; 2026-02-21T08:30:48.5674713Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5674779Z shl.b16 %rs1212, %rs1211, 4; 2026-02-21T08:30:48.5674941Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5675004Z ld.shared.s8 %rs1213, [%r54+1792]; 2026-02-21T08:30:48.5675173Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5675230Z shl.b16 %rs1214, %rs1213, 4; 2026-02-21T08:30:48.5675390Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5675495Z ld.shared.s8 %rs1215, [%r58]; 2026-02-21T08:30:48.5675670Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5675729Z shl.b16 %rs1216, %rs1215, 4; 2026-02-21T08:30:48.5675891Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5675960Z ld.shared.s8 %rs1217, [%r54+2048]; 2026-02-21T08:30:48.5676123Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5676182Z shl.b16 %rs1218, %rs1217, 4; 2026-02-21T08:30:48.5676354Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5676417Z ld.shared.s8 %rs1219, [%r54+2176]; 2026-02-21T08:30:48.5676583Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5676650Z shl.b16 %rs1220, %rs1219, 4; 2026-02-21T08:30:48.5676822Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5676885Z ld.shared.s8 %rs1221, [%r54+2304]; 2026-02-21T08:30:48.5677055Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5677116Z shl.b16 %rs1222, %rs1221, 4; 2026-02-21T08:30:48.5677280Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5677343Z ld.shared.s8 %rs1223, [%r54+2432]; 2026-02-21T08:30:48.5677511Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5677569Z shl.b16 %rs1224, %rs1223, 4; 2026-02-21T08:30:48.5677734Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5677805Z ld.shared.s8 %rs1225, [%r54+2560]; 2026-02-21T08:30:48.5677971Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5678031Z shl.b16 %rs1226, %rs1225, 4; 2026-02-21T08:30:48.5678200Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5678262Z ld.shared.s8 %rs1227, [%r54+2688]; 2026-02-21T08:30:48.5678423Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5678489Z shl.b16 %rs1228, %rs1227, 4; 2026-02-21T08:30:48.5678654Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5678715Z ld.shared.s8 %rs1229, [%r54+2816]; 2026-02-21T08:30:48.5678879Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5678945Z shl.b16 %rs1230, %rs1229, 4; 2026-02-21T08:30:48.5679108Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5679209Z ld.shared.s8 %rs1231, [%r60]; 2026-02-21T08:30:48.5679378Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5679436Z shl.b16 %rs1232, %rs1231, 4; 2026-02-21T08:30:48.5679597Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5679664Z ld.shared.s8 %rs1233, [%r54+3072]; 2026-02-21T08:30:48.5679823Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5679882Z shl.b16 %rs1234, %rs1233, 4; 2026-02-21T08:30:48.5680049Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5680111Z ld.shared.s8 %rs1235, [%r54+3200]; 2026-02-21T08:30:48.5680275Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5680372Z shl.b16 %rs1236, %rs1235, 4; 2026-02-21T08:30:48.5680545Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5680608Z ld.shared.s8 %rs1237, [%r54+3328]; 2026-02-21T08:30:48.5680768Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5680834Z shl.b16 %rs1238, %rs1237, 4; 2026-02-21T08:30:48.5680994Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5681057Z ld.shared.s8 %rs1239, [%r54+3456]; 2026-02-21T08:30:48.5681227Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5681286Z shl.b16 %rs1240, %rs1239, 4; 2026-02-21T08:30:48.5681448Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5681517Z ld.shared.s8 %rs1241, [%r54+3584]; 2026-02-21T08:30:48.5681730Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5681793Z shl.b16 %rs1242, %rs1241, 4; 2026-02-21T08:30:48.5681952Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5682021Z ld.shared.s8 %rs1243, [%r54+3712]; 2026-02-21T08:30:48.5682183Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5682240Z shl.b16 %rs1244, %rs1243, 4; 2026-02-21T08:30:48.5682407Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5682469Z ld.shared.s8 %rs1245, [%r54+3840]; 2026-02-21T08:30:48.5682628Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5682694Z shl.b16 %rs1246, %rs1245, 4; 2026-02-21T08:30:48.5682858Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5682922Z ld.shared.s8 %rs1247, [%r62]; 2026-02-21T08:30:48.5683091Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5683149Z shl.b16 %rs1248, %rs1247, 4; 2026-02-21T08:30:48.5683309Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5683369Z cvt.s16.s8 %rs1249, %rs1186; 2026-02-21T08:30:48.5683432Z shr.s16 %rs1250, %rs1249, 4; 2026-02-21T08:30:48.5683490Z cvt.s16.s8 %rs1251, %rs1188; 2026-02-21T08:30:48.5683548Z shr.s16 %rs1252, %rs1251, 4; 2026-02-21T08:30:48.5683612Z shr.s16 %rs1253, %rs1185, 4; 2026-02-21T08:30:48.5683669Z shr.s16 %rs1254, %rs1187, 4; 2026-02-21T08:30:48.5683835Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5683907Z cvt.rn.f32.s16 %r2672, %rs1254; 2026-02-21T08:30:48.5683971Z cvt.rn.f32.s16 %r2673, %rs1253; 2026-02-21T08:30:48.5684077Z cvt.rn.f32.s16 %r2674, %rs1252; 2026-02-21T08:30:48.5684137Z cvt.rn.f32.s16 %r2675, %rs1250; 2026-02-21T08:30:48.5684310Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5684368Z cvt.s16.s8 %rs1255, %rs1190; 2026-02-21T08:30:48.5684426Z shr.s16 %rs1256, %rs1255, 4; 2026-02-21T08:30:48.5684490Z cvt.s16.s8 %rs1257, %rs1192; 2026-02-21T08:30:48.5684547Z shr.s16 %rs1258, %rs1257, 4; 2026-02-21T08:30:48.5684604Z shr.s16 %rs1259, %rs1189, 4; 2026-02-21T08:30:48.5684669Z shr.s16 %rs1260, %rs1191, 4; 2026-02-21T08:30:48.5684833Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5684893Z cvt.rn.f32.s16 %r2676, %rs1260; 2026-02-21T08:30:48.5684953Z cvt.rn.f32.s16 %r2677, %rs1259; 2026-02-21T08:30:48.5685019Z cvt.rn.f32.s16 %r2678, %rs1258; 2026-02-21T08:30:48.5685136Z cvt.rn.f32.s16 %r2679, %rs1256; 2026-02-21T08:30:48.5685310Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5685376Z cvt.s16.s8 %rs1261, %rs1194; 2026-02-21T08:30:48.5685436Z shr.s16 %rs1262, %rs1261, 4; 2026-02-21T08:30:48.5685494Z cvt.s16.s8 %rs1263, %rs1196; 2026-02-21T08:30:48.5685551Z shr.s16 %rs1264, %rs1263, 4; 2026-02-21T08:30:48.5685617Z shr.s16 %rs1265, %rs1193, 4; 2026-02-21T08:30:48.5685674Z shr.s16 %rs1266, %rs1195, 4; 2026-02-21T08:30:48.5685839Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5685906Z cvt.rn.f32.s16 %r2680, %rs1266; 2026-02-21T08:30:48.5685967Z cvt.rn.f32.s16 %r2681, %rs1265; 2026-02-21T08:30:48.5686026Z cvt.rn.f32.s16 %r2682, %rs1264; 2026-02-21T08:30:48.5686091Z cvt.rn.f32.s16 %r2683, %rs1262; 2026-02-21T08:30:48.5686259Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5686321Z cvt.s16.s8 %rs1267, %rs1198; 2026-02-21T08:30:48.5686379Z shr.s16 %rs1268, %rs1267, 4; 2026-02-21T08:30:48.5686444Z cvt.s16.s8 %rs1269, %rs1200; 2026-02-21T08:30:48.5686501Z shr.s16 %rs1270, %rs1269, 4; 2026-02-21T08:30:48.5686560Z shr.s16 %rs1271, %rs1197, 4; 2026-02-21T08:30:48.5686626Z shr.s16 %rs1272, %rs1199, 4; 2026-02-21T08:30:48.5686794Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5686855Z cvt.rn.f32.s16 %r2684, %rs1272; 2026-02-21T08:30:48.5686931Z cvt.rn.f32.s16 %r2685, %rs1271; 2026-02-21T08:30:48.5686993Z cvt.rn.f32.s16 %r2686, %rs1270; 2026-02-21T08:30:48.5687052Z cvt.rn.f32.s16 %r2687, %rs1268; 2026-02-21T08:30:48.5687215Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5687282Z cvt.s16.s8 %rs1273, %rs1202; 2026-02-21T08:30:48.5687339Z shr.s16 %rs1274, %rs1273, 4; 2026-02-21T08:30:48.5687399Z cvt.s16.s8 %rs1275, %rs1204; 2026-02-21T08:30:48.5687466Z shr.s16 %rs1276, %rs1275, 4; 2026-02-21T08:30:48.5687523Z shr.s16 %rs1277, %rs1201, 4; 2026-02-21T08:30:48.5687580Z shr.s16 %rs1278, %rs1203, 4; 2026-02-21T08:30:48.5687753Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5687812Z cvt.rn.f32.s16 %r2688, %rs1278; 2026-02-21T08:30:48.5687871Z cvt.rn.f32.s16 %r2689, %rs1277; 2026-02-21T08:30:48.5687930Z cvt.rn.f32.s16 %r2690, %rs1276; 2026-02-21T08:30:48.5687996Z cvt.rn.f32.s16 %r2691, %rs1274; 2026-02-21T08:30:48.5688160Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5688218Z cvt.s16.s8 %rs1279, %rs1206; 2026-02-21T08:30:48.5688284Z shr.s16 %rs1280, %rs1279, 4; 2026-02-21T08:30:48.5688340Z cvt.s16.s8 %rs1281, %rs1208; 2026-02-21T08:30:48.5688398Z shr.s16 %rs1282, %rs1281, 4; 2026-02-21T08:30:48.5688456Z shr.s16 %rs1283, %rs1205, 4; 2026-02-21T08:30:48.5688566Z shr.s16 %rs1284, %rs1207, 4; 2026-02-21T08:30:48.5688732Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5688793Z cvt.rn.f32.s16 %r2692, %rs1284; 2026-02-21T08:30:48.5688860Z cvt.rn.f32.s16 %r2693, %rs1283; 2026-02-21T08:30:48.5688919Z cvt.rn.f32.s16 %r2694, %rs1282; 2026-02-21T08:30:48.5688978Z cvt.rn.f32.s16 %r2695, %rs1280; 2026-02-21T08:30:48.5689149Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5689208Z cvt.s16.s8 %rs1285, %rs1210; 2026-02-21T08:30:48.5689266Z shr.s16 %rs1286, %rs1285, 4; 2026-02-21T08:30:48.5689325Z cvt.s16.s8 %rs1287, %rs1212; 2026-02-21T08:30:48.5689389Z shr.s16 %rs1288, %rs1287, 4; 2026-02-21T08:30:48.5689446Z shr.s16 %rs1289, %rs1209, 4; 2026-02-21T08:30:48.5689503Z shr.s16 %rs1290, %rs1211, 4; 2026-02-21T08:30:48.5689717Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5689782Z cvt.rn.f32.s16 %r2696, %rs1290; 2026-02-21T08:30:48.5689842Z cvt.rn.f32.s16 %r2697, %rs1289; 2026-02-21T08:30:48.5689909Z cvt.rn.f32.s16 %r2698, %rs1288; 2026-02-21T08:30:48.5689968Z cvt.rn.f32.s16 %r2699, %rs1286; 2026-02-21T08:30:48.5690136Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5690194Z cvt.s16.s8 %rs1291, %rs1214; 2026-02-21T08:30:48.5690260Z shr.s16 %rs1292, %rs1291, 4; 2026-02-21T08:30:48.5690317Z cvt.s16.s8 %rs1293, %rs1216; 2026-02-21T08:30:48.5690375Z shr.s16 %rs1294, %rs1293, 4; 2026-02-21T08:30:48.5690440Z shr.s16 %rs1295, %rs1213, 4; 2026-02-21T08:30:48.5690497Z shr.s16 %rs1296, %rs1215, 4; 2026-02-21T08:30:48.5690664Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5690728Z cvt.rn.f32.s16 %r2700, %rs1296; 2026-02-21T08:30:48.5690788Z cvt.rn.f32.s16 %r2701, %rs1295; 2026-02-21T08:30:48.5690849Z cvt.rn.f32.s16 %r2702, %rs1294; 2026-02-21T08:30:48.5690908Z cvt.rn.f32.s16 %r2703, %rs1292; 2026-02-21T08:30:48.5691081Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5691138Z cvt.s16.s8 %rs1297, %rs1218; 2026-02-21T08:30:48.5691195Z shr.s16 %rs1298, %rs1297, 4; 2026-02-21T08:30:48.5691261Z cvt.s16.s8 %rs1299, %rs1220; 2026-02-21T08:30:48.5691318Z shr.s16 %rs1300, %rs1299, 4; 2026-02-21T08:30:48.5691374Z shr.s16 %rs1301, %rs1217, 4; 2026-02-21T08:30:48.5691431Z shr.s16 %rs1302, %rs1219, 4; 2026-02-21T08:30:48.5691639Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5691700Z cvt.rn.f32.s16 %r2704, %rs1302; 2026-02-21T08:30:48.5691758Z cvt.rn.f32.s16 %r2705, %rs1301; 2026-02-21T08:30:48.5691824Z cvt.rn.f32.s16 %r2706, %rs1300; 2026-02-21T08:30:48.5691886Z cvt.rn.f32.s16 %r2707, %rs1298; 2026-02-21T08:30:48.5692049Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5692114Z cvt.s16.s8 %rs1303, %rs1222; 2026-02-21T08:30:48.5692172Z shr.s16 %rs1304, %rs1303, 4; 2026-02-21T08:30:48.5692229Z cvt.s16.s8 %rs1305, %rs1224; 2026-02-21T08:30:48.5692285Z shr.s16 %rs1306, %rs1305, 4; 2026-02-21T08:30:48.5692349Z shr.s16 %rs1307, %rs1221, 4; 2026-02-21T08:30:48.5692406Z shr.s16 %rs1308, %rs1223, 4; 2026-02-21T08:30:48.5692564Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5692632Z cvt.rn.f32.s16 %r2708, %rs1308; 2026-02-21T08:30:48.5692689Z cvt.rn.f32.s16 %r2709, %rs1307; 2026-02-21T08:30:48.5692746Z cvt.rn.f32.s16 %r2710, %rs1306; 2026-02-21T08:30:48.5692811Z cvt.rn.f32.s16 %r2711, %rs1304; 2026-02-21T08:30:48.5692972Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5693080Z cvt.s16.s8 %rs1309, %rs1226; 2026-02-21T08:30:48.5693136Z shr.s16 %rs1310, %rs1309, 4; 2026-02-21T08:30:48.5693203Z cvt.s16.s8 %rs1311, %rs1228; 2026-02-21T08:30:48.5693260Z shr.s16 %rs1312, %rs1311, 4; 2026-02-21T08:30:48.5693318Z shr.s16 %rs1313, %rs1225, 4; 2026-02-21T08:30:48.5693381Z shr.s16 %rs1314, %rs1227, 4; 2026-02-21T08:30:48.5693546Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5693605Z cvt.rn.f32.s16 %r2712, %rs1314; 2026-02-21T08:30:48.5693672Z cvt.rn.f32.s16 %r2713, %rs1313; 2026-02-21T08:30:48.5693744Z cvt.rn.f32.s16 %r2714, %rs1312; 2026-02-21T08:30:48.5693806Z cvt.rn.f32.s16 %r2715, %rs1310; 2026-02-21T08:30:48.5693980Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5694051Z cvt.s16.s8 %rs1315, %rs1230; 2026-02-21T08:30:48.5694112Z shr.s16 %rs1316, %rs1315, 4; 2026-02-21T08:30:48.5694223Z cvt.s16.s8 %rs1317, %rs1232; 2026-02-21T08:30:48.5694295Z shr.s16 %rs1318, %rs1317, 4; 2026-02-21T08:30:48.5694356Z shr.s16 %rs1319, %rs1229, 4; 2026-02-21T08:30:48.5694416Z shr.s16 %rs1320, %rs1231, 4; 2026-02-21T08:30:48.5694592Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5694665Z cvt.rn.f32.s16 %r2716, %rs1320; 2026-02-21T08:30:48.5694729Z cvt.rn.f32.s16 %r2717, %rs1319; 2026-02-21T08:30:48.5694792Z cvt.rn.f32.s16 %r2718, %rs1318; 2026-02-21T08:30:48.5694867Z cvt.rn.f32.s16 %r2719, %rs1316; 2026-02-21T08:30:48.5695040Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5695106Z cvt.s16.s8 %rs1321, %rs1234; 2026-02-21T08:30:48.5695176Z shr.s16 %rs1322, %rs1321, 4; 2026-02-21T08:30:48.5695236Z cvt.s16.s8 %rs1323, %rs1236; 2026-02-21T08:30:48.5695297Z shr.s16 %rs1324, %rs1323, 4; 2026-02-21T08:30:48.5695358Z shr.s16 %rs1325, %rs1233, 4; 2026-02-21T08:30:48.5695428Z shr.s16 %rs1326, %rs1235, 4; 2026-02-21T08:30:48.5695602Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5695664Z cvt.rn.f32.s16 %r2720, %rs1326; 2026-02-21T08:30:48.5695732Z cvt.rn.f32.s16 %r2721, %rs1325; 2026-02-21T08:30:48.5695793Z cvt.rn.f32.s16 %r2722, %rs1324; 2026-02-21T08:30:48.5695854Z cvt.rn.f32.s16 %r2723, %rs1322; 2026-02-21T08:30:48.5696032Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5696093Z cvt.s16.s8 %rs1327, %rs1238; 2026-02-21T08:30:48.5696154Z shr.s16 %rs1328, %rs1327, 4; 2026-02-21T08:30:48.5696215Z cvt.s16.s8 %rs1329, %rs1240; 2026-02-21T08:30:48.5696281Z shr.s16 %rs1330, %rs1329, 4; 2026-02-21T08:30:48.5696340Z shr.s16 %rs1331, %rs1237, 4; 2026-02-21T08:30:48.5696399Z shr.s16 %rs1332, %rs1239, 4; 2026-02-21T08:30:48.5696586Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5696651Z cvt.rn.f32.s16 %r2724, %rs1332; 2026-02-21T08:30:48.5696712Z cvt.rn.f32.s16 %r2725, %rs1331; 2026-02-21T08:30:48.5696780Z cvt.rn.f32.s16 %r2726, %rs1330; 2026-02-21T08:30:48.5696841Z cvt.rn.f32.s16 %r2727, %rs1328; 2026-02-21T08:30:48.5697013Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5697073Z cvt.s16.s8 %rs1333, %rs1242; 2026-02-21T08:30:48.5697141Z shr.s16 %rs1334, %rs1333, 4; 2026-02-21T08:30:48.5697200Z cvt.s16.s8 %rs1335, %rs1244; 2026-02-21T08:30:48.5697261Z shr.s16 %rs1336, %rs1335, 4; 2026-02-21T08:30:48.5697326Z shr.s16 %rs1337, %rs1241, 4; 2026-02-21T08:30:48.5697386Z shr.s16 %rs1338, %rs1243, 4; 2026-02-21T08:30:48.5697558Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5697620Z cvt.rn.f32.s16 %r2728, %rs1338; 2026-02-21T08:30:48.5697692Z cvt.rn.f32.s16 %r2729, %rs1337; 2026-02-21T08:30:48.5697799Z cvt.rn.f32.s16 %r2730, %rs1336; 2026-02-21T08:30:48.5697859Z cvt.rn.f32.s16 %r2731, %rs1334; 2026-02-21T08:30:48.5698038Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5698099Z cvt.s16.s8 %rs1339, %rs1246; 2026-02-21T08:30:48.5698159Z shr.s16 %rs1340, %rs1339, 4; 2026-02-21T08:30:48.5698225Z cvt.s16.s8 %rs1341, %rs1248; 2026-02-21T08:30:48.5698285Z shr.s16 %rs1342, %rs1341, 4; 2026-02-21T08:30:48.5698344Z shr.s16 %rs1343, %rs1245, 4; 2026-02-21T08:30:48.5698403Z shr.s16 %rs1344, %rs1247, 4; 2026-02-21T08:30:48.5698582Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5698644Z cvt.rn.f32.s16 %r2732, %rs1344; 2026-02-21T08:30:48.5698705Z cvt.rn.f32.s16 %r2733, %rs1343; 2026-02-21T08:30:48.5698772Z cvt.rn.f32.s16 %r2734, %rs1342; 2026-02-21T08:30:48.5698874Z cvt.rn.f32.s16 %r2735, %rs1340; 2026-02-21T08:30:48.5698982Z st.shared.v4.b32 [%r63], {%r2675, %r2673, %r2674, %r2672}; 2026-02-21T08:30:48.5699101Z st.shared.v4.b32 [%r63+16384], {%r2707, %r2705, %r2706, %r2704}; 2026-02-21T08:30:48.5699201Z st.shared.v4.b32 [%r64], {%r2679, %r2677, %r2678, %r2676}; 2026-02-21T08:30:48.5699309Z st.shared.v4.b32 [%r64+16384], {%r2711, %r2709, %r2710, %r2708}; 2026-02-21T08:30:48.5699405Z st.shared.v4.b32 [%r65], {%r2683, %r2681, %r2682, %r2680}; 2026-02-21T08:30:48.5699513Z st.shared.v4.b32 [%r65+16384], {%r2715, %r2713, %r2714, %r2712}; 2026-02-21T08:30:48.5699607Z st.shared.v4.b32 [%r66], {%r2687, %r2685, %r2686, %r2684}; 2026-02-21T08:30:48.5699708Z st.shared.v4.b32 [%r66+16384], {%r2719, %r2717, %r2718, %r2716}; 2026-02-21T08:30:48.5699809Z st.shared.v4.b32 [%r67], {%r2691, %r2689, %r2690, %r2688}; 2026-02-21T08:30:48.5699908Z st.shared.v4.b32 [%r67+16384], {%r2723, %r2721, %r2722, %r2720}; 2026-02-21T08:30:48.5700005Z st.shared.v4.b32 [%r68], {%r2695, %r2693, %r2694, %r2692}; 2026-02-21T08:30:48.5700115Z st.shared.v4.b32 [%r68+16384], {%r2727, %r2725, %r2726, %r2724}; 2026-02-21T08:30:48.5700208Z st.shared.v4.b32 [%r69], {%r2699, %r2697, %r2698, %r2696}; 2026-02-21T08:30:48.5700309Z st.shared.v4.b32 [%r69+16384], {%r2731, %r2729, %r2730, %r2728}; 2026-02-21T08:30:48.5700406Z st.shared.v4.b32 [%r70], {%r2703, %r2701, %r2702, %r2700}; 2026-02-21T08:30:48.5700508Z st.shared.v4.b32 [%r70+16384], {%r2735, %r2733, %r2734, %r2732}; 2026-02-21T08:30:48.5700565Z $L__tmp299: 2026-02-21T08:30:48.5700794Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5700861Z // begin inline asm 2026-02-21T08:30:48.5700937Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5700995Z // end inline asm 2026-02-21T08:30:48.5701059Z bar.sync 0; 2026-02-21T08:30:48.5701121Z @%p14 bra $L__BB0_23; 2026-02-21T08:30:48.5701226Z // %bb.22: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5701304Z elect.sync %r2760|%p172, -1; 2026-02-21T08:30:48.5701366Z mov.b32 %r2738, 69208336; 2026-02-21T08:30:48.5701428Z mov.pred %p171, 0; 2026-02-21T08:30:48.5701498Z // begin inline asm 2026-02-21T08:30:48.5701701Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r2738, %p171; 2026-02-21T08:30:48.5701758Z // end inline asm 2026-02-21T08:30:48.5701815Z // begin inline asm 2026-02-21T08:30:48.5701978Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r2738, %p173; 2026-02-21T08:30:48.5702036Z // end inline asm 2026-02-21T08:30:48.5702092Z // begin inline asm 2026-02-21T08:30:48.5702253Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r2738, %p173; 2026-02-21T08:30:48.5702309Z // end inline asm 2026-02-21T08:30:48.5702366Z // begin inline asm 2026-02-21T08:30:48.5702517Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r2738, %p173; 2026-02-21T08:30:48.5702643Z // end inline asm 2026-02-21T08:30:48.5702700Z // begin inline asm 2026-02-21T08:30:48.5702852Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r2738, %p173; 2026-02-21T08:30:48.5702917Z // end inline asm 2026-02-21T08:30:48.5702973Z // begin inline asm 2026-02-21T08:30:48.5703122Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r2738, %p173; 2026-02-21T08:30:48.5703187Z // end inline asm 2026-02-21T08:30:48.5703245Z // begin inline asm 2026-02-21T08:30:48.5703392Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r2738, %p173; 2026-02-21T08:30:48.5703458Z // end inline asm 2026-02-21T08:30:48.5703516Z // begin inline asm 2026-02-21T08:30:48.5703664Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r2738, %p173; 2026-02-21T08:30:48.5703723Z // end inline asm 2026-02-21T08:30:48.5703794Z add.s32 %r2762, %r294, 57344; 2026-02-21T08:30:48.5703907Z cvt.u64.u32 %rd745, %r2762; 2026-02-21T08:30:48.5703966Z // begin inline asm 2026-02-21T08:30:48.5704109Z @%p172 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd745]; 2026-02-21T08:30:48.5704169Z // end inline asm 2026-02-21T08:30:48.5704227Z $L__tmp300: 2026-02-21T08:30:48.5704344Z $L__BB0_23: // in Loop: Header=BB0_3 Depth=1 2026-02-21T08:30:48.5704514Z .loc 1 0 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0 2026-02-21T08:30:48.5704576Z or.b32 %r170, %r168, %r8; 2026-02-21T08:30:48.5704638Z or.b32 %r172, %r171, %r14; 2026-02-21T08:30:48.5704705Z or.b32 %r173, %r171, %r15; 2026-02-21T08:30:48.5704762Z or.b32 %r174, %r171, %r16; 2026-02-21T08:30:48.5704818Z or.b32 %r175, %r171, %r17; 2026-02-21T08:30:48.5704882Z or.b32 %r176, %r171, %r18; 2026-02-21T08:30:48.5704939Z or.b32 %r177, %r171, %r19; 2026-02-21T08:30:48.5704995Z or.b32 %r178, %r171, %r20; 2026-02-21T08:30:48.5705053Z or.b32 %r179, %r171, %r21; 2026-02-21T08:30:48.5705235Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5705297Z add.s64 %rd746, %rd66, 256; 2026-02-21T08:30:48.5705356Z add.s64 %rd747, %rd67, 256; 2026-02-21T08:30:48.5705423Z add.s64 %rd748, %rd68, 256; 2026-02-21T08:30:48.5705481Z add.s64 %rd749, %rd69, 256; 2026-02-21T08:30:48.5705649Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5705713Z // begin inline asm 2026-02-21T08:30:48.5705836Z cp.async.cg.shared.global [ %r2763 + 0 ], [ %rd746 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5705892Z // end inline asm 2026-02-21T08:30:48.5705949Z // begin inline asm 2026-02-21T08:30:48.5706078Z cp.async.cg.shared.global [ %r2765 + 0 ], [ %rd747 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5706135Z // end inline asm 2026-02-21T08:30:48.5706192Z // begin inline asm 2026-02-21T08:30:48.5706316Z cp.async.cg.shared.global [ %r2767 + 0 ], [ %rd748 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5706373Z // end inline asm 2026-02-21T08:30:48.5706430Z // begin inline asm 2026-02-21T08:30:48.5706543Z cp.async.cg.shared.global [ %r2769 + 0 ], [ %rd749 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5706605Z // end inline asm 2026-02-21T08:30:48.5706668Z cp.async.commit_group; 2026-02-21T08:30:48.5706839Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5706906Z add.s32 %r2778, %r169, %r71; 2026-02-21T08:30:48.5706965Z add.s32 %r2779, %r169, %r72; 2026-02-21T08:30:48.5707130Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5707196Z cvt.s64.s32 %rd753, %r2778; 2026-02-21T08:30:48.5707259Z add.s64 %rd750, %rd93, %rd753; 2026-02-21T08:30:48.5707318Z cvt.s64.s32 %rd754, %r2779; 2026-02-21T08:30:48.5707379Z add.s64 %rd751, %rd93, %rd754; 2026-02-21T08:30:48.5707556Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5707651Z // begin inline asm 2026-02-21T08:30:48.5707767Z cp.async.cg.shared.global [ %r2771 + 0 ], [ %rd750 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5707830Z // end inline asm 2026-02-21T08:30:48.5707886Z // begin inline asm 2026-02-21T08:30:48.5707996Z cp.async.cg.shared.global [ %r2773 + 0 ], [ %rd751 + 0 ], 0x10, %r2764; 2026-02-21T08:30:48.5708056Z // end inline asm 2026-02-21T08:30:48.5708118Z cp.async.commit_group; 2026-02-21T08:30:48.5708283Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5708344Z add.s32 %r2780, %r39, %r168; 2026-02-21T08:30:48.5708409Z cvt.u64.u32 %rd1102, %r2780; 2026-02-21T08:30:48.5708466Z add.s32 %r2781, %r11, %r171; 2026-02-21T08:30:48.5708524Z shl.b32 %r2782, %r2781, 10; 2026-02-21T08:30:48.5708601Z mad.wide.s32 %rd1101, %r2782, 2, %rd10; 2026-02-21T08:30:48.5708695Z add.s32 %r2783, %r10, %r171; 2026-02-21T08:30:48.5708757Z shl.b32 %r2784, %r2783, 10; 2026-02-21T08:30:48.5708826Z mad.wide.s32 %rd1100, %r2784, 2, %rd10; 2026-02-21T08:30:48.5708891Z shl.b32 %r2785, %r167, 16; 2026-02-21T08:30:48.5708949Z or.b32 %r2786, %r79, %r2785; 2026-02-21T08:30:48.5709015Z mad.wide.s32 %rd1099, %r2786, 2, %rd10; 2026-02-21T08:30:48.5709079Z or.b32 %r4037, %r80, %r2785; 2026-02-21T08:30:48.5709138Z mov.b64 %rd1103, -32; 2026-02-21T08:30:48.5709195Z mov.b32 %r4040, %r4038; 2026-02-21T08:30:48.5709258Z mov.b32 %r4042, %r4038; 2026-02-21T08:30:48.5709316Z bra.uni $L__BB0_24; 2026-02-21T08:30:48.5709418Z $L__BB0_26: // in Loop: Header=BB0_24 Depth=2 2026-02-21T08:30:48.5709588Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5709657Z add.s64 %rd1103, %rd1103, 32; 2026-02-21T08:30:48.5709724Z setp.lt.u64 %p209, %rd1103, 416; 2026-02-21T08:30:48.5709777Z $L__tmp301: 2026-02-21T08:30:48.5710009Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5710071Z add.s32 %r2955, %r4041, 1; 2026-02-21T08:30:48.5710133Z setp.gt.s32 %p210, %r2955, 1; 2026-02-21T08:30:48.5710206Z selp.b32 %r4041, 0, %r2955, %p210; 2026-02-21T08:30:48.5710266Z selp.b32 %r2956, 1, 0, %p210; 2026-02-21T08:30:48.5710326Z xor.b32 %r192, %r4042, %r2956; 2026-02-21T08:30:48.5710378Z $L__tmp302: 2026-02-21T08:30:48.5710553Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5710615Z add.s64 %rd764, %rd1099, %rd9; 2026-02-21T08:30:48.5710675Z add.s64 %rd765, %rd1100, %rd9; 2026-02-21T08:30:48.5710740Z add.s64 %rd766, %rd1101, %rd9; 2026-02-21T08:30:48.5710810Z mad.wide.s32 %rd767, %r4037, 2, %rd92; 2026-02-21T08:30:48.5710977Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5711037Z add.s32 %r2943, %r188, %r4047; 2026-02-21T08:30:48.5711109Z selp.b32 %r2944, 16, 0, %p209; 2026-02-21T08:30:48.5711166Z // begin inline asm 2026-02-21T08:30:48.5711283Z cp.async.cg.shared.global [ %r2943 + 0 ], [ %rd764 + 0 ], 0x10, %r2944; 2026-02-21T08:30:48.5711346Z // end inline asm 2026-02-21T08:30:48.5711406Z add.s32 %r2945, %r2943, 2048; 2026-02-21T08:30:48.5711465Z // begin inline asm 2026-02-21T08:30:48.5711618Z cp.async.cg.shared.global [ %r2945 + 0 ], [ %rd765 + 0 ], 0x10, %r2944; 2026-02-21T08:30:48.5711677Z // end inline asm 2026-02-21T08:30:48.5711737Z add.s32 %r2947, %r2943, 4096; 2026-02-21T08:30:48.5711795Z // begin inline asm 2026-02-21T08:30:48.5711919Z cp.async.cg.shared.global [ %r2947 + 0 ], [ %rd766 + 0 ], 0x10, %r2944; 2026-02-21T08:30:48.5711975Z // end inline asm 2026-02-21T08:30:48.5712032Z add.s32 %r2949, %r2943, 6144; 2026-02-21T08:30:48.5712096Z // begin inline asm 2026-02-21T08:30:48.5712211Z cp.async.cg.shared.global [ %r2949 + 0 ], [ %rd767 + 0 ], 0x10, %r2944; 2026-02-21T08:30:48.5712317Z // end inline asm 2026-02-21T08:30:48.5712380Z cp.async.commit_group; 2026-02-21T08:30:48.5712551Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5712616Z add.s64 %rd770, %rd9, %rd1102; 2026-02-21T08:30:48.5712680Z add.s64 %rd771, %rd770, 786432; 2026-02-21T08:30:48.5712859Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5712925Z add.s64 %rd772, %rd770, 917504; 2026-02-21T08:30:48.5712987Z cvt.s64.s32 %rd773, %rd771; 2026-02-21T08:30:48.5713054Z add.s64 %rd768, %rd93, %rd773; 2026-02-21T08:30:48.5713113Z cvt.s64.s32 %rd774, %rd772; 2026-02-21T08:30:48.5713172Z add.s64 %rd769, %rd93, %rd774; 2026-02-21T08:30:48.5713333Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5713400Z add.s32 %r2951, %r189, %r111; 2026-02-21T08:30:48.5713501Z // begin inline asm 2026-02-21T08:30:48.5713620Z cp.async.cg.shared.global [ %r2951 + 0 ], [ %rd768 + 0 ], 0x10, %r2944; 2026-02-21T08:30:48.5713683Z // end inline asm 2026-02-21T08:30:48.5713741Z add.s32 %r2953, %r2951, 2048; 2026-02-21T08:30:48.5713798Z // begin inline asm 2026-02-21T08:30:48.5713920Z cp.async.cg.shared.global [ %r2953 + 0 ], [ %rd769 + 0 ], 0x10, %r2944; 2026-02-21T08:30:48.5713976Z // end inline asm 2026-02-21T08:30:48.5714040Z cp.async.commit_group; 2026-02-21T08:30:48.5714204Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5714275Z add.s64 %rd1102, %rd1102, 262144; 2026-02-21T08:30:48.5714335Z add.s64 %rd1101, %rd1101, 128; 2026-02-21T08:30:48.5714396Z add.s64 %rd1100, %rd1100, 128; 2026-02-21T08:30:48.5714462Z add.s64 %rd1099, %rd1099, 128; 2026-02-21T08:30:48.5714521Z add.s32 %r4037, %r4037, 64; 2026-02-21T08:30:48.5714587Z setp.lt.u64 %p211, %rd1103, 448; 2026-02-21T08:30:48.5714648Z mov.b32 %r4038, %r4042; 2026-02-21T08:30:48.5714715Z mov.b32 %r4042, %r192; 2026-02-21T08:30:48.5714774Z @%p211 bra $L__BB0_24; 2026-02-21T08:30:48.5714832Z bra.uni $L__BB0_27; 2026-02-21T08:30:48.5714941Z $L__BB0_24: // Parent Loop BB0_3 Depth=1 2026-02-21T08:30:48.5715035Z // => This Inner Loop Header: Depth=2 2026-02-21T08:30:48.5715095Z add.s32 %r2824, %r4040, 1; 2026-02-21T08:30:48.5715164Z setp.gt.s32 %p191, %r2824, 1; 2026-02-21T08:30:48.5715230Z selp.b32 %r4040, 0, %r2824, %p191; 2026-02-21T08:30:48.5715400Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5715463Z cp.async.wait_group 2; 2026-02-21T08:30:48.5715522Z bar.sync 0; 2026-02-21T08:30:48.5715580Z shl.b32 %r2825, %r4040, 12; 2026-02-21T08:30:48.5715637Z shl.b32 %r2826, %r4040, 13; 2026-02-21T08:30:48.5715702Z add.s32 %r2828, %r294, %r2826; 2026-02-21T08:30:48.5715760Z add.s32 %r188, %r2828, 32768; 2026-02-21T08:30:48.5715821Z add.s32 %r2829, %r188, %r52; 2026-02-21T08:30:48.5715923Z ld.shared.v4.b32 {%r2830, %r2831, %r2832, %r2833}, [%r2829]; 2026-02-21T08:30:48.5715994Z mov.b32 {%rs1345, %rs1346}, %r2833; 2026-02-21T08:30:48.5716057Z mov.b32 {%rs1347, %rs1348}, %r2832; 2026-02-21T08:30:48.5716118Z mov.b32 {%rs1349, %rs1350}, %r2831; 2026-02-21T08:30:48.5716185Z mov.b32 {%rs1351, %rs1352}, %r2830; 2026-02-21T08:30:48.5716288Z ld.shared.v4.b32 {%r2834, %r2835, %r2836, %r2837}, [%r2829+16]; 2026-02-21T08:30:48.5716348Z mov.b32 {%rs1353, %rs1354}, %r2837; 2026-02-21T08:30:48.5716414Z mov.b32 {%rs1355, %rs1356}, %r2836; 2026-02-21T08:30:48.5716474Z mov.b32 {%rs1357, %rs1358}, %r2835; 2026-02-21T08:30:48.5716531Z mov.b32 {%rs1359, %rs1360}, %r2834; 2026-02-21T08:30:48.5716634Z ld.shared.v4.b32 {%r2838, %r2839, %r2840, %r2841}, [%r2829+32]; 2026-02-21T08:30:48.5716702Z mov.b32 {%rs1361, %rs1362}, %r2841; 2026-02-21T08:30:48.5716765Z mov.b32 {%rs1363, %rs1364}, %r2840; 2026-02-21T08:30:48.5716865Z mov.b32 {%rs1365, %rs1366}, %r2839; 2026-02-21T08:30:48.5716930Z mov.b32 {%rs1367, %rs1368}, %r2838; 2026-02-21T08:30:48.5717026Z ld.shared.v4.b32 {%r2842, %r2843, %r2844, %r2845}, [%r2829+48]; 2026-02-21T08:30:48.5717086Z mov.b32 {%rs1369, %rs1370}, %r2845; 2026-02-21T08:30:48.5717151Z mov.b32 {%rs1371, %rs1372}, %r2844; 2026-02-21T08:30:48.5717210Z mov.b32 {%rs1373, %rs1374}, %r2843; 2026-02-21T08:30:48.5717269Z mov.b32 {%rs1375, %rs1376}, %r2842; 2026-02-21T08:30:48.5717434Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5717503Z cvt.f32.bf16 %r2791, %rs1351; 2026-02-21T08:30:48.5717563Z cvt.f32.bf16 %r2792, %rs1352; 2026-02-21T08:30:48.5717622Z cvt.f32.bf16 %r2793, %rs1349; 2026-02-21T08:30:48.5717688Z cvt.f32.bf16 %r2794, %rs1350; 2026-02-21T08:30:48.5717745Z cvt.f32.bf16 %r2795, %rs1347; 2026-02-21T08:30:48.5717802Z cvt.f32.bf16 %r2796, %rs1348; 2026-02-21T08:30:48.5717901Z cvt.f32.bf16 %r2797, %rs1345; 2026-02-21T08:30:48.5717965Z cvt.f32.bf16 %r2798, %rs1346; 2026-02-21T08:30:48.5718023Z cvt.f32.bf16 %r2799, %rs1359; 2026-02-21T08:30:48.5718082Z cvt.f32.bf16 %r2800, %rs1360; 2026-02-21T08:30:48.5718148Z cvt.f32.bf16 %r2801, %rs1357; 2026-02-21T08:30:48.5718205Z cvt.f32.bf16 %r2802, %rs1358; 2026-02-21T08:30:48.5718262Z cvt.f32.bf16 %r2803, %rs1355; 2026-02-21T08:30:48.5718327Z cvt.f32.bf16 %r2804, %rs1356; 2026-02-21T08:30:48.5718384Z cvt.f32.bf16 %r2805, %rs1353; 2026-02-21T08:30:48.5718441Z cvt.f32.bf16 %r2806, %rs1354; 2026-02-21T08:30:48.5718498Z cvt.f32.bf16 %r2808, %rs1367; 2026-02-21T08:30:48.5718565Z cvt.f32.bf16 %r2809, %rs1368; 2026-02-21T08:30:48.5718622Z cvt.f32.bf16 %r2810, %rs1365; 2026-02-21T08:30:48.5718680Z cvt.f32.bf16 %r2811, %rs1366; 2026-02-21T08:30:48.5718746Z cvt.f32.bf16 %r2812, %rs1363; 2026-02-21T08:30:48.5718803Z cvt.f32.bf16 %r2813, %rs1364; 2026-02-21T08:30:48.5718862Z cvt.f32.bf16 %r2814, %rs1361; 2026-02-21T08:30:48.5718922Z cvt.f32.bf16 %r2815, %rs1362; 2026-02-21T08:30:48.5718989Z cvt.f32.bf16 %r2816, %rs1375; 2026-02-21T08:30:48.5719049Z cvt.f32.bf16 %r2817, %rs1376; 2026-02-21T08:30:48.5719108Z cvt.f32.bf16 %r2818, %rs1373; 2026-02-21T08:30:48.5719181Z cvt.f32.bf16 %r2819, %rs1374; 2026-02-21T08:30:48.5719238Z cvt.f32.bf16 %r2820, %rs1371; 2026-02-21T08:30:48.5719296Z cvt.f32.bf16 %r2821, %rs1372; 2026-02-21T08:30:48.5719353Z cvt.f32.bf16 %r2822, %rs1369; 2026-02-21T08:30:48.5719417Z cvt.f32.bf16 %r2823, %rs1370; 2026-02-21T08:30:48.5719469Z $L__tmp303: 2026-02-21T08:30:48.5719691Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5719757Z // begin inline asm 2026-02-21T08:30:48.5719807Z 2026-02-21T08:30:48.5719857Z { 2026-02-21T08:30:48.5719924Z .reg .pred complete; 2026-02-21T08:30:48.5719979Z waitLoop: 2026-02-21T08:30:48.5720105Z mbarrier.try_wait.parity.shared.b64 complete, [%r4039], %r4038; 2026-02-21T08:30:48.5720173Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5720232Z } 2026-02-21T08:30:48.5720236Z 2026-02-21T08:30:48.5720292Z // end inline asm 2026-02-21T08:30:48.5720353Z mov.pred %p192, -1; 2026-02-21T08:30:48.5720417Z // begin inline asm 2026-02-21T08:30:48.5720738Z @%p192 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 0], 32, {%r2791, %r2792, %r2793, %r2794, %r2795, %r2796, %r2797, %r2798, %r2799, %r2800, %r2801, %r2802, %r2803, %r2804, %r2805, %r2806}; 2026-02-21T08:30:48.5720796Z // end inline asm 2026-02-21T08:30:48.5720857Z // begin inline asm 2026-02-21T08:30:48.5721173Z @%p192 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r2497 + 16], 32, {%r2808, %r2809, %r2810, %r2811, %r2812, %r2813, %r2814, %r2815, %r2816, %r2817, %r2818, %r2819, %r2820, %r2821, %r2822, %r2823}; 2026-02-21T08:30:48.5721228Z // end inline asm 2026-02-21T08:30:48.5721284Z // begin inline asm 2026-02-21T08:30:48.5721362Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5721419Z // end inline asm 2026-02-21T08:30:48.5721529Z bar.sync 0; 2026-02-21T08:30:48.5721623Z $L__tmp304: 2026-02-21T08:30:48.5721793Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5721854Z add.s32 %r2846, %r294, %r2825; 2026-02-21T08:30:48.5721922Z add.s32 %r2847, %r2846, 49152; 2026-02-21T08:30:48.5721982Z add.s32 %r189, %r2847, %r4048; 2026-02-21T08:30:48.5722041Z add.s32 %r2848, %r2847, %r55; 2026-02-21T08:30:48.5722099Z add.s32 %r2849, %r2847, %r57; 2026-02-21T08:30:48.5722164Z add.s32 %r2850, %r2847, %r59; 2026-02-21T08:30:48.5722221Z add.s32 %r2851, %r2847, %r61; 2026-02-21T08:30:48.5722383Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5722453Z ld.shared.s8 %rs1377, [%r189]; 2026-02-21T08:30:48.5722616Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5722722Z shl.b16 %rs1378, %rs1377, 4; 2026-02-21T08:30:48.5722888Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5722962Z ld.shared.s8 %rs1379, [%r189+128]; 2026-02-21T08:30:48.5723122Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5723183Z shl.b16 %rs1380, %rs1379, 4; 2026-02-21T08:30:48.5723352Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5723414Z ld.shared.s8 %rs1381, [%r189+256]; 2026-02-21T08:30:48.5723575Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5723642Z shl.b16 %rs1382, %rs1381, 4; 2026-02-21T08:30:48.5723802Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5723865Z ld.shared.s8 %rs1383, [%r189+384]; 2026-02-21T08:30:48.5724034Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5724093Z shl.b16 %rs1384, %rs1383, 4; 2026-02-21T08:30:48.5724251Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5724320Z ld.shared.s8 %rs1385, [%r189+512]; 2026-02-21T08:30:48.5724478Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5724537Z shl.b16 %rs1386, %rs1385, 4; 2026-02-21T08:30:48.5724697Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5724767Z ld.shared.s8 %rs1387, [%r189+640]; 2026-02-21T08:30:48.5724926Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5724984Z shl.b16 %rs1388, %rs1387, 4; 2026-02-21T08:30:48.5725153Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5725217Z ld.shared.s8 %rs1389, [%r189+768]; 2026-02-21T08:30:48.5725374Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5725439Z shl.b16 %rs1390, %rs1389, 4; 2026-02-21T08:30:48.5725601Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5725664Z ld.shared.s8 %rs1391, [%r2848]; 2026-02-21T08:30:48.5725830Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5725890Z shl.b16 %rs1392, %rs1391, 4; 2026-02-21T08:30:48.5726051Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5726115Z ld.shared.s8 %rs1393, [%r189+1024]; 2026-02-21T08:30:48.5726282Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5726387Z shl.b16 %rs1394, %rs1393, 4; 2026-02-21T08:30:48.5726550Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5726624Z ld.shared.s8 %rs1395, [%r189+1152]; 2026-02-21T08:30:48.5726786Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5726845Z shl.b16 %rs1396, %rs1395, 4; 2026-02-21T08:30:48.5727019Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5727081Z ld.shared.s8 %rs1397, [%r189+1280]; 2026-02-21T08:30:48.5727242Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5727307Z shl.b16 %rs1398, %rs1397, 4; 2026-02-21T08:30:48.5727474Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5727573Z ld.shared.s8 %rs1399, [%r189+1408]; 2026-02-21T08:30:48.5727740Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5727805Z shl.b16 %rs1400, %rs1399, 4; 2026-02-21T08:30:48.5727969Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5728032Z ld.shared.s8 %rs1401, [%r189+1536]; 2026-02-21T08:30:48.5728203Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5728262Z shl.b16 %rs1402, %rs1401, 4; 2026-02-21T08:30:48.5728429Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5728500Z ld.shared.s8 %rs1403, [%r189+1664]; 2026-02-21T08:30:48.5728670Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5728728Z shl.b16 %rs1404, %rs1403, 4; 2026-02-21T08:30:48.5728905Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5728970Z ld.shared.s8 %rs1405, [%r189+1792]; 2026-02-21T08:30:48.5729137Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5729198Z shl.b16 %rs1406, %rs1405, 4; 2026-02-21T08:30:48.5729374Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5729438Z ld.shared.s8 %rs1407, [%r2849]; 2026-02-21T08:30:48.5729604Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5729671Z shl.b16 %rs1408, %rs1407, 4; 2026-02-21T08:30:48.5729833Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5729896Z ld.shared.s8 %rs1409, [%r189+2048]; 2026-02-21T08:30:48.5730068Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5730130Z shl.b16 %rs1410, %rs1409, 4; 2026-02-21T08:30:48.5730293Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5730362Z ld.shared.s8 %rs1411, [%r189+2176]; 2026-02-21T08:30:48.5730526Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5730584Z shl.b16 %rs1412, %rs1411, 4; 2026-02-21T08:30:48.5730752Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5730814Z ld.shared.s8 %rs1413, [%r189+2304]; 2026-02-21T08:30:48.5730976Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5731034Z shl.b16 %rs1414, %rs1413, 4; 2026-02-21T08:30:48.5731209Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5731319Z ld.shared.s8 %rs1415, [%r189+2432]; 2026-02-21T08:30:48.5731484Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5731582Z shl.b16 %rs1416, %rs1415, 4; 2026-02-21T08:30:48.5731749Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5731812Z ld.shared.s8 %rs1417, [%r189+2560]; 2026-02-21T08:30:48.5731985Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5732045Z shl.b16 %rs1418, %rs1417, 4; 2026-02-21T08:30:48.5732212Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5732281Z ld.shared.s8 %rs1419, [%r189+2688]; 2026-02-21T08:30:48.5732448Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5732568Z shl.b16 %rs1420, %rs1419, 4; 2026-02-21T08:30:48.5732733Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5732804Z ld.shared.s8 %rs1421, [%r189+2816]; 2026-02-21T08:30:48.5732967Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5733026Z shl.b16 %rs1422, %rs1421, 4; 2026-02-21T08:30:48.5733199Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5733264Z ld.shared.s8 %rs1423, [%r2850]; 2026-02-21T08:30:48.5733429Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5733504Z shl.b16 %rs1424, %rs1423, 4; 2026-02-21T08:30:48.5733669Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5733736Z ld.shared.s8 %rs1425, [%r189+3072]; 2026-02-21T08:30:48.5733910Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5733969Z shl.b16 %rs1426, %rs1425, 4; 2026-02-21T08:30:48.5734132Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5734195Z ld.shared.s8 %rs1427, [%r189+3200]; 2026-02-21T08:30:48.5734361Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5734420Z shl.b16 %rs1428, %rs1427, 4; 2026-02-21T08:30:48.5734579Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5734651Z ld.shared.s8 %rs1429, [%r189+3328]; 2026-02-21T08:30:48.5734816Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5734875Z shl.b16 %rs1430, %rs1429, 4; 2026-02-21T08:30:48.5735046Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5735111Z ld.shared.s8 %rs1431, [%r189+3456]; 2026-02-21T08:30:48.5735274Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5735337Z shl.b16 %rs1432, %rs1431, 4; 2026-02-21T08:30:48.5735497Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5735558Z ld.shared.s8 %rs1433, [%r189+3584]; 2026-02-21T08:30:48.5735727Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5735785Z shl.b16 %rs1434, %rs1433, 4; 2026-02-21T08:30:48.5735950Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5736012Z ld.shared.s8 %rs1435, [%r189+3712]; 2026-02-21T08:30:48.5736183Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5736294Z shl.b16 %rs1436, %rs1435, 4; 2026-02-21T08:30:48.5736456Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5736526Z ld.shared.s8 %rs1437, [%r189+3840]; 2026-02-21T08:30:48.5736704Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5736765Z shl.b16 %rs1438, %rs1437, 4; 2026-02-21T08:30:48.5736943Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5737010Z ld.shared.s8 %rs1439, [%r2851]; 2026-02-21T08:30:48.5737184Z .loc 1 57 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:57:28 2026-02-21T08:30:48.5737251Z shl.b16 %rs1440, %rs1439, 4; 2026-02-21T08:30:48.5737422Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5737543Z cvt.s16.s8 %rs1441, %rs1378; 2026-02-21T08:30:48.5737609Z shr.s16 %rs1442, %rs1441, 4; 2026-02-21T08:30:48.5737679Z cvt.s16.s8 %rs1443, %rs1380; 2026-02-21T08:30:48.5737738Z shr.s16 %rs1444, %rs1443, 4; 2026-02-21T08:30:48.5737798Z shr.s16 %rs1445, %rs1377, 4; 2026-02-21T08:30:48.5737864Z shr.s16 %rs1446, %rs1379, 4; 2026-02-21T08:30:48.5738037Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5738103Z cvt.rn.f32.s16 %r2852, %rs1446; 2026-02-21T08:30:48.5738177Z cvt.rn.f32.s16 %r2853, %rs1445; 2026-02-21T08:30:48.5738241Z cvt.rn.f32.s16 %r2854, %rs1444; 2026-02-21T08:30:48.5738303Z cvt.rn.f32.s16 %r2855, %rs1442; 2026-02-21T08:30:48.5738476Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5738545Z cvt.s16.s8 %rs1447, %rs1382; 2026-02-21T08:30:48.5738606Z shr.s16 %rs1448, %rs1447, 4; 2026-02-21T08:30:48.5738669Z cvt.s16.s8 %rs1449, %rs1384; 2026-02-21T08:30:48.5738738Z shr.s16 %rs1450, %rs1449, 4; 2026-02-21T08:30:48.5738798Z shr.s16 %rs1451, %rs1381, 4; 2026-02-21T08:30:48.5738857Z shr.s16 %rs1452, %rs1383, 4; 2026-02-21T08:30:48.5739032Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5739103Z cvt.rn.f32.s16 %r2856, %rs1452; 2026-02-21T08:30:48.5739165Z cvt.rn.f32.s16 %r2857, %rs1451; 2026-02-21T08:30:48.5739227Z cvt.rn.f32.s16 %r2858, %rs1450; 2026-02-21T08:30:48.5739296Z cvt.rn.f32.s16 %r2859, %rs1448; 2026-02-21T08:30:48.5739472Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5739535Z cvt.s16.s8 %rs1453, %rs1386; 2026-02-21T08:30:48.5739605Z shr.s16 %rs1454, %rs1453, 4; 2026-02-21T08:30:48.5739666Z cvt.s16.s8 %rs1455, %rs1388; 2026-02-21T08:30:48.5739729Z shr.s16 %rs1456, %rs1455, 4; 2026-02-21T08:30:48.5739791Z shr.s16 %rs1457, %rs1385, 4; 2026-02-21T08:30:48.5739867Z shr.s16 %rs1458, %rs1387, 4; 2026-02-21T08:30:48.5740044Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5740107Z cvt.rn.f32.s16 %r2860, %rs1458; 2026-02-21T08:30:48.5740178Z cvt.rn.f32.s16 %r2861, %rs1457; 2026-02-21T08:30:48.5740240Z cvt.rn.f32.s16 %r2862, %rs1456; 2026-02-21T08:30:48.5740300Z cvt.rn.f32.s16 %r2863, %rs1454; 2026-02-21T08:30:48.5740485Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5740547Z cvt.s16.s8 %rs1459, %rs1390; 2026-02-21T08:30:48.5740607Z shr.s16 %rs1460, %rs1459, 4; 2026-02-21T08:30:48.5740668Z cvt.s16.s8 %rs1461, %rs1392; 2026-02-21T08:30:48.5740736Z shr.s16 %rs1462, %rs1461, 4; 2026-02-21T08:30:48.5740796Z shr.s16 %rs1463, %rs1389, 4; 2026-02-21T08:30:48.5740855Z shr.s16 %rs1464, %rs1391, 4; 2026-02-21T08:30:48.5741037Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5741143Z cvt.rn.f32.s16 %r2864, %rs1464; 2026-02-21T08:30:48.5741205Z cvt.rn.f32.s16 %r2865, %rs1463; 2026-02-21T08:30:48.5741266Z cvt.rn.f32.s16 %r2866, %rs1462; 2026-02-21T08:30:48.5741332Z cvt.rn.f32.s16 %r2867, %rs1460; 2026-02-21T08:30:48.5741503Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5741601Z cvt.s16.s8 %rs1465, %rs1394; 2026-02-21T08:30:48.5741673Z shr.s16 %rs1466, %rs1465, 4; 2026-02-21T08:30:48.5741735Z cvt.s16.s8 %rs1467, %rs1396; 2026-02-21T08:30:48.5741796Z shr.s16 %rs1468, %rs1467, 4; 2026-02-21T08:30:48.5741864Z shr.s16 %rs1469, %rs1393, 4; 2026-02-21T08:30:48.5741925Z shr.s16 %rs1470, %rs1395, 4; 2026-02-21T08:30:48.5742101Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5742162Z cvt.rn.f32.s16 %r2868, %rs1470; 2026-02-21T08:30:48.5742283Z cvt.rn.f32.s16 %r2869, %rs1469; 2026-02-21T08:30:48.5742348Z cvt.rn.f32.s16 %r2870, %rs1468; 2026-02-21T08:30:48.5742410Z cvt.rn.f32.s16 %r2871, %rs1466; 2026-02-21T08:30:48.5742592Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5742655Z cvt.s16.s8 %rs1471, %rs1398; 2026-02-21T08:30:48.5742715Z shr.s16 %rs1472, %rs1471, 4; 2026-02-21T08:30:48.5742784Z cvt.s16.s8 %rs1473, %rs1400; 2026-02-21T08:30:48.5742844Z shr.s16 %rs1474, %rs1473, 4; 2026-02-21T08:30:48.5742905Z shr.s16 %rs1475, %rs1397, 4; 2026-02-21T08:30:48.5742966Z shr.s16 %rs1476, %rs1399, 4; 2026-02-21T08:30:48.5743146Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5743210Z cvt.rn.f32.s16 %r2872, %rs1476; 2026-02-21T08:30:48.5743272Z cvt.rn.f32.s16 %r2873, %rs1475; 2026-02-21T08:30:48.5743343Z cvt.rn.f32.s16 %r2874, %rs1474; 2026-02-21T08:30:48.5743404Z cvt.rn.f32.s16 %r2875, %rs1472; 2026-02-21T08:30:48.5743576Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5743646Z cvt.s16.s8 %rs1477, %rs1402; 2026-02-21T08:30:48.5743706Z shr.s16 %rs1478, %rs1477, 4; 2026-02-21T08:30:48.5743767Z cvt.s16.s8 %rs1479, %rs1404; 2026-02-21T08:30:48.5743826Z shr.s16 %rs1480, %rs1479, 4; 2026-02-21T08:30:48.5743894Z shr.s16 %rs1481, %rs1401, 4; 2026-02-21T08:30:48.5743954Z shr.s16 %rs1482, %rs1403, 4; 2026-02-21T08:30:48.5744125Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5744194Z cvt.rn.f32.s16 %r2876, %rs1482; 2026-02-21T08:30:48.5744255Z cvt.rn.f32.s16 %r2877, %rs1481; 2026-02-21T08:30:48.5744315Z cvt.rn.f32.s16 %r2878, %rs1480; 2026-02-21T08:30:48.5744385Z cvt.rn.f32.s16 %r2879, %rs1478; 2026-02-21T08:30:48.5744560Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5744620Z cvt.s16.s8 %rs1483, %rs1406; 2026-02-21T08:30:48.5744681Z shr.s16 %rs1484, %rs1483, 4; 2026-02-21T08:30:48.5744745Z cvt.s16.s8 %rs1485, %rs1408; 2026-02-21T08:30:48.5744802Z shr.s16 %rs1486, %rs1485, 4; 2026-02-21T08:30:48.5744858Z shr.s16 %rs1487, %rs1405, 4; 2026-02-21T08:30:48.5744921Z shr.s16 %rs1488, %rs1407, 4; 2026-02-21T08:30:48.5745087Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5745146Z cvt.rn.f32.s16 %r2880, %rs1488; 2026-02-21T08:30:48.5745205Z cvt.rn.f32.s16 %r2881, %rs1487; 2026-02-21T08:30:48.5745270Z cvt.rn.f32.s16 %r2882, %rs1486; 2026-02-21T08:30:48.5745329Z cvt.rn.f32.s16 %r2883, %rs1484; 2026-02-21T08:30:48.5745496Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5745563Z cvt.s16.s8 %rs1489, %rs1410; 2026-02-21T08:30:48.5745619Z shr.s16 %rs1490, %rs1489, 4; 2026-02-21T08:30:48.5745679Z cvt.s16.s8 %rs1491, %rs1412; 2026-02-21T08:30:48.5745790Z shr.s16 %rs1492, %rs1491, 4; 2026-02-21T08:30:48.5745847Z shr.s16 %rs1493, %rs1409, 4; 2026-02-21T08:30:48.5745903Z shr.s16 %rs1494, %rs1411, 4; 2026-02-21T08:30:48.5746067Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5746132Z cvt.rn.f32.s16 %r2884, %rs1494; 2026-02-21T08:30:48.5746192Z cvt.rn.f32.s16 %r2885, %rs1493; 2026-02-21T08:30:48.5746251Z cvt.rn.f32.s16 %r2886, %rs1492; 2026-02-21T08:30:48.5746318Z cvt.rn.f32.s16 %r2887, %rs1490; 2026-02-21T08:30:48.5746483Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5746542Z cvt.s16.s8 %rs1495, %rs1414; 2026-02-21T08:30:48.5746604Z shr.s16 %rs1496, %rs1495, 4; 2026-02-21T08:30:48.5746661Z cvt.s16.s8 %rs1497, %rs1416; 2026-02-21T08:30:48.5746718Z shr.s16 %rs1498, %rs1497, 4; 2026-02-21T08:30:48.5746775Z shr.s16 %rs1499, %rs1413, 4; 2026-02-21T08:30:48.5746878Z shr.s16 %rs1500, %rs1415, 4; 2026-02-21T08:30:48.5747048Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5747107Z cvt.rn.f32.s16 %r2888, %rs1500; 2026-02-21T08:30:48.5747174Z cvt.rn.f32.s16 %r2889, %rs1499; 2026-02-21T08:30:48.5747235Z cvt.rn.f32.s16 %r2890, %rs1498; 2026-02-21T08:30:48.5747295Z cvt.rn.f32.s16 %r2891, %rs1496; 2026-02-21T08:30:48.5747461Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5747527Z cvt.s16.s8 %rs1501, %rs1418; 2026-02-21T08:30:48.5747584Z shr.s16 %rs1502, %rs1501, 4; 2026-02-21T08:30:48.5747642Z cvt.s16.s8 %rs1503, %rs1420; 2026-02-21T08:30:48.5747709Z shr.s16 %rs1504, %rs1503, 4; 2026-02-21T08:30:48.5747768Z shr.s16 %rs1505, %rs1417, 4; 2026-02-21T08:30:48.5747825Z shr.s16 %rs1506, %rs1419, 4; 2026-02-21T08:30:48.5747999Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5748063Z cvt.rn.f32.s16 %r2892, %rs1506; 2026-02-21T08:30:48.5748126Z cvt.rn.f32.s16 %r2893, %rs1505; 2026-02-21T08:30:48.5748185Z cvt.rn.f32.s16 %r2894, %rs1504; 2026-02-21T08:30:48.5748250Z cvt.rn.f32.s16 %r2895, %rs1502; 2026-02-21T08:30:48.5748418Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5748477Z cvt.s16.s8 %rs1507, %rs1422; 2026-02-21T08:30:48.5748541Z shr.s16 %rs1508, %rs1507, 4; 2026-02-21T08:30:48.5748597Z cvt.s16.s8 %rs1509, %rs1424; 2026-02-21T08:30:48.5748653Z shr.s16 %rs1510, %rs1509, 4; 2026-02-21T08:30:48.5748718Z shr.s16 %rs1511, %rs1421, 4; 2026-02-21T08:30:48.5748775Z shr.s16 %rs1512, %rs1423, 4; 2026-02-21T08:30:48.5748943Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5749002Z cvt.rn.f32.s16 %r2896, %rs1512; 2026-02-21T08:30:48.5749068Z cvt.rn.f32.s16 %r2897, %rs1511; 2026-02-21T08:30:48.5749132Z cvt.rn.f32.s16 %r2898, %rs1510; 2026-02-21T08:30:48.5749192Z cvt.rn.f32.s16 %r2899, %rs1508; 2026-02-21T08:30:48.5749365Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5749423Z cvt.s16.s8 %rs1513, %rs1426; 2026-02-21T08:30:48.5749481Z shr.s16 %rs1514, %rs1513, 4; 2026-02-21T08:30:48.5749545Z cvt.s16.s8 %rs1515, %rs1428; 2026-02-21T08:30:48.5749604Z shr.s16 %rs1516, %rs1515, 4; 2026-02-21T08:30:48.5749661Z shr.s16 %rs1517, %rs1425, 4; 2026-02-21T08:30:48.5749718Z shr.s16 %rs1518, %rs1427, 4; 2026-02-21T08:30:48.5749890Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5749951Z cvt.rn.f32.s16 %r2900, %rs1518; 2026-02-21T08:30:48.5750011Z cvt.rn.f32.s16 %r2901, %rs1517; 2026-02-21T08:30:48.5750076Z cvt.rn.f32.s16 %r2902, %rs1516; 2026-02-21T08:30:48.5750135Z cvt.rn.f32.s16 %r2903, %rs1514; 2026-02-21T08:30:48.5750303Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5750410Z cvt.s16.s8 %rs1519, %rs1430; 2026-02-21T08:30:48.5750467Z shr.s16 %rs1520, %rs1519, 4; 2026-02-21T08:30:48.5750524Z cvt.s16.s8 %rs1521, %rs1432; 2026-02-21T08:30:48.5750581Z shr.s16 %rs1522, %rs1521, 4; 2026-02-21T08:30:48.5750649Z shr.s16 %rs1523, %rs1429, 4; 2026-02-21T08:30:48.5750705Z shr.s16 %rs1524, %rs1431, 4; 2026-02-21T08:30:48.5750867Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5750934Z cvt.rn.f32.s16 %r2904, %rs1524; 2026-02-21T08:30:48.5750991Z cvt.rn.f32.s16 %r2905, %rs1523; 2026-02-21T08:30:48.5751051Z cvt.rn.f32.s16 %r2906, %rs1522; 2026-02-21T08:30:48.5751108Z cvt.rn.f32.s16 %r2907, %rs1520; 2026-02-21T08:30:48.5751281Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5751376Z cvt.s16.s8 %rs1525, %rs1434; 2026-02-21T08:30:48.5751435Z shr.s16 %rs1526, %rs1525, 4; 2026-02-21T08:30:48.5751499Z cvt.s16.s8 %rs1527, %rs1436; 2026-02-21T08:30:48.5751586Z shr.s16 %rs1528, %rs1527, 4; 2026-02-21T08:30:48.5751645Z shr.s16 %rs1529, %rs1433, 4; 2026-02-21T08:30:48.5751708Z shr.s16 %rs1530, %rs1435, 4; 2026-02-21T08:30:48.5751870Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5751930Z cvt.rn.f32.s16 %r2908, %rs1530; 2026-02-21T08:30:48.5751988Z cvt.rn.f32.s16 %r2909, %rs1529; 2026-02-21T08:30:48.5752053Z cvt.rn.f32.s16 %r2910, %rs1528; 2026-02-21T08:30:48.5752112Z cvt.rn.f32.s16 %r2911, %rs1526; 2026-02-21T08:30:48.5752272Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5752337Z cvt.s16.s8 %rs1531, %rs1438; 2026-02-21T08:30:48.5752394Z shr.s16 %rs1532, %rs1531, 4; 2026-02-21T08:30:48.5752452Z cvt.s16.s8 %rs1533, %rs1440; 2026-02-21T08:30:48.5752518Z shr.s16 %rs1534, %rs1533, 4; 2026-02-21T08:30:48.5752576Z shr.s16 %rs1535, %rs1437, 4; 2026-02-21T08:30:48.5752633Z shr.s16 %rs1536, %rs1439, 4; 2026-02-21T08:30:48.5752791Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5752857Z cvt.rn.f32.s16 %r2912, %rs1536; 2026-02-21T08:30:48.5752915Z cvt.rn.f32.s16 %r2913, %rs1535; 2026-02-21T08:30:48.5752972Z cvt.rn.f32.s16 %r2914, %rs1534; 2026-02-21T08:30:48.5753037Z cvt.rn.f32.s16 %r2915, %rs1532; 2026-02-21T08:30:48.5753137Z st.shared.v4.b32 [%r63], {%r2855, %r2853, %r2854, %r2852}; 2026-02-21T08:30:48.5753242Z st.shared.v4.b32 [%r63+16384], {%r2887, %r2885, %r2886, %r2884}; 2026-02-21T08:30:48.5753345Z st.shared.v4.b32 [%r64], {%r2859, %r2857, %r2858, %r2856}; 2026-02-21T08:30:48.5753448Z st.shared.v4.b32 [%r64+16384], {%r2891, %r2889, %r2890, %r2888}; 2026-02-21T08:30:48.5753541Z st.shared.v4.b32 [%r65], {%r2863, %r2861, %r2862, %r2860}; 2026-02-21T08:30:48.5753643Z st.shared.v4.b32 [%r65+16384], {%r2895, %r2893, %r2894, %r2892}; 2026-02-21T08:30:48.5753747Z st.shared.v4.b32 [%r66], {%r2867, %r2865, %r2866, %r2864}; 2026-02-21T08:30:48.5753845Z st.shared.v4.b32 [%r66+16384], {%r2899, %r2897, %r2898, %r2896}; 2026-02-21T08:30:48.5753934Z st.shared.v4.b32 [%r67], {%r2871, %r2869, %r2870, %r2868}; 2026-02-21T08:30:48.5754039Z st.shared.v4.b32 [%r67+16384], {%r2903, %r2901, %r2902, %r2900}; 2026-02-21T08:30:48.5754129Z st.shared.v4.b32 [%r68], {%r2875, %r2873, %r2874, %r2872}; 2026-02-21T08:30:48.5754225Z st.shared.v4.b32 [%r68+16384], {%r2907, %r2905, %r2906, %r2904}; 2026-02-21T08:30:48.5754321Z st.shared.v4.b32 [%r69], {%r2879, %r2877, %r2878, %r2876}; 2026-02-21T08:30:48.5754419Z st.shared.v4.b32 [%r69+16384], {%r2911, %r2909, %r2910, %r2908}; 2026-02-21T08:30:48.5754508Z st.shared.v4.b32 [%r70], {%r2883, %r2881, %r2882, %r2880}; 2026-02-21T08:30:48.5754603Z st.shared.v4.b32 [%r70+16384], {%r2915, %r2913, %r2914, %r2912}; 2026-02-21T08:30:48.5754826Z .loc 1 40 70 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:40:70 2026-02-21T08:30:48.5754888Z shl.b32 %r2916, %r4041, 3; 2026-02-21T08:30:48.5754951Z add.s32 %r2917, %r294, %r2916; 2026-02-21T08:30:48.5755020Z add.s32 %r4039, %r2917, 57344; 2026-02-21T08:30:48.5755073Z $L__tmp305: 2026-02-21T08:30:48.5755291Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5755357Z // begin inline asm 2026-02-21T08:30:48.5755432Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5755487Z // end inline asm 2026-02-21T08:30:48.5755541Z bar.sync 0; 2026-02-21T08:30:48.5755609Z @%p14 bra $L__BB0_26; 2026-02-21T08:30:48.5755708Z // %bb.25: // in Loop: Header=BB0_24 Depth=2 2026-02-21T08:30:48.5755774Z elect.sync %r2942|%p193, -1; 2026-02-21T08:30:48.5755843Z mov.b32 %r2920, 69208336; 2026-02-21T08:30:48.5755964Z // begin inline asm 2026-02-21T08:30:48.5756130Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd737, %r2920, %p192; 2026-02-21T08:30:48.5756195Z // end inline asm 2026-02-21T08:30:48.5756254Z // begin inline asm 2026-02-21T08:30:48.5756413Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd738, %r2920, %p192; 2026-02-21T08:30:48.5756477Z // end inline asm 2026-02-21T08:30:48.5756532Z // begin inline asm 2026-02-21T08:30:48.5756688Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd739, %r2920, %p192; 2026-02-21T08:30:48.5756739Z // end inline asm 2026-02-21T08:30:48.5756800Z // begin inline asm 2026-02-21T08:30:48.5756946Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd740, %r2920, %p192; 2026-02-21T08:30:48.5757001Z // end inline asm 2026-02-21T08:30:48.5757056Z // begin inline asm 2026-02-21T08:30:48.5757206Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd741, %r2920, %p192; 2026-02-21T08:30:48.5757259Z // end inline asm 2026-02-21T08:30:48.5757319Z // begin inline asm 2026-02-21T08:30:48.5757466Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd742, %r2920, %p192; 2026-02-21T08:30:48.5757522Z // end inline asm 2026-02-21T08:30:48.5757577Z // begin inline asm 2026-02-21T08:30:48.5757730Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd743, %r2920, %p192; 2026-02-21T08:30:48.5757786Z // end inline asm 2026-02-21T08:30:48.5757843Z // begin inline asm 2026-02-21T08:30:48.5758000Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd744, %r2920, %p192; 2026-02-21T08:30:48.5758056Z // end inline asm 2026-02-21T08:30:48.5758118Z cvt.u64.u32 %rd763, %r4039; 2026-02-21T08:30:48.5758180Z // begin inline asm 2026-02-21T08:30:48.5758310Z @%p193 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd763]; 2026-02-21T08:30:48.5758364Z // end inline asm 2026-02-21T08:30:48.5758424Z bra.uni $L__BB0_26; 2026-02-21T08:30:48.5758485Z $L__tmp306: 2026-02-21T08:30:48.5758580Z $L__BB0_1: // %.._crit_edge_crit_edge 2026-02-21T08:30:48.5758750Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5758818Z or.b32 %r4046, %r4047, 2048; 2026-02-21T08:30:48.5758876Z or.b32 %r4045, %r4047, 4096; 2026-02-21T08:30:48.5758934Z or.b32 %r4044, %r4047, 6144; 2026-02-21T08:30:48.5759027Z $L__BB0_28: // %._crit_edge 2026-02-21T08:30:48.5759200Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5759260Z sub.s32 %r3298, 4096, %r4074; 2026-02-21T08:30:48.5759330Z mul.hi.s32 %r3299, %r3298, -580400985; 2026-02-21T08:30:48.5759397Z add.s32 %r3300, %r3299, %r3298; 2026-02-21T08:30:48.5759456Z shr.u32 %r3301, %r3300, 31; 2026-02-21T08:30:48.5759513Z shr.s32 %r3302, %r3300, 11; 2026-02-21T08:30:48.5759582Z add.s32 %r201, %r3302, %r3301; 2026-02-21T08:30:48.5759681Z mul.lo.s32 %r3303, %r201, 2368; 2026-02-21T08:30:48.5759748Z setp.ne.b32 %p219, %r3298, %r3303; 2026-02-21T08:30:48.5759818Z setp.gt.s32 %p220, %r3298, -1; 2026-02-21T08:30:48.5759884Z and.pred %p221, %p220, %p219; 2026-02-21T08:30:48.5759944Z selp.b32 %r202, 1, 0, %p221; 2026-02-21T08:30:48.5760002Z add.s32 %r203, %r201, %r202; 2026-02-21T08:30:48.5760063Z $L__tmp307: 2026-02-21T08:30:48.5760281Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5760357Z shfl.sync.idx.b32 %r204, %r5, 0, 31, -1; 2026-02-21T08:30:48.5760419Z shl.b32 %r3304, %r204, 21; 2026-02-21T08:30:48.5760474Z and.b32 %r205, %r3304, 6291456; 2026-02-21T08:30:48.5760532Z add.s32 %r3214, %r205, %r4008; 2026-02-21T08:30:48.5760588Z mov.pred %p215, -1; 2026-02-21T08:30:48.5760647Z mov.b32 %r4051, 0; 2026-02-21T08:30:48.5760753Z // begin inline asm 2026-02-21T08:30:48.5761067Z @%p215 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3214 + 0], 64, {%r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051}; 2026-02-21T08:30:48.5761122Z // end inline asm 2026-02-21T08:30:48.5761178Z // begin inline asm 2026-02-21T08:30:48.5761484Z @%p215 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3214 + 16], 64, {%r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051}; 2026-02-21T08:30:48.5761579Z // end inline asm 2026-02-21T08:30:48.5761633Z // begin inline asm 2026-02-21T08:30:48.5761936Z @%p215 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3214 + 32], 64, {%r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051}; 2026-02-21T08:30:48.5761989Z // end inline asm 2026-02-21T08:30:48.5762043Z // begin inline asm 2026-02-21T08:30:48.5762340Z @%p215 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3214 + 48], 64, {%r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051, %r4051}; 2026-02-21T08:30:48.5762402Z // end inline asm 2026-02-21T08:30:48.5762454Z // begin inline asm 2026-02-21T08:30:48.5762523Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5762573Z // end inline asm 2026-02-21T08:30:48.5762631Z bar.sync 0; 2026-02-21T08:30:48.5762679Z $L__tmp308: 2026-02-21T08:30:48.5762848Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5762913Z setp.lt.s32 %p222, %r203, 1; 2026-02-21T08:30:48.5762974Z setp.gt.s32 %p223, %r203, 0; 2026-02-21T08:30:48.5763141Z .loc 1 25 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:25:35 2026-02-21T08:30:48.5763196Z shr.s32 %r3305, %r4074, 31; 2026-02-21T08:30:48.5763257Z shr.u32 %r3306, %r3305, 23; 2026-02-21T08:30:48.5763315Z add.s32 %r3307, %r4074, %r3306; 2026-02-21T08:30:48.5763377Z shr.s32 %r3308, %r3307, 9; 2026-02-21T08:30:48.5763541Z .loc 1 26 33 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:26:33 2026-02-21T08:30:48.5763596Z shl.b32 %r207, %r3308, 3; 2026-02-21T08:30:48.5763757Z .loc 1 27 39 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:39 2026-02-21T08:30:48.5763814Z sub.s32 %r3309, 64, %r207; 2026-02-21T08:30:48.5763979Z .loc 1 27 52 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:52 2026-02-21T08:30:48.5764033Z min.s32 %r209, %r3309, 8; 2026-02-21T08:30:48.5764198Z .loc 1 28 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:45 2026-02-21T08:30:48.5764259Z and.b32 %r3310, %r3307, -512; 2026-02-21T08:30:48.5764319Z sub.s32 %r208, %r4074, %r3310; 2026-02-21T08:30:48.5764487Z .loc 1 29 51 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:29:51 2026-02-21T08:30:48.5764606Z div.s32 %r210, %r208, %r209; 2026-02-21T08:30:48.5764769Z .loc 1 32 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:32:27 2026-02-21T08:30:48.5764827Z shl.b32 %r4049, %r210, 6; 2026-02-21T08:30:48.5764998Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5765057Z or.b32 %r4070, %r4049, %r9; 2026-02-21T08:30:48.5765116Z or.b32 %r4071, %r4049, %r10; 2026-02-21T08:30:48.5765183Z or.b32 %r4072, %r4049, %r11; 2026-02-21T08:30:48.5765240Z or.b32 %r4073, %r4049, %r12; 2026-02-21T08:30:48.5765407Z .loc 1 48 53 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:53 2026-02-21T08:30:48.5765468Z shl.b32 %r3311, %r4070, 10; 2026-02-21T08:30:48.5765538Z shl.b32 %r3312, %r4071, 10; 2026-02-21T08:30:48.5765596Z shl.b32 %r3313, %r4072, 10; 2026-02-21T08:30:48.5765654Z shl.b32 %r3314, %r4073, 10; 2026-02-21T08:30:48.5765870Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5765932Z or.b32 %r3315, %r3311, %r22; 2026-02-21T08:30:48.5765990Z or.b32 %r3316, %r3312, %r22; 2026-02-21T08:30:48.5766055Z or.b32 %r3317, %r3313, %r22; 2026-02-21T08:30:48.5766111Z or.b32 %r3318, %r3314, %r22; 2026-02-21T08:30:48.5766273Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5766344Z mad.wide.s32 %rd911, %r3315, 2, %rd92; 2026-02-21T08:30:48.5766418Z mad.wide.s32 %rd912, %r3316, 2, %rd92; 2026-02-21T08:30:48.5766485Z mad.wide.s32 %rd913, %r3317, 2, %rd92; 2026-02-21T08:30:48.5766547Z mad.wide.s32 %rd914, %r3318, 2, %rd92; 2026-02-21T08:30:48.5766720Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5766781Z add.s32 %r3320, %r294, 32768; 2026-02-21T08:30:48.5766842Z add.s32 %r3282, %r3320, %r4047; 2026-02-21T08:30:48.5766912Z selp.b32 %r3283, 16, 0, %p223; 2026-02-21T08:30:48.5766971Z // begin inline asm 2026-02-21T08:30:48.5767095Z cp.async.cg.shared.global [ %r3282 + 0 ], [ %rd911 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5767150Z // end inline asm 2026-02-21T08:30:48.5767216Z add.s32 %r3284, %r3320, %r4046; 2026-02-21T08:30:48.5767270Z // begin inline asm 2026-02-21T08:30:48.5767391Z cp.async.cg.shared.global [ %r3284 + 0 ], [ %rd912 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5767451Z // end inline asm 2026-02-21T08:30:48.5767507Z add.s32 %r3286, %r3320, %r4045; 2026-02-21T08:30:48.5767564Z // begin inline asm 2026-02-21T08:30:48.5767678Z cp.async.cg.shared.global [ %r3286 + 0 ], [ %rd913 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5767738Z // end inline asm 2026-02-21T08:30:48.5767794Z add.s32 %r3288, %r3320, %r4044; 2026-02-21T08:30:48.5767848Z // begin inline asm 2026-02-21T08:30:48.5767967Z cp.async.cg.shared.global [ %r3288 + 0 ], [ %rd914 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5768021Z // end inline asm 2026-02-21T08:30:48.5768087Z cp.async.commit_group; 2026-02-21T08:30:48.5768255Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5768314Z or.b32 %r3321, %r3311, %r4043; 2026-02-21T08:30:48.5768371Z or.b32 %r3322, %r3312, %r4043; 2026-02-21T08:30:48.5768428Z or.b32 %r3323, %r3313, %r4043; 2026-02-21T08:30:48.5768486Z or.b32 %r3324, %r3314, %r4043; 2026-02-21T08:30:48.5768650Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5768717Z mad.wide.s32 %rd915, %r3321, 2, %rd92; 2026-02-21T08:30:48.5768782Z mad.wide.s32 %rd916, %r3322, 2, %rd92; 2026-02-21T08:30:48.5768846Z mad.wide.s32 %rd917, %r3323, 2, %rd92; 2026-02-21T08:30:48.5768905Z mad.wide.s32 %rd918, %r3324, 2, %rd92; 2026-02-21T08:30:48.5769069Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5769130Z bar.sync 0; 2026-02-21T08:30:48.5769229Z add.s32 %r3325, %r294, 40960; 2026-02-21T08:30:48.5769288Z add.s32 %r3290, %r3325, %r4047; 2026-02-21T08:30:48.5769349Z // begin inline asm 2026-02-21T08:30:48.5769462Z cp.async.cg.shared.global [ %r3290 + 0 ], [ %rd915 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5769516Z // end inline asm 2026-02-21T08:30:48.5769578Z add.s32 %r3292, %r3325, %r4046; 2026-02-21T08:30:48.5769632Z // begin inline asm 2026-02-21T08:30:48.5769743Z cp.async.cg.shared.global [ %r3292 + 0 ], [ %rd916 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5769798Z // end inline asm 2026-02-21T08:30:48.5769857Z add.s32 %r3294, %r3325, %r4045; 2026-02-21T08:30:48.5769912Z // begin inline asm 2026-02-21T08:30:48.5770023Z cp.async.cg.shared.global [ %r3294 + 0 ], [ %rd917 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5770084Z // end inline asm 2026-02-21T08:30:48.5770142Z add.s32 %r3296, %r3325, %r4044; 2026-02-21T08:30:48.5770197Z // begin inline asm 2026-02-21T08:30:48.5770347Z cp.async.cg.shared.global [ %r3296 + 0 ], [ %rd918 + 0 ], 0x10, %r3283; 2026-02-21T08:30:48.5770411Z // end inline asm 2026-02-21T08:30:48.5770469Z cp.async.commit_group; 2026-02-21T08:30:48.5770642Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5770706Z @%p222 bra $L__BB0_37; 2026-02-21T08:30:48.5770784Z // %bb.29: // %.lr.ph447 2026-02-21T08:30:48.5770941Z .loc 1 0 0 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0 2026-02-21T08:30:48.5771005Z and.b32 %r13, %r1, 112; 2026-02-21T08:30:48.5771174Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5771228Z shl.b32 %r3330, %r203, 4; 2026-02-21T08:30:48.5771388Z .loc 1 28 64 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:64 2026-02-21T08:30:48.5771452Z mul.lo.s32 %r3331, %r210, %r209; 2026-02-21T08:30:48.5771509Z sub.s32 %r3332, %r208, %r3331; 2026-02-21T08:30:48.5771701Z .loc 1 28 30 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:30 2026-02-21T08:30:48.5771765Z add.s32 %r3333, %r3332, %r207; 2026-02-21T08:30:48.5771929Z .loc 1 30 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:30:27 2026-02-21T08:30:48.5771983Z shl.b32 %r4050, %r3333, 7; 2026-02-21T08:30:48.5772144Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5772203Z or.b32 %r4052, %r4050, %r6; 2026-02-21T08:30:48.5772259Z add.s32 %r218, %r3330, -2; 2026-02-21T08:30:48.5772321Z shl.b32 %r3336, %r4013, 6; 2026-02-21T08:30:48.5772375Z and.b32 %r3338, %r4014, 64; 2026-02-21T08:30:48.5772537Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5772597Z add.s32 %r3340, %r294, %r4012; 2026-02-21T08:30:48.5772658Z add.s32 %r3341, %r3340, %r3336; 2026-02-21T08:30:48.5772721Z add.s32 %r3342, %r3341, %r3338; 2026-02-21T08:30:48.5772776Z add.s32 %r219, %r3342, 32768; 2026-02-21T08:30:48.5772839Z and.b32 %r3343, %r1, 24; 2026-02-21T08:30:48.5772892Z shl.b32 %r3344, %r3343, 5; 2026-02-21T08:30:48.5772944Z shr.u32 %r3345, %r3343, 1; 2026-02-21T08:30:48.5773003Z shr.u32 %r3346, %r1, 4; 2026-02-21T08:30:48.5773059Z and.b32 %r3347, %r3346, 2; 2026-02-21T08:30:48.5773116Z shl.b32 %r3348, %r1, 1; 2026-02-21T08:30:48.5773172Z and.b32 %r3349, %r3348, 128; 2026-02-21T08:30:48.5773232Z or.b32 %r3350, %r3347, %r3349; 2026-02-21T08:30:48.5773291Z or.b32 %r3351, %r3344, %r3345; 2026-02-21T08:30:48.5773344Z or.b32 %r3352, %r3351, %r3350; 2026-02-21T08:30:48.5773409Z or.b32 %r3353, %r3352, %r6; 2026-02-21T08:30:48.5773471Z add.s32 %r220, %r294, %r3353; 2026-02-21T08:30:48.5773525Z xor.b32 %r3354, %r3353, 4; 2026-02-21T08:30:48.5773580Z add.s32 %r221, %r294, %r3354; 2026-02-21T08:30:48.5773647Z xor.b32 %r3355, %r3353, 8; 2026-02-21T08:30:48.5773708Z add.s32 %r222, %r294, %r3355; 2026-02-21T08:30:48.5773837Z xor.b32 %r3356, %r3353, 12; 2026-02-21T08:30:48.5773904Z add.s32 %r223, %r294, %r3356; 2026-02-21T08:30:48.5773962Z xor.b32 %r3357, %r3353, 32; 2026-02-21T08:30:48.5774020Z add.s32 %r224, %r294, %r3357; 2026-02-21T08:30:48.5774077Z xor.b32 %r3358, %r3353, 36; 2026-02-21T08:30:48.5774143Z add.s32 %r225, %r294, %r3358; 2026-02-21T08:30:48.5774199Z xor.b32 %r3359, %r3353, 40; 2026-02-21T08:30:48.5774257Z add.s32 %r226, %r294, %r3359; 2026-02-21T08:30:48.5774322Z xor.b32 %r3360, %r3353, 44; 2026-02-21T08:30:48.5774378Z add.s32 %r227, %r294, %r3360; 2026-02-21T08:30:48.5774436Z xor.b32 %r3361, %r3353, 64; 2026-02-21T08:30:48.5774495Z add.s32 %r228, %r294, %r3361; 2026-02-21T08:30:48.5774559Z xor.b32 %r3362, %r3353, 68; 2026-02-21T08:30:48.5774617Z add.s32 %r229, %r294, %r3362; 2026-02-21T08:30:48.5774675Z xor.b32 %r3363, %r3353, 72; 2026-02-21T08:30:48.5774741Z add.s32 %r230, %r294, %r3363; 2026-02-21T08:30:48.5774842Z xor.b32 %r3364, %r3353, 76; 2026-02-21T08:30:48.5774905Z add.s32 %r231, %r294, %r3364; 2026-02-21T08:30:48.5774962Z xor.b32 %r3365, %r3353, 96; 2026-02-21T08:30:48.5775028Z add.s32 %r232, %r294, %r3365; 2026-02-21T08:30:48.5775086Z xor.b32 %r3366, %r3353, 100; 2026-02-21T08:30:48.5775144Z add.s32 %r233, %r294, %r3366; 2026-02-21T08:30:48.5775210Z xor.b32 %r3367, %r3353, 104; 2026-02-21T08:30:48.5775269Z add.s32 %r234, %r294, %r3367; 2026-02-21T08:30:48.5775326Z xor.b32 %r3368, %r3353, 108; 2026-02-21T08:30:48.5775391Z add.s32 %r235, %r294, %r3368; 2026-02-21T08:30:48.5775449Z and.b32 %r3369, %r1, 12; 2026-02-21T08:30:48.5775506Z and.b32 %r3370, %r4014, 12; 2026-02-21T08:30:48.5775566Z mul.lo.s32 %r3371, %r3369, 264; 2026-02-21T08:30:48.5775632Z or.b32 %r3372, %r3370, %r13; 2026-02-21T08:30:48.5775690Z xor.b32 %r3373, %r3371, %r3372; 2026-02-21T08:30:48.5775748Z add.s32 %r236, %r294, %r3373; 2026-02-21T08:30:48.5775810Z xor.b32 %r3374, %r3373, 260; 2026-02-21T08:30:48.5775869Z add.s32 %r237, %r294, %r3374; 2026-02-21T08:30:48.5775927Z xor.b32 %r3375, %r3373, 520; 2026-02-21T08:30:48.5775984Z add.s32 %r238, %r294, %r3375; 2026-02-21T08:30:48.5776049Z xor.b32 %r3376, %r3373, 780; 2026-02-21T08:30:48.5776103Z add.s32 %r239, %r294, %r3376; 2026-02-21T08:30:48.5776156Z shl.b32 %r3377, %r4048, 7; 2026-02-21T08:30:48.5776217Z or.b32 %r3378, %r3377, %r6; 2026-02-21T08:30:48.5776273Z add.s32 %r240, %r294, %r3378; 2026-02-21T08:30:48.5776326Z xor.b32 %r3379, %r3378, 16; 2026-02-21T08:30:48.5776384Z add.s32 %r241, %r294, %r3379; 2026-02-21T08:30:48.5776444Z xor.b32 %r3380, %r3378, 32; 2026-02-21T08:30:48.5776498Z add.s32 %r242, %r294, %r3380; 2026-02-21T08:30:48.5776551Z xor.b32 %r3381, %r3378, 48; 2026-02-21T08:30:48.5776610Z add.s32 %r243, %r294, %r3381; 2026-02-21T08:30:48.5776663Z xor.b32 %r3382, %r3378, 64; 2026-02-21T08:30:48.5776718Z add.s32 %r244, %r294, %r3382; 2026-02-21T08:30:48.5776770Z xor.b32 %r3383, %r3378, 80; 2026-02-21T08:30:48.5776832Z add.s32 %r245, %r294, %r3383; 2026-02-21T08:30:48.5776887Z xor.b32 %r3384, %r3378, 96; 2026-02-21T08:30:48.5776942Z add.s32 %r246, %r294, %r3384; 2026-02-21T08:30:48.5776998Z xor.b32 %r3385, %r3378, 112; 2026-02-21T08:30:48.5777053Z add.s32 %r247, %r294, %r3385; 2026-02-21T08:30:48.5777108Z add.s32 %r3440, %r205, %r3694; 2026-02-21T08:30:48.5777169Z bfe.u32 %r3386, %r294, 4, 14; 2026-02-21T08:30:48.5777226Z cvt.u64.u32 %rd919, %r3386; 2026-02-21T08:30:48.5777296Z or.b64 %rd935, %rd919, 4611686293372403712; 2026-02-21T08:30:48.5777354Z add.s32 %r3387, %r294, 32; 2026-02-21T08:30:48.5777416Z bfe.u32 %r3388, %r3387, 4, 14; 2026-02-21T08:30:48.5777471Z cvt.u64.u32 %rd920, %r3388; 2026-02-21T08:30:48.5777541Z or.b64 %rd936, %rd920, 4611686293372403712; 2026-02-21T08:30:48.5777599Z add.s32 %r3389, %r294, 64; 2026-02-21T08:30:48.5777653Z bfe.u32 %r3390, %r3389, 4, 14; 2026-02-21T08:30:48.5777711Z cvt.u64.u32 %rd921, %r3390; 2026-02-21T08:30:48.5777776Z or.b64 %rd937, %rd921, 4611686293372403712; 2026-02-21T08:30:48.5777879Z add.s32 %r3391, %r294, 96; 2026-02-21T08:30:48.5777933Z bfe.u32 %r3392, %r3391, 4, 14; 2026-02-21T08:30:48.5777988Z cvt.u64.u32 %rd922, %r3392; 2026-02-21T08:30:48.5778059Z or.b64 %rd938, %rd922, 4611686293372403712; 2026-02-21T08:30:48.5778113Z add.s32 %r3393, %r294, 16384; 2026-02-21T08:30:48.5778170Z bfe.u32 %r3394, %r3393, 4, 14; 2026-02-21T08:30:48.5778231Z cvt.u64.u32 %rd923, %r3394; 2026-02-21T08:30:48.5778293Z or.b64 %rd939, %rd923, 4611686293372403712; 2026-02-21T08:30:48.5778347Z add.s32 %r3395, %r294, 16416; 2026-02-21T08:30:48.5778404Z bfe.u32 %r3396, %r3395, 4, 14; 2026-02-21T08:30:48.5778460Z cvt.u64.u32 %rd924, %r3396; 2026-02-21T08:30:48.5778526Z or.b64 %rd940, %rd924, 4611686293372403712; 2026-02-21T08:30:48.5778583Z add.s32 %r3397, %r294, 16448; 2026-02-21T08:30:48.5778649Z bfe.u32 %r3398, %r3397, 4, 14; 2026-02-21T08:30:48.5778706Z cvt.u64.u32 %rd925, %r3398; 2026-02-21T08:30:48.5778813Z or.b64 %rd941, %rd925, 4611686293372403712; 2026-02-21T08:30:48.5778875Z add.s32 %r3399, %r294, 16480; 2026-02-21T08:30:48.5778943Z bfe.u32 %r3400, %r3399, 4, 14; 2026-02-21T08:30:48.5779001Z cvt.u64.u32 %rd926, %r3400; 2026-02-21T08:30:48.5779068Z or.b64 %rd942, %rd926, 4611686293372403712; 2026-02-21T08:30:48.5779133Z and.b32 %r3402, %r4015, 3072; 2026-02-21T08:30:48.5779191Z shl.b32 %r3404, %r4013, 3; 2026-02-21T08:30:48.5779249Z or.b32 %r3405, %r4016, %r3404; 2026-02-21T08:30:48.5779308Z xor.b32 %r3406, %r3405, %r3338; 2026-02-21T08:30:48.5779377Z or.b32 %r3407, %r3406, %r3402; 2026-02-21T08:30:48.5779435Z add.s32 %r250, %r294, %r3407; 2026-02-21T08:30:48.5779493Z xor.b32 %r3408, %r3407, 32; 2026-02-21T08:30:48.5779559Z add.s32 %r251, %r294, %r3408; 2026-02-21T08:30:48.5779619Z and.b32 %r3410, %r4017, 3168; 2026-02-21T08:30:48.5779679Z shl.b32 %r3411, %r3343, 4; 2026-02-21T08:30:48.5779768Z and.b32 %r3412, %r4014, 16; 2026-02-21T08:30:48.5779828Z or.b32 %r3413, %r3410, %r3411; 2026-02-21T08:30:48.5779892Z xor.b32 %r3414, %r3413, %r4013; 2026-02-21T08:30:48.5779956Z add.s32 %r3415, %r294, %r3412; 2026-02-21T08:30:48.5780024Z add.s32 %r3819, %r3415, %r3414; 2026-02-21T08:30:48.5780084Z add.s32 %r3824, %r3819, 512; 2026-02-21T08:30:48.5780144Z shl.b32 %r3416, %r201, 4; 2026-02-21T08:30:48.5780209Z shl.b32 %r3417, %r202, 4; 2026-02-21T08:30:48.5780270Z add.s32 %r254, %r3416, %r3417; 2026-02-21T08:30:48.5780332Z mov.pred %p256, 0; 2026-02-21T08:30:48.5780389Z mov.b32 %r4056, 1; 2026-02-21T08:30:48.5780456Z mov.b32 %r4055, -1; 2026-02-21T08:30:48.5780515Z mov.b32 %r4053, 32; 2026-02-21T08:30:48.5780576Z mov.b32 %r4054, %r4051; 2026-02-21T08:30:48.5780643Z mov.b32 %r4061, %r4049; 2026-02-21T08:30:48.5780702Z mov.b32 %r4062, %r4052; 2026-02-21T08:30:48.5780761Z mov.b32 %r4063, %r4050; 2026-02-21T08:30:48.5780819Z mov.b32 %r4065, %r4056; 2026-02-21T08:30:48.5780885Z mov.b32 %r4066, %r4051; 2026-02-21T08:30:48.5780943Z mov.b32 %r4067, %r4063; 2026-02-21T08:30:48.5781003Z mov.b32 %r4068, %r4062; 2026-02-21T08:30:48.5781070Z mov.b32 %r4069, %r4061; 2026-02-21T08:30:48.5781129Z bra.uni $L__BB0_30; 2026-02-21T08:30:48.5781239Z $L__BB0_36: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:48.5781418Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5781486Z add.s32 %r4066, %r4066, 1; 2026-02-21T08:30:48.5781580Z setp.ne.b32 %p253, %r254, %r4066; 2026-02-21T08:30:48.5781642Z mov.b32 %r4049, %r4061; 2026-02-21T08:30:48.5781709Z mov.b32 %r4050, %r4063; 2026-02-21T08:30:48.5781768Z mov.b32 %r4051, %r4065; 2026-02-21T08:30:48.5781827Z mov.b32 %r4052, %r4062; 2026-02-21T08:30:48.5781895Z mov.b32 %r4054, %r259; 2026-02-21T08:30:48.5781954Z mov.b32 %r4061, %r4069; 2026-02-21T08:30:48.5782012Z mov.b32 %r4062, %r4068; 2026-02-21T08:30:48.5782071Z mov.b32 %r4063, %r4067; 2026-02-21T08:30:48.5782138Z mov.b32 %r4065, %r273; 2026-02-21T08:30:48.5782204Z @%p253 bra $L__BB0_30; 2026-02-21T08:30:48.5782326Z bra.uni $L__BB0_37; 2026-02-21T08:30:48.5782443Z $L__BB0_30: // =>This Inner Loop Header: Depth=1 2026-02-21T08:30:48.5782624Z .loc 1 0 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0:112 2026-02-21T08:30:48.5782684Z mov.b32 %r259, %r4053; 2026-02-21T08:30:48.5782863Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5782934Z add.s32 %r3418, %r4065, 1; 2026-02-21T08:30:48.5783001Z setp.eq.b32 %p225, %r4065, 15; 2026-02-21T08:30:48.5783069Z selp.b32 %r273, 0, %r3418, %p225; 2026-02-21T08:30:48.5783143Z setp.ne.b32 %p226, %r273, 0; 2026-02-21T08:30:48.5783203Z @%p226 bra $L__BB0_32; 2026-02-21T08:30:48.5783305Z // %bb.31: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:48.5783371Z add.s32 %r4074, %r4074, 2368; 2026-02-21T08:30:48.5783598Z .loc 1 25 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:25:35 2026-02-21T08:30:48.5783662Z shr.s32 %r3419, %r4074, 31; 2026-02-21T08:30:48.5783721Z shr.u32 %r3420, %r3419, 23; 2026-02-21T08:30:48.5783791Z add.s32 %r3421, %r4074, %r3420; 2026-02-21T08:30:48.5783853Z shr.s32 %r3422, %r3421, 9; 2026-02-21T08:30:48.5784027Z .loc 1 26 33 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:26:33 2026-02-21T08:30:48.5784097Z shl.b32 %r3423, %r3422, 3; 2026-02-21T08:30:48.5784270Z .loc 1 27 39 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:39 2026-02-21T08:30:48.5784331Z sub.s32 %r3424, 64, %r3423; 2026-02-21T08:30:48.5784508Z .loc 1 27 52 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:27:52 2026-02-21T08:30:48.5784567Z min.s32 %r3425, %r3424, 8; 2026-02-21T08:30:48.5784739Z .loc 1 28 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:45 2026-02-21T08:30:48.5784809Z and.b32 %r3426, %r3421, -512; 2026-02-21T08:30:48.5784874Z sub.s32 %r3427, %r4074, %r3426; 2026-02-21T08:30:48.5785045Z .loc 1 29 51 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:29:51 2026-02-21T08:30:48.5785108Z div.s32 %r3428, %r3427, %r3425; 2026-02-21T08:30:48.5785285Z .loc 1 28 64 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:64 2026-02-21T08:30:48.5785352Z mul.lo.s32 %r3429, %r3428, %r3425; 2026-02-21T08:30:48.5785413Z sub.s32 %r3430, %r3427, %r3429; 2026-02-21T08:30:48.5785592Z .loc 1 28 30 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:28:30 2026-02-21T08:30:48.5785652Z add.s32 %r3431, %r3430, %r3423; 2026-02-21T08:30:48.5785821Z .loc 1 30 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:30:27 2026-02-21T08:30:48.5785891Z shl.b32 %r4067, %r3431, 7; 2026-02-21T08:30:48.5786066Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5786129Z or.b32 %r4068, %r4067, %r6; 2026-02-21T08:30:48.5786301Z .loc 1 32 27 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:32:27 2026-02-21T08:30:48.5786367Z shl.b32 %r4069, %r3428, 6; 2026-02-21T08:30:48.5786535Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5786595Z or.b32 %r4070, %r4069, %r9; 2026-02-21T08:30:48.5786664Z or.b32 %r4071, %r4069, %r10; 2026-02-21T08:30:48.5786725Z or.b32 %r4072, %r4069, %r11; 2026-02-21T08:30:48.5786786Z or.b32 %r4073, %r4069, %r12; 2026-02-21T08:30:48.5786893Z $L__BB0_32: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:48.5787072Z .loc 1 0 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:0:32 2026-02-21T08:30:48.5787135Z setp.ne.b32 %p230, %r204, 0; 2026-02-21T08:30:48.5787327Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5787427Z add.s32 %r3475, %r4055, 1; 2026-02-21T08:30:48.5787494Z setp.gt.s32 %p231, %r3475, 1; 2026-02-21T08:30:48.5787571Z selp.b32 %r4055, 0, %r3475, %p231; 2026-02-21T08:30:48.5787746Z .loc 1 41 35 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:41:35 2026-02-21T08:30:48.5787805Z add.s32 %r3476, %r4054, %r9; 2026-02-21T08:30:48.5787864Z add.s32 %r3477, %r4054, %r10; 2026-02-21T08:30:48.5788035Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5788099Z cp.async.wait_group 1; 2026-02-21T08:30:48.5788155Z bar.sync 0; 2026-02-21T08:30:48.5788220Z shl.b32 %r3478, %r4055, 13; 2026-02-21T08:30:48.5788281Z add.s32 %r3479, %r219, %r3478; 2026-02-21T08:30:48.5788391Z ld.shared.v4.b32 {%r3480, %r3481, %r3482, %r3483}, [%r3479]; 2026-02-21T08:30:48.5788492Z mov.b32 {%rs1537, %rs1538}, %r3483; 2026-02-21T08:30:48.5788566Z mov.b32 {%rs1539, %rs1540}, %r3482; 2026-02-21T08:30:48.5788625Z mov.b32 {%rs1541, %rs1542}, %r3481; 2026-02-21T08:30:48.5788683Z mov.b32 {%rs1543, %rs1544}, %r3480; 2026-02-21T08:30:48.5788792Z ld.shared.v4.b32 {%r3484, %r3485, %r3486, %r3487}, [%r3479+16]; 2026-02-21T08:30:48.5788850Z mov.b32 {%rs1545, %rs1546}, %r3487; 2026-02-21T08:30:48.5788908Z mov.b32 {%rs1547, %rs1548}, %r3486; 2026-02-21T08:30:48.5788965Z mov.b32 {%rs1549, %rs1550}, %r3485; 2026-02-21T08:30:48.5789026Z mov.b32 {%rs1551, %rs1552}, %r3484; 2026-02-21T08:30:48.5789125Z ld.shared.v4.b32 {%r3488, %r3489, %r3490, %r3491}, [%r3479+32]; 2026-02-21T08:30:48.5789184Z mov.b32 {%rs1553, %rs1554}, %r3491; 2026-02-21T08:30:48.5789247Z mov.b32 {%rs1555, %rs1556}, %r3490; 2026-02-21T08:30:48.5789303Z mov.b32 {%rs1557, %rs1558}, %r3489; 2026-02-21T08:30:48.5789361Z mov.b32 {%rs1559, %rs1560}, %r3488; 2026-02-21T08:30:48.5789459Z ld.shared.v4.b32 {%r3492, %r3493, %r3494, %r3495}, [%r3479+48]; 2026-02-21T08:30:48.5789519Z mov.b32 {%rs1561, %rs1562}, %r3495; 2026-02-21T08:30:48.5789575Z mov.b32 {%rs1563, %rs1564}, %r3494; 2026-02-21T08:30:48.5789632Z mov.b32 {%rs1565, %rs1566}, %r3493; 2026-02-21T08:30:48.5789693Z mov.b32 {%rs1567, %rs1568}, %r3492; 2026-02-21T08:30:48.5789858Z .loc 1 52 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:52:32 2026-02-21T08:30:48.5789918Z cvt.f32.bf16 %r3441, %rs1543; 2026-02-21T08:30:48.5789980Z cvt.f32.bf16 %r3442, %rs1544; 2026-02-21T08:30:48.5790037Z cvt.f32.bf16 %r3443, %rs1541; 2026-02-21T08:30:48.5790093Z cvt.f32.bf16 %r3444, %rs1542; 2026-02-21T08:30:48.5790152Z cvt.f32.bf16 %r3445, %rs1539; 2026-02-21T08:30:48.5790206Z cvt.f32.bf16 %r3446, %rs1540; 2026-02-21T08:30:48.5790262Z cvt.f32.bf16 %r3447, %rs1537; 2026-02-21T08:30:48.5790317Z cvt.f32.bf16 %r3448, %rs1538; 2026-02-21T08:30:48.5790379Z cvt.f32.bf16 %r3449, %rs1551; 2026-02-21T08:30:48.5790433Z cvt.f32.bf16 %r3450, %rs1552; 2026-02-21T08:30:48.5790490Z cvt.f32.bf16 %r3451, %rs1549; 2026-02-21T08:30:48.5790552Z cvt.f32.bf16 %r3452, %rs1550; 2026-02-21T08:30:48.5790606Z cvt.f32.bf16 %r3453, %rs1547; 2026-02-21T08:30:48.5790660Z cvt.f32.bf16 %r3454, %rs1548; 2026-02-21T08:30:48.5790715Z cvt.f32.bf16 %r3455, %rs1545; 2026-02-21T08:30:48.5790773Z cvt.f32.bf16 %r3456, %rs1546; 2026-02-21T08:30:48.5790828Z cvt.f32.bf16 %r3458, %rs1559; 2026-02-21T08:30:48.5790882Z cvt.f32.bf16 %r3459, %rs1560; 2026-02-21T08:30:48.5790943Z cvt.f32.bf16 %r3460, %rs1557; 2026-02-21T08:30:48.5790997Z cvt.f32.bf16 %r3461, %rs1558; 2026-02-21T08:30:48.5791053Z cvt.f32.bf16 %r3462, %rs1555; 2026-02-21T08:30:48.5791109Z cvt.f32.bf16 %r3463, %rs1556; 2026-02-21T08:30:48.5791170Z cvt.f32.bf16 %r3464, %rs1553; 2026-02-21T08:30:48.5791225Z cvt.f32.bf16 %r3465, %rs1554; 2026-02-21T08:30:48.5791280Z cvt.f32.bf16 %r3466, %rs1567; 2026-02-21T08:30:48.5791340Z cvt.f32.bf16 %r3467, %rs1568; 2026-02-21T08:30:48.5791397Z cvt.f32.bf16 %r3468, %rs1565; 2026-02-21T08:30:48.5791512Z cvt.f32.bf16 %r3469, %rs1566; 2026-02-21T08:30:48.5791606Z cvt.f32.bf16 %r3470, %rs1563; 2026-02-21T08:30:48.5791663Z cvt.f32.bf16 %r3471, %rs1564; 2026-02-21T08:30:48.5791719Z cvt.f32.bf16 %r3472, %rs1561; 2026-02-21T08:30:48.5791774Z cvt.f32.bf16 %r3473, %rs1562; 2026-02-21T08:30:48.5791943Z .loc 1 54 55 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:55 2026-02-21T08:30:48.5791999Z shl.b32 %r3496, %r3476, 13; 2026-02-21T08:30:48.5792058Z shl.b32 %r3497, %r3477, 13; 2026-02-21T08:30:48.5792226Z .loc 1 54 62 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:62 2026-02-21T08:30:48.5792284Z add.s32 %r3498, %r4052, %r3496; 2026-02-21T08:30:48.5792340Z add.s32 %r3499, %r4052, %r3497; 2026-02-21T08:30:48.5792499Z .loc 1 54 34 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:34 2026-02-21T08:30:48.5792562Z cvt.s64.s32 %rd933, %r3498; 2026-02-21T08:30:48.5792666Z add.s64 %rd928, %rd93, %rd933; 2026-02-21T08:30:48.5792728Z cvt.s64.s32 %rd934, %r3499; 2026-02-21T08:30:48.5792793Z add.s64 %rd931, %rd93, %rd934; 2026-02-21T08:30:48.5792952Z .loc 1 54 87 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:54:87 2026-02-21T08:30:48.5793008Z // begin inline asm 2026-02-21T08:30:48.5793067Z mov.u64 %rd927, 0x0; 2026-02-21T08:30:48.5793182Z createpolicy.fractional.L2::evict_first.b64 %rd927, 1.0; 2026-02-21T08:30:48.5793237Z // end inline asm 2026-02-21T08:30:48.5793292Z // begin inline asm 2026-02-21T08:30:48.5793352Z mov.u32 %r3432, 0x0; 2026-02-21T08:30:48.5793405Z mov.u32 %r3433, 0x0; 2026-02-21T08:30:48.5793457Z mov.u32 %r3434, 0x0; 2026-02-21T08:30:48.5793525Z mov.u32 %r3435, 0x0; 2026-02-21T08:30:48.5793703Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3432, %r3433, %r3434, %r3435 }, [ %rd928 + 0 ], %rd927; 2026-02-21T08:30:48.5793757Z // end inline asm 2026-02-21T08:30:48.5793830Z prmt.b32 %r3500, %r3432, 0, 0x8880U; 2026-02-21T08:30:48.5793893Z cvt.u16.u32 %rs1569, %r3500; 2026-02-21T08:30:48.5793954Z prmt.b32 %r3501, %r3432, 0, 0x7770U; 2026-02-21T08:30:48.5794009Z cvt.u16.u32 %rs1570, %r3501; 2026-02-21T08:30:48.5794075Z prmt.b32 %r3502, %r3432, 0, 0x9991U; 2026-02-21T08:30:48.5794131Z cvt.u16.u32 %rs1571, %r3502; 2026-02-21T08:30:48.5794189Z prmt.b32 %r3503, %r3432, 0, 0x7771U; 2026-02-21T08:30:48.5794250Z cvt.u16.u32 %rs1572, %r3503; 2026-02-21T08:30:48.5794309Z prmt.b32 %r3504, %r3432, 0, 0xaaa2U; 2026-02-21T08:30:48.5794365Z cvt.u16.u32 %rs1573, %r3504; 2026-02-21T08:30:48.5794421Z prmt.b32 %r3505, %r3432, 0, 0x7772U; 2026-02-21T08:30:48.5794482Z cvt.u16.u32 %rs1574, %r3505; 2026-02-21T08:30:48.5794541Z prmt.b32 %r3506, %r3432, 0, 0xbbb3U; 2026-02-21T08:30:48.5794599Z cvt.u16.u32 %rs1575, %r3506; 2026-02-21T08:30:48.5794661Z prmt.b32 %r3507, %r3432, 0, 0x7773U; 2026-02-21T08:30:48.5794717Z cvt.u16.u32 %rs1576, %r3507; 2026-02-21T08:30:48.5794774Z prmt.b32 %r3508, %r3433, 0, 0x8880U; 2026-02-21T08:30:48.5794843Z cvt.u16.u32 %rs1577, %r3508; 2026-02-21T08:30:48.5794905Z prmt.b32 %r3509, %r3433, 0, 0x7770U; 2026-02-21T08:30:48.5794961Z cvt.u16.u32 %rs1578, %r3509; 2026-02-21T08:30:48.5795023Z prmt.b32 %r3510, %r3433, 0, 0x9991U; 2026-02-21T08:30:48.5795085Z cvt.u16.u32 %rs1579, %r3510; 2026-02-21T08:30:48.5795142Z prmt.b32 %r3511, %r3433, 0, 0x7771U; 2026-02-21T08:30:48.5795199Z cvt.u16.u32 %rs1580, %r3511; 2026-02-21T08:30:48.5795260Z prmt.b32 %r3512, %r3433, 0, 0xaaa2U; 2026-02-21T08:30:48.5795316Z cvt.u16.u32 %rs1581, %r3512; 2026-02-21T08:30:48.5795375Z prmt.b32 %r3513, %r3433, 0, 0x7772U; 2026-02-21T08:30:48.5795430Z cvt.u16.u32 %rs1582, %r3513; 2026-02-21T08:30:48.5795496Z prmt.b32 %r3514, %r3433, 0, 0xbbb3U; 2026-02-21T08:30:48.5795550Z cvt.u16.u32 %rs1583, %r3514; 2026-02-21T08:30:48.5795608Z prmt.b32 %r3515, %r3433, 0, 0x7773U; 2026-02-21T08:30:48.5795671Z cvt.u16.u32 %rs1584, %r3515; 2026-02-21T08:30:48.5795730Z prmt.b32 %r3516, %r3434, 0, 0x8880U; 2026-02-21T08:30:48.5795831Z cvt.u16.u32 %rs1585, %r3516; 2026-02-21T08:30:48.5795896Z prmt.b32 %r3517, %r3434, 0, 0x7770U; 2026-02-21T08:30:48.5795954Z cvt.u16.u32 %rs1586, %r3517; 2026-02-21T08:30:48.5796013Z prmt.b32 %r3518, %r3434, 0, 0x9991U; 2026-02-21T08:30:48.5796071Z cvt.u16.u32 %rs1587, %r3518; 2026-02-21T08:30:48.5796135Z prmt.b32 %r3519, %r3434, 0, 0x7771U; 2026-02-21T08:30:48.5796190Z cvt.u16.u32 %rs1588, %r3519; 2026-02-21T08:30:48.5796248Z prmt.b32 %r3520, %r3434, 0, 0xaaa2U; 2026-02-21T08:30:48.5796312Z cvt.u16.u32 %rs1589, %r3520; 2026-02-21T08:30:48.5796370Z prmt.b32 %r3521, %r3434, 0, 0x7772U; 2026-02-21T08:30:48.5796427Z cvt.u16.u32 %rs1590, %r3521; 2026-02-21T08:30:48.5796486Z prmt.b32 %r3522, %r3434, 0, 0xbbb3U; 2026-02-21T08:30:48.5796547Z cvt.u16.u32 %rs1591, %r3522; 2026-02-21T08:30:48.5796606Z prmt.b32 %r3523, %r3434, 0, 0x7773U; 2026-02-21T08:30:48.5796663Z cvt.u16.u32 %rs1592, %r3523; 2026-02-21T08:30:48.5796767Z prmt.b32 %r3524, %r3435, 0, 0x8880U; 2026-02-21T08:30:48.5796827Z cvt.u16.u32 %rs1593, %r3524; 2026-02-21T08:30:48.5796885Z prmt.b32 %r3525, %r3435, 0, 0x7770U; 2026-02-21T08:30:48.5796941Z cvt.u16.u32 %rs1594, %r3525; 2026-02-21T08:30:48.5797005Z prmt.b32 %r3526, %r3435, 0, 0x9991U; 2026-02-21T08:30:48.5797059Z cvt.u16.u32 %rs1595, %r3526; 2026-02-21T08:30:48.5797116Z prmt.b32 %r3527, %r3435, 0, 0x7771U; 2026-02-21T08:30:48.5797175Z cvt.u16.u32 %rs1596, %r3527; 2026-02-21T08:30:48.5797234Z prmt.b32 %r3528, %r3435, 0, 0xaaa2U; 2026-02-21T08:30:48.5797287Z cvt.u16.u32 %rs1597, %r3528; 2026-02-21T08:30:48.5797349Z prmt.b32 %r3529, %r3435, 0, 0x7772U; 2026-02-21T08:30:48.5797405Z cvt.u16.u32 %rs1598, %r3529; 2026-02-21T08:30:48.5797462Z prmt.b32 %r3530, %r3435, 0, 0xbbb3U; 2026-02-21T08:30:48.5797517Z cvt.u16.u32 %rs1599, %r3530; 2026-02-21T08:30:48.5797580Z prmt.b32 %r3531, %r3435, 0, 0x7773U; 2026-02-21T08:30:48.5797634Z cvt.u16.u32 %rs1600, %r3531; 2026-02-21T08:30:48.5797689Z // begin inline asm 2026-02-21T08:30:48.5797750Z mov.u64 %rd930, 0x0; 2026-02-21T08:30:48.5797860Z createpolicy.fractional.L2::evict_first.b64 %rd930, 1.0; 2026-02-21T08:30:48.5797915Z // end inline asm 2026-02-21T08:30:48.5797967Z // begin inline asm 2026-02-21T08:30:48.5798027Z mov.u32 %r3436, 0x0; 2026-02-21T08:30:48.5798079Z mov.u32 %r3437, 0x0; 2026-02-21T08:30:48.5798133Z mov.u32 %r3438, 0x0; 2026-02-21T08:30:48.5798189Z mov.u32 %r3439, 0x0; 2026-02-21T08:30:48.5798361Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3436, %r3437, %r3438, %r3439 }, [ %rd931 + 0 ], %rd930; 2026-02-21T08:30:48.5798415Z // end inline asm 2026-02-21T08:30:48.5798476Z prmt.b32 %r3532, %r3436, 0, 0x8880U; 2026-02-21T08:30:48.5798533Z cvt.u16.u32 %rs1601, %r3532; 2026-02-21T08:30:48.5798590Z prmt.b32 %r3533, %r3436, 0, 0x7770U; 2026-02-21T08:30:48.5798645Z cvt.u16.u32 %rs1602, %r3533; 2026-02-21T08:30:48.5798708Z prmt.b32 %r3534, %r3436, 0, 0x9991U; 2026-02-21T08:30:48.5798761Z cvt.u16.u32 %rs1603, %r3534; 2026-02-21T08:30:48.5798821Z prmt.b32 %r3535, %r3436, 0, 0x7771U; 2026-02-21T08:30:48.5798884Z cvt.u16.u32 %rs1604, %r3535; 2026-02-21T08:30:48.5798940Z prmt.b32 %r3536, %r3436, 0, 0xaaa2U; 2026-02-21T08:30:48.5798993Z cvt.u16.u32 %rs1605, %r3536; 2026-02-21T08:30:48.5799051Z prmt.b32 %r3537, %r3436, 0, 0x7772U; 2026-02-21T08:30:48.5799113Z cvt.u16.u32 %rs1606, %r3537; 2026-02-21T08:30:48.5799169Z prmt.b32 %r3538, %r3436, 0, 0xbbb3U; 2026-02-21T08:30:48.5799226Z cvt.u16.u32 %rs1607, %r3538; 2026-02-21T08:30:48.5799288Z prmt.b32 %r3539, %r3436, 0, 0x7773U; 2026-02-21T08:30:48.5799342Z cvt.u16.u32 %rs1608, %r3539; 2026-02-21T08:30:48.5799397Z prmt.b32 %r3540, %r3437, 0, 0x8880U; 2026-02-21T08:30:48.5799456Z cvt.u16.u32 %rs1609, %r3540; 2026-02-21T08:30:48.5799514Z prmt.b32 %r3541, %r3437, 0, 0x7770U; 2026-02-21T08:30:48.5799569Z cvt.u16.u32 %rs1610, %r3541; 2026-02-21T08:30:48.5799625Z prmt.b32 %r3542, %r3437, 0, 0x9991U; 2026-02-21T08:30:48.5799686Z cvt.u16.u32 %rs1611, %r3542; 2026-02-21T08:30:48.5799745Z prmt.b32 %r3543, %r3437, 0, 0x7771U; 2026-02-21T08:30:48.5799841Z cvt.u16.u32 %rs1612, %r3543; 2026-02-21T08:30:48.5799903Z prmt.b32 %r3544, %r3437, 0, 0xaaa2U; 2026-02-21T08:30:48.5799958Z cvt.u16.u32 %rs1613, %r3544; 2026-02-21T08:30:48.5800016Z prmt.b32 %r3545, %r3437, 0, 0x7772U; 2026-02-21T08:30:48.5800071Z cvt.u16.u32 %rs1614, %r3545; 2026-02-21T08:30:48.5800134Z prmt.b32 %r3546, %r3437, 0, 0xbbb3U; 2026-02-21T08:30:48.5800189Z cvt.u16.u32 %rs1615, %r3546; 2026-02-21T08:30:48.5800245Z prmt.b32 %r3547, %r3437, 0, 0x7773U; 2026-02-21T08:30:48.5800305Z cvt.u16.u32 %rs1616, %r3547; 2026-02-21T08:30:48.5800363Z prmt.b32 %r3548, %r3438, 0, 0x8880U; 2026-02-21T08:30:48.5800418Z cvt.u16.u32 %rs1617, %r3548; 2026-02-21T08:30:48.5800474Z prmt.b32 %r3549, %r3438, 0, 0x7770U; 2026-02-21T08:30:48.5800533Z cvt.u16.u32 %rs1618, %r3549; 2026-02-21T08:30:48.5800590Z prmt.b32 %r3550, %r3438, 0, 0x9991U; 2026-02-21T08:30:48.5800647Z cvt.u16.u32 %rs1619, %r3550; 2026-02-21T08:30:48.5800757Z prmt.b32 %r3551, %r3438, 0, 0x7771U; 2026-02-21T08:30:48.5800816Z cvt.u16.u32 %rs1620, %r3551; 2026-02-21T08:30:48.5800874Z prmt.b32 %r3552, %r3438, 0, 0xaaa2U; 2026-02-21T08:30:48.5800938Z cvt.u16.u32 %rs1621, %r3552; 2026-02-21T08:30:48.5800996Z prmt.b32 %r3553, %r3438, 0, 0x7772U; 2026-02-21T08:30:48.5801054Z cvt.u16.u32 %rs1622, %r3553; 2026-02-21T08:30:48.5801112Z prmt.b32 %r3554, %r3438, 0, 0xbbb3U; 2026-02-21T08:30:48.5801175Z cvt.u16.u32 %rs1623, %r3554; 2026-02-21T08:30:48.5801234Z prmt.b32 %r3555, %r3438, 0, 0x7773U; 2026-02-21T08:30:48.5801289Z cvt.u16.u32 %rs1624, %r3555; 2026-02-21T08:30:48.5801353Z prmt.b32 %r3556, %r3439, 0, 0x8880U; 2026-02-21T08:30:48.5801408Z cvt.u16.u32 %rs1625, %r3556; 2026-02-21T08:30:48.5801464Z prmt.b32 %r3557, %r3439, 0, 0x7770U; 2026-02-21T08:30:48.5801522Z cvt.u16.u32 %rs1626, %r3557; 2026-02-21T08:30:48.5801623Z prmt.b32 %r3558, %r3439, 0, 0x9991U; 2026-02-21T08:30:48.5801680Z cvt.u16.u32 %rs1627, %r3558; 2026-02-21T08:30:48.5801738Z prmt.b32 %r3559, %r3439, 0, 0x7771U; 2026-02-21T08:30:48.5801799Z cvt.u16.u32 %rs1628, %r3559; 2026-02-21T08:30:48.5801856Z prmt.b32 %r3560, %r3439, 0, 0xaaa2U; 2026-02-21T08:30:48.5801912Z cvt.u16.u32 %rs1629, %r3560; 2026-02-21T08:30:48.5801974Z prmt.b32 %r3561, %r3439, 0, 0x7772U; 2026-02-21T08:30:48.5802031Z cvt.u16.u32 %rs1630, %r3561; 2026-02-21T08:30:48.5802089Z prmt.b32 %r3562, %r3439, 0, 0xbbb3U; 2026-02-21T08:30:48.5802144Z cvt.u16.u32 %rs1631, %r3562; 2026-02-21T08:30:48.5802206Z prmt.b32 %r3563, %r3439, 0, 0x7773U; 2026-02-21T08:30:48.5802261Z cvt.u16.u32 %rs1632, %r3563; 2026-02-21T08:30:48.5802427Z .loc 1 59 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:59:25 2026-02-21T08:30:48.5802488Z shl.b16 %rs1633, %rs1570, 12; 2026-02-21T08:30:48.5802546Z shr.s16 %rs1634, %rs1633, 12; 2026-02-21T08:30:48.5802603Z shl.b16 %rs1635, %rs1602, 12; 2026-02-21T08:30:48.5802658Z shr.s16 %rs1636, %rs1635, 12; 2026-02-21T08:30:48.5802715Z shl.b16 %rs1637, %rs1572, 12; 2026-02-21T08:30:48.5802771Z shr.s16 %rs1638, %rs1637, 12; 2026-02-21T08:30:48.5802824Z shl.b16 %rs1639, %rs1604, 12; 2026-02-21T08:30:48.5802882Z shr.s16 %rs1640, %rs1639, 12; 2026-02-21T08:30:48.5802939Z shl.b16 %rs1641, %rs1574, 12; 2026-02-21T08:30:48.5802992Z shr.s16 %rs1642, %rs1641, 12; 2026-02-21T08:30:48.5803045Z shl.b16 %rs1643, %rs1606, 12; 2026-02-21T08:30:48.5803104Z shr.s16 %rs1644, %rs1643, 12; 2026-02-21T08:30:48.5803158Z shl.b16 %rs1645, %rs1576, 12; 2026-02-21T08:30:48.5803212Z shr.s16 %rs1646, %rs1645, 12; 2026-02-21T08:30:48.5803270Z shl.b16 %rs1647, %rs1608, 12; 2026-02-21T08:30:48.5803323Z shr.s16 %rs1648, %rs1647, 12; 2026-02-21T08:30:48.5803377Z shl.b16 %rs1649, %rs1578, 12; 2026-02-21T08:30:48.5803437Z shr.s16 %rs1650, %rs1649, 12; 2026-02-21T08:30:48.5803490Z shl.b16 %rs1651, %rs1610, 12; 2026-02-21T08:30:48.5803542Z shr.s16 %rs1652, %rs1651, 12; 2026-02-21T08:30:48.5803596Z shl.b16 %rs1653, %rs1580, 12; 2026-02-21T08:30:48.5803654Z shr.s16 %rs1654, %rs1653, 12; 2026-02-21T08:30:48.5803758Z shl.b16 %rs1655, %rs1612, 12; 2026-02-21T08:30:48.5803813Z shr.s16 %rs1656, %rs1655, 12; 2026-02-21T08:30:48.5803874Z shl.b16 %rs1657, %rs1582, 12; 2026-02-21T08:30:48.5803929Z shr.s16 %rs1658, %rs1657, 12; 2026-02-21T08:30:48.5803984Z shl.b16 %rs1659, %rs1614, 12; 2026-02-21T08:30:48.5804038Z shr.s16 %rs1660, %rs1659, 12; 2026-02-21T08:30:48.5804098Z shl.b16 %rs1661, %rs1584, 12; 2026-02-21T08:30:48.5804152Z shr.s16 %rs1662, %rs1661, 12; 2026-02-21T08:30:48.5804207Z shl.b16 %rs1663, %rs1616, 12; 2026-02-21T08:30:48.5804264Z shr.s16 %rs1664, %rs1663, 12; 2026-02-21T08:30:48.5804319Z shl.b16 %rs1665, %rs1586, 12; 2026-02-21T08:30:48.5804372Z shr.s16 %rs1666, %rs1665, 12; 2026-02-21T08:30:48.5804426Z shl.b16 %rs1667, %rs1618, 12; 2026-02-21T08:30:48.5804484Z shr.s16 %rs1668, %rs1667, 12; 2026-02-21T08:30:48.5804537Z shl.b16 %rs1669, %rs1588, 12; 2026-02-21T08:30:48.5804650Z shr.s16 %rs1670, %rs1669, 12; 2026-02-21T08:30:48.5804714Z shl.b16 %rs1671, %rs1620, 12; 2026-02-21T08:30:48.5804769Z shr.s16 %rs1672, %rs1671, 12; 2026-02-21T08:30:48.5804824Z shl.b16 %rs1673, %rs1590, 12; 2026-02-21T08:30:48.5804882Z shr.s16 %rs1674, %rs1673, 12; 2026-02-21T08:30:48.5804936Z shl.b16 %rs1675, %rs1622, 12; 2026-02-21T08:30:48.5804988Z shr.s16 %rs1676, %rs1675, 12; 2026-02-21T08:30:48.5805043Z shl.b16 %rs1677, %rs1592, 12; 2026-02-21T08:30:48.5805101Z shr.s16 %rs1678, %rs1677, 12; 2026-02-21T08:30:48.5805154Z shl.b16 %rs1679, %rs1624, 12; 2026-02-21T08:30:48.5805207Z shr.s16 %rs1680, %rs1679, 12; 2026-02-21T08:30:48.5805263Z shl.b16 %rs1681, %rs1594, 12; 2026-02-21T08:30:48.5805315Z shr.s16 %rs1682, %rs1681, 12; 2026-02-21T08:30:48.5805367Z shl.b16 %rs1683, %rs1626, 12; 2026-02-21T08:30:48.5805420Z shr.s16 %rs1684, %rs1683, 12; 2026-02-21T08:30:48.5805480Z shl.b16 %rs1685, %rs1596, 12; 2026-02-21T08:30:48.5805533Z shr.s16 %rs1686, %rs1685, 12; 2026-02-21T08:30:48.5805589Z shl.b16 %rs1687, %rs1628, 12; 2026-02-21T08:30:48.5805647Z shr.s16 %rs1688, %rs1687, 12; 2026-02-21T08:30:48.5805700Z shl.b16 %rs1689, %rs1598, 12; 2026-02-21T08:30:48.5805752Z shr.s16 %rs1690, %rs1689, 12; 2026-02-21T08:30:48.5805805Z shl.b16 %rs1691, %rs1630, 12; 2026-02-21T08:30:48.5805864Z shr.s16 %rs1692, %rs1691, 12; 2026-02-21T08:30:48.5805918Z shl.b16 %rs1693, %rs1600, 12; 2026-02-21T08:30:48.5805971Z shr.s16 %rs1694, %rs1693, 12; 2026-02-21T08:30:48.5806029Z shl.b16 %rs1695, %rs1632, 12; 2026-02-21T08:30:48.5806083Z shr.s16 %rs1696, %rs1695, 12; 2026-02-21T08:30:48.5806250Z .loc 1 62 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:62:28 2026-02-21T08:30:48.5806312Z shr.s16 %rs1697, %rs1569, 4; 2026-02-21T08:30:48.5806366Z shr.s16 %rs1698, %rs1571, 4; 2026-02-21T08:30:48.5806420Z shr.s16 %rs1699, %rs1573, 4; 2026-02-21T08:30:48.5806474Z shr.s16 %rs1700, %rs1575, 4; 2026-02-21T08:30:48.5806534Z shr.s16 %rs1701, %rs1577, 4; 2026-02-21T08:30:48.5806589Z shr.s16 %rs1702, %rs1579, 4; 2026-02-21T08:30:48.5806647Z shr.s16 %rs1703, %rs1581, 4; 2026-02-21T08:30:48.5806705Z shr.s16 %rs1704, %rs1583, 4; 2026-02-21T08:30:48.5806757Z shr.s16 %rs1705, %rs1585, 4; 2026-02-21T08:30:48.5806810Z shr.s16 %rs1706, %rs1587, 4; 2026-02-21T08:30:48.5806866Z shr.s16 %rs1707, %rs1589, 4; 2026-02-21T08:30:48.5806924Z shr.s16 %rs1708, %rs1591, 4; 2026-02-21T08:30:48.5806979Z shr.s16 %rs1709, %rs1593, 4; 2026-02-21T08:30:48.5807032Z shr.s16 %rs1710, %rs1595, 4; 2026-02-21T08:30:48.5807088Z shr.s16 %rs1711, %rs1597, 4; 2026-02-21T08:30:48.5807143Z shr.s16 %rs1712, %rs1599, 4; 2026-02-21T08:30:48.5807196Z shr.s16 %rs1713, %rs1601, 4; 2026-02-21T08:30:48.5807252Z shr.s16 %rs1714, %rs1603, 4; 2026-02-21T08:30:48.5807312Z shr.s16 %rs1715, %rs1605, 4; 2026-02-21T08:30:48.5807366Z shr.s16 %rs1716, %rs1607, 4; 2026-02-21T08:30:48.5807422Z shr.s16 %rs1717, %rs1609, 4; 2026-02-21T08:30:48.5807486Z shr.s16 %rs1718, %rs1611, 4; 2026-02-21T08:30:48.5807543Z shr.s16 %rs1719, %rs1613, 4; 2026-02-21T08:30:48.5807640Z shr.s16 %rs1720, %rs1615, 4; 2026-02-21T08:30:48.5807698Z shr.s16 %rs1721, %rs1617, 4; 2026-02-21T08:30:48.5807760Z shr.s16 %rs1722, %rs1619, 4; 2026-02-21T08:30:48.5807817Z shr.s16 %rs1723, %rs1621, 4; 2026-02-21T08:30:48.5807874Z shr.s16 %rs1724, %rs1623, 4; 2026-02-21T08:30:48.5807938Z shr.s16 %rs1725, %rs1625, 4; 2026-02-21T08:30:48.5807990Z shr.s16 %rs1726, %rs1627, 4; 2026-02-21T08:30:48.5808047Z shr.s16 %rs1727, %rs1629, 4; 2026-02-21T08:30:48.5808105Z shr.s16 %rs1728, %rs1631, 4; 2026-02-21T08:30:48.5808270Z .loc 1 66 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:66:45 2026-02-21T08:30:48.5808350Z st.shared.v2.b8 [%r220], {%rs1634, %rs1636}; 2026-02-21T08:30:48.5808427Z st.shared.v2.b8 [%r221], {%rs1638, %rs1640}; 2026-02-21T08:30:48.5808501Z st.shared.v2.b8 [%r222], {%rs1642, %rs1644}; 2026-02-21T08:30:48.5808571Z st.shared.v2.b8 [%r223], {%rs1646, %rs1648}; 2026-02-21T08:30:48.5808688Z st.shared.v2.b8 [%r224+1024], {%rs1650, %rs1652}; 2026-02-21T08:30:48.5808774Z st.shared.v2.b8 [%r225+1024], {%rs1654, %rs1656}; 2026-02-21T08:30:48.5808852Z st.shared.v2.b8 [%r226+1024], {%rs1658, %rs1660}; 2026-02-21T08:30:48.5808927Z st.shared.v2.b8 [%r227+1024], {%rs1662, %rs1664}; 2026-02-21T08:30:48.5809006Z st.shared.v2.b8 [%r228+2048], {%rs1666, %rs1668}; 2026-02-21T08:30:48.5809083Z st.shared.v2.b8 [%r229+2048], {%rs1670, %rs1672}; 2026-02-21T08:30:48.5809155Z st.shared.v2.b8 [%r230+2048], {%rs1674, %rs1676}; 2026-02-21T08:30:48.5809229Z st.shared.v2.b8 [%r231+2048], {%rs1678, %rs1680}; 2026-02-21T08:30:48.5809307Z st.shared.v2.b8 [%r232+3072], {%rs1682, %rs1684}; 2026-02-21T08:30:48.5809382Z st.shared.v2.b8 [%r233+3072], {%rs1686, %rs1688}; 2026-02-21T08:30:48.5809461Z st.shared.v2.b8 [%r234+3072], {%rs1690, %rs1692}; 2026-02-21T08:30:48.5809537Z st.shared.v2.b8 [%r235+3072], {%rs1694, %rs1696}; 2026-02-21T08:30:48.5809589Z bar.sync 0; 2026-02-21T08:30:48.5809655Z ld.shared.b32 %r3564, [%r236]; 2026-02-21T08:30:48.5809725Z ld.shared.b32 %r3565, [%r236+128]; 2026-02-21T08:30:48.5809784Z ld.shared.b32 %r3566, [%r237]; 2026-02-21T08:30:48.5809843Z ld.shared.b32 %r3567, [%r237+128]; 2026-02-21T08:30:48.5809904Z ld.shared.b32 %r3568, [%r238]; 2026-02-21T08:30:48.5809966Z ld.shared.b32 %r3569, [%r238+128]; 2026-02-21T08:30:48.5810022Z ld.shared.b32 %r3570, [%r239]; 2026-02-21T08:30:48.5810081Z ld.shared.b32 %r3571, [%r239+128]; 2026-02-21T08:30:48.5810254Z .loc 1 67 45 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:67:45 2026-02-21T08:30:48.5810304Z bar.sync 0; 2026-02-21T08:30:48.5810374Z st.shared.v2.b8 [%r220], {%rs1697, %rs1713}; 2026-02-21T08:30:48.5810446Z st.shared.v2.b8 [%r221], {%rs1698, %rs1714}; 2026-02-21T08:30:48.5810516Z st.shared.v2.b8 [%r222], {%rs1699, %rs1715}; 2026-02-21T08:30:48.5810587Z st.shared.v2.b8 [%r223], {%rs1700, %rs1716}; 2026-02-21T08:30:48.5810664Z st.shared.v2.b8 [%r224+1024], {%rs1701, %rs1717}; 2026-02-21T08:30:48.5810746Z st.shared.v2.b8 [%r225+1024], {%rs1702, %rs1718}; 2026-02-21T08:30:48.5810819Z st.shared.v2.b8 [%r226+1024], {%rs1703, %rs1719}; 2026-02-21T08:30:48.5810890Z st.shared.v2.b8 [%r227+1024], {%rs1704, %rs1720}; 2026-02-21T08:30:48.5810970Z st.shared.v2.b8 [%r228+2048], {%rs1705, %rs1721}; 2026-02-21T08:30:48.5811041Z st.shared.v2.b8 [%r229+2048], {%rs1706, %rs1722}; 2026-02-21T08:30:48.5811118Z st.shared.v2.b8 [%r230+2048], {%rs1707, %rs1723}; 2026-02-21T08:30:48.5811189Z st.shared.v2.b8 [%r231+2048], {%rs1708, %rs1724}; 2026-02-21T08:30:48.5811268Z st.shared.v2.b8 [%r232+3072], {%rs1709, %rs1725}; 2026-02-21T08:30:48.5811341Z st.shared.v2.b8 [%r233+3072], {%rs1710, %rs1726}; 2026-02-21T08:30:48.5811413Z st.shared.v2.b8 [%r234+3072], {%rs1711, %rs1727}; 2026-02-21T08:30:48.5811494Z st.shared.v2.b8 [%r235+3072], {%rs1712, %rs1728}; 2026-02-21T08:30:48.5811570Z bar.sync 0; 2026-02-21T08:30:48.5811633Z ld.shared.b32 %r3572, [%r236]; 2026-02-21T08:30:48.5811752Z ld.shared.b32 %r3573, [%r236+128]; 2026-02-21T08:30:48.5811814Z ld.shared.b32 %r3574, [%r237]; 2026-02-21T08:30:48.5811874Z ld.shared.b32 %r3575, [%r237+128]; 2026-02-21T08:30:48.5811936Z ld.shared.b32 %r3576, [%r238]; 2026-02-21T08:30:48.5812005Z ld.shared.b32 %r3577, [%r238+128]; 2026-02-21T08:30:48.5812066Z ld.shared.b32 %r3578, [%r239]; 2026-02-21T08:30:48.5812126Z ld.shared.b32 %r3579, [%r239+128]; 2026-02-21T08:30:48.5812298Z .loc 1 77 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:77:32 2026-02-21T08:30:48.5812358Z cvt.s8.s32 %rs1729, %r3566; 2026-02-21T08:30:48.5812420Z cvt.rn.f32.s16 %r3580, %rs1729; 2026-02-21T08:30:48.5812485Z cvt.s8.s32 %rs1730, %r3564; 2026-02-21T08:30:48.5812544Z cvt.rn.f32.s16 %r3581, %rs1730; 2026-02-21T08:30:48.5812599Z cvt.s8.s32 %rs1731, %r3574; 2026-02-21T08:30:48.5812656Z cvt.rn.f32.s16 %r3582, %rs1731; 2026-02-21T08:30:48.5812719Z cvt.s8.s32 %rs1732, %r3572; 2026-02-21T08:30:48.5812823Z cvt.rn.f32.s16 %r3583, %rs1732; 2026-02-21T08:30:48.5812881Z cvt.s8.s32 %rs1733, %r3570; 2026-02-21T08:30:48.5812939Z cvt.rn.f32.s16 %r3584, %rs1733; 2026-02-21T08:30:48.5812995Z cvt.s8.s32 %rs1734, %r3568; 2026-02-21T08:30:48.5813054Z cvt.rn.f32.s16 %r3585, %rs1734; 2026-02-21T08:30:48.5813110Z cvt.s8.s32 %rs1735, %r3578; 2026-02-21T08:30:48.5813174Z cvt.rn.f32.s16 %r3586, %rs1735; 2026-02-21T08:30:48.5813229Z cvt.s8.s32 %rs1736, %r3576; 2026-02-21T08:30:48.5813285Z cvt.rn.f32.s16 %r3587, %rs1736; 2026-02-21T08:30:48.5813350Z prmt.b32 %r3588, %r3566, 0, 0xaaa2U; 2026-02-21T08:30:48.5813406Z cvt.u16.u32 %rs1737, %r3588; 2026-02-21T08:30:48.5813463Z cvt.rn.f32.s16 %r3589, %rs1737; 2026-02-21T08:30:48.5813522Z prmt.b32 %r3590, %r3564, 0, 0xaaa2U; 2026-02-21T08:30:48.5813580Z cvt.u16.u32 %rs1738, %r3590; 2026-02-21T08:30:48.5813636Z cvt.rn.f32.s16 %r3591, %rs1738; 2026-02-21T08:30:48.5813695Z prmt.b32 %r3592, %r3574, 0, 0xaaa2U; 2026-02-21T08:30:48.5813761Z cvt.u16.u32 %rs1739, %r3592; 2026-02-21T08:30:48.5813820Z cvt.rn.f32.s16 %r3593, %rs1739; 2026-02-21T08:30:48.5813880Z prmt.b32 %r3594, %r3572, 0, 0xaaa2U; 2026-02-21T08:30:48.5813938Z cvt.u16.u32 %rs1740, %r3594; 2026-02-21T08:30:48.5814001Z cvt.rn.f32.s16 %r3595, %rs1740; 2026-02-21T08:30:48.5814058Z prmt.b32 %r3596, %r3570, 0, 0xaaa2U; 2026-02-21T08:30:48.5814114Z cvt.u16.u32 %rs1741, %r3596; 2026-02-21T08:30:48.5814172Z cvt.rn.f32.s16 %r3597, %rs1741; 2026-02-21T08:30:48.5814230Z prmt.b32 %r3598, %r3568, 0, 0xaaa2U; 2026-02-21T08:30:48.5814285Z cvt.u16.u32 %rs1742, %r3598; 2026-02-21T08:30:48.5814347Z cvt.rn.f32.s16 %r3599, %rs1742; 2026-02-21T08:30:48.5814405Z prmt.b32 %r3600, %r3578, 0, 0xaaa2U; 2026-02-21T08:30:48.5814460Z cvt.u16.u32 %rs1743, %r3600; 2026-02-21T08:30:48.5814514Z cvt.rn.f32.s16 %r3601, %rs1743; 2026-02-21T08:30:48.5814576Z prmt.b32 %r3602, %r3576, 0, 0xaaa2U; 2026-02-21T08:30:48.5814629Z cvt.u16.u32 %rs1744, %r3602; 2026-02-21T08:30:48.5814686Z cvt.rn.f32.s16 %r3603, %rs1744; 2026-02-21T08:30:48.5814748Z cvt.s8.s32 %rs1745, %r3567; 2026-02-21T08:30:48.5814804Z cvt.rn.f32.s16 %r3604, %rs1745; 2026-02-21T08:30:48.5814859Z cvt.s8.s32 %rs1746, %r3565; 2026-02-21T08:30:48.5814918Z cvt.rn.f32.s16 %r3605, %rs1746; 2026-02-21T08:30:48.5814976Z cvt.s8.s32 %rs1747, %r3575; 2026-02-21T08:30:48.5815030Z cvt.rn.f32.s16 %r3606, %rs1747; 2026-02-21T08:30:48.5815083Z cvt.s8.s32 %rs1748, %r3573; 2026-02-21T08:30:48.5815145Z cvt.rn.f32.s16 %r3607, %rs1748; 2026-02-21T08:30:48.5815197Z cvt.s8.s32 %rs1749, %r3571; 2026-02-21T08:30:48.5815252Z cvt.rn.f32.s16 %r3608, %rs1749; 2026-02-21T08:30:48.5815310Z cvt.s8.s32 %rs1750, %r3569; 2026-02-21T08:30:48.5815365Z cvt.rn.f32.s16 %r3609, %rs1750; 2026-02-21T08:30:48.5815421Z cvt.s8.s32 %rs1751, %r3579; 2026-02-21T08:30:48.5815476Z cvt.rn.f32.s16 %r3610, %rs1751; 2026-02-21T08:30:48.5815533Z cvt.s8.s32 %rs1752, %r3577; 2026-02-21T08:30:48.5815588Z cvt.rn.f32.s16 %r3611, %rs1752; 2026-02-21T08:30:48.5815650Z prmt.b32 %r3612, %r3567, 0, 0xaaa2U; 2026-02-21T08:30:48.5815748Z cvt.u16.u32 %rs1753, %r3612; 2026-02-21T08:30:48.5815804Z cvt.rn.f32.s16 %r3613, %rs1753; 2026-02-21T08:30:48.5815861Z prmt.b32 %r3614, %r3565, 0, 0xaaa2U; 2026-02-21T08:30:48.5815915Z cvt.u16.u32 %rs1754, %r3614; 2026-02-21T08:30:48.5815974Z cvt.rn.f32.s16 %r3615, %rs1754; 2026-02-21T08:30:48.5816030Z prmt.b32 %r3616, %r3575, 0, 0xaaa2U; 2026-02-21T08:30:48.5816087Z cvt.u16.u32 %rs1755, %r3616; 2026-02-21T08:30:48.5816146Z cvt.rn.f32.s16 %r3617, %rs1755; 2026-02-21T08:30:48.5816207Z prmt.b32 %r3618, %r3573, 0, 0xaaa2U; 2026-02-21T08:30:48.5816261Z cvt.u16.u32 %rs1756, %r3618; 2026-02-21T08:30:48.5816321Z cvt.rn.f32.s16 %r3619, %rs1756; 2026-02-21T08:30:48.5816376Z prmt.b32 %r3620, %r3571, 0, 0xaaa2U; 2026-02-21T08:30:48.5816430Z cvt.u16.u32 %rs1757, %r3620; 2026-02-21T08:30:48.5816485Z cvt.rn.f32.s16 %r3621, %rs1757; 2026-02-21T08:30:48.5816549Z prmt.b32 %r3622, %r3569, 0, 0xaaa2U; 2026-02-21T08:30:48.5816644Z cvt.u16.u32 %rs1758, %r3622; 2026-02-21T08:30:48.5816704Z cvt.rn.f32.s16 %r3623, %rs1758; 2026-02-21T08:30:48.5816765Z prmt.b32 %r3624, %r3579, 0, 0xaaa2U; 2026-02-21T08:30:48.5816818Z cvt.u16.u32 %rs1759, %r3624; 2026-02-21T08:30:48.5816874Z cvt.rn.f32.s16 %r3625, %rs1759; 2026-02-21T08:30:48.5816931Z prmt.b32 %r3626, %r3577, 0, 0xaaa2U; 2026-02-21T08:30:48.5816988Z cvt.u16.u32 %rs1760, %r3626; 2026-02-21T08:30:48.5817044Z cvt.rn.f32.s16 %r3627, %rs1760; 2026-02-21T08:30:48.5817102Z prmt.b32 %r3628, %r3566, 0, 0x9991U; 2026-02-21T08:30:48.5817161Z cvt.u16.u32 %rs1761, %r3628; 2026-02-21T08:30:48.5817218Z cvt.rn.f32.s16 %r3629, %rs1761; 2026-02-21T08:30:48.5817276Z prmt.b32 %r3630, %r3564, 0, 0x9991U; 2026-02-21T08:30:48.5817330Z cvt.u16.u32 %rs1762, %r3630; 2026-02-21T08:30:48.5817390Z cvt.rn.f32.s16 %r3631, %rs1762; 2026-02-21T08:30:48.5817446Z prmt.b32 %r3632, %r3574, 0, 0x9991U; 2026-02-21T08:30:48.5817500Z cvt.u16.u32 %rs1763, %r3632; 2026-02-21T08:30:48.5817561Z cvt.rn.f32.s16 %r3633, %rs1763; 2026-02-21T08:30:48.5817621Z prmt.b32 %r3634, %r3572, 0, 0x9991U; 2026-02-21T08:30:48.5817677Z cvt.u16.u32 %rs1764, %r3634; 2026-02-21T08:30:48.5817738Z cvt.rn.f32.s16 %r3635, %rs1764; 2026-02-21T08:30:48.5817795Z prmt.b32 %r3636, %r3570, 0, 0x9991U; 2026-02-21T08:30:48.5817848Z cvt.u16.u32 %rs1765, %r3636; 2026-02-21T08:30:48.5817906Z cvt.rn.f32.s16 %r3637, %rs1765; 2026-02-21T08:30:48.5817965Z prmt.b32 %r3638, %r3568, 0, 0x9991U; 2026-02-21T08:30:48.5818021Z cvt.u16.u32 %rs1766, %r3638; 2026-02-21T08:30:48.5818075Z cvt.rn.f32.s16 %r3639, %rs1766; 2026-02-21T08:30:48.5818136Z prmt.b32 %r3640, %r3578, 0, 0x9991U; 2026-02-21T08:30:48.5818188Z cvt.u16.u32 %rs1767, %r3640; 2026-02-21T08:30:48.5818241Z cvt.rn.f32.s16 %r3641, %rs1767; 2026-02-21T08:30:48.5818297Z prmt.b32 %r3642, %r3576, 0, 0x9991U; 2026-02-21T08:30:48.5818355Z cvt.u16.u32 %rs1768, %r3642; 2026-02-21T08:30:48.5818409Z cvt.rn.f32.s16 %r3643, %rs1768; 2026-02-21T08:30:48.5818469Z prmt.b32 %r3644, %r3566, 0, 0xbbb3U; 2026-02-21T08:30:48.5818531Z cvt.u16.u32 %rs1769, %r3644; 2026-02-21T08:30:48.5818588Z cvt.rn.f32.s16 %r3645, %rs1769; 2026-02-21T08:30:48.5818644Z prmt.b32 %r3646, %r3564, 0, 0xbbb3U; 2026-02-21T08:30:48.5818705Z cvt.u16.u32 %rs1770, %r3646; 2026-02-21T08:30:48.5818759Z cvt.rn.f32.s16 %r3647, %rs1770; 2026-02-21T08:30:48.5818816Z prmt.b32 %r3648, %r3574, 0, 0xbbb3U; 2026-02-21T08:30:48.5818873Z cvt.u16.u32 %rs1771, %r3648; 2026-02-21T08:30:48.5818932Z cvt.rn.f32.s16 %r3649, %rs1771; 2026-02-21T08:30:48.5818990Z prmt.b32 %r3650, %r3572, 0, 0xbbb3U; 2026-02-21T08:30:48.5819043Z cvt.u16.u32 %rs1772, %r3650; 2026-02-21T08:30:48.5819103Z cvt.rn.f32.s16 %r3651, %rs1772; 2026-02-21T08:30:48.5819158Z prmt.b32 %r3652, %r3570, 0, 0xbbb3U; 2026-02-21T08:30:48.5819212Z cvt.u16.u32 %rs1773, %r3652; 2026-02-21T08:30:48.5819270Z cvt.rn.f32.s16 %r3653, %rs1773; 2026-02-21T08:30:48.5819330Z prmt.b32 %r3654, %r3568, 0, 0xbbb3U; 2026-02-21T08:30:48.5819389Z cvt.u16.u32 %rs1774, %r3654; 2026-02-21T08:30:48.5819485Z cvt.rn.f32.s16 %r3655, %rs1774; 2026-02-21T08:30:48.5819548Z prmt.b32 %r3656, %r3578, 0, 0xbbb3U; 2026-02-21T08:30:48.5819602Z cvt.u16.u32 %rs1775, %r3656; 2026-02-21T08:30:48.5819656Z cvt.rn.f32.s16 %r3657, %rs1775; 2026-02-21T08:30:48.5819716Z prmt.b32 %r3658, %r3576, 0, 0xbbb3U; 2026-02-21T08:30:48.5819777Z cvt.u16.u32 %rs1776, %r3658; 2026-02-21T08:30:48.5819831Z cvt.rn.f32.s16 %r3659, %rs1776; 2026-02-21T08:30:48.5819888Z prmt.b32 %r3660, %r3567, 0, 0x9991U; 2026-02-21T08:30:48.5819948Z cvt.u16.u32 %rs1777, %r3660; 2026-02-21T08:30:48.5820003Z cvt.rn.f32.s16 %r3661, %rs1777; 2026-02-21T08:30:48.5820060Z prmt.b32 %r3662, %r3565, 0, 0x9991U; 2026-02-21T08:30:48.5820119Z cvt.u16.u32 %rs1778, %r3662; 2026-02-21T08:30:48.5820174Z cvt.rn.f32.s16 %r3663, %rs1778; 2026-02-21T08:30:48.5820230Z prmt.b32 %r3664, %r3575, 0, 0x9991U; 2026-02-21T08:30:48.5820287Z cvt.u16.u32 %rs1779, %r3664; 2026-02-21T08:30:48.5820410Z cvt.rn.f32.s16 %r3665, %rs1779; 2026-02-21T08:30:48.5820473Z prmt.b32 %r3666, %r3573, 0, 0x9991U; 2026-02-21T08:30:48.5820531Z cvt.u16.u32 %rs1780, %r3666; 2026-02-21T08:30:48.5820592Z cvt.rn.f32.s16 %r3667, %rs1780; 2026-02-21T08:30:48.5820650Z prmt.b32 %r3668, %r3571, 0, 0x9991U; 2026-02-21T08:30:48.5820706Z cvt.u16.u32 %rs1781, %r3668; 2026-02-21T08:30:48.5820765Z cvt.rn.f32.s16 %r3669, %rs1781; 2026-02-21T08:30:48.5820829Z prmt.b32 %r3670, %r3569, 0, 0x9991U; 2026-02-21T08:30:48.5820883Z cvt.u16.u32 %rs1782, %r3670; 2026-02-21T08:30:48.5820941Z cvt.rn.f32.s16 %r3671, %rs1782; 2026-02-21T08:30:48.5821003Z prmt.b32 %r3672, %r3579, 0, 0x9991U; 2026-02-21T08:30:48.5821058Z cvt.u16.u32 %rs1783, %r3672; 2026-02-21T08:30:48.5821116Z cvt.rn.f32.s16 %r3673, %rs1783; 2026-02-21T08:30:48.5821182Z prmt.b32 %r3674, %r3577, 0, 0x9991U; 2026-02-21T08:30:48.5821238Z cvt.u16.u32 %rs1784, %r3674; 2026-02-21T08:30:48.5821296Z cvt.rn.f32.s16 %r3675, %rs1784; 2026-02-21T08:30:48.5821353Z prmt.b32 %r3676, %r3567, 0, 0xbbb3U; 2026-02-21T08:30:48.5821417Z cvt.u16.u32 %rs1785, %r3676; 2026-02-21T08:30:48.5821475Z cvt.rn.f32.s16 %r3677, %rs1785; 2026-02-21T08:30:48.5821571Z prmt.b32 %r3678, %r3565, 0, 0xbbb3U; 2026-02-21T08:30:48.5821637Z cvt.u16.u32 %rs1786, %r3678; 2026-02-21T08:30:48.5821693Z cvt.rn.f32.s16 %r3679, %rs1786; 2026-02-21T08:30:48.5821750Z prmt.b32 %r3680, %r3575, 0, 0xbbb3U; 2026-02-21T08:30:48.5821808Z cvt.u16.u32 %rs1787, %r3680; 2026-02-21T08:30:48.5821873Z cvt.rn.f32.s16 %r3681, %rs1787; 2026-02-21T08:30:48.5821931Z prmt.b32 %r3682, %r3573, 0, 0xbbb3U; 2026-02-21T08:30:48.5821986Z cvt.u16.u32 %rs1788, %r3682; 2026-02-21T08:30:48.5822047Z cvt.rn.f32.s16 %r3683, %rs1788; 2026-02-21T08:30:48.5822107Z prmt.b32 %r3684, %r3571, 0, 0xbbb3U; 2026-02-21T08:30:48.5822163Z cvt.u16.u32 %rs1789, %r3684; 2026-02-21T08:30:48.5822220Z cvt.rn.f32.s16 %r3685, %rs1789; 2026-02-21T08:30:48.5822282Z prmt.b32 %r3686, %r3569, 0, 0xbbb3U; 2026-02-21T08:30:48.5822338Z cvt.u16.u32 %rs1790, %r3686; 2026-02-21T08:30:48.5822397Z cvt.rn.f32.s16 %r3687, %rs1790; 2026-02-21T08:30:48.5822462Z prmt.b32 %r3688, %r3579, 0, 0xbbb3U; 2026-02-21T08:30:48.5822519Z cvt.u16.u32 %rs1791, %r3688; 2026-02-21T08:30:48.5822575Z cvt.rn.f32.s16 %r3689, %rs1791; 2026-02-21T08:30:48.5822638Z prmt.b32 %r3690, %r3577, 0, 0xbbb3U; 2026-02-21T08:30:48.5822693Z cvt.u16.u32 %rs1792, %r3690; 2026-02-21T08:30:48.5822747Z cvt.rn.f32.s16 %r3691, %rs1792; 2026-02-21T08:30:48.5822801Z bar.sync 0; 2026-02-21T08:30:48.5822908Z st.shared.v4.b32 [%r240], {%r3581, %r3583, %r3580, %r3582}; 2026-02-21T08:30:48.5823019Z st.shared.v4.b32 [%r240+16384], {%r3631, %r3635, %r3629, %r3633}; 2026-02-21T08:30:48.5823114Z st.shared.v4.b32 [%r241], {%r3585, %r3587, %r3584, %r3586}; 2026-02-21T08:30:48.5823221Z st.shared.v4.b32 [%r241+16384], {%r3639, %r3643, %r3637, %r3641}; 2026-02-21T08:30:48.5823311Z st.shared.v4.b32 [%r242], {%r3591, %r3595, %r3589, %r3593}; 2026-02-21T08:30:48.5823409Z st.shared.v4.b32 [%r242+16384], {%r3647, %r3651, %r3645, %r3649}; 2026-02-21T08:30:48.5823557Z st.shared.v4.b32 [%r243], {%r3599, %r3603, %r3597, %r3601}; 2026-02-21T08:30:48.5823656Z st.shared.v4.b32 [%r243+16384], {%r3655, %r3659, %r3653, %r3657}; 2026-02-21T08:30:48.5823744Z st.shared.v4.b32 [%r244], {%r3605, %r3607, %r3604, %r3606}; 2026-02-21T08:30:48.5823857Z st.shared.v4.b32 [%r244+16384], {%r3663, %r3667, %r3661, %r3665}; 2026-02-21T08:30:48.5823954Z st.shared.v4.b32 [%r245], {%r3609, %r3611, %r3608, %r3610}; 2026-02-21T08:30:48.5824055Z st.shared.v4.b32 [%r245+16384], {%r3671, %r3675, %r3669, %r3673}; 2026-02-21T08:30:48.5824146Z st.shared.v4.b32 [%r246], {%r3615, %r3619, %r3613, %r3617}; 2026-02-21T08:30:48.5824250Z st.shared.v4.b32 [%r246+16384], {%r3679, %r3683, %r3677, %r3681}; 2026-02-21T08:30:48.5824343Z st.shared.v4.b32 [%r247], {%r3623, %r3627, %r3621, %r3625}; 2026-02-21T08:30:48.5824442Z st.shared.v4.b32 [%r247+16384], {%r3687, %r3691, %r3685, %r3689}; 2026-02-21T08:30:48.5824508Z mov.pred %p234, -1; 2026-02-21T08:30:48.5824616Z $L__tmp309: 2026-02-21T08:30:48.5824860Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5824925Z // begin inline asm 2026-02-21T08:30:48.5825257Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3440 + 0], 32, {%r3441, %r3442, %r3443, %r3444, %r3445, %r3446, %r3447, %r3448, %r3449, %r3450, %r3451, %r3452, %r3453, %r3454, %r3455, %r3456}; 2026-02-21T08:30:48.5825314Z // end inline asm 2026-02-21T08:30:48.5825373Z // begin inline asm 2026-02-21T08:30:48.5825710Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r3440 + 16], 32, {%r3458, %r3459, %r3460, %r3461, %r3462, %r3463, %r3464, %r3465, %r3466, %r3467, %r3468, %r3469, %r3470, %r3471, %r3472, %r3473}; 2026-02-21T08:30:48.5825767Z // end inline asm 2026-02-21T08:30:48.5825825Z // begin inline asm 2026-02-21T08:30:48.5825906Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:30:48.5825962Z // end inline asm 2026-02-21T08:30:48.5826020Z bar.sync 0; 2026-02-21T08:30:48.5826087Z // begin inline asm 2026-02-21T08:30:48.5826162Z fence.proxy.async.shared::cta; 2026-02-21T08:30:48.5826216Z // end inline asm 2026-02-21T08:30:48.5826278Z add.s32 %r3720, %r294, 49152; 2026-02-21T08:30:48.5826343Z // begin inline asm 2026-02-21T08:30:48.5826435Z @%p10 mbarrier.init.shared::cta.b64 [%r3720], 1; 2026-02-21T08:30:48.5826492Z // end inline asm 2026-02-21T08:30:48.5826551Z bar.sync 0; 2026-02-21T08:30:48.5826615Z @%p230 bra $L__BB0_34; 2026-02-21T08:30:48.5826716Z // %bb.33: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:48.5826783Z elect.sync %r3717|%p233, -1; 2026-02-21T08:30:48.5826850Z mov.b32 %r3695, 69208336; 2026-02-21T08:30:48.5826907Z // begin inline asm 2026-02-21T08:30:48.5827082Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 0 ], %rd935, %r3695, %p256; 2026-02-21T08:30:48.5827146Z // end inline asm 2026-02-21T08:30:48.5827206Z // begin inline asm 2026-02-21T08:30:48.5827371Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 8 ], %rd936, %r3695, %p234; 2026-02-21T08:30:48.5827437Z // end inline asm 2026-02-21T08:30:48.5827496Z // begin inline asm 2026-02-21T08:30:48.5827656Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 16 ], %rd937, %r3695, %p234; 2026-02-21T08:30:48.5827715Z // end inline asm 2026-02-21T08:30:48.5827771Z // begin inline asm 2026-02-21T08:30:48.5827926Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 24 ], %rd938, %r3695, %p234; 2026-02-21T08:30:48.5827981Z // end inline asm 2026-02-21T08:30:48.5828044Z // begin inline asm 2026-02-21T08:30:48.5828197Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 32 ], %rd939, %r3695, %p234; 2026-02-21T08:30:48.5828254Z // end inline asm 2026-02-21T08:30:48.5828315Z // begin inline asm 2026-02-21T08:30:48.5828469Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 40 ], %rd940, %r3695, %p234; 2026-02-21T08:30:48.5828527Z // end inline asm 2026-02-21T08:30:48.5828634Z // begin inline asm 2026-02-21T08:30:48.5828785Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 48 ], %rd941, %r3695, %p234; 2026-02-21T08:30:48.5828842Z // end inline asm 2026-02-21T08:30:48.5828897Z // begin inline asm 2026-02-21T08:30:48.5829055Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r4008 + 0 ], [ %r3694 + 56 ], %rd942, %r3695, %p234; 2026-02-21T08:30:48.5829110Z // end inline asm 2026-02-21T08:30:48.5829173Z cvt.u64.u32 %rd943, %r3720; 2026-02-21T08:30:48.5829234Z // begin inline asm 2026-02-21T08:30:48.5829362Z @%p233 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd943]; 2026-02-21T08:30:48.5829418Z // end inline asm 2026-02-21T08:30:48.5829477Z $L__tmp310: 2026-02-21T08:30:48.5829582Z $L__BB0_34: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:48.5829808Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5829883Z setp.eq.b32 %p250, %r273, 0; 2026-02-21T08:30:48.5829952Z setp.lt.s32 %p251, %r4066, %r218; 2026-02-21T08:30:48.5830009Z mov.b32 %r3721, 0; 2026-02-21T08:30:48.5830064Z $L__tmp311: 2026-02-21T08:30:48.5830299Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5830356Z // begin inline asm 2026-02-21T08:30:48.5830407Z 2026-02-21T08:30:48.5830462Z { 2026-02-21T08:30:48.5830526Z .reg .pred complete; 2026-02-21T08:30:48.5830583Z waitLoop: 2026-02-21T08:30:48.5830710Z mbarrier.try_wait.parity.shared.b64 complete, [%r3720], %r3721; 2026-02-21T08:30:48.5830785Z @!complete bra.uni waitLoop; 2026-02-21T08:30:48.5830847Z } 2026-02-21T08:30:48.5830851Z 2026-02-21T08:30:48.5830904Z // end inline asm 2026-02-21T08:30:48.5830962Z bar.sync 0; 2026-02-21T08:30:48.5831017Z // begin inline asm 2026-02-21T08:30:48.5831101Z @%p10 mbarrier.inval.shared::cta.b64 [%r3720]; 2026-02-21T08:30:48.5831163Z // end inline asm 2026-02-21T08:30:48.5831217Z $L__tmp312: 2026-02-21T08:30:48.5831392Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5831451Z add.s32 %r3732, %r259, 32; 2026-02-21T08:30:48.5831514Z add.s32 %r3733, %r4056, 1; 2026-02-21T08:30:48.5831601Z setp.gt.s32 %p252, %r3733, 1; 2026-02-21T08:30:48.5831664Z selp.b32 %r4056, 0, %r3733, %p252; 2026-02-21T08:30:48.5831730Z selp.b32 %r4053, 0, %r3732, %p250; 2026-02-21T08:30:48.5831901Z .loc 1 45 22 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:45:22 2026-02-21T08:30:48.5831959Z shl.b32 %r3734, %r4053, 1; 2026-02-21T08:30:48.5832125Z .loc 1 47 25 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:47:25 2026-02-21T08:30:48.5832188Z add.s32 %r3735, %r3734, %r22; 2026-02-21T08:30:48.5832356Z .loc 1 48 53 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:53 2026-02-21T08:30:48.5832417Z shl.b32 %r3736, %r4070, 10; 2026-02-21T08:30:48.5832479Z shl.b32 %r3737, %r4071, 10; 2026-02-21T08:30:48.5832535Z shl.b32 %r3738, %r4072, 10; 2026-02-21T08:30:48.5832589Z shl.b32 %r3739, %r4073, 10; 2026-02-21T08:30:48.5832760Z .loc 1 48 60 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:60 2026-02-21T08:30:48.5832819Z add.s32 %r3740, %r3736, %r3735; 2026-02-21T08:30:48.5832877Z add.s32 %r3741, %r3737, %r3735; 2026-02-21T08:30:48.5832933Z add.s32 %r3742, %r3738, %r3735; 2026-02-21T08:30:48.5832993Z add.s32 %r3743, %r3739, %r3735; 2026-02-21T08:30:48.5833154Z .loc 1 48 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:32 2026-02-21T08:30:48.5833223Z mad.wide.s32 %rd944, %r3740, 2, %rd92; 2026-02-21T08:30:48.5833297Z mad.wide.s32 %rd945, %r3741, 2, %rd92; 2026-02-21T08:30:48.5833362Z mad.wide.s32 %rd946, %r3742, 2, %rd92; 2026-02-21T08:30:48.5833447Z mad.wide.s32 %rd947, %r3743, 2, %rd92; 2026-02-21T08:30:48.5833667Z .loc 1 48 80 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:48:80 2026-02-21T08:30:48.5833725Z shl.b32 %r3744, %r4056, 13; 2026-02-21T08:30:48.5833785Z add.s32 %r3745, %r294, %r3744; 2026-02-21T08:30:48.5833844Z add.s32 %r3746, %r3745, 32768; 2026-02-21T08:30:48.5833908Z add.s32 %r3723, %r3746, %r4047; 2026-02-21T08:30:48.5833968Z selp.b32 %r3724, 16, 0, %p251; 2026-02-21T08:30:48.5834023Z // begin inline asm 2026-02-21T08:30:48.5834152Z cp.async.cg.shared.global [ %r3723 + 0 ], [ %rd944 + 0 ], 0x10, %r3724; 2026-02-21T08:30:48.5834206Z // end inline asm 2026-02-21T08:30:48.5834263Z add.s32 %r3725, %r3746, %r4046; 2026-02-21T08:30:48.5834326Z // begin inline asm 2026-02-21T08:30:48.5834440Z cp.async.cg.shared.global [ %r3725 + 0 ], [ %rd945 + 0 ], 0x10, %r3724; 2026-02-21T08:30:48.5834495Z // end inline asm 2026-02-21T08:30:48.5834555Z add.s32 %r3727, %r3746, %r4045; 2026-02-21T08:30:48.5834661Z // begin inline asm 2026-02-21T08:30:48.5834777Z cp.async.cg.shared.global [ %r3727 + 0 ], [ %rd946 + 0 ], 0x10, %r3724; 2026-02-21T08:30:48.5834832Z // end inline asm 2026-02-21T08:30:48.5834897Z add.s32 %r3729, %r3746, %r4044; 2026-02-21T08:30:48.5834951Z // begin inline asm 2026-02-21T08:30:48.5835065Z cp.async.cg.shared.global [ %r3729 + 0 ], [ %rd947 + 0 ], 0x10, %r3724; 2026-02-21T08:30:48.5835117Z // end inline asm 2026-02-21T08:30:48.5835183Z cp.async.commit_group; 2026-02-21T08:30:48.5835359Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5835421Z setp.ne.b32 %p256, %r4051, 15; 2026-02-21T08:30:48.5835489Z @%p256 bra $L__BB0_36; 2026-02-21T08:30:48.5835584Z // %bb.35: // in Loop: Header=BB0_30 Depth=1 2026-02-21T08:30:48.5835751Z .loc 1 31 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:31:32 2026-02-21T08:30:48.5835816Z add.s32 %r3887, %r4050, %r8; 2026-02-21T08:30:48.5835991Z .loc 1 33 32 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:33:32 2026-02-21T08:30:48.5836053Z add.s32 %r3888, %r4049, %r14; 2026-02-21T08:30:48.5836115Z add.s32 %r3889, %r4049, %r15; 2026-02-21T08:30:48.5836177Z add.s32 %r3890, %r4049, %r16; 2026-02-21T08:30:48.5836233Z add.s32 %r3891, %r4049, %r17; 2026-02-21T08:30:48.5836288Z add.s32 %r3892, %r4049, %r18; 2026-02-21T08:30:48.5836350Z add.s32 %r3893, %r4049, %r19; 2026-02-21T08:30:48.5836408Z add.s32 %r3894, %r4049, %r20; 2026-02-21T08:30:48.5836466Z add.s32 %r3895, %r4049, %r21; 2026-02-21T08:30:48.5836638Z .loc 1 88 43 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:43 2026-02-21T08:30:48.5836695Z shl.b32 %r3896, %r3888, 13; 2026-02-21T08:30:48.5836752Z shl.b32 %r3897, %r3889, 13; 2026-02-21T08:30:48.5836807Z shl.b32 %r3898, %r3890, 13; 2026-02-21T08:30:48.5836868Z shl.b32 %r3899, %r3891, 13; 2026-02-21T08:30:48.5836926Z shl.b32 %r3900, %r3892, 13; 2026-02-21T08:30:48.5836983Z shl.b32 %r3901, %r3893, 13; 2026-02-21T08:30:48.5837043Z shl.b32 %r3902, %r3894, 13; 2026-02-21T08:30:48.5837097Z shl.b32 %r3903, %r3895, 13; 2026-02-21T08:30:48.5837261Z .loc 1 88 50 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:50 2026-02-21T08:30:48.5837325Z add.s32 %r3904, %r3896, %r3887; 2026-02-21T08:30:48.5837385Z add.s32 %r3905, %r3897, %r3887; 2026-02-21T08:30:48.5837442Z add.s32 %r3906, %r3898, %r3887; 2026-02-21T08:30:48.5837498Z add.s32 %r3907, %r3899, %r3887; 2026-02-21T08:30:48.5837559Z add.s32 %r3908, %r3900, %r3887; 2026-02-21T08:30:48.5837615Z add.s32 %r3909, %r3901, %r3887; 2026-02-21T08:30:48.5837670Z add.s32 %r3910, %r3902, %r3887; 2026-02-21T08:30:48.5837730Z add.s32 %r3911, %r3903, %r3887; 2026-02-21T08:30:48.5837899Z .loc 1 88 22 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:22 2026-02-21T08:30:48.5837966Z mad.wide.s32 %rd948, %r3904, 2, %rd94; 2026-02-21T08:30:48.5838086Z mad.wide.s32 %rd949, %r3905, 2, %rd94; 2026-02-21T08:30:48.5838155Z mad.wide.s32 %rd950, %r3906, 2, %rd94; 2026-02-21T08:30:48.5838217Z mad.wide.s32 %rd951, %r3907, 2, %rd94; 2026-02-21T08:30:48.5838279Z mad.wide.s32 %rd952, %r3908, 2, %rd94; 2026-02-21T08:30:48.5838344Z mad.wide.s32 %rd953, %r3909, 2, %rd94; 2026-02-21T08:30:48.5838403Z mad.wide.s32 %rd954, %r3910, 2, %rd94; 2026-02-21T08:30:48.5838461Z mad.wide.s32 %rd955, %r3911, 2, %rd94; 2026-02-21T08:30:48.5838515Z $L__tmp313: 2026-02-21T08:30:48.5838734Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5838788Z // begin inline asm 2026-02-21T08:30:48.5839098Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3747, %r3748, %r3749, %r3750, %r3751, %r3752, %r3753, %r3754, %r3755, %r3756, %r3757, %r3758, %r3759, %r3760, %r3761, %r3762}, [%r3214 + 0], 64; 2026-02-21T08:30:48.5839157Z // end inline asm 2026-02-21T08:30:48.5839250Z // begin inline asm 2026-02-21T08:30:48.5839557Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3764, %r3765, %r3766, %r3767, %r3768, %r3769, %r3770, %r3771, %r3772, %r3773, %r3774, %r3775, %r3776, %r3777, %r3778, %r3779}, [%r3214 + 16], 64; 2026-02-21T08:30:48.5839617Z // end inline asm 2026-02-21T08:30:48.5839670Z // begin inline asm 2026-02-21T08:30:48.5839973Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3781, %r3782, %r3783, %r3784, %r3785, %r3786, %r3787, %r3788, %r3789, %r3790, %r3791, %r3792, %r3793, %r3794, %r3795, %r3796}, [%r3214 + 32], 64; 2026-02-21T08:30:48.5840032Z // end inline asm 2026-02-21T08:30:48.5840086Z // begin inline asm 2026-02-21T08:30:48.5840385Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3798, %r3799, %r3800, %r3801, %r3802, %r3803, %r3804, %r3805, %r3806, %r3807, %r3808, %r3809, %r3810, %r3811, %r3812, %r3813}, [%r3214 + 48], 64; 2026-02-21T08:30:48.5840446Z // end inline asm 2026-02-21T08:30:48.5840500Z // begin inline asm 2026-02-21T08:30:48.5840570Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:30:48.5840627Z // end inline asm 2026-02-21T08:30:48.5840691Z cvt.u64.u32 %rd956, %r3747; 2026-02-21T08:30:48.5840749Z cvt.u64.u32 %rd957, %r3748; 2026-02-21T08:30:48.5840804Z shl.b64 %rd958, %rd957, 32; 2026-02-21T08:30:48.5840871Z or.b64 %rd959, %rd956, %rd958; 2026-02-21T08:30:48.5840923Z $L__tmp314: 2026-02-21T08:30:48.5841092Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5841158Z mov.b64 {%r3912, %r3913}, %rd959; 2026-02-21T08:30:48.5841228Z cvt.rn.bf16x2.f32 %r3914, %r3913, %r3912; 2026-02-21T08:30:48.5841280Z $L__tmp315: 2026-02-21T08:30:48.5841494Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5841587Z cvt.u64.u32 %rd960, %r3749; 2026-02-21T08:30:48.5841645Z cvt.u64.u32 %rd961, %r3750; 2026-02-21T08:30:48.5841702Z shl.b64 %rd962, %rd961, 32; 2026-02-21T08:30:48.5841770Z or.b64 %rd963, %rd960, %rd962; 2026-02-21T08:30:48.5841823Z $L__tmp316: 2026-02-21T08:30:48.5841984Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5842049Z mov.b64 {%r3915, %r3916}, %rd963; 2026-02-21T08:30:48.5842120Z cvt.rn.bf16x2.f32 %r3917, %r3916, %r3915; 2026-02-21T08:30:48.5842170Z $L__tmp317: 2026-02-21T08:30:48.5842382Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5842445Z cvt.u64.u32 %rd964, %r3751; 2026-02-21T08:30:48.5842501Z cvt.u64.u32 %rd965, %r3752; 2026-02-21T08:30:48.5842558Z shl.b64 %rd966, %rd965, 32; 2026-02-21T08:30:48.5842621Z or.b64 %rd967, %rd964, %rd966; 2026-02-21T08:30:48.5842669Z $L__tmp318: 2026-02-21T08:30:48.5842835Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5842899Z mov.b64 {%r3918, %r3919}, %rd967; 2026-02-21T08:30:48.5843026Z cvt.rn.bf16x2.f32 %r3920, %r3919, %r3918; 2026-02-21T08:30:48.5843078Z $L__tmp319: 2026-02-21T08:30:48.5843287Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5843350Z cvt.u64.u32 %rd968, %r3753; 2026-02-21T08:30:48.5843408Z cvt.u64.u32 %rd969, %r3754; 2026-02-21T08:30:48.5843465Z shl.b64 %rd970, %rd969, 32; 2026-02-21T08:30:48.5843527Z or.b64 %rd971, %rd968, %rd970; 2026-02-21T08:30:48.5843578Z $L__tmp320: 2026-02-21T08:30:48.5843742Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5843802Z mov.b64 {%r3921, %r3922}, %rd971; 2026-02-21T08:30:48.5843874Z cvt.rn.bf16x2.f32 %r3923, %r3922, %r3921; 2026-02-21T08:30:48.5843926Z $L__tmp321: 2026-02-21T08:30:48.5844136Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5844255Z cvt.u64.u32 %rd972, %r3755; 2026-02-21T08:30:48.5844312Z cvt.u64.u32 %rd973, %r3756; 2026-02-21T08:30:48.5844370Z shl.b64 %rd974, %rd973, 32; 2026-02-21T08:30:48.5844441Z or.b64 %rd975, %rd972, %rd974; 2026-02-21T08:30:48.5844494Z $L__tmp322: 2026-02-21T08:30:48.5844662Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5844722Z mov.b64 {%r3924, %r3925}, %rd975; 2026-02-21T08:30:48.5844801Z cvt.rn.bf16x2.f32 %r3926, %r3925, %r3924; 2026-02-21T08:30:48.5844851Z $L__tmp323: 2026-02-21T08:30:48.5845060Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5845122Z cvt.u64.u32 %rd976, %r3757; 2026-02-21T08:30:48.5845178Z cvt.u64.u32 %rd977, %r3758; 2026-02-21T08:30:48.5845233Z shl.b64 %rd978, %rd977, 32; 2026-02-21T08:30:48.5845294Z or.b64 %rd979, %rd976, %rd978; 2026-02-21T08:30:48.5845344Z $L__tmp324: 2026-02-21T08:30:48.5845512Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5845569Z mov.b64 {%r3927, %r3928}, %rd979; 2026-02-21T08:30:48.5845641Z cvt.rn.bf16x2.f32 %r3929, %r3928, %r3927; 2026-02-21T08:30:48.5845690Z $L__tmp325: 2026-02-21T08:30:48.5845896Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5845958Z cvt.u64.u32 %rd980, %r3759; 2026-02-21T08:30:48.5846013Z cvt.u64.u32 %rd981, %r3760; 2026-02-21T08:30:48.5846069Z shl.b64 %rd982, %rd981, 32; 2026-02-21T08:30:48.5846133Z or.b64 %rd983, %rd980, %rd982; 2026-02-21T08:30:48.5846183Z $L__tmp326: 2026-02-21T08:30:48.5846345Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5846402Z mov.b64 {%r3930, %r3931}, %rd983; 2026-02-21T08:30:48.5846473Z cvt.rn.bf16x2.f32 %r3932, %r3931, %r3930; 2026-02-21T08:30:48.5846525Z $L__tmp327: 2026-02-21T08:30:48.5846736Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5846798Z cvt.u64.u32 %rd984, %r3761; 2026-02-21T08:30:48.5846853Z cvt.u64.u32 %rd985, %r3762; 2026-02-21T08:30:48.5846908Z shl.b64 %rd986, %rd985, 32; 2026-02-21T08:30:48.5846965Z or.b64 %rd987, %rd984, %rd986; 2026-02-21T08:30:48.5847023Z $L__tmp328: 2026-02-21T08:30:48.5847187Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5847247Z mov.b64 {%r3933, %r3934}, %rd987; 2026-02-21T08:30:48.5847315Z cvt.rn.bf16x2.f32 %r3935, %r3934, %r3933; 2026-02-21T08:30:48.5847367Z $L__tmp329: 2026-02-21T08:30:48.5847575Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5847637Z cvt.u64.u32 %rd988, %r3764; 2026-02-21T08:30:48.5847694Z cvt.u64.u32 %rd989, %r3765; 2026-02-21T08:30:48.5847791Z shl.b64 %rd990, %rd989, 32; 2026-02-21T08:30:48.5847847Z or.b64 %rd991, %rd988, %rd990; 2026-02-21T08:30:48.5847903Z $L__tmp330: 2026-02-21T08:30:48.5848068Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5848127Z mov.b64 {%r3936, %r3937}, %rd991; 2026-02-21T08:30:48.5848196Z cvt.rn.bf16x2.f32 %r3938, %r3937, %r3936; 2026-02-21T08:30:48.5848245Z $L__tmp331: 2026-02-21T08:30:48.5848454Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5848515Z cvt.u64.u32 %rd992, %r3766; 2026-02-21T08:30:48.5848567Z cvt.u64.u32 %rd993, %r3767; 2026-02-21T08:30:48.5848623Z shl.b64 %rd994, %rd993, 32; 2026-02-21T08:30:48.5848679Z or.b64 %rd995, %rd992, %rd994; 2026-02-21T08:30:48.5848734Z $L__tmp332: 2026-02-21T08:30:48.5848936Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5848998Z mov.b64 {%r3939, %r3940}, %rd995; 2026-02-21T08:30:48.5849068Z cvt.rn.bf16x2.f32 %r3941, %r3940, %r3939; 2026-02-21T08:30:48.5849117Z $L__tmp333: 2026-02-21T08:30:48.5849329Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5849387Z cvt.u64.u32 %rd996, %r3768; 2026-02-21T08:30:48.5849441Z cvt.u64.u32 %rd997, %r3769; 2026-02-21T08:30:48.5849496Z shl.b64 %rd998, %rd997, 32; 2026-02-21T08:30:48.5849553Z or.b64 %rd999, %rd996, %rd998; 2026-02-21T08:30:48.5849609Z $L__tmp334: 2026-02-21T08:30:48.5849777Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5849833Z mov.b64 {%r3942, %r3943}, %rd999; 2026-02-21T08:30:48.5849904Z cvt.rn.bf16x2.f32 %r3944, %r3943, %r3942; 2026-02-21T08:30:48.5849953Z $L__tmp335: 2026-02-21T08:30:48.5850168Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5850238Z cvt.u64.u32 %rd1000, %r3770; 2026-02-21T08:30:48.5850297Z cvt.u64.u32 %rd1001, %r3771; 2026-02-21T08:30:48.5850356Z shl.b64 %rd1002, %rd1001, 32; 2026-02-21T08:30:48.5850413Z or.b64 %rd1003, %rd1000, %rd1002; 2026-02-21T08:30:48.5850470Z $L__tmp336: 2026-02-21T08:30:48.5850634Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5850695Z mov.b64 {%r3945, %r3946}, %rd1003; 2026-02-21T08:30:48.5850765Z cvt.rn.bf16x2.f32 %r3947, %r3946, %r3945; 2026-02-21T08:30:48.5850815Z $L__tmp337: 2026-02-21T08:30:48.5851030Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5851093Z cvt.u64.u32 %rd1004, %r3772; 2026-02-21T08:30:48.5851150Z cvt.u64.u32 %rd1005, %r3773; 2026-02-21T08:30:48.5851210Z shl.b64 %rd1006, %rd1005, 32; 2026-02-21T08:30:48.5851271Z or.b64 %rd1007, %rd1004, %rd1006; 2026-02-21T08:30:48.5851327Z $L__tmp338: 2026-02-21T08:30:48.5851491Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5851580Z mov.b64 {%r3948, %r3949}, %rd1007; 2026-02-21T08:30:48.5851656Z cvt.rn.bf16x2.f32 %r3950, %r3949, %r3948; 2026-02-21T08:30:48.5851705Z $L__tmp339: 2026-02-21T08:30:48.5851910Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5851967Z cvt.u64.u32 %rd1008, %r3774; 2026-02-21T08:30:48.5852031Z cvt.u64.u32 %rd1009, %r3775; 2026-02-21T08:30:48.5852088Z shl.b64 %rd1010, %rd1009, 32; 2026-02-21T08:30:48.5852146Z or.b64 %rd1011, %rd1008, %rd1010; 2026-02-21T08:30:48.5852201Z $L__tmp340: 2026-02-21T08:30:48.5852365Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5852427Z mov.b64 {%r3951, %r3952}, %rd1011; 2026-02-21T08:30:48.5852547Z cvt.rn.bf16x2.f32 %r3953, %r3952, %r3951; 2026-02-21T08:30:48.5852597Z $L__tmp341: 2026-02-21T08:30:48.5852801Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5852859Z cvt.u64.u32 %rd1012, %r3776; 2026-02-21T08:30:48.5852918Z cvt.u64.u32 %rd1013, %r3777; 2026-02-21T08:30:48.5852978Z shl.b64 %rd1014, %rd1013, 32; 2026-02-21T08:30:48.5853037Z or.b64 %rd1015, %rd1012, %rd1014; 2026-02-21T08:30:48.5853098Z $L__tmp342: 2026-02-21T08:30:48.5853264Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5853323Z mov.b64 {%r3954, %r3955}, %rd1015; 2026-02-21T08:30:48.5853393Z cvt.rn.bf16x2.f32 %r3956, %r3955, %r3954; 2026-02-21T08:30:48.5853446Z $L__tmp343: 2026-02-21T08:30:48.5853701Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5853762Z cvt.u64.u32 %rd1016, %r3778; 2026-02-21T08:30:48.5853824Z cvt.u64.u32 %rd1017, %r3779; 2026-02-21T08:30:48.5853881Z shl.b64 %rd1018, %rd1017, 32; 2026-02-21T08:30:48.5853938Z or.b64 %rd1019, %rd1016, %rd1018; 2026-02-21T08:30:48.5853994Z $L__tmp344: 2026-02-21T08:30:48.5854157Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5854218Z mov.b64 {%r3957, %r3958}, %rd1019; 2026-02-21T08:30:48.5854295Z cvt.rn.bf16x2.f32 %r3959, %r3958, %r3957; 2026-02-21T08:30:48.5854345Z $L__tmp345: 2026-02-21T08:30:48.5854552Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5854609Z cvt.u64.u32 %rd1020, %r3781; 2026-02-21T08:30:48.5854672Z cvt.u64.u32 %rd1021, %r3782; 2026-02-21T08:30:48.5854730Z shl.b64 %rd1022, %rd1021, 32; 2026-02-21T08:30:48.5854790Z or.b64 %rd1023, %rd1020, %rd1022; 2026-02-21T08:30:48.5854847Z $L__tmp346: 2026-02-21T08:30:48.5855011Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5855070Z mov.b64 {%r3960, %r3961}, %rd1023; 2026-02-21T08:30:48.5855137Z cvt.rn.bf16x2.f32 %r3962, %r3961, %r3960; 2026-02-21T08:30:48.5855193Z $L__tmp347: 2026-02-21T08:30:48.5855401Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5855456Z cvt.u64.u32 %rd1024, %r3783; 2026-02-21T08:30:48.5855518Z cvt.u64.u32 %rd1025, %r3784; 2026-02-21T08:30:48.5855573Z shl.b64 %rd1026, %rd1025, 32; 2026-02-21T08:30:48.5855630Z or.b64 %rd1027, %rd1024, %rd1026; 2026-02-21T08:30:48.5855682Z $L__tmp348: 2026-02-21T08:30:48.5855846Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5855908Z mov.b64 {%r3963, %r3964}, %rd1027; 2026-02-21T08:30:48.5855977Z cvt.rn.bf16x2.f32 %r3965, %r3964, %r3963; 2026-02-21T08:30:48.5856034Z $L__tmp349: 2026-02-21T08:30:48.5856246Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5856303Z cvt.u64.u32 %rd1028, %r3785; 2026-02-21T08:30:48.5856363Z cvt.u64.u32 %rd1029, %r3786; 2026-02-21T08:30:48.5856419Z shl.b64 %rd1030, %rd1029, 32; 2026-02-21T08:30:48.5856474Z or.b64 %rd1031, %rd1028, %rd1030; 2026-02-21T08:30:48.5856528Z $L__tmp350: 2026-02-21T08:30:48.5856691Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5856753Z mov.b64 {%r3966, %r3967}, %rd1031; 2026-02-21T08:30:48.5856818Z cvt.rn.bf16x2.f32 %r3968, %r3967, %r3966; 2026-02-21T08:30:48.5856875Z $L__tmp351: 2026-02-21T08:30:48.5857085Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5857195Z cvt.u64.u32 %rd1032, %r3787; 2026-02-21T08:30:48.5857257Z cvt.u64.u32 %rd1033, %r3788; 2026-02-21T08:30:48.5857314Z shl.b64 %rd1034, %rd1033, 32; 2026-02-21T08:30:48.5857371Z or.b64 %rd1035, %rd1032, %rd1034; 2026-02-21T08:30:48.5857426Z $L__tmp352: 2026-02-21T08:30:48.5857593Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5857650Z mov.b64 {%r3969, %r3970}, %rd1035; 2026-02-21T08:30:48.5857715Z cvt.rn.bf16x2.f32 %r3971, %r3970, %r3969; 2026-02-21T08:30:48.5857770Z $L__tmp353: 2026-02-21T08:30:48.5857977Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5858032Z cvt.u64.u32 %rd1036, %r3789; 2026-02-21T08:30:48.5858093Z cvt.u64.u32 %rd1037, %r3790; 2026-02-21T08:30:48.5858149Z shl.b64 %rd1038, %rd1037, 32; 2026-02-21T08:30:48.5858205Z or.b64 %rd1039, %rd1036, %rd1038; 2026-02-21T08:30:48.5858303Z $L__tmp354: 2026-02-21T08:30:48.5858469Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5858527Z mov.b64 {%r3972, %r3973}, %rd1039; 2026-02-21T08:30:48.5858595Z cvt.rn.bf16x2.f32 %r3974, %r3973, %r3972; 2026-02-21T08:30:48.5858650Z $L__tmp355: 2026-02-21T08:30:48.5858853Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5858908Z cvt.u64.u32 %rd1040, %r3791; 2026-02-21T08:30:48.5858968Z cvt.u64.u32 %rd1041, %r3792; 2026-02-21T08:30:48.5859024Z shl.b64 %rd1042, %rd1041, 32; 2026-02-21T08:30:48.5859081Z or.b64 %rd1043, %rd1040, %rd1042; 2026-02-21T08:30:48.5859129Z $L__tmp356: 2026-02-21T08:30:48.5859298Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5859355Z mov.b64 {%r3975, %r3976}, %rd1043; 2026-02-21T08:30:48.5859421Z cvt.rn.bf16x2.f32 %r3977, %r3976, %r3975; 2026-02-21T08:30:48.5859480Z $L__tmp357: 2026-02-21T08:30:48.5859684Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5859742Z cvt.u64.u32 %rd1044, %r3793; 2026-02-21T08:30:48.5859800Z cvt.u64.u32 %rd1045, %r3794; 2026-02-21T08:30:48.5859855Z shl.b64 %rd1046, %rd1045, 32; 2026-02-21T08:30:48.5859913Z or.b64 %rd1047, %rd1044, %rd1046; 2026-02-21T08:30:48.5859960Z $L__tmp358: 2026-02-21T08:30:48.5860130Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5860190Z mov.b64 {%r3978, %r3979}, %rd1047; 2026-02-21T08:30:48.5860256Z cvt.rn.bf16x2.f32 %r3980, %r3979, %r3978; 2026-02-21T08:30:48.5860314Z $L__tmp359: 2026-02-21T08:30:48.5860521Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5860582Z cvt.u64.u32 %rd1048, %r3795; 2026-02-21T08:30:48.5860646Z cvt.u64.u32 %rd1049, %r3796; 2026-02-21T08:30:48.5860703Z shl.b64 %rd1050, %rd1049, 32; 2026-02-21T08:30:48.5860759Z or.b64 %rd1051, %rd1048, %rd1050; 2026-02-21T08:30:48.5860809Z $L__tmp360: 2026-02-21T08:30:48.5860979Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5861035Z mov.b64 {%r3981, %r3982}, %rd1051; 2026-02-21T08:30:48.5861101Z cvt.rn.bf16x2.f32 %r3983, %r3982, %r3981; 2026-02-21T08:30:48.5861159Z $L__tmp361: 2026-02-21T08:30:48.5861370Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5861431Z cvt.u64.u32 %rd1052, %r3798; 2026-02-21T08:30:48.5861494Z cvt.u64.u32 %rd1053, %r3799; 2026-02-21T08:30:48.5861579Z shl.b64 %rd1054, %rd1053, 32; 2026-02-21T08:30:48.5861643Z or.b64 %rd1055, %rd1052, %rd1054; 2026-02-21T08:30:48.5861695Z $L__tmp362: 2026-02-21T08:30:48.5861869Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5862625Z mov.b64 {%r3984, %r3985}, %rd1055; 2026-02-21T08:30:48.5862694Z cvt.rn.bf16x2.f32 %r3986, %r3985, %r3984; 2026-02-21T08:30:48.5862753Z $L__tmp363: 2026-02-21T08:30:48.5862964Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5863026Z cvt.u64.u32 %rd1056, %r3800; 2026-02-21T08:30:48.5863091Z cvt.u64.u32 %rd1057, %r3801; 2026-02-21T08:30:48.5863149Z shl.b64 %rd1058, %rd1057, 32; 2026-02-21T08:30:48.5863206Z or.b64 %rd1059, %rd1056, %rd1058; 2026-02-21T08:30:48.5863259Z $L__tmp364: 2026-02-21T08:30:48.5863431Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5863489Z mov.b64 {%r3987, %r3988}, %rd1059; 2026-02-21T08:30:48.5863558Z cvt.rn.bf16x2.f32 %r3989, %r3988, %r3987; 2026-02-21T08:30:48.5863663Z $L__tmp365: 2026-02-21T08:30:48.5863876Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5863935Z cvt.u64.u32 %rd1060, %r3802; 2026-02-21T08:30:48.5863997Z cvt.u64.u32 %rd1061, %r3803; 2026-02-21T08:30:48.5864053Z shl.b64 %rd1062, %rd1061, 32; 2026-02-21T08:30:48.5864110Z or.b64 %rd1063, %rd1060, %rd1062; 2026-02-21T08:30:48.5864160Z $L__tmp366: 2026-02-21T08:30:48.5864331Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5864389Z mov.b64 {%r3990, %r3991}, %rd1063; 2026-02-21T08:30:48.5864455Z cvt.rn.bf16x2.f32 %r3992, %r3991, %r3990; 2026-02-21T08:30:48.5864508Z $L__tmp367: 2026-02-21T08:30:48.5864719Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5864774Z cvt.u64.u32 %rd1064, %r3804; 2026-02-21T08:30:48.5864834Z cvt.u64.u32 %rd1065, %r3805; 2026-02-21T08:30:48.5864899Z shl.b64 %rd1066, %rd1065, 32; 2026-02-21T08:30:48.5864958Z or.b64 %rd1067, %rd1064, %rd1066; 2026-02-21T08:30:48.5865009Z $L__tmp368: 2026-02-21T08:30:48.5865178Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5865234Z mov.b64 {%r3993, %r3994}, %rd1067; 2026-02-21T08:30:48.5865299Z cvt.rn.bf16x2.f32 %r3995, %r3994, %r3993; 2026-02-21T08:30:48.5865351Z $L__tmp369: 2026-02-21T08:30:48.5865561Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5865618Z cvt.u64.u32 %rd1068, %r3806; 2026-02-21T08:30:48.5865673Z cvt.u64.u32 %rd1069, %r3807; 2026-02-21T08:30:48.5865736Z shl.b64 %rd1070, %rd1069, 32; 2026-02-21T08:30:48.5865793Z or.b64 %rd1071, %rd1068, %rd1070; 2026-02-21T08:30:48.5865844Z $L__tmp370: 2026-02-21T08:30:48.5866017Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5866075Z mov.b64 {%r3996, %r3997}, %rd1071; 2026-02-21T08:30:48.5866140Z cvt.rn.bf16x2.f32 %r3998, %r3997, %r3996; 2026-02-21T08:30:48.5866195Z $L__tmp371: 2026-02-21T08:30:48.5866417Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5866474Z cvt.u64.u32 %rd1072, %r3808; 2026-02-21T08:30:48.5866532Z cvt.u64.u32 %rd1073, %r3809; 2026-02-21T08:30:48.5866596Z shl.b64 %rd1074, %rd1073, 32; 2026-02-21T08:30:48.5866656Z or.b64 %rd1075, %rd1072, %rd1074; 2026-02-21T08:30:48.5866708Z $L__tmp372: 2026-02-21T08:30:48.5866882Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5866941Z mov.b64 {%r3999, %r4000}, %rd1075; 2026-02-21T08:30:48.5867010Z cvt.rn.bf16x2.f32 %r4001, %r4000, %r3999; 2026-02-21T08:30:48.5867065Z $L__tmp373: 2026-02-21T08:30:48.5867286Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5867395Z cvt.u64.u32 %rd1076, %r3810; 2026-02-21T08:30:48.5867454Z cvt.u64.u32 %rd1077, %r3811; 2026-02-21T08:30:48.5867520Z shl.b64 %rd1078, %rd1077, 32; 2026-02-21T08:30:48.5867581Z or.b64 %rd1079, %rd1076, %rd1078; 2026-02-21T08:30:48.5867634Z $L__tmp374: 2026-02-21T08:30:48.5867812Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5867873Z mov.b64 {%r4002, %r4003}, %rd1079; 2026-02-21T08:30:48.5867941Z cvt.rn.bf16x2.f32 %r4004, %r4003, %r4002; 2026-02-21T08:30:48.5867994Z $L__tmp375: 2026-02-21T08:30:48.5868214Z .loc 2 291 36 // standard.py:291:36 @[ cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:84:40 ] 2026-02-21T08:30:48.5868272Z cvt.u64.u32 %rd1080, %r3812; 2026-02-21T08:30:48.5868332Z cvt.u64.u32 %rd1081, %r3813; 2026-02-21T08:30:48.5868447Z shl.b64 %rd1082, %rd1081, 32; 2026-02-21T08:30:48.5868512Z or.b64 %rd1083, %rd1080, %rd1082; 2026-02-21T08:30:48.5868563Z $L__tmp376: 2026-02-21T08:30:48.5868738Z .loc 1 87 28 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:87:28 2026-02-21T08:30:48.5868798Z mov.b64 {%r4005, %r4006}, %rd1083; 2026-02-21T08:30:48.5868865Z cvt.rn.bf16x2.f32 %r4007, %r4006, %r4005; 2026-02-21T08:30:48.5869036Z .loc 1 88 81 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:88:81 2026-02-21T08:30:48.5869145Z st.shared.v4.b32 [%r250], {%r3914, %r3926, %r3938, %r3950}; 2026-02-21T08:30:48.5869246Z st.shared.v4.b32 [%r251], {%r3962, %r3974, %r3986, %r3998}; 2026-02-21T08:30:48.5869303Z bar.sync 0; 2026-02-21T08:30:48.5869370Z // begin inline asm 2026-02-21T08:30:48.5869534Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3855, %r3859, %r3863, %r3867}, [%r3819]; 2026-02-21T08:30:48.5869593Z // end inline asm 2026-02-21T08:30:48.5869655Z // begin inline asm 2026-02-21T08:30:48.5869821Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3871, %r3875, %r3879, %r3883}, [%r3824]; 2026-02-21T08:30:48.5869877Z // end inline asm 2026-02-21T08:30:48.5869932Z bar.sync 0; 2026-02-21T08:30:48.5870036Z st.shared.v4.b32 [%r250], {%r3917, %r3929, %r3941, %r3953}; 2026-02-21T08:30:48.5870129Z st.shared.v4.b32 [%r251], {%r3965, %r3977, %r3989, %r4001}; 2026-02-21T08:30:48.5870183Z bar.sync 0; 2026-02-21T08:30:48.5870247Z // begin inline asm 2026-02-21T08:30:48.5870400Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3856, %r3860, %r3864, %r3868}, [%r3819]; 2026-02-21T08:30:48.5870458Z // end inline asm 2026-02-21T08:30:48.5870522Z // begin inline asm 2026-02-21T08:30:48.5870677Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3872, %r3876, %r3880, %r3884}, [%r3824]; 2026-02-21T08:30:48.5870732Z // end inline asm 2026-02-21T08:30:48.5870790Z bar.sync 0; 2026-02-21T08:30:48.5870894Z st.shared.v4.b32 [%r250], {%r3920, %r3932, %r3944, %r3956}; 2026-02-21T08:30:48.5870990Z st.shared.v4.b32 [%r251], {%r3968, %r3980, %r3992, %r4004}; 2026-02-21T08:30:48.5871046Z bar.sync 0; 2026-02-21T08:30:48.5871109Z // begin inline asm 2026-02-21T08:30:48.5871262Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3857, %r3861, %r3865, %r3869}, [%r3819]; 2026-02-21T08:30:48.5871317Z // end inline asm 2026-02-21T08:30:48.5871372Z // begin inline asm 2026-02-21T08:30:48.5871528Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3873, %r3877, %r3881, %r3885}, [%r3824]; 2026-02-21T08:30:48.5871627Z // end inline asm 2026-02-21T08:30:48.5871683Z bar.sync 0; 2026-02-21T08:30:48.5871784Z st.shared.v4.b32 [%r250], {%r3923, %r3935, %r3947, %r3959}; 2026-02-21T08:30:48.5871874Z st.shared.v4.b32 [%r251], {%r3971, %r3983, %r3995, %r4007}; 2026-02-21T08:30:48.5871929Z bar.sync 0; 2026-02-21T08:30:48.5871993Z // begin inline asm 2026-02-21T08:30:48.5872145Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3858, %r3862, %r3866, %r3870}, [%r3819]; 2026-02-21T08:30:48.5872202Z // end inline asm 2026-02-21T08:30:48.5872260Z // begin inline asm 2026-02-21T08:30:48.5872479Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3874, %r3878, %r3882, %r3886}, [%r3824]; 2026-02-21T08:30:48.5872535Z // end inline asm 2026-02-21T08:30:48.5872592Z // begin inline asm 2026-02-21T08:30:48.5872709Z st.global.v4.b32 [ %rd948 + 0 ], { %r3855, %r3856, %r3857, %r3858 }; 2026-02-21T08:30:48.5872766Z // end inline asm 2026-02-21T08:30:48.5872822Z // begin inline asm 2026-02-21T08:30:48.5872937Z st.global.v4.b32 [ %rd949 + 0 ], { %r3859, %r3860, %r3861, %r3862 }; 2026-02-21T08:30:48.5872995Z // end inline asm 2026-02-21T08:30:48.5873052Z // begin inline asm 2026-02-21T08:30:48.5873154Z st.global.v4.b32 [ %rd950 + 0 ], { %r3863, %r3864, %r3865, %r3866 }; 2026-02-21T08:30:48.5873216Z // end inline asm 2026-02-21T08:30:48.5873273Z // begin inline asm 2026-02-21T08:30:48.5873373Z st.global.v4.b32 [ %rd951 + 0 ], { %r3867, %r3868, %r3869, %r3870 }; 2026-02-21T08:30:48.5873433Z // end inline asm 2026-02-21T08:30:48.5873553Z // begin inline asm 2026-02-21T08:30:48.5873658Z st.global.v4.b32 [ %rd952 + 0 ], { %r3871, %r3872, %r3873, %r3874 }; 2026-02-21T08:30:48.5873715Z // end inline asm 2026-02-21T08:30:48.5873781Z // begin inline asm 2026-02-21T08:30:48.5873881Z st.global.v4.b32 [ %rd953 + 0 ], { %r3875, %r3876, %r3877, %r3878 }; 2026-02-21T08:30:48.5873937Z // end inline asm 2026-02-21T08:30:48.5873999Z // begin inline asm 2026-02-21T08:30:48.5874099Z st.global.v4.b32 [ %rd954 + 0 ], { %r3879, %r3880, %r3881, %r3882 }; 2026-02-21T08:30:48.5874154Z // end inline asm 2026-02-21T08:30:48.5874210Z // begin inline asm 2026-02-21T08:30:48.5874320Z st.global.v4.b32 [ %rd955 + 0 ], { %r3883, %r3884, %r3885, %r3886 }; 2026-02-21T08:30:48.5874374Z // end inline asm 2026-02-21T08:30:48.5874431Z bra.uni $L__BB0_36; 2026-02-21T08:30:48.5874523Z $L__BB0_37: // %._crit_edge448 2026-02-21T08:30:48.5874700Z .loc 1 19 112 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:112 2026-02-21T08:30:48.5874770Z cp.async.wait_group 0; 2026-02-21T08:30:48.5874824Z bar.sync 0; 2026-02-21T08:30:48.5874993Z .loc 1 19 4 // cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py:19:4 2026-02-21T08:30:48.5875046Z bar.sync 0; 2026-02-21T08:30:48.5875101Z // begin inline asm 2026-02-21T08:30:48.5875223Z @%p3 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r4008, 256; 2026-02-21T08:30:48.5875275Z // end inline asm 2026-02-21T08:30:48.5875326Z ret; 2026-02-21T08:30:48.5875382Z $L__tmp377: 2026-02-21T08:30:48.5875435Z $L__func_end0: 2026-02-21T08:30:48.5875517Z // -- End function 2026-02-21T08:30:48.5875569Z } 2026-02-21T08:30:48.5875777Z .file 1 "/tmp/torchinductor_root/ay/cayanldlxf2j3xjjgltgcvcwmaneugouicq3ja24fyjll7aikznj.py" 2026-02-21T08:30:48.5875949Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:30:48.5876008Z .section .debug_abbrev 2026-02-21T08:30:48.5876065Z { 2026-02-21T08:30:48.5876155Z .b8 1 // Abbreviation Code 2026-02-21T08:30:48.5876239Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:30:48.5876320Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:30:48.5876400Z .b8 37 // DW_AT_producer 2026-02-21T08:30:48.5876475Z .b8 8 // DW_FORM_string 2026-02-21T08:30:48.5876553Z .b8 19 // DW_AT_language 2026-02-21T08:30:48.5876627Z .b8 5 // DW_FORM_data2 2026-02-21T08:30:48.5876698Z .b8 3 // DW_AT_name 2026-02-21T08:30:48.5876772Z .b8 8 // DW_FORM_string 2026-02-21T08:30:48.5876854Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:30:48.5876928Z .b8 6 // DW_FORM_data4 2026-02-21T08:30:48.5877003Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:30:48.5877127Z .b8 8 // DW_FORM_string 2026-02-21T08:30:48.5877195Z .b8 0 // EOM(1) 2026-02-21T08:30:48.5877261Z .b8 0 // EOM(2) 2026-02-21T08:30:48.5877346Z .b8 2 // Abbreviation Code 2026-02-21T08:30:48.5877425Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:30:48.5877497Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:30:48.5877568Z .b8 3 // DW_AT_name 2026-02-21T08:30:48.5877643Z .b8 8 // DW_FORM_string 2026-02-21T08:30:48.5877718Z .b8 32 // DW_AT_inline 2026-02-21T08:30:48.5877792Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:48.5877866Z .b8 0 // EOM(1) 2026-02-21T08:30:48.5877976Z .b8 0 // EOM(2) 2026-02-21T08:30:48.5878058Z .b8 3 // Abbreviation Code 2026-02-21T08:30:48.5878142Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:30:48.5878218Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:30:48.5878295Z .b8 17 // DW_AT_low_pc 2026-02-21T08:30:48.5878367Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:48.5878449Z .b8 18 // DW_AT_high_pc 2026-02-21T08:30:48.5878520Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:48.5878599Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:30:48.5878673Z .b8 19 // DW_FORM_ref4 2026-02-21T08:30:48.5878738Z .b8 0 // EOM(1) 2026-02-21T08:30:48.5878802Z .b8 0 // EOM(2) 2026-02-21T08:30:48.5878885Z .b8 4 // Abbreviation Code 2026-02-21T08:30:48.5878978Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:30:48.5879054Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:30:48.5879148Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:30:48.5879219Z .b8 19 // DW_FORM_ref4 2026-02-21T08:30:48.5879288Z .b8 17 // DW_AT_low_pc 2026-02-21T08:30:48.5879355Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:48.5879435Z .b8 18 // DW_AT_high_pc 2026-02-21T08:30:48.5879505Z .b8 1 // DW_FORM_addr 2026-02-21T08:30:48.5879579Z .b8 88 // DW_AT_call_file 2026-02-21T08:30:48.5879655Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:48.5879727Z .b8 89 // DW_AT_call_line 2026-02-21T08:30:48.5879803Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:48.5879884Z .b8 87 // DW_AT_call_column 2026-02-21T08:30:48.5879956Z .b8 11 // DW_FORM_data1 2026-02-21T08:30:48.5880022Z .b8 0 // EOM(1) 2026-02-21T08:30:48.5880085Z .b8 0 // EOM(2) 2026-02-21T08:30:48.5880154Z .b8 0 // EOM(3) 2026-02-21T08:30:48.5880204Z } 2026-02-21T08:30:48.5880262Z .section .debug_info 2026-02-21T08:30:48.5880318Z { 2026-02-21T08:30:48.5880398Z .b32 178 // Length of Unit 2026-02-21T08:30:48.5880481Z .b8 2 // DWARF version number 2026-02-21T08:30:48.5880537Z .b8 0 2026-02-21T08:30:48.5880652Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:30:48.5880739Z .b8 8 // Address Size (in bytes) 2026-02-21T08:30:48.5880839Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:30:48.5880966Z .b8 116 // DW_AT_producer 2026-02-21T08:30:48.5881018Z .b8 114 2026-02-21T08:30:48.5881070Z .b8 105 2026-02-21T08:30:48.5881125Z .b8 116 2026-02-21T08:30:48.5881174Z .b8 111 2026-02-21T08:30:48.5881223Z .b8 110 2026-02-21T08:30:48.5881272Z .b8 0 2026-02-21T08:30:48.5881348Z .b8 2 // DW_AT_language 2026-02-21T08:30:48.5881399Z .b8 0 2026-02-21T08:30:48.5881471Z .b8 99 // DW_AT_name 2026-02-21T08:30:48.5881526Z .b8 97 2026-02-21T08:30:48.5881598Z .b8 121 2026-02-21T08:30:48.5881649Z .b8 97 2026-02-21T08:30:48.5881697Z .b8 110 2026-02-21T08:30:48.5881753Z .b8 108 2026-02-21T08:30:48.5881802Z .b8 100 2026-02-21T08:30:48.5881852Z .b8 108 2026-02-21T08:30:48.5881907Z .b8 120 2026-02-21T08:30:48.5881955Z .b8 102 2026-02-21T08:30:48.5882003Z .b8 50 2026-02-21T08:30:48.5882050Z .b8 106 2026-02-21T08:30:48.5882151Z .b8 51 2026-02-21T08:30:48.5882204Z .b8 120 2026-02-21T08:30:48.5882254Z .b8 106 2026-02-21T08:30:48.5882303Z .b8 106 2026-02-21T08:30:48.5882360Z .b8 103 2026-02-21T08:30:48.5882410Z .b8 108 2026-02-21T08:30:48.5882457Z .b8 116 2026-02-21T08:30:48.5882511Z .b8 103 2026-02-21T08:30:48.5882558Z .b8 99 2026-02-21T08:30:48.5882605Z .b8 118 2026-02-21T08:30:48.5882653Z .b8 99 2026-02-21T08:30:48.5882704Z .b8 119 2026-02-21T08:30:48.5882753Z .b8 109 2026-02-21T08:30:48.5882800Z .b8 97 2026-02-21T08:30:48.5882854Z .b8 110 2026-02-21T08:30:48.5882903Z .b8 101 2026-02-21T08:30:48.5882951Z .b8 117 2026-02-21T08:30:48.5883000Z .b8 103 2026-02-21T08:30:48.5883055Z .b8 111 2026-02-21T08:30:48.5883103Z .b8 117 2026-02-21T08:30:48.5883152Z .b8 105 2026-02-21T08:30:48.5883205Z .b8 99 2026-02-21T08:30:48.5883252Z .b8 113 2026-02-21T08:30:48.5883302Z .b8 51 2026-02-21T08:30:48.5883351Z .b8 106 2026-02-21T08:30:48.5883405Z .b8 97 2026-02-21T08:30:48.5883452Z .b8 50 2026-02-21T08:30:48.5883501Z .b8 52 2026-02-21T08:30:48.5883551Z .b8 102 2026-02-21T08:30:48.5883608Z .b8 121 2026-02-21T08:30:48.5883656Z .b8 106 2026-02-21T08:30:48.5883706Z .b8 108 2026-02-21T08:30:48.5883762Z .b8 108 2026-02-21T08:30:48.5883810Z .b8 55 2026-02-21T08:30:48.5883858Z .b8 97 2026-02-21T08:30:48.5883906Z .b8 105 2026-02-21T08:30:48.5883962Z .b8 107 2026-02-21T08:30:48.5884010Z .b8 122 2026-02-21T08:30:48.5884059Z .b8 110 2026-02-21T08:30:48.5884113Z .b8 106 2026-02-21T08:30:48.5884160Z .b8 46 2026-02-21T08:30:48.5884209Z .b8 112 2026-02-21T08:30:48.5884259Z .b8 121 2026-02-21T08:30:48.5884312Z .b8 0 2026-02-21T08:30:48.5884398Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:30:48.5884473Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:30:48.5884527Z .b8 116 2026-02-21T08:30:48.5884575Z .b8 109 2026-02-21T08:30:48.5884624Z .b8 112 2026-02-21T08:30:48.5884673Z .b8 47 2026-02-21T08:30:48.5884728Z .b8 116 2026-02-21T08:30:48.5884780Z .b8 111 2026-02-21T08:30:48.5884830Z .b8 114 2026-02-21T08:30:48.5884879Z .b8 99 2026-02-21T08:30:48.5884943Z .b8 104 2026-02-21T08:30:48.5884993Z .b8 105 2026-02-21T08:30:48.5885042Z .b8 110 2026-02-21T08:30:48.5885095Z .b8 100 2026-02-21T08:30:48.5885142Z .b8 117 2026-02-21T08:30:48.5885191Z .b8 99 2026-02-21T08:30:48.5885239Z .b8 116 2026-02-21T08:30:48.5885293Z .b8 111 2026-02-21T08:30:48.5885341Z .b8 114 2026-02-21T08:30:48.5885389Z .b8 95 2026-02-21T08:30:48.5885442Z .b8 114 2026-02-21T08:30:48.5885490Z .b8 111 2026-02-21T08:30:48.5885538Z .b8 111 2026-02-21T08:30:48.5885585Z .b8 116 2026-02-21T08:30:48.5885638Z .b8 47 2026-02-21T08:30:48.5885684Z .b8 97 2026-02-21T08:30:48.5885734Z .b8 121 2026-02-21T08:30:48.5885787Z .b8 0 2026-02-21T08:30:48.5885885Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:30:48.5885956Z .b8 95 // DW_AT_name 2026-02-21T08:30:48.5886007Z .b8 104 2026-02-21T08:30:48.5886061Z .b8 101 2026-02-21T08:30:48.5886110Z .b8 108 2026-02-21T08:30:48.5886160Z .b8 105 2026-02-21T08:30:48.5886216Z .b8 111 2026-02-21T08:30:48.5886314Z .b8 110 2026-02-21T08:30:48.5886364Z .b8 95 2026-02-21T08:30:48.5886412Z .b8 109 2026-02-21T08:30:48.5886468Z .b8 97 2026-02-21T08:30:48.5886518Z .b8 116 2026-02-21T08:30:48.5886567Z .b8 109 2026-02-21T08:30:48.5886617Z .b8 117 2026-02-21T08:30:48.5886670Z .b8 108 2026-02-21T08:30:48.5886718Z .b8 95 2026-02-21T08:30:48.5886768Z .b8 98 2026-02-21T08:30:48.5886823Z .b8 102 2026-02-21T08:30:48.5886873Z .b8 49 2026-02-21T08:30:48.5886922Z .b8 54 2026-02-21T08:30:48.5886970Z .b8 95 2026-02-21T08:30:48.5887024Z .b8 105 2026-02-21T08:30:48.5887072Z .b8 110 2026-02-21T08:30:48.5887119Z .b8 116 2026-02-21T08:30:48.5887172Z .b8 52 2026-02-21T08:30:48.5887220Z .b8 0 2026-02-21T08:30:48.5887292Z .b8 1 // DW_AT_inline 2026-02-21T08:30:48.5887386Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:30:48.5887476Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:30:48.5887604Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:30:48.5887697Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:30:48.5887811Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:30:48.5887896Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:30:48.5887976Z .b64 $L__tmp1 // DW_AT_low_pc 2026-02-21T08:30:48.5888064Z .b64 $L__tmp376 // DW_AT_high_pc 2026-02-21T08:30:48.5888136Z .b8 1 // DW_AT_call_file 2026-02-21T08:30:48.5888211Z .b8 84 // DW_AT_call_line 2026-02-21T08:30:48.5888288Z .b8 40 // DW_AT_call_column 2026-02-21T08:30:48.5888374Z .b8 0 // End Of Children Mark 2026-02-21T08:30:48.5888455Z .b8 0 // End Of Children Mark 2026-02-21T08:30:48.5888504Z } 2026-02-21T08:30:48.5888577Z .section .debug_macinfo { } 2026-02-21T08:30:48.5888581Z 2026-02-21T08:30:48.5888656Z ================================================================ 2026-02-21T08:30:48.5888760Z please share the reproducer above with Triton project. 2026-02-21T08:30:50.8063942Z 2026-02-21T08:30:50.8064494Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 102/102 6.2 configs/s 2026-02-21T08:30:51.5503746Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 255/255 278.0 configs/s 2026-02-21T08:30:51.7391930Z [138s] Generation 1 complete: 2026-02-21T08:30:51.7393629Z error=10 2026-02-21T08:30:51.7393792Z ok=94 2026-02-21T08:30:51.7393917Z min=0.7813 2026-02-21T08:30:51.7394047Z mid=2.2437 2026-02-21T08:30:51.7394247Z max=517.8849 2026-02-21T08:30:51.7396372Z best={'block_sizes': [8, 512, 64], 2026-02-21T08:30:51.7396662Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:30:51.7396911Z 'l2_groupings': [4], 2026-02-21T08:30:51.7397114Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:30:51.7397317Z 'loop_orders': [[0, 1]], 2026-02-21T08:30:51.7397482Z 'num_sm_multiplier': 64, 2026-02-21T08:30:51.7397633Z 'num_stages': 4, 2026-02-21T08:30:51.7397775Z 'num_warps': 16, 2026-02-21T08:30:51.7397926Z 'pid_type': 'persistent_blocked', 2026-02-21T08:30:51.7398108Z 'range_flattens': [False, False], 2026-02-21T08:30:51.7398291Z 'range_multi_buffers': [True, None], 2026-02-21T08:30:51.7398472Z 'range_num_stages': [1, 1], 2026-02-21T08:30:51.7398643Z 'range_unroll_factors': [1, 4], 2026-02-21T08:30:51.7398821Z 'range_warp_specializes': [None, False]} 2026-02-21T08:30:51.7408767Z [138s] Fitting surrogate: 204 points, 204 targets 2026-02-21T08:30:53.0615862Z [139s] Generation 2 starting: 98 neighbors, 5 active search path(s) 2026-02-21T08:31:05.9121316Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 1.3 configs/s 2026-02-21T08:31:06.8192211Z [153s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=32, num_stages=6, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:31:06.8193833Z Tensor-likes are not close! 2026-02-21T08:31:06.8198398Z 2026-02-21T08:31:06.8200147Z Mismatched elements: 33484606 / 33554432 (99.8%) 2026-02-21T08:31:06.8200505Z Greatest absolute difference: 2512.0 at index (696, 6537) (up to 0.01 allowed) 2026-02-21T08:31:06.8204573Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:06.8207864Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:06.8210773Z 2026-02-21T08:31:07.3635254Z [154s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]) 2026-02-21T08:31:07.3636467Z Tensor-likes are not close! 2026-02-21T08:31:07.3642144Z 2026-02-21T08:31:07.3646211Z Mismatched elements: 33435016 / 33554432 (99.6%) 2026-02-21T08:31:07.3648295Z Greatest absolute difference: 1296.0 at index (1726, 7328) (up to 0.01 allowed) 2026-02-21T08:31:07.3648674Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:07.3648994Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:07.3649161Z 2026-02-21T08:31:07.6708474Z [154s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]) 2026-02-21T08:31:07.6709671Z Tensor-likes are not close! 2026-02-21T08:31:07.6709796Z 2026-02-21T08:31:07.6709884Z Mismatched elements: 33483631 / 33554432 (99.8%) 2026-02-21T08:31:07.6710184Z Greatest absolute difference: 2432.0 at index (2056, 3359) (up to 0.01 allowed) 2026-02-21T08:31:07.6710555Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:07.6710872Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:07.6711040Z 2026-02-21T08:31:08.1907671Z [154s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]) 2026-02-21T08:31:08.1908822Z Tensor-likes are not close! 2026-02-21T08:31:08.1908988Z 2026-02-21T08:31:08.1909147Z Mismatched elements: 33435016 / 33554432 (99.6%) 2026-02-21T08:31:08.1909439Z Greatest absolute difference: 1296.0 at index (1726, 7328) (up to 0.01 allowed) 2026-02-21T08:31:08.1909800Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:08.1910105Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:08.1910271Z 2026-02-21T08:31:09.1773928Z [155s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=6, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:31:09.1775245Z Tensor-likes are not close! 2026-02-21T08:31:09.1775377Z 2026-02-21T08:31:09.1775497Z Mismatched elements: 33437878 / 33554432 (99.7%) 2026-02-21T08:31:09.1775793Z Greatest absolute difference: 1352.0 at index (321, 1931) (up to 0.01 allowed) 2026-02-21T08:31:09.1776135Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:09.1776433Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:09.1776605Z 2026-02-21T08:31:09.3209911Z [156s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=6, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:31:09.3210962Z Tensor-likes are not close! 2026-02-21T08:31:09.3216579Z 2026-02-21T08:31:09.3221078Z Mismatched elements: 33477275 / 33554432 (99.8%) 2026-02-21T08:31:09.3225153Z Greatest absolute difference: 2352.0 at index (3414, 7062) (up to 0.01 allowed) 2026-02-21T08:31:09.3228958Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:09.3229380Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:09.3229604Z 2026-02-21T08:31:11.6480550Z [158s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:31:11.6482587Z Tensor-likes are not close! 2026-02-21T08:31:11.6486234Z 2026-02-21T08:31:11.6490790Z Mismatched elements: 33436839 / 33554432 (99.6%) 2026-02-21T08:31:11.6492257Z Greatest absolute difference: 1320.0 at index (1792, 5366) (up to 0.01 allowed) 2026-02-21T08:31:11.6492646Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:11.6492974Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:11.6493153Z 2026-02-21T08:31:11.7158553Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 17.5 configs/s 2026-02-21T08:31:13.1872845Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━ 372/372 230.9 configs/s 2026-02-21T08:31:13.3431002Z [160s] Generation 2 complete: 2026-02-21T08:31:13.3432741Z error=21 2026-02-21T08:31:13.3432884Z ok=82 2026-02-21T08:31:13.3433009Z min=0.5457 2026-02-21T08:31:13.3433138Z mid=1.2524 2026-02-21T08:31:13.3433276Z max=20.5384 2026-02-21T08:31:13.3433634Z best={'block_sizes': [32, 64, 32], 2026-02-21T08:31:13.3433919Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:31:13.3434160Z 'l2_groupings': [32], 2026-02-21T08:31:13.3434336Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:31:13.3434543Z 'loop_orders': [[1, 0]], 2026-02-21T08:31:13.3434702Z 'num_stages': 3, 2026-02-21T08:31:13.3434848Z 'num_warps': 1, 2026-02-21T08:31:13.3434990Z 'pid_type': 'flat', 2026-02-21T08:31:13.3435152Z 'range_flattens': [None, False], 2026-02-21T08:31:13.3435333Z 'range_multi_buffers': [None, True], 2026-02-21T08:31:13.3435523Z 'range_num_stages': [0, 0], 2026-02-21T08:31:13.3435703Z 'range_unroll_factors': [0, 0], 2026-02-21T08:31:13.3435884Z 'range_warp_specializes': [None, None]} 2026-02-21T08:31:13.3450558Z [160s] Fitting surrogate: 307 points, 307 targets 2026-02-21T08:31:14.5514850Z [161s] Generation 3 starting: 93 neighbors, 5 active search path(s) 2026-02-21T08:31:26.0702545Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 2.0 configs/s 2026-02-21T08:31:27.6430983Z [174s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]) 2026-02-21T08:31:27.6435735Z Tensor-likes are not close! 2026-02-21T08:31:27.6439299Z 2026-02-21T08:31:27.6443842Z Mismatched elements: 33444356 / 33554432 (99.7%) 2026-02-21T08:31:27.6445404Z Greatest absolute difference: 1296.0 at index (1726, 7328) (up to 0.01 allowed) 2026-02-21T08:31:27.6445783Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:27.6446389Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:27.6446583Z 2026-02-21T08:31:28.6367809Z [175s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:31:28.6368824Z Tensor-likes are not close! 2026-02-21T08:31:28.6368982Z 2026-02-21T08:31:28.6369227Z Mismatched elements: 33485225 / 33554432 (99.8%) 2026-02-21T08:31:28.6369546Z Greatest absolute difference: 2512.0 at index (696, 6537) (up to 0.01 allowed) 2026-02-21T08:31:28.6373779Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:28.6374211Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:28.6378416Z 2026-02-21T08:31:31.3383290Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 17.9 configs/s 2026-02-21T08:31:31.7761757Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━ 612/612 1177.5 2026-02-21T08:31:31.7765875Z configs/s 2026-02-21T08:31:31.8670060Z [178s] Generation 3 complete: 2026-02-21T08:31:31.8674371Z error=29 2026-02-21T08:31:31.8675828Z ok=70 2026-02-21T08:31:31.8675991Z min=0.3268 2026-02-21T08:31:31.8676133Z mid=0.9554 2026-02-21T08:31:31.8676263Z max=29.5772 2026-02-21T08:31:31.8676418Z best={'block_sizes': [8, 128, 256], 2026-02-21T08:31:31.8676666Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:31:31.8676892Z 'l2_groupings': [8], 2026-02-21T08:31:31.8677071Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:31:31.8677263Z 'loop_orders': [[1, 0]], 2026-02-21T08:31:31.8677425Z 'maxnreg': 64, 2026-02-21T08:31:31.8677571Z 'num_sm_multiplier': 32, 2026-02-21T08:31:31.8677747Z 'num_stages': 6, 2026-02-21T08:31:31.8678239Z 'num_warps': 2, 2026-02-21T08:31:31.8678417Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:31:31.8678614Z 'range_flattens': [False, False], 2026-02-21T08:31:31.8678806Z 'range_multi_buffers': [False, None], 2026-02-21T08:31:31.8678995Z 'range_num_stages': [3, 0], 2026-02-21T08:31:31.8679164Z 'range_unroll_factors': [0, 0], 2026-02-21T08:31:31.8679353Z 'range_warp_specializes': [True, None]} 2026-02-21T08:31:31.8693232Z [178s] Fitting surrogate: 406 points, 406 targets 2026-02-21T08:31:33.0197802Z [179s] Generation 4 starting: 90 neighbors, 5 active search path(s) 2026-02-21T08:31:45.1607078Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 1.7 configs/s 2026-02-21T08:31:45.2388250Z [191s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 0], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:31:45.2389424Z Tensor-likes are not close! 2026-02-21T08:31:45.2389542Z 2026-02-21T08:31:45.2389627Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T08:31:45.2389909Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T08:31:45.2390239Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:45.2390543Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:45.2390704Z 2026-02-21T08:31:45.3961423Z [192s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=6, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[3, 0], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:31:45.3963343Z Tensor-likes are not close! 2026-02-21T08:31:45.3963468Z 2026-02-21T08:31:45.3963556Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T08:31:45.3963850Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T08:31:45.3964211Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:45.3964537Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:45.3964706Z 2026-02-21T08:31:45.4654825Z [192s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[4, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:31:45.4655895Z Tensor-likes are not close! 2026-02-21T08:31:45.4656022Z 2026-02-21T08:31:45.4656109Z Mismatched elements: 33485271 / 33554432 (99.8%) 2026-02-21T08:31:45.4656401Z Greatest absolute difference: 2432.0 at index (2320, 6047) (up to 0.01 allowed) 2026-02-21T08:31:45.4656734Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:45.4657043Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:45.4657206Z 2026-02-21T08:31:48.1582306Z 2026-02-21T08:31:48.1583689Z 2026-02-21T08:31:48.1583938Z ================================================================ 2026-02-21T08:31:48.1584240Z Internal Triton PTX codegen error 2026-02-21T08:31:48.1587220Z [194s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:31:48.1587561Z `ptxas` stderr: 2026-02-21T08:31:48.1588308Z ptxas fatal : (C7602) Insufficient registers (40) to compile instruction at line 1073 in function _helion_matmul_bf16_int4. Try to compile with register target of 146 or higher. 2026-02-21T08:31:48.1588812Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:31:48.1588963Z 2026-02-21T08:31:48.1589353Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp0t1r0r8t.ptx -o /tmp/tmp0t1r0r8t.ptx.o 2026-02-21T08:31:48.1589788Z 2026-02-21T08:31:48.1589792Z 2026-02-21T08:31:48.1589849Z // 2026-02-21T08:31:48.1589998Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:31:48.1590182Z // 2026-02-21T08:31:48.1590253Z 2026-02-21T08:31:48.1590323Z .version 8.7 2026-02-21T08:31:48.1590470Z .target sm_100a 2026-02-21T08:31:48.1590620Z .address_size 64 2026-02-21T08:31:48.1590707Z 2026-02-21T08:31:48.1590942Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:31:48.1591248Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:31:48.1591474Z // @_helion_matmul_bf16_int4 2026-02-21T08:31:48.1592586Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:31:48.1592849Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:31:48.1593137Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:31:48.1593422Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:31:48.1593692Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:31:48.1593968Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:31:48.1594182Z ) 2026-02-21T08:31:48.1594311Z .reqntid 1408 2026-02-21T08:31:48.1594446Z { 2026-02-21T08:31:48.1594571Z .reg .pred %p<108>; 2026-02-21T08:31:48.1594726Z .reg .b16 %rs<65>; 2026-02-21T08:31:48.1594872Z .reg .b32 %r<1998>; 2026-02-21T08:31:48.1595031Z .reg .b64 %rd<131>; 2026-02-21T08:31:48.1595183Z $L__func_begin0: 2026-02-21T08:31:48.1595276Z 2026-02-21T08:31:48.1595331Z // %bb.0: 2026-02-21T08:31:48.1595576Z .loc 1 19 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:19 2026-02-21T08:31:48.1595870Z mov.u32 %r1, %tid.x; 2026-02-21T08:31:48.1596021Z shr.u32 %r2, %r1, 5; 2026-02-21T08:31:48.1596176Z shfl.sync.idx.b32 %r3, %r2, 0, 31, -1; 2026-02-21T08:31:48.1596366Z setp.lt.u32 %p3, %r3, 8; 2026-02-21T08:31:48.1596521Z @%p3 bra $L__BB0_43; 2026-02-21T08:31:48.1596704Z // %bb.1: // %.preheader.preheader 2026-02-21T08:31:48.1597011Z .loc 1 0 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0:0 2026-02-21T08:31:48.1597343Z ld.param.b64 %rd20, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T08:31:48.1597588Z ld.param.b64 %rd19, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:31:48.1597836Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:31:48.1598055Z mov.b32 %r273, global_smem; 2026-02-21T08:31:48.1598219Z add.s32 %r274, %r273, %r3; 2026-02-21T08:31:48.1598387Z add.s32 %r1660, %r273, 393216; 2026-02-21T08:31:48.1598549Z mov.u32 %r5, %ctaid.x; 2026-02-21T08:31:48.1598705Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1598869Z $L__BB0_41: // %._crit_edge 2026-02-21T08:31:48.1599098Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1599417Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1599713Z barrier.sync 1; 2026-02-21T08:31:48.1599883Z $L__BB0_2: // %.preheader 2026-02-21T08:31:48.1600102Z // =>This Loop Header: Depth=1 2026-02-21T08:31:48.1600336Z // Child Loop BB0_36 Depth 2 2026-02-21T08:31:48.1600568Z // Child Loop BB0_32 Depth 2 2026-02-21T08:31:48.1600890Z // Child Loop BB0_28 Depth 2 2026-02-21T08:31:48.1601111Z // Child Loop BB0_22 Depth 2 2026-02-21T08:31:48.1601348Z // Child Loop BB0_8 Depth 2 2026-02-21T08:31:48.1601685Z .loc 1 19 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:19 2026-02-21T08:31:48.1601963Z barrier.sync 1; 2026-02-21T08:31:48.1602120Z ld.shared.b8 %r272, [%r274+803192]; 2026-02-21T08:31:48.1602299Z setp.gt.u32 %p4, %r272, 6; 2026-02-21T08:31:48.1602466Z @%p4 bra $L__BB0_4; 2026-02-21T08:31:48.1602626Z // %bb.3: // %.preheader 2026-02-21T08:31:48.1602847Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1603060Z $L_brx_0: .branchtargets 2026-02-21T08:31:48.1603208Z $L__BB0_6, 2026-02-21T08:31:48.1603412Z $L__BB0_20, 2026-02-21T08:31:48.1603542Z $L__BB0_26, 2026-02-21T08:31:48.1603671Z $L__BB0_30, 2026-02-21T08:31:48.1603793Z $L__BB0_34, 2026-02-21T08:31:48.1603919Z $L__BB0_42, 2026-02-21T08:31:48.1604038Z $L__BB0_5; 2026-02-21T08:31:48.1604176Z brx.idx %r272, $L_brx_0; 2026-02-21T08:31:48.1604369Z $L__BB0_6: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1604698Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1605013Z ld.shared.b32 %r4, [global_smem+393216]; 2026-02-21T08:31:48.1605195Z barrier.sync 1; 2026-02-21T08:31:48.1605450Z .loc 1 32 45 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:32:45 2026-02-21T08:31:48.1605734Z bfe.u32 %r8, %r1, 2, 6; 2026-02-21T08:31:48.1607091Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_sm_multiplier=128, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:31:48.1608183Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:31:48.1608412Z `ptxas` stderr: 2026-02-21T08:31:48.1608879Z ptxas fatal : (C7602) Insufficient registers (40) to compile instruction at line 1073 in function _helion_matmul_bf16_int4. Try to compile with register target of 146 or higher. 2026-02-21T08:31:48.1609371Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:31:48.1609516Z 2026-02-21T08:31:48.1609892Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp0t1r0r8t.ptx -o /tmp/tmp0t1r0r8t.ptx.o 2026-02-21T08:31:48.1610314Z 2026-02-21T08:31:48.1610442Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:31:48.1610680Z or.b32 %r9, %r8, 64; 2026-02-21T08:31:48.1610934Z .loc 1 47 38 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:47:38 2026-02-21T08:31:48.1611228Z shl.b32 %r1662, %r1, 3; 2026-02-21T08:31:48.1611380Z and.b32 %r10, %r1662, 24; 2026-02-21T08:31:48.1611682Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1611977Z setp.lt.s32 %p45, %r4, 1; 2026-02-21T08:31:48.1612137Z setp.gt.s32 %p46, %r4, 0; 2026-02-21T08:31:48.1612401Z .loc 1 31 27 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:31:27 2026-02-21T08:31:48.1612684Z shl.b32 %r1663, %r5, 7; 2026-02-21T08:31:48.1612846Z and.b32 %r1664, %r1663, 3968; 2026-02-21T08:31:48.1613109Z .loc 1 32 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:32:32 2026-02-21T08:31:48.1613400Z or.b32 %r1969, %r1664, %r8; 2026-02-21T08:31:48.1613630Z or.b32 %r1970, %r9, %r1664; 2026-02-21T08:31:48.1613886Z .loc 1 48 53 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:53 2026-02-21T08:31:48.1614171Z shl.b32 %r1665, %r1969, 10; 2026-02-21T08:31:48.1614324Z shl.b32 %r1666, %r1970, 10; 2026-02-21T08:31:48.1614588Z .loc 1 48 60 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:60 2026-02-21T08:31:48.1614867Z or.b32 %r1667, %r1665, %r10; 2026-02-21T08:31:48.1615033Z or.b32 %r1668, %r1666, %r10; 2026-02-21T08:31:48.1615292Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1615589Z mad.wide.u32 %rd43, %r1667, 2, %rd18; 2026-02-21T08:31:48.1615780Z mad.wide.u32 %rd44, %r1668, 2, %rd18; 2026-02-21T08:31:48.1616063Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1616406Z shl.b32 %r1669, %r1, 4; 2026-02-21T08:31:48.1616563Z and.b32 %r13, %r1669, 4080; 2026-02-21T08:31:48.1616732Z add.s32 %r1579, %r1660, %r13; 2026-02-21T08:31:48.1616900Z selp.b32 %r1580, 16, 0, %p46; 2026-02-21T08:31:48.1617071Z // begin inline asm 2026-02-21T08:31:48.1617289Z cp.async.cg.shared.global [ %r1579 + 0 ], [ %rd43 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1617532Z // end inline asm 2026-02-21T08:31:48.1617690Z add.s32 %r1581, %r1579, 4096; 2026-02-21T08:31:48.1617859Z // begin inline asm 2026-02-21T08:31:48.1618084Z cp.async.cg.shared.global [ %r1581 + 0 ], [ %rd44 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1618323Z // end inline asm 2026-02-21T08:31:48.1618485Z cp.async.commit_group; 2026-02-21T08:31:48.1618761Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1619066Z add.s64 %rd45, %rd43, 64; 2026-02-21T08:31:48.1619242Z add.s64 %rd46, %rd44, 64; 2026-02-21T08:31:48.1619521Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1619835Z add.s32 %r1670, %r273, %r13; 2026-02-21T08:31:48.1620009Z add.s32 %r1583, %r1670, 434176; 2026-02-21T08:31:48.1620193Z // begin inline asm 2026-02-21T08:31:48.1620409Z cp.async.cg.shared.global [ %r1583 + 0 ], [ %rd45 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1620657Z // end inline asm 2026-02-21T08:31:48.1620813Z add.s32 %r1585, %r1670, 438272; 2026-02-21T08:31:48.1621000Z // begin inline asm 2026-02-21T08:31:48.1621220Z cp.async.cg.shared.global [ %r1585 + 0 ], [ %rd46 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1621455Z // end inline asm 2026-02-21T08:31:48.1621640Z cp.async.commit_group; 2026-02-21T08:31:48.1621913Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1622221Z add.s64 %rd47, %rd43, 128; 2026-02-21T08:31:48.1622388Z add.s64 %rd48, %rd44, 128; 2026-02-21T08:31:48.1622676Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1622977Z add.s32 %r1587, %r1670, 475136; 2026-02-21T08:31:48.1623145Z // begin inline asm 2026-02-21T08:31:48.1623360Z cp.async.cg.shared.global [ %r1587 + 0 ], [ %rd47 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1623596Z // end inline asm 2026-02-21T08:31:48.1623749Z add.s32 %r1589, %r1670, 479232; 2026-02-21T08:31:48.1623915Z // begin inline asm 2026-02-21T08:31:48.1624126Z cp.async.cg.shared.global [ %r1589 + 0 ], [ %rd48 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1624359Z // end inline asm 2026-02-21T08:31:48.1624515Z cp.async.commit_group; 2026-02-21T08:31:48.1624788Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1625094Z add.s64 %rd49, %rd43, 192; 2026-02-21T08:31:48.1625265Z add.s64 %rd50, %rd44, 192; 2026-02-21T08:31:48.1625536Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1625840Z add.s32 %r1591, %r1670, 516096; 2026-02-21T08:31:48.1626091Z // begin inline asm 2026-02-21T08:31:48.1626301Z cp.async.cg.shared.global [ %r1591 + 0 ], [ %rd49 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1626533Z // end inline asm 2026-02-21T08:31:48.1626689Z add.s32 %r1593, %r1670, 520192; 2026-02-21T08:31:48.1626850Z // begin inline asm 2026-02-21T08:31:48.1627043Z cp.async.cg.shared.global [ %r1593 + 0 ], [ %rd50 + 0 ], 0x10, %r1580; 2026-02-21T08:31:48.1627275Z // end inline asm 2026-02-21T08:31:48.1627416Z cp.async.commit_group; 2026-02-21T08:31:48.1627685Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1627977Z setp.gt.s32 %p47, %r4, 1; 2026-02-21T08:31:48.1628243Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1628524Z add.s64 %rd51, %rd43, 256; 2026-02-21T08:31:48.1628685Z add.s64 %rd52, %rd44, 256; 2026-02-21T08:31:48.1629008Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1629284Z bar.sync 2, 256; 2026-02-21T08:31:48.1629431Z add.s32 %r1595, %r1670, 401408; 2026-02-21T08:31:48.1629595Z selp.b32 %r1596, 16, 0, %p47; 2026-02-21T08:31:48.1629760Z // begin inline asm 2026-02-21T08:31:48.1629957Z cp.async.cg.shared.global [ %r1595 + 0 ], [ %rd51 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1630191Z // end inline asm 2026-02-21T08:31:48.1630330Z add.s32 %r1597, %r1670, 405504; 2026-02-21T08:31:48.1630495Z // begin inline asm 2026-02-21T08:31:48.1630698Z cp.async.cg.shared.global [ %r1597 + 0 ], [ %rd52 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1630923Z // end inline asm 2026-02-21T08:31:48.1631074Z cp.async.commit_group; 2026-02-21T08:31:48.1631332Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1631656Z add.s64 %rd53, %rd43, 320; 2026-02-21T08:31:48.1631815Z add.s64 %rd54, %rd44, 320; 2026-02-21T08:31:48.1632091Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1632384Z add.s32 %r1599, %r1670, 442368; 2026-02-21T08:31:48.1632545Z // begin inline asm 2026-02-21T08:31:48.1632749Z cp.async.cg.shared.global [ %r1599 + 0 ], [ %rd53 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1632973Z // end inline asm 2026-02-21T08:31:48.1633118Z add.s32 %r1601, %r1670, 446464; 2026-02-21T08:31:48.1633274Z // begin inline asm 2026-02-21T08:31:48.1633477Z cp.async.cg.shared.global [ %r1601 + 0 ], [ %rd54 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1633699Z // end inline asm 2026-02-21T08:31:48.1633845Z cp.async.commit_group; 2026-02-21T08:31:48.1634108Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1634391Z add.s64 %rd55, %rd43, 384; 2026-02-21T08:31:48.1634555Z add.s64 %rd56, %rd44, 384; 2026-02-21T08:31:48.1634815Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1635103Z add.s32 %r1603, %r1670, 483328; 2026-02-21T08:31:48.1635260Z // begin inline asm 2026-02-21T08:31:48.1635464Z cp.async.cg.shared.global [ %r1603 + 0 ], [ %rd55 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1635684Z // end inline asm 2026-02-21T08:31:48.1635828Z add.s32 %r1605, %r1670, 487424; 2026-02-21T08:31:48.1635992Z // begin inline asm 2026-02-21T08:31:48.1636187Z cp.async.cg.shared.global [ %r1605 + 0 ], [ %rd56 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1636414Z // end inline asm 2026-02-21T08:31:48.1636555Z cp.async.commit_group; 2026-02-21T08:31:48.1636816Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1637100Z add.s64 %rd57, %rd43, 448; 2026-02-21T08:31:48.1637259Z add.s64 %rd58, %rd44, 448; 2026-02-21T08:31:48.1637517Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1637854Z add.s32 %r1607, %r1670, 524288; 2026-02-21T08:31:48.1638020Z // begin inline asm 2026-02-21T08:31:48.1638214Z cp.async.cg.shared.global [ %r1607 + 0 ], [ %rd57 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1638441Z // end inline asm 2026-02-21T08:31:48.1638578Z add.s32 %r1609, %r1670, 528384; 2026-02-21T08:31:48.1638740Z // begin inline asm 2026-02-21T08:31:48.1638935Z cp.async.cg.shared.global [ %r1609 + 0 ], [ %rd58 + 0 ], 0x10, %r1596; 2026-02-21T08:31:48.1639164Z // end inline asm 2026-02-21T08:31:48.1639304Z cp.async.commit_group; 2026-02-21T08:31:48.1639572Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1639872Z setp.gt.s32 %p48, %r4, 2; 2026-02-21T08:31:48.1640129Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1640418Z add.s64 %rd59, %rd43, 512; 2026-02-21T08:31:48.1640629Z add.s64 %rd60, %rd44, 512; 2026-02-21T08:31:48.1640896Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1641173Z bar.sync 2, 256; 2026-02-21T08:31:48.1641320Z add.s32 %r1611, %r1670, 409600; 2026-02-21T08:31:48.1641492Z selp.b32 %r1612, 16, 0, %p48; 2026-02-21T08:31:48.1641680Z // begin inline asm 2026-02-21T08:31:48.1641887Z cp.async.cg.shared.global [ %r1611 + 0 ], [ %rd59 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1642110Z // end inline asm 2026-02-21T08:31:48.1642263Z add.s32 %r1613, %r1670, 413696; 2026-02-21T08:31:48.1642426Z // begin inline asm 2026-02-21T08:31:48.1642637Z cp.async.cg.shared.global [ %r1613 + 0 ], [ %rd60 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1642862Z // end inline asm 2026-02-21T08:31:48.1643013Z cp.async.commit_group; 2026-02-21T08:31:48.1643281Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1643565Z add.s64 %rd61, %rd43, 576; 2026-02-21T08:31:48.1643737Z add.s64 %rd62, %rd44, 576; 2026-02-21T08:31:48.1643995Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1644290Z add.s32 %r1615, %r1670, 450560; 2026-02-21T08:31:48.1644449Z // begin inline asm 2026-02-21T08:31:48.1644656Z cp.async.cg.shared.global [ %r1615 + 0 ], [ %rd61 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1644879Z // end inline asm 2026-02-21T08:31:48.1645025Z add.s32 %r1617, %r1670, 454656; 2026-02-21T08:31:48.1645188Z // begin inline asm 2026-02-21T08:31:48.1645381Z cp.async.cg.shared.global [ %r1617 + 0 ], [ %rd62 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1645613Z // end inline asm 2026-02-21T08:31:48.1645753Z cp.async.commit_group; 2026-02-21T08:31:48.1646017Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1646300Z add.s64 %rd63, %rd43, 640; 2026-02-21T08:31:48.1646464Z add.s64 %rd64, %rd44, 640; 2026-02-21T08:31:48.1646727Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1647018Z add.s32 %r1619, %r1670, 491520; 2026-02-21T08:31:48.1647183Z // begin inline asm 2026-02-21T08:31:48.1647381Z cp.async.cg.shared.global [ %r1619 + 0 ], [ %rd63 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1647610Z // end inline asm 2026-02-21T08:31:48.1647746Z add.s32 %r1621, %r1670, 495616; 2026-02-21T08:31:48.1647908Z // begin inline asm 2026-02-21T08:31:48.1648102Z cp.async.cg.shared.global [ %r1621 + 0 ], [ %rd64 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1648331Z // end inline asm 2026-02-21T08:31:48.1648471Z cp.async.commit_group; 2026-02-21T08:31:48.1648735Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1649023Z add.s64 %rd65, %rd43, 704; 2026-02-21T08:31:48.1649175Z add.s64 %rd66, %rd44, 704; 2026-02-21T08:31:48.1649442Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1649772Z add.s32 %r1623, %r1670, 532480; 2026-02-21T08:31:48.1649937Z // begin inline asm 2026-02-21T08:31:48.1650132Z cp.async.cg.shared.global [ %r1623 + 0 ], [ %rd65 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1650364Z // end inline asm 2026-02-21T08:31:48.1650510Z add.s32 %r1625, %r1670, 536576; 2026-02-21T08:31:48.1650667Z // begin inline asm 2026-02-21T08:31:48.1650872Z cp.async.cg.shared.global [ %r1625 + 0 ], [ %rd66 + 0 ], 0x10, %r1612; 2026-02-21T08:31:48.1651092Z // end inline asm 2026-02-21T08:31:48.1651239Z cp.async.commit_group; 2026-02-21T08:31:48.1651500Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1651825Z setp.gt.s32 %p49, %r4, 3; 2026-02-21T08:31:48.1652086Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1652372Z add.s64 %rd67, %rd43, 768; 2026-02-21T08:31:48.1652588Z add.s64 %rd68, %rd44, 768; 2026-02-21T08:31:48.1652851Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1653144Z bar.sync 2, 256; 2026-02-21T08:31:48.1653292Z add.s32 %r1627, %r1670, 417792; 2026-02-21T08:31:48.1653493Z selp.b32 %r1628, 16, 0, %p49; 2026-02-21T08:31:48.1653656Z // begin inline asm 2026-02-21T08:31:48.1653867Z cp.async.cg.shared.global [ %r1627 + 0 ], [ %rd67 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1654100Z // end inline asm 2026-02-21T08:31:48.1654243Z add.s32 %r1629, %r1670, 421888; 2026-02-21T08:31:48.1654413Z // begin inline asm 2026-02-21T08:31:48.1654613Z cp.async.cg.shared.global [ %r1629 + 0 ], [ %rd68 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1654849Z // end inline asm 2026-02-21T08:31:48.1654993Z cp.async.commit_group; 2026-02-21T08:31:48.1655255Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1655542Z add.s64 %rd69, %rd43, 832; 2026-02-21T08:31:48.1655709Z add.s64 %rd70, %rd44, 832; 2026-02-21T08:31:48.1655980Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1656261Z add.s32 %r1631, %r1670, 458752; 2026-02-21T08:31:48.1656430Z // begin inline asm 2026-02-21T08:31:48.1656629Z cp.async.cg.shared.global [ %r1631 + 0 ], [ %rd69 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1656865Z // end inline asm 2026-02-21T08:31:48.1657006Z add.s32 %r1633, %r1670, 462848; 2026-02-21T08:31:48.1657173Z // begin inline asm 2026-02-21T08:31:48.1657368Z cp.async.cg.shared.global [ %r1633 + 0 ], [ %rd70 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1657603Z // end inline asm 2026-02-21T08:31:48.1657752Z cp.async.commit_group; 2026-02-21T08:31:48.1658009Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1658303Z add.s64 %rd71, %rd43, 896; 2026-02-21T08:31:48.1658464Z add.s64 %rd72, %rd44, 896; 2026-02-21T08:31:48.1658734Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1659047Z add.s32 %r1635, %r1670, 499712; 2026-02-21T08:31:48.1659217Z // begin inline asm 2026-02-21T08:31:48.1659417Z cp.async.cg.shared.global [ %r1635 + 0 ], [ %rd71 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1659668Z // end inline asm 2026-02-21T08:31:48.1659819Z add.s32 %r1637, %r1670, 503808; 2026-02-21T08:31:48.1659986Z // begin inline asm 2026-02-21T08:31:48.1660203Z cp.async.cg.shared.global [ %r1637 + 0 ], [ %rd72 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1660440Z // end inline asm 2026-02-21T08:31:48.1660595Z cp.async.commit_group; 2026-02-21T08:31:48.1660867Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1661172Z add.s64 %rd73, %rd43, 960; 2026-02-21T08:31:48.1661337Z add.s64 %rd74, %rd44, 960; 2026-02-21T08:31:48.1661656Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1662010Z add.s32 %r1639, %r1670, 540672; 2026-02-21T08:31:48.1662176Z // begin inline asm 2026-02-21T08:31:48.1662391Z cp.async.cg.shared.global [ %r1639 + 0 ], [ %rd73 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1662620Z // end inline asm 2026-02-21T08:31:48.1662771Z add.s32 %r1641, %r1670, 544768; 2026-02-21T08:31:48.1662937Z // begin inline asm 2026-02-21T08:31:48.1663150Z cp.async.cg.shared.global [ %r1641 + 0 ], [ %rd74 + 0 ], 0x10, %r1628; 2026-02-21T08:31:48.1663392Z // end inline asm 2026-02-21T08:31:48.1663544Z cp.async.commit_group; 2026-02-21T08:31:48.1663828Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1664132Z setp.gt.s32 %p50, %r4, 4; 2026-02-21T08:31:48.1664416Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1664788Z add.s64 %rd75, %rd43, 1024; 2026-02-21T08:31:48.1664969Z add.s64 %rd76, %rd44, 1024; 2026-02-21T08:31:48.1665243Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1665541Z bar.sync 2, 256; 2026-02-21T08:31:48.1665695Z add.s32 %r1643, %r1670, 425984; 2026-02-21T08:31:48.1665868Z selp.b32 %r1644, 16, 0, %p50; 2026-02-21T08:31:48.1666040Z // begin inline asm 2026-02-21T08:31:48.1666247Z cp.async.cg.shared.global [ %r1643 + 0 ], [ %rd75 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1666488Z // end inline asm 2026-02-21T08:31:48.1666632Z add.s32 %r1645, %r1670, 430080; 2026-02-21T08:31:48.1666804Z // begin inline asm 2026-02-21T08:31:48.1667008Z cp.async.cg.shared.global [ %r1645 + 0 ], [ %rd76 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1667252Z // end inline asm 2026-02-21T08:31:48.1667398Z cp.async.commit_group; 2026-02-21T08:31:48.1667654Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1667940Z add.s64 %rd77, %rd43, 1088; 2026-02-21T08:31:48.1668096Z add.s64 %rd78, %rd44, 1088; 2026-02-21T08:31:48.1668356Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1668632Z add.s32 %r1647, %r1670, 466944; 2026-02-21T08:31:48.1668796Z // begin inline asm 2026-02-21T08:31:48.1668990Z cp.async.cg.shared.global [ %r1647 + 0 ], [ %rd77 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1669217Z // end inline asm 2026-02-21T08:31:48.1669361Z add.s32 %r1649, %r1670, 471040; 2026-02-21T08:31:48.1669516Z // begin inline asm 2026-02-21T08:31:48.1669714Z cp.async.cg.shared.global [ %r1649 + 0 ], [ %rd78 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1669932Z // end inline asm 2026-02-21T08:31:48.1670074Z cp.async.commit_group; 2026-02-21T08:31:48.1670326Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1670611Z add.s64 %rd79, %rd43, 1152; 2026-02-21T08:31:48.1670778Z add.s64 %rd80, %rd44, 1152; 2026-02-21T08:31:48.1671032Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1671316Z add.s32 %r1651, %r1670, 507904; 2026-02-21T08:31:48.1671472Z // begin inline asm 2026-02-21T08:31:48.1671708Z cp.async.cg.shared.global [ %r1651 + 0 ], [ %rd79 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1671928Z // end inline asm 2026-02-21T08:31:48.1672074Z add.s32 %r1653, %r1670, 512000; 2026-02-21T08:31:48.1672228Z // begin inline asm 2026-02-21T08:31:48.1672432Z cp.async.cg.shared.global [ %r1653 + 0 ], [ %rd80 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1672663Z // end inline asm 2026-02-21T08:31:48.1672805Z cp.async.commit_group; 2026-02-21T08:31:48.1673066Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1673348Z add.s64 %rd81, %rd43, 1216; 2026-02-21T08:31:48.1673513Z add.s64 %rd82, %rd44, 1216; 2026-02-21T08:31:48.1673777Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1674114Z add.s32 %r1655, %r1670, 548864; 2026-02-21T08:31:48.1674271Z // begin inline asm 2026-02-21T08:31:48.1674476Z cp.async.cg.shared.global [ %r1655 + 0 ], [ %rd81 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1674709Z // end inline asm 2026-02-21T08:31:48.1674850Z add.s32 %r1657, %r1670, 552960; 2026-02-21T08:31:48.1675018Z // begin inline asm 2026-02-21T08:31:48.1675209Z cp.async.cg.shared.global [ %r1657 + 0 ], [ %rd82 + 0 ], 0x10, %r1644; 2026-02-21T08:31:48.1675435Z // end inline asm 2026-02-21T08:31:48.1675573Z cp.async.commit_group; 2026-02-21T08:31:48.1675726Z $L__tmp0: 2026-02-21T08:31:48.1676012Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1676349Z // begin inline asm 2026-02-21T08:31:48.1676529Z fence.proxy.async.shared::cta; 2026-02-21T08:31:48.1676750Z // end inline asm 2026-02-21T08:31:48.1676898Z $L__tmp1: 2026-02-21T08:31:48.1677138Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1677436Z @%p45 bra $L__BB0_19; 2026-02-21T08:31:48.1677609Z // %bb.7: // %.lr.ph495 2026-02-21T08:31:48.1677845Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1678169Z .loc 1 0 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0:145 2026-02-21T08:31:48.1678462Z add.s32 %r1661, %r1, -256; 2026-02-21T08:31:48.1678635Z shr.u32 %r6, %r1661, 5; 2026-02-21T08:31:48.1678790Z shr.u32 %r7, %r1, 2; 2026-02-21T08:31:48.1678948Z or.b32 %r14, %r10, 32; 2026-02-21T08:31:48.1679105Z or.b32 %r15, %r10, 64; 2026-02-21T08:31:48.1679262Z or.b32 %r16, %r10, 96; 2026-02-21T08:31:48.1679411Z add.s32 %r17, %r4, -5; 2026-02-21T08:31:48.1679570Z shl.b32 %r1678, %r1, 6; 2026-02-21T08:31:48.1679728Z and.b32 %r1679, %r1678, 8128; 2026-02-21T08:31:48.1679903Z and.b32 %r1680, %r7, 32; 2026-02-21T08:31:48.1680067Z or.b32 %r18, %r1679, %r1680; 2026-02-21T08:31:48.1680232Z add.s32 %r1682, %r273, 557056; 2026-02-21T08:31:48.1680406Z bfe.u32 %r1683, %r1682, 4, 14; 2026-02-21T08:31:48.1680572Z cvt.u64.u32 %rd83, %r1683; 2026-02-21T08:31:48.1680748Z or.b64 %rd99, %rd83, 4611686293439512576; 2026-02-21T08:31:48.1680935Z add.s32 %r1684, %r273, 557088; 2026-02-21T08:31:48.1681104Z bfe.u32 %r1685, %r1684, 4, 14; 2026-02-21T08:31:48.1681266Z cvt.u64.u32 %rd84, %r1685; 2026-02-21T08:31:48.1681445Z or.b64 %rd100, %rd84, 4611686293439512576; 2026-02-21T08:31:48.1681676Z add.s32 %r1686, %r273, 557120; 2026-02-21T08:31:48.1681831Z bfe.u32 %r1687, %r1686, 4, 14; 2026-02-21T08:31:48.1681993Z cvt.u64.u32 %rd85, %r1687; 2026-02-21T08:31:48.1682157Z or.b64 %rd101, %rd85, 4611686293439512576; 2026-02-21T08:31:48.1682342Z add.s32 %r1688, %r273, 557152; 2026-02-21T08:31:48.1682498Z bfe.u32 %r1689, %r1688, 4, 14; 2026-02-21T08:31:48.1682665Z cvt.u64.u32 %rd86, %r1689; 2026-02-21T08:31:48.1682826Z or.b64 %rd102, %rd86, 4611686293439512576; 2026-02-21T08:31:48.1683009Z add.s32 %r1690, %r273, 589824; 2026-02-21T08:31:48.1683163Z bfe.u32 %r1691, %r1690, 4, 14; 2026-02-21T08:31:48.1683330Z cvt.u64.u32 %rd87, %r1691; 2026-02-21T08:31:48.1683499Z or.b64 %rd105, %rd87, 4611686293439512576; 2026-02-21T08:31:48.1683675Z add.s32 %r1692, %r273, 589856; 2026-02-21T08:31:48.1683841Z bfe.u32 %r1693, %r1692, 4, 14; 2026-02-21T08:31:48.1683999Z cvt.u64.u32 %rd88, %r1693; 2026-02-21T08:31:48.1684171Z or.b64 %rd106, %rd88, 4611686293439512576; 2026-02-21T08:31:48.1684349Z add.s32 %r1694, %r273, 589888; 2026-02-21T08:31:48.1684512Z bfe.u32 %r1695, %r1694, 4, 14; 2026-02-21T08:31:48.1684666Z cvt.u64.u32 %rd89, %r1695; 2026-02-21T08:31:48.1684832Z or.b64 %rd107, %rd89, 4611686293439512576; 2026-02-21T08:31:48.1685012Z add.s32 %r1696, %r273, 589920; 2026-02-21T08:31:48.1685167Z bfe.u32 %r1697, %r1696, 4, 14; 2026-02-21T08:31:48.1685392Z cvt.u64.u32 %rd90, %r1697; 2026-02-21T08:31:48.1685550Z or.b64 %rd108, %rd90, 4611686293439512576; 2026-02-21T08:31:48.1685728Z add.s32 %r1698, %r273, 622592; 2026-02-21T08:31:48.1685881Z bfe.u32 %r1699, %r1698, 4, 14; 2026-02-21T08:31:48.1686042Z cvt.u64.u32 %rd91, %r1699; 2026-02-21T08:31:48.1686202Z or.b64 %rd111, %rd91, 4611686293439512576; 2026-02-21T08:31:48.1686383Z add.s32 %r1700, %r273, 622624; 2026-02-21T08:31:48.1686538Z bfe.u32 %r1701, %r1700, 4, 14; 2026-02-21T08:31:48.1686701Z cvt.u64.u32 %rd92, %r1701; 2026-02-21T08:31:48.1686870Z or.b64 %rd112, %rd92, 4611686293439512576; 2026-02-21T08:31:48.1687045Z add.s32 %r1702, %r273, 622656; 2026-02-21T08:31:48.1687207Z bfe.u32 %r1703, %r1702, 4, 14; 2026-02-21T08:31:48.1687362Z cvt.u64.u32 %rd93, %r1703; 2026-02-21T08:31:48.1687529Z or.b64 %rd113, %rd93, 4611686293439512576; 2026-02-21T08:31:48.1687702Z add.s32 %r1704, %r273, 622688; 2026-02-21T08:31:48.1687911Z bfe.u32 %r1705, %r1704, 4, 14; 2026-02-21T08:31:48.1688071Z cvt.u64.u32 %rd94, %r1705; 2026-02-21T08:31:48.1688239Z or.b64 %rd114, %rd94, 4611686293439512576; 2026-02-21T08:31:48.1688416Z add.s32 %r1706, %r273, 655360; 2026-02-21T08:31:48.1688570Z bfe.u32 %r1707, %r1706, 4, 14; 2026-02-21T08:31:48.1688732Z cvt.u64.u32 %rd95, %r1707; 2026-02-21T08:31:48.1688891Z or.b64 %rd117, %rd95, 4611686293439512576; 2026-02-21T08:31:48.1689066Z add.s32 %r1708, %r273, 655392; 2026-02-21T08:31:48.1689217Z bfe.u32 %r1709, %r1708, 4, 14; 2026-02-21T08:31:48.1689378Z cvt.u64.u32 %rd96, %r1709; 2026-02-21T08:31:48.1689538Z or.b64 %rd118, %rd96, 4611686293439512576; 2026-02-21T08:31:48.1689715Z add.s32 %r1710, %r273, 655424; 2026-02-21T08:31:48.1689868Z bfe.u32 %r1711, %r1710, 4, 14; 2026-02-21T08:31:48.1690027Z cvt.u64.u32 %rd97, %r1711; 2026-02-21T08:31:48.1690195Z or.b64 %rd119, %rd97, 4611686293439512576; 2026-02-21T08:31:48.1690367Z add.s32 %r1712, %r273, 655456; 2026-02-21T08:31:48.1690529Z bfe.u32 %r1713, %r1712, 4, 14; 2026-02-21T08:31:48.1690685Z cvt.u64.u32 %rd98, %r1713; 2026-02-21T08:31:48.1690852Z or.b64 %rd120, %rd98, 4611686293439512576; 2026-02-21T08:31:48.1691026Z mov.pred %p107, 0; 2026-02-21T08:31:48.1691173Z mov.b32 %r1961, 4; 2026-02-21T08:31:48.1691314Z mov.b32 %r1960, -1; 2026-02-21T08:31:48.1691464Z mov.b32 %r1959, 256; 2026-02-21T08:31:48.1691640Z mov.b32 %r1958, 0; 2026-02-21T08:31:48.1691777Z mov.b32 %r1957, 1; 2026-02-21T08:31:48.1691919Z mov.b32 %r1956, 2; 2026-02-21T08:31:48.1692051Z mov.b32 %r1955, 3; 2026-02-21T08:31:48.1692202Z mov.b32 %r1962, %r1958; 2026-02-21T08:31:48.1692350Z mov.b32 %r1968, %r5; 2026-02-21T08:31:48.1692510Z mov.b32 %r1964, %r1961; 2026-02-21T08:31:48.1692657Z mov.b32 %r1965, %r1958; 2026-02-21T08:31:48.1692809Z bra.uni $L__BB0_8; 2026-02-21T08:31:48.1692991Z $L__BB0_18: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1693212Z xor.b32 %r1962, %r1962, 1; 2026-02-21T08:31:48.1693493Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1693790Z setp.eq.b32 %p103, %r34, 0; 2026-02-21T08:31:48.1693966Z setp.lt.s32 %p104, %r1965, %r17; 2026-02-21T08:31:48.1694135Z add.s32 %r1934, %r1959, 64; 2026-02-21T08:31:48.1694292Z add.s32 %r1935, %r1961, 1; 2026-02-21T08:31:48.1694447Z setp.gt.s32 %p105, %r1935, 4; 2026-02-21T08:31:48.1694622Z selp.b32 %r1961, 0, %r1935, %p105; 2026-02-21T08:31:48.1694799Z selp.b32 %r1959, 0, %r1934, %p103; 2026-02-21T08:31:48.1695086Z .loc 1 45 22 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:45:22 2026-02-21T08:31:48.1695378Z shl.b32 %r1936, %r1959, 1; 2026-02-21T08:31:48.1695640Z .loc 1 47 25 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:47:25 2026-02-21T08:31:48.1695932Z add.s32 %r1937, %r1936, %r10; 2026-02-21T08:31:48.1696196Z .loc 1 48 53 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:53 2026-02-21T08:31:48.1696541Z shl.b32 %r1938, %r1969, 10; 2026-02-21T08:31:48.1696694Z shl.b32 %r1939, %r1970, 10; 2026-02-21T08:31:48.1696953Z .loc 1 48 60 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:60 2026-02-21T08:31:48.1697254Z add.s32 %r1940, %r1938, %r1937; 2026-02-21T08:31:48.1697420Z add.s32 %r1941, %r1939, %r1937; 2026-02-21T08:31:48.1697689Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1697976Z mad.wide.s32 %rd123, %r1940, 2, %rd18; 2026-02-21T08:31:48.1698164Z mad.wide.s32 %rd124, %r1941, 2, %rd18; 2026-02-21T08:31:48.1698439Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1698722Z shl.b32 %r1942, %r1961, 13; 2026-02-21T08:31:48.1698882Z bar.sync 2, 256; 2026-02-21T08:31:48.1699023Z add.s32 %r1944, %r273, %r1942; 2026-02-21T08:31:48.1699241Z add.s32 %r1945, %r1944, %r13; 2026-02-21T08:31:48.1699402Z add.s32 %r1918, %r1945, 393216; 2026-02-21T08:31:48.1699572Z selp.b32 %r1919, 16, 0, %p104; 2026-02-21T08:31:48.1699729Z // begin inline asm 2026-02-21T08:31:48.1699939Z cp.async.cg.shared.global [ %r1918 + 0 ], [ %rd123 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1700168Z // end inline asm 2026-02-21T08:31:48.1700311Z add.s32 %r1920, %r1945, 397312; 2026-02-21T08:31:48.1700468Z // begin inline asm 2026-02-21T08:31:48.1700675Z cp.async.cg.shared.global [ %r1920 + 0 ], [ %rd124 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1700907Z // end inline asm 2026-02-21T08:31:48.1701046Z cp.async.commit_group; 2026-02-21T08:31:48.1701305Z .loc 1 47 25 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:47:25 2026-02-21T08:31:48.1701613Z add.s32 %r1946, %r14, %r1936; 2026-02-21T08:31:48.1701886Z .loc 1 48 60 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:60 2026-02-21T08:31:48.1702169Z add.s32 %r1947, %r1938, %r1946; 2026-02-21T08:31:48.1702342Z add.s32 %r1948, %r1939, %r1946; 2026-02-21T08:31:48.1702615Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1702910Z mad.wide.s32 %rd125, %r1947, 2, %rd18; 2026-02-21T08:31:48.1703107Z mad.wide.s32 %rd126, %r1948, 2, %rd18; 2026-02-21T08:31:48.1703387Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1703674Z add.s32 %r1922, %r1945, 434176; 2026-02-21T08:31:48.1703833Z // begin inline asm 2026-02-21T08:31:48.1704044Z cp.async.cg.shared.global [ %r1922 + 0 ], [ %rd125 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1704276Z // end inline asm 2026-02-21T08:31:48.1704413Z add.s32 %r1924, %r1945, 438272; 2026-02-21T08:31:48.1704585Z // begin inline asm 2026-02-21T08:31:48.1704796Z cp.async.cg.shared.global [ %r1924 + 0 ], [ %rd126 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1705041Z // end inline asm 2026-02-21T08:31:48.1705194Z cp.async.commit_group; 2026-02-21T08:31:48.1705472Z .loc 1 47 25 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:47:25 2026-02-21T08:31:48.1705767Z add.s32 %r1949, %r15, %r1936; 2026-02-21T08:31:48.1706047Z .loc 1 48 60 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:60 2026-02-21T08:31:48.1706345Z add.s32 %r1950, %r1938, %r1949; 2026-02-21T08:31:48.1706514Z add.s32 %r1951, %r1939, %r1949; 2026-02-21T08:31:48.1706799Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1707095Z mad.wide.s32 %rd127, %r1950, 2, %rd18; 2026-02-21T08:31:48.1707292Z mad.wide.s32 %rd128, %r1951, 2, %rd18; 2026-02-21T08:31:48.1707584Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1707883Z add.s32 %r1926, %r1945, 475136; 2026-02-21T08:31:48.1708058Z // begin inline asm 2026-02-21T08:31:48.1708273Z cp.async.cg.shared.global [ %r1926 + 0 ], [ %rd127 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1708577Z // end inline asm 2026-02-21T08:31:48.1708721Z add.s32 %r1928, %r1945, 479232; 2026-02-21T08:31:48.1708892Z // begin inline asm 2026-02-21T08:31:48.1709100Z cp.async.cg.shared.global [ %r1928 + 0 ], [ %rd128 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1709344Z // end inline asm 2026-02-21T08:31:48.1709488Z cp.async.commit_group; 2026-02-21T08:31:48.1709764Z .loc 1 47 25 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:47:25 2026-02-21T08:31:48.1710065Z add.s32 %r1952, %r16, %r1936; 2026-02-21T08:31:48.1710341Z .loc 1 48 60 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:60 2026-02-21T08:31:48.1710639Z add.s32 %r1953, %r1938, %r1952; 2026-02-21T08:31:48.1710805Z add.s32 %r1954, %r1939, %r1952; 2026-02-21T08:31:48.1711168Z .loc 1 48 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:32 2026-02-21T08:31:48.1711469Z mad.wide.s32 %rd129, %r1953, 2, %rd18; 2026-02-21T08:31:48.1711691Z mad.wide.s32 %rd130, %r1954, 2, %rd18; 2026-02-21T08:31:48.1711990Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1712290Z add.s32 %r1930, %r1945, 516096; 2026-02-21T08:31:48.1712464Z // begin inline asm 2026-02-21T08:31:48.1712675Z cp.async.cg.shared.global [ %r1930 + 0 ], [ %rd129 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1712924Z // end inline asm 2026-02-21T08:31:48.1713071Z add.s32 %r1932, %r1945, 520192; 2026-02-21T08:31:48.1713245Z // begin inline asm 2026-02-21T08:31:48.1713454Z cp.async.cg.shared.global [ %r1932 + 0 ], [ %rd130 + 0 ], 0x10, %r1919; 2026-02-21T08:31:48.1713705Z // end inline asm 2026-02-21T08:31:48.1713854Z cp.async.commit_group; 2026-02-21T08:31:48.1714112Z .loc 1 0 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0 2026-02-21T08:31:48.1714406Z setp.ne.b32 %p107, %r1958, 7; 2026-02-21T08:31:48.1714688Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1714991Z add.s32 %r1965, %r1965, 1; 2026-02-21T08:31:48.1715156Z setp.ne.b32 %p106, %r4, %r1965; 2026-02-21T08:31:48.1715332Z mov.b32 %r1955, %r1964; 2026-02-21T08:31:48.1715482Z mov.b32 %r1958, %r23; 2026-02-21T08:31:48.1715634Z mov.b32 %r1964, %r34; 2026-02-21T08:31:48.1715784Z @%p106 bra $L__BB0_8; 2026-02-21T08:31:48.1715929Z bra.uni $L__BB0_19; 2026-02-21T08:31:48.1716120Z $L__BB0_8: // Parent Loop BB0_2 Depth=1 2026-02-21T08:31:48.1716361Z // => This Inner Loop Header: Depth=2 2026-02-21T08:31:48.1716686Z .loc 1 0 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0:145 2026-02-21T08:31:48.1716969Z mov.b32 %r23, %r1957; 2026-02-21T08:31:48.1717122Z mov.b32 %r1957, %r1956; 2026-02-21T08:31:48.1717280Z mov.b32 %r1956, %r1955; 2026-02-21T08:31:48.1717540Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1717835Z add.s32 %r1714, %r1964, 1; 2026-02-21T08:31:48.1717996Z setp.eq.b32 %p52, %r1964, 7; 2026-02-21T08:31:48.1718167Z selp.b32 %r34, 0, %r1714, %p52; 2026-02-21T08:31:48.1718334Z setp.ne.b32 %p53, %r34, 0; 2026-02-21T08:31:48.1718499Z @%p53 bra $L__BB0_10; 2026-02-21T08:31:48.1718684Z // %bb.9: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1718906Z add.s32 %r1968, %r1968, 18944; 2026-02-21T08:31:48.1719181Z .loc 1 29 30 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:29:30 2026-02-21T08:31:48.1719465Z shr.s32 %r1715, %r1968, 31; 2026-02-21T08:31:48.1719630Z shr.u32 %r1716, %r1715, 27; 2026-02-21T08:31:48.1719787Z add.s32 %r1717, %r1968, %r1716; 2026-02-21T08:31:48.1719956Z and.b32 %r1718, %r1717, 33554400; 2026-02-21T08:31:48.1720127Z sub.s32 %r1719, %r1968, %r1718; 2026-02-21T08:31:48.1720450Z .loc 1 31 27 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:31:27 2026-02-21T08:31:48.1720726Z shl.b32 %r1720, %r1719, 7; 2026-02-21T08:31:48.1720987Z .loc 1 32 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:32:32 2026-02-21T08:31:48.1721269Z or.b32 %r1969, %r1720, %r8; 2026-02-21T08:31:48.1721421Z or.b32 %r1970, %r1720, %r9; 2026-02-21T08:31:48.1721660Z $L__BB0_10: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1721981Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1722271Z add.s32 %r1742, %r1960, 1; 2026-02-21T08:31:48.1722428Z setp.gt.s32 %p55, %r1742, 4; 2026-02-21T08:31:48.1722603Z selp.b32 %r1960, 0, %r1742, %p55; 2026-02-21T08:31:48.1722929Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1723215Z cp.async.wait_group 16; 2026-02-21T08:31:48.1723375Z bar.sync 2, 256; 2026-02-21T08:31:48.1723514Z shl.b32 %r43, %r1960, 12; 2026-02-21T08:31:48.1723676Z shl.b32 %r1743, %r1960, 13; 2026-02-21T08:31:48.1723833Z add.s32 %r1745, %r273, %r1743; 2026-02-21T08:31:48.1724004Z add.s32 %r1746, %r1745, %r18; 2026-02-21T08:31:48.1724218Z ld.shared.v4.b32 {%r1747, %r1748, %r1749, %r1750}, [%r1746+393216]; 2026-02-21T08:31:48.1724448Z mov.b32 {%rs1, %rs2}, %r1750; 2026-02-21T08:31:48.1724613Z mov.b32 {%rs3, %rs4}, %r1749; 2026-02-21T08:31:48.1724770Z mov.b32 {%rs5, %rs6}, %r1748; 2026-02-21T08:31:48.1724934Z mov.b32 {%rs7, %rs8}, %r1747; 2026-02-21T08:31:48.1725138Z ld.shared.v4.b32 {%r1751, %r1752, %r1753, %r1754}, [%r1746+393232]; 2026-02-21T08:31:48.1725370Z mov.b32 {%rs9, %rs10}, %r1754; 2026-02-21T08:31:48.1725535Z mov.b32 {%rs11, %rs12}, %r1753; 2026-02-21T08:31:48.1725705Z mov.b32 {%rs13, %rs14}, %r1752; 2026-02-21T08:31:48.1725868Z mov.b32 {%rs15, %rs16}, %r1751; 2026-02-21T08:31:48.1726142Z .loc 1 52 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:52:32 2026-02-21T08:31:48.1726430Z cvt.f32.bf16 %r1726, %rs7; 2026-02-21T08:31:48.1726587Z cvt.f32.bf16 %r1727, %rs8; 2026-02-21T08:31:48.1726747Z cvt.f32.bf16 %r1728, %rs5; 2026-02-21T08:31:48.1726895Z cvt.f32.bf16 %r1729, %rs6; 2026-02-21T08:31:48.1727050Z cvt.f32.bf16 %r1730, %rs3; 2026-02-21T08:31:48.1727199Z cvt.f32.bf16 %r1731, %rs4; 2026-02-21T08:31:48.1727354Z cvt.f32.bf16 %r1732, %rs1; 2026-02-21T08:31:48.1727504Z cvt.f32.bf16 %r1733, %rs2; 2026-02-21T08:31:48.1727666Z cvt.f32.bf16 %r1734, %rs15; 2026-02-21T08:31:48.1727828Z cvt.f32.bf16 %r1735, %rs16; 2026-02-21T08:31:48.1727982Z cvt.f32.bf16 %r1736, %rs13; 2026-02-21T08:31:48.1728140Z cvt.f32.bf16 %r1737, %rs14; 2026-02-21T08:31:48.1728291Z cvt.f32.bf16 %r1738, %rs11; 2026-02-21T08:31:48.1728451Z cvt.f32.bf16 %r1739, %rs12; 2026-02-21T08:31:48.1728601Z cvt.f32.bf16 %r1740, %rs9; 2026-02-21T08:31:48.1728764Z cvt.f32.bf16 %r1741, %rs10; 2026-02-21T08:31:48.1728912Z $L__tmp2: 2026-02-21T08:31:48.1729206Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1729547Z add.s32 %r1721, %r273, 803280; 2026-02-21T08:31:48.1729705Z // begin inline asm 2026-02-21T08:31:48.1729849Z 2026-02-21T08:31:48.1729958Z { 2026-02-21T08:31:48.1730085Z .reg .pred complete; 2026-02-21T08:31:48.1730229Z waitLoop: 2026-02-21T08:31:48.1730427Z mbarrier.try_wait.parity.shared.b64 complete, [%r1721], %r1962; 2026-02-21T08:31:48.1730661Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1730817Z } 2026-02-21T08:31:48.1730882Z 2026-02-21T08:31:48.1730938Z // end inline asm 2026-02-21T08:31:48.1731079Z $L__tmp3: 2026-02-21T08:31:48.1731313Z .loc 1 77 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:77:32 2026-02-21T08:31:48.1731632Z add.s32 %r1723, %r273, 803312; 2026-02-21T08:31:48.1731799Z // begin inline asm 2026-02-21T08:31:48.1731998Z 2026-02-21T08:31:48.1732114Z { 2026-02-21T08:31:48.1732235Z .reg .pred complete; 2026-02-21T08:31:48.1732382Z waitLoop: 2026-02-21T08:31:48.1732569Z mbarrier.try_wait.parity.shared.b64 complete, [%r1723], %r1962; 2026-02-21T08:31:48.1732809Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1732957Z } 2026-02-21T08:31:48.1733029Z 2026-02-21T08:31:48.1733085Z // end inline asm 2026-02-21T08:31:48.1733227Z $L__tmp4: 2026-02-21T08:31:48.1733514Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1733861Z shfl.sync.idx.b32 %r44, %r6, 0, 31, -1; 2026-02-21T08:31:48.1734045Z mov.pred %p59, -1; 2026-02-21T08:31:48.1734204Z // begin inline asm 2026-02-21T08:31:48.1734652Z @%p59 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1725 + 0], {%r1726, %r1727, %r1728, %r1729, %r1730, %r1731, %r1732, %r1733, %r1734, %r1735, %r1736, %r1737, %r1738, %r1739, %r1740, %r1741}; 2026-02-21T08:31:48.1735088Z // end inline asm 2026-02-21T08:31:48.1735231Z // begin inline asm 2026-02-21T08:31:48.1735387Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1735555Z // end inline asm 2026-02-21T08:31:48.1735687Z bar.sync 2, 256; 2026-02-21T08:31:48.1735836Z setp.ne.b32 %p56, %r44, 0; 2026-02-21T08:31:48.1735994Z @%p56 bra $L__BB0_12; 2026-02-21T08:31:48.1736188Z // %bb.11: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1736404Z elect.sync %r1767|%p58, -1; 2026-02-21T08:31:48.1736571Z mov.b32 %r1757, 138414352; 2026-02-21T08:31:48.1736727Z // begin inline asm 2026-02-21T08:31:48.1736969Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r1755 + 0 ], [ %r1756 + 0 ], %rd99, %r1757, %p107; 2026-02-21T08:31:48.1737239Z // end inline asm 2026-02-21T08:31:48.1737377Z // begin inline asm 2026-02-21T08:31:48.1737615Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r1758 + 0 ], [ %r1759 + 8 ], %rd100, %r1757, %p59; 2026-02-21T08:31:48.1737876Z // end inline asm 2026-02-21T08:31:48.1738019Z // begin inline asm 2026-02-21T08:31:48.1738257Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r1761 + 0 ], [ %r1762 + 16 ], %rd101, %r1757, %p59; 2026-02-21T08:31:48.1738518Z // end inline asm 2026-02-21T08:31:48.1738661Z // begin inline asm 2026-02-21T08:31:48.1738890Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r1764 + 0 ], [ %r1765 + 24 ], %rd102, %r1757, %p59; 2026-02-21T08:31:48.1739157Z // end inline asm 2026-02-21T08:31:48.1739295Z add.s32 %r1769, %r273, 803248; 2026-02-21T08:31:48.1739465Z cvt.u64.u32 %rd103, %r1769; 2026-02-21T08:31:48.1739619Z // begin inline asm 2026-02-21T08:31:48.1739832Z @%p58 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd103]; 2026-02-21T08:31:48.1740064Z // end inline asm 2026-02-21T08:31:48.1740202Z add.s32 %r1770, %r273, 803296; 2026-02-21T08:31:48.1740365Z cvt.u64.u32 %rd104, %r1770; 2026-02-21T08:31:48.1740518Z // begin inline asm 2026-02-21T08:31:48.1740723Z @%p58 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd104]; 2026-02-21T08:31:48.1740948Z // end inline asm 2026-02-21T08:31:48.1741084Z $L__tmp5: 2026-02-21T08:31:48.1741252Z $L__BB0_12: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1741608Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1741905Z shl.b32 %r1792, %r43, 1; 2026-02-21T08:31:48.1742059Z add.s32 %r1794, %r273, %r1792; 2026-02-21T08:31:48.1742225Z add.s32 %r1795, %r1794, %r18; 2026-02-21T08:31:48.1742434Z ld.shared.v4.b32 {%r1796, %r1797, %r1798, %r1799}, [%r1795+434176]; 2026-02-21T08:31:48.1742663Z mov.b32 {%rs17, %rs18}, %r1799; 2026-02-21T08:31:48.1742827Z mov.b32 {%rs19, %rs20}, %r1798; 2026-02-21T08:31:48.1742996Z mov.b32 {%rs21, %rs22}, %r1797; 2026-02-21T08:31:48.1743153Z mov.b32 {%rs23, %rs24}, %r1796; 2026-02-21T08:31:48.1743365Z ld.shared.v4.b32 {%r1800, %r1801, %r1802, %r1803}, [%r1795+434192]; 2026-02-21T08:31:48.1743592Z mov.b32 {%rs25, %rs26}, %r1803; 2026-02-21T08:31:48.1743810Z mov.b32 {%rs27, %rs28}, %r1802; 2026-02-21T08:31:48.1743983Z mov.b32 {%rs29, %rs30}, %r1801; 2026-02-21T08:31:48.1744150Z mov.b32 {%rs31, %rs32}, %r1800; 2026-02-21T08:31:48.1744427Z .loc 1 52 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:52:32 2026-02-21T08:31:48.1744716Z cvt.f32.bf16 %r1776, %rs23; 2026-02-21T08:31:48.1744893Z cvt.f32.bf16 %r1777, %rs24; 2026-02-21T08:31:48.1745061Z cvt.f32.bf16 %r1778, %rs21; 2026-02-21T08:31:48.1745221Z cvt.f32.bf16 %r1779, %rs22; 2026-02-21T08:31:48.1745387Z cvt.f32.bf16 %r1780, %rs19; 2026-02-21T08:31:48.1745546Z cvt.f32.bf16 %r1781, %rs20; 2026-02-21T08:31:48.1745710Z cvt.f32.bf16 %r1782, %rs17; 2026-02-21T08:31:48.1745865Z cvt.f32.bf16 %r1783, %rs18; 2026-02-21T08:31:48.1746027Z cvt.f32.bf16 %r1784, %rs31; 2026-02-21T08:31:48.1746181Z cvt.f32.bf16 %r1785, %rs32; 2026-02-21T08:31:48.1746344Z cvt.f32.bf16 %r1786, %rs29; 2026-02-21T08:31:48.1746554Z cvt.f32.bf16 %r1787, %rs30; 2026-02-21T08:31:48.1746715Z cvt.f32.bf16 %r1788, %rs27; 2026-02-21T08:31:48.1746873Z cvt.f32.bf16 %r1789, %rs28; 2026-02-21T08:31:48.1747026Z cvt.f32.bf16 %r1790, %rs25; 2026-02-21T08:31:48.1747185Z cvt.f32.bf16 %r1791, %rs26; 2026-02-21T08:31:48.1747338Z $L__tmp6: 2026-02-21T08:31:48.1747643Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1747991Z add.s32 %r1771, %r273, 803328; 2026-02-21T08:31:48.1748165Z // begin inline asm 2026-02-21T08:31:48.1748305Z 2026-02-21T08:31:48.1748430Z { 2026-02-21T08:31:48.1748557Z .reg .pred complete; 2026-02-21T08:31:48.1748711Z waitLoop: 2026-02-21T08:31:48.1748916Z mbarrier.try_wait.parity.shared.b64 complete, [%r1771], %r1962; 2026-02-21T08:31:48.1749161Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1749325Z } 2026-02-21T08:31:48.1749392Z 2026-02-21T08:31:48.1749450Z // end inline asm 2026-02-21T08:31:48.1749598Z $L__tmp7: 2026-02-21T08:31:48.1749847Z .loc 1 77 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:77:32 2026-02-21T08:31:48.1750150Z add.s32 %r1773, %r273, 803376; 2026-02-21T08:31:48.1750316Z // begin inline asm 2026-02-21T08:31:48.1750462Z 2026-02-21T08:31:48.1750583Z { 2026-02-21T08:31:48.1750706Z .reg .pred complete; 2026-02-21T08:31:48.1750860Z waitLoop: 2026-02-21T08:31:48.1751052Z mbarrier.try_wait.parity.shared.b64 complete, [%r1773], %r1962; 2026-02-21T08:31:48.1751300Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1751456Z } 2026-02-21T08:31:48.1751529Z 2026-02-21T08:31:48.1751625Z // end inline asm 2026-02-21T08:31:48.1751763Z $L__tmp8: 2026-02-21T08:31:48.1752064Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1752417Z // begin inline asm 2026-02-21T08:31:48.1752827Z @%p59 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1775 + 0], {%r1776, %r1777, %r1778, %r1779, %r1780, %r1781, %r1782, %r1783, %r1784, %r1785, %r1786, %r1787, %r1788, %r1789, %r1790, %r1791}; 2026-02-21T08:31:48.1753270Z // end inline asm 2026-02-21T08:31:48.1753412Z // begin inline asm 2026-02-21T08:31:48.1753581Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1753751Z // end inline asm 2026-02-21T08:31:48.1753902Z bar.sync 2, 256; 2026-02-21T08:31:48.1754048Z @%p56 bra $L__BB0_14; 2026-02-21T08:31:48.1754260Z // %bb.13: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1754497Z elect.sync %r1816|%p70, -1; 2026-02-21T08:31:48.1754667Z mov.b32 %r1806, 138414352; 2026-02-21T08:31:48.1754838Z mov.pred %p69, -1; 2026-02-21T08:31:48.1754985Z // begin inline asm 2026-02-21T08:31:48.1755240Z @%p70 tcgen05.mma.cta_group::1.kind::tf32 [ %r1804 + 0 ], [ %r1805 + 0 ], %rd105, %r1806, %p69; 2026-02-21T08:31:48.1755523Z // end inline asm 2026-02-21T08:31:48.1755664Z // begin inline asm 2026-02-21T08:31:48.1755903Z @%p70 tcgen05.mma.cta_group::1.kind::tf32 [ %r1807 + 0 ], [ %r1808 + 8 ], %rd106, %r1806, %p69; 2026-02-21T08:31:48.1756239Z // end inline asm 2026-02-21T08:31:48.1756382Z // begin inline asm 2026-02-21T08:31:48.1756612Z @%p70 tcgen05.mma.cta_group::1.kind::tf32 [ %r1810 + 0 ], [ %r1811 + 16 ], %rd107, %r1806, %p69; 2026-02-21T08:31:48.1756880Z // end inline asm 2026-02-21T08:31:48.1757012Z // begin inline asm 2026-02-21T08:31:48.1757249Z @%p70 tcgen05.mma.cta_group::1.kind::tf32 [ %r1813 + 0 ], [ %r1814 + 24 ], %rd108, %r1806, %p69; 2026-02-21T08:31:48.1757518Z // end inline asm 2026-02-21T08:31:48.1757658Z add.s32 %r1818, %r273, 803344; 2026-02-21T08:31:48.1757828Z cvt.u64.u32 %rd109, %r1818; 2026-02-21T08:31:48.1757986Z // begin inline asm 2026-02-21T08:31:48.1758200Z @%p70 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd109]; 2026-02-21T08:31:48.1758432Z // end inline asm 2026-02-21T08:31:48.1758577Z add.s32 %r1819, %r273, 803360; 2026-02-21T08:31:48.1758736Z cvt.u64.u32 %rd110, %r1819; 2026-02-21T08:31:48.1758958Z // begin inline asm 2026-02-21T08:31:48.1759165Z @%p70 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd110]; 2026-02-21T08:31:48.1759395Z // end inline asm 2026-02-21T08:31:48.1759529Z $L__tmp9: 2026-02-21T08:31:48.1759696Z $L__BB0_14: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1760017Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1760348Z ld.shared.v4.b32 {%r1845, %r1846, %r1847, %r1848}, [%r1795+475136]; 2026-02-21T08:31:48.1760574Z mov.b32 {%rs33, %rs34}, %r1848; 2026-02-21T08:31:48.1760739Z mov.b32 {%rs35, %rs36}, %r1847; 2026-02-21T08:31:48.1760910Z mov.b32 {%rs37, %rs38}, %r1846; 2026-02-21T08:31:48.1761077Z mov.b32 {%rs39, %rs40}, %r1845; 2026-02-21T08:31:48.1761280Z ld.shared.v4.b32 {%r1849, %r1850, %r1851, %r1852}, [%r1795+475152]; 2026-02-21T08:31:48.1761507Z mov.b32 {%rs41, %rs42}, %r1852; 2026-02-21T08:31:48.1761699Z mov.b32 {%rs43, %rs44}, %r1851; 2026-02-21T08:31:48.1761865Z mov.b32 {%rs45, %rs46}, %r1850; 2026-02-21T08:31:48.1762023Z mov.b32 {%rs47, %rs48}, %r1849; 2026-02-21T08:31:48.1762291Z .loc 1 52 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:52:32 2026-02-21T08:31:48.1762577Z cvt.f32.bf16 %r1825, %rs39; 2026-02-21T08:31:48.1762743Z cvt.f32.bf16 %r1826, %rs40; 2026-02-21T08:31:48.1762906Z cvt.f32.bf16 %r1827, %rs37; 2026-02-21T08:31:48.1763059Z cvt.f32.bf16 %r1828, %rs38; 2026-02-21T08:31:48.1763220Z cvt.f32.bf16 %r1829, %rs35; 2026-02-21T08:31:48.1763369Z cvt.f32.bf16 %r1830, %rs36; 2026-02-21T08:31:48.1763529Z cvt.f32.bf16 %r1831, %rs33; 2026-02-21T08:31:48.1763681Z cvt.f32.bf16 %r1832, %rs34; 2026-02-21T08:31:48.1763842Z cvt.f32.bf16 %r1833, %rs47; 2026-02-21T08:31:48.1763995Z cvt.f32.bf16 %r1834, %rs48; 2026-02-21T08:31:48.1764157Z cvt.f32.bf16 %r1835, %rs45; 2026-02-21T08:31:48.1764319Z cvt.f32.bf16 %r1836, %rs46; 2026-02-21T08:31:48.1764472Z cvt.f32.bf16 %r1837, %rs43; 2026-02-21T08:31:48.1764633Z cvt.f32.bf16 %r1838, %rs44; 2026-02-21T08:31:48.1764783Z cvt.f32.bf16 %r1839, %rs41; 2026-02-21T08:31:48.1764942Z cvt.f32.bf16 %r1840, %rs42; 2026-02-21T08:31:48.1765087Z $L__tmp10: 2026-02-21T08:31:48.1765377Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1765708Z add.s32 %r1820, %r273, 803392; 2026-02-21T08:31:48.1765872Z // begin inline asm 2026-02-21T08:31:48.1766006Z 2026-02-21T08:31:48.1766124Z { 2026-02-21T08:31:48.1766252Z .reg .pred complete; 2026-02-21T08:31:48.1766394Z waitLoop: 2026-02-21T08:31:48.1766590Z mbarrier.try_wait.parity.shared.b64 complete, [%r1820], %r1962; 2026-02-21T08:31:48.1766823Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1766982Z } 2026-02-21T08:31:48.1767046Z 2026-02-21T08:31:48.1767101Z // end inline asm 2026-02-21T08:31:48.1767240Z $L__tmp11: 2026-02-21T08:31:48.1767470Z .loc 1 77 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:77:32 2026-02-21T08:31:48.1767819Z add.s32 %r1822, %r273, 803440; 2026-02-21T08:31:48.1767984Z // begin inline asm 2026-02-21T08:31:48.1768117Z 2026-02-21T08:31:48.1768234Z { 2026-02-21T08:31:48.1768356Z .reg .pred complete; 2026-02-21T08:31:48.1768502Z waitLoop: 2026-02-21T08:31:48.1768684Z mbarrier.try_wait.parity.shared.b64 complete, [%r1822], %r1962; 2026-02-21T08:31:48.1768921Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1769070Z } 2026-02-21T08:31:48.1769139Z 2026-02-21T08:31:48.1769194Z // end inline asm 2026-02-21T08:31:48.1769331Z mov.pred %p81, -1; 2026-02-21T08:31:48.1769474Z $L__tmp12: 2026-02-21T08:31:48.1769758Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1770078Z // begin inline asm 2026-02-21T08:31:48.1770524Z @%p81 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1824 + 0], {%r1825, %r1826, %r1827, %r1828, %r1829, %r1830, %r1831, %r1832, %r1833, %r1834, %r1835, %r1836, %r1837, %r1838, %r1839, %r1840}; 2026-02-21T08:31:48.1770941Z // end inline asm 2026-02-21T08:31:48.1771086Z // begin inline asm 2026-02-21T08:31:48.1771243Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1771417Z // end inline asm 2026-02-21T08:31:48.1771578Z bar.sync 2, 256; 2026-02-21T08:31:48.1771719Z @%p56 bra $L__BB0_16; 2026-02-21T08:31:48.1771911Z // %bb.15: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1772122Z elect.sync %r1865|%p82, -1; 2026-02-21T08:31:48.1772289Z mov.b32 %r1855, 138414352; 2026-02-21T08:31:48.1772443Z // begin inline asm 2026-02-21T08:31:48.1772686Z @%p82 tcgen05.mma.cta_group::1.kind::tf32 [ %r1853 + 0 ], [ %r1854 + 0 ], %rd111, %r1855, %p81; 2026-02-21T08:31:48.1772953Z // end inline asm 2026-02-21T08:31:48.1773096Z // begin inline asm 2026-02-21T08:31:48.1773332Z @%p82 tcgen05.mma.cta_group::1.kind::tf32 [ %r1856 + 0 ], [ %r1857 + 8 ], %rd112, %r1855, %p81; 2026-02-21T08:31:48.1773592Z // end inline asm 2026-02-21T08:31:48.1773747Z // begin inline asm 2026-02-21T08:31:48.1786335Z @%p82 tcgen05.mma.cta_group::1.kind::tf32 [ %r1859 + 0 ], [ %r1860 + 16 ], %rd113, %r1855, %p81; 2026-02-21T08:31:48.1786647Z // end inline asm 2026-02-21T08:31:48.1786801Z // begin inline asm 2026-02-21T08:31:48.1787062Z @%p82 tcgen05.mma.cta_group::1.kind::tf32 [ %r1862 + 0 ], [ %r1863 + 24 ], %rd114, %r1855, %p81; 2026-02-21T08:31:48.1787331Z // end inline asm 2026-02-21T08:31:48.1787491Z add.s32 %r1867, %r273, 803408; 2026-02-21T08:31:48.1787663Z cvt.u64.u32 %rd115, %r1867; 2026-02-21T08:31:48.1787843Z // begin inline asm 2026-02-21T08:31:48.1788070Z @%p82 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd115]; 2026-02-21T08:31:48.1788305Z // end inline asm 2026-02-21T08:31:48.1788461Z add.s32 %r1868, %r273, 803424; 2026-02-21T08:31:48.1788628Z cvt.u64.u32 %rd116, %r1868; 2026-02-21T08:31:48.1788801Z // begin inline asm 2026-02-21T08:31:48.1789016Z @%p82 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd116]; 2026-02-21T08:31:48.1789262Z // end inline asm 2026-02-21T08:31:48.1789402Z $L__tmp13: 2026-02-21T08:31:48.1789591Z $L__BB0_16: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1789928Z .loc 1 48 80 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:48:80 2026-02-21T08:31:48.1790271Z ld.shared.v4.b32 {%r1894, %r1895, %r1896, %r1897}, [%r1795+516096]; 2026-02-21T08:31:48.1790508Z mov.b32 {%rs49, %rs50}, %r1897; 2026-02-21T08:31:48.1790678Z mov.b32 {%rs51, %rs52}, %r1896; 2026-02-21T08:31:48.1790849Z mov.b32 {%rs53, %rs54}, %r1895; 2026-02-21T08:31:48.1791007Z mov.b32 {%rs55, %rs56}, %r1894; 2026-02-21T08:31:48.1791221Z ld.shared.v4.b32 {%r1898, %r1899, %r1900, %r1901}, [%r1795+516112]; 2026-02-21T08:31:48.1791437Z mov.b32 {%rs57, %rs58}, %r1901; 2026-02-21T08:31:48.1791644Z mov.b32 {%rs59, %rs60}, %r1900; 2026-02-21T08:31:48.1791820Z mov.b32 {%rs61, %rs62}, %r1899; 2026-02-21T08:31:48.1791994Z mov.b32 {%rs63, %rs64}, %r1898; 2026-02-21T08:31:48.1792418Z .loc 1 52 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:52:32 2026-02-21T08:31:48.1792726Z cvt.f32.bf16 %r1874, %rs55; 2026-02-21T08:31:48.1792909Z cvt.f32.bf16 %r1875, %rs56; 2026-02-21T08:31:48.1793077Z cvt.f32.bf16 %r1876, %rs53; 2026-02-21T08:31:48.1793253Z cvt.f32.bf16 %r1877, %rs54; 2026-02-21T08:31:48.1793417Z cvt.f32.bf16 %r1878, %rs51; 2026-02-21T08:31:48.1793591Z cvt.f32.bf16 %r1879, %rs52; 2026-02-21T08:31:48.1793763Z cvt.f32.bf16 %r1880, %rs49; 2026-02-21T08:31:48.1793925Z cvt.f32.bf16 %r1881, %rs50; 2026-02-21T08:31:48.1794098Z cvt.f32.bf16 %r1882, %rs63; 2026-02-21T08:31:48.1794261Z cvt.f32.bf16 %r1883, %rs64; 2026-02-21T08:31:48.1794434Z cvt.f32.bf16 %r1884, %rs61; 2026-02-21T08:31:48.1794597Z cvt.f32.bf16 %r1885, %rs62; 2026-02-21T08:31:48.1794770Z cvt.f32.bf16 %r1886, %rs59; 2026-02-21T08:31:48.1794928Z cvt.f32.bf16 %r1887, %rs60; 2026-02-21T08:31:48.1795184Z cvt.f32.bf16 %r1888, %rs57; 2026-02-21T08:31:48.1795363Z cvt.f32.bf16 %r1889, %rs58; 2026-02-21T08:31:48.1795521Z $L__tmp14: 2026-02-21T08:31:48.1795843Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1796200Z add.s32 %r1869, %r273, 803456; 2026-02-21T08:31:48.1796388Z // begin inline asm 2026-02-21T08:31:48.1796532Z 2026-02-21T08:31:48.1796659Z { 2026-02-21T08:31:48.1796792Z .reg .pred complete; 2026-02-21T08:31:48.1796957Z waitLoop: 2026-02-21T08:31:48.1797163Z mbarrier.try_wait.parity.shared.b64 complete, [%r1869], %r1962; 2026-02-21T08:31:48.1797428Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1797599Z } 2026-02-21T08:31:48.1797669Z 2026-02-21T08:31:48.1797732Z // end inline asm 2026-02-21T08:31:48.1797885Z $L__tmp15: 2026-02-21T08:31:48.1798137Z .loc 1 77 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:77:32 2026-02-21T08:31:48.1798455Z add.s32 %r1871, %r273, 803504; 2026-02-21T08:31:48.1798626Z // begin inline asm 2026-02-21T08:31:48.1798789Z 2026-02-21T08:31:48.1798901Z { 2026-02-21T08:31:48.1799035Z .reg .pred complete; 2026-02-21T08:31:48.1799179Z waitLoop: 2026-02-21T08:31:48.1799375Z mbarrier.try_wait.parity.shared.b64 complete, [%r1871], %r1962; 2026-02-21T08:31:48.1799621Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1799771Z } 2026-02-21T08:31:48.1799834Z 2026-02-21T08:31:48.1799900Z // end inline asm 2026-02-21T08:31:48.1800033Z $L__tmp16: 2026-02-21T08:31:48.1800327Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1800659Z // begin inline asm 2026-02-21T08:31:48.1801071Z @%p81 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1873 + 0], {%r1874, %r1875, %r1876, %r1877, %r1878, %r1879, %r1880, %r1881, %r1882, %r1883, %r1884, %r1885, %r1886, %r1887, %r1888, %r1889}; 2026-02-21T08:31:48.1801504Z // end inline asm 2026-02-21T08:31:48.1801676Z // begin inline asm 2026-02-21T08:31:48.1801848Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1802017Z // end inline asm 2026-02-21T08:31:48.1802160Z bar.sync 2, 256; 2026-02-21T08:31:48.1802303Z @%p56 bra $L__BB0_18; 2026-02-21T08:31:48.1802507Z // %bb.17: // in Loop: Header=BB0_8 Depth=2 2026-02-21T08:31:48.1802737Z elect.sync %r1914|%p94, -1; 2026-02-21T08:31:48.1802920Z mov.b32 %r1904, 138414352; 2026-02-21T08:31:48.1803092Z mov.pred %p93, -1; 2026-02-21T08:31:48.1803242Z // begin inline asm 2026-02-21T08:31:48.1803502Z @%p94 tcgen05.mma.cta_group::1.kind::tf32 [ %r1902 + 0 ], [ %r1903 + 0 ], %rd117, %r1904, %p93; 2026-02-21T08:31:48.1803778Z // end inline asm 2026-02-21T08:31:48.1803927Z // begin inline asm 2026-02-21T08:31:48.1804165Z @%p94 tcgen05.mma.cta_group::1.kind::tf32 [ %r1905 + 0 ], [ %r1906 + 8 ], %rd118, %r1904, %p93; 2026-02-21T08:31:48.1804446Z // end inline asm 2026-02-21T08:31:48.1804597Z // begin inline asm 2026-02-21T08:31:48.1804841Z @%p94 tcgen05.mma.cta_group::1.kind::tf32 [ %r1908 + 0 ], [ %r1909 + 16 ], %rd119, %r1904, %p93; 2026-02-21T08:31:48.1805193Z // end inline asm 2026-02-21T08:31:48.1805328Z // begin inline asm 2026-02-21T08:31:48.1805574Z @%p94 tcgen05.mma.cta_group::1.kind::tf32 [ %r1911 + 0 ], [ %r1912 + 24 ], %rd120, %r1904, %p93; 2026-02-21T08:31:48.1805843Z // end inline asm 2026-02-21T08:31:48.1805996Z add.s32 %r1916, %r273, 803472; 2026-02-21T08:31:48.1806163Z cvt.u64.u32 %rd121, %r1916; 2026-02-21T08:31:48.1806329Z // begin inline asm 2026-02-21T08:31:48.1806551Z @%p94 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd121]; 2026-02-21T08:31:48.1806788Z // end inline asm 2026-02-21T08:31:48.1806939Z add.s32 %r1917, %r273, 803488; 2026-02-21T08:31:48.1807102Z cvt.u64.u32 %rd122, %r1917; 2026-02-21T08:31:48.1807269Z // begin inline asm 2026-02-21T08:31:48.1807474Z @%p94 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd122]; 2026-02-21T08:31:48.1807714Z // end inline asm 2026-02-21T08:31:48.1807914Z bra.uni $L__BB0_18; 2026-02-21T08:31:48.1808069Z $L__tmp17: 2026-02-21T08:31:48.1808254Z $L__BB0_34: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1808589Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1808914Z ld.shared.b32 %r1988, [global_smem+393216]; 2026-02-21T08:31:48.1809105Z barrier.sync 1; 2026-02-21T08:31:48.1809266Z setp.lt.s32 %p5, %r1988, 1; 2026-02-21T08:31:48.1809431Z @%p5 bra $L__BB0_41; 2026-02-21T08:31:48.1809608Z // %bb.35: // %.lr.ph 2026-02-21T08:31:48.1809843Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1810150Z .loc 1 0 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0 2026-02-21T08:31:48.1810442Z and.b32 %r96, %r1, 31; 2026-02-21T08:31:48.1810602Z or.b32 %r97, %r2, 134217696; 2026-02-21T08:31:48.1810774Z shl.b32 %r98, %r97, 5; 2026-02-21T08:31:48.1810931Z and.b32 %r99, %r98, 224; 2026-02-21T08:31:48.1811099Z bfe.u32 %r100, %r98, 5, 3; 2026-02-21T08:31:48.1811259Z or.b32 %r101, %r100, 8; 2026-02-21T08:31:48.1811423Z or.b32 %r102, %r100, 16; 2026-02-21T08:31:48.1811607Z or.b32 %r103, %r100, 24; 2026-02-21T08:31:48.1811768Z or.b32 %r104, %r100, 32; 2026-02-21T08:31:48.1811925Z or.b32 %r105, %r100, 40; 2026-02-21T08:31:48.1812071Z or.b32 %r106, %r100, 48; 2026-02-21T08:31:48.1812232Z or.b32 %r107, %r100, 56; 2026-02-21T08:31:48.1812379Z or.b32 %r108, %r100, 64; 2026-02-21T08:31:48.1812534Z or.b32 %r109, %r100, 72; 2026-02-21T08:31:48.1812681Z or.b32 %r110, %r100, 80; 2026-02-21T08:31:48.1812839Z or.b32 %r111, %r100, 88; 2026-02-21T08:31:48.1812988Z or.b32 %r112, %r100, 96; 2026-02-21T08:31:48.1813149Z or.b32 %r113, %r100, 104; 2026-02-21T08:31:48.1813315Z or.b32 %r114, %r100, 112; 2026-02-21T08:31:48.1813468Z or.b32 %r115, %r100, 120; 2026-02-21T08:31:48.1813631Z shl.b32 %r116, %r96, 3; 2026-02-21T08:31:48.1813902Z .loc 1 21 66 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:21:66 2026-02-21T08:31:48.1814201Z mov.u32 %r278, %ctaid.x; 2026-02-21T08:31:48.1814464Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1814765Z add.s32 %r1997, %r278, -18944; 2026-02-21T08:31:48.1814932Z and.b32 %r279, %r98, 96; 2026-02-21T08:31:48.1815094Z or.b32 %r280, %r279, %r96; 2026-02-21T08:31:48.1815261Z shl.b32 %r281, %r280, 10; 2026-02-21T08:31:48.1815323Z and.b32 %r282, %r1, 128; 2026-02-21T08:31:48.1815382Z shl.b32 %r283, %r282, 2; 2026-02-21T08:31:48.1815445Z add.s32 %r285, %r273, %r281; 2026-02-21T08:31:48.1815517Z add.s32 %r286, %r285, %r283; 2026-02-21T08:31:48.1815578Z add.s32 %r118, %r286, 262144; 2026-02-21T08:31:48.1815638Z shl.b32 %r287, %r1, 11; 2026-02-21T08:31:48.1815707Z and.b32 %r288, %r287, 14336; 2026-02-21T08:31:48.1815765Z shl.b32 %r289, %r280, 4; 2026-02-21T08:31:48.1815825Z shr.u32 %r290, %r282, 1; 2026-02-21T08:31:48.1815948Z xor.b32 %r291, %r289, %r290; 2026-02-21T08:31:48.1816020Z or.b32 %r292, %r291, %r288; 2026-02-21T08:31:48.1816080Z add.s32 %r293, %r273, 786432; 2026-02-21T08:31:48.1816138Z add.s32 %r119, %r293, %r292; 2026-02-21T08:31:48.1816204Z xor.b32 %r294, %r292, 16; 2026-02-21T08:31:48.1816264Z add.s32 %r120, %r293, %r294; 2026-02-21T08:31:48.1816322Z xor.b32 %r295, %r292, 32; 2026-02-21T08:31:48.1816382Z add.s32 %r121, %r293, %r295; 2026-02-21T08:31:48.1816452Z xor.b32 %r296, %r292, 48; 2026-02-21T08:31:48.1816508Z add.s32 %r122, %r293, %r296; 2026-02-21T08:31:48.1816569Z shl.b32 %r297, %r99, 6; 2026-02-21T08:31:48.1816637Z shl.b32 %r298, %r96, 4; 2026-02-21T08:31:48.1816695Z shr.u32 %r299, %r99, 1; 2026-02-21T08:31:48.1816755Z or.b32 %r300, %r297, %r298; 2026-02-21T08:31:48.1816813Z xor.b32 %r301, %r300, %r299; 2026-02-21T08:31:48.1816879Z add.s32 %r579, %r293, %r301; 2026-02-21T08:31:48.1816993Z add.s32 %r584, %r579, 512; 2026-02-21T08:31:48.1817058Z add.s32 %r589, %r579, 1024; 2026-02-21T08:31:48.1817128Z add.s32 %r594, %r579, 1536; 2026-02-21T08:31:48.1817189Z mov.b32 %r1994, -1; 2026-02-21T08:31:48.1817248Z mov.b32 %r1990, 0; 2026-02-21T08:31:48.1817311Z mov.b32 %r1989, 1; 2026-02-21T08:31:48.1817371Z mov.b32 %r1996, %r1990; 2026-02-21T08:31:48.1817442Z mov.b32 %r1995, %r1990; 2026-02-21T08:31:48.1817500Z bra.uni $L__BB0_36; 2026-02-21T08:31:48.1817603Z $L__BB0_40: // in Loop: Header=BB0_36 Depth=2 2026-02-21T08:31:48.1817657Z $L__tmp18: 2026-02-21T08:31:48.1817878Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1817937Z xor.b32 %r1990, %r1990, 1; 2026-02-21T08:31:48.1817996Z add.s32 %r832, %r273, 803264; 2026-02-21T08:31:48.1818061Z // begin inline asm 2026-02-21T08:31:48.1818111Z 2026-02-21T08:31:48.1818161Z { 2026-02-21T08:31:48.1818224Z .reg .pred complete; 2026-02-21T08:31:48.1818287Z waitLoop: 2026-02-21T08:31:48.1818409Z mbarrier.try_wait.parity.shared.b64 complete, [%r832], %r1990; 2026-02-21T08:31:48.1818474Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1818531Z } 2026-02-21T08:31:48.1818535Z 2026-02-21T08:31:48.1818590Z // end inline asm 2026-02-21T08:31:48.1818648Z // begin inline asm 2026-02-21T08:31:48.1820050Z @%p9 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r834 + 0], {%r835, %r836, %r837, %r838, %r839, %r840, %r841, %r842, %r843, %r844, %r845, %r846, %r847, %r848, %r849, %r850, %r851, %r852, %r853, %r854, %r855, %r856, %r857, %r858, %r859, %r860, %r861, %r862, %r863, %r864, %r865, %r866, %r867, %r868, %r869, %r870, %r871, %r872, %r873, %r874, %r875, %r876, %r877, %r878, %r879, %r880, %r881, %r882, %r883, %r884, %r885, %r886, %r887, %r888, %r889, %r890, %r891, %r892, %r893, %r894, %r895, %r896, %r897, %r898, %r899, %r900, %r901, %r902, %r903, %r904, %r905, %r906, %r907, %r908, %r909, %r910, %r911, %r912, %r913, %r914, %r915, %r916, %r917, %r918, %r919, %r920, %r921, %r922, %r923, %r924, %r925, %r926, %r927, %r928, %r929, %r930, %r931, %r932, %r933, %r934, %r935, %r936, %r937, %r938, %r939, %r940, %r941, %r942, %r943, %r944, %r945, %r946, %r947, %r948, %r949, %r950, %r951, %r952, %r953, %r954, %r955, %r956, %r957, %r958, %r959, %r960, %r961, %r962}; 2026-02-21T08:31:48.1820109Z // end inline asm 2026-02-21T08:31:48.1820171Z // begin inline asm 2026-02-21T08:31:48.1820241Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1820295Z // end inline asm 2026-02-21T08:31:48.1820350Z bar.sync 6, 256; 2026-02-21T08:31:48.1820413Z bar.sync 6, 256; 2026-02-21T08:31:48.1820472Z add.s32 %r963, %r273, 803280; 2026-02-21T08:31:48.1820527Z // begin inline asm 2026-02-21T08:31:48.1820627Z @%p8 mbarrier.arrive.shared::cta.b64 _, [%r963]; 2026-02-21T08:31:48.1820682Z // end inline asm 2026-02-21T08:31:48.1820735Z $L__tmp19: 2026-02-21T08:31:48.1820910Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1821024Z add.s32 %r1988, %r1988, -1; 2026-02-21T08:31:48.1821088Z setp.ne.b32 %p14, %r1988, 0; 2026-02-21T08:31:48.1821148Z @%p14 bra $L__BB0_36; 2026-02-21T08:31:48.1821212Z bra.uni $L__BB0_41; 2026-02-21T08:31:48.1821313Z $L__BB0_36: // Parent Loop BB0_2 Depth=1 2026-02-21T08:31:48.1821408Z // => This Inner Loop Header: Depth=2 2026-02-21T08:31:48.1821475Z add.s32 %r302, %r1994, 1; 2026-02-21T08:31:48.1821575Z setp.eq.b32 %p6, %r1994, 7; 2026-02-21T08:31:48.1821641Z selp.b32 %r1994, 0, %r302, %p6; 2026-02-21T08:31:48.1821703Z setp.ne.b32 %p7, %r1994, 0; 2026-02-21T08:31:48.1821768Z @%p7 bra $L__BB0_38; 2026-02-21T08:31:48.1821865Z // %bb.37: // in Loop: Header=BB0_36 Depth=2 2026-02-21T08:31:48.1821926Z add.s32 %r1997, %r1997, 18944; 2026-02-21T08:31:48.1822151Z .loc 1 30 31 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:30:31 2026-02-21T08:31:48.1822215Z shr.s32 %r303, %r1997, 31; 2026-02-21T08:31:48.1822273Z shr.u32 %r304, %r303, 27; 2026-02-21T08:31:48.1822339Z add.s32 %r305, %r1997, %r304; 2026-02-21T08:31:48.1822501Z .loc 1 29 30 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:29:30 2026-02-21T08:31:48.1822563Z and.b32 %r306, %r305, 33554400; 2026-02-21T08:31:48.1822623Z sub.s32 %r307, %r1997, %r306; 2026-02-21T08:31:48.1822792Z .loc 1 31 27 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:31:27 2026-02-21T08:31:48.1822853Z shl.b32 %r1995, %r307, 7; 2026-02-21T08:31:48.1823016Z .loc 1 33 27 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:33:27 2026-02-21T08:31:48.1823087Z shl.b32 %r308, %r305, 3; 2026-02-21T08:31:48.1823152Z and.b32 %r1996, %r308, -256; 2026-02-21T08:31:48.1823253Z $L__BB0_38: // in Loop: Header=BB0_36 Depth=2 2026-02-21T08:31:48.1823319Z $L__tmp20: 2026-02-21T08:31:48.1823541Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1823606Z xor.b32 %r1989, %r1989, 1; 2026-02-21T08:31:48.1823665Z bar.sync 6, 256; 2026-02-21T08:31:48.1823731Z add.s32 %r309, %r273, 803600; 2026-02-21T08:31:48.1823790Z // begin inline asm 2026-02-21T08:31:48.1823840Z 2026-02-21T08:31:48.1823898Z { 2026-02-21T08:31:48.1823958Z .reg .pred complete; 2026-02-21T08:31:48.1824011Z waitLoop: 2026-02-21T08:31:48.1824131Z mbarrier.try_wait.parity.shared.b64 complete, [%r309], %r1989; 2026-02-21T08:31:48.1824203Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1824252Z } 2026-02-21T08:31:48.1824256Z 2026-02-21T08:31:48.1824311Z // end inline asm 2026-02-21T08:31:48.1824413Z ld.shared.v4.b32 {%r313, %r314, %r315, %r316}, [%r118]; 2026-02-21T08:31:48.1824511Z ld.shared.v4.b32 {%r317, %r318, %r319, %r320}, [%r118+16]; 2026-02-21T08:31:48.1824606Z ld.shared.v4.b32 {%r321, %r322, %r323, %r324}, [%r118+32]; 2026-02-21T08:31:48.1824706Z ld.shared.v4.b32 {%r325, %r326, %r327, %r328}, [%r118+48]; 2026-02-21T08:31:48.1824795Z ld.shared.v4.b32 {%r329, %r330, %r331, %r332}, [%r118+64]; 2026-02-21T08:31:48.1824882Z ld.shared.v4.b32 {%r333, %r334, %r335, %r336}, [%r118+80]; 2026-02-21T08:31:48.1824978Z ld.shared.v4.b32 {%r337, %r338, %r339, %r340}, [%r118+96]; 2026-02-21T08:31:48.1825072Z ld.shared.v4.b32 {%r341, %r342, %r343, %r344}, [%r118+112]; 2026-02-21T08:31:48.1825173Z ld.shared.v4.b32 {%r345, %r346, %r347, %r348}, [%r118+128]; 2026-02-21T08:31:48.1825263Z ld.shared.v4.b32 {%r349, %r350, %r351, %r352}, [%r118+144]; 2026-02-21T08:31:48.1825353Z ld.shared.v4.b32 {%r353, %r354, %r355, %r356}, [%r118+160]; 2026-02-21T08:31:48.1825448Z ld.shared.v4.b32 {%r357, %r358, %r359, %r360}, [%r118+176]; 2026-02-21T08:31:48.1825537Z ld.shared.v4.b32 {%r361, %r362, %r363, %r364}, [%r118+192]; 2026-02-21T08:31:48.1825628Z ld.shared.v4.b32 {%r365, %r366, %r367, %r368}, [%r118+208]; 2026-02-21T08:31:48.1825772Z ld.shared.v4.b32 {%r369, %r370, %r371, %r372}, [%r118+224]; 2026-02-21T08:31:48.1825865Z ld.shared.v4.b32 {%r373, %r374, %r375, %r376}, [%r118+240]; 2026-02-21T08:31:48.1825953Z ld.shared.v4.b32 {%r377, %r378, %r379, %r380}, [%r118+256]; 2026-02-21T08:31:48.1826043Z ld.shared.v4.b32 {%r381, %r382, %r383, %r384}, [%r118+272]; 2026-02-21T08:31:48.1826140Z ld.shared.v4.b32 {%r385, %r386, %r387, %r388}, [%r118+288]; 2026-02-21T08:31:48.1826230Z ld.shared.v4.b32 {%r389, %r390, %r391, %r392}, [%r118+304]; 2026-02-21T08:31:48.1826321Z ld.shared.v4.b32 {%r393, %r394, %r395, %r396}, [%r118+320]; 2026-02-21T08:31:48.1826418Z ld.shared.v4.b32 {%r397, %r398, %r399, %r400}, [%r118+336]; 2026-02-21T08:31:48.1826509Z ld.shared.v4.b32 {%r401, %r402, %r403, %r404}, [%r118+352]; 2026-02-21T08:31:48.1826599Z ld.shared.v4.b32 {%r405, %r406, %r407, %r408}, [%r118+368]; 2026-02-21T08:31:48.1826697Z ld.shared.v4.b32 {%r409, %r410, %r411, %r412}, [%r118+384]; 2026-02-21T08:31:48.1826839Z ld.shared.v4.b32 {%r413, %r414, %r415, %r416}, [%r118+400]; 2026-02-21T08:31:48.1826933Z ld.shared.v4.b32 {%r417, %r418, %r419, %r420}, [%r118+416]; 2026-02-21T08:31:48.1827030Z ld.shared.v4.b32 {%r421, %r422, %r423, %r424}, [%r118+432]; 2026-02-21T08:31:48.1827119Z ld.shared.v4.b32 {%r425, %r426, %r427, %r428}, [%r118+448]; 2026-02-21T08:31:48.1827206Z ld.shared.v4.b32 {%r429, %r430, %r431, %r432}, [%r118+464]; 2026-02-21T08:31:48.1827294Z ld.shared.v4.b32 {%r433, %r434, %r435, %r436}, [%r118+480]; 2026-02-21T08:31:48.1827391Z ld.shared.v4.b32 {%r437, %r438, %r439, %r440}, [%r118+496]; 2026-02-21T08:31:48.1827447Z bar.sync 6, 256; 2026-02-21T08:31:48.1827506Z add.s32 %r311, %r273, 803584; 2026-02-21T08:31:48.1827573Z mov.pred %p8, 0; 2026-02-21T08:31:48.1827631Z // begin inline asm 2026-02-21T08:31:48.1827719Z @%p8 mbarrier.arrive.shared::cta.b64 _, [%r311]; 2026-02-21T08:31:48.1827782Z // end inline asm 2026-02-21T08:31:48.1827856Z shfl.sync.idx.b32 %r574, %r97, 0, 31, -1; 2026-02-21T08:31:48.1827920Z mov.pred %p9, -1; 2026-02-21T08:31:48.1827978Z // begin inline asm 2026-02-21T08:31:48.1829368Z @%p9 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r312 + 0], {%r313, %r314, %r315, %r316, %r317, %r318, %r319, %r320, %r321, %r322, %r323, %r324, %r325, %r326, %r327, %r328, %r329, %r330, %r331, %r332, %r333, %r334, %r335, %r336, %r337, %r338, %r339, %r340, %r341, %r342, %r343, %r344, %r345, %r346, %r347, %r348, %r349, %r350, %r351, %r352, %r353, %r354, %r355, %r356, %r357, %r358, %r359, %r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369, %r370, %r371, %r372, %r373, %r374, %r375, %r376, %r377, %r378, %r379, %r380, %r381, %r382, %r383, %r384, %r385, %r386, %r387, %r388, %r389, %r390, %r391, %r392, %r393, %r394, %r395, %r396, %r397, %r398, %r399, %r400, %r401, %r402, %r403, %r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414, %r415, %r416, %r417, %r418, %r419, %r420, %r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434, %r435, %r436, %r437, %r438, %r439, %r440}; 2026-02-21T08:31:48.1829428Z // end inline asm 2026-02-21T08:31:48.1829492Z // begin inline asm 2026-02-21T08:31:48.1829559Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1829612Z // end inline asm 2026-02-21T08:31:48.1829672Z bar.sync 6, 256; 2026-02-21T08:31:48.1829726Z bar.sync 6, 256; 2026-02-21T08:31:48.1829784Z add.s32 %r441, %r273, 803456; 2026-02-21T08:31:48.1829840Z // begin inline asm 2026-02-21T08:31:48.1829935Z @%p8 mbarrier.arrive.shared::cta.b64 _, [%r441]; 2026-02-21T08:31:48.1829989Z // end inline asm 2026-02-21T08:31:48.1830040Z $L__tmp21: 2026-02-21T08:31:48.1830216Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1830281Z setp.ne.b32 %p11, %r1994, 7; 2026-02-21T08:31:48.1830333Z $L__tmp22: 2026-02-21T08:31:48.1830557Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1830654Z bar.sync 6, 256; 2026-02-21T08:31:48.1830711Z add.s32 %r442, %r273, 803472; 2026-02-21T08:31:48.1830767Z // begin inline asm 2026-02-21T08:31:48.1830824Z 2026-02-21T08:31:48.1830872Z { 2026-02-21T08:31:48.1830931Z .reg .pred complete; 2026-02-21T08:31:48.1830990Z waitLoop: 2026-02-21T08:31:48.1831107Z mbarrier.try_wait.parity.shared.b64 complete, [%r442], %r1990; 2026-02-21T08:31:48.1831171Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1831219Z } 2026-02-21T08:31:48.1831223Z 2026-02-21T08:31:48.1831282Z // end inline asm 2026-02-21T08:31:48.1831337Z // begin inline asm 2026-02-21T08:31:48.1832767Z tcgen05.ld.sync.aligned.32x32b.x128.b32 {%r835, %r836, %r837, %r838, %r839, %r840, %r841, %r842, %r843, %r844, %r845, %r846, %r847, %r848, %r849, %r850, %r851, %r852, %r853, %r854, %r855, %r856, %r857, %r858, %r859, %r860, %r861, %r862, %r863, %r864, %r865, %r866, %r867, %r868, %r869, %r870, %r871, %r872, %r873, %r874, %r875, %r876, %r877, %r878, %r879, %r880, %r881, %r882, %r883, %r884, %r885, %r886, %r887, %r888, %r889, %r890, %r891, %r892, %r893, %r894, %r895, %r896, %r897, %r898, %r899, %r900, %r901, %r902, %r903, %r904, %r905, %r906, %r907, %r908, %r909, %r910, %r911, %r912, %r913, %r914, %r915, %r916, %r917, %r918, %r919, %r920, %r921, %r922, %r923, %r924, %r925, %r926, %r927, %r928, %r929, %r930, %r931, %r932, %r933, %r934, %r935, %r936, %r937, %r938, %r939, %r940, %r941, %r942, %r943, %r944, %r945, %r946, %r947, %r948, %r949, %r950, %r951, %r952, %r953, %r954, %r955, %r956, %r957, %r958, %r959, %r960, %r961, %r962}, [%r572 + 0]; 2026-02-21T08:31:48.1832826Z // end inline asm 2026-02-21T08:31:48.1832883Z // begin inline asm 2026-02-21T08:31:48.1832956Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:31:48.1833009Z // end inline asm 2026-02-21T08:31:48.1833062Z $L__tmp23: 2026-02-21T08:31:48.1833232Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1833299Z @%p11 bra $L__BB0_40; 2026-02-21T08:31:48.1833400Z // %bb.39: // in Loop: Header=BB0_36 Depth=2 2026-02-21T08:31:48.1833565Z .loc 1 32 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:32:32 2026-02-21T08:31:48.1833633Z or.b32 %r719, %r1995, %r100; 2026-02-21T08:31:48.1833692Z or.b32 %r720, %r1995, %r101; 2026-02-21T08:31:48.1833748Z or.b32 %r721, %r1995, %r102; 2026-02-21T08:31:48.1833812Z or.b32 %r722, %r1995, %r103; 2026-02-21T08:31:48.1833867Z or.b32 %r723, %r1995, %r104; 2026-02-21T08:31:48.1833922Z or.b32 %r724, %r1995, %r105; 2026-02-21T08:31:48.1833977Z or.b32 %r725, %r1995, %r106; 2026-02-21T08:31:48.1834039Z or.b32 %r726, %r1995, %r107; 2026-02-21T08:31:48.1834094Z or.b32 %r727, %r1995, %r108; 2026-02-21T08:31:48.1834150Z or.b32 %r728, %r1995, %r109; 2026-02-21T08:31:48.1834214Z or.b32 %r729, %r1995, %r110; 2026-02-21T08:31:48.1834270Z or.b32 %r730, %r1995, %r111; 2026-02-21T08:31:48.1834325Z or.b32 %r731, %r1995, %r112; 2026-02-21T08:31:48.1834390Z or.b32 %r732, %r1995, %r113; 2026-02-21T08:31:48.1834448Z or.b32 %r733, %r1995, %r114; 2026-02-21T08:31:48.1834503Z or.b32 %r734, %r1995, %r115; 2026-02-21T08:31:48.1834661Z .loc 1 34 32 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:34:32 2026-02-21T08:31:48.1834727Z or.b32 %r735, %r1996, %r116; 2026-02-21T08:31:48.1834885Z .loc 1 88 43 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:88:43 2026-02-21T08:31:48.1834945Z shl.b32 %r736, %r719, 13; 2026-02-21T08:31:48.1835012Z shl.b32 %r737, %r720, 13; 2026-02-21T08:31:48.1835070Z shl.b32 %r738, %r721, 13; 2026-02-21T08:31:48.1835126Z shl.b32 %r739, %r722, 13; 2026-02-21T08:31:48.1835186Z shl.b32 %r740, %r723, 13; 2026-02-21T08:31:48.1835254Z shl.b32 %r741, %r724, 13; 2026-02-21T08:31:48.1835312Z shl.b32 %r742, %r725, 13; 2026-02-21T08:31:48.1835368Z shl.b32 %r743, %r726, 13; 2026-02-21T08:31:48.1835433Z shl.b32 %r744, %r727, 13; 2026-02-21T08:31:48.1835491Z shl.b32 %r745, %r728, 13; 2026-02-21T08:31:48.1835600Z shl.b32 %r746, %r729, 13; 2026-02-21T08:31:48.1835659Z shl.b32 %r747, %r730, 13; 2026-02-21T08:31:48.1835725Z shl.b32 %r748, %r731, 13; 2026-02-21T08:31:48.1835785Z shl.b32 %r749, %r732, 13; 2026-02-21T08:31:48.1835845Z shl.b32 %r750, %r733, 13; 2026-02-21T08:31:48.1835911Z shl.b32 %r751, %r734, 13; 2026-02-21T08:31:48.1836081Z .loc 1 88 50 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:88:50 2026-02-21T08:31:48.1836142Z add.s32 %r752, %r736, %r735; 2026-02-21T08:31:48.1836209Z add.s32 %r753, %r737, %r735; 2026-02-21T08:31:48.1836268Z add.s32 %r754, %r738, %r735; 2026-02-21T08:31:48.1836326Z add.s32 %r755, %r739, %r735; 2026-02-21T08:31:48.1836385Z add.s32 %r756, %r740, %r735; 2026-02-21T08:31:48.1836450Z add.s32 %r757, %r741, %r735; 2026-02-21T08:31:48.1836509Z add.s32 %r758, %r742, %r735; 2026-02-21T08:31:48.1836567Z add.s32 %r759, %r743, %r735; 2026-02-21T08:31:48.1836675Z add.s32 %r760, %r744, %r735; 2026-02-21T08:31:48.1836738Z add.s32 %r761, %r745, %r735; 2026-02-21T08:31:48.1836797Z add.s32 %r762, %r746, %r735; 2026-02-21T08:31:48.1836855Z add.s32 %r763, %r747, %r735; 2026-02-21T08:31:48.1836922Z add.s32 %r764, %r748, %r735; 2026-02-21T08:31:48.1836980Z add.s32 %r765, %r749, %r735; 2026-02-21T08:31:48.1837039Z add.s32 %r766, %r750, %r735; 2026-02-21T08:31:48.1837106Z add.s32 %r767, %r751, %r735; 2026-02-21T08:31:48.1837278Z .loc 1 88 22 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:88:22 2026-02-21T08:31:48.1837352Z mad.wide.s32 %rd21, %r752, 2, %rd19; 2026-02-21T08:31:48.1837430Z mad.wide.s32 %rd22, %r753, 2, %rd19; 2026-02-21T08:31:48.1837497Z mad.wide.s32 %rd23, %r754, 2, %rd19; 2026-02-21T08:31:48.1837562Z mad.wide.s32 %rd24, %r755, 2, %rd19; 2026-02-21T08:31:48.1837626Z mad.wide.s32 %rd25, %r756, 2, %rd19; 2026-02-21T08:31:48.1837698Z mad.wide.s32 %rd26, %r757, 2, %rd19; 2026-02-21T08:31:48.1837764Z mad.wide.s32 %rd27, %r758, 2, %rd19; 2026-02-21T08:31:48.1837829Z mad.wide.s32 %rd28, %r759, 2, %rd19; 2026-02-21T08:31:48.1837898Z mad.wide.s32 %rd29, %r760, 2, %rd19; 2026-02-21T08:31:48.1837962Z mad.wide.s32 %rd30, %r761, 2, %rd19; 2026-02-21T08:31:48.1838026Z mad.wide.s32 %rd31, %r762, 2, %rd19; 2026-02-21T08:31:48.1838089Z mad.wide.s32 %rd32, %r763, 2, %rd19; 2026-02-21T08:31:48.1838160Z mad.wide.s32 %rd33, %r764, 2, %rd19; 2026-02-21T08:31:48.1838223Z mad.wide.s32 %rd34, %r765, 2, %rd19; 2026-02-21T08:31:48.1838287Z mad.wide.s32 %rd35, %r766, 2, %rd19; 2026-02-21T08:31:48.1838358Z mad.wide.s32 %rd36, %r767, 2, %rd19; 2026-02-21T08:31:48.1838531Z .loc 1 88 81 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:88:81 2026-02-21T08:31:48.1838604Z cvt.rn.bf16x2.f32 %r768, %r860, %r859; 2026-02-21T08:31:48.1838681Z cvt.rn.bf16x2.f32 %r769, %r852, %r851; 2026-02-21T08:31:48.1838749Z cvt.rn.bf16x2.f32 %r770, %r844, %r843; 2026-02-21T08:31:48.1838817Z cvt.rn.bf16x2.f32 %r771, %r836, %r835; 2026-02-21T08:31:48.1838917Z st.shared.v4.b32 [%r119], {%r771, %r770, %r769, %r768}; 2026-02-21T08:31:48.1838989Z cvt.rn.bf16x2.f32 %r772, %r892, %r891; 2026-02-21T08:31:48.1839054Z cvt.rn.bf16x2.f32 %r773, %r884, %r883; 2026-02-21T08:31:48.1839119Z cvt.rn.bf16x2.f32 %r774, %r876, %r875; 2026-02-21T08:31:48.1839191Z cvt.rn.bf16x2.f32 %r775, %r868, %r867; 2026-02-21T08:31:48.1839285Z st.shared.v4.b32 [%r120], {%r775, %r774, %r773, %r772}; 2026-02-21T08:31:48.1839351Z cvt.rn.bf16x2.f32 %r776, %r924, %r923; 2026-02-21T08:31:48.1839425Z cvt.rn.bf16x2.f32 %r777, %r916, %r915; 2026-02-21T08:31:48.1839489Z cvt.rn.bf16x2.f32 %r778, %r908, %r907; 2026-02-21T08:31:48.1839554Z cvt.rn.bf16x2.f32 %r779, %r900, %r899; 2026-02-21T08:31:48.1839645Z st.shared.v4.b32 [%r121], {%r779, %r778, %r777, %r776}; 2026-02-21T08:31:48.1839717Z cvt.rn.bf16x2.f32 %r780, %r956, %r955; 2026-02-21T08:31:48.1839781Z cvt.rn.bf16x2.f32 %r781, %r948, %r947; 2026-02-21T08:31:48.1839847Z cvt.rn.bf16x2.f32 %r782, %r940, %r939; 2026-02-21T08:31:48.1839965Z cvt.rn.bf16x2.f32 %r783, %r932, %r931; 2026-02-21T08:31:48.1840056Z st.shared.v4.b32 [%r122], {%r783, %r782, %r781, %r780}; 2026-02-21T08:31:48.1840115Z bar.sync 6, 256; 2026-02-21T08:31:48.1840180Z // begin inline asm 2026-02-21T08:31:48.1840339Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r655, %r659, %r663, %r667}, [%r579]; 2026-02-21T08:31:48.1840397Z // end inline asm 2026-02-21T08:31:48.1840456Z // begin inline asm 2026-02-21T08:31:48.1840618Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r671, %r675, %r679, %r683}, [%r584]; 2026-02-21T08:31:48.1840675Z // end inline asm 2026-02-21T08:31:48.1840733Z // begin inline asm 2026-02-21T08:31:48.1840888Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r687, %r691, %r695, %r699}, [%r589]; 2026-02-21T08:31:48.1840944Z // end inline asm 2026-02-21T08:31:48.1841002Z // begin inline asm 2026-02-21T08:31:48.1841189Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r703, %r707, %r711, %r715}, [%r594]; 2026-02-21T08:31:48.1841257Z // end inline asm 2026-02-21T08:31:48.1841314Z bar.sync 6, 256; 2026-02-21T08:31:48.1841380Z cvt.rn.bf16x2.f32 %r784, %r862, %r861; 2026-02-21T08:31:48.1841455Z cvt.rn.bf16x2.f32 %r785, %r854, %r853; 2026-02-21T08:31:48.1841521Z cvt.rn.bf16x2.f32 %r786, %r846, %r845; 2026-02-21T08:31:48.1841615Z cvt.rn.bf16x2.f32 %r787, %r838, %r837; 2026-02-21T08:31:48.1841714Z st.shared.v4.b32 [%r119], {%r787, %r786, %r785, %r784}; 2026-02-21T08:31:48.1841779Z cvt.rn.bf16x2.f32 %r788, %r894, %r893; 2026-02-21T08:31:48.1841844Z cvt.rn.bf16x2.f32 %r789, %r886, %r885; 2026-02-21T08:31:48.1841909Z cvt.rn.bf16x2.f32 %r790, %r878, %r877; 2026-02-21T08:31:48.1841980Z cvt.rn.bf16x2.f32 %r791, %r870, %r869; 2026-02-21T08:31:48.1842069Z st.shared.v4.b32 [%r120], {%r791, %r790, %r789, %r788}; 2026-02-21T08:31:48.1842135Z cvt.rn.bf16x2.f32 %r792, %r926, %r925; 2026-02-21T08:31:48.1842206Z cvt.rn.bf16x2.f32 %r793, %r918, %r917; 2026-02-21T08:31:48.1842275Z cvt.rn.bf16x2.f32 %r794, %r910, %r909; 2026-02-21T08:31:48.1842343Z cvt.rn.bf16x2.f32 %r795, %r902, %r901; 2026-02-21T08:31:48.1842434Z st.shared.v4.b32 [%r121], {%r795, %r794, %r793, %r792}; 2026-02-21T08:31:48.1842508Z cvt.rn.bf16x2.f32 %r796, %r958, %r957; 2026-02-21T08:31:48.1842575Z cvt.rn.bf16x2.f32 %r797, %r950, %r949; 2026-02-21T08:31:48.1842644Z cvt.rn.bf16x2.f32 %r798, %r942, %r941; 2026-02-21T08:31:48.1842718Z cvt.rn.bf16x2.f32 %r799, %r934, %r933; 2026-02-21T08:31:48.1842809Z st.shared.v4.b32 [%r122], {%r799, %r798, %r797, %r796}; 2026-02-21T08:31:48.1842868Z bar.sync 6, 256; 2026-02-21T08:31:48.1842935Z // begin inline asm 2026-02-21T08:31:48.1843083Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r656, %r660, %r664, %r668}, [%r579]; 2026-02-21T08:31:48.1843142Z // end inline asm 2026-02-21T08:31:48.1843208Z // begin inline asm 2026-02-21T08:31:48.1843353Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r672, %r676, %r680, %r684}, [%r584]; 2026-02-21T08:31:48.1843407Z // end inline asm 2026-02-21T08:31:48.1843465Z // begin inline asm 2026-02-21T08:31:48.1843611Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r688, %r692, %r696, %r700}, [%r589]; 2026-02-21T08:31:48.1843665Z // end inline asm 2026-02-21T08:31:48.1843721Z // begin inline asm 2026-02-21T08:31:48.1843860Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r704, %r708, %r712, %r716}, [%r594]; 2026-02-21T08:31:48.1843921Z // end inline asm 2026-02-21T08:31:48.1843976Z bar.sync 6, 256; 2026-02-21T08:31:48.1844040Z cvt.rn.bf16x2.f32 %r800, %r864, %r863; 2026-02-21T08:31:48.1844110Z cvt.rn.bf16x2.f32 %r801, %r856, %r855; 2026-02-21T08:31:48.1844173Z cvt.rn.bf16x2.f32 %r802, %r848, %r847; 2026-02-21T08:31:48.1844235Z cvt.rn.bf16x2.f32 %r803, %r840, %r839; 2026-02-21T08:31:48.1844326Z st.shared.v4.b32 [%r119], {%r803, %r802, %r801, %r800}; 2026-02-21T08:31:48.1844389Z cvt.rn.bf16x2.f32 %r804, %r896, %r895; 2026-02-21T08:31:48.1844452Z cvt.rn.bf16x2.f32 %r805, %r888, %r887; 2026-02-21T08:31:48.1844516Z cvt.rn.bf16x2.f32 %r806, %r880, %r879; 2026-02-21T08:31:48.1844654Z cvt.rn.bf16x2.f32 %r807, %r872, %r871; 2026-02-21T08:31:48.1844740Z st.shared.v4.b32 [%r120], {%r807, %r806, %r805, %r804}; 2026-02-21T08:31:48.1844802Z cvt.rn.bf16x2.f32 %r808, %r928, %r927; 2026-02-21T08:31:48.1844870Z cvt.rn.bf16x2.f32 %r809, %r920, %r919; 2026-02-21T08:31:48.1844930Z cvt.rn.bf16x2.f32 %r810, %r912, %r911; 2026-02-21T08:31:48.1844991Z cvt.rn.bf16x2.f32 %r811, %r904, %r903; 2026-02-21T08:31:48.1845079Z st.shared.v4.b32 [%r121], {%r811, %r810, %r809, %r808}; 2026-02-21T08:31:48.1845140Z cvt.rn.bf16x2.f32 %r812, %r960, %r959; 2026-02-21T08:31:48.1845200Z cvt.rn.bf16x2.f32 %r813, %r952, %r951; 2026-02-21T08:31:48.1845261Z cvt.rn.bf16x2.f32 %r814, %r944, %r943; 2026-02-21T08:31:48.1845330Z cvt.rn.bf16x2.f32 %r815, %r936, %r935; 2026-02-21T08:31:48.1845413Z st.shared.v4.b32 [%r122], {%r815, %r814, %r813, %r812}; 2026-02-21T08:31:48.1845469Z bar.sync 6, 256; 2026-02-21T08:31:48.1845532Z // begin inline asm 2026-02-21T08:31:48.1845719Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r657, %r661, %r665, %r669}, [%r579]; 2026-02-21T08:31:48.1845776Z // end inline asm 2026-02-21T08:31:48.1845832Z // begin inline asm 2026-02-21T08:31:48.1845979Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r673, %r677, %r681, %r685}, [%r584]; 2026-02-21T08:31:48.1846032Z // end inline asm 2026-02-21T08:31:48.1846088Z // begin inline asm 2026-02-21T08:31:48.1846233Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r689, %r693, %r697, %r701}, [%r589]; 2026-02-21T08:31:48.1846287Z // end inline asm 2026-02-21T08:31:48.1846342Z // begin inline asm 2026-02-21T08:31:48.1846486Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r705, %r709, %r713, %r717}, [%r594]; 2026-02-21T08:31:48.1846541Z // end inline asm 2026-02-21T08:31:48.1846594Z bar.sync 6, 256; 2026-02-21T08:31:48.1846658Z cvt.rn.bf16x2.f32 %r816, %r866, %r865; 2026-02-21T08:31:48.1846728Z cvt.rn.bf16x2.f32 %r817, %r858, %r857; 2026-02-21T08:31:48.1846788Z cvt.rn.bf16x2.f32 %r818, %r850, %r849; 2026-02-21T08:31:48.1846853Z cvt.rn.bf16x2.f32 %r819, %r842, %r841; 2026-02-21T08:31:48.1846946Z st.shared.v4.b32 [%r119], {%r819, %r818, %r817, %r816}; 2026-02-21T08:31:48.1847008Z cvt.rn.bf16x2.f32 %r820, %r898, %r897; 2026-02-21T08:31:48.1847070Z cvt.rn.bf16x2.f32 %r821, %r890, %r889; 2026-02-21T08:31:48.1847132Z cvt.rn.bf16x2.f32 %r822, %r882, %r881; 2026-02-21T08:31:48.1847199Z cvt.rn.bf16x2.f32 %r823, %r874, %r873; 2026-02-21T08:31:48.1847284Z st.shared.v4.b32 [%r120], {%r823, %r822, %r821, %r820}; 2026-02-21T08:31:48.1847346Z cvt.rn.bf16x2.f32 %r824, %r930, %r929; 2026-02-21T08:31:48.1847415Z cvt.rn.bf16x2.f32 %r825, %r922, %r921; 2026-02-21T08:31:48.1847477Z cvt.rn.bf16x2.f32 %r826, %r914, %r913; 2026-02-21T08:31:48.1847539Z cvt.rn.bf16x2.f32 %r827, %r906, %r905; 2026-02-21T08:31:48.1847630Z st.shared.v4.b32 [%r121], {%r827, %r826, %r825, %r824}; 2026-02-21T08:31:48.1847691Z cvt.rn.bf16x2.f32 %r828, %r962, %r961; 2026-02-21T08:31:48.1847753Z cvt.rn.bf16x2.f32 %r829, %r954, %r953; 2026-02-21T08:31:48.1847817Z cvt.rn.bf16x2.f32 %r830, %r946, %r945; 2026-02-21T08:31:48.1847887Z cvt.rn.bf16x2.f32 %r831, %r938, %r937; 2026-02-21T08:31:48.1847972Z st.shared.v4.b32 [%r122], {%r831, %r830, %r829, %r828}; 2026-02-21T08:31:48.1848026Z bar.sync 6, 256; 2026-02-21T08:31:48.1848088Z // begin inline asm 2026-02-21T08:31:48.1848226Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r658, %r662, %r666, %r670}, [%r579]; 2026-02-21T08:31:48.1848279Z // end inline asm 2026-02-21T08:31:48.1848335Z // begin inline asm 2026-02-21T08:31:48.1848482Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r674, %r678, %r682, %r686}, [%r584]; 2026-02-21T08:31:48.1848536Z // end inline asm 2026-02-21T08:31:48.1848590Z // begin inline asm 2026-02-21T08:31:48.1848735Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r690, %r694, %r698, %r702}, [%r589]; 2026-02-21T08:31:48.1848788Z // end inline asm 2026-02-21T08:31:48.1848842Z // begin inline asm 2026-02-21T08:31:48.1848987Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r706, %r710, %r714, %r718}, [%r594]; 2026-02-21T08:31:48.1849081Z // end inline asm 2026-02-21T08:31:48.1849136Z // begin inline asm 2026-02-21T08:31:48.1849239Z st.global.v4.b32 [ %rd21 + 0 ], { %r655, %r656, %r657, %r658 }; 2026-02-21T08:31:48.1849300Z // end inline asm 2026-02-21T08:31:48.1849356Z // begin inline asm 2026-02-21T08:31:48.1849458Z st.global.v4.b32 [ %rd22 + 0 ], { %r659, %r660, %r661, %r662 }; 2026-02-21T08:31:48.1849520Z // end inline asm 2026-02-21T08:31:48.1849573Z // begin inline asm 2026-02-21T08:31:48.1849669Z st.global.v4.b32 [ %rd23 + 0 ], { %r663, %r664, %r665, %r666 }; 2026-02-21T08:31:48.1849722Z // end inline asm 2026-02-21T08:31:48.1849784Z // begin inline asm 2026-02-21T08:31:48.1849876Z st.global.v4.b32 [ %rd24 + 0 ], { %r667, %r668, %r669, %r670 }; 2026-02-21T08:31:48.1849928Z // end inline asm 2026-02-21T08:31:48.1849992Z // begin inline asm 2026-02-21T08:31:48.1850084Z st.global.v4.b32 [ %rd25 + 0 ], { %r671, %r672, %r673, %r674 }; 2026-02-21T08:31:48.1850183Z // end inline asm 2026-02-21T08:31:48.1850253Z // begin inline asm 2026-02-21T08:31:48.1850346Z st.global.v4.b32 [ %rd26 + 0 ], { %r675, %r676, %r677, %r678 }; 2026-02-21T08:31:48.1850403Z // end inline asm 2026-02-21T08:31:48.1850458Z // begin inline asm 2026-02-21T08:31:48.1850558Z st.global.v4.b32 [ %rd27 + 0 ], { %r679, %r680, %r681, %r682 }; 2026-02-21T08:31:48.1850611Z // end inline asm 2026-02-21T08:31:48.1850666Z // begin inline asm 2026-02-21T08:31:48.1850765Z st.global.v4.b32 [ %rd28 + 0 ], { %r683, %r684, %r685, %r686 }; 2026-02-21T08:31:48.1850818Z // end inline asm 2026-02-21T08:31:48.1850873Z // begin inline asm 2026-02-21T08:31:48.1850964Z st.global.v4.b32 [ %rd29 + 0 ], { %r687, %r688, %r689, %r690 }; 2026-02-21T08:31:48.1851023Z // end inline asm 2026-02-21T08:31:48.1851078Z // begin inline asm 2026-02-21T08:31:48.1851170Z st.global.v4.b32 [ %rd30 + 0 ], { %r691, %r692, %r693, %r694 }; 2026-02-21T08:31:48.1851229Z // end inline asm 2026-02-21T08:31:48.1851283Z // begin inline asm 2026-02-21T08:31:48.1851378Z st.global.v4.b32 [ %rd31 + 0 ], { %r695, %r696, %r697, %r698 }; 2026-02-21T08:31:48.1851440Z // end inline asm 2026-02-21T08:31:48.1851495Z // begin inline asm 2026-02-21T08:31:48.1851613Z st.global.v4.b32 [ %rd32 + 0 ], { %r699, %r700, %r701, %r702 }; 2026-02-21T08:31:48.1851670Z // end inline asm 2026-02-21T08:31:48.1851732Z // begin inline asm 2026-02-21T08:31:48.1851823Z st.global.v4.b32 [ %rd33 + 0 ], { %r703, %r704, %r705, %r706 }; 2026-02-21T08:31:48.1851877Z // end inline asm 2026-02-21T08:31:48.1851938Z // begin inline asm 2026-02-21T08:31:48.1852030Z st.global.v4.b32 [ %rd34 + 0 ], { %r707, %r708, %r709, %r710 }; 2026-02-21T08:31:48.1852083Z // end inline asm 2026-02-21T08:31:48.1852140Z // begin inline asm 2026-02-21T08:31:48.1852238Z st.global.v4.b32 [ %rd35 + 0 ], { %r711, %r712, %r713, %r714 }; 2026-02-21T08:31:48.1852291Z // end inline asm 2026-02-21T08:31:48.1852348Z // begin inline asm 2026-02-21T08:31:48.1852448Z st.global.v4.b32 [ %rd36 + 0 ], { %r715, %r716, %r717, %r718 }; 2026-02-21T08:31:48.1852504Z // end inline asm 2026-02-21T08:31:48.1852562Z bra.uni $L__BB0_40; 2026-02-21T08:31:48.1852663Z $L__BB0_26: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1852844Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1852923Z ld.shared.b32 %r1980, [global_smem+393216]; 2026-02-21T08:31:48.1852980Z barrier.sync 1; 2026-02-21T08:31:48.1853051Z setp.lt.s32 %p21, %r1980, 1; 2026-02-21T08:31:48.1853109Z @%p21 bra $L__BB0_29; 2026-02-21T08:31:48.1853190Z // %bb.27: // %.lr.ph489 2026-02-21T08:31:48.1853288Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1853458Z .loc 1 0 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0:145 2026-02-21T08:31:48.1853521Z add.s32 %r70, %r1, -512; 2026-02-21T08:31:48.1853578Z shr.u32 %r71, %r70, 5; 2026-02-21T08:31:48.1853712Z shl.b32 %r1245, %r1, 10; 2026-02-21T08:31:48.1853777Z and.b32 %r1246, %r1245, 130048; 2026-02-21T08:31:48.1853835Z shl.b32 %r1247, %r1, 2; 2026-02-21T08:31:48.1853900Z and.b32 %r1248, %r1247, 512; 2026-02-21T08:31:48.1853960Z or.b32 %r1249, %r1246, %r1248; 2026-02-21T08:31:48.1854017Z add.s32 %r72, %r273, %r1249; 2026-02-21T08:31:48.1854080Z add.s32 %r73, %r72, 131072; 2026-02-21T08:31:48.1854136Z mov.b32 %r1982, 1; 2026-02-21T08:31:48.1854189Z mov.b32 %r1981, 0; 2026-02-21T08:31:48.1854245Z mov.b32 %r1983, %r1981; 2026-02-21T08:31:48.1854349Z $L__BB0_28: // Parent Loop BB0_2 Depth=1 2026-02-21T08:31:48.1854440Z // => This Inner Loop Header: Depth=2 2026-02-21T08:31:48.1854502Z setp.eq.b32 %p22, %r70, 0; 2026-02-21T08:31:48.1854558Z $L__tmp24: 2026-02-21T08:31:48.1854827Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1854892Z xor.b32 %r1982, %r1982, 1; 2026-02-21T08:31:48.1854947Z bar.sync 4, 256; 2026-02-21T08:31:48.1855016Z add.s32 %r1251, %r273, 803536; 2026-02-21T08:31:48.1855071Z // begin inline asm 2026-02-21T08:31:48.1855120Z 2026-02-21T08:31:48.1855177Z { 2026-02-21T08:31:48.1855237Z .reg .pred complete; 2026-02-21T08:31:48.1855289Z waitLoop: 2026-02-21T08:31:48.1855406Z mbarrier.try_wait.parity.shared.b64 complete, [%r1251], %r1982; 2026-02-21T08:31:48.1855475Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1855524Z } 2026-02-21T08:31:48.1855528Z 2026-02-21T08:31:48.1855582Z // end inline asm 2026-02-21T08:31:48.1855683Z ld.shared.v4.b32 {%r1255, %r1256, %r1257, %r1258}, [%r72]; 2026-02-21T08:31:48.1855781Z ld.shared.v4.b32 {%r1259, %r1260, %r1261, %r1262}, [%r72+16]; 2026-02-21T08:31:48.1855877Z ld.shared.v4.b32 {%r1263, %r1264, %r1265, %r1266}, [%r72+32]; 2026-02-21T08:31:48.1855975Z ld.shared.v4.b32 {%r1267, %r1268, %r1269, %r1270}, [%r72+48]; 2026-02-21T08:31:48.1856068Z ld.shared.v4.b32 {%r1271, %r1272, %r1273, %r1274}, [%r72+64]; 2026-02-21T08:31:48.1856160Z ld.shared.v4.b32 {%r1275, %r1276, %r1277, %r1278}, [%r72+80]; 2026-02-21T08:31:48.1856258Z ld.shared.v4.b32 {%r1279, %r1280, %r1281, %r1282}, [%r72+96]; 2026-02-21T08:31:48.1856352Z ld.shared.v4.b32 {%r1283, %r1284, %r1285, %r1286}, [%r72+112]; 2026-02-21T08:31:48.1856446Z ld.shared.v4.b32 {%r1287, %r1288, %r1289, %r1290}, [%r72+128]; 2026-02-21T08:31:48.1856539Z ld.shared.v4.b32 {%r1291, %r1292, %r1293, %r1294}, [%r72+144]; 2026-02-21T08:31:48.1856638Z ld.shared.v4.b32 {%r1295, %r1296, %r1297, %r1298}, [%r72+160]; 2026-02-21T08:31:48.1856731Z ld.shared.v4.b32 {%r1299, %r1300, %r1301, %r1302}, [%r72+176]; 2026-02-21T08:31:48.1856822Z ld.shared.v4.b32 {%r1303, %r1304, %r1305, %r1306}, [%r72+192]; 2026-02-21T08:31:48.1856919Z ld.shared.v4.b32 {%r1307, %r1308, %r1309, %r1310}, [%r72+208]; 2026-02-21T08:31:48.1857009Z ld.shared.v4.b32 {%r1311, %r1312, %r1313, %r1314}, [%r72+224]; 2026-02-21T08:31:48.1857104Z ld.shared.v4.b32 {%r1315, %r1316, %r1317, %r1318}, [%r72+240]; 2026-02-21T08:31:48.1857202Z ld.shared.v4.b32 {%r1319, %r1320, %r1321, %r1322}, [%r72+256]; 2026-02-21T08:31:48.1857293Z ld.shared.v4.b32 {%r1323, %r1324, %r1325, %r1326}, [%r72+272]; 2026-02-21T08:31:48.1857383Z ld.shared.v4.b32 {%r1327, %r1328, %r1329, %r1330}, [%r72+288]; 2026-02-21T08:31:48.1857475Z ld.shared.v4.b32 {%r1331, %r1332, %r1333, %r1334}, [%r72+304]; 2026-02-21T08:31:48.1857572Z ld.shared.v4.b32 {%r1335, %r1336, %r1337, %r1338}, [%r72+320]; 2026-02-21T08:31:48.1857662Z ld.shared.v4.b32 {%r1339, %r1340, %r1341, %r1342}, [%r72+336]; 2026-02-21T08:31:48.1857754Z ld.shared.v4.b32 {%r1343, %r1344, %r1345, %r1346}, [%r72+352]; 2026-02-21T08:31:48.1857852Z ld.shared.v4.b32 {%r1347, %r1348, %r1349, %r1350}, [%r72+368]; 2026-02-21T08:31:48.1857943Z ld.shared.v4.b32 {%r1351, %r1352, %r1353, %r1354}, [%r72+384]; 2026-02-21T08:31:48.1858039Z ld.shared.v4.b32 {%r1355, %r1356, %r1357, %r1358}, [%r72+400]; 2026-02-21T08:31:48.1858181Z ld.shared.v4.b32 {%r1359, %r1360, %r1361, %r1362}, [%r72+416]; 2026-02-21T08:31:48.1858274Z ld.shared.v4.b32 {%r1363, %r1364, %r1365, %r1366}, [%r72+432]; 2026-02-21T08:31:48.1858371Z ld.shared.v4.b32 {%r1367, %r1368, %r1369, %r1370}, [%r72+448]; 2026-02-21T08:31:48.1858469Z ld.shared.v4.b32 {%r1371, %r1372, %r1373, %r1374}, [%r72+464]; 2026-02-21T08:31:48.1858559Z ld.shared.v4.b32 {%r1375, %r1376, %r1377, %r1378}, [%r72+480]; 2026-02-21T08:31:48.1858650Z ld.shared.v4.b32 {%r1379, %r1380, %r1381, %r1382}, [%r72+496]; 2026-02-21T08:31:48.1858705Z bar.sync 4, 256; 2026-02-21T08:31:48.1858772Z add.s32 %r1253, %r273, 803520; 2026-02-21T08:31:48.1858828Z // begin inline asm 2026-02-21T08:31:48.1858920Z @%p22 mbarrier.arrive.shared::cta.b64 _, [%r1253]; 2026-02-21T08:31:48.1858983Z // end inline asm 2026-02-21T08:31:48.1859058Z shfl.sync.idx.b32 %r1519, %r71, 0, 31, -1; 2026-02-21T08:31:48.1859118Z mov.pred %p23, -1; 2026-02-21T08:31:48.1859222Z // begin inline asm 2026-02-21T08:31:48.1860854Z @%p23 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r1254 + 0], {%r1255, %r1256, %r1257, %r1258, %r1259, %r1260, %r1261, %r1262, %r1263, %r1264, %r1265, %r1266, %r1267, %r1268, %r1269, %r1270, %r1271, %r1272, %r1273, %r1274, %r1275, %r1276, %r1277, %r1278, %r1279, %r1280, %r1281, %r1282, %r1283, %r1284, %r1285, %r1286, %r1287, %r1288, %r1289, %r1290, %r1291, %r1292, %r1293, %r1294, %r1295, %r1296, %r1297, %r1298, %r1299, %r1300, %r1301, %r1302, %r1303, %r1304, %r1305, %r1306, %r1307, %r1308, %r1309, %r1310, %r1311, %r1312, %r1313, %r1314, %r1315, %r1316, %r1317, %r1318, %r1319, %r1320, %r1321, %r1322, %r1323, %r1324, %r1325, %r1326, %r1327, %r1328, %r1329, %r1330, %r1331, %r1332, %r1333, %r1334, %r1335, %r1336, %r1337, %r1338, %r1339, %r1340, %r1341, %r1342, %r1343, %r1344, %r1345, %r1346, %r1347, %r1348, %r1349, %r1350, %r1351, %r1352, %r1353, %r1354, %r1355, %r1356, %r1357, %r1358, %r1359, %r1360, %r1361, %r1362, %r1363, %r1364, %r1365, %r1366, %r1367, %r1368, %r1369, %r1370, %r1371, %r1372, %r1373, %r1374, %r1375, %r1376, %r1377, %r1378, %r1379, %r1380, %r1381, %r1382}; 2026-02-21T08:31:48.1860919Z // end inline asm 2026-02-21T08:31:48.1860976Z // begin inline asm 2026-02-21T08:31:48.1861045Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1861100Z // end inline asm 2026-02-21T08:31:48.1861162Z bar.sync 4, 256; 2026-02-21T08:31:48.1861217Z bar.sync 4, 256; 2026-02-21T08:31:48.1861275Z add.s32 %r1383, %r273, 803328; 2026-02-21T08:31:48.1861338Z // begin inline asm 2026-02-21T08:31:48.1861427Z @%p22 mbarrier.arrive.shared::cta.b64 _, [%r1383]; 2026-02-21T08:31:48.1861483Z // end inline asm 2026-02-21T08:31:48.1861579Z bar.sync 4, 256; 2026-02-21T08:31:48.1861641Z add.s32 %r1384, %r273, 803344; 2026-02-21T08:31:48.1861698Z // begin inline asm 2026-02-21T08:31:48.1861747Z 2026-02-21T08:31:48.1861806Z { 2026-02-21T08:31:48.1861867Z .reg .pred complete; 2026-02-21T08:31:48.1861920Z waitLoop: 2026-02-21T08:31:48.1862045Z mbarrier.try_wait.parity.shared.b64 complete, [%r1384], %r1983; 2026-02-21T08:31:48.1862112Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1862163Z } 2026-02-21T08:31:48.1862166Z 2026-02-21T08:31:48.1862219Z // end inline asm 2026-02-21T08:31:48.1862285Z xor.b32 %r1981, %r1981, 1; 2026-02-21T08:31:48.1862342Z add.s32 %r1386, %r273, 803552; 2026-02-21T08:31:48.1862397Z // begin inline asm 2026-02-21T08:31:48.1862453Z 2026-02-21T08:31:48.1862501Z { 2026-02-21T08:31:48.1862560Z .reg .pred complete; 2026-02-21T08:31:48.1862612Z waitLoop: 2026-02-21T08:31:48.1862734Z mbarrier.try_wait.parity.shared.b64 complete, [%r1386], %r1981; 2026-02-21T08:31:48.1862800Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1862848Z } 2026-02-21T08:31:48.1862851Z 2026-02-21T08:31:48.1862911Z // end inline asm 2026-02-21T08:31:48.1862966Z // begin inline asm 2026-02-21T08:31:48.1864581Z tcgen05.ld.sync.aligned.32x32b.x128.b32 {%r1388, %r1389, %r1390, %r1391, %r1392, %r1393, %r1394, %r1395, %r1396, %r1397, %r1398, %r1399, %r1400, %r1401, %r1402, %r1403, %r1404, %r1405, %r1406, %r1407, %r1408, %r1409, %r1410, %r1411, %r1412, %r1413, %r1414, %r1415, %r1416, %r1417, %r1418, %r1419, %r1420, %r1421, %r1422, %r1423, %r1424, %r1425, %r1426, %r1427, %r1428, %r1429, %r1430, %r1431, %r1432, %r1433, %r1434, %r1435, %r1436, %r1437, %r1438, %r1439, %r1440, %r1441, %r1442, %r1443, %r1444, %r1445, %r1446, %r1447, %r1448, %r1449, %r1450, %r1451, %r1452, %r1453, %r1454, %r1455, %r1456, %r1457, %r1458, %r1459, %r1460, %r1461, %r1462, %r1463, %r1464, %r1465, %r1466, %r1467, %r1468, %r1469, %r1470, %r1471, %r1472, %r1473, %r1474, %r1475, %r1476, %r1477, %r1478, %r1479, %r1480, %r1481, %r1482, %r1483, %r1484, %r1485, %r1486, %r1487, %r1488, %r1489, %r1490, %r1491, %r1492, %r1493, %r1494, %r1495, %r1496, %r1497, %r1498, %r1499, %r1500, %r1501, %r1502, %r1503, %r1504, %r1505, %r1506, %r1507, %r1508, %r1509, %r1510, %r1511, %r1512, %r1513, %r1514, %r1515}, [%r1516 + 0]; 2026-02-21T08:31:48.1864714Z // end inline asm 2026-02-21T08:31:48.1864771Z // begin inline asm 2026-02-21T08:31:48.1864893Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:31:48.1864953Z // end inline asm 2026-02-21T08:31:48.1865047Z st.shared.v4.b32 [%r73], {%r1388, %r1389, %r1390, %r1391}; 2026-02-21T08:31:48.1865142Z st.shared.v4.b32 [%r73+16], {%r1392, %r1393, %r1394, %r1395}; 2026-02-21T08:31:48.1865240Z st.shared.v4.b32 [%r73+32], {%r1396, %r1397, %r1398, %r1399}; 2026-02-21T08:31:48.1865331Z st.shared.v4.b32 [%r73+48], {%r1400, %r1401, %r1402, %r1403}; 2026-02-21T08:31:48.1865421Z st.shared.v4.b32 [%r73+64], {%r1404, %r1405, %r1406, %r1407}; 2026-02-21T08:31:48.1865520Z st.shared.v4.b32 [%r73+80], {%r1408, %r1409, %r1410, %r1411}; 2026-02-21T08:31:48.1865610Z st.shared.v4.b32 [%r73+96], {%r1412, %r1413, %r1414, %r1415}; 2026-02-21T08:31:48.1865706Z st.shared.v4.b32 [%r73+112], {%r1416, %r1417, %r1418, %r1419}; 2026-02-21T08:31:48.1865807Z st.shared.v4.b32 [%r73+128], {%r1420, %r1421, %r1422, %r1423}; 2026-02-21T08:31:48.1865903Z st.shared.v4.b32 [%r73+144], {%r1424, %r1425, %r1426, %r1427}; 2026-02-21T08:31:48.1865996Z st.shared.v4.b32 [%r73+160], {%r1428, %r1429, %r1430, %r1431}; 2026-02-21T08:31:48.1866094Z st.shared.v4.b32 [%r73+176], {%r1432, %r1433, %r1434, %r1435}; 2026-02-21T08:31:48.1866187Z st.shared.v4.b32 [%r73+192], {%r1436, %r1437, %r1438, %r1439}; 2026-02-21T08:31:48.1866278Z st.shared.v4.b32 [%r73+208], {%r1440, %r1441, %r1442, %r1443}; 2026-02-21T08:31:48.1866370Z st.shared.v4.b32 [%r73+224], {%r1444, %r1445, %r1446, %r1447}; 2026-02-21T08:31:48.1866469Z st.shared.v4.b32 [%r73+240], {%r1448, %r1449, %r1450, %r1451}; 2026-02-21T08:31:48.1866561Z st.shared.v4.b32 [%r73+256], {%r1452, %r1453, %r1454, %r1455}; 2026-02-21T08:31:48.1866654Z st.shared.v4.b32 [%r73+272], {%r1456, %r1457, %r1458, %r1459}; 2026-02-21T08:31:48.1866752Z st.shared.v4.b32 [%r73+288], {%r1460, %r1461, %r1462, %r1463}; 2026-02-21T08:31:48.1866843Z st.shared.v4.b32 [%r73+304], {%r1464, %r1465, %r1466, %r1467}; 2026-02-21T08:31:48.1866936Z st.shared.v4.b32 [%r73+320], {%r1468, %r1469, %r1470, %r1471}; 2026-02-21T08:31:48.1867036Z st.shared.v4.b32 [%r73+336], {%r1472, %r1473, %r1474, %r1475}; 2026-02-21T08:31:48.1867128Z st.shared.v4.b32 [%r73+352], {%r1476, %r1477, %r1478, %r1479}; 2026-02-21T08:31:48.1867220Z st.shared.v4.b32 [%r73+368], {%r1480, %r1481, %r1482, %r1483}; 2026-02-21T08:31:48.1867311Z st.shared.v4.b32 [%r73+384], {%r1484, %r1485, %r1486, %r1487}; 2026-02-21T08:31:48.1867409Z st.shared.v4.b32 [%r73+400], {%r1488, %r1489, %r1490, %r1491}; 2026-02-21T08:31:48.1867500Z st.shared.v4.b32 [%r73+416], {%r1492, %r1493, %r1494, %r1495}; 2026-02-21T08:31:48.1867591Z st.shared.v4.b32 [%r73+432], {%r1496, %r1497, %r1498, %r1499}; 2026-02-21T08:31:48.1867690Z st.shared.v4.b32 [%r73+448], {%r1500, %r1501, %r1502, %r1503}; 2026-02-21T08:31:48.1867782Z st.shared.v4.b32 [%r73+464], {%r1504, %r1505, %r1506, %r1507}; 2026-02-21T08:31:48.1867873Z st.shared.v4.b32 [%r73+480], {%r1508, %r1509, %r1510, %r1511}; 2026-02-21T08:31:48.1867974Z st.shared.v4.b32 [%r73+496], {%r1512, %r1513, %r1514, %r1515}; 2026-02-21T08:31:48.1868080Z bar.sync 4, 256; 2026-02-21T08:31:48.1868140Z add.s32 %r1517, %r273, 803568; 2026-02-21T08:31:48.1868204Z // begin inline asm 2026-02-21T08:31:48.1868293Z @%p22 mbarrier.arrive.shared::cta.b64 _, [%r1517]; 2026-02-21T08:31:48.1868349Z // end inline asm 2026-02-21T08:31:48.1868408Z xor.b32 %r1983, %r1983, 1; 2026-02-21T08:31:48.1868469Z $L__tmp25: 2026-02-21T08:31:48.1868642Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1868703Z add.s32 %r1980, %r1980, -1; 2026-02-21T08:31:48.1868773Z setp.ne.b32 %p26, %r1980, 0; 2026-02-21T08:31:48.1868833Z @%p26 bra $L__BB0_28; 2026-02-21T08:31:48.1868920Z $L__BB0_29: // %._crit_edge490 2026-02-21T08:31:48.1869013Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1869082Z barrier.sync 1; 2026-02-21T08:31:48.1869141Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1869279Z $L__BB0_30: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1869460Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1869541Z ld.shared.b32 %r1984, [global_smem+393216]; 2026-02-21T08:31:48.1869598Z barrier.sync 1; 2026-02-21T08:31:48.1869674Z setp.lt.s32 %p15, %r1984, 1; 2026-02-21T08:31:48.1869735Z @%p15 bra $L__BB0_33; 2026-02-21T08:31:48.1869818Z // %bb.31: // %.lr.ph486 2026-02-21T08:31:48.1869908Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1870088Z .loc 1 0 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0:145 2026-02-21T08:31:48.1870149Z add.s32 %r83, %r1, -768; 2026-02-21T08:31:48.1870207Z shr.u32 %r84, %r83, 5; 2026-02-21T08:31:48.1870273Z shl.b32 %r967, %r1, 10; 2026-02-21T08:31:48.1870334Z and.b32 %r968, %r967, 130048; 2026-02-21T08:31:48.1870394Z shl.b32 %r969, %r1, 2; 2026-02-21T08:31:48.1870462Z and.b32 %r970, %r969, 512; 2026-02-21T08:31:48.1870520Z or.b32 %r971, %r968, %r970; 2026-02-21T08:31:48.1870578Z add.s32 %r973, %r273, %r971; 2026-02-21T08:31:48.1870637Z add.s32 %r85, %r973, 131072; 2026-02-21T08:31:48.1870702Z add.s32 %r86, %r973, 262144; 2026-02-21T08:31:48.1870757Z mov.b32 %r1986, 1; 2026-02-21T08:31:48.1870812Z mov.b32 %r1985, 0; 2026-02-21T08:31:48.1870878Z mov.b32 %r1987, %r1985; 2026-02-21T08:31:48.1870975Z $L__BB0_32: // Parent Loop BB0_2 Depth=1 2026-02-21T08:31:48.1871067Z // => This Inner Loop Header: Depth=2 2026-02-21T08:31:48.1871128Z setp.eq.b32 %p16, %r83, 0; 2026-02-21T08:31:48.1871187Z $L__tmp26: 2026-02-21T08:31:48.1871403Z .loc 2 291 36 // standard.py:291:36 @[ cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:84:40 ] 2026-02-21T08:31:48.1871461Z xor.b32 %r1986, %r1986, 1; 2026-02-21T08:31:48.1871526Z bar.sync 5, 256; 2026-02-21T08:31:48.1871619Z add.s32 %r974, %r273, 803568; 2026-02-21T08:31:48.1871676Z // begin inline asm 2026-02-21T08:31:48.1871733Z 2026-02-21T08:31:48.1871782Z { 2026-02-21T08:31:48.1871840Z .reg .pred complete; 2026-02-21T08:31:48.1871894Z waitLoop: 2026-02-21T08:31:48.1872017Z mbarrier.try_wait.parity.shared.b64 complete, [%r974], %r1986; 2026-02-21T08:31:48.1872081Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1872132Z } 2026-02-21T08:31:48.1872135Z 2026-02-21T08:31:48.1872195Z // end inline asm 2026-02-21T08:31:48.1872285Z ld.shared.v4.b32 {%r978, %r979, %r980, %r981}, [%r85]; 2026-02-21T08:31:48.1872379Z ld.shared.v4.b32 {%r982, %r983, %r984, %r985}, [%r85+16]; 2026-02-21T08:31:48.1872472Z ld.shared.v4.b32 {%r986, %r987, %r988, %r989}, [%r85+32]; 2026-02-21T08:31:48.1872567Z ld.shared.v4.b32 {%r990, %r991, %r992, %r993}, [%r85+48]; 2026-02-21T08:31:48.1872654Z ld.shared.v4.b32 {%r994, %r995, %r996, %r997}, [%r85+64]; 2026-02-21T08:31:48.1872753Z ld.shared.v4.b32 {%r998, %r999, %r1000, %r1001}, [%r85+80]; 2026-02-21T08:31:48.1872907Z ld.shared.v4.b32 {%r1002, %r1003, %r1004, %r1005}, [%r85+96]; 2026-02-21T08:31:48.1873003Z ld.shared.v4.b32 {%r1006, %r1007, %r1008, %r1009}, [%r85+112]; 2026-02-21T08:31:48.1873096Z ld.shared.v4.b32 {%r1010, %r1011, %r1012, %r1013}, [%r85+128]; 2026-02-21T08:31:48.1873194Z ld.shared.v4.b32 {%r1014, %r1015, %r1016, %r1017}, [%r85+144]; 2026-02-21T08:31:48.1873285Z ld.shared.v4.b32 {%r1018, %r1019, %r1020, %r1021}, [%r85+160]; 2026-02-21T08:31:48.1873376Z ld.shared.v4.b32 {%r1022, %r1023, %r1024, %r1025}, [%r85+176]; 2026-02-21T08:31:48.1873474Z ld.shared.v4.b32 {%r1026, %r1027, %r1028, %r1029}, [%r85+192]; 2026-02-21T08:31:48.1873565Z ld.shared.v4.b32 {%r1030, %r1031, %r1032, %r1033}, [%r85+208]; 2026-02-21T08:31:48.1873655Z ld.shared.v4.b32 {%r1034, %r1035, %r1036, %r1037}, [%r85+224]; 2026-02-21T08:31:48.1873746Z ld.shared.v4.b32 {%r1038, %r1039, %r1040, %r1041}, [%r85+240]; 2026-02-21T08:31:48.1873891Z ld.shared.v4.b32 {%r1042, %r1043, %r1044, %r1045}, [%r85+256]; 2026-02-21T08:31:48.1873986Z ld.shared.v4.b32 {%r1046, %r1047, %r1048, %r1049}, [%r85+272]; 2026-02-21T08:31:48.1874076Z ld.shared.v4.b32 {%r1050, %r1051, %r1052, %r1053}, [%r85+288]; 2026-02-21T08:31:48.1874174Z ld.shared.v4.b32 {%r1054, %r1055, %r1056, %r1057}, [%r85+304]; 2026-02-21T08:31:48.1874265Z ld.shared.v4.b32 {%r1058, %r1059, %r1060, %r1061}, [%r85+320]; 2026-02-21T08:31:48.1874356Z ld.shared.v4.b32 {%r1062, %r1063, %r1064, %r1065}, [%r85+336]; 2026-02-21T08:31:48.1874453Z ld.shared.v4.b32 {%r1066, %r1067, %r1068, %r1069}, [%r85+352]; 2026-02-21T08:31:48.1874545Z ld.shared.v4.b32 {%r1070, %r1071, %r1072, %r1073}, [%r85+368]; 2026-02-21T08:31:48.1874636Z ld.shared.v4.b32 {%r1074, %r1075, %r1076, %r1077}, [%r85+384]; 2026-02-21T08:31:48.1874732Z ld.shared.v4.b32 {%r1078, %r1079, %r1080, %r1081}, [%r85+400]; 2026-02-21T08:31:48.1874822Z ld.shared.v4.b32 {%r1082, %r1083, %r1084, %r1085}, [%r85+416]; 2026-02-21T08:31:48.1874916Z ld.shared.v4.b32 {%r1086, %r1087, %r1088, %r1089}, [%r85+432]; 2026-02-21T08:31:48.1875009Z ld.shared.v4.b32 {%r1090, %r1091, %r1092, %r1093}, [%r85+448]; 2026-02-21T08:31:48.1875108Z ld.shared.v4.b32 {%r1094, %r1095, %r1096, %r1097}, [%r85+464]; 2026-02-21T08:31:48.1875198Z ld.shared.v4.b32 {%r1098, %r1099, %r1100, %r1101}, [%r85+480]; 2026-02-21T08:31:48.1875287Z ld.shared.v4.b32 {%r1102, %r1103, %r1104, %r1105}, [%r85+496]; 2026-02-21T08:31:48.1875348Z bar.sync 5, 256; 2026-02-21T08:31:48.1875408Z add.s32 %r976, %r273, 803552; 2026-02-21T08:31:48.1875467Z // begin inline asm 2026-02-21T08:31:48.1875562Z @%p16 mbarrier.arrive.shared::cta.b64 _, [%r976]; 2026-02-21T08:31:48.1875617Z // end inline asm 2026-02-21T08:31:48.1875693Z shfl.sync.idx.b32 %r1242, %r84, 0, 31, -1; 2026-02-21T08:31:48.1875753Z mov.pred %p17, -1; 2026-02-21T08:31:48.1875816Z // begin inline asm 2026-02-21T08:31:48.1877423Z @%p17 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r977 + 0], {%r978, %r979, %r980, %r981, %r982, %r983, %r984, %r985, %r986, %r987, %r988, %r989, %r990, %r991, %r992, %r993, %r994, %r995, %r996, %r997, %r998, %r999, %r1000, %r1001, %r1002, %r1003, %r1004, %r1005, %r1006, %r1007, %r1008, %r1009, %r1010, %r1011, %r1012, %r1013, %r1014, %r1015, %r1016, %r1017, %r1018, %r1019, %r1020, %r1021, %r1022, %r1023, %r1024, %r1025, %r1026, %r1027, %r1028, %r1029, %r1030, %r1031, %r1032, %r1033, %r1034, %r1035, %r1036, %r1037, %r1038, %r1039, %r1040, %r1041, %r1042, %r1043, %r1044, %r1045, %r1046, %r1047, %r1048, %r1049, %r1050, %r1051, %r1052, %r1053, %r1054, %r1055, %r1056, %r1057, %r1058, %r1059, %r1060, %r1061, %r1062, %r1063, %r1064, %r1065, %r1066, %r1067, %r1068, %r1069, %r1070, %r1071, %r1072, %r1073, %r1074, %r1075, %r1076, %r1077, %r1078, %r1079, %r1080, %r1081, %r1082, %r1083, %r1084, %r1085, %r1086, %r1087, %r1088, %r1089, %r1090, %r1091, %r1092, %r1093, %r1094, %r1095, %r1096, %r1097, %r1098, %r1099, %r1100, %r1101, %r1102, %r1103, %r1104, %r1105}; 2026-02-21T08:31:48.1877491Z // end inline asm 2026-02-21T08:31:48.1877585Z // begin inline asm 2026-02-21T08:31:48.1877654Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:31:48.1877716Z // end inline asm 2026-02-21T08:31:48.1877771Z bar.sync 5, 256; 2026-02-21T08:31:48.1877825Z bar.sync 5, 256; 2026-02-21T08:31:48.1877884Z add.s32 %r1106, %r273, 803392; 2026-02-21T08:31:48.1877948Z // begin inline asm 2026-02-21T08:31:48.1878038Z @%p16 mbarrier.arrive.shared::cta.b64 _, [%r1106]; 2026-02-21T08:31:48.1878093Z // end inline asm 2026-02-21T08:31:48.1878153Z bar.sync 5, 256; 2026-02-21T08:31:48.1878214Z add.s32 %r1107, %r273, 803408; 2026-02-21T08:31:48.1878271Z // begin inline asm 2026-02-21T08:31:48.1878329Z 2026-02-21T08:31:48.1878380Z { 2026-02-21T08:31:48.1878439Z .reg .pred complete; 2026-02-21T08:31:48.1878493Z waitLoop: 2026-02-21T08:31:48.1878619Z mbarrier.try_wait.parity.shared.b64 complete, [%r1107], %r1987; 2026-02-21T08:31:48.1878685Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1878735Z } 2026-02-21T08:31:48.1878791Z 2026-02-21T08:31:48.1878855Z // end inline asm 2026-02-21T08:31:48.1878915Z xor.b32 %r1985, %r1985, 1; 2026-02-21T08:31:48.1878972Z add.s32 %r1109, %r273, 803584; 2026-02-21T08:31:48.1879027Z // begin inline asm 2026-02-21T08:31:48.1879083Z 2026-02-21T08:31:48.1879131Z { 2026-02-21T08:31:48.1879189Z .reg .pred complete; 2026-02-21T08:31:48.1879250Z waitLoop: 2026-02-21T08:31:48.1879363Z mbarrier.try_wait.parity.shared.b64 complete, [%r1109], %r1985; 2026-02-21T08:31:48.1879427Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1879476Z } 2026-02-21T08:31:48.1879480Z 2026-02-21T08:31:48.1879541Z // end inline asm 2026-02-21T08:31:48.1879597Z // begin inline asm 2026-02-21T08:31:48.1881312Z tcgen05.ld.sync.aligned.32x32b.x128.b32 {%r1111, %r1112, %r1113, %r1114, %r1115, %r1116, %r1117, %r1118, %r1119, %r1120, %r1121, %r1122, %r1123, %r1124, %r1125, %r1126, %r1127, %r1128, %r1129, %r1130, %r1131, %r1132, %r1133, %r1134, %r1135, %r1136, %r1137, %r1138, %r1139, %r1140, %r1141, %r1142, %r1143, %r1144, %r1145, %r1146, %r1147, %r1148, %r1149, %r1150, %r1151, %r1152, %r1153, %r1154, %r1155, %r1156, %r1157, %r1158, %r1159, %r1160, %r1161, %r1162, %r1163, %r1164, %r1165, %r1166, %r1167, %r1168, %r1169, %r1170, %r1171, %r1172, %r1173, %r1174, %r1175, %r1176, %r1177, %r1178, %r1179, %r1180, %r1181, %r1182, %r1183, %r1184, %r1185, %r1186, %r1187, %r1188, %r1189, %r1190, %r1191, %r1192, %r1193, %r1194, %r1195, %r1196, %r1197, %r1198, %r1199, %r1200, %r1201, %r1202, %r1203, %r1204, %r1205, %r1206, %r1207, %r1208, %r1209, %r1210, %r1211, %r1212, %r1213, %r1214, %r1215, %r1216, %r1217, %r1218, %r1219, %r1220, %r1221, %r1222, %r1223, %r1224, %r1225, %r1226, %r1227, %r1228, %r1229, %r1230, %r1231, %r1232, %r1233, %r1234, %r1235, %r1236, %r1237, %r1238}, [%r1239 + 0]; 2026-02-21T08:31:48.1881372Z // end inline asm 2026-02-21T08:31:48.1881430Z // begin inline asm 2026-02-21T08:31:48.1881507Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:31:48.1881608Z // end inline asm 2026-02-21T08:31:48.1881712Z st.shared.v4.b32 [%r86], {%r1111, %r1112, %r1113, %r1114}; 2026-02-21T08:31:48.1881824Z st.shared.v4.b32 [%r86+16], {%r1115, %r1116, %r1117, %r1118}; 2026-02-21T08:31:48.1881923Z st.shared.v4.b32 [%r86+32], {%r1119, %r1120, %r1121, %r1122}; 2026-02-21T08:31:48.1882019Z st.shared.v4.b32 [%r86+48], {%r1123, %r1124, %r1125, %r1126}; 2026-02-21T08:31:48.1882124Z st.shared.v4.b32 [%r86+64], {%r1127, %r1128, %r1129, %r1130}; 2026-02-21T08:31:48.1882220Z st.shared.v4.b32 [%r86+80], {%r1131, %r1132, %r1133, %r1134}; 2026-02-21T08:31:48.1882314Z st.shared.v4.b32 [%r86+96], {%r1135, %r1136, %r1137, %r1138}; 2026-02-21T08:31:48.1882415Z st.shared.v4.b32 [%r86+112], {%r1139, %r1140, %r1141, %r1142}; 2026-02-21T08:31:48.1882522Z st.shared.v4.b32 [%r86+128], {%r1143, %r1144, %r1145, %r1146}; 2026-02-21T08:31:48.1882621Z st.shared.v4.b32 [%r86+144], {%r1147, %r1148, %r1149, %r1150}; 2026-02-21T08:31:48.1882719Z st.shared.v4.b32 [%r86+160], {%r1151, %r1152, %r1153, %r1154}; 2026-02-21T08:31:48.1882827Z st.shared.v4.b32 [%r86+176], {%r1155, %r1156, %r1157, %r1158}; 2026-02-21T08:31:48.1882981Z st.shared.v4.b32 [%r86+192], {%r1159, %r1160, %r1161, %r1162}; 2026-02-21T08:31:48.1883078Z st.shared.v4.b32 [%r86+208], {%r1163, %r1164, %r1165, %r1166}; 2026-02-21T08:31:48.1883185Z st.shared.v4.b32 [%r86+224], {%r1167, %r1168, %r1169, %r1170}; 2026-02-21T08:31:48.1883282Z st.shared.v4.b32 [%r86+240], {%r1171, %r1172, %r1173, %r1174}; 2026-02-21T08:31:48.1883378Z st.shared.v4.b32 [%r86+256], {%r1175, %r1176, %r1177, %r1178}; 2026-02-21T08:31:48.1883476Z st.shared.v4.b32 [%r86+272], {%r1179, %r1180, %r1181, %r1182}; 2026-02-21T08:31:48.1883582Z st.shared.v4.b32 [%r86+288], {%r1183, %r1184, %r1185, %r1186}; 2026-02-21T08:31:48.1883679Z st.shared.v4.b32 [%r86+304], {%r1187, %r1188, %r1189, %r1190}; 2026-02-21T08:31:48.1883778Z st.shared.v4.b32 [%r86+320], {%r1191, %r1192, %r1193, %r1194}; 2026-02-21T08:31:48.1883883Z st.shared.v4.b32 [%r86+336], {%r1195, %r1196, %r1197, %r1198}; 2026-02-21T08:31:48.1884024Z st.shared.v4.b32 [%r86+352], {%r1199, %r1200, %r1201, %r1202}; 2026-02-21T08:31:48.1884123Z st.shared.v4.b32 [%r86+368], {%r1203, %r1204, %r1205, %r1206}; 2026-02-21T08:31:48.1884228Z st.shared.v4.b32 [%r86+384], {%r1207, %r1208, %r1209, %r1210}; 2026-02-21T08:31:48.1884324Z st.shared.v4.b32 [%r86+400], {%r1211, %r1212, %r1213, %r1214}; 2026-02-21T08:31:48.1884420Z st.shared.v4.b32 [%r86+416], {%r1215, %r1216, %r1217, %r1218}; 2026-02-21T08:31:48.1884523Z st.shared.v4.b32 [%r86+432], {%r1219, %r1220, %r1221, %r1222}; 2026-02-21T08:31:48.1884620Z st.shared.v4.b32 [%r86+448], {%r1223, %r1224, %r1225, %r1226}; 2026-02-21T08:31:48.1884715Z st.shared.v4.b32 [%r86+464], {%r1227, %r1228, %r1229, %r1230}; 2026-02-21T08:31:48.1884811Z st.shared.v4.b32 [%r86+480], {%r1231, %r1232, %r1233, %r1234}; 2026-02-21T08:31:48.1884911Z st.shared.v4.b32 [%r86+496], {%r1235, %r1236, %r1237, %r1238}; 2026-02-21T08:31:48.1884968Z bar.sync 5, 256; 2026-02-21T08:31:48.1885030Z add.s32 %r1240, %r273, 803600; 2026-02-21T08:31:48.1885099Z // begin inline asm 2026-02-21T08:31:48.1885193Z @%p16 mbarrier.arrive.shared::cta.b64 _, [%r1240]; 2026-02-21T08:31:48.1885250Z // end inline asm 2026-02-21T08:31:48.1885317Z xor.b32 %r1987, %r1987, 1; 2026-02-21T08:31:48.1885373Z $L__tmp27: 2026-02-21T08:31:48.1885550Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1885611Z add.s32 %r1984, %r1984, -1; 2026-02-21T08:31:48.1885684Z setp.ne.b32 %p20, %r1984, 0; 2026-02-21T08:31:48.1885744Z @%p20 bra $L__BB0_32; 2026-02-21T08:31:48.1885835Z $L__BB0_33: // %._crit_edge487 2026-02-21T08:31:48.1885938Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1885998Z barrier.sync 1; 2026-02-21T08:31:48.1886058Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1886163Z $L__BB0_20: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1886341Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1886427Z ld.shared.b32 %r1971, [global_smem+393216]; 2026-02-21T08:31:48.1886486Z barrier.sync 1; 2026-02-21T08:31:48.1886667Z .loc 1 21 66 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:21:66 2026-02-21T08:31:48.1886730Z mov.u32 %r49, %ctaid.x; 2026-02-21T08:31:48.1886791Z mov.u32 %r1520, %ctaid.y; 2026-02-21T08:31:48.1886859Z mov.u32 %r1521, %ctaid.z; 2026-02-21T08:31:48.1886920Z mov.u32 %r1522, %nctaid.x; 2026-02-21T08:31:48.1886981Z mov.u32 %r1523, %nctaid.y; 2026-02-21T08:31:48.1887057Z mad.lo.s32 %r1524, %r1521, %r1523, %r1520; 2026-02-21T08:31:48.1887137Z mad.lo.s32 %r1525, %r1524, %r1522, %r49; 2026-02-21T08:31:48.1887196Z shl.b32 %r1526, %r1525, 7; 2026-02-21T08:31:48.1887259Z cvt.s64.s32 %rd37, %r1526; 2026-02-21T08:31:48.1887329Z add.s64 %rd38, %rd20, %rd37; 2026-02-21T08:31:48.1887396Z cvta.global.u64 %rd39, %rd38; 2026-02-21T08:31:48.1887576Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1887688Z setp.lt.s32 %p27, %r1971, 1; 2026-02-21T08:31:48.1887749Z @%p27 bra $L__BB0_25; 2026-02-21T08:31:48.1887834Z // %bb.21: // %.lr.ph492 2026-02-21T08:31:48.1887925Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1887997Z add.s32 %r1979, %r49, -18944; 2026-02-21T08:31:48.1888060Z add.s32 %r1529, %r1, -1280; 2026-02-21T08:31:48.1888122Z shr.u32 %r51, %r1529, 5; 2026-02-21T08:31:48.1888190Z mov.b32 %r1977, -1; 2026-02-21T08:31:48.1888248Z mov.b32 %r1972, 0; 2026-02-21T08:31:48.1888308Z mov.b32 %r1973, %r1972; 2026-02-21T08:31:48.1888369Z mov.b32 %r1978, %r1972; 2026-02-21T08:31:48.1888439Z mov.b32 %r1975, %r1972; 2026-02-21T08:31:48.1888500Z bra.uni $L__BB0_22; 2026-02-21T08:31:48.1888611Z $L__BB0_24: // in Loop: Header=BB0_22 Depth=2 2026-02-21T08:31:48.1888830Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1888893Z shl.b32 %r1563, %r1973, 3; 2026-02-21T08:31:48.1888954Z add.s32 %r1565, %r273, %r1563; 2026-02-21T08:31:48.1889022Z add.s32 %r1535, %r1565, 802816; 2026-02-21T08:31:48.1889187Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1889244Z // begin inline asm 2026-02-21T08:31:48.1889294Z 2026-02-21T08:31:48.1889351Z { 2026-02-21T08:31:48.1889411Z .reg .pred complete; 2026-02-21T08:31:48.1889464Z waitLoop: 2026-02-21T08:31:48.1889586Z mbarrier.try_wait.parity.shared.b64 complete, [%r1535], %r1972; 2026-02-21T08:31:48.1889650Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1889698Z } 2026-02-21T08:31:48.1889702Z 2026-02-21T08:31:48.1889763Z // end inline asm 2026-02-21T08:31:48.1889931Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1889996Z add.s32 %r1541, %r1565, 802864; 2026-02-21T08:31:48.1890054Z mov.pred %p31, 0; 2026-02-21T08:31:48.1890221Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1890277Z // begin inline asm 2026-02-21T08:31:48.1890390Z @%p31 mbarrier.arrive.expect_tx.shared.b64 _, [%r1541], 4096; 2026-02-21T08:31:48.1890453Z // end inline asm 2026-02-21T08:31:48.1890510Z shl.b32 %r1566, %r1973, 12; 2026-02-21T08:31:48.1890566Z bar.sync 3, 64; 2026-02-21T08:31:48.1890641Z shfl.sync.idx.b32 %r1567, %r51, 0, 31, -1; 2026-02-21T08:31:48.1890711Z elect.sync %r1568|%p39, -1; 2026-02-21T08:31:48.1890768Z and.b32 %r1569, %r1567, 1; 2026-02-21T08:31:48.1890824Z shl.b32 %r1570, %r1569, 11; 2026-02-21T08:31:48.1890890Z add.s32 %r1571, %r273, %r1566; 2026-02-21T08:31:48.1890951Z add.s32 %r1572, %r1571, %r1570; 2026-02-21T08:31:48.1891010Z add.s32 %r1538, %r1572, 688128; 2026-02-21T08:31:48.1891077Z shl.b32 %r1573, %r1569, 7; 2026-02-21T08:31:48.1891138Z or.b32 %r1539, %r1573, %r1978; 2026-02-21T08:31:48.1891195Z // begin inline asm 2026-02-21T08:31:48.1891448Z @%p31 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1538], [%rd39, {%r1539, %r1540}], [%r1541]; 2026-02-21T08:31:48.1891511Z // end inline asm 2026-02-21T08:31:48.1891716Z .loc 1 41 125 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:41:125 2026-02-21T08:31:48.1891777Z add.s32 %r1547, %r1540, 16; 2026-02-21T08:31:48.1891954Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1892013Z add.s32 %r1542, %r1565, 802912; 2026-02-21T08:31:48.1892178Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1892241Z // begin inline asm 2026-02-21T08:31:48.1892290Z 2026-02-21T08:31:48.1892338Z { 2026-02-21T08:31:48.1892397Z .reg .pred complete; 2026-02-21T08:31:48.1892459Z waitLoop: 2026-02-21T08:31:48.1892648Z mbarrier.try_wait.parity.shared.b64 complete, [%r1542], %r1972; 2026-02-21T08:31:48.1892714Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1892772Z } 2026-02-21T08:31:48.1892775Z 2026-02-21T08:31:48.1892831Z // end inline asm 2026-02-21T08:31:48.1892995Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1893062Z add.s32 %r1548, %r1565, 802960; 2026-02-21T08:31:48.1893224Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1893282Z // begin inline asm 2026-02-21T08:31:48.1893391Z @%p31 mbarrier.arrive.expect_tx.shared.b64 _, [%r1548], 4096; 2026-02-21T08:31:48.1893455Z // end inline asm 2026-02-21T08:31:48.1893510Z bar.sync 3, 64; 2026-02-21T08:31:48.1893574Z elect.sync %r1574|%p40, -1; 2026-02-21T08:31:48.1893641Z add.s32 %r1545, %r1572, 712704; 2026-02-21T08:31:48.1893750Z // begin inline asm 2026-02-21T08:31:48.1894006Z @%p31 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1545], [%rd39, {%r1539, %r1547}], [%r1548]; 2026-02-21T08:31:48.1894071Z // end inline asm 2026-02-21T08:31:48.1894241Z .loc 1 41 125 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:41:125 2026-02-21T08:31:48.1894299Z add.s32 %r1554, %r1540, 32; 2026-02-21T08:31:48.1894465Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1894530Z add.s32 %r1549, %r1565, 803008; 2026-02-21T08:31:48.1894690Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1894746Z // begin inline asm 2026-02-21T08:31:48.1894803Z 2026-02-21T08:31:48.1894853Z { 2026-02-21T08:31:48.1894911Z .reg .pred complete; 2026-02-21T08:31:48.1894970Z waitLoop: 2026-02-21T08:31:48.1895089Z mbarrier.try_wait.parity.shared.b64 complete, [%r1549], %r1972; 2026-02-21T08:31:48.1895155Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1895204Z } 2026-02-21T08:31:48.1895206Z 2026-02-21T08:31:48.1895268Z // end inline asm 2026-02-21T08:31:48.1895433Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1895491Z add.s32 %r1555, %r1565, 803056; 2026-02-21T08:31:48.1895659Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1895715Z // begin inline asm 2026-02-21T08:31:48.1895822Z @%p31 mbarrier.arrive.expect_tx.shared.b64 _, [%r1555], 4096; 2026-02-21T08:31:48.1895885Z // end inline asm 2026-02-21T08:31:48.1895940Z bar.sync 3, 64; 2026-02-21T08:31:48.1896003Z elect.sync %r1575|%p41, -1; 2026-02-21T08:31:48.1896062Z add.s32 %r1552, %r1572, 737280; 2026-02-21T08:31:48.1896126Z // begin inline asm 2026-02-21T08:31:48.1896372Z @%p31 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1552], [%rd39, {%r1539, %r1554}], [%r1555]; 2026-02-21T08:31:48.1896429Z // end inline asm 2026-02-21T08:31:48.1896604Z .loc 1 41 125 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:41:125 2026-02-21T08:31:48.1896662Z add.s32 %r1561, %r1540, 48; 2026-02-21T08:31:48.1896831Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1896897Z add.s32 %r1556, %r1565, 803104; 2026-02-21T08:31:48.1897061Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1897118Z // begin inline asm 2026-02-21T08:31:48.1897168Z 2026-02-21T08:31:48.1897227Z { 2026-02-21T08:31:48.1897287Z .reg .pred complete; 2026-02-21T08:31:48.1897340Z waitLoop: 2026-02-21T08:31:48.1897468Z mbarrier.try_wait.parity.shared.b64 complete, [%r1556], %r1972; 2026-02-21T08:31:48.1897532Z @!complete bra.uni waitLoop; 2026-02-21T08:31:48.1897582Z } 2026-02-21T08:31:48.1897587Z 2026-02-21T08:31:48.1897695Z // end inline asm 2026-02-21T08:31:48.1897859Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1897917Z add.s32 %r1562, %r1565, 803152; 2026-02-21T08:31:48.1898077Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1898139Z // begin inline asm 2026-02-21T08:31:48.1898244Z @%p31 mbarrier.arrive.expect_tx.shared.b64 _, [%r1562], 4096; 2026-02-21T08:31:48.1898298Z // end inline asm 2026-02-21T08:31:48.1898360Z bar.sync 3, 64; 2026-02-21T08:31:48.1898422Z elect.sync %r1576|%p42, -1; 2026-02-21T08:31:48.1898480Z add.s32 %r1559, %r1572, 761856; 2026-02-21T08:31:48.1898543Z // begin inline asm 2026-02-21T08:31:48.1898784Z @%p31 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1559], [%rd39, {%r1539, %r1561}], [%r1562]; 2026-02-21T08:31:48.1898839Z // end inline asm 2026-02-21T08:31:48.1899067Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1899134Z add.s32 %r1975, %r1540, 64; 2026-02-21T08:31:48.1899299Z .loc 1 54 33 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:54:33 2026-02-21T08:31:48.1899358Z add.s32 %r1577, %r1973, 1; 2026-02-21T08:31:48.1899430Z setp.eq.b32 %p43, %r1577, 6; 2026-02-21T08:31:48.1899496Z selp.b32 %r1973, 0, %r1577, %p43; 2026-02-21T08:31:48.1899557Z selp.b32 %r1578, 1, 0, %p43; 2026-02-21T08:31:48.1899622Z xor.b32 %r1972, %r1972, %r1578; 2026-02-21T08:31:48.1899788Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1899846Z add.s32 %r1971, %r1971, -1; 2026-02-21T08:31:48.1899908Z setp.ne.b32 %p44, %r1971, 0; 2026-02-21T08:31:48.1899974Z @%p44 bra $L__BB0_22; 2026-02-21T08:31:48.1900030Z bra.uni $L__BB0_25; 2026-02-21T08:31:48.1900134Z $L__BB0_22: // Parent Loop BB0_2 Depth=1 2026-02-21T08:31:48.1900238Z // => This Inner Loop Header: Depth=2 2026-02-21T08:31:48.1900298Z add.s32 %r1530, %r1977, 1; 2026-02-21T08:31:48.1900358Z setp.eq.b32 %p28, %r1977, 7; 2026-02-21T08:31:48.1900429Z selp.b32 %r1977, 0, %r1530, %p28; 2026-02-21T08:31:48.1900489Z setp.ne.b32 %p29, %r1977, 0; 2026-02-21T08:31:48.1900548Z setp.eq.b32 %p30, %r1977, 0; 2026-02-21T08:31:48.1900609Z selp.b32 %r1540, 0, %r1975, %p30; 2026-02-21T08:31:48.1900674Z @%p29 bra $L__BB0_24; 2026-02-21T08:31:48.1900772Z // %bb.23: // in Loop: Header=BB0_22 Depth=2 2026-02-21T08:31:48.1900939Z .loc 1 0 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:0:145 2026-02-21T08:31:48.1901005Z add.s32 %r1979, %r1979, 18944; 2026-02-21T08:31:48.1901173Z .loc 1 30 31 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:30:31 2026-02-21T08:31:48.1901232Z shr.s32 %r1531, %r1979, 31; 2026-02-21T08:31:48.1901298Z shr.u32 %r1532, %r1531, 27; 2026-02-21T08:31:48.1901357Z add.s32 %r1533, %r1979, %r1532; 2026-02-21T08:31:48.1901518Z .loc 1 33 27 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:33:27 2026-02-21T08:31:48.1901603Z shl.b32 %r1534, %r1533, 3; 2026-02-21T08:31:48.1901673Z and.b32 %r1978, %r1534, -256; 2026-02-21T08:31:48.1901729Z bra.uni $L__BB0_24; 2026-02-21T08:31:48.1901815Z $L__BB0_19: // %._crit_edge496 2026-02-21T08:31:48.1901909Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1902077Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1902142Z cp.async.wait_group 0; 2026-02-21T08:31:48.1902205Z bar.sync 2, 256; 2026-02-21T08:31:48.1902261Z barrier.sync 1; 2026-02-21T08:31:48.1902317Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1902404Z $L__BB0_25: // %._crit_edge493 2026-02-21T08:31:48.1902543Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1902599Z barrier.sync 1; 2026-02-21T08:31:48.1902654Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1902754Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1902908Z .loc 1 19 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:19 2026-02-21T08:31:48.1902963Z barrier.sync 1; 2026-02-21T08:31:48.1903026Z barrier.sync 1; 2026-02-21T08:31:48.1903080Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1903172Z $L__BB0_42: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:31:48.1903337Z .loc 1 26 145 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:26:145 2026-02-21T08:31:48.1903399Z barrier.sync 1; 2026-02-21T08:31:48.1903454Z barrier.sync 1; 2026-02-21T08:31:48.1903507Z bra.uni $L__BB0_2; 2026-02-21T08:31:48.1903613Z $L__BB0_5: 2026-02-21T08:31:48.1903773Z .loc 1 19 0 // cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py:19 2026-02-21T08:31:48.1903825Z ret; 2026-02-21T08:31:48.1903880Z $L__BB0_43: 2026-02-21T08:31:48.1903940Z trap; 2026-02-21T08:31:48.1903994Z $L__tmp28: 2026-02-21T08:31:48.1904048Z $L__func_end0: 2026-02-21T08:31:48.1904135Z // -- End function 2026-02-21T08:31:48.1904185Z } 2026-02-21T08:31:48.1904391Z .file 1 "/tmp/torchinductor_root/po/cpoz7obkwr2pz2x4tidbd6lji5hiey7a4oqdj32puqtu7j4cn6bi.py" 2026-02-21T08:31:48.1904575Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:31:48.1904638Z .section .debug_abbrev 2026-02-21T08:31:48.1904688Z { 2026-02-21T08:31:48.1904777Z .b8 1 // Abbreviation Code 2026-02-21T08:31:48.1904871Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:31:48.1904951Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:31:48.1905033Z .b8 37 // DW_AT_producer 2026-02-21T08:31:48.1905117Z .b8 8 // DW_FORM_string 2026-02-21T08:31:48.1905193Z .b8 19 // DW_AT_language 2026-02-21T08:31:48.1905272Z .b8 5 // DW_FORM_data2 2026-02-21T08:31:48.1905353Z .b8 3 // DW_AT_name 2026-02-21T08:31:48.1905426Z .b8 8 // DW_FORM_string 2026-02-21T08:31:48.1905504Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:31:48.1905580Z .b8 6 // DW_FORM_data4 2026-02-21T08:31:48.1905663Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:31:48.1905736Z .b8 8 // DW_FORM_string 2026-02-21T08:31:48.1905808Z .b8 0 // EOM(1) 2026-02-21T08:31:48.1905890Z .b8 0 // EOM(2) 2026-02-21T08:31:48.1905975Z .b8 2 // Abbreviation Code 2026-02-21T08:31:48.1906059Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:31:48.1906139Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:31:48.1906212Z .b8 3 // DW_AT_name 2026-02-21T08:31:48.1906285Z .b8 8 // DW_FORM_string 2026-02-21T08:31:48.1906360Z .b8 32 // DW_AT_inline 2026-02-21T08:31:48.1906442Z .b8 11 // DW_FORM_data1 2026-02-21T08:31:48.1906509Z .b8 0 // EOM(1) 2026-02-21T08:31:48.1906575Z .b8 0 // EOM(2) 2026-02-21T08:31:48.1906661Z .b8 3 // Abbreviation Code 2026-02-21T08:31:48.1906740Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:31:48.1906819Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:31:48.1906946Z .b8 17 // DW_AT_low_pc 2026-02-21T08:31:48.1907019Z .b8 1 // DW_FORM_addr 2026-02-21T08:31:48.1907096Z .b8 18 // DW_AT_high_pc 2026-02-21T08:31:48.1907167Z .b8 1 // DW_FORM_addr 2026-02-21T08:31:48.1907259Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:31:48.1907329Z .b8 19 // DW_FORM_ref4 2026-02-21T08:31:48.1907396Z .b8 0 // EOM(1) 2026-02-21T08:31:48.1907469Z .b8 0 // EOM(2) 2026-02-21T08:31:48.1907548Z .b8 4 // Abbreviation Code 2026-02-21T08:31:48.1907639Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:31:48.1907719Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:31:48.1907841Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:31:48.1907917Z .b8 19 // DW_FORM_ref4 2026-02-21T08:31:48.1907994Z .b8 17 // DW_AT_low_pc 2026-02-21T08:31:48.1908066Z .b8 1 // DW_FORM_addr 2026-02-21T08:31:48.1908143Z .b8 18 // DW_AT_high_pc 2026-02-21T08:31:48.1908212Z .b8 1 // DW_FORM_addr 2026-02-21T08:31:48.1908297Z .b8 88 // DW_AT_call_file 2026-02-21T08:31:48.1908371Z .b8 11 // DW_FORM_data1 2026-02-21T08:31:48.1908447Z .b8 89 // DW_AT_call_line 2026-02-21T08:31:48.1908527Z .b8 11 // DW_FORM_data1 2026-02-21T08:31:48.1908605Z .b8 87 // DW_AT_call_column 2026-02-21T08:31:48.1908677Z .b8 11 // DW_FORM_data1 2026-02-21T08:31:48.1908755Z .b8 0 // EOM(1) 2026-02-21T08:31:48.1908820Z .b8 0 // EOM(2) 2026-02-21T08:31:48.1908887Z .b8 0 // EOM(3) 2026-02-21T08:31:48.1908938Z } 2026-02-21T08:31:48.1909006Z .section .debug_info 2026-02-21T08:31:48.1909056Z { 2026-02-21T08:31:48.1909139Z .b32 178 // Length of Unit 2026-02-21T08:31:48.1909229Z .b8 2 // DWARF version number 2026-02-21T08:31:48.1909280Z .b8 0 2026-02-21T08:31:48.1909394Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:31:48.1909480Z .b8 8 // Address Size (in bytes) 2026-02-21T08:31:48.1909587Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:31:48.1909666Z .b8 116 // DW_AT_producer 2026-02-21T08:31:48.1909720Z .b8 114 2026-02-21T08:31:48.1909781Z .b8 105 2026-02-21T08:31:48.1909835Z .b8 116 2026-02-21T08:31:48.1909884Z .b8 111 2026-02-21T08:31:48.1909938Z .b8 110 2026-02-21T08:31:48.1909988Z .b8 0 2026-02-21T08:31:48.1910062Z .b8 2 // DW_AT_language 2026-02-21T08:31:48.1910112Z .b8 0 2026-02-21T08:31:48.1910192Z .b8 99 // DW_AT_name 2026-02-21T08:31:48.1910242Z .b8 112 2026-02-21T08:31:48.1910292Z .b8 111 2026-02-21T08:31:48.1910347Z .b8 122 2026-02-21T08:31:48.1910397Z .b8 55 2026-02-21T08:31:48.1910446Z .b8 111 2026-02-21T08:31:48.1910494Z .b8 98 2026-02-21T08:31:48.1910551Z .b8 107 2026-02-21T08:31:48.1910600Z .b8 119 2026-02-21T08:31:48.1910648Z .b8 114 2026-02-21T08:31:48.1910698Z .b8 50 2026-02-21T08:31:48.1910755Z .b8 112 2026-02-21T08:31:48.1910805Z .b8 122 2026-02-21T08:31:48.1910853Z .b8 50 2026-02-21T08:31:48.1910909Z .b8 120 2026-02-21T08:31:48.1910959Z .b8 52 2026-02-21T08:31:48.1911009Z .b8 116 2026-02-21T08:31:48.1911058Z .b8 105 2026-02-21T08:31:48.1911114Z .b8 100 2026-02-21T08:31:48.1911207Z .b8 98 2026-02-21T08:31:48.1911259Z .b8 100 2026-02-21T08:31:48.1911313Z .b8 54 2026-02-21T08:31:48.1911364Z .b8 108 2026-02-21T08:31:48.1911413Z .b8 106 2026-02-21T08:31:48.1911462Z .b8 105 2026-02-21T08:31:48.1911519Z .b8 53 2026-02-21T08:31:48.1911601Z .b8 104 2026-02-21T08:31:48.1911653Z .b8 105 2026-02-21T08:31:48.1911703Z .b8 101 2026-02-21T08:31:48.1911759Z .b8 121 2026-02-21T08:31:48.1911808Z .b8 55 2026-02-21T08:31:48.1911857Z .b8 97 2026-02-21T08:31:48.1911916Z .b8 52 2026-02-21T08:31:48.1911965Z .b8 111 2026-02-21T08:31:48.1912015Z .b8 113 2026-02-21T08:31:48.1912066Z .b8 100 2026-02-21T08:31:48.1912124Z .b8 106 2026-02-21T08:31:48.1912174Z .b8 51 2026-02-21T08:31:48.1912225Z .b8 50 2026-02-21T08:31:48.1912284Z .b8 112 2026-02-21T08:31:48.1912336Z .b8 117 2026-02-21T08:31:48.1912387Z .b8 113 2026-02-21T08:31:48.1912437Z .b8 116 2026-02-21T08:31:48.1912496Z .b8 117 2026-02-21T08:31:48.1912546Z .b8 55 2026-02-21T08:31:48.1912597Z .b8 106 2026-02-21T08:31:48.1912704Z .b8 52 2026-02-21T08:31:48.1912758Z .b8 99 2026-02-21T08:31:48.1912809Z .b8 110 2026-02-21T08:31:48.1912860Z .b8 54 2026-02-21T08:31:48.1912917Z .b8 98 2026-02-21T08:31:48.1912968Z .b8 105 2026-02-21T08:31:48.1913017Z .b8 46 2026-02-21T08:31:48.1913067Z .b8 112 2026-02-21T08:31:48.1913124Z .b8 121 2026-02-21T08:31:48.1913175Z .b8 0 2026-02-21T08:31:48.1913266Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:31:48.1913347Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:31:48.1913398Z .b8 116 2026-02-21T08:31:48.1913447Z .b8 109 2026-02-21T08:31:48.1913498Z .b8 112 2026-02-21T08:31:48.1913555Z .b8 47 2026-02-21T08:31:48.1913604Z .b8 116 2026-02-21T08:31:48.1913653Z .b8 111 2026-02-21T08:31:48.1913708Z .b8 114 2026-02-21T08:31:48.1913757Z .b8 99 2026-02-21T08:31:48.1913806Z .b8 104 2026-02-21T08:31:48.1913856Z .b8 105 2026-02-21T08:31:48.1913912Z .b8 110 2026-02-21T08:31:48.1913962Z .b8 100 2026-02-21T08:31:48.1914012Z .b8 117 2026-02-21T08:31:48.1914070Z .b8 99 2026-02-21T08:31:48.1914123Z .b8 116 2026-02-21T08:31:48.1914176Z .b8 111 2026-02-21T08:31:48.1914226Z .b8 114 2026-02-21T08:31:48.1914283Z .b8 95 2026-02-21T08:31:48.1914334Z .b8 114 2026-02-21T08:31:48.1914384Z .b8 111 2026-02-21T08:31:48.1914441Z .b8 111 2026-02-21T08:31:48.1914491Z .b8 116 2026-02-21T08:31:48.1914540Z .b8 47 2026-02-21T08:31:48.1914589Z .b8 112 2026-02-21T08:31:48.1914646Z .b8 111 2026-02-21T08:31:48.1914695Z .b8 0 2026-02-21T08:31:48.1914794Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:31:48.1914876Z .b8 95 // DW_AT_name 2026-02-21T08:31:48.1914926Z .b8 104 2026-02-21T08:31:48.1914977Z .b8 101 2026-02-21T08:31:48.1915025Z .b8 108 2026-02-21T08:31:48.1915081Z .b8 105 2026-02-21T08:31:48.1915131Z .b8 111 2026-02-21T08:31:48.1915181Z .b8 110 2026-02-21T08:31:48.1915229Z .b8 95 2026-02-21T08:31:48.1915285Z .b8 109 2026-02-21T08:31:48.1915334Z .b8 97 2026-02-21T08:31:48.1915383Z .b8 116 2026-02-21T08:31:48.1915439Z .b8 109 2026-02-21T08:31:48.1915490Z .b8 117 2026-02-21T08:31:48.1915538Z .b8 108 2026-02-21T08:31:48.1915587Z .b8 95 2026-02-21T08:31:48.1915642Z .b8 98 2026-02-21T08:31:48.1915691Z .b8 102 2026-02-21T08:31:48.1915740Z .b8 49 2026-02-21T08:31:48.1915794Z .b8 54 2026-02-21T08:31:48.1915843Z .b8 95 2026-02-21T08:31:48.1915891Z .b8 105 2026-02-21T08:31:48.1915940Z .b8 110 2026-02-21T08:31:48.1915997Z .b8 116 2026-02-21T08:31:48.1916046Z .b8 52 2026-02-21T08:31:48.1916094Z .b8 0 2026-02-21T08:31:48.1916170Z .b8 1 // DW_AT_inline 2026-02-21T08:31:48.1916265Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:31:48.1916351Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:31:48.1916438Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:31:48.1916534Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:31:48.1916649Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:31:48.1916797Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:31:48.1916886Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T08:31:48.1916970Z .b64 $L__tmp27 // DW_AT_high_pc 2026-02-21T08:31:48.1917045Z .b8 1 // DW_AT_call_file 2026-02-21T08:31:48.1917128Z .b8 84 // DW_AT_call_line 2026-02-21T08:31:48.1917207Z .b8 40 // DW_AT_call_column 2026-02-21T08:31:48.1917289Z .b8 0 // End Of Children Mark 2026-02-21T08:31:48.1917370Z .b8 0 // End Of Children Mark 2026-02-21T08:31:48.1917426Z } 2026-02-21T08:31:48.1917491Z .section .debug_macinfo { } 2026-02-21T08:31:48.1917494Z 2026-02-21T08:31:48.1917569Z ================================================================ 2026-02-21T08:31:48.1917717Z please share the reproducer above with Triton project. 2026-02-21T08:31:49.5606970Z [196s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 64], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:31:49.5607940Z Tensor-likes are not close! 2026-02-21T08:31:49.5608057Z 2026-02-21T08:31:49.5608151Z Mismatched elements: 33448161 / 33554432 (99.7%) 2026-02-21T08:31:49.5608427Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T08:31:49.5608768Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:31:49.5609063Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:31:49.5609236Z 2026-02-21T08:31:49.9951248Z 2026-02-21T08:31:49.9956119Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 19.4 configs/s 2026-02-21T08:31:51.1655841Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━ 652/652 526.0 configs/s 2026-02-21T08:31:51.2559715Z [197s] Generation 4 complete: 2026-02-21T08:31:51.2561131Z error=35 2026-02-21T08:31:51.2561319Z ok=61 2026-02-21T08:31:51.2561497Z min=0.3112 2026-02-21T08:31:51.2561923Z mid=0.9462 2026-02-21T08:31:51.2562066Z max=9.6215 2026-02-21T08:31:51.2562213Z best={'block_sizes': [8, 128, 128], 2026-02-21T08:31:51.2562458Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:31:51.2562686Z 'l2_groupings': [8], 2026-02-21T08:31:51.2562879Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:31:51.2563086Z 'loop_orders': [[1, 0]], 2026-02-21T08:31:51.2563241Z 'maxnreg': 64, 2026-02-21T08:31:51.2563395Z 'num_sm_multiplier': 64, 2026-02-21T08:31:51.2563549Z 'num_stages': 7, 2026-02-21T08:31:51.2563696Z 'num_warps': 4, 2026-02-21T08:31:51.2563870Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:31:51.2564093Z 'range_flattens': [False, False], 2026-02-21T08:31:51.2564285Z 'range_multi_buffers': [False, None], 2026-02-21T08:31:51.2564485Z 'range_num_stages': [3, 0], 2026-02-21T08:31:51.2564662Z 'range_unroll_factors': [0, 0], 2026-02-21T08:31:51.2564861Z 'range_warp_specializes': [True, None]} 2026-02-21T08:31:51.2586039Z [198s] Fitting surrogate: 502 points, 502 targets 2026-02-21T08:31:52.6236394Z [199s] Generation 5 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:32:06.3514107Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 1.7 configs/s 2026-02-21T08:32:07.0205975Z [213s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 128], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 0], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:32:07.0207356Z Tensor-likes are not close! 2026-02-21T08:32:07.0212013Z 2026-02-21T08:32:07.0216099Z Mismatched elements: 33484797 / 33554432 (99.8%) 2026-02-21T08:32:07.0220588Z Greatest absolute difference: 2304.0 at index (1825, 4939) (up to 0.01 allowed) 2026-02-21T08:32:07.0222003Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:07.0222359Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:07.0222531Z 2026-02-21T08:32:08.0815639Z [214s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[2, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]) 2026-02-21T08:32:08.0816897Z Tensor-likes are not close! 2026-02-21T08:32:08.0817031Z 2026-02-21T08:32:08.0817119Z Mismatched elements: 33444241 / 33554432 (99.7%) 2026-02-21T08:32:08.0817447Z Greatest absolute difference: 1408.0 at index (3439, 1611) (up to 0.01 allowed) 2026-02-21T08:32:08.0817830Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:08.0818170Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:08.0818379Z 2026-02-21T08:32:08.3979257Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T08:32:08.3980435Z Tensor-likes are not close! 2026-02-21T08:32:08.3980591Z 2026-02-21T08:32:08.3980821Z Mismatched elements: 33444241 / 33554432 (99.7%) 2026-02-21T08:32:08.3981143Z Greatest absolute difference: 1408.0 at index (3439, 1611) (up to 0.01 allowed) 2026-02-21T08:32:08.3986292Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:08.3990585Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:08.3994846Z 2026-02-21T08:32:08.5883522Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[True, None]) 2026-02-21T08:32:08.5884625Z Tensor-likes are not close! 2026-02-21T08:32:08.5884745Z 2026-02-21T08:32:08.5884830Z Mismatched elements: 33444356 / 33554432 (99.7%) 2026-02-21T08:32:08.5885118Z Greatest absolute difference: 1296.0 at index (1726, 7328) (up to 0.01 allowed) 2026-02-21T08:32:08.5885454Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:08.5885765Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:08.5885929Z 2026-02-21T08:32:08.7588980Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=128, num_stages=6, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:32:08.7590373Z Tensor-likes are not close! 2026-02-21T08:32:08.7590490Z 2026-02-21T08:32:08.7590575Z Mismatched elements: 33436835 / 33554432 (99.6%) 2026-02-21T08:32:08.7590865Z Greatest absolute difference: 1392.0 at index (3414, 7062) (up to 0.01 allowed) 2026-02-21T08:32:08.7591195Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:08.7591501Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:08.7591814Z 2026-02-21T08:32:08.7667346Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 64, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_sm_multiplier=128, num_stages=6, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:32:08.7668668Z Tensor-likes are not close! 2026-02-21T08:32:08.7668832Z 2026-02-21T08:32:08.7675288Z Mismatched elements: 2291668 / 33554432 (6.8%) 2026-02-21T08:32:08.7679773Z Greatest absolute difference: nan at index (0, 0) (up to 0.01 allowed) 2026-02-21T08:32:08.7684152Z Greatest relative difference: nan at index (0, 0) (up to 0.01 allowed) 2026-02-21T08:32:08.7685549Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:08.7685745Z 2026-02-21T08:32:09.1024003Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 64, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=128, num_stages=6, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:32:09.1025166Z Tensor-likes are not close! 2026-02-21T08:32:09.1025337Z 2026-02-21T08:32:09.1025594Z Mismatched elements: 2299589 / 33554432 (6.9%) 2026-02-21T08:32:09.1025870Z Greatest absolute difference: nan at index (0, 0) (up to 0.01 allowed) 2026-02-21T08:32:09.1026205Z Greatest relative difference: nan at index (0, 0) (up to 0.01 allowed) 2026-02-21T08:32:09.1030211Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:09.1033867Z 2026-02-21T08:32:09.2537465Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_sm_multiplier=128, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:32:09.2538513Z Tensor-likes are not close! 2026-02-21T08:32:09.2538630Z 2026-02-21T08:32:09.2538730Z Mismatched elements: 33436835 / 33554432 (99.6%) 2026-02-21T08:32:09.2539013Z Greatest absolute difference: 1392.0 at index (3414, 7062) (up to 0.01 allowed) 2026-02-21T08:32:09.2539358Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:09.2539664Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:09.2539839Z 2026-02-21T08:32:09.4328142Z [216s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 64, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:32:09.4329428Z Tensor-likes are not close! 2026-02-21T08:32:09.4333135Z 2026-02-21T08:32:09.4337234Z Mismatched elements: 2036325 / 33554432 (6.1%) 2026-02-21T08:32:09.4339258Z Greatest absolute difference: nan at index (0, 0) (up to 0.01 allowed) 2026-02-21T08:32:09.4339832Z Greatest relative difference: nan at index (0, 0) (up to 0.01 allowed) 2026-02-21T08:32:09.4340150Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:09.4340324Z 2026-02-21T08:32:11.7757982Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 18.6 configs/s 2026-02-21T08:32:15.8366977Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━ 652/652 158.0 configs/s 2026-02-21T08:32:15.9661717Z [222s] Generation 5 complete: 2026-02-21T08:32:15.9662041Z error=19 2026-02-21T08:32:15.9666383Z ok=84 2026-02-21T08:32:15.9668050Z min=0.3092 2026-02-21T08:32:15.9668255Z mid=0.7200 2026-02-21T08:32:15.9668402Z max=14.6647 2026-02-21T08:32:15.9668600Z best={'block_sizes': [8, 128, 128], 2026-02-21T08:32:15.9668877Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:32:15.9669126Z 'l2_groupings': [8], 2026-02-21T08:32:15.9669620Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:32:15.9669870Z 'loop_orders': [[1, 0]], 2026-02-21T08:32:15.9670041Z 'maxnreg': 64, 2026-02-21T08:32:15.9670220Z 'num_sm_multiplier': 64, 2026-02-21T08:32:15.9670420Z 'num_stages': 7, 2026-02-21T08:32:15.9670592Z 'num_warps': 2, 2026-02-21T08:32:15.9670759Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:32:15.9670955Z 'range_flattens': [False, False], 2026-02-21T08:32:15.9671146Z 'range_multi_buffers': [None, None], 2026-02-21T08:32:15.9671332Z 'range_num_stages': [3, 0], 2026-02-21T08:32:15.9671508Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:15.9671772Z 'range_warp_specializes': [True, None]} 2026-02-21T08:32:15.9687593Z [222s] Fitting surrogate: 605 points, 605 targets 2026-02-21T08:32:17.3245855Z [224s] Generation 6 starting: 98 neighbors, 5 active search path(s) 2026-02-21T08:32:26.3379282Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 3.3 configs/s 2026-02-21T08:32:27.7462413Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T08:32:27.7464084Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:32:27.7464383Z ^ 2026-02-21T08:32:27.7464768Z /tmp/torchinductor_root/ig/cigrbfbsy7dsnaksbuwrbpgtz4q3sx6pvmgukjqrp6u52id74ft5.py:87:40: note: called from 2026-02-21T08:32:27.7465189Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:32:27.7465411Z ^ 2026-02-21T08:32:27.7465830Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T08:32:27.7466335Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:32:27.7466585Z ^ 2026-02-21T08:32:27.7467154Z module { 2026-02-21T08:32:27.7469584Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:32:27.7470198Z %cst = arith.constant dense<0> : tensor<8x2x64xi8> 2026-02-21T08:32:27.7470424Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:32:27.7470624Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:32:27.7470809Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:32:27.7470997Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:32:27.7471210Z %cst_0 = arith.constant dense<8192> : tensor<256x1xi32> 2026-02-21T08:32:27.7471466Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:32:27.7472106Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:32:27.7472344Z %cst_3 = arith.constant dense<4> : tensor<8x64xi8> 2026-02-21T08:32:27.7472599Z %cst_4 = arith.constant dense<8192> : tensor<8x1xi32> 2026-02-21T08:32:27.7472854Z %cst_5 = arith.constant dense<1024> : tensor<256x1xi32> 2026-02-21T08:32:27.7473219Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<256x64xf32> 2026-02-21T08:32:27.7473461Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:32:27.7473662Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:32:27.7473855Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:32:27.7474034Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:32:27.7474224Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:32:27.7474414Z %0 = tt.get_program_id x : i32 2026-02-21T08:32:27.7474604Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T08:32:27.7474785Z %2 = arith.minsi %1, %c2048_i32 : i32 2026-02-21T08:32:27.7474996Z scf.for %arg3 = %0 to %2 step %c1_i32 : i32 { 2026-02-21T08:32:27.7475204Z %3 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T08:32:27.7475400Z %4 = arith.muli %3, %c2_i32 : i32 2026-02-21T08:32:27.7475591Z %5 = arith.subi %c16_i32, %4 : i32 2026-02-21T08:32:27.7475768Z %6 = arith.minsi %5, %c2_i32 : i32 2026-02-21T08:32:27.7476081Z %7 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T08:32:27.7476274Z %8 = arith.remsi %7, %6 : i32 2026-02-21T08:32:27.7476457Z %9 = arith.addi %4, %8 : i32 2026-02-21T08:32:27.7476629Z %10 = arith.divsi %7, %6 : i32 2026-02-21T08:32:27.7476818Z %11 = arith.muli %9, %c256_i32 : i32 2026-02-21T08:32:27.7477100Z %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:32:27.7477372Z %13 = tt.splat %11 : i32 -> tensor<256xi32> 2026-02-21T08:32:27.7477572Z %14 = arith.addi %13, %12 : tensor<256xi32> 2026-02-21T08:32:27.7477768Z %15 = arith.muli %10, %c64_i32 : i32 2026-02-21T08:32:27.7477996Z %16 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:32:27.7478250Z %17 = tt.splat %15 : i32 -> tensor<64xi32> 2026-02-21T08:32:27.7478446Z %18 = arith.addi %17, %16 : tensor<64xi32> 2026-02-21T08:32:27.7478643Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:32:27.7478962Z %19 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<256x64xf32>) : i32 { 2026-02-21T08:32:27.7479331Z %29 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:32:27.7479589Z %30 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T08:32:27.7479794Z %31 = arith.addi %30, %29 : tensor<8xi32> 2026-02-21T08:32:27.7479994Z %32 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:32:27.7480221Z %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:32:27.7480474Z %34 = tt.splat %32 : i32 -> tensor<16xi32> 2026-02-21T08:32:27.7480682Z %35 = arith.addi %34, %33 : tensor<16xi32> 2026-02-21T08:32:27.7480940Z %36 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T08:32:27.7481228Z %37 = arith.muli %36, %cst_5 : tensor<256x1xi32> 2026-02-21T08:32:27.7481491Z %38 = tt.expand_dims %35 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:32:27.7481840Z %39 = tt.broadcast %37 : tensor<256x1xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7482110Z %40 = tt.broadcast %38 : tensor<1x16xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7482356Z %41 = arith.addi %39, %40 : tensor<256x16xi32> 2026-02-21T08:32:27.7482608Z %42 = tt.splat %arg0 : !tt.ptr -> tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7482900Z %43 = tt.addptr %42, %41 : tensor<256x16x!tt.ptr>, tensor<256x16xi32> 2026-02-21T08:32:27.7483225Z %44 = tt.load %43 evictionPolicy = evict_last : tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7483565Z %45 = arith.extf %44 : tensor<256x16xbf16> to tensor<256x16xf32> 2026-02-21T08:32:27.7483860Z %46 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:32:27.7484124Z %47 = arith.muli %46, %cst_4 : tensor<8x1xi32> 2026-02-21T08:32:27.7484391Z %48 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:32:27.7484730Z %49 = tt.broadcast %47 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7484996Z %50 = tt.broadcast %48 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7485245Z %51 = arith.addi %49, %50 : tensor<8x64xi32> 2026-02-21T08:32:27.7485482Z %52 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7485759Z %53 = tt.addptr %52, %51 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T08:32:27.7486055Z %54 = tt.load %53 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7486313Z %55 = arith.shli %54, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7486533Z %56 = arith.shrsi %55, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7486746Z %57 = arith.shrsi %54, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7486990Z %58 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:32:27.7487338Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:32:27.7487655Z %60 = tt.expand_dims %59 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:32:27.7487971Z %61 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7488274Z %62 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7488554Z %63 = arith.cmpi eq, %60, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7488800Z %64 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7489069Z %65 = tt.broadcast %61 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7489339Z %66 = arith.select %64, %65, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7489612Z %67 = arith.cmpi eq, %60, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7489866Z %68 = tt.broadcast %62 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7490120Z %69 = tt.broadcast %67 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7490400Z %70 = arith.select %69, %68, %66 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7490663Z %71 = tt.reshape %70 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T08:32:27.7490924Z %72 = arith.sitofp %71 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T08:32:27.7491282Z %73 = tt.dot %45, %72, %arg5, inputPrecision = tf32 : tensor<256x16xf32> * tensor<16x64xf32> -> tensor<256x64xf32> 2026-02-21T08:32:27.7491646Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:32:27.7491849Z %74 = arith.muli %c8_i32, %c1_i32_7 : i32 2026-02-21T08:32:27.7492042Z %75 = arith.addi %arg4, %74 : i32 2026-02-21T08:32:27.7492272Z %76 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:32:27.7492511Z %77 = tt.splat %75 : i32 -> tensor<8xi32> 2026-02-21T08:32:27.7492716Z %78 = arith.addi %77, %76 : tensor<8xi32> 2026-02-21T08:32:27.7492914Z %79 = arith.muli %75, %c2_i32 : i32 2026-02-21T08:32:27.7493142Z %80 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:32:27.7493393Z %81 = tt.splat %79 : i32 -> tensor<16xi32> 2026-02-21T08:32:27.7493591Z %82 = arith.addi %81, %80 : tensor<16xi32> 2026-02-21T08:32:27.7493848Z %83 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T08:32:27.7494119Z %84 = arith.muli %83, %cst_5 : tensor<256x1xi32> 2026-02-21T08:32:27.7494381Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:32:27.7494705Z %86 = tt.broadcast %84 : tensor<256x1xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7494974Z %87 = tt.broadcast %85 : tensor<1x16xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7495214Z %88 = arith.addi %86, %87 : tensor<256x16xi32> 2026-02-21T08:32:27.7495457Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7495783Z %90 = tt.addptr %89, %88 : tensor<256x16x!tt.ptr>, tensor<256x16xi32> 2026-02-21T08:32:27.7496089Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7496398Z %92 = arith.extf %91 : tensor<256x16xbf16> to tensor<256x16xf32> 2026-02-21T08:32:27.7496690Z %93 = tt.expand_dims %78 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:32:27.7496945Z %94 = arith.muli %93, %cst_4 : tensor<8x1xi32> 2026-02-21T08:32:27.7497200Z %95 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:32:27.7497479Z %96 = tt.broadcast %94 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7497741Z %97 = tt.broadcast %95 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7497969Z %98 = arith.addi %96, %97 : tensor<8x64xi32> 2026-02-21T08:32:27.7498257Z %99 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7498533Z %100 = tt.addptr %99, %98 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T08:32:27.7498832Z %101 = tt.load %100 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7499109Z %102 = arith.shli %101, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7499330Z %103 = arith.shrsi %102, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7499556Z %104 = arith.shrsi %101, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7499811Z %105 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:32:27.7500105Z %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:32:27.7500430Z %107 = tt.expand_dims %106 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:32:27.7500752Z %108 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7501081Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7501368Z %110 = arith.cmpi eq, %107, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7501674Z %111 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7501956Z %112 = tt.broadcast %108 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7502241Z %113 = arith.select %111, %112, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7502520Z %114 = arith.cmpi eq, %107, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7502770Z %115 = tt.broadcast %109 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7503047Z %116 = tt.broadcast %114 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7503333Z %117 = arith.select %116, %115, %113 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7503609Z %118 = tt.reshape %117 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T08:32:27.7503881Z %119 = arith.sitofp %118 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T08:32:27.7504235Z %120 = tt.dot %92, %119, %73, inputPrecision = tf32 : tensor<256x16xf32> * tensor<16x64xf32> -> tensor<256x64xf32> 2026-02-21T08:32:27.7504569Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T08:32:27.7504764Z %121 = arith.muli %c8_i32, %c2_i32_8 : i32 2026-02-21T08:32:27.7504966Z %122 = arith.addi %arg4, %121 : i32 2026-02-21T08:32:27.7505202Z %123 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:32:27.7505448Z %124 = tt.splat %122 : i32 -> tensor<8xi32> 2026-02-21T08:32:27.7505685Z %125 = arith.addi %124, %123 : tensor<8xi32> 2026-02-21T08:32:27.7505887Z %126 = arith.muli %122, %c2_i32 : i32 2026-02-21T08:32:27.7506137Z %127 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:32:27.7506396Z %128 = tt.splat %126 : i32 -> tensor<16xi32> 2026-02-21T08:32:27.7506620Z %129 = arith.addi %128, %127 : tensor<16xi32> 2026-02-21T08:32:27.7506926Z %130 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T08:32:27.7507216Z %131 = arith.muli %130, %cst_5 : tensor<256x1xi32> 2026-02-21T08:32:27.7507499Z %132 = tt.expand_dims %129 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:32:27.7507813Z %133 = tt.broadcast %131 : tensor<256x1xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7508107Z %134 = tt.broadcast %132 : tensor<1x16xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7508364Z %135 = arith.addi %133, %134 : tensor<256x16xi32> 2026-02-21T08:32:27.7508635Z %136 = tt.splat %arg0 : !tt.ptr -> tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7508951Z %137 = tt.addptr %136, %135 : tensor<256x16x!tt.ptr>, tensor<256x16xi32> 2026-02-21T08:32:27.7509368Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7509702Z %139 = arith.extf %138 : tensor<256x16xbf16> to tensor<256x16xf32> 2026-02-21T08:32:27.7510010Z %140 = tt.expand_dims %125 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:32:27.7510321Z %141 = arith.muli %140, %cst_4 : tensor<8x1xi32> 2026-02-21T08:32:27.7510602Z %142 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:32:27.7510911Z %143 = tt.broadcast %141 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7511199Z %144 = tt.broadcast %142 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7511457Z %145 = arith.addi %143, %144 : tensor<8x64xi32> 2026-02-21T08:32:27.7511760Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7512052Z %147 = tt.addptr %146, %145 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T08:32:27.7512378Z %148 = tt.load %147 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7512666Z %149 = arith.shli %148, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7512897Z %150 = arith.shrsi %149, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7513138Z %151 = arith.shrsi %148, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7513388Z %152 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:32:27.7513685Z %153 = tt.expand_dims %152 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:32:27.7513998Z %154 = tt.expand_dims %153 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:32:27.7514328Z %155 = tt.expand_dims %150 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7514657Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7514943Z %157 = arith.cmpi eq, %154, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7515206Z %158 = tt.broadcast %157 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7515477Z %159 = tt.broadcast %155 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7515769Z %160 = arith.select %158, %159, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7516045Z %161 = arith.cmpi eq, %154, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7516292Z %162 = tt.broadcast %156 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7516567Z %163 = tt.broadcast %161 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7516842Z %164 = arith.select %163, %162, %160 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7517153Z %165 = tt.reshape %164 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T08:32:27.7517443Z %166 = arith.sitofp %165 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T08:32:27.7517804Z %167 = tt.dot %139, %166, %120, inputPrecision = tf32 : tensor<256x16xf32> * tensor<16x64xf32> -> tensor<256x64xf32> 2026-02-21T08:32:27.7518141Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:32:27.7518363Z %168 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T08:32:27.7518555Z %169 = arith.addi %arg4, %168 : i32 2026-02-21T08:32:27.7518789Z %170 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:32:27.7519031Z %171 = tt.splat %169 : i32 -> tensor<8xi32> 2026-02-21T08:32:27.7519240Z %172 = arith.addi %171, %170 : tensor<8xi32> 2026-02-21T08:32:27.7519433Z %173 = arith.muli %169, %c2_i32 : i32 2026-02-21T08:32:27.7519672Z %174 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:32:27.7519921Z %175 = tt.splat %173 : i32 -> tensor<16xi32> 2026-02-21T08:32:27.7520135Z %176 = arith.addi %175, %174 : tensor<16xi32> 2026-02-21T08:32:27.7520397Z %177 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T08:32:27.7520716Z %178 = arith.muli %177, %cst_5 : tensor<256x1xi32> 2026-02-21T08:32:27.7520992Z %179 = tt.expand_dims %176 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:32:27.7521292Z %180 = tt.broadcast %178 : tensor<256x1xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7521601Z %181 = tt.broadcast %179 : tensor<1x16xi32> -> tensor<256x16xi32> 2026-02-21T08:32:27.7521857Z %182 = arith.addi %180, %181 : tensor<256x16xi32> 2026-02-21T08:32:27.7522103Z %183 = tt.splat %arg0 : !tt.ptr -> tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7522405Z %184 = tt.addptr %183, %182 : tensor<256x16x!tt.ptr>, tensor<256x16xi32> 2026-02-21T08:32:27.7522721Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<256x16x!tt.ptr> 2026-02-21T08:32:27.7523028Z %186 = arith.extf %185 : tensor<256x16xbf16> to tensor<256x16xf32> 2026-02-21T08:32:27.7523315Z %187 = tt.expand_dims %172 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:32:27.7523595Z %188 = arith.muli %187, %cst_4 : tensor<8x1xi32> 2026-02-21T08:32:27.7523865Z %189 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:32:27.7524156Z %190 = tt.broadcast %188 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7524421Z %191 = tt.broadcast %189 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T08:32:27.7524661Z %192 = arith.addi %190, %191 : tensor<8x64xi32> 2026-02-21T08:32:27.7524905Z %193 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7525183Z %194 = tt.addptr %193, %192 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T08:32:27.7525484Z %195 = tt.load %194 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T08:32:27.7525759Z %196 = arith.shli %195, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7525977Z %197 = arith.shrsi %196, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7526201Z %198 = arith.shrsi %195, %cst_3 : tensor<8x64xi8> 2026-02-21T08:32:27.7526449Z %199 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:32:27.7526745Z %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:32:27.7527063Z %201 = tt.expand_dims %200 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:32:27.7527381Z %202 = tt.expand_dims %197 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7527705Z %203 = tt.expand_dims %198 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T08:32:27.7527983Z %204 = arith.cmpi eq, %201, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7528277Z %205 = tt.broadcast %204 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7528547Z %206 = tt.broadcast %202 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7528837Z %207 = arith.select %205, %206, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7529119Z %208 = arith.cmpi eq, %201, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:32:27.7529390Z %209 = tt.broadcast %203 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T08:32:27.7529660Z %210 = tt.broadcast %208 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T08:32:27.7529936Z %211 = arith.select %210, %209, %207 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T08:32:27.7530223Z %212 = tt.reshape %211 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T08:32:27.7530486Z %213 = arith.sitofp %212 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T08:32:27.7530842Z %214 = tt.dot %186, %213, %167, inputPrecision = tf32 : tensor<256x16xf32> * tensor<16x64xf32> -> tensor<256x64xf32> 2026-02-21T08:32:27.7531174Z scf.yield %214 : tensor<256x64xf32> 2026-02-21T08:32:27.7531350Z } 2026-02-21T08:32:27.7531563Z %20 = arith.truncf %19 : tensor<256x64xf32> to tensor<256x64xbf16> 2026-02-21T08:32:27.7531902Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T08:32:27.7532179Z %22 = arith.muli %21, %cst_0 : tensor<256x1xi32> 2026-02-21T08:32:27.7532440Z %23 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:32:27.7532722Z %24 = tt.broadcast %22 : tensor<256x1xi32> -> tensor<256x64xi32> 2026-02-21T08:32:27.7532988Z %25 = tt.broadcast %23 : tensor<1x64xi32> -> tensor<256x64xi32> 2026-02-21T08:32:27.7533220Z %26 = arith.addi %24, %25 : tensor<256x64xi32> 2026-02-21T08:32:27.7533465Z %27 = tt.splat %arg2 : !tt.ptr -> tensor<256x64x!tt.ptr> 2026-02-21T08:32:27.7533748Z %28 = tt.addptr %27, %26 : tensor<256x64x!tt.ptr>, tensor<256x64xi32> 2026-02-21T08:32:27.7534012Z tt.store %28, %20 : tensor<256x64x!tt.ptr> 2026-02-21T08:32:27.7534292Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:32:27.7534543Z tt.return 2026-02-21T08:32:27.7534679Z } 2026-02-21T08:32:27.7534803Z } 2026-02-21T08:32:27.7534880Z 2026-02-21T08:32:27.7534932Z {-# 2026-02-21T08:32:27.7535062Z external_resources: { 2026-02-21T08:32:27.7535227Z mlir_reproducer: { 2026-02-21T08:32:27.7539615Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:32:27.7544113Z disable_threading: false, 2026-02-21T08:32:27.7544318Z verify_each: true 2026-02-21T08:32:27.7544462Z } 2026-02-21T08:32:27.7544588Z } 2026-02-21T08:32:27.7544702Z #-} 2026-02-21T08:32:27.7545131Z /tmp/torchinductor_root/ig/cigrbfbsy7dsnaksbuwrbpgtz4q3sx6pvmgukjqrp6u52id74ft5.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:32:27.7546161Z /tmp/torchinductor_root/ig/cigrbfbsy7dsnaksbuwrbpgtz4q3sx6pvmgukjqrp6u52id74ft5.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:32:27.7546974Z [234s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:32:27.7548178Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[1, 0], range_unroll_factors=[1, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:32:27.7549238Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:32:27.7549507Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:32:27.9568156Z [234s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[1, 0], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]) 2026-02-21T08:32:27.9569298Z Tensor-likes are not close! 2026-02-21T08:32:27.9569430Z 2026-02-21T08:32:27.9569545Z Mismatched elements: 33485695 / 33554432 (99.8%) 2026-02-21T08:32:27.9569851Z Greatest absolute difference: 2592.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T08:32:27.9574948Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:27.9579206Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:27.9583476Z 2026-02-21T08:32:31.7083442Z [238s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:32:31.7084503Z Tensor-likes are not close! 2026-02-21T08:32:31.7084621Z 2026-02-21T08:32:31.7084706Z Mismatched elements: 33450159 / 33554432 (99.7%) 2026-02-21T08:32:31.7084996Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T08:32:31.7085333Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:32:31.7085640Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:32:31.7085803Z 2026-02-21T08:32:31.7097224Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 18.4 configs/s 2026-02-21T08:32:39.1255909Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 653/653 87.4 configs/s 2026-02-21T08:32:39.2885202Z [246s] Generation 6 complete: 2026-02-21T08:32:39.2885478Z error=21 2026-02-21T08:32:39.2889910Z ok=82 2026-02-21T08:32:39.2893547Z min=0.3083 2026-02-21T08:32:39.2893724Z mid=0.5644 2026-02-21T08:32:39.2893858Z max=26.9445 2026-02-21T08:32:39.2894008Z best={'block_sizes': [8, 128, 128], 2026-02-21T08:32:39.2894265Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:32:39.2894490Z 'l2_groupings': [8], 2026-02-21T08:32:39.2894710Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:32:39.2895180Z 'loop_orders': [[1, 0]], 2026-02-21T08:32:39.2895351Z 'maxnreg': 64, 2026-02-21T08:32:39.2895511Z 'num_sm_multiplier': 64, 2026-02-21T08:32:39.2895669Z 'num_stages': 7, 2026-02-21T08:32:39.2895820Z 'num_warps': 1, 2026-02-21T08:32:39.2895976Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:32:39.2896174Z 'range_flattens': [False, False], 2026-02-21T08:32:39.2896355Z 'range_multi_buffers': [False, None], 2026-02-21T08:32:39.2896541Z 'range_num_stages': [3, 0], 2026-02-21T08:32:39.2896709Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:39.2896897Z 'range_warp_specializes': [True, None]} 2026-02-21T08:32:39.2916751Z [246s] Fitting surrogate: 708 points, 708 targets 2026-02-21T08:32:40.2607852Z [247s] Generation 7 starting: 56 neighbors, 3 active search path(s) 2026-02-21T08:32:44.5507099Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 15.6 configs/s 2026-02-21T08:32:47.9747838Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 17.5 configs/s 2026-02-21T08:32:53.4496501Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━ 673/673 132.0 configs/s 2026-02-21T08:32:53.5884111Z [260s] Generation 7 complete: 2026-02-21T08:32:53.5885454Z error=8 2026-02-21T08:32:53.5885624Z ok=52 2026-02-21T08:32:53.5885758Z min=0.2971 2026-02-21T08:32:53.5885901Z mid=0.5396 2026-02-21T08:32:53.5886031Z max=11.9071 2026-02-21T08:32:53.5886190Z best={'block_sizes': [16, 128, 128], 2026-02-21T08:32:53.5886441Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:32:53.5886677Z 'l2_groupings': [8], 2026-02-21T08:32:53.5886861Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:32:53.5887097Z 'loop_orders': [[1, 0]], 2026-02-21T08:32:53.5887266Z 'maxnreg': 64, 2026-02-21T08:32:53.5887426Z 'num_sm_multiplier': 64, 2026-02-21T08:32:53.5887583Z 'num_stages': 7, 2026-02-21T08:32:53.5887722Z 'num_warps': 2, 2026-02-21T08:32:53.5887884Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:32:53.5888095Z 'range_flattens': [None, False], 2026-02-21T08:32:53.5888290Z 'range_multi_buffers': [None, None], 2026-02-21T08:32:53.5888471Z 'range_num_stages': [3, 0], 2026-02-21T08:32:53.5888647Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:53.5888827Z 'range_warp_specializes': [True, None]} 2026-02-21T08:32:53.5904450Z [260s] Fitting surrogate: 768 points, 768 targets 2026-02-21T08:32:54.5166994Z [261s] Generation 8 starting: 56 neighbors, 3 active search path(s) 2026-02-21T08:33:01.3868616Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 1.8 configs/s 2026-02-21T08:33:03.4214199Z [270s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:33:03.4214529Z 2026-02-21T08:33:03.4219456Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:33:03.4220517Z 2026-02-21T08:33:03.4220772Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:33:03.4221014Z `ptxas` stderr: 2026-02-21T08:33:03.4221460Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 271 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:33:03.4222266Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:33:03.4222695Z 2026-02-21T08:33:03.4223096Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpajymc1d3.ptx -o /tmp/tmpajymc1d3.ptx.o 2026-02-21T08:33:03.4223532Z 2026-02-21T08:33:03.4223683Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:33:03.4224055Z ================================================================ 2026-02-21T08:33:03.4224278Z Internal Triton PTX codegen error 2026-02-21T08:33:03.4224452Z `ptxas` stderr: 2026-02-21T08:33:03.4224892Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 271 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:33:03.4225394Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:33:03.4225545Z 2026-02-21T08:33:03.4225919Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpajymc1d3.ptx -o /tmp/tmpajymc1d3.ptx.o 2026-02-21T08:33:03.4226371Z 2026-02-21T08:33:03.4226374Z 2026-02-21T08:33:03.4226433Z // 2026-02-21T08:33:03.4226587Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:33:03.4226761Z // 2026-02-21T08:33:03.4226830Z 2026-02-21T08:33:03.4227030Z .version 8.7 2026-02-21T08:33:03.4227176Z .target sm_100a 2026-02-21T08:33:03.4227325Z .address_size 64 2026-02-21T08:33:03.4227412Z 2026-02-21T08:33:03.4227565Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:33:03.4227883Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:33:03.4228115Z // @_helion_matmul_bf16_int4 2026-02-21T08:33:03.4228354Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:33:03.4228624Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:33:03.4228929Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:33:03.4229229Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:33:03.4229535Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:33:03.4229831Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:33:03.4230068Z ) 2026-02-21T08:33:03.4230197Z .reqntid 128 2026-02-21T08:33:03.4230341Z .maxnreg 32 2026-02-21T08:33:03.4230468Z { 2026-02-21T08:33:03.4230608Z .reg .pred %p<106>; 2026-02-21T08:33:03.4230763Z .reg .b16 %rs<385>; 2026-02-21T08:33:03.4230911Z .reg .b32 %r<1390>; 2026-02-21T08:33:03.4231049Z .reg .b64 %rd<606>; 2026-02-21T08:33:03.4231191Z $L__func_begin0: 2026-02-21T08:33:03.4231276Z 2026-02-21T08:33:03.4231347Z // %bb.0: 2026-02-21T08:33:03.4231672Z .loc 1 19 0 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:19 2026-02-21T08:33:03.4232003Z mov.u32 %r1, %tid.x; 2026-02-21T08:33:03.4232198Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:33:03.4232426Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T08:33:03.4232626Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:33:03.4232847Z mov.b32 %r53, global_smem; 2026-02-21T08:33:03.4233009Z // begin inline asm 2026-02-21T08:33:03.4233272Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r53], 512; 2026-02-21T08:33:03.4233525Z // end inline asm 2026-02-21T08:33:03.4233713Z ld.param.b64 %rd51, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T08:33:03.4233922Z bar.sync 0; 2026-02-21T08:33:03.4234076Z ld.shared.b32 %r1384, [global_smem]; 2026-02-21T08:33:03.4234258Z bar.sync 0; 2026-02-21T08:33:03.4234391Z // begin inline asm 2026-02-21T08:33:03.4234603Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:33:03.4234829Z // end inline asm 2026-02-21T08:33:03.4235092Z .loc 1 21 66 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:21:66 2026-02-21T08:33:03.4235397Z mov.u32 %r3, %ctaid.x; 2026-02-21T08:33:03.4235610Z mov.u32 %r70, %ctaid.y; 2026-02-21T08:33:03.4235767Z mov.u32 %r71, %ctaid.z; 2026-02-21T08:33:03.4235916Z mov.u32 %r72, %nctaid.x; 2026-02-21T08:33:03.4236075Z mov.u32 %r73, %nctaid.y; 2026-02-21T08:33:03.4236234Z mad.lo.s32 %r74, %r71, %r73, %r70; 2026-02-21T08:33:03.4236421Z mad.lo.s32 %r75, %r74, %r72, %r3; 2026-02-21T08:33:03.4236629Z shl.b32 %r76, %r75, 8; 2026-02-21T08:33:03.4236788Z cvt.s64.s32 %rd52, %r76; 2026-02-21T08:33:03.4236943Z add.s64 %rd30, %rd51, %rd52; 2026-02-21T08:33:03.4237108Z shl.b32 %r77, %r1, 2; 2026-02-21T08:33:03.4237255Z add.s32 %r54, %r53, %r77; 2026-02-21T08:33:03.4237409Z mov.b32 %r63, 0; 2026-02-21T08:33:03.4237551Z // begin inline asm 2026-02-21T08:33:03.4237701Z @%p1 st.shared.b32 [ %r54 + 0 ], %r63; 2026-02-21T08:33:03.4237879Z // end inline asm 2026-02-21T08:33:03.4238021Z bar.warp.sync -1; 2026-02-21T08:33:03.4238181Z setp.eq.b32 %p99, %r1, 0; 2026-02-21T08:33:03.4238339Z cvt.u64.u32 %rd15, %r53; 2026-02-21T08:33:03.4238497Z // begin inline asm 2026-02-21T08:33:03.4238747Z @%p99 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd16; 2026-02-21T08:33:03.4239034Z // end inline asm 2026-02-21T08:33:03.4239177Z // begin inline asm 2026-02-21T08:33:03.4239464Z @%p99 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T08:33:03.4239731Z // end inline asm 2026-02-21T08:33:03.4239867Z mov.b32 %r56, 128; 2026-02-21T08:33:03.4240017Z // begin inline asm 2026-02-21T08:33:03.4240251Z @%p99 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r56; 2026-02-21T08:33:03.4240519Z // end inline asm 2026-02-21T08:33:03.4240653Z mov.b32 %r354, 16; 2026-02-21T08:33:03.4240794Z // begin inline asm 2026-02-21T08:33:03.4241033Z @%p99 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r354; 2026-02-21T08:33:03.4241302Z // end inline asm 2026-02-21T08:33:03.4241460Z mov.b32 %r58, 8192; 2026-02-21T08:33:03.4241634Z // begin inline asm 2026-02-21T08:33:03.4241893Z @%p99 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r58; 2026-02-21T08:33:03.4242166Z // end inline asm 2026-02-21T08:33:03.4242314Z mov.b32 %r59, 512; 2026-02-21T08:33:03.4242465Z // begin inline asm 2026-02-21T08:33:03.4242701Z @%p99 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r59; 2026-02-21T08:33:03.4242970Z // end inline asm 2026-02-21T08:33:03.4243103Z mov.b64 %rd23, 8192; 2026-02-21T08:33:03.4243249Z // begin inline asm 2026-02-21T08:33:03.4243493Z @%p99 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd23; 2026-02-21T08:33:03.4243783Z // end inline asm 2026-02-21T08:33:03.4243913Z mov.b32 %r60, 1; 2026-02-21T08:33:03.4244050Z // begin inline asm 2026-02-21T08:33:03.4244307Z @%p99 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r60; 2026-02-21T08:33:03.4244588Z // end inline asm 2026-02-21T08:33:03.4244728Z // begin inline asm 2026-02-21T08:33:03.4244974Z @%p99 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r60; 2026-02-21T08:33:03.4245250Z // end inline asm 2026-02-21T08:33:03.4245382Z // begin inline asm 2026-02-21T08:33:03.4245616Z @%p99 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T08:33:03.4245885Z // end inline asm 2026-02-21T08:33:03.4246022Z // begin inline asm 2026-02-21T08:33:03.4246277Z @%p99 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T08:33:03.4246554Z // end inline asm 2026-02-21T08:33:03.4246692Z // begin inline asm 2026-02-21T08:33:03.4246924Z @%p99 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T08:33:03.4247195Z // end inline asm 2026-02-21T08:33:03.4247327Z // begin inline asm 2026-02-21T08:33:03.4247562Z @%p99 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T08:33:03.4247825Z // end inline asm 2026-02-21T08:33:03.4247999Z // begin inline asm 2026-02-21T08:33:03.4248354Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd30 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T08:33:03.4248727Z // end inline asm 2026-02-21T08:33:03.4248876Z // begin inline asm 2026-02-21T08:33:03.4249092Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd30 + 0 ], 0x80; 2026-02-21T08:33:03.4249381Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T08:33:03.4249580Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T08:33:03.4249762Z // end inline asm 2026-02-21T08:33:03.4249909Z bar.sync 0; 2026-02-21T08:33:03.4250054Z cvta.global.u64 %rd71, %rd30; 2026-02-21T08:33:03.4250344Z .loc 1 23 67 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:23:67 2026-02-21T08:33:03.4250639Z add.s64 %rd48, %rd30, 128; 2026-02-21T08:33:03.4250805Z bar.sync 0; 2026-02-21T08:33:03.4250941Z // begin inline asm 2026-02-21T08:33:03.4251105Z @%p1 st.shared.b32 [ %r54 + 0 ], %r63; 2026-02-21T08:33:03.4251287Z // end inline asm 2026-02-21T08:33:03.4251429Z bar.warp.sync -1; 2026-02-21T08:33:03.4251610Z // begin inline asm 2026-02-21T08:33:03.4251859Z @%p99 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd34; 2026-02-21T08:33:03.4252145Z // end inline asm 2026-02-21T08:33:03.4252370Z // begin inline asm 2026-02-21T08:33:03.4252606Z @%p99 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T08:33:03.4252863Z // end inline asm 2026-02-21T08:33:03.4253004Z mov.b32 %r64, 64; 2026-02-21T08:33:03.4253147Z // begin inline asm 2026-02-21T08:33:03.4253379Z @%p99 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r64; 2026-02-21T08:33:03.4253651Z // end inline asm 2026-02-21T08:33:03.4253784Z // begin inline asm 2026-02-21T08:33:03.4254021Z @%p99 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r56; 2026-02-21T08:33:03.4254302Z // end inline asm 2026-02-21T08:33:03.4254452Z // begin inline asm 2026-02-21T08:33:03.4254713Z @%p99 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r58; 2026-02-21T08:33:03.4254991Z // end inline asm 2026-02-21T08:33:03.4255138Z mov.b32 %r67, 4096; 2026-02-21T08:33:03.4255283Z // begin inline asm 2026-02-21T08:33:03.4255541Z @%p99 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r67; 2026-02-21T08:33:03.4255823Z // end inline asm 2026-02-21T08:33:03.4255973Z mov.b64 %rd41, 16384; 2026-02-21T08:33:03.4256126Z // begin inline asm 2026-02-21T08:33:03.4256392Z @%p99 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd41; 2026-02-21T08:33:03.4256689Z // end inline asm 2026-02-21T08:33:03.4256828Z // begin inline asm 2026-02-21T08:33:03.4257095Z @%p99 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r60; 2026-02-21T08:33:03.4257391Z // end inline asm 2026-02-21T08:33:03.4257536Z // begin inline asm 2026-02-21T08:33:03.4257798Z @%p99 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r60; 2026-02-21T08:33:03.4258101Z // end inline asm 2026-02-21T08:33:03.4258247Z // begin inline asm 2026-02-21T08:33:03.4258489Z @%p99 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0xa; 2026-02-21T08:33:03.4258777Z // end inline asm 2026-02-21T08:33:03.4258920Z // begin inline asm 2026-02-21T08:33:03.4259196Z @%p99 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T08:33:03.4259491Z // end inline asm 2026-02-21T08:33:03.4259638Z // begin inline asm 2026-02-21T08:33:03.4259886Z @%p99 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T08:33:03.4260163Z // end inline asm 2026-02-21T08:33:03.4260310Z // begin inline asm 2026-02-21T08:33:03.4260545Z @%p99 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T08:33:03.4260819Z // end inline asm 2026-02-21T08:33:03.4260990Z // begin inline asm 2026-02-21T08:33:03.4261352Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd48 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T08:33:03.4261769Z // end inline asm 2026-02-21T08:33:03.4261910Z // begin inline asm 2026-02-21T08:33:03.4262123Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd48 + 0 ], 0x80; 2026-02-21T08:33:03.4262399Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T08:33:03.4262593Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T08:33:03.4262767Z // end inline asm 2026-02-21T08:33:03.4262905Z bar.sync 0; 2026-02-21T08:33:03.4263044Z cvta.global.u64 %rd91, %rd48; 2026-02-21T08:33:03.4263325Z .loc 1 31 74 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:31:74 2026-02-21T08:33:03.4263634Z setp.gt.u32 %p39, %r3, 1023; 2026-02-21T08:33:03.4263798Z @%p39 bra $L__BB0_8; 2026-02-21T08:33:03.4263967Z // %bb.1: // %.lr.ph 2026-02-21T08:33:03.4264265Z .loc 1 0 74 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:0:74 2026-02-21T08:33:03.4264598Z ld.param.b64 %rd14, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:33:03.4264917Z .loc 1 57 38 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:57:38 2026-02-21T08:33:03.4265295Z shl.b32 %r402, %r1, 3; 2026-02-21T08:33:03.4265452Z and.b32 %r403, %r402, 24; 2026-02-21T08:33:03.4265720Z .loc 1 43 45 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:43:45 2026-02-21T08:33:03.4266012Z bfe.u32 %r4, %r1, 2, 5; 2026-02-21T08:33:03.4266160Z shr.u32 %r404, %r1, 5; 2026-02-21T08:33:03.4266317Z and.b32 %r405, %r1, 127; 2026-02-21T08:33:03.4266470Z shl.b32 %r406, %r405, 4; 2026-02-21T08:33:03.4266630Z add.s32 %r408, %r53, %r406; 2026-02-21T08:33:03.4266790Z add.s32 %r541, %r408, 98304; 2026-02-21T08:33:03.4266955Z add.s32 %r543, %r408, 100352; 2026-02-21T08:33:03.4267112Z add.s32 %r545, %r408, 102400; 2026-02-21T08:33:03.4267275Z add.s32 %r547, %r408, 104448; 2026-02-21T08:33:03.4267439Z add.s32 %r527, %r1384, 256; 2026-02-21T08:33:03.4267600Z setp.lt.u32 %p63, %r1, 64; 2026-02-21T08:33:03.4267774Z mad.lo.s32 %r10, %r405, 48, %r541; 2026-02-21T08:33:03.4267945Z add.s32 %r409, %r53, 106496; 2026-02-21T08:33:03.4268109Z add.s32 %r11, %r409, %r405; 2026-02-21T08:33:03.4268267Z xor.b32 %r410, %r405, 16; 2026-02-21T08:33:03.4268427Z add.s32 %r12, %r409, %r410; 2026-02-21T08:33:03.4268581Z xor.b32 %r411, %r405, 32; 2026-02-21T08:33:03.4268739Z add.s32 %r13, %r409, %r411; 2026-02-21T08:33:03.4268891Z xor.b32 %r412, %r405, 48; 2026-02-21T08:33:03.4269049Z add.s32 %r14, %r409, %r412; 2026-02-21T08:33:03.4269213Z xor.b32 %r413, %r405, 64; 2026-02-21T08:33:03.4269365Z add.s32 %r15, %r409, %r413; 2026-02-21T08:33:03.4269526Z xor.b32 %r414, %r405, 80; 2026-02-21T08:33:03.4269672Z add.s32 %r16, %r409, %r414; 2026-02-21T08:33:03.4269829Z xor.b32 %r415, %r405, 96; 2026-02-21T08:33:03.4269977Z add.s32 %r17, %r409, %r415; 2026-02-21T08:33:03.4270139Z xor.b32 %r416, %r405, 112; 2026-02-21T08:33:03.4270291Z add.s32 %r18, %r409, %r416; 2026-02-21T08:33:03.4270450Z shl.b32 %r417, %r405, 7; 2026-02-21T08:33:03.4270605Z shl.b32 %r418, %r1, 4; 2026-02-21T08:33:03.4270754Z and.b32 %r419, %r418, 112; 2026-02-21T08:33:03.4270915Z or.b32 %r420, %r417, %r419; 2026-02-21T08:33:03.4271068Z add.s32 %r421, %r53, 65536; 2026-02-21T08:33:03.4271227Z add.s32 %r19, %r421, %r420; 2026-02-21T08:33:03.4271378Z xor.b32 %r422, %r420, 16; 2026-02-21T08:33:03.4271531Z add.s32 %r20, %r421, %r422; 2026-02-21T08:33:03.4271710Z xor.b32 %r423, %r420, 32; 2026-02-21T08:33:03.4271864Z add.s32 %r21, %r421, %r423; 2026-02-21T08:33:03.4272015Z xor.b32 %r424, %r420, 48; 2026-02-21T08:33:03.4272170Z add.s32 %r22, %r421, %r424; 2026-02-21T08:33:03.4272328Z xor.b32 %r425, %r420, 64; 2026-02-21T08:33:03.4272475Z add.s32 %r23, %r421, %r425; 2026-02-21T08:33:03.4272632Z xor.b32 %r426, %r420, 80; 2026-02-21T08:33:03.4272812Z add.s32 %r24, %r421, %r426; 2026-02-21T08:33:03.4272975Z xor.b32 %r427, %r420, 96; 2026-02-21T08:33:03.4273124Z add.s32 %r25, %r421, %r427; 2026-02-21T08:33:03.4273285Z xor.b32 %r428, %r420, 112; 2026-02-21T08:33:03.4273433Z add.s32 %r26, %r421, %r428; 2026-02-21T08:33:03.4273597Z bfe.u32 %r429, %r421, 4, 14; 2026-02-21T08:33:03.4273785Z cvt.u64.u32 %rd58, %r429; 2026-02-21T08:33:03.4273956Z or.b64 %rd62, %rd58, 4611686293439512576; 2026-02-21T08:33:03.4274136Z add.s32 %r430, %r53, 65568; 2026-02-21T08:33:03.4274289Z bfe.u32 %r431, %r430, 4, 14; 2026-02-21T08:33:03.4274450Z cvt.u64.u32 %rd59, %r431; 2026-02-21T08:33:03.4274607Z or.b64 %rd63, %rd59, 4611686293439512576; 2026-02-21T08:33:03.4274789Z add.s32 %r432, %r53, 65600; 2026-02-21T08:33:03.4274940Z bfe.u32 %r433, %r432, 4, 14; 2026-02-21T08:33:03.4275112Z cvt.u64.u32 %rd60, %r433; 2026-02-21T08:33:03.4275323Z or.b64 %rd64, %rd60, 4611686293439512576; 2026-02-21T08:33:03.4275514Z add.s32 %r434, %r53, 65632; 2026-02-21T08:33:03.4275678Z bfe.u32 %r435, %r434, 4, 14; 2026-02-21T08:33:03.4275833Z cvt.u64.u32 %rd61, %r435; 2026-02-21T08:33:03.4276000Z or.b64 %rd65, %rd61, 4611686293439512576; 2026-02-21T08:33:03.4276175Z or.b32 %r27, %r403, 32; 2026-02-21T08:33:03.4276504Z .loc 1 42 27 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:42:27 2026-02-21T08:33:03.4276797Z shl.b32 %r436, %r3, 7; 2026-02-21T08:33:03.4276953Z and.b32 %r994, %r436, 3968; 2026-02-21T08:33:03.4277217Z .loc 1 43 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:43:32 2026-02-21T08:33:03.4277518Z or.b32 %r437, %r994, %r4; 2026-02-21T08:33:03.4277790Z .loc 1 44 27 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:44:27 2026-02-21T08:33:03.4278077Z shl.b32 %r438, %r3, 3; 2026-02-21T08:33:03.4278245Z and.b32 %r37, %r438, 7936; 2026-02-21T08:33:03.4278512Z .loc 1 58 53 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:53 2026-02-21T08:33:03.4278817Z shl.b32 %r439, %r437, 10; 2026-02-21T08:33:03.4278970Z $L__tmp0: 2026-02-21T08:33:03.4279274Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4279639Z shfl.sync.idx.b32 %r38, %r404, 0, 31, -1; 2026-02-21T08:33:03.4279827Z and.b32 %r39, %r38, 3; 2026-02-21T08:33:03.4279988Z shl.b32 %r440, %r39, 21; 2026-02-21T08:33:03.4280147Z add.s32 %r992, %r440, %r1384; 2026-02-21T08:33:03.4280321Z mov.pred %p68, -1; 2026-02-21T08:33:03.4280475Z // begin inline asm 2026-02-21T08:33:03.4280823Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 0], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4281191Z // end inline asm 2026-02-21T08:33:03.4281341Z // begin inline asm 2026-02-21T08:33:03.4281711Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 16], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4282067Z // end inline asm 2026-02-21T08:33:03.4282212Z // begin inline asm 2026-02-21T08:33:03.4282528Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 32], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4282887Z // end inline asm 2026-02-21T08:33:03.4283018Z // begin inline asm 2026-02-21T08:33:03.4283340Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 48], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4283689Z // end inline asm 2026-02-21T08:33:03.4283822Z // begin inline asm 2026-02-21T08:33:03.4284140Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 64], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4284484Z // end inline asm 2026-02-21T08:33:03.4284652Z // begin inline asm 2026-02-21T08:33:03.4284961Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 80], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4285318Z // end inline asm 2026-02-21T08:33:03.4285458Z // begin inline asm 2026-02-21T08:33:03.4285771Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 96], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4286154Z // end inline asm 2026-02-21T08:33:03.4286288Z // begin inline asm 2026-02-21T08:33:03.4286614Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 112], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4286976Z // end inline asm 2026-02-21T08:33:03.4287109Z // begin inline asm 2026-02-21T08:33:03.4287433Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 128], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4287783Z // end inline asm 2026-02-21T08:33:03.4287923Z // begin inline asm 2026-02-21T08:33:03.4288238Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 144], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4288655Z // end inline asm 2026-02-21T08:33:03.4288799Z // begin inline asm 2026-02-21T08:33:03.4289116Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 160], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4289476Z // end inline asm 2026-02-21T08:33:03.4289611Z // begin inline asm 2026-02-21T08:33:03.4289932Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 176], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4290288Z // end inline asm 2026-02-21T08:33:03.4290429Z // begin inline asm 2026-02-21T08:33:03.4290753Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 192], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4291103Z // end inline asm 2026-02-21T08:33:03.4291242Z // begin inline asm 2026-02-21T08:33:03.4291598Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 208], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4291955Z // end inline asm 2026-02-21T08:33:03.4292089Z // begin inline asm 2026-02-21T08:33:03.4292419Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 224], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4292773Z // end inline asm 2026-02-21T08:33:03.4292907Z // begin inline asm 2026-02-21T08:33:03.4293235Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r992 + 240], {%r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63, %r63}; 2026-02-21T08:33:03.4293585Z // end inline asm 2026-02-21T08:33:03.4293727Z // begin inline asm 2026-02-21T08:33:03.4293879Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:33:03.4294049Z // end inline asm 2026-02-21T08:33:03.4294186Z bar.sync 0; 2026-02-21T08:33:03.4294314Z $L__tmp1: 2026-02-21T08:33:03.4294570Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4294870Z add.s32 %r1386, %r53, 110592; 2026-02-21T08:33:03.4295033Z // begin inline asm 2026-02-21T08:33:03.4295200Z @%p99 mbarrier.init.shared::cta.b64 [%r1386], 1; 2026-02-21T08:33:03.4295395Z // end inline asm 2026-02-21T08:33:03.4295525Z bar.sync 0; 2026-02-21T08:33:03.4295665Z add.s32 %r351, %r53, 110600; 2026-02-21T08:33:03.4295816Z // begin inline asm 2026-02-21T08:33:03.4295986Z @%p99 mbarrier.init.shared::cta.b64 [%r351], 1; 2026-02-21T08:33:03.4296177Z // end inline asm 2026-02-21T08:33:03.4296313Z add.s32 %r549, %r53, 110608; 2026-02-21T08:33:03.4296503Z // begin inline asm 2026-02-21T08:33:03.4296662Z @%p99 mbarrier.init.shared::cta.b64 [%r549], 1; 2026-02-21T08:33:03.4296851Z // end inline asm 2026-02-21T08:33:03.4297095Z .loc 1 58 60 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:60 2026-02-21T08:33:03.4297393Z or.b32 %r441, %r439, %r403; 2026-02-21T08:33:03.4297665Z .loc 1 58 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:32 2026-02-21T08:33:03.4297983Z mad.wide.u32 %rd53, %r441, 2, %rd14; 2026-02-21T08:33:03.4298176Z cvt.u64.u32 %rd7, %r439; 2026-02-21T08:33:03.4298340Z add.s64 %rd54, %rd53, 65536; 2026-02-21T08:33:03.4298510Z add.s64 %rd55, %rd53, 131072; 2026-02-21T08:33:03.4298675Z add.s64 %rd56, %rd53, 196608; 2026-02-21T08:33:03.4298957Z .loc 1 58 80 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:80 2026-02-21T08:33:03.4299249Z // begin inline asm 2026-02-21T08:33:03.4299469Z cp.async.cg.shared.global [ %r541 + 0 ], [ %rd53 + 0 ], 0x10, %r354; 2026-02-21T08:33:03.4299708Z // end inline asm 2026-02-21T08:33:03.4299848Z // begin inline asm 2026-02-21T08:33:03.4300060Z cp.async.cg.shared.global [ %r543 + 0 ], [ %rd54 + 0 ], 0x10, %r354; 2026-02-21T08:33:03.4300291Z // end inline asm 2026-02-21T08:33:03.4300436Z // begin inline asm 2026-02-21T08:33:03.4300688Z cp.async.cg.shared.global [ %r545 + 0 ], [ %rd55 + 0 ], 0x10, %r354; 2026-02-21T08:33:03.4300929Z // end inline asm 2026-02-21T08:33:03.4301068Z // begin inline asm 2026-02-21T08:33:03.4301274Z cp.async.cg.shared.global [ %r547 + 0 ], [ %rd56 + 0 ], 0x10, %r354; 2026-02-21T08:33:03.4301506Z // end inline asm 2026-02-21T08:33:03.4301687Z cp.async.commit_group; 2026-02-21T08:33:03.4301978Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4302284Z bar.sync 0; 2026-02-21T08:33:03.4302427Z // begin inline asm 2026-02-21T08:33:03.4302624Z @%p99 mbarrier.arrive.expect_tx.shared.b64 _, [%r549], 4096; 2026-02-21T08:33:03.4302859Z // end inline asm 2026-02-21T08:33:03.4303112Z .loc 1 64 33 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:64:33 2026-02-21T08:33:03.4303412Z bar.sync 0; 2026-02-21T08:33:03.4303566Z elect.sync %r442|%p64, -1; 2026-02-21T08:33:03.4303740Z and.pred %p60, %p63, %p64; 2026-02-21T08:33:03.4303918Z and.b32 %r443, %r38, 1; 2026-02-21T08:33:03.4304077Z shl.b32 %r444, %r443, 11; 2026-02-21T08:33:03.4304247Z add.s32 %r550, %r409, %r444; 2026-02-21T08:33:03.4304408Z shl.b32 %r445, %r443, 7; 2026-02-21T08:33:03.4304574Z or.b32 %r551, %r445, %r37; 2026-02-21T08:33:03.4304731Z // begin inline asm 2026-02-21T08:33:03.4305444Z @%p60 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r550], [%rd71, {%r551, %r63}], [%r549]; 2026-02-21T08:33:03.4305819Z // end inline asm 2026-02-21T08:33:03.4306083Z .loc 1 58 80 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:80 2026-02-21T08:33:03.4306392Z cp.async.wait_group 0; 2026-02-21T08:33:03.4306542Z bar.sync 0; 2026-02-21T08:33:03.4306714Z ld.shared.v4.b32 {%r446, %r447, %r448, %r449}, [%r10]; 2026-02-21T08:33:03.4306918Z mov.b32 {%rs1, %rs2}, %r449; 2026-02-21T08:33:03.4307082Z mov.b32 {%rs3, %rs4}, %r448; 2026-02-21T08:33:03.4307241Z mov.b32 {%rs5, %rs6}, %r447; 2026-02-21T08:33:03.4307393Z mov.b32 {%rs7, %rs8}, %r446; 2026-02-21T08:33:03.4307587Z ld.shared.v4.b32 {%r450, %r451, %r452, %r453}, [%r10+16]; 2026-02-21T08:33:03.4307792Z mov.b32 {%rs9, %rs10}, %r453; 2026-02-21T08:33:03.4307960Z mov.b32 {%rs11, %rs12}, %r452; 2026-02-21T08:33:03.4308119Z mov.b32 {%rs13, %rs14}, %r451; 2026-02-21T08:33:03.4308281Z mov.b32 {%rs15, %rs16}, %r450; 2026-02-21T08:33:03.4308470Z ld.shared.v4.b32 {%r454, %r455, %r456, %r457}, [%r10+32]; 2026-02-21T08:33:03.4308678Z mov.b32 {%rs17, %rs18}, %r457; 2026-02-21T08:33:03.4308832Z mov.b32 {%rs19, %rs20}, %r456; 2026-02-21T08:33:03.4308991Z mov.b32 {%rs21, %rs22}, %r455; 2026-02-21T08:33:03.4309183Z mov.b32 {%rs23, %rs24}, %r454; 2026-02-21T08:33:03.4309371Z ld.shared.v4.b32 {%r458, %r459, %r460, %r461}, [%r10+48]; 2026-02-21T08:33:03.4309579Z mov.b32 {%rs25, %rs26}, %r461; 2026-02-21T08:33:03.4309734Z mov.b32 {%rs27, %rs28}, %r460; 2026-02-21T08:33:03.4309895Z mov.b32 {%rs29, %rs30}, %r459; 2026-02-21T08:33:03.4310053Z mov.b32 {%rs31, %rs32}, %r458; 2026-02-21T08:33:03.4310382Z .loc 1 62 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:62:32 2026-02-21T08:33:03.4310675Z cvt.f32.bf16 %r367, %rs7; 2026-02-21T08:33:03.4310829Z cvt.f32.bf16 %r368, %rs8; 2026-02-21T08:33:03.4310984Z cvt.f32.bf16 %r369, %rs5; 2026-02-21T08:33:03.4311130Z cvt.f32.bf16 %r370, %rs6; 2026-02-21T08:33:03.4311284Z cvt.f32.bf16 %r371, %rs3; 2026-02-21T08:33:03.4311430Z cvt.f32.bf16 %r372, %rs4; 2026-02-21T08:33:03.4311616Z cvt.f32.bf16 %r373, %rs1; 2026-02-21T08:33:03.4311769Z cvt.f32.bf16 %r374, %rs2; 2026-02-21T08:33:03.4311931Z cvt.f32.bf16 %r375, %rs15; 2026-02-21T08:33:03.4312087Z cvt.f32.bf16 %r376, %rs16; 2026-02-21T08:33:03.4312249Z cvt.f32.bf16 %r377, %rs13; 2026-02-21T08:33:03.4312408Z cvt.f32.bf16 %r378, %rs14; 2026-02-21T08:33:03.4312559Z cvt.f32.bf16 %r379, %rs11; 2026-02-21T08:33:03.4312715Z cvt.f32.bf16 %r380, %rs12; 2026-02-21T08:33:03.4312920Z cvt.f32.bf16 %r381, %rs9; 2026-02-21T08:33:03.4313079Z cvt.f32.bf16 %r382, %rs10; 2026-02-21T08:33:03.4313227Z cvt.f32.bf16 %r384, %rs23; 2026-02-21T08:33:03.4313383Z cvt.f32.bf16 %r385, %rs24; 2026-02-21T08:33:03.4313532Z cvt.f32.bf16 %r386, %rs21; 2026-02-21T08:33:03.4313687Z cvt.f32.bf16 %r387, %rs22; 2026-02-21T08:33:03.4313837Z cvt.f32.bf16 %r388, %rs19; 2026-02-21T08:33:03.4313991Z cvt.f32.bf16 %r389, %rs20; 2026-02-21T08:33:03.4314147Z cvt.f32.bf16 %r390, %rs17; 2026-02-21T08:33:03.4314298Z cvt.f32.bf16 %r391, %rs18; 2026-02-21T08:33:03.4314455Z cvt.f32.bf16 %r392, %rs31; 2026-02-21T08:33:03.4314607Z cvt.f32.bf16 %r393, %rs32; 2026-02-21T08:33:03.4314767Z cvt.f32.bf16 %r394, %rs29; 2026-02-21T08:33:03.4314914Z cvt.f32.bf16 %r395, %rs30; 2026-02-21T08:33:03.4315068Z cvt.f32.bf16 %r396, %rs27; 2026-02-21T08:33:03.4315215Z cvt.f32.bf16 %r397, %rs28; 2026-02-21T08:33:03.4315370Z cvt.f32.bf16 %r398, %rs25; 2026-02-21T08:33:03.4315519Z cvt.f32.bf16 %r399, %rs26; 2026-02-21T08:33:03.4315674Z $L__tmp2: 2026-02-21T08:33:03.4315975Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4316303Z add.s32 %r366, %r440, %r527; 2026-02-21T08:33:03.4316465Z // begin inline asm 2026-02-21T08:33:03.4316825Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r366 + 0], {%r367, %r368, %r369, %r370, %r371, %r372, %r373, %r374, %r375, %r376, %r377, %r378, %r379, %r380, %r381, %r382}; 2026-02-21T08:33:03.4317219Z // end inline asm 2026-02-21T08:33:03.4317354Z // begin inline asm 2026-02-21T08:33:03.4317714Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r366 + 16], {%r384, %r385, %r386, %r387, %r388, %r389, %r390, %r391, %r392, %r393, %r394, %r395, %r396, %r397, %r398, %r399}; 2026-02-21T08:33:03.4318105Z // end inline asm 2026-02-21T08:33:03.4318240Z // begin inline asm 2026-02-21T08:33:03.4318404Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:33:03.4318563Z // end inline asm 2026-02-21T08:33:03.4318704Z bar.sync 0; 2026-02-21T08:33:03.4318832Z $L__tmp3: 2026-02-21T08:33:03.4319085Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4319377Z // begin inline asm 2026-02-21T08:33:03.4319520Z 2026-02-21T08:33:03.4319642Z { 2026-02-21T08:33:03.4319766Z .reg .pred complete; 2026-02-21T08:33:03.4319914Z waitLoop: 2026-02-21T08:33:03.4320100Z mbarrier.try_wait.parity.shared.b64 complete, [%r549], %r63; 2026-02-21T08:33:03.4320338Z @!complete bra.uni waitLoop; 2026-02-21T08:33:03.4320490Z } 2026-02-21T08:33:03.4320562Z 2026-02-21T08:33:03.4320616Z // end inline asm 2026-02-21T08:33:03.4320899Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4321200Z ld.shared.s8 %rs33, [%r11]; 2026-02-21T08:33:03.4321475Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4321801Z shl.b16 %rs34, %rs33, 4; 2026-02-21T08:33:03.4322104Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4326843Z ld.shared.s8 %rs35, [%r11+2048]; 2026-02-21T08:33:03.4327168Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4327461Z shl.b16 %rs36, %rs35, 4; 2026-02-21T08:33:03.4327735Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4328028Z ld.shared.s8 %rs37, [%r12+128]; 2026-02-21T08:33:03.4328311Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4328596Z shl.b16 %rs38, %rs37, 4; 2026-02-21T08:33:03.4328865Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4329159Z ld.shared.s8 %rs39, [%r12+2176]; 2026-02-21T08:33:03.4329482Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4329768Z shl.b16 %rs40, %rs39, 4; 2026-02-21T08:33:03.4330054Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4330340Z ld.shared.s8 %rs41, [%r13+256]; 2026-02-21T08:33:03.4330624Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4330914Z shl.b16 %rs42, %rs41, 4; 2026-02-21T08:33:03.4331170Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4331468Z ld.shared.s8 %rs43, [%r13+2304]; 2026-02-21T08:33:03.4331791Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4332081Z shl.b16 %rs44, %rs43, 4; 2026-02-21T08:33:03.4332342Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4332643Z ld.shared.s8 %rs45, [%r14+384]; 2026-02-21T08:33:03.4332927Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4333220Z shl.b16 %rs46, %rs45, 4; 2026-02-21T08:33:03.4333490Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4333778Z ld.shared.s8 %rs47, [%r14+2432]; 2026-02-21T08:33:03.4334056Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4334351Z shl.b16 %rs48, %rs47, 4; 2026-02-21T08:33:03.4334617Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4334913Z ld.shared.s8 %rs49, [%r15+512]; 2026-02-21T08:33:03.4335182Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4335475Z shl.b16 %rs50, %rs49, 4; 2026-02-21T08:33:03.4335734Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4336030Z ld.shared.s8 %rs51, [%r15+2560]; 2026-02-21T08:33:03.4336306Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4336587Z shl.b16 %rs52, %rs51, 4; 2026-02-21T08:33:03.4336847Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4337133Z ld.shared.s8 %rs53, [%r16+640]; 2026-02-21T08:33:03.4337411Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4337731Z shl.b16 %rs54, %rs53, 4; 2026-02-21T08:33:03.4337996Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4338289Z ld.shared.s8 %rs55, [%r16+2688]; 2026-02-21T08:33:03.4338559Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4338882Z shl.b16 %rs56, %rs55, 4; 2026-02-21T08:33:03.4339207Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4339494Z ld.shared.s8 %rs57, [%r17+768]; 2026-02-21T08:33:03.4339765Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4340093Z shl.b16 %rs58, %rs57, 4; 2026-02-21T08:33:03.4340356Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4340643Z ld.shared.s8 %rs59, [%r17+2816]; 2026-02-21T08:33:03.4340931Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4341224Z shl.b16 %rs60, %rs59, 4; 2026-02-21T08:33:03.4341525Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4341865Z ld.shared.s8 %rs61, [%r18+896]; 2026-02-21T08:33:03.4342150Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4342457Z shl.b16 %rs62, %rs61, 4; 2026-02-21T08:33:03.4342723Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4343031Z ld.shared.s8 %rs63, [%r18+2944]; 2026-02-21T08:33:03.4343311Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4343611Z shl.b16 %rs64, %rs63, 4; 2026-02-21T08:33:03.4343885Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4344186Z ld.shared.s8 %rs65, [%r11+1024]; 2026-02-21T08:33:03.4344475Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4344776Z shl.b16 %rs66, %rs65, 4; 2026-02-21T08:33:03.4345050Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4345350Z ld.shared.s8 %rs67, [%r11+3072]; 2026-02-21T08:33:03.4345637Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4345936Z shl.b16 %rs68, %rs67, 4; 2026-02-21T08:33:03.4346201Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4346505Z ld.shared.s8 %rs69, [%r12+1152]; 2026-02-21T08:33:03.4346786Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4347087Z shl.b16 %rs70, %rs69, 4; 2026-02-21T08:33:03.4347353Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4347659Z ld.shared.s8 %rs71, [%r12+3200]; 2026-02-21T08:33:03.4347945Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4348241Z shl.b16 %rs72, %rs71, 4; 2026-02-21T08:33:03.4348517Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4348813Z ld.shared.s8 %rs73, [%r13+1280]; 2026-02-21T08:33:03.4349107Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4349384Z shl.b16 %rs74, %rs73, 4; 2026-02-21T08:33:03.4349644Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4349961Z ld.shared.s8 %rs75, [%r13+3328]; 2026-02-21T08:33:03.4350225Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4350513Z shl.b16 %rs76, %rs75, 4; 2026-02-21T08:33:03.4350770Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4351087Z ld.shared.s8 %rs77, [%r14+1408]; 2026-02-21T08:33:03.4351360Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4351736Z shl.b16 %rs78, %rs77, 4; 2026-02-21T08:33:03.4352006Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4352295Z ld.shared.s8 %rs79, [%r14+3456]; 2026-02-21T08:33:03.4352572Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4352854Z shl.b16 %rs80, %rs79, 4; 2026-02-21T08:33:03.4353122Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4353412Z ld.shared.s8 %rs81, [%r15+1536]; 2026-02-21T08:33:03.4353678Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4354008Z shl.b16 %rs82, %rs81, 4; 2026-02-21T08:33:03.4354273Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4354554Z ld.shared.s8 %rs83, [%r15+3584]; 2026-02-21T08:33:03.4354826Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4355100Z shl.b16 %rs84, %rs83, 4; 2026-02-21T08:33:03.4355360Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4355642Z ld.shared.s8 %rs85, [%r16+1664]; 2026-02-21T08:33:03.4355914Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4356198Z shl.b16 %rs86, %rs85, 4; 2026-02-21T08:33:03.4356453Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4356742Z ld.shared.s8 %rs87, [%r16+3712]; 2026-02-21T08:33:03.4357005Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4357285Z shl.b16 %rs88, %rs87, 4; 2026-02-21T08:33:03.4357540Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4357830Z ld.shared.s8 %rs89, [%r17+1792]; 2026-02-21T08:33:03.4358104Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4358379Z shl.b16 %rs90, %rs89, 4; 2026-02-21T08:33:03.4358640Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4358919Z ld.shared.s8 %rs91, [%r17+3840]; 2026-02-21T08:33:03.4359185Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4359459Z shl.b16 %rs92, %rs91, 4; 2026-02-21T08:33:03.4359721Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4360008Z ld.shared.s8 %rs93, [%r18+1920]; 2026-02-21T08:33:03.4360276Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4360558Z shl.b16 %rs94, %rs93, 4; 2026-02-21T08:33:03.4360809Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4361095Z ld.shared.s8 %rs95, [%r18+3968]; 2026-02-21T08:33:03.4361355Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4361668Z shl.b16 %rs96, %rs95, 4; 2026-02-21T08:33:03.4361961Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4362244Z cvt.s16.s8 %rs97, %rs34; 2026-02-21T08:33:03.4362400Z shr.s16 %rs98, %rs97, 4; 2026-02-21T08:33:03.4362548Z cvt.s16.s8 %rs99, %rs38; 2026-02-21T08:33:03.4362706Z shr.s16 %rs100, %rs99, 4; 2026-02-21T08:33:03.4362860Z shr.s16 %rs101, %rs33, 4; 2026-02-21T08:33:03.4363047Z shr.s16 %rs102, %rs37, 4; 2026-02-21T08:33:03.4363345Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4363634Z cvt.rn.f32.s16 %r462, %rs102; 2026-02-21T08:33:03.4363806Z cvt.rn.f32.s16 %r463, %rs101; 2026-02-21T08:33:03.4363966Z cvt.rn.f32.s16 %r464, %rs100; 2026-02-21T08:33:03.4364132Z cvt.rn.f32.s16 %r465, %rs98; 2026-02-21T08:33:03.4364392Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4364680Z cvt.s16.s8 %rs103, %rs36; 2026-02-21T08:33:03.4364836Z shr.s16 %rs104, %rs103, 4; 2026-02-21T08:33:03.4365001Z cvt.s16.s8 %rs105, %rs40; 2026-02-21T08:33:03.4365157Z shr.s16 %rs106, %rs105, 4; 2026-02-21T08:33:03.4365309Z shr.s16 %rs107, %rs35, 4; 2026-02-21T08:33:03.4365464Z shr.s16 %rs108, %rs39, 4; 2026-02-21T08:33:03.4365743Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4366036Z cvt.rn.f32.s16 %r466, %rs108; 2026-02-21T08:33:03.4366195Z cvt.rn.f32.s16 %r467, %rs107; 2026-02-21T08:33:03.4366359Z cvt.rn.f32.s16 %r468, %rs106; 2026-02-21T08:33:03.4366512Z cvt.rn.f32.s16 %r469, %rs104; 2026-02-21T08:33:03.4366779Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4367064Z cvt.s16.s8 %rs109, %rs42; 2026-02-21T08:33:03.4367215Z shr.s16 %rs110, %rs109, 4; 2026-02-21T08:33:03.4367375Z cvt.s16.s8 %rs111, %rs46; 2026-02-21T08:33:03.4367523Z shr.s16 %rs112, %rs111, 4; 2026-02-21T08:33:03.4367687Z shr.s16 %rs113, %rs41, 4; 2026-02-21T08:33:03.4367833Z shr.s16 %rs114, %rs45, 4; 2026-02-21T08:33:03.4368098Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4368379Z cvt.rn.f32.s16 %r470, %rs114; 2026-02-21T08:33:03.4368543Z cvt.rn.f32.s16 %r471, %rs113; 2026-02-21T08:33:03.4368707Z cvt.rn.f32.s16 %r472, %rs112; 2026-02-21T08:33:03.4368860Z cvt.rn.f32.s16 %r473, %rs110; 2026-02-21T08:33:03.4369129Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4369406Z cvt.s16.s8 %rs115, %rs44; 2026-02-21T08:33:03.4369562Z shr.s16 %rs116, %rs115, 4; 2026-02-21T08:33:03.4369710Z cvt.s16.s8 %rs117, %rs48; 2026-02-21T08:33:03.4369865Z shr.s16 %rs118, %rs117, 4; 2026-02-21T08:33:03.4370014Z shr.s16 %rs119, %rs43, 4; 2026-02-21T08:33:03.4370166Z shr.s16 %rs120, %rs47, 4; 2026-02-21T08:33:03.4370430Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4370713Z cvt.rn.f32.s16 %r474, %rs120; 2026-02-21T08:33:03.4370876Z cvt.rn.f32.s16 %r475, %rs119; 2026-02-21T08:33:03.4371028Z cvt.rn.f32.s16 %r476, %rs118; 2026-02-21T08:33:03.4371188Z cvt.rn.f32.s16 %r477, %rs116; 2026-02-21T08:33:03.4371450Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4371769Z cvt.s16.s8 %rs121, %rs50; 2026-02-21T08:33:03.4371920Z shr.s16 %rs122, %rs121, 4; 2026-02-21T08:33:03.4372083Z cvt.s16.s8 %rs123, %rs54; 2026-02-21T08:33:03.4372238Z shr.s16 %rs124, %rs123, 4; 2026-02-21T08:33:03.4372389Z shr.s16 %rs125, %rs49, 4; 2026-02-21T08:33:03.4372546Z shr.s16 %rs126, %rs53, 4; 2026-02-21T08:33:03.4372805Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4373097Z cvt.rn.f32.s16 %r478, %rs126; 2026-02-21T08:33:03.4373253Z cvt.rn.f32.s16 %r479, %rs125; 2026-02-21T08:33:03.4373450Z cvt.rn.f32.s16 %r480, %rs124; 2026-02-21T08:33:03.4373606Z cvt.rn.f32.s16 %r481, %rs122; 2026-02-21T08:33:03.4373876Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4374169Z cvt.s16.s8 %rs127, %rs52; 2026-02-21T08:33:03.4374322Z shr.s16 %rs128, %rs127, 4; 2026-02-21T08:33:03.4374509Z cvt.s16.s8 %rs129, %rs56; 2026-02-21T08:33:03.4374659Z shr.s16 %rs130, %rs129, 4; 2026-02-21T08:33:03.4374822Z shr.s16 %rs131, %rs51, 4; 2026-02-21T08:33:03.4375003Z shr.s16 %rs132, %rs55, 4; 2026-02-21T08:33:03.4375268Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4375551Z cvt.rn.f32.s16 %r482, %rs132; 2026-02-21T08:33:03.4375719Z cvt.rn.f32.s16 %r483, %rs131; 2026-02-21T08:33:03.4375936Z cvt.rn.f32.s16 %r484, %rs130; 2026-02-21T08:33:03.4376093Z cvt.rn.f32.s16 %r485, %rs128; 2026-02-21T08:33:03.4376361Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4376641Z cvt.s16.s8 %rs133, %rs58; 2026-02-21T08:33:03.4376797Z shr.s16 %rs134, %rs133, 4; 2026-02-21T08:33:03.4376948Z cvt.s16.s8 %rs135, %rs62; 2026-02-21T08:33:03.4377103Z shr.s16 %rs136, %rs135, 4; 2026-02-21T08:33:03.4377293Z shr.s16 %rs137, %rs57, 4; 2026-02-21T08:33:03.4377453Z shr.s16 %rs138, %rs61, 4; 2026-02-21T08:33:03.4377721Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4378002Z cvt.rn.f32.s16 %r486, %rs138; 2026-02-21T08:33:03.4378166Z cvt.rn.f32.s16 %r487, %rs137; 2026-02-21T08:33:03.4378318Z cvt.rn.f32.s16 %r488, %rs136; 2026-02-21T08:33:03.4378478Z cvt.rn.f32.s16 %r489, %rs134; 2026-02-21T08:33:03.4378732Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4379016Z cvt.s16.s8 %rs139, %rs60; 2026-02-21T08:33:03.4379170Z shr.s16 %rs140, %rs139, 4; 2026-02-21T08:33:03.4379322Z cvt.s16.s8 %rs141, %rs64; 2026-02-21T08:33:03.4379477Z shr.s16 %rs142, %rs141, 4; 2026-02-21T08:33:03.4379627Z shr.s16 %rs143, %rs59, 4; 2026-02-21T08:33:03.4379781Z shr.s16 %rs144, %rs63, 4; 2026-02-21T08:33:03.4380032Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4380320Z cvt.rn.f32.s16 %r490, %rs144; 2026-02-21T08:33:03.4380475Z cvt.rn.f32.s16 %r491, %rs143; 2026-02-21T08:33:03.4380637Z cvt.rn.f32.s16 %r492, %rs142; 2026-02-21T08:33:03.4380792Z cvt.rn.f32.s16 %r493, %rs140; 2026-02-21T08:33:03.4381057Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4381346Z cvt.s16.s8 %rs145, %rs66; 2026-02-21T08:33:03.4381493Z shr.s16 %rs146, %rs145, 4; 2026-02-21T08:33:03.4381695Z cvt.s16.s8 %rs147, %rs70; 2026-02-21T08:33:03.4381843Z shr.s16 %rs148, %rs147, 4; 2026-02-21T08:33:03.4382004Z shr.s16 %rs149, %rs65, 4; 2026-02-21T08:33:03.4382150Z shr.s16 %rs150, %rs69, 4; 2026-02-21T08:33:03.4382413Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4382704Z cvt.rn.f32.s16 %r494, %rs150; 2026-02-21T08:33:03.4382862Z cvt.rn.f32.s16 %r495, %rs149; 2026-02-21T08:33:03.4383025Z cvt.rn.f32.s16 %r496, %rs148; 2026-02-21T08:33:03.4383182Z cvt.rn.f32.s16 %r497, %rs146; 2026-02-21T08:33:03.4383458Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4383742Z cvt.s16.s8 %rs151, %rs68; 2026-02-21T08:33:03.4383902Z shr.s16 %rs152, %rs151, 4; 2026-02-21T08:33:03.4384055Z cvt.s16.s8 %rs153, %rs72; 2026-02-21T08:33:03.4384212Z shr.s16 %rs154, %rs153, 4; 2026-02-21T08:33:03.4384369Z shr.s16 %rs155, %rs67, 4; 2026-02-21T08:33:03.4384515Z shr.s16 %rs156, %rs71, 4; 2026-02-21T08:33:03.4384784Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4385123Z cvt.rn.f32.s16 %r498, %rs156; 2026-02-21T08:33:03.4385293Z cvt.rn.f32.s16 %r499, %rs155; 2026-02-21T08:33:03.4385453Z cvt.rn.f32.s16 %r500, %rs154; 2026-02-21T08:33:03.4385617Z cvt.rn.f32.s16 %r501, %rs152; 2026-02-21T08:33:03.4385894Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4386229Z cvt.s16.s8 %rs157, %rs74; 2026-02-21T08:33:03.4386394Z shr.s16 %rs158, %rs157, 4; 2026-02-21T08:33:03.4386584Z cvt.s16.s8 %rs159, %rs78; 2026-02-21T08:33:03.4386748Z shr.s16 %rs160, %rs159, 4; 2026-02-21T08:33:03.4386905Z shr.s16 %rs161, %rs73, 4; 2026-02-21T08:33:03.4387065Z shr.s16 %rs162, %rs77, 4; 2026-02-21T08:33:03.4387337Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4387642Z cvt.rn.f32.s16 %r502, %rs162; 2026-02-21T08:33:03.4387806Z cvt.rn.f32.s16 %r503, %rs161; 2026-02-21T08:33:03.4387976Z cvt.rn.f32.s16 %r504, %rs160; 2026-02-21T08:33:03.4388147Z cvt.rn.f32.s16 %r505, %rs158; 2026-02-21T08:33:03.4388424Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4388726Z cvt.s16.s8 %rs163, %rs76; 2026-02-21T08:33:03.4388908Z shr.s16 %rs164, %rs163, 4; 2026-02-21T08:33:03.4389076Z cvt.s16.s8 %rs165, %rs80; 2026-02-21T08:33:03.4389232Z shr.s16 %rs166, %rs165, 4; 2026-02-21T08:33:03.4389393Z shr.s16 %rs167, %rs75, 4; 2026-02-21T08:33:03.4389549Z shr.s16 %rs168, %rs79, 4; 2026-02-21T08:33:03.4389827Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4390129Z cvt.rn.f32.s16 %r506, %rs168; 2026-02-21T08:33:03.4390289Z cvt.rn.f32.s16 %r507, %rs167; 2026-02-21T08:33:03.4390454Z cvt.rn.f32.s16 %r508, %rs166; 2026-02-21T08:33:03.4390611Z cvt.rn.f32.s16 %r509, %rs164; 2026-02-21T08:33:03.4390889Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4391184Z cvt.s16.s8 %rs169, %rs82; 2026-02-21T08:33:03.4391346Z shr.s16 %rs170, %rs169, 4; 2026-02-21T08:33:03.4391504Z cvt.s16.s8 %rs171, %rs86; 2026-02-21T08:33:03.4391722Z shr.s16 %rs172, %rs171, 4; 2026-02-21T08:33:03.4391889Z shr.s16 %rs173, %rs81, 4; 2026-02-21T08:33:03.4392048Z shr.s16 %rs174, %rs85, 4; 2026-02-21T08:33:03.4392323Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4392623Z cvt.rn.f32.s16 %r510, %rs174; 2026-02-21T08:33:03.4392795Z cvt.rn.f32.s16 %r511, %rs173; 2026-02-21T08:33:03.4392957Z cvt.rn.f32.s16 %r512, %rs172; 2026-02-21T08:33:03.4393126Z cvt.rn.f32.s16 %r513, %rs170; 2026-02-21T08:33:03.4393400Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4393716Z cvt.s16.s8 %rs175, %rs84; 2026-02-21T08:33:03.4393877Z shr.s16 %rs176, %rs175, 4; 2026-02-21T08:33:03.4394031Z cvt.s16.s8 %rs177, %rs88; 2026-02-21T08:33:03.4394192Z shr.s16 %rs178, %rs177, 4; 2026-02-21T08:33:03.4394344Z shr.s16 %rs179, %rs83, 4; 2026-02-21T08:33:03.4394503Z shr.s16 %rs180, %rs87, 4; 2026-02-21T08:33:03.4394763Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4395055Z cvt.rn.f32.s16 %r514, %rs180; 2026-02-21T08:33:03.4395211Z cvt.rn.f32.s16 %r515, %rs179; 2026-02-21T08:33:03.4395375Z cvt.rn.f32.s16 %r516, %rs178; 2026-02-21T08:33:03.4395534Z cvt.rn.f32.s16 %r517, %rs176; 2026-02-21T08:33:03.4395796Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4396083Z cvt.s16.s8 %rs181, %rs90; 2026-02-21T08:33:03.4396239Z shr.s16 %rs182, %rs181, 4; 2026-02-21T08:33:03.4396395Z cvt.s16.s8 %rs183, %rs94; 2026-02-21T08:33:03.4396542Z shr.s16 %rs184, %rs183, 4; 2026-02-21T08:33:03.4396731Z shr.s16 %rs185, %rs89, 4; 2026-02-21T08:33:03.4396878Z shr.s16 %rs186, %rs93, 4; 2026-02-21T08:33:03.4397143Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4397433Z cvt.rn.f32.s16 %r518, %rs186; 2026-02-21T08:33:03.4397590Z cvt.rn.f32.s16 %r519, %rs185; 2026-02-21T08:33:03.4397751Z cvt.rn.f32.s16 %r520, %rs184; 2026-02-21T08:33:03.4397957Z cvt.rn.f32.s16 %r521, %rs182; 2026-02-21T08:33:03.4398253Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4398537Z cvt.s16.s8 %rs187, %rs92; 2026-02-21T08:33:03.4398694Z shr.s16 %rs188, %rs187, 4; 2026-02-21T08:33:03.4398846Z cvt.s16.s8 %rs189, %rs96; 2026-02-21T08:33:03.4399003Z shr.s16 %rs190, %rs189, 4; 2026-02-21T08:33:03.4399161Z shr.s16 %rs191, %rs91, 4; 2026-02-21T08:33:03.4399309Z shr.s16 %rs192, %rs95, 4; 2026-02-21T08:33:03.4399581Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4399867Z cvt.rn.f32.s16 %r522, %rs192; 2026-02-21T08:33:03.4400031Z cvt.rn.f32.s16 %r523, %rs191; 2026-02-21T08:33:03.4400186Z cvt.rn.f32.s16 %r524, %rs190; 2026-02-21T08:33:03.4400347Z cvt.rn.f32.s16 %r525, %rs188; 2026-02-21T08:33:03.4400561Z st.shared.v4.b32 [%r19], {%r465, %r463, %r464, %r462}; 2026-02-21T08:33:03.4400811Z st.shared.v4.b32 [%r19+16384], {%r469, %r467, %r468, %r466}; 2026-02-21T08:33:03.4401054Z st.shared.v4.b32 [%r20], {%r473, %r471, %r472, %r470}; 2026-02-21T08:33:03.4401287Z st.shared.v4.b32 [%r20+16384], {%r477, %r475, %r476, %r474}; 2026-02-21T08:33:03.4401526Z st.shared.v4.b32 [%r21], {%r481, %r479, %r480, %r478}; 2026-02-21T08:33:03.4401792Z st.shared.v4.b32 [%r21+16384], {%r485, %r483, %r484, %r482}; 2026-02-21T08:33:03.4402030Z st.shared.v4.b32 [%r22], {%r489, %r487, %r488, %r486}; 2026-02-21T08:33:03.4402256Z st.shared.v4.b32 [%r22+16384], {%r493, %r491, %r492, %r490}; 2026-02-21T08:33:03.4402490Z st.shared.v4.b32 [%r23], {%r497, %r495, %r496, %r494}; 2026-02-21T08:33:03.4402721Z st.shared.v4.b32 [%r23+16384], {%r501, %r499, %r500, %r498}; 2026-02-21T08:33:03.4402947Z st.shared.v4.b32 [%r24], {%r505, %r503, %r504, %r502}; 2026-02-21T08:33:03.4403180Z st.shared.v4.b32 [%r24+16384], {%r509, %r507, %r508, %r506}; 2026-02-21T08:33:03.4403408Z st.shared.v4.b32 [%r25], {%r513, %r511, %r512, %r510}; 2026-02-21T08:33:03.4403643Z st.shared.v4.b32 [%r25+16384], {%r517, %r515, %r516, %r514}; 2026-02-21T08:33:03.4403869Z st.shared.v4.b32 [%r26], {%r521, %r519, %r520, %r518}; 2026-02-21T08:33:03.4404103Z st.shared.v4.b32 [%r26+16384], {%r525, %r523, %r524, %r522}; 2026-02-21T08:33:03.4404307Z $L__tmp4: 2026-02-21T08:33:03.4404598Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4404936Z // begin inline asm 2026-02-21T08:33:03.4405103Z fence.proxy.async.shared::cta; 2026-02-21T08:33:03.4405274Z // end inline asm 2026-02-21T08:33:03.4405409Z bar.sync 0; 2026-02-21T08:33:03.4405552Z setp.ne.b32 %p65, %r38, 0; 2026-02-21T08:33:03.4405706Z @%p65 bra $L__BB0_3; 2026-02-21T08:33:03.4405854Z // %bb.2: 2026-02-21T08:33:03.4405995Z elect.sync %r538|%p67, -1; 2026-02-21T08:33:03.4406153Z mov.b32 %r528, 138414352; 2026-02-21T08:33:03.4406313Z mov.pred %p66, 0; 2026-02-21T08:33:03.4406454Z // begin inline asm 2026-02-21T08:33:03.4406701Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 0 ], %rd62, %r528, %p66; 2026-02-21T08:33:03.4406964Z // end inline asm 2026-02-21T08:33:03.4407108Z // begin inline asm 2026-02-21T08:33:03.4407335Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 8 ], %rd63, %r528, %p68; 2026-02-21T08:33:03.4407595Z // end inline asm 2026-02-21T08:33:03.4407736Z // begin inline asm 2026-02-21T08:33:03.4407958Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 16 ], %rd64, %r528, %p68; 2026-02-21T08:33:03.4408249Z // end inline asm 2026-02-21T08:33:03.4408382Z // begin inline asm 2026-02-21T08:33:03.4408613Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 24 ], %rd65, %r528, %p68; 2026-02-21T08:33:03.4408868Z // end inline asm 2026-02-21T08:33:03.4409013Z add.s32 %r540, %r53, 110592; 2026-02-21T08:33:03.4409175Z cvt.u64.u32 %rd66, %r540; 2026-02-21T08:33:03.4409328Z // begin inline asm 2026-02-21T08:33:03.4409571Z @%p67 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd66]; 2026-02-21T08:33:03.4409801Z // end inline asm 2026-02-21T08:33:03.4409969Z $L__tmp5: 2026-02-21T08:33:03.4410095Z $L__BB0_3: 2026-02-21T08:33:03.4410259Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:33:03.4410452Z add.s32 %r28, %r53, %r420; 2026-02-21T08:33:03.4410612Z add.s32 %r29, %r53, %r422; 2026-02-21T08:33:03.4410762Z add.s32 %r30, %r53, %r423; 2026-02-21T08:33:03.4410917Z add.s32 %r31, %r53, %r424; 2026-02-21T08:33:03.4411070Z add.s32 %r32, %r53, %r425; 2026-02-21T08:33:03.4411216Z add.s32 %r33, %r53, %r426; 2026-02-21T08:33:03.4411369Z add.s32 %r34, %r53, %r427; 2026-02-21T08:33:03.4411515Z add.s32 %r35, %r53, %r428; 2026-02-21T08:33:03.4411817Z .loc 1 58 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:32 2026-02-21T08:33:03.4412105Z add.s64 %rd67, %rd53, 64; 2026-02-21T08:33:03.4412292Z cvt.u64.u32 %rd73, %r27; 2026-02-21T08:33:03.4412449Z add.s64 %rd74, %rd7, %rd73; 2026-02-21T08:33:03.4412617Z shl.b64 %rd75, %rd74, 1; 2026-02-21T08:33:03.4412778Z add.s64 %rd76, %rd14, %rd75; 2026-02-21T08:33:03.4412936Z add.s64 %rd68, %rd76, 65536; 2026-02-21T08:33:03.4413096Z add.s64 %rd69, %rd76, 131072; 2026-02-21T08:33:03.4413255Z add.s64 %rd70, %rd76, 196608; 2026-02-21T08:33:03.4413416Z mov.b32 %r542, 16; 2026-02-21T08:33:03.4413664Z .loc 1 58 80 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:80 2026-02-21T08:33:03.4413953Z // begin inline asm 2026-02-21T08:33:03.4414156Z cp.async.cg.shared.global [ %r541 + 0 ], [ %rd67 + 0 ], 0x10, %r542; 2026-02-21T08:33:03.4414386Z // end inline asm 2026-02-21T08:33:03.4414527Z // begin inline asm 2026-02-21T08:33:03.4414720Z cp.async.cg.shared.global [ %r543 + 0 ], [ %rd68 + 0 ], 0x10, %r542; 2026-02-21T08:33:03.4414949Z // end inline asm 2026-02-21T08:33:03.4415084Z // begin inline asm 2026-02-21T08:33:03.4415283Z cp.async.cg.shared.global [ %r545 + 0 ], [ %rd69 + 0 ], 0x10, %r542; 2026-02-21T08:33:03.4415503Z // end inline asm 2026-02-21T08:33:03.4415642Z // begin inline asm 2026-02-21T08:33:03.4415832Z cp.async.cg.shared.global [ %r547 + 0 ], [ %rd70 + 0 ], 0x10, %r542; 2026-02-21T08:33:03.4416057Z // end inline asm 2026-02-21T08:33:03.4416203Z cp.async.commit_group; 2026-02-21T08:33:03.4416468Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4416762Z // begin inline asm 2026-02-21T08:33:03.4416952Z @%p99 mbarrier.arrive.expect_tx.shared.b64 _, [%r549], 4096; 2026-02-21T08:33:03.4417175Z // end inline asm 2026-02-21T08:33:03.4417415Z .loc 1 64 33 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:64:33 2026-02-21T08:33:03.4417697Z bar.sync 0; 2026-02-21T08:33:03.4417837Z elect.sync %r558|%p78, -1; 2026-02-21T08:33:03.4418007Z and.pred %p76, %p63, %p78; 2026-02-21T08:33:03.4418168Z // begin inline asm 2026-02-21T08:33:03.4418495Z @%p76 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r550], [%rd71, {%r551, %r542}], [%r549]; 2026-02-21T08:33:03.4418849Z // end inline asm 2026-02-21T08:33:03.4419094Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4419391Z and.b32 %r559, %r1, 3; 2026-02-21T08:33:03.4419545Z mul.wide.u32 %rd77, %r559, 16; 2026-02-21T08:33:03.4419715Z and.b32 %r560, %r3, 31; 2026-02-21T08:33:03.4419872Z shl.b32 %r561, %r560, 17; 2026-02-21T08:33:03.4420021Z shl.b32 %r562, %r4, 10; 2026-02-21T08:33:03.4420215Z or.b32 %r563, %r561, %r562; 2026-02-21T08:33:03.4420375Z mul.wide.u32 %rd78, %r563, 2; 2026-02-21T08:33:03.4420543Z or.b64 %rd79, %rd77, %rd78; 2026-02-21T08:33:03.4420698Z add.s64 %rd80, %rd79, %rd14; 2026-02-21T08:33:03.4420862Z add.s64 %rd604, %rd80, 196736; 2026-02-21T08:33:03.4421016Z mov.b32 %r1388, 1; 2026-02-21T08:33:03.4421160Z mov.b32 %r1385, 0; 2026-02-21T08:33:03.4421321Z mov.b64 %rd605, -16; 2026-02-21T08:33:03.4421472Z mov.b32 %r1387, %r1385; 2026-02-21T08:33:03.4421659Z mov.b32 %r1389, %r1385; 2026-02-21T08:33:03.4421836Z bra.uni $L__BB0_4; 2026-02-21T08:33:03.4422033Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T08:33:03.4422369Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4422671Z add.s64 %rd12, %rd605, 16; 2026-02-21T08:33:03.4422837Z setp.lt.u64 %p93, %rd12, 480; 2026-02-21T08:33:03.4423008Z $L__tmp6: 2026-02-21T08:33:03.4423303Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4423648Z add.s32 %r711, %r1388, 1; 2026-02-21T08:33:03.4423817Z setp.gt.s32 %p96, %r711, 1; 2026-02-21T08:33:03.4423986Z selp.b32 %r1388, 0, %r711, %p96; 2026-02-21T08:33:03.4424197Z selp.b32 %r712, 1, 0, %p96; 2026-02-21T08:33:03.4424360Z xor.b32 %r52, %r1389, %r712; 2026-02-21T08:33:03.4424515Z $L__tmp7: 2026-02-21T08:33:03.4424751Z .loc 1 58 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:32 2026-02-21T08:33:03.4425049Z add.s64 %rd86, %rd604, -196608; 2026-02-21T08:33:03.4425217Z add.s64 %rd87, %rd604, -131072; 2026-02-21T08:33:03.4425386Z add.s64 %rd88, %rd604, -65536; 2026-02-21T08:33:03.4425664Z .loc 1 58 80 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:80 2026-02-21T08:33:03.4425946Z selp.b32 %r699, 16, 0, %p93; 2026-02-21T08:33:03.4426106Z // begin inline asm 2026-02-21T08:33:03.4426305Z cp.async.cg.shared.global [ %r541 + 0 ], [ %rd86 + 0 ], 0x10, %r699; 2026-02-21T08:33:03.4426538Z // end inline asm 2026-02-21T08:33:03.4426675Z // begin inline asm 2026-02-21T08:33:03.4426880Z cp.async.cg.shared.global [ %r543 + 0 ], [ %rd87 + 0 ], 0x10, %r699; 2026-02-21T08:33:03.4427104Z // end inline asm 2026-02-21T08:33:03.4427241Z // begin inline asm 2026-02-21T08:33:03.4427442Z cp.async.cg.shared.global [ %r545 + 0 ], [ %rd88 + 0 ], 0x10, %r699; 2026-02-21T08:33:03.4427660Z // end inline asm 2026-02-21T08:33:03.4427802Z // begin inline asm 2026-02-21T08:33:03.4427996Z cp.async.cg.shared.global [ %r547 + 0 ], [ %rd604 + 0 ], 0x10, %r699; 2026-02-21T08:33:03.4428220Z // end inline asm 2026-02-21T08:33:03.4428360Z cp.async.commit_group; 2026-02-21T08:33:03.4428631Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4428933Z and.pred %p91, %p99, %p93; 2026-02-21T08:33:03.4429092Z // begin inline asm 2026-02-21T08:33:03.4429293Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r549], 4096; 2026-02-21T08:33:03.4429516Z // end inline asm 2026-02-21T08:33:03.4429778Z .loc 1 64 33 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:64:33 2026-02-21T08:33:03.4430073Z bar.sync 0; 2026-02-21T08:33:03.4430224Z elect.sync %r714|%p97, -1; 2026-02-21T08:33:03.4430393Z and.pred %p98, %p93, %p97; 2026-02-21T08:33:03.4430565Z and.pred %p92, %p63, %p98; 2026-02-21T08:33:03.4430729Z cvt.u32.u64 %r715, %rd605; 2026-02-21T08:33:03.4430894Z add.s32 %r709, %r715, 48; 2026-02-21T08:33:03.4431055Z // begin inline asm 2026-02-21T08:33:03.4431393Z @%p92 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r550], [%rd71, {%r551, %r709}], [%r549]; 2026-02-21T08:33:03.4431789Z // end inline asm 2026-02-21T08:33:03.4432053Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4432398Z add.s64 %rd604, %rd604, 64; 2026-02-21T08:33:03.4432562Z mov.b64 %rd605, %rd12; 2026-02-21T08:33:03.4432725Z mov.b32 %r1385, %r1389; 2026-02-21T08:33:03.4432886Z mov.b32 %r1389, %r52; 2026-02-21T08:33:03.4433040Z @%p93 bra $L__BB0_4; 2026-02-21T08:33:03.4433199Z bra.uni $L__BB0_7; 2026-02-21T08:33:03.4433397Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T08:33:03.4433768Z .loc 1 58 80 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:58:80 2026-02-21T08:33:03.4434101Z cp.async.wait_group 0; 2026-02-21T08:33:03.4434267Z bar.sync 0; 2026-02-21T08:33:03.4434441Z ld.shared.v4.b32 {%r602, %r603, %r604, %r605}, [%r10]; 2026-02-21T08:33:03.4434665Z mov.b32 {%rs193, %rs194}, %r605; 2026-02-21T08:33:03.4434853Z mov.b32 {%rs195, %rs196}, %r604; 2026-02-21T08:33:03.4435025Z mov.b32 {%rs197, %rs198}, %r603; 2026-02-21T08:33:03.4435202Z mov.b32 {%rs199, %rs200}, %r602; 2026-02-21T08:33:03.4435410Z ld.shared.v4.b32 {%r606, %r607, %r608, %r609}, [%r10+16]; 2026-02-21T08:33:03.4435634Z mov.b32 {%rs201, %rs202}, %r609; 2026-02-21T08:33:03.4435802Z mov.b32 {%rs203, %rs204}, %r608; 2026-02-21T08:33:03.4435975Z mov.b32 {%rs205, %rs206}, %r607; 2026-02-21T08:33:03.4436142Z mov.b32 {%rs207, %rs208}, %r606; 2026-02-21T08:33:03.4436381Z ld.shared.v4.b32 {%r610, %r611, %r612, %r613}, [%r10+32]; 2026-02-21T08:33:03.4436607Z mov.b32 {%rs209, %rs210}, %r613; 2026-02-21T08:33:03.4436782Z mov.b32 {%rs211, %rs212}, %r612; 2026-02-21T08:33:03.4436960Z mov.b32 {%rs213, %rs214}, %r611; 2026-02-21T08:33:03.4437132Z mov.b32 {%rs215, %rs216}, %r610; 2026-02-21T08:33:03.4437329Z ld.shared.v4.b32 {%r614, %r615, %r616, %r617}, [%r10+48]; 2026-02-21T08:33:03.4437532Z mov.b32 {%rs217, %rs218}, %r617; 2026-02-21T08:33:03.4437697Z mov.b32 {%rs219, %rs220}, %r616; 2026-02-21T08:33:03.4437856Z mov.b32 {%rs221, %rs222}, %r615; 2026-02-21T08:33:03.4438020Z mov.b32 {%rs223, %rs224}, %r614; 2026-02-21T08:33:03.4438299Z .loc 1 62 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:62:32 2026-02-21T08:33:03.4438592Z cvt.f32.bf16 %r567, %rs199; 2026-02-21T08:33:03.4438761Z cvt.f32.bf16 %r568, %rs200; 2026-02-21T08:33:03.4438917Z cvt.f32.bf16 %r569, %rs197; 2026-02-21T08:33:03.4439080Z cvt.f32.bf16 %r570, %rs198; 2026-02-21T08:33:03.4439232Z cvt.f32.bf16 %r571, %rs195; 2026-02-21T08:33:03.4439392Z cvt.f32.bf16 %r572, %rs196; 2026-02-21T08:33:03.4439544Z cvt.f32.bf16 %r573, %rs193; 2026-02-21T08:33:03.4439704Z cvt.f32.bf16 %r574, %rs194; 2026-02-21T08:33:03.4439863Z cvt.f32.bf16 %r575, %rs207; 2026-02-21T08:33:03.4440014Z cvt.f32.bf16 %r576, %rs208; 2026-02-21T08:33:03.4440173Z cvt.f32.bf16 %r577, %rs205; 2026-02-21T08:33:03.4440321Z cvt.f32.bf16 %r578, %rs206; 2026-02-21T08:33:03.4440475Z cvt.f32.bf16 %r579, %rs203; 2026-02-21T08:33:03.4440623Z cvt.f32.bf16 %r580, %rs204; 2026-02-21T08:33:03.4440778Z cvt.f32.bf16 %r581, %rs201; 2026-02-21T08:33:03.4440925Z cvt.f32.bf16 %r582, %rs202; 2026-02-21T08:33:03.4441084Z cvt.f32.bf16 %r584, %rs215; 2026-02-21T08:33:03.4441234Z cvt.f32.bf16 %r585, %rs216; 2026-02-21T08:33:03.4441389Z cvt.f32.bf16 %r586, %rs213; 2026-02-21T08:33:03.4441623Z cvt.f32.bf16 %r587, %rs214; 2026-02-21T08:33:03.4441779Z cvt.f32.bf16 %r588, %rs211; 2026-02-21T08:33:03.4441940Z cvt.f32.bf16 %r589, %rs212; 2026-02-21T08:33:03.4442094Z cvt.f32.bf16 %r590, %rs209; 2026-02-21T08:33:03.4442253Z cvt.f32.bf16 %r591, %rs210; 2026-02-21T08:33:03.4442404Z cvt.f32.bf16 %r592, %rs223; 2026-02-21T08:33:03.4442564Z cvt.f32.bf16 %r593, %rs224; 2026-02-21T08:33:03.4442716Z cvt.f32.bf16 %r594, %rs221; 2026-02-21T08:33:03.4442874Z cvt.f32.bf16 %r595, %rs222; 2026-02-21T08:33:03.4443032Z cvt.f32.bf16 %r596, %rs219; 2026-02-21T08:33:03.4443185Z cvt.f32.bf16 %r597, %rs220; 2026-02-21T08:33:03.4443345Z cvt.f32.bf16 %r598, %rs217; 2026-02-21T08:33:03.4443495Z cvt.f32.bf16 %r599, %rs218; 2026-02-21T08:33:03.4443650Z $L__tmp8: 2026-02-21T08:33:03.4443999Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4444331Z // begin inline asm 2026-02-21T08:33:03.4444467Z 2026-02-21T08:33:03.4444589Z { 2026-02-21T08:33:03.4444713Z .reg .pred complete; 2026-02-21T08:33:03.4444869Z waitLoop: 2026-02-21T08:33:03.4445068Z mbarrier.try_wait.parity.shared.b64 complete, [%r1386], %r1385; 2026-02-21T08:33:03.4445376Z @!complete bra.uni waitLoop; 2026-02-21T08:33:03.4445540Z } 2026-02-21T08:33:03.4445607Z 2026-02-21T08:33:03.4445698Z // end inline asm 2026-02-21T08:33:03.4445848Z $L__tmp9: 2026-02-21T08:33:03.4446095Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4446395Z xor.b32 %r1387, %r1387, 1; 2026-02-21T08:33:03.4446554Z mov.pred %p82, -1; 2026-02-21T08:33:03.4446709Z $L__tmp10: 2026-02-21T08:33:03.4446999Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4447326Z // begin inline asm 2026-02-21T08:33:03.4447695Z @%p82 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r366 + 0], {%r567, %r568, %r569, %r570, %r571, %r572, %r573, %r574, %r575, %r576, %r577, %r578, %r579, %r580, %r581, %r582}; 2026-02-21T08:33:03.4448079Z // end inline asm 2026-02-21T08:33:03.4448248Z // begin inline asm 2026-02-21T08:33:03.4448535Z @%p82 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r366 + 16], {%r584, %r585, %r586, %r587, %r588, %r589, %r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597, %r598, %r599}; 2026-02-21T08:33:03.4448599Z // end inline asm 2026-02-21T08:33:03.4448656Z // begin inline asm 2026-02-21T08:33:03.4448729Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:33:03.4448784Z // end inline asm 2026-02-21T08:33:03.4448847Z bar.sync 0; 2026-02-21T08:33:03.4448900Z $L__tmp11: 2026-02-21T08:33:03.4449081Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4449144Z // begin inline asm 2026-02-21T08:33:03.4449194Z 2026-02-21T08:33:03.4449242Z { 2026-02-21T08:33:03.4449303Z .reg .pred complete; 2026-02-21T08:33:03.4449364Z waitLoop: 2026-02-21T08:33:03.4449480Z mbarrier.try_wait.parity.shared.b64 complete, [%r549], %r1387; 2026-02-21T08:33:03.4449545Z @!complete bra.uni waitLoop; 2026-02-21T08:33:03.4449602Z } 2026-02-21T08:33:03.4449606Z 2026-02-21T08:33:03.4449660Z // end inline asm 2026-02-21T08:33:03.4449826Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4449898Z ld.shared.s8 %rs225, [%r11]; 2026-02-21T08:33:03.4450065Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4450126Z shl.b16 %rs226, %rs225, 4; 2026-02-21T08:33:03.4450288Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4450361Z ld.shared.s8 %rs227, [%r11+2048]; 2026-02-21T08:33:03.4450527Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4450588Z shl.b16 %rs228, %rs227, 4; 2026-02-21T08:33:03.4450753Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4450820Z ld.shared.s8 %rs229, [%r12+128]; 2026-02-21T08:33:03.4450981Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4451047Z shl.b16 %rs230, %rs229, 4; 2026-02-21T08:33:03.4451211Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4451275Z ld.shared.s8 %rs231, [%r12+2176]; 2026-02-21T08:33:03.4451442Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4451499Z shl.b16 %rs232, %rs231, 4; 2026-02-21T08:33:03.4451687Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4451779Z ld.shared.s8 %rs233, [%r13+256]; 2026-02-21T08:33:03.4451949Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4452007Z shl.b16 %rs234, %rs233, 4; 2026-02-21T08:33:03.4452173Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4452269Z ld.shared.s8 %rs235, [%r13+2304]; 2026-02-21T08:33:03.4452457Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4452516Z shl.b16 %rs236, %rs235, 4; 2026-02-21T08:33:03.4452684Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4452746Z ld.shared.s8 %rs237, [%r14+384]; 2026-02-21T08:33:03.4452913Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4452979Z shl.b16 %rs238, %rs237, 4; 2026-02-21T08:33:03.4453143Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4453205Z ld.shared.s8 %rs239, [%r14+2432]; 2026-02-21T08:33:03.4453390Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4453457Z shl.b16 %rs240, %rs239, 4; 2026-02-21T08:33:03.4453625Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4453687Z ld.shared.s8 %rs241, [%r15+512]; 2026-02-21T08:33:03.4453864Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4453921Z shl.b16 %rs242, %rs241, 4; 2026-02-21T08:33:03.4454088Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4454160Z ld.shared.s8 %rs243, [%r15+2560]; 2026-02-21T08:33:03.4454330Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4454388Z shl.b16 %rs244, %rs243, 4; 2026-02-21T08:33:03.4454563Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4454626Z ld.shared.s8 %rs245, [%r16+640]; 2026-02-21T08:33:03.4454792Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4454853Z shl.b16 %rs246, %rs245, 4; 2026-02-21T08:33:03.4455030Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4455091Z ld.shared.s8 %rs247, [%r16+2688]; 2026-02-21T08:33:03.4455257Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4455322Z shl.b16 %rs248, %rs247, 4; 2026-02-21T08:33:03.4455490Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4455551Z ld.shared.s8 %rs249, [%r17+768]; 2026-02-21T08:33:03.4455720Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4455779Z shl.b16 %rs250, %rs249, 4; 2026-02-21T08:33:03.4455947Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4456019Z ld.shared.s8 %rs251, [%r17+2816]; 2026-02-21T08:33:03.4456183Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4456240Z shl.b16 %rs252, %rs251, 4; 2026-02-21T08:33:03.4456404Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4456474Z ld.shared.s8 %rs253, [%r18+896]; 2026-02-21T08:33:03.4456640Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4456722Z shl.b16 %rs254, %rs253, 4; 2026-02-21T08:33:03.4456897Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4456960Z ld.shared.s8 %rs255, [%r18+2944]; 2026-02-21T08:33:03.4457126Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4457215Z shl.b16 %rs256, %rs255, 4; 2026-02-21T08:33:03.4457403Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4457466Z ld.shared.s8 %rs257, [%r11+1024]; 2026-02-21T08:33:03.4457641Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4457698Z shl.b16 %rs258, %rs257, 4; 2026-02-21T08:33:03.4457860Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4457923Z ld.shared.s8 %rs259, [%r11+3072]; 2026-02-21T08:33:03.4458099Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4458158Z shl.b16 %rs260, %rs259, 4; 2026-02-21T08:33:03.4458345Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4458417Z ld.shared.s8 %rs261, [%r12+1152]; 2026-02-21T08:33:03.4458583Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4458640Z shl.b16 %rs262, %rs261, 4; 2026-02-21T08:33:03.4458809Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4458871Z ld.shared.s8 %rs263, [%r12+3200]; 2026-02-21T08:33:03.4459033Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4459099Z shl.b16 %rs264, %rs263, 4; 2026-02-21T08:33:03.4459266Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4459327Z ld.shared.s8 %rs265, [%r13+1280]; 2026-02-21T08:33:03.4459492Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4459559Z shl.b16 %rs266, %rs265, 4; 2026-02-21T08:33:03.4459724Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4459786Z ld.shared.s8 %rs267, [%r13+3328]; 2026-02-21T08:33:03.4459954Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4460011Z shl.b16 %rs268, %rs267, 4; 2026-02-21T08:33:03.4460176Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4460245Z ld.shared.s8 %rs269, [%r14+1408]; 2026-02-21T08:33:03.4460411Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4460470Z shl.b16 %rs270, %rs269, 4; 2026-02-21T08:33:03.4460638Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4460698Z ld.shared.s8 %rs271, [%r14+3456]; 2026-02-21T08:33:03.4460862Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4460920Z shl.b16 %rs272, %rs271, 4; 2026-02-21T08:33:03.4461091Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4461154Z ld.shared.s8 %rs273, [%r15+1536]; 2026-02-21T08:33:03.4461314Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4461378Z shl.b16 %rs274, %rs273, 4; 2026-02-21T08:33:03.4461565Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4461651Z ld.shared.s8 %rs275, [%r15+3584]; 2026-02-21T08:33:03.4461825Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4461883Z shl.b16 %rs276, %rs275, 4; 2026-02-21T08:33:03.4462049Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4462142Z ld.shared.s8 %rs277, [%r16+1664]; 2026-02-21T08:33:03.4462330Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4462388Z shl.b16 %rs278, %rs277, 4; 2026-02-21T08:33:03.4462554Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4462615Z ld.shared.s8 %rs279, [%r16+3712]; 2026-02-21T08:33:03.4462782Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4462841Z shl.b16 %rs280, %rs279, 4; 2026-02-21T08:33:03.4463012Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4463072Z ld.shared.s8 %rs281, [%r17+1792]; 2026-02-21T08:33:03.4463239Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4463358Z shl.b16 %rs282, %rs281, 4; 2026-02-21T08:33:03.4463526Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4463588Z ld.shared.s8 %rs283, [%r17+3840]; 2026-02-21T08:33:03.4463755Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4463814Z shl.b16 %rs284, %rs283, 4; 2026-02-21T08:33:03.4463977Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4464046Z ld.shared.s8 %rs285, [%r18+1920]; 2026-02-21T08:33:03.4464209Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4464267Z shl.b16 %rs286, %rs285, 4; 2026-02-21T08:33:03.4464431Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4464502Z ld.shared.s8 %rs287, [%r18+3968]; 2026-02-21T08:33:03.4464667Z .loc 1 67 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:67:28 2026-02-21T08:33:03.4464724Z shl.b16 %rs288, %rs287, 4; 2026-02-21T08:33:03.4464896Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4464958Z cvt.s16.s8 %rs289, %rs226; 2026-02-21T08:33:03.4465015Z shr.s16 %rs290, %rs289, 4; 2026-02-21T08:33:03.4465083Z cvt.s16.s8 %rs291, %rs230; 2026-02-21T08:33:03.4465141Z shr.s16 %rs292, %rs291, 4; 2026-02-21T08:33:03.4465199Z shr.s16 %rs293, %rs225, 4; 2026-02-21T08:33:03.4465256Z shr.s16 %rs294, %rs229, 4; 2026-02-21T08:33:03.4465430Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4465495Z cvt.rn.f32.s16 %r619, %rs294; 2026-02-21T08:33:03.4465556Z cvt.rn.f32.s16 %r620, %rs293; 2026-02-21T08:33:03.4465626Z cvt.rn.f32.s16 %r621, %rs292; 2026-02-21T08:33:03.4465687Z cvt.rn.f32.s16 %r622, %rs290; 2026-02-21T08:33:03.4465855Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4465923Z cvt.s16.s8 %rs295, %rs228; 2026-02-21T08:33:03.4465982Z shr.s16 %rs296, %rs295, 4; 2026-02-21T08:33:03.4466039Z cvt.s16.s8 %rs297, %rs232; 2026-02-21T08:33:03.4466096Z shr.s16 %rs298, %rs297, 4; 2026-02-21T08:33:03.4466161Z shr.s16 %rs299, %rs227, 4; 2026-02-21T08:33:03.4466220Z shr.s16 %rs300, %rs231, 4; 2026-02-21T08:33:03.4466387Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4466475Z cvt.rn.f32.s16 %r623, %rs300; 2026-02-21T08:33:03.4466533Z cvt.rn.f32.s16 %r624, %rs299; 2026-02-21T08:33:03.4466590Z cvt.rn.f32.s16 %r625, %rs298; 2026-02-21T08:33:03.4466647Z cvt.rn.f32.s16 %r626, %rs296; 2026-02-21T08:33:03.4466821Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4466879Z cvt.s16.s8 %rs301, %rs234; 2026-02-21T08:33:03.4466959Z shr.s16 %rs302, %rs301, 4; 2026-02-21T08:33:03.4467024Z cvt.s16.s8 %rs303, %rs238; 2026-02-21T08:33:03.4467104Z shr.s16 %rs304, %rs303, 4; 2026-02-21T08:33:03.4467163Z shr.s16 %rs305, %rs233, 4; 2026-02-21T08:33:03.4467225Z shr.s16 %rs306, %rs237, 4; 2026-02-21T08:33:03.4467393Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4467451Z cvt.rn.f32.s16 %r627, %rs306; 2026-02-21T08:33:03.4467508Z cvt.rn.f32.s16 %r628, %rs305; 2026-02-21T08:33:03.4467573Z cvt.rn.f32.s16 %r629, %rs304; 2026-02-21T08:33:03.4467631Z cvt.rn.f32.s16 %r630, %rs302; 2026-02-21T08:33:03.4467795Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4467860Z cvt.s16.s8 %rs307, %rs236; 2026-02-21T08:33:03.4467916Z shr.s16 %rs308, %rs307, 4; 2026-02-21T08:33:03.4467972Z cvt.s16.s8 %rs309, %rs240; 2026-02-21T08:33:03.4468051Z shr.s16 %rs310, %rs309, 4; 2026-02-21T08:33:03.4468115Z shr.s16 %rs311, %rs235, 4; 2026-02-21T08:33:03.4468172Z shr.s16 %rs312, %rs239, 4; 2026-02-21T08:33:03.4468342Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4468409Z cvt.rn.f32.s16 %r631, %rs312; 2026-02-21T08:33:03.4468468Z cvt.rn.f32.s16 %r632, %rs311; 2026-02-21T08:33:03.4468526Z cvt.rn.f32.s16 %r633, %rs310; 2026-02-21T08:33:03.4468592Z cvt.rn.f32.s16 %r634, %rs308; 2026-02-21T08:33:03.4468759Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4468820Z cvt.s16.s8 %rs313, %rs242; 2026-02-21T08:33:03.4468878Z shr.s16 %rs314, %rs313, 4; 2026-02-21T08:33:03.4468945Z cvt.s16.s8 %rs315, %rs246; 2026-02-21T08:33:03.4469003Z shr.s16 %rs316, %rs315, 4; 2026-02-21T08:33:03.4469060Z shr.s16 %rs317, %rs241, 4; 2026-02-21T08:33:03.4469123Z shr.s16 %rs318, %rs245, 4; 2026-02-21T08:33:03.4469291Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4469349Z cvt.rn.f32.s16 %r635, %rs318; 2026-02-21T08:33:03.4469416Z cvt.rn.f32.s16 %r636, %rs317; 2026-02-21T08:33:03.4469473Z cvt.rn.f32.s16 %r637, %rs316; 2026-02-21T08:33:03.4469530Z cvt.rn.f32.s16 %r638, %rs314; 2026-02-21T08:33:03.4469692Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4469756Z cvt.s16.s8 %rs319, %rs244; 2026-02-21T08:33:03.4469812Z shr.s16 %rs320, %rs319, 4; 2026-02-21T08:33:03.4469869Z cvt.s16.s8 %rs321, %rs248; 2026-02-21T08:33:03.4469933Z shr.s16 %rs322, %rs321, 4; 2026-02-21T08:33:03.4469988Z shr.s16 %rs323, %rs243, 4; 2026-02-21T08:33:03.4470043Z shr.s16 %rs324, %rs247, 4; 2026-02-21T08:33:03.4470209Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4470274Z cvt.rn.f32.s16 %r639, %rs324; 2026-02-21T08:33:03.4470334Z cvt.rn.f32.s16 %r640, %rs323; 2026-02-21T08:33:03.4470392Z cvt.rn.f32.s16 %r641, %rs322; 2026-02-21T08:33:03.4470456Z cvt.rn.f32.s16 %r642, %rs320; 2026-02-21T08:33:03.4470621Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4470681Z cvt.s16.s8 %rs325, %rs250; 2026-02-21T08:33:03.4470745Z shr.s16 %rs326, %rs325, 4; 2026-02-21T08:33:03.4470800Z cvt.s16.s8 %rs327, %rs254; 2026-02-21T08:33:03.4470856Z shr.s16 %rs328, %rs327, 4; 2026-02-21T08:33:03.4470912Z shr.s16 %rs329, %rs249, 4; 2026-02-21T08:33:03.4470995Z shr.s16 %rs330, %rs253, 4; 2026-02-21T08:33:03.4471160Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4471219Z cvt.rn.f32.s16 %r643, %rs330; 2026-02-21T08:33:03.4471284Z cvt.rn.f32.s16 %r644, %rs329; 2026-02-21T08:33:03.4471342Z cvt.rn.f32.s16 %r645, %rs328; 2026-02-21T08:33:03.4471400Z cvt.rn.f32.s16 %r646, %rs326; 2026-02-21T08:33:03.4471635Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4471719Z cvt.s16.s8 %rs331, %rs252; 2026-02-21T08:33:03.4471777Z shr.s16 %rs332, %rs331, 4; 2026-02-21T08:33:03.4471835Z cvt.s16.s8 %rs333, %rs256; 2026-02-21T08:33:03.4471898Z shr.s16 %rs334, %rs333, 4; 2026-02-21T08:33:03.4471955Z shr.s16 %rs335, %rs251, 4; 2026-02-21T08:33:03.4472011Z shr.s16 %rs336, %rs255, 4; 2026-02-21T08:33:03.4472178Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4472240Z cvt.rn.f32.s16 %r647, %rs336; 2026-02-21T08:33:03.4472299Z cvt.rn.f32.s16 %r648, %rs335; 2026-02-21T08:33:03.4472359Z cvt.rn.f32.s16 %r649, %rs334; 2026-02-21T08:33:03.4472427Z cvt.rn.f32.s16 %r650, %rs332; 2026-02-21T08:33:03.4472622Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4472685Z cvt.s16.s8 %rs337, %rs258; 2026-02-21T08:33:03.4472752Z shr.s16 %rs338, %rs337, 4; 2026-02-21T08:33:03.4472813Z cvt.s16.s8 %rs339, %rs262; 2026-02-21T08:33:03.4472874Z shr.s16 %rs340, %rs339, 4; 2026-02-21T08:33:03.4472944Z shr.s16 %rs341, %rs257, 4; 2026-02-21T08:33:03.4473004Z shr.s16 %rs342, %rs261, 4; 2026-02-21T08:33:03.4473179Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4473239Z cvt.rn.f32.s16 %r651, %rs342; 2026-02-21T08:33:03.4473310Z cvt.rn.f32.s16 %r652, %rs341; 2026-02-21T08:33:03.4473372Z cvt.rn.f32.s16 %r653, %rs340; 2026-02-21T08:33:03.4473435Z cvt.rn.f32.s16 %r654, %rs338; 2026-02-21T08:33:03.4473613Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4473674Z cvt.s16.s8 %rs343, %rs260; 2026-02-21T08:33:03.4473734Z shr.s16 %rs344, %rs343, 4; 2026-02-21T08:33:03.4473795Z cvt.s16.s8 %rs345, %rs264; 2026-02-21T08:33:03.4473863Z shr.s16 %rs346, %rs345, 4; 2026-02-21T08:33:03.4473922Z shr.s16 %rs347, %rs259, 4; 2026-02-21T08:33:03.4473981Z shr.s16 %rs348, %rs263, 4; 2026-02-21T08:33:03.4474171Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4474232Z cvt.rn.f32.s16 %r655, %rs348; 2026-02-21T08:33:03.4474294Z cvt.rn.f32.s16 %r656, %rs347; 2026-02-21T08:33:03.4474358Z cvt.rn.f32.s16 %r657, %rs346; 2026-02-21T08:33:03.4474418Z cvt.rn.f32.s16 %r658, %rs344; 2026-02-21T08:33:03.4474593Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4474653Z cvt.s16.s8 %rs349, %rs266; 2026-02-21T08:33:03.4474720Z shr.s16 %rs350, %rs349, 4; 2026-02-21T08:33:03.4474779Z cvt.s16.s8 %rs351, %rs270; 2026-02-21T08:33:03.4474836Z shr.s16 %rs352, %rs351, 4; 2026-02-21T08:33:03.4474901Z shr.s16 %rs353, %rs265, 4; 2026-02-21T08:33:03.4474967Z shr.s16 %rs354, %rs269, 4; 2026-02-21T08:33:03.4475140Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4475209Z cvt.rn.f32.s16 %r659, %rs354; 2026-02-21T08:33:03.4475269Z cvt.rn.f32.s16 %r660, %rs353; 2026-02-21T08:33:03.4475328Z cvt.rn.f32.s16 %r661, %rs352; 2026-02-21T08:33:03.4475387Z cvt.rn.f32.s16 %r662, %rs350; 2026-02-21T08:33:03.4475567Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4475629Z cvt.s16.s8 %rs355, %rs268; 2026-02-21T08:33:03.4475688Z shr.s16 %rs356, %rs355, 4; 2026-02-21T08:33:03.4475782Z cvt.s16.s8 %rs357, %rs272; 2026-02-21T08:33:03.4475843Z shr.s16 %rs358, %rs357, 4; 2026-02-21T08:33:03.4475902Z shr.s16 %rs359, %rs267, 4; 2026-02-21T08:33:03.4475962Z shr.s16 %rs360, %rs271, 4; 2026-02-21T08:33:03.4476141Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4476205Z cvt.rn.f32.s16 %r663, %rs360; 2026-02-21T08:33:03.4476292Z cvt.rn.f32.s16 %r664, %rs359; 2026-02-21T08:33:03.4476369Z cvt.rn.f32.s16 %r665, %rs358; 2026-02-21T08:33:03.4476471Z cvt.rn.f32.s16 %r666, %rs356; 2026-02-21T08:33:03.4476662Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4476732Z cvt.s16.s8 %rs361, %rs274; 2026-02-21T08:33:03.4476791Z shr.s16 %rs362, %rs361, 4; 2026-02-21T08:33:03.4476850Z cvt.s16.s8 %rs363, %rs278; 2026-02-21T08:33:03.4476909Z shr.s16 %rs364, %rs363, 4; 2026-02-21T08:33:03.4476977Z shr.s16 %rs365, %rs273, 4; 2026-02-21T08:33:03.4477040Z shr.s16 %rs366, %rs277, 4; 2026-02-21T08:33:03.4477213Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4477281Z cvt.rn.f32.s16 %r667, %rs366; 2026-02-21T08:33:03.4477343Z cvt.rn.f32.s16 %r668, %rs365; 2026-02-21T08:33:03.4477405Z cvt.rn.f32.s16 %r669, %rs364; 2026-02-21T08:33:03.4477492Z cvt.rn.f32.s16 %r670, %rs362; 2026-02-21T08:33:03.4477676Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4477739Z cvt.s16.s8 %rs367, %rs276; 2026-02-21T08:33:03.4477803Z shr.s16 %rs368, %rs367, 4; 2026-02-21T08:33:03.4477871Z cvt.s16.s8 %rs369, %rs280; 2026-02-21T08:33:03.4477933Z shr.s16 %rs370, %rs369, 4; 2026-02-21T08:33:03.4477993Z shr.s16 %rs371, %rs275, 4; 2026-02-21T08:33:03.4478058Z shr.s16 %rs372, %rs279, 4; 2026-02-21T08:33:03.4478229Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4478292Z cvt.rn.f32.s16 %r671, %rs372; 2026-02-21T08:33:03.4478353Z cvt.rn.f32.s16 %r672, %rs371; 2026-02-21T08:33:03.4478423Z cvt.rn.f32.s16 %r673, %rs370; 2026-02-21T08:33:03.4478483Z cvt.rn.f32.s16 %r674, %rs368; 2026-02-21T08:33:03.4478655Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4478725Z cvt.s16.s8 %rs373, %rs282; 2026-02-21T08:33:03.4478784Z shr.s16 %rs374, %rs373, 4; 2026-02-21T08:33:03.4478843Z cvt.s16.s8 %rs375, %rs286; 2026-02-21T08:33:03.4478910Z shr.s16 %rs376, %rs375, 4; 2026-02-21T08:33:03.4478971Z shr.s16 %rs377, %rs281, 4; 2026-02-21T08:33:03.4479030Z shr.s16 %rs378, %rs285, 4; 2026-02-21T08:33:03.4479200Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4479266Z cvt.rn.f32.s16 %r675, %rs378; 2026-02-21T08:33:03.4479327Z cvt.rn.f32.s16 %r676, %rs377; 2026-02-21T08:33:03.4479387Z cvt.rn.f32.s16 %r677, %rs376; 2026-02-21T08:33:03.4479457Z cvt.rn.f32.s16 %r678, %rs374; 2026-02-21T08:33:03.4479626Z .loc 1 69 25 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:69:25 2026-02-21T08:33:03.4479686Z cvt.s16.s8 %rs379, %rs284; 2026-02-21T08:33:03.4479746Z shr.s16 %rs380, %rs379, 4; 2026-02-21T08:33:03.4479815Z cvt.s16.s8 %rs381, %rs288; 2026-02-21T08:33:03.4479876Z shr.s16 %rs382, %rs381, 4; 2026-02-21T08:33:03.4479935Z shr.s16 %rs383, %rs283, 4; 2026-02-21T08:33:03.4480010Z shr.s16 %rs384, %rs287, 4; 2026-02-21T08:33:03.4480177Z .loc 1 87 32 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:87:32 2026-02-21T08:33:03.4480236Z cvt.rn.f32.s16 %r679, %rs384; 2026-02-21T08:33:03.4480300Z cvt.rn.f32.s16 %r680, %rs383; 2026-02-21T08:33:03.4480358Z cvt.rn.f32.s16 %r681, %rs382; 2026-02-21T08:33:03.4480415Z cvt.rn.f32.s16 %r682, %rs380; 2026-02-21T08:33:03.4480509Z st.shared.v4.b32 [%r19], {%r622, %r620, %r621, %r619}; 2026-02-21T08:33:03.4480640Z st.shared.v4.b32 [%r19+16384], {%r626, %r624, %r625, %r623}; 2026-02-21T08:33:03.4480729Z st.shared.v4.b32 [%r20], {%r630, %r628, %r629, %r627}; 2026-02-21T08:33:03.4480825Z st.shared.v4.b32 [%r20+16384], {%r634, %r632, %r633, %r631}; 2026-02-21T08:33:03.4480916Z st.shared.v4.b32 [%r21], {%r638, %r636, %r637, %r635}; 2026-02-21T08:33:03.4481033Z st.shared.v4.b32 [%r21+16384], {%r642, %r640, %r641, %r639}; 2026-02-21T08:33:03.4481119Z st.shared.v4.b32 [%r22], {%r646, %r644, %r645, %r643}; 2026-02-21T08:33:03.4481253Z st.shared.v4.b32 [%r22+16384], {%r650, %r648, %r649, %r647}; 2026-02-21T08:33:03.4481338Z st.shared.v4.b32 [%r23], {%r654, %r652, %r653, %r651}; 2026-02-21T08:33:03.4481430Z st.shared.v4.b32 [%r23+16384], {%r658, %r656, %r657, %r655}; 2026-02-21T08:33:03.4481514Z st.shared.v4.b32 [%r24], {%r662, %r660, %r661, %r659}; 2026-02-21T08:33:03.4481648Z st.shared.v4.b32 [%r24+16384], {%r666, %r664, %r665, %r663}; 2026-02-21T08:33:03.4481733Z st.shared.v4.b32 [%r25], {%r670, %r668, %r669, %r667}; 2026-02-21T08:33:03.4481826Z st.shared.v4.b32 [%r25+16384], {%r674, %r672, %r673, %r671}; 2026-02-21T08:33:03.4481916Z st.shared.v4.b32 [%r26], {%r678, %r676, %r677, %r675}; 2026-02-21T08:33:03.4482007Z st.shared.v4.b32 [%r26+16384], {%r682, %r680, %r681, %r679}; 2026-02-21T08:33:03.4482211Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4482280Z shl.b32 %r683, %r1388, 3; 2026-02-21T08:33:03.4482342Z add.s32 %r684, %r53, %r683; 2026-02-21T08:33:03.4482405Z add.s32 %r1386, %r684, 110592; 2026-02-21T08:33:03.4482459Z $L__tmp12: 2026-02-21T08:33:03.4482687Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4482745Z // begin inline asm 2026-02-21T08:33:03.4482821Z fence.proxy.async.shared::cta; 2026-02-21T08:33:03.4482885Z // end inline asm 2026-02-21T08:33:03.4482939Z bar.sync 0; 2026-02-21T08:33:03.4483001Z @%p65 bra $L__BB0_6; 2026-02-21T08:33:03.4483107Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T08:33:03.4483172Z elect.sync %r697|%p83, -1; 2026-02-21T08:33:03.4483233Z mov.b32 %r687, 138414352; 2026-02-21T08:33:03.4483290Z // begin inline asm 2026-02-21T08:33:03.4483458Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 0 ], %rd62, %r687, %p82; 2026-02-21T08:33:03.4483516Z // end inline asm 2026-02-21T08:33:03.4483574Z // begin inline asm 2026-02-21T08:33:03.4483735Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 8 ], %rd63, %r687, %p82; 2026-02-21T08:33:03.4483790Z // end inline asm 2026-02-21T08:33:03.4483850Z // begin inline asm 2026-02-21T08:33:03.4484005Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 16 ], %rd64, %r687, %p82; 2026-02-21T08:33:03.4484060Z // end inline asm 2026-02-21T08:33:03.4484114Z // begin inline asm 2026-02-21T08:33:03.4484269Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r1384 + 0 ], [ %r527 + 24 ], %rd65, %r687, %p82; 2026-02-21T08:33:03.4484325Z // end inline asm 2026-02-21T08:33:03.4484384Z cvt.u64.u32 %rd85, %r1386; 2026-02-21T08:33:03.4484439Z // begin inline asm 2026-02-21T08:33:03.4484570Z @%p83 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd85]; 2026-02-21T08:33:03.4484627Z // end inline asm 2026-02-21T08:33:03.4484686Z bra.uni $L__BB0_6; 2026-02-21T08:33:03.4484787Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T08:33:03.4484878Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:33:03.4484944Z setp.lt.u32 %p103, %r1, 128; 2026-02-21T08:33:03.4484999Z mov.b32 %r717, 1; 2026-02-21T08:33:03.4485225Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4485281Z // begin inline asm 2026-02-21T08:33:03.4485330Z 2026-02-21T08:33:03.4485388Z { 2026-02-21T08:33:03.4485478Z .reg .pred complete; 2026-02-21T08:33:03.4485531Z waitLoop: 2026-02-21T08:33:03.4485655Z mbarrier.try_wait.parity.shared.b64 complete, [%r1386], %r717; 2026-02-21T08:33:03.4485721Z @!complete bra.uni waitLoop; 2026-02-21T08:33:03.4485771Z } 2026-02-21T08:33:03.4485775Z 2026-02-21T08:33:03.4485829Z // end inline asm 2026-02-21T08:33:03.4485891Z $L__tmp13: 2026-02-21T08:33:03.4486089Z .loc 1 51 103 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:51:103 2026-02-21T08:33:03.4486152Z cp.async.wait_group 0; 2026-02-21T08:33:03.4486234Z bar.sync 0; 2026-02-21T08:33:03.4486290Z // begin inline asm 2026-02-21T08:33:03.4486376Z @%p99 mbarrier.inval.shared::cta.b64 [%r549]; 2026-02-21T08:33:03.4486431Z // end inline asm 2026-02-21T08:33:03.4486499Z add.s32 %r719, %r53, 110592; 2026-02-21T08:33:03.4486553Z // begin inline asm 2026-02-21T08:33:03.4486635Z @%p99 mbarrier.inval.shared::cta.b64 [%r719]; 2026-02-21T08:33:03.4486697Z // end inline asm 2026-02-21T08:33:03.4486752Z bar.sync 0; 2026-02-21T08:33:03.4486806Z // begin inline asm 2026-02-21T08:33:03.4486893Z @%p99 mbarrier.inval.shared::cta.b64 [%r351]; 2026-02-21T08:33:03.4486946Z // end inline asm 2026-02-21T08:33:03.4486998Z $L__tmp14: 2026-02-21T08:33:03.4487231Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4487298Z // begin inline asm 2026-02-21T08:33:03.4487570Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r721, %r722, %r723, %r724, %r725, %r726, %r727, %r728, %r729, %r730, %r731, %r732, %r733, %r734, %r735, %r736}, [%r992 + 0]; 2026-02-21T08:33:03.4487625Z // end inline asm 2026-02-21T08:33:03.4487688Z // begin inline asm 2026-02-21T08:33:03.4487957Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r738, %r739, %r740, %r741, %r742, %r743, %r744, %r745, %r746, %r747, %r748, %r749, %r750, %r751, %r752, %r753}, [%r992 + 16]; 2026-02-21T08:33:03.4488012Z // end inline asm 2026-02-21T08:33:03.4488072Z // begin inline asm 2026-02-21T08:33:03.4488332Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r755, %r756, %r757, %r758, %r759, %r760, %r761, %r762, %r763, %r764, %r765, %r766, %r767, %r768, %r769, %r770}, [%r992 + 32]; 2026-02-21T08:33:03.4488386Z // end inline asm 2026-02-21T08:33:03.4488447Z // begin inline asm 2026-02-21T08:33:03.4488711Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r772, %r773, %r774, %r775, %r776, %r777, %r778, %r779, %r780, %r781, %r782, %r783, %r784, %r785, %r786, %r787}, [%r992 + 48]; 2026-02-21T08:33:03.4488765Z // end inline asm 2026-02-21T08:33:03.4488822Z // begin inline asm 2026-02-21T08:33:03.4489091Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r789, %r790, %r791, %r792, %r793, %r794, %r795, %r796, %r797, %r798, %r799, %r800, %r801, %r802, %r803, %r804}, [%r992 + 64]; 2026-02-21T08:33:03.4489145Z // end inline asm 2026-02-21T08:33:03.4489200Z // begin inline asm 2026-02-21T08:33:03.4489465Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r806, %r807, %r808, %r809, %r810, %r811, %r812, %r813, %r814, %r815, %r816, %r817, %r818, %r819, %r820, %r821}, [%r992 + 80]; 2026-02-21T08:33:03.4489521Z // end inline asm 2026-02-21T08:33:03.4489577Z // begin inline asm 2026-02-21T08:33:03.4489846Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r823, %r824, %r825, %r826, %r827, %r828, %r829, %r830, %r831, %r832, %r833, %r834, %r835, %r836, %r837, %r838}, [%r992 + 96]; 2026-02-21T08:33:03.4489903Z // end inline asm 2026-02-21T08:33:03.4489958Z // begin inline asm 2026-02-21T08:33:03.4490228Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r840, %r841, %r842, %r843, %r844, %r845, %r846, %r847, %r848, %r849, %r850, %r851, %r852, %r853, %r854, %r855}, [%r992 + 112]; 2026-02-21T08:33:03.4490283Z // end inline asm 2026-02-21T08:33:03.4490337Z // begin inline asm 2026-02-21T08:33:03.4490606Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r857, %r858, %r859, %r860, %r861, %r862, %r863, %r864, %r865, %r866, %r867, %r868, %r869, %r870, %r871, %r872}, [%r992 + 128]; 2026-02-21T08:33:03.4490668Z // end inline asm 2026-02-21T08:33:03.4490747Z // begin inline asm 2026-02-21T08:33:03.4491005Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r874, %r875, %r876, %r877, %r878, %r879, %r880, %r881, %r882, %r883, %r884, %r885, %r886, %r887, %r888, %r889}, [%r992 + 144]; 2026-02-21T08:33:03.4491068Z // end inline asm 2026-02-21T08:33:03.4491124Z // begin inline asm 2026-02-21T08:33:03.4491379Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r891, %r892, %r893, %r894, %r895, %r896, %r897, %r898, %r899, %r900, %r901, %r902, %r903, %r904, %r905, %r906}, [%r992 + 160]; 2026-02-21T08:33:03.4491481Z // end inline asm 2026-02-21T08:33:03.4491563Z // begin inline asm 2026-02-21T08:33:03.4491826Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r908, %r909, %r910, %r911, %r912, %r913, %r914, %r915, %r916, %r917, %r918, %r919, %r920, %r921, %r922, %r923}, [%r992 + 176]; 2026-02-21T08:33:03.4491889Z // end inline asm 2026-02-21T08:33:03.4491945Z // begin inline asm 2026-02-21T08:33:03.4492202Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r925, %r926, %r927, %r928, %r929, %r930, %r931, %r932, %r933, %r934, %r935, %r936, %r937, %r938, %r939, %r940}, [%r992 + 192]; 2026-02-21T08:33:03.4492259Z // end inline asm 2026-02-21T08:33:03.4492321Z // begin inline asm 2026-02-21T08:33:03.4492602Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r942, %r943, %r944, %r945, %r946, %r947, %r948, %r949, %r950, %r951, %r952, %r953, %r954, %r955, %r956, %r957}, [%r992 + 208]; 2026-02-21T08:33:03.4492659Z // end inline asm 2026-02-21T08:33:03.4492723Z // begin inline asm 2026-02-21T08:33:03.4492983Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r959, %r960, %r961, %r962, %r963, %r964, %r965, %r966, %r967, %r968, %r969, %r970, %r971, %r972, %r973, %r974}, [%r992 + 224]; 2026-02-21T08:33:03.4493038Z // end inline asm 2026-02-21T08:33:03.4493100Z // begin inline asm 2026-02-21T08:33:03.4493357Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r976, %r977, %r978, %r979, %r980, %r981, %r982, %r983, %r984, %r985, %r986, %r987, %r988, %r989, %r990, %r991}, [%r992 + 240]; 2026-02-21T08:33:03.4493412Z // end inline asm 2026-02-21T08:33:03.4493475Z // begin inline asm 2026-02-21T08:33:03.4493544Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:33:03.4493597Z // end inline asm 2026-02-21T08:33:03.4493659Z cvt.u64.u32 %rd92, %r721; 2026-02-21T08:33:03.4493732Z cvt.u64.u32 %rd93, %r722; 2026-02-21T08:33:03.4493792Z shl.b64 %rd94, %rd93, 32; 2026-02-21T08:33:03.4493857Z or.b64 %rd95, %rd92, %rd94; 2026-02-21T08:33:03.4493921Z $L__tmp15: 2026-02-21T08:33:03.4494096Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4494161Z mov.b64 {%r997, %r998}, %rd95; 2026-02-21T08:33:03.4494231Z cvt.rn.bf16x2.f32 %r999, %r998, %r997; 2026-02-21T08:33:03.4494290Z $L__tmp16: 2026-02-21T08:33:03.4494506Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4494564Z cvt.u64.u32 %rd96, %r723; 2026-02-21T08:33:03.4494629Z cvt.u64.u32 %rd97, %r724; 2026-02-21T08:33:03.4494689Z shl.b64 %rd98, %rd97, 32; 2026-02-21T08:33:03.4494749Z or.b64 %rd99, %rd96, %rd98; 2026-02-21T08:33:03.4494807Z $L__tmp17: 2026-02-21T08:33:03.4494978Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4495042Z mov.b64 {%r1000, %r1001}, %rd99; 2026-02-21T08:33:03.4495116Z cvt.rn.bf16x2.f32 %r1002, %r1001, %r1000; 2026-02-21T08:33:03.4495177Z $L__tmp18: 2026-02-21T08:33:03.4495394Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4495456Z cvt.u64.u32 %rd100, %r725; 2026-02-21T08:33:03.4495523Z cvt.u64.u32 %rd101, %r726; 2026-02-21T08:33:03.4495582Z shl.b64 %rd102, %rd101, 32; 2026-02-21T08:33:03.4495643Z or.b64 %rd103, %rd100, %rd102; 2026-02-21T08:33:03.4495700Z $L__tmp19: 2026-02-21T08:33:03.4495867Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4495957Z mov.b64 {%r1003, %r1004}, %rd103; 2026-02-21T08:33:03.4496029Z cvt.rn.bf16x2.f32 %r1005, %r1004, %r1003; 2026-02-21T08:33:03.4496088Z $L__tmp20: 2026-02-21T08:33:03.4496297Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4496357Z cvt.u64.u32 %rd104, %r727; 2026-02-21T08:33:03.4496455Z cvt.u64.u32 %rd105, %r728; 2026-02-21T08:33:03.4496513Z shl.b64 %rd106, %rd105, 32; 2026-02-21T08:33:03.4496573Z or.b64 %rd107, %rd104, %rd106; 2026-02-21T08:33:03.4496647Z $L__tmp21: 2026-02-21T08:33:03.4496822Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4496883Z mov.b64 {%r1006, %r1007}, %rd107; 2026-02-21T08:33:03.4496952Z cvt.rn.bf16x2.f32 %r1008, %r1007, %r1006; 2026-02-21T08:33:03.4497011Z $L__tmp22: 2026-02-21T08:33:03.4497226Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4497288Z cvt.u64.u32 %rd108, %r729; 2026-02-21T08:33:03.4497351Z cvt.u64.u32 %rd109, %r730; 2026-02-21T08:33:03.4497409Z shl.b64 %rd110, %rd109, 32; 2026-02-21T08:33:03.4497467Z or.b64 %rd111, %rd108, %rd110; 2026-02-21T08:33:03.4497518Z $L__tmp23: 2026-02-21T08:33:03.4497712Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4497776Z mov.b64 {%r1009, %r1010}, %rd111; 2026-02-21T08:33:03.4497846Z cvt.rn.bf16x2.f32 %r1011, %r1010, %r1009; 2026-02-21T08:33:03.4497906Z $L__tmp24: 2026-02-21T08:33:03.4498118Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4498177Z cvt.u64.u32 %rd112, %r731; 2026-02-21T08:33:03.4498240Z cvt.u64.u32 %rd113, %r732; 2026-02-21T08:33:03.4498297Z shl.b64 %rd114, %rd113, 32; 2026-02-21T08:33:03.4498356Z or.b64 %rd115, %rd112, %rd114; 2026-02-21T08:33:03.4498408Z $L__tmp25: 2026-02-21T08:33:03.4498579Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4498637Z mov.b64 {%r1012, %r1013}, %rd115; 2026-02-21T08:33:03.4498704Z cvt.rn.bf16x2.f32 %r1014, %r1013, %r1012; 2026-02-21T08:33:03.4498759Z $L__tmp26: 2026-02-21T08:33:03.4498969Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4499027Z cvt.u64.u32 %rd116, %r733; 2026-02-21T08:33:03.4499091Z cvt.u64.u32 %rd117, %r734; 2026-02-21T08:33:03.4499149Z shl.b64 %rd118, %rd117, 32; 2026-02-21T08:33:03.4499208Z or.b64 %rd119, %rd116, %rd118; 2026-02-21T08:33:03.4499257Z $L__tmp27: 2026-02-21T08:33:03.4499430Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4499490Z mov.b64 {%r1015, %r1016}, %rd119; 2026-02-21T08:33:03.4499556Z cvt.rn.bf16x2.f32 %r1017, %r1016, %r1015; 2026-02-21T08:33:03.4499615Z $L__tmp28: 2026-02-21T08:33:03.4499825Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4499882Z cvt.u64.u32 %rd120, %r735; 2026-02-21T08:33:03.4499944Z cvt.u64.u32 %rd121, %r736; 2026-02-21T08:33:03.4500003Z shl.b64 %rd122, %rd121, 32; 2026-02-21T08:33:03.4500064Z or.b64 %rd123, %rd120, %rd122; 2026-02-21T08:33:03.4500114Z $L__tmp29: 2026-02-21T08:33:03.4500292Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4500353Z mov.b64 {%r1018, %r1019}, %rd123; 2026-02-21T08:33:03.4500420Z cvt.rn.bf16x2.f32 %r1020, %r1019, %r1018; 2026-02-21T08:33:03.4500479Z $L__tmp30: 2026-02-21T08:33:03.4500692Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4500750Z cvt.u64.u32 %rd124, %r738; 2026-02-21T08:33:03.4500830Z cvt.u64.u32 %rd125, %r739; 2026-02-21T08:33:03.4500894Z shl.b64 %rd126, %rd125, 32; 2026-02-21T08:33:03.4500952Z or.b64 %rd127, %rd124, %rd126; 2026-02-21T08:33:03.4501003Z $L__tmp31: 2026-02-21T08:33:03.4501180Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4501242Z mov.b64 {%r1021, %r1022}, %rd127; 2026-02-21T08:33:03.4501344Z cvt.rn.bf16x2.f32 %r1023, %r1022, %r1021; 2026-02-21T08:33:03.4501403Z $L__tmp32: 2026-02-21T08:33:03.4501671Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4501732Z cvt.u64.u32 %rd128, %r740; 2026-02-21T08:33:03.4501792Z cvt.u64.u32 %rd129, %r741; 2026-02-21T08:33:03.4501861Z shl.b64 %rd130, %rd129, 32; 2026-02-21T08:33:03.4501921Z or.b64 %rd131, %rd128, %rd130; 2026-02-21T08:33:03.4501974Z $L__tmp33: 2026-02-21T08:33:03.4502151Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4502214Z mov.b64 {%r1024, %r1025}, %rd131; 2026-02-21T08:33:03.4502282Z cvt.rn.bf16x2.f32 %r1026, %r1025, %r1024; 2026-02-21T08:33:03.4502339Z $L__tmp34: 2026-02-21T08:33:03.4502583Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4502648Z cvt.u64.u32 %rd132, %r742; 2026-02-21T08:33:03.4502708Z cvt.u64.u32 %rd133, %r743; 2026-02-21T08:33:03.4502782Z shl.b64 %rd134, %rd133, 32; 2026-02-21T08:33:03.4502845Z or.b64 %rd135, %rd132, %rd134; 2026-02-21T08:33:03.4502897Z $L__tmp35: 2026-02-21T08:33:03.4503073Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4503133Z mov.b64 {%r1027, %r1028}, %rd135; 2026-02-21T08:33:03.4503202Z cvt.rn.bf16x2.f32 %r1029, %r1028, %r1027; 2026-02-21T08:33:03.4503261Z $L__tmp36: 2026-02-21T08:33:03.4503472Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4503531Z cvt.u64.u32 %rd136, %r744; 2026-02-21T08:33:03.4503588Z cvt.u64.u32 %rd137, %r745; 2026-02-21T08:33:03.4503654Z shl.b64 %rd138, %rd137, 32; 2026-02-21T08:33:03.4503713Z or.b64 %rd139, %rd136, %rd138; 2026-02-21T08:33:03.4503766Z $L__tmp37: 2026-02-21T08:33:03.4503947Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4504008Z mov.b64 {%r1030, %r1031}, %rd139; 2026-02-21T08:33:03.4504075Z cvt.rn.bf16x2.f32 %r1032, %r1031, %r1030; 2026-02-21T08:33:03.4504126Z $L__tmp38: 2026-02-21T08:33:03.4504347Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4504406Z cvt.u64.u32 %rd140, %r746; 2026-02-21T08:33:03.4504463Z cvt.u64.u32 %rd141, %r747; 2026-02-21T08:33:03.4504531Z shl.b64 %rd142, %rd141, 32; 2026-02-21T08:33:03.4504590Z or.b64 %rd143, %rd140, %rd142; 2026-02-21T08:33:03.4504640Z $L__tmp39: 2026-02-21T08:33:03.4504816Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4504875Z mov.b64 {%r1033, %r1034}, %rd143; 2026-02-21T08:33:03.4504944Z cvt.rn.bf16x2.f32 %r1035, %r1034, %r1033; 2026-02-21T08:33:03.4504997Z $L__tmp40: 2026-02-21T08:33:03.4505217Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4505277Z cvt.u64.u32 %rd144, %r748; 2026-02-21T08:33:03.4505334Z cvt.u64.u32 %rd145, %r749; 2026-02-21T08:33:03.4505398Z shl.b64 %rd146, %rd145, 32; 2026-02-21T08:33:03.4505457Z or.b64 %rd147, %rd144, %rd146; 2026-02-21T08:33:03.4505508Z $L__tmp41: 2026-02-21T08:33:03.4505680Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4505739Z mov.b64 {%r1036, %r1037}, %rd147; 2026-02-21T08:33:03.4505834Z cvt.rn.bf16x2.f32 %r1038, %r1037, %r1036; 2026-02-21T08:33:03.4505905Z $L__tmp42: 2026-02-21T08:33:03.4506123Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4506180Z cvt.u64.u32 %rd148, %r750; 2026-02-21T08:33:03.4506269Z cvt.u64.u32 %rd149, %r751; 2026-02-21T08:33:03.4506334Z shl.b64 %rd150, %rd149, 32; 2026-02-21T08:33:03.4506393Z or.b64 %rd151, %rd148, %rd150; 2026-02-21T08:33:03.4506464Z $L__tmp43: 2026-02-21T08:33:03.4506640Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4506699Z mov.b64 {%r1039, %r1040}, %rd151; 2026-02-21T08:33:03.4506765Z cvt.rn.bf16x2.f32 %r1041, %r1040, %r1039; 2026-02-21T08:33:03.4506816Z $L__tmp44: 2026-02-21T08:33:03.4507035Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4507094Z cvt.u64.u32 %rd152, %r752; 2026-02-21T08:33:03.4507151Z cvt.u64.u32 %rd153, %r753; 2026-02-21T08:33:03.4507214Z shl.b64 %rd154, %rd153, 32; 2026-02-21T08:33:03.4507273Z or.b64 %rd155, %rd152, %rd154; 2026-02-21T08:33:03.4507324Z $L__tmp45: 2026-02-21T08:33:03.4507516Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4507584Z mov.b64 {%r1042, %r1043}, %rd155; 2026-02-21T08:33:03.4507654Z cvt.rn.bf16x2.f32 %r1044, %r1043, %r1042; 2026-02-21T08:33:03.4507705Z $L__tmp46: 2026-02-21T08:33:03.4507921Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4507978Z cvt.u64.u32 %rd156, %r755; 2026-02-21T08:33:03.4508035Z cvt.u64.u32 %rd157, %r756; 2026-02-21T08:33:03.4508100Z shl.b64 %rd158, %rd157, 32; 2026-02-21T08:33:03.4508160Z or.b64 %rd159, %rd156, %rd158; 2026-02-21T08:33:03.4508211Z $L__tmp47: 2026-02-21T08:33:03.4508379Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4508444Z mov.b64 {%r1045, %r1046}, %rd159; 2026-02-21T08:33:03.4508511Z cvt.rn.bf16x2.f32 %r1047, %r1046, %r1045; 2026-02-21T08:33:03.4508563Z $L__tmp48: 2026-02-21T08:33:03.4508773Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4508834Z cvt.u64.u32 %rd160, %r757; 2026-02-21T08:33:03.4508893Z cvt.u64.u32 %rd161, %r758; 2026-02-21T08:33:03.4508957Z shl.b64 %rd162, %rd161, 32; 2026-02-21T08:33:03.4509017Z or.b64 %rd163, %rd160, %rd162; 2026-02-21T08:33:03.4509068Z $L__tmp49: 2026-02-21T08:33:03.4509234Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4509301Z mov.b64 {%r1048, %r1049}, %rd163; 2026-02-21T08:33:03.4509367Z cvt.rn.bf16x2.f32 %r1050, %r1049, %r1048; 2026-02-21T08:33:03.4509420Z $L__tmp50: 2026-02-21T08:33:03.4509630Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4509689Z cvt.u64.u32 %rd164, %r759; 2026-02-21T08:33:03.4509747Z cvt.u64.u32 %rd165, %r760; 2026-02-21T08:33:03.4509814Z shl.b64 %rd166, %rd165, 32; 2026-02-21T08:33:03.4509875Z or.b64 %rd167, %rd164, %rd166; 2026-02-21T08:33:03.4509927Z $L__tmp51: 2026-02-21T08:33:03.4510093Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4510162Z mov.b64 {%r1051, %r1052}, %rd167; 2026-02-21T08:33:03.4510228Z cvt.rn.bf16x2.f32 %r1053, %r1052, %r1051; 2026-02-21T08:33:03.4510280Z $L__tmp52: 2026-02-21T08:33:03.4510492Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4510549Z cvt.u64.u32 %rd168, %r761; 2026-02-21T08:33:03.4510628Z cvt.u64.u32 %rd169, %r762; 2026-02-21T08:33:03.4510687Z shl.b64 %rd170, %rd169, 32; 2026-02-21T08:33:03.4510754Z or.b64 %rd171, %rd168, %rd170; 2026-02-21T08:33:03.4510805Z $L__tmp53: 2026-02-21T08:33:03.4510972Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4511040Z mov.b64 {%r1054, %r1055}, %rd171; 2026-02-21T08:33:03.4511129Z cvt.rn.bf16x2.f32 %r1056, %r1055, %r1054; 2026-02-21T08:33:03.4511188Z $L__tmp54: 2026-02-21T08:33:03.4511425Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4511484Z cvt.u64.u32 %rd172, %r763; 2026-02-21T08:33:03.4511569Z cvt.u64.u32 %rd173, %r764; 2026-02-21T08:33:03.4511629Z shl.b64 %rd174, %rd173, 32; 2026-02-21T08:33:03.4511695Z or.b64 %rd175, %rd172, %rd174; 2026-02-21T08:33:03.4511747Z $L__tmp55: 2026-02-21T08:33:03.4511914Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4511982Z mov.b64 {%r1057, %r1058}, %rd175; 2026-02-21T08:33:03.4512050Z cvt.rn.bf16x2.f32 %r1059, %r1058, %r1057; 2026-02-21T08:33:03.4512101Z $L__tmp56: 2026-02-21T08:33:03.4512344Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4512405Z cvt.u64.u32 %rd176, %r765; 2026-02-21T08:33:03.4512464Z cvt.u64.u32 %rd177, %r766; 2026-02-21T08:33:03.4512523Z shl.b64 %rd178, %rd177, 32; 2026-02-21T08:33:03.4512589Z or.b64 %rd179, %rd176, %rd178; 2026-02-21T08:33:03.4512641Z $L__tmp57: 2026-02-21T08:33:03.4512808Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4512875Z mov.b64 {%r1060, %r1061}, %rd179; 2026-02-21T08:33:03.4512942Z cvt.rn.bf16x2.f32 %r1062, %r1061, %r1060; 2026-02-21T08:33:03.4512992Z $L__tmp58: 2026-02-21T08:33:03.4513206Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4513267Z cvt.u64.u32 %rd180, %r767; 2026-02-21T08:33:03.4513325Z cvt.u64.u32 %rd181, %r768; 2026-02-21T08:33:03.4513384Z shl.b64 %rd182, %rd181, 32; 2026-02-21T08:33:03.4513450Z or.b64 %rd183, %rd180, %rd182; 2026-02-21T08:33:03.4513503Z $L__tmp59: 2026-02-21T08:33:03.4513673Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4513740Z mov.b64 {%r1063, %r1064}, %rd183; 2026-02-21T08:33:03.4513807Z cvt.rn.bf16x2.f32 %r1065, %r1064, %r1063; 2026-02-21T08:33:03.4513860Z $L__tmp60: 2026-02-21T08:33:03.4514078Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4514136Z cvt.u64.u32 %rd184, %r769; 2026-02-21T08:33:03.4514193Z cvt.u64.u32 %rd185, %r770; 2026-02-21T08:33:03.4514251Z shl.b64 %rd186, %rd185, 32; 2026-02-21T08:33:03.4514321Z or.b64 %rd187, %rd184, %rd186; 2026-02-21T08:33:03.4514372Z $L__tmp61: 2026-02-21T08:33:03.4514541Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4514608Z mov.b64 {%r1066, %r1067}, %rd187; 2026-02-21T08:33:03.4514677Z cvt.rn.bf16x2.f32 %r1068, %r1067, %r1066; 2026-02-21T08:33:03.4514729Z $L__tmp62: 2026-02-21T08:33:03.4514937Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4515001Z cvt.u64.u32 %rd188, %r772; 2026-02-21T08:33:03.4515057Z cvt.u64.u32 %rd189, %r773; 2026-02-21T08:33:03.4515113Z shl.b64 %rd190, %rd189, 32; 2026-02-21T08:33:03.4515177Z or.b64 %rd191, %rd188, %rd190; 2026-02-21T08:33:03.4515228Z $L__tmp63: 2026-02-21T08:33:03.4515396Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4515461Z mov.b64 {%r1069, %r1070}, %rd191; 2026-02-21T08:33:03.4515553Z cvt.rn.bf16x2.f32 %r1071, %r1070, %r1069; 2026-02-21T08:33:03.4515605Z $L__tmp64: 2026-02-21T08:33:03.4515818Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4515882Z cvt.u64.u32 %rd192, %r774; 2026-02-21T08:33:03.4515940Z cvt.u64.u32 %rd193, %r775; 2026-02-21T08:33:03.4516023Z shl.b64 %rd194, %rd193, 32; 2026-02-21T08:33:03.4516091Z or.b64 %rd195, %rd192, %rd194; 2026-02-21T08:33:03.4516146Z $L__tmp65: 2026-02-21T08:33:03.4516345Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4516415Z mov.b64 {%r1072, %r1073}, %rd195; 2026-02-21T08:33:03.4516485Z cvt.rn.bf16x2.f32 %r1074, %r1073, %r1072; 2026-02-21T08:33:03.4516538Z $L__tmp66: 2026-02-21T08:33:03.4516755Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4516826Z cvt.u64.u32 %rd196, %r776; 2026-02-21T08:33:03.4516886Z cvt.u64.u32 %rd197, %r777; 2026-02-21T08:33:03.4516948Z shl.b64 %rd198, %rd197, 32; 2026-02-21T08:33:03.4517016Z or.b64 %rd199, %rd196, %rd198; 2026-02-21T08:33:03.4517070Z $L__tmp67: 2026-02-21T08:33:03.4517263Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4517334Z mov.b64 {%r1075, %r1076}, %rd199; 2026-02-21T08:33:03.4517405Z cvt.rn.bf16x2.f32 %r1077, %r1076, %r1075; 2026-02-21T08:33:03.4517461Z $L__tmp68: 2026-02-21T08:33:03.4517684Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4517754Z cvt.u64.u32 %rd200, %r778; 2026-02-21T08:33:03.4517814Z cvt.u64.u32 %rd201, %r779; 2026-02-21T08:33:03.4517877Z shl.b64 %rd202, %rd201, 32; 2026-02-21T08:33:03.4517947Z or.b64 %rd203, %rd200, %rd202; 2026-02-21T08:33:03.4518001Z $L__tmp69: 2026-02-21T08:33:03.4518180Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4518241Z mov.b64 {%r1078, %r1079}, %rd203; 2026-02-21T08:33:03.4518318Z cvt.rn.bf16x2.f32 %r1080, %r1079, %r1078; 2026-02-21T08:33:03.4518371Z $L__tmp70: 2026-02-21T08:33:03.4518595Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4518665Z cvt.u64.u32 %rd204, %r780; 2026-02-21T08:33:03.4518726Z cvt.u64.u32 %rd205, %r781; 2026-02-21T08:33:03.4518787Z shl.b64 %rd206, %rd205, 32; 2026-02-21T08:33:03.4518855Z or.b64 %rd207, %rd204, %rd206; 2026-02-21T08:33:03.4518909Z $L__tmp71: 2026-02-21T08:33:03.4519086Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4519149Z mov.b64 {%r1081, %r1082}, %rd207; 2026-02-21T08:33:03.4519225Z cvt.rn.bf16x2.f32 %r1083, %r1082, %r1081; 2026-02-21T08:33:03.4519280Z $L__tmp72: 2026-02-21T08:33:03.4519502Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4519570Z cvt.u64.u32 %rd208, %r782; 2026-02-21T08:33:03.4519629Z cvt.u64.u32 %rd209, %r783; 2026-02-21T08:33:03.4519688Z shl.b64 %rd210, %rd209, 32; 2026-02-21T08:33:03.4519756Z or.b64 %rd211, %rd208, %rd210; 2026-02-21T08:33:03.4519810Z $L__tmp73: 2026-02-21T08:33:03.4519986Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4520049Z mov.b64 {%r1084, %r1085}, %rd211; 2026-02-21T08:33:03.4520128Z cvt.rn.bf16x2.f32 %r1086, %r1085, %r1084; 2026-02-21T08:33:03.4520182Z $L__tmp74: 2026-02-21T08:33:03.4520404Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4520472Z cvt.u64.u32 %rd212, %r784; 2026-02-21T08:33:03.4520532Z cvt.u64.u32 %rd213, %r785; 2026-02-21T08:33:03.4520633Z shl.b64 %rd214, %rd213, 32; 2026-02-21T08:33:03.4520695Z or.b64 %rd215, %rd212, %rd214; 2026-02-21T08:33:03.4520755Z $L__tmp75: 2026-02-21T08:33:03.4520930Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4520995Z mov.b64 {%r1087, %r1088}, %rd215; 2026-02-21T08:33:03.4521095Z cvt.rn.bf16x2.f32 %r1089, %r1088, %r1087; 2026-02-21T08:33:03.4521150Z $L__tmp76: 2026-02-21T08:33:03.4521389Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4521457Z cvt.u64.u32 %rd216, %r786; 2026-02-21T08:33:03.4521517Z cvt.u64.u32 %rd217, %r787; 2026-02-21T08:33:03.4521617Z shl.b64 %rd218, %rd217, 32; 2026-02-21T08:33:03.4521682Z or.b64 %rd219, %rd216, %rd218; 2026-02-21T08:33:03.4521743Z $L__tmp77: 2026-02-21T08:33:03.4521918Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4521982Z mov.b64 {%r1090, %r1091}, %rd219; 2026-02-21T08:33:03.4522060Z cvt.rn.bf16x2.f32 %r1092, %r1091, %r1090; 2026-02-21T08:33:03.4522114Z $L__tmp78: 2026-02-21T08:33:03.4522359Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4522428Z cvt.u64.u32 %rd220, %r789; 2026-02-21T08:33:03.4522486Z cvt.u64.u32 %rd221, %r790; 2026-02-21T08:33:03.4522546Z shl.b64 %rd222, %rd221, 32; 2026-02-21T08:33:03.4522608Z or.b64 %rd223, %rd220, %rd222; 2026-02-21T08:33:03.4522670Z $L__tmp79: 2026-02-21T08:33:03.4522839Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4522901Z mov.b64 {%r1093, %r1094}, %rd223; 2026-02-21T08:33:03.4522978Z cvt.rn.bf16x2.f32 %r1095, %r1094, %r1093; 2026-02-21T08:33:03.4523030Z $L__tmp80: 2026-02-21T08:33:03.4523246Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4523314Z cvt.u64.u32 %rd224, %r791; 2026-02-21T08:33:03.4523372Z cvt.u64.u32 %rd225, %r792; 2026-02-21T08:33:03.4523433Z shl.b64 %rd226, %rd225, 32; 2026-02-21T08:33:03.4523494Z or.b64 %rd227, %rd224, %rd226; 2026-02-21T08:33:03.4523554Z $L__tmp81: 2026-02-21T08:33:03.4523726Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4523796Z mov.b64 {%r1096, %r1097}, %rd227; 2026-02-21T08:33:03.4523869Z cvt.rn.bf16x2.f32 %r1098, %r1097, %r1096; 2026-02-21T08:33:03.4523920Z $L__tmp82: 2026-02-21T08:33:03.4524123Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4524186Z cvt.u64.u32 %rd228, %r793; 2026-02-21T08:33:03.4524242Z cvt.u64.u32 %rd229, %r794; 2026-02-21T08:33:03.4524299Z shl.b64 %rd230, %rd229, 32; 2026-02-21T08:33:03.4524359Z or.b64 %rd231, %rd228, %rd230; 2026-02-21T08:33:03.4524417Z $L__tmp83: 2026-02-21T08:33:03.4524577Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4524635Z mov.b64 {%r1099, %r1100}, %rd231; 2026-02-21T08:33:03.4524708Z cvt.rn.bf16x2.f32 %r1101, %r1100, %r1099; 2026-02-21T08:33:03.4524760Z $L__tmp84: 2026-02-21T08:33:03.4524967Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4525026Z cvt.u64.u32 %rd232, %r795; 2026-02-21T08:33:03.4525089Z cvt.u64.u32 %rd233, %r796; 2026-02-21T08:33:03.4525147Z shl.b64 %rd234, %rd233, 32; 2026-02-21T08:33:03.4525204Z or.b64 %rd235, %rd232, %rd234; 2026-02-21T08:33:03.4525263Z $L__tmp85: 2026-02-21T08:33:03.4525424Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4525484Z mov.b64 {%r1102, %r1103}, %rd235; 2026-02-21T08:33:03.4525674Z cvt.rn.bf16x2.f32 %r1104, %r1103, %r1102; 2026-02-21T08:33:03.4525725Z $L__tmp86: 2026-02-21T08:33:03.4525938Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4525996Z cvt.u64.u32 %rd236, %r797; 2026-02-21T08:33:03.4526062Z cvt.u64.u32 %rd237, %r798; 2026-02-21T08:33:03.4526148Z shl.b64 %rd238, %rd237, 32; 2026-02-21T08:33:03.4526207Z or.b64 %rd239, %rd236, %rd238; 2026-02-21T08:33:03.4526269Z $L__tmp87: 2026-02-21T08:33:03.4526459Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4526520Z mov.b64 {%r1105, %r1106}, %rd239; 2026-02-21T08:33:03.4526594Z cvt.rn.bf16x2.f32 %r1107, %r1106, %r1105; 2026-02-21T08:33:03.4526646Z $L__tmp88: 2026-02-21T08:33:03.4526861Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4526921Z cvt.u64.u32 %rd240, %r799; 2026-02-21T08:33:03.4526989Z cvt.u64.u32 %rd241, %r800; 2026-02-21T08:33:03.4527047Z shl.b64 %rd242, %rd241, 32; 2026-02-21T08:33:03.4527108Z or.b64 %rd243, %rd240, %rd242; 2026-02-21T08:33:03.4527166Z $L__tmp89: 2026-02-21T08:33:03.4527355Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4527416Z mov.b64 {%r1108, %r1109}, %rd243; 2026-02-21T08:33:03.4527488Z cvt.rn.bf16x2.f32 %r1110, %r1109, %r1108; 2026-02-21T08:33:03.4527541Z $L__tmp90: 2026-02-21T08:33:03.4527747Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4527805Z cvt.u64.u32 %rd244, %r801; 2026-02-21T08:33:03.4527868Z cvt.u64.u32 %rd245, %r802; 2026-02-21T08:33:03.4527926Z shl.b64 %rd246, %rd245, 32; 2026-02-21T08:33:03.4527985Z or.b64 %rd247, %rd244, %rd246; 2026-02-21T08:33:03.4528042Z $L__tmp91: 2026-02-21T08:33:03.4528207Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4528266Z mov.b64 {%r1111, %r1112}, %rd247; 2026-02-21T08:33:03.4528332Z cvt.rn.bf16x2.f32 %r1113, %r1112, %r1111; 2026-02-21T08:33:03.4528389Z $L__tmp92: 2026-02-21T08:33:03.4528599Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4528658Z cvt.u64.u32 %rd248, %r803; 2026-02-21T08:33:03.4528724Z cvt.u64.u32 %rd249, %r804; 2026-02-21T08:33:03.4528784Z shl.b64 %rd250, %rd249, 32; 2026-02-21T08:33:03.4528844Z or.b64 %rd251, %rd248, %rd250; 2026-02-21T08:33:03.4528900Z $L__tmp93: 2026-02-21T08:33:03.4529063Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4529122Z mov.b64 {%r1114, %r1115}, %rd251; 2026-02-21T08:33:03.4529189Z cvt.rn.bf16x2.f32 %r1116, %r1115, %r1114; 2026-02-21T08:33:03.4529245Z $L__tmp94: 2026-02-21T08:33:03.4529454Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4529510Z cvt.u64.u32 %rd252, %r806; 2026-02-21T08:33:03.4529574Z cvt.u64.u32 %rd253, %r807; 2026-02-21T08:33:03.4529632Z shl.b64 %rd254, %rd253, 32; 2026-02-21T08:33:03.4529691Z or.b64 %rd255, %rd252, %rd254; 2026-02-21T08:33:03.4529750Z $L__tmp95: 2026-02-21T08:33:03.4529919Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4529979Z mov.b64 {%r1117, %r1118}, %rd255; 2026-02-21T08:33:03.4530047Z cvt.rn.bf16x2.f32 %r1119, %r1118, %r1117; 2026-02-21T08:33:03.4530105Z $L__tmp96: 2026-02-21T08:33:03.4530315Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4530371Z cvt.u64.u32 %rd256, %r808; 2026-02-21T08:33:03.4530436Z cvt.u64.u32 %rd257, %r809; 2026-02-21T08:33:03.4530517Z shl.b64 %rd258, %rd257, 32; 2026-02-21T08:33:03.4530575Z or.b64 %rd259, %rd256, %rd258; 2026-02-21T08:33:03.4530627Z $L__tmp97: 2026-02-21T08:33:03.4530803Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4530863Z mov.b64 {%r1120, %r1121}, %rd259; 2026-02-21T08:33:03.4530931Z cvt.rn.bf16x2.f32 %r1122, %r1121, %r1120; 2026-02-21T08:33:03.4531012Z $L__tmp98: 2026-02-21T08:33:03.4531241Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4531299Z cvt.u64.u32 %rd260, %r810; 2026-02-21T08:33:03.4531362Z cvt.u64.u32 %rd261, %r811; 2026-02-21T08:33:03.4531420Z shl.b64 %rd262, %rd261, 32; 2026-02-21T08:33:03.4531477Z or.b64 %rd263, %rd260, %rd262; 2026-02-21T08:33:03.4531528Z $L__tmp99: 2026-02-21T08:33:03.4531739Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4531799Z mov.b64 {%r1123, %r1124}, %rd263; 2026-02-21T08:33:03.4531866Z cvt.rn.bf16x2.f32 %r1125, %r1124, %r1123; 2026-02-21T08:33:03.4531927Z $L__tmp100: 2026-02-21T08:33:03.4532140Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4532223Z cvt.u64.u32 %rd264, %r812; 2026-02-21T08:33:03.4532289Z cvt.u64.u32 %rd265, %r813; 2026-02-21T08:33:03.4532348Z shl.b64 %rd266, %rd265, 32; 2026-02-21T08:33:03.4532407Z or.b64 %rd267, %rd264, %rd266; 2026-02-21T08:33:03.4532460Z $L__tmp101: 2026-02-21T08:33:03.4532631Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4532691Z mov.b64 {%r1126, %r1127}, %rd267; 2026-02-21T08:33:03.4532758Z cvt.rn.bf16x2.f32 %r1128, %r1127, %r1126; 2026-02-21T08:33:03.4532819Z $L__tmp102: 2026-02-21T08:33:03.4533025Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4533086Z cvt.u64.u32 %rd268, %r814; 2026-02-21T08:33:03.4533152Z cvt.u64.u32 %rd269, %r815; 2026-02-21T08:33:03.4533210Z shl.b64 %rd270, %rd269, 32; 2026-02-21T08:33:03.4533267Z or.b64 %rd271, %rd268, %rd270; 2026-02-21T08:33:03.4533319Z $L__tmp103: 2026-02-21T08:33:03.4533491Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4533551Z mov.b64 {%r1129, %r1130}, %rd271; 2026-02-21T08:33:03.4533617Z cvt.rn.bf16x2.f32 %r1131, %r1130, %r1129; 2026-02-21T08:33:03.4533675Z $L__tmp104: 2026-02-21T08:33:03.4533878Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4533934Z cvt.u64.u32 %rd272, %r816; 2026-02-21T08:33:03.4533997Z cvt.u64.u32 %rd273, %r817; 2026-02-21T08:33:03.4534054Z shl.b64 %rd274, %rd273, 32; 2026-02-21T08:33:03.4534112Z or.b64 %rd275, %rd272, %rd274; 2026-02-21T08:33:03.4534166Z $L__tmp105: 2026-02-21T08:33:03.4534334Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4534392Z mov.b64 {%r1132, %r1133}, %rd275; 2026-02-21T08:33:03.4534459Z cvt.rn.bf16x2.f32 %r1134, %r1133, %r1132; 2026-02-21T08:33:03.4534519Z $L__tmp106: 2026-02-21T08:33:03.4534726Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4534785Z cvt.u64.u32 %rd276, %r818; 2026-02-21T08:33:03.4534841Z cvt.u64.u32 %rd277, %r819; 2026-02-21T08:33:03.4534905Z shl.b64 %rd278, %rd277, 32; 2026-02-21T08:33:03.4534963Z or.b64 %rd279, %rd276, %rd278; 2026-02-21T08:33:03.4535014Z $L__tmp107: 2026-02-21T08:33:03.4535185Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4535244Z mov.b64 {%r1135, %r1136}, %rd279; 2026-02-21T08:33:03.4535337Z cvt.rn.bf16x2.f32 %r1137, %r1136, %r1135; 2026-02-21T08:33:03.4535396Z $L__tmp108: 2026-02-21T08:33:03.4535604Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4535662Z cvt.u64.u32 %rd280, %r820; 2026-02-21T08:33:03.4535720Z cvt.u64.u32 %rd281, %r821; 2026-02-21T08:33:03.4535809Z shl.b64 %rd282, %rd281, 32; 2026-02-21T08:33:03.4535866Z or.b64 %rd283, %rd280, %rd282; 2026-02-21T08:33:03.4535918Z $L__tmp109: 2026-02-21T08:33:03.4536116Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4536178Z mov.b64 {%r1138, %r1139}, %rd283; 2026-02-21T08:33:03.4536244Z cvt.rn.bf16x2.f32 %r1140, %r1139, %r1138; 2026-02-21T08:33:03.4536304Z $L__tmp110: 2026-02-21T08:33:03.4536510Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4536569Z cvt.u64.u32 %rd284, %r823; 2026-02-21T08:33:03.4536627Z cvt.u64.u32 %rd285, %r824; 2026-02-21T08:33:03.4536694Z shl.b64 %rd286, %rd285, 32; 2026-02-21T08:33:03.4536751Z or.b64 %rd287, %rd284, %rd286; 2026-02-21T08:33:03.4536802Z $L__tmp111: 2026-02-21T08:33:03.4537007Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4537069Z mov.b64 {%r1141, %r1142}, %rd287; 2026-02-21T08:33:03.4537137Z cvt.rn.bf16x2.f32 %r1143, %r1142, %r1141; 2026-02-21T08:33:03.4537197Z $L__tmp112: 2026-02-21T08:33:03.4537406Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4537465Z cvt.u64.u32 %rd288, %r825; 2026-02-21T08:33:03.4537522Z cvt.u64.u32 %rd289, %r826; 2026-02-21T08:33:03.4537587Z shl.b64 %rd290, %rd289, 32; 2026-02-21T08:33:03.4537645Z or.b64 %rd291, %rd288, %rd290; 2026-02-21T08:33:03.4537698Z $L__tmp113: 2026-02-21T08:33:03.4537872Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4537931Z mov.b64 {%r1144, %r1145}, %rd291; 2026-02-21T08:33:03.4537997Z cvt.rn.bf16x2.f32 %r1146, %r1145, %r1144; 2026-02-21T08:33:03.4538047Z $L__tmp114: 2026-02-21T08:33:03.4538262Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4538321Z cvt.u64.u32 %rd292, %r827; 2026-02-21T08:33:03.4538379Z cvt.u64.u32 %rd293, %r828; 2026-02-21T08:33:03.4538446Z shl.b64 %rd294, %rd293, 32; 2026-02-21T08:33:03.4538505Z or.b64 %rd295, %rd292, %rd294; 2026-02-21T08:33:03.4538557Z $L__tmp115: 2026-02-21T08:33:03.4538727Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4538787Z mov.b64 {%r1147, %r1148}, %rd295; 2026-02-21T08:33:03.4538854Z cvt.rn.bf16x2.f32 %r1149, %r1148, %r1147; 2026-02-21T08:33:03.4538907Z $L__tmp116: 2026-02-21T08:33:03.4539126Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4539184Z cvt.u64.u32 %rd296, %r829; 2026-02-21T08:33:03.4539241Z cvt.u64.u32 %rd297, %r830; 2026-02-21T08:33:03.4539307Z shl.b64 %rd298, %rd297, 32; 2026-02-21T08:33:03.4539367Z or.b64 %rd299, %rd296, %rd298; 2026-02-21T08:33:03.4539418Z $L__tmp117: 2026-02-21T08:33:03.4539591Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4539651Z mov.b64 {%r1150, %r1151}, %rd299; 2026-02-21T08:33:03.4539717Z cvt.rn.bf16x2.f32 %r1152, %r1151, %r1150; 2026-02-21T08:33:03.4539768Z $L__tmp118: 2026-02-21T08:33:03.4539982Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4540039Z cvt.u64.u32 %rd300, %r831; 2026-02-21T08:33:03.4540096Z cvt.u64.u32 %rd301, %r832; 2026-02-21T08:33:03.4540187Z shl.b64 %rd302, %rd301, 32; 2026-02-21T08:33:03.4540246Z or.b64 %rd303, %rd300, %rd302; 2026-02-21T08:33:03.4540298Z $L__tmp119: 2026-02-21T08:33:03.4540471Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4540531Z mov.b64 {%r1153, %r1154}, %rd303; 2026-02-21T08:33:03.4540600Z cvt.rn.bf16x2.f32 %r1155, %r1154, %r1153; 2026-02-21T08:33:03.4540682Z $L__tmp120: 2026-02-21T08:33:03.4540922Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4540982Z cvt.u64.u32 %rd304, %r833; 2026-02-21T08:33:03.4541039Z cvt.u64.u32 %rd305, %r834; 2026-02-21T08:33:03.4541105Z shl.b64 %rd306, %rd305, 32; 2026-02-21T08:33:03.4541163Z or.b64 %rd307, %rd304, %rd306; 2026-02-21T08:33:03.4541214Z $L__tmp121: 2026-02-21T08:33:03.4541380Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4541448Z mov.b64 {%r1156, %r1157}, %rd307; 2026-02-21T08:33:03.4541516Z cvt.rn.bf16x2.f32 %r1158, %r1157, %r1156; 2026-02-21T08:33:03.4541601Z $L__tmp122: 2026-02-21T08:33:03.4541818Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4541905Z cvt.u64.u32 %rd308, %r835; 2026-02-21T08:33:03.4541966Z cvt.u64.u32 %rd309, %r836; 2026-02-21T08:33:03.4542031Z shl.b64 %rd310, %rd309, 32; 2026-02-21T08:33:03.4542090Z or.b64 %rd311, %rd308, %rd310; 2026-02-21T08:33:03.4542140Z $L__tmp123: 2026-02-21T08:33:03.4542308Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4542373Z mov.b64 {%r1159, %r1160}, %rd311; 2026-02-21T08:33:03.4542439Z cvt.rn.bf16x2.f32 %r1161, %r1160, %r1159; 2026-02-21T08:33:03.4542490Z $L__tmp124: 2026-02-21T08:33:03.4542704Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4542763Z cvt.u64.u32 %rd312, %r837; 2026-02-21T08:33:03.4542821Z cvt.u64.u32 %rd313, %r838; 2026-02-21T08:33:03.4542885Z shl.b64 %rd314, %rd313, 32; 2026-02-21T08:33:03.4542943Z or.b64 %rd315, %rd312, %rd314; 2026-02-21T08:33:03.4542995Z $L__tmp125: 2026-02-21T08:33:03.4543159Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4543227Z mov.b64 {%r1162, %r1163}, %rd315; 2026-02-21T08:33:03.4543296Z cvt.rn.bf16x2.f32 %r1164, %r1163, %r1162; 2026-02-21T08:33:03.4543348Z $L__tmp126: 2026-02-21T08:33:03.4543561Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4543619Z cvt.u64.u32 %rd316, %r840; 2026-02-21T08:33:03.4543676Z cvt.u64.u32 %rd317, %r841; 2026-02-21T08:33:03.4543742Z shl.b64 %rd318, %rd317, 32; 2026-02-21T08:33:03.4543801Z or.b64 %rd319, %rd316, %rd318; 2026-02-21T08:33:03.4543853Z $L__tmp127: 2026-02-21T08:33:03.4544018Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4544084Z mov.b64 {%r1165, %r1166}, %rd319; 2026-02-21T08:33:03.4544151Z cvt.rn.bf16x2.f32 %r1167, %r1166, %r1165; 2026-02-21T08:33:03.4544204Z $L__tmp128: 2026-02-21T08:33:03.4544417Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4544477Z cvt.u64.u32 %rd320, %r842; 2026-02-21T08:33:03.4544535Z cvt.u64.u32 %rd321, %r843; 2026-02-21T08:33:03.4544594Z shl.b64 %rd322, %rd321, 32; 2026-02-21T08:33:03.4544662Z or.b64 %rd323, %rd320, %rd322; 2026-02-21T08:33:03.4544714Z $L__tmp129: 2026-02-21T08:33:03.4544879Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4544948Z mov.b64 {%r1168, %r1169}, %rd323; 2026-02-21T08:33:03.4545041Z cvt.rn.bf16x2.f32 %r1170, %r1169, %r1168; 2026-02-21T08:33:03.4545095Z $L__tmp130: 2026-02-21T08:33:03.4545316Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4545375Z cvt.u64.u32 %rd324, %r844; 2026-02-21T08:33:03.4545435Z cvt.u64.u32 %rd325, %r845; 2026-02-21T08:33:03.4545519Z shl.b64 %rd326, %rd325, 32; 2026-02-21T08:33:03.4545585Z or.b64 %rd327, %rd324, %rd326; 2026-02-21T08:33:03.4545637Z $L__tmp131: 2026-02-21T08:33:03.4545831Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4545899Z mov.b64 {%r1171, %r1172}, %rd327; 2026-02-21T08:33:03.4545968Z cvt.rn.bf16x2.f32 %r1173, %r1172, %r1171; 2026-02-21T08:33:03.4546020Z $L__tmp132: 2026-02-21T08:33:03.4546232Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4546292Z cvt.u64.u32 %rd328, %r846; 2026-02-21T08:33:03.4546355Z cvt.u64.u32 %rd329, %r847; 2026-02-21T08:33:03.4546413Z shl.b64 %rd330, %rd329, 32; 2026-02-21T08:33:03.4546478Z or.b64 %rd331, %rd328, %rd330; 2026-02-21T08:33:03.4546528Z $L__tmp133: 2026-02-21T08:33:03.4546715Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4546784Z mov.b64 {%r1174, %r1175}, %rd331; 2026-02-21T08:33:03.4546851Z cvt.rn.bf16x2.f32 %r1176, %r1175, %r1174; 2026-02-21T08:33:03.4546903Z $L__tmp134: 2026-02-21T08:33:03.4547120Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4547178Z cvt.u64.u32 %rd332, %r848; 2026-02-21T08:33:03.4547235Z cvt.u64.u32 %rd333, %r849; 2026-02-21T08:33:03.4547294Z shl.b64 %rd334, %rd333, 32; 2026-02-21T08:33:03.4547358Z or.b64 %rd335, %rd332, %rd334; 2026-02-21T08:33:03.4547410Z $L__tmp135: 2026-02-21T08:33:03.4547577Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4547643Z mov.b64 {%r1177, %r1178}, %rd335; 2026-02-21T08:33:03.4547709Z cvt.rn.bf16x2.f32 %r1179, %r1178, %r1177; 2026-02-21T08:33:03.4547761Z $L__tmp136: 2026-02-21T08:33:03.4547977Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4548037Z cvt.u64.u32 %rd336, %r850; 2026-02-21T08:33:03.4548095Z cvt.u64.u32 %rd337, %r851; 2026-02-21T08:33:03.4548155Z shl.b64 %rd338, %rd337, 32; 2026-02-21T08:33:03.4548221Z or.b64 %rd339, %rd336, %rd338; 2026-02-21T08:33:03.4548273Z $L__tmp137: 2026-02-21T08:33:03.4548439Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4548506Z mov.b64 {%r1180, %r1181}, %rd339; 2026-02-21T08:33:03.4548572Z cvt.rn.bf16x2.f32 %r1182, %r1181, %r1180; 2026-02-21T08:33:03.4548624Z $L__tmp138: 2026-02-21T08:33:03.4548835Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4548900Z cvt.u64.u32 %rd340, %r852; 2026-02-21T08:33:03.4548958Z cvt.u64.u32 %rd341, %r853; 2026-02-21T08:33:03.4549016Z shl.b64 %rd342, %rd341, 32; 2026-02-21T08:33:03.4549083Z or.b64 %rd343, %rd340, %rd342; 2026-02-21T08:33:03.4549137Z $L__tmp139: 2026-02-21T08:33:03.4549305Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4549370Z mov.b64 {%r1183, %r1184}, %rd343; 2026-02-21T08:33:03.4549437Z cvt.rn.bf16x2.f32 %r1185, %r1184, %r1183; 2026-02-21T08:33:03.4549489Z $L__tmp140: 2026-02-21T08:33:03.4549697Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4549761Z cvt.u64.u32 %rd344, %r854; 2026-02-21T08:33:03.4549817Z cvt.u64.u32 %rd345, %r855; 2026-02-21T08:33:03.4549895Z shl.b64 %rd346, %rd345, 32; 2026-02-21T08:33:03.4549961Z or.b64 %rd347, %rd344, %rd346; 2026-02-21T08:33:03.4550011Z $L__tmp141: 2026-02-21T08:33:03.4550180Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4550245Z mov.b64 {%r1186, %r1187}, %rd347; 2026-02-21T08:33:03.4550332Z cvt.rn.bf16x2.f32 %r1188, %r1187, %r1186; 2026-02-21T08:33:03.4550383Z $L__tmp142: 2026-02-21T08:33:03.4550613Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4550681Z cvt.u64.u32 %rd348, %r857; 2026-02-21T08:33:03.4550739Z cvt.u64.u32 %rd349, %r858; 2026-02-21T08:33:03.4550797Z shl.b64 %rd350, %rd349, 32; 2026-02-21T08:33:03.4550862Z or.b64 %rd351, %rd348, %rd350; 2026-02-21T08:33:03.4550913Z $L__tmp143: 2026-02-21T08:33:03.4551080Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4551146Z mov.b64 {%r1189, %r1190}, %rd351; 2026-02-21T08:33:03.4551214Z cvt.rn.bf16x2.f32 %r1191, %r1190, %r1189; 2026-02-21T08:33:03.4551265Z $L__tmp144: 2026-02-21T08:33:03.4551476Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4551591Z cvt.u64.u32 %rd352, %r859; 2026-02-21T08:33:03.4551652Z cvt.u64.u32 %rd353, %r860; 2026-02-21T08:33:03.4551710Z shl.b64 %rd354, %rd353, 32; 2026-02-21T08:33:03.4551775Z or.b64 %rd355, %rd352, %rd354; 2026-02-21T08:33:03.4551827Z $L__tmp145: 2026-02-21T08:33:03.4551995Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4552054Z mov.b64 {%r1192, %r1193}, %rd355; 2026-02-21T08:33:03.4552130Z cvt.rn.bf16x2.f32 %r1194, %r1193, %r1192; 2026-02-21T08:33:03.4552181Z $L__tmp146: 2026-02-21T08:33:03.4552392Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4552459Z cvt.u64.u32 %rd356, %r861; 2026-02-21T08:33:03.4552517Z cvt.u64.u32 %rd357, %r862; 2026-02-21T08:33:03.4552576Z shl.b64 %rd358, %rd357, 32; 2026-02-21T08:33:03.4552641Z or.b64 %rd359, %rd356, %rd358; 2026-02-21T08:33:03.4552693Z $L__tmp147: 2026-02-21T08:33:03.4552861Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4552922Z mov.b64 {%r1195, %r1196}, %rd359; 2026-02-21T08:33:03.4552998Z cvt.rn.bf16x2.f32 %r1197, %r1196, %r1195; 2026-02-21T08:33:03.4553051Z $L__tmp148: 2026-02-21T08:33:03.4553261Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4553328Z cvt.u64.u32 %rd360, %r863; 2026-02-21T08:33:03.4553387Z cvt.u64.u32 %rd361, %r864; 2026-02-21T08:33:03.4553447Z shl.b64 %rd362, %rd361, 32; 2026-02-21T08:33:03.4553515Z or.b64 %rd363, %rd360, %rd362; 2026-02-21T08:33:03.4553569Z $L__tmp149: 2026-02-21T08:33:03.4553734Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4553793Z mov.b64 {%r1198, %r1199}, %rd363; 2026-02-21T08:33:03.4553866Z cvt.rn.bf16x2.f32 %r1200, %r1199, %r1198; 2026-02-21T08:33:03.4553918Z $L__tmp150: 2026-02-21T08:33:03.4554127Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4554194Z cvt.u64.u32 %rd364, %r865; 2026-02-21T08:33:03.4554251Z cvt.u64.u32 %rd365, %r866; 2026-02-21T08:33:03.4554308Z shl.b64 %rd366, %rd365, 32; 2026-02-21T08:33:03.4554372Z or.b64 %rd367, %rd364, %rd366; 2026-02-21T08:33:03.4554422Z $L__tmp151: 2026-02-21T08:33:03.4554587Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4554647Z mov.b64 {%r1201, %r1202}, %rd367; 2026-02-21T08:33:03.4554749Z cvt.rn.bf16x2.f32 %r1203, %r1202, %r1201; 2026-02-21T08:33:03.4554801Z $L__tmp152: 2026-02-21T08:33:03.4555014Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4555079Z cvt.u64.u32 %rd368, %r867; 2026-02-21T08:33:03.4555137Z cvt.u64.u32 %rd369, %r868; 2026-02-21T08:33:03.4555220Z shl.b64 %rd370, %rd369, 32; 2026-02-21T08:33:03.4555278Z or.b64 %rd371, %rd368, %rd370; 2026-02-21T08:33:03.4555338Z $L__tmp153: 2026-02-21T08:33:03.4555542Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4555602Z mov.b64 {%r1204, %r1205}, %rd371; 2026-02-21T08:33:03.4555677Z cvt.rn.bf16x2.f32 %r1206, %r1205, %r1204; 2026-02-21T08:33:03.4555730Z $L__tmp154: 2026-02-21T08:33:03.4555940Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4556007Z cvt.u64.u32 %rd372, %r869; 2026-02-21T08:33:03.4556063Z cvt.u64.u32 %rd373, %r870; 2026-02-21T08:33:03.4556120Z shl.b64 %rd374, %rd373, 32; 2026-02-21T08:33:03.4556179Z or.b64 %rd375, %rd372, %rd374; 2026-02-21T08:33:03.4556238Z $L__tmp155: 2026-02-21T08:33:03.4556426Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4556487Z mov.b64 {%r1207, %r1208}, %rd375; 2026-02-21T08:33:03.4556561Z cvt.rn.bf16x2.f32 %r1209, %r1208, %r1207; 2026-02-21T08:33:03.4556616Z $L__tmp156: 2026-02-21T08:33:03.4556828Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4556893Z cvt.u64.u32 %rd376, %r871; 2026-02-21T08:33:03.4556949Z cvt.u64.u32 %rd377, %r872; 2026-02-21T08:33:03.4557006Z shl.b64 %rd378, %rd377, 32; 2026-02-21T08:33:03.4557064Z or.b64 %rd379, %rd376, %rd378; 2026-02-21T08:33:03.4557124Z $L__tmp157: 2026-02-21T08:33:03.4557292Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4557353Z mov.b64 {%r1210, %r1211}, %rd379; 2026-02-21T08:33:03.4557427Z cvt.rn.bf16x2.f32 %r1212, %r1211, %r1210; 2026-02-21T08:33:03.4557479Z $L__tmp158: 2026-02-21T08:33:03.4557690Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4557758Z cvt.u64.u32 %rd380, %r874; 2026-02-21T08:33:03.4557816Z cvt.u64.u32 %rd381, %r875; 2026-02-21T08:33:03.4557876Z shl.b64 %rd382, %rd381, 32; 2026-02-21T08:33:03.4557936Z or.b64 %rd383, %rd380, %rd382; 2026-02-21T08:33:03.4557995Z $L__tmp159: 2026-02-21T08:33:03.4558164Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4558223Z mov.b64 {%r1213, %r1214}, %rd383; 2026-02-21T08:33:03.4558295Z cvt.rn.bf16x2.f32 %r1215, %r1214, %r1213; 2026-02-21T08:33:03.4558348Z $L__tmp160: 2026-02-21T08:33:03.4558557Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4558621Z cvt.u64.u32 %rd384, %r876; 2026-02-21T08:33:03.4558678Z cvt.u64.u32 %rd385, %r877; 2026-02-21T08:33:03.4558735Z shl.b64 %rd386, %rd385, 32; 2026-02-21T08:33:03.4558795Z or.b64 %rd387, %rd384, %rd386; 2026-02-21T08:33:03.4558856Z $L__tmp161: 2026-02-21T08:33:03.4559025Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4559083Z mov.b64 {%r1216, %r1217}, %rd387; 2026-02-21T08:33:03.4559157Z cvt.rn.bf16x2.f32 %r1218, %r1217, %r1216; 2026-02-21T08:33:03.4559207Z $L__tmp162: 2026-02-21T08:33:03.4559417Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4559475Z cvt.u64.u32 %rd388, %r878; 2026-02-21T08:33:03.4559540Z cvt.u64.u32 %rd389, %r879; 2026-02-21T08:33:03.4559626Z shl.b64 %rd390, %rd389, 32; 2026-02-21T08:33:03.4559687Z or.b64 %rd391, %rd388, %rd390; 2026-02-21T08:33:03.4559748Z $L__tmp163: 2026-02-21T08:33:03.4559923Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4559985Z mov.b64 {%r1219, %r1220}, %rd391; 2026-02-21T08:33:03.4560085Z cvt.rn.bf16x2.f32 %r1221, %r1220, %r1219; 2026-02-21T08:33:03.4560139Z $L__tmp164: 2026-02-21T08:33:03.4560385Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4560446Z cvt.u64.u32 %rd392, %r880; 2026-02-21T08:33:03.4560513Z cvt.u64.u32 %rd393, %r881; 2026-02-21T08:33:03.4560574Z shl.b64 %rd394, %rd393, 32; 2026-02-21T08:33:03.4560635Z or.b64 %rd395, %rd392, %rd394; 2026-02-21T08:33:03.4560694Z $L__tmp165: 2026-02-21T08:33:03.4560875Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4560938Z mov.b64 {%r1222, %r1223}, %rd395; 2026-02-21T08:33:03.4561016Z cvt.rn.bf16x2.f32 %r1224, %r1223, %r1222; 2026-02-21T08:33:03.4561070Z $L__tmp166: 2026-02-21T08:33:03.4561292Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4561373Z cvt.u64.u32 %rd396, %r882; 2026-02-21T08:33:03.4561443Z cvt.u64.u32 %rd397, %r883; 2026-02-21T08:33:03.4561503Z shl.b64 %rd398, %rd397, 32; 2026-02-21T08:33:03.4561604Z or.b64 %rd399, %rd396, %rd398; 2026-02-21T08:33:03.4561666Z $L__tmp167: 2026-02-21T08:33:03.4561840Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4561901Z mov.b64 {%r1225, %r1226}, %rd399; 2026-02-21T08:33:03.4561972Z cvt.rn.bf16x2.f32 %r1227, %r1226, %r1225; 2026-02-21T08:33:03.4562036Z $L__tmp168: 2026-02-21T08:33:03.4562258Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4562320Z cvt.u64.u32 %rd400, %r884; 2026-02-21T08:33:03.4562388Z cvt.u64.u32 %rd401, %r885; 2026-02-21T08:33:03.4562449Z shl.b64 %rd402, %rd401, 32; 2026-02-21T08:33:03.4562511Z or.b64 %rd403, %rd400, %rd402; 2026-02-21T08:33:03.4562571Z $L__tmp169: 2026-02-21T08:33:03.4562750Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4562814Z mov.b64 {%r1228, %r1229}, %rd403; 2026-02-21T08:33:03.4562886Z cvt.rn.bf16x2.f32 %r1230, %r1229, %r1228; 2026-02-21T08:33:03.4562947Z $L__tmp170: 2026-02-21T08:33:03.4563168Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4563229Z cvt.u64.u32 %rd404, %r886; 2026-02-21T08:33:03.4563296Z cvt.u64.u32 %rd405, %r887; 2026-02-21T08:33:03.4563357Z shl.b64 %rd406, %rd405, 32; 2026-02-21T08:33:03.4563422Z or.b64 %rd407, %rd404, %rd406; 2026-02-21T08:33:03.4563483Z $L__tmp171: 2026-02-21T08:33:03.4563659Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4563723Z mov.b64 {%r1231, %r1232}, %rd407; 2026-02-21T08:33:03.4563793Z cvt.rn.bf16x2.f32 %r1233, %r1232, %r1231; 2026-02-21T08:33:03.4563858Z $L__tmp172: 2026-02-21T08:33:03.4564078Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4564139Z cvt.u64.u32 %rd408, %r888; 2026-02-21T08:33:03.4564207Z cvt.u64.u32 %rd409, %r889; 2026-02-21T08:33:03.4564268Z shl.b64 %rd410, %rd409, 32; 2026-02-21T08:33:03.4564329Z or.b64 %rd411, %rd408, %rd410; 2026-02-21T08:33:03.4564382Z $L__tmp173: 2026-02-21T08:33:03.4564564Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4564626Z mov.b64 {%r1234, %r1235}, %rd411; 2026-02-21T08:33:03.4564725Z cvt.rn.bf16x2.f32 %r1236, %r1235, %r1234; 2026-02-21T08:33:03.4564787Z $L__tmp174: 2026-02-21T08:33:03.4565011Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4565071Z cvt.u64.u32 %rd412, %r891; 2026-02-21T08:33:03.4565140Z cvt.u64.u32 %rd413, %r892; 2026-02-21T08:33:03.4565228Z shl.b64 %rd414, %rd413, 32; 2026-02-21T08:33:03.4565290Z or.b64 %rd415, %rd412, %rd414; 2026-02-21T08:33:03.4565343Z $L__tmp175: 2026-02-21T08:33:03.4565554Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4565618Z mov.b64 {%r1237, %r1238}, %rd415; 2026-02-21T08:33:03.4565688Z cvt.rn.bf16x2.f32 %r1239, %r1238, %r1237; 2026-02-21T08:33:03.4565751Z $L__tmp176: 2026-02-21T08:33:03.4565975Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4566036Z cvt.u64.u32 %rd416, %r893; 2026-02-21T08:33:03.4566104Z cvt.u64.u32 %rd417, %r894; 2026-02-21T08:33:03.4566164Z shl.b64 %rd418, %rd417, 32; 2026-02-21T08:33:03.4566226Z or.b64 %rd419, %rd416, %rd418; 2026-02-21T08:33:03.4566280Z $L__tmp177: 2026-02-21T08:33:03.4566490Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4566554Z mov.b64 {%r1240, %r1241}, %rd419; 2026-02-21T08:33:03.4566623Z cvt.rn.bf16x2.f32 %r1242, %r1241, %r1240; 2026-02-21T08:33:03.4566685Z $L__tmp178: 2026-02-21T08:33:03.4566905Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4566964Z cvt.u64.u32 %rd420, %r895; 2026-02-21T08:33:03.4567030Z cvt.u64.u32 %rd421, %r896; 2026-02-21T08:33:03.4567090Z shl.b64 %rd422, %rd421, 32; 2026-02-21T08:33:03.4567151Z or.b64 %rd423, %rd420, %rd422; 2026-02-21T08:33:03.4567203Z $L__tmp179: 2026-02-21T08:33:03.4567390Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4567462Z mov.b64 {%r1243, %r1244}, %rd423; 2026-02-21T08:33:03.4567528Z cvt.rn.bf16x2.f32 %r1245, %r1244, %r1243; 2026-02-21T08:33:03.4567585Z $L__tmp180: 2026-02-21T08:33:03.4567796Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4567855Z cvt.u64.u32 %rd424, %r897; 2026-02-21T08:33:03.4567917Z cvt.u64.u32 %rd425, %r898; 2026-02-21T08:33:03.4567976Z shl.b64 %rd426, %rd425, 32; 2026-02-21T08:33:03.4568034Z or.b64 %rd427, %rd424, %rd426; 2026-02-21T08:33:03.4568084Z $L__tmp181: 2026-02-21T08:33:03.4568258Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4568317Z mov.b64 {%r1246, %r1247}, %rd427; 2026-02-21T08:33:03.4568383Z cvt.rn.bf16x2.f32 %r1248, %r1247, %r1246; 2026-02-21T08:33:03.4568443Z $L__tmp182: 2026-02-21T08:33:03.4568653Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4568710Z cvt.u64.u32 %rd428, %r899; 2026-02-21T08:33:03.4568768Z cvt.u64.u32 %rd429, %r900; 2026-02-21T08:33:03.4568834Z shl.b64 %rd430, %rd429, 32; 2026-02-21T08:33:03.4568893Z or.b64 %rd431, %rd428, %rd430; 2026-02-21T08:33:03.4568945Z $L__tmp183: 2026-02-21T08:33:03.4569124Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4569184Z mov.b64 {%r1249, %r1250}, %rd431; 2026-02-21T08:33:03.4569251Z cvt.rn.bf16x2.f32 %r1251, %r1250, %r1249; 2026-02-21T08:33:03.4569308Z $L__tmp184: 2026-02-21T08:33:03.4569520Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4569578Z cvt.u64.u32 %rd432, %r901; 2026-02-21T08:33:03.4569635Z cvt.u64.u32 %rd433, %r902; 2026-02-21T08:33:03.4569721Z shl.b64 %rd434, %rd433, 32; 2026-02-21T08:33:03.4569779Z or.b64 %rd435, %rd432, %rd434; 2026-02-21T08:33:03.4569831Z $L__tmp185: 2026-02-21T08:33:03.4570004Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4570065Z mov.b64 {%r1252, %r1253}, %rd435; 2026-02-21T08:33:03.4570156Z cvt.rn.bf16x2.f32 %r1254, %r1253, %r1252; 2026-02-21T08:33:03.4570213Z $L__tmp186: 2026-02-21T08:33:03.4570444Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4570502Z cvt.u64.u32 %rd436, %r903; 2026-02-21T08:33:03.4570561Z cvt.u64.u32 %rd437, %r904; 2026-02-21T08:33:03.4570628Z shl.b64 %rd438, %rd437, 32; 2026-02-21T08:33:03.4570687Z or.b64 %rd439, %rd436, %rd438; 2026-02-21T08:33:03.4570739Z $L__tmp187: 2026-02-21T08:33:03.4570915Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4570977Z mov.b64 {%r1255, %r1256}, %rd439; 2026-02-21T08:33:03.4571044Z cvt.rn.bf16x2.f32 %r1257, %r1256, %r1255; 2026-02-21T08:33:03.4571102Z $L__tmp188: 2026-02-21T08:33:03.4571318Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4571404Z cvt.u64.u32 %rd440, %r905; 2026-02-21T08:33:03.4571465Z cvt.u64.u32 %rd441, %r906; 2026-02-21T08:33:03.4571531Z shl.b64 %rd442, %rd441, 32; 2026-02-21T08:33:03.4571621Z or.b64 %rd443, %rd440, %rd442; 2026-02-21T08:33:03.4571673Z $L__tmp189: 2026-02-21T08:33:03.4571846Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4571907Z mov.b64 {%r1258, %r1259}, %rd443; 2026-02-21T08:33:03.4571973Z cvt.rn.bf16x2.f32 %r1260, %r1259, %r1258; 2026-02-21T08:33:03.4572025Z $L__tmp190: 2026-02-21T08:33:03.4572241Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4572300Z cvt.u64.u32 %rd444, %r908; 2026-02-21T08:33:03.4572357Z cvt.u64.u32 %rd445, %r909; 2026-02-21T08:33:03.4572424Z shl.b64 %rd446, %rd445, 32; 2026-02-21T08:33:03.4572483Z or.b64 %rd447, %rd444, %rd446; 2026-02-21T08:33:03.4572533Z $L__tmp191: 2026-02-21T08:33:03.4572709Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4572772Z mov.b64 {%r1261, %r1262}, %rd447; 2026-02-21T08:33:03.4572843Z cvt.rn.bf16x2.f32 %r1263, %r1262, %r1261; 2026-02-21T08:33:03.4572894Z $L__tmp192: 2026-02-21T08:33:03.4573109Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4573166Z cvt.u64.u32 %rd448, %r910; 2026-02-21T08:33:03.4573223Z cvt.u64.u32 %rd449, %r911; 2026-02-21T08:33:03.4573288Z shl.b64 %rd450, %rd449, 32; 2026-02-21T08:33:03.4573349Z or.b64 %rd451, %rd448, %rd450; 2026-02-21T08:33:03.4573400Z $L__tmp193: 2026-02-21T08:33:03.4573574Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4573632Z mov.b64 {%r1264, %r1265}, %rd451; 2026-02-21T08:33:03.4573699Z cvt.rn.bf16x2.f32 %r1266, %r1265, %r1264; 2026-02-21T08:33:03.4573753Z $L__tmp194: 2026-02-21T08:33:03.4573973Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4574032Z cvt.u64.u32 %rd452, %r912; 2026-02-21T08:33:03.4574090Z cvt.u64.u32 %rd453, %r913; 2026-02-21T08:33:03.4574156Z shl.b64 %rd454, %rd453, 32; 2026-02-21T08:33:03.4574216Z or.b64 %rd455, %rd452, %rd454; 2026-02-21T08:33:03.4574268Z $L__tmp195: 2026-02-21T08:33:03.4574447Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4574506Z mov.b64 {%r1267, %r1268}, %rd455; 2026-02-21T08:33:03.4574599Z cvt.rn.bf16x2.f32 %r1269, %r1268, %r1267; 2026-02-21T08:33:03.4574651Z $L__tmp196: 2026-02-21T08:33:03.4574866Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4574923Z cvt.u64.u32 %rd456, %r914; 2026-02-21T08:33:03.4574980Z cvt.u64.u32 %rd457, %r915; 2026-02-21T08:33:03.4575087Z shl.b64 %rd458, %rd457, 32; 2026-02-21T08:33:03.4575147Z or.b64 %rd459, %rd456, %rd458; 2026-02-21T08:33:03.4575198Z $L__tmp197: 2026-02-21T08:33:03.4575392Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4575459Z mov.b64 {%r1270, %r1271}, %rd459; 2026-02-21T08:33:03.4575526Z cvt.rn.bf16x2.f32 %r1272, %r1271, %r1270; 2026-02-21T08:33:03.4575577Z $L__tmp198: 2026-02-21T08:33:03.4575794Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4575854Z cvt.u64.u32 %rd460, %r916; 2026-02-21T08:33:03.4575912Z cvt.u64.u32 %rd461, %r917; 2026-02-21T08:33:03.4575978Z shl.b64 %rd462, %rd461, 32; 2026-02-21T08:33:03.4576037Z or.b64 %rd463, %rd460, %rd462; 2026-02-21T08:33:03.4576089Z $L__tmp199: 2026-02-21T08:33:03.4576282Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4576351Z mov.b64 {%r1273, %r1274}, %rd463; 2026-02-21T08:33:03.4576418Z cvt.rn.bf16x2.f32 %r1275, %r1274, %r1273; 2026-02-21T08:33:03.4576470Z $L__tmp200: 2026-02-21T08:33:03.4576688Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4576745Z cvt.u64.u32 %rd464, %r918; 2026-02-21T08:33:03.4576801Z cvt.u64.u32 %rd465, %r919; 2026-02-21T08:33:03.4576866Z shl.b64 %rd466, %rd465, 32; 2026-02-21T08:33:03.4576923Z or.b64 %rd467, %rd464, %rd466; 2026-02-21T08:33:03.4576986Z $L__tmp201: 2026-02-21T08:33:03.4577193Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4577260Z mov.b64 {%r1276, %r1277}, %rd467; 2026-02-21T08:33:03.4577329Z cvt.rn.bf16x2.f32 %r1278, %r1277, %r1276; 2026-02-21T08:33:03.4577380Z $L__tmp202: 2026-02-21T08:33:03.4577594Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4577653Z cvt.u64.u32 %rd468, %r920; 2026-02-21T08:33:03.4577712Z cvt.u64.u32 %rd469, %r921; 2026-02-21T08:33:03.4577778Z shl.b64 %rd470, %rd469, 32; 2026-02-21T08:33:03.4577835Z or.b64 %rd471, %rd468, %rd470; 2026-02-21T08:33:03.4577887Z $L__tmp203: 2026-02-21T08:33:03.4578055Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4578122Z mov.b64 {%r1279, %r1280}, %rd471; 2026-02-21T08:33:03.4578189Z cvt.rn.bf16x2.f32 %r1281, %r1280, %r1279; 2026-02-21T08:33:03.4578243Z $L__tmp204: 2026-02-21T08:33:03.4578460Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4578519Z cvt.u64.u32 %rd472, %r922; 2026-02-21T08:33:03.4578576Z cvt.u64.u32 %rd473, %r923; 2026-02-21T08:33:03.4578634Z shl.b64 %rd474, %rd473, 32; 2026-02-21T08:33:03.4578702Z or.b64 %rd475, %rd472, %rd474; 2026-02-21T08:33:03.4578755Z $L__tmp205: 2026-02-21T08:33:03.4578924Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4578993Z mov.b64 {%r1282, %r1283}, %rd475; 2026-02-21T08:33:03.4579059Z cvt.rn.bf16x2.f32 %r1284, %r1283, %r1282; 2026-02-21T08:33:03.4579110Z $L__tmp206: 2026-02-21T08:33:03.4579327Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4579384Z cvt.u64.u32 %rd476, %r925; 2026-02-21T08:33:03.4579442Z cvt.u64.u32 %rd477, %r926; 2026-02-21T08:33:03.4579526Z shl.b64 %rd478, %rd477, 32; 2026-02-21T08:33:03.4579591Z or.b64 %rd479, %rd476, %rd478; 2026-02-21T08:33:03.4579642Z $L__tmp207: 2026-02-21T08:33:03.4579809Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4579879Z mov.b64 {%r1285, %r1286}, %rd479; 2026-02-21T08:33:03.4579969Z cvt.rn.bf16x2.f32 %r1287, %r1286, %r1285; 2026-02-21T08:33:03.4580021Z $L__tmp208: 2026-02-21T08:33:03.4580255Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4580314Z cvt.u64.u32 %rd480, %r927; 2026-02-21T08:33:03.4580372Z cvt.u64.u32 %rd481, %r928; 2026-02-21T08:33:03.4580430Z shl.b64 %rd482, %rd481, 32; 2026-02-21T08:33:03.4580496Z or.b64 %rd483, %rd480, %rd482; 2026-02-21T08:33:03.4580546Z $L__tmp209: 2026-02-21T08:33:03.4580715Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4580782Z mov.b64 {%r1288, %r1289}, %rd483; 2026-02-21T08:33:03.4580850Z cvt.rn.bf16x2.f32 %r1290, %r1289, %r1288; 2026-02-21T08:33:03.4580901Z $L__tmp210: 2026-02-21T08:33:03.4581119Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4581199Z cvt.u64.u32 %rd484, %r929; 2026-02-21T08:33:03.4581257Z cvt.u64.u32 %rd485, %r930; 2026-02-21T08:33:03.4581314Z shl.b64 %rd486, %rd485, 32; 2026-02-21T08:33:03.4581382Z or.b64 %rd487, %rd484, %rd486; 2026-02-21T08:33:03.4581440Z $L__tmp211: 2026-02-21T08:33:03.4581644Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4581712Z mov.b64 {%r1291, %r1292}, %rd487; 2026-02-21T08:33:03.4581779Z cvt.rn.bf16x2.f32 %r1293, %r1292, %r1291; 2026-02-21T08:33:03.4581830Z $L__tmp212: 2026-02-21T08:33:03.4582047Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4582109Z cvt.u64.u32 %rd488, %r931; 2026-02-21T08:33:03.4582164Z cvt.u64.u32 %rd489, %r932; 2026-02-21T08:33:03.4582222Z shl.b64 %rd490, %rd489, 32; 2026-02-21T08:33:03.4582287Z or.b64 %rd491, %rd488, %rd490; 2026-02-21T08:33:03.4582337Z $L__tmp213: 2026-02-21T08:33:03.4582509Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4582576Z mov.b64 {%r1294, %r1295}, %rd491; 2026-02-21T08:33:03.4582644Z cvt.rn.bf16x2.f32 %r1296, %r1295, %r1294; 2026-02-21T08:33:03.4582696Z $L__tmp214: 2026-02-21T08:33:03.4582904Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4582969Z cvt.u64.u32 %rd492, %r933; 2026-02-21T08:33:03.4583025Z cvt.u64.u32 %rd493, %r934; 2026-02-21T08:33:03.4583083Z shl.b64 %rd494, %rd493, 32; 2026-02-21T08:33:03.4583149Z or.b64 %rd495, %rd492, %rd494; 2026-02-21T08:33:03.4583199Z $L__tmp215: 2026-02-21T08:33:03.4583366Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4583434Z mov.b64 {%r1297, %r1298}, %rd495; 2026-02-21T08:33:03.4583500Z cvt.rn.bf16x2.f32 %r1299, %r1298, %r1297; 2026-02-21T08:33:03.4583551Z $L__tmp216: 2026-02-21T08:33:03.4583762Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4583827Z cvt.u64.u32 %rd496, %r935; 2026-02-21T08:33:03.4583884Z cvt.u64.u32 %rd497, %r936; 2026-02-21T08:33:03.4583942Z shl.b64 %rd498, %rd497, 32; 2026-02-21T08:33:03.4584006Z or.b64 %rd499, %rd496, %rd498; 2026-02-21T08:33:03.4584056Z $L__tmp217: 2026-02-21T08:33:03.4584225Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4584290Z mov.b64 {%r1300, %r1301}, %rd499; 2026-02-21T08:33:03.4584389Z cvt.rn.bf16x2.f32 %r1302, %r1301, %r1300; 2026-02-21T08:33:03.4584441Z $L__tmp218: 2026-02-21T08:33:03.4584653Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4584718Z cvt.u64.u32 %rd500, %r937; 2026-02-21T08:33:03.4584776Z cvt.u64.u32 %rd501, %r938; 2026-02-21T08:33:03.4584861Z shl.b64 %rd502, %rd501, 32; 2026-02-21T08:33:03.4584926Z or.b64 %rd503, %rd500, %rd502; 2026-02-21T08:33:03.4584977Z $L__tmp219: 2026-02-21T08:33:03.4585171Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4585237Z mov.b64 {%r1303, %r1304}, %rd503; 2026-02-21T08:33:03.4585304Z cvt.rn.bf16x2.f32 %r1305, %r1304, %r1303; 2026-02-21T08:33:03.4585355Z $L__tmp220: 2026-02-21T08:33:03.4585566Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4585645Z cvt.u64.u32 %rd504, %r939; 2026-02-21T08:33:03.4585701Z cvt.u64.u32 %rd505, %r940; 2026-02-21T08:33:03.4585759Z shl.b64 %rd506, %rd505, 32; 2026-02-21T08:33:03.4585823Z or.b64 %rd507, %rd504, %rd506; 2026-02-21T08:33:03.4585874Z $L__tmp221: 2026-02-21T08:33:03.4586066Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4586128Z mov.b64 {%r1306, %r1307}, %rd507; 2026-02-21T08:33:03.4586201Z cvt.rn.bf16x2.f32 %r1308, %r1307, %r1306; 2026-02-21T08:33:03.4586254Z $L__tmp222: 2026-02-21T08:33:03.4586467Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4586532Z cvt.u64.u32 %rd508, %r942; 2026-02-21T08:33:03.4586589Z cvt.u64.u32 %rd509, %r943; 2026-02-21T08:33:03.4586646Z shl.b64 %rd510, %rd509, 32; 2026-02-21T08:33:03.4586711Z or.b64 %rd511, %rd508, %rd510; 2026-02-21T08:33:03.4586763Z $L__tmp223: 2026-02-21T08:33:03.4586928Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4586988Z mov.b64 {%r1309, %r1310}, %rd511; 2026-02-21T08:33:03.4587060Z cvt.rn.bf16x2.f32 %r1311, %r1310, %r1309; 2026-02-21T08:33:03.4587112Z $L__tmp224: 2026-02-21T08:33:03.4587320Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4587386Z cvt.u64.u32 %rd512, %r944; 2026-02-21T08:33:03.4587445Z cvt.u64.u32 %rd513, %r945; 2026-02-21T08:33:03.4587503Z shl.b64 %rd514, %rd513, 32; 2026-02-21T08:33:03.4587569Z or.b64 %rd515, %rd512, %rd514; 2026-02-21T08:33:03.4587621Z $L__tmp225: 2026-02-21T08:33:03.4587789Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4587848Z mov.b64 {%r1312, %r1313}, %rd515; 2026-02-21T08:33:03.4587923Z cvt.rn.bf16x2.f32 %r1314, %r1313, %r1312; 2026-02-21T08:33:03.4587976Z $L__tmp226: 2026-02-21T08:33:03.4588186Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4588252Z cvt.u64.u32 %rd516, %r946; 2026-02-21T08:33:03.4588309Z cvt.u64.u32 %rd517, %r947; 2026-02-21T08:33:03.4588368Z shl.b64 %rd518, %rd517, 32; 2026-02-21T08:33:03.4588432Z or.b64 %rd519, %rd516, %rd518; 2026-02-21T08:33:03.4588485Z $L__tmp227: 2026-02-21T08:33:03.4588652Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4588711Z mov.b64 {%r1315, %r1316}, %rd519; 2026-02-21T08:33:03.4588785Z cvt.rn.bf16x2.f32 %r1317, %r1316, %r1315; 2026-02-21T08:33:03.4588835Z $L__tmp228: 2026-02-21T08:33:03.4589042Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4589106Z cvt.u64.u32 %rd520, %r948; 2026-02-21T08:33:03.4589162Z cvt.u64.u32 %rd521, %r949; 2026-02-21T08:33:03.4589242Z shl.b64 %rd522, %rd521, 32; 2026-02-21T08:33:03.4589300Z or.b64 %rd523, %rd520, %rd522; 2026-02-21T08:33:03.4589360Z $L__tmp229: 2026-02-21T08:33:03.4589526Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4589586Z mov.b64 {%r1318, %r1319}, %rd523; 2026-02-21T08:33:03.4589693Z cvt.rn.bf16x2.f32 %r1320, %r1319, %r1318; 2026-02-21T08:33:03.4589745Z $L__tmp230: 2026-02-21T08:33:03.4589974Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4590040Z cvt.u64.u32 %rd524, %r950; 2026-02-21T08:33:03.4590098Z cvt.u64.u32 %rd525, %r951; 2026-02-21T08:33:03.4590158Z shl.b64 %rd526, %rd525, 32; 2026-02-21T08:33:03.4590216Z or.b64 %rd527, %rd524, %rd526; 2026-02-21T08:33:03.4590275Z $L__tmp231: 2026-02-21T08:33:03.4590442Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4590502Z mov.b64 {%r1321, %r1322}, %rd527; 2026-02-21T08:33:03.4590577Z cvt.rn.bf16x2.f32 %r1323, %r1322, %r1321; 2026-02-21T08:33:03.4590628Z $L__tmp232: 2026-02-21T08:33:03.4590859Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4590926Z cvt.u64.u32 %rd528, %r952; 2026-02-21T08:33:03.4590984Z cvt.u64.u32 %rd529, %r953; 2026-02-21T08:33:03.4591043Z shl.b64 %rd530, %rd529, 32; 2026-02-21T08:33:03.4591101Z or.b64 %rd531, %rd528, %rd530; 2026-02-21T08:33:03.4591159Z $L__tmp233: 2026-02-21T08:33:03.4591324Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4591384Z mov.b64 {%r1324, %r1325}, %rd531; 2026-02-21T08:33:03.4591459Z cvt.rn.bf16x2.f32 %r1326, %r1325, %r1324; 2026-02-21T08:33:03.4591509Z $L__tmp234: 2026-02-21T08:33:03.4591747Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4591816Z cvt.u64.u32 %rd532, %r954; 2026-02-21T08:33:03.4591873Z cvt.u64.u32 %rd533, %r955; 2026-02-21T08:33:03.4591931Z shl.b64 %rd534, %rd533, 32; 2026-02-21T08:33:03.4591989Z or.b64 %rd535, %rd532, %rd534; 2026-02-21T08:33:03.4592050Z $L__tmp235: 2026-02-21T08:33:03.4592218Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4592277Z mov.b64 {%r1327, %r1328}, %rd535; 2026-02-21T08:33:03.4592351Z cvt.rn.bf16x2.f32 %r1329, %r1328, %r1327; 2026-02-21T08:33:03.4592403Z $L__tmp236: 2026-02-21T08:33:03.4592613Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4592679Z cvt.u64.u32 %rd536, %r956; 2026-02-21T08:33:03.4592736Z cvt.u64.u32 %rd537, %r957; 2026-02-21T08:33:03.4592793Z shl.b64 %rd538, %rd537, 32; 2026-02-21T08:33:03.4592853Z or.b64 %rd539, %rd536, %rd538; 2026-02-21T08:33:03.4592912Z $L__tmp237: 2026-02-21T08:33:03.4593080Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4593138Z mov.b64 {%r1330, %r1331}, %rd539; 2026-02-21T08:33:03.4593211Z cvt.rn.bf16x2.f32 %r1332, %r1331, %r1330; 2026-02-21T08:33:03.4593263Z $L__tmp238: 2026-02-21T08:33:03.4593473Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4593533Z cvt.u64.u32 %rd540, %r959; 2026-02-21T08:33:03.4593596Z cvt.u64.u32 %rd541, %r960; 2026-02-21T08:33:03.4593654Z shl.b64 %rd542, %rd541, 32; 2026-02-21T08:33:03.4593713Z or.b64 %rd543, %rd540, %rd542; 2026-02-21T08:33:03.4593772Z $L__tmp239: 2026-02-21T08:33:03.4593940Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4593998Z mov.b64 {%r1333, %r1334}, %rd543; 2026-02-21T08:33:03.4594115Z cvt.rn.bf16x2.f32 %r1335, %r1334, %r1333; 2026-02-21T08:33:03.4594166Z $L__tmp240: 2026-02-21T08:33:03.4594373Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4594430Z cvt.u64.u32 %rd544, %r961; 2026-02-21T08:33:03.4594493Z cvt.u64.u32 %rd545, %r962; 2026-02-21T08:33:03.4594578Z shl.b64 %rd546, %rd545, 32; 2026-02-21T08:33:03.4594635Z or.b64 %rd547, %rd544, %rd546; 2026-02-21T08:33:03.4594692Z $L__tmp241: 2026-02-21T08:33:03.4594882Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4594944Z mov.b64 {%r1336, %r1337}, %rd547; 2026-02-21T08:33:03.4595017Z cvt.rn.bf16x2.f32 %r1338, %r1337, %r1336; 2026-02-21T08:33:03.4595069Z $L__tmp242: 2026-02-21T08:33:03.4595280Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4595340Z cvt.u64.u32 %rd548, %r963; 2026-02-21T08:33:03.4595408Z cvt.u64.u32 %rd549, %r964; 2026-02-21T08:33:03.4595467Z shl.b64 %rd550, %rd549, 32; 2026-02-21T08:33:03.4595527Z or.b64 %rd551, %rd548, %rd550; 2026-02-21T08:33:03.4595587Z $L__tmp243: 2026-02-21T08:33:03.4595781Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4595843Z mov.b64 {%r1339, %r1340}, %rd551; 2026-02-21T08:33:03.4595911Z cvt.rn.bf16x2.f32 %r1341, %r1340, %r1339; 2026-02-21T08:33:03.4595970Z $L__tmp244: 2026-02-21T08:33:03.4596177Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4596236Z cvt.u64.u32 %rd552, %r965; 2026-02-21T08:33:03.4596302Z cvt.u64.u32 %rd553, %r966; 2026-02-21T08:33:03.4596360Z shl.b64 %rd554, %rd553, 32; 2026-02-21T08:33:03.4596417Z or.b64 %rd555, %rd552, %rd554; 2026-02-21T08:33:03.4596476Z $L__tmp245: 2026-02-21T08:33:03.4596642Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4596701Z mov.b64 {%r1342, %r1343}, %rd555; 2026-02-21T08:33:03.4596768Z cvt.rn.bf16x2.f32 %r1344, %r1343, %r1342; 2026-02-21T08:33:03.4596828Z $L__tmp246: 2026-02-21T08:33:03.4597033Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4597093Z cvt.u64.u32 %rd556, %r967; 2026-02-21T08:33:03.4597159Z cvt.u64.u32 %rd557, %r968; 2026-02-21T08:33:03.4597218Z shl.b64 %rd558, %rd557, 32; 2026-02-21T08:33:03.4597275Z or.b64 %rd559, %rd556, %rd558; 2026-02-21T08:33:03.4597332Z $L__tmp247: 2026-02-21T08:33:03.4597503Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4597562Z mov.b64 {%r1345, %r1346}, %rd559; 2026-02-21T08:33:03.4597628Z cvt.rn.bf16x2.f32 %r1347, %r1346, %r1345; 2026-02-21T08:33:03.4597688Z $L__tmp248: 2026-02-21T08:33:03.4597898Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4597954Z cvt.u64.u32 %rd560, %r969; 2026-02-21T08:33:03.4598017Z cvt.u64.u32 %rd561, %r970; 2026-02-21T08:33:03.4598074Z shl.b64 %rd562, %rd561, 32; 2026-02-21T08:33:03.4598132Z or.b64 %rd563, %rd560, %rd562; 2026-02-21T08:33:03.4598183Z $L__tmp249: 2026-02-21T08:33:03.4598360Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4598417Z mov.b64 {%r1348, %r1349}, %rd563; 2026-02-21T08:33:03.4598483Z cvt.rn.bf16x2.f32 %r1350, %r1349, %r1348; 2026-02-21T08:33:03.4598543Z $L__tmp250: 2026-02-21T08:33:03.4598749Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4598806Z cvt.u64.u32 %rd564, %r971; 2026-02-21T08:33:03.4598871Z cvt.u64.u32 %rd565, %r972; 2026-02-21T08:33:03.4598951Z shl.b64 %rd566, %rd565, 32; 2026-02-21T08:33:03.4599009Z or.b64 %rd567, %rd564, %rd566; 2026-02-21T08:33:03.4599060Z $L__tmp251: 2026-02-21T08:33:03.4599233Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4599293Z mov.b64 {%r1351, %r1352}, %rd567; 2026-02-21T08:33:03.4599383Z cvt.rn.bf16x2.f32 %r1353, %r1352, %r1351; 2026-02-21T08:33:03.4599443Z $L__tmp252: 2026-02-21T08:33:03.4599671Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4599731Z cvt.u64.u32 %rd568, %r973; 2026-02-21T08:33:03.4599795Z cvt.u64.u32 %rd569, %r974; 2026-02-21T08:33:03.4599853Z shl.b64 %rd570, %rd569, 32; 2026-02-21T08:33:03.4599911Z or.b64 %rd571, %rd568, %rd570; 2026-02-21T08:33:03.4599962Z $L__tmp253: 2026-02-21T08:33:03.4600141Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4600201Z mov.b64 {%r1354, %r1355}, %rd571; 2026-02-21T08:33:03.4600268Z cvt.rn.bf16x2.f32 %r1356, %r1355, %r1354; 2026-02-21T08:33:03.4600326Z $L__tmp254: 2026-02-21T08:33:03.4600559Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4600619Z cvt.u64.u32 %rd572, %r976; 2026-02-21T08:33:03.4600684Z cvt.u64.u32 %rd573, %r977; 2026-02-21T08:33:03.4600741Z shl.b64 %rd574, %rd573, 32; 2026-02-21T08:33:03.4600800Z or.b64 %rd575, %rd572, %rd574; 2026-02-21T08:33:03.4600851Z $L__tmp255: 2026-02-21T08:33:03.4601028Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4601087Z mov.b64 {%r1357, %r1358}, %rd575; 2026-02-21T08:33:03.4601154Z cvt.rn.bf16x2.f32 %r1359, %r1358, %r1357; 2026-02-21T08:33:03.4601211Z $L__tmp256: 2026-02-21T08:33:03.4601422Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4601481Z cvt.u64.u32 %rd576, %r978; 2026-02-21T08:33:03.4601582Z cvt.u64.u32 %rd577, %r979; 2026-02-21T08:33:03.4601642Z shl.b64 %rd578, %rd577, 32; 2026-02-21T08:33:03.4601703Z or.b64 %rd579, %rd576, %rd578; 2026-02-21T08:33:03.4601757Z $L__tmp257: 2026-02-21T08:33:03.4601936Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4601995Z mov.b64 {%r1360, %r1361}, %rd579; 2026-02-21T08:33:03.4602064Z cvt.rn.bf16x2.f32 %r1362, %r1361, %r1360; 2026-02-21T08:33:03.4602122Z $L__tmp258: 2026-02-21T08:33:03.4602335Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4602392Z cvt.u64.u32 %rd580, %r980; 2026-02-21T08:33:03.4602448Z cvt.u64.u32 %rd581, %r981; 2026-02-21T08:33:03.4602515Z shl.b64 %rd582, %rd581, 32; 2026-02-21T08:33:03.4602577Z or.b64 %rd583, %rd580, %rd582; 2026-02-21T08:33:03.4602627Z $L__tmp259: 2026-02-21T08:33:03.4602803Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4602863Z mov.b64 {%r1363, %r1364}, %rd583; 2026-02-21T08:33:03.4602931Z cvt.rn.bf16x2.f32 %r1365, %r1364, %r1363; 2026-02-21T08:33:03.4602991Z $L__tmp260: 2026-02-21T08:33:03.4603204Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4603262Z cvt.u64.u32 %rd584, %r982; 2026-02-21T08:33:03.4603321Z cvt.u64.u32 %rd585, %r983; 2026-02-21T08:33:03.4603388Z shl.b64 %rd586, %rd585, 32; 2026-02-21T08:33:03.4603448Z or.b64 %rd587, %rd584, %rd586; 2026-02-21T08:33:03.4603502Z $L__tmp261: 2026-02-21T08:33:03.4603687Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4603748Z mov.b64 {%r1366, %r1367}, %rd587; 2026-02-21T08:33:03.4603846Z cvt.rn.bf16x2.f32 %r1368, %r1367, %r1366; 2026-02-21T08:33:03.4603906Z $L__tmp262: 2026-02-21T08:33:03.4604124Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4604185Z cvt.u64.u32 %rd588, %r984; 2026-02-21T08:33:03.4604247Z cvt.u64.u32 %rd589, %r985; 2026-02-21T08:33:03.4604342Z shl.b64 %rd590, %rd589, 32; 2026-02-21T08:33:03.4604405Z or.b64 %rd591, %rd588, %rd590; 2026-02-21T08:33:03.4604458Z $L__tmp263: 2026-02-21T08:33:03.4604672Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4604736Z mov.b64 {%r1369, %r1370}, %rd591; 2026-02-21T08:33:03.4604805Z cvt.rn.bf16x2.f32 %r1371, %r1370, %r1369; 2026-02-21T08:33:03.4604866Z $L__tmp264: 2026-02-21T08:33:03.4605089Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4605151Z cvt.u64.u32 %rd592, %r986; 2026-02-21T08:33:03.4605211Z cvt.u64.u32 %rd593, %r987; 2026-02-21T08:33:03.4605279Z shl.b64 %rd594, %rd593, 32; 2026-02-21T08:33:03.4605338Z or.b64 %rd595, %rd592, %rd594; 2026-02-21T08:33:03.4605392Z $L__tmp265: 2026-02-21T08:33:03.4605609Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4605673Z mov.b64 {%r1372, %r1373}, %rd595; 2026-02-21T08:33:03.4605746Z cvt.rn.bf16x2.f32 %r1374, %r1373, %r1372; 2026-02-21T08:33:03.4605801Z $L__tmp266: 2026-02-21T08:33:03.4606028Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4606088Z cvt.u64.u32 %rd596, %r988; 2026-02-21T08:33:03.4606147Z cvt.u64.u32 %rd597, %r989; 2026-02-21T08:33:03.4606217Z shl.b64 %rd598, %rd597, 32; 2026-02-21T08:33:03.4606279Z or.b64 %rd599, %rd596, %rd598; 2026-02-21T08:33:03.4606331Z $L__tmp267: 2026-02-21T08:33:03.4606514Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4606576Z mov.b64 {%r1375, %r1376}, %rd599; 2026-02-21T08:33:03.4606647Z cvt.rn.bf16x2.f32 %r1377, %r1376, %r1375; 2026-02-21T08:33:03.4606700Z $L__tmp268: 2026-02-21T08:33:03.4606927Z .loc 2 291 36 // standard.py:291:36 @[ c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:94:40 ] 2026-02-21T08:33:03.4606989Z cvt.u64.u32 %rd600, %r990; 2026-02-21T08:33:03.4607050Z cvt.u64.u32 %rd601, %r991; 2026-02-21T08:33:03.4607118Z shl.b64 %rd602, %rd601, 32; 2026-02-21T08:33:03.4607179Z or.b64 %rd603, %rd600, %rd602; 2026-02-21T08:33:03.4607232Z $L__tmp269: 2026-02-21T08:33:03.4607414Z .loc 1 97 28 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:97:28 2026-02-21T08:33:03.4607476Z mov.b64 {%r1378, %r1379}, %rd603; 2026-02-21T08:33:03.4607545Z cvt.rn.bf16x2.f32 %r1380, %r1379, %r1378; 2026-02-21T08:33:03.4607718Z .loc 1 98 43 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:98:43 2026-02-21T08:33:03.4607801Z cp.async.bulk.wait_group.read 0; 2026-02-21T08:33:03.4607860Z bar.sync 0; 2026-02-21T08:33:03.4607961Z st.shared.v4.b32 [%r28], {%r999, %r1002, %r1005, %r1008}; 2026-02-21T08:33:03.4608082Z st.shared.v4.b32 [%r28+16384], {%r1095, %r1098, %r1101, %r1104}; 2026-02-21T08:33:03.4608191Z st.shared.v4.b32 [%r28+32768], {%r1191, %r1194, %r1197, %r1200}; 2026-02-21T08:33:03.4608297Z st.shared.v4.b32 [%r28+49152], {%r1287, %r1290, %r1293, %r1296}; 2026-02-21T08:33:03.4608403Z st.shared.v4.b32 [%r29], {%r1011, %r1014, %r1017, %r1020}; 2026-02-21T08:33:03.4608506Z st.shared.v4.b32 [%r29+16384], {%r1107, %r1110, %r1113, %r1116}; 2026-02-21T08:33:03.4608607Z st.shared.v4.b32 [%r29+32768], {%r1203, %r1206, %r1209, %r1212}; 2026-02-21T08:33:03.4608707Z st.shared.v4.b32 [%r29+49152], {%r1299, %r1302, %r1305, %r1308}; 2026-02-21T08:33:03.4608811Z st.shared.v4.b32 [%r30], {%r1023, %r1026, %r1029, %r1032}; 2026-02-21T08:33:03.4608930Z st.shared.v4.b32 [%r30+16384], {%r1119, %r1122, %r1125, %r1128}; 2026-02-21T08:33:03.4609031Z st.shared.v4.b32 [%r30+32768], {%r1215, %r1218, %r1221, %r1224}; 2026-02-21T08:33:03.4609139Z st.shared.v4.b32 [%r30+49152], {%r1311, %r1314, %r1317, %r1320}; 2026-02-21T08:33:03.4609236Z st.shared.v4.b32 [%r31], {%r1035, %r1038, %r1041, %r1044}; 2026-02-21T08:33:03.4609358Z st.shared.v4.b32 [%r31+16384], {%r1131, %r1134, %r1137, %r1140}; 2026-02-21T08:33:03.4609489Z st.shared.v4.b32 [%r31+32768], {%r1227, %r1230, %r1233, %r1236}; 2026-02-21T08:33:03.4609591Z st.shared.v4.b32 [%r31+49152], {%r1323, %r1326, %r1329, %r1332}; 2026-02-21T08:33:03.4609684Z st.shared.v4.b32 [%r32], {%r1047, %r1050, %r1053, %r1056}; 2026-02-21T08:33:03.4609793Z st.shared.v4.b32 [%r32+16384], {%r1143, %r1146, %r1149, %r1152}; 2026-02-21T08:33:03.4609893Z st.shared.v4.b32 [%r32+32768], {%r1239, %r1242, %r1245, %r1248}; 2026-02-21T08:33:03.4609995Z st.shared.v4.b32 [%r32+49152], {%r1335, %r1338, %r1341, %r1344}; 2026-02-21T08:33:03.4610088Z st.shared.v4.b32 [%r33], {%r1059, %r1062, %r1065, %r1068}; 2026-02-21T08:33:03.4610196Z st.shared.v4.b32 [%r33+16384], {%r1155, %r1158, %r1161, %r1164}; 2026-02-21T08:33:03.4610295Z st.shared.v4.b32 [%r33+32768], {%r1251, %r1254, %r1257, %r1260}; 2026-02-21T08:33:03.4610428Z st.shared.v4.b32 [%r33+49152], {%r1347, %r1350, %r1353, %r1356}; 2026-02-21T08:33:03.4610533Z st.shared.v4.b32 [%r34], {%r1071, %r1074, %r1077, %r1080}; 2026-02-21T08:33:03.4610635Z st.shared.v4.b32 [%r34+16384], {%r1167, %r1170, %r1173, %r1176}; 2026-02-21T08:33:03.4610736Z st.shared.v4.b32 [%r34+32768], {%r1263, %r1266, %r1269, %r1272}; 2026-02-21T08:33:03.4610853Z st.shared.v4.b32 [%r34+49152], {%r1359, %r1362, %r1365, %r1368}; 2026-02-21T08:33:03.4610942Z st.shared.v4.b32 [%r35], {%r1083, %r1086, %r1089, %r1092}; 2026-02-21T08:33:03.4611035Z st.shared.v4.b32 [%r35+16384], {%r1179, %r1182, %r1185, %r1188}; 2026-02-21T08:33:03.4611137Z st.shared.v4.b32 [%r35+32768], {%r1275, %r1278, %r1281, %r1284}; 2026-02-21T08:33:03.4611232Z st.shared.v4.b32 [%r35+49152], {%r1371, %r1374, %r1377, %r1380}; 2026-02-21T08:33:03.4611289Z // begin inline asm 2026-02-21T08:33:03.4611366Z fence.proxy.async.shared::cta; 2026-02-21T08:33:03.4611431Z // end inline asm 2026-02-21T08:33:03.4611487Z bar.sync 0; 2026-02-21T08:33:03.4611584Z elect.sync %r1381|%p104, -1; 2026-02-21T08:33:03.4611661Z and.pred %p102, %p103, %p104; 2026-02-21T08:33:03.4611720Z shl.b32 %r1382, %r39, 14; 2026-02-21T08:33:03.4611781Z add.s32 %r995, %r53, %r1382; 2026-02-21T08:33:03.4611839Z shl.b32 %r1383, %r39, 6; 2026-02-21T08:33:03.4611906Z or.b32 %r993, %r1383, %r37; 2026-02-21T08:33:03.4611962Z // begin inline asm 2026-02-21T08:33:03.4612144Z @%p102 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd91, {%r993, %r994}], [%r995]; 2026-02-21T08:33:03.4612207Z // end inline asm 2026-02-21T08:33:03.4612273Z cp.async.bulk.commit_group; 2026-02-21T08:33:03.4612356Z $L__BB0_8: // %._crit_edge 2026-02-21T08:33:03.4612529Z .loc 1 31 74 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:31:74 2026-02-21T08:33:03.4612601Z cp.async.bulk.wait_group.read 0; 2026-02-21T08:33:03.4612654Z bar.sync 0; 2026-02-21T08:33:03.4612824Z .loc 1 31 4 // c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py:31:4 2026-02-21T08:33:03.4612885Z bar.sync 0; 2026-02-21T08:33:03.4612942Z // begin inline asm 2026-02-21T08:33:03.4613059Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r1384, 512; 2026-02-21T08:33:03.4613122Z // end inline asm 2026-02-21T08:33:03.4613172Z ret; 2026-02-21T08:33:03.4613226Z $L__tmp270: 2026-02-21T08:33:03.4613287Z $L__func_end0: 2026-02-21T08:33:03.4613370Z // -- End function 2026-02-21T08:33:03.4613420Z } 2026-02-21T08:33:03.4613624Z .file 1 "/tmp/torchinductor_root/5o/c5o4j7typmanyrei4hbirs7obiedyapdnotefzsnrpolorrerzao.py" 2026-02-21T08:33:03.4613833Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:33:03.4613895Z .section .debug_abbrev 2026-02-21T08:33:03.4613946Z { 2026-02-21T08:33:03.4614040Z .b8 1 // Abbreviation Code 2026-02-21T08:33:03.4614129Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:33:03.4614235Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:33:03.4614346Z .b8 37 // DW_AT_producer 2026-02-21T08:33:03.4614425Z .b8 8 // DW_FORM_string 2026-02-21T08:33:03.4614498Z .b8 19 // DW_AT_language 2026-02-21T08:33:03.4614574Z .b8 5 // DW_FORM_data2 2026-02-21T08:33:03.4614656Z .b8 3 // DW_AT_name 2026-02-21T08:33:03.4614728Z .b8 8 // DW_FORM_string 2026-02-21T08:33:03.4614806Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:33:03.4614889Z .b8 6 // DW_FORM_data4 2026-02-21T08:33:03.4614964Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:33:03.4615035Z .b8 8 // DW_FORM_string 2026-02-21T08:33:03.4615136Z .b8 0 // EOM(1) 2026-02-21T08:33:03.4615207Z .b8 0 // EOM(2) 2026-02-21T08:33:03.4615288Z .b8 2 // Abbreviation Code 2026-02-21T08:33:03.4615368Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:33:03.4615448Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:33:03.4615520Z .b8 3 // DW_AT_name 2026-02-21T08:33:03.4615594Z .b8 8 // DW_FORM_string 2026-02-21T08:33:03.4615678Z .b8 32 // DW_AT_inline 2026-02-21T08:33:03.4615753Z .b8 11 // DW_FORM_data1 2026-02-21T08:33:03.4615821Z .b8 0 // EOM(1) 2026-02-21T08:33:03.4615893Z .b8 0 // EOM(2) 2026-02-21T08:33:03.4615971Z .b8 3 // Abbreviation Code 2026-02-21T08:33:03.4616052Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:33:03.4616136Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:33:03.4616212Z .b8 17 // DW_AT_low_pc 2026-02-21T08:33:03.4616284Z .b8 1 // DW_FORM_addr 2026-02-21T08:33:03.4616359Z .b8 18 // DW_AT_high_pc 2026-02-21T08:33:03.4616437Z .b8 1 // DW_FORM_addr 2026-02-21T08:33:03.4616520Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:33:03.4616597Z .b8 19 // DW_FORM_ref4 2026-02-21T08:33:03.4616672Z .b8 0 // EOM(1) 2026-02-21T08:33:03.4616738Z .b8 0 // EOM(2) 2026-02-21T08:33:03.4616815Z .b8 4 // Abbreviation Code 2026-02-21T08:33:03.4616912Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:33:03.4616986Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:33:03.4617070Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:33:03.4617142Z .b8 19 // DW_FORM_ref4 2026-02-21T08:33:03.4617220Z .b8 17 // DW_AT_low_pc 2026-02-21T08:33:03.4617290Z .b8 1 // DW_FORM_addr 2026-02-21T08:33:03.4617363Z .b8 18 // DW_AT_high_pc 2026-02-21T08:33:03.4617439Z .b8 1 // DW_FORM_addr 2026-02-21T08:33:03.4617514Z .b8 88 // DW_AT_call_file 2026-02-21T08:33:03.4617608Z .b8 11 // DW_FORM_data1 2026-02-21T08:33:03.4617689Z .b8 89 // DW_AT_call_line 2026-02-21T08:33:03.4617761Z .b8 11 // DW_FORM_data1 2026-02-21T08:33:03.4617839Z .b8 87 // DW_AT_call_column 2026-02-21T08:33:03.4617933Z .b8 11 // DW_FORM_data1 2026-02-21T08:33:03.4618009Z .b8 0 // EOM(1) 2026-02-21T08:33:03.4618096Z .b8 0 // EOM(2) 2026-02-21T08:33:03.4618163Z .b8 0 // EOM(3) 2026-02-21T08:33:03.4618223Z } 2026-02-21T08:33:03.4618284Z .section .debug_info 2026-02-21T08:33:03.4618336Z { 2026-02-21T08:33:03.4618424Z .b32 178 // Length of Unit 2026-02-21T08:33:03.4618506Z .b8 2 // DWARF version number 2026-02-21T08:33:03.4618557Z .b8 0 2026-02-21T08:33:03.4618674Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:33:03.4618767Z .b8 8 // Address Size (in bytes) 2026-02-21T08:33:03.4618867Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:33:03.4618970Z .b8 116 // DW_AT_producer 2026-02-21T08:33:03.4619033Z .b8 114 2026-02-21T08:33:03.4619086Z .b8 105 2026-02-21T08:33:03.4619137Z .b8 116 2026-02-21T08:33:03.4619187Z .b8 111 2026-02-21T08:33:03.4619247Z .b8 110 2026-02-21T08:33:03.4619298Z .b8 0 2026-02-21T08:33:03.4619372Z .b8 2 // DW_AT_language 2026-02-21T08:33:03.4619430Z .b8 0 2026-02-21T08:33:03.4619505Z .b8 99 // DW_AT_name 2026-02-21T08:33:03.4619557Z .b8 53 2026-02-21T08:33:03.4619607Z .b8 111 2026-02-21T08:33:03.4619666Z .b8 52 2026-02-21T08:33:03.4619716Z .b8 106 2026-02-21T08:33:03.4619766Z .b8 55 2026-02-21T08:33:03.4619824Z .b8 116 2026-02-21T08:33:03.4619874Z .b8 121 2026-02-21T08:33:03.4619923Z .b8 112 2026-02-21T08:33:03.4619974Z .b8 109 2026-02-21T08:33:03.4620030Z .b8 97 2026-02-21T08:33:03.4620079Z .b8 110 2026-02-21T08:33:03.4620130Z .b8 121 2026-02-21T08:33:03.4620187Z .b8 114 2026-02-21T08:33:03.4620238Z .b8 101 2026-02-21T08:33:03.4620287Z .b8 105 2026-02-21T08:33:03.4620338Z .b8 52 2026-02-21T08:33:03.4620397Z .b8 104 2026-02-21T08:33:03.4620446Z .b8 98 2026-02-21T08:33:03.4620496Z .b8 105 2026-02-21T08:33:03.4620554Z .b8 114 2026-02-21T08:33:03.4620603Z .b8 115 2026-02-21T08:33:03.4620654Z .b8 55 2026-02-21T08:33:03.4620703Z .b8 111 2026-02-21T08:33:03.4620761Z .b8 98 2026-02-21T08:33:03.4620811Z .b8 105 2026-02-21T08:33:03.4620860Z .b8 101 2026-02-21T08:33:03.4620909Z .b8 100 2026-02-21T08:33:03.4620966Z .b8 121 2026-02-21T08:33:03.4621015Z .b8 97 2026-02-21T08:33:03.4621063Z .b8 112 2026-02-21T08:33:03.4621120Z .b8 100 2026-02-21T08:33:03.4621170Z .b8 110 2026-02-21T08:33:03.4621219Z .b8 111 2026-02-21T08:33:03.4621271Z .b8 116 2026-02-21T08:33:03.4621328Z .b8 101 2026-02-21T08:33:03.4621378Z .b8 102 2026-02-21T08:33:03.4621427Z .b8 122 2026-02-21T08:33:03.4621483Z .b8 115 2026-02-21T08:33:03.4621558Z .b8 110 2026-02-21T08:33:03.4621609Z .b8 114 2026-02-21T08:33:03.4621659Z .b8 112 2026-02-21T08:33:03.4621717Z .b8 111 2026-02-21T08:33:03.4621767Z .b8 108 2026-02-21T08:33:03.4621818Z .b8 111 2026-02-21T08:33:03.4621868Z .b8 114 2026-02-21T08:33:03.4621925Z .b8 114 2026-02-21T08:33:03.4621976Z .b8 101 2026-02-21T08:33:03.4622025Z .b8 114 2026-02-21T08:33:03.4622082Z .b8 122 2026-02-21T08:33:03.4622130Z .b8 97 2026-02-21T08:33:03.4622180Z .b8 111 2026-02-21T08:33:03.4622228Z .b8 46 2026-02-21T08:33:03.4622286Z .b8 112 2026-02-21T08:33:03.4622336Z .b8 121 2026-02-21T08:33:03.4622383Z .b8 0 2026-02-21T08:33:03.4622481Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:33:03.4622556Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:33:03.4622606Z .b8 116 2026-02-21T08:33:03.4622698Z .b8 109 2026-02-21T08:33:03.4622756Z .b8 112 2026-02-21T08:33:03.4622805Z .b8 47 2026-02-21T08:33:03.4622856Z .b8 116 2026-02-21T08:33:03.4622914Z .b8 111 2026-02-21T08:33:03.4622965Z .b8 114 2026-02-21T08:33:03.4623014Z .b8 99 2026-02-21T08:33:03.4623063Z .b8 104 2026-02-21T08:33:03.4623121Z .b8 105 2026-02-21T08:33:03.4623170Z .b8 110 2026-02-21T08:33:03.4623222Z .b8 100 2026-02-21T08:33:03.4623304Z .b8 117 2026-02-21T08:33:03.4623355Z .b8 99 2026-02-21T08:33:03.4623405Z .b8 116 2026-02-21T08:33:03.4623456Z .b8 111 2026-02-21T08:33:03.4623513Z .b8 114 2026-02-21T08:33:03.4623586Z .b8 95 2026-02-21T08:33:03.4623640Z .b8 114 2026-02-21T08:33:03.4623692Z .b8 111 2026-02-21T08:33:03.4623751Z .b8 111 2026-02-21T08:33:03.4623800Z .b8 116 2026-02-21T08:33:03.4623850Z .b8 47 2026-02-21T08:33:03.4623907Z .b8 53 2026-02-21T08:33:03.4623957Z .b8 111 2026-02-21T08:33:03.4624007Z .b8 0 2026-02-21T08:33:03.4624106Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:33:03.4624187Z .b8 95 // DW_AT_name 2026-02-21T08:33:03.4624238Z .b8 104 2026-02-21T08:33:03.4624287Z .b8 101 2026-02-21T08:33:03.4624342Z .b8 108 2026-02-21T08:33:03.4624392Z .b8 105 2026-02-21T08:33:03.4624440Z .b8 111 2026-02-21T08:33:03.4624490Z .b8 110 2026-02-21T08:33:03.4624545Z .b8 95 2026-02-21T08:33:03.4624594Z .b8 109 2026-02-21T08:33:03.4624667Z .b8 97 2026-02-21T08:33:03.4624730Z .b8 116 2026-02-21T08:33:03.4624779Z .b8 109 2026-02-21T08:33:03.4624829Z .b8 117 2026-02-21T08:33:03.4624878Z .b8 108 2026-02-21T08:33:03.4624935Z .b8 95 2026-02-21T08:33:03.4624985Z .b8 98 2026-02-21T08:33:03.4625034Z .b8 102 2026-02-21T08:33:03.4625089Z .b8 49 2026-02-21T08:33:03.4625138Z .b8 54 2026-02-21T08:33:03.4625186Z .b8 95 2026-02-21T08:33:03.4625235Z .b8 105 2026-02-21T08:33:03.4625290Z .b8 110 2026-02-21T08:33:03.4625340Z .b8 116 2026-02-21T08:33:03.4625390Z .b8 52 2026-02-21T08:33:03.4625440Z .b8 0 2026-02-21T08:33:03.4625532Z .b8 1 // DW_AT_inline 2026-02-21T08:33:03.4625629Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:33:03.4625714Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:33:03.4625808Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:33:03.4625899Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:33:03.4626012Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:33:03.4626109Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:33:03.4626191Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T08:33:03.4626274Z .b64 $L__tmp269 // DW_AT_high_pc 2026-02-21T08:33:03.4626357Z .b8 1 // DW_AT_call_file 2026-02-21T08:33:03.4626432Z .b8 94 // DW_AT_call_line 2026-02-21T08:33:03.4626511Z .b8 40 // DW_AT_call_column 2026-02-21T08:33:03.4626595Z .b8 0 // End Of Children Mark 2026-02-21T08:33:03.4626682Z .b8 0 // End Of Children Mark 2026-02-21T08:33:03.4626733Z } 2026-02-21T08:33:03.4626799Z .section .debug_macinfo { } 2026-02-21T08:33:03.4626803Z 2026-02-21T08:33:03.4626887Z ================================================================ 2026-02-21T08:33:03.4626991Z please share the reproducer above with Triton project. 2026-02-21T08:33:04.6296973Z 2026-02-21T08:33:04.6298971Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 18.1 configs/s 2026-02-21T08:33:11.2282899Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━ 742/742 111.7 configs/s 2026-02-21T08:33:11.3842250Z [278s] Generation 8 complete: 2026-02-21T08:33:11.3843906Z error=13 2026-02-21T08:33:11.3844063Z ok=47 2026-02-21T08:33:11.3844198Z min=0.2713 2026-02-21T08:33:11.3844326Z mid=0.3984 2026-02-21T08:33:11.3844456Z max=13.0294 2026-02-21T08:33:11.3844601Z best={'block_sizes': [16, 128, 128], 2026-02-21T08:33:11.3845062Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:33:11.3845282Z 'l2_groupings': [8], 2026-02-21T08:33:11.3845463Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:33:11.3845657Z 'loop_orders': [[1, 0]], 2026-02-21T08:33:11.3845821Z 'maxnreg': 64, 2026-02-21T08:33:11.3845985Z 'num_sm_multiplier': 32, 2026-02-21T08:33:11.3846189Z 'num_stages': 7, 2026-02-21T08:33:11.3846340Z 'num_warps': 4, 2026-02-21T08:33:11.3846500Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:33:11.3846782Z 'range_flattens': [None, False], 2026-02-21T08:33:11.3846969Z 'range_multi_buffers': [None, None], 2026-02-21T08:33:11.3847165Z 'range_num_stages': [3, 0], 2026-02-21T08:33:11.3847339Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:11.3847536Z 'range_warp_specializes': [True, None]} 2026-02-21T08:33:11.3872044Z [278s] Fitting surrogate: 828 points, 828 targets 2026-02-21T08:33:12.3956888Z [279s] Generation 9 starting: 62 neighbors, 3 active search path(s) 2026-02-21T08:33:20.6146109Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 1.4 configs/s 2026-02-21T08:33:22.4720956Z [289s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:33:22.4722088Z Tensor-likes are not close! 2026-02-21T08:33:22.4722217Z 2026-02-21T08:33:22.4722304Z Mismatched elements: 33485243 / 33554432 (99.8%) 2026-02-21T08:33:22.4722588Z Greatest absolute difference: 2432.0 at index (2056, 3359) (up to 0.01 allowed) 2026-02-21T08:33:22.4722923Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:33:22.4723233Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:33:22.4723409Z 2026-02-21T08:33:22.8289153Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T08:33:22.8290602Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:33:22.8290905Z ^ 2026-02-21T08:33:22.8291274Z /tmp/torchinductor_root/6d/c6dnd4oliz43qfyfjlmm2wyrlvyo6d4w4a7ov4wsvwcarkg5hxog.py:72:36: note: called from 2026-02-21T08:33:22.8291872Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:33:22.8292083Z ^ 2026-02-21T08:33:22.8292477Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T08:33:22.8292949Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:33:22.8293185Z ^ 2026-02-21T08:33:22.8293440Z module { 2026-02-21T08:33:22.8298696Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:33:22.8302884Z %cst = arith.constant dense<0> : tensor<4x2x128xi8> 2026-02-21T08:33:22.8307368Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:33:22.8311230Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:33:22.8312859Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:33:22.8313194Z %cst_0 = arith.constant dense<8192> : tensor<128x1xi32> 2026-02-21T08:33:22.8318639Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:33:22.8323179Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:33:22.8327277Z %cst_3 = arith.constant dense<4> : tensor<4x128xi8> 2026-02-21T08:33:22.8327653Z %cst_4 = arith.constant dense<8192> : tensor<4x1xi32> 2026-02-21T08:33:22.8328196Z %cst_5 = arith.constant dense<1024> : tensor<128x1xi32> 2026-02-21T08:33:22.8334458Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:33:22.8338984Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x128xf32> 2026-02-21T08:33:22.8343179Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:33:22.8344938Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:33:22.8345458Z %0 = tt.get_program_id x : i32 2026-02-21T08:33:22.8345664Z %1 = arith.remsi %0, %c32_i32 : i32 2026-02-21T08:33:22.8345928Z %2 = arith.divsi %0, %c32_i32 : i32 2026-02-21T08:33:22.8346124Z %3 = arith.muli %1, %c128_i32 : i32 2026-02-21T08:33:22.8346410Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:33:22.8346700Z %5 = tt.splat %3 : i32 -> tensor<128xi32> 2026-02-21T08:33:22.8346912Z %6 = arith.addi %5, %4 : tensor<128xi32> 2026-02-21T08:33:22.8347119Z %7 = arith.muli %2, %c128_i32 : i32 2026-02-21T08:33:22.8347311Z %8 = tt.splat %7 : i32 -> tensor<128xi32> 2026-02-21T08:33:22.8347524Z %9 = arith.addi %8, %4 : tensor<128xi32> 2026-02-21T08:33:22.8347724Z %c504_i32 = arith.constant 504 : i32 2026-02-21T08:33:22.8347918Z %c12_i32 = arith.constant 12 : i32 2026-02-21T08:33:22.8348286Z %10 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c12_i32 iter_args(%arg4 = %cst_6) -> (tensor<128x128xf32>) : i32 { 2026-02-21T08:33:22.8348668Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:33:22.8348930Z %22 = tt.splat %arg3 : i32 -> tensor<4xi32> 2026-02-21T08:33:22.8349138Z %23 = arith.addi %22, %21 : tensor<4xi32> 2026-02-21T08:33:22.8349510Z %24 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:33:22.8349736Z %25 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:33:22.8349977Z %26 = tt.splat %24 : i32 -> tensor<8xi32> 2026-02-21T08:33:22.8350177Z %27 = arith.addi %26, %25 : tensor<8xi32> 2026-02-21T08:33:22.8350428Z %28 = tt.expand_dims %6 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:33:22.8350707Z %29 = arith.muli %28, %cst_5 : tensor<128x1xi32> 2026-02-21T08:33:22.8350998Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:33:22.8355039Z %31 = tt.broadcast %29 : tensor<128x1xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8357364Z %32 = tt.broadcast %30 : tensor<1x8xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8357671Z %33 = arith.addi %31, %32 : tensor<128x8xi32> 2026-02-21T08:33:22.8357941Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8358284Z %35 = tt.addptr %34, %33 : tensor<128x8x!tt.ptr>, tensor<128x8xi32> 2026-02-21T08:33:22.8358595Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8358893Z %37 = arith.extf %36 : tensor<128x8xbf16> to tensor<128x8xf32> 2026-02-21T08:33:22.8359174Z %38 = tt.expand_dims %23 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:33:22.8359446Z %39 = arith.muli %38, %cst_4 : tensor<4x1xi32> 2026-02-21T08:33:22.8359700Z %40 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:33:22.8359994Z %41 = tt.broadcast %39 : tensor<4x1xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8360257Z %42 = tt.broadcast %40 : tensor<1x128xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8360501Z %43 = arith.addi %41, %42 : tensor<4x128xi32> 2026-02-21T08:33:22.8360743Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8361012Z %45 = tt.addptr %44, %43 : tensor<4x128x!tt.ptr>, tensor<4x128xi32> 2026-02-21T08:33:22.8361262Z %46 = tt.load %45 : tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8361467Z %47 = arith.shli %46, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8361791Z %48 = arith.shrsi %47, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8362010Z %49 = arith.shrsi %46, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8362480Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:33:22.8362780Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:33:22.8363088Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:33:22.8363474Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8363832Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8364121Z %55 = arith.cmpi eq, %52, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8364371Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8364650Z %57 = tt.broadcast %53 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8364942Z %58 = arith.select %56, %57, %cst : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8365212Z %59 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8365468Z %60 = tt.broadcast %54 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8365729Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8366046Z %62 = arith.select %61, %60, %58 : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8366327Z %63 = tt.reshape %62 : tensor<4x2x128xi8> -> tensor<8x128xi8> 2026-02-21T08:33:22.8366582Z %64 = arith.sitofp %63 : tensor<8x128xi8> to tensor<8x128xf32> 2026-02-21T08:33:22.8366950Z %65 = tt.dot %37, %64, %arg4, inputPrecision = tf32 : tensor<128x8xf32> * tensor<8x128xf32> -> tensor<128x128xf32> 2026-02-21T08:33:22.8367277Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:33:22.8367476Z %66 = arith.muli %c4_i32, %c1_i32 : i32 2026-02-21T08:33:22.8367664Z %67 = arith.addi %arg3, %66 : i32 2026-02-21T08:33:22.8367892Z %68 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:33:22.8368141Z %69 = tt.splat %67 : i32 -> tensor<4xi32> 2026-02-21T08:33:22.8368335Z %70 = arith.addi %69, %68 : tensor<4xi32> 2026-02-21T08:33:22.8368531Z %71 = arith.muli %67, %c2_i32 : i32 2026-02-21T08:33:22.8368751Z %72 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:33:22.8368993Z %73 = tt.splat %71 : i32 -> tensor<8xi32> 2026-02-21T08:33:22.8369184Z %74 = arith.addi %73, %72 : tensor<8xi32> 2026-02-21T08:33:22.8369438Z %75 = tt.expand_dims %6 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:33:22.8369712Z %76 = arith.muli %75, %cst_5 : tensor<128x1xi32> 2026-02-21T08:33:22.8369963Z %77 = tt.expand_dims %74 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:33:22.8370249Z %78 = tt.broadcast %76 : tensor<128x1xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8370504Z %79 = tt.broadcast %77 : tensor<1x8xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8370739Z %80 = arith.addi %78, %79 : tensor<128x8xi32> 2026-02-21T08:33:22.8370975Z %81 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8371256Z %82 = tt.addptr %81, %80 : tensor<128x8x!tt.ptr>, tensor<128x8xi32> 2026-02-21T08:33:22.8371607Z %83 = tt.load %82 evictionPolicy = evict_last : tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8371892Z %84 = arith.extf %83 : tensor<128x8xbf16> to tensor<128x8xf32> 2026-02-21T08:33:22.8372176Z %85 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:33:22.8372431Z %86 = arith.muli %85, %cst_4 : tensor<4x1xi32> 2026-02-21T08:33:22.8372691Z %87 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:33:22.8372974Z %88 = tt.broadcast %86 : tensor<4x1xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8373239Z %89 = tt.broadcast %87 : tensor<1x128xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8373480Z %90 = arith.addi %88, %89 : tensor<4x128xi32> 2026-02-21T08:33:22.8373762Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8374038Z %92 = tt.addptr %91, %90 : tensor<4x128x!tt.ptr>, tensor<4x128xi32> 2026-02-21T08:33:22.8374285Z %93 = tt.load %92 : tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8374508Z %94 = arith.shli %93, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8374751Z %95 = arith.shrsi %94, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8375002Z %96 = arith.shrsi %93, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8375247Z %97 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:33:22.8375529Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:33:22.8375835Z %99 = tt.expand_dims %98 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:33:22.8376155Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8376479Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8376765Z %102 = arith.cmpi eq, %99, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8377018Z %103 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8377329Z %104 = tt.broadcast %100 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8377628Z %105 = arith.select %103, %104, %cst : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8377914Z %106 = arith.cmpi eq, %99, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8378169Z %107 = tt.broadcast %101 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8378449Z %108 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8378734Z %109 = arith.select %108, %107, %105 : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8379014Z %110 = tt.reshape %109 : tensor<4x2x128xi8> -> tensor<8x128xi8> 2026-02-21T08:33:22.8379283Z %111 = arith.sitofp %110 : tensor<8x128xi8> to tensor<8x128xf32> 2026-02-21T08:33:22.8379640Z %112 = tt.dot %84, %111, %65, inputPrecision = tf32 : tensor<128x8xf32> * tensor<8x128xf32> -> tensor<128x128xf32> 2026-02-21T08:33:22.8379968Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T08:33:22.8380171Z %113 = arith.muli %c4_i32, %c2_i32_7 : i32 2026-02-21T08:33:22.8380367Z %114 = arith.addi %arg3, %113 : i32 2026-02-21T08:33:22.8380599Z %115 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:33:22.8380841Z %116 = tt.splat %114 : i32 -> tensor<4xi32> 2026-02-21T08:33:22.8381051Z %117 = arith.addi %116, %115 : tensor<4xi32> 2026-02-21T08:33:22.8381242Z %118 = arith.muli %114, %c2_i32 : i32 2026-02-21T08:33:22.8381473Z %119 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:33:22.8381754Z %120 = tt.splat %118 : i32 -> tensor<8xi32> 2026-02-21T08:33:22.8381964Z %121 = arith.addi %120, %119 : tensor<8xi32> 2026-02-21T08:33:22.8382224Z %122 = tt.expand_dims %6 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:33:22.8382497Z %123 = arith.muli %122, %cst_5 : tensor<128x1xi32> 2026-02-21T08:33:22.8382765Z %124 = tt.expand_dims %121 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:33:22.8383060Z %125 = tt.broadcast %123 : tensor<128x1xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8383329Z %126 = tt.broadcast %124 : tensor<1x8xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8383566Z %127 = arith.addi %125, %126 : tensor<128x8xi32> 2026-02-21T08:33:22.8383810Z %128 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8384106Z %129 = tt.addptr %128, %127 : tensor<128x8x!tt.ptr>, tensor<128x8xi32> 2026-02-21T08:33:22.8384414Z %130 = tt.load %129 evictionPolicy = evict_last : tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8384758Z %131 = arith.extf %130 : tensor<128x8xbf16> to tensor<128x8xf32> 2026-02-21T08:33:22.8385040Z %132 = tt.expand_dims %117 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:33:22.8385311Z %133 = arith.muli %132, %cst_4 : tensor<4x1xi32> 2026-02-21T08:33:22.8385571Z %134 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:33:22.8385883Z %135 = tt.broadcast %133 : tensor<4x1xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8386177Z %136 = tt.broadcast %134 : tensor<1x128xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8386412Z %137 = arith.addi %135, %136 : tensor<4x128xi32> 2026-02-21T08:33:22.8386649Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8386921Z %139 = tt.addptr %138, %137 : tensor<4x128x!tt.ptr>, tensor<4x128xi32> 2026-02-21T08:33:22.8387185Z %140 = tt.load %139 : tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8387408Z %141 = arith.shli %140, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8387630Z %142 = arith.shrsi %141, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8387855Z %143 = arith.shrsi %140, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8388098Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:33:22.8388418Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:33:22.8388730Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:33:22.8389091Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8389439Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8389739Z %149 = arith.cmpi eq, %146, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8390012Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8390296Z %151 = tt.broadcast %147 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8390606Z %152 = arith.select %150, %151, %cst : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8390906Z %153 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8391175Z %154 = tt.broadcast %148 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8391469Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8391802Z %156 = arith.select %155, %154, %152 : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8392105Z %157 = tt.reshape %156 : tensor<4x2x128xi8> -> tensor<8x128xi8> 2026-02-21T08:33:22.8392376Z %158 = arith.sitofp %157 : tensor<8x128xi8> to tensor<8x128xf32> 2026-02-21T08:33:22.8392759Z %159 = tt.dot %131, %158, %112, inputPrecision = tf32 : tensor<128x8xf32> * tensor<8x128xf32> -> tensor<128x128xf32> 2026-02-21T08:33:22.8393110Z scf.yield %159 : tensor<128x128xf32> 2026-02-21T08:33:22.8393310Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:33:22.8393645Z %11 = scf.for %arg3 = %c504_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %10) -> (tensor<128x128xf32>) : i32 { 2026-02-21T08:33:22.8394022Z %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:33:22.8394287Z %22 = tt.splat %arg3 : i32 -> tensor<4xi32> 2026-02-21T08:33:22.8394513Z %23 = arith.addi %22, %21 : tensor<4xi32> 2026-02-21T08:33:22.8394721Z %24 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:33:22.8394967Z %25 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:33:22.8395214Z %26 = tt.splat %24 : i32 -> tensor<8xi32> 2026-02-21T08:33:22.8395424Z %27 = arith.addi %26, %25 : tensor<8xi32> 2026-02-21T08:33:22.8395683Z %28 = tt.expand_dims %6 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:33:22.8395969Z %29 = arith.muli %28, %cst_5 : tensor<128x1xi32> 2026-02-21T08:33:22.8396234Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:33:22.8396568Z %31 = tt.broadcast %29 : tensor<128x1xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8396843Z %32 = tt.broadcast %30 : tensor<1x8xi32> -> tensor<128x8xi32> 2026-02-21T08:33:22.8397080Z %33 = arith.addi %31, %32 : tensor<128x8xi32> 2026-02-21T08:33:22.8397343Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8397676Z %35 = tt.addptr %34, %33 : tensor<128x8x!tt.ptr>, tensor<128x8xi32> 2026-02-21T08:33:22.8398007Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<128x8x!tt.ptr> 2026-02-21T08:33:22.8398299Z %37 = arith.extf %36 : tensor<128x8xbf16> to tensor<128x8xf32> 2026-02-21T08:33:22.8398570Z %38 = tt.expand_dims %23 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:33:22.8398835Z %39 = arith.muli %38, %cst_4 : tensor<4x1xi32> 2026-02-21T08:33:22.8399085Z %40 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:33:22.8399375Z %41 = tt.broadcast %39 : tensor<4x1xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8399622Z %42 = tt.broadcast %40 : tensor<1x128xi32> -> tensor<4x128xi32> 2026-02-21T08:33:22.8399860Z %43 = arith.addi %41, %42 : tensor<4x128xi32> 2026-02-21T08:33:22.8400122Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8400395Z %45 = tt.addptr %44, %43 : tensor<4x128x!tt.ptr>, tensor<4x128xi32> 2026-02-21T08:33:22.8400647Z %46 = tt.load %45 : tensor<4x128x!tt.ptr> 2026-02-21T08:33:22.8400856Z %47 = arith.shli %46, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8401076Z %48 = arith.shrsi %47, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8401284Z %49 = arith.shrsi %46, %cst_3 : tensor<4x128xi8> 2026-02-21T08:33:22.8401526Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:33:22.8401855Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:33:22.8402156Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:33:22.8402480Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8402796Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<4x128xi8> -> tensor<4x1x128xi8> 2026-02-21T08:33:22.8403082Z %55 = arith.cmpi eq, %52, %cst_2 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8403336Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8403604Z %57 = tt.broadcast %53 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8403894Z %58 = arith.select %56, %57, %cst : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8404157Z %59 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:33:22.8404416Z %60 = tt.broadcast %54 : tensor<4x1x128xi8> -> tensor<4x2x128xi8> 2026-02-21T08:33:22.8404681Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<4x2x128xi1> 2026-02-21T08:33:22.8404967Z %62 = arith.select %61, %60, %58 : tensor<4x2x128xi1>, tensor<4x2x128xi8> 2026-02-21T08:33:22.8405263Z %63 = tt.reshape %62 : tensor<4x2x128xi8> -> tensor<8x128xi8> 2026-02-21T08:33:22.8405518Z %64 = arith.sitofp %63 : tensor<8x128xi8> to tensor<8x128xf32> 2026-02-21T08:33:22.8405881Z %65 = tt.dot %37, %64, %arg4, inputPrecision = tf32 : tensor<128x8xf32> * tensor<8x128xf32> -> tensor<128x128xf32> 2026-02-21T08:33:22.8406202Z scf.yield %65 : tensor<128x128xf32> 2026-02-21T08:33:22.8406430Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:33:22.8406691Z %12 = arith.truncf %11 : tensor<128x128xf32> to tensor<128x128xbf16> 2026-02-21T08:33:22.8406987Z %13 = tt.expand_dims %6 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:33:22.8407258Z %14 = arith.muli %13, %cst_0 : tensor<128x1xi32> 2026-02-21T08:33:22.8407513Z %15 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:33:22.8407834Z %16 = tt.broadcast %14 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:33:22.8408098Z %17 = tt.broadcast %15 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:33:22.8408339Z %18 = arith.addi %16, %17 : tensor<128x128xi32> 2026-02-21T08:33:22.8408584Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:33:22.8408902Z %20 = tt.addptr %19, %18 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:33:22.8409195Z tt.store %20, %12 : tensor<128x128x!tt.ptr> 2026-02-21T08:33:22.8409389Z tt.return 2026-02-21T08:33:22.8409527Z } 2026-02-21T08:33:22.8409649Z } 2026-02-21T08:33:22.8409728Z 2026-02-21T08:33:22.8409780Z {-# 2026-02-21T08:33:22.8409911Z external_resources: { 2026-02-21T08:33:22.8410075Z mlir_reproducer: { 2026-02-21T08:33:22.8414509Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:33:22.8418974Z disable_threading: false, 2026-02-21T08:33:22.8419152Z verify_each: true 2026-02-21T08:33:22.8419300Z } 2026-02-21T08:33:22.8419427Z } 2026-02-21T08:33:22.8419541Z #-} 2026-02-21T08:33:22.8419977Z /tmp/torchinductor_root/6d/c6dnd4oliz43qfyfjlmm2wyrlvyo6d4w4a7ov4wsvwcarkg5hxog.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:33:22.8421010Z /tmp/torchinductor_root/6d/c6dnd4oliz43qfyfjlmm2wyrlvyo6d4w4a7ov4wsvwcarkg5hxog.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:33:22.8421866Z [289s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:33:22.8422962Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:33:22.8423944Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:33:22.8424239Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:33:23.7532221Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 20.1 configs/s 2026-02-21T08:33:32.7866395Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 831/831 91.8 configs/s 2026-02-21T08:33:32.9772958Z [299s] Generation 9 complete: 2026-02-21T08:33:32.9776174Z error=16 2026-02-21T08:33:32.9780676Z ok=49 2026-02-21T08:33:32.9785688Z min=0.2406 2026-02-21T08:33:32.9789591Z mid=0.3245 2026-02-21T08:33:32.9794692Z max=11.0691 2026-02-21T08:33:32.9799232Z best={'block_sizes': [8, 128, 256], 2026-02-21T08:33:32.9803674Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:33:32.9808041Z 'l2_groupings': [1], 2026-02-21T08:33:32.9809656Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:33:32.9809880Z 'loop_orders': [[0, 1]], 2026-02-21T08:33:32.9810053Z 'num_stages': 6, 2026-02-21T08:33:32.9810197Z 'num_warps': 2, 2026-02-21T08:33:32.9810398Z 'pid_type': 'flat', 2026-02-21T08:33:32.9814703Z 'range_flattens': [None, False], 2026-02-21T08:33:32.9816809Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:32.9817064Z 'range_num_stages': [0, 0], 2026-02-21T08:33:32.9817308Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:32.9817525Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:32.9820254Z [299s] Fitting surrogate: 893 points, 893 targets 2026-02-21T08:33:34.0102812Z [300s] Generation 10 starting: 61 neighbors, 3 active search path(s) 2026-02-21T08:33:41.9939539Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 2.1 configs/s 2026-02-21T08:33:44.8175661Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 62/62 22.4 configs/s 2026-02-21T08:33:54.2264557Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 831/831 88.2 configs/s 2026-02-21T08:33:54.4245738Z [321s] Generation 10 complete: 2026-02-21T08:33:54.4250113Z error=20 2026-02-21T08:33:54.4254493Z ok=45 2026-02-21T08:33:54.4256106Z min=0.2406 2026-02-21T08:33:54.4256335Z mid=0.3011 2026-02-21T08:33:54.4261050Z max=14.9494 2026-02-21T08:33:54.4265423Z best={'block_sizes': [8, 128, 256], 2026-02-21T08:33:54.4265793Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:33:54.4266068Z 'l2_groupings': [1], 2026-02-21T08:33:54.4270537Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:33:54.4274306Z 'loop_orders': [[0, 1]], 2026-02-21T08:33:54.4278173Z 'num_stages': 6, 2026-02-21T08:33:54.4282566Z 'num_warps': 2, 2026-02-21T08:33:54.4286370Z 'pid_type': 'flat', 2026-02-21T08:33:54.4287779Z 'range_flattens': [None, False], 2026-02-21T08:33:54.4288021Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:54.4288216Z 'range_num_stages': [0, 0], 2026-02-21T08:33:54.4288398Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:54.4288584Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:54.4288875Z [321s] Fitting surrogate: 958 points, 958 targets 2026-02-21T08:33:55.0980013Z [321s] Generation 11 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:34:08.8680263Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.6 configs/s 2026-02-21T08:34:09.9108512Z [336s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:34:09.9109849Z Tensor-likes are not close! 2026-02-21T08:34:09.9115169Z 2026-02-21T08:34:09.9119662Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T08:34:09.9121156Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T08:34:09.9121674Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:34:09.9122060Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:34:09.9122517Z 2026-02-21T08:34:09.9140105Z [336s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:34:09.9141334Z Tensor-likes are not close! 2026-02-21T08:34:09.9146225Z 2026-02-21T08:34:09.9147769Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T08:34:09.9148093Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T08:34:09.9148445Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:34:09.9148759Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:34:09.9148939Z 2026-02-21T08:34:10.4195582Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 22.8 configs/s 2026-02-21T08:34:15.4325550Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 852/852 168.8 configs/s 2026-02-21T08:34:15.5653692Z [342s] Generation 11 complete: 2026-02-21T08:34:15.5658041Z error=9 2026-02-21T08:34:15.5659562Z ok=27 2026-02-21T08:34:15.5659749Z min=0.2366 2026-02-21T08:34:15.5659889Z mid=0.2713 2026-02-21T08:34:15.5660025Z max=0.8193 2026-02-21T08:34:15.5660163Z best={'block_sizes': [16, 128, 128], 2026-02-21T08:34:15.5660432Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:34:15.5660673Z 'l2_groupings': [4], 2026-02-21T08:34:15.5660852Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:34:15.5661044Z 'loop_orders': [[1, 0]], 2026-02-21T08:34:15.5661202Z 'num_stages': 7, 2026-02-21T08:34:15.5661338Z 'num_warps': 4, 2026-02-21T08:34:15.5661483Z 'pid_type': 'flat', 2026-02-21T08:34:15.5661829Z 'range_flattens': [None, False], 2026-02-21T08:34:15.5662019Z 'range_multi_buffers': [None, None], 2026-02-21T08:34:15.5662212Z 'range_num_stages': [0, 0], 2026-02-21T08:34:15.5662381Z 'range_unroll_factors': [0, 0], 2026-02-21T08:34:15.5662570Z 'range_warp_specializes': [None, False]} 2026-02-21T08:34:15.5694157Z [342s] Fitting surrogate: 994 points, 994 targets 2026-02-21T08:34:16.1956544Z [342s] Generation 12 starting: 30 neighbors, 2 active search path(s) 2026-02-21T08:34:19.1202563Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 14.0 configs/s 2026-02-21T08:34:21.6412478Z [348s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:34:21.6413577Z Tensor-likes are not close! 2026-02-21T08:34:21.6413720Z 2026-02-21T08:34:21.6418679Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T08:34:21.6423256Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T08:34:21.6424813Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T08:34:21.6425528Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:34:21.6425725Z 2026-02-21T08:34:21.6426094Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 11.6 configs/s 2026-02-21T08:34:25.3088172Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━ 852/852 230.1 configs/s 2026-02-21T08:34:25.4239566Z [352s] Generation 12 complete: 2026-02-21T08:34:25.4243976Z error=4 2026-02-21T08:34:25.4245477Z ok=28 2026-02-21T08:34:25.4245643Z min=0.2346 2026-02-21T08:34:25.4245798Z mid=0.2939 2026-02-21T08:34:25.4245929Z max=15.8577 2026-02-21T08:34:25.4246084Z best={'block_sizes': [16, 128, 128], 2026-02-21T08:34:25.4246601Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:34:25.4246842Z 'l2_groupings': [4], 2026-02-21T08:34:25.4247025Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:34:25.4247220Z 'loop_orders': [[1, 0]], 2026-02-21T08:34:25.4247383Z 'num_stages': 7, 2026-02-21T08:34:25.4247595Z 'num_warps': 4, 2026-02-21T08:34:25.4247755Z 'pid_type': 'flat', 2026-02-21T08:34:25.4247912Z 'range_flattens': [None, False], 2026-02-21T08:34:25.4248106Z 'range_multi_buffers': [None, False], 2026-02-21T08:34:25.4248364Z 'range_num_stages': [0, 0], 2026-02-21T08:34:25.4248544Z 'range_unroll_factors': [0, 0], 2026-02-21T08:34:25.4248732Z 'range_warp_specializes': [None, False]} 2026-02-21T08:34:25.4270418Z [352s] Fitting surrogate: 1026 points, 1026 targets 2026-02-21T08:34:26.0945661Z [352s] Generation 13 starting: 32 neighbors, 2 active search path(s) 2026-02-21T08:34:28.7797615Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 14.3 configs/s 2026-02-21T08:34:30.5427485Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 18.1 configs/s 2026-02-21T08:34:36.2468138Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━ 891/891 155.5 configs/s 2026-02-21T08:34:36.3956422Z [363s] Generation 13 complete: 2026-02-21T08:34:36.3960274Z error=7 2026-02-21T08:34:36.3960532Z ok=27 2026-02-21T08:34:36.3971404Z min=0.2243 2026-02-21T08:34:36.3971679Z mid=0.2468 2026-02-21T08:34:36.3973507Z max=15.7921 2026-02-21T08:34:36.3976851Z best={'block_sizes': [16, 128, 256], 2026-02-21T08:34:36.3977193Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:34:36.3977427Z 'l2_groupings': [1], 2026-02-21T08:34:36.3981207Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:34:36.3985688Z 'loop_orders': [[0, 1]], 2026-02-21T08:34:36.3987037Z 'num_stages': 6, 2026-02-21T08:34:36.3987233Z 'num_warps': 1, 2026-02-21T08:34:36.3987396Z 'pid_type': 'flat', 2026-02-21T08:34:36.3987581Z 'range_flattens': [None, False], 2026-02-21T08:34:36.3987797Z 'range_multi_buffers': [None, False], 2026-02-21T08:34:36.3987992Z 'range_num_stages': [0, 0], 2026-02-21T08:34:36.3988178Z 'range_unroll_factors': [0, 0], 2026-02-21T08:34:36.3988363Z 'range_warp_specializes': [None, True]} 2026-02-21T08:34:36.3992880Z [363s] Fitting surrogate: 1060 points, 1060 targets 2026-02-21T08:34:36.8209003Z [363s] Generation 14 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:34:38.0276609Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 21.5 configs/s 2026-02-21T08:34:38.6903143Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 21.4 configs/s 2026-02-21T08:34:41.0466938Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━ 891/891 372.1 configs/s 2026-02-21T08:34:41.1401333Z [367s] Generation 14 complete: 2026-02-21T08:34:41.1403367Z error=2 2026-02-21T08:34:41.1403556Z ok=12 2026-02-21T08:34:41.1403728Z min=0.2243 2026-02-21T08:34:41.1403898Z mid=0.2244 2026-02-21T08:34:41.1404064Z max=0.4844 2026-02-21T08:34:41.1404521Z best={'block_sizes': [16, 128, 256], 2026-02-21T08:34:41.1404779Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:34:41.1405046Z 'l2_groupings': [1], 2026-02-21T08:34:41.1405233Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:34:41.1405458Z 'loop_orders': [[0, 1]], 2026-02-21T08:34:41.1405632Z 'num_stages': 6, 2026-02-21T08:34:41.1405919Z 'num_warps': 1, 2026-02-21T08:34:41.1406105Z 'pid_type': 'flat', 2026-02-21T08:34:41.1406294Z 'range_flattens': [None, False], 2026-02-21T08:34:41.1406534Z 'range_multi_buffers': [None, False], 2026-02-21T08:34:41.1406763Z 'range_num_stages': [0, 0], 2026-02-21T08:34:41.1406993Z 'range_unroll_factors': [0, 0], 2026-02-21T08:34:41.1407195Z 'range_warp_specializes': [None, True]} 2026-02-21T08:34:41.1444618Z [367s] Fitting surrogate: 1074 points, 1074 targets 2026-02-21T08:34:41.4422284Z [368s] Autotuning complete in 368.2s after searching 1041 configs. 2026-02-21T08:34:41.4424299Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:34:41.4425680Z @helion.kernel(config=helion.Config(block_sizes=[16, 128, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:34:41.4426622Z 2026-02-21T08:34:41.4427000Z [368s] Code of selected kernel: /tmp/torchinductor_root/6n/c6n4wxbea3zrogahqbysvw4lkpf5nasiv6npgjyzc7jgjjeoszxz.py 2026-02-21T08:34:42.4984403Z WARNING:tritonbench.utils.triton_op:Completed input ID 17: 2026-02-21T08:34:42.4985748Z x_val 2026-02-21T08:34:42.4985986Z --------------------- 2026-02-21T08:34:42.4986174Z (1, 4096, 8192, 1024) 2026-02-21T08:34:42.4986354Z 2026-02-21T08:34:42.5008997Z 60%|██████ | 6/10 [25:42<20:04, 301.06s/it]WARNING:tritonbench.utils.triton_op:Running input ID 21: 2026-02-21T08:34:42.5013197Z x_val 2026-02-21T08:34:42.5014497Z --------------------- 2026-02-21T08:34:42.5014692Z (4, 4096, 8192, 1024) 2026-02-21T08:34:42.5015012Z INFO:tritonbench.utils.triton_op:Took 0.22ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:34:43.8251823Z INFO:tritonbench.utils.triton_op:Took 2.68ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:34:44.9939665Z Autotune Choices Stats: 2026-02-21T08:34:44.9942330Z {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.17824000120162964, "best_triton_pos": 1, "best_triton_time": 0.3112959861755371, "best_triton_kernel": "triton_mm_87", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4"} 2026-02-21T08:34:44.9955393Z AUTOTUNE mm(16384x1024, 1024x8192) 2026-02-21T08:34:44.9959654Z strides: [1024, 1], [8192, 1] 2026-02-21T08:34:44.9962968Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T08:34:44.9967250Z mm 0.1782 ms 100.0% 2026-02-21T08:34:44.9968923Z triton_mm_87 0.3113 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:34:44.9969676Z triton_mm_86 0.3318 ms 53.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:34:44.9970364Z triton_mm_79 0.4157 ms 42.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:34:44.9971033Z triton_mm_80 0.4434 ms 40.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T08:34:44.9971773Z triton_mm_83 0.4464 ms 39.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:34:44.9972711Z triton_mm_81 0.4904 ms 36.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T08:34:44.9973383Z triton_mm_84 0.5079 ms 35.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T08:34:44.9974158Z triton_mm_88 0.5110 ms 34.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T08:34:44.9974840Z triton_mm_77 0.5284 ms 33.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2026-02-21T08:34:44.9975420Z SingleProcess AUTOTUNE benchmarking takes 0.7562 seconds and 0.2453 seconds precompiling for 20 choices 2026-02-21T08:34:46.4671000Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:34:47.5914689Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:34:47.5919286Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:34:47.5923514Z 'dtype': 'torch.bfloat16', 2026-02-21T08:34:47.5927843Z 'shape': (4, 4096, 1024), 2026-02-21T08:34:47.5932148Z 'stride': (4194304, 1024, 1)}, 2026-02-21T08:34:47.5936354Z { 'device': 'cuda:0', 2026-02-21T08:34:47.5937732Z 'dtype': 'torch.int32', 2026-02-21T08:34:47.5937974Z 'shape': (1024, 8192), 2026-02-21T08:34:47.5938159Z 'stride': (8192, 1)}), 2026-02-21T08:34:47.5938343Z 'kwargs': {}} 2026-02-21T08:34:47.5956226Z INFO:tritonbench.utils.triton_op:Took 4.71ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:34:47.8518662Z [0s] Autotune random seed: 2134813318 2026-02-21T08:34:47.9886511Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:35:23.9514848Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 1024, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], num_stages=5, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:35:23.9531885Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s 2026-02-21T08:35:35.8420512Z [47s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 128, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=32, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:35:35.8425723Z Tensor-likes are not close! 2026-02-21T08:35:35.8429125Z 2026-02-21T08:35:35.8430834Z Mismatched elements: 134215284 / 134217728 (100.0%) 2026-02-21T08:35:35.8431189Z Greatest absolute difference: 2464.0 at index (11995, 4463) (up to 0.01 allowed) 2026-02-21T08:35:35.8433815Z Greatest relative difference: 3.046875 at index (7745, 1202) (up to 0.01 allowed) 2026-02-21T08:35:35.8434152Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:35:35.8434321Z 2026-02-21T08:35:40.7681521Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 1], range_warp_specializes=[None, False]) 2026-02-21T08:35:40.7683843Z Tensor-likes are not close! 2026-02-21T08:35:40.7688059Z 2026-02-21T08:35:40.7692646Z Mismatched elements: 133940600 / 134217728 (99.8%) 2026-02-21T08:35:40.7697281Z Greatest absolute difference: 2464.0 at index (15523, 1213) (up to 0.01 allowed) 2026-02-21T08:35:40.7699186Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:35:40.7699557Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:35:40.7699735Z 2026-02-21T08:37:07.0323412Z [139s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[4, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:37:07.0324618Z Tensor-likes are not close! 2026-02-21T08:37:07.0329309Z 2026-02-21T08:37:07.0331833Z Mismatched elements: 134064047 / 134217728 (99.9%) 2026-02-21T08:37:07.0332203Z Greatest absolute difference: 3392.0 at index (12027, 4463) (up to 0.01 allowed) 2026-02-21T08:37:07.0332617Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:37:07.0332932Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:07.0333096Z 2026-02-21T08:37:21.8493996Z [153s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], maxnreg=128, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[0, 4], range_unroll_factors=[4, 0], range_warp_specializes=[None, None]) 2026-02-21T08:37:21.8495191Z Tensor-likes are not close! 2026-02-21T08:37:21.8498890Z 2026-02-21T08:37:21.8502281Z Mismatched elements: 134016461 / 134217728 (99.9%) 2026-02-21T08:37:21.8506854Z Greatest absolute difference: 3296.0 at index (6219, 2731) (up to 0.01 allowed) 2026-02-21T08:37:21.8511266Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:37:21.8512416Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:21.8512592Z 2026-02-21T08:37:36.1060429Z [168s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=1, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, None]) 2026-02-21T08:37:36.1061753Z Tensor-likes are not close! 2026-02-21T08:37:36.1065829Z 2026-02-21T08:37:36.1070527Z Mismatched elements: 133800092 / 134217728 (99.7%) 2026-02-21T08:37:36.1075566Z Greatest absolute difference: 1448.0 at index (4017, 215) (up to 0.01 allowed) 2026-02-21T08:37:36.1079931Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:37:36.1084268Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:36.1088368Z 2026-02-21T08:37:36.2299168Z [168s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T08:37:36.2300475Z Tensor-likes are not close! 2026-02-21T08:37:36.2300595Z 2026-02-21T08:37:36.2300689Z Mismatched elements: 134016173 / 134217728 (99.8%) 2026-02-21T08:37:36.2300996Z Greatest absolute difference: 3376.0 at index (3700, 1784) (up to 0.01 allowed) 2026-02-21T08:37:36.2301432Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:37:36.2301949Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:36.2302127Z 2026-02-21T08:37:37.2413584Z [169s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 256, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=8, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:37:37.2414781Z Tensor-likes are not close! 2026-02-21T08:37:37.2417022Z 2026-02-21T08:37:37.2417201Z Mismatched elements: 710134 / 134217728 (0.5%) 2026-02-21T08:37:37.2417704Z Greatest absolute difference: nan at index (128, 6) (up to 0.01 allowed) 2026-02-21T08:37:37.2418061Z Greatest relative difference: nan at index (128, 6) (up to 0.01 allowed) 2026-02-21T08:37:37.2418378Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:37.2418603Z 2026-02-21T08:37:37.5398625Z [169s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=1, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, False]) 2026-02-21T08:37:37.5399918Z Tensor-likes are not close! 2026-02-21T08:37:37.5400051Z 2026-02-21T08:37:37.5400149Z Mismatched elements: 134112088 / 134217728 (99.9%) 2026-02-21T08:37:37.5400443Z Greatest absolute difference: 3216.0 at index (11453, 7080) (up to 0.01 allowed) 2026-02-21T08:37:37.5400817Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:37:37.5401156Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:37.5401337Z 2026-02-21T08:37:58.7065702Z [190s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 128, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=1, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[4, 1], range_unroll_factors=[3, 0], range_warp_specializes=[None, None]) 2026-02-21T08:37:58.7066835Z Tensor-likes are not close! 2026-02-21T08:37:58.7072192Z 2026-02-21T08:37:58.7076289Z Mismatched elements: 134215284 / 134217728 (100.0%) 2026-02-21T08:37:58.7077799Z Greatest absolute difference: 2464.0 at index (11995, 4463) (up to 0.01 allowed) 2026-02-21T08:37:58.7078206Z Greatest relative difference: 3.046875 at index (7745, 1202) (up to 0.01 allowed) 2026-02-21T08:37:58.7078524Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:37:58.7078706Z 2026-02-21T08:38:01.8494776Z [193s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T08:38:01.8496209Z Tensor-likes are not close! 2026-02-21T08:38:01.8500102Z 2026-02-21T08:38:01.8505431Z Mismatched elements: 134020107 / 134217728 (99.9%) 2026-02-21T08:38:01.8509359Z Greatest absolute difference: 3392.0 at index (6346, 2411) (up to 0.01 allowed) 2026-02-21T08:38:01.8513688Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:38:01.8514964Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:01.8515136Z 2026-02-21T08:38:10.0563542Z [202s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 1024], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:38:10.0568321Z Tensor-likes are not close! 2026-02-21T08:38:10.0571823Z 2026-02-21T08:38:10.0575981Z Mismatched elements: 134034949 / 134217728 (99.9%) 2026-02-21T08:38:10.0579830Z Greatest absolute difference: 3584.0 at index (11995, 4719) (up to 0.01 allowed) 2026-02-21T08:38:10.0583929Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:38:10.0584727Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:10.0584898Z 2026-02-21T08:38:10.2798048Z [202s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:38:10.2799132Z Tensor-likes are not close! 2026-02-21T08:38:10.2803851Z 2026-02-21T08:38:10.2804087Z Mismatched elements: 134215284 / 134217728 (100.0%) 2026-02-21T08:38:10.2804497Z Greatest absolute difference: 2464.0 at index (11995, 4463) (up to 0.01 allowed) 2026-02-21T08:38:10.2804889Z Greatest relative difference: 3.046875 at index (7745, 1202) (up to 0.01 allowed) 2026-02-21T08:38:10.2809854Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:10.2814571Z 2026-02-21T08:38:10.6346140Z [202s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], num_sm_multiplier=1, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:38:10.6347203Z Tensor-likes are not close! 2026-02-21T08:38:10.6347327Z 2026-02-21T08:38:10.6347420Z Mismatched elements: 133931917 / 134217728 (99.8%) 2026-02-21T08:38:10.6347710Z Greatest absolute difference: 2624.0 at index (13540, 3531) (up to 0.01 allowed) 2026-02-21T08:38:10.6348048Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:38:10.6348356Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:10.6348525Z 2026-02-21T08:38:13.4884931Z [205s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=128, num_stages=6, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[2, 2], range_unroll_factors=[0, 1], range_warp_specializes=[False, True]) 2026-02-21T08:38:13.4886057Z Tensor-likes are not close! 2026-02-21T08:38:13.4886175Z 2026-02-21T08:38:13.4886495Z Mismatched elements: 133905853 / 134217728 (99.8%) 2026-02-21T08:38:13.4886783Z Greatest absolute difference: 2688.0 at index (2171, 7645) (up to 0.01 allowed) 2026-02-21T08:38:13.4887132Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:38:13.4887435Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:13.4887608Z 2026-02-21T08:38:58.6634271Z [250s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=8, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T08:38:58.6635410Z Tensor-likes are not close! 2026-02-21T08:38:58.6639242Z 2026-02-21T08:38:58.6641348Z Mismatched elements: 133927058 / 134217728 (99.8%) 2026-02-21T08:38:58.6641852Z Greatest absolute difference: 2720.0 at index (14880, 5350) (up to 0.01 allowed) 2026-02-21T08:38:58.6642204Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:38:58.6642510Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:58.6642685Z 2026-02-21T08:38:58.6814644Z [250s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:38:58.6815668Z Tensor-likes are not close! 2026-02-21T08:38:58.6815794Z 2026-02-21T08:38:58.6815883Z Mismatched elements: 133906682 / 134217728 (99.8%) 2026-02-21T08:38:58.6816171Z Greatest absolute difference: 2720.0 at index (14880, 5350) (up to 0.01 allowed) 2026-02-21T08:38:58.6816514Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:38:58.6816825Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:38:58.6816988Z 2026-02-21T08:39:00.1016248Z [252s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 256, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]) 2026-02-21T08:39:00.1017336Z Tensor-likes are not close! 2026-02-21T08:39:00.1017483Z 2026-02-21T08:39:00.1017719Z Mismatched elements: 134016080 / 134217728 (99.8%) 2026-02-21T08:39:00.1018006Z Greatest absolute difference: 3376.0 at index (3700, 1512) (up to 0.01 allowed) 2026-02-21T08:39:00.1018377Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:39:00.1022431Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:39:00.1026704Z 2026-02-21T08:39:15.7442451Z [267s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[4, 0], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:39:15.7443653Z Tensor-likes are not close! 2026-02-21T08:39:15.7443777Z 2026-02-21T08:39:15.7443868Z Mismatched elements: 134215284 / 134217728 (100.0%) 2026-02-21T08:39:15.7444165Z Greatest absolute difference: 2464.0 at index (11995, 4463) (up to 0.01 allowed) 2026-02-21T08:39:15.7444780Z Greatest relative difference: 3.046875 at index (7745, 1202) (up to 0.01 allowed) 2026-02-21T08:39:15.7445088Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:39:15.7445251Z 2026-02-21T08:39:16.7650907Z [268s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=128, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 0], range_warp_specializes=[None, True]) 2026-02-21T08:39:16.7652351Z Tensor-likes are not close! 2026-02-21T08:39:16.7656527Z 2026-02-21T08:39:16.7660594Z Mismatched elements: 134112747 / 134217728 (99.9%) 2026-02-21T08:39:16.7665184Z Greatest absolute difference: 3440.0 at index (4464, 3962) (up to 0.01 allowed) 2026-02-21T08:39:16.7668443Z Greatest relative difference: 56885248.0 at index (3806, 6813) (up to 0.01 allowed) 2026-02-21T08:39:16.7673581Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:39:16.7673852Z 2026-02-21T08:39:19.6905566Z [271s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 3], range_warp_specializes=[None, None]) 2026-02-21T08:39:19.6906690Z Tensor-likes are not close! 2026-02-21T08:39:19.6906808Z 2026-02-21T08:39:19.6906903Z Mismatched elements: 134112890 / 134217728 (99.9%) 2026-02-21T08:39:19.6907186Z Greatest absolute difference: 3264.0 at index (11453, 7136) (up to 0.01 allowed) 2026-02-21T08:39:19.6907539Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:39:19.6907841Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:39:19.6908013Z 2026-02-21T08:39:25.4607514Z [277s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T08:39:25.4608672Z Tensor-likes are not close! 2026-02-21T08:39:25.4612469Z 2026-02-21T08:39:25.4617072Z Mismatched elements: 134112318 / 134217728 (99.9%) 2026-02-21T08:39:25.4618831Z Greatest absolute difference: 3216.0 at index (9419, 2928) (up to 0.01 allowed) 2026-02-21T08:39:25.4619218Z Greatest relative difference: 98566144.0 at index (10510, 1767) (up to 0.01 allowed) 2026-02-21T08:39:25.4619565Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:39:25.4619731Z 2026-02-21T08:40:04.7750454Z [316s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:40:04.7751811Z Tensor-likes are not close! 2026-02-21T08:40:04.7756388Z 2026-02-21T08:40:04.7761201Z Mismatched elements: 133778692 / 134217728 (99.7%) 2026-02-21T08:40:04.7765473Z Greatest absolute difference: 1448.0 at index (12839, 4488) (up to 0.01 allowed) 2026-02-21T08:40:04.7769755Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:40:04.7770871Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:40:04.7771054Z 2026-02-21T08:40:16.1802985Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T08:40:16.1818510Z [328s] Adaptive compile timeout: 30s (90% percentile=4.9s, bounds=[30.0s, 30s]) 2026-02-21T08:40:16.2120329Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 - configs/s 2026-02-21T08:40:16.8684417Z [328s] Initial random population of 100, 5 starting points: 2026-02-21T08:40:16.8686260Z error=29 2026-02-21T08:40:16.8686432Z timeout=1 2026-02-21T08:40:16.8686559Z ok=70 2026-02-21T08:40:16.8686694Z min=7.0891 2026-02-21T08:40:16.8686823Z mid=187.0347 2026-02-21T08:40:16.8686967Z max=9085.0684 2026-02-21T08:40:16.8687117Z best={'block_sizes': [8, 16, 256], 2026-02-21T08:40:16.8687353Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:40:16.8687575Z 'l2_groupings': [8], 2026-02-21T08:40:16.8687753Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:40:16.8687967Z 'loop_orders': [[1, 0]], 2026-02-21T08:40:16.8688119Z 'maxnreg': 256, 2026-02-21T08:40:16.8688268Z 'num_sm_multiplier': 32, 2026-02-21T08:40:16.8688421Z 'num_stages': 6, 2026-02-21T08:40:16.8688566Z 'num_warps': 8, 2026-02-21T08:40:16.8688719Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:40:16.8689137Z 'range_flattens': [False, None], 2026-02-21T08:40:16.8689335Z 'range_multi_buffers': [True, None], 2026-02-21T08:40:16.8689530Z 'range_num_stages': [3, 0], 2026-02-21T08:40:16.8689772Z 'range_unroll_factors': [0, 0], 2026-02-21T08:40:16.8689960Z 'range_warp_specializes': [True, None]} 2026-02-21T08:40:16.8704614Z [328s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:40:18.2373768Z [330s] Generation 1 starting: 99 neighbors, 5 active search path(s) 2026-02-21T08:40:32.4852998Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 103/103 1.6 configs/s 2026-02-21T08:40:55.3448846Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T08:40:55.3449778Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:40:55.3454289Z ^ 2026-02-21T08:40:55.3460103Z /tmp/torchinductor_root/mz/cmzxbqmluhkpsa4qdftsscvccisiqqrx7apvswc7hhfkcikvl2me.py:86:36: note: called from 2026-02-21T08:40:55.3460804Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:40:55.3461025Z ^ 2026-02-21T08:40:55.3461499Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T08:40:55.3462047Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T08:40:55.3462300Z ^ 2026-02-21T08:40:55.3462463Z module { 2026-02-21T08:40:55.3462945Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:40:55.3463518Z %cst = arith.constant dense<0> : tensor<8x2x16xi8> 2026-02-21T08:40:55.3463741Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:40:55.3463960Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:40:55.3464145Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:40:55.3464325Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:40:55.3464526Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:40:55.3464765Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:40:55.3464991Z %cst_2 = arith.constant dense<4> : tensor<8x16xi8> 2026-02-21T08:40:55.3465231Z %cst_3 = arith.constant dense<8192> : tensor<8x1xi32> 2026-02-21T08:40:55.3465473Z %cst_4 = arith.constant dense<1024> : tensor<512x1xi32> 2026-02-21T08:40:55.3465692Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:40:55.3466197Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<512x16xf32> 2026-02-21T08:40:55.3466429Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:40:55.3466620Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:40:55.3466798Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:40:55.3466985Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:40:55.3467173Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:40:55.3467443Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:40:55.3467626Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:40:55.3467950Z %0 = tt.make_tensor_descriptor %arg2, [%c16384_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:40:55.3468283Z %1 = tt.get_program_id x : i32 2026-02-21T08:40:55.3468462Z %2 = arith.divsi %1, %c2048_i32 : i32 2026-02-21T08:40:55.3468654Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:40:55.3468828Z %4 = arith.subi %c32_i32, %3 : i32 2026-02-21T08:40:55.3469010Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T08:40:55.3469196Z %6 = arith.remsi %1, %c2048_i32 : i32 2026-02-21T08:40:55.3469394Z %7 = arith.remsi %6, %5 : i32 2026-02-21T08:40:55.3469574Z %8 = arith.addi %3, %7 : i32 2026-02-21T08:40:55.3469739Z %9 = arith.divsi %6, %5 : i32 2026-02-21T08:40:55.3469917Z %10 = arith.muli %8, %c512_i32 : i32 2026-02-21T08:40:55.3470208Z %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:40:55.3470478Z %12 = tt.splat %10 : i32 -> tensor<512xi32> 2026-02-21T08:40:55.3470756Z %13 = arith.addi %12, %11 : tensor<512xi32> 2026-02-21T08:40:55.3470961Z %14 = arith.muli %9, %c16_i32 : i32 2026-02-21T08:40:55.3471197Z %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:40:55.3471444Z %16 = tt.splat %14 : i32 -> tensor<16xi32> 2026-02-21T08:40:55.3471691Z %17 = arith.addi %16, %15 : tensor<16xi32> 2026-02-21T08:40:55.3471881Z %c32_i32_6 = arith.constant 32 : i32 2026-02-21T08:40:55.3472208Z %18 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c32_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<512x16xf32>) : i32 { 2026-02-21T08:40:55.3472572Z %20 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:40:55.3472828Z %21 = tt.splat %arg3 : i32 -> tensor<8xi32> 2026-02-21T08:40:55.3473044Z %22 = arith.addi %21, %20 : tensor<8xi32> 2026-02-21T08:40:55.3473240Z %23 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T08:40:55.3473441Z %24 = tt.splat %23 : i32 -> tensor<16xi32> 2026-02-21T08:40:55.3473639Z %25 = arith.addi %24, %15 : tensor<16xi32> 2026-02-21T08:40:55.3473903Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:40:55.3474181Z %27 = arith.muli %26, %cst_4 : tensor<512x1xi32> 2026-02-21T08:40:55.3474450Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3474749Z %29 = tt.broadcast %27 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3475015Z %30 = tt.broadcast %28 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3475262Z %31 = arith.addi %29, %30 : tensor<512x16xi32> 2026-02-21T08:40:55.3475506Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3475803Z %33 = tt.addptr %32, %31 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:40:55.3476113Z %34 = tt.load %33 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3476422Z %35 = arith.extf %34 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:40:55.3476705Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:40:55.3476963Z %37 = arith.muli %36, %cst_3 : tensor<8x1xi32> 2026-02-21T08:40:55.3477224Z %38 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3477508Z %39 = tt.broadcast %37 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3477810Z %40 = tt.broadcast %38 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3478039Z %41 = arith.addi %39, %40 : tensor<8x16xi32> 2026-02-21T08:40:55.3478274Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3478550Z %43 = tt.addptr %42, %41 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:40:55.3478871Z %44 = tt.load %43 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3479136Z %45 = arith.shli %44, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3479342Z %46 = arith.shrsi %45, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3479561Z %47 = arith.shrsi %44, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3479802Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:40:55.3480085Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:40:55.3480393Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:40:55.3480699Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3481009Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3481318Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3481604Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3481905Z %55 = tt.broadcast %51 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3482173Z %56 = arith.select %54, %55, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3482439Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3482678Z %58 = tt.broadcast %52 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3482939Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3483201Z %60 = arith.select %59, %58, %56 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3483473Z %61 = tt.reshape %60 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:40:55.3483733Z %62 = arith.sitofp %61 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:40:55.3484091Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:40:55.3484467Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:40:55.3484661Z %64 = arith.muli %c8_i32, %c1_i32 : i32 2026-02-21T08:40:55.3484858Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T08:40:55.3485090Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:40:55.3485343Z %67 = tt.splat %65 : i32 -> tensor<8xi32> 2026-02-21T08:40:55.3485553Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T08:40:55.3485747Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T08:40:55.3485947Z %70 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T08:40:55.3486149Z %71 = arith.addi %70, %15 : tensor<16xi32> 2026-02-21T08:40:55.3486411Z %72 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:40:55.3486689Z %73 = arith.muli %72, %cst_4 : tensor<512x1xi32> 2026-02-21T08:40:55.3486960Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3487265Z %75 = tt.broadcast %73 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3487537Z %76 = tt.broadcast %74 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3487786Z %77 = arith.addi %75, %76 : tensor<512x16xi32> 2026-02-21T08:40:55.3488037Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3488339Z %79 = tt.addptr %78, %77 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:40:55.3488655Z %80 = tt.load %79 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3488956Z %81 = arith.extf %80 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:40:55.3489290Z %82 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:40:55.3489559Z %83 = arith.muli %82, %cst_3 : tensor<8x1xi32> 2026-02-21T08:40:55.3489828Z %84 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3490151Z %85 = tt.broadcast %83 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3490426Z %86 = tt.broadcast %84 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3490674Z %87 = arith.addi %85, %86 : tensor<8x16xi32> 2026-02-21T08:40:55.3490916Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3491195Z %89 = tt.addptr %88, %87 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:40:55.3491490Z %90 = tt.load %89 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3491822Z %91 = arith.shli %90, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3492041Z %92 = arith.shrsi %91, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3492266Z %93 = arith.shrsi %90, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3492523Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:40:55.3492832Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:40:55.3493149Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:40:55.3493484Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3493800Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3494070Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3494326Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3494596Z %101 = tt.broadcast %97 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3494877Z %102 = arith.select %100, %101, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3495154Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3495401Z %104 = tt.broadcast %98 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3495669Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3495978Z %106 = arith.select %105, %104, %102 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3496258Z %107 = tt.reshape %106 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:40:55.3496525Z %108 = arith.sitofp %107 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:40:55.3496890Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:40:55.3497214Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T08:40:55.3497416Z %110 = arith.muli %c8_i32, %c2_i32_7 : i32 2026-02-21T08:40:55.3497612Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T08:40:55.3497856Z %112 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:40:55.3498103Z %113 = tt.splat %111 : i32 -> tensor<8xi32> 2026-02-21T08:40:55.3498314Z %114 = arith.addi %113, %112 : tensor<8xi32> 2026-02-21T08:40:55.3498526Z %115 = arith.muli %111, %c2_i32 : i32 2026-02-21T08:40:55.3498726Z %116 = tt.splat %115 : i32 -> tensor<16xi32> 2026-02-21T08:40:55.3498941Z %117 = arith.addi %116, %15 : tensor<16xi32> 2026-02-21T08:40:55.3499197Z %118 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:40:55.3499479Z %119 = arith.muli %118, %cst_4 : tensor<512x1xi32> 2026-02-21T08:40:55.3499742Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3500048Z %121 = tt.broadcast %119 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3500325Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3500598Z %123 = arith.addi %121, %122 : tensor<512x16xi32> 2026-02-21T08:40:55.3500852Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3501146Z %125 = tt.addptr %124, %123 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:40:55.3501473Z %126 = tt.load %125 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3501834Z %127 = arith.extf %126 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:40:55.3502126Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:40:55.3502397Z %129 = arith.muli %128, %cst_3 : tensor<8x1xi32> 2026-02-21T08:40:55.3502650Z %130 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3502943Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3503200Z %132 = tt.broadcast %130 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3503443Z %133 = arith.addi %131, %132 : tensor<8x16xi32> 2026-02-21T08:40:55.3503683Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3503952Z %135 = tt.addptr %134, %133 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:40:55.3504305Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3504571Z %137 = arith.shli %136, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3504816Z %138 = arith.shrsi %137, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3505030Z %139 = arith.shrsi %136, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3505275Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:40:55.3505566Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:40:55.3505873Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:40:55.3506198Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3506508Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3506795Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3507048Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3507321Z %147 = tt.broadcast %143 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3507607Z %148 = arith.select %146, %147, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3507877Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3508133Z %150 = tt.broadcast %144 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3508393Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3508669Z %152 = arith.select %151, %150, %148 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3508950Z %153 = tt.reshape %152 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:40:55.3509204Z %154 = arith.sitofp %153 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:40:55.3509570Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:40:55.3509898Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:40:55.3510091Z %156 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T08:40:55.3510283Z %157 = arith.addi %arg3, %156 : i32 2026-02-21T08:40:55.3510518Z %158 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:40:55.3510773Z %159 = tt.splat %157 : i32 -> tensor<8xi32> 2026-02-21T08:40:55.3510978Z %160 = arith.addi %159, %158 : tensor<8xi32> 2026-02-21T08:40:55.3511197Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T08:40:55.3511388Z %162 = tt.splat %161 : i32 -> tensor<16xi32> 2026-02-21T08:40:55.3511654Z %163 = arith.addi %162, %15 : tensor<16xi32> 2026-02-21T08:40:55.3511908Z %164 = tt.expand_dims %13 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:40:55.3512191Z %165 = arith.muli %164, %cst_4 : tensor<512x1xi32> 2026-02-21T08:40:55.3512464Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3512789Z %167 = tt.broadcast %165 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3513069Z %168 = tt.broadcast %166 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:40:55.3513309Z %169 = arith.addi %167, %168 : tensor<512x16xi32> 2026-02-21T08:40:55.3513561Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3513852Z %171 = tt.addptr %170, %169 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:40:55.3514171Z %172 = tt.load %171 evictionPolicy = evict_last : tensor<512x16x!tt.ptr> 2026-02-21T08:40:55.3514478Z %173 = arith.extf %172 : tensor<512x16xbf16> to tensor<512x16xf32> 2026-02-21T08:40:55.3514758Z %174 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:40:55.3515029Z %175 = arith.muli %174, %cst_3 : tensor<8x1xi32> 2026-02-21T08:40:55.3515311Z %176 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:40:55.3515605Z %177 = tt.broadcast %175 : tensor<8x1xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3515884Z %178 = tt.broadcast %176 : tensor<1x16xi32> -> tensor<8x16xi32> 2026-02-21T08:40:55.3516127Z %179 = arith.addi %177, %178 : tensor<8x16xi32> 2026-02-21T08:40:55.3516368Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3516640Z %181 = tt.addptr %180, %179 : tensor<8x16x!tt.ptr>, tensor<8x16xi32> 2026-02-21T08:40:55.3516943Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<8x16x!tt.ptr> 2026-02-21T08:40:55.3517207Z %183 = arith.shli %182, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3517427Z %184 = arith.shrsi %183, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3517647Z %185 = arith.shrsi %182, %cst_2 : tensor<8x16xi8> 2026-02-21T08:40:55.3517885Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:40:55.3518176Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:40:55.3518488Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:40:55.3518808Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3519116Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<8x16xi8> -> tensor<8x1x16xi8> 2026-02-21T08:40:55.3519403Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3519656Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3519924Z %193 = tt.broadcast %189 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3520208Z %194 = arith.select %192, %193, %cst : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3520477Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:40:55.3520732Z %196 = tt.broadcast %190 : tensor<8x1x16xi8> -> tensor<8x2x16xi8> 2026-02-21T08:40:55.3520996Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<8x2x16xi1> 2026-02-21T08:40:55.3521275Z %198 = arith.select %197, %196, %194 : tensor<8x2x16xi1>, tensor<8x2x16xi8> 2026-02-21T08:40:55.3521599Z %199 = tt.reshape %198 : tensor<8x2x16xi8> -> tensor<16x16xi8> 2026-02-21T08:40:55.3521861Z %200 = arith.sitofp %199 : tensor<16x16xi8> to tensor<16x16xf32> 2026-02-21T08:40:55.3522223Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<512x16xf32> * tensor<16x16xf32> -> tensor<512x16xf32> 2026-02-21T08:40:55.3522543Z scf.yield %201 : tensor<512x16xf32> 2026-02-21T08:40:55.3522755Z } 2026-02-21T08:40:55.3522939Z %19 = arith.truncf %18 : tensor<512x16xf32> to tensor<512x16xbf16> 2026-02-21T08:40:55.3523262Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<512x16xbf16> 2026-02-21T08:40:55.3523543Z tt.return 2026-02-21T08:40:55.3523674Z } 2026-02-21T08:40:55.3523806Z } 2026-02-21T08:40:55.3523905Z 2026-02-21T08:40:55.3523958Z {-# 2026-02-21T08:40:55.3524101Z external_resources: { 2026-02-21T08:40:55.3524259Z mlir_reproducer: { 2026-02-21T08:40:55.3528664Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:40:55.3533331Z disable_threading: false, 2026-02-21T08:40:55.3533518Z verify_each: true 2026-02-21T08:40:55.3533668Z } 2026-02-21T08:40:55.3533801Z } 2026-02-21T08:40:55.3533920Z #-} 2026-02-21T08:40:55.3534378Z /tmp/torchinductor_root/mz/cmzxbqmluhkpsa4qdftsscvccisiqqrx7apvswc7hhfkcikvl2me.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:40:55.3535467Z /tmp/torchinductor_root/mz/cmzxbqmluhkpsa4qdftsscvccisiqqrx7apvswc7hhfkcikvl2me.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:40:55.3536326Z [367s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:40:55.3537430Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:40:55.3538416Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:40:55.3538674Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:40:56.0636221Z [368s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:40:56.0636492Z 2026-02-21T08:40:56.0640919Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 512, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:40:56.0642464Z 2026-02-21T08:40:56.0642750Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T08:40:56.0642995Z `ptxas` stderr: 2026-02-21T08:40:56.0643439Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 147 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:40:56.0643921Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:40:56.0644100Z 2026-02-21T08:40:56.0644490Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp_k1cqf9h.ptx -o /tmp/tmp_k1cqf9h.ptx.o 2026-02-21T08:40:56.0644922Z 2026-02-21T08:40:56.0645058Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:40:56.0645319Z ================================================================ 2026-02-21T08:40:56.0645533Z Internal Triton PTX codegen error 2026-02-21T08:40:56.0645755Z `ptxas` stderr: 2026-02-21T08:40:56.0646249Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 147 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T08:40:56.0646732Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T08:40:56.0646875Z 2026-02-21T08:40:56.0647243Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp_k1cqf9h.ptx -o /tmp/tmp_k1cqf9h.ptx.o 2026-02-21T08:40:56.0647672Z 2026-02-21T08:40:56.0647676Z 2026-02-21T08:40:56.0647731Z // 2026-02-21T08:40:56.0647872Z // Generated by LLVM NVPTX Back-End 2026-02-21T08:40:56.0648057Z // 2026-02-21T08:40:56.0648127Z 2026-02-21T08:40:56.0648193Z .version 8.7 2026-02-21T08:40:56.0648331Z .target sm_100a 2026-02-21T08:40:56.0648482Z .address_size 64 2026-02-21T08:40:56.0648566Z 2026-02-21T08:40:56.0648717Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T08:40:56.0649010Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T08:40:56.0649233Z // @_helion_matmul_bf16_int4 2026-02-21T08:40:56.0649461Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T08:40:56.0649705Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T08:40:56.0649991Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T08:40:56.0650279Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T08:40:56.0650555Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T08:40:56.0650841Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T08:40:56.0651057Z ) 2026-02-21T08:40:56.0651185Z .reqntid 256 2026-02-21T08:40:56.0651311Z .maxnreg 32 2026-02-21T08:40:56.0651438Z { 2026-02-21T08:40:56.0651611Z .reg .pred %p<110>; 2026-02-21T08:40:56.0651761Z .reg .b16 %rs<85>; 2026-02-21T08:40:56.0651911Z .reg .b32 %r<866>; 2026-02-21T08:40:56.0652051Z .reg .b64 %rd<95>; 2026-02-21T08:40:56.0652319Z .loc 1 19 0 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:19:0 2026-02-21T08:40:56.0652603Z $L__func_begin0: 2026-02-21T08:40:56.0652852Z .loc 1 19 0 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:19:0 2026-02-21T08:40:56.0653078Z 2026-02-21T08:40:56.0653131Z // %bb.0: 2026-02-21T08:40:56.0653311Z ld.param.b64 %rd3, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T08:40:56.0653523Z $L__tmp0: 2026-02-21T08:40:56.0653754Z .loc 1 19 0 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:19 2026-02-21T08:40:56.0654086Z mov.u32 %r1, %tid.x; 2026-02-21T08:40:56.0654240Z setp.lt.u32 %p3, %r1, 32; 2026-02-21T08:40:56.0654448Z ld.param.b64 %rd6, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T08:40:56.0654661Z mov.b32 %r121, global_smem; 2026-02-21T08:40:56.0654830Z // begin inline asm 2026-02-21T08:40:56.0655101Z @%p3 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r121], 512; 2026-02-21T08:40:56.0655354Z // end inline asm 2026-02-21T08:40:56.0655540Z ld.param.b64 %rd47, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T08:40:56.0655742Z bar.sync 0; 2026-02-21T08:40:56.0655895Z ld.shared.b32 %r829, [global_smem]; 2026-02-21T08:40:56.0656063Z bar.sync 0; 2026-02-21T08:40:56.0656201Z // begin inline asm 2026-02-21T08:40:56.0656403Z @%p3 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T08:40:56.0656636Z // end inline asm 2026-02-21T08:40:56.0656885Z .loc 1 21 68 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:21:68 2026-02-21T08:40:56.0657177Z mov.u32 %r212, %ctaid.x; 2026-02-21T08:40:56.0657341Z mov.u32 %r213, %ctaid.y; 2026-02-21T08:40:56.0657495Z mov.u32 %r214, %ctaid.z; 2026-02-21T08:40:56.0657655Z mov.u32 %r215, %nctaid.x; 2026-02-21T08:40:56.0657811Z mov.u32 %r216, %nctaid.y; 2026-02-21T08:40:56.0658023Z mad.lo.s32 %r217, %r214, %r216, %r213; 2026-02-21T08:40:56.0658213Z mad.lo.s32 %r218, %r217, %r215, %r212; 2026-02-21T08:40:56.0658392Z shl.b32 %r219, %r218, 7; 2026-02-21T08:40:56.0658579Z cvt.s64.s32 %rd48, %r219; 2026-02-21T08:40:56.0658746Z add.s64 %rd20, %rd47, %rd48; 2026-02-21T08:40:56.0658911Z shl.b32 %r220, %r1, 2; 2026-02-21T08:40:56.0659063Z add.s32 %r122, %r121, %r220; 2026-02-21T08:40:56.0659225Z mov.b32 %r835, 0; 2026-02-21T08:40:56.0659365Z // begin inline asm 2026-02-21T08:40:56.0659529Z @%p3 st.shared.b32 [ %r122 + 0 ], %r835; 2026-02-21T08:40:56.0659705Z // end inline asm 2026-02-21T08:40:56.0659852Z bar.warp.sync -1; 2026-02-21T08:40:56.0660001Z setp.eq.b32 %p6, %r1, 0; 2026-02-21T08:40:56.0660161Z cvt.u64.u32 %rd5, %r121; 2026-02-21T08:40:56.0660311Z // begin inline asm 2026-02-21T08:40:56.0660563Z @%p6 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd5 + 0 ], %rd6; 2026-02-21T08:40:56.0660839Z // end inline asm 2026-02-21T08:40:56.0660975Z // begin inline asm 2026-02-21T08:40:56.0661202Z @%p6 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x1; 2026-02-21T08:40:56.0661447Z // end inline asm 2026-02-21T08:40:56.0661628Z mov.b32 %r847, 16; 2026-02-21T08:40:56.0661765Z // begin inline asm 2026-02-21T08:40:56.0662001Z @%p6 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x0, %r847; 2026-02-21T08:40:56.0662270Z // end inline asm 2026-02-21T08:40:56.0662404Z mov.b32 %r125, 256; 2026-02-21T08:40:56.0662549Z // begin inline asm 2026-02-21T08:40:56.0662774Z @%p6 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x1, %r125; 2026-02-21T08:40:56.0663038Z // end inline asm 2026-02-21T08:40:56.0663172Z mov.b32 %r126, 8192; 2026-02-21T08:40:56.0663319Z // begin inline asm 2026-02-21T08:40:56.0663556Z @%p6 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x0, %r126; 2026-02-21T08:40:56.0663830Z // end inline asm 2026-02-21T08:40:56.0663973Z mov.b32 %r127, 16384; 2026-02-21T08:40:56.0664119Z // begin inline asm 2026-02-21T08:40:56.0664362Z @%p6 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x1, %r127; 2026-02-21T08:40:56.0664628Z // end inline asm 2026-02-21T08:40:56.0664771Z mov.b64 %rd13, 16384; 2026-02-21T08:40:56.0664914Z // begin inline asm 2026-02-21T08:40:56.0665164Z @%p6 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd5 + 0 ], 0x0, %rd13; 2026-02-21T08:40:56.0665436Z // end inline asm 2026-02-21T08:40:56.0665580Z mov.b32 %r834, 1; 2026-02-21T08:40:56.0665721Z // begin inline asm 2026-02-21T08:40:56.0665967Z @%p6 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x0, %r834; 2026-02-21T08:40:56.0666284Z // end inline asm 2026-02-21T08:40:56.0666423Z // begin inline asm 2026-02-21T08:40:56.0666682Z @%p6 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x1, %r834; 2026-02-21T08:40:56.0666961Z // end inline asm 2026-02-21T08:40:56.0667112Z // begin inline asm 2026-02-21T08:40:56.0667354Z @%p6 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd5 + 0 ], 0xa; 2026-02-21T08:40:56.0667665Z // end inline asm 2026-02-21T08:40:56.0667797Z // begin inline asm 2026-02-21T08:40:56.0668045Z @%p6 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x0; 2026-02-21T08:40:56.0668319Z // end inline asm 2026-02-21T08:40:56.0668459Z // begin inline asm 2026-02-21T08:40:56.0668694Z @%p6 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x1; 2026-02-21T08:40:56.0668962Z // end inline asm 2026-02-21T08:40:56.0669104Z // begin inline asm 2026-02-21T08:40:56.0669329Z @%p6 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd5 + 0 ], 0x0; 2026-02-21T08:40:56.0669591Z // end inline asm 2026-02-21T08:40:56.0669725Z // begin inline asm 2026-02-21T08:40:56.0670066Z @%p3 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd20 + 0 ], [ %rd5 + 0 ], 0x80; 2026-02-21T08:40:56.0670453Z // end inline asm 2026-02-21T08:40:56.0670596Z // begin inline asm 2026-02-21T08:40:56.0670807Z @%p3 fence.proxy.tensormap::generic.acquire.gpu [ %rd20 + 0 ], 0x80; 2026-02-21T08:40:56.0671090Z @%p3 cp.async.bulk.commit_group ; 2026-02-21T08:40:56.0671290Z @%p3 cp.async.bulk.wait_group.read 0 ; 2026-02-21T08:40:56.0671466Z // end inline asm 2026-02-21T08:40:56.0671643Z bar.sync 0; 2026-02-21T08:40:56.0671787Z cvta.global.u64 %rd94, %rd20; 2026-02-21T08:40:56.0672071Z .loc 1 27 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:27:35 2026-02-21T08:40:56.0672357Z shl.b32 %r863, %r212, 1; 2026-02-21T08:40:56.0672632Z .loc 1 28 37 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:28:37 2026-02-21T08:40:56.0672925Z add.s32 %r221, %r863, 2; 2026-02-21T08:40:56.0673184Z .loc 1 28 49 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:28:49 2026-02-21T08:40:56.0673474Z min.s32 %r222, %r221, 16384; 2026-02-21T08:40:56.0673741Z .loc 1 41 45 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:41:45 2026-02-21T08:40:56.0674029Z shr.u32 %r223, %r1, 5; 2026-02-21T08:40:56.0674184Z and.b32 %r4, %r1, 255; 2026-02-21T08:40:56.0674340Z or.b32 %r5, %r4, 256; 2026-02-21T08:40:56.0674602Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0674894Z sub.s32 %r224, %r222, %r863; 2026-02-21T08:40:56.0675056Z shl.b32 %r7, %r224, 5; 2026-02-21T08:40:56.0675200Z $L__tmp1: 2026-02-21T08:40:56.0675498Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0675843Z shfl.sync.idx.b32 %r10, %r223, 0, 31, -1; 2026-02-21T08:40:56.0676033Z shl.b32 %r225, %r10, 21; 2026-02-21T08:40:56.0676186Z and.b32 %r11, %r225, 6291456; 2026-02-21T08:40:56.0676352Z and.b32 %r12, %r10, 4; 2026-02-21T08:40:56.0676510Z shl.b32 %r226, %r12, 2; 2026-02-21T08:40:56.0676666Z or.b32 %r13, %r226, %r11; 2026-02-21T08:40:56.0676832Z add.s32 %r359, %r13, %r829; 2026-02-21T08:40:56.0676992Z mov.pred %p23, -1; 2026-02-21T08:40:56.0677148Z // begin inline asm 2026-02-21T08:40:56.0677514Z @%p23 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r359 + 0], {%r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835}; 2026-02-21T08:40:56.0677929Z // end inline asm 2026-02-21T08:40:56.0678074Z // begin inline asm 2026-02-21T08:40:56.0678455Z @%p23 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r359 + 32], {%r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835, %r835}; 2026-02-21T08:40:56.0678898Z // end inline asm 2026-02-21T08:40:56.0679045Z // begin inline asm 2026-02-21T08:40:56.0679213Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0679384Z // end inline asm 2026-02-21T08:40:56.0679532Z bar.sync 0; 2026-02-21T08:40:56.0679667Z $L__tmp2: 2026-02-21T08:40:56.0679927Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0680272Z setp.lt.s32 %p25, %r7, 1; 2026-02-21T08:40:56.0680440Z setp.gt.s32 %p26, %r7, 0; 2026-02-21T08:40:56.0680720Z .loc 1 35 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:35:35 2026-02-21T08:40:56.0681019Z bfe.s32 %r227, %r212, 30, 1; 2026-02-21T08:40:56.0681190Z shr.u32 %r228, %r227, 21; 2026-02-21T08:40:56.0681349Z add.s32 %r229, %r863, %r228; 2026-02-21T08:40:56.0681517Z shr.s32 %r230, %r229, 11; 2026-02-21T08:40:56.0681818Z .loc 1 36 33 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:36:33 2026-02-21T08:40:56.0682120Z shl.b32 %r231, %r230, 2; 2026-02-21T08:40:56.0682396Z .loc 1 37 39 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:37:39 2026-02-21T08:40:56.0682688Z sub.s32 %r232, 32, %r231; 2026-02-21T08:40:56.0683020Z .loc 1 37 52 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:37:52 2026-02-21T08:40:56.0683316Z min.s32 %r233, %r232, 4; 2026-02-21T08:40:56.0683620Z .loc 1 38 45 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:38:45 2026-02-21T08:40:56.0683915Z and.b32 %r234, %r229, -2048; 2026-02-21T08:40:56.0684088Z sub.s32 %r235, %r863, %r234; 2026-02-21T08:40:56.0684365Z .loc 1 39 51 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:39:51 2026-02-21T08:40:56.0684652Z div.s32 %r15, %r235, %r233; 2026-02-21T08:40:56.0684933Z .loc 1 38 64 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:38:64 2026-02-21T08:40:56.0685228Z mul.lo.s32 %r236, %r15, %r233; 2026-02-21T08:40:56.0685412Z sub.s32 %r237, %r235, %r236; 2026-02-21T08:40:56.0685664Z .loc 1 38 30 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:38:30 2026-02-21T08:40:56.0685957Z add.s32 %r238, %r237, %r231; 2026-02-21T08:40:56.0686225Z .loc 1 40 27 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:40:27 2026-02-21T08:40:56.0686502Z shl.b32 %r832, %r238, 9; 2026-02-21T08:40:56.0686761Z .loc 1 41 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:41:32 2026-02-21T08:40:56.0687036Z or.b32 %r864, %r832, %r4; 2026-02-21T08:40:56.0687193Z or.b32 %r865, %r832, %r5; 2026-02-21T08:40:56.0687448Z .loc 1 58 53 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:53 2026-02-21T08:40:56.0687734Z shl.b32 %r239, %r864, 10; 2026-02-21T08:40:56.0687889Z shl.b32 %r240, %r865, 10; 2026-02-21T08:40:56.0688138Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0688433Z mad.wide.s32 %rd23, %r239, 2, %rd3; 2026-02-21T08:40:56.0688615Z mad.wide.s32 %rd24, %r240, 2, %rd3; 2026-02-21T08:40:56.0688896Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0689169Z bar.sync 0; 2026-02-21T08:40:56.0689310Z shl.b32 %r19, %r4, 4; 2026-02-21T08:40:56.0689460Z add.s32 %r164, %r121, %r19; 2026-02-21T08:40:56.0689628Z selp.b32 %r165, 16, 0, %p26; 2026-02-21T08:40:56.0689791Z // begin inline asm 2026-02-21T08:40:56.0689994Z cp.async.cg.shared.global [ %r164 + 0 ], [ %rd23 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0690228Z // end inline asm 2026-02-21T08:40:56.0690369Z add.s32 %r166, %r164, 4096; 2026-02-21T08:40:56.0690534Z // begin inline asm 2026-02-21T08:40:56.0690734Z cp.async.cg.shared.global [ %r166 + 0 ], [ %rd24 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0690995Z // end inline asm 2026-02-21T08:40:56.0691138Z cp.async.commit_group; 2026-02-21T08:40:56.0691402Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0691712Z add.s64 %rd25, %rd23, 16; 2026-02-21T08:40:56.0691865Z add.s64 %rd26, %rd24, 16; 2026-02-21T08:40:56.0692155Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0692430Z add.s32 %r168, %r164, 24576; 2026-02-21T08:40:56.0692594Z // begin inline asm 2026-02-21T08:40:56.0692792Z cp.async.cg.shared.global [ %r168 + 0 ], [ %rd25 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0693020Z // end inline asm 2026-02-21T08:40:56.0693157Z add.s32 %r170, %r164, 28672; 2026-02-21T08:40:56.0693315Z // begin inline asm 2026-02-21T08:40:56.0693514Z cp.async.cg.shared.global [ %r170 + 0 ], [ %rd26 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0693733Z // end inline asm 2026-02-21T08:40:56.0693883Z cp.async.commit_group; 2026-02-21T08:40:56.0694138Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0694418Z add.s64 %rd27, %rd23, 32; 2026-02-21T08:40:56.0694568Z add.s64 %rd28, %rd24, 32; 2026-02-21T08:40:56.0694857Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0695142Z add.s32 %r172, %r164, 49152; 2026-02-21T08:40:56.0695293Z // begin inline asm 2026-02-21T08:40:56.0695525Z cp.async.cg.shared.global [ %r172 + 0 ], [ %rd27 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0695746Z // end inline asm 2026-02-21T08:40:56.0695890Z add.s32 %r174, %r164, 53248; 2026-02-21T08:40:56.0696040Z // begin inline asm 2026-02-21T08:40:56.0696239Z cp.async.cg.shared.global [ %r174 + 0 ], [ %rd28 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0696457Z // end inline asm 2026-02-21T08:40:56.0696605Z cp.async.commit_group; 2026-02-21T08:40:56.0696862Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0697134Z add.s64 %rd29, %rd23, 48; 2026-02-21T08:40:56.0697291Z add.s64 %rd30, %rd24, 48; 2026-02-21T08:40:56.0697542Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0697825Z add.s32 %r176, %r164, 73728; 2026-02-21T08:40:56.0697979Z // begin inline asm 2026-02-21T08:40:56.0698180Z cp.async.cg.shared.global [ %r176 + 0 ], [ %rd29 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0698398Z // end inline asm 2026-02-21T08:40:56.0698546Z add.s32 %r178, %r164, 77824; 2026-02-21T08:40:56.0698705Z // begin inline asm 2026-02-21T08:40:56.0698898Z cp.async.cg.shared.global [ %r178 + 0 ], [ %rd30 + 0 ], 0x10, %r165; 2026-02-21T08:40:56.0699128Z // end inline asm 2026-02-21T08:40:56.0699267Z cp.async.commit_group; 2026-02-21T08:40:56.0699533Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0699820Z setp.gt.s32 %p27, %r7, 1; 2026-02-21T08:40:56.0700080Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0700357Z add.s64 %rd31, %rd23, 64; 2026-02-21T08:40:56.0700515Z add.s64 %rd32, %rd24, 64; 2026-02-21T08:40:56.0700774Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0701048Z bar.sync 0; 2026-02-21T08:40:56.0701195Z add.s32 %r180, %r164, 8192; 2026-02-21T08:40:56.0701359Z selp.b32 %r181, 16, 0, %p27; 2026-02-21T08:40:56.0701522Z // begin inline asm 2026-02-21T08:40:56.0701761Z cp.async.cg.shared.global [ %r180 + 0 ], [ %rd31 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0701986Z // end inline asm 2026-02-21T08:40:56.0702123Z add.s32 %r182, %r164, 12288; 2026-02-21T08:40:56.0702283Z // begin inline asm 2026-02-21T08:40:56.0702485Z cp.async.cg.shared.global [ %r182 + 0 ], [ %rd32 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0702741Z // end inline asm 2026-02-21T08:40:56.0702893Z cp.async.commit_group; 2026-02-21T08:40:56.0703147Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0703432Z add.s64 %rd33, %rd23, 80; 2026-02-21T08:40:56.0703584Z add.s64 %rd34, %rd24, 80; 2026-02-21T08:40:56.0703843Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0704153Z add.s32 %r184, %r164, 32768; 2026-02-21T08:40:56.0704308Z // begin inline asm 2026-02-21T08:40:56.0704513Z cp.async.cg.shared.global [ %r184 + 0 ], [ %rd33 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0704736Z // end inline asm 2026-02-21T08:40:56.0704882Z add.s32 %r186, %r164, 36864; 2026-02-21T08:40:56.0705036Z // begin inline asm 2026-02-21T08:40:56.0705237Z cp.async.cg.shared.global [ %r186 + 0 ], [ %rd34 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0705457Z // end inline asm 2026-02-21T08:40:56.0705607Z cp.async.commit_group; 2026-02-21T08:40:56.0705865Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0706138Z add.s64 %rd35, %rd23, 96; 2026-02-21T08:40:56.0706296Z add.s64 %rd36, %rd24, 96; 2026-02-21T08:40:56.0706575Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0706855Z add.s32 %r188, %r164, 57344; 2026-02-21T08:40:56.0707008Z // begin inline asm 2026-02-21T08:40:56.0707231Z cp.async.cg.shared.global [ %r188 + 0 ], [ %rd35 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0707449Z // end inline asm 2026-02-21T08:40:56.0707590Z add.s32 %r190, %r164, 61440; 2026-02-21T08:40:56.0707747Z // begin inline asm 2026-02-21T08:40:56.0707938Z cp.async.cg.shared.global [ %r190 + 0 ], [ %rd36 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0708168Z // end inline asm 2026-02-21T08:40:56.0708308Z cp.async.commit_group; 2026-02-21T08:40:56.0708572Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0708852Z add.s64 %rd37, %rd23, 112; 2026-02-21T08:40:56.0709018Z add.s64 %rd38, %rd24, 112; 2026-02-21T08:40:56.0709275Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0709563Z add.s32 %r192, %r164, 81920; 2026-02-21T08:40:56.0709724Z // begin inline asm 2026-02-21T08:40:56.0709919Z cp.async.cg.shared.global [ %r192 + 0 ], [ %rd37 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0710151Z // end inline asm 2026-02-21T08:40:56.0710288Z add.s32 %r194, %r164, 86016; 2026-02-21T08:40:56.0710449Z // begin inline asm 2026-02-21T08:40:56.0710642Z cp.async.cg.shared.global [ %r194 + 0 ], [ %rd38 + 0 ], 0x10, %r181; 2026-02-21T08:40:56.0710869Z // end inline asm 2026-02-21T08:40:56.0711008Z cp.async.commit_group; 2026-02-21T08:40:56.0711277Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0711613Z setp.gt.s32 %p28, %r7, 2; 2026-02-21T08:40:56.0711880Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0712168Z add.s64 %rd39, %rd23, 128; 2026-02-21T08:40:56.0712325Z add.s64 %rd40, %rd24, 128; 2026-02-21T08:40:56.0712591Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0712864Z bar.sync 0; 2026-02-21T08:40:56.0713006Z add.s32 %r196, %r164, 16384; 2026-02-21T08:40:56.0713169Z selp.b32 %r197, 16, 0, %p28; 2026-02-21T08:40:56.0713324Z // begin inline asm 2026-02-21T08:40:56.0713531Z cp.async.cg.shared.global [ %r196 + 0 ], [ %rd39 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0713756Z // end inline asm 2026-02-21T08:40:56.0713903Z add.s32 %r198, %r164, 20480; 2026-02-21T08:40:56.0714054Z // begin inline asm 2026-02-21T08:40:56.0714256Z cp.async.cg.shared.global [ %r198 + 0 ], [ %rd40 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0714508Z // end inline asm 2026-02-21T08:40:56.0714655Z cp.async.commit_group; 2026-02-21T08:40:56.0714915Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0715194Z add.s64 %rd41, %rd23, 144; 2026-02-21T08:40:56.0715357Z add.s64 %rd42, %rd24, 144; 2026-02-21T08:40:56.0715612Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0715930Z add.s32 %r200, %r164, 40960; 2026-02-21T08:40:56.0716081Z // begin inline asm 2026-02-21T08:40:56.0716287Z cp.async.cg.shared.global [ %r200 + 0 ], [ %rd41 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0716508Z // end inline asm 2026-02-21T08:40:56.0716653Z add.s32 %r202, %r164, 45056; 2026-02-21T08:40:56.0716812Z // begin inline asm 2026-02-21T08:40:56.0717007Z cp.async.cg.shared.global [ %r202 + 0 ], [ %rd42 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0717233Z // end inline asm 2026-02-21T08:40:56.0717371Z cp.async.commit_group; 2026-02-21T08:40:56.0717635Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0717916Z add.s64 %rd43, %rd23, 160; 2026-02-21T08:40:56.0718077Z add.s64 %rd44, %rd24, 160; 2026-02-21T08:40:56.0718361Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0718644Z add.s32 %r204, %r164, 65536; 2026-02-21T08:40:56.0718802Z // begin inline asm 2026-02-21T08:40:56.0719025Z cp.async.cg.shared.global [ %r204 + 0 ], [ %rd43 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0719254Z // end inline asm 2026-02-21T08:40:56.0719391Z add.s32 %r206, %r164, 69632; 2026-02-21T08:40:56.0719548Z // begin inline asm 2026-02-21T08:40:56.0719744Z cp.async.cg.shared.global [ %r206 + 0 ], [ %rd44 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0719970Z // end inline asm 2026-02-21T08:40:56.0720110Z cp.async.commit_group; 2026-02-21T08:40:56.0720378Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0720674Z add.s64 %rd45, %rd23, 176; 2026-02-21T08:40:56.0720831Z add.s64 %rd46, %rd24, 176; 2026-02-21T08:40:56.0721094Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0721386Z add.s32 %r208, %r164, 90112; 2026-02-21T08:40:56.0721585Z // begin inline asm 2026-02-21T08:40:56.0721802Z cp.async.cg.shared.global [ %r208 + 0 ], [ %rd45 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0722046Z // end inline asm 2026-02-21T08:40:56.0722204Z add.s32 %r210, %r164, 94208; 2026-02-21T08:40:56.0722364Z // begin inline asm 2026-02-21T08:40:56.0722574Z cp.async.cg.shared.global [ %r210 + 0 ], [ %rd46 + 0 ], 0x10, %r197; 2026-02-21T08:40:56.0722804Z // end inline asm 2026-02-21T08:40:56.0722960Z cp.async.commit_group; 2026-02-21T08:40:56.0723232Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0723541Z @%p25 bra $L__BB0_15; 2026-02-21T08:40:56.0723713Z // %bb.1: // %.lr.ph 2026-02-21T08:40:56.0724031Z .loc 1 0 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:0:121 2026-02-21T08:40:56.0724381Z ld.param.b64 %rd4, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T08:40:56.0724599Z and.b32 %r6, %r1, 15; 2026-02-21T08:40:56.0724765Z bfe.u32 %r8, %r1, 5, 2; 2026-02-21T08:40:56.0724922Z and.b32 %r9, %r1, 16; 2026-02-21T08:40:56.0725080Z add.s32 %r446, %r829, 64; 2026-02-21T08:40:56.0725244Z add.s32 %r567, %r829, 128; 2026-02-21T08:40:56.0725416Z add.s32 %r688, %r829, 192; 2026-02-21T08:40:56.0725685Z .loc 1 42 27 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:42:27 2026-02-21T08:40:56.0725985Z shl.b32 %r830, %r15, 4; 2026-02-21T08:40:56.0726260Z .loc 1 43 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:43:32 2026-02-21T08:40:56.0726554Z or.b32 %r845, %r830, %r6; 2026-02-21T08:40:56.0726755Z add.s32 %r25, %r7, -3; 2026-02-21T08:40:56.0726912Z shl.b32 %r256, %r1, 5; 2026-02-21T08:40:56.0727074Z and.b32 %r257, %r256, 352; 2026-02-21T08:40:56.0727232Z shr.u32 %r258, %r1, 2; 2026-02-21T08:40:56.0727392Z and.b32 %r259, %r258, 28; 2026-02-21T08:40:56.0727551Z bfe.s32 %r260, %r1, 2, 1; 2026-02-21T08:40:56.0727715Z and.b32 %r261, %r260, 144; 2026-02-21T08:40:56.0727944Z xor.b32 %r262, %r261, %r259; 2026-02-21T08:40:56.0728108Z add.s32 %r264, %r121, 114688; 2026-02-21T08:40:56.0728281Z add.s32 %r265, %r264, %r257; 2026-02-21T08:40:56.0728444Z add.s32 %r26, %r265, %r262; 2026-02-21T08:40:56.0728615Z add.s32 %r689, %r829, 256; 2026-02-21T08:40:56.0728775Z add.s32 %r266, %r11, %r689; 2026-02-21T08:40:56.0728943Z shl.b32 %r267, %r12, 1; 2026-02-21T08:40:56.0729099Z add.s32 %r653, %r266, %r267; 2026-02-21T08:40:56.0729266Z bfe.u32 %r268, %r264, 4, 14; 2026-02-21T08:40:56.0729437Z cvt.u64.u32 %rd49, %r268; 2026-02-21T08:40:56.0729613Z or.b64 %rd81, %rd49, -4611685949705814016; 2026-02-21T08:40:56.0729809Z add.s32 %r480, %r13, %r446; 2026-02-21T08:40:56.0729963Z add.s32 %r601, %r13, %r567; 2026-02-21T08:40:56.0730124Z add.s32 %r722, %r13, %r688; 2026-02-21T08:40:56.0730273Z and.b32 %r269, %r256, 8032; 2026-02-21T08:40:56.0730430Z or.b32 %r270, %r261, %r269; 2026-02-21T08:40:56.0730612Z add.s32 %r271, %r121, 98304; 2026-02-21T08:40:56.0730776Z add.s32 %r32, %r271, %r270; 2026-02-21T08:40:56.0730928Z xor.b32 %r272, %r270, 16; 2026-02-21T08:40:56.0731116Z add.s32 %r33, %r271, %r272; 2026-02-21T08:40:56.0731271Z and.b32 %r273, %r10, 1; 2026-02-21T08:40:56.0731432Z shl.b32 %r274, %r273, 13; 2026-02-21T08:40:56.0731629Z add.s32 %r811, %r271, %r274; 2026-02-21T08:40:56.0731786Z shl.b32 %r35, %r273, 8; 2026-02-21T08:40:56.0731948Z mov.pred %p109, 0; 2026-02-21T08:40:56.0732092Z mov.b32 %r850, 2; 2026-02-21T08:40:56.0732247Z mov.b32 %r849, -1; 2026-02-21T08:40:56.0732391Z mov.b32 %r846, 32; 2026-02-21T08:40:56.0732540Z mov.b32 %r844, 4; 2026-02-21T08:40:56.0732674Z mov.b32 %r843, 20; 2026-02-21T08:40:56.0732813Z mov.b32 %r842, 36; 2026-02-21T08:40:56.0732948Z mov.b32 %r841, 8; 2026-02-21T08:40:56.0733090Z mov.b32 %r840, 24; 2026-02-21T08:40:56.0733232Z mov.b32 %r839, 40; 2026-02-21T08:40:56.0733365Z mov.b32 %r838, 12; 2026-02-21T08:40:56.0733508Z mov.b32 %r837, 28; 2026-02-21T08:40:56.0733642Z mov.b32 %r836, 44; 2026-02-21T08:40:56.0733785Z mov.b32 %r831, %r830; 2026-02-21T08:40:56.0733929Z mov.b32 %r833, %r832; 2026-02-21T08:40:56.0734080Z mov.b32 %r848, %r835; 2026-02-21T08:40:56.0734220Z mov.b32 %r851, %r830; 2026-02-21T08:40:56.0734370Z mov.b32 %r852, %r832; 2026-02-21T08:40:56.0734510Z mov.b32 %r854, %r850; 2026-02-21T08:40:56.0734658Z mov.b32 %r855, %r835; 2026-02-21T08:40:56.0734804Z mov.b32 %r858, %r845; 2026-02-21T08:40:56.0734943Z mov.b32 %r859, %r845; 2026-02-21T08:40:56.0735088Z mov.b32 %r860, %r852; 2026-02-21T08:40:56.0735227Z mov.b32 %r861, %r851; 2026-02-21T08:40:56.0735375Z mov.b32 %r862, %r858; 2026-02-21T08:40:56.0735513Z bra.uni $L__BB0_2; 2026-02-21T08:40:56.0735706Z $L__BB0_14: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0736034Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0736325Z add.s32 %r855, %r855, 1; 2026-02-21T08:40:56.0736486Z setp.ne.b32 %p107, %r7, %r855; 2026-02-21T08:40:56.0736653Z mov.b32 %r830, %r851; 2026-02-21T08:40:56.0736797Z mov.b32 %r831, %r38; 2026-02-21T08:40:56.0736939Z mov.b32 %r832, %r852; 2026-02-21T08:40:56.0737085Z mov.b32 %r833, %r40; 2026-02-21T08:40:56.0737227Z mov.b32 %r834, %r854; 2026-02-21T08:40:56.0737373Z mov.b32 %r835, %r42; 2026-02-21T08:40:56.0737509Z mov.b32 %r838, %r45; 2026-02-21T08:40:56.0737656Z mov.b32 %r841, %r48; 2026-02-21T08:40:56.0737791Z mov.b32 %r844, %r51; 2026-02-21T08:40:56.0737936Z mov.b32 %r845, %r858; 2026-02-21T08:40:56.0738075Z mov.b32 %r848, %r55; 2026-02-21T08:40:56.0738254Z mov.b32 %r851, %r861; 2026-02-21T08:40:56.0738401Z mov.b32 %r852, %r860; 2026-02-21T08:40:56.0738540Z mov.b32 %r854, %r68; 2026-02-21T08:40:56.0738687Z mov.b32 %r858, %r862; 2026-02-21T08:40:56.0738827Z mov.b32 %r859, %r53; 2026-02-21T08:40:56.0738975Z @%p107 bra $L__BB0_2; 2026-02-21T08:40:56.0739119Z bra.uni $L__BB0_15; 2026-02-21T08:40:56.0739317Z $L__BB0_2: // =>This Inner Loop Header: Depth=1 2026-02-21T08:40:56.0739689Z .loc 1 0 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:0:121 2026-02-21T08:40:56.0739984Z mov.b32 %r55, %r847; 2026-02-21T08:40:56.0740132Z mov.b32 %r847, %r846; 2026-02-21T08:40:56.0740272Z mov.b32 %r53, %r845; 2026-02-21T08:40:56.0740417Z mov.b32 %r51, %r843; 2026-02-21T08:40:56.0740554Z mov.b32 %r843, %r842; 2026-02-21T08:40:56.0740703Z mov.b32 %r48, %r840; 2026-02-21T08:40:56.0740840Z mov.b32 %r840, %r839; 2026-02-21T08:40:56.0740985Z mov.b32 %r45, %r837; 2026-02-21T08:40:56.0741122Z mov.b32 %r837, %r836; 2026-02-21T08:40:56.0741270Z mov.b32 %r42, %r834; 2026-02-21T08:40:56.0741408Z mov.b32 %r40, %r832; 2026-02-21T08:40:56.0741578Z mov.b32 %r38, %r830; 2026-02-21T08:40:56.0741839Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0742161Z add.s32 %r275, %r854, 1; 2026-02-21T08:40:56.0742332Z setp.eq.b32 %p30, %r854, 31; 2026-02-21T08:40:56.0742503Z selp.b32 %r68, 0, %r275, %p30; 2026-02-21T08:40:56.0742720Z setp.ne.b32 %p31, %r68, 0; 2026-02-21T08:40:56.0742880Z @%p31 bra $L__BB0_4; 2026-02-21T08:40:56.0743075Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0743281Z add.s32 %r863, %r863, 1; 2026-02-21T08:40:56.0743546Z .loc 1 35 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:35:35 2026-02-21T08:40:56.0743837Z shr.s32 %r276, %r863, 31; 2026-02-21T08:40:56.0743992Z shr.u32 %r277, %r276, 21; 2026-02-21T08:40:56.0744155Z add.s32 %r278, %r863, %r277; 2026-02-21T08:40:56.0744311Z shr.s32 %r279, %r278, 11; 2026-02-21T08:40:56.0744577Z .loc 1 36 33 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:36:33 2026-02-21T08:40:56.0744857Z shl.b32 %r280, %r279, 2; 2026-02-21T08:40:56.0745118Z .loc 1 37 39 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:37:39 2026-02-21T08:40:56.0745399Z sub.s32 %r281, 32, %r280; 2026-02-21T08:40:56.0745652Z .loc 1 37 52 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:37:52 2026-02-21T08:40:56.0745935Z min.s32 %r282, %r281, 4; 2026-02-21T08:40:56.0746187Z .loc 1 38 45 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:38:45 2026-02-21T08:40:56.0746470Z and.b32 %r283, %r278, -2048; 2026-02-21T08:40:56.0746626Z sub.s32 %r284, %r863, %r283; 2026-02-21T08:40:56.0746890Z .loc 1 39 51 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:39:51 2026-02-21T08:40:56.0747175Z div.s32 %r285, %r284, %r282; 2026-02-21T08:40:56.0747430Z .loc 1 38 64 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:38:64 2026-02-21T08:40:56.0747720Z mul.lo.s32 %r286, %r285, %r282; 2026-02-21T08:40:56.0747885Z sub.s32 %r287, %r284, %r286; 2026-02-21T08:40:56.0748148Z .loc 1 38 30 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:38:30 2026-02-21T08:40:56.0748432Z add.s32 %r288, %r287, %r280; 2026-02-21T08:40:56.0748699Z .loc 1 40 27 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:40:27 2026-02-21T08:40:56.0748983Z shl.b32 %r860, %r288, 9; 2026-02-21T08:40:56.0749238Z .loc 1 41 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:41:32 2026-02-21T08:40:56.0749525Z or.b32 %r864, %r860, %r4; 2026-02-21T08:40:56.0749677Z or.b32 %r865, %r860, %r5; 2026-02-21T08:40:56.0749943Z .loc 1 42 27 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:42:27 2026-02-21T08:40:56.0750254Z shl.b32 %r861, %r285, 4; 2026-02-21T08:40:56.0750519Z .loc 1 43 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:43:32 2026-02-21T08:40:56.0750817Z or.b32 %r862, %r861, %r6; 2026-02-21T08:40:56.0751017Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0751372Z .loc 1 0 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:0:32 2026-02-21T08:40:56.0751691Z setp.ne.b32 %p35, %r10, 0; 2026-02-21T08:40:56.0751965Z .loc 1 81 38 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:81:38 2026-02-21T08:40:56.0752247Z setp.eq.b32 %p36, %r9, 0; 2026-02-21T08:40:56.0752520Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0752818Z add.s32 %r308, %r849, 1; 2026-02-21T08:40:56.0752974Z setp.gt.s32 %p37, %r308, 2; 2026-02-21T08:40:56.0753150Z selp.b32 %r849, 0, %r308, %p37; 2026-02-21T08:40:56.0753421Z .loc 1 51 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:51:35 2026-02-21T08:40:56.0753715Z add.s32 %r309, %r848, %r8; 2026-02-21T08:40:56.0754009Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0754303Z cp.async.wait_group 8; 2026-02-21T08:40:56.0754457Z bar.sync 0; 2026-02-21T08:40:56.0754631Z shl.b32 %r82, %r849, 12; 2026-02-21T08:40:56.0754794Z shl.b32 %r310, %r849, 13; 2026-02-21T08:40:56.0754949Z add.s32 %r312, %r121, %r310; 2026-02-21T08:40:56.0755116Z add.s32 %r313, %r312, %r19; 2026-02-21T08:40:56.0755315Z ld.shared.v4.b32 {%r314, %r315, %r316, %r317}, [%r313]; 2026-02-21T08:40:56.0755532Z mov.b32 {%rs2, %rs3}, %r317; 2026-02-21T08:40:56.0755691Z mov.b32 {%rs4, %rs5}, %r316; 2026-02-21T08:40:56.0755859Z mov.b32 {%rs6, %rs7}, %r315; 2026-02-21T08:40:56.0756020Z mov.b32 {%rs8, %rs9}, %r314; 2026-02-21T08:40:56.0756223Z ld.shared.v4.b32 {%r318, %r319, %r320, %r321}, [%r313+4096]; 2026-02-21T08:40:56.0756449Z mov.b32 {%rs10, %rs11}, %r321; 2026-02-21T08:40:56.0756617Z mov.b32 {%rs12, %rs13}, %r320; 2026-02-21T08:40:56.0756791Z mov.b32 {%rs14, %rs15}, %r319; 2026-02-21T08:40:56.0756953Z mov.b32 {%rs16, %rs17}, %r318; 2026-02-21T08:40:56.0757229Z .loc 1 62 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:62:32 2026-02-21T08:40:56.0757516Z cvt.f32.bf16 %r290, %rs8; 2026-02-21T08:40:56.0757682Z cvt.f32.bf16 %r291, %rs9; 2026-02-21T08:40:56.0757836Z cvt.f32.bf16 %r292, %rs6; 2026-02-21T08:40:56.0757996Z cvt.f32.bf16 %r293, %rs7; 2026-02-21T08:40:56.0758155Z cvt.f32.bf16 %r294, %rs4; 2026-02-21T08:40:56.0758305Z cvt.f32.bf16 %r295, %rs5; 2026-02-21T08:40:56.0758462Z cvt.f32.bf16 %r296, %rs2; 2026-02-21T08:40:56.0758611Z cvt.f32.bf16 %r297, %rs3; 2026-02-21T08:40:56.0758772Z cvt.f32.bf16 %r299, %rs16; 2026-02-21T08:40:56.0758934Z cvt.f32.bf16 %r300, %rs17; 2026-02-21T08:40:56.0759095Z cvt.f32.bf16 %r301, %rs14; 2026-02-21T08:40:56.0759250Z cvt.f32.bf16 %r302, %rs15; 2026-02-21T08:40:56.0759413Z cvt.f32.bf16 %r303, %rs12; 2026-02-21T08:40:56.0759566Z cvt.f32.bf16 %r304, %rs13; 2026-02-21T08:40:56.0759728Z cvt.f32.bf16 %r305, %rs10; 2026-02-21T08:40:56.0759892Z cvt.f32.bf16 %r306, %rs11; 2026-02-21T08:40:56.0760153Z .loc 1 64 55 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:55 2026-02-21T08:40:56.0760445Z shl.b32 %r322, %r309, 13; 2026-02-21T08:40:56.0760704Z .loc 1 64 62 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:62 2026-02-21T08:40:56.0761001Z add.s32 %r323, %r859, %r322; 2026-02-21T08:40:56.0761265Z .loc 1 64 34 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:34 2026-02-21T08:40:56.0761579Z cvt.s64.s32 %rd53, %r323; 2026-02-21T08:40:56.0761780Z add.s64 %rd51, %rd4, %rd53; 2026-02-21T08:40:56.0762047Z .loc 1 64 87 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:87 2026-02-21T08:40:56.0762333Z // begin inline asm 2026-02-21T08:40:56.0762477Z mov.u64 %rd50, 0x0; 2026-02-21T08:40:56.0762676Z createpolicy.fractional.L2::evict_last.b64 %rd50, 1.0; 2026-02-21T08:40:56.0762889Z // end inline asm 2026-02-21T08:40:56.0763066Z // begin inline asm 2026-02-21T08:40:56.0763205Z mov.u16 %rs1, 0x0; 2026-02-21T08:40:56.0763425Z ld.global.L1::evict_last.L2::cache_hint.b8 { %rs1 }, [ %rd51 + 0 ], %rd50; 2026-02-21T08:40:56.0763664Z // end inline asm 2026-02-21T08:40:56.0763904Z .loc 1 67 28 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:67:28 2026-02-21T08:40:56.0764191Z shl.b16 %rs18, %rs1, 4; 2026-02-21T08:40:56.0764441Z .loc 1 82 58 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:82:58 2026-02-21T08:40:56.0764747Z selp.b16 %rs19, %rs18, %rs1, %p36; 2026-02-21T08:40:56.0764932Z cvt.s16.s8 %rs20, %rs19; 2026-02-21T08:40:56.0765102Z shr.s16 %rs21, %rs20, 4; 2026-02-21T08:40:56.0765368Z .loc 1 87 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:87:32 2026-02-21T08:40:56.0765655Z cvt.rn.f32.s16 %r324, %rs21; 2026-02-21T08:40:56.0765861Z st.shared.b32 [%r26], %r324; 2026-02-21T08:40:56.0766033Z mov.pred %p48, -1; 2026-02-21T08:40:56.0766188Z $L__tmp3: 2026-02-21T08:40:56.0766519Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0766875Z // begin inline asm 2026-02-21T08:40:56.0767167Z @%p48 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 0], {%r290, %r291, %r292, %r293, %r294, %r295, %r296, %r297}; 2026-02-21T08:40:56.0767494Z // end inline asm 2026-02-21T08:40:56.0767642Z // begin inline asm 2026-02-21T08:40:56.0767929Z @%p48 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 16], {%r299, %r300, %r301, %r302, %r303, %r304, %r305, %r306}; 2026-02-21T08:40:56.0768252Z // end inline asm 2026-02-21T08:40:56.0768395Z // begin inline asm 2026-02-21T08:40:56.0768562Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0768732Z // end inline asm 2026-02-21T08:40:56.0768880Z bar.sync 0; 2026-02-21T08:40:56.0769016Z // begin inline asm 2026-02-21T08:40:56.0769188Z fence.proxy.async.shared::cta; 2026-02-21T08:40:56.0769369Z // end inline asm 2026-02-21T08:40:56.0769515Z add.s32 %r703, %r121, 115200; 2026-02-21T08:40:56.0769684Z // begin inline asm 2026-02-21T08:40:56.0769858Z @%p6 mbarrier.init.shared::cta.b64 [%r703], 1; 2026-02-21T08:40:56.0770062Z // end inline asm 2026-02-21T08:40:56.0770202Z bar.sync 0; 2026-02-21T08:40:56.0770347Z @%p35 bra $L__BB0_6; 2026-02-21T08:40:56.0770540Z // %bb.5: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0770771Z elect.sync %r337|%p39, -1; 2026-02-21T08:40:56.0770947Z mov.b32 %r327, 134482192; 2026-02-21T08:40:56.0771108Z // begin inline asm 2026-02-21T08:40:56.0771365Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r829 + 0 ], [ %r689 + 0 ], %rd81, %r327, %p109; 2026-02-21T08:40:56.0771691Z // end inline asm 2026-02-21T08:40:56.0771844Z // begin inline asm 2026-02-21T08:40:56.0772094Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r829 + 16 ], [ %r689 + 8 ], %rd81, %r327, %p109; 2026-02-21T08:40:56.0772369Z // end inline asm 2026-02-21T08:40:56.0772503Z // begin inline asm 2026-02-21T08:40:56.0772740Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r829 + 32 ], [ %r689 + 16 ], %rd81, %r327, %p109; 2026-02-21T08:40:56.0773012Z // end inline asm 2026-02-21T08:40:56.0773146Z // begin inline asm 2026-02-21T08:40:56.0773383Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r829 + 48 ], [ %r689 + 24 ], %rd81, %r327, %p109; 2026-02-21T08:40:56.0773647Z // end inline asm 2026-02-21T08:40:56.0773798Z cvt.u64.u32 %rd58, %r703; 2026-02-21T08:40:56.0773952Z // begin inline asm 2026-02-21T08:40:56.0774165Z @%p39 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd58]; 2026-02-21T08:40:56.0774457Z // end inline asm 2026-02-21T08:40:56.0774635Z $L__BB0_6: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0774883Z .loc 2 0 36 // standard.py:0:36 2026-02-21T08:40:56.0775076Z mov.b32 %r341, 0; 2026-02-21T08:40:56.0775381Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0775736Z // begin inline asm 2026-02-21T08:40:56.0775880Z 2026-02-21T08:40:56.0775998Z { 2026-02-21T08:40:56.0776131Z .reg .pred complete; 2026-02-21T08:40:56.0776283Z waitLoop: 2026-02-21T08:40:56.0776471Z mbarrier.try_wait.parity.shared.b64 complete, [%r703], %r341; 2026-02-21T08:40:56.0776709Z @!complete bra.uni waitLoop; 2026-02-21T08:40:56.0776863Z } 2026-02-21T08:40:56.0776934Z 2026-02-21T08:40:56.0776989Z // end inline asm 2026-02-21T08:40:56.0777122Z bar.sync 0; 2026-02-21T08:40:56.0777257Z // begin inline asm 2026-02-21T08:40:56.0777419Z @%p6 mbarrier.inval.shared::cta.b64 [%r703]; 2026-02-21T08:40:56.0777607Z // end inline asm 2026-02-21T08:40:56.0777744Z $L__tmp4: 2026-02-21T08:40:56.0777978Z .loc 1 51 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:51:35 2026-02-21T08:40:56.0778297Z add.s32 %r431, %r844, %r8; 2026-02-21T08:40:56.0778563Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0778853Z shl.b32 %r432, %r82, 1; 2026-02-21T08:40:56.0779039Z add.s32 %r433, %r121, %r432; 2026-02-21T08:40:56.0779207Z add.s32 %r434, %r433, %r19; 2026-02-21T08:40:56.0779408Z ld.shared.v4.b32 {%r435, %r436, %r437, %r438}, [%r434+24576]; 2026-02-21T08:40:56.0779632Z mov.b32 {%rs23, %rs24}, %r438; 2026-02-21T08:40:56.0779802Z mov.b32 {%rs25, %rs26}, %r437; 2026-02-21T08:40:56.0779963Z mov.b32 {%rs27, %rs28}, %r436; 2026-02-21T08:40:56.0780127Z mov.b32 {%rs29, %rs30}, %r435; 2026-02-21T08:40:56.0780325Z ld.shared.v4.b32 {%r439, %r440, %r441, %r442}, [%r434+28672]; 2026-02-21T08:40:56.0780540Z mov.b32 {%rs31, %rs32}, %r442; 2026-02-21T08:40:56.0780696Z mov.b32 {%rs33, %rs34}, %r441; 2026-02-21T08:40:56.0780857Z mov.b32 {%rs35, %rs36}, %r440; 2026-02-21T08:40:56.0781013Z mov.b32 {%rs37, %rs38}, %r439; 2026-02-21T08:40:56.0781282Z .loc 1 62 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:62:32 2026-02-21T08:40:56.0781602Z cvt.f32.bf16 %r412, %rs29; 2026-02-21T08:40:56.0781766Z cvt.f32.bf16 %r413, %rs30; 2026-02-21T08:40:56.0781935Z cvt.f32.bf16 %r414, %rs27; 2026-02-21T08:40:56.0782093Z cvt.f32.bf16 %r415, %rs28; 2026-02-21T08:40:56.0782255Z cvt.f32.bf16 %r416, %rs25; 2026-02-21T08:40:56.0782406Z cvt.f32.bf16 %r417, %rs26; 2026-02-21T08:40:56.0782564Z cvt.f32.bf16 %r418, %rs23; 2026-02-21T08:40:56.0782712Z cvt.f32.bf16 %r419, %rs24; 2026-02-21T08:40:56.0782872Z cvt.f32.bf16 %r421, %rs37; 2026-02-21T08:40:56.0783030Z cvt.f32.bf16 %r422, %rs38; 2026-02-21T08:40:56.0783183Z cvt.f32.bf16 %r423, %rs35; 2026-02-21T08:40:56.0783343Z cvt.f32.bf16 %r424, %rs36; 2026-02-21T08:40:56.0783492Z cvt.f32.bf16 %r425, %rs33; 2026-02-21T08:40:56.0783651Z cvt.f32.bf16 %r426, %rs34; 2026-02-21T08:40:56.0783801Z cvt.f32.bf16 %r427, %rs31; 2026-02-21T08:40:56.0783963Z cvt.f32.bf16 %r428, %rs32; 2026-02-21T08:40:56.0784223Z .loc 1 64 55 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:55 2026-02-21T08:40:56.0784512Z shl.b32 %r443, %r431, 13; 2026-02-21T08:40:56.0784769Z .loc 1 64 62 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:62 2026-02-21T08:40:56.0785058Z add.s32 %r444, %r443, %r859; 2026-02-21T08:40:56.0785325Z .loc 1 64 34 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:34 2026-02-21T08:40:56.0785608Z cvt.s64.s32 %rd62, %r444; 2026-02-21T08:40:56.0785772Z add.s64 %rd60, %rd4, %rd62; 2026-02-21T08:40:56.0786070Z .loc 1 64 87 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:87 2026-02-21T08:40:56.0786357Z // begin inline asm 2026-02-21T08:40:56.0786499Z mov.u64 %rd59, 0x0; 2026-02-21T08:40:56.0786698Z createpolicy.fractional.L2::evict_last.b64 %rd59, 1.0; 2026-02-21T08:40:56.0786913Z // end inline asm 2026-02-21T08:40:56.0787051Z // begin inline asm 2026-02-21T08:40:56.0787230Z mov.u16 %rs22, 0x0; 2026-02-21T08:40:56.0787439Z ld.global.L1::evict_last.L2::cache_hint.b8 { %rs22 }, [ %rd60 + 0 ], %rd59; 2026-02-21T08:40:56.0787685Z // end inline asm 2026-02-21T08:40:56.0787922Z .loc 1 67 28 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:67:28 2026-02-21T08:40:56.0788211Z shl.b16 %rs39, %rs22, 4; 2026-02-21T08:40:56.0788474Z .loc 1 82 58 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:82:58 2026-02-21T08:40:56.0788762Z selp.b16 %rs40, %rs39, %rs22, %p36; 2026-02-21T08:40:56.0788947Z cvt.s16.s8 %rs41, %rs40; 2026-02-21T08:40:56.0789102Z shr.s16 %rs42, %rs41, 4; 2026-02-21T08:40:56.0789359Z .loc 1 87 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:87:32 2026-02-21T08:40:56.0789641Z cvt.rn.f32.s16 %r445, %rs42; 2026-02-21T08:40:56.0789810Z st.shared.b32 [%r26], %r445; 2026-02-21T08:40:56.0789961Z $L__tmp5: 2026-02-21T08:40:56.0790284Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0790616Z // begin inline asm 2026-02-21T08:40:56.0790999Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r343, %r344, %r345, %r346, %r347, %r348, %r349, %r350, %r351, %r352, %r353, %r354, %r355, %r356, %r357, %r358}, [%r359 + 0]; 2026-02-21T08:40:56.0791378Z // end inline asm 2026-02-21T08:40:56.0791514Z // begin inline asm 2026-02-21T08:40:56.0791907Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369, %r370, %r371, %r372, %r373, %r374, %r375}, [%r359 + 32]; 2026-02-21T08:40:56.0792292Z // end inline asm 2026-02-21T08:40:56.0792428Z // begin inline asm 2026-02-21T08:40:56.0792588Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:40:56.0792754Z // end inline asm 2026-02-21T08:40:56.0792900Z // begin inline asm 2026-02-21T08:40:56.0793258Z @%p48 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r480 + 0], {%r343, %r344, %r345, %r346, %r347, %r348, %r349, %r350, %r351, %r352, %r353, %r354, %r355, %r356, %r357, %r358}; 2026-02-21T08:40:56.0793652Z // end inline asm 2026-02-21T08:40:56.0793789Z // begin inline asm 2026-02-21T08:40:56.0794147Z @%p48 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r480 + 32], {%r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369, %r370, %r371, %r372, %r373, %r374, %r375}; 2026-02-21T08:40:56.0794547Z // end inline asm 2026-02-21T08:40:56.0794689Z // begin inline asm 2026-02-21T08:40:56.0794853Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0795017Z // end inline asm 2026-02-21T08:40:56.0795158Z bar.sync 0; 2026-02-21T08:40:56.0795288Z // begin inline asm 2026-02-21T08:40:56.0795567Z @%p48 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 0], {%r412, %r413, %r414, %r415, %r416, %r417, %r418, %r419}; 2026-02-21T08:40:56.0795874Z // end inline asm 2026-02-21T08:40:56.0796008Z // begin inline asm 2026-02-21T08:40:56.0796286Z @%p48 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 16], {%r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428}; 2026-02-21T08:40:56.0796585Z // end inline asm 2026-02-21T08:40:56.0796729Z // begin inline asm 2026-02-21T08:40:56.0796878Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0797046Z // end inline asm 2026-02-21T08:40:56.0797179Z bar.sync 0; 2026-02-21T08:40:56.0797317Z // begin inline asm 2026-02-21T08:40:56.0797469Z fence.proxy.async.shared::cta; 2026-02-21T08:40:56.0797639Z // end inline asm 2026-02-21T08:40:56.0797777Z bar.sync 0; 2026-02-21T08:40:56.0797906Z // begin inline asm 2026-02-21T08:40:56.0798079Z @%p6 mbarrier.init.shared::cta.b64 [%r703], 1; 2026-02-21T08:40:56.0798304Z // end inline asm 2026-02-21T08:40:56.0798447Z @%p35 bra $L__BB0_8; 2026-02-21T08:40:56.0798629Z // %bb.7: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0798852Z elect.sync %r458|%p56, -1; 2026-02-21T08:40:56.0799013Z mov.b32 %r448, 134482192; 2026-02-21T08:40:56.0799173Z mov.pred %p55, -1; 2026-02-21T08:40:56.0799351Z // begin inline asm 2026-02-21T08:40:56.0799588Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r446 + 0 ], [ %r689 + 0 ], %rd81, %r448, %p55; 2026-02-21T08:40:56.0799861Z // end inline asm 2026-02-21T08:40:56.0799995Z // begin inline asm 2026-02-21T08:40:56.0800232Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r446 + 16 ], [ %r689 + 8 ], %rd81, %r448, %p55; 2026-02-21T08:40:56.0800492Z // end inline asm 2026-02-21T08:40:56.0800633Z // begin inline asm 2026-02-21T08:40:56.0800861Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r446 + 32 ], [ %r689 + 16 ], %rd81, %r448, %p55; 2026-02-21T08:40:56.0801127Z // end inline asm 2026-02-21T08:40:56.0801268Z // begin inline asm 2026-02-21T08:40:56.0801494Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r446 + 48 ], [ %r689 + 24 ], %rd81, %r448, %p55; 2026-02-21T08:40:56.0801793Z // end inline asm 2026-02-21T08:40:56.0801933Z cvt.u64.u32 %rd67, %r703; 2026-02-21T08:40:56.0802093Z // begin inline asm 2026-02-21T08:40:56.0802331Z @%p56 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd67]; 2026-02-21T08:40:56.0802569Z // end inline asm 2026-02-21T08:40:56.0802786Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0802995Z bar.sync 0; 2026-02-21T08:40:56.0803139Z // begin inline asm 2026-02-21T08:40:56.0803272Z 2026-02-21T08:40:56.0803399Z { 2026-02-21T08:40:56.0803525Z .reg .pred complete; 2026-02-21T08:40:56.0803675Z waitLoop: 2026-02-21T08:40:56.0803862Z mbarrier.try_wait.parity.shared.b64 complete, [%r703], %r341; 2026-02-21T08:40:56.0804103Z @!complete bra.uni waitLoop; 2026-02-21T08:40:56.0804258Z } 2026-02-21T08:40:56.0804331Z 2026-02-21T08:40:56.0804386Z // end inline asm 2026-02-21T08:40:56.0804528Z bar.sync 0; 2026-02-21T08:40:56.0804658Z // begin inline asm 2026-02-21T08:40:56.0804829Z @%p6 mbarrier.inval.shared::cta.b64 [%r703]; 2026-02-21T08:40:56.0805013Z // end inline asm 2026-02-21T08:40:56.0805153Z $L__tmp6: 2026-02-21T08:40:56.0805394Z .loc 1 51 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:51:35 2026-02-21T08:40:56.0805687Z add.s32 %r552, %r841, %r8; 2026-02-21T08:40:56.0805949Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0806275Z ld.shared.v4.b32 {%r556, %r557, %r558, %r559}, [%r434+49152]; 2026-02-21T08:40:56.0806503Z mov.b32 {%rs44, %rs45}, %r559; 2026-02-21T08:40:56.0806668Z mov.b32 {%rs46, %rs47}, %r558; 2026-02-21T08:40:56.0806834Z mov.b32 {%rs48, %rs49}, %r557; 2026-02-21T08:40:56.0806995Z mov.b32 {%rs50, %rs51}, %r556; 2026-02-21T08:40:56.0807199Z ld.shared.v4.b32 {%r560, %r561, %r562, %r563}, [%r434+53248]; 2026-02-21T08:40:56.0807408Z mov.b32 {%rs52, %rs53}, %r563; 2026-02-21T08:40:56.0807572Z mov.b32 {%rs54, %rs55}, %r562; 2026-02-21T08:40:56.0807728Z mov.b32 {%rs56, %rs57}, %r561; 2026-02-21T08:40:56.0807889Z mov.b32 {%rs58, %rs59}, %r560; 2026-02-21T08:40:56.0808161Z .loc 1 62 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:62:32 2026-02-21T08:40:56.0808453Z cvt.f32.bf16 %r533, %rs50; 2026-02-21T08:40:56.0808626Z cvt.f32.bf16 %r534, %rs51; 2026-02-21T08:40:56.0808787Z cvt.f32.bf16 %r535, %rs48; 2026-02-21T08:40:56.0808951Z cvt.f32.bf16 %r536, %rs49; 2026-02-21T08:40:56.0809107Z cvt.f32.bf16 %r537, %rs46; 2026-02-21T08:40:56.0809270Z cvt.f32.bf16 %r538, %rs47; 2026-02-21T08:40:56.0809424Z cvt.f32.bf16 %r539, %rs44; 2026-02-21T08:40:56.0809588Z cvt.f32.bf16 %r540, %rs45; 2026-02-21T08:40:56.0809750Z cvt.f32.bf16 %r542, %rs58; 2026-02-21T08:40:56.0809904Z cvt.f32.bf16 %r543, %rs59; 2026-02-21T08:40:56.0810105Z cvt.f32.bf16 %r544, %rs56; 2026-02-21T08:40:56.0810263Z cvt.f32.bf16 %r545, %rs57; 2026-02-21T08:40:56.0810425Z cvt.f32.bf16 %r546, %rs54; 2026-02-21T08:40:56.0810583Z cvt.f32.bf16 %r547, %rs55; 2026-02-21T08:40:56.0810744Z cvt.f32.bf16 %r548, %rs52; 2026-02-21T08:40:56.0810900Z cvt.f32.bf16 %r549, %rs53; 2026-02-21T08:40:56.0811178Z .loc 1 64 55 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:55 2026-02-21T08:40:56.0811512Z shl.b32 %r564, %r552, 13; 2026-02-21T08:40:56.0811826Z .loc 1 64 62 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:62 2026-02-21T08:40:56.0812130Z add.s32 %r565, %r564, %r859; 2026-02-21T08:40:56.0812408Z .loc 1 64 34 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:34 2026-02-21T08:40:56.0812720Z cvt.s64.s32 %rd71, %r565; 2026-02-21T08:40:56.0812889Z add.s64 %rd69, %rd4, %rd71; 2026-02-21T08:40:56.0813178Z .loc 1 64 87 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:87 2026-02-21T08:40:56.0813481Z // begin inline asm 2026-02-21T08:40:56.0813631Z mov.u64 %rd68, 0x0; 2026-02-21T08:40:56.0813839Z createpolicy.fractional.L2::evict_last.b64 %rd68, 1.0; 2026-02-21T08:40:56.0814060Z // end inline asm 2026-02-21T08:40:56.0814246Z // begin inline asm 2026-02-21T08:40:56.0814395Z mov.u16 %rs43, 0x0; 2026-02-21T08:40:56.0814623Z ld.global.L1::evict_last.L2::cache_hint.b8 { %rs43 }, [ %rd69 + 0 ], %rd68; 2026-02-21T08:40:56.0814920Z // end inline asm 2026-02-21T08:40:56.0815289Z .loc 1 67 28 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:67:28 2026-02-21T08:40:56.0815577Z shl.b16 %rs60, %rs43, 4; 2026-02-21T08:40:56.0815833Z .loc 1 82 58 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:82:58 2026-02-21T08:40:56.0816130Z selp.b16 %rs61, %rs60, %rs43, %p36; 2026-02-21T08:40:56.0816310Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T08:40:56.0816473Z shr.s16 %rs63, %rs62, 4; 2026-02-21T08:40:56.0816724Z .loc 1 87 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:87:32 2026-02-21T08:40:56.0817012Z cvt.rn.f32.s16 %r566, %rs63; 2026-02-21T08:40:56.0817184Z st.shared.b32 [%r26], %r566; 2026-02-21T08:40:56.0817338Z $L__tmp7: 2026-02-21T08:40:56.0817631Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0817961Z // begin inline asm 2026-02-21T08:40:56.0818320Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479}, [%r480 + 0]; 2026-02-21T08:40:56.0818694Z // end inline asm 2026-02-21T08:40:56.0818837Z // begin inline asm 2026-02-21T08:40:56.0819183Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r481, %r482, %r483, %r484, %r485, %r486, %r487, %r488, %r489, %r490, %r491, %r492, %r493, %r494, %r495, %r496}, [%r480 + 32]; 2026-02-21T08:40:56.0819555Z // end inline asm 2026-02-21T08:40:56.0819696Z // begin inline asm 2026-02-21T08:40:56.0819845Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:40:56.0820013Z // end inline asm 2026-02-21T08:40:56.0820152Z mov.pred %p72, -1; 2026-02-21T08:40:56.0820297Z // begin inline asm 2026-02-21T08:40:56.0820652Z @%p72 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r601 + 0], {%r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479}; 2026-02-21T08:40:56.0821051Z // end inline asm 2026-02-21T08:40:56.0821193Z // begin inline asm 2026-02-21T08:40:56.0821574Z @%p72 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r601 + 32], {%r481, %r482, %r483, %r484, %r485, %r486, %r487, %r488, %r489, %r490, %r491, %r492, %r493, %r494, %r495, %r496}; 2026-02-21T08:40:56.0821962Z // end inline asm 2026-02-21T08:40:56.0822094Z // begin inline asm 2026-02-21T08:40:56.0822251Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0822412Z // end inline asm 2026-02-21T08:40:56.0822587Z bar.sync 0; 2026-02-21T08:40:56.0822726Z // begin inline asm 2026-02-21T08:40:56.0823000Z @%p72 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 0], {%r533, %r534, %r535, %r536, %r537, %r538, %r539, %r540}; 2026-02-21T08:40:56.0823307Z // end inline asm 2026-02-21T08:40:56.0823441Z // begin inline asm 2026-02-21T08:40:56.0823722Z @%p72 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 16], {%r542, %r543, %r544, %r545, %r546, %r547, %r548, %r549}; 2026-02-21T08:40:56.0824049Z // end inline asm 2026-02-21T08:40:56.0824202Z // begin inline asm 2026-02-21T08:40:56.0824358Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0824536Z // end inline asm 2026-02-21T08:40:56.0824685Z bar.sync 0; 2026-02-21T08:40:56.0824821Z // begin inline asm 2026-02-21T08:40:56.0824987Z fence.proxy.async.shared::cta; 2026-02-21T08:40:56.0825153Z // end inline asm 2026-02-21T08:40:56.0825297Z bar.sync 0; 2026-02-21T08:40:56.0825429Z // begin inline asm 2026-02-21T08:40:56.0825609Z @%p6 mbarrier.init.shared::cta.b64 [%r703], 1; 2026-02-21T08:40:56.0825803Z // end inline asm 2026-02-21T08:40:56.0825953Z @%p35 bra $L__BB0_10; 2026-02-21T08:40:56.0826145Z // %bb.9: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0826375Z elect.sync %r579|%p73, -1; 2026-02-21T08:40:56.0826581Z mov.b32 %r569, 134482192; 2026-02-21T08:40:56.0826737Z // begin inline asm 2026-02-21T08:40:56.0826983Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r567 + 0 ], [ %r689 + 0 ], %rd81, %r569, %p72; 2026-02-21T08:40:56.0827282Z // end inline asm 2026-02-21T08:40:56.0827425Z // begin inline asm 2026-02-21T08:40:56.0827656Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r567 + 16 ], [ %r689 + 8 ], %rd81, %r569, %p72; 2026-02-21T08:40:56.0827930Z // end inline asm 2026-02-21T08:40:56.0828073Z // begin inline asm 2026-02-21T08:40:56.0828299Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r567 + 32 ], [ %r689 + 16 ], %rd81, %r569, %p72; 2026-02-21T08:40:56.0828563Z // end inline asm 2026-02-21T08:40:56.0828698Z // begin inline asm 2026-02-21T08:40:56.0828925Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r567 + 48 ], [ %r689 + 24 ], %rd81, %r569, %p72; 2026-02-21T08:40:56.0829181Z // end inline asm 2026-02-21T08:40:56.0829326Z cvt.u64.u32 %rd76, %r703; 2026-02-21T08:40:56.0829478Z // begin inline asm 2026-02-21T08:40:56.0829695Z @%p73 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd76]; 2026-02-21T08:40:56.0829935Z // end inline asm 2026-02-21T08:40:56.0830113Z $L__BB0_10: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0830326Z bar.sync 0; 2026-02-21T08:40:56.0830454Z mov.b32 %r583, 0; 2026-02-21T08:40:56.0830593Z // begin inline asm 2026-02-21T08:40:56.0830723Z 2026-02-21T08:40:56.0830843Z { 2026-02-21T08:40:56.0830966Z .reg .pred complete; 2026-02-21T08:40:56.0831115Z waitLoop: 2026-02-21T08:40:56.0831306Z mbarrier.try_wait.parity.shared.b64 complete, [%r703], %r583; 2026-02-21T08:40:56.0831560Z @!complete bra.uni waitLoop; 2026-02-21T08:40:56.0831724Z } 2026-02-21T08:40:56.0831789Z 2026-02-21T08:40:56.0831844Z // end inline asm 2026-02-21T08:40:56.0831984Z bar.sync 0; 2026-02-21T08:40:56.0832114Z // begin inline asm 2026-02-21T08:40:56.0832285Z @%p6 mbarrier.inval.shared::cta.b64 [%r703]; 2026-02-21T08:40:56.0832468Z // end inline asm 2026-02-21T08:40:56.0832611Z $L__tmp8: 2026-02-21T08:40:56.0832851Z .loc 1 51 35 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:51:35 2026-02-21T08:40:56.0833143Z add.s32 %r673, %r838, %r8; 2026-02-21T08:40:56.0833419Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0833751Z ld.shared.v4.b32 {%r677, %r678, %r679, %r680}, [%r434+73728]; 2026-02-21T08:40:56.0833989Z mov.b32 {%rs65, %rs66}, %r680; 2026-02-21T08:40:56.0834161Z mov.b32 {%rs67, %rs68}, %r679; 2026-02-21T08:40:56.0834332Z mov.b32 {%rs69, %rs70}, %r678; 2026-02-21T08:40:56.0834489Z mov.b32 {%rs71, %rs72}, %r677; 2026-02-21T08:40:56.0834728Z ld.shared.v4.b32 {%r681, %r682, %r683, %r684}, [%r434+77824]; 2026-02-21T08:40:56.0834944Z mov.b32 {%rs73, %rs74}, %r684; 2026-02-21T08:40:56.0835100Z mov.b32 {%rs75, %rs76}, %r683; 2026-02-21T08:40:56.0835265Z mov.b32 {%rs77, %rs78}, %r682; 2026-02-21T08:40:56.0835419Z mov.b32 {%rs79, %rs80}, %r681; 2026-02-21T08:40:56.0835686Z .loc 1 62 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:62:32 2026-02-21T08:40:56.0835994Z cvt.f32.bf16 %r654, %rs71; 2026-02-21T08:40:56.0836159Z cvt.f32.bf16 %r655, %rs72; 2026-02-21T08:40:56.0836313Z cvt.f32.bf16 %r656, %rs69; 2026-02-21T08:40:56.0836472Z cvt.f32.bf16 %r657, %rs70; 2026-02-21T08:40:56.0836629Z cvt.f32.bf16 %r658, %rs67; 2026-02-21T08:40:56.0836780Z cvt.f32.bf16 %r659, %rs68; 2026-02-21T08:40:56.0836937Z cvt.f32.bf16 %r660, %rs65; 2026-02-21T08:40:56.0837086Z cvt.f32.bf16 %r661, %rs66; 2026-02-21T08:40:56.0837244Z cvt.f32.bf16 %r663, %rs79; 2026-02-21T08:40:56.0837395Z cvt.f32.bf16 %r664, %rs80; 2026-02-21T08:40:56.0837552Z cvt.f32.bf16 %r665, %rs77; 2026-02-21T08:40:56.0837699Z cvt.f32.bf16 %r666, %rs78; 2026-02-21T08:40:56.0837854Z cvt.f32.bf16 %r667, %rs75; 2026-02-21T08:40:56.0838002Z cvt.f32.bf16 %r668, %rs76; 2026-02-21T08:40:56.0838158Z cvt.f32.bf16 %r669, %rs73; 2026-02-21T08:40:56.0838342Z cvt.f32.bf16 %r670, %rs74; 2026-02-21T08:40:56.0838598Z .loc 1 64 55 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:55 2026-02-21T08:40:56.0838907Z shl.b32 %r685, %r673, 13; 2026-02-21T08:40:56.0839164Z .loc 1 64 62 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:62 2026-02-21T08:40:56.0839449Z add.s32 %r686, %r685, %r859; 2026-02-21T08:40:56.0839704Z .loc 1 64 34 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:34 2026-02-21T08:40:56.0839989Z cvt.s64.s32 %rd80, %r686; 2026-02-21T08:40:56.0840150Z add.s64 %rd78, %rd4, %rd80; 2026-02-21T08:40:56.0840412Z .loc 1 64 87 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:64:87 2026-02-21T08:40:56.0840692Z // begin inline asm 2026-02-21T08:40:56.0840832Z mov.u64 %rd77, 0x0; 2026-02-21T08:40:56.0841029Z createpolicy.fractional.L2::evict_last.b64 %rd77, 1.0; 2026-02-21T08:40:56.0841235Z // end inline asm 2026-02-21T08:40:56.0841380Z // begin inline asm 2026-02-21T08:40:56.0841518Z mov.u16 %rs64, 0x0; 2026-02-21T08:40:56.0841773Z ld.global.L1::evict_last.L2::cache_hint.b8 { %rs64 }, [ %rd78 + 0 ], %rd77; 2026-02-21T08:40:56.0842017Z // end inline asm 2026-02-21T08:40:56.0842257Z .loc 1 67 28 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:67:28 2026-02-21T08:40:56.0842544Z shl.b16 %rs81, %rs64, 4; 2026-02-21T08:40:56.0842799Z .loc 1 82 58 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:82:58 2026-02-21T08:40:56.0843094Z selp.b16 %rs82, %rs81, %rs64, %p36; 2026-02-21T08:40:56.0843271Z cvt.s16.s8 %rs83, %rs82; 2026-02-21T08:40:56.0843433Z shr.s16 %rs84, %rs83, 4; 2026-02-21T08:40:56.0843698Z .loc 1 87 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:87:32 2026-02-21T08:40:56.0843985Z cvt.rn.f32.s16 %r687, %rs84; 2026-02-21T08:40:56.0844164Z st.shared.b32 [%r26], %r687; 2026-02-21T08:40:56.0844323Z $L__tmp9: 2026-02-21T08:40:56.0844628Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0844964Z // begin inline asm 2026-02-21T08:40:56.0845324Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r585, %r586, %r587, %r588, %r589, %r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597, %r598, %r599, %r600}, [%r601 + 0]; 2026-02-21T08:40:56.0845709Z // end inline asm 2026-02-21T08:40:56.0845848Z // begin inline asm 2026-02-21T08:40:56.0846201Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r602, %r603, %r604, %r605, %r606, %r607, %r608, %r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616, %r617}, [%r601 + 32]; 2026-02-21T08:40:56.0846601Z // end inline asm 2026-02-21T08:40:56.0846744Z // begin inline asm 2026-02-21T08:40:56.0846895Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:40:56.0847063Z // end inline asm 2026-02-21T08:40:56.0847195Z // begin inline asm 2026-02-21T08:40:56.0847556Z @%p72 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r722 + 0], {%r585, %r586, %r587, %r588, %r589, %r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597, %r598, %r599, %r600}; 2026-02-21T08:40:56.0847974Z // end inline asm 2026-02-21T08:40:56.0848112Z // begin inline asm 2026-02-21T08:40:56.0848470Z @%p72 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r722 + 32], {%r602, %r603, %r604, %r605, %r606, %r607, %r608, %r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616, %r617}; 2026-02-21T08:40:56.0848846Z // end inline asm 2026-02-21T08:40:56.0848989Z // begin inline asm 2026-02-21T08:40:56.0849147Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0849305Z // end inline asm 2026-02-21T08:40:56.0849448Z bar.sync 0; 2026-02-21T08:40:56.0849578Z // begin inline asm 2026-02-21T08:40:56.0849864Z @%p72 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 0], {%r654, %r655, %r656, %r657, %r658, %r659, %r660, %r661}; 2026-02-21T08:40:56.0850164Z // end inline asm 2026-02-21T08:40:56.0850305Z // begin inline asm 2026-02-21T08:40:56.0850607Z @%p72 tcgen05.st.sync.aligned.32x32b.x8.b32 [%r653 + 16], {%r663, %r664, %r665, %r666, %r667, %r668, %r669, %r670}; 2026-02-21T08:40:56.0850913Z // end inline asm 2026-02-21T08:40:56.0851107Z // begin inline asm 2026-02-21T08:40:56.0851266Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0851444Z // end inline asm 2026-02-21T08:40:56.0851609Z bar.sync 0; 2026-02-21T08:40:56.0851757Z // begin inline asm 2026-02-21T08:40:56.0851918Z fence.proxy.async.shared::cta; 2026-02-21T08:40:56.0852102Z // end inline asm 2026-02-21T08:40:56.0852242Z bar.sync 0; 2026-02-21T08:40:56.0852387Z // begin inline asm 2026-02-21T08:40:56.0852563Z @%p6 mbarrier.init.shared::cta.b64 [%r703], 1; 2026-02-21T08:40:56.0852770Z // end inline asm 2026-02-21T08:40:56.0852922Z @%p35 bra $L__BB0_12; 2026-02-21T08:40:56.0853120Z // %bb.11: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0853356Z elect.sync %r700|%p90, -1; 2026-02-21T08:40:56.0853525Z mov.b32 %r690, 134482192; 2026-02-21T08:40:56.0853695Z mov.pred %p89, -1; 2026-02-21T08:40:56.0853844Z // begin inline asm 2026-02-21T08:40:56.0854097Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r688 + 0 ], [ %r689 + 0 ], %rd81, %r690, %p89; 2026-02-21T08:40:56.0854374Z // end inline asm 2026-02-21T08:40:56.0854524Z // begin inline asm 2026-02-21T08:40:56.0854772Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r688 + 16 ], [ %r689 + 8 ], %rd81, %r690, %p89; 2026-02-21T08:40:56.0855046Z // end inline asm 2026-02-21T08:40:56.0855201Z // begin inline asm 2026-02-21T08:40:56.0855435Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r688 + 32 ], [ %r689 + 16 ], %rd81, %r690, %p89; 2026-02-21T08:40:56.0855714Z // end inline asm 2026-02-21T08:40:56.0855853Z // begin inline asm 2026-02-21T08:40:56.0856095Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r688 + 48 ], [ %r689 + 24 ], %rd81, %r690, %p89; 2026-02-21T08:40:56.0856371Z // end inline asm 2026-02-21T08:40:56.0856518Z cvt.u64.u32 %rd85, %r703; 2026-02-21T08:40:56.0856685Z // begin inline asm 2026-02-21T08:40:56.0856903Z @%p90 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd85]; 2026-02-21T08:40:56.0857147Z // end inline asm 2026-02-21T08:40:56.0857285Z $L__tmp10: 2026-02-21T08:40:56.0857472Z $L__BB0_12: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0857822Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0858149Z setp.eq.b32 %p101, %r68, 0; 2026-02-21T08:40:56.0858334Z setp.lt.s32 %p102, %r855, %r25; 2026-02-21T08:40:56.0858502Z $L__tmp11: 2026-02-21T08:40:56.0858727Z .loc 2 291 36 // standard.py:291:36 @[ crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:94:40 ] 2026-02-21T08:40:56.0858827Z bar.sync 0; 2026-02-21T08:40:56.0858895Z // begin inline asm 2026-02-21T08:40:56.0858946Z 2026-02-21T08:40:56.0859004Z { 2026-02-21T08:40:56.0859066Z .reg .pred complete; 2026-02-21T08:40:56.0859121Z waitLoop: 2026-02-21T08:40:56.0859240Z mbarrier.try_wait.parity.shared.b64 complete, [%r703], %r583; 2026-02-21T08:40:56.0859346Z @!complete bra.uni waitLoop; 2026-02-21T08:40:56.0859396Z } 2026-02-21T08:40:56.0859400Z 2026-02-21T08:40:56.0859456Z // end inline asm 2026-02-21T08:40:56.0859518Z bar.sync 0; 2026-02-21T08:40:56.0859576Z // begin inline asm 2026-02-21T08:40:56.0859659Z @%p6 mbarrier.inval.shared::cta.b64 [%r703]; 2026-02-21T08:40:56.0859721Z // end inline asm 2026-02-21T08:40:56.0859776Z // begin inline asm 2026-02-21T08:40:56.0860044Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r741, %r742, %r743, %r744, %r745, %r746, %r747, %r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755, %r756}, [%r722 + 0]; 2026-02-21T08:40:56.0860100Z // end inline asm 2026-02-21T08:40:56.0860164Z // begin inline asm 2026-02-21T08:40:56.0860430Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r758, %r759, %r760, %r761, %r762, %r763, %r764, %r765, %r766, %r767, %r768, %r769, %r770, %r771, %r772, %r773}, [%r722 + 32]; 2026-02-21T08:40:56.0860537Z // end inline asm 2026-02-21T08:40:56.0860605Z // begin inline asm 2026-02-21T08:40:56.0860674Z tcgen05.wait::ld.sync.aligned; 2026-02-21T08:40:56.0860728Z // end inline asm 2026-02-21T08:40:56.0860821Z mov.pred %p99, -1; 2026-02-21T08:40:56.0860878Z // begin inline asm 2026-02-21T08:40:56.0861144Z @%p99 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r359 + 0], {%r741, %r742, %r743, %r744, %r745, %r746, %r747, %r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755, %r756}; 2026-02-21T08:40:56.0861198Z // end inline asm 2026-02-21T08:40:56.0861261Z // begin inline asm 2026-02-21T08:40:56.0861527Z @%p99 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r359 + 32], {%r758, %r759, %r760, %r761, %r762, %r763, %r764, %r765, %r766, %r767, %r768, %r769, %r770, %r771, %r772, %r773}; 2026-02-21T08:40:56.0861614Z // end inline asm 2026-02-21T08:40:56.0861678Z // begin inline asm 2026-02-21T08:40:56.0861746Z tcgen05.wait::st.sync.aligned; 2026-02-21T08:40:56.0861801Z // end inline asm 2026-02-21T08:40:56.0861865Z bar.sync 0; 2026-02-21T08:40:56.0861923Z $L__tmp12: 2026-02-21T08:40:56.0862094Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0862155Z add.s32 %r791, %r847, 16; 2026-02-21T08:40:56.0862224Z add.s32 %r792, %r850, 1; 2026-02-21T08:40:56.0862290Z setp.gt.s32 %p103, %r792, 2; 2026-02-21T08:40:56.0862355Z selp.b32 %r850, 0, %r792, %p103; 2026-02-21T08:40:56.0862426Z selp.b32 %r846, 0, %r791, %p101; 2026-02-21T08:40:56.0862590Z .loc 1 55 22 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:55:22 2026-02-21T08:40:56.0862649Z shl.b32 %r793, %r846, 1; 2026-02-21T08:40:56.0862818Z .loc 1 58 53 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:53 2026-02-21T08:40:56.0862877Z shl.b32 %r794, %r864, 10; 2026-02-21T08:40:56.0862935Z shl.b32 %r795, %r865, 10; 2026-02-21T08:40:56.0863094Z .loc 1 58 60 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:60 2026-02-21T08:40:56.0863166Z add.s32 %r796, %r794, %r793; 2026-02-21T08:40:56.0863224Z add.s32 %r797, %r795, %r793; 2026-02-21T08:40:56.0863386Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0863463Z mad.wide.s32 %rd86, %r796, 2, %rd3; 2026-02-21T08:40:56.0863529Z mad.wide.s32 %rd87, %r797, 2, %rd3; 2026-02-21T08:40:56.0863687Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0863752Z shl.b32 %r798, %r850, 13; 2026-02-21T08:40:56.0863810Z add.s32 %r799, %r121, %r798; 2026-02-21T08:40:56.0863903Z add.s32 %r774, %r799, %r19; 2026-02-21T08:40:56.0863966Z selp.b32 %r775, 16, 0, %p102; 2026-02-21T08:40:56.0864032Z // begin inline asm 2026-02-21T08:40:56.0864154Z cp.async.cg.shared.global [ %r774 + 0 ], [ %rd86 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0864209Z // end inline asm 2026-02-21T08:40:56.0864277Z add.s32 %r776, %r774, 4096; 2026-02-21T08:40:56.0864383Z // begin inline asm 2026-02-21T08:40:56.0864501Z cp.async.cg.shared.global [ %r776 + 0 ], [ %rd87 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0864557Z // end inline asm 2026-02-21T08:40:56.0864632Z cp.async.commit_group; 2026-02-21T08:40:56.0864802Z .loc 1 50 115 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:50:115 2026-02-21T08:40:56.0864861Z add.s32 %r842, %r846, 4; 2026-02-21T08:40:56.0865033Z .loc 1 55 22 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:55:22 2026-02-21T08:40:56.0865091Z shl.b32 %r800, %r842, 1; 2026-02-21T08:40:56.0865249Z .loc 1 58 60 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:60 2026-02-21T08:40:56.0865317Z add.s32 %r801, %r794, %r800; 2026-02-21T08:40:56.0865376Z add.s32 %r802, %r795, %r800; 2026-02-21T08:40:56.0865533Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0865637Z mad.wide.s32 %rd88, %r801, 2, %rd3; 2026-02-21T08:40:56.0865704Z mad.wide.s32 %rd89, %r802, 2, %rd3; 2026-02-21T08:40:56.0865895Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0865959Z add.s32 %r778, %r774, 24576; 2026-02-21T08:40:56.0866029Z // begin inline asm 2026-02-21T08:40:56.0866140Z cp.async.cg.shared.global [ %r778 + 0 ], [ %rd88 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0866196Z // end inline asm 2026-02-21T08:40:56.0866262Z add.s32 %r780, %r774, 28672; 2026-02-21T08:40:56.0866318Z // begin inline asm 2026-02-21T08:40:56.0866429Z cp.async.cg.shared.global [ %r780 + 0 ], [ %rd89 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0866486Z // end inline asm 2026-02-21T08:40:56.0866557Z cp.async.commit_group; 2026-02-21T08:40:56.0866723Z .loc 1 50 115 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:50:115 2026-02-21T08:40:56.0866781Z add.s32 %r839, %r846, 8; 2026-02-21T08:40:56.0866950Z .loc 1 55 22 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:55:22 2026-02-21T08:40:56.0867010Z shl.b32 %r803, %r839, 1; 2026-02-21T08:40:56.0867170Z .loc 1 58 60 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:60 2026-02-21T08:40:56.0867238Z add.s32 %r804, %r794, %r803; 2026-02-21T08:40:56.0867296Z add.s32 %r805, %r795, %r803; 2026-02-21T08:40:56.0867457Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0867527Z mad.wide.s32 %rd90, %r804, 2, %rd3; 2026-02-21T08:40:56.0867590Z mad.wide.s32 %rd91, %r805, 2, %rd3; 2026-02-21T08:40:56.0867751Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0867808Z add.s32 %r782, %r774, 49152; 2026-02-21T08:40:56.0867872Z // begin inline asm 2026-02-21T08:40:56.0867982Z cp.async.cg.shared.global [ %r782 + 0 ], [ %rd90 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0868037Z // end inline asm 2026-02-21T08:40:56.0868103Z add.s32 %r784, %r774, 53248; 2026-02-21T08:40:56.0868158Z // begin inline asm 2026-02-21T08:40:56.0868268Z cp.async.cg.shared.global [ %r784 + 0 ], [ %rd91 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0868323Z // end inline asm 2026-02-21T08:40:56.0868394Z cp.async.commit_group; 2026-02-21T08:40:56.0868564Z .loc 1 50 115 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:50:115 2026-02-21T08:40:56.0868622Z add.s32 %r836, %r846, 12; 2026-02-21T08:40:56.0868792Z .loc 1 55 22 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:55:22 2026-02-21T08:40:56.0868877Z shl.b32 %r806, %r836, 1; 2026-02-21T08:40:56.0869038Z .loc 1 58 60 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:60 2026-02-21T08:40:56.0869103Z add.s32 %r807, %r794, %r806; 2026-02-21T08:40:56.0869160Z add.s32 %r808, %r795, %r806; 2026-02-21T08:40:56.0869320Z .loc 1 58 32 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:32 2026-02-21T08:40:56.0869414Z mad.wide.s32 %rd92, %r807, 2, %rd3; 2026-02-21T08:40:56.0869478Z mad.wide.s32 %rd93, %r808, 2, %rd3; 2026-02-21T08:40:56.0869638Z .loc 1 58 80 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:58:80 2026-02-21T08:40:56.0869696Z add.s32 %r786, %r774, 73728; 2026-02-21T08:40:56.0869759Z // begin inline asm 2026-02-21T08:40:56.0869867Z cp.async.cg.shared.global [ %r786 + 0 ], [ %rd92 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0869921Z // end inline asm 2026-02-21T08:40:56.0869984Z add.s32 %r788, %r774, 77824; 2026-02-21T08:40:56.0870040Z // begin inline asm 2026-02-21T08:40:56.0870148Z cp.async.cg.shared.global [ %r788 + 0 ], [ %rd93 + 0 ], 0x10, %r775; 2026-02-21T08:40:56.0870209Z // end inline asm 2026-02-21T08:40:56.0870271Z cp.async.commit_group; 2026-02-21T08:40:56.0870457Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0870523Z setp.ne.b32 %p109, %r835, 31; 2026-02-21T08:40:56.0870592Z @%p109 bra $L__BB0_14; 2026-02-21T08:40:56.0870711Z // %bb.13: // in Loop: Header=BB0_2 Depth=1 2026-02-21T08:40:56.0870875Z .loc 1 0 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:0:121 2026-02-21T08:40:56.0870947Z setp.lt.u32 %p105, %r1, 64; 2026-02-21T08:40:56.0871109Z .loc 1 98 43 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:98:43 2026-02-21T08:40:56.0871182Z cp.async.bulk.wait_group.read 0; 2026-02-21T08:40:56.0871244Z bar.sync 0; 2026-02-21T08:40:56.0871316Z cvt.rn.bf16x2.f32 %r812, %r748, %r747; 2026-02-21T08:40:56.0871387Z cvt.rn.bf16x2.f32 %r813, %r746, %r745; 2026-02-21T08:40:56.0871451Z cvt.rn.bf16x2.f32 %r814, %r744, %r743; 2026-02-21T08:40:56.0871522Z cvt.rn.bf16x2.f32 %r815, %r742, %r741; 2026-02-21T08:40:56.0871653Z st.shared.v4.b32 [%r32], {%r815, %r814, %r813, %r812}; 2026-02-21T08:40:56.0871720Z cvt.rn.bf16x2.f32 %r816, %r765, %r764; 2026-02-21T08:40:56.0871789Z cvt.rn.bf16x2.f32 %r817, %r763, %r762; 2026-02-21T08:40:56.0871853Z cvt.rn.bf16x2.f32 %r818, %r761, %r760; 2026-02-21T08:40:56.0871914Z cvt.rn.bf16x2.f32 %r819, %r759, %r758; 2026-02-21T08:40:56.0872013Z st.shared.v4.b32 [%r32+8192], {%r819, %r818, %r817, %r816}; 2026-02-21T08:40:56.0872082Z cvt.rn.bf16x2.f32 %r820, %r756, %r755; 2026-02-21T08:40:56.0872145Z cvt.rn.bf16x2.f32 %r821, %r754, %r753; 2026-02-21T08:40:56.0872207Z cvt.rn.bf16x2.f32 %r822, %r752, %r751; 2026-02-21T08:40:56.0872279Z cvt.rn.bf16x2.f32 %r823, %r750, %r749; 2026-02-21T08:40:56.0872370Z st.shared.v4.b32 [%r33], {%r823, %r822, %r821, %r820}; 2026-02-21T08:40:56.0872433Z cvt.rn.bf16x2.f32 %r824, %r773, %r772; 2026-02-21T08:40:56.0872503Z cvt.rn.bf16x2.f32 %r825, %r771, %r770; 2026-02-21T08:40:56.0872566Z cvt.rn.bf16x2.f32 %r826, %r769, %r768; 2026-02-21T08:40:56.0872633Z cvt.rn.bf16x2.f32 %r827, %r767, %r766; 2026-02-21T08:40:56.0872731Z st.shared.v4.b32 [%r33+8192], {%r827, %r826, %r825, %r824}; 2026-02-21T08:40:56.0872797Z // begin inline asm 2026-02-21T08:40:56.0872873Z fence.proxy.async.shared::cta; 2026-02-21T08:40:56.0872928Z // end inline asm 2026-02-21T08:40:56.0872990Z bar.sync 0; 2026-02-21T08:40:56.0873055Z elect.sync %r828|%p106, -1; 2026-02-21T08:40:56.0873121Z and.pred %p104, %p105, %p106; 2026-02-21T08:40:56.0873181Z add.s32 %r810, %r833, %r35; 2026-02-21T08:40:56.0873250Z // begin inline asm 2026-02-21T08:40:56.0873432Z @%p104 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd94, {%r831, %r810}], [%r811]; 2026-02-21T08:40:56.0873520Z // end inline asm 2026-02-21T08:40:56.0873607Z cp.async.bulk.commit_group; 2026-02-21T08:40:56.0873666Z bra.uni $L__BB0_14; 2026-02-21T08:40:56.0873750Z $L__BB0_15: // %._crit_edge 2026-02-21T08:40:56.0873928Z .loc 1 29 121 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:121 2026-02-21T08:40:56.0874032Z cp.async.bulk.wait_group.read 0; 2026-02-21T08:40:56.0874087Z bar.sync 0; 2026-02-21T08:40:56.0874152Z cp.async.wait_group 0; 2026-02-21T08:40:56.0874215Z bar.sync 0; 2026-02-21T08:40:56.0874381Z .loc 1 29 4 // crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py:29:4 2026-02-21T08:40:56.0874435Z bar.sync 0; 2026-02-21T08:40:56.0874500Z // begin inline asm 2026-02-21T08:40:56.0874616Z @%p3 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r829, 512; 2026-02-21T08:40:56.0874671Z // end inline asm 2026-02-21T08:40:56.0874732Z ret; 2026-02-21T08:40:56.0874787Z $L__tmp13: 2026-02-21T08:40:56.0874843Z $L__func_end0: 2026-02-21T08:40:56.0874926Z // -- End function 2026-02-21T08:40:56.0874988Z } 2026-02-21T08:40:56.0875186Z .file 1 "/tmp/torchinductor_root/ra/crabfvk37a6znwb72n5nsgxxvt7q3bymzifroxqfwnr3m3ta4xx6.py" 2026-02-21T08:40:56.0875392Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T08:40:56.0875466Z .section .debug_abbrev 2026-02-21T08:40:56.0875518Z { 2026-02-21T08:40:56.0875629Z .b8 1 // Abbreviation Code 2026-02-21T08:40:56.0875717Z .b8 17 // DW_TAG_compile_unit 2026-02-21T08:40:56.0875807Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:40:56.0875888Z .b8 37 // DW_AT_producer 2026-02-21T08:40:56.0875968Z .b8 8 // DW_FORM_string 2026-02-21T08:40:56.0876050Z .b8 19 // DW_AT_language 2026-02-21T08:40:56.0876128Z .b8 5 // DW_FORM_data2 2026-02-21T08:40:56.0876202Z .b8 3 // DW_AT_name 2026-02-21T08:40:56.0876282Z .b8 8 // DW_FORM_string 2026-02-21T08:40:56.0876362Z .b8 16 // DW_AT_stmt_list 2026-02-21T08:40:56.0876438Z .b8 6 // DW_FORM_data4 2026-02-21T08:40:56.0876521Z .b8 27 // DW_AT_comp_dir 2026-02-21T08:40:56.0876596Z .b8 8 // DW_FORM_string 2026-02-21T08:40:56.0876670Z .b8 0 // EOM(1) 2026-02-21T08:40:56.0876739Z .b8 0 // EOM(2) 2026-02-21T08:40:56.0876831Z .b8 2 // Abbreviation Code 2026-02-21T08:40:56.0876912Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:40:56.0876986Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:40:56.0877068Z .b8 3 // DW_AT_name 2026-02-21T08:40:56.0877141Z .b8 8 // DW_FORM_string 2026-02-21T08:40:56.0877219Z .b8 32 // DW_AT_inline 2026-02-21T08:40:56.0877303Z .b8 11 // DW_FORM_data1 2026-02-21T08:40:56.0877374Z .b8 0 // EOM(1) 2026-02-21T08:40:56.0877442Z .b8 0 // EOM(2) 2026-02-21T08:40:56.0877523Z .b8 3 // Abbreviation Code 2026-02-21T08:40:56.0877610Z .b8 46 // DW_TAG_subprogram 2026-02-21T08:40:56.0877687Z .b8 1 // DW_CHILDREN_yes 2026-02-21T08:40:56.0877765Z .b8 17 // DW_AT_low_pc 2026-02-21T08:40:56.0877845Z .b8 1 // DW_FORM_addr 2026-02-21T08:40:56.0877920Z .b8 18 // DW_AT_high_pc 2026-02-21T08:40:56.0878014Z .b8 1 // DW_FORM_addr 2026-02-21T08:40:56.0878104Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:40:56.0878175Z .b8 19 // DW_FORM_ref4 2026-02-21T08:40:56.0878242Z .b8 0 // EOM(1) 2026-02-21T08:40:56.0878347Z .b8 0 // EOM(2) 2026-02-21T08:40:56.0878434Z .b8 4 // Abbreviation Code 2026-02-21T08:40:56.0878526Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T08:40:56.0878602Z .b8 0 // DW_CHILDREN_no 2026-02-21T08:40:56.0878690Z .b8 49 // DW_AT_abstract_origin 2026-02-21T08:40:56.0878761Z .b8 19 // DW_FORM_ref4 2026-02-21T08:40:56.0878832Z .b8 17 // DW_AT_low_pc 2026-02-21T08:40:56.0878908Z .b8 1 // DW_FORM_addr 2026-02-21T08:40:56.0878986Z .b8 18 // DW_AT_high_pc 2026-02-21T08:40:56.0879057Z .b8 1 // DW_FORM_addr 2026-02-21T08:40:56.0879139Z .b8 88 // DW_AT_call_file 2026-02-21T08:40:56.0879237Z .b8 11 // DW_FORM_data1 2026-02-21T08:40:56.0879317Z .b8 89 // DW_AT_call_line 2026-02-21T08:40:56.0879393Z .b8 11 // DW_FORM_data1 2026-02-21T08:40:56.0879498Z .b8 87 // DW_AT_call_column 2026-02-21T08:40:56.0879573Z .b8 11 // DW_FORM_data1 2026-02-21T08:40:56.0879642Z .b8 0 // EOM(1) 2026-02-21T08:40:56.0879719Z .b8 0 // EOM(2) 2026-02-21T08:40:56.0879788Z .b8 0 // EOM(3) 2026-02-21T08:40:56.0879839Z } 2026-02-21T08:40:56.0879910Z .section .debug_info 2026-02-21T08:40:56.0879962Z { 2026-02-21T08:40:56.0880045Z .b32 178 // Length of Unit 2026-02-21T08:40:56.0880132Z .b8 2 // DWARF version number 2026-02-21T08:40:56.0880190Z .b8 0 2026-02-21T08:40:56.0880307Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T08:40:56.0880396Z .b8 8 // Address Size (in bytes) 2026-02-21T08:40:56.0880504Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T08:40:56.0880585Z .b8 116 // DW_AT_producer 2026-02-21T08:40:56.0880639Z .b8 114 2026-02-21T08:40:56.0880692Z .b8 105 2026-02-21T08:40:56.0880752Z .b8 116 2026-02-21T08:40:56.0880803Z .b8 111 2026-02-21T08:40:56.0880854Z .b8 110 2026-02-21T08:40:56.0880913Z .b8 0 2026-02-21T08:40:56.0880986Z .b8 2 // DW_AT_language 2026-02-21T08:40:56.0881037Z .b8 0 2026-02-21T08:40:56.0881114Z .b8 99 // DW_AT_name 2026-02-21T08:40:56.0881176Z .b8 114 2026-02-21T08:40:56.0881229Z .b8 97 2026-02-21T08:40:56.0881282Z .b8 98 2026-02-21T08:40:56.0881341Z .b8 102 2026-02-21T08:40:56.0881396Z .b8 118 2026-02-21T08:40:56.0881447Z .b8 107 2026-02-21T08:40:56.0881498Z .b8 51 2026-02-21T08:40:56.0881589Z .b8 55 2026-02-21T08:40:56.0881642Z .b8 97 2026-02-21T08:40:56.0881693Z .b8 54 2026-02-21T08:40:56.0881753Z .b8 122 2026-02-21T08:40:56.0881804Z .b8 110 2026-02-21T08:40:56.0881856Z .b8 119 2026-02-21T08:40:56.0881907Z .b8 98 2026-02-21T08:40:56.0881965Z .b8 55 2026-02-21T08:40:56.0882014Z .b8 50 2026-02-21T08:40:56.0882067Z .b8 110 2026-02-21T08:40:56.0882123Z .b8 53 2026-02-21T08:40:56.0882173Z .b8 110 2026-02-21T08:40:56.0882224Z .b8 115 2026-02-21T08:40:56.0882274Z .b8 103 2026-02-21T08:40:56.0882331Z .b8 120 2026-02-21T08:40:56.0882381Z .b8 120 2026-02-21T08:40:56.0882431Z .b8 118 2026-02-21T08:40:56.0882481Z .b8 116 2026-02-21T08:40:56.0882540Z .b8 55 2026-02-21T08:40:56.0882625Z .b8 113 2026-02-21T08:40:56.0882675Z .b8 51 2026-02-21T08:40:56.0882734Z .b8 98 2026-02-21T08:40:56.0882785Z .b8 121 2026-02-21T08:40:56.0882835Z .b8 109 2026-02-21T08:40:56.0882887Z .b8 122 2026-02-21T08:40:56.0882944Z .b8 105 2026-02-21T08:40:56.0882995Z .b8 102 2026-02-21T08:40:56.0883046Z .b8 114 2026-02-21T08:40:56.0883104Z .b8 111 2026-02-21T08:40:56.0883183Z .b8 120 2026-02-21T08:40:56.0883234Z .b8 113 2026-02-21T08:40:56.0883285Z .b8 102 2026-02-21T08:40:56.0883343Z .b8 119 2026-02-21T08:40:56.0883393Z .b8 110 2026-02-21T08:40:56.0883445Z .b8 114 2026-02-21T08:40:56.0883512Z .b8 51 2026-02-21T08:40:56.0883571Z .b8 109 2026-02-21T08:40:56.0883620Z .b8 51 2026-02-21T08:40:56.0883671Z .b8 116 2026-02-21T08:40:56.0883727Z .b8 97 2026-02-21T08:40:56.0883778Z .b8 52 2026-02-21T08:40:56.0883828Z .b8 120 2026-02-21T08:40:56.0883877Z .b8 120 2026-02-21T08:40:56.0883933Z .b8 54 2026-02-21T08:40:56.0883983Z .b8 46 2026-02-21T08:40:56.0884033Z .b8 112 2026-02-21T08:40:56.0884088Z .b8 121 2026-02-21T08:40:56.0884140Z .b8 0 2026-02-21T08:40:56.0884227Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T08:40:56.0884301Z .b8 47 // DW_AT_comp_dir 2026-02-21T08:40:56.0884358Z .b8 116 2026-02-21T08:40:56.0884409Z .b8 109 2026-02-21T08:40:56.0884458Z .b8 112 2026-02-21T08:40:56.0884514Z .b8 47 2026-02-21T08:40:56.0884595Z .b8 116 2026-02-21T08:40:56.0884646Z .b8 111 2026-02-21T08:40:56.0884696Z .b8 114 2026-02-21T08:40:56.0884755Z .b8 99 2026-02-21T08:40:56.0884805Z .b8 104 2026-02-21T08:40:56.0884882Z .b8 105 2026-02-21T08:40:56.0884942Z .b8 110 2026-02-21T08:40:56.0884992Z .b8 100 2026-02-21T08:40:56.0885043Z .b8 117 2026-02-21T08:40:56.0885092Z .b8 99 2026-02-21T08:40:56.0885150Z .b8 116 2026-02-21T08:40:56.0885199Z .b8 111 2026-02-21T08:40:56.0885249Z .b8 114 2026-02-21T08:40:56.0885298Z .b8 95 2026-02-21T08:40:56.0885357Z .b8 114 2026-02-21T08:40:56.0885407Z .b8 111 2026-02-21T08:40:56.0885456Z .b8 111 2026-02-21T08:40:56.0885514Z .b8 116 2026-02-21T08:40:56.0885565Z .b8 47 2026-02-21T08:40:56.0885617Z .b8 114 2026-02-21T08:40:56.0885666Z .b8 97 2026-02-21T08:40:56.0885723Z .b8 0 2026-02-21T08:40:56.0885822Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T08:40:56.0885896Z .b8 95 // DW_AT_name 2026-02-21T08:40:56.0885958Z .b8 104 2026-02-21T08:40:56.0886010Z .b8 101 2026-02-21T08:40:56.0886061Z .b8 108 2026-02-21T08:40:56.0886111Z .b8 105 2026-02-21T08:40:56.0886169Z .b8 111 2026-02-21T08:40:56.0886221Z .b8 110 2026-02-21T08:40:56.0886273Z .b8 95 2026-02-21T08:40:56.0886333Z .b8 109 2026-02-21T08:40:56.0886383Z .b8 97 2026-02-21T08:40:56.0886434Z .b8 116 2026-02-21T08:40:56.0886486Z .b8 109 2026-02-21T08:40:56.0886546Z .b8 117 2026-02-21T08:40:56.0886596Z .b8 108 2026-02-21T08:40:56.0886647Z .b8 95 2026-02-21T08:40:56.0886707Z .b8 98 2026-02-21T08:40:56.0886758Z .b8 102 2026-02-21T08:40:56.0886811Z .b8 49 2026-02-21T08:40:56.0886863Z .b8 54 2026-02-21T08:40:56.0886924Z .b8 95 2026-02-21T08:40:56.0886978Z .b8 105 2026-02-21T08:40:56.0887030Z .b8 110 2026-02-21T08:40:56.0887081Z .b8 116 2026-02-21T08:40:56.0887139Z .b8 52 2026-02-21T08:40:56.0887190Z .b8 0 2026-02-21T08:40:56.0887265Z .b8 1 // DW_AT_inline 2026-02-21T08:40:56.0887369Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T08:40:56.0887456Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T08:40:56.0887546Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T08:40:56.0887643Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:40:56.0887753Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T08:40:56.0887842Z .b32 108 // DW_AT_abstract_origin 2026-02-21T08:40:56.0887925Z .b64 $L__tmp1 // DW_AT_low_pc 2026-02-21T08:40:56.0888016Z .b64 $L__tmp12 // DW_AT_high_pc 2026-02-21T08:40:56.0888113Z .b8 1 // DW_AT_call_file 2026-02-21T08:40:56.0888189Z .b8 94 // DW_AT_call_line 2026-02-21T08:40:56.0888275Z .b8 40 // DW_AT_call_column 2026-02-21T08:40:56.0888358Z .b8 0 // End Of Children Mark 2026-02-21T08:40:56.0888465Z .b8 0 // End Of Children Mark 2026-02-21T08:40:56.0888522Z } 2026-02-21T08:40:56.0888589Z .section .debug_macinfo { } 2026-02-21T08:40:56.0888594Z 2026-02-21T08:40:56.0888671Z ================================================================ 2026-02-21T08:40:56.0888781Z please share the reproducer above with Triton project. 2026-02-21T08:41:03.4992711Z [375s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=6, num_warps=32, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:41:03.4993865Z Tensor-likes are not close! 2026-02-21T08:41:03.4998603Z 2026-02-21T08:41:03.5000630Z Mismatched elements: 133871517 / 134217728 (99.7%) 2026-02-21T08:41:03.5001285Z Greatest absolute difference: 1944.0 at index (7376, 3075) (up to 0.01 allowed) 2026-02-21T08:41:03.5007011Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:03.5010590Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:03.5014923Z 2026-02-21T08:41:14.0797191Z 2026-02-21T08:41:14.0799968Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 103/103 3.9 configs/s 2026-02-21T08:41:14.2968842Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 65/65 70.9 configs/s 2026-02-21T08:41:15.0352474Z [387s] Generation 1 complete: 2026-02-21T08:41:15.0356920Z error=9 2026-02-21T08:41:15.0358340Z ok=95 2026-02-21T08:41:15.0358515Z min=3.0721 2026-02-21T08:41:15.0358655Z mid=8.8965 2026-02-21T08:41:15.0358778Z max=2069.1907 2026-02-21T08:41:15.0358942Z best={'block_sizes': [8, 512, 64], 2026-02-21T08:41:15.0359201Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:41:15.0359467Z 'l2_groupings': [4], 2026-02-21T08:41:15.0359653Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:41:15.0359856Z 'loop_orders': [[0, 1]], 2026-02-21T08:41:15.0360019Z 'num_sm_multiplier': 64, 2026-02-21T08:41:15.0360187Z 'num_stages': 4, 2026-02-21T08:41:15.0360328Z 'num_warps': 16, 2026-02-21T08:41:15.0360486Z 'pid_type': 'persistent_blocked', 2026-02-21T08:41:15.0360673Z 'range_flattens': [False, False], 2026-02-21T08:41:15.0360851Z 'range_multi_buffers': [True, None], 2026-02-21T08:41:15.0361041Z 'range_num_stages': [1, 1], 2026-02-21T08:41:15.0361206Z 'range_unroll_factors': [1, 4], 2026-02-21T08:41:15.0361393Z 'range_warp_specializes': [None, False]} 2026-02-21T08:41:15.0373466Z [387s] Fitting surrogate: 204 points, 204 targets 2026-02-21T08:41:16.3484330Z [388s] Generation 2 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:41:24.2632364Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 11.4 configs/s 2026-02-21T08:41:25.7187642Z [397s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:41:25.7188834Z Tensor-likes are not close! 2026-02-21T08:41:25.7193864Z 2026-02-21T08:41:25.7196140Z Mismatched elements: 133775635 / 134217728 (99.7%) 2026-02-21T08:41:25.7196576Z Greatest absolute difference: 1424.0 at index (11028, 2801) (up to 0.01 allowed) 2026-02-21T08:41:25.7201044Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:25.7202142Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:25.7202341Z 2026-02-21T08:41:26.3134747Z [398s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]) 2026-02-21T08:41:26.3136051Z Tensor-likes are not close! 2026-02-21T08:41:26.3136172Z 2026-02-21T08:41:26.3136262Z Mismatched elements: 133741162 / 134217728 (99.6%) 2026-02-21T08:41:26.3136566Z Greatest absolute difference: 1424.0 at index (15238, 5873) (up to 0.01 allowed) 2026-02-21T08:41:26.3136910Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:26.3137234Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:26.3137401Z 2026-02-21T08:41:27.0697057Z [399s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=4, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[1, 4], range_warp_specializes=[None, False]) 2026-02-21T08:41:27.0698244Z Tensor-likes are not close! 2026-02-21T08:41:27.0698392Z 2026-02-21T08:41:27.0698623Z Mismatched elements: 133741162 / 134217728 (99.6%) 2026-02-21T08:41:27.0698916Z Greatest absolute difference: 1424.0 at index (15238, 5873) (up to 0.01 allowed) 2026-02-21T08:41:27.0699272Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:27.0699627Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:27.0703478Z 2026-02-21T08:41:28.1047865Z [400s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=6, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:41:28.1049834Z Tensor-likes are not close! 2026-02-21T08:41:28.1050023Z 2026-02-21T08:41:28.1054282Z Mismatched elements: 133907987 / 134217728 (99.8%) 2026-02-21T08:41:28.1058933Z Greatest absolute difference: 2464.0 at index (3576, 6955) (up to 0.01 allowed) 2026-02-21T08:41:28.1060790Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:28.1061173Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:28.1061341Z 2026-02-21T08:41:29.0652011Z [401s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:41:29.0653046Z Tensor-likes are not close! 2026-02-21T08:41:29.0653165Z 2026-02-21T08:41:29.0653263Z Mismatched elements: 133693216 / 134217728 (99.6%) 2026-02-21T08:41:29.0653543Z Greatest absolute difference: 1488.0 at index (3576, 6955) (up to 0.01 allowed) 2026-02-21T08:41:29.0653886Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:29.0654182Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:29.0654555Z 2026-02-21T08:41:32.1874792Z [404s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 64], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[32], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:41:32.1876148Z Tensor-likes are not close! 2026-02-21T08:41:32.1876394Z 2026-02-21T08:41:32.1876490Z Mismatched elements: 133741162 / 134217728 (99.6%) 2026-02-21T08:41:32.1876845Z Greatest absolute difference: 1424.0 at index (15238, 5873) (up to 0.01 allowed) 2026-02-21T08:41:32.1877262Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T08:41:32.1877618Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:41:32.1877824Z 2026-02-21T08:41:33.9181412Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 10.4 configs/s 2026-02-21T08:41:34.4545536Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━ 118/118 129.3 configs/s 2026-02-21T08:41:34.8588176Z [406s] Generation 2 complete: 2026-02-21T08:41:34.8592493Z error=21 2026-02-21T08:41:34.8594017Z ok=82 2026-02-21T08:41:34.8594229Z min=1.6885 2026-02-21T08:41:34.8598666Z mid=5.9029 2026-02-21T08:41:34.8601800Z max=82.9010 2026-02-21T08:41:34.8606022Z best={'block_sizes': [16, 64, 256], 2026-02-21T08:41:34.8607691Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:41:34.8607951Z 'l2_groupings': [8], 2026-02-21T08:41:34.8608131Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:41:34.8608338Z 'loop_orders': [[1, 0]], 2026-02-21T08:41:34.8608500Z 'maxnreg': 256, 2026-02-21T08:41:34.8608658Z 'num_sm_multiplier': 32, 2026-02-21T08:41:34.8608814Z 'num_stages': 6, 2026-02-21T08:41:34.8608961Z 'num_warps': 4, 2026-02-21T08:41:34.8609116Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:41:34.8609331Z 'range_flattens': [False, False], 2026-02-21T08:41:34.8609527Z 'range_multi_buffers': [True, None], 2026-02-21T08:41:34.8609713Z 'range_num_stages': [3, 0], 2026-02-21T08:41:34.8609891Z 'range_unroll_factors': [0, 0], 2026-02-21T08:41:34.8610072Z 'range_warp_specializes': [True, None]} 2026-02-21T08:41:34.8612405Z [406s] Fitting surrogate: 307 points, 307 targets 2026-02-21T08:41:36.0717454Z [408s] Generation 3 starting: 93 neighbors, 5 active search path(s) 2026-02-21T08:41:50.9606147Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 2.0 configs/s 2026-02-21T08:41:52.2643194Z Generation 3: exploring neighbors 16% ━━━ 15/94 12.6 configs/s 2026-02-21T08:41:52.2726985Z 2026-02-21T08:41:52.2727333Z 60%|██████ | 6/10 [32:52<21:54, 328.73s/it] 2026-02-21T08:41:52.2736034Z WARNING:tritonbench.utils.triton_op:Caught exception on backend helion_int4_gemm_tritonbench, terminating early with partial results 2026-02-21T08:41:52.2737380Z Traceback (most recent call last): 2026-02-21T08:41:52.2748095Z File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1199, in run 2026-02-21T08:41:52.2748567Z y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce( 2026-02-21T08:41:52.2748860Z ^^^^^^^^^^^^^^^^^ 2026-02-21T08:41:52.2749328Z File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1188, in _reduce_benchmarks 2026-02-21T08:41:52.2749730Z torch.accelerator.synchronize() 2026-02-21T08:41:52.2750139Z File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/accelerator/__init__.py", line 235, in synchronize 2026-02-21T08:41:52.2750559Z torch._C._accelerator_synchronizeDevice(device_index) 2026-02-21T08:41:52.2750837Z torch.AcceleratorError: CUDA error: misaligned address 2026-02-21T08:41:52.2751308Z Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. 2026-02-21T08:41:52.2752218Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2026-02-21T08:41:52.2752634Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1 2026-02-21T08:41:52.2752926Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2026-02-21T08:41:52.2753117Z 2026-02-21T08:41:52.2753442Z WARNING:tritonbench.utils.triton_op:Failing input: --input-id 21 --num-inputs 1 --input-sample-mode first-k 2026-02-21T08:41:52.2753996Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpqbve6vo0.csv 2026-02-21T08:41:53.3137224Z x_val preprocessed_torch_compile_int4_gemm-speedup preprocessed_torch_compile_int4_gemm-accuracy preprocessed_triton_int4_gemm-speedup preprocessed_triton_int4_gemm-accuracy helion_int4_gemm_tritonbench-speedup helion_int4_gemm_tritonbench-accuracy 2026-02-21T08:41:53.3138485Z --------------------- ---------------------------------------------- ----------------------------------------------- --------------------------------------- ---------------------------------------- -------------------------------------- --------------------------------------- 2026-02-21T08:41:53.3139600Z (1, 1, 1280, 8192) 8.78248 1 0.771215 1 8.61771 1 2026-02-21T08:41:53.3140223Z (1, 1, 8192, 3584) 8.33484 1 4.02633 1 13.9388 1 2026-02-21T08:41:53.3140834Z (4, 1, 8192, 3584) 10.2644 1 4.25197 1 6.8542 1 2026-02-21T08:41:53.3141442Z (16, 1, 7168, 8192) 8.61667 1 3.72182 1 5.93722 1 2026-02-21T08:41:53.3142297Z (64, 1, 7168, 8192) 8.16595 1 2.61275 1 3.95119 1 2026-02-21T08:41:53.3142897Z (1, 4096, 8192, 1024) 2.07964 1 0.119139 1 0.529183 1 2026-02-21T08:41:53.3143513Z average 7.70733 1 2.58387 1 6.63806 1 2026-02-21T08:43:59.6130354Z Applying custom args for int4_gemm: {'num_inputs': 10} 2026-02-21T08:43:59.6234512Z Running int4_gemm benchmark with Helion implementation... 2026-02-21T08:43:59.6236126Z 2026-02-21T08:43:59.8489566Z Equally-spaced-k mode: Selected 10 equally spaced inputs (total available: 32) 2026-02-21T08:43:59.8491627Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 3, 7, 10, 14, 17, 21, 24, 28, 31] 2026-02-21T08:43:59.8496647Z 2026-02-21T08:43:59.8504835Z 0%| | 0/10 [00:00> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T08:49:07.7198550Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T08:49:07.7198728Z v_2 = b_tile << v_1 2026-02-21T08:49:07.7198902Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T08:49:07.7199121Z v_4 = v_2 >> v_3 2026-02-21T08:49:07.7199371Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T08:49:07.7199632Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T08:49:07.7199811Z v_6 = b_tile >> v_5 2026-02-21T08:49:07.7200127Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T08:49:07.7200482Z stack_idx = tl.arange(0, 2) 2026-02-21T08:49:07.7200862Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T08:49:07.7201197Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:49:07.7201518Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:49:07.7201914Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T08:49:07.7202255Z mask_0 = broadcast_idx == 0 2026-02-21T08:49:07.7202613Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T08:49:07.7202943Z mask_1 = broadcast_idx == 1 2026-02-21T08:49:07.7203280Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T08:49:07.7203560Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T08:49:07.7203856Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T08:49:07.7204133Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T08:49:07.7204391Z view = tl.reshape(stacked_result, [_SHAPE_DIM_3, _BLOCK_SIZE_2]) 2026-02-21T08:49:07.7204641Z v_7 = tl.cast(view, tl.float32) 2026-02-21T08:49:07.7204946Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T08:49:07.7205232Z a_tile_1 = v_0[:, :, None] 2026-02-21T08:49:07.7205466Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T08:49:07.7205820Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T08:49:07.7206247Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T08:49:07.7206691Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T08:49:07.7206983Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:49:07.7207256Z acc = acc_copy_0 + sum_1 2026-02-21T08:49:07.7207569Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T08:49:07.7207918Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T08:49:07.7208352Z tl.store(C + tl.broadcast_to(indices_2[None, :] * 1, [_BLOCK_SIZE_1, _BLOCK_SIZE_2]), v_10, None) 2026-02-21T08:49:07.7208661Z 2026-02-21T08:49:07.7208799Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T08:49:07.7209039Z """ 2026-02-21T08:49:07.7209245Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T08:49:07.7209397Z 2026-02-21T08:49:07.7209490Z This kernel performs matrix multiplication where: 2026-02-21T08:49:07.7209722Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T08:49:07.7209966Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T08:49:07.7210224Z (two 4-bit values packed into each int8) 2026-02-21T08:49:07.7210352Z 2026-02-21T08:49:07.7210413Z Args: 2026-02-21T08:49:07.7210590Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T08:49:07.7210953Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T08:49:07.7211123Z 2026-02-21T08:49:07.7211186Z Returns: 2026-02-21T08:49:07.7211379Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T08:49:07.7211649Z """ 2026-02-21T08:49:07.7211800Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T08:49:07.7211976Z M, K = A.shape 2026-02-21T08:49:07.7212137Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T08:49:07.7212322Z _, N = B.shape 2026-02-21T08:49:07.7212546Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T08:49:07.7212938Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T08:49:07.7213203Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:49:07.7213486Z _NUM_SM = helion.runtime.get_num_sm(A.device) 2026-02-21T08:49:07.7213875Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T08:49:07.7214322Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T08:49:07.7214765Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T08:49:07.7215064Z # src[int4_gemm.py:60-89]: ... 2026-02-21T08:49:07.7215245Z _BLOCK_SIZE_0 = 1024 2026-02-21T08:49:07.7215489Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T08:49:07.7215772Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T08:49:07.7215991Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T08:49:07.7216278Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T08:49:07.7216545Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T08:49:07.7216741Z _SHAPE_DIM_3 = 2 * _BLOCK_SIZE_0 2026-02-21T08:49:07.7216970Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:49:07.7217258Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T08:49:07.7217520Z # src[int4_gemm.py:57-91]: ... 2026-02-21T08:49:07.7217724Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T08:49:07.7218163Z _launcher(_helion_matmul_bf16_int4, (_NUM_SM * 4,), A, B, C, _NUM_SM, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, _SHAPE_DIM_3, num_warps=16, num_stages=1, maxnreg=256) 2026-02-21T08:49:07.7218560Z # src[int4_gemm.py:93]: return C 2026-02-21T08:49:07.7218725Z return C 2026-02-21T08:49:08.0526124Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T08:49:08.0526450Z x_val 2026-02-21T08:49:08.0532156Z ------------------ 2026-02-21T08:49:08.0536445Z (1, 1, 1280, 8192) 2026-02-21T08:49:08.0540316Z 2026-02-21T08:49:08.0544173Z 10%|█ | 1/10 [05:08<46:13, 308.20s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T08:49:08.0548549Z x_val 2026-02-21T08:49:08.0553292Z ------------------ 2026-02-21T08:49:08.0553579Z (1, 1, 8192, 3584) 2026-02-21T08:49:08.0553987Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:49:09.1394447Z INFO:tritonbench.utils.triton_op:Took 2.94ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:49:10.4198827Z INFO:tritonbench.utils.triton_op:Took 0.11ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:49:11.8530796Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:49:11.8534741Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:49:11.8539259Z 'dtype': 'torch.bfloat16', 2026-02-21T08:49:11.8543649Z 'shape': (1, 1, 3584), 2026-02-21T08:49:11.8545234Z 'stride': (3584, 3584, 1)}, 2026-02-21T08:49:11.8545469Z { 'device': 'cuda:0', 2026-02-21T08:49:11.8545662Z 'dtype': 'torch.int32', 2026-02-21T08:49:11.8545857Z 'shape': (3584, 8192), 2026-02-21T08:49:11.8546282Z 'stride': (8192, 1)}), 2026-02-21T08:49:11.8546449Z 'kwargs': {}} 2026-02-21T08:49:11.8573716Z INFO:tritonbench.utils.triton_op:Took 4.53ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:49:12.1259863Z [0s] Autotune random seed: 2136913670 2026-02-21T08:49:12.1517618Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:49:20.8082511Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 4.3 configs/s 2026-02-21T08:49:26.6356439Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.3 configs/s 2026-02-21T08:49:26.6366373Z [14s] Adaptive compile timeout: 30s (90% percentile=1.6s, bounds=[30.0s, 30s]) 2026-02-21T08:49:26.8462766Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 4583.2 configs/s 2026-02-21T08:49:26.8727913Z [14s] Initial random population of 100, 5 starting points: 2026-02-21T08:49:26.8729879Z error=2 2026-02-21T08:49:26.8730067Z ok=98 2026-02-21T08:49:26.8730203Z min=0.0257 2026-02-21T08:49:26.8730330Z mid=0.1381 2026-02-21T08:49:26.8730462Z max=3.9978 2026-02-21T08:49:26.8730709Z best={'block_sizes': [512, 1, 64], 2026-02-21T08:49:26.8730990Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:49:26.8731263Z 'l2_groupings': [1], 2026-02-21T08:49:26.8731437Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:49:26.8731835Z 'loop_orders': [[0, 1]], 2026-02-21T08:49:26.8731991Z 'num_stages': 8, 2026-02-21T08:49:26.8732141Z 'num_warps': 16, 2026-02-21T08:49:26.8732291Z 'pid_type': 'xyz', 2026-02-21T08:49:26.8732451Z 'range_flattens': [None, True], 2026-02-21T08:49:26.8732630Z 'range_multi_buffers': [None, None], 2026-02-21T08:49:26.8732816Z 'range_num_stages': [0, 4], 2026-02-21T08:49:26.8732990Z 'range_unroll_factors': [0, 1], 2026-02-21T08:49:26.8733169Z 'range_warp_specializes': [None, None]} 2026-02-21T08:49:26.8746914Z [14s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:49:27.9991398Z [15s] Generation 1 starting: 84 neighbors, 5 active search path(s) 2026-02-21T08:49:46.5028812Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 2.7 configs/s 2026-02-21T08:49:51.7912329Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 16.6 configs/s 2026-02-21T08:49:53.4855819Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.6 2026-02-21T08:49:53.4859775Z configs/s 2026-02-21T08:49:53.6005119Z [41s] Generation 1 complete: 2026-02-21T08:49:53.6009568Z error=3 2026-02-21T08:49:53.6012838Z ok=87 2026-02-21T08:49:53.6014862Z min=0.0236 2026-02-21T08:49:53.6020113Z mid=0.0440 2026-02-21T08:49:53.6022062Z max=0.4538 2026-02-21T08:49:53.6022285Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:49:53.6022635Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:49:53.6022938Z 'l2_groupings': [1], 2026-02-21T08:49:53.6023132Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:49:53.6023356Z 'loop_orders': [[0, 1]], 2026-02-21T08:49:53.6023536Z 'num_stages': 8, 2026-02-21T08:49:53.6023729Z 'num_warps': 16, 2026-02-21T08:49:53.6023872Z 'pid_type': 'xyz', 2026-02-21T08:49:53.6024037Z 'range_flattens': [None, None], 2026-02-21T08:49:53.6024224Z 'range_multi_buffers': [None, None], 2026-02-21T08:49:53.6024422Z 'range_num_stages': [0, 4], 2026-02-21T08:49:53.6024585Z 'range_unroll_factors': [0, 1], 2026-02-21T08:49:53.6024768Z 'range_warp_specializes': [None, None]} 2026-02-21T08:49:53.6025269Z [41s] Fitting surrogate: 190 points, 190 targets 2026-02-21T08:49:54.7484520Z [42s] Generation 2 starting: 80 neighbors, 5 active search path(s) 2026-02-21T08:50:07.6681208Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 2.0 configs/s 2026-02-21T08:50:12.4467743Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.3 configs/s 2026-02-21T08:50:16.3954932Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 269.4 2026-02-21T08:50:16.3958912Z configs/s 2026-02-21T08:50:16.6513746Z [64s] Generation 2 complete: 2026-02-21T08:50:16.6515217Z error=2 2026-02-21T08:50:16.6515384Z ok=84 2026-02-21T08:50:16.6515532Z min=0.0236 2026-02-21T08:50:16.6515672Z mid=0.0339 2026-02-21T08:50:16.6515816Z max=0.3328 2026-02-21T08:50:16.6515972Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:50:16.6516285Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:50:16.6516594Z 'l2_groupings': [1], 2026-02-21T08:50:16.6516791Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:50:16.6517008Z 'loop_orders': [[0, 1]], 2026-02-21T08:50:16.6517180Z 'num_stages': 8, 2026-02-21T08:50:16.6517344Z 'num_warps': 16, 2026-02-21T08:50:16.6517492Z 'pid_type': 'xyz', 2026-02-21T08:50:16.6517899Z 'range_flattens': [None, None], 2026-02-21T08:50:16.6518101Z 'range_multi_buffers': [None, None], 2026-02-21T08:50:16.6518296Z 'range_num_stages': [0, 4], 2026-02-21T08:50:16.6518463Z 'range_unroll_factors': [0, 1], 2026-02-21T08:50:16.6518741Z 'range_warp_specializes': [None, None]} 2026-02-21T08:50:16.6525649Z [64s] Fitting surrogate: 276 points, 276 targets 2026-02-21T08:50:17.6926743Z [65s] Generation 3 starting: 69 neighbors, 5 active search path(s) 2026-02-21T08:50:33.7749865Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 6.2 configs/s 2026-02-21T08:50:38.0067531Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 17.0 configs/s 2026-02-21T08:50:41.5884329Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 300.3 2026-02-21T08:50:41.5886928Z configs/s 2026-02-21T08:50:41.8338123Z [89s] Generation 3 complete: 2026-02-21T08:50:41.8338328Z ok=75 2026-02-21T08:50:41.8338470Z min=0.0236 2026-02-21T08:50:41.8338620Z mid=0.0299 2026-02-21T08:50:41.8338757Z max=0.1302 2026-02-21T08:50:41.8338894Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:50:41.8339166Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:50:41.8339425Z 'l2_groupings': [1], 2026-02-21T08:50:41.8339644Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:50:41.8339838Z 'loop_orders': [[1, 0]], 2026-02-21T08:50:41.8340003Z 'num_stages': 8, 2026-02-21T08:50:41.8340148Z 'num_warps': 8, 2026-02-21T08:50:41.8340290Z 'pid_type': 'xyz', 2026-02-21T08:50:41.8340493Z 'range_flattens': [None, None], 2026-02-21T08:50:41.8340738Z 'range_multi_buffers': [None, None], 2026-02-21T08:50:41.8341001Z 'range_num_stages': [0, 4], 2026-02-21T08:50:41.8341223Z 'range_unroll_factors': [0, 0], 2026-02-21T08:50:41.8341492Z 'range_warp_specializes': [None, None]} 2026-02-21T08:50:41.8353239Z [89s] Fitting surrogate: 351 points, 351 targets 2026-02-21T08:50:42.8393801Z [90s] Generation 4 starting: 74 neighbors, 5 active search path(s) 2026-02-21T08:50:46.6913078Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 17.6 configs/s 2026-02-21T08:50:51.3054881Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.8 configs/s 2026-02-21T08:50:55.0967731Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 283.1 2026-02-21T08:50:55.0969092Z configs/s 2026-02-21T08:50:55.3599089Z [103s] Generation 4 complete: 2026-02-21T08:50:55.3603494Z ok=79 2026-02-21T08:50:55.3605256Z min=0.0216 2026-02-21T08:50:55.3605425Z mid=0.0297 2026-02-21T08:50:55.3605827Z max=0.3001 2026-02-21T08:50:55.3605977Z best={'block_sizes': [256, 1, 32], 2026-02-21T08:50:55.3606233Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:50:55.3606486Z 'l2_groupings': [1], 2026-02-21T08:50:55.3606651Z 'load_eviction_policies': ['', ''], 2026-02-21T08:50:55.3606846Z 'loop_orders': [[0, 1]], 2026-02-21T08:50:55.3607017Z 'num_stages': 6, 2026-02-21T08:50:55.3607269Z 'num_warps': 4, 2026-02-21T08:50:55.3607429Z 'pid_type': 'flat', 2026-02-21T08:50:55.3607591Z 'range_flattens': [None, True], 2026-02-21T08:50:55.3607861Z 'range_multi_buffers': [None, True], 2026-02-21T08:50:55.3608048Z 'range_num_stages': [0, 3], 2026-02-21T08:50:55.3608223Z 'range_unroll_factors': [0, 1], 2026-02-21T08:50:55.3608406Z 'range_warp_specializes': [None, None]} 2026-02-21T08:50:55.3629779Z [103s] Fitting surrogate: 430 points, 430 targets 2026-02-21T08:50:56.5058661Z [104s] Generation 5 starting: 82 neighbors, 5 active search path(s) 2026-02-21T08:51:05.2107172Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 4.5 configs/s 2026-02-21T08:51:12.4955766Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 11.6 configs/s 2026-02-21T08:51:16.9988027Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 238.7 2026-02-21T08:51:16.9989385Z configs/s 2026-02-21T08:51:17.3370813Z [125s] Generation 5 complete: 2026-02-21T08:51:17.3371098Z ok=87 2026-02-21T08:51:17.3371269Z min=0.0216 2026-02-21T08:51:17.3371447Z mid=0.0257 2026-02-21T08:51:17.3371900Z max=0.2653 2026-02-21T08:51:17.3372072Z best={'block_sizes': [1024, 1, 64], 2026-02-21T08:51:17.3372404Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:51:17.3372737Z 'l2_groupings': [1], 2026-02-21T08:51:17.3372956Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:51:17.3373201Z 'loop_orders': [[1, 0]], 2026-02-21T08:51:17.3373404Z 'maxnreg': 256, 2026-02-21T08:51:17.3373596Z 'num_sm_multiplier': 8, 2026-02-21T08:51:17.3373803Z 'num_stages': 2, 2026-02-21T08:51:17.3373991Z 'num_warps': 16, 2026-02-21T08:51:17.3374189Z 'pid_type': 'persistent_blocked', 2026-02-21T08:51:17.3374446Z 'range_flattens': [None, False], 2026-02-21T08:51:17.3374697Z 'range_multi_buffers': [None, False], 2026-02-21T08:51:17.3374967Z 'range_num_stages': [3, 4], 2026-02-21T08:51:17.3375207Z 'range_unroll_factors': [1, 1], 2026-02-21T08:51:17.3375466Z 'range_warp_specializes': [None, False]} 2026-02-21T08:51:17.3397664Z [125s] Fitting surrogate: 517 points, 517 targets 2026-02-21T08:51:18.5705651Z [126s] Generation 6 starting: 81 neighbors, 5 active search path(s) 2026-02-21T08:51:27.5268556Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 3.4 configs/s 2026-02-21T08:51:32.3514447Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 17.4 configs/s 2026-02-21T08:51:36.1564689Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 267.7 2026-02-21T08:51:36.1565773Z configs/s 2026-02-21T08:51:36.4484030Z [144s] Generation 6 complete: 2026-02-21T08:51:36.4485939Z error=3 2026-02-21T08:51:36.4486102Z ok=84 2026-02-21T08:51:36.4486242Z min=0.0214 2026-02-21T08:51:36.4486373Z mid=0.0256 2026-02-21T08:51:36.4486508Z max=0.5785 2026-02-21T08:51:36.4486939Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:51:36.4487193Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:51:36.4487420Z 'l2_groupings': [1], 2026-02-21T08:51:36.4487614Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:51:36.4487822Z 'loop_orders': [[0, 1]], 2026-02-21T08:51:36.4487978Z 'num_stages': 8, 2026-02-21T08:51:36.4488125Z 'num_warps': 8, 2026-02-21T08:51:36.4488267Z 'pid_type': 'xyz', 2026-02-21T08:51:36.4488425Z 'range_flattens': [None, None], 2026-02-21T08:51:36.4488607Z 'range_multi_buffers': [None, None], 2026-02-21T08:51:36.4488799Z 'range_num_stages': [0, 4], 2026-02-21T08:51:36.4489074Z 'range_unroll_factors': [0, 0], 2026-02-21T08:51:36.4489284Z 'range_warp_specializes': [None, None]} 2026-02-21T08:51:36.4510144Z [144s] Fitting surrogate: 604 points, 604 targets 2026-02-21T08:51:37.2926574Z [145s] Generation 7 starting: 48 neighbors, 3 active search path(s) 2026-02-21T08:51:50.2938562Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49/49 2.4 configs/s 2026-02-21T08:51:55.1399011Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 49/49 10.1 configs/s 2026-02-21T08:51:56.7148139Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 644.8 2026-02-21T08:51:56.7152126Z configs/s 2026-02-21T08:51:56.8419440Z [164s] Generation 7 complete: 2026-02-21T08:51:56.8419747Z error=1 2026-02-21T08:51:56.8419921Z ok=50 2026-02-21T08:51:56.8420080Z min=0.0215 2026-02-21T08:51:56.8420237Z mid=0.0317 2026-02-21T08:51:56.8420395Z max=0.5725 2026-02-21T08:51:56.8420596Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:51:56.8420816Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:51:56.8421032Z 'l2_groupings': [1], 2026-02-21T08:51:56.8426334Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:51:56.8426583Z 'loop_orders': [[0, 1]], 2026-02-21T08:51:56.8426759Z 'num_stages': 8, 2026-02-21T08:51:56.8426918Z 'num_warps': 8, 2026-02-21T08:51:56.8427076Z 'pid_type': 'xyz', 2026-02-21T08:51:56.8427232Z 'range_flattens': [None, None], 2026-02-21T08:51:56.8427421Z 'range_multi_buffers': [None, None], 2026-02-21T08:51:56.8427607Z 'range_num_stages': [0, 4], 2026-02-21T08:51:56.8427780Z 'range_unroll_factors': [0, 0], 2026-02-21T08:51:56.8427966Z 'range_warp_specializes': [None, None]} 2026-02-21T08:51:56.8445506Z [164s] Fitting surrogate: 655 points, 655 targets 2026-02-21T08:51:57.5322202Z [165s] Generation 8 starting: 36 neighbors, 2 active search path(s) 2026-02-21T08:52:28.9385029Z [196s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=8, num_stages=8, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 1], range_unroll_factors=[4, 0], range_warp_specializes=[None, False]) 2026-02-21T08:52:28.9403403Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 0.8 configs/s 2026-02-21T08:52:31.3426887Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 38/38 16.1 configs/s 2026-02-21T08:52:32.6005359Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 804.9 2026-02-21T08:52:32.6009180Z configs/s 2026-02-21T08:52:32.6986977Z [200s] Generation 8 complete: 2026-02-21T08:52:32.6991440Z error=2 2026-02-21T08:52:32.6995898Z timeout=1 2026-02-21T08:52:32.7000313Z ok=36 2026-02-21T08:52:32.7004691Z min=0.0214 2026-02-21T08:52:32.7009308Z mid=0.0278 2026-02-21T08:52:32.7009523Z max=0.1899 2026-02-21T08:52:32.7009710Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:52:32.7009969Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:52:32.7010208Z 'l2_groupings': [1], 2026-02-21T08:52:32.7010428Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:52:32.7057148Z 'loop_orders': [[0, 1]], 2026-02-21T08:52:32.7057338Z 'num_stages': 8, 2026-02-21T08:52:32.7057493Z 'num_warps': 8, 2026-02-21T08:52:32.7057657Z 'pid_type': 'xyz', 2026-02-21T08:52:32.7057850Z 'range_flattens': [None, None], 2026-02-21T08:52:32.7058032Z 'range_multi_buffers': [None, None], 2026-02-21T08:52:32.7058225Z 'range_num_stages': [0, 3], 2026-02-21T08:52:32.7058391Z 'range_unroll_factors': [0, 0], 2026-02-21T08:52:32.7058578Z 'range_warp_specializes': [None, None]} 2026-02-21T08:52:32.7058784Z [200s] Fitting surrogate: 694 points, 694 targets 2026-02-21T08:52:33.3837015Z [201s] Generation 9 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:53:04.4949448Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 1.1 configs/s 2026-02-21T08:53:07.5214604Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 11.6 configs/s 2026-02-21T08:53:08.9418265Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 847.3 2026-02-21T08:53:08.9424425Z configs/s 2026-02-21T08:53:09.0368706Z [236s] Generation 9 complete: 2026-02-21T08:53:09.2172634Z ok=36 2026-02-21T08:53:09.2172943Z min=0.0196 2026-02-21T08:53:09.2173085Z mid=0.0298 2026-02-21T08:53:09.2173209Z max=0.4272 2026-02-21T08:53:09.2173357Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:53:09.2173580Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:53:09.2173793Z 'l2_groupings': [1], 2026-02-21T08:53:09.2173964Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:53:09.2174174Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:09.2174353Z 'num_stages': 8, 2026-02-21T08:53:09.2174491Z 'num_warps': 8, 2026-02-21T08:53:09.2174638Z 'pid_type': 'xyz', 2026-02-21T08:53:09.2174792Z 'range_flattens': [None, None], 2026-02-21T08:53:09.2174977Z 'range_multi_buffers': [None, None], 2026-02-21T08:53:09.2175158Z 'range_num_stages': [0, 3], 2026-02-21T08:53:09.2175332Z 'range_unroll_factors': [0, 0], 2026-02-21T08:53:09.2175515Z 'range_warp_specializes': [None, None]} 2026-02-21T08:53:09.2175731Z [236s] Fitting surrogate: 730 points, 730 targets 2026-02-21T08:53:09.4809152Z [237s] Generation 10 starting: 18 neighbors, 1 active search path(s) 2026-02-21T08:53:13.4105162Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 9.7 configs/s 2026-02-21T08:53:14.6740460Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 15.5 configs/s 2026-02-21T08:53:14.7964626Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 7693.7 2026-02-21T08:53:14.7969260Z configs/s 2026-02-21T08:53:14.8167646Z [242s] Generation 10 complete: 2026-02-21T08:53:14.8169306Z ok=20 2026-02-21T08:53:14.8169439Z min=0.0195 2026-02-21T08:53:14.8169580Z mid=0.0379 2026-02-21T08:53:14.8169727Z max=0.1732 2026-02-21T08:53:14.8169865Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:53:14.8170103Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:53:14.8170314Z 'l2_groupings': [1], 2026-02-21T08:53:14.8170492Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:53:14.8170691Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:14.8170854Z 'num_stages': 8, 2026-02-21T08:53:14.8170993Z 'num_warps': 8, 2026-02-21T08:53:14.8171142Z 'pid_type': 'xyz', 2026-02-21T08:53:14.8171292Z 'range_flattens': [None, None], 2026-02-21T08:53:14.8171478Z 'range_multi_buffers': [None, None], 2026-02-21T08:53:14.8171849Z 'range_num_stages': [0, 3], 2026-02-21T08:53:14.8172015Z 'range_unroll_factors': [0, 0], 2026-02-21T08:53:14.8172205Z 'range_warp_specializes': [None, None]} 2026-02-21T08:53:14.8192941Z [242s] Fitting surrogate: 750 points, 750 targets 2026-02-21T08:53:15.3199652Z [243s] Generation 11 starting: 20 neighbors, 1 active search path(s) 2026-02-21T08:53:17.8207030Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 8.7 configs/s 2026-02-21T08:53:19.0805291Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.3 configs/s 2026-02-21T08:53:19.2725001Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5023.0 2026-02-21T08:53:19.2725854Z configs/s 2026-02-21T08:53:19.2976451Z [247s] Generation 11 complete: 2026-02-21T08:53:19.2979892Z ok=22 2026-02-21T08:53:19.2983903Z min=0.0195 2026-02-21T08:53:19.2987692Z mid=0.0380 2026-02-21T08:53:19.2991813Z max=0.1034 2026-02-21T08:53:19.2995396Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:53:19.2999304Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:53:19.3001258Z 'l2_groupings': [1], 2026-02-21T08:53:19.3001439Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:53:19.3001705Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:19.3001870Z 'num_stages': 8, 2026-02-21T08:53:19.3002015Z 'num_warps': 8, 2026-02-21T08:53:19.3002166Z 'pid_type': 'xyz', 2026-02-21T08:53:19.3002320Z 'range_flattens': [None, None], 2026-02-21T08:53:19.3002713Z 'range_multi_buffers': [None, None], 2026-02-21T08:53:19.3002909Z 'range_num_stages': [0, 3], 2026-02-21T08:53:19.3003084Z 'range_unroll_factors': [0, 0], 2026-02-21T08:53:19.3003355Z 'range_warp_specializes': [None, None]} 2026-02-21T08:53:19.3003576Z [247s] Fitting surrogate: 772 points, 772 targets 2026-02-21T08:53:19.7044355Z [247s] Generation 12 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:53:21.3817451Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 10.5 configs/s 2026-02-21T08:53:22.2144801Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.8 configs/s 2026-02-21T08:53:22.3366059Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 7708.2 2026-02-21T08:53:22.3367568Z configs/s 2026-02-21T08:53:22.3567633Z [250s] Generation 12 complete: 2026-02-21T08:53:22.3567967Z ok=15 2026-02-21T08:53:22.3573236Z min=0.0195 2026-02-21T08:53:22.3577507Z mid=0.0420 2026-02-21T08:53:22.3581272Z max=0.1136 2026-02-21T08:53:22.3584560Z best={'block_sizes': [256, 1, 64], 2026-02-21T08:53:22.3588060Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:53:22.3592783Z 'l2_groupings': [1], 2026-02-21T08:53:22.3596547Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:53:22.3599537Z 'loop_orders': [[0, 1]], 2026-02-21T08:53:22.3599771Z 'num_stages': 8, 2026-02-21T08:53:22.3599937Z 'num_warps': 8, 2026-02-21T08:53:22.3600093Z 'pid_type': 'xyz', 2026-02-21T08:53:22.3600267Z 'range_flattens': [None, None], 2026-02-21T08:53:22.3600463Z 'range_multi_buffers': [None, None], 2026-02-21T08:53:22.3600669Z 'range_num_stages': [0, 3], 2026-02-21T08:53:22.3600847Z 'range_unroll_factors': [0, 0], 2026-02-21T08:53:22.3601030Z 'range_warp_specializes': [None, None]} 2026-02-21T08:53:22.3601249Z [250s] Fitting surrogate: 787 points, 787 targets 2026-02-21T08:53:22.6392253Z [250s] Autotuning complete in 250.5s after searching 760 configs. 2026-02-21T08:53:22.6396218Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:53:22.6401817Z @helion.kernel(config=helion.Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=8, num_warps=8, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:53:22.6402892Z 2026-02-21T08:53:22.6403213Z [250s] Code of selected kernel: /tmp/torchinductor_root/o5/co5ygt2py6spqnx7wpzyqlblisujwnaizkemz4yylecoqxkycjqp.py 2026-02-21T08:53:22.6596627Z from __future__ import annotations 2026-02-21T08:53:22.6601104Z 2026-02-21T08:53:22.6604147Z import torch 2026-02-21T08:53:22.6604324Z import triton 2026-02-21T08:53:22.6604478Z import triton.language as tl 2026-02-21T08:53:22.6604723Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:53:22.6605116Z 2026-02-21T08:53:22.6605191Z _BLOCK_SIZE_2 = tl.constexpr(64) 2026-02-21T08:53:22.6605379Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T08:53:22.6605639Z _BLOCK_SIZE_0 = tl.constexpr(256) 2026-02-21T08:53:22.6605756Z 2026-02-21T08:53:22.6605813Z @triton.jit 2026-02-21T08:53:22.6606120Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr, _SHAPE_DIM_3: tl.constexpr): 2026-02-21T08:53:22.6606504Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:53:22.6606778Z pid_1 = tl.program_id(1) 2026-02-21T08:53:22.6606970Z offset_2 = pid_1 * _BLOCK_SIZE_2 2026-02-21T08:53:22.6607213Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T08:53:22.6607534Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T08:53:22.6607836Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T08:53:22.6608209Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T08:53:22.6608627Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T08:53:22.6609015Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T08:53:22.6609294Z # src[int4_gemm.py:60-89]: ... 2026-02-21T08:53:22.6609522Z for offset_3 in tl.range(0, 1792, _BLOCK_SIZE_0, num_stages=3): 2026-02-21T08:53:22.6609798Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:53:22.6610032Z acc_copy = acc 2026-02-21T08:53:22.6610191Z acc_copy_0 = acc_copy 2026-02-21T08:53:22.6610410Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T08:53:22.6610636Z mul = 2 * offset_3 2026-02-21T08:53:22.6610893Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T08:53:22.6611193Z iota = mul + tl.arange(0, mul_1) 2026-02-21T08:53:22.6611613Z load = tl.broadcast_to(tl.load(A + iota[None, :] * 1, None, eviction_policy='evict_last'), [_BLOCK_SIZE_1, _SHAPE_DIM_2]) 2026-02-21T08:53:22.6612068Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T08:53:22.6612368Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T08:53:22.6612615Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T08:53:22.6612843Z v_0 = tl.cast(load, tl.float32) 2026-02-21T08:53:22.6613134Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T08:53:22.6613573Z b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:53:22.6613991Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T08:53:22.6614288Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T08:53:22.6614471Z v_2 = b_tile << v_1 2026-02-21T08:53:22.6614650Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T08:53:22.6614868Z v_4 = v_2 >> v_3 2026-02-21T08:53:22.6615115Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T08:53:22.6615391Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T08:53:22.6615566Z v_6 = b_tile >> v_5 2026-02-21T08:53:22.6615796Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T08:53:22.6616053Z stack_idx = tl.arange(0, 2) 2026-02-21T08:53:22.6616250Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T08:53:22.6616514Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:53:22.6616712Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:53:22.6616926Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T08:53:22.6617136Z mask_0 = broadcast_idx == 0 2026-02-21T08:53:22.6617382Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T08:53:22.6617659Z mask_1 = broadcast_idx == 1 2026-02-21T08:53:22.6617931Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T08:53:22.6618223Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T08:53:22.6618519Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T08:53:22.6618804Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T08:53:22.6619063Z view = tl.reshape(stacked_result, [_SHAPE_DIM_3, _BLOCK_SIZE_2]) 2026-02-21T08:53:22.6619320Z v_7 = tl.cast(view, tl.float32) 2026-02-21T08:53:22.6619597Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T08:53:22.6619924Z a_tile_1 = v_0[:, :, None] 2026-02-21T08:53:22.6620155Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T08:53:22.6620385Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T08:53:22.6620706Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T08:53:22.6620997Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T08:53:22.6621195Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:53:22.6621390Z acc = acc_copy_0 + sum_1 2026-02-21T08:53:22.6621655Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T08:53:22.6621895Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T08:53:22.6622174Z tl.store(C + tl.broadcast_to(indices_2[None, :] * 1, [_BLOCK_SIZE_1, _BLOCK_SIZE_2]), v_10, None) 2026-02-21T08:53:22.6622409Z 2026-02-21T08:53:22.6622540Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T08:53:22.6622780Z """ 2026-02-21T08:53:22.6622958Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T08:53:22.6623112Z 2026-02-21T08:53:22.6623215Z This kernel performs matrix multiplication where: 2026-02-21T08:53:22.6623441Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T08:53:22.6623703Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T08:53:22.6623958Z (two 4-bit values packed into each int8) 2026-02-21T08:53:22.6624092Z 2026-02-21T08:53:22.6624147Z Args: 2026-02-21T08:53:22.6624329Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T08:53:22.6624609Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T08:53:22.6624776Z 2026-02-21T08:53:22.6624839Z Returns: 2026-02-21T08:53:22.6625021Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T08:53:22.6625242Z """ 2026-02-21T08:53:22.6625380Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T08:53:22.6625562Z M, K = A.shape 2026-02-21T08:53:22.6625714Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T08:53:22.6625891Z _, N = B.shape 2026-02-21T08:53:22.6626110Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T08:53:22.6626424Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T08:53:22.6626693Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:53:22.6626908Z _BLOCK_SIZE_2 = 64 2026-02-21T08:53:22.6627161Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T08:53:22.6627549Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T08:53:22.6627934Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T08:53:22.6628241Z # src[int4_gemm.py:60-89]: ... 2026-02-21T08:53:22.6628408Z _BLOCK_SIZE_0 = 256 2026-02-21T08:53:22.6628655Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T08:53:22.6628923Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T08:53:22.6629146Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T08:53:22.6629444Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T08:53:22.6629737Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T08:53:22.6629933Z _SHAPE_DIM_3 = 2 * _BLOCK_SIZE_0 2026-02-21T08:53:22.6630158Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:53:22.6630449Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T08:53:22.6630703Z # src[int4_gemm.py:57-91]: ... 2026-02-21T08:53:22.6630920Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T08:53:22.6631358Z _launcher(_helion_matmul_bf16_int4, (1, triton.cdiv(8192, _BLOCK_SIZE_2)), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, _SHAPE_DIM_3, num_warps=8, num_stages=8) 2026-02-21T08:53:22.6631815Z # src[int4_gemm.py:93]: return C 2026-02-21T08:53:22.6631984Z return C 2026-02-21T08:53:23.7470108Z WARNING:tritonbench.utils.triton_op:Completed input ID 3: 2026-02-21T08:53:23.7470530Z x_val 2026-02-21T08:53:23.7470665Z ------------------ 2026-02-21T08:53:23.7470822Z (1, 1, 8192, 3584) 2026-02-21T08:53:23.7470907Z 2026-02-21T08:53:23.7471292Z 20%|██ | 2/10 [09:23<36:56, 277.12s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7: 2026-02-21T08:53:23.7474089Z x_val 2026-02-21T08:53:23.7474235Z ------------------ 2026-02-21T08:53:23.7474371Z (4, 1, 8192, 3584) 2026-02-21T08:53:23.7474664Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:53:24.5064035Z INFO:tritonbench.utils.triton_op:Took 2.59ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:53:25.6822197Z INFO:tritonbench.utils.triton_op:Took 0.10ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:53:27.1285314Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:53:27.1287443Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:53:27.1287695Z 'dtype': 'torch.bfloat16', 2026-02-21T08:53:27.1287959Z 'shape': (4, 1, 3584), 2026-02-21T08:53:27.1288160Z 'stride': (3584, 3584, 1)}, 2026-02-21T08:53:27.1292411Z { 'device': 'cuda:0', 2026-02-21T08:53:27.1296934Z 'dtype': 'torch.int32', 2026-02-21T08:53:27.1302120Z 'shape': (3584, 8192), 2026-02-21T08:53:27.1304225Z 'stride': (8192, 1)}), 2026-02-21T08:53:27.1304500Z 'kwargs': {}} 2026-02-21T08:53:27.1315678Z INFO:tritonbench.utils.triton_op:Took 3.07ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:53:27.3880605Z [0s] Autotune random seed: 2136913670 2026-02-21T08:53:27.4137861Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:54:00.1744222Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1, 1024], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:54:00.5050629Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 1, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=5, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:54:00.5067011Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T08:54:03.8135350Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:54:03.8140029Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:54:03.8141051Z %c3584_i32 = arith.constant 3584 : i32 2026-02-21T08:54:03.8141399Z %cst = arith.constant dense<0> : tensor<256x2x16xi8> 2026-02-21T08:54:03.8141843Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:54:03.8142056Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:54:03.8142253Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:54:03.8142449Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:54:03.8142672Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T08:54:03.8142911Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T08:54:03.8143163Z %cst_2 = arith.constant dense<4> : tensor<256x16xi8> 2026-02-21T08:54:03.8143376Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:54:03.8143606Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x16xf32> 2026-02-21T08:54:03.8143838Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:54:03.8144108Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:54:03.8144304Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T08:54:03.8144500Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:54:03.8144687Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:54:03.8144875Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:54:03.8145193Z %0 = tt.make_tensor_descriptor %arg1, [%c1792_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:54:03.8145524Z %1 = tt.get_program_id x : i32 2026-02-21T08:54:03.8145712Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:54:03.8145900Z %3 = arith.minsi %2, %c2048_i32 : i32 2026-02-21T08:54:03.8146113Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:54:03.8146325Z %4 = arith.divsi %arg3, %c2048_i32 : i32 2026-02-21T08:54:03.8146518Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T08:54:03.8146698Z %6 = arith.subi %c4_i32, %5 : i32 2026-02-21T08:54:03.8146882Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T08:54:03.8147067Z %8 = arith.remsi %arg3, %c2048_i32 : i32 2026-02-21T08:54:03.8147256Z %9 = arith.remsi %8, %7 : i32 2026-02-21T08:54:03.8147433Z %10 = arith.addi %5, %9 : i32 2026-02-21T08:54:03.8147601Z %11 = arith.divsi %8, %7 : i32 2026-02-21T08:54:03.8147780Z %12 = arith.muli %11, %c16_i32 : i32 2026-02-21T08:54:03.8148011Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:54:03.8148275Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:54:03.8148473Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:54:03.8148678Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:54:03.8148874Z %c1024_i32_4 = arith.constant 1024 : i32 2026-02-21T08:54:03.8149066Z %16 = arith.muli %c0_i32, %c2_i32 : i32 2026-02-21T08:54:03.8149305Z %17 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:03.8149558Z %18 = tt.splat %16 : i32 -> tensor<512xi32> 2026-02-21T08:54:03.8149767Z %19 = arith.addi %18, %17 : tensor<512xi32> 2026-02-21T08:54:03.8149958Z %20 = arith.muli %10, %c3584_i32 : i32 2026-02-21T08:54:03.8150218Z %21 = tt.expand_dims %19 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:03.8150495Z %22 = tt.splat %20 : i32 -> tensor<1x512xi32> 2026-02-21T08:54:03.8150702Z %23 = arith.addi %22, %21 : tensor<1x512xi32> 2026-02-21T08:54:03.8150958Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8151249Z %25 = tt.addptr %24, %23 : tensor<1x512x!tt.ptr>, tensor<1x512xi32> 2026-02-21T08:54:03.8151707Z %26 = tt.load %25 evictionPolicy = evict_last : tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8152007Z %27 = arith.extf %26 : tensor<1x512xbf16> to tensor<1x512xf32> 2026-02-21T08:54:03.8152356Z %28 = tt.descriptor_load %0[%c0_i32, %12] : !tt.tensordesc> -> tensor<256x16xi8> 2026-02-21T08:54:03.8152675Z %29 = arith.shli %28, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8153580Z %30 = arith.shrsi %29, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8153890Z %31 = arith.shrsi %28, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8154141Z %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:54:03.8154448Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:54:03.8154761Z %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:54:03.8155099Z %35 = tt.expand_dims %30 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8155441Z %36 = tt.expand_dims %31 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8155719Z %37 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8155975Z %38 = tt.broadcast %37 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8156296Z %39 = tt.broadcast %35 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8156599Z %40 = arith.select %38, %39, %cst : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8156875Z %41 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8157122Z %42 = tt.broadcast %36 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8157397Z %43 = tt.broadcast %41 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8157670Z %44 = arith.select %43, %42, %40 : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8157957Z %45 = tt.reshape %44 : tensor<256x2x16xi8> -> tensor<512x16xi8> 2026-02-21T08:54:03.8158220Z %46 = arith.sitofp %45 : tensor<512x16xi8> to tensor<512x16xf32> 2026-02-21T08:54:03.8158515Z %47 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x512xf32> -> tensor<1x512x1xf32> 2026-02-21T08:54:03.8158852Z %48 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512x16xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8159161Z %49 = tt.broadcast %47 : tensor<1x512x1xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8159421Z %50 = arith.mulf %49, %48 : tensor<1x512x16xf32> 2026-02-21T08:54:03.8159630Z %51 = "tt.reduce"(%50) <{axis = 1 : i32}> ({ 2026-02-21T08:54:03.8159830Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:54:03.8160012Z %178 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:54:03.8160207Z tt.reduce.return %178 : f32 2026-02-21T08:54:03.8160403Z }) : (tensor<1x512x16xf32>) -> tensor<1x16xf32> 2026-02-21T08:54:03.8160611Z %52 = arith.addf %cst_3, %51 : tensor<1x16xf32> 2026-02-21T08:54:03.8160816Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T08:54:03.8161007Z %53 = arith.muli %c256_i32, %c1_i32_5 : i32 2026-02-21T08:54:03.8161204Z %54 = arith.addi %c0_i32, %53 : i32 2026-02-21T08:54:03.8161383Z %55 = arith.muli %54, %c2_i32 : i32 2026-02-21T08:54:03.8161675Z %56 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:03.8161942Z %57 = tt.splat %55 : i32 -> tensor<512xi32> 2026-02-21T08:54:03.8162148Z %58 = arith.addi %57, %56 : tensor<512xi32> 2026-02-21T08:54:03.8162357Z %59 = arith.muli %10, %c3584_i32 : i32 2026-02-21T08:54:03.8162609Z %60 = tt.expand_dims %58 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:03.8162889Z %61 = tt.splat %59 : i32 -> tensor<1x512xi32> 2026-02-21T08:54:03.8163102Z %62 = arith.addi %61, %60 : tensor<1x512xi32> 2026-02-21T08:54:03.8163359Z %63 = tt.splat %arg0 : !tt.ptr -> tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8167482Z %64 = tt.addptr %63, %62 : tensor<1x512x!tt.ptr>, tensor<1x512xi32> 2026-02-21T08:54:03.8167817Z %65 = tt.load %64 evictionPolicy = evict_last : tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8168134Z %66 = arith.extf %65 : tensor<1x512xbf16> to tensor<1x512xf32> 2026-02-21T08:54:03.8168479Z %67 = tt.descriptor_load %0[%54, %12] : !tt.tensordesc> -> tensor<256x16xi8> 2026-02-21T08:54:03.8168913Z %68 = arith.shli %67, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8169197Z %69 = arith.shrsi %68, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8169438Z %70 = arith.shrsi %67, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8169702Z %71 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:54:03.8170008Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:54:03.8170339Z %73 = tt.expand_dims %72 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:54:03.8170688Z %74 = tt.expand_dims %69 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8171022Z %75 = tt.expand_dims %70 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8171319Z %76 = arith.cmpi eq, %73, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8171694Z %77 = tt.broadcast %76 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8171989Z %78 = tt.broadcast %74 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8172280Z %79 = arith.select %77, %78, %cst : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8172556Z %80 = arith.cmpi eq, %73, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8172799Z %81 = tt.broadcast %75 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8173074Z %82 = tt.broadcast %80 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8173351Z %83 = arith.select %82, %81, %79 : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8173625Z %84 = tt.reshape %83 : tensor<256x2x16xi8> -> tensor<512x16xi8> 2026-02-21T08:54:03.8173891Z %85 = arith.sitofp %84 : tensor<512x16xi8> to tensor<512x16xf32> 2026-02-21T08:54:03.8174230Z %86 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x512xf32> -> tensor<1x512x1xf32> 2026-02-21T08:54:03.8174556Z %87 = tt.expand_dims %85 {axis = 0 : i32} : tensor<512x16xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8174872Z %88 = tt.broadcast %86 : tensor<1x512x1xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8175129Z %89 = arith.mulf %88, %87 : tensor<1x512x16xf32> 2026-02-21T08:54:03.8175343Z %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({ 2026-02-21T08:54:03.8175546Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:54:03.8175733Z %178 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:54:03.8175932Z tt.reduce.return %178 : f32 2026-02-21T08:54:03.8176127Z }) : (tensor<1x512x16xf32>) -> tensor<1x16xf32> 2026-02-21T08:54:03.8176341Z %91 = arith.addf %52, %90 : tensor<1x16xf32> 2026-02-21T08:54:03.8176543Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T08:54:03.8176732Z %92 = arith.muli %c256_i32, %c2_i32_6 : i32 2026-02-21T08:54:03.8176926Z %93 = arith.addi %c0_i32, %92 : i32 2026-02-21T08:54:03.8177104Z %94 = arith.muli %93, %c2_i32 : i32 2026-02-21T08:54:03.8177341Z %95 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:03.8177587Z %96 = tt.splat %94 : i32 -> tensor<512xi32> 2026-02-21T08:54:03.8177791Z %97 = arith.addi %96, %95 : tensor<512xi32> 2026-02-21T08:54:03.8177978Z %98 = arith.muli %10, %c3584_i32 : i32 2026-02-21T08:54:03.8178227Z %99 = tt.expand_dims %97 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:03.8178498Z %100 = tt.splat %98 : i32 -> tensor<1x512xi32> 2026-02-21T08:54:03.8178707Z %101 = arith.addi %100, %99 : tensor<1x512xi32> 2026-02-21T08:54:03.8178954Z %102 = tt.splat %arg0 : !tt.ptr -> tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8179312Z %103 = tt.addptr %102, %101 : tensor<1x512x!tt.ptr>, tensor<1x512xi32> 2026-02-21T08:54:03.8179638Z %104 = tt.load %103 evictionPolicy = evict_last : tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8179945Z %105 = arith.extf %104 : tensor<1x512xbf16> to tensor<1x512xf32> 2026-02-21T08:54:03.8180311Z %106 = tt.descriptor_load %0[%93, %12] : !tt.tensordesc> -> tensor<256x16xi8> 2026-02-21T08:54:03.8180657Z %107 = arith.shli %106, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8180885Z %108 = arith.shrsi %107, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8181113Z %109 = arith.shrsi %106, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8181362Z %110 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:54:03.8181740Z %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:54:03.8182063Z %112 = tt.expand_dims %111 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:54:03.8182393Z %113 = tt.expand_dims %108 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8182734Z %114 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8183064Z %115 = arith.cmpi eq, %112, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8183335Z %116 = tt.broadcast %115 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8183619Z %117 = tt.broadcast %113 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8183928Z %118 = arith.select %116, %117, %cst : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8184214Z %119 = arith.cmpi eq, %112, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8184466Z %120 = tt.broadcast %114 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8184746Z %121 = tt.broadcast %119 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8185035Z %122 = arith.select %121, %120, %118 : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8185328Z %123 = tt.reshape %122 : tensor<256x2x16xi8> -> tensor<512x16xi8> 2026-02-21T08:54:03.8185604Z %124 = arith.sitofp %123 : tensor<512x16xi8> to tensor<512x16xf32> 2026-02-21T08:54:03.8185902Z %125 = tt.expand_dims %105 {axis = 2 : i32} : tensor<1x512xf32> -> tensor<1x512x1xf32> 2026-02-21T08:54:03.8186247Z %126 = tt.expand_dims %124 {axis = 0 : i32} : tensor<512x16xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8186561Z %127 = tt.broadcast %125 : tensor<1x512x1xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8186826Z %128 = arith.mulf %127, %126 : tensor<1x512x16xf32> 2026-02-21T08:54:03.8187039Z %129 = "tt.reduce"(%128) <{axis = 1 : i32}> ({ 2026-02-21T08:54:03.8187239Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:54:03.8187426Z %178 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:54:03.8187612Z tt.reduce.return %178 : f32 2026-02-21T08:54:03.8187811Z }) : (tensor<1x512x16xf32>) -> tensor<1x16xf32> 2026-02-21T08:54:03.8188017Z %130 = arith.addf %91, %129 : tensor<1x16xf32> 2026-02-21T08:54:03.8188219Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:54:03.8188409Z %131 = arith.muli %c256_i32, %c3_i32 : i32 2026-02-21T08:54:03.8188607Z %132 = arith.addi %c0_i32, %131 : i32 2026-02-21T08:54:03.8188789Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T08:54:03.8189028Z %134 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:03.8189288Z %135 = tt.splat %133 : i32 -> tensor<512xi32> 2026-02-21T08:54:03.8189494Z %136 = arith.addi %135, %134 : tensor<512xi32> 2026-02-21T08:54:03.8189694Z %137 = arith.muli %10, %c3584_i32 : i32 2026-02-21T08:54:03.8189941Z %138 = tt.expand_dims %136 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:03.8190221Z %139 = tt.splat %137 : i32 -> tensor<1x512xi32> 2026-02-21T08:54:03.8194753Z %140 = arith.addi %139, %138 : tensor<1x512xi32> 2026-02-21T08:54:03.8195036Z %141 = tt.splat %arg0 : !tt.ptr -> tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8195347Z %142 = tt.addptr %141, %140 : tensor<1x512x!tt.ptr>, tensor<1x512xi32> 2026-02-21T08:54:03.8195672Z %143 = tt.load %142 evictionPolicy = evict_last : tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8196095Z %144 = arith.extf %143 : tensor<1x512xbf16> to tensor<1x512xf32> 2026-02-21T08:54:03.8196474Z %145 = tt.descriptor_load %0[%132, %12] : !tt.tensordesc> -> tensor<256x16xi8> 2026-02-21T08:54:03.8196794Z %146 = arith.shli %145, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8197024Z %147 = arith.shrsi %146, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8197244Z %148 = arith.shrsi %145, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8197504Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:54:03.8197805Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:54:03.8198131Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:54:03.8198460Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8198848Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8199150Z %154 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8199406Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8199695Z %156 = tt.broadcast %152 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8199997Z %157 = arith.select %155, %156, %cst : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8200287Z %158 = arith.cmpi eq, %151, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8200550Z %159 = tt.broadcast %153 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8200832Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8201132Z %161 = arith.select %160, %159, %157 : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8201427Z %162 = tt.reshape %161 : tensor<256x2x16xi8> -> tensor<512x16xi8> 2026-02-21T08:54:03.8201791Z %163 = arith.sitofp %162 : tensor<512x16xi8> to tensor<512x16xf32> 2026-02-21T08:54:03.8202101Z %164 = tt.expand_dims %144 {axis = 2 : i32} : tensor<1x512xf32> -> tensor<1x512x1xf32> 2026-02-21T08:54:03.8202459Z %165 = tt.expand_dims %163 {axis = 0 : i32} : tensor<512x16xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8202792Z %166 = tt.broadcast %164 : tensor<1x512x1xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8203061Z %167 = arith.mulf %166, %165 : tensor<1x512x16xf32> 2026-02-21T08:54:03.8203294Z %168 = "tt.reduce"(%167) <{axis = 1 : i32}> ({ 2026-02-21T08:54:03.8203500Z ^bb0(%arg4: f32, %arg5: f32): 2026-02-21T08:54:03.8203698Z %178 = arith.addf %arg4, %arg5 : f32 2026-02-21T08:54:03.8203890Z tt.reduce.return %178 : f32 2026-02-21T08:54:03.8204097Z }) : (tensor<1x512x16xf32>) -> tensor<1x16xf32> 2026-02-21T08:54:03.8204314Z %169 = arith.addf %130, %168 : tensor<1x16xf32> 2026-02-21T08:54:03.8204647Z %170 = scf.for %arg4 = %c1024_i32 to %c1792_i32 step %c256_i32 iter_args(%arg5 = %169) -> (tensor<1x16xf32>) : i32 { 2026-02-21T08:54:03.8204989Z %178 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T08:54:03.8205234Z %179 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:03.8205496Z %180 = tt.splat %178 : i32 -> tensor<512xi32> 2026-02-21T08:54:03.8205705Z %181 = arith.addi %180, %179 : tensor<512xi32> 2026-02-21T08:54:03.8205916Z %182 = arith.muli %10, %c3584_i32 : i32 2026-02-21T08:54:03.8206181Z %183 = tt.expand_dims %181 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:03.8206507Z %184 = tt.splat %182 : i32 -> tensor<1x512xi32> 2026-02-21T08:54:03.8206733Z %185 = arith.addi %184, %183 : tensor<1x512xi32> 2026-02-21T08:54:03.8206982Z %186 = tt.splat %arg0 : !tt.ptr -> tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8207288Z %187 = tt.addptr %186, %185 : tensor<1x512x!tt.ptr>, tensor<1x512xi32> 2026-02-21T08:54:03.8207648Z %188 = tt.load %187 evictionPolicy = evict_last : tensor<1x512x!tt.ptr> 2026-02-21T08:54:03.8208001Z %189 = arith.extf %188 : tensor<1x512xbf16> to tensor<1x512xf32> 2026-02-21T08:54:03.8208354Z %190 = tt.descriptor_load %0[%arg4, %12] : !tt.tensordesc> -> tensor<256x16xi8> 2026-02-21T08:54:03.8208681Z %191 = arith.shli %190, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8208926Z %192 = arith.shrsi %191, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8209156Z %193 = arith.shrsi %190, %cst_2 : tensor<256x16xi8> 2026-02-21T08:54:03.8209423Z %194 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T08:54:03.8209742Z %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T08:54:03.8210077Z %196 = tt.expand_dims %195 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T08:54:03.8210518Z %197 = tt.expand_dims %192 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8210873Z %198 = tt.expand_dims %193 {axis = 1 : i32} : tensor<256x16xi8> -> tensor<256x1x16xi8> 2026-02-21T08:54:03.8211189Z %199 = arith.cmpi eq, %196, %cst_1 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8211469Z %200 = tt.broadcast %199 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8211809Z %201 = tt.broadcast %197 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8212131Z %202 = arith.select %200, %201, %cst : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8212425Z %203 = arith.cmpi eq, %196, %cst_0 : tensor<1x2x1xi32> 2026-02-21T08:54:03.8212697Z %204 = tt.broadcast %198 : tensor<256x1x16xi8> -> tensor<256x2x16xi8> 2026-02-21T08:54:03.8212992Z %205 = tt.broadcast %203 : tensor<1x2x1xi1> -> tensor<256x2x16xi1> 2026-02-21T08:54:03.8213304Z %206 = arith.select %205, %204, %202 : tensor<256x2x16xi1>, tensor<256x2x16xi8> 2026-02-21T08:54:03.8213615Z %207 = tt.reshape %206 : tensor<256x2x16xi8> -> tensor<512x16xi8> 2026-02-21T08:54:03.8213901Z %208 = arith.sitofp %207 : tensor<512x16xi8> to tensor<512x16xf32> 2026-02-21T08:54:03.8214225Z %209 = tt.expand_dims %189 {axis = 2 : i32} : tensor<1x512xf32> -> tensor<1x512x1xf32> 2026-02-21T08:54:03.8214582Z %210 = tt.expand_dims %208 {axis = 0 : i32} : tensor<512x16xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8214927Z %211 = tt.broadcast %209 : tensor<1x512x1xf32> -> tensor<1x512x16xf32> 2026-02-21T08:54:03.8215213Z %212 = arith.mulf %211, %210 : tensor<1x512x16xf32> 2026-02-21T08:54:03.8215447Z %213 = "tt.reduce"(%212) <{axis = 1 : i32}> ({ 2026-02-21T08:54:03.8215681Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:03.8215870Z %215 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:54:03.8216071Z tt.reduce.return %215 : f32 2026-02-21T08:54:03.8216268Z }) : (tensor<1x512x16xf32>) -> tensor<1x16xf32> 2026-02-21T08:54:03.8216491Z %214 = arith.addf %arg5, %213 : tensor<1x16xf32> 2026-02-21T08:54:03.8216697Z scf.yield %214 : tensor<1x16xf32> 2026-02-21T08:54:03.8216927Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:54:03.8217200Z %171 = arith.truncf %170 : tensor<1x16xf32> to tensor<1x16xbf16> 2026-02-21T08:54:03.8217432Z %172 = arith.muli %10, %c8192_i32 : i32 2026-02-21T08:54:03.8217684Z %173 = tt.expand_dims %15 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:54:03.8217952Z %174 = tt.splat %172 : i32 -> tensor<1x16xi32> 2026-02-21T08:54:03.8218220Z %175 = arith.addi %174, %173 : tensor<1x16xi32> 2026-02-21T08:54:03.8218459Z %176 = tt.splat %arg2 : !tt.ptr -> tensor<1x16x!tt.ptr> 2026-02-21T08:54:03.8218750Z %177 = tt.addptr %176, %175 : tensor<1x16x!tt.ptr>, tensor<1x16xi32> 2026-02-21T08:54:03.8219017Z tt.store %177, %171 : tensor<1x16x!tt.ptr> 2026-02-21T08:54:03.8219278Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:54:03.8219466Z tt.return 2026-02-21T08:54:03.8219596Z } 2026-02-21T08:54:03.8219840Z } 2026-02-21T08:54:03.8219911Z 2026-02-21T08:54:03.8219962Z {-# 2026-02-21T08:54:03.8220103Z external_resources: { 2026-02-21T08:54:03.8220259Z mlir_reproducer: { 2026-02-21T08:54:03.8224723Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:54:03.8229116Z disable_threading: false, 2026-02-21T08:54:03.8229291Z verify_each: true 2026-02-21T08:54:03.8229437Z } 2026-02-21T08:54:03.8229565Z } 2026-02-21T08:54:03.8229680Z #-} 2026-02-21T08:54:03.8230107Z /tmp/torchinductor_root/tj/ctjh6ocw4i4nobfug3vjlptom72vrbn3bjqt3qy2guicnxlt67bm.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:54:03.8231337Z /tmp/torchinductor_root/tj/ctjh6ocw4i4nobfug3vjlptom72vrbn3bjqt3qy2guicnxlt67bm.py:19:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPURewritePartitionDependencies` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:54:03.8232369Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:54:03.8233547Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 1, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=128, num_stages=8, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:54:03.8234621Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:54:03.8234882Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:54:06.2315629Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.6 configs/s 2026-02-21T08:54:06.2323042Z [38s] Adaptive compile timeout: 30s (90% percentile=2.3s, bounds=[30.0s, 30s]) 2026-02-21T08:54:06.2327919Z [38s] Initial random population of 100, 5 starting points: 2026-02-21T08:54:06.2331946Z error=5 2026-02-21T08:54:06.2336384Z timeout=2 2026-02-21T08:54:06.2337935Z ok=93 2026-02-21T08:54:06.2338116Z min=0.0584 2026-02-21T08:54:06.2338251Z mid=0.2940 2026-02-21T08:54:06.2338607Z max=8.3928 2026-02-21T08:54:06.2338763Z best={'block_sizes': [32, 2, 16], 2026-02-21T08:54:06.2339034Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:54:06.2339294Z 'l2_groupings': [1], 2026-02-21T08:54:06.2339480Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:54:06.2339685Z 'loop_orders': [[1, 0]], 2026-02-21T08:54:06.2339842Z 'num_stages': 1, 2026-02-21T08:54:06.2339989Z 'num_warps': 1, 2026-02-21T08:54:06.2340141Z 'pid_type': 'flat', 2026-02-21T08:54:06.2340309Z 'range_flattens': [None, True], 2026-02-21T08:54:06.2340489Z 'range_multi_buffers': [None, True], 2026-02-21T08:54:06.2340682Z 'range_num_stages': [0, 3], 2026-02-21T08:54:06.2340847Z 'range_unroll_factors': [0, 0], 2026-02-21T08:54:06.2341124Z 'range_warp_specializes': [None, True]} 2026-02-21T08:54:06.2341356Z [38s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:54:07.4702319Z [40s] Generation 1 starting: 98 neighbors, 5 active search path(s) 2026-02-21T08:54:12.4133953Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 30.3 configs/s 2026-02-21T08:54:18.2907993Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 17.3 configs/s 2026-02-21T08:54:20.5503619Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 449.3 2026-02-21T08:54:20.5507694Z configs/s 2026-02-21T08:54:20.6757147Z [53s] Generation 1 complete: 2026-02-21T08:54:20.6762369Z ok=103 2026-02-21T08:54:20.6764080Z min=0.0420 2026-02-21T08:54:20.6764298Z mid=0.1014 2026-02-21T08:54:20.6769139Z max=0.4823 2026-02-21T08:54:20.6770735Z best={'block_sizes': [128, 2, 32], 2026-02-21T08:54:20.6771065Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:54:20.6777036Z 'l2_groupings': [4], 2026-02-21T08:54:20.6780983Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:54:20.6784692Z 'loop_orders': [[1, 0]], 2026-02-21T08:54:20.6789082Z 'maxnreg': 64, 2026-02-21T08:54:20.6790530Z 'num_sm_multiplier': 2, 2026-02-21T08:54:20.6790747Z 'num_stages': 3, 2026-02-21T08:54:20.6790896Z 'num_warps': 4, 2026-02-21T08:54:20.6791068Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:54:20.6791266Z 'range_flattens': [False, True], 2026-02-21T08:54:20.6791459Z 'range_multi_buffers': [False, False], 2026-02-21T08:54:20.6791728Z 'range_num_stages': [2, 4], 2026-02-21T08:54:20.6791911Z 'range_unroll_factors': [0, 0], 2026-02-21T08:54:20.6792102Z 'range_warp_specializes': [True, None]} 2026-02-21T08:54:20.6792326Z [53s] Fitting surrogate: 203 points, 203 targets 2026-02-21T08:54:21.9987073Z [54s] Generation 2 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:54:37.9028834Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 4.3 configs/s 2026-02-21T08:54:44.9038666Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 14.9 configs/s 2026-02-21T08:54:48.6976061Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 266.7 2026-02-21T08:54:48.6980124Z configs/s 2026-02-21T08:54:48.9193424Z [81s] Generation 2 complete: 2026-02-21T08:54:48.9197746Z ok=103 2026-02-21T08:54:48.9202879Z min=0.0359 2026-02-21T08:54:48.9204393Z mid=0.0584 2026-02-21T08:54:48.9204598Z max=0.9268 2026-02-21T08:54:48.9204754Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:54:48.9207724Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:54:48.9208315Z 'l2_groupings': [4], 2026-02-21T08:54:48.9212114Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:54:48.9216192Z 'loop_orders': [[1, 0]], 2026-02-21T08:54:48.9221341Z 'maxnreg': 64, 2026-02-21T08:54:48.9221672Z 'num_sm_multiplier': 2, 2026-02-21T08:54:48.9222111Z 'num_stages': 3, 2026-02-21T08:54:48.9227077Z 'num_warps': 4, 2026-02-21T08:54:48.9227364Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:54:48.9227592Z 'range_flattens': [False, True], 2026-02-21T08:54:48.9232265Z 'range_multi_buffers': [False, False], 2026-02-21T08:54:48.9233277Z 'range_num_stages': [2, 4], 2026-02-21T08:54:48.9233493Z 'range_unroll_factors': [0, 0], 2026-02-21T08:54:48.9233695Z 'range_warp_specializes': [True, None]} 2026-02-21T08:54:48.9233918Z [81s] Fitting surrogate: 306 points, 306 targets 2026-02-21T08:54:50.2220015Z [82s] Generation 3 starting: 95 neighbors, 5 active search path(s) 2026-02-21T08:55:23.8634577Z [116s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, None]) 2026-02-21T08:55:24.2755102Z [116s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:55:24.3926415Z [116s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['', 'first'], loop_orders=[[1, 0]], num_sm_multiplier=32, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T08:55:24.3943393Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 0.8 configs/s 2026-02-21T08:55:30.3570717Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 16.7 configs/s 2026-02-21T08:55:34.3263789Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 255.2 2026-02-21T08:55:34.3264595Z configs/s 2026-02-21T08:55:34.5587102Z [127s] Generation 3 complete: 2026-02-21T08:55:34.5587395Z timeout=3 2026-02-21T08:55:34.5587587Z ok=98 2026-02-21T08:55:34.5587766Z min=0.0359 2026-02-21T08:55:34.5587929Z mid=0.0544 2026-02-21T08:55:34.5588119Z max=1.4847 2026-02-21T08:55:34.5588295Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:55:34.5588580Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:55:34.5588821Z 'l2_groupings': [4], 2026-02-21T08:55:34.5589037Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:55:34.5595570Z 'loop_orders': [[1, 0]], 2026-02-21T08:55:34.5596094Z 'maxnreg': 64, 2026-02-21T08:55:34.5596254Z 'num_sm_multiplier': 2, 2026-02-21T08:55:34.5596423Z 'num_stages': 3, 2026-02-21T08:55:34.5596573Z 'num_warps': 4, 2026-02-21T08:55:34.5596752Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:55:34.5596955Z 'range_flattens': [False, True], 2026-02-21T08:55:34.5597149Z 'range_multi_buffers': [False, False], 2026-02-21T08:55:34.5597353Z 'range_num_stages': [2, 4], 2026-02-21T08:55:34.5597533Z 'range_unroll_factors': [0, 0], 2026-02-21T08:55:34.5597717Z 'range_warp_specializes': [True, None]} 2026-02-21T08:55:34.5602518Z [127s] Fitting surrogate: 407 points, 407 targets 2026-02-21T08:55:35.7629334Z [128s] Generation 4 starting: 89 neighbors, 5 active search path(s) 2026-02-21T08:56:08.1482205Z [160s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 16], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:56:08.1497585Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 0.9 configs/s 2026-02-21T08:56:14.9219383Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 13.6 configs/s 2026-02-21T08:56:19.5840321Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.9 2026-02-21T08:56:19.5841922Z configs/s 2026-02-21T08:56:19.8530110Z [172s] Generation 4 complete: 2026-02-21T08:56:19.8534700Z timeout=1 2026-02-21T08:56:19.8538979Z ok=94 2026-02-21T08:56:19.8540441Z min=0.0359 2026-02-21T08:56:19.8540602Z mid=0.0523 2026-02-21T08:56:19.8540738Z max=2.2108 2026-02-21T08:56:19.8540882Z best={'block_sizes': [128, 2, 64], 2026-02-21T08:56:19.8541150Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:56:19.8541412Z 'l2_groupings': [4], 2026-02-21T08:56:19.8541693Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:56:19.8541899Z 'loop_orders': [[1, 0]], 2026-02-21T08:56:19.8542064Z 'maxnreg': 64, 2026-02-21T08:56:19.8542219Z 'num_sm_multiplier': 2, 2026-02-21T08:56:19.8542377Z 'num_stages': 3, 2026-02-21T08:56:19.8542529Z 'num_warps': 4, 2026-02-21T08:56:19.8542687Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:56:19.8542887Z 'range_flattens': [False, True], 2026-02-21T08:56:19.8543069Z 'range_multi_buffers': [False, False], 2026-02-21T08:56:19.8543265Z 'range_num_stages': [2, 4], 2026-02-21T08:56:19.8543432Z 'range_unroll_factors': [0, 0], 2026-02-21T08:56:19.8543626Z 'range_warp_specializes': [True, None]} 2026-02-21T08:56:19.8553811Z [172s] Fitting surrogate: 502 points, 502 targets 2026-02-21T08:56:20.6358538Z [173s] Generation 5 starting: 49 neighbors, 3 active search path(s) 2026-02-21T08:56:32.0977106Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 2.5 configs/s 2026-02-21T08:56:35.2826857Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 16.2 configs/s 2026-02-21T08:56:36.4967552Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 826.9 2026-02-21T08:56:36.4972453Z configs/s 2026-02-21T08:56:36.5772444Z [189s] Generation 5 complete: 2026-02-21T08:56:36.5774316Z ok=53 2026-02-21T08:56:36.5774491Z min=0.0359 2026-02-21T08:56:36.5774631Z mid=0.0790 2026-02-21T08:56:36.5774769Z max=1.2830 2026-02-21T08:56:36.5774919Z best={'block_sizes': [256, 4, 16], 2026-02-21T08:56:36.5775145Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:56:36.5775620Z 'l2_groupings': [4], 2026-02-21T08:56:36.5775801Z 'load_eviction_policies': ['', ''], 2026-02-21T08:56:36.5775991Z 'loop_orders': [[1, 0]], 2026-02-21T08:56:36.5776160Z 'num_stages': 7, 2026-02-21T08:56:36.5776301Z 'num_warps': 2, 2026-02-21T08:56:36.5776447Z 'pid_type': 'xyz', 2026-02-21T08:56:36.5776612Z 'range_flattens': [None, True], 2026-02-21T08:56:36.5776884Z 'range_multi_buffers': [None, False], 2026-02-21T08:56:36.5777075Z 'range_num_stages': [0, 3], 2026-02-21T08:56:36.5777247Z 'range_unroll_factors': [0, 1], 2026-02-21T08:56:36.5777437Z 'range_warp_specializes': [None, None]} 2026-02-21T08:56:36.5797020Z [189s] Fitting surrogate: 555 points, 555 targets 2026-02-21T08:56:37.3688038Z [189s] Generation 6 starting: 44 neighbors, 3 active search path(s) 2026-02-21T08:56:50.2608310Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 1.7 configs/s 2026-02-21T08:56:53.0049207Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 17.0 configs/s 2026-02-21T08:56:55.1623543Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 534.5 2026-02-21T08:56:55.1624151Z configs/s 2026-02-21T08:56:55.2833436Z [207s] Generation 6 complete: 2026-02-21T08:56:55.2835143Z ok=47 2026-02-21T08:56:55.2835371Z min=0.0359 2026-02-21T08:56:55.2835549Z mid=0.0543 2026-02-21T08:56:55.2835694Z max=1.1623 2026-02-21T08:56:55.2835864Z best={'block_sizes': [256, 4, 16], 2026-02-21T08:56:55.2836213Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:56:55.2836458Z 'l2_groupings': [4], 2026-02-21T08:56:55.2836645Z 'load_eviction_policies': ['', ''], 2026-02-21T08:56:55.2836851Z 'loop_orders': [[1, 0]], 2026-02-21T08:56:55.2837030Z 'num_stages': 7, 2026-02-21T08:56:55.2837202Z 'num_warps': 2, 2026-02-21T08:56:55.2837355Z 'pid_type': 'xyz', 2026-02-21T08:56:55.2837512Z 'range_flattens': [None, True], 2026-02-21T08:56:55.2837715Z 'range_multi_buffers': [None, False], 2026-02-21T08:56:55.2837914Z 'range_num_stages': [0, 3], 2026-02-21T08:56:55.2838094Z 'range_unroll_factors': [0, 1], 2026-02-21T08:56:55.2838277Z 'range_warp_specializes': [None, None]} 2026-02-21T08:56:55.2850973Z [207s] Fitting surrogate: 602 points, 602 targets 2026-02-21T08:56:55.8553744Z [208s] Generation 7 starting: 31 neighbors, 2 active search path(s) 2026-02-21T08:57:04.7929387Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.6 configs/s 2026-02-21T08:57:06.8356143Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 16.5 configs/s 2026-02-21T08:57:07.7453881Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1097.6 2026-02-21T08:57:07.7455693Z configs/s 2026-02-21T08:57:07.8079741Z [220s] Generation 7 complete: 2026-02-21T08:57:07.8084117Z ok=34 2026-02-21T08:57:07.8085522Z min=0.0359 2026-02-21T08:57:07.8085695Z mid=0.0973 2026-02-21T08:57:07.8085838Z max=2.8098 2026-02-21T08:57:07.8085988Z best={'block_sizes': [256, 4, 16], 2026-02-21T08:57:07.8086218Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:57:07.8086424Z 'l2_groupings': [4], 2026-02-21T08:57:07.8086599Z 'load_eviction_policies': ['', ''], 2026-02-21T08:57:07.8086784Z 'loop_orders': [[1, 0]], 2026-02-21T08:57:07.8086960Z 'num_stages': 7, 2026-02-21T08:57:07.8087112Z 'num_warps': 2, 2026-02-21T08:57:07.8087267Z 'pid_type': 'xyz', 2026-02-21T08:57:07.8087424Z 'range_flattens': [None, True], 2026-02-21T08:57:07.8087624Z 'range_multi_buffers': [None, False], 2026-02-21T08:57:07.8087810Z 'range_num_stages': [0, 3], 2026-02-21T08:57:07.8087987Z 'range_unroll_factors': [0, 1], 2026-02-21T08:57:07.8088176Z 'range_warp_specializes': [None, None]} 2026-02-21T08:57:07.8104549Z [220s] Fitting surrogate: 636 points, 636 targets 2026-02-21T08:57:08.3919380Z [220s] Generation 8 starting: 26 neighbors, 2 active search path(s) 2026-02-21T08:57:20.9678621Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 1.4 configs/s 2026-02-21T08:57:22.5603155Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.4 configs/s 2026-02-21T08:57:23.3482804Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1264.7 2026-02-21T08:57:23.3487134Z configs/s 2026-02-21T08:57:23.4051503Z [235s] Generation 8 complete: 2026-02-21T08:57:23.4055932Z ok=29 2026-02-21T08:57:23.4060282Z min=0.0359 2026-02-21T08:57:23.4064480Z mid=0.0687 2026-02-21T08:57:23.4068172Z max=0.9308 2026-02-21T08:57:23.4072630Z best={'block_sizes': [256, 4, 16], 2026-02-21T08:57:23.4072954Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:57:23.4077154Z 'l2_groupings': [4], 2026-02-21T08:57:23.4081661Z 'load_eviction_policies': ['', ''], 2026-02-21T08:57:23.4085972Z 'loop_orders': [[1, 0]], 2026-02-21T08:57:23.4087511Z 'num_stages': 7, 2026-02-21T08:57:23.4087708Z 'num_warps': 2, 2026-02-21T08:57:23.4087880Z 'pid_type': 'xyz', 2026-02-21T08:57:23.4088045Z 'range_flattens': [None, True], 2026-02-21T08:57:23.4088238Z 'range_multi_buffers': [None, False], 2026-02-21T08:57:23.4088431Z 'range_num_stages': [0, 3], 2026-02-21T08:57:23.4088598Z 'range_unroll_factors': [0, 1], 2026-02-21T08:57:23.4088989Z 'range_warp_specializes': [None, None]} 2026-02-21T08:57:23.4089211Z [235s] Fitting surrogate: 665 points, 665 targets 2026-02-21T08:57:23.8986553Z [236s] Generation 9 starting: 22 neighbors, 2 active search path(s) 2026-02-21T08:57:28.5994210Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 2.8 configs/s 2026-02-21T08:57:29.9413661Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.8 configs/s 2026-02-21T08:57:30.8082663Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1152.3 2026-02-21T08:57:30.8087471Z configs/s 2026-02-21T08:57:30.8716680Z [243s] Generation 9 complete: 2026-02-21T08:57:30.8717094Z ok=25 2026-02-21T08:57:30.8722175Z min=0.0359 2026-02-21T08:57:30.8726410Z mid=0.0604 2026-02-21T08:57:30.8729891Z max=0.4414 2026-02-21T08:57:30.8733477Z best={'block_sizes': [256, 4, 16], 2026-02-21T08:57:30.8737504Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:57:30.8738858Z 'l2_groupings': [4], 2026-02-21T08:57:30.8739060Z 'load_eviction_policies': ['', ''], 2026-02-21T08:57:30.8739272Z 'loop_orders': [[1, 0]], 2026-02-21T08:57:30.8739439Z 'num_stages': 7, 2026-02-21T08:57:30.8739592Z 'num_warps': 2, 2026-02-21T08:57:30.8739742Z 'pid_type': 'xyz', 2026-02-21T08:57:30.8739895Z 'range_flattens': [None, True], 2026-02-21T08:57:30.8740083Z 'range_multi_buffers': [None, False], 2026-02-21T08:57:30.8740276Z 'range_num_stages': [0, 3], 2026-02-21T08:57:30.8740443Z 'range_unroll_factors': [0, 1], 2026-02-21T08:57:30.8740633Z 'range_warp_specializes': [None, None]} 2026-02-21T08:57:30.8740840Z [243s] Fitting surrogate: 690 points, 690 targets 2026-02-21T08:57:31.1634443Z [243s] Autotuning complete in 243.7s after searching 667 configs. 2026-02-21T08:57:31.1634962Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:57:31.1636101Z @helion.kernel(config=helion.Config(block_sizes=[256, 4, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], num_stages=7, num_warps=2, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:57:31.1637017Z 2026-02-21T08:57:31.1637317Z [243s] Code of selected kernel: /tmp/torchinductor_root/2m/c2muht6eyqksbo6zkspxgx6yzwt5j5vqibp635xii43uj75knut2.py 2026-02-21T08:57:31.1842458Z from __future__ import annotations 2026-02-21T08:57:31.1843760Z 2026-02-21T08:57:31.1843903Z import torch 2026-02-21T08:57:31.1852268Z import triton 2026-02-21T08:57:31.1852455Z import triton.language as tl 2026-02-21T08:57:31.1852905Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:57:31.1853089Z 2026-02-21T08:57:31.1853175Z _BLOCK_SIZE_2 = tl.constexpr(16) 2026-02-21T08:57:31.1853362Z _BLOCK_SIZE_1 = tl.constexpr(4) 2026-02-21T08:57:31.1853556Z _BLOCK_SIZE_0 = tl.constexpr(256) 2026-02-21T08:57:31.1853676Z 2026-02-21T08:57:31.1853741Z @triton.jit 2026-02-21T08:57:31.1854073Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T08:57:31.1854515Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:57:31.1854765Z num_blocks_0 = tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T08:57:31.1854987Z num_pid_m = tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T08:57:31.1855185Z num_pid_n = tl.cdiv(4, _BLOCK_SIZE_1) 2026-02-21T08:57:31.1855438Z inner_2d_pid = tl.program_id(0) + tl.program_id(1) * num_blocks_0 2026-02-21T08:57:31.1855685Z num_pid_in_group = 4 * num_pid_n 2026-02-21T08:57:31.1855904Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T08:57:31.1856124Z first_pid_m = group_id * 4 2026-02-21T08:57:31.1856336Z group_size_m = min(num_pid_m - first_pid_m, 4) 2026-02-21T08:57:31.1856604Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T08:57:31.1856942Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T08:57:31.1857177Z offset_2 = pid_0 * _BLOCK_SIZE_2 2026-02-21T08:57:31.1857408Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T08:57:31.1857650Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T08:57:31.1857872Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T08:57:31.1858184Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T08:57:31.1858523Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T08:57:31.1858876Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T08:57:31.1859293Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T08:57:31.1859722Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T08:57:31.1860020Z # src[int4_gemm.py:60-89]: ... 2026-02-21T08:57:31.1860425Z for offset_3 in tl.range(0, 1792, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=3, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T08:57:31.1860886Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:57:31.1861157Z acc_copy = acc 2026-02-21T08:57:31.1861332Z acc_copy_0 = acc_copy 2026-02-21T08:57:31.1861598Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T08:57:31.1861850Z mul = 2 * offset_3 2026-02-21T08:57:31.1862124Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T08:57:31.1862439Z iota = mul + tl.arange(0, mul_1) 2026-02-21T08:57:31.1862697Z load = tl.load(A + (indices_1[:, None] * 3584 + iota[None, :] * 1), None) 2026-02-21T08:57:31.1863076Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T08:57:31.1863398Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T08:57:31.1863645Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T08:57:31.1863887Z v_0 = tl.cast(load, tl.float32) 2026-02-21T08:57:31.1864184Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T08:57:31.1864554Z b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None) 2026-02-21T08:57:31.1864943Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T08:57:31.1865245Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T08:57:31.1865443Z v_2 = b_tile << v_1 2026-02-21T08:57:31.1865658Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T08:57:31.1865859Z v_4 = v_2 >> v_3 2026-02-21T08:57:31.1866109Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T08:57:31.1866401Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T08:57:31.1866581Z v_6 = b_tile >> v_5 2026-02-21T08:57:31.1866851Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T08:57:31.1867138Z stack_idx = tl.arange(0, 2) 2026-02-21T08:57:31.1867384Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T08:57:31.1867615Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T08:57:31.1867821Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T08:57:31.1868046Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T08:57:31.1868261Z mask_0 = broadcast_idx == 0 2026-02-21T08:57:31.1868513Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T08:57:31.1868760Z mask_1 = broadcast_idx == 1 2026-02-21T08:57:31.1868996Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T08:57:31.1869290Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T08:57:31.1869616Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T08:57:31.1869907Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T08:57:31.1870172Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T08:57:31.1870423Z v_7 = tl.cast(view, tl.float32) 2026-02-21T08:57:31.1870694Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T08:57:31.1870966Z a_tile_1 = v_0[:, :, None] 2026-02-21T08:57:31.1871195Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T08:57:31.1871425Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T08:57:31.1871745Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T08:57:31.1872042Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T08:57:31.1872249Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T08:57:31.1872451Z acc = acc_copy_0 + sum_1 2026-02-21T08:57:31.1872683Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T08:57:31.1872932Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T08:57:31.1873177Z tl.store(C + (indices_1[:, None] * 8192 + indices_2[None, :] * 1), v_10, None) 2026-02-21T08:57:31.1873375Z 2026-02-21T08:57:31.1873509Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T08:57:31.1873752Z """ 2026-02-21T08:57:31.1873935Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T08:57:31.1874090Z 2026-02-21T08:57:31.1874196Z This kernel performs matrix multiplication where: 2026-02-21T08:57:31.1874422Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T08:57:31.1874685Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T08:57:31.1874942Z (two 4-bit values packed into each int8) 2026-02-21T08:57:31.1875078Z 2026-02-21T08:57:31.1875133Z Args: 2026-02-21T08:57:31.1875316Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T08:57:31.1875604Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T08:57:31.1875777Z 2026-02-21T08:57:31.1875846Z Returns: 2026-02-21T08:57:31.1876035Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T08:57:31.1876255Z """ 2026-02-21T08:57:31.1876398Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T08:57:31.1876586Z M, K = A.shape 2026-02-21T08:57:31.1876737Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T08:57:31.1876922Z _, N = B.shape 2026-02-21T08:57:31.1877149Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T08:57:31.1877501Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T08:57:31.1877773Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:57:31.1877994Z _BLOCK_SIZE_2 = 16 2026-02-21T08:57:31.1878156Z _BLOCK_SIZE_1 = 4 2026-02-21T08:57:31.1878406Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T08:57:31.1878870Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T08:57:31.1879282Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T08:57:31.1879562Z # src[int4_gemm.py:60-89]: ... 2026-02-21T08:57:31.1879748Z _BLOCK_SIZE_0 = 256 2026-02-21T08:57:31.1879942Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T08:57:31.1880228Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T08:57:31.1880491Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T08:57:31.1880700Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T08:57:31.1880919Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T08:57:31.1881218Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T08:57:31.1881515Z # src[int4_gemm.py:57-91]: ... 2026-02-21T08:57:31.1881760Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T08:57:31.1882235Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(8192, _BLOCK_SIZE_2), triton.cdiv(4, _BLOCK_SIZE_1)), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=2, num_stages=7) 2026-02-21T08:57:31.1882660Z # src[int4_gemm.py:93]: return C 2026-02-21T08:57:31.1882841Z return C 2026-02-21T08:57:32.1007596Z WARNING:tritonbench.utils.triton_op:Completed input ID 7: 2026-02-21T08:57:32.1011999Z x_val 2026-02-21T08:57:32.1015087Z ------------------ 2026-02-21T08:57:32.1019472Z (4, 1, 8192, 3584) 2026-02-21T08:57:32.1020790Z 2026-02-21T08:57:32.1021387Z 30%|███ | 3/10 [13:32<30:48, 264.14s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10: 2026-02-21T08:57:32.1021797Z x_val 2026-02-21T08:57:32.1021940Z ------------------- 2026-02-21T08:57:32.1022105Z (16, 1, 7168, 8192) 2026-02-21T08:57:32.1027657Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T08:57:33.2728672Z INFO:tritonbench.utils.triton_op:Took 125.47ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T08:57:34.5324310Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T08:57:35.8936823Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:57:35.8940919Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:57:35.8945301Z 'dtype': 'torch.bfloat16', 2026-02-21T08:57:35.8946632Z 'shape': (16, 1, 8192), 2026-02-21T08:57:35.8946860Z 'stride': (8192, 8192, 1)}, 2026-02-21T08:57:35.8947086Z { 'device': 'cuda:0', 2026-02-21T08:57:35.8947281Z 'dtype': 'torch.int32', 2026-02-21T08:57:35.8947463Z 'shape': (8192, 7168), 2026-02-21T08:57:35.8947644Z 'stride': (7168, 1)}), 2026-02-21T08:57:35.8947811Z 'kwargs': {}} 2026-02-21T08:57:35.8968296Z INFO:tritonbench.utils.triton_op:Took 3.29ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T08:57:36.1519642Z [0s] Autotune random seed: 2136913670 2026-02-21T08:57:36.1773908Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:58:08.5988401Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1, 1024], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:58:08.8387636Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:58:18.3756643Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.5 configs/s 2026-02-21T08:58:18.3772571Z [42s] Adaptive compile timeout: 30s (90% percentile=3.4s, bounds=[30.0s, 30s]) 2026-02-21T08:58:19.3250333Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 715/715 706.0 configs/s 2026-02-21T08:58:19.4097715Z [43s] Initial random population of 100, 5 starting points: 2026-02-21T08:58:19.4098809Z error=7 2026-02-21T08:58:19.4098978Z timeout=1 2026-02-21T08:58:19.4099106Z ok=92 2026-02-21T08:58:19.4099239Z min=0.2222 2026-02-21T08:58:19.4099375Z mid=2.1760 2026-02-21T08:58:19.4099499Z max=112.2212 2026-02-21T08:58:19.4099660Z best={'block_sizes': [16, 16, 16], 2026-02-21T08:58:19.4099874Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:58:19.4100084Z 'l2_groupings': [1], 2026-02-21T08:58:19.4100263Z 'load_eviction_policies': ['', ''], 2026-02-21T08:58:19.4100451Z 'loop_orders': [[0, 1]], 2026-02-21T08:58:19.4100604Z 'num_stages': 1, 2026-02-21T08:58:19.4100751Z 'num_warps': 4, 2026-02-21T08:58:19.4100890Z 'pid_type': 'flat', 2026-02-21T08:58:19.4101052Z 'range_flattens': [None, None], 2026-02-21T08:58:19.4101523Z 'range_multi_buffers': [None, None], 2026-02-21T08:58:19.4101884Z 'range_num_stages': [0, 0], 2026-02-21T08:58:19.4102066Z 'range_unroll_factors': [0, 0], 2026-02-21T08:58:19.4102251Z 'range_warp_specializes': [None, None]} 2026-02-21T08:58:19.4113597Z [43s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:58:20.5349497Z [44s] Generation 1 starting: 87 neighbors, 5 active search path(s) 2026-02-21T08:58:31.2762346Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 1.4 configs/s 2026-02-21T08:58:36.8115733Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.2 configs/s 2026-02-21T08:58:38.0290024Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 810.7 2026-02-21T08:58:38.0294131Z configs/s 2026-02-21T08:58:38.0955042Z [61s] Generation 1 complete: 2026-02-21T08:58:38.0957001Z ok=93 2026-02-21T08:58:38.0957177Z min=0.0993 2026-02-21T08:58:38.0957308Z mid=0.4557 2026-02-21T08:58:38.0957441Z max=6.4481 2026-02-21T08:58:38.0957611Z best={'block_sizes': [128, 16, 32], 2026-02-21T08:58:38.0957872Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:58:38.0958125Z 'l2_groupings': [32], 2026-02-21T08:58:38.0958307Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:58:38.0958498Z 'loop_orders': [[0, 1]], 2026-02-21T08:58:38.0958661Z 'num_stages': 3, 2026-02-21T08:58:38.0958809Z 'num_warps': 4, 2026-02-21T08:58:38.0958951Z 'pid_type': 'flat', 2026-02-21T08:58:38.0959114Z 'range_flattens': [None, True], 2026-02-21T08:58:38.0959292Z 'range_multi_buffers': [None, True], 2026-02-21T08:58:38.0959488Z 'range_num_stages': [0, 3], 2026-02-21T08:58:38.0959660Z 'range_unroll_factors': [0, 0], 2026-02-21T08:58:38.0959854Z 'range_warp_specializes': [None, True]} 2026-02-21T08:58:38.0972580Z [61s] Fitting surrogate: 193 points, 193 targets 2026-02-21T08:58:39.1046531Z [62s] Generation 2 starting: 80 neighbors, 5 active search path(s) 2026-02-21T08:58:43.9979451Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 10.5 configs/s 2026-02-21T08:58:47.2858719Z [71s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]) 2026-02-21T08:58:47.2859780Z Tensor-likes are not close! 2026-02-21T08:58:47.2859912Z 2026-02-21T08:58:47.2859997Z Mismatched elements: 114203 / 114688 (99.6%) 2026-02-21T08:58:47.2860579Z Greatest absolute difference: 3040.0 at index (5, 4084) (up to 0.01 allowed) 2026-02-21T08:58:47.2860936Z Greatest relative difference: 68608.0 at index (15, 2156) (up to 0.01 allowed) 2026-02-21T08:58:47.2861264Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:58:47.2861438Z 2026-02-21T08:58:49.1518693Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.6 configs/s 2026-02-21T08:58:52.6931638Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 283.1 2026-02-21T08:58:52.6932239Z configs/s 2026-02-21T08:58:52.8310904Z [76s] Generation 2 complete: 2026-02-21T08:58:52.8315614Z error=1 2026-02-21T08:58:52.8319556Z ok=85 2026-02-21T08:58:52.8321171Z min=0.0891 2026-02-21T08:58:52.8321402Z mid=0.2365 2026-02-21T08:58:52.8324383Z max=5.4631 2026-02-21T08:58:52.8324566Z best={'block_sizes': [32, 16, 32], 2026-02-21T08:58:52.8324823Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:58:52.8325064Z 'l2_groupings': [2], 2026-02-21T08:58:52.8325244Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:58:52.8325439Z 'loop_orders': [[0, 1]], 2026-02-21T08:58:52.8325605Z 'num_stages': 5, 2026-02-21T08:58:52.8325747Z 'num_warps': 4, 2026-02-21T08:58:52.8325905Z 'pid_type': 'flat', 2026-02-21T08:58:52.8326068Z 'range_flattens': [None, False], 2026-02-21T08:58:52.8326258Z 'range_multi_buffers': [None, None], 2026-02-21T08:58:52.8326446Z 'range_num_stages': [0, 0], 2026-02-21T08:58:52.8326614Z 'range_unroll_factors': [0, 3], 2026-02-21T08:58:52.8326912Z 'range_warp_specializes': [None, False]} 2026-02-21T08:58:52.8331728Z [76s] Fitting surrogate: 279 points, 279 targets 2026-02-21T08:58:53.8941735Z [77s] Generation 3 starting: 77 neighbors, 5 active search path(s) 2026-02-21T08:58:59.9229499Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 6.1 configs/s 2026-02-21T08:59:04.6179554Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 17.2 configs/s 2026-02-21T08:59:08.8484965Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 237.5 2026-02-21T08:59:08.8489069Z configs/s 2026-02-21T08:59:09.0161351Z [92s] Generation 3 complete: 2026-02-21T08:59:09.0165887Z error=1 2026-02-21T08:59:09.0170301Z ok=82 2026-02-21T08:59:09.0174760Z min=0.0809 2026-02-21T08:59:09.0178612Z mid=0.2304 2026-02-21T08:59:09.0182657Z max=3.0833 2026-02-21T08:59:09.0186610Z best={'block_sizes': [128, 16, 16], 2026-02-21T08:59:09.0186971Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:59:09.0187217Z 'l2_groupings': [2], 2026-02-21T08:59:09.0191074Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:59:09.0192754Z 'loop_orders': [[0, 1]], 2026-02-21T08:59:09.0192926Z 'num_stages': 2, 2026-02-21T08:59:09.0193082Z 'num_warps': 2, 2026-02-21T08:59:09.0193232Z 'pid_type': 'flat', 2026-02-21T08:59:09.0193687Z 'range_flattens': [None, None], 2026-02-21T08:59:09.0193885Z 'range_multi_buffers': [None, True], 2026-02-21T08:59:09.0194082Z 'range_num_stages': [0, 0], 2026-02-21T08:59:09.0194270Z 'range_unroll_factors': [0, 0], 2026-02-21T08:59:09.0194461Z 'range_warp_specializes': [None, None]} 2026-02-21T08:59:09.0194728Z [92s] Fitting surrogate: 362 points, 362 targets 2026-02-21T08:59:10.1562115Z [93s] Generation 4 starting: 84 neighbors, 5 active search path(s) 2026-02-21T08:59:19.6880765Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 3.4 configs/s 2026-02-21T08:59:22.0184440Z [105s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]) 2026-02-21T08:59:22.0185581Z Tensor-likes are not close! 2026-02-21T08:59:22.0185710Z 2026-02-21T08:59:22.0185800Z Mismatched elements: 114084 / 114688 (99.5%) 2026-02-21T08:59:22.0186115Z Greatest absolute difference: 2800.0 at index (12, 6895) (up to 0.01 allowed) 2026-02-21T08:59:22.0186725Z Greatest relative difference: 140288.0 at index (2, 1421) (up to 0.01 allowed) 2026-02-21T08:59:22.0187084Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:59:22.0187292Z 2026-02-21T08:59:22.5551254Z [106s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]) 2026-02-21T08:59:22.5553206Z Tensor-likes are not close! 2026-02-21T08:59:22.5553344Z 2026-02-21T08:59:22.5553433Z Mismatched elements: 114250 / 114688 (99.6%) 2026-02-21T08:59:22.5553723Z Greatest absolute difference: 2848.0 at index (14, 1361) (up to 0.01 allowed) 2026-02-21T08:59:22.5554076Z Greatest relative difference: 94208.0 at index (15, 2156) (up to 0.01 allowed) 2026-02-21T08:59:22.5554407Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:59:22.5554585Z 2026-02-21T08:59:24.6842723Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 17.6 configs/s 2026-02-21T08:59:30.1770464Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 194.3 2026-02-21T08:59:30.1774287Z configs/s 2026-02-21T08:59:30.3927322Z [114s] Generation 4 complete: 2026-02-21T08:59:30.3928643Z error=5 2026-02-21T08:59:30.3928794Z ok=84 2026-02-21T08:59:30.3928928Z min=0.0789 2026-02-21T08:59:30.3929056Z mid=0.1547 2026-02-21T08:59:30.3929187Z max=9.7732 2026-02-21T08:59:30.3929340Z best={'block_sizes': [64, 16, 32], 2026-02-21T08:59:30.3929584Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:59:30.3929813Z 'l2_groupings': [2], 2026-02-21T08:59:30.3929986Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:59:30.3930187Z 'loop_orders': [[0, 1]], 2026-02-21T08:59:30.3930342Z 'num_stages': 5, 2026-02-21T08:59:30.3930501Z 'num_warps': 4, 2026-02-21T08:59:30.3930640Z 'pid_type': 'flat', 2026-02-21T08:59:30.3930803Z 'range_flattens': [None, False], 2026-02-21T08:59:30.3930986Z 'range_multi_buffers': [None, True], 2026-02-21T08:59:30.3931181Z 'range_num_stages': [0, 0], 2026-02-21T08:59:30.3931360Z 'range_unroll_factors': [0, 2], 2026-02-21T08:59:30.3931740Z 'range_warp_specializes': [None, False]} 2026-02-21T08:59:30.3946068Z [114s] Fitting surrogate: 451 points, 451 targets 2026-02-21T08:59:31.4892147Z [115s] Generation 5 starting: 81 neighbors, 5 active search path(s) 2026-02-21T08:59:47.2934853Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 0.4 configs/s 2026-02-21T08:59:52.2795638Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.8 configs/s 2026-02-21T08:59:56.4041964Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 243.9 2026-02-21T08:59:56.4046035Z configs/s 2026-02-21T08:59:56.5752075Z [140s] Generation 5 complete: 2026-02-21T08:59:56.5753877Z error=2 2026-02-21T08:59:56.5754072Z ok=84 2026-02-21T08:59:56.5754241Z min=0.0789 2026-02-21T08:59:56.5754408Z mid=0.1485 2026-02-21T08:59:56.5754564Z max=6.1450 2026-02-21T08:59:56.5754727Z best={'block_sizes': [64, 16, 16], 2026-02-21T08:59:56.5755003Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:59:56.5755246Z 'l2_groupings': [2], 2026-02-21T08:59:56.5755441Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:59:56.5755659Z 'loop_orders': [[0, 1]], 2026-02-21T08:59:56.5755824Z 'num_stages': 3, 2026-02-21T08:59:56.5756003Z 'num_warps': 2, 2026-02-21T08:59:56.5756187Z 'pid_type': 'flat', 2026-02-21T08:59:56.5756365Z 'range_flattens': [None, None], 2026-02-21T08:59:56.5756554Z 'range_multi_buffers': [None, True], 2026-02-21T08:59:56.5756751Z 'range_num_stages': [0, 0], 2026-02-21T08:59:56.5775938Z 'range_unroll_factors': [0, 0], 2026-02-21T08:59:56.5776462Z 'range_warp_specializes': [None, None]} 2026-02-21T08:59:56.5776724Z [140s] Fitting surrogate: 537 points, 537 targets 2026-02-21T08:59:57.5097691Z [141s] Generation 6 starting: 62 neighbors, 4 active search path(s) 2026-02-21T09:00:02.2807625Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 5.6 configs/s 2026-02-21T09:00:06.1616783Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 16.7 configs/s 2026-02-21T09:00:10.2049292Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.5 2026-02-21T09:00:10.2053240Z configs/s 2026-02-21T09:00:10.3734932Z [154s] Generation 6 complete: 2026-02-21T09:00:10.3739211Z ok=66 2026-02-21T09:00:10.3740704Z min=0.0789 2026-02-21T09:00:10.3740878Z mid=0.1321 2026-02-21T09:00:10.3741023Z max=5.8332 2026-02-21T09:00:10.3741175Z best={'block_sizes': [64, 16, 32], 2026-02-21T09:00:10.3741411Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:00:10.3741840Z 'l2_groupings': [2], 2026-02-21T09:00:10.3742036Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:00:10.3742246Z 'loop_orders': [[0, 1]], 2026-02-21T09:00:10.3742413Z 'num_stages': 5, 2026-02-21T09:00:10.3742565Z 'num_warps': 4, 2026-02-21T09:00:10.3742719Z 'pid_type': 'flat', 2026-02-21T09:00:10.3742878Z 'range_flattens': [None, True], 2026-02-21T09:00:10.3743066Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:10.3743260Z 'range_num_stages': [0, 0], 2026-02-21T09:00:10.3743430Z 'range_unroll_factors': [0, 1], 2026-02-21T09:00:10.3743622Z 'range_warp_specializes': [None, False]} 2026-02-21T09:00:10.3754281Z [154s] Fitting surrogate: 603 points, 603 targets 2026-02-21T09:00:10.8266708Z [154s] Generation 7 starting: 14 neighbors, 1 active search path(s) 2026-02-21T09:00:13.8392599Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 5.0 configs/s 2026-02-21T09:00:14.7271510Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 17.8 configs/s 2026-02-21T09:00:15.2324771Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1915.3 2026-02-21T09:00:15.2329381Z configs/s 2026-02-21T09:00:15.2751183Z [159s] Generation 7 complete: 2026-02-21T09:00:15.2757942Z ok=16 2026-02-21T09:00:15.2766889Z min=0.0789 2026-02-21T09:00:15.2770274Z mid=0.1547 2026-02-21T09:00:15.2773272Z max=1.2627 2026-02-21T09:00:15.2783941Z best={'block_sizes': [64, 16, 32], 2026-02-21T09:00:15.2785685Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:00:15.2785947Z 'l2_groupings': [2], 2026-02-21T09:00:15.2786128Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:00:15.2786645Z 'loop_orders': [[0, 1]], 2026-02-21T09:00:15.2786820Z 'num_stages': 5, 2026-02-21T09:00:15.2786970Z 'num_warps': 4, 2026-02-21T09:00:15.2787137Z 'pid_type': 'flat', 2026-02-21T09:00:15.2787318Z 'range_flattens': [None, True], 2026-02-21T09:00:15.2787532Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:15.2787752Z 'range_num_stages': [0, 0], 2026-02-21T09:00:15.2788103Z 'range_unroll_factors': [0, 1], 2026-02-21T09:00:15.2788328Z 'range_warp_specializes': [None, False]} 2026-02-21T09:00:15.2788690Z [159s] Fitting surrogate: 619 points, 619 targets 2026-02-21T09:00:15.8152904Z [159s] Generation 8 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:00:17.6214059Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 14.4 configs/s 2026-02-21T09:00:18.7258340Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 17.0 configs/s 2026-02-21T09:00:19.2352367Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1893.5 2026-02-21T09:00:19.2356648Z configs/s 2026-02-21T09:00:19.2780333Z [163s] Generation 8 complete: 2026-02-21T09:00:19.2784778Z ok=19 2026-02-21T09:00:19.2789222Z min=0.0789 2026-02-21T09:00:19.2793493Z mid=0.2222 2026-02-21T09:00:19.2797927Z max=5.1774 2026-02-21T09:00:19.2800587Z best={'block_sizes': [64, 16, 32], 2026-02-21T09:00:19.2805180Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:00:19.2805507Z 'l2_groupings': [2], 2026-02-21T09:00:19.2805737Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:00:19.2805969Z 'loop_orders': [[0, 1]], 2026-02-21T09:00:19.2811107Z 'num_stages': 5, 2026-02-21T09:00:19.2815446Z 'num_warps': 4, 2026-02-21T09:00:19.2819823Z 'pid_type': 'flat', 2026-02-21T09:00:19.2824157Z 'range_flattens': [None, True], 2026-02-21T09:00:19.2825761Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:19.2825993Z 'range_num_stages': [0, 0], 2026-02-21T09:00:19.2826191Z 'range_unroll_factors': [0, 1], 2026-02-21T09:00:19.2826380Z 'range_warp_specializes': [None, False]} 2026-02-21T09:00:19.2826677Z [163s] Fitting surrogate: 638 points, 638 targets 2026-02-21T09:00:19.7235357Z [163s] Generation 9 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:00:21.6106921Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 10.8 configs/s 2026-02-21T09:00:22.6314014Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.4 configs/s 2026-02-21T09:00:23.2499349Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1573.9 2026-02-21T09:00:23.2501114Z configs/s 2026-02-21T09:00:23.2964200Z [167s] Generation 9 complete: 2026-02-21T09:00:23.2964461Z ok=18 2026-02-21T09:00:23.2968511Z min=0.0789 2026-02-21T09:00:23.2972915Z mid=0.1608 2026-02-21T09:00:23.2977209Z max=2.9645 2026-02-21T09:00:23.2981533Z best={'block_sizes': [64, 16, 32], 2026-02-21T09:00:23.2982907Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:00:23.2983143Z 'l2_groupings': [2], 2026-02-21T09:00:23.2983329Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:00:23.2983537Z 'loop_orders': [[0, 1]], 2026-02-21T09:00:23.2983699Z 'num_stages': 5, 2026-02-21T09:00:23.2983857Z 'num_warps': 4, 2026-02-21T09:00:23.2984021Z 'pid_type': 'flat', 2026-02-21T09:00:23.2984448Z 'range_flattens': [None, True], 2026-02-21T09:00:23.2984644Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:23.2984857Z 'range_num_stages': [0, 0], 2026-02-21T09:00:23.2985038Z 'range_unroll_factors': [0, 1], 2026-02-21T09:00:23.2985244Z 'range_warp_specializes': [None, False]} 2026-02-21T09:00:23.2993032Z [167s] Fitting surrogate: 656 points, 656 targets 2026-02-21T09:00:23.7351506Z [167s] Generation 10 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:00:26.7700783Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 6.2 configs/s 2026-02-21T09:00:27.8238657Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.9 configs/s 2026-02-21T09:00:29.3085308Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 670.8 2026-02-21T09:00:29.3089737Z configs/s 2026-02-21T09:00:29.3832023Z [173s] Generation 10 complete: 2026-02-21T09:00:29.3834180Z ok=20 2026-02-21T09:00:29.3834400Z min=0.0789 2026-02-21T09:00:29.3834566Z mid=0.0890 2026-02-21T09:00:29.3834720Z max=2.3777 2026-02-21T09:00:29.3834974Z best={'block_sizes': [64, 16, 32], 2026-02-21T09:00:29.3835232Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:00:29.3835452Z 'l2_groupings': [2], 2026-02-21T09:00:29.3835638Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:00:29.3835853Z 'loop_orders': [[0, 1]], 2026-02-21T09:00:29.3836011Z 'num_stages': 5, 2026-02-21T09:00:29.3836176Z 'num_warps': 4, 2026-02-21T09:00:29.3836349Z 'pid_type': 'flat', 2026-02-21T09:00:29.3836527Z 'range_flattens': [None, True], 2026-02-21T09:00:29.3836728Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:29.3836938Z 'range_num_stages': [0, 0], 2026-02-21T09:00:29.3837131Z 'range_unroll_factors': [0, 1], 2026-02-21T09:00:29.3837322Z 'range_warp_specializes': [None, False]} 2026-02-21T09:00:29.3858671Z [173s] Fitting surrogate: 676 points, 676 targets 2026-02-21T09:00:29.6557063Z [173s] Autotuning complete in 173.5s after searching 655 configs. 2026-02-21T09:00:29.6561494Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:00:29.6566304Z @helion.kernel(config=helion.Config(block_sizes=[64, 16, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:00:29.6567199Z 2026-02-21T09:00:29.6567472Z [173s] Code of selected kernel: /tmp/torchinductor_root/co/ccop6xaxxthplexm32ktbonvkzdgl6qt222yxpfdv757phvaj76k.py 2026-02-21T09:00:29.6760659Z from __future__ import annotations 2026-02-21T09:00:29.6765179Z 2026-02-21T09:00:29.6767069Z import torch 2026-02-21T09:00:29.6767249Z import triton 2026-02-21T09:00:29.6767425Z import triton.language as tl 2026-02-21T09:00:29.6767670Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:00:29.6767864Z 2026-02-21T09:00:29.6767939Z _BLOCK_SIZE_1 = tl.constexpr(16) 2026-02-21T09:00:29.6768127Z _BLOCK_SIZE_2 = tl.constexpr(32) 2026-02-21T09:00:29.6768299Z _BLOCK_SIZE_0 = tl.constexpr(64) 2026-02-21T09:00:29.6768405Z 2026-02-21T09:00:29.6768470Z @triton.jit 2026-02-21T09:00:29.6768702Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:00:29.6769036Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:00:29.6769272Z num_pid_m = tl.cdiv(16, _BLOCK_SIZE_1) 2026-02-21T09:00:29.6769477Z num_pid_n = tl.cdiv(7168, _BLOCK_SIZE_2) 2026-02-21T09:00:29.6769677Z inner_2d_pid = tl.program_id(0) 2026-02-21T09:00:29.6769862Z num_pid_in_group = 2 * num_pid_n 2026-02-21T09:00:29.6770066Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:00:29.6770268Z first_pid_m = group_id * 2 2026-02-21T09:00:29.6770476Z group_size_m = min(num_pid_m - first_pid_m, 2) 2026-02-21T09:00:29.6770977Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:00:29.6774602Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:00:29.6776742Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T09:00:29.6776992Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:00:29.6777243Z offset_2 = pid_1 * _BLOCK_SIZE_2 2026-02-21T09:00:29.6777474Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:00:29.6777797Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:00:29.6778181Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:00:29.6778534Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:00:29.6779008Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:00:29.6779426Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:00:29.6779765Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:00:29.6780087Z for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, loop_unroll_factor=1, warp_specialize=False, flatten=True): 2026-02-21T09:00:29.6780480Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:00:29.6780715Z acc_copy = acc 2026-02-21T09:00:29.6780887Z acc_copy_0 = acc_copy 2026-02-21T09:00:29.6781106Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:00:29.6781347Z mul = 2 * offset_3 2026-02-21T09:00:29.6781648Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:00:29.6781940Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:00:29.6782264Z load = tl.load(A + (indices_1[:, None] * 8192 + iota[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T09:00:29.6782683Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:00:29.6783013Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:00:29.6783260Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:00:29.6783491Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:00:29.6783784Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:00:29.6784208Z b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T09:00:29.6784631Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:00:29.6784903Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:00:29.6785085Z v_2 = b_tile << v_1 2026-02-21T09:00:29.6785256Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:00:29.6785431Z v_4 = v_2 >> v_3 2026-02-21T09:00:29.6785674Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:00:29.6785936Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:00:29.6786109Z v_6 = b_tile >> v_5 2026-02-21T09:00:29.6786321Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:00:29.6786565Z stack_idx = tl.arange(0, 2) 2026-02-21T09:00:29.6786759Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:00:29.6786956Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:00:29.6787170Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:00:29.6787370Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:00:29.6787575Z mask_0 = broadcast_idx == 0 2026-02-21T09:00:29.6787798Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:00:29.6788034Z mask_1 = broadcast_idx == 1 2026-02-21T09:00:29.6788252Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:00:29.6788568Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:00:29.6788887Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:00:29.6789170Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:00:29.6789443Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:00:29.6789696Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:00:29.6789987Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:00:29.6790316Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:00:29.6790553Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:00:29.6790803Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:00:29.6791157Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:00:29.6791479Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:00:29.6791748Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:00:29.6792025Z acc = acc_copy_0 + sum_1 2026-02-21T09:00:29.6792288Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:00:29.6792558Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:00:29.6792844Z tl.store(C + (indices_1[:, None] * 7168 + indices_2[None, :] * 1), v_10, None) 2026-02-21T09:00:29.6793060Z 2026-02-21T09:00:29.6793223Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:00:29.6793469Z """ 2026-02-21T09:00:29.6793640Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:00:29.6793801Z 2026-02-21T09:00:29.6793895Z This kernel performs matrix multiplication where: 2026-02-21T09:00:29.6794124Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:00:29.6794377Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:00:29.6794639Z (two 4-bit values packed into each int8) 2026-02-21T09:00:29.6794765Z 2026-02-21T09:00:29.6794818Z Args: 2026-02-21T09:00:29.6795004Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:00:29.6795284Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:00:29.6795453Z 2026-02-21T09:00:29.6795509Z Returns: 2026-02-21T09:00:29.6795697Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:00:29.6795905Z """ 2026-02-21T09:00:29.6796049Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:00:29.6796225Z M, K = A.shape 2026-02-21T09:00:29.6796383Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:00:29.6796554Z _, N = B.shape 2026-02-21T09:00:29.6796783Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:00:29.6797095Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:00:29.6797357Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:00:29.6797581Z _BLOCK_SIZE_1 = 16 2026-02-21T09:00:29.6797729Z _BLOCK_SIZE_2 = 32 2026-02-21T09:00:29.6797979Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:00:29.6798371Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:00:29.6798756Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:00:29.6799030Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:00:29.6799197Z _BLOCK_SIZE_0 = 64 2026-02-21T09:00:29.6799391Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:00:29.6799661Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:00:29.6799923Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:00:29.6800116Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:00:29.6800372Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:00:29.6800664Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:00:29.6800919Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:00:29.6801134Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:00:29.6801644Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(16, _BLOCK_SIZE_1) * triton.cdiv(7168, _BLOCK_SIZE_2),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=5) 2026-02-21T09:00:29.6802077Z # src[int4_gemm.py:93]: return C 2026-02-21T09:00:29.6802281Z return C 2026-02-21T09:00:30.6894346Z WARNING:tritonbench.utils.triton_op:Completed input ID 10: 2026-02-21T09:00:30.6897737Z x_val 2026-02-21T09:00:30.6899113Z ------------------- 2026-02-21T09:00:30.6899317Z (16, 1, 7168, 8192) 2026-02-21T09:00:30.6899407Z 2026-02-21T09:00:30.6910723Z 40%|████ | 4/10 [16:30<23:02, 230.36s/it]WARNING:tritonbench.utils.triton_op:Running input ID 14: 2026-02-21T09:00:30.6915759Z x_val 2026-02-21T09:00:30.6920341Z ------------------- 2026-02-21T09:00:30.6925897Z (64, 1, 7168, 8192) 2026-02-21T09:00:30.6936976Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:00:31.7451890Z INFO:tritonbench.utils.triton_op:Took 2.80ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:00:32.9925172Z INFO:tritonbench.utils.triton_op:Took 0.10ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:00:34.2705160Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:00:34.2706655Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:00:34.2706927Z 'dtype': 'torch.bfloat16', 2026-02-21T09:00:34.2707164Z 'shape': (64, 1, 8192), 2026-02-21T09:00:34.2707379Z 'stride': (8192, 8192, 1)}, 2026-02-21T09:00:34.2707588Z { 'device': 'cuda:0', 2026-02-21T09:00:34.2707806Z 'dtype': 'torch.int32', 2026-02-21T09:00:34.2708019Z 'shape': (8192, 7168), 2026-02-21T09:00:34.2708208Z 'stride': (7168, 1)}), 2026-02-21T09:00:34.2708386Z 'kwargs': {}} 2026-02-21T09:00:34.2736898Z INFO:tritonbench.utils.triton_op:Took 3.34ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:00:34.5265882Z [0s] Autotune random seed: 2136913670 2026-02-21T09:00:34.5530251Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:00:56.8740768Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T09:01:02.2381976Z [27s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T09:01:02.2387147Z Tensor-likes are not close! 2026-02-21T09:01:02.2390475Z 2026-02-21T09:01:02.2393966Z Mismatched elements: 458750 / 458752 (100.0%) 2026-02-21T09:01:02.2397822Z Greatest absolute difference: 5696.0 at index (29, 5255) (up to 0.01 allowed) 2026-02-21T09:01:02.2402027Z Greatest relative difference: 3.078125 at index (19, 992) (up to 0.01 allowed) 2026-02-21T09:01:02.2405627Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:02.2406772Z 2026-02-21T09:01:06.3553313Z [31s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[64], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T09:01:06.3554792Z Tensor-likes are not close! 2026-02-21T09:01:06.3554919Z 2026-02-21T09:01:06.3555004Z Mismatched elements: 458367 / 458752 (99.9%) 2026-02-21T09:01:06.3555309Z Greatest absolute difference: 7136.0 at index (29, 5263) (up to 0.01 allowed) 2026-02-21T09:01:06.3555712Z Greatest relative difference: 1064960.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:06.3558807Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:06.3559041Z 2026-02-21T09:01:09.3477808Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.0 configs/s 2026-02-21T09:01:09.3485471Z [34s] Adaptive compile timeout: 30s (90% percentile=3.4s, bounds=[30.0s, 30s]) 2026-02-21T09:01:10.0993087Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━ 868/868 1074.5 configs/s 2026-02-21T09:01:10.1728080Z [35s] Initial random population of 100, 5 starting points: 2026-02-21T09:01:10.1729418Z error=9 2026-02-21T09:01:10.1729823Z ok=91 2026-02-21T09:01:10.1730015Z min=0.2303 2026-02-21T09:01:10.1730189Z mid=5.0647 2026-02-21T09:01:10.1730343Z max=147.9752 2026-02-21T09:01:10.1730535Z best={'block_sizes': [32, 32, 64], 2026-02-21T09:01:10.1730877Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:01:10.1731134Z 'l2_groupings': [8], 2026-02-21T09:01:10.1731342Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:01:10.1731657Z 'loop_orders': [[1, 0]], 2026-02-21T09:01:10.1731845Z 'num_stages': 1, 2026-02-21T09:01:10.1732023Z 'num_warps': 8, 2026-02-21T09:01:10.1732195Z 'pid_type': 'flat', 2026-02-21T09:01:10.1732390Z 'range_flattens': [None, None], 2026-02-21T09:01:10.1732597Z 'range_multi_buffers': [None, None], 2026-02-21T09:01:10.1732792Z 'range_num_stages': [0, 4], 2026-02-21T09:01:10.1733006Z 'range_unroll_factors': [0, 0], 2026-02-21T09:01:10.1733218Z 'range_warp_specializes': [None, None]} 2026-02-21T09:01:10.1741892Z [35s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:01:11.5459364Z [36s] Generation 1 starting: 100 neighbors, 5 active search path(s) 2026-02-21T09:01:23.7077656Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 103/103 2.1 configs/s 2026-02-21T09:01:27.3427111Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T09:01:27.3428448Z Tensor-likes are not close! 2026-02-21T09:01:27.3433533Z 2026-02-21T09:01:27.3438108Z Mismatched elements: 457788 / 458752 (99.8%) 2026-02-21T09:01:27.3442609Z Greatest absolute difference: 6752.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:01:27.3447230Z Greatest relative difference: 1064960.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:27.3451006Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:27.3451271Z 2026-02-21T09:01:27.4045146Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T09:01:27.4046612Z Tensor-likes are not close! 2026-02-21T09:01:27.4051820Z 2026-02-21T09:01:27.4053938Z Mismatched elements: 456989 / 458752 (99.6%) 2026-02-21T09:01:27.4054244Z Greatest absolute difference: 3872.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:01:27.4054603Z Greatest relative difference: 532480.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:27.4055152Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:27.4055325Z 2026-02-21T09:01:27.4093383Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[False, True]) 2026-02-21T09:01:27.4094493Z Tensor-likes are not close! 2026-02-21T09:01:27.4094623Z 2026-02-21T09:01:27.4094705Z Mismatched elements: 457788 / 458752 (99.8%) 2026-02-21T09:01:27.4094974Z Greatest absolute difference: 6752.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:01:27.4095510Z Greatest relative difference: 1064960.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:27.4095844Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:27.4096008Z 2026-02-21T09:01:27.5350789Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]) 2026-02-21T09:01:27.5352048Z Tensor-likes are not close! 2026-02-21T09:01:27.5352170Z 2026-02-21T09:01:27.5352262Z Mismatched elements: 457788 / 458752 (99.8%) 2026-02-21T09:01:27.5352533Z Greatest absolute difference: 6752.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:01:27.5352882Z Greatest relative difference: 1064960.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:27.5353194Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:27.5353367Z 2026-02-21T09:01:27.5395912Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T09:01:27.5397051Z Tensor-likes are not close! 2026-02-21T09:01:27.5402717Z 2026-02-21T09:01:27.5404326Z Mismatched elements: 456587 / 458752 (99.5%) 2026-02-21T09:01:27.5404639Z Greatest absolute difference: 3072.0 at index (40, 446) (up to 0.01 allowed) 2026-02-21T09:01:27.5405004Z Greatest relative difference: 380928.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:27.5405328Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:27.5405505Z 2026-02-21T09:01:27.9604145Z [53s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T09:01:27.9605310Z Tensor-likes are not close! 2026-02-21T09:01:27.9608314Z 2026-02-21T09:01:27.9608572Z Mismatched elements: 456989 / 458752 (99.6%) 2026-02-21T09:01:27.9608906Z Greatest absolute difference: 3872.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:01:27.9609295Z Greatest relative difference: 532480.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:27.9609654Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:27.9610110Z 2026-02-21T09:01:29.3298708Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 103/103 18.5 configs/s 2026-02-21T09:01:31.5734022Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 441.4 2026-02-21T09:01:31.5735588Z configs/s 2026-02-21T09:01:31.6658045Z [57s] Generation 1 complete: 2026-02-21T09:01:31.6659690Z error=12 2026-02-21T09:01:31.6659929Z ok=94 2026-02-21T09:01:31.6660109Z min=0.1441 2026-02-21T09:01:31.6660284Z mid=0.3144 2026-02-21T09:01:31.6660441Z max=9.6963 2026-02-21T09:01:31.6660623Z best={'block_sizes': [32, 32, 64], 2026-02-21T09:01:31.6660861Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:01:31.6661090Z 'l2_groupings': [4], 2026-02-21T09:01:31.6661269Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:01:31.6661462Z 'loop_orders': [[1, 0]], 2026-02-21T09:01:31.6661822Z 'num_stages': 1, 2026-02-21T09:01:31.6662185Z 'num_warps': 2, 2026-02-21T09:01:31.6662354Z 'pid_type': 'flat', 2026-02-21T09:01:31.6662517Z 'range_flattens': [None, None], 2026-02-21T09:01:31.6662783Z 'range_multi_buffers': [None, None], 2026-02-21T09:01:31.6662974Z 'range_num_stages': [0, 4], 2026-02-21T09:01:31.6663153Z 'range_unroll_factors': [0, 0], 2026-02-21T09:01:31.6663339Z 'range_warp_specializes': [None, None]} 2026-02-21T09:01:31.6674737Z [57s] Fitting surrogate: 206 points, 206 targets 2026-02-21T09:01:32.9144822Z [58s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T09:01:39.1959641Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 12.1 configs/s 2026-02-21T09:01:42.7747289Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:42.7747930Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:42.7748179Z ^ 2026-02-21T09:01:42.7748598Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:90:40: note: called from 2026-02-21T09:01:42.7749006Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:42.7749211Z ^ 2026-02-21T09:01:42.7749605Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:42.7750056Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:42.7750301Z ^ 2026-02-21T09:01:42.7750492Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:42.7751039Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:42.7751689Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:42.7751910Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:42.7752103Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:42.7752290Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:42.7752483Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:42.7752662Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:42.7752855Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:42.7753087Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:42.7753334Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:42.7753573Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:42.7754078Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:42.7754329Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:42.7754543Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:42.7754785Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:42.7755036Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:42.7755313Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:42.7755509Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:42.7755709Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:42.7755907Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:42.7756094Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:42.7756428Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:42.7756775Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:42.7756963Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:42.7757163Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:42.7757360Z %3 = arith.subi %c9472_i32, %c1_i32_6 : i32 2026-02-21T09:01:42.7757563Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:42.7757747Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:42.7758000Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:42.7758189Z %6 = arith.remsi %5, %c2_i32_7 : i32 2026-02-21T09:01:42.7758383Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:42.7758629Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:42.7758816Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:42.7759012Z %10 = arith.muli %c9472_i32, %c2_i32_7 : i32 2026-02-21T09:01:42.7759224Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:42.7759439Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:42.7759635Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:42.7759829Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:42.7760017Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:42.7760218Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:42.7760419Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:42.7760602Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:42.7760795Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:42.7760981Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:42.7761231Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:42.7761499Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:42.7761766Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:42.7761960Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:42.7762165Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:42.7762376Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:42.7762577Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:42.7762920Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:42.7763264Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:42.7763514Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7763776Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7764028Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:42.7764305Z %67 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7764593Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7764873Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7765180Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7765447Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7765699Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:42.7766010Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7766319Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7766584Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7766822Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7767186Z %77 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7767494Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7767723Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7767945Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7768205Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7768512Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7768826Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7769156Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7769504Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7769792Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7770046Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7770346Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7770635Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7770900Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7771153Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7771416Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7771722Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7771996Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7772247Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7772599Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7772917Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:42.7773121Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:42.7773316Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:42.7773504Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:42.7773742Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7773990Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7774207Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:42.7774466Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7774755Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7775021Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7775323Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7775599Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7775839Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:42.7776091Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7776385Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7776661Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7776909Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7777256Z %113 = tt.descriptor_load %0[%98, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7777565Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7777787Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7778019Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7778296Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7778607Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7778938Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7779272Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7779610Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7779906Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7780172Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7780460Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7780781Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7781071Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7781349Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7781675Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7781956Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7782251Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7782522Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7782880Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7783206Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:42.7783406Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:42.7783610Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:42.7783798Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:42.7784042Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7784303Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7784510Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:42.7784776Z %139 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7785048Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7785314Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7785608Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7785877Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7786122Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:42.7786368Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7786668Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7786935Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7787181Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7787508Z %149 = tt.descriptor_load %0[%134, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7787808Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7788037Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7788287Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7788545Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7788852Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7789183Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7789547Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7789873Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7790165Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7790418Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7790702Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7791008Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7791288Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7791612Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7791939Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7792231Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7792538Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7792814Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7793182Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7793508Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:42.7793710Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:42.7793908Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:42.7794103Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:42.7794338Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7794599Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7794816Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:42.7795072Z %175 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7795349Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7795619Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7795922Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7796184Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7796430Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:42.7796685Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7796978Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7797247Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7797484Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7797812Z %185 = tt.descriptor_load %0[%170, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7798195Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7798410Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7798641Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7798905Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7799219Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7799579Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7799928Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7800276Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7800614Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7800889Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7801181Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7801496Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7801824Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7802089Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7802382Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7802676Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7802983Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7803293Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7803672Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7804048Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:42.7804242Z } 2026-02-21T09:01:42.7804448Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:42.7804758Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7805055Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:42.7805329Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:42.7805636Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:42.7805917Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:42.7806163Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:42.7806429Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:42.7806728Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:42.7807039Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:42.7807252Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:42.7807449Z %36 = arith.muli %c9472_i32, %c1_i32_9 : i32 2026-02-21T09:01:42.7807658Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:42.7807846Z %38 = arith.divsi %37, %c448_i32 : i32 2026-02-21T09:01:42.7808043Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:42.7808227Z %40 = arith.subi %c1_i32, %39 : i32 2026-02-21T09:01:42.7808417Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:42.7808605Z %42 = arith.remsi %37, %c448_i32 : i32 2026-02-21T09:01:42.7808799Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:42.7808987Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:42.7809165Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:42.7809354Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:42.7809585Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:42.7809843Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:42.7810046Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:42.7810247Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:42.7810435Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:42.7810643Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:42.7810843Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:01:42.7811165Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:42.7811531Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:42.7811792Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7812046Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7812280Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:42.7812540Z %67 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7812818Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7813074Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7813367Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7813624Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7813868Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:42.7814118Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7814401Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7814669Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7814933Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7815295Z %77 = tt.descriptor_load %0[%arg4, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7815599Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7815825Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7816048Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7816293Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7816590Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7816903Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7817231Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7817556Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7817834Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7818096Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7818360Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7818650Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7818918Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7819175Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7819445Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7819716Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7819994Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7820251Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7820608Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7820927Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:42.7821135Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:42.7821339Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:42.7821522Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:42.7821794Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7822045Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7822289Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:42.7822547Z %103 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7822826Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7823096Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7823427Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7823709Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7823952Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:42.7824207Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7824507Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7824773Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7825025Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7825345Z %113 = tt.descriptor_load %0[%98, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7825651Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7825903Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7826132Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7826388Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7826710Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7827037Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7827367Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7827700Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7827994Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7828250Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7828535Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7828827Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7829114Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7829373Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7829654Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7829944Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7830232Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7830510Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7830871Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7831201Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:42.7831404Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:42.7831640Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:42.7831838Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:42.7832072Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7832331Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7832536Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:42.7832797Z %139 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7833073Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7833347Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7833676Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7833940Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7834192Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:42.7834439Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7834793Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7835067Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7835310Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7835641Z %149 = tt.descriptor_load %0[%134, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7835940Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7836169Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7836392Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7836647Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7836948Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7837292Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7837653Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7837975Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7838266Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7838523Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7838795Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7839090Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7839362Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7839620Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7839889Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7840181Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7840470Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7840731Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7841091Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7841407Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:42.7841633Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:42.7841830Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:42.7842025Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:42.7842266Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7842522Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7842739Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:42.7842996Z %175 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7843278Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7843542Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7843863Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7844153Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7844417Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:42.7844712Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7845020Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7845307Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7845567Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7845930Z %185 = tt.descriptor_load %0[%170, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7846252Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7846482Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7846716Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7846972Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7847284Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7847621Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7847955Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7848335Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7848635Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7848936Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7849231Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7849535Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7849834Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7850100Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7850386Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7850681Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7850986Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7851268Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7851669Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7852006Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:42.7852188Z } 2026-02-21T09:01:42.7852380Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:42.7852675Z %55 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7852958Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:42.7853230Z %57 = tt.expand_dims %52 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:42.7853528Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:42.7853804Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:42.7854046Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:42.7854293Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:42.7854571Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:42.7854834Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:42.7855044Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:42.7855247Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:42.7855476Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:42.7855667Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:42.7855856Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:42.7856033Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:42.7856257Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:42.7856452Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:42.7856629Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:42.7856813Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:42.7856991Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:42.7857252Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:42.7857501Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:42.7857710Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:42.7857901Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:42.7858093Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:42.7858291Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:42.7858477Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:42.7858804Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:42.7859126Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:42.7859364Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7859612Z %38 = tt.splat %36 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7859846Z %39 = arith.addi %38, %37 : tensor<32xi32> 2026-02-21T09:01:42.7860111Z %40 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7860401Z %41 = arith.muli %40, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7860661Z %42 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7860944Z %43 = tt.broadcast %41 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7861205Z %44 = tt.broadcast %42 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7861443Z %45 = arith.addi %43, %44 : tensor<64x32xi32> 2026-02-21T09:01:42.7861708Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7862001Z %47 = tt.addptr %46, %45 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7862257Z %48 = tt.load %47 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7862502Z %49 = arith.extf %48 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7862822Z %50 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7863126Z %51 = arith.shli %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7863371Z %52 = arith.shrsi %51, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7863585Z %53 = arith.shrsi %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7863833Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7864117Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7864428Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7864741Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7865061Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7865346Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7865595Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7865866Z %61 = tt.broadcast %57 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7866146Z %62 = arith.select %60, %61, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7866418Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7866671Z %64 = tt.broadcast %58 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7866934Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7867246Z %66 = arith.select %65, %64, %62 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7867516Z %67 = tt.reshape %66 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7867776Z %68 = arith.sitofp %67 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7868118Z %69 = tt.dot %49, %68, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7868487Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:42.7868691Z %70 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:01:42.7868886Z %71 = arith.addi %arg4, %70 : i32 2026-02-21T09:01:42.7869078Z %72 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:01:42.7869306Z %73 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7869558Z %74 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7869755Z %75 = arith.addi %74, %73 : tensor<32xi32> 2026-02-21T09:01:42.7870016Z %76 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7870294Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7870549Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7870865Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7871118Z %80 = tt.broadcast %78 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7871381Z %81 = arith.addi %79, %80 : tensor<64x32xi32> 2026-02-21T09:01:42.7871657Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7871947Z %83 = tt.addptr %82, %81 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7872207Z %84 = tt.load %83 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7872440Z %85 = arith.extf %84 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7872757Z %86 = tt.descriptor_load %0[%71, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7873048Z %87 = arith.shli %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7873272Z %88 = arith.shrsi %87, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7873498Z %89 = arith.shrsi %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7873737Z %90 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7874029Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7874370Z %92 = tt.expand_dims %91 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7874680Z %93 = tt.expand_dims %88 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7875002Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7875277Z %95 = arith.cmpi eq, %92, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7875533Z %96 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7875798Z %97 = tt.broadcast %93 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7876087Z %98 = arith.select %96, %97, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7876362Z %99 = arith.cmpi eq, %92, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7876613Z %100 = tt.broadcast %94 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7876894Z %101 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7877174Z %102 = arith.select %101, %100, %98 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7877469Z %103 = tt.reshape %102 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7877744Z %104 = arith.sitofp %103 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7878095Z %105 = tt.dot %85, %104, %69, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7878467Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:01:42.7878663Z %106 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:01:42.7878863Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T09:01:42.7879048Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:01:42.7879319Z %109 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7879576Z %110 = tt.splat %108 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7879781Z %111 = arith.addi %110, %109 : tensor<32xi32> 2026-02-21T09:01:42.7880041Z %112 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7880312Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7880580Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7880875Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7881143Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7881389Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:01:42.7881678Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7882008Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7882279Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7882554Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7882878Z %122 = tt.descriptor_load %0[%107, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7883186Z %123 = arith.shli %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7883417Z %124 = arith.shrsi %123, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7883643Z %125 = arith.shrsi %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7883909Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7884256Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7884581Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7884908Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7885233Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7885524Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7885776Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7886057Z %133 = tt.broadcast %129 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7886349Z %134 = arith.select %132, %133, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7886637Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7886896Z %136 = tt.broadcast %130 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7887166Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7887454Z %138 = arith.select %137, %136, %134 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7887740Z %139 = tt.reshape %138 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7888011Z %140 = arith.sitofp %139 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7888379Z %141 = tt.dot %121, %140, %105, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7888707Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:42.7888914Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:42.7889117Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:01:42.7889317Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:01:42.7889598Z %145 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:42.7889872Z %146 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:01:42.7890099Z %147 = arith.addi %146, %145 : tensor<32xi32> 2026-02-21T09:01:42.7890368Z %148 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7890686Z %149 = arith.muli %148, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:42.7890959Z %150 = tt.expand_dims %147 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:42.7891271Z %151 = tt.broadcast %149 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7891583Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:42.7891844Z %153 = arith.addi %151, %152 : tensor<64x32xi32> 2026-02-21T09:01:42.7892108Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7892412Z %155 = tt.addptr %154, %153 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:42.7892694Z %156 = tt.load %155 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:42.7892944Z %157 = arith.extf %156 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:42.7893316Z %158 = tt.descriptor_load %0[%143, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:42.7893633Z %159 = arith.shli %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7893903Z %160 = arith.shrsi %159, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7894144Z %161 = arith.shrsi %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:42.7894404Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:42.7894723Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:42.7895054Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:42.7895410Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7895759Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:42.7896061Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7896339Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7896630Z %169 = tt.broadcast %165 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7896931Z %170 = arith.select %168, %169, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7897210Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:42.7897472Z %172 = tt.broadcast %166 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:42.7897758Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:42.7898041Z %174 = arith.select %173, %172, %170 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:42.7898348Z %175 = tt.reshape %174 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:42.7898614Z %176 = arith.sitofp %175 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:42.7898982Z %177 = tt.dot %157, %176, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:42.7899309Z scf.yield %177 : tensor<64x64xf32> 2026-02-21T09:01:42.7899482Z } 2026-02-21T09:01:42.7899669Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:42.7899957Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:42.7900230Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:42.7900481Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:42.7900774Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:42.7901067Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:42.7901297Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:42.7901565Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:42.7901844Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:42.7902138Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:42.7902341Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:42.7902514Z tt.return 2026-02-21T09:01:42.7902642Z } 2026-02-21T09:01:42.7902773Z } 2026-02-21T09:01:42.7902844Z 2026-02-21T09:01:42.7902904Z {-# 2026-02-21T09:01:42.7903033Z external_resources: { 2026-02-21T09:01:42.7903196Z mlir_reproducer: { 2026-02-21T09:01:42.7908594Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:42.7914733Z disable_threading: false, 2026-02-21T09:01:42.7914907Z verify_each: true 2026-02-21T09:01:42.7915051Z } 2026-02-21T09:01:42.7915182Z } 2026-02-21T09:01:42.7916211Z #-} 2026-02-21T09:01:42.7916636Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:42.7917658Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:42.7918458Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:42.7919627Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:42.7920681Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:42.7920943Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:43.1557304Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:43.1562735Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:43.1564119Z ^ 2026-02-21T09:01:43.1564832Z /tmp/torchinductor_root/yn/cynttwfh6do4sc3q7s5ls6uzybz3hivjpx5huxdz6fohr3mfoxxe.py:90:40: note: called from 2026-02-21T09:01:43.1565296Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:43.1565517Z ^ 2026-02-21T09:01:43.1565931Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:43.1566447Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:43.1566705Z ^ 2026-02-21T09:01:43.1567003Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:43.1571857Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:43.1573523Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:43.1573821Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:43.1574019Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:43.1574264Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:43.1574453Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:43.1574690Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:43.1574949Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:43.1575183Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:43.1575428Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:43.1575664Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:43.1575877Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:43.1576099Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:43.1576334Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:43.1576516Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:43.1576705Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:43.1576896Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:43.1577078Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:43.1577263Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:43.1577575Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:43.1577930Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:43.1578114Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:43.1578301Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:43.1578493Z %3 = arith.subi %c9472_i32, %c1_i32 : i32 2026-02-21T09:01:43.1578673Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:43.1578853Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:43.1579032Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:01:43.1579216Z %6 = arith.remsi %5, %c2_i32_6 : i32 2026-02-21T09:01:43.1579393Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:43.1579565Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:43.1579743Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:43.1579928Z %10 = arith.muli %c9472_i32, %c2_i32_6 : i32 2026-02-21T09:01:43.1580173Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:43.1580380Z %11 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:01:43.1580566Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:43.1580758Z %13 = arith.subi %c112_i32, %12 : i32 2026-02-21T09:01:43.1580939Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:43.1581132Z %15 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:01:43.1581399Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:43.1581667Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:43.1581854Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:43.1582032Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:43.1582271Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.1582570Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.1582775Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:43.1582964Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:43.1583160Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.1583355Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:43.1583559Z %c64_i32_7 = arith.constant 64 : i32 2026-02-21T09:01:43.1583889Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_7 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.1584220Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.1584470Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1584724Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1584935Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:43.1585227Z %67 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1585505Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1585794Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1586082Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1586349Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1586584Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:43.1586835Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1587128Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1587396Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1587640Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1587965Z %77 = tt.descriptor_load %0[%arg4, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1588276Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1588490Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1588714Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1588964Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1589252Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1589564Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1589883Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1590206Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1590488Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1590738Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1591010Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1591291Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1591598Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1591845Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1592114Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1592389Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1592692Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1592953Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1593303Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1593659Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:01:43.1593857Z %97 = arith.muli %c16_i32, %c1_i32_10 : i32 2026-02-21T09:01:43.1594062Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:43.1594250Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:43.1594483Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1594744Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1594953Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:43.1595234Z %103 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1595512Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1595784Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1596090Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1596400Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1596661Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:43.1596942Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1597253Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1597540Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1597787Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1598118Z %113 = tt.descriptor_load %0[%98, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1598417Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1598656Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1598886Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1599158Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1599461Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1599783Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1600117Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1600445Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1600742Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1601009Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1601285Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1601622Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1601910Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1602176Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1602453Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1602748Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1603048Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1603314Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1603681Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1604038Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:01:43.1604251Z %133 = arith.muli %c16_i32, %c2_i32_11 : i32 2026-02-21T09:01:43.1604452Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:43.1604646Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:43.1604888Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1605166Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1605385Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:43.1605644Z %139 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1605924Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1606187Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1606491Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1606771Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1607012Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:43.1607290Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1607614Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1607894Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1608162Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1608494Z %149 = tt.descriptor_load %0[%134, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1608804Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1609024Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1609255Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1609504Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1609808Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1610144Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1610497Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1610851Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1611154Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1611428Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1611753Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1612068Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1612366Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1612637Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1612934Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1613233Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1613554Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1613835Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1614217Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1614566Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.1614770Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.1614983Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:43.1615178Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:43.1615459Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1615730Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1615948Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:43.1616232Z %175 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1616540Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1616823Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1617135Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1617420Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1617679Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:43.1617938Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1618258Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1618537Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1618799Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1619166Z %185 = tt.descriptor_load %0[%170, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1619492Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1619777Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1620039Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1620297Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1620604Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1620937Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1621282Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1621663Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1621959Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1622214Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1622500Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1622792Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1623074Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1623336Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1623607Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1623897Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1624181Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1624456Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1624824Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1625149Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:43.1625333Z } 2026-02-21T09:01:43.1625513Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.1625809Z %28 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1626073Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.1626333Z %30 = tt.expand_dims %22 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.1626624Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.1626919Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.1627160Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:43.1627402Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.1627693Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.1627976Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.1628190Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:43.1628390Z %36 = arith.muli %c9472_i32, %c1_i32_8 : i32 2026-02-21T09:01:43.1628586Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:43.1628775Z %38 = arith.divsi %37, %c4_i32 : i32 2026-02-21T09:01:43.1629028Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:43.1629215Z %40 = arith.subi %c112_i32, %39 : i32 2026-02-21T09:01:43.1629396Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:43.1629583Z %42 = arith.remsi %37, %c4_i32 : i32 2026-02-21T09:01:43.1629763Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:43.1629945Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:43.1630126Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:43.1630301Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:43.1630562Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.1630811Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.1631045Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:43.1631236Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:43.1631430Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.1631662Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:43.1631853Z %c64_i32_9 = arith.constant 64 : i32 2026-02-21T09:01:43.1632180Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_9 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.1632507Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.1632750Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1633002Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1633218Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:43.1633488Z %67 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1633762Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1634030Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1634324Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1634605Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1634847Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:43.1635097Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1635389Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1635646Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1635889Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1636214Z %77 = tt.descriptor_load %0[%arg4, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1636525Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1636742Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1636967Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1637215Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1637506Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1637819Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1638164Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1638491Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1638781Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1639038Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1639337Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1639621Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1639895Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1640144Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1640415Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1640698Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1640972Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1641236Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1641641Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1641967Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:01:43.1642172Z %97 = arith.muli %c16_i32, %c1_i32_10 : i32 2026-02-21T09:01:43.1642394Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:43.1642588Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:43.1642823Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1643081Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1643289Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:43.1643555Z %103 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1643837Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1644102Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1644408Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1644681Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1644929Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:43.1645176Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1645477Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1645755Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1645998Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1646332Z %113 = tt.descriptor_load %0[%98, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1646637Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1646871Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1647096Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1647353Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1647660Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1647978Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1648313Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1648642Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1648935Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1649237Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1649514Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1649814Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1650095Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1650393Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1650667Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1650962Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1651262Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1651531Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1651939Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1652258Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:01:43.1652469Z %133 = arith.muli %c16_i32, %c2_i32_11 : i32 2026-02-21T09:01:43.1652672Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:43.1652885Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:43.1653129Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1653386Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1653632Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:43.1653889Z %139 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1654167Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1654466Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1654777Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1655067Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1655319Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:43.1655586Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1655895Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1656182Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1656441Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1656779Z %149 = tt.descriptor_load %0[%134, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1657104Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1657337Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1657577Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1657838Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1658156Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1658494Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1658837Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1659193Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1659492Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1659768Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1660065Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1660372Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1660704Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1660976Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1661283Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1661617Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1661956Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1662241Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1662620Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1662963Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.1663167Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.1663370Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:43.1663563Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:43.1663800Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1664064Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1664272Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:43.1664575Z %175 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1664853Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1665150Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1665456Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1665728Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1665983Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:43.1666239Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1666543Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1666814Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1667068Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1667405Z %185 = tt.descriptor_load %0[%170, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1667713Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1667951Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1668180Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1668441Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1668742Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1669076Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1669418Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1669751Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1670047Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1670305Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1670591Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1670896Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1671176Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1671442Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1671752Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1672046Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1672359Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1672627Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1672991Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1673354Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:43.1673543Z } 2026-02-21T09:01:43.1673725Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.1674025Z %55 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1674301Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.1674557Z %57 = tt.expand_dims %49 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.1674854Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.1675119Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.1675360Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:43.1675599Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.1675920Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.1676190Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.1676393Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:43.1676636Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:43.1676858Z %11 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:01:43.1677051Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:43.1677233Z %13 = arith.subi %c112_i32, %12 : i32 2026-02-21T09:01:43.1677425Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:43.1677612Z %15 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:01:43.1677806Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:43.1677988Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:43.1678161Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:43.1678342Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:43.1678572Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.1678829Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.1679027Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:43.1679224Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:43.1679410Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.1679611Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:43.1679809Z %c64_i32_7 = arith.constant 64 : i32 2026-02-21T09:01:43.1680123Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_7 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.1680452Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.1680684Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1680935Z %38 = tt.splat %36 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1681140Z %39 = arith.addi %38, %37 : tensor<32xi32> 2026-02-21T09:01:43.1681389Z %40 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1681691Z %41 = arith.muli %40, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1681947Z %42 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1682240Z %43 = tt.broadcast %41 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1682500Z %44 = tt.broadcast %42 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1682744Z %45 = arith.addi %43, %44 : tensor<64x32xi32> 2026-02-21T09:01:43.1682993Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1683307Z %47 = tt.addptr %46, %45 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1683573Z %48 = tt.load %47 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1683809Z %49 = arith.extf %48 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1684140Z %50 = tt.descriptor_load %0[%arg4, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1684474Z %51 = arith.shli %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1684702Z %52 = arith.shrsi %51, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1684927Z %53 = arith.shrsi %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1685172Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1685474Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1685784Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1686112Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1686440Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1686727Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1687018Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1687290Z %61 = tt.broadcast %57 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1687608Z %62 = arith.select %60, %61, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1687879Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1688133Z %64 = tt.broadcast %58 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1688406Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1688679Z %66 = arith.select %65, %64, %62 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1688961Z %67 = tt.reshape %66 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1689218Z %68 = arith.sitofp %67 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1689573Z %69 = tt.dot %49, %68, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1689893Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:43.1690095Z %70 = arith.muli %c16_i32, %c1_i32_8 : i32 2026-02-21T09:01:43.1690296Z %71 = arith.addi %arg4, %70 : i32 2026-02-21T09:01:43.1690478Z %72 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:01:43.1690714Z %73 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1690963Z %74 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1691170Z %75 = arith.addi %74, %73 : tensor<32xi32> 2026-02-21T09:01:43.1691421Z %76 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1691735Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1691995Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1692280Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1692546Z %80 = tt.broadcast %78 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1692781Z %81 = arith.addi %79, %80 : tensor<64x32xi32> 2026-02-21T09:01:43.1693034Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1693327Z %83 = tt.addptr %82, %81 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1693585Z %84 = tt.load %83 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1693825Z %85 = arith.extf %84 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1694134Z %86 = tt.descriptor_load %0[%71, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1694461Z %87 = arith.shli %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1694676Z %88 = arith.shrsi %87, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1694904Z %89 = arith.shrsi %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1695151Z %90 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1695441Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1695783Z %92 = tt.expand_dims %91 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1696101Z %93 = tt.expand_dims %88 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1696422Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1696702Z %95 = arith.cmpi eq, %92, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1696961Z %96 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1697235Z %97 = tt.broadcast %93 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1697520Z %98 = arith.select %96, %97, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1697796Z %99 = arith.cmpi eq, %92, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1698077Z %100 = tt.broadcast %94 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1698357Z %101 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1698699Z %102 = arith.select %101, %100, %98 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1699001Z %103 = tt.reshape %102 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1699287Z %104 = arith.sitofp %103 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1699656Z %105 = tt.dot %85, %104, %69, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1700002Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:01:43.1700210Z %106 = arith.muli %c16_i32, %c2_i32_9 : i32 2026-02-21T09:01:43.1700427Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T09:01:43.1700635Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:01:43.1700885Z %109 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1701161Z %110 = tt.splat %108 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1701381Z %111 = arith.addi %110, %109 : tensor<32xi32> 2026-02-21T09:01:43.1701700Z %112 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1701989Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1702275Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1702592Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1702874Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1703138Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:01:43.1703398Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1703716Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1703995Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1704260Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1704608Z %122 = tt.descriptor_load %0[%107, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1704931Z %123 = arith.shli %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1705173Z %124 = arith.shrsi %123, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1705409Z %125 = arith.shrsi %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1705681Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1706027Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1706362Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1706716Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1707063Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1707406Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1707659Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1707941Z %133 = tt.broadcast %129 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1708243Z %134 = arith.select %132, %133, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1708519Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1708780Z %136 = tt.broadcast %130 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1709051Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1709342Z %138 = arith.select %137, %136, %134 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1709659Z %139 = tt.reshape %138 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1709927Z %140 = arith.sitofp %139 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1710316Z %141 = tt.dot %121, %140, %105, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1710635Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.1710838Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.1711034Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:01:43.1711226Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:01:43.1711465Z %145 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.1711757Z %146 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.1711972Z %147 = arith.addi %146, %145 : tensor<32xi32> 2026-02-21T09:01:43.1712228Z %148 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1712512Z %149 = arith.muli %148, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.1712780Z %150 = tt.expand_dims %147 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.1713080Z %151 = tt.broadcast %149 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1713351Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.1713592Z %153 = arith.addi %151, %152 : tensor<64x32xi32> 2026-02-21T09:01:43.1713848Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1714142Z %155 = tt.addptr %154, %153 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.1714422Z %156 = tt.load %155 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.1714668Z %157 = arith.extf %156 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.1715003Z %158 = tt.descriptor_load %0[%143, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.1715311Z %159 = arith.shli %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1715532Z %160 = arith.shrsi %159, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1715760Z %161 = arith.shrsi %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.1716005Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.1716305Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.1716626Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.1716950Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1717311Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.1717598Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1717859Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1718135Z %169 = tt.broadcast %165 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1718458Z %170 = arith.select %168, %169, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1718745Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.1718995Z %172 = tt.broadcast %166 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.1719278Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.1719557Z %174 = arith.select %173, %172, %170 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.1719846Z %175 = tt.reshape %174 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.1720116Z %176 = arith.sitofp %175 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.1720471Z %177 = tt.dot %157, %176, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.1720820Z scf.yield %177 : tensor<64x64xf32> 2026-02-21T09:01:43.1720996Z } 2026-02-21T09:01:43.1721176Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.1721482Z %28 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.1721781Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.1722037Z %30 = tt.expand_dims %22 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.1722317Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.1722578Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.1722808Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:43.1723053Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.1723327Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.1723587Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.1723794Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:43.1723957Z tt.return 2026-02-21T09:01:43.1724093Z } 2026-02-21T09:01:43.1724212Z } 2026-02-21T09:01:43.1724289Z 2026-02-21T09:01:43.1724342Z {-# 2026-02-21T09:01:43.1724470Z external_resources: { 2026-02-21T09:01:43.1724637Z mlir_reproducer: { 2026-02-21T09:01:43.1729020Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:43.1733564Z disable_threading: false, 2026-02-21T09:01:43.1733744Z verify_each: true 2026-02-21T09:01:43.1733890Z } 2026-02-21T09:01:43.1734021Z } 2026-02-21T09:01:43.1734138Z #-} 2026-02-21T09:01:43.1734568Z /tmp/torchinductor_root/yn/cynttwfh6do4sc3q7s5ls6uzybz3hivjpx5huxdz6fohr3mfoxxe.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:43.1735590Z /tmp/torchinductor_root/yn/cynttwfh6do4sc3q7s5ls6uzybz3hivjpx5huxdz6fohr3mfoxxe.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:43.1736404Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:43.1737632Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:43.1738710Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:43.1738965Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:43.4786803Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:43.4791208Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:43.4795760Z ^ 2026-02-21T09:01:43.4799870Z /tmp/torchinductor_root/og/cogddhc52vlyf72ft22oi55bd6gkpfrvs2dp23rk5ud4odrg25v2.py:90:40: note: called from 2026-02-21T09:01:43.4803694Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:43.4805349Z ^ 2026-02-21T09:01:43.4805825Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:43.4806302Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:43.4806558Z ^ 2026-02-21T09:01:43.4806827Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:43.4811686Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:43.4812350Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:43.4817210Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:43.4817491Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:43.4821859Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:43.4825870Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:43.4829747Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:43.4833734Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:43.4839158Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:43.4839511Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:43.4839776Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:43.4845685Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:43.4847768Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:43.4848021Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:43.4848263Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:43.4848503Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:43.4848704Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:43.4849088Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:43.4849285Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:43.4849471Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:43.4849657Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:43.4849981Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:43.4850298Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:43.4850485Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:43.4850666Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:43.4850866Z %3 = arith.subi %c9472_i32, %c1_i32_6 : i32 2026-02-21T09:01:43.4851054Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:43.4851233Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:43.4851423Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:43.4851676Z %6 = arith.remsi %5, %c2_i32_7 : i32 2026-02-21T09:01:43.4851920Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:43.4852099Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:43.4852296Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:43.4852524Z %10 = arith.muli %c9472_i32, %c2_i32_7 : i32 2026-02-21T09:01:43.4852747Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:43.4852945Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.4853139Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:43.4853324Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:43.4853503Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:43.4853700Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.4853883Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:43.4854067Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:43.4854241Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:43.4854423Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:43.4854654Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.4854922Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.4855127Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:43.4855317Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:43.4855510Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.4855701Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:43.4855894Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:43.4856209Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.4856547Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.4856786Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4857035Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4857243Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:43.4857497Z %67 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4857877Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4858133Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4858429Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4858691Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4858925Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:43.4859175Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4859528Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4859794Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4860039Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4860360Z %77 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4860716Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4860935Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4861159Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4861399Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4861735Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4862057Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4862381Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4862705Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4863027Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4863298Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4863615Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4863904Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4864181Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4864430Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4864703Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4864973Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4865255Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4865516Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4865869Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4866201Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:43.4866401Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:43.4866607Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:43.4866788Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:43.4867032Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4867292Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4867502Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:43.4867770Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4868051Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4868327Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4868626Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4868902Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4869156Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:43.4869405Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4869706Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4869976Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4870224Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4870551Z %113 = tt.descriptor_load %0[%98, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4870875Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4871101Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4871318Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4871619Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4871946Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4872274Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4872608Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4872934Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4873228Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4873506Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4873788Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4874098Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4874410Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4874679Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4874982Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4875261Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4875555Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4875821Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4876185Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4876512Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:43.4876711Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:43.4876915Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:43.4877102Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:43.4877350Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4877613Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4877841Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:43.4878113Z %139 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4878400Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4878680Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4878987Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4879275Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4879531Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:43.4879794Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4880110Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4880391Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4880651Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4880989Z %149 = tt.descriptor_load %0[%134, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4881311Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4881575Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4881822Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4882110Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4882415Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4882754Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4883123Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4883476Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4883780Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4884049Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4884348Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4884656Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4884954Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4885218Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4885510Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4885841Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4886140Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4886463Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4886820Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4887151Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.4887356Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.4887555Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:43.4887752Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:43.4887984Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4888243Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4888449Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:43.4888712Z %175 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4888998Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4889269Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4889572Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4889836Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4890084Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:43.4890331Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4890632Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4890902Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4891140Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4891468Z %185 = tt.descriptor_load %0[%170, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4891807Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4892036Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4892257Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4892515Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4892816Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4893134Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4893499Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4893819Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4894110Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4894395Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4894675Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4894978Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4895255Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4895520Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4895797Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4896089Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4896385Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4896651Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4897046Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4897364Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:43.4897547Z } 2026-02-21T09:01:43.4897765Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.4898049Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4898317Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.4898566Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.4898854Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.4899110Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.4899349Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:43.4899595Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.4899875Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.4900137Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.4900341Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:43.4900544Z %36 = arith.muli %c9472_i32, %c1_i32_9 : i32 2026-02-21T09:01:43.4900739Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:43.4900938Z %38 = arith.divsi %37, %c448_i32 : i32 2026-02-21T09:01:43.4901123Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:43.4901309Z %40 = arith.subi %c1_i32, %39 : i32 2026-02-21T09:01:43.4901495Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:43.4901708Z %42 = arith.remsi %37, %c448_i32 : i32 2026-02-21T09:01:43.4901897Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:43.4902072Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:43.4902248Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:43.4902421Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:43.4902657Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.4902912Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.4903113Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:43.4903306Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:43.4903489Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.4903691Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:43.4903879Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:01:43.4904209Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.4904568Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.4904798Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4905051Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4905254Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:43.4905537Z %67 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4905808Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4906068Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4906362Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4906618Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4906859Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:43.4907099Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4907396Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4907651Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4907889Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4908248Z %77 = tt.descriptor_load %0[%arg4, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4908549Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4908801Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4909016Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4909263Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4909548Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4909860Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4910180Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4910498Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4910783Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4911030Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4911304Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4911622Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4911890Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4912143Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4912406Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4912686Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4912962Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4913231Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4913594Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4913925Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:43.4914141Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:43.4914344Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:43.4914546Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:43.4914779Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4915037Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4915252Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:43.4915537Z %103 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4915816Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4916080Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4916381Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4916670Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4916915Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:43.4917171Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4917463Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4917733Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4917970Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4918292Z %113 = tt.descriptor_load %0[%98, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4918595Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4918814Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4919068Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4919317Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4919638Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4919955Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4920289Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4920626Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4920924Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4921196Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4921484Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4921810Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4922108Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4922373Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4922668Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4922966Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4923271Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4923548Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4923926Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4924271Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:43.4924480Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:43.4924697Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:43.4924896Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:43.4925150Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4925416Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4925645Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:43.4925919Z %139 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4926205Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4926488Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4926796Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4927114Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4927375Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:43.4927638Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4927976Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4928255Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4928512Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4928850Z %149 = tt.descriptor_load %0[%134, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4929158Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4929385Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4929606Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4929857Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4930145Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4930515Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4930845Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4931205Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4931499Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4931791Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4932066Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4932353Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4932636Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4932897Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4933168Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4933455Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4933738Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4934008Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4934361Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4934682Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.4934881Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.4935079Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:43.4935272Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:43.4935504Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4935758Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4935968Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:43.4936233Z %175 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4936511Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4936773Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4937075Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4937339Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4937589Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:43.4937831Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4938158Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4938430Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4938669Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4938996Z %185 = tt.descriptor_load %0[%170, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4939321Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4939550Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4939779Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4940025Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4940330Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4940646Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4940981Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4941305Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4941660Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4941922Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4942219Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4942513Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4942784Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4943041Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4943313Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4943590Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4943879Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4944136Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4944498Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4944817Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:43.4944998Z } 2026-02-21T09:01:43.4945179Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.4945459Z %55 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4945727Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.4945975Z %57 = tt.expand_dims %52 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.4946259Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.4946509Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.4946743Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:43.4946987Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.4947263Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.4947523Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.4947730Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:01:43.4947957Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:43.4948175Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.4948367Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:43.4948552Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:43.4948728Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:43.4948919Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.4949132Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:43.4949313Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:43.4949485Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:43.4949667Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:43.4949890Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.4950166Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.4950373Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:43.4950561Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:43.4950753Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.4950948Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:43.4951145Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:43.4951462Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.4951822Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.4952062Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4952310Z %38 = tt.splat %36 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4952530Z %39 = arith.addi %38, %37 : tensor<32xi32> 2026-02-21T09:01:43.4952812Z %40 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4953099Z %41 = arith.muli %40, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4953388Z %42 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4953687Z %43 = tt.broadcast %41 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4953952Z %44 = tt.broadcast %42 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4954185Z %45 = arith.addi %43, %44 : tensor<64x32xi32> 2026-02-21T09:01:43.4954431Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4954719Z %47 = tt.addptr %46, %45 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4954988Z %48 = tt.load %47 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4955229Z %49 = arith.extf %48 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4955549Z %50 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4955858Z %51 = arith.shli %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4956073Z %52 = arith.shrsi %51, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4956297Z %53 = arith.shrsi %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4956538Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4956835Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4957148Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4957464Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4957787Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4958062Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4958319Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4958595Z %61 = tt.broadcast %57 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4958876Z %62 = arith.select %60, %61, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4959147Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4959396Z %64 = tt.broadcast %58 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4959665Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4959936Z %66 = arith.select %65, %64, %62 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4960244Z %67 = tt.reshape %66 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4960503Z %68 = arith.sitofp %67 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4960852Z %69 = tt.dot %49, %68, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4961203Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:43.4961399Z %70 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:01:43.4961643Z %71 = arith.addi %arg4, %70 : i32 2026-02-21T09:01:43.4961828Z %72 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:01:43.4962061Z %73 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4962313Z %74 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4962513Z %75 = arith.addi %74, %73 : tensor<32xi32> 2026-02-21T09:01:43.4962769Z %76 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4963040Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4963299Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4963585Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4963873Z %80 = tt.broadcast %78 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4964118Z %81 = arith.addi %79, %80 : tensor<64x32xi32> 2026-02-21T09:01:43.4964394Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4964704Z %83 = tt.addptr %82, %81 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4964968Z %84 = tt.load %83 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4965221Z %85 = arith.extf %84 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4965562Z %86 = tt.descriptor_load %0[%71, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4965873Z %87 = arith.shli %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4966111Z %88 = arith.shrsi %87, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4966339Z %89 = arith.shrsi %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4966606Z %90 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4966911Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4967242Z %92 = tt.expand_dims %91 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4967583Z %93 = tt.expand_dims %88 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4967915Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4968211Z %95 = arith.cmpi eq, %92, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4968470Z %96 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4968759Z %97 = tt.broadcast %93 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4969054Z %98 = arith.select %96, %97, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4969335Z %99 = arith.cmpi eq, %92, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4969607Z %100 = tt.broadcast %94 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4969893Z %101 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4970195Z %102 = arith.select %101, %100, %98 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4970494Z %103 = tt.reshape %102 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4970780Z %104 = arith.sitofp %103 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4971156Z %105 = tt.dot %85, %104, %69, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4971526Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:01:43.4971773Z %106 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:01:43.4971977Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T09:01:43.4972182Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:01:43.4972426Z %109 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4972743Z %110 = tt.splat %108 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4972970Z %111 = arith.addi %110, %109 : tensor<32xi32> 2026-02-21T09:01:43.4973239Z %112 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4973534Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4973820Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4974123Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4974383Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4974630Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:01:43.4974881Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4975194Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4975472Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4975715Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4976069Z %122 = tt.descriptor_load %0[%107, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4976370Z %123 = arith.shli %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4976597Z %124 = arith.shrsi %123, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4976824Z %125 = arith.shrsi %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4977071Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4977373Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4977688Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4978024Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4978358Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4978645Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4978909Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4979181Z %133 = tt.broadcast %129 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4979480Z %134 = arith.select %132, %133, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4979754Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4980018Z %136 = tt.broadcast %130 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4980301Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4980590Z %138 = arith.select %137, %136, %134 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4980883Z %139 = tt.reshape %138 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4981144Z %140 = arith.sitofp %139 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4981505Z %141 = tt.dot %121, %140, %105, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4981872Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.4982065Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.4982268Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:01:43.4982454Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:01:43.4982693Z %145 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.4982986Z %146 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.4983203Z %147 = arith.addi %146, %145 : tensor<32xi32> 2026-02-21T09:01:43.4983465Z %148 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4983736Z %149 = arith.muli %148, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.4984036Z %150 = tt.expand_dims %147 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.4984338Z %151 = tt.broadcast %149 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4984612Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.4984851Z %153 = arith.addi %151, %152 : tensor<64x32xi32> 2026-02-21T09:01:43.4985107Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4985412Z %155 = tt.addptr %154, %153 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.4985677Z %156 = tt.load %155 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.4985923Z %157 = arith.extf %156 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.4986247Z %158 = tt.descriptor_load %0[%143, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.4986583Z %159 = arith.shli %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4986803Z %160 = arith.shrsi %159, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4987053Z %161 = arith.shrsi %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.4987313Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.4987607Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.4987929Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.4988255Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4988589Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.4988882Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4989136Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4989418Z %169 = tt.broadcast %165 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4989710Z %170 = arith.select %168, %169, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4989991Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.4990242Z %172 = tt.broadcast %166 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.4990515Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.4990800Z %174 = arith.select %173, %172, %170 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.4991083Z %175 = tt.reshape %174 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.4991351Z %176 = arith.sitofp %175 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.4991736Z %177 = tt.dot %157, %176, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.4992064Z scf.yield %177 : tensor<64x64xf32> 2026-02-21T09:01:43.4992247Z } 2026-02-21T09:01:43.4992424Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.4992718Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.4992984Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.4993246Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.4993532Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.4993802Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.4994081Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:43.4994323Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.4994610Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.4994866Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.4995113Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:01:43.4995292Z tt.return 2026-02-21T09:01:43.4995429Z } 2026-02-21T09:01:43.4995550Z } 2026-02-21T09:01:43.4995628Z 2026-02-21T09:01:43.4995680Z {-# 2026-02-21T09:01:43.4995820Z external_resources: { 2026-02-21T09:01:43.4995975Z mlir_reproducer: { 2026-02-21T09:01:43.5000486Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:43.5004955Z disable_threading: false, 2026-02-21T09:01:43.5005124Z verify_each: true 2026-02-21T09:01:43.5005278Z } 2026-02-21T09:01:43.5005396Z } 2026-02-21T09:01:43.5005520Z #-} 2026-02-21T09:01:43.5005944Z /tmp/torchinductor_root/og/cogddhc52vlyf72ft22oi55bd6gkpfrvs2dp23rk5ud4odrg25v2.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:43.5006965Z /tmp/torchinductor_root/og/cogddhc52vlyf72ft22oi55bd6gkpfrvs2dp23rk5ud4odrg25v2.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:43.5007810Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:43.5009025Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:43.5010125Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:43.5010394Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:43.8552421Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:43.8556926Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:43.8558271Z ^ 2026-02-21T09:01:43.8558695Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:90:40: note: called from 2026-02-21T09:01:43.8559285Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:43.8559509Z ^ 2026-02-21T09:01:43.8559916Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:43.8560387Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:43.8560638Z ^ 2026-02-21T09:01:43.8560915Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:43.8564949Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:43.8569227Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:43.8573930Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:43.8575245Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:43.8575485Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:43.8575862Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:43.8576067Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:43.8576266Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:43.8576492Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:43.8576745Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:43.8576970Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:43.8577207Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:43.8577434Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:43.8577648Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:43.8577867Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:43.8578104Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:43.8578289Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:43.8578464Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:43.8578656Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:43.8578834Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:43.8579018Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:43.8579327Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:43.8579660Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:43.8579852Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:43.8580032Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:43.8580227Z %3 = arith.subi %c9472_i32, %c1_i32_6 : i32 2026-02-21T09:01:43.8580412Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:43.8580591Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:43.8580773Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:43.8580954Z %6 = arith.remsi %5, %c2_i32_7 : i32 2026-02-21T09:01:43.8581129Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:43.8581299Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:43.8581475Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:43.8581736Z %10 = arith.muli %c9472_i32, %c2_i32_7 : i32 2026-02-21T09:01:43.8581951Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:43.8582151Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.8582348Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:43.8582528Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:43.8582713Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:43.8582961Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.8583154Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:43.8583338Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:43.8583511Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:43.8583699Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:43.8583999Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.8584264Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.8584463Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:43.8584661Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:43.8584846Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.8585046Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:43.8585245Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:43.8585615Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.8585947Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.8586190Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8586441Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8586690Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:43.8586958Z %67 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8587257Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8587534Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8587827Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8588106Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8588355Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:43.8588600Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8588894Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8589152Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8589394Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8589710Z %77 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8590014Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8590237Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8590449Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8590693Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8590980Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8591294Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8591638Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8591984Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8592285Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8592549Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8592844Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8593141Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8593436Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8593704Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8593988Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8594329Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8594611Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8594904Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8595269Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8595640Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:43.8595855Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:43.8596065Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:43.8596262Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:43.8596506Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8596776Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8596997Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:43.8597278Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8597577Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8597854Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8598222Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8598506Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8598795Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:43.8599056Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8599374Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8599665Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8599929Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8600262Z %113 = tt.descriptor_load %0[%98, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8600566Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8600796Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8601021Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8601278Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8601615Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8601934Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8602267Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8602589Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8602882Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8603141Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8603413Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8603713Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8603990Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8604251Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8604521Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8604811Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8605103Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8605368Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8605754Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8606069Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:43.8606275Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:43.8606478Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:43.8606696Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:43.8606935Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8607186Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8607400Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:43.8607656Z %139 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8607937Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8608205Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8608497Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8608771Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8609013Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:43.8609287Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8609586Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8609890Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8610141Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8610473Z %149 = tt.descriptor_load %0[%134, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8610778Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8611001Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8611234Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8611483Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8611824Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8612147Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8612474Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8612807Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8613091Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8613354Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8613634Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8613922Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8614205Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8614456Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8614737Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8615018Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8615309Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8615581Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8615934Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8616263Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.8616455Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.8616658Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:43.8616879Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:43.8617114Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8617372Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8617586Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:43.8617876Z %175 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8618153Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8618424Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8618726Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8618990Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8619241Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:43.8619489Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8619792Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8620062Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8620335Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8620665Z %185 = tt.descriptor_load %0[%170, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8620991Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8621218Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8621439Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8621724Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8622020Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8622346Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8622684Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8623008Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8623297Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8623553Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8623836Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8624132Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8624413Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8624677Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8624949Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8625253Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8625555Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8625818Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8626180Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8626499Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:43.8626685Z } 2026-02-21T09:01:43.8626864Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.8627159Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8627427Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.8627679Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.8628029Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.8628288Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.8628526Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:43.8628769Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.8629080Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.8629346Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.8629554Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:43.8629757Z %36 = arith.muli %c9472_i32, %c1_i32_9 : i32 2026-02-21T09:01:43.8629955Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:43.8630151Z %38 = arith.divsi %37, %c448_i32 : i32 2026-02-21T09:01:43.8630337Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:43.8630525Z %40 = arith.subi %c1_i32, %39 : i32 2026-02-21T09:01:43.8630703Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:43.8630894Z %42 = arith.remsi %37, %c448_i32 : i32 2026-02-21T09:01:43.8631082Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:43.8631256Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:43.8631437Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:43.8631669Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:43.8631902Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.8632170Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.8632374Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:43.8632568Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:43.8632751Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.8632950Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:43.8633142Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:01:43.8633468Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.8633795Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.8634031Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8634284Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8634483Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:43.8634755Z %67 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8635023Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8635281Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8635565Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8635826Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8636064Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:43.8636303Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8636593Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8636850Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8637099Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8637411Z %77 = tt.descriptor_load %0[%arg4, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8637742Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8637974Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8638195Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8638450Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8638745Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8639105Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8639444Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8639776Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8640080Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8640369Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8640659Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8640957Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8641243Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8641511Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8641823Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8642117Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8642402Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8642673Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8643064Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8643404Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:43.8643666Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:43.8643874Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:43.8644074Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:43.8644318Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8644586Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8644804Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:43.8645081Z %103 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8645382Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8645660Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8645975Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8646257Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8646523Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:43.8646798Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8647089Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8647363Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8647607Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8647933Z %113 = tt.descriptor_load %0[%98, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8648232Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8648460Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8648692Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8648939Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8649245Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8649562Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8649951Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8650285Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8650611Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8650882Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8651171Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8651482Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8651809Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8652071Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8652347Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8652629Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8652917Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8653177Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8653541Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8653869Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:43.8654068Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:43.8654297Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:43.8654487Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:43.8654731Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8655009Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8655227Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:43.8655491Z %139 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8655764Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8656032Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8656326Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8656598Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8656844Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:43.8657101Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8657402Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8657664Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8657914Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8658237Z %149 = tt.descriptor_load %0[%134, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8658543Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8658765Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8658992Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8659244Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8659537Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8659859Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8660184Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8660515Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8660805Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8661057Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8661334Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8661644Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8661952Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8662203Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8662480Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8662793Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8663079Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8663345Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8663694Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8664021Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.8664227Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.8664425Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:43.8664627Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:43.8664859Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8665114Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8665347Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:43.8665613Z %175 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8665933Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8666196Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8666498Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8666761Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8667008Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:43.8667255Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8667559Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8667828Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8668070Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8668398Z %185 = tt.descriptor_load %0[%170, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8668697Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8668925Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8669146Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8669401Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8669704Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8670019Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8670352Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8670672Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8670958Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8671217Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8671486Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8671809Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8672082Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8672336Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8672607Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8672920Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8673208Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8673473Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8673832Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8674190Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:43.8674375Z } 2026-02-21T09:01:43.8674561Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.8674851Z %55 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8675127Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.8675377Z %57 = tt.expand_dims %52 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.8675668Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.8675924Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.8676166Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:43.8676457Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.8676741Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.8677037Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.8677238Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:43.8677455Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:43.8677683Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.8677886Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:43.8678073Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:43.8678262Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:43.8678458Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:43.8678643Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:43.8678826Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:43.8678997Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:43.8679180Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:43.8679407Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:43.8679661Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.8679870Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:43.8680059Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:43.8680250Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:43.8680441Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:43.8680637Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:43.8680955Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:43.8681288Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:43.8681555Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8681814Z %38 = tt.splat %36 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8682037Z %39 = arith.addi %38, %37 : tensor<32xi32> 2026-02-21T09:01:43.8682300Z %40 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8682593Z %41 = arith.muli %40, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8682859Z %42 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8683166Z %43 = tt.broadcast %41 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8683441Z %44 = tt.broadcast %42 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8683687Z %45 = arith.addi %43, %44 : tensor<64x32xi32> 2026-02-21T09:01:43.8683943Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8684264Z %47 = tt.addptr %46, %45 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8684541Z %48 = tt.load %47 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8684785Z %49 = arith.extf %48 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8685127Z %50 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8685488Z %51 = arith.shli %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8685716Z %52 = arith.shrsi %51, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8685953Z %53 = arith.shrsi %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8686205Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8686517Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8686839Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8687183Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8687525Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8687840Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8688114Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8688420Z %61 = tt.broadcast %57 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8688720Z %62 = arith.select %60, %61, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8689005Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8689262Z %64 = tt.broadcast %58 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8689547Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8689838Z %66 = arith.select %65, %64, %62 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8690115Z %67 = tt.reshape %66 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8690371Z %68 = arith.sitofp %67 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8690727Z %69 = tt.dot %49, %68, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8691057Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:43.8691252Z %70 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:01:43.8691455Z %71 = arith.addi %arg4, %70 : i32 2026-02-21T09:01:43.8691655Z %72 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:01:43.8691893Z %73 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8692142Z %74 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8692350Z %75 = arith.addi %74, %73 : tensor<32xi32> 2026-02-21T09:01:43.8692610Z %76 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8692874Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8693135Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8693423Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8693690Z %80 = tt.broadcast %78 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8693935Z %81 = arith.addi %79, %80 : tensor<64x32xi32> 2026-02-21T09:01:43.8694177Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8694465Z %83 = tt.addptr %82, %81 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8694722Z %84 = tt.load %83 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8694964Z %85 = arith.extf %84 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8695273Z %86 = tt.descriptor_load %0[%71, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8695603Z %87 = arith.shli %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8695826Z %88 = arith.shrsi %87, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8696040Z %89 = arith.shrsi %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8696290Z %90 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8696605Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8696917Z %92 = tt.expand_dims %91 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8697233Z %93 = tt.expand_dims %88 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8697552Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8697835Z %95 = arith.cmpi eq, %92, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8698081Z %96 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8698353Z %97 = tt.broadcast %93 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8698635Z %98 = arith.select %96, %97, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8698929Z %99 = arith.cmpi eq, %92, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8699185Z %100 = tt.broadcast %94 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8699479Z %101 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8699765Z %102 = arith.select %101, %100, %98 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8700044Z %103 = tt.reshape %102 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8700311Z %104 = arith.sitofp %103 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8700660Z %105 = tt.dot %85, %104, %69, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8700981Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:01:43.8701184Z %106 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:01:43.8701377Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T09:01:43.8701599Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:01:43.8701836Z %109 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8702094Z %110 = tt.splat %108 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8702301Z %111 = arith.addi %110, %109 : tensor<32xi32> 2026-02-21T09:01:43.8702563Z %112 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8702847Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8703109Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8703409Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8703676Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8703929Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:01:43.8704183Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8704481Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8704758Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8704998Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8705325Z %122 = tt.descriptor_load %0[%107, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8705624Z %123 = arith.shli %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8705856Z %124 = arith.shrsi %123, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8706083Z %125 = arith.shrsi %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8706358Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8706664Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8706981Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8707314Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8707666Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8707957Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8708220Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8708494Z %133 = tt.broadcast %129 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8708789Z %134 = arith.select %132, %133, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8709065Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8709327Z %136 = tt.broadcast %130 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8709603Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8709940Z %138 = arith.select %137, %136, %134 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8710235Z %139 = tt.reshape %138 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8710523Z %140 = arith.sitofp %139 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8710884Z %141 = tt.dot %121, %140, %105, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8711203Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:43.8711401Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:43.8711633Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:01:43.8711823Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:01:43.8712064Z %145 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:43.8712316Z %146 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:01:43.8712530Z %147 = arith.addi %146, %145 : tensor<32xi32> 2026-02-21T09:01:43.8712783Z %148 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8713064Z %149 = arith.muli %148, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:43.8713336Z %150 = tt.expand_dims %147 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:43.8713634Z %151 = tt.broadcast %149 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8713903Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:43.8714149Z %153 = arith.addi %151, %152 : tensor<64x32xi32> 2026-02-21T09:01:43.8714400Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8714696Z %155 = tt.addptr %154, %153 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:43.8714966Z %156 = tt.load %155 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:43.8715212Z %157 = arith.extf %156 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:43.8715533Z %158 = tt.descriptor_load %0[%143, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:43.8715841Z %159 = arith.shli %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8716062Z %160 = arith.shrsi %159, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8716289Z %161 = arith.shrsi %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:43.8716542Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:43.8716834Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:43.8717160Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:43.8717513Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8717848Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:43.8718133Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8718393Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8718700Z %169 = tt.broadcast %165 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8718989Z %170 = arith.select %168, %169, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8719272Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:43.8719523Z %172 = tt.broadcast %166 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:43.8719803Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:43.8720093Z %174 = arith.select %173, %172, %170 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:43.8720380Z %175 = tt.reshape %174 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:43.8720649Z %176 = arith.sitofp %175 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:43.8721027Z %177 = tt.dot %157, %176, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:43.8721363Z scf.yield %177 : tensor<64x64xf32> 2026-02-21T09:01:43.8721557Z } 2026-02-21T09:01:43.8721774Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:43.8722068Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:43.8722332Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:43.8722590Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:43.8722875Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.8723139Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:43.8723371Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:43.8723620Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.8723910Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:43.8724168Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:43.8724378Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:43.8724545Z tt.return 2026-02-21T09:01:43.8724684Z } 2026-02-21T09:01:43.8724801Z } 2026-02-21T09:01:43.8724876Z 2026-02-21T09:01:43.8724926Z {-# 2026-02-21T09:01:43.8725055Z external_resources: { 2026-02-21T09:01:43.8725217Z mlir_reproducer: { 2026-02-21T09:01:43.8729731Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:43.8734420Z disable_threading: false, 2026-02-21T09:01:43.8734600Z verify_each: true 2026-02-21T09:01:43.8734748Z } 2026-02-21T09:01:43.8734881Z } 2026-02-21T09:01:43.8735001Z #-} 2026-02-21T09:01:43.8735482Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:43.8736587Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:43.8737461Z [69s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:43.8738707Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:43.8739823Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:43.8740086Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:44.1704144Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:44.1707881Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:44.1710135Z ^ 2026-02-21T09:01:44.1710639Z /tmp/torchinductor_root/ol/coliylrqdrfjaama3rkkgrvpqh55m4dfdbtdisf7jv2srmwz62en.py:90:40: note: called from 2026-02-21T09:01:44.1716435Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:44.1718013Z ^ 2026-02-21T09:01:44.1718536Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:44.1722834Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:44.1728092Z ^ 2026-02-21T09:01:44.1732737Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:44.1733388Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:44.1737539Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:01:44.1739675Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:44.1739987Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:44.1740201Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:44.1740394Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:01:44.1740572Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:44.1740760Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:44.1740981Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:44.1741233Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:44.1741460Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:44.1741975Z %cst_3 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:01:44.1742215Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:44.1742424Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:44.1742656Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:44.1742949Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:44.1743139Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:44.1743319Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:44.1743521Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:44.1743704Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:44.1743890Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:44.1744214Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:44.1744534Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:44.1744718Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:44.1744901Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:44.1745094Z %3 = arith.subi %c9472_i32, %c1_i32_6 : i32 2026-02-21T09:01:44.1745281Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:44.1745461Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:44.1745683Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:44.1745872Z %6 = arith.remsi %5, %c2_i32_7 : i32 2026-02-21T09:01:44.1746058Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:44.1746289Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:44.1746475Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:44.1746660Z %10 = arith.muli %c9472_i32, %c2_i32_7 : i32 2026-02-21T09:01:44.1746881Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:44.1747088Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.1747292Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:44.1747477Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:44.1747674Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:44.1747874Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.1748065Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:44.1748262Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:44.1748441Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:44.1748626Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:44.1748864Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.1749134Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1749348Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:44.1749543Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:44.1749745Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1749939Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:44.1750140Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:01:44.1750466Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.1750808Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.1751019Z %64 = tt.splat %63 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1751226Z %65 = arith.addi %64, %20 : tensor<64xi32> 2026-02-21T09:01:44.1751495Z %66 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1751817Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1752083Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1752371Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1752642Z %70 = tt.broadcast %68 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1752885Z %71 = arith.addi %69, %70 : tensor<64x64xi32> 2026-02-21T09:01:44.1753134Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1753467Z %73 = tt.addptr %72, %71 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1753724Z %74 = tt.load %73 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1753967Z %75 = arith.extf %74 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1754293Z %76 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1754636Z %77 = arith.shli %76, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1754863Z %78 = arith.shrsi %77, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1755079Z %79 = arith.shrsi %76, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1755334Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1755626Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1755939Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1756264Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1756581Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1756872Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1757156Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1757434Z %87 = tt.broadcast %83 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1757750Z %88 = arith.select %86, %87, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1758028Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1758285Z %90 = tt.broadcast %84 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1758549Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1758831Z %92 = arith.select %91, %90, %88 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1759106Z %93 = tt.reshape %92 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1759370Z %94 = arith.sitofp %93 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1759727Z %95 = tt.dot %75, %94, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1760061Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:01:44.1760273Z %96 = arith.muli %c32_i32, %c1_i32_10 : i32 2026-02-21T09:01:44.1760473Z %97 = arith.addi %arg4, %96 : i32 2026-02-21T09:01:44.1760678Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:01:44.1760870Z %99 = tt.splat %98 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1761084Z %100 = arith.addi %99, %20 : tensor<64xi32> 2026-02-21T09:01:44.1761341Z %101 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1761660Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1761937Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1762232Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1762507Z %105 = tt.broadcast %103 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1762755Z %106 = arith.addi %104, %105 : tensor<64x64xi32> 2026-02-21T09:01:44.1763013Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1763325Z %108 = tt.addptr %107, %106 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1763599Z %109 = tt.load %108 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1763853Z %110 = arith.extf %109 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1764175Z %111 = tt.descriptor_load %0[%97, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1764486Z %112 = arith.shli %111, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1764733Z %113 = arith.shrsi %112, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1764960Z %114 = arith.shrsi %111, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1765211Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1765506Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1765875Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1766218Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1766565Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1766870Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1767134Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1767424Z %122 = tt.broadcast %118 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1767729Z %123 = arith.select %121, %122, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1768021Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1768321Z %125 = tt.broadcast %119 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1768617Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1768955Z %127 = arith.select %126, %125, %123 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1769252Z %128 = tt.reshape %127 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1769535Z %129 = arith.sitofp %128 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1769916Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1770255Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:01:44.1770463Z %131 = arith.muli %c32_i32, %c2_i32_11 : i32 2026-02-21T09:01:44.1770678Z %132 = arith.addi %arg4, %131 : i32 2026-02-21T09:01:44.1770879Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:01:44.1771083Z %134 = tt.splat %133 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1771305Z %135 = arith.addi %134, %20 : tensor<64xi32> 2026-02-21T09:01:44.1771603Z %136 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1771898Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1772173Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1772487Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1772768Z %140 = tt.broadcast %138 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1773018Z %141 = arith.addi %139, %140 : tensor<64x64xi32> 2026-02-21T09:01:44.1773290Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1773599Z %143 = tt.addptr %142, %141 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1773891Z %144 = tt.load %143 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1774156Z %145 = arith.extf %144 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1774492Z %146 = tt.descriptor_load %0[%132, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1774818Z %147 = arith.shli %146, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1775055Z %148 = arith.shrsi %147, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1775283Z %149 = arith.shrsi %146, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1775529Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1775830Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1776182Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1776506Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1776841Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1777153Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1777416Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1777698Z %157 = tt.broadcast %153 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1777990Z %158 = arith.select %156, %157, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1778278Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1778531Z %160 = tt.broadcast %154 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1778809Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1779092Z %162 = arith.select %161, %160, %158 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1779384Z %163 = tt.reshape %162 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1779680Z %164 = arith.sitofp %163 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1779881Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1779977Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.1780062Z %166 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:01:44.1780130Z %167 = arith.addi %arg4, %166 : i32 2026-02-21T09:01:44.1780200Z %168 = arith.muli %167, %c2_i32 : i32 2026-02-21T09:01:44.1780277Z %169 = tt.splat %168 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1780360Z %170 = arith.addi %169, %20 : tensor<64xi32> 2026-02-21T09:01:44.1780486Z %171 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1780571Z %172 = arith.muli %171, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1780706Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1780810Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1780912Z %175 = tt.broadcast %173 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1780999Z %176 = arith.addi %174, %175 : tensor<64x64xi32> 2026-02-21T09:01:44.1781110Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1781228Z %178 = tt.addptr %177, %176 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1781319Z %179 = tt.load %178 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1781425Z %180 = arith.extf %179 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1781608Z %181 = tt.descriptor_load %0[%167, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1781692Z %182 = arith.shli %181, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1781783Z %183 = arith.shrsi %182, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1781864Z %184 = arith.shrsi %181, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1781974Z %185 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1782107Z %186 = tt.expand_dims %185 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1782236Z %187 = tt.expand_dims %186 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1782367Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1782502Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1782592Z %190 = arith.cmpi eq, %187, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1782725Z %191 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1782839Z %192 = tt.broadcast %188 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1782961Z %193 = arith.select %191, %192, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1783053Z %194 = arith.cmpi eq, %187, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1783196Z %195 = tt.broadcast %189 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1783302Z %196 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1783422Z %197 = arith.select %196, %195, %193 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1783524Z %198 = tt.reshape %197 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1783636Z %199 = arith.sitofp %198 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1783830Z %200 = tt.dot %180, %199, %165, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1783901Z scf.yield %200 : tensor<64x64xf32> 2026-02-21T09:01:44.1783967Z } 2026-02-21T09:01:44.1784075Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.1784240Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1784332Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.1784474Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1784577Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1784684Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1784762Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:44.1784870Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1784982Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1785076Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1785145Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:44.1785221Z %36 = arith.muli %c9472_i32, %c1_i32_8 : i32 2026-02-21T09:01:44.1785298Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:44.1785372Z %38 = arith.divsi %37, %c448_i32 : i32 2026-02-21T09:01:44.1785441Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:44.1785518Z %40 = arith.subi %c1_i32, %39 : i32 2026-02-21T09:01:44.1785585Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:44.1785654Z %42 = arith.remsi %37, %c448_i32 : i32 2026-02-21T09:01:44.1785721Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:44.1785798Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:44.1785862Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:44.1785929Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:44.1786060Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.1786140Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1786215Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:44.1786294Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:44.1786370Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1786466Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:44.1786537Z %c128_i32_9 = arith.constant 128 : i32 2026-02-21T09:01:44.1786746Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c128_i32_9 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.1786817Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.1786891Z %64 = tt.splat %63 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1786970Z %65 = arith.addi %64, %47 : tensor<64xi32> 2026-02-21T09:01:44.1787095Z %66 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1787201Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1787330Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1787433Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1787534Z %70 = tt.broadcast %68 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1787645Z %71 = arith.addi %69, %70 : tensor<64x64xi32> 2026-02-21T09:01:44.1787756Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1787870Z %73 = tt.addptr %72, %71 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1787953Z %74 = tt.load %73 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1788062Z %75 = arith.extf %74 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1788228Z %76 = tt.descriptor_load %0[%arg4, %50] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1788310Z %77 = arith.shli %76, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1788402Z %78 = arith.shrsi %77, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1788483Z %79 = arith.shrsi %76, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1788611Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1788740Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1788883Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1789010Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1789141Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1789232Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1789335Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1789449Z %87 = tt.broadcast %83 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1789567Z %88 = arith.select %86, %87, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1789656Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1789761Z %90 = tt.broadcast %84 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1789870Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1789984Z %92 = arith.select %91, %90, %88 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1790082Z %93 = tt.reshape %92 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1790191Z %94 = arith.sitofp %93 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1790436Z %95 = tt.dot %75, %94, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1790507Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:01:44.1790591Z %96 = arith.muli %c32_i32, %c1_i32_10 : i32 2026-02-21T09:01:44.1790659Z %97 = arith.addi %arg4, %96 : i32 2026-02-21T09:01:44.1790725Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:01:44.1790804Z %99 = tt.splat %98 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1790880Z %100 = arith.addi %99, %47 : tensor<64xi32> 2026-02-21T09:01:44.1791006Z %101 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1791091Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1791225Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1791328Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1791430Z %105 = tt.broadcast %103 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1791521Z %106 = arith.addi %104, %105 : tensor<64x64xi32> 2026-02-21T09:01:44.1791696Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1791819Z %108 = tt.addptr %107, %106 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1791912Z %109 = tt.load %108 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1792020Z %110 = arith.extf %109 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1792201Z %111 = tt.descriptor_load %0[%97, %50] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1792292Z %112 = arith.shli %111, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1792377Z %113 = arith.shrsi %112, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1792461Z %114 = arith.shrsi %111, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1792576Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1792699Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1792827Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1792958Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1793123Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1793220Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1793352Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1793471Z %122 = tt.broadcast %118 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1793594Z %123 = arith.select %121, %122, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1793686Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1793800Z %125 = tt.broadcast %119 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1793905Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1794024Z %127 = arith.select %126, %125, %123 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1794135Z %128 = tt.reshape %127 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1794238Z %129 = arith.sitofp %128 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1794428Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1794507Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:01:44.1794586Z %131 = arith.muli %c32_i32, %c2_i32_11 : i32 2026-02-21T09:01:44.1794652Z %132 = arith.addi %arg4, %131 : i32 2026-02-21T09:01:44.1794722Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:01:44.1794808Z %134 = tt.splat %133 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1794881Z %135 = arith.addi %134, %47 : tensor<64xi32> 2026-02-21T09:01:44.1795007Z %136 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1795098Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1795222Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1795327Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1795437Z %140 = tt.broadcast %138 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1795521Z %141 = arith.addi %139, %140 : tensor<64x64xi32> 2026-02-21T09:01:44.1795635Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1795766Z %143 = tt.addptr %142, %141 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1795852Z %144 = tt.load %143 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1795956Z %145 = arith.extf %144 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1796144Z %146 = tt.descriptor_load %0[%132, %50] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1796237Z %147 = arith.shli %146, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1796322Z %148 = arith.shrsi %147, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1796407Z %149 = arith.shrsi %146, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1796551Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1796673Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1796800Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1796937Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1797065Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1797155Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1797268Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1797375Z %157 = tt.broadcast %153 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1797498Z %158 = arith.select %156, %157, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1797616Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1797728Z %160 = tt.broadcast %154 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1797855Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1797975Z %162 = arith.select %161, %160, %158 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1798085Z %163 = tt.reshape %162 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1798189Z %164 = arith.sitofp %163 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1798376Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1798455Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.1798531Z %166 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:01:44.1798599Z %167 = arith.addi %arg4, %166 : i32 2026-02-21T09:01:44.1798676Z %168 = arith.muli %167, %c2_i32 : i32 2026-02-21T09:01:44.1798755Z %169 = tt.splat %168 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1798830Z %170 = arith.addi %169, %47 : tensor<64xi32> 2026-02-21T09:01:44.1798965Z %171 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1799047Z %172 = arith.muli %171, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1799172Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1799276Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1799384Z %175 = tt.broadcast %173 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1799467Z %176 = arith.addi %174, %175 : tensor<64x64xi32> 2026-02-21T09:01:44.1799577Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1799705Z %178 = tt.addptr %177, %176 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1799789Z %179 = tt.load %178 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1799891Z %180 = arith.extf %179 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1800055Z %181 = tt.descriptor_load %0[%167, %50] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1800137Z %182 = arith.shli %181, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1800221Z %183 = arith.shrsi %182, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1800311Z %184 = arith.shrsi %181, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1800421Z %185 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1800565Z %186 = tt.expand_dims %185 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1800699Z %187 = tt.expand_dims %186 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1800829Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1800999Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1801091Z %190 = arith.cmpi eq, %187, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1801203Z %191 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1801312Z %192 = tt.broadcast %188 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1801432Z %193 = arith.select %191, %192, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1801530Z %194 = arith.cmpi eq, %187, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1801673Z %195 = tt.broadcast %189 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1801776Z %196 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1801903Z %197 = arith.select %196, %195, %193 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1802031Z %198 = tt.reshape %197 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1802138Z %199 = arith.sitofp %198 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1802361Z %200 = tt.dot %180, %199, %165, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1802431Z scf.yield %200 : tensor<64x64xf32> 2026-02-21T09:01:44.1802487Z } 2026-02-21T09:01:44.1802600Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.1802722Z %55 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1802805Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.1802922Z %57 = tt.expand_dims %52 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1803033Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1803135Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1803215Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:44.1803330Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1803446Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1803530Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1803604Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:44.1803702Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:44.1803778Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.1803845Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:44.1803920Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:44.1803987Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:44.1804060Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.1804135Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:44.1804202Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:44.1804268Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:44.1804341Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:44.1804454Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.1804527Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1804598Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:44.1804671Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:44.1804742Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1804811Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:44.1804916Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:01:44.1805110Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.1805183Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.1805267Z %37 = tt.splat %36 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1805368Z %38 = arith.addi %37, %20 : tensor<64xi32> 2026-02-21T09:01:44.1805494Z %39 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1805585Z %40 = arith.muli %39, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1805703Z %41 = tt.expand_dims %38 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1805808Z %42 = tt.broadcast %40 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1805910Z %43 = tt.broadcast %41 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1806000Z %44 = arith.addi %42, %43 : tensor<64x64xi32> 2026-02-21T09:01:44.1806113Z %45 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1806231Z %46 = tt.addptr %45, %44 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1806329Z %47 = tt.load %46 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1806452Z %48 = arith.extf %47 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1806613Z %49 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1806741Z %50 = arith.shli %49, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1806827Z %51 = arith.shrsi %50, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1806909Z %52 = arith.shrsi %49, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1807027Z %53 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1807146Z %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1807270Z %55 = tt.expand_dims %54 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1807398Z %56 = tt.expand_dims %51 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1807532Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1807625Z %58 = arith.cmpi eq, %55, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1807730Z %59 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1807843Z %60 = tt.broadcast %56 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1807960Z %61 = arith.select %59, %60, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1808049Z %62 = arith.cmpi eq, %55, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1808158Z %63 = tt.broadcast %57 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1808258Z %64 = tt.broadcast %62 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1808372Z %65 = arith.select %64, %63, %61 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1808477Z %66 = tt.reshape %65 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1808577Z %67 = arith.sitofp %66 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1808783Z %68 = tt.dot %48, %67, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1808868Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:44.1808947Z %69 = arith.muli %c32_i32, %c1_i32_8 : i32 2026-02-21T09:01:44.1809018Z %70 = arith.addi %arg4, %69 : i32 2026-02-21T09:01:44.1809087Z %71 = arith.muli %70, %c2_i32 : i32 2026-02-21T09:01:44.1809172Z %72 = tt.splat %71 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1809247Z %73 = arith.addi %72, %20 : tensor<64xi32> 2026-02-21T09:01:44.1809374Z %74 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1809493Z %75 = arith.muli %74, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1809617Z %76 = tt.expand_dims %73 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1809720Z %77 = tt.broadcast %75 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1809833Z %78 = tt.broadcast %76 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1809939Z %79 = arith.addi %77, %78 : tensor<64x64xi32> 2026-02-21T09:01:44.1810055Z %80 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1810182Z %81 = tt.addptr %80, %79 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1810267Z %82 = tt.load %81 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1810374Z %83 = arith.extf %82 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1810536Z %84 = tt.descriptor_load %0[%70, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1810627Z %85 = arith.shli %84, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1810712Z %86 = arith.shrsi %85, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1810796Z %87 = arith.shrsi %84, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1810942Z %88 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1811071Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1811222Z %90 = tt.expand_dims %89 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1811363Z %91 = tt.expand_dims %86 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1811493Z %92 = tt.expand_dims %87 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1811630Z %93 = arith.cmpi eq, %90, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1811749Z %94 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1811860Z %95 = tt.broadcast %91 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1811983Z %96 = arith.select %94, %95, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1812082Z %97 = arith.cmpi eq, %90, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1812191Z %98 = tt.broadcast %92 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1812297Z %99 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1812419Z %100 = arith.select %99, %98, %96 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1812533Z %101 = tt.reshape %100 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1812643Z %102 = arith.sitofp %101 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1812836Z %103 = tt.dot %83, %102, %68, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1812919Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:01:44.1812998Z %104 = arith.muli %c32_i32, %c2_i32_9 : i32 2026-02-21T09:01:44.1813070Z %105 = arith.addi %arg4, %104 : i32 2026-02-21T09:01:44.1813148Z %106 = arith.muli %105, %c2_i32 : i32 2026-02-21T09:01:44.1813232Z %107 = tt.splat %106 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1813313Z %108 = arith.addi %107, %20 : tensor<64xi32> 2026-02-21T09:01:44.1813453Z %109 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1813542Z %110 = arith.muli %109, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1813676Z %111 = tt.expand_dims %108 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1813786Z %112 = tt.broadcast %110 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1813902Z %113 = tt.broadcast %111 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1813986Z %114 = arith.addi %112, %113 : tensor<64x64xi32> 2026-02-21T09:01:44.1814134Z %115 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1814272Z %116 = tt.addptr %115, %114 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1814363Z %117 = tt.load %116 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1814475Z %118 = arith.extf %117 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1814682Z %119 = tt.descriptor_load %0[%105, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1814768Z %120 = arith.shli %119, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1814858Z %121 = arith.shrsi %120, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1814952Z %122 = arith.shrsi %119, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1815070Z %123 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1815198Z %124 = tt.expand_dims %123 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1815335Z %125 = tt.expand_dims %124 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1815480Z %126 = tt.expand_dims %121 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1815647Z %127 = tt.expand_dims %122 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1815749Z %128 = arith.cmpi eq, %125, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1815900Z %129 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1816016Z %130 = tt.broadcast %126 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1816147Z %131 = arith.select %129, %130, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1816252Z %132 = arith.cmpi eq, %125, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1816367Z %133 = tt.broadcast %127 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1816478Z %134 = tt.broadcast %132 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1816613Z %135 = arith.select %134, %133, %131 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1816721Z %136 = tt.reshape %135 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1816834Z %137 = arith.sitofp %136 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1817048Z %138 = tt.dot %118, %137, %103, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1817123Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.1817204Z %139 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:01:44.1817278Z %140 = arith.addi %arg4, %139 : i32 2026-02-21T09:01:44.1817363Z %141 = arith.muli %140, %c2_i32 : i32 2026-02-21T09:01:44.1817445Z %142 = tt.splat %141 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.1817535Z %143 = arith.addi %142, %20 : tensor<64xi32> 2026-02-21T09:01:44.1817669Z %144 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1817752Z %145 = arith.muli %144, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.1817878Z %146 = tt.expand_dims %143 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1817990Z %147 = tt.broadcast %145 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1818093Z %148 = tt.broadcast %146 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1818174Z %149 = arith.addi %147, %148 : tensor<64x64xi32> 2026-02-21T09:01:44.1818294Z %150 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1818414Z %151 = tt.addptr %150, %149 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1818498Z %152 = tt.load %151 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1818600Z %153 = arith.extf %152 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:01:44.1818804Z %154 = tt.descriptor_load %0[%140, %23] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:01:44.1818885Z %155 = arith.shli %154, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1818969Z %156 = arith.shrsi %155, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1819063Z %157 = arith.shrsi %154, %cst_3 : tensor<32x64xi8> 2026-02-21T09:01:44.1819197Z %158 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.1819321Z %159 = tt.expand_dims %158 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.1819460Z %160 = tt.expand_dims %159 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.1819591Z %161 = tt.expand_dims %156 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1819719Z %162 = tt.expand_dims %157 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:01:44.1819819Z %163 = arith.cmpi eq, %160, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1819927Z %164 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1820036Z %165 = tt.broadcast %161 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1820189Z %166 = arith.select %164, %165, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1820284Z %167 = arith.cmpi eq, %160, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.1820417Z %168 = tt.broadcast %162 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:01:44.1820523Z %169 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:01:44.1820651Z %170 = arith.select %169, %168, %166 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:01:44.1820754Z %171 = tt.reshape %170 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:01:44.1820858Z %172 = arith.sitofp %171 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:01:44.1821058Z %173 = tt.dot %153, %172, %138, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.1821128Z scf.yield %173 : tensor<64x64xf32> 2026-02-21T09:01:44.1821183Z } 2026-02-21T09:01:44.1821296Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.1821420Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.1821503Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.1821670Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.1821777Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1821877Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.1821964Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:44.1822073Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1822186Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.1822272Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.1822346Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:44.1822404Z tt.return 2026-02-21T09:01:44.1822457Z } 2026-02-21T09:01:44.1822518Z } 2026-02-21T09:01:44.1822526Z 2026-02-21T09:01:44.1822577Z {-# 2026-02-21T09:01:44.1822643Z external_resources: { 2026-02-21T09:01:44.1822708Z mlir_reproducer: { 2026-02-21T09:01:44.1826994Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:44.1827134Z disable_threading: false, 2026-02-21T09:01:44.1827221Z verify_each: true 2026-02-21T09:01:44.1827277Z } 2026-02-21T09:01:44.1827334Z } 2026-02-21T09:01:44.1827386Z #-} 2026-02-21T09:01:44.1827775Z /tmp/torchinductor_root/ol/coliylrqdrfjaama3rkkgrvpqh55m4dfdbtdisf7jv2srmwz62en.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:44.1828406Z /tmp/torchinductor_root/ol/coliylrqdrfjaama3rkkgrvpqh55m4dfdbtdisf7jv2srmwz62en.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:44.1828585Z [69s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:44.1829540Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:44.1829628Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:44.1829753Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:44.4838924Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:44.4840920Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:44.4841244Z ^ 2026-02-21T09:01:44.4841815Z /tmp/torchinductor_root/ue/cuecdaopwhltcgvbiuuifr5d3lqzvbczorbbq2quk5izvpgfdmds.py:93:40: note: called from 2026-02-21T09:01:44.4842236Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:44.4842450Z ^ 2026-02-21T09:01:44.4842861Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:44.4843326Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:44.4843574Z ^ 2026-02-21T09:01:44.4843774Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:44.4844308Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:44.4845051Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:44.4845267Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:44.4845461Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:44.4845647Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:44.4845837Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:44.4846083Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:44.4846295Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:44.4846547Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:44.4846778Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:44.4847013Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:44.4847242Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:44.4847459Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:44.4847688Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:44.4847920Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:44.4848105Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:44.4848287Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:44.4848479Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:44.4848710Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:44.4848894Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:44.4849247Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:44.4849574Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:44.4849756Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:01:44.4849933Z %3 = arith.minsi %2, %c112_i32 : i32 2026-02-21T09:01:44.4850117Z %4 = arith.subi %3, %1 : i32 2026-02-21T09:01:44.4850287Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:44.4850476Z %5 = arith.subi %c1_i32, %c1_i32_6 : i32 2026-02-21T09:01:44.4850658Z %6 = arith.addi %4, %5 : i32 2026-02-21T09:01:44.4850833Z %7 = arith.divui %6, %c1_i32 : i32 2026-02-21T09:01:44.4851010Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:44.4851192Z %8 = arith.remsi %7, %c2_i32_7 : i32 2026-02-21T09:01:44.4851374Z %9 = arith.subi %7, %8 : i32 2026-02-21T09:01:44.4851582Z %10 = arith.muli %9, %c1_i32 : i32 2026-02-21T09:01:44.4851772Z %11 = arith.addi %1, %10 : i32 2026-02-21T09:01:44.4851953Z %12 = arith.muli %c1_i32, %c2_i32_7 : i32 2026-02-21T09:01:44.4852163Z scf.for %arg3 = %1 to %11 step %12 : i32 { 2026-02-21T09:01:44.4852367Z %13 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.4852566Z %14 = arith.muli %13, %c4_i32 : i32 2026-02-21T09:01:44.4852748Z %15 = arith.subi %c1_i32, %14 : i32 2026-02-21T09:01:44.4852941Z %16 = arith.minsi %15, %c4_i32 : i32 2026-02-21T09:01:44.4853139Z %17 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.4853328Z %18 = arith.remsi %17, %16 : i32 2026-02-21T09:01:44.4853519Z %19 = arith.addi %14, %18 : i32 2026-02-21T09:01:44.4853693Z %20 = arith.divsi %17, %16 : i32 2026-02-21T09:01:44.4853874Z %21 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:01:44.4854099Z %22 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.4854361Z %23 = tt.splat %21 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.4854568Z %24 = arith.addi %23, %22 : tensor<64xi32> 2026-02-21T09:01:44.4854753Z %25 = arith.muli %20, %c64_i32 : i32 2026-02-21T09:01:44.4854943Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.4855135Z %27 = arith.addi %26, %22 : tensor<64xi32> 2026-02-21T09:01:44.4855329Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:44.4855648Z %28 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.4855983Z %65 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.4856218Z %66 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4856518Z %67 = tt.splat %65 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4856735Z %68 = arith.addi %67, %66 : tensor<32xi32> 2026-02-21T09:01:44.4857026Z %69 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4857300Z %70 = arith.muli %69, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4857610Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4857901Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4858169Z %73 = tt.broadcast %71 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4858405Z %74 = arith.addi %72, %73 : tensor<64x32xi32> 2026-02-21T09:01:44.4858658Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4858948Z %76 = tt.addptr %75, %74 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4859208Z %77 = tt.load %76 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4859449Z %78 = arith.extf %77 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4859771Z %79 = tt.descriptor_load %0[%arg4, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4860123Z %80 = arith.shli %79, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4860337Z %81 = arith.shrsi %80, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4860605Z %82 = arith.shrsi %79, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4860857Z %83 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4861144Z %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4861458Z %85 = tt.expand_dims %84 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4861822Z %86 = tt.expand_dims %81 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4862152Z %87 = tt.expand_dims %82 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4862439Z %88 = arith.cmpi eq, %85, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4862690Z %89 = tt.broadcast %88 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4862968Z %90 = tt.broadcast %86 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4863257Z %91 = arith.select %89, %90, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4863535Z %92 = arith.cmpi eq, %85, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4863782Z %93 = tt.broadcast %87 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4864059Z %94 = tt.broadcast %92 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4864345Z %95 = arith.select %94, %93, %91 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4864619Z %96 = tt.reshape %95 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4864886Z %97 = arith.sitofp %96 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4865240Z %98 = tt.dot %78, %97, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4865571Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:44.4865773Z %99 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:44.4865997Z %100 = arith.addi %arg4, %99 : i32 2026-02-21T09:01:44.4866195Z %101 = arith.muli %100, %c2_i32 : i32 2026-02-21T09:01:44.4866431Z %102 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4866693Z %103 = tt.splat %101 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4866904Z %104 = arith.addi %103, %102 : tensor<32xi32> 2026-02-21T09:01:44.4867168Z %105 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4867445Z %106 = arith.muli %105, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4873714Z %107 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4874040Z %108 = tt.broadcast %106 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4874318Z %109 = tt.broadcast %107 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4874697Z %110 = arith.addi %108, %109 : tensor<64x32xi32> 2026-02-21T09:01:44.4874954Z %111 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4875264Z %112 = tt.addptr %111, %110 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4875545Z %113 = tt.load %112 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4875793Z %114 = arith.extf %113 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4876123Z %115 = tt.descriptor_load %0[%100, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4876430Z %116 = arith.shli %115, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4876663Z %117 = arith.shrsi %116, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4876884Z %118 = arith.shrsi %115, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4877155Z %119 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4877526Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4877911Z %121 = tt.expand_dims %120 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4878269Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4878614Z %123 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4878922Z %124 = arith.cmpi eq, %121, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4879197Z %125 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4879492Z %126 = tt.broadcast %122 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4879809Z %127 = arith.select %125, %126, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4880103Z %128 = arith.cmpi eq, %121, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4880378Z %129 = tt.broadcast %123 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4880663Z %130 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4880965Z %131 = arith.select %130, %129, %127 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4881269Z %132 = tt.reshape %131 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4881647Z %133 = arith.sitofp %132 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4882032Z %134 = tt.dot %114, %133, %98, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4882371Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:44.4882594Z %135 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:44.4882803Z %136 = arith.addi %arg4, %135 : i32 2026-02-21T09:01:44.4883005Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:01:44.4883260Z %138 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4883531Z %139 = tt.splat %137 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4883760Z %140 = arith.addi %139, %138 : tensor<32xi32> 2026-02-21T09:01:44.4884032Z %141 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4884330Z %142 = arith.muli %141, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4884609Z %143 = tt.expand_dims %140 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4884931Z %144 = tt.broadcast %142 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4885221Z %145 = tt.broadcast %143 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4885523Z %146 = arith.addi %144, %145 : tensor<64x32xi32> 2026-02-21T09:01:44.4885790Z %147 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4886098Z %148 = tt.addptr %147, %146 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4886391Z %149 = tt.load %148 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4886699Z %150 = arith.extf %149 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4887029Z %151 = tt.descriptor_load %0[%136, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4887350Z %152 = arith.shli %151, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4887578Z %153 = arith.shrsi %152, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4887810Z %154 = arith.shrsi %151, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4888060Z %155 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4888366Z %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4888692Z %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4889054Z %158 = tt.expand_dims %153 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4889387Z %159 = tt.expand_dims %154 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4889698Z %160 = arith.cmpi eq, %157, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4889957Z %161 = tt.broadcast %160 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4890237Z %162 = tt.broadcast %158 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4890525Z %163 = arith.select %161, %162, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4890805Z %164 = arith.cmpi eq, %157, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4891057Z %165 = tt.broadcast %159 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4891335Z %166 = tt.broadcast %164 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4891642Z %167 = arith.select %166, %165, %163 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4891938Z %168 = tt.reshape %167 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4892213Z %169 = arith.sitofp %168 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4892572Z %170 = tt.dot %150, %169, %134, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4892901Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.4893094Z %171 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:44.4893297Z %172 = arith.addi %arg4, %171 : i32 2026-02-21T09:01:44.4893482Z %173 = arith.muli %172, %c2_i32 : i32 2026-02-21T09:01:44.4893724Z %174 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4893989Z %175 = tt.splat %173 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4894260Z %176 = arith.addi %175, %174 : tensor<32xi32> 2026-02-21T09:01:44.4894515Z %177 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4894794Z %178 = arith.muli %177, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4895062Z %179 = tt.expand_dims %176 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4895362Z %180 = tt.broadcast %178 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4895635Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4895879Z %182 = arith.addi %180, %181 : tensor<64x32xi32> 2026-02-21T09:01:44.4896134Z %183 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4896430Z %184 = tt.addptr %183, %182 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4896735Z %185 = tt.load %184 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4896978Z %186 = arith.extf %185 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4897294Z %187 = tt.descriptor_load %0[%172, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4897597Z %188 = arith.shli %187, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4897846Z %189 = arith.shrsi %188, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4898075Z %190 = arith.shrsi %187, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4898319Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4898622Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4898945Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4899267Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4899601Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4899885Z %196 = arith.cmpi eq, %193, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4900146Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4900450Z %198 = tt.broadcast %194 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4900746Z %199 = arith.select %197, %198, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4901059Z %200 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4901316Z %201 = tt.broadcast %195 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4901652Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4901933Z %203 = arith.select %202, %201, %199 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4902223Z %204 = tt.reshape %203 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4902494Z %205 = arith.sitofp %204 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4902845Z %206 = tt.dot %186, %205, %170, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4903173Z scf.yield %206 : tensor<64x64xf32> 2026-02-21T09:01:44.4903352Z } 2026-02-21T09:01:44.4903537Z %29 = arith.truncf %28 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.4903825Z %30 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4904095Z %31 = arith.muli %30, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.4904357Z %32 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.4904643Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.4904907Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.4905142Z %35 = arith.addi %33, %34 : tensor<64x64xi32> 2026-02-21T09:01:44.4905390Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.4905666Z %37 = tt.addptr %36, %35 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.4905926Z tt.store %37, %29 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.4906137Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:44.4906327Z %38 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:01:44.4906524Z %39 = arith.addi %arg3, %38 : i32 2026-02-21T09:01:44.4906707Z %40 = arith.divsi %39, %c448_i32 : i32 2026-02-21T09:01:44.4906895Z %41 = arith.muli %40, %c4_i32 : i32 2026-02-21T09:01:44.4907072Z %42 = arith.subi %c1_i32, %41 : i32 2026-02-21T09:01:44.4907257Z %43 = arith.minsi %42, %c4_i32 : i32 2026-02-21T09:01:44.4907446Z %44 = arith.remsi %39, %c448_i32 : i32 2026-02-21T09:01:44.4907626Z %45 = arith.remsi %44, %43 : i32 2026-02-21T09:01:44.4907843Z %46 = arith.addi %41, %45 : i32 2026-02-21T09:01:44.4908015Z %47 = arith.divsi %44, %43 : i32 2026-02-21T09:01:44.4908197Z %48 = arith.muli %46, %c64_i32 : i32 2026-02-21T09:01:44.4908418Z %49 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.4908676Z %50 = tt.splat %48 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.4908899Z %51 = arith.addi %50, %49 : tensor<64xi32> 2026-02-21T09:01:44.4909095Z %52 = arith.muli %47, %c64_i32 : i32 2026-02-21T09:01:44.4909289Z %53 = tt.splat %52 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.4909482Z %54 = arith.addi %53, %49 : tensor<64xi32> 2026-02-21T09:01:44.4909680Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:01:44.4910006Z %55 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.4910341Z %65 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.4910575Z %66 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4910835Z %67 = tt.splat %65 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4911047Z %68 = arith.addi %67, %66 : tensor<32xi32> 2026-02-21T09:01:44.4911304Z %69 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4911664Z %70 = arith.muli %69, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4911922Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4912242Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4912501Z %73 = tt.broadcast %71 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4912742Z %74 = arith.addi %72, %73 : tensor<64x32xi32> 2026-02-21T09:01:44.4912993Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4913280Z %76 = tt.addptr %75, %74 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4913545Z %77 = tt.load %76 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4913780Z %78 = arith.extf %77 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4914106Z %79 = tt.descriptor_load %0[%arg4, %52] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4914416Z %80 = arith.shli %79, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4914634Z %81 = arith.shrsi %80, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4914860Z %82 = arith.shrsi %79, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4915102Z %83 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4915392Z %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4915697Z %85 = tt.expand_dims %84 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4916014Z %86 = tt.expand_dims %81 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4916337Z %87 = tt.expand_dims %82 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4916612Z %88 = arith.cmpi eq, %85, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4916866Z %89 = tt.broadcast %88 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4917132Z %90 = tt.broadcast %86 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4917421Z %91 = arith.select %89, %90, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4917693Z %92 = arith.cmpi eq, %85, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4917936Z %93 = tt.broadcast %87 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4918201Z %94 = tt.broadcast %92 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4918465Z %95 = arith.select %94, %93, %91 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4918742Z %96 = tt.reshape %95 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4919025Z %97 = arith.sitofp %96 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4919380Z %98 = tt.dot %78, %97, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4919701Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:44.4919926Z %99 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:44.4920129Z %100 = arith.addi %arg4, %99 : i32 2026-02-21T09:01:44.4920314Z %101 = arith.muli %100, %c2_i32 : i32 2026-02-21T09:01:44.4920559Z %102 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4920835Z %103 = tt.splat %101 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4921064Z %104 = arith.addi %103, %102 : tensor<32xi32> 2026-02-21T09:01:44.4921345Z %105 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4921661Z %106 = arith.muli %105, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4921956Z %107 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4922276Z %108 = tt.broadcast %106 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4922600Z %109 = tt.broadcast %107 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4922859Z %110 = arith.addi %108, %109 : tensor<64x32xi32> 2026-02-21T09:01:44.4923157Z %111 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4923478Z %112 = tt.addptr %111, %110 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4923762Z %113 = tt.load %112 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4924030Z %114 = arith.extf %113 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4924374Z %115 = tt.descriptor_load %0[%100, %52] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4924711Z %116 = arith.shli %115, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4936368Z %117 = arith.shrsi %116, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4936661Z %118 = arith.shrsi %115, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4936937Z %119 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4937254Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4937604Z %121 = tt.expand_dims %120 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4937950Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4938283Z %123 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4938590Z %124 = arith.cmpi eq, %121, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4938853Z %125 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4939146Z %126 = tt.broadcast %122 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4939445Z %127 = arith.select %125, %126, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4939741Z %128 = arith.cmpi eq, %121, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4940010Z %129 = tt.broadcast %123 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4940288Z %130 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4940585Z %131 = arith.select %130, %129, %127 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4940870Z %132 = tt.reshape %131 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4941148Z %133 = arith.sitofp %132 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4941518Z %134 = tt.dot %114, %133, %98, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4941914Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:44.4942225Z %135 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:44.4942430Z %136 = arith.addi %arg4, %135 : i32 2026-02-21T09:01:44.4942633Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:01:44.4942877Z %138 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4943184Z %139 = tt.splat %137 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4943406Z %140 = arith.addi %139, %138 : tensor<32xi32> 2026-02-21T09:01:44.4943669Z %141 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4943957Z %142 = arith.muli %141, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4944225Z %143 = tt.expand_dims %140 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4944533Z %144 = tt.broadcast %142 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4944801Z %145 = tt.broadcast %143 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4945061Z %146 = arith.addi %144, %145 : tensor<64x32xi32> 2026-02-21T09:01:44.4945329Z %147 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4945628Z %148 = tt.addptr %147, %146 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4945950Z %149 = tt.load %148 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4946200Z %150 = arith.extf %149 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4946574Z %151 = tt.descriptor_load %0[%136, %52] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4946883Z %152 = arith.shli %151, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4947122Z %153 = arith.shrsi %152, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4947367Z %154 = arith.shrsi %151, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4947621Z %155 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4947935Z %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4948264Z %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4948614Z %158 = tt.expand_dims %153 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4948952Z %159 = tt.expand_dims %154 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4949240Z %160 = arith.cmpi eq, %157, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4949515Z %161 = tt.broadcast %160 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4949795Z %162 = tt.broadcast %158 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4950104Z %163 = arith.select %161, %162, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4950381Z %164 = arith.cmpi eq, %157, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4950648Z %165 = tt.broadcast %159 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4950929Z %166 = tt.broadcast %164 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4951212Z %167 = arith.select %166, %165, %163 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4951514Z %168 = tt.reshape %167 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4951935Z %169 = arith.sitofp %168 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4952311Z %170 = tt.dot %150, %169, %134, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4952652Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.4952855Z %171 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:44.4953068Z %172 = arith.addi %arg4, %171 : i32 2026-02-21T09:01:44.4953260Z %173 = arith.muli %172, %c2_i32 : i32 2026-02-21T09:01:44.4953513Z %174 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4953812Z %175 = tt.splat %173 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4954040Z %176 = arith.addi %175, %174 : tensor<32xi32> 2026-02-21T09:01:44.4954314Z %177 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4954593Z %178 = arith.muli %177, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4954899Z %179 = tt.expand_dims %176 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4955203Z %180 = tt.broadcast %178 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4955481Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4955723Z %182 = arith.addi %180, %181 : tensor<64x32xi32> 2026-02-21T09:01:44.4955981Z %183 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4956290Z %184 = tt.addptr %183, %182 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4956559Z %185 = tt.load %184 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4956813Z %186 = arith.extf %185 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4957135Z %187 = tt.descriptor_load %0[%172, %52] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4957475Z %188 = arith.shli %187, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4957703Z %189 = arith.shrsi %188, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4957961Z %190 = arith.shrsi %187, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4958222Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4958522Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4958850Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4959173Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4959511Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4959803Z %196 = arith.cmpi eq, %193, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4960055Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4960343Z %198 = tt.broadcast %194 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4960637Z %199 = arith.select %197, %198, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4960923Z %200 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4961179Z %201 = tt.broadcast %195 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4961464Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4961788Z %203 = arith.select %202, %201, %199 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4962080Z %204 = tt.reshape %203 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4962360Z %205 = arith.sitofp %204 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4962718Z %206 = tt.dot %186, %205, %170, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4963055Z scf.yield %206 : tensor<64x64xf32> 2026-02-21T09:01:44.4963248Z } 2026-02-21T09:01:44.4963428Z %56 = arith.truncf %55 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.4963731Z %57 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4964005Z %58 = arith.muli %57, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.4964274Z %59 = tt.expand_dims %54 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.4964570Z %60 = tt.broadcast %58 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.4964847Z %61 = tt.broadcast %59 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.4965158Z %62 = arith.addi %60, %61 : tensor<64x64xi32> 2026-02-21T09:01:44.4965413Z %63 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.4965715Z %64 = tt.addptr %63, %62 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.4965989Z tt.store %64, %56 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.4966245Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:44.4966450Z scf.for %arg3 = %11 to %3 step %c1_i32 : i32 { 2026-02-21T09:01:44.4966687Z %13 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.4966902Z %14 = arith.muli %13, %c4_i32 : i32 2026-02-21T09:01:44.4967100Z %15 = arith.subi %c1_i32, %14 : i32 2026-02-21T09:01:44.4967309Z %16 = arith.minsi %15, %c4_i32 : i32 2026-02-21T09:01:44.4967514Z %17 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.4967726Z %18 = arith.remsi %17, %16 : i32 2026-02-21T09:01:44.4967916Z %19 = arith.addi %14, %18 : i32 2026-02-21T09:01:44.4968111Z %20 = arith.divsi %17, %16 : i32 2026-02-21T09:01:44.4968301Z %21 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:01:44.4968555Z %22 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.4968823Z %23 = tt.splat %21 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.4969060Z %24 = arith.addi %23, %22 : tensor<64xi32> 2026-02-21T09:01:44.4969276Z %25 = arith.muli %20, %c64_i32 : i32 2026-02-21T09:01:44.4969471Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.4969711Z %27 = arith.addi %26, %22 : tensor<64xi32> 2026-02-21T09:01:44.4969912Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:44.4970261Z %28 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.4970613Z %38 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.4970912Z %39 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4971174Z %40 = tt.splat %38 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4971397Z %41 = arith.addi %40, %39 : tensor<32xi32> 2026-02-21T09:01:44.4971688Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4971987Z %43 = arith.muli %42, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4972274Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4972582Z %45 = tt.broadcast %43 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4972868Z %46 = tt.broadcast %44 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4973120Z %47 = arith.addi %45, %46 : tensor<64x32xi32> 2026-02-21T09:01:44.4973387Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4973718Z %49 = tt.addptr %48, %47 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4973988Z %50 = tt.load %49 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4974237Z %51 = arith.extf %50 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4974558Z %52 = tt.descriptor_load %0[%arg4, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4974877Z %53 = arith.shli %52, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4975100Z %54 = arith.shrsi %53, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4975332Z %55 = arith.shrsi %52, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4975579Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4975883Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4976202Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4976522Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4976873Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4977154Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4977413Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4977693Z %63 = tt.broadcast %59 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4978026Z %64 = arith.select %62, %63, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4978304Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4978554Z %66 = tt.broadcast %60 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4978833Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4979109Z %68 = arith.select %67, %66, %64 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4979400Z %69 = tt.reshape %68 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4979668Z %70 = arith.sitofp %69 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4980022Z %71 = tt.dot %51, %70, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4980351Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:44.4980583Z %72 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:01:44.4980782Z %73 = arith.addi %arg4, %72 : i32 2026-02-21T09:01:44.4980974Z %74 = arith.muli %73, %c2_i32 : i32 2026-02-21T09:01:44.4981233Z %75 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4981480Z %76 = tt.splat %74 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4981713Z %77 = arith.addi %76, %75 : tensor<32xi32> 2026-02-21T09:01:44.4981961Z %78 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4982234Z %79 = arith.muli %78, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4982487Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4982779Z %81 = tt.broadcast %79 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4983039Z %82 = tt.broadcast %80 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4983272Z %83 = arith.addi %81, %82 : tensor<64x32xi32> 2026-02-21T09:01:44.4983523Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4983805Z %85 = tt.addptr %84, %83 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4984066Z %86 = tt.load %85 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4984297Z %87 = arith.extf %86 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4984615Z %88 = tt.descriptor_load %0[%73, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4984916Z %89 = arith.shli %88, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4985132Z %90 = arith.shrsi %89, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4985352Z %91 = arith.shrsi %88, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4985593Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4985886Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4986201Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4986516Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4986837Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4987114Z %97 = arith.cmpi eq, %94, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4987369Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4987633Z %99 = tt.broadcast %95 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4987951Z %100 = arith.select %98, %99, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4988232Z %101 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4988486Z %102 = tt.broadcast %96 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4988769Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4989079Z %104 = arith.select %103, %102, %100 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4989375Z %105 = tt.reshape %104 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4989645Z %106 = arith.sitofp %105 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4989993Z %107 = tt.dot %87, %106, %71, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4990317Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:01:44.4990517Z %108 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:01:44.4990718Z %109 = arith.addi %arg4, %108 : i32 2026-02-21T09:01:44.4990900Z %110 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:01:44.4991126Z %111 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4991371Z %112 = tt.splat %110 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4991620Z %113 = arith.addi %112, %111 : tensor<32xi32> 2026-02-21T09:01:44.4991879Z %114 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4992177Z %115 = arith.muli %114, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4992435Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4992722Z %117 = tt.broadcast %115 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4992996Z %118 = tt.broadcast %116 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4993242Z %119 = arith.addi %117, %118 : tensor<64x32xi32> 2026-02-21T09:01:44.4993493Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4993790Z %121 = tt.addptr %120, %119 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4993877Z %122 = tt.load %121 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4993984Z %123 = arith.extf %122 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4994153Z %124 = tt.descriptor_load %0[%109, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4994238Z %125 = arith.shli %124, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4994325Z %126 = arith.shrsi %125, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4994418Z %127 = arith.shrsi %124, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4994527Z %128 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4994650Z %129 = tt.expand_dims %128 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4994787Z %130 = tt.expand_dims %129 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4994919Z %131 = tt.expand_dims %126 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4995049Z %132 = tt.expand_dims %127 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4995151Z %133 = arith.cmpi eq, %130, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4995260Z %134 = tt.broadcast %133 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4995370Z %135 = tt.broadcast %131 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4995493Z %136 = arith.select %134, %135, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4995593Z %137 = arith.cmpi eq, %130, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4995700Z %138 = tt.broadcast %132 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4995835Z %139 = tt.broadcast %137 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4995960Z %140 = arith.select %139, %138, %136 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4996063Z %141 = tt.reshape %140 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.4996165Z %142 = arith.sitofp %141 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.4996389Z %143 = tt.dot %123, %142, %107, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.4996460Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.4996535Z %144 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:44.4996610Z %145 = arith.addi %arg4, %144 : i32 2026-02-21T09:01:44.4996679Z %146 = arith.muli %145, %c2_i32 : i32 2026-02-21T09:01:44.4996792Z %147 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.4996872Z %148 = tt.splat %146 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.4996958Z %149 = arith.addi %148, %147 : tensor<32xi32> 2026-02-21T09:01:44.4997084Z %150 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.4997165Z %151 = arith.muli %150, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.4997328Z %152 = tt.expand_dims %149 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.4997435Z %153 = tt.broadcast %151 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4997570Z %154 = tt.broadcast %152 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.4997661Z %155 = arith.addi %153, %154 : tensor<64x32xi32> 2026-02-21T09:01:44.4997774Z %156 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4997892Z %157 = tt.addptr %156, %155 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.4997981Z %158 = tt.load %157 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.4998083Z %159 = arith.extf %158 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.4998241Z %160 = tt.descriptor_load %0[%145, %25] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.4998327Z %161 = arith.shli %160, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4998412Z %162 = arith.shrsi %161, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4998495Z %163 = arith.shrsi %160, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.4998604Z %164 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.4998736Z %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.4998864Z %166 = tt.expand_dims %165 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.4998992Z %167 = tt.expand_dims %162 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4999129Z %168 = tt.expand_dims %163 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.4999223Z %169 = arith.cmpi eq, %166, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4999330Z %170 = tt.broadcast %169 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4999446Z %171 = tt.broadcast %167 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4999567Z %172 = arith.select %170, %171, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.4999658Z %173 = arith.cmpi eq, %166, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.4999773Z %174 = tt.broadcast %168 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.4999876Z %175 = tt.broadcast %173 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.4999994Z %176 = arith.select %175, %174, %172 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.5000103Z %177 = tt.reshape %176 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.5000207Z %178 = arith.sitofp %177 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.5000425Z %179 = tt.dot %159, %178, %143, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.5000496Z scf.yield %179 : tensor<64x64xf32> 2026-02-21T09:01:44.5000558Z } 2026-02-21T09:01:44.5000667Z %29 = arith.truncf %28 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.5000813Z %30 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.5000904Z %31 = arith.muli %30, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.5001025Z %32 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.5001127Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.5001236Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.5001315Z %35 = arith.addi %33, %34 : tensor<64x64xi32> 2026-02-21T09:01:44.5001426Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.5001564Z %37 = tt.addptr %36, %35 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.5001652Z tt.store %37, %29 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.5001721Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:44.5001819Z tt.return 2026-02-21T09:01:44.5001888Z } 2026-02-21T09:01:44.5001943Z } 2026-02-21T09:01:44.5001949Z 2026-02-21T09:01:44.5001999Z {-# 2026-02-21T09:01:44.5002075Z external_resources: { 2026-02-21T09:01:44.5002162Z mlir_reproducer: { 2026-02-21T09:01:44.5006521Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:44.5006595Z disable_threading: false, 2026-02-21T09:01:44.5006663Z verify_each: true 2026-02-21T09:01:44.5006715Z } 2026-02-21T09:01:44.5006765Z } 2026-02-21T09:01:44.5006818Z #-} 2026-02-21T09:01:44.5007195Z /tmp/torchinductor_root/ue/cuecdaopwhltcgvbiuuifr5d3lqzvbczorbbq2quk5izvpgfdmds.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:44.5007817Z /tmp/torchinductor_root/ue/cuecdaopwhltcgvbiuuifr5d3lqzvbczorbbq2quk5izvpgfdmds.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:44.5008022Z [69s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:44.5008957Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:44.5009075Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:44.5009208Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:44.8036494Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:44.8038437Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:44.8038553Z ^ 2026-02-21T09:01:44.8043591Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:90:40: note: called from 2026-02-21T09:01:44.8045292Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:44.8045398Z ^ 2026-02-21T09:01:44.8045758Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:44.8045907Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:44.8046015Z ^ 2026-02-21T09:01:44.8046110Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:44.8046513Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:44.8046614Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:44.8046695Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:44.8046763Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:44.8046830Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:44.8046901Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:44.8046972Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:44.8047044Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:44.8047141Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:44.8047242Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:44.8047328Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:44.8047414Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:44.8047503Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:44.8047571Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:44.8047681Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:44.8047748Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:44.8047817Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:44.8047887Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:44.8047955Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:44.8048026Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:44.8048090Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:44.8048287Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:44.8048361Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:44.8048430Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:44.8048495Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:44.8048571Z %3 = arith.subi %c9472_i32, %c1_i32_6 : i32 2026-02-21T09:01:44.8048687Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:44.8048756Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:44.8048822Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:44.8048894Z %6 = arith.remsi %5, %c2_i32_7 : i32 2026-02-21T09:01:44.8048954Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:44.8049020Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:44.8049132Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:44.8049215Z %10 = arith.muli %c9472_i32, %c2_i32_7 : i32 2026-02-21T09:01:44.8049293Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:44.8049366Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.8049438Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:44.8049502Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:44.8049567Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:44.8049643Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.8049707Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:44.8049775Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:44.8049836Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:44.8049908Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:44.8050029Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.8050136Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.8050218Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:44.8050283Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:44.8050376Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.8050453Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:44.8050521Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:44.8050723Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.8050796Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.8050921Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8050996Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8051069Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:44.8051207Z %67 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8051290Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8051415Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8051527Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8051816Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8051899Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:44.8052024Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8052144Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8052229Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8052336Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8052513Z %77 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8052596Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8052682Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8052779Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8052893Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8053019Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8053163Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8053298Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8053459Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8053558Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8053668Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8053804Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8053927Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8054013Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8054118Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8054220Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8054342Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8054442Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8054545Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8054748Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8054869Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:44.8054948Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:44.8055023Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:44.8055113Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:44.8055229Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8055312Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8055391Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:44.8055517Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8055605Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8055739Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8055846Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8055951Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8056040Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:44.8056155Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8056278Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8056372Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8056477Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8056628Z %113 = tt.descriptor_load %0[%98, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8056720Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8056807Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8056890Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8056999Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8057131Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8057261Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8057393Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8057532Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8057627Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8057733Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8057883Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8058008Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8058099Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8058216Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8058367Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8058495Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8058597Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8058700Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8058894Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8058966Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:44.8059045Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:44.8059111Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:44.8059186Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:44.8059324Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8059407Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8059493Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:44.8059642Z %139 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8059726Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8059857Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8059960Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8060062Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8060143Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:44.8060262Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8060383Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8060467Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8060578Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8060738Z %149 = tt.descriptor_load %0[%134, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8060819Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8060910Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8060992Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8061102Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8061233Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8061362Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8061494Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8061666Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8061760Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8061864Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8061971Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8062098Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8062188Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8062323Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8062434Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8062553Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8062658Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8062797Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8062992Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8063065Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.8063154Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:44.8063224Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:44.8063294Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:44.8063421Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8063503Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8063584Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:44.8063713Z %175 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8063850Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8063979Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8064111Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8064223Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8064302Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:44.8064414Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8064541Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8064624Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8064728Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8064890Z %185 = tt.descriptor_load %0[%170, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8064974Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8065058Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8065140Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8065256Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8065384Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8065511Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8065650Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8065782Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8065873Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8065990Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8066098Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8066220Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8066315Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8066420Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8066522Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8066649Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8066776Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8066879Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8067075Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8067146Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:44.8067227Z } 2026-02-21T09:01:44.8067335Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.8067465Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8067543Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.8067661Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.8067769Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.8067884Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.8067966Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:44.8068084Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.8068202Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.8068308Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.8068385Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:44.8068470Z %36 = arith.muli %c9472_i32, %c1_i32_9 : i32 2026-02-21T09:01:44.8068564Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:44.8068683Z %38 = arith.divsi %37, %c448_i32 : i32 2026-02-21T09:01:44.8068753Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:44.8068821Z %40 = arith.subi %c1_i32, %39 : i32 2026-02-21T09:01:44.8068897Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:44.8068967Z %42 = arith.remsi %37, %c448_i32 : i32 2026-02-21T09:01:44.8069034Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:44.8069103Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:44.8069174Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:44.8069240Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:44.8069358Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.8069445Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.8069522Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:44.8069590Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:44.8069670Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.8069743Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:44.8069815Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:01:44.8070025Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.8070109Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.8070227Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8070305Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8070389Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:44.8070520Z %67 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8070608Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8070741Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8070850Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8070954Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8071041Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:44.8071156Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8071275Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8071385Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8071498Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8071701Z %77 = tt.descriptor_load %0[%arg4, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8071829Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8071926Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8072013Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8072127Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8072262Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8072394Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8072529Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8072672Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8072769Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8072909Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8073029Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8073191Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8073286Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8073396Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8073521Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8073640Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8073745Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8073858Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8074056Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8074132Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:44.8074219Z %97 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:44.8074289Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:44.8074360Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:44.8074491Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8074571Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8074650Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:44.8074777Z %103 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8074872Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8075005Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8075114Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8075228Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8075313Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:44.8075432Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8075564Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8075651Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8075759Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8075924Z %113 = tt.descriptor_load %0[%98, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8076040Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8076128Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8076219Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8076351Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8076506Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8076646Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8076785Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8076922Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8077025Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8077136Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8077266Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8077406Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8077497Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8077630Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8077737Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8077875Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8077978Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8078088Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8078276Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8078345Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:44.8078424Z %133 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:44.8078496Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:44.8078564Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:44.8078678Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8078760Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8078838Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:44.8078961Z %139 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8079044Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8079168Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8079269Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8079372Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8079453Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:44.8079563Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8079682Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8079768Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8079868Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8080024Z %149 = tt.descriptor_load %0[%134, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8080108Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8080188Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8080270Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8080382Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8080534Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8080661Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8080795Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8080949Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8081038Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8081148Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8081256Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8081376Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8081466Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8081603Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8081706Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8081826Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8081959Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8082062Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8082275Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8082359Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.8082433Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:44.8082500Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:44.8082568Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:44.8082679Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8082760Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8082833Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:44.8082958Z %175 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8083040Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8083164Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8083273Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8083372Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8083452Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:44.8083566Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8083696Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8083778Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8083884Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8084035Z %185 = tt.descriptor_load %0[%170, %50] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8084115Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8084200Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8084280Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8084387Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8084510Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8084638Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8084766Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8084923Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8085015Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8085122Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8085258Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8085380Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8085468Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8085571Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8085676Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8085792Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8085893Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8085998Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8086187Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8086278Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:44.8086337Z } 2026-02-21T09:01:44.8086441Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.8086578Z %55 = tt.expand_dims %49 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8086659Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.8086775Z %57 = tt.expand_dims %52 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.8086875Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.8086971Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.8087050Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:44.8087154Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.8087263Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.8087351Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.8087417Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:44.8087513Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:44.8087586Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.8087647Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:44.8087708Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:44.8087770Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:44.8087844Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:44.8087906Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:44.8087968Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:44.8088031Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:44.8088092Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:44.8088198Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:44.8088274Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.8088345Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:44.8088405Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:44.8088475Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:44.8088543Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:44.8088607Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:01:44.8088801Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:44.8088873Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:44.8089052Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8093167Z %38 = tt.splat %36 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8098409Z %39 = arith.addi %38, %37 : tensor<32xi32> 2026-02-21T09:01:44.8100462Z %40 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8100573Z %41 = arith.muli %40, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8100932Z %42 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8101046Z %43 = tt.broadcast %41 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8101144Z %44 = tt.broadcast %42 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8101230Z %45 = arith.addi %43, %44 : tensor<64x32xi32> 2026-02-21T09:01:44.8101350Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8101471Z %47 = tt.addptr %46, %45 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8101619Z %48 = tt.load %47 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8101731Z %49 = arith.extf %48 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8101899Z %50 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8102034Z %51 = arith.shli %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8102130Z %52 = arith.shrsi %51, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8102252Z %53 = arith.shrsi %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8102364Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8102492Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8102616Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8102744Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8102870Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8102962Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8103065Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8103170Z %61 = tt.broadcast %57 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8103298Z %62 = arith.select %60, %61, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8103387Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8103491Z %64 = tt.broadcast %58 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8103595Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8103713Z %66 = arith.select %65, %64, %62 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8104033Z %67 = tt.reshape %66 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8104139Z %68 = arith.sitofp %67 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8104333Z %69 = tt.dot %49, %68, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8104406Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:01:44.8104536Z %70 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:01:44.8110832Z %71 = arith.addi %arg4, %70 : i32 2026-02-21T09:01:44.8110939Z %72 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:01:44.8111083Z %73 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8111174Z %74 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8111245Z %75 = arith.addi %74, %73 : tensor<32xi32> 2026-02-21T09:01:44.8111397Z %76 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8111499Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8111882Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8112010Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8112142Z %80 = tt.broadcast %78 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8112235Z %81 = arith.addi %79, %80 : tensor<64x32xi32> 2026-02-21T09:01:44.8112420Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8112563Z %83 = tt.addptr %82, %81 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8112650Z %84 = tt.load %83 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8112770Z %85 = arith.extf %84 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8112962Z %86 = tt.descriptor_load %0[%71, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8113060Z %87 = arith.shli %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8113153Z %88 = arith.shrsi %87, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8113248Z %89 = arith.shrsi %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8113386Z %90 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8113570Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8113704Z %92 = tt.expand_dims %91 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8113888Z %93 = tt.expand_dims %88 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8114021Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8114122Z %95 = arith.cmpi eq, %92, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8114241Z %96 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8114352Z %97 = tt.broadcast %93 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8114476Z %98 = arith.select %96, %97, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8114572Z %99 = arith.cmpi eq, %92, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8114715Z %100 = tt.broadcast %94 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8114821Z %101 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8114950Z %102 = arith.select %101, %100, %98 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8115058Z %103 = tt.reshape %102 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8115165Z %104 = arith.sitofp %103 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8115383Z %105 = tt.dot %85, %104, %69, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8115457Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:01:44.8115537Z %106 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:01:44.8115613Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T09:01:44.8115684Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:01:44.8115820Z %109 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8115907Z %110 = tt.splat %108 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8115987Z %111 = arith.addi %110, %109 : tensor<32xi32> 2026-02-21T09:01:44.8116117Z %112 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8116203Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8116337Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8116455Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8116554Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8116641Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:01:44.8116779Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8116902Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8116992Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8117095Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8117281Z %122 = tt.descriptor_load %0[%107, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8117368Z %123 = arith.shli %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8117453Z %124 = arith.shrsi %123, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8117536Z %125 = arith.shrsi %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8117649Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8117779Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8117910Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8118039Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8118198Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8118295Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8118426Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8118541Z %133 = tt.broadcast %129 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8118664Z %134 = arith.select %132, %133, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8118755Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8118870Z %136 = tt.broadcast %130 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8118974Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8119093Z %138 = arith.select %137, %136, %134 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8119199Z %139 = tt.reshape %138 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8119302Z %140 = arith.sitofp %139 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8119500Z %141 = tt.dot %121, %140, %105, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8119577Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:44.8119652Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:44.8119716Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:01:44.8119782Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:01:44.8119900Z %145 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:44.8119976Z %146 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:01:44.8120055Z %147 = arith.addi %146, %145 : tensor<32xi32> 2026-02-21T09:01:44.8120184Z %148 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8120264Z %149 = arith.muli %148, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:44.8120389Z %150 = tt.expand_dims %147 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:44.8120495Z %151 = tt.broadcast %149 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8120594Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:44.8120673Z %153 = arith.addi %151, %152 : tensor<64x32xi32> 2026-02-21T09:01:44.8120790Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8120911Z %155 = tt.addptr %154, %153 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:44.8120991Z %156 = tt.load %155 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:44.8121116Z %157 = arith.extf %156 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:44.8121282Z %158 = tt.descriptor_load %0[%143, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:44.8121361Z %159 = arith.shli %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8121446Z %160 = arith.shrsi %159, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8121597Z %161 = arith.shrsi %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:44.8121707Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:44.8121830Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:44.8121961Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:44.8122094Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8122223Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:44.8122317Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8122421Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8122563Z %169 = tt.broadcast %165 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8122691Z %170 = arith.select %168, %169, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8122808Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:44.8122917Z %172 = tt.broadcast %166 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:44.8123019Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:44.8123142Z %174 = arith.select %173, %172, %170 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:44.8123244Z %175 = tt.reshape %174 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:44.8123347Z %176 = arith.sitofp %175 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:44.8123539Z %177 = tt.dot %157, %176, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:44.8123605Z scf.yield %177 : tensor<64x64xf32> 2026-02-21T09:01:44.8123659Z } 2026-02-21T09:01:44.8123768Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:44.8123890Z %28 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:44.8123970Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:44.8124094Z %30 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:44.8124195Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.8124292Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:44.8124366Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:44.8124475Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.8124583Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:44.8124664Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:44.8124733Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:44.8124790Z tt.return 2026-02-21T09:01:44.8124839Z } 2026-02-21T09:01:44.8124892Z } 2026-02-21T09:01:44.8124898Z 2026-02-21T09:01:44.8124944Z {-# 2026-02-21T09:01:44.8125007Z external_resources: { 2026-02-21T09:01:44.8125065Z mlir_reproducer: { 2026-02-21T09:01:44.8129326Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:44.8129528Z disable_threading: false, 2026-02-21T09:01:44.8133531Z verify_each: true 2026-02-21T09:01:44.8134927Z } 2026-02-21T09:01:44.8135156Z } 2026-02-21T09:01:44.8135216Z #-} 2026-02-21T09:01:44.8135696Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:44.8138923Z /tmp/torchinductor_root/5k/c5kz3cda3c5hfh2ilvy5vox5nwdc5rus3gvelzqvmjirftpd6ebf.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:44.8139152Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:44.8140208Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[4, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:44.8140311Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:44.8140455Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:45.2867826Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:45.2868325Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:45.2872465Z ^ 2026-02-21T09:01:45.2877317Z /tmp/torchinductor_root/kn/cknpoc4m7lfel52vscqqxsnmsmy2zzdshbe3yfn3vziszxllov65.py:90:40: note: called from 2026-02-21T09:01:45.2879131Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:45.2879372Z ^ 2026-02-21T09:01:45.2879788Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:45.2880247Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:45.2880486Z ^ 2026-02-21T09:01:45.2880749Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:45.2884199Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:45.2885205Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:45.2885429Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:45.2885682Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:45.2885936Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:45.2886129Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:45.2886330Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:45.2886519Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:45.2886753Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:45.2887013Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:45.2887241Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:45.2887478Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:45.2887707Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:45.2887921Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:45.2888144Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:45.2888383Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:45.2888564Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:45.2888785Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:45.2888985Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:45.2889172Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:45.2889393Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:45.2889709Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:45.2890037Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:45.2890229Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:45.2890413Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:01:45.2890613Z %3 = arith.subi %c9472_i32, %c1_i32_6 : i32 2026-02-21T09:01:45.2890807Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:45.2890990Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:45.2891181Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:01:45.2891372Z %6 = arith.remsi %5, %c2_i32_7 : i32 2026-02-21T09:01:45.2891761Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:45.2891940Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:45.2892117Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:45.2892302Z %10 = arith.muli %c9472_i32, %c2_i32_7 : i32 2026-02-21T09:01:45.2892516Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:45.2892720Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:45.2892917Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:45.2893100Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:45.2893292Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:45.2893480Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:45.2893676Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:45.2893857Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:45.2894030Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:45.2894214Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:45.2894447Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.2894710Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.2894908Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:45.2895104Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:45.2895289Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.2895493Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:45.2895695Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:01:45.2895888Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:01:45.2896211Z %26 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.2896576Z %131 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:45.2896825Z %132 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2897083Z %133 = tt.splat %131 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2897303Z %134 = arith.addi %133, %132 : tensor<32xi32> 2026-02-21T09:01:45.2897611Z %135 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2897898Z %136 = arith.muli %135, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2898175Z %137 = tt.expand_dims %134 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2898474Z %138 = tt.broadcast %136 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2898750Z %139 = tt.broadcast %137 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2898992Z %140 = arith.addi %138, %139 : tensor<64x32xi32> 2026-02-21T09:01:45.2899249Z %141 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2899556Z %142 = tt.addptr %141, %140 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2899825Z %143 = tt.load %142 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2900106Z %144 = arith.extf %143 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2900437Z %145 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2900808Z %146 = arith.shli %145, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2901036Z %147 = arith.shrsi %146, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2901255Z %148 = arith.shrsi %145, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2901505Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2901842Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2902166Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2902493Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2902827Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2903122Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2903373Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2903653Z %156 = tt.broadcast %152 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2903945Z %157 = arith.select %155, %156, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2904231Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2904490Z %159 = tt.broadcast %153 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2904762Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2905054Z %161 = arith.select %160, %159, %157 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2905336Z %162 = tt.reshape %161 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2905609Z %163 = arith.sitofp %162 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2905970Z %164 = tt.dot %144, %163, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2906304Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:45.2906510Z %165 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:45.2906705Z %166 = arith.addi %arg4, %165 : i32 2026-02-21T09:01:45.2906901Z %167 = arith.muli %166, %c2_i32 : i32 2026-02-21T09:01:45.2907139Z %168 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2907399Z %169 = tt.splat %167 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2907607Z %170 = arith.addi %169, %168 : tensor<32xi32> 2026-02-21T09:01:45.2907900Z %171 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2908180Z %172 = arith.muli %171, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2908439Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2908774Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2909041Z %175 = tt.broadcast %173 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2909297Z %176 = arith.addi %174, %175 : tensor<64x32xi32> 2026-02-21T09:01:45.2909543Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2909843Z %178 = tt.addptr %177, %176 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2910116Z %179 = tt.load %178 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2910356Z %180 = arith.extf %179 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2910682Z %181 = tt.descriptor_load %0[%166, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2910982Z %182 = arith.shli %181, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2911207Z %183 = arith.shrsi %182, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2911457Z %184 = arith.shrsi %181, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2911736Z %185 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2912069Z %186 = tt.expand_dims %185 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2912383Z %187 = tt.expand_dims %186 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2912715Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2913038Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2913332Z %190 = arith.cmpi eq, %187, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2913591Z %191 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2913863Z %192 = tt.broadcast %188 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2914159Z %193 = arith.select %191, %192, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2914436Z %194 = arith.cmpi eq, %187, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2914699Z %195 = tt.broadcast %189 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2914976Z %196 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2915254Z %197 = arith.select %196, %195, %193 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2915540Z %198 = tt.reshape %197 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2915803Z %199 = arith.sitofp %198 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2916169Z %200 = tt.dot %180, %199, %164, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2916488Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:45.2916690Z %201 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:45.2916890Z %202 = arith.addi %arg4, %201 : i32 2026-02-21T09:01:45.2917073Z %203 = arith.muli %202, %c2_i32 : i32 2026-02-21T09:01:45.2917311Z %204 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2917563Z %205 = tt.splat %203 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2917776Z %206 = arith.addi %205, %204 : tensor<32xi32> 2026-02-21T09:01:45.2918025Z %207 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2918305Z %208 = arith.muli %207, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2918572Z %209 = tt.expand_dims %206 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2918893Z %210 = tt.broadcast %208 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2919163Z %211 = tt.broadcast %209 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2919401Z %212 = arith.addi %210, %211 : tensor<64x32xi32> 2026-02-21T09:01:45.2919652Z %213 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2919980Z %214 = tt.addptr %213, %212 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2920246Z %215 = tt.load %214 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2920495Z %216 = arith.extf %215 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2920815Z %217 = tt.descriptor_load %0[%202, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2921124Z %218 = arith.shli %217, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2921399Z %219 = arith.shrsi %218, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2921670Z %220 = arith.shrsi %217, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2921925Z %221 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2922225Z %222 = tt.expand_dims %221 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2922591Z %223 = tt.expand_dims %222 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2922951Z %224 = tt.expand_dims %219 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2923286Z %225 = tt.expand_dims %220 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2923582Z %226 = arith.cmpi eq, %223, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2923832Z %227 = tt.broadcast %226 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2924143Z %228 = tt.broadcast %224 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2924447Z %229 = arith.select %227, %228, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2924744Z %230 = arith.cmpi eq, %223, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2925007Z %231 = tt.broadcast %225 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2925302Z %232 = tt.broadcast %230 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2925608Z %233 = arith.select %232, %231, %229 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2925909Z %234 = tt.reshape %233 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2926196Z %235 = arith.sitofp %234 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2926568Z %236 = tt.dot %216, %235, %200, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2926910Z scf.yield %236 : tensor<64x64xf32> 2026-02-21T09:01:45.2927100Z } 2026-02-21T09:01:45.2927250Z %27 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:01:45.2927506Z %28 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2927757Z %29 = tt.splat %27 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2927970Z %30 = arith.addi %29, %28 : tensor<32xi32> 2026-02-21T09:01:45.2928231Z %31 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2928515Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2928779Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2929086Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2929358Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2929597Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:01:45.2929851Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2930140Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2930443Z %39 = tt.load %38 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2930696Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2931038Z %41 = tt.descriptor_load %0[%c4080_i32, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2931391Z %42 = arith.shli %41, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2931656Z %43 = arith.shrsi %42, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2931883Z %44 = arith.shrsi %41, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2932132Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2932435Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2932761Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2933088Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2933421Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2933709Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2934001Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2934294Z %52 = tt.broadcast %48 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2934580Z %53 = arith.select %51, %52, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2934877Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2935121Z %55 = tt.broadcast %49 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2935393Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2935660Z %57 = arith.select %56, %55, %53 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2935939Z %58 = tt.reshape %57 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2936202Z %59 = arith.sitofp %58 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2936546Z %60 = tt.dot %40, %59, %26, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2936896Z %61 = arith.truncf %60 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.2937178Z %62 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2937447Z %63 = arith.muli %62, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.2937697Z %64 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.2937980Z %65 = tt.broadcast %63 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.2938240Z %66 = tt.broadcast %64 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.2938468Z %67 = arith.addi %65, %66 : tensor<64x64xi32> 2026-02-21T09:01:45.2938711Z %68 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.2938985Z %69 = tt.addptr %68, %67 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.2939243Z tt.store %69, %61 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.2939445Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:45.2939647Z %70 = arith.muli %c9472_i32, %c1_i32_8 : i32 2026-02-21T09:01:45.2939845Z %71 = arith.addi %arg3, %70 : i32 2026-02-21T09:01:45.2940031Z %72 = arith.divsi %71, %c448_i32 : i32 2026-02-21T09:01:45.2940221Z %73 = arith.muli %72, %c4_i32 : i32 2026-02-21T09:01:45.2940399Z %74 = arith.subi %c1_i32, %73 : i32 2026-02-21T09:01:45.2940584Z %75 = arith.minsi %74, %c4_i32 : i32 2026-02-21T09:01:45.2940765Z %76 = arith.remsi %71, %c448_i32 : i32 2026-02-21T09:01:45.2940949Z %77 = arith.remsi %76, %75 : i32 2026-02-21T09:01:45.2941130Z %78 = arith.addi %73, %77 : i32 2026-02-21T09:01:45.2941300Z %79 = arith.divsi %76, %75 : i32 2026-02-21T09:01:45.2941510Z %80 = arith.muli %78, %c64_i32 : i32 2026-02-21T09:01:45.2941768Z %81 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.2942011Z %82 = tt.splat %80 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.2942206Z %83 = arith.addi %82, %81 : tensor<64xi32> 2026-02-21T09:01:45.2942399Z %84 = arith.muli %79, %c64_i32 : i32 2026-02-21T09:01:45.2942613Z %85 = tt.splat %84 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.2942813Z %86 = arith.addi %85, %81 : tensor<64xi32> 2026-02-21T09:01:45.2943017Z %c4080_i32_9 = arith.constant 4080 : i32 2026-02-21T09:01:45.2943214Z %c48_i32_10 = arith.constant 48 : i32 2026-02-21T09:01:45.2943546Z %87 = scf.for %arg4 = %c0_i32 to %c4080_i32_9 step %c48_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.2943876Z %131 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:45.2944117Z %132 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2944372Z %133 = tt.splat %131 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2944586Z %134 = arith.addi %133, %132 : tensor<32xi32> 2026-02-21T09:01:45.2944850Z %135 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2945190Z %136 = arith.muli %135, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2945466Z %137 = tt.expand_dims %134 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2945782Z %138 = tt.broadcast %136 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2946054Z %139 = tt.broadcast %137 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2946295Z %140 = arith.addi %138, %139 : tensor<64x32xi32> 2026-02-21T09:01:45.2946539Z %141 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2946834Z %142 = tt.addptr %141, %140 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2947097Z %143 = tt.load %142 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2947341Z %144 = arith.extf %143 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2947660Z %145 = tt.descriptor_load %0[%arg4, %84] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2947973Z %146 = arith.shli %145, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2948198Z %147 = arith.shrsi %146, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2948418Z %148 = arith.shrsi %145, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2948669Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2948959Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2949278Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2949597Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2949925Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2950207Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2950454Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2950732Z %156 = tt.broadcast %152 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2951016Z %157 = arith.select %155, %156, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2951294Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2951566Z %159 = tt.broadcast %153 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2951833Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2952111Z %161 = arith.select %160, %159, %157 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2952394Z %162 = tt.reshape %161 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2952691Z %163 = arith.sitofp %162 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2953041Z %164 = tt.dot %144, %163, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2953367Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:01:45.2953595Z %165 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:01:45.2953785Z %166 = arith.addi %arg4, %165 : i32 2026-02-21T09:01:45.2953970Z %167 = arith.muli %166, %c2_i32 : i32 2026-02-21T09:01:45.2954197Z %168 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2954448Z %169 = tt.splat %167 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2954650Z %170 = arith.addi %169, %168 : tensor<32xi32> 2026-02-21T09:01:45.2954909Z %171 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2955179Z %172 = arith.muli %171, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2955438Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2955727Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2956008Z %175 = tt.broadcast %173 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2956255Z %176 = arith.addi %174, %175 : tensor<64x32xi32> 2026-02-21T09:01:45.2956523Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2956813Z %178 = tt.addptr %177, %176 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2957074Z %179 = tt.load %178 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2957305Z %180 = arith.extf %179 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2957620Z %181 = tt.descriptor_load %0[%166, %84] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2957917Z %182 = arith.shli %181, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2958138Z %183 = arith.shrsi %182, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2958355Z %184 = arith.shrsi %181, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2958596Z %185 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2958889Z %186 = tt.expand_dims %185 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2959202Z %187 = tt.expand_dims %186 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2959522Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2959838Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2960119Z %190 = arith.cmpi eq, %187, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2960368Z %191 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2960640Z %192 = tt.broadcast %188 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2960928Z %193 = arith.select %191, %192, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2961200Z %194 = arith.cmpi eq, %187, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2961452Z %195 = tt.broadcast %189 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2961766Z %196 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2962039Z %197 = arith.select %196, %195, %193 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2962322Z %198 = tt.reshape %197 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2962581Z %199 = arith.sitofp %198 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2962934Z %200 = tt.dot %180, %199, %164, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2963280Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:01:45.2963477Z %201 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:01:45.2963672Z %202 = arith.addi %arg4, %201 : i32 2026-02-21T09:01:45.2963854Z %203 = arith.muli %202, %c2_i32 : i32 2026-02-21T09:01:45.2964094Z %204 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2964369Z %205 = tt.splat %203 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2964580Z %206 = arith.addi %205, %204 : tensor<32xi32> 2026-02-21T09:01:45.2964833Z %207 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2965113Z %208 = arith.muli %207, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2965381Z %209 = tt.expand_dims %206 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2965674Z %210 = tt.broadcast %208 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2965946Z %211 = tt.broadcast %209 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2966189Z %212 = arith.addi %210, %211 : tensor<64x32xi32> 2026-02-21T09:01:45.2966442Z %213 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2966738Z %214 = tt.addptr %213, %212 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2967035Z %215 = tt.load %214 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2967285Z %216 = arith.extf %215 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2967634Z %217 = tt.descriptor_load %0[%202, %84] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2967942Z %218 = arith.shli %217, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2968173Z %219 = arith.shrsi %218, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2968412Z %220 = arith.shrsi %217, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2968673Z %221 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2968981Z %222 = tt.expand_dims %221 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2969319Z %223 = tt.expand_dims %222 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2969663Z %224 = tt.expand_dims %219 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2970013Z %225 = tt.expand_dims %220 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2970321Z %226 = arith.cmpi eq, %223, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2970587Z %227 = tt.broadcast %226 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2970879Z %228 = tt.broadcast %224 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2971183Z %229 = arith.select %227, %228, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2971476Z %230 = arith.cmpi eq, %223, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2971777Z %231 = tt.broadcast %225 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2972069Z %232 = tt.broadcast %230 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2972371Z %233 = arith.select %232, %231, %229 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2972669Z %234 = tt.reshape %233 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2972954Z %235 = arith.sitofp %234 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2973329Z %236 = tt.dot %216, %235, %200, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2973672Z scf.yield %236 : tensor<64x64xf32> 2026-02-21T09:01:45.2973855Z } 2026-02-21T09:01:45.2974013Z %88 = arith.muli %c4080_i32_9, %c2_i32 : i32 2026-02-21T09:01:45.2974271Z %89 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2974530Z %90 = tt.splat %88 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2974780Z %91 = arith.addi %90, %89 : tensor<32xi32> 2026-02-21T09:01:45.2975045Z %92 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2975332Z %93 = arith.muli %92, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2975601Z %94 = tt.expand_dims %91 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2975930Z %95 = tt.broadcast %93 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2976220Z %96 = tt.broadcast %94 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2976446Z %97 = arith.addi %95, %96 : tensor<64x32xi32> 2026-02-21T09:01:45.2976690Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2976964Z %99 = tt.addptr %98, %97 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2977218Z %100 = tt.load %99 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2977458Z %101 = arith.extf %100 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2977797Z %102 = tt.descriptor_load %0[%c4080_i32_9, %84] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2978122Z %103 = arith.shli %102, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2978365Z %104 = arith.shrsi %103, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2978591Z %105 = arith.shrsi %102, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2978832Z %106 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2979156Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2979478Z %108 = tt.expand_dims %107 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2979802Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2980131Z %110 = tt.expand_dims %105 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2980415Z %111 = arith.cmpi eq, %108, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2980677Z %112 = tt.broadcast %111 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2980950Z %113 = tt.broadcast %109 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2981247Z %114 = arith.select %112, %113, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2981562Z %115 = arith.cmpi eq, %108, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2981817Z %116 = tt.broadcast %110 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2982097Z %117 = tt.broadcast %115 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2982378Z %118 = arith.select %117, %116, %114 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2982670Z %119 = tt.reshape %118 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2982936Z %120 = arith.sitofp %119 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2983292Z %121 = tt.dot %101, %120, %87, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.2983651Z %122 = arith.truncf %121 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.2983941Z %123 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2984221Z %124 = arith.muli %123, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.2984481Z %125 = tt.expand_dims %86 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.2984776Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.2985045Z %127 = tt.broadcast %125 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.2985282Z %128 = arith.addi %126, %127 : tensor<64x64xi32> 2026-02-21T09:01:45.2985538Z %129 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.2985829Z %130 = tt.addptr %129, %128 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.2986130Z tt.store %130, %122 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.2986334Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:45.2986546Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:45.2986772Z %11 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:01:45.2986964Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:45.2987192Z %13 = arith.subi %c1_i32, %12 : i32 2026-02-21T09:01:45.2987373Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:45.2987568Z %15 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:01:45.2987755Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:45.2987940Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:45.2988123Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:45.2988304Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:45.2988546Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.2988790Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.2989003Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:45.2989196Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:45.2989398Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.2989615Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:45.2989818Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:01:45.2990011Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:01:45.2990348Z %26 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.2990674Z %70 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:45.2990904Z %71 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.2991160Z %72 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.2991360Z %73 = arith.addi %72, %71 : tensor<32xi32> 2026-02-21T09:01:45.2991651Z %74 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.2991931Z %75 = arith.muli %74, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.2992186Z %76 = tt.expand_dims %73 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.2992482Z %77 = tt.broadcast %75 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2992742Z %78 = tt.broadcast %76 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.2992987Z %79 = arith.addi %77, %78 : tensor<64x32xi32> 2026-02-21T09:01:45.2993232Z %80 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2993524Z %81 = tt.addptr %80, %79 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.2993792Z %82 = tt.load %81 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.2994023Z %83 = arith.extf %82 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.2994381Z %84 = tt.descriptor_load %0[%arg4, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.2994685Z %85 = arith.shli %84, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2994900Z %86 = arith.shrsi %85, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2995118Z %87 = arith.shrsi %84, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.2995371Z %88 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.2995658Z %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.2995971Z %90 = tt.expand_dims %89 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.2996290Z %91 = tt.expand_dims %86 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2996610Z %92 = tt.expand_dims %87 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.2996896Z %93 = arith.cmpi eq, %90, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2997177Z %94 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2997455Z %95 = tt.broadcast %91 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2997734Z %96 = arith.select %94, %95, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2998008Z %97 = arith.cmpi eq, %90, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.2998277Z %98 = tt.broadcast %92 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.2998545Z %99 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.2998827Z %100 = arith.select %99, %98, %96 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.2999104Z %101 = tt.reshape %100 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.2999374Z %102 = arith.sitofp %101 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.2999727Z %103 = tt.dot %83, %102, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3000057Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:45.3000259Z %104 = arith.muli %c16_i32, %c1_i32_8 : i32 2026-02-21T09:01:45.3000455Z %105 = arith.addi %arg4, %104 : i32 2026-02-21T09:01:45.3000651Z %106 = arith.muli %105, %c2_i32 : i32 2026-02-21T09:01:45.3000912Z %107 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3001181Z %108 = tt.splat %106 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3001420Z %109 = arith.addi %108, %107 : tensor<32xi32> 2026-02-21T09:01:45.3001723Z %110 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3002009Z %111 = arith.muli %110, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3002274Z %112 = tt.expand_dims %109 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3002579Z %113 = tt.broadcast %111 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3002844Z %114 = tt.broadcast %112 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3003091Z %115 = arith.addi %113, %114 : tensor<64x32xi32> 2026-02-21T09:01:45.3003334Z %116 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3003637Z %117 = tt.addptr %116, %115 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3003913Z %118 = tt.load %117 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3004156Z %119 = arith.extf %118 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3004484Z %120 = tt.descriptor_load %0[%105, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3004788Z %121 = arith.shli %120, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3005016Z %122 = arith.shrsi %121, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3005238Z %123 = arith.shrsi %120, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3005493Z %124 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3005795Z %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3006111Z %126 = tt.expand_dims %125 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3006445Z %127 = tt.expand_dims %122 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3006778Z %128 = tt.expand_dims %123 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3007073Z %129 = arith.cmpi eq, %126, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3007330Z %130 = tt.broadcast %129 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3007603Z %131 = tt.broadcast %127 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3007899Z %132 = arith.select %130, %131, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3008204Z %133 = arith.cmpi eq, %126, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3008465Z %134 = tt.broadcast %128 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3008735Z %135 = tt.broadcast %133 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3009023Z %136 = arith.select %135, %134, %132 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3009334Z %137 = tt.reshape %136 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3009593Z %138 = arith.sitofp %137 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3009955Z %139 = tt.dot %119, %138, %103, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3010272Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:01:45.3010475Z %140 = arith.muli %c16_i32, %c2_i32_9 : i32 2026-02-21T09:01:45.3010676Z %141 = arith.addi %arg4, %140 : i32 2026-02-21T09:01:45.3010859Z %142 = arith.muli %141, %c2_i32 : i32 2026-02-21T09:01:45.3011096Z %143 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3011344Z %144 = tt.splat %142 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3011588Z %145 = arith.addi %144, %143 : tensor<32xi32> 2026-02-21T09:01:45.3011872Z %146 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3012154Z %147 = arith.muli %146, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3012447Z %148 = tt.expand_dims %145 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3012751Z %149 = tt.broadcast %147 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3013033Z %150 = tt.broadcast %148 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3013291Z %151 = arith.addi %149, %150 : tensor<64x32xi32> 2026-02-21T09:01:45.3013556Z %152 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3013861Z %153 = tt.addptr %152, %151 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3014143Z %154 = tt.load %153 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3014404Z %155 = arith.extf %154 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3014738Z %156 = tt.descriptor_load %0[%141, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3015062Z %157 = arith.shli %156, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3015155Z %158 = arith.shrsi %157, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3015246Z %159 = arith.shrsi %156, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3015373Z %160 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3015504Z %161 = tt.expand_dims %160 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3015642Z %162 = tt.expand_dims %161 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3015789Z %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3015926Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3016022Z %165 = arith.cmpi eq, %162, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3016144Z %166 = tt.broadcast %165 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3016259Z %167 = tt.broadcast %163 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3016386Z %168 = arith.select %166, %167, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3016489Z %169 = arith.cmpi eq, %162, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3016602Z %170 = tt.broadcast %164 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3016711Z %171 = tt.broadcast %169 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3016836Z %172 = arith.select %171, %170, %168 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3016986Z %173 = tt.reshape %172 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3017095Z %174 = arith.sitofp %173 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3017297Z %175 = tt.dot %155, %174, %139, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3017408Z scf.yield %175 : tensor<64x64xf32> 2026-02-21T09:01:45.3017465Z } 2026-02-21T09:01:45.3017546Z %27 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:01:45.3017669Z %28 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3017748Z %29 = tt.splat %27 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3017824Z %30 = arith.addi %29, %28 : tensor<32xi32> 2026-02-21T09:01:45.3017961Z %31 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3018046Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3018173Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3018282Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3018395Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3018499Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:01:45.3018616Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3018783Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3018871Z %39 = tt.load %38 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3018978Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3019161Z %41 = tt.descriptor_load %0[%c4080_i32, %23] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3019243Z %42 = arith.shli %41, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3019332Z %43 = arith.shrsi %42, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3019424Z %44 = arith.shrsi %41, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3019538Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3019668Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3019799Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3019937Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3020065Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3020161Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3020279Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3020389Z %52 = tt.broadcast %48 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3020511Z %53 = arith.select %51, %52, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3020612Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3020721Z %55 = tt.broadcast %49 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3020829Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3020952Z %57 = arith.select %56, %55, %53 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3021057Z %58 = tt.reshape %57 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3021160Z %59 = arith.sitofp %58 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3021357Z %60 = tt.dot %40, %59, %26, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3021467Z %61 = arith.truncf %60 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.3021616Z %62 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3021746Z %63 = arith.muli %62, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.3021880Z %64 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.3021997Z %65 = tt.broadcast %63 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.3022112Z %66 = tt.broadcast %64 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.3022224Z %67 = arith.addi %65, %66 : tensor<64x64xi32> 2026-02-21T09:01:45.3022336Z %68 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.3022450Z %69 = tt.addptr %68, %67 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.3022540Z tt.store %69, %61 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.3022609Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:45.3022666Z tt.return 2026-02-21T09:01:45.3022718Z } 2026-02-21T09:01:45.3022779Z } 2026-02-21T09:01:45.3022785Z 2026-02-21T09:01:45.3022837Z {-# 2026-02-21T09:01:45.3022904Z external_resources: { 2026-02-21T09:01:45.3022974Z mlir_reproducer: { 2026-02-21T09:01:45.3027416Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:45.3027498Z disable_threading: false, 2026-02-21T09:01:45.3027560Z verify_each: true 2026-02-21T09:01:45.3027622Z } 2026-02-21T09:01:45.3027673Z } 2026-02-21T09:01:45.3027724Z #-} 2026-02-21T09:01:45.3028094Z /tmp/torchinductor_root/kn/cknpoc4m7lfel52vscqqxsnmsmy2zzdshbe3yfn3vziszxllov65.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:45.3028718Z /tmp/torchinductor_root/kn/cknpoc4m7lfel52vscqqxsnmsmy2zzdshbe3yfn3vziszxllov65.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:45.3028893Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:45.3029853Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:45.3029970Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:45.3030094Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:45.3866339Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:45.3870868Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:45.3872197Z ^ 2026-02-21T09:01:45.3872597Z /tmp/torchinductor_root/34/c34suihzlz7n7krk2zaje4ga62sglfbyac5xyaf6s56bbo4dzknd.py:85:36: note: called from 2026-02-21T09:01:45.3872992Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:45.3873195Z ^ 2026-02-21T09:01:45.3873606Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:45.3874077Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:45.3874319Z ^ 2026-02-21T09:01:45.3874568Z module { 2026-02-21T09:01:45.3878590Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:45.3879259Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:45.3879484Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:01:45.3879678Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:45.3879861Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:45.3880048Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:45.3880259Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:45.3880509Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:45.3880747Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:45.3880973Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:45.3881212Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:45.3881420Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:45.3881742Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:45.3881974Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:45.3882163Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:45.3882344Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:45.3882540Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:45.3887192Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:45.3894122Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:45.3894457Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:45.3894789Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:45.3894985Z %2 = arith.divsi %1, %c448_i32 : i32 2026-02-21T09:01:45.3895174Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:01:45.3895372Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:01:45.3895568Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:01:45.3895753Z %6 = arith.remsi %1, %c448_i32 : i32 2026-02-21T09:01:45.3895948Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:01:45.3896123Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:01:45.3896299Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:01:45.3896469Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:01:45.3896707Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.3896963Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.3897188Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:01:45.3897434Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:01:45.3897635Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.3897831Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T09:01:45.3898031Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T09:01:45.3898356Z %17 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c64_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.3898728Z %27 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:01:45.3898968Z %28 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3899212Z %29 = tt.splat %27 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3899417Z %30 = arith.addi %29, %28 : tensor<32xi32> 2026-02-21T09:01:45.3899669Z %31 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3899941Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3900194Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3900488Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3900752Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3900984Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:01:45.3901276Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3901608Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3901922Z %39 = tt.load %38 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3902179Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3902515Z %41 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3902830Z %42 = arith.shli %41, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3903054Z %43 = arith.shrsi %42, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3903287Z %44 = arith.shrsi %41, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3903541Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3903853Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3904180Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3904512Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3904851Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3905141Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3905406Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3905683Z %52 = tt.broadcast %48 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3905980Z %53 = arith.select %51, %52, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3906264Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3906517Z %55 = tt.broadcast %49 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3906796Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3907076Z %57 = arith.select %56, %55, %53 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3907366Z %58 = tt.reshape %57 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3907639Z %59 = arith.sitofp %58 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3908009Z %60 = tt.dot %40, %59, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3908350Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:01:45.3908554Z %61 = arith.muli %c16_i32, %c1_i32_7 : i32 2026-02-21T09:01:45.3908760Z %62 = arith.addi %arg3, %61 : i32 2026-02-21T09:01:45.3908948Z %63 = arith.muli %62, %c2_i32 : i32 2026-02-21T09:01:45.3909226Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3909485Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3909691Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:45.3909966Z %67 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3910268Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3910545Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3910839Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3911111Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3911361Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:45.3911697Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3911998Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3912256Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3912496Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3912803Z %77 = tt.descriptor_load %0[%62, %14] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3913128Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3913350Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3913606Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3913852Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3914133Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3914439Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3914747Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3915063Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3915345Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3915589Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3915862Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3916140Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3916411Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3916662Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3916923Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3917199Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3917464Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3917718Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3918047Z %96 = tt.dot %76, %95, %60, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3918358Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:01:45.3918554Z %97 = arith.muli %c16_i32, %c2_i32_8 : i32 2026-02-21T09:01:45.3918742Z %98 = arith.addi %arg3, %97 : i32 2026-02-21T09:01:45.3918928Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:45.3919158Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3919413Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3919617Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:45.3919874Z %103 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3920146Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3920436Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3920734Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3920997Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3921242Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:45.3921513Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3921850Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3922119Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3922357Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3922674Z %113 = tt.descriptor_load %0[%98, %14] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3922969Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3923196Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3923415Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3923675Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3924004Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3924330Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3924696Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3925021Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3925317Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3925576Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3925844Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3926138Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3926408Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3926666Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3926936Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3927223Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3927513Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3927776Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3928130Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3928447Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:45.3928645Z %133 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:45.3928837Z %134 = arith.addi %arg3, %133 : i32 2026-02-21T09:01:45.3929026Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:45.3929261Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.3929509Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.3929724Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:45.3929978Z %139 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3930256Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.3930513Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.3930817Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3931083Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.3931319Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:45.3931629Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3931918Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.3932180Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.3932424Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.3932765Z %149 = tt.descriptor_load %0[%134, %14] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.3933066Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3933284Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3933507Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.3933752Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.3934049Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.3934364Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.3934702Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3935034Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.3935368Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3935623Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3935931Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3936221Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3936504Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.3936754Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.3937032Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.3937319Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.3937598Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.3937871Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.3938229Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.3938562Z scf.yield %168 : tensor<64x64xf32> 2026-02-21T09:01:45.3938737Z } 2026-02-21T09:01:45.3938917Z %18 = arith.truncf %17 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.3939206Z %19 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.3939465Z %20 = arith.muli %19, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.3939726Z %21 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.3940007Z %22 = tt.broadcast %20 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.3940269Z %23 = tt.broadcast %21 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.3940498Z %24 = arith.addi %22, %23 : tensor<64x64xi32> 2026-02-21T09:01:45.3940746Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.3941029Z %26 = tt.addptr %25, %24 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.3941281Z tt.store %26, %18 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.3941478Z tt.return 2026-02-21T09:01:45.3941641Z } 2026-02-21T09:01:45.3941767Z } 2026-02-21T09:01:45.3941838Z 2026-02-21T09:01:45.3941889Z {-# 2026-02-21T09:01:45.3942026Z external_resources: { 2026-02-21T09:01:45.3942183Z mlir_reproducer: { 2026-02-21T09:01:45.3946714Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:45.3951434Z disable_threading: false, 2026-02-21T09:01:45.3951643Z verify_each: true 2026-02-21T09:01:45.3951796Z } 2026-02-21T09:01:45.3951930Z } 2026-02-21T09:01:45.3952050Z #-} 2026-02-21T09:01:45.3952494Z /tmp/torchinductor_root/34/c34suihzlz7n7krk2zaje4ga62sglfbyac5xyaf6s56bbo4dzknd.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:45.3953550Z /tmp/torchinductor_root/34/c34suihzlz7n7krk2zaje4ga62sglfbyac5xyaf6s56bbo4dzknd.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:45.3954394Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:45.3955444Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:45.3956380Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:45.3956635Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:45.8153093Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:01:45.8157483Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:45.8162061Z ^ 2026-02-21T09:01:45.8164294Z /tmp/torchinductor_root/nb/cnbbg7hgf7qc3e54idqylspxgd2rdjqhkc5qwarmuqvkr2si43j3.py:90:40: note: called from 2026-02-21T09:01:45.8164764Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:01:45.8164979Z ^ 2026-02-21T09:01:45.8165391Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:01:45.8165864Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:01:45.8166124Z ^ 2026-02-21T09:01:45.8166335Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:01:45.8170887Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:01:45.8175896Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:01:45.8177644Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:01:45.8178101Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:01:45.8178318Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:01:45.8178514Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:01:45.8178752Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:01:45.8179008Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:01:45.8179261Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:01:45.8179506Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:01:45.8179741Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:01:45.8179968Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:01:45.8180194Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:01:45.8180436Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:01:45.8180617Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:01:45.8180868Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:01:45.8181078Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:01:45.8181264Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:01:45.8181498Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:01:45.8181909Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:01:45.8182251Z %1 = tt.get_program_id x : i32 2026-02-21T09:01:45.8182438Z %2 = arith.subi %c112_i32, %1 : i32 2026-02-21T09:01:45.8182631Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:01:45.8182826Z %3 = arith.subi %c9472_i32, %c1_i32 : i32 2026-02-21T09:01:45.8183022Z %4 = arith.addi %2, %3 : i32 2026-02-21T09:01:45.8183208Z %5 = arith.divui %4, %c9472_i32 : i32 2026-02-21T09:01:45.8183396Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:01:45.8183586Z %6 = arith.remsi %5, %c2_i32_6 : i32 2026-02-21T09:01:45.8183768Z %7 = arith.subi %5, %6 : i32 2026-02-21T09:01:45.8183949Z %8 = arith.muli %7, %c9472_i32 : i32 2026-02-21T09:01:45.8184137Z %9 = arith.addi %1, %8 : i32 2026-02-21T09:01:45.8184327Z %10 = arith.muli %c9472_i32, %c2_i32_6 : i32 2026-02-21T09:01:45.8184543Z scf.for %arg3 = %1 to %9 step %10 : i32 { 2026-02-21T09:01:45.8184747Z %11 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:01:45.8184941Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:45.8185126Z %13 = arith.subi %c112_i32, %12 : i32 2026-02-21T09:01:45.8185317Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:45.8185502Z %15 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:01:45.8185694Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:45.8185873Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:45.8186063Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:45.8186235Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:45.8186461Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.8186719Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.8186911Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:45.8187100Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:45.8187295Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.8187487Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:45.8187670Z %c64_i32_7 = arith.constant 64 : i32 2026-02-21T09:01:45.8187994Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_7 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.8188320Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:45.8188661Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8188907Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8189115Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:45.8189365Z %67 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8189639Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8189933Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8190220Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8190477Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8190705Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:45.8190955Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8191237Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8191489Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8191759Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8192079Z %77 = tt.descriptor_load %0[%arg4, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8192410Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8192628Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8192873Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8193116Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8193404Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8193716Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8194030Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8194349Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8194620Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8194870Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8195139Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8195419Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8195687Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8195931Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8196193Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8196466Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8196737Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8196993Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8197339Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8197661Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:01:45.8197857Z %97 = arith.muli %c16_i32, %c1_i32_10 : i32 2026-02-21T09:01:45.8198055Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:45.8198238Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:45.8198468Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8198719Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8198921Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:45.8199177Z %103 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8199449Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8199747Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8200045Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8200310Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8200581Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:45.8200825Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8201124Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8201392Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8201654Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8201976Z %113 = tt.descriptor_load %0[%98, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8202281Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8202508Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8202725Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8202978Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8203306Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8203661Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8203996Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8204327Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8204622Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8204876Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8205159Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8205458Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8205739Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8206004Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8206281Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8206574Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8206869Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8207134Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8207495Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8207808Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:01:45.8208008Z %133 = arith.muli %c16_i32, %c2_i32_11 : i32 2026-02-21T09:01:45.8208196Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:45.8208388Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:45.8208626Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8208878Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8209096Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:45.8209350Z %139 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8209629Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8209891Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8210194Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8210464Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8210758Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:45.8211011Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8211301Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8211611Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8211878Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8212208Z %149 = tt.descriptor_load %0[%134, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8212515Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8212735Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8212963Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8213212Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8213517Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8213848Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8214206Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8214546Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8214859Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8215128Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8215406Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8215705Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8215989Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8216249Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8216534Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8216819Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8217112Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8217383Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8217743Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8218071Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:45.8218260Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:45.8218462Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:45.8218661Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:45.8218899Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8219160Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8219366Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:45.8219628Z %175 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8219898Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8220172Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8220464Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8220747Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8221010Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:45.8221269Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8221605Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8221912Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8222172Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8222513Z %185 = tt.descriptor_load %0[%170, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8222826Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8223090Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8223319Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8223588Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8223902Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8224251Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8224600Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8224945Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8225271Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8225539Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8225862Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8226213Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8226499Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8226773Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8227056Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8227359Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8227657Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8227946Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8228332Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8228672Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:45.8228878Z } 2026-02-21T09:01:45.8229064Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.8229366Z %28 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8229629Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.8229886Z %30 = tt.expand_dims %22 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.8230176Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.8230429Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.8230666Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:45.8230907Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.8231191Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.8231448Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.8231697Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:45.8231903Z %36 = arith.muli %c9472_i32, %c1_i32_8 : i32 2026-02-21T09:01:45.8232099Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:01:45.8232290Z %38 = arith.divsi %37, %c4_i32 : i32 2026-02-21T09:01:45.8232471Z %39 = arith.muli %38, %c4_i32 : i32 2026-02-21T09:01:45.8232660Z %40 = arith.subi %c112_i32, %39 : i32 2026-02-21T09:01:45.8232839Z %41 = arith.minsi %40, %c4_i32 : i32 2026-02-21T09:01:45.8233026Z %42 = arith.remsi %37, %c4_i32 : i32 2026-02-21T09:01:45.8233210Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:01:45.8233418Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:01:45.8233601Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:01:45.8233774Z %46 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:01:45.8234009Z %47 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.8234255Z %48 = tt.splat %46 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.8234499Z %49 = arith.addi %48, %47 : tensor<64xi32> 2026-02-21T09:01:45.8234687Z %50 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:01:45.8234883Z %51 = tt.splat %50 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.8235080Z %52 = arith.addi %51, %47 : tensor<64xi32> 2026-02-21T09:01:45.8235269Z %c64_i32_9 = arith.constant 64 : i32 2026-02-21T09:01:45.8235584Z %53 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_9 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.8235904Z %63 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:45.8236141Z %64 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8236386Z %65 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8236590Z %66 = arith.addi %65, %64 : tensor<32xi32> 2026-02-21T09:01:45.8236876Z %67 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8237149Z %68 = arith.muli %67, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8237413Z %69 = tt.expand_dims %66 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8237725Z %70 = tt.broadcast %68 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8237989Z %71 = tt.broadcast %69 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8238222Z %72 = arith.addi %70, %71 : tensor<64x32xi32> 2026-02-21T09:01:45.8238469Z %73 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8238757Z %74 = tt.addptr %73, %72 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8239013Z %75 = tt.load %74 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8239254Z %76 = arith.extf %75 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8239574Z %77 = tt.descriptor_load %0[%arg4, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8239885Z %78 = arith.shli %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8240113Z %79 = arith.shrsi %78, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8240334Z %80 = arith.shrsi %77, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8240586Z %81 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8240877Z %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8241190Z %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8241508Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8241858Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8242148Z %86 = arith.cmpi eq, %83, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8242399Z %87 = tt.broadcast %86 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8242675Z %88 = tt.broadcast %84 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8242960Z %89 = arith.select %87, %88, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8243238Z %90 = arith.cmpi eq, %83, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8243488Z %91 = tt.broadcast %85 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8243763Z %92 = tt.broadcast %90 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8244043Z %93 = arith.select %92, %91, %89 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8244316Z %94 = tt.reshape %93 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8244616Z %95 = arith.sitofp %94 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8244973Z %96 = tt.dot %76, %95, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8245300Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:01:45.8245529Z %97 = arith.muli %c16_i32, %c1_i32_10 : i32 2026-02-21T09:01:45.8245726Z %98 = arith.addi %arg4, %97 : i32 2026-02-21T09:01:45.8245917Z %99 = arith.muli %98, %c2_i32 : i32 2026-02-21T09:01:45.8246149Z %100 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8246406Z %101 = tt.splat %99 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8246613Z %102 = arith.addi %101, %100 : tensor<32xi32> 2026-02-21T09:01:45.8246873Z %103 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8246959Z %104 = arith.muli %103, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8247095Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8247201Z %106 = tt.broadcast %104 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8247333Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8247424Z %108 = arith.addi %106, %107 : tensor<64x32xi32> 2026-02-21T09:01:45.8247578Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8247704Z %110 = tt.addptr %109, %108 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8247790Z %111 = tt.load %110 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8247904Z %112 = arith.extf %111 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8248060Z %113 = tt.descriptor_load %0[%98, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8248145Z %114 = arith.shli %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8248238Z %115 = arith.shrsi %114, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8248322Z %116 = arith.shrsi %113, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8248434Z %117 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8248565Z %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8248698Z %119 = tt.expand_dims %118 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8248830Z %120 = tt.expand_dims %115 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8248967Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8249062Z %122 = arith.cmpi eq, %119, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8249170Z %123 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8249289Z %124 = tt.broadcast %120 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8249412Z %125 = arith.select %123, %124, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8249504Z %126 = arith.cmpi eq, %119, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8249617Z %127 = tt.broadcast %121 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8249730Z %128 = tt.broadcast %126 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8249851Z %129 = arith.select %128, %127, %125 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8250009Z %130 = tt.reshape %129 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8250122Z %131 = arith.sitofp %130 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8250312Z %132 = tt.dot %112, %131, %96, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8250653Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:01:45.8250886Z %133 = arith.muli %c16_i32, %c2_i32_11 : i32 2026-02-21T09:01:45.8251097Z %134 = arith.addi %arg4, %133 : i32 2026-02-21T09:01:45.8251286Z %135 = arith.muli %134, %c2_i32 : i32 2026-02-21T09:01:45.8251561Z %136 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8251842Z %137 = tt.splat %135 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8252050Z %138 = arith.addi %137, %136 : tensor<32xi32> 2026-02-21T09:01:45.8252313Z %139 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8252587Z %140 = arith.muli %139, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8252869Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8253164Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8253446Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8253705Z %144 = arith.addi %142, %143 : tensor<64x32xi32> 2026-02-21T09:01:45.8253956Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8254263Z %146 = tt.addptr %145, %144 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8254564Z %147 = tt.load %146 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8254816Z %148 = arith.extf %147 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8255231Z %149 = tt.descriptor_load %0[%134, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8255543Z %150 = arith.shli %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8255775Z %151 = arith.shrsi %150, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8256003Z %152 = arith.shrsi %149, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8256250Z %153 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8256549Z %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8256866Z %155 = tt.expand_dims %154 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8257206Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8257542Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8257825Z %158 = arith.cmpi eq, %155, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8258085Z %159 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8258357Z %160 = tt.broadcast %156 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8258654Z %161 = arith.select %159, %160, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8258929Z %162 = arith.cmpi eq, %155, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8259190Z %163 = tt.broadcast %157 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8259468Z %164 = tt.broadcast %162 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8259750Z %165 = arith.select %164, %163, %161 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8260043Z %166 = tt.reshape %165 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8260311Z %167 = arith.sitofp %166 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8260675Z %168 = tt.dot %148, %167, %132, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8260999Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:45.8261191Z %169 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:45.8261391Z %170 = arith.addi %arg4, %169 : i32 2026-02-21T09:01:45.8261596Z %171 = arith.muli %170, %c2_i32 : i32 2026-02-21T09:01:45.8261835Z %172 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8262119Z %173 = tt.splat %171 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8262335Z %174 = arith.addi %173, %172 : tensor<32xi32> 2026-02-21T09:01:45.8262596Z %175 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8262867Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8263162Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8263454Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8263725Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8263968Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:01:45.8264233Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8264548Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8264831Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8265090Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8265424Z %185 = tt.descriptor_load %0[%170, %46] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8265776Z %186 = arith.shli %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8266008Z %187 = arith.shrsi %186, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8266265Z %188 = arith.shrsi %185, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8266529Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8266838Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8267178Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8267517Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8267865Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8268165Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8268430Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8268723Z %196 = tt.broadcast %192 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8269029Z %197 = arith.select %195, %196, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8269320Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8269582Z %199 = tt.broadcast %193 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8269873Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8270172Z %201 = arith.select %200, %199, %197 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8270470Z %202 = tt.reshape %201 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8270752Z %203 = arith.sitofp %202 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8271121Z %204 = tt.dot %184, %203, %168, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8271463Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:01:45.8271682Z } 2026-02-21T09:01:45.8271868Z %54 = arith.truncf %53 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.8272176Z %55 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8272452Z %56 = arith.muli %55, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.8272721Z %57 = tt.expand_dims %49 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.8273018Z %58 = tt.broadcast %56 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.8273293Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.8273574Z %60 = arith.addi %58, %59 : tensor<64x64xi32> 2026-02-21T09:01:45.8273827Z %61 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.8274133Z %62 = tt.addptr %61, %60 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.8274389Z tt.store %62, %54 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.8274622Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:45.8274828Z scf.for %arg3 = %9 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:01:45.8275060Z %11 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:01:45.8275246Z %12 = arith.muli %11, %c4_i32 : i32 2026-02-21T09:01:45.8275433Z %13 = arith.subi %c112_i32, %12 : i32 2026-02-21T09:01:45.8275627Z %14 = arith.minsi %13, %c4_i32 : i32 2026-02-21T09:01:45.8275811Z %15 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:01:45.8275999Z %16 = arith.remsi %15, %14 : i32 2026-02-21T09:01:45.8276184Z %17 = arith.addi %12, %16 : i32 2026-02-21T09:01:45.8276357Z %18 = arith.divsi %15, %14 : i32 2026-02-21T09:01:45.8276539Z %19 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:01:45.8276768Z %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:01:45.8277013Z %21 = tt.splat %19 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.8277242Z %22 = arith.addi %21, %20 : tensor<64xi32> 2026-02-21T09:01:45.8277431Z %23 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:01:45.8277645Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:01:45.8277836Z %25 = arith.addi %24, %20 : tensor<64xi32> 2026-02-21T09:01:45.8278030Z %c64_i32_7 = arith.constant 64 : i32 2026-02-21T09:01:45.8278350Z %26 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_7 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:01:45.8278671Z %36 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:01:45.8278910Z %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8279157Z %38 = tt.splat %36 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8279366Z %39 = arith.addi %38, %37 : tensor<32xi32> 2026-02-21T09:01:45.8279616Z %40 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8279890Z %41 = arith.muli %40, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8280150Z %42 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8280432Z %43 = tt.broadcast %41 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8280693Z %44 = tt.broadcast %42 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8280926Z %45 = arith.addi %43, %44 : tensor<64x32xi32> 2026-02-21T09:01:45.8281171Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8281451Z %47 = tt.addptr %46, %45 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8281739Z %48 = tt.load %47 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8281982Z %49 = arith.extf %48 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8282301Z %50 = tt.descriptor_load %0[%arg4, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8282608Z %51 = arith.shli %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8282822Z %52 = arith.shrsi %51, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8283043Z %53 = arith.shrsi %50, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8283283Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8283579Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8283892Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8284210Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8284605Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8284883Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8285140Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8285416Z %61 = tt.broadcast %57 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8285724Z %62 = arith.select %60, %61, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8285995Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8286241Z %64 = tt.broadcast %58 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8286511Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8286781Z %66 = arith.select %65, %64, %62 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8287062Z %67 = tt.reshape %66 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8287323Z %68 = arith.sitofp %67 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8287669Z %69 = tt.dot %49, %68, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8287993Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:01:45.8288229Z %70 = arith.muli %c16_i32, %c1_i32_8 : i32 2026-02-21T09:01:45.8288437Z %71 = arith.addi %arg4, %70 : i32 2026-02-21T09:01:45.8288631Z %72 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:01:45.8288885Z %73 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8289141Z %74 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8289338Z %75 = arith.addi %74, %73 : tensor<32xi32> 2026-02-21T09:01:45.8289590Z %76 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8289855Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8290114Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8290402Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8290654Z %80 = tt.broadcast %78 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8290895Z %81 = arith.addi %79, %80 : tensor<64x32xi32> 2026-02-21T09:01:45.8291135Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8291426Z %83 = tt.addptr %82, %81 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8291700Z %84 = tt.load %83 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8291943Z %85 = arith.extf %84 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8292256Z %86 = tt.descriptor_load %0[%71, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8292546Z %87 = arith.shli %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8292767Z %88 = arith.shrsi %87, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8292980Z %89 = arith.shrsi %86, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8293230Z %90 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8293517Z %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8293830Z %92 = tt.expand_dims %91 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8294156Z %93 = tt.expand_dims %88 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8294468Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8294751Z %95 = arith.cmpi eq, %92, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8294998Z %96 = tt.broadcast %95 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8295268Z %97 = tt.broadcast %93 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8295579Z %98 = arith.select %96, %97, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8295841Z %99 = arith.cmpi eq, %92, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8296098Z %100 = tt.broadcast %94 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8296364Z %101 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8296671Z %102 = arith.select %101, %100, %98 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8296954Z %103 = tt.reshape %102 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8297222Z %104 = arith.sitofp %103 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8297582Z %105 = tt.dot %85, %104, %69, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8297895Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:01:45.8298094Z %106 = arith.muli %c16_i32, %c2_i32_9 : i32 2026-02-21T09:01:45.8298291Z %107 = arith.addi %arg4, %106 : i32 2026-02-21T09:01:45.8298483Z %108 = arith.muli %107, %c2_i32 : i32 2026-02-21T09:01:45.8298713Z %109 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8298972Z %110 = tt.splat %108 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8299212Z %111 = arith.addi %110, %109 : tensor<32xi32> 2026-02-21T09:01:45.8299468Z %112 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8299783Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8300048Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8300347Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8300610Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8300861Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:01:45.8301112Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8301407Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8301705Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8301949Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8302289Z %122 = tt.descriptor_load %0[%107, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8302609Z %123 = arith.shli %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8302838Z %124 = arith.shrsi %123, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8303064Z %125 = arith.shrsi %122, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8303312Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8303613Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8303926Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8304259Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8304590Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8304878Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8305143Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8305418Z %133 = tt.broadcast %129 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8305715Z %134 = arith.select %132, %133, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8306000Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8306252Z %136 = tt.broadcast %130 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8306566Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8306843Z %138 = arith.select %137, %136, %134 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8307136Z %139 = tt.reshape %138 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8307400Z %140 = arith.sitofp %139 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8307788Z %141 = tt.dot %121, %140, %105, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8308115Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:01:45.8308317Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:01:45.8308530Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:01:45.8308725Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:01:45.8308972Z %145 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:01:45.8309234Z %146 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:01:45.8309460Z %147 = arith.addi %146, %145 : tensor<32xi32> 2026-02-21T09:01:45.8309732Z %148 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8310015Z %149 = arith.muli %148, %cst_4 : tensor<64x1xi32> 2026-02-21T09:01:45.8310318Z %150 = tt.expand_dims %147 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:01:45.8310630Z %151 = tt.broadcast %149 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8310939Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:01:45.8311193Z %153 = arith.addi %151, %152 : tensor<64x32xi32> 2026-02-21T09:01:45.8311457Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8311798Z %155 = tt.addptr %154, %153 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:01:45.8312079Z %156 = tt.load %155 : tensor<64x32x!tt.ptr> 2026-02-21T09:01:45.8312339Z %157 = arith.extf %156 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:01:45.8312673Z %158 = tt.descriptor_load %0[%143, %19] : !tt.tensordesc> -> tensor<16x64xi8> 2026-02-21T09:01:45.8312997Z %159 = arith.shli %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8313238Z %160 = arith.shrsi %159, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8313472Z %161 = arith.shrsi %158, %cst_3 : tensor<16x64xi8> 2026-02-21T09:01:45.8313740Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:01:45.8314048Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:01:45.8314381Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:01:45.8314724Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8315079Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:01:45.8315388Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8315655Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8315952Z %169 = tt.broadcast %165 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8316270Z %170 = arith.select %168, %169, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8316558Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:01:45.8316819Z %172 = tt.broadcast %166 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:01:45.8317090Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:01:45.8317378Z %174 = arith.select %173, %172, %170 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:01:45.8317662Z %175 = tt.reshape %174 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:01:45.8317929Z %176 = arith.sitofp %175 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:01:45.8318304Z %177 = tt.dot %157, %176, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:01:45.8318626Z scf.yield %177 : tensor<64x64xf32> 2026-02-21T09:01:45.8318807Z } 2026-02-21T09:01:45.8318982Z %27 = arith.truncf %26 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:01:45.8319295Z %28 = tt.expand_dims %25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:01:45.8319554Z %29 = arith.muli %28, %cst_0 : tensor<64x1xi32> 2026-02-21T09:01:45.8319811Z %30 = tt.expand_dims %22 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:01:45.8320093Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.8320359Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:01:45.8320595Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:01:45.8320834Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.8321118Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:01:45.8321370Z tt.store %35, %27 : tensor<64x64x!tt.ptr> 2026-02-21T09:01:45.8321602Z } {tt.num_stages = 1 : i32} 2026-02-21T09:01:45.8321794Z tt.return 2026-02-21T09:01:45.8321933Z } 2026-02-21T09:01:45.8322060Z } 2026-02-21T09:01:45.8322131Z 2026-02-21T09:01:45.8322182Z {-# 2026-02-21T09:01:45.8322319Z external_resources: { 2026-02-21T09:01:45.8322501Z mlir_reproducer: { 2026-02-21T09:01:45.8326779Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:01:45.8331315Z disable_threading: false, 2026-02-21T09:01:45.8331484Z verify_each: true 2026-02-21T09:01:45.8331658Z } 2026-02-21T09:01:45.8331780Z } 2026-02-21T09:01:45.8331905Z #-} 2026-02-21T09:01:45.8332331Z /tmp/torchinductor_root/nb/cnbbg7hgf7qc3e54idqylspxgd2rdjqhkc5qwarmuqvkr2si43j3.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:01:45.8333343Z /tmp/torchinductor_root/nb/cnbbg7hgf7qc3e54idqylspxgd2rdjqhkc5qwarmuqvkr2si43j3.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:01:45.8334189Z [71s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:01:45.8335371Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:01:45.8336479Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:01:45.8336741Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:01:45.8337876Z [71s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:01:45.8338956Z Tensor-likes are not close! 2026-02-21T09:01:45.8339068Z 2026-02-21T09:01:45.8339146Z Mismatched elements: 456592 / 458752 (99.5%) 2026-02-21T09:01:45.8339439Z Greatest absolute difference: 3072.0 at index (25, 3953) (up to 0.01 allowed) 2026-02-21T09:01:45.8339778Z Greatest relative difference: 532480.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:01:45.8340076Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:01:45.8340235Z 2026-02-21T09:01:46.9011810Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 12.8 configs/s 2026-02-21T09:01:52.2230838Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 187.9 2026-02-21T09:01:52.2231766Z configs/s 2026-02-21T09:01:52.3853791Z [77s] Generation 2 complete: 2026-02-21T09:01:52.3858071Z error=17 2026-02-21T09:01:52.3862345Z ok=82 2026-02-21T09:01:52.3866350Z min=0.1311 2026-02-21T09:01:52.3867689Z mid=0.2377 2026-02-21T09:01:52.3867860Z max=2.6327 2026-02-21T09:01:52.3868016Z best={'block_sizes': [32, 64, 64], 2026-02-21T09:01:52.3868258Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:01:52.3868498Z 'l2_groupings': [4], 2026-02-21T09:01:52.3868674Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:01:52.3868870Z 'loop_orders': [[1, 0]], 2026-02-21T09:01:52.3869024Z 'num_stages': 1, 2026-02-21T09:01:52.3869170Z 'num_warps': 8, 2026-02-21T09:01:52.3869310Z 'pid_type': 'flat', 2026-02-21T09:01:52.3869473Z 'range_flattens': [None, None], 2026-02-21T09:01:52.3869657Z 'range_multi_buffers': [None, False], 2026-02-21T09:01:52.3869848Z 'range_num_stages': [0, 4], 2026-02-21T09:01:52.3870020Z 'range_unroll_factors': [0, 0], 2026-02-21T09:01:52.3870198Z 'range_warp_specializes': [None, None]} 2026-02-21T09:01:52.3874413Z [77s] Fitting surrogate: 305 points, 305 targets 2026-02-21T09:01:53.7069612Z [79s] Generation 3 starting: 96 neighbors, 5 active search path(s) 2026-02-21T09:02:01.2579891Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 11.4 configs/s 2026-02-21T09:02:05.3382345Z [90s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:02:05.3383474Z Tensor-likes are not close! 2026-02-21T09:02:05.3387247Z 2026-02-21T09:02:05.3387355Z Mismatched elements: 457785 / 458752 (99.8%) 2026-02-21T09:02:05.3387672Z Greatest absolute difference: 6752.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:02:05.3388036Z Greatest relative difference: 540672.0 at index (44, 4279) (up to 0.01 allowed) 2026-02-21T09:02:05.3388366Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:02:05.3392584Z 2026-02-21T09:02:05.6369608Z [91s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:02:05.6370829Z Tensor-likes are not close! 2026-02-21T09:02:05.6370971Z 2026-02-21T09:02:05.6371067Z Mismatched elements: 457785 / 458752 (99.8%) 2026-02-21T09:02:05.6371383Z Greatest absolute difference: 6752.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:02:05.6371978Z Greatest relative difference: 540672.0 at index (44, 4279) (up to 0.01 allowed) 2026-02-21T09:02:05.6372538Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:02:05.6372740Z 2026-02-21T09:02:07.1336841Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 16.8 configs/s 2026-02-21T09:02:12.8515937Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 175.1 2026-02-21T09:02:12.8519998Z configs/s 2026-02-21T09:02:13.0246388Z [98s] Generation 3 complete: 2026-02-21T09:02:13.0252521Z error=3 2026-02-21T09:02:13.0258140Z ok=99 2026-02-21T09:02:13.0258356Z min=0.1197 2026-02-21T09:02:13.0262179Z mid=0.2038 2026-02-21T09:02:13.0266938Z max=14.1957 2026-02-21T09:02:13.0268945Z best={'block_sizes': [32, 64, 32], 2026-02-21T09:02:13.0269225Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:02:13.0269452Z 'l2_groupings': [4], 2026-02-21T09:02:13.0269639Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:02:13.0269834Z 'loop_orders': [[1, 0]], 2026-02-21T09:02:13.0270000Z 'num_stages': 1, 2026-02-21T09:02:13.0270170Z 'num_warps': 2, 2026-02-21T09:02:13.0270316Z 'pid_type': 'flat', 2026-02-21T09:02:13.0270480Z 'range_flattens': [None, None], 2026-02-21T09:02:13.0270667Z 'range_multi_buffers': [None, False], 2026-02-21T09:02:13.0270857Z 'range_num_stages': [0, 4], 2026-02-21T09:02:13.0271023Z 'range_unroll_factors': [0, 0], 2026-02-21T09:02:13.0271207Z 'range_warp_specializes': [None, None]} 2026-02-21T09:02:13.0271419Z [98s] Fitting surrogate: 407 points, 407 targets 2026-02-21T09:02:14.3456078Z [99s] Generation 4 starting: 94 neighbors, 5 active search path(s) 2026-02-21T09:02:22.4859483Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 6.7 configs/s 2026-02-21T09:02:27.0723980Z [112s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:02:27.0725332Z Tensor-likes are not close! 2026-02-21T09:02:27.0729691Z 2026-02-21T09:02:27.0733726Z Mismatched elements: 456592 / 458752 (99.5%) 2026-02-21T09:02:27.0735277Z Greatest absolute difference: 3072.0 at index (25, 3953) (up to 0.01 allowed) 2026-02-21T09:02:27.0735659Z Greatest relative difference: 532480.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:02:27.0735972Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:02:27.0736409Z 2026-02-21T09:02:28.2944139Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 16.6 configs/s 2026-02-21T09:02:36.3833569Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 129.6 2026-02-21T09:02:36.3837160Z configs/s 2026-02-21T09:02:36.6184701Z [122s] Generation 4 complete: 2026-02-21T09:02:36.6189053Z error=3 2026-02-21T09:02:36.6192984Z ok=96 2026-02-21T09:02:36.6195058Z min=0.1095 2026-02-21T09:02:36.6195295Z mid=0.1731 2026-02-21T09:02:36.6200347Z max=27.7842 2026-02-21T09:02:36.6204758Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:02:36.6207929Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:02:36.6208207Z 'l2_groupings': [1], 2026-02-21T09:02:36.6212317Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:02:36.6216784Z 'loop_orders': [[0, 1]], 2026-02-21T09:02:36.6221310Z 'num_stages': 3, 2026-02-21T09:02:36.6223166Z 'num_warps': 1, 2026-02-21T09:02:36.6223380Z 'pid_type': 'flat', 2026-02-21T09:02:36.6223553Z 'range_flattens': [None, None], 2026-02-21T09:02:36.6223749Z 'range_multi_buffers': [None, True], 2026-02-21T09:02:36.6223946Z 'range_num_stages': [0, 0], 2026-02-21T09:02:36.6224114Z 'range_unroll_factors': [0, 0], 2026-02-21T09:02:36.6224305Z 'range_warp_specializes': [None, None]} 2026-02-21T09:02:36.6224786Z [122s] Fitting surrogate: 506 points, 506 targets 2026-02-21T09:02:37.9742386Z [123s] Generation 5 starting: 92 neighbors, 5 active search path(s) 2026-02-21T09:02:46.3639000Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 5.1 configs/s 2026-02-21T09:02:49.5903850Z [135s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:02:49.5905526Z Tensor-likes are not close! 2026-02-21T09:02:49.5905693Z 2026-02-21T09:02:49.5905800Z Mismatched elements: 457772 / 458752 (99.8%) 2026-02-21T09:02:49.5906179Z Greatest absolute difference: 5696.0 at index (42, 5979) (up to 0.01 allowed) 2026-02-21T09:02:49.5906597Z Greatest relative difference: 368640.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T09:02:49.5906975Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:02:49.5907233Z 2026-02-21T09:02:49.6536236Z [135s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:02:49.6537365Z Tensor-likes are not close! 2026-02-21T09:02:49.6538068Z 2026-02-21T09:02:49.6538179Z Mismatched elements: 457785 / 458752 (99.8%) 2026-02-21T09:02:49.6538464Z Greatest absolute difference: 6752.0 at index (57, 1532) (up to 0.01 allowed) 2026-02-21T09:02:49.6538817Z Greatest relative difference: 540672.0 at index (44, 4279) (up to 0.01 allowed) 2026-02-21T09:02:49.6539120Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:02:49.6539283Z 2026-02-21T09:02:50.3550165Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:02:50.3554778Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:50.3556858Z ^ 2026-02-21T09:02:50.3557269Z /tmp/torchinductor_root/7f/c7fnzrkkisnfmjd5xpp5ka2e57wl6pcqeq7p6dlx2ggnqauwc57j.py:87:40: note: called from 2026-02-21T09:02:50.3557950Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:02:50.3558174Z ^ 2026-02-21T09:02:50.3558587Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:02:50.3559161Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:50.3559411Z ^ 2026-02-21T09:02:50.3559605Z module { 2026-02-21T09:02:50.3560267Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:02:50.3560840Z %cst = arith.constant dense<0> : tensor<8x2x64xi8> 2026-02-21T09:02:50.3561070Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:02:50.3561258Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:02:50.3561452Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:02:50.3561805Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:02:50.3561992Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:02:50.3562212Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:02:50.3562520Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:02:50.3562761Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:02:50.3562981Z %cst_3 = arith.constant dense<4> : tensor<8x64xi8> 2026-02-21T09:02:50.3566754Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:02:50.3567096Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:02:50.3572842Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:02:50.3575160Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:02:50.3575516Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:02:50.3575760Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:02:50.3575997Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:02:50.3576219Z %0 = tt.get_program_id x : i32 2026-02-21T09:02:50.3576426Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:02:50.3576651Z %2 = arith.minsi %1, %c112_i32 : i32 2026-02-21T09:02:50.3576873Z %3 = arith.subi %2, %0 : i32 2026-02-21T09:02:50.3577079Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:02:50.3577302Z %4 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T09:02:50.3577529Z %5 = arith.addi %3, %4 : i32 2026-02-21T09:02:50.3577723Z %6 = arith.divui %5, %c1_i32 : i32 2026-02-21T09:02:50.3577902Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:02:50.3578086Z %7 = arith.remsi %6, %c2_i32_8 : i32 2026-02-21T09:02:50.3578267Z %8 = arith.subi %6, %7 : i32 2026-02-21T09:02:50.3578431Z %9 = arith.muli %8, %c1_i32 : i32 2026-02-21T09:02:50.3578610Z %10 = arith.addi %0, %9 : i32 2026-02-21T09:02:50.3578793Z %11 = arith.muli %c1_i32, %c2_i32_8 : i32 2026-02-21T09:02:50.3579008Z scf.for %arg3 = %0 to %10 step %11 : i32 { 2026-02-21T09:02:50.3579211Z %12 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:02:50.3579408Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:02:50.3579589Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:02:50.3579779Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:02:50.3579976Z %16 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:02:50.3580160Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:02:50.3580345Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:02:50.3580517Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:02:50.3580699Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:02:50.3580932Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:50.3581198Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.3581396Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:02:50.3581663Z %24 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:02:50.3588250Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.3588450Z %26 = arith.addi %25, %21 : tensor<64xi32> 2026-02-21T09:02:50.3588647Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:02:50.3588976Z %27 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:50.3589428Z %64 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3589717Z %65 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3589936Z %66 = arith.addi %65, %64 : tensor<8xi32> 2026-02-21T09:02:50.3590139Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:50.3590368Z %68 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3590619Z %69 = tt.splat %67 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3590815Z %70 = arith.addi %69, %68 : tensor<16xi32> 2026-02-21T09:02:50.3591080Z %71 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3591348Z %72 = arith.muli %71, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3591642Z %73 = tt.expand_dims %70 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3591971Z %74 = tt.broadcast %72 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3592231Z %75 = tt.broadcast %73 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3592480Z %76 = arith.addi %74, %75 : tensor<64x16xi32> 2026-02-21T09:02:50.3592735Z %77 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3593038Z %78 = tt.addptr %77, %76 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3593310Z %79 = tt.load %78 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3593553Z %80 = arith.extf %79 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3593850Z %81 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3594123Z %82 = arith.muli %81, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3594393Z %83 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3594684Z %84 = tt.broadcast %82 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3594962Z %85 = tt.broadcast %83 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3595210Z %86 = arith.addi %84, %85 : tensor<8x64xi32> 2026-02-21T09:02:50.3595448Z %87 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3595724Z %88 = tt.addptr %87, %86 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3596017Z %89 = tt.load %88 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3596289Z %90 = arith.shli %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3596509Z %91 = arith.shrsi %90, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3596737Z %92 = arith.shrsi %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3596985Z %93 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3597275Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3597595Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3597915Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3598237Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3598589Z %98 = arith.cmpi eq, %95, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3598837Z %99 = tt.broadcast %98 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3599113Z %100 = tt.broadcast %96 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3599404Z %101 = arith.select %99, %100, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3599720Z %102 = arith.cmpi eq, %95, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3599971Z %103 = tt.broadcast %97 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3600243Z %104 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3600531Z %105 = arith.select %104, %103, %101 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3600837Z %106 = tt.reshape %105 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3601131Z %107 = arith.sitofp %106 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3601488Z %108 = tt.dot %80, %107, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3601868Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:02:50.3602072Z %109 = arith.muli %c8_i32, %c1_i32_11 : i32 2026-02-21T09:02:50.3602267Z %110 = arith.addi %arg4, %109 : i32 2026-02-21T09:02:50.3602506Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3602762Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3602986Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T09:02:50.3603218Z %114 = arith.muli %110, %c2_i32 : i32 2026-02-21T09:02:50.3603470Z %115 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3603734Z %116 = tt.splat %114 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3603952Z %117 = arith.addi %116, %115 : tensor<16xi32> 2026-02-21T09:02:50.3604226Z %118 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3604510Z %119 = arith.muli %118, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3604793Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3605094Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3605376Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3605636Z %123 = arith.addi %121, %122 : tensor<64x16xi32> 2026-02-21T09:02:50.3605888Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3606201Z %125 = tt.addptr %124, %123 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3606483Z %126 = tt.load %125 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3606743Z %127 = arith.extf %126 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3607041Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3607329Z %129 = arith.muli %128, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3607610Z %130 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3607916Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3608203Z %132 = tt.broadcast %130 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3608451Z %133 = arith.addi %131, %132 : tensor<8x64xi32> 2026-02-21T09:02:50.3608706Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3608991Z %135 = tt.addptr %134, %133 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3609317Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3609603Z %137 = arith.shli %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3609828Z %138 = arith.shrsi %137, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3610057Z %139 = arith.shrsi %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3610306Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3610616Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3610985Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3611324Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3611704Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3612034Z %145 = arith.cmpi eq, %142, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3612307Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3612617Z %147 = tt.broadcast %143 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3612922Z %148 = arith.select %146, %147, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3613204Z %149 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3613454Z %150 = tt.broadcast %144 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3613725Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3613995Z %152 = arith.select %151, %150, %148 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3614277Z %153 = tt.reshape %152 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3614540Z %154 = arith.sitofp %153 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3614920Z %155 = tt.dot %127, %154, %108, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3615252Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:02:50.3615448Z %156 = arith.muli %c8_i32, %c2_i32_12 : i32 2026-02-21T09:02:50.3615645Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:02:50.3615872Z %158 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3616122Z %159 = tt.splat %157 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3616332Z %160 = arith.addi %159, %158 : tensor<8xi32> 2026-02-21T09:02:50.3616528Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:02:50.3616765Z %162 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3617012Z %163 = tt.splat %161 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3617223Z %164 = arith.addi %163, %162 : tensor<16xi32> 2026-02-21T09:02:50.3617479Z %165 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3617759Z %166 = arith.muli %165, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3618027Z %167 = tt.expand_dims %164 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3618321Z %168 = tt.broadcast %166 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3618592Z %169 = tt.broadcast %167 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3618833Z %170 = arith.addi %168, %169 : tensor<64x16xi32> 2026-02-21T09:02:50.3619085Z %171 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3619378Z %172 = tt.addptr %171, %170 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3619651Z %173 = tt.load %172 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3619896Z %174 = arith.extf %173 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3620180Z %175 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3620455Z %176 = arith.muli %175, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3620713Z %177 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3621011Z %178 = tt.broadcast %176 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3621285Z %179 = tt.broadcast %177 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3621527Z %180 = arith.addi %178, %179 : tensor<8x64xi32> 2026-02-21T09:02:50.3621814Z %181 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3622111Z %182 = tt.addptr %181, %180 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3622421Z %183 = tt.load %182 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3622691Z %184 = arith.shli %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3622917Z %185 = arith.shrsi %184, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3623170Z %186 = arith.shrsi %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3623459Z %187 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3623765Z %188 = tt.expand_dims %187 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3624084Z %189 = tt.expand_dims %188 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3624410Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3624726Z %191 = tt.expand_dims %186 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3625017Z %192 = arith.cmpi eq, %189, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3625276Z %193 = tt.broadcast %192 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3625544Z %194 = tt.broadcast %190 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3625864Z %195 = arith.select %193, %194, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3626142Z %196 = arith.cmpi eq, %189, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3626406Z %197 = tt.broadcast %191 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3626679Z %198 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3626956Z %199 = arith.select %198, %197, %195 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3627246Z %200 = tt.reshape %199 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3627507Z %201 = arith.sitofp %200 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3627874Z %202 = tt.dot %174, %201, %155, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3628200Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:50.3628396Z %203 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:50.3628598Z %204 = arith.addi %arg4, %203 : i32 2026-02-21T09:02:50.3628831Z %205 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3629084Z %206 = tt.splat %204 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3629293Z %207 = arith.addi %206, %205 : tensor<8xi32> 2026-02-21T09:02:50.3629500Z %208 = arith.muli %204, %c2_i32 : i32 2026-02-21T09:02:50.3629733Z %209 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3629992Z %210 = tt.splat %208 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3630205Z %211 = arith.addi %210, %209 : tensor<16xi32> 2026-02-21T09:02:50.3630464Z %212 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3630742Z %213 = arith.muli %212, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3631007Z %214 = tt.expand_dims %211 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3631308Z %215 = tt.broadcast %213 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3631601Z %216 = tt.broadcast %214 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3631854Z %217 = arith.addi %215, %216 : tensor<64x16xi32> 2026-02-21T09:02:50.3632108Z %218 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3632403Z %219 = tt.addptr %218, %217 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3632680Z %220 = tt.load %219 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3632923Z %221 = arith.extf %220 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3633243Z %222 = tt.expand_dims %207 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3633521Z %223 = arith.muli %222, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3633784Z %224 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3634088Z %225 = tt.broadcast %223 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3634379Z %226 = tt.broadcast %224 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3634676Z %227 = arith.addi %225, %226 : tensor<8x64xi32> 2026-02-21T09:02:50.3634912Z %228 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3635194Z %229 = tt.addptr %228, %227 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3635501Z %230 = tt.load %229 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3635765Z %231 = arith.shli %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3635992Z %232 = arith.shrsi %231, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3636206Z %233 = arith.shrsi %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3636456Z %234 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3636776Z %235 = tt.expand_dims %234 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3637106Z %236 = tt.expand_dims %235 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3637441Z %237 = tt.expand_dims %232 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3637763Z %238 = tt.expand_dims %233 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3638057Z %239 = arith.cmpi eq, %236, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3638310Z %240 = tt.broadcast %239 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3638587Z %241 = tt.broadcast %237 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3638878Z %242 = arith.select %240, %241, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3639155Z %243 = arith.cmpi eq, %236, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3639415Z %244 = tt.broadcast %238 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3639682Z %245 = tt.broadcast %243 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3639969Z %246 = arith.select %245, %244, %242 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3640249Z %247 = tt.reshape %246 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3640521Z %248 = arith.sitofp %247 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3640886Z %249 = tt.dot %221, %248, %202, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3641212Z scf.yield %249 : tensor<64x64xf32> 2026-02-21T09:02:50.3641399Z } {tt.flatten} 2026-02-21T09:02:50.3641628Z %28 = arith.truncf %27 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:50.3641921Z %29 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3642187Z %30 = arith.muli %29, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:50.3642449Z %31 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3642741Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.3642998Z %33 = tt.broadcast %31 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.3643236Z %34 = arith.addi %32, %33 : tensor<64x64xi32> 2026-02-21T09:02:50.3643475Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.3643761Z %36 = tt.addptr %35, %34 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:50.3644026Z tt.store %36, %28 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.3644259Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:50.3644460Z %37 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:02:50.3644652Z %38 = arith.addi %arg3, %37 : i32 2026-02-21T09:02:50.3644846Z %39 = arith.divsi %38, %c448_i32 : i32 2026-02-21T09:02:50.3645031Z %40 = arith.muli %39, %c4_i32 : i32 2026-02-21T09:02:50.3645222Z %41 = arith.subi %c1_i32, %40 : i32 2026-02-21T09:02:50.3645426Z %42 = arith.minsi %41, %c4_i32 : i32 2026-02-21T09:02:50.3645618Z %43 = arith.remsi %38, %c448_i32 : i32 2026-02-21T09:02:50.3645829Z %44 = arith.remsi %43, %42 : i32 2026-02-21T09:02:50.3646007Z %45 = arith.addi %40, %44 : i32 2026-02-21T09:02:50.3646186Z %46 = arith.divsi %43, %42 : i32 2026-02-21T09:02:50.3646361Z %47 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:02:50.3646599Z %48 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:50.3646848Z %49 = tt.splat %47 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.3647058Z %50 = arith.addi %49, %48 : tensor<64xi32> 2026-02-21T09:02:50.3647245Z %51 = arith.muli %46, %c64_i32 : i32 2026-02-21T09:02:50.3647442Z %52 = tt.splat %51 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.3647647Z %53 = arith.addi %52, %48 : tensor<64xi32> 2026-02-21T09:02:50.3647846Z %c32_i32_10 = arith.constant 32 : i32 2026-02-21T09:02:50.3648217Z %54 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_10 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:50.3648603Z %64 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3648872Z %65 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3649088Z %66 = arith.addi %65, %64 : tensor<8xi32> 2026-02-21T09:02:50.3649301Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:50.3649553Z %68 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3649809Z %69 = tt.splat %67 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3650028Z %70 = arith.addi %69, %68 : tensor<16xi32> 2026-02-21T09:02:50.3650294Z %71 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3650585Z %72 = arith.muli %71, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3650857Z %73 = tt.expand_dims %70 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3651165Z %74 = tt.broadcast %72 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3651444Z %75 = tt.broadcast %73 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3651721Z %76 = arith.addi %74, %75 : tensor<64x16xi32> 2026-02-21T09:02:50.3651984Z %77 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3652285Z %78 = tt.addptr %77, %76 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3652572Z %79 = tt.load %78 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3652834Z %80 = arith.extf %79 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3653130Z %81 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3653410Z %82 = arith.muli %81, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3653678Z %83 = tt.expand_dims %53 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3653986Z %84 = tt.broadcast %82 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3654265Z %85 = tt.broadcast %83 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3654528Z %86 = arith.addi %84, %85 : tensor<8x64xi32> 2026-02-21T09:02:50.3654786Z %87 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3655074Z %88 = tt.addptr %87, %86 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3655390Z %89 = tt.load %88 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3655653Z %90 = arith.shli %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3655898Z %91 = arith.shrsi %90, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3656119Z %92 = arith.shrsi %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3656374Z %93 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3656677Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3657012Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3657415Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3657755Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3658036Z %98 = arith.cmpi eq, %95, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3658295Z %99 = tt.broadcast %98 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3658568Z %100 = tt.broadcast %96 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3658853Z %101 = arith.select %99, %100, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3659138Z %102 = arith.cmpi eq, %95, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3659413Z %103 = tt.broadcast %97 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3659690Z %104 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3659977Z %105 = arith.select %104, %103, %101 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3660251Z %106 = tt.reshape %105 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3660525Z %107 = arith.sitofp %106 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3660889Z %108 = tt.dot %80, %107, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3661221Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:02:50.3661422Z %109 = arith.muli %c8_i32, %c1_i32_11 : i32 2026-02-21T09:02:50.3661661Z %110 = arith.addi %arg4, %109 : i32 2026-02-21T09:02:50.3661902Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3662154Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3662373Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T09:02:50.3662579Z %114 = arith.muli %110, %c2_i32 : i32 2026-02-21T09:02:50.3662825Z %115 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3663078Z %116 = tt.splat %114 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3663294Z %117 = arith.addi %116, %115 : tensor<16xi32> 2026-02-21T09:02:50.3663558Z %118 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3663838Z %119 = arith.muli %118, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3664115Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3664416Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3664692Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3664939Z %123 = arith.addi %121, %122 : tensor<64x16xi32> 2026-02-21T09:02:50.3665199Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3665504Z %125 = tt.addptr %124, %123 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3665773Z %126 = tt.load %125 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3666025Z %127 = arith.extf %126 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3666313Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3666592Z %129 = arith.muli %128, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3666851Z %130 = tt.expand_dims %53 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3667211Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3667485Z %132 = tt.broadcast %130 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3667724Z %133 = arith.addi %131, %132 : tensor<8x64xi32> 2026-02-21T09:02:50.3667973Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3668312Z %135 = tt.addptr %134, %133 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3668659Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3668935Z %137 = arith.shli %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3669153Z %138 = arith.shrsi %137, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3669379Z %139 = arith.shrsi %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3669627Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3669929Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3670247Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3670579Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3670953Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3671238Z %145 = arith.cmpi eq, %142, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3671496Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3671798Z %147 = tt.broadcast %143 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3672091Z %148 = arith.select %146, %147, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3672374Z %149 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3672628Z %150 = tt.broadcast %144 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3672907Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3673191Z %152 = arith.select %151, %150, %148 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3673481Z %153 = tt.reshape %152 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3673744Z %154 = arith.sitofp %153 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3674114Z %155 = tt.dot %127, %154, %108, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3674445Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:02:50.3674646Z %156 = arith.muli %c8_i32, %c2_i32_12 : i32 2026-02-21T09:02:50.3674850Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:02:50.3675080Z %158 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3675340Z %159 = tt.splat %157 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3675549Z %160 = arith.addi %159, %158 : tensor<8xi32> 2026-02-21T09:02:50.3675760Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:02:50.3676006Z %162 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3676261Z %163 = tt.splat %161 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3676482Z %164 = arith.addi %163, %162 : tensor<16xi32> 2026-02-21T09:02:50.3676740Z %165 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3677023Z %166 = arith.muli %165, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3677285Z %167 = tt.expand_dims %164 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3677596Z %168 = tt.broadcast %166 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3677872Z %169 = tt.broadcast %167 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3678115Z %170 = arith.addi %168, %169 : tensor<64x16xi32> 2026-02-21T09:02:50.3678409Z %171 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3678705Z %172 = tt.addptr %171, %170 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3678982Z %173 = tt.load %172 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3679223Z %174 = arith.extf %173 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3679548Z %175 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3679865Z %176 = arith.muli %175, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3680128Z %177 = tt.expand_dims %53 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3680425Z %178 = tt.broadcast %176 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3680685Z %179 = tt.broadcast %177 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3680931Z %180 = arith.addi %178, %179 : tensor<8x64xi32> 2026-02-21T09:02:50.3681174Z %181 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3681449Z %182 = tt.addptr %181, %180 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3681816Z %183 = tt.load %182 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3682109Z %184 = arith.shli %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3682340Z %185 = arith.shrsi %184, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3682562Z %186 = arith.shrsi %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3682816Z %187 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3683119Z %188 = tt.expand_dims %187 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3683436Z %189 = tt.expand_dims %188 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3683774Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3684094Z %191 = tt.expand_dims %186 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3684388Z %192 = arith.cmpi eq, %189, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3684648Z %193 = tt.broadcast %192 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3684933Z %194 = tt.broadcast %190 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3685239Z %195 = arith.select %193, %194, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3685528Z %196 = arith.cmpi eq, %189, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3685778Z %197 = tt.broadcast %191 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3686051Z %198 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3686330Z %199 = arith.select %198, %197, %195 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3686617Z %200 = tt.reshape %199 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3686882Z %201 = arith.sitofp %200 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3687248Z %202 = tt.dot %174, %201, %155, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3687575Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:50.3687767Z %203 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:50.3687967Z %204 = arith.addi %arg4, %203 : i32 2026-02-21T09:02:50.3688197Z %205 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3688448Z %206 = tt.splat %204 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3688661Z %207 = arith.addi %206, %205 : tensor<8xi32> 2026-02-21T09:02:50.3688856Z %208 = arith.muli %204, %c2_i32 : i32 2026-02-21T09:02:50.3689096Z %209 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3689371Z %210 = tt.splat %208 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3689587Z %211 = arith.addi %210, %209 : tensor<16xi32> 2026-02-21T09:02:50.3689841Z %212 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3690122Z %213 = arith.muli %212, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3690390Z %214 = tt.expand_dims %211 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3690711Z %215 = tt.broadcast %213 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3691016Z %216 = tt.broadcast %214 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3691259Z %217 = arith.addi %215, %216 : tensor<64x16xi32> 2026-02-21T09:02:50.3691511Z %218 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3691840Z %219 = tt.addptr %218, %217 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3692112Z %220 = tt.load %219 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3692358Z %221 = arith.extf %220 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3692648Z %222 = tt.expand_dims %207 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3692934Z %223 = arith.muli %222, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3693230Z %224 = tt.expand_dims %53 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3693545Z %225 = tt.broadcast %223 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3693821Z %226 = tt.broadcast %224 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3694074Z %227 = arith.addi %225, %226 : tensor<8x64xi32> 2026-02-21T09:02:50.3694330Z %228 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3694673Z %229 = tt.addptr %228, %227 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3694999Z %230 = tt.load %229 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3695280Z %231 = arith.shli %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3695513Z %232 = arith.shrsi %231, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3695750Z %233 = arith.shrsi %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3696007Z %234 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3696325Z %235 = tt.expand_dims %234 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3696659Z %236 = tt.expand_dims %235 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3697003Z %237 = tt.expand_dims %232 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3697334Z %238 = tt.expand_dims %233 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3697639Z %239 = arith.cmpi eq, %236, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3697918Z %240 = tt.broadcast %239 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3698206Z %241 = tt.broadcast %237 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3698516Z %242 = arith.select %240, %241, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3698801Z %243 = arith.cmpi eq, %236, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3699074Z %244 = tt.broadcast %238 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3699365Z %245 = tt.broadcast %243 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3699656Z %246 = arith.select %245, %244, %242 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3699957Z %247 = tt.reshape %246 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3700227Z %248 = arith.sitofp %247 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3700611Z %249 = tt.dot %221, %248, %202, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3700973Z scf.yield %249 : tensor<64x64xf32> 2026-02-21T09:02:50.3701172Z } {tt.flatten} 2026-02-21T09:02:50.3701383Z %55 = arith.truncf %54 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:50.3701708Z %56 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3701994Z %57 = arith.muli %56, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:50.3702269Z %58 = tt.expand_dims %53 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3702582Z %59 = tt.broadcast %57 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.3702839Z %60 = tt.broadcast %58 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.3703079Z %61 = arith.addi %59, %60 : tensor<64x64xi32> 2026-02-21T09:02:50.3703327Z %62 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.3703607Z %63 = tt.addptr %62, %61 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:50.3703874Z tt.store %63, %55 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.3704077Z } {tt.num_stages = 1 : i32} 2026-02-21T09:02:50.3704275Z scf.for %arg3 = %10 to %2 step %c1_i32 : i32 { 2026-02-21T09:02:50.3704484Z %12 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:02:50.3704706Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:02:50.3704897Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:02:50.3705078Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:02:50.3705272Z %16 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:02:50.3705457Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:02:50.3705642Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:02:50.3705813Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:02:50.3705995Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:02:50.3706220Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:50.3706472Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.3706678Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:02:50.3706866Z %24 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:02:50.3707058Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.3707252Z %26 = arith.addi %25, %21 : tensor<64xi32> 2026-02-21T09:02:50.3707449Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:02:50.3707766Z %27 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:50.3708136Z %37 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3708388Z %38 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3708594Z %39 = arith.addi %38, %37 : tensor<8xi32> 2026-02-21T09:02:50.3708798Z %40 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:50.3709026Z %41 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3709282Z %42 = tt.splat %40 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3709482Z %43 = arith.addi %42, %41 : tensor<16xi32> 2026-02-21T09:02:50.3709741Z %44 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3710022Z %45 = arith.muli %44, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3710282Z %46 = tt.expand_dims %43 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3710581Z %47 = tt.broadcast %45 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3710843Z %48 = tt.broadcast %46 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3711086Z %49 = arith.addi %47, %48 : tensor<64x16xi32> 2026-02-21T09:02:50.3711328Z %50 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3711648Z %51 = tt.addptr %50, %49 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3711912Z %52 = tt.load %51 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3712195Z %53 = arith.extf %52 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3712480Z %54 = tt.expand_dims %39 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3712749Z %55 = arith.muli %54, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3713014Z %56 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3713332Z %57 = tt.broadcast %55 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3713610Z %58 = tt.broadcast %56 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3713855Z %59 = arith.addi %57, %58 : tensor<8x64xi32> 2026-02-21T09:02:50.3714086Z %60 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3714357Z %61 = tt.addptr %60, %59 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3714655Z %62 = tt.load %61 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3714924Z %63 = arith.shli %62, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3715144Z %64 = arith.shrsi %63, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3715356Z %65 = arith.shrsi %62, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3715630Z %66 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3715921Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3716238Z %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3716550Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3716863Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3717145Z %71 = arith.cmpi eq, %68, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3717389Z %72 = tt.broadcast %71 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3717661Z %73 = tt.broadcast %69 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3717936Z %74 = arith.select %72, %73, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3718208Z %75 = arith.cmpi eq, %68, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3718458Z %76 = tt.broadcast %70 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3718716Z %77 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3718988Z %78 = arith.select %77, %76, %74 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3719253Z %79 = tt.reshape %78 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3719508Z %80 = arith.sitofp %79 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3719852Z %81 = tt.dot %53, %80, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3720181Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:50.3720383Z %82 = arith.muli %c8_i32, %c1_i32_9 : i32 2026-02-21T09:02:50.3720577Z %83 = arith.addi %arg4, %82 : i32 2026-02-21T09:02:50.3720812Z %84 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3721054Z %85 = tt.splat %83 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3721263Z %86 = arith.addi %85, %84 : tensor<8xi32> 2026-02-21T09:02:50.3721456Z %87 = arith.muli %83, %c2_i32 : i32 2026-02-21T09:02:50.3721748Z %88 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3722000Z %89 = tt.splat %87 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3722200Z %90 = arith.addi %89, %88 : tensor<16xi32> 2026-02-21T09:02:50.3722460Z %91 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3722728Z %92 = arith.muli %91, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3722989Z %93 = tt.expand_dims %90 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3723307Z %94 = tt.broadcast %92 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3723579Z %95 = tt.broadcast %93 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3723834Z %96 = arith.addi %94, %95 : tensor<64x16xi32> 2026-02-21T09:02:50.3724085Z %97 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3724427Z %98 = tt.addptr %97, %96 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3724707Z %99 = tt.load %98 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3724959Z %100 = arith.extf %99 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3725246Z %101 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3725525Z %102 = arith.muli %101, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3725794Z %103 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3726091Z %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3726360Z %105 = tt.broadcast %103 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3726597Z %106 = arith.addi %104, %105 : tensor<8x64xi32> 2026-02-21T09:02:50.3726868Z %107 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3727149Z %108 = tt.addptr %107, %106 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3727452Z %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3727731Z %110 = arith.shli %109, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3727948Z %111 = arith.shrsi %110, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3728176Z %112 = arith.shrsi %109, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3728425Z %113 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3728727Z %114 = tt.expand_dims %113 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3729053Z %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3729378Z %116 = tt.expand_dims %111 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3729708Z %117 = tt.expand_dims %112 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3729995Z %118 = arith.cmpi eq, %115, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3730258Z %119 = tt.broadcast %118 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3730531Z %120 = tt.broadcast %116 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3730823Z %121 = arith.select %119, %120, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3731102Z %122 = arith.cmpi eq, %115, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3731354Z %123 = tt.broadcast %117 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3731661Z %124 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3731945Z %125 = arith.select %124, %123, %121 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3732234Z %126 = tt.reshape %125 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3732501Z %127 = arith.sitofp %126 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3732858Z %128 = tt.dot %100, %127, %81, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3733185Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:02:50.3733382Z %129 = arith.muli %c8_i32, %c2_i32_10 : i32 2026-02-21T09:02:50.3733586Z %130 = arith.addi %arg4, %129 : i32 2026-02-21T09:02:50.3733812Z %131 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3734064Z %132 = tt.splat %130 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3734305Z %133 = arith.addi %132, %131 : tensor<8xi32> 2026-02-21T09:02:50.3734506Z %134 = arith.muli %130, %c2_i32 : i32 2026-02-21T09:02:50.3734746Z %135 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3734998Z %136 = tt.splat %134 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3735216Z %137 = arith.addi %136, %135 : tensor<16xi32> 2026-02-21T09:02:50.3735513Z %138 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3735830Z %139 = arith.muli %138, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3736115Z %140 = tt.expand_dims %137 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3736423Z %141 = tt.broadcast %139 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3736711Z %142 = tt.broadcast %140 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3736967Z %143 = arith.addi %141, %142 : tensor<64x16xi32> 2026-02-21T09:02:50.3737235Z %144 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3737552Z %145 = tt.addptr %144, %143 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3737841Z %146 = tt.load %145 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3738123Z %147 = arith.extf %146 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3738427Z %148 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3738715Z %149 = arith.muli %148, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3738984Z %150 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3739299Z %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3739584Z %152 = tt.broadcast %150 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3739835Z %153 = arith.addi %151, %152 : tensor<8x64xi32> 2026-02-21T09:02:50.3740092Z %154 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3740381Z %155 = tt.addptr %154, %153 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3740706Z %156 = tt.load %155 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3740987Z %157 = arith.shli %156, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3741222Z %158 = arith.shrsi %157, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3741460Z %159 = arith.shrsi %156, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3741766Z %160 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3742083Z %161 = tt.expand_dims %160 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3742419Z %162 = tt.expand_dims %161 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3742768Z %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3743117Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3743419Z %165 = arith.cmpi eq, %162, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3743690Z %166 = tt.broadcast %165 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3743979Z %167 = tt.broadcast %163 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3744290Z %168 = arith.select %166, %167, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3744579Z %169 = arith.cmpi eq, %162, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3744857Z %170 = tt.broadcast %164 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3745147Z %171 = tt.broadcast %169 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3745428Z %172 = arith.select %171, %170, %168 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3745719Z %173 = tt.reshape %172 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3746001Z %174 = arith.sitofp %173 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3746363Z %175 = tt.dot %147, %174, %128, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3746685Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:50.3746902Z %176 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:50.3747100Z %177 = arith.addi %arg4, %176 : i32 2026-02-21T09:02:50.3747352Z %178 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.3747607Z %179 = tt.splat %177 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.3747813Z %180 = arith.addi %179, %178 : tensor<8xi32> 2026-02-21T09:02:50.3748018Z %181 = arith.muli %177, %c2_i32 : i32 2026-02-21T09:02:50.3748250Z %182 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.3748509Z %183 = tt.splat %181 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.3748725Z %184 = arith.addi %183, %182 : tensor<16xi32> 2026-02-21T09:02:50.3748982Z %185 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3749261Z %186 = arith.muli %185, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.3749564Z %187 = tt.expand_dims %184 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.3749873Z %188 = tt.broadcast %186 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3750144Z %189 = tt.broadcast %187 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.3750397Z %190 = arith.addi %188, %189 : tensor<64x16xi32> 2026-02-21T09:02:50.3750655Z %191 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3750955Z %192 = tt.addptr %191, %190 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.3751240Z %193 = tt.load %192 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.3751483Z %194 = arith.extf %193 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.3751821Z %195 = tt.expand_dims %180 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.3752101Z %196 = arith.muli %195, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.3752367Z %197 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3752679Z %198 = tt.broadcast %196 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3752943Z %199 = tt.broadcast %197 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.3753189Z %200 = arith.addi %198, %199 : tensor<8x64xi32> 2026-02-21T09:02:50.3753425Z %201 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3753710Z %202 = tt.addptr %201, %200 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.3754017Z %203 = tt.load %202 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.3754282Z %204 = arith.shli %203, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3754507Z %205 = arith.shrsi %204, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3754721Z %206 = arith.shrsi %203, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.3754973Z %207 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.3755267Z %208 = tt.expand_dims %207 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.3755603Z %209 = tt.expand_dims %208 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.3755926Z %210 = tt.expand_dims %205 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3756239Z %211 = tt.expand_dims %206 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.3756528Z %212 = arith.cmpi eq, %209, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3756776Z %213 = tt.broadcast %212 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3757079Z %214 = tt.broadcast %210 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3757368Z %215 = arith.select %213, %214, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3757639Z %216 = arith.cmpi eq, %209, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.3757899Z %217 = tt.broadcast %211 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.3758184Z %218 = tt.broadcast %216 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.3758489Z %219 = arith.select %218, %217, %215 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.3758773Z %220 = tt.reshape %219 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.3759040Z %221 = arith.sitofp %220 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.3759398Z %222 = tt.dot %194, %221, %175, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.3759718Z scf.yield %222 : tensor<64x64xf32> 2026-02-21T09:02:50.3759904Z } {tt.flatten} 2026-02-21T09:02:50.3760099Z %28 = arith.truncf %27 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:50.3760392Z %29 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.3760785Z %30 = arith.muli %29, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:50.3761038Z %31 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.3761333Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.3761619Z %33 = tt.broadcast %31 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.3761861Z %34 = arith.addi %32, %33 : tensor<64x64xi32> 2026-02-21T09:02:50.3762103Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.3762394Z %36 = tt.addptr %35, %34 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:50.3762659Z tt.store %36, %28 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.3762862Z } {tt.num_stages = 1 : i32} 2026-02-21T09:02:50.3763037Z tt.return 2026-02-21T09:02:50.3763166Z } 2026-02-21T09:02:50.3763299Z } 2026-02-21T09:02:50.3763369Z 2026-02-21T09:02:50.3763422Z {-# 2026-02-21T09:02:50.3763563Z external_resources: { 2026-02-21T09:02:50.3763727Z mlir_reproducer: { 2026-02-21T09:02:50.3768084Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:02:50.3772696Z disable_threading: false, 2026-02-21T09:02:50.3772883Z verify_each: true 2026-02-21T09:02:50.3773034Z } 2026-02-21T09:02:50.3773173Z } 2026-02-21T09:02:50.3773298Z #-} 2026-02-21T09:02:50.3773782Z /tmp/torchinductor_root/7f/c7fnzrkkisnfmjd5xpp5ka2e57wl6pcqeq7p6dlx2ggnqauwc57j.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:02:50.3774817Z /tmp/torchinductor_root/7f/c7fnzrkkisnfmjd5xpp5ka2e57wl6pcqeq7p6dlx2ggnqauwc57j.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:02:50.3775640Z [135s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:02:50.3776805Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:02:50.3777827Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:02:50.3778085Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:02:50.9042334Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:02:50.9044221Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:50.9044501Z ^ 2026-02-21T09:02:50.9044868Z /tmp/torchinductor_root/22/c22ckupeeabgzgknr32c7hrhqvgo47imt3h5zrttb5ezu5batycn.py:87:40: note: called from 2026-02-21T09:02:50.9045278Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:02:50.9045491Z ^ 2026-02-21T09:02:50.9045899Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:02:50.9046360Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:50.9046608Z ^ 2026-02-21T09:02:50.9047113Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:02:50.9047698Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:02:50.9048246Z %cst = arith.constant dense<0> : tensor<8x2x64xi8> 2026-02-21T09:02:50.9048468Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:02:50.9048658Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:02:50.9048851Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:02:50.9049025Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:02:50.9049236Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:02:50.9049473Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:02:50.9049706Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:02:50.9049937Z %cst_3 = arith.constant dense<4> : tensor<8x64xi8> 2026-02-21T09:02:50.9050165Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:02:50.9050405Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:02:50.9050607Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:02:50.9050828Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:02:50.9051056Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:02:50.9051239Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:02:50.9051702Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:02:50.9051890Z %0 = tt.get_program_id x : i32 2026-02-21T09:02:50.9052079Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:02:50.9052258Z %2 = arith.minsi %1, %c112_i32 : i32 2026-02-21T09:02:50.9052451Z %3 = arith.subi %2, %0 : i32 2026-02-21T09:02:50.9052628Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:02:50.9052868Z %4 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T09:02:50.9053058Z %5 = arith.addi %3, %4 : i32 2026-02-21T09:02:50.9053276Z %6 = arith.divui %5, %c1_i32 : i32 2026-02-21T09:02:50.9053454Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:02:50.9053635Z %7 = arith.remsi %6, %c2_i32_8 : i32 2026-02-21T09:02:50.9053817Z %8 = arith.subi %6, %7 : i32 2026-02-21T09:02:50.9053985Z %9 = arith.muli %8, %c1_i32 : i32 2026-02-21T09:02:50.9054166Z %10 = arith.addi %0, %9 : i32 2026-02-21T09:02:50.9054344Z %11 = arith.muli %c1_i32, %c2_i32_8 : i32 2026-02-21T09:02:50.9054558Z scf.for %arg3 = %0 to %10 step %11 : i32 { 2026-02-21T09:02:50.9054759Z %12 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:02:50.9054953Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:02:50.9055144Z %14 = arith.subi %c112_i32, %13 : i32 2026-02-21T09:02:50.9055369Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:02:50.9055569Z %16 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:02:50.9055752Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:02:50.9055942Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:02:50.9056115Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:02:50.9056294Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:02:50.9056522Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:50.9056789Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.9057007Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:02:50.9057200Z %24 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:02:50.9057402Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.9057594Z %26 = arith.addi %25, %21 : tensor<64xi32> 2026-02-21T09:02:50.9057789Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:02:50.9058109Z %27 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:50.9058505Z %64 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9058761Z %65 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9058968Z %66 = arith.addi %65, %64 : tensor<8xi32> 2026-02-21T09:02:50.9059172Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:50.9059401Z %68 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9059650Z %69 = tt.splat %67 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9059849Z %70 = arith.addi %69, %68 : tensor<16xi32> 2026-02-21T09:02:50.9060113Z %71 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9060392Z %72 = arith.muli %71, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9060648Z %73 = tt.expand_dims %70 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9060943Z %74 = tt.broadcast %72 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9061200Z %75 = tt.broadcast %73 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9061447Z %76 = arith.addi %74, %75 : tensor<64x16xi32> 2026-02-21T09:02:50.9061773Z %77 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9062068Z %78 = tt.addptr %77, %76 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9062345Z %79 = tt.load %78 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9062596Z %80 = arith.extf %79 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9062927Z %81 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9063217Z %82 = arith.muli %81, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9063467Z %83 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9063763Z %84 = tt.broadcast %82 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9064055Z %85 = tt.broadcast %83 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9064315Z %86 = arith.addi %84, %85 : tensor<8x64xi32> 2026-02-21T09:02:50.9064556Z %87 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9064818Z %88 = tt.addptr %87, %86 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9065113Z %89 = tt.load %88 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9065373Z %90 = arith.shli %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9065593Z %91 = arith.shrsi %90, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9065813Z %92 = arith.shrsi %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9066050Z %93 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9066367Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9066714Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9067053Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9067381Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9067665Z %98 = arith.cmpi eq, %95, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9067928Z %99 = tt.broadcast %98 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9068202Z %100 = tt.broadcast %96 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9068503Z %101 = arith.select %99, %100, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9068784Z %102 = arith.cmpi eq, %95, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9069054Z %103 = tt.broadcast %97 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9069337Z %104 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9069631Z %105 = arith.select %104, %103, %101 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9069930Z %106 = tt.reshape %105 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9070198Z %107 = arith.sitofp %106 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9070582Z %108 = tt.dot %80, %107, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9070924Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:02:50.9071132Z %109 = arith.muli %c8_i32, %c1_i32_11 : i32 2026-02-21T09:02:50.9071347Z %110 = arith.addi %arg4, %109 : i32 2026-02-21T09:02:50.9071624Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9071890Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9072106Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T09:02:50.9072320Z %114 = arith.muli %110, %c2_i32 : i32 2026-02-21T09:02:50.9072564Z %115 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9072832Z %116 = tt.splat %114 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9073056Z %117 = arith.addi %116, %115 : tensor<16xi32> 2026-02-21T09:02:50.9073323Z %118 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9073622Z %119 = arith.muli %118, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9073902Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9074220Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9074556Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9074817Z %123 = arith.addi %121, %122 : tensor<64x16xi32> 2026-02-21T09:02:50.9075092Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9075404Z %125 = tt.addptr %124, %123 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9075723Z %126 = tt.load %125 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9075987Z %127 = arith.extf %126 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9076288Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9076573Z %129 = arith.muli %128, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9076830Z %130 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9077127Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9077388Z %132 = tt.broadcast %130 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9077634Z %133 = arith.addi %131, %132 : tensor<8x64xi32> 2026-02-21T09:02:50.9077869Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9078170Z %135 = tt.addptr %134, %133 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9078479Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9078750Z %137 = arith.shli %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9078975Z %138 = arith.shrsi %137, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9079193Z %139 = arith.shrsi %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9079446Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9079745Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9080060Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9080388Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9080708Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9081001Z %145 = arith.cmpi eq, %142, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9081251Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9081526Z %147 = tt.broadcast %143 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9081847Z %148 = arith.select %146, %147, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9082122Z %149 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9082377Z %150 = tt.broadcast %144 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9082639Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9082929Z %152 = arith.select %151, %150, %148 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9083224Z %153 = tt.reshape %152 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9083483Z %154 = arith.sitofp %153 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9083842Z %155 = tt.dot %127, %154, %108, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9084166Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:02:50.9084370Z %156 = arith.muli %c8_i32, %c2_i32_12 : i32 2026-02-21T09:02:50.9084566Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:02:50.9084802Z %158 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9085052Z %159 = tt.splat %157 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9085257Z %160 = arith.addi %159, %158 : tensor<8xi32> 2026-02-21T09:02:50.9085485Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:02:50.9085716Z %162 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9085981Z %163 = tt.splat %161 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9086186Z %164 = arith.addi %163, %162 : tensor<16xi32> 2026-02-21T09:02:50.9086452Z %165 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9086754Z %166 = arith.muli %165, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9087052Z %167 = tt.expand_dims %164 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9087353Z %168 = tt.broadcast %166 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9087617Z %169 = tt.broadcast %167 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9087874Z %170 = arith.addi %168, %169 : tensor<64x16xi32> 2026-02-21T09:02:50.9088124Z %171 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9088429Z %172 = tt.addptr %171, %170 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9088702Z %173 = tt.load %172 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9088944Z %174 = arith.extf %173 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9089261Z %175 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9089534Z %176 = arith.muli %175, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9089807Z %177 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9090097Z %178 = tt.broadcast %176 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9090367Z %179 = tt.broadcast %177 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9090651Z %180 = arith.addi %178, %179 : tensor<8x64xi32> 2026-02-21T09:02:50.9090894Z %181 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9091170Z %182 = tt.addptr %181, %180 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9091518Z %183 = tt.load %182 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9091832Z %184 = arith.shli %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9092057Z %185 = arith.shrsi %184, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9092288Z %186 = arith.shrsi %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9092545Z %187 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9092852Z %188 = tt.expand_dims %187 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9093174Z %189 = tt.expand_dims %188 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9093519Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9093851Z %191 = tt.expand_dims %186 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9094145Z %192 = arith.cmpi eq, %189, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9094408Z %193 = tt.broadcast %192 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9094685Z %194 = tt.broadcast %190 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9094982Z %195 = arith.select %193, %194, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9095263Z %196 = arith.cmpi eq, %189, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9095522Z %197 = tt.broadcast %191 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9095797Z %198 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9096076Z %199 = arith.select %198, %197, %195 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9096366Z %200 = tt.reshape %199 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9096644Z %201 = arith.sitofp %200 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9097023Z %202 = tt.dot %174, %201, %155, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9097350Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:50.9097543Z %203 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:50.9097767Z %204 = arith.addi %arg4, %203 : i32 2026-02-21T09:02:50.9097993Z %205 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9098267Z %206 = tt.splat %204 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9098483Z %207 = arith.addi %206, %205 : tensor<8xi32> 2026-02-21T09:02:50.9098686Z %208 = arith.muli %204, %c2_i32 : i32 2026-02-21T09:02:50.9098923Z %209 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9099170Z %210 = tt.splat %208 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9099385Z %211 = arith.addi %210, %209 : tensor<16xi32> 2026-02-21T09:02:50.9099644Z %212 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9099923Z %213 = arith.muli %212, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9100185Z %214 = tt.expand_dims %211 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9100510Z %215 = tt.broadcast %213 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9100785Z %216 = tt.broadcast %214 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9101025Z %217 = arith.addi %215, %216 : tensor<64x16xi32> 2026-02-21T09:02:50.9101282Z %218 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9101612Z %219 = tt.addptr %218, %217 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9101889Z %220 = tt.load %219 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9102130Z %221 = arith.extf %220 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9102424Z %222 = tt.expand_dims %207 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9102699Z %223 = arith.muli %222, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9102958Z %224 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9103257Z %225 = tt.broadcast %223 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9103519Z %226 = tt.broadcast %224 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9103766Z %227 = arith.addi %225, %226 : tensor<8x64xi32> 2026-02-21T09:02:50.9104009Z %228 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9104280Z %229 = tt.addptr %228, %227 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9104594Z %230 = tt.load %229 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9104861Z %231 = arith.shli %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9105084Z %232 = arith.shrsi %231, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9105301Z %233 = arith.shrsi %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9105551Z %234 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9105850Z %235 = tt.expand_dims %234 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9106162Z %236 = tt.expand_dims %235 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9106488Z %237 = tt.expand_dims %232 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9106804Z %238 = tt.expand_dims %233 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9107090Z %239 = arith.cmpi eq, %236, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9107340Z %240 = tt.broadcast %239 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9107614Z %241 = tt.broadcast %237 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9107925Z %242 = arith.select %240, %241, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9108195Z %243 = arith.cmpi eq, %236, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9108451Z %244 = tt.broadcast %238 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9108717Z %245 = tt.broadcast %243 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9109042Z %246 = arith.select %245, %244, %242 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9109324Z %247 = tt.reshape %246 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9109595Z %248 = arith.sitofp %247 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9109954Z %249 = tt.dot %221, %248, %202, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9110268Z scf.yield %249 : tensor<64x64xf32> 2026-02-21T09:02:50.9110455Z } {tt.flatten} 2026-02-21T09:02:50.9110648Z %28 = arith.truncf %27 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:50.9110939Z %29 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9111205Z %30 = arith.muli %29, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:50.9111474Z %31 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9111807Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.9112074Z %33 = tt.broadcast %31 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.9112323Z %34 = arith.addi %32, %33 : tensor<64x64xi32> 2026-02-21T09:02:50.9112574Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.9112877Z %36 = tt.addptr %35, %34 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:50.9113152Z tt.store %36, %28 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.9113369Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:50.9113580Z %37 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:02:50.9113785Z %38 = arith.addi %arg3, %37 : i32 2026-02-21T09:02:50.9113988Z %39 = arith.divsi %38, %c4_i32 : i32 2026-02-21T09:02:50.9114185Z %40 = arith.muli %39, %c4_i32 : i32 2026-02-21T09:02:50.9114396Z %41 = arith.subi %c112_i32, %40 : i32 2026-02-21T09:02:50.9114597Z %42 = arith.minsi %41, %c4_i32 : i32 2026-02-21T09:02:50.9114783Z %43 = arith.remsi %38, %c4_i32 : i32 2026-02-21T09:02:50.9114977Z %44 = arith.remsi %43, %42 : i32 2026-02-21T09:02:50.9115161Z %45 = arith.addi %40, %44 : i32 2026-02-21T09:02:50.9115349Z %46 = arith.divsi %43, %42 : i32 2026-02-21T09:02:50.9115530Z %47 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:02:50.9115772Z %48 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:50.9116028Z %49 = tt.splat %47 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.9116242Z %50 = arith.addi %49, %48 : tensor<64xi32> 2026-02-21T09:02:50.9116449Z %51 = arith.muli %46, %c64_i32 : i32 2026-02-21T09:02:50.9116642Z %52 = tt.splat %51 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.9116851Z %53 = arith.addi %52, %48 : tensor<64xi32> 2026-02-21T09:02:50.9117053Z %c32_i32_10 = arith.constant 32 : i32 2026-02-21T09:02:50.9117398Z %54 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_10 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:50.9117778Z %64 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9118043Z %65 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9118262Z %66 = arith.addi %65, %64 : tensor<8xi32> 2026-02-21T09:02:50.9118469Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:50.9118716Z %68 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9118972Z %69 = tt.splat %67 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9119216Z %70 = arith.addi %69, %68 : tensor<16xi32> 2026-02-21T09:02:50.9119479Z %71 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9119765Z %72 = arith.muli %71, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9120039Z %73 = tt.expand_dims %70 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9120359Z %74 = tt.broadcast %72 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9120658Z %75 = tt.broadcast %73 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9120903Z %76 = arith.addi %74, %75 : tensor<64x16xi32> 2026-02-21T09:02:50.9121161Z %77 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9121461Z %78 = tt.addptr %77, %76 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9121771Z %79 = tt.load %78 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9122016Z %80 = arith.extf %79 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9122290Z %81 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9122555Z %82 = arith.muli %81, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9122826Z %83 = tt.expand_dims %50 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9123116Z %84 = tt.broadcast %82 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9123375Z %85 = tt.broadcast %83 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9123607Z %86 = arith.addi %84, %85 : tensor<8x64xi32> 2026-02-21T09:02:50.9123846Z %87 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9124114Z %88 = tt.addptr %87, %86 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9124410Z %89 = tt.load %88 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9124668Z %90 = arith.shli %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9124890Z %91 = arith.shrsi %90, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9125107Z %92 = arith.shrsi %89, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9125344Z %93 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9125636Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9125944Z %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9126268Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9126592Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9126871Z %98 = arith.cmpi eq, %95, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9127121Z %99 = tt.broadcast %98 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9127384Z %100 = tt.broadcast %96 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9127670Z %101 = arith.select %99, %100, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9127941Z %102 = arith.cmpi eq, %95, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9128198Z %103 = tt.broadcast %97 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9128466Z %104 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9128743Z %105 = arith.select %104, %103, %101 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9129032Z %106 = tt.reshape %105 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9129293Z %107 = arith.sitofp %106 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9129657Z %108 = tt.dot %80, %107, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9129977Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:02:50.9130209Z %109 = arith.muli %c8_i32, %c1_i32_11 : i32 2026-02-21T09:02:50.9130414Z %110 = arith.addi %arg4, %109 : i32 2026-02-21T09:02:50.9130642Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9130899Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9131107Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T09:02:50.9131353Z %114 = arith.muli %110, %c2_i32 : i32 2026-02-21T09:02:50.9131633Z %115 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9131894Z %116 = tt.splat %114 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9132105Z %117 = arith.addi %116, %115 : tensor<16xi32> 2026-02-21T09:02:50.9132359Z %118 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9132638Z %119 = arith.muli %118, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9132900Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9133199Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9133459Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9133704Z %123 = arith.addi %121, %122 : tensor<64x16xi32> 2026-02-21T09:02:50.9133982Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9134285Z %125 = tt.addptr %124, %123 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9134560Z %126 = tt.load %125 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9134803Z %127 = arith.extf %126 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9135094Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9135369Z %129 = arith.muli %128, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9135627Z %130 = tt.expand_dims %50 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9135931Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9136193Z %132 = tt.broadcast %130 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9136439Z %133 = arith.addi %131, %132 : tensor<8x64xi32> 2026-02-21T09:02:50.9136677Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9136959Z %135 = tt.addptr %134, %133 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9137271Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9137540Z %137 = arith.shli %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9137771Z %138 = arith.shrsi %137, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9137991Z %139 = arith.shrsi %136, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9138245Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9138541Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9138868Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9139201Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9139523Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9139820Z %145 = arith.cmpi eq, %142, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9140079Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9140360Z %147 = tt.broadcast %143 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9140650Z %148 = arith.select %146, %147, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9140922Z %149 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9141193Z %150 = tt.broadcast %144 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9141479Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9141795Z %152 = arith.select %151, %150, %148 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9142079Z %153 = tt.reshape %152 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9142369Z %154 = arith.sitofp %153 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9142747Z %155 = tt.dot %127, %154, %108, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9143067Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:02:50.9143273Z %156 = arith.muli %c8_i32, %c2_i32_12 : i32 2026-02-21T09:02:50.9143467Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:02:50.9143708Z %158 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9143958Z %159 = tt.splat %157 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9144165Z %160 = arith.addi %159, %158 : tensor<8xi32> 2026-02-21T09:02:50.9144372Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:02:50.9144605Z %162 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9144886Z %163 = tt.splat %161 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9145095Z %164 = arith.addi %163, %162 : tensor<16xi32> 2026-02-21T09:02:50.9145359Z %165 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9145628Z %166 = arith.muli %165, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9145895Z %167 = tt.expand_dims %164 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9146192Z %168 = tt.broadcast %166 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9146455Z %169 = tt.broadcast %167 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9146699Z %170 = arith.addi %168, %169 : tensor<64x16xi32> 2026-02-21T09:02:50.9146944Z %171 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9147238Z %172 = tt.addptr %171, %170 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9147509Z %173 = tt.load %172 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9147750Z %174 = arith.extf %173 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9148041Z %175 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9148305Z %176 = arith.muli %175, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9148569Z %177 = tt.expand_dims %50 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9148857Z %178 = tt.broadcast %176 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9149125Z %179 = tt.broadcast %177 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9149367Z %180 = arith.addi %178, %179 : tensor<8x64xi32> 2026-02-21T09:02:50.9149605Z %181 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9149885Z %182 = tt.addptr %181, %180 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9150185Z %183 = tt.load %182 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9150461Z %184 = arith.shli %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9150676Z %185 = arith.shrsi %184, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9150898Z %186 = arith.shrsi %183, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9151206Z %187 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9151499Z %188 = tt.expand_dims %187 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9151866Z %189 = tt.expand_dims %188 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9152187Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9152539Z %191 = tt.expand_dims %186 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9152834Z %192 = arith.cmpi eq, %189, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9153091Z %193 = tt.broadcast %192 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9153398Z %194 = tt.broadcast %190 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9153707Z %195 = arith.select %193, %194, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9153991Z %196 = arith.cmpi eq, %189, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9154274Z %197 = tt.broadcast %191 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9154549Z %198 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9154844Z %199 = arith.select %198, %197, %195 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9155136Z %200 = tt.reshape %199 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9155410Z %201 = arith.sitofp %200 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9155778Z %202 = tt.dot %174, %201, %155, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9156140Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:50.9156354Z %203 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:50.9156558Z %204 = arith.addi %arg4, %203 : i32 2026-02-21T09:02:50.9156807Z %205 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9157060Z %206 = tt.splat %204 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9157280Z %207 = arith.addi %206, %205 : tensor<8xi32> 2026-02-21T09:02:50.9157488Z %208 = arith.muli %204, %c2_i32 : i32 2026-02-21T09:02:50.9157735Z %209 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9158001Z %210 = tt.splat %208 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9158217Z %211 = arith.addi %210, %209 : tensor<16xi32> 2026-02-21T09:02:50.9158492Z %212 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9158775Z %213 = arith.muli %212, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9159055Z %214 = tt.expand_dims %211 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9159358Z %215 = tt.broadcast %213 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9159635Z %216 = tt.broadcast %214 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9159888Z %217 = arith.addi %215, %216 : tensor<64x16xi32> 2026-02-21T09:02:50.9160141Z %218 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9160449Z %219 = tt.addptr %218, %217 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9160722Z %220 = tt.load %219 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9160979Z %221 = arith.extf %220 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9161274Z %222 = tt.expand_dims %207 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9161586Z %223 = arith.muli %222, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9161863Z %224 = tt.expand_dims %50 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9162162Z %225 = tt.broadcast %223 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9162430Z %226 = tt.broadcast %224 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9162667Z %227 = arith.addi %225, %226 : tensor<8x64xi32> 2026-02-21T09:02:50.9162906Z %228 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9163185Z %229 = tt.addptr %228, %227 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9163482Z %230 = tt.load %229 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9163783Z %231 = arith.shli %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9163999Z %232 = arith.shrsi %231, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9164220Z %233 = arith.shrsi %230, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9164463Z %234 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9164782Z %235 = tt.expand_dims %234 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9165124Z %236 = tt.expand_dims %235 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9165443Z %237 = tt.expand_dims %232 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9165759Z %238 = tt.expand_dims %233 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9166041Z %239 = arith.cmpi eq, %236, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9166303Z %240 = tt.broadcast %239 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9166581Z %241 = tt.broadcast %237 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9166876Z %242 = arith.select %240, %241, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9167174Z %243 = arith.cmpi eq, %236, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9167427Z %244 = tt.broadcast %238 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9167700Z %245 = tt.broadcast %243 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9167975Z %246 = arith.select %245, %244, %242 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9168262Z %247 = tt.reshape %246 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9168530Z %248 = arith.sitofp %247 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9168881Z %249 = tt.dot %221, %248, %202, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9169208Z scf.yield %249 : tensor<64x64xf32> 2026-02-21T09:02:50.9169388Z } {tt.flatten} 2026-02-21T09:02:50.9169593Z %55 = arith.truncf %54 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:50.9169875Z %56 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9170149Z %57 = arith.muli %56, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:50.9170409Z %58 = tt.expand_dims %50 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9170693Z %59 = tt.broadcast %57 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.9170954Z %60 = tt.broadcast %58 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.9171185Z %61 = arith.addi %59, %60 : tensor<64x64xi32> 2026-02-21T09:02:50.9171434Z %62 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.9171756Z %63 = tt.addptr %62, %61 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:50.9172021Z tt.store %63, %55 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.9172231Z } {tt.num_stages = 1 : i32} 2026-02-21T09:02:50.9172421Z scf.for %arg3 = %10 to %2 step %c1_i32 : i32 { 2026-02-21T09:02:50.9172632Z %12 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:02:50.9172820Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:02:50.9173010Z %14 = arith.subi %c112_i32, %13 : i32 2026-02-21T09:02:50.9173191Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:02:50.9173385Z %16 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:02:50.9173570Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:02:50.9173746Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:02:50.9173927Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:02:50.9174101Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:02:50.9174336Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:50.9174580Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.9174828Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:02:50.9175017Z %24 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:02:50.9175210Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:02:50.9175406Z %26 = arith.addi %25, %21 : tensor<64xi32> 2026-02-21T09:02:50.9175594Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:02:50.9175938Z %27 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:50.9176320Z %37 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9176574Z %38 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9176782Z %39 = arith.addi %38, %37 : tensor<8xi32> 2026-02-21T09:02:50.9176992Z %40 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:50.9177233Z %41 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9177481Z %42 = tt.splat %40 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9177694Z %43 = arith.addi %42, %41 : tensor<16xi32> 2026-02-21T09:02:50.9177951Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9178233Z %45 = arith.muli %44, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9178515Z %46 = tt.expand_dims %43 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9178813Z %47 = tt.broadcast %45 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9179082Z %48 = tt.broadcast %46 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9179314Z %49 = arith.addi %47, %48 : tensor<64x16xi32> 2026-02-21T09:02:50.9179561Z %50 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9179843Z %51 = tt.addptr %50, %49 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9180106Z %52 = tt.load %51 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9180348Z %53 = arith.extf %52 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9180622Z %54 = tt.expand_dims %39 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9180886Z %55 = arith.muli %54, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9181132Z %56 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9181418Z %57 = tt.broadcast %55 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9181703Z %58 = tt.broadcast %56 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9181942Z %59 = arith.addi %57, %58 : tensor<8x64xi32> 2026-02-21T09:02:50.9182182Z %60 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9182451Z %61 = tt.addptr %60, %59 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9182746Z %62 = tt.load %61 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9183005Z %63 = arith.shli %62, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9183225Z %64 = arith.shrsi %63, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9183436Z %65 = arith.shrsi %62, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9183682Z %66 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9183978Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9184285Z %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9184606Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9184915Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9185203Z %71 = arith.cmpi eq, %68, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9185453Z %72 = tt.broadcast %71 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9185735Z %73 = tt.broadcast %69 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9186010Z %74 = arith.select %72, %73, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9186271Z %75 = arith.cmpi eq, %68, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9186522Z %76 = tt.broadcast %70 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9186802Z %77 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9187096Z %78 = arith.select %77, %76, %74 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9187365Z %79 = tt.reshape %78 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9187613Z %80 = arith.sitofp %79 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9187968Z %81 = tt.dot %53, %80, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9188288Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:50.9188497Z %82 = arith.muli %c8_i32, %c1_i32_9 : i32 2026-02-21T09:02:50.9188694Z %83 = arith.addi %arg4, %82 : i32 2026-02-21T09:02:50.9188929Z %84 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9189178Z %85 = tt.splat %83 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9189400Z %86 = arith.addi %85, %84 : tensor<8xi32> 2026-02-21T09:02:50.9189601Z %87 = arith.muli %83, %c2_i32 : i32 2026-02-21T09:02:50.9189829Z %88 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9190076Z %89 = tt.splat %87 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9190277Z %90 = arith.addi %89, %88 : tensor<16xi32> 2026-02-21T09:02:50.9190534Z %91 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9190805Z %92 = arith.muli %91, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9191056Z %93 = tt.expand_dims %90 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9191346Z %94 = tt.broadcast %92 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9191634Z %95 = tt.broadcast %93 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9191881Z %96 = arith.addi %94, %95 : tensor<64x16xi32> 2026-02-21T09:02:50.9192124Z %97 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9192415Z %98 = tt.addptr %97, %96 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9192678Z %99 = tt.load %98 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9192917Z %100 = arith.extf %99 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9193208Z %101 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9193473Z %102 = arith.muli %101, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9193737Z %103 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9194030Z %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9194289Z %105 = tt.broadcast %103 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9194531Z %106 = arith.addi %104, %105 : tensor<8x64xi32> 2026-02-21T09:02:50.9194764Z %107 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9195045Z %108 = tt.addptr %107, %106 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9195342Z %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9195617Z %110 = arith.shli %109, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9195843Z %111 = arith.shrsi %110, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9196058Z %112 = arith.shrsi %109, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9196310Z %113 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9196629Z %114 = tt.expand_dims %113 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9196953Z %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9197275Z %116 = tt.expand_dims %111 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9197602Z %117 = tt.expand_dims %112 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9197924Z %118 = arith.cmpi eq, %115, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9198208Z %119 = tt.broadcast %118 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9198499Z %120 = tt.broadcast %116 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9198796Z %121 = arith.select %119, %120, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9199088Z %122 = arith.cmpi eq, %115, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9199357Z %123 = tt.broadcast %117 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9199632Z %124 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9199931Z %125 = arith.select %124, %123, %121 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9200240Z %126 = tt.reshape %125 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9200525Z %127 = arith.sitofp %126 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9200900Z %128 = tt.dot %100, %127, %81, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9201244Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:02:50.9201463Z %129 = arith.muli %c8_i32, %c2_i32_10 : i32 2026-02-21T09:02:50.9201692Z %130 = arith.addi %arg4, %129 : i32 2026-02-21T09:02:50.9201941Z %131 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9202197Z %132 = tt.splat %130 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9202425Z %133 = arith.addi %132, %131 : tensor<8xi32> 2026-02-21T09:02:50.9202631Z %134 = arith.muli %130, %c2_i32 : i32 2026-02-21T09:02:50.9202884Z %135 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9203153Z %136 = tt.splat %134 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9203373Z %137 = arith.addi %136, %135 : tensor<16xi32> 2026-02-21T09:02:50.9203652Z %138 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9203938Z %139 = arith.muli %138, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9204223Z %140 = tt.expand_dims %137 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9204532Z %141 = tt.broadcast %139 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9204818Z %142 = tt.broadcast %140 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9205081Z %143 = arith.addi %141, %142 : tensor<64x16xi32> 2026-02-21T09:02:50.9205347Z %144 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9205662Z %145 = tt.addptr %144, %143 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9205926Z %146 = tt.load %145 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9206178Z %147 = arith.extf %146 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9206472Z %148 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9206737Z %149 = arith.muli %148, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9207001Z %150 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9207289Z %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9207553Z %152 = tt.broadcast %150 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9207789Z %153 = arith.addi %151, %152 : tensor<8x64xi32> 2026-02-21T09:02:50.9208057Z %154 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9208339Z %155 = tt.addptr %154, %153 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9208642Z %156 = tt.load %155 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9208918Z %157 = arith.shli %156, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9209159Z %158 = arith.shrsi %157, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9209402Z %159 = arith.shrsi %156, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9209647Z %160 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9209946Z %161 = tt.expand_dims %160 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9210265Z %162 = tt.expand_dims %161 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9210584Z %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9210902Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9211181Z %165 = arith.cmpi eq, %162, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9211440Z %166 = tt.broadcast %165 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9211798Z %167 = tt.broadcast %163 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9212085Z %168 = arith.select %166, %167, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9212362Z %169 = arith.cmpi eq, %162, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9212611Z %170 = tt.broadcast %164 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9212879Z %171 = tt.broadcast %169 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9213155Z %172 = arith.select %171, %170, %168 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9213438Z %173 = tt.reshape %172 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9213704Z %174 = arith.sitofp %173 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9214057Z %175 = tt.dot %147, %174, %128, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9214385Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:50.9214580Z %176 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:50.9214784Z %177 = arith.addi %arg4, %176 : i32 2026-02-21T09:02:50.9215015Z %178 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:50.9215267Z %179 = tt.splat %177 : i32 -> tensor<8xi32> 2026-02-21T09:02:50.9215481Z %180 = arith.addi %179, %178 : tensor<8xi32> 2026-02-21T09:02:50.9215677Z %181 = arith.muli %177, %c2_i32 : i32 2026-02-21T09:02:50.9215918Z %182 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:50.9216166Z %183 = tt.splat %181 : i32 -> tensor<16xi32> 2026-02-21T09:02:50.9216386Z %184 = arith.addi %183, %182 : tensor<16xi32> 2026-02-21T09:02:50.9216640Z %185 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9216919Z %186 = arith.muli %185, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:50.9217193Z %187 = tt.expand_dims %184 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:50.9217486Z %188 = tt.broadcast %186 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9217761Z %189 = tt.broadcast %187 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:50.9218002Z %190 = arith.addi %188, %189 : tensor<64x16xi32> 2026-02-21T09:02:50.9218258Z %191 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9218560Z %192 = tt.addptr %191, %190 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:50.9218837Z %193 = tt.load %192 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:50.9219106Z %194 = arith.extf %193 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:50.9219390Z %195 = tt.expand_dims %180 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:50.9219662Z %196 = arith.muli %195, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:50.9219917Z %197 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9220237Z %198 = tt.broadcast %196 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9220537Z %199 = tt.broadcast %197 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:50.9220773Z %200 = arith.addi %198, %199 : tensor<8x64xi32> 2026-02-21T09:02:50.9221020Z %201 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9221291Z %202 = tt.addptr %201, %200 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:50.9221623Z %203 = tt.load %202 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:50.9221891Z %204 = arith.shli %203, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9222114Z %205 = arith.shrsi %204, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9222336Z %206 = arith.shrsi %203, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:50.9222602Z %207 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:50.9222906Z %208 = tt.expand_dims %207 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:50.9223222Z %209 = tt.expand_dims %208 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:50.9223549Z %210 = tt.expand_dims %205 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9223871Z %211 = tt.expand_dims %206 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:50.9224152Z %212 = arith.cmpi eq, %209, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9224411Z %213 = tt.broadcast %212 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9224680Z %214 = tt.broadcast %210 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9224969Z %215 = arith.select %213, %214, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9225239Z %216 = arith.cmpi eq, %209, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:50.9225496Z %217 = tt.broadcast %211 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:50.9225764Z %218 = tt.broadcast %216 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:50.9226036Z %219 = arith.select %218, %217, %215 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:50.9226319Z %220 = tt.reshape %219 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:50.9226576Z %221 = arith.sitofp %220 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:50.9226935Z %222 = tt.dot %194, %221, %175, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:50.9227259Z scf.yield %222 : tensor<64x64xf32> 2026-02-21T09:02:50.9227437Z } {tt.flatten} 2026-02-21T09:02:50.9227637Z %28 = arith.truncf %27 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:50.9227919Z %29 = tt.expand_dims %26 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:50.9228189Z %30 = arith.muli %29, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:50.9228440Z %31 = tt.expand_dims %23 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:50.9228730Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.9228989Z %33 = tt.broadcast %31 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:50.9229222Z %34 = arith.addi %32, %33 : tensor<64x64xi32> 2026-02-21T09:02:50.9229467Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.9229741Z %36 = tt.addptr %35, %34 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:50.9230023Z tt.store %36, %28 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:50.9230225Z } {tt.num_stages = 1 : i32} 2026-02-21T09:02:50.9230396Z tt.return 2026-02-21T09:02:50.9230525Z } 2026-02-21T09:02:50.9230652Z } 2026-02-21T09:02:50.9230722Z 2026-02-21T09:02:50.9230781Z {-# 2026-02-21T09:02:50.9230911Z external_resources: { 2026-02-21T09:02:50.9231102Z mlir_reproducer: { 2026-02-21T09:02:50.9235543Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:02:50.9239895Z disable_threading: false, 2026-02-21T09:02:50.9240073Z verify_each: true 2026-02-21T09:02:50.9240216Z } 2026-02-21T09:02:50.9240343Z } 2026-02-21T09:02:50.9240461Z #-} 2026-02-21T09:02:50.9240906Z /tmp/torchinductor_root/22/c22ckupeeabgzgknr32c7hrhqvgo47imt3h5zrttb5ezu5batycn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:02:50.9242015Z /tmp/torchinductor_root/22/c22ckupeeabgzgknr32c7hrhqvgo47imt3h5zrttb5ezu5batycn.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:02:50.9242879Z [136s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:02:50.9244098Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:02:50.9245204Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:02:50.9245470Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:02:51.4354962Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:02:51.4356700Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:51.4357050Z ^ 2026-02-21T09:02:51.4357449Z /tmp/torchinductor_root/pd/cpd7pnzzp5ymhqqpg33txqka7aqoi3k7e5jaabie6grhx4jgd4cc.py:87:40: note: called from 2026-02-21T09:02:51.4358115Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:02:51.4358339Z ^ 2026-02-21T09:02:51.4358788Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:02:51.4359362Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:51.4359667Z ^ 2026-02-21T09:02:51.4360059Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:02:51.4360705Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:02:51.4363649Z %cst = arith.constant dense<0> : tensor<8x2x64xi8> 2026-02-21T09:02:51.4363985Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:02:51.4364195Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:02:51.4368116Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:02:51.4369504Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:02:51.4369755Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:02:51.4370163Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:02:51.4370451Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:02:51.4370695Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:02:51.4370940Z %cst_3 = arith.constant dense<4> : tensor<8x64xi8> 2026-02-21T09:02:51.4371187Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:02:51.4371428Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:02:51.4371717Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:02:51.4371960Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:02:51.4372215Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:02:51.4372407Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:02:51.4372603Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:02:51.4372810Z %0 = tt.get_program_id x : i32 2026-02-21T09:02:51.4372998Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:02:51.4373199Z %2 = arith.minsi %1, %c112_i32 : i32 2026-02-21T09:02:51.4373399Z %3 = arith.subi %2, %0 : i32 2026-02-21T09:02:51.4373578Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:02:51.4373765Z %4 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T09:02:51.4373951Z %5 = arith.addi %3, %4 : i32 2026-02-21T09:02:51.4374119Z %6 = arith.divui %5, %c1_i32 : i32 2026-02-21T09:02:51.4374301Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:02:51.4374481Z %7 = arith.remsi %6, %c2_i32_8 : i32 2026-02-21T09:02:51.4374655Z %8 = arith.subi %6, %7 : i32 2026-02-21T09:02:51.4374828Z %9 = arith.muli %8, %c1_i32 : i32 2026-02-21T09:02:51.4375001Z %10 = arith.addi %0, %9 : i32 2026-02-21T09:02:51.4375194Z %11 = arith.muli %c1_i32, %c2_i32_8 : i32 2026-02-21T09:02:51.4375405Z scf.for %arg3 = %0 to %10 step %11 : i32 { 2026-02-21T09:02:51.4375625Z %12 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:02:51.4375819Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:02:51.4376007Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:02:51.4376194Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:02:51.4376379Z %16 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:02:51.4376571Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:02:51.4376743Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:02:51.4376921Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:02:51.4377095Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:02:51.4377333Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:51.4377585Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:02:51.4377854Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:02:51.4378049Z %24 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:02:51.4378237Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:02:51.4378436Z %26 = arith.addi %25, %21 : tensor<64xi32> 2026-02-21T09:02:51.4378629Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:02:51.4378863Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:02:51.4379210Z %27 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:51.4379580Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4379870Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4380083Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:02:51.4380277Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:51.4380510Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4380756Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4380959Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:02:51.4381219Z %73 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4381521Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4381825Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4382116Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4382384Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4382618Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:02:51.4382877Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4383169Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4383428Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4383673Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4383953Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4384225Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4384477Z %85 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4384769Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4385035Z %87 = tt.broadcast %85 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4385266Z %88 = arith.addi %86, %87 : tensor<8x64xi32> 2026-02-21T09:02:51.4385504Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4385768Z %90 = tt.addptr %89, %88 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4386066Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4386337Z %92 = arith.shli %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4386550Z %93 = arith.shrsi %92, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4386773Z %94 = arith.shrsi %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4387013Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4387316Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4387625Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4387943Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4388252Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4388532Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4388790Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4389092Z %102 = tt.broadcast %98 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4389387Z %103 = arith.select %101, %102, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4389659Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4389911Z %105 = tt.broadcast %99 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4390234Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4390543Z %107 = arith.select %106, %105, %103 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4390825Z %108 = tt.reshape %107 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4391085Z %109 = arith.sitofp %108 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4391454Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4391822Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:02:51.4392021Z %111 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:02:51.4392225Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:02:51.4392455Z %113 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4392732Z %114 = tt.splat %112 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4392943Z %115 = arith.addi %114, %113 : tensor<8xi32> 2026-02-21T09:02:51.4393148Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:02:51.4393387Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4393639Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4393853Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:02:51.4394106Z %120 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4394385Z %121 = arith.muli %120, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4394645Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4394954Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4395237Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4395480Z %125 = arith.addi %123, %124 : tensor<64x16xi32> 2026-02-21T09:02:51.4395734Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4396035Z %127 = tt.addptr %126, %125 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4396305Z %128 = tt.load %127 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4396554Z %129 = arith.extf %128 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4396841Z %130 = tt.expand_dims %115 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4397116Z %131 = arith.muli %130, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4397383Z %132 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4397669Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4397937Z %134 = tt.broadcast %132 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4398177Z %135 = arith.addi %133, %134 : tensor<8x64xi32> 2026-02-21T09:02:51.4398418Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4398691Z %137 = tt.addptr %136, %135 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4399001Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4399276Z %139 = arith.shli %138, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4399495Z %140 = arith.shrsi %139, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4399720Z %141 = arith.shrsi %138, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4399964Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4400297Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4400611Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4400943Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4401293Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4401631Z %147 = arith.cmpi eq, %144, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4401891Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4402162Z %149 = tt.broadcast %145 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4402455Z %150 = arith.select %148, %149, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4402736Z %151 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4402987Z %152 = tt.broadcast %146 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4403261Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4403535Z %154 = arith.select %153, %152, %150 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4403851Z %155 = tt.reshape %154 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4404111Z %156 = arith.sitofp %155 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4404471Z %157 = tt.dot %129, %156, %110, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4404798Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:02:51.4404994Z %158 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:02:51.4405198Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T09:02:51.4405424Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4405677Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4405882Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T09:02:51.4406085Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T09:02:51.4406321Z %164 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4406569Z %165 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4406782Z %166 = arith.addi %165, %164 : tensor<16xi32> 2026-02-21T09:02:51.4407039Z %167 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4407320Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4407582Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4407885Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4408152Z %171 = tt.broadcast %169 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4408394Z %172 = arith.addi %170, %171 : tensor<64x16xi32> 2026-02-21T09:02:51.4408645Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4408950Z %174 = tt.addptr %173, %172 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4409237Z %175 = tt.load %174 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4409494Z %176 = arith.extf %175 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4409792Z %177 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4410077Z %178 = arith.muli %177, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4410345Z %179 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4410652Z %180 = tt.broadcast %178 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4410923Z %181 = tt.broadcast %179 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4411212Z %182 = arith.addi %180, %181 : tensor<8x64xi32> 2026-02-21T09:02:51.4411468Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4411790Z %184 = tt.addptr %183, %182 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4412116Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4412420Z %186 = arith.shli %185, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4412700Z %187 = arith.shrsi %186, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4412928Z %188 = arith.shrsi %185, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4413191Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4413505Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4413837Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4414179Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4414508Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4414816Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4415113Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4415399Z %196 = tt.broadcast %192 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4415709Z %197 = arith.select %195, %196, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4415997Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4416267Z %199 = tt.broadcast %193 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4416544Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4416842Z %201 = arith.select %200, %199, %197 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4417147Z %202 = tt.reshape %201 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4417421Z %203 = arith.sitofp %202 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4417798Z %204 = tt.dot %176, %203, %157, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4418128Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:02:51.4418318Z } {tt.flatten} 2026-02-21T09:02:51.4418605Z %28 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %27) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:51.4418969Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4419230Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4419434Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:02:51.4419640Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:51.4419874Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4420126Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4420332Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:02:51.4420582Z %73 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4420863Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4421117Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4421415Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4421702Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4421949Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:02:51.4422200Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4422480Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4422770Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4423008Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4423295Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4423558Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4423908Z %85 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4424220Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4424476Z %87 = tt.broadcast %85 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4424717Z %88 = arith.addi %86, %87 : tensor<8x64xi32> 2026-02-21T09:02:51.4424948Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4425223Z %90 = tt.addptr %89, %88 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4425523Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4425788Z %92 = arith.shli %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4426010Z %93 = arith.shrsi %92, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4426220Z %94 = arith.shrsi %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4426494Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4426784Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4427095Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4427410Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4427721Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4428006Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4428258Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4428533Z %102 = tt.broadcast %98 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4428819Z %103 = arith.select %101, %102, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4429095Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4429353Z %105 = tt.broadcast %99 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4429613Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4429896Z %107 = arith.select %106, %105, %103 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4430171Z %108 = tt.reshape %107 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4430440Z %109 = arith.sitofp %108 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4430805Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4431132Z scf.yield %110 : tensor<64x64xf32> 2026-02-21T09:02:51.4431336Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:02:51.4431593Z %29 = arith.truncf %28 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:51.4431882Z %30 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4432145Z %31 = arith.muli %30, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:51.4432406Z %32 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4432690Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:51.4432943Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:51.4433183Z %35 = arith.addi %33, %34 : tensor<64x64xi32> 2026-02-21T09:02:51.4433423Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:51.4433749Z %37 = tt.addptr %36, %35 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:51.4434018Z tt.store %37, %29 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:51.4434228Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:51.4434435Z %38 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:02:51.4434629Z %39 = arith.addi %arg3, %38 : i32 2026-02-21T09:02:51.4434849Z %40 = arith.divsi %39, %c448_i32 : i32 2026-02-21T09:02:51.4435031Z %41 = arith.muli %40, %c4_i32 : i32 2026-02-21T09:02:51.4435239Z %42 = arith.subi %c1_i32, %41 : i32 2026-02-21T09:02:51.4435415Z %43 = arith.minsi %42, %c4_i32 : i32 2026-02-21T09:02:51.4435611Z %44 = arith.remsi %39, %c448_i32 : i32 2026-02-21T09:02:51.4435806Z %45 = arith.remsi %44, %43 : i32 2026-02-21T09:02:51.4435981Z %46 = arith.addi %41, %45 : i32 2026-02-21T09:02:51.4436161Z %47 = arith.divsi %44, %43 : i32 2026-02-21T09:02:51.4436336Z %48 = arith.muli %46, %c64_i32 : i32 2026-02-21T09:02:51.4436565Z %49 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:51.4436807Z %50 = tt.splat %48 : i32 -> tensor<64xi32> 2026-02-21T09:02:51.4437012Z %51 = arith.addi %50, %49 : tensor<64xi32> 2026-02-21T09:02:51.4437198Z %52 = arith.muli %47, %c64_i32 : i32 2026-02-21T09:02:51.4437415Z %53 = tt.splat %52 : i32 -> tensor<64xi32> 2026-02-21T09:02:51.4437625Z %54 = arith.addi %53, %49 : tensor<64xi32> 2026-02-21T09:02:51.4437820Z %c4080_i32_10 = arith.constant 4080 : i32 2026-02-21T09:02:51.4438022Z %c24_i32_11 = arith.constant 24 : i32 2026-02-21T09:02:51.4438349Z %55 = scf.for %arg4 = %c0_i32 to %c4080_i32_10 step %c24_i32_11 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:51.4438726Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4438971Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4439180Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:02:51.4439385Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:51.4439611Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4439862Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4440063Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:02:51.4440319Z %73 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4440585Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4440844Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4441134Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4441386Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4441668Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:02:51.4441907Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4442196Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4442452Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4442692Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4442980Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4443242Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4443499Z %85 = tt.expand_dims %54 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4443776Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4444035Z %87 = tt.broadcast %85 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4444274Z %88 = arith.addi %86, %87 : tensor<8x64xi32> 2026-02-21T09:02:51.4444510Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4444810Z %90 = tt.addptr %89, %88 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4445099Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4445367Z %92 = arith.shli %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4445577Z %93 = arith.shrsi %92, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4445833Z %94 = arith.shrsi %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4446115Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4457823Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4458259Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4458616Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4458941Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4459254Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4459534Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4459834Z %102 = tt.broadcast %98 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4460223Z %103 = arith.select %101, %102, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4460535Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4460815Z %105 = tt.broadcast %99 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4461094Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4461397Z %107 = arith.select %106, %105, %103 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4461745Z %108 = tt.reshape %107 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4462034Z %109 = arith.sitofp %108 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4462431Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4462779Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:02:51.4463011Z %111 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:02:51.4463225Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:02:51.4463486Z %113 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4463755Z %114 = tt.splat %112 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4463991Z %115 = arith.addi %114, %113 : tensor<8xi32> 2026-02-21T09:02:51.4464213Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:02:51.4464469Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4464731Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4464943Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:02:51.4465213Z %120 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4465493Z %121 = arith.muli %120, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4465772Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4466076Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4466345Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4466601Z %125 = arith.addi %123, %124 : tensor<64x16xi32> 2026-02-21T09:02:51.4466857Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4467167Z %127 = tt.addptr %126, %125 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4467441Z %128 = tt.load %127 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4467698Z %129 = arith.extf %128 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4468192Z %130 = tt.expand_dims %115 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4468465Z %131 = arith.muli %130, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4468739Z %132 = tt.expand_dims %54 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4469036Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4469353Z %134 = tt.broadcast %132 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4469647Z %135 = arith.addi %133, %134 : tensor<8x64xi32> 2026-02-21T09:02:51.4469895Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4470187Z %137 = tt.addptr %136, %135 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4470503Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4470792Z %139 = arith.shli %138, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4471018Z %140 = arith.shrsi %139, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4471245Z %141 = arith.shrsi %138, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4471504Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4471861Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4472193Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4472521Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4472857Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4473146Z %147 = arith.cmpi eq, %144, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4473415Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4473698Z %149 = tt.broadcast %145 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4473993Z %150 = arith.select %148, %149, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4474281Z %151 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4474539Z %152 = tt.broadcast %146 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4474825Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4475121Z %154 = arith.select %153, %152, %150 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4475406Z %155 = tt.reshape %154 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4475682Z %156 = arith.sitofp %155 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4476044Z %157 = tt.dot %129, %156, %110, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4476381Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:02:51.4476587Z %158 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:02:51.4476801Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T09:02:51.4477043Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4477289Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4477509Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T09:02:51.4477710Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T09:02:51.4477953Z %164 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4478205Z %165 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4478423Z %166 = arith.addi %165, %164 : tensor<16xi32> 2026-02-21T09:02:51.4478689Z %167 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4478963Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4479236Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4479564Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4479843Z %171 = tt.broadcast %169 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4480085Z %172 = arith.addi %170, %171 : tensor<64x16xi32> 2026-02-21T09:02:51.4480347Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4480695Z %174 = tt.addptr %173, %172 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4480990Z %175 = tt.load %174 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4481242Z %176 = arith.extf %175 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4481564Z %177 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4481847Z %178 = arith.muli %177, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4482114Z %179 = tt.expand_dims %54 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4482407Z %180 = tt.broadcast %178 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4482683Z %181 = tt.broadcast %179 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4482926Z %182 = arith.addi %180, %181 : tensor<8x64xi32> 2026-02-21T09:02:51.4483204Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4483488Z %184 = tt.addptr %183, %182 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4483808Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4484088Z %186 = arith.shli %185, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4484310Z %187 = arith.shrsi %186, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4484539Z %188 = arith.shrsi %185, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4484786Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4485093Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4485418Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4485751Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4486083Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4486375Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4486642Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4486916Z %196 = tt.broadcast %192 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4487212Z %197 = arith.select %195, %196, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4487503Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4487757Z %199 = tt.broadcast %193 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4488042Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4488325Z %201 = arith.select %200, %199, %197 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4488619Z %202 = tt.reshape %201 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4488886Z %203 = arith.sitofp %202 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4489263Z %204 = tt.dot %176, %203, %157, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4489600Z scf.yield %204 : tensor<64x64xf32> 2026-02-21T09:02:51.4489788Z } {tt.flatten} 2026-02-21T09:02:51.4490099Z %56 = scf.for %arg4 = %c4080_i32_10 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %55) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:51.4490466Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4490724Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4490956Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:02:51.4491165Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:51.4491408Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4491683Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4491927Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:02:51.4492205Z %73 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4492483Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4492737Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4493030Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4493299Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4493536Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:02:51.4493793Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4494080Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4494348Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4494618Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4494898Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4495172Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4495425Z %85 = tt.expand_dims %54 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4495719Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4495977Z %87 = tt.broadcast %85 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4496220Z %88 = arith.addi %86, %87 : tensor<8x64xi32> 2026-02-21T09:02:51.4496465Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4496730Z %90 = tt.addptr %89, %88 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4497043Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4497309Z %92 = arith.shli %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4497533Z %93 = arith.shrsi %92, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4497746Z %94 = arith.shrsi %91, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4497996Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4498291Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4498600Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4498947Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4499276Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4499583Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4499858Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4500142Z %102 = tt.broadcast %98 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4500455Z %103 = arith.select %101, %102, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4500745Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4501020Z %105 = tt.broadcast %99 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4501298Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4501629Z %107 = arith.select %106, %105, %103 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4501938Z %108 = tt.reshape %107 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4502244Z %109 = arith.sitofp %108 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4502626Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4502970Z scf.yield %110 : tensor<64x64xf32> 2026-02-21T09:02:51.4503178Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:02:51.4503453Z %57 = arith.truncf %56 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:51.4503789Z %58 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4504066Z %59 = arith.muli %58, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:51.4504336Z %60 = tt.expand_dims %54 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4504632Z %61 = tt.broadcast %59 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:51.4504904Z %62 = tt.broadcast %60 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:51.4505147Z %63 = arith.addi %61, %62 : tensor<64x64xi32> 2026-02-21T09:02:51.4505407Z %64 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:51.4505710Z %65 = tt.addptr %64, %63 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:51.4506004Z tt.store %65, %57 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:51.4506263Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:02:51.4506507Z scf.for %arg3 = %10 to %2 step %c1_i32 : i32 { 2026-02-21T09:02:51.4506737Z %12 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:02:51.4506942Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:02:51.4507143Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:02:51.4507351Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:02:51.4507544Z %16 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:02:51.4507743Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:02:51.4507922Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:02:51.4508106Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:02:51.4508285Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:02:51.4508526Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:51.4508778Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:02:51.4508989Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:02:51.4509190Z %24 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:02:51.4509384Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:02:51.4509589Z %26 = arith.addi %25, %21 : tensor<64xi32> 2026-02-21T09:02:51.4509786Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:02:51.4509990Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:02:51.4510315Z %27 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:51.4510695Z %38 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4510960Z %39 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4511170Z %40 = arith.addi %39, %38 : tensor<8xi32> 2026-02-21T09:02:51.4511380Z %41 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:51.4511636Z %42 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4511890Z %43 = tt.splat %41 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4512104Z %44 = arith.addi %43, %42 : tensor<16xi32> 2026-02-21T09:02:51.4512360Z %45 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4512634Z %46 = arith.muli %45, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4512886Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4513176Z %48 = tt.broadcast %46 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4513425Z %49 = tt.broadcast %47 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4513695Z %50 = arith.addi %48, %49 : tensor<64x16xi32> 2026-02-21T09:02:51.4513942Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4514223Z %52 = tt.addptr %51, %50 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4514490Z %53 = tt.load %52 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4514751Z %54 = arith.extf %53 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4515062Z %55 = tt.expand_dims %40 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4515322Z %56 = arith.muli %55, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4515579Z %57 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4515867Z %58 = tt.broadcast %56 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4516122Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4516361Z %60 = arith.addi %58, %59 : tensor<8x64xi32> 2026-02-21T09:02:51.4516589Z %61 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4516853Z %62 = tt.addptr %61, %60 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4517159Z %63 = tt.load %62 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4517429Z %64 = arith.shli %63, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4517644Z %65 = arith.shrsi %64, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4517854Z %66 = arith.shrsi %63, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4518098Z %67 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4518384Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4518701Z %69 = tt.expand_dims %68 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4519017Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4519328Z %71 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4519608Z %72 = arith.cmpi eq, %69, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4519854Z %73 = tt.broadcast %72 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4520122Z %74 = tt.broadcast %70 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4520396Z %75 = arith.select %73, %74, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4520665Z %76 = arith.cmpi eq, %69, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4520917Z %77 = tt.broadcast %71 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4521173Z %78 = tt.broadcast %76 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4521448Z %79 = arith.select %78, %77, %75 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4521740Z %80 = tt.reshape %79 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4521992Z %81 = arith.sitofp %80 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4522340Z %82 = tt.dot %54, %81, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4522658Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:51.4522861Z %83 = arith.muli %c8_i32, %c1_i32_9 : i32 2026-02-21T09:02:51.4523054Z %84 = arith.addi %arg4, %83 : i32 2026-02-21T09:02:51.4523286Z %85 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4523525Z %86 = tt.splat %84 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4523728Z %87 = arith.addi %86, %85 : tensor<8xi32> 2026-02-21T09:02:51.4523920Z %88 = arith.muli %84, %c2_i32 : i32 2026-02-21T09:02:51.4524149Z %89 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4524440Z %90 = tt.splat %88 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4524641Z %91 = arith.addi %90, %89 : tensor<16xi32> 2026-02-21T09:02:51.4524896Z %92 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4525160Z %93 = arith.muli %92, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4525419Z %94 = tt.expand_dims %91 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4525734Z %95 = tt.broadcast %93 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4526015Z %96 = tt.broadcast %94 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4526266Z %97 = arith.addi %95, %96 : tensor<64x16xi32> 2026-02-21T09:02:51.4526512Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4526806Z %99 = tt.addptr %98, %97 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4527067Z %100 = tt.load %99 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4527319Z %101 = arith.extf %100 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4527612Z %102 = tt.expand_dims %87 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4527877Z %103 = arith.muli %102, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4528163Z %104 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4528452Z %105 = tt.broadcast %103 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4528723Z %106 = tt.broadcast %104 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4528966Z %107 = arith.addi %105, %106 : tensor<8x64xi32> 2026-02-21T09:02:51.4529215Z %108 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4529500Z %109 = tt.addptr %108, %107 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4529804Z %110 = tt.load %109 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4530081Z %111 = arith.shli %110, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4530302Z %112 = arith.shrsi %111, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4530528Z %113 = arith.shrsi %110, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4530780Z %114 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4531081Z %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4531407Z %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4531754Z %117 = tt.expand_dims %112 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4532075Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4532354Z %119 = arith.cmpi eq, %116, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4532614Z %120 = tt.broadcast %119 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4532891Z %121 = tt.broadcast %117 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4533173Z %122 = arith.select %120, %121, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4533454Z %123 = arith.cmpi eq, %116, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4533702Z %124 = tt.broadcast %118 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4533970Z %125 = tt.broadcast %123 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4534249Z %126 = arith.select %125, %124, %122 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4534533Z %127 = tt.reshape %126 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4534803Z %128 = arith.sitofp %127 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4535158Z %129 = tt.dot %101, %128, %82, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4535510Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:02:51.4535707Z %130 = arith.muli %c8_i32, %c2_i32_10 : i32 2026-02-21T09:02:51.4535907Z %131 = arith.addi %arg4, %130 : i32 2026-02-21T09:02:51.4536138Z %132 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4536419Z %133 = tt.splat %131 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4536654Z %134 = arith.addi %133, %132 : tensor<8xi32> 2026-02-21T09:02:51.4536852Z %135 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:02:51.4537108Z %136 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4537365Z %137 = tt.splat %135 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4537571Z %138 = arith.addi %137, %136 : tensor<16xi32> 2026-02-21T09:02:51.4537837Z %139 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4538113Z %140 = arith.muli %139, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4538389Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4538697Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4538968Z %143 = tt.broadcast %141 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4539234Z %144 = arith.addi %142, %143 : tensor<64x16xi32> 2026-02-21T09:02:51.4539478Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4539777Z %146 = tt.addptr %145, %144 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4540042Z %147 = tt.load %146 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4540288Z %148 = arith.extf %147 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4540579Z %149 = tt.expand_dims %134 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4540842Z %150 = arith.muli %149, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4541102Z %151 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4541390Z %152 = tt.broadcast %150 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4541692Z %153 = tt.broadcast %151 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4541930Z %154 = arith.addi %152, %153 : tensor<8x64xi32> 2026-02-21T09:02:51.4542172Z %155 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4542464Z %156 = tt.addptr %155, %154 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4542783Z %157 = tt.load %156 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4543067Z %158 = arith.shli %157, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4543293Z %159 = arith.shrsi %158, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4543526Z %160 = arith.shrsi %157, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4543789Z %161 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4544093Z %162 = tt.expand_dims %161 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4544432Z %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4544771Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4545112Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4545407Z %166 = arith.cmpi eq, %163, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4545679Z %167 = tt.broadcast %166 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4545962Z %168 = tt.broadcast %164 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4546258Z %169 = arith.select %167, %168, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4546549Z %170 = arith.cmpi eq, %163, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4546837Z %171 = tt.broadcast %165 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4547120Z %172 = tt.broadcast %170 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4547410Z %173 = arith.select %172, %171, %169 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4547704Z %174 = tt.reshape %173 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4548004Z %175 = arith.sitofp %174 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4548400Z %176 = tt.dot %148, %175, %129, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4548749Z scf.yield %176 : tensor<64x64xf32> 2026-02-21T09:02:51.4548937Z } {tt.flatten} 2026-02-21T09:02:51.4549244Z %28 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %27) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:51.4549627Z %38 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:51.4549882Z %39 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:51.4550104Z %40 = arith.addi %39, %38 : tensor<8xi32> 2026-02-21T09:02:51.4550310Z %41 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:51.4550586Z %42 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:51.4550839Z %43 = tt.splat %41 : i32 -> tensor<16xi32> 2026-02-21T09:02:51.4551055Z %44 = arith.addi %43, %42 : tensor<16xi32> 2026-02-21T09:02:51.4551327Z %45 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4551665Z %46 = arith.muli %45, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:51.4551948Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:51.4552246Z %48 = tt.broadcast %46 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4552518Z %49 = tt.broadcast %47 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:51.4552751Z %50 = arith.addi %48, %49 : tensor<64x16xi32> 2026-02-21T09:02:51.4553001Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4553289Z %52 = tt.addptr %51, %50 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:51.4553549Z %53 = tt.load %52 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:51.4553796Z %54 = arith.extf %53 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:51.4554073Z %55 = tt.expand_dims %40 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:51.4554339Z %56 = arith.muli %55, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:51.4554596Z %57 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4554878Z %58 = tt.broadcast %56 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4555144Z %59 = tt.broadcast %57 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:51.4555376Z %60 = arith.addi %58, %59 : tensor<8x64xi32> 2026-02-21T09:02:51.4555616Z %61 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4555876Z %62 = tt.addptr %61, %60 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:51.4556174Z %63 = tt.load %62 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:51.4556436Z %64 = arith.shli %63, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4556646Z %65 = arith.shrsi %64, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4556862Z %66 = arith.shrsi %63, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:51.4557097Z %67 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:51.4557390Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:51.4557690Z %69 = tt.expand_dims %68 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:51.4558028Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4558338Z %71 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:51.4558608Z %72 = arith.cmpi eq, %69, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4558858Z %73 = tt.broadcast %72 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4559144Z %74 = tt.broadcast %70 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4559448Z %75 = arith.select %73, %74, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4559714Z %76 = arith.cmpi eq, %69, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:51.4559953Z %77 = tt.broadcast %71 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:51.4560212Z %78 = tt.broadcast %76 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:51.4560475Z %79 = arith.select %78, %77, %75 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:51.4560744Z %80 = tt.reshape %79 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:51.4560990Z %81 = arith.sitofp %80 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:51.4561339Z %82 = tt.dot %54, %81, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:51.4561784Z scf.yield %82 : tensor<64x64xf32> 2026-02-21T09:02:51.4561979Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:02:51.4562216Z %29 = arith.truncf %28 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:51.4562497Z %30 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:51.4562763Z %31 = arith.muli %30, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:51.4563014Z %32 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:51.4563299Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:51.4563557Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:51.4563785Z %35 = arith.addi %33, %34 : tensor<64x64xi32> 2026-02-21T09:02:51.4564032Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:51.4564309Z %37 = tt.addptr %36, %35 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:51.4564572Z tt.store %37, %29 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:51.4564807Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:02:51.4565026Z tt.return 2026-02-21T09:02:51.4565165Z } 2026-02-21T09:02:51.4565290Z } 2026-02-21T09:02:51.4565359Z 2026-02-21T09:02:51.4565419Z {-# 2026-02-21T09:02:51.4565546Z external_resources: { 2026-02-21T09:02:51.4565706Z mlir_reproducer: { 2026-02-21T09:02:51.4570132Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:02:51.4574695Z disable_threading: false, 2026-02-21T09:02:51.4574856Z verify_each: true 2026-02-21T09:02:51.4575002Z } 2026-02-21T09:02:51.4575114Z } 2026-02-21T09:02:51.4575230Z #-} 2026-02-21T09:02:51.4575650Z /tmp/torchinductor_root/pd/cpd7pnzzp5ymhqqpg33txqka7aqoi3k7e5jaabie6grhx4jgd4cc.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:02:51.4576676Z /tmp/torchinductor_root/pd/cpd7pnzzp5ymhqqpg33txqka7aqoi3k7e5jaabie6grhx4jgd4cc.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:02:51.4577515Z [136s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:02:51.4578684Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:02:51.4579723Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:02:51.4579987Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:02:51.5688401Z [137s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 4], range_warp_specializes=[None, None]) 2026-02-21T09:02:51.5689521Z Tensor-likes are not close! 2026-02-21T09:02:51.5692890Z 2026-02-21T09:02:51.5694882Z Mismatched elements: 456592 / 458752 (99.5%) 2026-02-21T09:02:51.5699677Z Greatest absolute difference: 3072.0 at index (25, 3953) (up to 0.01 allowed) 2026-02-21T09:02:51.5701803Z Greatest relative difference: 532480.0 at index (59, 5406) (up to 0.01 allowed) 2026-02-21T09:02:51.5702169Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:02:51.5702358Z 2026-02-21T09:02:52.0398043Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:02:52.0402480Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:52.0403915Z ^ 2026-02-21T09:02:52.0404392Z /tmp/torchinductor_root/sy/csy6nrsbdtvzgnjgiqql7afzvdvfltgzrctfoqat3xtfyss3m5q3.py:84:40: note: called from 2026-02-21T09:02:52.0404828Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:02:52.0405047Z ^ 2026-02-21T09:02:52.0405563Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:02:52.0406115Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:02:52.0406379Z ^ 2026-02-21T09:02:52.0413601Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:02:52.0419060Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:02:52.0419709Z %cst = arith.constant dense<0> : tensor<8x2x64xi8> 2026-02-21T09:02:52.0419972Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:02:52.0420352Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:02:52.0420531Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:02:52.0420755Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:02:52.0420927Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:02:52.0421137Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:02:52.0421365Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:02:52.0421645Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:02:52.0421866Z %cst_3 = arith.constant dense<4> : tensor<8x64xi8> 2026-02-21T09:02:52.0422086Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:02:52.0422310Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:02:52.0422505Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:02:52.0422719Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:02:52.0422990Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:02:52.0423159Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:02:52.0423326Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:02:52.0423495Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:02:52.0423671Z %0 = tt.get_program_id x : i32 2026-02-21T09:02:52.0423840Z %1 = arith.subi %c112_i32, %0 : i32 2026-02-21T09:02:52.0424011Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:02:52.0424188Z %2 = arith.subi %c9472_i32, %c1_i32_7 : i32 2026-02-21T09:02:52.0424371Z %3 = arith.addi %1, %2 : i32 2026-02-21T09:02:52.0424538Z %4 = arith.divui %3, %c9472_i32 : i32 2026-02-21T09:02:52.0424710Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:02:52.0424880Z %5 = arith.remsi %4, %c2_i32_8 : i32 2026-02-21T09:02:52.0425046Z %6 = arith.subi %4, %5 : i32 2026-02-21T09:02:52.0425214Z %7 = arith.muli %6, %c9472_i32 : i32 2026-02-21T09:02:52.0425382Z %8 = arith.addi %0, %7 : i32 2026-02-21T09:02:52.0425556Z %9 = arith.muli %c9472_i32, %c2_i32_8 : i32 2026-02-21T09:02:52.0425748Z scf.for %arg3 = %0 to %8 step %9 : i32 { 2026-02-21T09:02:52.0425947Z %10 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:02:52.0426133Z %11 = arith.muli %10, %c4_i32 : i32 2026-02-21T09:02:52.0426304Z %12 = arith.subi %c1_i32, %11 : i32 2026-02-21T09:02:52.0426483Z %13 = arith.minsi %12, %c4_i32 : i32 2026-02-21T09:02:52.0426661Z %14 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:02:52.0426838Z %15 = arith.remsi %14, %13 : i32 2026-02-21T09:02:52.0427004Z %16 = arith.addi %11, %15 : i32 2026-02-21T09:02:52.0427175Z %17 = arith.divsi %14, %13 : i32 2026-02-21T09:02:52.0427341Z %18 = arith.muli %16, %c64_i32 : i32 2026-02-21T09:02:52.0427569Z %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:52.0427828Z %20 = tt.splat %18 : i32 -> tensor<64xi32> 2026-02-21T09:02:52.0428028Z %21 = arith.addi %20, %19 : tensor<64xi32> 2026-02-21T09:02:52.0428225Z %22 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:02:52.0428457Z %23 = tt.splat %22 : i32 -> tensor<64xi32> 2026-02-21T09:02:52.0428651Z %24 = arith.addi %23, %19 : tensor<64xi32> 2026-02-21T09:02:52.0428846Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:02:52.0429164Z %25 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:52.0429530Z %62 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0429784Z %63 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0430047Z %64 = arith.addi %63, %62 : tensor<8xi32> 2026-02-21T09:02:52.0430255Z %65 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:52.0430484Z %66 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0430736Z %67 = tt.splat %65 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0430937Z %68 = arith.addi %67, %66 : tensor<16xi32> 2026-02-21T09:02:52.0431229Z %69 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0431595Z %70 = arith.muli %69, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0431854Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0432146Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0432404Z %73 = tt.broadcast %71 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0432643Z %74 = arith.addi %72, %73 : tensor<64x16xi32> 2026-02-21T09:02:52.0432893Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0433178Z %76 = tt.addptr %75, %74 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0433445Z %77 = tt.load %76 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0433707Z %78 = arith.extf %77 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0434000Z %79 = tt.expand_dims %64 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0434267Z %80 = arith.muli %79, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0434534Z %81 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0434818Z %82 = tt.broadcast %80 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0435075Z %83 = tt.broadcast %81 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0435309Z %84 = arith.addi %82, %83 : tensor<8x64xi32> 2026-02-21T09:02:52.0435541Z %85 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0435814Z %86 = tt.addptr %85, %84 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0436108Z %87 = tt.load %86 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0436378Z %88 = arith.shli %87, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0436599Z %89 = arith.shrsi %88, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0436808Z %90 = arith.shrsi %87, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0437057Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0437340Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0437656Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0437983Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0438293Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0438585Z %96 = arith.cmpi eq, %93, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0438831Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0439103Z %98 = tt.broadcast %94 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0439377Z %99 = arith.select %97, %98, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0439656Z %100 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0439916Z %101 = tt.broadcast %95 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0440178Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0440461Z %103 = arith.select %102, %101, %99 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0440736Z %104 = tt.reshape %103 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0441005Z %105 = arith.sitofp %104 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0441400Z %106 = tt.dot %78, %105, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0441773Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:02:52.0441981Z %107 = arith.muli %c8_i32, %c1_i32_11 : i32 2026-02-21T09:02:52.0442203Z %108 = arith.addi %arg4, %107 : i32 2026-02-21T09:02:52.0442434Z %109 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0442710Z %110 = tt.splat %108 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0442927Z %111 = arith.addi %110, %109 : tensor<8xi32> 2026-02-21T09:02:52.0443122Z %112 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:02:52.0443358Z %113 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0443614Z %114 = tt.splat %112 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0443820Z %115 = arith.addi %114, %113 : tensor<16xi32> 2026-02-21T09:02:52.0444077Z %116 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0444351Z %117 = arith.muli %116, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0444617Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0444943Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0445206Z %120 = tt.broadcast %118 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0445458Z %121 = arith.addi %119, %120 : tensor<64x16xi32> 2026-02-21T09:02:52.0445703Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0446000Z %123 = tt.addptr %122, %121 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0446268Z %124 = tt.load %123 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0446514Z %125 = arith.extf %124 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0446808Z %126 = tt.expand_dims %111 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0447075Z %127 = arith.muli %126, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0447337Z %128 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0447624Z %129 = tt.broadcast %127 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0447893Z %130 = tt.broadcast %128 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0448130Z %131 = arith.addi %129, %130 : tensor<8x64xi32> 2026-02-21T09:02:52.0448370Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0448648Z %133 = tt.addptr %132, %131 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0448952Z %134 = tt.load %133 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0449227Z %135 = arith.shli %134, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0449444Z %136 = arith.shrsi %135, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0449664Z %137 = arith.shrsi %134, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0449904Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0450204Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0450546Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0450883Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0451223Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0451521Z %143 = arith.cmpi eq, %140, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0451838Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0452125Z %145 = tt.broadcast %141 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0452455Z %146 = arith.select %144, %145, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0452753Z %147 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0453022Z %148 = tt.broadcast %142 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0453334Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0453663Z %150 = arith.select %149, %148, %146 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0453957Z %151 = tt.reshape %150 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0454245Z %152 = arith.sitofp %151 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0454626Z %153 = tt.dot %125, %152, %106, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0454973Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:02:52.0455184Z %154 = arith.muli %c8_i32, %c2_i32_12 : i32 2026-02-21T09:02:52.0455400Z %155 = arith.addi %arg4, %154 : i32 2026-02-21T09:02:52.0455659Z %156 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0455921Z %157 = tt.splat %155 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0456187Z %158 = arith.addi %157, %156 : tensor<8xi32> 2026-02-21T09:02:52.0456395Z %159 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:02:52.0456644Z %160 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0456910Z %161 = tt.splat %159 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0457124Z %162 = arith.addi %161, %160 : tensor<16xi32> 2026-02-21T09:02:52.0457395Z %163 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0457677Z %164 = arith.muli %163, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0457959Z %165 = tt.expand_dims %162 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0458265Z %166 = tt.broadcast %164 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0458546Z %167 = tt.broadcast %165 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0458808Z %168 = arith.addi %166, %167 : tensor<64x16xi32> 2026-02-21T09:02:52.0459054Z %169 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0459350Z %170 = tt.addptr %169, %168 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0459616Z %171 = tt.load %170 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0459861Z %172 = arith.extf %171 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0460148Z %173 = tt.expand_dims %158 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0460419Z %174 = arith.muli %173, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0460680Z %175 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0460970Z %176 = tt.broadcast %174 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0461238Z %177 = tt.broadcast %175 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0461472Z %178 = arith.addi %176, %177 : tensor<8x64xi32> 2026-02-21T09:02:52.0461743Z %179 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0462020Z %180 = tt.addptr %179, %178 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0462320Z %181 = tt.load %180 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0462590Z %182 = arith.shli %181, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0462803Z %183 = arith.shrsi %182, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0463024Z %184 = arith.shrsi %181, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0463266Z %185 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0463596Z %186 = tt.expand_dims %185 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0463918Z %187 = tt.expand_dims %186 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0464240Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0464567Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0464898Z %190 = arith.cmpi eq, %187, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0465162Z %191 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0465438Z %192 = tt.broadcast %188 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0465725Z %193 = arith.select %191, %192, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0466009Z %194 = arith.cmpi eq, %187, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0466261Z %195 = tt.broadcast %189 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0466536Z %196 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0466810Z %197 = arith.select %196, %195, %193 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0467121Z %198 = tt.reshape %197 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0467392Z %199 = arith.sitofp %198 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0467749Z %200 = tt.dot %172, %199, %153, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0468072Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:52.0468262Z %201 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:52.0468467Z %202 = arith.addi %arg4, %201 : i32 2026-02-21T09:02:52.0468698Z %203 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0468956Z %204 = tt.splat %202 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0469167Z %205 = arith.addi %204, %203 : tensor<8xi32> 2026-02-21T09:02:52.0469359Z %206 = arith.muli %202, %c2_i32 : i32 2026-02-21T09:02:52.0469598Z %207 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0469847Z %208 = tt.splat %206 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0470063Z %209 = arith.addi %208, %207 : tensor<16xi32> 2026-02-21T09:02:52.0470317Z %210 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0470599Z %211 = arith.muli %210, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0470868Z %212 = tt.expand_dims %209 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0471162Z %213 = tt.broadcast %211 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0471434Z %214 = tt.broadcast %212 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0471725Z %215 = arith.addi %213, %214 : tensor<64x16xi32> 2026-02-21T09:02:52.0471982Z %216 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0472286Z %217 = tt.addptr %216, %215 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0472552Z %218 = tt.load %217 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0472802Z %219 = arith.extf %218 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0473087Z %220 = tt.expand_dims %205 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0473361Z %221 = arith.muli %220, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0473618Z %222 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0473920Z %223 = tt.broadcast %221 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0474190Z %224 = tt.broadcast %222 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0474429Z %225 = arith.addi %223, %224 : tensor<8x64xi32> 2026-02-21T09:02:52.0474700Z %226 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0474972Z %227 = tt.addptr %226, %225 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0475276Z %228 = tt.load %227 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0475539Z %229 = arith.shli %228, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0475783Z %230 = arith.shrsi %229, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0476022Z %231 = arith.shrsi %228, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0476262Z %232 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0476555Z %233 = tt.expand_dims %232 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0476865Z %234 = tt.expand_dims %233 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0477185Z %235 = tt.expand_dims %230 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0477504Z %236 = tt.expand_dims %231 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0477783Z %237 = arith.cmpi eq, %234, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0478037Z %238 = tt.broadcast %237 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0478328Z %239 = tt.broadcast %235 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0478620Z %240 = arith.select %238, %239, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0478894Z %241 = arith.cmpi eq, %234, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0479156Z %242 = tt.broadcast %236 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0479431Z %243 = tt.broadcast %241 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0479706Z %244 = arith.select %243, %242, %240 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0479994Z %245 = tt.reshape %244 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0480259Z %246 = arith.sitofp %245 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0480618Z %247 = tt.dot %219, %246, %200, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0480950Z scf.yield %247 : tensor<64x64xf32> 2026-02-21T09:02:52.0481133Z } {tt.flatten} 2026-02-21T09:02:52.0481339Z %26 = arith.truncf %25 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:52.0481664Z %27 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0481941Z %28 = arith.muli %27, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:52.0482194Z %29 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0482483Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:52.0482742Z %31 = tt.broadcast %29 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:52.0482971Z %32 = arith.addi %30, %31 : tensor<64x64xi32> 2026-02-21T09:02:52.0483216Z %33 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:52.0483490Z %34 = tt.addptr %33, %32 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:52.0483754Z tt.store %34, %26 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:52.0483959Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:52.0484162Z %35 = arith.muli %c9472_i32, %c1_i32_9 : i32 2026-02-21T09:02:52.0484370Z %36 = arith.addi %arg3, %35 : i32 2026-02-21T09:02:52.0484552Z %37 = arith.divsi %36, %c448_i32 : i32 2026-02-21T09:02:52.0484742Z %38 = arith.muli %37, %c4_i32 : i32 2026-02-21T09:02:52.0484918Z %39 = arith.subi %c1_i32, %38 : i32 2026-02-21T09:02:52.0485105Z %40 = arith.minsi %39, %c4_i32 : i32 2026-02-21T09:02:52.0485288Z %41 = arith.remsi %36, %c448_i32 : i32 2026-02-21T09:02:52.0485473Z %42 = arith.remsi %41, %40 : i32 2026-02-21T09:02:52.0485685Z %43 = arith.addi %38, %42 : i32 2026-02-21T09:02:52.0485873Z %44 = arith.divsi %41, %40 : i32 2026-02-21T09:02:52.0486059Z %45 = arith.muli %43, %c64_i32 : i32 2026-02-21T09:02:52.0486287Z %46 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:52.0486547Z %47 = tt.splat %45 : i32 -> tensor<64xi32> 2026-02-21T09:02:52.0486773Z %48 = arith.addi %47, %46 : tensor<64xi32> 2026-02-21T09:02:52.0486966Z %49 = arith.muli %44, %c64_i32 : i32 2026-02-21T09:02:52.0487181Z %50 = tt.splat %49 : i32 -> tensor<64xi32> 2026-02-21T09:02:52.0487379Z %51 = arith.addi %50, %46 : tensor<64xi32> 2026-02-21T09:02:52.0487567Z %c32_i32_10 = arith.constant 32 : i32 2026-02-21T09:02:52.0487894Z %52 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_10 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:52.0488266Z %62 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0488514Z %63 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0488722Z %64 = arith.addi %63, %62 : tensor<8xi32> 2026-02-21T09:02:52.0488919Z %65 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:52.0489180Z %66 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0489427Z %67 = tt.splat %65 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0489631Z %68 = arith.addi %67, %66 : tensor<16xi32> 2026-02-21T09:02:52.0489891Z %69 = tt.expand_dims %48 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0490230Z %70 = arith.muli %69, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0490489Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0490776Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0491047Z %73 = tt.broadcast %71 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0491291Z %74 = arith.addi %72, %73 : tensor<64x16xi32> 2026-02-21T09:02:52.0491558Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0491853Z %76 = tt.addptr %75, %74 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0492112Z %77 = tt.load %76 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0492356Z %78 = arith.extf %77 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0492633Z %79 = tt.expand_dims %64 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0492903Z %80 = arith.muli %79, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0493161Z %81 = tt.expand_dims %51 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0493444Z %82 = tt.broadcast %80 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0493714Z %83 = tt.broadcast %81 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0493946Z %84 = arith.addi %82, %83 : tensor<8x64xi32> 2026-02-21T09:02:52.0494185Z %85 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0494452Z %86 = tt.addptr %85, %84 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0494774Z %87 = tt.load %86 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0495065Z %88 = arith.shli %87, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0495285Z %89 = arith.shrsi %88, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0495510Z %90 = arith.shrsi %87, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0495758Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0496067Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0496391Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0496718Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0497077Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0497364Z %96 = arith.cmpi eq, %93, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0497627Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0497927Z %98 = tt.broadcast %94 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0498256Z %99 = arith.select %97, %98, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0498543Z %100 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0498800Z %101 = tt.broadcast %95 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0499082Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0499374Z %103 = arith.select %102, %101, %99 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0499667Z %104 = tt.reshape %103 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0499944Z %105 = arith.sitofp %104 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0500317Z %106 = tt.dot %78, %105, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0500685Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:02:52.0500892Z %107 = arith.muli %c8_i32, %c1_i32_11 : i32 2026-02-21T09:02:52.0501104Z %108 = arith.addi %arg4, %107 : i32 2026-02-21T09:02:52.0501338Z %109 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0501633Z %110 = tt.splat %108 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0501853Z %111 = arith.addi %110, %109 : tensor<8xi32> 2026-02-21T09:02:52.0502055Z %112 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:02:52.0502300Z %113 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0502556Z %114 = tt.splat %112 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0502775Z %115 = arith.addi %114, %113 : tensor<16xi32> 2026-02-21T09:02:52.0503038Z %116 = tt.expand_dims %48 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0503330Z %117 = arith.muli %116, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0503618Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0503914Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0504183Z %120 = tt.broadcast %118 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0504417Z %121 = arith.addi %119, %120 : tensor<64x16xi32> 2026-02-21T09:02:52.0504666Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0504955Z %123 = tt.addptr %122, %121 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0505225Z %124 = tt.load %123 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0505473Z %125 = arith.extf %124 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0505754Z %126 = tt.expand_dims %111 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0506024Z %127 = arith.muli %126, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0506278Z %128 = tt.expand_dims %51 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0506571Z %129 = tt.broadcast %127 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0506830Z %130 = tt.broadcast %128 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0507070Z %131 = arith.addi %129, %130 : tensor<8x64xi32> 2026-02-21T09:02:52.0507321Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0507597Z %133 = tt.addptr %132, %131 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0507906Z %134 = tt.load %133 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0508203Z %135 = arith.shli %134, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0508429Z %136 = arith.shrsi %135, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0508656Z %137 = arith.shrsi %134, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0508905Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0509234Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0509573Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0509897Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0510218Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0510515Z %143 = arith.cmpi eq, %140, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0510773Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0511044Z %145 = tt.broadcast %141 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0511339Z %146 = arith.select %144, %145, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0511690Z %147 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0511954Z %148 = tt.broadcast %142 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0512225Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0512504Z %150 = arith.select %149, %148, %146 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0512794Z %151 = tt.reshape %150 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0513054Z %152 = arith.sitofp %151 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0513415Z %153 = tt.dot %125, %152, %106, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0513742Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:02:52.0513951Z %154 = arith.muli %c8_i32, %c2_i32_12 : i32 2026-02-21T09:02:52.0514153Z %155 = arith.addi %arg4, %154 : i32 2026-02-21T09:02:52.0514380Z %156 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0514638Z %157 = tt.splat %155 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0514844Z %158 = arith.addi %157, %156 : tensor<8xi32> 2026-02-21T09:02:52.0515048Z %159 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:02:52.0515279Z %160 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0515533Z %161 = tt.splat %159 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0515746Z %162 = arith.addi %161, %160 : tensor<16xi32> 2026-02-21T09:02:52.0516000Z %163 = tt.expand_dims %48 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0516281Z %164 = arith.muli %163, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0516540Z %165 = tt.expand_dims %162 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0516840Z %166 = tt.broadcast %164 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0517103Z %167 = tt.broadcast %165 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0517352Z %168 = arith.addi %166, %167 : tensor<64x16xi32> 2026-02-21T09:02:52.0517602Z %169 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0517894Z %170 = tt.addptr %169, %168 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0518172Z %171 = tt.load %170 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0518413Z %172 = arith.extf %171 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0518701Z %173 = tt.expand_dims %158 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0518968Z %174 = arith.muli %173, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0519259Z %175 = tt.expand_dims %51 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0519562Z %176 = tt.broadcast %174 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0519826Z %177 = tt.broadcast %175 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0520101Z %178 = arith.addi %176, %177 : tensor<8x64xi32> 2026-02-21T09:02:52.0520339Z %179 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0520652Z %180 = tt.addptr %179, %178 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0520964Z %181 = tt.load %180 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0521233Z %182 = arith.shli %181, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0521456Z %183 = arith.shrsi %182, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0521706Z %184 = arith.shrsi %181, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0521959Z %185 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0522253Z %186 = tt.expand_dims %185 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0522576Z %187 = tt.expand_dims %186 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0522934Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0523257Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0523548Z %190 = arith.cmpi eq, %187, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0523800Z %191 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0524077Z %192 = tt.broadcast %188 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0524372Z %193 = arith.select %191, %192, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0524649Z %194 = arith.cmpi eq, %187, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0524908Z %195 = tt.broadcast %189 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0525173Z %196 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0525463Z %197 = arith.select %196, %195, %193 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0525745Z %198 = tt.reshape %197 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0526018Z %199 = arith.sitofp %198 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0526384Z %200 = tt.dot %172, %199, %153, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0526706Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:52.0526905Z %201 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:52.0527098Z %202 = arith.addi %arg4, %201 : i32 2026-02-21T09:02:52.0527335Z %203 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0527581Z %204 = tt.splat %202 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0527790Z %205 = arith.addi %204, %203 : tensor<8xi32> 2026-02-21T09:02:52.0527993Z %206 = arith.muli %202, %c2_i32 : i32 2026-02-21T09:02:52.0528225Z %207 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0528483Z %208 = tt.splat %206 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0528687Z %209 = arith.addi %208, %207 : tensor<16xi32> 2026-02-21T09:02:52.0528947Z %210 = tt.expand_dims %48 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0529217Z %211 = arith.muli %210, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0529485Z %212 = tt.expand_dims %209 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0529785Z %213 = tt.broadcast %211 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0530049Z %214 = tt.broadcast %212 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0530331Z %215 = arith.addi %213, %214 : tensor<64x16xi32> 2026-02-21T09:02:52.0530577Z %216 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0530878Z %217 = tt.addptr %216, %215 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0531169Z %218 = tt.load %217 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0531415Z %219 = arith.extf %218 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0531753Z %220 = tt.expand_dims %205 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0532029Z %221 = arith.muli %220, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0532293Z %222 = tt.expand_dims %51 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0532585Z %223 = tt.broadcast %221 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0532854Z %224 = tt.broadcast %222 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0533100Z %225 = arith.addi %223, %224 : tensor<8x64xi32> 2026-02-21T09:02:52.0533337Z %226 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0533621Z %227 = tt.addptr %226, %225 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0533957Z %228 = tt.load %227 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0534234Z %229 = arith.shli %228, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0534448Z %230 = arith.shrsi %229, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0534673Z %231 = arith.shrsi %228, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0534919Z %232 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0535207Z %233 = tt.expand_dims %232 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0535528Z %234 = tt.expand_dims %233 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0535850Z %235 = tt.expand_dims %230 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0536174Z %236 = tt.expand_dims %231 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0536464Z %237 = arith.cmpi eq, %234, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0536717Z %238 = tt.broadcast %237 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0536993Z %239 = tt.broadcast %235 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0537276Z %240 = arith.select %238, %239, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0537555Z %241 = arith.cmpi eq, %234, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0537803Z %242 = tt.broadcast %236 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0538072Z %243 = tt.broadcast %241 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0538355Z %244 = arith.select %243, %242, %240 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0538637Z %245 = tt.reshape %244 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0538904Z %246 = arith.sitofp %245 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0539275Z %247 = tt.dot %219, %246, %200, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0539620Z scf.yield %247 : tensor<64x64xf32> 2026-02-21T09:02:52.0539811Z } {tt.flatten} 2026-02-21T09:02:52.0540021Z %53 = arith.truncf %52 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:52.0540323Z %54 = tt.expand_dims %48 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0540597Z %55 = arith.muli %54, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:52.0540866Z %56 = tt.expand_dims %51 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0541194Z %57 = tt.broadcast %55 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:52.0541588Z %58 = tt.broadcast %56 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:52.0541832Z %59 = arith.addi %57, %58 : tensor<64x64xi32> 2026-02-21T09:02:52.0542091Z %60 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:52.0542395Z %61 = tt.addptr %60, %59 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:52.0542704Z tt.store %61, %53 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:52.0542947Z } {tt.num_stages = 1 : i32} 2026-02-21T09:02:52.0543160Z scf.for %arg3 = %8 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:02:52.0543399Z %10 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:02:52.0543597Z %11 = arith.muli %10, %c4_i32 : i32 2026-02-21T09:02:52.0543797Z %12 = arith.subi %c1_i32, %11 : i32 2026-02-21T09:02:52.0543995Z %13 = arith.minsi %12, %c4_i32 : i32 2026-02-21T09:02:52.0544195Z %14 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:02:52.0544399Z %15 = arith.remsi %14, %13 : i32 2026-02-21T09:02:52.0544581Z %16 = arith.addi %11, %15 : i32 2026-02-21T09:02:52.0544768Z %17 = arith.divsi %14, %13 : i32 2026-02-21T09:02:52.0544951Z %18 = arith.muli %16, %c64_i32 : i32 2026-02-21T09:02:52.0545219Z %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:02:52.0545490Z %20 = tt.splat %18 : i32 -> tensor<64xi32> 2026-02-21T09:02:52.0545698Z %21 = arith.addi %20, %19 : tensor<64xi32> 2026-02-21T09:02:52.0545900Z %22 = arith.muli %17, %c64_i32 : i32 2026-02-21T09:02:52.0546091Z %23 = tt.splat %22 : i32 -> tensor<64xi32> 2026-02-21T09:02:52.0546298Z %24 = arith.addi %23, %19 : tensor<64xi32> 2026-02-21T09:02:52.0546494Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:02:52.0546828Z %25 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:02:52.0547207Z %35 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0547476Z %36 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0547701Z %37 = arith.addi %36, %35 : tensor<8xi32> 2026-02-21T09:02:52.0547907Z %38 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:02:52.0548158Z %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0548403Z %40 = tt.splat %38 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0548612Z %41 = arith.addi %40, %39 : tensor<16xi32> 2026-02-21T09:02:52.0548863Z %42 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0549137Z %43 = arith.muli %42, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0549398Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0549686Z %45 = tt.broadcast %43 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0549948Z %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0550184Z %47 = arith.addi %45, %46 : tensor<64x16xi32> 2026-02-21T09:02:52.0550431Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0550724Z %49 = tt.addptr %48, %47 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0550976Z %50 = tt.load %49 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0551217Z %51 = arith.extf %50 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0551493Z %52 = tt.expand_dims %37 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0551790Z %53 = arith.muli %52, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0552038Z %54 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0552333Z %55 = tt.broadcast %53 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0552595Z %56 = tt.broadcast %54 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0552853Z %57 = arith.addi %55, %56 : tensor<8x64xi32> 2026-02-21T09:02:52.0553091Z %58 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0553353Z %59 = tt.addptr %58, %57 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0553654Z %60 = tt.load %59 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0553936Z %61 = arith.shli %60, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0554177Z %62 = arith.shrsi %61, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0554395Z %63 = arith.shrsi %60, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0554637Z %64 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0554933Z %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0555246Z %66 = tt.expand_dims %65 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0555569Z %67 = tt.expand_dims %62 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0555885Z %68 = tt.expand_dims %63 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0556169Z %69 = arith.cmpi eq, %66, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0556451Z %70 = tt.broadcast %69 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0556713Z %71 = tt.broadcast %67 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0556995Z %72 = arith.select %70, %71, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0557261Z %73 = arith.cmpi eq, %66, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0557512Z %74 = tt.broadcast %68 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0557775Z %75 = tt.broadcast %73 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0558036Z %76 = arith.select %75, %74, %72 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0558309Z %77 = tt.reshape %76 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0558556Z %78 = arith.sitofp %77 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0558908Z %79 = tt.dot %51, %78, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0559226Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:02:52.0559427Z %80 = arith.muli %c8_i32, %c1_i32_9 : i32 2026-02-21T09:02:52.0559628Z %81 = arith.addi %arg4, %80 : i32 2026-02-21T09:02:52.0559852Z %82 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0560106Z %83 = tt.splat %81 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0560303Z %84 = arith.addi %83, %82 : tensor<8xi32> 2026-02-21T09:02:52.0560499Z %85 = arith.muli %81, %c2_i32 : i32 2026-02-21T09:02:52.0560727Z %86 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0560981Z %87 = tt.splat %85 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0561185Z %88 = arith.addi %87, %86 : tensor<16xi32> 2026-02-21T09:02:52.0561432Z %89 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0561752Z %90 = arith.muli %89, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0562010Z %91 = tt.expand_dims %88 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0562304Z %92 = tt.broadcast %90 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0562565Z %93 = tt.broadcast %91 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0562805Z %94 = arith.addi %92, %93 : tensor<64x16xi32> 2026-02-21T09:02:52.0563053Z %95 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0563336Z %96 = tt.addptr %95, %94 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0563596Z %97 = tt.load %96 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0563855Z %98 = arith.extf %97 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0564135Z %99 = tt.expand_dims %84 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0564404Z %100 = arith.muli %99, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0564662Z %101 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0564984Z %102 = tt.broadcast %100 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0565279Z %103 = tt.broadcast %101 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0565520Z %104 = arith.addi %102, %103 : tensor<8x64xi32> 2026-02-21T09:02:52.0565752Z %105 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0566031Z %106 = tt.addptr %105, %104 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0566338Z %107 = tt.load %106 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0566602Z %108 = arith.shli %107, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0566818Z %109 = arith.shrsi %108, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0567034Z %110 = arith.shrsi %107, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0567175Z %111 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0567302Z %112 = tt.expand_dims %111 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0567433Z %113 = tt.expand_dims %112 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0567567Z %114 = tt.expand_dims %109 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0567692Z %115 = tt.expand_dims %110 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0567786Z %116 = arith.cmpi eq, %113, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0567892Z %117 = tt.broadcast %116 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0568004Z %118 = tt.broadcast %114 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0568126Z %119 = arith.select %117, %118, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0568221Z %120 = arith.cmpi eq, %113, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0568334Z %121 = tt.broadcast %115 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0568436Z %122 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0568554Z %123 = arith.select %122, %121, %119 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0568660Z %124 = tt.reshape %123 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0568765Z %125 = arith.sitofp %124 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0568955Z %126 = tt.dot %98, %125, %79, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0569034Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:02:52.0569109Z %127 = arith.muli %c8_i32, %c2_i32_10 : i32 2026-02-21T09:02:52.0569178Z %128 = arith.addi %arg4, %127 : i32 2026-02-21T09:02:52.0569295Z %129 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0569372Z %130 = tt.splat %128 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0569449Z %131 = arith.addi %130, %129 : tensor<8xi32> 2026-02-21T09:02:52.0569518Z %132 = arith.muli %128, %c2_i32 : i32 2026-02-21T09:02:52.0569638Z %133 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0569715Z %134 = tt.splat %132 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0569792Z %135 = arith.addi %134, %133 : tensor<16xi32> 2026-02-21T09:02:52.0569924Z %136 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0570007Z %137 = arith.muli %136, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0570155Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0570266Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0570364Z %140 = tt.broadcast %138 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0570472Z %141 = arith.addi %139, %140 : tensor<64x16xi32> 2026-02-21T09:02:52.0570591Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0570732Z %143 = tt.addptr %142, %141 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0570815Z %144 = tt.load %143 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0570918Z %145 = arith.extf %144 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0571047Z %146 = tt.expand_dims %131 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0571127Z %147 = arith.muli %146, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0571252Z %148 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0571361Z %149 = tt.broadcast %147 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0571462Z %150 = tt.broadcast %148 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0571592Z %151 = arith.addi %149, %150 : tensor<8x64xi32> 2026-02-21T09:02:52.0571704Z %152 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0571821Z %153 = tt.addptr %152, %151 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0571947Z %154 = tt.load %153 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0572034Z %155 = arith.shli %154, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0572115Z %156 = arith.shrsi %155, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0572195Z %157 = arith.shrsi %154, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0572305Z %158 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0572435Z %159 = tt.expand_dims %158 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0572563Z %160 = tt.expand_dims %159 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0572693Z %161 = tt.expand_dims %156 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0572830Z %162 = tt.expand_dims %157 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0572925Z %163 = arith.cmpi eq, %160, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0573029Z %164 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0573144Z %165 = tt.broadcast %161 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0573270Z %166 = arith.select %164, %165, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0573361Z %167 = arith.cmpi eq, %160, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0573474Z %168 = tt.broadcast %162 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0573575Z %169 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0573697Z %170 = arith.select %169, %168, %166 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0573807Z %171 = tt.reshape %170 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0573912Z %172 = arith.sitofp %171 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0574108Z %173 = tt.dot %145, %172, %126, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0574179Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:02:52.0574259Z %174 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:02:52.0574325Z %175 = arith.addi %arg4, %174 : i32 2026-02-21T09:02:52.0574434Z %176 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:02:52.0574557Z %177 = tt.splat %175 : i32 -> tensor<8xi32> 2026-02-21T09:02:52.0574634Z %178 = arith.addi %177, %176 : tensor<8xi32> 2026-02-21T09:02:52.0574704Z %179 = arith.muli %175, %c2_i32 : i32 2026-02-21T09:02:52.0574823Z %180 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:02:52.0574926Z %181 = tt.splat %179 : i32 -> tensor<16xi32> 2026-02-21T09:02:52.0575002Z %182 = arith.addi %181, %180 : tensor<16xi32> 2026-02-21T09:02:52.0575155Z %183 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0575237Z %184 = arith.muli %183, %cst_5 : tensor<64x1xi32> 2026-02-21T09:02:52.0575360Z %185 = tt.expand_dims %182 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:02:52.0575461Z %186 = tt.broadcast %184 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0575568Z %187 = tt.broadcast %185 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:02:52.0575650Z %188 = arith.addi %186, %187 : tensor<64x16xi32> 2026-02-21T09:02:52.0575761Z %189 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0575890Z %190 = tt.addptr %189, %188 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:02:52.0575987Z %191 = tt.load %190 : tensor<64x16x!tt.ptr> 2026-02-21T09:02:52.0576094Z %192 = arith.extf %191 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:02:52.0576223Z %193 = tt.expand_dims %178 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:02:52.0576303Z %194 = arith.muli %193, %cst_4 : tensor<8x1xi32> 2026-02-21T09:02:52.0576425Z %195 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0576534Z %196 = tt.broadcast %194 : tensor<8x1xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0576634Z %197 = tt.broadcast %195 : tensor<1x64xi32> -> tensor<8x64xi32> 2026-02-21T09:02:52.0576716Z %198 = arith.addi %196, %197 : tensor<8x64xi32> 2026-02-21T09:02:52.0576818Z %199 = tt.splat %arg1 : !tt.ptr -> tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0576940Z %200 = tt.addptr %199, %198 : tensor<8x64x!tt.ptr>, tensor<8x64xi32> 2026-02-21T09:02:52.0577066Z %201 = tt.load %200 evictionPolicy = evict_last : tensor<8x64x!tt.ptr> 2026-02-21T09:02:52.0577148Z %202 = arith.shli %201, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0577237Z %203 = arith.shrsi %202, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0577316Z %204 = arith.shrsi %201, %cst_3 : tensor<8x64xi8> 2026-02-21T09:02:52.0577424Z %205 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:02:52.0577552Z %206 = tt.expand_dims %205 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:02:52.0577675Z %207 = tt.expand_dims %206 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:02:52.0577805Z %208 = tt.expand_dims %203 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0577939Z %209 = tt.expand_dims %204 {axis = 1 : i32} : tensor<8x64xi8> -> tensor<8x1x64xi8> 2026-02-21T09:02:52.0578032Z %210 = arith.cmpi eq, %207, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0578137Z %211 = tt.broadcast %210 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0578249Z %212 = tt.broadcast %208 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0578369Z %213 = arith.select %211, %212, %cst : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0578462Z %214 = arith.cmpi eq, %207, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:02:52.0578566Z %215 = tt.broadcast %209 : tensor<8x1x64xi8> -> tensor<8x2x64xi8> 2026-02-21T09:02:52.0578674Z %216 = tt.broadcast %214 : tensor<1x2x1xi1> -> tensor<8x2x64xi1> 2026-02-21T09:02:52.0578790Z %217 = arith.select %216, %215, %213 : tensor<8x2x64xi1>, tensor<8x2x64xi8> 2026-02-21T09:02:52.0578912Z %218 = tt.reshape %217 : tensor<8x2x64xi8> -> tensor<16x64xi8> 2026-02-21T09:02:52.0579023Z %219 = arith.sitofp %218 : tensor<16x64xi8> to tensor<16x64xf32> 2026-02-21T09:02:52.0579215Z %220 = tt.dot %192, %219, %173, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x64xf32> -> tensor<64x64xf32> 2026-02-21T09:02:52.0579304Z scf.yield %220 : tensor<64x64xf32> 2026-02-21T09:02:52.0579373Z } {tt.flatten} 2026-02-21T09:02:52.0579499Z %26 = arith.truncf %25 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:02:52.0579622Z %27 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:02:52.0579710Z %28 = arith.muli %27, %cst_0 : tensor<64x1xi32> 2026-02-21T09:02:52.0579829Z %29 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:02:52.0579930Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:02:52.0580038Z %31 = tt.broadcast %29 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:02:52.0580109Z %32 = arith.addi %30, %31 : tensor<64x64xi32> 2026-02-21T09:02:52.0580221Z %33 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:02:52.0580356Z %34 = tt.addptr %33, %32 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:02:52.0580452Z tt.store %34, %26 : tensor<64x64x!tt.ptr> 2026-02-21T09:02:52.0580519Z } {tt.num_stages = 1 : i32} 2026-02-21T09:02:52.0580580Z tt.return 2026-02-21T09:02:52.0580638Z } 2026-02-21T09:02:52.0580692Z } 2026-02-21T09:02:52.0580698Z 2026-02-21T09:02:52.0580748Z {-# 2026-02-21T09:02:52.0580812Z external_resources: { 2026-02-21T09:02:52.0580882Z mlir_reproducer: { 2026-02-21T09:02:52.0589765Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:02:52.0589855Z disable_threading: false, 2026-02-21T09:02:52.0589919Z verify_each: true 2026-02-21T09:02:52.0589985Z } 2026-02-21T09:02:52.0590041Z } 2026-02-21T09:02:52.0590098Z #-} 2026-02-21T09:02:52.0590491Z /tmp/torchinductor_root/sy/csy6nrsbdtvzgnjgiqql7afzvdvfltgzrctfoqat3xtfyss3m5q3.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:02:52.0591176Z /tmp/torchinductor_root/sy/csy6nrsbdtvzgnjgiqql7afzvdvfltgzrctfoqat3xtfyss3m5q3.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:02:52.0591372Z [137s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:02:52.0592436Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:02:52.0592524Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:02:52.0592655Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:02:53.4408866Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 13.5 configs/s 2026-02-21T09:02:59.1268998Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 176.4 2026-02-21T09:02:59.1274151Z configs/s 2026-02-21T09:02:59.3134389Z [144s] Generation 5 complete: 2026-02-21T09:02:59.3136258Z error=11 2026-02-21T09:02:59.3136540Z ok=86 2026-02-21T09:02:59.3136722Z min=0.1076 2026-02-21T09:02:59.3140635Z mid=0.1935 2026-02-21T09:02:59.3144448Z max=29.0202 2026-02-21T09:02:59.3147711Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:02:59.3150865Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:02:59.3156726Z 'l2_groupings': [1], 2026-02-21T09:02:59.3157014Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:02:59.3157244Z 'loop_orders': [[0, 1]], 2026-02-21T09:02:59.3161950Z 'num_stages': 3, 2026-02-21T09:02:59.3163736Z 'num_warps': 1, 2026-02-21T09:02:59.3163935Z 'pid_type': 'flat', 2026-02-21T09:02:59.3164105Z 'range_flattens': [None, None], 2026-02-21T09:02:59.3164300Z 'range_multi_buffers': [None, True], 2026-02-21T09:02:59.3164494Z 'range_num_stages': [0, 0], 2026-02-21T09:02:59.3164665Z 'range_unroll_factors': [0, 0], 2026-02-21T09:02:59.3164864Z 'range_warp_specializes': [None, None]} 2026-02-21T09:02:59.3165157Z [144s] Fitting surrogate: 603 points, 603 targets 2026-02-21T09:03:00.6501516Z [146s] Generation 6 starting: 87 neighbors, 5 active search path(s) 2026-02-21T09:03:07.9037470Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 8.4 configs/s 2026-02-21T09:03:09.4779868Z 2026-02-21T09:03:09.4782317Z [154s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:09.4782646Z 2026-02-21T09:03:09.4782740Z ================================================================ 2026-02-21T09:03:09.4782989Z Internal Triton PTX codegen error 2026-02-21T09:03:09.4783223Z `ptxas` stderr: 2026-02-21T09:03:09.4783689Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 259 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:03:09.4784223Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:03:09.4784393Z 2026-02-21T09:03:09.4785049Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpjegcl4u9.ptx -o /tmp/tmpjegcl4u9.ptx.o 2026-02-21T09:03:09.4785495Z 2026-02-21T09:03:09.4785499Z 2026-02-21T09:03:09.4785560Z // 2026-02-21T09:03:09.4785715Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:03:09.4785894Z // 2026-02-21T09:03:09.4785966Z 2026-02-21T09:03:09.4786035Z .version 8.7 2026-02-21T09:03:09.4786181Z .target sm_100a 2026-02-21T09:03:09.4786339Z .address_size 64 2026-02-21T09:03:09.4786429Z 2026-02-21T09:03:09.4786585Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:03:09.4786982Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:03:09.4787221Z // @_helion_matmul_bf16_int4 2026-02-21T09:03:09.4787465Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:03:09.4787744Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:03:09.4788051Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:03:09.4788422Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:03:09.4788706Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:03:09.4789000Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:03:09.4789231Z ) 2026-02-21T09:03:09.4789356Z .reqntid 128 2026-02-21T09:03:09.4789497Z .maxnreg 32 2026-02-21T09:03:09.4789623Z { 2026-02-21T09:03:09.4789759Z .reg .pred %p<93>; 2026-02-21T09:03:09.4789915Z .reg .b16 %rs<769>; 2026-02-21T09:03:09.4790076Z .reg .b32 %r<849>; 2026-02-21T09:03:09.4790221Z .reg .b64 %rd<219>; 2026-02-21T09:03:09.4790373Z $L__func_begin0: 2026-02-21T09:03:09.4790460Z 2026-02-21T09:03:09.4790516Z // %bb.0: 2026-02-21T09:03:09.4790770Z .loc 1 14 0 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:14 2026-02-21T09:03:09.4792492Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:03:09.4793634Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:03:09.4793884Z `ptxas` stderr: 2026-02-21T09:03:09.4794326Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 259 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:03:09.4794802Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:03:09.4794957Z 2026-02-21T09:03:09.4795328Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpjegcl4u9.ptx -o /tmp/tmpjegcl4u9.ptx.o 2026-02-21T09:03:09.4795752Z 2026-02-21T09:03:09.4795887Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:09.4796123Z mov.u32 %r1, %tid.x; 2026-02-21T09:03:09.4796284Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:03:09.4796449Z mov.b32 %r52, global_smem; 2026-02-21T09:03:09.4796614Z // begin inline asm 2026-02-21T09:03:09.4796854Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r52], 256; 2026-02-21T09:03:09.4797101Z // end inline asm 2026-02-21T09:03:09.4797235Z bar.sync 0; 2026-02-21T09:03:09.4797391Z ld.shared.b32 %r843, [global_smem]; 2026-02-21T09:03:09.4797567Z bar.sync 0; 2026-02-21T09:03:09.4797697Z // begin inline asm 2026-02-21T09:03:09.4797903Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:03:09.4798128Z // end inline asm 2026-02-21T09:03:09.4798386Z .loc 1 19 46 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:19:46 2026-02-21T09:03:09.4798717Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:03:09.4798978Z .loc 1 19 98 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:19:98 2026-02-21T09:03:09.4799271Z setp.gt.u32 %p3, %r3, 111; 2026-02-21T09:03:09.4799432Z @%p3 bra $L__BB0_8; 2026-02-21T09:03:09.4799598Z // %bb.1: // %.lr.ph 2026-02-21T09:03:09.4799889Z .loc 1 0 98 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:0:98 2026-02-21T09:03:09.4800218Z ld.param.b64 %rd36, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:03:09.4800503Z ld.param.b64 %rd35, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:03:09.4800717Z and.b32 %r4, %r1, 64; 2026-02-21T09:03:09.4800967Z .loc 1 71 38 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:71:38 2026-02-21T09:03:09.4801265Z setp.eq.b32 %p12, %r4, 0; 2026-02-21T09:03:09.4801566Z .loc 1 47 38 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:47:38 2026-02-21T09:03:09.4801853Z and.b32 %r197, %r1, 15; 2026-02-21T09:03:09.4802056Z shl.b32 %r198, %r197, 3; 2026-02-21T09:03:09.4802307Z .loc 1 31 45 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:31:45 2026-02-21T09:03:09.4802592Z and.b32 %r199, %r1, 7; 2026-02-21T09:03:09.4802743Z shl.b32 %r200, %r199, 3; 2026-02-21T09:03:09.4802905Z shl.b32 %r201, %r1, 4; 2026-02-21T09:03:09.4803066Z and.b32 %r5, %r201, 48; 2026-02-21T09:03:09.4803218Z bfe.u32 %r202, %r1, 2, 5; 2026-02-21T09:03:09.4803375Z shr.u32 %r203, %r1, 5; 2026-02-21T09:03:09.4803523Z bfe.u32 %r204, %r1, 3, 4; 2026-02-21T09:03:09.4803682Z shl.b32 %r205, %r1, 9; 2026-02-21T09:03:09.4803832Z and.b32 %r206, %r205, 57344; 2026-02-21T09:03:09.4803997Z setp.eq.b32 %p90, %r1, 0; 2026-02-21T09:03:09.4804152Z or.b32 %r207, %r206, %r198; 2026-02-21T09:03:09.4804355Z mad.wide.u32 %rd38, %r207, 2, %rd35; 2026-02-21T09:03:09.4804538Z add.s64 %rd39, %rd38, 131072; 2026-02-21T09:03:09.4804709Z add.s64 %rd40, %rd38, 262144; 2026-02-21T09:03:09.4804875Z add.s64 %rd41, %rd38, 393216; 2026-02-21T09:03:09.4805036Z add.s64 %rd42, %rd38, 524288; 2026-02-21T09:03:09.4805204Z add.s64 %rd43, %rd38, 655360; 2026-02-21T09:03:09.4805361Z add.s64 %rd44, %rd38, 786432; 2026-02-21T09:03:09.4805525Z add.s64 %rd45, %rd38, 917504; 2026-02-21T09:03:09.4805683Z and.b32 %r6, %r201, 2032; 2026-02-21T09:03:09.4805845Z add.s32 %r209, %r52, 32768; 2026-02-21T09:03:09.4806002Z add.s32 %r441, %r209, %r6; 2026-02-21T09:03:09.4806167Z add.s32 %r443, %r441, 2048; 2026-02-21T09:03:09.4806323Z add.s32 %r445, %r441, 4096; 2026-02-21T09:03:09.4806484Z add.s32 %r447, %r441, 6144; 2026-02-21T09:03:09.4806643Z add.s32 %r449, %r441, 8192; 2026-02-21T09:03:09.4806800Z add.s32 %r451, %r441, 10240; 2026-02-21T09:03:09.4806963Z add.s32 %r453, %r441, 12288; 2026-02-21T09:03:09.4807122Z or.b32 %r14, %r201, 14336; 2026-02-21T09:03:09.4807285Z add.s32 %r455, %r209, %r14; 2026-02-21T09:03:09.4807438Z add.s32 %r391, %r843, 64; 2026-02-21T09:03:09.4807603Z mul.lo.s32 %r210, %r202, 7168; 2026-02-21T09:03:09.4807766Z add.s32 %r211, %r52, 65536; 2026-02-21T09:03:09.4807929Z add.s32 %r457, %r211, %r6; 2026-02-21T09:03:09.4808089Z add.s32 %r459, %r457, 2048; 2026-02-21T09:03:09.4808243Z add.s64 %rd48, %rd38, 256; 2026-02-21T09:03:09.4808404Z or.b32 %r212, %r207, 128; 2026-02-21T09:03:09.4808567Z mad.wide.u32 %rd58, %r212, 2, %rd35; 2026-02-21T09:03:09.4808751Z add.s64 %rd49, %rd58, 131072; 2026-02-21T09:03:09.4808911Z add.s64 %rd50, %rd58, 262144; 2026-02-21T09:03:09.4809078Z add.s64 %rd51, %rd58, 393216; 2026-02-21T09:03:09.4809235Z add.s64 %rd52, %rd58, 524288; 2026-02-21T09:03:09.4809399Z add.s64 %rd53, %rd58, 655360; 2026-02-21T09:03:09.4809557Z add.s64 %rd54, %rd58, 786432; 2026-02-21T09:03:09.4809726Z add.s64 %rd55, %rd58, 917504; 2026-02-21T09:03:09.4809899Z add.s32 %r213, %r52, 49152; 2026-02-21T09:03:09.4810083Z add.s32 %r109, %r213, %r6; 2026-02-21T09:03:09.4810245Z add.s32 %r111, %r109, 2048; 2026-02-21T09:03:09.4810395Z add.s32 %r113, %r109, 4096; 2026-02-21T09:03:09.4810552Z add.s32 %r115, %r109, 6144; 2026-02-21T09:03:09.4810701Z add.s32 %r117, %r109, 8192; 2026-02-21T09:03:09.4810860Z add.s32 %r119, %r109, 10240; 2026-02-21T09:03:09.4811012Z add.s32 %r121, %r109, 12288; 2026-02-21T09:03:09.4811171Z add.s32 %r123, %r213, %r14; 2026-02-21T09:03:09.4811324Z add.s32 %r214, %r52, %r6; 2026-02-21T09:03:09.4811483Z add.s32 %r125, %r214, 69632; 2026-02-21T09:03:09.4811679Z add.s32 %r127, %r214, 71680; 2026-02-21T09:03:09.4811868Z shl.b32 %r215, %r197, 8; 2026-02-21T09:03:09.4812028Z and.b32 %r216, %r1, 96; 2026-02-21T09:03:09.4812178Z shl.b32 %r217, %r216, 7; 2026-02-21T09:03:09.4812336Z and.b32 %r218, %r1, 16; 2026-02-21T09:03:09.4812485Z shl.b32 %r219, %r218, 3; 2026-02-21T09:03:09.4812642Z or.b32 %r220, %r215, %r217; 2026-02-21T09:03:09.4812798Z or.b32 %r19, %r220, %r219; 2026-02-21T09:03:09.4812960Z add.s32 %r221, %r209, %r19; 2026-02-21T09:03:09.4813112Z and.b32 %r20, %r1, 63; 2026-02-21T09:03:09.4813270Z add.s32 %r222, %r211, %r20; 2026-02-21T09:03:09.4813458Z or.b32 %r21, %r1, 960; 2026-02-21T09:03:09.4813606Z add.s32 %r223, %r211, %r21; 2026-02-21T09:03:09.4813765Z or.b32 %r22, %r1, 1984; 2026-02-21T09:03:09.4813914Z add.s32 %r224, %r211, %r22; 2026-02-21T09:03:09.4814069Z or.b32 %r23, %r1, 3008; 2026-02-21T09:03:09.4814214Z add.s32 %r225, %r211, %r23; 2026-02-21T09:03:09.4814371Z or.b32 %r24, %r1, 4032; 2026-02-21T09:03:09.4814515Z add.s32 %r226, %r211, %r24; 2026-02-21T09:03:09.4814673Z shl.b32 %r227, %r20, 7; 2026-02-21T09:03:09.4814824Z shl.b32 %r228, %r199, 4; 2026-02-21T09:03:09.4814971Z shr.u32 %r229, %r4, 4; 2026-02-21T09:03:09.4815122Z or.b32 %r230, %r228, %r229; 2026-02-21T09:03:09.4815272Z or.b32 %r231, %r230, %r227; 2026-02-21T09:03:09.4815429Z add.s32 %r25, %r52, %r231; 2026-02-21T09:03:09.4815609Z xor.b32 %r232, %r231, 16; 2026-02-21T09:03:09.4815773Z add.s32 %r26, %r52, %r232; 2026-02-21T09:03:09.4815926Z xor.b32 %r233, %r231, 32; 2026-02-21T09:03:09.4816087Z add.s32 %r27, %r52, %r233; 2026-02-21T09:03:09.4816240Z xor.b32 %r234, %r231, 48; 2026-02-21T09:03:09.4816399Z add.s32 %r28, %r52, %r234; 2026-02-21T09:03:09.4816557Z xor.b32 %r235, %r231, 64; 2026-02-21T09:03:09.4816705Z add.s32 %r29, %r52, %r235; 2026-02-21T09:03:09.4816867Z xor.b32 %r236, %r231, 80; 2026-02-21T09:03:09.4817015Z add.s32 %r30, %r52, %r236; 2026-02-21T09:03:09.4817174Z xor.b32 %r237, %r231, 96; 2026-02-21T09:03:09.4817322Z add.s32 %r31, %r52, %r237; 2026-02-21T09:03:09.4817486Z xor.b32 %r238, %r231, 112; 2026-02-21T09:03:09.4817640Z add.s32 %r32, %r52, %r238; 2026-02-21T09:03:09.4817804Z bfe.u32 %r239, %r52, 4, 14; 2026-02-21T09:03:09.4817966Z cvt.u64.u32 %rd59, %r239; 2026-02-21T09:03:09.4818139Z or.b64 %rd81, %rd59, 4611686293338849280; 2026-02-21T09:03:09.4818328Z add.s32 %r240, %r52, 32; 2026-02-21T09:03:09.4818482Z bfe.u32 %r241, %r240, 4, 14; 2026-02-21T09:03:09.4818647Z cvt.u64.u32 %rd60, %r241; 2026-02-21T09:03:09.4818806Z or.b64 %rd82, %rd60, 4611686293338849280; 2026-02-21T09:03:09.4818993Z add.s32 %r242, %r52, 64; 2026-02-21T09:03:09.4819146Z bfe.u32 %r243, %r242, 4, 14; 2026-02-21T09:03:09.4819310Z cvt.u64.u32 %rd61, %r243; 2026-02-21T09:03:09.4819469Z or.b64 %rd83, %rd61, 4611686293338849280; 2026-02-21T09:03:09.4819651Z add.s32 %r244, %r52, 96; 2026-02-21T09:03:09.4819801Z bfe.u32 %r245, %r244, 4, 14; 2026-02-21T09:03:09.4819961Z cvt.u64.u32 %rd62, %r245; 2026-02-21T09:03:09.4820125Z or.b64 %rd84, %rd62, 4611686293338849280; 2026-02-21T09:03:09.4820299Z add.s32 %r246, %r52, 8192; 2026-02-21T09:03:09.4820459Z bfe.u32 %r247, %r246, 4, 14; 2026-02-21T09:03:09.4820612Z cvt.u64.u32 %rd63, %r247; 2026-02-21T09:03:09.4820779Z or.b64 %rd85, %rd63, 4611686293338849280; 2026-02-21T09:03:09.4820950Z add.s32 %r248, %r52, 8224; 2026-02-21T09:03:09.4821110Z bfe.u32 %r249, %r248, 4, 14; 2026-02-21T09:03:09.4821300Z cvt.u64.u32 %rd64, %r249; 2026-02-21T09:03:09.4821467Z or.b64 %rd86, %rd64, 4611686293338849280; 2026-02-21T09:03:09.4821686Z add.s32 %r250, %r52, 8256; 2026-02-21T09:03:09.4821838Z bfe.u32 %r251, %r250, 4, 14; 2026-02-21T09:03:09.4822001Z cvt.u64.u32 %rd65, %r251; 2026-02-21T09:03:09.4822159Z or.b64 %rd87, %rd65, 4611686293338849280; 2026-02-21T09:03:09.4822340Z add.s32 %r252, %r52, 8288; 2026-02-21T09:03:09.4822495Z bfe.u32 %r253, %r252, 4, 14; 2026-02-21T09:03:09.4822658Z cvt.u64.u32 %rd66, %r253; 2026-02-21T09:03:09.4822816Z or.b64 %rd88, %rd66, 4611686293338849280; 2026-02-21T09:03:09.4823058Z add.s32 %r254, %r52, 16384; 2026-02-21T09:03:09.4823213Z bfe.u32 %r255, %r254, 4, 14; 2026-02-21T09:03:09.4823372Z cvt.u64.u32 %rd67, %r255; 2026-02-21T09:03:09.4823537Z or.b64 %rd89, %rd67, 4611686293338849280; 2026-02-21T09:03:09.4823710Z add.s32 %r256, %r52, 16416; 2026-02-21T09:03:09.4823871Z bfe.u32 %r257, %r256, 4, 14; 2026-02-21T09:03:09.4824025Z cvt.u64.u32 %rd68, %r257; 2026-02-21T09:03:09.4824193Z or.b64 %rd90, %rd68, 4611686293338849280; 2026-02-21T09:03:09.4824366Z add.s32 %r258, %r52, 16448; 2026-02-21T09:03:09.4824562Z bfe.u32 %r259, %r258, 4, 14; 2026-02-21T09:03:09.4824718Z cvt.u64.u32 %rd69, %r259; 2026-02-21T09:03:09.4824883Z or.b64 %rd91, %rd69, 4611686293338849280; 2026-02-21T09:03:09.4825063Z add.s32 %r260, %r52, 16480; 2026-02-21T09:03:09.4825218Z bfe.u32 %r261, %r260, 4, 14; 2026-02-21T09:03:09.4825381Z cvt.u64.u32 %rd70, %r261; 2026-02-21T09:03:09.4825538Z or.b64 %rd92, %rd70, 4611686293338849280; 2026-02-21T09:03:09.4825718Z add.s32 %r262, %r52, 24576; 2026-02-21T09:03:09.4825873Z bfe.u32 %r263, %r262, 4, 14; 2026-02-21T09:03:09.4826036Z cvt.u64.u32 %rd71, %r263; 2026-02-21T09:03:09.4826208Z or.b64 %rd93, %rd71, 4611686293338849280; 2026-02-21T09:03:09.4826389Z add.s32 %r264, %r52, 24608; 2026-02-21T09:03:09.4826542Z bfe.u32 %r265, %r264, 4, 14; 2026-02-21T09:03:09.4826729Z cvt.u64.u32 %rd72, %r265; 2026-02-21T09:03:09.4826893Z or.b64 %rd94, %rd72, 4611686293338849280; 2026-02-21T09:03:09.4827064Z add.s32 %r266, %r52, 24640; 2026-02-21T09:03:09.4827223Z bfe.u32 %r267, %r266, 4, 14; 2026-02-21T09:03:09.4827376Z cvt.u64.u32 %rd73, %r267; 2026-02-21T09:03:09.4827538Z or.b64 %rd95, %rd73, 4611686293338849280; 2026-02-21T09:03:09.4827708Z add.s32 %r268, %r52, 24672; 2026-02-21T09:03:09.4827865Z bfe.u32 %r269, %r268, 4, 14; 2026-02-21T09:03:09.4828016Z cvt.u64.u32 %rd74, %r269; 2026-02-21T09:03:09.4828183Z or.b64 %rd96, %rd74, 4611686293338849280; 2026-02-21T09:03:09.4828370Z add.s64 %rd98, %rd38, 512; 2026-02-21T09:03:09.4828531Z or.b32 %r270, %r207, 256; 2026-02-21T09:03:09.4828702Z mad.wide.u32 %rd75, %r270, 2, %rd35; 2026-02-21T09:03:09.4828883Z add.s64 %rd99, %rd75, 131072; 2026-02-21T09:03:09.4829059Z add.s64 %rd100, %rd75, 262144; 2026-02-21T09:03:09.4829228Z add.s64 %rd101, %rd75, 393216; 2026-02-21T09:03:09.4829401Z add.s64 %rd102, %rd75, 524288; 2026-02-21T09:03:09.4829565Z add.s64 %rd103, %rd75, 655360; 2026-02-21T09:03:09.4829735Z add.s64 %rd104, %rd75, 786432; 2026-02-21T09:03:09.4829903Z add.s64 %rd105, %rd75, 917504; 2026-02-21T09:03:09.4830067Z and.b32 %r271, %r201, 176; 2026-02-21T09:03:09.4830234Z shl.b32 %r272, %r216, 3; 2026-02-21T09:03:09.4830391Z bfe.s32 %r273, %r1, 2, 1; 2026-02-21T09:03:09.4830557Z and.b32 %r274, %r273, 1088; 2026-02-21T09:03:09.4830713Z shl.b32 %r275, %r218, 2; 2026-02-21T09:03:09.4830876Z xor.b32 %r276, %r274, %r275; 2026-02-21T09:03:09.4831033Z add.s32 %r277, %r52, %r271; 2026-02-21T09:03:09.4831200Z add.s32 %r278, %r277, %r272; 2026-02-21T09:03:09.4831361Z shl.b32 %r279, %r1, 5; 2026-02-21T09:03:09.4831524Z and.b32 %r280, %r279, 1792; 2026-02-21T09:03:09.4831734Z shl.b32 %r281, %r1, 3; 2026-02-21T09:03:09.4831890Z and.b32 %r282, %r281, 48; 2026-02-21T09:03:09.4832055Z shl.b32 %r283, %r216, 1; 2026-02-21T09:03:09.4832213Z shl.b32 %r284, %r1, 6; 2026-02-21T09:03:09.4832376Z and.b32 %r285, %r284, 64; 2026-02-21T09:03:09.4832577Z xor.b32 %r286, %r283, %r285; 2026-02-21T09:03:09.4832751Z add.s32 %r287, %r52, %r280; 2026-02-21T09:03:09.4832917Z add.s32 %r288, %r287, %r282; 2026-02-21T09:03:09.4833210Z .loc 1 19 98 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:19:98 2026-02-21T09:03:09.4833519Z cvt.u64.u32 %rd25, %r202; 2026-02-21T09:03:09.4833805Z .loc 1 30 27 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:30:27 2026-02-21T09:03:09.4834112Z shl.b32 %r36, %r3, 6; 2026-02-21T09:03:09.4834383Z .loc 1 31 32 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:31:32 2026-02-21T09:03:09.4834714Z or.b32 %r289, %r36, %r5; 2026-02-21T09:03:09.4834871Z $L__tmp0: 2026-02-21T09:03:09.4835185Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.4835552Z shfl.sync.idx.b32 %r38, %r203, 0, 31, -1; 2026-02-21T09:03:09.4835762Z shl.b32 %r290, %r38, 21; 2026-02-21T09:03:09.4835929Z and.b32 %r291, %r290, 6291456; 2026-02-21T09:03:09.4836097Z add.s32 %r756, %r291, %r843; 2026-02-21T09:03:09.4836303Z mov.pred %p16, -1; 2026-02-21T09:03:09.4836461Z mov.b32 %r844, 0; 2026-02-21T09:03:09.4836606Z // begin inline asm 2026-02-21T09:03:09.4836982Z @%p16 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r756 + 0], 32, {%r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844}; 2026-02-21T09:03:09.4837383Z // end inline asm 2026-02-21T09:03:09.4837527Z // begin inline asm 2026-02-21T09:03:09.4837893Z @%p16 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r756 + 16], 32, {%r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844, %r844}; 2026-02-21T09:03:09.4838290Z // end inline asm 2026-02-21T09:03:09.4838428Z // begin inline asm 2026-02-21T09:03:09.4838591Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:03:09.4838780Z // end inline asm 2026-02-21T09:03:09.4838925Z bar.sync 0; 2026-02-21T09:03:09.4839052Z $L__tmp1: 2026-02-21T09:03:09.4839297Z .loc 1 40 93 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:40:93 2026-02-21T09:03:09.4839588Z add.s32 %r845, %r52, 73728; 2026-02-21T09:03:09.4839740Z // begin inline asm 2026-02-21T09:03:09.4839913Z @%p90 mbarrier.init.shared::cta.b64 [%r845], 1; 2026-02-21T09:03:09.4840104Z // end inline asm 2026-02-21T09:03:09.4840242Z bar.sync 0; 2026-02-21T09:03:09.4840376Z add.s32 %r88, %r52, 73736; 2026-02-21T09:03:09.4840533Z // begin inline asm 2026-02-21T09:03:09.4840698Z @%p90 mbarrier.init.shared::cta.b64 [%r88], 1; 2026-02-21T09:03:09.4840892Z // end inline asm 2026-02-21T09:03:09.4841029Z mov.b32 %r442, 16; 2026-02-21T09:03:09.4841274Z .loc 1 48 80 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:80 2026-02-21T09:03:09.4841602Z // begin inline asm 2026-02-21T09:03:09.4841806Z cp.async.cg.shared.global [ %r441 + 0 ], [ %rd38 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4842037Z // end inline asm 2026-02-21T09:03:09.4842168Z // begin inline asm 2026-02-21T09:03:09.4842370Z cp.async.cg.shared.global [ %r443 + 0 ], [ %rd39 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4842588Z // end inline asm 2026-02-21T09:03:09.4842728Z // begin inline asm 2026-02-21T09:03:09.4842926Z cp.async.cg.shared.global [ %r445 + 0 ], [ %rd40 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4843142Z // end inline asm 2026-02-21T09:03:09.4843280Z // begin inline asm 2026-02-21T09:03:09.4843467Z cp.async.cg.shared.global [ %r447 + 0 ], [ %rd41 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4843690Z // end inline asm 2026-02-21T09:03:09.4843822Z // begin inline asm 2026-02-21T09:03:09.4844018Z cp.async.cg.shared.global [ %r449 + 0 ], [ %rd42 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4844234Z // end inline asm 2026-02-21T09:03:09.4844376Z // begin inline asm 2026-02-21T09:03:09.4844566Z cp.async.cg.shared.global [ %r451 + 0 ], [ %rd43 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4844833Z // end inline asm 2026-02-21T09:03:09.4844975Z // begin inline asm 2026-02-21T09:03:09.4845170Z cp.async.cg.shared.global [ %r453 + 0 ], [ %rd44 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4845396Z // end inline asm 2026-02-21T09:03:09.4845532Z // begin inline asm 2026-02-21T09:03:09.4845731Z cp.async.cg.shared.global [ %r455 + 0 ], [ %rd45 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4845946Z // end inline asm 2026-02-21T09:03:09.4846094Z cp.async.commit_group; 2026-02-21T09:03:09.4846351Z .loc 1 54 62 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:62 2026-02-21T09:03:09.4846681Z add.s32 %r292, %r289, %r210; 2026-02-21T09:03:09.4846855Z add.s32 %r293, %r292, 229376; 2026-02-21T09:03:09.4847114Z .loc 1 54 34 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:34 2026-02-21T09:03:09.4847404Z cvt.u64.u32 %rd76, %r292; 2026-02-21T09:03:09.4847562Z add.s64 %rd46, %rd36, %rd76; 2026-02-21T09:03:09.4847726Z cvt.u64.u32 %rd77, %r293; 2026-02-21T09:03:09.4847882Z add.s64 %rd47, %rd36, %rd77; 2026-02-21T09:03:09.4848176Z .loc 1 54 87 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:87 2026-02-21T09:03:09.4848460Z // begin inline asm 2026-02-21T09:03:09.4848656Z cp.async.cg.shared.global [ %r457 + 0 ], [ %rd46 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4848885Z // end inline asm 2026-02-21T09:03:09.4849020Z // begin inline asm 2026-02-21T09:03:09.4849218Z cp.async.cg.shared.global [ %r459 + 0 ], [ %rd47 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4849437Z // end inline asm 2026-02-21T09:03:09.4849586Z cp.async.commit_group; 2026-02-21T09:03:09.4849842Z .loc 1 48 80 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:80 2026-02-21T09:03:09.4850130Z bar.sync 0; 2026-02-21T09:03:09.4850266Z // begin inline asm 2026-02-21T09:03:09.4850460Z cp.async.cg.shared.global [ %r109 + 0 ], [ %rd48 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4850711Z // end inline asm 2026-02-21T09:03:09.4850848Z // begin inline asm 2026-02-21T09:03:09.4851045Z cp.async.cg.shared.global [ %r111 + 0 ], [ %rd49 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4851262Z // end inline asm 2026-02-21T09:03:09.4851402Z // begin inline asm 2026-02-21T09:03:09.4851613Z cp.async.cg.shared.global [ %r113 + 0 ], [ %rd50 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4851835Z // end inline asm 2026-02-21T09:03:09.4851973Z // begin inline asm 2026-02-21T09:03:09.4852159Z cp.async.cg.shared.global [ %r115 + 0 ], [ %rd51 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4852380Z // end inline asm 2026-02-21T09:03:09.4852514Z // begin inline asm 2026-02-21T09:03:09.4852711Z cp.async.cg.shared.global [ %r117 + 0 ], [ %rd52 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4852923Z // end inline asm 2026-02-21T09:03:09.4853066Z // begin inline asm 2026-02-21T09:03:09.4853255Z cp.async.cg.shared.global [ %r119 + 0 ], [ %rd53 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4853481Z // end inline asm 2026-02-21T09:03:09.4853622Z // begin inline asm 2026-02-21T09:03:09.4853812Z cp.async.cg.shared.global [ %r121 + 0 ], [ %rd54 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4854036Z // end inline asm 2026-02-21T09:03:09.4854171Z // begin inline asm 2026-02-21T09:03:09.4854368Z cp.async.cg.shared.global [ %r123 + 0 ], [ %rd55 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4854583Z // end inline asm 2026-02-21T09:03:09.4854730Z cp.async.commit_group; 2026-02-21T09:03:09.4854983Z .loc 1 54 34 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:34 2026-02-21T09:03:09.4855271Z cvt.u64.u32 %rd78, %r289; 2026-02-21T09:03:09.4855438Z cvt.u64.u32 %rd79, %r210; 2026-02-21T09:03:09.4855598Z add.s64 %rd80, %rd78, %rd79; 2026-02-21T09:03:09.4855770Z add.s64 %rd26, %rd36, %rd80; 2026-02-21T09:03:09.4855928Z add.s64 %rd56, %rd26, 458752; 2026-02-21T09:03:09.4856096Z add.s64 %rd57, %rd26, 688128; 2026-02-21T09:03:09.4856356Z .loc 1 54 87 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:87 2026-02-21T09:03:09.4856687Z // begin inline asm 2026-02-21T09:03:09.4856879Z cp.async.cg.shared.global [ %r125 + 0 ], [ %rd56 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4857108Z // end inline asm 2026-02-21T09:03:09.4857251Z // begin inline asm 2026-02-21T09:03:09.4857443Z cp.async.cg.shared.global [ %r127 + 0 ], [ %rd57 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4857669Z // end inline asm 2026-02-21T09:03:09.4857808Z cp.async.commit_group; 2026-02-21T09:03:09.4858067Z .loc 1 48 80 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:80 2026-02-21T09:03:09.4858352Z cp.async.wait_group 2; 2026-02-21T09:03:09.4858545Z bar.sync 0; 2026-02-21T09:03:09.4858711Z ld.shared.v4.b32 {%r294, %r295, %r296, %r297}, [%r221]; 2026-02-21T09:03:09.4858926Z mov.b32 {%rs1, %rs2}, %r297; 2026-02-21T09:03:09.4859092Z mov.b32 {%rs3, %rs4}, %r296; 2026-02-21T09:03:09.4859247Z mov.b32 {%rs5, %rs6}, %r295; 2026-02-21T09:03:09.4859408Z mov.b32 {%rs7, %rs8}, %r294; 2026-02-21T09:03:09.4859600Z ld.shared.v4.b32 {%r298, %r299, %r300, %r301}, [%r221+16]; 2026-02-21T09:03:09.4859817Z mov.b32 {%rs9, %rs10}, %r301; 2026-02-21T09:03:09.4860006Z mov.b32 {%rs11, %rs12}, %r300; 2026-02-21T09:03:09.4860181Z mov.b32 {%rs13, %rs14}, %r299; 2026-02-21T09:03:09.4860340Z mov.b32 {%rs15, %rs16}, %r298; 2026-02-21T09:03:09.4860537Z ld.shared.v4.b32 {%r302, %r303, %r304, %r305}, [%r221+32]; 2026-02-21T09:03:09.4860748Z mov.b32 {%rs17, %rs18}, %r305; 2026-02-21T09:03:09.4860907Z mov.b32 {%rs19, %rs20}, %r304; 2026-02-21T09:03:09.4861068Z mov.b32 {%rs21, %rs22}, %r303; 2026-02-21T09:03:09.4861222Z mov.b32 {%rs23, %rs24}, %r302; 2026-02-21T09:03:09.4861416Z ld.shared.v4.b32 {%r306, %r307, %r308, %r309}, [%r221+48]; 2026-02-21T09:03:09.4861655Z mov.b32 {%rs25, %rs26}, %r309; 2026-02-21T09:03:09.4861818Z mov.b32 {%rs27, %rs28}, %r308; 2026-02-21T09:03:09.4861972Z mov.b32 {%rs29, %rs30}, %r307; 2026-02-21T09:03:09.4862167Z mov.b32 {%rs31, %rs32}, %r306; 2026-02-21T09:03:09.4862360Z ld.shared.v4.b32 {%r310, %r311, %r312, %r313}, [%r221+64]; 2026-02-21T09:03:09.4862574Z mov.b32 {%rs33, %rs34}, %r313; 2026-02-21T09:03:09.4862743Z mov.b32 {%rs35, %rs36}, %r312; 2026-02-21T09:03:09.4862904Z mov.b32 {%rs37, %rs38}, %r311; 2026-02-21T09:03:09.4863073Z mov.b32 {%rs39, %rs40}, %r310; 2026-02-21T09:03:09.4863263Z ld.shared.v4.b32 {%r314, %r315, %r316, %r317}, [%r221+80]; 2026-02-21T09:03:09.4863475Z mov.b32 {%rs41, %rs42}, %r317; 2026-02-21T09:03:09.4863635Z mov.b32 {%rs43, %rs44}, %r316; 2026-02-21T09:03:09.4863807Z mov.b32 {%rs45, %rs46}, %r315; 2026-02-21T09:03:09.4863967Z mov.b32 {%rs47, %rs48}, %r314; 2026-02-21T09:03:09.4864170Z ld.shared.v4.b32 {%r318, %r319, %r320, %r321}, [%r221+96]; 2026-02-21T09:03:09.4864383Z mov.b32 {%rs49, %rs50}, %r321; 2026-02-21T09:03:09.4864543Z mov.b32 {%rs51, %rs52}, %r320; 2026-02-21T09:03:09.4864710Z mov.b32 {%rs53, %rs54}, %r319; 2026-02-21T09:03:09.4864874Z mov.b32 {%rs55, %rs56}, %r318; 2026-02-21T09:03:09.4865084Z ld.shared.v4.b32 {%r322, %r323, %r324, %r325}, [%r221+112]; 2026-02-21T09:03:09.4865305Z mov.b32 {%rs57, %rs58}, %r325; 2026-02-21T09:03:09.4865473Z mov.b32 {%rs59, %rs60}, %r324; 2026-02-21T09:03:09.4865632Z mov.b32 {%rs61, %rs62}, %r323; 2026-02-21T09:03:09.4865797Z mov.b32 {%rs63, %rs64}, %r322; 2026-02-21T09:03:09.4866070Z .loc 1 52 32 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:52:32 2026-02-21T09:03:09.4866365Z cvt.f32.bf16 %r130, %rs7; 2026-02-21T09:03:09.4866535Z cvt.f32.bf16 %r131, %rs8; 2026-02-21T09:03:09.4866691Z cvt.f32.bf16 %r132, %rs5; 2026-02-21T09:03:09.4866855Z cvt.f32.bf16 %r133, %rs6; 2026-02-21T09:03:09.4867007Z cvt.f32.bf16 %r134, %rs3; 2026-02-21T09:03:09.4867167Z cvt.f32.bf16 %r135, %rs4; 2026-02-21T09:03:09.4867318Z cvt.f32.bf16 %r136, %rs1; 2026-02-21T09:03:09.4867476Z cvt.f32.bf16 %r137, %rs2; 2026-02-21T09:03:09.4867637Z cvt.f32.bf16 %r138, %rs15; 2026-02-21T09:03:09.4867799Z cvt.f32.bf16 %r139, %rs16; 2026-02-21T09:03:09.4868011Z cvt.f32.bf16 %r140, %rs13; 2026-02-21T09:03:09.4868163Z cvt.f32.bf16 %r141, %rs14; 2026-02-21T09:03:09.4868320Z cvt.f32.bf16 %r142, %rs11; 2026-02-21T09:03:09.4868470Z cvt.f32.bf16 %r143, %rs12; 2026-02-21T09:03:09.4868627Z cvt.f32.bf16 %r144, %rs9; 2026-02-21T09:03:09.4868774Z cvt.f32.bf16 %r145, %rs10; 2026-02-21T09:03:09.4868930Z cvt.f32.bf16 %r147, %rs23; 2026-02-21T09:03:09.4869078Z cvt.f32.bf16 %r148, %rs24; 2026-02-21T09:03:09.4869233Z cvt.f32.bf16 %r149, %rs21; 2026-02-21T09:03:09.4869386Z cvt.f32.bf16 %r150, %rs22; 2026-02-21T09:03:09.4869535Z cvt.f32.bf16 %r151, %rs19; 2026-02-21T09:03:09.4869723Z cvt.f32.bf16 %r152, %rs20; 2026-02-21T09:03:09.4869872Z cvt.f32.bf16 %r153, %rs17; 2026-02-21T09:03:09.4870027Z cvt.f32.bf16 %r154, %rs18; 2026-02-21T09:03:09.4870174Z cvt.f32.bf16 %r155, %rs31; 2026-02-21T09:03:09.4870331Z cvt.f32.bf16 %r156, %rs32; 2026-02-21T09:03:09.4870484Z cvt.f32.bf16 %r157, %rs29; 2026-02-21T09:03:09.4870647Z cvt.f32.bf16 %r158, %rs30; 2026-02-21T09:03:09.4870802Z cvt.f32.bf16 %r159, %rs27; 2026-02-21T09:03:09.4870965Z cvt.f32.bf16 %r160, %rs28; 2026-02-21T09:03:09.4871158Z cvt.f32.bf16 %r161, %rs25; 2026-02-21T09:03:09.4871319Z cvt.f32.bf16 %r162, %rs26; 2026-02-21T09:03:09.4871484Z cvt.f32.bf16 %r164, %rs39; 2026-02-21T09:03:09.4871688Z cvt.f32.bf16 %r165, %rs40; 2026-02-21T09:03:09.4871853Z cvt.f32.bf16 %r166, %rs37; 2026-02-21T09:03:09.4872009Z cvt.f32.bf16 %r167, %rs38; 2026-02-21T09:03:09.4872176Z cvt.f32.bf16 %r168, %rs35; 2026-02-21T09:03:09.4872333Z cvt.f32.bf16 %r169, %rs36; 2026-02-21T09:03:09.4872497Z cvt.f32.bf16 %r170, %rs33; 2026-02-21T09:03:09.4872656Z cvt.f32.bf16 %r171, %rs34; 2026-02-21T09:03:09.4872823Z cvt.f32.bf16 %r172, %rs47; 2026-02-21T09:03:09.4872992Z cvt.f32.bf16 %r173, %rs48; 2026-02-21T09:03:09.4873155Z cvt.f32.bf16 %r174, %rs45; 2026-02-21T09:03:09.4873330Z cvt.f32.bf16 %r175, %rs46; 2026-02-21T09:03:09.4873492Z cvt.f32.bf16 %r176, %rs43; 2026-02-21T09:03:09.4873705Z cvt.f32.bf16 %r177, %rs44; 2026-02-21T09:03:09.4873869Z cvt.f32.bf16 %r178, %rs41; 2026-02-21T09:03:09.4874033Z cvt.f32.bf16 %r179, %rs42; 2026-02-21T09:03:09.4874192Z cvt.f32.bf16 %r181, %rs55; 2026-02-21T09:03:09.4874359Z cvt.f32.bf16 %r182, %rs56; 2026-02-21T09:03:09.4874516Z cvt.f32.bf16 %r183, %rs53; 2026-02-21T09:03:09.4874680Z cvt.f32.bf16 %r184, %rs54; 2026-02-21T09:03:09.4874842Z cvt.f32.bf16 %r185, %rs51; 2026-02-21T09:03:09.4875000Z cvt.f32.bf16 %r186, %rs52; 2026-02-21T09:03:09.4875166Z cvt.f32.bf16 %r187, %rs49; 2026-02-21T09:03:09.4875325Z cvt.f32.bf16 %r188, %rs50; 2026-02-21T09:03:09.4875492Z cvt.f32.bf16 %r189, %rs63; 2026-02-21T09:03:09.4875650Z cvt.f32.bf16 %r190, %rs64; 2026-02-21T09:03:09.4875815Z cvt.f32.bf16 %r191, %rs61; 2026-02-21T09:03:09.4875973Z cvt.f32.bf16 %r192, %rs62; 2026-02-21T09:03:09.4876140Z cvt.f32.bf16 %r193, %rs59; 2026-02-21T09:03:09.4876298Z cvt.f32.bf16 %r194, %rs60; 2026-02-21T09:03:09.4876464Z cvt.f32.bf16 %r195, %rs57; 2026-02-21T09:03:09.4876630Z cvt.f32.bf16 %r196, %rs58; 2026-02-21T09:03:09.4876782Z $L__tmp2: 2026-02-21T09:03:09.4877093Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.4877448Z add.s32 %r129, %r291, %r391; 2026-02-21T09:03:09.4877618Z // begin inline asm 2026-02-21T09:03:09.4878009Z @%p16 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 0], 64, {%r130, %r131, %r132, %r133, %r134, %r135, %r136, %r137, %r138, %r139, %r140, %r141, %r142, %r143, %r144, %r145}; 2026-02-21T09:03:09.4878435Z // end inline asm 2026-02-21T09:03:09.4878583Z // begin inline asm 2026-02-21T09:03:09.4878947Z @%p16 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 16], 64, {%r147, %r148, %r149, %r150, %r151, %r152, %r153, %r154, %r155, %r156, %r157, %r158, %r159, %r160, %r161, %r162}; 2026-02-21T09:03:09.4879345Z // end inline asm 2026-02-21T09:03:09.4879479Z // begin inline asm 2026-02-21T09:03:09.4879843Z @%p16 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 32], 64, {%r164, %r165, %r166, %r167, %r168, %r169, %r170, %r171, %r172, %r173, %r174, %r175, %r176, %r177, %r178, %r179}; 2026-02-21T09:03:09.4880266Z // end inline asm 2026-02-21T09:03:09.4880406Z // begin inline asm 2026-02-21T09:03:09.4880768Z @%p16 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 48], 64, {%r181, %r182, %r183, %r184, %r185, %r186, %r187, %r188, %r189, %r190, %r191, %r192, %r193, %r194, %r195, %r196}; 2026-02-21T09:03:09.4881151Z // end inline asm 2026-02-21T09:03:09.4881291Z // begin inline asm 2026-02-21T09:03:09.4881444Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:03:09.4881663Z // end inline asm 2026-02-21T09:03:09.4881798Z bar.sync 0; 2026-02-21T09:03:09.4881935Z $L__tmp3: 2026-02-21T09:03:09.4882177Z .loc 1 54 87 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:87 2026-02-21T09:03:09.4882480Z ld.shared.b8 %rs65, [%r222]; 2026-02-21T09:03:09.4882656Z ld.shared.b8 %rs66, [%r222+64]; 2026-02-21T09:03:09.4882833Z ld.shared.b8 %rs67, [%r222+128]; 2026-02-21T09:03:09.4883013Z ld.shared.b8 %rs68, [%r222+192]; 2026-02-21T09:03:09.4883180Z ld.shared.b8 %rs69, [%r222+256]; 2026-02-21T09:03:09.4883394Z ld.shared.b8 %rs70, [%r222+320]; 2026-02-21T09:03:09.4883565Z ld.shared.b8 %rs71, [%r222+384]; 2026-02-21T09:03:09.4883747Z ld.shared.b8 %rs72, [%r222+448]; 2026-02-21T09:03:09.4883918Z ld.shared.b8 %rs73, [%r222+512]; 2026-02-21T09:03:09.4884097Z ld.shared.b8 %rs74, [%r222+576]; 2026-02-21T09:03:09.4884277Z ld.shared.b8 %rs75, [%r222+640]; 2026-02-21T09:03:09.4884443Z ld.shared.b8 %rs76, [%r222+704]; 2026-02-21T09:03:09.4884621Z ld.shared.b8 %rs77, [%r222+768]; 2026-02-21T09:03:09.4884787Z ld.shared.b8 %rs78, [%r222+832]; 2026-02-21T09:03:09.4884962Z ld.shared.b8 %rs79, [%r222+896]; 2026-02-21T09:03:09.4885133Z ld.shared.b8 %rs80, [%r223]; 2026-02-21T09:03:09.4885312Z ld.shared.b8 %rs81, [%r222+1024]; 2026-02-21T09:03:09.4885496Z ld.shared.b8 %rs82, [%r222+1088]; 2026-02-21T09:03:09.4885709Z ld.shared.b8 %rs83, [%r222+1152]; 2026-02-21T09:03:09.4885886Z ld.shared.b8 %rs84, [%r222+1216]; 2026-02-21T09:03:09.4886052Z ld.shared.b8 %rs85, [%r222+1280]; 2026-02-21T09:03:09.4886225Z ld.shared.b8 %rs86, [%r222+1344]; 2026-02-21T09:03:09.4886388Z ld.shared.b8 %rs87, [%r222+1408]; 2026-02-21T09:03:09.4886559Z ld.shared.b8 %rs88, [%r222+1472]; 2026-02-21T09:03:09.4886722Z ld.shared.b8 %rs89, [%r222+1536]; 2026-02-21T09:03:09.4886890Z ld.shared.b8 %rs90, [%r222+1600]; 2026-02-21T09:03:09.4887053Z ld.shared.b8 %rs91, [%r222+1664]; 2026-02-21T09:03:09.4887225Z ld.shared.b8 %rs92, [%r222+1728]; 2026-02-21T09:03:09.4887399Z ld.shared.b8 %rs93, [%r222+1792]; 2026-02-21T09:03:09.4887563Z ld.shared.b8 %rs94, [%r222+1856]; 2026-02-21T09:03:09.4887733Z ld.shared.b8 %rs95, [%r222+1920]; 2026-02-21T09:03:09.4887897Z ld.shared.b8 %rs96, [%r224]; 2026-02-21T09:03:09.4888062Z ld.shared.b8 %rs97, [%r222+2048]; 2026-02-21T09:03:09.4888226Z ld.shared.b8 %rs98, [%r222+2112]; 2026-02-21T09:03:09.4888397Z ld.shared.b8 %rs99, [%r222+2176]; 2026-02-21T09:03:09.4888564Z ld.shared.b8 %rs100, [%r222+2240]; 2026-02-21T09:03:09.4888744Z ld.shared.b8 %rs101, [%r222+2304]; 2026-02-21T09:03:09.4888918Z ld.shared.b8 %rs102, [%r222+2368]; 2026-02-21T09:03:09.4889086Z ld.shared.b8 %rs103, [%r222+2432]; 2026-02-21T09:03:09.4889257Z ld.shared.b8 %rs104, [%r222+2496]; 2026-02-21T09:03:09.4889421Z ld.shared.b8 %rs105, [%r222+2560]; 2026-02-21T09:03:09.4889592Z ld.shared.b8 %rs106, [%r222+2624]; 2026-02-21T09:03:09.4889756Z ld.shared.b8 %rs107, [%r222+2688]; 2026-02-21T09:03:09.4889928Z ld.shared.b8 %rs108, [%r222+2752]; 2026-02-21T09:03:09.4890092Z ld.shared.b8 %rs109, [%r222+2816]; 2026-02-21T09:03:09.4890263Z ld.shared.b8 %rs110, [%r222+2880]; 2026-02-21T09:03:09.4890429Z ld.shared.b8 %rs111, [%r222+2944]; 2026-02-21T09:03:09.4890604Z ld.shared.b8 %rs112, [%r225]; 2026-02-21T09:03:09.4890772Z ld.shared.b8 %rs113, [%r222+3072]; 2026-02-21T09:03:09.4890940Z ld.shared.b8 %rs114, [%r222+3136]; 2026-02-21T09:03:09.4891145Z ld.shared.b8 %rs115, [%r222+3200]; 2026-02-21T09:03:09.4891312Z ld.shared.b8 %rs116, [%r222+3264]; 2026-02-21T09:03:09.4891485Z ld.shared.b8 %rs117, [%r222+3328]; 2026-02-21T09:03:09.4891693Z ld.shared.b8 %rs118, [%r222+3392]; 2026-02-21T09:03:09.4891866Z ld.shared.b8 %rs119, [%r222+3456]; 2026-02-21T09:03:09.4892031Z ld.shared.b8 %rs120, [%r222+3520]; 2026-02-21T09:03:09.4892205Z ld.shared.b8 %rs121, [%r222+3584]; 2026-02-21T09:03:09.4892380Z ld.shared.b8 %rs122, [%r222+3648]; 2026-02-21T09:03:09.4892548Z ld.shared.b8 %rs123, [%r222+3712]; 2026-02-21T09:03:09.4892755Z ld.shared.b8 %rs124, [%r222+3776]; 2026-02-21T09:03:09.4892924Z ld.shared.b8 %rs125, [%r222+3840]; 2026-02-21T09:03:09.4893097Z ld.shared.b8 %rs126, [%r222+3904]; 2026-02-21T09:03:09.4893267Z ld.shared.b8 %rs127, [%r222+3968]; 2026-02-21T09:03:09.4893453Z ld.shared.b8 %rs128, [%r226]; 2026-02-21T09:03:09.4893731Z .loc 1 57 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:57:28 2026-02-21T09:03:09.4894032Z shl.b16 %rs129, %rs65, 4; 2026-02-21T09:03:09.4894194Z shl.b16 %rs130, %rs66, 4; 2026-02-21T09:03:09.4894378Z shl.b16 %rs131, %rs67, 4; 2026-02-21T09:03:09.4894538Z shl.b16 %rs132, %rs68, 4; 2026-02-21T09:03:09.4894687Z shl.b16 %rs133, %rs69, 4; 2026-02-21T09:03:09.4894842Z shl.b16 %rs134, %rs70, 4; 2026-02-21T09:03:09.4894987Z shl.b16 %rs135, %rs71, 4; 2026-02-21T09:03:09.4895140Z shl.b16 %rs136, %rs72, 4; 2026-02-21T09:03:09.4895287Z shl.b16 %rs137, %rs73, 4; 2026-02-21T09:03:09.4895440Z shl.b16 %rs138, %rs74, 4; 2026-02-21T09:03:09.4895587Z shl.b16 %rs139, %rs75, 4; 2026-02-21T09:03:09.4895741Z shl.b16 %rs140, %rs76, 4; 2026-02-21T09:03:09.4895894Z shl.b16 %rs141, %rs77, 4; 2026-02-21T09:03:09.4896042Z shl.b16 %rs142, %rs78, 4; 2026-02-21T09:03:09.4896196Z shl.b16 %rs143, %rs79, 4; 2026-02-21T09:03:09.4896344Z shl.b16 %rs144, %rs80, 4; 2026-02-21T09:03:09.4896531Z shl.b16 %rs145, %rs81, 4; 2026-02-21T09:03:09.4896681Z shl.b16 %rs146, %rs82, 4; 2026-02-21T09:03:09.4896835Z shl.b16 %rs147, %rs83, 4; 2026-02-21T09:03:09.4896981Z shl.b16 %rs148, %rs84, 4; 2026-02-21T09:03:09.4897134Z shl.b16 %rs149, %rs85, 4; 2026-02-21T09:03:09.4897280Z shl.b16 %rs150, %rs86, 4; 2026-02-21T09:03:09.4897431Z shl.b16 %rs151, %rs87, 4; 2026-02-21T09:03:09.4897581Z shl.b16 %rs152, %rs88, 4; 2026-02-21T09:03:09.4897725Z shl.b16 %rs153, %rs89, 4; 2026-02-21T09:03:09.4897880Z shl.b16 %rs154, %rs90, 4; 2026-02-21T09:03:09.4898025Z shl.b16 %rs155, %rs91, 4; 2026-02-21T09:03:09.4898178Z shl.b16 %rs156, %rs92, 4; 2026-02-21T09:03:09.4898325Z shl.b16 %rs157, %rs93, 4; 2026-02-21T09:03:09.4898477Z shl.b16 %rs158, %rs94, 4; 2026-02-21T09:03:09.4898622Z shl.b16 %rs159, %rs95, 4; 2026-02-21T09:03:09.4898776Z shl.b16 %rs160, %rs96, 4; 2026-02-21T09:03:09.4898922Z shl.b16 %rs161, %rs97, 4; 2026-02-21T09:03:09.4899073Z shl.b16 %rs162, %rs98, 4; 2026-02-21T09:03:09.4899228Z shl.b16 %rs163, %rs99, 4; 2026-02-21T09:03:09.4899380Z shl.b16 %rs164, %rs100, 4; 2026-02-21T09:03:09.4899543Z shl.b16 %rs165, %rs101, 4; 2026-02-21T09:03:09.4899698Z shl.b16 %rs166, %rs102, 4; 2026-02-21T09:03:09.4899858Z shl.b16 %rs167, %rs103, 4; 2026-02-21T09:03:09.4900009Z shl.b16 %rs168, %rs104, 4; 2026-02-21T09:03:09.4900166Z shl.b16 %rs169, %rs105, 4; 2026-02-21T09:03:09.4900315Z shl.b16 %rs170, %rs106, 4; 2026-02-21T09:03:09.4900473Z shl.b16 %rs171, %rs107, 4; 2026-02-21T09:03:09.4900631Z shl.b16 %rs172, %rs108, 4; 2026-02-21T09:03:09.4900783Z shl.b16 %rs173, %rs109, 4; 2026-02-21T09:03:09.4900948Z shl.b16 %rs174, %rs110, 4; 2026-02-21T09:03:09.4901099Z shl.b16 %rs175, %rs111, 4; 2026-02-21T09:03:09.4901256Z shl.b16 %rs176, %rs112, 4; 2026-02-21T09:03:09.4901404Z shl.b16 %rs177, %rs113, 4; 2026-02-21T09:03:09.4901589Z shl.b16 %rs178, %rs114, 4; 2026-02-21T09:03:09.4901741Z shl.b16 %rs179, %rs115, 4; 2026-02-21T09:03:09.4901900Z shl.b16 %rs180, %rs116, 4; 2026-02-21T09:03:09.4902051Z shl.b16 %rs181, %rs117, 4; 2026-02-21T09:03:09.4902245Z shl.b16 %rs182, %rs118, 4; 2026-02-21T09:03:09.4902400Z shl.b16 %rs183, %rs119, 4; 2026-02-21T09:03:09.4902553Z shl.b16 %rs184, %rs120, 4; 2026-02-21T09:03:09.4902712Z shl.b16 %rs185, %rs121, 4; 2026-02-21T09:03:09.4902861Z shl.b16 %rs186, %rs122, 4; 2026-02-21T09:03:09.4903021Z shl.b16 %rs187, %rs123, 4; 2026-02-21T09:03:09.4903170Z shl.b16 %rs188, %rs124, 4; 2026-02-21T09:03:09.4903329Z shl.b16 %rs189, %rs125, 4; 2026-02-21T09:03:09.4903480Z shl.b16 %rs190, %rs126, 4; 2026-02-21T09:03:09.4903637Z shl.b16 %rs191, %rs127, 4; 2026-02-21T09:03:09.4903784Z shl.b16 %rs192, %rs128, 4; 2026-02-21T09:03:09.4904083Z .loc 1 72 58 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:72:58 2026-02-21T09:03:09.4904382Z selp.b16 %rs193, %rs129, %rs65, %p12; 2026-02-21T09:03:09.4904559Z cvt.s16.s8 %rs194, %rs193; 2026-02-21T09:03:09.4904721Z shr.s16 %rs195, %rs194, 4; 2026-02-21T09:03:09.4904880Z selp.b16 %rs196, %rs130, %rs66, %p12; 2026-02-21T09:03:09.4905062Z cvt.s16.s8 %rs197, %rs196; 2026-02-21T09:03:09.4905210Z shr.s16 %rs198, %rs197, 4; 2026-02-21T09:03:09.4905399Z selp.b16 %rs199, %rs131, %rs67, %p12; 2026-02-21T09:03:09.4905571Z cvt.s16.s8 %rs200, %rs199; 2026-02-21T09:03:09.4905727Z shr.s16 %rs201, %rs200, 4; 2026-02-21T09:03:09.4905889Z selp.b16 %rs202, %rs132, %rs68, %p12; 2026-02-21T09:03:09.4906057Z cvt.s16.s8 %rs203, %rs202; 2026-02-21T09:03:09.4906218Z shr.s16 %rs204, %rs203, 4; 2026-02-21T09:03:09.4906375Z selp.b16 %rs205, %rs133, %rs69, %p12; 2026-02-21T09:03:09.4906551Z cvt.s16.s8 %rs206, %rs205; 2026-02-21T09:03:09.4906703Z shr.s16 %rs207, %rs206, 4; 2026-02-21T09:03:09.4906864Z selp.b16 %rs208, %rs134, %rs70, %p12; 2026-02-21T09:03:09.4907030Z cvt.s16.s8 %rs209, %rs208; 2026-02-21T09:03:09.4907189Z shr.s16 %rs210, %rs209, 4; 2026-02-21T09:03:09.4907344Z selp.b16 %rs211, %rs135, %rs71, %p12; 2026-02-21T09:03:09.4907523Z cvt.s16.s8 %rs212, %rs211; 2026-02-21T09:03:09.4907744Z shr.s16 %rs213, %rs212, 4; 2026-02-21T09:03:09.4907903Z selp.b16 %rs214, %rs136, %rs72, %p12; 2026-02-21T09:03:09.4908082Z cvt.s16.s8 %rs215, %rs214; 2026-02-21T09:03:09.4908233Z shr.s16 %rs216, %rs215, 4; 2026-02-21T09:03:09.4908397Z selp.b16 %rs217, %rs137, %rs73, %p12; 2026-02-21T09:03:09.4908565Z cvt.s16.s8 %rs218, %rs217; 2026-02-21T09:03:09.4908724Z shr.s16 %rs219, %rs218, 4; 2026-02-21T09:03:09.4908879Z selp.b16 %rs220, %rs138, %rs74, %p12; 2026-02-21T09:03:09.4909057Z cvt.s16.s8 %rs221, %rs220; 2026-02-21T09:03:09.4909219Z shr.s16 %rs222, %rs221, 4; 2026-02-21T09:03:09.4909380Z selp.b16 %rs223, %rs139, %rs75, %p12; 2026-02-21T09:03:09.4909567Z cvt.s16.s8 %rs224, %rs223; 2026-02-21T09:03:09.4909725Z shr.s16 %rs225, %rs224, 4; 2026-02-21T09:03:09.4909890Z selp.b16 %rs226, %rs140, %rs76, %p12; 2026-02-21T09:03:09.4910058Z cvt.s16.s8 %rs227, %rs226; 2026-02-21T09:03:09.4910216Z shr.s16 %rs228, %rs227, 4; 2026-02-21T09:03:09.4910372Z selp.b16 %rs229, %rs141, %rs77, %p12; 2026-02-21T09:03:09.4910548Z cvt.s16.s8 %rs230, %rs229; 2026-02-21T09:03:09.4910697Z shr.s16 %rs231, %rs230, 4; 2026-02-21T09:03:09.4910863Z selp.b16 %rs232, %rs142, %rs78, %p12; 2026-02-21T09:03:09.4911040Z cvt.s16.s8 %rs233, %rs232; 2026-02-21T09:03:09.4911191Z shr.s16 %rs234, %rs233, 4; 2026-02-21T09:03:09.4911353Z selp.b16 %rs235, %rs143, %rs79, %p12; 2026-02-21T09:03:09.4911519Z cvt.s16.s8 %rs236, %rs235; 2026-02-21T09:03:09.4911720Z shr.s16 %rs237, %rs236, 4; 2026-02-21T09:03:09.4911874Z selp.b16 %rs238, %rs144, %rs80, %p12; 2026-02-21T09:03:09.4912050Z cvt.s16.s8 %rs239, %rs238; 2026-02-21T09:03:09.4912203Z shr.s16 %rs240, %rs239, 4; 2026-02-21T09:03:09.4912366Z selp.b16 %rs241, %rs145, %rs81, %p12; 2026-02-21T09:03:09.4912533Z cvt.s16.s8 %rs242, %rs241; 2026-02-21T09:03:09.4912694Z shr.s16 %rs243, %rs242, 4; 2026-02-21T09:03:09.4912864Z selp.b16 %rs244, %rs146, %rs82, %p12; 2026-02-21T09:03:09.4913040Z cvt.s16.s8 %rs245, %rs244; 2026-02-21T09:03:09.4913208Z shr.s16 %rs246, %rs245, 4; 2026-02-21T09:03:09.4913408Z selp.b16 %rs247, %rs147, %rs83, %p12; 2026-02-21T09:03:09.4913593Z cvt.s16.s8 %rs248, %rs247; 2026-02-21T09:03:09.4913752Z shr.s16 %rs249, %rs248, 4; 2026-02-21T09:03:09.4913922Z selp.b16 %rs250, %rs148, %rs84, %p12; 2026-02-21T09:03:09.4914097Z cvt.s16.s8 %rs251, %rs250; 2026-02-21T09:03:09.4914263Z shr.s16 %rs252, %rs251, 4; 2026-02-21T09:03:09.4914433Z selp.b16 %rs253, %rs149, %rs85, %p12; 2026-02-21T09:03:09.4914610Z cvt.s16.s8 %rs254, %rs253; 2026-02-21T09:03:09.4914778Z shr.s16 %rs255, %rs254, 4; 2026-02-21T09:03:09.4914941Z selp.b16 %rs256, %rs150, %rs86, %p12; 2026-02-21T09:03:09.4915163Z cvt.s16.s8 %rs257, %rs256; 2026-02-21T09:03:09.4915321Z shr.s16 %rs258, %rs257, 4; 2026-02-21T09:03:09.4915489Z selp.b16 %rs259, %rs151, %rs87, %p12; 2026-02-21T09:03:09.4915666Z cvt.s16.s8 %rs260, %rs259; 2026-02-21T09:03:09.4915830Z shr.s16 %rs261, %rs260, 4; 2026-02-21T09:03:09.4915997Z selp.b16 %rs262, %rs152, %rs88, %p12; 2026-02-21T09:03:09.4916181Z cvt.s16.s8 %rs263, %rs262; 2026-02-21T09:03:09.4916345Z shr.s16 %rs264, %rs263, 4; 2026-02-21T09:03:09.4916508Z selp.b16 %rs265, %rs153, %rs89, %p12; 2026-02-21T09:03:09.4916723Z cvt.s16.s8 %rs266, %rs265; 2026-02-21T09:03:09.4916882Z shr.s16 %rs267, %rs266, 4; 2026-02-21T09:03:09.4917052Z selp.b16 %rs268, %rs154, %rs90, %p12; 2026-02-21T09:03:09.4917228Z cvt.s16.s8 %rs269, %rs268; 2026-02-21T09:03:09.4917397Z shr.s16 %rs270, %rs269, 4; 2026-02-21T09:03:09.4917561Z selp.b16 %rs271, %rs155, %rs91, %p12; 2026-02-21T09:03:09.4917749Z cvt.s16.s8 %rs272, %rs271; 2026-02-21T09:03:09.4917917Z shr.s16 %rs273, %rs272, 4; 2026-02-21T09:03:09.4918084Z selp.b16 %rs274, %rs156, %rs92, %p12; 2026-02-21T09:03:09.4918268Z cvt.s16.s8 %rs275, %rs274; 2026-02-21T09:03:09.4918425Z shr.s16 %rs276, %rs275, 4; 2026-02-21T09:03:09.4918607Z selp.b16 %rs277, %rs157, %rs93, %p12; 2026-02-21T09:03:09.4918787Z cvt.s16.s8 %rs278, %rs277; 2026-02-21T09:03:09.4918988Z shr.s16 %rs279, %rs278, 4; 2026-02-21T09:03:09.4919156Z selp.b16 %rs280, %rs158, %rs94, %p12; 2026-02-21T09:03:09.4919342Z cvt.s16.s8 %rs281, %rs280; 2026-02-21T09:03:09.4919502Z shr.s16 %rs282, %rs281, 4; 2026-02-21T09:03:09.4919680Z selp.b16 %rs283, %rs159, %rs95, %p12; 2026-02-21T09:03:09.4919865Z cvt.s16.s8 %rs284, %rs283; 2026-02-21T09:03:09.4920025Z shr.s16 %rs285, %rs284, 4; 2026-02-21T09:03:09.4920198Z selp.b16 %rs286, %rs160, %rs96, %p12; 2026-02-21T09:03:09.4920375Z cvt.s16.s8 %rs287, %rs286; 2026-02-21T09:03:09.4920540Z shr.s16 %rs288, %rs287, 4; 2026-02-21T09:03:09.4920703Z selp.b16 %rs289, %rs161, %rs97, %p12; 2026-02-21T09:03:09.4920889Z cvt.s16.s8 %rs290, %rs289; 2026-02-21T09:03:09.4921048Z shr.s16 %rs291, %rs290, 4; 2026-02-21T09:03:09.4921218Z selp.b16 %rs292, %rs162, %rs98, %p12; 2026-02-21T09:03:09.4921401Z cvt.s16.s8 %rs293, %rs292; 2026-02-21T09:03:09.4921599Z shr.s16 %rs294, %rs293, 4; 2026-02-21T09:03:09.4921779Z selp.b16 %rs295, %rs163, %rs99, %p12; 2026-02-21T09:03:09.4921948Z cvt.s16.s8 %rs296, %rs295; 2026-02-21T09:03:09.4922110Z shr.s16 %rs297, %rs296, 4; 2026-02-21T09:03:09.4922273Z selp.b16 %rs298, %rs164, %rs100, %p12; 2026-02-21T09:03:09.4922453Z cvt.s16.s8 %rs299, %rs298; 2026-02-21T09:03:09.4922604Z shr.s16 %rs300, %rs299, 4; 2026-02-21T09:03:09.4922771Z selp.b16 %rs301, %rs165, %rs101, %p12; 2026-02-21T09:03:09.4922941Z cvt.s16.s8 %rs302, %rs301; 2026-02-21T09:03:09.4923098Z shr.s16 %rs303, %rs302, 4; 2026-02-21T09:03:09.4923263Z selp.b16 %rs304, %rs166, %rs102, %p12; 2026-02-21T09:03:09.4923435Z cvt.s16.s8 %rs305, %rs304; 2026-02-21T09:03:09.4923596Z shr.s16 %rs306, %rs305, 4; 2026-02-21T09:03:09.4923755Z selp.b16 %rs307, %rs167, %rs103, %p12; 2026-02-21T09:03:09.4923936Z cvt.s16.s8 %rs308, %rs307; 2026-02-21T09:03:09.4924086Z shr.s16 %rs309, %rs308, 4; 2026-02-21T09:03:09.4924252Z selp.b16 %rs310, %rs168, %rs104, %p12; 2026-02-21T09:03:09.4924422Z cvt.s16.s8 %rs311, %rs310; 2026-02-21T09:03:09.4924581Z shr.s16 %rs312, %rs311, 4; 2026-02-21T09:03:09.4924777Z selp.b16 %rs313, %rs169, %rs105, %p12; 2026-02-21T09:03:09.4924954Z cvt.s16.s8 %rs314, %rs313; 2026-02-21T09:03:09.4925113Z shr.s16 %rs315, %rs314, 4; 2026-02-21T09:03:09.4925272Z selp.b16 %rs316, %rs170, %rs106, %p12; 2026-02-21T09:03:09.4925453Z cvt.s16.s8 %rs317, %rs316; 2026-02-21T09:03:09.4925605Z shr.s16 %rs318, %rs317, 4; 2026-02-21T09:03:09.4925771Z selp.b16 %rs319, %rs171, %rs107, %p12; 2026-02-21T09:03:09.4925943Z cvt.s16.s8 %rs320, %rs319; 2026-02-21T09:03:09.4926106Z shr.s16 %rs321, %rs320, 4; 2026-02-21T09:03:09.4926265Z selp.b16 %rs322, %rs172, %rs108, %p12; 2026-02-21T09:03:09.4926491Z cvt.s16.s8 %rs323, %rs322; 2026-02-21T09:03:09.4926651Z shr.s16 %rs324, %rs323, 4; 2026-02-21T09:03:09.4926809Z selp.b16 %rs325, %rs173, %rs109, %p12; 2026-02-21T09:03:09.4926989Z cvt.s16.s8 %rs326, %rs325; 2026-02-21T09:03:09.4927139Z shr.s16 %rs327, %rs326, 4; 2026-02-21T09:03:09.4927305Z selp.b16 %rs328, %rs174, %rs110, %p12; 2026-02-21T09:03:09.4927480Z cvt.s16.s8 %rs329, %rs328; 2026-02-21T09:03:09.4927641Z shr.s16 %rs330, %rs329, 4; 2026-02-21T09:03:09.4927798Z selp.b16 %rs331, %rs175, %rs111, %p12; 2026-02-21T09:03:09.4928006Z cvt.s16.s8 %rs332, %rs331; 2026-02-21T09:03:09.4928158Z shr.s16 %rs333, %rs332, 4; 2026-02-21T09:03:09.4928324Z selp.b16 %rs334, %rs176, %rs112, %p12; 2026-02-21T09:03:09.4928502Z cvt.s16.s8 %rs335, %rs334; 2026-02-21T09:03:09.4928652Z shr.s16 %rs336, %rs335, 4; 2026-02-21T09:03:09.4928817Z selp.b16 %rs337, %rs177, %rs113, %p12; 2026-02-21T09:03:09.4928987Z cvt.s16.s8 %rs338, %rs337; 2026-02-21T09:03:09.4929144Z shr.s16 %rs339, %rs338, 4; 2026-02-21T09:03:09.4929302Z selp.b16 %rs340, %rs178, %rs114, %p12; 2026-02-21T09:03:09.4929479Z cvt.s16.s8 %rs341, %rs340; 2026-02-21T09:03:09.4929629Z shr.s16 %rs342, %rs341, 4; 2026-02-21T09:03:09.4929794Z selp.b16 %rs343, %rs179, %rs115, %p12; 2026-02-21T09:03:09.4929971Z cvt.s16.s8 %rs344, %rs343; 2026-02-21T09:03:09.4930148Z shr.s16 %rs345, %rs344, 4; 2026-02-21T09:03:09.4930315Z selp.b16 %rs346, %rs180, %rs116, %p12; 2026-02-21T09:03:09.4930484Z cvt.s16.s8 %rs347, %rs346; 2026-02-21T09:03:09.4930639Z shr.s16 %rs348, %rs347, 4; 2026-02-21T09:03:09.4930796Z selp.b16 %rs349, %rs181, %rs117, %p12; 2026-02-21T09:03:09.4930976Z cvt.s16.s8 %rs350, %rs349; 2026-02-21T09:03:09.4931123Z shr.s16 %rs351, %rs350, 4; 2026-02-21T09:03:09.4931287Z selp.b16 %rs352, %rs182, %rs118, %p12; 2026-02-21T09:03:09.4931455Z cvt.s16.s8 %rs353, %rs352; 2026-02-21T09:03:09.4931637Z shr.s16 %rs354, %rs353, 4; 2026-02-21T09:03:09.4931801Z selp.b16 %rs355, %rs183, %rs119, %p12; 2026-02-21T09:03:09.4931969Z cvt.s16.s8 %rs356, %rs355; 2026-02-21T09:03:09.4932129Z shr.s16 %rs357, %rs356, 4; 2026-02-21T09:03:09.4932285Z selp.b16 %rs358, %rs184, %rs120, %p12; 2026-02-21T09:03:09.4932463Z cvt.s16.s8 %rs359, %rs358; 2026-02-21T09:03:09.4932615Z shr.s16 %rs360, %rs359, 4; 2026-02-21T09:03:09.4932780Z selp.b16 %rs361, %rs185, %rs121, %p12; 2026-02-21T09:03:09.4932951Z cvt.s16.s8 %rs362, %rs361; 2026-02-21T09:03:09.4933109Z shr.s16 %rs363, %rs362, 4; 2026-02-21T09:03:09.4933271Z selp.b16 %rs364, %rs186, %rs122, %p12; 2026-02-21T09:03:09.4933444Z cvt.s16.s8 %rs365, %rs364; 2026-02-21T09:03:09.4933606Z shr.s16 %rs366, %rs365, 4; 2026-02-21T09:03:09.4933765Z selp.b16 %rs367, %rs187, %rs123, %p12; 2026-02-21T09:03:09.4933944Z cvt.s16.s8 %rs368, %rs367; 2026-02-21T09:03:09.4934094Z shr.s16 %rs369, %rs368, 4; 2026-02-21T09:03:09.4934259Z selp.b16 %rs370, %rs188, %rs124, %p12; 2026-02-21T09:03:09.4934432Z cvt.s16.s8 %rs371, %rs370; 2026-02-21T09:03:09.4934593Z shr.s16 %rs372, %rs371, 4; 2026-02-21T09:03:09.4934759Z selp.b16 %rs373, %rs189, %rs125, %p12; 2026-02-21T09:03:09.4934941Z cvt.s16.s8 %rs374, %rs373; 2026-02-21T09:03:09.4935098Z shr.s16 %rs375, %rs374, 4; 2026-02-21T09:03:09.4935254Z selp.b16 %rs376, %rs190, %rs126, %p12; 2026-02-21T09:03:09.4935433Z cvt.s16.s8 %rs377, %rs376; 2026-02-21T09:03:09.4935585Z shr.s16 %rs378, %rs377, 4; 2026-02-21T09:03:09.4935786Z selp.b16 %rs379, %rs191, %rs127, %p12; 2026-02-21T09:03:09.4935956Z cvt.s16.s8 %rs380, %rs379; 2026-02-21T09:03:09.4936113Z shr.s16 %rs381, %rs380, 4; 2026-02-21T09:03:09.4936270Z selp.b16 %rs382, %rs192, %rs128, %p12; 2026-02-21T09:03:09.4936448Z cvt.s16.s8 %rs383, %rs382; 2026-02-21T09:03:09.4936599Z shr.s16 %rs384, %rs383, 4; 2026-02-21T09:03:09.4936871Z .loc 1 77 32 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:77:32 2026-02-21T09:03:09.4937172Z cvt.rn.f32.s16 %r326, %rs195; 2026-02-21T09:03:09.4937334Z cvt.rn.f32.s16 %r327, %rs198; 2026-02-21T09:03:09.4937541Z cvt.rn.f32.s16 %r328, %rs201; 2026-02-21T09:03:09.4937699Z cvt.rn.f32.s16 %r329, %rs204; 2026-02-21T09:03:09.4937862Z cvt.rn.f32.s16 %r330, %rs207; 2026-02-21T09:03:09.4938015Z cvt.rn.f32.s16 %r331, %rs210; 2026-02-21T09:03:09.4938180Z cvt.rn.f32.s16 %r332, %rs213; 2026-02-21T09:03:09.4938334Z cvt.rn.f32.s16 %r333, %rs216; 2026-02-21T09:03:09.4938496Z cvt.rn.f32.s16 %r334, %rs219; 2026-02-21T09:03:09.4938657Z cvt.rn.f32.s16 %r335, %rs222; 2026-02-21T09:03:09.4938809Z cvt.rn.f32.s16 %r336, %rs225; 2026-02-21T09:03:09.4939002Z cvt.rn.f32.s16 %r337, %rs228; 2026-02-21T09:03:09.4939159Z cvt.rn.f32.s16 %r338, %rs231; 2026-02-21T09:03:09.4939322Z cvt.rn.f32.s16 %r339, %rs234; 2026-02-21T09:03:09.4939476Z cvt.rn.f32.s16 %r340, %rs237; 2026-02-21T09:03:09.4939637Z cvt.rn.f32.s16 %r341, %rs240; 2026-02-21T09:03:09.4939791Z cvt.rn.f32.s16 %r342, %rs243; 2026-02-21T09:03:09.4939950Z cvt.rn.f32.s16 %r343, %rs246; 2026-02-21T09:03:09.4940110Z cvt.rn.f32.s16 %r344, %rs249; 2026-02-21T09:03:09.4940263Z cvt.rn.f32.s16 %r345, %rs252; 2026-02-21T09:03:09.4940422Z cvt.rn.f32.s16 %r346, %rs255; 2026-02-21T09:03:09.4940573Z cvt.rn.f32.s16 %r347, %rs258; 2026-02-21T09:03:09.4940730Z cvt.rn.f32.s16 %r348, %rs261; 2026-02-21T09:03:09.4940881Z cvt.rn.f32.s16 %r349, %rs264; 2026-02-21T09:03:09.4941039Z cvt.rn.f32.s16 %r350, %rs267; 2026-02-21T09:03:09.4941218Z cvt.rn.f32.s16 %r351, %rs270; 2026-02-21T09:03:09.4941382Z cvt.rn.f32.s16 %r352, %rs273; 2026-02-21T09:03:09.4941577Z cvt.rn.f32.s16 %r353, %rs276; 2026-02-21T09:03:09.4941743Z cvt.rn.f32.s16 %r354, %rs279; 2026-02-21T09:03:09.4941904Z cvt.rn.f32.s16 %r355, %rs282; 2026-02-21T09:03:09.4942059Z cvt.rn.f32.s16 %r356, %rs285; 2026-02-21T09:03:09.4942219Z cvt.rn.f32.s16 %r357, %rs288; 2026-02-21T09:03:09.4942372Z cvt.rn.f32.s16 %r358, %rs291; 2026-02-21T09:03:09.4942534Z cvt.rn.f32.s16 %r359, %rs294; 2026-02-21T09:03:09.4942689Z cvt.rn.f32.s16 %r360, %rs297; 2026-02-21T09:03:09.4942853Z cvt.rn.f32.s16 %r361, %rs300; 2026-02-21T09:03:09.4943011Z cvt.rn.f32.s16 %r362, %rs303; 2026-02-21T09:03:09.4943176Z cvt.rn.f32.s16 %r363, %rs306; 2026-02-21T09:03:09.4943337Z cvt.rn.f32.s16 %r364, %rs309; 2026-02-21T09:03:09.4943490Z cvt.rn.f32.s16 %r365, %rs312; 2026-02-21T09:03:09.4943651Z cvt.rn.f32.s16 %r366, %rs315; 2026-02-21T09:03:09.4943804Z cvt.rn.f32.s16 %r367, %rs318; 2026-02-21T09:03:09.4943966Z cvt.rn.f32.s16 %r368, %rs321; 2026-02-21T09:03:09.4944118Z cvt.rn.f32.s16 %r369, %rs324; 2026-02-21T09:03:09.4944277Z cvt.rn.f32.s16 %r370, %rs327; 2026-02-21T09:03:09.4944430Z cvt.rn.f32.s16 %r371, %rs330; 2026-02-21T09:03:09.4944591Z cvt.rn.f32.s16 %r372, %rs333; 2026-02-21T09:03:09.4944744Z cvt.rn.f32.s16 %r373, %rs336; 2026-02-21T09:03:09.4944907Z cvt.rn.f32.s16 %r374, %rs339; 2026-02-21T09:03:09.4945069Z cvt.rn.f32.s16 %r375, %rs342; 2026-02-21T09:03:09.4945221Z cvt.rn.f32.s16 %r376, %rs345; 2026-02-21T09:03:09.4945384Z cvt.rn.f32.s16 %r377, %rs348; 2026-02-21T09:03:09.4945536Z cvt.rn.f32.s16 %r378, %rs351; 2026-02-21T09:03:09.4945699Z cvt.rn.f32.s16 %r379, %rs354; 2026-02-21T09:03:09.4945850Z cvt.rn.f32.s16 %r380, %rs357; 2026-02-21T09:03:09.4946012Z cvt.rn.f32.s16 %r381, %rs360; 2026-02-21T09:03:09.4946166Z cvt.rn.f32.s16 %r382, %rs363; 2026-02-21T09:03:09.4946326Z cvt.rn.f32.s16 %r383, %rs366; 2026-02-21T09:03:09.4946479Z cvt.rn.f32.s16 %r384, %rs369; 2026-02-21T09:03:09.4946669Z cvt.rn.f32.s16 %r385, %rs372; 2026-02-21T09:03:09.4946829Z cvt.rn.f32.s16 %r386, %rs375; 2026-02-21T09:03:09.4946980Z cvt.rn.f32.s16 %r387, %rs378; 2026-02-21T09:03:09.4947140Z cvt.rn.f32.s16 %r388, %rs381; 2026-02-21T09:03:09.4947291Z cvt.rn.f32.s16 %r389, %rs384; 2026-02-21T09:03:09.4947454Z st.shared.b32 [%r25], %r326; 2026-02-21T09:03:09.4947618Z st.shared.b32 [%r25+8], %r327; 2026-02-21T09:03:09.4947790Z st.shared.b32 [%r25+8192], %r342; 2026-02-21T09:03:09.4947962Z st.shared.b32 [%r25+8200], %r343; 2026-02-21T09:03:09.4948139Z st.shared.b32 [%r25+16384], %r358; 2026-02-21T09:03:09.4948346Z st.shared.b32 [%r25+16392], %r359; 2026-02-21T09:03:09.4948513Z st.shared.b32 [%r25+24576], %r374; 2026-02-21T09:03:09.4948689Z st.shared.b32 [%r25+24584], %r375; 2026-02-21T09:03:09.4948857Z st.shared.b32 [%r26], %r328; 2026-02-21T09:03:09.4949028Z st.shared.b32 [%r26+8], %r329; 2026-02-21T09:03:09.4949194Z st.shared.b32 [%r26+8192], %r344; 2026-02-21T09:03:09.4949370Z st.shared.b32 [%r26+8200], %r345; 2026-02-21T09:03:09.4949535Z st.shared.b32 [%r26+16384], %r360; 2026-02-21T09:03:09.4949707Z st.shared.b32 [%r26+16392], %r361; 2026-02-21T09:03:09.4949930Z st.shared.b32 [%r26+24576], %r376; 2026-02-21T09:03:09.4950099Z st.shared.b32 [%r26+24584], %r377; 2026-02-21T09:03:09.4950275Z st.shared.b32 [%r27], %r330; 2026-02-21T09:03:09.4950436Z st.shared.b32 [%r27+8], %r331; 2026-02-21T09:03:09.4950608Z st.shared.b32 [%r27+8192], %r346; 2026-02-21T09:03:09.4950773Z st.shared.b32 [%r27+8200], %r347; 2026-02-21T09:03:09.4950947Z st.shared.b32 [%r27+16384], %r362; 2026-02-21T09:03:09.4951115Z st.shared.b32 [%r27+16392], %r363; 2026-02-21T09:03:09.4951293Z st.shared.b32 [%r27+24576], %r378; 2026-02-21T09:03:09.4951461Z st.shared.b32 [%r27+24584], %r379; 2026-02-21T09:03:09.4951674Z st.shared.b32 [%r28], %r332; 2026-02-21T09:03:09.4951843Z st.shared.b32 [%r28+8], %r333; 2026-02-21T09:03:09.4952036Z st.shared.b32 [%r28+8192], %r348; 2026-02-21T09:03:09.4952215Z st.shared.b32 [%r28+8200], %r349; 2026-02-21T09:03:09.4952380Z st.shared.b32 [%r28+16384], %r364; 2026-02-21T09:03:09.4952552Z st.shared.b32 [%r28+16392], %r365; 2026-02-21T09:03:09.4952717Z st.shared.b32 [%r28+24576], %r380; 2026-02-21T09:03:09.4952889Z st.shared.b32 [%r28+24584], %r381; 2026-02-21T09:03:09.4953056Z st.shared.b32 [%r29], %r334; 2026-02-21T09:03:09.4953223Z st.shared.b32 [%r29+8], %r335; 2026-02-21T09:03:09.4953390Z st.shared.b32 [%r29+8192], %r350; 2026-02-21T09:03:09.4953558Z st.shared.b32 [%r29+8200], %r351; 2026-02-21T09:03:09.4953730Z st.shared.b32 [%r29+16384], %r366; 2026-02-21T09:03:09.4953895Z st.shared.b32 [%r29+16392], %r367; 2026-02-21T09:03:09.4954067Z st.shared.b32 [%r29+24576], %r382; 2026-02-21T09:03:09.4954231Z st.shared.b32 [%r29+24584], %r383; 2026-02-21T09:03:09.4954403Z st.shared.b32 [%r30], %r336; 2026-02-21T09:03:09.4954561Z st.shared.b32 [%r30+8], %r337; 2026-02-21T09:03:09.4954728Z st.shared.b32 [%r30+8192], %r352; 2026-02-21T09:03:09.4954900Z st.shared.b32 [%r30+8200], %r353; 2026-02-21T09:03:09.4955064Z st.shared.b32 [%r30+16384], %r368; 2026-02-21T09:03:09.4955238Z st.shared.b32 [%r30+16392], %r369; 2026-02-21T09:03:09.4955402Z st.shared.b32 [%r30+24576], %r384; 2026-02-21T09:03:09.4955573Z st.shared.b32 [%r30+24584], %r385; 2026-02-21T09:03:09.4955739Z st.shared.b32 [%r31], %r338; 2026-02-21T09:03:09.4955903Z st.shared.b32 [%r31+8], %r339; 2026-02-21T09:03:09.4956061Z st.shared.b32 [%r31+8192], %r354; 2026-02-21T09:03:09.4956230Z st.shared.b32 [%r31+8200], %r355; 2026-02-21T09:03:09.4956393Z st.shared.b32 [%r31+16384], %r370; 2026-02-21T09:03:09.4956568Z st.shared.b32 [%r31+16392], %r371; 2026-02-21T09:03:09.4956743Z st.shared.b32 [%r31+24576], %r386; 2026-02-21T09:03:09.4956914Z st.shared.b32 [%r31+24584], %r387; 2026-02-21T09:03:09.4957092Z st.shared.b32 [%r32], %r340; 2026-02-21T09:03:09.4957258Z st.shared.b32 [%r32+8], %r341; 2026-02-21T09:03:09.4957433Z st.shared.b32 [%r32+8192], %r356; 2026-02-21T09:03:09.4957638Z st.shared.b32 [%r32+8200], %r357; 2026-02-21T09:03:09.4957816Z st.shared.b32 [%r32+16384], %r372; 2026-02-21T09:03:09.4957989Z st.shared.b32 [%r32+16392], %r373; 2026-02-21T09:03:09.4958167Z st.shared.b32 [%r32+24576], %r388; 2026-02-21T09:03:09.4958347Z st.shared.b32 [%r32+24584], %r389; 2026-02-21T09:03:09.4958515Z $L__tmp4: 2026-02-21T09:03:09.4958829Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.4959182Z // begin inline asm 2026-02-21T09:03:09.4959361Z fence.proxy.async.shared::cta; 2026-02-21T09:03:09.4959566Z // end inline asm 2026-02-21T09:03:09.4959717Z bar.sync 0; 2026-02-21T09:03:09.4959864Z setp.ne.b32 %p13, %r38, 0; 2026-02-21T09:03:09.4960042Z @%p13 bra $L__BB0_3; 2026-02-21T09:03:09.4960202Z // %bb.2: 2026-02-21T09:03:09.4960351Z elect.sync %r438|%p15, -1; 2026-02-21T09:03:09.4960529Z mov.b32 %r392, 68159760; 2026-02-21T09:03:09.4960691Z mov.pred %p14, 0; 2026-02-21T09:03:09.4960847Z // begin inline asm 2026-02-21T09:03:09.4961146Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 0 ], %rd81, %r392, %p14; 2026-02-21T09:03:09.4961434Z // end inline asm 2026-02-21T09:03:09.4961615Z // begin inline asm 2026-02-21T09:03:09.4961861Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 8 ], %rd82, %r392, %p16; 2026-02-21T09:03:09.4962140Z // end inline asm 2026-02-21T09:03:09.4962283Z // begin inline asm 2026-02-21T09:03:09.4962532Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 16 ], %rd83, %r392, %p16; 2026-02-21T09:03:09.4962801Z // end inline asm 2026-02-21T09:03:09.4962950Z // begin inline asm 2026-02-21T09:03:09.4963186Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 24 ], %rd84, %r392, %p16; 2026-02-21T09:03:09.4963462Z // end inline asm 2026-02-21T09:03:09.4963604Z // begin inline asm 2026-02-21T09:03:09.4963879Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 32 ], %rd85, %r392, %p16; 2026-02-21T09:03:09.4964158Z // end inline asm 2026-02-21T09:03:09.4964298Z // begin inline asm 2026-02-21T09:03:09.4964543Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 40 ], %rd86, %r392, %p16; 2026-02-21T09:03:09.4964809Z // end inline asm 2026-02-21T09:03:09.4964961Z // begin inline asm 2026-02-21T09:03:09.4965182Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 48 ], %rd87, %r392, %p16; 2026-02-21T09:03:09.4965442Z // end inline asm 2026-02-21T09:03:09.4965582Z // begin inline asm 2026-02-21T09:03:09.4965802Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 56 ], %rd88, %r392, %p16; 2026-02-21T09:03:09.4966068Z // end inline asm 2026-02-21T09:03:09.4966203Z // begin inline asm 2026-02-21T09:03:09.4966431Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 64 ], %rd89, %r392, %p16; 2026-02-21T09:03:09.4966690Z // end inline asm 2026-02-21T09:03:09.4966830Z // begin inline asm 2026-02-21T09:03:09.4967058Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 72 ], %rd90, %r392, %p16; 2026-02-21T09:03:09.4967311Z // end inline asm 2026-02-21T09:03:09.4967451Z // begin inline asm 2026-02-21T09:03:09.4967670Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 80 ], %rd91, %r392, %p16; 2026-02-21T09:03:09.4967934Z // end inline asm 2026-02-21T09:03:09.4968067Z // begin inline asm 2026-02-21T09:03:09.4968297Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 88 ], %rd92, %r392, %p16; 2026-02-21T09:03:09.4968557Z // end inline asm 2026-02-21T09:03:09.4968691Z // begin inline asm 2026-02-21T09:03:09.4968920Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 96 ], %rd93, %r392, %p16; 2026-02-21T09:03:09.4969172Z // end inline asm 2026-02-21T09:03:09.4969315Z // begin inline asm 2026-02-21T09:03:09.4969544Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 104 ], %rd94, %r392, %p16; 2026-02-21T09:03:09.4969837Z // end inline asm 2026-02-21T09:03:09.4969971Z // begin inline asm 2026-02-21T09:03:09.4970207Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 112 ], %rd95, %r392, %p16; 2026-02-21T09:03:09.4970472Z // end inline asm 2026-02-21T09:03:09.4970610Z // begin inline asm 2026-02-21T09:03:09.4970850Z @%p15 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 120 ], %rd96, %r392, %p16; 2026-02-21T09:03:09.4971105Z // end inline asm 2026-02-21T09:03:09.4971252Z add.s32 %r440, %r52, 73728; 2026-02-21T09:03:09.4971414Z cvt.u64.u32 %rd97, %r440; 2026-02-21T09:03:09.4971598Z // begin inline asm 2026-02-21T09:03:09.4971848Z @%p15 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd97]; 2026-02-21T09:03:09.4972075Z // end inline asm 2026-02-21T09:03:09.4972216Z $L__tmp5: 2026-02-21T09:03:09.4972339Z $L__BB0_3: 2026-02-21T09:03:09.4972506Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:03:09.4972745Z ld.param.b64 %rd37, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:03:09.4972964Z mul.lo.s32 %r33, %r204, 7168; 2026-02-21T09:03:09.4973124Z add.s32 %r34, %r278, %r276; 2026-02-21T09:03:09.4973237Z add.s32 %r761, %r288, %r286; 2026-02-21T09:03:09.4973300Z or.b32 %r37, %r36, %r200; 2026-02-21T09:03:09.4973473Z .loc 1 48 80 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:80 2026-02-21T09:03:09.4973539Z // begin inline asm 2026-02-21T09:03:09.4973660Z cp.async.cg.shared.global [ %r441 + 0 ], [ %rd98 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4973717Z // end inline asm 2026-02-21T09:03:09.4973782Z // begin inline asm 2026-02-21T09:03:09.4973899Z cp.async.cg.shared.global [ %r443 + 0 ], [ %rd99 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4973954Z // end inline asm 2026-02-21T09:03:09.4974012Z // begin inline asm 2026-02-21T09:03:09.4974136Z cp.async.cg.shared.global [ %r445 + 0 ], [ %rd100 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4974192Z // end inline asm 2026-02-21T09:03:09.4974276Z // begin inline asm 2026-02-21T09:03:09.4974399Z cp.async.cg.shared.global [ %r447 + 0 ], [ %rd101 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4974454Z // end inline asm 2026-02-21T09:03:09.4974511Z // begin inline asm 2026-02-21T09:03:09.4974620Z cp.async.cg.shared.global [ %r449 + 0 ], [ %rd102 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4974682Z // end inline asm 2026-02-21T09:03:09.4974738Z // begin inline asm 2026-02-21T09:03:09.4974847Z cp.async.cg.shared.global [ %r451 + 0 ], [ %rd103 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4974908Z // end inline asm 2026-02-21T09:03:09.4974964Z // begin inline asm 2026-02-21T09:03:09.4975071Z cp.async.cg.shared.global [ %r453 + 0 ], [ %rd104 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4975134Z // end inline asm 2026-02-21T09:03:09.4975189Z // begin inline asm 2026-02-21T09:03:09.4975296Z cp.async.cg.shared.global [ %r455 + 0 ], [ %rd105 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4975351Z // end inline asm 2026-02-21T09:03:09.4975423Z cp.async.commit_group; 2026-02-21T09:03:09.4975594Z .loc 1 54 34 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:34 2026-02-21T09:03:09.4975658Z add.s64 %rd106, %rd26, 917504; 2026-02-21T09:03:09.4975730Z add.s64 %rd107, %rd26, 1146880; 2026-02-21T09:03:09.4975893Z .loc 1 54 87 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:87 2026-02-21T09:03:09.4975948Z // begin inline asm 2026-02-21T09:03:09.4976055Z cp.async.cg.shared.global [ %r457 + 0 ], [ %rd106 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4976116Z // end inline asm 2026-02-21T09:03:09.4976171Z // begin inline asm 2026-02-21T09:03:09.4976280Z cp.async.cg.shared.global [ %r459 + 0 ], [ %rd107 + 0 ], 0x10, %r442; 2026-02-21T09:03:09.4976342Z // end inline asm 2026-02-21T09:03:09.4976403Z cp.async.commit_group; 2026-02-21T09:03:09.4976565Z .loc 1 40 93 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:40:93 2026-02-21T09:03:09.4976634Z bfe.u32 %r464, %r1, 4, 3; 2026-02-21T09:03:09.4976731Z mul.wide.u32 %rd109, %r464, 16384; 2026-02-21T09:03:09.4976794Z mul.wide.u32 %rd110, %r197, 16; 2026-02-21T09:03:09.4976857Z or.b64 %rd111, %rd109, %rd110; 2026-02-21T09:03:09.4976925Z add.s64 %rd112, %rd111, %rd35; 2026-02-21T09:03:09.4976987Z add.s64 %rd217, %rd112, 918272; 2026-02-21T09:03:09.4977048Z add.s32 %r466, %r36, %r5; 2026-02-21T09:03:09.4977117Z cvt.u64.u32 %rd113, %r466; 2026-02-21T09:03:09.4977187Z mad.lo.s64 %rd114, %rd25, 7168, %rd113; 2026-02-21T09:03:09.4977248Z add.s64 %rd115, %rd114, %rd36; 2026-02-21T09:03:09.4977310Z add.s64 %rd216, %rd115, 1605632; 2026-02-21T09:03:09.4977395Z mov.b32 %r847, 1; 2026-02-21T09:03:09.4977456Z mov.b64 %rd218, -64; 2026-02-21T09:03:09.4977513Z mov.b32 %r846, %r844; 2026-02-21T09:03:09.4977577Z mov.b32 %r848, %r844; 2026-02-21T09:03:09.4977634Z bra.uni $L__BB0_4; 2026-02-21T09:03:09.4977737Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:03:09.4977906Z .loc 1 40 93 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:40:93 2026-02-21T09:03:09.4977969Z add.s64 %rd218, %rd218, 64; 2026-02-21T09:03:09.4978059Z setp.lt.u64 %p87, %rd218, 3904; 2026-02-21T09:03:09.4978114Z $L__tmp6: 2026-02-21T09:03:09.4978337Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.4978398Z add.s32 %r717, %r847, 1; 2026-02-21T09:03:09.4978460Z setp.gt.s32 %p88, %r717, 1; 2026-02-21T09:03:09.4978530Z selp.b32 %r847, 0, %r717, %p88; 2026-02-21T09:03:09.4978590Z selp.b32 %r718, 1, 0, %p88; 2026-02-21T09:03:09.4978650Z xor.b32 %r51, %r848, %r718; 2026-02-21T09:03:09.4978705Z $L__tmp7: 2026-02-21T09:03:09.4978873Z .loc 1 48 32 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:32 2026-02-21T09:03:09.4978940Z add.s64 %rd133, %rd217, -917504; 2026-02-21T09:03:09.4979003Z add.s64 %rd134, %rd217, -786432; 2026-02-21T09:03:09.4979097Z add.s64 %rd135, %rd217, -655360; 2026-02-21T09:03:09.4979162Z add.s64 %rd136, %rd217, -524288; 2026-02-21T09:03:09.4979223Z add.s64 %rd137, %rd217, -393216; 2026-02-21T09:03:09.4979294Z add.s64 %rd138, %rd217, -262144; 2026-02-21T09:03:09.4979352Z add.s64 %rd139, %rd217, -131072; 2026-02-21T09:03:09.4979509Z .loc 1 48 80 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:80 2026-02-21T09:03:09.4979571Z add.s32 %r697, %r47, %r6; 2026-02-21T09:03:09.4979640Z selp.b32 %r698, 16, 0, %p87; 2026-02-21T09:03:09.4979696Z // begin inline asm 2026-02-21T09:03:09.4979807Z cp.async.cg.shared.global [ %r697 + 0 ], [ %rd133 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4979871Z // end inline asm 2026-02-21T09:03:09.4979928Z add.s32 %r699, %r697, 2048; 2026-02-21T09:03:09.4979984Z // begin inline asm 2026-02-21T09:03:09.4980103Z cp.async.cg.shared.global [ %r699 + 0 ], [ %rd134 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4980158Z // end inline asm 2026-02-21T09:03:09.4980217Z add.s32 %r701, %r697, 4096; 2026-02-21T09:03:09.4980275Z // begin inline asm 2026-02-21T09:03:09.4980392Z cp.async.cg.shared.global [ %r701 + 0 ], [ %rd135 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4980447Z // end inline asm 2026-02-21T09:03:09.4980506Z add.s32 %r703, %r697, 6144; 2026-02-21T09:03:09.4980569Z // begin inline asm 2026-02-21T09:03:09.4980678Z cp.async.cg.shared.global [ %r703 + 0 ], [ %rd136 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4980731Z // end inline asm 2026-02-21T09:03:09.4980790Z add.s32 %r705, %r697, 8192; 2026-02-21T09:03:09.4980855Z // begin inline asm 2026-02-21T09:03:09.4980964Z cp.async.cg.shared.global [ %r705 + 0 ], [ %rd137 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4981021Z // end inline asm 2026-02-21T09:03:09.4981088Z add.s32 %r707, %r697, 10240; 2026-02-21T09:03:09.4981144Z // begin inline asm 2026-02-21T09:03:09.4981254Z cp.async.cg.shared.global [ %r707 + 0 ], [ %rd138 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4981316Z // end inline asm 2026-02-21T09:03:09.4981376Z add.s32 %r709, %r697, 12288; 2026-02-21T09:03:09.4981459Z // begin inline asm 2026-02-21T09:03:09.4981610Z cp.async.cg.shared.global [ %r709 + 0 ], [ %rd139 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4981673Z // end inline asm 2026-02-21T09:03:09.4981734Z add.s32 %r711, %r47, %r14; 2026-02-21T09:03:09.4981788Z // begin inline asm 2026-02-21T09:03:09.4981905Z cp.async.cg.shared.global [ %r711 + 0 ], [ %rd217 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4981959Z // end inline asm 2026-02-21T09:03:09.4982022Z cp.async.commit_group; 2026-02-21T09:03:09.4982179Z .loc 1 54 34 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:34 2026-02-21T09:03:09.4982297Z add.s64 %rd141, %rd216, -229376; 2026-02-21T09:03:09.4982458Z .loc 1 54 87 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:87 2026-02-21T09:03:09.4982518Z add.s32 %r713, %r48, %r6; 2026-02-21T09:03:09.4982582Z // begin inline asm 2026-02-21T09:03:09.4982692Z cp.async.cg.shared.global [ %r713 + 0 ], [ %rd141 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4982750Z // end inline asm 2026-02-21T09:03:09.4982816Z add.s32 %r715, %r713, 2048; 2026-02-21T09:03:09.4982903Z // begin inline asm 2026-02-21T09:03:09.4983014Z cp.async.cg.shared.global [ %r715 + 0 ], [ %rd216 + 0 ], 0x10, %r698; 2026-02-21T09:03:09.4983069Z // end inline asm 2026-02-21T09:03:09.4983139Z cp.async.commit_group; 2026-02-21T09:03:09.4983304Z .loc 1 40 93 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:40:93 2026-02-21T09:03:09.4983365Z add.s64 %rd217, %rd217, 256; 2026-02-21T09:03:09.4983434Z add.s64 %rd216, %rd216, 458752; 2026-02-21T09:03:09.4983499Z setp.lt.u64 %p89, %rd218, 3968; 2026-02-21T09:03:09.4983558Z mov.b32 %r844, %r848; 2026-02-21T09:03:09.4983614Z mov.b32 %r848, %r51; 2026-02-21T09:03:09.4983678Z @%p89 bra $L__BB0_4; 2026-02-21T09:03:09.4983734Z bra.uni $L__BB0_7; 2026-02-21T09:03:09.4983869Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:03:09.4983938Z add.s32 %r538, %r846, 1; 2026-02-21T09:03:09.4984001Z setp.gt.s32 %p53, %r538, 1; 2026-02-21T09:03:09.4984064Z selp.b32 %r846, 0, %r538, %p53; 2026-02-21T09:03:09.4984233Z .loc 1 48 80 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:48:80 2026-02-21T09:03:09.4984296Z cp.async.wait_group 2; 2026-02-21T09:03:09.4984351Z bar.sync 0; 2026-02-21T09:03:09.4984410Z shl.b32 %r539, %r846, 14; 2026-02-21T09:03:09.4984476Z add.s32 %r541, %r52, %r539; 2026-02-21T09:03:09.4984534Z add.s32 %r47, %r541, 32768; 2026-02-21T09:03:09.4984593Z add.s32 %r542, %r47, %r19; 2026-02-21T09:03:09.4984698Z ld.shared.v4.b32 {%r543, %r544, %r545, %r546}, [%r542]; 2026-02-21T09:03:09.4984762Z mov.b32 {%rs385, %rs386}, %r546; 2026-02-21T09:03:09.4984825Z mov.b32 {%rs387, %rs388}, %r545; 2026-02-21T09:03:09.4984884Z mov.b32 {%rs389, %rs390}, %r544; 2026-02-21T09:03:09.4984951Z mov.b32 {%rs391, %rs392}, %r543; 2026-02-21T09:03:09.4985053Z ld.shared.v4.b32 {%r547, %r548, %r549, %r550}, [%r542+16]; 2026-02-21T09:03:09.4985111Z mov.b32 {%rs393, %rs394}, %r550; 2026-02-21T09:03:09.4985180Z mov.b32 {%rs395, %rs396}, %r549; 2026-02-21T09:03:09.4985238Z mov.b32 {%rs397, %rs398}, %r548; 2026-02-21T09:03:09.4985298Z mov.b32 {%rs399, %rs400}, %r547; 2026-02-21T09:03:09.4985398Z ld.shared.v4.b32 {%r551, %r552, %r553, %r554}, [%r542+32]; 2026-02-21T09:03:09.4985457Z mov.b32 {%rs401, %rs402}, %r554; 2026-02-21T09:03:09.4985515Z mov.b32 {%rs403, %rs404}, %r553; 2026-02-21T09:03:09.4985573Z mov.b32 {%rs405, %rs406}, %r552; 2026-02-21T09:03:09.4985639Z mov.b32 {%rs407, %rs408}, %r551; 2026-02-21T09:03:09.4985732Z ld.shared.v4.b32 {%r555, %r556, %r557, %r558}, [%r542+48]; 2026-02-21T09:03:09.4985790Z mov.b32 {%rs409, %rs410}, %r558; 2026-02-21T09:03:09.4985856Z mov.b32 {%rs411, %rs412}, %r557; 2026-02-21T09:03:09.4985914Z mov.b32 {%rs413, %rs414}, %r556; 2026-02-21T09:03:09.4985974Z mov.b32 {%rs415, %rs416}, %r555; 2026-02-21T09:03:09.4986092Z ld.shared.v4.b32 {%r559, %r560, %r561, %r562}, [%r542+64]; 2026-02-21T09:03:09.4986159Z mov.b32 {%rs417, %rs418}, %r562; 2026-02-21T09:03:09.4986219Z mov.b32 {%rs419, %rs420}, %r561; 2026-02-21T09:03:09.4986278Z mov.b32 {%rs421, %rs422}, %r560; 2026-02-21T09:03:09.4986344Z mov.b32 {%rs423, %rs424}, %r559; 2026-02-21T09:03:09.4986433Z ld.shared.v4.b32 {%r563, %r564, %r565, %r566}, [%r542+80]; 2026-02-21T09:03:09.4986494Z mov.b32 {%rs425, %rs426}, %r566; 2026-02-21T09:03:09.4986563Z mov.b32 {%rs427, %rs428}, %r565; 2026-02-21T09:03:09.4986622Z mov.b32 {%rs429, %rs430}, %r564; 2026-02-21T09:03:09.4986705Z mov.b32 {%rs431, %rs432}, %r563; 2026-02-21T09:03:09.4986794Z ld.shared.v4.b32 {%r567, %r568, %r569, %r570}, [%r542+96]; 2026-02-21T09:03:09.4986864Z mov.b32 {%rs433, %rs434}, %r570; 2026-02-21T09:03:09.4986922Z mov.b32 {%rs435, %rs436}, %r569; 2026-02-21T09:03:09.4986982Z mov.b32 {%rs437, %rs438}, %r568; 2026-02-21T09:03:09.4987047Z mov.b32 {%rs439, %rs440}, %r567; 2026-02-21T09:03:09.4987143Z ld.shared.v4.b32 {%r571, %r572, %r573, %r574}, [%r542+112]; 2026-02-21T09:03:09.4987201Z mov.b32 {%rs441, %rs442}, %r574; 2026-02-21T09:03:09.4987279Z mov.b32 {%rs443, %rs444}, %r573; 2026-02-21T09:03:09.4987347Z mov.b32 {%rs445, %rs446}, %r572; 2026-02-21T09:03:09.4987405Z mov.b32 {%rs447, %rs448}, %r571; 2026-02-21T09:03:09.4987568Z .loc 1 52 32 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:52:32 2026-02-21T09:03:09.4987636Z cvt.f32.bf16 %r471, %rs391; 2026-02-21T09:03:09.4987697Z cvt.f32.bf16 %r472, %rs392; 2026-02-21T09:03:09.4987755Z cvt.f32.bf16 %r473, %rs389; 2026-02-21T09:03:09.4987821Z cvt.f32.bf16 %r474, %rs390; 2026-02-21T09:03:09.4987880Z cvt.f32.bf16 %r475, %rs387; 2026-02-21T09:03:09.4987938Z cvt.f32.bf16 %r476, %rs388; 2026-02-21T09:03:09.4987994Z cvt.f32.bf16 %r477, %rs385; 2026-02-21T09:03:09.4988060Z cvt.f32.bf16 %r478, %rs386; 2026-02-21T09:03:09.4988141Z cvt.f32.bf16 %r479, %rs399; 2026-02-21T09:03:09.4988201Z cvt.f32.bf16 %r480, %rs400; 2026-02-21T09:03:09.4988267Z cvt.f32.bf16 %r481, %rs397; 2026-02-21T09:03:09.4988325Z cvt.f32.bf16 %r482, %rs398; 2026-02-21T09:03:09.4988384Z cvt.f32.bf16 %r483, %rs395; 2026-02-21T09:03:09.4988442Z cvt.f32.bf16 %r484, %rs396; 2026-02-21T09:03:09.4988508Z cvt.f32.bf16 %r485, %rs393; 2026-02-21T09:03:09.4988565Z cvt.f32.bf16 %r486, %rs394; 2026-02-21T09:03:09.4988623Z cvt.f32.bf16 %r488, %rs407; 2026-02-21T09:03:09.4988687Z cvt.f32.bf16 %r489, %rs408; 2026-02-21T09:03:09.4988744Z cvt.f32.bf16 %r490, %rs405; 2026-02-21T09:03:09.4988801Z cvt.f32.bf16 %r491, %rs406; 2026-02-21T09:03:09.4988870Z cvt.f32.bf16 %r492, %rs403; 2026-02-21T09:03:09.4988927Z cvt.f32.bf16 %r493, %rs404; 2026-02-21T09:03:09.4988985Z cvt.f32.bf16 %r494, %rs401; 2026-02-21T09:03:09.4989043Z cvt.f32.bf16 %r495, %rs402; 2026-02-21T09:03:09.4989107Z cvt.f32.bf16 %r496, %rs415; 2026-02-21T09:03:09.4989165Z cvt.f32.bf16 %r497, %rs416; 2026-02-21T09:03:09.4989224Z cvt.f32.bf16 %r498, %rs413; 2026-02-21T09:03:09.4989289Z cvt.f32.bf16 %r499, %rs414; 2026-02-21T09:03:09.4989346Z cvt.f32.bf16 %r500, %rs411; 2026-02-21T09:03:09.4989404Z cvt.f32.bf16 %r501, %rs412; 2026-02-21T09:03:09.4989461Z cvt.f32.bf16 %r502, %rs409; 2026-02-21T09:03:09.4989525Z cvt.f32.bf16 %r503, %rs410; 2026-02-21T09:03:09.4989583Z cvt.f32.bf16 %r505, %rs423; 2026-02-21T09:03:09.4989639Z cvt.f32.bf16 %r506, %rs424; 2026-02-21T09:03:09.4989704Z cvt.f32.bf16 %r507, %rs421; 2026-02-21T09:03:09.4989761Z cvt.f32.bf16 %r508, %rs422; 2026-02-21T09:03:09.4989817Z cvt.f32.bf16 %r509, %rs419; 2026-02-21T09:03:09.4989876Z cvt.f32.bf16 %r510, %rs420; 2026-02-21T09:03:09.4989940Z cvt.f32.bf16 %r511, %rs417; 2026-02-21T09:03:09.4989997Z cvt.f32.bf16 %r512, %rs418; 2026-02-21T09:03:09.4990052Z cvt.f32.bf16 %r513, %rs431; 2026-02-21T09:03:09.4990115Z cvt.f32.bf16 %r514, %rs432; 2026-02-21T09:03:09.4990173Z cvt.f32.bf16 %r515, %rs429; 2026-02-21T09:03:09.4990230Z cvt.f32.bf16 %r516, %rs430; 2026-02-21T09:03:09.4990314Z cvt.f32.bf16 %r517, %rs427; 2026-02-21T09:03:09.4990378Z cvt.f32.bf16 %r518, %rs428; 2026-02-21T09:03:09.4990436Z cvt.f32.bf16 %r519, %rs425; 2026-02-21T09:03:09.4990493Z cvt.f32.bf16 %r520, %rs426; 2026-02-21T09:03:09.4990556Z cvt.f32.bf16 %r522, %rs439; 2026-02-21T09:03:09.4990612Z cvt.f32.bf16 %r523, %rs440; 2026-02-21T09:03:09.4990669Z cvt.f32.bf16 %r524, %rs437; 2026-02-21T09:03:09.4990734Z cvt.f32.bf16 %r525, %rs438; 2026-02-21T09:03:09.4990791Z cvt.f32.bf16 %r526, %rs435; 2026-02-21T09:03:09.4990848Z cvt.f32.bf16 %r527, %rs436; 2026-02-21T09:03:09.4990905Z cvt.f32.bf16 %r528, %rs433; 2026-02-21T09:03:09.4990994Z cvt.f32.bf16 %r529, %rs434; 2026-02-21T09:03:09.4991053Z cvt.f32.bf16 %r530, %rs447; 2026-02-21T09:03:09.4991111Z cvt.f32.bf16 %r531, %rs448; 2026-02-21T09:03:09.4991175Z cvt.f32.bf16 %r532, %rs445; 2026-02-21T09:03:09.4991233Z cvt.f32.bf16 %r533, %rs446; 2026-02-21T09:03:09.4991291Z cvt.f32.bf16 %r534, %rs443; 2026-02-21T09:03:09.4991351Z cvt.f32.bf16 %r535, %rs444; 2026-02-21T09:03:09.4991416Z cvt.f32.bf16 %r536, %rs441; 2026-02-21T09:03:09.4991473Z cvt.f32.bf16 %r537, %rs442; 2026-02-21T09:03:09.4991587Z $L__tmp8: 2026-02-21T09:03:09.4991814Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.4991874Z // begin inline asm 2026-02-21T09:03:09.4991930Z 2026-02-21T09:03:09.4991981Z { 2026-02-21T09:03:09.4992050Z .reg .pred complete; 2026-02-21T09:03:09.4992106Z waitLoop: 2026-02-21T09:03:09.4992227Z mbarrier.try_wait.parity.shared.b64 complete, [%r845], %r844; 2026-02-21T09:03:09.4992301Z @!complete bra.uni waitLoop; 2026-02-21T09:03:09.4992352Z } 2026-02-21T09:03:09.4992356Z 2026-02-21T09:03:09.4992412Z // end inline asm 2026-02-21T09:03:09.4992481Z mov.pred %p54, -1; 2026-02-21T09:03:09.4992538Z // begin inline asm 2026-02-21T09:03:09.4992862Z @%p54 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 0], 64, {%r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479, %r480, %r481, %r482, %r483, %r484, %r485, %r486}; 2026-02-21T09:03:09.4992922Z // end inline asm 2026-02-21T09:03:09.4992990Z // begin inline asm 2026-02-21T09:03:09.4993275Z @%p54 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 16], 64, {%r488, %r489, %r490, %r491, %r492, %r493, %r494, %r495, %r496, %r497, %r498, %r499, %r500, %r501, %r502, %r503}; 2026-02-21T09:03:09.4993332Z // end inline asm 2026-02-21T09:03:09.4993397Z // begin inline asm 2026-02-21T09:03:09.4993678Z @%p54 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 32], 64, {%r505, %r506, %r507, %r508, %r509, %r510, %r511, %r512, %r513, %r514, %r515, %r516, %r517, %r518, %r519, %r520}; 2026-02-21T09:03:09.4993737Z // end inline asm 2026-02-21T09:03:09.4993805Z // begin inline asm 2026-02-21T09:03:09.4994083Z @%p54 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r129 + 48], 64, {%r522, %r523, %r524, %r525, %r526, %r527, %r528, %r529, %r530, %r531, %r532, %r533, %r534, %r535, %r536, %r537}; 2026-02-21T09:03:09.4994142Z // end inline asm 2026-02-21T09:03:09.4994208Z // begin inline asm 2026-02-21T09:03:09.4994282Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:03:09.4994338Z // end inline asm 2026-02-21T09:03:09.4994394Z bar.sync 0; 2026-02-21T09:03:09.4994456Z $L__tmp9: 2026-02-21T09:03:09.4994625Z .loc 1 54 87 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:54:87 2026-02-21T09:03:09.4994686Z shl.b32 %r575, %r846, 12; 2026-02-21T09:03:09.4994753Z add.s32 %r576, %r52, %r575; 2026-02-21T09:03:09.4994811Z add.s32 %r48, %r576, 65536; 2026-02-21T09:03:09.4994870Z add.s32 %r577, %r48, %r20; 2026-02-21T09:03:09.4994935Z ld.shared.b8 %rs449, [%r577]; 2026-02-21T09:03:09.4995006Z ld.shared.b8 %rs450, [%r577+64]; 2026-02-21T09:03:09.4995070Z ld.shared.b8 %rs451, [%r577+128]; 2026-02-21T09:03:09.4995134Z ld.shared.b8 %rs452, [%r577+192]; 2026-02-21T09:03:09.4995202Z ld.shared.b8 %rs453, [%r577+256]; 2026-02-21T09:03:09.4995262Z ld.shared.b8 %rs454, [%r577+320]; 2026-02-21T09:03:09.4995354Z ld.shared.b8 %rs455, [%r577+384]; 2026-02-21T09:03:09.4995422Z ld.shared.b8 %rs456, [%r577+448]; 2026-02-21T09:03:09.4995484Z ld.shared.b8 %rs457, [%r577+512]; 2026-02-21T09:03:09.4995546Z ld.shared.b8 %rs458, [%r577+576]; 2026-02-21T09:03:09.4995606Z ld.shared.b8 %rs459, [%r577+640]; 2026-02-21T09:03:09.4995672Z ld.shared.b8 %rs460, [%r577+704]; 2026-02-21T09:03:09.4995732Z ld.shared.b8 %rs461, [%r577+768]; 2026-02-21T09:03:09.4995791Z ld.shared.b8 %rs462, [%r577+832]; 2026-02-21T09:03:09.4995857Z ld.shared.b8 %rs463, [%r577+896]; 2026-02-21T09:03:09.4995917Z add.s32 %r578, %r48, %r21; 2026-02-21T09:03:09.4996007Z ld.shared.b8 %rs464, [%r578]; 2026-02-21T09:03:09.4996072Z ld.shared.b8 %rs465, [%r577+1024]; 2026-02-21T09:03:09.4996141Z ld.shared.b8 %rs466, [%r577+1088]; 2026-02-21T09:03:09.4996201Z ld.shared.b8 %rs467, [%r577+1152]; 2026-02-21T09:03:09.4996262Z ld.shared.b8 %rs468, [%r577+1216]; 2026-02-21T09:03:09.4996330Z ld.shared.b8 %rs469, [%r577+1280]; 2026-02-21T09:03:09.4996391Z ld.shared.b8 %rs470, [%r577+1344]; 2026-02-21T09:03:09.4996451Z ld.shared.b8 %rs471, [%r577+1408]; 2026-02-21T09:03:09.4996543Z ld.shared.b8 %rs472, [%r577+1472]; 2026-02-21T09:03:09.4996614Z ld.shared.b8 %rs473, [%r577+1536]; 2026-02-21T09:03:09.4996675Z ld.shared.b8 %rs474, [%r577+1600]; 2026-02-21T09:03:09.4996736Z ld.shared.b8 %rs475, [%r577+1664]; 2026-02-21T09:03:09.4996804Z ld.shared.b8 %rs476, [%r577+1728]; 2026-02-21T09:03:09.4996866Z ld.shared.b8 %rs477, [%r577+1792]; 2026-02-21T09:03:09.4996927Z ld.shared.b8 %rs478, [%r577+1856]; 2026-02-21T09:03:09.4996995Z ld.shared.b8 %rs479, [%r577+1920]; 2026-02-21T09:03:09.4997054Z add.s32 %r579, %r48, %r22; 2026-02-21T09:03:09.4997115Z ld.shared.b8 %rs480, [%r579]; 2026-02-21T09:03:09.4997175Z ld.shared.b8 %rs481, [%r577+2048]; 2026-02-21T09:03:09.4997245Z ld.shared.b8 %rs482, [%r577+2112]; 2026-02-21T09:03:09.4997326Z ld.shared.b8 %rs483, [%r577+2176]; 2026-02-21T09:03:09.4997389Z ld.shared.b8 %rs484, [%r577+2240]; 2026-02-21T09:03:09.4997454Z ld.shared.b8 %rs485, [%r577+2304]; 2026-02-21T09:03:09.4997515Z ld.shared.b8 %rs486, [%r577+2368]; 2026-02-21T09:03:09.4997575Z ld.shared.b8 %rs487, [%r577+2432]; 2026-02-21T09:03:09.4997635Z ld.shared.b8 %rs488, [%r577+2496]; 2026-02-21T09:03:09.4997700Z ld.shared.b8 %rs489, [%r577+2560]; 2026-02-21T09:03:09.4997761Z ld.shared.b8 %rs490, [%r577+2624]; 2026-02-21T09:03:09.4997821Z ld.shared.b8 %rs491, [%r577+2688]; 2026-02-21T09:03:09.4997887Z ld.shared.b8 %rs492, [%r577+2752]; 2026-02-21T09:03:09.4997946Z ld.shared.b8 %rs493, [%r577+2816]; 2026-02-21T09:03:09.4998006Z ld.shared.b8 %rs494, [%r577+2880]; 2026-02-21T09:03:09.4998072Z ld.shared.b8 %rs495, [%r577+2944]; 2026-02-21T09:03:09.4998130Z add.s32 %r580, %r48, %r23; 2026-02-21T09:03:09.4998190Z ld.shared.b8 %rs496, [%r580]; 2026-02-21T09:03:09.4998249Z ld.shared.b8 %rs497, [%r577+3072]; 2026-02-21T09:03:09.4998317Z ld.shared.b8 %rs498, [%r577+3136]; 2026-02-21T09:03:09.4998377Z ld.shared.b8 %rs499, [%r577+3200]; 2026-02-21T09:03:09.4998436Z ld.shared.b8 %rs500, [%r577+3264]; 2026-02-21T09:03:09.4998502Z ld.shared.b8 %rs501, [%r577+3328]; 2026-02-21T09:03:09.4998561Z ld.shared.b8 %rs502, [%r577+3392]; 2026-02-21T09:03:09.4998620Z ld.shared.b8 %rs503, [%r577+3456]; 2026-02-21T09:03:09.4998679Z ld.shared.b8 %rs504, [%r577+3520]; 2026-02-21T09:03:09.4998745Z ld.shared.b8 %rs505, [%r577+3584]; 2026-02-21T09:03:09.4998806Z ld.shared.b8 %rs506, [%r577+3648]; 2026-02-21T09:03:09.4998865Z ld.shared.b8 %rs507, [%r577+3712]; 2026-02-21T09:03:09.4998932Z ld.shared.b8 %rs508, [%r577+3776]; 2026-02-21T09:03:09.4998993Z ld.shared.b8 %rs509, [%r577+3840]; 2026-02-21T09:03:09.4999052Z ld.shared.b8 %rs510, [%r577+3904]; 2026-02-21T09:03:09.4999114Z ld.shared.b8 %rs511, [%r577+3968]; 2026-02-21T09:03:09.4999178Z add.s32 %r581, %r48, %r24; 2026-02-21T09:03:09.4999238Z ld.shared.b8 %rs512, [%r581]; 2026-02-21T09:03:09.4999415Z .loc 1 57 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:57:28 2026-02-21T09:03:09.4999525Z shl.b16 %rs513, %rs449, 4; 2026-02-21T09:03:09.4999589Z shl.b16 %rs514, %rs450, 4; 2026-02-21T09:03:09.4999650Z shl.b16 %rs515, %rs451, 4; 2026-02-21T09:03:09.4999718Z shl.b16 %rs516, %rs452, 4; 2026-02-21T09:03:09.4999779Z shl.b16 %rs517, %rs453, 4; 2026-02-21T09:03:09.4999839Z shl.b16 %rs518, %rs454, 4; 2026-02-21T09:03:09.4999899Z shl.b16 %rs519, %rs455, 4; 2026-02-21T09:03:09.4999966Z shl.b16 %rs520, %rs456, 4; 2026-02-21T09:03:09.5000027Z shl.b16 %rs521, %rs457, 4; 2026-02-21T09:03:09.5000087Z shl.b16 %rs522, %rs458, 4; 2026-02-21T09:03:09.5000182Z shl.b16 %rs523, %rs459, 4; 2026-02-21T09:03:09.5000242Z shl.b16 %rs524, %rs460, 4; 2026-02-21T09:03:09.5000303Z shl.b16 %rs525, %rs461, 4; 2026-02-21T09:03:09.5000364Z shl.b16 %rs526, %rs462, 4; 2026-02-21T09:03:09.5000433Z shl.b16 %rs527, %rs463, 4; 2026-02-21T09:03:09.5000496Z shl.b16 %rs528, %rs464, 4; 2026-02-21T09:03:09.5000559Z shl.b16 %rs529, %rs465, 4; 2026-02-21T09:03:09.5000631Z shl.b16 %rs530, %rs466, 4; 2026-02-21T09:03:09.5000692Z shl.b16 %rs531, %rs467, 4; 2026-02-21T09:03:09.5000776Z shl.b16 %rs532, %rs468, 4; 2026-02-21T09:03:09.5000847Z shl.b16 %rs533, %rs469, 4; 2026-02-21T09:03:09.5000915Z shl.b16 %rs534, %rs470, 4; 2026-02-21T09:03:09.5000976Z shl.b16 %rs535, %rs471, 4; 2026-02-21T09:03:09.5001036Z shl.b16 %rs536, %rs472, 4; 2026-02-21T09:03:09.5001102Z shl.b16 %rs537, %rs473, 4; 2026-02-21T09:03:09.5001163Z shl.b16 %rs538, %rs474, 4; 2026-02-21T09:03:09.5001223Z shl.b16 %rs539, %rs475, 4; 2026-02-21T09:03:09.5001285Z shl.b16 %rs540, %rs476, 4; 2026-02-21T09:03:09.5001354Z shl.b16 %rs541, %rs477, 4; 2026-02-21T09:03:09.5001414Z shl.b16 %rs542, %rs478, 4; 2026-02-21T09:03:09.5001473Z shl.b16 %rs543, %rs479, 4; 2026-02-21T09:03:09.5001579Z shl.b16 %rs544, %rs480, 4; 2026-02-21T09:03:09.5001640Z shl.b16 %rs545, %rs481, 4; 2026-02-21T09:03:09.5001728Z shl.b16 %rs546, %rs482, 4; 2026-02-21T09:03:09.5001793Z shl.b16 %rs547, %rs483, 4; 2026-02-21T09:03:09.5001862Z shl.b16 %rs548, %rs484, 4; 2026-02-21T09:03:09.5001922Z shl.b16 %rs549, %rs485, 4; 2026-02-21T09:03:09.5001984Z shl.b16 %rs550, %rs486, 4; 2026-02-21T09:03:09.5002051Z shl.b16 %rs551, %rs487, 4; 2026-02-21T09:03:09.5002111Z shl.b16 %rs552, %rs488, 4; 2026-02-21T09:03:09.5002170Z shl.b16 %rs553, %rs489, 4; 2026-02-21T09:03:09.5002239Z shl.b16 %rs554, %rs490, 4; 2026-02-21T09:03:09.5002297Z shl.b16 %rs555, %rs491, 4; 2026-02-21T09:03:09.5002356Z shl.b16 %rs556, %rs492, 4; 2026-02-21T09:03:09.5002415Z shl.b16 %rs557, %rs493, 4; 2026-02-21T09:03:09.5002485Z shl.b16 %rs558, %rs494, 4; 2026-02-21T09:03:09.5002547Z shl.b16 %rs559, %rs495, 4; 2026-02-21T09:03:09.5002607Z shl.b16 %rs560, %rs496, 4; 2026-02-21T09:03:09.5002674Z shl.b16 %rs561, %rs497, 4; 2026-02-21T09:03:09.5002735Z shl.b16 %rs562, %rs498, 4; 2026-02-21T09:03:09.5002795Z shl.b16 %rs563, %rs499, 4; 2026-02-21T09:03:09.5002859Z shl.b16 %rs564, %rs500, 4; 2026-02-21T09:03:09.5002930Z shl.b16 %rs565, %rs501, 4; 2026-02-21T09:03:09.5002989Z shl.b16 %rs566, %rs502, 4; 2026-02-21T09:03:09.5003053Z shl.b16 %rs567, %rs503, 4; 2026-02-21T09:03:09.5003122Z shl.b16 %rs568, %rs504, 4; 2026-02-21T09:03:09.5003184Z shl.b16 %rs569, %rs505, 4; 2026-02-21T09:03:09.5003243Z shl.b16 %rs570, %rs506, 4; 2026-02-21T09:03:09.5003303Z shl.b16 %rs571, %rs507, 4; 2026-02-21T09:03:09.5003371Z shl.b16 %rs572, %rs508, 4; 2026-02-21T09:03:09.5003431Z shl.b16 %rs573, %rs509, 4; 2026-02-21T09:03:09.5003491Z shl.b16 %rs574, %rs510, 4; 2026-02-21T09:03:09.5003559Z shl.b16 %rs575, %rs511, 4; 2026-02-21T09:03:09.5003620Z shl.b16 %rs576, %rs512, 4; 2026-02-21T09:03:09.5003798Z .loc 1 72 58 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:72:58 2026-02-21T09:03:09.5003872Z selp.b16 %rs577, %rs513, %rs449, %p12; 2026-02-21T09:03:09.5003942Z cvt.s16.s8 %rs578, %rs577; 2026-02-21T09:03:09.5004002Z shr.s16 %rs579, %rs578, 4; 2026-02-21T09:03:09.5004104Z selp.b16 %rs580, %rs514, %rs450, %p12; 2026-02-21T09:03:09.5004172Z cvt.s16.s8 %rs581, %rs580; 2026-02-21T09:03:09.5004233Z shr.s16 %rs582, %rs581, 4; 2026-02-21T09:03:09.5004303Z selp.b16 %rs583, %rs515, %rs451, %p12; 2026-02-21T09:03:09.5004371Z cvt.s16.s8 %rs584, %rs583; 2026-02-21T09:03:09.5004431Z shr.s16 %rs585, %rs584, 4; 2026-02-21T09:03:09.5004498Z selp.b16 %rs586, %rs516, %rs452, %p12; 2026-02-21T09:03:09.5004558Z cvt.s16.s8 %rs587, %rs586; 2026-02-21T09:03:09.5004625Z shr.s16 %rs588, %rs587, 4; 2026-02-21T09:03:09.5004693Z selp.b16 %rs589, %rs517, %rs453, %p12; 2026-02-21T09:03:09.5004780Z cvt.s16.s8 %rs590, %rs589; 2026-02-21T09:03:09.5004846Z shr.s16 %rs591, %rs590, 4; 2026-02-21T09:03:09.5004912Z selp.b16 %rs592, %rs518, %rs454, %p12; 2026-02-21T09:03:09.5004973Z cvt.s16.s8 %rs593, %rs592; 2026-02-21T09:03:09.5005033Z shr.s16 %rs594, %rs593, 4; 2026-02-21T09:03:09.5005109Z selp.b16 %rs595, %rs519, %rs455, %p12; 2026-02-21T09:03:09.5005171Z cvt.s16.s8 %rs596, %rs595; 2026-02-21T09:03:09.5005231Z shr.s16 %rs597, %rs596, 4; 2026-02-21T09:03:09.5005304Z selp.b16 %rs598, %rs520, %rs456, %p12; 2026-02-21T09:03:09.5005393Z cvt.s16.s8 %rs599, %rs598; 2026-02-21T09:03:09.5005457Z shr.s16 %rs600, %rs599, 4; 2026-02-21T09:03:09.5005525Z selp.b16 %rs601, %rs521, %rs457, %p12; 2026-02-21T09:03:09.5005593Z cvt.s16.s8 %rs602, %rs601; 2026-02-21T09:03:09.5005653Z shr.s16 %rs603, %rs602, 4; 2026-02-21T09:03:09.5005719Z selp.b16 %rs604, %rs522, %rs458, %p12; 2026-02-21T09:03:09.5005787Z cvt.s16.s8 %rs605, %rs604; 2026-02-21T09:03:09.5005850Z shr.s16 %rs606, %rs605, 4; 2026-02-21T09:03:09.5005917Z selp.b16 %rs607, %rs523, %rs459, %p12; 2026-02-21T09:03:09.5005984Z cvt.s16.s8 %rs608, %rs607; 2026-02-21T09:03:09.5006043Z shr.s16 %rs609, %rs608, 4; 2026-02-21T09:03:09.5006109Z selp.b16 %rs610, %rs524, %rs460, %p12; 2026-02-21T09:03:09.5006171Z cvt.s16.s8 %rs611, %rs610; 2026-02-21T09:03:09.5006268Z shr.s16 %rs612, %rs611, 4; 2026-02-21T09:03:09.5006339Z selp.b16 %rs613, %rs525, %rs461, %p12; 2026-02-21T09:03:09.5006399Z cvt.s16.s8 %rs614, %rs613; 2026-02-21T09:03:09.5006467Z shr.s16 %rs615, %rs614, 4; 2026-02-21T09:03:09.5006535Z selp.b16 %rs616, %rs526, %rs462, %p12; 2026-02-21T09:03:09.5006596Z cvt.s16.s8 %rs617, %rs616; 2026-02-21T09:03:09.5006656Z shr.s16 %rs618, %rs617, 4; 2026-02-21T09:03:09.5006732Z selp.b16 %rs619, %rs527, %rs463, %p12; 2026-02-21T09:03:09.5006793Z cvt.s16.s8 %rs620, %rs619; 2026-02-21T09:03:09.5006854Z shr.s16 %rs621, %rs620, 4; 2026-02-21T09:03:09.5006928Z selp.b16 %rs622, %rs528, %rs464, %p12; 2026-02-21T09:03:09.5006989Z cvt.s16.s8 %rs623, %rs622; 2026-02-21T09:03:09.5007050Z shr.s16 %rs624, %rs623, 4; 2026-02-21T09:03:09.5007117Z selp.b16 %rs625, %rs529, %rs465, %p12; 2026-02-21T09:03:09.5007188Z cvt.s16.s8 %rs626, %rs625; 2026-02-21T09:03:09.5007248Z shr.s16 %rs627, %rs626, 4; 2026-02-21T09:03:09.5007319Z selp.b16 %rs628, %rs530, %rs466, %p12; 2026-02-21T09:03:09.5007391Z cvt.s16.s8 %rs629, %rs628; 2026-02-21T09:03:09.5007455Z shr.s16 %rs630, %rs629, 4; 2026-02-21T09:03:09.5007523Z selp.b16 %rs631, %rs531, %rs467, %p12; 2026-02-21T09:03:09.5007586Z cvt.s16.s8 %rs632, %rs631; 2026-02-21T09:03:09.5007653Z shr.s16 %rs633, %rs632, 4; 2026-02-21T09:03:09.5007721Z selp.b16 %rs634, %rs532, %rs468, %p12; 2026-02-21T09:03:09.5007780Z cvt.s16.s8 %rs635, %rs634; 2026-02-21T09:03:09.5007849Z shr.s16 %rs636, %rs635, 4; 2026-02-21T09:03:09.5007914Z selp.b16 %rs637, %rs533, %rs469, %p12; 2026-02-21T09:03:09.5007975Z cvt.s16.s8 %rs638, %rs637; 2026-02-21T09:03:09.5008051Z shr.s16 %rs639, %rs638, 4; 2026-02-21T09:03:09.5008115Z selp.b16 %rs640, %rs534, %rs470, %p12; 2026-02-21T09:03:09.5008173Z cvt.s16.s8 %rs641, %rs640; 2026-02-21T09:03:09.5008230Z shr.s16 %rs642, %rs641, 4; 2026-02-21T09:03:09.5008300Z selp.b16 %rs643, %rs535, %rs471, %p12; 2026-02-21T09:03:09.5008358Z cvt.s16.s8 %rs644, %rs643; 2026-02-21T09:03:09.5008417Z shr.s16 %rs645, %rs644, 4; 2026-02-21T09:03:09.5008513Z selp.b16 %rs646, %rs536, %rs472, %p12; 2026-02-21T09:03:09.5008570Z cvt.s16.s8 %rs647, %rs646; 2026-02-21T09:03:09.5008628Z shr.s16 %rs648, %rs647, 4; 2026-02-21T09:03:09.5008694Z selp.b16 %rs649, %rs537, %rs473, %p12; 2026-02-21T09:03:09.5008758Z cvt.s16.s8 %rs650, %rs649; 2026-02-21T09:03:09.5008815Z shr.s16 %rs651, %rs650, 4; 2026-02-21T09:03:09.5008879Z selp.b16 %rs652, %rs538, %rs474, %p12; 2026-02-21T09:03:09.5008945Z cvt.s16.s8 %rs653, %rs652; 2026-02-21T09:03:09.5009001Z shr.s16 %rs654, %rs653, 4; 2026-02-21T09:03:09.5009065Z selp.b16 %rs655, %rs539, %rs475, %p12; 2026-02-21T09:03:09.5009147Z cvt.s16.s8 %rs656, %rs655; 2026-02-21T09:03:09.5009213Z shr.s16 %rs657, %rs656, 4; 2026-02-21T09:03:09.5009275Z selp.b16 %rs658, %rs540, %rs476, %p12; 2026-02-21T09:03:09.5009333Z cvt.s16.s8 %rs659, %rs658; 2026-02-21T09:03:09.5009397Z shr.s16 %rs660, %rs659, 4; 2026-02-21T09:03:09.5009461Z selp.b16 %rs661, %rs541, %rs477, %p12; 2026-02-21T09:03:09.5009521Z cvt.s16.s8 %rs662, %rs661; 2026-02-21T09:03:09.5009579Z shr.s16 %rs663, %rs662, 4; 2026-02-21T09:03:09.5009649Z selp.b16 %rs664, %rs542, %rs478, %p12; 2026-02-21T09:03:09.5009734Z cvt.s16.s8 %rs665, %rs664; 2026-02-21T09:03:09.5009793Z shr.s16 %rs666, %rs665, 4; 2026-02-21T09:03:09.5009866Z selp.b16 %rs667, %rs543, %rs479, %p12; 2026-02-21T09:03:09.5009923Z cvt.s16.s8 %rs668, %rs667; 2026-02-21T09:03:09.5009980Z shr.s16 %rs669, %rs668, 4; 2026-02-21T09:03:09.5010049Z selp.b16 %rs670, %rs544, %rs480, %p12; 2026-02-21T09:03:09.5010107Z cvt.s16.s8 %rs671, %rs670; 2026-02-21T09:03:09.5010164Z shr.s16 %rs672, %rs671, 4; 2026-02-21T09:03:09.5010228Z selp.b16 %rs673, %rs545, %rs481, %p12; 2026-02-21T09:03:09.5010292Z cvt.s16.s8 %rs674, %rs673; 2026-02-21T09:03:09.5010349Z shr.s16 %rs675, %rs674, 4; 2026-02-21T09:03:09.5010412Z selp.b16 %rs676, %rs546, %rs482, %p12; 2026-02-21T09:03:09.5010476Z cvt.s16.s8 %rs677, %rs676; 2026-02-21T09:03:09.5010573Z shr.s16 %rs678, %rs677, 4; 2026-02-21T09:03:09.5010639Z selp.b16 %rs679, %rs547, %rs483, %p12; 2026-02-21T09:03:09.5010696Z cvt.s16.s8 %rs680, %rs679; 2026-02-21T09:03:09.5010762Z shr.s16 %rs681, %rs680, 4; 2026-02-21T09:03:09.5010825Z selp.b16 %rs682, %rs548, %rs484, %p12; 2026-02-21T09:03:09.5010882Z cvt.s16.s8 %rs683, %rs682; 2026-02-21T09:03:09.5010945Z shr.s16 %rs684, %rs683, 4; 2026-02-21T09:03:09.5011006Z selp.b16 %rs685, %rs549, %rs485, %p12; 2026-02-21T09:03:09.5011062Z cvt.s16.s8 %rs686, %rs685; 2026-02-21T09:03:09.5011118Z shr.s16 %rs687, %rs686, 4; 2026-02-21T09:03:09.5011186Z selp.b16 %rs688, %rs550, %rs486, %p12; 2026-02-21T09:03:09.5011246Z cvt.s16.s8 %rs689, %rs688; 2026-02-21T09:03:09.5011304Z shr.s16 %rs690, %rs689, 4; 2026-02-21T09:03:09.5011375Z selp.b16 %rs691, %rs551, %rs487, %p12; 2026-02-21T09:03:09.5011431Z cvt.s16.s8 %rs692, %rs691; 2026-02-21T09:03:09.5011487Z shr.s16 %rs693, %rs692, 4; 2026-02-21T09:03:09.5011597Z selp.b16 %rs694, %rs552, %rs488, %p12; 2026-02-21T09:03:09.5011658Z cvt.s16.s8 %rs695, %rs694; 2026-02-21T09:03:09.5011714Z shr.s16 %rs696, %rs695, 4; 2026-02-21T09:03:09.5011777Z selp.b16 %rs697, %rs553, %rs489, %p12; 2026-02-21T09:03:09.5011845Z cvt.s16.s8 %rs698, %rs697; 2026-02-21T09:03:09.5011901Z shr.s16 %rs699, %rs698, 4; 2026-02-21T09:03:09.5011964Z selp.b16 %rs700, %rs554, %rs490, %p12; 2026-02-21T09:03:09.5012029Z cvt.s16.s8 %rs701, %rs700; 2026-02-21T09:03:09.5012086Z shr.s16 %rs702, %rs701, 4; 2026-02-21T09:03:09.5012148Z selp.b16 %rs703, %rs555, %rs491, %p12; 2026-02-21T09:03:09.5012206Z cvt.s16.s8 %rs704, %rs703; 2026-02-21T09:03:09.5012272Z shr.s16 %rs705, %rs704, 4; 2026-02-21T09:03:09.5012336Z selp.b16 %rs706, %rs556, %rs492, %p12; 2026-02-21T09:03:09.5012392Z cvt.s16.s8 %rs707, %rs706; 2026-02-21T09:03:09.5012456Z shr.s16 %rs708, %rs707, 4; 2026-02-21T09:03:09.5012519Z selp.b16 %rs709, %rs557, %rs493, %p12; 2026-02-21T09:03:09.5012577Z cvt.s16.s8 %rs710, %rs709; 2026-02-21T09:03:09.5012636Z shr.s16 %rs711, %rs710, 4; 2026-02-21T09:03:09.5012744Z selp.b16 %rs712, %rs558, %rs494, %p12; 2026-02-21T09:03:09.5012803Z cvt.s16.s8 %rs713, %rs712; 2026-02-21T09:03:09.5012863Z shr.s16 %rs714, %rs713, 4; 2026-02-21T09:03:09.5012935Z selp.b16 %rs715, %rs559, %rs495, %p12; 2026-02-21T09:03:09.5012994Z cvt.s16.s8 %rs716, %rs715; 2026-02-21T09:03:09.5013052Z shr.s16 %rs717, %rs716, 4; 2026-02-21T09:03:09.5013116Z selp.b16 %rs718, %rs560, %rs496, %p12; 2026-02-21T09:03:09.5013180Z cvt.s16.s8 %rs719, %rs718; 2026-02-21T09:03:09.5013238Z shr.s16 %rs720, %rs719, 4; 2026-02-21T09:03:09.5013302Z selp.b16 %rs721, %rs561, %rs497, %p12; 2026-02-21T09:03:09.5013396Z cvt.s16.s8 %rs722, %rs721; 2026-02-21T09:03:09.5013455Z shr.s16 %rs723, %rs722, 4; 2026-02-21T09:03:09.5013521Z selp.b16 %rs724, %rs562, %rs498, %p12; 2026-02-21T09:03:09.5013589Z cvt.s16.s8 %rs725, %rs724; 2026-02-21T09:03:09.5013649Z shr.s16 %rs726, %rs725, 4; 2026-02-21T09:03:09.5013716Z selp.b16 %rs727, %rs563, %rs499, %p12; 2026-02-21T09:03:09.5013776Z cvt.s16.s8 %rs728, %rs727; 2026-02-21T09:03:09.5013845Z shr.s16 %rs729, %rs728, 4; 2026-02-21T09:03:09.5013910Z selp.b16 %rs730, %rs564, %rs500, %p12; 2026-02-21T09:03:09.5013993Z cvt.s16.s8 %rs731, %rs730; 2026-02-21T09:03:09.5014061Z shr.s16 %rs732, %rs731, 4; 2026-02-21T09:03:09.5014125Z selp.b16 %rs733, %rs565, %rs501, %p12; 2026-02-21T09:03:09.5014182Z cvt.s16.s8 %rs734, %rs733; 2026-02-21T09:03:09.5014240Z shr.s16 %rs735, %rs734, 4; 2026-02-21T09:03:09.5014312Z selp.b16 %rs736, %rs566, %rs502, %p12; 2026-02-21T09:03:09.5014370Z cvt.s16.s8 %rs737, %rs736; 2026-02-21T09:03:09.5014427Z shr.s16 %rs738, %rs737, 4; 2026-02-21T09:03:09.5014497Z selp.b16 %rs739, %rs567, %rs503, %p12; 2026-02-21T09:03:09.5014553Z cvt.s16.s8 %rs740, %rs739; 2026-02-21T09:03:09.5014609Z shr.s16 %rs741, %rs740, 4; 2026-02-21T09:03:09.5014673Z selp.b16 %rs742, %rs568, %rs504, %p12; 2026-02-21T09:03:09.5014739Z cvt.s16.s8 %rs743, %rs742; 2026-02-21T09:03:09.5014852Z shr.s16 %rs744, %rs743, 4; 2026-02-21T09:03:09.5014919Z selp.b16 %rs745, %rs569, %rs505, %p12; 2026-02-21T09:03:09.5014986Z cvt.s16.s8 %rs746, %rs745; 2026-02-21T09:03:09.5015046Z shr.s16 %rs747, %rs746, 4; 2026-02-21T09:03:09.5015108Z selp.b16 %rs748, %rs570, %rs506, %p12; 2026-02-21T09:03:09.5015165Z cvt.s16.s8 %rs749, %rs748; 2026-02-21T09:03:09.5015230Z shr.s16 %rs750, %rs749, 4; 2026-02-21T09:03:09.5015293Z selp.b16 %rs751, %rs571, %rs507, %p12; 2026-02-21T09:03:09.5015351Z cvt.s16.s8 %rs752, %rs751; 2026-02-21T09:03:09.5015414Z shr.s16 %rs753, %rs752, 4; 2026-02-21T09:03:09.5015477Z selp.b16 %rs754, %rs572, %rs508, %p12; 2026-02-21T09:03:09.5015537Z cvt.s16.s8 %rs755, %rs754; 2026-02-21T09:03:09.5015603Z shr.s16 %rs756, %rs755, 4; 2026-02-21T09:03:09.5015664Z selp.b16 %rs757, %rs573, %rs509, %p12; 2026-02-21T09:03:09.5015722Z cvt.s16.s8 %rs758, %rs757; 2026-02-21T09:03:09.5015779Z shr.s16 %rs759, %rs758, 4; 2026-02-21T09:03:09.5015850Z selp.b16 %rs760, %rs574, %rs510, %p12; 2026-02-21T09:03:09.5015908Z cvt.s16.s8 %rs761, %rs760; 2026-02-21T09:03:09.5015966Z shr.s16 %rs762, %rs761, 4; 2026-02-21T09:03:09.5016036Z selp.b16 %rs763, %rs575, %rs511, %p12; 2026-02-21T09:03:09.5016094Z cvt.s16.s8 %rs764, %rs763; 2026-02-21T09:03:09.5016152Z shr.s16 %rs765, %rs764, 4; 2026-02-21T09:03:09.5016214Z selp.b16 %rs766, %rs576, %rs512, %p12; 2026-02-21T09:03:09.5016283Z cvt.s16.s8 %rs767, %rs766; 2026-02-21T09:03:09.5016340Z shr.s16 %rs768, %rs767, 4; 2026-02-21T09:03:09.5016509Z .loc 1 77 32 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:77:32 2026-02-21T09:03:09.5016578Z cvt.rn.f32.s16 %r582, %rs579; 2026-02-21T09:03:09.5016639Z cvt.rn.f32.s16 %r583, %rs582; 2026-02-21T09:03:09.5016698Z cvt.rn.f32.s16 %r584, %rs585; 2026-02-21T09:03:09.5016762Z cvt.rn.f32.s16 %r585, %rs588; 2026-02-21T09:03:09.5016820Z cvt.rn.f32.s16 %r586, %rs591; 2026-02-21T09:03:09.5016879Z cvt.rn.f32.s16 %r587, %rs594; 2026-02-21T09:03:09.5016938Z cvt.rn.f32.s16 %r588, %rs597; 2026-02-21T09:03:09.5017026Z cvt.rn.f32.s16 %r589, %rs600; 2026-02-21T09:03:09.5017085Z cvt.rn.f32.s16 %r590, %rs603; 2026-02-21T09:03:09.5017145Z cvt.rn.f32.s16 %r591, %rs606; 2026-02-21T09:03:09.5017209Z cvt.rn.f32.s16 %r592, %rs609; 2026-02-21T09:03:09.5017267Z cvt.rn.f32.s16 %r593, %rs612; 2026-02-21T09:03:09.5017325Z cvt.rn.f32.s16 %r594, %rs615; 2026-02-21T09:03:09.5017383Z cvt.rn.f32.s16 %r595, %rs618; 2026-02-21T09:03:09.5017446Z cvt.rn.f32.s16 %r596, %rs621; 2026-02-21T09:03:09.5017504Z cvt.rn.f32.s16 %r597, %rs624; 2026-02-21T09:03:09.5017562Z cvt.rn.f32.s16 %r598, %rs627; 2026-02-21T09:03:09.5017647Z cvt.rn.f32.s16 %r599, %rs630; 2026-02-21T09:03:09.5017704Z cvt.rn.f32.s16 %r600, %rs633; 2026-02-21T09:03:09.5017761Z cvt.rn.f32.s16 %r601, %rs636; 2026-02-21T09:03:09.5017817Z cvt.rn.f32.s16 %r602, %rs639; 2026-02-21T09:03:09.5017882Z cvt.rn.f32.s16 %r603, %rs642; 2026-02-21T09:03:09.5017941Z cvt.rn.f32.s16 %r604, %rs645; 2026-02-21T09:03:09.5018000Z cvt.rn.f32.s16 %r605, %rs648; 2026-02-21T09:03:09.5018066Z cvt.rn.f32.s16 %r606, %rs651; 2026-02-21T09:03:09.5018125Z cvt.rn.f32.s16 %r607, %rs654; 2026-02-21T09:03:09.5018202Z cvt.rn.f32.s16 %r608, %rs657; 2026-02-21T09:03:09.5018261Z cvt.rn.f32.s16 %r609, %rs660; 2026-02-21T09:03:09.5018327Z cvt.rn.f32.s16 %r610, %rs663; 2026-02-21T09:03:09.5018384Z cvt.rn.f32.s16 %r611, %rs666; 2026-02-21T09:03:09.5018443Z cvt.rn.f32.s16 %r612, %rs669; 2026-02-21T09:03:09.5018509Z cvt.rn.f32.s16 %r613, %rs672; 2026-02-21T09:03:09.5018567Z cvt.rn.f32.s16 %r614, %rs675; 2026-02-21T09:03:09.5018625Z cvt.rn.f32.s16 %r615, %rs678; 2026-02-21T09:03:09.5018693Z cvt.rn.f32.s16 %r616, %rs681; 2026-02-21T09:03:09.5018750Z cvt.rn.f32.s16 %r617, %rs684; 2026-02-21T09:03:09.5018808Z cvt.rn.f32.s16 %r618, %rs687; 2026-02-21T09:03:09.5018866Z cvt.rn.f32.s16 %r619, %rs690; 2026-02-21T09:03:09.5018933Z cvt.rn.f32.s16 %r620, %rs693; 2026-02-21T09:03:09.5018991Z cvt.rn.f32.s16 %r621, %rs696; 2026-02-21T09:03:09.5019071Z cvt.rn.f32.s16 %r622, %rs699; 2026-02-21T09:03:09.5019138Z cvt.rn.f32.s16 %r623, %rs702; 2026-02-21T09:03:09.5019196Z cvt.rn.f32.s16 %r624, %rs705; 2026-02-21T09:03:09.5019254Z cvt.rn.f32.s16 %r625, %rs708; 2026-02-21T09:03:09.5019311Z cvt.rn.f32.s16 %r626, %rs711; 2026-02-21T09:03:09.5019378Z cvt.rn.f32.s16 %r627, %rs714; 2026-02-21T09:03:09.5019436Z cvt.rn.f32.s16 %r628, %rs717; 2026-02-21T09:03:09.5019493Z cvt.rn.f32.s16 %r629, %rs720; 2026-02-21T09:03:09.5019558Z cvt.rn.f32.s16 %r630, %rs723; 2026-02-21T09:03:09.5019615Z cvt.rn.f32.s16 %r631, %rs726; 2026-02-21T09:03:09.5019673Z cvt.rn.f32.s16 %r632, %rs729; 2026-02-21T09:03:09.5019732Z cvt.rn.f32.s16 %r633, %rs732; 2026-02-21T09:03:09.5019798Z cvt.rn.f32.s16 %r634, %rs735; 2026-02-21T09:03:09.5019858Z cvt.rn.f32.s16 %r635, %rs738; 2026-02-21T09:03:09.5019917Z cvt.rn.f32.s16 %r636, %rs741; 2026-02-21T09:03:09.5019984Z cvt.rn.f32.s16 %r637, %rs744; 2026-02-21T09:03:09.5020045Z cvt.rn.f32.s16 %r638, %rs747; 2026-02-21T09:03:09.5020105Z cvt.rn.f32.s16 %r639, %rs750; 2026-02-21T09:03:09.5020164Z cvt.rn.f32.s16 %r640, %rs753; 2026-02-21T09:03:09.5020233Z cvt.rn.f32.s16 %r641, %rs756; 2026-02-21T09:03:09.5020293Z cvt.rn.f32.s16 %r642, %rs759; 2026-02-21T09:03:09.5020350Z cvt.rn.f32.s16 %r643, %rs762; 2026-02-21T09:03:09.5020416Z cvt.rn.f32.s16 %r644, %rs765; 2026-02-21T09:03:09.5020473Z cvt.rn.f32.s16 %r645, %rs768; 2026-02-21T09:03:09.5020535Z st.shared.b32 [%r25], %r582; 2026-02-21T09:03:09.5020606Z st.shared.b32 [%r25+8], %r583; 2026-02-21T09:03:09.5020669Z st.shared.b32 [%r25+8192], %r598; 2026-02-21T09:03:09.5020731Z st.shared.b32 [%r25+8200], %r599; 2026-02-21T09:03:09.5020794Z st.shared.b32 [%r25+16384], %r614; 2026-02-21T09:03:09.5020864Z st.shared.b32 [%r25+16392], %r615; 2026-02-21T09:03:09.5020923Z st.shared.b32 [%r25+24576], %r630; 2026-02-21T09:03:09.5020983Z st.shared.b32 [%r25+24584], %r631; 2026-02-21T09:03:09.5021050Z st.shared.b32 [%r26], %r584; 2026-02-21T09:03:09.5021112Z st.shared.b32 [%r26+8], %r585; 2026-02-21T09:03:09.5021200Z st.shared.b32 [%r26+8192], %r600; 2026-02-21T09:03:09.5021261Z st.shared.b32 [%r26+8200], %r601; 2026-02-21T09:03:09.5021332Z st.shared.b32 [%r26+16384], %r616; 2026-02-21T09:03:09.5021392Z st.shared.b32 [%r26+16392], %r617; 2026-02-21T09:03:09.5021451Z st.shared.b32 [%r26+24576], %r632; 2026-02-21T09:03:09.5021518Z st.shared.b32 [%r26+24584], %r633; 2026-02-21T09:03:09.5021621Z st.shared.b32 [%r27], %r586; 2026-02-21T09:03:09.5021685Z st.shared.b32 [%r27+8], %r587; 2026-02-21T09:03:09.5021746Z st.shared.b32 [%r27+8192], %r602; 2026-02-21T09:03:09.5021814Z st.shared.b32 [%r27+8200], %r603; 2026-02-21T09:03:09.5021907Z st.shared.b32 [%r27+16384], %r618; 2026-02-21T09:03:09.5021967Z st.shared.b32 [%r27+16392], %r619; 2026-02-21T09:03:09.5022036Z st.shared.b32 [%r27+24576], %r634; 2026-02-21T09:03:09.5022096Z st.shared.b32 [%r27+24584], %r635; 2026-02-21T09:03:09.5022155Z st.shared.b32 [%r28], %r588; 2026-02-21T09:03:09.5022224Z st.shared.b32 [%r28+8], %r589; 2026-02-21T09:03:09.5022284Z st.shared.b32 [%r28+8192], %r604; 2026-02-21T09:03:09.5022342Z st.shared.b32 [%r28+8200], %r605; 2026-02-21T09:03:09.5022427Z st.shared.b32 [%r28+16384], %r620; 2026-02-21T09:03:09.5022499Z st.shared.b32 [%r28+16392], %r621; 2026-02-21T09:03:09.5022560Z st.shared.b32 [%r28+24576], %r636; 2026-02-21T09:03:09.5022621Z st.shared.b32 [%r28+24584], %r637; 2026-02-21T09:03:09.5022689Z st.shared.b32 [%r29], %r590; 2026-02-21T09:03:09.5022752Z st.shared.b32 [%r29+8], %r591; 2026-02-21T09:03:09.5022812Z st.shared.b32 [%r29+8192], %r606; 2026-02-21T09:03:09.5022874Z st.shared.b32 [%r29+8200], %r607; 2026-02-21T09:03:09.5022945Z st.shared.b32 [%r29+16384], %r622; 2026-02-21T09:03:09.5023006Z st.shared.b32 [%r29+16392], %r623; 2026-02-21T09:03:09.5023065Z st.shared.b32 [%r29+24576], %r638; 2026-02-21T09:03:09.5023133Z st.shared.b32 [%r29+24584], %r639; 2026-02-21T09:03:09.5023219Z st.shared.b32 [%r30], %r592; 2026-02-21T09:03:09.5023284Z st.shared.b32 [%r30+8], %r593; 2026-02-21T09:03:09.5023350Z st.shared.b32 [%r30+8192], %r608; 2026-02-21T09:03:09.5023409Z st.shared.b32 [%r30+8200], %r609; 2026-02-21T09:03:09.5023469Z st.shared.b32 [%r30+16384], %r624; 2026-02-21T09:03:09.5023527Z st.shared.b32 [%r30+16392], %r625; 2026-02-21T09:03:09.5023595Z st.shared.b32 [%r30+24576], %r640; 2026-02-21T09:03:09.5023652Z st.shared.b32 [%r30+24584], %r641; 2026-02-21T09:03:09.5023710Z st.shared.b32 [%r31], %r594; 2026-02-21T09:03:09.5023776Z st.shared.b32 [%r31+8], %r595; 2026-02-21T09:03:09.5023835Z st.shared.b32 [%r31+8192], %r610; 2026-02-21T09:03:09.5023896Z st.shared.b32 [%r31+8200], %r611; 2026-02-21T09:03:09.5023955Z st.shared.b32 [%r31+16384], %r626; 2026-02-21T09:03:09.5024020Z st.shared.b32 [%r31+16392], %r627; 2026-02-21T09:03:09.5024079Z st.shared.b32 [%r31+24576], %r642; 2026-02-21T09:03:09.5024138Z st.shared.b32 [%r31+24584], %r643; 2026-02-21T09:03:09.5024205Z st.shared.b32 [%r32], %r596; 2026-02-21T09:03:09.5024266Z st.shared.b32 [%r32+8], %r597; 2026-02-21T09:03:09.5024326Z st.shared.b32 [%r32+8192], %r612; 2026-02-21T09:03:09.5024385Z st.shared.b32 [%r32+8200], %r613; 2026-02-21T09:03:09.5024452Z st.shared.b32 [%r32+16384], %r628; 2026-02-21T09:03:09.5024511Z st.shared.b32 [%r32+16392], %r629; 2026-02-21T09:03:09.5024571Z st.shared.b32 [%r32+24576], %r644; 2026-02-21T09:03:09.5024637Z st.shared.b32 [%r32+24584], %r645; 2026-02-21T09:03:09.5024811Z .loc 1 40 93 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:40:93 2026-02-21T09:03:09.5024872Z shl.b32 %r646, %r847, 3; 2026-02-21T09:03:09.5024940Z add.s32 %r647, %r52, %r646; 2026-02-21T09:03:09.5025000Z add.s32 %r845, %r647, 73728; 2026-02-21T09:03:09.5025053Z $L__tmp10: 2026-02-21T09:03:09.5025272Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5025338Z // begin inline asm 2026-02-21T09:03:09.5025411Z fence.proxy.async.shared::cta; 2026-02-21T09:03:09.5025494Z // end inline asm 2026-02-21T09:03:09.5025559Z bar.sync 0; 2026-02-21T09:03:09.5025619Z @%p13 bra $L__BB0_6; 2026-02-21T09:03:09.5025721Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:03:09.5025795Z elect.sync %r696|%p55, -1; 2026-02-21T09:03:09.5025855Z mov.b32 %r650, 68159760; 2026-02-21T09:03:09.5025913Z // begin inline asm 2026-02-21T09:03:09.5026067Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 0 ], %rd81, %r650, %p54; 2026-02-21T09:03:09.5026131Z // end inline asm 2026-02-21T09:03:09.5026214Z // begin inline asm 2026-02-21T09:03:09.5026361Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 8 ], %rd82, %r650, %p54; 2026-02-21T09:03:09.5026427Z // end inline asm 2026-02-21T09:03:09.5026483Z // begin inline asm 2026-02-21T09:03:09.5026628Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 16 ], %rd83, %r650, %p54; 2026-02-21T09:03:09.5026692Z // end inline asm 2026-02-21T09:03:09.5026749Z // begin inline asm 2026-02-21T09:03:09.5026909Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 24 ], %rd84, %r650, %p54; 2026-02-21T09:03:09.5026966Z // end inline asm 2026-02-21T09:03:09.5027031Z // begin inline asm 2026-02-21T09:03:09.5027171Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 32 ], %rd85, %r650, %p54; 2026-02-21T09:03:09.5027226Z // end inline asm 2026-02-21T09:03:09.5027300Z // begin inline asm 2026-02-21T09:03:09.5027439Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 40 ], %rd86, %r650, %p54; 2026-02-21T09:03:09.5027496Z // end inline asm 2026-02-21T09:03:09.5027559Z // begin inline asm 2026-02-21T09:03:09.5027695Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 48 ], %rd87, %r650, %p54; 2026-02-21T09:03:09.5027749Z // end inline asm 2026-02-21T09:03:09.5027812Z // begin inline asm 2026-02-21T09:03:09.5027971Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 56 ], %rd88, %r650, %p54; 2026-02-21T09:03:09.5028029Z // end inline asm 2026-02-21T09:03:09.5028085Z // begin inline asm 2026-02-21T09:03:09.5028232Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 64 ], %rd89, %r650, %p54; 2026-02-21T09:03:09.5028287Z // end inline asm 2026-02-21T09:03:09.5028344Z // begin inline asm 2026-02-21T09:03:09.5028489Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 72 ], %rd90, %r650, %p54; 2026-02-21T09:03:09.5028545Z // end inline asm 2026-02-21T09:03:09.5028600Z // begin inline asm 2026-02-21T09:03:09.5028746Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 80 ], %rd91, %r650, %p54; 2026-02-21T09:03:09.5028801Z // end inline asm 2026-02-21T09:03:09.5028857Z // begin inline asm 2026-02-21T09:03:09.5029001Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 88 ], %rd92, %r650, %p54; 2026-02-21T09:03:09.5029056Z // end inline asm 2026-02-21T09:03:09.5029113Z // begin inline asm 2026-02-21T09:03:09.5029251Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 96 ], %rd93, %r650, %p54; 2026-02-21T09:03:09.5029316Z // end inline asm 2026-02-21T09:03:09.5029374Z // begin inline asm 2026-02-21T09:03:09.5029516Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 104 ], %rd94, %r650, %p54; 2026-02-21T09:03:09.5029579Z // end inline asm 2026-02-21T09:03:09.5029635Z // begin inline asm 2026-02-21T09:03:09.5029775Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 112 ], %rd95, %r650, %p54; 2026-02-21T09:03:09.5029838Z // end inline asm 2026-02-21T09:03:09.5029897Z // begin inline asm 2026-02-21T09:03:09.5030038Z @%p55 tcgen05.mma.cta_group::1.kind::tf32 [ %r843 + 0 ], [ %r391 + 120 ], %rd96, %r650, %p54; 2026-02-21T09:03:09.5030093Z // end inline asm 2026-02-21T09:03:09.5030161Z cvt.u64.u32 %rd132, %r845; 2026-02-21T09:03:09.5030219Z // begin inline asm 2026-02-21T09:03:09.5030345Z @%p55 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd132]; 2026-02-21T09:03:09.5030434Z // end inline asm 2026-02-21T09:03:09.5030491Z bra.uni $L__BB0_6; 2026-02-21T09:03:09.5030588Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:03:09.5030683Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:03:09.5030738Z mov.b32 %r720, 1; 2026-02-21T09:03:09.5030951Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5031007Z // begin inline asm 2026-02-21T09:03:09.5031066Z 2026-02-21T09:03:09.5031116Z { 2026-02-21T09:03:09.5031202Z .reg .pred complete; 2026-02-21T09:03:09.5031265Z waitLoop: 2026-02-21T09:03:09.5031384Z mbarrier.try_wait.parity.shared.b64 complete, [%r845], %r720; 2026-02-21T09:03:09.5031449Z @!complete bra.uni waitLoop; 2026-02-21T09:03:09.5031498Z } 2026-02-21T09:03:09.5031509Z 2026-02-21T09:03:09.5031596Z // end inline asm 2026-02-21T09:03:09.5031652Z $L__tmp11: 2026-02-21T09:03:09.5031816Z .loc 1 40 93 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:40:93 2026-02-21T09:03:09.5031936Z cp.async.wait_group 0; 2026-02-21T09:03:09.5031993Z bar.sync 0; 2026-02-21T09:03:09.5032055Z add.s32 %r721, %r52, 73728; 2026-02-21T09:03:09.5032118Z // begin inline asm 2026-02-21T09:03:09.5032203Z @%p90 mbarrier.inval.shared::cta.b64 [%r721]; 2026-02-21T09:03:09.5032258Z // end inline asm 2026-02-21T09:03:09.5032311Z bar.sync 0; 2026-02-21T09:03:09.5032374Z // begin inline asm 2026-02-21T09:03:09.5032459Z @%p90 mbarrier.inval.shared::cta.b64 [%r88]; 2026-02-21T09:03:09.5032517Z // end inline asm 2026-02-21T09:03:09.5032687Z .loc 1 88 50 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:88:50 2026-02-21T09:03:09.5032748Z add.s32 %r794, %r37, %r33; 2026-02-21T09:03:09.5032937Z .loc 1 88 22 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:88:22 2026-02-21T09:03:09.5033015Z mad.wide.u32 %rd143, %r794, 2, %rd37; 2026-02-21T09:03:09.5033076Z cvt.u64.u32 %rd147, %r37; 2026-02-21T09:03:09.5033136Z cvt.u64.u32 %rd148, %r33; 2026-02-21T09:03:09.5033202Z add.s64 %rd149, %rd148, %rd147; 2026-02-21T09:03:09.5033269Z shl.b64 %rd150, %rd149, 1; 2026-02-21T09:03:09.5033331Z add.s64 %rd151, %rd37, %rd150; 2026-02-21T09:03:09.5033392Z add.s64 %rd144, %rd151, 229376; 2026-02-21T09:03:09.5033458Z add.s64 %rd145, %rd151, 458752; 2026-02-21T09:03:09.5033516Z add.s64 %rd146, %rd151, 688128; 2026-02-21T09:03:09.5033571Z $L__tmp12: 2026-02-21T09:03:09.5033787Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5033852Z // begin inline asm 2026-02-21T09:03:09.5034137Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r723, %r724, %r725, %r726, %r727, %r728, %r729, %r730, %r731, %r732, %r733, %r734, %r735, %r736, %r737, %r738}, [%r756 + 0], 32; 2026-02-21T09:03:09.5034195Z // end inline asm 2026-02-21T09:03:09.5034260Z // begin inline asm 2026-02-21T09:03:09.5034538Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r740, %r741, %r742, %r743, %r744, %r745, %r746, %r747, %r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755}, [%r756 + 16], 32; 2026-02-21T09:03:09.5034593Z // end inline asm 2026-02-21T09:03:09.5034657Z // begin inline asm 2026-02-21T09:03:09.5034728Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:03:09.5034783Z // end inline asm 2026-02-21T09:03:09.5034849Z cvt.u64.u32 %rd152, %r723; 2026-02-21T09:03:09.5034909Z cvt.u64.u32 %rd153, %r724; 2026-02-21T09:03:09.5034968Z shl.b64 %rd154, %rd153, 32; 2026-02-21T09:03:09.5035030Z or.b64 %rd155, %rd152, %rd154; 2026-02-21T09:03:09.5035091Z $L__tmp13: 2026-02-21T09:03:09.5035260Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5035323Z mov.b64 {%r795, %r796}, %rd155; 2026-02-21T09:03:09.5035404Z cvt.rn.bf16x2.f32 %r797, %r796, %r795; 2026-02-21T09:03:09.5035490Z $L__tmp14: 2026-02-21T09:03:09.5035701Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5035763Z cvt.u64.u32 %rd156, %r725; 2026-02-21T09:03:09.5035831Z cvt.u64.u32 %rd157, %r726; 2026-02-21T09:03:09.5035892Z shl.b64 %rd158, %rd157, 32; 2026-02-21T09:03:09.5035953Z or.b64 %rd159, %rd156, %rd158; 2026-02-21T09:03:09.5036013Z $L__tmp15: 2026-02-21T09:03:09.5036173Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5036234Z mov.b64 {%r798, %r799}, %rd159; 2026-02-21T09:03:09.5036341Z cvt.rn.bf16x2.f32 %r800, %r799, %r798; 2026-02-21T09:03:09.5036393Z $L__tmp16: 2026-02-21T09:03:09.5036605Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5036664Z cvt.u64.u32 %rd160, %r727; 2026-02-21T09:03:09.5036731Z cvt.u64.u32 %rd161, %r728; 2026-02-21T09:03:09.5036791Z shl.b64 %rd162, %rd161, 32; 2026-02-21T09:03:09.5036850Z or.b64 %rd163, %rd160, %rd162; 2026-02-21T09:03:09.5036911Z $L__tmp17: 2026-02-21T09:03:09.5037100Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5037161Z mov.b64 {%r801, %r802}, %rd163; 2026-02-21T09:03:09.5037237Z cvt.rn.bf16x2.f32 %r803, %r802, %r801; 2026-02-21T09:03:09.5037289Z $L__tmp18: 2026-02-21T09:03:09.5037495Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5037557Z cvt.u64.u32 %rd164, %r729; 2026-02-21T09:03:09.5037623Z cvt.u64.u32 %rd165, %r730; 2026-02-21T09:03:09.5037683Z shl.b64 %rd166, %rd165, 32; 2026-02-21T09:03:09.5037742Z or.b64 %rd167, %rd164, %rd166; 2026-02-21T09:03:09.5037802Z $L__tmp19: 2026-02-21T09:03:09.5037987Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5038049Z mov.b64 {%r804, %r805}, %rd167; 2026-02-21T09:03:09.5038114Z cvt.rn.bf16x2.f32 %r806, %r805, %r804; 2026-02-21T09:03:09.5038175Z $L__tmp20: 2026-02-21T09:03:09.5038382Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5038441Z cvt.u64.u32 %rd168, %r731; 2026-02-21T09:03:09.5038507Z cvt.u64.u32 %rd169, %r732; 2026-02-21T09:03:09.5038565Z shl.b64 %rd170, %rd169, 32; 2026-02-21T09:03:09.5038624Z or.b64 %rd171, %rd168, %rd170; 2026-02-21T09:03:09.5038685Z $L__tmp21: 2026-02-21T09:03:09.5038850Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5038910Z mov.b64 {%r807, %r808}, %rd171; 2026-02-21T09:03:09.5038976Z cvt.rn.bf16x2.f32 %r809, %r808, %r807; 2026-02-21T09:03:09.5039037Z $L__tmp22: 2026-02-21T09:03:09.5039248Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5039308Z cvt.u64.u32 %rd172, %r733; 2026-02-21T09:03:09.5039374Z cvt.u64.u32 %rd173, %r734; 2026-02-21T09:03:09.5039433Z shl.b64 %rd174, %rd173, 32; 2026-02-21T09:03:09.5039493Z or.b64 %rd175, %rd172, %rd174; 2026-02-21T09:03:09.5039553Z $L__tmp23: 2026-02-21T09:03:09.5039719Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5039777Z mov.b64 {%r810, %r811}, %rd175; 2026-02-21T09:03:09.5039842Z cvt.rn.bf16x2.f32 %r812, %r811, %r810; 2026-02-21T09:03:09.5039903Z $L__tmp24: 2026-02-21T09:03:09.5040109Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5040168Z cvt.u64.u32 %rd176, %r735; 2026-02-21T09:03:09.5040233Z cvt.u64.u32 %rd177, %r736; 2026-02-21T09:03:09.5040292Z shl.b64 %rd178, %rd177, 32; 2026-02-21T09:03:09.5040352Z or.b64 %rd179, %rd176, %rd178; 2026-02-21T09:03:09.5040437Z $L__tmp25: 2026-02-21T09:03:09.5040606Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5040664Z mov.b64 {%r813, %r814}, %rd179; 2026-02-21T09:03:09.5040730Z cvt.rn.bf16x2.f32 %r815, %r814, %r813; 2026-02-21T09:03:09.5040789Z $L__tmp26: 2026-02-21T09:03:09.5040996Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5041054Z cvt.u64.u32 %rd180, %r737; 2026-02-21T09:03:09.5041117Z cvt.u64.u32 %rd181, %r738; 2026-02-21T09:03:09.5041202Z shl.b64 %rd182, %rd181, 32; 2026-02-21T09:03:09.5041261Z or.b64 %rd183, %rd180, %rd182; 2026-02-21T09:03:09.5041315Z $L__tmp27: 2026-02-21T09:03:09.5041490Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5041588Z mov.b64 {%r816, %r817}, %rd183; 2026-02-21T09:03:09.5041658Z cvt.rn.bf16x2.f32 %r818, %r817, %r816; 2026-02-21T09:03:09.5041718Z $L__tmp28: 2026-02-21T09:03:09.5041955Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5042017Z cvt.u64.u32 %rd184, %r740; 2026-02-21T09:03:09.5042082Z cvt.u64.u32 %rd185, %r741; 2026-02-21T09:03:09.5042141Z shl.b64 %rd186, %rd185, 32; 2026-02-21T09:03:09.5042201Z or.b64 %rd187, %rd184, %rd186; 2026-02-21T09:03:09.5042253Z $L__tmp29: 2026-02-21T09:03:09.5042426Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5042486Z mov.b64 {%r819, %r820}, %rd187; 2026-02-21T09:03:09.5042552Z cvt.rn.bf16x2.f32 %r821, %r820, %r819; 2026-02-21T09:03:09.5042613Z $L__tmp30: 2026-02-21T09:03:09.5042820Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5042904Z cvt.u64.u32 %rd188, %r742; 2026-02-21T09:03:09.5042973Z cvt.u64.u32 %rd189, %r743; 2026-02-21T09:03:09.5043032Z shl.b64 %rd190, %rd189, 32; 2026-02-21T09:03:09.5043093Z or.b64 %rd191, %rd188, %rd190; 2026-02-21T09:03:09.5043149Z $L__tmp31: 2026-02-21T09:03:09.5043327Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5043387Z mov.b64 {%r822, %r823}, %rd191; 2026-02-21T09:03:09.5043455Z cvt.rn.bf16x2.f32 %r824, %r823, %r822; 2026-02-21T09:03:09.5043526Z $L__tmp32: 2026-02-21T09:03:09.5043748Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5043811Z cvt.u64.u32 %rd192, %r744; 2026-02-21T09:03:09.5043879Z cvt.u64.u32 %rd193, %r745; 2026-02-21T09:03:09.5043941Z shl.b64 %rd194, %rd193, 32; 2026-02-21T09:03:09.5044003Z or.b64 %rd195, %rd192, %rd194; 2026-02-21T09:03:09.5044058Z $L__tmp33: 2026-02-21T09:03:09.5044241Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5044307Z mov.b64 {%r825, %r826}, %rd195; 2026-02-21T09:03:09.5044380Z cvt.rn.bf16x2.f32 %r827, %r826, %r825; 2026-02-21T09:03:09.5044445Z $L__tmp34: 2026-02-21T09:03:09.5044665Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5044729Z cvt.u64.u32 %rd196, %r746; 2026-02-21T09:03:09.5044792Z cvt.u64.u32 %rd197, %r747; 2026-02-21T09:03:09.5044861Z shl.b64 %rd198, %rd197, 32; 2026-02-21T09:03:09.5044926Z or.b64 %rd199, %rd196, %rd198; 2026-02-21T09:03:09.5044983Z $L__tmp35: 2026-02-21T09:03:09.5045159Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5045221Z mov.b64 {%r828, %r829}, %rd199; 2026-02-21T09:03:09.5045289Z cvt.rn.bf16x2.f32 %r830, %r829, %r828; 2026-02-21T09:03:09.5045352Z $L__tmp36: 2026-02-21T09:03:09.5045571Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5045661Z cvt.u64.u32 %rd200, %r748; 2026-02-21T09:03:09.5045722Z cvt.u64.u32 %rd201, %r749; 2026-02-21T09:03:09.5045794Z shl.b64 %rd202, %rd201, 32; 2026-02-21T09:03:09.5045856Z or.b64 %rd203, %rd200, %rd202; 2026-02-21T09:03:09.5045911Z $L__tmp37: 2026-02-21T09:03:09.5046086Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5046147Z mov.b64 {%r831, %r832}, %rd203; 2026-02-21T09:03:09.5046217Z cvt.rn.bf16x2.f32 %r833, %r832, %r831; 2026-02-21T09:03:09.5046310Z $L__tmp38: 2026-02-21T09:03:09.5046523Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5046584Z cvt.u64.u32 %rd204, %r750; 2026-02-21T09:03:09.5046645Z cvt.u64.u32 %rd205, %r751; 2026-02-21T09:03:09.5046714Z shl.b64 %rd206, %rd205, 32; 2026-02-21T09:03:09.5046779Z or.b64 %rd207, %rd204, %rd206; 2026-02-21T09:03:09.5046832Z $L__tmp39: 2026-02-21T09:03:09.5047037Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5047101Z mov.b64 {%r834, %r835}, %rd207; 2026-02-21T09:03:09.5047172Z cvt.rn.bf16x2.f32 %r836, %r835, %r834; 2026-02-21T09:03:09.5047227Z $L__tmp40: 2026-02-21T09:03:09.5047455Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5047515Z cvt.u64.u32 %rd208, %r752; 2026-02-21T09:03:09.5047578Z cvt.u64.u32 %rd209, %r753; 2026-02-21T09:03:09.5047649Z shl.b64 %rd210, %rd209, 32; 2026-02-21T09:03:09.5047712Z or.b64 %rd211, %rd208, %rd210; 2026-02-21T09:03:09.5047766Z $L__tmp41: 2026-02-21T09:03:09.5047943Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5048031Z mov.b64 {%r837, %r838}, %rd211; 2026-02-21T09:03:09.5048103Z cvt.rn.bf16x2.f32 %r839, %r838, %r837; 2026-02-21T09:03:09.5048159Z $L__tmp42: 2026-02-21T09:03:09.5048387Z .loc 2 291 36 // standard.py:291:36 @[ c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:84:40 ] 2026-02-21T09:03:09.5048449Z cvt.u64.u32 %rd212, %r754; 2026-02-21T09:03:09.5048511Z cvt.u64.u32 %rd213, %r755; 2026-02-21T09:03:09.5048580Z shl.b64 %rd214, %rd213, 32; 2026-02-21T09:03:09.5048642Z or.b64 %rd215, %rd212, %rd214; 2026-02-21T09:03:09.5048697Z $L__tmp43: 2026-02-21T09:03:09.5048876Z .loc 1 87 28 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:87:28 2026-02-21T09:03:09.5048940Z mov.b64 {%r840, %r841}, %rd215; 2026-02-21T09:03:09.5049009Z cvt.rn.bf16x2.f32 %r842, %r841, %r840; 2026-02-21T09:03:09.5049182Z .loc 1 88 81 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:88:81 2026-02-21T09:03:09.5049289Z st.shared.v4.b32 [%r34], {%r797, %r809, %r821, %r833}; 2026-02-21T09:03:09.5049349Z bar.sync 0; 2026-02-21T09:03:09.5049408Z // begin inline asm 2026-02-21T09:03:09.5049576Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r777, %r781, %r785, %r789}, [%r761]; 2026-02-21T09:03:09.5049634Z // end inline asm 2026-02-21T09:03:09.5049690Z bar.sync 0; 2026-02-21T09:03:09.5049794Z st.shared.v4.b32 [%r34], {%r800, %r812, %r824, %r836}; 2026-02-21T09:03:09.5049853Z bar.sync 0; 2026-02-21T09:03:09.5049912Z // begin inline asm 2026-02-21T09:03:09.5050066Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r778, %r782, %r786, %r790}, [%r761]; 2026-02-21T09:03:09.5050131Z // end inline asm 2026-02-21T09:03:09.5050187Z bar.sync 0; 2026-02-21T09:03:09.5050279Z st.shared.v4.b32 [%r34], {%r803, %r815, %r827, %r839}; 2026-02-21T09:03:09.5050342Z bar.sync 0; 2026-02-21T09:03:09.5050400Z // begin inline asm 2026-02-21T09:03:09.5050547Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r779, %r783, %r787, %r791}, [%r761]; 2026-02-21T09:03:09.5050605Z // end inline asm 2026-02-21T09:03:09.5050695Z bar.sync 0; 2026-02-21T09:03:09.5050785Z st.shared.v4.b32 [%r34], {%r806, %r818, %r830, %r842}; 2026-02-21T09:03:09.5050841Z bar.sync 0; 2026-02-21T09:03:09.5050908Z // begin inline asm 2026-02-21T09:03:09.5051054Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r780, %r784, %r788, %r792}, [%r761]; 2026-02-21T09:03:09.5051112Z // end inline asm 2026-02-21T09:03:09.5051169Z // begin inline asm 2026-02-21T09:03:09.5051288Z st.global.v4.b32 [ %rd143 + 0 ], { %r777, %r778, %r779, %r780 }; 2026-02-21T09:03:09.5051345Z // end inline asm 2026-02-21T09:03:09.5051403Z // begin inline asm 2026-02-21T09:03:09.5051601Z st.global.v4.b32 [ %rd144 + 0 ], { %r781, %r782, %r783, %r784 }; 2026-02-21T09:03:09.5051661Z // end inline asm 2026-02-21T09:03:09.5051720Z // begin inline asm 2026-02-21T09:03:09.5051827Z st.global.v4.b32 [ %rd145 + 0 ], { %r785, %r786, %r787, %r788 }; 2026-02-21T09:03:09.5051885Z // end inline asm 2026-02-21T09:03:09.5051945Z // begin inline asm 2026-02-21T09:03:09.5052047Z st.global.v4.b32 [ %rd146 + 0 ], { %r789, %r790, %r791, %r792 }; 2026-02-21T09:03:09.5052112Z // end inline asm 2026-02-21T09:03:09.5052223Z $L__BB0_8: // %._crit_edge 2026-02-21T09:03:09.5052401Z .loc 1 19 4 // c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py:19:4 2026-02-21T09:03:09.5052466Z bar.sync 0; 2026-02-21T09:03:09.5052525Z // begin inline asm 2026-02-21T09:03:09.5052650Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r843, 256; 2026-02-21T09:03:09.5052714Z // end inline asm 2026-02-21T09:03:09.5052769Z ret; 2026-02-21T09:03:09.5052827Z $L__tmp44: 2026-02-21T09:03:09.5052896Z $L__func_end0: 2026-02-21T09:03:09.5052989Z // -- End function 2026-02-21T09:03:09.5053041Z } 2026-02-21T09:03:09.5053244Z .file 1 "/tmp/torchinductor_root/7j/c7jmqsa3bcnqcerdy22gouicy7rohfmyh2m2dxwd7r6c6zzq2vrm.py" 2026-02-21T09:03:09.5053458Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:03:09.5053526Z .section .debug_abbrev 2026-02-21T09:03:09.5053578Z { 2026-02-21T09:03:09.5053670Z .b8 1 // Abbreviation Code 2026-02-21T09:03:09.5053764Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:03:09.5053844Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:03:09.5053925Z .b8 37 // DW_AT_producer 2026-02-21T09:03:09.5054009Z .b8 8 // DW_FORM_string 2026-02-21T09:03:09.5054083Z .b8 19 // DW_AT_language 2026-02-21T09:03:09.5054159Z .b8 5 // DW_FORM_data2 2026-02-21T09:03:09.5054241Z .b8 3 // DW_AT_name 2026-02-21T09:03:09.5054314Z .b8 8 // DW_FORM_string 2026-02-21T09:03:09.5054391Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:03:09.5054466Z .b8 6 // DW_FORM_data4 2026-02-21T09:03:09.5054548Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:03:09.5054622Z .b8 8 // DW_FORM_string 2026-02-21T09:03:09.5054692Z .b8 0 // EOM(1) 2026-02-21T09:03:09.5054768Z .b8 0 // EOM(2) 2026-02-21T09:03:09.5054848Z .b8 2 // Abbreviation Code 2026-02-21T09:03:09.5054930Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:03:09.5055010Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:03:09.5055082Z .b8 3 // DW_AT_name 2026-02-21T09:03:09.5055154Z .b8 8 // DW_FORM_string 2026-02-21T09:03:09.5055235Z .b8 32 // DW_AT_inline 2026-02-21T09:03:09.5055309Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:09.5055378Z .b8 0 // EOM(1) 2026-02-21T09:03:09.5055472Z .b8 0 // EOM(2) 2026-02-21T09:03:09.5055563Z .b8 3 // Abbreviation Code 2026-02-21T09:03:09.5055642Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:03:09.5055718Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:03:09.5055802Z .b8 17 // DW_AT_low_pc 2026-02-21T09:03:09.5055874Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:09.5055950Z .b8 18 // DW_AT_high_pc 2026-02-21T09:03:09.5056057Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:09.5056142Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:03:09.5056214Z .b8 19 // DW_FORM_ref4 2026-02-21T09:03:09.5056281Z .b8 0 // EOM(1) 2026-02-21T09:03:09.5056356Z .b8 0 // EOM(2) 2026-02-21T09:03:09.5056435Z .b8 4 // Abbreviation Code 2026-02-21T09:03:09.5056549Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:03:09.5056632Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:03:09.5056716Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:03:09.5056787Z .b8 19 // DW_FORM_ref4 2026-02-21T09:03:09.5056865Z .b8 17 // DW_AT_low_pc 2026-02-21T09:03:09.5056936Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:09.5057013Z .b8 18 // DW_AT_high_pc 2026-02-21T09:03:09.5057084Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:09.5057166Z .b8 88 // DW_AT_call_file 2026-02-21T09:03:09.5057265Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:09.5057342Z .b8 89 // DW_AT_call_line 2026-02-21T09:03:09.5057425Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:09.5057502Z .b8 87 // DW_AT_call_column 2026-02-21T09:03:09.5057574Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:09.5057646Z .b8 0 // EOM(1) 2026-02-21T09:03:09.5057713Z .b8 0 // EOM(2) 2026-02-21T09:03:09.5057778Z .b8 0 // EOM(3) 2026-02-21T09:03:09.5057830Z } 2026-02-21T09:03:09.5057899Z .section .debug_info 2026-02-21T09:03:09.5057949Z { 2026-02-21T09:03:09.5058033Z .b32 178 // Length of Unit 2026-02-21T09:03:09.5058125Z .b8 2 // DWARF version number 2026-02-21T09:03:09.5058176Z .b8 0 2026-02-21T09:03:09.5058292Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:03:09.5058386Z .b8 8 // Address Size (in bytes) 2026-02-21T09:03:09.5058487Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:03:09.5058567Z .b8 116 // DW_AT_producer 2026-02-21T09:03:09.5058619Z .b8 114 2026-02-21T09:03:09.5058680Z .b8 105 2026-02-21T09:03:09.5058731Z .b8 116 2026-02-21T09:03:09.5058781Z .b8 111 2026-02-21T09:03:09.5058839Z .b8 110 2026-02-21T09:03:09.5058889Z .b8 0 2026-02-21T09:03:09.5058963Z .b8 2 // DW_AT_language 2026-02-21T09:03:09.5059012Z .b8 0 2026-02-21T09:03:09.5059094Z .b8 99 // DW_AT_name 2026-02-21T09:03:09.5059146Z .b8 55 2026-02-21T09:03:09.5059196Z .b8 106 2026-02-21T09:03:09.5059254Z .b8 109 2026-02-21T09:03:09.5059303Z .b8 113 2026-02-21T09:03:09.5059353Z .b8 115 2026-02-21T09:03:09.5059402Z .b8 97 2026-02-21T09:03:09.5059460Z .b8 51 2026-02-21T09:03:09.5059511Z .b8 98 2026-02-21T09:03:09.5059582Z .b8 99 2026-02-21T09:03:09.5059641Z .b8 110 2026-02-21T09:03:09.5059692Z .b8 113 2026-02-21T09:03:09.5059742Z .b8 99 2026-02-21T09:03:09.5059792Z .b8 101 2026-02-21T09:03:09.5059851Z .b8 114 2026-02-21T09:03:09.5059903Z .b8 100 2026-02-21T09:03:09.5059953Z .b8 121 2026-02-21T09:03:09.5060002Z .b8 50 2026-02-21T09:03:09.5060058Z .b8 50 2026-02-21T09:03:09.5060109Z .b8 103 2026-02-21T09:03:09.5060159Z .b8 111 2026-02-21T09:03:09.5060217Z .b8 117 2026-02-21T09:03:09.5060269Z .b8 105 2026-02-21T09:03:09.5060322Z .b8 99 2026-02-21T09:03:09.5060373Z .b8 121 2026-02-21T09:03:09.5060433Z .b8 55 2026-02-21T09:03:09.5060484Z .b8 114 2026-02-21T09:03:09.5060564Z .b8 111 2026-02-21T09:03:09.5060623Z .b8 104 2026-02-21T09:03:09.5060674Z .b8 102 2026-02-21T09:03:09.5060723Z .b8 109 2026-02-21T09:03:09.5060773Z .b8 121 2026-02-21T09:03:09.5060831Z .b8 104 2026-02-21T09:03:09.5060880Z .b8 50 2026-02-21T09:03:09.5060931Z .b8 109 2026-02-21T09:03:09.5060988Z .b8 50 2026-02-21T09:03:09.5061040Z .b8 100 2026-02-21T09:03:09.5061091Z .b8 120 2026-02-21T09:03:09.5061142Z .b8 119 2026-02-21T09:03:09.5061200Z .b8 100 2026-02-21T09:03:09.5061250Z .b8 55 2026-02-21T09:03:09.5061299Z .b8 114 2026-02-21T09:03:09.5061369Z .b8 54 2026-02-21T09:03:09.5061429Z .b8 99 2026-02-21T09:03:09.5061478Z .b8 54 2026-02-21T09:03:09.5061528Z .b8 122 2026-02-21T09:03:09.5061606Z .b8 122 2026-02-21T09:03:09.5061657Z .b8 113 2026-02-21T09:03:09.5061707Z .b8 50 2026-02-21T09:03:09.5061759Z .b8 118 2026-02-21T09:03:09.5061817Z .b8 114 2026-02-21T09:03:09.5061867Z .b8 109 2026-02-21T09:03:09.5061916Z .b8 46 2026-02-21T09:03:09.5061973Z .b8 112 2026-02-21T09:03:09.5062023Z .b8 121 2026-02-21T09:03:09.5062074Z .b8 0 2026-02-21T09:03:09.5062165Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:03:09.5062249Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:03:09.5062299Z .b8 116 2026-02-21T09:03:09.5062349Z .b8 109 2026-02-21T09:03:09.5062406Z .b8 112 2026-02-21T09:03:09.5062455Z .b8 47 2026-02-21T09:03:09.5062535Z .b8 116 2026-02-21T09:03:09.5062588Z .b8 111 2026-02-21T09:03:09.5062649Z .b8 114 2026-02-21T09:03:09.5062700Z .b8 99 2026-02-21T09:03:09.5062751Z .b8 104 2026-02-21T09:03:09.5062803Z .b8 105 2026-02-21T09:03:09.5062863Z .b8 110 2026-02-21T09:03:09.5062914Z .b8 100 2026-02-21T09:03:09.5062964Z .b8 117 2026-02-21T09:03:09.5063022Z .b8 99 2026-02-21T09:03:09.5063073Z .b8 116 2026-02-21T09:03:09.5063123Z .b8 111 2026-02-21T09:03:09.5063172Z .b8 114 2026-02-21T09:03:09.5063228Z .b8 95 2026-02-21T09:03:09.5063279Z .b8 114 2026-02-21T09:03:09.5063327Z .b8 111 2026-02-21T09:03:09.5063383Z .b8 111 2026-02-21T09:03:09.5063432Z .b8 116 2026-02-21T09:03:09.5063482Z .b8 47 2026-02-21T09:03:09.5063531Z .b8 55 2026-02-21T09:03:09.5063588Z .b8 106 2026-02-21T09:03:09.5063637Z .b8 0 2026-02-21T09:03:09.5063737Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:03:09.5063818Z .b8 95 // DW_AT_name 2026-02-21T09:03:09.5063870Z .b8 104 2026-02-21T09:03:09.5063921Z .b8 101 2026-02-21T09:03:09.5063970Z .b8 108 2026-02-21T09:03:09.5064028Z .b8 105 2026-02-21T09:03:09.5064077Z .b8 111 2026-02-21T09:03:09.5064126Z .b8 110 2026-02-21T09:03:09.5064183Z .b8 95 2026-02-21T09:03:09.5064234Z .b8 109 2026-02-21T09:03:09.5064283Z .b8 97 2026-02-21T09:03:09.5064332Z .b8 116 2026-02-21T09:03:09.5064387Z .b8 109 2026-02-21T09:03:09.5064437Z .b8 117 2026-02-21T09:03:09.5064487Z .b8 108 2026-02-21T09:03:09.5064536Z .b8 95 2026-02-21T09:03:09.5064592Z .b8 98 2026-02-21T09:03:09.5064642Z .b8 102 2026-02-21T09:03:09.5064692Z .b8 49 2026-02-21T09:03:09.5064747Z .b8 54 2026-02-21T09:03:09.5064799Z .b8 95 2026-02-21T09:03:09.5064849Z .b8 105 2026-02-21T09:03:09.5064898Z .b8 110 2026-02-21T09:03:09.5064956Z .b8 116 2026-02-21T09:03:09.5065005Z .b8 52 2026-02-21T09:03:09.5065054Z .b8 0 2026-02-21T09:03:09.5065136Z .b8 1 // DW_AT_inline 2026-02-21T09:03:09.5065234Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:03:09.5065352Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:03:09.5065441Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:03:09.5065539Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:03:09.5065651Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:03:09.5065739Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:03:09.5065830Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:03:09.5065915Z .b64 $L__tmp43 // DW_AT_high_pc 2026-02-21T09:03:09.5066020Z .b8 1 // DW_AT_call_file 2026-02-21T09:03:09.5066105Z .b8 84 // DW_AT_call_line 2026-02-21T09:03:09.5066187Z .b8 40 // DW_AT_call_column 2026-02-21T09:03:09.5066272Z .b8 0 // End Of Children Mark 2026-02-21T09:03:09.5066365Z .b8 0 // End Of Children Mark 2026-02-21T09:03:09.5066418Z } 2026-02-21T09:03:09.5066486Z .section .debug_macinfo { } 2026-02-21T09:03:09.5066518Z 2026-02-21T09:03:09.5066597Z ================================================================ 2026-02-21T09:03:09.5066710Z please share the reproducer above with Triton project. 2026-02-21T09:03:10.2990977Z [155s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:10.2991333Z 2026-02-21T09:03:10.2991396Z 2026-02-21T09:03:10.2991400Z 2026-02-21T09:03:10.2991515Z ================================================================ 2026-02-21T09:03:10.2991956Z Internal Triton PTX codegen error 2026-02-21T09:03:10.2992141Z `ptxas` stderr: 2026-02-21T09:03:10.2992612Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 250 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:03:10.2994268Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:03:10.2995364Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:03:10.2995915Z `ptxas` stderr: 2026-02-21T09:03:10.2996348Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 250 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:03:10.2996855Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:03:10.2997010Z 2026-02-21T09:03:10.2997421Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpvlc23vw9.ptx -o /tmp/tmpvlc23vw9.ptx.o 2026-02-21T09:03:10.2997879Z 2026-02-21T09:03:10.2998018Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:10.2998311Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:03:10.2998465Z 2026-02-21T09:03:10.2998847Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpvlc23vw9.ptx -o /tmp/tmpvlc23vw9.ptx.o 2026-02-21T09:03:10.2999283Z 2026-02-21T09:03:10.2999286Z 2026-02-21T09:03:10.2999346Z // 2026-02-21T09:03:10.2999506Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:03:10.2999690Z // 2026-02-21T09:03:10.2999771Z 2026-02-21T09:03:10.2999831Z .version 8.7 2026-02-21T09:03:10.2999972Z .target sm_100a 2026-02-21T09:03:10.3000120Z .address_size 64 2026-02-21T09:03:10.3000207Z 2026-02-21T09:03:10.3000366Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:03:10.3000760Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:03:10.3001007Z // @_helion_matmul_bf16_int4 2026-02-21T09:03:10.3001236Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:03:10.3001500Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:03:10.3001833Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:03:10.3002124Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:03:10.3002410Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:03:10.3002740Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:03:10.3002972Z ) 2026-02-21T09:03:10.3003099Z .reqntid 128 2026-02-21T09:03:10.3003243Z .maxnreg 32 2026-02-21T09:03:10.3003372Z { 2026-02-21T09:03:10.3003509Z .reg .pred %p<92>; 2026-02-21T09:03:10.3003667Z .reg .b16 %rs<449>; 2026-02-21T09:03:10.3003830Z .reg .b32 %r<676>; 2026-02-21T09:03:10.3003975Z .reg .b64 %rd<180>; 2026-02-21T09:03:10.3004131Z $L__func_begin0: 2026-02-21T09:03:10.3004217Z 2026-02-21T09:03:10.3004319Z // %bb.0: 2026-02-21T09:03:10.3004576Z .loc 1 14 0 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:14 2026-02-21T09:03:10.3004885Z mov.u32 %r1, %tid.x; 2026-02-21T09:03:10.3005047Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:03:10.3005224Z mov.b32 %r48, global_smem; 2026-02-21T09:03:10.3005386Z // begin inline asm 2026-02-21T09:03:10.3005644Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r48], 256; 2026-02-21T09:03:10.3005889Z // end inline asm 2026-02-21T09:03:10.3006036Z bar.sync 0; 2026-02-21T09:03:10.3006188Z ld.shared.b32 %r670, [global_smem]; 2026-02-21T09:03:10.3006361Z bar.sync 0; 2026-02-21T09:03:10.3006501Z // begin inline asm 2026-02-21T09:03:10.3006732Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:03:10.3006965Z // end inline asm 2026-02-21T09:03:10.3007222Z .loc 1 19 46 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:19:46 2026-02-21T09:03:10.3007531Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:03:10.3007795Z .loc 1 19 98 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:19:98 2026-02-21T09:03:10.3008101Z setp.gt.u32 %p3, %r3, 223; 2026-02-21T09:03:10.3008275Z @%p3 bra $L__BB0_8; 2026-02-21T09:03:10.3008441Z // %bb.1: // %.lr.ph 2026-02-21T09:03:10.3008745Z .loc 1 0 98 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:0:98 2026-02-21T09:03:10.3009075Z ld.param.b64 %rd36, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:03:10.3009333Z ld.param.b64 %rd35, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:03:10.3009543Z and.b32 %r4, %r1, 32; 2026-02-21T09:03:10.3009817Z .loc 1 65 38 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:65:38 2026-02-21T09:03:10.3010120Z setp.eq.b32 %p11, %r4, 0; 2026-02-21T09:03:10.3010394Z .loc 1 41 38 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:41:38 2026-02-21T09:03:10.3010692Z and.b32 %r172, %r1, 15; 2026-02-21T09:03:10.3010852Z shl.b32 %r173, %r172, 3; 2026-02-21T09:03:10.3011121Z .loc 1 27 45 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:27:45 2026-02-21T09:03:10.3011405Z shr.u32 %r174, %r1, 1; 2026-02-21T09:03:10.3011615Z bfe.u32 %r175, %r1, 1, 6; 2026-02-21T09:03:10.3011882Z .loc 1 25 45 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:25:45 2026-02-21T09:03:10.3012161Z and.b32 %r176, %r1, 3; 2026-02-21T09:03:10.3012317Z shl.b32 %r177, %r176, 3; 2026-02-21T09:03:10.3012467Z shl.b32 %r178, %r1, 4; 2026-02-21T09:03:10.3012620Z and.b32 %r5, %r178, 16; 2026-02-21T09:03:10.3012768Z shr.u32 %r179, %r1, 5; 2026-02-21T09:03:10.3013028Z .loc 1 27 45 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:27:45 2026-02-21T09:03:10.3013344Z and.b32 %r180, %r1, 124; 2026-02-21T09:03:10.3013505Z bfe.u32 %r181, %r1, 2, 5; 2026-02-21T09:03:10.3013664Z shl.b32 %r182, %r1, 9; 2026-02-21T09:03:10.3013814Z and.b32 %r183, %r182, 57344; 2026-02-21T09:03:10.3013980Z setp.eq.b32 %p89, %r1, 0; 2026-02-21T09:03:10.3014135Z or.b32 %r184, %r183, %r173; 2026-02-21T09:03:10.3014310Z mad.wide.u32 %rd38, %r184, 2, %rd35; 2026-02-21T09:03:10.3014488Z add.s64 %rd39, %rd38, 131072; 2026-02-21T09:03:10.3014658Z add.s64 %rd40, %rd38, 262144; 2026-02-21T09:03:10.3014814Z add.s64 %rd41, %rd38, 393216; 2026-02-21T09:03:10.3015008Z add.s64 %rd42, %rd38, 524288; 2026-02-21T09:03:10.3015162Z add.s64 %rd43, %rd38, 655360; 2026-02-21T09:03:10.3015323Z add.s64 %rd44, %rd38, 786432; 2026-02-21T09:03:10.3015480Z add.s64 %rd45, %rd38, 917504; 2026-02-21T09:03:10.3015633Z and.b32 %r6, %r178, 2032; 2026-02-21T09:03:10.3015791Z add.s32 %r378, %r48, %r6; 2026-02-21T09:03:10.3015943Z add.s32 %r380, %r378, 2048; 2026-02-21T09:03:10.3016104Z add.s32 %r382, %r378, 4096; 2026-02-21T09:03:10.3016255Z add.s32 %r384, %r378, 6144; 2026-02-21T09:03:10.3016438Z add.s32 %r386, %r378, 8192; 2026-02-21T09:03:10.3016595Z add.s32 %r388, %r378, 10240; 2026-02-21T09:03:10.3016756Z add.s32 %r390, %r378, 12288; 2026-02-21T09:03:10.3016920Z or.b32 %r14, %r178, 14336; 2026-02-21T09:03:10.3017073Z add.s32 %r392, %r48, %r14; 2026-02-21T09:03:10.3017233Z add.s32 %r328, %r670, 32; 2026-02-21T09:03:10.3017387Z mul.lo.s32 %r186, %r175, 7168; 2026-02-21T09:03:10.3017555Z add.s32 %r187, %r48, 49152; 2026-02-21T09:03:10.3017709Z add.s32 %r394, %r187, %r6; 2026-02-21T09:03:10.3017869Z add.s64 %rd47, %rd38, 256; 2026-02-21T09:03:10.3018020Z or.b32 %r188, %r184, 128; 2026-02-21T09:03:10.3018184Z mad.wide.u32 %rd56, %r188, 2, %rd35; 2026-02-21T09:03:10.3018356Z add.s64 %rd48, %rd56, 131072; 2026-02-21T09:03:10.3018519Z add.s64 %rd49, %rd56, 262144; 2026-02-21T09:03:10.3018710Z add.s64 %rd50, %rd56, 393216; 2026-02-21T09:03:10.3018865Z add.s64 %rd51, %rd56, 524288; 2026-02-21T09:03:10.3019026Z add.s64 %rd52, %rd56, 655360; 2026-02-21T09:03:10.3019180Z add.s64 %rd53, %rd56, 786432; 2026-02-21T09:03:10.3019343Z add.s64 %rd54, %rd56, 917504; 2026-02-21T09:03:10.3019498Z add.s32 %r189, %r48, 16384; 2026-02-21T09:03:10.3019664Z add.s32 %r86, %r189, %r6; 2026-02-21T09:03:10.3019814Z add.s32 %r88, %r86, 2048; 2026-02-21T09:03:10.3019968Z add.s32 %r90, %r86, 4096; 2026-02-21T09:03:10.3020116Z add.s32 %r92, %r86, 6144; 2026-02-21T09:03:10.3020268Z add.s32 %r94, %r86, 8192; 2026-02-21T09:03:10.3020422Z add.s32 %r96, %r86, 10240; 2026-02-21T09:03:10.3020589Z add.s32 %r98, %r86, 12288; 2026-02-21T09:03:10.3020749Z add.s32 %r100, %r189, %r14; 2026-02-21T09:03:10.3020902Z add.s32 %r102, %r378, 51200; 2026-02-21T09:03:10.3021060Z shl.b32 %r190, %r172, 8; 2026-02-21T09:03:10.3021209Z and.b32 %r191, %r1, 96; 2026-02-21T09:03:10.3021363Z shl.b32 %r192, %r191, 7; 2026-02-21T09:03:10.3021515Z and.b32 %r193, %r1, 16; 2026-02-21T09:03:10.3021702Z shl.b32 %r194, %r193, 3; 2026-02-21T09:03:10.3021859Z or.b32 %r195, %r190, %r192; 2026-02-21T09:03:10.3022016Z or.b32 %r18, %r195, %r194; 2026-02-21T09:03:10.3022180Z add.s32 %r196, %r48, %r18; 2026-02-21T09:03:10.3022330Z and.b32 %r197, %r1, 31; 2026-02-21T09:03:10.3022486Z and.b32 %r198, %r174, 32; 2026-02-21T09:03:10.3022635Z or.b32 %r19, %r198, %r197; 2026-02-21T09:03:10.3022793Z add.s32 %r199, %r187, %r19; 2026-02-21T09:03:10.3022946Z shl.b32 %r200, %r197, 7; 2026-02-21T09:03:10.3023103Z and.b32 %r201, %r178, 112; 2026-02-21T09:03:10.3023257Z shr.u32 %r202, %r191, 3; 2026-02-21T09:03:10.3023415Z or.b32 %r203, %r200, %r202; 2026-02-21T09:03:10.3023577Z or.b32 %r204, %r203, %r201; 2026-02-21T09:03:10.3023728Z add.s32 %r205, %r48, 32768; 2026-02-21T09:03:10.3023887Z add.s32 %r20, %r205, %r204; 2026-02-21T09:03:10.3024036Z xor.b32 %r206, %r204, 16; 2026-02-21T09:03:10.3024195Z add.s32 %r21, %r205, %r206; 2026-02-21T09:03:10.3024381Z xor.b32 %r207, %r204, 32; 2026-02-21T09:03:10.3024536Z add.s32 %r22, %r205, %r207; 2026-02-21T09:03:10.3024685Z xor.b32 %r208, %r204, 48; 2026-02-21T09:03:10.3024839Z add.s32 %r23, %r205, %r208; 2026-02-21T09:03:10.3024988Z xor.b32 %r209, %r204, 64; 2026-02-21T09:03:10.3025142Z add.s32 %r24, %r205, %r209; 2026-02-21T09:03:10.3025296Z xor.b32 %r210, %r204, 80; 2026-02-21T09:03:10.3025443Z add.s32 %r25, %r205, %r210; 2026-02-21T09:03:10.3025602Z xor.b32 %r211, %r204, 96; 2026-02-21T09:03:10.3025751Z add.s32 %r26, %r205, %r211; 2026-02-21T09:03:10.3025908Z xor.b32 %r212, %r204, 112; 2026-02-21T09:03:10.3026090Z add.s32 %r27, %r205, %r212; 2026-02-21T09:03:10.3026251Z bfe.u32 %r213, %r205, 4, 14; 2026-02-21T09:03:10.3026405Z cvt.u64.u32 %rd57, %r213; 2026-02-21T09:03:10.3026572Z or.b64 %rd78, %rd57, 4611686293322072064; 2026-02-21T09:03:10.3026756Z add.s32 %r214, %r48, 32800; 2026-02-21T09:03:10.3026911Z bfe.u32 %r215, %r214, 4, 14; 2026-02-21T09:03:10.3027076Z cvt.u64.u32 %rd58, %r215; 2026-02-21T09:03:10.3027235Z or.b64 %rd79, %rd58, 4611686293322072064; 2026-02-21T09:03:10.3027446Z add.s32 %r216, %r48, 32832; 2026-02-21T09:03:10.3027604Z bfe.u32 %r217, %r216, 4, 14; 2026-02-21T09:03:10.3027772Z cvt.u64.u32 %rd59, %r217; 2026-02-21T09:03:10.3027933Z or.b64 %rd80, %rd59, 4611686293322072064; 2026-02-21T09:03:10.3028121Z add.s32 %r218, %r48, 32864; 2026-02-21T09:03:10.3028281Z bfe.u32 %r219, %r218, 4, 14; 2026-02-21T09:03:10.3028452Z cvt.u64.u32 %rd60, %r219; 2026-02-21T09:03:10.3028622Z or.b64 %rd81, %rd60, 4611686293322072064; 2026-02-21T09:03:10.3028799Z add.s32 %r220, %r48, 36864; 2026-02-21T09:03:10.3028964Z bfe.u32 %r221, %r220, 4, 14; 2026-02-21T09:03:10.3029119Z cvt.u64.u32 %rd61, %r221; 2026-02-21T09:03:10.3029286Z or.b64 %rd82, %rd61, 4611686293322072064; 2026-02-21T09:03:10.3029462Z add.s32 %r222, %r48, 36896; 2026-02-21T09:03:10.3029634Z bfe.u32 %r223, %r222, 4, 14; 2026-02-21T09:03:10.3029823Z cvt.u64.u32 %rd62, %r223; 2026-02-21T09:03:10.3029990Z or.b64 %rd83, %rd62, 4611686293322072064; 2026-02-21T09:03:10.3030169Z add.s32 %r224, %r48, 36928; 2026-02-21T09:03:10.3030322Z bfe.u32 %r225, %r224, 4, 14; 2026-02-21T09:03:10.3030481Z cvt.u64.u32 %rd63, %r225; 2026-02-21T09:03:10.3030637Z or.b64 %rd84, %rd63, 4611686293322072064; 2026-02-21T09:03:10.3030812Z add.s32 %r226, %r48, 36960; 2026-02-21T09:03:10.3030963Z bfe.u32 %r227, %r226, 4, 14; 2026-02-21T09:03:10.3031124Z cvt.u64.u32 %rd64, %r227; 2026-02-21T09:03:10.3031278Z or.b64 %rd85, %rd64, 4611686293322072064; 2026-02-21T09:03:10.3031455Z add.s32 %r228, %r48, 40960; 2026-02-21T09:03:10.3031640Z bfe.u32 %r229, %r228, 4, 14; 2026-02-21T09:03:10.3031804Z cvt.u64.u32 %rd65, %r229; 2026-02-21T09:03:10.3031968Z or.b64 %rd86, %rd65, 4611686293322072064; 2026-02-21T09:03:10.3032138Z add.s32 %r230, %r48, 40992; 2026-02-21T09:03:10.3032298Z bfe.u32 %r231, %r230, 4, 14; 2026-02-21T09:03:10.3032450Z cvt.u64.u32 %rd66, %r231; 2026-02-21T09:03:10.3032615Z or.b64 %rd87, %rd66, 4611686293322072064; 2026-02-21T09:03:10.3032788Z add.s32 %r232, %r48, 41024; 2026-02-21T09:03:10.3032950Z bfe.u32 %r233, %r232, 4, 14; 2026-02-21T09:03:10.3033106Z cvt.u64.u32 %rd67, %r233; 2026-02-21T09:03:10.3033274Z or.b64 %rd88, %rd67, 4611686293322072064; 2026-02-21T09:03:10.3033453Z add.s32 %r234, %r48, 41056; 2026-02-21T09:03:10.3033603Z bfe.u32 %r235, %r234, 4, 14; 2026-02-21T09:03:10.3033767Z cvt.u64.u32 %rd68, %r235; 2026-02-21T09:03:10.3033922Z or.b64 %rd89, %rd68, 4611686293322072064; 2026-02-21T09:03:10.3034101Z add.s32 %r236, %r48, 45056; 2026-02-21T09:03:10.3034255Z bfe.u32 %r237, %r236, 4, 14; 2026-02-21T09:03:10.3034414Z cvt.u64.u32 %rd69, %r237; 2026-02-21T09:03:10.3034569Z or.b64 %rd90, %rd69, 4611686293322072064; 2026-02-21T09:03:10.3034747Z add.s32 %r238, %r48, 45088; 2026-02-21T09:03:10.3034902Z bfe.u32 %r239, %r238, 4, 14; 2026-02-21T09:03:10.3035060Z cvt.u64.u32 %rd70, %r239; 2026-02-21T09:03:10.3035227Z or.b64 %rd91, %rd70, 4611686293322072064; 2026-02-21T09:03:10.3035431Z add.s32 %r240, %r48, 45120; 2026-02-21T09:03:10.3035591Z bfe.u32 %r241, %r240, 4, 14; 2026-02-21T09:03:10.3035744Z cvt.u64.u32 %rd71, %r241; 2026-02-21T09:03:10.3035908Z or.b64 %rd92, %rd71, 4611686293322072064; 2026-02-21T09:03:10.3036077Z add.s32 %r242, %r48, 45152; 2026-02-21T09:03:10.3036239Z bfe.u32 %r243, %r242, 4, 14; 2026-02-21T09:03:10.3036391Z cvt.u64.u32 %rd72, %r243; 2026-02-21T09:03:10.3036562Z or.b64 %rd93, %rd72, 4611686293322072064; 2026-02-21T09:03:10.3036751Z add.s64 %rd95, %rd38, 512; 2026-02-21T09:03:10.3036910Z or.b32 %r244, %r184, 256; 2026-02-21T09:03:10.3037107Z mad.wide.u32 %rd73, %r244, 2, %rd35; 2026-02-21T09:03:10.3037278Z add.s64 %rd96, %rd73, 131072; 2026-02-21T09:03:10.3037444Z add.s64 %rd97, %rd73, 262144; 2026-02-21T09:03:10.3037602Z add.s64 %rd98, %rd73, 393216; 2026-02-21T09:03:10.3037764Z add.s64 %rd99, %rd73, 524288; 2026-02-21T09:03:10.3037922Z add.s64 %rd100, %rd73, 655360; 2026-02-21T09:03:10.3038096Z add.s64 %rd101, %rd73, 786432; 2026-02-21T09:03:10.3038256Z add.s64 %rd102, %rd73, 917504; 2026-02-21T09:03:10.3038450Z and.b32 %r245, %r182, 3072; 2026-02-21T09:03:10.3038615Z shl.b32 %r246, %r172, 4; 2026-02-21T09:03:10.3038766Z shl.b32 %r247, %r191, 3; 2026-02-21T09:03:10.3038921Z shl.b32 %r248, %r193, 2; 2026-02-21T09:03:10.3039067Z or.b32 %r249, %r246, %r247; 2026-02-21T09:03:10.3039230Z xor.b32 %r250, %r249, %r248; 2026-02-21T09:03:10.3039382Z or.b32 %r251, %r250, %r245; 2026-02-21T09:03:10.3039539Z xor.b32 %r252, %r251, 32; 2026-02-21T09:03:10.3039688Z shl.b32 %r253, %r1, 7; 2026-02-21T09:03:10.3039847Z and.b32 %r254, %r253, 3072; 2026-02-21T09:03:10.3040006Z shl.b32 %r255, %r176, 5; 2026-02-21T09:03:10.3040153Z shl.b32 %r256, %r180, 2; 2026-02-21T09:03:10.3040322Z or.b32 %r257, %r254, %r255; 2026-02-21T09:03:10.3040483Z xor.b32 %r258, %r257, %r256; 2026-02-21T09:03:10.3040823Z .loc 1 19 98 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:19:98 2026-02-21T09:03:10.3041129Z cvt.u64.u32 %rd25, %r175; 2026-02-21T09:03:10.3041408Z .loc 1 24 27 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:24:27 2026-02-21T09:03:10.3041733Z shl.b32 %r32, %r3, 5; 2026-02-21T09:03:10.3042013Z .loc 1 25 32 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:25:32 2026-02-21T09:03:10.3042320Z or.b32 %r259, %r32, %r5; 2026-02-21T09:03:10.3042478Z $L__tmp0: 2026-02-21T09:03:10.3042793Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3043164Z shfl.sync.idx.b32 %r34, %r179, 0, 31, -1; 2026-02-21T09:03:10.3043363Z shl.b32 %r260, %r34, 21; 2026-02-21T09:03:10.3043525Z and.b32 %r261, %r260, 6291456; 2026-02-21T09:03:10.3043700Z add.s32 %r635, %r261, %r670; 2026-02-21T09:03:10.3043874Z mov.pred %p15, -1; 2026-02-21T09:03:10.3044029Z mov.b32 %r671, 0; 2026-02-21T09:03:10.3044188Z // begin inline asm 2026-02-21T09:03:10.3044584Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r635 + 0], 16, {%r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671, %r671}; 2026-02-21T09:03:10.3045007Z // end inline asm 2026-02-21T09:03:10.3045152Z // begin inline asm 2026-02-21T09:03:10.3045321Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:03:10.3045492Z // end inline asm 2026-02-21T09:03:10.3045641Z bar.sync 0; 2026-02-21T09:03:10.3045774Z $L__tmp1: 2026-02-21T09:03:10.3046038Z .loc 1 34 93 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:34:93 2026-02-21T09:03:10.3046351Z add.s32 %r672, %r48, 53248; 2026-02-21T09:03:10.3046515Z // begin inline asm 2026-02-21T09:03:10.3046703Z @%p89 mbarrier.init.shared::cta.b64 [%r672], 1; 2026-02-21T09:03:10.3046906Z // end inline asm 2026-02-21T09:03:10.3047059Z bar.sync 0; 2026-02-21T09:03:10.3047201Z add.s32 %r67, %r48, 53256; 2026-02-21T09:03:10.3047404Z // begin inline asm 2026-02-21T09:03:10.3047576Z @%p89 mbarrier.init.shared::cta.b64 [%r67], 1; 2026-02-21T09:03:10.3047780Z // end inline asm 2026-02-21T09:03:10.3047921Z mov.b32 %r379, 16; 2026-02-21T09:03:10.3048162Z .loc 1 42 80 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:80 2026-02-21T09:03:10.3048449Z // begin inline asm 2026-02-21T09:03:10.3048651Z cp.async.cg.shared.global [ %r378 + 0 ], [ %rd38 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3048884Z // end inline asm 2026-02-21T09:03:10.3049016Z // begin inline asm 2026-02-21T09:03:10.3049214Z cp.async.cg.shared.global [ %r380 + 0 ], [ %rd39 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3049463Z // end inline asm 2026-02-21T09:03:10.3049602Z // begin inline asm 2026-02-21T09:03:10.3049800Z cp.async.cg.shared.global [ %r382 + 0 ], [ %rd40 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3050017Z // end inline asm 2026-02-21T09:03:10.3050158Z // begin inline asm 2026-02-21T09:03:10.3050346Z cp.async.cg.shared.global [ %r384 + 0 ], [ %rd41 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3050568Z // end inline asm 2026-02-21T09:03:10.3050700Z // begin inline asm 2026-02-21T09:03:10.3050929Z cp.async.cg.shared.global [ %r386 + 0 ], [ %rd42 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3051150Z // end inline asm 2026-02-21T09:03:10.3051292Z // begin inline asm 2026-02-21T09:03:10.3051486Z cp.async.cg.shared.global [ %r388 + 0 ], [ %rd43 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3051727Z // end inline asm 2026-02-21T09:03:10.3051867Z // begin inline asm 2026-02-21T09:03:10.3052058Z cp.async.cg.shared.global [ %r390 + 0 ], [ %rd44 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3052288Z // end inline asm 2026-02-21T09:03:10.3052421Z // begin inline asm 2026-02-21T09:03:10.3052623Z cp.async.cg.shared.global [ %r392 + 0 ], [ %rd45 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3052846Z // end inline asm 2026-02-21T09:03:10.3052994Z cp.async.commit_group; 2026-02-21T09:03:10.3053294Z .loc 1 48 62 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:62 2026-02-21T09:03:10.3053582Z add.s32 %r262, %r259, %r186; 2026-02-21T09:03:10.3053852Z .loc 1 48 34 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:34 2026-02-21T09:03:10.3054133Z cvt.u64.u32 %rd74, %r262; 2026-02-21T09:03:10.3054296Z add.s64 %rd46, %rd36, %rd74; 2026-02-21T09:03:10.3054553Z .loc 1 48 87 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:87 2026-02-21T09:03:10.3054838Z // begin inline asm 2026-02-21T09:03:10.3055038Z cp.async.cg.shared.global [ %r394 + 0 ], [ %rd46 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3055260Z // end inline asm 2026-02-21T09:03:10.3055405Z cp.async.commit_group; 2026-02-21T09:03:10.3055664Z .loc 1 42 80 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:80 2026-02-21T09:03:10.3055951Z bar.sync 0; 2026-02-21T09:03:10.3056080Z // begin inline asm 2026-02-21T09:03:10.3056286Z cp.async.cg.shared.global [ %r86 + 0 ], [ %rd47 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3056504Z // end inline asm 2026-02-21T09:03:10.3056646Z // begin inline asm 2026-02-21T09:03:10.3056850Z cp.async.cg.shared.global [ %r88 + 0 ], [ %rd48 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3057065Z // end inline asm 2026-02-21T09:03:10.3057215Z // begin inline asm 2026-02-21T09:03:10.3057405Z cp.async.cg.shared.global [ %r90 + 0 ], [ %rd49 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3057630Z // end inline asm 2026-02-21T09:03:10.3057762Z // begin inline asm 2026-02-21T09:03:10.3057956Z cp.async.cg.shared.global [ %r92 + 0 ], [ %rd50 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3058172Z // end inline asm 2026-02-21T09:03:10.3058312Z // begin inline asm 2026-02-21T09:03:10.3058504Z cp.async.cg.shared.global [ %r94 + 0 ], [ %rd51 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3058721Z // end inline asm 2026-02-21T09:03:10.3058858Z // begin inline asm 2026-02-21T09:03:10.3059049Z cp.async.cg.shared.global [ %r96 + 0 ], [ %rd52 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3059304Z // end inline asm 2026-02-21T09:03:10.3059437Z // begin inline asm 2026-02-21T09:03:10.3059634Z cp.async.cg.shared.global [ %r98 + 0 ], [ %rd53 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3059849Z // end inline asm 2026-02-21T09:03:10.3059988Z // begin inline asm 2026-02-21T09:03:10.3060178Z cp.async.cg.shared.global [ %r100 + 0 ], [ %rd54 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3060399Z // end inline asm 2026-02-21T09:03:10.3060545Z cp.async.commit_group; 2026-02-21T09:03:10.3060804Z .loc 1 48 34 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:34 2026-02-21T09:03:10.3061126Z cvt.u64.u32 %rd75, %r259; 2026-02-21T09:03:10.3061284Z cvt.u64.u32 %rd76, %r186; 2026-02-21T09:03:10.3061445Z add.s64 %rd77, %rd75, %rd76; 2026-02-21T09:03:10.3061632Z add.s64 %rd26, %rd36, %rd77; 2026-02-21T09:03:10.3061802Z add.s64 %rd55, %rd26, 458752; 2026-02-21T09:03:10.3062074Z .loc 1 48 87 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:87 2026-02-21T09:03:10.3062362Z // begin inline asm 2026-02-21T09:03:10.3062566Z cp.async.cg.shared.global [ %r102 + 0 ], [ %rd55 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3062814Z // end inline asm 2026-02-21T09:03:10.3062964Z cp.async.commit_group; 2026-02-21T09:03:10.3063227Z .loc 1 42 80 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:80 2026-02-21T09:03:10.3063527Z cp.async.wait_group 2; 2026-02-21T09:03:10.3063681Z bar.sync 0; 2026-02-21T09:03:10.3063863Z ld.shared.v4.b32 {%r263, %r264, %r265, %r266}, [%r196]; 2026-02-21T09:03:10.3064082Z mov.b32 {%rs1, %rs2}, %r266; 2026-02-21T09:03:10.3064243Z mov.b32 {%rs3, %rs4}, %r265; 2026-02-21T09:03:10.3064408Z mov.b32 {%rs5, %rs6}, %r264; 2026-02-21T09:03:10.3064564Z mov.b32 {%rs7, %rs8}, %r263; 2026-02-21T09:03:10.3064766Z ld.shared.v4.b32 {%r267, %r268, %r269, %r270}, [%r196+16]; 2026-02-21T09:03:10.3064978Z mov.b32 {%rs9, %rs10}, %r270; 2026-02-21T09:03:10.3065181Z mov.b32 {%rs11, %rs12}, %r269; 2026-02-21T09:03:10.3065348Z mov.b32 {%rs13, %rs14}, %r268; 2026-02-21T09:03:10.3065518Z mov.b32 {%rs15, %rs16}, %r267; 2026-02-21T09:03:10.3065712Z ld.shared.v4.b32 {%r271, %r272, %r273, %r274}, [%r196+32]; 2026-02-21T09:03:10.3065929Z mov.b32 {%rs17, %rs18}, %r274; 2026-02-21T09:03:10.3066095Z mov.b32 {%rs19, %rs20}, %r273; 2026-02-21T09:03:10.3066250Z mov.b32 {%rs21, %rs22}, %r272; 2026-02-21T09:03:10.3066417Z mov.b32 {%rs23, %rs24}, %r271; 2026-02-21T09:03:10.3066612Z ld.shared.v4.b32 {%r275, %r276, %r277, %r278}, [%r196+48]; 2026-02-21T09:03:10.3066827Z mov.b32 {%rs25, %rs26}, %r278; 2026-02-21T09:03:10.3066986Z mov.b32 {%rs27, %rs28}, %r277; 2026-02-21T09:03:10.3067149Z mov.b32 {%rs29, %rs30}, %r276; 2026-02-21T09:03:10.3067302Z mov.b32 {%rs31, %rs32}, %r275; 2026-02-21T09:03:10.3067500Z ld.shared.v4.b32 {%r279, %r280, %r281, %r282}, [%r196+64]; 2026-02-21T09:03:10.3067706Z mov.b32 {%rs33, %rs34}, %r282; 2026-02-21T09:03:10.3067863Z mov.b32 {%rs35, %rs36}, %r281; 2026-02-21T09:03:10.3068027Z mov.b32 {%rs37, %rs38}, %r280; 2026-02-21T09:03:10.3068182Z mov.b32 {%rs39, %rs40}, %r279; 2026-02-21T09:03:10.3068376Z ld.shared.v4.b32 {%r283, %r284, %r285, %r286}, [%r196+80]; 2026-02-21T09:03:10.3068575Z mov.b32 {%rs41, %rs42}, %r286; 2026-02-21T09:03:10.3068738Z mov.b32 {%rs43, %rs44}, %r285; 2026-02-21T09:03:10.3068893Z mov.b32 {%rs45, %rs46}, %r284; 2026-02-21T09:03:10.3069056Z mov.b32 {%rs47, %rs48}, %r283; 2026-02-21T09:03:10.3069251Z ld.shared.v4.b32 {%r287, %r288, %r289, %r290}, [%r196+96]; 2026-02-21T09:03:10.3069448Z mov.b32 {%rs49, %rs50}, %r290; 2026-02-21T09:03:10.3069612Z mov.b32 {%rs51, %rs52}, %r289; 2026-02-21T09:03:10.3069766Z mov.b32 {%rs53, %rs54}, %r288; 2026-02-21T09:03:10.3069928Z mov.b32 {%rs55, %rs56}, %r287; 2026-02-21T09:03:10.3070120Z ld.shared.v4.b32 {%r291, %r292, %r293, %r294}, [%r196+112]; 2026-02-21T09:03:10.3070329Z mov.b32 {%rs57, %rs58}, %r294; 2026-02-21T09:03:10.3070484Z mov.b32 {%rs59, %rs60}, %r293; 2026-02-21T09:03:10.3070676Z mov.b32 {%rs61, %rs62}, %r292; 2026-02-21T09:03:10.3070839Z mov.b32 {%rs63, %rs64}, %r291; 2026-02-21T09:03:10.3071104Z .loc 1 46 32 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:46:32 2026-02-21T09:03:10.3071400Z cvt.f32.bf16 %r105, %rs7; 2026-02-21T09:03:10.3071584Z cvt.f32.bf16 %r106, %rs8; 2026-02-21T09:03:10.3071745Z cvt.f32.bf16 %r107, %rs5; 2026-02-21T09:03:10.3071894Z cvt.f32.bf16 %r108, %rs6; 2026-02-21T09:03:10.3072051Z cvt.f32.bf16 %r109, %rs3; 2026-02-21T09:03:10.3072198Z cvt.f32.bf16 %r110, %rs4; 2026-02-21T09:03:10.3072356Z cvt.f32.bf16 %r111, %rs1; 2026-02-21T09:03:10.3072534Z cvt.f32.bf16 %r112, %rs2; 2026-02-21T09:03:10.3072686Z cvt.f32.bf16 %r113, %rs15; 2026-02-21T09:03:10.3072847Z cvt.f32.bf16 %r114, %rs16; 2026-02-21T09:03:10.3072998Z cvt.f32.bf16 %r115, %rs13; 2026-02-21T09:03:10.3073159Z cvt.f32.bf16 %r116, %rs14; 2026-02-21T09:03:10.3073311Z cvt.f32.bf16 %r117, %rs11; 2026-02-21T09:03:10.3073474Z cvt.f32.bf16 %r118, %rs12; 2026-02-21T09:03:10.3073621Z cvt.f32.bf16 %r119, %rs9; 2026-02-21T09:03:10.3073776Z cvt.f32.bf16 %r120, %rs10; 2026-02-21T09:03:10.3073955Z cvt.f32.bf16 %r122, %rs23; 2026-02-21T09:03:10.3074117Z cvt.f32.bf16 %r123, %rs24; 2026-02-21T09:03:10.3074277Z cvt.f32.bf16 %r124, %rs21; 2026-02-21T09:03:10.3074429Z cvt.f32.bf16 %r125, %rs22; 2026-02-21T09:03:10.3074586Z cvt.f32.bf16 %r126, %rs19; 2026-02-21T09:03:10.3074736Z cvt.f32.bf16 %r127, %rs20; 2026-02-21T09:03:10.3074895Z cvt.f32.bf16 %r128, %rs17; 2026-02-21T09:03:10.3075048Z cvt.f32.bf16 %r129, %rs18; 2026-02-21T09:03:10.3075212Z cvt.f32.bf16 %r130, %rs31; 2026-02-21T09:03:10.3075362Z cvt.f32.bf16 %r131, %rs32; 2026-02-21T09:03:10.3075522Z cvt.f32.bf16 %r132, %rs29; 2026-02-21T09:03:10.3075671Z cvt.f32.bf16 %r133, %rs30; 2026-02-21T09:03:10.3075827Z cvt.f32.bf16 %r134, %rs27; 2026-02-21T09:03:10.3075984Z cvt.f32.bf16 %r135, %rs28; 2026-02-21T09:03:10.3076163Z cvt.f32.bf16 %r136, %rs25; 2026-02-21T09:03:10.3076324Z cvt.f32.bf16 %r137, %rs26; 2026-02-21T09:03:10.3076474Z cvt.f32.bf16 %r139, %rs39; 2026-02-21T09:03:10.3076629Z cvt.f32.bf16 %r140, %rs40; 2026-02-21T09:03:10.3076781Z cvt.f32.bf16 %r141, %rs37; 2026-02-21T09:03:10.3076935Z cvt.f32.bf16 %r142, %rs38; 2026-02-21T09:03:10.3077082Z cvt.f32.bf16 %r143, %rs35; 2026-02-21T09:03:10.3077239Z cvt.f32.bf16 %r144, %rs36; 2026-02-21T09:03:10.3077388Z cvt.f32.bf16 %r145, %rs33; 2026-02-21T09:03:10.3077544Z cvt.f32.bf16 %r146, %rs34; 2026-02-21T09:03:10.3077703Z cvt.f32.bf16 %r147, %rs47; 2026-02-21T09:03:10.3077852Z cvt.f32.bf16 %r148, %rs48; 2026-02-21T09:03:10.3078012Z cvt.f32.bf16 %r149, %rs45; 2026-02-21T09:03:10.3078162Z cvt.f32.bf16 %r150, %rs46; 2026-02-21T09:03:10.3078319Z cvt.f32.bf16 %r151, %rs43; 2026-02-21T09:03:10.3078468Z cvt.f32.bf16 %r152, %rs44; 2026-02-21T09:03:10.3078626Z cvt.f32.bf16 %r153, %rs41; 2026-02-21T09:03:10.3078776Z cvt.f32.bf16 %r154, %rs42; 2026-02-21T09:03:10.3078935Z cvt.f32.bf16 %r156, %rs55; 2026-02-21T09:03:10.3079086Z cvt.f32.bf16 %r157, %rs56; 2026-02-21T09:03:10.3079241Z cvt.f32.bf16 %r158, %rs53; 2026-02-21T09:03:10.3079398Z cvt.f32.bf16 %r159, %rs54; 2026-02-21T09:03:10.3079549Z cvt.f32.bf16 %r160, %rs51; 2026-02-21T09:03:10.3079705Z cvt.f32.bf16 %r161, %rs52; 2026-02-21T09:03:10.3079852Z cvt.f32.bf16 %r162, %rs49; 2026-02-21T09:03:10.3080006Z cvt.f32.bf16 %r163, %rs50; 2026-02-21T09:03:10.3080154Z cvt.f32.bf16 %r164, %rs63; 2026-02-21T09:03:10.3080309Z cvt.f32.bf16 %r165, %rs64; 2026-02-21T09:03:10.3080457Z cvt.f32.bf16 %r166, %rs61; 2026-02-21T09:03:10.3080615Z cvt.f32.bf16 %r167, %rs62; 2026-02-21T09:03:10.3080765Z cvt.f32.bf16 %r168, %rs59; 2026-02-21T09:03:10.3080923Z cvt.f32.bf16 %r169, %rs60; 2026-02-21T09:03:10.3081082Z cvt.f32.bf16 %r170, %rs57; 2026-02-21T09:03:10.3081231Z cvt.f32.bf16 %r171, %rs58; 2026-02-21T09:03:10.3081386Z $L__tmp2: 2026-02-21T09:03:10.3081702Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3082082Z add.s32 %r104, %r261, %r328; 2026-02-21T09:03:10.3082237Z // begin inline asm 2026-02-21T09:03:10.3082623Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 0], 64, {%r105, %r106, %r107, %r108, %r109, %r110, %r111, %r112, %r113, %r114, %r115, %r116, %r117, %r118, %r119, %r120}; 2026-02-21T09:03:10.3083030Z // end inline asm 2026-02-21T09:03:10.3083171Z // begin inline asm 2026-02-21T09:03:10.3083552Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 16], 64, {%r122, %r123, %r124, %r125, %r126, %r127, %r128, %r129, %r130, %r131, %r132, %r133, %r134, %r135, %r136, %r137}; 2026-02-21T09:03:10.3083979Z // end inline asm 2026-02-21T09:03:10.3084142Z // begin inline asm 2026-02-21T09:03:10.3084536Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 32], 64, {%r139, %r140, %r141, %r142, %r143, %r144, %r145, %r146, %r147, %r148, %r149, %r150, %r151, %r152, %r153, %r154}; 2026-02-21T09:03:10.3084970Z // end inline asm 2026-02-21T09:03:10.3085123Z // begin inline asm 2026-02-21T09:03:10.3085543Z @%p15 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 48], 64, {%r156, %r157, %r158, %r159, %r160, %r161, %r162, %r163, %r164, %r165, %r166, %r167, %r168, %r169, %r170, %r171}; 2026-02-21T09:03:10.3085958Z // end inline asm 2026-02-21T09:03:10.3086103Z // begin inline asm 2026-02-21T09:03:10.3086279Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:03:10.3086454Z // end inline asm 2026-02-21T09:03:10.3086605Z bar.sync 0; 2026-02-21T09:03:10.3086745Z $L__tmp3: 2026-02-21T09:03:10.3087012Z .loc 1 48 87 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:87 2026-02-21T09:03:10.3087336Z ld.shared.b8 %rs65, [%r199]; 2026-02-21T09:03:10.3087518Z ld.shared.b8 %rs66, [%r199+64]; 2026-02-21T09:03:10.3087714Z ld.shared.b8 %rs67, [%r199+128]; 2026-02-21T09:03:10.3087902Z ld.shared.b8 %rs68, [%r199+192]; 2026-02-21T09:03:10.3088114Z ld.shared.b8 %rs69, [%r199+256]; 2026-02-21T09:03:10.3088288Z ld.shared.b8 %rs70, [%r199+320]; 2026-02-21T09:03:10.3088468Z ld.shared.b8 %rs71, [%r199+384]; 2026-02-21T09:03:10.3088639Z ld.shared.b8 %rs72, [%r199+448]; 2026-02-21T09:03:10.3088815Z ld.shared.b8 %rs73, [%r199+512]; 2026-02-21T09:03:10.3088992Z ld.shared.b8 %rs74, [%r199+576]; 2026-02-21T09:03:10.3089160Z ld.shared.b8 %rs75, [%r199+640]; 2026-02-21T09:03:10.3089338Z ld.shared.b8 %rs76, [%r199+704]; 2026-02-21T09:03:10.3089506Z ld.shared.b8 %rs77, [%r199+768]; 2026-02-21T09:03:10.3089680Z ld.shared.b8 %rs78, [%r199+832]; 2026-02-21T09:03:10.3089848Z ld.shared.b8 %rs79, [%r199+896]; 2026-02-21T09:03:10.3090025Z ld.shared.b8 %rs80, [%r199+960]; 2026-02-21T09:03:10.3090199Z ld.shared.b8 %rs81, [%r199+1024]; 2026-02-21T09:03:10.3090385Z ld.shared.b8 %rs82, [%r199+1088]; 2026-02-21T09:03:10.3090565Z ld.shared.b8 %rs83, [%r199+1152]; 2026-02-21T09:03:10.3090738Z ld.shared.b8 %rs84, [%r199+1216]; 2026-02-21T09:03:10.3090917Z ld.shared.b8 %rs85, [%r199+1280]; 2026-02-21T09:03:10.3090981Z ld.shared.b8 %rs86, [%r199+1344]; 2026-02-21T09:03:10.3091042Z ld.shared.b8 %rs87, [%r199+1408]; 2026-02-21T09:03:10.3091113Z ld.shared.b8 %rs88, [%r199+1472]; 2026-02-21T09:03:10.3091174Z ld.shared.b8 %rs89, [%r199+1536]; 2026-02-21T09:03:10.3091235Z ld.shared.b8 %rs90, [%r199+1600]; 2026-02-21T09:03:10.3091296Z ld.shared.b8 %rs91, [%r199+1664]; 2026-02-21T09:03:10.3091364Z ld.shared.b8 %rs92, [%r199+1728]; 2026-02-21T09:03:10.3091426Z ld.shared.b8 %rs93, [%r199+1792]; 2026-02-21T09:03:10.3091487Z ld.shared.b8 %rs94, [%r199+1856]; 2026-02-21T09:03:10.3091577Z ld.shared.b8 %rs95, [%r199+1920]; 2026-02-21T09:03:10.3091643Z ld.shared.b8 %rs96, [%r199+1984]; 2026-02-21T09:03:10.3091828Z .loc 1 51 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:51:28 2026-02-21T09:03:10.3091900Z shl.b16 %rs97, %rs65, 4; 2026-02-21T09:03:10.3091973Z shl.b16 %rs98, %rs66, 4; 2026-02-21T09:03:10.3092032Z shl.b16 %rs99, %rs67, 4; 2026-02-21T09:03:10.3092117Z shl.b16 %rs100, %rs68, 4; 2026-02-21T09:03:10.3092183Z shl.b16 %rs101, %rs69, 4; 2026-02-21T09:03:10.3092241Z shl.b16 %rs102, %rs70, 4; 2026-02-21T09:03:10.3092299Z shl.b16 %rs103, %rs71, 4; 2026-02-21T09:03:10.3092364Z shl.b16 %rs104, %rs72, 4; 2026-02-21T09:03:10.3092424Z shl.b16 %rs105, %rs73, 4; 2026-02-21T09:03:10.3092482Z shl.b16 %rs106, %rs74, 4; 2026-02-21T09:03:10.3092539Z shl.b16 %rs107, %rs75, 4; 2026-02-21T09:03:10.3092605Z shl.b16 %rs108, %rs76, 4; 2026-02-21T09:03:10.3092662Z shl.b16 %rs109, %rs77, 4; 2026-02-21T09:03:10.3092718Z shl.b16 %rs110, %rs78, 4; 2026-02-21T09:03:10.3092812Z shl.b16 %rs111, %rs79, 4; 2026-02-21T09:03:10.3092869Z shl.b16 %rs112, %rs80, 4; 2026-02-21T09:03:10.3092926Z shl.b16 %rs113, %rs81, 4; 2026-02-21T09:03:10.3092982Z shl.b16 %rs114, %rs82, 4; 2026-02-21T09:03:10.3093046Z shl.b16 %rs115, %rs83, 4; 2026-02-21T09:03:10.3093103Z shl.b16 %rs116, %rs84, 4; 2026-02-21T09:03:10.3093162Z shl.b16 %rs117, %rs85, 4; 2026-02-21T09:03:10.3093227Z shl.b16 %rs118, %rs86, 4; 2026-02-21T09:03:10.3093285Z shl.b16 %rs119, %rs87, 4; 2026-02-21T09:03:10.3093344Z shl.b16 %rs120, %rs88, 4; 2026-02-21T09:03:10.3093428Z shl.b16 %rs121, %rs89, 4; 2026-02-21T09:03:10.3093501Z shl.b16 %rs122, %rs90, 4; 2026-02-21T09:03:10.3093560Z shl.b16 %rs123, %rs91, 4; 2026-02-21T09:03:10.3093628Z shl.b16 %rs124, %rs92, 4; 2026-02-21T09:03:10.3093694Z shl.b16 %rs125, %rs93, 4; 2026-02-21T09:03:10.3093751Z shl.b16 %rs126, %rs94, 4; 2026-02-21T09:03:10.3093808Z shl.b16 %rs127, %rs95, 4; 2026-02-21T09:03:10.3093865Z shl.b16 %rs128, %rs96, 4; 2026-02-21T09:03:10.3094047Z .loc 1 66 58 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:66:58 2026-02-21T09:03:10.3094118Z selp.b16 %rs129, %rs97, %rs65, %p11; 2026-02-21T09:03:10.3094181Z cvt.s16.s8 %rs130, %rs129; 2026-02-21T09:03:10.3094249Z shr.s16 %rs131, %rs130, 4; 2026-02-21T09:03:10.3094347Z selp.b16 %rs132, %rs98, %rs66, %p11; 2026-02-21T09:03:10.3094409Z cvt.s16.s8 %rs133, %rs132; 2026-02-21T09:03:10.3094476Z shr.s16 %rs134, %rs133, 4; 2026-02-21T09:03:10.3094541Z selp.b16 %rs135, %rs99, %rs67, %p11; 2026-02-21T09:03:10.3094602Z cvt.s16.s8 %rs136, %rs135; 2026-02-21T09:03:10.3094659Z shr.s16 %rs137, %rs136, 4; 2026-02-21T09:03:10.3094734Z selp.b16 %rs138, %rs100, %rs68, %p11; 2026-02-21T09:03:10.3094793Z cvt.s16.s8 %rs139, %rs138; 2026-02-21T09:03:10.3094850Z shr.s16 %rs140, %rs139, 4; 2026-02-21T09:03:10.3094923Z selp.b16 %rs141, %rs101, %rs69, %p11; 2026-02-21T09:03:10.3094982Z cvt.s16.s8 %rs142, %rs141; 2026-02-21T09:03:10.3095039Z shr.s16 %rs143, %rs142, 4; 2026-02-21T09:03:10.3095104Z selp.b16 %rs144, %rs102, %rs70, %p11; 2026-02-21T09:03:10.3095171Z cvt.s16.s8 %rs145, %rs144; 2026-02-21T09:03:10.3095228Z shr.s16 %rs146, %rs145, 4; 2026-02-21T09:03:10.3095291Z selp.b16 %rs147, %rs103, %rs71, %p11; 2026-02-21T09:03:10.3095358Z cvt.s16.s8 %rs148, %rs147; 2026-02-21T09:03:10.3095417Z shr.s16 %rs149, %rs148, 4; 2026-02-21T09:03:10.3095482Z selp.b16 %rs150, %rs104, %rs72, %p11; 2026-02-21T09:03:10.3095540Z cvt.s16.s8 %rs151, %rs150; 2026-02-21T09:03:10.3095607Z shr.s16 %rs152, %rs151, 4; 2026-02-21T09:03:10.3095671Z selp.b16 %rs153, %rs105, %rs73, %p11; 2026-02-21T09:03:10.3095729Z cvt.s16.s8 %rs154, %rs153; 2026-02-21T09:03:10.3095795Z shr.s16 %rs155, %rs154, 4; 2026-02-21T09:03:10.3095858Z selp.b16 %rs156, %rs106, %rs74, %p11; 2026-02-21T09:03:10.3095914Z cvt.s16.s8 %rs157, %rs156; 2026-02-21T09:03:10.3095980Z shr.s16 %rs158, %rs157, 4; 2026-02-21T09:03:10.3096042Z selp.b16 %rs159, %rs107, %rs75, %p11; 2026-02-21T09:03:10.3096101Z cvt.s16.s8 %rs160, %rs159; 2026-02-21T09:03:10.3096159Z shr.s16 %rs161, %rs160, 4; 2026-02-21T09:03:10.3096230Z selp.b16 %rs162, %rs108, %rs76, %p11; 2026-02-21T09:03:10.3096287Z cvt.s16.s8 %rs163, %rs162; 2026-02-21T09:03:10.3096343Z shr.s16 %rs164, %rs163, 4; 2026-02-21T09:03:10.3096413Z selp.b16 %rs165, %rs109, %rs77, %p11; 2026-02-21T09:03:10.3096472Z cvt.s16.s8 %rs166, %rs165; 2026-02-21T09:03:10.3096555Z shr.s16 %rs167, %rs166, 4; 2026-02-21T09:03:10.3096619Z selp.b16 %rs168, %rs110, %rs78, %p11; 2026-02-21T09:03:10.3096686Z cvt.s16.s8 %rs169, %rs168; 2026-02-21T09:03:10.3096743Z shr.s16 %rs170, %rs169, 4; 2026-02-21T09:03:10.3096804Z selp.b16 %rs171, %rs111, %rs79, %p11; 2026-02-21T09:03:10.3096868Z cvt.s16.s8 %rs172, %rs171; 2026-02-21T09:03:10.3096925Z shr.s16 %rs173, %rs172, 4; 2026-02-21T09:03:10.3096988Z selp.b16 %rs174, %rs112, %rs80, %p11; 2026-02-21T09:03:10.3097046Z cvt.s16.s8 %rs175, %rs174; 2026-02-21T09:03:10.3097111Z shr.s16 %rs176, %rs175, 4; 2026-02-21T09:03:10.3097197Z selp.b16 %rs177, %rs113, %rs81, %p11; 2026-02-21T09:03:10.3097256Z cvt.s16.s8 %rs178, %rs177; 2026-02-21T09:03:10.3097322Z shr.s16 %rs179, %rs178, 4; 2026-02-21T09:03:10.3097384Z selp.b16 %rs180, %rs114, %rs82, %p11; 2026-02-21T09:03:10.3097442Z cvt.s16.s8 %rs181, %rs180; 2026-02-21T09:03:10.3097500Z shr.s16 %rs182, %rs181, 4; 2026-02-21T09:03:10.3097570Z selp.b16 %rs183, %rs115, %rs83, %p11; 2026-02-21T09:03:10.3097627Z cvt.s16.s8 %rs184, %rs183; 2026-02-21T09:03:10.3097684Z shr.s16 %rs185, %rs184, 4; 2026-02-21T09:03:10.3097787Z selp.b16 %rs186, %rs116, %rs84, %p11; 2026-02-21T09:03:10.3097847Z cvt.s16.s8 %rs187, %rs186; 2026-02-21T09:03:10.3097904Z shr.s16 %rs188, %rs187, 4; 2026-02-21T09:03:10.3097974Z selp.b16 %rs189, %rs117, %rs85, %p11; 2026-02-21T09:03:10.3098031Z cvt.s16.s8 %rs190, %rs189; 2026-02-21T09:03:10.3098088Z shr.s16 %rs191, %rs190, 4; 2026-02-21T09:03:10.3098151Z selp.b16 %rs192, %rs118, %rs86, %p11; 2026-02-21T09:03:10.3098217Z cvt.s16.s8 %rs193, %rs192; 2026-02-21T09:03:10.3098276Z shr.s16 %rs194, %rs193, 4; 2026-02-21T09:03:10.3098340Z selp.b16 %rs195, %rs119, %rs87, %p11; 2026-02-21T09:03:10.3098406Z cvt.s16.s8 %rs196, %rs195; 2026-02-21T09:03:10.3098464Z shr.s16 %rs197, %rs196, 4; 2026-02-21T09:03:10.3098527Z selp.b16 %rs198, %rs120, %rs88, %p11; 2026-02-21T09:03:10.3098606Z cvt.s16.s8 %rs199, %rs198; 2026-02-21T09:03:10.3098677Z shr.s16 %rs200, %rs199, 4; 2026-02-21T09:03:10.3098740Z selp.b16 %rs201, %rs121, %rs89, %p11; 2026-02-21T09:03:10.3098799Z cvt.s16.s8 %rs202, %rs201; 2026-02-21T09:03:10.3098864Z shr.s16 %rs203, %rs202, 4; 2026-02-21T09:03:10.3098928Z selp.b16 %rs204, %rs122, %rs90, %p11; 2026-02-21T09:03:10.3098986Z cvt.s16.s8 %rs205, %rs204; 2026-02-21T09:03:10.3099044Z shr.s16 %rs206, %rs205, 4; 2026-02-21T09:03:10.3099116Z selp.b16 %rs207, %rs123, %rs91, %p11; 2026-02-21T09:03:10.3099174Z cvt.s16.s8 %rs208, %rs207; 2026-02-21T09:03:10.3099233Z shr.s16 %rs209, %rs208, 4; 2026-02-21T09:03:10.3099308Z selp.b16 %rs210, %rs124, %rs92, %p11; 2026-02-21T09:03:10.3099366Z cvt.s16.s8 %rs211, %rs210; 2026-02-21T09:03:10.3099425Z shr.s16 %rs212, %rs211, 4; 2026-02-21T09:03:10.3099496Z selp.b16 %rs213, %rs125, %rs93, %p11; 2026-02-21T09:03:10.3099556Z cvt.s16.s8 %rs214, %rs213; 2026-02-21T09:03:10.3099614Z shr.s16 %rs215, %rs214, 4; 2026-02-21T09:03:10.3099681Z selp.b16 %rs216, %rs126, %rs94, %p11; 2026-02-21T09:03:10.3099751Z cvt.s16.s8 %rs217, %rs216; 2026-02-21T09:03:10.3099810Z shr.s16 %rs218, %rs217, 4; 2026-02-21T09:03:10.3099876Z selp.b16 %rs219, %rs127, %rs95, %p11; 2026-02-21T09:03:10.3099945Z cvt.s16.s8 %rs220, %rs219; 2026-02-21T09:03:10.3100006Z shr.s16 %rs221, %rs220, 4; 2026-02-21T09:03:10.3100069Z selp.b16 %rs222, %rs128, %rs96, %p11; 2026-02-21T09:03:10.3100128Z cvt.s16.s8 %rs223, %rs222; 2026-02-21T09:03:10.3100194Z shr.s16 %rs224, %rs223, 4; 2026-02-21T09:03:10.3100364Z .loc 1 71 32 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:71:32 2026-02-21T09:03:10.3100430Z cvt.rn.f32.s16 %r295, %rs131; 2026-02-21T09:03:10.3100501Z cvt.rn.f32.s16 %r296, %rs134; 2026-02-21T09:03:10.3100561Z cvt.rn.f32.s16 %r297, %rs137; 2026-02-21T09:03:10.3100621Z cvt.rn.f32.s16 %r298, %rs140; 2026-02-21T09:03:10.3100679Z cvt.rn.f32.s16 %r299, %rs143; 2026-02-21T09:03:10.3100746Z cvt.rn.f32.s16 %r300, %rs146; 2026-02-21T09:03:10.3100830Z cvt.rn.f32.s16 %r301, %rs149; 2026-02-21T09:03:10.3100887Z cvt.rn.f32.s16 %r302, %rs152; 2026-02-21T09:03:10.3100953Z cvt.rn.f32.s16 %r303, %rs155; 2026-02-21T09:03:10.3101012Z cvt.rn.f32.s16 %r304, %rs158; 2026-02-21T09:03:10.3101069Z cvt.rn.f32.s16 %r305, %rs161; 2026-02-21T09:03:10.3101135Z cvt.rn.f32.s16 %r306, %rs164; 2026-02-21T09:03:10.3101193Z cvt.rn.f32.s16 %r307, %rs167; 2026-02-21T09:03:10.3101252Z cvt.rn.f32.s16 %r308, %rs170; 2026-02-21T09:03:10.3101310Z cvt.rn.f32.s16 %r309, %rs173; 2026-02-21T09:03:10.3101376Z cvt.rn.f32.s16 %r310, %rs176; 2026-02-21T09:03:10.3101433Z cvt.rn.f32.s16 %r311, %rs179; 2026-02-21T09:03:10.3101511Z cvt.rn.f32.s16 %r312, %rs182; 2026-02-21T09:03:10.3101610Z cvt.rn.f32.s16 %r313, %rs185; 2026-02-21T09:03:10.3101669Z cvt.rn.f32.s16 %r314, %rs188; 2026-02-21T09:03:10.3101727Z cvt.rn.f32.s16 %r315, %rs191; 2026-02-21T09:03:10.3101784Z cvt.rn.f32.s16 %r316, %rs194; 2026-02-21T09:03:10.3101852Z cvt.rn.f32.s16 %r317, %rs197; 2026-02-21T09:03:10.3101911Z cvt.rn.f32.s16 %r318, %rs200; 2026-02-21T09:03:10.3101968Z cvt.rn.f32.s16 %r319, %rs203; 2026-02-21T09:03:10.3102064Z cvt.rn.f32.s16 %r320, %rs206; 2026-02-21T09:03:10.3102125Z cvt.rn.f32.s16 %r321, %rs209; 2026-02-21T09:03:10.3102184Z cvt.rn.f32.s16 %r322, %rs212; 2026-02-21T09:03:10.3102242Z cvt.rn.f32.s16 %r323, %rs215; 2026-02-21T09:03:10.3102309Z cvt.rn.f32.s16 %r324, %rs218; 2026-02-21T09:03:10.3102366Z cvt.rn.f32.s16 %r325, %rs221; 2026-02-21T09:03:10.3102426Z cvt.rn.f32.s16 %r326, %rs224; 2026-02-21T09:03:10.3102496Z st.shared.b32 [%r20], %r295; 2026-02-21T09:03:10.3102562Z st.shared.b32 [%r20+4096], %r303; 2026-02-21T09:03:10.3102625Z st.shared.b32 [%r20+8192], %r311; 2026-02-21T09:03:10.3102691Z st.shared.b32 [%r20+12288], %r319; 2026-02-21T09:03:10.3102761Z st.shared.b32 [%r21], %r296; 2026-02-21T09:03:10.3102824Z st.shared.b32 [%r21+4096], %r304; 2026-02-21T09:03:10.3102912Z st.shared.b32 [%r21+8192], %r312; 2026-02-21T09:03:10.3102984Z st.shared.b32 [%r21+12288], %r320; 2026-02-21T09:03:10.3103044Z st.shared.b32 [%r22], %r297; 2026-02-21T09:03:10.3103105Z st.shared.b32 [%r22+4096], %r305; 2026-02-21T09:03:10.3103176Z st.shared.b32 [%r22+8192], %r313; 2026-02-21T09:03:10.3103238Z st.shared.b32 [%r22+12288], %r321; 2026-02-21T09:03:10.3103299Z st.shared.b32 [%r23], %r298; 2026-02-21T09:03:10.3103361Z st.shared.b32 [%r23+4096], %r306; 2026-02-21T09:03:10.3103429Z st.shared.b32 [%r23+8192], %r314; 2026-02-21T09:03:10.3103490Z st.shared.b32 [%r23+12288], %r322; 2026-02-21T09:03:10.3103550Z st.shared.b32 [%r24], %r299; 2026-02-21T09:03:10.3103617Z st.shared.b32 [%r24+4096], %r307; 2026-02-21T09:03:10.3103677Z st.shared.b32 [%r24+8192], %r315; 2026-02-21T09:03:10.3103737Z st.shared.b32 [%r24+12288], %r323; 2026-02-21T09:03:10.3103797Z st.shared.b32 [%r25], %r300; 2026-02-21T09:03:10.3103864Z st.shared.b32 [%r25+4096], %r308; 2026-02-21T09:03:10.3103925Z st.shared.b32 [%r25+8192], %r316; 2026-02-21T09:03:10.3103987Z st.shared.b32 [%r25+12288], %r324; 2026-02-21T09:03:10.3104054Z st.shared.b32 [%r26], %r301; 2026-02-21T09:03:10.3104115Z st.shared.b32 [%r26+4096], %r309; 2026-02-21T09:03:10.3104175Z st.shared.b32 [%r26+8192], %r317; 2026-02-21T09:03:10.3104243Z st.shared.b32 [%r26+12288], %r325; 2026-02-21T09:03:10.3104303Z st.shared.b32 [%r27], %r302; 2026-02-21T09:03:10.3104365Z st.shared.b32 [%r27+4096], %r310; 2026-02-21T09:03:10.3104425Z st.shared.b32 [%r27+8192], %r318; 2026-02-21T09:03:10.3104494Z st.shared.b32 [%r27+12288], %r326; 2026-02-21T09:03:10.3104548Z $L__tmp4: 2026-02-21T09:03:10.3104775Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3104843Z // begin inline asm 2026-02-21T09:03:10.3104917Z fence.proxy.async.shared::cta; 2026-02-21T09:03:10.3104972Z // end inline asm 2026-02-21T09:03:10.3105027Z bar.sync 0; 2026-02-21T09:03:10.3105100Z setp.ne.b32 %p12, %r34, 0; 2026-02-21T09:03:10.3105193Z @%p12 bra $L__BB0_3; 2026-02-21T09:03:10.3105247Z // %bb.2: 2026-02-21T09:03:10.3105321Z elect.sync %r375|%p14, -1; 2026-02-21T09:03:10.3105381Z mov.b32 %r329, 67635472; 2026-02-21T09:03:10.3105441Z mov.pred %p13, 0; 2026-02-21T09:03:10.3105498Z // begin inline asm 2026-02-21T09:03:10.3105664Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 0 ], %rd78, %r329, %p13; 2026-02-21T09:03:10.3105720Z // end inline asm 2026-02-21T09:03:10.3105775Z // begin inline asm 2026-02-21T09:03:10.3105929Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 8 ], %rd79, %r329, %p15; 2026-02-21T09:03:10.3106032Z // end inline asm 2026-02-21T09:03:10.3106089Z // begin inline asm 2026-02-21T09:03:10.3106243Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 16 ], %rd80, %r329, %p15; 2026-02-21T09:03:10.3106299Z // end inline asm 2026-02-21T09:03:10.3106355Z // begin inline asm 2026-02-21T09:03:10.3106512Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 24 ], %rd81, %r329, %p15; 2026-02-21T09:03:10.3106569Z // end inline asm 2026-02-21T09:03:10.3106626Z // begin inline asm 2026-02-21T09:03:10.3106787Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 32 ], %rd82, %r329, %p15; 2026-02-21T09:03:10.3106855Z // end inline asm 2026-02-21T09:03:10.3106913Z // begin inline asm 2026-02-21T09:03:10.3107057Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 40 ], %rd83, %r329, %p15; 2026-02-21T09:03:10.3107124Z // end inline asm 2026-02-21T09:03:10.3107184Z // begin inline asm 2026-02-21T09:03:10.3107328Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 48 ], %rd84, %r329, %p15; 2026-02-21T09:03:10.3107392Z // end inline asm 2026-02-21T09:03:10.3107448Z // begin inline asm 2026-02-21T09:03:10.3107590Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 56 ], %rd85, %r329, %p15; 2026-02-21T09:03:10.3107658Z // end inline asm 2026-02-21T09:03:10.3107738Z // begin inline asm 2026-02-21T09:03:10.3107883Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 64 ], %rd86, %r329, %p15; 2026-02-21T09:03:10.3107945Z // end inline asm 2026-02-21T09:03:10.3108014Z // begin inline asm 2026-02-21T09:03:10.3108153Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 72 ], %rd87, %r329, %p15; 2026-02-21T09:03:10.3108208Z // end inline asm 2026-02-21T09:03:10.3108273Z // begin inline asm 2026-02-21T09:03:10.3108411Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 80 ], %rd88, %r329, %p15; 2026-02-21T09:03:10.3108465Z // end inline asm 2026-02-21T09:03:10.3108531Z // begin inline asm 2026-02-21T09:03:10.3108672Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 88 ], %rd89, %r329, %p15; 2026-02-21T09:03:10.3108727Z // end inline asm 2026-02-21T09:03:10.3108783Z // begin inline asm 2026-02-21T09:03:10.3108932Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 96 ], %rd90, %r329, %p15; 2026-02-21T09:03:10.3108988Z // end inline asm 2026-02-21T09:03:10.3109045Z // begin inline asm 2026-02-21T09:03:10.3109197Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 104 ], %rd91, %r329, %p15; 2026-02-21T09:03:10.3109251Z // end inline asm 2026-02-21T09:03:10.3109307Z // begin inline asm 2026-02-21T09:03:10.3109455Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 112 ], %rd92, %r329, %p15; 2026-02-21T09:03:10.3109511Z // end inline asm 2026-02-21T09:03:10.3109567Z // begin inline asm 2026-02-21T09:03:10.3109715Z @%p14 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 120 ], %rd93, %r329, %p15; 2026-02-21T09:03:10.3109772Z // end inline asm 2026-02-21T09:03:10.3109835Z add.s32 %r377, %r48, 53248; 2026-02-21T09:03:10.3109898Z cvt.u64.u32 %rd94, %r377; 2026-02-21T09:03:10.3109963Z // begin inline asm 2026-02-21T09:03:10.3110089Z @%p14 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd94]; 2026-02-21T09:03:10.3110146Z // end inline asm 2026-02-21T09:03:10.3110233Z $L__tmp5: 2026-02-21T09:03:10.3110288Z $L__BB0_3: 2026-02-21T09:03:10.3110382Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:03:10.3110488Z ld.param.b64 %rd37, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:03:10.3110560Z mul.lo.s32 %r28, %r181, 7168; 2026-02-21T09:03:10.3110623Z add.s32 %r29, %r48, %r251; 2026-02-21T09:03:10.3110684Z add.s32 %r30, %r48, %r252; 2026-02-21T09:03:10.3110750Z add.s32 %r31, %r48, %r258; 2026-02-21T09:03:10.3110811Z or.b32 %r33, %r32, %r177; 2026-02-21T09:03:10.3110980Z .loc 1 42 80 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:80 2026-02-21T09:03:10.3111068Z // begin inline asm 2026-02-21T09:03:10.3111187Z cp.async.cg.shared.global [ %r378 + 0 ], [ %rd95 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3111241Z // end inline asm 2026-02-21T09:03:10.3111298Z // begin inline asm 2026-02-21T09:03:10.3111422Z cp.async.cg.shared.global [ %r380 + 0 ], [ %rd96 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3111478Z // end inline asm 2026-02-21T09:03:10.3111564Z // begin inline asm 2026-02-21T09:03:10.3111711Z cp.async.cg.shared.global [ %r382 + 0 ], [ %rd97 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3111768Z // end inline asm 2026-02-21T09:03:10.3111823Z // begin inline asm 2026-02-21T09:03:10.3111929Z cp.async.cg.shared.global [ %r384 + 0 ], [ %rd98 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3111990Z // end inline asm 2026-02-21T09:03:10.3112047Z // begin inline asm 2026-02-21T09:03:10.3112152Z cp.async.cg.shared.global [ %r386 + 0 ], [ %rd99 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3112213Z // end inline asm 2026-02-21T09:03:10.3112271Z // begin inline asm 2026-02-21T09:03:10.3112381Z cp.async.cg.shared.global [ %r388 + 0 ], [ %rd100 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3112443Z // end inline asm 2026-02-21T09:03:10.3112498Z // begin inline asm 2026-02-21T09:03:10.3112607Z cp.async.cg.shared.global [ %r390 + 0 ], [ %rd101 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3112688Z // end inline asm 2026-02-21T09:03:10.3112757Z // begin inline asm 2026-02-21T09:03:10.3112870Z cp.async.cg.shared.global [ %r392 + 0 ], [ %rd102 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3112926Z // end inline asm 2026-02-21T09:03:10.3112998Z cp.async.commit_group; 2026-02-21T09:03:10.3113170Z .loc 1 48 34 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:34 2026-02-21T09:03:10.3113237Z add.s64 %rd103, %rd26, 917504; 2026-02-21T09:03:10.3113417Z .loc 1 48 87 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:87 2026-02-21T09:03:10.3113475Z // begin inline asm 2026-02-21T09:03:10.3113587Z cp.async.cg.shared.global [ %r394 + 0 ], [ %rd103 + 0 ], 0x10, %r379; 2026-02-21T09:03:10.3113643Z // end inline asm 2026-02-21T09:03:10.3113714Z cp.async.commit_group; 2026-02-21T09:03:10.3113883Z .loc 1 34 93 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:34:93 2026-02-21T09:03:10.3113947Z bfe.u32 %r399, %r1, 4, 3; 2026-02-21T09:03:10.3114022Z mul.wide.u32 %rd105, %r399, 16384; 2026-02-21T09:03:10.3114088Z mul.wide.u32 %rd106, %r172, 16; 2026-02-21T09:03:10.3114153Z or.b64 %rd107, %rd105, %rd106; 2026-02-21T09:03:10.3114214Z add.s64 %rd108, %rd107, %rd35; 2026-02-21T09:03:10.3114285Z add.s64 %rd178, %rd108, 918272; 2026-02-21T09:03:10.3114345Z add.s32 %r401, %r32, %r5; 2026-02-21T09:03:10.3114406Z cvt.u64.u32 %rd109, %r401; 2026-02-21T09:03:10.3114485Z mad.lo.s64 %rd110, %rd25, 7168, %rd109; 2026-02-21T09:03:10.3114546Z add.s64 %rd111, %rd110, %rd36; 2026-02-21T09:03:10.3114610Z add.s64 %rd177, %rd111, 1376256; 2026-02-21T09:03:10.3114674Z mov.b32 %r674, 1; 2026-02-21T09:03:10.3114735Z mov.b64 %rd179, -64; 2026-02-21T09:03:10.3114793Z mov.b32 %r673, %r671; 2026-02-21T09:03:10.3114850Z mov.b32 %r675, %r671; 2026-02-21T09:03:10.3114917Z bra.uni $L__BB0_4; 2026-02-21T09:03:10.3115020Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:03:10.3115194Z .loc 1 34 93 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:34:93 2026-02-21T09:03:10.3115297Z add.s64 %rd179, %rd179, 64; 2026-02-21T09:03:10.3115367Z setp.lt.u64 %p86, %rd179, 3904; 2026-02-21T09:03:10.3115423Z $L__tmp6: 2026-02-21T09:03:10.3115649Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3115719Z add.s32 %r613, %r674, 1; 2026-02-21T09:03:10.3115782Z setp.gt.s32 %p87, %r613, 1; 2026-02-21T09:03:10.3115845Z selp.b32 %r674, 0, %r613, %p87; 2026-02-21T09:03:10.3115914Z selp.b32 %r614, 1, 0, %p87; 2026-02-21T09:03:10.3116008Z xor.b32 %r47, %r675, %r614; 2026-02-21T09:03:10.3116061Z $L__tmp7: 2026-02-21T09:03:10.3116239Z .loc 1 42 32 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:32 2026-02-21T09:03:10.3116304Z add.s64 %rd129, %rd178, -917504; 2026-02-21T09:03:10.3116369Z add.s64 %rd130, %rd178, -786432; 2026-02-21T09:03:10.3116432Z add.s64 %rd131, %rd178, -655360; 2026-02-21T09:03:10.3116500Z add.s64 %rd132, %rd178, -524288; 2026-02-21T09:03:10.3116560Z add.s64 %rd133, %rd178, -393216; 2026-02-21T09:03:10.3116642Z add.s64 %rd134, %rd178, -262144; 2026-02-21T09:03:10.3116712Z add.s64 %rd135, %rd178, -131072; 2026-02-21T09:03:10.3116880Z .loc 1 42 80 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:80 2026-02-21T09:03:10.3116940Z add.s32 %r595, %r43, %r6; 2026-02-21T09:03:10.3117009Z selp.b32 %r596, 16, 0, %p86; 2026-02-21T09:03:10.3117067Z // begin inline asm 2026-02-21T09:03:10.3117182Z cp.async.cg.shared.global [ %r595 + 0 ], [ %rd129 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3117241Z // end inline asm 2026-02-21T09:03:10.3117310Z add.s32 %r597, %r595, 2048; 2026-02-21T09:03:10.3117367Z // begin inline asm 2026-02-21T09:03:10.3117480Z cp.async.cg.shared.global [ %r597 + 0 ], [ %rd130 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3117568Z // end inline asm 2026-02-21T09:03:10.3117630Z add.s32 %r599, %r595, 4096; 2026-02-21T09:03:10.3117687Z // begin inline asm 2026-02-21T09:03:10.3117799Z cp.async.cg.shared.global [ %r599 + 0 ], [ %rd131 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3117863Z // end inline asm 2026-02-21T09:03:10.3117922Z add.s32 %r601, %r595, 6144; 2026-02-21T09:03:10.3117978Z // begin inline asm 2026-02-21T09:03:10.3118096Z cp.async.cg.shared.global [ %r601 + 0 ], [ %rd132 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3118152Z // end inline asm 2026-02-21T09:03:10.3118209Z add.s32 %r603, %r595, 8192; 2026-02-21T09:03:10.3118266Z // begin inline asm 2026-02-21T09:03:10.3118384Z cp.async.cg.shared.global [ %r603 + 0 ], [ %rd133 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3118440Z // end inline asm 2026-02-21T09:03:10.3118501Z add.s32 %r605, %r595, 10240; 2026-02-21T09:03:10.3118565Z // begin inline asm 2026-02-21T09:03:10.3118675Z cp.async.cg.shared.global [ %r605 + 0 ], [ %rd134 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3118733Z // end inline asm 2026-02-21T09:03:10.3118802Z add.s32 %r607, %r595, 12288; 2026-02-21T09:03:10.3118858Z // begin inline asm 2026-02-21T09:03:10.3118968Z cp.async.cg.shared.global [ %r607 + 0 ], [ %rd135 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3119023Z // end inline asm 2026-02-21T09:03:10.3119091Z add.s32 %r609, %r43, %r14; 2026-02-21T09:03:10.3119147Z // begin inline asm 2026-02-21T09:03:10.3119257Z cp.async.cg.shared.global [ %r609 + 0 ], [ %rd178 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3119319Z // end inline asm 2026-02-21T09:03:10.3119381Z cp.async.commit_group; 2026-02-21T09:03:10.3119549Z .loc 1 48 87 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:87 2026-02-21T09:03:10.3119610Z add.s32 %r611, %r44, %r6; 2026-02-21T09:03:10.3119673Z // begin inline asm 2026-02-21T09:03:10.3119782Z cp.async.cg.shared.global [ %r611 + 0 ], [ %rd177 + 0 ], 0x10, %r596; 2026-02-21T09:03:10.3119836Z // end inline asm 2026-02-21T09:03:10.3119906Z cp.async.commit_group; 2026-02-21T09:03:10.3120102Z .loc 1 34 93 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:34:93 2026-02-21T09:03:10.3120162Z add.s64 %rd178, %rd178, 256; 2026-02-21T09:03:10.3120233Z add.s64 %rd177, %rd177, 458752; 2026-02-21T09:03:10.3120297Z setp.lt.u64 %p88, %rd179, 3968; 2026-02-21T09:03:10.3120354Z mov.b32 %r671, %r675; 2026-02-21T09:03:10.3120412Z mov.b32 %r675, %r47; 2026-02-21T09:03:10.3120478Z @%p88 bra $L__BB0_4; 2026-02-21T09:03:10.3120534Z bra.uni $L__BB0_7; 2026-02-21T09:03:10.3120640Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:03:10.3120708Z add.s32 %r473, %r673, 1; 2026-02-21T09:03:10.3120796Z setp.gt.s32 %p52, %r473, 1; 2026-02-21T09:03:10.3120858Z selp.b32 %r673, 0, %r473, %p52; 2026-02-21T09:03:10.3121026Z .loc 1 42 80 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:42:80 2026-02-21T09:03:10.3121097Z cp.async.wait_group 2; 2026-02-21T09:03:10.3121155Z bar.sync 0; 2026-02-21T09:03:10.3121215Z shl.b32 %r474, %r673, 14; 2026-02-21T09:03:10.3121281Z add.s32 %r43, %r48, %r474; 2026-02-21T09:03:10.3121340Z add.s32 %r476, %r43, %r18; 2026-02-21T09:03:10.3121457Z ld.shared.v4.b32 {%r477, %r478, %r479, %r480}, [%r476]; 2026-02-21T09:03:10.3121530Z mov.b32 {%rs225, %rs226}, %r480; 2026-02-21T09:03:10.3121626Z mov.b32 {%rs227, %rs228}, %r479; 2026-02-21T09:03:10.3121688Z mov.b32 {%rs229, %rs230}, %r478; 2026-02-21T09:03:10.3121747Z mov.b32 {%rs231, %rs232}, %r477; 2026-02-21T09:03:10.3121852Z ld.shared.v4.b32 {%r481, %r482, %r483, %r484}, [%r476+16]; 2026-02-21T09:03:10.3121911Z mov.b32 {%rs233, %rs234}, %r484; 2026-02-21T09:03:10.3121971Z mov.b32 {%rs235, %rs236}, %r483; 2026-02-21T09:03:10.3122036Z mov.b32 {%rs237, %rs238}, %r482; 2026-02-21T09:03:10.3122095Z mov.b32 {%rs239, %rs240}, %r481; 2026-02-21T09:03:10.3122188Z ld.shared.v4.b32 {%r485, %r486, %r487, %r488}, [%r476+32]; 2026-02-21T09:03:10.3122287Z mov.b32 {%rs241, %rs242}, %r488; 2026-02-21T09:03:10.3122355Z mov.b32 {%rs243, %rs244}, %r487; 2026-02-21T09:03:10.3122413Z mov.b32 {%rs245, %rs246}, %r486; 2026-02-21T09:03:10.3122472Z mov.b32 {%rs247, %rs248}, %r485; 2026-02-21T09:03:10.3122572Z ld.shared.v4.b32 {%r489, %r490, %r491, %r492}, [%r476+48]; 2026-02-21T09:03:10.3122631Z mov.b32 {%rs249, %rs250}, %r492; 2026-02-21T09:03:10.3122692Z mov.b32 {%rs251, %rs252}, %r491; 2026-02-21T09:03:10.3122758Z mov.b32 {%rs253, %rs254}, %r490; 2026-02-21T09:03:10.3122819Z mov.b32 {%rs255, %rs256}, %r489; 2026-02-21T09:03:10.3122910Z ld.shared.v4.b32 {%r493, %r494, %r495, %r496}, [%r476+64]; 2026-02-21T09:03:10.3122974Z mov.b32 {%rs257, %rs258}, %r496; 2026-02-21T09:03:10.3123045Z mov.b32 {%rs259, %rs260}, %r495; 2026-02-21T09:03:10.3123105Z mov.b32 {%rs261, %rs262}, %r494; 2026-02-21T09:03:10.3123171Z mov.b32 {%rs263, %rs264}, %r493; 2026-02-21T09:03:10.3123269Z ld.shared.v4.b32 {%r497, %r498, %r499, %r500}, [%r476+80]; 2026-02-21T09:03:10.3123330Z mov.b32 {%rs265, %rs266}, %r500; 2026-02-21T09:03:10.3123391Z mov.b32 {%rs267, %rs268}, %r499; 2026-02-21T09:03:10.3123457Z mov.b32 {%rs269, %rs270}, %r498; 2026-02-21T09:03:10.3123517Z mov.b32 {%rs271, %rs272}, %r497; 2026-02-21T09:03:10.3123608Z ld.shared.v4.b32 {%r501, %r502, %r503, %r504}, [%r476+96]; 2026-02-21T09:03:10.3123666Z mov.b32 {%rs273, %rs274}, %r504; 2026-02-21T09:03:10.3123740Z mov.b32 {%rs275, %rs276}, %r503; 2026-02-21T09:03:10.3123802Z mov.b32 {%rs277, %rs278}, %r502; 2026-02-21T09:03:10.3123865Z mov.b32 {%rs279, %rs280}, %r501; 2026-02-21T09:03:10.3123973Z ld.shared.v4.b32 {%r505, %r506, %r507, %r508}, [%r476+112]; 2026-02-21T09:03:10.3124035Z mov.b32 {%rs281, %rs282}, %r508; 2026-02-21T09:03:10.3124095Z mov.b32 {%rs283, %rs284}, %r507; 2026-02-21T09:03:10.3124153Z mov.b32 {%rs285, %rs286}, %r506; 2026-02-21T09:03:10.3124219Z mov.b32 {%rs287, %rs288}, %r505; 2026-02-21T09:03:10.3124384Z .loc 1 46 32 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:46:32 2026-02-21T09:03:10.3124495Z cvt.f32.bf16 %r406, %rs231; 2026-02-21T09:03:10.3124564Z cvt.f32.bf16 %r407, %rs232; 2026-02-21T09:03:10.3124624Z cvt.f32.bf16 %r408, %rs229; 2026-02-21T09:03:10.3124685Z cvt.f32.bf16 %r409, %rs230; 2026-02-21T09:03:10.3124753Z cvt.f32.bf16 %r410, %rs227; 2026-02-21T09:03:10.3124811Z cvt.f32.bf16 %r411, %rs228; 2026-02-21T09:03:10.3124868Z cvt.f32.bf16 %r412, %rs225; 2026-02-21T09:03:10.3124927Z cvt.f32.bf16 %r413, %rs226; 2026-02-21T09:03:10.3124993Z cvt.f32.bf16 %r414, %rs239; 2026-02-21T09:03:10.3125051Z cvt.f32.bf16 %r415, %rs240; 2026-02-21T09:03:10.3125109Z cvt.f32.bf16 %r416, %rs237; 2026-02-21T09:03:10.3125201Z cvt.f32.bf16 %r417, %rs238; 2026-02-21T09:03:10.3125259Z cvt.f32.bf16 %r418, %rs235; 2026-02-21T09:03:10.3125317Z cvt.f32.bf16 %r419, %rs236; 2026-02-21T09:03:10.3125374Z cvt.f32.bf16 %r420, %rs233; 2026-02-21T09:03:10.3125440Z cvt.f32.bf16 %r421, %rs234; 2026-02-21T09:03:10.3125499Z cvt.f32.bf16 %r423, %rs247; 2026-02-21T09:03:10.3125557Z cvt.f32.bf16 %r424, %rs248; 2026-02-21T09:03:10.3125624Z cvt.f32.bf16 %r425, %rs245; 2026-02-21T09:03:10.3125681Z cvt.f32.bf16 %r426, %rs246; 2026-02-21T09:03:10.3125762Z cvt.f32.bf16 %r427, %rs243; 2026-02-21T09:03:10.3125822Z cvt.f32.bf16 %r428, %rs244; 2026-02-21T09:03:10.3125889Z cvt.f32.bf16 %r429, %rs241; 2026-02-21T09:03:10.3125945Z cvt.f32.bf16 %r430, %rs242; 2026-02-21T09:03:10.3126002Z cvt.f32.bf16 %r431, %rs255; 2026-02-21T09:03:10.3126067Z cvt.f32.bf16 %r432, %rs256; 2026-02-21T09:03:10.3126123Z cvt.f32.bf16 %r433, %rs253; 2026-02-21T09:03:10.3126180Z cvt.f32.bf16 %r434, %rs254; 2026-02-21T09:03:10.3126245Z cvt.f32.bf16 %r435, %rs251; 2026-02-21T09:03:10.3126302Z cvt.f32.bf16 %r436, %rs252; 2026-02-21T09:03:10.3126358Z cvt.f32.bf16 %r437, %rs249; 2026-02-21T09:03:10.3126416Z cvt.f32.bf16 %r438, %rs250; 2026-02-21T09:03:10.3126479Z cvt.f32.bf16 %r440, %rs263; 2026-02-21T09:03:10.3126536Z cvt.f32.bf16 %r441, %rs264; 2026-02-21T09:03:10.3126615Z cvt.f32.bf16 %r442, %rs261; 2026-02-21T09:03:10.3126680Z cvt.f32.bf16 %r443, %rs262; 2026-02-21T09:03:10.3126737Z cvt.f32.bf16 %r444, %rs259; 2026-02-21T09:03:10.3126793Z cvt.f32.bf16 %r445, %rs260; 2026-02-21T09:03:10.3126850Z cvt.f32.bf16 %r446, %rs257; 2026-02-21T09:03:10.3126915Z cvt.f32.bf16 %r447, %rs258; 2026-02-21T09:03:10.3126972Z cvt.f32.bf16 %r448, %rs271; 2026-02-21T09:03:10.3127029Z cvt.f32.bf16 %r449, %rs272; 2026-02-21T09:03:10.3127092Z cvt.f32.bf16 %r450, %rs269; 2026-02-21T09:03:10.3127149Z cvt.f32.bf16 %r451, %rs270; 2026-02-21T09:03:10.3127206Z cvt.f32.bf16 %r452, %rs267; 2026-02-21T09:03:10.3127263Z cvt.f32.bf16 %r453, %rs268; 2026-02-21T09:03:10.3127328Z cvt.f32.bf16 %r454, %rs265; 2026-02-21T09:03:10.3127384Z cvt.f32.bf16 %r455, %rs266; 2026-02-21T09:03:10.3127441Z cvt.f32.bf16 %r457, %rs279; 2026-02-21T09:03:10.3127506Z cvt.f32.bf16 %r458, %rs280; 2026-02-21T09:03:10.3127565Z cvt.f32.bf16 %r459, %rs277; 2026-02-21T09:03:10.3127625Z cvt.f32.bf16 %r460, %rs278; 2026-02-21T09:03:10.3127688Z cvt.f32.bf16 %r461, %rs275; 2026-02-21T09:03:10.3127754Z cvt.f32.bf16 %r462, %rs276; 2026-02-21T09:03:10.3127814Z cvt.f32.bf16 %r463, %rs273; 2026-02-21T09:03:10.3127874Z cvt.f32.bf16 %r464, %rs274; 2026-02-21T09:03:10.3127941Z cvt.f32.bf16 %r465, %rs287; 2026-02-21T09:03:10.3128001Z cvt.f32.bf16 %r466, %rs288; 2026-02-21T09:03:10.3128061Z cvt.f32.bf16 %r467, %rs285; 2026-02-21T09:03:10.3128128Z cvt.f32.bf16 %r468, %rs286; 2026-02-21T09:03:10.3128188Z cvt.f32.bf16 %r469, %rs283; 2026-02-21T09:03:10.3128248Z cvt.f32.bf16 %r470, %rs284; 2026-02-21T09:03:10.3128307Z cvt.f32.bf16 %r471, %rs281; 2026-02-21T09:03:10.3128377Z cvt.f32.bf16 %r472, %rs282; 2026-02-21T09:03:10.3128433Z $L__tmp8: 2026-02-21T09:03:10.3128665Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3128733Z // begin inline asm 2026-02-21T09:03:10.3128789Z 2026-02-21T09:03:10.3128843Z { 2026-02-21T09:03:10.3128932Z .reg .pred complete; 2026-02-21T09:03:10.3129000Z waitLoop: 2026-02-21T09:03:10.3129127Z mbarrier.try_wait.parity.shared.b64 complete, [%r672], %r671; 2026-02-21T09:03:10.3129196Z @!complete bra.uni waitLoop; 2026-02-21T09:03:10.3129255Z } 2026-02-21T09:03:10.3129260Z 2026-02-21T09:03:10.3129318Z // end inline asm 2026-02-21T09:03:10.3129382Z mov.pred %p53, -1; 2026-02-21T09:03:10.3129442Z // begin inline asm 2026-02-21T09:03:10.3129757Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 0], 64, {%r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414, %r415, %r416, %r417, %r418, %r419, %r420, %r421}; 2026-02-21T09:03:10.3129839Z // end inline asm 2026-02-21T09:03:10.3129900Z // begin inline asm 2026-02-21T09:03:10.3130207Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 16], 64, {%r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434, %r435, %r436, %r437, %r438}; 2026-02-21T09:03:10.3130268Z // end inline asm 2026-02-21T09:03:10.3130329Z // begin inline asm 2026-02-21T09:03:10.3130651Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 32], 64, {%r440, %r441, %r442, %r443, %r444, %r445, %r446, %r447, %r448, %r449, %r450, %r451, %r452, %r453, %r454, %r455}; 2026-02-21T09:03:10.3130711Z // end inline asm 2026-02-21T09:03:10.3130769Z // begin inline asm 2026-02-21T09:03:10.3131065Z @%p53 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r104 + 48], 64, {%r457, %r458, %r459, %r460, %r461, %r462, %r463, %r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472}; 2026-02-21T09:03:10.3131123Z // end inline asm 2026-02-21T09:03:10.3131180Z // begin inline asm 2026-02-21T09:03:10.3131264Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:03:10.3131320Z // end inline asm 2026-02-21T09:03:10.3131377Z bar.sync 0; 2026-02-21T09:03:10.3131433Z $L__tmp9: 2026-02-21T09:03:10.3131663Z .loc 1 48 87 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:48:87 2026-02-21T09:03:10.3131759Z shl.b32 %r509, %r673, 11; 2026-02-21T09:03:10.3131825Z add.s32 %r510, %r48, %r509; 2026-02-21T09:03:10.3131893Z add.s32 %r44, %r510, 49152; 2026-02-21T09:03:10.3131955Z add.s32 %r511, %r44, %r19; 2026-02-21T09:03:10.3132022Z ld.shared.b8 %rs289, [%r511]; 2026-02-21T09:03:10.3132090Z ld.shared.b8 %rs290, [%r511+64]; 2026-02-21T09:03:10.3132166Z ld.shared.b8 %rs291, [%r511+128]; 2026-02-21T09:03:10.3132233Z ld.shared.b8 %rs292, [%r511+192]; 2026-02-21T09:03:10.3132297Z ld.shared.b8 %rs293, [%r511+256]; 2026-02-21T09:03:10.3132367Z ld.shared.b8 %rs294, [%r511+320]; 2026-02-21T09:03:10.3132431Z ld.shared.b8 %rs295, [%r511+384]; 2026-02-21T09:03:10.3132495Z ld.shared.b8 %rs296, [%r511+448]; 2026-02-21T09:03:10.3132558Z ld.shared.b8 %rs297, [%r511+512]; 2026-02-21T09:03:10.3132629Z ld.shared.b8 %rs298, [%r511+576]; 2026-02-21T09:03:10.3132691Z ld.shared.b8 %rs299, [%r511+640]; 2026-02-21T09:03:10.3132752Z ld.shared.b8 %rs300, [%r511+704]; 2026-02-21T09:03:10.3132824Z ld.shared.b8 %rs301, [%r511+768]; 2026-02-21T09:03:10.3132892Z ld.shared.b8 %rs302, [%r511+832]; 2026-02-21T09:03:10.3132955Z ld.shared.b8 %rs303, [%r511+896]; 2026-02-21T09:03:10.3133026Z ld.shared.b8 %rs304, [%r511+960]; 2026-02-21T09:03:10.3133094Z ld.shared.b8 %rs305, [%r511+1024]; 2026-02-21T09:03:10.3133160Z ld.shared.b8 %rs306, [%r511+1088]; 2026-02-21T09:03:10.3133223Z ld.shared.b8 %rs307, [%r511+1152]; 2026-02-21T09:03:10.3133293Z ld.shared.b8 %rs308, [%r511+1216]; 2026-02-21T09:03:10.3133355Z ld.shared.b8 %rs309, [%r511+1280]; 2026-02-21T09:03:10.3133417Z ld.shared.b8 %rs310, [%r511+1344]; 2026-02-21T09:03:10.3133487Z ld.shared.b8 %rs311, [%r511+1408]; 2026-02-21T09:03:10.3133552Z ld.shared.b8 %rs312, [%r511+1472]; 2026-02-21T09:03:10.3133614Z ld.shared.b8 %rs313, [%r511+1536]; 2026-02-21T09:03:10.3133676Z ld.shared.b8 %rs314, [%r511+1600]; 2026-02-21T09:03:10.3133746Z ld.shared.b8 %rs315, [%r511+1664]; 2026-02-21T09:03:10.3133810Z ld.shared.b8 %rs316, [%r511+1728]; 2026-02-21T09:03:10.3133873Z ld.shared.b8 %rs317, [%r511+1792]; 2026-02-21T09:03:10.3133978Z ld.shared.b8 %rs318, [%r511+1856]; 2026-02-21T09:03:10.3134040Z ld.shared.b8 %rs319, [%r511+1920]; 2026-02-21T09:03:10.3134104Z ld.shared.b8 %rs320, [%r511+1984]; 2026-02-21T09:03:10.3134292Z .loc 1 51 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:51:28 2026-02-21T09:03:10.3134355Z shl.b16 %rs321, %rs289, 4; 2026-02-21T09:03:10.3134417Z shl.b16 %rs322, %rs290, 4; 2026-02-21T09:03:10.3134478Z shl.b16 %rs323, %rs291, 4; 2026-02-21T09:03:10.3134545Z shl.b16 %rs324, %rs292, 4; 2026-02-21T09:03:10.3134607Z shl.b16 %rs325, %rs293, 4; 2026-02-21T09:03:10.3134708Z shl.b16 %rs326, %rs294, 4; 2026-02-21T09:03:10.3134774Z shl.b16 %rs327, %rs295, 4; 2026-02-21T09:03:10.3134835Z shl.b16 %rs328, %rs296, 4; 2026-02-21T09:03:10.3134894Z shl.b16 %rs329, %rs297, 4; 2026-02-21T09:03:10.3134952Z shl.b16 %rs330, %rs298, 4; 2026-02-21T09:03:10.3135019Z shl.b16 %rs331, %rs299, 4; 2026-02-21T09:03:10.3135081Z shl.b16 %rs332, %rs300, 4; 2026-02-21T09:03:10.3135141Z shl.b16 %rs333, %rs301, 4; 2026-02-21T09:03:10.3135206Z shl.b16 %rs334, %rs302, 4; 2026-02-21T09:03:10.3135265Z shl.b16 %rs335, %rs303, 4; 2026-02-21T09:03:10.3135355Z shl.b16 %rs336, %rs304, 4; 2026-02-21T09:03:10.3135415Z shl.b16 %rs337, %rs305, 4; 2026-02-21T09:03:10.3135484Z shl.b16 %rs338, %rs306, 4; 2026-02-21T09:03:10.3135543Z shl.b16 %rs339, %rs307, 4; 2026-02-21T09:03:10.3135602Z shl.b16 %rs340, %rs308, 4; 2026-02-21T09:03:10.3135668Z shl.b16 %rs341, %rs309, 4; 2026-02-21T09:03:10.3135735Z shl.b16 %rs342, %rs310, 4; 2026-02-21T09:03:10.3135792Z shl.b16 %rs343, %rs311, 4; 2026-02-21T09:03:10.3135852Z shl.b16 %rs344, %rs312, 4; 2026-02-21T09:03:10.3135915Z shl.b16 %rs345, %rs313, 4; 2026-02-21T09:03:10.3135971Z shl.b16 %rs346, %rs314, 4; 2026-02-21T09:03:10.3136027Z shl.b16 %rs347, %rs315, 4; 2026-02-21T09:03:10.3136091Z shl.b16 %rs348, %rs316, 4; 2026-02-21T09:03:10.3136148Z shl.b16 %rs349, %rs317, 4; 2026-02-21T09:03:10.3136226Z shl.b16 %rs350, %rs318, 4; 2026-02-21T09:03:10.3136287Z shl.b16 %rs351, %rs319, 4; 2026-02-21T09:03:10.3136352Z shl.b16 %rs352, %rs320, 4; 2026-02-21T09:03:10.3136521Z .loc 1 66 58 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:66:58 2026-02-21T09:03:10.3136593Z selp.b16 %rs353, %rs321, %rs289, %p11; 2026-02-21T09:03:10.3136659Z cvt.s16.s8 %rs354, %rs353; 2026-02-21T09:03:10.3136717Z shr.s16 %rs355, %rs354, 4; 2026-02-21T09:03:10.3136783Z selp.b16 %rs356, %rs322, %rs290, %p11; 2026-02-21T09:03:10.3136850Z cvt.s16.s8 %rs357, %rs356; 2026-02-21T09:03:10.3136907Z shr.s16 %rs358, %rs357, 4; 2026-02-21T09:03:10.3136974Z selp.b16 %rs359, %rs323, %rs291, %p11; 2026-02-21T09:03:10.3137032Z cvt.s16.s8 %rs360, %rs359; 2026-02-21T09:03:10.3137098Z shr.s16 %rs361, %rs360, 4; 2026-02-21T09:03:10.3137163Z selp.b16 %rs362, %rs324, %rs292, %p11; 2026-02-21T09:03:10.3137222Z cvt.s16.s8 %rs363, %rs362; 2026-02-21T09:03:10.3137292Z shr.s16 %rs364, %rs363, 4; 2026-02-21T09:03:10.3137359Z selp.b16 %rs365, %rs325, %rs293, %p11; 2026-02-21T09:03:10.3137419Z cvt.s16.s8 %rs366, %rs365; 2026-02-21T09:03:10.3137477Z shr.s16 %rs367, %rs366, 4; 2026-02-21T09:03:10.3137552Z selp.b16 %rs368, %rs326, %rs294, %p11; 2026-02-21T09:03:10.3137614Z cvt.s16.s8 %rs369, %rs368; 2026-02-21T09:03:10.3137673Z shr.s16 %rs370, %rs369, 4; 2026-02-21T09:03:10.3137745Z selp.b16 %rs371, %rs327, %rs295, %p11; 2026-02-21T09:03:10.3137803Z cvt.s16.s8 %rs372, %rs371; 2026-02-21T09:03:10.3137860Z shr.s16 %rs373, %rs372, 4; 2026-02-21T09:03:10.3137930Z selp.b16 %rs374, %rs328, %rs296, %p11; 2026-02-21T09:03:10.3137988Z cvt.s16.s8 %rs375, %rs374; 2026-02-21T09:03:10.3138046Z shr.s16 %rs376, %rs375, 4; 2026-02-21T09:03:10.3138110Z selp.b16 %rs377, %rs329, %rs297, %p11; 2026-02-21T09:03:10.3138176Z cvt.s16.s8 %rs378, %rs377; 2026-02-21T09:03:10.3138233Z shr.s16 %rs379, %rs378, 4; 2026-02-21T09:03:10.3138297Z selp.b16 %rs380, %rs330, %rs298, %p11; 2026-02-21T09:03:10.3138364Z cvt.s16.s8 %rs381, %rs380; 2026-02-21T09:03:10.3138445Z shr.s16 %rs382, %rs381, 4; 2026-02-21T09:03:10.3138508Z selp.b16 %rs383, %rs331, %rs299, %p11; 2026-02-21T09:03:10.3138566Z cvt.s16.s8 %rs384, %rs383; 2026-02-21T09:03:10.3138633Z shr.s16 %rs385, %rs384, 4; 2026-02-21T09:03:10.3138696Z selp.b16 %rs386, %rs332, %rs300, %p11; 2026-02-21T09:03:10.3138753Z cvt.s16.s8 %rs387, %rs386; 2026-02-21T09:03:10.3138816Z shr.s16 %rs388, %rs387, 4; 2026-02-21T09:03:10.3138881Z selp.b16 %rs389, %rs333, %rs301, %p11; 2026-02-21T09:03:10.3138938Z cvt.s16.s8 %rs390, %rs389; 2026-02-21T09:03:10.3138994Z shr.s16 %rs391, %rs390, 4; 2026-02-21T09:03:10.3139092Z selp.b16 %rs392, %rs334, %rs302, %p11; 2026-02-21T09:03:10.3139149Z cvt.s16.s8 %rs393, %rs392; 2026-02-21T09:03:10.3139208Z shr.s16 %rs394, %rs393, 4; 2026-02-21T09:03:10.3139279Z selp.b16 %rs395, %rs335, %rs303, %p11; 2026-02-21T09:03:10.3139336Z cvt.s16.s8 %rs396, %rs395; 2026-02-21T09:03:10.3139394Z shr.s16 %rs397, %rs396, 4; 2026-02-21T09:03:10.3139461Z selp.b16 %rs398, %rs336, %rs304, %p11; 2026-02-21T09:03:10.3139526Z cvt.s16.s8 %rs399, %rs398; 2026-02-21T09:03:10.3139582Z shr.s16 %rs400, %rs399, 4; 2026-02-21T09:03:10.3139668Z selp.b16 %rs401, %rs337, %rs305, %p11; 2026-02-21T09:03:10.3139736Z cvt.s16.s8 %rs402, %rs401; 2026-02-21T09:03:10.3139793Z shr.s16 %rs403, %rs402, 4; 2026-02-21T09:03:10.3139858Z selp.b16 %rs404, %rs338, %rs306, %p11; 2026-02-21T09:03:10.3139923Z cvt.s16.s8 %rs405, %rs404; 2026-02-21T09:03:10.3139979Z shr.s16 %rs406, %rs405, 4; 2026-02-21T09:03:10.3140043Z selp.b16 %rs407, %rs339, %rs307, %p11; 2026-02-21T09:03:10.3140103Z cvt.s16.s8 %rs408, %rs407; 2026-02-21T09:03:10.3140168Z shr.s16 %rs409, %rs408, 4; 2026-02-21T09:03:10.3140231Z selp.b16 %rs410, %rs340, %rs308, %p11; 2026-02-21T09:03:10.3140289Z cvt.s16.s8 %rs411, %rs410; 2026-02-21T09:03:10.3140354Z shr.s16 %rs412, %rs411, 4; 2026-02-21T09:03:10.3140417Z selp.b16 %rs413, %rs341, %rs309, %p11; 2026-02-21T09:03:10.3140517Z cvt.s16.s8 %rs414, %rs413; 2026-02-21T09:03:10.3140577Z shr.s16 %rs415, %rs414, 4; 2026-02-21T09:03:10.3140649Z selp.b16 %rs416, %rs342, %rs310, %p11; 2026-02-21T09:03:10.3140707Z cvt.s16.s8 %rs417, %rs416; 2026-02-21T09:03:10.3140764Z shr.s16 %rs418, %rs417, 4; 2026-02-21T09:03:10.3140834Z selp.b16 %rs419, %rs343, %rs311, %p11; 2026-02-21T09:03:10.3140892Z cvt.s16.s8 %rs420, %rs419; 2026-02-21T09:03:10.3140948Z shr.s16 %rs421, %rs420, 4; 2026-02-21T09:03:10.3141011Z selp.b16 %rs422, %rs344, %rs312, %p11; 2026-02-21T09:03:10.3141075Z cvt.s16.s8 %rs423, %rs422; 2026-02-21T09:03:10.3141132Z shr.s16 %rs424, %rs423, 4; 2026-02-21T09:03:10.3141198Z selp.b16 %rs425, %rs345, %rs313, %p11; 2026-02-21T09:03:10.3141262Z cvt.s16.s8 %rs426, %rs425; 2026-02-21T09:03:10.3141319Z shr.s16 %rs427, %rs426, 4; 2026-02-21T09:03:10.3141383Z selp.b16 %rs428, %rs346, %rs314, %p11; 2026-02-21T09:03:10.3141441Z cvt.s16.s8 %rs429, %rs428; 2026-02-21T09:03:10.3141506Z shr.s16 %rs430, %rs429, 4; 2026-02-21T09:03:10.3141596Z selp.b16 %rs431, %rs347, %rs315, %p11; 2026-02-21T09:03:10.3141655Z cvt.s16.s8 %rs432, %rs431; 2026-02-21T09:03:10.3141720Z shr.s16 %rs433, %rs432, 4; 2026-02-21T09:03:10.3141785Z selp.b16 %rs434, %rs348, %rs316, %p11; 2026-02-21T09:03:10.3141843Z cvt.s16.s8 %rs435, %rs434; 2026-02-21T09:03:10.3141907Z shr.s16 %rs436, %rs435, 4; 2026-02-21T09:03:10.3141970Z selp.b16 %rs437, %rs349, %rs317, %p11; 2026-02-21T09:03:10.3142028Z cvt.s16.s8 %rs438, %rs437; 2026-02-21T09:03:10.3142087Z shr.s16 %rs439, %rs438, 4; 2026-02-21T09:03:10.3142157Z selp.b16 %rs440, %rs350, %rs318, %p11; 2026-02-21T09:03:10.3142215Z cvt.s16.s8 %rs441, %rs440; 2026-02-21T09:03:10.3142273Z shr.s16 %rs442, %rs441, 4; 2026-02-21T09:03:10.3142343Z selp.b16 %rs443, %rs351, %rs319, %p11; 2026-02-21T09:03:10.3142402Z cvt.s16.s8 %rs444, %rs443; 2026-02-21T09:03:10.3142459Z shr.s16 %rs445, %rs444, 4; 2026-02-21T09:03:10.3142523Z selp.b16 %rs446, %rs352, %rs320, %p11; 2026-02-21T09:03:10.3142590Z cvt.s16.s8 %rs447, %rs446; 2026-02-21T09:03:10.3147162Z shr.s16 %rs448, %rs447, 4; 2026-02-21T09:03:10.3147384Z .loc 1 71 32 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:71:32 2026-02-21T09:03:10.3147465Z cvt.rn.f32.s16 %r512, %rs355; 2026-02-21T09:03:10.3147528Z cvt.rn.f32.s16 %r513, %rs358; 2026-02-21T09:03:10.3147588Z cvt.rn.f32.s16 %r514, %rs361; 2026-02-21T09:03:10.3147656Z cvt.rn.f32.s16 %r515, %rs364; 2026-02-21T09:03:10.3147716Z cvt.rn.f32.s16 %r516, %rs367; 2026-02-21T09:03:10.3147775Z cvt.rn.f32.s16 %r517, %rs370; 2026-02-21T09:03:10.3147833Z cvt.rn.f32.s16 %r518, %rs373; 2026-02-21T09:03:10.3148012Z cvt.rn.f32.s16 %r519, %rs376; 2026-02-21T09:03:10.3148073Z cvt.rn.f32.s16 %r520, %rs379; 2026-02-21T09:03:10.3148131Z cvt.rn.f32.s16 %r521, %rs382; 2026-02-21T09:03:10.3148200Z cvt.rn.f32.s16 %r522, %rs385; 2026-02-21T09:03:10.3148259Z cvt.rn.f32.s16 %r523, %rs388; 2026-02-21T09:03:10.3148322Z cvt.rn.f32.s16 %r524, %rs391; 2026-02-21T09:03:10.3148386Z cvt.rn.f32.s16 %r525, %rs394; 2026-02-21T09:03:10.3148458Z cvt.rn.f32.s16 %r526, %rs397; 2026-02-21T09:03:10.3148519Z cvt.rn.f32.s16 %r527, %rs400; 2026-02-21T09:03:10.3148618Z cvt.rn.f32.s16 %r528, %rs403; 2026-02-21T09:03:10.3148690Z cvt.rn.f32.s16 %r529, %rs406; 2026-02-21T09:03:10.3148751Z cvt.rn.f32.s16 %r530, %rs409; 2026-02-21T09:03:10.3148811Z cvt.rn.f32.s16 %r531, %rs412; 2026-02-21T09:03:10.3148871Z cvt.rn.f32.s16 %r532, %rs415; 2026-02-21T09:03:10.3148937Z cvt.rn.f32.s16 %r533, %rs418; 2026-02-21T09:03:10.3148994Z cvt.rn.f32.s16 %r534, %rs421; 2026-02-21T09:03:10.3149051Z cvt.rn.f32.s16 %r535, %rs424; 2026-02-21T09:03:10.3149117Z cvt.rn.f32.s16 %r536, %rs427; 2026-02-21T09:03:10.3149174Z cvt.rn.f32.s16 %r537, %rs430; 2026-02-21T09:03:10.3149231Z cvt.rn.f32.s16 %r538, %rs433; 2026-02-21T09:03:10.3149289Z cvt.rn.f32.s16 %r539, %rs436; 2026-02-21T09:03:10.3149352Z cvt.rn.f32.s16 %r540, %rs439; 2026-02-21T09:03:10.3149452Z cvt.rn.f32.s16 %r541, %rs442; 2026-02-21T09:03:10.3149515Z cvt.rn.f32.s16 %r542, %rs445; 2026-02-21T09:03:10.3149581Z cvt.rn.f32.s16 %r543, %rs448; 2026-02-21T09:03:10.3149645Z st.shared.b32 [%r20], %r512; 2026-02-21T09:03:10.3149715Z st.shared.b32 [%r20+4096], %r520; 2026-02-21T09:03:10.3149784Z st.shared.b32 [%r20+8192], %r528; 2026-02-21T09:03:10.3149850Z st.shared.b32 [%r20+12288], %r536; 2026-02-21T09:03:10.3149912Z st.shared.b32 [%r21], %r513; 2026-02-21T09:03:10.3149973Z st.shared.b32 [%r21+4096], %r521; 2026-02-21T09:03:10.3150041Z st.shared.b32 [%r21+8192], %r529; 2026-02-21T09:03:10.3150101Z st.shared.b32 [%r21+12288], %r537; 2026-02-21T09:03:10.3150162Z st.shared.b32 [%r22], %r514; 2026-02-21T09:03:10.3150230Z st.shared.b32 [%r22+4096], %r522; 2026-02-21T09:03:10.3150291Z st.shared.b32 [%r22+8192], %r530; 2026-02-21T09:03:10.3150351Z st.shared.b32 [%r22+12288], %r538; 2026-02-21T09:03:10.3150411Z st.shared.b32 [%r23], %r515; 2026-02-21T09:03:10.3150483Z st.shared.b32 [%r23+4096], %r523; 2026-02-21T09:03:10.3150544Z st.shared.b32 [%r23+8192], %r531; 2026-02-21T09:03:10.3150604Z st.shared.b32 [%r23+12288], %r539; 2026-02-21T09:03:10.3150671Z st.shared.b32 [%r24], %r516; 2026-02-21T09:03:10.3150733Z st.shared.b32 [%r24+4096], %r524; 2026-02-21T09:03:10.3150794Z st.shared.b32 [%r24+8192], %r532; 2026-02-21T09:03:10.3150854Z st.shared.b32 [%r24+12288], %r540; 2026-02-21T09:03:10.3150921Z st.shared.b32 [%r25], %r517; 2026-02-21T09:03:10.3150981Z st.shared.b32 [%r25+4096], %r525; 2026-02-21T09:03:10.3151041Z st.shared.b32 [%r25+8192], %r533; 2026-02-21T09:03:10.3151109Z st.shared.b32 [%r25+12288], %r541; 2026-02-21T09:03:10.3151170Z st.shared.b32 [%r26], %r518; 2026-02-21T09:03:10.3151230Z st.shared.b32 [%r26+4096], %r526; 2026-02-21T09:03:10.3151297Z st.shared.b32 [%r26+8192], %r534; 2026-02-21T09:03:10.3151356Z st.shared.b32 [%r26+12288], %r542; 2026-02-21T09:03:10.3151417Z st.shared.b32 [%r27], %r519; 2026-02-21T09:03:10.3151477Z st.shared.b32 [%r27+4096], %r527; 2026-02-21T09:03:10.3151638Z st.shared.b32 [%r27+8192], %r535; 2026-02-21T09:03:10.3151701Z st.shared.b32 [%r27+12288], %r543; 2026-02-21T09:03:10.3151891Z .loc 1 34 93 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:34:93 2026-02-21T09:03:10.3151964Z shl.b32 %r544, %r674, 3; 2026-02-21T09:03:10.3152027Z add.s32 %r545, %r48, %r544; 2026-02-21T09:03:10.3152090Z add.s32 %r672, %r545, 53248; 2026-02-21T09:03:10.3152146Z $L__tmp10: 2026-02-21T09:03:10.3152385Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3152488Z // begin inline asm 2026-02-21T09:03:10.3152573Z fence.proxy.async.shared::cta; 2026-02-21T09:03:10.3152637Z // end inline asm 2026-02-21T09:03:10.3152692Z bar.sync 0; 2026-02-21T09:03:10.3152753Z @%p12 bra $L__BB0_6; 2026-02-21T09:03:10.3152865Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:03:10.3152938Z elect.sync %r594|%p54, -1; 2026-02-21T09:03:10.3152999Z mov.b32 %r548, 67635472; 2026-02-21T09:03:10.3153056Z // begin inline asm 2026-02-21T09:03:10.3153263Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 0 ], %rd78, %r548, %p53; 2026-02-21T09:03:10.3153322Z // end inline asm 2026-02-21T09:03:10.3153378Z // begin inline asm 2026-02-21T09:03:10.3153534Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 8 ], %rd79, %r548, %p53; 2026-02-21T09:03:10.3153590Z // end inline asm 2026-02-21T09:03:10.3153645Z // begin inline asm 2026-02-21T09:03:10.3153799Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 16 ], %rd80, %r548, %p53; 2026-02-21T09:03:10.3153856Z // end inline asm 2026-02-21T09:03:10.3153912Z // begin inline asm 2026-02-21T09:03:10.3154061Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 24 ], %rd81, %r548, %p53; 2026-02-21T09:03:10.3154116Z // end inline asm 2026-02-21T09:03:10.3154200Z // begin inline asm 2026-02-21T09:03:10.3154341Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 32 ], %rd82, %r548, %p53; 2026-02-21T09:03:10.3154404Z // end inline asm 2026-02-21T09:03:10.3154461Z // begin inline asm 2026-02-21T09:03:10.3154600Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 40 ], %rd83, %r548, %p53; 2026-02-21T09:03:10.3154662Z // end inline asm 2026-02-21T09:03:10.3154717Z // begin inline asm 2026-02-21T09:03:10.3154855Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 48 ], %rd84, %r548, %p53; 2026-02-21T09:03:10.3154917Z // end inline asm 2026-02-21T09:03:10.3154974Z // begin inline asm 2026-02-21T09:03:10.3155114Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 56 ], %rd85, %r548, %p53; 2026-02-21T09:03:10.3155169Z // end inline asm 2026-02-21T09:03:10.3155234Z // begin inline asm 2026-02-21T09:03:10.3155372Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 64 ], %rd86, %r548, %p53; 2026-02-21T09:03:10.3155429Z // end inline asm 2026-02-21T09:03:10.3155496Z // begin inline asm 2026-02-21T09:03:10.3155637Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 72 ], %rd87, %r548, %p53; 2026-02-21T09:03:10.3155693Z // end inline asm 2026-02-21T09:03:10.3155756Z // begin inline asm 2026-02-21T09:03:10.3155897Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 80 ], %rd88, %r548, %p53; 2026-02-21T09:03:10.3155954Z // end inline asm 2026-02-21T09:03:10.3156019Z // begin inline asm 2026-02-21T09:03:10.3156161Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 88 ], %rd89, %r548, %p53; 2026-02-21T09:03:10.3156219Z // end inline asm 2026-02-21T09:03:10.3156275Z // begin inline asm 2026-02-21T09:03:10.3156424Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 96 ], %rd90, %r548, %p53; 2026-02-21T09:03:10.3156492Z // end inline asm 2026-02-21T09:03:10.3156549Z // begin inline asm 2026-02-21T09:03:10.3156700Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 104 ], %rd91, %r548, %p53; 2026-02-21T09:03:10.3156784Z // end inline asm 2026-02-21T09:03:10.3156839Z // begin inline asm 2026-02-21T09:03:10.3156988Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 112 ], %rd92, %r548, %p53; 2026-02-21T09:03:10.3157044Z // end inline asm 2026-02-21T09:03:10.3157099Z // begin inline asm 2026-02-21T09:03:10.3157246Z @%p54 tcgen05.mma.cta_group::1.kind::tf32 [ %r670 + 0 ], [ %r328 + 120 ], %rd93, %r548, %p53; 2026-02-21T09:03:10.3157301Z // end inline asm 2026-02-21T09:03:10.3157365Z cvt.u64.u32 %rd128, %r672; 2026-02-21T09:03:10.3157421Z // begin inline asm 2026-02-21T09:03:10.3157579Z @%p54 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd128]; 2026-02-21T09:03:10.3157634Z // end inline asm 2026-02-21T09:03:10.3157692Z bra.uni $L__BB0_6; 2026-02-21T09:03:10.3157799Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:03:10.3157895Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:03:10.3157953Z mov.b32 %r616, 1; 2026-02-21T09:03:10.3158210Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3158268Z // begin inline asm 2026-02-21T09:03:10.3158323Z 2026-02-21T09:03:10.3158374Z { 2026-02-21T09:03:10.3158448Z .reg .pred complete; 2026-02-21T09:03:10.3158504Z waitLoop: 2026-02-21T09:03:10.3158626Z mbarrier.try_wait.parity.shared.b64 complete, [%r672], %r616; 2026-02-21T09:03:10.3158703Z @!complete bra.uni waitLoop; 2026-02-21T09:03:10.3158754Z } 2026-02-21T09:03:10.3158759Z 2026-02-21T09:03:10.3158817Z // end inline asm 2026-02-21T09:03:10.3158870Z $L__tmp11: 2026-02-21T09:03:10.3159054Z .loc 1 34 93 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:34:93 2026-02-21T09:03:10.3159118Z cp.async.wait_group 0; 2026-02-21T09:03:10.3159173Z bar.sync 0; 2026-02-21T09:03:10.3159239Z add.s32 %r617, %r48, 53248; 2026-02-21T09:03:10.3159318Z // begin inline asm 2026-02-21T09:03:10.3159410Z @%p89 mbarrier.inval.shared::cta.b64 [%r617]; 2026-02-21T09:03:10.3159465Z // end inline asm 2026-02-21T09:03:10.3159527Z bar.sync 0; 2026-02-21T09:03:10.3159583Z // begin inline asm 2026-02-21T09:03:10.3159664Z @%p89 mbarrier.inval.shared::cta.b64 [%r67]; 2026-02-21T09:03:10.3159728Z // end inline asm 2026-02-21T09:03:10.3159901Z .loc 1 82 50 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:82:50 2026-02-21T09:03:10.3159964Z add.s32 %r645, %r33, %r28; 2026-02-21T09:03:10.3160139Z .loc 1 82 22 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:82:22 2026-02-21T09:03:10.3160210Z mad.wide.u32 %rd138, %r645, 2, %rd37; 2026-02-21T09:03:10.3160272Z cvt.u64.u32 %rd140, %r33; 2026-02-21T09:03:10.3160332Z cvt.u64.u32 %rd141, %r28; 2026-02-21T09:03:10.3160402Z add.s64 %rd142, %rd141, %rd140; 2026-02-21T09:03:10.3160462Z shl.b64 %rd143, %rd142, 1; 2026-02-21T09:03:10.3160527Z add.s64 %rd144, %rd37, %rd143; 2026-02-21T09:03:10.3160595Z add.s64 %rd139, %rd144, 458752; 2026-02-21T09:03:10.3160648Z $L__tmp12: 2026-02-21T09:03:10.3160871Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3160936Z // begin inline asm 2026-02-21T09:03:10.3161224Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r619, %r620, %r621, %r622, %r623, %r624, %r625, %r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633, %r634}, [%r635 + 0], 16; 2026-02-21T09:03:10.3161281Z // end inline asm 2026-02-21T09:03:10.3161337Z // begin inline asm 2026-02-21T09:03:10.3161415Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:03:10.3161470Z // end inline asm 2026-02-21T09:03:10.3161528Z cvt.u64.u32 %rd145, %r619; 2026-02-21T09:03:10.3161636Z cvt.u64.u32 %rd146, %r620; 2026-02-21T09:03:10.3161696Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:03:10.3161758Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:03:10.3161812Z $L__tmp13: 2026-02-21T09:03:10.3162024Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3162088Z mov.b64 {%r646, %r647}, %rd148; 2026-02-21T09:03:10.3162161Z cvt.rn.bf16x2.f32 %r648, %r647, %r646; 2026-02-21T09:03:10.3162222Z $L__tmp14: 2026-02-21T09:03:10.3162437Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3162498Z cvt.u64.u32 %rd149, %r621; 2026-02-21T09:03:10.3162564Z cvt.u64.u32 %rd150, %r622; 2026-02-21T09:03:10.3162624Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:03:10.3162711Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:03:10.3162765Z $L__tmp15: 2026-02-21T09:03:10.3162942Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3163005Z mov.b64 {%r649, %r650}, %rd152; 2026-02-21T09:03:10.3163075Z cvt.rn.bf16x2.f32 %r651, %r650, %r649; 2026-02-21T09:03:10.3163139Z $L__tmp16: 2026-02-21T09:03:10.3163349Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3163456Z cvt.u64.u32 %rd153, %r623; 2026-02-21T09:03:10.3163524Z cvt.u64.u32 %rd154, %r624; 2026-02-21T09:03:10.3163583Z shl.b64 %rd155, %rd154, 32; 2026-02-21T09:03:10.3163643Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T09:03:10.3163697Z $L__tmp17: 2026-02-21T09:03:10.3163870Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3163930Z mov.b64 {%r652, %r653}, %rd156; 2026-02-21T09:03:10.3164000Z cvt.rn.bf16x2.f32 %r654, %r653, %r652; 2026-02-21T09:03:10.3164061Z $L__tmp18: 2026-02-21T09:03:10.3164269Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3164329Z cvt.u64.u32 %rd157, %r625; 2026-02-21T09:03:10.3164418Z cvt.u64.u32 %rd158, %r626; 2026-02-21T09:03:10.3164481Z shl.b64 %rd159, %rd158, 32; 2026-02-21T09:03:10.3164541Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T09:03:10.3164594Z $L__tmp19: 2026-02-21T09:03:10.3164778Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3164840Z mov.b64 {%r655, %r656}, %rd160; 2026-02-21T09:03:10.3164908Z cvt.rn.bf16x2.f32 %r657, %r656, %r655; 2026-02-21T09:03:10.3164972Z $L__tmp20: 2026-02-21T09:03:10.3165186Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3165249Z cvt.u64.u32 %rd161, %r627; 2026-02-21T09:03:10.3165322Z cvt.u64.u32 %rd162, %r628; 2026-02-21T09:03:10.3165383Z shl.b64 %rd163, %rd162, 32; 2026-02-21T09:03:10.3165444Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T09:03:10.3165500Z $L__tmp21: 2026-02-21T09:03:10.3165684Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3165745Z mov.b64 {%r658, %r659}, %rd164; 2026-02-21T09:03:10.3165812Z cvt.rn.bf16x2.f32 %r660, %r659, %r658; 2026-02-21T09:03:10.3165870Z $L__tmp22: 2026-02-21T09:03:10.3166083Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3166141Z cvt.u64.u32 %rd165, %r629; 2026-02-21T09:03:10.3166199Z cvt.u64.u32 %rd166, %r630; 2026-02-21T09:03:10.3166266Z shl.b64 %rd167, %rd166, 32; 2026-02-21T09:03:10.3166325Z or.b64 %rd168, %rd165, %rd167; 2026-02-21T09:03:10.3166377Z $L__tmp23: 2026-02-21T09:03:10.3166553Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3166612Z mov.b64 {%r661, %r662}, %rd168; 2026-02-21T09:03:10.3166679Z cvt.rn.bf16x2.f32 %r663, %r662, %r661; 2026-02-21T09:03:10.3166738Z $L__tmp24: 2026-02-21T09:03:10.3166954Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3167037Z cvt.u64.u32 %rd169, %r631; 2026-02-21T09:03:10.3167096Z cvt.u64.u32 %rd170, %r632; 2026-02-21T09:03:10.3167163Z shl.b64 %rd171, %rd170, 32; 2026-02-21T09:03:10.3167222Z or.b64 %rd172, %rd169, %rd171; 2026-02-21T09:03:10.3167275Z $L__tmp25: 2026-02-21T09:03:10.3167451Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3167509Z mov.b64 {%r664, %r665}, %rd172; 2026-02-21T09:03:10.3167576Z cvt.rn.bf16x2.f32 %r666, %r665, %r664; 2026-02-21T09:03:10.3167636Z $L__tmp26: 2026-02-21T09:03:10.3167865Z .loc 2 291 36 // standard.py:291:36 @[ cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:78:40 ] 2026-02-21T09:03:10.3167924Z cvt.u64.u32 %rd173, %r633; 2026-02-21T09:03:10.3167984Z cvt.u64.u32 %rd174, %r634; 2026-02-21T09:03:10.3168053Z shl.b64 %rd175, %rd174, 32; 2026-02-21T09:03:10.3168114Z or.b64 %rd176, %rd173, %rd175; 2026-02-21T09:03:10.3168169Z $L__tmp27: 2026-02-21T09:03:10.3168342Z .loc 1 81 28 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:81:28 2026-02-21T09:03:10.3168424Z mov.b64 {%r667, %r668}, %rd176; 2026-02-21T09:03:10.3168492Z cvt.rn.bf16x2.f32 %r669, %r668, %r667; 2026-02-21T09:03:10.3168665Z .loc 1 82 81 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:82:81 2026-02-21T09:03:10.3168764Z st.shared.v4.b32 [%r29], {%r648, %r651, %r654, %r657}; 2026-02-21T09:03:10.3168857Z st.shared.v4.b32 [%r30], {%r660, %r663, %r666, %r669}; 2026-02-21T09:03:10.3168915Z bar.sync 0; 2026-02-21T09:03:10.3169021Z ld.shared.v4.b32 {%r640, %r641, %r642, %r643}, [%r31+512]; 2026-02-21T09:03:10.3169107Z ld.shared.v4.b32 {%r636, %r637, %r638, %r639}, [%r31]; 2026-02-21T09:03:10.3169165Z // begin inline asm 2026-02-21T09:03:10.3169274Z st.global.v4.b32 [ %rd138 + 0 ], { %r636, %r637, %r638, %r639 }; 2026-02-21T09:03:10.3169352Z // end inline asm 2026-02-21T09:03:10.3169410Z // begin inline asm 2026-02-21T09:03:10.3169510Z st.global.v4.b32 [ %rd139 + 0 ], { %r640, %r641, %r642, %r643 }; 2026-02-21T09:03:10.3169573Z // end inline asm 2026-02-21T09:03:10.3169654Z $L__BB0_8: // %._crit_edge 2026-02-21T09:03:10.3169822Z .loc 1 19 4 // cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py:19:4 2026-02-21T09:03:10.3169885Z bar.sync 0; 2026-02-21T09:03:10.3169942Z // begin inline asm 2026-02-21T09:03:10.3170058Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r670, 256; 2026-02-21T09:03:10.3170119Z // end inline asm 2026-02-21T09:03:10.3170173Z ret; 2026-02-21T09:03:10.3170225Z $L__tmp28: 2026-02-21T09:03:10.3170281Z $L__func_end0: 2026-02-21T09:03:10.3170372Z // -- End function 2026-02-21T09:03:10.3170423Z } 2026-02-21T09:03:10.3170630Z .file 1 "/tmp/torchinductor_root/ml/cmlicxwgeth7zahdwg3tuufcbkk3iaskwf2nj44gwll2gmwbhr65.py" 2026-02-21T09:03:10.3170810Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:03:10.3170872Z .section .debug_abbrev 2026-02-21T09:03:10.3170924Z { 2026-02-21T09:03:10.3171019Z .b8 1 // Abbreviation Code 2026-02-21T09:03:10.3171106Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:03:10.3171187Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:03:10.3171267Z .b8 37 // DW_AT_producer 2026-02-21T09:03:10.3171349Z .b8 8 // DW_FORM_string 2026-02-21T09:03:10.3171423Z .b8 19 // DW_AT_language 2026-02-21T09:03:10.3171498Z .b8 5 // DW_FORM_data2 2026-02-21T09:03:10.3171634Z .b8 3 // DW_AT_name 2026-02-21T09:03:10.3171710Z .b8 8 // DW_FORM_string 2026-02-21T09:03:10.3171789Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:03:10.3171899Z .b8 6 // DW_FORM_data4 2026-02-21T09:03:10.3171976Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:03:10.3172051Z .b8 8 // DW_FORM_string 2026-02-21T09:03:10.3172123Z .b8 0 // EOM(1) 2026-02-21T09:03:10.3172198Z .b8 0 // EOM(2) 2026-02-21T09:03:10.3172281Z .b8 2 // Abbreviation Code 2026-02-21T09:03:10.3172361Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:03:10.3172471Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:03:10.3172545Z .b8 3 // DW_AT_name 2026-02-21T09:03:10.3172617Z .b8 8 // DW_FORM_string 2026-02-21T09:03:10.3172702Z .b8 32 // DW_AT_inline 2026-02-21T09:03:10.3172779Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:10.3172850Z .b8 0 // EOM(1) 2026-02-21T09:03:10.3172941Z .b8 0 // EOM(2) 2026-02-21T09:03:10.3173030Z .b8 3 // Abbreviation Code 2026-02-21T09:03:10.3173111Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:03:10.3173188Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:03:10.3173271Z .b8 17 // DW_AT_low_pc 2026-02-21T09:03:10.3173345Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:10.3173424Z .b8 18 // DW_AT_high_pc 2026-02-21T09:03:10.3173506Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:10.3173593Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:03:10.3173667Z .b8 19 // DW_FORM_ref4 2026-02-21T09:03:10.3173770Z .b8 0 // EOM(1) 2026-02-21T09:03:10.3173841Z .b8 0 // EOM(2) 2026-02-21T09:03:10.3173921Z .b8 4 // Abbreviation Code 2026-02-21T09:03:10.3174017Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:03:10.3174106Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:03:10.3174194Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:03:10.3174269Z .b8 19 // DW_FORM_ref4 2026-02-21T09:03:10.3174354Z .b8 17 // DW_AT_low_pc 2026-02-21T09:03:10.3174429Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:10.3174508Z .b8 18 // DW_AT_high_pc 2026-02-21T09:03:10.3174588Z .b8 1 // DW_FORM_addr 2026-02-21T09:03:10.3174668Z .b8 88 // DW_AT_call_file 2026-02-21T09:03:10.3174748Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:10.3174825Z .b8 89 // DW_AT_call_line 2026-02-21T09:03:10.3174910Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:10.3174992Z .b8 87 // DW_AT_call_column 2026-02-21T09:03:10.3175068Z .b8 11 // DW_FORM_data1 2026-02-21T09:03:10.3175145Z .b8 0 // EOM(1) 2026-02-21T09:03:10.3175216Z .b8 0 // EOM(2) 2026-02-21T09:03:10.3175286Z .b8 0 // EOM(3) 2026-02-21T09:03:10.3175348Z } 2026-02-21T09:03:10.3175414Z .section .debug_info 2026-02-21T09:03:10.3175466Z { 2026-02-21T09:03:10.3175552Z .b32 178 // Length of Unit 2026-02-21T09:03:10.3175648Z .b8 2 // DWARF version number 2026-02-21T09:03:10.3175702Z .b8 0 2026-02-21T09:03:10.3175826Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:03:10.3175944Z .b8 8 // Address Size (in bytes) 2026-02-21T09:03:10.3176050Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:03:10.3176133Z .b8 116 // DW_AT_producer 2026-02-21T09:03:10.3176200Z .b8 114 2026-02-21T09:03:10.3176256Z .b8 105 2026-02-21T09:03:10.3176309Z .b8 116 2026-02-21T09:03:10.3176363Z .b8 111 2026-02-21T09:03:10.3176425Z .b8 110 2026-02-21T09:03:10.3176479Z .b8 0 2026-02-21T09:03:10.3176556Z .b8 2 // DW_AT_language 2026-02-21T09:03:10.3176647Z .b8 0 2026-02-21T09:03:10.3176726Z .b8 99 // DW_AT_name 2026-02-21T09:03:10.3176780Z .b8 109 2026-02-21T09:03:10.3176834Z .b8 108 2026-02-21T09:03:10.3176896Z .b8 105 2026-02-21T09:03:10.3176951Z .b8 99 2026-02-21T09:03:10.3177004Z .b8 120 2026-02-21T09:03:10.3177057Z .b8 119 2026-02-21T09:03:10.3177119Z .b8 103 2026-02-21T09:03:10.3177173Z .b8 101 2026-02-21T09:03:10.3177226Z .b8 116 2026-02-21T09:03:10.3177286Z .b8 104 2026-02-21T09:03:10.3177340Z .b8 55 2026-02-21T09:03:10.3177417Z .b8 122 2026-02-21T09:03:10.3177472Z .b8 97 2026-02-21T09:03:10.3177532Z .b8 104 2026-02-21T09:03:10.3177584Z .b8 100 2026-02-21T09:03:10.3177637Z .b8 119 2026-02-21T09:03:10.3177695Z .b8 103 2026-02-21T09:03:10.3177746Z .b8 51 2026-02-21T09:03:10.3177798Z .b8 116 2026-02-21T09:03:10.3177849Z .b8 117 2026-02-21T09:03:10.3177909Z .b8 117 2026-02-21T09:03:10.3177961Z .b8 102 2026-02-21T09:03:10.3178014Z .b8 99 2026-02-21T09:03:10.3178072Z .b8 98 2026-02-21T09:03:10.3178126Z .b8 107 2026-02-21T09:03:10.3178178Z .b8 107 2026-02-21T09:03:10.3178229Z .b8 51 2026-02-21T09:03:10.3178289Z .b8 105 2026-02-21T09:03:10.3178341Z .b8 97 2026-02-21T09:03:10.3178393Z .b8 115 2026-02-21T09:03:10.3178446Z .b8 107 2026-02-21T09:03:10.3178506Z .b8 119 2026-02-21T09:03:10.3178558Z .b8 102 2026-02-21T09:03:10.3178610Z .b8 50 2026-02-21T09:03:10.3178696Z .b8 110 2026-02-21T09:03:10.3178750Z .b8 106 2026-02-21T09:03:10.3178802Z .b8 52 2026-02-21T09:03:10.3178854Z .b8 52 2026-02-21T09:03:10.3178914Z .b8 103 2026-02-21T09:03:10.3178967Z .b8 119 2026-02-21T09:03:10.3179019Z .b8 108 2026-02-21T09:03:10.3179079Z .b8 108 2026-02-21T09:03:10.3179131Z .b8 50 2026-02-21T09:03:10.3179183Z .b8 103 2026-02-21T09:03:10.3179235Z .b8 109 2026-02-21T09:03:10.3179296Z .b8 119 2026-02-21T09:03:10.3179349Z .b8 98 2026-02-21T09:03:10.3179402Z .b8 104 2026-02-21T09:03:10.3179454Z .b8 114 2026-02-21T09:03:10.3179514Z .b8 54 2026-02-21T09:03:10.3179566Z .b8 53 2026-02-21T09:03:10.3179618Z .b8 46 2026-02-21T09:03:10.3179680Z .b8 112 2026-02-21T09:03:10.3179733Z .b8 121 2026-02-21T09:03:10.3179785Z .b8 0 2026-02-21T09:03:10.3179878Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:03:10.3179964Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:03:10.3180020Z .b8 116 2026-02-21T09:03:10.3180075Z .b8 109 2026-02-21T09:03:10.3180139Z .b8 112 2026-02-21T09:03:10.3180194Z .b8 47 2026-02-21T09:03:10.3180249Z .b8 116 2026-02-21T09:03:10.3180304Z .b8 111 2026-02-21T09:03:10.3180365Z .b8 114 2026-02-21T09:03:10.3180418Z .b8 99 2026-02-21T09:03:10.3180471Z .b8 104 2026-02-21T09:03:10.3180531Z .b8 105 2026-02-21T09:03:10.3180584Z .b8 110 2026-02-21T09:03:10.3180636Z .b8 100 2026-02-21T09:03:10.3180688Z .b8 117 2026-02-21T09:03:10.3180746Z .b8 99 2026-02-21T09:03:10.3180798Z .b8 116 2026-02-21T09:03:10.3180850Z .b8 111 2026-02-21T09:03:10.3180908Z .b8 114 2026-02-21T09:03:10.3180963Z .b8 95 2026-02-21T09:03:10.3181016Z .b8 114 2026-02-21T09:03:10.3181068Z .b8 111 2026-02-21T09:03:10.3181130Z .b8 111 2026-02-21T09:03:10.3181183Z .b8 116 2026-02-21T09:03:10.3181235Z .b8 47 2026-02-21T09:03:10.3181286Z .b8 109 2026-02-21T09:03:10.3181345Z .b8 108 2026-02-21T09:03:10.3181397Z .b8 0 2026-02-21T09:03:10.3181500Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:03:10.3181617Z .b8 95 // DW_AT_name 2026-02-21T09:03:10.3181699Z .b8 104 2026-02-21T09:03:10.3181752Z .b8 101 2026-02-21T09:03:10.3181805Z .b8 108 2026-02-21T09:03:10.3181866Z .b8 105 2026-02-21T09:03:10.3181921Z .b8 111 2026-02-21T09:03:10.3181973Z .b8 110 2026-02-21T09:03:10.3182033Z .b8 95 2026-02-21T09:03:10.3182087Z .b8 109 2026-02-21T09:03:10.3182139Z .b8 97 2026-02-21T09:03:10.3182192Z .b8 116 2026-02-21T09:03:10.3182262Z .b8 109 2026-02-21T09:03:10.3182311Z .b8 117 2026-02-21T09:03:10.3182362Z .b8 108 2026-02-21T09:03:10.3182419Z .b8 95 2026-02-21T09:03:10.3182471Z .b8 98 2026-02-21T09:03:10.3182522Z .b8 102 2026-02-21T09:03:10.3182617Z .b8 49 2026-02-21T09:03:10.3182675Z .b8 54 2026-02-21T09:03:10.3182725Z .b8 95 2026-02-21T09:03:10.3182775Z .b8 105 2026-02-21T09:03:10.3182832Z .b8 110 2026-02-21T09:03:10.3182881Z .b8 116 2026-02-21T09:03:10.3182930Z .b8 52 2026-02-21T09:03:10.3182980Z .b8 0 2026-02-21T09:03:10.3183060Z .b8 1 // DW_AT_inline 2026-02-21T09:03:10.3183160Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:03:10.3183248Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:03:10.3183366Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:03:10.3183457Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:03:10.3183566Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:03:10.3183653Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:03:10.3183741Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:03:10.3183824Z .b64 $L__tmp27 // DW_AT_high_pc 2026-02-21T09:03:10.3183900Z .b8 1 // DW_AT_call_file 2026-02-21T09:03:10.3183982Z .b8 78 // DW_AT_call_line 2026-02-21T09:03:10.3184059Z .b8 40 // DW_AT_call_column 2026-02-21T09:03:10.3184171Z .b8 0 // End Of Children Mark 2026-02-21T09:03:10.3184260Z .b8 0 // End Of Children Mark 2026-02-21T09:03:10.3184311Z } 2026-02-21T09:03:10.3184378Z .section .debug_macinfo { } 2026-02-21T09:03:10.3184383Z 2026-02-21T09:03:10.3184465Z ================================================================ 2026-02-21T09:03:10.3184570Z please share the reproducer above with Triton project. 2026-02-21T09:03:12.1590252Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:12.1594894Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:12.1596897Z ^ 2026-02-21T09:03:12.1597380Z /tmp/torchinductor_root/uv/cuvio34qae35iloldr2an3sfrjkasralkftp2tz6c76qzpvivuvp.py:87:40: note: called from 2026-02-21T09:03:12.1597853Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:12.1598101Z ^ 2026-02-21T09:03:12.1598529Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:12.1599009Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:12.1599268Z ^ 2026-02-21T09:03:12.1599475Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:03:12.1604490Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:12.1608860Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:03:12.1613635Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:12.1615724Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:03:12.1615998Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:12.1616457Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:12.1616668Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:12.1616895Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:12.1617165Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:12.1617417Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:12.1617656Z %cst_3 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:03:12.1617912Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:03:12.1618157Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:12.1618448Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:12.1618688Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:12.1618964Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:12.1619178Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:12.1619373Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:12.1619578Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:12.1619776Z %0 = tt.get_program_id x : i32 2026-02-21T09:03:12.1619977Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:03:12.1620222Z %2 = arith.minsi %1, %c224_i32 : i32 2026-02-21T09:03:12.1620412Z %3 = arith.subi %2, %0 : i32 2026-02-21T09:03:12.1620582Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:03:12.1620779Z %4 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T09:03:12.1620968Z %5 = arith.addi %3, %4 : i32 2026-02-21T09:03:12.1621136Z %6 = arith.divui %5, %c1_i32 : i32 2026-02-21T09:03:12.1621317Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:03:12.1621496Z %7 = arith.remsi %6, %c2_i32_8 : i32 2026-02-21T09:03:12.1621927Z %8 = arith.subi %6, %7 : i32 2026-02-21T09:03:12.1622097Z %9 = arith.muli %8, %c1_i32 : i32 2026-02-21T09:03:12.1622283Z %10 = arith.addi %0, %9 : i32 2026-02-21T09:03:12.1622468Z %11 = arith.muli %c1_i32, %c2_i32_8 : i32 2026-02-21T09:03:12.1622753Z scf.for %arg3 = %0 to %10 step %11 : i32 { 2026-02-21T09:03:12.1622972Z %12 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:12.1623165Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:03:12.1623359Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:03:12.1623543Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:03:12.1623744Z %16 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:12.1623933Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:03:12.1624121Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:03:12.1624296Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:03:12.1624487Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:03:12.1624776Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:12.1625047Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:03:12.1625250Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:03:12.1625453Z %24 = arith.muli %19, %c32_i32 : i32 2026-02-21T09:03:12.1625680Z %25 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:12.1625933Z %26 = tt.splat %24 : i32 -> tensor<32xi32> 2026-02-21T09:03:12.1626149Z %27 = arith.addi %26, %25 : tensor<32xi32> 2026-02-21T09:03:12.1626349Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:03:12.1626555Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:03:12.1626879Z %28 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:12.1627259Z %68 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1627510Z %69 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1627730Z %70 = arith.addi %69, %68 : tensor<8xi32> 2026-02-21T09:03:12.1627940Z %71 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:12.1628171Z %72 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1628428Z %73 = tt.splat %71 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1628682Z %74 = arith.addi %73, %72 : tensor<16xi32> 2026-02-21T09:03:12.1628952Z %75 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1629227Z %76 = arith.muli %75, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1629502Z %77 = tt.expand_dims %74 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1629802Z %78 = tt.broadcast %76 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1630065Z %79 = tt.broadcast %77 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1630347Z %80 = arith.addi %78, %79 : tensor<64x16xi32> 2026-02-21T09:03:12.1630599Z %81 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1630898Z %82 = tt.addptr %81, %80 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1631157Z %83 = tt.load %82 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1631411Z %84 = arith.extf %83 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1631771Z %85 = tt.expand_dims %70 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1632037Z %86 = arith.muli %85, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1632303Z %87 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1632592Z %88 = tt.broadcast %86 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1632862Z %89 = tt.broadcast %87 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1633100Z %90 = arith.addi %88, %89 : tensor<8x32xi32> 2026-02-21T09:03:12.1633354Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1633634Z %92 = tt.addptr %91, %90 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1633988Z %93 = tt.load %92 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1634262Z %94 = arith.shli %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1634480Z %95 = arith.shrsi %94, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1634705Z %96 = arith.shrsi %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1634946Z %97 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1635246Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1635566Z %99 = tt.expand_dims %98 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1635883Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1636215Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1636500Z %102 = arith.cmpi eq, %99, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1636768Z %103 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1637056Z %104 = tt.broadcast %100 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1637365Z %105 = arith.select %103, %104, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1637660Z %106 = arith.cmpi eq, %99, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1637926Z %107 = tt.broadcast %101 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1638209Z %108 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1638493Z %109 = arith.select %108, %107, %105 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1638792Z %110 = tt.reshape %109 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1639067Z %111 = arith.sitofp %110 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1639431Z %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1639773Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:03:12.1640013Z %113 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:03:12.1640220Z %114 = arith.addi %arg4, %113 : i32 2026-02-21T09:03:12.1640463Z %115 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1640715Z %116 = tt.splat %114 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1640929Z %117 = arith.addi %116, %115 : tensor<8xi32> 2026-02-21T09:03:12.1641129Z %118 = arith.muli %114, %c2_i32 : i32 2026-02-21T09:03:12.1641377Z %119 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1641686Z %120 = tt.splat %118 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1641908Z %121 = arith.addi %120, %119 : tensor<16xi32> 2026-02-21T09:03:12.1642179Z %122 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1642464Z %123 = arith.muli %122, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1642746Z %124 = tt.expand_dims %121 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1643077Z %125 = tt.broadcast %123 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1643360Z %126 = tt.broadcast %124 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1643611Z %127 = arith.addi %125, %126 : tensor<64x16xi32> 2026-02-21T09:03:12.1643876Z %128 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1644189Z %129 = tt.addptr %128, %127 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1644461Z %130 = tt.load %129 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1644724Z %131 = arith.extf %130 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1645018Z %132 = tt.expand_dims %117 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1645295Z %133 = arith.muli %132, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1645581Z %134 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1645883Z %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1646154Z %136 = tt.broadcast %134 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1646397Z %137 = arith.addi %135, %136 : tensor<8x32xi32> 2026-02-21T09:03:12.1646644Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1646920Z %139 = tt.addptr %138, %137 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1647234Z %140 = tt.load %139 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1647514Z %141 = arith.shli %140, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1647732Z %142 = arith.shrsi %141, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1647958Z %143 = arith.shrsi %140, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1648205Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1648511Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1648836Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1649167Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1649497Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1649782Z %149 = arith.cmpi eq, %146, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1650042Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1650320Z %151 = tt.broadcast %147 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1650602Z %152 = arith.select %150, %151, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1650884Z %153 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1651168Z %154 = tt.broadcast %148 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1651443Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1651756Z %156 = arith.select %155, %154, %152 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1652079Z %157 = tt.reshape %156 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1652345Z %158 = arith.sitofp %157 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1652710Z %159 = tt.dot %131, %158, %112, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1653081Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:03:12.1653288Z %160 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:03:12.1653500Z %161 = arith.addi %arg4, %160 : i32 2026-02-21T09:03:12.1653745Z %162 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1654009Z %163 = tt.splat %161 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1654233Z %164 = arith.addi %163, %162 : tensor<8xi32> 2026-02-21T09:03:12.1654471Z %165 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:03:12.1654727Z %166 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1654989Z %167 = tt.splat %165 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1655214Z %168 = arith.addi %167, %166 : tensor<16xi32> 2026-02-21T09:03:12.1655488Z %169 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1655772Z %170 = arith.muli %169, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1656055Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1656368Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1656699Z %173 = tt.broadcast %171 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1656955Z %174 = arith.addi %172, %173 : tensor<64x16xi32> 2026-02-21T09:03:12.1657220Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1657538Z %176 = tt.addptr %175, %174 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1657819Z %177 = tt.load %176 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1658085Z %178 = arith.extf %177 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1658389Z %179 = tt.expand_dims %164 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1658680Z %180 = arith.muli %179, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1658956Z %181 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1659278Z %182 = tt.broadcast %180 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1659569Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1659826Z %184 = arith.addi %182, %183 : tensor<8x32xi32> 2026-02-21T09:03:12.1660095Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1660390Z %186 = tt.addptr %185, %184 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1660729Z %187 = tt.load %186 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1661017Z %188 = arith.shli %187, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1661240Z %189 = arith.shrsi %188, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1661472Z %190 = arith.shrsi %187, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1661760Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1662084Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1662400Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1662732Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1663082Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1663361Z %196 = arith.cmpi eq, %193, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1663615Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1663881Z %198 = tt.broadcast %194 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1664172Z %199 = arith.select %197, %198, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1664481Z %200 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1664737Z %201 = tt.broadcast %195 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1665017Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1665296Z %203 = arith.select %202, %201, %199 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1665590Z %204 = tt.reshape %203 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1665878Z %205 = arith.sitofp %204 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1666241Z %206 = tt.dot %178, %205, %159, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1666567Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:03:12.1666741Z } 2026-02-21T09:03:12.1667012Z %29 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %28) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:12.1667370Z %68 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1667627Z %69 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1667835Z %70 = arith.addi %69, %68 : tensor<8xi32> 2026-02-21T09:03:12.1668032Z %71 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:12.1668305Z %72 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1668551Z %73 = tt.splat %71 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1668759Z %74 = arith.addi %73, %72 : tensor<16xi32> 2026-02-21T09:03:12.1669016Z %75 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1669294Z %76 = arith.muli %75, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1669559Z %77 = tt.expand_dims %74 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1669845Z %78 = tt.broadcast %76 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1670109Z %79 = tt.broadcast %77 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1670339Z %80 = arith.addi %78, %79 : tensor<64x16xi32> 2026-02-21T09:03:12.1670588Z %81 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1670870Z %82 = tt.addptr %81, %80 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1671137Z %83 = tt.load %82 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1671374Z %84 = arith.extf %83 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1671677Z %85 = tt.expand_dims %70 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1671940Z %86 = arith.muli %85, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1672190Z %87 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1672476Z %88 = tt.broadcast %86 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1672732Z %89 = tt.broadcast %87 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1672971Z %90 = arith.addi %88, %89 : tensor<8x32xi32> 2026-02-21T09:03:12.1673211Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1673480Z %92 = tt.addptr %91, %90 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1673781Z %93 = tt.load %92 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1674089Z %94 = arith.shli %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1674316Z %95 = arith.shrsi %94, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1674527Z %96 = arith.shrsi %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1674770Z %97 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1675066Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1675369Z %99 = tt.expand_dims %98 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1675718Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1676034Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1676319Z %102 = arith.cmpi eq, %99, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1676575Z %103 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1676847Z %104 = tt.broadcast %100 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1677163Z %105 = arith.select %103, %104, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1677437Z %106 = arith.cmpi eq, %99, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1677693Z %107 = tt.broadcast %101 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1677963Z %108 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1678250Z %109 = arith.select %108, %107, %105 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1678540Z %110 = tt.reshape %109 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1678797Z %111 = arith.sitofp %110 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1679188Z %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1679509Z scf.yield %112 : tensor<64x32xf32> 2026-02-21T09:03:12.1679705Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:12.1679928Z %30 = arith.truncf %29 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:12.1680212Z %31 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1680481Z %32 = arith.muli %31, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:12.1680731Z %33 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1681015Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:12.1681268Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:12.1681504Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:12.1681781Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:12.1682057Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:12.1682320Z tt.store %38, %30 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:12.1682521Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:12.1682717Z %39 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:03:12.1682907Z %40 = arith.addi %arg3, %39 : i32 2026-02-21T09:03:12.1683099Z %41 = arith.divsi %40, %c896_i32 : i32 2026-02-21T09:03:12.1683289Z %42 = arith.muli %41, %c4_i32 : i32 2026-02-21T09:03:12.1683466Z %43 = arith.subi %c1_i32, %42 : i32 2026-02-21T09:03:12.1683649Z %44 = arith.minsi %43, %c4_i32 : i32 2026-02-21T09:03:12.1683831Z %45 = arith.remsi %40, %c896_i32 : i32 2026-02-21T09:03:12.1684021Z %46 = arith.remsi %45, %44 : i32 2026-02-21T09:03:12.1684193Z %47 = arith.addi %42, %46 : i32 2026-02-21T09:03:12.1684368Z %48 = arith.divsi %45, %44 : i32 2026-02-21T09:03:12.1684542Z %49 = arith.muli %47, %c64_i32 : i32 2026-02-21T09:03:12.1684773Z %50 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:12.1685076Z %51 = tt.splat %49 : i32 -> tensor<64xi32> 2026-02-21T09:03:12.1685275Z %52 = arith.addi %51, %50 : tensor<64xi32> 2026-02-21T09:03:12.1685474Z %53 = arith.muli %48, %c32_i32 : i32 2026-02-21T09:03:12.1685699Z %54 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:12.1685945Z %55 = tt.splat %53 : i32 -> tensor<32xi32> 2026-02-21T09:03:12.1686137Z %56 = arith.addi %55, %54 : tensor<32xi32> 2026-02-21T09:03:12.1686345Z %c4080_i32_10 = arith.constant 4080 : i32 2026-02-21T09:03:12.1686541Z %c24_i32_11 = arith.constant 24 : i32 2026-02-21T09:03:12.1686910Z %57 = scf.for %arg4 = %c0_i32 to %c4080_i32_10 step %c24_i32_11 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:12.1687287Z %68 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1687532Z %69 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1687743Z %70 = arith.addi %69, %68 : tensor<8xi32> 2026-02-21T09:03:12.1687937Z %71 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:12.1688195Z %72 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1688445Z %73 = tt.splat %71 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1688640Z %74 = arith.addi %73, %72 : tensor<16xi32> 2026-02-21T09:03:12.1688895Z %75 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1689154Z %76 = arith.muli %75, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1689482Z %77 = tt.expand_dims %74 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1689766Z %78 = tt.broadcast %76 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1690025Z %79 = tt.broadcast %77 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1690289Z %80 = arith.addi %78, %79 : tensor<64x16xi32> 2026-02-21T09:03:12.1690531Z %81 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1690817Z %82 = tt.addptr %81, %80 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1691066Z %83 = tt.load %82 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1691302Z %84 = arith.extf %83 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1691596Z %85 = tt.expand_dims %70 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1691857Z %86 = arith.muli %85, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1692110Z %87 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1692392Z %88 = tt.broadcast %86 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1692649Z %89 = tt.broadcast %87 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1692878Z %90 = arith.addi %88, %89 : tensor<8x32xi32> 2026-02-21T09:03:12.1693112Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1693382Z %92 = tt.addptr %91, %90 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1693673Z %93 = tt.load %92 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1693938Z %94 = arith.shli %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1694144Z %95 = arith.shrsi %94, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1694352Z %96 = arith.shrsi %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1694587Z %97 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1694875Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1695180Z %99 = tt.expand_dims %98 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1695491Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1695810Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1696115Z %102 = arith.cmpi eq, %99, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1696369Z %103 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1696629Z %104 = tt.broadcast %100 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1696910Z %105 = arith.select %103, %104, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1697182Z %106 = arith.cmpi eq, %99, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1697424Z %107 = tt.broadcast %101 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1697718Z %108 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1697995Z %109 = arith.select %108, %107, %105 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1698279Z %110 = tt.reshape %109 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1698558Z %111 = arith.sitofp %110 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1698977Z %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1699322Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:03:12.1699529Z %113 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:03:12.1699736Z %114 = arith.addi %arg4, %113 : i32 2026-02-21T09:03:12.1699973Z %115 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1700230Z %116 = tt.splat %114 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1700445Z %117 = arith.addi %116, %115 : tensor<8xi32> 2026-02-21T09:03:12.1700646Z %118 = arith.muli %114, %c2_i32 : i32 2026-02-21T09:03:12.1700890Z %119 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1701173Z %120 = tt.splat %118 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1701392Z %121 = arith.addi %120, %119 : tensor<16xi32> 2026-02-21T09:03:12.1701675Z %122 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1701965Z %123 = arith.muli %122, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1702241Z %124 = tt.expand_dims %121 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1702543Z %125 = tt.broadcast %123 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1702820Z %126 = tt.broadcast %124 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1703067Z %127 = arith.addi %125, %126 : tensor<64x16xi32> 2026-02-21T09:03:12.1703327Z %128 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1703630Z %129 = tt.addptr %128, %127 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1703909Z %130 = tt.load %129 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1704160Z %131 = arith.extf %130 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1704458Z %132 = tt.expand_dims %117 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1704740Z %133 = arith.muli %132, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1704998Z %134 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1705305Z %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1705576Z %136 = tt.broadcast %134 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1705818Z %137 = arith.addi %135, %136 : tensor<8x32xi32> 2026-02-21T09:03:12.1706065Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1706343Z %139 = tt.addptr %138, %137 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1706657Z %140 = tt.load %139 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1706928Z %141 = arith.shli %140, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1707181Z %142 = arith.shrsi %141, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1707398Z %143 = arith.shrsi %140, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1707634Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1707928Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1708240Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1708564Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1708904Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1709189Z %149 = arith.cmpi eq, %146, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1709440Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1709706Z %151 = tt.broadcast %147 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1710020Z %152 = arith.select %150, %151, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1710291Z %153 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1710542Z %154 = tt.broadcast %148 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1710804Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1711078Z %156 = arith.select %155, %154, %152 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1711357Z %157 = tt.reshape %156 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1711641Z %158 = arith.sitofp %157 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1711996Z %159 = tt.dot %131, %158, %112, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1712355Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:03:12.1712565Z %160 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:03:12.1712767Z %161 = arith.addi %arg4, %160 : i32 2026-02-21T09:03:12.1712998Z %162 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1713247Z %163 = tt.splat %161 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1713449Z %164 = arith.addi %163, %162 : tensor<8xi32> 2026-02-21T09:03:12.1713650Z %165 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:03:12.1713883Z %166 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1714138Z %167 = tt.splat %165 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1714349Z %168 = arith.addi %167, %166 : tensor<16xi32> 2026-02-21T09:03:12.1714602Z %169 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1714880Z %170 = arith.muli %169, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1715142Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1715443Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1715708Z %173 = tt.broadcast %171 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1715954Z %174 = arith.addi %172, %173 : tensor<64x16xi32> 2026-02-21T09:03:12.1716206Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1716499Z %176 = tt.addptr %175, %174 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1716774Z %177 = tt.load %176 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1717014Z %178 = arith.extf %177 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1717307Z %179 = tt.expand_dims %164 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1717580Z %180 = arith.muli %179, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1717840Z %181 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1718171Z %182 = tt.broadcast %180 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1718434Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1718675Z %184 = arith.addi %182, %183 : tensor<8x32xi32> 2026-02-21T09:03:12.1718907Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1719186Z %186 = tt.addptr %185, %184 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1719487Z %187 = tt.load %186 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1719775Z %188 = arith.shli %187, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1719997Z %189 = arith.shrsi %188, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1720209Z %190 = arith.shrsi %187, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1720458Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1720750Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1721097Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1721426Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1721775Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1722063Z %196 = arith.cmpi eq, %193, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1722312Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1722589Z %198 = tt.broadcast %194 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1722881Z %199 = arith.select %197, %198, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1723155Z %200 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1723443Z %201 = tt.broadcast %195 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1723706Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1723989Z %203 = arith.select %202, %201, %199 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1724270Z %204 = tt.reshape %203 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1724536Z %205 = arith.sitofp %204 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1724892Z %206 = tt.dot %178, %205, %159, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1725218Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:03:12.1725403Z } 2026-02-21T09:03:12.1725673Z %58 = scf.for %arg4 = %c4080_i32_10 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %57) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:12.1726045Z %68 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1726294Z %69 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1726494Z %70 = arith.addi %69, %68 : tensor<8xi32> 2026-02-21T09:03:12.1726698Z %71 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:12.1726926Z %72 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1727174Z %73 = tt.splat %71 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1727370Z %74 = arith.addi %73, %72 : tensor<16xi32> 2026-02-21T09:03:12.1727621Z %75 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1727893Z %76 = arith.muli %75, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1728143Z %77 = tt.expand_dims %74 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1728431Z %78 = tt.broadcast %76 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1728684Z %79 = tt.broadcast %77 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1728954Z %80 = arith.addi %78, %79 : tensor<64x16xi32> 2026-02-21T09:03:12.1729197Z %81 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1729485Z %82 = tt.addptr %81, %80 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1729746Z %83 = tt.load %82 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1729978Z %84 = arith.extf %83 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1730261Z %85 = tt.expand_dims %70 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1730515Z %86 = arith.muli %85, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1730796Z %87 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1731075Z %88 = tt.broadcast %86 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1731334Z %89 = tt.broadcast %87 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1731593Z %90 = arith.addi %88, %89 : tensor<8x32xi32> 2026-02-21T09:03:12.1731822Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1732115Z %92 = tt.addptr %91, %90 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1732400Z %93 = tt.load %92 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1732661Z %94 = arith.shli %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1732866Z %95 = arith.shrsi %94, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1733076Z %96 = arith.shrsi %93, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1733316Z %97 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1733595Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1733906Z %99 = tt.expand_dims %98 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1734244Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1734567Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1734858Z %102 = arith.cmpi eq, %99, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1735108Z %103 = tt.broadcast %102 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1735376Z %104 = tt.broadcast %100 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1735652Z %105 = arith.select %103, %104, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1735921Z %106 = arith.cmpi eq, %99, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1736167Z %107 = tt.broadcast %101 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1736429Z %108 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1736708Z %109 = arith.select %108, %107, %105 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1736983Z %110 = tt.reshape %109 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1737244Z %111 = arith.sitofp %110 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1737597Z %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1737924Z scf.yield %112 : tensor<64x32xf32> 2026-02-21T09:03:12.1738116Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:12.1738325Z %59 = arith.truncf %58 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:12.1738611Z %60 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1738874Z %61 = arith.muli %60, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:12.1739129Z %62 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1739408Z %63 = tt.broadcast %61 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:12.1739662Z %64 = tt.broadcast %62 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:12.1739918Z %65 = arith.addi %63, %64 : tensor<64x32xi32> 2026-02-21T09:03:12.1740154Z %66 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:12.1740430Z %67 = tt.addptr %66, %65 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:12.1740678Z tt.store %67, %59 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:12.1740879Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:12.1741065Z scf.for %arg3 = %10 to %2 step %c1_i32 : i32 { 2026-02-21T09:03:12.1741274Z %12 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:12.1741508Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:03:12.1741713Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:03:12.1741894Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:03:12.1742076Z %16 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:12.1742260Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:03:12.1742433Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:03:12.1742605Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:03:12.1742773Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:03:12.1743023Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:12.1743274Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:03:12.1743466Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:03:12.1743650Z %24 = arith.muli %19, %c32_i32 : i32 2026-02-21T09:03:12.1743867Z %25 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:12.1744112Z %26 = tt.splat %24 : i32 -> tensor<32xi32> 2026-02-21T09:03:12.1744310Z %27 = arith.addi %26, %25 : tensor<32xi32> 2026-02-21T09:03:12.1744509Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:03:12.1744700Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:03:12.1745047Z %28 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:12.1745422Z %39 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1745671Z %40 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1745878Z %41 = arith.addi %40, %39 : tensor<8xi32> 2026-02-21T09:03:12.1746074Z %42 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:12.1746309Z %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1746557Z %44 = tt.splat %42 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1746758Z %45 = arith.addi %44, %43 : tensor<16xi32> 2026-02-21T09:03:12.1747019Z %46 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1747287Z %47 = arith.muli %46, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1747550Z %48 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1747842Z %49 = tt.broadcast %47 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1748114Z %50 = tt.broadcast %48 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1748369Z %51 = arith.addi %49, %50 : tensor<64x16xi32> 2026-02-21T09:03:12.1748620Z %52 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1748920Z %53 = tt.addptr %52, %51 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1749182Z %54 = tt.load %53 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1749434Z %55 = arith.extf %54 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1749723Z %56 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1750000Z %57 = arith.muli %56, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1750270Z %58 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1750565Z %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1750872Z %60 = tt.broadcast %58 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1751115Z %61 = arith.addi %59, %60 : tensor<8x32xi32> 2026-02-21T09:03:12.1751362Z %62 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1751661Z %63 = tt.addptr %62, %61 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1751970Z %64 = tt.load %63 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1752249Z %65 = arith.shli %64, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1752468Z %66 = arith.shrsi %65, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1752722Z %67 = arith.shrsi %64, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1752968Z %68 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1753274Z %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1753599Z %70 = tt.expand_dims %69 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1753963Z %71 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1754291Z %72 = tt.expand_dims %67 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1754565Z %73 = arith.cmpi eq, %70, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1754813Z %74 = tt.broadcast %73 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1755069Z %75 = tt.broadcast %71 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1755346Z %76 = arith.select %74, %75, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1755615Z %77 = arith.cmpi eq, %70, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1755856Z %78 = tt.broadcast %72 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1756118Z %79 = tt.broadcast %77 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1756427Z %80 = arith.select %79, %78, %76 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1756704Z %81 = tt.reshape %80 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1756954Z %82 = arith.sitofp %81 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1757307Z %83 = tt.dot %55, %82, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1757628Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:12.1757822Z %84 = arith.muli %c8_i32, %c1_i32_9 : i32 2026-02-21T09:03:12.1758021Z %85 = arith.addi %arg4, %84 : i32 2026-02-21T09:03:12.1758243Z %86 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1758488Z %87 = tt.splat %85 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1758684Z %88 = arith.addi %87, %86 : tensor<8xi32> 2026-02-21T09:03:12.1758880Z %89 = arith.muli %85, %c2_i32 : i32 2026-02-21T09:03:12.1759113Z %90 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1759357Z %91 = tt.splat %89 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1759560Z %92 = arith.addi %91, %90 : tensor<16xi32> 2026-02-21T09:03:12.1759807Z %93 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1760080Z %94 = arith.muli %93, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1760329Z %95 = tt.expand_dims %92 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1760617Z %96 = tt.broadcast %94 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1760879Z %97 = tt.broadcast %95 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1761113Z %98 = arith.addi %96, %97 : tensor<64x16xi32> 2026-02-21T09:03:12.1761360Z %99 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1761679Z %100 = tt.addptr %99, %98 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1761982Z %101 = tt.load %100 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1762239Z %102 = arith.extf %101 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1762530Z %103 = tt.expand_dims %88 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1762808Z %104 = arith.muli %103, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1763066Z %105 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1763369Z %106 = tt.broadcast %104 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1763668Z %107 = tt.broadcast %105 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1763924Z %108 = arith.addi %106, %107 : tensor<8x32xi32> 2026-02-21T09:03:12.1764170Z %109 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1764441Z %110 = tt.addptr %109, %108 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1764746Z %111 = tt.load %110 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1765037Z %112 = arith.shli %111, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1765274Z %113 = arith.shrsi %112, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1765500Z %114 = arith.shrsi %111, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1765756Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1766059Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1766380Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1766719Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1767039Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1767364Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1767627Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1767896Z %122 = tt.broadcast %118 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1768182Z %123 = arith.select %121, %122, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1768454Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1768709Z %125 = tt.broadcast %119 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1768971Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1769248Z %127 = arith.select %126, %125, %123 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1769530Z %128 = tt.reshape %127 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1769785Z %129 = arith.sitofp %128 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1770144Z %130 = tt.dot %102, %129, %83, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1770463Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:03:12.1770668Z %131 = arith.muli %c8_i32, %c2_i32_10 : i32 2026-02-21T09:03:12.1770867Z %132 = arith.addi %arg4, %131 : i32 2026-02-21T09:03:12.1771094Z %133 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1771344Z %134 = tt.splat %132 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1771579Z %135 = arith.addi %134, %133 : tensor<8xi32> 2026-02-21T09:03:12.1771782Z %136 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:03:12.1772016Z %137 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1772285Z %138 = tt.splat %136 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1772499Z %139 = arith.addi %138, %137 : tensor<16xi32> 2026-02-21T09:03:12.1772758Z %140 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1773077Z %141 = arith.muli %140, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1773339Z %142 = tt.expand_dims %139 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1773639Z %143 = tt.broadcast %141 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1773904Z %144 = tt.broadcast %142 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1774152Z %145 = arith.addi %143, %144 : tensor<64x16xi32> 2026-02-21T09:03:12.1774405Z %146 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1774737Z %147 = tt.addptr %146, %145 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1775013Z %148 = tt.load %147 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1775255Z %149 = arith.extf %148 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1775549Z %150 = tt.expand_dims %135 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1775820Z %151 = arith.muli %150, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1776122Z %152 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1776422Z %153 = tt.broadcast %151 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1776685Z %154 = tt.broadcast %152 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1776941Z %155 = arith.addi %153, %154 : tensor<8x32xi32> 2026-02-21T09:03:12.1777182Z %156 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1777471Z %157 = tt.addptr %156, %155 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1777777Z %158 = tt.load %157 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1778055Z %159 = arith.shli %158, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1778316Z %160 = arith.shrsi %159, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1778535Z %161 = arith.shrsi %158, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1778789Z %162 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1779083Z %163 = tt.expand_dims %162 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1779407Z %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1779737Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1780052Z %166 = tt.expand_dims %161 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1780343Z %167 = arith.cmpi eq, %164, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1780591Z %168 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1780865Z %169 = tt.broadcast %165 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1781148Z %170 = arith.select %168, %169, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1781428Z %171 = arith.cmpi eq, %164, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1781720Z %172 = tt.broadcast %166 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1781982Z %173 = tt.broadcast %171 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1782264Z %174 = arith.select %173, %172, %170 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1782546Z %175 = tt.reshape %174 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1782815Z %176 = arith.sitofp %175 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1783176Z %177 = tt.dot %149, %176, %130, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1783494Z scf.yield %177 : tensor<64x32xf32> 2026-02-21T09:03:12.1783676Z } 2026-02-21T09:03:12.1783942Z %29 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %28) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:12.1784350Z %39 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:12.1784598Z %40 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:12.1784803Z %41 = arith.addi %40, %39 : tensor<8xi32> 2026-02-21T09:03:12.1785004Z %42 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:12.1785236Z %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:12.1785485Z %44 = tt.splat %42 : i32 -> tensor<16xi32> 2026-02-21T09:03:12.1785683Z %45 = arith.addi %44, %43 : tensor<16xi32> 2026-02-21T09:03:12.1785967Z %46 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1786235Z %47 = arith.muli %46, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:12.1786496Z %48 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:12.1786792Z %49 = tt.broadcast %47 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1788979Z %50 = tt.broadcast %48 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:12.1789259Z %51 = arith.addi %49, %50 : tensor<64x16xi32> 2026-02-21T09:03:12.1789517Z %52 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1789826Z %53 = tt.addptr %52, %51 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:12.1790092Z %54 = tt.load %53 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:12.1790346Z %55 = arith.extf %54 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:12.1790644Z %56 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:12.1791147Z %57 = arith.muli %56, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:12.1791419Z %58 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1791771Z %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1792048Z %60 = tt.broadcast %58 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:12.1792302Z %61 = arith.addi %59, %60 : tensor<8x32xi32> 2026-02-21T09:03:12.1792546Z %62 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1792839Z %63 = tt.addptr %62, %61 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:12.1793152Z %64 = tt.load %63 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:12.1793436Z %65 = arith.shli %64, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1793655Z %66 = arith.shrsi %65, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1793885Z %67 = arith.shrsi %64, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:12.1796076Z %68 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:12.1796383Z %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:12.1796702Z %70 = tt.expand_dims %69 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:12.1797015Z %71 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1797331Z %72 = tt.expand_dims %67 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:12.1797603Z %73 = arith.cmpi eq, %70, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1797852Z %74 = tt.broadcast %73 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1798119Z %75 = tt.broadcast %71 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1798391Z %76 = arith.select %74, %75, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1798665Z %77 = arith.cmpi eq, %70, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:12.1798911Z %78 = tt.broadcast %72 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:12.1799175Z %79 = tt.broadcast %77 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:12.1799474Z %80 = arith.select %79, %78, %76 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:12.1799742Z %81 = tt.reshape %80 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:12.1800000Z %82 = arith.sitofp %81 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:12.1800346Z %83 = tt.dot %55, %82, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:12.1800668Z scf.yield %83 : tensor<64x32xf32> 2026-02-21T09:03:12.1800857Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:12.1801078Z %30 = arith.truncf %29 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:12.1801396Z %31 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:12.1801684Z %32 = arith.muli %31, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:12.1801943Z %33 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:12.1802220Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:12.1802481Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:12.1802737Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:12.1802986Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:12.1803268Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:12.1803521Z tt.store %38, %30 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:12.1803727Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:12.1803888Z tt.return 2026-02-21T09:03:12.1804025Z } 2026-02-21T09:03:12.1804142Z } 2026-02-21T09:03:12.1804220Z 2026-02-21T09:03:12.1804271Z {-# 2026-02-21T09:03:12.1804398Z external_resources: { 2026-02-21T09:03:12.1804562Z mlir_reproducer: { 2026-02-21T09:03:12.1812103Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:12.1816636Z disable_threading: false, 2026-02-21T09:03:12.1816820Z verify_each: true 2026-02-21T09:03:12.1816966Z } 2026-02-21T09:03:12.1817096Z } 2026-02-21T09:03:12.1817214Z #-} 2026-02-21T09:03:12.1817659Z /tmp/torchinductor_root/uv/cuvio34qae35iloldr2an3sfrjkasralkftp2tz6c76qzpvivuvp.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:12.1818749Z /tmp/torchinductor_root/uv/cuvio34qae35iloldr2an3sfrjkasralkftp2tz6c76qzpvivuvp.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:12.1819615Z [157s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:12.1820810Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:12.1821935Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:12.1822203Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:13.1967506Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:13.1971000Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:13.1972934Z ^ 2026-02-21T09:03:13.1973342Z /tmp/torchinductor_root/eu/ceukoscejkhf6df26w7fyctej2ccypjoxzu5kke66ofoxzig7fmy.py:87:40: note: called from 2026-02-21T09:03:13.1973755Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:13.1973983Z ^ 2026-02-21T09:03:13.1974386Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:13.1974865Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:13.1975313Z ^ 2026-02-21T09:03:13.1975544Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:03:13.1976556Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:13.1977106Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:03:13.1977331Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:13.1977522Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:03:13.1977725Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:13.1977926Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:13.1978113Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:13.1978336Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:13.1978602Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:13.1978853Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:13.1979098Z %cst_3 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:03:13.1979335Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:03:13.1979589Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:13.1979802Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:13.1980042Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:13.1980280Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:13.1980491Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:13.1980682Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:13.1980869Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:13.1981065Z %0 = tt.get_program_id x : i32 2026-02-21T09:03:13.1981256Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:03:13.1981442Z %2 = arith.minsi %1, %c224_i32 : i32 2026-02-21T09:03:13.1981713Z %3 = arith.subi %2, %0 : i32 2026-02-21T09:03:13.1981897Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:03:13.1982206Z %4 = arith.subi %c1_i32, %c1_i32_7 : i32 2026-02-21T09:03:13.1982397Z %5 = arith.addi %3, %4 : i32 2026-02-21T09:03:13.1982585Z %6 = arith.divui %5, %c1_i32 : i32 2026-02-21T09:03:13.1982769Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:03:13.1982960Z %7 = arith.remsi %6, %c2_i32_8 : i32 2026-02-21T09:03:13.1983151Z %8 = arith.subi %6, %7 : i32 2026-02-21T09:03:13.1983326Z %9 = arith.muli %8, %c1_i32 : i32 2026-02-21T09:03:13.1983514Z %10 = arith.addi %0, %9 : i32 2026-02-21T09:03:13.1983703Z %11 = arith.muli %c1_i32, %c2_i32_8 : i32 2026-02-21T09:03:13.1983984Z scf.for %arg3 = %0 to %10 step %11 : i32 { 2026-02-21T09:03:13.1984194Z %12 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.1984400Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:03:13.1984587Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:03:13.1984780Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:03:13.1984980Z %16 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.1985175Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:03:13.1985367Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:03:13.1985588Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:03:13.1985779Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:03:13.1986017Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:13.1986289Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:03:13.1986504Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:03:13.1986703Z %24 = arith.muli %19, %c32_i32 : i32 2026-02-21T09:03:13.1986933Z %25 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:13.1987169Z %26 = tt.splat %24 : i32 -> tensor<32xi32> 2026-02-21T09:03:13.1987371Z %27 = arith.addi %26, %25 : tensor<32xi32> 2026-02-21T09:03:13.1987561Z %c32_i32_9 = arith.constant 32 : i32 2026-02-21T09:03:13.1987935Z %28 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_9 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.1988312Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.1988557Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.1988770Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:03:13.1988965Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.1989198Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.1989439Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.1989644Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:03:13.1989904Z %73 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.1990178Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.1990448Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.1990741Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.1991012Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.1991250Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:03:13.1991507Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.1991828Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.1992127Z %81 = tt.load %80 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.1992419Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.1992693Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.1992963Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.1993211Z %85 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.1993539Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.1993806Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.1994040Z %88 = arith.addi %86, %87 : tensor<8x32xi32> 2026-02-21T09:03:13.1994281Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.1994545Z %90 = tt.addptr %89, %88 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.1994841Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.1995139Z %92 = arith.shli %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.1995349Z %93 = arith.shrsi %92, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.1995565Z %94 = arith.shrsi %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.1995802Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.1996091Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.1996393Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.1996740Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.1997050Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.1997325Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.1997586Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.1997852Z %102 = tt.broadcast %98 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.1998146Z %103 = arith.select %101, %102, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.1998425Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.1998700Z %105 = tt.broadcast %99 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.1998966Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.1999243Z %107 = arith.select %106, %105, %103 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.1999526Z %108 = tt.reshape %107 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.1999784Z %109 = arith.sitofp %108 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2000150Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2000479Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:03:13.2000681Z %111 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:03:13.2000881Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:03:13.2001106Z %113 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2001355Z %114 = tt.splat %112 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2001601Z %115 = arith.addi %114, %113 : tensor<8xi32> 2026-02-21T09:03:13.2001811Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:03:13.2002052Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2002304Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2002520Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:03:13.2002777Z %120 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2003060Z %121 = arith.muli %120, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2003324Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2003628Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2003906Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2004149Z %125 = arith.addi %123, %124 : tensor<64x16xi32> 2026-02-21T09:03:13.2004411Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2004747Z %127 = tt.addptr %126, %125 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2005073Z %128 = tt.load %127 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2005415Z %129 = arith.extf %128 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2005706Z %130 = tt.expand_dims %115 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2005970Z %131 = arith.muli %130, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2006264Z %132 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2006551Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2006822Z %134 = tt.broadcast %132 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2007064Z %135 = arith.addi %133, %134 : tensor<8x32xi32> 2026-02-21T09:03:13.2007301Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2007606Z %137 = tt.addptr %136, %135 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2007907Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2008186Z %139 = arith.shli %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2008403Z %140 = arith.shrsi %139, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2008627Z %141 = arith.shrsi %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2008875Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2009167Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2009491Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2009837Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2010161Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2010445Z %147 = arith.cmpi eq, %144, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2010702Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2010978Z %149 = tt.broadcast %145 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2011263Z %150 = arith.select %148, %149, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2011575Z %151 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2011829Z %152 = tt.broadcast %146 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2012100Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2012379Z %154 = arith.select %153, %152, %150 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2012657Z %155 = tt.reshape %154 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2012921Z %156 = arith.sitofp %155 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2013271Z %157 = tt.dot %129, %156, %110, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2013598Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:03:13.2013797Z %158 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:03:13.2014001Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T09:03:13.2014235Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2014480Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2014695Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T09:03:13.2014890Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T09:03:13.2015130Z %164 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2015380Z %165 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2015647Z %166 = arith.addi %165, %164 : tensor<16xi32> 2026-02-21T09:03:13.2015911Z %167 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2016179Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2016444Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2016732Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2017003Z %171 = tt.broadcast %169 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2017273Z %172 = arith.addi %170, %171 : tensor<64x16xi32> 2026-02-21T09:03:13.2017533Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2017844Z %174 = tt.addptr %173, %172 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2018166Z %175 = tt.load %174 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2018483Z %176 = arith.extf %175 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2018796Z %177 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2019071Z %178 = arith.muli %177, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2019338Z %179 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2019627Z %180 = tt.broadcast %178 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2019900Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2020139Z %182 = arith.addi %180, %181 : tensor<8x32xi32> 2026-02-21T09:03:13.2020382Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2020683Z %184 = tt.addptr %183, %182 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2021021Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2021318Z %186 = arith.shli %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2021583Z %187 = arith.shrsi %186, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2021824Z %188 = arith.shrsi %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2022082Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2022401Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2022733Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2023078Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2023419Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2023718Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2023991Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2024272Z %196 = tt.broadcast %192 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2024577Z %197 = arith.select %195, %196, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2024870Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2025130Z %199 = tt.broadcast %193 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2025416Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2025702Z %201 = arith.select %200, %199, %197 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2026003Z %202 = tt.reshape %201 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2026271Z %203 = arith.sitofp %202 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2026643Z %204 = tt.dot %176, %203, %157, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2027012Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:13.2027210Z %205 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:03:13.2027420Z %206 = arith.addi %arg4, %205 : i32 2026-02-21T09:03:13.2027654Z %207 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2027916Z %208 = tt.splat %206 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2028132Z %209 = arith.addi %208, %207 : tensor<8xi32> 2026-02-21T09:03:13.2028343Z %210 = arith.muli %206, %c2_i32 : i32 2026-02-21T09:03:13.2028595Z %211 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2028883Z %212 = tt.splat %210 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2029130Z %213 = arith.addi %212, %211 : tensor<16xi32> 2026-02-21T09:03:13.2029397Z %214 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2029689Z %215 = arith.muli %214, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2029977Z %216 = tt.expand_dims %213 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2030311Z %217 = tt.broadcast %215 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2030603Z %218 = tt.broadcast %216 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2030843Z %219 = arith.addi %217, %218 : tensor<64x16xi32> 2026-02-21T09:03:13.2031108Z %220 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2031399Z %221 = tt.addptr %220, %219 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2031754Z %222 = tt.load %221 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2032056Z %223 = arith.extf %222 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2032342Z %224 = tt.expand_dims %209 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2032646Z %225 = arith.muli %224, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2032904Z %226 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2033199Z %227 = tt.broadcast %225 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2033464Z %228 = tt.broadcast %226 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2033710Z %229 = arith.addi %227, %228 : tensor<8x32xi32> 2026-02-21T09:03:13.2033951Z %230 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2034227Z %231 = tt.addptr %230, %229 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2034533Z %232 = tt.load %231 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2034796Z %233 = arith.shli %232, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2035020Z %234 = arith.shrsi %233, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2035244Z %235 = arith.shrsi %232, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2035488Z %236 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2035791Z %237 = tt.expand_dims %236 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2036104Z %238 = tt.expand_dims %237 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2036429Z %239 = tt.expand_dims %234 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2036741Z %240 = tt.expand_dims %235 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2037030Z %241 = arith.cmpi eq, %238, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2037290Z %242 = tt.broadcast %241 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2037556Z %243 = tt.broadcast %239 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2037843Z %244 = arith.select %242, %243, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2038114Z %245 = arith.cmpi eq, %238, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2038400Z %246 = tt.broadcast %240 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2038669Z %247 = tt.broadcast %245 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2038944Z %248 = arith.select %247, %246, %244 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2039228Z %249 = tt.reshape %248 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2039492Z %250 = arith.sitofp %249 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2039846Z %251 = tt.dot %223, %250, %204, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2040192Z scf.yield %251 : tensor<64x32xf32> 2026-02-21T09:03:13.2040377Z } 2026-02-21T09:03:13.2040559Z %29 = arith.truncf %28 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:13.2040844Z %30 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2041117Z %31 = arith.muli %30, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:13.2041395Z %32 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2041713Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.2041968Z %34 = tt.broadcast %32 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.2042203Z %35 = arith.addi %33, %34 : tensor<64x32xi32> 2026-02-21T09:03:13.2042449Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.2042725Z %37 = tt.addptr %36, %35 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:13.2042986Z tt.store %37, %29 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.2043190Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:03:13.2043390Z %38 = arith.muli %c1_i32, %c1_i32_10 : i32 2026-02-21T09:03:13.2043610Z %39 = arith.addi %arg3, %38 : i32 2026-02-21T09:03:13.2043809Z %40 = arith.divsi %39, %c896_i32 : i32 2026-02-21T09:03:13.2044007Z %41 = arith.muli %40, %c4_i32 : i32 2026-02-21T09:03:13.2044193Z %42 = arith.subi %c1_i32, %41 : i32 2026-02-21T09:03:13.2044389Z %43 = arith.minsi %42, %c4_i32 : i32 2026-02-21T09:03:13.2044571Z %44 = arith.remsi %39, %c896_i32 : i32 2026-02-21T09:03:13.2044758Z %45 = arith.remsi %44, %43 : i32 2026-02-21T09:03:13.2044935Z %46 = arith.addi %41, %45 : i32 2026-02-21T09:03:13.2045114Z %47 = arith.divsi %44, %43 : i32 2026-02-21T09:03:13.2045287Z %48 = arith.muli %46, %c64_i32 : i32 2026-02-21T09:03:13.2045524Z %49 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:13.2045773Z %50 = tt.splat %48 : i32 -> tensor<64xi32> 2026-02-21T09:03:13.2045967Z %51 = arith.addi %50, %49 : tensor<64xi32> 2026-02-21T09:03:13.2046166Z %52 = arith.muli %47, %c32_i32 : i32 2026-02-21T09:03:13.2046389Z %53 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:13.2046638Z %54 = tt.splat %52 : i32 -> tensor<32xi32> 2026-02-21T09:03:13.2046835Z %55 = arith.addi %54, %53 : tensor<32xi32> 2026-02-21T09:03:13.2047033Z %c32_i32_11 = arith.constant 32 : i32 2026-02-21T09:03:13.2047357Z %56 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_11 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.2047720Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2047988Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2048192Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:03:13.2048395Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.2048623Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2048871Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2049078Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:03:13.2049361Z %73 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2049635Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2049887Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2050177Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2050432Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2050671Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:03:13.2050946Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2051224Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2051528Z %81 = tt.load %80 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2051852Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2052143Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2052446Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2052701Z %85 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2052994Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2053254Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2053496Z %88 = arith.addi %86, %87 : tensor<8x32xi32> 2026-02-21T09:03:13.2053733Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2054012Z %90 = tt.addptr %89, %88 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2054307Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2054617Z %92 = arith.shli %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2054843Z %93 = arith.shrsi %92, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2055056Z %94 = arith.shrsi %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2055302Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2055588Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2055976Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2056300Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2056604Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2056885Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2057136Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2057414Z %102 = tt.broadcast %98 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2057705Z %103 = arith.select %101, %102, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2057984Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2058245Z %105 = tt.broadcast %99 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2058503Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2058784Z %107 = arith.select %106, %105, %103 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2059062Z %108 = tt.reshape %107 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2059328Z %109 = arith.sitofp %108 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2059693Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2060011Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:03:13.2060215Z %111 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:03:13.2060439Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:03:13.2060675Z %113 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2060918Z %114 = tt.splat %112 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2061128Z %115 = arith.addi %114, %113 : tensor<8xi32> 2026-02-21T09:03:13.2061330Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:03:13.2061596Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2061850Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2062089Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:03:13.2062350Z %120 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2062623Z %121 = arith.muli %120, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2062893Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2063194Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2063478Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2063727Z %125 = arith.addi %123, %124 : tensor<64x16xi32> 2026-02-21T09:03:13.2063972Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2064273Z %127 = tt.addptr %126, %125 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2064595Z %128 = tt.load %127 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2064890Z %129 = arith.extf %128 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2065181Z %130 = tt.expand_dims %115 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2065448Z %131 = arith.muli %130, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2065759Z %132 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2066050Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2066331Z %134 = tt.broadcast %132 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2066590Z %135 = arith.addi %133, %134 : tensor<8x32xi32> 2026-02-21T09:03:13.2066835Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2067131Z %137 = tt.addptr %136, %135 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2067446Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2067732Z %139 = arith.shli %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2067957Z %140 = arith.shrsi %139, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2068193Z %141 = arith.shrsi %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2068460Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2068769Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2069111Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2069449Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2069791Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2070097Z %147 = arith.cmpi eq, %144, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2070360Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2070650Z %149 = tt.broadcast %145 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2070947Z %150 = arith.select %148, %149, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2071237Z %151 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2071500Z %152 = tt.broadcast %146 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2071853Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2072149Z %154 = arith.select %153, %152, %150 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2072439Z %155 = tt.reshape %154 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2072717Z %156 = arith.sitofp %155 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2073085Z %157 = tt.dot %129, %156, %110, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2073429Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:03:13.2073678Z %158 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:03:13.2073882Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T09:03:13.2074132Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2074395Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2074623Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T09:03:13.2074832Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T09:03:13.2075114Z %164 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2075383Z %165 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2075602Z %166 = arith.addi %165, %164 : tensor<16xi32> 2026-02-21T09:03:13.2075875Z %167 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2076156Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2076434Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2076740Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2077004Z %171 = tt.broadcast %169 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2077284Z %172 = arith.addi %170, %171 : tensor<64x16xi32> 2026-02-21T09:03:13.2077529Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2077828Z %174 = tt.addptr %173, %172 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2078143Z %175 = tt.load %174 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2078443Z %176 = arith.extf %175 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2078728Z %177 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2078999Z %178 = arith.muli %177, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2079260Z %179 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2079544Z %180 = tt.broadcast %178 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2079808Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2080043Z %182 = arith.addi %180, %181 : tensor<8x32xi32> 2026-02-21T09:03:13.2080285Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2080569Z %184 = tt.addptr %183, %182 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2080867Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2081137Z %186 = arith.shli %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2081351Z %187 = arith.shrsi %186, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2081603Z %188 = arith.shrsi %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2081847Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2082144Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2082470Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2082788Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2083143Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2083427Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2083705Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2083985Z %196 = tt.broadcast %192 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2084269Z %197 = arith.select %195, %196, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2084551Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2084830Z %199 = tt.broadcast %193 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2085100Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2085378Z %201 = arith.select %200, %199, %197 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2085664Z %202 = tt.reshape %201 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2085933Z %203 = arith.sitofp %202 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2086314Z %204 = tt.dot %176, %203, %157, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2086638Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:13.2086828Z %205 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:03:13.2087028Z %206 = arith.addi %arg4, %205 : i32 2026-02-21T09:03:13.2087255Z %207 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2087510Z %208 = tt.splat %206 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2087722Z %209 = arith.addi %208, %207 : tensor<8xi32> 2026-02-21T09:03:13.2087915Z %210 = arith.muli %206, %c2_i32 : i32 2026-02-21T09:03:13.2088152Z %211 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2088425Z %212 = tt.splat %210 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2088637Z %213 = arith.addi %212, %211 : tensor<16xi32> 2026-02-21T09:03:13.2088887Z %214 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2089165Z %215 = arith.muli %214, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2089431Z %216 = tt.expand_dims %213 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2089722Z %217 = tt.broadcast %215 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2089992Z %218 = tt.broadcast %216 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2090227Z %219 = arith.addi %217, %218 : tensor<64x16xi32> 2026-02-21T09:03:13.2090477Z %220 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2090764Z %221 = tt.addptr %220, %219 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2091079Z %222 = tt.load %221 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2091376Z %223 = arith.extf %222 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2091687Z %224 = tt.expand_dims %209 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2091958Z %225 = arith.muli %224, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2092212Z %226 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2092506Z %227 = tt.broadcast %225 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2092772Z %228 = tt.broadcast %226 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2093010Z %229 = arith.addi %227, %228 : tensor<8x32xi32> 2026-02-21T09:03:13.2093251Z %230 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2093522Z %231 = tt.addptr %230, %229 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2093831Z %232 = tt.load %231 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2094123Z %233 = arith.shli %232, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2094352Z %234 = arith.shrsi %233, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2094582Z %235 = arith.shrsi %232, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2094832Z %236 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2095137Z %237 = tt.expand_dims %236 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2095457Z %238 = tt.expand_dims %237 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2095817Z %239 = tt.expand_dims %234 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2096143Z %240 = tt.expand_dims %235 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2096427Z %241 = arith.cmpi eq, %238, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2096693Z %242 = tt.broadcast %241 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2097009Z %243 = tt.broadcast %239 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2097301Z %244 = arith.select %242, %243, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2097572Z %245 = arith.cmpi eq, %238, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2097828Z %246 = tt.broadcast %240 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2098097Z %247 = tt.broadcast %245 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2098367Z %248 = arith.select %247, %246, %244 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2098649Z %249 = tt.reshape %248 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2098907Z %250 = arith.sitofp %249 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2099315Z %251 = tt.dot %223, %250, %204, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2099653Z scf.yield %251 : tensor<64x32xf32> 2026-02-21T09:03:13.2099832Z } 2026-02-21T09:03:13.2100021Z %57 = arith.truncf %56 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:13.2100314Z %58 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2100593Z %59 = arith.muli %58, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:13.2100848Z %60 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2101144Z %61 = tt.broadcast %59 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.2101416Z %62 = tt.broadcast %60 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.2101687Z %63 = arith.addi %61, %62 : tensor<64x32xi32> 2026-02-21T09:03:13.2101938Z %64 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.2102227Z %65 = tt.addptr %64, %63 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:13.2102499Z tt.store %65, %57 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.2102704Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.2102901Z scf.for %arg3 = %10 to %2 step %c1_i32 : i32 { 2026-02-21T09:03:13.2103114Z %12 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.2103314Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:03:13.2103505Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:03:13.2103714Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:03:13.2103908Z %16 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.2104091Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:03:13.2104271Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:03:13.2104443Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:03:13.2104623Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:03:13.2104853Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:13.2105101Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:03:13.2105330Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:03:13.2105518Z %24 = arith.muli %19, %c32_i32 : i32 2026-02-21T09:03:13.2105746Z %25 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:13.2105977Z %26 = tt.splat %24 : i32 -> tensor<32xi32> 2026-02-21T09:03:13.2106177Z %27 = arith.addi %26, %25 : tensor<32xi32> 2026-02-21T09:03:13.2106366Z %c32_i32_9 = arith.constant 32 : i32 2026-02-21T09:03:13.2106689Z %28 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_9 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.2107079Z %38 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2107320Z %39 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2107531Z %40 = arith.addi %39, %38 : tensor<8xi32> 2026-02-21T09:03:13.2107729Z %41 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.2107972Z %42 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2108255Z %43 = tt.splat %41 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2108458Z %44 = arith.addi %43, %42 : tensor<16xi32> 2026-02-21T09:03:13.2108720Z %45 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2108990Z %46 = arith.muli %45, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2109254Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2109536Z %48 = tt.broadcast %46 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2109798Z %49 = tt.broadcast %47 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2110039Z %50 = arith.addi %48, %49 : tensor<64x16xi32> 2026-02-21T09:03:13.2110278Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2110591Z %52 = tt.addptr %51, %50 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2110895Z %53 = tt.load %52 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2111192Z %54 = arith.extf %53 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2111467Z %55 = tt.expand_dims %40 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2111763Z %56 = arith.muli %55, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2112022Z %57 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2112304Z %58 = tt.broadcast %56 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2112564Z %59 = tt.broadcast %57 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2112793Z %60 = arith.addi %58, %59 : tensor<8x32xi32> 2026-02-21T09:03:13.2113029Z %61 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2113293Z %62 = tt.addptr %61, %60 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2113585Z %63 = tt.load %62 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2113854Z %64 = arith.shli %63, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2114067Z %65 = arith.shrsi %64, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2114285Z %66 = arith.shrsi %63, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2114522Z %67 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2114812Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2115126Z %69 = tt.expand_dims %68 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2115434Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2115742Z %71 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2116015Z %72 = arith.cmpi eq, %69, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2116297Z %73 = tt.broadcast %72 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2116560Z %74 = tt.broadcast %70 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2116840Z %75 = arith.select %73, %74, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2117109Z %76 = arith.cmpi eq, %69, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2117354Z %77 = tt.broadcast %71 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2117622Z %78 = tt.broadcast %76 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2117914Z %79 = arith.select %78, %77, %75 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2118182Z %80 = tt.reshape %79 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2118430Z %81 = arith.sitofp %80 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2118783Z %82 = tt.dot %54, %81, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2119107Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:03:13.2119330Z %83 = arith.muli %c8_i32, %c1_i32_10 : i32 2026-02-21T09:03:13.2119532Z %84 = arith.addi %arg4, %83 : i32 2026-02-21T09:03:13.2119753Z %85 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2120018Z %86 = tt.splat %84 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2120227Z %87 = arith.addi %86, %85 : tensor<8xi32> 2026-02-21T09:03:13.2120434Z %88 = arith.muli %84, %c2_i32 : i32 2026-02-21T09:03:13.2120678Z %89 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2120932Z %90 = tt.splat %88 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2121146Z %91 = arith.addi %90, %89 : tensor<16xi32> 2026-02-21T09:03:13.2121436Z %92 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2121756Z %93 = arith.muli %92, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2122028Z %94 = tt.expand_dims %91 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2122340Z %95 = tt.broadcast %93 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2122618Z %96 = tt.broadcast %94 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2122864Z %97 = arith.addi %95, %96 : tensor<64x16xi32> 2026-02-21T09:03:13.2123123Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2123421Z %99 = tt.addptr %98, %97 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2123749Z %100 = tt.load %99 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2124070Z %101 = arith.extf %100 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2124374Z %102 = tt.expand_dims %87 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2124662Z %103 = arith.muli %102, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2124937Z %104 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2125254Z %105 = tt.broadcast %103 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2125530Z %106 = tt.broadcast %104 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2125790Z %107 = arith.addi %105, %106 : tensor<8x32xi32> 2026-02-21T09:03:13.2126047Z %108 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2126341Z %109 = tt.addptr %108, %107 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2126665Z %110 = tt.load %109 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2126948Z %111 = arith.shli %110, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2127187Z %112 = arith.shrsi %111, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2127462Z %113 = arith.shrsi %110, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2127726Z %114 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2128048Z %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2128363Z %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2128692Z %117 = tt.expand_dims %112 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2129006Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2129328Z %119 = arith.cmpi eq, %116, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2129584Z %120 = tt.broadcast %119 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2129852Z %121 = tt.broadcast %117 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2130145Z %122 = arith.select %120, %121, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2130416Z %123 = arith.cmpi eq, %116, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2130698Z %124 = tt.broadcast %118 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2130961Z %125 = tt.broadcast %123 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2131243Z %126 = arith.select %125, %124, %122 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2131523Z %127 = tt.reshape %126 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2131809Z %128 = arith.sitofp %127 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2132168Z %129 = tt.dot %101, %128, %82, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2132487Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:03:13.2132694Z %130 = arith.muli %c8_i32, %c2_i32_11 : i32 2026-02-21T09:03:13.2132924Z %131 = arith.addi %arg4, %130 : i32 2026-02-21T09:03:13.2133154Z %132 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2133409Z %133 = tt.splat %131 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2133612Z %134 = arith.addi %133, %132 : tensor<8xi32> 2026-02-21T09:03:13.2133816Z %135 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:03:13.2134047Z %136 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2134304Z %137 = tt.splat %135 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2134523Z %138 = arith.addi %137, %136 : tensor<16xi32> 2026-02-21T09:03:13.2134777Z %139 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2135059Z %140 = arith.muli %139, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2135319Z %141 = tt.expand_dims %138 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2135624Z %142 = tt.broadcast %140 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2135891Z %143 = tt.broadcast %141 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2136148Z %144 = arith.addi %142, %143 : tensor<64x16xi32> 2026-02-21T09:03:13.2136400Z %145 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2136688Z %146 = tt.addptr %145, %144 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2137005Z %147 = tt.load %146 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2137296Z %148 = arith.extf %147 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2137585Z %149 = tt.expand_dims %134 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2137847Z %150 = arith.muli %149, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2138109Z %151 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2138407Z %152 = tt.broadcast %150 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2138694Z %153 = tt.broadcast %151 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2138938Z %154 = arith.addi %152, %153 : tensor<8x32xi32> 2026-02-21T09:03:13.2139169Z %155 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2139447Z %156 = tt.addptr %155, %154 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2139753Z %157 = tt.load %156 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2140016Z %158 = arith.shli %157, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2140290Z %159 = arith.shrsi %158, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2140504Z %160 = arith.shrsi %157, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2140754Z %161 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2141046Z %162 = tt.expand_dims %161 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2141364Z %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2141757Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2142080Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2142367Z %166 = arith.cmpi eq, %163, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2142616Z %167 = tt.broadcast %166 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2142891Z %168 = tt.broadcast %164 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2143176Z %169 = arith.select %167, %168, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2143447Z %170 = arith.cmpi eq, %163, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2143704Z %171 = tt.broadcast %165 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2143999Z %172 = tt.broadcast %170 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2144287Z %173 = arith.select %172, %171, %169 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2144566Z %174 = tt.reshape %173 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2144835Z %175 = arith.sitofp %174 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2145196Z %176 = tt.dot %148, %175, %129, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2145513Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:13.2145709Z %177 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:03:13.2145904Z %178 = arith.addi %arg4, %177 : i32 2026-02-21T09:03:13.2146142Z %179 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.2146386Z %180 = tt.splat %178 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.2146601Z %181 = arith.addi %180, %179 : tensor<8xi32> 2026-02-21T09:03:13.2146803Z %182 = arith.muli %178, %c2_i32 : i32 2026-02-21T09:03:13.2147034Z %183 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.2147293Z %184 = tt.splat %182 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.2147499Z %185 = arith.addi %184, %183 : tensor<16xi32> 2026-02-21T09:03:13.2147762Z %186 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2148030Z %187 = arith.muli %186, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.2148298Z %188 = tt.expand_dims %185 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.2148600Z %189 = tt.broadcast %187 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2148865Z %190 = tt.broadcast %188 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.2149115Z %191 = arith.addi %189, %190 : tensor<64x16xi32> 2026-02-21T09:03:13.2149365Z %192 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2149690Z %193 = tt.addptr %192, %191 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.2150012Z %194 = tt.load %193 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.2150306Z %195 = arith.extf %194 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.2150598Z %196 = tt.expand_dims %181 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.2150861Z %197 = arith.muli %196, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.2151124Z %198 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2151445Z %199 = tt.broadcast %197 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2151745Z %200 = tt.broadcast %198 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.2151991Z %201 = arith.addi %199, %200 : tensor<8x32xi32> 2026-02-21T09:03:13.2152229Z %202 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2152516Z %203 = tt.addptr %202, %201 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.2152842Z %204 = tt.load %203 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.2153114Z %205 = arith.shli %204, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2153329Z %206 = arith.shrsi %205, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2153549Z %207 = arith.shrsi %204, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.2153799Z %208 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.2154092Z %209 = tt.expand_dims %208 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.2154410Z %210 = tt.expand_dims %209 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.2154722Z %211 = tt.expand_dims %206 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2155066Z %212 = tt.expand_dims %207 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.2155358Z %213 = arith.cmpi eq, %210, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2155608Z %214 = tt.broadcast %213 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2155885Z %215 = tt.broadcast %211 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2156171Z %216 = arith.select %214, %215, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2156451Z %217 = arith.cmpi eq, %210, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.2156698Z %218 = tt.broadcast %212 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.2156972Z %219 = tt.broadcast %217 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.2157253Z %220 = arith.select %219, %218, %216 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.2157530Z %221 = tt.reshape %220 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.2157799Z %222 = arith.sitofp %221 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.2158152Z %223 = tt.dot %195, %222, %176, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.2158481Z scf.yield %223 : tensor<64x32xf32> 2026-02-21T09:03:13.2158662Z } 2026-02-21T09:03:13.2158836Z %29 = arith.truncf %28 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:13.2159120Z %30 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.2159384Z %31 = arith.muli %30, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:13.2159641Z %32 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.2159921Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.2160186Z %34 = tt.broadcast %32 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.2160423Z %35 = arith.addi %33, %34 : tensor<64x32xi32> 2026-02-21T09:03:13.2160691Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.2160973Z %37 = tt.addptr %36, %35 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:13.2161227Z tt.store %37, %29 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.2161432Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.2161615Z tt.return 2026-02-21T09:03:13.2161751Z } 2026-02-21T09:03:13.2161887Z } 2026-02-21T09:03:13.2161969Z 2026-02-21T09:03:13.2162023Z {-# 2026-02-21T09:03:13.2162172Z external_resources: { 2026-02-21T09:03:13.2162337Z mlir_reproducer: { 2026-02-21T09:03:13.2166966Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:13.2171498Z disable_threading: false, 2026-02-21T09:03:13.2171693Z verify_each: true 2026-02-21T09:03:13.2171842Z } 2026-02-21T09:03:13.2171984Z } 2026-02-21T09:03:13.2172100Z #-} 2026-02-21T09:03:13.2172530Z /tmp/torchinductor_root/eu/ceukoscejkhf6df26w7fyctej2ccypjoxzu5kke66ofoxzig7fmy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:13.2173571Z /tmp/torchinductor_root/eu/ceukoscejkhf6df26w7fyctej2ccypjoxzu5kke66ofoxzig7fmy.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:13.2174399Z [158s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:13.2175569Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:13.2176629Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:13.2176885Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:13.4976871Z [158s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]) 2026-02-21T09:03:13.4978309Z Tensor-likes are not close! 2026-02-21T09:03:13.4982684Z 2026-02-21T09:03:13.4986169Z Mismatched elements: 457837 / 458752 (99.8%) 2026-02-21T09:03:13.4989389Z Greatest absolute difference: 5760.0 at index (52, 1777) (up to 0.01 allowed) 2026-02-21T09:03:13.4992656Z Greatest relative difference: 929792.0 at index (41, 3764) (up to 0.01 allowed) 2026-02-21T09:03:13.4996589Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:03:13.4996859Z 2026-02-21T09:03:13.8936350Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:13.8940701Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:13.8942210Z ^ 2026-02-21T09:03:13.8942848Z /tmp/torchinductor_root/i4/ci4utcp2xuir34degt4jknweduohprfr42hgtb5hamhm3now2cxw.py:84:40: note: called from 2026-02-21T09:03:13.8943254Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:13.8947252Z ^ 2026-02-21T09:03:13.8948981Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:13.8949544Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:13.8954728Z ^ 2026-02-21T09:03:13.8958731Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:03:13.8963206Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:13.8963858Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:03:13.8968369Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:13.8970720Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:03:13.8970991Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:13.8971192Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:13.8971397Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:03:13.8971673Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:13.8971944Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:13.8972175Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:13.8972414Z %cst_3 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:03:13.8972653Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:03:13.8972890Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:13.8973128Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:13.8973357Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:13.8973600Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:13.8973780Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:13.8973966Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:13.8974139Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:13.8974324Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:13.8974517Z %0 = tt.get_program_id x : i32 2026-02-21T09:03:13.8974697Z %1 = arith.subi %c224_i32, %0 : i32 2026-02-21T09:03:13.8974887Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:03:13.8975086Z %2 = arith.subi %c9472_i32, %c1_i32_7 : i32 2026-02-21T09:03:13.8975282Z %3 = arith.addi %1, %2 : i32 2026-02-21T09:03:13.8975455Z %4 = arith.divui %3, %c9472_i32 : i32 2026-02-21T09:03:13.8975644Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:03:13.8976005Z %5 = arith.remsi %4, %c2_i32_8 : i32 2026-02-21T09:03:13.8976191Z %6 = arith.subi %4, %5 : i32 2026-02-21T09:03:13.8976368Z %7 = arith.muli %6, %c9472_i32 : i32 2026-02-21T09:03:13.8976545Z %8 = arith.addi %0, %7 : i32 2026-02-21T09:03:13.8976727Z %9 = arith.muli %c9472_i32, %c2_i32_8 : i32 2026-02-21T09:03:13.8976926Z scf.for %arg3 = %0 to %8 step %9 : i32 { 2026-02-21T09:03:13.8977135Z %10 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.8977323Z %11 = arith.muli %10, %c4_i32 : i32 2026-02-21T09:03:13.8977510Z %12 = arith.subi %c1_i32, %11 : i32 2026-02-21T09:03:13.8977738Z %13 = arith.minsi %12, %c4_i32 : i32 2026-02-21T09:03:13.8977930Z %14 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.8978120Z %15 = arith.remsi %14, %13 : i32 2026-02-21T09:03:13.8978296Z %16 = arith.addi %11, %15 : i32 2026-02-21T09:03:13.8978477Z %17 = arith.divsi %14, %13 : i32 2026-02-21T09:03:13.8978653Z %18 = arith.muli %16, %c64_i32 : i32 2026-02-21T09:03:13.8978894Z %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:13.8979214Z %20 = tt.splat %18 : i32 -> tensor<64xi32> 2026-02-21T09:03:13.8979424Z %21 = arith.addi %20, %19 : tensor<64xi32> 2026-02-21T09:03:13.8979633Z %22 = arith.muli %17, %c32_i32 : i32 2026-02-21T09:03:13.8979872Z %23 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:13.8980117Z %24 = tt.splat %22 : i32 -> tensor<32xi32> 2026-02-21T09:03:13.8980327Z %25 = arith.addi %24, %23 : tensor<32xi32> 2026-02-21T09:03:13.8980523Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:03:13.8980724Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:03:13.8981066Z %26 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.8981471Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.8981780Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.8981990Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:03:13.8982197Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.8982426Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.8982676Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.8982879Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:03:13.8983141Z %73 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.8983424Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.8983684Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.8983979Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.8984241Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.8984488Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:03:13.8984736Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.8985039Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.8985302Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.8985540Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.8985823Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.8986083Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.8986341Z %85 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.8986618Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.8986940Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.8987180Z %88 = arith.addi %86, %87 : tensor<8x32xi32> 2026-02-21T09:03:13.8987461Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.8987746Z %90 = tt.addptr %89, %88 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.8988046Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.8988328Z %92 = arith.shli %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.8988556Z %93 = arith.shrsi %92, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.8988776Z %94 = arith.shrsi %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.8989030Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.8989356Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.8989686Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.8990011Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.8990337Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.8990662Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.8990931Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.8991222Z %102 = tt.broadcast %98 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.8991526Z %103 = arith.select %101, %102, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.8991866Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.8992142Z %105 = tt.broadcast %99 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.8992420Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.8992720Z %107 = arith.select %106, %105, %103 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.8993047Z %108 = tt.reshape %107 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.8993334Z %109 = arith.sitofp %108 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.8993724Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.8994084Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:03:13.8994305Z %111 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:03:13.8994512Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:03:13.8994760Z %113 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.8995018Z %114 = tt.splat %112 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.8995239Z %115 = arith.addi %114, %113 : tensor<8xi32> 2026-02-21T09:03:13.8995445Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:03:13.8995698Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.8995967Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.8996184Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:03:13.8996463Z %120 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.8996753Z %121 = arith.muli %120, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.8997024Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.8997315Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.8997587Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.8997837Z %125 = arith.addi %123, %124 : tensor<64x16xi32> 2026-02-21T09:03:13.8998081Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.8998382Z %127 = tt.addptr %126, %125 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.8998649Z %128 = tt.load %127 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.8998924Z %129 = arith.extf %128 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.8999214Z %130 = tt.expand_dims %115 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.8999480Z %131 = arith.muli %130, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.8999740Z %132 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9000027Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9000294Z %134 = tt.broadcast %132 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9000560Z %135 = arith.addi %133, %134 : tensor<8x32xi32> 2026-02-21T09:03:13.9000802Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9001087Z %137 = tt.addptr %136, %135 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9001385Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9001717Z %139 = arith.shli %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9001962Z %140 = arith.shrsi %139, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9002186Z %141 = arith.shrsi %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9002429Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9002728Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9003053Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9003378Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9003706Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9003992Z %147 = arith.cmpi eq, %144, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9004284Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9004561Z %149 = tt.broadcast %145 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9004849Z %150 = arith.select %148, %149, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9005131Z %151 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9005387Z %152 = tt.broadcast %146 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9005662Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9005937Z %154 = arith.select %153, %152, %150 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9006222Z %155 = tt.reshape %154 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9006487Z %156 = arith.sitofp %155 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9006840Z %157 = tt.dot %129, %156, %110, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9007188Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:03:13.9007390Z %158 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:03:13.9007594Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T09:03:13.9007820Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9008075Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9008285Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T09:03:13.9008478Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T09:03:13.9008716Z %164 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9008965Z %165 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9009176Z %166 = arith.addi %165, %164 : tensor<16xi32> 2026-02-21T09:03:13.9009428Z %167 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9009708Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9010010Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9010304Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9010580Z %171 = tt.broadcast %169 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9010829Z %172 = arith.addi %170, %171 : tensor<64x16xi32> 2026-02-21T09:03:13.9011088Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9011391Z %174 = tt.addptr %173, %172 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9011733Z %175 = tt.load %174 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9011981Z %176 = arith.extf %175 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9012267Z %177 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9012541Z %178 = arith.muli %177, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9012799Z %179 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9013121Z %180 = tt.broadcast %178 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9013388Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9013624Z %182 = arith.addi %180, %181 : tensor<8x32xi32> 2026-02-21T09:03:13.9013868Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9014142Z %184 = tt.addptr %183, %182 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9014449Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9014715Z %186 = arith.shli %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9014937Z %187 = arith.shrsi %186, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9015160Z %188 = arith.shrsi %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9015434Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9015734Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9016049Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9016373Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9016697Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9016974Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9017234Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9017500Z %196 = tt.broadcast %192 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9017789Z %197 = arith.select %195, %196, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9018062Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9018322Z %199 = tt.broadcast %193 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9018598Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9018873Z %201 = arith.select %200, %199, %197 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9019158Z %202 = tt.reshape %201 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9019414Z %203 = arith.sitofp %202 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9019769Z %204 = tt.dot %176, %203, %157, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9020092Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:03:13.9020265Z } 2026-02-21T09:03:13.9020534Z %27 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %26) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.9020889Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9021176Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9021390Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:03:13.9021641Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.9021879Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9022124Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9022333Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:03:13.9022583Z %73 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9022907Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9023162Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9023455Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9023725Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9023959Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:03:13.9024231Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9024513Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9024778Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9025013Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9025299Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9025565Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9025814Z %85 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9026104Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9026379Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9026616Z %88 = arith.addi %86, %87 : tensor<8x32xi32> 2026-02-21T09:03:13.9026845Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9027112Z %90 = tt.addptr %89, %88 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9027404Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9027663Z %92 = arith.shli %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9027878Z %93 = arith.shrsi %92, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9028084Z %94 = arith.shrsi %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9028331Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9028622Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9028925Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9029243Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9029548Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9029832Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9030083Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9030355Z %102 = tt.broadcast %98 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9030647Z %103 = arith.select %101, %102, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9030918Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9031174Z %105 = tt.broadcast %99 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9031461Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9031803Z %107 = arith.select %106, %105, %103 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9032129Z %108 = tt.reshape %107 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9032404Z %109 = arith.sitofp %108 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9032782Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9033119Z scf.yield %110 : tensor<64x32xf32> 2026-02-21T09:03:13.9033322Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.9033548Z %28 = arith.truncf %27 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:13.9033853Z %29 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9034164Z %30 = arith.muli %29, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:13.9034428Z %31 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9034731Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.9035000Z %33 = tt.broadcast %31 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.9035245Z %34 = arith.addi %32, %33 : tensor<64x32xi32> 2026-02-21T09:03:13.9035542Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.9035845Z %36 = tt.addptr %35, %34 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:13.9036120Z tt.store %36, %28 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.9036336Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:13.9036544Z %37 = arith.muli %c9472_i32, %c1_i32_9 : i32 2026-02-21T09:03:13.9036750Z %38 = arith.addi %arg3, %37 : i32 2026-02-21T09:03:13.9036951Z %39 = arith.divsi %38, %c896_i32 : i32 2026-02-21T09:03:13.9037142Z %40 = arith.muli %39, %c4_i32 : i32 2026-02-21T09:03:13.9037337Z %41 = arith.subi %c1_i32, %40 : i32 2026-02-21T09:03:13.9037523Z %42 = arith.minsi %41, %c4_i32 : i32 2026-02-21T09:03:13.9037750Z %43 = arith.remsi %38, %c896_i32 : i32 2026-02-21T09:03:13.9037952Z %44 = arith.remsi %43, %42 : i32 2026-02-21T09:03:13.9038141Z %45 = arith.addi %40, %44 : i32 2026-02-21T09:03:13.9038334Z %46 = arith.divsi %43, %42 : i32 2026-02-21T09:03:13.9038519Z %47 = arith.muli %45, %c64_i32 : i32 2026-02-21T09:03:13.9038765Z %48 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:13.9039029Z %49 = tt.splat %47 : i32 -> tensor<64xi32> 2026-02-21T09:03:13.9039246Z %50 = arith.addi %49, %48 : tensor<64xi32> 2026-02-21T09:03:13.9039454Z %51 = arith.muli %46, %c32_i32 : i32 2026-02-21T09:03:13.9039695Z %52 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:13.9039957Z %53 = tt.splat %51 : i32 -> tensor<32xi32> 2026-02-21T09:03:13.9040165Z %54 = arith.addi %53, %52 : tensor<32xi32> 2026-02-21T09:03:13.9040381Z %c4080_i32_10 = arith.constant 4080 : i32 2026-02-21T09:03:13.9040589Z %c24_i32_11 = arith.constant 24 : i32 2026-02-21T09:03:13.9040950Z %55 = scf.for %arg4 = %c0_i32 to %c4080_i32_10 step %c24_i32_11 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.9041377Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9041655Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9041866Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:03:13.9042064Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.9042302Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9042547Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9042751Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:03:13.9043011Z %73 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9043275Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9043545Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9043858Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9044129Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9044367Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:03:13.9044618Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9044907Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9045159Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9045432Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9045713Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9045978Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9046233Z %85 = tt.expand_dims %54 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9046523Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9046818Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9047055Z %88 = arith.addi %86, %87 : tensor<8x32xi32> 2026-02-21T09:03:13.9047291Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9047551Z %90 = tt.addptr %89, %88 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9047846Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9048113Z %92 = arith.shli %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9048324Z %93 = arith.shrsi %92, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9048541Z %94 = arith.shrsi %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9048780Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9049103Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9049410Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9049731Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9050043Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9050319Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9050577Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9050843Z %102 = tt.broadcast %98 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9051136Z %103 = arith.select %101, %102, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9051405Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9051686Z %105 = tt.broadcast %99 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9051953Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9052226Z %107 = arith.select %106, %105, %103 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9052505Z %108 = tt.reshape %107 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9052762Z %109 = arith.sitofp %108 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9053127Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9053458Z %c1_i32_12 = arith.constant 1 : i32 2026-02-21T09:03:13.9053658Z %111 = arith.muli %c8_i32, %c1_i32_12 : i32 2026-02-21T09:03:13.9053858Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:03:13.9054086Z %113 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9054341Z %114 = tt.splat %112 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9054575Z %115 = arith.addi %114, %113 : tensor<8xi32> 2026-02-21T09:03:13.9054781Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:03:13.9055019Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9055265Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9055477Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:03:13.9055730Z %120 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9056011Z %121 = arith.muli %120, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9056301Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9056607Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9056883Z %124 = tt.broadcast %122 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9057126Z %125 = arith.addi %123, %124 : tensor<64x16xi32> 2026-02-21T09:03:13.9057383Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9057699Z %127 = tt.addptr %126, %125 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9057976Z %128 = tt.load %127 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9058217Z %129 = arith.extf %128 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9058510Z %130 = tt.expand_dims %115 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9058785Z %131 = arith.muli %130, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9059041Z %132 = tt.expand_dims %54 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9059338Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9059600Z %134 = tt.broadcast %132 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9059870Z %135 = arith.addi %133, %134 : tensor<8x32xi32> 2026-02-21T09:03:13.9060109Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9060397Z %137 = tt.addptr %136, %135 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9060712Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9060981Z %139 = arith.shli %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9061211Z %140 = arith.shrsi %139, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9061435Z %141 = arith.shrsi %138, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9061717Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9062014Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9062328Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9062655Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9062971Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9063264Z %147 = arith.cmpi eq, %144, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9063516Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9063788Z %149 = tt.broadcast %145 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9064075Z %150 = arith.select %148, %149, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9064346Z %151 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9064600Z %152 = tt.broadcast %146 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9064866Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9065147Z %154 = arith.select %153, %152, %150 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9065434Z %155 = tt.reshape %154 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9065735Z %156 = arith.sitofp %155 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9066094Z %157 = tt.dot %129, %156, %110, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9066413Z %c2_i32_13 = arith.constant 2 : i32 2026-02-21T09:03:13.9066616Z %158 = arith.muli %c8_i32, %c2_i32_13 : i32 2026-02-21T09:03:13.9066808Z %159 = arith.addi %arg4, %158 : i32 2026-02-21T09:03:13.9067044Z %160 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9067355Z %161 = tt.splat %159 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9067565Z %162 = arith.addi %161, %160 : tensor<8xi32> 2026-02-21T09:03:13.9067771Z %163 = arith.muli %159, %c2_i32 : i32 2026-02-21T09:03:13.9068006Z %164 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9068270Z %165 = tt.splat %163 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9068480Z %166 = arith.addi %165, %164 : tensor<16xi32> 2026-02-21T09:03:13.9068778Z %167 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9069061Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9069331Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9069634Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9069903Z %171 = tt.broadcast %169 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9070172Z %172 = arith.addi %170, %171 : tensor<64x16xi32> 2026-02-21T09:03:13.9070417Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9070721Z %174 = tt.addptr %173, %172 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9071024Z %175 = tt.load %174 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9071265Z %176 = arith.extf %175 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9071584Z %177 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9071850Z %178 = arith.muli %177, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9072115Z %179 = tt.expand_dims %54 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9072413Z %180 = tt.broadcast %178 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9072676Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9072921Z %182 = arith.addi %180, %181 : tensor<8x32xi32> 2026-02-21T09:03:13.9073157Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9073438Z %184 = tt.addptr %183, %182 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9073736Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9074007Z %186 = arith.shli %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9074230Z %187 = arith.shrsi %186, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9074447Z %188 = arith.shrsi %185, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9074701Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9074992Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9075316Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9075639Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9075963Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9076314Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9076563Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9076883Z %196 = tt.broadcast %192 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9077173Z %197 = arith.select %195, %196, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9077457Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9077717Z %199 = tt.broadcast %193 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9077989Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9078277Z %201 = arith.select %200, %199, %197 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9078582Z %202 = tt.reshape %201 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9078852Z %203 = arith.sitofp %202 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9079204Z %204 = tt.dot %176, %203, %157, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9079534Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:03:13.9079717Z } 2026-02-21T09:03:13.9080023Z %56 = scf.for %arg4 = %c4080_i32_10 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %55) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.9080389Z %66 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9080634Z %67 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9080844Z %68 = arith.addi %67, %66 : tensor<8xi32> 2026-02-21T09:03:13.9081049Z %69 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.9081283Z %70 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9081554Z %71 = tt.splat %69 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9081757Z %72 = arith.addi %71, %70 : tensor<16xi32> 2026-02-21T09:03:13.9082013Z %73 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9082306Z %74 = arith.muli %73, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9082573Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9082866Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9083126Z %77 = tt.broadcast %75 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9083367Z %78 = arith.addi %76, %77 : tensor<64x16xi32> 2026-02-21T09:03:13.9083612Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9083906Z %80 = tt.addptr %79, %78 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9084174Z %81 = tt.load %80 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9084418Z %82 = arith.extf %81 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9084704Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9084962Z %84 = arith.muli %83, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9085222Z %85 = tt.expand_dims %54 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9085500Z %86 = tt.broadcast %84 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9085762Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9085992Z %88 = arith.addi %86, %87 : tensor<8x32xi32> 2026-02-21T09:03:13.9086227Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9086495Z %90 = tt.addptr %89, %88 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9086781Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9087047Z %92 = arith.shli %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9087256Z %93 = arith.shrsi %92, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9087475Z %94 = arith.shrsi %91, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9087723Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9088057Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9088384Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9088709Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9089040Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9089331Z %100 = arith.cmpi eq, %97, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9089602Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9089913Z %102 = tt.broadcast %98 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9090209Z %103 = arith.select %101, %102, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9090500Z %104 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9090763Z %105 = tt.broadcast %99 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9091064Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9091357Z %107 = arith.select %106, %105, %103 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9091684Z %108 = tt.reshape %107 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9091967Z %109 = arith.sitofp %108 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9092346Z %110 = tt.dot %82, %109, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9092689Z scf.yield %110 : tensor<64x32xf32> 2026-02-21T09:03:13.9092887Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.9093124Z %57 = arith.truncf %56 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:13.9093457Z %58 = tt.expand_dims %50 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9093737Z %59 = arith.muli %58, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:13.9094007Z %60 = tt.expand_dims %54 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9094300Z %61 = tt.broadcast %59 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.9094572Z %62 = tt.broadcast %60 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.9094813Z %63 = arith.addi %61, %62 : tensor<64x32xi32> 2026-02-21T09:03:13.9095072Z %64 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.9095375Z %65 = tt.addptr %64, %63 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:13.9095647Z tt.store %65, %57 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.9095897Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.9096099Z scf.for %arg3 = %8 to %c224_i32 step %c9472_i32 : i32 { 2026-02-21T09:03:13.9096328Z %10 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.9096520Z %11 = arith.muli %10, %c4_i32 : i32 2026-02-21T09:03:13.9096713Z %12 = arith.subi %c1_i32, %11 : i32 2026-02-21T09:03:13.9096894Z %13 = arith.minsi %12, %c4_i32 : i32 2026-02-21T09:03:13.9097097Z %14 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:13.9097296Z %15 = arith.remsi %14, %13 : i32 2026-02-21T09:03:13.9097473Z %16 = arith.addi %11, %15 : i32 2026-02-21T09:03:13.9097656Z %17 = arith.divsi %14, %13 : i32 2026-02-21T09:03:13.9097831Z %18 = arith.muli %16, %c64_i32 : i32 2026-02-21T09:03:13.9098062Z %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:13.9098305Z %20 = tt.splat %18 : i32 -> tensor<64xi32> 2026-02-21T09:03:13.9098509Z %21 = arith.addi %20, %19 : tensor<64xi32> 2026-02-21T09:03:13.9098702Z %22 = arith.muli %17, %c32_i32 : i32 2026-02-21T09:03:13.9098923Z %23 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:13.9099167Z %24 = tt.splat %22 : i32 -> tensor<32xi32> 2026-02-21T09:03:13.9099386Z %25 = arith.addi %24, %23 : tensor<32xi32> 2026-02-21T09:03:13.9099584Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:03:13.9099770Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:03:13.9100091Z %26 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.9100449Z %37 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9100693Z %38 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9100903Z %39 = arith.addi %38, %37 : tensor<8xi32> 2026-02-21T09:03:13.9101124Z %40 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.9101359Z %41 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9101630Z %42 = tt.splat %40 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9101836Z %43 = arith.addi %42, %41 : tensor<16xi32> 2026-02-21T09:03:13.9102095Z %44 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9102363Z %45 = arith.muli %44, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9102649Z %46 = tt.expand_dims %43 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9102934Z %47 = tt.broadcast %45 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9103198Z %48 = tt.broadcast %46 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9103428Z %49 = arith.addi %47, %48 : tensor<64x16xi32> 2026-02-21T09:03:13.9103673Z %50 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9103963Z %51 = tt.addptr %50, %49 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9104215Z %52 = tt.load %51 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9104454Z %53 = arith.extf %52 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9104777Z %54 = tt.expand_dims %39 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9105041Z %55 = arith.muli %54, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9105291Z %56 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9105578Z %57 = tt.broadcast %55 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9105838Z %58 = tt.broadcast %56 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9106069Z %59 = arith.addi %57, %58 : tensor<8x32xi32> 2026-02-21T09:03:13.9106305Z %60 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9106566Z %61 = tt.addptr %60, %59 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9106862Z %62 = tt.load %61 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9107121Z %63 = arith.shli %62, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9107339Z %64 = arith.shrsi %63, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9107560Z %65 = arith.shrsi %62, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9107798Z %66 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9108090Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9108396Z %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9108717Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9109031Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9109309Z %71 = arith.cmpi eq, %68, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9109568Z %72 = tt.broadcast %71 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9109829Z %73 = tt.broadcast %69 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9110111Z %74 = arith.select %72, %73, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9110399Z %75 = arith.cmpi eq, %68, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9110649Z %76 = tt.broadcast %70 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9110908Z %77 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9111167Z %78 = arith.select %77, %76, %74 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9111436Z %79 = tt.reshape %78 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9111715Z %80 = arith.sitofp %79 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9112089Z %81 = tt.dot %53, %80, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9112409Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:13.9112604Z %82 = arith.muli %c8_i32, %c1_i32_9 : i32 2026-02-21T09:03:13.9112809Z %83 = arith.addi %arg4, %82 : i32 2026-02-21T09:03:13.9113035Z %84 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9113280Z %85 = tt.splat %83 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9113501Z %86 = arith.addi %85, %84 : tensor<8xi32> 2026-02-21T09:03:13.9113701Z %87 = arith.muli %83, %c2_i32 : i32 2026-02-21T09:03:13.9113929Z %88 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9114179Z %89 = tt.splat %87 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9114384Z %90 = arith.addi %89, %88 : tensor<16xi32> 2026-02-21T09:03:13.9114630Z %91 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9114899Z %92 = arith.muli %91, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9115150Z %93 = tt.expand_dims %90 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9115466Z %94 = tt.broadcast %92 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9115729Z %95 = tt.broadcast %93 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9115958Z %96 = arith.addi %94, %95 : tensor<64x16xi32> 2026-02-21T09:03:13.9116207Z %97 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9116487Z %98 = tt.addptr %97, %96 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9116749Z %99 = tt.load %98 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9116983Z %100 = arith.extf %99 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9117272Z %101 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9117546Z %102 = arith.muli %101, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9117806Z %103 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9118103Z %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9118364Z %105 = tt.broadcast %103 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9118610Z %106 = arith.addi %104, %105 : tensor<8x32xi32> 2026-02-21T09:03:13.9118844Z %107 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9119125Z %108 = tt.addptr %107, %106 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9119432Z %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9119701Z %110 = arith.shli %109, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9119925Z %111 = arith.shrsi %110, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9120142Z %112 = arith.shrsi %109, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9120394Z %113 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9120686Z %114 = tt.expand_dims %113 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9121016Z %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9121377Z %116 = tt.expand_dims %111 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9121733Z %117 = tt.expand_dims %112 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9122024Z %118 = arith.cmpi eq, %115, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9122281Z %119 = tt.broadcast %118 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9122566Z %120 = tt.broadcast %116 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9122866Z %121 = arith.select %119, %120, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9123168Z %122 = arith.cmpi eq, %115, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9123431Z %123 = tt.broadcast %117 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9123699Z %124 = tt.broadcast %122 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9123987Z %125 = arith.select %124, %123, %121 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9124293Z %126 = tt.reshape %125 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9124568Z %127 = arith.sitofp %126 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9124930Z %128 = tt.dot %100, %127, %81, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9125254Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:03:13.9125461Z %129 = arith.muli %c8_i32, %c2_i32_10 : i32 2026-02-21T09:03:13.9125656Z %130 = arith.addi %arg4, %129 : i32 2026-02-21T09:03:13.9125895Z %131 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9126153Z %132 = tt.splat %130 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9126360Z %133 = arith.addi %132, %131 : tensor<8xi32> 2026-02-21T09:03:13.9126590Z %134 = arith.muli %130, %c2_i32 : i32 2026-02-21T09:03:13.9126828Z %135 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9127086Z %136 = tt.splat %134 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9127295Z %137 = arith.addi %136, %135 : tensor<16xi32> 2026-02-21T09:03:13.9127557Z %138 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9127839Z %139 = arith.muli %138, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9128102Z %140 = tt.expand_dims %137 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9128407Z %141 = tt.broadcast %139 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9128674Z %142 = tt.broadcast %140 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9128920Z %143 = arith.addi %141, %142 : tensor<64x16xi32> 2026-02-21T09:03:13.9129167Z %144 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9129472Z %145 = tt.addptr %144, %143 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9129745Z %146 = tt.load %145 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9129984Z %147 = arith.extf %146 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9130278Z %148 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9130545Z %149 = arith.muli %148, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9130810Z %150 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9131100Z %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9131369Z %152 = tt.broadcast %150 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9131653Z %153 = arith.addi %151, %152 : tensor<8x32xi32> 2026-02-21T09:03:13.9131892Z %154 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9132199Z %155 = tt.addptr %154, %153 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9132553Z %156 = tt.load %155 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9132850Z %157 = arith.shli %156, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9133084Z %158 = arith.shrsi %157, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9133323Z %159 = arith.shrsi %156, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9133594Z %160 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9133913Z %161 = tt.expand_dims %160 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9134286Z %162 = tt.expand_dims %161 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9134628Z %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9134979Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9135287Z %165 = arith.cmpi eq, %162, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9135557Z %166 = tt.broadcast %165 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9135877Z %167 = tt.broadcast %163 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9136182Z %168 = arith.select %166, %167, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9136487Z %169 = arith.cmpi eq, %162, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9136765Z %170 = tt.broadcast %164 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9137055Z %171 = tt.broadcast %169 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9137355Z %172 = arith.select %171, %170, %168 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9137649Z %173 = tt.reshape %172 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9137968Z %174 = arith.sitofp %173 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9138346Z %175 = tt.dot %147, %174, %128, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9138697Z scf.yield %175 : tensor<64x32xf32> 2026-02-21T09:03:13.9138887Z } 2026-02-21T09:03:13.9139167Z %27 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %26) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:13.9139557Z %37 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:13.9139807Z %38 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:13.9140020Z %39 = arith.addi %38, %37 : tensor<8xi32> 2026-02-21T09:03:13.9140220Z %40 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:13.9140461Z %41 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:13.9140713Z %42 = tt.splat %40 : i32 -> tensor<16xi32> 2026-02-21T09:03:13.9140915Z %43 = arith.addi %42, %41 : tensor<16xi32> 2026-02-21T09:03:13.9141181Z %44 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9141455Z %45 = arith.muli %44, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:13.9141752Z %46 = tt.expand_dims %43 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:13.9142039Z %47 = tt.broadcast %45 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9142305Z %48 = tt.broadcast %46 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:13.9142544Z %49 = arith.addi %47, %48 : tensor<64x16xi32> 2026-02-21T09:03:13.9142788Z %50 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9143079Z %51 = tt.addptr %50, %49 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:13.9143333Z %52 = tt.load %51 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:13.9143575Z %53 = arith.extf %52 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:13.9143853Z %54 = tt.expand_dims %39 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:13.9144149Z %55 = arith.muli %54, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:13.9144408Z %56 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9144689Z %57 = tt.broadcast %55 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9144952Z %58 = tt.broadcast %56 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:13.9145186Z %59 = arith.addi %57, %58 : tensor<8x32xi32> 2026-02-21T09:03:13.9145423Z %60 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9145725Z %61 = tt.addptr %60, %59 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:13.9146017Z %62 = tt.load %61 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:03:13.9146284Z %63 = arith.shli %62, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9146495Z %64 = arith.shrsi %63, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9146715Z %65 = arith.shrsi %62, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:13.9146992Z %66 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:13.9147290Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:13.9147605Z %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:13.9147917Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9148229Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:13.9148506Z %71 = arith.cmpi eq, %68, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9148759Z %72 = tt.broadcast %71 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9149018Z %73 = tt.broadcast %69 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9149324Z %74 = arith.select %72, %73, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9149604Z %75 = arith.cmpi eq, %68, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:13.9149853Z %76 = tt.broadcast %70 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:13.9150129Z %77 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:13.9150401Z %78 = arith.select %77, %76, %74 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:13.9150686Z %79 = tt.reshape %78 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:13.9150946Z %80 = arith.sitofp %79 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:13.9151291Z %81 = tt.dot %53, %80, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:13.9151661Z scf.yield %81 : tensor<64x32xf32> 2026-02-21T09:03:13.9151847Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.9152071Z %28 = arith.truncf %27 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:13.9152356Z %29 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:13.9152633Z %30 = arith.muli %29, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:13.9152893Z %31 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:13.9153176Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.9153440Z %33 = tt.broadcast %31 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:13.9153672Z %34 = arith.addi %32, %33 : tensor<64x32xi32> 2026-02-21T09:03:13.9153925Z %35 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.9154211Z %36 = tt.addptr %35, %34 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:13.9154476Z tt.store %36, %28 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:13.9154687Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:13.9154852Z tt.return 2026-02-21T09:03:13.9154989Z } 2026-02-21T09:03:13.9155156Z } 2026-02-21T09:03:13.9155227Z 2026-02-21T09:03:13.9155286Z {-# 2026-02-21T09:03:13.9155416Z external_resources: { 2026-02-21T09:03:13.9155581Z mlir_reproducer: { 2026-02-21T09:03:13.9159988Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:13.9164549Z disable_threading: false, 2026-02-21T09:03:13.9164720Z verify_each: true 2026-02-21T09:03:13.9164871Z } 2026-02-21T09:03:13.9164997Z } 2026-02-21T09:03:13.9165111Z #-} 2026-02-21T09:03:13.9165542Z /tmp/torchinductor_root/i4/ci4utcp2xuir34degt4jknweduohprfr42hgtb5hamhm3now2cxw.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:13.9166571Z /tmp/torchinductor_root/i4/ci4utcp2xuir34degt4jknweduohprfr42hgtb5hamhm3now2cxw.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:13.9167398Z [159s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:13.9168584Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:13.9169649Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:13.9169912Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:14.4064461Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:14.4066056Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:14.4066338Z ^ 2026-02-21T09:03:14.4066777Z /tmp/torchinductor_root/q2/cq23kavdx65kw6lmohpcdzpphzdbesrrqvysps7vqmjlwrr5jxnm.py:87:40: note: called from 2026-02-21T09:03:14.4067204Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:14.4072174Z ^ 2026-02-21T09:03:14.4074127Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:14.4074796Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:14.4076960Z ^ 2026-02-21T09:03:14.4077277Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:03:14.4081883Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:14.4087422Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:03:14.4092574Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:14.4094818Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:03:14.4095109Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:14.4095326Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:14.4099836Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:14.4104101Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:14.4108372Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:14.4109878Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:14.4110169Z %cst_3 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:03:14.4110412Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:03:14.4110668Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:14.4110899Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:14.4111142Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:14.4111377Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:14.4111633Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:14.4111822Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:14.4112181Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:14.4112395Z %0 = tt.get_program_id x : i32 2026-02-21T09:03:14.4115657Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:03:14.4116018Z %2 = arith.minsi %1, %c224_i32 : i32 2026-02-21T09:03:14.4116315Z scf.for %arg3 = %0 to %2 step %c1_i32 : i32 { 2026-02-21T09:03:14.4121386Z %3 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:14.4125295Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:03:14.4127359Z %5 = arith.subi %c1_i32, %4 : i32 2026-02-21T09:03:14.4127606Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:03:14.4127817Z %7 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:14.4128019Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:03:14.4128199Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:03:14.4128379Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:03:14.4128561Z %11 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:03:14.4128809Z %12 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:14.4129072Z %13 = tt.splat %11 : i32 -> tensor<64xi32> 2026-02-21T09:03:14.4129285Z %14 = arith.addi %13, %12 : tensor<64xi32> 2026-02-21T09:03:14.4129483Z %15 = arith.muli %10, %c32_i32 : i32 2026-02-21T09:03:14.4129709Z %16 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:14.4129953Z %17 = tt.splat %15 : i32 -> tensor<32xi32> 2026-02-21T09:03:14.4130145Z %18 = arith.addi %17, %16 : tensor<32xi32> 2026-02-21T09:03:14.4130342Z %c32_i32_7 = arith.constant 32 : i32 2026-02-21T09:03:14.4130670Z %19 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_7 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:14.4131052Z %29 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:14.4131311Z %30 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:03:14.4131519Z %31 = arith.addi %30, %29 : tensor<8xi32> 2026-02-21T09:03:14.4132087Z %32 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:14.4132327Z %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:14.4132597Z %34 = tt.splat %32 : i32 -> tensor<16xi32> 2026-02-21T09:03:14.4132808Z %35 = arith.addi %34, %33 : tensor<16xi32> 2026-02-21T09:03:14.4133086Z %36 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:14.4133386Z %37 = arith.muli %36, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:14.4133655Z %38 = tt.expand_dims %35 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:14.4134017Z %39 = tt.broadcast %37 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4134289Z %40 = tt.broadcast %38 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4134551Z %41 = arith.addi %39, %40 : tensor<64x16xi32> 2026-02-21T09:03:14.4134875Z %42 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4135191Z %43 = tt.addptr %42, %41 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:14.4135502Z %44 = tt.load %43 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4135763Z %45 = arith.extf %44 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:14.4136064Z %46 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:14.4136353Z %47 = arith.muli %46, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:14.4136627Z %48 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:14.4136930Z %49 = tt.broadcast %47 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4137210Z %50 = tt.broadcast %48 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4137464Z %51 = arith.addi %49, %50 : tensor<8x32xi32> 2026-02-21T09:03:14.4137758Z %52 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4138042Z %53 = tt.addptr %52, %51 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:14.4138306Z %54 = tt.load %53 : tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4138532Z %55 = arith.shli %54, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4138754Z %56 = arith.shrsi %55, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4138980Z %57 = arith.shrsi %54, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4139223Z %58 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:14.4139530Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:14.4139855Z %60 = tt.expand_dims %59 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:14.4140194Z %61 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4140526Z %62 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4140817Z %63 = arith.cmpi eq, %60, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4141082Z %64 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4141380Z %65 = tt.broadcast %61 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4141704Z %66 = arith.select %64, %65, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4141975Z %67 = arith.cmpi eq, %60, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4142216Z %68 = tt.broadcast %62 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4142478Z %69 = tt.broadcast %67 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4142745Z %70 = arith.select %69, %68, %66 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4143015Z %71 = tt.reshape %70 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:14.4143264Z %72 = arith.sitofp %71 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:14.4143621Z %73 = tt.dot %45, %72, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:14.4143990Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:03:14.4144198Z %74 = arith.muli %c8_i32, %c1_i32_8 : i32 2026-02-21T09:03:14.4144400Z %75 = arith.addi %arg4, %74 : i32 2026-02-21T09:03:14.4144626Z %76 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:14.4144876Z %77 = tt.splat %75 : i32 -> tensor<8xi32> 2026-02-21T09:03:14.4145075Z %78 = arith.addi %77, %76 : tensor<8xi32> 2026-02-21T09:03:14.4145274Z %79 = arith.muli %75, %c2_i32 : i32 2026-02-21T09:03:14.4145530Z %80 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:14.4145780Z %81 = tt.splat %79 : i32 -> tensor<16xi32> 2026-02-21T09:03:14.4145979Z %82 = arith.addi %81, %80 : tensor<16xi32> 2026-02-21T09:03:14.4146232Z %83 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:14.4146504Z %84 = arith.muli %83, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:14.4146783Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:14.4147076Z %86 = tt.broadcast %84 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4147329Z %87 = tt.broadcast %85 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4147574Z %88 = arith.addi %86, %87 : tensor<64x16xi32> 2026-02-21T09:03:14.4147821Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4148102Z %90 = tt.addptr %89, %88 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:14.4148370Z %91 = tt.load %90 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4148602Z %92 = arith.extf %91 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:14.4148914Z %93 = tt.expand_dims %78 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:14.4149173Z %94 = arith.muli %93, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:14.4149430Z %95 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:14.4149716Z %96 = tt.broadcast %94 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4149966Z %97 = tt.broadcast %95 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4150197Z %98 = arith.addi %96, %97 : tensor<8x32xi32> 2026-02-21T09:03:14.4150423Z %99 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4150692Z %100 = tt.addptr %99, %98 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:14.4150943Z %101 = tt.load %100 : tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4151163Z %102 = arith.shli %101, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4151387Z %103 = arith.shrsi %102, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4151645Z %104 = arith.shrsi %101, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4151903Z %105 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:14.4152202Z %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:14.4152529Z %107 = tt.expand_dims %106 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:14.4152866Z %108 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4153187Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4153487Z %110 = arith.cmpi eq, %107, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4153745Z %111 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4154022Z %112 = tt.broadcast %108 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4154312Z %113 = arith.select %111, %112, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4154599Z %114 = arith.cmpi eq, %107, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4154886Z %115 = tt.broadcast %109 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4155155Z %116 = tt.broadcast %114 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4155446Z %117 = arith.select %116, %115, %113 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4155725Z %118 = tt.reshape %117 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:14.4155993Z %119 = arith.sitofp %118 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:14.4156356Z %120 = tt.dot %92, %119, %73, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:14.4156703Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:03:14.4156911Z %121 = arith.muli %c8_i32, %c2_i32_9 : i32 2026-02-21T09:03:14.4157106Z %122 = arith.addi %arg4, %121 : i32 2026-02-21T09:03:14.4157344Z %123 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:14.4157593Z %124 = tt.splat %122 : i32 -> tensor<8xi32> 2026-02-21T09:03:14.4157849Z %125 = arith.addi %124, %123 : tensor<8xi32> 2026-02-21T09:03:14.4158051Z %126 = arith.muli %122, %c2_i32 : i32 2026-02-21T09:03:14.4158289Z %127 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:14.4158541Z %128 = tt.splat %126 : i32 -> tensor<16xi32> 2026-02-21T09:03:14.4158745Z %129 = arith.addi %128, %127 : tensor<16xi32> 2026-02-21T09:03:14.4159004Z %130 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:14.4159277Z %131 = arith.muli %130, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:14.4159546Z %132 = tt.expand_dims %129 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:14.4159846Z %133 = tt.broadcast %131 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4160141Z %134 = tt.broadcast %132 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4160394Z %135 = arith.addi %133, %134 : tensor<64x16xi32> 2026-02-21T09:03:14.4160639Z %136 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4160942Z %137 = tt.addptr %136, %135 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:14.4161212Z %138 = tt.load %137 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4161459Z %139 = arith.extf %138 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:14.4161783Z %140 = tt.expand_dims %125 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:14.4162048Z %141 = arith.muli %140, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:14.4162312Z %142 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:14.4162600Z %143 = tt.broadcast %141 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4162873Z %144 = tt.broadcast %142 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4163112Z %145 = arith.addi %143, %144 : tensor<8x32xi32> 2026-02-21T09:03:14.4163358Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4163637Z %147 = tt.addptr %146, %145 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:14.4163893Z %148 = tt.load %147 : tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4164114Z %149 = arith.shli %148, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4164333Z %150 = arith.shrsi %149, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4164561Z %151 = arith.shrsi %148, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4164804Z %152 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:14.4165104Z %153 = tt.expand_dims %152 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:14.4165426Z %154 = tt.expand_dims %153 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:14.4165772Z %155 = tt.expand_dims %150 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4166104Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4166392Z %157 = arith.cmpi eq, %154, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4166650Z %158 = tt.broadcast %157 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4166928Z %159 = tt.broadcast %155 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4167216Z %160 = arith.select %158, %159, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4167527Z %161 = arith.cmpi eq, %154, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4167777Z %162 = tt.broadcast %156 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4168049Z %163 = tt.broadcast %161 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4168324Z %164 = arith.select %163, %162, %160 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4168611Z %165 = tt.reshape %164 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:14.4168902Z %166 = arith.sitofp %165 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:14.4169259Z %167 = tt.dot %139, %166, %120, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:14.4169590Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:14.4169785Z %168 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:03:14.4169992Z %169 = arith.addi %arg4, %168 : i32 2026-02-21T09:03:14.4170235Z %170 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:14.4170480Z %171 = tt.splat %169 : i32 -> tensor<8xi32> 2026-02-21T09:03:14.4170693Z %172 = arith.addi %171, %170 : tensor<8xi32> 2026-02-21T09:03:14.4170891Z %173 = arith.muli %169, %c2_i32 : i32 2026-02-21T09:03:14.4171160Z %174 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:03:14.4171411Z %175 = tt.splat %173 : i32 -> tensor<16xi32> 2026-02-21T09:03:14.4171667Z %176 = arith.addi %175, %174 : tensor<16xi32> 2026-02-21T09:03:14.4171924Z %177 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:14.4172205Z %178 = arith.muli %177, %cst_5 : tensor<64x1xi32> 2026-02-21T09:03:14.4172475Z %179 = tt.expand_dims %176 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:03:14.4172770Z %180 = tt.broadcast %178 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4173047Z %181 = tt.broadcast %179 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:03:14.4173289Z %182 = arith.addi %180, %181 : tensor<64x16xi32> 2026-02-21T09:03:14.4173545Z %183 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4173850Z %184 = tt.addptr %183, %182 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:03:14.4174123Z %185 = tt.load %184 : tensor<64x16x!tt.ptr> 2026-02-21T09:03:14.4174374Z %186 = arith.extf %185 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:03:14.4174679Z %187 = tt.expand_dims %172 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:14.4174966Z %188 = arith.muli %187, %cst_4 : tensor<8x1xi32> 2026-02-21T09:03:14.4175241Z %189 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:14.4175557Z %190 = tt.broadcast %188 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4175841Z %191 = tt.broadcast %189 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:03:14.4176090Z %192 = arith.addi %190, %191 : tensor<8x32xi32> 2026-02-21T09:03:14.4176347Z %193 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4176641Z %194 = tt.addptr %193, %192 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:03:14.4176940Z %195 = tt.load %194 : tensor<8x32x!tt.ptr> 2026-02-21T09:03:14.4177162Z %196 = arith.shli %195, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4177398Z %197 = arith.shrsi %196, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4177631Z %198 = arith.shrsi %195, %cst_3 : tensor<8x32xi8> 2026-02-21T09:03:14.4177889Z %199 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:14.4178203Z %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:14.4178533Z %201 = tt.expand_dims %200 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:14.4178904Z %202 = tt.expand_dims %197 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4179246Z %203 = tt.expand_dims %198 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:03:14.4179546Z %204 = arith.cmpi eq, %201, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4179817Z %205 = tt.broadcast %204 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4180124Z %206 = tt.broadcast %202 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4180430Z %207 = arith.select %205, %206, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4180715Z %208 = arith.cmpi eq, %201, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:14.4180985Z %209 = tt.broadcast %203 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:03:14.4181266Z %210 = tt.broadcast %208 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:03:14.4181578Z %211 = arith.select %210, %209, %207 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:03:14.4181884Z %212 = tt.reshape %211 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:03:14.4182159Z %213 = arith.sitofp %212 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:03:14.4182567Z %214 = tt.dot %186, %213, %167, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:14.4182911Z scf.yield %214 : tensor<64x32xf32> 2026-02-21T09:03:14.4183097Z } 2026-02-21T09:03:14.4183298Z %20 = arith.truncf %19 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:14.4183610Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:14.4183891Z %22 = arith.muli %21, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:14.4184148Z %23 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:14.4184439Z %24 = tt.broadcast %22 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:14.4184702Z %25 = tt.broadcast %23 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:14.4184933Z %26 = arith.addi %24, %25 : tensor<64x32xi32> 2026-02-21T09:03:14.4185180Z %27 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:14.4185461Z %28 = tt.addptr %27, %26 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:14.4185729Z tt.store %28, %20 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:14.4185953Z } {tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:03:14.4186153Z tt.return 2026-02-21T09:03:14.4186281Z } 2026-02-21T09:03:14.4186407Z } 2026-02-21T09:03:14.4186477Z 2026-02-21T09:03:14.4186536Z {-# 2026-02-21T09:03:14.4186667Z external_resources: { 2026-02-21T09:03:14.4186832Z mlir_reproducer: { 2026-02-21T09:03:14.4191296Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:14.4195815Z disable_threading: false, 2026-02-21T09:03:14.4195984Z verify_each: true 2026-02-21T09:03:14.4196136Z } 2026-02-21T09:03:14.4196256Z } 2026-02-21T09:03:14.4196381Z #-} 2026-02-21T09:03:14.4196807Z /tmp/torchinductor_root/q2/cq23kavdx65kw6lmohpcdzpphzdbesrrqvysps7vqmjlwrr5jxnm.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:14.4197832Z /tmp/torchinductor_root/q2/cq23kavdx65kw6lmohpcdzpphzdbesrrqvysps7vqmjlwrr5jxnm.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:14.4198692Z [159s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:14.4199856Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:03:14.4200880Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:14.4201143Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:14.9874104Z 2026-02-21T09:03:14.9878411Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 12.7 configs/s 2026-02-21T09:03:21.8299084Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 146.7 2026-02-21T09:03:21.8303217Z configs/s 2026-02-21T09:03:22.0509702Z [167s] Generation 6 complete: 2026-02-21T09:03:22.0513467Z error=11 2026-02-21T09:03:22.0515120Z ok=81 2026-02-21T09:03:22.0515346Z min=0.1076 2026-02-21T09:03:22.0520661Z mid=0.1648 2026-02-21T09:03:22.0525111Z max=13.5762 2026-02-21T09:03:22.0527052Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:03:22.0527342Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:03:22.0527580Z 'l2_groupings': [1], 2026-02-21T09:03:22.0527766Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:03:22.0527994Z 'loop_orders': [[0, 1]], 2026-02-21T09:03:22.0528192Z 'num_stages': 3, 2026-02-21T09:03:22.0528366Z 'num_warps': 1, 2026-02-21T09:03:22.0528544Z 'pid_type': 'flat', 2026-02-21T09:03:22.0528729Z 'range_flattens': [None, None], 2026-02-21T09:03:22.0528963Z 'range_multi_buffers': [None, True], 2026-02-21T09:03:22.0529166Z 'range_num_stages': [0, 0], 2026-02-21T09:03:22.0529373Z 'range_unroll_factors': [0, 0], 2026-02-21T09:03:22.0529813Z 'range_warp_specializes': [None, None]} 2026-02-21T09:03:22.0534040Z [167s] Fitting surrogate: 695 points, 695 targets 2026-02-21T09:03:23.1889064Z [168s] Generation 7 starting: 73 neighbors, 4 active search path(s) 2026-02-21T09:03:38.2125739Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 1.4 configs/s 2026-02-21T09:03:41.8626770Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:41.8627713Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:41.8628258Z ^ 2026-02-21T09:03:41.8628681Z /tmp/torchinductor_root/v3/cv36cfq6jhwmmmsuq6hh3imbtidotcd6eebzvkpqxvdyt2uxe2kf.py:87:40: note: called from 2026-02-21T09:03:41.8629104Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:41.8629310Z ^ 2026-02-21T09:03:41.8629817Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:41.8630416Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:41.8630680Z ^ 2026-02-21T09:03:41.8630897Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:03:41.8631478Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:41.8632142Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:03:41.8632373Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:41.8632569Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:41.8632775Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:41.8632958Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:41.8633199Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:41.8633442Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:41.8633714Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:03:41.8633960Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:41.8634229Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:41.8634474Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:41.8634725Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:41.8634992Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:41.8635177Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:41.8635378Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:41.8635572Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:41.8635778Z %0 = tt.get_program_id x : i32 2026-02-21T09:03:41.8635967Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:03:41.8636152Z %2 = arith.minsi %1, %c224_i32 : i32 2026-02-21T09:03:41.8636348Z %3 = arith.subi %2, %0 : i32 2026-02-21T09:03:41.8636525Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:03:41.8636735Z %4 = arith.subi %c1_i32, %c1_i32_6 : i32 2026-02-21T09:03:41.8636923Z %5 = arith.addi %3, %4 : i32 2026-02-21T09:03:41.8637112Z %6 = arith.divui %5, %c1_i32 : i32 2026-02-21T09:03:41.8637285Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:03:41.8637467Z %7 = arith.remsi %6, %c2_i32_7 : i32 2026-02-21T09:03:41.8637647Z %8 = arith.subi %6, %7 : i32 2026-02-21T09:03:41.8637813Z %9 = arith.muli %8, %c1_i32 : i32 2026-02-21T09:03:41.8637991Z %10 = arith.addi %0, %9 : i32 2026-02-21T09:03:41.8638174Z %11 = arith.muli %c1_i32, %c2_i32_7 : i32 2026-02-21T09:03:41.8638381Z scf.for %arg3 = %0 to %10 step %11 : i32 { 2026-02-21T09:03:41.8638591Z %12 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:41.8638792Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:03:41.8638973Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:03:41.8639251Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:03:41.8639444Z %16 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:41.8639628Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:03:41.8639809Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:03:41.8639978Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:03:41.8640159Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:03:41.8640388Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:41.8640653Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8640898Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:03:41.8641095Z %24 = arith.muli %19, %c32_i32 : i32 2026-02-21T09:03:41.8641327Z %25 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:41.8641608Z %26 = tt.splat %24 : i32 -> tensor<32xi32> 2026-02-21T09:03:41.8641849Z %27 = arith.addi %26, %25 : tensor<32xi32> 2026-02-21T09:03:41.8642039Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:03:41.8642426Z %28 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:41.8642761Z %66 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8642975Z %67 = arith.addi %66, %21 : tensor<64xi32> 2026-02-21T09:03:41.8643182Z %68 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:41.8643415Z %69 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8643675Z %70 = tt.splat %68 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8643881Z %71 = arith.addi %70, %69 : tensor<128xi32> 2026-02-21T09:03:41.8644149Z %72 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8644422Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8644692Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8644997Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8645273Z %76 = tt.broadcast %74 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8645531Z %77 = arith.addi %75, %76 : tensor<64x128xi32> 2026-02-21T09:03:41.8645777Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8646068Z %79 = tt.addptr %78, %77 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8646326Z %80 = tt.load %79 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8646573Z %81 = arith.extf %80 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8646862Z %82 = tt.expand_dims %67 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8647126Z %83 = arith.muli %82, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8647382Z %84 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8647667Z %85 = tt.broadcast %83 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8647931Z %86 = tt.broadcast %84 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8648174Z %87 = arith.addi %85, %86 : tensor<64x32xi32> 2026-02-21T09:03:41.8648412Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8648690Z %89 = tt.addptr %88, %87 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8648983Z %90 = tt.load %89 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8649284Z %91 = arith.shli %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8649506Z %92 = arith.shrsi %91, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8649725Z %93 = arith.shrsi %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8649958Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8650253Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8650596Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8650919Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8651232Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8651515Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8651796Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8652088Z %101 = tt.broadcast %97 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8652380Z %102 = arith.select %100, %101, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8652654Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8652937Z %104 = tt.broadcast %98 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8653214Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8653523Z %106 = arith.select %105, %104, %102 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8653810Z %107 = tt.reshape %106 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8654078Z %108 = arith.sitofp %107 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8654461Z %109 = tt.dot %81, %108, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8654787Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:03:41.8654996Z %110 = arith.muli %c64_i32, %c1_i32_10 : i32 2026-02-21T09:03:41.8655199Z %111 = arith.addi %arg4, %110 : i32 2026-02-21T09:03:41.8655395Z %112 = tt.splat %111 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8655601Z %113 = arith.addi %112, %21 : tensor<64xi32> 2026-02-21T09:03:41.8655793Z %114 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:03:41.8656038Z %115 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8656294Z %116 = tt.splat %114 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8656514Z %117 = arith.addi %116, %115 : tensor<128xi32> 2026-02-21T09:03:41.8656781Z %118 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8657056Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8657334Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8657630Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8657910Z %122 = tt.broadcast %120 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8658154Z %123 = arith.addi %121, %122 : tensor<64x128xi32> 2026-02-21T09:03:41.8658408Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8658713Z %125 = tt.addptr %124, %123 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8658987Z %126 = tt.load %125 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8659250Z %127 = arith.extf %126 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8659552Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8659842Z %129 = arith.muli %128, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8660125Z %130 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8660431Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8660716Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8660970Z %133 = arith.addi %131, %132 : tensor<64x32xi32> 2026-02-21T09:03:41.8661227Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8661600Z %135 = tt.addptr %134, %133 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8661939Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8662230Z %137 = arith.shli %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8662462Z %138 = arith.shrsi %137, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8662703Z %139 = arith.shrsi %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8662964Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8663285Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8663648Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8664003Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8664393Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8664694Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8665025Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8665315Z %147 = tt.broadcast %143 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8665630Z %148 = arith.select %146, %147, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8665931Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8666196Z %150 = tt.broadcast %144 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8666491Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8666788Z %152 = arith.select %151, %150, %148 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8667091Z %153 = tt.reshape %152 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8667375Z %154 = arith.sitofp %153 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8667779Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8668122Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:03:41.8668324Z %156 = arith.muli %c64_i32, %c2_i32_11 : i32 2026-02-21T09:03:41.8668528Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:03:41.8668723Z %158 = tt.splat %157 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8668934Z %159 = arith.addi %158, %21 : tensor<64xi32> 2026-02-21T09:03:41.8669131Z %160 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:03:41.8669378Z %161 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8669641Z %162 = tt.splat %160 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8669849Z %163 = arith.addi %162, %161 : tensor<128xi32> 2026-02-21T09:03:41.8670115Z %164 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8670392Z %165 = arith.muli %164, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8670672Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8671039Z %167 = tt.broadcast %165 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8671335Z %168 = tt.broadcast %166 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8671653Z %169 = arith.addi %167, %168 : tensor<64x128xi32> 2026-02-21T09:03:41.8673598Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8673906Z %171 = tt.addptr %170, %169 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8674182Z %172 = tt.load %171 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8674448Z %173 = arith.extf %172 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8674753Z %174 = tt.expand_dims %159 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8675068Z %175 = arith.muli %174, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8685134Z %176 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8685430Z %177 = tt.broadcast %175 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8685707Z %178 = tt.broadcast %176 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8685949Z %179 = arith.addi %177, %178 : tensor<64x32xi32> 2026-02-21T09:03:41.8686200Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8686526Z %181 = tt.addptr %180, %179 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8686836Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8687117Z %183 = arith.shli %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8687373Z %184 = arith.shrsi %183, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8687605Z %185 = arith.shrsi %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8687886Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8688185Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8688516Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8688845Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8689183Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8689472Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8689733Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8690017Z %193 = tt.broadcast %189 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8690313Z %194 = arith.select %192, %193, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8690596Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8690848Z %196 = tt.broadcast %190 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8691130Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8691421Z %198 = arith.select %197, %196, %194 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8691730Z %199 = tt.reshape %198 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8692015Z %200 = arith.sitofp %199 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8692380Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8692718Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:41.8692915Z %202 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:03:41.8693122Z %203 = arith.addi %arg4, %202 : i32 2026-02-21T09:03:41.8693329Z %204 = tt.splat %203 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8693539Z %205 = arith.addi %204, %21 : tensor<64xi32> 2026-02-21T09:03:41.8693749Z %206 = arith.muli %203, %c2_i32 : i32 2026-02-21T09:03:41.8693989Z %207 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8694259Z %208 = tt.splat %206 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8694466Z %209 = arith.addi %208, %207 : tensor<128xi32> 2026-02-21T09:03:41.8694735Z %210 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8695024Z %211 = arith.muli %210, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8695297Z %212 = tt.expand_dims %209 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8695612Z %213 = tt.broadcast %211 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8695914Z %214 = tt.broadcast %212 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8696172Z %215 = arith.addi %213, %214 : tensor<64x128xi32> 2026-02-21T09:03:41.8696421Z %216 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8696725Z %217 = tt.addptr %216, %215 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8697009Z %218 = tt.load %217 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8697259Z %219 = arith.extf %218 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8697592Z %220 = tt.expand_dims %205 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8697867Z %221 = arith.muli %220, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8698136Z %222 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8698465Z %223 = tt.broadcast %221 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8698734Z %224 = tt.broadcast %222 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8699015Z %225 = arith.addi %223, %224 : tensor<64x32xi32> 2026-02-21T09:03:41.8699254Z %226 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8699544Z %227 = tt.addptr %226, %225 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8699849Z %228 = tt.load %227 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8700129Z %229 = arith.shli %228, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8700358Z %230 = arith.shrsi %229, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8700579Z %231 = arith.shrsi %228, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8700837Z %232 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8701138Z %233 = tt.expand_dims %232 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8701466Z %234 = tt.expand_dims %233 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8701850Z %235 = tt.expand_dims %230 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8702183Z %236 = tt.expand_dims %231 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8702484Z %237 = arith.cmpi eq, %234, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8702737Z %238 = tt.broadcast %237 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8703019Z %239 = tt.broadcast %235 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8703309Z %240 = arith.select %238, %239, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8703595Z %241 = arith.cmpi eq, %234, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8703854Z %242 = tt.broadcast %236 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8704129Z %243 = tt.broadcast %241 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8704432Z %244 = arith.select %243, %242, %240 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8704736Z %245 = tt.reshape %244 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8705035Z %246 = arith.sitofp %245 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8705432Z %247 = tt.dot %219, %246, %201, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8705783Z scf.yield %247 : tensor<64x32xf32> 2026-02-21T09:03:41.8705978Z } 2026-02-21T09:03:41.8706239Z %29 = arith.truncf %28 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:41.8706551Z %30 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8706834Z %31 = arith.muli %30, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8707111Z %32 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8707449Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8707722Z %34 = tt.broadcast %32 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8707978Z %35 = arith.addi %33, %34 : tensor<64x32xi32> 2026-02-21T09:03:41.8708235Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8708538Z %37 = tt.addptr %36, %35 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8708814Z tt.store %37, %29 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8709037Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:03:41.8709284Z %38 = arith.muli %c1_i32, %c1_i32_8 : i32 2026-02-21T09:03:41.8709488Z %39 = arith.addi %arg3, %38 : i32 2026-02-21T09:03:41.8709692Z %40 = arith.divsi %39, %c896_i32 : i32 2026-02-21T09:03:41.8709885Z %41 = arith.muli %40, %c4_i32 : i32 2026-02-21T09:03:41.8710104Z %42 = arith.subi %c1_i32, %41 : i32 2026-02-21T09:03:41.8710296Z %43 = arith.minsi %42, %c4_i32 : i32 2026-02-21T09:03:41.8710497Z %44 = arith.remsi %39, %c896_i32 : i32 2026-02-21T09:03:41.8710714Z %45 = arith.remsi %44, %43 : i32 2026-02-21T09:03:41.8710908Z %46 = arith.addi %41, %45 : i32 2026-02-21T09:03:41.8711099Z %47 = arith.divsi %44, %43 : i32 2026-02-21T09:03:41.8711284Z %48 = arith.muli %46, %c64_i32 : i32 2026-02-21T09:03:41.8711531Z %49 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:41.8711839Z %50 = tt.splat %48 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8712055Z %51 = arith.addi %50, %49 : tensor<64xi32> 2026-02-21T09:03:41.8712256Z %52 = arith.muli %47, %c32_i32 : i32 2026-02-21T09:03:41.8712502Z %53 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:41.8712765Z %54 = tt.splat %52 : i32 -> tensor<32xi32> 2026-02-21T09:03:41.8712971Z %55 = arith.addi %54, %53 : tensor<32xi32> 2026-02-21T09:03:41.8713182Z %c256_i32_9 = arith.constant 256 : i32 2026-02-21T09:03:41.8713533Z %56 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c256_i32_9 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:41.8713897Z %66 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8714118Z %67 = arith.addi %66, %49 : tensor<64xi32> 2026-02-21T09:03:41.8714340Z %68 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:41.8714588Z %69 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8714844Z %70 = tt.splat %68 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8715056Z %71 = arith.addi %70, %69 : tensor<128xi32> 2026-02-21T09:03:41.8715308Z %72 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8715587Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8715848Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8716153Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8716428Z %76 = tt.broadcast %74 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8716669Z %77 = arith.addi %75, %76 : tensor<64x128xi32> 2026-02-21T09:03:41.8716921Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8717211Z %79 = tt.addptr %78, %77 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8717484Z %80 = tt.load %79 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8717724Z %81 = arith.extf %80 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8718014Z %82 = tt.expand_dims %67 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8718292Z %83 = arith.muli %82, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8718548Z %84 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8718871Z %85 = tt.broadcast %83 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8719131Z %86 = tt.broadcast %84 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8719378Z %87 = arith.addi %85, %86 : tensor<64x32xi32> 2026-02-21T09:03:41.8719622Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8719895Z %89 = tt.addptr %88, %87 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8720202Z %90 = tt.load %89 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8720494Z %91 = arith.shli %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8720722Z %92 = arith.shrsi %91, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8720939Z %93 = arith.shrsi %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8721196Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8721603Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8721915Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8722266Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8722583Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8722887Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8723145Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8723434Z %101 = tt.broadcast %97 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8723740Z %102 = arith.select %100, %101, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8724018Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8724281Z %104 = tt.broadcast %98 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8724553Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8724845Z %106 = arith.select %105, %104, %102 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8725141Z %107 = tt.reshape %106 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8725411Z %108 = arith.sitofp %107 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8725790Z %109 = tt.dot %81, %108, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8726118Z %c1_i32_10 = arith.constant 1 : i32 2026-02-21T09:03:41.8726327Z %110 = arith.muli %c64_i32, %c1_i32_10 : i32 2026-02-21T09:03:41.8726525Z %111 = arith.addi %arg4, %110 : i32 2026-02-21T09:03:41.8726732Z %112 = tt.splat %111 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8726944Z %113 = arith.addi %112, %49 : tensor<64xi32> 2026-02-21T09:03:41.8727141Z %114 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:03:41.8727388Z %115 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8727645Z %116 = tt.splat %114 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8727860Z %117 = arith.addi %116, %115 : tensor<128xi32> 2026-02-21T09:03:41.8728114Z %118 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8728393Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8728663Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8728967Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8729245Z %122 = tt.broadcast %120 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8729491Z %123 = arith.addi %121, %122 : tensor<64x128xi32> 2026-02-21T09:03:41.8729745Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8730068Z %125 = tt.addptr %124, %123 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8730347Z %126 = tt.load %125 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8730602Z %127 = arith.extf %126 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8730893Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8731175Z %129 = arith.muli %128, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8731433Z %130 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8731797Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8732065Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8732310Z %133 = arith.addi %131, %132 : tensor<64x32xi32> 2026-02-21T09:03:41.8732582Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8732867Z %135 = tt.addptr %134, %133 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8733204Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8733472Z %137 = arith.shli %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8733701Z %138 = arith.shrsi %137, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8733929Z %139 = arith.shrsi %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8734176Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8734488Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8734807Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8735151Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8735493Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8735783Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8736045Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8736316Z %147 = tt.broadcast %143 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8736611Z %148 = arith.select %146, %147, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8736883Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8737142Z %150 = tt.broadcast %144 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8737420Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8737700Z %152 = arith.select %151, %150, %148 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8737989Z %153 = tt.reshape %152 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8738258Z %154 = arith.sitofp %153 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8738632Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8738966Z %c2_i32_11 = arith.constant 2 : i32 2026-02-21T09:03:41.8739165Z %156 = arith.muli %c64_i32, %c2_i32_11 : i32 2026-02-21T09:03:41.8739373Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:03:41.8739565Z %158 = tt.splat %157 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8739773Z %159 = arith.addi %158, %49 : tensor<64xi32> 2026-02-21T09:03:41.8739972Z %160 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:03:41.8740215Z %161 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8740469Z %162 = tt.splat %160 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8740680Z %163 = arith.addi %162, %161 : tensor<128xi32> 2026-02-21T09:03:41.8740941Z %164 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8741240Z %165 = arith.muli %164, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8741510Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8741843Z %167 = tt.broadcast %165 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8748123Z %168 = tt.broadcast %166 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8748373Z %169 = arith.addi %167, %168 : tensor<64x128xi32> 2026-02-21T09:03:41.8748618Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8748943Z %171 = tt.addptr %170, %169 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8749210Z %172 = tt.load %171 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8751689Z %173 = arith.extf %172 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8752023Z %174 = tt.expand_dims %159 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8752318Z %175 = arith.muli %174, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8752625Z %176 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8752929Z %177 = tt.broadcast %175 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8755266Z %178 = tt.broadcast %176 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8755528Z %179 = arith.addi %177, %178 : tensor<64x32xi32> 2026-02-21T09:03:41.8755787Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8756091Z %181 = tt.addptr %180, %179 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8756423Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8756721Z %183 = arith.shli %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8756957Z %184 = arith.shrsi %183, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8757201Z %185 = arith.shrsi %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8757464Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8757785Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8758135Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8758489Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8758851Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8759146Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8759412Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8759687Z %193 = tt.broadcast %189 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8759993Z %194 = arith.select %192, %193, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8760276Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8760534Z %196 = tt.broadcast %190 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8760816Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8761098Z %198 = arith.select %197, %196, %194 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8761392Z %199 = tt.reshape %198 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8761697Z %200 = arith.sitofp %199 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8762067Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8762402Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:41.8762600Z %202 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:03:41.8762854Z %203 = arith.addi %arg4, %202 : i32 2026-02-21T09:03:41.8763053Z %204 = tt.splat %203 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8763268Z %205 = arith.addi %204, %49 : tensor<64xi32> 2026-02-21T09:03:41.8763473Z %206 = arith.muli %203, %c2_i32 : i32 2026-02-21T09:03:41.8763711Z %207 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8763975Z %208 = tt.splat %206 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8764183Z %209 = arith.addi %208, %207 : tensor<128xi32> 2026-02-21T09:03:41.8764443Z %210 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8764739Z %211 = arith.muli %210, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8765011Z %212 = tt.expand_dims %209 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8765345Z %213 = tt.broadcast %211 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8765619Z %214 = tt.broadcast %212 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8765903Z %215 = arith.addi %213, %214 : tensor<64x128xi32> 2026-02-21T09:03:41.8766151Z %216 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8766448Z %217 = tt.addptr %216, %215 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8766715Z %218 = tt.load %217 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8766970Z %219 = arith.extf %218 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8767269Z %220 = tt.expand_dims %205 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8767540Z %221 = arith.muli %220, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8767802Z %222 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8768090Z %223 = tt.broadcast %221 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8768364Z %224 = tt.broadcast %222 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8768618Z %225 = arith.addi %223, %224 : tensor<64x32xi32> 2026-02-21T09:03:41.8768852Z %226 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8769138Z %227 = tt.addptr %226, %225 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8769445Z %228 = tt.load %227 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8769720Z %229 = arith.shli %228, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8769937Z %230 = arith.shrsi %229, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8770164Z %231 = arith.shrsi %228, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8770418Z %232 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8770714Z %233 = tt.expand_dims %232 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8771038Z %234 = tt.expand_dims %233 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8771363Z %235 = tt.expand_dims %230 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8771738Z %236 = tt.expand_dims %231 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8772033Z %237 = arith.cmpi eq, %234, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8772288Z %238 = tt.broadcast %237 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8772574Z %239 = tt.broadcast %235 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8772867Z %240 = arith.select %238, %239, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8773146Z %241 = arith.cmpi eq, %234, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8773397Z %242 = tt.broadcast %236 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8773676Z %243 = tt.broadcast %241 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8773990Z %244 = arith.select %243, %242, %240 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8774272Z %245 = tt.reshape %244 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8774549Z %246 = arith.sitofp %245 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8774909Z %247 = tt.dot %219, %246, %201, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8775239Z scf.yield %247 : tensor<64x32xf32> 2026-02-21T09:03:41.8775413Z } 2026-02-21T09:03:41.8775640Z %57 = arith.truncf %56 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:41.8775931Z %58 = tt.expand_dims %51 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8776198Z %59 = arith.muli %58, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8776482Z %60 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8776769Z %61 = tt.broadcast %59 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8777063Z %62 = tt.broadcast %60 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8777307Z %63 = arith.addi %61, %62 : tensor<64x32xi32> 2026-02-21T09:03:41.8777549Z %64 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8777836Z %65 = tt.addptr %64, %63 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8778093Z tt.store %65, %57 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8778302Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:41.8778493Z scf.for %arg3 = %10 to %2 step %c1_i32 : i32 { 2026-02-21T09:03:41.8778709Z %12 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:41.8778901Z %13 = arith.muli %12, %c4_i32 : i32 2026-02-21T09:03:41.8779090Z %14 = arith.subi %c1_i32, %13 : i32 2026-02-21T09:03:41.8779277Z %15 = arith.minsi %14, %c4_i32 : i32 2026-02-21T09:03:41.8779467Z %16 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:41.8779660Z %17 = arith.remsi %16, %15 : i32 2026-02-21T09:03:41.8779834Z %18 = arith.addi %13, %17 : i32 2026-02-21T09:03:41.8780014Z %19 = arith.divsi %16, %15 : i32 2026-02-21T09:03:41.8780188Z %20 = arith.muli %18, %c64_i32 : i32 2026-02-21T09:03:41.8780421Z %21 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:41.8780675Z %22 = tt.splat %20 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8780873Z %23 = arith.addi %22, %21 : tensor<64xi32> 2026-02-21T09:03:41.8781068Z %24 = arith.muli %19, %c32_i32 : i32 2026-02-21T09:03:41.8781292Z %25 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:41.8781561Z %26 = tt.splat %24 : i32 -> tensor<32xi32> 2026-02-21T09:03:41.8781758Z %27 = arith.addi %26, %25 : tensor<32xi32> 2026-02-21T09:03:41.8781957Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:03:41.8782275Z %28 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:41.8782621Z %38 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8782835Z %39 = arith.addi %38, %21 : tensor<64xi32> 2026-02-21T09:03:41.8783030Z %40 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:41.8783274Z %41 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8783529Z %42 = tt.splat %40 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8783738Z %43 = arith.addi %42, %41 : tensor<128xi32> 2026-02-21T09:03:41.8783989Z %44 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8784272Z %45 = arith.muli %44, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8784539Z %46 = tt.expand_dims %43 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8784841Z %47 = tt.broadcast %45 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8785150Z %48 = tt.broadcast %46 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8785391Z %49 = arith.addi %47, %48 : tensor<64x128xi32> 2026-02-21T09:03:41.8785642Z %50 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8785936Z %51 = tt.addptr %50, %49 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8786198Z %52 = tt.load %51 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8786442Z %53 = arith.extf %52 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8786753Z %54 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8787024Z %55 = arith.muli %54, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8787278Z %56 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8787602Z %57 = tt.broadcast %55 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8787867Z %58 = tt.broadcast %56 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8788124Z %59 = arith.addi %57, %58 : tensor<64x32xi32> 2026-02-21T09:03:41.8788364Z %60 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8788633Z %61 = tt.addptr %60, %59 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8788934Z %62 = tt.load %61 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8789197Z %63 = arith.shli %62, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8789421Z %64 = arith.shrsi %63, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8789639Z %65 = arith.shrsi %62, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8789877Z %66 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8790172Z %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8790476Z %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8790801Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8791121Z %70 = tt.expand_dims %65 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8791398Z %71 = arith.cmpi eq, %68, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8791707Z %72 = tt.broadcast %71 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8791975Z %73 = tt.broadcast %69 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8792265Z %74 = arith.select %72, %73, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8792535Z %75 = arith.cmpi eq, %68, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8792787Z %76 = tt.broadcast %70 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8793062Z %77 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8793332Z %78 = arith.select %77, %76, %74 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8793615Z %79 = tt.reshape %78 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8793875Z %80 = arith.sitofp %79 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8794242Z %81 = tt.dot %53, %80, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8794596Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:03:41.8794804Z %82 = arith.muli %c64_i32, %c1_i32_8 : i32 2026-02-21T09:03:41.8795023Z %83 = arith.addi %arg4, %82 : i32 2026-02-21T09:03:41.8795223Z %84 = tt.splat %83 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8795441Z %85 = arith.addi %84, %21 : tensor<64xi32> 2026-02-21T09:03:41.8795645Z %86 = arith.muli %83, %c2_i32 : i32 2026-02-21T09:03:41.8795900Z %87 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8796193Z %88 = tt.splat %86 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8796414Z %89 = arith.addi %88, %87 : tensor<128xi32> 2026-02-21T09:03:41.8796687Z %90 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8796966Z %91 = arith.muli %90, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8797245Z %92 = tt.expand_dims %89 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8797556Z %93 = tt.broadcast %91 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8797844Z %94 = tt.broadcast %92 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8798146Z %95 = arith.addi %93, %94 : tensor<64x128xi32> 2026-02-21T09:03:41.8798408Z %96 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8798752Z %97 = tt.addptr %96, %95 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8799029Z %98 = tt.load %97 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8799287Z %99 = arith.extf %98 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8799614Z %100 = tt.expand_dims %85 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8799913Z %101 = arith.muli %100, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8800196Z %102 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8800509Z %103 = tt.broadcast %101 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8800797Z %104 = tt.broadcast %102 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8801052Z %105 = arith.addi %103, %104 : tensor<64x32xi32> 2026-02-21T09:03:41.8801316Z %106 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8801643Z %107 = tt.addptr %106, %105 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8801984Z %108 = tt.load %107 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8802264Z %109 = arith.shli %108, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8802492Z %110 = arith.shrsi %109, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8802722Z %111 = arith.shrsi %108, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8802974Z %112 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8803283Z %113 = tt.expand_dims %112 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8803608Z %114 = tt.expand_dims %113 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8803936Z %115 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8804275Z %116 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8804570Z %117 = arith.cmpi eq, %114, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8804834Z %118 = tt.broadcast %117 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8805111Z %119 = tt.broadcast %115 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8805407Z %120 = arith.select %118, %119, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8805689Z %121 = arith.cmpi eq, %114, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8805942Z %122 = tt.broadcast %116 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8806220Z %123 = tt.broadcast %121 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8806501Z %124 = arith.select %123, %122, %120 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8806794Z %125 = tt.reshape %124 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8807068Z %126 = arith.sitofp %125 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8807434Z %127 = tt.dot %99, %126, %81, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8807798Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:03:41.8807996Z %128 = arith.muli %c64_i32, %c2_i32_9 : i32 2026-02-21T09:03:41.8808203Z %129 = arith.addi %arg4, %128 : i32 2026-02-21T09:03:41.8808402Z %130 = tt.splat %129 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8808619Z %131 = arith.addi %130, %21 : tensor<64xi32> 2026-02-21T09:03:41.8808820Z %132 = arith.muli %129, %c2_i32 : i32 2026-02-21T09:03:41.8809056Z %133 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8809349Z %134 = tt.splat %132 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8809559Z %135 = arith.addi %134, %133 : tensor<128xi32> 2026-02-21T09:03:41.8809822Z %136 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8810117Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8810394Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8810721Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8810993Z %140 = tt.broadcast %138 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8811248Z %141 = arith.addi %139, %140 : tensor<64x128xi32> 2026-02-21T09:03:41.8811500Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8811848Z %143 = tt.addptr %142, %141 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8812124Z %144 = tt.load %143 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8812380Z %145 = arith.extf %144 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8812684Z %146 = tt.expand_dims %131 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8812962Z %147 = arith.muli %146, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8813231Z %148 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8813524Z %149 = tt.broadcast %147 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8813799Z %150 = tt.broadcast %148 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8814051Z %151 = arith.addi %149, %150 : tensor<64x32xi32> 2026-02-21T09:03:41.8814291Z %152 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8814579Z %153 = tt.addptr %152, %151 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8814885Z %154 = tt.load %153 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8815164Z %155 = arith.shli %154, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8815384Z %156 = arith.shrsi %155, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8815614Z %157 = arith.shrsi %154, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8815873Z %158 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8816171Z %159 = tt.expand_dims %158 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8823223Z %160 = tt.expand_dims %159 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8823552Z %161 = tt.expand_dims %156 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8823893Z %162 = tt.expand_dims %157 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8824186Z %163 = arith.cmpi eq, %160, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8828871Z %164 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8836015Z %165 = tt.broadcast %161 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8836303Z %166 = arith.select %164, %165, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8836585Z %167 = arith.cmpi eq, %160, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8836892Z %168 = tt.broadcast %162 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8837176Z %169 = tt.broadcast %167 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8837467Z %170 = arith.select %169, %168, %166 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8837749Z %171 = tt.reshape %170 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8838029Z %172 = arith.sitofp %171 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8838401Z %173 = tt.dot %145, %172, %127, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8838843Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:41.8839040Z %174 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:03:41.8839244Z %175 = arith.addi %arg4, %174 : i32 2026-02-21T09:03:41.8839529Z %176 = tt.splat %175 : i32 -> tensor<64xi32> 2026-02-21T09:03:41.8839749Z %177 = arith.addi %176, %21 : tensor<64xi32> 2026-02-21T09:03:41.8839960Z %178 = arith.muli %175, %c2_i32 : i32 2026-02-21T09:03:41.8840247Z %179 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:03:41.8840517Z %180 = tt.splat %178 : i32 -> tensor<128xi32> 2026-02-21T09:03:41.8840745Z %181 = arith.addi %180, %179 : tensor<128xi32> 2026-02-21T09:03:41.8841014Z %182 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8841305Z %183 = arith.muli %182, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:41.8841614Z %184 = tt.expand_dims %181 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:03:41.8841935Z %185 = tt.broadcast %183 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8842229Z %186 = tt.broadcast %184 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:03:41.8842496Z %187 = arith.addi %185, %186 : tensor<64x128xi32> 2026-02-21T09:03:41.8842763Z %188 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8843073Z %189 = tt.addptr %188, %187 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:03:41.8843363Z %190 = tt.load %189 : tensor<64x128x!tt.ptr> 2026-02-21T09:03:41.8843626Z %191 = arith.extf %190 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:03:41.8843937Z %192 = tt.expand_dims %177 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8844233Z %193 = arith.muli %192, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8844508Z %194 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8844827Z %195 = tt.broadcast %193 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8845104Z %196 = tt.broadcast %194 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8845372Z %197 = arith.addi %195, %196 : tensor<64x32xi32> 2026-02-21T09:03:41.8845628Z %198 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8845924Z %199 = tt.addptr %198, %197 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8846255Z %200 = tt.load %199 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8846539Z %201 = arith.shli %200, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8846778Z %202 = arith.shrsi %201, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8847006Z %203 = arith.shrsi %200, %cst_2 : tensor<64x32xi8> 2026-02-21T09:03:41.8847274Z %204 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:41.8847589Z %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:41.8847923Z %206 = tt.expand_dims %205 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:41.8848273Z %207 = tt.expand_dims %202 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8848653Z %208 = tt.expand_dims %203 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:03:41.8848962Z %209 = arith.cmpi eq, %206, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8849234Z %210 = tt.broadcast %209 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8849508Z %211 = tt.broadcast %207 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8849807Z %212 = arith.select %210, %211, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8850104Z %213 = arith.cmpi eq, %206, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:03:41.8850361Z %214 = tt.broadcast %208 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:03:41.8850627Z %215 = tt.broadcast %213 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:03:41.8850937Z %216 = arith.select %215, %214, %212 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:03:41.8851233Z %217 = tt.reshape %216 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:03:41.8851530Z %218 = arith.sitofp %217 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:03:41.8851921Z %219 = tt.dot %191, %218, %173, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:41.8852244Z scf.yield %219 : tensor<64x32xf32> 2026-02-21T09:03:41.8852424Z } 2026-02-21T09:03:41.8852606Z %29 = arith.truncf %28 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:41.8852890Z %30 = tt.expand_dims %23 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:41.8853157Z %31 = arith.muli %30, %cst_3 : tensor<64x1xi32> 2026-02-21T09:03:41.8853405Z %32 = tt.expand_dims %27 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:41.8853693Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8853948Z %34 = tt.broadcast %32 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:41.8854188Z %35 = arith.addi %33, %34 : tensor<64x32xi32> 2026-02-21T09:03:41.8854436Z %36 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8854713Z %37 = tt.addptr %36, %35 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:41.8854981Z tt.store %37, %29 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:41.8855181Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:41.8855350Z tt.return 2026-02-21T09:03:41.8855480Z } 2026-02-21T09:03:41.8855608Z } 2026-02-21T09:03:41.8855678Z 2026-02-21T09:03:41.8855731Z {-# 2026-02-21T09:03:41.8857614Z external_resources: { 2026-02-21T09:03:41.8857814Z mlir_reproducer: { 2026-02-21T09:03:41.8863521Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:41.8869391Z disable_threading: false, 2026-02-21T09:03:41.8869562Z verify_each: true 2026-02-21T09:03:41.8869768Z } 2026-02-21T09:03:41.8869890Z } 2026-02-21T09:03:41.8870010Z #-} 2026-02-21T09:03:41.8870431Z /tmp/torchinductor_root/v3/cv36cfq6jhwmmmsuq6hh3imbtidotcd6eebzvkpqxvdyt2uxe2kf.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:41.8871511Z /tmp/torchinductor_root/v3/cv36cfq6jhwmmmsuq6hh3imbtidotcd6eebzvkpqxvdyt2uxe2kf.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:41.8872399Z [187s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:41.8873555Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:41.8874610Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:41.8874874Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:42.2636973Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:42.2638528Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:42.2638868Z ^ 2026-02-21T09:03:42.2644300Z /tmp/torchinductor_root/3a/c3axv66jqi3ugqsctzgru3fjur6zipobvyobuwn4ft6azye3v7h4.py:93:40: note: called from 2026-02-21T09:03:42.2648860Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:42.2652809Z ^ 2026-02-21T09:03:42.2655118Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:42.2655676Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:42.2655964Z ^ 2026-02-21T09:03:42.2669156Z module { 2026-02-21T09:03:42.2670889Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:42.2671593Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:03:42.2671838Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:42.2672026Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:42.2672220Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:03:42.2672404Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:42.2672588Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:42.2672809Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:42.2673049Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:42.2673286Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:42.2673514Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:03:42.2673753Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:42.2674202Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:42.2674431Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:42.2674661Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:42.2674850Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:42.2675050Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:42.2675249Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:42.2675435Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:03:42.2675625Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:03:42.2675864Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:03:42.2676175Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:03:42.2681768Z %1 = tt.get_program_id x : i32 2026-02-21T09:03:42.2685890Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:03:42.2687609Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:03:42.2687863Z %4 = arith.subi %3, %1 : i32 2026-02-21T09:03:42.2688057Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:03:42.2688302Z %5 = arith.subi %c1_i32, %c1_i32_6 : i32 2026-02-21T09:03:42.2688493Z %6 = arith.addi %4, %5 : i32 2026-02-21T09:03:42.2688675Z %7 = arith.divui %6, %c1_i32 : i32 2026-02-21T09:03:42.2688855Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:03:42.2689042Z %8 = arith.remsi %7, %c2_i32_7 : i32 2026-02-21T09:03:42.2689217Z %9 = arith.subi %7, %8 : i32 2026-02-21T09:03:42.2689394Z %10 = arith.muli %9, %c1_i32 : i32 2026-02-21T09:03:42.2689572Z %11 = arith.addi %1, %10 : i32 2026-02-21T09:03:42.2689764Z %12 = arith.muli %c1_i32, %c2_i32_7 : i32 2026-02-21T09:03:42.2689981Z scf.for %arg3 = %1 to %11 step %12 : i32 { 2026-02-21T09:03:42.2690185Z %13 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:42.2690381Z %14 = arith.muli %13, %c4_i32 : i32 2026-02-21T09:03:42.2690561Z %15 = arith.subi %c1_i32, %14 : i32 2026-02-21T09:03:42.2690750Z %16 = arith.minsi %15, %c4_i32 : i32 2026-02-21T09:03:42.2690935Z %17 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:42.2691124Z %18 = arith.remsi %17, %16 : i32 2026-02-21T09:03:42.2691298Z %19 = arith.addi %14, %18 : i32 2026-02-21T09:03:42.2691483Z %20 = arith.divsi %17, %16 : i32 2026-02-21T09:03:42.2691745Z %21 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:03:42.2691981Z %22 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:42.2692250Z %23 = tt.splat %21 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.2692453Z %24 = arith.addi %23, %22 : tensor<64xi32> 2026-02-21T09:03:42.2692647Z %25 = arith.muli %20, %c32_i32 : i32 2026-02-21T09:03:42.2692870Z %26 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:42.2693120Z %27 = tt.splat %25 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2693323Z %28 = arith.addi %27, %26 : tensor<32xi32> 2026-02-21T09:03:42.2693516Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:03:42.2693849Z %29 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:42.2694172Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:42.2694376Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2694579Z %69 = arith.addi %68, %26 : tensor<32xi32> 2026-02-21T09:03:42.2694845Z %70 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2695128Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2695388Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2695687Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2695946Z %74 = tt.broadcast %72 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2696248Z %75 = arith.addi %73, %74 : tensor<64x32xi32> 2026-02-21T09:03:42.2696502Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2696810Z %77 = tt.addptr %76, %75 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2697087Z %78 = tt.load %77 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2697331Z %79 = arith.extf %78 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2697701Z %80 = tt.descriptor_load %0[%arg4, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2698009Z %81 = arith.shli %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2698286Z %82 = arith.shrsi %81, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2698508Z %83 = arith.shrsi %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2698756Z %84 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2699074Z %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2699398Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2699753Z %87 = tt.expand_dims %82 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2700079Z %88 = tt.expand_dims %83 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2700367Z %89 = arith.cmpi eq, %86, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2700618Z %90 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2700895Z %91 = tt.broadcast %87 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2701180Z %92 = arith.select %90, %91, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2701452Z %93 = arith.cmpi eq, %86, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2701744Z %94 = tt.broadcast %88 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2702022Z %95 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2702301Z %96 = arith.select %95, %94, %92 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2702577Z %97 = tt.reshape %96 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2702841Z %98 = arith.sitofp %97 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2703193Z %99 = tt.dot %79, %98, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2703526Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:03:42.2703729Z %100 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:03:42.2703934Z %101 = arith.addi %arg4, %100 : i32 2026-02-21T09:03:42.2704128Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:03:42.2704328Z %103 = tt.splat %102 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2704544Z %104 = arith.addi %103, %26 : tensor<32xi32> 2026-02-21T09:03:42.2704803Z %105 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2705087Z %106 = arith.muli %105, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2705357Z %107 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2705665Z %108 = tt.broadcast %106 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2705945Z %109 = tt.broadcast %107 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2706192Z %110 = arith.addi %108, %109 : tensor<64x32xi32> 2026-02-21T09:03:42.2706448Z %111 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2706751Z %112 = tt.addptr %111, %110 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2707030Z %113 = tt.load %112 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2707283Z %114 = arith.extf %113 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2707648Z %115 = tt.descriptor_load %0[%101, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2707965Z %116 = arith.shli %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2708189Z %117 = arith.shrsi %116, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2708419Z %118 = arith.shrsi %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2708668Z %119 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2708978Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2709311Z %121 = tt.expand_dims %120 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2709672Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2710024Z %123 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2710347Z %124 = arith.cmpi eq, %121, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2710618Z %125 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2710927Z %126 = tt.broadcast %122 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2711220Z %127 = arith.select %125, %126, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2711514Z %128 = arith.cmpi eq, %121, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2711810Z %129 = tt.broadcast %123 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2712093Z %130 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2712374Z %131 = arith.select %130, %129, %127 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2712665Z %132 = tt.reshape %131 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2712935Z %133 = arith.sitofp %132 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2713288Z %134 = tt.dot %114, %133, %99, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2713614Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:03:42.2713815Z %135 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:03:42.2714019Z %136 = arith.addi %arg4, %135 : i32 2026-02-21T09:03:42.2714202Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:03:42.2714411Z %138 = tt.splat %137 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2714622Z %139 = arith.addi %138, %26 : tensor<32xi32> 2026-02-21T09:03:42.2714883Z %140 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2715181Z %141 = arith.muli %140, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2715457Z %142 = tt.expand_dims %139 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2715770Z %143 = tt.broadcast %141 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2716051Z %144 = tt.broadcast %142 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2716312Z %145 = arith.addi %143, %144 : tensor<64x32xi32> 2026-02-21T09:03:42.2716580Z %146 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2716885Z %147 = tt.addptr %146, %145 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2717171Z %148 = tt.load %147 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2717421Z %149 = arith.extf %148 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2717760Z %150 = tt.descriptor_load %0[%136, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2718081Z %151 = arith.shli %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2718311Z %152 = arith.shrsi %151, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2718550Z %153 = arith.shrsi %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2718807Z %154 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2719150Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2719483Z %156 = tt.expand_dims %155 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2719834Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2720184Z %158 = tt.expand_dims %153 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2720478Z %159 = arith.cmpi eq, %156, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2720779Z %160 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2721070Z %161 = tt.broadcast %157 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2721386Z %162 = arith.select %160, %161, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2721743Z %163 = arith.cmpi eq, %156, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2722010Z %164 = tt.broadcast %158 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2722329Z %165 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2722621Z %166 = arith.select %165, %164, %162 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2722930Z %167 = tt.reshape %166 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2723209Z %168 = arith.sitofp %167 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2723596Z %169 = tt.dot %149, %168, %134, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2723941Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:42.2724173Z %170 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:42.2724375Z %171 = arith.addi %arg4, %170 : i32 2026-02-21T09:03:42.2724558Z %172 = arith.muli %171, %c2_i32 : i32 2026-02-21T09:03:42.2724762Z %173 = tt.splat %172 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2724970Z %174 = arith.addi %173, %26 : tensor<32xi32> 2026-02-21T09:03:42.2725233Z %175 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2725512Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2725772Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2726076Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2726337Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2726586Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:03:42.2726830Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2727130Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2727406Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2727646Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2727973Z %185 = tt.descriptor_load %0[%171, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2728274Z %186 = arith.shli %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2728500Z %187 = arith.shrsi %186, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2728731Z %188 = arith.shrsi %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2733518Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2733825Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2734155Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2734492Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2734825Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2735188Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2735443Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2735727Z %196 = tt.broadcast %192 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2736027Z %197 = arith.select %195, %196, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2736303Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2736582Z %199 = tt.broadcast %193 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2736861Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2737150Z %201 = arith.select %200, %199, %197 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2737492Z %202 = tt.reshape %201 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2737770Z %203 = arith.sitofp %202 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2738161Z %204 = tt.dot %184, %203, %169, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2738496Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:03:42.2738681Z } 2026-02-21T09:03:42.2738874Z %30 = arith.truncf %29 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:42.2739172Z %31 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2739437Z %32 = arith.muli %31, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:42.2739691Z %33 = tt.expand_dims %28 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2739969Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2740231Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2740468Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:42.2740714Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2741012Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2741270Z tt.store %38, %30 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2741482Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:42.2741718Z %39 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:03:42.2741912Z %40 = arith.addi %arg3, %39 : i32 2026-02-21T09:03:42.2742103Z %41 = arith.divsi %40, %c896_i32 : i32 2026-02-21T09:03:42.2742286Z %42 = arith.muli %41, %c4_i32 : i32 2026-02-21T09:03:42.2742489Z %43 = arith.subi %c1_i32, %42 : i32 2026-02-21T09:03:42.2742667Z %44 = arith.minsi %43, %c4_i32 : i32 2026-02-21T09:03:42.2742856Z %45 = arith.remsi %40, %c896_i32 : i32 2026-02-21T09:03:42.2743036Z %46 = arith.remsi %45, %44 : i32 2026-02-21T09:03:42.2743220Z %47 = arith.addi %42, %46 : i32 2026-02-21T09:03:42.2743393Z %48 = arith.divsi %45, %44 : i32 2026-02-21T09:03:42.2743576Z %49 = arith.muli %47, %c64_i32 : i32 2026-02-21T09:03:42.2743811Z %50 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:42.2744053Z %51 = tt.splat %49 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.2744320Z %52 = arith.addi %51, %50 : tensor<64xi32> 2026-02-21T09:03:42.2744507Z %53 = arith.muli %48, %c32_i32 : i32 2026-02-21T09:03:42.2744735Z %54 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:42.2745052Z %55 = tt.splat %53 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2745243Z %56 = arith.addi %55, %54 : tensor<32xi32> 2026-02-21T09:03:42.2745439Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:03:42.2745757Z %57 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:42.2746088Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:42.2746318Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2746534Z %69 = arith.addi %68, %54 : tensor<32xi32> 2026-02-21T09:03:42.2746794Z %70 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2747077Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2747344Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2747631Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2747903Z %74 = tt.broadcast %72 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2748144Z %75 = arith.addi %73, %74 : tensor<64x32xi32> 2026-02-21T09:03:42.2748394Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2748685Z %77 = tt.addptr %76, %75 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2748969Z %78 = tt.load %77 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2749215Z %79 = arith.extf %78 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2749565Z %80 = tt.descriptor_load %0[%arg4, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2749875Z %81 = arith.shli %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2750090Z %82 = arith.shrsi %81, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2750309Z %83 = arith.shrsi %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2750560Z %84 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2750849Z %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2751163Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2751732Z %87 = tt.expand_dims %82 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2752055Z %88 = tt.expand_dims %83 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2752335Z %89 = arith.cmpi eq, %86, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2752579Z %90 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2752848Z %91 = tt.broadcast %87 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2753125Z %92 = arith.select %90, %91, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2753397Z %93 = arith.cmpi eq, %86, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2753638Z %94 = tt.broadcast %88 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2753909Z %95 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2754186Z %96 = arith.select %95, %94, %92 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2754454Z %97 = tt.reshape %96 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2756012Z %98 = arith.sitofp %97 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2756362Z %99 = tt.dot %79, %98, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2756684Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:03:42.2756883Z %100 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:03:42.2757084Z %101 = arith.addi %arg4, %100 : i32 2026-02-21T09:03:42.2757276Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:03:42.2757518Z %103 = tt.splat %102 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2757729Z %104 = arith.addi %103, %54 : tensor<32xi32> 2026-02-21T09:03:42.2757983Z %105 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2758267Z %106 = arith.muli %105, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2760936Z %107 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2761300Z %108 = tt.broadcast %106 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2761626Z %109 = tt.broadcast %107 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2761880Z %110 = arith.addi %108, %109 : tensor<64x32xi32> 2026-02-21T09:03:42.2762142Z %111 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2765748Z %112 = tt.addptr %111, %110 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2766045Z %113 = tt.load %112 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2766305Z %114 = arith.extf %113 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2766645Z %115 = tt.descriptor_load %0[%101, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2766970Z %116 = arith.shli %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2767200Z %117 = arith.shrsi %116, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2767477Z %118 = arith.shrsi %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2767735Z %119 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2768081Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2768402Z %121 = tt.expand_dims %120 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2768725Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2769061Z %123 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2769351Z %124 = arith.cmpi eq, %121, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2769617Z %125 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2769898Z %126 = tt.broadcast %122 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2770190Z %127 = arith.select %125, %126, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2770473Z %128 = arith.cmpi eq, %121, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2770733Z %129 = tt.broadcast %123 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2771011Z %130 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2771292Z %131 = arith.select %130, %129, %127 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2771613Z %132 = tt.reshape %131 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2771886Z %133 = arith.sitofp %132 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2772244Z %134 = tt.dot %114, %133, %99, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2772571Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:03:42.2772767Z %135 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:03:42.2772971Z %136 = arith.addi %arg4, %135 : i32 2026-02-21T09:03:42.2773157Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:03:42.2773359Z %138 = tt.splat %137 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2773574Z %139 = arith.addi %138, %54 : tensor<32xi32> 2026-02-21T09:03:42.2773829Z %140 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2774113Z %141 = arith.muli %140, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2774379Z %142 = tt.expand_dims %139 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2774730Z %143 = tt.broadcast %141 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2774996Z %144 = tt.broadcast %142 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2775247Z %145 = arith.addi %143, %144 : tensor<64x32xi32> 2026-02-21T09:03:42.2775497Z %146 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2775791Z %147 = tt.addptr %146, %145 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2776102Z %148 = tt.load %147 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2776345Z %149 = arith.extf %148 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2776678Z %150 = tt.descriptor_load %0[%136, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2776982Z %151 = arith.shli %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2777201Z %152 = arith.shrsi %151, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2777433Z %153 = arith.shrsi %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2777680Z %154 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2777985Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2778305Z %156 = tt.expand_dims %155 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2778672Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2779032Z %158 = tt.expand_dims %153 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2779317Z %159 = arith.cmpi eq, %156, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2779577Z %160 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2779851Z %161 = tt.broadcast %157 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2780143Z %162 = arith.select %160, %161, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2780420Z %163 = arith.cmpi eq, %156, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2780667Z %164 = tt.broadcast %158 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2780944Z %165 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2781227Z %166 = arith.select %165, %164, %162 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2781521Z %167 = tt.reshape %166 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2781826Z %168 = arith.sitofp %167 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2782187Z %169 = tt.dot %149, %168, %134, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2782517Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:42.2782707Z %170 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:42.2782911Z %171 = arith.addi %arg4, %170 : i32 2026-02-21T09:03:42.2783095Z %172 = arith.muli %171, %c2_i32 : i32 2026-02-21T09:03:42.2783296Z %173 = tt.splat %172 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2783498Z %174 = arith.addi %173, %54 : tensor<32xi32> 2026-02-21T09:03:42.2783760Z %175 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2784042Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2784305Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2784608Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2784871Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2785116Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:03:42.2785362Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2785702Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2785972Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2786214Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2786542Z %185 = tt.descriptor_load %0[%171, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2786841Z %186 = arith.shli %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2787106Z %187 = arith.shrsi %186, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2787336Z %188 = arith.shrsi %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2787583Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2787889Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2788199Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2788529Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2788852Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2789143Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2789400Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2789704Z %196 = tt.broadcast %192 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2790001Z %197 = arith.select %195, %196, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2790300Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2790558Z %199 = tt.broadcast %193 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2790834Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2791114Z %201 = arith.select %200, %199, %197 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2791402Z %202 = tt.reshape %201 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2791705Z %203 = arith.sitofp %202 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2792066Z %204 = tt.dot %184, %203, %169, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2792384Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:03:42.2792566Z } 2026-02-21T09:03:42.2792747Z %58 = arith.truncf %57 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:42.2793029Z %59 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2793298Z %60 = arith.muli %59, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:42.2793547Z %61 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2793837Z %62 = tt.broadcast %60 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2794092Z %63 = tt.broadcast %61 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2794333Z %64 = arith.addi %62, %63 : tensor<64x32xi32> 2026-02-21T09:03:42.2794581Z %65 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2794856Z %66 = tt.addptr %65, %64 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2795123Z tt.store %66, %58 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2795322Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:42.2795517Z scf.for %arg3 = %11 to %3 step %c1_i32 : i32 { 2026-02-21T09:03:42.2795728Z %13 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:42.2795924Z %14 = arith.muli %13, %c4_i32 : i32 2026-02-21T09:03:42.2796109Z %15 = arith.subi %c1_i32, %14 : i32 2026-02-21T09:03:42.2802786Z %16 = arith.minsi %15, %c4_i32 : i32 2026-02-21T09:03:42.2806163Z %17 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:42.2806359Z %18 = arith.remsi %17, %16 : i32 2026-02-21T09:03:42.2806585Z %19 = arith.addi %14, %18 : i32 2026-02-21T09:03:42.2806765Z %20 = arith.divsi %17, %16 : i32 2026-02-21T09:03:42.2806955Z %21 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:03:42.2807189Z %22 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:42.2807456Z %23 = tt.splat %21 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.2807669Z %24 = arith.addi %23, %22 : tensor<64xi32> 2026-02-21T09:03:42.2807897Z %25 = arith.muli %20, %c32_i32 : i32 2026-02-21T09:03:42.2808140Z %26 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:42.2808389Z %27 = tt.splat %25 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2808596Z %28 = arith.addi %27, %26 : tensor<32xi32> 2026-02-21T09:03:42.2808792Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:03:42.2809129Z %29 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:42.2809477Z %39 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:42.2809692Z %40 = tt.splat %39 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2809907Z %41 = arith.addi %40, %26 : tensor<32xi32> 2026-02-21T09:03:42.2810167Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2810481Z %43 = arith.muli %42, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2810747Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2811072Z %45 = tt.broadcast %43 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2811348Z %46 = tt.broadcast %44 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2811627Z %47 = arith.addi %45, %46 : tensor<64x32xi32> 2026-02-21T09:03:42.2811884Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2812180Z %49 = tt.addptr %48, %47 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2812456Z %50 = tt.load %49 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2812700Z %51 = arith.extf %50 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2813038Z %52 = tt.descriptor_load %0[%arg4, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2813352Z %53 = arith.shli %52, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2813579Z %54 = arith.shrsi %53, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2813807Z %55 = arith.shrsi %52, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2814057Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2814359Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2814681Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2815009Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2815341Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2815629Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2815892Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2816164Z %63 = tt.broadcast %59 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2816463Z %64 = arith.select %62, %63, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2816750Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2816991Z %66 = tt.broadcast %60 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2817259Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2817523Z %68 = arith.select %67, %66, %64 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2817799Z %69 = tt.reshape %68 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2818088Z %70 = arith.sitofp %69 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2818432Z %71 = tt.dot %51, %70, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2818756Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:42.2818951Z %72 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:03:42.2819174Z %73 = arith.addi %arg4, %72 : i32 2026-02-21T09:03:42.2819353Z %74 = arith.muli %73, %c2_i32 : i32 2026-02-21T09:03:42.2819551Z %75 = tt.splat %74 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2819746Z %76 = arith.addi %75, %26 : tensor<32xi32> 2026-02-21T09:03:42.2820001Z %77 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2820271Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2820524Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2820816Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2821066Z %81 = tt.broadcast %79 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2821308Z %82 = arith.addi %80, %81 : tensor<64x32xi32> 2026-02-21T09:03:42.2821578Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2821888Z %84 = tt.addptr %83, %82 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2822148Z %85 = tt.load %84 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2822403Z %86 = arith.extf %85 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2822721Z %87 = tt.descriptor_load %0[%73, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2823010Z %88 = arith.shli %87, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2823233Z %89 = arith.shrsi %88, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2823458Z %90 = arith.shrsi %87, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2823699Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2823992Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2824296Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2824614Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2824930Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2825205Z %96 = arith.cmpi eq, %93, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2825457Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2825716Z %98 = tt.broadcast %94 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2826000Z %99 = arith.select %97, %98, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2826265Z %100 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2826521Z %101 = tt.broadcast %95 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2826794Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2827072Z %103 = arith.select %102, %101, %99 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2827356Z %104 = tt.reshape %103 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2827618Z %105 = arith.sitofp %104 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2827974Z %106 = tt.dot %86, %105, %71, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2828290Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:03:42.2828495Z %107 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:03:42.2828729Z %108 = arith.addi %arg4, %107 : i32 2026-02-21T09:03:42.2828914Z %109 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:03:42.2829119Z %110 = tt.splat %109 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2829318Z %111 = arith.addi %110, %26 : tensor<32xi32> 2026-02-21T09:03:42.2829572Z %112 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2829842Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2830137Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2830437Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2830701Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2830950Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:03:42.2831191Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2831489Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2831782Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2832024Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2832351Z %122 = tt.descriptor_load %0[%108, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2832689Z %123 = arith.shli %122, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2832918Z %124 = arith.shrsi %123, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2833165Z %125 = arith.shrsi %122, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2833422Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2833723Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2834036Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2834370Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2834692Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2834986Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2835240Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2835520Z %133 = tt.broadcast %129 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2835818Z %134 = arith.select %132, %133, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2836092Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2836347Z %136 = tt.broadcast %130 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2836615Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2836900Z %138 = arith.select %137, %136, %134 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2837186Z %139 = tt.reshape %138 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2837446Z %140 = arith.sitofp %139 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2837807Z %141 = tt.dot %121, %140, %106, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2838127Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:42.2838327Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:42.2838518Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:03:42.2838710Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:03:42.2838912Z %145 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.2839113Z %146 = arith.addi %145, %26 : tensor<32xi32> 2026-02-21T09:03:42.2839371Z %147 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2839637Z %148 = arith.muli %147, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.2839958Z %149 = tt.expand_dims %146 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2840249Z %150 = tt.broadcast %148 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2840517Z %151 = tt.broadcast %149 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2840765Z %152 = arith.addi %150, %151 : tensor<64x32xi32> 2026-02-21T09:03:42.2841012Z %153 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2841343Z %154 = tt.addptr %153, %152 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2841649Z %155 = tt.load %154 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2841914Z %156 = arith.extf %155 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.2842233Z %157 = tt.descriptor_load %0[%143, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.2842528Z %158 = arith.shli %157, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2842752Z %159 = arith.shrsi %158, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2842971Z %160 = arith.shrsi %157, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.2843220Z %161 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.2843518Z %162 = tt.expand_dims %161 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.2843862Z %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.2844219Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2844549Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.2844839Z %166 = arith.cmpi eq, %163, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2845089Z %167 = tt.broadcast %166 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2845366Z %168 = tt.broadcast %164 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2845663Z %169 = arith.select %167, %168, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2845940Z %170 = arith.cmpi eq, %163, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.2846199Z %171 = tt.broadcast %165 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.2846471Z %172 = tt.broadcast %170 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.2846761Z %173 = arith.select %172, %171, %169 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.2847054Z %174 = tt.reshape %173 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.2847319Z %175 = arith.sitofp %174 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.2847685Z %176 = tt.dot %156, %175, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.2848002Z scf.yield %176 : tensor<64x32xf32> 2026-02-21T09:03:42.2848184Z } 2026-02-21T09:03:42.2848358Z %30 = arith.truncf %29 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:42.2848646Z %31 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.2848914Z %32 = arith.muli %31, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:42.2849166Z %33 = tt.expand_dims %28 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.2849455Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2849719Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.2849966Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:42.2850221Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2850518Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.2850791Z tt.store %38, %30 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.2851128Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:42.2851312Z tt.return 2026-02-21T09:03:42.2851450Z } 2026-02-21T09:03:42.2851620Z } 2026-02-21T09:03:42.2851696Z 2026-02-21T09:03:42.2851750Z {-# 2026-02-21T09:03:42.2851895Z external_resources: { 2026-02-21T09:03:42.2852061Z mlir_reproducer: { 2026-02-21T09:03:42.2856779Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:42.2861483Z disable_threading: false, 2026-02-21T09:03:42.2861686Z verify_each: true 2026-02-21T09:03:42.2861842Z } 2026-02-21T09:03:42.2861963Z } 2026-02-21T09:03:42.2862087Z #-} 2026-02-21T09:03:42.2862512Z /tmp/torchinductor_root/3a/c3axv66jqi3ugqsctzgru3fjur6zipobvyobuwn4ft6azye3v7h4.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:42.2863538Z /tmp/torchinductor_root/3a/c3axv66jqi3ugqsctzgru3fjur6zipobvyobuwn4ft6azye3v7h4.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:42.2864372Z [187s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:42.2865507Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:42.2866531Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:42.2866796Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:42.6718577Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:42.6719188Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:42.6723045Z ^ 2026-02-21T09:03:42.6724852Z /tmp/torchinductor_root/3q/c3qzdda5d7klpscguemyusgzh75ogg3hkwsue6ohcqxbnqjjui2e.py:85:36: note: called from 2026-02-21T09:03:42.6725654Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:42.6728038Z ^ 2026-02-21T09:03:42.6733947Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:42.6736019Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:42.6736319Z ^ 2026-02-21T09:03:42.6736758Z module { 2026-02-21T09:03:42.6739852Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:42.6740482Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:03:42.6740710Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:03:42.6740911Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:42.6741098Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:42.6741286Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:42.6741510Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:42.6741836Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:42.6742074Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:42.6742494Z %cst_3 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:03:42.6742757Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:42.6742964Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:42.6743257Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:03:42.6743519Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:42.6743712Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:42.6743907Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:42.6744116Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:03:42.6744304Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:03:42.6744478Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:03:42.6744796Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:03:42.6745107Z %1 = tt.get_program_id x : i32 2026-02-21T09:03:42.6745289Z %2 = arith.divsi %1, %c448_i32 : i32 2026-02-21T09:03:42.6745472Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:03:42.6745658Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:03:42.6745827Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:03:42.6745995Z %6 = arith.remsi %1, %c448_i32 : i32 2026-02-21T09:03:42.6746169Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:03:42.6746334Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:03:42.6746499Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:03:42.6746665Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:03:42.6746890Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:42.6747134Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.6747328Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:03:42.6747508Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:03:42.6747685Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.6747870Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T09:03:42.6748048Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:03:42.6748358Z %17 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:03:42.6748673Z %27 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:03:42.6748864Z %28 = tt.splat %27 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.6749054Z %29 = arith.addi %28, %11 : tensor<64xi32> 2026-02-21T09:03:42.6749296Z %30 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.6749559Z %31 = arith.muli %30, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.6749844Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:03:42.6750127Z %33 = tt.broadcast %31 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6750377Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6750602Z %35 = arith.addi %33, %34 : tensor<64x64xi32> 2026-02-21T09:03:42.6750840Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6751152Z %37 = tt.addptr %36, %35 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:03:42.6751405Z %38 = tt.load %37 : tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6751681Z %39 = arith.extf %38 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:03:42.6751994Z %40 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:03:42.6752281Z %41 = arith.shli %40, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6752485Z %42 = arith.shrsi %41, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6752689Z %43 = arith.shrsi %40, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6752918Z %44 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.6753201Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.6753526Z %46 = tt.expand_dims %45 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.6753835Z %47 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6754163Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6754466Z %49 = arith.cmpi eq, %46, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6754702Z %50 = tt.broadcast %49 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6754964Z %51 = tt.broadcast %47 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6755233Z %52 = arith.select %50, %51, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6755492Z %53 = arith.cmpi eq, %46, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6755727Z %54 = tt.broadcast %48 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6755981Z %55 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6756245Z %56 = arith.select %55, %54, %52 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6756505Z %57 = tt.reshape %56 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:03:42.6756756Z %58 = arith.sitofp %57 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:03:42.6757093Z %59 = tt.dot %39, %58, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:03:42.6757408Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:03:42.6757598Z %60 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:03:42.6757782Z %61 = arith.addi %arg3, %60 : i32 2026-02-21T09:03:42.6757960Z %62 = arith.muli %61, %c2_i32 : i32 2026-02-21T09:03:42.6758144Z %63 = tt.splat %62 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.6758340Z %64 = arith.addi %63, %11 : tensor<64xi32> 2026-02-21T09:03:42.6758582Z %65 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.6758839Z %66 = arith.muli %65, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.6759090Z %67 = tt.expand_dims %64 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:03:42.6759364Z %68 = tt.broadcast %66 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6759614Z %69 = tt.broadcast %67 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6759832Z %70 = arith.addi %68, %69 : tensor<64x64xi32> 2026-02-21T09:03:42.6760065Z %71 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6760339Z %72 = tt.addptr %71, %70 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:03:42.6760617Z %73 = tt.load %72 : tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6760846Z %74 = arith.extf %73 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:03:42.6761159Z %75 = tt.descriptor_load %0[%61, %14] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:03:42.6761463Z %76 = arith.shli %75, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6761726Z %77 = arith.shrsi %76, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6761978Z %78 = arith.shrsi %75, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6762224Z %79 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.6762520Z %80 = tt.expand_dims %79 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.6762833Z %81 = tt.expand_dims %80 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.6763156Z %82 = tt.expand_dims %77 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6763487Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6763772Z %84 = arith.cmpi eq, %81, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6764041Z %85 = tt.broadcast %84 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6764321Z %86 = tt.broadcast %82 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6764638Z %87 = arith.select %85, %86, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6764923Z %88 = arith.cmpi eq, %81, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6765208Z %89 = tt.broadcast %83 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6765479Z %90 = tt.broadcast %88 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6765765Z %91 = arith.select %90, %89, %87 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6766044Z %92 = tt.reshape %91 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:03:42.6766317Z %93 = arith.sitofp %92 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:03:42.6766667Z %94 = tt.dot %74, %93, %59, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:03:42.6766997Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:03:42.6767203Z %95 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:03:42.6767403Z %96 = arith.addi %arg3, %95 : i32 2026-02-21T09:03:42.6767600Z %97 = arith.muli %96, %c2_i32 : i32 2026-02-21T09:03:42.6767795Z %98 = tt.splat %97 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.6768007Z %99 = arith.addi %98, %11 : tensor<64xi32> 2026-02-21T09:03:42.6768266Z %100 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.6768559Z %101 = arith.muli %100, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.6768834Z %102 = tt.expand_dims %99 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:03:42.6769137Z %103 = tt.broadcast %101 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6769420Z %104 = tt.broadcast %102 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6769668Z %105 = arith.addi %103, %104 : tensor<64x64xi32> 2026-02-21T09:03:42.6769929Z %106 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6770232Z %107 = tt.addptr %106, %105 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:03:42.6770514Z %108 = tt.load %107 : tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6770771Z %109 = arith.extf %108 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:03:42.6771106Z %110 = tt.descriptor_load %0[%96, %14] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:03:42.6771409Z %111 = arith.shli %110, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6771670Z %112 = arith.shrsi %111, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6771895Z %113 = arith.shrsi %110, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6772169Z %114 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.6772468Z %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.6772792Z %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.6773118Z %117 = tt.expand_dims %112 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6773494Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6773784Z %119 = arith.cmpi eq, %116, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6774044Z %120 = tt.broadcast %119 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6774318Z %121 = tt.broadcast %117 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6774610Z %122 = arith.select %120, %121, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6774895Z %123 = arith.cmpi eq, %116, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6775145Z %124 = tt.broadcast %118 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6775422Z %125 = tt.broadcast %123 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6775700Z %126 = arith.select %125, %124, %122 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6776016Z %127 = tt.reshape %126 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:03:42.6776289Z %128 = arith.sitofp %127 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:03:42.6776677Z %129 = tt.dot %109, %128, %94, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:03:42.6777009Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:42.6777201Z %130 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:03:42.6777405Z %131 = arith.addi %arg3, %130 : i32 2026-02-21T09:03:42.6777601Z %132 = arith.muli %131, %c2_i32 : i32 2026-02-21T09:03:42.6777799Z %133 = tt.splat %132 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.6778012Z %134 = arith.addi %133, %11 : tensor<64xi32> 2026-02-21T09:03:42.6778266Z %135 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.6778539Z %136 = arith.muli %135, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.6778797Z %137 = tt.expand_dims %134 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:03:42.6779100Z %138 = tt.broadcast %136 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6779366Z %139 = tt.broadcast %137 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6779605Z %140 = arith.addi %138, %139 : tensor<64x64xi32> 2026-02-21T09:03:42.6779859Z %141 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6780149Z %142 = tt.addptr %141, %140 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:03:42.6780418Z %143 = tt.load %142 : tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6780657Z %144 = arith.extf %143 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:03:42.6780979Z %145 = tt.descriptor_load %0[%131, %14] : !tt.tensordesc> -> tensor<32x64xi8> 2026-02-21T09:03:42.6781285Z %146 = arith.shli %145, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6781502Z %147 = arith.shrsi %146, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6781779Z %148 = arith.shrsi %145, %cst_3 : tensor<32x64xi8> 2026-02-21T09:03:42.6782022Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.6782324Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.6782640Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.6782974Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6783333Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:03:42.6783617Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6783878Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6784152Z %156 = tt.broadcast %152 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6784448Z %157 = arith.select %155, %156, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6784773Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.6785022Z %159 = tt.broadcast %153 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:03:42.6785296Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:03:42.6785573Z %161 = arith.select %160, %159, %157 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:03:42.6785860Z %162 = tt.reshape %161 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:03:42.6786122Z %163 = arith.sitofp %162 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:03:42.6786476Z %164 = tt.dot %144, %163, %129, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:03:42.6786797Z scf.yield %164 : tensor<64x64xf32> 2026-02-21T09:03:42.6786975Z } {tt.flatten} 2026-02-21T09:03:42.6787172Z %18 = arith.truncf %17 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:03:42.6787483Z %19 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.6787752Z %20 = arith.muli %19, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:42.6788044Z %21 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:03:42.6788336Z %22 = tt.broadcast %20 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6788596Z %23 = tt.broadcast %21 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:03:42.6788826Z %24 = arith.addi %22, %23 : tensor<64x64xi32> 2026-02-21T09:03:42.6789074Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6789349Z %26 = tt.addptr %25, %24 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:03:42.6789608Z tt.store %26, %18 : tensor<64x64x!tt.ptr> 2026-02-21T09:03:42.6789800Z tt.return 2026-02-21T09:03:42.6789938Z } 2026-02-21T09:03:42.6790069Z } 2026-02-21T09:03:42.6790140Z 2026-02-21T09:03:42.6790194Z {-# 2026-02-21T09:03:42.6790340Z external_resources: { 2026-02-21T09:03:42.6790504Z mlir_reproducer: { 2026-02-21T09:03:42.6794864Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:42.6799443Z disable_threading: false, 2026-02-21T09:03:42.6799618Z verify_each: true 2026-02-21T09:03:42.6799787Z } 2026-02-21T09:03:42.6799948Z } 2026-02-21T09:03:42.6800074Z #-} 2026-02-21T09:03:42.6800549Z /tmp/torchinductor_root/3q/c3qzdda5d7klpscguemyusgzh75ogg3hkwsue6ohcqxbnqjjui2e.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:42.6801702Z /tmp/torchinductor_root/3q/c3qzdda5d7klpscguemyusgzh75ogg3hkwsue6ohcqxbnqjjui2e.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:42.6802562Z [188s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:42.6803741Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:42.6805901Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:42.6806186Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:42.8345273Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:42.8346403Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:42.8346714Z ^ 2026-02-21T09:03:42.8347162Z /tmp/torchinductor_root/kj/ckjus5xq2metox3mlkhgmdrigwepoxqps2w3una3w27u2lb2jlx3.py:85:36: note: called from 2026-02-21T09:03:42.8353597Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:42.8358447Z ^ 2026-02-21T09:03:42.8363541Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:42.8364146Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:42.8364469Z ^ 2026-02-21T09:03:42.8364664Z module { 2026-02-21T09:03:42.8365160Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:42.8365762Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:03:42.8365982Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:42.8366178Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:42.8366359Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:03:42.8370366Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:42.8377280Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:42.8381379Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:42.8381798Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:42.8382069Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:03:42.8382347Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:42.8382600Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:42.8382838Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:42.8383084Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:42.8383273Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:42.8383656Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:42.8383855Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:42.8384065Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:03:42.8384262Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:03:42.8384466Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:03:42.8384806Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:03:42.8385184Z %1 = tt.get_program_id x : i32 2026-02-21T09:03:42.8385376Z %2 = arith.divsi %1, %c896_i32 : i32 2026-02-21T09:03:42.8385562Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:03:42.8385766Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:03:42.8385945Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:03:42.8386139Z %6 = arith.remsi %1, %c896_i32 : i32 2026-02-21T09:03:42.8386324Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:03:42.8386511Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:03:42.8386698Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:03:42.8386877Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:03:42.8387126Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:42.8387389Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:03:42.8387602Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:03:42.8387840Z %14 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:03:42.8388084Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:42.8388377Z %16 = tt.splat %14 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.8388577Z %17 = arith.addi %16, %15 : tensor<32xi32> 2026-02-21T09:03:42.8388782Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T09:03:42.8389117Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c64_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:42.8389459Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:03:42.8389662Z %29 = tt.splat %28 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.8389874Z %30 = arith.addi %29, %15 : tensor<32xi32> 2026-02-21T09:03:42.8390151Z %31 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.8390439Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.8390712Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.8391014Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8391291Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8391576Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:42.8391847Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8392152Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.8392421Z %39 = tt.load %38 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8392678Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.8393009Z %41 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.8393332Z %42 = arith.shli %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8393558Z %43 = arith.shrsi %42, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8393789Z %44 = arith.shrsi %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8394047Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.8394350Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.8394677Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.8395006Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8395343Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8395675Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8395928Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8396206Z %52 = tt.broadcast %48 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8396481Z %53 = arith.select %51, %52, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8396749Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8397017Z %55 = tt.broadcast %49 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8397312Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8397572Z %57 = arith.select %56, %55, %53 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8397842Z %58 = tt.reshape %57 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.8398088Z %59 = arith.sitofp %58 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.8398430Z %60 = tt.dot %40, %59, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.8398741Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:03:42.8398941Z %61 = arith.muli %c16_i32, %c1_i32_7 : i32 2026-02-21T09:03:42.8399141Z %62 = arith.addi %arg3, %61 : i32 2026-02-21T09:03:42.8399320Z %63 = arith.muli %62, %c2_i32 : i32 2026-02-21T09:03:42.8399547Z %64 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.8399745Z %65 = arith.addi %64, %15 : tensor<32xi32> 2026-02-21T09:03:42.8400026Z %66 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.8400287Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.8400546Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.8400837Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8401090Z %70 = tt.broadcast %68 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8401324Z %71 = arith.addi %69, %70 : tensor<64x32xi32> 2026-02-21T09:03:42.8401602Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8401888Z %73 = tt.addptr %72, %71 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.8402140Z %74 = tt.load %73 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8402379Z %75 = arith.extf %74 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.8402691Z %76 = tt.descriptor_load %0[%62, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.8402977Z %77 = arith.shli %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8403192Z %78 = arith.shrsi %77, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8403398Z %79 = arith.shrsi %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8403632Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.8403918Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.8404220Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.8404540Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8404849Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8405129Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8405369Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8405635Z %87 = tt.broadcast %83 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8405917Z %88 = arith.select %86, %87, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8406180Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8406429Z %90 = tt.broadcast %84 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8406736Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8407004Z %92 = arith.select %91, %90, %88 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8407268Z %93 = tt.reshape %92 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.8407522Z %94 = arith.sitofp %93 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.8407863Z %95 = tt.dot %75, %94, %60, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.8408199Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:03:42.8408397Z %96 = arith.muli %c16_i32, %c2_i32_8 : i32 2026-02-21T09:03:42.8408589Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T09:03:42.8408777Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:03:42.8408967Z %99 = tt.splat %98 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.8409182Z %100 = arith.addi %99, %15 : tensor<32xi32> 2026-02-21T09:03:42.8409451Z %101 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.8409720Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.8409985Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.8410275Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8410569Z %105 = tt.broadcast %103 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8410808Z %106 = arith.addi %104, %105 : tensor<64x32xi32> 2026-02-21T09:03:42.8411082Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8411378Z %108 = tt.addptr %107, %106 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.8411677Z %109 = tt.load %108 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8411927Z %110 = arith.extf %109 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.8412239Z %111 = tt.descriptor_load %0[%97, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.8412549Z %112 = arith.shli %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8412765Z %113 = arith.shrsi %112, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8412989Z %114 = arith.shrsi %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8413243Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.8413538Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.8413857Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.8414177Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8414510Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8414804Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8415055Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8415330Z %122 = tt.broadcast %118 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8415615Z %123 = arith.select %121, %122, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8415895Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8416145Z %125 = tt.broadcast %119 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8416418Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8416699Z %127 = arith.select %126, %125, %123 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8416976Z %128 = tt.reshape %127 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.8417242Z %129 = arith.sitofp %128 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.8417589Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.8417942Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:42.8418135Z %131 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:42.8418321Z %132 = arith.addi %arg3, %131 : i32 2026-02-21T09:03:42.8418509Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:03:42.8418704Z %134 = tt.splat %133 : i32 -> tensor<32xi32> 2026-02-21T09:03:42.8418914Z %135 = arith.addi %134, %15 : tensor<32xi32> 2026-02-21T09:03:42.8419202Z %136 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.8420682Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:42.8420954Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.8421252Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8421511Z %140 = tt.broadcast %138 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8421800Z %141 = arith.addi %139, %140 : tensor<64x32xi32> 2026-02-21T09:03:42.8422042Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8422336Z %143 = tt.addptr %142, %141 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.8422606Z %144 = tt.load %143 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8422876Z %145 = arith.extf %144 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:42.8423207Z %146 = tt.descriptor_load %0[%132, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:42.8423512Z %147 = arith.shli %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8423762Z %148 = arith.shrsi %147, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8423982Z %149 = arith.shrsi %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:42.8424236Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:42.8424536Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:42.8424847Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:42.8425176Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8425503Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:42.8425800Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8426063Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8426339Z %157 = tt.broadcast %153 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8426635Z %158 = arith.select %156, %157, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8426910Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:42.8427169Z %160 = tt.broadcast %154 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:42.8427438Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:42.8427728Z %162 = arith.select %161, %160, %158 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:42.8428019Z %163 = tt.reshape %162 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:42.8428277Z %164 = arith.sitofp %163 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:42.8428639Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:42.8428962Z scf.yield %165 : tensor<64x32xf32> 2026-02-21T09:03:42.8429139Z } 2026-02-21T09:03:42.8429320Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:42.8429612Z %20 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:42.8429889Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:42.8430152Z %22 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:42.8430476Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8430738Z %24 = tt.broadcast %22 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:42.8430982Z %25 = arith.addi %23, %24 : tensor<64x32xi32> 2026-02-21T09:03:42.8431230Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8431525Z %27 = tt.addptr %26, %25 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:42.8431861Z tt.store %27, %19 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:42.8432055Z tt.return 2026-02-21T09:03:42.8432190Z } 2026-02-21T09:03:42.8432377Z } 2026-02-21T09:03:42.8432456Z 2026-02-21T09:03:42.8432507Z {-# 2026-02-21T09:03:42.8432640Z external_resources: { 2026-02-21T09:03:42.8432805Z mlir_reproducer: { 2026-02-21T09:03:42.8437376Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:42.8441905Z disable_threading: false, 2026-02-21T09:03:42.8442080Z verify_each: true 2026-02-21T09:03:42.8442221Z } 2026-02-21T09:03:42.8442345Z } 2026-02-21T09:03:42.8442459Z #-} 2026-02-21T09:03:42.8442885Z /tmp/torchinductor_root/kj/ckjus5xq2metox3mlkhgmdrigwepoxqps2w3una3w27u2lb2jlx3.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:42.8443899Z /tmp/torchinductor_root/kj/ckjus5xq2metox3mlkhgmdrigwepoxqps2w3una3w27u2lb2jlx3.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:42.8444720Z [188s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:42.8445763Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:42.8446689Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:42.8446976Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:43.2834966Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:03:43.2836419Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:43.2836695Z ^ 2026-02-21T09:03:43.2837123Z /tmp/torchinductor_root/3a/c3axv66jqi3ugqsctzgru3fjur6zipobvyobuwn4ft6azye3v7h4.py:93:40: note: called from 2026-02-21T09:03:43.2837840Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:03:43.2841008Z ^ 2026-02-21T09:03:43.2841715Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:03:43.2842223Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:03:43.2842477Z ^ 2026-02-21T09:03:43.2843059Z module { 2026-02-21T09:03:43.2845729Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:43.2846473Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:03:43.2851473Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:03:43.2851886Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:03:43.2852080Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:03:43.2852277Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:43.2852455Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:43.2852675Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:03:43.2852921Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:03:43.2853162Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:03:43.2853398Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:03:43.2853639Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:03:43.2853854Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:43.2854091Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:03:43.2854322Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:03:43.2854506Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:03:43.2854692Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:03:43.2854870Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:43.2855064Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:03:43.2855245Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:03:43.2855434Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:03:43.2855747Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:03:43.2856081Z %1 = tt.get_program_id x : i32 2026-02-21T09:03:43.2856270Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:03:43.2856452Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:03:43.2856643Z %4 = arith.subi %3, %1 : i32 2026-02-21T09:03:43.2856814Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:03:43.2857003Z %5 = arith.subi %c1_i32, %c1_i32_6 : i32 2026-02-21T09:03:43.2857185Z %6 = arith.addi %4, %5 : i32 2026-02-21T09:03:43.2857362Z %7 = arith.divui %6, %c1_i32 : i32 2026-02-21T09:03:43.2857539Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:03:43.2857726Z %8 = arith.remsi %7, %c2_i32_7 : i32 2026-02-21T09:03:43.2857906Z %9 = arith.subi %7, %8 : i32 2026-02-21T09:03:43.2858073Z %10 = arith.muli %9, %c1_i32 : i32 2026-02-21T09:03:43.2858258Z %11 = arith.addi %1, %10 : i32 2026-02-21T09:03:43.2858436Z %12 = arith.muli %c1_i32, %c2_i32_7 : i32 2026-02-21T09:03:43.2858643Z scf.for %arg3 = %1 to %11 step %12 : i32 { 2026-02-21T09:03:43.2858844Z %13 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:43.2859101Z %14 = arith.muli %13, %c4_i32 : i32 2026-02-21T09:03:43.2859280Z %15 = arith.subi %c1_i32, %14 : i32 2026-02-21T09:03:43.2859465Z %16 = arith.minsi %15, %c4_i32 : i32 2026-02-21T09:03:43.2859657Z %17 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:43.2859838Z %18 = arith.remsi %17, %16 : i32 2026-02-21T09:03:43.2860016Z %19 = arith.addi %14, %18 : i32 2026-02-21T09:03:43.2860185Z %20 = arith.divsi %17, %16 : i32 2026-02-21T09:03:43.2860408Z %21 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:03:43.2860638Z %22 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:43.2860948Z %23 = tt.splat %21 : i32 -> tensor<64xi32> 2026-02-21T09:03:43.2861149Z %24 = arith.addi %23, %22 : tensor<64xi32> 2026-02-21T09:03:43.2861341Z %25 = arith.muli %20, %c32_i32 : i32 2026-02-21T09:03:43.2861607Z %26 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:43.2861851Z %27 = tt.splat %25 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2862053Z %28 = arith.addi %27, %26 : tensor<32xi32> 2026-02-21T09:03:43.2862239Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:03:43.2862567Z %29 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:43.2862892Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:43.2863130Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2863343Z %69 = arith.addi %68, %26 : tensor<32xi32> 2026-02-21T09:03:43.2863595Z %70 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2863877Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2864131Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2864428Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2864689Z %74 = tt.broadcast %72 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2864935Z %75 = arith.addi %73, %74 : tensor<64x32xi32> 2026-02-21T09:03:43.2865184Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2865469Z %77 = tt.addptr %76, %75 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2865745Z %78 = tt.load %77 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2865989Z %79 = arith.extf %78 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2866320Z %80 = tt.descriptor_load %0[%arg4, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2866630Z %81 = arith.shli %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2866843Z %82 = arith.shrsi %81, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2867063Z %83 = arith.shrsi %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2867300Z %84 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2867597Z %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2867901Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2868222Z %87 = tt.expand_dims %82 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2868544Z %88 = tt.expand_dims %83 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2868823Z %89 = arith.cmpi eq, %86, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2869086Z %90 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2869368Z %91 = tt.broadcast %87 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2869666Z %92 = arith.select %90, %91, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2869952Z %93 = arith.cmpi eq, %86, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2870207Z %94 = tt.broadcast %88 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2870522Z %95 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2870805Z %96 = arith.select %95, %94, %92 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2871099Z %97 = tt.reshape %96 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2871367Z %98 = arith.sitofp %97 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2871818Z %99 = tt.dot %79, %98, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2872160Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:03:43.2872411Z %100 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:03:43.2872625Z %101 = arith.addi %arg4, %100 : i32 2026-02-21T09:03:43.2872820Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:03:43.2873035Z %103 = tt.splat %102 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2873245Z %104 = arith.addi %103, %26 : tensor<32xi32> 2026-02-21T09:03:43.2873520Z %105 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2873813Z %106 = arith.muli %105, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2874087Z %107 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2874428Z %108 = tt.broadcast %106 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2874711Z %109 = tt.broadcast %107 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2874968Z %110 = arith.addi %108, %109 : tensor<64x32xi32> 2026-02-21T09:03:43.2875230Z %111 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2875543Z %112 = tt.addptr %111, %110 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2875832Z %113 = tt.load %112 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2876084Z %114 = arith.extf %113 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2876430Z %115 = tt.descriptor_load %0[%101, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2876748Z %116 = arith.shli %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2876983Z %117 = arith.shrsi %116, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2877221Z %118 = arith.shrsi %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2877480Z %119 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2877800Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2878138Z %121 = tt.expand_dims %120 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2878493Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2878833Z %123 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2879143Z %124 = arith.cmpi eq, %121, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2879425Z %125 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2879697Z %126 = tt.broadcast %122 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2879987Z %127 = arith.select %125, %126, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2880258Z %128 = arith.cmpi eq, %121, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2880519Z %129 = tt.broadcast %123 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2880793Z %130 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2881071Z %131 = arith.select %130, %129, %127 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2881354Z %132 = tt.reshape %131 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2881641Z %133 = arith.sitofp %132 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2882034Z %134 = tt.dot %114, %133, %99, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2882352Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:03:43.2882550Z %135 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:03:43.2882749Z %136 = arith.addi %arg4, %135 : i32 2026-02-21T09:03:43.2882931Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:03:43.2883157Z %138 = tt.splat %137 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2883355Z %139 = arith.addi %138, %26 : tensor<32xi32> 2026-02-21T09:03:43.2883642Z %140 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2883909Z %141 = arith.muli %140, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2884169Z %142 = tt.expand_dims %139 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2884458Z %143 = tt.broadcast %141 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2884716Z %144 = tt.broadcast %142 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2884959Z %145 = arith.addi %143, %144 : tensor<64x32xi32> 2026-02-21T09:03:43.2885201Z %146 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2885539Z %147 = tt.addptr %146, %145 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2885800Z %148 = tt.load %147 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2886040Z %149 = arith.extf %148 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2886354Z %150 = tt.descriptor_load %0[%136, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2886648Z %151 = arith.shli %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2886870Z %152 = arith.shrsi %151, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2887081Z %153 = arith.shrsi %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2887330Z %154 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2887629Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2887938Z %156 = tt.expand_dims %155 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2888265Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2888587Z %158 = tt.expand_dims %153 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2888878Z %159 = arith.cmpi eq, %156, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2889124Z %160 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2889396Z %161 = tt.broadcast %157 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2889682Z %162 = arith.select %160, %161, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2889954Z %163 = arith.cmpi eq, %156, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2890203Z %164 = tt.broadcast %158 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2890467Z %165 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2890746Z %166 = arith.select %165, %164, %162 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2891029Z %167 = tt.reshape %166 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2891290Z %168 = arith.sitofp %167 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2891686Z %169 = tt.dot %149, %168, %134, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2892001Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:43.2892191Z %170 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:43.2892379Z %171 = arith.addi %arg4, %170 : i32 2026-02-21T09:03:43.2892591Z %172 = arith.muli %171, %c2_i32 : i32 2026-02-21T09:03:43.2892786Z %173 = tt.splat %172 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2892983Z %174 = arith.addi %173, %26 : tensor<32xi32> 2026-02-21T09:03:43.2893234Z %175 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2893500Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2893762Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2894074Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2894363Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2894608Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:03:43.2894851Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2895192Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2895453Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2895694Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2896013Z %185 = tt.descriptor_load %0[%171, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2896308Z %186 = arith.shli %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2896553Z %187 = arith.shrsi %186, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2896769Z %188 = arith.shrsi %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2897009Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2897301Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2897622Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2897952Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2898281Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2898569Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2898819Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2899097Z %196 = tt.broadcast %192 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2899399Z %197 = arith.select %195, %196, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2899673Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2899931Z %199 = tt.broadcast %193 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2900200Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2900480Z %201 = arith.select %200, %199, %197 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2900769Z %202 = tt.reshape %201 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2901044Z %203 = arith.sitofp %202 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2901405Z %204 = tt.dot %184, %203, %169, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2901757Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:03:43.2901936Z } 2026-02-21T09:03:43.2902109Z %30 = arith.truncf %29 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:43.2902396Z %31 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2902654Z %32 = arith.muli %31, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:43.2902904Z %33 = tt.expand_dims %28 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2903184Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2903433Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2903685Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:43.2903920Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2904202Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2904452Z tt.store %38, %30 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2904667Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:43.2904894Z %39 = arith.muli %c1_i32, %c1_i32_9 : i32 2026-02-21T09:03:43.2905084Z %40 = arith.addi %arg3, %39 : i32 2026-02-21T09:03:43.2905274Z %41 = arith.divsi %40, %c896_i32 : i32 2026-02-21T09:03:43.2905492Z %42 = arith.muli %41, %c4_i32 : i32 2026-02-21T09:03:43.2905678Z %43 = arith.subi %c1_i32, %42 : i32 2026-02-21T09:03:43.2905859Z %44 = arith.minsi %43, %c4_i32 : i32 2026-02-21T09:03:43.2906048Z %45 = arith.remsi %40, %c896_i32 : i32 2026-02-21T09:03:43.2906234Z %46 = arith.remsi %45, %44 : i32 2026-02-21T09:03:43.2906412Z %47 = arith.addi %42, %46 : i32 2026-02-21T09:03:43.2906595Z %48 = arith.divsi %45, %44 : i32 2026-02-21T09:03:43.2906770Z %49 = arith.muli %47, %c64_i32 : i32 2026-02-21T09:03:43.2907006Z %50 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:43.2907251Z %51 = tt.splat %49 : i32 -> tensor<64xi32> 2026-02-21T09:03:43.2907488Z %52 = arith.addi %51, %50 : tensor<64xi32> 2026-02-21T09:03:43.2907681Z %53 = arith.muli %48, %c32_i32 : i32 2026-02-21T09:03:43.2907912Z %54 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:43.2908157Z %55 = tt.splat %53 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2908348Z %56 = arith.addi %55, %54 : tensor<32xi32> 2026-02-21T09:03:43.2908544Z %c64_i32_10 = arith.constant 64 : i32 2026-02-21T09:03:43.2908869Z %57 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_10 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:43.2909195Z %67 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:43.2909390Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2909596Z %69 = arith.addi %68, %54 : tensor<32xi32> 2026-02-21T09:03:43.2909850Z %70 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2910118Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2910375Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2910659Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2910925Z %74 = tt.broadcast %72 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2911154Z %75 = arith.addi %73, %74 : tensor<64x32xi32> 2026-02-21T09:03:43.2911396Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2911723Z %77 = tt.addptr %76, %75 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2911979Z %78 = tt.load %77 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2912222Z %79 = arith.extf %78 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2912539Z %80 = tt.descriptor_load %0[%arg4, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2912843Z %81 = arith.shli %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2913067Z %82 = arith.shrsi %81, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2913282Z %83 = arith.shrsi %80, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2913542Z %84 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2913843Z %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2914171Z %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2914504Z %87 = tt.expand_dims %82 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2914868Z %88 = tt.expand_dims %83 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2915168Z %89 = arith.cmpi eq, %86, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2915427Z %90 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2915715Z %91 = tt.broadcast %87 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2916039Z %92 = arith.select %90, %91, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2916332Z %93 = arith.cmpi eq, %86, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2916628Z %94 = tt.broadcast %88 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2916913Z %95 = tt.broadcast %93 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2917206Z %96 = arith.select %95, %94, %92 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2917492Z %97 = tt.reshape %96 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2917769Z %98 = arith.sitofp %97 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2918135Z %99 = tt.dot %79, %98, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2918474Z %c1_i32_11 = arith.constant 1 : i32 2026-02-21T09:03:43.2918718Z %100 = arith.muli %c16_i32, %c1_i32_11 : i32 2026-02-21T09:03:43.2918928Z %101 = arith.addi %arg4, %100 : i32 2026-02-21T09:03:43.2919131Z %102 = arith.muli %101, %c2_i32 : i32 2026-02-21T09:03:43.2919342Z %103 = tt.splat %102 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2919567Z %104 = arith.addi %103, %54 : tensor<32xi32> 2026-02-21T09:03:43.2919831Z %105 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2920130Z %106 = arith.muli %105, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2920419Z %107 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2920735Z %108 = tt.broadcast %106 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2921022Z %109 = tt.broadcast %107 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2921281Z %110 = arith.addi %108, %109 : tensor<64x32xi32> 2026-02-21T09:03:43.2921586Z %111 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2921896Z %112 = tt.addptr %111, %110 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2922184Z %113 = tt.load %112 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2922445Z %114 = arith.extf %113 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2922781Z %115 = tt.descriptor_load %0[%101, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2923115Z %116 = arith.shli %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2923338Z %117 = arith.shrsi %116, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2923567Z %118 = arith.shrsi %115, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2923813Z %119 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2924112Z %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2924435Z %121 = tt.expand_dims %120 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2924761Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2925099Z %123 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2925386Z %124 = arith.cmpi eq, %121, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2925650Z %125 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2925930Z %126 = tt.broadcast %122 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2926250Z %127 = arith.select %125, %126, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2926532Z %128 = arith.cmpi eq, %121, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2926789Z %129 = tt.broadcast %123 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2927066Z %130 = tt.broadcast %128 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2927348Z %131 = arith.select %130, %129, %127 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2927664Z %132 = tt.reshape %131 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2927974Z %133 = arith.sitofp %132 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2928331Z %134 = tt.dot %114, %133, %99, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2928655Z %c2_i32_12 = arith.constant 2 : i32 2026-02-21T09:03:43.2928853Z %135 = arith.muli %c16_i32, %c2_i32_12 : i32 2026-02-21T09:03:43.2929057Z %136 = arith.addi %arg4, %135 : i32 2026-02-21T09:03:43.2929249Z %137 = arith.muli %136, %c2_i32 : i32 2026-02-21T09:03:43.2929444Z %138 = tt.splat %137 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2929656Z %139 = arith.addi %138, %54 : tensor<32xi32> 2026-02-21T09:03:43.2929937Z %140 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2930221Z %141 = arith.muli %140, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2930488Z %142 = tt.expand_dims %139 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2930792Z %143 = tt.broadcast %141 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2931065Z %144 = tt.broadcast %142 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2931306Z %145 = arith.addi %143, %144 : tensor<64x32xi32> 2026-02-21T09:03:43.2931605Z %146 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2931896Z %147 = tt.addptr %146, %145 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2932165Z %148 = tt.load %147 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2932405Z %149 = arith.extf %148 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2932732Z %150 = tt.descriptor_load %0[%136, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2933042Z %151 = arith.shli %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2933260Z %152 = arith.shrsi %151, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2933488Z %153 = arith.shrsi %150, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2933735Z %154 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2934038Z %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2934350Z %156 = tt.expand_dims %155 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2934686Z %157 = tt.expand_dims %152 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2935017Z %158 = tt.expand_dims %153 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2935304Z %159 = arith.cmpi eq, %156, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2935565Z %160 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2935837Z %161 = tt.broadcast %157 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2936136Z %162 = arith.select %160, %161, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2936416Z %163 = arith.cmpi eq, %156, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2936665Z %164 = tt.broadcast %158 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2936944Z %165 = tt.broadcast %163 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2937256Z %166 = arith.select %165, %164, %162 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2937542Z %167 = tt.reshape %166 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2937806Z %168 = arith.sitofp %167 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2938170Z %169 = tt.dot %149, %168, %134, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2938522Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:43.2938715Z %170 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:43.2938914Z %171 = arith.addi %arg4, %170 : i32 2026-02-21T09:03:43.2939125Z %172 = arith.muli %171, %c2_i32 : i32 2026-02-21T09:03:43.2939332Z %173 = tt.splat %172 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2939537Z %174 = arith.addi %173, %54 : tensor<32xi32> 2026-02-21T09:03:43.2939799Z %175 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2940079Z %176 = arith.muli %175, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2940339Z %177 = tt.expand_dims %174 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2940638Z %178 = tt.broadcast %176 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2940899Z %179 = tt.broadcast %177 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2941174Z %180 = arith.addi %178, %179 : tensor<64x32xi32> 2026-02-21T09:03:43.2941432Z %181 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2941775Z %182 = tt.addptr %181, %180 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2942044Z %183 = tt.load %182 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2942285Z %184 = arith.extf %183 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2942613Z %185 = tt.descriptor_load %0[%171, %53] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2942913Z %186 = arith.shli %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2943140Z %187 = arith.shrsi %186, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2943372Z %188 = arith.shrsi %185, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2943617Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2943928Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2944247Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2946289Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2946621Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2946906Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2947166Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2947442Z %196 = tt.broadcast %192 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2948173Z %197 = arith.select %195, %196, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2948445Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2948705Z %199 = tt.broadcast %193 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2948988Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2949267Z %201 = arith.select %200, %199, %197 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2949559Z %202 = tt.reshape %201 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2949822Z %203 = arith.sitofp %202 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2950187Z %204 = tt.dot %184, %203, %169, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2950541Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:03:43.2950718Z } 2026-02-21T09:03:43.2950901Z %58 = arith.truncf %57 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:43.2951190Z %59 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2952358Z %60 = arith.muli %59, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:43.2952613Z %61 = tt.expand_dims %56 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2952927Z %62 = tt.broadcast %60 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2954190Z %63 = tt.broadcast %61 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2954422Z %64 = arith.addi %62, %63 : tensor<64x32xi32> 2026-02-21T09:03:43.2954669Z %65 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2959474Z %66 = tt.addptr %65, %64 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2959753Z tt.store %66, %58 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2959963Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:43.2960169Z scf.for %arg3 = %11 to %3 step %c1_i32 : i32 { 2026-02-21T09:03:43.2960386Z %13 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:03:43.2960592Z %14 = arith.muli %13, %c4_i32 : i32 2026-02-21T09:03:43.2962818Z %15 = arith.subi %c1_i32, %14 : i32 2026-02-21T09:03:43.2963038Z %16 = arith.minsi %15, %c4_i32 : i32 2026-02-21T09:03:43.2963252Z %17 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:03:43.2963453Z %18 = arith.remsi %17, %16 : i32 2026-02-21T09:03:43.2963656Z %19 = arith.addi %14, %18 : i32 2026-02-21T09:03:43.2963844Z %20 = arith.divsi %17, %16 : i32 2026-02-21T09:03:43.2964045Z %21 = arith.muli %19, %c64_i32 : i32 2026-02-21T09:03:43.2964297Z %22 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:03:43.2964561Z %23 = tt.splat %21 : i32 -> tensor<64xi32> 2026-02-21T09:03:43.2964785Z %24 = arith.addi %23, %22 : tensor<64xi32> 2026-02-21T09:03:43.2964987Z %25 = arith.muli %20, %c32_i32 : i32 2026-02-21T09:03:43.2965234Z %26 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:03:43.2965495Z %27 = tt.splat %25 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2965712Z %28 = arith.addi %27, %26 : tensor<32xi32> 2026-02-21T09:03:43.2965916Z %c64_i32_8 = arith.constant 64 : i32 2026-02-21T09:03:43.2966264Z %29 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c64_i32_8 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:03:43.2967031Z %39 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:03:43.2967241Z %40 = tt.splat %39 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2967466Z %41 = arith.addi %40, %26 : tensor<32xi32> 2026-02-21T09:03:43.2967738Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2968022Z %43 = arith.muli %42, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2968290Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2968597Z %45 = tt.broadcast %43 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2968863Z %46 = tt.broadcast %44 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2969098Z %47 = arith.addi %45, %46 : tensor<64x32xi32> 2026-02-21T09:03:43.2969347Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2969629Z %49 = tt.addptr %48, %47 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2969896Z %50 = tt.load %49 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2970134Z %51 = arith.extf %50 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2970451Z %52 = tt.descriptor_load %0[%arg4, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2970789Z %53 = arith.shli %52, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2971003Z %54 = arith.shrsi %53, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2971223Z %55 = arith.shrsi %52, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2971467Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2971791Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2972131Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2972443Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2972787Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2973066Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2973322Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2973593Z %63 = tt.broadcast %59 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2973871Z %64 = arith.select %62, %63, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2974146Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2974393Z %66 = tt.broadcast %60 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2974689Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2974957Z %68 = arith.select %67, %66, %64 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2975236Z %69 = tt.reshape %68 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2975499Z %70 = arith.sitofp %69 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2975845Z %71 = tt.dot %51, %70, %arg5, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2976170Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:03:43.2976364Z %72 = arith.muli %c16_i32, %c1_i32_9 : i32 2026-02-21T09:03:43.2976567Z %73 = arith.addi %arg4, %72 : i32 2026-02-21T09:03:43.2976747Z %74 = arith.muli %73, %c2_i32 : i32 2026-02-21T09:03:43.2976940Z %75 = tt.splat %74 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2977145Z %76 = arith.addi %75, %26 : tensor<32xi32> 2026-02-21T09:03:43.2977390Z %77 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2977662Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2977911Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2978201Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2978453Z %81 = tt.broadcast %79 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2978687Z %82 = arith.addi %80, %81 : tensor<64x32xi32> 2026-02-21T09:03:43.2978937Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2979221Z %84 = tt.addptr %83, %82 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2979482Z %85 = tt.load %84 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2979713Z %86 = arith.extf %85 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2980028Z %87 = tt.descriptor_load %0[%73, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2980329Z %88 = arith.shli %87, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2980546Z %89 = arith.shrsi %88, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2980774Z %90 = arith.shrsi %87, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2981013Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2981304Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2981676Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2982013Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2982332Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2982612Z %96 = arith.cmpi eq, %93, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2982876Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2983199Z %98 = tt.broadcast %94 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2983482Z %99 = arith.select %97, %98, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2983786Z %100 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2984040Z %101 = tt.broadcast %95 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2984314Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2984606Z %103 = arith.select %102, %101, %99 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2984887Z %104 = tt.reshape %103 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2985155Z %105 = arith.sitofp %104 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2985501Z %106 = tt.dot %86, %105, %71, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2985858Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:03:43.2986060Z %107 = arith.muli %c16_i32, %c2_i32_10 : i32 2026-02-21T09:03:43.2986265Z %108 = arith.addi %arg4, %107 : i32 2026-02-21T09:03:43.2986458Z %109 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:03:43.2986653Z %110 = tt.splat %109 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.2986862Z %111 = arith.addi %110, %26 : tensor<32xi32> 2026-02-21T09:03:43.2987111Z %112 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.2988552Z %113 = arith.muli %112, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.2988811Z %114 = tt.expand_dims %111 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.2989107Z %115 = tt.broadcast %113 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2989376Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.2989616Z %117 = arith.addi %115, %116 : tensor<64x32xi32> 2026-02-21T09:03:43.2989871Z %118 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2990160Z %119 = tt.addptr %118, %117 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.2990432Z %120 = tt.load %119 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.2990671Z %121 = arith.extf %120 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.2991004Z %122 = tt.descriptor_load %0[%108, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.2991311Z %123 = arith.shli %122, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2991574Z %124 = arith.shrsi %123, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2991802Z %125 = arith.shrsi %122, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.2992046Z %126 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.2992580Z %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.2992910Z %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.2993234Z %129 = tt.expand_dims %124 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2993569Z %130 = tt.expand_dims %125 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.2995234Z %131 = arith.cmpi eq, %128, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2995498Z %132 = tt.broadcast %131 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2995799Z %133 = tt.broadcast %129 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2996099Z %134 = arith.select %132, %133, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2996384Z %135 = arith.cmpi eq, %128, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.2996644Z %136 = tt.broadcast %130 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.2996959Z %137 = tt.broadcast %135 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.2997243Z %138 = arith.select %137, %136, %134 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.2999019Z %139 = tt.reshape %138 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.2999292Z %140 = arith.sitofp %139 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.2999643Z %141 = tt.dot %121, %140, %106, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.2999966Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:43.3000160Z %142 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:03:43.3000361Z %143 = arith.addi %arg4, %142 : i32 2026-02-21T09:03:43.3000545Z %144 = arith.muli %143, %c2_i32 : i32 2026-02-21T09:03:43.3000751Z %145 = tt.splat %144 : i32 -> tensor<32xi32> 2026-02-21T09:03:43.3000978Z %146 = arith.addi %145, %26 : tensor<32xi32> 2026-02-21T09:03:43.3001242Z %147 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.3001524Z %148 = arith.muli %147, %cst_4 : tensor<64x1xi32> 2026-02-21T09:03:43.3001812Z %149 = tt.expand_dims %146 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.3002108Z %150 = tt.broadcast %148 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.3002371Z %151 = tt.broadcast %149 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.3002619Z %152 = arith.addi %150, %151 : tensor<64x32xi32> 2026-02-21T09:03:43.3002879Z %153 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.3003182Z %154 = tt.addptr %153, %152 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.3003466Z %155 = tt.load %154 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.3003713Z %156 = arith.extf %155 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:03:43.3004052Z %157 = tt.descriptor_load %0[%143, %25] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:03:43.3004366Z %158 = arith.shli %157, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.3004604Z %159 = arith.shrsi %158, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.3004839Z %160 = arith.shrsi %157, %cst_3 : tensor<16x32xi8> 2026-02-21T09:03:43.3005093Z %161 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:03:43.3005407Z %162 = tt.expand_dims %161 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:03:43.3005738Z %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:03:43.3006078Z %164 = tt.expand_dims %159 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.3006424Z %165 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:03:43.3006720Z %166 = arith.cmpi eq, %163, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:03:43.3006994Z %167 = tt.broadcast %166 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.3007277Z %168 = tt.broadcast %164 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.3007583Z %169 = arith.select %167, %168, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.3007867Z %170 = arith.cmpi eq, %163, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:03:43.3008133Z %171 = tt.broadcast %165 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:03:43.3008461Z %172 = tt.broadcast %170 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:03:43.3008752Z %173 = arith.select %172, %171, %169 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:03:43.3009055Z %174 = tt.reshape %173 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:03:43.3009328Z %175 = arith.sitofp %174 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:03:43.3009706Z %176 = tt.dot %156, %175, %141, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:03:43.3010075Z scf.yield %176 : tensor<64x32xf32> 2026-02-21T09:03:43.3010258Z } 2026-02-21T09:03:43.3010478Z %30 = arith.truncf %29 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:03:43.3010779Z %31 = tt.expand_dims %24 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:03:43.3011066Z %32 = arith.muli %31, %cst_0 : tensor<64x1xi32> 2026-02-21T09:03:43.3011332Z %33 = tt.expand_dims %28 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:03:43.3011681Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.3011957Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:03:43.3012202Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:03:43.3012499Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.3012787Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:03:43.3013053Z tt.store %38, %30 : tensor<64x32x!tt.ptr> 2026-02-21T09:03:43.3013254Z } {tt.num_stages = 1 : i32} 2026-02-21T09:03:43.3013428Z tt.return 2026-02-21T09:03:43.3013558Z } 2026-02-21T09:03:43.3013687Z } 2026-02-21T09:03:43.3013757Z 2026-02-21T09:03:43.3013816Z {-# 2026-02-21T09:03:43.3013942Z external_resources: { 2026-02-21T09:03:43.3014108Z mlir_reproducer: { 2026-02-21T09:03:43.3018531Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:43.3023006Z disable_threading: false, 2026-02-21T09:03:43.3023172Z verify_each: true 2026-02-21T09:03:43.3023324Z } 2026-02-21T09:03:43.3023446Z } 2026-02-21T09:03:43.3023568Z #-} 2026-02-21T09:03:43.3024000Z /tmp/torchinductor_root/3a/c3axv66jqi3ugqsctzgru3fjur6zipobvyobuwn4ft6azye3v7h4.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:43.3025050Z /tmp/torchinductor_root/3a/c3axv66jqi3ugqsctzgru3fjur6zipobvyobuwn4ft6azye3v7h4.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:43.3025875Z [188s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:43.3027069Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:03:43.3028095Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:43.3028362Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:43.6465220Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 13.9 configs/s 2026-02-21T09:03:49.3109456Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 177.1 2026-02-21T09:03:49.3114033Z configs/s 2026-02-21T09:03:49.5000964Z [194s] Generation 7 complete: 2026-02-21T09:03:49.5005434Z error=6 2026-02-21T09:03:49.5009920Z ok=72 2026-02-21T09:03:49.5013836Z min=0.1096 2026-02-21T09:03:49.5015771Z mid=0.1690 2026-02-21T09:03:49.5015930Z max=7.1097 2026-02-21T09:03:49.5016076Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:03:49.5016295Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:03:49.5016501Z 'l2_groupings': [1], 2026-02-21T09:03:49.5016668Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:03:49.5016861Z 'loop_orders': [[0, 1]], 2026-02-21T09:03:49.5017018Z 'num_stages': 3, 2026-02-21T09:03:49.5017152Z 'num_warps': 1, 2026-02-21T09:03:49.5017291Z 'pid_type': 'flat', 2026-02-21T09:03:49.5017441Z 'range_flattens': [None, None], 2026-02-21T09:03:49.5017618Z 'range_multi_buffers': [None, True], 2026-02-21T09:03:49.5017795Z 'range_num_stages': [0, 0], 2026-02-21T09:03:49.5017964Z 'range_unroll_factors': [0, 0], 2026-02-21T09:03:49.5018147Z 'range_warp_specializes': [None, None]} 2026-02-21T09:03:49.5032425Z [194s] Fitting surrogate: 773 points, 773 targets 2026-02-21T09:03:50.6630719Z [196s] Generation 8 starting: 75 neighbors, 4 active search path(s) 2026-02-21T09:04:13.0483116Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 0.7 configs/s 2026-02-21T09:04:14.8168556Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:14.8173325Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:14.8174736Z ^ 2026-02-21T09:04:14.8175147Z /tmp/torchinductor_root/cn/ccn5a5q6ythd2dbscrq4juvz7sw53ddww2lx7k6awnonjagbribm.py:78:36: note: called from 2026-02-21T09:04:14.8175553Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:14.8175763Z ^ 2026-02-21T09:04:14.8176186Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:14.8176669Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:14.8176923Z ^ 2026-02-21T09:04:14.8177106Z module { 2026-02-21T09:04:14.8182372Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:14.8183531Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:04:14.8183805Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:04:14.8184027Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:14.8184293Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:14.8187140Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:14.8187483Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:14.8192124Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:04:14.8196657Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:14.8197219Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:14.8197498Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:14.8202007Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:14.8206550Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:14.8208091Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:14.8208392Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:14.8211229Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:14.8211482Z %0 = tt.get_program_id x : i32 2026-02-21T09:04:14.8211741Z %1 = arith.divsi %0, %c896_i32 : i32 2026-02-21T09:04:14.8211950Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:04:14.8212135Z %3 = arith.subi %c1_i32, %2 : i32 2026-02-21T09:04:14.8212526Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:04:14.8212723Z %5 = arith.remsi %0, %c896_i32 : i32 2026-02-21T09:04:14.8212919Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:04:14.8213096Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:04:14.8213283Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:04:14.8213465Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:04:14.8213705Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:14.8213975Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:04:14.8214181Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:04:14.8214385Z %13 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:04:14.8214614Z %14 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:14.8214866Z %15 = tt.splat %13 : i32 -> tensor<32xi32> 2026-02-21T09:04:14.8215067Z %16 = arith.addi %15, %14 : tensor<32xi32> 2026-02-21T09:04:14.8215279Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:04:14.8215609Z %17 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:14.8215943Z %27 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:04:14.8216164Z %28 = arith.addi %27, %10 : tensor<64xi32> 2026-02-21T09:04:14.8216366Z %29 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:14.8216616Z %30 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:14.8216868Z %31 = tt.splat %29 : i32 -> tensor<128xi32> 2026-02-21T09:04:14.8217075Z %32 = arith.addi %31, %30 : tensor<128xi32> 2026-02-21T09:04:14.8217341Z %33 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8217612Z %34 = arith.muli %33, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:14.8217881Z %35 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:14.8218180Z %36 = tt.broadcast %34 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8218503Z %37 = tt.broadcast %35 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8218771Z %38 = arith.addi %36, %37 : tensor<64x128xi32> 2026-02-21T09:04:14.8219024Z %39 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8219320Z %40 = tt.addptr %39, %38 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:14.8219581Z %41 = tt.load %40 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8219831Z %42 = arith.extf %41 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:14.8220180Z %43 = tt.expand_dims %28 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8220449Z %44 = arith.muli %43, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:14.8220709Z %45 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:14.8220992Z %46 = tt.broadcast %44 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8221262Z %47 = tt.broadcast %45 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8221589Z %48 = arith.addi %46, %47 : tensor<64x32xi32> 2026-02-21T09:04:14.8221875Z %49 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8222161Z %50 = tt.addptr %49, %48 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:14.8222460Z %51 = tt.load %50 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8222741Z %52 = arith.shli %51, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8222958Z %53 = arith.shrsi %52, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8223183Z %54 = arith.shrsi %51, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8223425Z %55 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:14.8223733Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:14.8224086Z %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:14.8224407Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8224741Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8225020Z %60 = arith.cmpi eq, %57, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8225276Z %61 = tt.broadcast %60 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8225557Z %62 = tt.broadcast %58 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8225839Z %63 = arith.select %61, %62, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8226115Z %64 = arith.cmpi eq, %57, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8226362Z %65 = tt.broadcast %59 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8226635Z %66 = tt.broadcast %64 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8226906Z %67 = arith.select %66, %65, %63 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8227190Z %68 = tt.reshape %67 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:14.8227456Z %69 = arith.sitofp %68 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:14.8227814Z %70 = tt.dot %42, %69, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:14.8228147Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:14.8228346Z %71 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T09:04:14.8228551Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:04:14.8228741Z %73 = tt.splat %72 : i32 -> tensor<64xi32> 2026-02-21T09:04:14.8228948Z %74 = arith.addi %73, %10 : tensor<64xi32> 2026-02-21T09:04:14.8229152Z %75 = arith.muli %72, %c2_i32 : i32 2026-02-21T09:04:14.8229384Z %76 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:14.8229645Z %77 = tt.splat %75 : i32 -> tensor<128xi32> 2026-02-21T09:04:14.8229849Z %78 = arith.addi %77, %76 : tensor<128xi32> 2026-02-21T09:04:14.8230109Z %79 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8230373Z %80 = arith.muli %79, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:14.8230637Z %81 = tt.expand_dims %78 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:14.8230936Z %82 = tt.broadcast %80 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8231198Z %83 = tt.broadcast %81 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8231479Z %84 = arith.addi %82, %83 : tensor<64x128xi32> 2026-02-21T09:04:14.8231757Z %85 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8232053Z %86 = tt.addptr %85, %84 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:14.8232309Z %87 = tt.load %86 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8232558Z %88 = arith.extf %87 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:14.8232882Z %89 = tt.expand_dims %74 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8233144Z %90 = arith.muli %89, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:14.8233436Z %91 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:14.8233719Z %92 = tt.broadcast %90 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8233989Z %93 = tt.broadcast %91 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8234230Z %94 = arith.addi %92, %93 : tensor<64x32xi32> 2026-02-21T09:04:14.8234468Z %95 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8234747Z %96 = tt.addptr %95, %94 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:14.8235033Z %97 = tt.load %96 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8235334Z %98 = arith.shli %97, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8235552Z %99 = arith.shrsi %98, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8235783Z %100 = arith.shrsi %97, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8236042Z %101 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:14.8236336Z %102 = tt.expand_dims %101 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:14.8236660Z %103 = tt.expand_dims %102 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:14.8236979Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8237308Z %105 = tt.expand_dims %100 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8237597Z %106 = arith.cmpi eq, %103, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8237863Z %107 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8238145Z %108 = tt.broadcast %104 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8238438Z %109 = arith.select %107, %108, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8238722Z %110 = arith.cmpi eq, %103, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8238975Z %111 = tt.broadcast %105 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8239252Z %112 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8239542Z %113 = arith.select %112, %111, %109 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8239824Z %114 = tt.reshape %113 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:14.8240097Z %115 = arith.sitofp %114 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:14.8240449Z %116 = tt.dot %88, %115, %70, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:14.8240782Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:14.8240979Z %117 = arith.muli %c64_i32, %c2_i32_7 : i32 2026-02-21T09:04:14.8241181Z %118 = arith.addi %arg3, %117 : i32 2026-02-21T09:04:14.8241390Z %119 = tt.splat %118 : i32 -> tensor<64xi32> 2026-02-21T09:04:14.8241636Z %120 = arith.addi %119, %10 : tensor<64xi32> 2026-02-21T09:04:14.8241845Z %121 = arith.muli %118, %c2_i32 : i32 2026-02-21T09:04:14.8242081Z %122 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:14.8242344Z %123 = tt.splat %121 : i32 -> tensor<128xi32> 2026-02-21T09:04:14.8242553Z %124 = arith.addi %123, %122 : tensor<128xi32> 2026-02-21T09:04:14.8242857Z %125 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8243145Z %126 = arith.muli %125, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:14.8243415Z %127 = tt.expand_dims %124 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:14.8243729Z %128 = tt.broadcast %126 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8244036Z %129 = tt.broadcast %127 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8244295Z %130 = arith.addi %128, %129 : tensor<64x128xi32> 2026-02-21T09:04:14.8244571Z %131 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8244880Z %132 = tt.addptr %131, %130 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:14.8245163Z %133 = tt.load %132 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8245413Z %134 = arith.extf %133 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:14.8245718Z %135 = tt.expand_dims %120 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8245990Z %136 = arith.muli %135, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:14.8246272Z %137 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:14.8246563Z %138 = tt.broadcast %136 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8246865Z %139 = tt.broadcast %137 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8247120Z %140 = arith.addi %138, %139 : tensor<64x32xi32> 2026-02-21T09:04:14.8247358Z %141 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8247652Z %142 = tt.addptr %141, %140 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:14.8247962Z %143 = tt.load %142 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8248236Z %144 = arith.shli %143, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8248458Z %145 = arith.shrsi %144, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8248692Z %146 = arith.shrsi %143, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8248943Z %147 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:14.8249236Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:14.8249556Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:14.8249892Z %150 = tt.expand_dims %145 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8250235Z %151 = tt.expand_dims %146 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8250524Z %152 = arith.cmpi eq, %149, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8250787Z %153 = tt.broadcast %152 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8251068Z %154 = tt.broadcast %150 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8251370Z %155 = arith.select %153, %154, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8251710Z %156 = arith.cmpi eq, %149, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8251994Z %157 = tt.broadcast %151 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8252267Z %158 = tt.broadcast %156 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8252562Z %159 = arith.select %158, %157, %155 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8252849Z %160 = tt.reshape %159 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:14.8253128Z %161 = arith.sitofp %160 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:14.8253494Z %162 = tt.dot %134, %161, %116, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:14.8253831Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:14.8254034Z %163 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:04:14.8254288Z %164 = arith.addi %arg3, %163 : i32 2026-02-21T09:04:14.8254492Z %165 = tt.splat %164 : i32 -> tensor<64xi32> 2026-02-21T09:04:14.8254707Z %166 = arith.addi %165, %10 : tensor<64xi32> 2026-02-21T09:04:14.8254901Z %167 = arith.muli %164, %c2_i32 : i32 2026-02-21T09:04:14.8255143Z %168 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:14.8255416Z %169 = tt.splat %167 : i32 -> tensor<128xi32> 2026-02-21T09:04:14.8255673Z %170 = arith.addi %169, %168 : tensor<128xi32> 2026-02-21T09:04:14.8255955Z %171 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8256265Z %172 = arith.muli %171, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:14.8256554Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:14.8256872Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8257170Z %175 = tt.broadcast %173 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:14.8257471Z %176 = arith.addi %174, %175 : tensor<64x128xi32> 2026-02-21T09:04:14.8257817Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8258139Z %178 = tt.addptr %177, %176 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:14.8258449Z %179 = tt.load %178 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:14.8258725Z %180 = arith.extf %179 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:14.8259039Z %181 = tt.expand_dims %166 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8259337Z %182 = arith.muli %181, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:14.8259616Z %183 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:14.8259920Z %184 = tt.broadcast %182 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8260208Z %185 = tt.broadcast %183 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8260466Z %186 = arith.addi %184, %185 : tensor<64x32xi32> 2026-02-21T09:04:14.8260725Z %187 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8261013Z %188 = tt.addptr %187, %186 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:14.8261349Z %189 = tt.load %188 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8261669Z %190 = arith.shli %189, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8261901Z %191 = arith.shrsi %190, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8262147Z %192 = arith.shrsi %189, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:14.8262406Z %193 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:14.8262733Z %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:14.8263047Z %195 = tt.expand_dims %194 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:14.8263386Z %196 = tt.expand_dims %191 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8263718Z %197 = tt.expand_dims %192 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:14.8264008Z %198 = arith.cmpi eq, %195, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8264274Z %199 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8264548Z %200 = tt.broadcast %196 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8264848Z %201 = arith.select %199, %200, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8265133Z %202 = arith.cmpi eq, %195, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:14.8265381Z %203 = tt.broadcast %197 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:14.8265660Z %204 = tt.broadcast %202 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:14.8265942Z %205 = arith.select %204, %203, %201 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:14.8266264Z %206 = tt.reshape %205 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:14.8266532Z %207 = arith.sitofp %206 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:14.8266900Z %208 = tt.dot %180, %207, %162, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:14.8267233Z scf.yield %208 : tensor<64x32xf32> 2026-02-21T09:04:14.8267435Z } 2026-02-21T09:04:14.8267626Z %18 = arith.truncf %17 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:14.8267936Z %19 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:14.8268208Z %20 = arith.muli %19, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:14.8268462Z %21 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:14.8268751Z %22 = tt.broadcast %20 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8269016Z %23 = tt.broadcast %21 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:14.8269248Z %24 = arith.addi %22, %23 : tensor<64x32xi32> 2026-02-21T09:04:14.8269499Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8269779Z %26 = tt.addptr %25, %24 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:14.8270069Z tt.store %26, %18 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:14.8270267Z tt.return 2026-02-21T09:04:14.8270409Z } 2026-02-21T09:04:14.8270549Z } 2026-02-21T09:04:14.8270619Z 2026-02-21T09:04:14.8270672Z {-# 2026-02-21T09:04:14.8270815Z external_resources: { 2026-02-21T09:04:14.8270982Z mlir_reproducer: { 2026-02-21T09:04:14.8275334Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:14.8279859Z disable_threading: false, 2026-02-21T09:04:14.8280048Z verify_each: true 2026-02-21T09:04:14.8280223Z } 2026-02-21T09:04:14.8280380Z } 2026-02-21T09:04:14.8280527Z #-} 2026-02-21T09:04:14.8281089Z /tmp/torchinductor_root/cn/ccn5a5q6ythd2dbscrq4juvz7sw53ddww2lx7k6awnonjagbribm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:14.8282327Z /tmp/torchinductor_root/cn/ccn5a5q6ythd2dbscrq4juvz7sw53ddww2lx7k6awnonjagbribm.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:14.8283294Z [220s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:14.8284491Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:14.8285545Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:14.8285814Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:15.0576493Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:15.0577092Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.0580886Z ^ 2026-02-21T09:04:15.0585770Z /tmp/torchinductor_root/2v/c2vzwroveuekb5aubjldaifdktbe2f26uicxyt376l6sm7baxeki.py:85:36: note: called from 2026-02-21T09:04:15.0590948Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:15.0592450Z ^ 2026-02-21T09:04:15.0592963Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:15.0596233Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.0596543Z ^ 2026-02-21T09:04:15.0596796Z module { 2026-02-21T09:04:15.0601683Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:15.0605893Z %cst = arith.constant dense<0> : tensor<64x2x128xi8> 2026-02-21T09:04:15.0611115Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:15.0614991Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:15.0616473Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:15.0616754Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:15.0617028Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:15.0617264Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:15.0617505Z %cst_3 = arith.constant dense<4> : tensor<64x128xi8> 2026-02-21T09:04:15.0617741Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:15.0617965Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:15.0618200Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T09:04:15.0618440Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:15.0618642Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:15.0618827Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:15.0619024Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:15.0619215Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:15.0619410Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:15.0619593Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:15.0619928Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:15.0620252Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:15.0620446Z %2 = arith.divsi %1, %c224_i32 : i32 2026-02-21T09:04:15.0620634Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:15.0620833Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:04:15.0621012Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:15.0621198Z %6 = arith.remsi %1, %c224_i32 : i32 2026-02-21T09:04:15.0621634Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:15.0621813Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:15.0621999Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:15.0622178Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:04:15.0622431Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:15.0622692Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.0622960Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:04:15.0623160Z %14 = arith.muli %9, %c128_i32 : i32 2026-02-21T09:04:15.0623388Z %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:15.0623684Z %16 = tt.splat %14 : i32 -> tensor<128xi32> 2026-02-21T09:04:15.0623894Z %17 = arith.addi %16, %15 : tensor<128xi32> 2026-02-21T09:04:15.0624106Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:04:15.0624448Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x128xf32>) : i32 { 2026-02-21T09:04:15.0624807Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:15.0625027Z %29 = tt.splat %28 : i32 -> tensor<128xi32> 2026-02-21T09:04:15.0625241Z %30 = arith.addi %29, %15 : tensor<128xi32> 2026-02-21T09:04:15.0625520Z %31 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.0625850Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.0626138Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:15.0626448Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0626738Z %35 = tt.broadcast %33 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0627005Z %36 = arith.addi %34, %35 : tensor<64x128xi32> 2026-02-21T09:04:15.0627265Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0627576Z %38 = tt.addptr %37, %36 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:15.0627901Z %39 = tt.load %38 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0628149Z %40 = arith.extf %39 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:15.0628502Z %41 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T09:04:15.0628836Z %42 = arith.shli %41, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0629063Z %43 = arith.shrsi %42, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0629296Z %44 = arith.shrsi %41, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0629564Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.0629868Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.0630196Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.0630534Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0630879Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0631183Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0631439Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0631778Z %52 = tt.broadcast %48 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0632080Z %53 = arith.select %51, %52, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0632374Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0632633Z %55 = tt.broadcast %49 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0632900Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0633180Z %57 = arith.select %56, %55, %53 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0633493Z %58 = tt.reshape %57 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:04:15.0633783Z %59 = arith.sitofp %58 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:04:15.0634151Z %60 = tt.dot %40, %59, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:15.0634490Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:15.0634690Z %61 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T09:04:15.0634931Z %62 = arith.addi %arg3, %61 : i32 2026-02-21T09:04:15.0635118Z %63 = arith.muli %62, %c2_i32 : i32 2026-02-21T09:04:15.0635325Z %64 = tt.splat %63 : i32 -> tensor<128xi32> 2026-02-21T09:04:15.0635553Z %65 = arith.addi %64, %15 : tensor<128xi32> 2026-02-21T09:04:15.0635813Z %66 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.0636078Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.0636339Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:15.0636653Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0636931Z %70 = tt.broadcast %68 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0637179Z %71 = arith.addi %69, %70 : tensor<64x128xi32> 2026-02-21T09:04:15.0637431Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0637749Z %73 = tt.addptr %72, %71 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:15.0638021Z %74 = tt.load %73 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0638261Z %75 = arith.extf %74 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:15.0638584Z %76 = tt.descriptor_load %0[%62, %14] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T09:04:15.0638882Z %77 = arith.shli %76, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0639107Z %78 = arith.shrsi %77, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0639339Z %79 = arith.shrsi %76, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0639583Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.0639877Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.0640303Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.0640621Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0640971Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0641251Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0641503Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0641803Z %87 = tt.broadcast %83 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0642098Z %88 = arith.select %86, %87, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0642374Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0642621Z %90 = tt.broadcast %84 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0642897Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0643168Z %92 = arith.select %91, %90, %88 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0643454Z %93 = tt.reshape %92 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:04:15.0643726Z %94 = arith.sitofp %93 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:04:15.0644084Z %95 = tt.dot %75, %94, %60, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:15.0644410Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:15.0644604Z %96 = arith.muli %c64_i32, %c2_i32_7 : i32 2026-02-21T09:04:15.0644810Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T09:04:15.0645024Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:04:15.0645227Z %99 = tt.splat %98 : i32 -> tensor<128xi32> 2026-02-21T09:04:15.0645430Z %100 = arith.addi %99, %15 : tensor<128xi32> 2026-02-21T09:04:15.0645695Z %101 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.0645977Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.0646258Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:15.0646596Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0646896Z %105 = tt.broadcast %103 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0647148Z %106 = arith.addi %104, %105 : tensor<64x128xi32> 2026-02-21T09:04:15.0647407Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0647703Z %108 = tt.addptr %107, %106 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:15.0647983Z %109 = tt.load %108 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0648226Z %110 = arith.extf %109 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:15.0648551Z %111 = tt.descriptor_load %0[%97, %14] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T09:04:15.0648851Z %112 = arith.shli %111, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0649109Z %113 = arith.shrsi %112, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0649345Z %114 = arith.shrsi %111, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0649594Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.0649900Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.0650213Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.0650551Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0650884Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0651184Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0651447Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0651753Z %122 = tt.broadcast %118 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0652059Z %123 = arith.select %121, %122, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0652340Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0652609Z %125 = tt.broadcast %119 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0652892Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0653181Z %127 = arith.select %126, %125, %123 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0653483Z %128 = tt.reshape %127 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:04:15.0653759Z %129 = arith.sitofp %128 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:04:15.0654135Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:15.0654465Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:15.0654675Z %131 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:04:15.0654883Z %132 = arith.addi %arg3, %131 : i32 2026-02-21T09:04:15.0655075Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:04:15.0655287Z %134 = tt.splat %133 : i32 -> tensor<128xi32> 2026-02-21T09:04:15.0655502Z %135 = arith.addi %134, %15 : tensor<128xi32> 2026-02-21T09:04:15.0655769Z %136 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.0656038Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.0656310Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:15.0656670Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0656940Z %140 = tt.broadcast %138 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0657194Z %141 = arith.addi %139, %140 : tensor<64x128xi32> 2026-02-21T09:04:15.0657443Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0657783Z %143 = tt.addptr %142, %141 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:15.0658055Z %144 = tt.load %143 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0658363Z %145 = arith.extf %144 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:15.0658699Z %146 = tt.descriptor_load %0[%132, %14] : !tt.tensordesc> -> tensor<64x128xi8> 2026-02-21T09:04:15.0659007Z %147 = arith.shli %146, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0659244Z %148 = arith.shrsi %147, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0659469Z %149 = arith.shrsi %146, %cst_3 : tensor<64x128xi8> 2026-02-21T09:04:15.0659725Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.0660035Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.0660374Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.0660714Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0661052Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:04:15.0661361Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0661657Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0661949Z %157 = tt.broadcast %153 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0662257Z %158 = arith.select %156, %157, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0662535Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.0662797Z %160 = tt.broadcast %154 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:04:15.0663073Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:04:15.0663374Z %162 = arith.select %161, %160, %158 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:04:15.0663673Z %163 = tt.reshape %162 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:04:15.0663956Z %164 = arith.sitofp %163 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:04:15.0664335Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:15.0664667Z scf.yield %165 : tensor<64x128xf32> 2026-02-21T09:04:15.0664854Z } 2026-02-21T09:04:15.0665037Z %19 = arith.truncf %18 : tensor<64x128xf32> to tensor<64x128xbf16> 2026-02-21T09:04:15.0665341Z %20 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.0665618Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:15.0665878Z %22 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:15.0666180Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0666448Z %24 = tt.broadcast %22 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:15.0666692Z %25 = arith.addi %23, %24 : tensor<64x128xi32> 2026-02-21T09:04:15.0666934Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0667229Z %27 = tt.addptr %26, %25 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:15.0667500Z tt.store %27, %19 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:15.0667700Z tt.return 2026-02-21T09:04:15.0667846Z } 2026-02-21T09:04:15.0668026Z } 2026-02-21T09:04:15.0668118Z 2026-02-21T09:04:15.0668180Z {-# 2026-02-21T09:04:15.0668328Z external_resources: { 2026-02-21T09:04:15.0668518Z mlir_reproducer: { 2026-02-21T09:04:15.0673209Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:15.0677971Z disable_threading: false, 2026-02-21T09:04:15.0678149Z verify_each: true 2026-02-21T09:04:15.0678308Z } 2026-02-21T09:04:15.0678441Z } 2026-02-21T09:04:15.0678560Z #-} 2026-02-21T09:04:15.0678998Z /tmp/torchinductor_root/2v/c2vzwroveuekb5aubjldaifdktbe2f26uicxyt376l6sm7baxeki.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:15.0680021Z /tmp/torchinductor_root/2v/c2vzwroveuekb5aubjldaifdktbe2f26uicxyt376l6sm7baxeki.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:15.0680847Z [220s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:15.0681946Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:15.0682876Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:15.0683143Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:15.1645488Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:15.1649903Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.1654359Z ^ 2026-02-21T09:04:15.1656146Z /tmp/torchinductor_root/gz/cgz33bvhi45buwt5dotuinrd6xlfasfzd4b3uc5nij4hoz6vhhdc.py:78:36: note: called from 2026-02-21T09:04:15.1656634Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:15.1660916Z ^ 2026-02-21T09:04:15.1663089Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:15.1663604Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.1663856Z ^ 2026-02-21T09:04:15.1664116Z module { 2026-02-21T09:04:15.1668456Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:15.1669841Z %cst = arith.constant dense<0> : tensor<32x2x32xi8> 2026-02-21T09:04:15.1670129Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:04:15.1670327Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:15.1670537Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:15.1670754Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:15.1671004Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:15.1671240Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:15.1671483Z %cst_3 = arith.constant dense<4> : tensor<32x32xi8> 2026-02-21T09:04:15.1671778Z %cst_4 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:04:15.1672013Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:15.1672304Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:15.1672537Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:15.1672783Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:15.1672971Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:15.1673213Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:15.1673405Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:15.1673587Z %0 = tt.get_program_id x : i32 2026-02-21T09:04:15.1673782Z %1 = arith.divsi %0, %c896_i32 : i32 2026-02-21T09:04:15.1673970Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:04:15.1674165Z %3 = arith.subi %c1_i32, %2 : i32 2026-02-21T09:04:15.1674342Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:04:15.1674532Z %5 = arith.remsi %0, %c896_i32 : i32 2026-02-21T09:04:15.1674722Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:04:15.1674898Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:04:15.1675090Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:04:15.1675265Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:04:15.1675569Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:15.1675821Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.1676033Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:04:15.1676226Z %13 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:04:15.1676499Z %14 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:15.1676746Z %15 = tt.splat %13 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.1676940Z %16 = arith.addi %15, %14 : tensor<32xi32> 2026-02-21T09:04:15.1677142Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:15.1677464Z %17 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:15.1677810Z %27 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.1678043Z %28 = arith.addi %27, %14 : tensor<32xi32> 2026-02-21T09:04:15.1678265Z %29 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:15.1678478Z %30 = tt.splat %29 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.1678674Z %31 = arith.addi %30, %10 : tensor<64xi32> 2026-02-21T09:04:15.1678937Z %32 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.1679211Z %33 = arith.muli %32, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:15.1679476Z %34 = tt.expand_dims %31 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:15.1679769Z %35 = tt.broadcast %33 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1680097Z %36 = tt.broadcast %34 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1680333Z %37 = arith.addi %35, %36 : tensor<64x64xi32> 2026-02-21T09:04:15.1680593Z %38 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1680882Z %39 = tt.addptr %38, %37 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:15.1681150Z %40 = tt.load %39 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1681430Z %41 = arith.extf %40 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:15.1681760Z %42 = tt.expand_dims %28 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:15.1682067Z %43 = arith.muli %42, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:15.1682322Z %44 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.1682612Z %45 = tt.broadcast %43 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1682867Z %46 = tt.broadcast %44 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1683109Z %47 = arith.addi %45, %46 : tensor<32x32xi32> 2026-02-21T09:04:15.1683349Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1683681Z %49 = tt.addptr %48, %47 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:15.1684002Z %50 = tt.load %49 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1684275Z %51 = arith.shli %50, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1684490Z %52 = arith.shrsi %51, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1684713Z %53 = arith.shrsi %50, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1684959Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.1685260Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.1685573Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.1685892Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1686223Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1686501Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1686755Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1687035Z %61 = tt.broadcast %57 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1687319Z %62 = arith.select %60, %61, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1687597Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1687846Z %64 = tt.broadcast %58 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1688125Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1688397Z %66 = arith.select %65, %64, %62 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1688682Z %67 = tt.reshape %66 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:15.1688938Z %68 = arith.sitofp %67 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:15.1689297Z %69 = tt.dot %41, %68, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.1689620Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:04:15.1689819Z %70 = arith.muli %c32_i32, %c1_i32_7 : i32 2026-02-21T09:04:15.1690020Z %71 = arith.addi %arg3, %70 : i32 2026-02-21T09:04:15.1690215Z %72 = tt.splat %71 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.1690419Z %73 = arith.addi %72, %14 : tensor<32xi32> 2026-02-21T09:04:15.1690614Z %74 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:04:15.1690807Z %75 = tt.splat %74 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.1691002Z %76 = arith.addi %75, %10 : tensor<64xi32> 2026-02-21T09:04:15.1691253Z %77 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.1691578Z %78 = arith.muli %77, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:15.1691836Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:15.1692123Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1692379Z %81 = tt.broadcast %79 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1692646Z %82 = arith.addi %80, %81 : tensor<64x64xi32> 2026-02-21T09:04:15.1692894Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1693208Z %84 = tt.addptr %83, %82 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:15.1693470Z %85 = tt.load %84 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1693714Z %86 = arith.extf %85 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:15.1693993Z %87 = tt.expand_dims %73 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:15.1694257Z %88 = arith.muli %87, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:15.1694517Z %89 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.1694800Z %90 = tt.broadcast %88 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1695057Z %91 = tt.broadcast %89 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1695317Z %92 = arith.addi %90, %91 : tensor<32x32xi32> 2026-02-21T09:04:15.1695559Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1695833Z %94 = tt.addptr %93, %92 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:15.1696133Z %95 = tt.load %94 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1696398Z %96 = arith.shli %95, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1696617Z %97 = arith.shrsi %96, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1696836Z %98 = arith.shrsi %95, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1697087Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.1697385Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.1697709Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.1698039Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1698362Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1698651Z %104 = arith.cmpi eq, %101, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1698915Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1699194Z %106 = tt.broadcast %102 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1699488Z %107 = arith.select %105, %106, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1699771Z %108 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1700025Z %109 = tt.broadcast %103 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1700299Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1700588Z %111 = arith.select %110, %109, %107 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1700875Z %112 = tt.reshape %111 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:15.1701150Z %113 = arith.sitofp %112 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:15.1701509Z %114 = tt.dot %86, %113, %69, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.1701862Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:04:15.1702066Z %115 = arith.muli %c32_i32, %c2_i32_8 : i32 2026-02-21T09:04:15.1702270Z %116 = arith.addi %arg3, %115 : i32 2026-02-21T09:04:15.1702472Z %117 = tt.splat %116 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.1702715Z %118 = arith.addi %117, %14 : tensor<32xi32> 2026-02-21T09:04:15.1702922Z %119 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:04:15.1703122Z %120 = tt.splat %119 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.1703328Z %121 = arith.addi %120, %10 : tensor<64xi32> 2026-02-21T09:04:15.1703593Z %122 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.1703880Z %123 = arith.muli %122, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:15.1704179Z %124 = tt.expand_dims %121 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:15.1704503Z %125 = tt.broadcast %123 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1704774Z %126 = tt.broadcast %124 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1705026Z %127 = arith.addi %125, %126 : tensor<64x64xi32> 2026-02-21T09:04:15.1705277Z %128 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1705582Z %129 = tt.addptr %128, %127 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:15.1705857Z %130 = tt.load %129 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1706102Z %131 = arith.extf %130 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:15.1706398Z %132 = tt.expand_dims %118 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:15.1706705Z %133 = arith.muli %132, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:15.1706973Z %134 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.1707266Z %135 = tt.broadcast %133 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1707530Z %136 = tt.broadcast %134 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1707778Z %137 = arith.addi %135, %136 : tensor<32x32xi32> 2026-02-21T09:04:15.1708021Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1708301Z %139 = tt.addptr %138, %137 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:15.1708621Z %140 = tt.load %139 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1708895Z %141 = arith.shli %140, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1709121Z %142 = arith.shrsi %141, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1709359Z %143 = arith.shrsi %140, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1709623Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.1709936Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.1710267Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.1710610Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1710955Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1711265Z %149 = arith.cmpi eq, %146, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1711592Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1711888Z %151 = tt.broadcast %147 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1712196Z %152 = arith.select %150, %151, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1712491Z %153 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1712759Z %154 = tt.broadcast %148 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1713049Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1713347Z %156 = arith.select %155, %154, %152 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1713648Z %157 = tt.reshape %156 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:15.1713931Z %158 = arith.sitofp %157 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:15.1714344Z %159 = tt.dot %131, %158, %114, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.1714686Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:15.1714898Z %160 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:04:15.1715108Z %161 = arith.addi %arg3, %160 : i32 2026-02-21T09:04:15.1715317Z %162 = tt.splat %161 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.1715595Z %163 = arith.addi %162, %14 : tensor<32xi32> 2026-02-21T09:04:15.1715805Z %164 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:04:15.1716020Z %165 = tt.splat %164 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.1716300Z %166 = arith.addi %165, %10 : tensor<64xi32> 2026-02-21T09:04:15.1716573Z %167 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.1716863Z %168 = arith.muli %167, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:15.1717142Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:15.1717458Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1717742Z %171 = tt.broadcast %169 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:15.1717985Z %172 = arith.addi %170, %171 : tensor<64x64xi32> 2026-02-21T09:04:15.1718240Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1718566Z %174 = tt.addptr %173, %172 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:15.1718844Z %175 = tt.load %174 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:15.1719095Z %176 = arith.extf %175 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:15.1719388Z %177 = tt.expand_dims %163 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:15.1719667Z %178 = arith.muli %177, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:15.1719925Z %179 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.1720218Z %180 = tt.broadcast %178 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1720484Z %181 = tt.broadcast %179 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:15.1720728Z %182 = arith.addi %180, %181 : tensor<32x32xi32> 2026-02-21T09:04:15.1720970Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1721257Z %184 = tt.addptr %183, %182 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:15.1721610Z %185 = tt.load %184 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:15.1721893Z %186 = arith.shli %185, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1722121Z %187 = arith.shrsi %186, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1722352Z %188 = arith.shrsi %185, %cst_3 : tensor<32x32xi8> 2026-02-21T09:04:15.1722610Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.1722912Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.1723232Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.1723565Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1723898Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:15.1724186Z %194 = arith.cmpi eq, %191, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1724456Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1724751Z %196 = tt.broadcast %192 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1725053Z %197 = arith.select %195, %196, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1725335Z %198 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.1725595Z %199 = tt.broadcast %193 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:15.1725915Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:15.1726204Z %201 = arith.select %200, %199, %197 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:15.1726491Z %202 = tt.reshape %201 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:15.1726762Z %203 = arith.sitofp %202 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:15.1727122Z %204 = tt.dot %176, %203, %159, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.1727474Z scf.yield %204 : tensor<64x32xf32> 2026-02-21T09:04:15.1727651Z } 2026-02-21T09:04:15.1727866Z %18 = arith.truncf %17 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:15.1728166Z %19 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.1728435Z %20 = arith.muli %19, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:15.1728692Z %21 = tt.expand_dims %16 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.1728979Z %22 = tt.broadcast %20 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.1729239Z %23 = tt.broadcast %21 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.1729475Z %24 = arith.addi %22, %23 : tensor<64x32xi32> 2026-02-21T09:04:15.1729721Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.1730040Z %26 = tt.addptr %25, %24 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.1730301Z tt.store %26, %18 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.1730496Z tt.return 2026-02-21T09:04:15.1730632Z } 2026-02-21T09:04:15.1730765Z } 2026-02-21T09:04:15.1730838Z 2026-02-21T09:04:15.1730896Z {-# 2026-02-21T09:04:15.1731038Z external_resources: { 2026-02-21T09:04:15.1731204Z mlir_reproducer: { 2026-02-21T09:04:15.1735654Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:15.1740125Z disable_threading: false, 2026-02-21T09:04:15.1740306Z verify_each: true 2026-02-21T09:04:15.1740462Z } 2026-02-21T09:04:15.1740590Z } 2026-02-21T09:04:15.1740715Z #-} 2026-02-21T09:04:15.1741145Z /tmp/torchinductor_root/gz/cgz33bvhi45buwt5dotuinrd6xlfasfzd4b3uc5nij4hoz6vhhdc.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:15.1742238Z /tmp/torchinductor_root/gz/cgz33bvhi45buwt5dotuinrd6xlfasfzd4b3uc5nij4hoz6vhhdc.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:15.1743091Z [220s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:15.1744181Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:15.1745109Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:15.1745379Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:15.4340647Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:15.4345079Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.4346836Z ^ 2026-02-21T09:04:15.4347442Z /tmp/torchinductor_root/br/cbr75xotwb4uckv5d724lf4hcc26icck3iyiotho77wvjmo5bzah.py:85:36: note: called from 2026-02-21T09:04:15.4347866Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:15.4348086Z ^ 2026-02-21T09:04:15.4348503Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:15.4348985Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.4349231Z ^ 2026-02-21T09:04:15.4349412Z module { 2026-02-21T09:04:15.4350327Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:15.4350928Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:04:15.4355609Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:04:15.4355978Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:15.4356261Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:04:15.4360073Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:15.4365333Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:15.4367331Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:15.4367675Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:15.4372355Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:04:15.4374509Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:15.4374816Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:15.4377643Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:15.4377933Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:15.4378140Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:15.4378329Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:15.4378538Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:15.4378737Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:15.4378927Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:15.4379115Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:15.4379444Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:15.4379775Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:15.4379967Z %2 = arith.divsi %1, %c896_i32 : i32 2026-02-21T09:04:15.4380156Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:15.4380596Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:04:15.4380784Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:15.4380967Z %6 = arith.remsi %1, %c896_i32 : i32 2026-02-21T09:04:15.4381163Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:15.4381348Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:15.4381532Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:15.4381786Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:04:15.4382084Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:15.4382348Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.4382592Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:04:15.4382803Z %14 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:04:15.4383039Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:15.4383282Z %16 = tt.splat %14 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.4383481Z %17 = arith.addi %16, %15 : tensor<32xi32> 2026-02-21T09:04:15.4383676Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T09:04:15.4384001Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c64_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:15.4384335Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:15.4384539Z %29 = tt.splat %28 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.4384782Z %30 = arith.addi %29, %15 : tensor<32xi32> 2026-02-21T09:04:15.4385047Z %31 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.4385324Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.4385593Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.4385889Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4386152Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4386394Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:04:15.4386648Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4386933Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.4387197Z %39 = tt.load %38 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4387445Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.4387771Z %41 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.4388078Z %42 = arith.shli %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4388299Z %43 = arith.shrsi %42, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4388517Z %44 = arith.shrsi %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4388800Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.4389094Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.4389407Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.4389725Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4390040Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4390332Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4390587Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4390863Z %52 = tt.broadcast %48 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4391151Z %53 = arith.select %51, %52, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4391424Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4391719Z %55 = tt.broadcast %49 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4391989Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4392304Z %57 = arith.select %56, %55, %53 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4392586Z %58 = tt.reshape %57 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.4392847Z %59 = arith.sitofp %58 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.4393209Z %60 = tt.dot %40, %59, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.4393575Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:04:15.4393780Z %61 = arith.muli %c16_i32, %c1_i32_7 : i32 2026-02-21T09:04:15.4393984Z %62 = arith.addi %arg3, %61 : i32 2026-02-21T09:04:15.4394202Z %63 = arith.muli %62, %c2_i32 : i32 2026-02-21T09:04:15.4394403Z %64 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.4394605Z %65 = arith.addi %64, %15 : tensor<32xi32> 2026-02-21T09:04:15.4394858Z %66 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.4395132Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.4395389Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.4395677Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4395935Z %70 = tt.broadcast %68 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4396202Z %71 = arith.addi %69, %70 : tensor<64x32xi32> 2026-02-21T09:04:15.4396454Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4396743Z %73 = tt.addptr %72, %71 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.4397008Z %74 = tt.load %73 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4397247Z %75 = arith.extf %74 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.4397560Z %76 = tt.descriptor_load %0[%62, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.4397872Z %77 = arith.shli %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4398088Z %78 = arith.shrsi %77, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4398308Z %79 = arith.shrsi %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4398554Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.4398845Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.4399155Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.4399476Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4399792Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4400079Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4400328Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4400611Z %87 = tt.broadcast %83 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4400891Z %88 = arith.select %86, %87, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4401161Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4401409Z %90 = tt.broadcast %84 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4401712Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4401991Z %92 = arith.select %91, %90, %88 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4402275Z %93 = tt.reshape %92 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.4402554Z %94 = arith.sitofp %93 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.4402921Z %95 = tt.dot %75, %94, %60, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.4403260Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:04:15.4403512Z %96 = arith.muli %c16_i32, %c2_i32_8 : i32 2026-02-21T09:04:15.4403719Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T09:04:15.4403940Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:04:15.4404141Z %99 = tt.splat %98 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.4404363Z %100 = arith.addi %99, %15 : tensor<32xi32> 2026-02-21T09:04:15.4404635Z %101 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.4404963Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.4405242Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.4405588Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4405886Z %105 = tt.broadcast %103 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4406143Z %106 = arith.addi %104, %105 : tensor<64x32xi32> 2026-02-21T09:04:15.4406417Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4406727Z %108 = tt.addptr %107, %106 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.4407021Z %109 = tt.load %108 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4407275Z %110 = arith.extf %109 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.4407648Z %111 = tt.descriptor_load %0[%97, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.4407978Z %112 = arith.shli %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4408215Z %113 = arith.shrsi %112, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4408444Z %114 = arith.shrsi %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4408705Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.4409007Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.4409341Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.4409676Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4410029Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4410317Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4410568Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4410840Z %122 = tt.broadcast %118 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4411157Z %123 = arith.select %121, %122, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4411437Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4411717Z %125 = tt.broadcast %119 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4411992Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4412270Z %127 = arith.select %126, %125, %123 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4412565Z %128 = tt.reshape %127 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.4412834Z %129 = arith.sitofp %128 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.4413189Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.4413521Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:15.4413714Z %131 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:04:15.4413913Z %132 = arith.addi %arg3, %131 : i32 2026-02-21T09:04:15.4414099Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:04:15.4414303Z %134 = tt.splat %133 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.4414516Z %135 = arith.addi %134, %15 : tensor<32xi32> 2026-02-21T09:04:15.4414775Z %136 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.4415055Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.4415337Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.4415634Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4415895Z %140 = tt.broadcast %138 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4416137Z %141 = arith.addi %139, %140 : tensor<64x32xi32> 2026-02-21T09:04:15.4416388Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4416702Z %143 = tt.addptr %142, %141 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.4417015Z %144 = tt.load %143 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4417257Z %145 = arith.extf %144 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.4417580Z %146 = tt.descriptor_load %0[%132, %14] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.4417874Z %147 = arith.shli %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4418099Z %148 = arith.shrsi %147, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4418319Z %149 = arith.shrsi %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.4418561Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.4418859Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.4419196Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.4419529Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4419861Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.4420151Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4420410Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4420679Z %157 = tt.broadcast %153 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4420975Z %158 = arith.select %156, %157, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4421247Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.4421503Z %160 = tt.broadcast %154 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.4421813Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.4422095Z %162 = arith.select %161, %160, %158 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.4422386Z %163 = tt.reshape %162 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.4422648Z %164 = arith.sitofp %163 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.4423006Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.4423325Z scf.yield %165 : tensor<64x32xf32> 2026-02-21T09:04:15.4423505Z } 2026-02-21T09:04:15.4423693Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:15.4423978Z %20 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.4424250Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:15.4424501Z %22 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.4424790Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4425043Z %24 = tt.broadcast %22 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.4425277Z %25 = arith.addi %23, %24 : tensor<64x32xi32> 2026-02-21T09:04:15.4425523Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4425798Z %27 = tt.addptr %26, %25 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.4426061Z tt.store %27, %19 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.4426249Z tt.return 2026-02-21T09:04:15.4426422Z } 2026-02-21T09:04:15.4426543Z } 2026-02-21T09:04:15.4426619Z 2026-02-21T09:04:15.4426670Z {-# 2026-02-21T09:04:15.4426799Z external_resources: { 2026-02-21T09:04:15.4426967Z mlir_reproducer: { 2026-02-21T09:04:15.4431393Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:15.4436002Z disable_threading: false, 2026-02-21T09:04:15.4436183Z verify_each: true 2026-02-21T09:04:15.4436327Z } 2026-02-21T09:04:15.4436456Z } 2026-02-21T09:04:15.4436570Z #-} 2026-02-21T09:04:15.4436998Z /tmp/torchinductor_root/br/cbr75xotwb4uckv5d724lf4hcc26icck3iyiotho77wvjmo5bzah.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:15.4438019Z /tmp/torchinductor_root/br/cbr75xotwb4uckv5d724lf4hcc26icck3iyiotho77wvjmo5bzah.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:15.4438848Z [220s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:15.4439916Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:15.4440879Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:15.4441135Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:15.5932453Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:15.5935045Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.5940124Z ^ 2026-02-21T09:04:15.5944986Z /tmp/torchinductor_root/g4/cg4kryeis5xxwsd2qpjsodim7pykaxcerxijk32saw54f2yffnqk.py:85:36: note: called from 2026-02-21T09:04:15.5949357Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:15.5950696Z ^ 2026-02-21T09:04:15.5951343Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:15.5951990Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:15.5952234Z ^ 2026-02-21T09:04:15.5952487Z module { 2026-02-21T09:04:15.5957690Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:15.5962042Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:04:15.5964531Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:15.5964826Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:04:15.5965036Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:15.5965263Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:15.5965535Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:15.5965776Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:15.5966048Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:04:15.5966298Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:15.5966516Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:15.5966813Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:15.5967065Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:15.5967269Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:15.5967470Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:15.5967670Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:15.5967883Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:15.5968082Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:15.5968287Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:15.5968617Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:15.5968978Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:15.5969170Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:04:15.5969357Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:15.5969550Z %4 = arith.subi %c224_i32, %3 : i32 2026-02-21T09:04:15.5969735Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:15.5969920Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:04:15.5970101Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:15.5970296Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:15.5970461Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:15.5970639Z %10 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:04:15.5970870Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:15.5971120Z %12 = tt.splat %10 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.5971323Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T09:04:15.5971505Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:15.5971826Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:15.5972066Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:04:15.5972271Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T09:04:15.5972462Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T09:04:15.5972799Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c64_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:15.5973133Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:15.5973329Z %29 = tt.splat %28 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.5973537Z %30 = arith.addi %29, %11 : tensor<32xi32> 2026-02-21T09:04:15.5973790Z %31 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.5974071Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.5974335Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.5974667Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.5974933Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.5975164Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:04:15.5975419Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.5975740Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.5976090Z %39 = tt.load %38 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.5976425Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.5976756Z %41 = tt.descriptor_load %0[%arg3, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.5977060Z %42 = arith.shli %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5977282Z %43 = arith.shrsi %42, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5977496Z %44 = arith.shrsi %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5977745Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.5978042Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.5978343Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.5978693Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.5979017Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.5979310Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.5979558Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.5979826Z %52 = tt.broadcast %48 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.5980108Z %53 = arith.select %51, %52, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.5980368Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.5980617Z %55 = tt.broadcast %49 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.5980879Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.5981168Z %57 = arith.select %56, %55, %53 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.5981442Z %58 = tt.reshape %57 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.5981732Z %59 = arith.sitofp %58 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.5982088Z %60 = tt.dot %40, %59, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.5982406Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:15.5982604Z %61 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:04:15.5982792Z %62 = arith.addi %arg3, %61 : i32 2026-02-21T09:04:15.5982982Z %63 = arith.muli %62, %c2_i32 : i32 2026-02-21T09:04:15.5983180Z %64 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.5983380Z %65 = arith.addi %64, %11 : tensor<32xi32> 2026-02-21T09:04:15.5983633Z %66 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.5983894Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.5984153Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.5984430Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.5984691Z %70 = tt.broadcast %68 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.5984930Z %71 = arith.addi %69, %70 : tensor<64x32xi32> 2026-02-21T09:04:15.5985168Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.5985450Z %73 = tt.addptr %72, %71 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.5985779Z %74 = tt.load %73 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.5986069Z %75 = arith.extf %74 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.5986377Z %76 = tt.descriptor_load %0[%62, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.5986678Z %77 = arith.shli %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5986895Z %78 = arith.shrsi %77, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5987130Z %79 = arith.shrsi %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5987375Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.5987691Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.5987996Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.5988302Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.5988616Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.5988898Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.5989142Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.5989407Z %87 = tt.broadcast %83 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.5989707Z %88 = arith.select %86, %87, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.5989978Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.5990229Z %90 = tt.broadcast %84 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.5990487Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.5990759Z %92 = arith.select %91, %90, %88 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.5991027Z %93 = tt.reshape %92 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.5991284Z %94 = arith.sitofp %93 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.5991652Z %95 = tt.dot %75, %94, %60, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.5991983Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:15.5992175Z %96 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:04:15.5992370Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T09:04:15.5992551Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:04:15.5992744Z %99 = tt.splat %98 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.5992941Z %100 = arith.addi %99, %11 : tensor<32xi32> 2026-02-21T09:04:15.5993199Z %101 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.5993465Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.5993731Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.5994031Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.5994292Z %105 = tt.broadcast %103 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.5994537Z %106 = arith.addi %104, %105 : tensor<64x32xi32> 2026-02-21T09:04:15.5994778Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.5995079Z %108 = tt.addptr %107, %106 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.5995398Z %109 = tt.load %108 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.5995690Z %110 = arith.extf %109 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.5996008Z %111 = tt.descriptor_load %0[%97, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.5996304Z %112 = arith.shli %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5996531Z %113 = arith.shrsi %112, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5996747Z %114 = arith.shrsi %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.5997027Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.5997327Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.5997635Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.5997966Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.5998314Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.5998608Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.5998891Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.5999164Z %122 = tt.broadcast %118 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.5999457Z %123 = arith.select %121, %122, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.5999732Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.5999990Z %125 = tt.broadcast %119 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.6000252Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.6000539Z %127 = arith.select %126, %125, %123 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.6000850Z %128 = tt.reshape %127 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.6001115Z %129 = arith.sitofp %128 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.6001477Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.6001817Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:15.6002013Z %131 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:04:15.6002202Z %132 = arith.addi %arg3, %131 : i32 2026-02-21T09:04:15.6002392Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:04:15.6002593Z %134 = tt.splat %133 : i32 -> tensor<32xi32> 2026-02-21T09:04:15.6002798Z %135 = arith.addi %134, %11 : tensor<32xi32> 2026-02-21T09:04:15.6003057Z %136 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.6003323Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:15.6003589Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.6003881Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.6004179Z %140 = tt.broadcast %138 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.6004443Z %141 = arith.addi %139, %140 : tensor<64x32xi32> 2026-02-21T09:04:15.6004704Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.6005018Z %143 = tt.addptr %142, %141 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.6005347Z %144 = tt.load %143 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.6005666Z %145 = arith.extf %144 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:15.6006006Z %146 = tt.descriptor_load %0[%132, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:15.6006322Z %147 = arith.shli %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.6006559Z %148 = arith.shrsi %147, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.6006788Z %149 = arith.shrsi %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:15.6007050Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:15.6007355Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:15.6007690Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:15.6008042Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.6008408Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:15.6008711Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:15.6008974Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.6009265Z %157 = tt.broadcast %153 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.6009566Z %158 = arith.select %156, %157, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.6009889Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:15.6010191Z %160 = tt.broadcast %154 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:15.6010474Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:15.6010771Z %162 = arith.select %161, %160, %158 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:15.6011064Z %163 = tt.reshape %162 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:15.6011345Z %164 = arith.sitofp %163 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:15.6011753Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:15.6012081Z scf.yield %165 : tensor<64x32xf32> 2026-02-21T09:04:15.6012268Z } 2026-02-21T09:04:15.6012478Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:15.6012786Z %20 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:15.6013056Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:15.6013326Z %22 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:15.6013629Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.6013894Z %24 = tt.broadcast %22 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:15.6014129Z %25 = arith.addi %23, %24 : tensor<64x32xi32> 2026-02-21T09:04:15.6014368Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.6014651Z %27 = tt.addptr %26, %25 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:15.6014902Z tt.store %27, %19 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:15.6015098Z tt.return 2026-02-21T09:04:15.6015234Z } 2026-02-21T09:04:15.6015352Z } 2026-02-21T09:04:15.6015420Z 2026-02-21T09:04:15.6015481Z {-# 2026-02-21T09:04:15.6015613Z external_resources: { 2026-02-21T09:04:15.6015778Z mlir_reproducer: { 2026-02-21T09:04:15.6020181Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:15.6024634Z disable_threading: false, 2026-02-21T09:04:15.6024805Z verify_each: true 2026-02-21T09:04:15.6025008Z } 2026-02-21T09:04:15.6025128Z } 2026-02-21T09:04:15.6025250Z #-} 2026-02-21T09:04:15.6025708Z /tmp/torchinductor_root/g4/cg4kryeis5xxwsd2qpjsodim7pykaxcerxijk32saw54f2yffnqk.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:15.6026732Z /tmp/torchinductor_root/g4/cg4kryeis5xxwsd2qpjsodim7pykaxcerxijk32saw54f2yffnqk.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:15.6027559Z [221s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:15.6028657Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:15.6029602Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:15.6029866Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:15.6030456Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 31.0 configs/s 2026-02-21T09:04:18.1325476Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 393.4 2026-02-21T09:04:18.1328632Z configs/s 2026-02-21T09:04:18.2334277Z [223s] Generation 8 complete: 2026-02-21T09:04:18.2338584Z error=50 2026-02-21T09:04:18.2342948Z ok=30 2026-02-21T09:04:18.2346260Z min=0.1095 2026-02-21T09:04:18.2350734Z mid=0.1628 2026-02-21T09:04:18.2355072Z max=1.8217 2026-02-21T09:04:18.2356499Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:04:18.2356783Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:04:18.2357004Z 'l2_groupings': [1], 2026-02-21T09:04:18.2357186Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:04:18.2357380Z 'loop_orders': [[0, 1]], 2026-02-21T09:04:18.2357551Z 'num_stages': 3, 2026-02-21T09:04:18.2357700Z 'num_warps': 1, 2026-02-21T09:04:18.2357854Z 'pid_type': 'flat', 2026-02-21T09:04:18.2358015Z 'range_flattens': [None, None], 2026-02-21T09:04:18.2358206Z 'range_multi_buffers': [None, True], 2026-02-21T09:04:18.2358403Z 'range_num_stages': [0, 0], 2026-02-21T09:04:18.2358578Z 'range_unroll_factors': [0, 0], 2026-02-21T09:04:18.2358785Z 'range_warp_specializes': [None, None]} 2026-02-21T09:04:18.2359022Z [223s] Fitting surrogate: 853 points, 853 targets 2026-02-21T09:04:19.0252111Z [224s] Generation 9 starting: 47 neighbors, 3 active search path(s) 2026-02-21T09:04:24.2428363Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49/49 6.1 configs/s 2026-02-21T09:04:26.3874769Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:26.3876415Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:26.3876713Z ^ 2026-02-21T09:04:26.3877136Z /tmp/torchinductor_root/3y/c3yvrk3jfk47mbdlxjsc7anrgqljw5s26c65h4kxkwffuuio5ns4.py:78:36: note: called from 2026-02-21T09:04:26.3877601Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:26.3877820Z ^ 2026-02-21T09:04:26.3878513Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:26.3879055Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:26.3879324Z ^ 2026-02-21T09:04:26.3879514Z module { 2026-02-21T09:04:26.3880055Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:26.3880765Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:04:26.3881068Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:04:26.3881299Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:26.3881496Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:26.3883125Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:26.3883347Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:26.3883588Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:26.3883826Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:26.3884062Z %cst_3 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:04:26.3884295Z %cst_4 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:04:26.3884606Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:26.3884817Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:26.3885047Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:04:26.3885278Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:26.3885465Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:26.3885644Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:26.3885820Z %0 = tt.get_program_id x : i32 2026-02-21T09:04:26.3886004Z %1 = arith.divsi %0, %c448_i32 : i32 2026-02-21T09:04:26.3886188Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:04:26.3886374Z %3 = arith.subi %c1_i32, %2 : i32 2026-02-21T09:04:26.3886549Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:04:26.3886741Z %5 = arith.remsi %0, %c448_i32 : i32 2026-02-21T09:04:26.3886924Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:04:26.3887105Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:04:26.3887280Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:04:26.3887449Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:04:26.3887683Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:26.3887935Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:04:26.3888143Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:04:26.3888334Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:04:26.3888533Z %14 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:04:26.3888729Z %15 = arith.addi %14, %10 : tensor<64xi32> 2026-02-21T09:04:26.3888926Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:26.3889255Z %16 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:04:26.3889625Z %26 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:26.3889891Z %27 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:04:26.3890095Z %28 = arith.addi %27, %26 : tensor<32xi32> 2026-02-21T09:04:26.3890297Z %29 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:26.3890491Z %30 = tt.splat %29 : i32 -> tensor<64xi32> 2026-02-21T09:04:26.3890719Z %31 = arith.addi %30, %10 : tensor<64xi32> 2026-02-21T09:04:26.3890971Z %32 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:26.3891243Z %33 = arith.muli %32, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:26.3891498Z %34 = tt.expand_dims %31 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3891852Z %35 = tt.broadcast %33 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3892163Z %36 = tt.broadcast %34 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3892401Z %37 = arith.addi %35, %36 : tensor<64x64xi32> 2026-02-21T09:04:26.3892661Z %38 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3892946Z %39 = tt.addptr %38, %37 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:26.3893212Z %40 = tt.load %39 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3893487Z %41 = arith.extf %40 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:26.3893774Z %42 = tt.expand_dims %28 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:26.3894075Z %43 = arith.muli %42, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:26.3894324Z %44 = tt.expand_dims %15 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3894605Z %45 = tt.broadcast %43 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3894855Z %46 = tt.broadcast %44 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3895087Z %47 = arith.addi %45, %46 : tensor<32x64xi32> 2026-02-21T09:04:26.3895324Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3895600Z %49 = tt.addptr %48, %47 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:26.3895925Z %50 = tt.load %49 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3896191Z %51 = arith.shli %50, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3896408Z %52 = arith.shrsi %51, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3896618Z %53 = arith.shrsi %50, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3896866Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:26.3897151Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:26.3897457Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:26.3897774Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3898086Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3898368Z %59 = arith.cmpi eq, %56, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3898615Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3898889Z %61 = tt.broadcast %57 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3899173Z %62 = arith.select %60, %61, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3899437Z %63 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3899685Z %64 = tt.broadcast %58 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3899944Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3900219Z %66 = arith.select %65, %64, %62 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3900488Z %67 = tt.reshape %66 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:26.3900746Z %68 = arith.sitofp %67 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:26.3901102Z %69 = tt.dot %41, %68, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:26.3901421Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:04:26.3901651Z %70 = arith.muli %c32_i32, %c1_i32_7 : i32 2026-02-21T09:04:26.3901844Z %71 = arith.addi %arg3, %70 : i32 2026-02-21T09:04:26.3902075Z %72 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:26.3902316Z %73 = tt.splat %71 : i32 -> tensor<32xi32> 2026-02-21T09:04:26.3902517Z %74 = arith.addi %73, %72 : tensor<32xi32> 2026-02-21T09:04:26.3902709Z %75 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:04:26.3902895Z %76 = tt.splat %75 : i32 -> tensor<64xi32> 2026-02-21T09:04:26.3903126Z %77 = arith.addi %76, %10 : tensor<64xi32> 2026-02-21T09:04:26.3903371Z %78 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:26.3903648Z %79 = arith.muli %78, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:26.3903899Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3904185Z %81 = tt.broadcast %79 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3904480Z %82 = tt.broadcast %80 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3904707Z %83 = arith.addi %81, %82 : tensor<64x64xi32> 2026-02-21T09:04:26.3905014Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3905304Z %85 = tt.addptr %84, %83 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:26.3905560Z %86 = tt.load %85 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3905802Z %87 = arith.extf %86 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:26.3906086Z %88 = tt.expand_dims %74 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:26.3906343Z %89 = arith.muli %88, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:26.3906598Z %90 = tt.expand_dims %15 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3906878Z %91 = tt.broadcast %89 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3907164Z %92 = tt.broadcast %90 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3907391Z %93 = arith.addi %91, %92 : tensor<32x64xi32> 2026-02-21T09:04:26.3907642Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3907928Z %95 = tt.addptr %94, %93 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:26.3908227Z %96 = tt.load %95 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3908509Z %97 = arith.shli %96, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3908729Z %98 = arith.shrsi %97, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3908956Z %99 = arith.shrsi %96, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3909210Z %100 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:26.3909524Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:26.3909859Z %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:26.3910197Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3910536Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3910832Z %105 = arith.cmpi eq, %102, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3911102Z %106 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3911390Z %107 = tt.broadcast %103 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3911739Z %108 = arith.select %106, %107, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3912034Z %109 = arith.cmpi eq, %102, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3912294Z %110 = tt.broadcast %104 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3912581Z %111 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3912874Z %112 = arith.select %111, %110, %108 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3913178Z %113 = tt.reshape %112 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:26.3913461Z %114 = arith.sitofp %113 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:26.3913829Z %115 = tt.dot %87, %114, %69, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:26.3914172Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:04:26.3914379Z %116 = arith.muli %c32_i32, %c2_i32_8 : i32 2026-02-21T09:04:26.3914626Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:04:26.3914866Z %118 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:26.3915132Z %119 = tt.splat %117 : i32 -> tensor<32xi32> 2026-02-21T09:04:26.3915354Z %120 = arith.addi %119, %118 : tensor<32xi32> 2026-02-21T09:04:26.3915557Z %121 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:04:26.3915770Z %122 = tt.splat %121 : i32 -> tensor<64xi32> 2026-02-21T09:04:26.3916009Z %123 = arith.addi %122, %10 : tensor<64xi32> 2026-02-21T09:04:26.3916278Z %124 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:26.3916619Z %125 = arith.muli %124, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:26.3916904Z %126 = tt.expand_dims %123 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3917218Z %127 = tt.broadcast %125 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3917501Z %128 = tt.broadcast %126 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3917747Z %129 = arith.addi %127, %128 : tensor<64x64xi32> 2026-02-21T09:04:26.3917987Z %130 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3918279Z %131 = tt.addptr %130, %129 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:26.3918542Z %132 = tt.load %131 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3918818Z %133 = arith.extf %132 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:26.3919118Z %134 = tt.expand_dims %120 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:26.3919389Z %135 = arith.muli %134, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:26.3919657Z %136 = tt.expand_dims %15 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3919940Z %137 = tt.broadcast %135 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3920210Z %138 = tt.broadcast %136 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3920455Z %139 = arith.addi %137, %138 : tensor<32x64xi32> 2026-02-21T09:04:26.3920688Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3920971Z %141 = tt.addptr %140, %139 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:26.3921273Z %142 = tt.load %141 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3921589Z %143 = arith.shli %142, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3921811Z %144 = arith.shrsi %143, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3922036Z %145 = arith.shrsi %142, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3922290Z %146 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:26.3922578Z %147 = tt.expand_dims %146 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:26.3922893Z %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:26.3923215Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3923549Z %150 = tt.expand_dims %145 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3923840Z %151 = arith.cmpi eq, %148, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3924091Z %152 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3924370Z %153 = tt.broadcast %149 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3924655Z %154 = arith.select %152, %153, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3924932Z %155 = arith.cmpi eq, %148, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3925180Z %156 = tt.broadcast %150 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3925456Z %157 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3925740Z %158 = arith.select %157, %156, %154 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3926066Z %159 = tt.reshape %158 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:26.3926335Z %160 = arith.sitofp %159 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:26.3926687Z %161 = tt.dot %133, %160, %115, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:26.3927010Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:26.3927199Z %162 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:04:26.3927423Z %163 = arith.addi %arg3, %162 : i32 2026-02-21T09:04:26.3927658Z %164 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:26.3927928Z %165 = tt.splat %163 : i32 -> tensor<32xi32> 2026-02-21T09:04:26.3928140Z %166 = arith.addi %165, %164 : tensor<32xi32> 2026-02-21T09:04:26.3928330Z %167 = arith.muli %163, %c2_i32 : i32 2026-02-21T09:04:26.3928528Z %168 = tt.splat %167 : i32 -> tensor<64xi32> 2026-02-21T09:04:26.3928726Z %169 = arith.addi %168, %10 : tensor<64xi32> 2026-02-21T09:04:26.3928984Z %170 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:26.3929256Z %171 = arith.muli %170, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:26.3929511Z %172 = tt.expand_dims %169 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3929832Z %173 = tt.broadcast %171 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3930100Z %174 = tt.broadcast %172 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3930353Z %175 = arith.addi %173, %174 : tensor<64x64xi32> 2026-02-21T09:04:26.3930604Z %176 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3930909Z %177 = tt.addptr %176, %175 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:26.3931181Z %178 = tt.load %177 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3931419Z %179 = arith.extf %178 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:26.3931756Z %180 = tt.expand_dims %166 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:26.3932020Z %181 = arith.muli %180, %cst_4 : tensor<32x1xi32> 2026-02-21T09:04:26.3932283Z %182 = tt.expand_dims %15 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3932578Z %183 = tt.broadcast %181 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3932838Z %184 = tt.broadcast %182 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:26.3933083Z %185 = arith.addi %183, %184 : tensor<32x64xi32> 2026-02-21T09:04:26.3933319Z %186 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3933606Z %187 = tt.addptr %186, %185 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:26.3933910Z %188 = tt.load %187 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:26.3934184Z %189 = arith.shli %188, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3934411Z %190 = arith.shrsi %189, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3934629Z %191 = arith.shrsi %188, %cst_3 : tensor<32x64xi8> 2026-02-21T09:04:26.3934884Z %192 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:26.3935172Z %193 = tt.expand_dims %192 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:26.3935493Z %194 = tt.expand_dims %193 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:26.3935813Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3936146Z %196 = tt.expand_dims %191 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:26.3936434Z %197 = arith.cmpi eq, %194, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3936685Z %198 = tt.broadcast %197 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3936962Z %199 = tt.broadcast %195 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3937276Z %200 = arith.select %198, %199, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3937560Z %201 = arith.cmpi eq, %194, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:26.3937812Z %202 = tt.broadcast %196 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:26.3938083Z %203 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:26.3938397Z %204 = arith.select %203, %202, %200 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:26.3938675Z %205 = tt.reshape %204 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:26.3938965Z %206 = arith.sitofp %205 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:26.3939315Z %207 = tt.dot %179, %206, %161, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:26.3939635Z scf.yield %207 : tensor<64x64xf32> 2026-02-21T09:04:26.3939829Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:26.3940051Z %17 = arith.truncf %16 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:04:26.3940346Z %18 = tt.expand_dims %12 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:26.3940608Z %19 = arith.muli %18, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:26.3940867Z %20 = tt.expand_dims %15 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:26.3941197Z %21 = tt.broadcast %19 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3941458Z %22 = tt.broadcast %20 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:26.3941719Z %23 = arith.addi %21, %22 : tensor<64x64xi32> 2026-02-21T09:04:26.3941961Z %24 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3942245Z %25 = tt.addptr %24, %23 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:26.3942497Z tt.store %25, %17 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:26.3942693Z tt.return 2026-02-21T09:04:26.3942827Z } 2026-02-21T09:04:26.3942960Z } 2026-02-21T09:04:26.3943030Z 2026-02-21T09:04:26.3943091Z {-# 2026-02-21T09:04:26.3943225Z external_resources: { 2026-02-21T09:04:26.3943404Z mlir_reproducer: { 2026-02-21T09:04:26.3947758Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:26.3952462Z disable_threading: false, 2026-02-21T09:04:26.3952639Z verify_each: true 2026-02-21T09:04:26.3952801Z } 2026-02-21T09:04:26.3952934Z } 2026-02-21T09:04:26.3953071Z #-} 2026-02-21T09:04:26.3953514Z /tmp/torchinductor_root/3y/c3yvrk3jfk47mbdlxjsc7anrgqljw5s26c65h4kxkwffuuio5ns4.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:26.3954597Z /tmp/torchinductor_root/3y/c3yvrk3jfk47mbdlxjsc7anrgqljw5s26c65h4kxkwffuuio5ns4.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:26.3955525Z [231s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:26.3956607Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:26.3957576Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:26.3957852Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:27.0860340Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:27.0862021Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:27.0862404Z ^ 2026-02-21T09:04:27.0862803Z /tmp/torchinductor_root/ky/ckyqyzinnsfkiyq2hf6f5svqam4r2ykgltpgyyeqw6tzlbbmklsv.py:87:40: note: called from 2026-02-21T09:04:27.0863206Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:27.0863415Z ^ 2026-02-21T09:04:27.0863858Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:27.0869105Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:27.0870418Z ^ 2026-02-21T09:04:27.0870750Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:04:27.0876659Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:27.0879102Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:04:27.0879382Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:04:27.0879600Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:04:27.0879799Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:27.0879988Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:27.0880175Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:27.0880383Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:27.0880636Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:27.0880869Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:27.0881109Z %cst_3 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:04:27.0881347Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:04:27.0881681Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:27.0881900Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:27.0882129Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:27.0882384Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:27.0882577Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:27.0882778Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:27.0882961Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:27.0883337Z %0 = tt.get_program_id x : i32 2026-02-21T09:04:27.0883527Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:04:27.0883709Z %2 = arith.minsi %1, %c224_i32 : i32 2026-02-21T09:04:27.0883916Z scf.for %arg3 = %0 to %2 step %c1_i32 : i32 { 2026-02-21T09:04:27.0884125Z %3 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:04:27.0884319Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:04:27.0884589Z %5 = arith.subi %c1_i32, %4 : i32 2026-02-21T09:04:27.0884777Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:04:27.0884963Z %7 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:04:27.0885192Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:04:27.0885378Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:04:27.0885552Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:04:27.0885787Z %11 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:27.0886035Z %12 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:27.0886296Z %13 = tt.splat %11 : i32 -> tensor<64xi32> 2026-02-21T09:04:27.0886508Z %14 = arith.addi %13, %12 : tensor<64xi32> 2026-02-21T09:04:27.0886702Z %15 = arith.muli %10, %c32_i32 : i32 2026-02-21T09:04:27.0886935Z %16 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:27.0887182Z %17 = tt.splat %15 : i32 -> tensor<32xi32> 2026-02-21T09:04:27.0887414Z %18 = arith.addi %17, %16 : tensor<32xi32> 2026-02-21T09:04:27.0887611Z %c32_i32_7 = arith.constant 32 : i32 2026-02-21T09:04:27.0887935Z %19 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_7 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:27.0888306Z %29 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:27.0888551Z %30 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:04:27.0888761Z %31 = arith.addi %30, %29 : tensor<8xi32> 2026-02-21T09:04:27.0888963Z %32 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:04:27.0889188Z %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:27.0889444Z %34 = tt.splat %32 : i32 -> tensor<16xi32> 2026-02-21T09:04:27.0889644Z %35 = arith.addi %34, %33 : tensor<16xi32> 2026-02-21T09:04:27.0889904Z %36 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.0890178Z %37 = arith.muli %36, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:27.0890443Z %38 = tt.expand_dims %35 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:27.0890737Z %39 = tt.broadcast %37 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0890999Z %40 = tt.broadcast %38 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0891241Z %41 = arith.addi %39, %40 : tensor<64x16xi32> 2026-02-21T09:04:27.0891483Z %42 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0891829Z %43 = tt.addptr %42, %41 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:27.0892095Z %44 = tt.load %43 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0892333Z %45 = arith.extf %44 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:27.0892624Z %46 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:27.0892889Z %47 = arith.muli %46, %cst_4 : tensor<8x1xi32> 2026-02-21T09:04:27.0893155Z %48 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.0893436Z %49 = tt.broadcast %47 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0893703Z %50 = tt.broadcast %48 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0893949Z %51 = arith.addi %49, %50 : tensor<8x32xi32> 2026-02-21T09:04:27.0894190Z %52 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0894475Z %53 = tt.addptr %52, %51 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:27.0894802Z %54 = tt.load %53 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0895072Z %55 = arith.shli %54, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0895285Z %56 = arith.shrsi %55, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0895505Z %57 = arith.shrsi %54, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0895749Z %58 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.0896063Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.0896406Z %60 = tt.expand_dims %59 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.0896721Z %61 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0897041Z %62 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0897318Z %63 = arith.cmpi eq, %60, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0897568Z %64 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0897842Z %65 = tt.broadcast %61 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0898115Z %66 = arith.select %64, %65, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0898410Z %67 = arith.cmpi eq, %60, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0898661Z %68 = tt.broadcast %62 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0898931Z %69 = tt.broadcast %67 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0899204Z %70 = arith.select %69, %68, %66 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0899476Z %71 = tt.reshape %70 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:27.0899737Z %72 = arith.sitofp %71 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:27.0900089Z %73 = tt.dot %45, %72, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.0900422Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:04:27.0900621Z %74 = arith.muli %c8_i32, %c1_i32_8 : i32 2026-02-21T09:04:27.0900825Z %75 = arith.addi %arg4, %74 : i32 2026-02-21T09:04:27.0901060Z %76 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:27.0901304Z %77 = tt.splat %75 : i32 -> tensor<8xi32> 2026-02-21T09:04:27.0901512Z %78 = arith.addi %77, %76 : tensor<8xi32> 2026-02-21T09:04:27.0901756Z %79 = arith.muli %75, %c2_i32 : i32 2026-02-21T09:04:27.0901990Z %80 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:27.0902233Z %81 = tt.splat %79 : i32 -> tensor<16xi32> 2026-02-21T09:04:27.0902440Z %82 = arith.addi %81, %80 : tensor<16xi32> 2026-02-21T09:04:27.0902696Z %83 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.0902961Z %84 = arith.muli %83, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:27.0903219Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:27.0903503Z %86 = tt.broadcast %84 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0903765Z %87 = tt.broadcast %85 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0904000Z %88 = arith.addi %86, %87 : tensor<64x16xi32> 2026-02-21T09:04:27.0904252Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0904546Z %90 = tt.addptr %89, %88 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:27.0904801Z %91 = tt.load %90 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0905045Z %92 = arith.extf %91 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:27.0905321Z %93 = tt.expand_dims %78 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:27.0905621Z %94 = arith.muli %93, %cst_4 : tensor<8x1xi32> 2026-02-21T09:04:27.0905880Z %95 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.0906181Z %96 = tt.broadcast %94 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0906449Z %97 = tt.broadcast %95 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0906682Z %98 = arith.addi %96, %97 : tensor<8x32xi32> 2026-02-21T09:04:27.0906949Z %99 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0907220Z %100 = tt.addptr %99, %98 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:27.0907570Z %101 = tt.load %100 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0907855Z %102 = arith.shli %101, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0908080Z %103 = arith.shrsi %102, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0908312Z %104 = arith.shrsi %101, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0908566Z %105 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.0908874Z %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.0909198Z %107 = tt.expand_dims %106 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.0909565Z %108 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0909897Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0910182Z %110 = arith.cmpi eq, %107, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0910448Z %111 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0910719Z %112 = tt.broadcast %108 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0911015Z %113 = arith.select %111, %112, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0911303Z %114 = arith.cmpi eq, %107, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0911613Z %115 = tt.broadcast %109 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0911891Z %116 = tt.broadcast %114 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0912170Z %117 = arith.select %116, %115, %113 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0912460Z %118 = tt.reshape %117 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:27.0912722Z %119 = arith.sitofp %118 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:27.0913082Z %120 = tt.dot %92, %119, %73, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.0913411Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:04:27.0913607Z %121 = arith.muli %c8_i32, %c2_i32_9 : i32 2026-02-21T09:04:27.0913813Z %122 = arith.addi %arg4, %121 : i32 2026-02-21T09:04:27.0914042Z %123 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:27.0914296Z %124 = tt.splat %122 : i32 -> tensor<8xi32> 2026-02-21T09:04:27.0914504Z %125 = arith.addi %124, %123 : tensor<8xi32> 2026-02-21T09:04:27.0914709Z %126 = arith.muli %122, %c2_i32 : i32 2026-02-21T09:04:27.0914948Z %127 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:27.0915203Z %128 = tt.splat %126 : i32 -> tensor<16xi32> 2026-02-21T09:04:27.0915419Z %129 = arith.addi %128, %127 : tensor<16xi32> 2026-02-21T09:04:27.0915671Z %130 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.0915955Z %131 = arith.muli %130, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:27.0916232Z %132 = tt.expand_dims %129 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:27.0916544Z %133 = tt.broadcast %131 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0916860Z %134 = tt.broadcast %132 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0917113Z %135 = arith.addi %133, %134 : tensor<64x16xi32> 2026-02-21T09:04:27.0917379Z %136 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0917684Z %137 = tt.addptr %136, %135 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:27.0917973Z %138 = tt.load %137 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0918252Z %139 = arith.extf %138 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:27.0918564Z %140 = tt.expand_dims %125 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:27.0918878Z %141 = arith.muli %140, %cst_4 : tensor<8x1xi32> 2026-02-21T09:04:27.0919151Z %142 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.0919466Z %143 = tt.broadcast %141 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0919744Z %144 = tt.broadcast %142 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0920008Z %145 = arith.addi %143, %144 : tensor<8x32xi32> 2026-02-21T09:04:27.0920265Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0920553Z %147 = tt.addptr %146, %145 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:27.0920903Z %148 = tt.load %147 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0921198Z %149 = arith.shli %148, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0921439Z %150 = arith.shrsi %149, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0921710Z %151 = arith.shrsi %148, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0921975Z %152 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.0922292Z %153 = tt.expand_dims %152 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.0922622Z %154 = tt.expand_dims %153 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.0922971Z %155 = tt.expand_dims %150 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0923305Z %156 = tt.expand_dims %151 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0923612Z %157 = arith.cmpi eq, %154, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0923888Z %158 = tt.broadcast %157 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0924181Z %159 = tt.broadcast %155 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0924477Z %160 = arith.select %158, %159, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0924753Z %161 = arith.cmpi eq, %154, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0925015Z %162 = tt.broadcast %156 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0925281Z %163 = tt.broadcast %161 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0925567Z %164 = arith.select %163, %162, %160 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0925856Z %165 = tt.reshape %164 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:27.0926113Z %166 = arith.sitofp %165 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:27.0926477Z %167 = tt.dot %139, %166, %120, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.0926802Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:27.0927002Z %168 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:04:27.0927203Z %169 = arith.addi %arg4, %168 : i32 2026-02-21T09:04:27.0927432Z %170 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:27.0927687Z %171 = tt.splat %169 : i32 -> tensor<8xi32> 2026-02-21T09:04:27.0927894Z %172 = arith.addi %171, %170 : tensor<8xi32> 2026-02-21T09:04:27.0928099Z %173 = arith.muli %169, %c2_i32 : i32 2026-02-21T09:04:27.0928386Z %174 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:27.0928645Z %175 = tt.splat %173 : i32 -> tensor<16xi32> 2026-02-21T09:04:27.0928857Z %176 = arith.addi %175, %174 : tensor<16xi32> 2026-02-21T09:04:27.0929112Z %177 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.0929393Z %178 = arith.muli %177, %cst_5 : tensor<64x1xi32> 2026-02-21T09:04:27.0929682Z %179 = tt.expand_dims %176 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:27.0929985Z %180 = tt.broadcast %178 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0930274Z %181 = tt.broadcast %179 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:27.0930524Z %182 = arith.addi %180, %181 : tensor<64x16xi32> 2026-02-21T09:04:27.0930779Z %183 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0931073Z %184 = tt.addptr %183, %182 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:27.0931353Z %185 = tt.load %184 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:27.0931622Z %186 = arith.extf %185 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:27.0931915Z %187 = tt.expand_dims %172 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:27.0932207Z %188 = arith.muli %187, %cst_4 : tensor<8x1xi32> 2026-02-21T09:04:27.0932474Z %189 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.0932770Z %190 = tt.broadcast %188 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0933031Z %191 = tt.broadcast %189 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:27.0933272Z %192 = arith.addi %190, %191 : tensor<8x32xi32> 2026-02-21T09:04:27.0933506Z %193 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0933790Z %194 = tt.addptr %193, %192 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:27.0934095Z %195 = tt.load %194 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:27.0934371Z %196 = arith.shli %195, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0934600Z %197 = arith.shrsi %196, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0934814Z %198 = arith.shrsi %195, %cst_3 : tensor<8x32xi8> 2026-02-21T09:04:27.0935068Z %199 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.0935359Z %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.0935683Z %201 = tt.expand_dims %200 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.0936008Z %202 = tt.expand_dims %197 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0936324Z %203 = tt.expand_dims %198 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:27.0936613Z %204 = arith.cmpi eq, %201, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0936865Z %205 = tt.broadcast %204 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0937142Z %206 = tt.broadcast %202 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0937424Z %207 = arith.select %205, %206, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0937707Z %208 = arith.cmpi eq, %201, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.0937982Z %209 = tt.broadcast %203 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:27.0938247Z %210 = tt.broadcast %208 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:27.0938532Z %211 = arith.select %210, %209, %207 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:27.0938818Z %212 = tt.reshape %211 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:27.0939077Z %213 = arith.sitofp %212 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:27.0939471Z %214 = tt.dot %186, %213, %167, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.0939793Z scf.yield %214 : tensor<64x32xf32> 2026-02-21T09:04:27.0939993Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:27.0940209Z %20 = arith.truncf %19 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:27.0940503Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.0940819Z %22 = arith.muli %21, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:27.0941070Z %23 = tt.expand_dims %18 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.0941388Z %24 = tt.broadcast %22 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.0941691Z %25 = tt.broadcast %23 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.0941930Z %26 = arith.addi %24, %25 : tensor<64x32xi32> 2026-02-21T09:04:27.0942168Z %27 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.0942457Z %28 = tt.addptr %27, %26 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:27.0942723Z tt.store %28, %20 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.0942962Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:04:27.0943182Z tt.return 2026-02-21T09:04:27.0943313Z } 2026-02-21T09:04:27.0943438Z } 2026-02-21T09:04:27.0943533Z 2026-02-21T09:04:27.0943592Z {-# 2026-02-21T09:04:27.0943737Z external_resources: { 2026-02-21T09:04:27.0943901Z mlir_reproducer: { 2026-02-21T09:04:27.0948289Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:27.0952795Z disable_threading: false, 2026-02-21T09:04:27.0952966Z verify_each: true 2026-02-21T09:04:27.0953122Z } 2026-02-21T09:04:27.0953244Z } 2026-02-21T09:04:27.0953369Z #-} 2026-02-21T09:04:27.0953793Z /tmp/torchinductor_root/ky/ckyqyzinnsfkiyq2hf6f5svqam4r2ykgltpgyyeqw6tzlbbmklsv.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:27.0954827Z /tmp/torchinductor_root/ky/ckyqyzinnsfkiyq2hf6f5svqam4r2ykgltpgyyeqw6tzlbbmklsv.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:27.0955682Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:27.0956861Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:04:27.0957970Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:27.0958242Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:27.3569309Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:27.3573756Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:27.3575078Z ^ 2026-02-21T09:04:27.3575509Z /tmp/torchinductor_root/7k/c7k7cmcr7x3f3tm25q5aadrhwhiangvzlxybdkfzsl3zxzzi4huv.py:85:36: note: called from 2026-02-21T09:04:27.3575918Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:27.3576131Z ^ 2026-02-21T09:04:27.3576732Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:27.3577210Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:27.3577463Z ^ 2026-02-21T09:04:27.3577626Z module { 2026-02-21T09:04:27.3578104Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:27.3578665Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:04:27.3578881Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:27.3579085Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:04:27.3579262Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:27.3579474Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:27.3579712Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:27.3579949Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:27.3580179Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:04:27.3580407Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:27.3580615Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:27.3580833Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:27.3581065Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:27.3581241Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:27.3581420Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:27.3581760Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:27.3581956Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:27.3582145Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:27.3582322Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:27.3582678Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:27.3583003Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:27.3583177Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:04:27.3583361Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:27.3583545Z %4 = arith.subi %c224_i32, %3 : i32 2026-02-21T09:04:27.3583720Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:27.3583895Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:04:27.3584065Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:27.3584309Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:27.3584476Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:27.3584654Z %10 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:04:27.3584882Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:27.3585142Z %12 = tt.splat %10 : i32 -> tensor<32xi32> 2026-02-21T09:04:27.3585347Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T09:04:27.3585535Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:27.3585804Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:27.3586039Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:04:27.3586273Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T09:04:27.3586472Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T09:04:27.3586813Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c64_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:27.3587160Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:27.3587369Z %29 = tt.splat %28 : i32 -> tensor<32xi32> 2026-02-21T09:04:27.3587588Z %30 = arith.addi %29, %11 : tensor<32xi32> 2026-02-21T09:04:27.3587848Z %31 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.3588131Z %32 = arith.muli %31, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:27.3588449Z %33 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.3588748Z %34 = tt.broadcast %32 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3589014Z %35 = tt.broadcast %33 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3589247Z %36 = arith.addi %34, %35 : tensor<64x32xi32> 2026-02-21T09:04:27.3589499Z %37 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3589784Z %38 = tt.addptr %37, %36 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:27.3590048Z %39 = tt.load %38 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3590283Z %40 = arith.extf %39 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:27.3590607Z %41 = tt.descriptor_load %0[%arg3, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:27.3590916Z %42 = arith.shli %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3591129Z %43 = arith.shrsi %42, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3591350Z %44 = arith.shrsi %41, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3591633Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.3591930Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.3592229Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.3592549Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3592867Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3593140Z %50 = arith.cmpi eq, %47, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3593390Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3593652Z %52 = tt.broadcast %48 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3593938Z %53 = arith.select %51, %52, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3594211Z %54 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3594454Z %55 = tt.broadcast %49 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3594726Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3594992Z %57 = arith.select %56, %55, %53 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3595262Z %58 = tt.reshape %57 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:27.3595514Z %59 = arith.sitofp %58 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:27.3595911Z %60 = tt.dot %40, %59, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.3596238Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:27.3596427Z %61 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:04:27.3596625Z %62 = arith.addi %arg3, %61 : i32 2026-02-21T09:04:27.3596804Z %63 = arith.muli %62, %c2_i32 : i32 2026-02-21T09:04:27.3597029Z %64 = tt.splat %63 : i32 -> tensor<32xi32> 2026-02-21T09:04:27.3597225Z %65 = arith.addi %64, %11 : tensor<32xi32> 2026-02-21T09:04:27.3597509Z %66 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.3597780Z %67 = arith.muli %66, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:27.3598031Z %68 = tt.expand_dims %65 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.3598315Z %69 = tt.broadcast %67 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3598566Z %70 = tt.broadcast %68 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3598799Z %71 = arith.addi %69, %70 : tensor<64x32xi32> 2026-02-21T09:04:27.3599038Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3599328Z %73 = tt.addptr %72, %71 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:27.3599628Z %74 = tt.load %73 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3599862Z %75 = arith.extf %74 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:27.3600176Z %76 = tt.descriptor_load %0[%62, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:27.3600466Z %77 = arith.shli %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3600682Z %78 = arith.shrsi %77, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3600891Z %79 = arith.shrsi %76, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3601138Z %80 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.3601432Z %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.3601781Z %82 = tt.expand_dims %81 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.3602097Z %83 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3602407Z %84 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3602689Z %85 = arith.cmpi eq, %82, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3602938Z %86 = tt.broadcast %85 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3603206Z %87 = tt.broadcast %83 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3603489Z %88 = arith.select %86, %87, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3603753Z %89 = arith.cmpi eq, %82, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3604005Z %90 = tt.broadcast %84 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3604270Z %91 = tt.broadcast %89 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3604547Z %92 = arith.select %91, %90, %88 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3604826Z %93 = tt.reshape %92 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:27.3605077Z %94 = arith.sitofp %93 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:27.3605417Z %95 = tt.dot %75, %94, %60, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.3605723Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:27.3605928Z %96 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:04:27.3606127Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T09:04:27.3606319Z %98 = arith.muli %97, %c2_i32 : i32 2026-02-21T09:04:27.3606519Z %99 = tt.splat %98 : i32 -> tensor<32xi32> 2026-02-21T09:04:27.3606725Z %100 = arith.addi %99, %11 : tensor<32xi32> 2026-02-21T09:04:27.3607027Z %101 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.3607310Z %102 = arith.muli %101, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:27.3607587Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.3607894Z %104 = tt.broadcast %102 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3608205Z %105 = tt.broadcast %103 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3608460Z %106 = arith.addi %104, %105 : tensor<64x32xi32> 2026-02-21T09:04:27.3608746Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3609061Z %108 = tt.addptr %107, %106 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:27.3609338Z %109 = tt.load %108 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3609597Z %110 = arith.extf %109 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:27.3609934Z %111 = tt.descriptor_load %0[%97, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:27.3610242Z %112 = arith.shli %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3610478Z %113 = arith.shrsi %112, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3610709Z %114 = arith.shrsi %111, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3610999Z %115 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.3611309Z %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.3611688Z %117 = tt.expand_dims %116 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.3612032Z %118 = tt.expand_dims %113 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3612371Z %119 = tt.expand_dims %114 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3612683Z %120 = arith.cmpi eq, %117, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3612950Z %121 = tt.broadcast %120 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3613248Z %122 = tt.broadcast %118 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3613550Z %123 = arith.select %121, %122, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3613850Z %124 = arith.cmpi eq, %117, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3614121Z %125 = tt.broadcast %119 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3614402Z %126 = tt.broadcast %124 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3614713Z %127 = arith.select %126, %125, %123 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3614998Z %128 = tt.reshape %127 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:27.3615269Z %129 = arith.sitofp %128 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:27.3615624Z %130 = tt.dot %110, %129, %95, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.3615938Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:27.3616134Z %131 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:04:27.3616326Z %132 = arith.addi %arg3, %131 : i32 2026-02-21T09:04:27.3616518Z %133 = arith.muli %132, %c2_i32 : i32 2026-02-21T09:04:27.3616713Z %134 = tt.splat %133 : i32 -> tensor<32xi32> 2026-02-21T09:04:27.3616924Z %135 = arith.addi %134, %11 : tensor<32xi32> 2026-02-21T09:04:27.3617185Z %136 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.3617453Z %137 = arith.muli %136, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:27.3617717Z %138 = tt.expand_dims %135 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.3618007Z %139 = tt.broadcast %137 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3618288Z %140 = tt.broadcast %138 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3618563Z %141 = arith.addi %139, %140 : tensor<64x32xi32> 2026-02-21T09:04:27.3618808Z %142 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3619102Z %143 = tt.addptr %142, %141 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:27.3619364Z %144 = tt.load %143 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3619614Z %145 = arith.extf %144 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:04:27.3619957Z %146 = tt.descriptor_load %0[%132, %10] : !tt.tensordesc> -> tensor<16x32xi8> 2026-02-21T09:04:27.3620290Z %147 = arith.shli %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3620520Z %148 = arith.shrsi %147, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3620736Z %149 = arith.shrsi %146, %cst_3 : tensor<16x32xi8> 2026-02-21T09:04:27.3620989Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:27.3621279Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:27.3621647Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:27.3621976Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3622322Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:04:27.3622617Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3622869Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3623151Z %157 = tt.broadcast %153 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3623439Z %158 = arith.select %156, %157, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3623722Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:27.3623983Z %160 = tt.broadcast %154 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:04:27.3624256Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:04:27.3624546Z %162 = arith.select %161, %160, %158 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:04:27.3624828Z %163 = tt.reshape %162 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:04:27.3625098Z %164 = arith.sitofp %163 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:04:27.3625449Z %165 = tt.dot %145, %164, %130, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:27.3625770Z scf.yield %165 : tensor<64x32xf32> 2026-02-21T09:04:27.3625966Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:27.3626188Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:27.3626481Z %20 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:27.3626745Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:04:27.3627010Z %22 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:27.3627299Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3627552Z %24 = tt.broadcast %22 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:27.3627788Z %25 = arith.addi %23, %24 : tensor<64x32xi32> 2026-02-21T09:04:27.3628030Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3628315Z %27 = tt.addptr %26, %25 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:27.3628572Z tt.store %27, %19 : tensor<64x32x!tt.ptr> 2026-02-21T09:04:27.3628770Z tt.return 2026-02-21T09:04:27.3628901Z } 2026-02-21T09:04:27.3629027Z } 2026-02-21T09:04:27.3629096Z 2026-02-21T09:04:27.3629153Z {-# 2026-02-21T09:04:27.3629282Z external_resources: { 2026-02-21T09:04:27.3629449Z mlir_reproducer: { 2026-02-21T09:04:27.3633841Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:27.3638312Z disable_threading: false, 2026-02-21T09:04:27.3638479Z verify_each: true 2026-02-21T09:04:27.3638631Z } 2026-02-21T09:04:27.3638754Z } 2026-02-21T09:04:27.3638879Z #-} 2026-02-21T09:04:27.3639300Z /tmp/torchinductor_root/7k/c7k7cmcr7x3f3tm25q5aadrhwhiangvzlxybdkfzsl3zxzzi4huv.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:27.3640314Z /tmp/torchinductor_root/7k/c7k7cmcr7x3f3tm25q5aadrhwhiangvzlxybdkfzsl3zxzzi4huv.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:27.3641138Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:27.3642209Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:27.3643151Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:27.3643408Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:27.4234260Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 49/49 15.6 configs/s 2026-02-21T09:04:29.7200284Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 433.1 2026-02-21T09:04:29.7202914Z configs/s 2026-02-21T09:04:29.8167766Z [235s] Generation 9 complete: 2026-02-21T09:04:29.8169765Z error=3 2026-02-21T09:04:29.8169940Z ok=48 2026-02-21T09:04:29.8170081Z min=0.1077 2026-02-21T09:04:29.8170220Z mid=0.1978 2026-02-21T09:04:29.8170345Z max=13.6008 2026-02-21T09:04:29.8170510Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:04:29.8170730Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:04:29.8170938Z 'l2_groupings': [1], 2026-02-21T09:04:29.8171102Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:04:29.8171292Z 'loop_orders': [[0, 1]], 2026-02-21T09:04:29.8172792Z 'num_stages': 3, 2026-02-21T09:04:29.8172942Z 'num_warps': 1, 2026-02-21T09:04:29.8173083Z 'pid_type': 'flat', 2026-02-21T09:04:29.8173249Z 'range_flattens': [None, None], 2026-02-21T09:04:29.8173431Z 'range_multi_buffers': [None, True], 2026-02-21T09:04:29.8173611Z 'range_num_stages': [0, 0], 2026-02-21T09:04:29.8173783Z 'range_unroll_factors': [0, 0], 2026-02-21T09:04:29.8173964Z 'range_warp_specializes': [None, None]} 2026-02-21T09:04:29.8199847Z [235s] Fitting surrogate: 904 points, 904 targets 2026-02-21T09:04:30.5834903Z [236s] Generation 10 starting: 45 neighbors, 3 active search path(s) 2026-02-21T09:04:38.7503965Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 1.5 configs/s 2026-02-21T09:04:40.8465594Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:40.8468137Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:40.8470133Z ^ 2026-02-21T09:04:40.8470673Z /tmp/torchinductor_root/np/cnpivsgsnua2hbkrxj3nxem6pbl2kv3uhl2cdj6p67uinuvizwl6.py:86:36: note: called from 2026-02-21T09:04:40.8473403Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:40.8473689Z ^ 2026-02-21T09:04:40.8479522Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:40.8480341Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:40.8480639Z ^ 2026-02-21T09:04:40.8480859Z module { 2026-02-21T09:04:40.8481413Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:40.8482101Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:04:40.8482341Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:40.8482537Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:40.8482741Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:40.8482974Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:40.8483205Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:40.8483454Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:04:40.8483694Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:40.8483944Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:40.8484151Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:40.8484380Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:40.8484604Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:40.8484785Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:40.8484992Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:40.8485176Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:40.8485366Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:40.8485542Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:40.8485864Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:40.8486190Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:40.8486368Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:04:40.8486557Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:40.8486733Z %4 = arith.subi %c224_i32, %3 : i32 2026-02-21T09:04:40.8486912Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:40.8487083Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:04:40.8487276Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:40.8487452Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:40.8487625Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:40.8487799Z %10 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:04:40.8488097Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:40.8488354Z %12 = tt.splat %10 : i32 -> tensor<32xi32> 2026-02-21T09:04:40.8488551Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T09:04:40.8488745Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:40.8488970Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:40.8489213Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:04:40.8489504Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T09:04:40.8489700Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:04:40.8490064Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:40.8490401Z %20 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:04:40.8490621Z %21 = arith.addi %20, %15 : tensor<64xi32> 2026-02-21T09:04:40.8490821Z %22 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:40.8491067Z %23 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:40.8491322Z %24 = tt.splat %22 : i32 -> tensor<128xi32> 2026-02-21T09:04:40.8491578Z %25 = arith.addi %24, %23 : tensor<128xi32> 2026-02-21T09:04:40.8491841Z %26 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8492140Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:40.8492410Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:40.8492704Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8492981Z %30 = tt.broadcast %28 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8493234Z %31 = arith.addi %29, %30 : tensor<64x128xi32> 2026-02-21T09:04:40.8493484Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8493781Z %33 = tt.addptr %32, %31 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:40.8494046Z %34 = tt.load %33 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8494294Z %35 = arith.extf %34 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:40.8494576Z %36 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8494850Z %37 = arith.muli %36, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:40.8495113Z %38 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:40.8495399Z %39 = tt.broadcast %37 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8495668Z %40 = tt.broadcast %38 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8495906Z %41 = arith.addi %39, %40 : tensor<64x32xi32> 2026-02-21T09:04:40.8496157Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8496431Z %43 = tt.addptr %42, %41 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:40.8496740Z %44 = tt.load %43 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8497009Z %45 = arith.shli %44, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8497221Z %46 = arith.shrsi %45, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8497440Z %47 = arith.shrsi %44, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8497682Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:40.8497980Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:40.8498320Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:40.8498645Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8498961Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8499248Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8499523Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8499796Z %55 = tt.broadcast %51 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8500085Z %56 = arith.select %54, %55, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8500348Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8500604Z %58 = tt.broadcast %52 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8500893Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8501190Z %60 = arith.select %59, %58, %56 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8501461Z %61 = tt.reshape %60 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:40.8501766Z %62 = arith.sitofp %61 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:40.8502131Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:40.8502452Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:40.8502646Z %64 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:04:40.8502835Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T09:04:40.8503028Z %66 = tt.splat %65 : i32 -> tensor<64xi32> 2026-02-21T09:04:40.8503223Z %67 = arith.addi %66, %15 : tensor<64xi32> 2026-02-21T09:04:40.8503441Z %68 = arith.muli %65, %c2_i32 : i32 2026-02-21T09:04:40.8503682Z %69 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:40.8503930Z %70 = tt.splat %68 : i32 -> tensor<128xi32> 2026-02-21T09:04:40.8504137Z %71 = arith.addi %70, %69 : tensor<128xi32> 2026-02-21T09:04:40.8504383Z %72 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8504650Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:40.8504898Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:40.8505189Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8505453Z %76 = tt.broadcast %74 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8505686Z %77 = arith.addi %75, %76 : tensor<64x128xi32> 2026-02-21T09:04:40.8505933Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8506215Z %79 = tt.addptr %78, %77 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:40.8506477Z %80 = tt.load %79 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8506710Z %81 = arith.extf %80 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:40.8506993Z %82 = tt.expand_dims %67 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8507255Z %83 = arith.muli %82, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:40.8507504Z %84 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:40.8507810Z %85 = tt.broadcast %83 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8508077Z %86 = tt.broadcast %84 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8508326Z %87 = arith.addi %85, %86 : tensor<64x32xi32> 2026-02-21T09:04:40.8508569Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8508859Z %89 = tt.addptr %88, %87 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:40.8509176Z %90 = tt.load %89 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8509451Z %91 = arith.shli %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8509683Z %92 = arith.shrsi %91, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8509902Z %93 = arith.shrsi %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8510154Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:40.8510458Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:40.8510805Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:40.8511142Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8511466Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8511792Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8512079Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8512369Z %101 = tt.broadcast %97 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8512708Z %102 = arith.select %100, %101, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8512995Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8513266Z %104 = tt.broadcast %98 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8513551Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8513854Z %106 = arith.select %105, %104, %102 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8514151Z %107 = tt.reshape %106 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:40.8514441Z %108 = arith.sitofp %107 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:40.8514851Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:40.8515186Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:04:40.8515387Z %110 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:04:40.8515582Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T09:04:40.8515785Z %112 = tt.splat %111 : i32 -> tensor<64xi32> 2026-02-21T09:04:40.8515990Z %113 = arith.addi %112, %15 : tensor<64xi32> 2026-02-21T09:04:40.8516188Z %114 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:04:40.8516425Z %115 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:40.8516675Z %116 = tt.splat %114 : i32 -> tensor<128xi32> 2026-02-21T09:04:40.8516889Z %117 = arith.addi %116, %115 : tensor<128xi32> 2026-02-21T09:04:40.8517138Z %118 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8517409Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:40.8517671Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:40.8517973Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8518286Z %122 = tt.broadcast %120 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8518528Z %123 = arith.addi %121, %122 : tensor<64x128xi32> 2026-02-21T09:04:40.8518779Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8519069Z %125 = tt.addptr %124, %123 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:40.8519344Z %126 = tt.load %125 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8519592Z %127 = arith.extf %126 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:40.8519880Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8520154Z %129 = arith.muli %128, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:40.8520409Z %130 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:40.8520704Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8520966Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8521212Z %133 = arith.addi %131, %132 : tensor<64x32xi32> 2026-02-21T09:04:40.8521453Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8521767Z %135 = tt.addptr %134, %133 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:40.8522133Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8522400Z %137 = arith.shli %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8522636Z %138 = arith.shrsi %137, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8522859Z %139 = arith.shrsi %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8523122Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:40.8523459Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:40.8523775Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:40.8524134Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8524468Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8524771Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8525037Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8525309Z %147 = tt.broadcast %143 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8525609Z %148 = arith.select %146, %147, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8525884Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8526172Z %150 = tt.broadcast %144 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8526449Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8526739Z %152 = arith.select %151, %150, %148 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8527031Z %153 = tt.reshape %152 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:40.8527301Z %154 = arith.sitofp %153 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:40.8527677Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:40.8528009Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:40.8528209Z %156 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:04:40.8528402Z %157 = arith.addi %arg3, %156 : i32 2026-02-21T09:04:40.8528601Z %158 = tt.splat %157 : i32 -> tensor<64xi32> 2026-02-21T09:04:40.8528810Z %159 = arith.addi %158, %15 : tensor<64xi32> 2026-02-21T09:04:40.8529004Z %160 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:04:40.8529248Z %161 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:40.8529506Z %162 = tt.splat %160 : i32 -> tensor<128xi32> 2026-02-21T09:04:40.8529725Z %163 = arith.addi %162, %161 : tensor<128xi32> 2026-02-21T09:04:40.8529980Z %164 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8530257Z %165 = arith.muli %164, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:40.8530529Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:40.8530830Z %167 = tt.broadcast %165 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8531108Z %168 = tt.broadcast %166 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:40.8531352Z %169 = arith.addi %167, %168 : tensor<64x128xi32> 2026-02-21T09:04:40.8531650Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8531945Z %171 = tt.addptr %170, %169 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:40.8532219Z %172 = tt.load %171 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:40.8532471Z %173 = arith.extf %172 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:40.8532760Z %174 = tt.expand_dims %159 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:40.8533032Z %175 = arith.muli %174, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:40.8533284Z %176 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:40.8533608Z %177 = tt.broadcast %175 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8533878Z %178 = tt.broadcast %176 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:40.8534121Z %179 = arith.addi %177, %178 : tensor<64x32xi32> 2026-02-21T09:04:40.8534368Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8534671Z %181 = tt.addptr %180, %179 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:40.8534983Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:40.8535270Z %183 = arith.shli %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8535494Z %184 = arith.shrsi %183, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8535721Z %185 = arith.shrsi %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:40.8535966Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:40.8536268Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:40.8536582Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:40.8536923Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8537279Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:40.8537567Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8537824Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8538096Z %193 = tt.broadcast %189 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8538391Z %194 = arith.select %192, %193, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8538663Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:40.8538922Z %196 = tt.broadcast %190 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:40.8539199Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:40.8539481Z %198 = arith.select %197, %196, %194 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:40.8539775Z %199 = tt.reshape %198 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:40.8540047Z %200 = arith.sitofp %199 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:40.8540416Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:40.8540741Z scf.yield %201 : tensor<64x32xf32> 2026-02-21T09:04:40.8540939Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:40.8541171Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:40.8541496Z tt.descriptor_store %0[%14, %10], %19 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:04:40.8541808Z tt.return 2026-02-21T09:04:40.8541938Z } 2026-02-21T09:04:40.8542069Z } 2026-02-21T09:04:40.8542138Z 2026-02-21T09:04:40.8542190Z {-# 2026-02-21T09:04:40.8542331Z external_resources: { 2026-02-21T09:04:40.8542490Z mlir_reproducer: { 2026-02-21T09:04:40.8546922Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:40.8551485Z disable_threading: false, 2026-02-21T09:04:40.8551679Z verify_each: true 2026-02-21T09:04:40.8551837Z } 2026-02-21T09:04:40.8551962Z } 2026-02-21T09:04:40.8552092Z #-} 2026-02-21T09:04:40.8552538Z /tmp/torchinductor_root/np/cnpivsgsnua2hbkrxj3nxem6pbl2kv3uhl2cdj6p67uinuvizwl6.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:40.8553657Z /tmp/torchinductor_root/np/cnpivsgsnua2hbkrxj3nxem6pbl2kv3uhl2cdj6p67uinuvizwl6.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:40.8554531Z [246s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:40.8555650Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:40.8556645Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:40.8556922Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:41.0013729Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:41.0018612Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.0022996Z ^ 2026-02-21T09:04:41.0028354Z /tmp/torchinductor_root/jr/cjrv4alzo5xgbt6s6mxtzhjadprme6rn6cmhecmvkmrrrzn4wovx.py:86:36: note: called from 2026-02-21T09:04:41.0032234Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:41.0033755Z ^ 2026-02-21T09:04:41.0034221Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:41.0034691Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.0034947Z ^ 2026-02-21T09:04:41.0035195Z module { 2026-02-21T09:04:41.0039798Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:41.0040467Z %cst = arith.constant dense<0> : tensor<128x2x32xi8> 2026-02-21T09:04:41.0040713Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:04:41.0046100Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:41.0050663Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:41.0052809Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:41.0053337Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:41.0056541Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:41.0056821Z %cst_2 = arith.constant dense<4> : tensor<128x32xi8> 2026-02-21T09:04:41.0057061Z %cst_3 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:41.0057290Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:41.0057537Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:41.0057953Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:41.0058146Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:41.0058324Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:41.0058554Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:41.0058741Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:41.0058929Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:41.0059108Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:41.0059447Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:41.0059908Z %1 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:41.0060227Z %2 = tt.get_program_id x : i32 2026-02-21T09:04:41.0060428Z %3 = arith.divsi %2, %c896_i32 : i32 2026-02-21T09:04:41.0060656Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:04:41.0060843Z %5 = arith.subi %c1_i32, %4 : i32 2026-02-21T09:04:41.0061014Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:04:41.0061193Z %7 = arith.remsi %2, %c896_i32 : i32 2026-02-21T09:04:41.0061377Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:04:41.0061626Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:04:41.0061811Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:04:41.0061991Z %11 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:41.0062228Z %12 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:41.0062482Z %13 = tt.splat %11 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.0062688Z %14 = arith.addi %13, %12 : tensor<64xi32> 2026-02-21T09:04:41.0062873Z %15 = arith.muli %10, %c32_i32 : i32 2026-02-21T09:04:41.0063106Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:04:41.0063420Z %16 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg4 = %cst_4) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:41.0063751Z %18 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:41.0063986Z %19 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:04:41.0064246Z %20 = tt.splat %18 : i32 -> tensor<256xi32> 2026-02-21T09:04:41.0064448Z %21 = arith.addi %20, %19 : tensor<256xi32> 2026-02-21T09:04:41.0064712Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.0064990Z %23 = arith.muli %22, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.0065250Z %24 = tt.expand_dims %21 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:04:41.0065550Z %25 = tt.broadcast %23 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0065820Z %26 = tt.broadcast %24 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0066081Z %27 = arith.addi %25, %26 : tensor<64x256xi32> 2026-02-21T09:04:41.0066340Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0066644Z %29 = tt.addptr %28, %27 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:04:41.0066918Z %30 = tt.load %29 : tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0067167Z %31 = arith.extf %30 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:04:41.0067509Z %32 = tt.descriptor_load %0[%arg3, %15] : !tt.tensordesc> -> tensor<128x32xi8> 2026-02-21T09:04:41.0067828Z %33 = arith.shli %32, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0068063Z %34 = arith.shrsi %33, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0068368Z %35 = arith.shrsi %32, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0068618Z %36 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.0068929Z %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.0069247Z %38 = tt.expand_dims %37 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.0069613Z %39 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0069948Z %40 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0070275Z %41 = arith.cmpi eq, %38, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0070537Z %42 = tt.broadcast %41 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0070818Z %43 = tt.broadcast %39 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0071126Z %44 = arith.select %42, %43, %cst : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0071408Z %45 = arith.cmpi eq, %38, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0071709Z %46 = tt.broadcast %40 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0072001Z %47 = tt.broadcast %45 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0072316Z %48 = arith.select %47, %46, %44 : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0072621Z %49 = tt.reshape %48 : tensor<128x2x32xi8> -> tensor<256x32xi8> 2026-02-21T09:04:41.0072902Z %50 = arith.sitofp %49 : tensor<256x32xi8> to tensor<256x32xf32> 2026-02-21T09:04:41.0073300Z %51 = tt.dot %31, %50, %arg4, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.0073655Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T09:04:41.0073878Z %52 = arith.muli %c128_i32, %c1_i32_5 : i32 2026-02-21T09:04:41.0074097Z %53 = arith.addi %arg3, %52 : i32 2026-02-21T09:04:41.0074285Z %54 = arith.muli %53, %c2_i32 : i32 2026-02-21T09:04:41.0074525Z %55 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:04:41.0074779Z %56 = tt.splat %54 : i32 -> tensor<256xi32> 2026-02-21T09:04:41.0074998Z %57 = arith.addi %56, %55 : tensor<256xi32> 2026-02-21T09:04:41.0075255Z %58 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.0075538Z %59 = arith.muli %58, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.0075801Z %60 = tt.expand_dims %57 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:04:41.0076097Z %61 = tt.broadcast %59 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0076368Z %62 = tt.broadcast %60 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0076607Z %63 = arith.addi %61, %62 : tensor<64x256xi32> 2026-02-21T09:04:41.0076859Z %64 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0077148Z %65 = tt.addptr %64, %63 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:04:41.0077415Z %66 = tt.load %65 : tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0077660Z %67 = arith.extf %66 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:04:41.0077975Z %68 = tt.descriptor_load %0[%53, %15] : !tt.tensordesc> -> tensor<128x32xi8> 2026-02-21T09:04:41.0078285Z %69 = arith.shli %68, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0078517Z %70 = arith.shrsi %69, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0078737Z %71 = arith.shrsi %68, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0078976Z %72 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.0079265Z %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.0079573Z %74 = tt.expand_dims %73 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.0079923Z %75 = tt.expand_dims %70 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0080245Z %76 = tt.expand_dims %71 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0080521Z %77 = arith.cmpi eq, %74, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0080768Z %78 = tt.broadcast %77 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0081075Z %79 = tt.broadcast %75 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0081359Z %80 = arith.select %78, %79, %cst : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0081684Z %81 = arith.cmpi eq, %74, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0081931Z %82 = tt.broadcast %76 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0082205Z %83 = tt.broadcast %81 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0082473Z %84 = arith.select %83, %82, %80 : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0082756Z %85 = tt.reshape %84 : tensor<128x2x32xi8> -> tensor<256x32xi8> 2026-02-21T09:04:41.0083023Z %86 = arith.sitofp %85 : tensor<256x32xi8> to tensor<256x32xf32> 2026-02-21T09:04:41.0083370Z %87 = tt.dot %67, %86, %51, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.0083722Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:04:41.0083917Z %88 = arith.muli %c128_i32, %c2_i32_6 : i32 2026-02-21T09:04:41.0084119Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T09:04:41.0084298Z %90 = arith.muli %89, %c2_i32 : i32 2026-02-21T09:04:41.0084529Z %91 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:04:41.0084781Z %92 = tt.splat %90 : i32 -> tensor<256xi32> 2026-02-21T09:04:41.0084978Z %93 = arith.addi %92, %91 : tensor<256xi32> 2026-02-21T09:04:41.0085230Z %94 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.0085490Z %95 = arith.muli %94, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.0085750Z %96 = tt.expand_dims %93 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:04:41.0086047Z %97 = tt.broadcast %95 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0086306Z %98 = tt.broadcast %96 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0086550Z %99 = arith.addi %97, %98 : tensor<64x256xi32> 2026-02-21T09:04:41.0086795Z %100 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0087100Z %101 = tt.addptr %100, %99 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:04:41.0087363Z %102 = tt.load %101 : tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0087616Z %103 = arith.extf %102 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:04:41.0087937Z %104 = tt.descriptor_load %0[%89, %15] : !tt.tensordesc> -> tensor<128x32xi8> 2026-02-21T09:04:41.0088237Z %105 = arith.shli %104, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0088463Z %106 = arith.shrsi %105, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0088682Z %107 = arith.shrsi %104, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0088933Z %108 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.0089224Z %109 = tt.expand_dims %108 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.0089542Z %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.0089874Z %111 = tt.expand_dims %106 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0090198Z %112 = tt.expand_dims %107 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0090492Z %113 = arith.cmpi eq, %110, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0090745Z %114 = tt.broadcast %113 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0091056Z %115 = tt.broadcast %111 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0091357Z %116 = arith.select %114, %115, %cst : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0091685Z %117 = arith.cmpi eq, %110, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0091945Z %118 = tt.broadcast %112 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0092244Z %119 = tt.broadcast %117 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0092544Z %120 = arith.select %119, %118, %116 : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0092852Z %121 = tt.reshape %120 : tensor<128x2x32xi8> -> tensor<256x32xi8> 2026-02-21T09:04:41.0093129Z %122 = arith.sitofp %121 : tensor<256x32xi8> to tensor<256x32xf32> 2026-02-21T09:04:41.0093490Z %123 = tt.dot %103, %122, %87, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.0093833Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:41.0094032Z %124 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:04:41.0094240Z %125 = arith.addi %arg3, %124 : i32 2026-02-21T09:04:41.0094428Z %126 = arith.muli %125, %c2_i32 : i32 2026-02-21T09:04:41.0094656Z %127 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:04:41.0094951Z %128 = tt.splat %126 : i32 -> tensor<256xi32> 2026-02-21T09:04:41.0095166Z %129 = arith.addi %128, %127 : tensor<256xi32> 2026-02-21T09:04:41.0095426Z %130 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.0095694Z %131 = arith.muli %130, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.0095963Z %132 = tt.expand_dims %129 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:04:41.0096263Z %133 = tt.broadcast %131 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0096528Z %134 = tt.broadcast %132 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:04:41.0096779Z %135 = arith.addi %133, %134 : tensor<64x256xi32> 2026-02-21T09:04:41.0097023Z %136 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0097322Z %137 = tt.addptr %136, %135 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:04:41.0097593Z %138 = tt.load %137 : tensor<64x256x!tt.ptr> 2026-02-21T09:04:41.0097836Z %139 = arith.extf %138 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:04:41.0098169Z %140 = tt.descriptor_load %0[%125, %15] : !tt.tensordesc> -> tensor<128x32xi8> 2026-02-21T09:04:41.0098473Z %141 = arith.shli %140, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0098700Z %142 = arith.shrsi %141, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0098918Z %143 = arith.shrsi %140, %cst_2 : tensor<128x32xi8> 2026-02-21T09:04:41.0099170Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.0099466Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.0099775Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.0100103Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0100426Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<128x32xi8> -> tensor<128x1x32xi8> 2026-02-21T09:04:41.0100716Z %149 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0100968Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0101250Z %151 = tt.broadcast %147 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0101578Z %152 = arith.select %150, %151, %cst : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0101852Z %153 = arith.cmpi eq, %146, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.0102140Z %154 = tt.broadcast %148 : tensor<128x1x32xi8> -> tensor<128x2x32xi8> 2026-02-21T09:04:41.0102413Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<128x2x32xi1> 2026-02-21T09:04:41.0102704Z %156 = arith.select %155, %154, %152 : tensor<128x2x32xi1>, tensor<128x2x32xi8> 2026-02-21T09:04:41.0102996Z %157 = tt.reshape %156 : tensor<128x2x32xi8> -> tensor<256x32xi8> 2026-02-21T09:04:41.0103263Z %158 = arith.sitofp %157 : tensor<256x32xi8> to tensor<256x32xf32> 2026-02-21T09:04:41.0103654Z %159 = tt.dot %139, %158, %123, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.0103998Z scf.yield %159 : tensor<64x32xf32> 2026-02-21T09:04:41.0104196Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:41.0104418Z %17 = arith.truncf %16 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:41.0104745Z tt.descriptor_store %1[%11, %15], %17 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:04:41.0105023Z tt.return 2026-02-21T09:04:41.0105154Z } 2026-02-21T09:04:41.0105287Z } 2026-02-21T09:04:41.0105359Z 2026-02-21T09:04:41.0105414Z {-# 2026-02-21T09:04:41.0105557Z external_resources: { 2026-02-21T09:04:41.0105720Z mlir_reproducer: { 2026-02-21T09:04:41.0110267Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:41.0114932Z disable_threading: false, 2026-02-21T09:04:41.0115130Z verify_each: true 2026-02-21T09:04:41.0115287Z } 2026-02-21T09:04:41.0115423Z } 2026-02-21T09:04:41.0115551Z #-} 2026-02-21T09:04:41.0115996Z /tmp/torchinductor_root/jr/cjrv4alzo5xgbt6s6mxtzhjadprme6rn6cmhecmvkmrrrzn4wovx.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:41.0117075Z /tmp/torchinductor_root/jr/cjrv4alzo5xgbt6s6mxtzhjadprme6rn6cmhecmvkmrrrzn4wovx.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:41.0117944Z [246s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:41.0119046Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:41.0120056Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:41.0120315Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:41.3297924Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:41.3302585Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.3303856Z ^ 2026-02-21T09:04:41.3304265Z /tmp/torchinductor_root/nj/cnjlzdzd5ib6xtdfz2uvvcebcsz6pqnoipdmcpi6sebxby6k2f7e.py:86:36: note: called from 2026-02-21T09:04:41.3304670Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:41.3304892Z ^ 2026-02-21T09:04:41.3305283Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:41.3305748Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.3305987Z ^ 2026-02-21T09:04:41.3306154Z module { 2026-02-21T09:04:41.3306793Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:41.3307377Z %cst = arith.constant dense<0> : tensor<32x2x128xi8> 2026-02-21T09:04:41.3307612Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:41.3307797Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:41.3307984Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:41.3308161Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:41.3308369Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:41.3308626Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:41.3308848Z %cst_2 = arith.constant dense<4> : tensor<32x128xi8> 2026-02-21T09:04:41.3309083Z %cst_3 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:41.3309289Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:41.3309512Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T09:04:41.3309751Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:41.3309932Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:41.3310115Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:41.3310297Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:41.3310492Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:41.3310674Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:41.3310859Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:41.3311179Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:41.3311800Z %1 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:41.3312131Z %2 = tt.get_program_id x : i32 2026-02-21T09:04:41.3312318Z %3 = arith.divsi %2, %c224_i32 : i32 2026-02-21T09:04:41.3312520Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:04:41.3312712Z %5 = arith.subi %c1_i32, %4 : i32 2026-02-21T09:04:41.3312907Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:04:41.3313090Z %7 = arith.remsi %2, %c224_i32 : i32 2026-02-21T09:04:41.3313285Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:04:41.3313476Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:04:41.3313647Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:04:41.3313827Z %11 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:41.3314057Z %12 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:41.3314371Z %13 = tt.splat %11 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.3314566Z %14 = arith.addi %13, %12 : tensor<64xi32> 2026-02-21T09:04:41.3314762Z %15 = arith.muli %10, %c128_i32 : i32 2026-02-21T09:04:41.3314942Z %c128_i32_5 = arith.constant 128 : i32 2026-02-21T09:04:41.3315276Z %16 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32_5 iter_args(%arg4 = %cst_4) -> (tensor<64x128xf32>) : i32 { 2026-02-21T09:04:41.3315647Z %18 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:41.3315842Z %19 = tt.splat %18 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.3316076Z %20 = arith.addi %19, %12 : tensor<64xi32> 2026-02-21T09:04:41.3316332Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.3316611Z %22 = arith.muli %21, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.3316866Z %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.3317156Z %24 = tt.broadcast %22 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3317425Z %25 = tt.broadcast %23 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3317656Z %26 = arith.addi %24, %25 : tensor<64x64xi32> 2026-02-21T09:04:41.3317904Z %27 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3318214Z %28 = tt.addptr %27, %26 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.3318483Z %29 = tt.load %28 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3318733Z %30 = arith.extf %29 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.3319069Z %31 = tt.descriptor_load %0[%arg3, %15] : !tt.tensordesc> -> tensor<32x128xi8> 2026-02-21T09:04:41.3319392Z %32 = arith.shli %31, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3319619Z %33 = arith.shrsi %32, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3319851Z %34 = arith.shrsi %31, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3320102Z %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.3320410Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.3320734Z %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.3321070Z %38 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3321416Z %39 = tt.expand_dims %34 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3321754Z %40 = arith.cmpi eq, %37, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3322025Z %41 = tt.broadcast %40 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3322308Z %42 = tt.broadcast %38 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3322612Z %43 = arith.select %41, %42, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3322901Z %44 = arith.cmpi eq, %37, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3323156Z %45 = tt.broadcast %39 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3323439Z %46 = tt.broadcast %44 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3323726Z %47 = arith.select %46, %45, %43 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3324025Z %48 = tt.reshape %47 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:41.3324308Z %49 = arith.sitofp %48 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:41.3324685Z %50 = tt.dot %30, %49, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:41.3325035Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:41.3325238Z %51 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:04:41.3325452Z %52 = arith.addi %arg3, %51 : i32 2026-02-21T09:04:41.3325646Z %53 = arith.muli %52, %c2_i32 : i32 2026-02-21T09:04:41.3325882Z %54 = tt.splat %53 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.3326093Z %55 = arith.addi %54, %12 : tensor<64xi32> 2026-02-21T09:04:41.3326352Z %56 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.3326662Z %57 = arith.muli %56, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.3326913Z %58 = tt.expand_dims %55 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.3327228Z %59 = tt.broadcast %57 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3327479Z %60 = tt.broadcast %58 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3327739Z %61 = arith.addi %59, %60 : tensor<64x64xi32> 2026-02-21T09:04:41.3327990Z %62 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3328273Z %63 = tt.addptr %62, %61 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.3328543Z %64 = tt.load %63 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3328783Z %65 = arith.extf %64 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.3329105Z %66 = tt.descriptor_load %0[%52, %15] : !tt.tensordesc> -> tensor<32x128xi8> 2026-02-21T09:04:41.3329407Z %67 = arith.shli %66, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3329635Z %68 = arith.shrsi %67, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3329889Z %69 = arith.shrsi %66, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3330130Z %70 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.3330424Z %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.3330751Z %72 = tt.expand_dims %71 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.3331065Z %73 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3331393Z %74 = tt.expand_dims %69 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3331704Z %75 = arith.cmpi eq, %72, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3331955Z %76 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3332229Z %77 = tt.broadcast %73 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3332519Z %78 = arith.select %76, %77, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3332798Z %79 = arith.cmpi eq, %72, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3333043Z %80 = tt.broadcast %74 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3333316Z %81 = tt.broadcast %79 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3333589Z %82 = arith.select %81, %80, %78 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3333872Z %83 = tt.reshape %82 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:41.3334139Z %84 = arith.sitofp %83 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:41.3334488Z %85 = tt.dot %65, %84, %50, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:41.3334810Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:41.3335003Z %86 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:04:41.3335206Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T09:04:41.3335392Z %88 = arith.muli %87, %c2_i32 : i32 2026-02-21T09:04:41.3335591Z %89 = tt.splat %88 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.3335793Z %90 = arith.addi %89, %12 : tensor<64xi32> 2026-02-21T09:04:41.3336045Z %91 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.3336317Z %92 = arith.muli %91, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.3336572Z %93 = tt.expand_dims %90 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.3336862Z %94 = tt.broadcast %92 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3337145Z %95 = tt.broadcast %93 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3337380Z %96 = arith.addi %94, %95 : tensor<64x64xi32> 2026-02-21T09:04:41.3337628Z %97 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3337908Z %98 = tt.addptr %97, %96 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.3338173Z %99 = tt.load %98 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3338451Z %100 = arith.extf %99 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.3338771Z %101 = tt.descriptor_load %0[%87, %15] : !tt.tensordesc> -> tensor<32x128xi8> 2026-02-21T09:04:41.3339108Z %102 = arith.shli %101, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3339332Z %103 = arith.shrsi %102, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3339564Z %104 = arith.shrsi %101, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3339810Z %105 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.3340113Z %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.3340424Z %107 = tt.expand_dims %106 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.3340759Z %108 = tt.expand_dims %103 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3341137Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3341427Z %110 = arith.cmpi eq, %107, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3341725Z %111 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3342003Z %112 = tt.broadcast %108 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3342306Z %113 = arith.select %111, %112, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3342586Z %114 = arith.cmpi eq, %107, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3342853Z %115 = tt.broadcast %109 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3354113Z %116 = tt.broadcast %114 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3354513Z %117 = arith.select %116, %115, %113 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3354822Z %118 = tt.reshape %117 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:41.3355120Z %119 = arith.sitofp %118 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:41.3355504Z %120 = tt.dot %100, %119, %85, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:41.3355843Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:41.3356045Z %121 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:04:41.3356241Z %122 = arith.addi %arg3, %121 : i32 2026-02-21T09:04:41.3356428Z %123 = arith.muli %122, %c2_i32 : i32 2026-02-21T09:04:41.3356624Z %124 = tt.splat %123 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.3356843Z %125 = arith.addi %124, %12 : tensor<64xi32> 2026-02-21T09:04:41.3357110Z %126 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.3357391Z %127 = arith.muli %126, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:41.3357668Z %128 = tt.expand_dims %125 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.3357969Z %129 = tt.broadcast %127 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3358245Z %130 = tt.broadcast %128 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.3358483Z %131 = arith.addi %129, %130 : tensor<64x64xi32> 2026-02-21T09:04:41.3358741Z %132 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3359037Z %133 = tt.addptr %132, %131 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.3359300Z %134 = tt.load %133 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.3359647Z %135 = arith.extf %134 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.3359973Z %136 = tt.descriptor_load %0[%122, %15] : !tt.tensordesc> -> tensor<32x128xi8> 2026-02-21T09:04:41.3360294Z %137 = arith.shli %136, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3360522Z %138 = arith.shrsi %137, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3360762Z %139 = arith.shrsi %136, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:41.3361059Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.3361355Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.3361772Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.3362123Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3362482Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:41.3362800Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3363070Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3363376Z %147 = tt.broadcast %143 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3363687Z %148 = arith.select %146, %147, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3364043Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.3364313Z %150 = tt.broadcast %144 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:41.3364615Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:41.3364929Z %152 = arith.select %151, %150, %148 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:41.3365238Z %153 = tt.reshape %152 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:41.3365535Z %154 = arith.sitofp %153 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:41.3365925Z %155 = tt.dot %135, %154, %120, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:41.3366281Z scf.yield %155 : tensor<64x128xf32> 2026-02-21T09:04:41.3366491Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:41.3366738Z %17 = arith.truncf %16 : tensor<64x128xf32> to tensor<64x128xbf16> 2026-02-21T09:04:41.3367100Z tt.descriptor_store %1[%11, %15], %17 : !tt.tensordesc>, tensor<64x128xbf16> 2026-02-21T09:04:41.3367404Z tt.return 2026-02-21T09:04:41.3367559Z } 2026-02-21T09:04:41.3367693Z } 2026-02-21T09:04:41.3367779Z 2026-02-21T09:04:41.3367837Z {-# 2026-02-21T09:04:41.3367978Z external_resources: { 2026-02-21T09:04:41.3368155Z mlir_reproducer: { 2026-02-21T09:04:41.3372564Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:41.3377069Z disable_threading: false, 2026-02-21T09:04:41.3377253Z verify_each: true 2026-02-21T09:04:41.3377424Z } 2026-02-21T09:04:41.3377561Z } 2026-02-21T09:04:41.3377687Z #-} 2026-02-21T09:04:41.3378130Z /tmp/torchinductor_root/nj/cnjlzdzd5ib6xtdfz2uvvcebcsz6pqnoipdmcpi6sebxby6k2f7e.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:41.3379155Z /tmp/torchinductor_root/nj/cnjlzdzd5ib6xtdfz2uvvcebcsz6pqnoipdmcpi6sebxby6k2f7e.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:41.3380008Z [246s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:41.3381140Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:41.3382166Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:41.3382432Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:41.4614391Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:41.4619357Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.4623821Z ^ 2026-02-21T09:04:41.4626001Z /tmp/torchinductor_root/6j/c6j5we6g5yhmbkiz7jwo7n2cfu7juplawax3sbqfglmezkbdmhpi.py:94:40: note: called from 2026-02-21T09:04:41.4626524Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:41.4630842Z ^ 2026-02-21T09:04:41.4632607Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:41.4633115Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.4633378Z ^ 2026-02-21T09:04:41.4633649Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:04:41.4638510Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:41.4639243Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:04:41.4639554Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:04:41.4639770Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:41.4639972Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:04:41.4640167Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:41.4640370Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:41.4640567Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:41.4640790Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:41.4641032Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:41.4641268Z %cst_2 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:04:41.4641822Z %cst_3 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:04:41.4642065Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:41.4642290Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:41.4642516Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:41.4642765Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:41.4642957Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:41.4643228Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:41.4643437Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:41.4643638Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:41.4643882Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:41.4644220Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:41.4644579Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:41.4644773Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:04:41.4644977Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:04:41.4645189Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:04:41.4645422Z %4 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:04:41.4645632Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T09:04:41.4645822Z %6 = arith.subi %c1_i32, %5 : i32 2026-02-21T09:04:41.4646060Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T09:04:41.4646261Z %8 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:04:41.4646467Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:04:41.4646650Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:04:41.4646841Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:04:41.4647026Z %12 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:04:41.4647281Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:41.4647561Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.4647774Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T09:04:41.4647988Z %16 = arith.muli %11, %c32_i32 : i32 2026-02-21T09:04:41.4648226Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:41.4648488Z %18 = tt.splat %16 : i32 -> tensor<32xi32> 2026-02-21T09:04:41.4648694Z %19 = arith.addi %18, %17 : tensor<32xi32> 2026-02-21T09:04:41.4648900Z %c32_i32_6 = arith.constant 32 : i32 2026-02-21T09:04:41.4649249Z %20 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c32_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:41.4649630Z %22 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:41.4649900Z %23 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:04:41.4650117Z %24 = arith.addi %23, %22 : tensor<8xi32> 2026-02-21T09:04:41.4650332Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:04:41.4650578Z %26 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:41.4650831Z %27 = tt.splat %25 : i32 -> tensor<16xi32> 2026-02-21T09:04:41.4651039Z %28 = arith.addi %27, %26 : tensor<16xi32> 2026-02-21T09:04:41.4651294Z %29 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.4651624Z %30 = arith.muli %29, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.4651889Z %31 = tt.expand_dims %28 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:41.4652194Z %32 = tt.broadcast %30 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4652456Z %33 = tt.broadcast %31 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4652705Z %34 = arith.addi %32, %33 : tensor<64x16xi32> 2026-02-21T09:04:41.4652964Z %35 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4653250Z %36 = tt.addptr %35, %34 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:41.4653560Z %37 = tt.load %36 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4653800Z %38 = arith.extf %37 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:41.4654094Z %39 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:41.4654372Z %40 = arith.muli %39, %cst_3 : tensor<8x1xi32> 2026-02-21T09:04:41.4654628Z %41 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.4654957Z %42 = tt.broadcast %40 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4655219Z %43 = tt.broadcast %41 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4655498Z %44 = arith.addi %42, %43 : tensor<8x32xi32> 2026-02-21T09:04:41.4655735Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4656012Z %46 = tt.addptr %45, %44 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:41.4656314Z %47 = tt.load %46 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4656582Z %48 = arith.shli %47, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4656793Z %49 = arith.shrsi %48, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4657014Z %50 = arith.shrsi %47, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4657253Z %51 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.4657571Z %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.4657881Z %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.4658202Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4658517Z %55 = tt.expand_dims %50 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4658794Z %56 = arith.cmpi eq, %53, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4659041Z %57 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4659305Z %58 = tt.broadcast %54 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4659582Z %59 = arith.select %57, %58, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4659843Z %60 = arith.cmpi eq, %53, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4660097Z %61 = tt.broadcast %55 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4660363Z %62 = tt.broadcast %60 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4660632Z %63 = arith.select %62, %61, %59 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4660904Z %64 = tt.reshape %63 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:41.4661156Z %65 = arith.sitofp %64 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:41.4661514Z %66 = tt.dot %38, %65, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.4661873Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T09:04:41.4662069Z %67 = arith.muli %c8_i32, %c1_i32_7 : i32 2026-02-21T09:04:41.4662270Z %68 = arith.addi %arg4, %67 : i32 2026-02-21T09:04:41.4662494Z %69 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:41.4662741Z %70 = tt.splat %68 : i32 -> tensor<8xi32> 2026-02-21T09:04:41.4662941Z %71 = arith.addi %70, %69 : tensor<8xi32> 2026-02-21T09:04:41.4663141Z %72 = arith.muli %68, %c2_i32 : i32 2026-02-21T09:04:41.4663379Z %73 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:41.4663621Z %74 = tt.splat %72 : i32 -> tensor<16xi32> 2026-02-21T09:04:41.4663829Z %75 = arith.addi %74, %73 : tensor<16xi32> 2026-02-21T09:04:41.4664080Z %76 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.4664359Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.4664651Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:41.4664945Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4665208Z %80 = tt.broadcast %78 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4665439Z %81 = arith.addi %79, %80 : tensor<64x16xi32> 2026-02-21T09:04:41.4665687Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4665996Z %83 = tt.addptr %82, %81 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:41.4666258Z %84 = tt.load %83 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4666524Z %85 = arith.extf %84 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:41.4666808Z %86 = tt.expand_dims %71 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:41.4667071Z %87 = arith.muli %86, %cst_3 : tensor<8x1xi32> 2026-02-21T09:04:41.4667319Z %88 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.4667608Z %89 = tt.broadcast %87 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4667862Z %90 = tt.broadcast %88 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4668097Z %91 = arith.addi %89, %90 : tensor<8x32xi32> 2026-02-21T09:04:41.4668334Z %92 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4668623Z %93 = tt.addptr %92, %91 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:41.4668918Z %94 = tt.load %93 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4669181Z %95 = arith.shli %94, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4669401Z %96 = arith.shrsi %95, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4669615Z %97 = arith.shrsi %94, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4669865Z %98 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.4670160Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.4670469Z %100 = tt.expand_dims %99 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.4670795Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4671114Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4671408Z %103 = arith.cmpi eq, %100, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4671732Z %104 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4672009Z %105 = tt.broadcast %101 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4672295Z %106 = arith.select %104, %105, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4672575Z %107 = arith.cmpi eq, %100, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4672828Z %108 = tt.broadcast %102 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4673105Z %109 = tt.broadcast %107 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4673395Z %110 = arith.select %109, %108, %106 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4673674Z %111 = tt.reshape %110 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:41.4673945Z %112 = arith.sitofp %111 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:41.4674302Z %113 = tt.dot %85, %112, %66, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.4674624Z %c2_i32_8 = arith.constant 2 : i32 2026-02-21T09:04:41.4674818Z %114 = arith.muli %c8_i32, %c2_i32_8 : i32 2026-02-21T09:04:41.4675019Z %115 = arith.addi %arg4, %114 : i32 2026-02-21T09:04:41.4675253Z %116 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:41.4675499Z %117 = tt.splat %115 : i32 -> tensor<8xi32> 2026-02-21T09:04:41.4675760Z %118 = arith.addi %117, %116 : tensor<8xi32> 2026-02-21T09:04:41.4675954Z %119 = arith.muli %115, %c2_i32 : i32 2026-02-21T09:04:41.4676191Z %120 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:41.4676436Z %121 = tt.splat %119 : i32 -> tensor<16xi32> 2026-02-21T09:04:41.4676646Z %122 = arith.addi %121, %120 : tensor<16xi32> 2026-02-21T09:04:41.4676908Z %123 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.4677210Z %124 = arith.muli %123, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.4677506Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:41.4677802Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4678075Z %127 = tt.broadcast %125 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4678318Z %128 = arith.addi %126, %127 : tensor<64x16xi32> 2026-02-21T09:04:41.4678573Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4678877Z %130 = tt.addptr %129, %128 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:41.4679144Z %131 = tt.load %130 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4679396Z %132 = arith.extf %131 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:41.4679709Z %133 = tt.expand_dims %118 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:41.4679991Z %134 = arith.muli %133, %cst_3 : tensor<8x1xi32> 2026-02-21T09:04:41.4680250Z %135 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.4680548Z %136 = tt.broadcast %134 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4680816Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4681055Z %138 = arith.addi %136, %137 : tensor<8x32xi32> 2026-02-21T09:04:41.4681298Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4681601Z %140 = tt.addptr %139, %138 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:41.4681914Z %141 = tt.load %140 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4682193Z %142 = arith.shli %141, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4682412Z %143 = arith.shrsi %142, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4682639Z %144 = arith.shrsi %141, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4682882Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.4683184Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.4683506Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.4683840Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4684171Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4684455Z %150 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4684716Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4684986Z %152 = tt.broadcast %148 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4685282Z %153 = arith.select %151, %152, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4685564Z %154 = arith.cmpi eq, %147, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4685814Z %155 = tt.broadcast %149 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4686083Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4686355Z %157 = arith.select %156, %155, %153 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4686638Z %158 = tt.reshape %157 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:41.4686927Z %159 = arith.sitofp %158 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:41.4687287Z %160 = tt.dot %132, %159, %113, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.4687613Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:41.4687808Z %161 = arith.muli %c8_i32, %c3_i32 : i32 2026-02-21T09:04:41.4688044Z %162 = arith.addi %arg4, %161 : i32 2026-02-21T09:04:41.4688281Z %163 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:04:41.4688543Z %164 = tt.splat %162 : i32 -> tensor<8xi32> 2026-02-21T09:04:41.4688782Z %165 = arith.addi %164, %163 : tensor<8xi32> 2026-02-21T09:04:41.4688994Z %166 = arith.muli %162, %c2_i32 : i32 2026-02-21T09:04:41.4689245Z %167 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:04:41.4689506Z %168 = tt.splat %166 : i32 -> tensor<16xi32> 2026-02-21T09:04:41.4689731Z %169 = arith.addi %168, %167 : tensor<16xi32> 2026-02-21T09:04:41.4689996Z %170 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.4690287Z %171 = arith.muli %170, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.4690560Z %172 = tt.expand_dims %169 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:04:41.4690900Z %173 = tt.broadcast %171 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4691185Z %174 = tt.broadcast %172 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:04:41.4691437Z %175 = arith.addi %173, %174 : tensor<64x16xi32> 2026-02-21T09:04:41.4691733Z %176 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4692042Z %177 = tt.addptr %176, %175 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:04:41.4692332Z %178 = tt.load %177 : tensor<64x16x!tt.ptr> 2026-02-21T09:04:41.4692588Z %179 = arith.extf %178 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:04:41.4692902Z %180 = tt.expand_dims %165 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:04:41.4693198Z %181 = arith.muli %180, %cst_3 : tensor<8x1xi32> 2026-02-21T09:04:41.4693471Z %182 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.4693784Z %183 = tt.broadcast %181 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4694061Z %184 = tt.broadcast %182 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:04:41.4694323Z %185 = arith.addi %183, %184 : tensor<8x32xi32> 2026-02-21T09:04:41.4694574Z %186 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4694865Z %187 = tt.addptr %186, %185 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:04:41.4695198Z %188 = tt.load %187 evictionPolicy = evict_last : tensor<8x32x!tt.ptr> 2026-02-21T09:04:41.4695480Z %189 = arith.shli %188, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4695713Z %190 = arith.shrsi %189, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4695939Z %191 = arith.shrsi %188, %cst_2 : tensor<8x32xi8> 2026-02-21T09:04:41.4696204Z %192 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.4696523Z %193 = tt.expand_dims %192 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.4696862Z %194 = tt.expand_dims %193 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.4697205Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4697525Z %196 = tt.expand_dims %191 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:04:41.4697814Z %197 = arith.cmpi eq, %194, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4698072Z %198 = tt.broadcast %197 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4698366Z %199 = tt.broadcast %195 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4698658Z %200 = arith.select %198, %199, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4698930Z %201 = arith.cmpi eq, %194, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.4699185Z %202 = tt.broadcast %196 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:04:41.4699451Z %203 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:04:41.4699765Z %204 = arith.select %203, %202, %200 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:04:41.4700078Z %205 = tt.reshape %204 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:04:41.4700338Z %206 = arith.sitofp %205 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:04:41.4700702Z %207 = tt.dot %179, %206, %160, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.4701022Z scf.yield %207 : tensor<64x32xf32> 2026-02-21T09:04:41.4701203Z } 2026-02-21T09:04:41.4701378Z %21 = arith.truncf %20 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:41.4701751Z tt.descriptor_store %0[%12, %16], %21 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:04:41.4702080Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:04:41.4702319Z tt.return 2026-02-21T09:04:41.4702458Z } 2026-02-21T09:04:41.4702577Z } 2026-02-21T09:04:41.4702652Z 2026-02-21T09:04:41.4702703Z {-# 2026-02-21T09:04:41.4702832Z external_resources: { 2026-02-21T09:04:41.4702995Z mlir_reproducer: { 2026-02-21T09:04:41.4707333Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:41.4711832Z disable_threading: false, 2026-02-21T09:04:41.4712007Z verify_each: true 2026-02-21T09:04:41.4712149Z } 2026-02-21T09:04:41.4712275Z } 2026-02-21T09:04:41.4712389Z #-} 2026-02-21T09:04:41.4712821Z /tmp/torchinductor_root/6j/c6j5we6g5yhmbkiz7jwo7n2cfu7juplawax3sbqfglmezkbdmhpi.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:41.4713836Z /tmp/torchinductor_root/6j/c6j5we6g5yhmbkiz7jwo7n2cfu7juplawax3sbqfglmezkbdmhpi.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:41.4714665Z [246s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:41.4715837Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[1, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:41.4716907Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:41.4717168Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:41.6928993Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:41.6929899Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.6930160Z ^ 2026-02-21T09:04:41.6930556Z /tmp/torchinductor_root/xs/cxsrgewg443vgyhatlx4rclkeizlcumsd3b4olveagzqjrmn3bnm.py:86:36: note: called from 2026-02-21T09:04:41.6931125Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:41.6931342Z ^ 2026-02-21T09:04:41.6931798Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:41.6932252Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:41.6932499Z ^ 2026-02-21T09:04:41.6932660Z module { 2026-02-21T09:04:41.6933142Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:41.6933702Z %cst = arith.constant dense<0> : tensor<32x2x32xi8> 2026-02-21T09:04:41.6933923Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:41.6934124Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:41.6934310Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:41.6934531Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:41.6934769Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:41.6934995Z %cst_2 = arith.constant dense<4> : tensor<32x32xi8> 2026-02-21T09:04:41.6935236Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:04:41.6935467Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:41.6935678Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:41.6935901Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:41.6936142Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:41.6936319Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:41.6936502Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:41.6936688Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:41.6936873Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:41.6937059Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:41.6937374Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:41.6937701Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:41.6937877Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:04:41.6938064Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:41.6938250Z %4 = arith.subi %c224_i32, %3 : i32 2026-02-21T09:04:41.6938428Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:41.6938609Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:04:41.6938782Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:41.6939035Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:41.6939205Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:41.6939383Z %10 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:04:41.6939614Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:41.6939879Z %12 = tt.splat %10 : i32 -> tensor<32xi32> 2026-02-21T09:04:41.6940087Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T09:04:41.6940312Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:04:41.6940538Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:41.6940805Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.6941005Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T09:04:41.6941204Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:41.6941521Z %18 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:41.6941907Z %20 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:04:41.6942112Z %21 = arith.addi %20, %11 : tensor<32xi32> 2026-02-21T09:04:41.6942309Z %22 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:41.6942499Z %23 = tt.splat %22 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.6942691Z %24 = arith.addi %23, %15 : tensor<64xi32> 2026-02-21T09:04:41.6942978Z %25 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.6943250Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.6943508Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.6943792Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6944052Z %29 = tt.broadcast %27 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6944283Z %30 = arith.addi %28, %29 : tensor<64x64xi32> 2026-02-21T09:04:41.6944521Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6944803Z %32 = tt.addptr %31, %30 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.6945053Z %33 = tt.load %32 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6945289Z %34 = arith.extf %33 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.6945560Z %35 = tt.expand_dims %21 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:41.6945818Z %36 = arith.muli %35, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:41.6946065Z %37 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.6946336Z %38 = tt.broadcast %36 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6946586Z %39 = tt.broadcast %37 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6946805Z %40 = arith.addi %38, %39 : tensor<32x32xi32> 2026-02-21T09:04:41.6947038Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6947304Z %42 = tt.addptr %41, %40 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:41.6947592Z %43 = tt.load %42 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6947850Z %44 = arith.shli %43, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6948055Z %45 = arith.shrsi %44, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6948261Z %46 = arith.shrsi %43, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6948497Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.6948781Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.6949082Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.6949390Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6949704Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6949998Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6950269Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6950528Z %54 = tt.broadcast %50 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6950799Z %55 = arith.select %53, %54, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6951055Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6951330Z %57 = tt.broadcast %51 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6951613Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6951906Z %59 = arith.select %58, %57, %55 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6952170Z %60 = tt.reshape %59 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:41.6952426Z %61 = arith.sitofp %60 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:41.6952777Z %62 = tt.dot %34, %61, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.6953092Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:41.6953285Z %63 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T09:04:41.6953472Z %64 = arith.addi %arg3, %63 : i32 2026-02-21T09:04:41.6953660Z %65 = tt.splat %64 : i32 -> tensor<32xi32> 2026-02-21T09:04:41.6953882Z %66 = arith.addi %65, %11 : tensor<32xi32> 2026-02-21T09:04:41.6954072Z %67 = arith.muli %64, %c2_i32 : i32 2026-02-21T09:04:41.6954253Z %68 = tt.splat %67 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.6954438Z %69 = arith.addi %68, %15 : tensor<64xi32> 2026-02-21T09:04:41.6954679Z %70 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.6954932Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.6955179Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.6955452Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6955700Z %74 = tt.broadcast %72 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6955922Z %75 = arith.addi %73, %74 : tensor<64x64xi32> 2026-02-21T09:04:41.6956154Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6956436Z %77 = tt.addptr %76, %75 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.6956679Z %78 = tt.load %77 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6956905Z %79 = arith.extf %78 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.6957173Z %80 = tt.expand_dims %66 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:41.6957424Z %81 = arith.muli %80, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:41.6957675Z %82 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.6957948Z %83 = tt.broadcast %81 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6958209Z %84 = tt.broadcast %82 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6958435Z %85 = arith.addi %83, %84 : tensor<32x32xi32> 2026-02-21T09:04:41.6958675Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6958942Z %87 = tt.addptr %86, %85 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:41.6959238Z %88 = tt.load %87 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6959501Z %89 = arith.shli %88, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6959713Z %90 = arith.shrsi %89, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6959926Z %91 = arith.shrsi %88, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6960159Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.6960452Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.6960781Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.6961093Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6961412Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6961718Z %97 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6961967Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6962256Z %99 = tt.broadcast %95 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6962568Z %100 = arith.select %98, %99, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6962848Z %101 = arith.cmpi eq, %94, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6963099Z %102 = tt.broadcast %96 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6963377Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6963659Z %104 = arith.select %103, %102, %100 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6963952Z %105 = tt.reshape %104 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:41.6964217Z %106 = arith.sitofp %105 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:41.6964606Z %107 = tt.dot %79, %106, %62, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.6964934Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:04:41.6965131Z %108 = arith.muli %c32_i32, %c2_i32_6 : i32 2026-02-21T09:04:41.6965336Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:04:41.6965532Z %110 = tt.splat %109 : i32 -> tensor<32xi32> 2026-02-21T09:04:41.6965746Z %111 = arith.addi %110, %11 : tensor<32xi32> 2026-02-21T09:04:41.6965937Z %112 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:04:41.6966132Z %113 = tt.splat %112 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.6966336Z %114 = arith.addi %113, %15 : tensor<64xi32> 2026-02-21T09:04:41.6966589Z %115 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.6966868Z %116 = arith.muli %115, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.6967134Z %117 = tt.expand_dims %114 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.6967438Z %118 = tt.broadcast %116 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6967705Z %119 = tt.broadcast %117 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6967948Z %120 = arith.addi %118, %119 : tensor<64x64xi32> 2026-02-21T09:04:41.6968202Z %121 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6968494Z %122 = tt.addptr %121, %120 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.6968778Z %123 = tt.load %122 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6969032Z %124 = arith.extf %123 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.6969341Z %125 = tt.expand_dims %111 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:41.6969627Z %126 = arith.muli %125, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:41.6969895Z %127 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.6970205Z %128 = tt.broadcast %126 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6970479Z %129 = tt.broadcast %127 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6970737Z %130 = arith.addi %128, %129 : tensor<32x32xi32> 2026-02-21T09:04:41.6970983Z %131 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6971279Z %132 = tt.addptr %131, %130 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:41.6971642Z %133 = tt.load %132 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6971920Z %134 = arith.shli %133, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6972184Z %135 = arith.shrsi %134, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6972412Z %136 = arith.shrsi %133, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6972675Z %137 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.6972975Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.6973309Z %139 = tt.expand_dims %138 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.6973705Z %140 = tt.expand_dims %135 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6974069Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6974375Z %142 = arith.cmpi eq, %139, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6974641Z %143 = tt.broadcast %142 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6974934Z %144 = tt.broadcast %140 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6975243Z %145 = arith.select %143, %144, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6975531Z %146 = arith.cmpi eq, %139, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6975801Z %147 = tt.broadcast %141 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6976094Z %148 = tt.broadcast %146 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6976383Z %149 = arith.select %148, %147, %145 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6976667Z %150 = tt.reshape %149 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:41.6976936Z %151 = arith.sitofp %150 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:41.6977296Z %152 = tt.dot %124, %151, %107, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.6977612Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:04:41.6977809Z %153 = arith.muli %c32_i32, %c3_i32 : i32 2026-02-21T09:04:41.6978003Z %154 = arith.addi %arg3, %153 : i32 2026-02-21T09:04:41.6978203Z %155 = tt.splat %154 : i32 -> tensor<32xi32> 2026-02-21T09:04:41.6978405Z %156 = arith.addi %155, %11 : tensor<32xi32> 2026-02-21T09:04:41.6978604Z %157 = arith.muli %154, %c2_i32 : i32 2026-02-21T09:04:41.6978801Z %158 = tt.splat %157 : i32 -> tensor<64xi32> 2026-02-21T09:04:41.6978998Z %159 = arith.addi %158, %15 : tensor<64xi32> 2026-02-21T09:04:41.6979263Z %160 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:41.6979533Z %161 = arith.muli %160, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:41.6979807Z %162 = tt.expand_dims %159 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:41.6980097Z %163 = tt.broadcast %161 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6980366Z %164 = tt.broadcast %162 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:41.6980611Z %165 = arith.addi %163, %164 : tensor<64x64xi32> 2026-02-21T09:04:41.6980852Z %166 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6981144Z %167 = tt.addptr %166, %165 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:41.6981407Z %168 = tt.load %167 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:41.6981683Z %169 = arith.extf %168 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:41.6981964Z %170 = tt.expand_dims %156 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:41.6982238Z %171 = arith.muli %170, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:41.6982501Z %172 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:41.6982785Z %173 = tt.broadcast %171 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6983045Z %174 = tt.broadcast %172 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:04:41.6983310Z %175 = arith.addi %173, %174 : tensor<32x32xi32> 2026-02-21T09:04:41.6983552Z %176 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6983834Z %177 = tt.addptr %176, %175 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:04:41.6984137Z %178 = tt.load %177 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:04:41.6984411Z %179 = arith.shli %178, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6984661Z %180 = arith.shrsi %179, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6984889Z %181 = arith.shrsi %178, %cst_2 : tensor<32x32xi8> 2026-02-21T09:04:41.6985164Z %182 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:41.6985463Z %183 = tt.expand_dims %182 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:41.6985780Z %184 = tt.expand_dims %183 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:41.6986098Z %185 = tt.expand_dims %180 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6986428Z %186 = tt.expand_dims %181 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:04:41.6986706Z %187 = arith.cmpi eq, %184, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6986961Z %188 = tt.broadcast %187 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6987259Z %189 = tt.broadcast %185 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6987547Z %190 = arith.select %188, %189, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6987827Z %191 = arith.cmpi eq, %184, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:41.6988077Z %192 = tt.broadcast %186 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:04:41.6988350Z %193 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:04:41.6988626Z %194 = arith.select %193, %192, %190 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:04:41.6988915Z %195 = tt.reshape %194 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:04:41.6989181Z %196 = arith.sitofp %195 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:04:41.6989529Z %197 = tt.dot %169, %196, %152, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:41.6989850Z scf.yield %197 : tensor<64x32xf32> 2026-02-21T09:04:41.6990039Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:41.6990268Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:41.6990583Z tt.descriptor_store %0[%14, %10], %19 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:04:41.6990862Z tt.return 2026-02-21T09:04:41.6990999Z } 2026-02-21T09:04:41.6991119Z } 2026-02-21T09:04:41.6991187Z 2026-02-21T09:04:41.6991247Z {-# 2026-02-21T09:04:41.6991377Z external_resources: { 2026-02-21T09:04:41.6991583Z mlir_reproducer: { 2026-02-21T09:04:41.6995966Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:41.7000445Z disable_threading: false, 2026-02-21T09:04:41.7000626Z verify_each: true 2026-02-21T09:04:41.7000768Z } 2026-02-21T09:04:41.7000882Z } 2026-02-21T09:04:41.7000993Z #-} 2026-02-21T09:04:41.7001407Z /tmp/torchinductor_root/xs/cxsrgewg443vgyhatlx4rclkeizlcumsd3b4olveagzqjrmn3bnm.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:41.7002486Z /tmp/torchinductor_root/xs/cxsrgewg443vgyhatlx4rclkeizlcumsd3b4olveagzqjrmn3bnm.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:41.7003342Z [247s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:41.7004411Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:41.7005352Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:41.7005597Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:41.8268951Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 48/48 15.8 configs/s 2026-02-21T09:04:44.7459547Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 341.6 2026-02-21T09:04:44.7463882Z configs/s 2026-02-21T09:04:44.8524101Z [250s] Generation 10 complete: 2026-02-21T09:04:44.8527308Z error=5 2026-02-21T09:04:44.8531326Z ok=44 2026-02-21T09:04:44.8535715Z min=0.1076 2026-02-21T09:04:44.8540020Z mid=0.2099 2026-02-21T09:04:44.8543927Z max=1.8751 2026-02-21T09:04:44.8544173Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:04:44.8544414Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:04:44.8544622Z 'l2_groupings': [1], 2026-02-21T09:04:44.8544819Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:04:44.8548829Z 'loop_orders': [[0, 1]], 2026-02-21T09:04:44.8552730Z 'num_stages': 3, 2026-02-21T09:04:44.8556497Z 'num_warps': 1, 2026-02-21T09:04:44.8560813Z 'pid_type': 'flat', 2026-02-21T09:04:44.8562127Z 'range_flattens': [None, None], 2026-02-21T09:04:44.8562354Z 'range_multi_buffers': [None, True], 2026-02-21T09:04:44.8562550Z 'range_num_stages': [0, 0], 2026-02-21T09:04:44.8562723Z 'range_unroll_factors': [0, 0], 2026-02-21T09:04:44.8562918Z 'range_warp_specializes': [None, None]} 2026-02-21T09:04:44.8563222Z [250s] Fitting surrogate: 953 points, 953 targets 2026-02-21T09:04:45.5360869Z [250s] Generation 11 starting: 36 neighbors, 2 active search path(s) 2026-02-21T09:04:49.4550450Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 8.5 configs/s 2026-02-21T09:04:50.8928561Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:50.8930300Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:50.8930918Z ^ 2026-02-21T09:04:50.8931334Z /tmp/torchinductor_root/fh/cfhvtc3xkckho6q6qborxawuizxdtlvfltxxo374lkickonfh6jt.py:94:40: note: called from 2026-02-21T09:04:50.8932068Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:50.8932309Z ^ 2026-02-21T09:04:50.8932773Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:50.8933456Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:50.8933738Z ^ 2026-02-21T09:04:50.8934030Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:04:50.8935894Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:50.8936555Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:04:50.8941222Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:04:50.8945162Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:04:50.8948579Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:50.8949961Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:50.8950206Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:50.8950625Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:50.8950860Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:50.8951093Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:50.8951334Z %cst_2 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:04:50.8951745Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:04:50.8951997Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:50.8952220Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:50.8952452Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:04:50.8952697Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:50.8952887Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:50.8953088Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:50.8953276Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:50.8953462Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:50.8953787Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:50.8954114Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:50.8954303Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:04:50.8954492Z %3 = arith.minsi %2, %c112_i32 : i32 2026-02-21T09:04:50.8954703Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:04:50.8954911Z %4 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:04:50.8955110Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T09:04:50.8955289Z %6 = arith.subi %c1_i32, %5 : i32 2026-02-21T09:04:50.8955476Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T09:04:50.8955667Z %8 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:04:50.8955850Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:04:50.8956029Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:04:50.8956198Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:04:50.8956382Z %12 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:04:50.8956617Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:50.8956883Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T09:04:50.8957085Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T09:04:50.8957282Z %16 = arith.muli %11, %c64_i32 : i32 2026-02-21T09:04:50.8957476Z %17 = tt.splat %16 : i32 -> tensor<64xi32> 2026-02-21T09:04:50.8957666Z %18 = arith.addi %17, %13 : tensor<64xi32> 2026-02-21T09:04:50.8957864Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:04:50.8958139Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:04:50.8958474Z %19 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:04:50.8958847Z %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:50.8959112Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:04:50.8959379Z %24 = arith.addi %23, %22 : tensor<32xi32> 2026-02-21T09:04:50.8959638Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:04:50.8959840Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:04:50.8960051Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:04:50.8960341Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:50.8960627Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:50.8960894Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8961186Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8961457Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8961745Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:04:50.8962000Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.8962327Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:50.8962597Z %36 = tt.load %35 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.8962844Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:50.8963130Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:50.8963406Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:50.8963661Z %40 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8963957Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.8964223Z %42 = tt.broadcast %40 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.8964456Z %43 = arith.addi %41, %42 : tensor<32x64xi32> 2026-02-21T09:04:50.8964705Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.8964984Z %45 = tt.addptr %44, %43 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:50.8965299Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.8965560Z %47 = arith.shli %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8965785Z %48 = arith.shrsi %47, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8966006Z %49 = arith.shrsi %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8966246Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:50.8966541Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:50.8966850Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:50.8967186Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.8967504Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.8967793Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:50.8968048Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.8968317Z %57 = tt.broadcast %53 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.8968609Z %58 = arith.select %56, %57, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.8968874Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:50.8969129Z %60 = tt.broadcast %54 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.8969429Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.8969697Z %62 = arith.select %61, %60, %58 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.8969980Z %63 = tt.reshape %62 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:50.8970232Z %64 = arith.sitofp %63 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:50.8970589Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:50.8970934Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:50.8971134Z %66 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:04:50.8971357Z %67 = arith.addi %arg4, %66 : i32 2026-02-21T09:04:50.8971624Z %68 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:50.8971883Z %69 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:04:50.8972083Z %70 = arith.addi %69, %68 : tensor<32xi32> 2026-02-21T09:04:50.8972283Z %71 = arith.muli %67, %c2_i32 : i32 2026-02-21T09:04:50.8972472Z %72 = tt.splat %71 : i32 -> tensor<64xi32> 2026-02-21T09:04:50.8972674Z %73 = arith.addi %72, %13 : tensor<64xi32> 2026-02-21T09:04:50.8972928Z %74 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:50.8973192Z %75 = arith.muli %74, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:50.8973479Z %76 = tt.expand_dims %73 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8973763Z %77 = tt.broadcast %75 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8974026Z %78 = tt.broadcast %76 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8974257Z %79 = arith.addi %77, %78 : tensor<64x64xi32> 2026-02-21T09:04:50.8974503Z %80 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.8974793Z %81 = tt.addptr %80, %79 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:50.8975044Z %82 = tt.load %81 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.8975285Z %83 = arith.extf %82 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:50.8975561Z %84 = tt.expand_dims %70 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:50.8975823Z %85 = arith.muli %84, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:50.8976071Z %86 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8976360Z %87 = tt.broadcast %85 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.8976638Z %88 = tt.broadcast %86 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.8976880Z %89 = arith.addi %87, %88 : tensor<32x64xi32> 2026-02-21T09:04:50.8977135Z %90 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.8977415Z %91 = tt.addptr %90, %89 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:50.8977736Z %92 = tt.load %91 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.8978013Z %93 = arith.shli %92, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8978236Z %94 = arith.shrsi %93, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8978465Z %95 = arith.shrsi %92, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8978718Z %96 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:50.8979027Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:50.8979345Z %98 = tt.expand_dims %97 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:50.8979685Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.8980028Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.8980329Z %101 = arith.cmpi eq, %98, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:50.8980635Z %102 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.8980929Z %103 = tt.broadcast %99 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.8981241Z %104 = arith.select %102, %103, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.8981575Z %105 = arith.cmpi eq, %98, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:50.8981851Z %106 = tt.broadcast %100 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.8982197Z %107 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.8982533Z %108 = arith.select %107, %106, %104 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.8982855Z %109 = tt.reshape %108 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:50.8983141Z %110 = arith.sitofp %109 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:50.8983523Z %111 = tt.dot %83, %110, %65, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:50.8983857Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:50.8984073Z %112 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:04:50.8984287Z %113 = arith.addi %arg4, %112 : i32 2026-02-21T09:04:50.8984528Z %114 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:50.8984821Z %115 = tt.splat %113 : i32 -> tensor<32xi32> 2026-02-21T09:04:50.8985032Z %116 = arith.addi %115, %114 : tensor<32xi32> 2026-02-21T09:04:50.8985236Z %117 = arith.muli %113, %c2_i32 : i32 2026-02-21T09:04:50.8985435Z %118 = tt.splat %117 : i32 -> tensor<64xi32> 2026-02-21T09:04:50.8985642Z %119 = arith.addi %118, %13 : tensor<64xi32> 2026-02-21T09:04:50.8985894Z %120 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:50.8986172Z %121 = arith.muli %120, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:50.8986443Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8986733Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8987003Z %124 = tt.broadcast %122 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8987241Z %125 = arith.addi %123, %124 : tensor<64x64xi32> 2026-02-21T09:04:50.8987492Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.8987785Z %127 = tt.addptr %126, %125 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:50.8988054Z %128 = tt.load %127 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.8988298Z %129 = arith.extf %128 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:50.8988584Z %130 = tt.expand_dims %116 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:50.8988860Z %131 = arith.muli %130, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:50.8989117Z %132 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8989409Z %133 = tt.broadcast %131 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.8989669Z %134 = tt.broadcast %132 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.8989914Z %135 = arith.addi %133, %134 : tensor<32x64xi32> 2026-02-21T09:04:50.8990154Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.8990430Z %137 = tt.addptr %136, %135 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:50.8990741Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.8991008Z %139 = arith.shli %138, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8991232Z %140 = arith.shrsi %139, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8991458Z %141 = arith.shrsi %138, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.8991772Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:50.8992073Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:50.8992388Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:50.8992716Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.8993044Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.8993393Z %147 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:50.8993688Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.8993964Z %149 = tt.broadcast %145 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.8994264Z %150 = arith.select %148, %149, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.8994543Z %151 = arith.cmpi eq, %144, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:50.8994809Z %152 = tt.broadcast %146 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.8995094Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.8995378Z %154 = arith.select %153, %152, %150 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.8995697Z %155 = tt.reshape %154 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:50.8995963Z %156 = arith.sitofp %155 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:50.8996325Z %157 = tt.dot %129, %156, %111, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:50.8996645Z scf.yield %157 : tensor<64x64xf32> 2026-02-21T09:04:50.8996842Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:50.8997158Z %20 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %19) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:04:50.8997520Z %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:50.8997776Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:04:50.8997981Z %24 = arith.addi %23, %22 : tensor<32xi32> 2026-02-21T09:04:50.8998185Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:04:50.8998379Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:04:50.8998583Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:04:50.8998839Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:50.8999104Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:50.8999363Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.8999643Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.8999904Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:50.9000136Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:04:50.9000380Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.9000669Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:50.9000922Z %36 = tt.load %35 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:50.9001166Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:50.9001447Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:50.9001742Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:50.9001998Z %40 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:50.9002278Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.9002538Z %42 = tt.broadcast %40 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:50.9002766Z %43 = arith.addi %41, %42 : tensor<32x64xi32> 2026-02-21T09:04:50.9003032Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.9003301Z %45 = tt.addptr %44, %43 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:50.9003602Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:50.9003865Z %47 = arith.shli %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.9004078Z %48 = arith.shrsi %47, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.9004329Z %49 = arith.shrsi %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:50.9004596Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:50.9004892Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:50.9005196Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:50.9005519Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.9005843Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:50.9006121Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:50.9006376Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.9006666Z %57 = tt.broadcast %53 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.9006956Z %58 = arith.select %56, %57, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.9007227Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:50.9007476Z %60 = tt.broadcast %54 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:50.9007746Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:50.9008018Z %62 = arith.select %61, %60, %58 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:50.9008299Z %63 = tt.reshape %62 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:50.9008557Z %64 = arith.sitofp %63 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:50.9008915Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:50.9009247Z scf.yield %65 : tensor<64x64xf32> 2026-02-21T09:04:50.9009473Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:04:50.9009742Z %21 = arith.truncf %20 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:04:50.9010055Z tt.descriptor_store %0[%12, %16], %21 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T09:04:50.9010418Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:04:50.9010674Z tt.return 2026-02-21T09:04:50.9010805Z } 2026-02-21T09:04:50.9010933Z } 2026-02-21T09:04:50.9011003Z 2026-02-21T09:04:50.9011054Z {-# 2026-02-21T09:04:50.9011192Z external_resources: { 2026-02-21T09:04:50.9011352Z mlir_reproducer: { 2026-02-21T09:04:50.9015772Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:50.9020271Z disable_threading: false, 2026-02-21T09:04:50.9020465Z verify_each: true 2026-02-21T09:04:50.9020629Z } 2026-02-21T09:04:50.9020790Z } 2026-02-21T09:04:50.9020956Z #-} 2026-02-21T09:04:50.9021531Z /tmp/torchinductor_root/fh/cfhvtc3xkckho6q6qborxawuizxdtlvfltxxo374lkickonfh6jt.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:50.9022846Z /tmp/torchinductor_root/fh/cfhvtc3xkckho6q6qborxawuizxdtlvfltxxo374lkickonfh6jt.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:50.9023821Z [256s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:50.9025173Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:04:50.9026358Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:50.9026659Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:51.1526129Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:51.1527521Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:51.1527867Z ^ 2026-02-21T09:04:51.1530129Z /tmp/torchinductor_root/ti/ctinma5luqrzujmr4v5p3lyc5qtb6unzbzuvqmy7xipuumtswy6g.py:86:36: note: called from 2026-02-21T09:04:51.1530611Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:51.1535142Z ^ 2026-02-21T09:04:51.1539957Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:51.1544283Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:51.1548769Z ^ 2026-02-21T09:04:51.1553308Z module { 2026-02-21T09:04:51.1555693Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:51.1556692Z %cst = arith.constant dense<0> : tensor<32x2x128xi8> 2026-02-21T09:04:51.1560507Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:04:51.1561860Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:51.1562200Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:51.1562440Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:51.1562654Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:51.1562867Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:51.1563311Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:51.1567811Z %cst_2 = arith.constant dense<4> : tensor<32x128xi8> 2026-02-21T09:04:51.1569648Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:04:51.1569931Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:51.1570163Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:51.1570400Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T09:04:51.1570839Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:04:51.1571036Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:51.1571254Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:51.1571450Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:51.1571713Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:51.1571907Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:51.1572232Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:51.1572577Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:51.1572770Z %2 = arith.divsi %1, %c224_i32 : i32 2026-02-21T09:04:51.1572975Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:04:51.1573171Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:04:51.1573356Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:04:51.1573598Z %6 = arith.remsi %1, %c224_i32 : i32 2026-02-21T09:04:51.1573783Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:51.1573965Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:51.1574132Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:51.1574316Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:04:51.1574549Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:51.1574815Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.1575017Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:04:51.1575204Z %14 = arith.muli %9, %c128_i32 : i32 2026-02-21T09:04:51.1575440Z %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:51.1575740Z %16 = tt.splat %14 : i32 -> tensor<128xi32> 2026-02-21T09:04:51.1575936Z %17 = arith.addi %16, %15 : tensor<128xi32> 2026-02-21T09:04:51.1576134Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:04:51.1576321Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:04:51.1576637Z %18 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x128xf32>) : i32 { 2026-02-21T09:04:51.1577010Z %21 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.1577263Z %22 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.1577479Z %23 = arith.addi %22, %21 : tensor<32xi32> 2026-02-21T09:04:51.1577673Z %24 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:51.1577873Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.1578076Z %26 = arith.addi %25, %11 : tensor<64xi32> 2026-02-21T09:04:51.1578325Z %27 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.1578601Z %28 = arith.muli %27, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.1578856Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.1579145Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1579405Z %31 = tt.broadcast %29 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1579650Z %32 = arith.addi %30, %31 : tensor<64x64xi32> 2026-02-21T09:04:51.1579924Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1580225Z %34 = tt.addptr %33, %32 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.1580498Z %35 = tt.load %34 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1580742Z %36 = arith.extf %35 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.1581075Z %37 = tt.expand_dims %23 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.1581347Z %38 = arith.muli %37, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.1581659Z %39 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.1581971Z %40 = tt.broadcast %38 : tensor<32x1xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1582261Z %41 = tt.broadcast %39 : tensor<1x128xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1582550Z %42 = arith.addi %40, %41 : tensor<32x128xi32> 2026-02-21T09:04:51.1582796Z %43 = tt.splat %arg1 : !tt.ptr -> tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1583119Z %44 = tt.addptr %43, %42 : tensor<32x128x!tt.ptr>, tensor<32x128xi32> 2026-02-21T09:04:51.1583430Z %45 = tt.load %44 evictionPolicy = evict_last : tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1583725Z %46 = arith.shli %45, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1583962Z %47 = arith.shrsi %46, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1584204Z %48 = arith.shrsi %45, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1584469Z %49 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.1584770Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.1585131Z %51 = tt.expand_dims %50 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.1585466Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1585813Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1586112Z %54 = arith.cmpi eq, %51, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1586365Z %55 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1586655Z %56 = tt.broadcast %52 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1586954Z %57 = arith.select %55, %56, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1587237Z %58 = arith.cmpi eq, %51, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1587500Z %59 = tt.broadcast %53 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1587772Z %60 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1588051Z %61 = arith.select %60, %59, %57 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1588326Z %62 = tt.reshape %61 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:51.1588590Z %63 = arith.sitofp %62 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:51.1588944Z %64 = tt.dot %36, %63, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:51.1589270Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:51.1589468Z %65 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:04:51.1589659Z %66 = arith.addi %arg3, %65 : i32 2026-02-21T09:04:51.1589893Z %67 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.1590132Z %68 = tt.splat %66 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.1590332Z %69 = arith.addi %68, %67 : tensor<32xi32> 2026-02-21T09:04:51.1590519Z %70 = arith.muli %66, %c2_i32 : i32 2026-02-21T09:04:51.1590715Z %71 = tt.splat %70 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.1590905Z %72 = arith.addi %71, %11 : tensor<64xi32> 2026-02-21T09:04:51.1591159Z %73 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.1591427Z %74 = arith.muli %73, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.1591709Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.1591996Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1592251Z %77 = tt.broadcast %75 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1592520Z %78 = arith.addi %76, %77 : tensor<64x64xi32> 2026-02-21T09:04:51.1592762Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1593052Z %80 = tt.addptr %79, %78 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.1593316Z %81 = tt.load %80 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1593554Z %82 = arith.extf %81 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.1593863Z %83 = tt.expand_dims %69 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.1594116Z %84 = arith.muli %83, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.1594399Z %85 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.1594696Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1594955Z %87 = tt.broadcast %85 : tensor<1x128xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1595196Z %88 = arith.addi %86, %87 : tensor<32x128xi32> 2026-02-21T09:04:51.1595428Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1595704Z %90 = tt.addptr %89, %88 : tensor<32x128x!tt.ptr>, tensor<32x128xi32> 2026-02-21T09:04:51.1595996Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1596295Z %92 = arith.shli %91, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1596521Z %93 = arith.shrsi %92, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1596733Z %94 = arith.shrsi %91, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1596981Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.1597261Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.1597564Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.1597876Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1598198Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1598486Z %100 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1598741Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1599023Z %102 = tt.broadcast %98 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1599318Z %103 = arith.select %101, %102, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1599600Z %104 = arith.cmpi eq, %97, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1599858Z %105 = tt.broadcast %99 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1600130Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1600418Z %107 = arith.select %106, %105, %103 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1600706Z %108 = tt.reshape %107 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:51.1600981Z %109 = arith.sitofp %108 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:51.1601332Z %110 = tt.dot %82, %109, %64, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:51.1601686Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:51.1601888Z %111 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:04:51.1602087Z %112 = arith.addi %arg3, %111 : i32 2026-02-21T09:04:51.1602322Z %113 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.1602574Z %114 = tt.splat %112 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.1602787Z %115 = arith.addi %114, %113 : tensor<32xi32> 2026-02-21T09:04:51.1602983Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:04:51.1603184Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.1603391Z %118 = arith.addi %117, %11 : tensor<64xi32> 2026-02-21T09:04:51.1603672Z %119 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.1603948Z %120 = arith.muli %119, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.1604206Z %121 = tt.expand_dims %118 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.1604504Z %122 = tt.broadcast %120 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1604821Z %123 = tt.broadcast %121 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1605070Z %124 = arith.addi %122, %123 : tensor<64x64xi32> 2026-02-21T09:04:51.1605345Z %125 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1605636Z %126 = tt.addptr %125, %124 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.1605911Z %127 = tt.load %126 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1606153Z %128 = arith.extf %127 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.1606459Z %129 = tt.expand_dims %115 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.1606735Z %130 = arith.muli %129, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.1607009Z %131 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.1607327Z %132 = tt.broadcast %130 : tensor<32x1xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1607629Z %133 = tt.broadcast %131 : tensor<1x128xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1607881Z %134 = arith.addi %132, %133 : tensor<32x128xi32> 2026-02-21T09:04:51.1608116Z %135 = tt.splat %arg1 : !tt.ptr -> tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1608408Z %136 = tt.addptr %135, %134 : tensor<32x128x!tt.ptr>, tensor<32x128xi32> 2026-02-21T09:04:51.1608720Z %137 = tt.load %136 evictionPolicy = evict_last : tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1608991Z %138 = arith.shli %137, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1609223Z %139 = arith.shrsi %138, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1609449Z %140 = arith.shrsi %137, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1609701Z %141 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.1609997Z %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.1610322Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.1610661Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1610993Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1611292Z %146 = arith.cmpi eq, %143, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1611567Z %147 = tt.broadcast %146 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1611861Z %148 = tt.broadcast %144 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1612164Z %149 = arith.select %147, %148, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1612440Z %150 = arith.cmpi eq, %143, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1612700Z %151 = tt.broadcast %145 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1612970Z %152 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1613262Z %153 = arith.select %152, %151, %149 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1613550Z %154 = tt.reshape %153 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:51.1613820Z %155 = arith.sitofp %154 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:51.1614186Z %156 = tt.dot %128, %155, %110, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:51.1614509Z scf.yield %156 : tensor<64x128xf32> 2026-02-21T09:04:51.1614733Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:51.1615039Z %19 = scf.for %arg3 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg4 = %18) -> (tensor<64x128xf32>) : i32 { 2026-02-21T09:04:51.1615409Z %21 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.1615658Z %22 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.1615864Z %23 = arith.addi %22, %21 : tensor<32xi32> 2026-02-21T09:04:51.1616065Z %24 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:51.1616286Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.1616491Z %26 = arith.addi %25, %11 : tensor<64xi32> 2026-02-21T09:04:51.1616762Z %27 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.1617033Z %28 = arith.muli %27, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.1617285Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.1617577Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1617841Z %31 = tt.broadcast %29 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.1618069Z %32 = arith.addi %30, %31 : tensor<64x64xi32> 2026-02-21T09:04:51.1618317Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1618598Z %34 = tt.addptr %33, %32 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.1618884Z %35 = tt.load %34 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.1619119Z %36 = arith.extf %35 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.1619404Z %37 = tt.expand_dims %23 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.1619671Z %38 = arith.muli %37, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.1619927Z %39 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.1620224Z %40 = tt.broadcast %38 : tensor<32x1xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1620490Z %41 = tt.broadcast %39 : tensor<1x128xi32> -> tensor<32x128xi32> 2026-02-21T09:04:51.1620730Z %42 = arith.addi %40, %41 : tensor<32x128xi32> 2026-02-21T09:04:51.1620969Z %43 = tt.splat %arg1 : !tt.ptr -> tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1621236Z %44 = tt.addptr %43, %42 : tensor<32x128x!tt.ptr>, tensor<32x128xi32> 2026-02-21T09:04:51.1621601Z %45 = tt.load %44 evictionPolicy = evict_last : tensor<32x128x!tt.ptr> 2026-02-21T09:04:51.1621868Z %46 = arith.shli %45, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1622089Z %47 = arith.shrsi %46, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1622308Z %48 = arith.shrsi %45, %cst_2 : tensor<32x128xi8> 2026-02-21T09:04:51.1622558Z %49 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.1622853Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.1623157Z %51 = tt.expand_dims %50 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.1623506Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1623842Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x128xi8> -> tensor<32x1x128xi8> 2026-02-21T09:04:51.1624143Z %54 = arith.cmpi eq, %51, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1624399Z %55 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1624691Z %56 = tt.broadcast %52 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1624999Z %57 = arith.select %55, %56, %cst : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1625276Z %58 = arith.cmpi eq, %51, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.1625541Z %59 = tt.broadcast %53 : tensor<32x1x128xi8> -> tensor<32x2x128xi8> 2026-02-21T09:04:51.1625819Z %60 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<32x2x128xi1> 2026-02-21T09:04:51.1626136Z %61 = arith.select %60, %59, %57 : tensor<32x2x128xi1>, tensor<32x2x128xi8> 2026-02-21T09:04:51.1626428Z %62 = tt.reshape %61 : tensor<32x2x128xi8> -> tensor<64x128xi8> 2026-02-21T09:04:51.1626696Z %63 = arith.sitofp %62 : tensor<64x128xi8> to tensor<64x128xf32> 2026-02-21T09:04:51.1627070Z %64 = tt.dot %36, %63, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x128xf32> -> tensor<64x128xf32> 2026-02-21T09:04:51.1627423Z scf.yield %64 : tensor<64x128xf32> 2026-02-21T09:04:51.1627659Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:04:51.1627929Z %20 = arith.truncf %19 : tensor<64x128xf32> to tensor<64x128xbf16> 2026-02-21T09:04:51.1628298Z tt.descriptor_store %0[%10, %14], %20 : !tt.tensordesc>, tensor<64x128xbf16> 2026-02-21T09:04:51.1628593Z tt.return 2026-02-21T09:04:51.1628727Z } 2026-02-21T09:04:51.1628858Z } 2026-02-21T09:04:51.1628931Z 2026-02-21T09:04:51.1628984Z {-# 2026-02-21T09:04:51.1629128Z external_resources: { 2026-02-21T09:04:51.1629291Z mlir_reproducer: { 2026-02-21T09:04:51.1633792Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:51.1638192Z disable_threading: false, 2026-02-21T09:04:51.1638370Z verify_each: true 2026-02-21T09:04:51.1638521Z } 2026-02-21T09:04:51.1638641Z } 2026-02-21T09:04:51.1638765Z #-} 2026-02-21T09:04:51.1639192Z /tmp/torchinductor_root/ti/ctinma5luqrzujmr4v5p3lyc5qtb6unzbzuvqmy7xipuumtswy6g.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:51.1640217Z /tmp/torchinductor_root/ti/ctinma5luqrzujmr4v5p3lyc5qtb6unzbzuvqmy7xipuumtswy6g.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:51.1641037Z [256s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:51.1642127Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 128], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:51.1643114Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:51.1643374Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:51.3759796Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:51.3764155Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:51.3768638Z ^ 2026-02-21T09:04:51.3773035Z /tmp/torchinductor_root/yj/cyjez3nrdcswigki65zevh734g4mhrgxlizqwmx7ugpygkmu7abj.py:86:36: note: called from 2026-02-21T09:04:51.3777879Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:51.3782998Z ^ 2026-02-21T09:04:51.3787832Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:51.3788744Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:51.3789048Z ^ 2026-02-21T09:04:51.3789240Z module { 2026-02-21T09:04:51.3789979Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:51.3790552Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:04:51.3790782Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T09:04:51.3790973Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:51.3791166Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:51.3791348Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:51.3791723Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:51.3791968Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:51.3792198Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:04:51.3792441Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:04:51.3792674Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:51.3792890Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:51.3793110Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:04:51.3793349Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:51.3793528Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:04:51.3793709Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:51.3793896Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:51.3794080Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:51.3794266Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:51.3794574Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:51.3794902Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:51.3795083Z %2 = arith.divsi %1, %c1792_i32 : i32 2026-02-21T09:04:51.3795272Z %3 = arith.muli %2, %c8_i32 : i32 2026-02-21T09:04:51.3795455Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:04:51.3795627Z %5 = arith.minsi %4, %c8_i32 : i32 2026-02-21T09:04:51.3795807Z %6 = arith.remsi %1, %c1792_i32 : i32 2026-02-21T09:04:51.3795985Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:04:51.3796167Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:04:51.3796337Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:04:51.3796518Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:04:51.3796752Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:51.3797015Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.3797219Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:04:51.3797405Z %14 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:04:51.3797634Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.3797942Z %16 = tt.splat %14 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.3798149Z %17 = arith.addi %16, %15 : tensor<32xi32> 2026-02-21T09:04:51.3798341Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:04:51.3798541Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:04:51.3798877Z %18 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:04:51.3799254Z %64 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.3799468Z %65 = arith.addi %64, %11 : tensor<64xi32> 2026-02-21T09:04:51.3799691Z %66 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:04:51.3799937Z %67 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:51.3800193Z %68 = tt.splat %66 : i32 -> tensor<128xi32> 2026-02-21T09:04:51.3800412Z %69 = arith.addi %68, %67 : tensor<128xi32> 2026-02-21T09:04:51.3800671Z %70 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3800942Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.3801208Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.3801500Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3801846Z %74 = tt.broadcast %72 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3802087Z %75 = arith.addi %73, %74 : tensor<64x128xi32> 2026-02-21T09:04:51.3802339Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3802632Z %77 = tt.addptr %76, %75 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:51.3802891Z %78 = tt.load %77 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3803137Z %79 = arith.extf %78 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:51.3803415Z %80 = tt.expand_dims %65 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3803685Z %81 = arith.muli %80, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:51.3803937Z %82 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:51.3804227Z %83 = tt.broadcast %81 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3804491Z %84 = tt.broadcast %82 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3804722Z %85 = arith.addi %83, %84 : tensor<64x32xi32> 2026-02-21T09:04:51.3804966Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3805232Z %87 = tt.addptr %86, %85 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:51.3805532Z %88 = tt.load %87 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3805789Z %89 = arith.shli %88, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3806005Z %90 = arith.shrsi %89, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3806220Z %91 = arith.shrsi %88, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3806457Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.3806747Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.3807045Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.3807367Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3807688Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3807961Z %97 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3808211Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3808474Z %99 = tt.broadcast %95 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3808761Z %100 = arith.select %98, %99, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3809060Z %101 = arith.cmpi eq, %94, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3809318Z %102 = tt.broadcast %96 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3809593Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3809873Z %104 = arith.select %103, %102, %100 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3810164Z %105 = tt.reshape %104 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:51.3810461Z %106 = arith.sitofp %105 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:51.3810862Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:51.3811201Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:51.3811398Z %108 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T09:04:51.3811643Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:04:51.3811842Z %110 = tt.splat %109 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.3812058Z %111 = arith.addi %110, %11 : tensor<64xi32> 2026-02-21T09:04:51.3812250Z %112 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:04:51.3812493Z %113 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:51.3812745Z %114 = tt.splat %112 : i32 -> tensor<128xi32> 2026-02-21T09:04:51.3812988Z %115 = arith.addi %114, %113 : tensor<128xi32> 2026-02-21T09:04:51.3813254Z %116 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3813529Z %117 = arith.muli %116, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.3813802Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.3814099Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3814378Z %120 = tt.broadcast %118 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3814621Z %121 = arith.addi %119, %120 : tensor<64x128xi32> 2026-02-21T09:04:51.3814878Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3815175Z %123 = tt.addptr %122, %121 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:51.3815444Z %124 = tt.load %123 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3815693Z %125 = arith.extf %124 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:51.3815986Z %126 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3816260Z %127 = arith.muli %126, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:51.3816524Z %128 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:51.3816810Z %129 = tt.broadcast %127 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3817099Z %130 = tt.broadcast %128 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3817349Z %131 = arith.addi %129, %130 : tensor<64x32xi32> 2026-02-21T09:04:51.3817598Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3817886Z %133 = tt.addptr %132, %131 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:51.3818208Z %134 = tt.load %133 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3818490Z %135 = arith.shli %134, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3818717Z %136 = arith.shrsi %135, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3818951Z %137 = arith.shrsi %134, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3819205Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.3819517Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.3819843Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.3820186Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3820560Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3820858Z %143 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3821127Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3821412Z %145 = tt.broadcast %141 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3821783Z %146 = arith.select %144, %145, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3822078Z %147 = arith.cmpi eq, %140, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3822375Z %148 = tt.broadcast %142 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3822669Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3822963Z %150 = arith.select %149, %148, %146 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3823294Z %151 = tt.reshape %150 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:51.3823582Z %152 = arith.sitofp %151 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:51.3823965Z %153 = tt.dot %125, %152, %107, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:51.3824315Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:51.3824543Z %154 = arith.muli %c64_i32, %c2_i32_7 : i32 2026-02-21T09:04:51.3824760Z %155 = arith.addi %arg3, %154 : i32 2026-02-21T09:04:51.3824971Z %156 = tt.splat %155 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.3825192Z %157 = arith.addi %156, %11 : tensor<64xi32> 2026-02-21T09:04:51.3825402Z %158 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:04:51.3825644Z %159 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:51.3825912Z %160 = tt.splat %158 : i32 -> tensor<128xi32> 2026-02-21T09:04:51.3826126Z %161 = arith.addi %160, %159 : tensor<128xi32> 2026-02-21T09:04:51.3826399Z %162 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3826672Z %163 = arith.muli %162, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.3826933Z %164 = tt.expand_dims %161 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.3827232Z %165 = tt.broadcast %163 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3827500Z %166 = tt.broadcast %164 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3827749Z %167 = arith.addi %165, %166 : tensor<64x128xi32> 2026-02-21T09:04:51.3827997Z %168 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3828299Z %169 = tt.addptr %168, %167 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:51.3828570Z %170 = tt.load %169 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3828812Z %171 = arith.extf %170 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:51.3829111Z %172 = tt.expand_dims %157 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3829383Z %173 = arith.muli %172, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:51.3829649Z %174 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:51.3829945Z %175 = tt.broadcast %173 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3830207Z %176 = tt.broadcast %174 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3830454Z %177 = arith.addi %175, %176 : tensor<64x32xi32> 2026-02-21T09:04:51.3830689Z %178 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3830975Z %179 = tt.addptr %178, %177 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:51.3831279Z %180 = tt.load %179 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3831586Z %181 = arith.shli %180, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3831839Z %182 = arith.shrsi %181, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3832059Z %183 = arith.shrsi %180, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3832314Z %184 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.3832609Z %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.3832929Z %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.3833295Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3833649Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3833941Z %189 = arith.cmpi eq, %186, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3834191Z %190 = tt.broadcast %189 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3834469Z %191 = tt.broadcast %187 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3834757Z %192 = arith.select %190, %191, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3835041Z %193 = arith.cmpi eq, %186, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3835300Z %194 = tt.broadcast %188 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3835568Z %195 = tt.broadcast %193 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3835882Z %196 = arith.select %195, %194, %192 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3836167Z %197 = tt.reshape %196 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:51.3836443Z %198 = arith.sitofp %197 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:51.3836802Z %199 = tt.dot %171, %198, %153, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:51.3837130Z scf.yield %199 : tensor<64x32xf32> 2026-02-21T09:04:51.3837325Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:51.3837526Z %19 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.3837743Z %20 = arith.addi %19, %11 : tensor<64xi32> 2026-02-21T09:04:51.3837938Z %21 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:04:51.3838179Z %22 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:04:51.3838423Z %23 = tt.splat %21 : i32 -> tensor<128xi32> 2026-02-21T09:04:51.3838633Z %24 = arith.addi %23, %22 : tensor<128xi32> 2026-02-21T09:04:51.3838907Z %25 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3839167Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.3839431Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:04:51.3839717Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3839985Z %29 = tt.broadcast %27 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:04:51.3840216Z %30 = arith.addi %28, %29 : tensor<64x128xi32> 2026-02-21T09:04:51.3840464Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3840749Z %32 = tt.addptr %31, %30 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:04:51.3841002Z %33 = tt.load %32 : tensor<64x128x!tt.ptr> 2026-02-21T09:04:51.3841247Z %34 = arith.extf %33 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:04:51.3841528Z %35 = tt.expand_dims %20 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.3841832Z %36 = arith.muli %35, %cst_3 : tensor<64x1xi32> 2026-02-21T09:04:51.3842082Z %37 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:04:51.3842372Z %38 = tt.broadcast %36 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3842631Z %39 = tt.broadcast %37 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:04:51.3842862Z %40 = arith.addi %38, %39 : tensor<64x32xi32> 2026-02-21T09:04:51.3843129Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3843399Z %42 = tt.addptr %41, %40 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:04:51.3843695Z %43 = tt.load %42 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:04:51.3843961Z %44 = arith.shli %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3844172Z %45 = arith.shrsi %44, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3844417Z %46 = arith.shrsi %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:04:51.3844653Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.3844966Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.3845269Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.3845586Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3845903Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:04:51.3846179Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3846426Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3846690Z %54 = tt.broadcast %50 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3846994Z %55 = arith.select %53, %54, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3847258Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.3847503Z %57 = tt.broadcast %51 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:04:51.3847762Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:04:51.3848022Z %59 = arith.select %58, %57, %55 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:04:51.3848293Z %60 = tt.reshape %59 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:04:51.3848546Z %61 = arith.sitofp %60 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:04:51.3848894Z %62 = tt.dot %34, %61, %18, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:04:51.3849246Z %63 = arith.truncf %62 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:04:51.3849558Z tt.descriptor_store %0[%10, %14], %63 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:04:51.3849836Z tt.return 2026-02-21T09:04:51.3849967Z } 2026-02-21T09:04:51.3850097Z } 2026-02-21T09:04:51.3850166Z 2026-02-21T09:04:51.3850218Z {-# 2026-02-21T09:04:51.3850355Z external_resources: { 2026-02-21T09:04:51.3850515Z mlir_reproducer: { 2026-02-21T09:04:51.3854851Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:51.3859424Z disable_threading: false, 2026-02-21T09:04:51.3859648Z verify_each: true 2026-02-21T09:04:51.3859821Z } 2026-02-21T09:04:51.3859956Z } 2026-02-21T09:04:51.3860090Z #-} 2026-02-21T09:04:51.3860584Z /tmp/torchinductor_root/yj/cyjez3nrdcswigki65zevh734g4mhrgxlizqwmx7ugpygkmu7abj.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:51.3861852Z /tmp/torchinductor_root/yj/cyjez3nrdcswigki65zevh734g4mhrgxlizqwmx7ugpygkmu7abj.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:51.3862826Z [256s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:51.3864081Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:51.3865173Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:51.3865453Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:51.5658632Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:04:51.5663103Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:51.5664488Z ^ 2026-02-21T09:04:51.5664908Z /tmp/torchinductor_root/zz/czzptp3oweqrqf3xw56ek34cnkokxga3vducne4rjuyhdd4pmuxb.py:94:40: note: called from 2026-02-21T09:04:51.5665361Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:04:51.5667880Z ^ 2026-02-21T09:04:51.5668353Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:04:51.5668853Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:04:51.5669111Z ^ 2026-02-21T09:04:51.5669310Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:04:51.5669873Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:04:51.5670433Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:04:51.5670669Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:04:51.5672850Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:04:51.5673114Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:04:51.5673333Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:04:51.5673535Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:04:51.5673716Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:04:51.5673935Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:04:51.5674190Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:04:51.5674443Z %cst_2 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:04:51.5674695Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:04:51.5674955Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:04:51.5675410Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:04:51.5675661Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:04:51.5675916Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:04:51.5676113Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:04:51.5676321Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:04:51.5676531Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:04:51.5676784Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:04:51.5677141Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:04:51.5677533Z %1 = tt.get_program_id x : i32 2026-02-21T09:04:51.5677736Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:04:51.5677927Z %3 = arith.minsi %2, %c112_i32 : i32 2026-02-21T09:04:51.5678140Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:04:51.5678352Z %4 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:04:51.5678550Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T09:04:51.5678731Z %6 = arith.subi %c1_i32, %5 : i32 2026-02-21T09:04:51.5678916Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T09:04:51.5679102Z %8 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:04:51.5679299Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:04:51.5679478Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:04:51.5679729Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:04:51.5679909Z %12 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:04:51.5680150Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:04:51.5680407Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.5680614Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T09:04:51.5680798Z %16 = arith.muli %11, %c64_i32 : i32 2026-02-21T09:04:51.5680994Z %17 = tt.splat %16 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.5681193Z %18 = arith.addi %17, %13 : tensor<64xi32> 2026-02-21T09:04:51.5681385Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:04:51.5681731Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:04:51.5682053Z %19 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:04:51.5682431Z %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.5682690Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.5682910Z %24 = arith.addi %23, %22 : tensor<32xi32> 2026-02-21T09:04:51.5683115Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:04:51.5683310Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.5683517Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:04:51.5683767Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.5684044Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.5684298Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5684595Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5684860Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5685095Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:04:51.5685348Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5685633Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.5685898Z %36 = tt.load %35 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5686144Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.5686423Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.5686691Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.5686976Z %40 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5687262Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5687515Z %42 = tt.broadcast %40 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5687753Z %43 = arith.addi %41, %42 : tensor<32x64xi32> 2026-02-21T09:04:51.5687998Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5688301Z %45 = tt.addptr %44, %43 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:51.5688609Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5688901Z %47 = arith.shli %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5689128Z %48 = arith.shrsi %47, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5689345Z %49 = arith.shrsi %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5689597Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.5689893Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.5690197Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.5690516Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5690856Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5691149Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5691406Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5691727Z %57 = tt.broadcast %53 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5692017Z %58 = arith.select %56, %57, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5692284Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5692542Z %60 = tt.broadcast %54 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5692805Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5693081Z %62 = arith.select %61, %60, %58 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5693361Z %63 = tt.reshape %62 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:51.5693617Z %64 = arith.sitofp %63 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:51.5693980Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:51.5694304Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:04:51.5694511Z %66 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:04:51.5694712Z %67 = arith.addi %arg4, %66 : i32 2026-02-21T09:04:51.5694958Z %68 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.5695219Z %69 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.5695428Z %70 = arith.addi %69, %68 : tensor<32xi32> 2026-02-21T09:04:51.5695636Z %71 = arith.muli %67, %c2_i32 : i32 2026-02-21T09:04:51.5695832Z %72 = tt.splat %71 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.5696043Z %73 = arith.addi %72, %13 : tensor<64xi32> 2026-02-21T09:04:51.5696302Z %74 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.5696587Z %75 = arith.muli %74, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.5696855Z %76 = tt.expand_dims %73 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5697150Z %77 = tt.broadcast %75 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5697423Z %78 = tt.broadcast %76 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5697665Z %79 = arith.addi %77, %78 : tensor<64x64xi32> 2026-02-21T09:04:51.5697923Z %80 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5698251Z %81 = tt.addptr %80, %79 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.5698524Z %82 = tt.load %81 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5698778Z %83 = arith.extf %82 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.5699066Z %84 = tt.expand_dims %70 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.5699346Z %85 = arith.muli %84, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.5699633Z %86 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5699959Z %87 = tt.broadcast %85 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5700233Z %88 = tt.broadcast %86 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5700474Z %89 = arith.addi %87, %88 : tensor<32x64xi32> 2026-02-21T09:04:51.5700748Z %90 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5701032Z %91 = tt.addptr %90, %89 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:51.5701345Z %92 = tt.load %91 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5701651Z %93 = arith.shli %92, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5701882Z %94 = arith.shrsi %93, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5702130Z %95 = arith.shrsi %92, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5702390Z %96 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.5702695Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.5703019Z %98 = tt.expand_dims %97 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.5703356Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5703696Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5704009Z %101 = arith.cmpi eq, %98, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5704285Z %102 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5704569Z %103 = tt.broadcast %99 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5704875Z %104 = arith.select %102, %103, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5705148Z %105 = arith.cmpi eq, %98, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5705411Z %106 = tt.broadcast %100 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5705682Z %107 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5705970Z %108 = arith.select %107, %106, %104 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5706256Z %109 = tt.reshape %108 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:51.5706520Z %110 = arith.sitofp %109 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:51.5706878Z %111 = tt.dot %83, %110, %65, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:51.5707191Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:04:51.5707395Z %112 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:04:51.5707597Z %113 = arith.addi %arg4, %112 : i32 2026-02-21T09:04:51.5707826Z %114 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.5708085Z %115 = tt.splat %113 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.5708292Z %116 = arith.addi %115, %114 : tensor<32xi32> 2026-02-21T09:04:51.5708496Z %117 = arith.muli %113, %c2_i32 : i32 2026-02-21T09:04:51.5708691Z %118 = tt.splat %117 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.5708899Z %119 = arith.addi %118, %13 : tensor<64xi32> 2026-02-21T09:04:51.5709154Z %120 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.5709454Z %121 = arith.muli %120, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.5709727Z %122 = tt.expand_dims %119 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5710015Z %123 = tt.broadcast %121 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5710287Z %124 = tt.broadcast %122 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5710529Z %125 = arith.addi %123, %124 : tensor<64x64xi32> 2026-02-21T09:04:51.5710810Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5711157Z %127 = tt.addptr %126, %125 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.5711427Z %128 = tt.load %127 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5711712Z %129 = arith.extf %128 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.5712004Z %130 = tt.expand_dims %116 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.5712287Z %131 = arith.muli %130, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.5712550Z %132 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5712852Z %133 = tt.broadcast %131 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5713122Z %134 = tt.broadcast %132 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5713392Z %135 = arith.addi %133, %134 : tensor<32x64xi32> 2026-02-21T09:04:51.5713643Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5713925Z %137 = tt.addptr %136, %135 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:51.5714245Z %138 = tt.load %137 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5714538Z %139 = arith.shli %138, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5714768Z %140 = arith.shrsi %139, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5715005Z %141 = arith.shrsi %138, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5715258Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.5715566Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.5715884Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.5716222Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5716561Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5716854Z %147 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5717123Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5717400Z %149 = tt.broadcast %145 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5717703Z %150 = arith.select %148, %149, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5717989Z %151 = arith.cmpi eq, %144, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5718251Z %152 = tt.broadcast %146 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5718538Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5718823Z %154 = arith.select %153, %152, %150 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5719119Z %155 = tt.reshape %154 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:51.5719387Z %156 = arith.sitofp %155 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:51.5719757Z %157 = tt.dot %129, %156, %111, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:51.5720091Z scf.yield %157 : tensor<64x64xf32> 2026-02-21T09:04:51.5720285Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:04:51.5720612Z %20 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %19) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:04:51.5721004Z %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:04:51.5721260Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:04:51.5721464Z %24 = arith.addi %23, %22 : tensor<32xi32> 2026-02-21T09:04:51.5721700Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:04:51.5721901Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:04:51.5722127Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:04:51.5722382Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:04:51.5722674Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:04:51.5722936Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5723224Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5723490Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:04:51.5723730Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:04:51.5723988Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5724279Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:04:51.5724573Z %36 = tt.load %35 : tensor<64x64x!tt.ptr> 2026-02-21T09:04:51.5724809Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:04:51.5725092Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:04:51.5725354Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:04:51.5725611Z %40 = tt.expand_dims %18 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:04:51.5725890Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5726151Z %42 = tt.broadcast %40 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:04:51.5726391Z %43 = arith.addi %41, %42 : tensor<32x64xi32> 2026-02-21T09:04:51.5726624Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5726903Z %45 = tt.addptr %44, %43 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:04:51.5727199Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<32x64x!tt.ptr> 2026-02-21T09:04:51.5727475Z %47 = arith.shli %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5727692Z %48 = arith.shrsi %47, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5727915Z %49 = arith.shrsi %46, %cst_2 : tensor<32x64xi8> 2026-02-21T09:04:51.5728164Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:04:51.5728451Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:04:51.5728759Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:04:51.5729075Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5729391Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:04:51.5729673Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5729914Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5730186Z %57 = tt.broadcast %53 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5730468Z %58 = arith.select %56, %57, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5730739Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:04:51.5730985Z %60 = tt.broadcast %54 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:04:51.5731255Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:04:51.5731572Z %62 = arith.select %61, %60, %58 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:04:51.5731873Z %63 = tt.reshape %62 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:04:51.5732136Z %64 = arith.sitofp %63 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:04:51.5732484Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:04:51.5732804Z scf.yield %65 : tensor<64x64xf32> 2026-02-21T09:04:51.5733048Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:04:51.5733312Z %21 = arith.truncf %20 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:04:51.5733663Z tt.descriptor_store %0[%12, %16], %21 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T09:04:51.5733985Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:04:51.5734204Z tt.return 2026-02-21T09:04:51.5734336Z } 2026-02-21T09:04:51.5734462Z } 2026-02-21T09:04:51.5734534Z 2026-02-21T09:04:51.5734585Z {-# 2026-02-21T09:04:51.5734721Z external_resources: { 2026-02-21T09:04:51.5734882Z mlir_reproducer: { 2026-02-21T09:04:51.5739304Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:04:51.5744071Z disable_threading: false, 2026-02-21T09:04:51.5744249Z verify_each: true 2026-02-21T09:04:51.5744408Z } 2026-02-21T09:04:51.5744534Z } 2026-02-21T09:04:51.5744664Z #-} 2026-02-21T09:04:51.5745143Z /tmp/torchinductor_root/zz/czzptp3oweqrqf3xw56ek34cnkokxga3vducne4rjuyhdd4pmuxb.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:04:51.5746256Z /tmp/torchinductor_root/zz/czzptp3oweqrqf3xw56ek34cnkokxga3vducne4rjuyhdd4pmuxb.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:04:51.5747107Z [257s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:04:51.5748307Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:04:51.5749408Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:04:51.5749679Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:04:51.9563165Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 15.4 configs/s 2026-02-21T09:04:52.4411775Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1951.4 2026-02-21T09:04:52.4412698Z configs/s 2026-02-21T09:04:52.4878166Z [257s] Generation 11 complete: 2026-02-21T09:04:52.4882514Z error=5 2026-02-21T09:04:52.4882941Z ok=34 2026-02-21T09:04:52.4883139Z min=0.1077 2026-02-21T09:04:52.4883327Z mid=0.2571 2026-02-21T09:04:52.4883479Z max=5.2378 2026-02-21T09:04:52.4883643Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:04:52.4883889Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:04:52.4884113Z 'l2_groupings': [1], 2026-02-21T09:04:52.4884287Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:04:52.4884488Z 'loop_orders': [[0, 1]], 2026-02-21T09:04:52.4884658Z 'num_stages': 3, 2026-02-21T09:04:52.4884803Z 'num_warps': 1, 2026-02-21T09:04:52.4884959Z 'pid_type': 'flat', 2026-02-21T09:04:52.4885330Z 'range_flattens': [None, None], 2026-02-21T09:04:52.4885540Z 'range_multi_buffers': [None, True], 2026-02-21T09:04:52.4885738Z 'range_num_stages': [0, 0], 2026-02-21T09:04:52.4885921Z 'range_unroll_factors': [0, 0], 2026-02-21T09:04:52.4886114Z 'range_warp_specializes': [None, None]} 2026-02-21T09:04:52.4909191Z [257s] Fitting surrogate: 992 points, 992 targets 2026-02-21T09:04:53.1936126Z [258s] Generation 12 starting: 37 neighbors, 2 active search path(s) 2026-02-21T09:05:00.2275753Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 2.3 configs/s 2026-02-21T09:05:02.0454041Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:02.0459220Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:02.0460487Z ^ 2026-02-21T09:05:02.0460915Z /tmp/torchinductor_root/pb/cpbgll5xtavsmktpolwitxofqnguu7b4bagn4lmh3wa6vppzwgkk.py:94:40: note: called from 2026-02-21T09:05:02.0461337Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:02.0461722Z ^ 2026-02-21T09:05:02.0462144Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:02.0462615Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:02.0462856Z ^ 2026-02-21T09:05:02.0463130Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T09:05:02.0467711Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:02.0474406Z %cst = arith.constant dense<0> : tensor<32x2x32xi8> 2026-02-21T09:05:02.0479680Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:05:02.0479893Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:02.0480083Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:02.0480278Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:02.0480482Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:02.0480692Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:02.0480920Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:02.0481150Z %cst_2 = arith.constant dense<4> : tensor<32x32xi8> 2026-02-21T09:05:02.0481381Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:02.0481985Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:02.0482206Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:02.0482425Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:02.0482663Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:02.0482842Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:02.0483029Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:02.0483308Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:02.0483505Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:02.0483691Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:02.0484074Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:02.0484400Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:02.0484580Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:05:02.0484767Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:05:02.0484967Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:05:02.0485181Z %4 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:05:02.0485371Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T09:05:02.0485553Z %6 = arith.subi %c1_i32, %5 : i32 2026-02-21T09:05:02.0485736Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T09:05:02.0485917Z %8 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:05:02.0486174Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:05:02.0486345Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:05:02.0486518Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:05:02.0486691Z %12 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:05:02.0486932Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:02.0487186Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T09:05:02.0487394Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T09:05:02.0487590Z %16 = arith.muli %11, %c32_i32 : i32 2026-02-21T09:05:02.0487813Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:02.0488061Z %18 = tt.splat %16 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.0488250Z %19 = arith.addi %18, %17 : tensor<32xi32> 2026-02-21T09:05:02.0488448Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:02.0488632Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:02.0488951Z %20 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:02.0489293Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.0489502Z %24 = arith.addi %23, %17 : tensor<32xi32> 2026-02-21T09:05:02.0489706Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:02.0489898Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:05:02.0490102Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:05:02.0490352Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.0490632Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.0490894Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:02.0491178Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0491444Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0491719Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:05:02.0491972Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0492265Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:02.0492534Z %36 = tt.load %35 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0492790Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:02.0493070Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:02.0493388Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:02.0493637Z %40 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.0493920Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0494169Z %42 = tt.broadcast %40 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0494409Z %43 = arith.addi %41, %42 : tensor<32x32xi32> 2026-02-21T09:05:02.0494691Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0495045Z %45 = tt.addptr %44, %43 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:02.0495349Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0495612Z %47 = arith.shli %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0495836Z %48 = arith.shrsi %47, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0496052Z %49 = arith.shrsi %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0496303Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.0496599Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.0496897Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.0497255Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0497575Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0497862Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0498113Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0498376Z %57 = tt.broadcast %53 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0498663Z %58 = arith.select %56, %57, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0498931Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0499182Z %60 = tt.broadcast %54 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0499443Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0499717Z %62 = arith.select %61, %60, %58 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0499999Z %63 = tt.reshape %62 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:02.0500254Z %64 = arith.sitofp %63 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:02.0500611Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.0500926Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:02.0501129Z %66 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:05:02.0501334Z %67 = arith.addi %arg4, %66 : i32 2026-02-21T09:05:02.0501523Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.0501760Z %69 = arith.addi %68, %17 : tensor<32xi32> 2026-02-21T09:05:02.0501951Z %70 = arith.muli %67, %c2_i32 : i32 2026-02-21T09:05:02.0502144Z %71 = tt.splat %70 : i32 -> tensor<64xi32> 2026-02-21T09:05:02.0502336Z %72 = arith.addi %71, %13 : tensor<64xi32> 2026-02-21T09:05:02.0502589Z %73 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.0502851Z %74 = arith.muli %73, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.0503106Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:02.0503435Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0503684Z %77 = tt.broadcast %75 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0503921Z %78 = arith.addi %76, %77 : tensor<64x64xi32> 2026-02-21T09:05:02.0504170Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0504507Z %80 = tt.addptr %79, %78 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:02.0504780Z %81 = tt.load %80 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0505025Z %82 = arith.extf %81 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:02.0505322Z %83 = tt.expand_dims %69 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:02.0505625Z %84 = arith.muli %83, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:02.0505894Z %85 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.0506217Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0506492Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0506740Z %88 = arith.addi %86, %87 : tensor<32x32xi32> 2026-02-21T09:05:02.0506985Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0507277Z %90 = tt.addptr %89, %88 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:02.0507585Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0507864Z %92 = arith.shli %91, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0508086Z %93 = arith.shrsi %92, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0508349Z %94 = arith.shrsi %91, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0508610Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.0508910Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.0509233Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.0509597Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0509931Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0510224Z %100 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0510495Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0510777Z %102 = tt.broadcast %98 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0511085Z %103 = arith.select %101, %102, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0511370Z %104 = arith.cmpi eq, %97, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0511671Z %105 = tt.broadcast %99 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0511961Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0512252Z %107 = arith.select %106, %105, %103 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0512550Z %108 = tt.reshape %107 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:02.0512830Z %109 = arith.sitofp %108 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:02.0513191Z %110 = tt.dot %82, %109, %65, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.0513513Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:02.0513708Z %111 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:02.0513912Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:05:02.0514115Z %113 = tt.splat %112 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.0514327Z %114 = arith.addi %113, %17 : tensor<32xi32> 2026-02-21T09:05:02.0514521Z %115 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:05:02.0514723Z %116 = tt.splat %115 : i32 -> tensor<64xi32> 2026-02-21T09:05:02.0514918Z %117 = arith.addi %116, %13 : tensor<64xi32> 2026-02-21T09:05:02.0515175Z %118 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.0515458Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.0515750Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:02.0516051Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0516315Z %122 = tt.broadcast %120 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0516563Z %123 = arith.addi %121, %122 : tensor<64x64xi32> 2026-02-21T09:05:02.0516807Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0517128Z %125 = tt.addptr %124, %123 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:02.0517445Z %126 = tt.load %125 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0517685Z %127 = arith.extf %126 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:02.0517974Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:02.0518237Z %129 = arith.muli %128, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:02.0518507Z %130 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.0518799Z %131 = tt.broadcast %129 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0519056Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0519302Z %133 = arith.addi %131, %132 : tensor<32x32xi32> 2026-02-21T09:05:02.0519574Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0519865Z %135 = tt.addptr %134, %133 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:02.0520176Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0520452Z %137 = arith.shli %136, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0520679Z %138 = arith.shrsi %137, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0520900Z %139 = arith.shrsi %136, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0521155Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.0521443Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.0521794Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.0522133Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0522462Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0522754Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0523008Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0523287Z %147 = tt.broadcast %143 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0523577Z %148 = arith.select %146, %147, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0523861Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0524121Z %150 = tt.broadcast %144 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0524390Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0524678Z %152 = arith.select %151, %150, %148 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0524961Z %153 = tt.reshape %152 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:02.0525232Z %154 = arith.sitofp %153 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:02.0525595Z %155 = tt.dot %127, %154, %110, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.0525914Z scf.yield %155 : tensor<64x32xf32> 2026-02-21T09:05:02.0526110Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:02.0526422Z %21 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %20) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:02.0526788Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.0526993Z %24 = arith.addi %23, %17 : tensor<32xi32> 2026-02-21T09:05:02.0527195Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:02.0527396Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:05:02.0527590Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:05:02.0527843Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.0528136Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.0528419Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:02.0528708Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0528972Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:02.0529212Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:05:02.0529454Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0529744Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:02.0529994Z %36 = tt.load %35 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:02.0530238Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:02.0530538Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:02.0530804Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:02.0531060Z %40 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.0531344Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0531636Z %42 = tt.broadcast %40 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:02.0531876Z %43 = arith.addi %41, %42 : tensor<32x32xi32> 2026-02-21T09:05:02.0532122Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0532389Z %45 = tt.addptr %44, %43 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:02.0532690Z %46 = tt.load %45 evictionPolicy = evict_last : tensor<32x32x!tt.ptr> 2026-02-21T09:05:02.0532960Z %47 = arith.shli %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0533174Z %48 = arith.shrsi %47, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0533398Z %49 = arith.shrsi %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:02.0533638Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.0533934Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.0534244Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.0534561Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0534884Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:02.0535159Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0535412Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0535674Z %57 = tt.broadcast %53 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0535963Z %58 = arith.select %56, %57, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0536234Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.0536476Z %60 = tt.broadcast %54 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:02.0536743Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:02.0537013Z %62 = arith.select %61, %60, %58 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:02.0537291Z %63 = tt.reshape %62 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:02.0537577Z %64 = arith.sitofp %63 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:02.0537923Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.0538245Z scf.yield %65 : tensor<64x32xf32> 2026-02-21T09:05:02.0538465Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:02.0538726Z %22 = arith.truncf %21 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:02.0539067Z tt.descriptor_store %0[%12, %16], %22 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:02.0539471Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:05:02.0539725Z tt.return 2026-02-21T09:05:02.0539856Z } 2026-02-21T09:05:02.0539984Z } 2026-02-21T09:05:02.0540055Z 2026-02-21T09:05:02.0540108Z {-# 2026-02-21T09:05:02.0540245Z external_resources: { 2026-02-21T09:05:02.0540400Z mlir_reproducer: { 2026-02-21T09:05:02.0544826Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:02.0549283Z disable_threading: false, 2026-02-21T09:05:02.0549460Z verify_each: true 2026-02-21T09:05:02.0549617Z } 2026-02-21T09:05:02.0549767Z } 2026-02-21T09:05:02.0549917Z #-} 2026-02-21T09:05:02.0550454Z /tmp/torchinductor_root/pb/cpbgll5xtavsmktpolwitxofqnguu7b4bagn4lmh3wa6vppzwgkk.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:02.0551675Z /tmp/torchinductor_root/pb/cpbgll5xtavsmktpolwitxofqnguu7b4bagn4lmh3wa6vppzwgkk.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:02.0552562Z [267s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:02.0553791Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:05:02.0554932Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:02.0555227Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:02.5627719Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:02.5632982Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:02.5635012Z ^ 2026-02-21T09:05:02.5635421Z /tmp/torchinductor_root/ci/cciaymts3qdnqwytfo2a3n53opj6gqzug4awh4khka7pf33sxzvh.py:86:36: note: called from 2026-02-21T09:05:02.5636017Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:02.5636244Z ^ 2026-02-21T09:05:02.5636640Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:02.5637118Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:02.5637362Z ^ 2026-02-21T09:05:02.5637613Z module { 2026-02-21T09:05:02.5641488Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:02.5643162Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:05:02.5643443Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:05:02.5643645Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:02.5643858Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:02.5644053Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:02.5644250Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:02.5644457Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:02.5644693Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:02.5644930Z %cst_2 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:05:02.5645160Z %cst_3 = arith.constant dense<7168> : tensor<16x1xi32> 2026-02-21T09:05:02.5645402Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:02.5645608Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:02.5645839Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:02.5646071Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:02.5646257Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:02.5646439Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:02.5646622Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:02.5646816Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:02.5646994Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:02.5647315Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:02.5647635Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:02.5647823Z %2 = arith.divsi %1, %c896_i32 : i32 2026-02-21T09:05:02.5648005Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:02.5648187Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:05:02.5648381Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:02.5648587Z %6 = arith.remsi %1, %c896_i32 : i32 2026-02-21T09:05:02.5648773Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:02.5648952Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:02.5649116Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:02.5649294Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:02.5649527Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:02.5649786Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:02.5649985Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:02.5650175Z %14 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:05:02.5650595Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:02.5650840Z %16 = tt.splat %14 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.5651039Z %17 = arith.addi %16, %15 : tensor<32xi32> 2026-02-21T09:05:02.5651227Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:05:02.5651419Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:05:02.5651818Z %18 = scf.for %arg3 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:02.5652242Z %64 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:02.5652501Z %65 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:05:02.5652751Z %66 = arith.addi %65, %64 : tensor<16xi32> 2026-02-21T09:05:02.5652956Z %67 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:02.5653148Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.5653353Z %69 = arith.addi %68, %15 : tensor<32xi32> 2026-02-21T09:05:02.5653606Z %70 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.5653891Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.5654152Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5654451Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5654752Z %74 = tt.broadcast %72 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5654987Z %75 = arith.addi %73, %74 : tensor<64x32xi32> 2026-02-21T09:05:02.5655237Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5655521Z %77 = tt.addptr %76, %75 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:02.5655784Z %78 = tt.load %77 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5656019Z %79 = arith.extf %78 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:02.5656303Z %80 = tt.expand_dims %66 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:02.5656573Z %81 = arith.muli %80, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:02.5656820Z %82 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5657107Z %83 = tt.broadcast %81 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5657356Z %84 = tt.broadcast %82 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5657597Z %85 = arith.addi %83, %84 : tensor<16x32xi32> 2026-02-21T09:05:02.5657832Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5658107Z %87 = tt.addptr %86, %85 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:02.5658406Z %88 = tt.load %87 evictionPolicy = evict_last : tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5658663Z %89 = arith.shli %88, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5658883Z %90 = arith.shrsi %89, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5659095Z %91 = arith.shrsi %88, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5659341Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.5659631Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.5659929Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.5660248Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5660561Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5660841Z %97 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5661083Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5661352Z %99 = tt.broadcast %95 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5661680Z %100 = arith.select %98, %99, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5661986Z %101 = arith.cmpi eq, %94, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5662244Z %102 = tt.broadcast %96 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5662511Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5662801Z %104 = arith.select %103, %102, %100 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5663093Z %105 = tt.reshape %104 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:02.5663385Z %106 = arith.sitofp %105 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:02.5663778Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.5664098Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:02.5664299Z %108 = arith.muli %c16_i32, %c1_i32_6 : i32 2026-02-21T09:05:02.5664493Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:05:02.5664732Z %110 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:02.5664988Z %111 = tt.splat %109 : i32 -> tensor<16xi32> 2026-02-21T09:05:02.5665195Z %112 = arith.addi %111, %110 : tensor<16xi32> 2026-02-21T09:05:02.5665400Z %113 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:05:02.5665591Z %114 = tt.splat %113 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.5665820Z %115 = arith.addi %114, %15 : tensor<32xi32> 2026-02-21T09:05:02.5666073Z %116 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.5666356Z %117 = arith.muli %116, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.5666626Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5666921Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5667202Z %120 = tt.broadcast %118 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5667444Z %121 = arith.addi %119, %120 : tensor<64x32xi32> 2026-02-21T09:05:02.5667699Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5667993Z %123 = tt.addptr %122, %121 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:02.5668270Z %124 = tt.load %123 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5668514Z %125 = arith.extf %124 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:02.5668802Z %126 = tt.expand_dims %112 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:02.5669078Z %127 = arith.muli %126, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:02.5669335Z %128 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5669630Z %129 = tt.broadcast %127 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5669888Z %130 = tt.broadcast %128 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5670131Z %131 = arith.addi %129, %130 : tensor<16x32xi32> 2026-02-21T09:05:02.5670375Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5670651Z %133 = tt.addptr %132, %131 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:02.5670960Z %134 = tt.load %133 evictionPolicy = evict_last : tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5671223Z %135 = arith.shli %134, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5671452Z %136 = arith.shrsi %135, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5671723Z %137 = arith.shrsi %134, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5671981Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.5672285Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.5672600Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.5672936Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5673289Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5673585Z %143 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5673848Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5674123Z %145 = tt.broadcast %141 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5674428Z %146 = arith.select %144, %145, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5674730Z %147 = arith.cmpi eq, %140, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5675015Z %148 = tt.broadcast %142 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5675317Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5675594Z %150 = arith.select %149, %148, %146 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5675877Z %151 = tt.reshape %150 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:02.5676154Z %152 = arith.sitofp %151 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:02.5676515Z %153 = tt.dot %125, %152, %107, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.5676855Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:02.5677067Z %154 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:05:02.5677330Z %155 = arith.addi %arg3, %154 : i32 2026-02-21T09:05:02.5677576Z %156 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:02.5677841Z %157 = tt.splat %155 : i32 -> tensor<16xi32> 2026-02-21T09:05:02.5678056Z %158 = arith.addi %157, %156 : tensor<16xi32> 2026-02-21T09:05:02.5678269Z %159 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:05:02.5678471Z %160 = tt.splat %159 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.5678691Z %161 = arith.addi %160, %15 : tensor<32xi32> 2026-02-21T09:05:02.5678962Z %162 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.5679240Z %163 = arith.muli %162, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.5679518Z %164 = tt.expand_dims %161 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5679822Z %165 = tt.broadcast %163 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5680103Z %166 = tt.broadcast %164 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5680355Z %167 = arith.addi %165, %166 : tensor<64x32xi32> 2026-02-21T09:05:02.5680617Z %168 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5680927Z %169 = tt.addptr %168, %167 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:02.5681197Z %170 = tt.load %169 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5681451Z %171 = arith.extf %170 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:02.5681782Z %172 = tt.expand_dims %158 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:02.5682069Z %173 = arith.muli %172, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:02.5682341Z %174 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5682642Z %175 = tt.broadcast %173 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5682924Z %176 = tt.broadcast %174 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5683172Z %177 = arith.addi %175, %176 : tensor<16x32xi32> 2026-02-21T09:05:02.5683422Z %178 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5683713Z %179 = tt.addptr %178, %177 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:02.5684035Z %180 = tt.load %179 evictionPolicy = evict_last : tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5684340Z %181 = arith.shli %180, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5684556Z %182 = arith.shrsi %181, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5684827Z %183 = arith.shrsi %180, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5685074Z %184 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.5685374Z %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.5685686Z %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.5686066Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5686398Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5686710Z %189 = arith.cmpi eq, %186, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5686973Z %190 = tt.broadcast %189 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5687243Z %191 = tt.broadcast %187 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5687539Z %192 = arith.select %190, %191, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5687826Z %193 = arith.cmpi eq, %186, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5688077Z %194 = tt.broadcast %188 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5688356Z %195 = tt.broadcast %193 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5688669Z %196 = arith.select %195, %194, %192 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5688963Z %197 = tt.reshape %196 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:02.5689229Z %198 = arith.sitofp %197 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:02.5689595Z %199 = tt.dot %171, %198, %153, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.5689925Z scf.yield %199 : tensor<64x32xf32> 2026-02-21T09:05:02.5690114Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:02.5690347Z %19 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:02.5690603Z %20 = tt.splat %c4080_i32 : i32 -> tensor<16xi32> 2026-02-21T09:05:02.5690818Z %21 = arith.addi %20, %19 : tensor<16xi32> 2026-02-21T09:05:02.5691012Z %22 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:05:02.5691210Z %23 = tt.splat %22 : i32 -> tensor<32xi32> 2026-02-21T09:05:02.5691407Z %24 = arith.addi %23, %15 : tensor<32xi32> 2026-02-21T09:05:02.5691695Z %25 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:02.5691974Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:02.5692225Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5692516Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5692773Z %29 = tt.broadcast %27 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:02.5693014Z %30 = arith.addi %28, %29 : tensor<64x32xi32> 2026-02-21T09:05:02.5693261Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5693539Z %32 = tt.addptr %31, %30 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:02.5693797Z %33 = tt.load %32 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:02.5694030Z %34 = arith.extf %33 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:02.5694315Z %35 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:02.5694578Z %36 = arith.muli %35, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:02.5694836Z %37 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:02.5695127Z %38 = tt.broadcast %36 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5695377Z %39 = tt.broadcast %37 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:02.5695614Z %40 = arith.addi %38, %39 : tensor<16x32xi32> 2026-02-21T09:05:02.5695847Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5696149Z %42 = tt.addptr %41, %40 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:02.5696443Z %43 = tt.load %42 evictionPolicy = evict_last : tensor<16x32x!tt.ptr> 2026-02-21T09:05:02.5696703Z %44 = arith.shli %43, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5696921Z %45 = arith.shrsi %44, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5697136Z %46 = arith.shrsi %43, %cst_2 : tensor<16x32xi8> 2026-02-21T09:05:02.5697407Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:02.5697690Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:02.5698022Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:02.5698332Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5698640Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:02.5698916Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5699157Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5699423Z %54 = tt.broadcast %50 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5699697Z %55 = arith.select %53, %54, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5699993Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:02.5700244Z %57 = tt.broadcast %51 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:02.5700501Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:02.5700776Z %59 = arith.select %58, %57, %55 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:02.5701040Z %60 = tt.reshape %59 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:02.5701299Z %61 = arith.sitofp %60 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:02.5701659Z %62 = tt.dot %34, %61, %18, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:02.5702011Z %63 = arith.truncf %62 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:02.5702336Z tt.descriptor_store %0[%10, %14], %63 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:02.5702609Z tt.return 2026-02-21T09:05:02.5702752Z } 2026-02-21T09:05:02.5702878Z } 2026-02-21T09:05:02.5702959Z 2026-02-21T09:05:02.5703012Z {-# 2026-02-21T09:05:02.5703143Z external_resources: { 2026-02-21T09:05:02.5703311Z mlir_reproducer: { 2026-02-21T09:05:02.5709445Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:02.5714075Z disable_threading: false, 2026-02-21T09:05:02.5714257Z verify_each: true 2026-02-21T09:05:02.5714441Z } 2026-02-21T09:05:02.5714571Z } 2026-02-21T09:05:02.5714687Z #-} 2026-02-21T09:05:02.5715217Z /tmp/torchinductor_root/ci/cciaymts3qdnqwytfo2a3n53opj6gqzug4awh4khka7pf33sxzvh.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:02.5716347Z /tmp/torchinductor_root/ci/cciaymts3qdnqwytfo2a3n53opj6gqzug4awh4khka7pf33sxzvh.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:02.5717278Z [268s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:02.5718449Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:02.5719459Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:02.5719747Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:03.1556115Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 13.5 configs/s 2026-02-21T09:05:04.8185704Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 594.2 2026-02-21T09:05:04.8187651Z configs/s 2026-02-21T09:05:04.8960400Z [270s] Generation 12 complete: 2026-02-21T09:05:04.8968364Z error=2 2026-02-21T09:05:04.8968575Z ok=38 2026-02-21T09:05:04.8971798Z min=0.1077 2026-02-21T09:05:04.8973906Z mid=0.2612 2026-02-21T09:05:04.8979610Z max=23.7476 2026-02-21T09:05:04.8981053Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:05:04.8981339Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:05:04.8981631Z 'l2_groupings': [1], 2026-02-21T09:05:04.8981821Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:05:04.8982017Z 'loop_orders': [[0, 1]], 2026-02-21T09:05:04.8982197Z 'num_stages': 3, 2026-02-21T09:05:04.8982348Z 'num_warps': 1, 2026-02-21T09:05:04.8982506Z 'pid_type': 'flat', 2026-02-21T09:05:04.8982681Z 'range_flattens': [None, None], 2026-02-21T09:05:04.8982868Z 'range_multi_buffers': [None, True], 2026-02-21T09:05:04.8983066Z 'range_num_stages': [0, 0], 2026-02-21T09:05:04.8983243Z 'range_unroll_factors': [0, 0], 2026-02-21T09:05:04.8983434Z 'range_warp_specializes': [None, None]} 2026-02-21T09:05:04.8993596Z [270s] Fitting surrogate: 1032 points, 1032 targets 2026-02-21T09:05:05.6627149Z [271s] Generation 13 starting: 36 neighbors, 2 active search path(s) 2026-02-21T09:05:09.6118901Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 7.7 configs/s 2026-02-21T09:05:11.3312482Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:11.3313687Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:11.3313950Z ^ 2026-02-21T09:05:11.3314318Z /tmp/torchinductor_root/4n/c4nygr2kzmq77wqg7e7hsnttsukcdnjj4q3btlzo26uor5dr43ne.py:86:36: note: called from 2026-02-21T09:05:11.3314702Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:11.3314908Z ^ 2026-02-21T09:05:11.3315553Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:11.3316017Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:11.3316297Z ^ 2026-02-21T09:05:11.3316459Z module { 2026-02-21T09:05:11.3316947Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:11.3317617Z %cst = arith.constant dense<0> : tensor<16x2x16xi8> 2026-02-21T09:05:11.3317894Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T09:05:11.3318096Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:11.3318284Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:11.3318481Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:11.3318691Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:11.3318939Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:11.3319179Z %cst_2 = arith.constant dense<4> : tensor<16x16xi8> 2026-02-21T09:05:11.3319412Z %cst_3 = arith.constant dense<7168> : tensor<16x1xi32> 2026-02-21T09:05:11.3319656Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:11.3319928Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:11.3320164Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x16xf32> 2026-02-21T09:05:11.3320393Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:11.3320581Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:11.3320765Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:11.3320943Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:11.3321134Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:11.3321314Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:11.3321796Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:11.3322122Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:11.3322312Z %2 = arith.divsi %1, %c1792_i32 : i32 2026-02-21T09:05:11.3322498Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:11.3322686Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:05:11.3322870Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:11.3323049Z %6 = arith.remsi %1, %c1792_i32 : i32 2026-02-21T09:05:11.3323245Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:11.3323417Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:11.3323593Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:11.3323766Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:11.3324007Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:11.3324262Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.3324496Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:11.3324687Z %14 = arith.muli %9, %c16_i32 : i32 2026-02-21T09:05:11.3324918Z %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:11.3325157Z %16 = tt.splat %14 : i32 -> tensor<16xi32> 2026-02-21T09:05:11.3325359Z %17 = arith.addi %16, %15 : tensor<16xi32> 2026-02-21T09:05:11.3325554Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:05:11.3325740Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:05:11.3326061Z %18 = scf.for %arg3 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x16xf32>) : i32 { 2026-02-21T09:05:11.3326393Z %64 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:05:11.3326607Z %65 = arith.addi %64, %15 : tensor<16xi32> 2026-02-21T09:05:11.3326801Z %66 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:11.3327039Z %67 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.3327290Z %68 = tt.splat %66 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.3327534Z %69 = arith.addi %68, %67 : tensor<32xi32> 2026-02-21T09:05:11.3327794Z %70 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.3328060Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.3328319Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:11.3328603Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3328913Z %74 = tt.broadcast %72 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3329153Z %75 = arith.addi %73, %74 : tensor<64x32xi32> 2026-02-21T09:05:11.3329428Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3329713Z %77 = tt.addptr %76, %75 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:11.3330016Z %78 = tt.load %77 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3330311Z %79 = arith.extf %78 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:11.3330598Z %80 = tt.expand_dims %65 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:11.3330874Z %81 = arith.muli %80, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:11.3331145Z %82 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:11.3331569Z %83 = tt.broadcast %81 : tensor<16x1xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3331841Z %84 = tt.broadcast %82 : tensor<1x16xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3332077Z %85 = arith.addi %83, %84 : tensor<16x16xi32> 2026-02-21T09:05:11.3332334Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3332629Z %87 = tt.addptr %86, %85 : tensor<16x16x!tt.ptr>, tensor<16x16xi32> 2026-02-21T09:05:11.3332933Z %88 = tt.load %87 evictionPolicy = evict_first : tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3333213Z %89 = arith.shli %88, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3333439Z %90 = arith.shrsi %89, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3333668Z %91 = arith.shrsi %88, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3333914Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.3334219Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.3334538Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.3334865Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3335198Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3335484Z %97 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3335752Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3336037Z %99 = tt.broadcast %95 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3336332Z %100 = arith.select %98, %99, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3336625Z %101 = arith.cmpi eq, %94, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3336892Z %102 = tt.broadcast %96 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3337185Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3337484Z %104 = arith.select %103, %102, %100 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3337788Z %105 = tt.reshape %104 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:11.3338068Z %106 = arith.sitofp %105 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:11.3338437Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:11.3338774Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:11.3338977Z %108 = arith.muli %c16_i32, %c1_i32_6 : i32 2026-02-21T09:05:11.3339222Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:05:11.3339451Z %110 = tt.splat %109 : i32 -> tensor<16xi32> 2026-02-21T09:05:11.3339659Z %111 = arith.addi %110, %15 : tensor<16xi32> 2026-02-21T09:05:11.3339849Z %112 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:05:11.3340087Z %113 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.3340333Z %114 = tt.splat %112 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.3340570Z %115 = arith.addi %114, %113 : tensor<32xi32> 2026-02-21T09:05:11.3340827Z %116 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.3341131Z %117 = arith.muli %116, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.3341402Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:11.3341733Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3342004Z %120 = tt.broadcast %118 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3342245Z %121 = arith.addi %119, %120 : tensor<64x32xi32> 2026-02-21T09:05:11.3342495Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3342784Z %123 = tt.addptr %122, %121 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:11.3343132Z %124 = tt.load %123 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3343435Z %125 = arith.extf %124 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:11.3343725Z %126 = tt.expand_dims %111 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:11.3344006Z %127 = arith.muli %126, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:11.3344265Z %128 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:11.3344563Z %129 = tt.broadcast %127 : tensor<16x1xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3344832Z %130 = tt.broadcast %128 : tensor<1x16xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3345071Z %131 = arith.addi %129, %130 : tensor<16x16xi32> 2026-02-21T09:05:11.3345317Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3345598Z %133 = tt.addptr %132, %131 : tensor<16x16x!tt.ptr>, tensor<16x16xi32> 2026-02-21T09:05:11.3345911Z %134 = tt.load %133 evictionPolicy = evict_first : tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3346182Z %135 = arith.shli %134, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3346411Z %136 = arith.shrsi %135, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3346640Z %137 = arith.shrsi %134, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3346887Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.3347189Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.3347503Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.3347834Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3348162Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3348456Z %143 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3348721Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3348998Z %145 = tt.broadcast %141 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3349296Z %146 = arith.select %144, %145, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3349580Z %147 = arith.cmpi eq, %140, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3349850Z %148 = tt.broadcast %142 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3350130Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3350442Z %150 = arith.select %149, %148, %146 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3350733Z %151 = tt.reshape %150 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:11.3350995Z %152 = arith.sitofp %151 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:11.3351358Z %153 = tt.dot %125, %152, %107, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:11.3351708Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:11.3351938Z %154 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:05:11.3352137Z %155 = arith.addi %arg3, %154 : i32 2026-02-21T09:05:11.3352377Z %156 = tt.splat %155 : i32 -> tensor<16xi32> 2026-02-21T09:05:11.3352599Z %157 = arith.addi %156, %15 : tensor<16xi32> 2026-02-21T09:05:11.3352796Z %158 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:05:11.3353034Z %159 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.3353281Z %160 = tt.splat %158 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.3353490Z %161 = arith.addi %160, %159 : tensor<32xi32> 2026-02-21T09:05:11.3353748Z %162 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.3354015Z %163 = arith.muli %162, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.3354280Z %164 = tt.expand_dims %161 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:11.3354602Z %165 = tt.broadcast %163 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3354878Z %166 = tt.broadcast %164 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3355116Z %167 = arith.addi %165, %166 : tensor<64x32xi32> 2026-02-21T09:05:11.3355372Z %168 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3355668Z %169 = tt.addptr %168, %167 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:11.3355980Z %170 = tt.load %169 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3356284Z %171 = arith.extf %170 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:11.3356573Z %172 = tt.expand_dims %157 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:11.3356848Z %173 = arith.muli %172, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:11.3357103Z %174 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:11.3357394Z %175 = tt.broadcast %173 : tensor<16x1xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3357657Z %176 = tt.broadcast %174 : tensor<1x16xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3357891Z %177 = arith.addi %175, %176 : tensor<16x16xi32> 2026-02-21T09:05:11.3358131Z %178 = tt.splat %arg1 : !tt.ptr -> tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3358404Z %179 = tt.addptr %178, %177 : tensor<16x16x!tt.ptr>, tensor<16x16xi32> 2026-02-21T09:05:11.3358711Z %180 = tt.load %179 evictionPolicy = evict_first : tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3358984Z %181 = arith.shli %180, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3359201Z %182 = arith.shrsi %181, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3359424Z %183 = arith.shrsi %180, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3359668Z %184 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.3359964Z %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.3360276Z %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.3360607Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3360935Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3361217Z %189 = arith.cmpi eq, %186, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3361475Z %190 = tt.broadcast %189 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3361797Z %191 = tt.broadcast %187 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3362092Z %192 = arith.select %190, %191, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3362372Z %193 = arith.cmpi eq, %186, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3362620Z %194 = tt.broadcast %188 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3362901Z %195 = tt.broadcast %193 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3363205Z %196 = arith.select %195, %194, %192 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3363518Z %197 = tt.reshape %196 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:11.3363784Z %198 = arith.sitofp %197 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:11.3364153Z %199 = tt.dot %171, %198, %153, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:11.3364486Z scf.yield %199 : tensor<64x16xf32> 2026-02-21T09:05:11.3364685Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:11.3364898Z %19 = tt.splat %c4080_i32 : i32 -> tensor<16xi32> 2026-02-21T09:05:11.3365113Z %20 = arith.addi %19, %15 : tensor<16xi32> 2026-02-21T09:05:11.3365320Z %21 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:05:11.3365590Z %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.3365840Z %23 = tt.splat %21 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.3366036Z %24 = arith.addi %23, %22 : tensor<32xi32> 2026-02-21T09:05:11.3366278Z %25 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.3366543Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.3366792Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:11.3367082Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3367335Z %29 = tt.broadcast %27 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:11.3367569Z %30 = arith.addi %28, %29 : tensor<64x32xi32> 2026-02-21T09:05:11.3367817Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3368092Z %32 = tt.addptr %31, %30 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:11.3368394Z %33 = tt.load %32 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:11.3368681Z %34 = arith.extf %33 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:11.3368962Z %35 = tt.expand_dims %20 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:11.3369216Z %36 = arith.muli %35, %cst_3 : tensor<16x1xi32> 2026-02-21T09:05:11.3369469Z %37 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:11.3369751Z %38 = tt.broadcast %36 : tensor<16x1xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3369996Z %39 = tt.broadcast %37 : tensor<1x16xi32> -> tensor<16x16xi32> 2026-02-21T09:05:11.3370231Z %40 = arith.addi %38, %39 : tensor<16x16xi32> 2026-02-21T09:05:11.3370461Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3370733Z %42 = tt.addptr %41, %40 : tensor<16x16x!tt.ptr>, tensor<16x16xi32> 2026-02-21T09:05:11.3371020Z %43 = tt.load %42 evictionPolicy = evict_first : tensor<16x16x!tt.ptr> 2026-02-21T09:05:11.3371281Z %44 = arith.shli %43, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3371496Z %45 = arith.shrsi %44, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3371739Z %46 = arith.shrsi %43, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:11.3371984Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.3372263Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.3372571Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.3372921Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3373229Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:11.3373514Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3373756Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3374026Z %54 = tt.broadcast %50 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3374341Z %55 = arith.select %53, %54, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3374622Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.3374912Z %57 = tt.broadcast %51 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:11.3375182Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:11.3375467Z %59 = arith.select %58, %57, %55 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:11.3375746Z %60 = tt.reshape %59 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:11.3376020Z %61 = arith.sitofp %60 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:11.3376373Z %62 = tt.dot %34, %61, %18, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:11.3376740Z %63 = arith.truncf %62 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T09:05:11.3377104Z tt.descriptor_store %0[%10, %14], %63 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T09:05:11.3377391Z tt.return 2026-02-21T09:05:11.3377538Z } 2026-02-21T09:05:11.3377678Z } 2026-02-21T09:05:11.3377760Z 2026-02-21T09:05:11.3377816Z {-# 2026-02-21T09:05:11.3377953Z external_resources: { 2026-02-21T09:05:11.3378125Z mlir_reproducer: { 2026-02-21T09:05:11.3382694Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:11.3387168Z disable_threading: false, 2026-02-21T09:05:11.3387346Z verify_each: true 2026-02-21T09:05:11.3387492Z } 2026-02-21T09:05:11.3387623Z } 2026-02-21T09:05:11.3387739Z #-} 2026-02-21T09:05:11.3388166Z /tmp/torchinductor_root/4n/c4nygr2kzmq77wqg7e7hsnttsukcdnjj4q3btlzo26uor5dr43ne.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:11.3389193Z /tmp/torchinductor_root/4n/c4nygr2kzmq77wqg7e7hsnttsukcdnjj4q3btlzo26uor5dr43ne.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:11.3390040Z [276s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:11.3391136Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:11.3392146Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:11.3392402Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:11.7542676Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:11.7544744Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:11.7545080Z ^ 2026-02-21T09:05:11.7545645Z /tmp/torchinductor_root/nf/cnf3xl33ocm6enate2bzxuoikjklfjzml5ahgsrrmxhmjmkwhlw7.py:91:40: note: called from 2026-02-21T09:05:11.7551003Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:11.7552559Z ^ 2026-02-21T09:05:11.7553024Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:11.7553506Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:11.7553757Z ^ 2026-02-21T09:05:11.7554024Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T09:05:11.7558647Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:11.7562790Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:05:11.7564436Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:05:11.7564731Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:05:11.7567677Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:11.7567931Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:11.7568132Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:11.7568341Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:11.7568528Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:05:11.7568746Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:11.7568987Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:11.7569221Z %cst_2 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:05:11.7569461Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:11.7569694Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:11.7569909Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:11.7571487Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:05:11.7571958Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:11.7572160Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:11.7572350Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:11.7572544Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:11.7572726Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:11.7573048Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:11.7573369Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:11.7573588Z scf.for %arg3 = %1 to %c112_i32 step %c9472_i32 : i32 { 2026-02-21T09:05:11.7574088Z %2 = arith.divsi %arg3, %c448_i32 : i32 2026-02-21T09:05:11.7574290Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:11.7574476Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:05:11.7574656Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:11.7574852Z %6 = arith.remsi %arg3, %c448_i32 : i32 2026-02-21T09:05:11.7575038Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:11.7575220Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:11.7575451Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:11.7575635Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:11.7575903Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:11.7576169Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.7576377Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:11.7576564Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:05:11.7576755Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.7576947Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T09:05:11.7577146Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:11.7577331Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:11.7577654Z %17 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:11.7578059Z %20 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.7578318Z %21 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.7578542Z %22 = arith.addi %21, %20 : tensor<32xi32> 2026-02-21T09:05:11.7578751Z %23 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:11.7578963Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.7579166Z %25 = arith.addi %24, %11 : tensor<64xi32> 2026-02-21T09:05:11.7579434Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.7579722Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.7579991Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7580297Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7580570Z %30 = tt.broadcast %28 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7580828Z %31 = arith.addi %29, %30 : tensor<64x64xi32> 2026-02-21T09:05:11.7581084Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7581377Z %33 = tt.addptr %32, %31 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:11.7581690Z %34 = tt.load %33 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7581927Z %35 = arith.extf %34 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:11.7582216Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:11.7582480Z %37 = arith.muli %36, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:11.7582739Z %38 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7583026Z %39 = tt.broadcast %37 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7583279Z %40 = tt.broadcast %38 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7583519Z %41 = arith.addi %39, %40 : tensor<32x64xi32> 2026-02-21T09:05:11.7583758Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7584054Z %43 = tt.addptr %42, %41 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:11.7584352Z %44 = tt.load %43 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7584673Z %45 = arith.shli %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7584888Z %46 = arith.shrsi %45, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7585109Z %47 = arith.shrsi %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7585389Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.7585687Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.7585994Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.7586321Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7586676Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7586954Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7587237Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7587503Z %55 = tt.broadcast %51 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7587792Z %56 = arith.select %54, %55, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7588065Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7588309Z %58 = tt.broadcast %52 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7588579Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7588850Z %60 = arith.select %59, %58, %56 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7589151Z %61 = tt.reshape %60 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:11.7589412Z %62 = arith.sitofp %61 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:11.7589772Z %63 = tt.dot %35, %62, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:11.7590097Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:11.7590294Z %64 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:05:11.7590495Z %65 = arith.addi %arg4, %64 : i32 2026-02-21T09:05:11.7590722Z %66 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.7590979Z %67 = tt.splat %65 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.7591178Z %68 = arith.addi %67, %66 : tensor<32xi32> 2026-02-21T09:05:11.7591379Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T09:05:11.7591617Z %70 = tt.splat %69 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.7591818Z %71 = arith.addi %70, %11 : tensor<64xi32> 2026-02-21T09:05:11.7592077Z %72 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.7592343Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.7592604Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7592884Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7593151Z %76 = tt.broadcast %74 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7593389Z %77 = arith.addi %75, %76 : tensor<64x64xi32> 2026-02-21T09:05:11.7593628Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7593922Z %79 = tt.addptr %78, %77 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:11.7594176Z %80 = tt.load %79 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7594419Z %81 = arith.extf %80 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:11.7594694Z %82 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:11.7594960Z %83 = arith.muli %82, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:11.7595217Z %84 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7595496Z %85 = tt.broadcast %83 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7595753Z %86 = tt.broadcast %84 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7596011Z %87 = arith.addi %85, %86 : tensor<32x64xi32> 2026-02-21T09:05:11.7596293Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7596583Z %89 = tt.addptr %88, %87 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:11.7596891Z %90 = tt.load %89 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7597169Z %91 = arith.shli %90, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7597394Z %92 = arith.shrsi %91, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7597655Z %93 = arith.shrsi %90, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7597906Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.7598245Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.7598577Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.7598910Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7599250Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7599539Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7599807Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7600095Z %101 = tt.broadcast %97 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7600439Z %102 = arith.select %100, %101, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7600742Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7601013Z %104 = tt.broadcast %98 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7601308Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7601651Z %106 = arith.select %105, %104, %102 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7601960Z %107 = tt.reshape %106 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:11.7602246Z %108 = arith.sitofp %107 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:11.7602618Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:11.7602963Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:11.7603170Z %110 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:11.7603388Z %111 = arith.addi %arg4, %110 : i32 2026-02-21T09:05:11.7603634Z %112 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.7603938Z %113 = tt.splat %111 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.7604156Z %114 = arith.addi %113, %112 : tensor<32xi32> 2026-02-21T09:05:11.7604357Z %115 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:05:11.7604568Z %116 = tt.splat %115 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.7604771Z %117 = arith.addi %116, %11 : tensor<64xi32> 2026-02-21T09:05:11.7605033Z %118 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.7605310Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.7605583Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7605892Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7606155Z %122 = tt.broadcast %120 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7606404Z %123 = arith.addi %121, %122 : tensor<64x64xi32> 2026-02-21T09:05:11.7606649Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7606949Z %125 = tt.addptr %124, %123 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:11.7607215Z %126 = tt.load %125 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7607462Z %127 = arith.extf %126 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:11.7607788Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:11.7608058Z %129 = arith.muli %128, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:11.7608324Z %130 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7608612Z %131 = tt.broadcast %129 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7608882Z %132 = tt.broadcast %130 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7609156Z %133 = arith.addi %131, %132 : tensor<32x64xi32> 2026-02-21T09:05:11.7609394Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7609722Z %135 = tt.addptr %134, %133 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:11.7610047Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7610325Z %137 = arith.shli %136, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7610555Z %138 = arith.shrsi %137, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7610778Z %139 = arith.shrsi %136, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7611030Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.7611323Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.7611730Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.7612059Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7612394Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7612686Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7612939Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7613219Z %147 = tt.broadcast %143 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7613515Z %148 = arith.select %146, %147, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7613801Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7614063Z %150 = tt.broadcast %144 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7614335Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7614628Z %152 = arith.select %151, %150, %148 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7614913Z %153 = tt.reshape %152 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:11.7615186Z %154 = arith.sitofp %153 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:11.7615546Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:11.7615871Z scf.yield %155 : tensor<64x64xf32> 2026-02-21T09:05:11.7616068Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:11.7616384Z %18 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %17) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:11.7616754Z %20 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:11.7617008Z %21 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:11.7617220Z %22 = arith.addi %21, %20 : tensor<32xi32> 2026-02-21T09:05:11.7617427Z %23 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:11.7617620Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:05:11.7617822Z %25 = arith.addi %24, %11 : tensor<64xi32> 2026-02-21T09:05:11.7618068Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:11.7618341Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:11.7618590Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7618911Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7619177Z %30 = tt.broadcast %28 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:11.7619408Z %31 = arith.addi %29, %30 : tensor<64x64xi32> 2026-02-21T09:05:11.7619653Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7619940Z %33 = tt.addptr %32, %31 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:11.7620225Z %34 = tt.load %33 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:11.7620461Z %35 = arith.extf %34 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:11.7620774Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:11.7621041Z %37 = arith.muli %36, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:11.7621293Z %38 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:11.7621611Z %39 = tt.broadcast %37 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7621870Z %40 = tt.broadcast %38 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:11.7622111Z %41 = arith.addi %39, %40 : tensor<32x64xi32> 2026-02-21T09:05:11.7622344Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7622622Z %43 = tt.addptr %42, %41 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:11.7622954Z %44 = tt.load %43 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:11.7623223Z %45 = arith.shli %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7623451Z %46 = arith.shrsi %45, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7623665Z %47 = arith.shrsi %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:11.7623914Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:11.7624199Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:11.7624520Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:11.7624852Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7625167Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:11.7625450Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7625695Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7625968Z %55 = tt.broadcast %51 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7626252Z %56 = arith.select %54, %55, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7626513Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:11.7626763Z %58 = tt.broadcast %52 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:11.7627024Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:11.7627302Z %60 = arith.select %59, %58, %56 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:11.7627571Z %61 = tt.reshape %60 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:11.7627831Z %62 = arith.sitofp %61 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:11.7628184Z %63 = tt.dot %35, %62, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:11.7628495Z scf.yield %63 : tensor<64x64xf32> 2026-02-21T09:05:11.7628720Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:11.7628979Z %19 = arith.truncf %18 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:05:11.7629302Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T09:05:11.7629627Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:05:11.7629837Z tt.return 2026-02-21T09:05:11.7630004Z } 2026-02-21T09:05:11.7630126Z } 2026-02-21T09:05:11.7630196Z 2026-02-21T09:05:11.7630253Z {-# 2026-02-21T09:05:11.7630383Z external_resources: { 2026-02-21T09:05:11.7630546Z mlir_reproducer: { 2026-02-21T09:05:11.7634943Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:11.7639500Z disable_threading: false, 2026-02-21T09:05:11.7639678Z verify_each: true 2026-02-21T09:05:11.7639838Z } 2026-02-21T09:05:11.7639967Z } 2026-02-21T09:05:11.7640085Z #-} 2026-02-21T09:05:11.7640529Z /tmp/torchinductor_root/nf/cnf3xl33ocm6enate2bzxuoikjklfjzml5ahgsrrmxhmjmkwhlw7.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:11.7641626Z /tmp/torchinductor_root/nf/cnf3xl33ocm6enate2bzxuoikjklfjzml5ahgsrrmxhmjmkwhlw7.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:11.7642501Z [277s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:11.7643731Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:11.7644844Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:11.7645113Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:12.1699653Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 15.1 configs/s 2026-02-21T09:05:12.8442621Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1426.1 2026-02-21T09:05:12.8446400Z configs/s 2026-02-21T09:05:12.8963547Z [278s] Generation 13 complete: 2026-02-21T09:05:12.8968524Z error=2 2026-02-21T09:05:12.8972716Z ok=37 2026-02-21T09:05:12.8974159Z min=0.1077 2026-02-21T09:05:12.8974551Z mid=0.2530 2026-02-21T09:05:12.8974678Z max=13.7329 2026-02-21T09:05:12.8974831Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:05:12.8975048Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:05:12.8975259Z 'l2_groupings': [1], 2026-02-21T09:05:12.8975425Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:05:12.8975618Z 'loop_orders': [[0, 1]], 2026-02-21T09:05:12.8975780Z 'num_stages': 3, 2026-02-21T09:05:12.8975929Z 'num_warps': 1, 2026-02-21T09:05:12.8976168Z 'pid_type': 'flat', 2026-02-21T09:05:12.8976325Z 'range_flattens': [None, None], 2026-02-21T09:05:12.8976511Z 'range_multi_buffers': [None, True], 2026-02-21T09:05:12.8976696Z 'range_num_stages': [0, 0], 2026-02-21T09:05:12.8976908Z 'range_unroll_factors': [0, 0], 2026-02-21T09:05:12.8977091Z 'range_warp_specializes': [None, None]} 2026-02-21T09:05:12.8995608Z [278s] Fitting surrogate: 1071 points, 1071 targets 2026-02-21T09:05:13.5633248Z [279s] Generation 14 starting: 33 neighbors, 2 active search path(s) 2026-02-21T09:05:19.2339980Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 3.0 configs/s 2026-02-21T09:05:20.6931892Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:20.6936623Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:20.6940388Z ^ 2026-02-21T09:05:20.6942381Z /tmp/torchinductor_root/ep/cep2pd4d46axjcqnahck3led53kkwonqhfibk6u2ooqkgckd6jln.py:86:36: note: called from 2026-02-21T09:05:20.6942845Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:20.6943057Z ^ 2026-02-21T09:05:20.6943486Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:20.6943964Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:20.6944226Z ^ 2026-02-21T09:05:20.6944410Z module { 2026-02-21T09:05:20.6946119Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:20.6946763Z %cst = arith.constant dense<0> : tensor<16x2x16xi8> 2026-02-21T09:05:20.6947019Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T09:05:20.6947246Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:20.6947440Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:20.6947666Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:20.6947911Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:20.6948163Z %cst_2 = arith.constant dense<4> : tensor<16x16xi8> 2026-02-21T09:05:20.6948415Z %cst_3 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:20.6948636Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:20.6948879Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<64x16xf32> 2026-02-21T09:05:20.6949121Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:20.6949314Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:20.6949497Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:20.6949693Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:20.6949890Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:20.6950082Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:20.6950289Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:20.6950620Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:20.6951082Z %1 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:20.6951392Z %2 = tt.get_program_id x : i32 2026-02-21T09:05:20.6951689Z %3 = arith.divsi %2, %c1792_i32 : i32 2026-02-21T09:05:20.6952137Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:05:20.6952316Z %5 = arith.subi %c1_i32, %4 : i32 2026-02-21T09:05:20.6952500Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:05:20.6952681Z %7 = arith.remsi %2, %c1792_i32 : i32 2026-02-21T09:05:20.6952869Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:05:20.6953042Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:05:20.6953219Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:05:20.6953394Z %11 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:05:20.6953727Z %12 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:20.6953983Z %13 = tt.splat %11 : i32 -> tensor<64xi32> 2026-02-21T09:05:20.6954246Z %14 = arith.addi %13, %12 : tensor<64xi32> 2026-02-21T09:05:20.6954445Z %15 = arith.muli %10, %c16_i32 : i32 2026-02-21T09:05:20.6954628Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:05:20.6954815Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:05:20.6955128Z %16 = scf.for %arg3 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg4 = %cst_4) -> (tensor<64x16xf32>) : i32 { 2026-02-21T09:05:20.6955458Z %52 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:20.6955697Z %53 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.6955945Z %54 = tt.splat %52 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.6956153Z %55 = arith.addi %54, %53 : tensor<32xi32> 2026-02-21T09:05:20.6956447Z %56 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.6956724Z %57 = arith.muli %56, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:20.6956981Z %58 = tt.expand_dims %55 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:20.6957279Z %59 = tt.broadcast %57 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6957549Z %60 = tt.broadcast %58 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6957785Z %61 = arith.addi %59, %60 : tensor<64x32xi32> 2026-02-21T09:05:20.6958040Z %62 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6958325Z %63 = tt.addptr %62, %61 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:20.6958651Z %64 = tt.load %63 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6958948Z %65 = arith.extf %64 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:20.6959288Z %66 = tt.descriptor_load %0[%arg3, %15] : !tt.tensordesc> -> tensor<16x16xi8> 2026-02-21T09:05:20.6959604Z %67 = arith.shli %66, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6959818Z %68 = arith.shrsi %67, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6960037Z %69 = arith.shrsi %66, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6960277Z %70 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.6960598Z %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.6960901Z %72 = tt.expand_dims %71 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.6961222Z %73 = tt.expand_dims %68 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6961584Z %74 = tt.expand_dims %69 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6961868Z %75 = arith.cmpi eq, %72, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6962113Z %76 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6962387Z %77 = tt.broadcast %73 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6962672Z %78 = arith.select %76, %77, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6962933Z %79 = arith.cmpi eq, %72, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6963183Z %80 = tt.broadcast %74 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6963488Z %81 = tt.broadcast %79 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6963797Z %82 = arith.select %81, %80, %78 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6964075Z %83 = tt.reshape %82 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:20.6964337Z %84 = arith.sitofp %83 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:20.6964687Z %85 = tt.dot %65, %84, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.6965041Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T09:05:20.6965234Z %86 = arith.muli %c16_i32, %c1_i32_5 : i32 2026-02-21T09:05:20.6965430Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T09:05:20.6965641Z %88 = arith.muli %87, %c2_i32 : i32 2026-02-21T09:05:20.6965876Z %89 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.6966125Z %90 = tt.splat %88 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.6966319Z %91 = arith.addi %90, %89 : tensor<32xi32> 2026-02-21T09:05:20.6966570Z %92 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.6966830Z %93 = arith.muli %92, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:20.6967086Z %94 = tt.expand_dims %91 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:20.6967366Z %95 = tt.broadcast %93 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6967648Z %96 = tt.broadcast %94 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6967884Z %97 = arith.addi %95, %96 : tensor<64x32xi32> 2026-02-21T09:05:20.6968124Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6968413Z %99 = tt.addptr %98, %97 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:20.6968713Z %100 = tt.load %99 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6969017Z %101 = arith.extf %100 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:20.6969332Z %102 = tt.descriptor_load %0[%87, %15] : !tt.tensordesc> -> tensor<16x16xi8> 2026-02-21T09:05:20.6969627Z %103 = arith.shli %102, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6969852Z %104 = arith.shrsi %103, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6970067Z %105 = arith.shrsi %102, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6970318Z %106 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.6970609Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.6970928Z %108 = tt.expand_dims %107 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.6971258Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6971634Z %110 = tt.expand_dims %105 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6971927Z %111 = arith.cmpi eq, %108, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6972180Z %112 = tt.broadcast %111 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6972459Z %113 = tt.broadcast %109 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6972746Z %114 = arith.select %112, %113, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6973028Z %115 = arith.cmpi eq, %108, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6973292Z %116 = tt.broadcast %110 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6973564Z %117 = tt.broadcast %115 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6973854Z %118 = arith.select %117, %116, %114 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6974135Z %119 = tt.reshape %118 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:20.6974406Z %120 = arith.sitofp %119 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:20.6974771Z %121 = tt.dot %101, %120, %85, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.6975119Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:05:20.6975319Z %122 = arith.muli %c16_i32, %c2_i32_6 : i32 2026-02-21T09:05:20.6975511Z %123 = arith.addi %arg3, %122 : i32 2026-02-21T09:05:20.6975700Z %124 = arith.muli %123, %c2_i32 : i32 2026-02-21T09:05:20.6975930Z %125 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.6976216Z %126 = tt.splat %124 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.6976428Z %127 = arith.addi %126, %125 : tensor<32xi32> 2026-02-21T09:05:20.6976707Z %128 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.6976985Z %129 = arith.muli %128, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:20.6977242Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:20.6977539Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6977799Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6978046Z %133 = arith.addi %131, %132 : tensor<64x32xi32> 2026-02-21T09:05:20.6978295Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6978584Z %135 = tt.addptr %134, %133 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:20.6978930Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6979230Z %137 = arith.extf %136 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:20.6979555Z %138 = tt.descriptor_load %0[%123, %15] : !tt.tensordesc> -> tensor<16x16xi8> 2026-02-21T09:05:20.6979859Z %139 = arith.shli %138, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6980078Z %140 = arith.shrsi %139, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6980302Z %141 = arith.shrsi %138, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6980547Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.6980840Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.6981149Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.6981476Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6981950Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6982235Z %147 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6982497Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6982769Z %149 = tt.broadcast %145 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6983063Z %150 = arith.select %148, %149, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6983335Z %151 = arith.cmpi eq, %144, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6983593Z %152 = tt.broadcast %146 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6983868Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6984172Z %154 = arith.select %153, %152, %150 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6984474Z %155 = tt.reshape %154 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:20.6984748Z %156 = arith.sitofp %155 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:20.6985128Z %157 = tt.dot %137, %156, %121, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.6985467Z scf.yield %157 : tensor<64x16xf32> 2026-02-21T09:05:20.6985663Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:20.6985867Z %17 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:05:20.6986112Z %18 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.6986413Z %19 = tt.splat %17 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.6986615Z %20 = arith.addi %19, %18 : tensor<32xi32> 2026-02-21T09:05:20.6986877Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.6987167Z %22 = arith.muli %21, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:20.6987429Z %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:20.6987778Z %24 = tt.broadcast %22 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6988055Z %25 = tt.broadcast %23 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:20.6988361Z %26 = arith.addi %24, %25 : tensor<64x32xi32> 2026-02-21T09:05:20.6988612Z %27 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6988908Z %28 = tt.addptr %27, %26 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:20.6989224Z %29 = tt.load %28 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:20.6989522Z %30 = arith.extf %29 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:20.6989864Z %31 = tt.descriptor_load %0[%c4080_i32, %15] : !tt.tensordesc> -> tensor<16x16xi8> 2026-02-21T09:05:20.6990183Z %32 = arith.shli %31, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6990409Z %33 = arith.shrsi %32, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6990658Z %34 = arith.shrsi %31, %cst_2 : tensor<16x16xi8> 2026-02-21T09:05:20.6990915Z %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.6991217Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.6991550Z %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.6991869Z %38 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6992177Z %39 = tt.expand_dims %34 {axis = 1 : i32} : tensor<16x16xi8> -> tensor<16x1x16xi8> 2026-02-21T09:05:20.6992461Z %40 = arith.cmpi eq, %37, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6992713Z %41 = tt.broadcast %40 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6992975Z %42 = tt.broadcast %38 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6993259Z %43 = arith.select %41, %42, %cst : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6993519Z %44 = arith.cmpi eq, %37, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.6993767Z %45 = tt.broadcast %39 : tensor<16x1x16xi8> -> tensor<16x2x16xi8> 2026-02-21T09:05:20.6994021Z %46 = tt.broadcast %44 : tensor<1x2x1xi1> -> tensor<16x2x16xi1> 2026-02-21T09:05:20.6994294Z %47 = arith.select %46, %45, %43 : tensor<16x2x16xi1>, tensor<16x2x16xi8> 2026-02-21T09:05:20.6994567Z %48 = tt.reshape %47 : tensor<16x2x16xi8> -> tensor<32x16xi8> 2026-02-21T09:05:20.6994815Z %49 = arith.sitofp %48 : tensor<32x16xi8> to tensor<32x16xf32> 2026-02-21T09:05:20.6995156Z %50 = tt.dot %30, %49, %16, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.6995492Z %51 = arith.truncf %50 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T09:05:20.6995810Z tt.descriptor_store %1[%11, %15], %51 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T09:05:20.6996081Z tt.return 2026-02-21T09:05:20.6996225Z } 2026-02-21T09:05:20.6996350Z } 2026-02-21T09:05:20.6996424Z 2026-02-21T09:05:20.6996475Z {-# 2026-02-21T09:05:20.6996614Z external_resources: { 2026-02-21T09:05:20.6996770Z mlir_reproducer: { 2026-02-21T09:05:20.7001096Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:20.7005589Z disable_threading: false, 2026-02-21T09:05:20.7005799Z verify_each: true 2026-02-21T09:05:20.7005950Z } 2026-02-21T09:05:20.7006081Z } 2026-02-21T09:05:20.7006198Z #-} 2026-02-21T09:05:20.7006633Z /tmp/torchinductor_root/ep/cep2pd4d46axjcqnahck3led53kkwonqhfibk6u2ooqkgckd6jln.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:20.7007682Z /tmp/torchinductor_root/ep/cep2pd4d46axjcqnahck3led53kkwonqhfibk6u2ooqkgckd6jln.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:20.7008509Z [286s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:20.7009606Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:20.7010591Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:20.7010848Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:20.8769258Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:20.8773784Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:20.8777644Z ^ 2026-02-21T09:05:20.8781964Z /tmp/torchinductor_root/xn/cxn757cwhfdjr2cqdeixdmvqmks5opsjqg7ktmexacquwggxfpxi.py:91:40: note: called from 2026-02-21T09:05:20.8783008Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:20.8783272Z ^ 2026-02-21T09:05:20.8783725Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:20.8784218Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:20.8784481Z ^ 2026-02-21T09:05:20.8784673Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:05:20.8785280Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:20.8786116Z %cst = arith.constant dense<0> : tensor<32x2x16xi8> 2026-02-21T09:05:20.8786347Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T09:05:20.8786545Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:05:20.8786731Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:20.8786921Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:20.8787102Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:20.8787349Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:20.8787533Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:05:20.8787778Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:20.8788015Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:20.8788242Z %cst_2 = arith.constant dense<4> : tensor<32x16xi8> 2026-02-21T09:05:20.8788476Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:20.8788710Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:20.8788924Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:20.8789147Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x16xf32> 2026-02-21T09:05:20.8789368Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:20.8789548Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:20.8789718Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:20.8789941Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:20.8790128Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:20.8790311Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:20.8790620Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:20.8790948Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:20.8791167Z scf.for %arg3 = %1 to %c448_i32 step %c9472_i32 : i32 { 2026-02-21T09:05:20.8791394Z %2 = arith.divsi %arg3, %c1792_i32 : i32 2026-02-21T09:05:20.8791638Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:20.8791818Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:05:20.8792001Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:20.8792187Z %6 = arith.remsi %arg3, %c1792_i32 : i32 2026-02-21T09:05:20.8792379Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:20.8792557Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:20.8792729Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:20.8792915Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:20.8793151Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:20.8793413Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:20.8793614Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:20.8793812Z %14 = arith.muli %9, %c16_i32 : i32 2026-02-21T09:05:20.8794052Z %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:20.8794302Z %16 = tt.splat %14 : i32 -> tensor<16xi32> 2026-02-21T09:05:20.8794507Z %17 = arith.addi %16, %15 : tensor<16xi32> 2026-02-21T09:05:20.8794697Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:20.8794888Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:20.8795201Z %18 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x16xf32>) : i32 { 2026-02-21T09:05:20.8795569Z %21 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.8795822Z %22 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.8796039Z %23 = arith.addi %22, %21 : tensor<32xi32> 2026-02-21T09:05:20.8796245Z %24 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:20.8796438Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:05:20.8796641Z %26 = arith.addi %25, %11 : tensor<64xi32> 2026-02-21T09:05:20.8796894Z %27 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.8797205Z %28 = arith.muli %27, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:20.8797460Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:20.8797759Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8798027Z %31 = tt.broadcast %29 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8798263Z %32 = arith.addi %30, %31 : tensor<64x64xi32> 2026-02-21T09:05:20.8798551Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8798835Z %34 = tt.addptr %33, %32 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:20.8799176Z %35 = tt.load %34 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8799476Z %36 = arith.extf %35 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:20.8799758Z %37 = tt.expand_dims %23 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:20.8800031Z %38 = arith.muli %37, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:20.8800285Z %39 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:20.8800572Z %40 = tt.broadcast %38 : tensor<32x1xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8800824Z %41 = tt.broadcast %39 : tensor<1x16xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8801085Z %42 = arith.addi %40, %41 : tensor<32x16xi32> 2026-02-21T09:05:20.8801327Z %43 = tt.splat %arg1 : !tt.ptr -> tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8801627Z %44 = tt.addptr %43, %42 : tensor<32x16x!tt.ptr>, tensor<32x16xi32> 2026-02-21T09:05:20.8801959Z %45 = tt.load %44 evictionPolicy = evict_first : tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8802227Z %46 = arith.shli %45, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8802443Z %47 = arith.shrsi %46, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8802664Z %48 = arith.shrsi %45, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8802913Z %49 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.8803202Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.8803534Z %51 = tt.expand_dims %50 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.8803864Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8804206Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8804507Z %54 = arith.cmpi eq, %51, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8804771Z %55 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8805060Z %56 = tt.broadcast %52 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8805355Z %57 = arith.select %55, %56, %cst : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8805641Z %58 = arith.cmpi eq, %51, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8805898Z %59 = tt.broadcast %53 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8806181Z %60 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8806471Z %61 = arith.select %60, %59, %57 : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8806760Z %62 = tt.reshape %61 : tensor<32x2x16xi8> -> tensor<64x16xi8> 2026-02-21T09:05:20.8807041Z %63 = arith.sitofp %62 : tensor<64x16xi8> to tensor<64x16xf32> 2026-02-21T09:05:20.8807413Z %64 = tt.dot %36, %63, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.8807757Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:20.8807966Z %65 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:05:20.8808177Z %66 = arith.addi %arg4, %65 : i32 2026-02-21T09:05:20.8808424Z %67 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.8808709Z %68 = tt.splat %66 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.8808923Z %69 = arith.addi %68, %67 : tensor<32xi32> 2026-02-21T09:05:20.8809123Z %70 = arith.muli %66, %c2_i32 : i32 2026-02-21T09:05:20.8809329Z %71 = tt.splat %70 : i32 -> tensor<64xi32> 2026-02-21T09:05:20.8809533Z %72 = arith.addi %71, %11 : tensor<64xi32> 2026-02-21T09:05:20.8809822Z %73 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.8810112Z %74 = arith.muli %73, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:20.8810401Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:20.8810703Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8810969Z %77 = tt.broadcast %75 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8811220Z %78 = arith.addi %76, %77 : tensor<64x64xi32> 2026-02-21T09:05:20.8811460Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8811789Z %80 = tt.addptr %79, %78 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:20.8812099Z %81 = tt.load %80 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8812416Z %82 = arith.extf %81 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:20.8812706Z %83 = tt.expand_dims %69 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:20.8812972Z %84 = arith.muli %83, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:20.8813235Z %85 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:20.8813522Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8813780Z %87 = tt.broadcast %85 : tensor<1x16xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8814020Z %88 = arith.addi %86, %87 : tensor<32x16xi32> 2026-02-21T09:05:20.8814258Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8814532Z %90 = tt.addptr %89, %88 : tensor<32x16x!tt.ptr>, tensor<32x16xi32> 2026-02-21T09:05:20.8814826Z %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8815099Z %92 = arith.shli %91, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8815322Z %93 = arith.shrsi %92, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8815542Z %94 = arith.shrsi %91, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8815796Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.8816086Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.8816398Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.8816713Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8817040Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8817334Z %100 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8817593Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8817879Z %102 = tt.broadcast %98 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8818173Z %103 = arith.select %101, %102, %cst : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8818458Z %104 = arith.cmpi eq, %97, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8818724Z %105 = tt.broadcast %99 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8818997Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8819290Z %107 = arith.select %106, %105, %103 : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8819601Z %108 = tt.reshape %107 : tensor<32x2x16xi8> -> tensor<64x16xi8> 2026-02-21T09:05:20.8819875Z %109 = arith.sitofp %108 : tensor<64x16xi8> to tensor<64x16xf32> 2026-02-21T09:05:20.8820234Z %110 = tt.dot %82, %109, %64, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.8820560Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:20.8820770Z %111 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:20.8820995Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:05:20.8821236Z %113 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.8821524Z %114 = tt.splat %112 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.8821774Z %115 = arith.addi %114, %113 : tensor<32xi32> 2026-02-21T09:05:20.8821970Z %116 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:05:20.8822171Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:05:20.8822377Z %118 = arith.addi %117, %11 : tensor<64xi32> 2026-02-21T09:05:20.8822629Z %119 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.8822907Z %120 = arith.muli %119, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:20.8823174Z %121 = tt.expand_dims %118 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:20.8823501Z %122 = tt.broadcast %120 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8823771Z %123 = tt.broadcast %121 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8824022Z %124 = arith.addi %122, %123 : tensor<64x64xi32> 2026-02-21T09:05:20.8824273Z %125 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8824568Z %126 = tt.addptr %125, %124 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:20.8824889Z %127 = tt.load %126 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8825182Z %128 = arith.extf %127 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:20.8825481Z %129 = tt.expand_dims %115 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:20.8825763Z %130 = arith.muli %129, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:20.8826020Z %131 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:20.8826317Z %132 = tt.broadcast %130 : tensor<32x1xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8826581Z %133 = tt.broadcast %131 : tensor<1x16xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8826825Z %134 = arith.addi %132, %133 : tensor<32x16xi32> 2026-02-21T09:05:20.8827060Z %135 = tt.splat %arg1 : !tt.ptr -> tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8827345Z %136 = tt.addptr %135, %134 : tensor<32x16x!tt.ptr>, tensor<32x16xi32> 2026-02-21T09:05:20.8827654Z %137 = tt.load %136 evictionPolicy = evict_first : tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8827924Z %138 = arith.shli %137, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8828151Z %139 = arith.shrsi %138, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8828372Z %140 = arith.shrsi %137, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8828624Z %141 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.8828918Z %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.8829237Z %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.8829566Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8829889Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8830181Z %146 = arith.cmpi eq, %143, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8830433Z %147 = tt.broadcast %146 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8830741Z %148 = tt.broadcast %144 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8831034Z %149 = arith.select %147, %148, %cst : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8831306Z %150 = arith.cmpi eq, %143, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8831593Z %151 = tt.broadcast %145 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8831869Z %152 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8832184Z %153 = arith.select %152, %151, %149 : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8832490Z %154 = tt.reshape %153 : tensor<32x2x16xi8> -> tensor<64x16xi8> 2026-02-21T09:05:20.8832766Z %155 = arith.sitofp %154 : tensor<64x16xi8> to tensor<64x16xf32> 2026-02-21T09:05:20.8833134Z %156 = tt.dot %128, %155, %110, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.8833462Z scf.yield %156 : tensor<64x16xf32> 2026-02-21T09:05:20.8833660Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:20.8833977Z %19 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %18) -> (tensor<64x16xf32>) : i32 { 2026-02-21T09:05:20.8834354Z %21 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:20.8834619Z %22 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:20.8834865Z %23 = arith.addi %22, %21 : tensor<32xi32> 2026-02-21T09:05:20.8835075Z %24 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:20.8835268Z %25 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T09:05:20.8835470Z %26 = arith.addi %25, %11 : tensor<64xi32> 2026-02-21T09:05:20.8835717Z %27 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:20.8835991Z %28 = arith.muli %27, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:20.8836247Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:20.8836533Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8836794Z %31 = tt.broadcast %29 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:20.8837026Z %32 = arith.addi %30, %31 : tensor<64x64xi32> 2026-02-21T09:05:20.8837271Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8837549Z %34 = tt.addptr %33, %32 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:20.8837856Z %35 = tt.load %34 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:20.8838152Z %36 = arith.extf %35 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:20.8838427Z %37 = tt.expand_dims %23 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:20.8838697Z %38 = arith.muli %37, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:20.8838942Z %39 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:20.8839229Z %40 = tt.broadcast %38 : tensor<32x1xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8839486Z %41 = tt.broadcast %39 : tensor<1x16xi32> -> tensor<32x16xi32> 2026-02-21T09:05:20.8839715Z %42 = arith.addi %40, %41 : tensor<32x16xi32> 2026-02-21T09:05:20.8839954Z %43 = tt.splat %arg1 : !tt.ptr -> tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8840218Z %44 = tt.addptr %43, %42 : tensor<32x16x!tt.ptr>, tensor<32x16xi32> 2026-02-21T09:05:20.8840515Z %45 = tt.load %44 evictionPolicy = evict_first : tensor<32x16x!tt.ptr> 2026-02-21T09:05:20.8840781Z %46 = arith.shli %45, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8840998Z %47 = arith.shrsi %46, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8841216Z %48 = arith.shrsi %45, %cst_2 : tensor<32x16xi8> 2026-02-21T09:05:20.8841450Z %49 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:20.8841810Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:20.8842114Z %51 = tt.expand_dims %50 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:20.8842435Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8842746Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x16xi8> -> tensor<32x1x16xi8> 2026-02-21T09:05:20.8843075Z %54 = arith.cmpi eq, %51, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8843330Z %55 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8843621Z %56 = tt.broadcast %52 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8843912Z %57 = arith.select %55, %56, %cst : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8844175Z %58 = arith.cmpi eq, %51, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:20.8844431Z %59 = tt.broadcast %53 : tensor<32x1x16xi8> -> tensor<32x2x16xi8> 2026-02-21T09:05:20.8844705Z %60 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<32x2x16xi1> 2026-02-21T09:05:20.8844978Z %61 = arith.select %60, %59, %57 : tensor<32x2x16xi1>, tensor<32x2x16xi8> 2026-02-21T09:05:20.8845258Z %62 = tt.reshape %61 : tensor<32x2x16xi8> -> tensor<64x16xi8> 2026-02-21T09:05:20.8845541Z %63 = arith.sitofp %62 : tensor<64x16xi8> to tensor<64x16xf32> 2026-02-21T09:05:20.8845900Z %64 = tt.dot %36, %63, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:20.8846220Z scf.yield %64 : tensor<64x16xf32> 2026-02-21T09:05:20.8846448Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:20.8846716Z %20 = arith.truncf %19 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T09:05:20.8847034Z tt.descriptor_store %0[%10, %14], %20 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T09:05:20.8847424Z } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:05:20.8847712Z tt.return 2026-02-21T09:05:20.8847869Z } 2026-02-21T09:05:20.8847994Z } 2026-02-21T09:05:20.8848075Z 2026-02-21T09:05:20.8848130Z {-# 2026-02-21T09:05:20.8848276Z external_resources: { 2026-02-21T09:05:20.8848440Z mlir_reproducer: { 2026-02-21T09:05:20.8853058Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:20.8857723Z disable_threading: false, 2026-02-21T09:05:20.8857897Z verify_each: true 2026-02-21T09:05:20.8858038Z } 2026-02-21T09:05:20.8858166Z } 2026-02-21T09:05:20.8858279Z #-} 2026-02-21T09:05:20.8858708Z /tmp/torchinductor_root/xn/cxn757cwhfdjr2cqdeixdmvqmks5opsjqg7ktmexacquwggxfpxi.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:20.8859780Z /tmp/torchinductor_root/xn/cxn757cwhfdjr2cqdeixdmvqmks5opsjqg7ktmexacquwggxfpxi.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:20.8860594Z [286s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:20.8861849Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:05:20.8862909Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:20.8863164Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:21.2562857Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:21.2563764Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:21.2564026Z ^ 2026-02-21T09:05:21.2564405Z /tmp/torchinductor_root/n7/cn7dk543isajvfv3ii33k6id675gfeqinrogis3pv5566s4kdlnd.py:94:40: note: called from 2026-02-21T09:05:21.2564799Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:21.2565014Z ^ 2026-02-21T09:05:21.2565406Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:21.2565873Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:21.2566115Z ^ 2026-02-21T09:05:21.2566311Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:05:21.2566837Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:21.2567382Z %cst = arith.constant dense<0> : tensor<32x2x32xi8> 2026-02-21T09:05:21.2567610Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:05:21.2567796Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:21.2567988Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:21.2568179Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:21.2568369Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:21.2568581Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:21.2568832Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:21.2569077Z %cst_2 = arith.constant dense<4> : tensor<32x32xi8> 2026-02-21T09:05:21.2569323Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:21.2569575Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:21.2569791Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:21.2570027Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:21.2570265Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:21.2570454Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:21.2570833Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:21.2571020Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:21.2571225Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:21.2571414Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:21.2571909Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:21.2572250Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:21.2572532Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:05:21.2572721Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:05:21.2572976Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:05:21.2573205Z %4 = arith.divsi %arg3, %c896_i32 : i32 2026-02-21T09:05:21.2573401Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T09:05:21.2573595Z %6 = arith.subi %c1_i32, %5 : i32 2026-02-21T09:05:21.2573779Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T09:05:21.2573980Z %8 = arith.remsi %arg3, %c896_i32 : i32 2026-02-21T09:05:21.2574173Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:05:21.2574359Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:05:21.2574534Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:05:21.2574723Z %12 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:05:21.2574977Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:21.2575292Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.2575513Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T09:05:21.2575710Z %16 = arith.muli %11, %c32_i32 : i32 2026-02-21T09:05:21.2575958Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:21.2576210Z %18 = tt.splat %16 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.2576430Z %19 = arith.addi %18, %17 : tensor<32xi32> 2026-02-21T09:05:21.2576632Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:21.2576820Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:21.2577146Z %20 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:21.2577486Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.2577711Z %24 = arith.addi %23, %17 : tensor<32xi32> 2026-02-21T09:05:21.2577905Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:21.2578108Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.2578315Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:05:21.2578563Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.2578845Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.2579104Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.2579400Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2579661Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2579904Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:05:21.2580157Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2580441Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.2580755Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2581046Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.2581335Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.2581647Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.2581903Z %40 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:21.2582196Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2582478Z %42 = tt.broadcast %40 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2582718Z %43 = arith.addi %41, %42 : tensor<32x32xi32> 2026-02-21T09:05:21.2582955Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2583230Z %45 = tt.addptr %44, %43 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:21.2583555Z %46 = tt.load %45 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2583844Z %47 = arith.shli %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2584063Z %48 = arith.shrsi %47, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2584310Z %49 = arith.shrsi %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2584551Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.2584840Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.2585144Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.2585468Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2585786Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2586070Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2586352Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2586619Z %57 = tt.broadcast %53 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2586911Z %58 = arith.select %56, %57, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2587175Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2587426Z %60 = tt.broadcast %54 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2587697Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2587968Z %62 = arith.select %61, %60, %58 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2588245Z %63 = tt.reshape %62 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:21.2588493Z %64 = arith.sitofp %63 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:21.2588846Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:21.2589163Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:21.2589366Z %66 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:05:21.2589565Z %67 = arith.addi %arg4, %66 : i32 2026-02-21T09:05:21.2589755Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.2589961Z %69 = arith.addi %68, %17 : tensor<32xi32> 2026-02-21T09:05:21.2590154Z %70 = arith.muli %67, %c2_i32 : i32 2026-02-21T09:05:21.2590352Z %71 = tt.splat %70 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.2590545Z %72 = arith.addi %71, %13 : tensor<64xi32> 2026-02-21T09:05:21.2590799Z %73 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.2591070Z %74 = arith.muli %73, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.2591320Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.2591644Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2591902Z %77 = tt.broadcast %75 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2592144Z %78 = arith.addi %76, %77 : tensor<64x64xi32> 2026-02-21T09:05:21.2592388Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2592678Z %80 = tt.addptr %79, %78 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.2592984Z %81 = tt.load %80 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2593274Z %82 = arith.extf %81 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.2593584Z %83 = tt.expand_dims %69 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.2593845Z %84 = arith.muli %83, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.2594104Z %85 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:21.2594393Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2594675Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2594912Z %88 = arith.addi %86, %87 : tensor<32x32xi32> 2026-02-21T09:05:21.2595172Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2595450Z %90 = tt.addptr %89, %88 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:21.2595744Z %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2596013Z %92 = arith.shli %91, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2596236Z %93 = arith.shrsi %92, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2596449Z %94 = arith.shrsi %91, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2596693Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.2596981Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.2597318Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.2597636Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2597963Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2598251Z %100 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2598506Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2598788Z %102 = tt.broadcast %98 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2599082Z %103 = arith.select %101, %102, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2599367Z %104 = arith.cmpi eq, %97, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2599622Z %105 = tt.broadcast %99 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2599894Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2600181Z %107 = arith.select %106, %105, %103 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2600464Z %108 = tt.reshape %107 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:21.2600736Z %109 = arith.sitofp %108 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:21.2601088Z %110 = tt.dot %82, %109, %65, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:21.2601413Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:21.2601647Z %111 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:21.2601843Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:05:21.2602051Z %113 = tt.splat %112 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.2602257Z %114 = arith.addi %113, %17 : tensor<32xi32> 2026-02-21T09:05:21.2602462Z %115 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:05:21.2602657Z %116 = tt.splat %115 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.2602870Z %117 = arith.addi %116, %13 : tensor<64xi32> 2026-02-21T09:05:21.2603131Z %118 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.2603409Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.2603686Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.2603976Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2604248Z %122 = tt.broadcast %120 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2604518Z %123 = arith.addi %121, %122 : tensor<64x64xi32> 2026-02-21T09:05:21.2604771Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2605074Z %125 = tt.addptr %124, %123 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.2605393Z %126 = tt.load %125 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2605728Z %127 = arith.extf %126 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.2606021Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.2606331Z %129 = arith.muli %128, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.2606611Z %130 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:21.2606909Z %131 = tt.broadcast %129 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2607168Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2607412Z %133 = arith.addi %131, %132 : tensor<32x32xi32> 2026-02-21T09:05:21.2607644Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2607931Z %135 = tt.addptr %134, %133 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:21.2608267Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2608545Z %137 = arith.shli %136, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2608775Z %138 = arith.shrsi %137, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2608999Z %139 = arith.shrsi %136, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2609250Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.2609552Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.2609869Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.2610206Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2610533Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2610842Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2611110Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2611409Z %147 = tt.broadcast %143 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2611757Z %148 = arith.select %146, %147, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2612050Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2612323Z %150 = tt.broadcast %144 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2612611Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2612918Z %152 = arith.select %151, %150, %148 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2613226Z %153 = tt.reshape %152 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:21.2613503Z %154 = arith.sitofp %153 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:21.2613885Z %155 = tt.dot %127, %154, %110, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:21.2614222Z scf.yield %155 : tensor<64x32xf32> 2026-02-21T09:05:21.2614428Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:21.2614756Z %21 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %20) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:21.2615110Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.2615334Z %24 = arith.addi %23, %17 : tensor<32xi32> 2026-02-21T09:05:21.2615542Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:21.2615753Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.2616001Z %27 = arith.addi %26, %13 : tensor<64xi32> 2026-02-21T09:05:21.2616266Z %28 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.2616545Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.2616820Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.2617123Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2617414Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.2617687Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:05:21.2617962Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2618270Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.2618597Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.2618890Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.2619178Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.2619438Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.2619697Z %40 = tt.expand_dims %19 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:21.2620021Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2620276Z %42 = tt.broadcast %40 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:21.2620523Z %43 = arith.addi %41, %42 : tensor<32x32xi32> 2026-02-21T09:05:21.2620756Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2621030Z %45 = tt.addptr %44, %43 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:21.2621320Z %46 = tt.load %45 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T09:05:21.2621607Z %47 = arith.shli %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2621826Z %48 = arith.shrsi %47, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2622040Z %49 = arith.shrsi %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:21.2622299Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.2622591Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.2622907Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.2623234Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2623552Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:21.2623838Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2624083Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2624360Z %57 = tt.broadcast %53 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2624642Z %58 = arith.select %56, %57, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2624917Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.2625175Z %60 = tt.broadcast %54 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:21.2625438Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:21.2625718Z %62 = arith.select %61, %60, %58 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:21.2625993Z %63 = tt.reshape %62 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:21.2626259Z %64 = arith.sitofp %63 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:21.2626610Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:21.2626980Z scf.yield %65 : tensor<64x32xf32> 2026-02-21T09:05:21.2627207Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:21.2627469Z %22 = arith.truncf %21 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:21.2627793Z tt.descriptor_store %0[%12, %16], %22 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:21.2628152Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:05:21.2628438Z tt.return 2026-02-21T09:05:21.2628569Z } 2026-02-21T09:05:21.2628697Z } 2026-02-21T09:05:21.2628768Z 2026-02-21T09:05:21.2628827Z {-# 2026-02-21T09:05:21.2628959Z external_resources: { 2026-02-21T09:05:21.2629146Z mlir_reproducer: { 2026-02-21T09:05:21.2633558Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:21.2637964Z disable_threading: false, 2026-02-21T09:05:21.2638129Z verify_each: true 2026-02-21T09:05:21.2638280Z } 2026-02-21T09:05:21.2638398Z } 2026-02-21T09:05:21.2638526Z #-} 2026-02-21T09:05:21.2638939Z /tmp/torchinductor_root/n7/cn7dk543isajvfv3ii33k6id675gfeqinrogis3pv5566s4kdlnd.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:21.2639943Z /tmp/torchinductor_root/n7/cn7dk543isajvfv3ii33k6id675gfeqinrogis3pv5566s4kdlnd.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:21.2640754Z [286s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:21.2641955Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:05:21.2643013Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:21.2643274Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:21.4252377Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:21.4253177Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:21.4253591Z ^ 2026-02-21T09:05:21.4254193Z /tmp/torchinductor_root/m7/cm7uidlbool47iacl6pgbackpimlc7ut7ay3yb2tx4nhjm2hp7r6.py:86:36: note: called from 2026-02-21T09:05:21.4255042Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:21.4255356Z ^ 2026-02-21T09:05:21.4256021Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:21.4256749Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:21.4257102Z ^ 2026-02-21T09:05:21.4257358Z module { 2026-02-21T09:05:21.4258110Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:21.4258969Z %cst = arith.constant dense<0> : tensor<32x2x64xi8> 2026-02-21T09:05:21.4259299Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:05:21.4259572Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:21.4259906Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:21.4260178Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:21.4260461Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:21.4260771Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:21.4261115Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:21.4261451Z %cst_2 = arith.constant dense<4> : tensor<32x64xi8> 2026-02-21T09:05:21.4261858Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:21.4262222Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:21.4262554Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:21.4262917Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:05:21.4263286Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:21.4263577Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:21.4263874Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:21.4264173Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:21.4264472Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:21.4264989Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:21.4265522Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:21.4265797Z %2 = arith.divsi %1, %c448_i32 : i32 2026-02-21T09:05:21.4266092Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:21.4266378Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:05:21.4266654Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:21.4266945Z %6 = arith.remsi %1, %c448_i32 : i32 2026-02-21T09:05:21.4267218Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:21.4267489Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:21.4267748Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:21.4268020Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:21.4268387Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:21.4268801Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.4269126Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:21.4269424Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:05:21.4269729Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.4270035Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T09:05:21.4270345Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:21.4270635Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:21.4271152Z %17 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:21.4271868Z %20 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:21.4272313Z %21 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.4272654Z %22 = arith.addi %21, %20 : tensor<32xi32> 2026-02-21T09:05:21.4272972Z %23 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:21.4273281Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.4273664Z %25 = arith.addi %24, %11 : tensor<64xi32> 2026-02-21T09:05:21.4274075Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.4274560Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.4274983Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4275461Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4275890Z %30 = tt.broadcast %28 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4276261Z %31 = arith.addi %29, %30 : tensor<64x64xi32> 2026-02-21T09:05:21.4276652Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4277114Z %33 = tt.addptr %32, %31 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.4277696Z %34 = tt.load %33 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4278173Z %35 = arith.extf %34 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.4278645Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.4279077Z %37 = arith.muli %36, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.4279484Z %38 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4279916Z %39 = tt.broadcast %37 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4280300Z %40 = tt.broadcast %38 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4280657Z %41 = arith.addi %39, %40 : tensor<32x64xi32> 2026-02-21T09:05:21.4281005Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4281425Z %43 = tt.addptr %42, %41 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:21.4281929Z %44 = tt.load %43 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4282328Z %45 = arith.shli %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4282679Z %46 = arith.shrsi %45, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4283020Z %47 = arith.shrsi %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4283414Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.4283890Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.4284394Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.4284925Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4285444Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4285903Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4286298Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4286742Z %55 = tt.broadcast %51 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4287207Z %56 = arith.select %54, %55, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4287637Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4288049Z %58 = tt.broadcast %52 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4288484Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4288932Z %60 = arith.select %59, %58, %56 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4289418Z %61 = tt.reshape %60 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:21.4289844Z %62 = arith.sitofp %61 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:21.4290432Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:21.4290953Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:21.4291262Z %64 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:05:21.4291659Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T09:05:21.4292035Z %66 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:21.4292465Z %67 = tt.splat %65 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.4292794Z %68 = arith.addi %67, %66 : tensor<32xi32> 2026-02-21T09:05:21.4293124Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T09:05:21.4293426Z %70 = tt.splat %69 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.4293707Z %71 = arith.addi %70, %11 : tensor<64xi32> 2026-02-21T09:05:21.4294091Z %72 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.4294537Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.4294947Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4295427Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4295893Z %76 = tt.broadcast %74 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4296265Z %77 = arith.addi %75, %76 : tensor<64x64xi32> 2026-02-21T09:05:21.4296661Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4297109Z %79 = tt.addptr %78, %77 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.4297606Z %80 = tt.load %79 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4298075Z %81 = arith.extf %80 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.4298539Z %82 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.4298972Z %83 = arith.muli %82, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.4299381Z %84 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4299844Z %85 = tt.broadcast %83 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4300256Z %86 = tt.broadcast %84 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4300636Z %87 = arith.addi %85, %86 : tensor<32x64xi32> 2026-02-21T09:05:21.4301018Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4301461Z %89 = tt.addptr %88, %87 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:21.4301984Z %90 = tt.load %89 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4302402Z %91 = arith.shli %90, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4302730Z %92 = arith.shrsi %91, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4303071Z %93 = arith.shrsi %90, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4303464Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.4303936Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.4304433Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.4304949Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4305458Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4305926Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4306300Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4306715Z %101 = tt.broadcast %97 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4307226Z %102 = arith.select %100, %101, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4307642Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4308034Z %104 = tt.broadcast %98 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4308447Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4308892Z %106 = arith.select %105, %104, %102 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4309378Z %107 = tt.reshape %106 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:21.4309776Z %108 = arith.sitofp %107 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:21.4310357Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:21.4310838Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:21.4311129Z %110 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:21.4311425Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T09:05:21.4311860Z %112 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:21.4312266Z %113 = tt.splat %111 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.4312600Z %114 = arith.addi %113, %112 : tensor<32xi32> 2026-02-21T09:05:21.4312924Z %115 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:05:21.4313277Z %116 = tt.splat %115 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.4313612Z %117 = arith.addi %116, %11 : tensor<64xi32> 2026-02-21T09:05:21.4314028Z %118 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.4314473Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.4314879Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4315319Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4315729Z %122 = tt.broadcast %120 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4316090Z %123 = arith.addi %121, %122 : tensor<64x64xi32> 2026-02-21T09:05:21.4316463Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4316903Z %125 = tt.addptr %124, %123 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.4317393Z %126 = tt.load %125 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4317850Z %127 = arith.extf %126 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.4318286Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.4318701Z %129 = arith.muli %128, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.4319094Z %130 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4319555Z %131 = tt.broadcast %129 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4319948Z %132 = tt.broadcast %130 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4320310Z %133 = arith.addi %131, %132 : tensor<32x64xi32> 2026-02-21T09:05:21.4320672Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4321090Z %135 = tt.addptr %134, %133 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:21.4321621Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4322036Z %137 = arith.shli %136, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4322376Z %138 = arith.shrsi %137, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4322717Z %139 = arith.shrsi %136, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4323112Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.4323618Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.4324129Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.4324690Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4325229Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4325692Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4326104Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4326539Z %147 = tt.broadcast %143 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4327049Z %148 = arith.select %146, %147, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4327524Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4327932Z %150 = tt.broadcast %144 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4328382Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4328835Z %152 = arith.select %151, %150, %148 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4329302Z %153 = tt.reshape %152 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:21.4329728Z %154 = arith.sitofp %153 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:21.4330302Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:21.4330863Z scf.yield %155 : tensor<64x64xf32> 2026-02-21T09:05:21.4331136Z } 2026-02-21T09:05:21.4331578Z %18 = scf.for %arg3 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg4 = %17) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:21.4332137Z %20 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:21.4332520Z %21 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:05:21.4332855Z %22 = arith.addi %21, %20 : tensor<32xi32> 2026-02-21T09:05:21.4333167Z %23 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:21.4333483Z %24 = tt.splat %23 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.4333801Z %25 = arith.addi %24, %11 : tensor<64xi32> 2026-02-21T09:05:21.4334216Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.4334652Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.4335045Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4335485Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4335872Z %30 = tt.broadcast %28 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:21.4336235Z %31 = arith.addi %29, %30 : tensor<64x64xi32> 2026-02-21T09:05:21.4336601Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4337037Z %33 = tt.addptr %32, %31 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:21.4337495Z %34 = tt.load %33 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:21.4337943Z %35 = arith.extf %34 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:21.4338374Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:21.4338769Z %37 = arith.muli %36, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:21.4339158Z %38 = tt.expand_dims %16 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:21.4339617Z %39 = tt.broadcast %37 : tensor<32x1xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4340041Z %40 = tt.broadcast %38 : tensor<1x64xi32> -> tensor<32x64xi32> 2026-02-21T09:05:21.4340408Z %41 = arith.addi %39, %40 : tensor<32x64xi32> 2026-02-21T09:05:21.4340795Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4341242Z %43 = tt.addptr %42, %41 : tensor<32x64x!tt.ptr>, tensor<32x64xi32> 2026-02-21T09:05:21.4341768Z %44 = tt.load %43 evictionPolicy = evict_first : tensor<32x64x!tt.ptr> 2026-02-21T09:05:21.4342223Z %45 = arith.shli %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4342540Z %46 = arith.shrsi %45, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4342866Z %47 = arith.shrsi %44, %cst_2 : tensor<32x64xi8> 2026-02-21T09:05:21.4343235Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.4343677Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.4344141Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.4344653Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4345187Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<32x64xi8> -> tensor<32x1x64xi8> 2026-02-21T09:05:21.4345601Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4345981Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4346387Z %55 = tt.broadcast %51 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4346805Z %56 = arith.select %54, %55, %cst : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4347208Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.4347575Z %58 = tt.broadcast %52 : tensor<32x1x64xi8> -> tensor<32x2x64xi8> 2026-02-21T09:05:21.4348014Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<32x2x64xi1> 2026-02-21T09:05:21.4348431Z %60 = arith.select %59, %58, %56 : tensor<32x2x64xi1>, tensor<32x2x64xi8> 2026-02-21T09:05:21.4348848Z %61 = tt.reshape %60 : tensor<32x2x64xi8> -> tensor<64x64xi8> 2026-02-21T09:05:21.4349237Z %62 = arith.sitofp %61 : tensor<64x64xi8> to tensor<64x64xf32> 2026-02-21T09:05:21.4349772Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:21.4350256Z scf.yield %63 : tensor<64x64xf32> 2026-02-21T09:05:21.4350521Z } {tt.num_stages = 1 : i32} 2026-02-21T09:05:21.4350839Z %19 = arith.truncf %18 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:05:21.4351331Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T09:05:21.4351776Z tt.return 2026-02-21T09:05:21.4351963Z } 2026-02-21T09:05:21.4352139Z } 2026-02-21T09:05:21.4352242Z 2026-02-21T09:05:21.4352323Z {-# 2026-02-21T09:05:21.4352518Z external_resources: { 2026-02-21T09:05:21.4352770Z mlir_reproducer: { 2026-02-21T09:05:21.4360460Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:21.4368298Z disable_threading: false, 2026-02-21T09:05:21.4368552Z verify_each: true 2026-02-21T09:05:21.4368777Z } 2026-02-21T09:05:21.4368957Z } 2026-02-21T09:05:21.4369156Z #-} 2026-02-21T09:05:21.4369871Z /tmp/torchinductor_root/m7/cm7uidlbool47iacl6pgbackpimlc7ut7ay3yb2tx4nhjm2hp7r6.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:21.4371748Z /tmp/torchinductor_root/m7/cm7uidlbool47iacl6pgbackpimlc7ut7ay3yb2tx4nhjm2hp7r6.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:21.4373218Z [286s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:21.4375145Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:21.4376841Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:21.4377273Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:21.5865689Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:21.5870312Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:21.5874910Z ^ 2026-02-21T09:05:21.5879042Z /tmp/torchinductor_root/js/cjsngyt5ademlae42hzmuwjdxudk5yuhkdj7cs4omwpmwpnxpmyw.py:86:36: note: called from 2026-02-21T09:05:21.5880898Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:21.5881151Z ^ 2026-02-21T09:05:21.5883050Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:21.5883581Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:21.5883834Z ^ 2026-02-21T09:05:21.5884009Z module { 2026-02-21T09:05:21.5885241Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:21.5885843Z %cst = arith.constant dense<0> : tensor<64x2x16xi8> 2026-02-21T09:05:21.5886083Z %c1792_i32 = arith.constant 1792 : i32 2026-02-21T09:05:21.5886283Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:21.5886469Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:21.5886663Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:21.5886887Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:21.5887142Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:21.5887374Z %cst_2 = arith.constant dense<4> : tensor<64x16xi8> 2026-02-21T09:05:21.5887632Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:21.5887895Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:21.5888119Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:21.5888348Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x16xf32> 2026-02-21T09:05:21.5888594Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:21.5888783Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:21.5896906Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:21.5897166Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:21.5897403Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:21.5897602Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:21.5897925Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:21.5898264Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:21.5898564Z %2 = arith.divsi %1, %c1792_i32 : i32 2026-02-21T09:05:21.5898765Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:21.5898947Z %4 = arith.subi %c1_i32, %3 : i32 2026-02-21T09:05:21.5899171Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:21.5899368Z %6 = arith.remsi %1, %c1792_i32 : i32 2026-02-21T09:05:21.5899552Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:21.5899736Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:21.5899904Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:21.5900090Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:21.5900322Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:21.5900643Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.5900853Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:21.5901048Z %14 = arith.muli %9, %c16_i32 : i32 2026-02-21T09:05:21.5901308Z %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:21.5901604Z %16 = tt.splat %14 : i32 -> tensor<16xi32> 2026-02-21T09:05:21.5901802Z %17 = arith.addi %16, %15 : tensor<16xi32> 2026-02-21T09:05:21.5902001Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:21.5902188Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:21.5902519Z %18 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x16xf32>) : i32 { 2026-02-21T09:05:21.5902860Z %64 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.5903070Z %65 = arith.addi %64, %11 : tensor<64xi32> 2026-02-21T09:05:21.5903277Z %66 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:21.5903513Z %67 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:21.5903776Z %68 = tt.splat %66 : i32 -> tensor<128xi32> 2026-02-21T09:05:21.5903976Z %69 = arith.addi %68, %67 : tensor<128xi32> 2026-02-21T09:05:21.5904247Z %70 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5904531Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.5904794Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:21.5905099Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5905363Z %74 = tt.broadcast %72 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5905612Z %75 = arith.addi %73, %74 : tensor<64x128xi32> 2026-02-21T09:05:21.5905860Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5906153Z %77 = tt.addptr %76, %75 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:21.5906467Z %78 = tt.load %77 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5906759Z %79 = arith.extf %78 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:21.5907054Z %80 = tt.expand_dims %65 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5907318Z %81 = arith.muli %80, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:21.5907575Z %82 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:21.5907855Z %83 = tt.broadcast %81 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5908119Z %84 = tt.broadcast %82 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5908356Z %85 = arith.addi %83, %84 : tensor<64x16xi32> 2026-02-21T09:05:21.5908591Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5908908Z %87 = tt.addptr %86, %85 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:21.5909205Z %88 = tt.load %87 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5909482Z %89 = arith.shli %88, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5909698Z %90 = arith.shrsi %89, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5909927Z %91 = arith.shrsi %88, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5910212Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.5910532Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.5910848Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.5911166Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5911494Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5911830Z %97 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5912082Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5912367Z %99 = tt.broadcast %95 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5912698Z %100 = arith.select %98, %99, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5913000Z %101 = arith.cmpi eq, %94, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5913272Z %102 = tt.broadcast %96 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5913572Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5913884Z %104 = arith.select %103, %102, %100 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5914185Z %105 = tt.reshape %104 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:05:21.5914470Z %106 = arith.sitofp %105 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:05:21.5914860Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:21.5915217Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:21.5915434Z %108 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T09:05:21.5915640Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:05:21.5915856Z %110 = tt.splat %109 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.5916072Z %111 = arith.addi %110, %11 : tensor<64xi32> 2026-02-21T09:05:21.5916284Z %112 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:05:21.5916534Z %113 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:21.5916810Z %114 = tt.splat %112 : i32 -> tensor<128xi32> 2026-02-21T09:05:21.5917032Z %115 = arith.addi %114, %113 : tensor<128xi32> 2026-02-21T09:05:21.5917311Z %116 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5917609Z %117 = arith.muli %116, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.5917886Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:21.5918207Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5918493Z %120 = tt.broadcast %118 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5918764Z %121 = arith.addi %119, %120 : tensor<64x128xi32> 2026-02-21T09:05:21.5919034Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5919349Z %123 = tt.addptr %122, %121 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:21.5919695Z %124 = tt.load %123 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5920019Z %125 = arith.extf %124 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:21.5920325Z %126 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5920635Z %127 = arith.muli %126, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:21.5920906Z %128 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:21.5921205Z %129 = tt.broadcast %127 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5921473Z %130 = tt.broadcast %128 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5921757Z %131 = arith.addi %129, %130 : tensor<64x16xi32> 2026-02-21T09:05:21.5922023Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5922332Z %133 = tt.addptr %132, %131 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:21.5922641Z %134 = tt.load %133 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5922925Z %135 = arith.shli %134, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5923156Z %136 = arith.shrsi %135, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5923383Z %137 = arith.shrsi %134, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5923647Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.5923949Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.5924281Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.5924654Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5924990Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5925292Z %143 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5925554Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5925842Z %145 = tt.broadcast %141 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5926136Z %146 = arith.select %144, %145, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5926424Z %147 = arith.cmpi eq, %140, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5926687Z %148 = tt.broadcast %142 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5926956Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5927248Z %150 = arith.select %149, %148, %146 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5927537Z %151 = tt.reshape %150 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:05:21.5927811Z %152 = arith.sitofp %151 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:05:21.5928187Z %153 = tt.dot %125, %152, %107, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:21.5928515Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:21.5928724Z %154 = arith.muli %c64_i32, %c2_i32_7 : i32 2026-02-21T09:05:21.5928921Z %155 = arith.addi %arg3, %154 : i32 2026-02-21T09:05:21.5929126Z %156 = tt.splat %155 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.5929328Z %157 = arith.addi %156, %11 : tensor<64xi32> 2026-02-21T09:05:21.5929532Z %158 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:05:21.5929770Z %159 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:21.5930033Z %160 = tt.splat %158 : i32 -> tensor<128xi32> 2026-02-21T09:05:21.5930256Z %161 = arith.addi %160, %159 : tensor<128xi32> 2026-02-21T09:05:21.5930515Z %162 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5930796Z %163 = arith.muli %162, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.5931064Z %164 = tt.expand_dims %161 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:21.5931371Z %165 = tt.broadcast %163 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5931692Z %166 = tt.broadcast %164 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5931967Z %167 = arith.addi %165, %166 : tensor<64x128xi32> 2026-02-21T09:05:21.5932226Z %168 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5932521Z %169 = tt.addptr %168, %167 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:21.5932846Z %170 = tt.load %169 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5933147Z %171 = arith.extf %170 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:21.5933481Z %172 = tt.expand_dims %157 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5933799Z %173 = arith.muli %172, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:21.5934062Z %174 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:21.5934365Z %175 = tt.broadcast %173 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5934630Z %176 = tt.broadcast %174 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5934878Z %177 = arith.addi %175, %176 : tensor<64x16xi32> 2026-02-21T09:05:21.5935112Z %178 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5935391Z %179 = tt.addptr %178, %177 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:21.5935701Z %180 = tt.load %179 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5936014Z %181 = arith.shli %180, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5936253Z %182 = arith.shrsi %181, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5936476Z %183 = arith.shrsi %180, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5936737Z %184 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.5937042Z %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.5937362Z %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.5937696Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5938022Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5938320Z %189 = arith.cmpi eq, %186, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5938575Z %190 = tt.broadcast %189 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5938857Z %191 = tt.broadcast %187 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5939162Z %192 = arith.select %190, %191, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5939442Z %193 = arith.cmpi eq, %186, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5939693Z %194 = tt.broadcast %188 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5939970Z %195 = tt.broadcast %193 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5940251Z %196 = arith.select %195, %194, %192 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5940546Z %197 = tt.reshape %196 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:05:21.5940815Z %198 = arith.sitofp %197 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:05:21.5941167Z %199 = tt.dot %171, %198, %153, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:21.5941493Z scf.yield %199 : tensor<64x16xf32> 2026-02-21T09:05:21.5941727Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:21.5941936Z %19 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:21.5942144Z %20 = arith.addi %19, %11 : tensor<64xi32> 2026-02-21T09:05:21.5942347Z %21 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:21.5942588Z %22 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:21.5942834Z %23 = tt.splat %21 : i32 -> tensor<128xi32> 2026-02-21T09:05:21.5943040Z %24 = arith.addi %23, %22 : tensor<128xi32> 2026-02-21T09:05:21.5943313Z %25 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5943579Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:21.5943828Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:21.5944121Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5944388Z %29 = tt.broadcast %27 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:21.5944646Z %30 = arith.addi %28, %29 : tensor<64x128xi32> 2026-02-21T09:05:21.5944893Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5945196Z %32 = tt.addptr %31, %30 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:21.5945503Z %33 = tt.load %32 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:21.5945788Z %34 = arith.extf %33 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:21.5946075Z %35 = tt.expand_dims %20 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:21.5946338Z %36 = arith.muli %35, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:21.5946585Z %37 = tt.expand_dims %17 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:21.5946872Z %38 = tt.broadcast %36 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5947160Z %39 = tt.broadcast %37 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:21.5947401Z %40 = arith.addi %38, %39 : tensor<64x16xi32> 2026-02-21T09:05:21.5947644Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5947920Z %42 = tt.addptr %41, %40 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:21.5948223Z %43 = tt.load %42 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:05:21.5948481Z %44 = arith.shli %43, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5948699Z %45 = arith.shrsi %44, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5948913Z %46 = arith.shrsi %43, %cst_2 : tensor<64x16xi8> 2026-02-21T09:05:21.5949155Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:21.5949442Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:21.5949751Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:21.5950065Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5950377Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:05:21.5950660Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5950910Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5951175Z %54 = tt.broadcast %50 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5951455Z %55 = arith.select %53, %54, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5951750Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:21.5951998Z %57 = tt.broadcast %51 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:05:21.5952254Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:05:21.5952525Z %59 = arith.select %58, %57, %55 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:05:21.5952800Z %60 = tt.reshape %59 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:05:21.5953056Z %61 = arith.sitofp %60 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:05:21.5953411Z %62 = tt.dot %34, %61, %18, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:05:21.5953763Z %63 = arith.truncf %62 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T09:05:21.5954086Z tt.descriptor_store %0[%10, %14], %63 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T09:05:21.5954358Z tt.return 2026-02-21T09:05:21.5954494Z } 2026-02-21T09:05:21.5954654Z } 2026-02-21T09:05:21.5954724Z 2026-02-21T09:05:21.5954776Z {-# 2026-02-21T09:05:21.5954913Z external_resources: { 2026-02-21T09:05:21.5955069Z mlir_reproducer: { 2026-02-21T09:05:21.5959685Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:21.5964475Z disable_threading: false, 2026-02-21T09:05:21.5964655Z verify_each: true 2026-02-21T09:05:21.5964799Z } 2026-02-21T09:05:21.5964929Z } 2026-02-21T09:05:21.5965043Z #-} 2026-02-21T09:05:21.5965484Z /tmp/torchinductor_root/js/cjsngyt5ademlae42hzmuwjdxudk5yuhkdj7cs4omwpmwpnxpmyw.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:21.5966530Z /tmp/torchinductor_root/js/cjsngyt5ademlae42hzmuwjdxudk5yuhkdj7cs4omwpmwpnxpmyw.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:21.5967366Z [287s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:21.5968448Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:21.5969419Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:21.5969676Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:21.6517410Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 14.7 configs/s 2026-02-21T09:05:22.8905471Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 795.2 2026-02-21T09:05:22.8907125Z configs/s 2026-02-21T09:05:22.9542455Z [288s] Generation 14 complete: 2026-02-21T09:05:22.9547440Z error=6 2026-02-21T09:05:22.9551996Z ok=30 2026-02-21T09:05:22.9553697Z min=0.1076 2026-02-21T09:05:22.9553833Z mid=0.2857 2026-02-21T09:05:22.9553998Z max=13.5004 2026-02-21T09:05:22.9554152Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:05:22.9558339Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:05:22.9562782Z 'l2_groupings': [1], 2026-02-21T09:05:22.9564345Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:05:22.9564570Z 'loop_orders': [[0, 1]], 2026-02-21T09:05:22.9564741Z 'num_stages': 3, 2026-02-21T09:05:22.9564884Z 'num_warps': 1, 2026-02-21T09:05:22.9565039Z 'pid_type': 'flat', 2026-02-21T09:05:22.9565222Z 'range_flattens': [None, None], 2026-02-21T09:05:22.9565689Z 'range_multi_buffers': [None, True], 2026-02-21T09:05:22.9565886Z 'range_num_stages': [0, 0], 2026-02-21T09:05:22.9566055Z 'range_unroll_factors': [0, 0], 2026-02-21T09:05:22.9566302Z 'range_warp_specializes': [None, None]} 2026-02-21T09:05:22.9578374Z [288s] Fitting surrogate: 1107 points, 1107 targets 2026-02-21T09:05:23.5333131Z [288s] Generation 15 starting: 30 neighbors, 2 active search path(s) 2026-02-21T09:05:27.4041206Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 5.8 configs/s 2026-02-21T09:05:28.6243382Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:28.6244298Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:28.6244549Z ^ 2026-02-21T09:05:28.6246811Z /tmp/torchinductor_root/r6/cr6zawkrb3jqe74ygtcvlbclf2ez5o4hhbavk7piok4gh2wqwqry.py:78:36: note: called from 2026-02-21T09:05:28.6247213Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:28.6247421Z ^ 2026-02-21T09:05:28.6247809Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:28.6248281Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:28.6248525Z ^ 2026-02-21T09:05:28.6250223Z module { 2026-02-21T09:05:28.6250900Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:28.6251524Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:05:28.6251858Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:28.6252051Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:28.6252263Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:28.6252476Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:28.6252726Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:28.6252958Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:28.6253192Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:05:28.6253430Z %cst_4 = arith.constant dense<7168> : tensor<16x1xi32> 2026-02-21T09:05:28.6253662Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:28.6253876Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:28.6254096Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:28.6254331Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:28.6254511Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:28.6254692Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:28.6254875Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:28.6255079Z %0 = tt.get_program_id x : i32 2026-02-21T09:05:28.6255513Z %1 = arith.divsi %0, %c4_i32 : i32 2026-02-21T09:05:28.6255691Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:05:28.6255879Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:05:28.6256059Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:05:28.6256247Z %5 = arith.remsi %0, %c4_i32 : i32 2026-02-21T09:05:28.6256428Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:05:28.6256617Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:05:28.6256802Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:05:28.6257059Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:05:28.6257307Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:28.6257574Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:05:28.6257793Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:05:28.6257990Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:28.6258233Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:28.6258482Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:05:28.6258696Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:05:28.6259004Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:05:28.6259213Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:05:28.6259551Z %17 = scf.for %arg3 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg4 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:28.6259952Z %71 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:28.6260225Z %72 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:05:28.6260464Z %73 = arith.addi %72, %71 : tensor<16xi32> 2026-02-21T09:05:28.6260679Z %74 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:28.6260882Z %75 = tt.splat %74 : i32 -> tensor<32xi32> 2026-02-21T09:05:28.6261093Z %76 = arith.addi %75, %10 : tensor<32xi32> 2026-02-21T09:05:28.6261406Z %77 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:28.6261740Z %78 = arith.muli %77, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:28.6262014Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6262328Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6262612Z %81 = tt.broadcast %79 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6262859Z %82 = arith.addi %80, %81 : tensor<64x32xi32> 2026-02-21T09:05:28.6263129Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6263430Z %84 = tt.addptr %83, %82 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:28.6263756Z %85 = tt.load %84 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6264060Z %86 = arith.extf %85 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:28.6264363Z %87 = tt.expand_dims %73 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:28.6264660Z %88 = arith.muli %87, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:28.6264910Z %89 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6265195Z %90 = tt.broadcast %88 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6265449Z %91 = tt.broadcast %89 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6265685Z %92 = arith.addi %90, %91 : tensor<16x32xi32> 2026-02-21T09:05:28.6265921Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6266196Z %94 = tt.addptr %93, %92 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:28.6266499Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6266761Z %96 = arith.shli %95, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6266980Z %97 = arith.shrsi %96, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6267241Z %98 = arith.shrsi %95, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6267496Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:28.6267797Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:28.6268115Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:28.6268454Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6268822Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6269115Z %104 = arith.cmpi eq, %101, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6269373Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6269650Z %106 = tt.broadcast %102 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6269950Z %107 = arith.select %105, %106, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6274907Z %108 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6280434Z %109 = tt.broadcast %103 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6280735Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6281025Z %111 = arith.select %110, %109, %107 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6281317Z %112 = tt.reshape %111 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:28.6281603Z %113 = arith.sitofp %112 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:28.6281971Z %114 = tt.dot %86, %113, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:28.6282293Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:28.6282495Z %115 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:05:28.6282692Z %116 = arith.addi %arg3, %115 : i32 2026-02-21T09:05:28.6282963Z %117 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:28.6283224Z %118 = tt.splat %116 : i32 -> tensor<16xi32> 2026-02-21T09:05:28.6283434Z %119 = arith.addi %118, %117 : tensor<16xi32> 2026-02-21T09:05:28.6283642Z %120 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:05:28.6283840Z %121 = tt.splat %120 : i32 -> tensor<32xi32> 2026-02-21T09:05:28.6284051Z %122 = arith.addi %121, %10 : tensor<32xi32> 2026-02-21T09:05:28.6284307Z %123 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:28.6284593Z %124 = arith.muli %123, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:28.6284867Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6285168Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6285441Z %127 = tt.broadcast %125 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6285682Z %128 = arith.addi %126, %127 : tensor<64x32xi32> 2026-02-21T09:05:28.6285940Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6286235Z %130 = tt.addptr %129, %128 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:28.6286561Z %131 = tt.load %130 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6286861Z %132 = arith.extf %131 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:28.6287149Z %133 = tt.expand_dims %119 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:28.6287431Z %134 = arith.muli %133, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:28.6287693Z %135 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6287988Z %136 = tt.broadcast %134 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6288248Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6288494Z %138 = arith.addi %136, %137 : tensor<16x32xi32> 2026-02-21T09:05:28.6288772Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6289050Z %140 = tt.addptr %139, %138 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:28.6289358Z %141 = tt.load %140 evictionPolicy = evict_first : tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6289622Z %142 = arith.shli %141, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6289844Z %143 = arith.shrsi %142, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6290132Z %144 = arith.shrsi %141, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6290377Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:28.6290674Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:28.6290990Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:28.6291308Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6291665Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6291975Z %150 = arith.cmpi eq, %147, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6292239Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6292510Z %152 = tt.broadcast %148 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6292802Z %153 = arith.select %151, %152, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6293083Z %154 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6293334Z %155 = tt.broadcast %149 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6293608Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6293886Z %157 = arith.select %156, %155, %153 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6294205Z %158 = tt.reshape %157 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:28.6294479Z %159 = arith.sitofp %158 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:28.6294842Z %160 = tt.dot %132, %159, %114, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:28.6295173Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:28.6295367Z %161 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:05:28.6295569Z %162 = arith.addi %arg3, %161 : i32 2026-02-21T09:05:28.6295802Z %163 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:28.6296057Z %164 = tt.splat %162 : i32 -> tensor<16xi32> 2026-02-21T09:05:28.6296270Z %165 = arith.addi %164, %163 : tensor<16xi32> 2026-02-21T09:05:28.6296465Z %166 = arith.muli %162, %c2_i32 : i32 2026-02-21T09:05:28.6296662Z %167 = tt.splat %166 : i32 -> tensor<32xi32> 2026-02-21T09:05:28.6296862Z %168 = arith.addi %167, %10 : tensor<32xi32> 2026-02-21T09:05:28.6297120Z %169 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:28.6297386Z %170 = arith.muli %169, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:28.6297649Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6297948Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6298211Z %173 = tt.broadcast %171 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6298457Z %174 = arith.addi %172, %173 : tensor<64x32xi32> 2026-02-21T09:05:28.6298700Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6298998Z %176 = tt.addptr %175, %174 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:28.6299310Z %177 = tt.load %176 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6299611Z %178 = arith.extf %177 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:28.6299927Z %179 = tt.expand_dims %165 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:28.6300194Z %180 = arith.muli %179, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:28.6300466Z %181 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6300764Z %182 = tt.broadcast %180 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6301040Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6301323Z %184 = arith.addi %182, %183 : tensor<16x32xi32> 2026-02-21T09:05:28.6301583Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6301884Z %186 = tt.addptr %185, %184 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:28.6302202Z %187 = tt.load %186 evictionPolicy = evict_first : tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6302500Z %188 = arith.shli %187, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6302732Z %189 = arith.shrsi %188, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6302972Z %190 = arith.shrsi %187, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6303291Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:28.6303604Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:28.6303939Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:28.6304278Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6304621Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6304917Z %196 = arith.cmpi eq, %193, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6305186Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6305514Z %198 = tt.broadcast %194 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6305813Z %199 = arith.select %197, %198, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6306100Z %200 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6306361Z %201 = tt.broadcast %195 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6306647Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6306944Z %203 = arith.select %202, %201, %199 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6307236Z %204 = tt.reshape %203 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:28.6307518Z %205 = arith.sitofp %204 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:28.6307883Z %206 = tt.dot %178, %205, %160, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:28.6308221Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:05:28.6308417Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:28.6308661Z %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:28.6308933Z %19 = tt.splat %c4080_i32 : i32 -> tensor<16xi32> 2026-02-21T09:05:28.6309151Z %20 = arith.addi %19, %18 : tensor<16xi32> 2026-02-21T09:05:28.6309362Z %21 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:05:28.6309562Z %22 = tt.splat %21 : i32 -> tensor<32xi32> 2026-02-21T09:05:28.6309766Z %23 = arith.addi %22, %10 : tensor<32xi32> 2026-02-21T09:05:28.6310020Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:28.6310282Z %25 = arith.muli %24, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:28.6310536Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6310811Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6311067Z %28 = tt.broadcast %26 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6311295Z %29 = arith.addi %27, %28 : tensor<64x32xi32> 2026-02-21T09:05:28.6311583Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6311861Z %31 = tt.addptr %30, %29 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:28.6312160Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6312451Z %33 = arith.extf %32 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:28.6312726Z %34 = tt.expand_dims %20 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:28.6313042Z %35 = arith.muli %34, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:28.6313286Z %36 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6313564Z %37 = tt.broadcast %35 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6313817Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:05:28.6314041Z %39 = arith.addi %37, %38 : tensor<16x32xi32> 2026-02-21T09:05:28.6314279Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6314540Z %41 = tt.addptr %40, %39 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:05:28.6314875Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<16x32x!tt.ptr> 2026-02-21T09:05:28.6315131Z %43 = arith.shli %42, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6315352Z %44 = arith.shrsi %43, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6315579Z %45 = arith.shrsi %42, %cst_3 : tensor<16x32xi8> 2026-02-21T09:05:28.6315823Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:28.6316121Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:28.6316429Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:28.6316784Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6317105Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:05:28.6317384Z %51 = arith.cmpi eq, %48, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6317636Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6317897Z %53 = tt.broadcast %49 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6318177Z %54 = arith.select %52, %53, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6318435Z %55 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:28.6318686Z %56 = tt.broadcast %50 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:05:28.6318951Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:05:28.6319214Z %58 = arith.select %57, %56, %54 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:05:28.6319488Z %59 = tt.reshape %58 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:05:28.6319742Z %60 = arith.sitofp %59 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:05:28.6320087Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:28.6320431Z %62 = arith.truncf %61 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:28.6320718Z %63 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:28.6320982Z %64 = arith.muli %63, %cst_0 : tensor<64x1xi32> 2026-02-21T09:05:28.6321230Z %65 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:28.6321512Z %66 = tt.broadcast %64 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6321790Z %67 = tt.broadcast %65 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:28.6322026Z %68 = arith.addi %66, %67 : tensor<64x32xi32> 2026-02-21T09:05:28.6322273Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6322554Z %70 = tt.addptr %69, %68 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:28.6322851Z tt.store %70, %62 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:28.6323041Z tt.return 2026-02-21T09:05:28.6323177Z } 2026-02-21T09:05:28.6323299Z } 2026-02-21T09:05:28.6323378Z 2026-02-21T09:05:28.6323429Z {-# 2026-02-21T09:05:28.6323557Z external_resources: { 2026-02-21T09:05:28.6323721Z mlir_reproducer: { 2026-02-21T09:05:28.6328080Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:28.6332687Z disable_threading: false, 2026-02-21T09:05:28.6332865Z verify_each: true 2026-02-21T09:05:28.6333011Z } 2026-02-21T09:05:28.6333138Z } 2026-02-21T09:05:28.6333255Z #-} 2026-02-21T09:05:28.6333689Z /tmp/torchinductor_root/r6/cr6zawkrb3jqe74ygtcvlbclf2ez5o4hhbavk7piok4gh2wqwqry.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:28.6334727Z /tmp/torchinductor_root/r6/cr6zawkrb3jqe74ygtcvlbclf2ez5o4hhbavk7piok4gh2wqwqry.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:28.6335564Z [294s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:28.6336626Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:28.6337567Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:28.6337821Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:28.9964833Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:28.9965435Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:28.9969195Z ^ 2026-02-21T09:05:28.9973520Z /tmp/torchinductor_root/72/c725sir7jhtpivvrzoruzf4hks6ghwuwjosfcv6sy2suorsafa3q.py:94:40: note: called from 2026-02-21T09:05:28.9977901Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:28.9982716Z ^ 2026-02-21T09:05:28.9983250Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:28.9987974Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:28.9990022Z ^ 2026-02-21T09:05:28.9990559Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:05:28.9995110Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:28.9996871Z %cst = arith.constant dense<0> : tensor<32x2x32xi8> 2026-02-21T09:05:29.0000186Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:29.0000425Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:29.0000636Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:29.0000833Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:29.0001212Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:29.0001469Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:29.0001881Z %cst_2 = arith.constant dense<4> : tensor<32x32xi8> 2026-02-21T09:05:29.0002124Z %cst_3 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:29.0002364Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:29.0002580Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:29.0002801Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:29.0003039Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:29.0003227Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:29.0003413Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:29.0003666Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:29.0003854Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:29.0004044Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:29.0004359Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:29.0004692Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:29.0004882Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:05:29.0005064Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:05:29.0005272Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:05:29.0005480Z %4 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:05:29.0005674Z %5 = arith.muli %4, %c4_i32 : i32 2026-02-21T09:05:29.0005853Z %6 = arith.subi %c224_i32, %5 : i32 2026-02-21T09:05:29.0006040Z %7 = arith.minsi %6, %c4_i32 : i32 2026-02-21T09:05:29.0006220Z %8 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:05:29.0006407Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:05:29.0006583Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:05:29.0006751Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:05:29.0006932Z %12 = arith.muli %10, %c32_i32 : i32 2026-02-21T09:05:29.0007164Z %13 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:29.0007424Z %14 = tt.splat %12 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.0007623Z %15 = arith.addi %14, %13 : tensor<32xi32> 2026-02-21T09:05:29.0007819Z %16 = arith.muli %11, %c64_i32 : i32 2026-02-21T09:05:29.0008046Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:29.0008309Z %18 = tt.splat %16 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.0008498Z %19 = arith.addi %18, %17 : tensor<64xi32> 2026-02-21T09:05:29.0008695Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:29.0008879Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:29.0009203Z %20 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:29.0009585Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.0009807Z %24 = arith.addi %23, %13 : tensor<32xi32> 2026-02-21T09:05:29.0010015Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:29.0010228Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.0010443Z %27 = arith.addi %26, %17 : tensor<64xi32> 2026-02-21T09:05:29.0010709Z %28 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.0011034Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.0011304Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.0011657Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0011946Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0012200Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:05:29.0012472Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0012801Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.0013134Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0013442Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.0013748Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.0014037Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:29.0014301Z %40 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.0014605Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0014875Z %42 = tt.broadcast %40 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0015193Z %43 = arith.addi %41, %42 : tensor<32x32xi32> 2026-02-21T09:05:29.0015447Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0015736Z %45 = tt.addptr %44, %43 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.0015994Z %46 = tt.load %45 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0016221Z %47 = arith.shli %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0016447Z %48 = arith.shrsi %47, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0016681Z %49 = arith.shrsi %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0016932Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.0017246Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.0017575Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.0017910Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0018255Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0018547Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0018819Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0019093Z %57 = tt.broadcast %53 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0019379Z %58 = arith.select %56, %57, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0019653Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0019902Z %60 = tt.broadcast %54 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0020173Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0020442Z %62 = arith.select %61, %60, %58 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0020721Z %63 = tt.reshape %62 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.0021015Z %64 = arith.sitofp %63 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.0021369Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.0021746Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:29.0021947Z %66 = arith.muli %c32_i32, %c1_i32_6 : i32 2026-02-21T09:05:29.0022150Z %67 = arith.addi %arg4, %66 : i32 2026-02-21T09:05:29.0022366Z %68 = tt.splat %67 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.0022571Z %69 = arith.addi %68, %13 : tensor<32xi32> 2026-02-21T09:05:29.0022765Z %70 = arith.muli %67, %c2_i32 : i32 2026-02-21T09:05:29.0025897Z %71 = tt.splat %70 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.0027447Z %72 = arith.addi %71, %17 : tensor<64xi32> 2026-02-21T09:05:29.0030718Z %73 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.0030993Z %74 = arith.muli %73, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.0031246Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.0031603Z %76 = tt.broadcast %74 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0031865Z %77 = tt.broadcast %75 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0032095Z %78 = arith.addi %76, %77 : tensor<64x64xi32> 2026-02-21T09:05:29.0032341Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0032624Z %80 = tt.addptr %79, %78 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.0032927Z %81 = tt.load %80 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0033212Z %82 = arith.extf %81 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.0033520Z %83 = tt.expand_dims %69 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.0033793Z %84 = arith.muli %83, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:29.0034044Z %85 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.0034331Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0034579Z %87 = tt.broadcast %85 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0034812Z %88 = arith.addi %86, %87 : tensor<32x32xi32> 2026-02-21T09:05:29.0035052Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0035319Z %90 = tt.addptr %89, %88 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.0035570Z %91 = tt.load %90 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0035775Z %92 = arith.shli %91, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0035993Z %93 = arith.shrsi %92, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0036206Z %94 = arith.shrsi %91, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0036454Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.0036747Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.0037053Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.0037369Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0037681Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0037967Z %100 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0038223Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0038508Z %102 = tt.broadcast %98 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0038809Z %103 = arith.select %101, %102, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0039108Z %104 = arith.cmpi eq, %97, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0039370Z %105 = tt.broadcast %99 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0039642Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0039930Z %107 = arith.select %106, %105, %103 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0040217Z %108 = tt.reshape %107 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.0040481Z %109 = arith.sitofp %108 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.0040868Z %110 = tt.dot %82, %109, %65, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.0041190Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:29.0041397Z %111 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:29.0041632Z %112 = arith.addi %arg4, %111 : i32 2026-02-21T09:05:29.0041838Z %113 = tt.splat %112 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.0042053Z %114 = arith.addi %113, %13 : tensor<32xi32> 2026-02-21T09:05:29.0042247Z %115 = arith.muli %112, %c2_i32 : i32 2026-02-21T09:05:29.0042472Z %116 = tt.splat %115 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.0042672Z %117 = arith.addi %116, %17 : tensor<64xi32> 2026-02-21T09:05:29.0042932Z %118 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.0043207Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.0043487Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.0043792Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0044055Z %122 = tt.broadcast %120 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0044306Z %123 = arith.addi %121, %122 : tensor<64x64xi32> 2026-02-21T09:05:29.0044594Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0044895Z %125 = tt.addptr %124, %123 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.0045210Z %126 = tt.load %125 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0045515Z %127 = arith.extf %126 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.0045809Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.0046082Z %129 = arith.muli %128, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:29.0046347Z %130 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.0046635Z %131 = tt.broadcast %129 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0046903Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0047146Z %133 = arith.addi %131, %132 : tensor<32x32xi32> 2026-02-21T09:05:29.0047383Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0047671Z %135 = tt.addptr %134, %133 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.0047930Z %136 = tt.load %135 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0048151Z %137 = arith.shli %136, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0048372Z %138 = arith.shrsi %137, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0048599Z %139 = arith.shrsi %136, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0048853Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.0049147Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.0049470Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.0049800Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0050170Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0050464Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0050720Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0051001Z %147 = tt.broadcast %143 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0051290Z %148 = arith.select %146, %147, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0051601Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0051877Z %150 = tt.broadcast %144 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0052156Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0052445Z %152 = arith.select %151, %150, %148 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0052729Z %153 = tt.reshape %152 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.0053008Z %154 = arith.sitofp %153 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.0053405Z %155 = tt.dot %127, %154, %110, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.0053738Z scf.yield %155 : tensor<64x32xf32> 2026-02-21T09:05:29.0053934Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:29.0054262Z %21 = scf.for %arg4 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg5 = %20) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:29.0054613Z %23 = tt.splat %arg4 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.0054831Z %24 = arith.addi %23, %13 : tensor<32xi32> 2026-02-21T09:05:29.0055046Z %25 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:29.0055247Z %26 = tt.splat %25 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.0055462Z %27 = arith.addi %26, %17 : tensor<64xi32> 2026-02-21T09:05:29.0055749Z %28 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.0056039Z %29 = arith.muli %28, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.0056315Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.0056613Z %31 = tt.broadcast %29 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0056892Z %32 = tt.broadcast %30 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.0057139Z %33 = arith.addi %31, %32 : tensor<64x64xi32> 2026-02-21T09:05:29.0057399Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0057703Z %35 = tt.addptr %34, %33 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.0058018Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.0058330Z %37 = arith.extf %36 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.0058625Z %38 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.0058909Z %39 = arith.muli %38, %cst_3 : tensor<32x1xi32> 2026-02-21T09:05:29.0059176Z %40 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.0059475Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0059748Z %42 = tt.broadcast %40 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.0059992Z %43 = arith.addi %41, %42 : tensor<32x32xi32> 2026-02-21T09:05:29.0060245Z %44 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0060529Z %45 = tt.addptr %44, %43 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.0060796Z %46 = tt.load %45 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.0061019Z %47 = arith.shli %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0061251Z %48 = arith.shrsi %47, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0061481Z %49 = arith.shrsi %46, %cst_2 : tensor<32x32xi8> 2026-02-21T09:05:29.0061798Z %50 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.0062109Z %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.0062430Z %52 = tt.expand_dims %51 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.0062769Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0063110Z %54 = tt.expand_dims %49 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.0063424Z %55 = arith.cmpi eq, %52, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0063695Z %56 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0063966Z %57 = tt.broadcast %53 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0064253Z %58 = arith.select %56, %57, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0064524Z %59 = arith.cmpi eq, %52, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.0064780Z %60 = tt.broadcast %54 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.0065074Z %61 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.0065344Z %62 = arith.select %61, %60, %58 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.0065620Z %63 = tt.reshape %62 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.0065872Z %64 = arith.sitofp %63 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.0066223Z %65 = tt.dot %37, %64, %arg5, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.0066545Z scf.yield %65 : tensor<64x32xf32> 2026-02-21T09:05:29.0066762Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:29.0067049Z %22 = arith.truncf %21 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:29.0067370Z tt.descriptor_store %0[%16, %12], %22 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:29.0067695Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:05:29.0067907Z tt.return 2026-02-21T09:05:29.0068042Z } 2026-02-21T09:05:29.0068163Z } 2026-02-21T09:05:29.0068240Z 2026-02-21T09:05:29.0068291Z {-# 2026-02-21T09:05:29.0068425Z external_resources: { 2026-02-21T09:05:29.0068581Z mlir_reproducer: { 2026-02-21T09:05:29.0072986Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:29.0077376Z disable_threading: false, 2026-02-21T09:05:29.0077556Z verify_each: true 2026-02-21T09:05:29.0077699Z } 2026-02-21T09:05:29.0077824Z } 2026-02-21T09:05:29.0077937Z #-} 2026-02-21T09:05:29.0078362Z /tmp/torchinductor_root/72/c725sir7jhtpivvrzoruzf4hks6ghwuwjosfcv6sy2suorsafa3q.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:29.0079406Z /tmp/torchinductor_root/72/c725sir7jhtpivvrzoruzf4hks6ghwuwjosfcv6sy2suorsafa3q.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:29.0080233Z [294s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:29.0081438Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:05:29.0082530Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:29.0082780Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:29.2236812Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:29.2241449Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:29.2242363Z ^ 2026-02-21T09:05:29.2242848Z /tmp/torchinductor_root/kv/ckvkb3zlfisk4hapg3qm6ie7cskmemxapxuxnneadn32pbzi76wi.py:78:36: note: called from 2026-02-21T09:05:29.2243322Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:29.2248633Z ^ 2026-02-21T09:05:29.2253385Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:29.2256063Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:29.2260009Z ^ 2026-02-21T09:05:29.2264082Z module { 2026-02-21T09:05:29.2264757Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:29.2265377Z %cst = arith.constant dense<0> : tensor<32x2x32xi8> 2026-02-21T09:05:29.2267662Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:29.2267974Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:29.2268238Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:29.2272201Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:29.2276385Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:29.2279725Z %cst_3 = arith.constant dense<4> : tensor<32x32xi8> 2026-02-21T09:05:29.2283668Z %cst_4 = arith.constant dense<7168> : tensor<32x1xi32> 2026-02-21T09:05:29.2285710Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:29.2285976Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:29.2286225Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:29.2286470Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:29.2286655Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:29.2286853Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:29.2287225Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:29.2287420Z %0 = tt.get_program_id x : i32 2026-02-21T09:05:29.2287599Z %1 = arith.divsi %0, %c4_i32 : i32 2026-02-21T09:05:29.2287789Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T09:05:29.2287967Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:05:29.2288152Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T09:05:29.2288333Z %5 = arith.remsi %0, %c4_i32 : i32 2026-02-21T09:05:29.2288507Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:05:29.2288738Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:05:29.2288906Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:05:29.2289084Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:05:29.2289321Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:29.2289594Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.2289806Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:05:29.2290020Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:29.2290261Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:29.2290548Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.2290760Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:05:29.2290963Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:29.2291160Z %c96_i32 = arith.constant 96 : i32 2026-02-21T09:05:29.2291491Z %17 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c96_i32 iter_args(%arg4 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:29.2291937Z %28 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.2292161Z %29 = arith.addi %28, %10 : tensor<32xi32> 2026-02-21T09:05:29.2292366Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:29.2292574Z %31 = tt.splat %30 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.2292780Z %32 = arith.addi %31, %14 : tensor<64xi32> 2026-02-21T09:05:29.2293094Z %33 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.2293381Z %34 = arith.muli %33, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:29.2293662Z %35 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.2293969Z %36 = tt.broadcast %34 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2294238Z %37 = tt.broadcast %35 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2294493Z %38 = arith.addi %36, %37 : tensor<64x64xi32> 2026-02-21T09:05:29.2294756Z %39 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2295065Z %40 = tt.addptr %39, %38 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.2295389Z %41 = tt.load %40 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2295702Z %42 = arith.extf %41 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.2296005Z %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.2296281Z %44 = arith.muli %43, %cst_4 : tensor<32x1xi32> 2026-02-21T09:05:29.2296553Z %45 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.2296840Z %46 = tt.broadcast %44 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2297110Z %47 = tt.broadcast %45 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2297357Z %48 = arith.addi %46, %47 : tensor<32x32xi32> 2026-02-21T09:05:29.2297602Z %49 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2297893Z %50 = tt.addptr %49, %48 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.2298151Z %51 = tt.load %50 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2298377Z %52 = arith.shli %51, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2298598Z %53 = arith.shrsi %52, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2298862Z %54 = arith.shrsi %51, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2299122Z %55 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.2299417Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.2299727Z %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.2300088Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2300406Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2300740Z %60 = arith.cmpi eq, %57, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2300996Z %61 = tt.broadcast %60 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2301269Z %62 = tt.broadcast %58 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2301599Z %63 = arith.select %61, %62, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2301866Z %64 = arith.cmpi eq, %57, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2302129Z %65 = tt.broadcast %59 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2302423Z %66 = tt.broadcast %64 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2302701Z %67 = arith.select %66, %65, %63 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2302976Z %68 = tt.reshape %67 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.2303228Z %69 = arith.sitofp %68 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.2303592Z %70 = tt.dot %42, %69, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.2303913Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:29.2304112Z %71 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T09:05:29.2304298Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:05:29.2304516Z %73 = tt.splat %72 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.2304723Z %74 = arith.addi %73, %10 : tensor<32xi32> 2026-02-21T09:05:29.2304913Z %75 = arith.muli %72, %c2_i32 : i32 2026-02-21T09:05:29.2305107Z %76 = tt.splat %75 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.2305300Z %77 = arith.addi %76, %14 : tensor<64xi32> 2026-02-21T09:05:29.2305551Z %78 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.2305812Z %79 = arith.muli %78, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:29.2306072Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.2306356Z %81 = tt.broadcast %79 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2306610Z %82 = tt.broadcast %80 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2306844Z %83 = arith.addi %81, %82 : tensor<64x64xi32> 2026-02-21T09:05:29.2307083Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2307373Z %85 = tt.addptr %84, %83 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.2307669Z %86 = tt.load %85 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2307960Z %87 = arith.extf %86 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.2308239Z %88 = tt.expand_dims %74 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.2308496Z %89 = arith.muli %88, %cst_4 : tensor<32x1xi32> 2026-02-21T09:05:29.2308749Z %90 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.2309022Z %91 = tt.broadcast %89 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2309280Z %92 = tt.broadcast %90 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2309510Z %93 = arith.addi %91, %92 : tensor<32x32xi32> 2026-02-21T09:05:29.2309741Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2310018Z %95 = tt.addptr %94, %93 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.2310294Z %96 = tt.load %95 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2310505Z %97 = arith.shli %96, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2310718Z %98 = arith.shrsi %97, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2310934Z %99 = arith.shrsi %96, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2311185Z %100 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.2311478Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.2311854Z %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.2312179Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2312507Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2312793Z %105 = arith.cmpi eq, %102, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2313062Z %106 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2313368Z %107 = tt.broadcast %103 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2313659Z %108 = arith.select %106, %107, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2313948Z %109 = arith.cmpi eq, %102, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2314207Z %110 = tt.broadcast %104 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2314490Z %111 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2314776Z %112 = arith.select %111, %110, %108 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2315076Z %113 = tt.reshape %112 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.2315348Z %114 = arith.sitofp %113 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.2315724Z %115 = tt.dot %87, %114, %70, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.2316047Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:29.2316243Z %116 = arith.muli %c32_i32, %c2_i32_7 : i32 2026-02-21T09:05:29.2316447Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:05:29.2316646Z %118 = tt.splat %117 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.2316850Z %119 = arith.addi %118, %10 : tensor<32xi32> 2026-02-21T09:05:29.2317052Z %120 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:05:29.2317244Z %121 = tt.splat %120 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.2317447Z %122 = arith.addi %121, %14 : tensor<64xi32> 2026-02-21T09:05:29.2317697Z %123 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.2317977Z %124 = arith.muli %123, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:29.2318237Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.2318538Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2318810Z %127 = tt.broadcast %125 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2319053Z %128 = arith.addi %126, %127 : tensor<64x64xi32> 2026-02-21T09:05:29.2319303Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2319593Z %130 = tt.addptr %129, %128 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.2319914Z %131 = tt.load %130 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2320218Z %132 = arith.extf %131 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.2320501Z %133 = tt.expand_dims %119 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.2320775Z %134 = arith.muli %133, %cst_4 : tensor<32x1xi32> 2026-02-21T09:05:29.2321027Z %135 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.2321353Z %136 = tt.broadcast %134 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2321644Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2321892Z %138 = arith.addi %136, %137 : tensor<32x32xi32> 2026-02-21T09:05:29.2322133Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2322409Z %140 = tt.addptr %139, %138 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.2322670Z %141 = tt.load %140 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2322912Z %142 = arith.shli %141, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2323136Z %143 = arith.shrsi %142, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2323354Z %144 = arith.shrsi %141, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2323605Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.2323908Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.2324221Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.2324577Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2324902Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2325193Z %150 = arith.cmpi eq, %147, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2325451Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2325723Z %152 = tt.broadcast %148 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2326016Z %153 = arith.select %151, %152, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2326292Z %154 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2326555Z %155 = tt.broadcast %149 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2326852Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2327141Z %157 = arith.select %156, %155, %153 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2327435Z %158 = tt.reshape %157 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.2327698Z %159 = arith.sitofp %158 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.2328063Z %160 = tt.dot %132, %159, %115, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.2328384Z scf.yield %160 : tensor<64x32xf32> 2026-02-21T09:05:29.2328593Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:29.2328912Z %18 = scf.for %arg3 = %c4032_i32 to %c4096_i32 step %c32_i32 iter_args(%arg4 = %17) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:29.2329236Z %28 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T09:05:29.2329448Z %29 = arith.addi %28, %10 : tensor<32xi32> 2026-02-21T09:05:29.2329643Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:29.2329843Z %31 = tt.splat %30 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.2330037Z %32 = arith.addi %31, %14 : tensor<64xi32> 2026-02-21T09:05:29.2330289Z %33 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.2330561Z %34 = arith.muli %33, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:29.2330811Z %35 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.2331099Z %36 = tt.broadcast %34 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2331354Z %37 = tt.broadcast %35 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.2331624Z %38 = arith.addi %36, %37 : tensor<64x64xi32> 2026-02-21T09:05:29.2331864Z %39 = tt.splat %arg0 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2332151Z %40 = tt.addptr %39, %38 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.2332461Z %41 = tt.load %40 evictionPolicy = evict_last : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.2332788Z %42 = arith.extf %41 : tensor<64x64xbf16> to tensor<64x64xf32> 2026-02-21T09:05:29.2333091Z %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T09:05:29.2333364Z %44 = arith.muli %43, %cst_4 : tensor<32x1xi32> 2026-02-21T09:05:29.2333632Z %45 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.2333921Z %46 = tt.broadcast %44 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2334217Z %47 = tt.broadcast %45 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T09:05:29.2334462Z %48 = arith.addi %46, %47 : tensor<32x32xi32> 2026-02-21T09:05:29.2334705Z %49 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2334992Z %50 = tt.addptr %49, %48 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T09:05:29.2335250Z %51 = tt.load %50 : tensor<32x32x!tt.ptr> 2026-02-21T09:05:29.2335476Z %52 = arith.shli %51, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2335697Z %53 = arith.shrsi %52, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2335951Z %54 = arith.shrsi %51, %cst_3 : tensor<32x32xi8> 2026-02-21T09:05:29.2336207Z %55 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.2336502Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.2336823Z %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.2337148Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2337479Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<32x32xi8> -> tensor<32x1x32xi8> 2026-02-21T09:05:29.2337773Z %60 = arith.cmpi eq, %57, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2338050Z %61 = tt.broadcast %60 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2338336Z %62 = tt.broadcast %58 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2338625Z %63 = arith.select %61, %62, %cst : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2338907Z %64 = arith.cmpi eq, %57, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.2339162Z %65 = tt.broadcast %59 : tensor<32x1x32xi8> -> tensor<32x2x32xi8> 2026-02-21T09:05:29.2339446Z %66 = tt.broadcast %64 : tensor<1x2x1xi1> -> tensor<32x2x32xi1> 2026-02-21T09:05:29.2339733Z %67 = arith.select %66, %65, %63 : tensor<32x2x32xi1>, tensor<32x2x32xi8> 2026-02-21T09:05:29.2340018Z %68 = tt.reshape %67 : tensor<32x2x32xi8> -> tensor<64x32xi8> 2026-02-21T09:05:29.2340293Z %69 = arith.sitofp %68 : tensor<64x32xi8> to tensor<64x32xf32> 2026-02-21T09:05:29.2340654Z %70 = tt.dot %42, %69, %arg4, inputPrecision = tf32 : tensor<64x64xf32> * tensor<64x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:29.2340984Z scf.yield %70 : tensor<64x32xf32> 2026-02-21T09:05:29.2341205Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:29.2341475Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:29.2341791Z %20 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.2342056Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:05:29.2342318Z %22 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:29.2342602Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:29.2342868Z %24 = tt.broadcast %22 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:29.2343098Z %25 = arith.addi %23, %24 : tensor<64x32xi32> 2026-02-21T09:05:29.2343345Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:29.2343632Z %27 = tt.addptr %26, %25 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:29.2343887Z tt.store %27, %19 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:29.2344139Z tt.return 2026-02-21T09:05:29.2344269Z } 2026-02-21T09:05:29.2344396Z } 2026-02-21T09:05:29.2344465Z 2026-02-21T09:05:29.2344517Z {-# 2026-02-21T09:05:29.2344656Z external_resources: { 2026-02-21T09:05:29.2344814Z mlir_reproducer: { 2026-02-21T09:05:29.2349193Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:29.2353646Z disable_threading: false, 2026-02-21T09:05:29.2353815Z verify_each: true 2026-02-21T09:05:29.2353966Z } 2026-02-21T09:05:29.2354086Z } 2026-02-21T09:05:29.2354211Z #-} 2026-02-21T09:05:29.2354641Z /tmp/torchinductor_root/kv/ckvkb3zlfisk4hapg3qm6ie7cskmemxapxuxnneadn32pbzi76wi.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:29.2355688Z /tmp/torchinductor_root/kv/ckvkb3zlfisk4hapg3qm6ie7cskmemxapxuxnneadn32pbzi76wi.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:29.2356528Z [294s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:29.2357577Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 64, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:29.2358501Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:29.2358763Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:29.4741348Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:29.4745794Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:29.4750997Z ^ 2026-02-21T09:05:29.4760052Z /tmp/torchinductor_root/dm/cdm4lh3nrg5rtev2j7nt5rhurm7wwwqm7pqlqudcpboflpu6gc6h.py:86:36: note: called from 2026-02-21T09:05:29.4764408Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:29.4764945Z ^ 2026-02-21T09:05:29.4765445Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:29.4765997Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:29.4766273Z ^ 2026-02-21T09:05:29.4766444Z module { 2026-02-21T09:05:29.4766987Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:29.4767625Z %cst = arith.constant dense<0> : tensor<64x2x64xi8> 2026-02-21T09:05:29.4767865Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:05:29.4768064Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:29.4768266Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:29.4768488Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:29.4768723Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:29.4768973Z %cst_2 = arith.constant dense<4> : tensor<64x64xi8> 2026-02-21T09:05:29.4769248Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:29.4769497Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:29.4769723Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:29.4769970Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:05:29.4770209Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:29.4770405Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:29.4770606Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:29.4770790Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:29.4770977Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:29.4771340Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:29.4771735Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:29.4771923Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:05:29.4772107Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:29.4772297Z %4 = arith.subi %c112_i32, %3 : i32 2026-02-21T09:05:29.4772476Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:29.4772658Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:05:29.4772833Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:29.4773015Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:29.4773188Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:29.4773376Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:29.4773617Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:29.4773870Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.4774085Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:29.4774277Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:05:29.4774480Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.4774676Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T09:05:29.4774875Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:29.4775059Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:29.4775390Z %17 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:29.4775733Z %63 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.4775941Z %64 = arith.addi %63, %11 : tensor<64xi32> 2026-02-21T09:05:29.4776143Z %65 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:29.4776380Z %66 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:29.4776640Z %67 = tt.splat %65 : i32 -> tensor<128xi32> 2026-02-21T09:05:29.4776839Z %68 = arith.addi %67, %66 : tensor<128xi32> 2026-02-21T09:05:29.4777100Z %69 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4777421Z %70 = arith.muli %69, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.4777680Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:29.4777980Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4778247Z %73 = tt.broadcast %71 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4778495Z %74 = arith.addi %72, %73 : tensor<64x128xi32> 2026-02-21T09:05:29.4778740Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4779074Z %76 = tt.addptr %75, %74 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:29.4779391Z %77 = tt.load %76 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4779685Z %78 = arith.extf %77 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:29.4779976Z %79 = tt.expand_dims %64 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4780242Z %80 = arith.muli %79, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:29.4780500Z %81 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.4780815Z %82 = tt.broadcast %80 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4781071Z %83 = tt.broadcast %81 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4781314Z %84 = arith.addi %82, %83 : tensor<64x64xi32> 2026-02-21T09:05:29.4781609Z %85 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4781887Z %86 = tt.addptr %85, %84 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.4782180Z %87 = tt.load %86 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4782448Z %88 = arith.shli %87, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4782667Z %89 = arith.shrsi %88, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4782907Z %90 = arith.shrsi %87, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4783156Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.4783442Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.4783782Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.4784099Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4784418Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4784693Z %96 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4784943Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4785205Z %98 = tt.broadcast %94 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4785487Z %99 = arith.select %97, %98, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4785752Z %100 = arith.cmpi eq, %93, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4786008Z %101 = tt.broadcast %95 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4786291Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4786587Z %103 = arith.select %102, %101, %99 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4786877Z %104 = tt.reshape %103 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:29.4787155Z %105 = arith.sitofp %104 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:29.4787555Z %106 = tt.dot %78, %105, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:29.4787902Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:29.4788103Z %107 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:05:29.4788314Z %108 = arith.addi %arg3, %107 : i32 2026-02-21T09:05:29.4788520Z %109 = tt.splat %108 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.4788773Z %110 = arith.addi %109, %11 : tensor<64xi32> 2026-02-21T09:05:29.4788976Z %111 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:05:29.4789232Z %112 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:29.4789505Z %113 = tt.splat %111 : i32 -> tensor<128xi32> 2026-02-21T09:05:29.4789726Z %114 = arith.addi %113, %112 : tensor<128xi32> 2026-02-21T09:05:29.4790003Z %115 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4790314Z %116 = arith.muli %115, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.4790599Z %117 = tt.expand_dims %114 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:29.4790912Z %118 = tt.broadcast %116 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4791217Z %119 = tt.broadcast %117 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4791478Z %120 = arith.addi %118, %119 : tensor<64x128xi32> 2026-02-21T09:05:29.4791791Z %121 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4792126Z %122 = tt.addptr %121, %120 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:29.4792468Z %123 = tt.load %122 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4792789Z %124 = arith.extf %123 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:29.4793093Z %125 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4793385Z %126 = arith.muli %125, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:29.4793654Z %127 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.4793960Z %128 = tt.broadcast %126 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4794238Z %129 = tt.broadcast %127 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4794545Z %130 = arith.addi %128, %129 : tensor<64x64xi32> 2026-02-21T09:05:29.4794805Z %131 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4795096Z %132 = tt.addptr %131, %130 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.4795420Z %133 = tt.load %132 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4795701Z %134 = arith.shli %133, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4795937Z %135 = arith.shrsi %134, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4796171Z %136 = arith.shrsi %133, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4796439Z %137 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.4796751Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.4797080Z %139 = tt.expand_dims %138 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.4797428Z %140 = tt.expand_dims %135 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4797760Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4798052Z %142 = arith.cmpi eq, %139, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4798312Z %143 = tt.broadcast %142 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4798586Z %144 = tt.broadcast %140 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4798887Z %145 = arith.select %143, %144, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4799162Z %146 = arith.cmpi eq, %139, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4799419Z %147 = tt.broadcast %141 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4799689Z %148 = tt.broadcast %146 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4799975Z %149 = arith.select %148, %147, %145 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4800261Z %150 = tt.reshape %149 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:29.4800552Z %151 = arith.sitofp %150 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:29.4800926Z %152 = tt.dot %124, %151, %106, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:29.4801247Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:05:29.4801450Z %153 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:05:29.4801690Z %154 = arith.addi %arg3, %153 : i32 2026-02-21T09:05:29.4801921Z %155 = tt.splat %154 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.4802132Z %156 = arith.addi %155, %11 : tensor<64xi32> 2026-02-21T09:05:29.4802326Z %157 = arith.muli %154, %c2_i32 : i32 2026-02-21T09:05:29.4802568Z %158 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:29.4802817Z %159 = tt.splat %157 : i32 -> tensor<128xi32> 2026-02-21T09:05:29.4803034Z %160 = arith.addi %159, %158 : tensor<128xi32> 2026-02-21T09:05:29.4803287Z %161 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4803562Z %162 = arith.muli %161, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.4803862Z %163 = tt.expand_dims %160 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:29.4804159Z %164 = tt.broadcast %162 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4804438Z %165 = tt.broadcast %163 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4804685Z %166 = arith.addi %164, %165 : tensor<64x128xi32> 2026-02-21T09:05:29.4804944Z %167 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4805239Z %168 = tt.addptr %167, %166 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:29.4805559Z %169 = tt.load %168 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4805884Z %170 = arith.extf %169 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:29.4806175Z %171 = tt.expand_dims %156 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4806447Z %172 = arith.muli %171, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:29.4806703Z %173 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.4806998Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4807265Z %175 = tt.broadcast %173 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4807504Z %176 = arith.addi %174, %175 : tensor<64x64xi32> 2026-02-21T09:05:29.4807747Z %177 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4808025Z %178 = tt.addptr %177, %176 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.4808335Z %179 = tt.load %178 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4808603Z %180 = arith.shli %179, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4808830Z %181 = arith.shrsi %180, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4809058Z %182 = arith.shrsi %179, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4809309Z %183 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.4809610Z %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.4809924Z %185 = tt.expand_dims %184 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.4810259Z %186 = tt.expand_dims %181 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4810589Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4810873Z %188 = arith.cmpi eq, %185, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4811133Z %189 = tt.broadcast %188 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4811405Z %190 = tt.broadcast %186 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4811780Z %191 = arith.select %189, %190, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4812059Z %192 = arith.cmpi eq, %185, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4812323Z %193 = tt.broadcast %187 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4812603Z %194 = tt.broadcast %192 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4812884Z %195 = arith.select %194, %193, %191 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4813210Z %196 = tt.reshape %195 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:29.4813478Z %197 = arith.sitofp %196 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:29.4813849Z %198 = tt.dot %170, %197, %152, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:29.4814178Z scf.yield %198 : tensor<64x64xf32> 2026-02-21T09:05:29.4814369Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:29.4814580Z %18 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:29.4814791Z %19 = arith.addi %18, %11 : tensor<64xi32> 2026-02-21T09:05:29.4815022Z %20 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:29.4815260Z %21 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:29.4815517Z %22 = tt.splat %20 : i32 -> tensor<128xi32> 2026-02-21T09:05:29.4815715Z %23 = arith.addi %22, %21 : tensor<128xi32> 2026-02-21T09:05:29.4815974Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4816241Z %25 = arith.muli %24, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:29.4816495Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:29.4816787Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4817075Z %28 = tt.broadcast %26 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:29.4817320Z %29 = arith.addi %27, %28 : tensor<64x128xi32> 2026-02-21T09:05:29.4817561Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4817849Z %31 = tt.addptr %30, %29 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:29.4818158Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:29.4818450Z %33 = arith.extf %32 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:29.4818738Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:29.4818994Z %35 = arith.muli %34, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:29.4819250Z %36 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:29.4819531Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4819785Z %38 = tt.broadcast %36 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:29.4820026Z %39 = arith.addi %37, %38 : tensor<64x64xi32> 2026-02-21T09:05:29.4820261Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4820537Z %41 = tt.addptr %40, %39 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:29.4820823Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:29.4821089Z %43 = arith.shli %42, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4821305Z %44 = arith.shrsi %43, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4821517Z %45 = arith.shrsi %42, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:29.4821799Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:29.4822080Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:29.4822384Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:29.4822690Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4823034Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:29.4823313Z %51 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4823557Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4823827Z %53 = tt.broadcast %49 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4824099Z %54 = arith.select %52, %53, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4824392Z %55 = arith.cmpi eq, %48, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:29.4824637Z %56 = tt.broadcast %50 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:29.4824905Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:29.4825177Z %58 = arith.select %57, %56, %54 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:29.4825446Z %59 = tt.reshape %58 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:29.4825710Z %60 = arith.sitofp %59 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:29.4826055Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:29.4826445Z %62 = arith.truncf %61 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:05:29.4826770Z tt.descriptor_store %0[%14, %10], %62 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T09:05:29.4827042Z tt.return 2026-02-21T09:05:29.4827179Z } 2026-02-21T09:05:29.4827302Z } 2026-02-21T09:05:29.4827371Z 2026-02-21T09:05:29.4827429Z {-# 2026-02-21T09:05:29.4827559Z external_resources: { 2026-02-21T09:05:29.4827725Z mlir_reproducer: { 2026-02-21T09:05:29.4832263Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:29.4836977Z disable_threading: false, 2026-02-21T09:05:29.4837166Z verify_each: true 2026-02-21T09:05:29.4837319Z } 2026-02-21T09:05:29.4837452Z } 2026-02-21T09:05:29.4837574Z #-} 2026-02-21T09:05:29.4838025Z /tmp/torchinductor_root/dm/cdm4lh3nrg5rtev2j7nt5rhurm7wwwqm7pqlqudcpboflpu6gc6h.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:29.4839112Z /tmp/torchinductor_root/dm/cdm4lh3nrg5rtev2j7nt5rhurm7wwwqm7pqlqudcpboflpu6gc6h.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:29.4839973Z [294s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:29.4841053Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:29.4842073Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:29.4842346Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:29.5369023Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 15.3 configs/s 2026-02-21T09:05:30.7909482Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 782.4 2026-02-21T09:05:30.7913938Z configs/s 2026-02-21T09:05:30.8585722Z [296s] Generation 15 complete: 2026-02-21T09:05:30.8589637Z error=5 2026-02-21T09:05:30.8594792Z ok=28 2026-02-21T09:05:30.8596816Z min=0.1076 2026-02-21T09:05:30.8597005Z mid=0.2366 2026-02-21T09:05:30.8597172Z max=5.3095 2026-02-21T09:05:30.8597360Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:05:30.8597623Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:05:30.8597843Z 'l2_groupings': [1], 2026-02-21T09:05:30.8598033Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:05:30.8599988Z 'loop_orders': [[0, 1]], 2026-02-21T09:05:30.8601993Z 'num_stages': 3, 2026-02-21T09:05:30.8604503Z 'num_warps': 1, 2026-02-21T09:05:30.8606535Z 'pid_type': 'flat', 2026-02-21T09:05:30.8606913Z 'range_flattens': [None, None], 2026-02-21T09:05:30.8607110Z 'range_multi_buffers': [None, True], 2026-02-21T09:05:30.8607311Z 'range_num_stages': [0, 0], 2026-02-21T09:05:30.8607480Z 'range_unroll_factors': [0, 0], 2026-02-21T09:05:30.8607676Z 'range_warp_specializes': [None, None]} 2026-02-21T09:05:30.8620536Z [296s] Fitting surrogate: 1140 points, 1140 targets 2026-02-21T09:05:31.4344496Z [296s] Generation 16 starting: 30 neighbors, 2 active search path(s) 2026-02-21T09:05:43.6474557Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.2 configs/s 2026-02-21T09:05:44.5876262Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:44.5878431Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:44.5878745Z ^ 2026-02-21T09:05:44.5879139Z /tmp/torchinductor_root/hs/chsqtfbcrgnenniit2grtbq5dupbwpijvce6yepfwtb7qykoqsrv.py:86:36: note: called from 2026-02-21T09:05:44.5879547Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:44.5879746Z ^ 2026-02-21T09:05:44.5880148Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:44.5880607Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:44.5880851Z ^ 2026-02-21T09:05:44.5881013Z module { 2026-02-21T09:05:44.5881527Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:44.5882431Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:05:44.5882669Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:44.5882877Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:44.5883072Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:44.5883405Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:44.5883635Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:44.5883936Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:05:44.5884176Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:44.5884420Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:44.5884632Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:44.5884854Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:44.5885091Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:44.5885271Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:44.5885455Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:44.5885636Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:44.5885829Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:44.5886077Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:44.5886404Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:44.5886732Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:44.5886909Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:05:44.5887094Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:44.5887271Z %4 = arith.subi %c224_i32, %3 : i32 2026-02-21T09:05:44.5887455Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:44.5887626Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:05:44.5887813Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:44.5887988Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:44.5888166Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:44.5888340Z %10 = arith.muli %8, %c32_i32 : i32 2026-02-21T09:05:44.5888567Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:44.5888828Z %12 = tt.splat %10 : i32 -> tensor<32xi32> 2026-02-21T09:05:44.5889026Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T09:05:44.5889217Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:05:44.5889433Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:44.5889674Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.5889878Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T09:05:44.5890073Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:44.5890272Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:44.5890607Z %18 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:44.5890958Z %64 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.5891169Z %65 = arith.addi %64, %15 : tensor<64xi32> 2026-02-21T09:05:44.5891378Z %66 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:44.5891679Z %67 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.5891945Z %68 = tt.splat %66 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.5892166Z %69 = arith.addi %68, %67 : tensor<128xi32> 2026-02-21T09:05:44.5892433Z %70 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5892726Z %71 = arith.muli %70, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.5893002Z %72 = tt.expand_dims %69 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.5893326Z %73 = tt.broadcast %71 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5893697Z %74 = tt.broadcast %72 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5893952Z %75 = arith.addi %73, %74 : tensor<64x128xi32> 2026-02-21T09:05:44.5894214Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5894516Z %77 = tt.addptr %76, %75 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.5894850Z %78 = tt.load %77 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5900121Z %79 = arith.extf %78 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.5900644Z %80 = tt.expand_dims %65 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5900952Z %81 = arith.muli %80, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.5901227Z %82 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.5901654Z %83 = tt.broadcast %81 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5901924Z %84 = tt.broadcast %82 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5902158Z %85 = arith.addi %83, %84 : tensor<64x32xi32> 2026-02-21T09:05:44.5902414Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5902694Z %87 = tt.addptr %86, %85 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.5903011Z %88 = tt.load %87 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5903238Z %89 = arith.shli %88, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5903455Z %90 = arith.shrsi %89, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5903678Z %91 = arith.shrsi %88, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5903921Z %92 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.5904218Z %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.5904529Z %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.5904845Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5905165Z %96 = tt.expand_dims %91 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5905440Z %97 = arith.cmpi eq, %94, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5905697Z %98 = tt.broadcast %97 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5905970Z %99 = tt.broadcast %95 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5906253Z %100 = arith.select %98, %99, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5906528Z %101 = arith.cmpi eq, %94, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5906780Z %102 = tt.broadcast %96 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5907053Z %103 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5907336Z %104 = arith.select %103, %102, %100 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5907628Z %105 = tt.reshape %104 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.5907900Z %106 = arith.sitofp %105 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.5908268Z %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.5908603Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:44.5908800Z %108 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:05:44.5909002Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:05:44.5909197Z %110 = tt.splat %109 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.5909411Z %111 = arith.addi %110, %15 : tensor<64xi32> 2026-02-21T09:05:44.5909610Z %112 = arith.muli %109, %c2_i32 : i32 2026-02-21T09:05:44.5909840Z %113 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.5910100Z %114 = tt.splat %112 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.5910355Z %115 = arith.addi %114, %113 : tensor<128xi32> 2026-02-21T09:05:44.5910617Z %116 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5910887Z %117 = arith.muli %116, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.5911163Z %118 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.5911471Z %119 = tt.broadcast %117 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5911813Z %120 = tt.broadcast %118 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5912069Z %121 = arith.addi %119, %120 : tensor<64x128xi32> 2026-02-21T09:05:44.5912357Z %122 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5912672Z %123 = tt.addptr %122, %121 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.5913002Z %124 = tt.load %123 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5913304Z %125 = arith.extf %124 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.5913601Z %126 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5913865Z %127 = arith.muli %126, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.5914128Z %128 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.5914448Z %129 = tt.broadcast %127 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5914720Z %130 = tt.broadcast %128 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5914963Z %131 = arith.addi %129, %130 : tensor<64x32xi32> 2026-02-21T09:05:44.5915201Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5915484Z %133 = tt.addptr %132, %131 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.5915744Z %134 = tt.load %133 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5915964Z %135 = arith.shli %134, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5916183Z %136 = arith.shrsi %135, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5916409Z %137 = arith.shrsi %134, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5916660Z %138 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.5916953Z %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.5917278Z %140 = tt.expand_dims %139 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.5917601Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5917930Z %142 = tt.expand_dims %137 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5918220Z %143 = arith.cmpi eq, %140, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5918473Z %144 = tt.broadcast %143 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5918750Z %145 = tt.broadcast %141 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5919043Z %146 = arith.select %144, %145, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5919322Z %147 = arith.cmpi eq, %140, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5919571Z %148 = tt.broadcast %142 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5919847Z %149 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5920131Z %150 = arith.select %149, %148, %146 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5920411Z %151 = tt.reshape %150 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.5920684Z %152 = arith.sitofp %151 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.5921042Z %153 = tt.dot %125, %152, %107, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.5921373Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:05:44.5921625Z %154 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:05:44.5921828Z %155 = arith.addi %arg3, %154 : i32 2026-02-21T09:05:44.5922027Z %156 = tt.splat %155 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.5922227Z %157 = arith.addi %156, %15 : tensor<64xi32> 2026-02-21T09:05:44.5922425Z %158 = arith.muli %155, %c2_i32 : i32 2026-02-21T09:05:44.5922660Z %159 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.5922947Z %160 = tt.splat %158 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.5923154Z %161 = arith.addi %160, %159 : tensor<128xi32> 2026-02-21T09:05:44.5923445Z %162 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5923721Z %163 = arith.muli %162, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.5923981Z %164 = tt.expand_dims %161 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.5924284Z %165 = tt.broadcast %163 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5924553Z %166 = tt.broadcast %164 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5924802Z %167 = arith.addi %165, %166 : tensor<64x128xi32> 2026-02-21T09:05:44.5925049Z %168 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5925378Z %169 = tt.addptr %168, %167 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.5925707Z %170 = tt.load %169 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5926004Z %171 = arith.extf %170 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.5926301Z %172 = tt.expand_dims %157 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5926565Z %173 = arith.muli %172, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.5926828Z %174 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.5927124Z %175 = tt.broadcast %173 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5927389Z %176 = tt.broadcast %174 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5927631Z %177 = arith.addi %175, %176 : tensor<64x32xi32> 2026-02-21T09:05:44.5927868Z %178 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5928154Z %179 = tt.addptr %178, %177 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.5928411Z %180 = tt.load %179 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5928630Z %181 = arith.shli %180, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5928853Z %182 = arith.shrsi %181, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5929075Z %183 = arith.shrsi %180, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5929326Z %184 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.5929619Z %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.5929940Z %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.5930265Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5930597Z %188 = tt.expand_dims %183 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5930889Z %189 = arith.cmpi eq, %186, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5931140Z %190 = tt.broadcast %189 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5931417Z %191 = tt.broadcast %187 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5931736Z %192 = arith.select %190, %191, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5932014Z %193 = arith.cmpi eq, %186, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5932275Z %194 = tt.broadcast %188 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5932544Z %195 = tt.broadcast %193 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5932861Z %196 = arith.select %195, %194, %192 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5933142Z %197 = tt.reshape %196 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.5933411Z %198 = arith.sitofp %197 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.5933772Z %199 = tt.dot %171, %198, %153, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.5934128Z scf.yield %199 : tensor<64x32xf32> 2026-02-21T09:05:44.5934320Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:44.5934518Z %19 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.5934757Z %20 = arith.addi %19, %15 : tensor<64xi32> 2026-02-21T09:05:44.5934959Z %21 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:44.5935202Z %22 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.5935452Z %23 = tt.splat %21 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.5935660Z %24 = arith.addi %23, %22 : tensor<128xi32> 2026-02-21T09:05:44.5935914Z %25 = tt.expand_dims %17 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5936176Z %26 = arith.muli %25, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.5936438Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.5936759Z %28 = tt.broadcast %26 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5937044Z %29 = tt.broadcast %27 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.5937290Z %30 = arith.addi %28, %29 : tensor<64x128xi32> 2026-02-21T09:05:44.5937548Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5937850Z %32 = tt.addptr %31, %30 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.5938163Z %33 = tt.load %32 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.5938478Z %34 = arith.extf %33 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.5938772Z %35 = tt.expand_dims %20 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.5939067Z %36 = arith.muli %35, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.5939352Z %37 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.5939643Z %38 = tt.broadcast %36 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5939916Z %39 = tt.broadcast %37 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.5940154Z %40 = arith.addi %38, %39 : tensor<64x32xi32> 2026-02-21T09:05:44.5940406Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5940691Z %42 = tt.addptr %41, %40 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.5940946Z %43 = tt.load %42 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.5941170Z %44 = arith.shli %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5941391Z %45 = arith.shrsi %44, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5941649Z %46 = arith.shrsi %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.5941899Z %47 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.5942203Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.5942526Z %49 = tt.expand_dims %48 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.5942848Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5943177Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.5943462Z %52 = arith.cmpi eq, %49, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5943724Z %53 = tt.broadcast %52 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5944001Z %54 = tt.broadcast %50 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5944323Z %55 = arith.select %53, %54, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5944604Z %56 = arith.cmpi eq, %49, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.5944865Z %57 = tt.broadcast %51 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.5945130Z %58 = tt.broadcast %56 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.5945395Z %59 = arith.select %58, %57, %55 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.5945723Z %60 = tt.reshape %59 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.5945973Z %61 = arith.sitofp %60 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.5946351Z %62 = tt.dot %34, %61, %18, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.5946706Z %63 = arith.truncf %62 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:44.5947020Z tt.descriptor_store %0[%14, %10], %63 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:44.5947302Z tt.return 2026-02-21T09:05:44.5947430Z } 2026-02-21T09:05:44.5947558Z } 2026-02-21T09:05:44.5947629Z 2026-02-21T09:05:44.5947681Z {-# 2026-02-21T09:05:44.5947817Z external_resources: { 2026-02-21T09:05:44.5947980Z mlir_reproducer: { 2026-02-21T09:05:44.5952411Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:44.5956866Z disable_threading: false, 2026-02-21T09:05:44.5957033Z verify_each: true 2026-02-21T09:05:44.5957183Z } 2026-02-21T09:05:44.5957302Z } 2026-02-21T09:05:44.5957424Z #-} 2026-02-21T09:05:44.5957848Z /tmp/torchinductor_root/hs/chsqtfbcrgnenniit2grtbq5dupbwpijvce6yepfwtb7qykoqsrv.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:44.5958898Z /tmp/torchinductor_root/hs/chsqtfbcrgnenniit2grtbq5dupbwpijvce6yepfwtb7qykoqsrv.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:44.5959724Z [310s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:44.5960760Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:44.5961747Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:44.5962015Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:44.7781187Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:44.7786056Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:44.7791737Z ^ 2026-02-21T09:05:44.7793996Z /tmp/torchinductor_root/t5/ct5f5vqckpzdcqevrqe5agmpa6fdcdarnmvvw3hlrhsad4cksaud.py:94:40: note: called from 2026-02-21T09:05:44.7794469Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:44.7798838Z ^ 2026-02-21T09:05:44.7803713Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:44.7805544Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:44.7805829Z ^ 2026-02-21T09:05:44.7806301Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:05:44.7810905Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:44.7811852Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:05:44.7812088Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:44.7812287Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:44.7812475Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:44.7812666Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:44.7812870Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:44.7813107Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:44.7813336Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:05:44.7813573Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:44.7813816Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:44.7814129Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:44.7814392Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:44.7819237Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:44.7821363Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:44.7821708Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:44.7821944Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:44.7822160Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:44.7822522Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:44.7822848Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:44.7823046Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:05:44.7823236Z %3 = arith.minsi %2, %c224_i32 : i32 2026-02-21T09:05:44.7823437Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:05:44.7828211Z %4 = arith.divsi %arg3, %c2_i32 : i32 2026-02-21T09:05:44.7829839Z %5 = arith.muli %4, %c2_i32 : i32 2026-02-21T09:05:44.7830114Z %6 = arith.subi %c224_i32, %5 : i32 2026-02-21T09:05:44.7835049Z %7 = arith.minsi %6, %c2_i32 : i32 2026-02-21T09:05:44.7836709Z %8 = arith.remsi %arg3, %c2_i32 : i32 2026-02-21T09:05:44.7836986Z %9 = arith.remsi %8, %7 : i32 2026-02-21T09:05:44.7840845Z %10 = arith.addi %5, %9 : i32 2026-02-21T09:05:44.7844835Z %11 = arith.divsi %8, %7 : i32 2026-02-21T09:05:44.7846840Z %12 = arith.muli %10, %c32_i32 : i32 2026-02-21T09:05:44.7847455Z %13 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:44.7851810Z %14 = tt.splat %12 : i32 -> tensor<32xi32> 2026-02-21T09:05:44.7852139Z %15 = arith.addi %14, %13 : tensor<32xi32> 2026-02-21T09:05:44.7852373Z %16 = arith.muli %11, %c64_i32 : i32 2026-02-21T09:05:44.7852638Z %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:44.7853094Z %18 = tt.splat %16 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.7853309Z %19 = arith.addi %18, %17 : tensor<64xi32> 2026-02-21T09:05:44.7858137Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:44.7862166Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:44.7864225Z %20 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:44.7864610Z %66 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.7864864Z %67 = arith.addi %66, %17 : tensor<64xi32> 2026-02-21T09:05:44.7865068Z %68 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:44.7865322Z %69 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.7865583Z %70 = tt.splat %68 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.7865798Z %71 = arith.addi %70, %69 : tensor<128xi32> 2026-02-21T09:05:44.7866225Z %72 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7866516Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.7866790Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.7867091Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7867368Z %76 = tt.broadcast %74 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7867611Z %77 = arith.addi %75, %76 : tensor<64x128xi32> 2026-02-21T09:05:44.7867878Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7868179Z %79 = tt.addptr %78, %77 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.7868535Z %80 = tt.load %79 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7868835Z %81 = arith.extf %80 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.7869131Z %82 = tt.expand_dims %67 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7869405Z %83 = arith.muli %82, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.7869666Z %84 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.7869958Z %85 = tt.broadcast %83 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7870216Z %86 = tt.broadcast %84 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7870457Z %87 = arith.addi %85, %86 : tensor<64x32xi32> 2026-02-21T09:05:44.7870696Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7870979Z %89 = tt.addptr %88, %87 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.7871288Z %90 = tt.load %89 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7871617Z %91 = arith.shli %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7871852Z %92 = arith.shrsi %91, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7872070Z %93 = arith.shrsi %90, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7872324Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.7872614Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.7872931Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.7873257Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7873641Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7873931Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7874186Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7874471Z %101 = tt.broadcast %97 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7874770Z %102 = arith.select %100, %101, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7875095Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7875387Z %104 = tt.broadcast %98 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7875658Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7875951Z %106 = arith.select %105, %104, %102 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7876236Z %107 = tt.reshape %106 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.7876521Z %108 = arith.sitofp %107 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.7876904Z %109 = tt.dot %81, %108, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.7877243Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:05:44.7877457Z %110 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T09:05:44.7877694Z %111 = arith.addi %arg4, %110 : i32 2026-02-21T09:05:44.7877905Z %112 = tt.splat %111 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.7878109Z %113 = arith.addi %112, %17 : tensor<64xi32> 2026-02-21T09:05:44.7878309Z %114 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:05:44.7878555Z %115 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.7878806Z %116 = tt.splat %114 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.7879018Z %117 = arith.addi %116, %115 : tensor<128xi32> 2026-02-21T09:05:44.7879273Z %118 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7879553Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.7879818Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.7880125Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7880403Z %122 = tt.broadcast %120 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7880651Z %123 = arith.addi %121, %122 : tensor<64x128xi32> 2026-02-21T09:05:44.7880910Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7881203Z %125 = tt.addptr %124, %123 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.7881531Z %126 = tt.load %125 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7881869Z %127 = arith.extf %126 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.7882160Z %128 = tt.expand_dims %113 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7882436Z %129 = arith.muli %128, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.7882692Z %130 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.7882989Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7883261Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7883512Z %133 = arith.addi %131, %132 : tensor<64x32xi32> 2026-02-21T09:05:44.7883754Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7884033Z %135 = tt.addptr %134, %133 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.7884345Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7884615Z %137 = arith.shli %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7884873Z %138 = arith.shrsi %137, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7885094Z %139 = arith.shrsi %136, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7885348Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.7885648Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.7885998Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.7886330Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7886689Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7887004Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7887277Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7887565Z %147 = tt.broadcast %143 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7887875Z %148 = arith.select %146, %147, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7888164Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7888438Z %150 = tt.broadcast %144 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7888748Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7889052Z %152 = arith.select %151, %150, %148 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7889358Z %153 = tt.reshape %152 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.7889638Z %154 = arith.sitofp %153 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.7890030Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.7890376Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:44.7890594Z %156 = arith.muli %c64_i32, %c2_i32_7 : i32 2026-02-21T09:05:44.7890823Z %157 = arith.addi %arg4, %156 : i32 2026-02-21T09:05:44.7891030Z %158 = tt.splat %157 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.7891252Z %159 = arith.addi %158, %17 : tensor<64xi32> 2026-02-21T09:05:44.7891458Z %160 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:05:44.7891746Z %161 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.7892014Z %162 = tt.splat %160 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.7892239Z %163 = arith.addi %162, %161 : tensor<128xi32> 2026-02-21T09:05:44.7892514Z %164 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7892798Z %165 = arith.muli %164, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.7893086Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.7893399Z %167 = tt.broadcast %165 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7893691Z %168 = tt.broadcast %166 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7893950Z %169 = arith.addi %167, %168 : tensor<64x128xi32> 2026-02-21T09:05:44.7894217Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7894535Z %171 = tt.addptr %170, %169 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.7894870Z %172 = tt.load %171 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7895195Z %173 = arith.extf %172 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.7895488Z %174 = tt.expand_dims %159 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7895767Z %175 = arith.muli %174, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.7896032Z %176 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.7896370Z %177 = tt.broadcast %175 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7896639Z %178 = tt.broadcast %176 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7896879Z %179 = arith.addi %177, %178 : tensor<64x32xi32> 2026-02-21T09:05:44.7897121Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7897396Z %181 = tt.addptr %180, %179 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.7897736Z %182 = tt.load %181 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7898037Z %183 = arith.shli %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7898257Z %184 = arith.shrsi %183, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7898494Z %185 = arith.shrsi %182, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7898744Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.7899052Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.7899369Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.7899706Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7900067Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7900351Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7900606Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7900879Z %193 = tt.broadcast %189 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7901173Z %194 = arith.select %192, %193, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7901452Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7901724Z %196 = tt.broadcast %190 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7902007Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7902289Z %198 = arith.select %197, %196, %194 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7902581Z %199 = tt.reshape %198 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.7902854Z %200 = arith.sitofp %199 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.7903232Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.7903569Z scf.yield %201 : tensor<64x32xf32> 2026-02-21T09:05:44.7903763Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:44.7903978Z %21 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:44.7904193Z %22 = arith.addi %21, %17 : tensor<64xi32> 2026-02-21T09:05:44.7904405Z %23 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:44.7904653Z %24 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:44.7904928Z %25 = tt.splat %23 : i32 -> tensor<128xi32> 2026-02-21T09:05:44.7905145Z %26 = arith.addi %25, %24 : tensor<128xi32> 2026-02-21T09:05:44.7905394Z %27 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7905665Z %28 = arith.muli %27, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:44.7905923Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:44.7906224Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7906492Z %31 = tt.broadcast %29 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:44.7906736Z %32 = arith.addi %30, %31 : tensor<64x128xi32> 2026-02-21T09:05:44.7906986Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7907269Z %34 = tt.addptr %33, %32 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:44.7907617Z %35 = tt.load %34 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:44.7907916Z %36 = arith.extf %35 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:44.7908210Z %37 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:44.7908489Z %38 = arith.muli %37, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:44.7908768Z %39 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:44.7909058Z %40 = tt.broadcast %38 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7909336Z %41 = tt.broadcast %39 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:44.7909576Z %42 = arith.addi %40, %41 : tensor<64x32xi32> 2026-02-21T09:05:44.7909809Z %43 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7910084Z %44 = tt.addptr %43, %42 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:44.7910384Z %45 = tt.load %44 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:44.7910642Z %46 = arith.shli %45, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7910859Z %47 = arith.shrsi %46, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7911066Z %48 = arith.shrsi %45, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:44.7911334Z %49 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:44.7911648Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:44.7911954Z %51 = tt.expand_dims %50 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:44.7912267Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7912573Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:44.7912857Z %54 = arith.cmpi eq, %51, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7913103Z %55 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7913374Z %56 = tt.broadcast %52 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7913653Z %57 = arith.select %55, %56, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7913916Z %58 = arith.cmpi eq, %51, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:44.7914167Z %59 = tt.broadcast %53 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:44.7914426Z %60 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:44.7914702Z %61 = arith.select %60, %59, %57 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:44.7914973Z %62 = tt.reshape %61 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:44.7915233Z %63 = arith.sitofp %62 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:44.7915588Z %64 = tt.dot %36, %63, %20, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:44.7915937Z %65 = arith.truncf %64 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:44.7916262Z tt.descriptor_store %0[%16, %12], %65 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:44.7916578Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:05:44.7916799Z tt.return 2026-02-21T09:05:44.7916930Z } 2026-02-21T09:05:44.7917058Z } 2026-02-21T09:05:44.7917130Z 2026-02-21T09:05:44.7917191Z {-# 2026-02-21T09:05:44.7917322Z external_resources: { 2026-02-21T09:05:44.7917496Z mlir_reproducer: { 2026-02-21T09:05:44.7922000Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:44.7926585Z disable_threading: false, 2026-02-21T09:05:44.7926764Z verify_each: true 2026-02-21T09:05:44.7926924Z } 2026-02-21T09:05:44.7927043Z } 2026-02-21T09:05:44.7927167Z #-} 2026-02-21T09:05:44.7927589Z /tmp/torchinductor_root/t5/ct5f5vqckpzdcqevrqe5agmpa6fdcdarnmvvw3hlrhsad4cksaud.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:44.7928612Z /tmp/torchinductor_root/t5/ct5f5vqckpzdcqevrqe5agmpa6fdcdarnmvvw3hlrhsad4cksaud.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:44.7929421Z [310s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:44.7930577Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:05:44.7931683Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:44.7931962Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:45.1623911Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:45.1624469Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:45.1628712Z ^ 2026-02-21T09:05:45.1633510Z /tmp/torchinductor_root/uk/cukfvkfsild6yw6kv5u765a3sdroqehzlat4oraazujb7c7xcsqg.py:86:36: note: called from 2026-02-21T09:05:45.1634623Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:45.1634856Z ^ 2026-02-21T09:05:45.1635268Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:45.1635740Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:45.1635990Z ^ 2026-02-21T09:05:45.1636197Z module { 2026-02-21T09:05:45.1640946Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:45.1645891Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:05:45.1646233Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:45.1646456Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:45.1646695Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:45.1647253Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:45.1651864Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:05:45.1655686Z %cst_3 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:45.1657298Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:45.1657652Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:45.1662904Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:45.1667334Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:45.1668842Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:45.1669089Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:45.1669298Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:45.1669495Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:45.1669683Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:45.1670019Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:45.1670645Z %1 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:45.1670979Z %2 = tt.get_program_id x : i32 2026-02-21T09:05:45.1671162Z %3 = arith.divsi %2, %c4_i32 : i32 2026-02-21T09:05:45.1671355Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:05:45.1671612Z %5 = arith.subi %c224_i32, %4 : i32 2026-02-21T09:05:45.1671795Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:05:45.1671979Z %7 = arith.remsi %2, %c4_i32 : i32 2026-02-21T09:05:45.1672166Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:05:45.1672358Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:05:45.1672533Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:05:45.1672732Z %11 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:05:45.1672921Z %12 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:05:45.1673166Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:45.1673434Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.1673632Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T09:05:45.1673830Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:45.1674012Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:45.1674336Z %16 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_4) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:45.1674654Z %52 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:45.1674892Z %53 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.1675151Z %54 = tt.splat %52 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.1675352Z %55 = arith.addi %54, %53 : tensor<128xi32> 2026-02-21T09:05:45.1675611Z %56 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.1675880Z %57 = arith.muli %56, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.1676143Z %58 = tt.expand_dims %55 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.1676431Z %59 = tt.broadcast %57 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1676699Z %60 = tt.broadcast %58 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1676942Z %61 = arith.addi %59, %60 : tensor<64x128xi32> 2026-02-21T09:05:45.1677184Z %62 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1677473Z %63 = tt.addptr %62, %61 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.1677776Z %64 = tt.load %63 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1678185Z %65 = arith.extf %64 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.1678516Z %66 = tt.descriptor_load %0[%arg3, %11] : !tt.tensordesc> -> tensor<64x32xi8> 2026-02-21T09:05:45.1678818Z %67 = arith.shli %66, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1679045Z %68 = arith.shrsi %67, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1679258Z %69 = arith.shrsi %66, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1679546Z %70 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.1679863Z %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.1680177Z %72 = tt.expand_dims %71 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.1680496Z %73 = tt.expand_dims %68 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1680829Z %74 = tt.expand_dims %69 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1681120Z %75 = arith.cmpi eq, %72, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1681366Z %76 = tt.broadcast %75 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1681677Z %77 = tt.broadcast %73 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1681983Z %78 = arith.select %76, %77, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1682256Z %79 = arith.cmpi eq, %72, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1682502Z %80 = tt.broadcast %74 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1682811Z %81 = tt.broadcast %79 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1683074Z %82 = arith.select %81, %80, %78 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1683355Z %83 = tt.reshape %82 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:45.1683612Z %84 = arith.sitofp %83 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:45.1683982Z %85 = tt.dot %65, %84, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:45.1684310Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:45.1684499Z %86 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:05:45.1684696Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T09:05:45.1684876Z %88 = arith.muli %87, %c2_i32 : i32 2026-02-21T09:05:45.1685116Z %89 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.1685366Z %90 = tt.splat %88 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.1685572Z %91 = arith.addi %90, %89 : tensor<128xi32> 2026-02-21T09:05:45.1685820Z %92 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.1686090Z %93 = arith.muli %92, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.1686347Z %94 = tt.expand_dims %91 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.1686633Z %95 = tt.broadcast %93 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1686891Z %96 = tt.broadcast %94 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1687120Z %97 = arith.addi %95, %96 : tensor<64x128xi32> 2026-02-21T09:05:45.1687363Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1687649Z %99 = tt.addptr %98, %97 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.1687954Z %100 = tt.load %99 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1688261Z %101 = arith.extf %100 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.1688579Z %102 = tt.descriptor_load %0[%87, %11] : !tt.tensordesc> -> tensor<64x32xi8> 2026-02-21T09:05:45.1688882Z %103 = arith.shli %102, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1689101Z %104 = arith.shrsi %103, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1689357Z %105 = arith.shrsi %102, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1689606Z %106 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.1689898Z %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.1690216Z %108 = tt.expand_dims %107 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.1690538Z %109 = tt.expand_dims %104 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1690890Z %110 = tt.expand_dims %105 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1691202Z %111 = arith.cmpi eq, %108, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1691455Z %112 = tt.broadcast %111 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1691754Z %113 = tt.broadcast %109 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1692045Z %114 = arith.select %112, %113, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1692324Z %115 = arith.cmpi eq, %108, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1692573Z %116 = tt.broadcast %110 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1692852Z %117 = tt.broadcast %115 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1693165Z %118 = arith.select %117, %116, %114 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1693455Z %119 = tt.reshape %118 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:45.1693741Z %120 = arith.sitofp %119 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:45.1694109Z %121 = tt.dot %101, %120, %85, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:45.1694450Z %c2_i32_5 = arith.constant 2 : i32 2026-02-21T09:05:45.1694649Z %122 = arith.muli %c64_i32, %c2_i32_5 : i32 2026-02-21T09:05:45.1694849Z %123 = arith.addi %arg3, %122 : i32 2026-02-21T09:05:45.1695040Z %124 = arith.muli %123, %c2_i32 : i32 2026-02-21T09:05:45.1695277Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.1695537Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.1695742Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T09:05:45.1696006Z %128 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.1696279Z %129 = arith.muli %128, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.1696550Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.1696859Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1697128Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1697377Z %133 = arith.addi %131, %132 : tensor<64x128xi32> 2026-02-21T09:05:45.1697623Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1697930Z %135 = tt.addptr %134, %133 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.1698254Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1698550Z %137 = arith.extf %136 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.1698882Z %138 = tt.descriptor_load %0[%123, %11] : !tt.tensordesc> -> tensor<64x32xi8> 2026-02-21T09:05:45.1699180Z %139 = arith.shli %138, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1699406Z %140 = arith.shrsi %139, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1699623Z %141 = arith.shrsi %138, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1699871Z %142 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.1700163Z %143 = tt.expand_dims %142 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.1700502Z %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.1700829Z %145 = tt.expand_dims %140 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1701154Z %146 = tt.expand_dims %141 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1701442Z %147 = arith.cmpi eq, %144, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1701724Z %148 = tt.broadcast %147 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1702045Z %149 = tt.broadcast %145 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1702370Z %150 = arith.select %148, %149, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1702643Z %151 = arith.cmpi eq, %144, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1702901Z %152 = tt.broadcast %146 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1703170Z %153 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1703458Z %154 = arith.select %153, %152, %150 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1703747Z %155 = tt.reshape %154 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:45.1704012Z %156 = arith.sitofp %155 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:45.1704404Z %157 = tt.dot %137, %156, %121, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:45.1704728Z scf.yield %157 : tensor<64x32xf32> 2026-02-21T09:05:45.1704909Z } 2026-02-21T09:05:45.1705053Z %17 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:45.1705299Z %18 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.1705548Z %19 = tt.splat %17 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.1705745Z %20 = arith.addi %19, %18 : tensor<128xi32> 2026-02-21T09:05:45.1705997Z %21 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.1706261Z %22 = arith.muli %21, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.1706514Z %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.1706800Z %24 = tt.broadcast %22 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1707070Z %25 = tt.broadcast %23 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.1707310Z %26 = arith.addi %24, %25 : tensor<64x128xi32> 2026-02-21T09:05:45.1707557Z %27 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1707841Z %28 = tt.addptr %27, %26 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.1708143Z %29 = tt.load %28 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.1708440Z %30 = arith.extf %29 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.1708763Z %31 = tt.descriptor_load %0[%c4032_i32, %11] : !tt.tensordesc> -> tensor<64x32xi8> 2026-02-21T09:05:45.1709074Z %32 = arith.shli %31, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1709292Z %33 = arith.shrsi %32, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1709499Z %34 = arith.shrsi %31, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:45.1709743Z %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.1710023Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.1710330Z %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.1710629Z %38 = tt.expand_dims %33 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1710943Z %39 = tt.expand_dims %34 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:45.1711219Z %40 = arith.cmpi eq, %37, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1711465Z %41 = tt.broadcast %40 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1711781Z %42 = tt.broadcast %38 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1712179Z %43 = arith.select %41, %42, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1712460Z %44 = arith.cmpi eq, %37, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.1712725Z %45 = tt.broadcast %39 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:45.1712995Z %46 = tt.broadcast %44 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:45.1713314Z %47 = arith.select %46, %45, %43 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:45.1713603Z %48 = tt.reshape %47 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:45.1713912Z %49 = arith.sitofp %48 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:45.1714272Z %50 = tt.dot %30, %49, %16, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:45.1714647Z %51 = arith.truncf %50 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:45.1714983Z tt.descriptor_store %1[%12, %11], %51 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:05:45.1715265Z tt.return 2026-02-21T09:05:45.1715408Z } 2026-02-21T09:05:45.1715535Z } 2026-02-21T09:05:45.1715614Z 2026-02-21T09:05:45.1715668Z {-# 2026-02-21T09:05:45.1715803Z external_resources: { 2026-02-21T09:05:45.1715974Z mlir_reproducer: { 2026-02-21T09:05:45.1720538Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:45.1724995Z disable_threading: false, 2026-02-21T09:05:45.1725175Z verify_each: true 2026-02-21T09:05:45.1725321Z } 2026-02-21T09:05:45.1725449Z } 2026-02-21T09:05:45.1725566Z #-} 2026-02-21T09:05:45.1726004Z /tmp/torchinductor_root/uk/cukfvkfsild6yw6kv5u765a3sdroqehzlat4oraazujb7c7xcsqg.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:45.1727034Z /tmp/torchinductor_root/uk/cukfvkfsild6yw6kv5u765a3sdroqehzlat4oraazujb7c7xcsqg.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:45.1727878Z [310s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:45.1728971Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:45.1729988Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:45.1730272Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:45.4438187Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:45.4442332Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:45.4446211Z ^ 2026-02-21T09:05:45.4450945Z /tmp/torchinductor_root/go/cgox4ptrz4pt62y6hkrzqwqlkzkfewa6irn755jdgunn6gik2xru.py:86:36: note: called from 2026-02-21T09:05:45.4451974Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:45.4452201Z ^ 2026-02-21T09:05:45.4452638Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:45.4458327Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:45.4462501Z ^ 2026-02-21T09:05:45.4462802Z module { 2026-02-21T09:05:45.4466127Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:45.4466758Z %cst = arith.constant dense<0> : tensor<64x2x64xi8> 2026-02-21T09:05:45.4466994Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:05:45.4467209Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:45.4467413Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:45.4467641Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:45.4467887Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:45.4468125Z %cst_2 = arith.constant dense<4> : tensor<64x64xi8> 2026-02-21T09:05:45.4468379Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:45.4468628Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:45.4468856Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:45.4469087Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:05:45.4469332Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:05:45.4469517Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:45.4469716Z %c7168_i32 = arith.constant 7168 : i32 2026-02-21T09:05:45.4469917Z %c7168_i64 = arith.constant 7168 : i64 2026-02-21T09:05:45.4470108Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:05:45.4470450Z %0 = tt.make_tensor_descriptor %arg2, [%c64_i32, %c7168_i32], [%c7168_i64, %c1_i64] : , > 2026-02-21T09:05:45.4470780Z %1 = tt.get_program_id x : i32 2026-02-21T09:05:45.4470973Z %2 = arith.divsi %1, %c4_i32 : i32 2026-02-21T09:05:45.4471159Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T09:05:45.4471350Z %4 = arith.subi %c112_i32, %3 : i32 2026-02-21T09:05:45.4471623Z %5 = arith.minsi %4, %c4_i32 : i32 2026-02-21T09:05:45.4471811Z %6 = arith.remsi %1, %c4_i32 : i32 2026-02-21T09:05:45.4472001Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:05:45.4472180Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:05:45.4472361Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:05:45.4472541Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:45.4472786Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:45.4473051Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.4473317Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:05:45.4473684Z %14 = arith.muli %9, %c64_i32 : i32 2026-02-21T09:05:45.4473886Z %15 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.4474091Z %16 = arith.addi %15, %11 : tensor<64xi32> 2026-02-21T09:05:45.4474289Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:45.4474476Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:45.4474808Z %17 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:45.4475216Z %63 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.4475429Z %64 = arith.addi %63, %11 : tensor<64xi32> 2026-02-21T09:05:45.4475670Z %65 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:45.4475915Z %66 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.4476195Z %67 = tt.splat %65 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.4476398Z %68 = arith.addi %67, %66 : tensor<128xi32> 2026-02-21T09:05:45.4476665Z %69 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4476943Z %70 = arith.muli %69, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:45.4477203Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.4477507Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4477816Z %73 = tt.broadcast %71 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4478064Z %74 = arith.addi %72, %73 : tensor<64x128xi32> 2026-02-21T09:05:45.4478306Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4478601Z %76 = tt.addptr %75, %74 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.4478915Z %77 = tt.load %76 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4479209Z %78 = arith.extf %77 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.4479499Z %79 = tt.expand_dims %64 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4479759Z %80 = arith.muli %79, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.4480016Z %81 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:45.4480295Z %82 = tt.broadcast %80 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4480562Z %83 = tt.broadcast %81 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4480804Z %84 = arith.addi %82, %83 : tensor<64x64xi32> 2026-02-21T09:05:45.4481041Z %85 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4481321Z %86 = tt.addptr %85, %84 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:45.4481658Z %87 = tt.load %86 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4481925Z %88 = arith.shli %87, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4482136Z %89 = arith.shrsi %88, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4482351Z %90 = arith.shrsi %87, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4482596Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.4482878Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.4483184Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.4483497Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4483818Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4484105Z %96 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4484353Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4484626Z %98 = tt.broadcast %94 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4484931Z %99 = arith.select %97, %98, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4485205Z %100 = arith.cmpi eq, %93, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4485460Z %101 = tt.broadcast %95 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4485738Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4486025Z %103 = arith.select %102, %101, %99 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4486334Z %104 = tt.reshape %103 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:45.4486604Z %105 = arith.sitofp %104 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:45.4486996Z %106 = tt.dot %78, %105, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:45.4487336Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:45.4487535Z %107 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:05:45.4487727Z %108 = arith.addi %arg3, %107 : i32 2026-02-21T09:05:45.4487931Z %109 = tt.splat %108 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.4488132Z %110 = arith.addi %109, %11 : tensor<64xi32> 2026-02-21T09:05:45.4488331Z %111 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:05:45.4488568Z %112 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.4488863Z %113 = tt.splat %111 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.4489079Z %114 = arith.addi %113, %112 : tensor<128xi32> 2026-02-21T09:05:45.4489344Z %115 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4489624Z %116 = arith.muli %115, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:45.4489888Z %117 = tt.expand_dims %114 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.4490188Z %118 = tt.broadcast %116 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4490455Z %119 = tt.broadcast %117 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4490705Z %120 = arith.addi %118, %119 : tensor<64x128xi32> 2026-02-21T09:05:45.4490960Z %121 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4491248Z %122 = tt.addptr %121, %120 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.4491596Z %123 = tt.load %122 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4491898Z %124 = arith.extf %123 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.4492191Z %125 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4492458Z %126 = arith.muli %125, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.4492721Z %127 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:45.4493013Z %128 = tt.broadcast %126 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4493271Z %129 = tt.broadcast %127 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4493515Z %130 = arith.addi %128, %129 : tensor<64x64xi32> 2026-02-21T09:05:45.4493748Z %131 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4494029Z %132 = tt.addptr %131, %130 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:45.4494329Z %133 = tt.load %132 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4494607Z %134 = arith.shli %133, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4494836Z %135 = arith.shrsi %134, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4495053Z %136 = arith.shrsi %133, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4495304Z %137 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.4495594Z %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.4495913Z %139 = tt.expand_dims %138 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.4496275Z %140 = tt.expand_dims %135 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4496601Z %141 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4496895Z %142 = arith.cmpi eq, %139, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4497151Z %143 = tt.broadcast %142 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4497460Z %144 = tt.broadcast %140 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4497750Z %145 = arith.select %143, %144, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4498075Z %146 = arith.cmpi eq, %139, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4498335Z %147 = tt.broadcast %141 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4498607Z %148 = tt.broadcast %146 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4498897Z %149 = arith.select %148, %147, %145 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4499179Z %150 = tt.reshape %149 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:45.4499454Z %151 = arith.sitofp %150 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:45.4499821Z %152 = tt.dot %124, %151, %106, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:45.4500174Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:05:45.4500381Z %153 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:05:45.4500576Z %154 = arith.addi %arg3, %153 : i32 2026-02-21T09:05:45.4500775Z %155 = tt.splat %154 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.4500976Z %156 = arith.addi %155, %11 : tensor<64xi32> 2026-02-21T09:05:45.4501176Z %157 = arith.muli %154, %c2_i32 : i32 2026-02-21T09:05:45.4501406Z %158 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.4501703Z %159 = tt.splat %157 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.4501918Z %160 = arith.addi %159, %158 : tensor<128xi32> 2026-02-21T09:05:45.4502176Z %161 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4502455Z %162 = arith.muli %161, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:45.4502722Z %163 = tt.expand_dims %160 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.4503034Z %164 = tt.broadcast %162 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4503316Z %165 = tt.broadcast %163 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4503560Z %166 = arith.addi %164, %165 : tensor<64x128xi32> 2026-02-21T09:05:45.4503816Z %167 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4504105Z %168 = tt.addptr %167, %166 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.4504427Z %169 = tt.load %168 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4504728Z %170 = arith.extf %169 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.4505026Z %171 = tt.expand_dims %156 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4505301Z %172 = arith.muli %171, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.4505555Z %173 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:45.4505850Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4506114Z %175 = tt.broadcast %173 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4506361Z %176 = arith.addi %174, %175 : tensor<64x64xi32> 2026-02-21T09:05:45.4506599Z %177 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4506882Z %178 = tt.addptr %177, %176 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:45.4507196Z %179 = tt.load %178 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4507500Z %180 = arith.shli %179, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4507736Z %181 = arith.shrsi %180, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4507966Z %182 = arith.shrsi %179, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4508230Z %183 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.4508545Z %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.4508902Z %185 = tt.expand_dims %184 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.4509283Z %186 = tt.expand_dims %181 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4509623Z %187 = tt.expand_dims %182 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4509930Z %188 = arith.cmpi eq, %185, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4510197Z %189 = tt.broadcast %188 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4510490Z %190 = tt.broadcast %186 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4510798Z %191 = arith.select %189, %190, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4511084Z %192 = arith.cmpi eq, %185, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4511357Z %193 = tt.broadcast %187 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4511699Z %194 = tt.broadcast %192 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4512001Z %195 = arith.select %194, %193, %191 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4512297Z %196 = tt.reshape %195 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:45.4512585Z %197 = arith.sitofp %196 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:45.4512970Z %198 = tt.dot %170, %197, %152, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:45.4513307Z scf.yield %198 : tensor<64x64xf32> 2026-02-21T09:05:45.4513510Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:45.4513718Z %18 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:45.4513943Z %19 = arith.addi %18, %11 : tensor<64xi32> 2026-02-21T09:05:45.4514144Z %20 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:45.4514398Z %21 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:45.4514665Z %22 = tt.splat %20 : i32 -> tensor<128xi32> 2026-02-21T09:05:45.4514870Z %23 = arith.addi %22, %21 : tensor<128xi32> 2026-02-21T09:05:45.4515133Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4515407Z %25 = arith.muli %24, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:45.4515681Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:45.4516030Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4516293Z %28 = tt.broadcast %26 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:45.4516537Z %29 = arith.addi %27, %28 : tensor<64x128xi32> 2026-02-21T09:05:45.4516778Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4517066Z %31 = tt.addptr %30, %29 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:45.4517365Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:45.4517662Z %33 = arith.extf %32 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:45.4517944Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:45.4518197Z %35 = arith.muli %34, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:45.4518447Z %36 = tt.expand_dims %13 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:45.4518724Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4518986Z %38 = tt.broadcast %36 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:45.4519239Z %39 = arith.addi %37, %38 : tensor<64x64xi32> 2026-02-21T09:05:45.4519475Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4519748Z %41 = tt.addptr %40, %39 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:45.4520035Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:05:45.4520300Z %43 = arith.shli %42, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4520534Z %44 = arith.shrsi %43, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4520752Z %45 = arith.shrsi %42, %cst_2 : tensor<64x64xi8> 2026-02-21T09:05:45.4521010Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:45.4521297Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:45.4521635Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:45.4521943Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4522251Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:05:45.4522521Z %51 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4522771Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4523067Z %53 = tt.broadcast %49 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4523341Z %54 = arith.select %52, %53, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4523610Z %55 = arith.cmpi eq, %48, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:45.4523852Z %56 = tt.broadcast %50 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:05:45.4524113Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:05:45.4524377Z %58 = arith.select %57, %56, %54 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:05:45.4524654Z %59 = tt.reshape %58 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:05:45.4524916Z %60 = arith.sitofp %59 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:05:45.4525255Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:45.4525608Z %62 = arith.truncf %61 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:05:45.4525920Z tt.descriptor_store %0[%14, %10], %62 : !tt.tensordesc>, tensor<64x64xbf16> 2026-02-21T09:05:45.4526197Z tt.return 2026-02-21T09:05:45.4526328Z } 2026-02-21T09:05:45.4526454Z } 2026-02-21T09:05:45.4526525Z 2026-02-21T09:05:45.4526582Z {-# 2026-02-21T09:05:45.4526711Z external_resources: { 2026-02-21T09:05:45.4526877Z mlir_reproducer: { 2026-02-21T09:05:45.4531160Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:45.4535804Z disable_threading: false, 2026-02-21T09:05:45.4535972Z verify_each: true 2026-02-21T09:05:45.4536152Z } 2026-02-21T09:05:45.4536274Z } 2026-02-21T09:05:45.4536399Z #-} 2026-02-21T09:05:45.4536825Z /tmp/torchinductor_root/go/cgox4ptrz4pt62y6hkrzqwqlkzkfewa6irn755jdgunn6gik2xru.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:45.4537834Z /tmp/torchinductor_root/go/cgox4ptrz4pt62y6hkrzqwqlkzkfewa6irn755jdgunn6gik2xru.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:45.4538653Z [310s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:45.4539758Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:45.4540689Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:45.4540965Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:45.6870533Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 15.5 configs/s 2026-02-21T09:05:46.7792871Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 898.9 2026-02-21T09:05:46.7794298Z configs/s 2026-02-21T09:05:46.8402821Z [312s] Generation 16 complete: 2026-02-21T09:05:46.8407338Z error=5 2026-02-21T09:05:46.8409125Z ok=28 2026-02-21T09:05:46.8409295Z min=0.1076 2026-02-21T09:05:46.8409438Z mid=0.2490 2026-02-21T09:05:46.8409570Z max=6.8311 2026-02-21T09:05:46.8409711Z best={'block_sizes': [64, 64, 16], 2026-02-21T09:05:46.8409937Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:05:46.8410145Z 'l2_groupings': [1], 2026-02-21T09:05:46.8410319Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:05:46.8410507Z 'loop_orders': [[0, 1]], 2026-02-21T09:05:46.8410671Z 'num_stages': 3, 2026-02-21T09:05:46.8410812Z 'num_warps': 1, 2026-02-21T09:05:46.8410963Z 'pid_type': 'flat', 2026-02-21T09:05:46.8411127Z 'range_flattens': [None, None], 2026-02-21T09:05:46.8411315Z 'range_multi_buffers': [None, True], 2026-02-21T09:05:46.8411511Z 'range_num_stages': [0, 0], 2026-02-21T09:05:46.8411903Z 'range_unroll_factors': [0, 0], 2026-02-21T09:05:46.8412101Z 'range_warp_specializes': [None, None]} 2026-02-21T09:05:46.8440945Z [312s] Fitting surrogate: 1173 points, 1173 targets 2026-02-21T09:05:47.4247522Z [312s] Generation 17 starting: 32 neighbors, 2 active search path(s) 2026-02-21T09:05:51.1397437Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 5.7 configs/s 2026-02-21T09:05:52.2072680Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:52.2076432Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:52.2078485Z ^ 2026-02-21T09:05:52.2078972Z /tmp/torchinductor_root/k2/ck2iuhpsqlrqipnw5mefxom6jvlao72qgmdzdeho2kc7tzs6jl6h.py:84:40: note: called from 2026-02-21T09:05:52.2079688Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:52.2079932Z ^ 2026-02-21T09:05:52.2080394Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:52.2080963Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:52.2081356Z ^ 2026-02-21T09:05:52.2081762Z module { 2026-02-21T09:05:52.2082401Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:52.2083051Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:05:52.2083288Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:52.2083513Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:52.2083742Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:05:52.2084000Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:52.2084276Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:52.2084531Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:05:52.2084878Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:52.2085145Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:52.2085453Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:52.2085686Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:52.2085879Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:52.2086063Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:52.2086253Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:52.2086448Z %0 = tt.get_program_id x : i32 2026-02-21T09:05:52.2086668Z scf.for %arg3 = %0 to %c224_i32 step %c9472_i32 : i32 { 2026-02-21T09:05:52.2086911Z %1 = arith.divsi %arg3, %c2_i32 : i32 2026-02-21T09:05:52.2087106Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:05:52.2087299Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:05:52.2087483Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:05:52.2087680Z %5 = arith.remsi %arg3, %c2_i32 : i32 2026-02-21T09:05:52.2087867Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:05:52.2088055Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:05:52.2088239Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:05:52.2088415Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:05:52.2088660Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:52.2088921Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:05:52.2089138Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:05:52.2089334Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:52.2089575Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:52.2089823Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.2090028Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:05:52.2090231Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:52.2090421Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:52.2090756Z %17 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:52.2091099Z %71 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.2091324Z %72 = arith.addi %71, %14 : tensor<64xi32> 2026-02-21T09:05:52.2091526Z %73 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:52.2091821Z %74 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.2092084Z %75 = tt.splat %73 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.2092288Z %76 = arith.addi %75, %74 : tensor<128xi32> 2026-02-21T09:05:52.2092609Z %77 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2092884Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.2093151Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.2093447Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2093718Z %81 = tt.broadcast %79 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2094018Z %82 = arith.addi %80, %81 : tensor<64x128xi32> 2026-02-21T09:05:52.2094295Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2094594Z %84 = tt.addptr %83, %82 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.2094904Z %85 = tt.load %84 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2095211Z %86 = arith.extf %85 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.2095509Z %87 = tt.expand_dims %72 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2095774Z %88 = arith.muli %87, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.2096032Z %89 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.2096356Z %90 = tt.broadcast %88 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2096622Z %91 = tt.broadcast %89 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2096853Z %92 = arith.addi %90, %91 : tensor<64x32xi32> 2026-02-21T09:05:52.2097095Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2097368Z %94 = tt.addptr %93, %92 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.2097659Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2097924Z %96 = arith.shli %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2098134Z %97 = arith.shrsi %96, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2098352Z %98 = arith.shrsi %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2098588Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.2098882Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.2099235Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.2099567Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2099893Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2100183Z %104 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2100447Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2100723Z %106 = tt.broadcast %102 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2101024Z %107 = arith.select %105, %106, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2101301Z %108 = arith.cmpi eq, %101, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2101596Z %109 = tt.broadcast %103 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2101877Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2102176Z %111 = arith.select %110, %109, %107 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2102476Z %112 = tt.reshape %111 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.2102753Z %113 = arith.sitofp %112 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.2103137Z %114 = tt.dot %86, %113, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.2103471Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:52.2103717Z %115 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:05:52.2103922Z %116 = arith.addi %arg4, %115 : i32 2026-02-21T09:05:52.2104132Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.2104346Z %118 = arith.addi %117, %14 : tensor<64xi32> 2026-02-21T09:05:52.2104545Z %119 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:05:52.2104795Z %120 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.2105083Z %121 = tt.splat %119 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.2105297Z %122 = arith.addi %121, %120 : tensor<128xi32> 2026-02-21T09:05:52.2105579Z %123 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2105866Z %124 = arith.muli %123, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.2106143Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.2106445Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2106730Z %127 = tt.broadcast %125 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2106980Z %128 = arith.addi %126, %127 : tensor<64x128xi32> 2026-02-21T09:05:52.2107243Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2107574Z %130 = tt.addptr %129, %128 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.2107896Z %131 = tt.load %130 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2108210Z %132 = arith.extf %131 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.2108512Z %133 = tt.expand_dims %118 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2108795Z %134 = arith.muli %133, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.2109057Z %135 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.2109358Z %136 = tt.broadcast %134 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2109635Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2109882Z %138 = arith.addi %136, %137 : tensor<64x32xi32> 2026-02-21T09:05:52.2110131Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2110416Z %140 = tt.addptr %139, %138 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.2110734Z %141 = tt.load %140 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2111016Z %142 = arith.shli %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2111240Z %143 = arith.shrsi %142, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2111472Z %144 = arith.shrsi %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2111746Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.2112049Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.2112368Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.2112711Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2113051Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2113338Z %150 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2113605Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2113884Z %152 = tt.broadcast %148 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2114185Z %153 = arith.select %151, %152, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2114473Z %154 = arith.cmpi eq, %147, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2114734Z %155 = tt.broadcast %149 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2115042Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2115323Z %157 = arith.select %156, %155, %153 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2115620Z %158 = tt.reshape %157 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.2115889Z %159 = arith.sitofp %158 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.2116307Z %160 = tt.dot %132, %159, %114, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.2116654Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:05:52.2116897Z %161 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:05:52.2117103Z %162 = arith.addi %arg4, %161 : i32 2026-02-21T09:05:52.2117307Z %163 = tt.splat %162 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.2117524Z %164 = arith.addi %163, %14 : tensor<64xi32> 2026-02-21T09:05:52.2117728Z %165 = arith.muli %162, %c2_i32 : i32 2026-02-21T09:05:52.2117980Z %166 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.2118247Z %167 = tt.splat %165 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.2118458Z %168 = arith.addi %167, %166 : tensor<128xi32> 2026-02-21T09:05:52.2118728Z %169 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2119068Z %170 = arith.muli %169, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.2119348Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.2119672Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2119954Z %173 = tt.broadcast %171 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2120204Z %174 = arith.addi %172, %173 : tensor<64x128xi32> 2026-02-21T09:05:52.2120462Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2120761Z %176 = tt.addptr %175, %174 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.2121102Z %177 = tt.load %176 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2121434Z %178 = arith.extf %177 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.2121768Z %179 = tt.expand_dims %164 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2122067Z %180 = arith.muli %179, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.2122337Z %181 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.2122656Z %182 = tt.broadcast %180 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2122939Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2123205Z %184 = arith.addi %182, %183 : tensor<64x32xi32> 2026-02-21T09:05:52.2123465Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2123763Z %186 = tt.addptr %185, %184 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.2124092Z %187 = tt.load %186 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2124379Z %188 = arith.shli %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2124622Z %189 = arith.shrsi %188, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2124865Z %190 = arith.shrsi %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2125124Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.2125442Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.2125777Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.2126133Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2126525Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2126831Z %196 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2127099Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2127384Z %198 = tt.broadcast %194 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2127696Z %199 = arith.select %197, %198, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2128021Z %200 = arith.cmpi eq, %193, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2128297Z %201 = tt.broadcast %195 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2128622Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2128919Z %203 = arith.select %202, %201, %199 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2129229Z %204 = tt.reshape %203 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.2129520Z %205 = arith.sitofp %204 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.2129917Z %206 = tt.dot %178, %205, %160, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.2130261Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:05:52.2130478Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:52.2130710Z %18 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.2130927Z %19 = arith.addi %18, %14 : tensor<64xi32> 2026-02-21T09:05:52.2131135Z %20 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:52.2131378Z %21 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.2131665Z %22 = tt.splat %20 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.2131864Z %23 = arith.addi %22, %21 : tensor<128xi32> 2026-02-21T09:05:52.2132126Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2132403Z %25 = arith.muli %24, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.2132663Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.2132970Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2133239Z %28 = tt.broadcast %26 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.2133490Z %29 = arith.addi %27, %28 : tensor<64x128xi32> 2026-02-21T09:05:52.2133735Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2134033Z %31 = tt.addptr %30, %29 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.2134351Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.2134655Z %33 = arith.extf %32 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.2134957Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2135227Z %35 = arith.muli %34, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.2135489Z %36 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.2135782Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2136044Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2136286Z %39 = arith.addi %37, %38 : tensor<64x32xi32> 2026-02-21T09:05:52.2136522Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2136799Z %41 = tt.addptr %40, %39 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.2137101Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2137372Z %43 = arith.shli %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2137594Z %44 = arith.shrsi %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2137808Z %45 = arith.shrsi %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.2138099Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.2138391Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.2138708Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.2139027Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2139380Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.2139666Z %51 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2139935Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2140204Z %53 = tt.broadcast %49 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2140481Z %54 = arith.select %52, %53, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2140750Z %55 = arith.cmpi eq, %48, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.2140996Z %56 = tt.broadcast %50 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.2141267Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.2141576Z %58 = arith.select %57, %56, %54 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.2141874Z %59 = tt.reshape %58 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.2142141Z %60 = arith.sitofp %59 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.2142494Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.2142854Z %62 = arith.truncf %61 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:52.2143143Z %63 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.2143405Z %64 = arith.muli %63, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.2143665Z %65 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.2143946Z %66 = tt.broadcast %64 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2144210Z %67 = tt.broadcast %65 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.2144444Z %68 = arith.addi %66, %67 : tensor<64x32xi32> 2026-02-21T09:05:52.2144694Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2144987Z %70 = tt.addptr %69, %68 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.2145248Z tt.store %70, %62 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.2145499Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:05:52.2145711Z tt.return 2026-02-21T09:05:52.2145850Z } 2026-02-21T09:05:52.2145976Z } 2026-02-21T09:05:52.2146057Z 2026-02-21T09:05:52.2146111Z {-# 2026-02-21T09:05:52.2146244Z external_resources: { 2026-02-21T09:05:52.2146418Z mlir_reproducer: { 2026-02-21T09:05:52.2150894Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:52.2155530Z disable_threading: false, 2026-02-21T09:05:52.2155718Z verify_each: true 2026-02-21T09:05:52.2155862Z } 2026-02-21T09:05:52.2155989Z } 2026-02-21T09:05:52.2156105Z #-} 2026-02-21T09:05:52.2156548Z /tmp/torchinductor_root/k2/ck2iuhpsqlrqipnw5mefxom6jvlao72qgmdzdeho2kc7tzs6jl6h.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:52.2157658Z /tmp/torchinductor_root/k2/ck2iuhpsqlrqipnw5mefxom6jvlao72qgmdzdeho2kc7tzs6jl6h.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:52.2158490Z [317s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:52.2159642Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:05:52.2160677Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:52.2160932Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:52.4317217Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:52.4322473Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:52.4326095Z ^ 2026-02-21T09:05:52.4327896Z /tmp/torchinductor_root/ss/css6ucdhzkgbqf6vpzc3tfbmpfnmbxa2jsdqu66m3mv4vec5wviv.py:78:36: note: called from 2026-02-21T09:05:52.4328343Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:52.4328552Z ^ 2026-02-21T09:05:52.4328961Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:52.4329431Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:52.4329682Z ^ 2026-02-21T09:05:52.4329846Z module { 2026-02-21T09:05:52.4330356Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:52.4330914Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:05:52.4331132Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:52.4331356Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:52.4331676Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:52.4331910Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:52.4332147Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:05:52.4332383Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:52.4332859Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:52.4333121Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:52.4333355Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:52.4333543Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:52.4333722Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:52.4333913Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:52.4334169Z %0 = tt.get_program_id x : i32 2026-02-21T09:05:52.4334344Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:05:52.4334532Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:05:52.4334751Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:05:52.4334940Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:05:52.4335114Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:05:52.4335295Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:05:52.4335469Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:05:52.4335649Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:05:52.4335826Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:05:52.4336064Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:52.4336321Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:05:52.4336518Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:05:52.4336718Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:52.4336990Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:52.4337237Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.4337430Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:05:52.4337630Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:05:52.4337822Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:05:52.4338146Z %17 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:52.4338492Z %71 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.4338702Z %72 = arith.addi %71, %14 : tensor<64xi32> 2026-02-21T09:05:52.4338904Z %73 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:52.4339138Z %74 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.4339401Z %75 = tt.splat %73 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.4339608Z %76 = arith.addi %75, %74 : tensor<128xi32> 2026-02-21T09:05:52.4339867Z %77 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4340147Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.4340403Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.4340705Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4340968Z %81 = tt.broadcast %79 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4341215Z %82 = arith.addi %80, %81 : tensor<64x128xi32> 2026-02-21T09:05:52.4341466Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4341790Z %84 = tt.addptr %83, %82 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.4342101Z %85 = tt.load %84 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4342396Z %86 = arith.extf %85 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.4342684Z %87 = tt.expand_dims %72 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4342955Z %88 = arith.muli %87, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.4343209Z %89 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.4343497Z %90 = tt.broadcast %88 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4343753Z %91 = tt.broadcast %89 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4343990Z %92 = arith.addi %90, %91 : tensor<64x32xi32> 2026-02-21T09:05:52.4344266Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4344545Z %94 = tt.addptr %93, %92 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.4344847Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4345107Z %96 = arith.shli %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4345326Z %97 = arith.shrsi %96, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4345570Z %98 = arith.shrsi %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4345819Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.4346141Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.4346465Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.4346799Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4347120Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4347424Z %104 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4347691Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4347986Z %106 = tt.broadcast %102 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4348331Z %107 = arith.select %105, %106, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4348624Z %108 = arith.cmpi eq, %101, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4348913Z %109 = tt.broadcast %103 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4349193Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4349492Z %111 = arith.select %110, %109, %107 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4349785Z %112 = tt.reshape %111 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.4350073Z %113 = arith.sitofp %112 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.4350466Z %114 = tt.dot %86, %113, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.4350810Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:52.4351017Z %115 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:05:52.4351218Z %116 = arith.addi %arg3, %115 : i32 2026-02-21T09:05:52.4351429Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.4351674Z %118 = arith.addi %117, %14 : tensor<64xi32> 2026-02-21T09:05:52.4351885Z %119 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:05:52.4352140Z %120 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.4352404Z %121 = tt.splat %119 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.4352630Z %122 = arith.addi %121, %120 : tensor<128xi32> 2026-02-21T09:05:52.4352895Z %123 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4353188Z %124 = arith.muli %123, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.4353465Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.4353788Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4354080Z %127 = tt.broadcast %125 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4354338Z %128 = arith.addi %126, %127 : tensor<64x128xi32> 2026-02-21T09:05:52.4354609Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4354920Z %130 = tt.addptr %129, %128 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.4355257Z %131 = tt.load %130 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4355570Z %132 = arith.extf %131 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.4355913Z %133 = tt.expand_dims %118 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4356197Z %134 = arith.muli %133, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.4356465Z %135 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.4356774Z %136 = tt.broadcast %134 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4357050Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4357337Z %138 = arith.addi %136, %137 : tensor<64x32xi32> 2026-02-21T09:05:52.4357610Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4357927Z %140 = tt.addptr %139, %138 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.4358260Z %141 = tt.load %140 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4358530Z %142 = arith.shli %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4358757Z %143 = arith.shrsi %142, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4358978Z %144 = arith.shrsi %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4359231Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.4359531Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.4359872Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.4360207Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4360531Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4360826Z %150 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4361079Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4361363Z %152 = tt.broadcast %148 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4361771Z %153 = arith.select %151, %152, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4362047Z %154 = arith.cmpi eq, %147, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4362313Z %155 = tt.broadcast %149 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4362590Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4362890Z %157 = arith.select %156, %155, %153 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4363189Z %158 = tt.reshape %157 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.4363462Z %159 = arith.sitofp %158 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.4363847Z %160 = tt.dot %132, %159, %114, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.4364179Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:05:52.4364394Z %161 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:05:52.4364597Z %162 = arith.addi %arg3, %161 : i32 2026-02-21T09:05:52.4364801Z %163 = tt.splat %162 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.4365011Z %164 = arith.addi %163, %14 : tensor<64xi32> 2026-02-21T09:05:52.4365204Z %165 = arith.muli %162, %c2_i32 : i32 2026-02-21T09:05:52.4365446Z %166 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.4365698Z %167 = tt.splat %165 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.4365912Z %168 = arith.addi %167, %166 : tensor<128xi32> 2026-02-21T09:05:52.4366166Z %169 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4366443Z %170 = arith.muli %169, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.4366711Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.4367009Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4367287Z %173 = tt.broadcast %171 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4367559Z %174 = arith.addi %172, %173 : tensor<64x128xi32> 2026-02-21T09:05:52.4367807Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4368099Z %176 = tt.addptr %175, %174 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.4368418Z %177 = tt.load %176 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4368747Z %178 = arith.extf %177 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.4369038Z %179 = tt.expand_dims %164 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4369357Z %180 = arith.muli %179, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.4369615Z %181 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.4369905Z %182 = tt.broadcast %180 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4370171Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4370412Z %184 = arith.addi %182, %183 : tensor<64x32xi32> 2026-02-21T09:05:52.4370657Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4370931Z %186 = tt.addptr %185, %184 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.4371284Z %187 = tt.load %186 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4371585Z %188 = arith.shli %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4371811Z %189 = arith.shrsi %188, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4372036Z %190 = arith.shrsi %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4372278Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.4372596Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.4372911Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.4373232Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4373560Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4373842Z %196 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4374103Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4374382Z %198 = tt.broadcast %194 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4374670Z %199 = arith.select %197, %198, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4374952Z %200 = arith.cmpi eq, %193, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4375203Z %201 = tt.broadcast %195 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4375481Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4375762Z %203 = arith.select %202, %201, %199 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4376051Z %204 = tt.reshape %203 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.4376328Z %205 = arith.sitofp %204 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.4376683Z %206 = tt.dot %178, %205, %160, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.4377010Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:05:52.4377181Z } 2026-02-21T09:05:52.4377339Z %18 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:05:52.4377545Z %19 = arith.addi %18, %14 : tensor<64xi32> 2026-02-21T09:05:52.4377747Z %20 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:05:52.4377986Z %21 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:05:52.4378230Z %22 = tt.splat %20 : i32 -> tensor<128xi32> 2026-02-21T09:05:52.4378435Z %23 = arith.addi %22, %21 : tensor<128xi32> 2026-02-21T09:05:52.4378703Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4378972Z %25 = arith.muli %24, %cst_4 : tensor<64x1xi32> 2026-02-21T09:05:52.4379224Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:05:52.4379521Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4379792Z %28 = tt.broadcast %26 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:05:52.4380046Z %29 = arith.addi %27, %28 : tensor<64x128xi32> 2026-02-21T09:05:52.4380291Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4380590Z %31 = tt.addptr %30, %29 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:05:52.4380897Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:05:52.4381181Z %33 = arith.extf %32 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:05:52.4381464Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4381760Z %35 = arith.muli %34, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.4382005Z %36 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.4382290Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4382572Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4382808Z %39 = arith.addi %37, %38 : tensor<64x32xi32> 2026-02-21T09:05:52.4383036Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4383312Z %41 = tt.addptr %40, %39 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.4383606Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4383865Z %43 = arith.shli %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4384080Z %44 = arith.shrsi %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4384287Z %45 = arith.shrsi %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:05:52.4384528Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:52.4384803Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:52.4385105Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:52.4385417Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4385728Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:05:52.4386006Z %51 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4386245Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4386514Z %53 = tt.broadcast %49 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4386794Z %54 = arith.select %52, %53, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4387055Z %55 = arith.cmpi eq, %48, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:05:52.4387302Z %56 = tt.broadcast %50 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:05:52.4387556Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:05:52.4387824Z %58 = arith.select %57, %56, %54 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:05:52.4388094Z %59 = tt.reshape %58 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:05:52.4388360Z %60 = arith.sitofp %59 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:05:52.4388712Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:52.4389058Z %62 = arith.truncf %61 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:52.4389345Z %63 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:52.4389606Z %64 = arith.muli %63, %cst_3 : tensor<64x1xi32> 2026-02-21T09:05:52.4389891Z %65 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:52.4390170Z %66 = tt.broadcast %64 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4390427Z %67 = tt.broadcast %65 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:52.4390676Z %68 = arith.addi %66, %67 : tensor<64x32xi32> 2026-02-21T09:05:52.4390927Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4391247Z %70 = tt.addptr %69, %68 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:52.4391513Z tt.store %70, %62 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:52.4391743Z tt.return 2026-02-21T09:05:52.4391903Z } 2026-02-21T09:05:52.4392038Z } 2026-02-21T09:05:52.4392110Z 2026-02-21T09:05:52.4392170Z {-# 2026-02-21T09:05:52.4392306Z external_resources: { 2026-02-21T09:05:52.4392480Z mlir_reproducer: { 2026-02-21T09:05:52.4397127Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:52.4401612Z disable_threading: false, 2026-02-21T09:05:52.4401781Z verify_each: true 2026-02-21T09:05:52.4401934Z } 2026-02-21T09:05:52.4402052Z } 2026-02-21T09:05:52.4402172Z #-} 2026-02-21T09:05:52.4402591Z /tmp/torchinductor_root/ss/css6ucdhzkgbqf6vpzc3tfbmpfnmbxa2jsdqu66m3mv4vec5wviv.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:52.4403627Z /tmp/torchinductor_root/ss/css6ucdhzkgbqf6vpzc3tfbmpfnmbxa2jsdqu66m3mv4vec5wviv.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:52.4404464Z [317s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:52.4405499Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:52.4406399Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:52.4406690Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:53.1416319Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:53.1418004Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:53.1418389Z ^ 2026-02-21T09:05:53.1418903Z /tmp/torchinductor_root/si/csiov3l75kci2ehcwgzx4zjove3xyfj5ycid7upcvvg76k6h22hn.py:84:40: note: called from 2026-02-21T09:05:53.1419682Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:53.1420048Z ^ 2026-02-21T09:05:53.1420608Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:53.1421185Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:53.1421446Z ^ 2026-02-21T09:05:53.1421839Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:05:53.1422627Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:53.1423240Z %cst = arith.constant dense<0> : tensor<8x2x32xi8> 2026-02-21T09:05:53.1423464Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:05:53.1423663Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:53.1423877Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:53.1424059Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:05:53.1424281Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:53.1424519Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:53.1424752Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:53.1424984Z %cst_3 = arith.constant dense<4> : tensor<8x32xi8> 2026-02-21T09:05:53.1425213Z %cst_4 = arith.constant dense<7168> : tensor<8x1xi32> 2026-02-21T09:05:53.1425453Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:53.1425702Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:05:53.1425940Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:53.1426120Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:05:53.1426304Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:53.1426487Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:05:53.1426670Z %0 = tt.get_program_id x : i32 2026-02-21T09:05:53.1426886Z scf.for %arg3 = %0 to %c224_i32 step %c9472_i32 : i32 { 2026-02-21T09:05:53.1427104Z %1 = arith.divsi %arg3, %c2_i32 : i32 2026-02-21T09:05:53.1427297Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:05:53.1427474Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:05:53.1427660Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:05:53.1427839Z %5 = arith.remsi %arg3, %c2_i32 : i32 2026-02-21T09:05:53.1428024Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:05:53.1428198Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:05:53.1428363Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:05:53.1428542Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:05:53.1428766Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:53.1429031Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:05:53.1429230Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:05:53.1429430Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:53.1429655Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:53.1429901Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:05:53.1430104Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:05:53.1430295Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:05:53.1430535Z %c24_i32 = arith.constant 24 : i32 2026-02-21T09:05:53.1430855Z %17 = scf.for %arg4 = %c0_i32 to %c4080_i32 step %c24_i32 iter_args(%arg5 = %cst_6) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:53.1431229Z %28 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:05:53.1431485Z %29 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:05:53.1431735Z %30 = arith.addi %29, %28 : tensor<8xi32> 2026-02-21T09:05:53.1431978Z %31 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:53.1432212Z %32 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.1432497Z %33 = tt.splat %31 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.1432708Z %34 = arith.addi %33, %32 : tensor<16xi32> 2026-02-21T09:05:53.1432984Z %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.1433268Z %36 = arith.muli %35, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.1433552Z %37 = tt.expand_dims %34 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:53.1433864Z %38 = tt.broadcast %36 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1434136Z %39 = tt.broadcast %37 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1434391Z %40 = arith.addi %38, %39 : tensor<64x16xi32> 2026-02-21T09:05:53.1434695Z %41 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1434988Z %42 = tt.addptr %41, %40 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:53.1435299Z %43 = tt.load %42 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1435589Z %44 = arith.extf %43 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:05:53.1435873Z %45 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:05:53.1436136Z %46 = arith.muli %45, %cst_4 : tensor<8x1xi32> 2026-02-21T09:05:53.1436401Z %47 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.1436684Z %48 = tt.broadcast %46 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1436947Z %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1437185Z %50 = arith.addi %48, %49 : tensor<8x32xi32> 2026-02-21T09:05:53.1437417Z %51 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1437695Z %52 = tt.addptr %51, %50 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:05:53.1437986Z %53 = tt.load %52 evictionPolicy = evict_first : tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1438258Z %54 = arith.shli %53, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1438470Z %55 = arith.shrsi %54, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1438689Z %56 = arith.shrsi %53, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1438933Z %57 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.1439228Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.1439555Z %59 = tt.expand_dims %58 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.1439879Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1440205Z %61 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1440497Z %62 = arith.cmpi eq, %59, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1440752Z %63 = tt.broadcast %62 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1441034Z %64 = tt.broadcast %60 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1441321Z %65 = arith.select %63, %64, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1441644Z %66 = arith.cmpi eq, %59, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1441927Z %67 = tt.broadcast %61 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1442202Z %68 = tt.broadcast %66 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1442491Z %69 = arith.select %68, %67, %65 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1442769Z %70 = tt.reshape %69 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:05:53.1443039Z %71 = arith.sitofp %70 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:05:53.1443436Z %72 = tt.dot %44, %71, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:53.1443780Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:53.1444019Z %73 = arith.muli %c8_i32, %c1_i32 : i32 2026-02-21T09:05:53.1444219Z %74 = arith.addi %arg4, %73 : i32 2026-02-21T09:05:53.1444464Z %75 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:05:53.1444717Z %76 = tt.splat %74 : i32 -> tensor<8xi32> 2026-02-21T09:05:53.1444935Z %77 = arith.addi %76, %75 : tensor<8xi32> 2026-02-21T09:05:53.1445134Z %78 = arith.muli %74, %c2_i32 : i32 2026-02-21T09:05:53.1445382Z %79 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.1445641Z %80 = tt.splat %78 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.1445861Z %81 = arith.addi %80, %79 : tensor<16xi32> 2026-02-21T09:05:53.1446164Z %82 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.1446446Z %83 = arith.muli %82, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.1446751Z %84 = tt.expand_dims %81 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:53.1447057Z %85 = tt.broadcast %83 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1447328Z %86 = tt.broadcast %84 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1447578Z %87 = arith.addi %85, %86 : tensor<64x16xi32> 2026-02-21T09:05:53.1447840Z %88 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1448141Z %89 = tt.addptr %88, %87 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:53.1448467Z %90 = tt.load %89 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1448770Z %91 = arith.extf %90 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:05:53.1449068Z %92 = tt.expand_dims %77 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:05:53.1449347Z %93 = arith.muli %92, %cst_4 : tensor<8x1xi32> 2026-02-21T09:05:53.1449611Z %94 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.1449917Z %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1450173Z %96 = tt.broadcast %94 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1450411Z %97 = arith.addi %95, %96 : tensor<8x32xi32> 2026-02-21T09:05:53.1450642Z %98 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1450917Z %99 = tt.addptr %98, %97 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:05:53.1451225Z %100 = tt.load %99 evictionPolicy = evict_first : tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1451496Z %101 = arith.shli %100, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1451754Z %102 = arith.shrsi %101, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1451975Z %103 = arith.shrsi %100, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1452229Z %104 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.1452522Z %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.1452846Z %106 = tt.expand_dims %105 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.1453176Z %107 = tt.expand_dims %102 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1453519Z %108 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1453807Z %109 = arith.cmpi eq, %106, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1454057Z %110 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1454335Z %111 = tt.broadcast %107 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1454655Z %112 = arith.select %110, %111, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1454928Z %113 = arith.cmpi eq, %106, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1455209Z %114 = tt.broadcast %108 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1455476Z %115 = tt.broadcast %113 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1455759Z %116 = arith.select %115, %114, %112 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1456042Z %117 = tt.reshape %116 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:05:53.1456313Z %118 = arith.sitofp %117 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:05:53.1456674Z %119 = tt.dot %91, %118, %72, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:53.1456991Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:53.1457193Z %120 = arith.muli %c8_i32, %c2_i32_7 : i32 2026-02-21T09:05:53.1457412Z %121 = arith.addi %arg4, %120 : i32 2026-02-21T09:05:53.1457656Z %122 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:05:53.1457898Z %123 = tt.splat %121 : i32 -> tensor<8xi32> 2026-02-21T09:05:53.1458114Z %124 = arith.addi %123, %122 : tensor<8xi32> 2026-02-21T09:05:53.1458322Z %125 = arith.muli %121, %c2_i32 : i32 2026-02-21T09:05:53.1458553Z %126 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.1458810Z %127 = tt.splat %125 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.1459018Z %128 = arith.addi %127, %126 : tensor<16xi32> 2026-02-21T09:05:53.1459285Z %129 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.1459563Z %130 = arith.muli %129, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.1459834Z %131 = tt.expand_dims %128 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:53.1460135Z %132 = tt.broadcast %130 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1460398Z %133 = tt.broadcast %131 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1460649Z %134 = arith.addi %132, %133 : tensor<64x16xi32> 2026-02-21T09:05:53.1460895Z %135 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1461191Z %136 = tt.addptr %135, %134 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:53.1461510Z %137 = tt.load %136 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1461838Z %138 = arith.extf %137 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:05:53.1462131Z %139 = tt.expand_dims %124 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:05:53.1462399Z %140 = arith.muli %139, %cst_4 : tensor<8x1xi32> 2026-02-21T09:05:53.1462663Z %141 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.1462950Z %142 = tt.broadcast %140 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1463220Z %143 = tt.broadcast %141 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1463470Z %144 = arith.addi %142, %143 : tensor<8x32xi32> 2026-02-21T09:05:53.1463706Z %145 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1463989Z %146 = tt.addptr %145, %144 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:05:53.1464287Z %147 = tt.load %146 evictionPolicy = evict_first : tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1464596Z %148 = arith.shli %147, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1464815Z %149 = arith.shrsi %148, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1465038Z %150 = arith.shrsi %147, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1465291Z %151 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.1465587Z %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.1465956Z %153 = tt.expand_dims %152 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.1466315Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1466647Z %155 = tt.expand_dims %150 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1466941Z %156 = arith.cmpi eq, %153, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1467193Z %157 = tt.broadcast %156 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1467466Z %158 = tt.broadcast %154 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1467746Z %159 = arith.select %157, %158, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1468022Z %160 = arith.cmpi eq, %153, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1468307Z %161 = tt.broadcast %155 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1468581Z %162 = tt.broadcast %160 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1468866Z %163 = arith.select %162, %161, %159 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1469145Z %164 = tt.reshape %163 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:05:53.1469416Z %165 = arith.sitofp %164 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:05:53.1469775Z %166 = tt.dot %138, %165, %119, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:53.1470104Z scf.yield %166 : tensor<64x32xf32> 2026-02-21T09:05:53.1470301Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:53.1470613Z %18 = scf.for %arg4 = %c4080_i32 to %c4096_i32 step %c8_i32 iter_args(%arg5 = %17) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:05:53.1470980Z %28 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:05:53.1471226Z %29 = tt.splat %arg4 : i32 -> tensor<8xi32> 2026-02-21T09:05:53.1471439Z %30 = arith.addi %29, %28 : tensor<8xi32> 2026-02-21T09:05:53.1471701Z %31 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:05:53.1471940Z %32 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.1472195Z %33 = tt.splat %31 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.1472395Z %34 = arith.addi %33, %32 : tensor<16xi32> 2026-02-21T09:05:53.1472660Z %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.1472933Z %36 = arith.muli %35, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.1473215Z %37 = tt.expand_dims %34 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:05:53.1473511Z %38 = tt.broadcast %36 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1473787Z %39 = tt.broadcast %37 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:05:53.1474039Z %40 = arith.addi %38, %39 : tensor<64x16xi32> 2026-02-21T09:05:53.1474285Z %41 = tt.splat %arg0 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1474576Z %42 = tt.addptr %41, %40 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:05:53.1474876Z %43 = tt.load %42 evictionPolicy = evict_last : tensor<64x16x!tt.ptr> 2026-02-21T09:05:53.1475167Z %44 = arith.extf %43 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T09:05:53.1475448Z %45 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:05:53.1475737Z %46 = arith.muli %45, %cst_4 : tensor<8x1xi32> 2026-02-21T09:05:53.1475994Z %47 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.1476274Z %48 = tt.broadcast %46 : tensor<8x1xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1476536Z %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<8x32xi32> 2026-02-21T09:05:53.1476768Z %50 = arith.addi %48, %49 : tensor<8x32xi32> 2026-02-21T09:05:53.1477031Z %51 = tt.splat %arg1 : !tt.ptr -> tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1477301Z %52 = tt.addptr %51, %50 : tensor<8x32x!tt.ptr>, tensor<8x32xi32> 2026-02-21T09:05:53.1477615Z %53 = tt.load %52 evictionPolicy = evict_first : tensor<8x32x!tt.ptr> 2026-02-21T09:05:53.1477884Z %54 = arith.shli %53, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1478098Z %55 = arith.shrsi %54, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1478322Z %56 = arith.shrsi %53, %cst_3 : tensor<8x32xi8> 2026-02-21T09:05:53.1478564Z %57 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.1478866Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.1479183Z %59 = tt.expand_dims %58 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.1479531Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1479848Z %61 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8x32xi8> -> tensor<8x1x32xi8> 2026-02-21T09:05:53.1480124Z %62 = arith.cmpi eq, %59, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1480376Z %63 = tt.broadcast %62 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1480643Z %64 = tt.broadcast %60 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1480918Z %65 = arith.select %63, %64, %cst : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1481189Z %66 = arith.cmpi eq, %59, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.1481432Z %67 = tt.broadcast %61 : tensor<8x1x32xi8> -> tensor<8x2x32xi8> 2026-02-21T09:05:53.1481721Z %68 = tt.broadcast %66 : tensor<1x2x1xi1> -> tensor<8x2x32xi1> 2026-02-21T09:05:53.1481985Z %69 = arith.select %68, %67, %65 : tensor<8x2x32xi1>, tensor<8x2x32xi8> 2026-02-21T09:05:53.1482258Z %70 = tt.reshape %69 : tensor<8x2x32xi8> -> tensor<16x32xi8> 2026-02-21T09:05:53.1482518Z %71 = arith.sitofp %70 : tensor<16x32xi8> to tensor<16x32xf32> 2026-02-21T09:05:53.1482872Z %72 = tt.dot %44, %71, %arg5, inputPrecision = tf32 : tensor<64x16xf32> * tensor<16x32xf32> -> tensor<64x32xf32> 2026-02-21T09:05:53.1483210Z scf.yield %72 : tensor<64x32xf32> 2026-02-21T09:05:53.1483441Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:05:53.1483717Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:05:53.1484011Z %20 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.1484289Z %21 = arith.muli %20, %cst_0 : tensor<64x1xi32> 2026-02-21T09:05:53.1484560Z %22 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.1484854Z %23 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.1485126Z %24 = tt.broadcast %22 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.1485369Z %25 = arith.addi %23, %24 : tensor<64x32xi32> 2026-02-21T09:05:53.1485628Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.1485928Z %27 = tt.addptr %26, %25 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:53.1486197Z tt.store %27, %19 : tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.1486460Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:05:53.1486684Z tt.return 2026-02-21T09:05:53.1486860Z } 2026-02-21T09:05:53.1486988Z } 2026-02-21T09:05:53.1487069Z 2026-02-21T09:05:53.1487123Z {-# 2026-02-21T09:05:53.1487260Z external_resources: { 2026-02-21T09:05:53.1487429Z mlir_reproducer: { 2026-02-21T09:05:53.1492042Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:53.1496537Z disable_threading: false, 2026-02-21T09:05:53.1496717Z verify_each: true 2026-02-21T09:05:53.1496861Z } 2026-02-21T09:05:53.1496993Z } 2026-02-21T09:05:53.1497109Z #-} 2026-02-21T09:05:53.1497536Z /tmp/torchinductor_root/si/csiov3l75kci2ehcwgzx4zjove3xyfj5ycid7upcvvg76k6h22hn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:53.1498549Z /tmp/torchinductor_root/si/csiov3l75kci2ehcwgzx4zjove3xyfj5ycid7upcvvg76k6h22hn.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:53.1499380Z [318s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:53.1500552Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:05:53.1501635Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:53.1501893Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:53.4159691Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:05:53.4161673Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:53.4161995Z ^ 2026-02-21T09:05:53.4162404Z /tmp/torchinductor_root/77/c7734btywd4yibk4agst6w23dk2f3sdzkuspy7vf2z47bwlmgcxp.py:78:36: note: called from 2026-02-21T09:05:53.4162834Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:05:53.4163300Z ^ 2026-02-21T09:05:53.4163763Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:05:53.4164317Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:05:53.4164595Z ^ 2026-02-21T09:05:53.4164807Z module { 2026-02-21T09:05:53.4169100Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:05:53.4171082Z %cst = arith.constant dense<0> : tensor<16x2x64xi8> 2026-02-21T09:05:53.4171432Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:05:53.4176721Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:05:53.4178685Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:05:53.4178960Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:05:53.4179220Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:05:53.4179460Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:05:53.4179687Z %cst_3 = arith.constant dense<4> : tensor<16x64xi8> 2026-02-21T09:05:53.4179928Z %cst_4 = arith.constant dense<7168> : tensor<16x1xi32> 2026-02-21T09:05:53.4180341Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:05:53.4180612Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:05:53.4180866Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:05:53.4181053Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:05:53.4181246Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:05:53.4181436Z %0 = tt.get_program_id x : i32 2026-02-21T09:05:53.4181702Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:05:53.4181890Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:05:53.4182138Z %3 = arith.subi %c112_i32, %2 : i32 2026-02-21T09:05:53.4182326Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:05:53.4182517Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:05:53.4182695Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:05:53.4182875Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:05:53.4183043Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:05:53.4183218Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:05:53.4183446Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:05:53.4183710Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:05:53.4183917Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:05:53.4184103Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:05:53.4184298Z %14 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:05:53.4184490Z %15 = arith.addi %14, %10 : tensor<64xi32> 2026-02-21T09:05:53.4184686Z %c4080_i32 = arith.constant 4080 : i32 2026-02-21T09:05:53.4184870Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:05:53.4185199Z %16 = scf.for %arg3 = %c0_i32 to %c4080_i32 step %c48_i32 iter_args(%arg4 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:05:53.4185574Z %71 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.4185827Z %72 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.4186040Z %73 = arith.addi %72, %71 : tensor<16xi32> 2026-02-21T09:05:53.4186234Z %74 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:05:53.4186470Z %75 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:53.4186713Z %76 = tt.splat %74 : i32 -> tensor<32xi32> 2026-02-21T09:05:53.4186915Z %77 = arith.addi %76, %75 : tensor<32xi32> 2026-02-21T09:05:53.4187173Z %78 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.4187440Z %79 = arith.muli %78, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.4187709Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.4188046Z %81 = tt.broadcast %79 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4188315Z %82 = tt.broadcast %80 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4188548Z %83 = arith.addi %81, %82 : tensor<64x32xi32> 2026-02-21T09:05:53.4188802Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4189093Z %85 = tt.addptr %84, %83 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:53.4189438Z %86 = tt.load %85 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4189780Z %87 = arith.extf %86 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:53.4190064Z %88 = tt.expand_dims %73 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:53.4190334Z %89 = arith.muli %88, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:53.4190589Z %90 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:53.4190882Z %91 = tt.broadcast %89 : tensor<16x1xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4191140Z %92 = tt.broadcast %90 : tensor<1x64xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4191372Z %93 = arith.addi %91, %92 : tensor<16x64xi32> 2026-02-21T09:05:53.4191674Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4191982Z %95 = tt.addptr %94, %93 : tensor<16x64x!tt.ptr>, tensor<16x64xi32> 2026-02-21T09:05:53.4192288Z %96 = tt.load %95 evictionPolicy = evict_first : tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4192551Z %97 = arith.shli %96, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4192773Z %98 = arith.shrsi %97, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4192999Z %99 = arith.shrsi %96, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4193247Z %100 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.4193553Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.4193869Z %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.4194205Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4194538Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4194833Z %105 = arith.cmpi eq, %102, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4195112Z %106 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4195403Z %107 = tt.broadcast %103 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4195713Z %108 = arith.select %106, %107, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4195995Z %109 = arith.cmpi eq, %102, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4196261Z %110 = tt.broadcast %104 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4196545Z %111 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4196832Z %112 = arith.select %111, %110, %108 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4197131Z %113 = tt.reshape %112 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:05:53.4197400Z %114 = arith.sitofp %113 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:05:53.4197779Z %115 = tt.dot %87, %114, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:53.4198117Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:05:53.4198315Z %116 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:05:53.4198524Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:05:53.4198763Z %118 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.4199028Z %119 = tt.splat %117 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.4199241Z %120 = arith.addi %119, %118 : tensor<16xi32> 2026-02-21T09:05:53.4199478Z %121 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:05:53.4199710Z %122 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:53.4199969Z %123 = tt.splat %121 : i32 -> tensor<32xi32> 2026-02-21T09:05:53.4200181Z %124 = arith.addi %123, %122 : tensor<32xi32> 2026-02-21T09:05:53.4200440Z %125 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.4200762Z %126 = arith.muli %125, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.4201020Z %127 = tt.expand_dims %124 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.4201340Z %128 = tt.broadcast %126 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4201650Z %129 = tt.broadcast %127 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4201887Z %130 = arith.addi %128, %129 : tensor<64x32xi32> 2026-02-21T09:05:53.4202136Z %131 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4202426Z %132 = tt.addptr %131, %130 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:53.4202741Z %133 = tt.load %132 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4203028Z %134 = arith.extf %133 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:53.4203342Z %135 = tt.expand_dims %120 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:53.4203621Z %136 = arith.muli %135, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:53.4203878Z %137 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:53.4204170Z %138 = tt.broadcast %136 : tensor<16x1xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4204425Z %139 = tt.broadcast %137 : tensor<1x64xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4204665Z %140 = arith.addi %138, %139 : tensor<16x64xi32> 2026-02-21T09:05:53.4204897Z %141 = tt.splat %arg1 : !tt.ptr -> tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4205181Z %142 = tt.addptr %141, %140 : tensor<16x64x!tt.ptr>, tensor<16x64xi32> 2026-02-21T09:05:53.4205491Z %143 = tt.load %142 evictionPolicy = evict_first : tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4205764Z %144 = arith.shli %143, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4205988Z %145 = arith.shrsi %144, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4206206Z %146 = arith.shrsi %143, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4206457Z %147 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.4206754Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.4207062Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.4207387Z %150 = tt.expand_dims %145 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4207708Z %151 = tt.expand_dims %146 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4207999Z %152 = arith.cmpi eq, %149, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4208266Z %153 = tt.broadcast %152 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4208544Z %154 = tt.broadcast %150 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4208836Z %155 = arith.select %153, %154, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4209107Z %156 = arith.cmpi eq, %149, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4209365Z %157 = tt.broadcast %151 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4209629Z %158 = tt.broadcast %156 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4209911Z %159 = arith.select %158, %157, %155 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4210188Z %160 = tt.reshape %159 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:05:53.4210455Z %161 = arith.sitofp %160 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:05:53.4210851Z %162 = tt.dot %134, %161, %115, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:53.4211166Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:05:53.4211368Z %163 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:05:53.4211590Z %164 = arith.addi %arg3, %163 : i32 2026-02-21T09:05:53.4211829Z %165 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.4212114Z %166 = tt.splat %164 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.4212317Z %167 = arith.addi %166, %165 : tensor<16xi32> 2026-02-21T09:05:53.4212562Z %168 = arith.muli %164, %c2_i32 : i32 2026-02-21T09:05:53.4212788Z %169 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:53.4213041Z %170 = tt.splat %168 : i32 -> tensor<32xi32> 2026-02-21T09:05:53.4213242Z %171 = arith.addi %170, %169 : tensor<32xi32> 2026-02-21T09:05:53.4213509Z %172 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.4213784Z %173 = arith.muli %172, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.4214041Z %174 = tt.expand_dims %171 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.4214339Z %175 = tt.broadcast %173 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4214630Z %176 = tt.broadcast %174 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4214882Z %177 = arith.addi %175, %176 : tensor<64x32xi32> 2026-02-21T09:05:53.4215128Z %178 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4215429Z %179 = tt.addptr %178, %177 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:53.4215750Z %180 = tt.load %179 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4216041Z %181 = arith.extf %180 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:53.4216333Z %182 = tt.expand_dims %167 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:53.4216599Z %183 = arith.muli %182, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:53.4216863Z %184 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:53.4217146Z %185 = tt.broadcast %183 : tensor<16x1xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4217413Z %186 = tt.broadcast %184 : tensor<1x64xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4217658Z %187 = arith.addi %185, %186 : tensor<16x64xi32> 2026-02-21T09:05:53.4217907Z %188 = tt.splat %arg1 : !tt.ptr -> tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4218186Z %189 = tt.addptr %188, %187 : tensor<16x64x!tt.ptr>, tensor<16x64xi32> 2026-02-21T09:05:53.4218508Z %190 = tt.load %189 evictionPolicy = evict_first : tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4218798Z %191 = arith.shli %190, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4219036Z %192 = arith.shrsi %191, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4219265Z %193 = arith.shrsi %190, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4219532Z %194 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.4219839Z %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.4220171Z %196 = tt.expand_dims %195 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.4220510Z %197 = tt.expand_dims %192 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4220861Z %198 = tt.expand_dims %193 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4221177Z %199 = arith.cmpi eq, %196, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4221443Z %200 = tt.broadcast %199 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4221774Z %201 = tt.broadcast %197 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4222133Z %202 = arith.select %200, %201, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4222426Z %203 = arith.cmpi eq, %196, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4222687Z %204 = tt.broadcast %198 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4222974Z %205 = tt.broadcast %203 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4223274Z %206 = arith.select %205, %204, %202 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4223620Z %207 = tt.reshape %206 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:05:53.4223938Z %208 = arith.sitofp %207 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:05:53.4224310Z %209 = tt.dot %181, %208, %162, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:53.4224655Z scf.yield %209 : tensor<64x64xf32> 2026-02-21T09:05:53.4224859Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:05:53.4225096Z %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:05:53.4225367Z %18 = tt.splat %c4080_i32 : i32 -> tensor<16xi32> 2026-02-21T09:05:53.4225586Z %19 = arith.addi %18, %17 : tensor<16xi32> 2026-02-21T09:05:53.4225799Z %20 = arith.muli %c4080_i32, %c2_i32 : i32 2026-02-21T09:05:53.4226038Z %21 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:05:53.4226352Z %22 = tt.splat %20 : i32 -> tensor<32xi32> 2026-02-21T09:05:53.4226553Z %23 = arith.addi %22, %21 : tensor<32xi32> 2026-02-21T09:05:53.4226793Z %24 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.4227059Z %25 = arith.muli %24, %cst_5 : tensor<64x1xi32> 2026-02-21T09:05:53.4227307Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:05:53.4227594Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4227848Z %28 = tt.broadcast %26 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:05:53.4228086Z %29 = arith.addi %27, %28 : tensor<64x32xi32> 2026-02-21T09:05:53.4228330Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4228605Z %31 = tt.addptr %30, %29 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:05:53.4228906Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:05:53.4229190Z %33 = arith.extf %32 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:05:53.4229469Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:05:53.4229719Z %35 = arith.muli %34, %cst_4 : tensor<16x1xi32> 2026-02-21T09:05:53.4229970Z %36 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:53.4230245Z %37 = tt.broadcast %35 : tensor<16x1xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4230491Z %38 = tt.broadcast %36 : tensor<1x64xi32> -> tensor<16x64xi32> 2026-02-21T09:05:53.4230727Z %39 = arith.addi %37, %38 : tensor<16x64xi32> 2026-02-21T09:05:53.4230959Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4231233Z %41 = tt.addptr %40, %39 : tensor<16x64x!tt.ptr>, tensor<16x64xi32> 2026-02-21T09:05:53.4231524Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<16x64x!tt.ptr> 2026-02-21T09:05:53.4231940Z %43 = arith.shli %42, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4232163Z %44 = arith.shrsi %43, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4232376Z %45 = arith.shrsi %42, %cst_3 : tensor<16x64xi8> 2026-02-21T09:05:53.4232623Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:05:53.4232904Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:05:53.4233213Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:05:53.4233522Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4233897Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16x64xi8> -> tensor<16x1x64xi8> 2026-02-21T09:05:53.4234180Z %51 = arith.cmpi eq, %48, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4234426Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4234699Z %53 = tt.broadcast %49 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4235017Z %54 = arith.select %52, %53, %cst : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4235297Z %55 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:05:53.4235602Z %56 = tt.broadcast %50 : tensor<16x1x64xi8> -> tensor<16x2x64xi8> 2026-02-21T09:05:53.4235870Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<16x2x64xi1> 2026-02-21T09:05:53.4236152Z %58 = arith.select %57, %56, %54 : tensor<16x2x64xi1>, tensor<16x2x64xi8> 2026-02-21T09:05:53.4236419Z %59 = tt.reshape %58 : tensor<16x2x64xi8> -> tensor<32x64xi8> 2026-02-21T09:05:53.4236681Z %60 = arith.sitofp %59 : tensor<32x64xi8> to tensor<32x64xf32> 2026-02-21T09:05:53.4237022Z %61 = tt.dot %33, %60, %16, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x64xf32> -> tensor<64x64xf32> 2026-02-21T09:05:53.4237368Z %62 = arith.truncf %61 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:05:53.4237694Z %63 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:05:53.4237951Z %64 = arith.muli %63, %cst_0 : tensor<64x1xi32> 2026-02-21T09:05:53.4238208Z %65 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:05:53.4238489Z %66 = tt.broadcast %64 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:05:53.4238738Z %67 = tt.broadcast %65 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:05:53.4238972Z %68 = arith.addi %66, %67 : tensor<64x64xi32> 2026-02-21T09:05:53.4239208Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:05:53.4239491Z %70 = tt.addptr %69, %68 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:05:53.4239739Z tt.store %70, %62 : tensor<64x64x!tt.ptr> 2026-02-21T09:05:53.4239936Z tt.return 2026-02-21T09:05:53.4240064Z } 2026-02-21T09:05:53.4240192Z } 2026-02-21T09:05:53.4240263Z 2026-02-21T09:05:53.4240321Z {-# 2026-02-21T09:05:53.4240452Z external_resources: { 2026-02-21T09:05:53.4240618Z mlir_reproducer: { 2026-02-21T09:05:53.4245025Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:05:53.4249470Z disable_threading: false, 2026-02-21T09:05:53.4249647Z verify_each: true 2026-02-21T09:05:53.4249792Z } 2026-02-21T09:05:53.4249919Z } 2026-02-21T09:05:53.4250036Z #-} 2026-02-21T09:05:53.4250492Z /tmp/torchinductor_root/77/c7734btywd4yibk4agst6w23dk2f3sdzkuspy7vf2z47bwlmgcxp.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:05:53.4251529Z /tmp/torchinductor_root/77/c7734btywd4yibk4agst6w23dk2f3sdzkuspy7vf2z47bwlmgcxp.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:05:53.4252365Z [318s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:05:53.4253429Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:05:53.4254341Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:05:53.4254596Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:05:53.4255117Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 14.7 configs/s 2026-02-21T09:05:55.2618629Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 539.7 2026-02-21T09:05:55.2619026Z configs/s 2026-02-21T09:05:55.3570721Z [320s] Generation 17 complete: 2026-02-21T09:05:55.3570970Z error=4 2026-02-21T09:05:55.3571104Z ok=31 2026-02-21T09:05:55.3571242Z min=0.1035 2026-02-21T09:05:55.3571377Z mid=0.1936 2026-02-21T09:05:55.3571512Z max=6.6120 2026-02-21T09:05:55.3571836Z best={'block_sizes': [32, 64, 16], 2026-02-21T09:05:55.3572101Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:05:55.3572351Z 'l2_groupings': [16], 2026-02-21T09:05:55.3572577Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:05:55.3572801Z 'loop_orders': [[1, 0]], 2026-02-21T09:05:55.3572986Z 'num_stages': 2, 2026-02-21T09:05:55.3573156Z 'num_warps': 1, 2026-02-21T09:05:55.3573325Z 'pid_type': 'flat', 2026-02-21T09:05:55.3573517Z 'range_flattens': [None, None], 2026-02-21T09:05:55.3573741Z 'range_multi_buffers': [None, False], 2026-02-21T09:05:55.3573982Z 'range_num_stages': [0, 3], 2026-02-21T09:05:55.3574178Z 'range_unroll_factors': [0, 1], 2026-02-21T09:05:55.3574398Z 'range_warp_specializes': [None, None]} 2026-02-21T09:05:55.3614587Z [320s] Fitting surrogate: 1208 points, 1208 targets 2026-02-21T09:05:55.9373132Z [321s] Generation 18 starting: 30 neighbors, 2 active search path(s) 2026-02-21T09:06:03.7641504Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.9 configs/s 2026-02-21T09:06:04.8213812Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:04.8218601Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:04.8223800Z ^ 2026-02-21T09:06:04.8228609Z /tmp/torchinductor_root/cy/ccyrsvbmdmnsqytykmd2euuk2hbxs3chzwnnaacq45mue2o76ou5.py:84:40: note: called from 2026-02-21T09:06:04.8230623Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:04.8230912Z ^ 2026-02-21T09:06:04.8235173Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:04.8240233Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:04.8240583Z ^ 2026-02-21T09:06:04.8246139Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:06:04.8246817Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:04.8247677Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:06:04.8252322Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:04.8256731Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:04.8261901Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:06:04.8263298Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:04.8263613Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:04.8267991Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:06:04.8269619Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:04.8269961Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:04.8275599Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:06:04.8280427Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:04.8280907Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:06:04.8286261Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:04.8286540Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:06:04.8291396Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:04.8295931Z scf.for %arg3 = %0 to %c224_i32 step %c9472_i32 : i32 { 2026-02-21T09:06:04.8296255Z %1 = arith.divsi %arg3, %c2_i32 : i32 2026-02-21T09:06:04.8300313Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:06:04.8304863Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:06:04.8306282Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:06:04.8306523Z %5 = arith.remsi %arg3, %c2_i32 : i32 2026-02-21T09:06:04.8306732Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:06:04.8306924Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:06:04.8307101Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:06:04.8307290Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:06:04.8307542Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:06:04.8307815Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:06:04.8308021Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:06:04.8308227Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:06:04.8308474Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:04.8308719Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:06:04.8308917Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:06:04.8309110Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:06:04.8309307Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:06:04.8309627Z %17 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:06:04.8309965Z %71 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:06:04.8310173Z %72 = arith.addi %71, %14 : tensor<64xi32> 2026-02-21T09:06:04.8310380Z %73 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:06:04.8310621Z %74 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:04.8310869Z %75 = tt.splat %73 : i32 -> tensor<128xi32> 2026-02-21T09:06:04.8311078Z %76 = arith.addi %75, %74 : tensor<128xi32> 2026-02-21T09:06:04.8311330Z %77 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8311681Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:04.8311947Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:04.8312432Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8312707Z %81 = tt.broadcast %79 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8312953Z %82 = arith.addi %80, %81 : tensor<64x128xi32> 2026-02-21T09:06:04.8313213Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8313563Z %84 = tt.addptr %83, %82 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:04.8313881Z %85 = tt.load %84 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8314223Z %86 = arith.extf %85 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:04.8314516Z %87 = tt.expand_dims %72 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8314798Z %88 = arith.muli %87, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:04.8315055Z %89 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:04.8315352Z %90 = tt.broadcast %88 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8315610Z %91 = tt.broadcast %89 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8315855Z %92 = arith.addi %90, %91 : tensor<64x32xi32> 2026-02-21T09:06:04.8316137Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8316412Z %94 = tt.addptr %93, %92 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:04.8316715Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8316983Z %96 = arith.shli %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8317204Z %97 = arith.shrsi %96, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8317417Z %98 = arith.shrsi %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8317665Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:04.8318012Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:04.8318338Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:04.8318682Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8319017Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8319314Z %104 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8319572Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8319858Z %106 = tt.broadcast %102 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8320150Z %107 = arith.select %105, %106, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8320432Z %108 = arith.cmpi eq, %101, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8320690Z %109 = tt.broadcast %103 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8320971Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8321256Z %111 = arith.select %110, %109, %107 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8321573Z %112 = tt.reshape %111 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:04.8321851Z %113 = arith.sitofp %112 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:04.8322221Z %114 = tt.dot %86, %113, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:04.8322553Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:04.8322753Z %115 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:06:04.8322946Z %116 = arith.addi %arg4, %115 : i32 2026-02-21T09:06:04.8323148Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:06:04.8323354Z %118 = arith.addi %117, %14 : tensor<64xi32> 2026-02-21T09:06:04.8323596Z %119 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:06:04.8323838Z %120 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:04.8324112Z %121 = tt.splat %119 : i32 -> tensor<128xi32> 2026-02-21T09:06:04.8324330Z %122 = arith.addi %121, %120 : tensor<128xi32> 2026-02-21T09:06:04.8324601Z %123 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8324917Z %124 = arith.muli %123, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:04.8325188Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:04.8325528Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8325801Z %127 = tt.broadcast %125 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8326059Z %128 = arith.addi %126, %127 : tensor<64x128xi32> 2026-02-21T09:06:04.8326316Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8326618Z %130 = tt.addptr %129, %128 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:04.8326944Z %131 = tt.load %130 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8327246Z %132 = arith.extf %131 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:04.8327603Z %133 = tt.expand_dims %118 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8327899Z %134 = arith.muli %133, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:04.8328183Z %135 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:04.8328500Z %136 = tt.broadcast %134 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8328781Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8329046Z %138 = arith.addi %136, %137 : tensor<64x32xi32> 2026-02-21T09:06:04.8329297Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8329594Z %140 = tt.addptr %139, %138 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:04.8329918Z %141 = tt.load %140 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8330204Z %142 = arith.shli %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8330443Z %143 = arith.shrsi %142, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8330675Z %144 = arith.shrsi %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8330941Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:04.8331247Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:04.8331614Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:04.8331964Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8332305Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8332614Z %150 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8332880Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8333176Z %152 = tt.broadcast %148 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8333490Z %153 = arith.select %151, %152, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8333781Z %154 = arith.cmpi eq, %147, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8334055Z %155 = tt.broadcast %149 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8334338Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8334642Z %157 = arith.select %156, %155, %153 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8334978Z %158 = tt.reshape %157 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:04.8335272Z %159 = arith.sitofp %158 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:04.8335666Z %160 = tt.dot %132, %159, %114, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:04.8336007Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:06:04.8336221Z %161 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:06:04.8336459Z %162 = arith.addi %arg4, %161 : i32 2026-02-21T09:06:04.8336662Z %163 = tt.splat %162 : i32 -> tensor<64xi32> 2026-02-21T09:06:04.8336864Z %164 = arith.addi %163, %14 : tensor<64xi32> 2026-02-21T09:06:04.8337094Z %165 = arith.muli %162, %c2_i32 : i32 2026-02-21T09:06:04.8337338Z %166 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:04.8337590Z %167 = tt.splat %165 : i32 -> tensor<128xi32> 2026-02-21T09:06:04.8337805Z %168 = arith.addi %167, %166 : tensor<128xi32> 2026-02-21T09:06:04.8338060Z %169 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8338338Z %170 = arith.muli %169, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:04.8338600Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:04.8338928Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8339207Z %173 = tt.broadcast %171 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8339451Z %174 = arith.addi %172, %173 : tensor<64x128xi32> 2026-02-21T09:06:04.8339707Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8339998Z %176 = tt.addptr %175, %174 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:04.8340321Z %177 = tt.load %176 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8340627Z %178 = arith.extf %177 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:04.8340916Z %179 = tt.expand_dims %164 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8341193Z %180 = arith.muli %179, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:04.8341447Z %181 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:04.8341781Z %182 = tt.broadcast %180 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8342048Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8342304Z %184 = arith.addi %182, %183 : tensor<64x32xi32> 2026-02-21T09:06:04.8342552Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8342829Z %186 = tt.addptr %185, %184 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:04.8343138Z %187 = tt.load %186 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8343411Z %188 = arith.shli %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8343636Z %189 = arith.shrsi %188, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8343856Z %190 = arith.shrsi %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8344110Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:04.8344410Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:04.8344726Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:04.8345062Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8345394Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8345685Z %196 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8345947Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8346276Z %198 = tt.broadcast %194 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8346572Z %199 = arith.select %197, %198, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8346848Z %200 = arith.cmpi eq, %193, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8347111Z %201 = tt.broadcast %195 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8347387Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8347698Z %203 = arith.select %202, %201, %199 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8348014Z %204 = tt.reshape %203 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:04.8348289Z %205 = arith.sitofp %204 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:04.8348666Z %206 = tt.dot %178, %205, %160, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:04.8348995Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:06:04.8349179Z } 2026-02-21T09:06:04.8349339Z %18 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:06:04.8349549Z %19 = arith.addi %18, %14 : tensor<64xi32> 2026-02-21T09:06:04.8349754Z %20 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:06:04.8349994Z %21 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:04.8350273Z %22 = tt.splat %20 : i32 -> tensor<128xi32> 2026-02-21T09:06:04.8350475Z %23 = arith.addi %22, %21 : tensor<128xi32> 2026-02-21T09:06:04.8350729Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8350995Z %25 = arith.muli %24, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:04.8351245Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:04.8351562Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8351827Z %28 = tt.broadcast %26 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:04.8352070Z %29 = arith.addi %27, %28 : tensor<64x128xi32> 2026-02-21T09:06:04.8352311Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8352599Z %31 = tt.addptr %30, %29 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:04.8352911Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:04.8353200Z %33 = arith.extf %32 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:04.8353493Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8353751Z %35 = arith.muli %34, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:04.8354011Z %36 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:04.8354290Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8354554Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8354789Z %39 = arith.addi %37, %38 : tensor<64x32xi32> 2026-02-21T09:06:04.8355025Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8355304Z %41 = tt.addptr %40, %39 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:04.8355600Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8355882Z %43 = arith.shli %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8356091Z %44 = arith.shrsi %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8356308Z %45 = arith.shrsi %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:04.8356553Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:04.8356835Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:04.8357141Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:04.8357483Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8357803Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:04.8358085Z %51 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8358330Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8358626Z %53 = tt.broadcast %49 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8358904Z %54 = arith.select %52, %53, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8359198Z %55 = arith.cmpi eq, %48, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:04.8359445Z %56 = tt.broadcast %50 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:04.8359712Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:04.8359985Z %58 = arith.select %57, %56, %54 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:04.8360254Z %59 = tt.reshape %58 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:04.8360519Z %60 = arith.sitofp %59 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:04.8360869Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:04.8361251Z %62 = arith.truncf %61 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:06:04.8361568Z %63 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:04.8361827Z %64 = arith.muli %63, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:04.8362086Z %65 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:04.8362361Z %66 = tt.broadcast %64 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8362617Z %67 = tt.broadcast %65 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:04.8362844Z %68 = arith.addi %66, %67 : tensor<64x32xi32> 2026-02-21T09:06:04.8363089Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8363375Z %70 = tt.addptr %69, %68 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:04.8363624Z tt.store %70, %62 : tensor<64x32x!tt.ptr> 2026-02-21T09:06:04.8363869Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:06:04.8364081Z tt.return 2026-02-21T09:06:04.8364218Z } 2026-02-21T09:06:04.8364341Z } 2026-02-21T09:06:04.8364418Z 2026-02-21T09:06:04.8364469Z {-# 2026-02-21T09:06:04.8364598Z external_resources: { 2026-02-21T09:06:04.8364763Z mlir_reproducer: { 2026-02-21T09:06:04.8369102Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:04.8373657Z disable_threading: false, 2026-02-21T09:06:04.8373834Z verify_each: true 2026-02-21T09:06:04.8373978Z } 2026-02-21T09:06:04.8374107Z } 2026-02-21T09:06:04.8374221Z #-} 2026-02-21T09:06:04.8374687Z /tmp/torchinductor_root/cy/ccyrsvbmdmnsqytykmd2euuk2hbxs3chzwnnaacq45mue2o76ou5.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:04.8375717Z /tmp/torchinductor_root/cy/ccyrsvbmdmnsqytykmd2euuk2hbxs3chzwnnaacq45mue2o76ou5.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:04.8376537Z [330s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:04.8377700Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:06:04.8378736Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:04.8378995Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:05.2357401Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:05.2359287Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:05.2359575Z ^ 2026-02-21T09:06:05.2359945Z /tmp/torchinductor_root/fa/cfazawlke33okiwd7lrxijft3rnlviebzdf7dljxe2p6i6tqy4ab.py:78:36: note: called from 2026-02-21T09:06:05.2360353Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:05.2360555Z ^ 2026-02-21T09:06:05.2360957Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:05.2361421Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:05.2361816Z ^ 2026-02-21T09:06:05.2361994Z module { 2026-02-21T09:06:05.2366740Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:05.2367640Z %cst = arith.constant dense<0> : tensor<64x2x64xi8> 2026-02-21T09:06:05.2367874Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:05.2368087Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:05.2368347Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:05.2368582Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:05.2368817Z %cst_2 = arith.constant dense<4> : tensor<64x64xi8> 2026-02-21T09:06:05.2369051Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:05.2369301Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:05.2369566Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:06:05.2369802Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:05.2369994Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:05.2370364Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:06:05.2370565Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:05.2370751Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:06:05.2370948Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:06:05.2371131Z %3 = arith.subi %c112_i32, %2 : i32 2026-02-21T09:06:05.2371324Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:06:05.2371506Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:06:05.2371807Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:06:05.2371989Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:06:05.2372159Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:06:05.2372368Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:06:05.2372601Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:05.2372859Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.2373057Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:06:05.2373252Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:06:05.2373444Z %14 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.2373634Z %15 = arith.addi %14, %10 : tensor<64xi32> 2026-02-21T09:06:05.2373829Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:06:05.2374152Z %16 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:06:05.2374546Z %26 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.2374757Z %27 = arith.addi %26, %10 : tensor<64xi32> 2026-02-21T09:06:05.2374962Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:05.2375206Z %29 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.2375461Z %30 = tt.splat %28 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.2375673Z %31 = arith.addi %30, %29 : tensor<128xi32> 2026-02-21T09:06:05.2375931Z %32 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2376212Z %33 = arith.muli %32, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.2376473Z %34 = tt.expand_dims %31 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.2376781Z %35 = tt.broadcast %33 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2377055Z %36 = tt.broadcast %34 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2377297Z %37 = arith.addi %35, %36 : tensor<64x128xi32> 2026-02-21T09:06:05.2377554Z %38 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2377838Z %39 = tt.addptr %38, %37 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.2378156Z %40 = tt.load %39 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2378460Z %41 = arith.extf %40 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.2378742Z %42 = tt.expand_dims %27 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2379017Z %43 = arith.muli %42, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.2379266Z %44 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:05.2379558Z %45 = tt.broadcast %43 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2379814Z %46 = tt.broadcast %44 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2380055Z %47 = arith.addi %45, %46 : tensor<64x64xi32> 2026-02-21T09:06:05.2380308Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2380582Z %49 = tt.addptr %48, %47 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:05.2380888Z %50 = tt.load %49 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2381156Z %51 = arith.shli %50, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2381380Z %52 = arith.shrsi %51, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2381645Z %53 = arith.shrsi %50, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2381934Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.2382238Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.2382544Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.2382876Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2383228Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2383524Z %59 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2383812Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2384079Z %61 = tt.broadcast %57 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2384364Z %62 = arith.select %60, %61, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2384626Z %63 = arith.cmpi eq, %56, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2384873Z %64 = tt.broadcast %58 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2385136Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2385411Z %66 = arith.select %65, %64, %62 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2385713Z %67 = tt.reshape %66 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:05.2385975Z %68 = arith.sitofp %67 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:05.2386343Z %69 = tt.dot %41, %68, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:05.2386668Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:05.2386868Z %70 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:06:05.2387058Z %71 = arith.addi %arg3, %70 : i32 2026-02-21T09:06:05.2387257Z %72 = tt.splat %71 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.2387463Z %73 = arith.addi %72, %10 : tensor<64xi32> 2026-02-21T09:06:05.2387652Z %74 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:06:05.2387891Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.2388155Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.2388373Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T09:06:05.2388633Z %78 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2388914Z %79 = arith.muli %78, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.2389188Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.2389498Z %81 = tt.broadcast %79 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2389777Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2390019Z %83 = arith.addi %81, %82 : tensor<64x128xi32> 2026-02-21T09:06:05.2390278Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2390577Z %85 = tt.addptr %84, %83 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.2390901Z %86 = tt.load %85 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2391208Z %87 = arith.extf %86 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.2391502Z %88 = tt.expand_dims %73 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2391834Z %89 = arith.muli %88, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.2392093Z %90 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:05.2392395Z %91 = tt.broadcast %89 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2392667Z %92 = tt.broadcast %90 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2392910Z %93 = arith.addi %91, %92 : tensor<64x64xi32> 2026-02-21T09:06:05.2393162Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2393468Z %95 = tt.addptr %94, %93 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:05.2393781Z %96 = tt.load %95 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2394053Z %97 = arith.shli %96, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2394286Z %98 = arith.shrsi %97, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2394515Z %99 = arith.shrsi %96, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2394801Z %100 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.2395146Z %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.2395477Z %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.2395821Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2396155Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2396459Z %105 = arith.cmpi eq, %102, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2396734Z %106 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2397015Z %107 = tt.broadcast %103 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2397363Z %108 = arith.select %106, %107, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2397653Z %109 = arith.cmpi eq, %102, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2397926Z %110 = tt.broadcast %104 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2398215Z %111 = tt.broadcast %109 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2398507Z %112 = arith.select %111, %110, %108 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2398808Z %113 = tt.reshape %112 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:05.2399086Z %114 = arith.sitofp %113 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:05.2399464Z %115 = tt.dot %87, %114, %69, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:05.2399782Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:06:05.2399981Z %116 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:06:05.2400182Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:06:05.2400372Z %118 = tt.splat %117 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.2400583Z %119 = arith.addi %118, %10 : tensor<64xi32> 2026-02-21T09:06:05.2400774Z %120 = arith.muli %117, %c2_i32 : i32 2026-02-21T09:06:05.2407711Z %121 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.2407994Z %122 = tt.splat %120 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.2408213Z %123 = arith.addi %122, %121 : tensor<128xi32> 2026-02-21T09:06:05.2408490Z %124 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2408765Z %125 = arith.muli %124, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.2409045Z %126 = tt.expand_dims %123 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.2409365Z %127 = tt.broadcast %125 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2409640Z %128 = tt.broadcast %126 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2409902Z %129 = arith.addi %127, %128 : tensor<64x128xi32> 2026-02-21T09:06:05.2410198Z %130 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2410501Z %131 = tt.addptr %130, %129 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.2410830Z %132 = tt.load %131 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2411131Z %133 = arith.extf %132 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.2411436Z %134 = tt.expand_dims %119 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2411860Z %135 = arith.muli %134, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.2412131Z %136 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:05.2412434Z %137 = tt.broadcast %135 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2412702Z %138 = tt.broadcast %136 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2412955Z %139 = arith.addi %137, %138 : tensor<64x64xi32> 2026-02-21T09:06:05.2413241Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2413560Z %141 = tt.addptr %140, %139 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:05.2413870Z %142 = tt.load %141 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2414148Z %143 = arith.shli %142, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2414382Z %144 = arith.shrsi %143, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2414610Z %145 = arith.shrsi %142, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2414869Z %146 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.2415159Z %147 = tt.expand_dims %146 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.2415483Z %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.2415849Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2416179Z %150 = tt.expand_dims %145 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2416472Z %151 = arith.cmpi eq, %148, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2416728Z %152 = tt.broadcast %151 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2417014Z %153 = tt.broadcast %149 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2417308Z %154 = arith.select %152, %153, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2417596Z %155 = arith.cmpi eq, %148, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2417859Z %156 = tt.broadcast %150 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2418133Z %157 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2418426Z %158 = arith.select %157, %156, %154 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2418708Z %159 = tt.reshape %158 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:05.2418991Z %160 = arith.sitofp %159 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:05.2419373Z %161 = tt.dot %133, %160, %115, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:05.2419709Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:06:05.2419920Z %162 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:06:05.2420121Z %163 = arith.addi %arg3, %162 : i32 2026-02-21T09:06:05.2420332Z %164 = tt.splat %163 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.2420536Z %165 = arith.addi %164, %10 : tensor<64xi32> 2026-02-21T09:06:05.2420742Z %166 = arith.muli %163, %c2_i32 : i32 2026-02-21T09:06:05.2420978Z %167 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.2421240Z %168 = tt.splat %166 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.2421463Z %169 = arith.addi %168, %167 : tensor<128xi32> 2026-02-21T09:06:05.2421772Z %170 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2422054Z %171 = arith.muli %170, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.2422320Z %172 = tt.expand_dims %169 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.2422630Z %173 = tt.broadcast %171 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2422917Z %174 = tt.broadcast %172 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.2423191Z %175 = arith.addi %173, %174 : tensor<64x128xi32> 2026-02-21T09:06:05.2423450Z %176 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2423743Z %177 = tt.addptr %176, %175 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.2424067Z %178 = tt.load %177 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.2424370Z %179 = arith.extf %178 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.2424698Z %180 = tt.expand_dims %165 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2425001Z %181 = arith.muli %180, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.2425260Z %182 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:05.2425557Z %183 = tt.broadcast %181 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2425823Z %184 = tt.broadcast %182 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2426075Z %185 = arith.addi %183, %184 : tensor<64x64xi32> 2026-02-21T09:06:05.2426309Z %186 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2426591Z %187 = tt.addptr %186, %185 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:05.2426905Z %188 = tt.load %187 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2427204Z %189 = arith.shli %188, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2427432Z %190 = arith.shrsi %189, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2427652Z %191 = arith.shrsi %188, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:05.2427908Z %192 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.2428206Z %193 = tt.expand_dims %192 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.2428515Z %194 = tt.expand_dims %193 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.2428849Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2429169Z %196 = tt.expand_dims %191 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:05.2429462Z %197 = arith.cmpi eq, %194, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2429712Z %198 = tt.broadcast %197 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2429997Z %199 = tt.broadcast %195 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2430297Z %200 = arith.select %198, %199, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2430570Z %201 = arith.cmpi eq, %194, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.2430843Z %202 = tt.broadcast %196 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:05.2431128Z %203 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:05.2431432Z %204 = arith.select %203, %202, %200 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:05.2431787Z %205 = tt.reshape %204 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:05.2432075Z %206 = arith.sitofp %205 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:05.2432465Z %207 = tt.dot %179, %206, %161, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:05.2432806Z scf.yield %207 : tensor<64x64xf32> 2026-02-21T09:06:05.2433020Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:06:05.2433261Z %17 = arith.truncf %16 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:06:05.2433574Z %18 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.2433860Z %19 = arith.muli %18, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.2434124Z %20 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:05.2434427Z %21 = tt.broadcast %19 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2434729Z %22 = tt.broadcast %20 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:05.2434982Z %23 = arith.addi %21, %22 : tensor<64x64xi32> 2026-02-21T09:06:05.2435235Z %24 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2435541Z %25 = tt.addptr %24, %23 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:05.2435822Z tt.store %25, %17 : tensor<64x64x!tt.ptr> 2026-02-21T09:06:05.2436057Z tt.return 2026-02-21T09:06:05.2436205Z } 2026-02-21T09:06:05.2436336Z } 2026-02-21T09:06:05.2436413Z 2026-02-21T09:06:05.2436478Z {-# 2026-02-21T09:06:05.2436619Z external_resources: { 2026-02-21T09:06:05.2436827Z mlir_reproducer: { 2026-02-21T09:06:05.2441323Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:05.2445811Z disable_threading: false, 2026-02-21T09:06:05.2445983Z verify_each: true 2026-02-21T09:06:05.2446138Z } 2026-02-21T09:06:05.2446272Z } 2026-02-21T09:06:05.2446389Z #-} 2026-02-21T09:06:05.2446836Z /tmp/torchinductor_root/fa/cfazawlke33okiwd7lrxijft3rnlviebzdf7dljxe2p6i6tqy4ab.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:05.2447860Z /tmp/torchinductor_root/fa/cfazawlke33okiwd7lrxijft3rnlviebzdf7dljxe2p6i6tqy4ab.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:05.2448697Z [330s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:05.2449731Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:06:05.2450633Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:05.2450899Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:05.8582298Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:05.8583987Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:05.8584284Z ^ 2026-02-21T09:06:05.8584654Z /tmp/torchinductor_root/4z/c4zg2jya6ltxtildedgwkjo5vjr4p656fw7aqpl32eototgqn42c.py:78:36: note: called from 2026-02-21T09:06:05.8585055Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:05.8585269Z ^ 2026-02-21T09:06:05.8585894Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:05.8586422Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:05.8586666Z ^ 2026-02-21T09:06:05.8586874Z module { 2026-02-21T09:06:05.8587362Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:05.8587949Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:06:05.8588180Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:05.8588375Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:05.8588591Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:05.8588881Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:05.8589113Z %cst_2 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:06:05.8589354Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:05.8589592Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:05.8589854Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:06:05.8590086Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:05.8590278Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:06:05.8590469Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:05.8590651Z %c224_i32 = arith.constant 224 : i32 2026-02-21T09:06:05.8590846Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:05.8591024Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:06:05.8591212Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:06:05.8591391Z %3 = arith.subi %c224_i32, %2 : i32 2026-02-21T09:06:05.8591745Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:06:05.8591925Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:06:05.8592118Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:06:05.8592293Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:06:05.8592472Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:06:05.8592643Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:06:05.8592878Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:06:05.8593135Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:06:05.8593334Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:06:05.8593532Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:06:05.8593752Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:05.8593992Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.8594182Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:06:05.8594378Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:06:05.8594560Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:06:05.8594887Z %17 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:06:05.8595223Z %71 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.8595427Z %72 = arith.addi %71, %14 : tensor<64xi32> 2026-02-21T09:06:05.8595628Z %73 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:05.8595860Z %74 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.8596114Z %75 = tt.splat %73 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.8596362Z %76 = arith.addi %75, %74 : tensor<128xi32> 2026-02-21T09:06:05.8596624Z %77 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8596903Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.8597160Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.8597463Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8597766Z %81 = tt.broadcast %79 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8598020Z %82 = arith.addi %80, %81 : tensor<64x128xi32> 2026-02-21T09:06:05.8598301Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8598593Z %84 = tt.addptr %83, %82 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.8598906Z %85 = tt.load %84 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8599205Z %86 = arith.extf %85 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.8599500Z %87 = tt.expand_dims %72 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8599768Z %88 = arith.muli %87, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.8600037Z %89 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:05.8600357Z %90 = tt.broadcast %88 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8600618Z %91 = tt.broadcast %89 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8600856Z %92 = arith.addi %90, %91 : tensor<64x32xi32> 2026-02-21T09:06:05.8601090Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8601367Z %94 = tt.addptr %93, %92 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:05.8601747Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8602014Z %96 = arith.shli %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8602231Z %97 = arith.shrsi %96, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8602439Z %98 = arith.shrsi %95, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8602683Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.8602972Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.8603296Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.8603630Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8603948Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8604231Z %104 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8604483Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8604763Z %106 = tt.broadcast %102 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8605053Z %107 = arith.select %105, %106, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8605324Z %108 = arith.cmpi eq, %101, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8605577Z %109 = tt.broadcast %103 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8605844Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8606124Z %111 = arith.select %110, %109, %107 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8606404Z %112 = tt.reshape %111 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:05.8606668Z %113 = arith.sitofp %112 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:05.8607046Z %114 = tt.dot %86, %113, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:05.8607389Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:05.8607618Z %115 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:06:05.8607816Z %116 = arith.addi %arg3, %115 : i32 2026-02-21T09:06:05.8608017Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.8608220Z %118 = arith.addi %117, %14 : tensor<64xi32> 2026-02-21T09:06:05.8608418Z %119 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:06:05.8608661Z %120 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.8608951Z %121 = tt.splat %119 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.8609168Z %122 = arith.addi %121, %120 : tensor<128xi32> 2026-02-21T09:06:05.8609453Z %123 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8609731Z %124 = arith.muli %123, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.8610036Z %125 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.8610348Z %126 = tt.broadcast %124 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8610624Z %127 = tt.broadcast %125 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8610878Z %128 = arith.addi %126, %127 : tensor<64x128xi32> 2026-02-21T09:06:05.8611132Z %129 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8611440Z %130 = tt.addptr %129, %128 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.8611837Z %131 = tt.load %130 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8612151Z %132 = arith.extf %131 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.8612455Z %133 = tt.expand_dims %118 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8612731Z %134 = arith.muli %133, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.8612998Z %135 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:05.8613293Z %136 = tt.broadcast %134 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8613564Z %137 = tt.broadcast %135 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8613807Z %138 = arith.addi %136, %137 : tensor<64x32xi32> 2026-02-21T09:06:05.8614049Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8614333Z %140 = tt.addptr %139, %138 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:05.8614643Z %141 = tt.load %140 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8614925Z %142 = arith.shli %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8615153Z %143 = arith.shrsi %142, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8615394Z %144 = arith.shrsi %141, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8615657Z %145 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.8615959Z %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.8616271Z %147 = tt.expand_dims %146 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.8616589Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8616913Z %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8617196Z %150 = arith.cmpi eq, %147, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8617446Z %151 = tt.broadcast %150 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8617717Z %152 = tt.broadcast %148 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8618001Z %153 = arith.select %151, %152, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8618274Z %154 = arith.cmpi eq, %147, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8618520Z %155 = tt.broadcast %149 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8618784Z %156 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8619092Z %157 = arith.select %156, %155, %153 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8619370Z %158 = tt.reshape %157 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:05.8619631Z %159 = arith.sitofp %158 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:05.8619984Z %160 = tt.dot %132, %159, %114, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:05.8620343Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:06:05.8620538Z %161 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:06:05.8620756Z %162 = arith.addi %arg3, %161 : i32 2026-02-21T09:06:05.8620954Z %163 = tt.splat %162 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.8621154Z %164 = arith.addi %163, %14 : tensor<64xi32> 2026-02-21T09:06:05.8621353Z %165 = arith.muli %162, %c2_i32 : i32 2026-02-21T09:06:05.8621621Z %166 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.8621879Z %167 = tt.splat %165 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.8622086Z %168 = arith.addi %167, %166 : tensor<128xi32> 2026-02-21T09:06:05.8622347Z %169 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8622617Z %170 = arith.muli %169, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.8622931Z %171 = tt.expand_dims %168 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.8623239Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8623509Z %173 = tt.broadcast %171 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8623757Z %174 = arith.addi %172, %173 : tensor<64x128xi32> 2026-02-21T09:06:05.8624002Z %175 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8624301Z %176 = tt.addptr %175, %174 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.8624621Z %177 = tt.load %176 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8624915Z %178 = arith.extf %177 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.8625212Z %179 = tt.expand_dims %164 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8625476Z %180 = arith.muli %179, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.8625742Z %181 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:05.8626029Z %182 = tt.broadcast %180 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8626301Z %183 = tt.broadcast %181 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8626549Z %184 = arith.addi %182, %183 : tensor<64x32xi32> 2026-02-21T09:06:05.8626783Z %185 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8627065Z %186 = tt.addptr %185, %184 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:05.8627367Z %187 = tt.load %186 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8627642Z %188 = arith.shli %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8627868Z %189 = arith.shrsi %188, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8628088Z %190 = arith.shrsi %187, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8628344Z %191 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.8628637Z %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.8628954Z %193 = tt.expand_dims %192 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.8629272Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8629600Z %195 = tt.expand_dims %190 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8629891Z %196 = arith.cmpi eq, %193, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8630171Z %197 = tt.broadcast %196 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8630449Z %198 = tt.broadcast %194 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8630737Z %199 = arith.select %197, %198, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8631019Z %200 = arith.cmpi eq, %193, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8631272Z %201 = tt.broadcast %195 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8631594Z %202 = tt.broadcast %200 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8631911Z %203 = arith.select %202, %201, %199 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8632193Z %204 = tt.reshape %203 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:05.8632463Z %205 = arith.sitofp %204 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:05.8632825Z %206 = tt.dot %178, %205, %160, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:05.8633157Z scf.yield %206 : tensor<64x32xf32> 2026-02-21T09:06:05.8633352Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:06:05.8633552Z %18 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:06:05.8633767Z %19 = arith.addi %18, %14 : tensor<64xi32> 2026-02-21T09:06:05.8633963Z %20 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:06:05.8634235Z %21 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:05.8634485Z %22 = tt.splat %20 : i32 -> tensor<128xi32> 2026-02-21T09:06:05.8634689Z %23 = arith.addi %22, %21 : tensor<128xi32> 2026-02-21T09:06:05.8634941Z %24 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8635201Z %25 = arith.muli %24, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:05.8635458Z %26 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:05.8635743Z %27 = tt.broadcast %25 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8636007Z %28 = tt.broadcast %26 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:05.8636236Z %29 = arith.addi %27, %28 : tensor<64x128xi32> 2026-02-21T09:06:05.8636480Z %30 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8636764Z %31 = tt.addptr %30, %29 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:05.8637062Z %32 = tt.load %31 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:05.8637355Z %33 = arith.extf %32 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:05.8637631Z %34 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8637891Z %35 = arith.muli %34, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.8638138Z %36 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:05.8638422Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8638681Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8638909Z %39 = arith.addi %37, %38 : tensor<64x32xi32> 2026-02-21T09:06:05.8639145Z %40 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8639411Z %41 = tt.addptr %40, %39 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:05.8639709Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8639965Z %43 = arith.shli %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8640185Z %44 = arith.shrsi %43, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8640405Z %45 = arith.shrsi %42, %cst_2 : tensor<64x32xi8> 2026-02-21T09:06:05.8640643Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:05.8640934Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:05.8641261Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:05.8641602Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8641924Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:06:05.8642200Z %51 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8642452Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8642743Z %53 = tt.broadcast %49 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8643028Z %54 = arith.select %52, %53, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8643315Z %55 = arith.cmpi eq, %48, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:05.8643567Z %56 = tt.broadcast %50 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:06:05.8643829Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:06:05.8644092Z %58 = arith.select %57, %56, %54 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:06:05.8644367Z %59 = tt.reshape %58 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:06:05.8644621Z %60 = arith.sitofp %59 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:06:05.8644973Z %61 = tt.dot %33, %60, %17, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x32xf32> -> tensor<64x32xf32> 2026-02-21T09:06:05.8645345Z %62 = arith.truncf %61 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:06:05.8645637Z %63 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:05.8645902Z %64 = arith.muli %63, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:05.8646156Z %65 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:06:05.8646445Z %66 = tt.broadcast %64 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8646692Z %67 = tt.broadcast %65 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:06:05.8646928Z %68 = arith.addi %66, %67 : tensor<64x32xi32> 2026-02-21T09:06:05.8647175Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8647453Z %70 = tt.addptr %69, %68 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:06:05.8647713Z tt.store %70, %62 : tensor<64x32x!tt.ptr> 2026-02-21T09:06:05.8647905Z tt.return 2026-02-21T09:06:05.8648042Z } 2026-02-21T09:06:05.8648163Z } 2026-02-21T09:06:05.8648239Z 2026-02-21T09:06:05.8648291Z {-# 2026-02-21T09:06:05.8648421Z external_resources: { 2026-02-21T09:06:05.8648586Z mlir_reproducer: { 2026-02-21T09:06:05.8653100Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:05.8657947Z disable_threading: false, 2026-02-21T09:06:05.8658135Z verify_each: true 2026-02-21T09:06:05.8658312Z } 2026-02-21T09:06:05.8658444Z } 2026-02-21T09:06:05.8658562Z #-} 2026-02-21T09:06:05.8659039Z /tmp/torchinductor_root/4z/c4zg2jya6ltxtildedgwkjo5vjr4p656fw7aqpl32eototgqn42c.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:05.8660054Z /tmp/torchinductor_root/4z/c4zg2jya6ltxtildedgwkjo5vjr4p656fw7aqpl32eototgqn42c.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:05.8660872Z [331s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:05.8661961Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:06:05.8662874Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:05.8663128Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:06.0374724Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 13.8 configs/s 2026-02-21T09:06:07.9738426Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 513.0 2026-02-21T09:06:07.9742659Z configs/s 2026-02-21T09:06:08.0579886Z [333s] Generation 18 complete: 2026-02-21T09:06:08.0584549Z error=3 2026-02-21T09:06:08.0589055Z ok=29 2026-02-21T09:06:08.0595018Z min=0.1035 2026-02-21T09:06:08.0596654Z mid=0.1895 2026-02-21T09:06:08.0596832Z max=16.5950 2026-02-21T09:06:08.0596986Z best={'block_sizes': [32, 64, 16], 2026-02-21T09:06:08.0597260Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:06:08.0597507Z 'l2_groupings': [16], 2026-02-21T09:06:08.0597699Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:06:08.0597902Z 'loop_orders': [[1, 0]], 2026-02-21T09:06:08.0598081Z 'num_stages': 2, 2026-02-21T09:06:08.0598229Z 'num_warps': 1, 2026-02-21T09:06:08.0598387Z 'pid_type': 'flat', 2026-02-21T09:06:08.0598559Z 'range_flattens': [None, True], 2026-02-21T09:06:08.0598748Z 'range_multi_buffers': [None, False], 2026-02-21T09:06:08.0598941Z 'range_num_stages': [0, 3], 2026-02-21T09:06:08.0599110Z 'range_unroll_factors': [0, 1], 2026-02-21T09:06:08.0599300Z 'range_warp_specializes': [None, None]} 2026-02-21T09:06:08.0632983Z [333s] Fitting surrogate: 1240 points, 1240 targets 2026-02-21T09:06:08.6925337Z [334s] Generation 19 starting: 29 neighbors, 2 active search path(s) 2026-02-21T09:06:25.0682642Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.8 configs/s 2026-02-21T09:06:25.9696925Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:25.9701454Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:25.9703666Z ^ 2026-02-21T09:06:25.9704154Z /tmp/torchinductor_root/oc/coc3sargr4tczigao47gzl5rhsj4pdrprsmzz2eb7ieufentond2.py:87:40: note: called from 2026-02-21T09:06:25.9704605Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:25.9705122Z ^ 2026-02-21T09:06:25.9705594Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:25.9706178Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:25.9706471Z ^ 2026-02-21T09:06:25.9706781Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:06:25.9712567Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:25.9717465Z %cst = arith.constant dense<0> : tensor<64x2x16xi8> 2026-02-21T09:06:25.9717816Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:25.9718019Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:25.9718222Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:25.9718443Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:25.9718693Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:25.9718928Z %cst_2 = arith.constant dense<4> : tensor<64x16xi8> 2026-02-21T09:06:25.9719173Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:25.9719416Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:25.9719716Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:25.9719964Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x16xf32> 2026-02-21T09:06:25.9720196Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:25.9720387Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:06:25.9720564Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:06:25.9720752Z %c448_i32 = arith.constant 448 : i32 2026-02-21T09:06:25.9720935Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:25.9721123Z %1 = arith.addi %0, %c1_i32 : i32 2026-02-21T09:06:25.9721312Z %2 = arith.minsi %1, %c448_i32 : i32 2026-02-21T09:06:25.9721514Z scf.for %arg3 = %0 to %2 step %c1_i32 : i32 { 2026-02-21T09:06:25.9721829Z %3 = arith.divsi %arg3, %c4_i32 : i32 2026-02-21T09:06:25.9722016Z %4 = arith.muli %3, %c4_i32 : i32 2026-02-21T09:06:25.9722207Z %5 = arith.subi %c448_i32, %4 : i32 2026-02-21T09:06:25.9722388Z %6 = arith.minsi %5, %c4_i32 : i32 2026-02-21T09:06:25.9722580Z %7 = arith.remsi %arg3, %c4_i32 : i32 2026-02-21T09:06:25.9722765Z %8 = arith.remsi %7, %6 : i32 2026-02-21T09:06:25.9722948Z %9 = arith.addi %4, %8 : i32 2026-02-21T09:06:25.9723125Z %10 = arith.divsi %7, %6 : i32 2026-02-21T09:06:25.9723303Z %11 = arith.muli %9, %c16_i32 : i32 2026-02-21T09:06:25.9723551Z %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:06:25.9723806Z %13 = tt.splat %11 : i32 -> tensor<16xi32> 2026-02-21T09:06:25.9724013Z %14 = arith.addi %13, %12 : tensor<16xi32> 2026-02-21T09:06:25.9724206Z %15 = arith.muli %10, %c64_i32 : i32 2026-02-21T09:06:25.9724440Z %16 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:25.9724686Z %17 = tt.splat %15 : i32 -> tensor<64xi32> 2026-02-21T09:06:25.9724879Z %18 = arith.addi %17, %16 : tensor<64xi32> 2026-02-21T09:06:25.9725078Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:06:25.9725264Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:06:25.9725593Z %19 = scf.for %arg4 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg5 = %cst_5) -> (tensor<64x16xf32>) : i32 { 2026-02-21T09:06:25.9725928Z %73 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:06:25.9726145Z %74 = arith.addi %73, %16 : tensor<64xi32> 2026-02-21T09:06:25.9726355Z %75 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:06:25.9726589Z %76 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:25.9726846Z %77 = tt.splat %75 : i32 -> tensor<128xi32> 2026-02-21T09:06:25.9727112Z %78 = arith.addi %77, %76 : tensor<128xi32> 2026-02-21T09:06:25.9727374Z %79 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9727654Z %80 = arith.muli %79, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:25.9727930Z %81 = tt.expand_dims %78 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:25.9728239Z %82 = tt.broadcast %80 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9728551Z %83 = tt.broadcast %81 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9728808Z %84 = arith.addi %82, %83 : tensor<64x128xi32> 2026-02-21T09:06:25.9729087Z %85 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9729400Z %86 = tt.addptr %85, %84 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:25.9729719Z %87 = tt.load %86 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9730033Z %88 = arith.extf %87 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:25.9730335Z %89 = tt.expand_dims %74 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9730607Z %90 = arith.muli %89, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:25.9730868Z %91 = tt.expand_dims %14 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:06:25.9731187Z %92 = tt.broadcast %90 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9731454Z %93 = tt.broadcast %91 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9731751Z %94 = arith.addi %92, %93 : tensor<64x16xi32> 2026-02-21T09:06:25.9731993Z %95 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9732285Z %96 = tt.addptr %95, %94 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:06:25.9732598Z %97 = tt.load %96 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9732886Z %98 = arith.shli %97, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9733113Z %99 = arith.shrsi %98, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9733353Z %100 = arith.shrsi %97, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9733623Z %101 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:25.9733937Z %102 = tt.expand_dims %101 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:25.9734283Z %103 = tt.expand_dims %102 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:25.9734627Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9734979Z %105 = tt.expand_dims %100 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9735296Z %106 = arith.cmpi eq, %103, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9735563Z %107 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9735860Z %108 = tt.broadcast %104 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9736168Z %109 = arith.select %107, %108, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9736465Z %110 = arith.cmpi eq, %103, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9736730Z %111 = tt.broadcast %105 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9737025Z %112 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9737330Z %113 = arith.select %112, %111, %109 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9737628Z %114 = tt.reshape %113 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:06:25.9737915Z %115 = arith.sitofp %114 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:06:25.9738310Z %116 = tt.dot %88, %115, %arg5, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:06:25.9738694Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T09:06:25.9738903Z %117 = arith.muli %c64_i32, %c1_i32_6 : i32 2026-02-21T09:06:25.9739115Z %118 = arith.addi %arg4, %117 : i32 2026-02-21T09:06:25.9739330Z %119 = tt.splat %118 : i32 -> tensor<64xi32> 2026-02-21T09:06:25.9739547Z %120 = arith.addi %119, %16 : tensor<64xi32> 2026-02-21T09:06:25.9739760Z %121 = arith.muli %118, %c2_i32 : i32 2026-02-21T09:06:25.9740014Z %122 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:25.9740320Z %123 = tt.splat %121 : i32 -> tensor<128xi32> 2026-02-21T09:06:25.9740534Z %124 = arith.addi %123, %122 : tensor<128xi32> 2026-02-21T09:06:25.9740845Z %125 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9741127Z %126 = arith.muli %125, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:25.9741401Z %127 = tt.expand_dims %124 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:25.9741754Z %128 = tt.broadcast %126 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9742034Z %129 = tt.broadcast %127 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9742299Z %130 = arith.addi %128, %129 : tensor<64x128xi32> 2026-02-21T09:06:25.9742568Z %131 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9742891Z %132 = tt.addptr %131, %130 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:25.9743221Z %133 = tt.load %132 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9743523Z %134 = arith.extf %133 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:25.9743821Z %135 = tt.expand_dims %120 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9744093Z %136 = arith.muli %135, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:25.9744361Z %137 = tt.expand_dims %14 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:06:25.9744660Z %138 = tt.broadcast %136 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9744926Z %139 = tt.broadcast %137 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9745171Z %140 = arith.addi %138, %139 : tensor<64x16xi32> 2026-02-21T09:06:25.9745409Z %141 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9745698Z %142 = tt.addptr %141, %140 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:06:25.9746007Z %143 = tt.load %142 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9746287Z %144 = arith.shli %143, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9746514Z %145 = arith.shrsi %144, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9746736Z %146 = arith.shrsi %143, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9746988Z %147 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:25.9747283Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:25.9747606Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:25.9747938Z %150 = tt.expand_dims %145 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9748265Z %151 = tt.expand_dims %146 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9748558Z %152 = arith.cmpi eq, %149, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9748811Z %153 = tt.broadcast %152 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9749094Z %154 = tt.broadcast %150 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9749383Z %155 = arith.select %153, %154, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9749668Z %156 = arith.cmpi eq, %149, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9749950Z %157 = tt.broadcast %151 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9750222Z %158 = tt.broadcast %156 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9750512Z %159 = arith.select %158, %157, %155 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9750800Z %160 = tt.reshape %159 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:06:25.9751078Z %161 = arith.sitofp %160 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:06:25.9751475Z %162 = tt.dot %134, %161, %116, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:06:25.9751901Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:06:25.9752104Z %163 = arith.muli %c64_i32, %c2_i32_7 : i32 2026-02-21T09:06:25.9752300Z %164 = arith.addi %arg4, %163 : i32 2026-02-21T09:06:25.9752503Z %165 = tt.splat %164 : i32 -> tensor<64xi32> 2026-02-21T09:06:25.9752706Z %166 = arith.addi %165, %16 : tensor<64xi32> 2026-02-21T09:06:25.9752907Z %167 = arith.muli %164, %c2_i32 : i32 2026-02-21T09:06:25.9753150Z %168 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:25.9753405Z %169 = tt.splat %167 : i32 -> tensor<128xi32> 2026-02-21T09:06:25.9753624Z %170 = arith.addi %169, %168 : tensor<128xi32> 2026-02-21T09:06:25.9753899Z %171 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9754180Z %172 = arith.muli %171, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:25.9754444Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:25.9754749Z %174 = tt.broadcast %172 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9755027Z %175 = tt.broadcast %173 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9755271Z %176 = arith.addi %174, %175 : tensor<64x128xi32> 2026-02-21T09:06:25.9755528Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9755823Z %178 = tt.addptr %177, %176 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:25.9756149Z %179 = tt.load %178 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9756449Z %180 = arith.extf %179 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:25.9756775Z %181 = tt.expand_dims %166 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9757055Z %182 = arith.muli %181, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:25.9757312Z %183 = tt.expand_dims %14 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:06:25.9757610Z %184 = tt.broadcast %182 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9757870Z %185 = tt.broadcast %183 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9758116Z %186 = arith.addi %184, %185 : tensor<64x16xi32> 2026-02-21T09:06:25.9758359Z %187 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9758635Z %188 = tt.addptr %187, %186 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:06:25.9758947Z %189 = tt.load %188 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9759217Z %190 = arith.shli %189, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9759444Z %191 = arith.shrsi %190, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9759666Z %192 = arith.shrsi %189, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9759920Z %193 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:25.9760220Z %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:25.9760533Z %195 = tt.expand_dims %194 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:25.9760865Z %196 = tt.expand_dims %191 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9761232Z %197 = tt.expand_dims %192 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9761526Z %198 = arith.cmpi eq, %195, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9761834Z %199 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9762110Z %200 = tt.broadcast %196 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9762435Z %201 = arith.select %199, %200, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9762708Z %202 = arith.cmpi eq, %195, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9762994Z %203 = tt.broadcast %197 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9763270Z %204 = tt.broadcast %202 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9763566Z %205 = arith.select %204, %203, %201 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9763855Z %206 = tt.reshape %205 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:06:25.9764126Z %207 = arith.sitofp %206 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:06:25.9764498Z %208 = tt.dot %180, %207, %162, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:06:25.9764826Z scf.yield %208 : tensor<64x16xf32> 2026-02-21T09:06:25.9765110Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:06:25.9765324Z %20 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:06:25.9765535Z %21 = arith.addi %20, %16 : tensor<64xi32> 2026-02-21T09:06:25.9765742Z %22 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:06:25.9765979Z %23 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:25.9766237Z %24 = tt.splat %22 : i32 -> tensor<128xi32> 2026-02-21T09:06:25.9766437Z %25 = arith.addi %24, %23 : tensor<128xi32> 2026-02-21T09:06:25.9766694Z %26 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9766957Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:25.9767216Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:25.9767508Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9767770Z %30 = tt.broadcast %28 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:25.9768018Z %31 = arith.addi %29, %30 : tensor<64x128xi32> 2026-02-21T09:06:25.9768257Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9768545Z %33 = tt.addptr %32, %31 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:25.9768854Z %34 = tt.load %33 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:25.9769140Z %35 = arith.extf %34 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:25.9769424Z %36 = tt.expand_dims %21 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9769687Z %37 = arith.muli %36, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:25.9769944Z %38 = tt.expand_dims %14 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:06:25.9770227Z %39 = tt.broadcast %37 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9770483Z %40 = tt.broadcast %38 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9770721Z %41 = arith.addi %39, %40 : tensor<64x16xi32> 2026-02-21T09:06:25.9770953Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9771228Z %43 = tt.addptr %42, %41 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:06:25.9771516Z %44 = tt.load %43 evictionPolicy = evict_first : tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9771824Z %45 = arith.shli %44, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9772036Z %46 = arith.shrsi %45, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9772283Z %47 = arith.shrsi %44, %cst_2 : tensor<64x16xi8> 2026-02-21T09:06:25.9772526Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:25.9772808Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:25.9773115Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:25.9773424Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9773772Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<64x16xi8> -> tensor<64x1x16xi8> 2026-02-21T09:06:25.9774079Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9774323Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9774595Z %55 = tt.broadcast %51 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9774871Z %56 = arith.select %54, %55, %cst : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9775150Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:25.9775406Z %58 = tt.broadcast %52 : tensor<64x1x16xi8> -> tensor<64x2x16xi8> 2026-02-21T09:06:25.9775688Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<64x2x16xi1> 2026-02-21T09:06:25.9775974Z %60 = arith.select %59, %58, %56 : tensor<64x2x16xi1>, tensor<64x2x16xi8> 2026-02-21T09:06:25.9776279Z %61 = tt.reshape %60 : tensor<64x2x16xi8> -> tensor<128x16xi8> 2026-02-21T09:06:25.9776559Z %62 = arith.sitofp %61 : tensor<128x16xi8> to tensor<128x16xf32> 2026-02-21T09:06:25.9776926Z %63 = tt.dot %35, %62, %19, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x16xf32> -> tensor<64x16xf32> 2026-02-21T09:06:25.9777302Z %64 = arith.truncf %63 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T09:06:25.9777607Z %65 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:25.9777879Z %66 = arith.muli %65, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:25.9778150Z %67 = tt.expand_dims %14 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T09:06:25.9778443Z %68 = tt.broadcast %66 : tensor<64x1xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9778714Z %69 = tt.broadcast %67 : tensor<1x16xi32> -> tensor<64x16xi32> 2026-02-21T09:06:25.9778950Z %70 = arith.addi %68, %69 : tensor<64x16xi32> 2026-02-21T09:06:25.9779226Z %71 = tt.splat %arg2 : !tt.ptr -> tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9779526Z %72 = tt.addptr %71, %70 : tensor<64x16x!tt.ptr>, tensor<64x16xi32> 2026-02-21T09:06:25.9779793Z tt.store %72, %64 : tensor<64x16x!tt.ptr> 2026-02-21T09:06:25.9780116Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32} 2026-02-21T09:06:25.9780407Z tt.return 2026-02-21T09:06:25.9780548Z } 2026-02-21T09:06:25.9780676Z } 2026-02-21T09:06:25.9780757Z 2026-02-21T09:06:25.9780810Z {-# 2026-02-21T09:06:25.9780958Z external_resources: { 2026-02-21T09:06:25.9781124Z mlir_reproducer: { 2026-02-21T09:06:25.9785784Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:25.9790230Z disable_threading: false, 2026-02-21T09:06:25.9790413Z verify_each: true 2026-02-21T09:06:25.9790559Z } 2026-02-21T09:06:25.9790694Z } 2026-02-21T09:06:25.9790814Z #-} 2026-02-21T09:06:25.9791254Z /tmp/torchinductor_root/oc/coc3sargr4tczigao47gzl5rhsj4pdrprsmzz2eb7ieufentond2.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:25.9792346Z /tmp/torchinductor_root/oc/coc3sargr4tczigao47gzl5rhsj4pdrprsmzz2eb7ieufentond2.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:25.9793168Z [351s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:25.9794321Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[1, 3], range_warp_specializes=[False, False]), static_shapes=True) 2026-02-21T09:06:25.9795355Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:25.9795607Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:26.1421181Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:26.1422225Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:26.1422478Z ^ 2026-02-21T09:06:26.1422849Z /tmp/torchinductor_root/oc/cocnvqhunqrocyda5yhfpclx4sxut2pogavpuzx7wjkt6xafsg2a.py:78:36: note: called from 2026-02-21T09:06:26.1423275Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:26.1423482Z ^ 2026-02-21T09:06:26.1423915Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:26.1424404Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:26.1424671Z ^ 2026-02-21T09:06:26.1429205Z module { 2026-02-21T09:06:26.1431265Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:26.1431922Z %cst = arith.constant dense<0> : tensor<64x2x64xi8> 2026-02-21T09:06:26.1432174Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:26.1432378Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:26.1432599Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:26.1432836Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:26.1433064Z %cst_2 = arith.constant dense<4> : tensor<64x64xi8> 2026-02-21T09:06:26.1433491Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:26.1433726Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:26.1433986Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:06:26.1434220Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:26.1434411Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:26.1434601Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:06:26.1434874Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:26.1435085Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:06:26.1435266Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:06:26.1435491Z %3 = arith.subi %c112_i32, %2 : i32 2026-02-21T09:06:26.1435673Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:06:26.1435851Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:06:26.1436022Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:06:26.1436198Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:06:26.1436371Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:06:26.1436536Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:06:26.1436764Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:26.1437011Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.1437216Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:06:26.1437401Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:06:26.1437633Z %14 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.1437828Z %15 = arith.addi %14, %10 : tensor<64xi32> 2026-02-21T09:06:26.1438023Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:06:26.1438218Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:06:26.1438534Z %16 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:06:26.1438875Z %70 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.1439082Z %71 = arith.addi %70, %10 : tensor<64xi32> 2026-02-21T09:06:26.1439287Z %72 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:26.1439519Z %73 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.1439776Z %74 = tt.splat %72 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.1439983Z %75 = arith.addi %74, %73 : tensor<128xi32> 2026-02-21T09:06:26.1440233Z %76 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1440507Z %77 = arith.muli %76, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:26.1440766Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:26.1441065Z %79 = tt.broadcast %77 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1441326Z %80 = tt.broadcast %78 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1441609Z %81 = arith.addi %79, %80 : tensor<64x128xi32> 2026-02-21T09:06:26.1441866Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1442156Z %83 = tt.addptr %82, %81 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:26.1442475Z %84 = tt.load %83 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1442770Z %85 = arith.extf %84 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:26.1443059Z %86 = tt.expand_dims %71 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1443333Z %87 = arith.muli %86, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:26.1443590Z %88 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.1443892Z %89 = tt.broadcast %87 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1444155Z %90 = tt.broadcast %88 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1444397Z %91 = arith.addi %89, %90 : tensor<64x64xi32> 2026-02-21T09:06:26.1444631Z %92 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1444967Z %93 = tt.addptr %92, %91 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:26.1445270Z %94 = tt.load %93 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1445535Z %95 = arith.shli %94, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1445758Z %96 = arith.shrsi %95, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1445972Z %97 = arith.shrsi %94, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1446278Z %98 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.1446561Z %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.1446907Z %100 = tt.expand_dims %99 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.1447239Z %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1447562Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1447855Z %103 = arith.cmpi eq, %100, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1448156Z %104 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1448431Z %105 = tt.broadcast %101 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1448728Z %106 = arith.select %104, %105, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1453373Z %107 = arith.cmpi eq, %100, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1453875Z %108 = tt.broadcast %102 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1454192Z %109 = tt.broadcast %107 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1454513Z %110 = arith.select %109, %108, %106 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1454839Z %111 = tt.reshape %110 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:26.1455129Z %112 = arith.sitofp %111 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:26.1455525Z %113 = tt.dot %85, %112, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.1455879Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:26.1456086Z %114 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:06:26.1456301Z %115 = arith.addi %arg3, %114 : i32 2026-02-21T09:06:26.1456512Z %116 = tt.splat %115 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.1456734Z %117 = arith.addi %116, %10 : tensor<64xi32> 2026-02-21T09:06:26.1456936Z %118 = arith.muli %115, %c2_i32 : i32 2026-02-21T09:06:26.1457192Z %119 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.1457461Z %120 = tt.splat %118 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.1457690Z %121 = arith.addi %120, %119 : tensor<128xi32> 2026-02-21T09:06:26.1457967Z %122 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1458255Z %123 = arith.muli %122, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:26.1458540Z %124 = tt.expand_dims %121 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:26.1458853Z %125 = tt.broadcast %123 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1459147Z %126 = tt.broadcast %124 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1459409Z %127 = arith.addi %125, %126 : tensor<64x128xi32> 2026-02-21T09:06:26.1459673Z %128 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1459986Z %129 = tt.addptr %128, %127 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:26.1460319Z %130 = tt.load %129 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1460637Z %131 = arith.extf %130 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:26.1460941Z %132 = tt.expand_dims %117 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1461364Z %133 = arith.muli %132, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:26.1461715Z %134 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.1462020Z %135 = tt.broadcast %133 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1462305Z %136 = tt.broadcast %134 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1462556Z %137 = arith.addi %135, %136 : tensor<64x64xi32> 2026-02-21T09:06:26.1462864Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1463152Z %139 = tt.addptr %138, %137 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:26.1463514Z %140 = tt.load %139 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1463789Z %141 = arith.shli %140, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1464015Z %142 = arith.shrsi %141, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1464245Z %143 = arith.shrsi %140, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1464495Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.1464799Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.1465117Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.1465476Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1465813Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1466100Z %149 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1466362Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1466636Z %151 = tt.broadcast %147 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1466935Z %152 = arith.select %150, %151, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1467218Z %153 = arith.cmpi eq, %146, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1467472Z %154 = tt.broadcast %148 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1467752Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1468033Z %156 = arith.select %155, %154, %152 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1468329Z %157 = tt.reshape %156 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:26.1468604Z %158 = arith.sitofp %157 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:26.1468977Z %159 = tt.dot %131, %158, %113, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.1469310Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:06:26.1469506Z %160 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:06:26.1469707Z %161 = arith.addi %arg3, %160 : i32 2026-02-21T09:06:26.1469899Z %162 = tt.splat %161 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.1470109Z %163 = arith.addi %162, %10 : tensor<64xi32> 2026-02-21T09:06:26.1470302Z %164 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:06:26.1470538Z %165 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.1470797Z %166 = tt.splat %164 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.1471005Z %167 = arith.addi %166, %165 : tensor<128xi32> 2026-02-21T09:06:26.1471268Z %168 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1471577Z %169 = arith.muli %168, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:26.1471851Z %170 = tt.expand_dims %167 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:26.1472157Z %171 = tt.broadcast %169 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1472425Z %172 = tt.broadcast %170 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1472713Z %173 = arith.addi %171, %172 : tensor<64x128xi32> 2026-02-21T09:06:26.1472960Z %174 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1473260Z %175 = tt.addptr %174, %173 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:26.1473580Z %176 = tt.load %175 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1473890Z %177 = arith.extf %176 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:26.1474215Z %178 = tt.expand_dims %163 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1474483Z %179 = arith.muli %178, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:26.1474779Z %180 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.1475062Z %181 = tt.broadcast %179 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1475326Z %182 = tt.broadcast %180 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1475565Z %183 = arith.addi %181, %182 : tensor<64x64xi32> 2026-02-21T09:06:26.1475805Z %184 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1476085Z %185 = tt.addptr %184, %183 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:26.1476388Z %186 = tt.load %185 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1476689Z %187 = arith.shli %186, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1476909Z %188 = arith.shrsi %187, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1477135Z %189 = arith.shrsi %186, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1477379Z %190 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.1477683Z %191 = tt.expand_dims %190 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.1477999Z %192 = tt.expand_dims %191 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.1478321Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1478653Z %194 = tt.expand_dims %189 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1478937Z %195 = arith.cmpi eq, %192, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1479194Z %196 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1479473Z %197 = tt.broadcast %193 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1479762Z %198 = arith.select %196, %197, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1480042Z %199 = arith.cmpi eq, %192, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1480292Z %200 = tt.broadcast %194 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1480568Z %201 = tt.broadcast %199 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1480851Z %202 = arith.select %201, %200, %198 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1481141Z %203 = tt.reshape %202 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:26.1481415Z %204 = arith.sitofp %203 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:26.1481809Z %205 = tt.dot %177, %204, %159, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.1482148Z scf.yield %205 : tensor<64x64xf32> 2026-02-21T09:06:26.1482321Z } 2026-02-21T09:06:26.1482486Z %17 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.1482697Z %18 = arith.addi %17, %10 : tensor<64xi32> 2026-02-21T09:06:26.1482898Z %19 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:06:26.1483142Z %20 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.1483387Z %21 = tt.splat %19 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.1483590Z %22 = arith.addi %21, %20 : tensor<128xi32> 2026-02-21T09:06:26.1483835Z %23 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1484151Z %24 = arith.muli %23, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:26.1484402Z %25 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:26.1484693Z %26 = tt.broadcast %24 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1484958Z %27 = tt.broadcast %25 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:26.1485192Z %28 = arith.addi %26, %27 : tensor<64x128xi32> 2026-02-21T09:06:26.1485472Z %29 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1485756Z %30 = tt.addptr %29, %28 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:26.1486090Z %31 = tt.load %30 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:26.1486392Z %32 = arith.extf %31 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:26.1486671Z %33 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1487332Z %34 = arith.muli %33, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:26.1487580Z %35 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.1487869Z %36 = tt.broadcast %34 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1488123Z %37 = tt.broadcast %35 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1488387Z %38 = arith.addi %36, %37 : tensor<64x64xi32> 2026-02-21T09:06:26.1488629Z %39 = tt.splat %arg1 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1488895Z %40 = tt.addptr %39, %38 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:26.1489193Z %41 = tt.load %40 evictionPolicy = evict_first : tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1489451Z %42 = arith.shli %41, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1489668Z %43 = arith.shrsi %42, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1489878Z %44 = arith.shrsi %41, %cst_2 : tensor<64x64xi8> 2026-02-21T09:06:26.1490124Z %45 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.1490416Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.1490714Z %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.1492449Z %48 = tt.expand_dims %43 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1492757Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<64x64xi8> -> tensor<64x1x64xi8> 2026-02-21T09:06:26.1493039Z %50 = arith.cmpi eq, %47, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1493281Z %51 = tt.broadcast %50 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1493551Z %52 = tt.broadcast %48 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1493830Z %53 = arith.select %51, %52, %cst : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1494090Z %54 = arith.cmpi eq, %47, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:26.1494339Z %55 = tt.broadcast %49 : tensor<64x1x64xi8> -> tensor<64x2x64xi8> 2026-02-21T09:06:26.1494596Z %56 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<64x2x64xi1> 2026-02-21T09:06:26.1496184Z %57 = arith.select %56, %55, %53 : tensor<64x2x64xi1>, tensor<64x2x64xi8> 2026-02-21T09:06:26.1497136Z %58 = tt.reshape %57 : tensor<64x2x64xi8> -> tensor<128x64xi8> 2026-02-21T09:06:26.1497408Z %59 = arith.sitofp %58 : tensor<128x64xi8> to tensor<128x64xf32> 2026-02-21T09:06:26.1497780Z %60 = tt.dot %32, %59, %16, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.1498145Z %61 = arith.truncf %60 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:06:26.1498448Z %62 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.1498718Z %63 = arith.muli %62, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:26.1498988Z %64 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.1499321Z %65 = tt.broadcast %63 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1499612Z %66 = tt.broadcast %64 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.1499867Z %67 = arith.addi %65, %66 : tensor<64x64xi32> 2026-02-21T09:06:26.1500123Z %68 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1500427Z %69 = tt.addptr %68, %67 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:26.1500741Z tt.store %69, %61 : tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.1500947Z tt.return 2026-02-21T09:06:26.1501100Z } 2026-02-21T09:06:26.1501227Z } 2026-02-21T09:06:26.1501327Z 2026-02-21T09:06:26.1501390Z {-# 2026-02-21T09:06:26.1501527Z external_resources: { 2026-02-21T09:06:26.1501735Z mlir_reproducer: { 2026-02-21T09:06:26.1506295Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:26.1510729Z disable_threading: false, 2026-02-21T09:06:26.1510899Z verify_each: true 2026-02-21T09:06:26.1511052Z } 2026-02-21T09:06:26.1511180Z } 2026-02-21T09:06:26.1511297Z #-} 2026-02-21T09:06:26.1511766Z /tmp/torchinductor_root/oc/cocnvqhunqrocyda5yhfpclx4sxut2pogavpuzx7wjkt6xafsg2a.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:26.1512790Z /tmp/torchinductor_root/oc/cocnvqhunqrocyda5yhfpclx4sxut2pogavpuzx7wjkt6xafsg2a.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:26.1513608Z [351s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:26.1514641Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:06:26.1515547Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:26.1515811Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:26.7436262Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:26.7442169Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:26.7444300Z ^ 2026-02-21T09:06:26.7444848Z /tmp/torchinductor_root/45/c45ru6fwawmat4zny5dkwdt6qenw3cxrle6frutyqviyhbqwhg5k.py:78:36: note: called from 2026-02-21T09:06:26.7449068Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:26.7453858Z ^ 2026-02-21T09:06:26.7459090Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:26.7464060Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:26.7465484Z ^ 2026-02-21T09:06:26.7465706Z module { 2026-02-21T09:06:26.7466205Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:26.7466764Z %cst = arith.constant dense<0> : tensor<128x2x64xi8> 2026-02-21T09:06:26.7466991Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:06:26.7467365Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:26.7467580Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:26.7467800Z %cst_0 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:26.7468055Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:26.7468312Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:26.7468558Z %cst_3 = arith.constant dense<4> : tensor<128x64xi8> 2026-02-21T09:06:26.7468791Z %cst_4 = arith.constant dense<7168> : tensor<128x1xi32> 2026-02-21T09:06:26.7469042Z %cst_5 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:26.7469299Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T09:06:26.7469540Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:26.7469727Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:26.7469921Z %c112_i32 = arith.constant 112 : i32 2026-02-21T09:06:26.7470148Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:26.7470334Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:06:26.7470535Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:06:26.7470723Z %3 = arith.subi %c112_i32, %2 : i32 2026-02-21T09:06:26.7470918Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:06:26.7471105Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:06:26.7471285Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:06:26.7471470Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:06:26.7471809Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:06:26.7471989Z %9 = arith.muli %7, %c64_i32 : i32 2026-02-21T09:06:26.7472217Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:26.7472478Z %11 = tt.splat %9 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.7472680Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T09:06:26.7472879Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:06:26.7473076Z %14 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:06:26.7473269Z %15 = arith.addi %14, %10 : tensor<64xi32> 2026-02-21T09:06:26.7473464Z %c3840_i32 = arith.constant 3840 : i32 2026-02-21T09:06:26.7473647Z %c384_i32 = arith.constant 384 : i32 2026-02-21T09:06:26.7473975Z %16 = scf.for %arg3 = %c0_i32 to %c3840_i32 step %c384_i32 iter_args(%arg4 = %cst_6) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:06:26.7474346Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.7474612Z %28 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.7474825Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T09:06:26.7475079Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:26.7475321Z %31 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:06:26.7475572Z %32 = tt.splat %30 : i32 -> tensor<256xi32> 2026-02-21T09:06:26.7475780Z %33 = arith.addi %32, %31 : tensor<256xi32> 2026-02-21T09:06:26.7476038Z %34 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.7476321Z %35 = arith.muli %34, %cst_5 : tensor<64x1xi32> 2026-02-21T09:06:26.7476632Z %36 = tt.expand_dims %33 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:06:26.7476957Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7477236Z %38 = tt.broadcast %36 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7477477Z %39 = arith.addi %37, %38 : tensor<64x256xi32> 2026-02-21T09:06:26.7477729Z %40 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7478016Z %41 = tt.addptr %40, %39 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:06:26.7478334Z %42 = tt.load %41 evictionPolicy = evict_last : tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7478635Z %43 = arith.extf %42 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:06:26.7478924Z %44 = tt.expand_dims %29 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:06:26.7479237Z %45 = arith.muli %44, %cst_4 : tensor<128x1xi32> 2026-02-21T09:06:26.7479494Z %46 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.7479827Z %47 = tt.broadcast %45 : tensor<128x1xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7480096Z %48 = tt.broadcast %46 : tensor<1x64xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7480340Z %49 = arith.addi %47, %48 : tensor<128x64xi32> 2026-02-21T09:06:26.7480569Z %50 = tt.splat %arg1 : !tt.ptr -> tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7480848Z %51 = tt.addptr %50, %49 : tensor<128x64x!tt.ptr>, tensor<128x64xi32> 2026-02-21T09:06:26.7481151Z %52 = tt.load %51 evictionPolicy = evict_first : tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7481434Z %53 = arith.shli %52, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7481709Z %54 = arith.shrsi %53, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7481932Z %55 = arith.shrsi %52, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7482196Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.7482498Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.7482828Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.7483157Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7483509Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7483811Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7484069Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7484357Z %63 = tt.broadcast %59 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7484651Z %64 = arith.select %62, %63, %cst : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7484936Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7485194Z %66 = tt.broadcast %60 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7485479Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7485768Z %68 = arith.select %67, %66, %64 : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7486055Z %69 = tt.reshape %68 : tensor<128x2x64xi8> -> tensor<256x64xi8> 2026-02-21T09:06:26.7486331Z %70 = arith.sitofp %69 : tensor<256x64xi8> to tensor<256x64xf32> 2026-02-21T09:06:26.7486760Z %71 = tt.dot %43, %70, %arg4, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.7487099Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:26.7487305Z %72 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T09:06:26.7487505Z %73 = arith.addi %arg3, %72 : i32 2026-02-21T09:06:26.7487753Z %74 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.7488049Z %75 = tt.splat %73 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.7488263Z %76 = arith.addi %75, %74 : tensor<128xi32> 2026-02-21T09:06:26.7488462Z %77 = arith.muli %73, %c2_i32 : i32 2026-02-21T09:06:26.7488748Z %78 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:06:26.7489007Z %79 = tt.splat %77 : i32 -> tensor<256xi32> 2026-02-21T09:06:26.7489217Z %80 = arith.addi %79, %78 : tensor<256xi32> 2026-02-21T09:06:26.7489478Z %81 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.7489754Z %82 = arith.muli %81, %cst_5 : tensor<64x1xi32> 2026-02-21T09:06:26.7490024Z %83 = tt.expand_dims %80 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:06:26.7490324Z %84 = tt.broadcast %82 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7490629Z %85 = tt.broadcast %83 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7490890Z %86 = arith.addi %84, %85 : tensor<64x256xi32> 2026-02-21T09:06:26.7491144Z %87 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7491453Z %88 = tt.addptr %87, %86 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:06:26.7491799Z %89 = tt.load %88 evictionPolicy = evict_last : tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7492114Z %90 = arith.extf %89 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:06:26.7492407Z %91 = tt.expand_dims %76 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:06:26.7492695Z %92 = arith.muli %91, %cst_4 : tensor<128x1xi32> 2026-02-21T09:06:26.7492970Z %93 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.7493270Z %94 = tt.broadcast %92 : tensor<128x1xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7493552Z %95 = tt.broadcast %93 : tensor<1x64xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7493803Z %96 = arith.addi %94, %95 : tensor<128x64xi32> 2026-02-21T09:06:26.7494041Z %97 = tt.splat %arg1 : !tt.ptr -> tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7494306Z %98 = tt.addptr %97, %96 : tensor<128x64x!tt.ptr>, tensor<128x64xi32> 2026-02-21T09:06:26.7494608Z %99 = tt.load %98 evictionPolicy = evict_first : tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7494884Z %100 = arith.shli %99, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7495112Z %101 = arith.shrsi %100, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7495352Z %102 = arith.shrsi %99, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7495598Z %103 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.7495898Z %104 = tt.expand_dims %103 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.7496217Z %105 = tt.expand_dims %104 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.7496543Z %106 = tt.expand_dims %101 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7496880Z %107 = tt.expand_dims %102 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7497167Z %108 = arith.cmpi eq, %105, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7497425Z %109 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7497704Z %110 = tt.broadcast %106 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7498007Z %111 = arith.select %109, %110, %cst : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7498324Z %112 = arith.cmpi eq, %105, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7498576Z %113 = tt.broadcast %107 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7498857Z %114 = tt.broadcast %112 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7499139Z %115 = arith.select %114, %113, %111 : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7499457Z %116 = tt.reshape %115 : tensor<128x2x64xi8> -> tensor<256x64xi8> 2026-02-21T09:06:26.7499724Z %117 = arith.sitofp %116 : tensor<256x64xi8> to tensor<256x64xf32> 2026-02-21T09:06:26.7500110Z %118 = tt.dot %90, %117, %71, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.7500438Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:06:26.7500632Z %119 = arith.muli %c128_i32, %c2_i32_7 : i32 2026-02-21T09:06:26.7500835Z %120 = arith.addi %arg3, %119 : i32 2026-02-21T09:06:26.7501068Z %121 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.7501327Z %122 = tt.splat %120 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.7501566Z %123 = arith.addi %122, %121 : tensor<128xi32> 2026-02-21T09:06:26.7501765Z %124 = arith.muli %120, %c2_i32 : i32 2026-02-21T09:06:26.7502038Z %125 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:06:26.7502288Z %126 = tt.splat %124 : i32 -> tensor<256xi32> 2026-02-21T09:06:26.7502502Z %127 = arith.addi %126, %125 : tensor<256xi32> 2026-02-21T09:06:26.7502756Z %128 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.7503035Z %129 = arith.muli %128, %cst_5 : tensor<64x1xi32> 2026-02-21T09:06:26.7503307Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:06:26.7503602Z %131 = tt.broadcast %129 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7503880Z %132 = tt.broadcast %130 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7504123Z %133 = arith.addi %131, %132 : tensor<64x256xi32> 2026-02-21T09:06:26.7504375Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7504662Z %135 = tt.addptr %134, %133 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:06:26.7504984Z %136 = tt.load %135 evictionPolicy = evict_last : tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7505289Z %137 = arith.extf %136 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:06:26.7505583Z %138 = tt.expand_dims %123 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:06:26.7505865Z %139 = arith.muli %138, %cst_4 : tensor<128x1xi32> 2026-02-21T09:06:26.7506120Z %140 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.7506414Z %141 = tt.broadcast %139 : tensor<128x1xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7506684Z %142 = tt.broadcast %140 : tensor<1x64xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7506937Z %143 = arith.addi %141, %142 : tensor<128x64xi32> 2026-02-21T09:06:26.7507186Z %144 = tt.splat %arg1 : !tt.ptr -> tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7507466Z %145 = tt.addptr %144, %143 : tensor<128x64x!tt.ptr>, tensor<128x64xi32> 2026-02-21T09:06:26.7507782Z %146 = tt.load %145 evictionPolicy = evict_first : tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7508053Z %147 = arith.shli %146, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7508282Z %148 = arith.shrsi %147, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7508510Z %149 = arith.shrsi %146, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7508755Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.7509050Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.7509384Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.7509714Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7510044Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7510337Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7510622Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7510899Z %157 = tt.broadcast %153 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7511229Z %158 = arith.select %156, %157, %cst : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7511504Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7511788Z %160 = tt.broadcast %154 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7512062Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7512355Z %162 = arith.select %161, %160, %158 : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7512648Z %163 = tt.reshape %162 : tensor<128x2x64xi8> -> tensor<256x64xi8> 2026-02-21T09:06:26.7512913Z %164 = arith.sitofp %163 : tensor<256x64xi8> to tensor<256x64xf32> 2026-02-21T09:06:26.7513305Z %165 = tt.dot %137, %164, %118, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.7513631Z scf.yield %165 : tensor<64x64xf32> 2026-02-21T09:06:26.7513828Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:06:26.7514154Z %17 = scf.for %arg3 = %c3840_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %16) -> (tensor<64x64xf32>) : i32 { 2026-02-21T09:06:26.7514526Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:26.7514791Z %28 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:06:26.7515000Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T09:06:26.7515209Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:26.7515443Z %31 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:06:26.7515700Z %32 = tt.splat %30 : i32 -> tensor<256xi32> 2026-02-21T09:06:26.7515904Z %33 = arith.addi %32, %31 : tensor<256xi32> 2026-02-21T09:06:26.7516159Z %34 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.7516432Z %35 = arith.muli %34, %cst_5 : tensor<64x1xi32> 2026-02-21T09:06:26.7516687Z %36 = tt.expand_dims %33 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T09:06:26.7516992Z %37 = tt.broadcast %35 : tensor<64x1xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7517260Z %38 = tt.broadcast %36 : tensor<1x256xi32> -> tensor<64x256xi32> 2026-02-21T09:06:26.7517507Z %39 = arith.addi %37, %38 : tensor<64x256xi32> 2026-02-21T09:06:26.7517763Z %40 = tt.splat %arg0 : !tt.ptr -> tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7518050Z %41 = tt.addptr %40, %39 : tensor<64x256x!tt.ptr>, tensor<64x256xi32> 2026-02-21T09:06:26.7518360Z %42 = tt.load %41 evictionPolicy = evict_last : tensor<64x256x!tt.ptr> 2026-02-21T09:06:26.7518660Z %43 = arith.extf %42 : tensor<64x256xbf16> to tensor<64x256xf32> 2026-02-21T09:06:26.7518954Z %44 = tt.expand_dims %29 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:06:26.7519232Z %45 = arith.muli %44, %cst_4 : tensor<128x1xi32> 2026-02-21T09:06:26.7519489Z %46 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.7519782Z %47 = tt.broadcast %45 : tensor<128x1xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7520043Z %48 = tt.broadcast %46 : tensor<1x64xi32> -> tensor<128x64xi32> 2026-02-21T09:06:26.7520293Z %49 = arith.addi %47, %48 : tensor<128x64xi32> 2026-02-21T09:06:26.7520556Z %50 = tt.splat %arg1 : !tt.ptr -> tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7520843Z %51 = tt.addptr %50, %49 : tensor<128x64x!tt.ptr>, tensor<128x64xi32> 2026-02-21T09:06:26.7521144Z %52 = tt.load %51 evictionPolicy = evict_first : tensor<128x64x!tt.ptr> 2026-02-21T09:06:26.7521405Z %53 = arith.shli %52, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7521648Z %54 = arith.shrsi %53, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7521897Z %55 = arith.shrsi %52, %cst_3 : tensor<128x64xi8> 2026-02-21T09:06:26.7522142Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:26.7522453Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:26.7522764Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:26.7523092Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7523410Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<128x64xi8> -> tensor<128x1x64xi8> 2026-02-21T09:06:26.7523695Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7523938Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7524215Z %63 = tt.broadcast %59 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7524546Z %64 = arith.select %62, %63, %cst : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7524812Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:26.7525065Z %66 = tt.broadcast %60 : tensor<128x1x64xi8> -> tensor<128x2x64xi8> 2026-02-21T09:06:26.7525329Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<128x2x64xi1> 2026-02-21T09:06:26.7525606Z %68 = arith.select %67, %66, %64 : tensor<128x2x64xi1>, tensor<128x2x64xi8> 2026-02-21T09:06:26.7525894Z %69 = tt.reshape %68 : tensor<128x2x64xi8> -> tensor<256x64xi8> 2026-02-21T09:06:26.7526170Z %70 = arith.sitofp %69 : tensor<256x64xi8> to tensor<256x64xf32> 2026-02-21T09:06:26.7526545Z %71 = tt.dot %43, %70, %arg4, inputPrecision = tf32 : tensor<64x256xf32> * tensor<256x64xf32> -> tensor<64x64xf32> 2026-02-21T09:06:26.7526878Z scf.yield %71 : tensor<64x64xf32> 2026-02-21T09:06:26.7527114Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:06:26.7527381Z %18 = arith.truncf %17 : tensor<64x64xf32> to tensor<64x64xbf16> 2026-02-21T09:06:26.7527685Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:26.7527956Z %20 = arith.muli %19, %cst_0 : tensor<64x1xi32> 2026-02-21T09:06:26.7528229Z %21 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T09:06:26.7528525Z %22 = tt.broadcast %20 : tensor<64x1xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.7528786Z %23 = tt.broadcast %21 : tensor<1x64xi32> -> tensor<64x64xi32> 2026-02-21T09:06:26.7529027Z %24 = arith.addi %22, %23 : tensor<64x64xi32> 2026-02-21T09:06:26.7529277Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.7529572Z %26 = tt.addptr %25, %24 : tensor<64x64x!tt.ptr>, tensor<64x64xi32> 2026-02-21T09:06:26.7529841Z tt.store %26, %18 : tensor<64x64x!tt.ptr> 2026-02-21T09:06:26.7530040Z tt.return 2026-02-21T09:06:26.7530182Z } 2026-02-21T09:06:26.7530310Z } 2026-02-21T09:06:26.7530382Z 2026-02-21T09:06:26.7530445Z {-# 2026-02-21T09:06:26.7530581Z external_resources: { 2026-02-21T09:06:26.7530749Z mlir_reproducer: { 2026-02-21T09:06:26.7535351Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:26.7539822Z disable_threading: false, 2026-02-21T09:06:26.7540019Z verify_each: true 2026-02-21T09:06:26.7540177Z } 2026-02-21T09:06:26.7540307Z } 2026-02-21T09:06:26.7540424Z #-} 2026-02-21T09:06:26.7540856Z /tmp/torchinductor_root/45/c45ru6fwawmat4zny5dkwdt6qenw3cxrle6frutyqviyhbqwhg5k.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:26.7541917Z /tmp/torchinductor_root/45/c45ru6fwawmat4zny5dkwdt6qenw3cxrle6frutyqviyhbqwhg5k.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:26.7542738Z [352s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:26.7543794Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:06:26.7544724Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:26.7544994Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:27.0320680Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:27.0325198Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:27.0326680Z ^ 2026-02-21T09:06:27.0327125Z /tmp/torchinductor_root/4i/c4ioef7sql5ijkuzpzlnvmnngkyp4hpus5d5joux3fmufnzmii5i.py:72:36: note: called from 2026-02-21T09:06:27.0330083Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:27.0330330Z ^ 2026-02-21T09:06:27.0330759Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:27.0331241Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:27.0331490Z ^ 2026-02-21T09:06:27.0331803Z module { 2026-02-21T09:06:27.0334740Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:27.0335630Z %cst = arith.constant dense<0> : tensor<64x2x128xi8> 2026-02-21T09:06:27.0339453Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:27.0341143Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:27.0341456Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:27.0347685Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:27.0349212Z %cst_2 = arith.constant dense<4> : tensor<64x128xi8> 2026-02-21T09:06:27.0349702Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:27.0349942Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:27.0350162Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:27.0350428Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T09:06:27.0350669Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:27.0350851Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:06:27.0351040Z %c56_i32 = arith.constant 56 : i32 2026-02-21T09:06:27.0351235Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:27.0351416Z %1 = arith.remsi %0, %c56_i32 : i32 2026-02-21T09:06:27.0351683Z %2 = arith.divsi %0, %c56_i32 : i32 2026-02-21T09:06:27.0351862Z %3 = arith.muli %1, %c128_i32 : i32 2026-02-21T09:06:27.0352103Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:27.0352360Z %5 = tt.splat %3 : i32 -> tensor<128xi32> 2026-02-21T09:06:27.0352610Z %6 = arith.addi %5, %4 : tensor<128xi32> 2026-02-21T09:06:27.0352799Z %7 = arith.muli %2, %c64_i32 : i32 2026-02-21T09:06:27.0353042Z %8 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:27.0353301Z %9 = tt.splat %7 : i32 -> tensor<64xi32> 2026-02-21T09:06:27.0353496Z %10 = arith.addi %9, %8 : tensor<64xi32> 2026-02-21T09:06:27.0353696Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T09:06:27.0353886Z %c192_i32 = arith.constant 192 : i32 2026-02-21T09:06:27.0354220Z %11 = scf.for %arg3 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x128xf32>) : i32 { 2026-02-21T09:06:27.0354566Z %64 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:06:27.0354794Z %65 = arith.addi %64, %8 : tensor<64xi32> 2026-02-21T09:06:27.0355005Z %66 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:27.0355205Z %67 = tt.splat %66 : i32 -> tensor<128xi32> 2026-02-21T09:06:27.0355419Z %68 = arith.addi %67, %4 : tensor<128xi32> 2026-02-21T09:06:27.0355677Z %69 = tt.expand_dims %10 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0355960Z %70 = arith.muli %69, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:27.0356220Z %71 = tt.expand_dims %68 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0356522Z %72 = tt.broadcast %70 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0356797Z %73 = tt.broadcast %71 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0357041Z %74 = arith.addi %72, %73 : tensor<64x128xi32> 2026-02-21T09:06:27.0357297Z %75 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0357586Z %76 = tt.addptr %75, %74 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0357897Z %77 = tt.load %76 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0358197Z %78 = arith.extf %77 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:27.0358535Z %79 = tt.expand_dims %65 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0358801Z %80 = arith.muli %79, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:27.0359066Z %81 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0359355Z %82 = tt.broadcast %80 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0359624Z %83 = tt.broadcast %81 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0360119Z %84 = arith.addi %82, %83 : tensor<64x128xi32> 2026-02-21T09:06:27.0360359Z %85 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0360633Z %86 = tt.addptr %85, %84 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0360926Z %87 = tt.load %86 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0361202Z %88 = arith.shli %87, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0361454Z %89 = arith.shrsi %88, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0361723Z %90 = arith.shrsi %87, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0361994Z %91 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:27.0362277Z %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:27.0362587Z %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:27.0362907Z %94 = tt.expand_dims %89 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0363239Z %95 = tt.expand_dims %90 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0363517Z %96 = arith.cmpi eq, %93, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0363771Z %97 = tt.broadcast %96 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0364071Z %98 = tt.broadcast %94 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0364356Z %99 = arith.select %97, %98, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0364632Z %100 = arith.cmpi eq, %93, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0364890Z %101 = tt.broadcast %95 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0365172Z %102 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0365457Z %103 = arith.select %102, %101, %99 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0365751Z %104 = tt.reshape %103 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:27.0366035Z %105 = arith.sitofp %104 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:27.0366407Z %106 = tt.dot %78, %105, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:27.0366740Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:27.0366933Z %107 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:06:27.0367136Z %108 = arith.addi %arg3, %107 : i32 2026-02-21T09:06:27.0367336Z %109 = tt.splat %108 : i32 -> tensor<64xi32> 2026-02-21T09:06:27.0367540Z %110 = arith.addi %109, %8 : tensor<64xi32> 2026-02-21T09:06:27.0367742Z %111 = arith.muli %108, %c2_i32 : i32 2026-02-21T09:06:27.0367940Z %112 = tt.splat %111 : i32 -> tensor<128xi32> 2026-02-21T09:06:27.0368152Z %113 = arith.addi %112, %4 : tensor<128xi32> 2026-02-21T09:06:27.0368405Z %114 = tt.expand_dims %10 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0368688Z %115 = arith.muli %114, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:27.0368958Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0369259Z %117 = tt.broadcast %115 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0369532Z %118 = tt.broadcast %116 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0369775Z %119 = arith.addi %117, %118 : tensor<64x128xi32> 2026-02-21T09:06:27.0370028Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0370323Z %121 = tt.addptr %120, %119 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0370645Z %122 = tt.load %121 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0370945Z %123 = arith.extf %122 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:27.0371229Z %124 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0371565Z %125 = arith.muli %124, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:27.0371827Z %126 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0372126Z %127 = tt.broadcast %125 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0372393Z %128 = tt.broadcast %126 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0372643Z %129 = arith.addi %127, %128 : tensor<64x128xi32> 2026-02-21T09:06:27.0372918Z %130 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0373221Z %131 = tt.addptr %130, %129 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0373539Z %132 = tt.load %131 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0373812Z %133 = arith.shli %132, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0374047Z %134 = arith.shrsi %133, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0374279Z %135 = arith.shrsi %132, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0374522Z %136 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:27.0374817Z %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:27.0375128Z %138 = tt.expand_dims %137 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:27.0375505Z %139 = tt.expand_dims %134 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0375837Z %140 = tt.expand_dims %135 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0376133Z %141 = arith.cmpi eq, %138, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0376391Z %142 = tt.broadcast %141 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0376664Z %143 = tt.broadcast %139 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0376967Z %144 = arith.select %142, %143, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0377243Z %145 = arith.cmpi eq, %138, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0377503Z %146 = tt.broadcast %140 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0377777Z %147 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0378067Z %148 = arith.select %147, %146, %144 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0378367Z %149 = tt.reshape %148 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:27.0378644Z %150 = arith.sitofp %149 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:27.0379024Z %151 = tt.dot %123, %150, %106, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:27.0379351Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:06:27.0379557Z %152 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:06:27.0379766Z %153 = arith.addi %arg3, %152 : i32 2026-02-21T09:06:27.0379970Z %154 = tt.splat %153 : i32 -> tensor<64xi32> 2026-02-21T09:06:27.0380186Z %155 = arith.addi %154, %8 : tensor<64xi32> 2026-02-21T09:06:27.0380386Z %156 = arith.muli %153, %c2_i32 : i32 2026-02-21T09:06:27.0380593Z %157 = tt.splat %156 : i32 -> tensor<128xi32> 2026-02-21T09:06:27.0380803Z %158 = arith.addi %157, %4 : tensor<128xi32> 2026-02-21T09:06:27.0381094Z %159 = tt.expand_dims %10 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0381375Z %160 = arith.muli %159, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:27.0381688Z %161 = tt.expand_dims %158 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0382010Z %162 = tt.broadcast %160 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0382290Z %163 = tt.broadcast %161 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0382552Z %164 = arith.addi %162, %163 : tensor<64x128xi32> 2026-02-21T09:06:27.0382842Z %165 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0383164Z %166 = tt.addptr %165, %164 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0383505Z %167 = tt.load %166 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0383824Z %168 = arith.extf %167 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:27.0384148Z %169 = tt.expand_dims %155 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0384455Z %170 = arith.muli %169, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:27.0384759Z %171 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0385060Z %172 = tt.broadcast %170 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0385347Z %173 = tt.broadcast %171 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0385603Z %174 = arith.addi %172, %173 : tensor<64x128xi32> 2026-02-21T09:06:27.0385850Z %175 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0386150Z %176 = tt.addptr %175, %174 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0386472Z %177 = tt.load %176 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0386767Z %178 = arith.shli %177, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0387044Z %179 = arith.shrsi %178, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0387285Z %180 = arith.shrsi %177, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0387548Z %181 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:27.0387852Z %182 = tt.expand_dims %181 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:27.0388182Z %183 = tt.expand_dims %182 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:27.0388519Z %184 = tt.expand_dims %179 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0388872Z %185 = tt.expand_dims %180 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0389180Z %186 = arith.cmpi eq, %183, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0389443Z %187 = tt.broadcast %186 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0389742Z %188 = tt.broadcast %184 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0390061Z %189 = arith.select %187, %188, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0390349Z %190 = arith.cmpi eq, %183, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0390604Z %191 = tt.broadcast %185 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0390882Z %192 = tt.broadcast %190 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0391172Z %193 = arith.select %192, %191, %189 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0391455Z %194 = tt.reshape %193 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:27.0391772Z %195 = arith.sitofp %194 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:27.0392138Z %196 = tt.dot %168, %195, %151, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:27.0392468Z scf.yield %196 : tensor<64x128xf32> 2026-02-21T09:06:27.0392646Z } 2026-02-21T09:06:27.0392802Z %12 = tt.splat %c4032_i32 : i32 -> tensor<64xi32> 2026-02-21T09:06:27.0393018Z %13 = arith.addi %12, %8 : tensor<64xi32> 2026-02-21T09:06:27.0393215Z %14 = arith.muli %c4032_i32, %c2_i32 : i32 2026-02-21T09:06:27.0393422Z %15 = tt.splat %14 : i32 -> tensor<128xi32> 2026-02-21T09:06:27.0393617Z %16 = arith.addi %15, %4 : tensor<128xi32> 2026-02-21T09:06:27.0393868Z %17 = tt.expand_dims %10 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0394129Z %18 = arith.muli %17, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:27.0394425Z %19 = tt.expand_dims %16 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0394719Z %20 = tt.broadcast %18 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0394985Z %21 = tt.broadcast %19 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0395227Z %22 = arith.addi %20, %21 : tensor<64x128xi32> 2026-02-21T09:06:27.0395464Z %23 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0395777Z %24 = tt.addptr %23, %22 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0396077Z %25 = tt.load %24 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0396400Z %26 = arith.extf %25 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:27.0396688Z %27 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0396941Z %28 = arith.muli %27, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:27.0397198Z %29 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0397480Z %30 = tt.broadcast %28 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0397747Z %31 = tt.broadcast %29 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0397984Z %32 = arith.addi %30, %31 : tensor<64x128xi32> 2026-02-21T09:06:27.0398213Z %33 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0398516Z %34 = tt.addptr %33, %32 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0398812Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0399082Z %36 = arith.shli %35, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0399299Z %37 = arith.shrsi %36, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0399521Z %38 = arith.shrsi %35, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:27.0399766Z %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:27.0400046Z %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:27.0400352Z %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:27.0400662Z %42 = tt.expand_dims %37 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0400985Z %43 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:27.0401261Z %44 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0401510Z %45 = tt.broadcast %44 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0401824Z %46 = tt.broadcast %42 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0402105Z %47 = arith.select %45, %46, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0402377Z %48 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:27.0402619Z %49 = tt.broadcast %43 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:27.0402896Z %50 = tt.broadcast %48 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:27.0403165Z %51 = arith.select %50, %49, %47 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:27.0403437Z %52 = tt.reshape %51 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:27.0403703Z %53 = arith.sitofp %52 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:27.0404054Z %54 = tt.dot %26, %53, %11, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:27.0404417Z %55 = arith.truncf %54 : tensor<64x128xf32> to tensor<64x128xbf16> 2026-02-21T09:06:27.0404699Z %56 = tt.expand_dims %10 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:27.0404965Z %57 = arith.muli %56, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:27.0405227Z %58 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:27.0405510Z %59 = tt.broadcast %57 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0405803Z %60 = tt.broadcast %58 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:27.0406033Z %61 = arith.addi %59, %60 : tensor<64x128xi32> 2026-02-21T09:06:27.0406282Z %62 = tt.splat %arg2 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0406558Z %63 = tt.addptr %62, %61 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:27.0406824Z tt.store %63, %55 : tensor<64x128x!tt.ptr> 2026-02-21T09:06:27.0407049Z tt.return 2026-02-21T09:06:27.0407181Z } 2026-02-21T09:06:27.0407311Z } 2026-02-21T09:06:27.0407384Z 2026-02-21T09:06:27.0407437Z {-# 2026-02-21T09:06:27.0407605Z external_resources: { 2026-02-21T09:06:27.0407767Z mlir_reproducer: { 2026-02-21T09:06:27.0412139Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:27.0416572Z disable_threading: false, 2026-02-21T09:06:27.0416755Z verify_each: true 2026-02-21T09:06:27.0416917Z } 2026-02-21T09:06:27.0417047Z } 2026-02-21T09:06:27.0417179Z #-} 2026-02-21T09:06:27.0417604Z /tmp/torchinductor_root/4i/c4ioef7sql5ijkuzpzlnvmnngkyp4hpus5d5joux3fmufnzmii5i.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:27.0418630Z /tmp/torchinductor_root/4i/c4ioef7sql5ijkuzpzlnvmnngkyp4hpus5d5joux3fmufnzmii5i.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:27.0419444Z [352s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:27.0420474Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:06:27.0421407Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:27.0421692Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:27.0422198Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 15.5 configs/s 2026-02-21T09:06:28.7608192Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 761.0 2026-02-21T09:06:28.7612258Z configs/s 2026-02-21T09:06:28.8288564Z [354s] Generation 19 complete: 2026-02-21T09:06:28.8289809Z error=5 2026-02-21T09:06:28.8289976Z ok=26 2026-02-21T09:06:28.8290130Z min=0.1034 2026-02-21T09:06:28.8290557Z mid=0.2180 2026-02-21T09:06:28.8290686Z max=2.6409 2026-02-21T09:06:28.8290845Z best={'block_sizes': [32, 64, 16], 2026-02-21T09:06:28.8291092Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:06:28.8291408Z 'l2_groupings': [16], 2026-02-21T09:06:28.8291799Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:06:28.8292006Z 'loop_orders': [[1, 0]], 2026-02-21T09:06:28.8292179Z 'num_stages': 2, 2026-02-21T09:06:28.8292330Z 'num_warps': 1, 2026-02-21T09:06:28.8292487Z 'pid_type': 'flat', 2026-02-21T09:06:28.8292666Z 'range_flattens': [None, True], 2026-02-21T09:06:28.8292864Z 'range_multi_buffers': [None, False], 2026-02-21T09:06:28.8293058Z 'range_num_stages': [0, 3], 2026-02-21T09:06:28.8293241Z 'range_unroll_factors': [0, 1], 2026-02-21T09:06:28.8293431Z 'range_warp_specializes': [None, None]} 2026-02-21T09:06:28.8320837Z [354s] Fitting surrogate: 1271 points, 1271 targets 2026-02-21T09:06:29.3229166Z [354s] Generation 20 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:06:45.7778488Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.1 configs/s 2026-02-21T09:06:46.4782761Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:06:46.4784563Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:46.4784888Z ^ 2026-02-21T09:06:46.4785310Z /tmp/torchinductor_root/ei/ceilltriaqoc6u54j2j3ljj4d2p6itdi4pbonim5ohsf3wvfmfjq.py:78:36: note: called from 2026-02-21T09:06:46.4785774Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:46.4785994Z ^ 2026-02-21T09:06:46.4786396Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:06:46.4786869Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:06:46.4787121Z ^ 2026-02-21T09:06:46.4787292Z module { 2026-02-21T09:06:46.4789036Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:06:46.4789628Z %cst = arith.constant dense<0> : tensor<64x2x128xi8> 2026-02-21T09:06:46.4789866Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:06:46.4790060Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:06:46.4790288Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:06:46.4790519Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:06:46.4790740Z %cst_2 = arith.constant dense<4> : tensor<64x128xi8> 2026-02-21T09:06:46.4790975Z %cst_3 = arith.constant dense<7168> : tensor<64x1xi32> 2026-02-21T09:06:46.4791210Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:06:46.4791465Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T09:06:46.4792348Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:06:46.4792540Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:06:46.4792728Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:06:46.4792913Z %c56_i32 = arith.constant 56 : i32 2026-02-21T09:06:46.4793101Z %0 = tt.get_program_id x : i32 2026-02-21T09:06:46.4793279Z %1 = arith.divsi %0, %c2_i32 : i32 2026-02-21T09:06:46.4793463Z %2 = arith.muli %1, %c2_i32 : i32 2026-02-21T09:06:46.4793894Z %3 = arith.subi %c56_i32, %2 : i32 2026-02-21T09:06:46.4794077Z %4 = arith.minsi %3, %c2_i32 : i32 2026-02-21T09:06:46.4794249Z %5 = arith.remsi %0, %c2_i32 : i32 2026-02-21T09:06:46.4794430Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:06:46.4794604Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:06:46.4794779Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:06:46.4794960Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:06:46.4795290Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:06:46.4795556Z %11 = tt.splat %9 : i32 -> tensor<128xi32> 2026-02-21T09:06:46.4795760Z %12 = arith.addi %11, %10 : tensor<128xi32> 2026-02-21T09:06:46.4796017Z %13 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:06:46.4796252Z %14 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:06:46.4796509Z %15 = tt.splat %13 : i32 -> tensor<64xi32> 2026-02-21T09:06:46.4796714Z %16 = arith.addi %15, %14 : tensor<64xi32> 2026-02-21T09:06:46.4796909Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:06:46.4797231Z %17 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x128xf32>) : i32 { 2026-02-21T09:06:46.4797597Z %27 = tt.splat %arg3 : i32 -> tensor<64xi32> 2026-02-21T09:06:46.4797808Z %28 = arith.addi %27, %14 : tensor<64xi32> 2026-02-21T09:06:46.4798092Z %29 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:06:46.4798297Z %30 = tt.splat %29 : i32 -> tensor<128xi32> 2026-02-21T09:06:46.4798496Z %31 = arith.addi %30, %10 : tensor<128xi32> 2026-02-21T09:06:46.4798760Z %32 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4799032Z %33 = arith.muli %32, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:46.4799296Z %34 = tt.expand_dims %31 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4799588Z %35 = tt.broadcast %33 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4799860Z %36 = tt.broadcast %34 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4800107Z %37 = arith.addi %35, %36 : tensor<64x128xi32> 2026-02-21T09:06:46.4800346Z %38 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4800644Z %39 = tt.addptr %38, %37 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4800950Z %40 = tt.load %39 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4801246Z %41 = arith.extf %40 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:46.4801529Z %42 = tt.expand_dims %28 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4801858Z %43 = arith.muli %42, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:46.4802114Z %44 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4802444Z %45 = tt.broadcast %43 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4802702Z %46 = tt.broadcast %44 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4802948Z %47 = arith.addi %45, %46 : tensor<64x128xi32> 2026-02-21T09:06:46.4803180Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4803458Z %49 = tt.addptr %48, %47 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4803761Z %50 = tt.load %49 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4804030Z %51 = arith.shli %50, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4804252Z %52 = arith.shrsi %51, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4804470Z %53 = arith.shrsi %50, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4804722Z %54 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:46.4805009Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:46.4805323Z %56 = tt.expand_dims %55 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:46.4805690Z %57 = tt.expand_dims %52 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4806013Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4806299Z %59 = arith.cmpi eq, %56, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4806538Z %60 = tt.broadcast %59 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4806853Z %61 = tt.broadcast %57 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4807143Z %62 = arith.select %60, %61, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4807440Z %63 = arith.cmpi eq, %56, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4807705Z %64 = tt.broadcast %58 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4807985Z %65 = tt.broadcast %63 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4808281Z %66 = arith.select %65, %64, %62 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4808571Z %67 = tt.reshape %66 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:46.4808857Z %68 = arith.sitofp %67 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:46.4809246Z %69 = tt.dot %41, %68, %arg4, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:46.4809623Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:06:46.4809836Z %70 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T09:06:46.4810040Z %71 = arith.addi %arg3, %70 : i32 2026-02-21T09:06:46.4810255Z %72 = tt.splat %71 : i32 -> tensor<64xi32> 2026-02-21T09:06:46.4810471Z %73 = arith.addi %72, %14 : tensor<64xi32> 2026-02-21T09:06:46.4810688Z %74 = arith.muli %71, %c2_i32 : i32 2026-02-21T09:06:46.4810900Z %75 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T09:06:46.4811111Z %76 = arith.addi %75, %10 : tensor<128xi32> 2026-02-21T09:06:46.4811383Z %77 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4811693Z %78 = arith.muli %77, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:46.4811962Z %79 = tt.expand_dims %76 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4812261Z %80 = tt.broadcast %78 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4812537Z %81 = tt.broadcast %79 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4812790Z %82 = arith.addi %80, %81 : tensor<64x128xi32> 2026-02-21T09:06:46.4813043Z %83 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4813344Z %84 = tt.addptr %83, %82 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4813660Z %85 = tt.load %84 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4813974Z %86 = arith.extf %85 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:46.4814275Z %87 = tt.expand_dims %73 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4814549Z %88 = arith.muli %87, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:46.4814820Z %89 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4815116Z %90 = tt.broadcast %88 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4815400Z %91 = tt.broadcast %89 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4815644Z %92 = arith.addi %90, %91 : tensor<64x128xi32> 2026-02-21T09:06:46.4815891Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4816179Z %94 = tt.addptr %93, %92 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4816487Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4816772Z %96 = arith.shli %95, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4817029Z %97 = arith.shrsi %96, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4817259Z %98 = arith.shrsi %95, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4817510Z %99 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:46.4817824Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:46.4818158Z %101 = tt.expand_dims %100 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:46.4818526Z %102 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4818887Z %103 = tt.expand_dims %98 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4819174Z %104 = arith.cmpi eq, %101, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4819436Z %105 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4819719Z %106 = tt.broadcast %102 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4820015Z %107 = arith.select %105, %106, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4820301Z %108 = arith.cmpi eq, %101, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4820555Z %109 = tt.broadcast %103 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4820835Z %110 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4821147Z %111 = arith.select %110, %109, %107 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4821443Z %112 = tt.reshape %111 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:46.4821756Z %113 = arith.sitofp %112 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:46.4822126Z %114 = tt.dot %86, %113, %69, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:46.4822459Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:06:46.4822652Z %115 = arith.muli %c64_i32, %c2_i32_6 : i32 2026-02-21T09:06:46.4822857Z %116 = arith.addi %arg3, %115 : i32 2026-02-21T09:06:46.4823051Z %117 = tt.splat %116 : i32 -> tensor<64xi32> 2026-02-21T09:06:46.4823263Z %118 = arith.addi %117, %14 : tensor<64xi32> 2026-02-21T09:06:46.4823464Z %119 = arith.muli %116, %c2_i32 : i32 2026-02-21T09:06:46.4823661Z %120 = tt.splat %119 : i32 -> tensor<128xi32> 2026-02-21T09:06:46.4823879Z %121 = arith.addi %120, %10 : tensor<128xi32> 2026-02-21T09:06:46.4824141Z %122 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4824431Z %123 = arith.muli %122, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:46.4824702Z %124 = tt.expand_dims %121 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4825012Z %125 = tt.broadcast %123 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4825292Z %126 = tt.broadcast %124 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4825537Z %127 = arith.addi %125, %126 : tensor<64x128xi32> 2026-02-21T09:06:46.4825792Z %128 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4826085Z %129 = tt.addptr %128, %127 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4826408Z %130 = tt.load %129 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4826715Z %131 = arith.extf %130 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:46.4827005Z %132 = tt.expand_dims %118 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4827284Z %133 = arith.muli %132, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:46.4827544Z %134 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4827844Z %135 = tt.broadcast %133 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4828110Z %136 = tt.broadcast %134 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4828387Z %137 = arith.addi %135, %136 : tensor<64x128xi32> 2026-02-21T09:06:46.4828641Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4828921Z %139 = tt.addptr %138, %137 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4829237Z %140 = tt.load %139 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4829514Z %141 = arith.shli %140, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4829771Z %142 = arith.shrsi %141, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4829996Z %143 = arith.shrsi %140, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4830301Z %144 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:46.4830607Z %145 = tt.expand_dims %144 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:46.4830924Z %146 = tt.expand_dims %145 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:46.4831263Z %147 = tt.expand_dims %142 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4831614Z %148 = tt.expand_dims %143 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4831908Z %149 = arith.cmpi eq, %146, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4832167Z %150 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4832476Z %151 = tt.broadcast %147 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4832782Z %152 = arith.select %150, %151, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4833059Z %153 = arith.cmpi eq, %146, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4833318Z %154 = tt.broadcast %148 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4833590Z %155 = tt.broadcast %153 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4833884Z %156 = arith.select %155, %154, %152 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4834179Z %157 = tt.reshape %156 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:46.4834454Z %158 = arith.sitofp %157 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:46.4834831Z %159 = tt.dot %131, %158, %114, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:46.4835161Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:06:46.4835364Z %160 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:06:46.4835559Z %161 = arith.addi %arg3, %160 : i32 2026-02-21T09:06:46.4835762Z %162 = tt.splat %161 : i32 -> tensor<64xi32> 2026-02-21T09:06:46.4835973Z %163 = arith.addi %162, %14 : tensor<64xi32> 2026-02-21T09:06:46.4836166Z %164 = arith.muli %161, %c2_i32 : i32 2026-02-21T09:06:46.4836368Z %165 = tt.splat %164 : i32 -> tensor<128xi32> 2026-02-21T09:06:46.4836570Z %166 = arith.addi %165, %10 : tensor<128xi32> 2026-02-21T09:06:46.4836831Z %167 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4837099Z %168 = arith.muli %167, %cst_4 : tensor<64x1xi32> 2026-02-21T09:06:46.4837371Z %169 = tt.expand_dims %166 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4837681Z %170 = tt.broadcast %168 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4837956Z %171 = tt.broadcast %169 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4838211Z %172 = arith.addi %170, %171 : tensor<64x128xi32> 2026-02-21T09:06:46.4838457Z %173 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4838762Z %174 = tt.addptr %173, %172 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4839084Z %175 = tt.load %174 evictionPolicy = evict_last : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4839383Z %176 = arith.extf %175 : tensor<64x128xbf16> to tensor<64x128xf32> 2026-02-21T09:06:46.4839713Z %177 = tt.expand_dims %163 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4839983Z %178 = arith.muli %177, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:46.4840245Z %179 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4840538Z %180 = tt.broadcast %178 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4840816Z %181 = tt.broadcast %179 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4841095Z %182 = arith.addi %180, %181 : tensor<64x128xi32> 2026-02-21T09:06:46.4841335Z %183 = tt.splat %arg1 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4841679Z %184 = tt.addptr %183, %182 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4841993Z %185 = tt.load %184 evictionPolicy = evict_first : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4842277Z %186 = arith.shli %185, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4842500Z %187 = arith.shrsi %186, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4842729Z %188 = arith.shrsi %185, %cst_2 : tensor<64x128xi8> 2026-02-21T09:06:46.4842985Z %189 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:06:46.4843280Z %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:06:46.4843628Z %191 = tt.expand_dims %190 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:06:46.4843957Z %192 = tt.expand_dims %187 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4844299Z %193 = tt.expand_dims %188 {axis = 1 : i32} : tensor<64x128xi8> -> tensor<64x1x128xi8> 2026-02-21T09:06:46.4844599Z %194 = arith.cmpi eq, %191, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4844870Z %195 = tt.broadcast %194 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4845157Z %196 = tt.broadcast %192 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4845450Z %197 = arith.select %195, %196, %cst : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4845732Z %198 = arith.cmpi eq, %191, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:06:46.4845991Z %199 = tt.broadcast %193 : tensor<64x1x128xi8> -> tensor<64x2x128xi8> 2026-02-21T09:06:46.4846266Z %200 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<64x2x128xi1> 2026-02-21T09:06:46.4846557Z %201 = arith.select %200, %199, %197 : tensor<64x2x128xi1>, tensor<64x2x128xi8> 2026-02-21T09:06:46.4846844Z %202 = tt.reshape %201 : tensor<64x2x128xi8> -> tensor<128x128xi8> 2026-02-21T09:06:46.4847121Z %203 = arith.sitofp %202 : tensor<128x128xi8> to tensor<128x128xf32> 2026-02-21T09:06:46.4847498Z %204 = tt.dot %176, %203, %159, inputPrecision = tf32 : tensor<64x128xf32> * tensor<128x128xf32> -> tensor<64x128xf32> 2026-02-21T09:06:46.4847825Z scf.yield %204 : tensor<64x128xf32> 2026-02-21T09:06:46.4848021Z } {tt.disallow_acc_multi_buffer} 2026-02-21T09:06:46.4848247Z %18 = arith.truncf %17 : tensor<64x128xf32> to tensor<64x128xbf16> 2026-02-21T09:06:46.4848541Z %19 = tt.expand_dims %16 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:06:46.4848802Z %20 = arith.muli %19, %cst_3 : tensor<64x1xi32> 2026-02-21T09:06:46.4849063Z %21 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:06:46.4849353Z %22 = tt.broadcast %20 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4849614Z %23 = tt.broadcast %21 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:06:46.4849854Z %24 = arith.addi %22, %23 : tensor<64x128xi32> 2026-02-21T09:06:46.4850095Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4850391Z %26 = tt.addptr %25, %24 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:06:46.4850662Z tt.store %26, %18 : tensor<64x128x!tt.ptr> 2026-02-21T09:06:46.4850904Z tt.return 2026-02-21T09:06:46.4851051Z } 2026-02-21T09:06:46.4851183Z } 2026-02-21T09:06:46.4851259Z 2026-02-21T09:06:46.4851337Z {-# 2026-02-21T09:06:46.4851476Z external_resources: { 2026-02-21T09:06:46.4851681Z mlir_reproducer: { 2026-02-21T09:06:46.4856383Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:06:46.4860882Z disable_threading: false, 2026-02-21T09:06:46.4861051Z verify_each: true 2026-02-21T09:06:46.4861203Z } 2026-02-21T09:06:46.4861325Z } 2026-02-21T09:06:46.4861445Z #-} 2026-02-21T09:06:46.4861904Z /tmp/torchinductor_root/ei/ceilltriaqoc6u54j2j3ljj4d2p6itdi4pbonim5ohsf3wvfmfjq.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:06:46.4862931Z /tmp/torchinductor_root/ei/ceilltriaqoc6u54j2j3ljj4d2p6itdi4pbonim5ohsf3wvfmfjq.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:06:46.4863755Z [371s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:06:46.4864801Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:06:46.4865718Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:06:46.4865985Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:06:47.5296724Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 11.5 configs/s 2026-02-21T09:06:48.0254564Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1906.8 2026-02-21T09:06:48.0258770Z configs/s 2026-02-21T09:06:48.0730872Z [373s] Generation 20 complete: 2026-02-21T09:06:48.0735238Z error=1 2026-02-21T09:06:48.0736602Z ok=20 2026-02-21T09:06:48.0736813Z min=0.1036 2026-02-21T09:06:48.0736968Z mid=0.2755 2026-02-21T09:06:48.0737127Z max=26.3097 2026-02-21T09:06:48.0737556Z best={'block_sizes': [32, 64, 16], 2026-02-21T09:06:48.0737826Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:06:48.0738096Z 'l2_groupings': [16], 2026-02-21T09:06:48.0738277Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:06:48.0738518Z 'loop_orders': [[1, 0]], 2026-02-21T09:06:48.0738691Z 'num_stages': 2, 2026-02-21T09:06:48.0738873Z 'num_warps': 1, 2026-02-21T09:06:48.0739057Z 'pid_type': 'flat', 2026-02-21T09:06:48.0739323Z 'range_flattens': [None, True], 2026-02-21T09:06:48.0739535Z 'range_multi_buffers': [None, False], 2026-02-21T09:06:48.0739754Z 'range_num_stages': [0, 3], 2026-02-21T09:06:48.0740022Z 'range_unroll_factors': [0, 1], 2026-02-21T09:06:48.0740244Z 'range_warp_specializes': [None, None]} 2026-02-21T09:06:48.0775193Z [373s] Fitting surrogate: 1292 points, 1292 targets 2026-02-21T09:06:48.3775085Z [373s] Autotuning complete in 373.8s after searching 1214 configs. 2026-02-21T09:06:48.3775507Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:06:48.3776763Z @helion.kernel(config=helion.Config(block_sizes=[32, 64, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:06:48.3777660Z 2026-02-21T09:06:48.3777911Z [373s] Code of selected kernel: /tmp/torchinductor_root/6d/c6dezladprzugnwlmiafqhi2fblnfqaxgswu7hmfxgturlvtk752.py 2026-02-21T09:06:48.4000304Z from __future__ import annotations 2026-02-21T09:06:48.4000508Z 2026-02-21T09:06:48.4005445Z import torch 2026-02-21T09:06:48.4006646Z import triton 2026-02-21T09:06:48.4006822Z import triton.language as tl 2026-02-21T09:06:48.4007075Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:06:48.4007265Z 2026-02-21T09:06:48.4007368Z _BLOCK_SIZE_2 = tl.constexpr(16) 2026-02-21T09:06:48.4012566Z _BLOCK_SIZE_1 = tl.constexpr(64) 2026-02-21T09:06:48.4012846Z _BLOCK_SIZE_0 = tl.constexpr(32) 2026-02-21T09:06:48.4017746Z 2026-02-21T09:06:48.4021858Z @triton.jit 2026-02-21T09:06:48.4025920Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:06:48.4026355Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:06:48.4026611Z num_pid_m = tl.cdiv(7168, _BLOCK_SIZE_2) 2026-02-21T09:06:48.4026822Z num_pid_n = tl.cdiv(64, _BLOCK_SIZE_1) 2026-02-21T09:06:48.4027018Z inner_2d_pid = tl.program_id(0) 2026-02-21T09:06:48.4027232Z num_pid_in_group = 16 * num_pid_n 2026-02-21T09:06:48.4027439Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:06:48.4027653Z first_pid_m = group_id * 16 2026-02-21T09:06:48.4027863Z group_size_m = min(num_pid_m - first_pid_m, 16) 2026-02-21T09:06:48.4028138Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:06:48.4028447Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:06:48.4028664Z offset_2 = pid_0 * _BLOCK_SIZE_2 2026-02-21T09:06:48.4028911Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:06:48.4029146Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T09:06:48.4029372Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:06:48.4029679Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:06:48.4029973Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:06:48.4030315Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:06:48.4030742Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:06:48.4031155Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:06:48.4031725Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:06:48.4032112Z for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=3, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:06:48.4032558Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:06:48.4032797Z acc_copy = acc 2026-02-21T09:06:48.4032974Z acc_copy_0 = acc_copy 2026-02-21T09:06:48.4033250Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:06:48.4033493Z mul = 2 * offset_3 2026-02-21T09:06:48.4033789Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:06:48.4034096Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:06:48.4034422Z load = tl.load(A + (indices_1[:, None] * 8192 + iota[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T09:06:48.4034827Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:06:48.4035142Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:06:48.4035379Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:06:48.4035616Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:06:48.4035901Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:06:48.4036394Z b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T09:06:48.4036824Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:06:48.4037116Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:06:48.4037299Z v_2 = b_tile << v_1 2026-02-21T09:06:48.4037477Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:06:48.4037652Z v_4 = v_2 >> v_3 2026-02-21T09:06:48.4037908Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:06:48.4038194Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:06:48.4038368Z v_6 = b_tile >> v_5 2026-02-21T09:06:48.4038587Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:06:48.4038828Z stack_idx = tl.arange(0, 2) 2026-02-21T09:06:48.4039028Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:06:48.4039231Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:06:48.4039433Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:06:48.4039633Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:06:48.4039839Z mask_0 = broadcast_idx == 0 2026-02-21T09:06:48.4040060Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:06:48.4040300Z mask_1 = broadcast_idx == 1 2026-02-21T09:06:48.4040523Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:06:48.4040789Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:06:48.4041074Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:06:48.4041332Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:06:48.4041628Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:06:48.4041865Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:06:48.4042135Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:06:48.4042410Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:06:48.4042624Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:06:48.4042858Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:06:48.4043140Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:06:48.4043437Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:06:48.4043629Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:06:48.4043866Z acc = acc_copy_0 + sum_1 2026-02-21T09:06:48.4044091Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:06:48.4044323Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:06:48.4044565Z tl.store(C + (indices_1[:, None] * 7168 + indices_2[None, :] * 1), v_10, None) 2026-02-21T09:06:48.4044750Z 2026-02-21T09:06:48.4044884Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:06:48.4045161Z """ 2026-02-21T09:06:48.4045335Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:06:48.4045498Z 2026-02-21T09:06:48.4045622Z This kernel performs matrix multiplication where: 2026-02-21T09:06:48.4045858Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:06:48.4046109Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:06:48.4046370Z (two 4-bit values packed into each int8) 2026-02-21T09:06:48.4046504Z 2026-02-21T09:06:48.4046561Z Args: 2026-02-21T09:06:48.4046750Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:06:48.4047028Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:06:48.4047207Z 2026-02-21T09:06:48.4047265Z Returns: 2026-02-21T09:06:48.4047453Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:06:48.4047691Z """ 2026-02-21T09:06:48.4047846Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:06:48.4048022Z M, K = A.shape 2026-02-21T09:06:48.4048179Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:06:48.4048348Z _, N = B.shape 2026-02-21T09:06:48.4048580Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:06:48.4048896Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:06:48.4049160Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:06:48.4049386Z _BLOCK_SIZE_2 = 16 2026-02-21T09:06:48.4049536Z _BLOCK_SIZE_1 = 64 2026-02-21T09:06:48.4049788Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:06:48.4050190Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:06:48.4050588Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:06:48.4050867Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:06:48.4051036Z _BLOCK_SIZE_0 = 32 2026-02-21T09:06:48.4051231Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:06:48.4051504Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:06:48.4051809Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:06:48.4052001Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:06:48.4052224Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:06:48.4052518Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:06:48.4052773Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:06:48.4052986Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:06:48.4053458Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(7168, _BLOCK_SIZE_2) * triton.cdiv(64, _BLOCK_SIZE_1),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=1, num_stages=2) 2026-02-21T09:06:48.4053901Z # src[int4_gemm.py:93]: return C 2026-02-21T09:06:48.4054066Z return C 2026-02-21T09:06:49.0627039Z WARNING:tritonbench.utils.triton_op:Completed input ID 14: 2026-02-21T09:06:49.0631346Z x_val 2026-02-21T09:06:49.0636036Z ------------------- 2026-02-21T09:06:49.0639943Z (64, 1, 7168, 8192) 2026-02-21T09:06:49.0644251Z 2026-02-21T09:06:49.0649282Z 50%|█████ | 5/10 [22:49<23:38, 283.74s/it]WARNING:tritonbench.utils.triton_op:Running input ID 17: 2026-02-21T09:06:49.0653357Z x_val 2026-02-21T09:06:49.0654967Z --------------------- 2026-02-21T09:06:49.0655163Z (1, 4096, 8192, 1024) 2026-02-21T09:06:49.0655490Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:06:50.0534530Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:06:51.4202283Z INFO:tritonbench.utils.triton_op:Took 0.11ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:06:52.5889299Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:06:52.5893891Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:06:52.5898258Z 'dtype': 'torch.bfloat16', 2026-02-21T09:06:52.5902127Z 'shape': (1, 4096, 1024), 2026-02-21T09:06:52.5903282Z 'stride': (4194304, 1024, 1)}, 2026-02-21T09:06:52.5903520Z { 'device': 'cuda:0', 2026-02-21T09:06:52.5903727Z 'dtype': 'torch.int32', 2026-02-21T09:06:52.5903927Z 'shape': (1024, 8192), 2026-02-21T09:06:52.5904140Z 'stride': (8192, 1)}), 2026-02-21T09:06:52.5904318Z 'kwargs': {}} 2026-02-21T09:06:52.5927128Z INFO:tritonbench.utils.triton_op:Took 4.10ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:06:52.8574509Z [0s] Autotune random seed: 2136913670 2026-02-21T09:06:52.8866585Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:07:25.5459131Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1, 1024], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T09:07:28.5392354Z [35s] Timeout after 30s compiling Config(block_sizes=[128, 1024, 4], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=4, num_stages=7, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T09:07:28.5408903Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T09:07:33.0586466Z [40s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]) 2026-02-21T09:07:33.0587670Z Tensor-likes are not close! 2026-02-21T09:07:33.0587797Z 2026-02-21T09:07:33.0587913Z Mismatched elements: 33553778 / 33554432 (100.0%) 2026-02-21T09:07:33.0588227Z Greatest absolute difference: 2560.0 at index (190, 5930) (up to 0.01 allowed) 2026-02-21T09:07:33.0588628Z Greatest relative difference: 3.03125 at index (44, 6176) (up to 0.01 allowed) 2026-02-21T09:07:33.0588968Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:33.0589139Z 2026-02-21T09:07:33.0718313Z [40s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 512], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[False, True]) 2026-02-21T09:07:33.0719463Z Tensor-likes are not close! 2026-02-21T09:07:33.0722639Z 2026-02-21T09:07:33.0725207Z Mismatched elements: 33443492 / 33554432 (99.7%) 2026-02-21T09:07:33.0725919Z Greatest absolute difference: 1424.0 at index (124, 4251) (up to 0.01 allowed) 2026-02-21T09:07:33.0731070Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:07:33.0735816Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:33.0736080Z 2026-02-21T09:07:44.7511339Z [51s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 256], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=1, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[4, 2], range_unroll_factors=[3, 1], range_warp_specializes=[None, True]) 2026-02-21T09:07:44.7512834Z Tensor-likes are not close! 2026-02-21T09:07:44.7516085Z 2026-02-21T09:07:44.7520675Z Mismatched elements: 33484606 / 33554432 (99.8%) 2026-02-21T09:07:44.7525151Z Greatest absolute difference: 2512.0 at index (696, 6537) (up to 0.01 allowed) 2026-02-21T09:07:44.7526575Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:07:44.7526939Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:44.7527116Z 2026-02-21T09:07:45.0962996Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 64, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=4, num_stages=7, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]) 2026-02-21T09:07:45.0964124Z Tensor-likes are not close! 2026-02-21T09:07:45.0968988Z 2026-02-21T09:07:45.0973715Z Mismatched elements: 33503946 / 33554432 (99.8%) 2026-02-21T09:07:45.0977557Z Greatest absolute difference: 3104.0 at index (306, 6642) (up to 0.01 allowed) 2026-02-21T09:07:45.0982121Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:07:45.0984014Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:45.0984269Z 2026-02-21T09:07:45.1149456Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 16, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=32, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T09:07:45.1150574Z Tensor-likes are not close! 2026-02-21T09:07:45.1155967Z 2026-02-21T09:07:45.1160583Z Mismatched elements: 33528286 / 33554432 (99.9%) 2026-02-21T09:07:45.1165024Z Greatest absolute difference: 3168.0 at index (2110, 1576) (up to 0.01 allowed) 2026-02-21T09:07:45.1168378Z Greatest relative difference: 90177536.0 at index (1456, 6433) (up to 0.01 allowed) 2026-02-21T09:07:45.1168816Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:45.1169034Z 2026-02-21T09:07:48.4607852Z [55s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 256, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=6, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T09:07:48.4609040Z Tensor-likes are not close! 2026-02-21T09:07:48.4609158Z 2026-02-21T09:07:48.4609250Z Mismatched elements: 33448005 / 33554432 (99.7%) 2026-02-21T09:07:48.4609533Z Greatest absolute difference: 1552.0 at index (2458, 4018) (up to 0.01 allowed) 2026-02-21T09:07:48.4610200Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:07:48.4610506Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:48.4610684Z 2026-02-21T09:07:50.3094734Z [57s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[1, 1], range_warp_specializes=[False, False]) 2026-02-21T09:07:50.3096031Z Tensor-likes are not close! 2026-02-21T09:07:50.3096157Z 2026-02-21T09:07:50.3096250Z Mismatched elements: 33506303 / 33554432 (99.9%) 2026-02-21T09:07:50.3096555Z Greatest absolute difference: 3088.0 at index (2763, 2791) (up to 0.01 allowed) 2026-02-21T09:07:50.3096970Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:07:50.3097320Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:50.3097526Z 2026-02-21T09:07:52.5371954Z [59s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=1, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[2, 1], range_warp_specializes=[None, None]) 2026-02-21T09:07:52.5373098Z Tensor-likes are not close! 2026-02-21T09:07:52.5373224Z 2026-02-21T09:07:52.5373310Z Mismatched elements: 33444241 / 33554432 (99.7%) 2026-02-21T09:07:52.5373589Z Greatest absolute difference: 1408.0 at index (3439, 1611) (up to 0.01 allowed) 2026-02-21T09:07:52.5373936Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:07:52.5374245Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:07:52.5374411Z 2026-02-21T09:08:00.0916623Z [67s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=256, num_sm_multiplier=8, num_stages=5, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[4, 1], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T09:08:00.0917716Z Tensor-likes are not close! 2026-02-21T09:08:00.0917844Z 2026-02-21T09:08:00.0917931Z Mismatched elements: 33515109 / 33554432 (99.9%) 2026-02-21T09:08:00.0918216Z Greatest absolute difference: 3488.0 at index (190, 5938) (up to 0.01 allowed) 2026-02-21T09:08:00.0918547Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:00.0918859Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:00.0919023Z 2026-02-21T09:08:02.2238594Z [69s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=4, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T09:08:02.2240310Z Tensor-likes are not close! 2026-02-21T09:08:02.2240444Z 2026-02-21T09:08:02.2240536Z Mismatched elements: 33529003 / 33554432 (99.9%) 2026-02-21T09:08:02.2240831Z Greatest absolute difference: 3008.0 at index (990, 5647) (up to 0.01 allowed) 2026-02-21T09:08:02.2241185Z Greatest relative difference: 98566144.0 at index (1905, 2353) (up to 0.01 allowed) 2026-02-21T09:08:02.2241905Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:02.2242076Z 2026-02-21T09:08:08.3963047Z [75s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[4], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T09:08:08.3964423Z Tensor-likes are not close! 2026-02-21T09:08:08.3970043Z 2026-02-21T09:08:08.3971518Z Mismatched elements: 33484116 / 33554432 (99.8%) 2026-02-21T09:08:08.3972020Z Greatest absolute difference: 2592.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T09:08:08.3972358Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:08.3972675Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:08.3972846Z 2026-02-21T09:08:13.0757928Z [80s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 512], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T09:08:13.0759148Z Tensor-likes are not close! 2026-02-21T09:08:13.0759273Z 2026-02-21T09:08:13.0759367Z Mismatched elements: 33504614 / 33554432 (99.9%) 2026-02-21T09:08:13.0759718Z Greatest absolute difference: 3088.0 at index (2626, 2884) (up to 0.01 allowed) 2026-02-21T09:08:13.0764151Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:13.0765771Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:13.0766014Z 2026-02-21T09:08:16.2768077Z [83s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[1], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]) 2026-02-21T09:08:16.2769376Z Tensor-likes are not close! 2026-02-21T09:08:16.2769504Z 2026-02-21T09:08:16.2769602Z Mismatched elements: 33480610 / 33554432 (99.8%) 2026-02-21T09:08:16.2769916Z Greatest absolute difference: 2432.0 at index (2056, 3359) (up to 0.01 allowed) 2026-02-21T09:08:16.2770297Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:16.2770619Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:16.2770798Z 2026-02-21T09:08:17.1147358Z [84s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 16, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[1], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]) 2026-02-21T09:08:17.1148342Z Tensor-likes are not close! 2026-02-21T09:08:17.1148460Z 2026-02-21T09:08:17.1148552Z Mismatched elements: 33476612 / 33554432 (99.8%) 2026-02-21T09:08:17.1148834Z Greatest absolute difference: 2400.0 at index (2675, 7632) (up to 0.01 allowed) 2026-02-21T09:08:17.1149180Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:17.1149483Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:17.1149651Z 2026-02-21T09:08:30.2712896Z [97s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=2, num_stages=8, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:08:30.2714501Z Tensor-likes are not close! 2026-02-21T09:08:30.2714623Z 2026-02-21T09:08:30.2714713Z Mismatched elements: 33476457 / 33554432 (99.8%) 2026-02-21T09:08:30.2715112Z Greatest absolute difference: 2400.0 at index (2394, 3954) (up to 0.01 allowed) 2026-02-21T09:08:30.2715500Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:30.2715853Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:30.2716062Z 2026-02-21T09:08:36.9623941Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.6 configs/s 2026-02-21T09:08:36.9637572Z [104s] Adaptive compile timeout: 30s (90% percentile=5.5s, bounds=[30.0s, 30s]) 2026-02-21T09:08:37.2015366Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 219/219 501.6 configs/s 2026-02-21T09:08:37.4079061Z [104s] Initial random population of 100, 5 starting points: 2026-02-21T09:08:37.4080391Z error=22 2026-02-21T09:08:37.4080556Z timeout=2 2026-02-21T09:08:37.4080929Z ok=76 2026-02-21T09:08:37.4081076Z min=0.9124 2026-02-21T09:08:37.4081221Z mid=54.8782 2026-02-21T09:08:37.4081348Z max=1231.2319 2026-02-21T09:08:37.4081514Z best={'block_sizes': [8, 512, 16], 2026-02-21T09:08:37.4081919Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:08:37.4082154Z 'l2_groupings': [8], 2026-02-21T09:08:37.4082335Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:08:37.4082528Z 'loop_orders': [[1, 0]], 2026-02-21T09:08:37.4082692Z 'num_stages': 1, 2026-02-21T09:08:37.4082835Z 'num_warps': 8, 2026-02-21T09:08:37.4082991Z 'pid_type': 'flat', 2026-02-21T09:08:37.4083147Z 'range_flattens': [None, None], 2026-02-21T09:08:37.4083330Z 'range_multi_buffers': [None, None], 2026-02-21T09:08:37.4083511Z 'range_num_stages': [0, 4], 2026-02-21T09:08:37.4083685Z 'range_unroll_factors': [0, 0], 2026-02-21T09:08:37.4083865Z 'range_warp_specializes': [None, None]} 2026-02-21T09:08:37.4097870Z [104s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:08:38.7040737Z [105s] Generation 1 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:08:50.9249113Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 1.3 configs/s 2026-02-21T09:08:54.2316573Z [121s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]) 2026-02-21T09:08:54.2317623Z Tensor-likes are not close! 2026-02-21T09:08:54.2317742Z 2026-02-21T09:08:54.2317829Z Mismatched elements: 33441839 / 33554432 (99.7%) 2026-02-21T09:08:54.2318117Z Greatest absolute difference: 1408.0 at index (3279, 4843) (up to 0.01 allowed) 2026-02-21T09:08:54.2318465Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:08:54.2318780Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:08:54.2318943Z 2026-02-21T09:09:01.2087386Z 2026-02-21T09:09:01.2089646Z [128s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:01.2089927Z 2026-02-21T09:09:01.2090027Z ================================================================ 2026-02-21T09:09:01.2090267Z Internal Triton PTX codegen error 2026-02-21T09:09:01.2090451Z `ptxas` stderr: 2026-02-21T09:09:01.2090914Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 294 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:01.2091796Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:01.2091961Z 2026-02-21T09:09:01.2092375Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpuh_pwmue.ptx -o /tmp/tmpuh_pwmue.ptx.o 2026-02-21T09:09:01.2092935Z 2026-02-21T09:09:01.2092940Z 2026-02-21T09:09:01.2093007Z // 2026-02-21T09:09:01.2093175Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:01.2093359Z // 2026-02-21T09:09:01.2093433Z 2026-02-21T09:09:01.2093578Z .version 8.7 2026-02-21T09:09:01.2093744Z .target sm_100a 2026-02-21T09:09:01.2093902Z .address_size 64 2026-02-21T09:09:01.2093988Z 2026-02-21T09:09:01.2094139Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:01.2094434Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:01.2094666Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:01.2094881Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:01.2095132Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:01.2095408Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:01.2095752Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:01.2096030Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:01.2096314Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:01.2096539Z ) 2026-02-21T09:09:01.2096658Z .reqntid 256 2026-02-21T09:09:01.2096793Z .maxnreg 32 2026-02-21T09:09:01.2096915Z { 2026-02-21T09:09:01.2097047Z .reg .pred %p<155>; 2026-02-21T09:09:01.2097196Z .reg .b16 %rs<169>; 2026-02-21T09:09:01.2097344Z .reg .b32 %r<813>; 2026-02-21T09:09:01.2097483Z .reg .b64 %rd<281>; 2026-02-21T09:09:01.2097632Z $L__func_begin0: 2026-02-21T09:09:01.2097714Z 2026-02-21T09:09:01.2097767Z // %bb.0: 2026-02-21T09:09:01.2098009Z .loc 1 19 0 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:19 2026-02-21T09:09:01.2098302Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:01.2098493Z ld.param.b64 %rd23, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:01.2098719Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:01.2098918Z ld.param.b64 %rd41, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:01.2099141Z mov.b32 %r47, global_smem; 2026-02-21T09:09:01.2099300Z // begin inline asm 2026-02-21T09:09:01.2099554Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r47], 256; 2026-02-21T09:09:01.2099802Z // end inline asm 2026-02-21T09:09:01.2099989Z ld.param.b64 %rd58, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:01.2100198Z bar.sync 0; 2026-02-21T09:09:01.2100342Z ld.shared.b32 %r806, [global_smem]; 2026-02-21T09:09:01.2100519Z bar.sync 0; 2026-02-21T09:09:01.2100652Z // begin inline asm 2026-02-21T09:09:01.2100866Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:01.2101090Z // end inline asm 2026-02-21T09:09:01.2101345Z .loc 1 21 66 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:21:66 2026-02-21T09:09:01.2101688Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:01.2101848Z mov.u32 %r64, %ctaid.y; 2026-02-21T09:09:01.2102005Z mov.u32 %r65, %ctaid.z; 2026-02-21T09:09:01.2102157Z mov.u32 %r66, %nctaid.x; 2026-02-21T09:09:01.2102318Z mov.u32 %r67, %nctaid.y; 2026-02-21T09:09:01.2102477Z mad.lo.s32 %r68, %r65, %r67, %r64; 2026-02-21T09:09:01.2103806Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 512, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[2, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:01.2104963Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:01.2105204Z `ptxas` stderr: 2026-02-21T09:09:01.2105633Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 294 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:01.2106163Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:01.2106309Z 2026-02-21T09:09:01.2106712Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpuh_pwmue.ptx -o /tmp/tmpuh_pwmue.ptx.o 2026-02-21T09:09:01.2107138Z 2026-02-21T09:09:01.2107261Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:01.2107515Z mad.lo.s32 %r69, %r68, %r66, %r3; 2026-02-21T09:09:01.2107689Z shl.b32 %r70, %r69, 8; 2026-02-21T09:09:01.2107852Z cvt.s64.s32 %rd59, %r70; 2026-02-21T09:09:01.2108009Z add.s64 %rd37, %rd58, %rd59; 2026-02-21T09:09:01.2108172Z shl.b32 %r71, %r1, 2; 2026-02-21T09:09:01.2108318Z add.s32 %r48, %r47, %r71; 2026-02-21T09:09:01.2108475Z mov.b32 %r57, 0; 2026-02-21T09:09:01.2108613Z // begin inline asm 2026-02-21T09:09:01.2108775Z @%p1 st.shared.b32 [ %r48 + 0 ], %r57; 2026-02-21T09:09:01.2108991Z // end inline asm 2026-02-21T09:09:01.2109133Z bar.warp.sync -1; 2026-02-21T09:09:01.2109292Z setp.eq.b32 %p147, %r1, 0; 2026-02-21T09:09:01.2109458Z cvt.u64.u32 %rd22, %r47; 2026-02-21T09:09:01.2109618Z // begin inline asm 2026-02-21T09:09:01.2109881Z @%p147 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd22 + 0 ], %rd23; 2026-02-21T09:09:01.2110181Z // end inline asm 2026-02-21T09:09:01.2110321Z // begin inline asm 2026-02-21T09:09:01.2110556Z @%p147 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1; 2026-02-21T09:09:01.2110820Z // end inline asm 2026-02-21T09:09:01.2110959Z mov.b32 %r50, 32; 2026-02-21T09:09:01.2111109Z // begin inline asm 2026-02-21T09:09:01.2111352Z @%p147 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0, %r50; 2026-02-21T09:09:01.2111660Z // end inline asm 2026-02-21T09:09:01.2111790Z mov.b32 %r145, 16; 2026-02-21T09:09:01.2111935Z // begin inline asm 2026-02-21T09:09:01.2112171Z @%p147 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1, %r145; 2026-02-21T09:09:01.2112451Z // end inline asm 2026-02-21T09:09:01.2112599Z mov.b32 %r52, 8192; 2026-02-21T09:09:01.2112741Z // begin inline asm 2026-02-21T09:09:01.2112990Z @%p147 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0, %r52; 2026-02-21T09:09:01.2113269Z // end inline asm 2026-02-21T09:09:01.2113420Z mov.b32 %r53, 512; 2026-02-21T09:09:01.2113560Z // begin inline asm 2026-02-21T09:09:01.2113818Z @%p147 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1, %r53; 2026-02-21T09:09:01.2114093Z // end inline asm 2026-02-21T09:09:01.2114239Z mov.b64 %rd30, 8192; 2026-02-21T09:09:01.2114387Z // begin inline asm 2026-02-21T09:09:01.2114644Z @%p147 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd22 + 0 ], 0x0, %rd30; 2026-02-21T09:09:01.2114935Z // end inline asm 2026-02-21T09:09:01.2115066Z mov.b32 %r54, 1; 2026-02-21T09:09:01.2115204Z // begin inline asm 2026-02-21T09:09:01.2115461Z @%p147 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0, %r54; 2026-02-21T09:09:01.2115751Z // end inline asm 2026-02-21T09:09:01.2115890Z // begin inline asm 2026-02-21T09:09:01.2116141Z @%p147 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1, %r54; 2026-02-21T09:09:01.2116429Z // end inline asm 2026-02-21T09:09:01.2116560Z // begin inline asm 2026-02-21T09:09:01.2116802Z @%p147 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0; 2026-02-21T09:09:01.2117089Z // end inline asm 2026-02-21T09:09:01.2117230Z // begin inline asm 2026-02-21T09:09:01.2117480Z @%p147 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0; 2026-02-21T09:09:01.2117767Z // end inline asm 2026-02-21T09:09:01.2117903Z // begin inline asm 2026-02-21T09:09:01.2118136Z @%p147 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1; 2026-02-21T09:09:01.2118404Z // end inline asm 2026-02-21T09:09:01.2118564Z // begin inline asm 2026-02-21T09:09:01.2118792Z @%p147 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0; 2026-02-21T09:09:01.2119051Z // end inline asm 2026-02-21T09:09:01.2119188Z // begin inline asm 2026-02-21T09:09:01.2119553Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd37 + 0 ], [ %rd22 + 0 ], 0x80; 2026-02-21T09:09:01.2119918Z // end inline asm 2026-02-21T09:09:01.2120056Z // begin inline asm 2026-02-21T09:09:01.2120262Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd37 + 0 ], 0x80; 2026-02-21T09:09:01.2120519Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:01.2120707Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:01.2120890Z // end inline asm 2026-02-21T09:09:01.2121027Z bar.sync 0; 2026-02-21T09:09:01.2121169Z cvta.global.u64 %rd115, %rd37; 2026-02-21T09:09:01.2121485Z .loc 1 23 67 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:23:67 2026-02-21T09:09:01.2121811Z add.s64 %rd55, %rd37, 128; 2026-02-21T09:09:01.2121973Z bar.sync 0; 2026-02-21T09:09:01.2122103Z // begin inline asm 2026-02-21T09:09:01.2122261Z @%p1 st.shared.b32 [ %r48 + 0 ], %r57; 2026-02-21T09:09:01.2122433Z // end inline asm 2026-02-21T09:09:01.2122584Z bar.warp.sync -1; 2026-02-21T09:09:01.2122727Z // begin inline asm 2026-02-21T09:09:01.2122988Z @%p147 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd22 + 0 ], %rd41; 2026-02-21T09:09:01.2123281Z // end inline asm 2026-02-21T09:09:01.2123422Z // begin inline asm 2026-02-21T09:09:01.2123665Z @%p147 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1; 2026-02-21T09:09:01.2123925Z // end inline asm 2026-02-21T09:09:01.2124067Z // begin inline asm 2026-02-21T09:09:01.2124301Z @%p147 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0, %r50; 2026-02-21T09:09:01.2124576Z // end inline asm 2026-02-21T09:09:01.2124714Z mov.b32 %r59, 256; 2026-02-21T09:09:01.2124852Z // begin inline asm 2026-02-21T09:09:01.2125091Z @%p147 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1, %r59; 2026-02-21T09:09:01.2125355Z // end inline asm 2026-02-21T09:09:01.2125496Z // begin inline asm 2026-02-21T09:09:01.2125736Z @%p147 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0, %r52; 2026-02-21T09:09:01.2126012Z // end inline asm 2026-02-21T09:09:01.2126146Z mov.b32 %r61, 4096; 2026-02-21T09:09:01.2126296Z // begin inline asm 2026-02-21T09:09:01.2126543Z @%p147 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1, %r61; 2026-02-21T09:09:01.2126835Z // end inline asm 2026-02-21T09:09:01.2126979Z mov.b64 %rd48, 16384; 2026-02-21T09:09:01.2127125Z // begin inline asm 2026-02-21T09:09:01.2127391Z @%p147 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd22 + 0 ], 0x0, %rd48; 2026-02-21T09:09:01.2127688Z // end inline asm 2026-02-21T09:09:01.2127835Z // begin inline asm 2026-02-21T09:09:01.2128107Z @%p147 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0, %r54; 2026-02-21T09:09:01.2128396Z // end inline asm 2026-02-21T09:09:01.2128543Z // begin inline asm 2026-02-21T09:09:01.2128805Z @%p147 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x1, %r54; 2026-02-21T09:09:01.2129101Z // end inline asm 2026-02-21T09:09:01.2129239Z // begin inline asm 2026-02-21T09:09:01.2129493Z @%p147 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd22 + 0 ], 0xa; 2026-02-21T09:09:01.2129808Z // end inline asm 2026-02-21T09:09:01.2129946Z // begin inline asm 2026-02-21T09:09:01.2130214Z @%p147 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0; 2026-02-21T09:09:01.2130506Z // end inline asm 2026-02-21T09:09:01.2130653Z // begin inline asm 2026-02-21T09:09:01.2130897Z @%p147 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x2; 2026-02-21T09:09:01.2131177Z // end inline asm 2026-02-21T09:09:01.2131314Z // begin inline asm 2026-02-21T09:09:01.2131612Z @%p147 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd22 + 0 ], 0x0; 2026-02-21T09:09:01.2131887Z // end inline asm 2026-02-21T09:09:01.2132028Z // begin inline asm 2026-02-21T09:09:01.2132438Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd55 + 0 ], [ %rd22 + 0 ], 0x80; 2026-02-21T09:09:01.2132824Z // end inline asm 2026-02-21T09:09:01.2132974Z // begin inline asm 2026-02-21T09:09:01.2133190Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd55 + 0 ], 0x80; 2026-02-21T09:09:01.2133456Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:01.2133660Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:01.2133845Z // end inline asm 2026-02-21T09:09:01.2133995Z bar.sync 0; 2026-02-21T09:09:01.2134146Z cvta.global.u64 %rd150, %rd55; 2026-02-21T09:09:01.2134492Z .loc 1 31 88 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:31:88 2026-02-21T09:09:01.2134800Z setp.gt.u32 %p39, %r3, 2047; 2026-02-21T09:09:01.2134981Z @%p39 bra $L__BB0_8; 2026-02-21T09:09:01.2135150Z // %bb.1: // %.lr.ph 2026-02-21T09:09:01.2135467Z .loc 1 0 88 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:0:88 2026-02-21T09:09:01.2135817Z ld.param.b64 %rd21, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:01.2136041Z and.b32 %r4, %r1, 32; 2026-02-21T09:09:01.2136304Z .loc 1 81 38 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:81:38 2026-02-21T09:09:01.2136590Z setp.eq.b32 %p56, %r4, 0; 2026-02-21T09:09:01.2136857Z .loc 1 57 38 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:57:38 2026-02-21T09:09:01.2137142Z shl.b32 %r256, %r1, 3; 2026-02-21T09:09:01.2137302Z and.b32 %r257, %r256, 24; 2026-02-21T09:09:01.2137568Z .loc 1 44 45 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:44:45 2026-02-21T09:09:01.2137847Z shr.u32 %r258, %r1, 2; 2026-02-21T09:09:01.2138007Z bfe.u32 %r5, %r1, 2, 6; 2026-02-21T09:09:01.2138156Z shr.u32 %r259, %r1, 5; 2026-02-21T09:09:01.2138307Z and.b32 %r6, %r1, 255; 2026-02-21T09:09:01.2138451Z shl.b32 %r260, %r6, 4; 2026-02-21T09:09:01.2138608Z add.s32 %r401, %r47, %r260; 2026-02-21T09:09:01.2138767Z add.s32 %r403, %r401, 4096; 2026-02-21T09:09:01.2138929Z add.s32 %r405, %r401, 8192; 2026-02-21T09:09:01.2139091Z add.s32 %r407, %r401, 12288; 2026-02-21T09:09:01.2139248Z add.s32 %r409, %r401, 16384; 2026-02-21T09:09:01.2139412Z add.s32 %r411, %r401, 20480; 2026-02-21T09:09:01.2139562Z add.s32 %r413, %r401, 24576; 2026-02-21T09:09:01.2139720Z add.s32 %r415, %r401, 28672; 2026-02-21T09:09:01.2139874Z add.s32 %r351, %r806, 128; 2026-02-21T09:09:01.2140033Z shl.b32 %r16, %r6, 6; 2026-02-21T09:09:01.2140178Z and.b32 %r262, %r1, 31; 2026-02-21T09:09:01.2140331Z shr.u32 %r263, %r1, 1; 2026-02-21T09:09:01.2140477Z and.b32 %r264, %r263, 96; 2026-02-21T09:09:01.2140639Z or.b32 %r17, %r264, %r262; 2026-02-21T09:09:01.2140797Z shl.b32 %r265, %r262, 7; 2026-02-21T09:09:01.2140945Z shl.b32 %r266, %r1, 4; 2026-02-21T09:09:01.2141097Z and.b32 %r267, %r266, 112; 2026-02-21T09:09:01.2141248Z shr.u32 %r268, %r1, 3; 2026-02-21T09:09:01.2141400Z and.b32 %r269, %r268, 28; 2026-02-21T09:09:01.2141568Z xor.b32 %r270, %r267, %r269; 2026-02-21T09:09:01.2141733Z or.b32 %r271, %r270, %r265; 2026-02-21T09:09:01.2141886Z add.s32 %r272, %r47, 98304; 2026-02-21T09:09:01.2142045Z add.s32 %r18, %r272, %r271; 2026-02-21T09:09:01.2142241Z xor.b32 %r273, %r271, 32; 2026-02-21T09:09:01.2142397Z add.s32 %r19, %r272, %r273; 2026-02-21T09:09:01.2142556Z xor.b32 %r274, %r271, 64; 2026-02-21T09:09:01.2142707Z add.s32 %r20, %r272, %r274; 2026-02-21T09:09:01.2142865Z xor.b32 %r275, %r271, 96; 2026-02-21T09:09:01.2143014Z add.s32 %r21, %r272, %r275; 2026-02-21T09:09:01.2143184Z bfe.u32 %r276, %r272, 4, 14; 2026-02-21T09:09:01.2143343Z cvt.u64.u32 %rd78, %r276; 2026-02-21T09:09:01.2143555Z or.b64 %rd90, %rd78, 4611686293322072064; 2026-02-21T09:09:01.2143733Z add.s32 %r277, %r47, 98336; 2026-02-21T09:09:01.2143892Z bfe.u32 %r278, %r277, 4, 14; 2026-02-21T09:09:01.2144083Z cvt.u64.u32 %rd79, %r278; 2026-02-21T09:09:01.2144243Z or.b64 %rd91, %rd79, 4611686293322072064; 2026-02-21T09:09:01.2144424Z add.s32 %r279, %r47, 98368; 2026-02-21T09:09:01.2144576Z bfe.u32 %r280, %r279, 4, 14; 2026-02-21T09:09:01.2144735Z cvt.u64.u32 %rd80, %r280; 2026-02-21T09:09:01.2144892Z or.b64 %rd92, %rd80, 4611686293322072064; 2026-02-21T09:09:01.2145069Z add.s32 %r281, %r47, 98400; 2026-02-21T09:09:01.2145222Z bfe.u32 %r282, %r281, 4, 14; 2026-02-21T09:09:01.2145381Z cvt.u64.u32 %rd81, %r282; 2026-02-21T09:09:01.2145537Z or.b64 %rd93, %rd81, 4611686293322072064; 2026-02-21T09:09:01.2145718Z or.b32 %r22, %r257, 64; 2026-02-21T09:09:01.2145872Z and.b32 %r283, %r256, 48; 2026-02-21T09:09:01.2146024Z or.b32 %r284, %r16, %r283; 2026-02-21T09:09:01.2146211Z add.s32 %r285, %r47, 65536; 2026-02-21T09:09:01.2146368Z xor.b32 %r286, %r284, 16; 2026-02-21T09:09:01.2146526Z xor.b32 %r287, %r284, 32; 2026-02-21T09:09:01.2146672Z xor.b32 %r288, %r284, 48; 2026-02-21T09:09:01.2146934Z .loc 1 31 88 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:31:88 2026-02-21T09:09:01.2147219Z cvt.u64.u32 %rd82, %r257; 2026-02-21T09:09:01.2147382Z mad.lo.s32 %r289, %r6, 48, %r401; 2026-02-21T09:09:01.2147559Z xor.b32 %r27, %r17, 16; 2026-02-21T09:09:01.2147708Z add.s32 %r161, %r47, 102400; 2026-02-21T09:09:01.2147871Z add.s32 %r290, %r161, %r27; 2026-02-21T09:09:01.2148022Z add.s32 %r291, %r161, %r17; 2026-02-21T09:09:01.2148182Z add.s32 %r165, %r401, 32768; 2026-02-21T09:09:01.2148334Z add.s32 %r179, %r401, 61440; 2026-02-21T09:09:01.2148490Z add.s32 %r177, %r401, 57344; 2026-02-21T09:09:01.2148640Z add.s32 %r175, %r401, 53248; 2026-02-21T09:09:01.2148797Z add.s32 %r173, %r401, 49152; 2026-02-21T09:09:01.2148945Z add.s32 %r171, %r401, 45056; 2026-02-21T09:09:01.2149102Z add.s32 %r169, %r401, 40960; 2026-02-21T09:09:01.2149261Z add.s32 %r167, %r401, 36864; 2026-02-21T09:09:01.2149521Z .loc 1 38 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:38:33 2026-02-21T09:09:01.2149808Z shr.u32 %r292, %r3, 3; 2026-02-21T09:09:01.2149957Z and.b32 %r293, %r292, 240; 2026-02-21T09:09:01.2150222Z .loc 1 40 64 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:40:64 2026-02-21T09:09:01.2150505Z and.b32 %r294, %r3, 15; 2026-02-21T09:09:01.2150765Z .loc 1 40 30 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:40:30 2026-02-21T09:09:01.2151050Z or.b32 %r295, %r293, %r294; 2026-02-21T09:09:01.2151307Z .loc 1 42 27 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:42:27 2026-02-21T09:09:01.2151620Z shl.b32 %r701, %r295, 5; 2026-02-21T09:09:01.2151880Z .loc 1 43 27 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:43:27 2026-02-21T09:09:01.2152168Z shl.b32 %r296, %r3, 5; 2026-02-21T09:09:01.2152317Z and.b32 %r29, %r296, 3584; 2026-02-21T09:09:01.2152590Z .loc 1 44 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:44:32 2026-02-21T09:09:01.2152883Z or.b32 %r297, %r29, %r5; 2026-02-21T09:09:01.2153038Z or.b32 %r298, %r258, %r29; 2026-02-21T09:09:01.2153308Z .loc 1 58 53 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:53 2026-02-21T09:09:01.2153628Z shl.b32 %r299, %r297, 10; 2026-02-21T09:09:01.2153797Z shl.b32 %r300, %r298, 10; 2026-02-21T09:09:01.2153949Z or.b32 %r301, %r300, 196608; 2026-02-21T09:09:01.2154110Z or.b32 %r302, %r300, 458752; 2026-02-21T09:09:01.2154255Z $L__tmp0: 2026-02-21T09:09:01.2154546Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2154906Z shfl.sync.idx.b32 %r30, %r259, 0, 31, -1; 2026-02-21T09:09:01.2155120Z shl.b32 %r303, %r30, 21; 2026-02-21T09:09:01.2155283Z and.b32 %r304, %r303, 6291456; 2026-02-21T09:09:01.2155443Z shl.b32 %r305, %r30, 3; 2026-02-21T09:09:01.2155640Z and.b32 %r306, %r305, 32; 2026-02-21T09:09:01.2155794Z or.b32 %r307, %r306, %r304; 2026-02-21T09:09:01.2155960Z add.s32 %r700, %r307, %r806; 2026-02-21T09:09:01.2156119Z mov.pred %p63, -1; 2026-02-21T09:09:01.2156275Z // begin inline asm 2026-02-21T09:09:01.2156617Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r700 + 0], {%r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57}; 2026-02-21T09:09:01.2156982Z // end inline asm 2026-02-21T09:09:01.2157127Z // begin inline asm 2026-02-21T09:09:01.2157450Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r700 + 16], {%r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57}; 2026-02-21T09:09:01.2157815Z // end inline asm 2026-02-21T09:09:01.2157983Z // begin inline asm 2026-02-21T09:09:01.2158316Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r700 + 64], {%r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57}; 2026-02-21T09:09:01.2158672Z // end inline asm 2026-02-21T09:09:01.2158807Z // begin inline asm 2026-02-21T09:09:01.2159131Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r700 + 80], {%r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57}; 2026-02-21T09:09:01.2159482Z // end inline asm 2026-02-21T09:09:01.2159624Z // begin inline asm 2026-02-21T09:09:01.2159777Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:01.2159946Z // end inline asm 2026-02-21T09:09:01.2160082Z bar.sync 0; 2026-02-21T09:09:01.2160208Z $L__tmp1: 2026-02-21T09:09:01.2160451Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2160733Z add.s32 %r808, %r47, 103424; 2026-02-21T09:09:01.2160892Z // begin inline asm 2026-02-21T09:09:01.2161059Z @%p147 mbarrier.init.shared::cta.b64 [%r808], 1; 2026-02-21T09:09:01.2161253Z // end inline asm 2026-02-21T09:09:01.2161383Z bar.sync 0; 2026-02-21T09:09:01.2161523Z add.s32 %r141, %r47, 103432; 2026-02-21T09:09:01.2161700Z // begin inline asm 2026-02-21T09:09:01.2161876Z @%p147 mbarrier.init.shared::cta.b64 [%r141], 1; 2026-02-21T09:09:01.2162071Z // end inline asm 2026-02-21T09:09:01.2162207Z add.s32 %r417, %r47, 103440; 2026-02-21T09:09:01.2162367Z // begin inline asm 2026-02-21T09:09:01.2162531Z @%p147 mbarrier.init.shared::cta.b64 [%r417], 1; 2026-02-21T09:09:01.2162724Z // end inline asm 2026-02-21T09:09:01.2162856Z bar.sync 0; 2026-02-21T09:09:01.2162996Z add.s32 %r143, %r47, 103448; 2026-02-21T09:09:01.2163150Z // begin inline asm 2026-02-21T09:09:01.2163323Z @%p147 mbarrier.init.shared::cta.b64 [%r143], 1; 2026-02-21T09:09:01.2163520Z // end inline asm 2026-02-21T09:09:01.2163773Z .loc 1 58 60 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:60 2026-02-21T09:09:01.2164073Z or.b32 %r308, %r299, %r257; 2026-02-21T09:09:01.2164233Z or.b32 %r309, %r301, %r257; 2026-02-21T09:09:01.2164393Z or.b32 %r310, %r302, %r257; 2026-02-21T09:09:01.2164659Z .loc 1 58 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:32 2026-02-21T09:09:01.2164955Z mad.wide.u32 %rd60, %r308, 2, %rd21; 2026-02-21T09:09:01.2165135Z cvt.u64.u32 %rd7, %r299; 2026-02-21T09:09:01.2165300Z or.b64 %rd83, %rd7, %rd82; 2026-02-21T09:09:01.2165499Z shl.b64 %rd84, %rd83, 1; 2026-02-21T09:09:01.2165650Z add.s64 %rd8, %rd21, %rd84; 2026-02-21T09:09:01.2165813Z add.s64 %rd61, %rd8, 131072; 2026-02-21T09:09:01.2165966Z add.s64 %rd62, %rd8, 262144; 2026-02-21T09:09:01.2166135Z mad.wide.u32 %rd63, %r309, 2, %rd21; 2026-02-21T09:09:01.2166307Z add.s64 %rd64, %rd8, 524288; 2026-02-21T09:09:01.2166467Z add.s64 %rd65, %rd8, 655360; 2026-02-21T09:09:01.2166621Z add.s64 %rd66, %rd8, 786432; 2026-02-21T09:09:01.2166814Z mad.wide.u32 %rd67, %r310, 2, %rd21; 2026-02-21T09:09:01.2167098Z .loc 1 58 80 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:80 2026-02-21T09:09:01.2167378Z // begin inline asm 2026-02-21T09:09:01.2167622Z cp.async.cg.shared.global [ %r401 + 0 ], [ %rd60 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2167850Z // end inline asm 2026-02-21T09:09:01.2167991Z // begin inline asm 2026-02-21T09:09:01.2168186Z cp.async.cg.shared.global [ %r403 + 0 ], [ %rd61 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2168414Z // end inline asm 2026-02-21T09:09:01.2168549Z // begin inline asm 2026-02-21T09:09:01.2168746Z cp.async.cg.shared.global [ %r405 + 0 ], [ %rd62 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2168969Z // end inline asm 2026-02-21T09:09:01.2169099Z // begin inline asm 2026-02-21T09:09:01.2169294Z cp.async.cg.shared.global [ %r407 + 0 ], [ %rd63 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2169508Z // end inline asm 2026-02-21T09:09:01.2169676Z // begin inline asm 2026-02-21T09:09:01.2169869Z cp.async.cg.shared.global [ %r409 + 0 ], [ %rd64 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2170089Z // end inline asm 2026-02-21T09:09:01.2170220Z // begin inline asm 2026-02-21T09:09:01.2170418Z cp.async.cg.shared.global [ %r411 + 0 ], [ %rd65 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2170637Z // end inline asm 2026-02-21T09:09:01.2170769Z // begin inline asm 2026-02-21T09:09:01.2170965Z cp.async.cg.shared.global [ %r413 + 0 ], [ %rd66 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2171189Z // end inline asm 2026-02-21T09:09:01.2171335Z // begin inline asm 2026-02-21T09:09:01.2171530Z cp.async.cg.shared.global [ %r415 + 0 ], [ %rd67 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2171812Z // end inline asm 2026-02-21T09:09:01.2171961Z cp.async.commit_group; 2026-02-21T09:09:01.2172245Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2172543Z bar.sync 0; 2026-02-21T09:09:01.2172681Z // begin inline asm 2026-02-21T09:09:01.2172892Z @%p147 mbarrier.arrive.expect_tx.shared.b64 _, [%r417], 512; 2026-02-21T09:09:01.2173123Z // end inline asm 2026-02-21T09:09:01.2173391Z .loc 1 64 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:64:33 2026-02-21T09:09:01.2173703Z bar.sync 0; 2026-02-21T09:09:01.2173870Z elect.sync %r311|%p58, -1; 2026-02-21T09:09:01.2174053Z and.pred %p49, %p1, %p58; 2026-02-21T09:09:01.2174226Z // begin inline asm 2026-02-21T09:09:01.2174573Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r161], [%rd115, {%r701, %r57}], [%r417]; 2026-02-21T09:09:01.2174940Z // end inline asm 2026-02-21T09:09:01.2175198Z .loc 1 58 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:32 2026-02-21T09:09:01.2175498Z add.s64 %rd69, %rd8, 64; 2026-02-21T09:09:01.2175664Z or.b32 %r312, %r308, 32; 2026-02-21T09:09:01.2175829Z mad.wide.u32 %rd85, %r312, 2, %rd21; 2026-02-21T09:09:01.2176023Z add.s64 %rd70, %rd85, 131072; 2026-02-21T09:09:01.2176193Z add.s64 %rd71, %rd85, 262144; 2026-02-21T09:09:01.2176367Z cvt.u64.u32 %rd9, %r301; 2026-02-21T09:09:01.2176534Z or.b64 %rd86, %rd9, %rd82; 2026-02-21T09:09:01.2176698Z shl.b64 %rd87, %rd86, 1; 2026-02-21T09:09:01.2176866Z add.s64 %rd10, %rd21, %rd87; 2026-02-21T09:09:01.2177031Z add.s64 %rd72, %rd10, 64; 2026-02-21T09:09:01.2177197Z add.s64 %rd73, %rd85, 524288; 2026-02-21T09:09:01.2177361Z add.s64 %rd74, %rd85, 655360; 2026-02-21T09:09:01.2177530Z add.s64 %rd75, %rd85, 786432; 2026-02-21T09:09:01.2177726Z cvt.u64.u32 %rd11, %r302; 2026-02-21T09:09:01.2177896Z or.b64 %rd88, %rd11, %rd82; 2026-02-21T09:09:01.2178066Z shl.b64 %rd89, %rd88, 1; 2026-02-21T09:09:01.2178225Z add.s64 %rd12, %rd21, %rd89; 2026-02-21T09:09:01.2178395Z add.s64 %rd76, %rd12, 64; 2026-02-21T09:09:01.2178669Z .loc 1 58 80 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:80 2026-02-21T09:09:01.2178954Z // begin inline asm 2026-02-21T09:09:01.2179209Z cp.async.cg.shared.global [ %r165 + 0 ], [ %rd69 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2179435Z // end inline asm 2026-02-21T09:09:01.2179571Z // begin inline asm 2026-02-21T09:09:01.2179795Z cp.async.cg.shared.global [ %r167 + 0 ], [ %rd70 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2180020Z // end inline asm 2026-02-21T09:09:01.2180152Z // begin inline asm 2026-02-21T09:09:01.2180349Z cp.async.cg.shared.global [ %r169 + 0 ], [ %rd71 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2180563Z // end inline asm 2026-02-21T09:09:01.2180707Z // begin inline asm 2026-02-21T09:09:01.2180895Z cp.async.cg.shared.global [ %r171 + 0 ], [ %rd72 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2181118Z // end inline asm 2026-02-21T09:09:01.2181249Z // begin inline asm 2026-02-21T09:09:01.2181449Z cp.async.cg.shared.global [ %r173 + 0 ], [ %rd73 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2181705Z // end inline asm 2026-02-21T09:09:01.2181841Z // begin inline asm 2026-02-21T09:09:01.2182065Z cp.async.cg.shared.global [ %r175 + 0 ], [ %rd74 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2182285Z // end inline asm 2026-02-21T09:09:01.2182432Z // begin inline asm 2026-02-21T09:09:01.2182629Z cp.async.cg.shared.global [ %r177 + 0 ], [ %rd75 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2182858Z // end inline asm 2026-02-21T09:09:01.2182999Z // begin inline asm 2026-02-21T09:09:01.2183207Z cp.async.cg.shared.global [ %r179 + 0 ], [ %rd76 + 0 ], 0x10, %r145; 2026-02-21T09:09:01.2183432Z // end inline asm 2026-02-21T09:09:01.2183592Z cp.async.commit_group; 2026-02-21T09:09:01.2183865Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2184147Z bar.sync 0; 2026-02-21T09:09:01.2184290Z // begin inline asm 2026-02-21T09:09:01.2184487Z @%p147 mbarrier.arrive.expect_tx.shared.b64 _, [%r143], 512; 2026-02-21T09:09:01.2184715Z // end inline asm 2026-02-21T09:09:01.2184954Z .loc 1 64 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:64:33 2026-02-21T09:09:01.2185247Z bar.sync 0; 2026-02-21T09:09:01.2185398Z elect.sync %r313|%p59, -1; 2026-02-21T09:09:01.2185570Z and.pred %p51, %p1, %p59; 2026-02-21T09:09:01.2185740Z add.s32 %r182, %r47, 102912; 2026-02-21T09:09:01.2185900Z // begin inline asm 2026-02-21T09:09:01.2186234Z @%p51 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r182], [%rd115, {%r701, %r145}], [%r143]; 2026-02-21T09:09:01.2186584Z // end inline asm 2026-02-21T09:09:01.2186838Z .loc 1 58 80 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:80 2026-02-21T09:09:01.2187129Z cp.async.wait_group 1; 2026-02-21T09:09:01.2187290Z bar.sync 0; 2026-02-21T09:09:01.2187470Z ld.shared.v4.b32 {%r314, %r315, %r316, %r317}, [%r289]; 2026-02-21T09:09:01.2187683Z mov.b32 {%rs1, %rs2}, %r317; 2026-02-21T09:09:01.2187857Z mov.b32 {%rs3, %rs4}, %r316; 2026-02-21T09:09:01.2188017Z mov.b32 {%rs5, %rs6}, %r315; 2026-02-21T09:09:01.2188182Z mov.b32 {%rs7, %rs8}, %r314; 2026-02-21T09:09:01.2188379Z ld.shared.v4.b32 {%r318, %r319, %r320, %r321}, [%r289+16]; 2026-02-21T09:09:01.2188598Z mov.b32 {%rs9, %rs10}, %r321; 2026-02-21T09:09:01.2188764Z mov.b32 {%rs11, %rs12}, %r320; 2026-02-21T09:09:01.2188942Z mov.b32 {%rs13, %rs14}, %r319; 2026-02-21T09:09:01.2189114Z mov.b32 {%rs15, %rs16}, %r318; 2026-02-21T09:09:01.2189311Z ld.shared.v4.b32 {%r322, %r323, %r324, %r325}, [%r289+32]; 2026-02-21T09:09:01.2189524Z mov.b32 {%rs17, %rs18}, %r325; 2026-02-21T09:09:01.2189710Z mov.b32 {%rs19, %rs20}, %r324; 2026-02-21T09:09:01.2189873Z mov.b32 {%rs21, %rs22}, %r323; 2026-02-21T09:09:01.2190026Z mov.b32 {%rs23, %rs24}, %r322; 2026-02-21T09:09:01.2190223Z ld.shared.v4.b32 {%r326, %r327, %r328, %r329}, [%r289+48]; 2026-02-21T09:09:01.2190425Z mov.b32 {%rs25, %rs26}, %r329; 2026-02-21T09:09:01.2190588Z mov.b32 {%rs27, %rs28}, %r328; 2026-02-21T09:09:01.2190752Z mov.b32 {%rs29, %rs30}, %r327; 2026-02-21T09:09:01.2190909Z mov.b32 {%rs31, %rs32}, %r326; 2026-02-21T09:09:01.2191143Z ld.shared.v4.b32 {%r330, %r331, %r332, %r333}, [%r289+16384]; 2026-02-21T09:09:01.2191354Z mov.b32 {%rs33, %rs34}, %r333; 2026-02-21T09:09:01.2191520Z mov.b32 {%rs35, %rs36}, %r332; 2026-02-21T09:09:01.2191754Z mov.b32 {%rs37, %rs38}, %r331; 2026-02-21T09:09:01.2191921Z mov.b32 {%rs39, %rs40}, %r330; 2026-02-21T09:09:01.2192115Z ld.shared.v4.b32 {%r334, %r335, %r336, %r337}, [%r289+16400]; 2026-02-21T09:09:01.2192332Z mov.b32 {%rs41, %rs42}, %r337; 2026-02-21T09:09:01.2192498Z mov.b32 {%rs43, %rs44}, %r336; 2026-02-21T09:09:01.2192657Z mov.b32 {%rs45, %rs46}, %r335; 2026-02-21T09:09:01.2192822Z mov.b32 {%rs47, %rs48}, %r334; 2026-02-21T09:09:01.2193015Z ld.shared.v4.b32 {%r338, %r339, %r340, %r341}, [%r289+16416]; 2026-02-21T09:09:01.2193237Z mov.b32 {%rs49, %rs50}, %r341; 2026-02-21T09:09:01.2193399Z mov.b32 {%rs51, %rs52}, %r340; 2026-02-21T09:09:01.2193572Z mov.b32 {%rs53, %rs54}, %r339; 2026-02-21T09:09:01.2193752Z mov.b32 {%rs55, %rs56}, %r338; 2026-02-21T09:09:01.2193957Z ld.shared.v4.b32 {%r342, %r343, %r344, %r345}, [%r289+16432]; 2026-02-21T09:09:01.2194170Z mov.b32 {%rs57, %rs58}, %r345; 2026-02-21T09:09:01.2194326Z mov.b32 {%rs59, %rs60}, %r344; 2026-02-21T09:09:01.2194491Z mov.b32 {%rs61, %rs62}, %r343; 2026-02-21T09:09:01.2194647Z mov.b32 {%rs63, %rs64}, %r342; 2026-02-21T09:09:01.2194920Z .loc 1 62 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:62:32 2026-02-21T09:09:01.2195208Z cvt.f32.bf16 %r187, %rs7; 2026-02-21T09:09:01.2195374Z cvt.f32.bf16 %r188, %rs8; 2026-02-21T09:09:01.2195528Z cvt.f32.bf16 %r189, %rs5; 2026-02-21T09:09:01.2195685Z cvt.f32.bf16 %r190, %rs6; 2026-02-21T09:09:01.2195839Z cvt.f32.bf16 %r191, %rs3; 2026-02-21T09:09:01.2195989Z cvt.f32.bf16 %r192, %rs4; 2026-02-21T09:09:01.2196146Z cvt.f32.bf16 %r193, %rs1; 2026-02-21T09:09:01.2196294Z cvt.f32.bf16 %r194, %rs2; 2026-02-21T09:09:01.2196453Z cvt.f32.bf16 %r195, %rs15; 2026-02-21T09:09:01.2196608Z cvt.f32.bf16 %r196, %rs16; 2026-02-21T09:09:01.2196770Z cvt.f32.bf16 %r197, %rs13; 2026-02-21T09:09:01.2196920Z cvt.f32.bf16 %r198, %rs14; 2026-02-21T09:09:01.2197079Z cvt.f32.bf16 %r199, %rs11; 2026-02-21T09:09:01.2197230Z cvt.f32.bf16 %r200, %rs12; 2026-02-21T09:09:01.2197388Z cvt.f32.bf16 %r201, %rs9; 2026-02-21T09:09:01.2197543Z cvt.f32.bf16 %r202, %rs10; 2026-02-21T09:09:01.2197692Z cvt.f32.bf16 %r204, %rs23; 2026-02-21T09:09:01.2197848Z cvt.f32.bf16 %r205, %rs24; 2026-02-21T09:09:01.2197996Z cvt.f32.bf16 %r206, %rs21; 2026-02-21T09:09:01.2198153Z cvt.f32.bf16 %r207, %rs22; 2026-02-21T09:09:01.2198298Z cvt.f32.bf16 %r208, %rs19; 2026-02-21T09:09:01.2198453Z cvt.f32.bf16 %r209, %rs20; 2026-02-21T09:09:01.2198601Z cvt.f32.bf16 %r210, %rs17; 2026-02-21T09:09:01.2198758Z cvt.f32.bf16 %r211, %rs18; 2026-02-21T09:09:01.2198905Z cvt.f32.bf16 %r212, %rs31; 2026-02-21T09:09:01.2199058Z cvt.f32.bf16 %r213, %rs32; 2026-02-21T09:09:01.2199214Z cvt.f32.bf16 %r214, %rs29; 2026-02-21T09:09:01.2199363Z cvt.f32.bf16 %r215, %rs30; 2026-02-21T09:09:01.2199519Z cvt.f32.bf16 %r216, %rs27; 2026-02-21T09:09:01.2199668Z cvt.f32.bf16 %r217, %rs28; 2026-02-21T09:09:01.2199825Z cvt.f32.bf16 %r218, %rs25; 2026-02-21T09:09:01.2199976Z cvt.f32.bf16 %r219, %rs26; 2026-02-21T09:09:01.2200132Z cvt.f32.bf16 %r221, %rs39; 2026-02-21T09:09:01.2200280Z cvt.f32.bf16 %r222, %rs40; 2026-02-21T09:09:01.2200437Z cvt.f32.bf16 %r223, %rs37; 2026-02-21T09:09:01.2200584Z cvt.f32.bf16 %r224, %rs38; 2026-02-21T09:09:01.2200774Z cvt.f32.bf16 %r225, %rs35; 2026-02-21T09:09:01.2200932Z cvt.f32.bf16 %r226, %rs36; 2026-02-21T09:09:01.2201081Z cvt.f32.bf16 %r227, %rs33; 2026-02-21T09:09:01.2201241Z cvt.f32.bf16 %r228, %rs34; 2026-02-21T09:09:01.2201394Z cvt.f32.bf16 %r229, %rs47; 2026-02-21T09:09:01.2201578Z cvt.f32.bf16 %r230, %rs48; 2026-02-21T09:09:01.2201734Z cvt.f32.bf16 %r231, %rs45; 2026-02-21T09:09:01.2201892Z cvt.f32.bf16 %r232, %rs46; 2026-02-21T09:09:01.2202044Z cvt.f32.bf16 %r233, %rs43; 2026-02-21T09:09:01.2202230Z cvt.f32.bf16 %r234, %rs44; 2026-02-21T09:09:01.2202379Z cvt.f32.bf16 %r235, %rs41; 2026-02-21T09:09:01.2202535Z cvt.f32.bf16 %r236, %rs42; 2026-02-21T09:09:01.2202726Z cvt.f32.bf16 %r238, %rs55; 2026-02-21T09:09:01.2202880Z cvt.f32.bf16 %r239, %rs56; 2026-02-21T09:09:01.2203039Z cvt.f32.bf16 %r240, %rs53; 2026-02-21T09:09:01.2203190Z cvt.f32.bf16 %r241, %rs54; 2026-02-21T09:09:01.2203349Z cvt.f32.bf16 %r242, %rs51; 2026-02-21T09:09:01.2203498Z cvt.f32.bf16 %r243, %rs52; 2026-02-21T09:09:01.2203655Z cvt.f32.bf16 %r244, %rs49; 2026-02-21T09:09:01.2203806Z cvt.f32.bf16 %r245, %rs50; 2026-02-21T09:09:01.2203964Z cvt.f32.bf16 %r246, %rs63; 2026-02-21T09:09:01.2204115Z cvt.f32.bf16 %r247, %rs64; 2026-02-21T09:09:01.2204274Z cvt.f32.bf16 %r248, %rs61; 2026-02-21T09:09:01.2204431Z cvt.f32.bf16 %r249, %rs62; 2026-02-21T09:09:01.2204582Z cvt.f32.bf16 %r250, %rs59; 2026-02-21T09:09:01.2204767Z cvt.f32.bf16 %r251, %rs60; 2026-02-21T09:09:01.2204920Z cvt.f32.bf16 %r252, %rs57; 2026-02-21T09:09:01.2205078Z cvt.f32.bf16 %r253, %rs58; 2026-02-21T09:09:01.2205222Z $L__tmp2: 2026-02-21T09:09:01.2205512Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2205843Z add.s32 %r186, %r307, %r351; 2026-02-21T09:09:01.2206003Z // begin inline asm 2026-02-21T09:09:01.2206372Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 0], {%r187, %r188, %r189, %r190, %r191, %r192, %r193, %r194, %r195, %r196, %r197, %r198, %r199, %r200, %r201, %r202}; 2026-02-21T09:09:01.2206752Z // end inline asm 2026-02-21T09:09:01.2206893Z // begin inline asm 2026-02-21T09:09:01.2207242Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 16], {%r204, %r205, %r206, %r207, %r208, %r209, %r210, %r211, %r212, %r213, %r214, %r215, %r216, %r217, %r218, %r219}; 2026-02-21T09:09:01.2207624Z // end inline asm 2026-02-21T09:09:01.2207758Z // begin inline asm 2026-02-21T09:09:01.2208110Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 64], {%r221, %r222, %r223, %r224, %r225, %r226, %r227, %r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236}; 2026-02-21T09:09:01.2208496Z // end inline asm 2026-02-21T09:09:01.2208629Z // begin inline asm 2026-02-21T09:09:01.2208977Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 80], {%r238, %r239, %r240, %r241, %r242, %r243, %r244, %r245, %r246, %r247, %r248, %r249, %r250, %r251, %r252, %r253}; 2026-02-21T09:09:01.2209352Z // end inline asm 2026-02-21T09:09:01.2209495Z // begin inline asm 2026-02-21T09:09:01.2209647Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:01.2209815Z // end inline asm 2026-02-21T09:09:01.2209952Z bar.sync 0; 2026-02-21T09:09:01.2210080Z $L__tmp3: 2026-02-21T09:09:01.2210323Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2210603Z // begin inline asm 2026-02-21T09:09:01.2210744Z 2026-02-21T09:09:01.2210860Z { 2026-02-21T09:09:01.2210994Z .reg .pred complete; 2026-02-21T09:09:01.2211142Z waitLoop: 2026-02-21T09:09:01.2211338Z mbarrier.try_wait.parity.shared.b64 complete, [%r417], %r57; 2026-02-21T09:09:01.2211600Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.2211777Z } 2026-02-21T09:09:01.2211844Z 2026-02-21T09:09:01.2211907Z // end inline asm 2026-02-21T09:09:01.2212151Z .loc 1 64 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:64:33 2026-02-21T09:09:01.2212446Z ld.shared.b8 %rs65, [%r291]; 2026-02-21T09:09:01.2212646Z ld.shared.b8 %rs66, [%r291+256]; 2026-02-21T09:09:01.2212826Z ld.shared.b8 %rs67, [%r290+128]; 2026-02-21T09:09:01.2212993Z ld.shared.b8 %rs68, [%r290+384]; 2026-02-21T09:09:01.2213269Z .loc 1 67 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:67:28 2026-02-21T09:09:01.2213560Z shl.b16 %rs69, %rs65, 4; 2026-02-21T09:09:01.2213715Z shl.b16 %rs70, %rs67, 4; 2026-02-21T09:09:01.2213873Z shl.b16 %rs71, %rs66, 4; 2026-02-21T09:09:01.2214057Z shl.b16 %rs72, %rs68, 4; 2026-02-21T09:09:01.2214316Z .loc 1 82 58 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:82:58 2026-02-21T09:09:01.2214634Z selp.b16 %rs73, %rs69, %rs65, %p56; 2026-02-21T09:09:01.2214819Z cvt.s16.s8 %rs74, %rs73; 2026-02-21T09:09:01.2214966Z shr.s16 %rs75, %rs74, 4; 2026-02-21T09:09:01.2215137Z selp.b16 %rs76, %rs70, %rs67, %p56; 2026-02-21T09:09:01.2215324Z cvt.s16.s8 %rs77, %rs76; 2026-02-21T09:09:01.2215478Z shr.s16 %rs78, %rs77, 4; 2026-02-21T09:09:01.2215651Z selp.b16 %rs79, %rs71, %rs66, %p56; 2026-02-21T09:09:01.2215827Z cvt.s16.s8 %rs80, %rs79; 2026-02-21T09:09:01.2215989Z shr.s16 %rs81, %rs80, 4; 2026-02-21T09:09:01.2216148Z selp.b16 %rs82, %rs72, %rs68, %p56; 2026-02-21T09:09:01.2216330Z cvt.s16.s8 %rs83, %rs82; 2026-02-21T09:09:01.2216480Z shr.s16 %rs84, %rs83, 4; 2026-02-21T09:09:01.2216775Z .loc 1 87 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:87:32 2026-02-21T09:09:01.2217077Z cvt.rn.f32.s16 %r346, %rs75; 2026-02-21T09:09:01.2217245Z cvt.rn.f32.s16 %r347, %rs78; 2026-02-21T09:09:01.2217414Z cvt.rn.f32.s16 %r348, %rs81; 2026-02-21T09:09:01.2217577Z cvt.rn.f32.s16 %r349, %rs84; 2026-02-21T09:09:01.2217744Z st.shared.b32 [%r18], %r346; 2026-02-21T09:09:01.2217904Z st.shared.b32 [%r19], %r347; 2026-02-21T09:09:01.2218070Z st.shared.b32 [%r20], %r348; 2026-02-21T09:09:01.2218231Z st.shared.b32 [%r21], %r349; 2026-02-21T09:09:01.2218394Z $L__tmp4: 2026-02-21T09:09:01.2218693Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2219040Z // begin inline asm 2026-02-21T09:09:01.2219213Z fence.proxy.async.shared::cta; 2026-02-21T09:09:01.2219386Z // end inline asm 2026-02-21T09:09:01.2219533Z bar.sync 0; 2026-02-21T09:09:01.2219676Z setp.ne.b32 %p60, %r30, 0; 2026-02-21T09:09:01.2219849Z @%p60 bra $L__BB0_3; 2026-02-21T09:09:01.2220001Z // %bb.2: 2026-02-21T09:09:01.2220151Z elect.sync %r398|%p62, -1; 2026-02-21T09:09:01.2220317Z mov.b32 %r352, 134744336; 2026-02-21T09:09:01.2220483Z mov.pred %p61, 0; 2026-02-21T09:09:01.2220639Z // begin inline asm 2026-02-21T09:09:01.2220892Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 0 ], %rd90, %r352, %p61; 2026-02-21T09:09:01.2221180Z // end inline asm 2026-02-21T09:09:01.2221325Z // begin inline asm 2026-02-21T09:09:01.2221604Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 8 ], %rd91, %r352, %p63; 2026-02-21T09:09:01.2221878Z // end inline asm 2026-02-21T09:09:01.2222027Z // begin inline asm 2026-02-21T09:09:01.2222266Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 16 ], %rd92, %r352, %p63; 2026-02-21T09:09:01.2222543Z // end inline asm 2026-02-21T09:09:01.2222690Z // begin inline asm 2026-02-21T09:09:01.2222928Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 24 ], %rd93, %r352, %p63; 2026-02-21T09:09:01.2223207Z // end inline asm 2026-02-21T09:09:01.2223340Z // begin inline asm 2026-02-21T09:09:01.2223574Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 32 ], %rd90, %r352, %p61; 2026-02-21T09:09:01.2223834Z // end inline asm 2026-02-21T09:09:01.2223976Z // begin inline asm 2026-02-21T09:09:01.2224207Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 40 ], %rd91, %r352, %p63; 2026-02-21T09:09:01.2224461Z // end inline asm 2026-02-21T09:09:01.2224602Z // begin inline asm 2026-02-21T09:09:01.2224876Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 48 ], %rd92, %r352, %p63; 2026-02-21T09:09:01.2225139Z // end inline asm 2026-02-21T09:09:01.2225274Z // begin inline asm 2026-02-21T09:09:01.2225502Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 56 ], %rd93, %r352, %p63; 2026-02-21T09:09:01.2225766Z // end inline asm 2026-02-21T09:09:01.2225897Z // begin inline asm 2026-02-21T09:09:01.2226129Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 64 ], %rd90, %r352, %p61; 2026-02-21T09:09:01.2226414Z // end inline asm 2026-02-21T09:09:01.2226555Z // begin inline asm 2026-02-21T09:09:01.2226803Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 72 ], %rd91, %r352, %p63; 2026-02-21T09:09:01.2227064Z // end inline asm 2026-02-21T09:09:01.2227205Z // begin inline asm 2026-02-21T09:09:01.2227430Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 80 ], %rd92, %r352, %p63; 2026-02-21T09:09:01.2227696Z // end inline asm 2026-02-21T09:09:01.2227829Z // begin inline asm 2026-02-21T09:09:01.2228056Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 88 ], %rd93, %r352, %p63; 2026-02-21T09:09:01.2228309Z // end inline asm 2026-02-21T09:09:01.2228449Z // begin inline asm 2026-02-21T09:09:01.2228678Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 96 ], %rd90, %r352, %p61; 2026-02-21T09:09:01.2228965Z // end inline asm 2026-02-21T09:09:01.2229108Z // begin inline asm 2026-02-21T09:09:01.2229338Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 104 ], %rd91, %r352, %p63; 2026-02-21T09:09:01.2229608Z // end inline asm 2026-02-21T09:09:01.2229741Z // begin inline asm 2026-02-21T09:09:01.2229978Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 112 ], %rd92, %r352, %p63; 2026-02-21T09:09:01.2230236Z // end inline asm 2026-02-21T09:09:01.2230377Z // begin inline asm 2026-02-21T09:09:01.2230608Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 120 ], %rd93, %r352, %p63; 2026-02-21T09:09:01.2230867Z // end inline asm 2026-02-21T09:09:01.2231014Z add.s32 %r400, %r47, 103424; 2026-02-21T09:09:01.2231174Z cvt.u64.u32 %rd106, %r400; 2026-02-21T09:09:01.2231333Z // begin inline asm 2026-02-21T09:09:01.2231567Z @%p62 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd106]; 2026-02-21T09:09:01.2231807Z // end inline asm 2026-02-21T09:09:01.2231951Z $L__tmp5: 2026-02-21T09:09:01.2232079Z $L__BB0_3: 2026-02-21T09:09:01.2232248Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:01.2232447Z add.s32 %r23, %r285, %r284; 2026-02-21T09:09:01.2232611Z add.s32 %r24, %r285, %r286; 2026-02-21T09:09:01.2232765Z add.s32 %r25, %r285, %r287; 2026-02-21T09:09:01.2232924Z add.s32 %r26, %r285, %r288; 2026-02-21T09:09:01.2233178Z .loc 1 58 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:32 2026-02-21T09:09:01.2233471Z add.s64 %rd107, %rd8, 128; 2026-02-21T09:09:01.2233631Z cvt.u64.u32 %rd117, %r22; 2026-02-21T09:09:01.2233799Z add.s64 %rd118, %rd7, %rd117; 2026-02-21T09:09:01.2233965Z shl.b64 %rd119, %rd118, 1; 2026-02-21T09:09:01.2234124Z add.s64 %rd120, %rd21, %rd119; 2026-02-21T09:09:01.2234297Z add.s64 %rd108, %rd120, 131072; 2026-02-21T09:09:01.2234461Z add.s64 %rd109, %rd120, 262144; 2026-02-21T09:09:01.2234631Z add.s64 %rd110, %rd10, 128; 2026-02-21T09:09:01.2234788Z add.s64 %rd111, %rd120, 524288; 2026-02-21T09:09:01.2234956Z add.s64 %rd112, %rd120, 655360; 2026-02-21T09:09:01.2235112Z add.s64 %rd113, %rd120, 786432; 2026-02-21T09:09:01.2235276Z add.s64 %rd114, %rd12, 128; 2026-02-21T09:09:01.2235438Z mov.b32 %r402, 16; 2026-02-21T09:09:01.2235687Z .loc 1 58 80 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:80 2026-02-21T09:09:01.2235976Z // begin inline asm 2026-02-21T09:09:01.2236179Z cp.async.cg.shared.global [ %r401 + 0 ], [ %rd107 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2236445Z // end inline asm 2026-02-21T09:09:01.2236581Z // begin inline asm 2026-02-21T09:09:01.2236785Z cp.async.cg.shared.global [ %r403 + 0 ], [ %rd108 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2237006Z // end inline asm 2026-02-21T09:09:01.2237146Z // begin inline asm 2026-02-21T09:09:01.2237345Z cp.async.cg.shared.global [ %r405 + 0 ], [ %rd109 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2237564Z // end inline asm 2026-02-21T09:09:01.2237702Z // begin inline asm 2026-02-21T09:09:01.2237914Z cp.async.cg.shared.global [ %r407 + 0 ], [ %rd110 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2238141Z // end inline asm 2026-02-21T09:09:01.2238272Z // begin inline asm 2026-02-21T09:09:01.2238495Z cp.async.cg.shared.global [ %r409 + 0 ], [ %rd111 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2238711Z // end inline asm 2026-02-21T09:09:01.2238849Z // begin inline asm 2026-02-21T09:09:01.2239048Z cp.async.cg.shared.global [ %r411 + 0 ], [ %rd112 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2239267Z // end inline asm 2026-02-21T09:09:01.2239408Z // begin inline asm 2026-02-21T09:09:01.2239597Z cp.async.cg.shared.global [ %r413 + 0 ], [ %rd113 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2239817Z // end inline asm 2026-02-21T09:09:01.2239950Z // begin inline asm 2026-02-21T09:09:01.2240148Z cp.async.cg.shared.global [ %r415 + 0 ], [ %rd114 + 0 ], 0x10, %r402; 2026-02-21T09:09:01.2240366Z // end inline asm 2026-02-21T09:09:01.2240514Z cp.async.commit_group; 2026-02-21T09:09:01.2240806Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2241088Z // begin inline asm 2026-02-21T09:09:01.2241288Z @%p147 mbarrier.arrive.expect_tx.shared.b64 _, [%r417], 512; 2026-02-21T09:09:01.2241510Z // end inline asm 2026-02-21T09:09:01.2241787Z .loc 1 64 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:64:33 2026-02-21T09:09:01.2242066Z bar.sync 0; 2026-02-21T09:09:01.2242213Z elect.sync %r426|%p97, -1; 2026-02-21T09:09:01.2242382Z and.pred %p95, %p1, %p97; 2026-02-21T09:09:01.2242548Z mov.b32 %r420, 32; 2026-02-21T09:09:01.2242691Z // begin inline asm 2026-02-21T09:09:01.2243016Z @%p95 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r161], [%rd115, {%r701, %r420}], [%r417]; 2026-02-21T09:09:01.2243377Z // end inline asm 2026-02-21T09:09:01.2243622Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2243915Z shl.b64 %rd121, %rd11, 1; 2026-02-21T09:09:01.2244073Z add.s64 %rd13, %rd121, 192; 2026-02-21T09:09:01.2244239Z and.b32 %r427, %r1, 3; 2026-02-21T09:09:01.2244408Z mad.wide.u32 %rd279, %r427, 16, %rd21; 2026-02-21T09:09:01.2244587Z shl.b64 %rd122, %rd9, 1; 2026-02-21T09:09:01.2244748Z add.s64 %rd15, %rd122, 192; 2026-02-21T09:09:01.2244905Z shl.b32 %r428, %r3, 15; 2026-02-21T09:09:01.2245066Z and.b32 %r429, %r428, 3670016; 2026-02-21T09:09:01.2245226Z shl.b32 %r430, %r5, 10; 2026-02-21T09:09:01.2245384Z or.b32 %r431, %r429, %r430; 2026-02-21T09:09:01.2245546Z mul.wide.u32 %rd16, %r431, 2; 2026-02-21T09:09:01.2245712Z mov.b32 %r811, 1; 2026-02-21T09:09:01.2245846Z mov.b32 %r807, 0; 2026-02-21T09:09:01.2245989Z mov.b64 %rd280, 0; 2026-02-21T09:09:01.2246134Z mov.b32 %r809, %r807; 2026-02-21T09:09:01.2246278Z mov.b32 %r810, %r807; 2026-02-21T09:09:01.2246425Z mov.b32 %r812, %r807; 2026-02-21T09:09:01.2246565Z bra.uni $L__BB0_4; 2026-02-21T09:09:01.2246757Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:01.2247076Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2247370Z setp.lt.u64 %p140, %rd280, 464; 2026-02-21T09:09:01.2247531Z $L__tmp6: 2026-02-21T09:09:01.2247820Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2248153Z add.s32 %r623, %r811, 1; 2026-02-21T09:09:01.2248310Z setp.gt.s32 %p143, %r623, 1; 2026-02-21T09:09:01.2248515Z selp.b32 %r811, 0, %r623, %p143; 2026-02-21T09:09:01.2248682Z selp.b32 %r624, 1, 0, %p143; 2026-02-21T09:09:01.2248845Z xor.b32 %r46, %r812, %r624; 2026-02-21T09:09:01.2248994Z $L__tmp7: 2026-02-21T09:09:01.2249229Z .loc 1 58 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:32 2026-02-21T09:09:01.2249511Z add.s64 %rd149, %rd279, %rd16; 2026-02-21T09:09:01.2249682Z add.s64 %rd140, %rd149, 192; 2026-02-21T09:09:01.2249876Z add.s64 %rd141, %rd149, 131264; 2026-02-21T09:09:01.2250040Z add.s64 %rd142, %rd149, 262336; 2026-02-21T09:09:01.2250211Z add.s64 %rd143, %rd279, %rd15; 2026-02-21T09:09:01.2250399Z add.s64 %rd144, %rd149, 524480; 2026-02-21T09:09:01.2250571Z add.s64 %rd145, %rd149, 655552; 2026-02-21T09:09:01.2250735Z add.s64 %rd146, %rd149, 786624; 2026-02-21T09:09:01.2251015Z .loc 1 58 80 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:80 2026-02-21T09:09:01.2251314Z add.s64 %rd147, %rd279, %rd13; 2026-02-21T09:09:01.2251507Z mad.lo.s32 %r602, %r6, -48, %r40; 2026-02-21T09:09:01.2251719Z selp.b32 %r603, 16, 0, %p140; 2026-02-21T09:09:01.2251881Z // begin inline asm 2026-02-21T09:09:01.2252092Z cp.async.cg.shared.global [ %r602 + 0 ], [ %rd140 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2252319Z // end inline asm 2026-02-21T09:09:01.2252463Z add.s32 %r604, %r602, 4096; 2026-02-21T09:09:01.2252653Z // begin inline asm 2026-02-21T09:09:01.2252870Z cp.async.cg.shared.global [ %r604 + 0 ], [ %rd141 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2253091Z // end inline asm 2026-02-21T09:09:01.2253236Z add.s32 %r606, %r602, 8192; 2026-02-21T09:09:01.2253397Z // begin inline asm 2026-02-21T09:09:01.2253591Z cp.async.cg.shared.global [ %r606 + 0 ], [ %rd142 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2253822Z // end inline asm 2026-02-21T09:09:01.2253959Z add.s32 %r608, %r602, 12288; 2026-02-21T09:09:01.2254119Z // begin inline asm 2026-02-21T09:09:01.2254313Z cp.async.cg.shared.global [ %r608 + 0 ], [ %rd143 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2254539Z // end inline asm 2026-02-21T09:09:01.2254675Z add.s32 %r610, %r602, 16384; 2026-02-21T09:09:01.2254836Z // begin inline asm 2026-02-21T09:09:01.2255029Z cp.async.cg.shared.global [ %r610 + 0 ], [ %rd144 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2255260Z // end inline asm 2026-02-21T09:09:01.2255403Z add.s32 %r612, %r602, 20480; 2026-02-21T09:09:01.2255555Z // begin inline asm 2026-02-21T09:09:01.2255759Z cp.async.cg.shared.global [ %r612 + 0 ], [ %rd145 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2255977Z // end inline asm 2026-02-21T09:09:01.2256117Z add.s32 %r614, %r602, 24576; 2026-02-21T09:09:01.2256270Z // begin inline asm 2026-02-21T09:09:01.2256464Z cp.async.cg.shared.global [ %r614 + 0 ], [ %rd146 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2256680Z // end inline asm 2026-02-21T09:09:01.2256821Z add.s32 %r616, %r602, 28672; 2026-02-21T09:09:01.2256975Z // begin inline asm 2026-02-21T09:09:01.2257166Z cp.async.cg.shared.global [ %r616 + 0 ], [ %rd147 + 0 ], 0x10, %r603; 2026-02-21T09:09:01.2257391Z // end inline asm 2026-02-21T09:09:01.2257530Z cp.async.commit_group; 2026-02-21T09:09:01.2257791Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2258075Z and.pred %p138, %p147, %p140; 2026-02-21T09:09:01.2258241Z // begin inline asm 2026-02-21T09:09:01.2258435Z @%p138 mbarrier.arrive.expect_tx.shared.b64 _, [%r618], 512; 2026-02-21T09:09:01.2258674Z // end inline asm 2026-02-21T09:09:01.2258934Z .loc 1 64 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:64:33 2026-02-21T09:09:01.2259224Z bar.sync 0; 2026-02-21T09:09:01.2259378Z elect.sync %r625|%p144, -1; 2026-02-21T09:09:01.2259550Z and.pred %p145, %p140, %p144; 2026-02-21T09:09:01.2259730Z and.pred %p139, %p1, %p145; 2026-02-21T09:09:01.2259896Z cvt.u32.u64 %r626, %rd280; 2026-02-21T09:09:01.2260069Z add.s32 %r621, %r626, 48; 2026-02-21T09:09:01.2260272Z // begin inline asm 2026-02-21T09:09:01.2260624Z @%p139 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r619], [%rd115, {%r701, %r621}], [%r618]; 2026-02-21T09:09:01.2261008Z // end inline asm 2026-02-21T09:09:01.2261261Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2261592Z add.s64 %rd279, %rd279, 64; 2026-02-21T09:09:01.2261810Z setp.lt.u64 %p146, %rd280, 480; 2026-02-21T09:09:01.2261993Z add.s64 %rd280, %rd280, 16; 2026-02-21T09:09:01.2262152Z mov.b32 %r807, %r812; 2026-02-21T09:09:01.2262310Z mov.b32 %r812, %r46; 2026-02-21T09:09:01.2262501Z @%p146 bra $L__BB0_4; 2026-02-21T09:09:01.2262657Z bra.uni $L__BB0_7; 2026-02-21T09:09:01.2262865Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:01.2263095Z add.s32 %r504, %r810, 1; 2026-02-21T09:09:01.2263272Z setp.gt.s32 %p104, %r504, 1; 2026-02-21T09:09:01.2263453Z selp.b32 %r810, 0, %r504, %p104; 2026-02-21T09:09:01.2263748Z .loc 1 58 80 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:58:80 2026-02-21T09:09:01.2264048Z cp.async.wait_group 1; 2026-02-21T09:09:01.2264218Z bar.sync 0; 2026-02-21T09:09:01.2264372Z shl.b32 %r505, %r810, 15; 2026-02-21T09:09:01.2264539Z add.s32 %r507, %r47, %r505; 2026-02-21T09:09:01.2264716Z add.s32 %r40, %r507, %r16; 2026-02-21T09:09:01.2264967Z ld.shared.v4.b32 {%r508, %r509, %r510, %r511}, [%r40]; 2026-02-21T09:09:01.2265196Z mov.b32 {%rs85, %rs86}, %r511; 2026-02-21T09:09:01.2265368Z mov.b32 {%rs87, %rs88}, %r510; 2026-02-21T09:09:01.2265539Z mov.b32 {%rs89, %rs90}, %r509; 2026-02-21T09:09:01.2265702Z mov.b32 {%rs91, %rs92}, %r508; 2026-02-21T09:09:01.2265911Z ld.shared.v4.b32 {%r512, %r513, %r514, %r515}, [%r40+16]; 2026-02-21T09:09:01.2266123Z mov.b32 {%rs93, %rs94}, %r515; 2026-02-21T09:09:01.2266293Z mov.b32 {%rs95, %rs96}, %r514; 2026-02-21T09:09:01.2266462Z mov.b32 {%rs97, %rs98}, %r513; 2026-02-21T09:09:01.2266626Z mov.b32 {%rs99, %rs100}, %r512; 2026-02-21T09:09:01.2266839Z ld.shared.v4.b32 {%r516, %r517, %r518, %r519}, [%r40+32]; 2026-02-21T09:09:01.2267058Z mov.b32 {%rs101, %rs102}, %r519; 2026-02-21T09:09:01.2267234Z mov.b32 {%rs103, %rs104}, %r518; 2026-02-21T09:09:01.2267395Z mov.b32 {%rs105, %rs106}, %r517; 2026-02-21T09:09:01.2267559Z mov.b32 {%rs107, %rs108}, %r516; 2026-02-21T09:09:01.2267747Z ld.shared.v4.b32 {%r520, %r521, %r522, %r523}, [%r40+48]; 2026-02-21T09:09:01.2267953Z mov.b32 {%rs109, %rs110}, %r523; 2026-02-21T09:09:01.2268115Z mov.b32 {%rs111, %rs112}, %r522; 2026-02-21T09:09:01.2268273Z mov.b32 {%rs113, %rs114}, %r521; 2026-02-21T09:09:01.2268438Z mov.b32 {%rs115, %rs116}, %r520; 2026-02-21T09:09:01.2268633Z ld.shared.v4.b32 {%r524, %r525, %r526, %r527}, [%r40+16384]; 2026-02-21T09:09:01.2268844Z mov.b32 {%rs117, %rs118}, %r527; 2026-02-21T09:09:01.2269000Z mov.b32 {%rs119, %rs120}, %r526; 2026-02-21T09:09:01.2269168Z mov.b32 {%rs121, %rs122}, %r525; 2026-02-21T09:09:01.2269324Z mov.b32 {%rs123, %rs124}, %r524; 2026-02-21T09:09:01.2269527Z ld.shared.v4.b32 {%r528, %r529, %r530, %r531}, [%r40+16400]; 2026-02-21T09:09:01.2269741Z mov.b32 {%rs125, %rs126}, %r531; 2026-02-21T09:09:01.2269901Z mov.b32 {%rs127, %rs128}, %r530; 2026-02-21T09:09:01.2270065Z mov.b32 {%rs129, %rs130}, %r529; 2026-02-21T09:09:01.2270222Z mov.b32 {%rs131, %rs132}, %r528; 2026-02-21T09:09:01.2270422Z ld.shared.v4.b32 {%r532, %r533, %r534, %r535}, [%r40+16416]; 2026-02-21T09:09:01.2270625Z mov.b32 {%rs133, %rs134}, %r535; 2026-02-21T09:09:01.2270796Z mov.b32 {%rs135, %rs136}, %r534; 2026-02-21T09:09:01.2270955Z mov.b32 {%rs137, %rs138}, %r533; 2026-02-21T09:09:01.2271122Z mov.b32 {%rs139, %rs140}, %r532; 2026-02-21T09:09:01.2271318Z ld.shared.v4.b32 {%r536, %r537, %r538, %r539}, [%r40+16432]; 2026-02-21T09:09:01.2271519Z mov.b32 {%rs141, %rs142}, %r539; 2026-02-21T09:09:01.2271710Z mov.b32 {%rs143, %rs144}, %r538; 2026-02-21T09:09:01.2271898Z mov.b32 {%rs145, %rs146}, %r537; 2026-02-21T09:09:01.2272060Z mov.b32 {%rs147, %rs148}, %r536; 2026-02-21T09:09:01.2272327Z .loc 1 62 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:62:32 2026-02-21T09:09:01.2272617Z cvt.f32.bf16 %r435, %rs91; 2026-02-21T09:09:01.2272781Z cvt.f32.bf16 %r436, %rs92; 2026-02-21T09:09:01.2272936Z cvt.f32.bf16 %r437, %rs89; 2026-02-21T09:09:01.2273099Z cvt.f32.bf16 %r438, %rs90; 2026-02-21T09:09:01.2273282Z cvt.f32.bf16 %r439, %rs87; 2026-02-21T09:09:01.2273441Z cvt.f32.bf16 %r440, %rs88; 2026-02-21T09:09:01.2273590Z cvt.f32.bf16 %r441, %rs85; 2026-02-21T09:09:01.2273773Z cvt.f32.bf16 %r442, %rs86; 2026-02-21T09:09:01.2273923Z cvt.f32.bf16 %r443, %rs99; 2026-02-21T09:09:01.2274085Z cvt.f32.bf16 %r444, %rs100; 2026-02-21T09:09:01.2274241Z cvt.f32.bf16 %r445, %rs97; 2026-02-21T09:09:01.2274399Z cvt.f32.bf16 %r446, %rs98; 2026-02-21T09:09:01.2274558Z cvt.f32.bf16 %r447, %rs95; 2026-02-21T09:09:01.2274710Z cvt.f32.bf16 %r448, %rs96; 2026-02-21T09:09:01.2274867Z cvt.f32.bf16 %r449, %rs93; 2026-02-21T09:09:01.2275017Z cvt.f32.bf16 %r450, %rs94; 2026-02-21T09:09:01.2275173Z cvt.f32.bf16 %r452, %rs107; 2026-02-21T09:09:01.2275326Z cvt.f32.bf16 %r453, %rs108; 2026-02-21T09:09:01.2275482Z cvt.f32.bf16 %r454, %rs105; 2026-02-21T09:09:01.2275631Z cvt.f32.bf16 %r455, %rs106; 2026-02-21T09:09:01.2275817Z cvt.f32.bf16 %r456, %rs103; 2026-02-21T09:09:01.2275972Z cvt.f32.bf16 %r457, %rs104; 2026-02-21T09:09:01.2276131Z cvt.f32.bf16 %r458, %rs101; 2026-02-21T09:09:01.2276286Z cvt.f32.bf16 %r459, %rs102; 2026-02-21T09:09:01.2276435Z cvt.f32.bf16 %r460, %rs115; 2026-02-21T09:09:01.2276590Z cvt.f32.bf16 %r461, %rs116; 2026-02-21T09:09:01.2276739Z cvt.f32.bf16 %r462, %rs113; 2026-02-21T09:09:01.2276896Z cvt.f32.bf16 %r463, %rs114; 2026-02-21T09:09:01.2277045Z cvt.f32.bf16 %r464, %rs111; 2026-02-21T09:09:01.2277201Z cvt.f32.bf16 %r465, %rs112; 2026-02-21T09:09:01.2277352Z cvt.f32.bf16 %r466, %rs109; 2026-02-21T09:09:01.2277507Z cvt.f32.bf16 %r467, %rs110; 2026-02-21T09:09:01.2277654Z cvt.f32.bf16 %r469, %rs123; 2026-02-21T09:09:01.2277812Z cvt.f32.bf16 %r470, %rs124; 2026-02-21T09:09:01.2277967Z cvt.f32.bf16 %r471, %rs121; 2026-02-21T09:09:01.2278118Z cvt.f32.bf16 %r472, %rs122; 2026-02-21T09:09:01.2278275Z cvt.f32.bf16 %r473, %rs119; 2026-02-21T09:09:01.2278425Z cvt.f32.bf16 %r474, %rs120; 2026-02-21T09:09:01.2278583Z cvt.f32.bf16 %r475, %rs117; 2026-02-21T09:09:01.2278735Z cvt.f32.bf16 %r476, %rs118; 2026-02-21T09:09:01.2278896Z cvt.f32.bf16 %r477, %rs131; 2026-02-21T09:09:01.2279049Z cvt.f32.bf16 %r478, %rs132; 2026-02-21T09:09:01.2279213Z cvt.f32.bf16 %r479, %rs129; 2026-02-21T09:09:01.2279370Z cvt.f32.bf16 %r480, %rs130; 2026-02-21T09:09:01.2279520Z cvt.f32.bf16 %r481, %rs127; 2026-02-21T09:09:01.2279677Z cvt.f32.bf16 %r482, %rs128; 2026-02-21T09:09:01.2279827Z cvt.f32.bf16 %r483, %rs125; 2026-02-21T09:09:01.2279983Z cvt.f32.bf16 %r484, %rs126; 2026-02-21T09:09:01.2280133Z cvt.f32.bf16 %r486, %rs139; 2026-02-21T09:09:01.2280288Z cvt.f32.bf16 %r487, %rs140; 2026-02-21T09:09:01.2280437Z cvt.f32.bf16 %r488, %rs137; 2026-02-21T09:09:01.2280594Z cvt.f32.bf16 %r489, %rs138; 2026-02-21T09:09:01.2280743Z cvt.f32.bf16 %r490, %rs135; 2026-02-21T09:09:01.2280901Z cvt.f32.bf16 %r491, %rs136; 2026-02-21T09:09:01.2281057Z cvt.f32.bf16 %r492, %rs133; 2026-02-21T09:09:01.2281209Z cvt.f32.bf16 %r493, %rs134; 2026-02-21T09:09:01.2281369Z cvt.f32.bf16 %r494, %rs147; 2026-02-21T09:09:01.2281518Z cvt.f32.bf16 %r495, %rs148; 2026-02-21T09:09:01.2281706Z cvt.f32.bf16 %r496, %rs145; 2026-02-21T09:09:01.2281858Z cvt.f32.bf16 %r497, %rs146; 2026-02-21T09:09:01.2282015Z cvt.f32.bf16 %r498, %rs143; 2026-02-21T09:09:01.2282166Z cvt.f32.bf16 %r499, %rs144; 2026-02-21T09:09:01.2282326Z cvt.f32.bf16 %r500, %rs141; 2026-02-21T09:09:01.2282477Z cvt.f32.bf16 %r501, %rs142; 2026-02-21T09:09:01.2282628Z $L__tmp8: 2026-02-21T09:09:01.2282920Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2283283Z // begin inline asm 2026-02-21T09:09:01.2283424Z 2026-02-21T09:09:01.2283537Z { 2026-02-21T09:09:01.2283665Z .reg .pred complete; 2026-02-21T09:09:01.2283810Z waitLoop: 2026-02-21T09:09:01.2284003Z mbarrier.try_wait.parity.shared.b64 complete, [%r808], %r807; 2026-02-21T09:09:01.2284236Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.2284424Z } 2026-02-21T09:09:01.2284490Z 2026-02-21T09:09:01.2284553Z // end inline asm 2026-02-21T09:09:01.2284686Z $L__tmp9: 2026-02-21T09:09:01.2284957Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2285243Z selp.b32 %r540, 1, 0, %p104; 2026-02-21T09:09:01.2285409Z xor.b32 %r809, %r809, %r540; 2026-02-21T09:09:01.2285566Z mov.pred %p105, -1; 2026-02-21T09:09:01.2285715Z $L__tmp10: 2026-02-21T09:09:01.2285994Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2286333Z // begin inline asm 2026-02-21T09:09:01.2286709Z @%p105 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 0], {%r435, %r436, %r437, %r438, %r439, %r440, %r441, %r442, %r443, %r444, %r445, %r446, %r447, %r448, %r449, %r450}; 2026-02-21T09:09:01.2287092Z // end inline asm 2026-02-21T09:09:01.2287237Z // begin inline asm 2026-02-21T09:09:01.2287621Z @%p105 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 16], {%r452, %r453, %r454, %r455, %r456, %r457, %r458, %r459, %r460, %r461, %r462, %r463, %r464, %r465, %r466, %r467}; 2026-02-21T09:09:01.2288014Z // end inline asm 2026-02-21T09:09:01.2288160Z // begin inline asm 2026-02-21T09:09:01.2288508Z @%p105 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 64], {%r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479, %r480, %r481, %r482, %r483, %r484}; 2026-02-21T09:09:01.2288900Z // end inline asm 2026-02-21T09:09:01.2289038Z // begin inline asm 2026-02-21T09:09:01.2289397Z @%p105 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r186 + 80], {%r486, %r487, %r488, %r489, %r490, %r491, %r492, %r493, %r494, %r495, %r496, %r497, %r498, %r499, %r500, %r501}; 2026-02-21T09:09:01.2289777Z // end inline asm 2026-02-21T09:09:01.2289919Z // begin inline asm 2026-02-21T09:09:01.2290078Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:01.2290241Z // end inline asm 2026-02-21T09:09:01.2290381Z bar.sync 0; 2026-02-21T09:09:01.2290509Z $L__tmp11: 2026-02-21T09:09:01.2290753Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2291036Z shl.b32 %r541, %r810, 3; 2026-02-21T09:09:01.2291197Z add.s32 %r542, %r47, %r541; 2026-02-21T09:09:01.2291354Z add.s32 %r618, %r542, 103440; 2026-02-21T09:09:01.2291516Z // begin inline asm 2026-02-21T09:09:01.2291692Z 2026-02-21T09:09:01.2291813Z { 2026-02-21T09:09:01.2291944Z .reg .pred complete; 2026-02-21T09:09:01.2292088Z waitLoop: 2026-02-21T09:09:01.2292278Z mbarrier.try_wait.parity.shared.b64 complete, [%r618], %r809; 2026-02-21T09:09:01.2292507Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.2292665Z } 2026-02-21T09:09:01.2292730Z 2026-02-21T09:09:01.2292785Z // end inline asm 2026-02-21T09:09:01.2293036Z .loc 1 64 33 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:64:33 2026-02-21T09:09:01.2293318Z shl.b32 %r543, %r810, 9; 2026-02-21T09:09:01.2293484Z add.s32 %r544, %r47, %r543; 2026-02-21T09:09:01.2293649Z add.s32 %r619, %r544, 102400; 2026-02-21T09:09:01.2293804Z add.s32 %r545, %r619, %r17; 2026-02-21T09:09:01.2293969Z ld.shared.b8 %rs149, [%r545]; 2026-02-21T09:09:01.2294135Z ld.shared.b8 %rs150, [%r545+256]; 2026-02-21T09:09:01.2294313Z add.s32 %r546, %r619, %r27; 2026-02-21T09:09:01.2294468Z ld.shared.b8 %rs151, [%r546+128]; 2026-02-21T09:09:01.2294643Z ld.shared.b8 %rs152, [%r546+384]; 2026-02-21T09:09:01.2294912Z .loc 1 67 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:67:28 2026-02-21T09:09:01.2295223Z shl.b16 %rs153, %rs149, 4; 2026-02-21T09:09:01.2295387Z shl.b16 %rs154, %rs151, 4; 2026-02-21T09:09:01.2295540Z shl.b16 %rs155, %rs150, 4; 2026-02-21T09:09:01.2295698Z shl.b16 %rs156, %rs152, 4; 2026-02-21T09:09:01.2295953Z .loc 1 82 58 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:82:58 2026-02-21T09:09:01.2296248Z selp.b16 %rs157, %rs153, %rs149, %p56; 2026-02-21T09:09:01.2296456Z cvt.s16.s8 %rs158, %rs157; 2026-02-21T09:09:01.2296525Z shr.s16 %rs159, %rs158, 4; 2026-02-21T09:09:01.2296597Z selp.b16 %rs160, %rs154, %rs151, %p56; 2026-02-21T09:09:01.2296686Z cvt.s16.s8 %rs161, %rs160; 2026-02-21T09:09:01.2296747Z shr.s16 %rs162, %rs161, 4; 2026-02-21T09:09:01.2296821Z selp.b16 %rs163, %rs155, %rs150, %p56; 2026-02-21T09:09:01.2296880Z cvt.s16.s8 %rs164, %rs163; 2026-02-21T09:09:01.2296938Z shr.s16 %rs165, %rs164, 4; 2026-02-21T09:09:01.2297011Z selp.b16 %rs166, %rs156, %rs152, %p56; 2026-02-21T09:09:01.2297071Z cvt.s16.s8 %rs167, %rs166; 2026-02-21T09:09:01.2297128Z shr.s16 %rs168, %rs167, 4; 2026-02-21T09:09:01.2297299Z .loc 1 87 32 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:87:32 2026-02-21T09:09:01.2297360Z cvt.rn.f32.s16 %r547, %rs159; 2026-02-21T09:09:01.2297419Z cvt.rn.f32.s16 %r548, %rs162; 2026-02-21T09:09:01.2297503Z cvt.rn.f32.s16 %r549, %rs165; 2026-02-21T09:09:01.2297572Z cvt.rn.f32.s16 %r550, %rs168; 2026-02-21T09:09:01.2297636Z st.shared.b32 [%r18], %r547; 2026-02-21T09:09:01.2297698Z st.shared.b32 [%r19], %r548; 2026-02-21T09:09:01.2297765Z st.shared.b32 [%r20], %r549; 2026-02-21T09:09:01.2297828Z st.shared.b32 [%r21], %r550; 2026-02-21T09:09:01.2297996Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2298064Z shl.b32 %r551, %r811, 3; 2026-02-21T09:09:01.2298126Z add.s32 %r552, %r47, %r551; 2026-02-21T09:09:01.2298187Z add.s32 %r808, %r552, 103424; 2026-02-21T09:09:01.2298241Z $L__tmp12: 2026-02-21T09:09:01.2298464Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2298524Z // begin inline asm 2026-02-21T09:09:01.2298597Z fence.proxy.async.shared::cta; 2026-02-21T09:09:01.2298660Z // end inline asm 2026-02-21T09:09:01.2298715Z bar.sync 0; 2026-02-21T09:09:01.2298774Z @%p60 bra $L__BB0_6; 2026-02-21T09:09:01.2298876Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:01.2298951Z elect.sync %r601|%p106, -1; 2026-02-21T09:09:01.2299011Z mov.b32 %r555, 134744336; 2026-02-21T09:09:01.2299068Z // begin inline asm 2026-02-21T09:09:01.2299233Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 0 ], %rd90, %r555, %p105; 2026-02-21T09:09:01.2299289Z // end inline asm 2026-02-21T09:09:01.2299346Z // begin inline asm 2026-02-21T09:09:01.2299505Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 8 ], %rd91, %r555, %p105; 2026-02-21T09:09:01.2299562Z // end inline asm 2026-02-21T09:09:01.2299620Z // begin inline asm 2026-02-21T09:09:01.2299771Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 16 ], %rd92, %r555, %p105; 2026-02-21T09:09:01.2299835Z // end inline asm 2026-02-21T09:09:01.2299890Z // begin inline asm 2026-02-21T09:09:01.2300037Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 0 ], [ %r351 + 24 ], %rd93, %r555, %p105; 2026-02-21T09:09:01.2300099Z // end inline asm 2026-02-21T09:09:01.2300156Z // begin inline asm 2026-02-21T09:09:01.2300305Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 32 ], %rd90, %r555, %p105; 2026-02-21T09:09:01.2300367Z // end inline asm 2026-02-21T09:09:01.2300422Z // begin inline asm 2026-02-21T09:09:01.2300568Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 40 ], %rd91, %r555, %p105; 2026-02-21T09:09:01.2300629Z // end inline asm 2026-02-21T09:09:01.2300706Z // begin inline asm 2026-02-21T09:09:01.2300851Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 48 ], %rd92, %r555, %p105; 2026-02-21T09:09:01.2300908Z // end inline asm 2026-02-21T09:09:01.2300976Z // begin inline asm 2026-02-21T09:09:01.2301126Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 32 ], [ %r351 + 56 ], %rd93, %r555, %p105; 2026-02-21T09:09:01.2301182Z // end inline asm 2026-02-21T09:09:01.2301272Z // begin inline asm 2026-02-21T09:09:01.2301423Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 64 ], %rd90, %r555, %p105; 2026-02-21T09:09:01.2301480Z // end inline asm 2026-02-21T09:09:01.2301619Z // begin inline asm 2026-02-21T09:09:01.2301771Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 72 ], %rd91, %r555, %p105; 2026-02-21T09:09:01.2301828Z // end inline asm 2026-02-21T09:09:01.2301894Z // begin inline asm 2026-02-21T09:09:01.2302043Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 80 ], %rd92, %r555, %p105; 2026-02-21T09:09:01.2302102Z // end inline asm 2026-02-21T09:09:01.2302160Z // begin inline asm 2026-02-21T09:09:01.2302317Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 64 ], [ %r351 + 88 ], %rd93, %r555, %p105; 2026-02-21T09:09:01.2302375Z // end inline asm 2026-02-21T09:09:01.2302433Z // begin inline asm 2026-02-21T09:09:01.2302614Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 96 ], %rd90, %r555, %p105; 2026-02-21T09:09:01.2302672Z // end inline asm 2026-02-21T09:09:01.2302730Z // begin inline asm 2026-02-21T09:09:01.2302898Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 104 ], %rd91, %r555, %p105; 2026-02-21T09:09:01.2302954Z // end inline asm 2026-02-21T09:09:01.2303013Z // begin inline asm 2026-02-21T09:09:01.2303179Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 112 ], %rd92, %r555, %p105; 2026-02-21T09:09:01.2303237Z // end inline asm 2026-02-21T09:09:01.2303297Z // begin inline asm 2026-02-21T09:09:01.2303453Z @%p106 tcgen05.mma.cta_group::1.kind::tf32 [ %r806 + 96 ], [ %r351 + 120 ], %rd93, %r555, %p105; 2026-02-21T09:09:01.2303521Z // end inline asm 2026-02-21T09:09:01.2303586Z cvt.u64.u32 %rd139, %r808; 2026-02-21T09:09:01.2303644Z // begin inline asm 2026-02-21T09:09:01.2303786Z @%p106 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd139]; 2026-02-21T09:09:01.2303845Z // end inline asm 2026-02-21T09:09:01.2303906Z bra.uni $L__BB0_6; 2026-02-21T09:09:01.2304015Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:01.2304110Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:01.2304178Z setp.lt.u32 %p152, %r1, 64; 2026-02-21T09:09:01.2304234Z mov.b32 %r628, 1; 2026-02-21T09:09:01.2304463Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2304522Z // begin inline asm 2026-02-21T09:09:01.2304575Z 2026-02-21T09:09:01.2304635Z { 2026-02-21T09:09:01.2304698Z .reg .pred complete; 2026-02-21T09:09:01.2304754Z waitLoop: 2026-02-21T09:09:01.2304873Z mbarrier.try_wait.parity.shared.b64 complete, [%r808], %r628; 2026-02-21T09:09:01.2304947Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.2304999Z } 2026-02-21T09:09:01.2305003Z 2026-02-21T09:09:01.2305061Z // end inline asm 2026-02-21T09:09:01.2305142Z $L__tmp13: 2026-02-21T09:09:01.2305316Z .loc 1 51 70 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:51:70 2026-02-21T09:09:01.2305384Z cp.async.wait_group 0; 2026-02-21T09:09:01.2305447Z bar.sync 0; 2026-02-21T09:09:01.2305504Z // begin inline asm 2026-02-21T09:09:01.2305596Z @%p147 mbarrier.inval.shared::cta.b64 [%r417]; 2026-02-21T09:09:01.2305652Z // end inline asm 2026-02-21T09:09:01.2305715Z bar.sync 0; 2026-02-21T09:09:01.2305773Z // begin inline asm 2026-02-21T09:09:01.2305861Z @%p147 mbarrier.inval.shared::cta.b64 [%r143]; 2026-02-21T09:09:01.2305953Z // end inline asm 2026-02-21T09:09:01.2306017Z add.s32 %r631, %r47, 103424; 2026-02-21T09:09:01.2306076Z // begin inline asm 2026-02-21T09:09:01.2306160Z @%p147 mbarrier.inval.shared::cta.b64 [%r631]; 2026-02-21T09:09:01.2306226Z // end inline asm 2026-02-21T09:09:01.2306284Z bar.sync 0; 2026-02-21T09:09:01.2306344Z // begin inline asm 2026-02-21T09:09:01.2306435Z @%p147 mbarrier.inval.shared::cta.b64 [%r141]; 2026-02-21T09:09:01.2306496Z // end inline asm 2026-02-21T09:09:01.2306582Z $L__tmp14: 2026-02-21T09:09:01.2306817Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2306900Z // begin inline asm 2026-02-21T09:09:01.2307187Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r633, %r634, %r635, %r636, %r637, %r638, %r639, %r640, %r641, %r642, %r643, %r644, %r645, %r646, %r647, %r648}, [%r700 + 0]; 2026-02-21T09:09:01.2307245Z // end inline asm 2026-02-21T09:09:01.2307313Z // begin inline asm 2026-02-21T09:09:01.2307592Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r650, %r651, %r652, %r653, %r654, %r655, %r656, %r657, %r658, %r659, %r660, %r661, %r662, %r663, %r664, %r665}, [%r700 + 16]; 2026-02-21T09:09:01.2307650Z // end inline asm 2026-02-21T09:09:01.2307718Z // begin inline asm 2026-02-21T09:09:01.2308010Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r667, %r668, %r669, %r670, %r671, %r672, %r673, %r674, %r675, %r676, %r677, %r678, %r679, %r680, %r681, %r682}, [%r700 + 64]; 2026-02-21T09:09:01.2308070Z // end inline asm 2026-02-21T09:09:01.2308135Z // begin inline asm 2026-02-21T09:09:01.2308410Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r684, %r685, %r686, %r687, %r688, %r689, %r690, %r691, %r692, %r693, %r694, %r695, %r696, %r697, %r698, %r699}, [%r700 + 80]; 2026-02-21T09:09:01.2308467Z // end inline asm 2026-02-21T09:09:01.2308534Z // begin inline asm 2026-02-21T09:09:01.2308606Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:01.2308675Z // end inline asm 2026-02-21T09:09:01.2308737Z cvt.u64.u32 %rd151, %r633; 2026-02-21T09:09:01.2308807Z cvt.u64.u32 %rd152, %r634; 2026-02-21T09:09:01.2308867Z shl.b64 %rd153, %rd152, 32; 2026-02-21T09:09:01.2308929Z or.b64 %rd154, %rd151, %rd153; 2026-02-21T09:09:01.2308989Z $L__tmp15: 2026-02-21T09:09:01.2309156Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2309220Z mov.b64 {%r705, %r706}, %rd154; 2026-02-21T09:09:01.2309291Z cvt.rn.bf16x2.f32 %r707, %r706, %r705; 2026-02-21T09:09:01.2309353Z $L__tmp16: 2026-02-21T09:09:01.2309567Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2309630Z cvt.u64.u32 %rd155, %r635; 2026-02-21T09:09:01.2309698Z cvt.u64.u32 %rd156, %r636; 2026-02-21T09:09:01.2309758Z shl.b64 %rd157, %rd156, 32; 2026-02-21T09:09:01.2309818Z or.b64 %rd158, %rd155, %rd157; 2026-02-21T09:09:01.2309877Z $L__tmp17: 2026-02-21T09:09:01.2310043Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2310106Z mov.b64 {%r708, %r709}, %rd158; 2026-02-21T09:09:01.2310174Z cvt.rn.bf16x2.f32 %r710, %r709, %r708; 2026-02-21T09:09:01.2310234Z $L__tmp18: 2026-02-21T09:09:01.2310446Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2310509Z cvt.u64.u32 %rd159, %r637; 2026-02-21T09:09:01.2310576Z cvt.u64.u32 %rd160, %r638; 2026-02-21T09:09:01.2310635Z shl.b64 %rd161, %rd160, 32; 2026-02-21T09:09:01.2310695Z or.b64 %rd162, %rd159, %rd161; 2026-02-21T09:09:01.2310747Z $L__tmp19: 2026-02-21T09:09:01.2310914Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2310975Z mov.b64 {%r711, %r712}, %rd162; 2026-02-21T09:09:01.2311042Z cvt.rn.bf16x2.f32 %r713, %r712, %r711; 2026-02-21T09:09:01.2311101Z $L__tmp20: 2026-02-21T09:09:01.2311337Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2311397Z cvt.u64.u32 %rd163, %r639; 2026-02-21T09:09:01.2311462Z cvt.u64.u32 %rd164, %r640; 2026-02-21T09:09:01.2311522Z shl.b64 %rd165, %rd164, 32; 2026-02-21T09:09:01.2311613Z or.b64 %rd166, %rd163, %rd165; 2026-02-21T09:09:01.2311667Z $L__tmp21: 2026-02-21T09:09:01.2311843Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2311930Z mov.b64 {%r714, %r715}, %rd166; 2026-02-21T09:09:01.2311995Z cvt.rn.bf16x2.f32 %r716, %r715, %r714; 2026-02-21T09:09:01.2312052Z $L__tmp22: 2026-02-21T09:09:01.2312287Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2312346Z cvt.u64.u32 %rd167, %r641; 2026-02-21T09:09:01.2312410Z cvt.u64.u32 %rd168, %r642; 2026-02-21T09:09:01.2312469Z shl.b64 %rd169, %rd168, 32; 2026-02-21T09:09:01.2312528Z or.b64 %rd170, %rd167, %rd169; 2026-02-21T09:09:01.2312579Z $L__tmp23: 2026-02-21T09:09:01.2312748Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2312807Z mov.b64 {%r717, %r718}, %rd170; 2026-02-21T09:09:01.2312872Z cvt.rn.bf16x2.f32 %r719, %r718, %r717; 2026-02-21T09:09:01.2312931Z $L__tmp24: 2026-02-21T09:09:01.2313167Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2313228Z cvt.u64.u32 %rd171, %r643; 2026-02-21T09:09:01.2313292Z cvt.u64.u32 %rd172, %r644; 2026-02-21T09:09:01.2313352Z shl.b64 %rd173, %rd172, 32; 2026-02-21T09:09:01.2313411Z or.b64 %rd174, %rd171, %rd173; 2026-02-21T09:09:01.2313462Z $L__tmp25: 2026-02-21T09:09:01.2313629Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2313690Z mov.b64 {%r720, %r721}, %rd174; 2026-02-21T09:09:01.2313756Z cvt.rn.bf16x2.f32 %r722, %r721, %r720; 2026-02-21T09:09:01.2313814Z $L__tmp26: 2026-02-21T09:09:01.2314022Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2314081Z cvt.u64.u32 %rd175, %r645; 2026-02-21T09:09:01.2314147Z cvt.u64.u32 %rd176, %r646; 2026-02-21T09:09:01.2314207Z shl.b64 %rd177, %rd176, 32; 2026-02-21T09:09:01.2314267Z or.b64 %rd178, %rd175, %rd177; 2026-02-21T09:09:01.2314318Z $L__tmp27: 2026-02-21T09:09:01.2314489Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2314550Z mov.b64 {%r723, %r724}, %rd178; 2026-02-21T09:09:01.2314616Z cvt.rn.bf16x2.f32 %r725, %r724, %r723; 2026-02-21T09:09:01.2314675Z $L__tmp28: 2026-02-21T09:09:01.2314879Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2314940Z cvt.u64.u32 %rd179, %r647; 2026-02-21T09:09:01.2314997Z cvt.u64.u32 %rd180, %r648; 2026-02-21T09:09:01.2315063Z shl.b64 %rd181, %rd180, 32; 2026-02-21T09:09:01.2315121Z or.b64 %rd182, %rd179, %rd181; 2026-02-21T09:09:01.2315173Z $L__tmp29: 2026-02-21T09:09:01.2315341Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2315403Z mov.b64 {%r726, %r727}, %rd182; 2026-02-21T09:09:01.2315472Z cvt.rn.bf16x2.f32 %r728, %r727, %r726; 2026-02-21T09:09:01.2315531Z $L__tmp30: 2026-02-21T09:09:01.2315739Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2315799Z cvt.u64.u32 %rd183, %r650; 2026-02-21T09:09:01.2315858Z cvt.u64.u32 %rd184, %r651; 2026-02-21T09:09:01.2315927Z shl.b64 %rd185, %rd184, 32; 2026-02-21T09:09:01.2315988Z or.b64 %rd186, %rd183, %rd185; 2026-02-21T09:09:01.2316042Z $L__tmp31: 2026-02-21T09:09:01.2316245Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2316306Z mov.b64 {%r729, %r730}, %rd186; 2026-02-21T09:09:01.2316372Z cvt.rn.bf16x2.f32 %r731, %r730, %r729; 2026-02-21T09:09:01.2316432Z $L__tmp32: 2026-02-21T09:09:01.2316643Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2316722Z cvt.u64.u32 %rd187, %r652; 2026-02-21T09:09:01.2316780Z cvt.u64.u32 %rd188, %r653; 2026-02-21T09:09:01.2316847Z shl.b64 %rd189, %rd188, 32; 2026-02-21T09:09:01.2316907Z or.b64 %rd190, %rd187, %rd189; 2026-02-21T09:09:01.2316982Z $L__tmp33: 2026-02-21T09:09:01.2317155Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2317215Z mov.b64 {%r732, %r733}, %rd190; 2026-02-21T09:09:01.2317281Z cvt.rn.bf16x2.f32 %r734, %r733, %r732; 2026-02-21T09:09:01.2317335Z $L__tmp34: 2026-02-21T09:09:01.2317552Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2317611Z cvt.u64.u32 %rd191, %r654; 2026-02-21T09:09:01.2317668Z cvt.u64.u32 %rd192, %r655; 2026-02-21T09:09:01.2317736Z shl.b64 %rd193, %rd192, 32; 2026-02-21T09:09:01.2317795Z or.b64 %rd194, %rd191, %rd193; 2026-02-21T09:09:01.2317868Z $L__tmp35: 2026-02-21T09:09:01.2318043Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2318103Z mov.b64 {%r735, %r736}, %rd194; 2026-02-21T09:09:01.2318170Z cvt.rn.bf16x2.f32 %r737, %r736, %r735; 2026-02-21T09:09:01.2318223Z $L__tmp36: 2026-02-21T09:09:01.2318438Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2318497Z cvt.u64.u32 %rd195, %r656; 2026-02-21T09:09:01.2318555Z cvt.u64.u32 %rd196, %r657; 2026-02-21T09:09:01.2318624Z shl.b64 %rd197, %rd196, 32; 2026-02-21T09:09:01.2318682Z or.b64 %rd198, %rd195, %rd197; 2026-02-21T09:09:01.2318735Z $L__tmp37: 2026-02-21T09:09:01.2318904Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2318963Z mov.b64 {%r738, %r739}, %rd198; 2026-02-21T09:09:01.2319030Z cvt.rn.bf16x2.f32 %r740, %r739, %r738; 2026-02-21T09:09:01.2319085Z $L__tmp38: 2026-02-21T09:09:01.2319300Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2319358Z cvt.u64.u32 %rd199, %r658; 2026-02-21T09:09:01.2319416Z cvt.u64.u32 %rd200, %r659; 2026-02-21T09:09:01.2319484Z shl.b64 %rd201, %rd200, 32; 2026-02-21T09:09:01.2319543Z or.b64 %rd202, %rd199, %rd201; 2026-02-21T09:09:01.2319594Z $L__tmp39: 2026-02-21T09:09:01.2319765Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2319826Z mov.b64 {%r741, %r742}, %rd202; 2026-02-21T09:09:01.2319890Z cvt.rn.bf16x2.f32 %r743, %r742, %r741; 2026-02-21T09:09:01.2319942Z $L__tmp40: 2026-02-21T09:09:01.2320158Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2320216Z cvt.u64.u32 %rd203, %r660; 2026-02-21T09:09:01.2320276Z cvt.u64.u32 %rd204, %r661; 2026-02-21T09:09:01.2320343Z shl.b64 %rd205, %rd204, 32; 2026-02-21T09:09:01.2320400Z or.b64 %rd206, %rd203, %rd205; 2026-02-21T09:09:01.2320452Z $L__tmp41: 2026-02-21T09:09:01.2320617Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2320683Z mov.b64 {%r744, %r745}, %rd206; 2026-02-21T09:09:01.2320748Z cvt.rn.bf16x2.f32 %r746, %r745, %r744; 2026-02-21T09:09:01.2320799Z $L__tmp42: 2026-02-21T09:09:01.2321017Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2321102Z cvt.u64.u32 %rd207, %r662; 2026-02-21T09:09:01.2321159Z cvt.u64.u32 %rd208, %r663; 2026-02-21T09:09:01.2321223Z shl.b64 %rd209, %rd208, 32; 2026-02-21T09:09:01.2321281Z or.b64 %rd210, %rd207, %rd209; 2026-02-21T09:09:01.2321333Z $L__tmp43: 2026-02-21T09:09:01.2321496Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2321626Z mov.b64 {%r747, %r748}, %rd210; 2026-02-21T09:09:01.2321692Z cvt.rn.bf16x2.f32 %r749, %r748, %r747; 2026-02-21T09:09:01.2321744Z $L__tmp44: 2026-02-21T09:09:01.2321981Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2322039Z cvt.u64.u32 %rd211, %r664; 2026-02-21T09:09:01.2322097Z cvt.u64.u32 %rd212, %r665; 2026-02-21T09:09:01.2322162Z shl.b64 %rd213, %rd212, 32; 2026-02-21T09:09:01.2322220Z or.b64 %rd214, %rd211, %rd213; 2026-02-21T09:09:01.2322274Z $L__tmp45: 2026-02-21T09:09:01.2322443Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2322510Z mov.b64 {%r750, %r751}, %rd214; 2026-02-21T09:09:01.2322574Z cvt.rn.bf16x2.f32 %r752, %r751, %r750; 2026-02-21T09:09:01.2322628Z $L__tmp46: 2026-02-21T09:09:01.2322867Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2322930Z cvt.u64.u32 %rd215, %r667; 2026-02-21T09:09:01.2322988Z cvt.u64.u32 %rd216, %r668; 2026-02-21T09:09:01.2323056Z shl.b64 %rd217, %rd216, 32; 2026-02-21T09:09:01.2323116Z or.b64 %rd218, %rd215, %rd217; 2026-02-21T09:09:01.2323170Z $L__tmp47: 2026-02-21T09:09:01.2323338Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2323408Z mov.b64 {%r753, %r754}, %rd218; 2026-02-21T09:09:01.2323475Z cvt.rn.bf16x2.f32 %r755, %r754, %r753; 2026-02-21T09:09:01.2323528Z $L__tmp48: 2026-02-21T09:09:01.2323744Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2323806Z cvt.u64.u32 %rd219, %r669; 2026-02-21T09:09:01.2323864Z cvt.u64.u32 %rd220, %r670; 2026-02-21T09:09:01.2323924Z shl.b64 %rd221, %rd220, 32; 2026-02-21T09:09:01.2323992Z or.b64 %rd222, %rd219, %rd221; 2026-02-21T09:09:01.2324044Z $L__tmp49: 2026-02-21T09:09:01.2324209Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2324278Z mov.b64 {%r756, %r757}, %rd222; 2026-02-21T09:09:01.2324344Z cvt.rn.bf16x2.f32 %r758, %r757, %r756; 2026-02-21T09:09:01.2324396Z $L__tmp50: 2026-02-21T09:09:01.2324618Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2324677Z cvt.u64.u32 %rd223, %r671; 2026-02-21T09:09:01.2324736Z cvt.u64.u32 %rd224, %r672; 2026-02-21T09:09:01.2324796Z shl.b64 %rd225, %rd224, 32; 2026-02-21T09:09:01.2324871Z or.b64 %rd226, %rd223, %rd225; 2026-02-21T09:09:01.2324923Z $L__tmp51: 2026-02-21T09:09:01.2325084Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2325150Z mov.b64 {%r759, %r760}, %rd226; 2026-02-21T09:09:01.2325217Z cvt.rn.bf16x2.f32 %r761, %r760, %r759; 2026-02-21T09:09:01.2325270Z $L__tmp52: 2026-02-21T09:09:01.2325490Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2325550Z cvt.u64.u32 %rd227, %r673; 2026-02-21T09:09:01.2325608Z cvt.u64.u32 %rd228, %r674; 2026-02-21T09:09:01.2325666Z shl.b64 %rd229, %rd228, 32; 2026-02-21T09:09:01.2325734Z or.b64 %rd230, %rd227, %rd229; 2026-02-21T09:09:01.2325785Z $L__tmp53: 2026-02-21T09:09:01.2325951Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2326043Z mov.b64 {%r762, %r763}, %rd230; 2026-02-21T09:09:01.2326108Z cvt.rn.bf16x2.f32 %r764, %r763, %r762; 2026-02-21T09:09:01.2326158Z $L__tmp54: 2026-02-21T09:09:01.2326372Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2326431Z cvt.u64.u32 %rd231, %r675; 2026-02-21T09:09:01.2326491Z cvt.u64.u32 %rd232, %r676; 2026-02-21T09:09:01.2326579Z shl.b64 %rd233, %rd232, 32; 2026-02-21T09:09:01.2326648Z or.b64 %rd234, %rd231, %rd233; 2026-02-21T09:09:01.2326699Z $L__tmp55: 2026-02-21T09:09:01.2326892Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2326958Z mov.b64 {%r765, %r766}, %rd234; 2026-02-21T09:09:01.2327023Z cvt.rn.bf16x2.f32 %r767, %r766, %r765; 2026-02-21T09:09:01.2327075Z $L__tmp56: 2026-02-21T09:09:01.2327288Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2327354Z cvt.u64.u32 %rd235, %r677; 2026-02-21T09:09:01.2327411Z cvt.u64.u32 %rd236, %r678; 2026-02-21T09:09:01.2327469Z shl.b64 %rd237, %rd236, 32; 2026-02-21T09:09:01.2327537Z or.b64 %rd238, %rd235, %rd237; 2026-02-21T09:09:01.2327589Z $L__tmp57: 2026-02-21T09:09:01.2327777Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2327846Z mov.b64 {%r768, %r769}, %rd238; 2026-02-21T09:09:01.2327910Z cvt.rn.bf16x2.f32 %r770, %r769, %r768; 2026-02-21T09:09:01.2327962Z $L__tmp58: 2026-02-21T09:09:01.2328170Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2328237Z cvt.u64.u32 %rd239, %r679; 2026-02-21T09:09:01.2328293Z cvt.u64.u32 %rd240, %r680; 2026-02-21T09:09:01.2328351Z shl.b64 %rd241, %rd240, 32; 2026-02-21T09:09:01.2328417Z or.b64 %rd242, %rd239, %rd241; 2026-02-21T09:09:01.2328468Z $L__tmp59: 2026-02-21T09:09:01.2328633Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2328699Z mov.b64 {%r771, %r772}, %rd242; 2026-02-21T09:09:01.2328764Z cvt.rn.bf16x2.f32 %r773, %r772, %r771; 2026-02-21T09:09:01.2328815Z $L__tmp60: 2026-02-21T09:09:01.2329026Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2329092Z cvt.u64.u32 %rd243, %r681; 2026-02-21T09:09:01.2329150Z cvt.u64.u32 %rd244, %r682; 2026-02-21T09:09:01.2329208Z shl.b64 %rd245, %rd244, 32; 2026-02-21T09:09:01.2329273Z or.b64 %rd246, %rd243, %rd245; 2026-02-21T09:09:01.2329324Z $L__tmp61: 2026-02-21T09:09:01.2329487Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2329554Z mov.b64 {%r774, %r775}, %rd246; 2026-02-21T09:09:01.2329620Z cvt.rn.bf16x2.f32 %r776, %r775, %r774; 2026-02-21T09:09:01.2329671Z $L__tmp62: 2026-02-21T09:09:01.2329875Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2329941Z cvt.u64.u32 %rd247, %r684; 2026-02-21T09:09:01.2329998Z cvt.u64.u32 %rd248, %r685; 2026-02-21T09:09:01.2330056Z shl.b64 %rd249, %rd248, 32; 2026-02-21T09:09:01.2330122Z or.b64 %rd250, %rd247, %rd249; 2026-02-21T09:09:01.2330174Z $L__tmp63: 2026-02-21T09:09:01.2330336Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2330397Z mov.b64 {%r777, %r778}, %rd250; 2026-02-21T09:09:01.2330467Z cvt.rn.bf16x2.f32 %r779, %r778, %r777; 2026-02-21T09:09:01.2330518Z $L__tmp64: 2026-02-21T09:09:01.2330723Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2330815Z cvt.u64.u32 %rd251, %r686; 2026-02-21T09:09:01.2330872Z cvt.u64.u32 %rd252, %r687; 2026-02-21T09:09:01.2330931Z shl.b64 %rd253, %rd252, 32; 2026-02-21T09:09:01.2330995Z or.b64 %rd254, %rd251, %rd253; 2026-02-21T09:09:01.2331046Z $L__tmp65: 2026-02-21T09:09:01.2331211Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2331271Z mov.b64 {%r780, %r781}, %rd254; 2026-02-21T09:09:01.2331371Z cvt.rn.bf16x2.f32 %r782, %r781, %r780; 2026-02-21T09:09:01.2331422Z $L__tmp66: 2026-02-21T09:09:01.2331696Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2331767Z cvt.u64.u32 %rd255, %r688; 2026-02-21T09:09:01.2331825Z cvt.u64.u32 %rd256, %r689; 2026-02-21T09:09:01.2331883Z shl.b64 %rd257, %rd256, 32; 2026-02-21T09:09:01.2331948Z or.b64 %rd258, %rd255, %rd257; 2026-02-21T09:09:01.2332000Z $L__tmp67: 2026-02-21T09:09:01.2332164Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2332223Z mov.b64 {%r783, %r784}, %rd258; 2026-02-21T09:09:01.2332296Z cvt.rn.bf16x2.f32 %r785, %r784, %r783; 2026-02-21T09:09:01.2332348Z $L__tmp68: 2026-02-21T09:09:01.2332560Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2332655Z cvt.u64.u32 %rd259, %r690; 2026-02-21T09:09:01.2332716Z cvt.u64.u32 %rd260, %r691; 2026-02-21T09:09:01.2332775Z shl.b64 %rd261, %rd260, 32; 2026-02-21T09:09:01.2332842Z or.b64 %rd262, %rd259, %rd261; 2026-02-21T09:09:01.2332894Z $L__tmp69: 2026-02-21T09:09:01.2333063Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2333124Z mov.b64 {%r786, %r787}, %rd262; 2026-02-21T09:09:01.2333200Z cvt.rn.bf16x2.f32 %r788, %r787, %r786; 2026-02-21T09:09:01.2333253Z $L__tmp70: 2026-02-21T09:09:01.2333466Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2333536Z cvt.u64.u32 %rd263, %r692; 2026-02-21T09:09:01.2333597Z cvt.u64.u32 %rd264, %r693; 2026-02-21T09:09:01.2333659Z shl.b64 %rd265, %rd264, 32; 2026-02-21T09:09:01.2333721Z or.b64 %rd266, %rd263, %rd265; 2026-02-21T09:09:01.2333784Z $L__tmp71: 2026-02-21T09:09:01.2333951Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2334012Z mov.b64 {%r789, %r790}, %rd266; 2026-02-21T09:09:01.2334084Z cvt.rn.bf16x2.f32 %r791, %r790, %r789; 2026-02-21T09:09:01.2334138Z $L__tmp72: 2026-02-21T09:09:01.2334345Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2334412Z cvt.u64.u32 %rd267, %r694; 2026-02-21T09:09:01.2334469Z cvt.u64.u32 %rd268, %r695; 2026-02-21T09:09:01.2334528Z shl.b64 %rd269, %rd268, 32; 2026-02-21T09:09:01.2334586Z or.b64 %rd270, %rd267, %rd269; 2026-02-21T09:09:01.2334646Z $L__tmp73: 2026-02-21T09:09:01.2334812Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2334872Z mov.b64 {%r792, %r793}, %rd270; 2026-02-21T09:09:01.2334946Z cvt.rn.bf16x2.f32 %r794, %r793, %r792; 2026-02-21T09:09:01.2334998Z $L__tmp74: 2026-02-21T09:09:01.2335207Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2335275Z cvt.u64.u32 %rd271, %r696; 2026-02-21T09:09:01.2335333Z cvt.u64.u32 %rd272, %r697; 2026-02-21T09:09:01.2335392Z shl.b64 %rd273, %rd272, 32; 2026-02-21T09:09:01.2335451Z or.b64 %rd274, %rd271, %rd273; 2026-02-21T09:09:01.2335509Z $L__tmp75: 2026-02-21T09:09:01.2335676Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2335764Z mov.b64 {%r795, %r796}, %rd274; 2026-02-21T09:09:01.2335838Z cvt.rn.bf16x2.f32 %r797, %r796, %r795; 2026-02-21T09:09:01.2335890Z $L__tmp76: 2026-02-21T09:09:01.2336096Z .loc 2 291 36 // standard.py:291:36 @[ cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:94:40 ] 2026-02-21T09:09:01.2336163Z cvt.u64.u32 %rd275, %r698; 2026-02-21T09:09:01.2336221Z cvt.u64.u32 %rd276, %r699; 2026-02-21T09:09:01.2336281Z shl.b64 %rd277, %rd276, 32; 2026-02-21T09:09:01.2336364Z or.b64 %rd278, %rd275, %rd277; 2026-02-21T09:09:01.2336424Z $L__tmp77: 2026-02-21T09:09:01.2336609Z .loc 1 97 28 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:97:28 2026-02-21T09:09:01.2336669Z mov.b64 {%r798, %r799}, %rd278; 2026-02-21T09:09:01.2336742Z cvt.rn.bf16x2.f32 %r800, %r799, %r798; 2026-02-21T09:09:01.2336904Z .loc 1 98 43 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:98:43 2026-02-21T09:09:01.2336976Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:01.2337032Z bar.sync 0; 2026-02-21T09:09:01.2337133Z st.shared.v4.b32 [%r23], {%r707, %r710, %r713, %r716}; 2026-02-21T09:09:01.2337233Z st.shared.v4.b32 [%r23+16384], {%r755, %r758, %r761, %r764}; 2026-02-21T09:09:01.2337322Z st.shared.v4.b32 [%r24], {%r719, %r722, %r725, %r728}; 2026-02-21T09:09:01.2337426Z st.shared.v4.b32 [%r24+16384], {%r767, %r770, %r773, %r776}; 2026-02-21T09:09:01.2337532Z st.shared.v4.b32 [%r25], {%r731, %r734, %r737, %r740}; 2026-02-21T09:09:01.2337626Z st.shared.v4.b32 [%r25+16384], {%r779, %r782, %r785, %r788}; 2026-02-21T09:09:01.2337717Z st.shared.v4.b32 [%r26], {%r743, %r746, %r749, %r752}; 2026-02-21T09:09:01.2337811Z st.shared.v4.b32 [%r26+16384], {%r791, %r794, %r797, %r800}; 2026-02-21T09:09:01.2337868Z // begin inline asm 2026-02-21T09:09:01.2337948Z fence.proxy.async.shared::cta; 2026-02-21T09:09:01.2338004Z // end inline asm 2026-02-21T09:09:01.2338059Z bar.sync 0; 2026-02-21T09:09:01.2338125Z elect.sync %r801|%p153, -1; 2026-02-21T09:09:01.2338198Z and.pred %p151, %p152, %p153; 2026-02-21T09:09:01.2338257Z and.b32 %r802, %r30, 1; 2026-02-21T09:09:01.2338315Z shl.b32 %r803, %r802, 14; 2026-02-21T09:09:01.2338379Z add.s32 %r804, %r47, %r803; 2026-02-21T09:09:01.2338436Z add.s32 %r703, %r804, 65536; 2026-02-21T09:09:01.2338496Z shl.b32 %r805, %r802, 8; 2026-02-21T09:09:01.2338553Z or.b32 %r702, %r805, %r29; 2026-02-21T09:09:01.2338617Z // begin inline asm 2026-02-21T09:09:01.2338798Z @%p151 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd150, {%r701, %r702}], [%r703]; 2026-02-21T09:09:01.2338853Z // end inline asm 2026-02-21T09:09:01.2338926Z cp.async.bulk.commit_group; 2026-02-21T09:09:01.2339010Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:01.2339174Z .loc 1 31 88 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:31:88 2026-02-21T09:09:01.2339252Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:01.2339307Z bar.sync 0; 2026-02-21T09:09:01.2339470Z .loc 1 31 4 // cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py:31:4 2026-02-21T09:09:01.2339524Z bar.sync 0; 2026-02-21T09:09:01.2339589Z // begin inline asm 2026-02-21T09:09:01.2339706Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r806, 256; 2026-02-21T09:09:01.2339759Z // end inline asm 2026-02-21T09:09:01.2339818Z ret; 2026-02-21T09:09:01.2339870Z $L__tmp78: 2026-02-21T09:09:01.2339928Z $L__func_end0: 2026-02-21T09:09:01.2340010Z // -- End function 2026-02-21T09:09:01.2340068Z } 2026-02-21T09:09:01.2340275Z .file 1 "/tmp/torchinductor_root/bg/cbgnw6kaq4nhzamthzglcqyoxls3u25255jvbaq2tqp24otoptla.py" 2026-02-21T09:09:01.2340445Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:01.2340514Z .section .debug_abbrev 2026-02-21T09:09:01.2340565Z { 2026-02-21T09:09:01.2340652Z .b8 1 // Abbreviation Code 2026-02-21T09:09:01.2340782Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:01.2340864Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:01.2340946Z .b8 37 // DW_AT_producer 2026-02-21T09:09:01.2341025Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.2341108Z .b8 19 // DW_AT_language 2026-02-21T09:09:01.2341208Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:01.2341283Z .b8 3 // DW_AT_name 2026-02-21T09:09:01.2341365Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.2341463Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:01.2341568Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:01.2341655Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:01.2341728Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.2341802Z .b8 0 // EOM(1) 2026-02-21T09:09:01.2341870Z .b8 0 // EOM(2) 2026-02-21T09:09:01.2341963Z .b8 2 // Abbreviation Code 2026-02-21T09:09:01.2342047Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:01.2342147Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:01.2342233Z .b8 3 // DW_AT_name 2026-02-21T09:09:01.2342305Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.2342381Z .b8 32 // DW_AT_inline 2026-02-21T09:09:01.2342464Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.2342532Z .b8 0 // EOM(1) 2026-02-21T09:09:01.2342598Z .b8 0 // EOM(2) 2026-02-21T09:09:01.2342687Z .b8 3 // Abbreviation Code 2026-02-21T09:09:01.2342771Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:01.2342852Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:01.2342936Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:01.2343025Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.2343106Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:01.2343183Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.2343280Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:01.2343355Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:01.2343427Z .b8 0 // EOM(1) 2026-02-21T09:09:01.2343503Z .b8 0 // EOM(2) 2026-02-21T09:09:01.2343585Z .b8 4 // Abbreviation Code 2026-02-21T09:09:01.2343681Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:01.2343760Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:01.2343853Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:01.2343929Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:01.2344004Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:01.2344087Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.2344169Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:01.2344244Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.2344333Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:01.2344411Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.2344490Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:01.2344567Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.2344682Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:01.2344760Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.2344832Z .b8 0 // EOM(1) 2026-02-21T09:09:01.2344910Z .b8 0 // EOM(2) 2026-02-21T09:09:01.2344979Z .b8 0 // EOM(3) 2026-02-21T09:09:01.2345035Z } 2026-02-21T09:09:01.2345108Z .section .debug_info 2026-02-21T09:09:01.2345189Z { 2026-02-21T09:09:01.2345278Z .b32 178 // Length of Unit 2026-02-21T09:09:01.2345365Z .b8 2 // DWARF version number 2026-02-21T09:09:01.2345427Z .b8 0 2026-02-21T09:09:01.2345574Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:01.2345666Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:01.2345778Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:01.2345864Z .b8 116 // DW_AT_producer 2026-02-21T09:09:01.2345921Z .b8 114 2026-02-21T09:09:01.2345982Z .b8 105 2026-02-21T09:09:01.2346034Z .b8 116 2026-02-21T09:09:01.2346087Z .b8 111 2026-02-21T09:09:01.2346140Z .b8 110 2026-02-21T09:09:01.2346200Z .b8 0 2026-02-21T09:09:01.2346277Z .b8 2 // DW_AT_language 2026-02-21T09:09:01.2346330Z .b8 0 2026-02-21T09:09:01.2346436Z .b8 99 // DW_AT_name 2026-02-21T09:09:01.2346493Z .b8 98 2026-02-21T09:09:01.2346546Z .b8 103 2026-02-21T09:09:01.2346598Z .b8 110 2026-02-21T09:09:01.2346658Z .b8 119 2026-02-21T09:09:01.2346712Z .b8 54 2026-02-21T09:09:01.2346766Z .b8 107 2026-02-21T09:09:01.2346819Z .b8 97 2026-02-21T09:09:01.2346880Z .b8 113 2026-02-21T09:09:01.2346932Z .b8 52 2026-02-21T09:09:01.2346985Z .b8 110 2026-02-21T09:09:01.2347043Z .b8 104 2026-02-21T09:09:01.2347096Z .b8 122 2026-02-21T09:09:01.2347147Z .b8 97 2026-02-21T09:09:01.2347199Z .b8 109 2026-02-21T09:09:01.2347260Z .b8 116 2026-02-21T09:09:01.2347313Z .b8 104 2026-02-21T09:09:01.2347365Z .b8 122 2026-02-21T09:09:01.2347424Z .b8 103 2026-02-21T09:09:01.2347477Z .b8 108 2026-02-21T09:09:01.2347528Z .b8 99 2026-02-21T09:09:01.2347581Z .b8 113 2026-02-21T09:09:01.2347641Z .b8 121 2026-02-21T09:09:01.2347693Z .b8 111 2026-02-21T09:09:01.2347746Z .b8 120 2026-02-21T09:09:01.2347804Z .b8 108 2026-02-21T09:09:01.2347856Z .b8 115 2026-02-21T09:09:01.2347910Z .b8 51 2026-02-21T09:09:01.2347964Z .b8 117 2026-02-21T09:09:01.2348024Z .b8 50 2026-02-21T09:09:01.2348077Z .b8 53 2026-02-21T09:09:01.2348129Z .b8 50 2026-02-21T09:09:01.2348182Z .b8 53 2026-02-21T09:09:01.2348243Z .b8 53 2026-02-21T09:09:01.2348297Z .b8 106 2026-02-21T09:09:01.2348350Z .b8 118 2026-02-21T09:09:01.2348411Z .b8 98 2026-02-21T09:09:01.2348464Z .b8 97 2026-02-21T09:09:01.2348516Z .b8 113 2026-02-21T09:09:01.2348569Z .b8 50 2026-02-21T09:09:01.2348632Z .b8 116 2026-02-21T09:09:01.2348686Z .b8 113 2026-02-21T09:09:01.2348742Z .b8 112 2026-02-21T09:09:01.2348807Z .b8 50 2026-02-21T09:09:01.2348863Z .b8 52 2026-02-21T09:09:01.2348917Z .b8 111 2026-02-21T09:09:01.2348970Z .b8 116 2026-02-21T09:09:01.2349030Z .b8 111 2026-02-21T09:09:01.2349084Z .b8 112 2026-02-21T09:09:01.2349135Z .b8 116 2026-02-21T09:09:01.2349186Z .b8 108 2026-02-21T09:09:01.2349247Z .b8 97 2026-02-21T09:09:01.2349298Z .b8 46 2026-02-21T09:09:01.2349351Z .b8 112 2026-02-21T09:09:01.2349410Z .b8 121 2026-02-21T09:09:01.2349463Z .b8 0 2026-02-21T09:09:01.2349556Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:01.2349633Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:01.2349693Z .b8 116 2026-02-21T09:09:01.2349745Z .b8 109 2026-02-21T09:09:01.2349797Z .b8 112 2026-02-21T09:09:01.2349858Z .b8 47 2026-02-21T09:09:01.2349911Z .b8 116 2026-02-21T09:09:01.2349963Z .b8 111 2026-02-21T09:09:01.2350014Z .b8 114 2026-02-21T09:09:01.2350075Z .b8 99 2026-02-21T09:09:01.2350126Z .b8 104 2026-02-21T09:09:01.2350179Z .b8 105 2026-02-21T09:09:01.2350262Z .b8 110 2026-02-21T09:09:01.2350314Z .b8 100 2026-02-21T09:09:01.2350365Z .b8 117 2026-02-21T09:09:01.2350417Z .b8 99 2026-02-21T09:09:01.2350478Z .b8 116 2026-02-21T09:09:01.2350530Z .b8 111 2026-02-21T09:09:01.2350583Z .b8 114 2026-02-21T09:09:01.2350642Z .b8 95 2026-02-21T09:09:01.2350694Z .b8 114 2026-02-21T09:09:01.2350747Z .b8 111 2026-02-21T09:09:01.2350798Z .b8 111 2026-02-21T09:09:01.2350858Z .b8 116 2026-02-21T09:09:01.2350922Z .b8 47 2026-02-21T09:09:01.2350994Z .b8 98 2026-02-21T09:09:01.2351046Z .b8 103 2026-02-21T09:09:01.2351104Z .b8 0 2026-02-21T09:09:01.2351203Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:01.2351296Z .b8 95 // DW_AT_name 2026-02-21T09:09:01.2351354Z .b8 104 2026-02-21T09:09:01.2351403Z .b8 101 2026-02-21T09:09:01.2351454Z .b8 108 2026-02-21T09:09:01.2351503Z .b8 105 2026-02-21T09:09:01.2351590Z .b8 111 2026-02-21T09:09:01.2351643Z .b8 110 2026-02-21T09:09:01.2351693Z .b8 95 2026-02-21T09:09:01.2351749Z .b8 109 2026-02-21T09:09:01.2351799Z .b8 97 2026-02-21T09:09:01.2351849Z .b8 116 2026-02-21T09:09:01.2351898Z .b8 109 2026-02-21T09:09:01.2351957Z .b8 117 2026-02-21T09:09:01.2352006Z .b8 108 2026-02-21T09:09:01.2352055Z .b8 95 2026-02-21T09:09:01.2352112Z .b8 98 2026-02-21T09:09:01.2352162Z .b8 102 2026-02-21T09:09:01.2352212Z .b8 49 2026-02-21T09:09:01.2352262Z .b8 54 2026-02-21T09:09:01.2352319Z .b8 95 2026-02-21T09:09:01.2352394Z .b8 105 2026-02-21T09:09:01.2352445Z .b8 110 2026-02-21T09:09:01.2352505Z .b8 116 2026-02-21T09:09:01.2352556Z .b8 52 2026-02-21T09:09:01.2352606Z .b8 0 2026-02-21T09:09:01.2352679Z .b8 1 // DW_AT_inline 2026-02-21T09:09:01.2352785Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:01.2352872Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:01.2352958Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:01.2353056Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:01.2353167Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:01.2353255Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:01.2353343Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:01.2353427Z .b64 $L__tmp77 // DW_AT_high_pc 2026-02-21T09:09:01.2353505Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:01.2353580Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:01.2353666Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:01.2353750Z .b8 0 // End Of Children Mark 2026-02-21T09:09:01.2353830Z .b8 0 // End Of Children Mark 2026-02-21T09:09:01.2353888Z } 2026-02-21T09:09:01.2353954Z .section .debug_macinfo { } 2026-02-21T09:09:01.2353959Z 2026-02-21T09:09:01.2354037Z ================================================================ 2026-02-21T09:09:01.2354147Z please share the reproducer above with Triton project. 2026-02-21T09:09:01.8999835Z [129s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:01.9000173Z 2026-02-21T09:09:01.9000177Z 2026-02-21T09:09:01.9000180Z 2026-02-21T09:09:01.9000280Z ================================================================ 2026-02-21T09:09:01.9000533Z Internal Triton PTX codegen error 2026-02-21T09:09:01.9000726Z `ptxas` stderr: 2026-02-21T09:09:01.9001294Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 301 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:01.9001937Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:01.9002118Z 2026-02-21T09:09:01.9002594Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpirdpca_z.ptx -o /tmp/tmpirdpca_z.ptx.o 2026-02-21T09:09:01.9003242Z 2026-02-21T09:09:01.9003247Z 2026-02-21T09:09:01.9003315Z // 2026-02-21T09:09:01.9003469Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:01.9003667Z // 2026-02-21T09:09:01.9003740Z 2026-02-21T09:09:01.9003804Z .version 8.7 2026-02-21T09:09:01.9003974Z .target sm_100a 2026-02-21T09:09:01.9004113Z .address_size 64 2026-02-21T09:09:01.9004211Z 2026-02-21T09:09:01.9004407Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:01.9004697Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:01.9004947Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:01.9005172Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:01.9005414Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:01.9005700Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:01.9005969Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:01.9006245Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:01.9006521Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:01.9006737Z ) 2026-02-21T09:09:01.9006862Z .reqntid 256 2026-02-21T09:09:01.9006988Z .maxnreg 32 2026-02-21T09:09:01.9007115Z { 2026-02-21T09:09:01.9007299Z .reg .pred %p<116>; 2026-02-21T09:09:01.9007467Z .reg .b16 %rs<145>; 2026-02-21T09:09:01.9007607Z .reg .b32 %r<468>; 2026-02-21T09:09:01.9007754Z .reg .b64 %rd<155>; 2026-02-21T09:09:01.9007891Z $L__func_begin0: 2026-02-21T09:09:01.9007981Z 2026-02-21T09:09:01.9008038Z // %bb.0: 2026-02-21T09:09:01.9008281Z .loc 1 19 0 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:19 2026-02-21T09:09:01.9008565Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:01.9008763Z ld.param.b64 %rd24, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:01.9008978Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:01.9009184Z ld.param.b64 %rd42, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:01.9009398Z mov.b32 %r42, global_smem; 2026-02-21T09:09:01.9009564Z // begin inline asm 2026-02-21T09:09:01.9009795Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r42], 128; 2026-02-21T09:09:01.9010041Z // end inline asm 2026-02-21T09:09:01.9010224Z ld.param.b64 %rd59, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:01.9010427Z bar.sync 0; 2026-02-21T09:09:01.9010579Z ld.shared.b32 %r461, [global_smem]; 2026-02-21T09:09:01.9010747Z bar.sync 0; 2026-02-21T09:09:01.9010883Z // begin inline asm 2026-02-21T09:09:01.9011085Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:01.9011317Z // end inline asm 2026-02-21T09:09:01.9011617Z .loc 1 21 66 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:21:66 2026-02-21T09:09:01.9011921Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:01.9012091Z mov.u32 %r59, %ctaid.y; 2026-02-21T09:09:01.9012244Z mov.u32 %r60, %ctaid.z; 2026-02-21T09:09:01.9012402Z mov.u32 %r61, %nctaid.x; 2026-02-21T09:09:01.9012557Z mov.u32 %r62, %nctaid.y; 2026-02-21T09:09:01.9012731Z mad.lo.s32 %r63, %r60, %r62, %r59; 2026-02-21T09:09:01.9012911Z mad.lo.s32 %r64, %r63, %r61, %r3; 2026-02-21T09:09:01.9013097Z shl.b32 %r65, %r64, 8; 2026-02-21T09:09:01.9013256Z cvt.s64.s32 %rd60, %r65; 2026-02-21T09:09:01.9013423Z add.s64 %rd38, %rd59, %rd60; 2026-02-21T09:09:01.9013589Z shl.b32 %r66, %r1, 2; 2026-02-21T09:09:01.9013738Z add.s32 %r43, %r42, %r66; 2026-02-21T09:09:01.9013895Z mov.b32 %r52, 0; 2026-02-21T09:09:01.9014033Z // begin inline asm 2026-02-21T09:09:01.9014194Z @%p1 st.shared.b32 [ %r43 + 0 ], %r52; 2026-02-21T09:09:01.9014366Z // end inline asm 2026-02-21T09:09:01.9014517Z bar.warp.sync -1; 2026-02-21T09:09:01.9014666Z setp.eq.b32 %p108, %r1, 0; 2026-02-21T09:09:01.9014833Z cvt.u64.u32 %rd23, %r42; 2026-02-21T09:09:01.9015020Z // begin inline asm 2026-02-21T09:09:01.9015280Z @%p108 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd23 + 0 ], %rd24; 2026-02-21T09:09:01.9015569Z // end inline asm 2026-02-21T09:09:01.9015703Z // begin inline asm 2026-02-21T09:09:01.9015934Z @%p108 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1; 2026-02-21T09:09:01.9016185Z // end inline asm 2026-02-21T09:09:01.9016326Z mov.b32 %r45, 32; 2026-02-21T09:09:01.9016491Z // begin inline asm 2026-02-21T09:09:01.9016736Z @%p108 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r45; 2026-02-21T09:09:01.9017006Z // end inline asm 2026-02-21T09:09:01.9017170Z // begin inline asm 2026-02-21T09:09:01.9017410Z @%p108 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r45; 2026-02-21T09:09:01.9017671Z // end inline asm 2026-02-21T09:09:01.9017812Z mov.b32 %r47, 8192; 2026-02-21T09:09:01.9017953Z // begin inline asm 2026-02-21T09:09:01.9018196Z @%p108 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r47; 2026-02-21T09:09:01.9018464Z // end inline asm 2026-02-21T09:09:01.9018599Z mov.b32 %r48, 512; 2026-02-21T09:09:01.9018740Z // begin inline asm 2026-02-21T09:09:01.9018973Z @%p108 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r48; 2026-02-21T09:09:01.9019245Z // end inline asm 2026-02-21T09:09:01.9019378Z mov.b64 %rd31, 8192; 2026-02-21T09:09:01.9019556Z // begin inline asm 2026-02-21T09:09:01.9019808Z @%p108 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd23 + 0 ], 0x0, %rd31; 2026-02-21T09:09:01.9020096Z // end inline asm 2026-02-21T09:09:01.9020227Z mov.b32 %r49, 1; 2026-02-21T09:09:01.9020368Z // begin inline asm 2026-02-21T09:09:01.9020626Z @%p108 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r49; 2026-02-21T09:09:01.9020907Z // end inline asm 2026-02-21T09:09:01.9021049Z // begin inline asm 2026-02-21T09:09:01.9021308Z @%p108 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r49; 2026-02-21T09:09:01.9021643Z // end inline asm 2026-02-21T09:09:01.9021787Z // begin inline asm 2026-02-21T09:09:01.9022041Z @%p108 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:01.9022321Z // end inline asm 2026-02-21T09:09:01.9022460Z // begin inline asm 2026-02-21T09:09:01.9023728Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:01.9024875Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:01.9025119Z `ptxas` stderr: 2026-02-21T09:09:01.9025571Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 301 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:01.9026070Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:01.9026228Z 2026-02-21T09:09:01.9026616Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpirdpca_z.ptx -o /tmp/tmpirdpca_z.ptx.o 2026-02-21T09:09:01.9027053Z 2026-02-21T09:09:01.9027191Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:01.9027561Z @%p108 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:01.9027864Z // end inline asm 2026-02-21T09:09:01.9028006Z // begin inline asm 2026-02-21T09:09:01.9028264Z @%p108 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1; 2026-02-21T09:09:01.9028582Z // end inline asm 2026-02-21T09:09:01.9028726Z // begin inline asm 2026-02-21T09:09:01.9028976Z @%p108 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:01.9029246Z // end inline asm 2026-02-21T09:09:01.9029394Z // begin inline asm 2026-02-21T09:09:01.9029751Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd38 + 0 ], [ %rd23 + 0 ], 0x80; 2026-02-21T09:09:01.9030191Z // end inline asm 2026-02-21T09:09:01.9030344Z // begin inline asm 2026-02-21T09:09:01.9030561Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd38 + 0 ], 0x80; 2026-02-21T09:09:01.9030830Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:01.9031068Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:01.9031254Z // end inline asm 2026-02-21T09:09:01.9031387Z bar.sync 0; 2026-02-21T09:09:01.9031569Z cvta.global.u64 %rd98, %rd38; 2026-02-21T09:09:01.9031858Z .loc 1 23 67 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:23:67 2026-02-21T09:09:01.9032158Z add.s64 %rd56, %rd38, 128; 2026-02-21T09:09:01.9032320Z bar.sync 0; 2026-02-21T09:09:01.9032449Z // begin inline asm 2026-02-21T09:09:01.9032607Z @%p1 st.shared.b32 [ %r43 + 0 ], %r52; 2026-02-21T09:09:01.9032777Z // end inline asm 2026-02-21T09:09:01.9032921Z bar.warp.sync -1; 2026-02-21T09:09:01.9033062Z // begin inline asm 2026-02-21T09:09:01.9033348Z @%p108 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd23 + 0 ], %rd42; 2026-02-21T09:09:01.9033627Z // end inline asm 2026-02-21T09:09:01.9033768Z // begin inline asm 2026-02-21T09:09:01.9033998Z @%p108 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1; 2026-02-21T09:09:01.9034245Z // end inline asm 2026-02-21T09:09:01.9034384Z // begin inline asm 2026-02-21T09:09:01.9034621Z @%p108 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r45; 2026-02-21T09:09:01.9034893Z // end inline asm 2026-02-21T09:09:01.9035027Z mov.b32 %r54, 128; 2026-02-21T09:09:01.9035175Z // begin inline asm 2026-02-21T09:09:01.9035410Z @%p108 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r54; 2026-02-21T09:09:01.9035684Z // end inline asm 2026-02-21T09:09:01.9035826Z // begin inline asm 2026-02-21T09:09:01.9036069Z @%p108 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r47; 2026-02-21T09:09:01.9036350Z // end inline asm 2026-02-21T09:09:01.9036487Z mov.b32 %r56, 4096; 2026-02-21T09:09:01.9036635Z // begin inline asm 2026-02-21T09:09:01.9036869Z @%p108 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r56; 2026-02-21T09:09:01.9037145Z // end inline asm 2026-02-21T09:09:01.9037287Z mov.b64 %rd49, 16384; 2026-02-21T09:09:01.9037433Z // begin inline asm 2026-02-21T09:09:01.9037692Z @%p108 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd23 + 0 ], 0x0, %rd49; 2026-02-21T09:09:01.9037972Z // end inline asm 2026-02-21T09:09:01.9038113Z // begin inline asm 2026-02-21T09:09:01.9038364Z @%p108 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r49; 2026-02-21T09:09:01.9038655Z // end inline asm 2026-02-21T09:09:01.9038789Z // begin inline asm 2026-02-21T09:09:01.9039044Z @%p108 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r49; 2026-02-21T09:09:01.9039332Z // end inline asm 2026-02-21T09:09:01.9039464Z // begin inline asm 2026-02-21T09:09:01.9039706Z @%p108 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd23 + 0 ], 0xa; 2026-02-21T09:09:01.9039964Z // end inline asm 2026-02-21T09:09:01.9040104Z // begin inline asm 2026-02-21T09:09:01.9040352Z @%p108 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:01.9040636Z // end inline asm 2026-02-21T09:09:01.9040776Z // begin inline asm 2026-02-21T09:09:01.9041011Z @%p108 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x2; 2026-02-21T09:09:01.9041309Z // end inline asm 2026-02-21T09:09:01.9041447Z // begin inline asm 2026-02-21T09:09:01.9041701Z @%p108 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:01.9041956Z // end inline asm 2026-02-21T09:09:01.9042093Z // begin inline asm 2026-02-21T09:09:01.9042434Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd56 + 0 ], [ %rd23 + 0 ], 0x80; 2026-02-21T09:09:01.9042802Z // end inline asm 2026-02-21T09:09:01.9042975Z // begin inline asm 2026-02-21T09:09:01.9043179Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd56 + 0 ], 0x80; 2026-02-21T09:09:01.9043438Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:01.9043661Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:01.9043837Z // end inline asm 2026-02-21T09:09:01.9043975Z bar.sync 0; 2026-02-21T09:09:01.9044114Z cvta.global.u64 %rd120, %rd56; 2026-02-21T09:09:01.9044399Z .loc 1 31 88 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:31:88 2026-02-21T09:09:01.9044702Z setp.gt.u32 %p39, %r3, 8191; 2026-02-21T09:09:01.9044864Z @%p39 bra $L__BB0_8; 2026-02-21T09:09:01.9045034Z // %bb.1: // %.lr.ph 2026-02-21T09:09:01.9045333Z .loc 1 0 88 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:0:88 2026-02-21T09:09:01.9045669Z ld.param.b64 %rd22, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:01.9045903Z and.b32 %r4, %r1, 32; 2026-02-21T09:09:01.9046168Z .loc 1 81 38 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:81:38 2026-02-21T09:09:01.9046457Z setp.eq.b32 %p51, %r4, 0; 2026-02-21T09:09:01.9046726Z .loc 1 57 38 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:57:38 2026-02-21T09:09:01.9047013Z and.b32 %r150, %r1, 7; 2026-02-21T09:09:01.9047166Z shl.b32 %r151, %r150, 3; 2026-02-21T09:09:01.9047431Z .loc 1 44 45 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:44:45 2026-02-21T09:09:01.9047711Z shr.u32 %r152, %r1, 3; 2026-02-21T09:09:01.9047868Z bfe.u32 %r5, %r1, 3, 5; 2026-02-21T09:09:01.9048017Z shr.u32 %r153, %r1, 5; 2026-02-21T09:09:01.9048168Z shl.b32 %r154, %r1, 4; 2026-02-21T09:09:01.9048314Z and.b32 %r6, %r154, 4080; 2026-02-21T09:09:01.9048474Z add.s32 %r270, %r42, %r6; 2026-02-21T09:09:01.9048633Z add.s32 %r272, %r270, 4096; 2026-02-21T09:09:01.9048791Z add.s32 %r274, %r270, 8192; 2026-02-21T09:09:01.9048959Z or.b32 %r10, %r154, 12288; 2026-02-21T09:09:01.9049112Z add.s32 %r276, %r42, %r10; 2026-02-21T09:09:01.9049269Z add.s32 %r244, %r461, 32; 2026-02-21T09:09:01.9049416Z and.b32 %r156, %r1, 127; 2026-02-21T09:09:01.9049570Z shl.b32 %r157, %r156, 7; 2026-02-21T09:09:01.9049718Z and.b32 %r158, %r1, 128; 2026-02-21T09:09:01.9049871Z shr.u32 %r159, %r158, 1; 2026-02-21T09:09:01.9050024Z or.b32 %r13, %r157, %r159; 2026-02-21T09:09:01.9050176Z and.b32 %r160, %r1, 31; 2026-02-21T09:09:01.9050330Z shr.u32 %r161, %r1, 1; 2026-02-21T09:09:01.9050474Z and.b32 %r162, %r161, 96; 2026-02-21T09:09:01.9050628Z or.b32 %r14, %r162, %r160; 2026-02-21T09:09:01.9050777Z shl.b32 %r163, %r160, 7; 2026-02-21T09:09:01.9050927Z shl.b32 %r164, %r150, 4; 2026-02-21T09:09:01.9051072Z and.b32 %r165, %r152, 28; 2026-02-21T09:09:01.9051226Z or.b32 %r166, %r163, %r164; 2026-02-21T09:09:01.9051384Z xor.b32 %r167, %r166, %r165; 2026-02-21T09:09:01.9051571Z add.s32 %r168, %r42, 32768; 2026-02-21T09:09:01.9051735Z add.s32 %r15, %r168, %r167; 2026-02-21T09:09:01.9051889Z xor.b32 %r169, %r167, 32; 2026-02-21T09:09:01.9052046Z add.s32 %r16, %r168, %r169; 2026-02-21T09:09:01.9052202Z xor.b32 %r170, %r167, 64; 2026-02-21T09:09:01.9052364Z add.s32 %r17, %r168, %r170; 2026-02-21T09:09:01.9052519Z xor.b32 %r171, %r167, 96; 2026-02-21T09:09:01.9052683Z add.s32 %r18, %r168, %r171; 2026-02-21T09:09:01.9052842Z bfe.u32 %r172, %r168, 4, 14; 2026-02-21T09:09:01.9053005Z cvt.u64.u32 %rd71, %r172; 2026-02-21T09:09:01.9053201Z or.b64 %rd85, %rd71, 4611686293322072064; 2026-02-21T09:09:01.9053384Z add.s32 %r173, %r42, 32800; 2026-02-21T09:09:01.9053546Z bfe.u32 %r174, %r173, 4, 14; 2026-02-21T09:09:01.9053701Z cvt.u64.u32 %rd72, %r174; 2026-02-21T09:09:01.9053869Z or.b64 %rd86, %rd72, 4611686293322072064; 2026-02-21T09:09:01.9054045Z add.s32 %r175, %r42, 32832; 2026-02-21T09:09:01.9054203Z bfe.u32 %r176, %r175, 4, 14; 2026-02-21T09:09:01.9054356Z cvt.u64.u32 %rd73, %r176; 2026-02-21T09:09:01.9054551Z or.b64 %rd87, %rd73, 4611686293322072064; 2026-02-21T09:09:01.9054728Z add.s32 %r177, %r42, 32864; 2026-02-21T09:09:01.9054887Z bfe.u32 %r178, %r177, 4, 14; 2026-02-21T09:09:01.9055092Z cvt.u64.u32 %rd74, %r178; 2026-02-21T09:09:01.9055252Z or.b64 %rd88, %rd74, 4611686293322072064; 2026-02-21T09:09:01.9055432Z add.s32 %r179, %r42, 36864; 2026-02-21T09:09:01.9055584Z bfe.u32 %r180, %r179, 4, 14; 2026-02-21T09:09:01.9055744Z cvt.u64.u32 %rd75, %r180; 2026-02-21T09:09:01.9055900Z or.b64 %rd89, %rd75, 4611686293322072064; 2026-02-21T09:09:01.9056081Z add.s32 %r181, %r42, 36896; 2026-02-21T09:09:01.9056231Z bfe.u32 %r182, %r181, 4, 14; 2026-02-21T09:09:01.9056390Z cvt.u64.u32 %rd76, %r182; 2026-02-21T09:09:01.9056544Z or.b64 %rd90, %rd76, 4611686293322072064; 2026-02-21T09:09:01.9056723Z add.s32 %r183, %r42, 36928; 2026-02-21T09:09:01.9056885Z bfe.u32 %r184, %r183, 4, 14; 2026-02-21T09:09:01.9057066Z cvt.u64.u32 %rd77, %r184; 2026-02-21T09:09:01.9057234Z or.b64 %rd91, %rd77, 4611686293322072064; 2026-02-21T09:09:01.9057405Z add.s32 %r185, %r42, 36960; 2026-02-21T09:09:01.9057569Z bfe.u32 %r186, %r185, 4, 14; 2026-02-21T09:09:01.9057726Z cvt.u64.u32 %rd78, %r186; 2026-02-21T09:09:01.9057894Z or.b64 %rd92, %rd78, 4611686293322072064; 2026-02-21T09:09:01.9058069Z or.b32 %r19, %r151, 128; 2026-02-21T09:09:01.9058230Z shl.b32 %r187, %r156, 6; 2026-02-21T09:09:01.9058391Z shl.b32 %r188, %r1, 3; 2026-02-21T09:09:01.9058540Z and.b32 %r189, %r188, 48; 2026-02-21T09:09:01.9058702Z shr.u32 %r190, %r158, 2; 2026-02-21T09:09:01.9058855Z xor.b32 %r191, %r189, %r190; 2026-02-21T09:09:01.9059020Z or.b32 %r192, %r191, %r187; 2026-02-21T09:09:01.9059176Z xor.b32 %r193, %r192, 16; 2026-02-21T09:09:01.9059450Z .loc 1 31 88 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:31:88 2026-02-21T09:09:01.9059737Z cvt.u64.u32 %rd79, %r151; 2026-02-21T09:09:01.9059904Z add.s32 %r194, %r42, %r13; 2026-02-21T09:09:01.9060068Z xor.b32 %r22, %r14, 16; 2026-02-21T09:09:01.9060232Z add.s32 %r97, %r42, 40960; 2026-02-21T09:09:01.9060394Z add.s32 %r195, %r97, %r22; 2026-02-21T09:09:01.9060551Z add.s32 %r196, %r97, %r14; 2026-02-21T09:09:01.9060718Z add.s32 %r197, %r42, 16384; 2026-02-21T09:09:01.9060882Z add.s32 %r107, %r197, %r10; 2026-02-21T09:09:01.9061051Z add.s32 %r101, %r197, %r6; 2026-02-21T09:09:01.9061210Z add.s32 %r105, %r101, 8192; 2026-02-21T09:09:01.9061377Z add.s32 %r103, %r101, 4096; 2026-02-21T09:09:01.9061669Z .loc 1 42 27 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:42:27 2026-02-21T09:09:01.9061966Z shl.b32 %r198, %r3, 5; 2026-02-21T09:09:01.9062120Z and.b32 %r199, %r198, 480; 2026-02-21T09:09:01.9062270Z and.b32 %r200, %r3, 7680; 2026-02-21T09:09:01.9062428Z or.b32 %r433, %r199, %r200; 2026-02-21T09:09:01.9062690Z .loc 1 43 27 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:43:27 2026-02-21T09:09:01.9062982Z shl.b32 %r201, %r3, 3; 2026-02-21T09:09:01.9063129Z and.b32 %r434, %r201, 3968; 2026-02-21T09:09:01.9063400Z .loc 1 44 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:44:32 2026-02-21T09:09:01.9063683Z or.b32 %r202, %r434, %r5; 2026-02-21T09:09:01.9063842Z or.b32 %r203, %r152, %r434; 2026-02-21T09:09:01.9064111Z .loc 1 58 53 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:53 2026-02-21T09:09:01.9064391Z shl.b32 %r204, %r202, 10; 2026-02-21T09:09:01.9064585Z shl.b32 %r205, %r203, 10; 2026-02-21T09:09:01.9064751Z or.b32 %r206, %r205, 98304; 2026-02-21T09:09:01.9064921Z $L__tmp0: 2026-02-21T09:09:01.9065233Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9065613Z shfl.sync.idx.b32 %r25, %r153, 0, 31, -1; 2026-02-21T09:09:01.9065815Z shl.b32 %r207, %r25, 21; 2026-02-21T09:09:01.9065981Z and.b32 %r208, %r207, 6291456; 2026-02-21T09:09:01.9066193Z add.s32 %r209, %r208, %r461; 2026-02-21T09:09:01.9066353Z and.b32 %r210, %r25, 4; 2026-02-21T09:09:01.9066517Z shl.b32 %r211, %r210, 2; 2026-02-21T09:09:01.9066675Z add.s32 %r432, %r209, %r211; 2026-02-21T09:09:01.9066872Z mov.pred %p58, -1; 2026-02-21T09:09:01.9067025Z // begin inline asm 2026-02-21T09:09:01.9067385Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r432 + 0], {%r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52, %r52}; 2026-02-21T09:09:01.9067767Z // end inline asm 2026-02-21T09:09:01.9067909Z // begin inline asm 2026-02-21T09:09:01.9068079Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:01.9068251Z // end inline asm 2026-02-21T09:09:01.9068398Z bar.sync 0; 2026-02-21T09:09:01.9068533Z $L__tmp1: 2026-02-21T09:09:01.9068792Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9069094Z add.s32 %r463, %r42, 43008; 2026-02-21T09:09:01.9069292Z // begin inline asm 2026-02-21T09:09:01.9069480Z @%p108 mbarrier.init.shared::cta.b64 [%r463], 1; 2026-02-21T09:09:01.9069681Z // end inline asm 2026-02-21T09:09:01.9069829Z bar.sync 0; 2026-02-21T09:09:01.9069970Z add.s32 %r85, %r42, 43016; 2026-02-21T09:09:01.9070138Z // begin inline asm 2026-02-21T09:09:01.9070313Z @%p108 mbarrier.init.shared::cta.b64 [%r85], 1; 2026-02-21T09:09:01.9070516Z // end inline asm 2026-02-21T09:09:01.9070659Z add.s32 %r278, %r42, 43024; 2026-02-21T09:09:01.9070829Z // begin inline asm 2026-02-21T09:09:01.9071004Z @%p108 mbarrier.init.shared::cta.b64 [%r278], 1; 2026-02-21T09:09:01.9071212Z // end inline asm 2026-02-21T09:09:01.9071361Z bar.sync 0; 2026-02-21T09:09:01.9071503Z add.s32 %r87, %r42, 43032; 2026-02-21T09:09:01.9071761Z // begin inline asm 2026-02-21T09:09:01.9071931Z @%p108 mbarrier.init.shared::cta.b64 [%r87], 1; 2026-02-21T09:09:01.9072136Z // end inline asm 2026-02-21T09:09:01.9072431Z .loc 1 58 60 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:60 2026-02-21T09:09:01.9072740Z or.b32 %r212, %r204, %r151; 2026-02-21T09:09:01.9072894Z or.b32 %r213, %r206, %r151; 2026-02-21T09:09:01.9073166Z .loc 1 58 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:32 2026-02-21T09:09:01.9073465Z mad.wide.u32 %rd61, %r212, 2, %rd22; 2026-02-21T09:09:01.9073642Z cvt.u64.u32 %rd11, %r204; 2026-02-21T09:09:01.9073804Z or.b64 %rd80, %rd11, %rd79; 2026-02-21T09:09:01.9073960Z shl.b64 %rd81, %rd80, 1; 2026-02-21T09:09:01.9074126Z add.s64 %rd12, %rd22, %rd81; 2026-02-21T09:09:01.9074284Z add.s64 %rd62, %rd12, 65536; 2026-02-21T09:09:01.9074450Z add.s64 %rd63, %rd12, 131072; 2026-02-21T09:09:01.9074617Z mad.wide.u32 %rd64, %r213, 2, %rd22; 2026-02-21T09:09:01.9074793Z mov.b32 %r271, 16; 2026-02-21T09:09:01.9075054Z .loc 1 58 80 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:80 2026-02-21T09:09:01.9075337Z // begin inline asm 2026-02-21T09:09:01.9075552Z cp.async.cg.shared.global [ %r270 + 0 ], [ %rd61 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9075776Z // end inline asm 2026-02-21T09:09:01.9075919Z // begin inline asm 2026-02-21T09:09:01.9076115Z cp.async.cg.shared.global [ %r272 + 0 ], [ %rd62 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9076340Z // end inline asm 2026-02-21T09:09:01.9076473Z // begin inline asm 2026-02-21T09:09:01.9076673Z cp.async.cg.shared.global [ %r274 + 0 ], [ %rd63 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9076895Z // end inline asm 2026-02-21T09:09:01.9077058Z // begin inline asm 2026-02-21T09:09:01.9077251Z cp.async.cg.shared.global [ %r276 + 0 ], [ %rd64 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9077467Z // end inline asm 2026-02-21T09:09:01.9077613Z cp.async.commit_group; 2026-02-21T09:09:01.9077870Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9078163Z bar.sync 0; 2026-02-21T09:09:01.9078292Z // begin inline asm 2026-02-21T09:09:01.9078522Z @%p108 mbarrier.arrive.expect_tx.shared.b64 _, [%r278], 1024; 2026-02-21T09:09:01.9078747Z // end inline asm 2026-02-21T09:09:01.9079016Z .loc 1 64 33 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:64:33 2026-02-21T09:09:01.9079309Z bar.sync 0; 2026-02-21T09:09:01.9079448Z elect.sync %r214|%p53, -1; 2026-02-21T09:09:01.9079619Z and.pred %p46, %p1, %p53; 2026-02-21T09:09:01.9079775Z // begin inline asm 2026-02-21T09:09:01.9080102Z @%p46 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r97], [%rd98, {%r433, %r52}], [%r278]; 2026-02-21T09:09:01.9080452Z // end inline asm 2026-02-21T09:09:01.9080692Z .loc 1 58 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:32 2026-02-21T09:09:01.9080989Z add.s64 %rd66, %rd12, 128; 2026-02-21T09:09:01.9081144Z or.b32 %r215, %r212, 64; 2026-02-21T09:09:01.9081307Z mad.wide.u32 %rd82, %r215, 2, %rd22; 2026-02-21T09:09:01.9081504Z add.s64 %rd67, %rd82, 65536; 2026-02-21T09:09:01.9081706Z add.s64 %rd68, %rd82, 131072; 2026-02-21T09:09:01.9081870Z cvt.u64.u32 %rd13, %r206; 2026-02-21T09:09:01.9092204Z or.b64 %rd83, %rd13, %rd79; 2026-02-21T09:09:01.9092464Z shl.b64 %rd84, %rd83, 1; 2026-02-21T09:09:01.9092639Z add.s64 %rd14, %rd22, %rd84; 2026-02-21T09:09:01.9092819Z add.s64 %rd69, %rd14, 128; 2026-02-21T09:09:01.9093112Z .loc 1 58 80 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:80 2026-02-21T09:09:01.9093430Z // begin inline asm 2026-02-21T09:09:01.9093652Z cp.async.cg.shared.global [ %r101 + 0 ], [ %rd66 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9093904Z // end inline asm 2026-02-21T09:09:01.9094057Z // begin inline asm 2026-02-21T09:09:01.9094265Z cp.async.cg.shared.global [ %r103 + 0 ], [ %rd67 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9094500Z // end inline asm 2026-02-21T09:09:01.9094640Z // begin inline asm 2026-02-21T09:09:01.9094850Z cp.async.cg.shared.global [ %r105 + 0 ], [ %rd68 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9095072Z // end inline asm 2026-02-21T09:09:01.9095219Z // begin inline asm 2026-02-21T09:09:01.9095416Z cp.async.cg.shared.global [ %r107 + 0 ], [ %rd69 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9095645Z // end inline asm 2026-02-21T09:09:01.9095800Z cp.async.commit_group; 2026-02-21T09:09:01.9096068Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9096360Z bar.sync 0; 2026-02-21T09:09:01.9096496Z // begin inline asm 2026-02-21T09:09:01.9096704Z @%p108 mbarrier.arrive.expect_tx.shared.b64 _, [%r87], 1024; 2026-02-21T09:09:01.9096923Z // end inline asm 2026-02-21T09:09:01.9097179Z .loc 1 64 33 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:64:33 2026-02-21T09:09:01.9097460Z bar.sync 0; 2026-02-21T09:09:01.9097614Z elect.sync %r216|%p54, -1; 2026-02-21T09:09:01.9097795Z and.pred %p48, %p1, %p54; 2026-02-21T09:09:01.9097961Z add.s32 %r110, %r42, 41984; 2026-02-21T09:09:01.9098131Z // begin inline asm 2026-02-21T09:09:01.9098465Z @%p48 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r110], [%rd98, {%r433, %r45}], [%r87]; 2026-02-21T09:09:01.9098825Z // end inline asm 2026-02-21T09:09:01.9099072Z .loc 1 58 80 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:80 2026-02-21T09:09:01.9099376Z cp.async.wait_group 1; 2026-02-21T09:09:01.9099538Z bar.sync 0; 2026-02-21T09:09:01.9099712Z ld.shared.v4.b32 {%r217, %r218, %r219, %r220}, [%r194]; 2026-02-21T09:09:01.9100040Z mov.b32 {%rs1, %rs2}, %r220; 2026-02-21T09:09:01.9100209Z mov.b32 {%rs3, %rs4}, %r219; 2026-02-21T09:09:01.9100382Z mov.b32 {%rs5, %rs6}, %r218; 2026-02-21T09:09:01.9100540Z mov.b32 {%rs7, %rs8}, %r217; 2026-02-21T09:09:01.9100748Z ld.shared.v4.b32 {%r221, %r222, %r223, %r224}, [%r194+16]; 2026-02-21T09:09:01.9100962Z mov.b32 {%rs9, %rs10}, %r224; 2026-02-21T09:09:01.9101147Z mov.b32 {%rs11, %rs12}, %r223; 2026-02-21T09:09:01.9101355Z mov.b32 {%rs13, %rs14}, %r222; 2026-02-21T09:09:01.9101525Z mov.b32 {%rs15, %rs16}, %r221; 2026-02-21T09:09:01.9101785Z ld.shared.v4.b32 {%r225, %r226, %r227, %r228}, [%r194+32]; 2026-02-21T09:09:01.9102026Z mov.b32 {%rs17, %rs18}, %r228; 2026-02-21T09:09:01.9102202Z mov.b32 {%rs19, %rs20}, %r227; 2026-02-21T09:09:01.9102365Z mov.b32 {%rs21, %rs22}, %r226; 2026-02-21T09:09:01.9102540Z mov.b32 {%rs23, %rs24}, %r225; 2026-02-21T09:09:01.9102734Z ld.shared.v4.b32 {%r229, %r230, %r231, %r232}, [%r194+48]; 2026-02-21T09:09:01.9102951Z mov.b32 {%rs25, %rs26}, %r232; 2026-02-21T09:09:01.9103109Z mov.b32 {%rs27, %rs28}, %r231; 2026-02-21T09:09:01.9103278Z mov.b32 {%rs29, %rs30}, %r230; 2026-02-21T09:09:01.9103445Z mov.b32 {%rs31, %rs32}, %r229; 2026-02-21T09:09:01.9103716Z .loc 1 62 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:62:32 2026-02-21T09:09:01.9104017Z cvt.f32.bf16 %r115, %rs7; 2026-02-21T09:09:01.9104212Z cvt.f32.bf16 %r116, %rs8; 2026-02-21T09:09:01.9104387Z cvt.f32.bf16 %r117, %rs5; 2026-02-21T09:09:01.9104543Z cvt.f32.bf16 %r118, %rs6; 2026-02-21T09:09:01.9104706Z cvt.f32.bf16 %r119, %rs3; 2026-02-21T09:09:01.9104859Z cvt.f32.bf16 %r120, %rs4; 2026-02-21T09:09:01.9105021Z cvt.f32.bf16 %r121, %rs1; 2026-02-21T09:09:01.9105181Z cvt.f32.bf16 %r122, %rs2; 2026-02-21T09:09:01.9105336Z cvt.f32.bf16 %r123, %rs15; 2026-02-21T09:09:01.9105506Z cvt.f32.bf16 %r124, %rs16; 2026-02-21T09:09:01.9105661Z cvt.f32.bf16 %r125, %rs13; 2026-02-21T09:09:01.9105830Z cvt.f32.bf16 %r126, %rs14; 2026-02-21T09:09:01.9105987Z cvt.f32.bf16 %r127, %rs11; 2026-02-21T09:09:01.9106150Z cvt.f32.bf16 %r128, %rs12; 2026-02-21T09:09:01.9106304Z cvt.f32.bf16 %r129, %rs9; 2026-02-21T09:09:01.9106466Z cvt.f32.bf16 %r130, %rs10; 2026-02-21T09:09:01.9106618Z cvt.f32.bf16 %r132, %rs23; 2026-02-21T09:09:01.9106778Z cvt.f32.bf16 %r133, %rs24; 2026-02-21T09:09:01.9106939Z cvt.f32.bf16 %r134, %rs21; 2026-02-21T09:09:01.9107093Z cvt.f32.bf16 %r135, %rs22; 2026-02-21T09:09:01.9107254Z cvt.f32.bf16 %r136, %rs19; 2026-02-21T09:09:01.9107406Z cvt.f32.bf16 %r137, %rs20; 2026-02-21T09:09:01.9107567Z cvt.f32.bf16 %r138, %rs17; 2026-02-21T09:09:01.9107719Z cvt.f32.bf16 %r139, %rs18; 2026-02-21T09:09:01.9107883Z cvt.f32.bf16 %r140, %rs31; 2026-02-21T09:09:01.9108034Z cvt.f32.bf16 %r141, %rs32; 2026-02-21T09:09:01.9108193Z cvt.f32.bf16 %r142, %rs29; 2026-02-21T09:09:01.9108353Z cvt.f32.bf16 %r143, %rs30; 2026-02-21T09:09:01.9108523Z cvt.f32.bf16 %r144, %rs27; 2026-02-21T09:09:01.9108693Z cvt.f32.bf16 %r145, %rs28; 2026-02-21T09:09:01.9108854Z cvt.f32.bf16 %r146, %rs25; 2026-02-21T09:09:01.9109032Z cvt.f32.bf16 %r147, %rs26; 2026-02-21T09:09:01.9109190Z $L__tmp2: 2026-02-21T09:09:01.9109507Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9109859Z add.s32 %r233, %r208, %r244; 2026-02-21T09:09:01.9110042Z shl.b32 %r234, %r210, 3; 2026-02-21T09:09:01.9110211Z add.s32 %r114, %r233, %r234; 2026-02-21T09:09:01.9110388Z // begin inline asm 2026-02-21T09:09:01.9110799Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r114 + 0], {%r115, %r116, %r117, %r118, %r119, %r120, %r121, %r122, %r123, %r124, %r125, %r126, %r127, %r128, %r129, %r130}; 2026-02-21T09:09:01.9111211Z // end inline asm 2026-02-21T09:09:01.9111369Z // begin inline asm 2026-02-21T09:09:01.9111782Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r114 + 16], {%r132, %r133, %r134, %r135, %r136, %r137, %r138, %r139, %r140, %r141, %r142, %r143, %r144, %r145, %r146, %r147}; 2026-02-21T09:09:01.9112228Z // end inline asm 2026-02-21T09:09:01.9112375Z // begin inline asm 2026-02-21T09:09:01.9112549Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:01.9112733Z // end inline asm 2026-02-21T09:09:01.9112876Z bar.sync 0; 2026-02-21T09:09:01.9113021Z $L__tmp3: 2026-02-21T09:09:01.9113275Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9113621Z // begin inline asm 2026-02-21T09:09:01.9113766Z 2026-02-21T09:09:01.9113898Z { 2026-02-21T09:09:01.9114035Z .reg .pred complete; 2026-02-21T09:09:01.9114204Z waitLoop: 2026-02-21T09:09:01.9114430Z mbarrier.try_wait.parity.shared.b64 complete, [%r278], %r52; 2026-02-21T09:09:01.9114689Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.9114861Z } 2026-02-21T09:09:01.9114933Z 2026-02-21T09:09:01.9114996Z // end inline asm 2026-02-21T09:09:01.9115266Z .loc 1 64 33 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:64:33 2026-02-21T09:09:01.9115569Z ld.shared.b8 %rs33, [%r196]; 2026-02-21T09:09:01.9115756Z ld.shared.b8 %rs34, [%r196+256]; 2026-02-21T09:09:01.9115940Z ld.shared.b8 %rs35, [%r196+512]; 2026-02-21T09:09:01.9116127Z ld.shared.b8 %rs36, [%r196+768]; 2026-02-21T09:09:01.9116301Z ld.shared.b8 %rs37, [%r195+128]; 2026-02-21T09:09:01.9116401Z ld.shared.b8 %rs38, [%r195+384]; 2026-02-21T09:09:01.9116472Z ld.shared.b8 %rs39, [%r195+640]; 2026-02-21T09:09:01.9116536Z ld.shared.b8 %rs40, [%r195+896]; 2026-02-21T09:09:01.9116726Z .loc 1 67 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:67:28 2026-02-21T09:09:01.9116802Z shl.b16 %rs41, %rs33, 4; 2026-02-21T09:09:01.9116863Z shl.b16 %rs42, %rs37, 4; 2026-02-21T09:09:01.9116930Z shl.b16 %rs43, %rs34, 4; 2026-02-21T09:09:01.9116988Z shl.b16 %rs44, %rs38, 4; 2026-02-21T09:09:01.9117043Z shl.b16 %rs45, %rs35, 4; 2026-02-21T09:09:01.9117103Z shl.b16 %rs46, %rs39, 4; 2026-02-21T09:09:01.9117171Z shl.b16 %rs47, %rs36, 4; 2026-02-21T09:09:01.9117227Z shl.b16 %rs48, %rs40, 4; 2026-02-21T09:09:01.9117401Z .loc 1 82 58 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:82:58 2026-02-21T09:09:01.9117481Z selp.b16 %rs49, %rs41, %rs33, %p51; 2026-02-21T09:09:01.9117541Z cvt.s16.s8 %rs50, %rs49; 2026-02-21T09:09:01.9117601Z shr.s16 %rs51, %rs50, 4; 2026-02-21T09:09:01.9117677Z selp.b16 %rs52, %rs42, %rs37, %p51; 2026-02-21T09:09:01.9117734Z cvt.s16.s8 %rs53, %rs52; 2026-02-21T09:09:01.9117793Z shr.s16 %rs54, %rs53, 4; 2026-02-21T09:09:01.9117858Z selp.b16 %rs55, %rs43, %rs34, %p51; 2026-02-21T09:09:01.9117926Z cvt.s16.s8 %rs56, %rs55; 2026-02-21T09:09:01.9117984Z shr.s16 %rs57, %rs56, 4; 2026-02-21T09:09:01.9118049Z selp.b16 %rs58, %rs44, %rs38, %p51; 2026-02-21T09:09:01.9118117Z cvt.s16.s8 %rs59, %rs58; 2026-02-21T09:09:01.9118178Z shr.s16 %rs60, %rs59, 4; 2026-02-21T09:09:01.9118245Z selp.b16 %rs61, %rs45, %rs35, %p51; 2026-02-21T09:09:01.9118304Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T09:09:01.9118372Z shr.s16 %rs63, %rs62, 4; 2026-02-21T09:09:01.9118434Z selp.b16 %rs64, %rs46, %rs39, %p51; 2026-02-21T09:09:01.9118494Z cvt.s16.s8 %rs65, %rs64; 2026-02-21T09:09:01.9118560Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:09:01.9118623Z selp.b16 %rs67, %rs47, %rs36, %p51; 2026-02-21T09:09:01.9118683Z cvt.s16.s8 %rs68, %rs67; 2026-02-21T09:09:01.9118741Z shr.s16 %rs69, %rs68, 4; 2026-02-21T09:09:01.9118816Z selp.b16 %rs70, %rs48, %rs40, %p51; 2026-02-21T09:09:01.9118875Z cvt.s16.s8 %rs71, %rs70; 2026-02-21T09:09:01.9118933Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:09:01.9119109Z .loc 1 87 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:87:32 2026-02-21T09:09:01.9119175Z cvt.rn.f32.s16 %r235, %rs51; 2026-02-21T09:09:01.9119238Z cvt.rn.f32.s16 %r236, %rs54; 2026-02-21T09:09:01.9119308Z cvt.rn.f32.s16 %r237, %rs57; 2026-02-21T09:09:01.9119394Z cvt.rn.f32.s16 %r238, %rs60; 2026-02-21T09:09:01.9119455Z cvt.rn.f32.s16 %r239, %rs63; 2026-02-21T09:09:01.9119517Z cvt.rn.f32.s16 %r240, %rs66; 2026-02-21T09:09:01.9119593Z cvt.rn.f32.s16 %r241, %rs69; 2026-02-21T09:09:01.9119654Z cvt.rn.f32.s16 %r242, %rs72; 2026-02-21T09:09:01.9119718Z st.shared.b32 [%r15], %r235; 2026-02-21T09:09:01.9119791Z st.shared.b32 [%r15+4096], %r239; 2026-02-21T09:09:01.9119853Z st.shared.b32 [%r16], %r236; 2026-02-21T09:09:01.9119940Z st.shared.b32 [%r16+4096], %r240; 2026-02-21T09:09:01.9120000Z st.shared.b32 [%r17], %r237; 2026-02-21T09:09:01.9120073Z st.shared.b32 [%r17+4096], %r241; 2026-02-21T09:09:01.9120133Z st.shared.b32 [%r18], %r238; 2026-02-21T09:09:01.9120218Z st.shared.b32 [%r18+4096], %r242; 2026-02-21T09:09:01.9120284Z $L__tmp4: 2026-02-21T09:09:01.9120510Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9120570Z // begin inline asm 2026-02-21T09:09:01.9120658Z fence.proxy.async.shared::cta; 2026-02-21T09:09:01.9120716Z // end inline asm 2026-02-21T09:09:01.9120772Z bar.sync 0; 2026-02-21T09:09:01.9120839Z setp.ne.b32 %p55, %r25, 0; 2026-02-21T09:09:01.9120909Z @%p55 bra $L__BB0_3; 2026-02-21T09:09:01.9120962Z // %bb.2: 2026-02-21T09:09:01.9121031Z elect.sync %r267|%p57, -1; 2026-02-21T09:09:01.9121101Z mov.b32 %r245, 134744336; 2026-02-21T09:09:01.9121181Z mov.pred %p56, 0; 2026-02-21T09:09:01.9121243Z // begin inline asm 2026-02-21T09:09:01.9121403Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 0 ], %rd85, %r245, %p56; 2026-02-21T09:09:01.9121469Z // end inline asm 2026-02-21T09:09:01.9121531Z // begin inline asm 2026-02-21T09:09:01.9121710Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 8 ], %rd86, %r245, %p58; 2026-02-21T09:09:01.9121777Z // end inline asm 2026-02-21T09:09:01.9121836Z // begin inline asm 2026-02-21T09:09:01.9121984Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 16 ], %rd87, %r245, %p58; 2026-02-21T09:09:01.9122052Z // end inline asm 2026-02-21T09:09:01.9122112Z // begin inline asm 2026-02-21T09:09:01.9122255Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 24 ], %rd88, %r245, %p58; 2026-02-21T09:09:01.9122313Z // end inline asm 2026-02-21T09:09:01.9122380Z // begin inline asm 2026-02-21T09:09:01.9122526Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 32 ], %rd89, %r245, %p58; 2026-02-21T09:09:01.9122585Z // end inline asm 2026-02-21T09:09:01.9122646Z // begin inline asm 2026-02-21T09:09:01.9122793Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 40 ], %rd90, %r245, %p58; 2026-02-21T09:09:01.9122849Z // end inline asm 2026-02-21T09:09:01.9122904Z // begin inline asm 2026-02-21T09:09:01.9123043Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 48 ], %rd91, %r245, %p58; 2026-02-21T09:09:01.9123106Z // end inline asm 2026-02-21T09:09:01.9123161Z // begin inline asm 2026-02-21T09:09:01.9123302Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 56 ], %rd92, %r245, %p58; 2026-02-21T09:09:01.9123365Z // end inline asm 2026-02-21T09:09:01.9123428Z add.s32 %r269, %r42, 43008; 2026-02-21T09:09:01.9123490Z cvt.u64.u32 %rd93, %r269; 2026-02-21T09:09:01.9123555Z // begin inline asm 2026-02-21T09:09:01.9123680Z @%p57 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd93]; 2026-02-21T09:09:01.9123737Z // end inline asm 2026-02-21T09:09:01.9123794Z $L__tmp5: 2026-02-21T09:09:01.9123855Z $L__BB0_3: 2026-02-21T09:09:01.9123946Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:01.9124007Z add.s32 %r20, %r42, %r192; 2026-02-21T09:09:01.9124073Z add.s32 %r21, %r42, %r193; 2026-02-21T09:09:01.9124246Z .loc 1 58 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:32 2026-02-21T09:09:01.9124306Z add.s64 %rd94, %rd12, 256; 2026-02-21T09:09:01.9124365Z cvt.u64.u32 %rd100, %r19; 2026-02-21T09:09:01.9124464Z add.s64 %rd101, %rd11, %rd100; 2026-02-21T09:09:01.9124522Z shl.b64 %rd102, %rd101, 1; 2026-02-21T09:09:01.9124582Z add.s64 %rd103, %rd22, %rd102; 2026-02-21T09:09:01.9124650Z add.s64 %rd95, %rd103, 65536; 2026-02-21T09:09:01.9124708Z add.s64 %rd96, %rd103, 131072; 2026-02-21T09:09:01.9124767Z add.s64 %rd97, %rd14, 256; 2026-02-21T09:09:01.9124944Z .loc 1 58 80 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:80 2026-02-21T09:09:01.9125028Z // begin inline asm 2026-02-21T09:09:01.9125144Z cp.async.cg.shared.global [ %r270 + 0 ], [ %rd94 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9125198Z // end inline asm 2026-02-21T09:09:01.9125304Z // begin inline asm 2026-02-21T09:09:01.9125418Z cp.async.cg.shared.global [ %r272 + 0 ], [ %rd95 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9125473Z // end inline asm 2026-02-21T09:09:01.9125535Z // begin inline asm 2026-02-21T09:09:01.9125645Z cp.async.cg.shared.global [ %r274 + 0 ], [ %rd96 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9125700Z // end inline asm 2026-02-21T09:09:01.9125755Z // begin inline asm 2026-02-21T09:09:01.9125873Z cp.async.cg.shared.global [ %r276 + 0 ], [ %rd97 + 0 ], 0x10, %r271; 2026-02-21T09:09:01.9125927Z // end inline asm 2026-02-21T09:09:01.9125991Z cp.async.commit_group; 2026-02-21T09:09:01.9126164Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9126244Z // begin inline asm 2026-02-21T09:09:01.9126360Z @%p108 mbarrier.arrive.expect_tx.shared.b64 _, [%r278], 1024; 2026-02-21T09:09:01.9126421Z // end inline asm 2026-02-21T09:09:01.9126591Z .loc 1 64 33 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:64:33 2026-02-21T09:09:01.9126647Z bar.sync 0; 2026-02-21T09:09:01.9126710Z elect.sync %r287|%p76, -1; 2026-02-21T09:09:01.9126783Z and.pred %p74, %p1, %p76; 2026-02-21T09:09:01.9126838Z mov.b32 %r281, 64; 2026-02-21T09:09:01.9126893Z // begin inline asm 2026-02-21T09:09:01.9127145Z @%p74 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r97], [%rd98, {%r433, %r281}], [%r278]; 2026-02-21T09:09:01.9127202Z // end inline asm 2026-02-21T09:09:01.9127365Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9127434Z shl.b64 %rd104, %rd13, 1; 2026-02-21T09:09:01.9127495Z add.s64 %rd15, %rd104, 384; 2026-02-21T09:09:01.9127567Z mad.wide.u32 %rd153, %r150, 16, %rd22; 2026-02-21T09:09:01.9127628Z shl.b32 %r289, %r3, 13; 2026-02-21T09:09:01.9127696Z and.b32 %r290, %r289, 4063232; 2026-02-21T09:09:01.9127756Z shl.b32 %r291, %r5, 10; 2026-02-21T09:09:01.9127819Z or.b32 %r292, %r290, %r291; 2026-02-21T09:09:01.9127890Z mul.wide.u32 %rd17, %r292, 2; 2026-02-21T09:09:01.9127946Z mov.b32 %r466, 1; 2026-02-21T09:09:01.9128001Z mov.b32 %r462, 0; 2026-02-21T09:09:01.9128057Z mov.b64 %rd154, 0; 2026-02-21T09:09:01.9128126Z mov.b32 %r464, %r462; 2026-02-21T09:09:01.9128186Z mov.b32 %r465, %r462; 2026-02-21T09:09:01.9128242Z mov.b32 %r467, %r462; 2026-02-21T09:09:01.9128305Z bra.uni $L__BB0_4; 2026-02-21T09:09:01.9128410Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:01.9128572Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9128645Z setp.lt.u64 %p101, %rd154, 416; 2026-02-21T09:09:01.9128699Z $L__tmp6: 2026-02-21T09:09:01.9128917Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9128976Z add.s32 %r406, %r466, 1; 2026-02-21T09:09:01.9129050Z setp.gt.s32 %p104, %r406, 1; 2026-02-21T09:09:01.9129112Z selp.b32 %r466, 0, %r406, %p104; 2026-02-21T09:09:01.9129182Z selp.b32 %r407, 1, 0, %p104; 2026-02-21T09:09:01.9129250Z xor.b32 %r41, %r467, %r407; 2026-02-21T09:09:01.9129303Z $L__tmp7: 2026-02-21T09:09:01.9129470Z .loc 1 58 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:32 2026-02-21T09:09:01.9129567Z add.s64 %rd119, %rd153, %rd17; 2026-02-21T09:09:01.9129624Z add.s64 %rd114, %rd119, 384; 2026-02-21T09:09:01.9129683Z add.s64 %rd115, %rd119, 65920; 2026-02-21T09:09:01.9129745Z add.s64 %rd116, %rd119, 131456; 2026-02-21T09:09:01.9129915Z .loc 1 58 80 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:80 2026-02-21T09:09:01.9130001Z add.s64 %rd117, %rd153, %rd15; 2026-02-21T09:09:01.9130060Z add.s32 %r393, %r35, %r6; 2026-02-21T09:09:01.9130128Z selp.b32 %r394, 16, 0, %p101; 2026-02-21T09:09:01.9130185Z // begin inline asm 2026-02-21T09:09:01.9130328Z cp.async.cg.shared.global [ %r393 + 0 ], [ %rd114 + 0 ], 0x10, %r394; 2026-02-21T09:09:01.9130392Z // end inline asm 2026-02-21T09:09:01.9130451Z add.s32 %r395, %r393, 4096; 2026-02-21T09:09:01.9130507Z // begin inline asm 2026-02-21T09:09:01.9130623Z cp.async.cg.shared.global [ %r395 + 0 ], [ %rd115 + 0 ], 0x10, %r394; 2026-02-21T09:09:01.9130686Z // end inline asm 2026-02-21T09:09:01.9130744Z add.s32 %r397, %r393, 8192; 2026-02-21T09:09:01.9130800Z // begin inline asm 2026-02-21T09:09:01.9130920Z cp.async.cg.shared.global [ %r397 + 0 ], [ %rd116 + 0 ], 0x10, %r394; 2026-02-21T09:09:01.9130974Z // end inline asm 2026-02-21T09:09:01.9131034Z add.s32 %r399, %r35, %r10; 2026-02-21T09:09:01.9131089Z // begin inline asm 2026-02-21T09:09:01.9131229Z cp.async.cg.shared.global [ %r399 + 0 ], [ %rd117 + 0 ], 0x10, %r394; 2026-02-21T09:09:01.9131286Z // end inline asm 2026-02-21T09:09:01.9131350Z cp.async.commit_group; 2026-02-21T09:09:01.9131520Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9131618Z and.pred %p99, %p108, %p101; 2026-02-21T09:09:01.9131674Z // begin inline asm 2026-02-21T09:09:01.9131792Z @%p99 mbarrier.arrive.expect_tx.shared.b64 _, [%r401], 1024; 2026-02-21T09:09:01.9131847Z // end inline asm 2026-02-21T09:09:01.9132009Z .loc 1 64 33 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:64:33 2026-02-21T09:09:01.9132065Z bar.sync 0; 2026-02-21T09:09:01.9132136Z elect.sync %r408|%p105, -1; 2026-02-21T09:09:01.9132200Z and.pred %p106, %p101, %p105; 2026-02-21T09:09:01.9132263Z and.pred %p100, %p1, %p106; 2026-02-21T09:09:01.9132330Z cvt.u32.u64 %r409, %rd154; 2026-02-21T09:09:01.9132390Z add.s32 %r404, %r409, 96; 2026-02-21T09:09:01.9132448Z // begin inline asm 2026-02-21T09:09:01.9132699Z @%p100 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r402], [%rd98, {%r433, %r404}], [%r401]; 2026-02-21T09:09:01.9132754Z // end inline asm 2026-02-21T09:09:01.9132920Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9132979Z add.s64 %rd153, %rd153, 128; 2026-02-21T09:09:01.9133052Z setp.lt.u64 %p107, %rd154, 448; 2026-02-21T09:09:01.9133110Z add.s64 %rd154, %rd154, 32; 2026-02-21T09:09:01.9133169Z mov.b32 %r462, %r467; 2026-02-21T09:09:01.9133233Z mov.b32 %r467, %r41; 2026-02-21T09:09:01.9133293Z @%p107 bra $L__BB0_4; 2026-02-21T09:09:01.9133350Z bra.uni $L__BB0_7; 2026-02-21T09:09:01.9133454Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:01.9133521Z add.s32 %r331, %r465, 1; 2026-02-21T09:09:01.9133582Z setp.gt.s32 %p81, %r331, 1; 2026-02-21T09:09:01.9133645Z selp.b32 %r465, 0, %r331, %p81; 2026-02-21T09:09:01.9133819Z .loc 1 58 80 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:58:80 2026-02-21T09:09:01.9133884Z cp.async.wait_group 1; 2026-02-21T09:09:01.9133939Z bar.sync 0; 2026-02-21T09:09:01.9134002Z shl.b32 %r332, %r465, 14; 2026-02-21T09:09:01.9134061Z add.s32 %r35, %r42, %r332; 2026-02-21T09:09:01.9134117Z add.s32 %r334, %r35, %r13; 2026-02-21T09:09:01.9134212Z ld.shared.v4.b32 {%r335, %r336, %r337, %r338}, [%r334]; 2026-02-21T09:09:01.9134281Z mov.b32 {%rs73, %rs74}, %r338; 2026-02-21T09:09:01.9134377Z mov.b32 {%rs75, %rs76}, %r337; 2026-02-21T09:09:01.9134435Z mov.b32 {%rs77, %rs78}, %r336; 2026-02-21T09:09:01.9134499Z mov.b32 {%rs79, %rs80}, %r335; 2026-02-21T09:09:01.9134595Z ld.shared.v4.b32 {%r339, %r340, %r341, %r342}, [%r334+16]; 2026-02-21T09:09:01.9134652Z mov.b32 {%rs81, %rs82}, %r342; 2026-02-21T09:09:01.9134709Z mov.b32 {%rs83, %rs84}, %r341; 2026-02-21T09:09:01.9134773Z mov.b32 {%rs85, %rs86}, %r340; 2026-02-21T09:09:01.9134855Z mov.b32 {%rs87, %rs88}, %r339; 2026-02-21T09:09:01.9134949Z ld.shared.v4.b32 {%r343, %r344, %r345, %r346}, [%r334+32]; 2026-02-21T09:09:01.9135014Z mov.b32 {%rs89, %rs90}, %r346; 2026-02-21T09:09:01.9135093Z mov.b32 {%rs91, %rs92}, %r345; 2026-02-21T09:09:01.9135152Z mov.b32 {%rs93, %rs94}, %r344; 2026-02-21T09:09:01.9135215Z mov.b32 {%rs95, %rs96}, %r343; 2026-02-21T09:09:01.9135307Z ld.shared.v4.b32 {%r347, %r348, %r349, %r350}, [%r334+48]; 2026-02-21T09:09:01.9135366Z mov.b32 {%rs97, %rs98}, %r350; 2026-02-21T09:09:01.9135428Z mov.b32 {%rs99, %rs100}, %r349; 2026-02-21T09:09:01.9135499Z mov.b32 {%rs101, %rs102}, %r348; 2026-02-21T09:09:01.9135561Z mov.b32 {%rs103, %rs104}, %r347; 2026-02-21T09:09:01.9135726Z .loc 1 62 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:62:32 2026-02-21T09:09:01.9135794Z cvt.f32.bf16 %r296, %rs79; 2026-02-21T09:09:01.9135850Z cvt.f32.bf16 %r297, %rs80; 2026-02-21T09:09:01.9135942Z cvt.f32.bf16 %r298, %rs77; 2026-02-21T09:09:01.9136000Z cvt.f32.bf16 %r299, %rs78; 2026-02-21T09:09:01.9136066Z cvt.f32.bf16 %r300, %rs75; 2026-02-21T09:09:01.9136125Z cvt.f32.bf16 %r301, %rs76; 2026-02-21T09:09:01.9136184Z cvt.f32.bf16 %r302, %rs73; 2026-02-21T09:09:01.9136249Z cvt.f32.bf16 %r303, %rs74; 2026-02-21T09:09:01.9136305Z cvt.f32.bf16 %r304, %rs87; 2026-02-21T09:09:01.9136362Z cvt.f32.bf16 %r305, %rs88; 2026-02-21T09:09:01.9136418Z cvt.f32.bf16 %r306, %rs85; 2026-02-21T09:09:01.9136482Z cvt.f32.bf16 %r307, %rs86; 2026-02-21T09:09:01.9136542Z cvt.f32.bf16 %r308, %rs83; 2026-02-21T09:09:01.9136600Z cvt.f32.bf16 %r309, %rs84; 2026-02-21T09:09:01.9136666Z cvt.f32.bf16 %r310, %rs81; 2026-02-21T09:09:01.9136723Z cvt.f32.bf16 %r311, %rs82; 2026-02-21T09:09:01.9136782Z cvt.f32.bf16 %r313, %rs95; 2026-02-21T09:09:01.9136846Z cvt.f32.bf16 %r314, %rs96; 2026-02-21T09:09:01.9136902Z cvt.f32.bf16 %r315, %rs93; 2026-02-21T09:09:01.9136960Z cvt.f32.bf16 %r316, %rs94; 2026-02-21T09:09:01.9137017Z cvt.f32.bf16 %r317, %rs91; 2026-02-21T09:09:01.9137081Z cvt.f32.bf16 %r318, %rs92; 2026-02-21T09:09:01.9137137Z cvt.f32.bf16 %r319, %rs89; 2026-02-21T09:09:01.9137193Z cvt.f32.bf16 %r320, %rs90; 2026-02-21T09:09:01.9137262Z cvt.f32.bf16 %r321, %rs103; 2026-02-21T09:09:01.9137321Z cvt.f32.bf16 %r322, %rs104; 2026-02-21T09:09:01.9137379Z cvt.f32.bf16 %r323, %rs101; 2026-02-21T09:09:01.9137436Z cvt.f32.bf16 %r324, %rs102; 2026-02-21T09:09:01.9137501Z cvt.f32.bf16 %r325, %rs99; 2026-02-21T09:09:01.9137559Z cvt.f32.bf16 %r326, %rs100; 2026-02-21T09:09:01.9137616Z cvt.f32.bf16 %r327, %rs97; 2026-02-21T09:09:01.9137679Z cvt.f32.bf16 %r328, %rs98; 2026-02-21T09:09:01.9137732Z $L__tmp8: 2026-02-21T09:09:01.9137945Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9138001Z // begin inline asm 2026-02-21T09:09:01.9138060Z 2026-02-21T09:09:01.9138111Z { 2026-02-21T09:09:01.9138175Z .reg .pred complete; 2026-02-21T09:09:01.9138237Z waitLoop: 2026-02-21T09:09:01.9138353Z mbarrier.try_wait.parity.shared.b64 complete, [%r463], %r462; 2026-02-21T09:09:01.9138416Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.9138469Z } 2026-02-21T09:09:01.9138481Z 2026-02-21T09:09:01.9138537Z // end inline asm 2026-02-21T09:09:01.9138588Z $L__tmp9: 2026-02-21T09:09:01.9138753Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9138821Z selp.b32 %r351, 1, 0, %p81; 2026-02-21T09:09:01.9138904Z xor.b32 %r464, %r464, %r351; 2026-02-21T09:09:01.9138966Z mov.pred %p82, -1; 2026-02-21T09:09:01.9139025Z $L__tmp10: 2026-02-21T09:09:01.9139239Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9139296Z // begin inline asm 2026-02-21T09:09:01.9139579Z @%p82 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r114 + 0], {%r296, %r297, %r298, %r299, %r300, %r301, %r302, %r303, %r304, %r305, %r306, %r307, %r308, %r309, %r310, %r311}; 2026-02-21T09:09:01.9139663Z // end inline asm 2026-02-21T09:09:01.9139719Z // begin inline asm 2026-02-21T09:09:01.9140011Z @%p82 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r114 + 16], {%r313, %r314, %r315, %r316, %r317, %r318, %r319, %r320, %r321, %r322, %r323, %r324, %r325, %r326, %r327, %r328}; 2026-02-21T09:09:01.9140076Z // end inline asm 2026-02-21T09:09:01.9140132Z // begin inline asm 2026-02-21T09:09:01.9140204Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:01.9140265Z // end inline asm 2026-02-21T09:09:01.9140320Z bar.sync 0; 2026-02-21T09:09:01.9140372Z $L__tmp11: 2026-02-21T09:09:01.9140536Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9140602Z shl.b32 %r352, %r465, 3; 2026-02-21T09:09:01.9140663Z add.s32 %r353, %r42, %r352; 2026-02-21T09:09:01.9140721Z add.s32 %r401, %r353, 43024; 2026-02-21T09:09:01.9140807Z // begin inline asm 2026-02-21T09:09:01.9140858Z 2026-02-21T09:09:01.9140908Z { 2026-02-21T09:09:01.9140969Z .reg .pred complete; 2026-02-21T09:09:01.9141028Z waitLoop: 2026-02-21T09:09:01.9141143Z mbarrier.try_wait.parity.shared.b64 complete, [%r401], %r464; 2026-02-21T09:09:01.9141207Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.9141264Z } 2026-02-21T09:09:01.9141268Z 2026-02-21T09:09:01.9141323Z // end inline asm 2026-02-21T09:09:01.9141490Z .loc 1 64 33 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:64:33 2026-02-21T09:09:01.9141597Z shl.b32 %r354, %r465, 10; 2026-02-21T09:09:01.9141657Z add.s32 %r355, %r42, %r354; 2026-02-21T09:09:01.9141716Z add.s32 %r402, %r355, 40960; 2026-02-21T09:09:01.9141773Z add.s32 %r356, %r402, %r14; 2026-02-21T09:09:01.9141843Z ld.shared.b8 %rs105, [%r356]; 2026-02-21T09:09:01.9141908Z ld.shared.b8 %rs106, [%r356+256]; 2026-02-21T09:09:01.9141971Z ld.shared.b8 %rs107, [%r356+512]; 2026-02-21T09:09:01.9142039Z ld.shared.b8 %rs108, [%r356+768]; 2026-02-21T09:09:01.9142099Z add.s32 %r357, %r402, %r22; 2026-02-21T09:09:01.9142159Z ld.shared.b8 %rs109, [%r357+128]; 2026-02-21T09:09:01.9142227Z ld.shared.b8 %rs110, [%r357+384]; 2026-02-21T09:09:01.9142289Z ld.shared.b8 %rs111, [%r357+640]; 2026-02-21T09:09:01.9142348Z ld.shared.b8 %rs112, [%r357+896]; 2026-02-21T09:09:01.9142514Z .loc 1 67 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:67:28 2026-02-21T09:09:01.9142584Z shl.b16 %rs113, %rs105, 4; 2026-02-21T09:09:01.9142645Z shl.b16 %rs114, %rs109, 4; 2026-02-21T09:09:01.9142703Z shl.b16 %rs115, %rs106, 4; 2026-02-21T09:09:01.9142768Z shl.b16 %rs116, %rs110, 4; 2026-02-21T09:09:01.9142827Z shl.b16 %rs117, %rs107, 4; 2026-02-21T09:09:01.9142884Z shl.b16 %rs118, %rs111, 4; 2026-02-21T09:09:01.9142940Z shl.b16 %rs119, %rs108, 4; 2026-02-21T09:09:01.9143005Z shl.b16 %rs120, %rs112, 4; 2026-02-21T09:09:01.9143172Z .loc 1 82 58 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:82:58 2026-02-21T09:09:01.9143246Z selp.b16 %rs121, %rs113, %rs105, %p51; 2026-02-21T09:09:01.9143314Z cvt.s16.s8 %rs122, %rs121; 2026-02-21T09:09:01.9143371Z shr.s16 %rs123, %rs122, 4; 2026-02-21T09:09:01.9143440Z selp.b16 %rs124, %rs114, %rs109, %p51; 2026-02-21T09:09:01.9143507Z cvt.s16.s8 %rs125, %rs124; 2026-02-21T09:09:01.9143564Z shr.s16 %rs126, %rs125, 4; 2026-02-21T09:09:01.9143629Z selp.b16 %rs127, %rs115, %rs106, %p51; 2026-02-21T09:09:01.9143687Z cvt.s16.s8 %rs128, %rs127; 2026-02-21T09:09:01.9143778Z shr.s16 %rs129, %rs128, 4; 2026-02-21T09:09:01.9143844Z selp.b16 %rs130, %rs116, %rs110, %p51; 2026-02-21T09:09:01.9143902Z cvt.s16.s8 %rs131, %rs130; 2026-02-21T09:09:01.9143966Z shr.s16 %rs132, %rs131, 4; 2026-02-21T09:09:01.9144032Z selp.b16 %rs133, %rs117, %rs107, %p51; 2026-02-21T09:09:01.9144092Z cvt.s16.s8 %rs134, %rs133; 2026-02-21T09:09:01.9144150Z shr.s16 %rs135, %rs134, 4; 2026-02-21T09:09:01.9144227Z selp.b16 %rs136, %rs118, %rs111, %p51; 2026-02-21T09:09:01.9144326Z cvt.s16.s8 %rs137, %rs136; 2026-02-21T09:09:01.9144384Z shr.s16 %rs138, %rs137, 4; 2026-02-21T09:09:01.9144456Z selp.b16 %rs139, %rs119, %rs108, %p51; 2026-02-21T09:09:01.9144545Z cvt.s16.s8 %rs140, %rs139; 2026-02-21T09:09:01.9144604Z shr.s16 %rs141, %rs140, 4; 2026-02-21T09:09:01.9144667Z selp.b16 %rs142, %rs120, %rs112, %p51; 2026-02-21T09:09:01.9144731Z cvt.s16.s8 %rs143, %rs142; 2026-02-21T09:09:01.9144787Z shr.s16 %rs144, %rs143, 4; 2026-02-21T09:09:01.9144955Z .loc 1 87 32 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:87:32 2026-02-21T09:09:01.9145023Z cvt.rn.f32.s16 %r358, %rs123; 2026-02-21T09:09:01.9145084Z cvt.rn.f32.s16 %r359, %rs126; 2026-02-21T09:09:01.9145142Z cvt.rn.f32.s16 %r360, %rs129; 2026-02-21T09:09:01.9145206Z cvt.rn.f32.s16 %r361, %rs132; 2026-02-21T09:09:01.9145263Z cvt.rn.f32.s16 %r362, %rs135; 2026-02-21T09:09:01.9145343Z cvt.rn.f32.s16 %r363, %rs138; 2026-02-21T09:09:01.9145402Z cvt.rn.f32.s16 %r364, %rs141; 2026-02-21T09:09:01.9145466Z cvt.rn.f32.s16 %r365, %rs144; 2026-02-21T09:09:01.9145527Z st.shared.b32 [%r15], %r358; 2026-02-21T09:09:01.9145589Z st.shared.b32 [%r15+4096], %r362; 2026-02-21T09:09:01.9145660Z st.shared.b32 [%r16], %r359; 2026-02-21T09:09:01.9145721Z st.shared.b32 [%r16+4096], %r363; 2026-02-21T09:09:01.9145779Z st.shared.b32 [%r17], %r360; 2026-02-21T09:09:01.9145841Z st.shared.b32 [%r17+4096], %r364; 2026-02-21T09:09:01.9145907Z st.shared.b32 [%r18], %r361; 2026-02-21T09:09:01.9145968Z st.shared.b32 [%r18+4096], %r365; 2026-02-21T09:09:01.9146132Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9146199Z shl.b32 %r366, %r466, 3; 2026-02-21T09:09:01.9146259Z add.s32 %r367, %r42, %r366; 2026-02-21T09:09:01.9146315Z add.s32 %r463, %r367, 43008; 2026-02-21T09:09:01.9146373Z $L__tmp12: 2026-02-21T09:09:01.9146589Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9146647Z // begin inline asm 2026-02-21T09:09:01.9146718Z fence.proxy.async.shared::cta; 2026-02-21T09:09:01.9146781Z // end inline asm 2026-02-21T09:09:01.9146834Z bar.sync 0; 2026-02-21T09:09:01.9146892Z @%p55 bra $L__BB0_6; 2026-02-21T09:09:01.9147001Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:01.9147065Z elect.sync %r392|%p83, -1; 2026-02-21T09:09:01.9147123Z mov.b32 %r370, 134744336; 2026-02-21T09:09:01.9147181Z // begin inline asm 2026-02-21T09:09:01.9147344Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 0 ], %rd85, %r370, %p82; 2026-02-21T09:09:01.9147399Z // end inline asm 2026-02-21T09:09:01.9147454Z // begin inline asm 2026-02-21T09:09:01.9147608Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 8 ], %rd86, %r370, %p82; 2026-02-21T09:09:01.9147663Z // end inline asm 2026-02-21T09:09:01.9147718Z // begin inline asm 2026-02-21T09:09:01.9147871Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 16 ], %rd87, %r370, %p82; 2026-02-21T09:09:01.9147924Z // end inline asm 2026-02-21T09:09:01.9147979Z // begin inline asm 2026-02-21T09:09:01.9148127Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 24 ], %rd88, %r370, %p82; 2026-02-21T09:09:01.9148181Z // end inline asm 2026-02-21T09:09:01.9148236Z // begin inline asm 2026-02-21T09:09:01.9148375Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 32 ], %rd89, %r370, %p82; 2026-02-21T09:09:01.9148473Z // end inline asm 2026-02-21T09:09:01.9148529Z // begin inline asm 2026-02-21T09:09:01.9148672Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 40 ], %rd90, %r370, %p82; 2026-02-21T09:09:01.9148732Z // end inline asm 2026-02-21T09:09:01.9148787Z // begin inline asm 2026-02-21T09:09:01.9148929Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 48 ], %rd91, %r370, %p82; 2026-02-21T09:09:01.9149013Z // end inline asm 2026-02-21T09:09:01.9149068Z // begin inline asm 2026-02-21T09:09:01.9149206Z @%p83 tcgen05.mma.cta_group::1.kind::tf32 [ %r461 + 0 ], [ %r244 + 56 ], %rd92, %r370, %p82; 2026-02-21T09:09:01.9149287Z // end inline asm 2026-02-21T09:09:01.9149348Z cvt.u64.u32 %rd113, %r463; 2026-02-21T09:09:01.9149403Z // begin inline asm 2026-02-21T09:09:01.9149527Z @%p83 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd113]; 2026-02-21T09:09:01.9149587Z // end inline asm 2026-02-21T09:09:01.9149643Z bra.uni $L__BB0_6; 2026-02-21T09:09:01.9149740Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:01.9149836Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:01.9149892Z mov.b32 %r411, 1; 2026-02-21T09:09:01.9150109Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9150184Z // begin inline asm 2026-02-21T09:09:01.9150241Z 2026-02-21T09:09:01.9150292Z { 2026-02-21T09:09:01.9150352Z .reg .pred complete; 2026-02-21T09:09:01.9150412Z waitLoop: 2026-02-21T09:09:01.9150528Z mbarrier.try_wait.parity.shared.b64 complete, [%r463], %r411; 2026-02-21T09:09:01.9150592Z @!complete bra.uni waitLoop; 2026-02-21T09:09:01.9150642Z } 2026-02-21T09:09:01.9150652Z 2026-02-21T09:09:01.9150706Z // end inline asm 2026-02-21T09:09:01.9150759Z $L__tmp13: 2026-02-21T09:09:01.9150925Z .loc 1 51 70 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:51:70 2026-02-21T09:09:01.9150996Z cp.async.wait_group 0; 2026-02-21T09:09:01.9151050Z bar.sync 0; 2026-02-21T09:09:01.9151106Z // begin inline asm 2026-02-21T09:09:01.9151200Z @%p108 mbarrier.inval.shared::cta.b64 [%r278]; 2026-02-21T09:09:01.9151254Z // end inline asm 2026-02-21T09:09:01.9151308Z bar.sync 0; 2026-02-21T09:09:01.9151363Z // begin inline asm 2026-02-21T09:09:01.9151456Z @%p108 mbarrier.inval.shared::cta.b64 [%r87]; 2026-02-21T09:09:01.9151512Z // end inline asm 2026-02-21T09:09:01.9151599Z add.s32 %r414, %r42, 43008; 2026-02-21T09:09:01.9151663Z // begin inline asm 2026-02-21T09:09:01.9151747Z @%p108 mbarrier.inval.shared::cta.b64 [%r414]; 2026-02-21T09:09:01.9151805Z // end inline asm 2026-02-21T09:09:01.9151870Z bar.sync 0; 2026-02-21T09:09:01.9151926Z // begin inline asm 2026-02-21T09:09:01.9152009Z @%p108 mbarrier.inval.shared::cta.b64 [%r85]; 2026-02-21T09:09:01.9152067Z // end inline asm 2026-02-21T09:09:01.9152132Z $L__tmp14: 2026-02-21T09:09:01.9152358Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9152416Z // begin inline asm 2026-02-21T09:09:01.9152701Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r416, %r417, %r418, %r419, %r420, %r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431}, [%r432 + 0]; 2026-02-21T09:09:01.9152758Z // end inline asm 2026-02-21T09:09:01.9152818Z // begin inline asm 2026-02-21T09:09:01.9152897Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:01.9152952Z // end inline asm 2026-02-21T09:09:01.9153014Z cvt.u64.u32 %rd121, %r416; 2026-02-21T09:09:01.9153076Z cvt.u64.u32 %rd122, %r417; 2026-02-21T09:09:01.9153144Z shl.b64 %rd123, %rd122, 32; 2026-02-21T09:09:01.9153208Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T09:09:01.9153263Z $L__tmp15: 2026-02-21T09:09:01.9153452Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9153546Z mov.b64 {%r436, %r437}, %rd124; 2026-02-21T09:09:01.9153620Z cvt.rn.bf16x2.f32 %r438, %r437, %r436; 2026-02-21T09:09:01.9153673Z $L__tmp16: 2026-02-21T09:09:01.9153902Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9153963Z cvt.u64.u32 %rd125, %r418; 2026-02-21T09:09:01.9154024Z cvt.u64.u32 %rd126, %r419; 2026-02-21T09:09:01.9154093Z shl.b64 %rd127, %rd126, 32; 2026-02-21T09:09:01.9154179Z or.b64 %rd128, %rd125, %rd127; 2026-02-21T09:09:01.9154235Z $L__tmp17: 2026-02-21T09:09:01.9154441Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9154507Z mov.b64 {%r439, %r440}, %rd128; 2026-02-21T09:09:01.9154579Z cvt.rn.bf16x2.f32 %r441, %r440, %r439; 2026-02-21T09:09:01.9154632Z $L__tmp18: 2026-02-21T09:09:01.9154859Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9154922Z cvt.u64.u32 %rd129, %r420; 2026-02-21T09:09:01.9154981Z cvt.u64.u32 %rd130, %r421; 2026-02-21T09:09:01.9155048Z shl.b64 %rd131, %rd130, 32; 2026-02-21T09:09:01.9155109Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T09:09:01.9155163Z $L__tmp19: 2026-02-21T09:09:01.9155337Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9155428Z mov.b64 {%r442, %r443}, %rd132; 2026-02-21T09:09:01.9155502Z cvt.rn.bf16x2.f32 %r444, %r443, %r442; 2026-02-21T09:09:01.9155555Z $L__tmp20: 2026-02-21T09:09:01.9155783Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9155844Z cvt.u64.u32 %rd133, %r422; 2026-02-21T09:09:01.9155901Z cvt.u64.u32 %rd134, %r423; 2026-02-21T09:09:01.9155969Z shl.b64 %rd135, %rd134, 32; 2026-02-21T09:09:01.9156028Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T09:09:01.9156083Z $L__tmp21: 2026-02-21T09:09:01.9156255Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9156320Z mov.b64 {%r445, %r446}, %rd136; 2026-02-21T09:09:01.9156387Z cvt.rn.bf16x2.f32 %r447, %r446, %r445; 2026-02-21T09:09:01.9156439Z $L__tmp22: 2026-02-21T09:09:01.9156663Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9156725Z cvt.u64.u32 %rd137, %r424; 2026-02-21T09:09:01.9156783Z cvt.u64.u32 %rd138, %r425; 2026-02-21T09:09:01.9156846Z shl.b64 %rd139, %rd138, 32; 2026-02-21T09:09:01.9156907Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T09:09:01.9156960Z $L__tmp23: 2026-02-21T09:09:01.9157134Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9157201Z mov.b64 {%r448, %r449}, %rd140; 2026-02-21T09:09:01.9157268Z cvt.rn.bf16x2.f32 %r450, %r449, %r448; 2026-02-21T09:09:01.9157323Z $L__tmp24: 2026-02-21T09:09:01.9157543Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9157603Z cvt.u64.u32 %rd141, %r426; 2026-02-21T09:09:01.9157661Z cvt.u64.u32 %rd142, %r427; 2026-02-21T09:09:01.9157724Z shl.b64 %rd143, %rd142, 32; 2026-02-21T09:09:01.9157784Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T09:09:01.9157838Z $L__tmp25: 2026-02-21T09:09:01.9158011Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9158078Z mov.b64 {%r451, %r452}, %rd144; 2026-02-21T09:09:01.9158147Z cvt.rn.bf16x2.f32 %r453, %r452, %r451; 2026-02-21T09:09:01.9158199Z $L__tmp26: 2026-02-21T09:09:01.9158425Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9158484Z cvt.u64.u32 %rd145, %r428; 2026-02-21T09:09:01.9158570Z cvt.u64.u32 %rd146, %r429; 2026-02-21T09:09:01.9158630Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:09:01.9158697Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:09:01.9158749Z $L__tmp27: 2026-02-21T09:09:01.9158921Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9158989Z mov.b64 {%r454, %r455}, %rd148; 2026-02-21T09:09:01.9159058Z cvt.rn.bf16x2.f32 %r456, %r455, %r454; 2026-02-21T09:09:01.9159137Z $L__tmp28: 2026-02-21T09:09:01.9159358Z .loc 2 291 36 // standard.py:291:36 @[ cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:94:40 ] 2026-02-21T09:09:01.9159416Z cvt.u64.u32 %rd149, %r430; 2026-02-21T09:09:01.9159496Z cvt.u64.u32 %rd150, %r431; 2026-02-21T09:09:01.9159560Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:09:01.9159627Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:09:01.9159679Z $L__tmp29: 2026-02-21T09:09:01.9159852Z .loc 1 97 28 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:97:28 2026-02-21T09:09:01.9159923Z mov.b64 {%r457, %r458}, %rd152; 2026-02-21T09:09:01.9159990Z cvt.rn.bf16x2.f32 %r459, %r458, %r457; 2026-02-21T09:09:01.9160183Z .loc 1 98 43 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:98:43 2026-02-21T09:09:01.9160282Z st.shared.v4.b32 [%r20], {%r438, %r441, %r444, %r447}; 2026-02-21T09:09:01.9160393Z st.shared.v4.b32 [%r21], {%r450, %r453, %r456, %r459}; 2026-02-21T09:09:01.9160451Z // begin inline asm 2026-02-21T09:09:01.9160522Z fence.proxy.async.shared::cta; 2026-02-21T09:09:01.9160581Z // end inline asm 2026-02-21T09:09:01.9160633Z bar.sync 0; 2026-02-21T09:09:01.9160700Z elect.sync %r460|%p114, -1; 2026-02-21T09:09:01.9160771Z and.pred %p112, %p1, %p114; 2026-02-21T09:09:01.9160827Z // begin inline asm 2026-02-21T09:09:01.9161009Z @%p112 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd120, {%r433, %r434}], [%r42]; 2026-02-21T09:09:01.9161069Z // end inline asm 2026-02-21T09:09:01.9161134Z cp.async.bulk.commit_group; 2026-02-21T09:09:01.9161200Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:01.9161249Z bar.sync 0; 2026-02-21T09:09:01.9161331Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:01.9161491Z .loc 1 31 4 // cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py:31:4 2026-02-21T09:09:01.9161566Z bar.sync 0; 2026-02-21T09:09:01.9161625Z // begin inline asm 2026-02-21T09:09:01.9161741Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r461, 128; 2026-02-21T09:09:01.9161793Z // end inline asm 2026-02-21T09:09:01.9161845Z ret; 2026-02-21T09:09:01.9161906Z $L__tmp30: 2026-02-21T09:09:01.9161962Z $L__func_end0: 2026-02-21T09:09:01.9162046Z // -- End function 2026-02-21T09:09:01.9162105Z } 2026-02-21T09:09:01.9162311Z .file 1 "/tmp/torchinductor_root/c6/cc6apfebfrb5hwnfzepezjic56f7xif6cbfxl5riawknkcbqbgav.py" 2026-02-21T09:09:01.9162486Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:01.9162559Z .section .debug_abbrev 2026-02-21T09:09:01.9162610Z { 2026-02-21T09:09:01.9162699Z .b8 1 // Abbreviation Code 2026-02-21T09:09:01.9162786Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:01.9162875Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:01.9162958Z .b8 37 // DW_AT_producer 2026-02-21T09:09:01.9163033Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.9163111Z .b8 19 // DW_AT_language 2026-02-21T09:09:01.9163188Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:01.9163263Z .b8 3 // DW_AT_name 2026-02-21T09:09:01.9163339Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.9163413Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:01.9163526Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:01.9163601Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:01.9163675Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.9163744Z .b8 0 // EOM(1) 2026-02-21T09:09:01.9163811Z .b8 0 // EOM(2) 2026-02-21T09:09:01.9163899Z .b8 2 // Abbreviation Code 2026-02-21T09:09:01.9164008Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:01.9164080Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:01.9164183Z .b8 3 // DW_AT_name 2026-02-21T09:09:01.9164255Z .b8 8 // DW_FORM_string 2026-02-21T09:09:01.9164329Z .b8 32 // DW_AT_inline 2026-02-21T09:09:01.9164405Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.9164477Z .b8 0 // EOM(1) 2026-02-21T09:09:01.9164541Z .b8 0 // EOM(2) 2026-02-21T09:09:01.9164618Z .b8 3 // Abbreviation Code 2026-02-21T09:09:01.9164700Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:01.9164776Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:01.9164872Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:01.9164952Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.9165025Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:01.9165099Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.9165183Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:01.9165256Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:01.9165320Z .b8 0 // EOM(1) 2026-02-21T09:09:01.9165386Z .b8 0 // EOM(2) 2026-02-21T09:09:01.9165467Z .b8 4 // Abbreviation Code 2026-02-21T09:09:01.9165559Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:01.9165633Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:01.9165724Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:01.9165797Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:01.9165869Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:01.9165946Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.9166021Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:01.9166090Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:01.9166166Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:01.9166245Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.9166320Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:01.9166391Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.9166475Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:01.9166546Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:01.9166615Z .b8 0 // EOM(1) 2026-02-21T09:09:01.9166690Z .b8 0 // EOM(2) 2026-02-21T09:09:01.9166756Z .b8 0 // EOM(3) 2026-02-21T09:09:01.9166808Z } 2026-02-21T09:09:01.9166867Z .section .debug_info 2026-02-21T09:09:01.9166927Z { 2026-02-21T09:09:01.9167008Z .b32 178 // Length of Unit 2026-02-21T09:09:01.9167092Z .b8 2 // DWARF version number 2026-02-21T09:09:01.9167154Z .b8 0 2026-02-21T09:09:01.9167271Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:01.9167381Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:01.9167488Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:01.9167569Z .b8 116 // DW_AT_producer 2026-02-21T09:09:01.9167623Z .b8 114 2026-02-21T09:09:01.9167676Z .b8 105 2026-02-21T09:09:01.9167735Z .b8 116 2026-02-21T09:09:01.9167786Z .b8 111 2026-02-21T09:09:01.9167861Z .b8 110 2026-02-21T09:09:01.9167920Z .b8 0 2026-02-21T09:09:01.9167993Z .b8 2 // DW_AT_language 2026-02-21T09:09:01.9168043Z .b8 0 2026-02-21T09:09:01.9168137Z .b8 99 // DW_AT_name 2026-02-21T09:09:01.9168198Z .b8 99 2026-02-21T09:09:01.9168251Z .b8 54 2026-02-21T09:09:01.9168302Z .b8 97 2026-02-21T09:09:01.9168362Z .b8 112 2026-02-21T09:09:01.9168413Z .b8 102 2026-02-21T09:09:01.9168465Z .b8 101 2026-02-21T09:09:01.9168518Z .b8 98 2026-02-21T09:09:01.9168580Z .b8 102 2026-02-21T09:09:01.9168632Z .b8 114 2026-02-21T09:09:01.9168682Z .b8 98 2026-02-21T09:09:01.9168740Z .b8 53 2026-02-21T09:09:01.9168790Z .b8 104 2026-02-21T09:09:01.9168841Z .b8 119 2026-02-21T09:09:01.9168891Z .b8 110 2026-02-21T09:09:01.9168947Z .b8 102 2026-02-21T09:09:01.9168996Z .b8 122 2026-02-21T09:09:01.9169046Z .b8 101 2026-02-21T09:09:01.9169096Z .b8 112 2026-02-21T09:09:01.9169152Z .b8 101 2026-02-21T09:09:01.9169202Z .b8 122 2026-02-21T09:09:01.9169275Z .b8 106 2026-02-21T09:09:01.9169333Z .b8 105 2026-02-21T09:09:01.9169385Z .b8 99 2026-02-21T09:09:01.9169434Z .b8 53 2026-02-21T09:09:01.9169484Z .b8 54 2026-02-21T09:09:01.9169540Z .b8 102 2026-02-21T09:09:01.9169590Z .b8 55 2026-02-21T09:09:01.9169641Z .b8 120 2026-02-21T09:09:01.9169697Z .b8 105 2026-02-21T09:09:01.9169746Z .b8 102 2026-02-21T09:09:01.9169795Z .b8 54 2026-02-21T09:09:01.9169845Z .b8 99 2026-02-21T09:09:01.9169902Z .b8 98 2026-02-21T09:09:01.9169951Z .b8 102 2026-02-21T09:09:01.9170000Z .b8 120 2026-02-21T09:09:01.9170052Z .b8 108 2026-02-21T09:09:01.9170109Z .b8 53 2026-02-21T09:09:01.9170158Z .b8 114 2026-02-21T09:09:01.9170208Z .b8 105 2026-02-21T09:09:01.9170265Z .b8 97 2026-02-21T09:09:01.9170314Z .b8 119 2026-02-21T09:09:01.9170364Z .b8 107 2026-02-21T09:09:01.9170413Z .b8 110 2026-02-21T09:09:01.9170471Z .b8 107 2026-02-21T09:09:01.9170520Z .b8 99 2026-02-21T09:09:01.9170570Z .b8 98 2026-02-21T09:09:01.9170626Z .b8 113 2026-02-21T09:09:01.9170676Z .b8 98 2026-02-21T09:09:01.9170729Z .b8 103 2026-02-21T09:09:01.9170779Z .b8 97 2026-02-21T09:09:01.9170837Z .b8 118 2026-02-21T09:09:01.9170886Z .b8 46 2026-02-21T09:09:01.9170935Z .b8 112 2026-02-21T09:09:01.9170991Z .b8 121 2026-02-21T09:09:01.9171040Z .b8 0 2026-02-21T09:09:01.9171130Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:01.9171205Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:01.9171264Z .b8 116 2026-02-21T09:09:01.9171313Z .b8 109 2026-02-21T09:09:01.9171362Z .b8 112 2026-02-21T09:09:01.9171421Z .b8 47 2026-02-21T09:09:01.9171469Z .b8 116 2026-02-21T09:09:01.9171517Z .b8 111 2026-02-21T09:09:01.9171683Z .b8 114 2026-02-21T09:09:01.9171741Z .b8 99 2026-02-21T09:09:01.9171791Z .b8 104 2026-02-21T09:09:01.9171842Z .b8 105 2026-02-21T09:09:01.9171893Z .b8 110 2026-02-21T09:09:01.9171950Z .b8 100 2026-02-21T09:09:01.9172001Z .b8 117 2026-02-21T09:09:01.9172050Z .b8 99 2026-02-21T09:09:01.9172107Z .b8 116 2026-02-21T09:09:01.9172158Z .b8 111 2026-02-21T09:09:01.9172211Z .b8 114 2026-02-21T09:09:01.9172263Z .b8 95 2026-02-21T09:09:01.9172319Z .b8 114 2026-02-21T09:09:01.9172370Z .b8 111 2026-02-21T09:09:01.9172419Z .b8 111 2026-02-21T09:09:01.9172477Z .b8 116 2026-02-21T09:09:01.9172527Z .b8 47 2026-02-21T09:09:01.9172579Z .b8 99 2026-02-21T09:09:01.9172630Z .b8 54 2026-02-21T09:09:01.9172688Z .b8 0 2026-02-21T09:09:01.9172789Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:01.9172863Z .b8 95 // DW_AT_name 2026-02-21T09:09:01.9172950Z .b8 104 2026-02-21T09:09:01.9173001Z .b8 101 2026-02-21T09:09:01.9173051Z .b8 108 2026-02-21T09:09:01.9173099Z .b8 105 2026-02-21T09:09:01.9173157Z .b8 111 2026-02-21T09:09:01.9173207Z .b8 110 2026-02-21T09:09:01.9173257Z .b8 95 2026-02-21T09:09:01.9173314Z .b8 109 2026-02-21T09:09:01.9173365Z .b8 97 2026-02-21T09:09:01.9173413Z .b8 116 2026-02-21T09:09:01.9173464Z .b8 109 2026-02-21T09:09:01.9173522Z .b8 117 2026-02-21T09:09:01.9173574Z .b8 108 2026-02-21T09:09:01.9173652Z .b8 95 2026-02-21T09:09:01.9173702Z .b8 98 2026-02-21T09:09:01.9173761Z .b8 102 2026-02-21T09:09:01.9173810Z .b8 49 2026-02-21T09:09:01.9173861Z .b8 54 2026-02-21T09:09:01.9173919Z .b8 95 2026-02-21T09:09:01.9173972Z .b8 105 2026-02-21T09:09:01.9174057Z .b8 110 2026-02-21T09:09:01.9174110Z .b8 116 2026-02-21T09:09:01.9174172Z .b8 52 2026-02-21T09:09:01.9174223Z .b8 0 2026-02-21T09:09:01.9174298Z .b8 1 // DW_AT_inline 2026-02-21T09:09:01.9174401Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:01.9174486Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:01.9174573Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:01.9174661Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:01.9174779Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:01.9174897Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:01.9174981Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:01.9175071Z .b64 $L__tmp29 // DW_AT_high_pc 2026-02-21T09:09:01.9175149Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:01.9175226Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:01.9175312Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:01.9175395Z .b8 0 // End Of Children Mark 2026-02-21T09:09:01.9175476Z .b8 0 // End Of Children Mark 2026-02-21T09:09:01.9175533Z } 2026-02-21T09:09:01.9175599Z .section .debug_macinfo { } 2026-02-21T09:09:01.9175603Z 2026-02-21T09:09:01.9175676Z ================================================================ 2026-02-21T09:09:01.9175777Z please share the reproducer above with Triton project. 2026-02-21T09:09:02.2062145Z [129s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:02.2062422Z 2026-02-21T09:09:02.2064868Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:02.2066062Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:02.2066306Z `ptxas` stderr: 2026-02-21T09:09:02.2066756Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 193 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:02.2067255Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:02.2067408Z 2026-02-21T09:09:02.2067800Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpw8ez7rlg.ptx -o /tmp/tmpw8ez7rlg.ptx.o 2026-02-21T09:09:02.2068257Z 2026-02-21T09:09:02.2068391Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:02.2068666Z 2026-02-21T09:09:02.2068669Z 2026-02-21T09:09:02.2068771Z ================================================================ 2026-02-21T09:09:02.2069281Z Internal Triton PTX codegen error 2026-02-21T09:09:02.2069472Z `ptxas` stderr: 2026-02-21T09:09:02.2069893Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 193 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:02.2070376Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:02.2070521Z 2026-02-21T09:09:02.2070900Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpw8ez7rlg.ptx -o /tmp/tmpw8ez7rlg.ptx.o 2026-02-21T09:09:02.2071393Z 2026-02-21T09:09:02.2071396Z 2026-02-21T09:09:02.2071451Z // 2026-02-21T09:09:02.2071698Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:02.2071878Z // 2026-02-21T09:09:02.2071945Z 2026-02-21T09:09:02.2072003Z .version 8.7 2026-02-21T09:09:02.2072149Z .target sm_100a 2026-02-21T09:09:02.2072286Z .address_size 64 2026-02-21T09:09:02.2072369Z 2026-02-21T09:09:02.2072520Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:02.2072805Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:02.2073029Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:02.2073243Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:02.2073492Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:02.2073829Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:02.2074108Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:02.2074386Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:02.2074661Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:02.2074884Z ) 2026-02-21T09:09:02.2075002Z .reqntid 256 2026-02-21T09:09:02.2075138Z .maxnreg 32 2026-02-21T09:09:02.2075260Z { 2026-02-21T09:09:02.2075394Z .reg .pred %p<47>; 2026-02-21T09:09:02.2075550Z .reg .b16 %rs<19>; 2026-02-21T09:09:02.2075689Z .reg .b32 %r<215>; 2026-02-21T09:09:02.2075831Z .reg .b64 %rd<86>; 2026-02-21T09:09:02.2075968Z $L__func_begin0: 2026-02-21T09:09:02.2076049Z 2026-02-21T09:09:02.2076107Z // %bb.0: 2026-02-21T09:09:02.2076342Z .loc 1 19 0 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:19 2026-02-21T09:09:02.2076639Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:02.2076793Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:02.2077006Z ld.param.b64 %rd15, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:02.2077230Z mov.b32 %r33, global_smem; 2026-02-21T09:09:02.2077386Z // begin inline asm 2026-02-21T09:09:02.2077623Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r33], 64; 2026-02-21T09:09:02.2077870Z // end inline asm 2026-02-21T09:09:02.2078054Z ld.param.b64 %rd32, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:02.2078256Z bar.sync 0; 2026-02-21T09:09:02.2078407Z ld.shared.b32 %r209, [global_smem]; 2026-02-21T09:09:02.2078580Z bar.sync 0; 2026-02-21T09:09:02.2078717Z // begin inline asm 2026-02-21T09:09:02.2078925Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:02.2079153Z // end inline asm 2026-02-21T09:09:02.2079415Z .loc 1 21 67 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:21:67 2026-02-21T09:09:02.2079709Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:02.2079879Z mov.u32 %r42, %ctaid.y; 2026-02-21T09:09:02.2080034Z mov.u32 %r43, %ctaid.z; 2026-02-21T09:09:02.2080193Z mov.u32 %r44, %nctaid.x; 2026-02-21T09:09:02.2080347Z mov.u32 %r45, %nctaid.y; 2026-02-21T09:09:02.2080512Z mad.lo.s32 %r46, %r43, %r45, %r42; 2026-02-21T09:09:02.2080699Z mad.lo.s32 %r47, %r46, %r44, %r3; 2026-02-21T09:09:02.2080868Z shl.b32 %r48, %r47, 7; 2026-02-21T09:09:02.2081027Z cvt.s64.s32 %rd33, %r48; 2026-02-21T09:09:02.2081184Z add.s64 %rd29, %rd32, %rd33; 2026-02-21T09:09:02.2081350Z shl.b32 %r49, %r1, 2; 2026-02-21T09:09:02.2081585Z add.s32 %r34, %r33, %r49; 2026-02-21T09:09:02.2081747Z mov.b32 %r51, 0; 2026-02-21T09:09:02.2081885Z // begin inline asm 2026-02-21T09:09:02.2082049Z @%p1 st.shared.b32 [ %r34 + 0 ], %r51; 2026-02-21T09:09:02.2082222Z // end inline asm 2026-02-21T09:09:02.2082381Z bar.warp.sync -1; 2026-02-21T09:09:02.2082538Z setp.eq.b32 %p41, %r1, 0; 2026-02-21T09:09:02.2082702Z cvt.u64.u32 %rd14, %r33; 2026-02-21T09:09:02.2082871Z // begin inline asm 2026-02-21T09:09:02.2083164Z @%p41 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd14 + 0 ], %rd15; 2026-02-21T09:09:02.2083449Z // end inline asm 2026-02-21T09:09:02.2083585Z // begin inline asm 2026-02-21T09:09:02.2083842Z @%p41 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x1; 2026-02-21T09:09:02.2084101Z // end inline asm 2026-02-21T09:09:02.2084235Z mov.b32 %r36, 32; 2026-02-21T09:09:02.2084379Z // begin inline asm 2026-02-21T09:09:02.2084613Z @%p41 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x0, %r36; 2026-02-21T09:09:02.2084880Z // end inline asm 2026-02-21T09:09:02.2085014Z mov.b32 %r37, 128; 2026-02-21T09:09:02.2085158Z // begin inline asm 2026-02-21T09:09:02.2085390Z @%p41 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x1, %r37; 2026-02-21T09:09:02.2085648Z // end inline asm 2026-02-21T09:09:02.2085790Z mov.b32 %r38, 8192; 2026-02-21T09:09:02.2085930Z // begin inline asm 2026-02-21T09:09:02.2086203Z @%p41 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x0, %r38; 2026-02-21T09:09:02.2086476Z // end inline asm 2026-02-21T09:09:02.2086616Z mov.b32 %r39, 4096; 2026-02-21T09:09:02.2086759Z // begin inline asm 2026-02-21T09:09:02.2087008Z @%p41 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x1, %r39; 2026-02-21T09:09:02.2087287Z // end inline asm 2026-02-21T09:09:02.2087429Z mov.b64 %rd22, 16384; 2026-02-21T09:09:02.2087583Z // begin inline asm 2026-02-21T09:09:02.2087836Z @%p41 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd14 + 0 ], 0x0, %rd22; 2026-02-21T09:09:02.2088130Z // end inline asm 2026-02-21T09:09:02.2088264Z mov.b32 %r40, 1; 2026-02-21T09:09:02.2088408Z // begin inline asm 2026-02-21T09:09:02.2088665Z @%p41 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x0, %r40; 2026-02-21T09:09:02.2088958Z // end inline asm 2026-02-21T09:09:02.2089102Z // begin inline asm 2026-02-21T09:09:02.2089357Z @%p41 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x1, %r40; 2026-02-21T09:09:02.2089650Z // end inline asm 2026-02-21T09:09:02.2089785Z // begin inline asm 2026-02-21T09:09:02.2090028Z @%p41 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd14 + 0 ], 0xa; 2026-02-21T09:09:02.2090290Z // end inline asm 2026-02-21T09:09:02.2090434Z // begin inline asm 2026-02-21T09:09:02.2090712Z @%p41 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x0; 2026-02-21T09:09:02.2091010Z // end inline asm 2026-02-21T09:09:02.2091163Z // begin inline asm 2026-02-21T09:09:02.2091412Z @%p41 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x2; 2026-02-21T09:09:02.2091735Z // end inline asm 2026-02-21T09:09:02.2091874Z // begin inline asm 2026-02-21T09:09:02.2092117Z @%p41 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd14 + 0 ], 0x0; 2026-02-21T09:09:02.2092394Z // end inline asm 2026-02-21T09:09:02.2092539Z // begin inline asm 2026-02-21T09:09:02.2092907Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd29 + 0 ], [ %rd14 + 0 ], 0x80; 2026-02-21T09:09:02.2093297Z // end inline asm 2026-02-21T09:09:02.2093445Z // begin inline asm 2026-02-21T09:09:02.2093661Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd29 + 0 ], 0x80; 2026-02-21T09:09:02.2093928Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:02.2094125Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:02.2094323Z // end inline asm 2026-02-21T09:09:02.2094496Z bar.sync 0; 2026-02-21T09:09:02.2094644Z cvta.global.u64 %rd50, %rd29; 2026-02-21T09:09:02.2094939Z .loc 1 29 88 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:29:88 2026-02-21T09:09:02.2095245Z setp.gt.u32 %p21, %r3, 8191; 2026-02-21T09:09:02.2095422Z @%p21 bra $L__BB0_8; 2026-02-21T09:09:02.2095593Z // %bb.1: // %.lr.ph 2026-02-21T09:09:02.2095908Z .loc 1 0 88 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:0:88 2026-02-21T09:09:02.2096273Z ld.param.b64 %rd13, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:02.2096529Z ld.param.b64 %rd12, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:02.2096775Z and.b32 %r4, %r1, 32; 2026-02-21T09:09:02.2097038Z .loc 1 81 38 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:81:38 2026-02-21T09:09:02.2097343Z setp.eq.b32 %p26, %r4, 0; 2026-02-21T09:09:02.2097615Z .loc 1 57 38 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:57:38 2026-02-21T09:09:02.2097910Z and.b32 %r5, %r49, 4; 2026-02-21T09:09:02.2098173Z .loc 1 43 45 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:43:45 2026-02-21T09:09:02.2098459Z bfe.u32 %r6, %r1, 1, 7; 2026-02-21T09:09:02.2098724Z .loc 1 41 45 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:41:45 2026-02-21T09:09:02.2099043Z and.b32 %r7, %r1, 31; 2026-02-21T09:09:02.2099196Z shr.u32 %r79, %r1, 5; 2026-02-21T09:09:02.2099337Z shl.b32 %r80, %r1, 3; 2026-02-21T09:09:02.2099483Z and.b32 %r8, %r80, 2040; 2026-02-21T09:09:02.2099635Z add.s32 %r124, %r33, %r8; 2026-02-21T09:09:02.2099793Z add.s32 %r119, %r209, 32; 2026-02-21T09:09:02.2099947Z and.b32 %r82, %r1, 127; 2026-02-21T09:09:02.2100095Z shl.b32 %r83, %r82, 4; 2026-02-21T09:09:02.2100248Z and.b32 %r84, %r1, 128; 2026-02-21T09:09:02.2100392Z shr.u32 %r85, %r84, 4; 2026-02-21T09:09:02.2100547Z or.b32 %r11, %r83, %r85; 2026-02-21T09:09:02.2100695Z shl.b32 %r86, %r1, 5; 2026-02-21T09:09:02.2100843Z and.b32 %r87, %r86, 864; 2026-02-21T09:09:02.2100987Z shr.u32 %r88, %r1, 3; 2026-02-21T09:09:02.2101136Z and.b32 %r89, %r88, 28; 2026-02-21T09:09:02.2101285Z bfe.s32 %r90, %r1, 2, 1; 2026-02-21T09:09:02.2101438Z and.b32 %r91, %r90, 144; 2026-02-21T09:09:02.2101632Z xor.b32 %r92, %r91, %r89; 2026-02-21T09:09:02.2101786Z add.s32 %r93, %r33, 4096; 2026-02-21T09:09:02.2101944Z add.s32 %r94, %r93, %r87; 2026-02-21T09:09:02.2102092Z add.s32 %r12, %r94, %r92; 2026-02-21T09:09:02.2102248Z bfe.u32 %r95, %r93, 4, 14; 2026-02-21T09:09:02.2102404Z cvt.u64.u32 %rd37, %r95; 2026-02-21T09:09:02.2102578Z or.b64 %rd39, %rd37, -4611685949703716864; 2026-02-21T09:09:02.2102762Z shl.b32 %r96, %r82, 6; 2026-02-21T09:09:02.2102921Z and.b32 %r97, %r80, 48; 2026-02-21T09:09:02.2103071Z shr.u32 %r98, %r84, 2; 2026-02-21T09:09:02.2103232Z xor.b32 %r99, %r97, %r98; 2026-02-21T09:09:02.2103391Z or.b32 %r100, %r99, %r96; 2026-02-21T09:09:02.2103542Z xor.b32 %r101, %r100, 16; 2026-02-21T09:09:02.2103697Z add.s32 %r102, %r33, %r11; 2026-02-21T09:09:02.2103849Z shl.b32 %r103, %r1, 7; 2026-02-21T09:09:02.2104003Z and.b32 %r15, %r103, 24576; 2026-02-21T09:09:02.2104157Z add.s32 %r71, %r124, 2048; 2026-02-21T09:09:02.2104416Z .loc 1 40 27 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:40:27 2026-02-21T09:09:02.2104693Z shl.b32 %r104, %r3, 5; 2026-02-21T09:09:02.2104846Z and.b32 %r16, %r104, 480; 2026-02-21T09:09:02.2104994Z and.b32 %r17, %r3, 7680; 2026-02-21T09:09:02.2105146Z or.b32 %r181, %r16, %r17; 2026-02-21T09:09:02.2105404Z .loc 1 41 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:41:32 2026-02-21T09:09:02.2105676Z or.b32 %r105, %r181, %r7; 2026-02-21T09:09:02.2105931Z .loc 1 42 27 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:42:27 2026-02-21T09:09:02.2106205Z shl.b32 %r106, %r3, 3; 2026-02-21T09:09:02.2106387Z and.b32 %r182, %r106, 3968; 2026-02-21T09:09:02.2106642Z .loc 1 43 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:43:32 2026-02-21T09:09:02.2106926Z or.b32 %r107, %r182, %r6; 2026-02-21T09:09:02.2107180Z .loc 1 58 53 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:53 2026-02-21T09:09:02.2107456Z shl.b32 %r108, %r107, 10; 2026-02-21T09:09:02.2107610Z $L__tmp0: 2026-02-21T09:09:02.2107919Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2108263Z shfl.sync.idx.b32 %r20, %r79, 0, 31, -1; 2026-02-21T09:09:02.2108484Z shl.b32 %r109, %r20, 21; 2026-02-21T09:09:02.2108641Z and.b32 %r110, %r109, 6291456; 2026-02-21T09:09:02.2108806Z add.s32 %r111, %r110, %r209; 2026-02-21T09:09:02.2108957Z and.b32 %r112, %r20, 4; 2026-02-21T09:09:02.2109106Z shl.b32 %r113, %r112, 2; 2026-02-21T09:09:02.2109252Z add.s32 %r180, %r111, %r113; 2026-02-21T09:09:02.2109410Z mov.pred %p22, -1; 2026-02-21T09:09:02.2109548Z // begin inline asm 2026-02-21T09:09:02.2109884Z @%p22 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r180 + 0], {%r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51}; 2026-02-21T09:09:02.2110232Z // end inline asm 2026-02-21T09:09:02.2110372Z // begin inline asm 2026-02-21T09:09:02.2110554Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:02.2110719Z // end inline asm 2026-02-21T09:09:02.2110854Z bar.sync 0; 2026-02-21T09:09:02.2110978Z $L__tmp1: 2026-02-21T09:09:02.2111220Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2111509Z add.s32 %r211, %r33, 5120; 2026-02-21T09:09:02.2111697Z // begin inline asm 2026-02-21T09:09:02.2111858Z @%p41 mbarrier.init.shared::cta.b64 [%r211], 1; 2026-02-21T09:09:02.2112051Z // end inline asm 2026-02-21T09:09:02.2112182Z bar.sync 0; 2026-02-21T09:09:02.2112316Z add.s32 %r68, %r33, 5128; 2026-02-21T09:09:02.2112466Z // begin inline asm 2026-02-21T09:09:02.2112625Z @%p41 mbarrier.init.shared::cta.b64 [%r68], 1; 2026-02-21T09:09:02.2112808Z // end inline asm 2026-02-21T09:09:02.2113044Z .loc 1 58 60 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:60 2026-02-21T09:09:02.2113328Z or.b32 %r114, %r108, %r5; 2026-02-21T09:09:02.2113581Z .loc 1 58 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:32 2026-02-21T09:09:02.2113872Z mad.wide.u32 %rd34, %r114, 2, %rd12; 2026-02-21T09:09:02.2114038Z mov.b32 %r70, 8; 2026-02-21T09:09:02.2114275Z .loc 1 58 80 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:80 2026-02-21T09:09:02.2114547Z // begin inline asm 2026-02-21T09:09:02.2114742Z cp.async.ca.shared.global [ %r124 + 0 ], [ %rd34 + 0 ], 0x8, %r70; 2026-02-21T09:09:02.2114962Z // end inline asm 2026-02-21T09:09:02.2115099Z cp.async.commit_group; 2026-02-21T09:09:02.2115352Z .loc 1 58 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:32 2026-02-21T09:09:02.2115625Z add.s64 %rd35, %rd34, 16; 2026-02-21T09:09:02.2115877Z .loc 1 58 80 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:80 2026-02-21T09:09:02.2116148Z bar.sync 0; 2026-02-21T09:09:02.2116269Z // begin inline asm 2026-02-21T09:09:02.2116460Z cp.async.ca.shared.global [ %r71 + 0 ], [ %rd35 + 0 ], 0x8, %r70; 2026-02-21T09:09:02.2116676Z // end inline asm 2026-02-21T09:09:02.2116808Z cp.async.commit_group; 2026-02-21T09:09:02.2116953Z cp.async.wait_group 1; 2026-02-21T09:09:02.2117099Z bar.sync 0; 2026-02-21T09:09:02.2117252Z ld.shared.v4.b16 {%rs2, %rs3, %rs4, %rs5}, [%r102]; 2026-02-21T09:09:02.2117551Z .loc 1 62 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:62:32 2026-02-21T09:09:02.2117828Z cvt.f32.bf16 %r74, %rs2; 2026-02-21T09:09:02.2118007Z cvt.f32.bf16 %r75, %rs3; 2026-02-21T09:09:02.2118154Z cvt.f32.bf16 %r76, %rs4; 2026-02-21T09:09:02.2118298Z cvt.f32.bf16 %r77, %rs5; 2026-02-21T09:09:02.2118436Z $L__tmp2: 2026-02-21T09:09:02.2118710Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2119035Z add.s32 %r115, %r112, %r119; 2026-02-21T09:09:02.2119188Z add.s32 %r140, %r115, %r110; 2026-02-21T09:09:02.2119364Z // begin inline asm 2026-02-21T09:09:02.2119587Z @%p22 tcgen05.st.sync.aligned.32x32b.x4.b32 [%r140 + 0], {%r74, %r75, %r76, %r77}; 2026-02-21T09:09:02.2119830Z // end inline asm 2026-02-21T09:09:02.2119985Z // begin inline asm 2026-02-21T09:09:02.2120134Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:02.2120289Z // end inline asm 2026-02-21T09:09:02.2120412Z bar.sync 0; 2026-02-21T09:09:02.2120532Z $L__tmp3: 2026-02-21T09:09:02.2120754Z .loc 1 64 62 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:64:62 2026-02-21T09:09:02.2121024Z or.b32 %r116, %r105, %r15; 2026-02-21T09:09:02.2121276Z .loc 1 64 34 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:64:34 2026-02-21T09:09:02.2121577Z cvt.u64.u32 %rd38, %r116; 2026-02-21T09:09:02.2121731Z add.s64 %rd36, %rd13, %rd38; 2026-02-21T09:09:02.2122002Z .loc 1 64 87 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:64:87 2026-02-21T09:09:02.2122277Z // begin inline asm 2026-02-21T09:09:02.2122415Z mov.u16 %rs1, 0x0; 2026-02-21T09:09:02.2122565Z ld.global.b8 { %rs1 }, [ %rd36 + 0 ]; 2026-02-21T09:09:02.2122734Z // end inline asm 2026-02-21T09:09:02.2122967Z .loc 1 67 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:67:28 2026-02-21T09:09:02.2123238Z shl.b16 %rs6, %rs1, 4; 2026-02-21T09:09:02.2123493Z .loc 1 82 58 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:82:58 2026-02-21T09:09:02.2123795Z selp.b16 %rs7, %rs6, %rs1, %p26; 2026-02-21T09:09:02.2123965Z cvt.s16.s8 %rs8, %rs7; 2026-02-21T09:09:02.2124123Z shr.s16 %rs9, %rs8, 4; 2026-02-21T09:09:02.2124372Z .loc 1 87 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:87:32 2026-02-21T09:09:02.2124658Z cvt.rn.f32.s16 %r117, %rs9; 2026-02-21T09:09:02.2124831Z st.shared.b32 [%r12], %r117; 2026-02-21T09:09:02.2124982Z $L__tmp4: 2026-02-21T09:09:02.2125272Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2125595Z // begin inline asm 2026-02-21T09:09:02.2125759Z fence.proxy.async.shared::cta; 2026-02-21T09:09:02.2125924Z // end inline asm 2026-02-21T09:09:02.2126067Z bar.sync 0; 2026-02-21T09:09:02.2126204Z setp.ne.b32 %p27, %r20, 0; 2026-02-21T09:09:02.2126368Z @%p27 bra $L__BB0_3; 2026-02-21T09:09:02.2126515Z // %bb.2: 2026-02-21T09:09:02.2126649Z elect.sync %r121|%p29, -1; 2026-02-21T09:09:02.2126817Z mov.b32 %r120, 134744336; 2026-02-21T09:09:02.2126969Z mov.pred %p28, 0; 2026-02-21T09:09:02.2127116Z // begin inline asm 2026-02-21T09:09:02.2127353Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r209 + 0 ], [ %r119 + 0 ], %rd39, %r120, %p28; 2026-02-21T09:09:02.2127627Z // end inline asm 2026-02-21T09:09:02.2127763Z add.s32 %r123, %r33, 5120; 2026-02-21T09:09:02.2127925Z cvt.u64.u32 %rd40, %r123; 2026-02-21T09:09:02.2128082Z // begin inline asm 2026-02-21T09:09:02.2128292Z @%p29 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd40]; 2026-02-21T09:09:02.2128526Z // end inline asm 2026-02-21T09:09:02.2128656Z $L__tmp5: 2026-02-21T09:09:02.2128783Z $L__BB0_3: 2026-02-21T09:09:02.2128938Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:02.2129139Z add.s32 %r13, %r33, %r100; 2026-02-21T09:09:02.2129290Z add.s32 %r14, %r33, %r101; 2026-02-21T09:09:02.2129556Z .loc 1 58 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:32 2026-02-21T09:09:02.2129867Z add.s64 %rd41, %rd34, 32; 2026-02-21T09:09:02.2130121Z .loc 1 58 80 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:80 2026-02-21T09:09:02.2130405Z // begin inline asm 2026-02-21T09:09:02.2130602Z cp.async.ca.shared.global [ %r124 + 0 ], [ %rd41 + 0 ], 0x8, %r70; 2026-02-21T09:09:02.2130825Z // end inline asm 2026-02-21T09:09:02.2130965Z cp.async.commit_group; 2026-02-21T09:09:02.2131230Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2131569Z shl.b32 %r129, %r3, 13; 2026-02-21T09:09:02.2131735Z and.b32 %r130, %r129, 4063232; 2026-02-21T09:09:02.2131927Z shl.b32 %r131, %r6, 10; 2026-02-21T09:09:02.2132080Z or.b32 %r132, %r130, %r131; 2026-02-21T09:09:02.2132245Z or.b32 %r133, %r132, %r5; 2026-02-21T09:09:02.2132405Z mad.wide.u32 %rd43, %r133, 2, %rd12; 2026-02-21T09:09:02.2132583Z add.s64 %rd84, %rd43, 48; 2026-02-21T09:09:02.2132735Z add.s32 %r134, %r15, %r17; 2026-02-21T09:09:02.2132899Z add.s32 %r135, %r134, %r16; 2026-02-21T09:09:02.2133059Z add.s32 %r136, %r135, %r7; 2026-02-21T09:09:02.2133227Z cvt.u64.u32 %rd44, %r136; 2026-02-21T09:09:02.2133397Z add.s64 %rd45, %rd44, %rd13; 2026-02-21T09:09:02.2133561Z add.s64 %rd83, %rd45, 32768; 2026-02-21T09:09:02.2133727Z mov.b32 %r213, 1; 2026-02-21T09:09:02.2133869Z mov.b32 %r210, 0; 2026-02-21T09:09:02.2134049Z mov.b64 %rd85, -4; 2026-02-21T09:09:02.2134200Z mov.b32 %r212, %r210; 2026-02-21T09:09:02.2134356Z mov.b32 %r214, %r210; 2026-02-21T09:09:02.2134503Z bra.uni $L__BB0_4; 2026-02-21T09:09:02.2134704Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:02.2135045Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2135355Z add.s64 %rd85, %rd85, 4; 2026-02-21T09:09:02.2135530Z setp.lt.u64 %p38, %rd85, 500; 2026-02-21T09:09:02.2135690Z $L__tmp6: 2026-02-21T09:09:02.2135993Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2136336Z add.s32 %r158, %r213, 1; 2026-02-21T09:09:02.2136505Z setp.gt.s32 %p39, %r158, 1; 2026-02-21T09:09:02.2136678Z selp.b32 %r213, 0, %r158, %p39; 2026-02-21T09:09:02.2136860Z selp.b32 %r159, 1, 0, %p39; 2026-02-21T09:09:02.2137025Z xor.b32 %r214, %r161, %r159; 2026-02-21T09:09:02.2137190Z $L__tmp7: 2026-02-21T09:09:02.2137438Z .loc 1 58 80 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:80 2026-02-21T09:09:02.2137728Z add.s32 %r156, %r29, %r8; 2026-02-21T09:09:02.2137894Z selp.b32 %r157, 8, 0, %p38; 2026-02-21T09:09:02.2138055Z // begin inline asm 2026-02-21T09:09:02.2138266Z cp.async.ca.shared.global [ %r156 + 0 ], [ %rd84 + 0 ], 0x8, %r157; 2026-02-21T09:09:02.2138497Z // end inline asm 2026-02-21T09:09:02.2138651Z cp.async.commit_group; 2026-02-21T09:09:02.2138921Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2139223Z add.s64 %rd84, %rd84, 16; 2026-02-21T09:09:02.2139389Z add.s64 %rd83, %rd83, 32768; 2026-02-21T09:09:02.2139555Z setp.lt.u64 %p40, %rd85, 504; 2026-02-21T09:09:02.2139728Z mov.b32 %r210, %r161; 2026-02-21T09:09:02.2139880Z @%p40 bra $L__BB0_4; 2026-02-21T09:09:02.2140036Z bra.uni $L__BB0_7; 2026-02-21T09:09:02.2140230Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:02.2140575Z .loc 1 0 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:0:103 2026-02-21T09:09:02.2140871Z mov.b32 %r161, %r214; 2026-02-21T09:09:02.2141134Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2141438Z add.s32 %r145, %r212, 1; 2026-02-21T09:09:02.2141642Z setp.gt.s32 %p34, %r145, 1; 2026-02-21T09:09:02.2141813Z selp.b32 %r212, 0, %r145, %p34; 2026-02-21T09:09:02.2142107Z .loc 1 58 80 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:58:80 2026-02-21T09:09:02.2142394Z cp.async.wait_group 1; 2026-02-21T09:09:02.2142545Z bar.sync 0; 2026-02-21T09:09:02.2142690Z shl.b32 %r146, %r212, 11; 2026-02-21T09:09:02.2142853Z add.s32 %r29, %r33, %r146; 2026-02-21T09:09:02.2143011Z add.s32 %r148, %r29, %r11; 2026-02-21T09:09:02.2143215Z ld.shared.v4.b16 {%rs11, %rs12, %rs13, %rs14}, [%r148]; 2026-02-21T09:09:02.2143552Z .loc 1 62 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:62:32 2026-02-21T09:09:02.2143842Z cvt.f32.bf16 %r141, %rs11; 2026-02-21T09:09:02.2144024Z cvt.f32.bf16 %r142, %rs12; 2026-02-21T09:09:02.2144186Z cvt.f32.bf16 %r143, %rs13; 2026-02-21T09:09:02.2144336Z cvt.f32.bf16 %r144, %rs14; 2026-02-21T09:09:02.2144490Z $L__tmp8: 2026-02-21T09:09:02.2144780Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2145099Z // begin inline asm 2026-02-21T09:09:02.2145243Z 2026-02-21T09:09:02.2145357Z { 2026-02-21T09:09:02.2145485Z .reg .pred complete; 2026-02-21T09:09:02.2145629Z waitLoop: 2026-02-21T09:09:02.2145822Z mbarrier.try_wait.parity.shared.b64 complete, [%r211], %r210; 2026-02-21T09:09:02.2146053Z @!complete bra.uni waitLoop; 2026-02-21T09:09:02.2146211Z } 2026-02-21T09:09:02.2146277Z 2026-02-21T09:09:02.2146370Z // end inline asm 2026-02-21T09:09:02.2146516Z mov.pred %p31, -1; 2026-02-21T09:09:02.2146666Z // begin inline asm 2026-02-21T09:09:02.2146899Z @%p31 tcgen05.st.sync.aligned.32x32b.x4.b32 [%r140 + 0], {%r141, %r142, %r143, %r144}; 2026-02-21T09:09:02.2147169Z // end inline asm 2026-02-21T09:09:02.2147304Z // begin inline asm 2026-02-21T09:09:02.2147463Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:02.2147623Z // end inline asm 2026-02-21T09:09:02.2147760Z bar.sync 0; 2026-02-21T09:09:02.2147883Z $L__tmp9: 2026-02-21T09:09:02.2148118Z .loc 1 64 87 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:64:87 2026-02-21T09:09:02.2148400Z // begin inline asm 2026-02-21T09:09:02.2148540Z mov.u16 %rs10, 0x0; 2026-02-21T09:09:02.2148696Z ld.global.b8 { %rs10 }, [ %rd83 + 0 ]; 2026-02-21T09:09:02.2148867Z // end inline asm 2026-02-21T09:09:02.2149106Z .loc 1 67 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:67:28 2026-02-21T09:09:02.2149382Z shl.b16 %rs15, %rs10, 4; 2026-02-21T09:09:02.2149642Z .loc 1 82 58 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:82:58 2026-02-21T09:09:02.2149931Z selp.b16 %rs16, %rs15, %rs10, %p26; 2026-02-21T09:09:02.2150105Z cvt.s16.s8 %rs17, %rs16; 2026-02-21T09:09:02.2150263Z shr.s16 %rs18, %rs17, 4; 2026-02-21T09:09:02.2150508Z .loc 1 87 32 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:87:32 2026-02-21T09:09:02.2150793Z cvt.rn.f32.s16 %r149, %rs18; 2026-02-21T09:09:02.2150956Z st.shared.b32 [%r12], %r149; 2026-02-21T09:09:02.2151226Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2151503Z shl.b32 %r150, %r213, 3; 2026-02-21T09:09:02.2151689Z add.s32 %r151, %r33, %r150; 2026-02-21T09:09:02.2151852Z add.s32 %r211, %r151, 5120; 2026-02-21T09:09:02.2151999Z $L__tmp10: 2026-02-21T09:09:02.2152293Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2152619Z // begin inline asm 2026-02-21T09:09:02.2152782Z fence.proxy.async.shared::cta; 2026-02-21T09:09:02.2152947Z // end inline asm 2026-02-21T09:09:02.2153091Z bar.sync 0; 2026-02-21T09:09:02.2153225Z @%p27 bra $L__BB0_6; 2026-02-21T09:09:02.2153423Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:02.2153655Z elect.sync %r155|%p36, -1; 2026-02-21T09:09:02.2153824Z mov.b32 %r154, 134744336; 2026-02-21T09:09:02.2154022Z // begin inline asm 2026-02-21T09:09:02.2154256Z @%p36 tcgen05.mma.cta_group::1.kind::tf32 [ %r209 + 0 ], [ %r119 + 0 ], %rd39, %r154, %p31; 2026-02-21T09:09:02.2154526Z // end inline asm 2026-02-21T09:09:02.2154664Z cvt.u64.u32 %rd48, %r211; 2026-02-21T09:09:02.2154820Z // begin inline asm 2026-02-21T09:09:02.2155023Z @%p36 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd48]; 2026-02-21T09:09:02.2155260Z // end inline asm 2026-02-21T09:09:02.2155445Z bra.uni $L__BB0_6; 2026-02-21T09:09:02.2155624Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:02.2156021Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2156343Z // begin inline asm 2026-02-21T09:09:02.2156481Z 2026-02-21T09:09:02.2156594Z { 2026-02-21T09:09:02.2156725Z .reg .pred complete; 2026-02-21T09:09:02.2156866Z waitLoop: 2026-02-21T09:09:02.2157060Z mbarrier.try_wait.parity.shared.b64 complete, [%r211], %r161; 2026-02-21T09:09:02.2157303Z @!complete bra.uni waitLoop; 2026-02-21T09:09:02.2157455Z } 2026-02-21T09:09:02.2157519Z 2026-02-21T09:09:02.2157583Z // end inline asm 2026-02-21T09:09:02.2157714Z $L__tmp11: 2026-02-21T09:09:02.2157955Z .loc 1 50 103 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:50:103 2026-02-21T09:09:02.2158243Z cp.async.wait_group 0; 2026-02-21T09:09:02.2158425Z bar.sync 0; 2026-02-21T09:09:02.2158561Z add.s32 %r162, %r33, 5120; 2026-02-21T09:09:02.2158720Z // begin inline asm 2026-02-21T09:09:02.2158890Z @%p41 mbarrier.inval.shared::cta.b64 [%r162]; 2026-02-21T09:09:02.2159078Z // end inline asm 2026-02-21T09:09:02.2159215Z bar.sync 0; 2026-02-21T09:09:02.2159341Z // begin inline asm 2026-02-21T09:09:02.2159506Z @%p41 mbarrier.inval.shared::cta.b64 [%r68]; 2026-02-21T09:09:02.2159687Z // end inline asm 2026-02-21T09:09:02.2159819Z $L__tmp12: 2026-02-21T09:09:02.2160094Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2160420Z // begin inline asm 2026-02-21T09:09:02.2160771Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r164, %r165, %r166, %r167, %r168, %r169, %r170, %r171, %r172, %r173, %r174, %r175, %r176, %r177, %r178, %r179}, [%r180 + 0]; 2026-02-21T09:09:02.2161143Z // end inline asm 2026-02-21T09:09:02.2161284Z // begin inline asm 2026-02-21T09:09:02.2161436Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:02.2161656Z // end inline asm 2026-02-21T09:09:02.2161857Z cvt.u64.u32 %rd51, %r164; 2026-02-21T09:09:02.2162024Z cvt.u64.u32 %rd52, %r165; 2026-02-21T09:09:02.2162182Z shl.b64 %rd53, %rd52, 32; 2026-02-21T09:09:02.2162345Z or.b64 %rd54, %rd51, %rd53; 2026-02-21T09:09:02.2162509Z $L__tmp13: 2026-02-21T09:09:02.2162746Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2163040Z mov.b64 {%r184, %r185}, %rd54; 2026-02-21T09:09:02.2163217Z cvt.rn.bf16x2.f32 %r186, %r185, %r184; 2026-02-21T09:09:02.2163407Z $L__tmp14: 2026-02-21T09:09:02.2163685Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2164020Z cvt.u64.u32 %rd55, %r166; 2026-02-21T09:09:02.2164183Z cvt.u64.u32 %rd56, %r167; 2026-02-21T09:09:02.2164335Z shl.b64 %rd57, %rd56, 32; 2026-02-21T09:09:02.2164498Z or.b64 %rd58, %rd55, %rd57; 2026-02-21T09:09:02.2164650Z $L__tmp15: 2026-02-21T09:09:02.2164887Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2165167Z mov.b64 {%r187, %r188}, %rd58; 2026-02-21T09:09:02.2165346Z cvt.rn.bf16x2.f32 %r189, %r188, %r187; 2026-02-21T09:09:02.2165515Z $L__tmp16: 2026-02-21T09:09:02.2165800Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2166129Z cvt.u64.u32 %rd59, %r168; 2026-02-21T09:09:02.2166312Z cvt.u64.u32 %rd60, %r169; 2026-02-21T09:09:02.2166472Z shl.b64 %rd61, %rd60, 32; 2026-02-21T09:09:02.2166623Z or.b64 %rd62, %rd59, %rd61; 2026-02-21T09:09:02.2166780Z $L__tmp17: 2026-02-21T09:09:02.2167006Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2167292Z mov.b64 {%r190, %r191}, %rd62; 2026-02-21T09:09:02.2167462Z cvt.rn.bf16x2.f32 %r192, %r191, %r190; 2026-02-21T09:09:02.2167664Z $L__tmp18: 2026-02-21T09:09:02.2167943Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2168283Z cvt.u64.u32 %rd63, %r170; 2026-02-21T09:09:02.2168444Z cvt.u64.u32 %rd64, %r171; 2026-02-21T09:09:02.2168592Z shl.b64 %rd65, %rd64, 32; 2026-02-21T09:09:02.2168747Z or.b64 %rd66, %rd63, %rd65; 2026-02-21T09:09:02.2168894Z $L__tmp19: 2026-02-21T09:09:02.2169126Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2169404Z mov.b64 {%r193, %r194}, %rd66; 2026-02-21T09:09:02.2169574Z cvt.rn.bf16x2.f32 %r195, %r194, %r193; 2026-02-21T09:09:02.2169746Z $L__tmp20: 2026-02-21T09:09:02.2170018Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2170343Z cvt.u64.u32 %rd67, %r172; 2026-02-21T09:09:02.2170426Z cvt.u64.u32 %rd68, %r173; 2026-02-21T09:09:02.2170485Z shl.b64 %rd69, %rd68, 32; 2026-02-21T09:09:02.2170549Z or.b64 %rd70, %rd67, %rd69; 2026-02-21T09:09:02.2170600Z $L__tmp21: 2026-02-21T09:09:02.2170767Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2170827Z mov.b64 {%r196, %r197}, %rd70; 2026-02-21T09:09:02.2170898Z cvt.rn.bf16x2.f32 %r198, %r197, %r196; 2026-02-21T09:09:02.2170949Z $L__tmp22: 2026-02-21T09:09:02.2171154Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2171220Z cvt.u64.u32 %rd71, %r174; 2026-02-21T09:09:02.2171276Z cvt.u64.u32 %rd72, %r175; 2026-02-21T09:09:02.2171334Z shl.b64 %rd73, %rd72, 32; 2026-02-21T09:09:02.2171398Z or.b64 %rd74, %rd71, %rd73; 2026-02-21T09:09:02.2171449Z $L__tmp23: 2026-02-21T09:09:02.2171655Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2171719Z mov.b64 {%r199, %r200}, %rd74; 2026-02-21T09:09:02.2171792Z cvt.rn.bf16x2.f32 %r201, %r200, %r199; 2026-02-21T09:09:02.2171843Z $L__tmp24: 2026-02-21T09:09:02.2172049Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2172114Z cvt.u64.u32 %rd75, %r176; 2026-02-21T09:09:02.2172171Z cvt.u64.u32 %rd76, %r177; 2026-02-21T09:09:02.2172228Z shl.b64 %rd77, %rd76, 32; 2026-02-21T09:09:02.2172295Z or.b64 %rd78, %rd75, %rd77; 2026-02-21T09:09:02.2172347Z $L__tmp25: 2026-02-21T09:09:02.2172507Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2172567Z mov.b64 {%r202, %r203}, %rd78; 2026-02-21T09:09:02.2172639Z cvt.rn.bf16x2.f32 %r204, %r203, %r202; 2026-02-21T09:09:02.2172690Z $L__tmp26: 2026-02-21T09:09:02.2172898Z .loc 2 291 36 // standard.py:291:36 @[ cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:94:40 ] 2026-02-21T09:09:02.2172969Z cvt.u64.u32 %rd79, %r178; 2026-02-21T09:09:02.2173028Z cvt.u64.u32 %rd80, %r179; 2026-02-21T09:09:02.2173087Z shl.b64 %rd81, %rd80, 32; 2026-02-21T09:09:02.2173156Z or.b64 %rd82, %rd79, %rd81; 2026-02-21T09:09:02.2173211Z $L__tmp27: 2026-02-21T09:09:02.2173378Z .loc 1 97 28 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:97:28 2026-02-21T09:09:02.2173438Z mov.b64 {%r205, %r206}, %rd82; 2026-02-21T09:09:02.2173539Z cvt.rn.bf16x2.f32 %r207, %r206, %r205; 2026-02-21T09:09:02.2173700Z .loc 1 98 43 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:98:43 2026-02-21T09:09:02.2173755Z bar.sync 0; 2026-02-21T09:09:02.2173856Z st.shared.v4.b32 [%r13], {%r186, %r189, %r192, %r195}; 2026-02-21T09:09:02.2173946Z st.shared.v4.b32 [%r14], {%r198, %r201, %r204, %r207}; 2026-02-21T09:09:02.2174005Z // begin inline asm 2026-02-21T09:09:02.2174079Z fence.proxy.async.shared::cta; 2026-02-21T09:09:02.2174173Z // end inline asm 2026-02-21T09:09:02.2174227Z bar.sync 0; 2026-02-21T09:09:02.2174295Z elect.sync %r208|%p45, -1; 2026-02-21T09:09:02.2174368Z and.pred %p43, %p1, %p45; 2026-02-21T09:09:02.2174446Z // begin inline asm 2026-02-21T09:09:02.2174625Z @%p43 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd50, {%r181, %r182}], [%r33]; 2026-02-21T09:09:02.2174689Z // end inline asm 2026-02-21T09:09:02.2174756Z cp.async.bulk.commit_group; 2026-02-21T09:09:02.2174827Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:02.2174883Z bar.sync 0; 2026-02-21T09:09:02.2174972Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:02.2175171Z .loc 1 29 4 // cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py:29:4 2026-02-21T09:09:02.2175224Z bar.sync 0; 2026-02-21T09:09:02.2175290Z // begin inline asm 2026-02-21T09:09:02.2175432Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r209, 64; 2026-02-21T09:09:02.2175492Z // end inline asm 2026-02-21T09:09:02.2175553Z ret; 2026-02-21T09:09:02.2175607Z $L__tmp28: 2026-02-21T09:09:02.2175662Z $L__func_end0: 2026-02-21T09:09:02.2175749Z // -- End function 2026-02-21T09:09:02.2175810Z } 2026-02-21T09:09:02.2176014Z .file 1 "/tmp/torchinductor_root/za/cza7obdk7gcf4ec67x4fotn7zned6weh6jmnm2odv4ocjg4za7ro.py" 2026-02-21T09:09:02.2176193Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:02.2176266Z .section .debug_abbrev 2026-02-21T09:09:02.2176319Z { 2026-02-21T09:09:02.2176410Z .b8 1 // Abbreviation Code 2026-02-21T09:09:02.2176498Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:02.2176587Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:02.2176670Z .b8 37 // DW_AT_producer 2026-02-21T09:09:02.2176752Z .b8 8 // DW_FORM_string 2026-02-21T09:09:02.2176839Z .b8 19 // DW_AT_language 2026-02-21T09:09:02.2176919Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:02.2176997Z .b8 3 // DW_AT_name 2026-02-21T09:09:02.2177080Z .b8 8 // DW_FORM_string 2026-02-21T09:09:02.2177160Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:02.2177237Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:02.2177315Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:02.2177397Z .b8 8 // DW_FORM_string 2026-02-21T09:09:02.2177470Z .b8 0 // EOM(1) 2026-02-21T09:09:02.2177541Z .b8 0 // EOM(2) 2026-02-21T09:09:02.2177633Z .b8 2 // Abbreviation Code 2026-02-21T09:09:02.2177718Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:02.2177794Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:02.2177875Z .b8 3 // DW_AT_name 2026-02-21T09:09:02.2177952Z .b8 8 // DW_FORM_string 2026-02-21T09:09:02.2178030Z .b8 32 // DW_AT_inline 2026-02-21T09:09:02.2178115Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:02.2178185Z .b8 0 // EOM(1) 2026-02-21T09:09:02.2178283Z .b8 0 // EOM(2) 2026-02-21T09:09:02.2178365Z .b8 3 // Abbreviation Code 2026-02-21T09:09:02.2178455Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:02.2178534Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:02.2178611Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:02.2178693Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:02.2178793Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:02.2178869Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:02.2178981Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:02.2179057Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:02.2179127Z .b8 0 // EOM(1) 2026-02-21T09:09:02.2179197Z .b8 0 // EOM(2) 2026-02-21T09:09:02.2179287Z .b8 4 // Abbreviation Code 2026-02-21T09:09:02.2179382Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:02.2179461Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:02.2179556Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:02.2179630Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:02.2179725Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:02.2179807Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:02.2179886Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:02.2179960Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:02.2180048Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:02.2180123Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:02.2180200Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:02.2180277Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:02.2180366Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:02.2180440Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:02.2180510Z .b8 0 // EOM(1) 2026-02-21T09:09:02.2180586Z .b8 0 // EOM(2) 2026-02-21T09:09:02.2180658Z .b8 0 // EOM(3) 2026-02-21T09:09:02.2180712Z } 2026-02-21T09:09:02.2180775Z .section .debug_info 2026-02-21T09:09:02.2180835Z { 2026-02-21T09:09:02.2180922Z .b32 178 // Length of Unit 2026-02-21T09:09:02.2181010Z .b8 2 // DWARF version number 2026-02-21T09:09:02.2181073Z .b8 0 2026-02-21T09:09:02.2181194Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:02.2181286Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:02.2181400Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:02.2181484Z .b8 116 // DW_AT_producer 2026-02-21T09:09:02.2181584Z .b8 114 2026-02-21T09:09:02.2181643Z .b8 105 2026-02-21T09:09:02.2181710Z .b8 116 2026-02-21T09:09:02.2181763Z .b8 111 2026-02-21T09:09:02.2181816Z .b8 110 2026-02-21T09:09:02.2181877Z .b8 0 2026-02-21T09:09:02.2181957Z .b8 2 // DW_AT_language 2026-02-21T09:09:02.2182010Z .b8 0 2026-02-21T09:09:02.2182087Z .b8 99 // DW_AT_name 2026-02-21T09:09:02.2182148Z .b8 122 2026-02-21T09:09:02.2182203Z .b8 97 2026-02-21T09:09:02.2182257Z .b8 55 2026-02-21T09:09:02.2182315Z .b8 111 2026-02-21T09:09:02.2182369Z .b8 98 2026-02-21T09:09:02.2182422Z .b8 100 2026-02-21T09:09:02.2182475Z .b8 107 2026-02-21T09:09:02.2182537Z .b8 55 2026-02-21T09:09:02.2182590Z .b8 103 2026-02-21T09:09:02.2182642Z .b8 99 2026-02-21T09:09:02.2182750Z .b8 102 2026-02-21T09:09:02.2182805Z .b8 52 2026-02-21T09:09:02.2182858Z .b8 101 2026-02-21T09:09:02.2182911Z .b8 99 2026-02-21T09:09:02.2182972Z .b8 54 2026-02-21T09:09:02.2183024Z .b8 55 2026-02-21T09:09:02.2183078Z .b8 120 2026-02-21T09:09:02.2183131Z .b8 52 2026-02-21T09:09:02.2183192Z .b8 102 2026-02-21T09:09:02.2183245Z .b8 111 2026-02-21T09:09:02.2183298Z .b8 116 2026-02-21T09:09:02.2183358Z .b8 110 2026-02-21T09:09:02.2183411Z .b8 55 2026-02-21T09:09:02.2183509Z .b8 122 2026-02-21T09:09:02.2183560Z .b8 110 2026-02-21T09:09:02.2183619Z .b8 101 2026-02-21T09:09:02.2183671Z .b8 100 2026-02-21T09:09:02.2183721Z .b8 54 2026-02-21T09:09:02.2183778Z .b8 119 2026-02-21T09:09:02.2183852Z .b8 101 2026-02-21T09:09:02.2183904Z .b8 104 2026-02-21T09:09:02.2183953Z .b8 54 2026-02-21T09:09:02.2184011Z .b8 106 2026-02-21T09:09:02.2184061Z .b8 109 2026-02-21T09:09:02.2184111Z .b8 110 2026-02-21T09:09:02.2184167Z .b8 109 2026-02-21T09:09:02.2184216Z .b8 50 2026-02-21T09:09:02.2184266Z .b8 111 2026-02-21T09:09:02.2184317Z .b8 100 2026-02-21T09:09:02.2184375Z .b8 118 2026-02-21T09:09:02.2184424Z .b8 52 2026-02-21T09:09:02.2184474Z .b8 111 2026-02-21T09:09:02.2184524Z .b8 99 2026-02-21T09:09:02.2184582Z .b8 106 2026-02-21T09:09:02.2184632Z .b8 103 2026-02-21T09:09:02.2184682Z .b8 52 2026-02-21T09:09:02.2184739Z .b8 122 2026-02-21T09:09:02.2184789Z .b8 97 2026-02-21T09:09:02.2184838Z .b8 55 2026-02-21T09:09:02.2184888Z .b8 114 2026-02-21T09:09:02.2184975Z .b8 111 2026-02-21T09:09:02.2185031Z .b8 46 2026-02-21T09:09:02.2185083Z .b8 112 2026-02-21T09:09:02.2185142Z .b8 121 2026-02-21T09:09:02.2185194Z .b8 0 2026-02-21T09:09:02.2185284Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:02.2185362Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:02.2185422Z .b8 116 2026-02-21T09:09:02.2185475Z .b8 109 2026-02-21T09:09:02.2185525Z .b8 112 2026-02-21T09:09:02.2185585Z .b8 47 2026-02-21T09:09:02.2185637Z .b8 116 2026-02-21T09:09:02.2185688Z .b8 111 2026-02-21T09:09:02.2185742Z .b8 114 2026-02-21T09:09:02.2185800Z .b8 99 2026-02-21T09:09:02.2185852Z .b8 104 2026-02-21T09:09:02.2185903Z .b8 105 2026-02-21T09:09:02.2185954Z .b8 110 2026-02-21T09:09:02.2186014Z .b8 100 2026-02-21T09:09:02.2186067Z .b8 117 2026-02-21T09:09:02.2186120Z .b8 99 2026-02-21T09:09:02.2186179Z .b8 116 2026-02-21T09:09:02.2186231Z .b8 111 2026-02-21T09:09:02.2186285Z .b8 114 2026-02-21T09:09:02.2186336Z .b8 95 2026-02-21T09:09:02.2186397Z .b8 114 2026-02-21T09:09:02.2186451Z .b8 111 2026-02-21T09:09:02.2186505Z .b8 111 2026-02-21T09:09:02.2186564Z .b8 116 2026-02-21T09:09:02.2186617Z .b8 47 2026-02-21T09:09:02.2186670Z .b8 122 2026-02-21T09:09:02.2186722Z .b8 97 2026-02-21T09:09:02.2186783Z .b8 0 2026-02-21T09:09:02.2186884Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:02.2186959Z .b8 95 // DW_AT_name 2026-02-21T09:09:02.2187021Z .b8 104 2026-02-21T09:09:02.2187073Z .b8 101 2026-02-21T09:09:02.2187130Z .b8 108 2026-02-21T09:09:02.2187183Z .b8 105 2026-02-21T09:09:02.2187245Z .b8 111 2026-02-21T09:09:02.2187300Z .b8 110 2026-02-21T09:09:02.2187354Z .b8 95 2026-02-21T09:09:02.2187418Z .b8 109 2026-02-21T09:09:02.2187472Z .b8 97 2026-02-21T09:09:02.2187525Z .b8 116 2026-02-21T09:09:02.2187576Z .b8 109 2026-02-21T09:09:02.2187637Z .b8 117 2026-02-21T09:09:02.2187691Z .b8 108 2026-02-21T09:09:02.2187742Z .b8 95 2026-02-21T09:09:02.2187795Z .b8 98 2026-02-21T09:09:02.2187855Z .b8 102 2026-02-21T09:09:02.2187909Z .b8 49 2026-02-21T09:09:02.2187963Z .b8 54 2026-02-21T09:09:02.2188022Z .b8 95 2026-02-21T09:09:02.2188075Z .b8 105 2026-02-21T09:09:02.2188127Z .b8 110 2026-02-21T09:09:02.2188179Z .b8 116 2026-02-21T09:09:02.2188240Z .b8 52 2026-02-21T09:09:02.2188293Z .b8 0 2026-02-21T09:09:02.2188368Z .b8 1 // DW_AT_inline 2026-02-21T09:09:02.2188473Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:02.2188561Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:02.2188669Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:02.2188764Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:02.2188874Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:02.2188963Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:02.2189044Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:02.2189156Z .b64 $L__tmp27 // DW_AT_high_pc 2026-02-21T09:09:02.2189232Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:02.2189326Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:02.2189412Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:02.2189493Z .b8 0 // End Of Children Mark 2026-02-21T09:09:02.2189573Z .b8 0 // End Of Children Mark 2026-02-21T09:09:02.2189633Z } 2026-02-21T09:09:02.2189700Z .section .debug_macinfo { } 2026-02-21T09:09:02.2189704Z 2026-02-21T09:09:02.2189778Z ================================================================ 2026-02-21T09:09:02.2189879Z please share the reproducer above with Triton project. 2026-02-21T09:09:02.3935705Z 2026-02-21T09:09:02.3938013Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 101/101 8.8 configs/s 2026-02-21T09:09:05.6008302Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 425/425 127.9 configs/s 2026-02-21T09:09:05.7553822Z [132s] Generation 1 complete: 2026-02-21T09:09:05.7558141Z error=14 2026-02-21T09:09:05.7559875Z ok=90 2026-02-21T09:09:05.7560090Z min=0.4721 2026-02-21T09:09:05.7564884Z mid=1.5207 2026-02-21T09:09:05.7569487Z max=238.1691 2026-02-21T09:09:05.7573938Z best={'block_sizes': [16, 64, 64], 2026-02-21T09:09:05.7575643Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:09:05.7575980Z 'l2_groupings': [16], 2026-02-21T09:09:05.7578365Z 'load_eviction_policies': ['first', ''], 2026-02-21T09:09:05.7578604Z 'loop_orders': [[0, 1]], 2026-02-21T09:09:05.7578787Z 'num_stages': 4, 2026-02-21T09:09:05.7578946Z 'num_warps': 1, 2026-02-21T09:09:05.7579093Z 'pid_type': 'flat', 2026-02-21T09:09:05.7579260Z 'range_flattens': [None, None], 2026-02-21T09:09:05.7579441Z 'range_multi_buffers': [None, None], 2026-02-21T09:09:05.7579643Z 'range_num_stages': [0, 0], 2026-02-21T09:09:05.7579817Z 'range_unroll_factors': [0, 1], 2026-02-21T09:09:05.7580003Z 'range_warp_specializes': [None, None]} 2026-02-21T09:09:05.7580219Z [132s] Fitting surrogate: 204 points, 204 targets 2026-02-21T09:09:07.0444323Z [134s] Generation 2 starting: 97 neighbors, 5 active search path(s) 2026-02-21T09:09:42.2680710Z [169s] Timeout after 30s compiling Config(block_sizes=[32, 512, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[1, 1], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]) 2026-02-21T09:09:42.5012834Z [169s] Timeout after 30s compiling Config(block_sizes=[64, 512, 16], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, False]) 2026-02-21T09:09:42.6186702Z [169s] Timeout after 30s compiling Config(block_sizes=[32, 1024, 16], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, None]) 2026-02-21T09:09:42.7383315Z [169s] Timeout after 30s compiling Config(block_sizes=[32, 512, 8], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 3], range_warp_specializes=[False, None]) 2026-02-21T09:09:42.8743169Z [169s] Timeout after 30s compiling Config(block_sizes=[32, 1024, 16], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=6, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, None]) 2026-02-21T09:09:42.9826028Z [170s] Timeout after 30s compiling Config(block_sizes=[32, 512, 8], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, None]) 2026-02-21T09:09:42.9840677Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 0.5 configs/s 2026-02-21T09:09:43.2327615Z [170s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 512, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:09:43.2328676Z Tensor-likes are not close! 2026-02-21T09:09:43.2328823Z 2026-02-21T09:09:43.2334171Z Mismatched elements: 33486438 / 33554432 (99.8%) 2026-02-21T09:09:43.2335573Z Greatest absolute difference: 2592.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T09:09:43.2335945Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:09:43.2336276Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:09:43.2336449Z 2026-02-21T09:09:43.6890487Z [170s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:09:43.6891716Z Tensor-likes are not close! 2026-02-21T09:09:43.6895963Z 2026-02-21T09:09:43.6900024Z Mismatched elements: 33451375 / 33554432 (99.7%) 2026-02-21T09:09:43.6900442Z Greatest absolute difference: 1560.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T09:09:43.6900901Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:09:43.6901297Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:09:43.6906552Z 2026-02-21T09:09:43.7115007Z [170s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T09:09:43.7115969Z Tensor-likes are not close! 2026-02-21T09:09:43.7117307Z 2026-02-21T09:09:43.7117709Z Mismatched elements: 33451375 / 33554432 (99.7%) 2026-02-21T09:09:43.7118045Z Greatest absolute difference: 1560.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T09:09:43.7118412Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:09:43.7118721Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:09:43.7118883Z 2026-02-21T09:09:46.3566023Z [173s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]) 2026-02-21T09:09:46.3567160Z Tensor-likes are not close! 2026-02-21T09:09:46.3571308Z 2026-02-21T09:09:46.3575481Z Mismatched elements: 33447377 / 33554432 (99.7%) 2026-02-21T09:09:46.3580078Z Greatest absolute difference: 1408.0 at index (3279, 4843) (up to 0.01 allowed) 2026-02-21T09:09:46.3584363Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:09:46.3586533Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:09:46.3586771Z 2026-02-21T09:09:53.1748233Z [180s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:53.1748886Z 2026-02-21T09:09:53.1749258Z 2026-02-21T09:09:53.1749408Z ================================================================ 2026-02-21T09:09:53.1749652Z Internal Triton PTX codegen error 2026-02-21T09:09:53.1750872Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 256, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:53.1752133Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:53.1752395Z `ptxas` stderr: 2026-02-21T09:09:53.1752886Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 246 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:53.1753384Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:53.1753530Z 2026-02-21T09:09:53.1753933Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpp79l5bqr.ptx -o /tmp/tmpp79l5bqr.ptx.o 2026-02-21T09:09:53.1754388Z 2026-02-21T09:09:53.1754522Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:53.1754789Z `ptxas` stderr: 2026-02-21T09:09:53.1755218Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 246 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:53.1755702Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:53.1755851Z 2026-02-21T09:09:53.1756222Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpp79l5bqr.ptx -o /tmp/tmpp79l5bqr.ptx.o 2026-02-21T09:09:53.1756731Z 2026-02-21T09:09:53.1756734Z 2026-02-21T09:09:53.1756805Z // 2026-02-21T09:09:53.1756950Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:53.1757133Z // 2026-02-21T09:09:53.1757204Z 2026-02-21T09:09:53.1757266Z .version 8.7 2026-02-21T09:09:53.1757417Z .target sm_100a 2026-02-21T09:09:53.1757557Z .address_size 64 2026-02-21T09:09:53.1757653Z 2026-02-21T09:09:53.1757800Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:53.1758083Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:53.1758404Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:53.1758630Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:53.1758872Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:53.1759163Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:53.1759439Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:53.1759720Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:53.1759997Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:53.1760224Z ) 2026-02-21T09:09:53.1760407Z .reqntid 128 2026-02-21T09:09:53.1760539Z .maxnreg 32 2026-02-21T09:09:53.1760669Z { 2026-02-21T09:09:53.1760794Z .reg .pred %p<84>; 2026-02-21T09:09:53.1760951Z .reg .b16 %rs<212>; 2026-02-21T09:09:53.1761094Z .reg .b32 %r<741>; 2026-02-21T09:09:53.1761238Z .reg .b64 %rd<248>; 2026-02-21T09:09:53.1761380Z $L__func_begin0: 2026-02-21T09:09:53.1761468Z 2026-02-21T09:09:53.1761522Z // %bb.0: 2026-02-21T09:09:53.1761793Z .loc 1 19 0 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:19 2026-02-21T09:09:53.1762083Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:53.1762242Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:53.1762442Z ld.param.b64 %rd24, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:53.1762701Z mov.b32 %r53, global_smem; 2026-02-21T09:09:53.1762863Z // begin inline asm 2026-02-21T09:09:53.1763108Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r53], 128; 2026-02-21T09:09:53.1763352Z // end inline asm 2026-02-21T09:09:53.1763538Z ld.param.b64 %rd41, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:53.1763738Z bar.sync 0; 2026-02-21T09:09:53.1763889Z ld.shared.b32 %r735, [global_smem]; 2026-02-21T09:09:53.1764067Z bar.sync 0; 2026-02-21T09:09:53.1764197Z // begin inline asm 2026-02-21T09:09:53.1764408Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:53.1764632Z // end inline asm 2026-02-21T09:09:53.1764884Z .loc 1 21 67 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:21:67 2026-02-21T09:09:53.1765164Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:53.1765322Z mov.u32 %r62, %ctaid.y; 2026-02-21T09:09:53.1765470Z mov.u32 %r63, %ctaid.z; 2026-02-21T09:09:53.1765629Z mov.u32 %r64, %nctaid.x; 2026-02-21T09:09:53.1765790Z mov.u32 %r65, %nctaid.y; 2026-02-21T09:09:53.1765945Z mad.lo.s32 %r66, %r63, %r65, %r62; 2026-02-21T09:09:53.1766128Z mad.lo.s32 %r67, %r66, %r64, %r3; 2026-02-21T09:09:53.1766294Z shl.b32 %r68, %r67, 7; 2026-02-21T09:09:53.1766455Z cvt.s64.s32 %rd42, %r68; 2026-02-21T09:09:53.1766613Z add.s64 %rd38, %rd41, %rd42; 2026-02-21T09:09:53.1766780Z shl.b32 %r69, %r1, 2; 2026-02-21T09:09:53.1766931Z add.s32 %r54, %r53, %r69; 2026-02-21T09:09:53.1767085Z mov.b32 %r71, 0; 2026-02-21T09:09:53.1767219Z // begin inline asm 2026-02-21T09:09:53.1767379Z @%p1 st.shared.b32 [ %r54 + 0 ], %r71; 2026-02-21T09:09:53.1767556Z // end inline asm 2026-02-21T09:09:53.1767694Z bar.warp.sync -1; 2026-02-21T09:09:53.1767846Z setp.eq.b32 %p78, %r1, 0; 2026-02-21T09:09:53.1768003Z cvt.u64.u32 %rd23, %r53; 2026-02-21T09:09:53.1768158Z // begin inline asm 2026-02-21T09:09:53.1768409Z @%p78 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd23 + 0 ], %rd24; 2026-02-21T09:09:53.1768735Z // end inline asm 2026-02-21T09:09:53.1768869Z // begin inline asm 2026-02-21T09:09:53.1769101Z @%p78 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1; 2026-02-21T09:09:53.1769357Z // end inline asm 2026-02-21T09:09:53.1769493Z mov.b32 %r56, 32; 2026-02-21T09:09:53.1769636Z // begin inline asm 2026-02-21T09:09:53.1769868Z @%p78 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r56; 2026-02-21T09:09:53.1770141Z // end inline asm 2026-02-21T09:09:53.1770275Z mov.b32 %r57, 256; 2026-02-21T09:09:53.1770464Z // begin inline asm 2026-02-21T09:09:53.1770703Z @%p78 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r57; 2026-02-21T09:09:53.1770977Z // end inline asm 2026-02-21T09:09:53.1771125Z mov.b32 %r58, 8192; 2026-02-21T09:09:53.1771270Z // begin inline asm 2026-02-21T09:09:53.1771526Z @%p78 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r58; 2026-02-21T09:09:53.1771836Z // end inline asm 2026-02-21T09:09:53.1771983Z mov.b32 %r59, 4096; 2026-02-21T09:09:53.1772127Z // begin inline asm 2026-02-21T09:09:53.1772377Z @%p78 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r59; 2026-02-21T09:09:53.1772703Z // end inline asm 2026-02-21T09:09:53.1772848Z mov.b64 %rd31, 16384; 2026-02-21T09:09:53.1773009Z // begin inline asm 2026-02-21T09:09:53.1773269Z @%p78 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd23 + 0 ], 0x0, %rd31; 2026-02-21T09:09:53.1773568Z // end inline asm 2026-02-21T09:09:53.1773706Z mov.b32 %r60, 1; 2026-02-21T09:09:53.1773850Z // begin inline asm 2026-02-21T09:09:53.1774107Z @%p78 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0, %r60; 2026-02-21T09:09:53.1774407Z // end inline asm 2026-02-21T09:09:53.1774554Z // begin inline asm 2026-02-21T09:09:53.1774812Z @%p78 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x1, %r60; 2026-02-21T09:09:53.1775140Z // end inline asm 2026-02-21T09:09:53.1775284Z // begin inline asm 2026-02-21T09:09:53.1775531Z @%p78 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd23 + 0 ], 0xa; 2026-02-21T09:09:53.1775804Z // end inline asm 2026-02-21T09:09:53.1775954Z // begin inline asm 2026-02-21T09:09:53.1776221Z @%p78 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:53.1776511Z // end inline asm 2026-02-21T09:09:53.1776659Z // begin inline asm 2026-02-21T09:09:53.1776909Z @%p78 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x2; 2026-02-21T09:09:53.1777220Z // end inline asm 2026-02-21T09:09:53.1777362Z // begin inline asm 2026-02-21T09:09:53.1777608Z @%p78 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd23 + 0 ], 0x0; 2026-02-21T09:09:53.1777874Z // end inline asm 2026-02-21T09:09:53.1778020Z // begin inline asm 2026-02-21T09:09:53.1778375Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd38 + 0 ], [ %rd23 + 0 ], 0x80; 2026-02-21T09:09:53.1778740Z // end inline asm 2026-02-21T09:09:53.1778878Z // begin inline asm 2026-02-21T09:09:53.1779084Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd38 + 0 ], 0x80; 2026-02-21T09:09:53.1779340Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:53.1779528Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:53.1779708Z // end inline asm 2026-02-21T09:09:53.1779847Z bar.sync 0; 2026-02-21T09:09:53.1779987Z cvta.global.u64 %rd116, %rd38; 2026-02-21T09:09:53.1780268Z .loc 1 29 88 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:29:88 2026-02-21T09:09:53.1780557Z setp.gt.u32 %p21, %r3, 4095; 2026-02-21T09:09:53.1780727Z @%p21 bra $L__BB0_8; 2026-02-21T09:09:53.1780891Z // %bb.1: // %.lr.ph 2026-02-21T09:09:53.1781187Z .loc 1 0 88 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:0:88 2026-02-21T09:09:53.1781502Z ld.param.b64 %rd22, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:53.1781818Z ld.param.b64 %rd21, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:53.1782028Z and.b32 %r4, %r1, 32; 2026-02-21T09:09:53.1782283Z .loc 1 81 38 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:81:38 2026-02-21T09:09:53.1782803Z setp.eq.b32 %p32, %r4, 0; 2026-02-21T09:09:53.1783052Z .loc 1 43 45 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:43:45 2026-02-21T09:09:53.1783339Z shr.u32 %r244, %r1, 2; 2026-02-21T09:09:53.1783530Z bfe.u32 %r5, %r1, 2, 5; 2026-02-21T09:09:53.1783778Z .loc 1 41 45 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:41:45 2026-02-21T09:09:53.1784065Z and.b32 %r245, %r1, 7; 2026-02-21T09:09:53.1784215Z shl.b32 %r6, %r245, 2; 2026-02-21T09:09:53.1784366Z shl.b32 %r246, %r1, 3; 2026-02-21T09:09:53.1784512Z and.b32 %r247, %r246, 24; 2026-02-21T09:09:53.1784670Z shr.u32 %r248, %r1, 5; 2026-02-21T09:09:53.1784917Z .loc 1 51 48 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:51:48 2026-02-21T09:09:53.1785200Z shr.u32 %r249, %r1, 3; 2026-02-21T09:09:53.1785350Z and.b32 %r7, %r1, 127; 2026-02-21T09:09:53.1785521Z shl.b32 %r250, %r7, 4; 2026-02-21T09:09:53.1785679Z add.s32 %r367, %r53, %r250; 2026-02-21T09:09:53.1785839Z add.s32 %r369, %r367, 2048; 2026-02-21T09:09:53.1786004Z add.s32 %r371, %r367, 4096; 2026-02-21T09:09:53.1786155Z add.s32 %r373, %r367, 6144; 2026-02-21T09:09:53.1786312Z add.s32 %r375, %r367, 8192; 2026-02-21T09:09:53.1786466Z add.s32 %r377, %r367, 10240; 2026-02-21T09:09:53.1786627Z add.s32 %r379, %r367, 12288; 2026-02-21T09:09:53.1786779Z add.s32 %r381, %r367, 14336; 2026-02-21T09:09:53.1786939Z add.s32 %r341, %r735, 64; 2026-02-21T09:09:53.1787101Z shl.b32 %r252, %r1, 10; 2026-02-21T09:09:53.1787252Z and.b32 %r17, %r252, 122880; 2026-02-21T09:09:53.1787417Z shl.b32 %r18, %r7, 2; 2026-02-21T09:09:53.1787592Z add.s32 %r253, %r53, 36864; 2026-02-21T09:09:53.1787756Z add.s32 %r383, %r253, %r18; 2026-02-21T09:09:53.1787907Z shl.b32 %r20, %r7, 6; 2026-02-21T09:09:53.1788058Z and.b32 %r254, %r1, 31; 2026-02-21T09:09:53.1788203Z shr.u32 %r255, %r1, 1; 2026-02-21T09:09:53.1788356Z and.b32 %r256, %r255, 32; 2026-02-21T09:09:53.1788508Z or.b32 %r21, %r256, %r254; 2026-02-21T09:09:53.1788669Z shl.b32 %r257, %r254, 7; 2026-02-21T09:09:53.1788826Z shl.b32 %r258, %r245, 4; 2026-02-21T09:09:53.1788975Z and.b32 %r259, %r249, 12; 2026-02-21T09:09:53.1789134Z or.b32 %r260, %r257, %r258; 2026-02-21T09:09:53.1789286Z or.b32 %r261, %r260, %r259; 2026-02-21T09:09:53.1789444Z add.s32 %r262, %r53, 32768; 2026-02-21T09:09:53.1789594Z add.s32 %r22, %r262, %r261; 2026-02-21T09:09:53.1789750Z xor.b32 %r263, %r261, 16; 2026-02-21T09:09:53.1789898Z add.s32 %r23, %r262, %r263; 2026-02-21T09:09:53.1790056Z xor.b32 %r264, %r261, 32; 2026-02-21T09:09:53.1790203Z add.s32 %r24, %r262, %r264; 2026-02-21T09:09:53.1790361Z xor.b32 %r265, %r261, 48; 2026-02-21T09:09:53.1790515Z add.s32 %r25, %r262, %r265; 2026-02-21T09:09:53.1790663Z xor.b32 %r266, %r261, 64; 2026-02-21T09:09:53.1790820Z add.s32 %r26, %r262, %r266; 2026-02-21T09:09:53.1790970Z xor.b32 %r267, %r261, 80; 2026-02-21T09:09:53.1791124Z add.s32 %r27, %r262, %r267; 2026-02-21T09:09:53.1791273Z xor.b32 %r268, %r261, 96; 2026-02-21T09:09:53.1791427Z add.s32 %r28, %r262, %r268; 2026-02-21T09:09:53.1791609Z xor.b32 %r269, %r261, 112; 2026-02-21T09:09:53.1791768Z add.s32 %r29, %r262, %r269; 2026-02-21T09:09:53.1791928Z bfe.u32 %r270, %r262, 4, 14; 2026-02-21T09:09:53.1792088Z cvt.u64.u32 %rd61, %r270; 2026-02-21T09:09:53.1792257Z or.b64 %rd72, %rd61, 4611686293322072064; 2026-02-21T09:09:53.1792435Z add.s32 %r271, %r53, 32800; 2026-02-21T09:09:53.1792594Z bfe.u32 %r272, %r271, 4, 14; 2026-02-21T09:09:53.1792747Z cvt.u64.u32 %rd62, %r272; 2026-02-21T09:09:53.1792914Z or.b64 %rd73, %rd62, 4611686293322072064; 2026-02-21T09:09:53.1793092Z add.s32 %r273, %r53, 32832; 2026-02-21T09:09:53.1793286Z bfe.u32 %r274, %r273, 4, 14; 2026-02-21T09:09:53.1793439Z cvt.u64.u32 %rd63, %r274; 2026-02-21T09:09:53.1793605Z or.b64 %rd74, %rd63, 4611686293322072064; 2026-02-21T09:09:53.1793785Z add.s32 %r275, %r53, 32864; 2026-02-21T09:09:53.1793938Z bfe.u32 %r276, %r275, 4, 14; 2026-02-21T09:09:53.1794097Z cvt.u64.u32 %rd64, %r276; 2026-02-21T09:09:53.1794252Z or.b64 %rd75, %rd64, 4611686293322072064; 2026-02-21T09:09:53.1794433Z or.b32 %r30, %r247, 64; 2026-02-21T09:09:53.1794581Z and.b32 %r277, %r246, 48; 2026-02-21T09:09:53.1794801Z or.b32 %r278, %r20, %r277; 2026-02-21T09:09:53.1794956Z xor.b32 %r279, %r278, 16; 2026-02-21T09:09:53.1795115Z xor.b32 %r280, %r278, 32; 2026-02-21T09:09:53.1795266Z xor.b32 %r281, %r278, 48; 2026-02-21T09:09:53.1795530Z .loc 1 29 88 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:29:88 2026-02-21T09:09:53.1795824Z cvt.u64.u32 %rd65, %r247; 2026-02-21T09:09:53.1795984Z mad.lo.s32 %r282, %r7, 48, %r367; 2026-02-21T09:09:53.1796162Z add.s32 %r283, %r253, %r21; 2026-02-21T09:09:53.1796317Z add.s32 %r284, %r53, %r18; 2026-02-21T09:09:53.1796478Z add.s32 %r174, %r284, 37376; 2026-02-21T09:09:53.1796668Z add.s32 %r158, %r367, 16384; 2026-02-21T09:09:53.1796829Z add.s32 %r172, %r367, 30720; 2026-02-21T09:09:53.1796979Z add.s32 %r170, %r367, 28672; 2026-02-21T09:09:53.1797134Z add.s32 %r168, %r367, 26624; 2026-02-21T09:09:53.1797290Z add.s32 %r166, %r367, 24576; 2026-02-21T09:09:53.1797441Z add.s32 %r164, %r367, 22528; 2026-02-21T09:09:53.1797597Z add.s32 %r162, %r367, 20480; 2026-02-21T09:09:53.1797748Z add.s32 %r160, %r367, 18432; 2026-02-21T09:09:53.1798007Z .loc 1 36 33 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:36:33 2026-02-21T09:09:53.1798282Z shr.u32 %r35, %r3, 4; 2026-02-21T09:09:53.1798436Z and.b32 %r285, %r35, 240; 2026-02-21T09:09:53.1798713Z .loc 1 38 64 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:38:64 2026-02-21T09:09:53.1798998Z and.b32 %r36, %r3, 15; 2026-02-21T09:09:53.1799250Z .loc 1 38 30 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:38:30 2026-02-21T09:09:53.1799526Z or.b32 %r286, %r285, %r36; 2026-02-21T09:09:53.1799789Z .loc 1 40 27 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:40:27 2026-02-21T09:09:53.1800063Z shl.b32 %r635, %r286, 5; 2026-02-21T09:09:53.1800320Z .loc 1 41 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:41:32 2026-02-21T09:09:53.1800603Z or.b32 %r287, %r635, %r6; 2026-02-21T09:09:53.1800850Z .loc 1 42 27 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:42:27 2026-02-21T09:09:53.1801127Z shl.b32 %r288, %r3, 4; 2026-02-21T09:09:53.1801274Z and.b32 %r636, %r288, 3840; 2026-02-21T09:09:53.1801568Z .loc 1 43 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:43:32 2026-02-21T09:09:53.1801842Z or.b32 %r289, %r636, %r5; 2026-02-21T09:09:53.1801999Z or.b32 %r290, %r244, %r636; 2026-02-21T09:09:53.1802252Z .loc 1 58 53 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:53 2026-02-21T09:09:53.1802534Z shl.b32 %r291, %r289, 10; 2026-02-21T09:09:53.1802690Z shl.b32 %r292, %r290, 10; 2026-02-21T09:09:53.1802839Z or.b32 %r293, %r292, 229376; 2026-02-21T09:09:53.1802997Z $L__tmp0: 2026-02-21T09:09:53.1803281Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1803636Z shfl.sync.idx.b32 %r39, %r248, 0, 31, -1; 2026-02-21T09:09:53.1803820Z shl.b32 %r294, %r39, 21; 2026-02-21T09:09:53.1803985Z and.b32 %r295, %r294, 6291456; 2026-02-21T09:09:53.1804148Z add.s32 %r634, %r295, %r735; 2026-02-21T09:09:53.1804317Z mov.pred %p36, -1; 2026-02-21T09:09:53.1804473Z // begin inline asm 2026-02-21T09:09:53.1804815Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r634 + 0], {%r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71}; 2026-02-21T09:09:53.1805208Z // end inline asm 2026-02-21T09:09:53.1805349Z // begin inline asm 2026-02-21T09:09:53.1805688Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r634 + 16], {%r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71}; 2026-02-21T09:09:53.1806039Z // end inline asm 2026-02-21T09:09:53.1806185Z // begin inline asm 2026-02-21T09:09:53.1806539Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r634 + 32], {%r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71}; 2026-02-21T09:09:53.1806890Z // end inline asm 2026-02-21T09:09:53.1807036Z // begin inline asm 2026-02-21T09:09:53.1807350Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r634 + 48], {%r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71, %r71}; 2026-02-21T09:09:53.1807707Z // end inline asm 2026-02-21T09:09:53.1807843Z // begin inline asm 2026-02-21T09:09:53.1808005Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.1808173Z // end inline asm 2026-02-21T09:09:53.1808305Z bar.sync 0; 2026-02-21T09:09:53.1808470Z $L__tmp1: 2026-02-21T09:09:53.1808714Z .loc 1 50 125 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:50:125 2026-02-21T09:09:53.1809011Z add.s32 %r737, %r53, 37888; 2026-02-21T09:09:53.1809168Z // begin inline asm 2026-02-21T09:09:53.1809342Z @%p78 mbarrier.init.shared::cta.b64 [%r737], 1; 2026-02-21T09:09:53.1809533Z // end inline asm 2026-02-21T09:09:53.1809672Z bar.sync 0; 2026-02-21T09:09:53.1809814Z add.s32 %r139, %r53, 37896; 2026-02-21T09:09:53.1809967Z // begin inline asm 2026-02-21T09:09:53.1810143Z @%p78 mbarrier.init.shared::cta.b64 [%r139], 1; 2026-02-21T09:09:53.1810328Z // end inline asm 2026-02-21T09:09:53.1810612Z .loc 1 58 60 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:60 2026-02-21T09:09:53.1810892Z or.b32 %r296, %r291, %r247; 2026-02-21T09:09:53.1811057Z or.b32 %r297, %r293, %r247; 2026-02-21T09:09:53.1811318Z .loc 1 58 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:32 2026-02-21T09:09:53.1811637Z mad.wide.u32 %rd43, %r296, 2, %rd21; 2026-02-21T09:09:53.1811823Z cvt.u64.u32 %rd6, %r291; 2026-02-21T09:09:53.1811983Z or.b64 %rd66, %rd6, %rd65; 2026-02-21T09:09:53.1812147Z shl.b64 %rd67, %rd66, 1; 2026-02-21T09:09:53.1812301Z add.s64 %rd7, %rd21, %rd67; 2026-02-21T09:09:53.1812463Z add.s64 %rd44, %rd7, 65536; 2026-02-21T09:09:53.1812617Z add.s64 %rd45, %rd7, 131072; 2026-02-21T09:09:53.1812783Z add.s64 %rd46, %rd7, 196608; 2026-02-21T09:09:53.1812937Z add.s64 %rd47, %rd7, 262144; 2026-02-21T09:09:53.1813097Z add.s64 %rd48, %rd7, 327680; 2026-02-21T09:09:53.1813255Z add.s64 %rd49, %rd7, 393216; 2026-02-21T09:09:53.1813415Z mad.wide.u32 %rd50, %r297, 2, %rd21; 2026-02-21T09:09:53.1813597Z mov.b32 %r368, 16; 2026-02-21T09:09:53.1813846Z .loc 1 58 80 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:80 2026-02-21T09:09:53.1814129Z // begin inline asm 2026-02-21T09:09:53.1814333Z cp.async.cg.shared.global [ %r367 + 0 ], [ %rd43 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1814566Z // end inline asm 2026-02-21T09:09:53.1814700Z // begin inline asm 2026-02-21T09:09:53.1814904Z cp.async.cg.shared.global [ %r369 + 0 ], [ %rd44 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1815144Z // end inline asm 2026-02-21T09:09:53.1815288Z // begin inline asm 2026-02-21T09:09:53.1815500Z cp.async.cg.shared.global [ %r371 + 0 ], [ %rd45 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1815731Z // end inline asm 2026-02-21T09:09:53.1815878Z // begin inline asm 2026-02-21T09:09:53.1816075Z cp.async.cg.shared.global [ %r373 + 0 ], [ %rd46 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1816312Z // end inline asm 2026-02-21T09:09:53.1816452Z // begin inline asm 2026-02-21T09:09:53.1816671Z cp.async.cg.shared.global [ %r375 + 0 ], [ %rd47 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1816943Z // end inline asm 2026-02-21T09:09:53.1817082Z // begin inline asm 2026-02-21T09:09:53.1817286Z cp.async.cg.shared.global [ %r377 + 0 ], [ %rd48 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1817515Z // end inline asm 2026-02-21T09:09:53.1817662Z // begin inline asm 2026-02-21T09:09:53.1817859Z cp.async.cg.shared.global [ %r379 + 0 ], [ %rd49 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1818094Z // end inline asm 2026-02-21T09:09:53.1818232Z // begin inline asm 2026-02-21T09:09:53.1818471Z cp.async.cg.shared.global [ %r381 + 0 ], [ %rd50 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1818704Z // end inline asm 2026-02-21T09:09:53.1818860Z cp.async.commit_group; 2026-02-21T09:09:53.1819132Z .loc 1 64 62 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:62 2026-02-21T09:09:53.1819426Z or.b32 %r298, %r287, %r17; 2026-02-21T09:09:53.1819708Z .loc 1 64 34 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:34 2026-02-21T09:09:53.1820003Z cvt.u64.u32 %rd68, %r298; 2026-02-21T09:09:53.1820178Z add.s64 %rd51, %rd22, %rd68; 2026-02-21T09:09:53.1820340Z mov.b32 %r157, 4; 2026-02-21T09:09:53.1820631Z .loc 1 64 87 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:87 2026-02-21T09:09:53.1820925Z // begin inline asm 2026-02-21T09:09:53.1821135Z cp.async.ca.shared.global [ %r383 + 0 ], [ %rd51 + 0 ], 0x4, %r157; 2026-02-21T09:09:53.1821377Z // end inline asm 2026-02-21T09:09:53.1821529Z cp.async.commit_group; 2026-02-21T09:09:53.1821829Z .loc 1 58 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:32 2026-02-21T09:09:53.1822122Z add.s64 %rd52, %rd7, 64; 2026-02-21T09:09:53.1822288Z or.b32 %r299, %r296, 32; 2026-02-21T09:09:53.1822452Z mad.wide.u32 %rd69, %r299, 2, %rd21; 2026-02-21T09:09:53.1822639Z add.s64 %rd53, %rd69, 65536; 2026-02-21T09:09:53.1822840Z add.s64 %rd54, %rd69, 131072; 2026-02-21T09:09:53.1823011Z add.s64 %rd55, %rd69, 196608; 2026-02-21T09:09:53.1823185Z add.s64 %rd56, %rd69, 262144; 2026-02-21T09:09:53.1823345Z add.s64 %rd57, %rd69, 327680; 2026-02-21T09:09:53.1823512Z add.s64 %rd58, %rd69, 393216; 2026-02-21T09:09:53.1823681Z cvt.u64.u32 %rd9, %r293; 2026-02-21T09:09:53.1823839Z or.b64 %rd70, %rd9, %rd65; 2026-02-21T09:09:53.1823993Z shl.b64 %rd71, %rd70, 1; 2026-02-21T09:09:53.1824150Z add.s64 %rd10, %rd21, %rd71; 2026-02-21T09:09:53.1824305Z add.s64 %rd59, %rd10, 64; 2026-02-21T09:09:53.1824568Z .loc 1 58 80 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:80 2026-02-21T09:09:53.1824849Z bar.sync 0; 2026-02-21T09:09:53.1824981Z // begin inline asm 2026-02-21T09:09:53.1825187Z cp.async.cg.shared.global [ %r158 + 0 ], [ %rd52 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1825407Z // end inline asm 2026-02-21T09:09:53.1825548Z // begin inline asm 2026-02-21T09:09:53.1825743Z cp.async.cg.shared.global [ %r160 + 0 ], [ %rd53 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1825970Z // end inline asm 2026-02-21T09:09:53.1826104Z // begin inline asm 2026-02-21T09:09:53.1826304Z cp.async.cg.shared.global [ %r162 + 0 ], [ %rd54 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1826531Z // end inline asm 2026-02-21T09:09:53.1826668Z // begin inline asm 2026-02-21T09:09:53.1826870Z cp.async.cg.shared.global [ %r164 + 0 ], [ %rd55 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1827086Z // end inline asm 2026-02-21T09:09:53.1827225Z // begin inline asm 2026-02-21T09:09:53.1827413Z cp.async.cg.shared.global [ %r166 + 0 ], [ %rd56 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1827638Z // end inline asm 2026-02-21T09:09:53.1827769Z // begin inline asm 2026-02-21T09:09:53.1827968Z cp.async.cg.shared.global [ %r168 + 0 ], [ %rd57 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1828192Z // end inline asm 2026-02-21T09:09:53.1828323Z // begin inline asm 2026-02-21T09:09:53.1828521Z cp.async.cg.shared.global [ %r170 + 0 ], [ %rd58 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1828768Z // end inline asm 2026-02-21T09:09:53.1828909Z // begin inline asm 2026-02-21T09:09:53.1829099Z cp.async.cg.shared.global [ %r172 + 0 ], [ %rd59 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1829323Z // end inline asm 2026-02-21T09:09:53.1829463Z cp.async.commit_group; 2026-02-21T09:09:53.1829723Z .loc 1 64 34 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:34 2026-02-21T09:09:53.1830007Z add.s64 %rd60, %rd51, 131072; 2026-02-21T09:09:53.1830262Z .loc 1 64 87 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:87 2026-02-21T09:09:53.1830573Z // begin inline asm 2026-02-21T09:09:53.1830770Z cp.async.ca.shared.global [ %r174 + 0 ], [ %rd60 + 0 ], 0x4, %r157; 2026-02-21T09:09:53.1831000Z // end inline asm 2026-02-21T09:09:53.1831140Z cp.async.commit_group; 2026-02-21T09:09:53.1831393Z .loc 1 58 80 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:80 2026-02-21T09:09:53.1831714Z cp.async.wait_group 2; 2026-02-21T09:09:53.1831868Z bar.sync 0; 2026-02-21T09:09:53.1832046Z ld.shared.v4.b32 {%r300, %r301, %r302, %r303}, [%r282]; 2026-02-21T09:09:53.1832254Z mov.b32 {%rs1, %rs2}, %r303; 2026-02-21T09:09:53.1832460Z mov.b32 {%rs3, %rs4}, %r302; 2026-02-21T09:09:53.1832620Z mov.b32 {%rs5, %rs6}, %r301; 2026-02-21T09:09:53.1832781Z mov.b32 {%rs7, %rs8}, %r300; 2026-02-21T09:09:53.1832972Z ld.shared.v4.b32 {%r304, %r305, %r306, %r307}, [%r282+16]; 2026-02-21T09:09:53.1833191Z mov.b32 {%rs9, %rs10}, %r307; 2026-02-21T09:09:53.1833364Z mov.b32 {%rs11, %rs12}, %r306; 2026-02-21T09:09:53.1833532Z mov.b32 {%rs13, %rs14}, %r305; 2026-02-21T09:09:53.1833703Z mov.b32 {%rs15, %rs16}, %r304; 2026-02-21T09:09:53.1833895Z ld.shared.v4.b32 {%r308, %r309, %r310, %r311}, [%r282+32]; 2026-02-21T09:09:53.1834106Z mov.b32 {%rs17, %rs18}, %r311; 2026-02-21T09:09:53.1834260Z mov.b32 {%rs19, %rs20}, %r310; 2026-02-21T09:09:53.1834448Z mov.b32 {%rs21, %rs22}, %r309; 2026-02-21T09:09:53.1834610Z mov.b32 {%rs23, %rs24}, %r308; 2026-02-21T09:09:53.1834812Z ld.shared.v4.b32 {%r312, %r313, %r314, %r315}, [%r282+48]; 2026-02-21T09:09:53.1835024Z mov.b32 {%rs25, %rs26}, %r315; 2026-02-21T09:09:53.1835186Z mov.b32 {%rs27, %rs28}, %r314; 2026-02-21T09:09:53.1835353Z mov.b32 {%rs29, %rs30}, %r313; 2026-02-21T09:09:53.1835512Z mov.b32 {%rs31, %rs32}, %r312; 2026-02-21T09:09:53.1835719Z ld.shared.v4.b32 {%r316, %r317, %r318, %r319}, [%r282+8192]; 2026-02-21T09:09:53.1835931Z mov.b32 {%rs33, %rs34}, %r319; 2026-02-21T09:09:53.1836104Z mov.b32 {%rs35, %rs36}, %r318; 2026-02-21T09:09:53.1836265Z mov.b32 {%rs37, %rs38}, %r317; 2026-02-21T09:09:53.1836437Z mov.b32 {%rs39, %rs40}, %r316; 2026-02-21T09:09:53.1836636Z ld.shared.v4.b32 {%r320, %r321, %r322, %r323}, [%r282+8208]; 2026-02-21T09:09:53.1836853Z mov.b32 {%rs41, %rs42}, %r323; 2026-02-21T09:09:53.1837021Z mov.b32 {%rs43, %rs44}, %r322; 2026-02-21T09:09:53.1837181Z mov.b32 {%rs45, %rs46}, %r321; 2026-02-21T09:09:53.1837347Z mov.b32 {%rs47, %rs48}, %r320; 2026-02-21T09:09:53.1837539Z ld.shared.v4.b32 {%r324, %r325, %r326, %r327}, [%r282+8224]; 2026-02-21T09:09:53.1837752Z mov.b32 {%rs49, %rs50}, %r327; 2026-02-21T09:09:53.1837911Z mov.b32 {%rs51, %rs52}, %r326; 2026-02-21T09:09:53.1838077Z mov.b32 {%rs53, %rs54}, %r325; 2026-02-21T09:09:53.1838234Z mov.b32 {%rs55, %rs56}, %r324; 2026-02-21T09:09:53.1838437Z ld.shared.v4.b32 {%r328, %r329, %r330, %r331}, [%r282+8240]; 2026-02-21T09:09:53.1838652Z mov.b32 {%rs57, %rs58}, %r331; 2026-02-21T09:09:53.1838812Z mov.b32 {%rs59, %rs60}, %r330; 2026-02-21T09:09:53.1838981Z mov.b32 {%rs61, %rs62}, %r329; 2026-02-21T09:09:53.1839140Z mov.b32 {%rs63, %rs64}, %r328; 2026-02-21T09:09:53.1839410Z .loc 1 62 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:62:32 2026-02-21T09:09:53.1839694Z cvt.f32.bf16 %r177, %rs7; 2026-02-21T09:09:53.1839863Z cvt.f32.bf16 %r178, %rs8; 2026-02-21T09:09:53.1840024Z cvt.f32.bf16 %r179, %rs5; 2026-02-21T09:09:53.1840241Z cvt.f32.bf16 %r180, %rs6; 2026-02-21T09:09:53.1840400Z cvt.f32.bf16 %r181, %rs3; 2026-02-21T09:09:53.1840549Z cvt.f32.bf16 %r182, %rs4; 2026-02-21T09:09:53.1840706Z cvt.f32.bf16 %r183, %rs1; 2026-02-21T09:09:53.1840854Z cvt.f32.bf16 %r184, %rs2; 2026-02-21T09:09:53.1841013Z cvt.f32.bf16 %r185, %rs15; 2026-02-21T09:09:53.1841167Z cvt.f32.bf16 %r186, %rs16; 2026-02-21T09:09:53.1841327Z cvt.f32.bf16 %r187, %rs13; 2026-02-21T09:09:53.1841480Z cvt.f32.bf16 %r188, %rs14; 2026-02-21T09:09:53.1841706Z cvt.f32.bf16 %r189, %rs11; 2026-02-21T09:09:53.1841856Z cvt.f32.bf16 %r190, %rs12; 2026-02-21T09:09:53.1842011Z cvt.f32.bf16 %r191, %rs9; 2026-02-21T09:09:53.1842165Z cvt.f32.bf16 %r192, %rs10; 2026-02-21T09:09:53.1842312Z cvt.f32.bf16 %r194, %rs23; 2026-02-21T09:09:53.1842469Z cvt.f32.bf16 %r195, %rs24; 2026-02-21T09:09:53.1842619Z cvt.f32.bf16 %r196, %rs21; 2026-02-21T09:09:53.1842774Z cvt.f32.bf16 %r197, %rs22; 2026-02-21T09:09:53.1842925Z cvt.f32.bf16 %r198, %rs19; 2026-02-21T09:09:53.1843082Z cvt.f32.bf16 %r199, %rs20; 2026-02-21T09:09:53.1843231Z cvt.f32.bf16 %r200, %rs17; 2026-02-21T09:09:53.1843387Z cvt.f32.bf16 %r201, %rs18; 2026-02-21T09:09:53.1843567Z cvt.f32.bf16 %r202, %rs31; 2026-02-21T09:09:53.1843729Z cvt.f32.bf16 %r203, %rs32; 2026-02-21T09:09:53.1843885Z cvt.f32.bf16 %r204, %rs29; 2026-02-21T09:09:53.1844035Z cvt.f32.bf16 %r205, %rs30; 2026-02-21T09:09:53.1844194Z cvt.f32.bf16 %r206, %rs27; 2026-02-21T09:09:53.1844346Z cvt.f32.bf16 %r207, %rs28; 2026-02-21T09:09:53.1844504Z cvt.f32.bf16 %r208, %rs25; 2026-02-21T09:09:53.1844654Z cvt.f32.bf16 %r209, %rs26; 2026-02-21T09:09:53.1844813Z cvt.f32.bf16 %r211, %rs39; 2026-02-21T09:09:53.1844966Z cvt.f32.bf16 %r212, %rs40; 2026-02-21T09:09:53.1845130Z cvt.f32.bf16 %r213, %rs37; 2026-02-21T09:09:53.1845280Z cvt.f32.bf16 %r214, %rs38; 2026-02-21T09:09:53.1845434Z cvt.f32.bf16 %r215, %rs35; 2026-02-21T09:09:53.1845589Z cvt.f32.bf16 %r216, %rs36; 2026-02-21T09:09:53.1845768Z cvt.f32.bf16 %r217, %rs33; 2026-02-21T09:09:53.1845930Z cvt.f32.bf16 %r218, %rs34; 2026-02-21T09:09:53.1846079Z cvt.f32.bf16 %r219, %rs47; 2026-02-21T09:09:53.1846236Z cvt.f32.bf16 %r220, %rs48; 2026-02-21T09:09:53.1846387Z cvt.f32.bf16 %r221, %rs45; 2026-02-21T09:09:53.1846545Z cvt.f32.bf16 %r222, %rs46; 2026-02-21T09:09:53.1846694Z cvt.f32.bf16 %r223, %rs43; 2026-02-21T09:09:53.1846853Z cvt.f32.bf16 %r224, %rs44; 2026-02-21T09:09:53.1847011Z cvt.f32.bf16 %r225, %rs41; 2026-02-21T09:09:53.1847159Z cvt.f32.bf16 %r226, %rs42; 2026-02-21T09:09:53.1847320Z cvt.f32.bf16 %r228, %rs55; 2026-02-21T09:09:53.1847472Z cvt.f32.bf16 %r229, %rs56; 2026-02-21T09:09:53.1847630Z cvt.f32.bf16 %r230, %rs53; 2026-02-21T09:09:53.1847779Z cvt.f32.bf16 %r231, %rs54; 2026-02-21T09:09:53.1847936Z cvt.f32.bf16 %r232, %rs51; 2026-02-21T09:09:53.1848084Z cvt.f32.bf16 %r233, %rs52; 2026-02-21T09:09:53.1848240Z cvt.f32.bf16 %r234, %rs49; 2026-02-21T09:09:53.1848391Z cvt.f32.bf16 %r235, %rs50; 2026-02-21T09:09:53.1848550Z cvt.f32.bf16 %r236, %rs63; 2026-02-21T09:09:53.1848707Z cvt.f32.bf16 %r237, %rs64; 2026-02-21T09:09:53.1848856Z cvt.f32.bf16 %r238, %rs61; 2026-02-21T09:09:53.1849011Z cvt.f32.bf16 %r239, %rs62; 2026-02-21T09:09:53.1849163Z cvt.f32.bf16 %r240, %rs59; 2026-02-21T09:09:53.1849321Z cvt.f32.bf16 %r241, %rs60; 2026-02-21T09:09:53.1849469Z cvt.f32.bf16 %r242, %rs57; 2026-02-21T09:09:53.1849624Z cvt.f32.bf16 %r243, %rs58; 2026-02-21T09:09:53.1849770Z $L__tmp2: 2026-02-21T09:09:53.1850067Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1850399Z add.s32 %r176, %r295, %r341; 2026-02-21T09:09:53.1850560Z // begin inline asm 2026-02-21T09:09:53.1850942Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 0], {%r177, %r178, %r179, %r180, %r181, %r182, %r183, %r184, %r185, %r186, %r187, %r188, %r189, %r190, %r191, %r192}; 2026-02-21T09:09:53.1851334Z // end inline asm 2026-02-21T09:09:53.1851514Z // begin inline asm 2026-02-21T09:09:53.1851893Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 16], {%r194, %r195, %r196, %r197, %r198, %r199, %r200, %r201, %r202, %r203, %r204, %r205, %r206, %r207, %r208, %r209}; 2026-02-21T09:09:53.1852278Z // end inline asm 2026-02-21T09:09:53.1852422Z // begin inline asm 2026-02-21T09:09:53.1852772Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 32], {%r211, %r212, %r213, %r214, %r215, %r216, %r217, %r218, %r219, %r220, %r221, %r222, %r223, %r224, %r225, %r226}; 2026-02-21T09:09:53.1853197Z // end inline asm 2026-02-21T09:09:53.1853332Z // begin inline asm 2026-02-21T09:09:53.1853687Z @%p36 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 48], {%r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241, %r242, %r243}; 2026-02-21T09:09:53.1854064Z // end inline asm 2026-02-21T09:09:53.1854209Z // begin inline asm 2026-02-21T09:09:53.1854373Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.1854538Z // end inline asm 2026-02-21T09:09:53.1854683Z bar.sync 0; 2026-02-21T09:09:53.1854813Z $L__tmp3: 2026-02-21T09:09:53.1855075Z .loc 1 64 87 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:87 2026-02-21T09:09:53.1855391Z ld.shared.b8 %rs65, [%r283]; 2026-02-21T09:09:53.1855567Z ld.shared.b8 %rs66, [%r283+64]; 2026-02-21T09:09:53.1855742Z ld.shared.b8 %rs67, [%r283+128]; 2026-02-21T09:09:53.1855922Z ld.shared.b8 %rs68, [%r283+192]; 2026-02-21T09:09:53.1856097Z ld.shared.b8 %rs69, [%r283+256]; 2026-02-21T09:09:53.1856262Z ld.shared.b8 %rs70, [%r283+320]; 2026-02-21T09:09:53.1856430Z ld.shared.b8 %rs71, [%r283+384]; 2026-02-21T09:09:53.1856592Z ld.shared.b8 %rs72, [%r283+448]; 2026-02-21T09:09:53.1856868Z .loc 1 67 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:67:28 2026-02-21T09:09:53.1857148Z shl.b16 %rs73, %rs65, 4; 2026-02-21T09:09:53.1857311Z shl.b16 %rs74, %rs66, 4; 2026-02-21T09:09:53.1857492Z shl.b16 %rs75, %rs67, 4; 2026-02-21T09:09:53.1857653Z shl.b16 %rs76, %rs68, 4; 2026-02-21T09:09:53.1857802Z shl.b16 %rs77, %rs69, 4; 2026-02-21T09:09:53.1857957Z shl.b16 %rs78, %rs70, 4; 2026-02-21T09:09:53.1858116Z shl.b16 %rs79, %rs71, 4; 2026-02-21T09:09:53.1858269Z shl.b16 %rs80, %rs72, 4; 2026-02-21T09:09:53.1858540Z .loc 1 82 58 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:82:58 2026-02-21T09:09:53.1858843Z selp.b16 %rs81, %rs73, %rs65, %p32; 2026-02-21T09:09:53.1859029Z cvt.s16.s8 %rs82, %rs81; 2026-02-21T09:09:53.1859184Z shr.s16 %rs83, %rs82, 4; 2026-02-21T09:09:53.1859356Z selp.b16 %rs84, %rs74, %rs66, %p32; 2026-02-21T09:09:53.1859531Z cvt.s16.s8 %rs85, %rs84; 2026-02-21T09:09:53.1859693Z shr.s16 %rs86, %rs85, 4; 2026-02-21T09:09:53.1859859Z selp.b16 %rs87, %rs75, %rs67, %p32; 2026-02-21T09:09:53.1860034Z cvt.s16.s8 %rs88, %rs87; 2026-02-21T09:09:53.1860194Z shr.s16 %rs89, %rs88, 4; 2026-02-21T09:09:53.1860352Z selp.b16 %rs90, %rs76, %rs68, %p32; 2026-02-21T09:09:53.1860534Z cvt.s16.s8 %rs91, %rs90; 2026-02-21T09:09:53.1860686Z shr.s16 %rs92, %rs91, 4; 2026-02-21T09:09:53.1860853Z selp.b16 %rs93, %rs77, %rs69, %p32; 2026-02-21T09:09:53.1861026Z cvt.s16.s8 %rs94, %rs93; 2026-02-21T09:09:53.1861187Z shr.s16 %rs95, %rs94, 4; 2026-02-21T09:09:53.1861351Z selp.b16 %rs96, %rs78, %rs70, %p32; 2026-02-21T09:09:53.1861525Z cvt.s16.s8 %rs97, %rs96; 2026-02-21T09:09:53.1861715Z shr.s16 %rs98, %rs97, 4; 2026-02-21T09:09:53.1861875Z selp.b16 %rs99, %rs79, %rs71, %p32; 2026-02-21T09:09:53.1862059Z cvt.s16.s8 %rs100, %rs99; 2026-02-21T09:09:53.1862225Z shr.s16 %rs101, %rs100, 4; 2026-02-21T09:09:53.1862405Z selp.b16 %rs102, %rs80, %rs72, %p32; 2026-02-21T09:09:53.1862586Z cvt.s16.s8 %rs103, %rs102; 2026-02-21T09:09:53.1862756Z shr.s16 %rs104, %rs103, 4; 2026-02-21T09:09:53.1863027Z .loc 1 87 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:87:32 2026-02-21T09:09:53.1863333Z cvt.rn.f32.s16 %r332, %rs83; 2026-02-21T09:09:53.1863540Z cvt.rn.f32.s16 %r333, %rs86; 2026-02-21T09:09:53.1863705Z cvt.rn.f32.s16 %r334, %rs89; 2026-02-21T09:09:53.1863876Z cvt.rn.f32.s16 %r335, %rs92; 2026-02-21T09:09:53.1864041Z cvt.rn.f32.s16 %r336, %rs95; 2026-02-21T09:09:53.1864211Z cvt.rn.f32.s16 %r337, %rs98; 2026-02-21T09:09:53.1864380Z cvt.rn.f32.s16 %r338, %rs101; 2026-02-21T09:09:53.1864558Z cvt.rn.f32.s16 %r339, %rs104; 2026-02-21T09:09:53.1864725Z st.shared.b32 [%r22], %r332; 2026-02-21T09:09:53.1864899Z st.shared.b32 [%r23], %r333; 2026-02-21T09:09:53.1865096Z st.shared.b32 [%r24], %r334; 2026-02-21T09:09:53.1865259Z st.shared.b32 [%r25], %r335; 2026-02-21T09:09:53.1865427Z st.shared.b32 [%r26], %r336; 2026-02-21T09:09:53.1865593Z st.shared.b32 [%r27], %r337; 2026-02-21T09:09:53.1865763Z st.shared.b32 [%r28], %r338; 2026-02-21T09:09:53.1865926Z st.shared.b32 [%r29], %r339; 2026-02-21T09:09:53.1866100Z $L__tmp4: 2026-02-21T09:09:53.1866387Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1866724Z // begin inline asm 2026-02-21T09:09:53.1866893Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.1867102Z // end inline asm 2026-02-21T09:09:53.1867247Z bar.sync 0; 2026-02-21T09:09:53.1867389Z setp.ne.b32 %p33, %r39, 0; 2026-02-21T09:09:53.1867558Z @%p33 bra $L__BB0_3; 2026-02-21T09:09:53.1867704Z // %bb.2: 2026-02-21T09:09:53.1867852Z elect.sync %r364|%p35, -1; 2026-02-21T09:09:53.1868015Z mov.b32 %r342, 134744336; 2026-02-21T09:09:53.1868183Z mov.pred %p34, 0; 2026-02-21T09:09:53.1868329Z // begin inline asm 2026-02-21T09:09:53.1868580Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 0 ], %rd72, %r342, %p34; 2026-02-21T09:09:53.1868856Z // end inline asm 2026-02-21T09:09:53.1869000Z // begin inline asm 2026-02-21T09:09:53.1869269Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 8 ], %rd73, %r342, %p36; 2026-02-21T09:09:53.1869523Z // end inline asm 2026-02-21T09:09:53.1869665Z // begin inline asm 2026-02-21T09:09:53.1869892Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 16 ], %rd74, %r342, %p36; 2026-02-21T09:09:53.1870153Z // end inline asm 2026-02-21T09:09:53.1870291Z // begin inline asm 2026-02-21T09:09:53.1870511Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 24 ], %rd75, %r342, %p36; 2026-02-21T09:09:53.1870772Z // end inline asm 2026-02-21T09:09:53.1870905Z // begin inline asm 2026-02-21T09:09:53.1871140Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 32 ], %rd72, %r342, %p34; 2026-02-21T09:09:53.1871393Z // end inline asm 2026-02-21T09:09:53.1871558Z // begin inline asm 2026-02-21T09:09:53.1871791Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 40 ], %rd73, %r342, %p36; 2026-02-21T09:09:53.1872049Z // end inline asm 2026-02-21T09:09:53.1872188Z // begin inline asm 2026-02-21T09:09:53.1872411Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 48 ], %rd74, %r342, %p36; 2026-02-21T09:09:53.1872675Z // end inline asm 2026-02-21T09:09:53.1872808Z // begin inline asm 2026-02-21T09:09:53.1873042Z @%p35 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 56 ], %rd75, %r342, %p36; 2026-02-21T09:09:53.1873303Z // end inline asm 2026-02-21T09:09:53.1873443Z add.s32 %r366, %r53, 37888; 2026-02-21T09:09:53.1873615Z cvt.u64.u32 %rd80, %r366; 2026-02-21T09:09:53.1873773Z // begin inline asm 2026-02-21T09:09:53.1873992Z @%p35 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd80]; 2026-02-21T09:09:53.1874227Z // end inline asm 2026-02-21T09:09:53.1874376Z $L__tmp5: 2026-02-21T09:09:53.1874509Z $L__BB0_3: 2026-02-21T09:09:53.1874677Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:53.1874875Z add.s32 %r31, %r53, %r278; 2026-02-21T09:09:53.1875037Z add.s32 %r32, %r53, %r279; 2026-02-21T09:09:53.1875194Z add.s32 %r33, %r53, %r280; 2026-02-21T09:09:53.1875344Z add.s32 %r34, %r53, %r281; 2026-02-21T09:09:53.1875643Z .loc 1 58 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:32 2026-02-21T09:09:53.1875931Z add.s64 %rd81, %rd7, 128; 2026-02-21T09:09:53.1876099Z cvt.u64.u32 %rd91, %r30; 2026-02-21T09:09:53.1876257Z add.s64 %rd92, %rd6, %rd91; 2026-02-21T09:09:53.1876427Z shl.b64 %rd93, %rd92, 1; 2026-02-21T09:09:53.1876586Z add.s64 %rd94, %rd21, %rd93; 2026-02-21T09:09:53.1876756Z add.s64 %rd82, %rd94, 65536; 2026-02-21T09:09:53.1876926Z add.s64 %rd83, %rd94, 131072; 2026-02-21T09:09:53.1877111Z add.s64 %rd84, %rd94, 196608; 2026-02-21T09:09:53.1877276Z add.s64 %rd85, %rd94, 262144; 2026-02-21T09:09:53.1877434Z add.s64 %rd86, %rd94, 327680; 2026-02-21T09:09:53.1877601Z add.s64 %rd87, %rd94, 393216; 2026-02-21T09:09:53.1877762Z add.s64 %rd88, %rd10, 128; 2026-02-21T09:09:53.1878028Z .loc 1 58 80 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:80 2026-02-21T09:09:53.1878307Z // begin inline asm 2026-02-21T09:09:53.1878526Z cp.async.cg.shared.global [ %r367 + 0 ], [ %rd81 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1878765Z // end inline asm 2026-02-21T09:09:53.1878906Z // begin inline asm 2026-02-21T09:09:53.1879141Z cp.async.cg.shared.global [ %r369 + 0 ], [ %rd82 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1879364Z // end inline asm 2026-02-21T09:09:53.1879508Z // begin inline asm 2026-02-21T09:09:53.1879707Z cp.async.cg.shared.global [ %r371 + 0 ], [ %rd83 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1879934Z // end inline asm 2026-02-21T09:09:53.1880074Z // begin inline asm 2026-02-21T09:09:53.1880277Z cp.async.cg.shared.global [ %r373 + 0 ], [ %rd84 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1880506Z // end inline asm 2026-02-21T09:09:53.1880646Z // begin inline asm 2026-02-21T09:09:53.1880844Z cp.async.cg.shared.global [ %r375 + 0 ], [ %rd85 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1881064Z // end inline asm 2026-02-21T09:09:53.1881257Z // begin inline asm 2026-02-21T09:09:53.1881453Z cp.async.cg.shared.global [ %r377 + 0 ], [ %rd86 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1881710Z // end inline asm 2026-02-21T09:09:53.1881842Z // begin inline asm 2026-02-21T09:09:53.1882040Z cp.async.cg.shared.global [ %r379 + 0 ], [ %rd87 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1882255Z // end inline asm 2026-02-21T09:09:53.1882398Z // begin inline asm 2026-02-21T09:09:53.1882594Z cp.async.cg.shared.global [ %r381 + 0 ], [ %rd88 + 0 ], 0x10, %r368; 2026-02-21T09:09:53.1882810Z // end inline asm 2026-02-21T09:09:53.1882960Z cp.async.commit_group; 2026-02-21T09:09:53.1883213Z .loc 1 64 34 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:34 2026-02-21T09:09:53.1883499Z add.s64 %rd89, %rd51, 262144; 2026-02-21T09:09:53.1883757Z .loc 1 64 87 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:87 2026-02-21T09:09:53.1884051Z // begin inline asm 2026-02-21T09:09:53.1884261Z cp.async.ca.shared.global [ %r383 + 0 ], [ %rd89 + 0 ], 0x4, %r157; 2026-02-21T09:09:53.1884482Z // end inline asm 2026-02-21T09:09:53.1884631Z cp.async.commit_group; 2026-02-21T09:09:53.1884892Z .loc 1 50 125 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:50:125 2026-02-21T09:09:53.1885181Z cvt.u16.u32 %rs105, %r3; 2026-02-21T09:09:53.1885336Z shr.u16 %rs106, %rs105, 8; 2026-02-21T09:09:53.1885502Z and.b16 %rs107, %rs106, 15; 2026-02-21T09:09:53.1885668Z mad.wide.u16 %r388, %rs107, 512, %r17; 2026-02-21T09:09:53.1885850Z shl.b32 %r389, %r36, 5; 2026-02-21T09:09:53.1886009Z add.s32 %r390, %r388, %r389; 2026-02-21T09:09:53.1886164Z add.s32 %r391, %r390, %r6; 2026-02-21T09:09:53.1886325Z add.s32 %r392, %r391, 393216; 2026-02-21T09:09:53.1886481Z cvt.u64.u32 %rd95, %r392; 2026-02-21T09:09:53.1886642Z add.s64 %rd246, %rd22, %rd95; 2026-02-21T09:09:53.1886799Z shl.b64 %rd96, %rd9, 1; 2026-02-21T09:09:53.1886956Z add.s64 %rd12, %rd96, 192; 2026-02-21T09:09:53.1887108Z and.b32 %r393, %r1, 3; 2026-02-21T09:09:53.1887309Z mad.wide.u32 %rd245, %r393, 16, %rd21; 2026-02-21T09:09:53.1887483Z and.b32 %r394, %r35, 15; 2026-02-21T09:09:53.1887644Z shl.b32 %r395, %r394, 18; 2026-02-21T09:09:53.1887800Z shl.b32 %r396, %r5, 10; 2026-02-21T09:09:53.1887951Z or.b32 %r397, %r395, %r396; 2026-02-21T09:09:53.1888118Z mul.wide.u32 %rd14, %r397, 2; 2026-02-21T09:09:53.1888273Z mov.b32 %r739, 1; 2026-02-21T09:09:53.1888415Z mov.b32 %r736, 0; 2026-02-21T09:09:53.1888555Z mov.b64 %rd247, -16; 2026-02-21T09:09:53.1888708Z mov.b32 %r738, %r736; 2026-02-21T09:09:53.1888883Z mov.b32 %r740, %r736; 2026-02-21T09:09:53.1889031Z bra.uni $L__BB0_4; 2026-02-21T09:09:53.1889217Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:53.1889545Z .loc 1 50 125 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:50:125 2026-02-21T09:09:53.1889838Z add.s64 %rd247, %rd247, 16; 2026-02-21T09:09:53.1890002Z setp.lt.u64 %p75, %rd247, 464; 2026-02-21T09:09:53.1890168Z $L__tmp6: 2026-02-21T09:09:53.1890445Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1890807Z add.s32 %r561, %r739, 1; 2026-02-21T09:09:53.1890962Z setp.gt.s32 %p76, %r561, 1; 2026-02-21T09:09:53.1891131Z selp.b32 %r739, 0, %r561, %p76; 2026-02-21T09:09:53.1891308Z selp.b32 %r562, 1, 0, %p76; 2026-02-21T09:09:53.1891462Z xor.b32 %r52, %r740, %r562; 2026-02-21T09:09:53.1891648Z $L__tmp7: 2026-02-21T09:09:53.1891877Z .loc 1 58 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:32 2026-02-21T09:09:53.1892167Z add.s64 %rd115, %rd245, %rd14; 2026-02-21T09:09:53.1892330Z add.s64 %rd106, %rd115, 192; 2026-02-21T09:09:53.1892499Z add.s64 %rd107, %rd115, 65728; 2026-02-21T09:09:53.1892661Z add.s64 %rd108, %rd115, 131264; 2026-02-21T09:09:53.1892834Z add.s64 %rd109, %rd115, 196800; 2026-02-21T09:09:53.1893033Z add.s64 %rd110, %rd115, 262336; 2026-02-21T09:09:53.1893199Z add.s64 %rd111, %rd115, 327872; 2026-02-21T09:09:53.1893375Z add.s64 %rd112, %rd115, 393408; 2026-02-21T09:09:53.1893636Z .loc 1 58 80 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:80 2026-02-21T09:09:53.1893921Z add.s64 %rd113, %rd245, %rd12; 2026-02-21T09:09:53.1894087Z mad.lo.s32 %r543, %r7, -48, %r48; 2026-02-21T09:09:53.1894269Z selp.b32 %r544, 16, 0, %p75; 2026-02-21T09:09:53.1894425Z // begin inline asm 2026-02-21T09:09:53.1894638Z cp.async.cg.shared.global [ %r543 + 0 ], [ %rd106 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1894872Z // end inline asm 2026-02-21T09:09:53.1895011Z add.s32 %r545, %r543, 2048; 2026-02-21T09:09:53.1895171Z // begin inline asm 2026-02-21T09:09:53.1895369Z cp.async.cg.shared.global [ %r545 + 0 ], [ %rd107 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1895596Z // end inline asm 2026-02-21T09:09:53.1895731Z add.s32 %r547, %r543, 4096; 2026-02-21T09:09:53.1895891Z // begin inline asm 2026-02-21T09:09:53.1896087Z cp.async.cg.shared.global [ %r547 + 0 ], [ %rd108 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1896317Z // end inline asm 2026-02-21T09:09:53.1896461Z add.s32 %r549, %r543, 6144; 2026-02-21T09:09:53.1896615Z // begin inline asm 2026-02-21T09:09:53.1896816Z cp.async.cg.shared.global [ %r549 + 0 ], [ %rd109 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1897038Z // end inline asm 2026-02-21T09:09:53.1897181Z add.s32 %r551, %r543, 8192; 2026-02-21T09:09:53.1897334Z // begin inline asm 2026-02-21T09:09:53.1897533Z cp.async.cg.shared.global [ %r551 + 0 ], [ %rd110 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1897754Z // end inline asm 2026-02-21T09:09:53.1897900Z add.s32 %r553, %r543, 10240; 2026-02-21T09:09:53.1898053Z // begin inline asm 2026-02-21T09:09:53.1898254Z cp.async.cg.shared.global [ %r553 + 0 ], [ %rd111 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1898484Z // end inline asm 2026-02-21T09:09:53.1898619Z add.s32 %r555, %r543, 12288; 2026-02-21T09:09:53.1898777Z // begin inline asm 2026-02-21T09:09:53.1898997Z cp.async.cg.shared.global [ %r555 + 0 ], [ %rd112 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1899224Z // end inline asm 2026-02-21T09:09:53.1899359Z add.s32 %r557, %r543, 14336; 2026-02-21T09:09:53.1899518Z // begin inline asm 2026-02-21T09:09:53.1899709Z cp.async.cg.shared.global [ %r557 + 0 ], [ %rd113 + 0 ], 0x10, %r544; 2026-02-21T09:09:53.1899934Z // end inline asm 2026-02-21T09:09:53.1900079Z cp.async.commit_group; 2026-02-21T09:09:53.1900329Z .loc 1 64 87 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:87 2026-02-21T09:09:53.1900643Z add.s32 %r559, %r49, %r18; 2026-02-21T09:09:53.1900809Z selp.b32 %r560, 4, 0, %p75; 2026-02-21T09:09:53.1900974Z // begin inline asm 2026-02-21T09:09:53.1901172Z cp.async.ca.shared.global [ %r559 + 0 ], [ %rd246 + 0 ], 0x4, %r560; 2026-02-21T09:09:53.1901401Z // end inline asm 2026-02-21T09:09:53.1901564Z cp.async.commit_group; 2026-02-21T09:09:53.1901844Z .loc 1 50 125 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:50:125 2026-02-21T09:09:53.1902157Z add.s64 %rd246, %rd246, 131072; 2026-02-21T09:09:53.1902330Z add.s64 %rd245, %rd245, 64; 2026-02-21T09:09:53.1902541Z setp.lt.u64 %p77, %rd247, 480; 2026-02-21T09:09:53.1902717Z mov.b32 %r736, %r740; 2026-02-21T09:09:53.1902879Z mov.b32 %r740, %r52; 2026-02-21T09:09:53.1903032Z @%p77 bra $L__BB0_4; 2026-02-21T09:09:53.1903189Z bra.uni $L__BB0_7; 2026-02-21T09:09:53.1903383Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:53.1903618Z add.s32 %r469, %r738, 1; 2026-02-21T09:09:53.1903791Z setp.gt.s32 %p57, %r469, 1; 2026-02-21T09:09:53.1903961Z selp.b32 %r738, 0, %r469, %p57; 2026-02-21T09:09:53.1904245Z .loc 1 58 80 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:58:80 2026-02-21T09:09:53.1904543Z cp.async.wait_group 2; 2026-02-21T09:09:53.1904707Z bar.sync 0; 2026-02-21T09:09:53.1904876Z shl.b32 %r470, %r738, 14; 2026-02-21T09:09:53.1905050Z add.s32 %r472, %r53, %r470; 2026-02-21T09:09:53.1905213Z add.s32 %r48, %r472, %r20; 2026-02-21T09:09:53.1905418Z ld.shared.v4.b32 {%r473, %r474, %r475, %r476}, [%r48]; 2026-02-21T09:09:53.1905645Z mov.b32 {%rs108, %rs109}, %r476; 2026-02-21T09:09:53.1905823Z mov.b32 {%rs110, %rs111}, %r475; 2026-02-21T09:09:53.1906002Z mov.b32 {%rs112, %rs113}, %r474; 2026-02-21T09:09:53.1906170Z mov.b32 {%rs114, %rs115}, %r473; 2026-02-21T09:09:53.1906384Z ld.shared.v4.b32 {%r477, %r478, %r479, %r480}, [%r48+16]; 2026-02-21T09:09:53.1906605Z mov.b32 {%rs116, %rs117}, %r480; 2026-02-21T09:09:53.1906780Z mov.b32 {%rs118, %rs119}, %r479; 2026-02-21T09:09:53.1906945Z mov.b32 {%rs120, %rs121}, %r478; 2026-02-21T09:09:53.1907121Z mov.b32 {%rs122, %rs123}, %r477; 2026-02-21T09:09:53.1907332Z ld.shared.v4.b32 {%r481, %r482, %r483, %r484}, [%r48+32]; 2026-02-21T09:09:53.1907542Z mov.b32 {%rs124, %rs125}, %r484; 2026-02-21T09:09:53.1907715Z mov.b32 {%rs126, %rs127}, %r483; 2026-02-21T09:09:53.1907882Z mov.b32 {%rs128, %rs129}, %r482; 2026-02-21T09:09:53.1908056Z mov.b32 {%rs130, %rs131}, %r481; 2026-02-21T09:09:53.1908255Z ld.shared.v4.b32 {%r485, %r486, %r487, %r488}, [%r48+48]; 2026-02-21T09:09:53.1908472Z mov.b32 {%rs132, %rs133}, %r488; 2026-02-21T09:09:53.1908636Z mov.b32 {%rs134, %rs135}, %r487; 2026-02-21T09:09:53.1908809Z mov.b32 {%rs136, %rs137}, %r486; 2026-02-21T09:09:53.1908983Z mov.b32 {%rs138, %rs139}, %r485; 2026-02-21T09:09:53.1909191Z ld.shared.v4.b32 {%r489, %r490, %r491, %r492}, [%r48+8192]; 2026-02-21T09:09:53.1909412Z mov.b32 {%rs140, %rs141}, %r492; 2026-02-21T09:09:53.1909578Z mov.b32 {%rs142, %rs143}, %r491; 2026-02-21T09:09:53.1909764Z mov.b32 {%rs144, %rs145}, %r490; 2026-02-21T09:09:53.1909923Z mov.b32 {%rs146, %rs147}, %r489; 2026-02-21T09:09:53.1910124Z ld.shared.v4.b32 {%r493, %r494, %r495, %r496}, [%r48+8208]; 2026-02-21T09:09:53.1910328Z mov.b32 {%rs148, %rs149}, %r496; 2026-02-21T09:09:53.1910494Z mov.b32 {%rs150, %rs151}, %r495; 2026-02-21T09:09:53.1910695Z mov.b32 {%rs152, %rs153}, %r494; 2026-02-21T09:09:53.1910853Z mov.b32 {%rs154, %rs155}, %r493; 2026-02-21T09:09:53.1911055Z ld.shared.v4.b32 {%r497, %r498, %r499, %r500}, [%r48+8224]; 2026-02-21T09:09:53.1911261Z mov.b32 {%rs156, %rs157}, %r500; 2026-02-21T09:09:53.1911426Z mov.b32 {%rs158, %rs159}, %r499; 2026-02-21T09:09:53.1911621Z mov.b32 {%rs160, %rs161}, %r498; 2026-02-21T09:09:53.1911789Z mov.b32 {%rs162, %rs163}, %r497; 2026-02-21T09:09:53.1911985Z ld.shared.v4.b32 {%r501, %r502, %r503, %r504}, [%r48+8240]; 2026-02-21T09:09:53.1912235Z mov.b32 {%rs164, %rs165}, %r504; 2026-02-21T09:09:53.1912409Z mov.b32 {%rs166, %rs167}, %r503; 2026-02-21T09:09:53.1912569Z mov.b32 {%rs168, %rs169}, %r502; 2026-02-21T09:09:53.1912742Z mov.b32 {%rs170, %rs171}, %r501; 2026-02-21T09:09:53.1913010Z .loc 1 62 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:62:32 2026-02-21T09:09:53.1913308Z cvt.f32.bf16 %r402, %rs114; 2026-02-21T09:09:53.1913468Z cvt.f32.bf16 %r403, %rs115; 2026-02-21T09:09:53.1913633Z cvt.f32.bf16 %r404, %rs112; 2026-02-21T09:09:53.1913786Z cvt.f32.bf16 %r405, %rs113; 2026-02-21T09:09:53.1913945Z cvt.f32.bf16 %r406, %rs110; 2026-02-21T09:09:53.1914130Z cvt.f32.bf16 %r407, %rs111; 2026-02-21T09:09:53.1914282Z cvt.f32.bf16 %r408, %rs108; 2026-02-21T09:09:53.1914442Z cvt.f32.bf16 %r409, %rs109; 2026-02-21T09:09:53.1914593Z cvt.f32.bf16 %r410, %rs122; 2026-02-21T09:09:53.1914750Z cvt.f32.bf16 %r411, %rs123; 2026-02-21T09:09:53.1914903Z cvt.f32.bf16 %r412, %rs120; 2026-02-21T09:09:53.1915062Z cvt.f32.bf16 %r413, %rs121; 2026-02-21T09:09:53.1915212Z cvt.f32.bf16 %r414, %rs118; 2026-02-21T09:09:53.1915372Z cvt.f32.bf16 %r415, %rs119; 2026-02-21T09:09:53.1915522Z cvt.f32.bf16 %r416, %rs116; 2026-02-21T09:09:53.1915680Z cvt.f32.bf16 %r417, %rs117; 2026-02-21T09:09:53.1915838Z cvt.f32.bf16 %r419, %rs130; 2026-02-21T09:09:53.1916014Z cvt.f32.bf16 %r420, %rs131; 2026-02-21T09:09:53.1916176Z cvt.f32.bf16 %r421, %rs128; 2026-02-21T09:09:53.1916328Z cvt.f32.bf16 %r422, %rs129; 2026-02-21T09:09:53.1916486Z cvt.f32.bf16 %r423, %rs126; 2026-02-21T09:09:53.1916637Z cvt.f32.bf16 %r424, %rs127; 2026-02-21T09:09:53.1916795Z cvt.f32.bf16 %r425, %rs124; 2026-02-21T09:09:53.1916944Z cvt.f32.bf16 %r426, %rs125; 2026-02-21T09:09:53.1917101Z cvt.f32.bf16 %r427, %rs138; 2026-02-21T09:09:53.1917255Z cvt.f32.bf16 %r428, %rs139; 2026-02-21T09:09:53.1917405Z cvt.f32.bf16 %r429, %rs136; 2026-02-21T09:09:53.1917563Z cvt.f32.bf16 %r430, %rs137; 2026-02-21T09:09:53.1917714Z cvt.f32.bf16 %r431, %rs134; 2026-02-21T09:09:53.1917873Z cvt.f32.bf16 %r432, %rs135; 2026-02-21T09:09:53.1918020Z cvt.f32.bf16 %r433, %rs132; 2026-02-21T09:09:53.1918175Z cvt.f32.bf16 %r434, %rs133; 2026-02-21T09:09:53.1918324Z cvt.f32.bf16 %r436, %rs146; 2026-02-21T09:09:53.1918481Z cvt.f32.bf16 %r437, %rs147; 2026-02-21T09:09:53.1918629Z cvt.f32.bf16 %r438, %rs144; 2026-02-21T09:09:53.1918785Z cvt.f32.bf16 %r439, %rs145; 2026-02-21T09:09:53.1918942Z cvt.f32.bf16 %r440, %rs142; 2026-02-21T09:09:53.1919091Z cvt.f32.bf16 %r441, %rs143; 2026-02-21T09:09:53.1919247Z cvt.f32.bf16 %r442, %rs140; 2026-02-21T09:09:53.1919398Z cvt.f32.bf16 %r443, %rs141; 2026-02-21T09:09:53.1919556Z cvt.f32.bf16 %r444, %rs154; 2026-02-21T09:09:53.1919705Z cvt.f32.bf16 %r445, %rs155; 2026-02-21T09:09:53.1919863Z cvt.f32.bf16 %r446, %rs152; 2026-02-21T09:09:53.1920012Z cvt.f32.bf16 %r447, %rs153; 2026-02-21T09:09:53.1920172Z cvt.f32.bf16 %r448, %rs150; 2026-02-21T09:09:53.1920326Z cvt.f32.bf16 %r449, %rs151; 2026-02-21T09:09:53.1920487Z cvt.f32.bf16 %r450, %rs148; 2026-02-21T09:09:53.1920648Z cvt.f32.bf16 %r451, %rs149; 2026-02-21T09:09:53.1920798Z cvt.f32.bf16 %r453, %rs162; 2026-02-21T09:09:53.1920957Z cvt.f32.bf16 %r454, %rs163; 2026-02-21T09:09:53.1921106Z cvt.f32.bf16 %r455, %rs160; 2026-02-21T09:09:53.1921262Z cvt.f32.bf16 %r456, %rs161; 2026-02-21T09:09:53.1921411Z cvt.f32.bf16 %r457, %rs158; 2026-02-21T09:09:53.1921633Z cvt.f32.bf16 %r458, %rs159; 2026-02-21T09:09:53.1921780Z cvt.f32.bf16 %r459, %rs156; 2026-02-21T09:09:53.1921936Z cvt.f32.bf16 %r460, %rs157; 2026-02-21T09:09:53.1922091Z cvt.f32.bf16 %r461, %rs170; 2026-02-21T09:09:53.1922243Z cvt.f32.bf16 %r462, %rs171; 2026-02-21T09:09:53.1922402Z cvt.f32.bf16 %r463, %rs168; 2026-02-21T09:09:53.1922553Z cvt.f32.bf16 %r464, %rs169; 2026-02-21T09:09:53.1922710Z cvt.f32.bf16 %r465, %rs166; 2026-02-21T09:09:53.1922861Z cvt.f32.bf16 %r466, %rs167; 2026-02-21T09:09:53.1923049Z cvt.f32.bf16 %r467, %rs164; 2026-02-21T09:09:53.1923200Z cvt.f32.bf16 %r468, %rs165; 2026-02-21T09:09:53.1923355Z $L__tmp8: 2026-02-21T09:09:53.1923637Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1923970Z // begin inline asm 2026-02-21T09:09:53.1924115Z 2026-02-21T09:09:53.1924228Z { 2026-02-21T09:09:53.1924360Z .reg .pred complete; 2026-02-21T09:09:53.1924507Z waitLoop: 2026-02-21T09:09:53.1924703Z mbarrier.try_wait.parity.shared.b64 complete, [%r737], %r736; 2026-02-21T09:09:53.1924933Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.1925091Z } 2026-02-21T09:09:53.1925155Z 2026-02-21T09:09:53.1925263Z // end inline asm 2026-02-21T09:09:53.1925417Z mov.pred %p58, -1; 2026-02-21T09:09:53.1925562Z // begin inline asm 2026-02-21T09:09:53.1925926Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 0], {%r402, %r403, %r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414, %r415, %r416, %r417}; 2026-02-21T09:09:53.1926310Z // end inline asm 2026-02-21T09:09:53.1926446Z // begin inline asm 2026-02-21T09:09:53.1926800Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 16], {%r419, %r420, %r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434}; 2026-02-21T09:09:53.1927175Z // end inline asm 2026-02-21T09:09:53.1927318Z // begin inline asm 2026-02-21T09:09:53.1927699Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 32], {%r436, %r437, %r438, %r439, %r440, %r441, %r442, %r443, %r444, %r445, %r446, %r447, %r448, %r449, %r450, %r451}; 2026-02-21T09:09:53.1928077Z // end inline asm 2026-02-21T09:09:53.1928219Z // begin inline asm 2026-02-21T09:09:53.1928556Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r176 + 48], {%r453, %r454, %r455, %r456, %r457, %r458, %r459, %r460, %r461, %r462, %r463, %r464, %r465, %r466, %r467, %r468}; 2026-02-21T09:09:53.1928935Z // end inline asm 2026-02-21T09:09:53.1929069Z // begin inline asm 2026-02-21T09:09:53.1929233Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.1929402Z // end inline asm 2026-02-21T09:09:53.1929536Z bar.sync 0; 2026-02-21T09:09:53.1929670Z $L__tmp9: 2026-02-21T09:09:53.1929906Z .loc 1 64 87 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:64:87 2026-02-21T09:09:53.1930196Z shl.b32 %r505, %r738, 9; 2026-02-21T09:09:53.1930352Z add.s32 %r506, %r53, %r505; 2026-02-21T09:09:53.1930528Z add.s32 %r49, %r506, 36864; 2026-02-21T09:09:53.1930686Z add.s32 %r507, %r49, %r21; 2026-02-21T09:09:53.1930853Z ld.shared.b8 %rs172, [%r507]; 2026-02-21T09:09:53.1931022Z ld.shared.b8 %rs173, [%r507+64]; 2026-02-21T09:09:53.1931203Z ld.shared.b8 %rs174, [%r507+128]; 2026-02-21T09:09:53.1931386Z ld.shared.b8 %rs175, [%r507+192]; 2026-02-21T09:09:53.1931579Z ld.shared.b8 %rs176, [%r507+256]; 2026-02-21T09:09:53.1931754Z ld.shared.b8 %rs177, [%r507+320]; 2026-02-21T09:09:53.1931919Z ld.shared.b8 %rs178, [%r507+384]; 2026-02-21T09:09:53.1932090Z ld.shared.b8 %rs179, [%r507+448]; 2026-02-21T09:09:53.1932353Z .loc 1 67 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:67:28 2026-02-21T09:09:53.1932641Z shl.b16 %rs180, %rs172, 4; 2026-02-21T09:09:53.1932796Z shl.b16 %rs181, %rs173, 4; 2026-02-21T09:09:53.1932957Z shl.b16 %rs182, %rs174, 4; 2026-02-21T09:09:53.1933115Z shl.b16 %rs183, %rs175, 4; 2026-02-21T09:09:53.1933266Z shl.b16 %rs184, %rs176, 4; 2026-02-21T09:09:53.1933458Z shl.b16 %rs185, %rs177, 4; 2026-02-21T09:09:53.1933609Z shl.b16 %rs186, %rs178, 4; 2026-02-21T09:09:53.1933765Z shl.b16 %rs187, %rs179, 4; 2026-02-21T09:09:53.1934024Z .loc 1 82 58 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:82:58 2026-02-21T09:09:53.1934322Z selp.b16 %rs188, %rs180, %rs172, %p32; 2026-02-21T09:09:53.1934500Z cvt.s16.s8 %rs189, %rs188; 2026-02-21T09:09:53.1934659Z shr.s16 %rs190, %rs189, 4; 2026-02-21T09:09:53.1934826Z selp.b16 %rs191, %rs181, %rs173, %p32; 2026-02-21T09:09:53.1935029Z cvt.s16.s8 %rs192, %rs191; 2026-02-21T09:09:53.1935186Z shr.s16 %rs193, %rs192, 4; 2026-02-21T09:09:53.1935343Z selp.b16 %rs194, %rs182, %rs174, %p32; 2026-02-21T09:09:53.1935518Z cvt.s16.s8 %rs195, %rs194; 2026-02-21T09:09:53.1935667Z shr.s16 %rs196, %rs195, 4; 2026-02-21T09:09:53.1935831Z selp.b16 %rs197, %rs183, %rs175, %p32; 2026-02-21T09:09:53.1936000Z cvt.s16.s8 %rs198, %rs197; 2026-02-21T09:09:53.1936155Z shr.s16 %rs199, %rs198, 4; 2026-02-21T09:09:53.1936320Z selp.b16 %rs200, %rs184, %rs176, %p32; 2026-02-21T09:09:53.1936488Z cvt.s16.s8 %rs201, %rs200; 2026-02-21T09:09:53.1936647Z shr.s16 %rs202, %rs201, 4; 2026-02-21T09:09:53.1936837Z selp.b16 %rs203, %rs185, %rs177, %p32; 2026-02-21T09:09:53.1937017Z cvt.s16.s8 %rs204, %rs203; 2026-02-21T09:09:53.1937165Z shr.s16 %rs205, %rs204, 4; 2026-02-21T09:09:53.1937328Z selp.b16 %rs206, %rs186, %rs178, %p32; 2026-02-21T09:09:53.1937497Z cvt.s16.s8 %rs207, %rs206; 2026-02-21T09:09:53.1937656Z shr.s16 %rs208, %rs207, 4; 2026-02-21T09:09:53.1937812Z selp.b16 %rs209, %rs187, %rs179, %p32; 2026-02-21T09:09:53.1937992Z cvt.s16.s8 %rs210, %rs209; 2026-02-21T09:09:53.1938149Z shr.s16 %rs211, %rs210, 4; 2026-02-21T09:09:53.1938402Z .loc 1 87 32 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:87:32 2026-02-21T09:09:53.1938687Z cvt.rn.f32.s16 %r508, %rs190; 2026-02-21T09:09:53.1938880Z cvt.rn.f32.s16 %r509, %rs193; 2026-02-21T09:09:53.1939051Z cvt.rn.f32.s16 %r510, %rs196; 2026-02-21T09:09:53.1939211Z cvt.rn.f32.s16 %r511, %rs199; 2026-02-21T09:09:53.1939378Z cvt.rn.f32.s16 %r512, %rs202; 2026-02-21T09:09:53.1939536Z cvt.rn.f32.s16 %r513, %rs205; 2026-02-21T09:09:53.1939700Z cvt.rn.f32.s16 %r514, %rs208; 2026-02-21T09:09:53.1939862Z cvt.rn.f32.s16 %r515, %rs211; 2026-02-21T09:09:53.1940020Z st.shared.b32 [%r22], %r508; 2026-02-21T09:09:53.1940189Z st.shared.b32 [%r23], %r509; 2026-02-21T09:09:53.1940348Z st.shared.b32 [%r24], %r510; 2026-02-21T09:09:53.1940514Z st.shared.b32 [%r25], %r511; 2026-02-21T09:09:53.1940668Z st.shared.b32 [%r26], %r512; 2026-02-21T09:09:53.1940828Z st.shared.b32 [%r27], %r513; 2026-02-21T09:09:53.1940984Z st.shared.b32 [%r28], %r514; 2026-02-21T09:09:53.1941147Z st.shared.b32 [%r29], %r515; 2026-02-21T09:09:53.1941420Z .loc 1 50 125 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:50:125 2026-02-21T09:09:53.1941731Z shl.b32 %r516, %r739, 3; 2026-02-21T09:09:53.1941895Z add.s32 %r517, %r53, %r516; 2026-02-21T09:09:53.1942051Z add.s32 %r737, %r517, 37888; 2026-02-21T09:09:53.1942207Z $L__tmp10: 2026-02-21T09:09:53.1942496Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1942829Z // begin inline asm 2026-02-21T09:09:53.1942988Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.1943161Z // end inline asm 2026-02-21T09:09:53.1943301Z bar.sync 0; 2026-02-21T09:09:53.1943436Z @%p33 bra $L__BB0_6; 2026-02-21T09:09:53.1943632Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:53.1943853Z elect.sync %r542|%p59, -1; 2026-02-21T09:09:53.1944024Z mov.b32 %r520, 134744336; 2026-02-21T09:09:53.1944175Z // begin inline asm 2026-02-21T09:09:53.1944417Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 0 ], %rd72, %r520, %p58; 2026-02-21T09:09:53.1944680Z // end inline asm 2026-02-21T09:09:53.1944851Z // begin inline asm 2026-02-21T09:09:53.1945081Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 8 ], %rd73, %r520, %p58; 2026-02-21T09:09:53.1945335Z // end inline asm 2026-02-21T09:09:53.1945477Z // begin inline asm 2026-02-21T09:09:53.1945712Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 16 ], %rd74, %r520, %p58; 2026-02-21T09:09:53.1945988Z // end inline asm 2026-02-21T09:09:53.1946127Z // begin inline asm 2026-02-21T09:09:53.1946366Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 0 ], [ %r341 + 24 ], %rd75, %r520, %p58; 2026-02-21T09:09:53.1946667Z // end inline asm 2026-02-21T09:09:53.1946808Z // begin inline asm 2026-02-21T09:09:53.1947053Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 32 ], %rd72, %r520, %p58; 2026-02-21T09:09:53.1947326Z // end inline asm 2026-02-21T09:09:53.1947474Z // begin inline asm 2026-02-21T09:09:53.1947712Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 40 ], %rd73, %r520, %p58; 2026-02-21T09:09:53.1947990Z // end inline asm 2026-02-21T09:09:53.1948130Z // begin inline asm 2026-02-21T09:09:53.1948401Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 48 ], %rd74, %r520, %p58; 2026-02-21T09:09:53.1948682Z // end inline asm 2026-02-21T09:09:53.1948822Z // begin inline asm 2026-02-21T09:09:53.1949067Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r735 + 32 ], [ %r341 + 56 ], %rd75, %r520, %p58; 2026-02-21T09:09:53.1949344Z // end inline asm 2026-02-21T09:09:53.1949505Z cvt.u64.u32 %rd105, %r737; 2026-02-21T09:09:53.1949670Z // begin inline asm 2026-02-21T09:09:53.1949900Z @%p59 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd105]; 2026-02-21T09:09:53.1950150Z // end inline asm 2026-02-21T09:09:53.1950290Z bra.uni $L__BB0_6; 2026-02-21T09:09:53.1950481Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:53.1950758Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:53.1950974Z mov.b32 %r564, 1; 2026-02-21T09:09:53.1951275Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1951645Z // begin inline asm 2026-02-21T09:09:53.1951794Z 2026-02-21T09:09:53.1951915Z { 2026-02-21T09:09:53.1952054Z .reg .pred complete; 2026-02-21T09:09:53.1952204Z waitLoop: 2026-02-21T09:09:53.1952330Z mbarrier.try_wait.parity.shared.b64 complete, [%r737], %r564; 2026-02-21T09:09:53.1952408Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.1952463Z } 2026-02-21T09:09:53.1952468Z 2026-02-21T09:09:53.1952526Z // end inline asm 2026-02-21T09:09:53.1952591Z $L__tmp11: 2026-02-21T09:09:53.1952770Z .loc 1 50 125 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:50:125 2026-02-21T09:09:53.1952837Z cp.async.wait_group 0; 2026-02-21T09:09:53.1952895Z bar.sync 0; 2026-02-21T09:09:53.1952967Z add.s32 %r565, %r53, 37888; 2026-02-21T09:09:53.1953027Z // begin inline asm 2026-02-21T09:09:53.1953120Z @%p78 mbarrier.inval.shared::cta.b64 [%r565]; 2026-02-21T09:09:53.1953185Z // end inline asm 2026-02-21T09:09:53.1953243Z bar.sync 0; 2026-02-21T09:09:53.1953302Z // begin inline asm 2026-02-21T09:09:53.1953397Z @%p78 mbarrier.inval.shared::cta.b64 [%r139]; 2026-02-21T09:09:53.1953455Z // end inline asm 2026-02-21T09:09:53.1953511Z $L__tmp12: 2026-02-21T09:09:53.1953737Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1953802Z // begin inline asm 2026-02-21T09:09:53.1954075Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r567, %r568, %r569, %r570, %r571, %r572, %r573, %r574, %r575, %r576, %r577, %r578, %r579, %r580, %r581, %r582}, [%r634 + 0]; 2026-02-21T09:09:53.1954130Z // end inline asm 2026-02-21T09:09:53.1954194Z // begin inline asm 2026-02-21T09:09:53.1954457Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r584, %r585, %r586, %r587, %r588, %r589, %r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597, %r598, %r599}, [%r634 + 16]; 2026-02-21T09:09:53.1954542Z // end inline asm 2026-02-21T09:09:53.1954605Z // begin inline asm 2026-02-21T09:09:53.1954864Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r601, %r602, %r603, %r604, %r605, %r606, %r607, %r608, %r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616}, [%r634 + 32]; 2026-02-21T09:09:53.1954920Z // end inline asm 2026-02-21T09:09:53.1954982Z // begin inline asm 2026-02-21T09:09:53.1955238Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r618, %r619, %r620, %r621, %r622, %r623, %r624, %r625, %r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633}, [%r634 + 48]; 2026-02-21T09:09:53.1955319Z // end inline asm 2026-02-21T09:09:53.1955374Z // begin inline asm 2026-02-21T09:09:53.1955454Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:53.1955509Z // end inline asm 2026-02-21T09:09:53.1955570Z cvt.u64.u32 %rd117, %r567; 2026-02-21T09:09:53.1955638Z cvt.u64.u32 %rd118, %r568; 2026-02-21T09:09:53.1955700Z shl.b64 %rd119, %rd118, 32; 2026-02-21T09:09:53.1955766Z or.b64 %rd120, %rd117, %rd119; 2026-02-21T09:09:53.1955818Z $L__tmp13: 2026-02-21T09:09:53.1956015Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1956080Z mov.b64 {%r638, %r639}, %rd120; 2026-02-21T09:09:53.1956150Z cvt.rn.bf16x2.f32 %r640, %r639, %r638; 2026-02-21T09:09:53.1956210Z $L__tmp14: 2026-02-21T09:09:53.1956417Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1956479Z cvt.u64.u32 %rd121, %r569; 2026-02-21T09:09:53.1956544Z cvt.u64.u32 %rd122, %r570; 2026-02-21T09:09:53.1956603Z shl.b64 %rd123, %rd122, 32; 2026-02-21T09:09:53.1956664Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T09:09:53.1956715Z $L__tmp15: 2026-02-21T09:09:53.1956886Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1956973Z mov.b64 {%r641, %r642}, %rd124; 2026-02-21T09:09:53.1957044Z cvt.rn.bf16x2.f32 %r643, %r642, %r641; 2026-02-21T09:09:53.1957105Z $L__tmp16: 2026-02-21T09:09:53.1957311Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1957370Z cvt.u64.u32 %rd125, %r571; 2026-02-21T09:09:53.1957434Z cvt.u64.u32 %rd126, %r572; 2026-02-21T09:09:53.1957493Z shl.b64 %rd127, %rd126, 32; 2026-02-21T09:09:53.1957554Z or.b64 %rd128, %rd125, %rd127; 2026-02-21T09:09:53.1957607Z $L__tmp17: 2026-02-21T09:09:53.1957776Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1957836Z mov.b64 {%r644, %r645}, %rd128; 2026-02-21T09:09:53.1957904Z cvt.rn.bf16x2.f32 %r646, %r645, %r644; 2026-02-21T09:09:53.1957963Z $L__tmp18: 2026-02-21T09:09:53.1958167Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1958226Z cvt.u64.u32 %rd129, %r573; 2026-02-21T09:09:53.1958292Z cvt.u64.u32 %rd130, %r574; 2026-02-21T09:09:53.1958350Z shl.b64 %rd131, %rd130, 32; 2026-02-21T09:09:53.1958410Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T09:09:53.1958463Z $L__tmp19: 2026-02-21T09:09:53.1958633Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1958695Z mov.b64 {%r647, %r648}, %rd132; 2026-02-21T09:09:53.1958761Z cvt.rn.bf16x2.f32 %r649, %r648, %r647; 2026-02-21T09:09:53.1958824Z $L__tmp20: 2026-02-21T09:09:53.1959029Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1959088Z cvt.u64.u32 %rd133, %r575; 2026-02-21T09:09:53.1959149Z cvt.u64.u32 %rd134, %r576; 2026-02-21T09:09:53.1959216Z shl.b64 %rd135, %rd134, 32; 2026-02-21T09:09:53.1959274Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T09:09:53.1959327Z $L__tmp21: 2026-02-21T09:09:53.1959514Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1959573Z mov.b64 {%r650, %r651}, %rd136; 2026-02-21T09:09:53.1959641Z cvt.rn.bf16x2.f32 %r652, %r651, %r650; 2026-02-21T09:09:53.1959699Z $L__tmp22: 2026-02-21T09:09:53.1959902Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1959960Z cvt.u64.u32 %rd137, %r577; 2026-02-21T09:09:53.1960016Z cvt.u64.u32 %rd138, %r578; 2026-02-21T09:09:53.1960124Z shl.b64 %rd139, %rd138, 32; 2026-02-21T09:09:53.1960184Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T09:09:53.1960236Z $L__tmp23: 2026-02-21T09:09:53.1960400Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1960460Z mov.b64 {%r653, %r654}, %rd140; 2026-02-21T09:09:53.1960526Z cvt.rn.bf16x2.f32 %r655, %r654, %r653; 2026-02-21T09:09:53.1960585Z $L__tmp24: 2026-02-21T09:09:53.1960790Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1960849Z cvt.u64.u32 %rd141, %r579; 2026-02-21T09:09:53.1960936Z cvt.u64.u32 %rd142, %r580; 2026-02-21T09:09:53.1961006Z shl.b64 %rd143, %rd142, 32; 2026-02-21T09:09:53.1961066Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T09:09:53.1961118Z $L__tmp25: 2026-02-21T09:09:53.1961284Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1961346Z mov.b64 {%r656, %r657}, %rd144; 2026-02-21T09:09:53.1961412Z cvt.rn.bf16x2.f32 %r658, %r657, %r656; 2026-02-21T09:09:53.1961464Z $L__tmp26: 2026-02-21T09:09:53.1961693Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1961752Z cvt.u64.u32 %rd145, %r581; 2026-02-21T09:09:53.1961837Z cvt.u64.u32 %rd146, %r582; 2026-02-21T09:09:53.1961906Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:09:53.1961966Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:09:53.1962018Z $L__tmp27: 2026-02-21T09:09:53.1962186Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1962245Z mov.b64 {%r659, %r660}, %rd148; 2026-02-21T09:09:53.1962311Z cvt.rn.bf16x2.f32 %r661, %r660, %r659; 2026-02-21T09:09:53.1962363Z $L__tmp28: 2026-02-21T09:09:53.1962572Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1962631Z cvt.u64.u32 %rd149, %r584; 2026-02-21T09:09:53.1962689Z cvt.u64.u32 %rd150, %r585; 2026-02-21T09:09:53.1962755Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:09:53.1962815Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:09:53.1962866Z $L__tmp29: 2026-02-21T09:09:53.1963032Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1963093Z mov.b64 {%r662, %r663}, %rd152; 2026-02-21T09:09:53.1963158Z cvt.rn.bf16x2.f32 %r664, %r663, %r662; 2026-02-21T09:09:53.1963209Z $L__tmp30: 2026-02-21T09:09:53.1963419Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1963477Z cvt.u64.u32 %rd153, %r586; 2026-02-21T09:09:53.1963535Z cvt.u64.u32 %rd154, %r587; 2026-02-21T09:09:53.1963602Z shl.b64 %rd155, %rd154, 32; 2026-02-21T09:09:53.1963662Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T09:09:53.1963716Z $L__tmp31: 2026-02-21T09:09:53.1963882Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1963941Z mov.b64 {%r665, %r666}, %rd156; 2026-02-21T09:09:53.1964006Z cvt.rn.bf16x2.f32 %r667, %r666, %r665; 2026-02-21T09:09:53.1964057Z $L__tmp32: 2026-02-21T09:09:53.1964264Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1964351Z cvt.u64.u32 %rd157, %r588; 2026-02-21T09:09:53.1964408Z cvt.u64.u32 %rd158, %r589; 2026-02-21T09:09:53.1964472Z shl.b64 %rd159, %rd158, 32; 2026-02-21T09:09:53.1964533Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T09:09:53.1964586Z $L__tmp33: 2026-02-21T09:09:53.1964747Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1964813Z mov.b64 {%r668, %r669}, %rd160; 2026-02-21T09:09:53.1964907Z cvt.rn.bf16x2.f32 %r670, %r669, %r668; 2026-02-21T09:09:53.1964958Z $L__tmp34: 2026-02-21T09:09:53.1965166Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1965224Z cvt.u64.u32 %rd161, %r590; 2026-02-21T09:09:53.1965281Z cvt.u64.u32 %rd162, %r591; 2026-02-21T09:09:53.1965345Z shl.b64 %rd163, %rd162, 32; 2026-02-21T09:09:53.1965405Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T09:09:53.1965458Z $L__tmp35: 2026-02-21T09:09:53.1965619Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1965685Z mov.b64 {%r671, %r672}, %rd164; 2026-02-21T09:09:53.1965775Z cvt.rn.bf16x2.f32 %r673, %r672, %r671; 2026-02-21T09:09:53.1965827Z $L__tmp36: 2026-02-21T09:09:53.1966029Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1966089Z cvt.u64.u32 %rd165, %r592; 2026-02-21T09:09:53.1966148Z cvt.u64.u32 %rd166, %r593; 2026-02-21T09:09:53.1966214Z shl.b64 %rd167, %rd166, 32; 2026-02-21T09:09:53.1966273Z or.b64 %rd168, %rd165, %rd167; 2026-02-21T09:09:53.1966325Z $L__tmp37: 2026-02-21T09:09:53.1966482Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1966551Z mov.b64 {%r674, %r675}, %rd168; 2026-02-21T09:09:53.1966638Z cvt.rn.bf16x2.f32 %r676, %r675, %r674; 2026-02-21T09:09:53.1966692Z $L__tmp38: 2026-02-21T09:09:53.1966897Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1966955Z cvt.u64.u32 %rd169, %r594; 2026-02-21T09:09:53.1967013Z cvt.u64.u32 %rd170, %r595; 2026-02-21T09:09:53.1967079Z shl.b64 %rd171, %rd170, 32; 2026-02-21T09:09:53.1967138Z or.b64 %rd172, %rd169, %rd171; 2026-02-21T09:09:53.1967191Z $L__tmp39: 2026-02-21T09:09:53.1967350Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1967420Z mov.b64 {%r677, %r678}, %rd172; 2026-02-21T09:09:53.1967487Z cvt.rn.bf16x2.f32 %r679, %r678, %r677; 2026-02-21T09:09:53.1967541Z $L__tmp40: 2026-02-21T09:09:53.1967756Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1967817Z cvt.u64.u32 %rd173, %r596; 2026-02-21T09:09:53.1967877Z cvt.u64.u32 %rd174, %r597; 2026-02-21T09:09:53.1967936Z shl.b64 %rd175, %rd174, 32; 2026-02-21T09:09:53.1968004Z or.b64 %rd176, %rd173, %rd175; 2026-02-21T09:09:53.1968056Z $L__tmp41: 2026-02-21T09:09:53.1968217Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1968284Z mov.b64 {%r680, %r681}, %rd176; 2026-02-21T09:09:53.1968351Z cvt.rn.bf16x2.f32 %r682, %r681, %r680; 2026-02-21T09:09:53.1968401Z $L__tmp42: 2026-02-21T09:09:53.1968610Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1968670Z cvt.u64.u32 %rd177, %r598; 2026-02-21T09:09:53.1968726Z cvt.u64.u32 %rd178, %r599; 2026-02-21T09:09:53.1968785Z shl.b64 %rd179, %rd178, 32; 2026-02-21T09:09:53.1968852Z or.b64 %rd180, %rd177, %rd179; 2026-02-21T09:09:53.1968905Z $L__tmp43: 2026-02-21T09:09:53.1969059Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1969149Z mov.b64 {%r683, %r684}, %rd180; 2026-02-21T09:09:53.1969217Z cvt.rn.bf16x2.f32 %r685, %r684, %r683; 2026-02-21T09:09:53.1969269Z $L__tmp44: 2026-02-21T09:09:53.1969483Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1969542Z cvt.u64.u32 %rd181, %r601; 2026-02-21T09:09:53.1969599Z cvt.u64.u32 %rd182, %r602; 2026-02-21T09:09:53.1969658Z shl.b64 %rd183, %rd182, 32; 2026-02-21T09:09:53.1969751Z or.b64 %rd184, %rd181, %rd183; 2026-02-21T09:09:53.1969803Z $L__tmp45: 2026-02-21T09:09:53.1969964Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1970033Z mov.b64 {%r686, %r687}, %rd184; 2026-02-21T09:09:53.1970100Z cvt.rn.bf16x2.f32 %r688, %r687, %r686; 2026-02-21T09:09:53.1970153Z $L__tmp46: 2026-02-21T09:09:53.1970362Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1970422Z cvt.u64.u32 %rd185, %r603; 2026-02-21T09:09:53.1970481Z cvt.u64.u32 %rd186, %r604; 2026-02-21T09:09:53.1970560Z shl.b64 %rd187, %rd186, 32; 2026-02-21T09:09:53.1970629Z or.b64 %rd188, %rd185, %rd187; 2026-02-21T09:09:53.1970680Z $L__tmp47: 2026-02-21T09:09:53.1970840Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1970907Z mov.b64 {%r689, %r690}, %rd188; 2026-02-21T09:09:53.1970975Z cvt.rn.bf16x2.f32 %r691, %r690, %r689; 2026-02-21T09:09:53.1971028Z $L__tmp48: 2026-02-21T09:09:53.1971229Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1971294Z cvt.u64.u32 %rd189, %r605; 2026-02-21T09:09:53.1971352Z cvt.u64.u32 %rd190, %r606; 2026-02-21T09:09:53.1971432Z shl.b64 %rd191, %rd190, 32; 2026-02-21T09:09:53.1971500Z or.b64 %rd192, %rd189, %rd191; 2026-02-21T09:09:53.1971573Z $L__tmp49: 2026-02-21T09:09:53.1971733Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1971802Z mov.b64 {%r692, %r693}, %rd192; 2026-02-21T09:09:53.1971867Z cvt.rn.bf16x2.f32 %r694, %r693, %r692; 2026-02-21T09:09:53.1971919Z $L__tmp50: 2026-02-21T09:09:53.1972122Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1972190Z cvt.u64.u32 %rd193, %r607; 2026-02-21T09:09:53.1972249Z cvt.u64.u32 %rd194, %r608; 2026-02-21T09:09:53.1972307Z shl.b64 %rd195, %rd194, 32; 2026-02-21T09:09:53.1972373Z or.b64 %rd196, %rd193, %rd195; 2026-02-21T09:09:53.1972426Z $L__tmp51: 2026-02-21T09:09:53.1972584Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1972651Z mov.b64 {%r695, %r696}, %rd196; 2026-02-21T09:09:53.1972716Z cvt.rn.bf16x2.f32 %r697, %r696, %r695; 2026-02-21T09:09:53.1972767Z $L__tmp52: 2026-02-21T09:09:53.1972971Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1973036Z cvt.u64.u32 %rd197, %r609; 2026-02-21T09:09:53.1973093Z cvt.u64.u32 %rd198, %r610; 2026-02-21T09:09:53.1973151Z shl.b64 %rd199, %rd198, 32; 2026-02-21T09:09:53.1973217Z or.b64 %rd200, %rd197, %rd199; 2026-02-21T09:09:53.1973270Z $L__tmp53: 2026-02-21T09:09:53.1973430Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1973498Z mov.b64 {%r698, %r699}, %rd200; 2026-02-21T09:09:53.1973564Z cvt.rn.bf16x2.f32 %r700, %r699, %r698; 2026-02-21T09:09:53.1973617Z $L__tmp54: 2026-02-21T09:09:53.1973824Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1973919Z cvt.u64.u32 %rd201, %r611; 2026-02-21T09:09:53.1973979Z cvt.u64.u32 %rd202, %r612; 2026-02-21T09:09:53.1974038Z shl.b64 %rd203, %rd202, 32; 2026-02-21T09:09:53.1974104Z or.b64 %rd204, %rd201, %rd203; 2026-02-21T09:09:53.1974158Z $L__tmp55: 2026-02-21T09:09:53.1974317Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1974378Z mov.b64 {%r701, %r702}, %rd204; 2026-02-21T09:09:53.1974453Z cvt.rn.bf16x2.f32 %r703, %r702, %r701; 2026-02-21T09:09:53.1974532Z $L__tmp56: 2026-02-21T09:09:53.1974732Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1974798Z cvt.u64.u32 %rd205, %r613; 2026-02-21T09:09:53.1974856Z cvt.u64.u32 %rd206, %r614; 2026-02-21T09:09:53.1974914Z shl.b64 %rd207, %rd206, 32; 2026-02-21T09:09:53.1974980Z or.b64 %rd208, %rd205, %rd207; 2026-02-21T09:09:53.1975031Z $L__tmp57: 2026-02-21T09:09:53.1975192Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1975253Z mov.b64 {%r704, %r705}, %rd208; 2026-02-21T09:09:53.1975327Z cvt.rn.bf16x2.f32 %r706, %r705, %r704; 2026-02-21T09:09:53.1975406Z $L__tmp58: 2026-02-21T09:09:53.1975606Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1975673Z cvt.u64.u32 %rd209, %r615; 2026-02-21T09:09:53.1975732Z cvt.u64.u32 %rd210, %r616; 2026-02-21T09:09:53.1975793Z shl.b64 %rd211, %rd210, 32; 2026-02-21T09:09:53.1975863Z or.b64 %rd212, %rd209, %rd211; 2026-02-21T09:09:53.1975916Z $L__tmp59: 2026-02-21T09:09:53.1976075Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1976138Z mov.b64 {%r707, %r708}, %rd212; 2026-02-21T09:09:53.1976216Z cvt.rn.bf16x2.f32 %r709, %r708, %r707; 2026-02-21T09:09:53.1976296Z $L__tmp60: 2026-02-21T09:09:53.1976500Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1976565Z cvt.u64.u32 %rd213, %r618; 2026-02-21T09:09:53.1976624Z cvt.u64.u32 %rd214, %r619; 2026-02-21T09:09:53.1976682Z shl.b64 %rd215, %rd214, 32; 2026-02-21T09:09:53.1976740Z or.b64 %rd216, %rd213, %rd215; 2026-02-21T09:09:53.1976798Z $L__tmp61: 2026-02-21T09:09:53.1976960Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1977021Z mov.b64 {%r710, %r711}, %rd216; 2026-02-21T09:09:53.1977095Z cvt.rn.bf16x2.f32 %r712, %r711, %r710; 2026-02-21T09:09:53.1977147Z $L__tmp62: 2026-02-21T09:09:53.1977349Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1977414Z cvt.u64.u32 %rd217, %r620; 2026-02-21T09:09:53.1977472Z cvt.u64.u32 %rd218, %r621; 2026-02-21T09:09:53.1977531Z shl.b64 %rd219, %rd218, 32; 2026-02-21T09:09:53.1977592Z or.b64 %rd220, %rd217, %rd219; 2026-02-21T09:09:53.1977650Z $L__tmp63: 2026-02-21T09:09:53.1977811Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1977871Z mov.b64 {%r713, %r714}, %rd220; 2026-02-21T09:09:53.1977944Z cvt.rn.bf16x2.f32 %r715, %r714, %r713; 2026-02-21T09:09:53.1977996Z $L__tmp64: 2026-02-21T09:09:53.1978199Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1978265Z cvt.u64.u32 %rd221, %r622; 2026-02-21T09:09:53.1978322Z cvt.u64.u32 %rd222, %r623; 2026-02-21T09:09:53.1978381Z shl.b64 %rd223, %rd222, 32; 2026-02-21T09:09:53.1978438Z or.b64 %rd224, %rd221, %rd223; 2026-02-21T09:09:53.1978497Z $L__tmp65: 2026-02-21T09:09:53.1978661Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1978757Z mov.b64 {%r716, %r717}, %rd224; 2026-02-21T09:09:53.1978829Z cvt.rn.bf16x2.f32 %r718, %r717, %r716; 2026-02-21T09:09:53.1978881Z $L__tmp66: 2026-02-21T09:09:53.1979085Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1979152Z cvt.u64.u32 %rd225, %r624; 2026-02-21T09:09:53.1979209Z cvt.u64.u32 %rd226, %r625; 2026-02-21T09:09:53.1979268Z shl.b64 %rd227, %rd226, 32; 2026-02-21T09:09:53.1979327Z or.b64 %rd228, %rd225, %rd227; 2026-02-21T09:09:53.1979409Z $L__tmp67: 2026-02-21T09:09:53.1979569Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1979629Z mov.b64 {%r719, %r720}, %rd228; 2026-02-21T09:09:53.1979703Z cvt.rn.bf16x2.f32 %r721, %r720, %r719; 2026-02-21T09:09:53.1979754Z $L__tmp68: 2026-02-21T09:09:53.1979957Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1980023Z cvt.u64.u32 %rd229, %r626; 2026-02-21T09:09:53.1980082Z cvt.u64.u32 %rd230, %r627; 2026-02-21T09:09:53.1980141Z shl.b64 %rd231, %rd230, 32; 2026-02-21T09:09:53.1980222Z or.b64 %rd232, %rd229, %rd231; 2026-02-21T09:09:53.1980283Z $L__tmp69: 2026-02-21T09:09:53.1980443Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1980502Z mov.b64 {%r722, %r723}, %rd232; 2026-02-21T09:09:53.1980574Z cvt.rn.bf16x2.f32 %r724, %r723, %r722; 2026-02-21T09:09:53.1980626Z $L__tmp70: 2026-02-21T09:09:53.1980829Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1980887Z cvt.u64.u32 %rd233, %r628; 2026-02-21T09:09:53.1980951Z cvt.u64.u32 %rd234, %r629; 2026-02-21T09:09:53.1981009Z shl.b64 %rd235, %rd234, 32; 2026-02-21T09:09:53.1981068Z or.b64 %rd236, %rd233, %rd235; 2026-02-21T09:09:53.1981146Z $L__tmp71: 2026-02-21T09:09:53.1981307Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1981366Z mov.b64 {%r725, %r726}, %rd236; 2026-02-21T09:09:53.1981439Z cvt.rn.bf16x2.f32 %r727, %r726, %r725; 2026-02-21T09:09:53.1981490Z $L__tmp72: 2026-02-21T09:09:53.1981734Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1981792Z cvt.u64.u32 %rd237, %r630; 2026-02-21T09:09:53.1981858Z cvt.u64.u32 %rd238, %r631; 2026-02-21T09:09:53.1981916Z shl.b64 %rd239, %rd238, 32; 2026-02-21T09:09:53.1981975Z or.b64 %rd240, %rd237, %rd239; 2026-02-21T09:09:53.1982034Z $L__tmp73: 2026-02-21T09:09:53.1982191Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1982252Z mov.b64 {%r728, %r729}, %rd240; 2026-02-21T09:09:53.1982324Z cvt.rn.bf16x2.f32 %r730, %r729, %r728; 2026-02-21T09:09:53.1982376Z $L__tmp74: 2026-02-21T09:09:53.1982577Z .loc 2 291 36 // standard.py:291:36 @[ c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:94:40 ] 2026-02-21T09:09:53.1982638Z cvt.u64.u32 %rd241, %r632; 2026-02-21T09:09:53.1982705Z cvt.u64.u32 %rd242, %r633; 2026-02-21T09:09:53.1982764Z shl.b64 %rd243, %rd242, 32; 2026-02-21T09:09:53.1982822Z or.b64 %rd244, %rd241, %rd243; 2026-02-21T09:09:53.1982880Z $L__tmp75: 2026-02-21T09:09:53.1983040Z .loc 1 97 28 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:97:28 2026-02-21T09:09:53.1983101Z mov.b64 {%r731, %r732}, %rd244; 2026-02-21T09:09:53.1983165Z cvt.rn.bf16x2.f32 %r733, %r732, %r731; 2026-02-21T09:09:53.1983329Z .loc 1 98 43 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:98:43 2026-02-21T09:09:53.1983425Z st.shared.v4.b32 [%r31], {%r640, %r643, %r646, %r649}; 2026-02-21T09:09:53.1983527Z st.shared.v4.b32 [%r31+8192], {%r688, %r691, %r694, %r697}; 2026-02-21T09:09:53.1983659Z st.shared.v4.b32 [%r32], {%r652, %r655, %r658, %r661}; 2026-02-21T09:09:53.1983754Z st.shared.v4.b32 [%r32+8192], {%r700, %r703, %r706, %r709}; 2026-02-21T09:09:53.1983840Z st.shared.v4.b32 [%r33], {%r664, %r667, %r670, %r673}; 2026-02-21T09:09:53.1983940Z st.shared.v4.b32 [%r33+8192], {%r712, %r715, %r718, %r721}; 2026-02-21T09:09:53.1984024Z st.shared.v4.b32 [%r34], {%r676, %r679, %r682, %r685}; 2026-02-21T09:09:53.1984115Z st.shared.v4.b32 [%r34+8192], {%r724, %r727, %r730, %r733}; 2026-02-21T09:09:53.1984207Z // begin inline asm 2026-02-21T09:09:53.1984285Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.1984341Z // end inline asm 2026-02-21T09:09:53.1984398Z bar.sync 0; 2026-02-21T09:09:53.1984477Z elect.sync %r734|%p82, -1; 2026-02-21T09:09:53.1984544Z and.pred %p80, %p1, %p82; 2026-02-21T09:09:53.1984601Z // begin inline asm 2026-02-21T09:09:53.1984794Z @%p80 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd116, {%r635, %r636}], [%r53]; 2026-02-21T09:09:53.1984854Z // end inline asm 2026-02-21T09:09:53.1984922Z cp.async.bulk.commit_group; 2026-02-21T09:09:53.1984995Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:53.1985057Z bar.sync 0; 2026-02-21T09:09:53.1985162Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:53.1985325Z .loc 1 29 4 // c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py:29:4 2026-02-21T09:09:53.1985388Z bar.sync 0; 2026-02-21T09:09:53.1985445Z // begin inline asm 2026-02-21T09:09:53.1985563Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r735, 128; 2026-02-21T09:09:53.1985624Z // end inline asm 2026-02-21T09:09:53.1985676Z ret; 2026-02-21T09:09:53.1985729Z $L__tmp76: 2026-02-21T09:09:53.1985784Z $L__func_end0: 2026-02-21T09:09:53.1985872Z // -- End function 2026-02-21T09:09:53.1985924Z } 2026-02-21T09:09:53.1986141Z .file 1 "/tmp/torchinductor_root/56/c567wqjpjje6345chpe6haf5d2fguczbby5mp745pndex5wxwhx4.py" 2026-02-21T09:09:53.1986323Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:53.1986385Z .section .debug_abbrev 2026-02-21T09:09:53.1986436Z { 2026-02-21T09:09:53.1986524Z .b8 1 // Abbreviation Code 2026-02-21T09:09:53.1986617Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:53.1986697Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:53.1986777Z .b8 37 // DW_AT_producer 2026-02-21T09:09:53.1986861Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.1986935Z .b8 19 // DW_AT_language 2026-02-21T09:09:53.1987008Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:53.1987090Z .b8 3 // DW_AT_name 2026-02-21T09:09:53.1987164Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.1987242Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:53.1987321Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:53.1987395Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:53.1987466Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.1987536Z .b8 0 // EOM(1) 2026-02-21T09:09:53.1987612Z .b8 0 // EOM(2) 2026-02-21T09:09:53.1987694Z .b8 2 // Abbreviation Code 2026-02-21T09:09:53.1987775Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:53.1987856Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:53.1987930Z .b8 3 // DW_AT_name 2026-02-21T09:09:53.1988001Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.1988085Z .b8 32 // DW_AT_inline 2026-02-21T09:09:53.1988182Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.1988251Z .b8 0 // EOM(1) 2026-02-21T09:09:53.1988318Z .b8 0 // EOM(2) 2026-02-21T09:09:53.1988406Z .b8 3 // Abbreviation Code 2026-02-21T09:09:53.1988486Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:53.1988564Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:53.1988647Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:53.1988747Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.1988824Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:53.1988904Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.1988989Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:53.1989061Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:53.1989138Z .b8 0 // EOM(1) 2026-02-21T09:09:53.1989204Z .b8 0 // EOM(2) 2026-02-21T09:09:53.1989284Z .b8 4 // Abbreviation Code 2026-02-21T09:09:53.1989402Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:53.1989488Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:53.1989573Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:53.1989649Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:53.1989731Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:53.1989804Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.1989882Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:53.1989962Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.1990065Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:53.1990146Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.1990224Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:53.1990308Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.1990391Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:53.1990467Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.1990544Z .b8 0 // EOM(1) 2026-02-21T09:09:53.1990615Z .b8 0 // EOM(2) 2026-02-21T09:09:53.1990687Z .b8 0 // EOM(3) 2026-02-21T09:09:53.1990747Z } 2026-02-21T09:09:53.1990811Z .section .debug_info 2026-02-21T09:09:53.1990865Z { 2026-02-21T09:09:53.1990952Z .b32 178 // Length of Unit 2026-02-21T09:09:53.1991047Z .b8 2 // DWARF version number 2026-02-21T09:09:53.1991103Z .b8 0 2026-02-21T09:09:53.1991226Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:53.1991323Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:53.1991429Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:53.1991514Z .b8 116 // DW_AT_producer 2026-02-21T09:09:53.1991605Z .b8 114 2026-02-21T09:09:53.1991662Z .b8 105 2026-02-21T09:09:53.1991716Z .b8 116 2026-02-21T09:09:53.1991769Z .b8 111 2026-02-21T09:09:53.1991833Z .b8 110 2026-02-21T09:09:53.1991889Z .b8 0 2026-02-21T09:09:53.1991969Z .b8 2 // DW_AT_language 2026-02-21T09:09:53.1992031Z .b8 0 2026-02-21T09:09:53.1992110Z .b8 99 // DW_AT_name 2026-02-21T09:09:53.1992166Z .b8 53 2026-02-21T09:09:53.1992221Z .b8 54 2026-02-21T09:09:53.1992283Z .b8 55 2026-02-21T09:09:53.1992339Z .b8 119 2026-02-21T09:09:53.1992395Z .b8 113 2026-02-21T09:09:53.1992452Z .b8 106 2026-02-21T09:09:53.1992541Z .b8 112 2026-02-21T09:09:53.1992596Z .b8 106 2026-02-21T09:09:53.1992652Z .b8 106 2026-02-21T09:09:53.1992713Z .b8 101 2026-02-21T09:09:53.1992766Z .b8 54 2026-02-21T09:09:53.1992820Z .b8 51 2026-02-21T09:09:53.1992874Z .b8 52 2026-02-21T09:09:53.1992936Z .b8 53 2026-02-21T09:09:53.1992989Z .b8 99 2026-02-21T09:09:53.1993041Z .b8 104 2026-02-21T09:09:53.1993101Z .b8 112 2026-02-21T09:09:53.1993153Z .b8 101 2026-02-21T09:09:53.1993205Z .b8 54 2026-02-21T09:09:53.1993259Z .b8 104 2026-02-21T09:09:53.1993349Z .b8 97 2026-02-21T09:09:53.1993403Z .b8 102 2026-02-21T09:09:53.1993456Z .b8 53 2026-02-21T09:09:53.1993516Z .b8 100 2026-02-21T09:09:53.1993570Z .b8 50 2026-02-21T09:09:53.1993624Z .b8 102 2026-02-21T09:09:53.1993677Z .b8 103 2026-02-21T09:09:53.1993740Z .b8 117 2026-02-21T09:09:53.1993793Z .b8 99 2026-02-21T09:09:53.1993846Z .b8 122 2026-02-21T09:09:53.1993899Z .b8 98 2026-02-21T09:09:53.1993959Z .b8 98 2026-02-21T09:09:53.1994012Z .b8 121 2026-02-21T09:09:53.1994068Z .b8 53 2026-02-21T09:09:53.1994129Z .b8 109 2026-02-21T09:09:53.1994183Z .b8 112 2026-02-21T09:09:53.1994236Z .b8 55 2026-02-21T09:09:53.1994290Z .b8 52 2026-02-21T09:09:53.1994352Z .b8 53 2026-02-21T09:09:53.1994406Z .b8 112 2026-02-21T09:09:53.1994489Z .b8 110 2026-02-21T09:09:53.1994562Z .b8 100 2026-02-21T09:09:53.1994627Z .b8 101 2026-02-21T09:09:53.1994692Z .b8 120 2026-02-21T09:09:53.1994765Z .b8 53 2026-02-21T09:09:53.1994836Z .b8 119 2026-02-21T09:09:53.1994899Z .b8 120 2026-02-21T09:09:53.1994959Z .b8 119 2026-02-21T09:09:53.1995036Z .b8 104 2026-02-21T09:09:53.1995107Z .b8 120 2026-02-21T09:09:53.1995171Z .b8 52 2026-02-21T09:09:53.1995246Z .b8 46 2026-02-21T09:09:53.1995316Z .b8 112 2026-02-21T09:09:53.1995380Z .b8 121 2026-02-21T09:09:53.1995440Z .b8 0 2026-02-21T09:09:53.1995558Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:53.1995644Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:53.1995709Z .b8 116 2026-02-21T09:09:53.1995842Z .b8 109 2026-02-21T09:09:53.1995916Z .b8 112 2026-02-21T09:09:53.1995981Z .b8 47 2026-02-21T09:09:53.1996040Z .b8 116 2026-02-21T09:09:53.1996113Z .b8 111 2026-02-21T09:09:53.1996184Z .b8 114 2026-02-21T09:09:53.1996247Z .b8 99 2026-02-21T09:09:53.1996323Z .b8 104 2026-02-21T09:09:53.1996395Z .b8 105 2026-02-21T09:09:53.1996457Z .b8 110 2026-02-21T09:09:53.1996516Z .b8 100 2026-02-21T09:09:53.1996590Z .b8 117 2026-02-21T09:09:53.1996661Z .b8 99 2026-02-21T09:09:53.1996725Z .b8 116 2026-02-21T09:09:53.1996784Z .b8 111 2026-02-21T09:09:53.1996865Z .b8 114 2026-02-21T09:09:53.1996931Z .b8 95 2026-02-21T09:09:53.1996994Z .b8 114 2026-02-21T09:09:53.1997069Z .b8 111 2026-02-21T09:09:53.1997139Z .b8 111 2026-02-21T09:09:53.1997203Z .b8 116 2026-02-21T09:09:53.1997263Z .b8 47 2026-02-21T09:09:53.1997338Z .b8 53 2026-02-21T09:09:53.1997408Z .b8 54 2026-02-21T09:09:53.1997471Z .b8 0 2026-02-21T09:09:53.1997587Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:53.1997673Z .b8 95 // DW_AT_name 2026-02-21T09:09:53.1997729Z .b8 104 2026-02-21T09:09:53.1997802Z .b8 101 2026-02-21T09:09:53.1997876Z .b8 108 2026-02-21T09:09:53.1997948Z .b8 105 2026-02-21T09:09:53.1998011Z .b8 111 2026-02-21T09:09:53.1998089Z .b8 110 2026-02-21T09:09:53.1998160Z .b8 95 2026-02-21T09:09:53.1998225Z .b8 109 2026-02-21T09:09:53.1998284Z .b8 97 2026-02-21T09:09:53.1998358Z .b8 116 2026-02-21T09:09:53.1998429Z .b8 109 2026-02-21T09:09:53.1998494Z .b8 117 2026-02-21T09:09:53.1998568Z .b8 108 2026-02-21T09:09:53.1998641Z .b8 95 2026-02-21T09:09:53.1998705Z .b8 98 2026-02-21T09:09:53.1998779Z .b8 102 2026-02-21T09:09:53.1998844Z .b8 49 2026-02-21T09:09:53.1998919Z .b8 54 2026-02-21T09:09:53.1998984Z .b8 95 2026-02-21T09:09:53.1999060Z .b8 105 2026-02-21T09:09:53.1999125Z .b8 110 2026-02-21T09:09:53.1999198Z .b8 116 2026-02-21T09:09:53.1999277Z .b8 52 2026-02-21T09:09:53.1999345Z .b8 0 2026-02-21T09:09:53.1999434Z .b8 1 // DW_AT_inline 2026-02-21T09:09:53.1999560Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:53.1999660Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:53.1999756Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:53.1999847Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:53.1999958Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:53.2000045Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:53.2000154Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:53.2000238Z .b64 $L__tmp75 // DW_AT_high_pc 2026-02-21T09:09:53.2000314Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:53.2000398Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:53.2000478Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:53.2000562Z .b8 0 // End Of Children Mark 2026-02-21T09:09:53.2000650Z .b8 0 // End Of Children Mark 2026-02-21T09:09:53.2000700Z } 2026-02-21T09:09:53.2000791Z .section .debug_macinfo { } 2026-02-21T09:09:53.2000796Z 2026-02-21T09:09:53.2000878Z ================================================================ 2026-02-21T09:09:53.2000983Z please share the reproducer above with Triton project. 2026-02-21T09:09:53.2001986Z [180s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T09:09:53.2002096Z Tensor-likes are not close! 2026-02-21T09:09:53.2002102Z 2026-02-21T09:09:53.2002186Z Mismatched elements: 33338520 / 33554432 (99.4%) 2026-02-21T09:09:53.2002329Z Greatest absolute difference: 776.0 at index (1687, 6957) (up to 0.01 allowed) 2026-02-21T09:09:53.2002477Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:09:53.2002585Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:09:53.2002589Z 2026-02-21T09:09:53.4211425Z [180s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:53.4211461Z 2026-02-21T09:09:53.4216145Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:53.4216170Z 2026-02-21T09:09:53.4217768Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:53.4217791Z 2026-02-21T09:09:53.4222836Z `ptxas` stderr: 2026-02-21T09:09:53.4222949Z ================================================================ 2026-02-21T09:09:53.4224607Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 251 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:53.4224736Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:53.4224742Z 2026-02-21T09:09:53.4225204Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpik52i3mn.ptx -o /tmp/tmpik52i3mn.ptx.o 2026-02-21T09:09:53.4225209Z 2026-02-21T09:09:53.4225349Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:53.4225649Z Internal Triton PTX codegen error 2026-02-21T09:09:53.4225721Z `ptxas` stderr: 2026-02-21T09:09:53.4226109Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 251 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:53.4226205Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:53.4226210Z 2026-02-21T09:09:53.4226612Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpik52i3mn.ptx -o /tmp/tmpik52i3mn.ptx.o 2026-02-21T09:09:53.4226673Z 2026-02-21T09:09:53.4226677Z 2026-02-21T09:09:53.4226738Z // 2026-02-21T09:09:53.4226814Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:53.4226874Z // 2026-02-21T09:09:53.4226878Z 2026-02-21T09:09:53.4226937Z .version 8.7 2026-02-21T09:09:53.4226993Z .target sm_100a 2026-02-21T09:09:53.4227057Z .address_size 64 2026-02-21T09:09:53.4227063Z 2026-02-21T09:09:53.4227216Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:53.4227301Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:53.4227438Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:53.4227518Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:53.4227641Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:53.4227753Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:53.4227870Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:53.4227978Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:53.4228091Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:53.4228175Z ) 2026-02-21T09:09:53.4228232Z .reqntid 128 2026-02-21T09:09:53.4228290Z .maxnreg 32 2026-02-21T09:09:53.4228354Z { 2026-02-21T09:09:53.4228472Z .reg .pred %p<60>; 2026-02-21T09:09:53.4228541Z .reg .b16 %rs<120>; 2026-02-21T09:09:53.4228601Z .reg .b32 %r<351>; 2026-02-21T09:09:53.4228674Z .reg .b64 %rd<120>; 2026-02-21T09:09:53.4228736Z $L__func_begin0: 2026-02-21T09:09:53.4228741Z 2026-02-21T09:09:53.4228799Z // %bb.0: 2026-02-21T09:09:53.4229000Z .loc 1 19 0 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:19 2026-02-21T09:09:53.4229062Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:53.4229131Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:53.4229241Z ld.param.b64 %rd21, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:53.4229314Z mov.b32 %r62, global_smem; 2026-02-21T09:09:53.4229383Z // begin inline asm 2026-02-21T09:09:53.4229534Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r62], 64; 2026-02-21T09:09:53.4229600Z // end inline asm 2026-02-21T09:09:53.4229698Z ld.param.b64 %rd38, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:53.4229904Z bar.sync 0; 2026-02-21T09:09:53.4229987Z ld.shared.b32 %r344, [global_smem]; 2026-02-21T09:09:53.4230045Z bar.sync 0; 2026-02-21T09:09:53.4230104Z // begin inline asm 2026-02-21T09:09:53.4230225Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:53.4230292Z // end inline asm 2026-02-21T09:09:53.4230466Z .loc 1 21 67 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:21:67 2026-02-21T09:09:53.4230528Z mov.u32 %r71, %ctaid.x; 2026-02-21T09:09:53.4230597Z mov.u32 %r72, %ctaid.y; 2026-02-21T09:09:53.4230653Z mov.u32 %r73, %ctaid.z; 2026-02-21T09:09:53.4230714Z mov.u32 %r74, %nctaid.x; 2026-02-21T09:09:53.4230780Z mov.u32 %r75, %nctaid.y; 2026-02-21T09:09:53.4230849Z mad.lo.s32 %r76, %r73, %r75, %r72; 2026-02-21T09:09:53.4230915Z mad.lo.s32 %r77, %r76, %r74, %r71; 2026-02-21T09:09:53.4230976Z shl.b32 %r78, %r77, 7; 2026-02-21T09:09:53.4231050Z cvt.s64.s32 %rd39, %r78; 2026-02-21T09:09:53.4231117Z add.s64 %rd35, %rd38, %rd39; 2026-02-21T09:09:53.4231177Z shl.b32 %r79, %r1, 2; 2026-02-21T09:09:53.4231278Z add.s32 %r63, %r62, %r79; 2026-02-21T09:09:53.4231332Z mov.b32 %r64, 0; 2026-02-21T09:09:53.4231388Z // begin inline asm 2026-02-21T09:09:53.4231458Z @%p1 st.shared.b32 [ %r63 + 0 ], %r64; 2026-02-21T09:09:53.4231523Z // end inline asm 2026-02-21T09:09:53.4231661Z bar.warp.sync -1; 2026-02-21T09:09:53.4231725Z setp.eq.b32 %p4, %r1, 0; 2026-02-21T09:09:53.4231791Z cvt.u64.u32 %rd20, %r62; 2026-02-21T09:09:53.4231848Z // begin inline asm 2026-02-21T09:09:53.4232015Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd20 + 0 ], %rd21; 2026-02-21T09:09:53.4232114Z // end inline asm 2026-02-21T09:09:53.4232172Z // begin inline asm 2026-02-21T09:09:53.4232313Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x1; 2026-02-21T09:09:53.4232370Z // end inline asm 2026-02-21T09:09:53.4232435Z mov.b32 %r65, 32; 2026-02-21T09:09:53.4232492Z // begin inline asm 2026-02-21T09:09:53.4232645Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x0, %r65; 2026-02-21T09:09:53.4232709Z // end inline asm 2026-02-21T09:09:53.4232764Z mov.b32 %r66, 64; 2026-02-21T09:09:53.4232820Z // begin inline asm 2026-02-21T09:09:53.4232999Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x1, %r66; 2026-02-21T09:09:53.4233061Z // end inline asm 2026-02-21T09:09:53.4233119Z mov.b32 %r67, 8192; 2026-02-21T09:09:53.4233175Z // begin inline asm 2026-02-21T09:09:53.4233337Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x0, %r67; 2026-02-21T09:09:53.4233393Z // end inline asm 2026-02-21T09:09:53.4233449Z mov.b32 %r68, 4096; 2026-02-21T09:09:53.4233513Z // begin inline asm 2026-02-21T09:09:53.4233662Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x1, %r68; 2026-02-21T09:09:53.4233716Z // end inline asm 2026-02-21T09:09:53.4233773Z mov.b64 %rd28, 16384; 2026-02-21T09:09:53.4233835Z // begin inline asm 2026-02-21T09:09:53.4234032Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd20 + 0 ], 0x0, %rd28; 2026-02-21T09:09:53.4234090Z // end inline asm 2026-02-21T09:09:53.4234150Z mov.b32 %r69, 1; 2026-02-21T09:09:53.4234206Z // begin inline asm 2026-02-21T09:09:53.4234375Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x0, %r69; 2026-02-21T09:09:53.4234437Z // end inline asm 2026-02-21T09:09:53.4234491Z // begin inline asm 2026-02-21T09:09:53.4234653Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x1, %r69; 2026-02-21T09:09:53.4234715Z // end inline asm 2026-02-21T09:09:53.4234772Z // begin inline asm 2026-02-21T09:09:53.4234916Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd20 + 0 ], 0xa; 2026-02-21T09:09:53.4234971Z // end inline asm 2026-02-21T09:09:53.4235033Z // begin inline asm 2026-02-21T09:09:53.4235194Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x0; 2026-02-21T09:09:53.4235250Z // end inline asm 2026-02-21T09:09:53.4235313Z // begin inline asm 2026-02-21T09:09:53.4235464Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x2; 2026-02-21T09:09:53.4235518Z // end inline asm 2026-02-21T09:09:53.4235581Z // begin inline asm 2026-02-21T09:09:53.4235721Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd20 + 0 ], 0x0; 2026-02-21T09:09:53.4235776Z // end inline asm 2026-02-21T09:09:53.4235831Z // begin inline asm 2026-02-21T09:09:53.4236098Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd35 + 0 ], [ %rd20 + 0 ], 0x80; 2026-02-21T09:09:53.4236155Z // end inline asm 2026-02-21T09:09:53.4236210Z // begin inline asm 2026-02-21T09:09:53.4236344Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd35 + 0 ], 0x80; 2026-02-21T09:09:53.4236417Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:53.4236492Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:53.4236557Z // end inline asm 2026-02-21T09:09:53.4236646Z bar.sync 0; 2026-02-21T09:09:53.4236711Z cvta.global.u64 %rd84, %rd35; 2026-02-21T09:09:53.4236887Z .loc 1 27 35 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:27:35 2026-02-21T09:09:53.4236956Z shl.b32 %r345, %r71, 1; 2026-02-21T09:09:53.4237125Z .loc 1 28 37 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:28:37 2026-02-21T09:09:53.4237184Z add.s32 %r80, %r345, 2; 2026-02-21T09:09:53.4237356Z .loc 1 28 49 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:28:49 2026-02-21T09:09:53.4237438Z min.s32 %r4, %r80, 16384; 2026-02-21T09:09:53.4237602Z .loc 1 29 88 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:29:88 2026-02-21T09:09:53.4237674Z setp.ge.s32 %p21, %r345, %r4; 2026-02-21T09:09:53.4237734Z @%p21 bra $L__BB0_9; 2026-02-21T09:09:53.4237812Z // %bb.1: // %.lr.ph 2026-02-21T09:09:53.4237986Z .loc 1 0 88 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:0:88 2026-02-21T09:09:53.4238088Z ld.param.b64 %rd19, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:53.4238228Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:53.4238286Z shr.u32 %r5, %r1, 5; 2026-02-21T09:09:53.4238351Z shl.b32 %r6, %r1, 3; 2026-02-21T09:09:53.4238410Z and.b32 %r7, %r6, 24; 2026-02-21T09:09:53.4238467Z and.b32 %r8, %r1, 7; 2026-02-21T09:09:53.4238531Z shl.b32 %r9, %r8, 2; 2026-02-21T09:09:53.4238597Z bfe.u32 %r10, %r1, 2, 5; 2026-02-21T09:09:53.4238656Z and.b32 %r11, %r1, 32; 2026-02-21T09:09:53.4238713Z and.b32 %r81, %r1, 127; 2026-02-21T09:09:53.4238784Z shl.b32 %r12, %r81, 4; 2026-02-21T09:09:53.4238845Z add.s32 %r216, %r62, %r12; 2026-02-21T09:09:53.4238906Z add.s32 %r218, %r216, 2048; 2026-02-21T09:09:53.4238973Z add.s32 %r202, %r344, 32; 2026-02-21T09:09:53.4239029Z shl.b32 %r83, %r1, 10; 2026-02-21T09:09:53.4239115Z and.b32 %r16, %r83, 122880; 2026-02-21T09:09:53.4239177Z shl.b32 %r17, %r81, 2; 2026-02-21T09:09:53.4239250Z add.s32 %r84, %r62, 12288; 2026-02-21T09:09:53.4239308Z add.s32 %r220, %r84, %r17; 2026-02-21T09:09:53.4239365Z or.b32 %r19, %r7, 32; 2026-02-21T09:09:53.4239429Z add.s32 %r148, %r216, 4096; 2026-02-21T09:09:53.4239486Z add.s32 %r150, %r216, 6144; 2026-02-21T09:09:53.4239543Z add.s32 %r85, %r62, %r17; 2026-02-21T09:09:53.4239601Z add.s32 %r152, %r85, 12800; 2026-02-21T09:09:53.4239664Z shl.b32 %r86, %r1, 6; 2026-02-21T09:09:53.4239724Z and.b32 %r87, %r86, 960; 2026-02-21T09:09:53.4239779Z and.b32 %r88, %r1, 96; 2026-02-21T09:09:53.4239842Z shl.b32 %r89, %r88, 5; 2026-02-21T09:09:53.4239897Z shl.b32 %r90, %r1, 1; 2026-02-21T09:09:53.4239953Z and.b32 %r91, %r90, 32; 2026-02-21T09:09:53.4240009Z or.b32 %r92, %r87, %r89; 2026-02-21T09:09:53.4240072Z or.b32 %r23, %r92, %r91; 2026-02-21T09:09:53.4240130Z add.s32 %r24, %r62, %r23; 2026-02-21T09:09:53.4240186Z and.b32 %r93, %r1, 31; 2026-02-21T09:09:53.4240251Z shr.u32 %r94, %r1, 1; 2026-02-21T09:09:53.4240308Z and.b32 %r95, %r94, 32; 2026-02-21T09:09:53.4240364Z or.b32 %r25, %r95, %r93; 2026-02-21T09:09:53.4240430Z add.s32 %r26, %r84, %r25; 2026-02-21T09:09:53.4240488Z shl.b32 %r96, %r93, 7; 2026-02-21T09:09:53.4240544Z shl.b32 %r97, %r8, 4; 2026-02-21T09:09:53.4240600Z shr.u32 %r98, %r88, 3; 2026-02-21T09:09:53.4240667Z or.b32 %r99, %r96, %r97; 2026-02-21T09:09:53.4240724Z or.b32 %r100, %r99, %r98; 2026-02-21T09:09:53.4240782Z add.s32 %r101, %r62, 8192; 2026-02-21T09:09:53.4240848Z add.s32 %r27, %r101, %r100; 2026-02-21T09:09:53.4240905Z xor.b32 %r102, %r100, 16; 2026-02-21T09:09:53.4240964Z add.s32 %r28, %r101, %r102; 2026-02-21T09:09:53.4241021Z xor.b32 %r103, %r100, 32; 2026-02-21T09:09:53.4241086Z add.s32 %r29, %r101, %r103; 2026-02-21T09:09:53.4241142Z xor.b32 %r104, %r100, 48; 2026-02-21T09:09:53.4241199Z add.s32 %r30, %r101, %r104; 2026-02-21T09:09:53.4241264Z xor.b32 %r105, %r100, 64; 2026-02-21T09:09:53.4241368Z add.s32 %r31, %r101, %r105; 2026-02-21T09:09:53.4241424Z xor.b32 %r106, %r100, 80; 2026-02-21T09:09:53.4241482Z add.s32 %r32, %r101, %r106; 2026-02-21T09:09:53.4241606Z xor.b32 %r107, %r100, 96; 2026-02-21T09:09:53.4241667Z add.s32 %r33, %r101, %r107; 2026-02-21T09:09:53.4241724Z xor.b32 %r108, %r100, 112; 2026-02-21T09:09:53.4241787Z add.s32 %r34, %r101, %r108; 2026-02-21T09:09:53.4241848Z bfe.u32 %r109, %r101, 4, 14; 2026-02-21T09:09:53.4241908Z cvt.u64.u32 %rd40, %r109; 2026-02-21T09:09:53.4242013Z or.b64 %rd62, %rd40, 4611686293322072064; 2026-02-21T09:09:53.4242081Z add.s32 %r110, %r62, 8224; 2026-02-21T09:09:53.4242144Z bfe.u32 %r111, %r110, 4, 14; 2026-02-21T09:09:53.4242204Z cvt.u64.u32 %rd41, %r111; 2026-02-21T09:09:53.4242280Z or.b64 %rd63, %rd41, 4611686293322072064; 2026-02-21T09:09:53.4242340Z add.s32 %r112, %r62, 8256; 2026-02-21T09:09:53.4242399Z bfe.u32 %r113, %r112, 4, 14; 2026-02-21T09:09:53.4242460Z cvt.u64.u32 %rd42, %r113; 2026-02-21T09:09:53.4242535Z or.b64 %rd64, %rd42, 4611686293322072064; 2026-02-21T09:09:53.4242593Z add.s32 %r114, %r62, 8288; 2026-02-21T09:09:53.4242652Z bfe.u32 %r115, %r114, 4, 14; 2026-02-21T09:09:53.4242718Z cvt.u64.u32 %rd43, %r115; 2026-02-21T09:09:53.4242814Z or.b64 %rd65, %rd43, 4611686293322072064; 2026-02-21T09:09:53.4242873Z or.b32 %r35, %r7, 64; 2026-02-21T09:09:53.4242936Z and.b32 %r116, %r6, 48; 2026-02-21T09:09:53.4242993Z or.b32 %r117, %r89, %r116; 2026-02-21T09:09:53.4243052Z xor.b32 %r118, %r117, %r91; 2026-02-21T09:09:53.4243110Z or.b32 %r119, %r118, %r87; 2026-02-21T09:09:53.4243176Z add.s32 %r36, %r62, %r119; 2026-02-21T09:09:53.4243233Z xor.b32 %r120, %r119, 16; 2026-02-21T09:09:53.4243289Z add.s32 %r37, %r62, %r120; 2026-02-21T09:09:53.4243473Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4243531Z or.b32 %r121, %r16, %r9; 2026-02-21T09:09:53.4243617Z or.b32 %r38, %r121, 393216; 2026-02-21T09:09:53.4243785Z .loc 1 29 88 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:29:88 2026-02-21T09:09:53.4243851Z and.b32 %r122, %r1, 3; 2026-02-21T09:09:53.4243920Z mad.wide.u32 %rd44, %r122, 16, %rd18; 2026-02-21T09:09:53.4243980Z add.s64 %rd6, %rd44, 65728; 2026-02-21T09:09:53.4244046Z shl.b32 %r39, %r10, 10; 2026-02-21T09:09:53.4244103Z bra.uni $L__BB0_2; 2026-02-21T09:09:53.4244208Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:53.4244381Z .loc 1 0 88 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:0:88 2026-02-21T09:09:53.4244448Z mov.b32 %r296, 1; 2026-02-21T09:09:53.4244503Z $L__tmp0: 2026-02-21T09:09:53.4244724Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4244788Z // begin inline asm 2026-02-21T09:09:53.4244839Z 2026-02-21T09:09:53.4244892Z { 2026-02-21T09:09:53.4244962Z .reg .pred complete; 2026-02-21T09:09:53.4245019Z waitLoop: 2026-02-21T09:09:53.4245140Z mbarrier.try_wait.parity.shared.b64 complete, [%r347], %r296; 2026-02-21T09:09:53.4245205Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.4245265Z } 2026-02-21T09:09:53.4245269Z 2026-02-21T09:09:53.4245328Z // end inline asm 2026-02-21T09:09:53.4245384Z $L__tmp1: 2026-02-21T09:09:53.4245567Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4245631Z cp.async.wait_group 0; 2026-02-21T09:09:53.4245691Z bar.sync 0; 2026-02-21T09:09:53.4245760Z add.s32 %r297, %r62, 13312; 2026-02-21T09:09:53.4245818Z // begin inline asm 2026-02-21T09:09:53.4245905Z @%p4 mbarrier.inval.shared::cta.b64 [%r297]; 2026-02-21T09:09:53.4245961Z // end inline asm 2026-02-21T09:09:53.4246023Z bar.sync 0; 2026-02-21T09:09:53.4246080Z // begin inline asm 2026-02-21T09:09:53.4246163Z @%p4 mbarrier.inval.shared::cta.b64 [%r141]; 2026-02-21T09:09:53.4246256Z // end inline asm 2026-02-21T09:09:53.4246312Z $L__tmp2: 2026-02-21T09:09:53.4246540Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4246609Z // begin inline asm 2026-02-21T09:09:53.4246913Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r299, %r300, %r301, %r302, %r303, %r304, %r305, %r306, %r307, %r308, %r309, %r310, %r311, %r312, %r313, %r314}, [%r315 + 0], 16; 2026-02-21T09:09:53.4246972Z // end inline asm 2026-02-21T09:09:53.4247055Z // begin inline asm 2026-02-21T09:09:53.4247139Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:53.4247197Z // end inline asm 2026-02-21T09:09:53.4247262Z cvt.u64.u32 %rd85, %r299; 2026-02-21T09:09:53.4247331Z cvt.u64.u32 %rd86, %r300; 2026-02-21T09:09:53.4247393Z shl.b64 %rd87, %rd86, 32; 2026-02-21T09:09:53.4247454Z or.b64 %rd88, %rd85, %rd87; 2026-02-21T09:09:53.4247509Z $L__tmp3: 2026-02-21T09:09:53.4247696Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4247763Z mov.b64 {%r319, %r320}, %rd88; 2026-02-21T09:09:53.4247837Z cvt.rn.bf16x2.f32 %r321, %r320, %r319; 2026-02-21T09:09:53.4247898Z $L__tmp4: 2026-02-21T09:09:53.4248147Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4248209Z cvt.u64.u32 %rd89, %r301; 2026-02-21T09:09:53.4248277Z cvt.u64.u32 %rd90, %r302; 2026-02-21T09:09:53.4248338Z shl.b64 %rd91, %rd90, 32; 2026-02-21T09:09:53.4248402Z or.b64 %rd92, %rd89, %rd91; 2026-02-21T09:09:53.4248456Z $L__tmp5: 2026-02-21T09:09:53.4248635Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4248700Z mov.b64 {%r322, %r323}, %rd92; 2026-02-21T09:09:53.4248773Z cvt.rn.bf16x2.f32 %r324, %r323, %r322; 2026-02-21T09:09:53.4248834Z $L__tmp6: 2026-02-21T09:09:53.4249080Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4249144Z cvt.u64.u32 %rd93, %r303; 2026-02-21T09:09:53.4249213Z cvt.u64.u32 %rd94, %r304; 2026-02-21T09:09:53.4249276Z shl.b64 %rd95, %rd94, 32; 2026-02-21T09:09:53.4249337Z or.b64 %rd96, %rd93, %rd95; 2026-02-21T09:09:53.4249391Z $L__tmp7: 2026-02-21T09:09:53.4249571Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4249634Z mov.b64 {%r325, %r326}, %rd96; 2026-02-21T09:09:53.4249705Z cvt.rn.bf16x2.f32 %r327, %r326, %r325; 2026-02-21T09:09:53.4249767Z $L__tmp8: 2026-02-21T09:09:53.4249982Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4250044Z cvt.u64.u32 %rd97, %r305; 2026-02-21T09:09:53.4250111Z cvt.u64.u32 %rd98, %r306; 2026-02-21T09:09:53.4250171Z shl.b64 %rd99, %rd98, 32; 2026-02-21T09:09:53.4250235Z or.b64 %rd100, %rd97, %rd99; 2026-02-21T09:09:53.4250291Z $L__tmp9: 2026-02-21T09:09:53.4250471Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4250537Z mov.b64 {%r328, %r329}, %rd100; 2026-02-21T09:09:53.4250605Z cvt.rn.bf16x2.f32 %r330, %r329, %r328; 2026-02-21T09:09:53.4250670Z $L__tmp10: 2026-02-21T09:09:53.4250891Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4250957Z cvt.u64.u32 %rd101, %r307; 2026-02-21T09:09:53.4251018Z cvt.u64.u32 %rd102, %r308; 2026-02-21T09:09:53.4251086Z shl.b64 %rd103, %rd102, 32; 2026-02-21T09:09:53.4251148Z or.b64 %rd104, %rd101, %rd103; 2026-02-21T09:09:53.4251204Z $L__tmp11: 2026-02-21T09:09:53.4251382Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4251444Z mov.b64 {%r331, %r332}, %rd104; 2026-02-21T09:09:53.4251515Z cvt.rn.bf16x2.f32 %r333, %r332, %r331; 2026-02-21T09:09:53.4251629Z $L__tmp12: 2026-02-21T09:09:53.4251849Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4251913Z cvt.u64.u32 %rd105, %r309; 2026-02-21T09:09:53.4251973Z cvt.u64.u32 %rd106, %r310; 2026-02-21T09:09:53.4252043Z shl.b64 %rd107, %rd106, 32; 2026-02-21T09:09:53.4252105Z or.b64 %rd108, %rd105, %rd107; 2026-02-21T09:09:53.4252159Z $L__tmp13: 2026-02-21T09:09:53.4252334Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4252427Z mov.b64 {%r334, %r335}, %rd108; 2026-02-21T09:09:53.4252497Z cvt.rn.bf16x2.f32 %r336, %r335, %r334; 2026-02-21T09:09:53.4252559Z $L__tmp14: 2026-02-21T09:09:53.4252781Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4252845Z cvt.u64.u32 %rd109, %r311; 2026-02-21T09:09:53.4252907Z cvt.u64.u32 %rd110, %r312; 2026-02-21T09:09:53.4252977Z shl.b64 %rd111, %rd110, 32; 2026-02-21T09:09:53.4253039Z or.b64 %rd112, %rd109, %rd111; 2026-02-21T09:09:53.4253093Z $L__tmp15: 2026-02-21T09:09:53.4253304Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4253369Z mov.b64 {%r337, %r338}, %rd112; 2026-02-21T09:09:53.4253438Z cvt.rn.bf16x2.f32 %r339, %r338, %r337; 2026-02-21T09:09:53.4253493Z $L__tmp16: 2026-02-21T09:09:53.4253719Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4253782Z cvt.u64.u32 %rd113, %r313; 2026-02-21T09:09:53.4253843Z cvt.u64.u32 %rd114, %r314; 2026-02-21T09:09:53.4253913Z shl.b64 %rd115, %rd114, 32; 2026-02-21T09:09:53.4253975Z or.b64 %rd116, %rd113, %rd115; 2026-02-21T09:09:53.4254028Z $L__tmp17: 2026-02-21T09:09:53.4254235Z .loc 1 97 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:97:28 2026-02-21T09:09:53.4254301Z mov.b64 {%r340, %r341}, %rd116; 2026-02-21T09:09:53.4254372Z cvt.rn.bf16x2.f32 %r342, %r341, %r340; 2026-02-21T09:09:53.4254546Z .loc 1 98 43 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:98:43 2026-02-21T09:09:53.4254662Z st.shared.v4.b32 [%r36], {%r321, %r324, %r327, %r330}; 2026-02-21T09:09:53.4254770Z st.shared.v4.b32 [%r37], {%r333, %r336, %r339, %r342}; 2026-02-21T09:09:53.4254829Z // begin inline asm 2026-02-21T09:09:53.4254927Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.4254985Z // end inline asm 2026-02-21T09:09:53.4255040Z bar.sync 0; 2026-02-21T09:09:53.4255114Z elect.sync %r343|%p57, -1; 2026-02-21T09:09:53.4255179Z and.pred %p55, %p1, %p57; 2026-02-21T09:09:53.4255235Z // begin inline asm 2026-02-21T09:09:53.4255415Z @%p55 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd84, {%r316, %r317}], [%r62]; 2026-02-21T09:09:53.4255480Z // end inline asm 2026-02-21T09:09:53.4255546Z cp.async.bulk.commit_group; 2026-02-21T09:09:53.4255618Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:53.4255679Z bar.sync 0; 2026-02-21T09:09:53.4255846Z .loc 1 29 88 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:29:88 2026-02-21T09:09:53.4255905Z add.s32 %r345, %r345, 1; 2026-02-21T09:09:53.4255978Z setp.ne.b32 %p58, %r345, %r4; 2026-02-21T09:09:53.4256036Z @%p58 bra $L__BB0_2; 2026-02-21T09:09:53.4256093Z bra.uni $L__BB0_9; 2026-02-21T09:09:53.4256196Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:09:53.4256296Z // Child Loop BB0_5 Depth 2 2026-02-21T09:09:53.4256461Z .loc 1 81 38 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:81:38 2026-02-21T09:09:53.4256522Z setp.eq.b32 %p26, %r11, 0; 2026-02-21T09:09:53.4256692Z .loc 1 35 35 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:35:35 2026-02-21T09:09:53.4256781Z shr.s32 %r171, %r345, 31; 2026-02-21T09:09:53.4256841Z shr.u32 %r172, %r171, 22; 2026-02-21T09:09:53.4256909Z add.s32 %r173, %r345, %r172; 2026-02-21T09:09:53.4256967Z shr.s32 %r174, %r173, 10; 2026-02-21T09:09:53.4257133Z .loc 1 38 45 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:38:45 2026-02-21T09:09:53.4257193Z and.b32 %r175, %r173, 64512; 2026-02-21T09:09:53.4257260Z sub.s32 %r176, %r345, %r175; 2026-02-21T09:09:53.4257464Z .loc 1 38 64 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:38:64 2026-02-21T09:09:53.4257525Z cvt.u16.u32 %rs1, %r176; 2026-02-21T09:09:53.4257698Z .loc 1 39 51 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:39:51 2026-02-21T09:09:53.4257759Z shr.s16 %rs2, %rs1, 15; 2026-02-21T09:09:53.4257819Z shr.u16 %rs3, %rs2, 12; 2026-02-21T09:09:53.4257887Z add.s16 %rs4, %rs1, %rs3; 2026-02-21T09:09:53.4257945Z shr.s16 %rs5, %rs4, 4; 2026-02-21T09:09:53.4258106Z .loc 1 38 64 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:38:64 2026-02-21T09:09:53.4258167Z and.b16 %rs6, %rs4, -16; 2026-02-21T09:09:53.4258256Z sub.s16 %rs7, %rs1, %rs6; 2026-02-21T09:09:53.4258422Z .loc 1 40 27 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:40:27 2026-02-21T09:09:53.4258480Z shl.b32 %r43, %r174, 9; 2026-02-21T09:09:53.4258550Z mul.wide.s16 %r44, %rs7, 32; 2026-02-21T09:09:53.4258611Z add.s32 %r316, %r44, %r43; 2026-02-21T09:09:53.4258776Z .loc 1 41 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:41:32 2026-02-21T09:09:53.4258840Z or.b32 %r177, %r316, %r9; 2026-02-21T09:09:53.4259003Z .loc 1 42 27 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:42:27 2026-02-21T09:09:53.4259066Z mul.wide.s16 %r317, %rs5, 64; 2026-02-21T09:09:53.4259252Z .loc 1 43 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:43:32 2026-02-21T09:09:53.4259322Z or.b32 %r178, %r317, %r10; 2026-02-21T09:09:53.4259486Z .loc 1 58 53 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:53 2026-02-21T09:09:53.4259544Z shl.b32 %r179, %r178, 10; 2026-02-21T09:09:53.4259602Z $L__tmp18: 2026-02-21T09:09:53.4259817Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4259893Z shfl.sync.idx.b32 %r47, %r5, 0, 31, -1; 2026-02-21T09:09:53.4259958Z shl.b32 %r180, %r47, 21; 2026-02-21T09:09:53.4260019Z and.b32 %r181, %r180, 6291456; 2026-02-21T09:09:53.4260078Z add.s32 %r315, %r181, %r344; 2026-02-21T09:09:53.4260139Z mov.pred %p30, -1; 2026-02-21T09:09:53.4260202Z mov.b32 %r346, 0; 2026-02-21T09:09:53.4260258Z // begin inline asm 2026-02-21T09:09:53.4260554Z @%p30 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r315 + 0], 16, {%r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346, %r346}; 2026-02-21T09:09:53.4260618Z // end inline asm 2026-02-21T09:09:53.4260675Z // begin inline asm 2026-02-21T09:09:53.4260747Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.4260808Z // end inline asm 2026-02-21T09:09:53.4260862Z bar.sync 0; 2026-02-21T09:09:53.4260914Z $L__tmp19: 2026-02-21T09:09:53.4261089Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4261157Z add.s32 %r347, %r62, 13312; 2026-02-21T09:09:53.4261213Z // begin inline asm 2026-02-21T09:09:53.4261301Z @%p4 mbarrier.init.shared::cta.b64 [%r347], 1; 2026-02-21T09:09:53.4261363Z // end inline asm 2026-02-21T09:09:53.4261417Z bar.sync 0; 2026-02-21T09:09:53.4261476Z add.s32 %r141, %r62, 13320; 2026-02-21T09:09:53.4261563Z // begin inline asm 2026-02-21T09:09:53.4261657Z @%p4 mbarrier.init.shared::cta.b64 [%r141], 1; 2026-02-21T09:09:53.4261716Z // end inline asm 2026-02-21T09:09:53.4261920Z .loc 1 58 60 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:60 2026-02-21T09:09:53.4261988Z or.b32 %r183, %r179, %r7; 2026-02-21T09:09:53.4262153Z .loc 1 58 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:32 2026-02-21T09:09:53.4262222Z mad.wide.s32 %rd45, %r183, 2, %rd18; 2026-02-21T09:09:53.4262290Z cvt.u64.u32 %rd51, %r7; 2026-02-21T09:09:53.4262349Z cvt.s64.s32 %rd7, %r179; 2026-02-21T09:09:53.4262409Z or.b64 %rd52, %rd7, %rd51; 2026-02-21T09:09:53.4262499Z shl.b64 %rd53, %rd52, 1; 2026-02-21T09:09:53.4262567Z add.s64 %rd8, %rd18, %rd53; 2026-02-21T09:09:53.4262626Z add.s64 %rd46, %rd8, 65536; 2026-02-21T09:09:53.4262682Z mov.b32 %r217, 16; 2026-02-21T09:09:53.4262856Z .loc 1 58 80 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:80 2026-02-21T09:09:53.4262914Z // begin inline asm 2026-02-21T09:09:53.4263036Z cp.async.cg.shared.global [ %r216 + 0 ], [ %rd45 + 0 ], 0x10, %r217; 2026-02-21T09:09:53.4263101Z // end inline asm 2026-02-21T09:09:53.4263157Z // begin inline asm 2026-02-21T09:09:53.4263307Z cp.async.cg.shared.global [ %r218 + 0 ], [ %rd46 + 0 ], 0x10, %r217; 2026-02-21T09:09:53.4263366Z // end inline asm 2026-02-21T09:09:53.4263439Z cp.async.commit_group; 2026-02-21T09:09:53.4263604Z .loc 1 64 62 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:62 2026-02-21T09:09:53.4263666Z add.s32 %r184, %r177, %r16; 2026-02-21T09:09:53.4263842Z .loc 1 64 34 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:34 2026-02-21T09:09:53.4263906Z cvt.s64.s32 %rd54, %r184; 2026-02-21T09:09:53.4263969Z add.s64 %rd47, %rd19, %rd54; 2026-02-21T09:09:53.4264026Z mov.b32 %r147, 4; 2026-02-21T09:09:53.4264196Z .loc 1 64 87 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:87 2026-02-21T09:09:53.4264281Z // begin inline asm 2026-02-21T09:09:53.4264397Z cp.async.ca.shared.global [ %r220 + 0 ], [ %rd47 + 0 ], 0x4, %r147; 2026-02-21T09:09:53.4264462Z // end inline asm 2026-02-21T09:09:53.4264524Z cp.async.commit_group; 2026-02-21T09:09:53.4264689Z .loc 1 58 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:32 2026-02-21T09:09:53.4264756Z add.s64 %rd48, %rd8, 64; 2026-02-21T09:09:53.4264815Z cvt.u64.u32 %rd55, %r19; 2026-02-21T09:09:53.4264874Z or.b64 %rd56, %rd7, %rd55; 2026-02-21T09:09:53.4264932Z shl.b64 %rd57, %rd56, 1; 2026-02-21T09:09:53.4264999Z add.s64 %rd58, %rd18, %rd57; 2026-02-21T09:09:53.4265057Z add.s64 %rd49, %rd58, 65536; 2026-02-21T09:09:53.4265225Z .loc 1 58 80 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:80 2026-02-21T09:09:53.4265286Z bar.sync 0; 2026-02-21T09:09:53.4265343Z // begin inline asm 2026-02-21T09:09:53.4265454Z cp.async.cg.shared.global [ %r148 + 0 ], [ %rd48 + 0 ], 0x10, %r217; 2026-02-21T09:09:53.4265519Z // end inline asm 2026-02-21T09:09:53.4265575Z // begin inline asm 2026-02-21T09:09:53.4265685Z cp.async.cg.shared.global [ %r150 + 0 ], [ %rd49 + 0 ], 0x10, %r217; 2026-02-21T09:09:53.4265740Z // end inline asm 2026-02-21T09:09:53.4265810Z cp.async.commit_group; 2026-02-21T09:09:53.4265974Z .loc 1 64 34 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:34 2026-02-21T09:09:53.4266034Z cvt.s64.s32 %rd59, %r177; 2026-02-21T09:09:53.4266100Z cvt.u64.u32 %rd60, %r16; 2026-02-21T09:09:53.4266160Z add.s64 %rd61, %rd60, %rd59; 2026-02-21T09:09:53.4266219Z add.s64 %rd9, %rd19, %rd61; 2026-02-21T09:09:53.4266277Z add.s64 %rd50, %rd9, 131072; 2026-02-21T09:09:53.4266449Z .loc 1 64 87 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:87 2026-02-21T09:09:53.4266504Z // begin inline asm 2026-02-21T09:09:53.4266617Z cp.async.ca.shared.global [ %r152 + 0 ], [ %rd50 + 0 ], 0x4, %r147; 2026-02-21T09:09:53.4266681Z // end inline asm 2026-02-21T09:09:53.4266770Z cp.async.commit_group; 2026-02-21T09:09:53.4266929Z .loc 1 58 80 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:80 2026-02-21T09:09:53.4267000Z cp.async.wait_group 2; 2026-02-21T09:09:53.4267054Z bar.sync 0; 2026-02-21T09:09:53.4267148Z ld.shared.v4.b32 {%r185, %r186, %r187, %r188}, [%r24]; 2026-02-21T09:09:53.4267208Z mov.b32 {%rs8, %rs9}, %r188; 2026-02-21T09:09:53.4267278Z mov.b32 {%rs10, %rs11}, %r187; 2026-02-21T09:09:53.4267337Z mov.b32 {%rs12, %rs13}, %r186; 2026-02-21T09:09:53.4267421Z mov.b32 {%rs14, %rs15}, %r185; 2026-02-21T09:09:53.4267528Z ld.shared.v4.b32 {%r189, %r190, %r191, %r192}, [%r24+16]; 2026-02-21T09:09:53.4267587Z mov.b32 {%rs16, %rs17}, %r192; 2026-02-21T09:09:53.4267644Z mov.b32 {%rs18, %rs19}, %r191; 2026-02-21T09:09:53.4267703Z mov.b32 {%rs20, %rs21}, %r190; 2026-02-21T09:09:53.4267768Z mov.b32 {%rs22, %rs23}, %r189; 2026-02-21T09:09:53.4267932Z .loc 1 62 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:62:32 2026-02-21T09:09:53.4267994Z cvt.f32.bf16 %r155, %rs14; 2026-02-21T09:09:53.4268061Z cvt.f32.bf16 %r156, %rs15; 2026-02-21T09:09:53.4268119Z cvt.f32.bf16 %r157, %rs12; 2026-02-21T09:09:53.4268198Z cvt.f32.bf16 %r158, %rs13; 2026-02-21T09:09:53.4268264Z cvt.f32.bf16 %r159, %rs10; 2026-02-21T09:09:53.4268322Z cvt.f32.bf16 %r160, %rs11; 2026-02-21T09:09:53.4268380Z cvt.f32.bf16 %r161, %rs8; 2026-02-21T09:09:53.4268438Z cvt.f32.bf16 %r162, %rs9; 2026-02-21T09:09:53.4268504Z cvt.f32.bf16 %r163, %rs22; 2026-02-21T09:09:53.4268561Z cvt.f32.bf16 %r164, %rs23; 2026-02-21T09:09:53.4268619Z cvt.f32.bf16 %r165, %rs20; 2026-02-21T09:09:53.4268683Z cvt.f32.bf16 %r166, %rs21; 2026-02-21T09:09:53.4268740Z cvt.f32.bf16 %r167, %rs18; 2026-02-21T09:09:53.4268797Z cvt.f32.bf16 %r168, %rs19; 2026-02-21T09:09:53.4268854Z cvt.f32.bf16 %r169, %rs16; 2026-02-21T09:09:53.4268919Z cvt.f32.bf16 %r170, %rs17; 2026-02-21T09:09:53.4268994Z $L__tmp20: 2026-02-21T09:09:53.4269207Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4269274Z add.s32 %r232, %r181, %r202; 2026-02-21T09:09:53.4269331Z // begin inline asm 2026-02-21T09:09:53.4269618Z @%p30 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r232 + 0], 16, {%r155, %r156, %r157, %r158, %r159, %r160, %r161, %r162, %r163, %r164, %r165, %r166, %r167, %r168, %r169, %r170}; 2026-02-21T09:09:53.4269680Z // end inline asm 2026-02-21T09:09:53.4269739Z // begin inline asm 2026-02-21T09:09:53.4269809Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.4269864Z // end inline asm 2026-02-21T09:09:53.4269927Z bar.sync 0; 2026-02-21T09:09:53.4269980Z $L__tmp21: 2026-02-21T09:09:53.4270142Z .loc 1 64 87 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:87 2026-02-21T09:09:53.4270212Z ld.shared.b8 %rs24, [%r26]; 2026-02-21T09:09:53.4270277Z ld.shared.b8 %rs25, [%r26+64]; 2026-02-21T09:09:53.4270344Z ld.shared.b8 %rs26, [%r26+128]; 2026-02-21T09:09:53.4270407Z ld.shared.b8 %rs27, [%r26+192]; 2026-02-21T09:09:53.4270476Z ld.shared.b8 %rs28, [%r26+256]; 2026-02-21T09:09:53.4270539Z ld.shared.b8 %rs29, [%r26+320]; 2026-02-21T09:09:53.4270599Z ld.shared.b8 %rs30, [%r26+384]; 2026-02-21T09:09:53.4270666Z ld.shared.b8 %rs31, [%r26+448]; 2026-02-21T09:09:53.4270829Z .loc 1 67 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:67:28 2026-02-21T09:09:53.4270888Z shl.b16 %rs32, %rs24, 4; 2026-02-21T09:09:53.4270956Z shl.b16 %rs33, %rs25, 4; 2026-02-21T09:09:53.4271016Z shl.b16 %rs34, %rs26, 4; 2026-02-21T09:09:53.4271074Z shl.b16 %rs35, %rs27, 4; 2026-02-21T09:09:53.4271132Z shl.b16 %rs36, %rs28, 4; 2026-02-21T09:09:53.4271197Z shl.b16 %rs37, %rs29, 4; 2026-02-21T09:09:53.4271255Z shl.b16 %rs38, %rs30, 4; 2026-02-21T09:09:53.4271312Z shl.b16 %rs39, %rs31, 4; 2026-02-21T09:09:53.4271484Z .loc 1 82 58 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:82:58 2026-02-21T09:09:53.4271609Z selp.b16 %rs40, %rs32, %rs24, %p26; 2026-02-21T09:09:53.4271671Z cvt.s16.s8 %rs41, %rs40; 2026-02-21T09:09:53.4271740Z shr.s16 %rs42, %rs41, 4; 2026-02-21T09:09:53.4271812Z selp.b16 %rs43, %rs33, %rs25, %p26; 2026-02-21T09:09:53.4271873Z cvt.s16.s8 %rs44, %rs43; 2026-02-21T09:09:53.4271931Z shr.s16 %rs45, %rs44, 4; 2026-02-21T09:09:53.4272004Z selp.b16 %rs46, %rs34, %rs26, %p26; 2026-02-21T09:09:53.4272062Z cvt.s16.s8 %rs47, %rs46; 2026-02-21T09:09:53.4272150Z shr.s16 %rs48, %rs47, 4; 2026-02-21T09:09:53.4272225Z selp.b16 %rs49, %rs35, %rs27, %p26; 2026-02-21T09:09:53.4272286Z cvt.s16.s8 %rs50, %rs49; 2026-02-21T09:09:53.4272346Z shr.s16 %rs51, %rs50, 4; 2026-02-21T09:09:53.4272413Z selp.b16 %rs52, %rs36, %rs28, %p26; 2026-02-21T09:09:53.4272486Z cvt.s16.s8 %rs53, %rs52; 2026-02-21T09:09:53.4272545Z shr.s16 %rs54, %rs53, 4; 2026-02-21T09:09:53.4272611Z selp.b16 %rs55, %rs37, %rs29, %p26; 2026-02-21T09:09:53.4272683Z cvt.s16.s8 %rs56, %rs55; 2026-02-21T09:09:53.4272743Z shr.s16 %rs57, %rs56, 4; 2026-02-21T09:09:53.4272808Z selp.b16 %rs58, %rs38, %rs30, %p26; 2026-02-21T09:09:53.4272866Z cvt.s16.s8 %rs59, %rs58; 2026-02-21T09:09:53.4272956Z shr.s16 %rs60, %rs59, 4; 2026-02-21T09:09:53.4273021Z selp.b16 %rs61, %rs39, %rs31, %p26; 2026-02-21T09:09:53.4273079Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T09:09:53.4273143Z shr.s16 %rs63, %rs62, 4; 2026-02-21T09:09:53.4273305Z .loc 1 87 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:87:32 2026-02-21T09:09:53.4273369Z cvt.rn.f32.s16 %r193, %rs42; 2026-02-21T09:09:53.4273437Z cvt.rn.f32.s16 %r194, %rs45; 2026-02-21T09:09:53.4273496Z cvt.rn.f32.s16 %r195, %rs48; 2026-02-21T09:09:53.4273555Z cvt.rn.f32.s16 %r196, %rs51; 2026-02-21T09:09:53.4273613Z cvt.rn.f32.s16 %r197, %rs54; 2026-02-21T09:09:53.4273679Z cvt.rn.f32.s16 %r198, %rs57; 2026-02-21T09:09:53.4273765Z cvt.rn.f32.s16 %r199, %rs60; 2026-02-21T09:09:53.4273825Z cvt.rn.f32.s16 %r200, %rs63; 2026-02-21T09:09:53.4273892Z st.shared.b32 [%r27], %r193; 2026-02-21T09:09:53.4273951Z st.shared.b32 [%r28], %r194; 2026-02-21T09:09:53.4274010Z st.shared.b32 [%r29], %r195; 2026-02-21T09:09:53.4274070Z st.shared.b32 [%r30], %r196; 2026-02-21T09:09:53.4274136Z st.shared.b32 [%r31], %r197; 2026-02-21T09:09:53.4274194Z st.shared.b32 [%r32], %r198; 2026-02-21T09:09:53.4274252Z st.shared.b32 [%r33], %r199; 2026-02-21T09:09:53.4274315Z st.shared.b32 [%r34], %r200; 2026-02-21T09:09:53.4274369Z $L__tmp22: 2026-02-21T09:09:53.4274582Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4274640Z // begin inline asm 2026-02-21T09:09:53.4274720Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.4274775Z // end inline asm 2026-02-21T09:09:53.4274829Z bar.sync 0; 2026-02-21T09:09:53.4274899Z setp.ne.b32 %p27, %r47, 0; 2026-02-21T09:09:53.4274959Z @%p27 bra $L__BB0_4; 2026-02-21T09:09:53.4275062Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:53.4275134Z elect.sync %r213|%p29, -1; 2026-02-21T09:09:53.4275191Z mov.b32 %r203, 67635472; 2026-02-21T09:09:53.4275249Z mov.pred %p28, 0; 2026-02-21T09:09:53.4275305Z // begin inline asm 2026-02-21T09:09:53.4275467Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 0 ], %rd62, %r203, %p28; 2026-02-21T09:09:53.4275523Z // end inline asm 2026-02-21T09:09:53.4275579Z // begin inline asm 2026-02-21T09:09:53.4275731Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 8 ], %rd63, %r203, %p30; 2026-02-21T09:09:53.4275786Z // end inline asm 2026-02-21T09:09:53.4275841Z // begin inline asm 2026-02-21T09:09:53.4275993Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 16 ], %rd64, %r203, %p30; 2026-02-21T09:09:53.4276048Z // end inline asm 2026-02-21T09:09:53.4276103Z // begin inline asm 2026-02-21T09:09:53.4276291Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 24 ], %rd65, %r203, %p30; 2026-02-21T09:09:53.4276353Z // end inline asm 2026-02-21T09:09:53.4276414Z add.s32 %r215, %r62, 13312; 2026-02-21T09:09:53.4276475Z cvt.u64.u32 %rd66, %r215; 2026-02-21T09:09:53.4276539Z // begin inline asm 2026-02-21T09:09:53.4276661Z @%p29 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd66]; 2026-02-21T09:09:53.4276715Z // end inline asm 2026-02-21T09:09:53.4276774Z $L__tmp23: 2026-02-21T09:09:53.4276873Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:53.4277063Z .loc 1 0 0 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:0 2026-02-21T09:09:53.4277123Z cvt.s32.s16 %r42, %rs5; 2026-02-21T09:09:53.4277297Z .loc 1 58 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:32 2026-02-21T09:09:53.4277357Z add.s64 %rd67, %rd8, 128; 2026-02-21T09:09:53.4277417Z cvt.u64.u32 %rd71, %r35; 2026-02-21T09:09:53.4277485Z add.s64 %rd72, %rd7, %rd71; 2026-02-21T09:09:53.4277542Z shl.b64 %rd73, %rd72, 1; 2026-02-21T09:09:53.4277601Z add.s64 %rd74, %rd18, %rd73; 2026-02-21T09:09:53.4277667Z add.s64 %rd68, %rd74, 65536; 2026-02-21T09:09:53.4277852Z .loc 1 58 80 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:80 2026-02-21T09:09:53.4277911Z // begin inline asm 2026-02-21T09:09:53.4278026Z cp.async.cg.shared.global [ %r216 + 0 ], [ %rd67 + 0 ], 0x10, %r217; 2026-02-21T09:09:53.4278093Z // end inline asm 2026-02-21T09:09:53.4278150Z // begin inline asm 2026-02-21T09:09:53.4278263Z cp.async.cg.shared.global [ %r218 + 0 ], [ %rd68 + 0 ], 0x10, %r217; 2026-02-21T09:09:53.4278327Z // end inline asm 2026-02-21T09:09:53.4278391Z cp.async.commit_group; 2026-02-21T09:09:53.4278555Z .loc 1 64 34 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:34 2026-02-21T09:09:53.4278641Z add.s64 %rd69, %rd9, 262144; 2026-02-21T09:09:53.4278804Z .loc 1 64 87 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:87 2026-02-21T09:09:53.4278862Z // begin inline asm 2026-02-21T09:09:53.4278977Z cp.async.ca.shared.global [ %r220 + 0 ], [ %rd69 + 0 ], 0x4, %r147; 2026-02-21T09:09:53.4279040Z // end inline asm 2026-02-21T09:09:53.4279104Z cp.async.commit_group; 2026-02-21T09:09:53.4279275Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4279344Z add.s32 %r225, %r38, %r43; 2026-02-21T09:09:53.4279406Z add.s32 %r226, %r225, %r44; 2026-02-21T09:09:53.4279467Z cvt.s64.s32 %rd75, %r226; 2026-02-21T09:09:53.4279530Z add.s64 %rd118, %rd19, %rd75; 2026-02-21T09:09:53.4279598Z shl.b32 %r227, %r42, 16; 2026-02-21T09:09:53.4279659Z or.b32 %r228, %r39, %r227; 2026-02-21T09:09:53.4279729Z mad.wide.s32 %rd117, %r228, 2, %rd6; 2026-02-21T09:09:53.4279793Z mov.b32 %r349, 1; 2026-02-21T09:09:53.4279852Z mov.b64 %rd119, -16; 2026-02-21T09:09:53.4279911Z mov.b32 %r348, %r346; 2026-02-21T09:09:53.4279976Z mov.b32 %r350, %r346; 2026-02-21T09:09:53.4280033Z bra.uni $L__BB0_5; 2026-02-21T09:09:53.4280130Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:53.4280304Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4280371Z add.s64 %rd119, %rd119, 16; 2026-02-21T09:09:53.4280436Z setp.lt.u64 %p50, %rd119, 464; 2026-02-21T09:09:53.4280489Z $L__tmp24: 2026-02-21T09:09:53.4280715Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4280773Z add.s32 %r293, %r349, 1; 2026-02-21T09:09:53.4280835Z setp.gt.s32 %p51, %r293, 1; 2026-02-21T09:09:53.4280897Z selp.b32 %r349, 0, %r293, %p51; 2026-02-21T09:09:53.4280963Z selp.b32 %r294, 1, 0, %p51; 2026-02-21T09:09:53.4281022Z xor.b32 %r60, %r350, %r294; 2026-02-21T09:09:53.4281095Z $L__tmp25: 2026-02-21T09:09:53.4281265Z .loc 1 58 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:32 2026-02-21T09:09:53.4281327Z add.s64 %rd81, %rd117, -65536; 2026-02-21T09:09:53.4281489Z .loc 1 58 80 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:80 2026-02-21T09:09:53.4281586Z add.s32 %r287, %r56, %r12; 2026-02-21T09:09:53.4281650Z selp.b32 %r288, 16, 0, %p50; 2026-02-21T09:09:53.4281706Z // begin inline asm 2026-02-21T09:09:53.4281820Z cp.async.cg.shared.global [ %r287 + 0 ], [ %rd81 + 0 ], 0x10, %r288; 2026-02-21T09:09:53.4281913Z // end inline asm 2026-02-21T09:09:53.4281970Z add.s32 %r289, %r287, 2048; 2026-02-21T09:09:53.4282028Z // begin inline asm 2026-02-21T09:09:53.4282152Z cp.async.cg.shared.global [ %r289 + 0 ], [ %rd117 + 0 ], 0x10, %r288; 2026-02-21T09:09:53.4282208Z // end inline asm 2026-02-21T09:09:53.4282272Z cp.async.commit_group; 2026-02-21T09:09:53.4282446Z .loc 1 64 87 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:87 2026-02-21T09:09:53.4282507Z add.s32 %r291, %r57, %r17; 2026-02-21T09:09:53.4282568Z selp.b32 %r292, 4, 0, %p50; 2026-02-21T09:09:53.4282648Z // begin inline asm 2026-02-21T09:09:53.4282769Z cp.async.ca.shared.global [ %r291 + 0 ], [ %rd118 + 0 ], 0x4, %r292; 2026-02-21T09:09:53.4282825Z // end inline asm 2026-02-21T09:09:53.4282888Z cp.async.commit_group; 2026-02-21T09:09:53.4283064Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4283130Z add.s64 %rd118, %rd118, 131072; 2026-02-21T09:09:53.4283190Z add.s64 %rd117, %rd117, 64; 2026-02-21T09:09:53.4283252Z setp.lt.u64 %p52, %rd119, 480; 2026-02-21T09:09:53.4283318Z mov.b32 %r346, %r350; 2026-02-21T09:09:53.4283375Z mov.b32 %r350, %r60; 2026-02-21T09:09:53.4283432Z @%p52 bra $L__BB0_5; 2026-02-21T09:09:53.4283496Z bra.uni $L__BB0_8; 2026-02-21T09:09:53.4283618Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:09:53.4283715Z // => This Inner Loop Header: Depth=2 2026-02-21T09:09:53.4283781Z add.s32 %r249, %r348, 1; 2026-02-21T09:09:53.4283842Z setp.gt.s32 %p40, %r249, 1; 2026-02-21T09:09:53.4283902Z selp.b32 %r348, 0, %r249, %p40; 2026-02-21T09:09:53.4284064Z .loc 1 58 80 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:58:80 2026-02-21T09:09:53.4284136Z cp.async.wait_group 2; 2026-02-21T09:09:53.4284191Z bar.sync 0; 2026-02-21T09:09:53.4284250Z shl.b32 %r250, %r348, 12; 2026-02-21T09:09:53.4284315Z add.s32 %r56, %r62, %r250; 2026-02-21T09:09:53.4284374Z add.s32 %r252, %r56, %r23; 2026-02-21T09:09:53.4284469Z ld.shared.v4.b32 {%r253, %r254, %r255, %r256}, [%r252]; 2026-02-21T09:09:53.4284528Z mov.b32 {%rs64, %rs65}, %r256; 2026-02-21T09:09:53.4284594Z mov.b32 {%rs66, %rs67}, %r255; 2026-02-21T09:09:53.4284655Z mov.b32 {%rs68, %rs69}, %r254; 2026-02-21T09:09:53.4284713Z mov.b32 {%rs70, %rs71}, %r253; 2026-02-21T09:09:53.4284817Z ld.shared.v4.b32 {%r257, %r258, %r259, %r260}, [%r252+16]; 2026-02-21T09:09:53.4284875Z mov.b32 {%rs72, %rs73}, %r260; 2026-02-21T09:09:53.4284935Z mov.b32 {%rs74, %rs75}, %r259; 2026-02-21T09:09:53.4285000Z mov.b32 {%rs76, %rs77}, %r258; 2026-02-21T09:09:53.4285056Z mov.b32 {%rs78, %rs79}, %r257; 2026-02-21T09:09:53.4285220Z .loc 1 62 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:62:32 2026-02-21T09:09:53.4285278Z cvt.f32.bf16 %r233, %rs70; 2026-02-21T09:09:53.4285344Z cvt.f32.bf16 %r234, %rs71; 2026-02-21T09:09:53.4285403Z cvt.f32.bf16 %r235, %rs68; 2026-02-21T09:09:53.4285460Z cvt.f32.bf16 %r236, %rs69; 2026-02-21T09:09:53.4285523Z cvt.f32.bf16 %r237, %rs66; 2026-02-21T09:09:53.4285580Z cvt.f32.bf16 %r238, %rs67; 2026-02-21T09:09:53.4285637Z cvt.f32.bf16 %r239, %rs64; 2026-02-21T09:09:53.4285694Z cvt.f32.bf16 %r240, %rs65; 2026-02-21T09:09:53.4285760Z cvt.f32.bf16 %r241, %rs78; 2026-02-21T09:09:53.4285846Z cvt.f32.bf16 %r242, %rs79; 2026-02-21T09:09:53.4285903Z cvt.f32.bf16 %r243, %rs76; 2026-02-21T09:09:53.4285968Z cvt.f32.bf16 %r244, %rs77; 2026-02-21T09:09:53.4286027Z cvt.f32.bf16 %r245, %rs74; 2026-02-21T09:09:53.4286084Z cvt.f32.bf16 %r246, %rs75; 2026-02-21T09:09:53.4286140Z cvt.f32.bf16 %r247, %rs72; 2026-02-21T09:09:53.4286206Z cvt.f32.bf16 %r248, %rs73; 2026-02-21T09:09:53.4286258Z $L__tmp26: 2026-02-21T09:09:53.4286471Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4286560Z // begin inline asm 2026-02-21T09:09:53.4286611Z 2026-02-21T09:09:53.4286662Z { 2026-02-21T09:09:53.4286730Z .reg .pred complete; 2026-02-21T09:09:53.4286785Z waitLoop: 2026-02-21T09:09:53.4286906Z mbarrier.try_wait.parity.shared.b64 complete, [%r347], %r346; 2026-02-21T09:09:53.4286971Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.4287030Z } 2026-02-21T09:09:53.4287036Z 2026-02-21T09:09:53.4287093Z // end inline asm 2026-02-21T09:09:53.4287156Z mov.pred %p41, -1; 2026-02-21T09:09:53.4287221Z // begin inline asm 2026-02-21T09:09:53.4287537Z @%p41 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r232 + 0], 16, {%r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241, %r242, %r243, %r244, %r245, %r246, %r247, %r248}; 2026-02-21T09:09:53.4287602Z // end inline asm 2026-02-21T09:09:53.4287666Z // begin inline asm 2026-02-21T09:09:53.4287736Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.4287793Z // end inline asm 2026-02-21T09:09:53.4287849Z bar.sync 0; 2026-02-21T09:09:53.4287910Z $L__tmp27: 2026-02-21T09:09:53.4288079Z .loc 1 64 87 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:64:87 2026-02-21T09:09:53.4288138Z shl.b32 %r261, %r348, 9; 2026-02-21T09:09:53.4288205Z add.s32 %r262, %r62, %r261; 2026-02-21T09:09:53.4288263Z add.s32 %r57, %r262, 12288; 2026-02-21T09:09:53.4288352Z add.s32 %r263, %r57, %r25; 2026-02-21T09:09:53.4288417Z ld.shared.b8 %rs80, [%r263]; 2026-02-21T09:09:53.4288488Z ld.shared.b8 %rs81, [%r263+64]; 2026-02-21T09:09:53.4288553Z ld.shared.b8 %rs82, [%r263+128]; 2026-02-21T09:09:53.4288618Z ld.shared.b8 %rs83, [%r263+192]; 2026-02-21T09:09:53.4288691Z ld.shared.b8 %rs84, [%r263+256]; 2026-02-21T09:09:53.4288754Z ld.shared.b8 %rs85, [%r263+320]; 2026-02-21T09:09:53.4288816Z ld.shared.b8 %rs86, [%r263+384]; 2026-02-21T09:09:53.4288878Z ld.shared.b8 %rs87, [%r263+448]; 2026-02-21T09:09:53.4289057Z .loc 1 67 28 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:67:28 2026-02-21T09:09:53.4289122Z shl.b16 %rs88, %rs80, 4; 2026-02-21T09:09:53.4289183Z shl.b16 %rs89, %rs81, 4; 2026-02-21T09:09:53.4289252Z shl.b16 %rs90, %rs82, 4; 2026-02-21T09:09:53.4289312Z shl.b16 %rs91, %rs83, 4; 2026-02-21T09:09:53.4289371Z shl.b16 %rs92, %rs84, 4; 2026-02-21T09:09:53.4289437Z shl.b16 %rs93, %rs85, 4; 2026-02-21T09:09:53.4289497Z shl.b16 %rs94, %rs86, 4; 2026-02-21T09:09:53.4289558Z shl.b16 %rs95, %rs87, 4; 2026-02-21T09:09:53.4289730Z .loc 1 82 58 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:82:58 2026-02-21T09:09:53.4289810Z selp.b16 %rs96, %rs88, %rs80, %p26; 2026-02-21T09:09:53.4289871Z cvt.s16.s8 %rs97, %rs96; 2026-02-21T09:09:53.4289930Z shr.s16 %rs98, %rs97, 4; 2026-02-21T09:09:53.4290006Z selp.b16 %rs99, %rs89, %rs81, %p26; 2026-02-21T09:09:53.4290070Z cvt.s16.s8 %rs100, %rs99; 2026-02-21T09:09:53.4290133Z shr.s16 %rs101, %rs100, 4; 2026-02-21T09:09:53.4290206Z selp.b16 %rs102, %rs90, %rs82, %p26; 2026-02-21T09:09:53.4290275Z cvt.s16.s8 %rs103, %rs102; 2026-02-21T09:09:53.4290335Z shr.s16 %rs104, %rs103, 4; 2026-02-21T09:09:53.4290403Z selp.b16 %rs105, %rs91, %rs83, %p26; 2026-02-21T09:09:53.4290472Z cvt.s16.s8 %rs106, %rs105; 2026-02-21T09:09:53.4290532Z shr.s16 %rs107, %rs106, 4; 2026-02-21T09:09:53.4290598Z selp.b16 %rs108, %rs92, %rs84, %p26; 2026-02-21T09:09:53.4290666Z cvt.s16.s8 %rs109, %rs108; 2026-02-21T09:09:53.4290751Z shr.s16 %rs110, %rs109, 4; 2026-02-21T09:09:53.4290817Z selp.b16 %rs111, %rs93, %rs85, %p26; 2026-02-21T09:09:53.4290877Z cvt.s16.s8 %rs112, %rs111; 2026-02-21T09:09:53.4290944Z shr.s16 %rs113, %rs112, 4; 2026-02-21T09:09:53.4291009Z selp.b16 %rs114, %rs94, %rs86, %p26; 2026-02-21T09:09:53.4291069Z cvt.s16.s8 %rs115, %rs114; 2026-02-21T09:09:53.4291135Z shr.s16 %rs116, %rs115, 4; 2026-02-21T09:09:53.4291200Z selp.b16 %rs117, %rs95, %rs87, %p26; 2026-02-21T09:09:53.4291259Z cvt.s16.s8 %rs118, %rs117; 2026-02-21T09:09:53.4291344Z shr.s16 %rs119, %rs118, 4; 2026-02-21T09:09:53.4291521Z .loc 1 87 32 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:87:32 2026-02-21T09:09:53.4291617Z cvt.rn.f32.s16 %r264, %rs98; 2026-02-21T09:09:53.4291683Z cvt.rn.f32.s16 %r265, %rs101; 2026-02-21T09:09:53.4291755Z cvt.rn.f32.s16 %r266, %rs104; 2026-02-21T09:09:53.4291818Z cvt.rn.f32.s16 %r267, %rs107; 2026-02-21T09:09:53.4291881Z cvt.rn.f32.s16 %r268, %rs110; 2026-02-21T09:09:53.4291949Z cvt.rn.f32.s16 %r269, %rs113; 2026-02-21T09:09:53.4292010Z cvt.rn.f32.s16 %r270, %rs116; 2026-02-21T09:09:53.4292070Z cvt.rn.f32.s16 %r271, %rs119; 2026-02-21T09:09:53.4292164Z st.shared.b32 [%r27], %r264; 2026-02-21T09:09:53.4292237Z st.shared.b32 [%r28], %r265; 2026-02-21T09:09:53.4292300Z st.shared.b32 [%r29], %r266; 2026-02-21T09:09:53.4292362Z st.shared.b32 [%r30], %r267; 2026-02-21T09:09:53.4292430Z st.shared.b32 [%r31], %r268; 2026-02-21T09:09:53.4292493Z st.shared.b32 [%r32], %r269; 2026-02-21T09:09:53.4292553Z st.shared.b32 [%r33], %r270; 2026-02-21T09:09:53.4292614Z st.shared.b32 [%r34], %r271; 2026-02-21T09:09:53.4292801Z .loc 1 50 125 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:50:125 2026-02-21T09:09:53.4292862Z shl.b32 %r272, %r349, 3; 2026-02-21T09:09:53.4292924Z add.s32 %r273, %r62, %r272; 2026-02-21T09:09:53.4293040Z add.s32 %r347, %r273, 13312; 2026-02-21T09:09:53.4293096Z $L__tmp28: 2026-02-21T09:09:53.4293319Z .loc 2 291 36 // standard.py:291:36 @[ cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:94:40 ] 2026-02-21T09:09:53.4293388Z // begin inline asm 2026-02-21T09:09:53.4293465Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.4293523Z // end inline asm 2026-02-21T09:09:53.4293581Z bar.sync 0; 2026-02-21T09:09:53.4293650Z @%p27 bra $L__BB0_7; 2026-02-21T09:09:53.4293752Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:53.4293821Z elect.sync %r286|%p42, -1; 2026-02-21T09:09:53.4293889Z mov.b32 %r276, 67635472; 2026-02-21T09:09:53.4293948Z // begin inline asm 2026-02-21T09:09:53.4294110Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 0 ], %rd62, %r276, %p41; 2026-02-21T09:09:53.4294167Z // end inline asm 2026-02-21T09:09:53.4294234Z // begin inline asm 2026-02-21T09:09:53.4294390Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 8 ], %rd63, %r276, %p41; 2026-02-21T09:09:53.4294448Z // end inline asm 2026-02-21T09:09:53.4294515Z // begin inline asm 2026-02-21T09:09:53.4294667Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 16 ], %rd64, %r276, %p41; 2026-02-21T09:09:53.4294726Z // end inline asm 2026-02-21T09:09:53.4294792Z // begin inline asm 2026-02-21T09:09:53.4294942Z @%p42 tcgen05.mma.cta_group::1.kind::tf32 [ %r344 + 0 ], [ %r202 + 24 ], %rd65, %r276, %p41; 2026-02-21T09:09:53.4295001Z // end inline asm 2026-02-21T09:09:53.4295075Z cvt.u64.u32 %rd80, %r347; 2026-02-21T09:09:53.4295137Z // begin inline asm 2026-02-21T09:09:53.4295266Z @%p42 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd80]; 2026-02-21T09:09:53.4295326Z // end inline asm 2026-02-21T09:09:53.4295392Z bra.uni $L__BB0_7; 2026-02-21T09:09:53.4295448Z $L__tmp29: 2026-02-21T09:09:53.4295535Z $L__BB0_9: // %._crit_edge 2026-02-21T09:09:53.4295718Z .loc 1 29 4 // cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py:29:4 2026-02-21T09:09:53.4295804Z bar.sync 0; 2026-02-21T09:09:53.4295863Z // begin inline asm 2026-02-21T09:09:53.4295985Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r344, 64; 2026-02-21T09:09:53.4296050Z // end inline asm 2026-02-21T09:09:53.4296105Z ret; 2026-02-21T09:09:53.4296163Z $L__tmp30: 2026-02-21T09:09:53.4296230Z $L__func_end0: 2026-02-21T09:09:53.4296317Z // -- End function 2026-02-21T09:09:53.4296370Z } 2026-02-21T09:09:53.4296627Z .file 1 "/tmp/torchinductor_root/q5/cq5gwsosnlj6eu4rxucyovxo7qprxbpa4aqzxjr5h7qnve4633ii.py" 2026-02-21T09:09:53.4296797Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:53.4296858Z .section .debug_abbrev 2026-02-21T09:09:53.4296909Z { 2026-02-21T09:09:53.4297005Z .b8 1 // Abbreviation Code 2026-02-21T09:09:53.4297092Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:53.4297174Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:53.4297261Z .b8 37 // DW_AT_producer 2026-02-21T09:09:53.4297357Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.4297434Z .b8 19 // DW_AT_language 2026-02-21T09:09:53.4297520Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:53.4297595Z .b8 3 // DW_AT_name 2026-02-21T09:09:53.4297671Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.4297748Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:53.4297832Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:53.4297907Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:53.4297979Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.4298146Z .b8 0 // EOM(1) 2026-02-21T09:09:53.4298220Z .b8 0 // EOM(2) 2026-02-21T09:09:53.4298304Z .b8 2 // Abbreviation Code 2026-02-21T09:09:53.4298394Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:53.4298469Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:53.4298543Z .b8 3 // DW_AT_name 2026-02-21T09:09:53.4298618Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.4298705Z .b8 32 // DW_AT_inline 2026-02-21T09:09:53.4298782Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.4298851Z .b8 0 // EOM(1) 2026-02-21T09:09:53.4298926Z .b8 0 // EOM(2) 2026-02-21T09:09:53.4299007Z .b8 3 // Abbreviation Code 2026-02-21T09:09:53.4299089Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:53.4299175Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:53.4299251Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:53.4299327Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.4299413Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:53.4299486Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.4299574Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:53.4299647Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:53.4299724Z .b8 0 // EOM(1) 2026-02-21T09:09:53.4299791Z .b8 0 // EOM(2) 2026-02-21T09:09:53.4299871Z .b8 4 // Abbreviation Code 2026-02-21T09:09:53.4299970Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:53.4300047Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:53.4300156Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:53.4300236Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:53.4300308Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:53.4300377Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.4300452Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:53.4300529Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.4300630Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:53.4300703Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.4300785Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:53.4300857Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.4300934Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:53.4301013Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.4301082Z .b8 0 // EOM(1) 2026-02-21T09:09:53.4301150Z .b8 0 // EOM(2) 2026-02-21T09:09:53.4301239Z .b8 0 // EOM(3) 2026-02-21T09:09:53.4301299Z } 2026-02-21T09:09:53.4301360Z .section .debug_info 2026-02-21T09:09:53.4301413Z { 2026-02-21T09:09:53.4301501Z .b32 178 // Length of Unit 2026-02-21T09:09:53.4301618Z .b8 2 // DWARF version number 2026-02-21T09:09:53.4301673Z .b8 0 2026-02-21T09:09:53.4301794Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:53.4301883Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:53.4301984Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:53.4302088Z .b8 116 // DW_AT_producer 2026-02-21T09:09:53.4302154Z .b8 114 2026-02-21T09:09:53.4302208Z .b8 105 2026-02-21T09:09:53.4302260Z .b8 116 2026-02-21T09:09:53.4302319Z .b8 111 2026-02-21T09:09:53.4302370Z .b8 110 2026-02-21T09:09:53.4302422Z .b8 0 2026-02-21T09:09:53.4302497Z .b8 2 // DW_AT_language 2026-02-21T09:09:53.4302555Z .b8 0 2026-02-21T09:09:53.4302628Z .b8 99 // DW_AT_name 2026-02-21T09:09:53.4302681Z .b8 113 2026-02-21T09:09:53.4302743Z .b8 53 2026-02-21T09:09:53.4302797Z .b8 103 2026-02-21T09:09:53.4302850Z .b8 119 2026-02-21T09:09:53.4302905Z .b8 115 2026-02-21T09:09:53.4302966Z .b8 111 2026-02-21T09:09:53.4303019Z .b8 115 2026-02-21T09:09:53.4303072Z .b8 110 2026-02-21T09:09:53.4303137Z .b8 108 2026-02-21T09:09:53.4303190Z .b8 106 2026-02-21T09:09:53.4303244Z .b8 54 2026-02-21T09:09:53.4303295Z .b8 101 2026-02-21T09:09:53.4303355Z .b8 117 2026-02-21T09:09:53.4303405Z .b8 52 2026-02-21T09:09:53.4303455Z .b8 114 2026-02-21T09:09:53.4303506Z .b8 120 2026-02-21T09:09:53.4303568Z .b8 117 2026-02-21T09:09:53.4303618Z .b8 99 2026-02-21T09:09:53.4303669Z .b8 121 2026-02-21T09:09:53.4303725Z .b8 111 2026-02-21T09:09:53.4303775Z .b8 118 2026-02-21T09:09:53.4303826Z .b8 120 2026-02-21T09:09:53.4303876Z .b8 111 2026-02-21T09:09:53.4303935Z .b8 55 2026-02-21T09:09:53.4303986Z .b8 113 2026-02-21T09:09:53.4304035Z .b8 112 2026-02-21T09:09:53.4304094Z .b8 114 2026-02-21T09:09:53.4304145Z .b8 120 2026-02-21T09:09:53.4304195Z .b8 98 2026-02-21T09:09:53.4304245Z .b8 112 2026-02-21T09:09:53.4304304Z .b8 97 2026-02-21T09:09:53.4304355Z .b8 52 2026-02-21T09:09:53.4304406Z .b8 97 2026-02-21T09:09:53.4304457Z .b8 113 2026-02-21T09:09:53.4304514Z .b8 122 2026-02-21T09:09:53.4304564Z .b8 120 2026-02-21T09:09:53.4304614Z .b8 106 2026-02-21T09:09:53.4304674Z .b8 114 2026-02-21T09:09:53.4304723Z .b8 53 2026-02-21T09:09:53.4304774Z .b8 104 2026-02-21T09:09:53.4304823Z .b8 55 2026-02-21T09:09:53.4304881Z .b8 113 2026-02-21T09:09:53.4304931Z .b8 110 2026-02-21T09:09:53.4304982Z .b8 118 2026-02-21T09:09:53.4305069Z .b8 101 2026-02-21T09:09:53.4305119Z .b8 52 2026-02-21T09:09:53.4305170Z .b8 54 2026-02-21T09:09:53.4305219Z .b8 51 2026-02-21T09:09:53.4305276Z .b8 51 2026-02-21T09:09:53.4305326Z .b8 105 2026-02-21T09:09:53.4305376Z .b8 105 2026-02-21T09:09:53.4305434Z .b8 46 2026-02-21T09:09:53.4305484Z .b8 112 2026-02-21T09:09:53.4305535Z .b8 121 2026-02-21T09:09:53.4305586Z .b8 0 2026-02-21T09:09:53.4305684Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:53.4305758Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:53.4305848Z .b8 116 2026-02-21T09:09:53.4305905Z .b8 109 2026-02-21T09:09:53.4305955Z .b8 112 2026-02-21T09:09:53.4306005Z .b8 47 2026-02-21T09:09:53.4306055Z .b8 116 2026-02-21T09:09:53.4306111Z .b8 111 2026-02-21T09:09:53.4306159Z .b8 114 2026-02-21T09:09:53.4306208Z .b8 99 2026-02-21T09:09:53.4306258Z .b8 104 2026-02-21T09:09:53.4306314Z .b8 105 2026-02-21T09:09:53.4306363Z .b8 110 2026-02-21T09:09:53.4306414Z .b8 100 2026-02-21T09:09:53.4306471Z .b8 117 2026-02-21T09:09:53.4306522Z .b8 99 2026-02-21T09:09:53.4306572Z .b8 116 2026-02-21T09:09:53.4306621Z .b8 111 2026-02-21T09:09:53.4306678Z .b8 114 2026-02-21T09:09:53.4306728Z .b8 95 2026-02-21T09:09:53.4306778Z .b8 114 2026-02-21T09:09:53.4306863Z .b8 111 2026-02-21T09:09:53.4306915Z .b8 111 2026-02-21T09:09:53.4306966Z .b8 116 2026-02-21T09:09:53.4307017Z .b8 47 2026-02-21T09:09:53.4307076Z .b8 113 2026-02-21T09:09:53.4307126Z .b8 53 2026-02-21T09:09:53.4307177Z .b8 0 2026-02-21T09:09:53.4307281Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:53.4307354Z .b8 95 // DW_AT_name 2026-02-21T09:09:53.4307404Z .b8 104 2026-02-21T09:09:53.4307454Z .b8 101 2026-02-21T09:09:53.4307511Z .b8 108 2026-02-21T09:09:53.4307561Z .b8 105 2026-02-21T09:09:53.4307611Z .b8 111 2026-02-21T09:09:53.4307667Z .b8 110 2026-02-21T09:09:53.4307717Z .b8 95 2026-02-21T09:09:53.4307769Z .b8 109 2026-02-21T09:09:53.4307842Z .b8 97 2026-02-21T09:09:53.4307904Z .b8 116 2026-02-21T09:09:53.4307954Z .b8 109 2026-02-21T09:09:53.4308004Z .b8 117 2026-02-21T09:09:53.4308054Z .b8 108 2026-02-21T09:09:53.4308114Z .b8 95 2026-02-21T09:09:53.4308165Z .b8 98 2026-02-21T09:09:53.4308217Z .b8 102 2026-02-21T09:09:53.4308275Z .b8 49 2026-02-21T09:09:53.4308327Z .b8 54 2026-02-21T09:09:53.4308378Z .b8 95 2026-02-21T09:09:53.4308429Z .b8 105 2026-02-21T09:09:53.4308489Z .b8 110 2026-02-21T09:09:53.4308541Z .b8 116 2026-02-21T09:09:53.4308594Z .b8 52 2026-02-21T09:09:53.4308656Z .b8 0 2026-02-21T09:09:53.4308730Z .b8 1 // DW_AT_inline 2026-02-21T09:09:53.4308825Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:53.4308912Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:53.4309006Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:53.4309096Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:53.4309210Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:53.4309305Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:53.4309386Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:53.4309471Z .b64 $L__tmp29 // DW_AT_high_pc 2026-02-21T09:09:53.4309552Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:53.4309628Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:53.4309708Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:53.4309796Z .b8 0 // End Of Children Mark 2026-02-21T09:09:53.4309877Z .b8 0 // End Of Children Mark 2026-02-21T09:09:53.4309926Z } 2026-02-21T09:09:53.4309991Z .section .debug_macinfo { } 2026-02-21T09:09:53.4309995Z 2026-02-21T09:09:53.4310080Z ================================================================ 2026-02-21T09:09:53.4310204Z please share the reproducer above with Triton project. 2026-02-21T09:09:53.5897345Z [180s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:53.5897359Z 2026-02-21T09:09:53.5898430Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 64], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:53.5898765Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:53.5898835Z `ptxas` stderr: 2026-02-21T09:09:53.5899188Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 285 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:53.5899283Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:53.5899288Z 2026-02-21T09:09:53.5899743Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpz_fyxg0u.ptx -o /tmp/tmpz_fyxg0u.ptx.o 2026-02-21T09:09:53.5899749Z 2026-02-21T09:09:53.5899880Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:53.5899916Z 2026-02-21T09:09:53.5903461Z 2026-02-21T09:09:53.5908512Z ================================================================ 2026-02-21T09:09:53.5910104Z Internal Triton PTX codegen error 2026-02-21T09:09:53.5910212Z `ptxas` stderr: 2026-02-21T09:09:53.5914728Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 285 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:53.5919177Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:53.5923189Z 2026-02-21T09:09:53.5925310Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpz_fyxg0u.ptx -o /tmp/tmpz_fyxg0u.ptx.o 2026-02-21T09:09:53.5925359Z 2026-02-21T09:09:53.5928555Z 2026-02-21T09:09:53.5928677Z // 2026-02-21T09:09:53.5928869Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:53.5929063Z // 2026-02-21T09:09:53.5929145Z 2026-02-21T09:09:53.5929205Z .version 8.7 2026-02-21T09:09:53.5929359Z .target sm_100a 2026-02-21T09:09:53.5929505Z .address_size 64 2026-02-21T09:09:53.5929586Z 2026-02-21T09:09:53.5929755Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:53.5930039Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:53.5930273Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:53.5930493Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:53.5930756Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:53.5931040Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:53.5931354Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:53.5936674Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:53.5936974Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:53.5937207Z ) 2026-02-21T09:09:53.5937341Z .reqntid 128 2026-02-21T09:09:53.5937483Z .maxnreg 32 2026-02-21T09:09:53.5937613Z { 2026-02-21T09:09:53.5937740Z .reg .pred %p<99>; 2026-02-21T09:09:53.5937898Z .reg .b16 %rs<193>; 2026-02-21T09:09:53.5938041Z .reg .b32 %r<450>; 2026-02-21T09:09:53.5938185Z .reg .b64 %rd<156>; 2026-02-21T09:09:53.5938324Z $L__func_begin0: 2026-02-21T09:09:53.5938418Z 2026-02-21T09:09:53.5938475Z // %bb.0: 2026-02-21T09:09:53.5938721Z .loc 1 19 0 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:19 2026-02-21T09:09:53.5939165Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:53.5939360Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:53.5939589Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:53.5939797Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:53.5940011Z mov.b32 %r47, global_smem; 2026-02-21T09:09:53.5940178Z // begin inline asm 2026-02-21T09:09:53.5940407Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r47], 128; 2026-02-21T09:09:53.5940703Z // end inline asm 2026-02-21T09:09:53.5940879Z ld.param.b64 %rd51, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:53.5941083Z bar.sync 0; 2026-02-21T09:09:53.5941234Z ld.shared.b32 %r443, [global_smem]; 2026-02-21T09:09:53.5941403Z bar.sync 0; 2026-02-21T09:09:53.5941617Z // begin inline asm 2026-02-21T09:09:53.5941825Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:53.5942061Z // end inline asm 2026-02-21T09:09:53.5942314Z .loc 1 21 66 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:21:66 2026-02-21T09:09:53.5942611Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:53.5942799Z mov.u32 %r64, %ctaid.y; 2026-02-21T09:09:53.5942960Z mov.u32 %r65, %ctaid.z; 2026-02-21T09:09:53.5943115Z mov.u32 %r66, %nctaid.x; 2026-02-21T09:09:53.5943265Z mov.u32 %r67, %nctaid.y; 2026-02-21T09:09:53.5943428Z mad.lo.s32 %r68, %r65, %r67, %r64; 2026-02-21T09:09:53.5943605Z mad.lo.s32 %r69, %r68, %r66, %r3; 2026-02-21T09:09:53.5943784Z shl.b32 %r70, %r69, 8; 2026-02-21T09:09:53.5943934Z cvt.s64.s32 %rd52, %r70; 2026-02-21T09:09:53.5944096Z add.s64 %rd30, %rd51, %rd52; 2026-02-21T09:09:53.5944253Z shl.b32 %r71, %r1, 2; 2026-02-21T09:09:53.5944411Z add.s32 %r48, %r47, %r71; 2026-02-21T09:09:53.5944563Z mov.b32 %r57, 0; 2026-02-21T09:09:53.5944923Z // begin inline asm 2026-02-21T09:09:53.5945087Z @%p1 st.shared.b32 [ %r48 + 0 ], %r57; 2026-02-21T09:09:53.5945297Z // end inline asm 2026-02-21T09:09:53.5945449Z bar.warp.sync -1; 2026-02-21T09:09:53.5945599Z setp.eq.b32 %p91, %r1, 0; 2026-02-21T09:09:53.5945761Z cvt.u64.u32 %rd15, %r47; 2026-02-21T09:09:53.5945909Z // begin inline asm 2026-02-21T09:09:53.5946171Z @%p91 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd16; 2026-02-21T09:09:53.5946449Z // end inline asm 2026-02-21T09:09:53.5946592Z // begin inline asm 2026-02-21T09:09:53.5946825Z @%p91 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:09:53.5947077Z // end inline asm 2026-02-21T09:09:53.5947218Z mov.b32 %r50, 64; 2026-02-21T09:09:53.5947358Z // begin inline asm 2026-02-21T09:09:53.5947602Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r50; 2026-02-21T09:09:53.5947865Z // end inline asm 2026-02-21T09:09:53.5948005Z mov.b32 %r111, 16; 2026-02-21T09:09:53.5948142Z // begin inline asm 2026-02-21T09:09:53.5948381Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r111; 2026-02-21T09:09:53.5948653Z // end inline asm 2026-02-21T09:09:53.5948786Z mov.b32 %r52, 8192; 2026-02-21T09:09:53.5948931Z // begin inline asm 2026-02-21T09:09:53.5949168Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r52; 2026-02-21T09:09:53.5949440Z // end inline asm 2026-02-21T09:09:53.5949571Z mov.b32 %r53, 512; 2026-02-21T09:09:53.5949711Z // begin inline asm 2026-02-21T09:09:53.5949942Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r53; 2026-02-21T09:09:53.5950213Z // end inline asm 2026-02-21T09:09:53.5950352Z mov.b64 %rd23, 8192; 2026-02-21T09:09:53.5950492Z // begin inline asm 2026-02-21T09:09:53.5950747Z @%p91 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd23; 2026-02-21T09:09:53.5951024Z // end inline asm 2026-02-21T09:09:53.5951161Z mov.b32 %r54, 1; 2026-02-21T09:09:53.5951293Z // begin inline asm 2026-02-21T09:09:53.5951628Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r54; 2026-02-21T09:09:53.5951919Z // end inline asm 2026-02-21T09:09:53.5952054Z // begin inline asm 2026-02-21T09:09:53.5952312Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r54; 2026-02-21T09:09:53.5952590Z // end inline asm 2026-02-21T09:09:53.5952730Z // begin inline asm 2026-02-21T09:09:53.5952961Z @%p91 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:53.5953258Z // end inline asm 2026-02-21T09:09:53.5953393Z // begin inline asm 2026-02-21T09:09:53.5953657Z @%p91 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:53.5953948Z // end inline asm 2026-02-21T09:09:53.5954081Z // begin inline asm 2026-02-21T09:09:53.5954325Z @%p91 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x2; 2026-02-21T09:09:53.5954601Z // end inline asm 2026-02-21T09:09:53.5954748Z // begin inline asm 2026-02-21T09:09:53.5954976Z @%p91 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:53.5955245Z // end inline asm 2026-02-21T09:09:53.5955385Z // begin inline asm 2026-02-21T09:09:53.5955754Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd30 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:09:53.5956133Z // end inline asm 2026-02-21T09:09:53.5956266Z // begin inline asm 2026-02-21T09:09:53.5956480Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd30 + 0 ], 0x80; 2026-02-21T09:09:53.5956729Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:53.5956928Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:53.5957110Z // end inline asm 2026-02-21T09:09:53.5957242Z bar.sync 0; 2026-02-21T09:09:53.5957387Z cvta.global.u64 %rd71, %rd30; 2026-02-21T09:09:53.5957695Z .loc 1 23 67 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:23:67 2026-02-21T09:09:53.5957993Z add.s64 %rd48, %rd30, 128; 2026-02-21T09:09:53.5958146Z bar.sync 0; 2026-02-21T09:09:53.5958283Z // begin inline asm 2026-02-21T09:09:53.5958434Z @%p1 st.shared.b32 [ %r48 + 0 ], %r57; 2026-02-21T09:09:53.5958615Z // end inline asm 2026-02-21T09:09:53.5958753Z bar.warp.sync -1; 2026-02-21T09:09:53.5958901Z // begin inline asm 2026-02-21T09:09:53.5959151Z @%p91 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd34; 2026-02-21T09:09:53.5959424Z // end inline asm 2026-02-21T09:09:53.5959568Z // begin inline asm 2026-02-21T09:09:53.5959789Z @%p91 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:09:53.5960047Z // end inline asm 2026-02-21T09:09:53.5960180Z // begin inline asm 2026-02-21T09:09:53.5960418Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r50; 2026-02-21T09:09:53.5960686Z // end inline asm 2026-02-21T09:09:53.5960818Z // begin inline asm 2026-02-21T09:09:53.5961054Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r50; 2026-02-21T09:09:53.5961313Z // end inline asm 2026-02-21T09:09:53.5961451Z // begin inline asm 2026-02-21T09:09:53.5961723Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r52; 2026-02-21T09:09:53.5961998Z // end inline asm 2026-02-21T09:09:53.5962132Z mov.b32 %r61, 4096; 2026-02-21T09:09:53.5962282Z // begin inline asm 2026-02-21T09:09:53.5962525Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r61; 2026-02-21T09:09:53.5962794Z // end inline asm 2026-02-21T09:09:53.5962937Z mov.b64 %rd41, 16384; 2026-02-21T09:09:53.5963084Z // begin inline asm 2026-02-21T09:09:53.5963340Z @%p91 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd41; 2026-02-21T09:09:53.5963622Z // end inline asm 2026-02-21T09:09:53.5963764Z // begin inline asm 2026-02-21T09:09:53.5964027Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r54; 2026-02-21T09:09:53.5964340Z // end inline asm 2026-02-21T09:09:53.5964480Z // begin inline asm 2026-02-21T09:09:53.5964730Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r54; 2026-02-21T09:09:53.5965017Z // end inline asm 2026-02-21T09:09:53.5965151Z // begin inline asm 2026-02-21T09:09:53.5965392Z @%p91 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0xa; 2026-02-21T09:09:53.5965665Z // end inline asm 2026-02-21T09:09:53.5965799Z // begin inline asm 2026-02-21T09:09:53.5966082Z @%p91 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:53.5966362Z // end inline asm 2026-02-21T09:09:53.5966507Z // begin inline asm 2026-02-21T09:09:53.5966748Z @%p91 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:09:53.5967030Z // end inline asm 2026-02-21T09:09:53.5967168Z // begin inline asm 2026-02-21T09:09:53.5967406Z @%p91 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:53.5967677Z // end inline asm 2026-02-21T09:09:53.5967815Z // begin inline asm 2026-02-21T09:09:53.5968187Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd48 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:09:53.5968555Z // end inline asm 2026-02-21T09:09:53.5968695Z // begin inline asm 2026-02-21T09:09:53.5968900Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd48 + 0 ], 0x80; 2026-02-21T09:09:53.5969158Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:53.5969351Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:53.5969524Z // end inline asm 2026-02-21T09:09:53.5969666Z bar.sync 0; 2026-02-21T09:09:53.5969806Z cvta.global.u64 %rd89, %rd48; 2026-02-21T09:09:53.5970091Z .loc 1 31 88 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:31:88 2026-02-21T09:09:53.5970406Z setp.gt.u32 %p39, %r3, 8191; 2026-02-21T09:09:53.5970583Z @%p39 bra $L__BB0_8; 2026-02-21T09:09:53.5970744Z // %bb.1: // %.lr.ph 2026-02-21T09:09:53.5971049Z .loc 1 0 88 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:0:88 2026-02-21T09:09:53.5971387Z ld.param.b64 %rd14, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:53.5971625Z and.b32 %r4, %r1, 64; 2026-02-21T09:09:53.5971886Z .loc 1 81 38 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:81:38 2026-02-21T09:09:53.5972172Z setp.eq.b32 %p51, %r4, 0; 2026-02-21T09:09:53.5972439Z .loc 1 57 38 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:57:38 2026-02-21T09:09:53.5972720Z shl.b32 %r147, %r1, 3; 2026-02-21T09:09:53.5972879Z and.b32 %r148, %r147, 24; 2026-02-21T09:09:53.5973145Z .loc 1 44 45 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:44:45 2026-02-21T09:09:53.5973426Z bfe.u32 %r5, %r1, 2, 5; 2026-02-21T09:09:53.5973596Z shr.u32 %r149, %r1, 5; 2026-02-21T09:09:53.5973748Z shl.b32 %r150, %r1, 4; 2026-02-21T09:09:53.5973909Z and.b32 %r6, %r150, 2032; 2026-02-21T09:09:53.5974072Z add.s32 %r152, %r47, 8192; 2026-02-21T09:09:53.5974245Z add.s32 %r245, %r152, %r6; 2026-02-21T09:09:53.5974409Z add.s32 %r247, %r245, 2048; 2026-02-21T09:09:53.5974582Z add.s32 %r231, %r443, 64; 2026-02-21T09:09:53.5974747Z and.b32 %r153, %r1, 15; 2026-02-21T09:09:53.5974905Z shl.b32 %r154, %r153, 6; 2026-02-21T09:09:53.5975071Z and.b32 %r155, %r1, 96; 2026-02-21T09:09:53.5975226Z shl.b32 %r156, %r155, 5; 2026-02-21T09:09:53.5975391Z and.b32 %r157, %r1, 16; 2026-02-21T09:09:53.5975545Z shl.b32 %r158, %r157, 1; 2026-02-21T09:09:53.5975713Z or.b32 %r159, %r154, %r156; 2026-02-21T09:09:53.5975879Z or.b32 %r10, %r159, %r158; 2026-02-21T09:09:53.5976054Z and.b32 %r11, %r1, 63; 2026-02-21T09:09:53.5976220Z shl.b32 %r160, %r11, 7; 2026-02-21T09:09:53.5976383Z and.b32 %r161, %r150, 112; 2026-02-21T09:09:53.5976552Z shr.u32 %r162, %r4, 4; 2026-02-21T09:09:53.5976742Z or.b32 %r163, %r160, %r162; 2026-02-21T09:09:53.5976911Z or.b32 %r164, %r163, %r161; 2026-02-21T09:09:53.5977070Z add.s32 %r12, %r47, %r164; 2026-02-21T09:09:53.5977240Z xor.b32 %r165, %r164, 16; 2026-02-21T09:09:53.5977400Z add.s32 %r13, %r47, %r165; 2026-02-21T09:09:53.5977569Z xor.b32 %r166, %r164, 32; 2026-02-21T09:09:53.5977725Z add.s32 %r14, %r47, %r166; 2026-02-21T09:09:53.5977893Z xor.b32 %r167, %r164, 48; 2026-02-21T09:09:53.5978051Z add.s32 %r15, %r47, %r167; 2026-02-21T09:09:53.5978272Z xor.b32 %r168, %r164, 64; 2026-02-21T09:09:53.5978440Z add.s32 %r16, %r47, %r168; 2026-02-21T09:09:53.5978599Z xor.b32 %r169, %r164, 80; 2026-02-21T09:09:53.5978765Z add.s32 %r17, %r47, %r169; 2026-02-21T09:09:53.5978923Z xor.b32 %r170, %r164, 96; 2026-02-21T09:09:53.5979089Z add.s32 %r18, %r47, %r170; 2026-02-21T09:09:53.5979249Z xor.b32 %r171, %r164, 112; 2026-02-21T09:09:53.5979418Z add.s32 %r19, %r47, %r171; 2026-02-21T09:09:53.5979585Z bfe.u32 %r172, %r47, 4, 14; 2026-02-21T09:09:53.5979757Z cvt.u64.u32 %rd59, %r172; 2026-02-21T09:09:53.5979934Z or.b64 %rd64, %rd59, 4611686293338849280; 2026-02-21T09:09:53.5980131Z add.s32 %r173, %r47, 32; 2026-02-21T09:09:53.5980327Z bfe.u32 %r174, %r173, 4, 14; 2026-02-21T09:09:53.5980494Z cvt.u64.u32 %rd60, %r174; 2026-02-21T09:09:53.5980671Z or.b64 %rd65, %rd60, 4611686293338849280; 2026-02-21T09:09:53.5980857Z add.s32 %r175, %r47, 64; 2026-02-21T09:09:53.5981021Z bfe.u32 %r176, %r175, 4, 14; 2026-02-21T09:09:53.5981185Z cvt.u64.u32 %rd61, %r176; 2026-02-21T09:09:53.5981357Z or.b64 %rd66, %rd61, 4611686293338849280; 2026-02-21T09:09:53.5981570Z add.s32 %r177, %r47, 96; 2026-02-21T09:09:53.5981736Z bfe.u32 %r178, %r177, 4, 14; 2026-02-21T09:09:53.5981906Z cvt.u64.u32 %rd62, %r178; 2026-02-21T09:09:53.5982070Z or.b64 %rd67, %rd62, 4611686293338849280; 2026-02-21T09:09:53.5982263Z or.b32 %r20, %r148, 64; 2026-02-21T09:09:53.5982445Z shl.b32 %r179, %r153, 7; 2026-02-21T09:09:53.5982613Z shl.b32 %r180, %r155, 6; 2026-02-21T09:09:53.5982765Z shl.b32 %r181, %r157, 2; 2026-02-21T09:09:53.5982929Z or.b32 %r182, %r180, %r161; 2026-02-21T09:09:53.5983094Z xor.b32 %r183, %r182, %r181; 2026-02-21T09:09:53.5983268Z or.b32 %r184, %r183, %r179; 2026-02-21T09:09:53.5983430Z xor.b32 %r185, %r184, 16; 2026-02-21T09:09:53.5983593Z xor.b32 %r186, %r184, 32; 2026-02-21T09:09:53.5983755Z xor.b32 %r187, %r184, 48; 2026-02-21T09:09:53.5983923Z add.s32 %r188, %r152, %r10; 2026-02-21T09:09:53.5984085Z xor.b32 %r25, %r11, 48; 2026-02-21T09:09:53.5984235Z add.s32 %r115, %r47, 16384; 2026-02-21T09:09:53.5984398Z add.s32 %r189, %r115, %r25; 2026-02-21T09:09:53.5984553Z xor.b32 %r26, %r11, 32; 2026-02-21T09:09:53.5984710Z add.s32 %r190, %r115, %r26; 2026-02-21T09:09:53.5984859Z xor.b32 %r27, %r11, 16; 2026-02-21T09:09:53.5985013Z add.s32 %r191, %r115, %r27; 2026-02-21T09:09:53.5985163Z add.s32 %r192, %r115, %r11; 2026-02-21T09:09:53.5985326Z add.s32 %r193, %r47, %r6; 2026-02-21T09:09:53.5985489Z add.s32 %r119, %r193, 12288; 2026-02-21T09:09:53.5985644Z add.s32 %r121, %r193, 14336; 2026-02-21T09:09:53.5985912Z .loc 1 42 27 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:42:27 2026-02-21T09:09:53.5986194Z shl.b32 %r194, %r3, 6; 2026-02-21T09:09:53.5986350Z and.b32 %r195, %r194, 960; 2026-02-21T09:09:53.5986503Z and.b32 %r196, %r3, 7168; 2026-02-21T09:09:53.5986660Z or.b32 %r391, %r195, %r196; 2026-02-21T09:09:53.5986916Z .loc 1 43 27 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:43:27 2026-02-21T09:09:53.5987208Z shl.b32 %r197, %r3, 2; 2026-02-21T09:09:53.5987362Z and.b32 %r392, %r197, 4032; 2026-02-21T09:09:53.5987619Z .loc 1 44 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:44:32 2026-02-21T09:09:53.5987902Z or.b32 %r198, %r392, %r5; 2026-02-21T09:09:53.5988157Z .loc 1 58 53 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:53 2026-02-21T09:09:53.5988470Z shl.b32 %r199, %r198, 10; 2026-02-21T09:09:53.5988619Z $L__tmp0: 2026-02-21T09:09:53.5988917Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.5989271Z shfl.sync.idx.b32 %r30, %r149, 0, 31, -1; 2026-02-21T09:09:53.5989452Z shl.b32 %r200, %r30, 21; 2026-02-21T09:09:53.5989615Z and.b32 %r201, %r200, 6291456; 2026-02-21T09:09:53.5989775Z add.s32 %r390, %r201, %r443; 2026-02-21T09:09:53.5989966Z mov.pred %p58, -1; 2026-02-21T09:09:53.5990114Z // begin inline asm 2026-02-21T09:09:53.5990472Z @%p58 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r390 + 0], 32, {%r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57}; 2026-02-21T09:09:53.5990858Z // end inline asm 2026-02-21T09:09:53.5990999Z // begin inline asm 2026-02-21T09:09:53.5991352Z @%p58 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r390 + 16], 32, {%r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57, %r57}; 2026-02-21T09:09:53.5991749Z // end inline asm 2026-02-21T09:09:53.5991891Z // begin inline asm 2026-02-21T09:09:53.5992066Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.5992240Z // end inline asm 2026-02-21T09:09:53.5992373Z bar.sync 0; 2026-02-21T09:09:53.5992509Z $L__tmp1: 2026-02-21T09:09:53.5992755Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.5993045Z add.s32 %r445, %r47, 18432; 2026-02-21T09:09:53.5993210Z // begin inline asm 2026-02-21T09:09:53.5993377Z @%p91 mbarrier.init.shared::cta.b64 [%r445], 1; 2026-02-21T09:09:53.5993577Z // end inline asm 2026-02-21T09:09:53.5993710Z bar.sync 0; 2026-02-21T09:09:53.5993848Z add.s32 %r107, %r47, 18440; 2026-02-21T09:09:53.5994002Z // begin inline asm 2026-02-21T09:09:53.5994176Z @%p91 mbarrier.init.shared::cta.b64 [%r107], 1; 2026-02-21T09:09:53.5994410Z // end inline asm 2026-02-21T09:09:53.5994555Z add.s32 %r249, %r47, 18448; 2026-02-21T09:09:53.5994719Z // begin inline asm 2026-02-21T09:09:53.5994884Z @%p91 mbarrier.init.shared::cta.b64 [%r249], 1; 2026-02-21T09:09:53.5995084Z // end inline asm 2026-02-21T09:09:53.5995218Z bar.sync 0; 2026-02-21T09:09:53.5995359Z add.s32 %r109, %r47, 18456; 2026-02-21T09:09:53.5995509Z // begin inline asm 2026-02-21T09:09:53.5995677Z @%p91 mbarrier.init.shared::cta.b64 [%r109], 1; 2026-02-21T09:09:53.5995862Z // end inline asm 2026-02-21T09:09:53.5996118Z .loc 1 58 60 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:60 2026-02-21T09:09:53.5996412Z or.b32 %r202, %r199, %r148; 2026-02-21T09:09:53.5996669Z .loc 1 58 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:32 2026-02-21T09:09:53.5996966Z mad.wide.u32 %rd53, %r202, 2, %rd14; 2026-02-21T09:09:53.5997145Z cvt.u64.u32 %rd7, %r199; 2026-02-21T09:09:53.5997310Z add.s64 %rd54, %rd53, 65536; 2026-02-21T09:09:53.5997569Z .loc 1 58 80 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:80 2026-02-21T09:09:53.5997855Z // begin inline asm 2026-02-21T09:09:53.5998066Z cp.async.cg.shared.global [ %r245 + 0 ], [ %rd53 + 0 ], 0x10, %r111; 2026-02-21T09:09:53.5998295Z // end inline asm 2026-02-21T09:09:53.5998437Z // begin inline asm 2026-02-21T09:09:53.5998636Z cp.async.cg.shared.global [ %r247 + 0 ], [ %rd54 + 0 ], 0x10, %r111; 2026-02-21T09:09:53.5998865Z // end inline asm 2026-02-21T09:09:53.5999010Z cp.async.commit_group; 2026-02-21T09:09:53.5999270Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.5999543Z bar.sync 0; 2026-02-21T09:09:53.5999678Z // begin inline asm 2026-02-21T09:09:53.5999875Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r249], 1024; 2026-02-21T09:09:53.6000090Z // end inline asm 2026-02-21T09:09:53.6000337Z .loc 1 64 33 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:64:33 2026-02-21T09:09:53.6000636Z bar.sync 0; 2026-02-21T09:09:53.6000780Z elect.sync %r203|%p53, -1; 2026-02-21T09:09:53.6000944Z and.pred %p47, %p1, %p53; 2026-02-21T09:09:53.6001106Z // begin inline asm 2026-02-21T09:09:53.6001429Z @%p47 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r115], [%rd71, {%r391, %r57}], [%r249]; 2026-02-21T09:09:53.6001808Z // end inline asm 2026-02-21T09:09:53.6002051Z .loc 1 58 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:32 2026-02-21T09:09:53.6002361Z add.s64 %rd56, %rd53, 64; 2026-02-21T09:09:53.6002525Z or.b32 %r204, %r202, 32; 2026-02-21T09:09:53.6002685Z mad.wide.u32 %rd63, %r204, 2, %rd14; 2026-02-21T09:09:53.6002867Z add.s64 %rd57, %rd63, 65536; 2026-02-21T09:09:53.6003132Z .loc 1 58 80 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:80 2026-02-21T09:09:53.6003421Z // begin inline asm 2026-02-21T09:09:53.6003627Z cp.async.cg.shared.global [ %r119 + 0 ], [ %rd56 + 0 ], 0x10, %r111; 2026-02-21T09:09:53.6003854Z // end inline asm 2026-02-21T09:09:53.6003998Z // begin inline asm 2026-02-21T09:09:53.6004223Z cp.async.cg.shared.global [ %r121 + 0 ], [ %rd57 + 0 ], 0x10, %r111; 2026-02-21T09:09:53.6004453Z // end inline asm 2026-02-21T09:09:53.6004595Z cp.async.commit_group; 2026-02-21T09:09:53.6004858Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6005132Z bar.sync 0; 2026-02-21T09:09:53.6005272Z // begin inline asm 2026-02-21T09:09:53.6005473Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r109], 1024; 2026-02-21T09:09:53.6005693Z // end inline asm 2026-02-21T09:09:53.6005945Z .loc 1 64 33 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:64:33 2026-02-21T09:09:53.6006219Z bar.sync 0; 2026-02-21T09:09:53.6006366Z elect.sync %r205|%p54, -1; 2026-02-21T09:09:53.6006568Z and.pred %p49, %p1, %p54; 2026-02-21T09:09:53.6006736Z add.s32 %r124, %r47, 17408; 2026-02-21T09:09:53.6006890Z // begin inline asm 2026-02-21T09:09:53.6007223Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r124], [%rd71, {%r391, %r111}], [%r109]; 2026-02-21T09:09:53.6007578Z // end inline asm 2026-02-21T09:09:53.6007822Z .loc 1 58 80 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:80 2026-02-21T09:09:53.6008120Z cp.async.wait_group 1; 2026-02-21T09:09:53.6008271Z bar.sync 0; 2026-02-21T09:09:53.6008448Z ld.shared.v4.b32 {%r206, %r207, %r208, %r209}, [%r188]; 2026-02-21T09:09:53.6008657Z mov.b32 {%rs1, %rs2}, %r209; 2026-02-21T09:09:53.6008824Z mov.b32 {%rs3, %rs4}, %r208; 2026-02-21T09:09:53.6008986Z mov.b32 {%rs5, %rs6}, %r207; 2026-02-21T09:09:53.6009140Z mov.b32 {%rs7, %rs8}, %r206; 2026-02-21T09:09:53.6009339Z ld.shared.v4.b32 {%r210, %r211, %r212, %r213}, [%r188+16]; 2026-02-21T09:09:53.6009550Z mov.b32 {%rs9, %rs10}, %r213; 2026-02-21T09:09:53.6009725Z mov.b32 {%rs11, %rs12}, %r212; 2026-02-21T09:09:53.6009890Z mov.b32 {%rs13, %rs14}, %r211; 2026-02-21T09:09:53.6010056Z mov.b32 {%rs15, %rs16}, %r210; 2026-02-21T09:09:53.6010324Z .loc 1 62 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:62:32 2026-02-21T09:09:53.6010621Z cvt.f32.bf16 %r129, %rs7; 2026-02-21T09:09:53.6010784Z cvt.f32.bf16 %r130, %rs8; 2026-02-21T09:09:53.6010937Z cvt.f32.bf16 %r131, %rs5; 2026-02-21T09:09:53.6011092Z cvt.f32.bf16 %r132, %rs6; 2026-02-21T09:09:53.6011241Z cvt.f32.bf16 %r133, %rs3; 2026-02-21T09:09:53.6011396Z cvt.f32.bf16 %r134, %rs4; 2026-02-21T09:09:53.6011571Z cvt.f32.bf16 %r135, %rs1; 2026-02-21T09:09:53.6011727Z cvt.f32.bf16 %r136, %rs2; 2026-02-21T09:09:53.6011877Z cvt.f32.bf16 %r137, %rs15; 2026-02-21T09:09:53.6012040Z cvt.f32.bf16 %r138, %rs16; 2026-02-21T09:09:53.6012191Z cvt.f32.bf16 %r139, %rs13; 2026-02-21T09:09:53.6012348Z cvt.f32.bf16 %r140, %rs14; 2026-02-21T09:09:53.6012538Z cvt.f32.bf16 %r141, %rs11; 2026-02-21T09:09:53.6012688Z cvt.f32.bf16 %r142, %rs12; 2026-02-21T09:09:53.6012845Z cvt.f32.bf16 %r143, %rs9; 2026-02-21T09:09:53.6012994Z cvt.f32.bf16 %r144, %rs10; 2026-02-21T09:09:53.6013149Z $L__tmp2: 2026-02-21T09:09:53.6013438Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6013780Z add.s32 %r266, %r201, %r231; 2026-02-21T09:09:53.6013935Z // begin inline asm 2026-02-21T09:09:53.6014347Z @%p58 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r266 + 0], 16, {%r129, %r130, %r131, %r132, %r133, %r134, %r135, %r136, %r137, %r138, %r139, %r140, %r141, %r142, %r143, %r144}; 2026-02-21T09:09:53.6014758Z // end inline asm 2026-02-21T09:09:53.6014896Z // begin inline asm 2026-02-21T09:09:53.6015057Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.6015221Z // end inline asm 2026-02-21T09:09:53.6015361Z bar.sync 0; 2026-02-21T09:09:53.6015492Z $L__tmp3: 2026-02-21T09:09:53.6015741Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6016030Z // begin inline asm 2026-02-21T09:09:53.6016172Z 2026-02-21T09:09:53.6016321Z { 2026-02-21T09:09:53.6016447Z .reg .pred complete; 2026-02-21T09:09:53.6016600Z waitLoop: 2026-02-21T09:09:53.6016792Z mbarrier.try_wait.parity.shared.b64 complete, [%r249], %r57; 2026-02-21T09:09:53.6017048Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.6017208Z } 2026-02-21T09:09:53.6017289Z 2026-02-21T09:09:53.6017348Z // end inline asm 2026-02-21T09:09:53.6017604Z .loc 1 64 33 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:64:33 2026-02-21T09:09:53.6017914Z ld.shared.b8 %rs17, [%r192]; 2026-02-21T09:09:53.6018096Z ld.shared.b8 %rs18, [%r192+64]; 2026-02-21T09:09:53.6018280Z ld.shared.b8 %rs19, [%r192+512]; 2026-02-21T09:09:53.6018468Z ld.shared.b8 %rs20, [%r192+576]; 2026-02-21T09:09:53.6018672Z ld.shared.b8 %rs21, [%r191+128]; 2026-02-21T09:09:53.6018857Z ld.shared.b8 %rs22, [%r191+192]; 2026-02-21T09:09:53.6019027Z ld.shared.b8 %rs23, [%r191+640]; 2026-02-21T09:09:53.6019205Z ld.shared.b8 %rs24, [%r191+704]; 2026-02-21T09:09:53.6019376Z ld.shared.b8 %rs25, [%r190+256]; 2026-02-21T09:09:53.6019555Z ld.shared.b8 %rs26, [%r190+320]; 2026-02-21T09:09:53.6019735Z ld.shared.b8 %rs27, [%r190+768]; 2026-02-21T09:09:53.6019904Z ld.shared.b8 %rs28, [%r190+832]; 2026-02-21T09:09:53.6020083Z ld.shared.b8 %rs29, [%r189+384]; 2026-02-21T09:09:53.6020255Z ld.shared.b8 %rs30, [%r189+448]; 2026-02-21T09:09:53.6020432Z ld.shared.b8 %rs31, [%r189+896]; 2026-02-21T09:09:53.6020601Z ld.shared.b8 %rs32, [%r189+960]; 2026-02-21T09:09:53.6020884Z .loc 1 67 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:67:28 2026-02-21T09:09:53.6021182Z shl.b16 %rs33, %rs17, 4; 2026-02-21T09:09:53.6021351Z shl.b16 %rs34, %rs18, 4; 2026-02-21T09:09:53.6021515Z shl.b16 %rs35, %rs21, 4; 2026-02-21T09:09:53.6021698Z shl.b16 %rs36, %rs22, 4; 2026-02-21T09:09:53.6021864Z shl.b16 %rs37, %rs25, 4; 2026-02-21T09:09:53.6022017Z shl.b16 %rs38, %rs26, 4; 2026-02-21T09:09:53.6022179Z shl.b16 %rs39, %rs29, 4; 2026-02-21T09:09:53.6022334Z shl.b16 %rs40, %rs30, 4; 2026-02-21T09:09:53.6022493Z shl.b16 %rs41, %rs19, 4; 2026-02-21T09:09:53.6022645Z shl.b16 %rs42, %rs20, 4; 2026-02-21T09:09:53.6022806Z shl.b16 %rs43, %rs23, 4; 2026-02-21T09:09:53.6022958Z shl.b16 %rs44, %rs24, 4; 2026-02-21T09:09:53.6023120Z shl.b16 %rs45, %rs27, 4; 2026-02-21T09:09:53.6023285Z shl.b16 %rs46, %rs28, 4; 2026-02-21T09:09:53.6023437Z shl.b16 %rs47, %rs31, 4; 2026-02-21T09:09:53.6023601Z shl.b16 %rs48, %rs32, 4; 2026-02-21T09:09:53.6023869Z .loc 1 82 58 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:82:58 2026-02-21T09:09:53.6024181Z selp.b16 %rs49, %rs33, %rs17, %p51; 2026-02-21T09:09:53.6024366Z cvt.s16.s8 %rs50, %rs49; 2026-02-21T09:09:53.6024530Z shr.s16 %rs51, %rs50, 4; 2026-02-21T09:09:53.6024749Z selp.b16 %rs52, %rs34, %rs18, %p51; 2026-02-21T09:09:53.6024941Z cvt.s16.s8 %rs53, %rs52; 2026-02-21T09:09:53.6025109Z shr.s16 %rs54, %rs53, 4; 2026-02-21T09:09:53.6025276Z selp.b16 %rs55, %rs35, %rs21, %p51; 2026-02-21T09:09:53.6025475Z cvt.s16.s8 %rs56, %rs55; 2026-02-21T09:09:53.6025622Z shr.s16 %rs57, %rs56, 4; 2026-02-21T09:09:53.6025784Z selp.b16 %rs58, %rs36, %rs22, %p51; 2026-02-21T09:09:53.6025950Z cvt.s16.s8 %rs59, %rs58; 2026-02-21T09:09:53.6026110Z shr.s16 %rs60, %rs59, 4; 2026-02-21T09:09:53.6026299Z selp.b16 %rs61, %rs37, %rs25, %p51; 2026-02-21T09:09:53.6026473Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T09:09:53.6026619Z shr.s16 %rs63, %rs62, 4; 2026-02-21T09:09:53.6026781Z selp.b16 %rs64, %rs38, %rs26, %p51; 2026-02-21T09:09:53.6026955Z cvt.s16.s8 %rs65, %rs64; 2026-02-21T09:09:53.6027102Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:09:53.6027264Z selp.b16 %rs67, %rs39, %rs29, %p51; 2026-02-21T09:09:53.6027432Z cvt.s16.s8 %rs68, %rs67; 2026-02-21T09:09:53.6027588Z shr.s16 %rs69, %rs68, 4; 2026-02-21T09:09:53.6027740Z selp.b16 %rs70, %rs40, %rs30, %p51; 2026-02-21T09:09:53.6027915Z cvt.s16.s8 %rs71, %rs70; 2026-02-21T09:09:53.6028059Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:09:53.6028244Z selp.b16 %rs73, %rs41, %rs19, %p51; 2026-02-21T09:09:53.6028419Z cvt.s16.s8 %rs74, %rs73; 2026-02-21T09:09:53.6028566Z shr.s16 %rs75, %rs74, 4; 2026-02-21T09:09:53.6028723Z selp.b16 %rs76, %rs42, %rs20, %p51; 2026-02-21T09:09:53.6028889Z cvt.s16.s8 %rs77, %rs76; 2026-02-21T09:09:53.6029045Z shr.s16 %rs78, %rs77, 4; 2026-02-21T09:09:53.6029195Z selp.b16 %rs79, %rs43, %rs23, %p51; 2026-02-21T09:09:53.6029365Z cvt.s16.s8 %rs80, %rs79; 2026-02-21T09:09:53.6029510Z shr.s16 %rs81, %rs80, 4; 2026-02-21T09:09:53.6029668Z selp.b16 %rs82, %rs44, %rs24, %p51; 2026-02-21T09:09:53.6029832Z cvt.s16.s8 %rs83, %rs82; 2026-02-21T09:09:53.6029984Z shr.s16 %rs84, %rs83, 4; 2026-02-21T09:09:53.6030173Z selp.b16 %rs85, %rs45, %rs27, %p51; 2026-02-21T09:09:53.6030341Z cvt.s16.s8 %rs86, %rs85; 2026-02-21T09:09:53.6030492Z shr.s16 %rs87, %rs86, 4; 2026-02-21T09:09:53.6030644Z selp.b16 %rs88, %rs46, %rs28, %p51; 2026-02-21T09:09:53.6030816Z cvt.s16.s8 %rs89, %rs88; 2026-02-21T09:09:53.6030966Z shr.s16 %rs90, %rs89, 4; 2026-02-21T09:09:53.6031127Z selp.b16 %rs91, %rs47, %rs31, %p51; 2026-02-21T09:09:53.6031293Z cvt.s16.s8 %rs92, %rs91; 2026-02-21T09:09:53.6031447Z shr.s16 %rs93, %rs92, 4; 2026-02-21T09:09:53.6031630Z selp.b16 %rs94, %rs48, %rs32, %p51; 2026-02-21T09:09:53.6031806Z cvt.s16.s8 %rs95, %rs94; 2026-02-21T09:09:53.6031961Z shr.s16 %rs96, %rs95, 4; 2026-02-21T09:09:53.6032216Z .loc 1 87 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:87:32 2026-02-21T09:09:53.6032512Z cvt.rn.f32.s16 %r214, %rs51; 2026-02-21T09:09:53.6032675Z cvt.rn.f32.s16 %r215, %rs54; 2026-02-21T09:09:53.6032840Z cvt.rn.f32.s16 %r216, %rs57; 2026-02-21T09:09:53.6032998Z cvt.rn.f32.s16 %r217, %rs60; 2026-02-21T09:09:53.6033164Z cvt.rn.f32.s16 %r218, %rs63; 2026-02-21T09:09:53.6033224Z cvt.rn.f32.s16 %r219, %rs66; 2026-02-21T09:09:53.6033292Z cvt.rn.f32.s16 %r220, %rs69; 2026-02-21T09:09:53.6033352Z cvt.rn.f32.s16 %r221, %rs72; 2026-02-21T09:09:53.6033414Z cvt.rn.f32.s16 %r222, %rs75; 2026-02-21T09:09:53.6033475Z cvt.rn.f32.s16 %r223, %rs78; 2026-02-21T09:09:53.6033550Z cvt.rn.f32.s16 %r224, %rs81; 2026-02-21T09:09:53.6033610Z cvt.rn.f32.s16 %r225, %rs84; 2026-02-21T09:09:53.6033669Z cvt.rn.f32.s16 %r226, %rs87; 2026-02-21T09:09:53.6033737Z cvt.rn.f32.s16 %r227, %rs90; 2026-02-21T09:09:53.6033795Z cvt.rn.f32.s16 %r228, %rs93; 2026-02-21T09:09:53.6033852Z cvt.rn.f32.s16 %r229, %rs96; 2026-02-21T09:09:53.6033910Z st.shared.b32 [%r12], %r214; 2026-02-21T09:09:53.6033982Z st.shared.b32 [%r12+8], %r215; 2026-02-21T09:09:53.6034041Z st.shared.b32 [%r13], %r216; 2026-02-21T09:09:53.6034103Z st.shared.b32 [%r13+8], %r217; 2026-02-21T09:09:53.6034171Z st.shared.b32 [%r14], %r218; 2026-02-21T09:09:53.6034259Z st.shared.b32 [%r14+8], %r219; 2026-02-21T09:09:53.6034317Z st.shared.b32 [%r15], %r220; 2026-02-21T09:09:53.6034376Z st.shared.b32 [%r15+8], %r221; 2026-02-21T09:09:53.6034442Z st.shared.b32 [%r16], %r222; 2026-02-21T09:09:53.6034502Z st.shared.b32 [%r16+8], %r223; 2026-02-21T09:09:53.6034560Z st.shared.b32 [%r17], %r224; 2026-02-21T09:09:53.6034626Z st.shared.b32 [%r17+8], %r225; 2026-02-21T09:09:53.6034684Z st.shared.b32 [%r18], %r226; 2026-02-21T09:09:53.6034742Z st.shared.b32 [%r18+8], %r227; 2026-02-21T09:09:53.6034837Z st.shared.b32 [%r19], %r228; 2026-02-21T09:09:53.6034905Z st.shared.b32 [%r19+8], %r229; 2026-02-21T09:09:53.6034959Z $L__tmp4: 2026-02-21T09:09:53.6035182Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6035249Z // begin inline asm 2026-02-21T09:09:53.6035324Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.6035382Z // end inline asm 2026-02-21T09:09:53.6035445Z bar.sync 0; 2026-02-21T09:09:53.6035510Z setp.ne.b32 %p55, %r30, 0; 2026-02-21T09:09:53.6035570Z @%p55 bra $L__BB0_3; 2026-02-21T09:09:53.6035623Z // %bb.2: 2026-02-21T09:09:53.6035720Z elect.sync %r242|%p57, -1; 2026-02-21T09:09:53.6035783Z mov.b32 %r232, 68159760; 2026-02-21T09:09:53.6035842Z mov.pred %p56, 0; 2026-02-21T09:09:53.6035906Z // begin inline asm 2026-02-21T09:09:53.6036067Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 0 ], %rd64, %r232, %p56; 2026-02-21T09:09:53.6036126Z // end inline asm 2026-02-21T09:09:53.6036183Z // begin inline asm 2026-02-21T09:09:53.6036343Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 8 ], %rd65, %r232, %p58; 2026-02-21T09:09:53.6036398Z // end inline asm 2026-02-21T09:09:53.6036453Z // begin inline asm 2026-02-21T09:09:53.6036608Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 16 ], %rd66, %r232, %p58; 2026-02-21T09:09:53.6036692Z // end inline asm 2026-02-21T09:09:53.6036749Z // begin inline asm 2026-02-21T09:09:53.6036900Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 24 ], %rd67, %r232, %p58; 2026-02-21T09:09:53.6036957Z // end inline asm 2026-02-21T09:09:53.6037019Z add.s32 %r244, %r47, 18432; 2026-02-21T09:09:53.6037087Z cvt.u64.u32 %rd68, %r244; 2026-02-21T09:09:53.6037142Z // begin inline asm 2026-02-21T09:09:53.6037265Z @%p57 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd68]; 2026-02-21T09:09:53.6037318Z // end inline asm 2026-02-21T09:09:53.6037380Z $L__tmp5: 2026-02-21T09:09:53.6037434Z $L__BB0_3: 2026-02-21T09:09:53.6037522Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:53.6037590Z add.s32 %r21, %r47, %r184; 2026-02-21T09:09:53.6037649Z add.s32 %r22, %r47, %r185; 2026-02-21T09:09:53.6037706Z add.s32 %r23, %r47, %r186; 2026-02-21T09:09:53.6037762Z add.s32 %r24, %r47, %r187; 2026-02-21T09:09:53.6037940Z .loc 1 58 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:32 2026-02-21T09:09:53.6038001Z add.s64 %rd69, %rd53, 128; 2026-02-21T09:09:53.6038061Z cvt.u64.u32 %rd73, %r20; 2026-02-21T09:09:53.6038130Z add.s64 %rd74, %rd7, %rd73; 2026-02-21T09:09:53.6038189Z shl.b64 %rd75, %rd74, 1; 2026-02-21T09:09:53.6038248Z add.s64 %rd76, %rd14, %rd75; 2026-02-21T09:09:53.6038308Z add.s64 %rd70, %rd76, 65536; 2026-02-21T09:09:53.6038370Z mov.b32 %r246, 16; 2026-02-21T09:09:53.6038532Z .loc 1 58 80 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:80 2026-02-21T09:09:53.6038590Z // begin inline asm 2026-02-21T09:09:53.6038715Z cp.async.cg.shared.global [ %r245 + 0 ], [ %rd69 + 0 ], 0x10, %r246; 2026-02-21T09:09:53.6038770Z // end inline asm 2026-02-21T09:09:53.6038827Z // begin inline asm 2026-02-21T09:09:53.6038949Z cp.async.cg.shared.global [ %r247 + 0 ], [ %rd70 + 0 ], 0x10, %r246; 2026-02-21T09:09:53.6039004Z // end inline asm 2026-02-21T09:09:53.6039069Z cp.async.commit_group; 2026-02-21T09:09:53.6039250Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6039313Z // begin inline asm 2026-02-21T09:09:53.6039423Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r249], 1024; 2026-02-21T09:09:53.6039478Z // end inline asm 2026-02-21T09:09:53.6039648Z .loc 1 64 33 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:64:33 2026-02-21T09:09:53.6039725Z bar.sync 0; 2026-02-21T09:09:53.6039790Z elect.sync %r258|%p68, -1; 2026-02-21T09:09:53.6039886Z and.pred %p66, %p1, %p68; 2026-02-21T09:09:53.6039943Z mov.b32 %r252, 32; 2026-02-21T09:09:53.6039998Z // begin inline asm 2026-02-21T09:09:53.6040241Z @%p66 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r115], [%rd71, {%r391, %r252}], [%r249]; 2026-02-21T09:09:53.6040305Z // end inline asm 2026-02-21T09:09:53.6040468Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6040528Z and.b32 %r259, %r1, 3; 2026-02-21T09:09:53.6040597Z mul.wide.u32 %rd77, %r259, 16; 2026-02-21T09:09:53.6040656Z shl.b32 %r260, %r3, 12; 2026-02-21T09:09:53.6040717Z and.b32 %r261, %r260, 4128768; 2026-02-21T09:09:53.6040802Z shl.b32 %r262, %r5, 10; 2026-02-21T09:09:53.6040864Z or.b32 %r263, %r261, %r262; 2026-02-21T09:09:53.6040927Z mul.wide.u32 %rd78, %r263, 2; 2026-02-21T09:09:53.6040987Z or.b64 %rd79, %rd77, %rd78; 2026-02-21T09:09:53.6041055Z add.s64 %rd80, %rd79, %rd14; 2026-02-21T09:09:53.6041120Z add.s64 %rd154, %rd80, 65728; 2026-02-21T09:09:53.6041175Z mov.b32 %r448, 1; 2026-02-21T09:09:53.6041238Z mov.b32 %r444, 0; 2026-02-21T09:09:53.6041295Z mov.b64 %rd155, 0; 2026-02-21T09:09:53.6041352Z mov.b32 %r446, %r444; 2026-02-21T09:09:53.6041409Z mov.b32 %r447, %r444; 2026-02-21T09:09:53.6041472Z mov.b32 %r449, %r444; 2026-02-21T09:09:53.6041527Z bra.uni $L__BB0_4; 2026-02-21T09:09:53.6041688Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:53.6041859Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6041924Z setp.lt.u64 %p84, %rd155, 464; 2026-02-21T09:09:53.6041979Z $L__tmp6: 2026-02-21T09:09:53.6042204Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6042263Z add.s32 %r347, %r448, 1; 2026-02-21T09:09:53.6042327Z setp.gt.s32 %p87, %r347, 1; 2026-02-21T09:09:53.6042393Z selp.b32 %r448, 0, %r347, %p87; 2026-02-21T09:09:53.6042462Z selp.b32 %r348, 1, 0, %p87; 2026-02-21T09:09:53.6042520Z xor.b32 %r46, %r449, %r348; 2026-02-21T09:09:53.6042572Z $L__tmp7: 2026-02-21T09:09:53.6042738Z .loc 1 58 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:32 2026-02-21T09:09:53.6042800Z add.s64 %rd86, %rd154, -65536; 2026-02-21T09:09:53.6042960Z .loc 1 58 80 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:80 2026-02-21T09:09:53.6043028Z add.s32 %r338, %r40, %r6; 2026-02-21T09:09:53.6043090Z selp.b32 %r339, 16, 0, %p84; 2026-02-21T09:09:53.6043147Z // begin inline asm 2026-02-21T09:09:53.6043261Z cp.async.cg.shared.global [ %r338 + 0 ], [ %rd86 + 0 ], 0x10, %r339; 2026-02-21T09:09:53.6043326Z // end inline asm 2026-02-21T09:09:53.6043385Z add.s32 %r340, %r338, 2048; 2026-02-21T09:09:53.6043442Z // begin inline asm 2026-02-21T09:09:53.6043568Z cp.async.cg.shared.global [ %r340 + 0 ], [ %rd154 + 0 ], 0x10, %r339; 2026-02-21T09:09:53.6043626Z // end inline asm 2026-02-21T09:09:53.6043690Z cp.async.commit_group; 2026-02-21T09:09:53.6043850Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6043926Z and.pred %p82, %p91, %p84; 2026-02-21T09:09:53.6043983Z // begin inline asm 2026-02-21T09:09:53.6044093Z @%p82 mbarrier.arrive.expect_tx.shared.b64 _, [%r342], 1024; 2026-02-21T09:09:53.6044183Z // end inline asm 2026-02-21T09:09:53.6044346Z .loc 1 64 33 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:64:33 2026-02-21T09:09:53.6044400Z bar.sync 0; 2026-02-21T09:09:53.6044471Z elect.sync %r349|%p88, -1; 2026-02-21T09:09:53.6044535Z and.pred %p89, %p84, %p88; 2026-02-21T09:09:53.6044598Z and.pred %p83, %p1, %p89; 2026-02-21T09:09:53.6044658Z cvt.u32.u64 %r350, %rd155; 2026-02-21T09:09:53.6044726Z add.s32 %r345, %r350, 48; 2026-02-21T09:09:53.6044783Z // begin inline asm 2026-02-21T09:09:53.6045043Z @%p83 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r343], [%rd71, {%r391, %r345}], [%r342]; 2026-02-21T09:09:53.6045106Z // end inline asm 2026-02-21T09:09:53.6045266Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6045327Z add.s64 %rd154, %rd154, 64; 2026-02-21T09:09:53.6045399Z setp.lt.u64 %p90, %rd155, 480; 2026-02-21T09:09:53.6045459Z add.s64 %rd155, %rd155, 16; 2026-02-21T09:09:53.6045514Z mov.b32 %r444, %r449; 2026-02-21T09:09:53.6045571Z mov.b32 %r449, %r46; 2026-02-21T09:09:53.6045637Z @%p90 bra $L__BB0_4; 2026-02-21T09:09:53.6045716Z bra.uni $L__BB0_7; 2026-02-21T09:09:53.6045820Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:53.6045887Z add.s32 %r285, %r447, 1; 2026-02-21T09:09:53.6045947Z setp.gt.s32 %p72, %r285, 1; 2026-02-21T09:09:53.6046010Z selp.b32 %r447, 0, %r285, %p72; 2026-02-21T09:09:53.6046173Z .loc 1 58 80 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:58:80 2026-02-21T09:09:53.6046243Z cp.async.wait_group 1; 2026-02-21T09:09:53.6046296Z bar.sync 0; 2026-02-21T09:09:53.6046352Z shl.b32 %r286, %r447, 12; 2026-02-21T09:09:53.6046419Z add.s32 %r288, %r47, %r286; 2026-02-21T09:09:53.6046476Z add.s32 %r40, %r288, 8192; 2026-02-21T09:09:53.6046534Z add.s32 %r289, %r40, %r10; 2026-02-21T09:09:53.6046661Z ld.shared.v4.b32 {%r290, %r291, %r292, %r293}, [%r289]; 2026-02-21T09:09:53.6046732Z mov.b32 {%rs97, %rs98}, %r293; 2026-02-21T09:09:53.6046794Z mov.b32 {%rs99, %rs100}, %r292; 2026-02-21T09:09:53.6046860Z mov.b32 {%rs101, %rs102}, %r291; 2026-02-21T09:09:53.6046930Z mov.b32 {%rs103, %rs104}, %r290; 2026-02-21T09:09:53.6047027Z ld.shared.v4.b32 {%r294, %r295, %r296, %r297}, [%r289+16]; 2026-02-21T09:09:53.6047089Z mov.b32 {%rs105, %rs106}, %r297; 2026-02-21T09:09:53.6047155Z mov.b32 {%rs107, %rs108}, %r296; 2026-02-21T09:09:53.6047216Z mov.b32 {%rs109, %rs110}, %r295; 2026-02-21T09:09:53.6047274Z mov.b32 {%rs111, %rs112}, %r294; 2026-02-21T09:09:53.6047436Z .loc 1 62 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:62:32 2026-02-21T09:09:53.6047505Z cvt.f32.bf16 %r267, %rs103; 2026-02-21T09:09:53.6047566Z cvt.f32.bf16 %r268, %rs104; 2026-02-21T09:09:53.6047624Z cvt.f32.bf16 %r269, %rs101; 2026-02-21T09:09:53.6047690Z cvt.f32.bf16 %r270, %rs102; 2026-02-21T09:09:53.6047750Z cvt.f32.bf16 %r271, %rs99; 2026-02-21T09:09:53.6047808Z cvt.f32.bf16 %r272, %rs100; 2026-02-21T09:09:53.6047873Z cvt.f32.bf16 %r273, %rs97; 2026-02-21T09:09:53.6047931Z cvt.f32.bf16 %r274, %rs98; 2026-02-21T09:09:53.6047989Z cvt.f32.bf16 %r275, %rs111; 2026-02-21T09:09:53.6048047Z cvt.f32.bf16 %r276, %rs112; 2026-02-21T09:09:53.6048111Z cvt.f32.bf16 %r277, %rs109; 2026-02-21T09:09:53.6048167Z cvt.f32.bf16 %r278, %rs110; 2026-02-21T09:09:53.6048225Z cvt.f32.bf16 %r279, %rs107; 2026-02-21T09:09:53.6048292Z cvt.f32.bf16 %r280, %rs108; 2026-02-21T09:09:53.6048349Z cvt.f32.bf16 %r281, %rs105; 2026-02-21T09:09:53.6048406Z cvt.f32.bf16 %r282, %rs106; 2026-02-21T09:09:53.6048458Z $L__tmp8: 2026-02-21T09:09:53.6048680Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6048740Z // begin inline asm 2026-02-21T09:09:53.6048793Z 2026-02-21T09:09:53.6048856Z { 2026-02-21T09:09:53.6048940Z .reg .pred complete; 2026-02-21T09:09:53.6048997Z waitLoop: 2026-02-21T09:09:53.6049121Z mbarrier.try_wait.parity.shared.b64 complete, [%r445], %r444; 2026-02-21T09:09:53.6049196Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.6049248Z } 2026-02-21T09:09:53.6049252Z 2026-02-21T09:09:53.6049309Z // end inline asm 2026-02-21T09:09:53.6049368Z $L__tmp9: 2026-02-21T09:09:53.6049532Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6049591Z selp.b32 %r298, 1, 0, %p72; 2026-02-21T09:09:53.6049680Z xor.b32 %r446, %r446, %r298; 2026-02-21T09:09:53.6049740Z mov.pred %p73, -1; 2026-02-21T09:09:53.6049794Z $L__tmp10: 2026-02-21T09:09:53.6050005Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6050069Z // begin inline asm 2026-02-21T09:09:53.6050364Z @%p73 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r266 + 0], 16, {%r267, %r268, %r269, %r270, %r271, %r272, %r273, %r274, %r275, %r276, %r277, %r278, %r279, %r280, %r281, %r282}; 2026-02-21T09:09:53.6050422Z // end inline asm 2026-02-21T09:09:53.6050485Z // begin inline asm 2026-02-21T09:09:53.6050574Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:53.6050631Z // end inline asm 2026-02-21T09:09:53.6050692Z bar.sync 0; 2026-02-21T09:09:53.6050745Z $L__tmp11: 2026-02-21T09:09:53.6050907Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6050966Z shl.b32 %r299, %r447, 3; 2026-02-21T09:09:53.6051032Z add.s32 %r300, %r47, %r299; 2026-02-21T09:09:53.6051090Z add.s32 %r342, %r300, 18448; 2026-02-21T09:09:53.6051147Z // begin inline asm 2026-02-21T09:09:53.6051207Z 2026-02-21T09:09:53.6051257Z { 2026-02-21T09:09:53.6051316Z .reg .pred complete; 2026-02-21T09:09:53.6051368Z waitLoop: 2026-02-21T09:09:53.6051489Z mbarrier.try_wait.parity.shared.b64 complete, [%r342], %r446; 2026-02-21T09:09:53.6051609Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.6051661Z } 2026-02-21T09:09:53.6051665Z 2026-02-21T09:09:53.6051727Z // end inline asm 2026-02-21T09:09:53.6051889Z .loc 1 64 33 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:64:33 2026-02-21T09:09:53.6051949Z shl.b32 %r301, %r447, 10; 2026-02-21T09:09:53.6052015Z add.s32 %r302, %r47, %r301; 2026-02-21T09:09:53.6052074Z add.s32 %r343, %r302, 16384; 2026-02-21T09:09:53.6052132Z add.s32 %r303, %r343, %r11; 2026-02-21T09:09:53.6052195Z ld.shared.b8 %rs113, [%r303]; 2026-02-21T09:09:53.6052270Z ld.shared.b8 %rs114, [%r303+64]; 2026-02-21T09:09:53.6052337Z ld.shared.b8 %rs115, [%r303+512]; 2026-02-21T09:09:53.6052403Z ld.shared.b8 %rs116, [%r303+576]; 2026-02-21T09:09:53.6052467Z add.s32 %r304, %r343, %r27; 2026-02-21T09:09:53.6052529Z ld.shared.b8 %rs117, [%r304+128]; 2026-02-21T09:09:53.6052590Z ld.shared.b8 %rs118, [%r304+192]; 2026-02-21T09:09:53.6052653Z ld.shared.b8 %rs119, [%r304+640]; 2026-02-21T09:09:53.6052723Z ld.shared.b8 %rs120, [%r304+704]; 2026-02-21T09:09:53.6052782Z add.s32 %r305, %r343, %r26; 2026-02-21T09:09:53.6052841Z ld.shared.b8 %rs121, [%r305+256]; 2026-02-21T09:09:53.6052908Z ld.shared.b8 %rs122, [%r305+320]; 2026-02-21T09:09:53.6052968Z ld.shared.b8 %rs123, [%r305+768]; 2026-02-21T09:09:53.6053029Z ld.shared.b8 %rs124, [%r305+832]; 2026-02-21T09:09:53.6053088Z add.s32 %r306, %r343, %r25; 2026-02-21T09:09:53.6053156Z ld.shared.b8 %rs125, [%r306+384]; 2026-02-21T09:09:53.6053216Z ld.shared.b8 %rs126, [%r306+448]; 2026-02-21T09:09:53.6053278Z ld.shared.b8 %rs127, [%r306+896]; 2026-02-21T09:09:53.6053345Z ld.shared.b8 %rs128, [%r306+960]; 2026-02-21T09:09:53.6053506Z .loc 1 67 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:67:28 2026-02-21T09:09:53.6053566Z shl.b16 %rs129, %rs113, 4; 2026-02-21T09:09:53.6053630Z shl.b16 %rs130, %rs114, 4; 2026-02-21T09:09:53.6053691Z shl.b16 %rs131, %rs117, 4; 2026-02-21T09:09:53.6053774Z shl.b16 %rs132, %rs118, 4; 2026-02-21T09:09:53.6053832Z shl.b16 %rs133, %rs121, 4; 2026-02-21T09:09:53.6053897Z shl.b16 %rs134, %rs122, 4; 2026-02-21T09:09:53.6053953Z shl.b16 %rs135, %rs125, 4; 2026-02-21T09:09:53.6054010Z shl.b16 %rs136, %rs126, 4; 2026-02-21T09:09:53.6054073Z shl.b16 %rs137, %rs115, 4; 2026-02-21T09:09:53.6054129Z shl.b16 %rs138, %rs116, 4; 2026-02-21T09:09:53.6054186Z shl.b16 %rs139, %rs119, 4; 2026-02-21T09:09:53.6054241Z shl.b16 %rs140, %rs120, 4; 2026-02-21T09:09:53.6054306Z shl.b16 %rs141, %rs123, 4; 2026-02-21T09:09:53.6054387Z shl.b16 %rs142, %rs124, 4; 2026-02-21T09:09:53.6054444Z shl.b16 %rs143, %rs127, 4; 2026-02-21T09:09:53.6054508Z shl.b16 %rs144, %rs128, 4; 2026-02-21T09:09:53.6054670Z .loc 1 82 58 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:82:58 2026-02-21T09:09:53.6054742Z selp.b16 %rs145, %rs129, %rs113, %p51; 2026-02-21T09:09:53.6054809Z cvt.s16.s8 %rs146, %rs145; 2026-02-21T09:09:53.6054868Z shr.s16 %rs147, %rs146, 4; 2026-02-21T09:09:53.6054939Z selp.b16 %rs148, %rs130, %rs114, %p51; 2026-02-21T09:09:53.6054999Z cvt.s16.s8 %rs149, %rs148; 2026-02-21T09:09:53.6055063Z shr.s16 %rs150, %rs149, 4; 2026-02-21T09:09:53.6055319Z selp.b16 %rs151, %rs131, %rs117, %p51; 2026-02-21T09:09:53.6055380Z cvt.s16.s8 %rs152, %rs151; 2026-02-21T09:09:53.6055447Z shr.s16 %rs153, %rs152, 4; 2026-02-21T09:09:53.6055514Z selp.b16 %rs154, %rs132, %rs118, %p51; 2026-02-21T09:09:53.6055573Z cvt.s16.s8 %rs155, %rs154; 2026-02-21T09:09:53.6055631Z shr.s16 %rs156, %rs155, 4; 2026-02-21T09:09:53.6055705Z selp.b16 %rs157, %rs133, %rs121, %p51; 2026-02-21T09:09:53.6055762Z cvt.s16.s8 %rs158, %rs157; 2026-02-21T09:09:53.6055821Z shr.s16 %rs159, %rs158, 4; 2026-02-21T09:09:53.6055893Z selp.b16 %rs160, %rs134, %rs122, %p51; 2026-02-21T09:09:53.6055951Z cvt.s16.s8 %rs161, %rs160; 2026-02-21T09:09:53.6056009Z shr.s16 %rs162, %rs161, 4; 2026-02-21T09:09:53.6056111Z selp.b16 %rs163, %rs135, %rs125, %p51; 2026-02-21T09:09:53.6056185Z cvt.s16.s8 %rs164, %rs163; 2026-02-21T09:09:53.6056245Z shr.s16 %rs165, %rs164, 4; 2026-02-21T09:09:53.6056312Z selp.b16 %rs166, %rs136, %rs126, %p51; 2026-02-21T09:09:53.6056385Z cvt.s16.s8 %rs167, %rs166; 2026-02-21T09:09:53.6056443Z shr.s16 %rs168, %rs167, 4; 2026-02-21T09:09:53.6056511Z selp.b16 %rs169, %rs137, %rs115, %p51; 2026-02-21T09:09:53.6056573Z cvt.s16.s8 %rs170, %rs169; 2026-02-21T09:09:53.6056638Z shr.s16 %rs171, %rs170, 4; 2026-02-21T09:09:53.6056702Z selp.b16 %rs172, %rs138, %rs116, %p51; 2026-02-21T09:09:53.6056761Z cvt.s16.s8 %rs173, %rs172; 2026-02-21T09:09:53.6056826Z shr.s16 %rs174, %rs173, 4; 2026-02-21T09:09:53.6056890Z selp.b16 %rs175, %rs139, %rs119, %p51; 2026-02-21T09:09:53.6056948Z cvt.s16.s8 %rs176, %rs175; 2026-02-21T09:09:53.6057012Z shr.s16 %rs177, %rs176, 4; 2026-02-21T09:09:53.6057076Z selp.b16 %rs178, %rs140, %rs120, %p51; 2026-02-21T09:09:53.6057132Z cvt.s16.s8 %rs179, %rs178; 2026-02-21T09:09:53.6057190Z shr.s16 %rs180, %rs179, 4; 2026-02-21T09:09:53.6057265Z selp.b16 %rs181, %rs141, %rs123, %p51; 2026-02-21T09:09:53.6057321Z cvt.s16.s8 %rs182, %rs181; 2026-02-21T09:09:53.6057377Z shr.s16 %rs183, %rs182, 4; 2026-02-21T09:09:53.6057448Z selp.b16 %rs184, %rs142, %rs124, %p51; 2026-02-21T09:09:53.6057506Z cvt.s16.s8 %rs185, %rs184; 2026-02-21T09:09:53.6057565Z shr.s16 %rs186, %rs185, 4; 2026-02-21T09:09:53.6057627Z selp.b16 %rs187, %rs143, %rs127, %p51; 2026-02-21T09:09:53.6057692Z cvt.s16.s8 %rs188, %rs187; 2026-02-21T09:09:53.6057751Z shr.s16 %rs189, %rs188, 4; 2026-02-21T09:09:53.6057814Z selp.b16 %rs190, %rs144, %rs128, %p51; 2026-02-21T09:09:53.6057879Z cvt.s16.s8 %rs191, %rs190; 2026-02-21T09:09:53.6057936Z shr.s16 %rs192, %rs191, 4; 2026-02-21T09:09:53.6058100Z .loc 1 87 32 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:87:32 2026-02-21T09:09:53.6058163Z cvt.rn.f32.s16 %r307, %rs147; 2026-02-21T09:09:53.6058232Z cvt.rn.f32.s16 %r308, %rs150; 2026-02-21T09:09:53.6058315Z cvt.rn.f32.s16 %r309, %rs153; 2026-02-21T09:09:53.6058374Z cvt.rn.f32.s16 %r310, %rs156; 2026-02-21T09:09:53.6058440Z cvt.rn.f32.s16 %r311, %rs159; 2026-02-21T09:09:53.6058501Z cvt.rn.f32.s16 %r312, %rs162; 2026-02-21T09:09:53.6058560Z cvt.rn.f32.s16 %r313, %rs165; 2026-02-21T09:09:53.6058624Z cvt.rn.f32.s16 %r314, %rs168; 2026-02-21T09:09:53.6058683Z cvt.rn.f32.s16 %r315, %rs171; 2026-02-21T09:09:53.6058742Z cvt.rn.f32.s16 %r316, %rs174; 2026-02-21T09:09:53.6058800Z cvt.rn.f32.s16 %r317, %rs177; 2026-02-21T09:09:53.6058897Z cvt.rn.f32.s16 %r318, %rs180; 2026-02-21T09:09:53.6058961Z cvt.rn.f32.s16 %r319, %rs183; 2026-02-21T09:09:53.6059022Z cvt.rn.f32.s16 %r320, %rs186; 2026-02-21T09:09:53.6059091Z cvt.rn.f32.s16 %r321, %rs189; 2026-02-21T09:09:53.6059150Z cvt.rn.f32.s16 %r322, %rs192; 2026-02-21T09:09:53.6059215Z st.shared.b32 [%r12], %r307; 2026-02-21T09:09:53.6059279Z st.shared.b32 [%r12+8], %r308; 2026-02-21T09:09:53.6059352Z st.shared.b32 [%r13], %r309; 2026-02-21T09:09:53.6059418Z st.shared.b32 [%r13+8], %r310; 2026-02-21T09:09:53.6059482Z st.shared.b32 [%r14], %r311; 2026-02-21T09:09:53.6059551Z st.shared.b32 [%r14+8], %r312; 2026-02-21T09:09:53.6059637Z st.shared.b32 [%r15], %r313; 2026-02-21T09:09:53.6059701Z st.shared.b32 [%r15+8], %r314; 2026-02-21T09:09:53.6059764Z st.shared.b32 [%r16], %r315; 2026-02-21T09:09:53.6059833Z st.shared.b32 [%r16+8], %r316; 2026-02-21T09:09:53.6059894Z st.shared.b32 [%r17], %r317; 2026-02-21T09:09:53.6059955Z st.shared.b32 [%r17+8], %r318; 2026-02-21T09:09:53.6060025Z st.shared.b32 [%r18], %r319; 2026-02-21T09:09:53.6060086Z st.shared.b32 [%r18+8], %r320; 2026-02-21T09:09:53.6060148Z st.shared.b32 [%r19], %r321; 2026-02-21T09:09:53.6060216Z st.shared.b32 [%r19+8], %r322; 2026-02-21T09:09:53.6060386Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6060449Z shl.b32 %r323, %r448, 3; 2026-02-21T09:09:53.6060532Z add.s32 %r324, %r47, %r323; 2026-02-21T09:09:53.6060604Z add.s32 %r445, %r324, 18432; 2026-02-21T09:09:53.6060659Z $L__tmp12: 2026-02-21T09:09:53.6060885Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6060951Z // begin inline asm 2026-02-21T09:09:53.6061027Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.6061084Z // end inline asm 2026-02-21T09:09:53.6061142Z bar.sync 0; 2026-02-21T09:09:53.6061209Z @%p55 bra $L__BB0_6; 2026-02-21T09:09:53.6061317Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:53.6061385Z elect.sync %r337|%p74, -1; 2026-02-21T09:09:53.6061454Z mov.b32 %r327, 68159760; 2026-02-21T09:09:53.6061513Z // begin inline asm 2026-02-21T09:09:53.6061706Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 0 ], %rd64, %r327, %p73; 2026-02-21T09:09:53.6061773Z // end inline asm 2026-02-21T09:09:53.6061834Z // begin inline asm 2026-02-21T09:09:53.6061989Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 8 ], %rd65, %r327, %p73; 2026-02-21T09:09:53.6062045Z // end inline asm 2026-02-21T09:09:53.6062111Z // begin inline asm 2026-02-21T09:09:53.6062266Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 16 ], %rd66, %r327, %p73; 2026-02-21T09:09:53.6062323Z // end inline asm 2026-02-21T09:09:53.6062389Z // begin inline asm 2026-02-21T09:09:53.6062541Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r443 + 0 ], [ %r231 + 24 ], %rd67, %r327, %p73; 2026-02-21T09:09:53.6062600Z // end inline asm 2026-02-21T09:09:53.6062671Z cvt.u64.u32 %rd85, %r445; 2026-02-21T09:09:53.6062730Z // begin inline asm 2026-02-21T09:09:53.6062857Z @%p74 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd85]; 2026-02-21T09:09:53.6062921Z // end inline asm 2026-02-21T09:09:53.6062979Z bra.uni $L__BB0_6; 2026-02-21T09:09:53.6063080Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:53.6063207Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:53.6063276Z mov.b32 %r352, 1; 2026-02-21T09:09:53.6063499Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6063559Z // begin inline asm 2026-02-21T09:09:53.6063622Z 2026-02-21T09:09:53.6063676Z { 2026-02-21T09:09:53.6063742Z .reg .pred complete; 2026-02-21T09:09:53.6063800Z waitLoop: 2026-02-21T09:09:53.6063931Z mbarrier.try_wait.parity.shared.b64 complete, [%r445], %r352; 2026-02-21T09:09:53.6064028Z @!complete bra.uni waitLoop; 2026-02-21T09:09:53.6064084Z } 2026-02-21T09:09:53.6064088Z 2026-02-21T09:09:53.6064157Z // end inline asm 2026-02-21T09:09:53.6064212Z $L__tmp13: 2026-02-21T09:09:53.6064387Z .loc 1 51 92 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:51:92 2026-02-21T09:09:53.6064461Z cp.async.wait_group 0; 2026-02-21T09:09:53.6064520Z bar.sync 0; 2026-02-21T09:09:53.6064580Z // begin inline asm 2026-02-21T09:09:53.6064670Z @%p91 mbarrier.inval.shared::cta.b64 [%r249]; 2026-02-21T09:09:53.6064740Z // end inline asm 2026-02-21T09:09:53.6064799Z bar.sync 0; 2026-02-21T09:09:53.6064897Z // begin inline asm 2026-02-21T09:09:53.6064994Z @%p91 mbarrier.inval.shared::cta.b64 [%r109]; 2026-02-21T09:09:53.6065051Z // end inline asm 2026-02-21T09:09:53.6065113Z add.s32 %r355, %r47, 18432; 2026-02-21T09:09:53.6065172Z // begin inline asm 2026-02-21T09:09:53.6065263Z @%p91 mbarrier.inval.shared::cta.b64 [%r355]; 2026-02-21T09:09:53.6065322Z // end inline asm 2026-02-21T09:09:53.6065378Z bar.sync 0; 2026-02-21T09:09:53.6065443Z // begin inline asm 2026-02-21T09:09:53.6065523Z @%p91 mbarrier.inval.shared::cta.b64 [%r107]; 2026-02-21T09:09:53.6065581Z // end inline asm 2026-02-21T09:09:53.6065637Z $L__tmp14: 2026-02-21T09:09:53.6065890Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6065952Z // begin inline asm 2026-02-21T09:09:53.6066253Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r357, %r358, %r359, %r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369, %r370, %r371, %r372}, [%r390 + 0], 32; 2026-02-21T09:09:53.6066319Z // end inline asm 2026-02-21T09:09:53.6066377Z // begin inline asm 2026-02-21T09:09:53.6066672Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r374, %r375, %r376, %r377, %r378, %r379, %r380, %r381, %r382, %r383, %r384, %r385, %r386, %r387, %r388, %r389}, [%r390 + 16], 32; 2026-02-21T09:09:53.6066738Z // end inline asm 2026-02-21T09:09:53.6066793Z // begin inline asm 2026-02-21T09:09:53.6066862Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:53.6066923Z // end inline asm 2026-02-21T09:09:53.6066985Z cvt.u64.u32 %rd90, %r357; 2026-02-21T09:09:53.6067044Z cvt.u64.u32 %rd91, %r358; 2026-02-21T09:09:53.6067103Z shl.b64 %rd92, %rd91, 32; 2026-02-21T09:09:53.6067171Z or.b64 %rd93, %rd90, %rd92; 2026-02-21T09:09:53.6067226Z $L__tmp15: 2026-02-21T09:09:53.6067398Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6067466Z mov.b64 {%r394, %r395}, %rd93; 2026-02-21T09:09:53.6067535Z cvt.rn.bf16x2.f32 %r396, %r395, %r394; 2026-02-21T09:09:53.6067588Z $L__tmp16: 2026-02-21T09:09:53.6067801Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6067867Z cvt.u64.u32 %rd94, %r359; 2026-02-21T09:09:53.6067928Z cvt.u64.u32 %rd95, %r360; 2026-02-21T09:09:53.6067986Z shl.b64 %rd96, %rd95, 32; 2026-02-21T09:09:53.6068053Z or.b64 %rd97, %rd94, %rd96; 2026-02-21T09:09:53.6068106Z $L__tmp17: 2026-02-21T09:09:53.6068270Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6068337Z mov.b64 {%r397, %r398}, %rd97; 2026-02-21T09:09:53.6068408Z cvt.rn.bf16x2.f32 %r399, %r398, %r397; 2026-02-21T09:09:53.6068485Z $L__tmp18: 2026-02-21T09:09:53.6068692Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6068756Z cvt.u64.u32 %rd98, %r361; 2026-02-21T09:09:53.6068816Z cvt.u64.u32 %rd99, %r362; 2026-02-21T09:09:53.6068874Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:09:53.6068941Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:09:53.6068992Z $L__tmp19: 2026-02-21T09:09:53.6069155Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6069245Z mov.b64 {%r400, %r401}, %rd101; 2026-02-21T09:09:53.6069310Z cvt.rn.bf16x2.f32 %r402, %r401, %r400; 2026-02-21T09:09:53.6069362Z $L__tmp20: 2026-02-21T09:09:53.6069568Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6069636Z cvt.u64.u32 %rd102, %r363; 2026-02-21T09:09:53.6069695Z cvt.u64.u32 %rd103, %r364; 2026-02-21T09:09:53.6069756Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:09:53.6069824Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:09:53.6069875Z $L__tmp21: 2026-02-21T09:09:53.6070060Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6070130Z mov.b64 {%r403, %r404}, %rd105; 2026-02-21T09:09:53.6070196Z cvt.rn.bf16x2.f32 %r405, %r404, %r403; 2026-02-21T09:09:53.6070248Z $L__tmp22: 2026-02-21T09:09:53.6070455Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6070522Z cvt.u64.u32 %rd106, %r365; 2026-02-21T09:09:53.6070580Z cvt.u64.u32 %rd107, %r366; 2026-02-21T09:09:53.6070638Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:09:53.6070704Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:09:53.6070756Z $L__tmp23: 2026-02-21T09:09:53.6070938Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6071001Z mov.b64 {%r406, %r407}, %rd109; 2026-02-21T09:09:53.6071073Z cvt.rn.bf16x2.f32 %r408, %r407, %r406; 2026-02-21T09:09:53.6071125Z $L__tmp24: 2026-02-21T09:09:53.6071330Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6071398Z cvt.u64.u32 %rd110, %r367; 2026-02-21T09:09:53.6071456Z cvt.u64.u32 %rd111, %r368; 2026-02-21T09:09:53.6071515Z shl.b64 %rd112, %rd111, 32; 2026-02-21T09:09:53.6071616Z or.b64 %rd113, %rd110, %rd112; 2026-02-21T09:09:53.6071670Z $L__tmp25: 2026-02-21T09:09:53.6071830Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6071889Z mov.b64 {%r409, %r410}, %rd113; 2026-02-21T09:09:53.6071963Z cvt.rn.bf16x2.f32 %r411, %r410, %r409; 2026-02-21T09:09:53.6072015Z $L__tmp26: 2026-02-21T09:09:53.6072222Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6072292Z cvt.u64.u32 %rd114, %r369; 2026-02-21T09:09:53.6072351Z cvt.u64.u32 %rd115, %r370; 2026-02-21T09:09:53.6072411Z shl.b64 %rd116, %rd115, 32; 2026-02-21T09:09:53.6072482Z or.b64 %rd117, %rd114, %rd116; 2026-02-21T09:09:53.6072535Z $L__tmp27: 2026-02-21T09:09:53.6072700Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6072760Z mov.b64 {%r412, %r413}, %rd117; 2026-02-21T09:09:53.6072839Z cvt.rn.bf16x2.f32 %r414, %r413, %r412; 2026-02-21T09:09:53.6072892Z $L__tmp28: 2026-02-21T09:09:53.6073097Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6073164Z cvt.u64.u32 %rd118, %r371; 2026-02-21T09:09:53.6073221Z cvt.u64.u32 %rd119, %r372; 2026-02-21T09:09:53.6073281Z shl.b64 %rd120, %rd119, 32; 2026-02-21T09:09:53.6073343Z or.b64 %rd121, %rd118, %rd120; 2026-02-21T09:09:53.6073433Z $L__tmp29: 2026-02-21T09:09:53.6073599Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6073659Z mov.b64 {%r415, %r416}, %rd121; 2026-02-21T09:09:53.6073733Z cvt.rn.bf16x2.f32 %r417, %r416, %r415; 2026-02-21T09:09:53.6073784Z $L__tmp30: 2026-02-21T09:09:53.6073991Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6074057Z cvt.u64.u32 %rd122, %r374; 2026-02-21T09:09:53.6074140Z cvt.u64.u32 %rd123, %r375; 2026-02-21T09:09:53.6074199Z shl.b64 %rd124, %rd123, 32; 2026-02-21T09:09:53.6074258Z or.b64 %rd125, %rd122, %rd124; 2026-02-21T09:09:53.6074316Z $L__tmp31: 2026-02-21T09:09:53.6074478Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6074537Z mov.b64 {%r418, %r419}, %rd125; 2026-02-21T09:09:53.6074612Z cvt.rn.bf16x2.f32 %r420, %r419, %r418; 2026-02-21T09:09:53.6074666Z $L__tmp32: 2026-02-21T09:09:53.6074873Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6074959Z cvt.u64.u32 %rd126, %r376; 2026-02-21T09:09:53.6075020Z cvt.u64.u32 %rd127, %r377; 2026-02-21T09:09:53.6075079Z shl.b64 %rd128, %rd127, 32; 2026-02-21T09:09:53.6075137Z or.b64 %rd129, %rd126, %rd128; 2026-02-21T09:09:53.6075197Z $L__tmp33: 2026-02-21T09:09:53.6075358Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6075420Z mov.b64 {%r421, %r422}, %rd129; 2026-02-21T09:09:53.6075494Z cvt.rn.bf16x2.f32 %r423, %r422, %r421; 2026-02-21T09:09:53.6075546Z $L__tmp34: 2026-02-21T09:09:53.6075753Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6075844Z cvt.u64.u32 %rd130, %r378; 2026-02-21T09:09:53.6075905Z cvt.u64.u32 %rd131, %r379; 2026-02-21T09:09:53.6075963Z shl.b64 %rd132, %rd131, 32; 2026-02-21T09:09:53.6076021Z or.b64 %rd133, %rd130, %rd132; 2026-02-21T09:09:53.6076079Z $L__tmp35: 2026-02-21T09:09:53.6076243Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6076303Z mov.b64 {%r424, %r425}, %rd133; 2026-02-21T09:09:53.6076376Z cvt.rn.bf16x2.f32 %r426, %r425, %r424; 2026-02-21T09:09:53.6076427Z $L__tmp36: 2026-02-21T09:09:53.6076634Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6076702Z cvt.u64.u32 %rd134, %r380; 2026-02-21T09:09:53.6076759Z cvt.u64.u32 %rd135, %r381; 2026-02-21T09:09:53.6076819Z shl.b64 %rd136, %rd135, 32; 2026-02-21T09:09:53.6076877Z or.b64 %rd137, %rd134, %rd136; 2026-02-21T09:09:53.6076935Z $L__tmp37: 2026-02-21T09:09:53.6077098Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6077157Z mov.b64 {%r427, %r428}, %rd137; 2026-02-21T09:09:53.6077228Z cvt.rn.bf16x2.f32 %r429, %r428, %r427; 2026-02-21T09:09:53.6077278Z $L__tmp38: 2026-02-21T09:09:53.6077492Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6077549Z cvt.u64.u32 %rd138, %r382; 2026-02-21T09:09:53.6077613Z cvt.u64.u32 %rd139, %r383; 2026-02-21T09:09:53.6077671Z shl.b64 %rd140, %rd139, 32; 2026-02-21T09:09:53.6077730Z or.b64 %rd141, %rd138, %rd140; 2026-02-21T09:09:53.6077788Z $L__tmp39: 2026-02-21T09:09:53.6077955Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6078014Z mov.b64 {%r430, %r431}, %rd141; 2026-02-21T09:09:53.6078084Z cvt.rn.bf16x2.f32 %r432, %r431, %r430; 2026-02-21T09:09:53.6078136Z $L__tmp40: 2026-02-21T09:09:53.6078341Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6078425Z cvt.u64.u32 %rd142, %r384; 2026-02-21T09:09:53.6078490Z cvt.u64.u32 %rd143, %r385; 2026-02-21T09:09:53.6078549Z shl.b64 %rd144, %rd143, 32; 2026-02-21T09:09:53.6078606Z or.b64 %rd145, %rd142, %rd144; 2026-02-21T09:09:53.6078664Z $L__tmp41: 2026-02-21T09:09:53.6078830Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6078889Z mov.b64 {%r433, %r434}, %rd145; 2026-02-21T09:09:53.6078986Z cvt.rn.bf16x2.f32 %r435, %r434, %r433; 2026-02-21T09:09:53.6079038Z $L__tmp42: 2026-02-21T09:09:53.6079244Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6079303Z cvt.u64.u32 %rd146, %r386; 2026-02-21T09:09:53.6079369Z cvt.u64.u32 %rd147, %r387; 2026-02-21T09:09:53.6079428Z shl.b64 %rd148, %rd147, 32; 2026-02-21T09:09:53.6079487Z or.b64 %rd149, %rd146, %rd148; 2026-02-21T09:09:53.6079545Z $L__tmp43: 2026-02-21T09:09:53.6079706Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6079807Z mov.b64 {%r436, %r437}, %rd149; 2026-02-21T09:09:53.6079875Z cvt.rn.bf16x2.f32 %r438, %r437, %r436; 2026-02-21T09:09:53.6079936Z $L__tmp44: 2026-02-21T09:09:53.6080140Z .loc 2 291 36 // standard.py:291:36 @[ cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:94:40 ] 2026-02-21T09:09:53.6080201Z cvt.u64.u32 %rd150, %r388; 2026-02-21T09:09:53.6080266Z cvt.u64.u32 %rd151, %r389; 2026-02-21T09:09:53.6080326Z shl.b64 %rd152, %rd151, 32; 2026-02-21T09:09:53.6080384Z or.b64 %rd153, %rd150, %rd152; 2026-02-21T09:09:53.6080443Z $L__tmp45: 2026-02-21T09:09:53.6080607Z .loc 1 97 28 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:97:28 2026-02-21T09:09:53.6080687Z mov.b64 {%r439, %r440}, %rd153; 2026-02-21T09:09:53.6080755Z cvt.rn.bf16x2.f32 %r441, %r440, %r439; 2026-02-21T09:09:53.6080920Z .loc 1 98 43 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:98:43 2026-02-21T09:09:53.6081016Z st.shared.v4.b32 [%r21], {%r396, %r399, %r402, %r405}; 2026-02-21T09:09:53.6081109Z st.shared.v4.b32 [%r22], {%r408, %r411, %r414, %r417}; 2026-02-21T09:09:53.6081207Z st.shared.v4.b32 [%r23], {%r420, %r423, %r426, %r429}; 2026-02-21T09:09:53.6081290Z st.shared.v4.b32 [%r24], {%r432, %r435, %r438, %r441}; 2026-02-21T09:09:53.6081352Z // begin inline asm 2026-02-21T09:09:53.6081445Z fence.proxy.async.shared::cta; 2026-02-21T09:09:53.6091987Z // end inline asm 2026-02-21T09:09:53.6092096Z bar.sync 0; 2026-02-21T09:09:53.6092188Z elect.sync %r442|%p97, -1; 2026-02-21T09:09:53.6092257Z and.pred %p95, %p1, %p97; 2026-02-21T09:09:53.6092319Z // begin inline asm 2026-02-21T09:09:53.6092535Z @%p95 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd89, {%r391, %r392}], [%r47]; 2026-02-21T09:09:53.6092599Z // end inline asm 2026-02-21T09:09:53.6092672Z cp.async.bulk.commit_group; 2026-02-21T09:09:53.6092748Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:53.6092815Z bar.sync 0; 2026-02-21T09:09:53.6092907Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:53.6093085Z .loc 1 31 4 // cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py:31:4 2026-02-21T09:09:53.6093152Z bar.sync 0; 2026-02-21T09:09:53.6093215Z // begin inline asm 2026-02-21T09:09:53.6093343Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r443, 128; 2026-02-21T09:09:53.6093412Z // end inline asm 2026-02-21T09:09:53.6093468Z ret; 2026-02-21T09:09:53.6093524Z $L__tmp46: 2026-02-21T09:09:53.6093583Z $L__func_end0: 2026-02-21T09:09:53.6093680Z // -- End function 2026-02-21T09:09:53.6093736Z } 2026-02-21T09:09:53.6093944Z .file 1 "/tmp/torchinductor_root/fn/cfnaieex2kqnslbi25h4kp4ktb76w2sozgk6aryla743p3xbvv4h.py" 2026-02-21T09:09:53.6094226Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:53.6094292Z .section .debug_abbrev 2026-02-21T09:09:53.6094350Z { 2026-02-21T09:09:53.6094443Z .b8 1 // Abbreviation Code 2026-02-21T09:09:53.6094545Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:53.6094628Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:53.6094711Z .b8 37 // DW_AT_producer 2026-02-21T09:09:53.6094838Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.6094916Z .b8 19 // DW_AT_language 2026-02-21T09:09:53.6094994Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:53.6095082Z .b8 3 // DW_AT_name 2026-02-21T09:09:53.6095160Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.6095239Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:53.6095315Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:53.6095399Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:53.6095501Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.6095573Z .b8 0 // EOM(1) 2026-02-21T09:09:53.6095654Z .b8 0 // EOM(2) 2026-02-21T09:09:53.6095736Z .b8 2 // Abbreviation Code 2026-02-21T09:09:53.6095821Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:53.6095905Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:53.6095979Z .b8 3 // DW_AT_name 2026-02-21T09:09:53.6096052Z .b8 8 // DW_FORM_string 2026-02-21T09:09:53.6096173Z .b8 32 // DW_AT_inline 2026-02-21T09:09:53.6096253Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.6096326Z .b8 0 // EOM(1) 2026-02-21T09:09:53.6096394Z .b8 0 // EOM(2) 2026-02-21T09:09:53.6096484Z .b8 3 // Abbreviation Code 2026-02-21T09:09:53.6096566Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:53.6096643Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:53.6096731Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:53.6096807Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.6096885Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:53.6096967Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.6097057Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:53.6097131Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:53.6097202Z .b8 0 // EOM(1) 2026-02-21T09:09:53.6097280Z .b8 0 // EOM(2) 2026-02-21T09:09:53.6097360Z .b8 4 // Abbreviation Code 2026-02-21T09:09:53.6097457Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:53.6097542Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:53.6097627Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:53.6097704Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:53.6097789Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:53.6097862Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.6097940Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:53.6098013Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:53.6098105Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:53.6098211Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.6098288Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:53.6098374Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.6098456Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:53.6098531Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:53.6098610Z .b8 0 // EOM(1) 2026-02-21T09:09:53.6098703Z .b8 0 // EOM(2) 2026-02-21T09:09:53.6098772Z .b8 0 // EOM(3) 2026-02-21T09:09:53.6098828Z } 2026-02-21T09:09:53.6098903Z .section .debug_info 2026-02-21T09:09:53.6098956Z { 2026-02-21T09:09:53.6099042Z .b32 178 // Length of Unit 2026-02-21T09:09:53.6099144Z .b8 2 // DWARF version number 2026-02-21T09:09:53.6099198Z .b8 0 2026-02-21T09:09:53.6099318Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:53.6099415Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:53.6099537Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:53.6099620Z .b8 116 // DW_AT_producer 2026-02-21T09:09:53.6099676Z .b8 114 2026-02-21T09:09:53.6099741Z .b8 105 2026-02-21T09:09:53.6099795Z .b8 116 2026-02-21T09:09:53.6099848Z .b8 111 2026-02-21T09:09:53.6099911Z .b8 110 2026-02-21T09:09:53.6099965Z .b8 0 2026-02-21T09:09:53.6100039Z .b8 2 // DW_AT_language 2026-02-21T09:09:53.6100092Z .b8 0 2026-02-21T09:09:53.6100177Z .b8 99 // DW_AT_name 2026-02-21T09:09:53.6100231Z .b8 102 2026-02-21T09:09:53.6100283Z .b8 110 2026-02-21T09:09:53.6100346Z .b8 97 2026-02-21T09:09:53.6100397Z .b8 105 2026-02-21T09:09:53.6100449Z .b8 101 2026-02-21T09:09:53.6100524Z .b8 101 2026-02-21T09:09:53.6100589Z .b8 120 2026-02-21T09:09:53.6100644Z .b8 50 2026-02-21T09:09:53.6100695Z .b8 107 2026-02-21T09:09:53.6100757Z .b8 113 2026-02-21T09:09:53.6100810Z .b8 110 2026-02-21T09:09:53.6100862Z .b8 115 2026-02-21T09:09:53.6100915Z .b8 108 2026-02-21T09:09:53.6100976Z .b8 98 2026-02-21T09:09:53.6101027Z .b8 105 2026-02-21T09:09:53.6101081Z .b8 50 2026-02-21T09:09:53.6101131Z .b8 53 2026-02-21T09:09:53.6101195Z .b8 104 2026-02-21T09:09:53.6101248Z .b8 52 2026-02-21T09:09:53.6101300Z .b8 107 2026-02-21T09:09:53.6101363Z .b8 112 2026-02-21T09:09:53.6101417Z .b8 52 2026-02-21T09:09:53.6101468Z .b8 107 2026-02-21T09:09:53.6101521Z .b8 116 2026-02-21T09:09:53.6101626Z .b8 98 2026-02-21T09:09:53.6101679Z .b8 55 2026-02-21T09:09:53.6101734Z .b8 54 2026-02-21T09:09:53.6101799Z .b8 119 2026-02-21T09:09:53.6101851Z .b8 50 2026-02-21T09:09:53.6101906Z .b8 115 2026-02-21T09:09:53.6101957Z .b8 111 2026-02-21T09:09:53.6102019Z .b8 122 2026-02-21T09:09:53.6102073Z .b8 103 2026-02-21T09:09:53.6102127Z .b8 107 2026-02-21T09:09:53.6102191Z .b8 54 2026-02-21T09:09:53.6102246Z .b8 97 2026-02-21T09:09:53.6102297Z .b8 114 2026-02-21T09:09:53.6102350Z .b8 121 2026-02-21T09:09:53.6102417Z .b8 108 2026-02-21T09:09:53.6102469Z .b8 97 2026-02-21T09:09:53.6102524Z .b8 55 2026-02-21T09:09:53.6102578Z .b8 52 2026-02-21T09:09:53.6102644Z .b8 51 2026-02-21T09:09:53.6102699Z .b8 112 2026-02-21T09:09:53.6102750Z .b8 51 2026-02-21T09:09:53.6102812Z .b8 120 2026-02-21T09:09:53.6102865Z .b8 98 2026-02-21T09:09:53.6102917Z .b8 118 2026-02-21T09:09:53.6102974Z .b8 118 2026-02-21T09:09:53.6103035Z .b8 52 2026-02-21T09:09:53.6103088Z .b8 104 2026-02-21T09:09:53.6103141Z .b8 46 2026-02-21T09:09:53.6103202Z .b8 112 2026-02-21T09:09:53.6103254Z .b8 121 2026-02-21T09:09:53.6103306Z .b8 0 2026-02-21T09:09:53.6103398Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:53.6103483Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:53.6103536Z .b8 116 2026-02-21T09:09:53.6103589Z .b8 109 2026-02-21T09:09:53.6103676Z .b8 112 2026-02-21T09:09:53.6103730Z .b8 47 2026-02-21T09:09:53.6103782Z .b8 116 2026-02-21T09:09:53.6103834Z .b8 111 2026-02-21T09:09:53.6103896Z .b8 114 2026-02-21T09:09:53.6103947Z .b8 99 2026-02-21T09:09:53.6104001Z .b8 104 2026-02-21T09:09:53.6104054Z .b8 105 2026-02-21T09:09:53.6104119Z .b8 110 2026-02-21T09:09:53.6104174Z .b8 100 2026-02-21T09:09:53.6104228Z .b8 117 2026-02-21T09:09:53.6104291Z .b8 99 2026-02-21T09:09:53.6104345Z .b8 116 2026-02-21T09:09:53.6104399Z .b8 111 2026-02-21T09:09:53.6104454Z .b8 114 2026-02-21T09:09:53.6104547Z .b8 95 2026-02-21T09:09:53.6104601Z .b8 114 2026-02-21T09:09:53.6104658Z .b8 111 2026-02-21T09:09:53.6104722Z .b8 111 2026-02-21T09:09:53.6104775Z .b8 116 2026-02-21T09:09:53.6104831Z .b8 47 2026-02-21T09:09:53.6104884Z .b8 102 2026-02-21T09:09:53.6104951Z .b8 110 2026-02-21T09:09:53.6105008Z .b8 0 2026-02-21T09:09:53.6105112Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:53.6105204Z .b8 95 // DW_AT_name 2026-02-21T09:09:53.6105262Z .b8 104 2026-02-21T09:09:53.6105316Z .b8 101 2026-02-21T09:09:53.6105370Z .b8 108 2026-02-21T09:09:53.6105437Z .b8 105 2026-02-21T09:09:53.6105492Z .b8 111 2026-02-21T09:09:53.6105576Z .b8 110 2026-02-21T09:09:53.6105641Z .b8 95 2026-02-21T09:09:53.6105695Z .b8 109 2026-02-21T09:09:53.6105750Z .b8 97 2026-02-21T09:09:53.6105803Z .b8 116 2026-02-21T09:09:53.6105868Z .b8 109 2026-02-21T09:09:53.6105922Z .b8 117 2026-02-21T09:09:53.6105975Z .b8 108 2026-02-21T09:09:53.6106031Z .b8 95 2026-02-21T09:09:53.6106093Z .b8 98 2026-02-21T09:09:53.6106148Z .b8 102 2026-02-21T09:09:53.6106201Z .b8 49 2026-02-21T09:09:53.6106265Z .b8 54 2026-02-21T09:09:53.6106318Z .b8 95 2026-02-21T09:09:53.6106372Z .b8 105 2026-02-21T09:09:53.6106425Z .b8 110 2026-02-21T09:09:53.6106489Z .b8 116 2026-02-21T09:09:53.6106544Z .b8 52 2026-02-21T09:09:53.6106597Z .b8 0 2026-02-21T09:09:53.6106712Z .b8 1 // DW_AT_inline 2026-02-21T09:09:53.6106819Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:53.6106909Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:53.6107001Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:53.6107107Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:53.6107223Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:53.6107316Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:53.6107415Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:53.6107502Z .b64 $L__tmp45 // DW_AT_high_pc 2026-02-21T09:09:53.6107584Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:53.6107672Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:53.6107755Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:53.6107845Z .b8 0 // End Of Children Mark 2026-02-21T09:09:53.6107939Z .b8 0 // End Of Children Mark 2026-02-21T09:09:53.6107994Z } 2026-02-21T09:09:53.6108067Z .section .debug_macinfo { } 2026-02-21T09:09:53.6108073Z 2026-02-21T09:09:53.6108153Z ================================================================ 2026-02-21T09:09:53.6108271Z please share the reproducer above with Triton project. 2026-02-21T09:09:54.7768410Z [181s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:54.7768694Z 2026-02-21T09:09:54.7771099Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 64, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[2, 3], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:54.7772723Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:54.7772981Z `ptxas` stderr: 2026-02-21T09:09:54.7773423Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 196 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:54.7773927Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:54.7774142Z 2026-02-21T09:09:54.7774555Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpmjfrw8es.ptx -o /tmp/tmpmjfrw8es.ptx.o 2026-02-21T09:09:54.7774991Z 2026-02-21T09:09:54.7775121Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:54.7775339Z 2026-02-21T09:09:54.7775342Z 2026-02-21T09:09:54.7775432Z ================================================================ 2026-02-21T09:09:54.7775666Z Internal Triton PTX codegen error 2026-02-21T09:09:54.7775844Z `ptxas` stderr: 2026-02-21T09:09:54.7776318Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 196 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:54.7776797Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:54.7776950Z 2026-02-21T09:09:54.7777316Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpmjfrw8es.ptx -o /tmp/tmpmjfrw8es.ptx.o 2026-02-21T09:09:54.7777739Z 2026-02-21T09:09:54.7777743Z 2026-02-21T09:09:54.7777806Z // 2026-02-21T09:09:54.7777954Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:54.7778131Z // 2026-02-21T09:09:54.7778198Z 2026-02-21T09:09:54.7778254Z .version 8.7 2026-02-21T09:09:54.7778397Z .target sm_100a 2026-02-21T09:09:54.7778573Z .address_size 64 2026-02-21T09:09:54.7778668Z 2026-02-21T09:09:54.7778813Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:54.7779092Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:54.7779320Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:54.7779542Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:54.7779785Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:54.7780074Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:54.7780347Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:54.7780625Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:54.7780900Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:54.7781123Z ) 2026-02-21T09:09:54.7781251Z .reqntid 128 2026-02-21T09:09:54.7781377Z .maxnreg 32 2026-02-21T09:09:54.7781518Z { 2026-02-21T09:09:54.7781683Z .reg .pred %p<51>; 2026-02-21T09:09:54.7781836Z .reg .b16 %rs<42>; 2026-02-21T09:09:54.7781973Z .reg .b32 %r<281>; 2026-02-21T09:09:54.7782113Z .reg .b64 %rd<89>; 2026-02-21T09:09:54.7782247Z $L__func_begin0: 2026-02-21T09:09:54.7782337Z 2026-02-21T09:09:54.7782391Z // %bb.0: 2026-02-21T09:09:54.7782628Z .loc 1 19 0 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:19 2026-02-21T09:09:54.7782922Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:54.7783084Z setp.lt.u32 %p2, %r1, 32; 2026-02-21T09:09:54.7783288Z ld.param.b64 %rd17, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:54.7783513Z mov.b32 %r33, global_smem; 2026-02-21T09:09:54.7783671Z // begin inline asm 2026-02-21T09:09:54.7783908Z @%p2 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r33], 64; 2026-02-21T09:09:54.7784154Z // end inline asm 2026-02-21T09:09:54.7784348Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:54.7784564Z bar.sync 0; 2026-02-21T09:09:54.7784751Z ld.shared.b32 %r279, [global_smem]; 2026-02-21T09:09:54.7785077Z bar.sync 0; 2026-02-21T09:09:54.7785214Z // begin inline asm 2026-02-21T09:09:54.7785432Z @%p2 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:54.7785671Z // end inline asm 2026-02-21T09:09:54.7785944Z .loc 1 21 67 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:21:67 2026-02-21T09:09:54.7786252Z mov.u32 %r42, %ctaid.x; 2026-02-21T09:09:54.7786405Z mov.u32 %r43, %ctaid.y; 2026-02-21T09:09:54.7786596Z mov.u32 %r44, %ctaid.z; 2026-02-21T09:09:54.7786745Z mov.u32 %r45, %nctaid.x; 2026-02-21T09:09:54.7786905Z mov.u32 %r46, %nctaid.y; 2026-02-21T09:09:54.7787063Z mad.lo.s32 %r47, %r44, %r46, %r43; 2026-02-21T09:09:54.7787242Z mad.lo.s32 %r48, %r47, %r45, %r42; 2026-02-21T09:09:54.7787409Z shl.b32 %r49, %r48, 7; 2026-02-21T09:09:54.7787568Z cvt.s64.s32 %rd35, %r49; 2026-02-21T09:09:54.7787726Z add.s64 %rd31, %rd34, %rd35; 2026-02-21T09:09:54.7787893Z shl.b32 %r50, %r1, 2; 2026-02-21T09:09:54.7788047Z add.s32 %r34, %r33, %r50; 2026-02-21T09:09:54.7788196Z mov.b32 %r35, 0; 2026-02-21T09:09:54.7788342Z // begin inline asm 2026-02-21T09:09:54.7788527Z @%p2 st.shared.b32 [ %r34 + 0 ], %r35; 2026-02-21T09:09:54.7788708Z // end inline asm 2026-02-21T09:09:54.7788847Z bar.warp.sync -1; 2026-02-21T09:09:54.7788999Z setp.eq.b32 %p5, %r1, 0; 2026-02-21T09:09:54.7789150Z cvt.u64.u32 %rd16, %r33; 2026-02-21T09:09:54.7789306Z // begin inline asm 2026-02-21T09:09:54.7789552Z @%p5 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd16 + 0 ], %rd17; 2026-02-21T09:09:54.7789837Z // end inline asm 2026-02-21T09:09:54.7789974Z // begin inline asm 2026-02-21T09:09:54.7790192Z @%p5 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:54.7790444Z // end inline asm 2026-02-21T09:09:54.7790576Z mov.b32 %r36, 32; 2026-02-21T09:09:54.7790721Z // begin inline asm 2026-02-21T09:09:54.7790996Z @%p5 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r36; 2026-02-21T09:09:54.7791264Z // end inline asm 2026-02-21T09:09:54.7791400Z mov.b32 %r37, 64; 2026-02-21T09:09:54.7791557Z // begin inline asm 2026-02-21T09:09:54.7791794Z @%p5 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r37; 2026-02-21T09:09:54.7792055Z // end inline asm 2026-02-21T09:09:54.7792196Z mov.b32 %r38, 8192; 2026-02-21T09:09:54.7792335Z // begin inline asm 2026-02-21T09:09:54.7792577Z @%p5 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r38; 2026-02-21T09:09:54.7792843Z // end inline asm 2026-02-21T09:09:54.7792983Z mov.b32 %r39, 4096; 2026-02-21T09:09:54.7793128Z // begin inline asm 2026-02-21T09:09:54.7793360Z @%p5 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r39; 2026-02-21T09:09:54.7793635Z // end inline asm 2026-02-21T09:09:54.7793770Z mov.b64 %rd24, 16384; 2026-02-21T09:09:54.7793920Z // begin inline asm 2026-02-21T09:09:54.7794164Z @%p5 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd16 + 0 ], 0x0, %rd24; 2026-02-21T09:09:54.7794454Z // end inline asm 2026-02-21T09:09:54.7794588Z mov.b32 %r40, 1; 2026-02-21T09:09:54.7794734Z // begin inline asm 2026-02-21T09:09:54.7794990Z @%p5 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r40; 2026-02-21T09:09:54.7795266Z // end inline asm 2026-02-21T09:09:54.7795408Z // begin inline asm 2026-02-21T09:09:54.7795648Z @%p5 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r40; 2026-02-21T09:09:54.7795941Z // end inline asm 2026-02-21T09:09:54.7796078Z // begin inline asm 2026-02-21T09:09:54.7796321Z @%p5 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd16 + 0 ], 0xa; 2026-02-21T09:09:54.7796598Z // end inline asm 2026-02-21T09:09:54.7796736Z // begin inline asm 2026-02-21T09:09:54.7797000Z @%p5 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.7797314Z // end inline asm 2026-02-21T09:09:54.7797457Z // begin inline asm 2026-02-21T09:09:54.7797696Z @%p5 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x2; 2026-02-21T09:09:54.7797979Z // end inline asm 2026-02-21T09:09:54.7798119Z // begin inline asm 2026-02-21T09:09:54.7798358Z @%p5 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.7798626Z // end inline asm 2026-02-21T09:09:54.7798764Z // begin inline asm 2026-02-21T09:09:54.7799153Z @%p2 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd31 + 0 ], [ %rd16 + 0 ], 0x80; 2026-02-21T09:09:54.7799534Z // end inline asm 2026-02-21T09:09:54.7799679Z // begin inline asm 2026-02-21T09:09:54.7799893Z @%p2 fence.proxy.tensormap::generic.acquire.gpu [ %rd31 + 0 ], 0x80; 2026-02-21T09:09:54.7800158Z @%p2 cp.async.bulk.commit_group ; 2026-02-21T09:09:54.7800361Z @%p2 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:54.7800545Z // end inline asm 2026-02-21T09:09:54.7800688Z bar.sync 0; 2026-02-21T09:09:54.7800833Z cvta.global.u64 %rd52, %rd31; 2026-02-21T09:09:54.7801155Z .loc 1 27 35 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:27:35 2026-02-21T09:09:54.7801460Z shl.b32 %r280, %r42, 1; 2026-02-21T09:09:54.7801766Z .loc 1 28 37 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:28:37 2026-02-21T09:09:54.7802070Z add.s32 %r51, %r280, 2; 2026-02-21T09:09:54.7802335Z .loc 1 28 49 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:28:49 2026-02-21T09:09:54.7802647Z min.s32 %r4, %r51, 16384; 2026-02-21T09:09:54.7802916Z .loc 1 29 88 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:29:88 2026-02-21T09:09:54.7803226Z setp.ge.s32 %p22, %r280, %r4; 2026-02-21T09:09:54.7803397Z @%p22 bra $L__BB0_9; 2026-02-21T09:09:54.7803600Z // %bb.1: // %.lr.ph 2026-02-21T09:09:54.7803922Z .loc 1 0 88 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:0:88 2026-02-21T09:09:54.7804263Z ld.param.b64 %rd15, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:54.7804529Z ld.param.b64 %rd14, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:54.7804742Z shr.u32 %r5, %r1, 5; 2026-02-21T09:09:54.7804907Z and.b32 %r6, %r1, 31; 2026-02-21T09:09:54.7805055Z bfe.u32 %r7, %r1, 1, 6; 2026-02-21T09:09:54.7805207Z and.b32 %r8, %r50, 4; 2026-02-21T09:09:54.7805350Z and.b32 %r9, %r1, 32; 2026-02-21T09:09:54.7805607Z .loc 1 51 48 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:51:48 2026-02-21T09:09:54.7805894Z and.b32 %r52, %r1, 96; 2026-02-21T09:09:54.7806050Z add.s32 %r168, %r279, 1048576; 2026-02-21T09:09:54.7806220Z shl.b32 %r53, %r1, 1; 2026-02-21T09:09:54.7806364Z and.b32 %r54, %r53, 30; 2026-02-21T09:09:54.7806526Z bfe.u32 %r55, %r1, 4, 1; 2026-02-21T09:09:54.7806677Z or.b32 %r11, %r54, %r55; 2026-02-21T09:09:54.7806835Z shl.b32 %r56, %r6, 2; 2026-02-21T09:09:54.7806975Z shr.u32 %r57, %r9, 4; 2026-02-21T09:09:54.7807123Z and.b32 %r58, %r1, 64; 2026-02-21T09:09:54.7807270Z bfe.u32 %r59, %r1, 6, 1; 2026-02-21T09:09:54.7807427Z add.s32 %r61, %r33, 4096; 2026-02-21T09:09:54.7807587Z add.s32 %r62, %r61, %r57; 2026-02-21T09:09:54.7807736Z add.s32 %r63, %r62, %r59; 2026-02-21T09:09:54.7807893Z add.s32 %r12, %r63, %r56; 2026-02-21T09:09:54.7808039Z shr.u32 %r64, %r58, 5; 2026-02-21T09:09:54.7808196Z add.s32 %r65, %r61, %r56; 2026-02-21T09:09:54.7808342Z add.s32 %r13, %r65, %r64; 2026-02-21T09:09:54.7808496Z shl.b32 %r66, %r1, 5; 2026-02-21T09:09:54.7808639Z and.b32 %r67, %r66, 864; 2026-02-21T09:09:54.7808794Z shr.u32 %r68, %r52, 3; 2026-02-21T09:09:54.7808940Z bfe.s32 %r69, %r1, 2, 1; 2026-02-21T09:09:54.7809094Z and.b32 %r70, %r69, 144; 2026-02-21T09:09:54.7809248Z or.b32 %r71, %r68, %r67; 2026-02-21T09:09:54.7809397Z or.b32 %r72, %r71, %r70; 2026-02-21T09:09:54.7809578Z add.s32 %r14, %r61, %r72; 2026-02-21T09:09:54.7809727Z xor.b32 %r73, %r72, 16; 2026-02-21T09:09:54.7809879Z add.s32 %r15, %r61, %r73; 2026-02-21T09:09:54.7810032Z add.s32 %r169, %r279, 1048608; 2026-02-21T09:09:54.7810201Z bfe.u32 %r74, %r61, 4, 14; 2026-02-21T09:09:54.7810356Z cvt.u64.u32 %rd36, %r74; 2026-02-21T09:09:54.7810530Z or.b64 %rd50, %rd36, -4611685949703716864; 2026-02-21T09:09:54.7810709Z add.s32 %r225, %r279, 32; 2026-02-21T09:09:54.7810862Z shl.b32 %r75, %r1, 6; 2026-02-21T09:09:54.7811065Z and.b32 %r76, %r75, 960; 2026-02-21T09:09:54.7811211Z shl.b32 %r77, %r52, 5; 2026-02-21T09:09:54.7811360Z shl.b32 %r78, %r1, 3; 2026-02-21T09:09:54.7811501Z and.b32 %r79, %r78, 48; 2026-02-21T09:09:54.7811690Z and.b32 %r80, %r53, 32; 2026-02-21T09:09:54.7811838Z or.b32 %r81, %r77, %r79; 2026-02-21T09:09:54.7811992Z xor.b32 %r82, %r81, %r80; 2026-02-21T09:09:54.7812139Z or.b32 %r83, %r82, %r76; 2026-02-21T09:09:54.7812293Z add.s32 %r18, %r33, %r83; 2026-02-21T09:09:54.7812444Z xor.b32 %r84, %r83, 16; 2026-02-21T09:09:54.7812596Z add.s32 %r19, %r33, %r84; 2026-02-21T09:09:54.7812864Z .loc 1 29 88 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:29:88 2026-02-21T09:09:54.7813180Z shl.b32 %r85, %r52, 8; 2026-02-21T09:09:54.7813451Z .loc 1 50 125 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:50:125 2026-02-21T09:09:54.7813733Z or.b32 %r20, %r85, %r6; 2026-02-21T09:09:54.7813889Z or.b32 %r21, %r20, 32768; 2026-02-21T09:09:54.7814147Z .loc 1 29 88 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:29:88 2026-02-21T09:09:54.7814433Z shl.b32 %r86, %r7, 10; 2026-02-21T09:09:54.7814699Z .loc 1 50 125 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:50:125 2026-02-21T09:09:54.7814985Z or.b32 %r22, %r86, %r8; 2026-02-21T09:09:54.7815143Z bra.uni $L__BB0_2; 2026-02-21T09:09:54.7815359Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:54.7815584Z $L__tmp0: 2026-02-21T09:09:54.7815870Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7816206Z // begin inline asm 2026-02-21T09:09:54.7816581Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241, %r242, %r243, %r244, %r245, %r246, %r247, %r248, %r249}, [%r250 + 0], 16; 2026-02-21T09:09:54.7816964Z // end inline asm 2026-02-21T09:09:54.7817111Z // begin inline asm 2026-02-21T09:09:54.7817263Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:54.7817431Z // end inline asm 2026-02-21T09:09:54.7817571Z cvt.u64.u32 %rd53, %r234; 2026-02-21T09:09:54.7817733Z cvt.u64.u32 %rd54, %r235; 2026-02-21T09:09:54.7817884Z shl.b64 %rd55, %rd54, 32; 2026-02-21T09:09:54.7818044Z or.b64 %rd56, %rd53, %rd55; 2026-02-21T09:09:54.7818201Z $L__tmp1: 2026-02-21T09:09:54.7818439Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7818734Z mov.b64 {%r254, %r255}, %rd56; 2026-02-21T09:09:54.7818906Z cvt.rn.bf16x2.f32 %r256, %r255, %r254; 2026-02-21T09:09:54.7819083Z $L__tmp2: 2026-02-21T09:09:54.7819362Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7819695Z cvt.u64.u32 %rd57, %r236; 2026-02-21T09:09:54.7819847Z cvt.u64.u32 %rd58, %r237; 2026-02-21T09:09:54.7820005Z shl.b64 %rd59, %rd58, 32; 2026-02-21T09:09:54.7820163Z or.b64 %rd60, %rd57, %rd59; 2026-02-21T09:09:54.7820310Z $L__tmp3: 2026-02-21T09:09:54.7820548Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7820826Z mov.b64 {%r257, %r258}, %rd60; 2026-02-21T09:09:54.7820998Z cvt.rn.bf16x2.f32 %r259, %r258, %r257; 2026-02-21T09:09:54.7821163Z $L__tmp4: 2026-02-21T09:09:54.7821446Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7821830Z cvt.u64.u32 %rd61, %r238; 2026-02-21T09:09:54.7821982Z cvt.u64.u32 %rd62, %r239; 2026-02-21T09:09:54.7822137Z shl.b64 %rd63, %rd62, 32; 2026-02-21T09:09:54.7822287Z or.b64 %rd64, %rd61, %rd63; 2026-02-21T09:09:54.7822442Z $L__tmp5: 2026-02-21T09:09:54.7822673Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7822962Z mov.b64 {%r260, %r261}, %rd64; 2026-02-21T09:09:54.7823158Z cvt.rn.bf16x2.f32 %r262, %r261, %r260; 2026-02-21T09:09:54.7823333Z $L__tmp6: 2026-02-21T09:09:54.7823609Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7823936Z cvt.u64.u32 %rd65, %r240; 2026-02-21T09:09:54.7824092Z cvt.u64.u32 %rd66, %r241; 2026-02-21T09:09:54.7824242Z shl.b64 %rd67, %rd66, 32; 2026-02-21T09:09:54.7824399Z or.b64 %rd68, %rd65, %rd67; 2026-02-21T09:09:54.7824547Z $L__tmp7: 2026-02-21T09:09:54.7824781Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7825084Z mov.b64 {%r263, %r264}, %rd68; 2026-02-21T09:09:54.7825258Z cvt.rn.bf16x2.f32 %r265, %r264, %r263; 2026-02-21T09:09:54.7825431Z $L__tmp8: 2026-02-21T09:09:54.7825712Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7826055Z cvt.u64.u32 %rd69, %r242; 2026-02-21T09:09:54.7826206Z cvt.u64.u32 %rd70, %r243; 2026-02-21T09:09:54.7826362Z shl.b64 %rd71, %rd70, 32; 2026-02-21T09:09:54.7826513Z or.b64 %rd72, %rd69, %rd71; 2026-02-21T09:09:54.7826670Z $L__tmp9: 2026-02-21T09:09:54.7826895Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7827181Z mov.b64 {%r266, %r267}, %rd72; 2026-02-21T09:09:54.7827381Z cvt.rn.bf16x2.f32 %r268, %r267, %r266; 2026-02-21T09:09:54.7827550Z $L__tmp10: 2026-02-21T09:09:54.7827837Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7828161Z cvt.u64.u32 %rd73, %r244; 2026-02-21T09:09:54.7828315Z cvt.u64.u32 %rd74, %r245; 2026-02-21T09:09:54.7828465Z shl.b64 %rd75, %rd74, 32; 2026-02-21T09:09:54.7828623Z or.b64 %rd76, %rd73, %rd75; 2026-02-21T09:09:54.7828771Z $L__tmp11: 2026-02-21T09:09:54.7829012Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7829303Z mov.b64 {%r269, %r270}, %rd76; 2026-02-21T09:09:54.7829470Z cvt.rn.bf16x2.f32 %r271, %r270, %r269; 2026-02-21T09:09:54.7829645Z $L__tmp12: 2026-02-21T09:09:54.7829918Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7830241Z cvt.u64.u32 %rd77, %r246; 2026-02-21T09:09:54.7830391Z cvt.u64.u32 %rd78, %r247; 2026-02-21T09:09:54.7830547Z shl.b64 %rd79, %rd78, 32; 2026-02-21T09:09:54.7830695Z or.b64 %rd80, %rd77, %rd79; 2026-02-21T09:09:54.7830848Z $L__tmp13: 2026-02-21T09:09:54.7831083Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7831361Z mov.b64 {%r272, %r273}, %rd80; 2026-02-21T09:09:54.7831579Z cvt.rn.bf16x2.f32 %r274, %r273, %r272; 2026-02-21T09:09:54.7831748Z $L__tmp14: 2026-02-21T09:09:54.7832028Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7832347Z cvt.u64.u32 %rd81, %r248; 2026-02-21T09:09:54.7832503Z cvt.u64.u32 %rd82, %r249; 2026-02-21T09:09:54.7832659Z shl.b64 %rd83, %rd82, 32; 2026-02-21T09:09:54.7832808Z or.b64 %rd84, %rd81, %rd83; 2026-02-21T09:09:54.7832963Z $L__tmp15: 2026-02-21T09:09:54.7833196Z .loc 1 97 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:97:28 2026-02-21T09:09:54.7833517Z mov.b64 {%r275, %r276}, %rd84; 2026-02-21T09:09:54.7833683Z cvt.rn.bf16x2.f32 %r277, %r276, %r275; 2026-02-21T09:09:54.7833970Z .loc 1 98 43 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:98:43 2026-02-21T09:09:54.7834270Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:54.7834446Z bar.sync 0; 2026-02-21T09:09:54.7834620Z st.shared.v4.b32 [%r18], {%r256, %r259, %r262, %r265}; 2026-02-21T09:09:54.7834852Z st.shared.v4.b32 [%r19], {%r268, %r271, %r274, %r277}; 2026-02-21T09:09:54.7835086Z // begin inline asm 2026-02-21T09:09:54.7835244Z fence.proxy.async.shared::cta; 2026-02-21T09:09:54.7835415Z // end inline asm 2026-02-21T09:09:54.7835548Z bar.sync 0; 2026-02-21T09:09:54.7835697Z elect.sync %r278|%p47, -1; 2026-02-21T09:09:54.7835864Z and.pred %p45, %p2, %p47; 2026-02-21T09:09:54.7836030Z // begin inline asm 2026-02-21T09:09:54.7836298Z @%p45 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd52, {%r25, %r26}], [%r33]; 2026-02-21T09:09:54.7836577Z // end inline asm 2026-02-21T09:09:54.7836732Z cp.async.bulk.commit_group; 2026-02-21T09:09:54.7837025Z .loc 1 29 88 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:29:88 2026-02-21T09:09:54.7837315Z add.s32 %r280, %r280, 1; 2026-02-21T09:09:54.7837471Z setp.ne.b32 %p48, %r280, %r4; 2026-02-21T09:09:54.7837642Z @%p48 bra $L__BB0_2; 2026-02-21T09:09:54.7837786Z bra.uni $L__BB0_9; 2026-02-21T09:09:54.7837979Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:09:54.7838224Z // Child Loop BB0_3 Depth 2 2026-02-21T09:09:54.7838530Z .loc 1 35 35 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:35:35 2026-02-21T09:09:54.7838833Z shr.s32 %r104, %r280, 31; 2026-02-21T09:09:54.7838989Z shr.u32 %r105, %r104, 22; 2026-02-21T09:09:54.7839185Z add.s32 %r106, %r280, %r105; 2026-02-21T09:09:54.7839353Z shr.s32 %r107, %r106, 10; 2026-02-21T09:09:54.7839626Z .loc 1 38 45 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:38:45 2026-02-21T09:09:54.7839930Z and.b32 %r108, %r106, 64512; 2026-02-21T09:09:54.7840094Z sub.s32 %r109, %r280, %r108; 2026-02-21T09:09:54.7840371Z .loc 1 38 64 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:38:64 2026-02-21T09:09:54.7840668Z cvt.u16.u32 %rs1, %r109; 2026-02-21T09:09:54.7840944Z .loc 1 39 51 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:39:51 2026-02-21T09:09:54.7841234Z shr.s16 %rs2, %rs1, 15; 2026-02-21T09:09:54.7841399Z shr.u16 %rs3, %rs2, 12; 2026-02-21T09:09:54.7841583Z add.s16 %rs4, %rs1, %rs3; 2026-02-21T09:09:54.7841752Z shr.s16 %rs5, %rs4, 4; 2026-02-21T09:09:54.7842028Z .loc 1 38 64 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:38:64 2026-02-21T09:09:54.7842322Z and.b16 %rs6, %rs4, -16; 2026-02-21T09:09:54.7842489Z sub.s16 %rs7, %rs1, %rs6; 2026-02-21T09:09:54.7842749Z .loc 1 39 51 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:39:51 2026-02-21T09:09:54.7843047Z cvt.u32.u16 %r110, %rs5; 2026-02-21T09:09:54.7843310Z .loc 1 40 27 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:40:27 2026-02-21T09:09:54.7843606Z shl.b32 %r111, %r107, 9; 2026-02-21T09:09:54.7843771Z mul.wide.s16 %r112, %rs7, 32; 2026-02-21T09:09:54.7843939Z add.s32 %r25, %r112, %r111; 2026-02-21T09:09:54.7844217Z .loc 1 42 27 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:42:27 2026-02-21T09:09:54.7844509Z mul.wide.s16 %r26, %rs5, 64; 2026-02-21T09:09:54.7844675Z $L__tmp16: 2026-02-21T09:09:54.7844971Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7845333Z shfl.sync.idx.b32 %r27, %r5, 0, 31, -1; 2026-02-21T09:09:54.7845558Z shl.b32 %r113, %r27, 21; 2026-02-21T09:09:54.7845717Z and.b32 %r114, %r113, 6291456; 2026-02-21T09:09:54.7845891Z add.s32 %r250, %r114, %r279; 2026-02-21T09:09:54.7846057Z mov.pred %p25, -1; 2026-02-21T09:09:54.7846214Z mov.b32 %r88, 0; 2026-02-21T09:09:54.7846358Z // begin inline asm 2026-02-21T09:09:54.7846728Z @%p25 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r250 + 0], 16, {%r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88, %r88}; 2026-02-21T09:09:54.7847123Z // end inline asm 2026-02-21T09:09:54.7847270Z // begin inline asm 2026-02-21T09:09:54.7847435Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.7847600Z // end inline asm 2026-02-21T09:09:54.7847744Z bar.sync 0; 2026-02-21T09:09:54.7847878Z add.s32 %r195, %r114, %r168; 2026-02-21T09:09:54.7848041Z add.s32 %r157, %r114, %r169; 2026-02-21T09:09:54.7848193Z add.s32 %r213, %r114, %r225; 2026-02-21T09:09:54.7848346Z $L__tmp17: 2026-02-21T09:09:54.7848585Z .loc 1 50 125 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:50:125 2026-02-21T09:09:54.7848885Z add.s32 %r115, %r21, %r111; 2026-02-21T09:09:54.7849040Z add.s32 %r116, %r115, %r112; 2026-02-21T09:09:54.7849226Z cvt.s64.s32 %rd38, %r116; 2026-02-21T09:09:54.7849389Z add.s64 %rd87, %rd15, %rd38; 2026-02-21T09:09:54.7849543Z shl.b32 %r117, %r110, 16; 2026-02-21T09:09:54.7849703Z or.b32 %r118, %r22, %r117; 2026-02-21T09:09:54.7849866Z mad.wide.s32 %rd86, %r118, 2, %rd14; 2026-02-21T09:09:54.7850051Z add.s32 %r119, %r20, %r111; 2026-02-21T09:09:54.7850206Z add.s32 %r120, %r119, %r112; 2026-02-21T09:09:54.7850366Z cvt.s64.s32 %rd39, %r120; 2026-02-21T09:09:54.7850516Z add.s64 %rd85, %rd15, %rd39; 2026-02-21T09:09:54.7850676Z mov.pred %p50, 0; 2026-02-21T09:09:54.7850818Z mov.b64 %rd88, -8; 2026-02-21T09:09:54.7850965Z bra.uni $L__BB0_3; 2026-02-21T09:09:54.7851182Z $L__BB0_7: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:09:54.7851393Z $L__tmp18: 2026-02-21T09:09:54.7851715Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7852042Z // begin inline asm 2026-02-21T09:09:54.7852184Z 2026-02-21T09:09:54.7852298Z { 2026-02-21T09:09:54.7852429Z .reg .pred complete; 2026-02-21T09:09:54.7852572Z waitLoop: 2026-02-21T09:09:54.7852760Z mbarrier.try_wait.parity.shared.b64 complete, [%r230], %r175; 2026-02-21T09:09:54.7853001Z @!complete bra.uni waitLoop; 2026-02-21T09:09:54.7853154Z } 2026-02-21T09:09:54.7853219Z 2026-02-21T09:09:54.7853282Z // end inline asm 2026-02-21T09:09:54.7853416Z bar.sync 0; 2026-02-21T09:09:54.7853555Z // begin inline asm 2026-02-21T09:09:54.7853721Z @%p5 mbarrier.inval.shared::cta.b64 [%r230]; 2026-02-21T09:09:54.7853915Z // end inline asm 2026-02-21T09:09:54.7854046Z $L__tmp19: 2026-02-21T09:09:54.7854296Z .loc 1 50 125 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:50:125 2026-02-21T09:09:54.7854591Z add.s64 %rd88, %rd88, 8; 2026-02-21T09:09:54.7854745Z add.s64 %rd87, %rd87, 65536; 2026-02-21T09:09:54.7854910Z add.s64 %rd86, %rd86, 32; 2026-02-21T09:09:54.7855063Z add.s64 %rd85, %rd85, 65536; 2026-02-21T09:09:54.7855232Z setp.lt.u64 %p44, %rd88, 504; 2026-02-21T09:09:54.7855394Z @%p44 bra $L__BB0_3; 2026-02-21T09:09:54.7855544Z bra.uni $L__BB0_8; 2026-02-21T09:09:54.7855724Z $L__BB0_3: // Parent Loop BB0_2 Depth=1 2026-02-21T09:09:54.7855974Z // => This Inner Loop Header: Depth=2 2026-02-21T09:09:54.7856306Z .loc 1 0 125 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:0:125 2026-02-21T09:09:54.7856599Z setp.ne.b32 %p28, %r27, 0; 2026-02-21T09:09:54.7856876Z .loc 1 81 38 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:81:38 2026-02-21T09:09:54.7857163Z setp.eq.b32 %p29, %r9, 0; 2026-02-21T09:09:54.7857432Z .loc 1 58 80 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:58:80 2026-02-21T09:09:54.7857765Z // begin inline asm 2026-02-21T09:09:54.7857913Z mov.u64 %rd40, 0x0; 2026-02-21T09:09:54.7858112Z createpolicy.fractional.L2::evict_first.b64 %rd40, 1.0; 2026-02-21T09:09:54.7858325Z // end inline asm 2026-02-21T09:09:54.7858469Z // begin inline asm 2026-02-21T09:09:54.7858606Z mov.u32 %r121, 0x0; 2026-02-21T09:09:54.7858746Z mov.u32 %r122, 0x0; 2026-02-21T09:09:54.7858974Z ld.global.L1::evict_first.L2::cache_hint.v2.b32 { %r121, %r122 }, [ %rd86 + 0 ], %rd40; 2026-02-21T09:09:54.7859272Z // end inline asm 2026-02-21T09:09:54.7859405Z $L__tmp20: 2026-02-21T09:09:54.7859698Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7860055Z shfl.sync.idx.b32 %r163, %r121, %r11, 31, -1; 2026-02-21T09:09:54.7860264Z shfl.sync.idx.b32 %r164, %r122, %r11, 31, -1; 2026-02-21T09:09:54.7860466Z mov.b32 {%rs9, %rs10}, %r163; 2026-02-21T09:09:54.7860634Z mov.b32 {%rs11, %rs12}, %r164; 2026-02-21T09:09:54.7860798Z $L__tmp21: 2026-02-21T09:09:54.7861067Z .loc 1 62 32 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:62:32 2026-02-21T09:09:54.7861358Z cvt.f32.bf16 %r158, %rs9; 2026-02-21T09:09:54.7861514Z cvt.f32.bf16 %r159, %rs10; 2026-02-21T09:09:54.7861703Z cvt.f32.bf16 %r160, %rs11; 2026-02-21T09:09:54.7861864Z cvt.f32.bf16 %r161, %rs12; 2026-02-21T09:09:54.7862123Z .loc 1 64 87 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:64:87 2026-02-21T09:09:54.7862414Z // begin inline asm 2026-02-21T09:09:54.7862555Z mov.u16 %rs8, 0x0; 2026-02-21T09:09:54.7862716Z ld.global.b8 { %rs8 }, [ %rd85 + 0 ]; 2026-02-21T09:09:54.7862884Z // end inline asm 2026-02-21T09:09:54.7863031Z cvt.s16.s8 %rs13, %rs8; 2026-02-21T09:09:54.7863313Z .loc 1 69 25 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:69:25 2026-02-21T09:09:54.7863601Z shl.b16 %rs14, %rs8, 12; 2026-02-21T09:09:54.7863764Z shr.s16 %rs15, %rs14, 12; 2026-02-21T09:09:54.7864025Z .loc 1 72 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:72:28 2026-02-21T09:09:54.7864313Z shr.u16 %rs16, %rs13, 4; 2026-02-21T09:09:54.7864570Z .loc 1 80 58 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:80:58 2026-02-21T09:09:54.7864858Z st.shared.b8 [%r12], %rs15; 2026-02-21T09:09:54.7865017Z bar.sync 0; 2026-02-21T09:09:54.7865173Z ld.shared.v2.b8 {%rs17, %rs18}, [%r13]; 2026-02-21T09:09:54.7865462Z .loc 1 82 58 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:82:58 2026-02-21T09:09:54.7865732Z bar.sync 0; 2026-02-21T09:09:54.7865873Z st.shared.b8 [%r12], %rs16; 2026-02-21T09:09:54.7866027Z bar.sync 0; 2026-02-21T09:09:54.7866175Z ld.shared.v2.b8 {%rs19, %rs20}, [%r13]; 2026-02-21T09:09:54.7866364Z selp.b16 %rs21, %rs17, %rs19, %p29; 2026-02-21T09:09:54.7866547Z cvt.s16.s8 %rs22, %rs21; 2026-02-21T09:09:54.7866710Z selp.b16 %rs23, %rs18, %rs20, %p29; 2026-02-21T09:09:54.7866888Z cvt.s16.s8 %rs24, %rs23; 2026-02-21T09:09:54.7867153Z .loc 1 87 32 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:87:32 2026-02-21T09:09:54.7867445Z cvt.rn.f32.s16 %r165, %rs22; 2026-02-21T09:09:54.7867621Z cvt.rn.f32.s16 %r166, %rs24; 2026-02-21T09:09:54.7867777Z bar.sync 0; 2026-02-21T09:09:54.7867921Z st.shared.b32 [%r14], %r165; 2026-02-21T09:09:54.7868082Z st.shared.b32 [%r15], %r166; 2026-02-21T09:09:54.7868241Z $L__tmp22: 2026-02-21T09:09:54.7868528Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7868863Z // begin inline asm 2026-02-21T09:09:54.7869240Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r123, %r124, %r125, %r126, %r127, %r128, %r129, %r130, %r131, %r132, %r133, %r134, %r135, %r136, %r137, %r138}, [%r250 + 0], 16; 2026-02-21T09:09:54.7869678Z // end inline asm 2026-02-21T09:09:54.7869822Z // begin inline asm 2026-02-21T09:09:54.7869974Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:54.7870143Z // end inline asm 2026-02-21T09:09:54.7870277Z // begin inline asm 2026-02-21T09:09:54.7870658Z @%p25 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r195 + 0], 16, {%r123, %r124, %r125, %r126, %r127, %r128, %r129, %r130, %r131, %r132, %r133, %r134, %r135, %r136, %r137, %r138}; 2026-02-21T09:09:54.7871059Z // end inline asm 2026-02-21T09:09:54.7871226Z // begin inline asm 2026-02-21T09:09:54.7871384Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.7871575Z // end inline asm 2026-02-21T09:09:54.7871716Z bar.sync 0; 2026-02-21T09:09:54.7871845Z // begin inline asm 2026-02-21T09:09:54.7872093Z @%p25 tcgen05.st.sync.aligned.16x32bx2.x4.b32 [%r157 + 0], 4, {%r158, %r159, %r160, %r161}; 2026-02-21T09:09:54.7872365Z // end inline asm 2026-02-21T09:09:54.7872508Z // begin inline asm 2026-02-21T09:09:54.7872661Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.7872817Z // end inline asm 2026-02-21T09:09:54.7872954Z bar.sync 0; 2026-02-21T09:09:54.7873080Z // begin inline asm 2026-02-21T09:09:54.7873262Z fence.proxy.async.shared::cta; 2026-02-21T09:09:54.7873425Z // end inline asm 2026-02-21T09:09:54.7873571Z add.s32 %r230, %r33, 5120; 2026-02-21T09:09:54.7873727Z // begin inline asm 2026-02-21T09:09:54.7873902Z @%p5 mbarrier.init.shared::cta.b64 [%r230], 1; 2026-02-21T09:09:54.7874094Z // end inline asm 2026-02-21T09:09:54.7874237Z bar.sync 0; 2026-02-21T09:09:54.7874378Z @%p28 bra $L__BB0_5; 2026-02-21T09:09:54.7874568Z // %bb.4: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:09:54.7874799Z elect.sync %r171|%p31, -1; 2026-02-21T09:09:54.7874965Z mov.b32 %r170, 67635472; 2026-02-21T09:09:54.7875126Z // begin inline asm 2026-02-21T09:09:54.7875397Z @%p31 tcgen05.mma.cta_group::1.kind::tf32 [ %r168 + 0 ], [ %r169 + 0 ], %rd50, %r170, %p50; 2026-02-21T09:09:54.7875667Z // end inline asm 2026-02-21T09:09:54.7875805Z cvt.u64.u32 %rd45, %r230; 2026-02-21T09:09:54.7875966Z // begin inline asm 2026-02-21T09:09:54.7876181Z @%p31 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd45]; 2026-02-21T09:09:54.7876406Z // end inline asm 2026-02-21T09:09:54.7876590Z $L__BB0_5: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:09:54.7876834Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:54.7877036Z mov.b32 %r175, 0; 2026-02-21T09:09:54.7877335Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7877672Z // begin inline asm 2026-02-21T09:09:54.7877817Z 2026-02-21T09:09:54.7877934Z { 2026-02-21T09:09:54.7878068Z .reg .pred complete; 2026-02-21T09:09:54.7878212Z waitLoop: 2026-02-21T09:09:54.7878404Z mbarrier.try_wait.parity.shared.b64 complete, [%r230], %r175; 2026-02-21T09:09:54.7878640Z @!complete bra.uni waitLoop; 2026-02-21T09:09:54.7878805Z } 2026-02-21T09:09:54.7878869Z 2026-02-21T09:09:54.7878925Z // end inline asm 2026-02-21T09:09:54.7879065Z bar.sync 0; 2026-02-21T09:09:54.7879195Z // begin inline asm 2026-02-21T09:09:54.7879366Z @%p5 mbarrier.inval.shared::cta.b64 [%r230]; 2026-02-21T09:09:54.7879558Z // end inline asm 2026-02-21T09:09:54.7879689Z $L__tmp23: 2026-02-21T09:09:54.7879936Z .loc 1 58 80 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:58:80 2026-02-21T09:09:54.7880219Z add.s64 %rd47, %rd86, 16; 2026-02-21T09:09:54.7880381Z // begin inline asm 2026-02-21T09:09:54.7880519Z mov.u64 %rd46, 0x0; 2026-02-21T09:09:54.7880718Z createpolicy.fractional.L2::evict_first.b64 %rd46, 1.0; 2026-02-21T09:09:54.7880928Z // end inline asm 2026-02-21T09:09:54.7881073Z // begin inline asm 2026-02-21T09:09:54.7881210Z mov.u32 %r177, 0x0; 2026-02-21T09:09:54.7881351Z mov.u32 %r178, 0x0; 2026-02-21T09:09:54.7881608Z ld.global.L1::evict_first.L2::cache_hint.v2.b32 { %r177, %r178 }, [ %rd47 + 0 ], %rd46; 2026-02-21T09:09:54.7881897Z // end inline asm 2026-02-21T09:09:54.7882033Z $L__tmp24: 2026-02-21T09:09:54.7882324Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7882697Z shfl.sync.idx.b32 %r220, %r177, %r11, 31, -1; 2026-02-21T09:09:54.7882915Z shfl.sync.idx.b32 %r221, %r178, %r11, 31, -1; 2026-02-21T09:09:54.7883121Z mov.b32 {%rs26, %rs27}, %r220; 2026-02-21T09:09:54.7883302Z mov.b32 {%rs28, %rs29}, %r221; 2026-02-21T09:09:54.7883490Z $L__tmp25: 2026-02-21T09:09:54.7883741Z .loc 1 62 32 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:62:32 2026-02-21T09:09:54.7884039Z cvt.f32.bf16 %r214, %rs26; 2026-02-21T09:09:54.7884209Z cvt.f32.bf16 %r215, %rs27; 2026-02-21T09:09:54.7884366Z cvt.f32.bf16 %r216, %rs28; 2026-02-21T09:09:54.7884528Z cvt.f32.bf16 %r217, %rs29; 2026-02-21T09:09:54.7884797Z .loc 1 64 87 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:64:87 2026-02-21T09:09:54.7885100Z // begin inline asm 2026-02-21T09:09:54.7885252Z mov.u16 %rs25, 0x0; 2026-02-21T09:09:54.7885441Z ld.global.b8 { %rs25 }, [ %rd87 + 0 ]; 2026-02-21T09:09:54.7885631Z // end inline asm 2026-02-21T09:09:54.7885778Z cvt.s16.s8 %rs30, %rs25; 2026-02-21T09:09:54.7886057Z .loc 1 69 25 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:69:25 2026-02-21T09:09:54.7886349Z shl.b16 %rs31, %rs25, 12; 2026-02-21T09:09:54.7886519Z shr.s16 %rs32, %rs31, 12; 2026-02-21T09:09:54.7886782Z .loc 1 72 28 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:72:28 2026-02-21T09:09:54.7887081Z shr.u16 %rs33, %rs30, 4; 2026-02-21T09:09:54.7887351Z .loc 1 80 58 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:80:58 2026-02-21T09:09:54.7887650Z st.shared.b8 [%r12], %rs32; 2026-02-21T09:09:54.7887854Z bar.sync 0; 2026-02-21T09:09:54.7888012Z ld.shared.v2.b8 {%rs34, %rs35}, [%r13]; 2026-02-21T09:09:54.7888347Z .loc 1 82 58 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:82:58 2026-02-21T09:09:54.7888631Z bar.sync 0; 2026-02-21T09:09:54.7888778Z st.shared.b8 [%r12], %rs33; 2026-02-21T09:09:54.7888946Z bar.sync 0; 2026-02-21T09:09:54.7889093Z ld.shared.v2.b8 {%rs36, %rs37}, [%r13]; 2026-02-21T09:09:54.7889294Z selp.b16 %rs38, %rs34, %rs36, %p29; 2026-02-21T09:09:54.7889473Z cvt.s16.s8 %rs39, %rs38; 2026-02-21T09:09:54.7889650Z selp.b16 %rs40, %rs35, %rs37, %p29; 2026-02-21T09:09:54.7889827Z cvt.s16.s8 %rs41, %rs40; 2026-02-21T09:09:54.7890104Z .loc 1 87 32 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:87:32 2026-02-21T09:09:54.7890384Z cvt.rn.f32.s16 %r222, %rs39; 2026-02-21T09:09:54.7890552Z cvt.rn.f32.s16 %r223, %rs41; 2026-02-21T09:09:54.7890709Z bar.sync 0; 2026-02-21T09:09:54.7890843Z st.shared.b32 [%r14], %r222; 2026-02-21T09:09:54.7891008Z st.shared.b32 [%r15], %r223; 2026-02-21T09:09:54.7891156Z $L__tmp26: 2026-02-21T09:09:54.7891440Z .loc 2 291 36 // standard.py:291:36 @[ cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:94:40 ] 2026-02-21T09:09:54.7891788Z // begin inline asm 2026-02-21T09:09:54.7892158Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r179, %r180, %r181, %r182, %r183, %r184, %r185, %r186, %r187, %r188, %r189, %r190, %r191, %r192, %r193, %r194}, [%r195 + 0], 16; 2026-02-21T09:09:54.7892552Z // end inline asm 2026-02-21T09:09:54.7892691Z // begin inline asm 2026-02-21T09:09:54.7892852Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:54.7893017Z // end inline asm 2026-02-21T09:09:54.7893162Z mov.pred %p50, -1; 2026-02-21T09:09:54.7893304Z // begin inline asm 2026-02-21T09:09:54.7893674Z @%p50 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r250 + 0], 16, {%r179, %r180, %r181, %r182, %r183, %r184, %r185, %r186, %r187, %r188, %r189, %r190, %r191, %r192, %r193, %r194}; 2026-02-21T09:09:54.7894092Z // end inline asm 2026-02-21T09:09:54.7894232Z // begin inline asm 2026-02-21T09:09:54.7894388Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.7894549Z // end inline asm 2026-02-21T09:09:54.7894684Z bar.sync 0; 2026-02-21T09:09:54.7894810Z // begin inline asm 2026-02-21T09:09:54.7895057Z @%p50 tcgen05.st.sync.aligned.16x32bx2.x4.b32 [%r213 + 0], 4, {%r214, %r215, %r216, %r217}; 2026-02-21T09:09:54.7895331Z // end inline asm 2026-02-21T09:09:54.7895472Z // begin inline asm 2026-02-21T09:09:54.7895650Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.7895816Z // end inline asm 2026-02-21T09:09:54.7895948Z bar.sync 0; 2026-02-21T09:09:54.7896081Z // begin inline asm 2026-02-21T09:09:54.7896245Z fence.proxy.async.shared::cta; 2026-02-21T09:09:54.7896414Z // end inline asm 2026-02-21T09:09:54.7896561Z // begin inline asm 2026-02-21T09:09:54.7896731Z @%p5 mbarrier.init.shared::cta.b64 [%r230], 1; 2026-02-21T09:09:54.7896936Z // end inline asm 2026-02-21T09:09:54.7897074Z bar.sync 0; 2026-02-21T09:09:54.7897219Z @%p28 bra $L__BB0_7; 2026-02-21T09:09:54.7897407Z // %bb.6: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:09:54.7897641Z elect.sync %r227|%p40, -1; 2026-02-21T09:09:54.7897840Z mov.b32 %r226, 67635472; 2026-02-21T09:09:54.7898008Z mov.pred %p39, -1; 2026-02-21T09:09:54.7898161Z // begin inline asm 2026-02-21T09:09:54.7898397Z @%p40 tcgen05.mma.cta_group::1.kind::tf32 [ %r279 + 0 ], [ %r225 + 0 ], %rd50, %r226, %p39; 2026-02-21T09:09:54.7898672Z // end inline asm 2026-02-21T09:09:54.7898814Z cvt.u64.u32 %rd51, %r230; 2026-02-21T09:09:54.7898974Z // begin inline asm 2026-02-21T09:09:54.7899179Z @%p40 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd51]; 2026-02-21T09:09:54.7899415Z // end inline asm 2026-02-21T09:09:54.7899555Z bra.uni $L__BB0_7; 2026-02-21T09:09:54.7899690Z $L__tmp27: 2026-02-21T09:09:54.7899848Z $L__BB0_9: // %._crit_edge 2026-02-21T09:09:54.7900198Z .loc 1 29 88 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:29:88 2026-02-21T09:09:54.7900506Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:54.7900675Z bar.sync 0; 2026-02-21T09:09:54.7900920Z .loc 1 29 4 // cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py:29:4 2026-02-21T09:09:54.7901196Z bar.sync 0; 2026-02-21T09:09:54.7901332Z // begin inline asm 2026-02-21T09:09:54.7901566Z @%p2 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r279, 64; 2026-02-21T09:09:54.7901790Z // end inline asm 2026-02-21T09:09:54.7901931Z ret; 2026-02-21T09:09:54.7902052Z $L__tmp28: 2026-02-21T09:09:54.7902184Z $L__func_end0: 2026-02-21T09:09:54.7902338Z // -- End function 2026-02-21T09:09:54.7902526Z } 2026-02-21T09:09:54.7902788Z .file 1 "/tmp/torchinductor_root/dr/cdrjhvc5zpiwvndd5pn3ysg7dlkmj6cszwebi4wg52756c4o4rbc.py" 2026-02-21T09:09:54.7903225Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:54.7903522Z .section .debug_abbrev 2026-02-21T09:09:54.7903665Z { 2026-02-21T09:09:54.7903823Z .b8 1 // Abbreviation Code 2026-02-21T09:09:54.7904047Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:54.7904271Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:54.7904482Z .b8 37 // DW_AT_producer 2026-02-21T09:09:54.7904692Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.7904896Z .b8 19 // DW_AT_language 2026-02-21T09:09:54.7905104Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:54.7905310Z .b8 3 // DW_AT_name 2026-02-21T09:09:54.7905508Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.7905714Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:54.7905918Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:54.7906157Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:54.7906359Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.7906563Z .b8 0 // EOM(1) 2026-02-21T09:09:54.7906759Z .b8 0 // EOM(2) 2026-02-21T09:09:54.7906959Z .b8 2 // Abbreviation Code 2026-02-21T09:09:54.7907183Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:54.7907416Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:54.7907621Z .b8 3 // DW_AT_name 2026-02-21T09:09:54.7907815Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.7908023Z .b8 32 // DW_AT_inline 2026-02-21T09:09:54.7908234Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.7908432Z .b8 0 // EOM(1) 2026-02-21T09:09:54.7908617Z .b8 0 // EOM(2) 2026-02-21T09:09:54.7908805Z .b8 3 // Abbreviation Code 2026-02-21T09:09:54.7909036Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:54.7909235Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:54.7909433Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:54.7909640Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.7909848Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:54.7910059Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.7910273Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:54.7910491Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:54.7910685Z .b8 0 // EOM(1) 2026-02-21T09:09:54.7910908Z .b8 0 // EOM(2) 2026-02-21T09:09:54.7911111Z .b8 4 // Abbreviation Code 2026-02-21T09:09:54.7911338Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:54.7911589Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:54.7911802Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:54.7912013Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:54.7912209Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:54.7912413Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.7912608Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:54.7912812Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.7913016Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:54.7913217Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.7913424Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:54.7913620Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.7913827Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:54.7914027Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.7914225Z .b8 0 // EOM(1) 2026-02-21T09:09:54.7914413Z .b8 0 // EOM(2) 2026-02-21T09:09:54.7914595Z .b8 0 // EOM(3) 2026-02-21T09:09:54.7914769Z } 2026-02-21T09:09:54.7914892Z .section .debug_info 2026-02-21T09:09:54.7915034Z { 2026-02-21T09:09:54.7915177Z .b32 178 // Length of Unit 2026-02-21T09:09:54.7915399Z .b8 2 // DWARF version number 2026-02-21T09:09:54.7915588Z .b8 0 2026-02-21T09:09:54.7915776Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:54.7916062Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:54.7916298Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:54.7916541Z .b8 116 // DW_AT_producer 2026-02-21T09:09:54.7916725Z .b8 114 2026-02-21T09:09:54.7916851Z .b8 105 2026-02-21T09:09:54.7916969Z .b8 116 2026-02-21T09:09:54.7917089Z .b8 111 2026-02-21T09:09:54.7917203Z .b8 110 2026-02-21T09:09:54.7917324Z .b8 0 2026-02-21T09:09:54.7917464Z .b8 2 // DW_AT_language 2026-02-21T09:09:54.7917681Z .b8 0 2026-02-21T09:09:54.7917827Z .b8 99 // DW_AT_name 2026-02-21T09:09:54.7918003Z .b8 100 2026-02-21T09:09:54.7918125Z .b8 114 2026-02-21T09:09:54.7918235Z .b8 106 2026-02-21T09:09:54.7918355Z .b8 104 2026-02-21T09:09:54.7918467Z .b8 118 2026-02-21T09:09:54.7918584Z .b8 99 2026-02-21T09:09:54.7918697Z .b8 53 2026-02-21T09:09:54.7918819Z .b8 122 2026-02-21T09:09:54.7918931Z .b8 112 2026-02-21T09:09:54.7919052Z .b8 105 2026-02-21T09:09:54.7919164Z .b8 119 2026-02-21T09:09:54.7919283Z .b8 118 2026-02-21T09:09:54.7919401Z .b8 110 2026-02-21T09:09:54.7919512Z .b8 100 2026-02-21T09:09:54.7919632Z .b8 100 2026-02-21T09:09:54.7919768Z .b8 53 2026-02-21T09:09:54.7919891Z .b8 112 2026-02-21T09:09:54.7920002Z .b8 110 2026-02-21T09:09:54.7920121Z .b8 51 2026-02-21T09:09:54.7920234Z .b8 121 2026-02-21T09:09:54.7920351Z .b8 115 2026-02-21T09:09:54.7920462Z .b8 103 2026-02-21T09:09:54.7920581Z .b8 55 2026-02-21T09:09:54.7920695Z .b8 100 2026-02-21T09:09:54.7920813Z .b8 108 2026-02-21T09:09:54.7920923Z .b8 107 2026-02-21T09:09:54.7921039Z .b8 109 2026-02-21T09:09:54.7921149Z .b8 106 2026-02-21T09:09:54.7921266Z .b8 54 2026-02-21T09:09:54.7921383Z .b8 99 2026-02-21T09:09:54.7921493Z .b8 115 2026-02-21T09:09:54.7921637Z .b8 122 2026-02-21T09:09:54.7921748Z .b8 119 2026-02-21T09:09:54.7921868Z .b8 101 2026-02-21T09:09:54.7921979Z .b8 98 2026-02-21T09:09:54.7922124Z .b8 105 2026-02-21T09:09:54.7922240Z .b8 52 2026-02-21T09:09:54.7922356Z .b8 119 2026-02-21T09:09:54.7922466Z .b8 103 2026-02-21T09:09:54.7922585Z .b8 53 2026-02-21T09:09:54.7922697Z .b8 50 2026-02-21T09:09:54.7922817Z .b8 55 2026-02-21T09:09:54.7922929Z .b8 53 2026-02-21T09:09:54.7923048Z .b8 54 2026-02-21T09:09:54.7923168Z .b8 99 2026-02-21T09:09:54.7923281Z .b8 52 2026-02-21T09:09:54.7923399Z .b8 111 2026-02-21T09:09:54.7923512Z .b8 52 2026-02-21T09:09:54.7923632Z .b8 114 2026-02-21T09:09:54.7923745Z .b8 98 2026-02-21T09:09:54.7923869Z .b8 99 2026-02-21T09:09:54.7923983Z .b8 46 2026-02-21T09:09:54.7924101Z .b8 112 2026-02-21T09:09:54.7924211Z .b8 121 2026-02-21T09:09:54.7924329Z .b8 0 2026-02-21T09:09:54.7924480Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:54.7924703Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:54.7924884Z .b8 116 2026-02-21T09:09:54.7924996Z .b8 109 2026-02-21T09:09:54.7925114Z .b8 112 2026-02-21T09:09:54.7925224Z .b8 47 2026-02-21T09:09:54.7925343Z .b8 116 2026-02-21T09:09:54.7925461Z .b8 111 2026-02-21T09:09:54.7925584Z .b8 114 2026-02-21T09:09:54.7925700Z .b8 99 2026-02-21T09:09:54.7925826Z .b8 104 2026-02-21T09:09:54.7925942Z .b8 105 2026-02-21T09:09:54.7926066Z .b8 110 2026-02-21T09:09:54.7926183Z .b8 100 2026-02-21T09:09:54.7926308Z .b8 117 2026-02-21T09:09:54.7926425Z .b8 99 2026-02-21T09:09:54.7926549Z .b8 116 2026-02-21T09:09:54.7926664Z .b8 111 2026-02-21T09:09:54.7926786Z .b8 114 2026-02-21T09:09:54.7926907Z .b8 95 2026-02-21T09:09:54.7927023Z .b8 114 2026-02-21T09:09:54.7927147Z .b8 111 2026-02-21T09:09:54.7927262Z .b8 111 2026-02-21T09:09:54.7927386Z .b8 116 2026-02-21T09:09:54.7927499Z .b8 47 2026-02-21T09:09:54.7927620Z .b8 100 2026-02-21T09:09:54.7927735Z .b8 114 2026-02-21T09:09:54.7927857Z .b8 0 2026-02-21T09:09:54.7928020Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:54.7928264Z .b8 95 // DW_AT_name 2026-02-21T09:09:54.7928446Z .b8 104 2026-02-21T09:09:54.7928609Z .b8 101 2026-02-21T09:09:54.7928732Z .b8 108 2026-02-21T09:09:54.7928849Z .b8 105 2026-02-21T09:09:54.7928972Z .b8 111 2026-02-21T09:09:54.7929088Z .b8 110 2026-02-21T09:09:54.7929213Z .b8 95 2026-02-21T09:09:54.7929333Z .b8 109 2026-02-21T09:09:54.7929458Z .b8 97 2026-02-21T09:09:54.7929576Z .b8 116 2026-02-21T09:09:54.7929701Z .b8 109 2026-02-21T09:09:54.7929819Z .b8 117 2026-02-21T09:09:54.7929946Z .b8 108 2026-02-21T09:09:54.7930064Z .b8 95 2026-02-21T09:09:54.7930193Z .b8 98 2026-02-21T09:09:54.7930316Z .b8 102 2026-02-21T09:09:54.7930475Z .b8 49 2026-02-21T09:09:54.7930601Z .b8 54 2026-02-21T09:09:54.7930717Z .b8 95 2026-02-21T09:09:54.7930839Z .b8 105 2026-02-21T09:09:54.7930955Z .b8 110 2026-02-21T09:09:54.7931080Z .b8 116 2026-02-21T09:09:54.7931199Z .b8 52 2026-02-21T09:09:54.7931322Z .b8 0 2026-02-21T09:09:54.7931465Z .b8 1 // DW_AT_inline 2026-02-21T09:09:54.7931738Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:54.7931991Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:54.7932235Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:54.7932509Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:54.7932776Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:54.7933057Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:54.7933294Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:54.7933530Z .b64 $L__tmp27 // DW_AT_high_pc 2026-02-21T09:09:54.7933753Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:54.7933980Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:54.7934196Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:54.7934435Z .b8 0 // End Of Children Mark 2026-02-21T09:09:54.7934665Z .b8 0 // End Of Children Mark 2026-02-21T09:09:54.7934851Z } 2026-02-21T09:09:54.7934987Z .section .debug_macinfo { } 2026-02-21T09:09:54.7935092Z 2026-02-21T09:09:54.7935169Z ================================================================ 2026-02-21T09:09:54.7935408Z please share the reproducer above with Triton project. 2026-02-21T09:09:54.9319409Z [182s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:54.9319728Z 2026-02-21T09:09:54.9322157Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:54.9323182Z 2026-02-21T09:09:54.9323187Z 2026-02-21T09:09:54.9323285Z ================================================================ 2026-02-21T09:09:54.9323503Z Internal Triton PTX codegen error 2026-02-21T09:09:54.9323698Z `ptxas` stderr: 2026-02-21T09:09:54.9324134Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 308 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:54.9324624Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:54.9324775Z 2026-02-21T09:09:54.9325187Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp5uz00x7s.ptx -o /tmp/tmp5uz00x7s.ptx.o 2026-02-21T09:09:54.9325622Z 2026-02-21T09:09:54.9325626Z 2026-02-21T09:09:54.9325682Z // 2026-02-21T09:09:54.9325835Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:54.9326013Z // 2026-02-21T09:09:54.9326093Z 2026-02-21T09:09:54.9326376Z .version 8.7 2026-02-21T09:09:54.9326516Z .target sm_100a 2026-02-21T09:09:54.9326664Z .address_size 64 2026-02-21T09:09:54.9326747Z 2026-02-21T09:09:54.9326927Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:54.9327215Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:54.9327440Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:54.9327655Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:54.9327903Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:54.9328234Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:54.9328513Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:54.9328787Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:54.9329065Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:54.9329289Z ) 2026-02-21T09:09:54.9329409Z .reqntid 128 2026-02-21T09:09:54.9329543Z .maxnreg 32 2026-02-21T09:09:54.9329663Z { 2026-02-21T09:09:54.9330053Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:54.9330369Z `ptxas` stderr: 2026-02-21T09:09:54.9330786Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 308 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:54.9331273Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:54.9331416Z 2026-02-21T09:09:54.9331831Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp5uz00x7s.ptx -o /tmp/tmp5uz00x7s.ptx.o 2026-02-21T09:09:54.9332266Z 2026-02-21T09:09:54.9332386Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:54.9332631Z .reg .pred %p<99>; 2026-02-21T09:09:54.9332833Z .reg .b16 %rs<120>; 2026-02-21T09:09:54.9332991Z .reg .b32 %r<379>; 2026-02-21T09:09:54.9333132Z .reg .b64 %rd<128>; 2026-02-21T09:09:54.9333276Z $L__func_begin0: 2026-02-21T09:09:54.9333357Z 2026-02-21T09:09:54.9333410Z // %bb.0: 2026-02-21T09:09:54.9333650Z .loc 1 19 0 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:19 2026-02-21T09:09:54.9333942Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:54.9334134Z ld.param.b64 %rd17, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:54.9334359Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:54.9334559Z ld.param.b64 %rd35, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:54.9334779Z mov.b32 %r58, global_smem; 2026-02-21T09:09:54.9334938Z // begin inline asm 2026-02-21T09:09:54.9335175Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r58], 64; 2026-02-21T09:09:54.9335417Z // end inline asm 2026-02-21T09:09:54.9335598Z ld.param.b64 %rd52, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:54.9335805Z bar.sync 0; 2026-02-21T09:09:54.9335952Z ld.shared.b32 %r371, [global_smem]; 2026-02-21T09:09:54.9336129Z bar.sync 0; 2026-02-21T09:09:54.9336258Z // begin inline asm 2026-02-21T09:09:54.9336469Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:54.9336695Z // end inline asm 2026-02-21T09:09:54.9336951Z .loc 1 21 66 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:21:66 2026-02-21T09:09:54.9337240Z mov.u32 %r75, %ctaid.x; 2026-02-21T09:09:54.9337398Z mov.u32 %r76, %ctaid.y; 2026-02-21T09:09:54.9337556Z mov.u32 %r77, %ctaid.z; 2026-02-21T09:09:54.9337708Z mov.u32 %r78, %nctaid.x; 2026-02-21T09:09:54.9337871Z mov.u32 %r79, %nctaid.y; 2026-02-21T09:09:54.9338031Z mad.lo.s32 %r80, %r77, %r79, %r76; 2026-02-21T09:09:54.9338220Z mad.lo.s32 %r81, %r80, %r78, %r75; 2026-02-21T09:09:54.9338388Z shl.b32 %r82, %r81, 8; 2026-02-21T09:09:54.9338548Z cvt.s64.s32 %rd53, %r82; 2026-02-21T09:09:54.9338707Z add.s64 %rd31, %rd52, %rd53; 2026-02-21T09:09:54.9338918Z shl.b32 %r83, %r1, 2; 2026-02-21T09:09:54.9339066Z add.s32 %r59, %r58, %r83; 2026-02-21T09:09:54.9339221Z mov.b32 %r68, 0; 2026-02-21T09:09:54.9339363Z // begin inline asm 2026-02-21T09:09:54.9339515Z @%p1 st.shared.b32 [ %r59 + 0 ], %r68; 2026-02-21T09:09:54.9339696Z // end inline asm 2026-02-21T09:09:54.9339835Z bar.warp.sync -1; 2026-02-21T09:09:54.9339988Z setp.eq.b32 %p4, %r1, 0; 2026-02-21T09:09:54.9340144Z cvt.u64.u32 %rd16, %r58; 2026-02-21T09:09:54.9340299Z // begin inline asm 2026-02-21T09:09:54.9340547Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd16 + 0 ], %rd17; 2026-02-21T09:09:54.9340865Z // end inline asm 2026-02-21T09:09:54.9341001Z // begin inline asm 2026-02-21T09:09:54.9341224Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:54.9341478Z // end inline asm 2026-02-21T09:09:54.9341654Z mov.b32 %r61, 32; 2026-02-21T09:09:54.9341799Z // begin inline asm 2026-02-21T09:09:54.9342032Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r61; 2026-02-21T09:09:54.9342300Z // end inline asm 2026-02-21T09:09:54.9342431Z mov.b32 %r62, 16; 2026-02-21T09:09:54.9342571Z // begin inline asm 2026-02-21T09:09:54.9342856Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r62; 2026-02-21T09:09:54.9343115Z // end inline asm 2026-02-21T09:09:54.9343255Z mov.b32 %r63, 8192; 2026-02-21T09:09:54.9343394Z // begin inline asm 2026-02-21T09:09:54.9343631Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r63; 2026-02-21T09:09:54.9343897Z // end inline asm 2026-02-21T09:09:54.9344036Z mov.b32 %r64, 512; 2026-02-21T09:09:54.9344170Z // begin inline asm 2026-02-21T09:09:54.9344410Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r64; 2026-02-21T09:09:54.9344681Z // end inline asm 2026-02-21T09:09:54.9344816Z mov.b64 %rd24, 8192; 2026-02-21T09:09:54.9345002Z // begin inline asm 2026-02-21T09:09:54.9345251Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd16 + 0 ], 0x0, %rd24; 2026-02-21T09:09:54.9345539Z // end inline asm 2026-02-21T09:09:54.9345673Z mov.b32 %r65, 1; 2026-02-21T09:09:54.9345820Z // begin inline asm 2026-02-21T09:09:54.9346076Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r65; 2026-02-21T09:09:54.9346367Z // end inline asm 2026-02-21T09:09:54.9346513Z // begin inline asm 2026-02-21T09:09:54.9346761Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r65; 2026-02-21T09:09:54.9347061Z // end inline asm 2026-02-21T09:09:54.9347205Z // begin inline asm 2026-02-21T09:09:54.9347461Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.9347740Z // end inline asm 2026-02-21T09:09:54.9347898Z // begin inline asm 2026-02-21T09:09:54.9348176Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.9348473Z // end inline asm 2026-02-21T09:09:54.9348625Z // begin inline asm 2026-02-21T09:09:54.9348875Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:54.9349158Z // end inline asm 2026-02-21T09:09:54.9349302Z // begin inline asm 2026-02-21T09:09:54.9349547Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.9349823Z // end inline asm 2026-02-21T09:09:54.9349968Z // begin inline asm 2026-02-21T09:09:54.9350331Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd31 + 0 ], [ %rd16 + 0 ], 0x80; 2026-02-21T09:09:54.9350723Z // end inline asm 2026-02-21T09:09:54.9350876Z // begin inline asm 2026-02-21T09:09:54.9351096Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd31 + 0 ], 0x80; 2026-02-21T09:09:54.9351370Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:54.9351603Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:54.9351824Z // end inline asm 2026-02-21T09:09:54.9351970Z bar.sync 0; 2026-02-21T09:09:54.9352118Z cvta.global.u64 %rd79, %rd31; 2026-02-21T09:09:54.9352415Z .loc 1 23 67 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:23:67 2026-02-21T09:09:54.9352720Z add.s64 %rd49, %rd31, 128; 2026-02-21T09:09:54.9352891Z bar.sync 0; 2026-02-21T09:09:54.9353027Z // begin inline asm 2026-02-21T09:09:54.9353187Z @%p1 st.shared.b32 [ %r59 + 0 ], %r68; 2026-02-21T09:09:54.9353367Z // end inline asm 2026-02-21T09:09:54.9353544Z bar.warp.sync -1; 2026-02-21T09:09:54.9353699Z // begin inline asm 2026-02-21T09:09:54.9353954Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd16 + 0 ], %rd35; 2026-02-21T09:09:54.9354241Z // end inline asm 2026-02-21T09:09:54.9354381Z // begin inline asm 2026-02-21T09:09:54.9354622Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:54.9354862Z // end inline asm 2026-02-21T09:09:54.9355003Z // begin inline asm 2026-02-21T09:09:54.9355232Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r61; 2026-02-21T09:09:54.9355495Z // end inline asm 2026-02-21T09:09:54.9355631Z mov.b32 %r70, 64; 2026-02-21T09:09:54.9355796Z // begin inline asm 2026-02-21T09:09:54.9356032Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r70; 2026-02-21T09:09:54.9356287Z // end inline asm 2026-02-21T09:09:54.9356426Z // begin inline asm 2026-02-21T09:09:54.9356655Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r63; 2026-02-21T09:09:54.9356924Z // end inline asm 2026-02-21T09:09:54.9357064Z mov.b32 %r72, 4096; 2026-02-21T09:09:54.9357204Z // begin inline asm 2026-02-21T09:09:54.9357442Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r72; 2026-02-21T09:09:54.9357701Z // end inline asm 2026-02-21T09:09:54.9357844Z mov.b64 %rd42, 16384; 2026-02-21T09:09:54.9358020Z // begin inline asm 2026-02-21T09:09:54.9358275Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd16 + 0 ], 0x0, %rd42; 2026-02-21T09:09:54.9358549Z // end inline asm 2026-02-21T09:09:54.9358694Z // begin inline asm 2026-02-21T09:09:54.9358950Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r65; 2026-02-21T09:09:54.9359222Z // end inline asm 2026-02-21T09:09:54.9359362Z // begin inline asm 2026-02-21T09:09:54.9359605Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r65; 2026-02-21T09:09:54.9359883Z // end inline asm 2026-02-21T09:09:54.9360014Z // begin inline asm 2026-02-21T09:09:54.9360244Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd16 + 0 ], 0xa; 2026-02-21T09:09:54.9360503Z // end inline asm 2026-02-21T09:09:54.9360634Z // begin inline asm 2026-02-21T09:09:54.9360881Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.9361154Z // end inline asm 2026-02-21T09:09:54.9361296Z // begin inline asm 2026-02-21T09:09:54.9361521Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x2; 2026-02-21T09:09:54.9361816Z // end inline asm 2026-02-21T09:09:54.9361873Z // begin inline asm 2026-02-21T09:09:54.9362022Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:54.9362078Z // end inline asm 2026-02-21T09:09:54.9362134Z // begin inline asm 2026-02-21T09:09:54.9362381Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd49 + 0 ], [ %rd16 + 0 ], 0x80; 2026-02-21T09:09:54.9362445Z // end inline asm 2026-02-21T09:09:54.9362500Z // begin inline asm 2026-02-21T09:09:54.9362625Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd49 + 0 ], 0x80; 2026-02-21T09:09:54.9362702Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:54.9362774Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:54.9362827Z // end inline asm 2026-02-21T09:09:54.9362930Z bar.sync 0; 2026-02-21T09:09:54.9362997Z cvta.global.u64 %rd93, %rd49; 2026-02-21T09:09:54.9363169Z .loc 1 29 35 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:29:35 2026-02-21T09:09:54.9363231Z shl.b32 %r372, %r75, 1; 2026-02-21T09:09:54.9363401Z .loc 1 30 37 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:30:37 2026-02-21T09:09:54.9363462Z add.s32 %r84, %r372, 2; 2026-02-21T09:09:54.9363621Z .loc 1 30 49 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:30:49 2026-02-21T09:09:54.9363724Z min.s32 %r4, %r84, 16384; 2026-02-21T09:09:54.9363880Z .loc 1 31 88 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:31:88 2026-02-21T09:09:54.9363945Z setp.ge.s32 %p39, %r372, %r4; 2026-02-21T09:09:54.9364011Z @%p39 bra $L__BB0_9; 2026-02-21T09:09:54.9364087Z // %bb.1: // %.lr.ph 2026-02-21T09:09:54.9364247Z .loc 1 0 88 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:0:88 2026-02-21T09:09:54.9364358Z ld.param.b64 %rd15, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:54.9364415Z shr.u32 %r5, %r1, 5; 2026-02-21T09:09:54.9364507Z bfe.u32 %r6, %r1, 2, 5; 2026-02-21T09:09:54.9364566Z shl.b32 %r7, %r1, 3; 2026-02-21T09:09:54.9364630Z and.b32 %r8, %r7, 24; 2026-02-21T09:09:54.9364686Z and.b32 %r9, %r1, 32; 2026-02-21T09:09:54.9364743Z shl.b32 %r85, %r1, 4; 2026-02-21T09:09:54.9364809Z and.b32 %r10, %r85, 2032; 2026-02-21T09:09:54.9364872Z add.s32 %r228, %r58, %r10; 2026-02-21T09:09:54.9364932Z add.s32 %r230, %r228, 2048; 2026-02-21T09:09:54.9364990Z add.s32 %r214, %r371, 32; 2026-02-21T09:09:54.9365052Z or.b32 %r14, %r8, 32; 2026-02-21T09:09:54.9365111Z add.s32 %r154, %r228, 4096; 2026-02-21T09:09:54.9365168Z add.s32 %r156, %r228, 6144; 2026-02-21T09:09:54.9365230Z shl.b32 %r87, %r1, 6; 2026-02-21T09:09:54.9365318Z and.b32 %r88, %r87, 960; 2026-02-21T09:09:54.9365378Z and.b32 %r89, %r1, 96; 2026-02-21T09:09:54.9365436Z shl.b32 %r90, %r89, 5; 2026-02-21T09:09:54.9365497Z shl.b32 %r91, %r1, 1; 2026-02-21T09:09:54.9365554Z and.b32 %r92, %r91, 32; 2026-02-21T09:09:54.9365614Z or.b32 %r93, %r88, %r90; 2026-02-21T09:09:54.9365677Z or.b32 %r17, %r93, %r92; 2026-02-21T09:09:54.9365736Z add.s32 %r18, %r58, %r17; 2026-02-21T09:09:54.9365791Z and.b32 %r94, %r1, 31; 2026-02-21T09:09:54.9365847Z shr.u32 %r95, %r1, 1; 2026-02-21T09:09:54.9365913Z and.b32 %r96, %r95, 32; 2026-02-21T09:09:54.9365971Z or.b32 %r19, %r96, %r94; 2026-02-21T09:09:54.9366029Z add.s32 %r97, %r58, 12288; 2026-02-21T09:09:54.9366092Z add.s32 %r20, %r97, %r19; 2026-02-21T09:09:54.9366148Z xor.b32 %r21, %r19, 16; 2026-02-21T09:09:54.9366204Z add.s32 %r22, %r97, %r21; 2026-02-21T09:09:54.9366259Z shl.b32 %r98, %r94, 7; 2026-02-21T09:09:54.9366322Z and.b32 %r99, %r85, 112; 2026-02-21T09:09:54.9366379Z shr.u32 %r100, %r89, 3; 2026-02-21T09:09:54.9366437Z or.b32 %r101, %r98, %r99; 2026-02-21T09:09:54.9366503Z or.b32 %r102, %r101, %r100; 2026-02-21T09:09:54.9366562Z add.s32 %r103, %r58, 8192; 2026-02-21T09:09:54.9366620Z add.s32 %r23, %r103, %r102; 2026-02-21T09:09:54.9366677Z xor.b32 %r104, %r102, 16; 2026-02-21T09:09:54.9366739Z add.s32 %r24, %r103, %r104; 2026-02-21T09:09:54.9366795Z xor.b32 %r105, %r102, 32; 2026-02-21T09:09:54.9366851Z add.s32 %r25, %r103, %r105; 2026-02-21T09:09:54.9366911Z xor.b32 %r106, %r102, 48; 2026-02-21T09:09:54.9366968Z add.s32 %r26, %r103, %r106; 2026-02-21T09:09:54.9367028Z xor.b32 %r107, %r102, 64; 2026-02-21T09:09:54.9367090Z add.s32 %r27, %r103, %r107; 2026-02-21T09:09:54.9367145Z xor.b32 %r108, %r102, 80; 2026-02-21T09:09:54.9367211Z add.s32 %r28, %r103, %r108; 2026-02-21T09:09:54.9367269Z xor.b32 %r109, %r102, 96; 2026-02-21T09:09:54.9367336Z add.s32 %r29, %r103, %r109; 2026-02-21T09:09:54.9367395Z xor.b32 %r110, %r102, 112; 2026-02-21T09:09:54.9367452Z add.s32 %r30, %r103, %r110; 2026-02-21T09:09:54.9367551Z bfe.u32 %r111, %r103, 4, 14; 2026-02-21T09:09:54.9367612Z cvt.u64.u32 %rd54, %r111; 2026-02-21T09:09:54.9367682Z or.b64 %rd72, %rd54, 4611686293322072064; 2026-02-21T09:09:54.9367739Z add.s32 %r112, %r58, 8224; 2026-02-21T09:09:54.9367807Z bfe.u32 %r113, %r112, 4, 14; 2026-02-21T09:09:54.9367865Z cvt.u64.u32 %rd55, %r113; 2026-02-21T09:09:54.9367933Z or.b64 %rd73, %rd55, 4611686293322072064; 2026-02-21T09:09:54.9367998Z add.s32 %r114, %r58, 8256; 2026-02-21T09:09:54.9368056Z bfe.u32 %r115, %r114, 4, 14; 2026-02-21T09:09:54.9368155Z cvt.u64.u32 %rd56, %r115; 2026-02-21T09:09:54.9368221Z or.b64 %rd74, %rd56, 4611686293322072064; 2026-02-21T09:09:54.9368285Z add.s32 %r116, %r58, 8288; 2026-02-21T09:09:54.9368342Z bfe.u32 %r117, %r116, 4, 14; 2026-02-21T09:09:54.9368399Z cvt.u64.u32 %rd57, %r117; 2026-02-21T09:09:54.9368471Z or.b64 %rd75, %rd57, 4611686293322072064; 2026-02-21T09:09:54.9368528Z or.b32 %r31, %r8, 64; 2026-02-21T09:09:54.9368586Z and.b32 %r118, %r7, 48; 2026-02-21T09:09:54.9368645Z or.b32 %r119, %r90, %r118; 2026-02-21T09:09:54.9368711Z xor.b32 %r120, %r119, %r92; 2026-02-21T09:09:54.9368768Z or.b32 %r121, %r120, %r88; 2026-02-21T09:09:54.9368826Z add.s32 %r32, %r58, %r121; 2026-02-21T09:09:54.9368915Z xor.b32 %r122, %r121, 16; 2026-02-21T09:09:54.9368974Z add.s32 %r33, %r58, %r122; 2026-02-21T09:09:54.9369138Z .loc 1 31 88 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:31:88 2026-02-21T09:09:54.9369204Z and.b32 %r123, %r1, 3; 2026-02-21T09:09:54.9369273Z mad.wide.u32 %rd58, %r123, 16, %rd15; 2026-02-21T09:09:54.9369332Z add.s64 %rd7, %rd58, 65728; 2026-02-21T09:09:54.9369389Z shl.b32 %r34, %r6, 10; 2026-02-21T09:09:54.9369452Z bra.uni $L__BB0_2; 2026-02-21T09:09:54.9369554Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:54.9369721Z .loc 1 0 88 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:0:88 2026-02-21T09:09:54.9369805Z mov.b32 %r321, 1; 2026-02-21T09:09:54.9369864Z $L__tmp0: 2026-02-21T09:09:54.9370083Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9370149Z // begin inline asm 2026-02-21T09:09:54.9370200Z 2026-02-21T09:09:54.9370252Z { 2026-02-21T09:09:54.9370312Z .reg .pred complete; 2026-02-21T09:09:54.9370376Z waitLoop: 2026-02-21T09:09:54.9370496Z mbarrier.try_wait.parity.shared.b64 complete, [%r374], %r321; 2026-02-21T09:09:54.9370561Z @!complete bra.uni waitLoop; 2026-02-21T09:09:54.9370620Z } 2026-02-21T09:09:54.9370624Z 2026-02-21T09:09:54.9370679Z // end inline asm 2026-02-21T09:09:54.9370733Z $L__tmp1: 2026-02-21T09:09:54.9370895Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9370964Z cp.async.wait_group 0; 2026-02-21T09:09:54.9371017Z bar.sync 0; 2026-02-21T09:09:54.9371074Z // begin inline asm 2026-02-21T09:09:54.9371165Z @%p4 mbarrier.inval.shared::cta.b64 [%r232]; 2026-02-21T09:09:54.9371221Z // end inline asm 2026-02-21T09:09:54.9371274Z bar.sync 0; 2026-02-21T09:09:54.9371329Z // begin inline asm 2026-02-21T09:09:54.9371415Z @%p4 mbarrier.inval.shared::cta.b64 [%r144]; 2026-02-21T09:09:54.9371469Z // end inline asm 2026-02-21T09:09:54.9371527Z add.s32 %r324, %r58, 13312; 2026-02-21T09:09:54.9371622Z // begin inline asm 2026-02-21T09:09:54.9371699Z @%p4 mbarrier.inval.shared::cta.b64 [%r324]; 2026-02-21T09:09:54.9371754Z // end inline asm 2026-02-21T09:09:54.9371815Z bar.sync 0; 2026-02-21T09:09:54.9371870Z // begin inline asm 2026-02-21T09:09:54.9371945Z @%p4 mbarrier.inval.shared::cta.b64 [%r142]; 2026-02-21T09:09:54.9371999Z // end inline asm 2026-02-21T09:09:54.9372058Z $L__tmp2: 2026-02-21T09:09:54.9372269Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9372326Z // begin inline asm 2026-02-21T09:09:54.9372647Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r326, %r327, %r328, %r329, %r330, %r331, %r332, %r333, %r334, %r335, %r336, %r337, %r338, %r339, %r340, %r341}, [%r342 + 0], 16; 2026-02-21T09:09:54.9372703Z // end inline asm 2026-02-21T09:09:54.9372760Z // begin inline asm 2026-02-21T09:09:54.9372840Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:54.9372895Z // end inline asm 2026-02-21T09:09:54.9372956Z cvt.u64.u32 %rd94, %r326; 2026-02-21T09:09:54.9373017Z cvt.u64.u32 %rd95, %r327; 2026-02-21T09:09:54.9373084Z shl.b64 %rd96, %rd95, 32; 2026-02-21T09:09:54.9373172Z or.b64 %rd97, %rd94, %rd96; 2026-02-21T09:09:54.9373225Z $L__tmp3: 2026-02-21T09:09:54.9373397Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9373460Z mov.b64 {%r346, %r347}, %rd97; 2026-02-21T09:09:54.9373530Z cvt.rn.bf16x2.f32 %r348, %r347, %r346; 2026-02-21T09:09:54.9373582Z $L__tmp4: 2026-02-21T09:09:54.9373799Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9373860Z cvt.u64.u32 %rd98, %r328; 2026-02-21T09:09:54.9373918Z cvt.u64.u32 %rd99, %r329; 2026-02-21T09:09:54.9374011Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:09:54.9374073Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:09:54.9374126Z $L__tmp5: 2026-02-21T09:09:54.9374296Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9374359Z mov.b64 {%r349, %r350}, %rd101; 2026-02-21T09:09:54.9374432Z cvt.rn.bf16x2.f32 %r351, %r350, %r349; 2026-02-21T09:09:54.9374485Z $L__tmp6: 2026-02-21T09:09:54.9374702Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9374765Z cvt.u64.u32 %rd102, %r330; 2026-02-21T09:09:54.9374825Z cvt.u64.u32 %rd103, %r331; 2026-02-21T09:09:54.9374896Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:09:54.9374987Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:09:54.9375041Z $L__tmp7: 2026-02-21T09:09:54.9375208Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9375271Z mov.b64 {%r352, %r353}, %rd105; 2026-02-21T09:09:54.9375338Z cvt.rn.bf16x2.f32 %r354, %r353, %r352; 2026-02-21T09:09:54.9375390Z $L__tmp8: 2026-02-21T09:09:54.9375601Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9375662Z cvt.u64.u32 %rd106, %r332; 2026-02-21T09:09:54.9375719Z cvt.u64.u32 %rd107, %r333; 2026-02-21T09:09:54.9375785Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:09:54.9375845Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:09:54.9375895Z $L__tmp9: 2026-02-21T09:09:54.9376057Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9376126Z mov.b64 {%r355, %r356}, %rd109; 2026-02-21T09:09:54.9376194Z cvt.rn.bf16x2.f32 %r357, %r356, %r355; 2026-02-21T09:09:54.9376249Z $L__tmp10: 2026-02-21T09:09:54.9376460Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9376520Z cvt.u64.u32 %rd110, %r334; 2026-02-21T09:09:54.9376578Z cvt.u64.u32 %rd111, %r335; 2026-02-21T09:09:54.9376644Z shl.b64 %rd112, %rd111, 32; 2026-02-21T09:09:54.9376703Z or.b64 %rd113, %rd110, %rd112; 2026-02-21T09:09:54.9376757Z $L__tmp11: 2026-02-21T09:09:54.9376922Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9376988Z mov.b64 {%r358, %r359}, %rd113; 2026-02-21T09:09:54.9377054Z cvt.rn.bf16x2.f32 %r360, %r359, %r358; 2026-02-21T09:09:54.9377107Z $L__tmp12: 2026-02-21T09:09:54.9377312Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9377374Z cvt.u64.u32 %rd114, %r336; 2026-02-21T09:09:54.9377460Z cvt.u64.u32 %rd115, %r337; 2026-02-21T09:09:54.9377527Z shl.b64 %rd116, %rd115, 32; 2026-02-21T09:09:54.9377586Z or.b64 %rd117, %rd114, %rd116; 2026-02-21T09:09:54.9377638Z $L__tmp13: 2026-02-21T09:09:54.9377802Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9377870Z mov.b64 {%r361, %r362}, %rd117; 2026-02-21T09:09:54.9377935Z cvt.rn.bf16x2.f32 %r363, %r362, %r361; 2026-02-21T09:09:54.9377988Z $L__tmp14: 2026-02-21T09:09:54.9378227Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9378285Z cvt.u64.u32 %rd118, %r338; 2026-02-21T09:09:54.9378343Z cvt.u64.u32 %rd119, %r339; 2026-02-21T09:09:54.9378410Z shl.b64 %rd120, %rd119, 32; 2026-02-21T09:09:54.9378470Z or.b64 %rd121, %rd118, %rd120; 2026-02-21T09:09:54.9378522Z $L__tmp15: 2026-02-21T09:09:54.9378686Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9378755Z mov.b64 {%r364, %r365}, %rd121; 2026-02-21T09:09:54.9378820Z cvt.rn.bf16x2.f32 %r366, %r365, %r364; 2026-02-21T09:09:54.9378893Z $L__tmp16: 2026-02-21T09:09:54.9379106Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9379164Z cvt.u64.u32 %rd122, %r340; 2026-02-21T09:09:54.9379223Z cvt.u64.u32 %rd123, %r341; 2026-02-21T09:09:54.9379290Z shl.b64 %rd124, %rd123, 32; 2026-02-21T09:09:54.9379349Z or.b64 %rd125, %rd122, %rd124; 2026-02-21T09:09:54.9379400Z $L__tmp17: 2026-02-21T09:09:54.9379562Z .loc 1 97 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:97:28 2026-02-21T09:09:54.9379629Z mov.b64 {%r367, %r368}, %rd125; 2026-02-21T09:09:54.9379693Z cvt.rn.bf16x2.f32 %r369, %r368, %r367; 2026-02-21T09:09:54.9379874Z .loc 1 98 43 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:98:43 2026-02-21T09:09:54.9379977Z st.shared.v4.b32 [%r32], {%r348, %r351, %r354, %r357}; 2026-02-21T09:09:54.9380070Z st.shared.v4.b32 [%r33], {%r360, %r363, %r366, %r369}; 2026-02-21T09:09:54.9380130Z // begin inline asm 2026-02-21T09:09:54.9380209Z fence.proxy.async.shared::cta; 2026-02-21T09:09:54.9380265Z // end inline asm 2026-02-21T09:09:54.9380320Z bar.sync 0; 2026-02-21T09:09:54.9380385Z elect.sync %r370|%p96, -1; 2026-02-21T09:09:54.9380458Z and.pred %p94, %p1, %p96; 2026-02-21T09:09:54.9380516Z // begin inline asm 2026-02-21T09:09:54.9380694Z @%p94 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd93, {%r343, %r344}], [%r58]; 2026-02-21T09:09:54.9380756Z // end inline asm 2026-02-21T09:09:54.9380823Z cp.async.bulk.commit_group; 2026-02-21T09:09:54.9380894Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:54.9380948Z bar.sync 0; 2026-02-21T09:09:54.9381123Z .loc 1 31 88 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:31:88 2026-02-21T09:09:54.9381185Z add.s32 %r372, %r372, 1; 2026-02-21T09:09:54.9381250Z setp.ne.b32 %p97, %r372, %r4; 2026-02-21T09:09:54.9381316Z @%p97 bra $L__BB0_2; 2026-02-21T09:09:54.9381375Z bra.uni $L__BB0_9; 2026-02-21T09:09:54.9381475Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:09:54.9381638Z // Child Loop BB0_5 Depth 2 2026-02-21T09:09:54.9381804Z .loc 1 81 38 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:81:38 2026-02-21T09:09:54.9381869Z setp.eq.b32 %p50, %r9, 0; 2026-02-21T09:09:54.9382030Z .loc 1 37 35 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:37:35 2026-02-21T09:09:54.9382096Z shr.s32 %r182, %r372, 31; 2026-02-21T09:09:54.9382153Z shr.u32 %r183, %r182, 22; 2026-02-21T09:09:54.9382215Z add.s32 %r184, %r372, %r183; 2026-02-21T09:09:54.9382282Z shr.s32 %r185, %r184, 10; 2026-02-21T09:09:54.9382470Z .loc 1 40 45 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:40:45 2026-02-21T09:09:54.9382531Z and.b32 %r186, %r184, 64512; 2026-02-21T09:09:54.9382599Z sub.s32 %r187, %r372, %r186; 2026-02-21T09:09:54.9382764Z .loc 1 40 64 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:40:64 2026-02-21T09:09:54.9382825Z cvt.u16.u32 %rs1, %r187; 2026-02-21T09:09:54.9382989Z .loc 1 41 51 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:41:51 2026-02-21T09:09:54.9383083Z shr.s16 %rs2, %rs1, 15; 2026-02-21T09:09:54.9383142Z shr.u16 %rs3, %rs2, 12; 2026-02-21T09:09:54.9383201Z add.s16 %rs4, %rs1, %rs3; 2026-02-21T09:09:54.9383266Z shr.s16 %rs5, %rs4, 4; 2026-02-21T09:09:54.9383427Z .loc 1 40 64 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:40:64 2026-02-21T09:09:54.9383489Z and.b16 %rs6, %rs4, -16; 2026-02-21T09:09:54.9383557Z sub.s16 %rs7, %rs1, %rs6; 2026-02-21T09:09:54.9383717Z .loc 1 42 27 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:42:27 2026-02-21T09:09:54.9383777Z shl.b32 %r188, %r185, 9; 2026-02-21T09:09:54.9383879Z mad.wide.s16 %r343, %rs7, 32, %r188; 2026-02-21T09:09:54.9384049Z .loc 1 43 27 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:43:27 2026-02-21T09:09:54.9384113Z mul.wide.s16 %r344, %rs5, 64; 2026-02-21T09:09:54.9384273Z .loc 1 44 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:44:32 2026-02-21T09:09:54.9384339Z or.b32 %r189, %r344, %r6; 2026-02-21T09:09:54.9384497Z .loc 1 58 53 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:53 2026-02-21T09:09:54.9384554Z shl.b32 %r190, %r189, 10; 2026-02-21T09:09:54.9384614Z $L__tmp18: 2026-02-21T09:09:54.9384845Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9384921Z shfl.sync.idx.b32 %r40, %r5, 0, 31, -1; 2026-02-21T09:09:54.9384980Z shl.b32 %r191, %r40, 21; 2026-02-21T09:09:54.9385047Z and.b32 %r192, %r191, 6291456; 2026-02-21T09:09:54.9385106Z add.s32 %r342, %r192, %r371; 2026-02-21T09:09:54.9385166Z mov.pred %p57, -1; 2026-02-21T09:09:54.9385229Z mov.b32 %r373, 0; 2026-02-21T09:09:54.9385286Z // begin inline asm 2026-02-21T09:09:54.9385581Z @%p57 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r342 + 0], 16, {%r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373, %r373}; 2026-02-21T09:09:54.9385645Z // end inline asm 2026-02-21T09:09:54.9385702Z // begin inline asm 2026-02-21T09:09:54.9385771Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.9385825Z // end inline asm 2026-02-21T09:09:54.9385887Z bar.sync 0; 2026-02-21T09:09:54.9385940Z $L__tmp19: 2026-02-21T09:09:54.9386106Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9386174Z add.s32 %r374, %r58, 13312; 2026-02-21T09:09:54.9386229Z // begin inline asm 2026-02-21T09:09:54.9386315Z @%p4 mbarrier.init.shared::cta.b64 [%r374], 1; 2026-02-21T09:09:54.9386376Z // end inline asm 2026-02-21T09:09:54.9386430Z bar.sync 0; 2026-02-21T09:09:54.9386489Z add.s32 %r142, %r58, 13320; 2026-02-21T09:09:54.9386545Z // begin inline asm 2026-02-21T09:09:54.9386638Z @%p4 mbarrier.init.shared::cta.b64 [%r142], 1; 2026-02-21T09:09:54.9386693Z // end inline asm 2026-02-21T09:09:54.9386752Z add.s32 %r232, %r58, 13328; 2026-02-21T09:09:54.9386815Z // begin inline asm 2026-02-21T09:09:54.9386894Z @%p4 mbarrier.init.shared::cta.b64 [%r232], 1; 2026-02-21T09:09:54.9386948Z // end inline asm 2026-02-21T09:09:54.9387001Z bar.sync 0; 2026-02-21T09:09:54.9387067Z add.s32 %r144, %r58, 13336; 2026-02-21T09:09:54.9387122Z // begin inline asm 2026-02-21T09:09:54.9387200Z @%p4 mbarrier.init.shared::cta.b64 [%r144], 1; 2026-02-21T09:09:54.9387262Z // end inline asm 2026-02-21T09:09:54.9387463Z .loc 1 58 60 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:60 2026-02-21T09:09:54.9387521Z or.b32 %r194, %r190, %r8; 2026-02-21T09:09:54.9387691Z .loc 1 58 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:32 2026-02-21T09:09:54.9387760Z mad.wide.s32 %rd59, %r194, 2, %rd15; 2026-02-21T09:09:54.9387821Z cvt.u64.u32 %rd65, %r8; 2026-02-21T09:09:54.9387881Z cvt.s64.s32 %rd8, %r190; 2026-02-21T09:09:54.9387949Z or.b64 %rd66, %rd8, %rd65; 2026-02-21T09:09:54.9388028Z shl.b64 %rd67, %rd66, 1; 2026-02-21T09:09:54.9388089Z add.s64 %rd9, %rd15, %rd67; 2026-02-21T09:09:54.9388153Z add.s64 %rd60, %rd9, 65536; 2026-02-21T09:09:54.9388208Z mov.b32 %r229, 16; 2026-02-21T09:09:54.9388370Z .loc 1 58 80 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:80 2026-02-21T09:09:54.9388428Z // begin inline asm 2026-02-21T09:09:54.9388555Z cp.async.cg.shared.global [ %r228 + 0 ], [ %rd59 + 0 ], 0x10, %r229; 2026-02-21T09:09:54.9388611Z // end inline asm 2026-02-21T09:09:54.9388667Z // begin inline asm 2026-02-21T09:09:54.9388810Z cp.async.cg.shared.global [ %r230 + 0 ], [ %rd60 + 0 ], 0x10, %r229; 2026-02-21T09:09:54.9388865Z // end inline asm 2026-02-21T09:09:54.9388926Z cp.async.commit_group; 2026-02-21T09:09:54.9389091Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9389144Z bar.sync 0; 2026-02-21T09:09:54.9389200Z // begin inline asm 2026-02-21T09:09:54.9389306Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r232], 512; 2026-02-21T09:09:54.9389366Z // end inline asm 2026-02-21T09:09:54.9389525Z .loc 1 64 33 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:64:33 2026-02-21T09:09:54.9389577Z bar.sync 0; 2026-02-21T09:09:54.9389648Z elect.sync %r195|%p52, -1; 2026-02-21T09:09:54.9389712Z and.pred %p46, %p1, %p52; 2026-02-21T09:09:54.9389790Z // begin inline asm 2026-02-21T09:09:54.9390039Z @%p46 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r97], [%rd79, {%r343, %r373}], [%r232]; 2026-02-21T09:09:54.9390094Z // end inline asm 2026-02-21T09:09:54.9390254Z .loc 1 58 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:32 2026-02-21T09:09:54.9390314Z add.s64 %rd62, %rd9, 64; 2026-02-21T09:09:54.9390382Z cvt.u64.u32 %rd68, %r14; 2026-02-21T09:09:54.9390441Z or.b64 %rd69, %rd8, %rd68; 2026-02-21T09:09:54.9390500Z shl.b64 %rd70, %rd69, 1; 2026-02-21T09:09:54.9390568Z add.s64 %rd71, %rd15, %rd70; 2026-02-21T09:09:54.9390631Z add.s64 %rd63, %rd71, 65536; 2026-02-21T09:09:54.9390799Z .loc 1 58 80 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:80 2026-02-21T09:09:54.9390865Z // begin inline asm 2026-02-21T09:09:54.9390984Z cp.async.cg.shared.global [ %r154 + 0 ], [ %rd62 + 0 ], 0x10, %r229; 2026-02-21T09:09:54.9391043Z // end inline asm 2026-02-21T09:09:54.9391102Z // begin inline asm 2026-02-21T09:09:54.9391226Z cp.async.cg.shared.global [ %r156 + 0 ], [ %rd63 + 0 ], 0x10, %r229; 2026-02-21T09:09:54.9391285Z // end inline asm 2026-02-21T09:09:54.9391352Z cp.async.commit_group; 2026-02-21T09:09:54.9391530Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9391615Z bar.sync 0; 2026-02-21T09:09:54.9391674Z // begin inline asm 2026-02-21T09:09:54.9391783Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r144], 512; 2026-02-21T09:09:54.9391849Z // end inline asm 2026-02-21T09:09:54.9392015Z .loc 1 64 33 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:64:33 2026-02-21T09:09:54.9392072Z bar.sync 0; 2026-02-21T09:09:54.9392149Z elect.sync %r196|%p53, -1; 2026-02-21T09:09:54.9392216Z and.pred %p48, %p1, %p53; 2026-02-21T09:09:54.9392280Z add.s32 %r159, %r58, 12800; 2026-02-21T09:09:54.9392351Z // begin inline asm 2026-02-21T09:09:54.9392627Z @%p48 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r159], [%rd79, {%r343, %r229}], [%r144]; 2026-02-21T09:09:54.9392685Z // end inline asm 2026-02-21T09:09:54.9392855Z .loc 1 58 80 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:80 2026-02-21T09:09:54.9392931Z cp.async.wait_group 1; 2026-02-21T09:09:54.9392987Z bar.sync 0; 2026-02-21T09:09:54.9393084Z ld.shared.v4.b32 {%r197, %r198, %r199, %r200}, [%r18]; 2026-02-21T09:09:54.9393192Z mov.b32 {%rs8, %rs9}, %r200; 2026-02-21T09:09:54.9393262Z mov.b32 {%rs10, %rs11}, %r199; 2026-02-21T09:09:54.9393328Z mov.b32 {%rs12, %rs13}, %r198; 2026-02-21T09:09:54.9393405Z mov.b32 {%rs14, %rs15}, %r197; 2026-02-21T09:09:54.9393508Z ld.shared.v4.b32 {%r201, %r202, %r203, %r204}, [%r18+16]; 2026-02-21T09:09:54.9393571Z mov.b32 {%rs16, %rs17}, %r204; 2026-02-21T09:09:54.9393631Z mov.b32 {%rs18, %rs19}, %r203; 2026-02-21T09:09:54.9393701Z mov.b32 {%rs20, %rs21}, %r202; 2026-02-21T09:09:54.9393762Z mov.b32 {%rs22, %rs23}, %r201; 2026-02-21T09:09:54.9393926Z .loc 1 62 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:62:32 2026-02-21T09:09:54.9394022Z cvt.f32.bf16 %r164, %rs14; 2026-02-21T09:09:54.9394085Z cvt.f32.bf16 %r165, %rs15; 2026-02-21T09:09:54.9394147Z cvt.f32.bf16 %r166, %rs12; 2026-02-21T09:09:54.9394207Z cvt.f32.bf16 %r167, %rs13; 2026-02-21T09:09:54.9394274Z cvt.f32.bf16 %r168, %rs10; 2026-02-21T09:09:54.9394333Z cvt.f32.bf16 %r169, %rs11; 2026-02-21T09:09:54.9394397Z cvt.f32.bf16 %r170, %rs8; 2026-02-21T09:09:54.9394466Z cvt.f32.bf16 %r171, %rs9; 2026-02-21T09:09:54.9394526Z cvt.f32.bf16 %r172, %rs22; 2026-02-21T09:09:54.9394586Z cvt.f32.bf16 %r173, %rs23; 2026-02-21T09:09:54.9394652Z cvt.f32.bf16 %r174, %rs20; 2026-02-21T09:09:54.9394712Z cvt.f32.bf16 %r175, %rs21; 2026-02-21T09:09:54.9394772Z cvt.f32.bf16 %r176, %rs18; 2026-02-21T09:09:54.9394861Z cvt.f32.bf16 %r177, %rs19; 2026-02-21T09:09:54.9394932Z cvt.f32.bf16 %r178, %rs16; 2026-02-21T09:09:54.9394994Z cvt.f32.bf16 %r179, %rs17; 2026-02-21T09:09:54.9395049Z $L__tmp20: 2026-02-21T09:09:54.9395279Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9395341Z add.s32 %r246, %r192, %r214; 2026-02-21T09:09:54.9395399Z // begin inline asm 2026-02-21T09:09:54.9395702Z @%p57 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r246 + 0], 16, {%r164, %r165, %r166, %r167, %r168, %r169, %r170, %r171, %r172, %r173, %r174, %r175, %r176, %r177, %r178, %r179}; 2026-02-21T09:09:54.9395769Z // end inline asm 2026-02-21T09:09:54.9395828Z // begin inline asm 2026-02-21T09:09:54.9395901Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.9395965Z // end inline asm 2026-02-21T09:09:54.9396022Z bar.sync 0; 2026-02-21T09:09:54.9396077Z $L__tmp21: 2026-02-21T09:09:54.9396256Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9396317Z // begin inline asm 2026-02-21T09:09:54.9396370Z 2026-02-21T09:09:54.9396423Z { 2026-02-21T09:09:54.9396495Z .reg .pred complete; 2026-02-21T09:09:54.9396552Z waitLoop: 2026-02-21T09:09:54.9396680Z mbarrier.try_wait.parity.shared.b64 complete, [%r232], %r373; 2026-02-21T09:09:54.9396758Z @!complete bra.uni waitLoop; 2026-02-21T09:09:54.9396811Z } 2026-02-21T09:09:54.9396815Z 2026-02-21T09:09:54.9396871Z // end inline asm 2026-02-21T09:09:54.9397044Z .loc 1 64 33 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:64:33 2026-02-21T09:09:54.9397116Z ld.shared.b8 %rs24, [%r20]; 2026-02-21T09:09:54.9397183Z ld.shared.b8 %rs25, [%r20+64]; 2026-02-21T09:09:54.9397253Z ld.shared.b8 %rs26, [%r20+256]; 2026-02-21T09:09:54.9397327Z ld.shared.b8 %rs27, [%r20+320]; 2026-02-21T09:09:54.9397390Z ld.shared.b8 %rs28, [%r22+128]; 2026-02-21T09:09:54.9397452Z ld.shared.b8 %rs29, [%r22+192]; 2026-02-21T09:09:54.9397521Z ld.shared.b8 %rs30, [%r22+384]; 2026-02-21T09:09:54.9397605Z ld.shared.b8 %rs31, [%r22+448]; 2026-02-21T09:09:54.9397776Z .loc 1 67 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:67:28 2026-02-21T09:09:54.9397841Z shl.b16 %rs32, %rs24, 4; 2026-02-21T09:09:54.9397909Z shl.b16 %rs33, %rs25, 4; 2026-02-21T09:09:54.9397970Z shl.b16 %rs34, %rs28, 4; 2026-02-21T09:09:54.9398029Z shl.b16 %rs35, %rs29, 4; 2026-02-21T09:09:54.9398094Z shl.b16 %rs36, %rs26, 4; 2026-02-21T09:09:54.9398154Z shl.b16 %rs37, %rs27, 4; 2026-02-21T09:09:54.9398234Z shl.b16 %rs38, %rs30, 4; 2026-02-21T09:09:54.9398294Z shl.b16 %rs39, %rs31, 4; 2026-02-21T09:09:54.9398477Z .loc 1 82 58 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:82:58 2026-02-21T09:09:54.9398545Z selp.b16 %rs40, %rs32, %rs24, %p50; 2026-02-21T09:09:54.9398603Z cvt.s16.s8 %rs41, %rs40; 2026-02-21T09:09:54.9398667Z shr.s16 %rs42, %rs41, 4; 2026-02-21T09:09:54.9398736Z selp.b16 %rs43, %rs33, %rs25, %p50; 2026-02-21T09:09:54.9398796Z cvt.s16.s8 %rs44, %rs43; 2026-02-21T09:09:54.9398859Z shr.s16 %rs45, %rs44, 4; 2026-02-21T09:09:54.9398924Z selp.b16 %rs46, %rs34, %rs28, %p50; 2026-02-21T09:09:54.9399002Z cvt.s16.s8 %rs47, %rs46; 2026-02-21T09:09:54.9399062Z shr.s16 %rs48, %rs47, 4; 2026-02-21T09:09:54.9399134Z selp.b16 %rs49, %rs35, %rs29, %p50; 2026-02-21T09:09:54.9399192Z cvt.s16.s8 %rs50, %rs49; 2026-02-21T09:09:54.9399249Z shr.s16 %rs51, %rs50, 4; 2026-02-21T09:09:54.9399321Z selp.b16 %rs52, %rs36, %rs26, %p50; 2026-02-21T09:09:54.9399380Z cvt.s16.s8 %rs53, %rs52; 2026-02-21T09:09:54.9399437Z shr.s16 %rs54, %rs53, 4; 2026-02-21T09:09:54.9399500Z selp.b16 %rs55, %rs37, %rs27, %p50; 2026-02-21T09:09:54.9399568Z cvt.s16.s8 %rs56, %rs55; 2026-02-21T09:09:54.9399624Z shr.s16 %rs57, %rs56, 4; 2026-02-21T09:09:54.9399687Z selp.b16 %rs58, %rs38, %rs30, %p50; 2026-02-21T09:09:54.9399753Z cvt.s16.s8 %rs59, %rs58; 2026-02-21T09:09:54.9399832Z shr.s16 %rs60, %rs59, 4; 2026-02-21T09:09:54.9399900Z selp.b16 %rs61, %rs39, %rs31, %p50; 2026-02-21T09:09:54.9399960Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T09:09:54.9400025Z shr.s16 %rs63, %rs62, 4; 2026-02-21T09:09:54.9400189Z .loc 1 87 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:87:32 2026-02-21T09:09:54.9400253Z cvt.rn.f32.s16 %r205, %rs42; 2026-02-21T09:09:54.9400331Z cvt.rn.f32.s16 %r206, %rs45; 2026-02-21T09:09:54.9400391Z cvt.rn.f32.s16 %r207, %rs48; 2026-02-21T09:09:54.9400450Z cvt.rn.f32.s16 %r208, %rs51; 2026-02-21T09:09:54.9400516Z cvt.rn.f32.s16 %r209, %rs54; 2026-02-21T09:09:54.9400575Z cvt.rn.f32.s16 %r210, %rs57; 2026-02-21T09:09:54.9400632Z cvt.rn.f32.s16 %r211, %rs60; 2026-02-21T09:09:54.9400689Z cvt.rn.f32.s16 %r212, %rs63; 2026-02-21T09:09:54.9400756Z st.shared.b32 [%r23], %r205; 2026-02-21T09:09:54.9400816Z st.shared.b32 [%r24], %r206; 2026-02-21T09:09:54.9400874Z st.shared.b32 [%r25], %r207; 2026-02-21T09:09:54.9400939Z st.shared.b32 [%r26], %r208; 2026-02-21T09:09:54.9400999Z st.shared.b32 [%r27], %r209; 2026-02-21T09:09:54.9401057Z st.shared.b32 [%r28], %r210; 2026-02-21T09:09:54.9401116Z st.shared.b32 [%r29], %r211; 2026-02-21T09:09:54.9401183Z st.shared.b32 [%r30], %r212; 2026-02-21T09:09:54.9401235Z $L__tmp22: 2026-02-21T09:09:54.9401445Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9401510Z // begin inline asm 2026-02-21T09:09:54.9401620Z fence.proxy.async.shared::cta; 2026-02-21T09:09:54.9401677Z // end inline asm 2026-02-21T09:09:54.9401740Z bar.sync 0; 2026-02-21T09:09:54.9401805Z setp.ne.b32 %p54, %r40, 0; 2026-02-21T09:09:54.9401862Z @%p54 bra $L__BB0_4; 2026-02-21T09:09:54.9401965Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:54.9402038Z elect.sync %r225|%p56, -1; 2026-02-21T09:09:54.9402095Z mov.b32 %r215, 67635472; 2026-02-21T09:09:54.9402154Z mov.pred %p55, 0; 2026-02-21T09:09:54.9402242Z // begin inline asm 2026-02-21T09:09:54.9402398Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 0 ], %rd72, %r215, %p55; 2026-02-21T09:09:54.9402453Z // end inline asm 2026-02-21T09:09:54.9402511Z // begin inline asm 2026-02-21T09:09:54.9402665Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 8 ], %rd73, %r215, %p57; 2026-02-21T09:09:54.9402721Z // end inline asm 2026-02-21T09:09:54.9402778Z // begin inline asm 2026-02-21T09:09:54.9402933Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 16 ], %rd74, %r215, %p57; 2026-02-21T09:09:54.9403016Z // end inline asm 2026-02-21T09:09:54.9403072Z // begin inline asm 2026-02-21T09:09:54.9403222Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 24 ], %rd75, %r215, %p57; 2026-02-21T09:09:54.9403276Z // end inline asm 2026-02-21T09:09:54.9403338Z add.s32 %r227, %r58, 13312; 2026-02-21T09:09:54.9403407Z cvt.u64.u32 %rd76, %r227; 2026-02-21T09:09:54.9403465Z // begin inline asm 2026-02-21T09:09:54.9403589Z @%p56 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd76]; 2026-02-21T09:09:54.9403644Z // end inline asm 2026-02-21T09:09:54.9403706Z $L__tmp23: 2026-02-21T09:09:54.9403831Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:54.9403992Z .loc 1 0 0 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:0 2026-02-21T09:09:54.9404062Z cvt.s32.s16 %r37, %rs5; 2026-02-21T09:09:54.9404225Z .loc 1 58 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:32 2026-02-21T09:09:54.9404287Z add.s64 %rd77, %rd9, 128; 2026-02-21T09:09:54.9404354Z cvt.u64.u32 %rd81, %r31; 2026-02-21T09:09:54.9404415Z add.s64 %rd82, %rd8, %rd81; 2026-02-21T09:09:54.9404472Z shl.b64 %rd83, %rd82, 1; 2026-02-21T09:09:54.9404531Z add.s64 %rd84, %rd15, %rd83; 2026-02-21T09:09:54.9404596Z add.s64 %rd78, %rd84, 65536; 2026-02-21T09:09:54.9404797Z .loc 1 58 80 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:80 2026-02-21T09:09:54.9404856Z // begin inline asm 2026-02-21T09:09:54.9404976Z cp.async.cg.shared.global [ %r228 + 0 ], [ %rd77 + 0 ], 0x10, %r229; 2026-02-21T09:09:54.9405033Z // end inline asm 2026-02-21T09:09:54.9405088Z // begin inline asm 2026-02-21T09:09:54.9405199Z cp.async.cg.shared.global [ %r230 + 0 ], [ %rd78 + 0 ], 0x10, %r229; 2026-02-21T09:09:54.9405260Z // end inline asm 2026-02-21T09:09:54.9405323Z cp.async.commit_group; 2026-02-21T09:09:54.9405484Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9405546Z // begin inline asm 2026-02-21T09:09:54.9405647Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r232], 512; 2026-02-21T09:09:54.9405700Z // end inline asm 2026-02-21T09:09:54.9405862Z .loc 1 64 33 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:64:33 2026-02-21T09:09:54.9405916Z bar.sync 0; 2026-02-21T09:09:54.9405982Z elect.sync %r241|%p67, -1; 2026-02-21T09:09:54.9406044Z and.pred %p65, %p1, %p67; 2026-02-21T09:09:54.9406107Z mov.b32 %r235, 32; 2026-02-21T09:09:54.9406163Z // begin inline asm 2026-02-21T09:09:54.9406400Z @%p65 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r97], [%rd79, {%r343, %r235}], [%r232]; 2026-02-21T09:09:54.9406462Z // end inline asm 2026-02-21T09:09:54.9406620Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9406681Z shl.b32 %r242, %r37, 16; 2026-02-21T09:09:54.9406746Z or.b32 %r243, %r34, %r242; 2026-02-21T09:09:54.9406815Z mad.wide.s32 %rd126, %r243, 2, %rd7; 2026-02-21T09:09:54.9406870Z mov.b32 %r377, 1; 2026-02-21T09:09:54.9406926Z mov.b64 %rd127, 0; 2026-02-21T09:09:54.9406990Z mov.b32 %r375, %r373; 2026-02-21T09:09:54.9407046Z mov.b32 %r376, %r373; 2026-02-21T09:09:54.9407102Z mov.b32 %r378, %r373; 2026-02-21T09:09:54.9407192Z bra.uni $L__BB0_5; 2026-02-21T09:09:54.9407291Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:54.9407456Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9407528Z setp.lt.u64 %p83, %rd127, 464; 2026-02-21T09:09:54.9407581Z $L__tmp24: 2026-02-21T09:09:54.9407792Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9407888Z add.s32 %r316, %r377, 1; 2026-02-21T09:09:54.9407959Z setp.gt.s32 %p86, %r316, 1; 2026-02-21T09:09:54.9408024Z selp.b32 %r377, 0, %r316, %p86; 2026-02-21T09:09:54.9408087Z selp.b32 %r317, 1, 0, %p86; 2026-02-21T09:09:54.9408152Z xor.b32 %r56, %r378, %r317; 2026-02-21T09:09:54.9408206Z $L__tmp25: 2026-02-21T09:09:54.9408371Z .loc 1 58 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:32 2026-02-21T09:09:54.9408437Z add.s64 %rd90, %rd126, -65536; 2026-02-21T09:09:54.9408610Z .loc 1 58 80 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:80 2026-02-21T09:09:54.9408669Z add.s32 %r307, %r50, %r10; 2026-02-21T09:09:54.9408750Z selp.b32 %r308, 16, 0, %p83; 2026-02-21T09:09:54.9408815Z // begin inline asm 2026-02-21T09:09:54.9408929Z cp.async.cg.shared.global [ %r307 + 0 ], [ %rd90 + 0 ], 0x10, %r308; 2026-02-21T09:09:54.9408983Z // end inline asm 2026-02-21T09:09:54.9409049Z add.s32 %r309, %r307, 2048; 2026-02-21T09:09:54.9409106Z // begin inline asm 2026-02-21T09:09:54.9409221Z cp.async.cg.shared.global [ %r309 + 0 ], [ %rd126 + 0 ], 0x10, %r308; 2026-02-21T09:09:54.9409275Z // end inline asm 2026-02-21T09:09:54.9409346Z cp.async.commit_group; 2026-02-21T09:09:54.9409507Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9409570Z and.pred %p81, %p4, %p83; 2026-02-21T09:09:54.9409656Z // begin inline asm 2026-02-21T09:09:54.9409769Z @%p81 mbarrier.arrive.expect_tx.shared.b64 _, [%r311], 512; 2026-02-21T09:09:54.9409824Z // end inline asm 2026-02-21T09:09:54.9409992Z .loc 1 64 33 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:64:33 2026-02-21T09:09:54.9410046Z bar.sync 0; 2026-02-21T09:09:54.9410110Z elect.sync %r318|%p87, -1; 2026-02-21T09:09:54.9410171Z and.pred %p88, %p83, %p87; 2026-02-21T09:09:54.9410241Z and.pred %p82, %p1, %p88; 2026-02-21T09:09:54.9410299Z cvt.u32.u64 %r319, %rd127; 2026-02-21T09:09:54.9410359Z add.s32 %r314, %r319, 48; 2026-02-21T09:09:54.9410422Z // begin inline asm 2026-02-21T09:09:54.9410658Z @%p82 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r312], [%rd79, {%r343, %r314}], [%r311]; 2026-02-21T09:09:54.9410713Z // end inline asm 2026-02-21T09:09:54.9410882Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9410943Z add.s64 %rd126, %rd126, 64; 2026-02-21T09:09:54.9411008Z setp.lt.u64 %p89, %rd127, 480; 2026-02-21T09:09:54.9411065Z add.s64 %rd127, %rd127, 16; 2026-02-21T09:09:54.9411131Z mov.b32 %r373, %r378; 2026-02-21T09:09:54.9411188Z mov.b32 %r378, %r56; 2026-02-21T09:09:54.9411247Z @%p89 bra $L__BB0_5; 2026-02-21T09:09:54.9411310Z bra.uni $L__BB0_8; 2026-02-21T09:09:54.9411409Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:09:54.9411505Z // => This Inner Loop Header: Depth=2 2026-02-21T09:09:54.9411595Z add.s32 %r265, %r376, 1; 2026-02-21T09:09:54.9411666Z setp.gt.s32 %p71, %r265, 1; 2026-02-21T09:09:54.9411728Z selp.b32 %r376, 0, %r265, %p71; 2026-02-21T09:09:54.9411891Z .loc 1 58 80 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:58:80 2026-02-21T09:09:54.9411962Z cp.async.wait_group 1; 2026-02-21T09:09:54.9412017Z bar.sync 0; 2026-02-21T09:09:54.9412075Z shl.b32 %r266, %r376, 12; 2026-02-21T09:09:54.9412169Z add.s32 %r50, %r58, %r266; 2026-02-21T09:09:54.9412227Z add.s32 %r268, %r50, %r17; 2026-02-21T09:09:54.9412320Z ld.shared.v4.b32 {%r269, %r270, %r271, %r272}, [%r268]; 2026-02-21T09:09:54.9412383Z mov.b32 {%rs64, %rs65}, %r272; 2026-02-21T09:09:54.9412450Z mov.b32 {%rs66, %rs67}, %r271; 2026-02-21T09:09:54.9412509Z mov.b32 {%rs68, %rs69}, %r270; 2026-02-21T09:09:54.9412567Z mov.b32 {%rs70, %rs71}, %r269; 2026-02-21T09:09:54.9412673Z ld.shared.v4.b32 {%r273, %r274, %r275, %r276}, [%r268+16]; 2026-02-21T09:09:54.9412756Z mov.b32 {%rs72, %rs73}, %r276; 2026-02-21T09:09:54.9412815Z mov.b32 {%rs74, %rs75}, %r275; 2026-02-21T09:09:54.9412876Z mov.b32 {%rs76, %rs77}, %r274; 2026-02-21T09:09:54.9412943Z mov.b32 {%rs78, %rs79}, %r273; 2026-02-21T09:09:54.9413107Z .loc 1 62 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:62:32 2026-02-21T09:09:54.9413166Z cvt.f32.bf16 %r247, %rs70; 2026-02-21T09:09:54.9413233Z cvt.f32.bf16 %r248, %rs71; 2026-02-21T09:09:54.9413291Z cvt.f32.bf16 %r249, %rs68; 2026-02-21T09:09:54.9413347Z cvt.f32.bf16 %r250, %rs69; 2026-02-21T09:09:54.9413410Z cvt.f32.bf16 %r251, %rs66; 2026-02-21T09:09:54.9413467Z cvt.f32.bf16 %r252, %rs67; 2026-02-21T09:09:54.9413551Z cvt.f32.bf16 %r253, %rs64; 2026-02-21T09:09:54.9413611Z cvt.f32.bf16 %r254, %rs65; 2026-02-21T09:09:54.9413676Z cvt.f32.bf16 %r255, %rs78; 2026-02-21T09:09:54.9413732Z cvt.f32.bf16 %r256, %rs79; 2026-02-21T09:09:54.9413790Z cvt.f32.bf16 %r257, %rs76; 2026-02-21T09:09:54.9413855Z cvt.f32.bf16 %r258, %rs77; 2026-02-21T09:09:54.9413913Z cvt.f32.bf16 %r259, %rs74; 2026-02-21T09:09:54.9413969Z cvt.f32.bf16 %r260, %rs75; 2026-02-21T09:09:54.9414026Z cvt.f32.bf16 %r261, %rs72; 2026-02-21T09:09:54.9414091Z cvt.f32.bf16 %r262, %rs73; 2026-02-21T09:09:54.9414144Z $L__tmp26: 2026-02-21T09:09:54.9414378Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9414445Z // begin inline asm 2026-02-21T09:09:54.9414498Z 2026-02-21T09:09:54.9414548Z { 2026-02-21T09:09:54.9414611Z .reg .pred complete; 2026-02-21T09:09:54.9414672Z waitLoop: 2026-02-21T09:09:54.9414791Z mbarrier.try_wait.parity.shared.b64 complete, [%r374], %r373; 2026-02-21T09:09:54.9414858Z @!complete bra.uni waitLoop; 2026-02-21T09:09:54.9414917Z } 2026-02-21T09:09:54.9414921Z 2026-02-21T09:09:54.9414976Z // end inline asm 2026-02-21T09:09:54.9415027Z $L__tmp27: 2026-02-21T09:09:54.9415198Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9415261Z selp.b32 %r277, 1, 0, %p71; 2026-02-21T09:09:54.9415321Z xor.b32 %r375, %r375, %r277; 2026-02-21T09:09:54.9415380Z mov.pred %p72, -1; 2026-02-21T09:09:54.9415440Z $L__tmp28: 2026-02-21T09:09:54.9415649Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9415707Z // begin inline asm 2026-02-21T09:09:54.9416008Z @%p72 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r246 + 0], 16, {%r247, %r248, %r249, %r250, %r251, %r252, %r253, %r254, %r255, %r256, %r257, %r258, %r259, %r260, %r261, %r262}; 2026-02-21T09:09:54.9416066Z // end inline asm 2026-02-21T09:09:54.9416124Z // begin inline asm 2026-02-21T09:09:54.9416205Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:54.9416261Z // end inline asm 2026-02-21T09:09:54.9416317Z bar.sync 0; 2026-02-21T09:09:54.9416368Z $L__tmp29: 2026-02-21T09:09:54.9416540Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9416601Z shl.b32 %r278, %r376, 3; 2026-02-21T09:09:54.9416661Z add.s32 %r279, %r58, %r278; 2026-02-21T09:09:54.9416728Z add.s32 %r311, %r279, 13328; 2026-02-21T09:09:54.9416784Z // begin inline asm 2026-02-21T09:09:54.9416835Z 2026-02-21T09:09:54.9416886Z { 2026-02-21T09:09:54.9416953Z .reg .pred complete; 2026-02-21T09:09:54.9417007Z waitLoop: 2026-02-21T09:09:54.9417161Z mbarrier.try_wait.parity.shared.b64 complete, [%r311], %r375; 2026-02-21T09:09:54.9417231Z @!complete bra.uni waitLoop; 2026-02-21T09:09:54.9417282Z } 2026-02-21T09:09:54.9417285Z 2026-02-21T09:09:54.9417341Z // end inline asm 2026-02-21T09:09:54.9417502Z .loc 1 64 33 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:64:33 2026-02-21T09:09:54.9417566Z shl.b32 %r280, %r376, 9; 2026-02-21T09:09:54.9417624Z add.s32 %r281, %r58, %r280; 2026-02-21T09:09:54.9417683Z add.s32 %r312, %r281, 12288; 2026-02-21T09:09:54.9417780Z add.s32 %r282, %r312, %r19; 2026-02-21T09:09:54.9417843Z ld.shared.b8 %rs80, [%r282]; 2026-02-21T09:09:54.9417909Z ld.shared.b8 %rs81, [%r282+64]; 2026-02-21T09:09:54.9417983Z ld.shared.b8 %rs82, [%r282+256]; 2026-02-21T09:09:54.9418048Z ld.shared.b8 %rs83, [%r282+320]; 2026-02-21T09:09:54.9418106Z add.s32 %r283, %r312, %r21; 2026-02-21T09:09:54.9418168Z ld.shared.b8 %rs84, [%r283+128]; 2026-02-21T09:09:54.9418242Z ld.shared.b8 %rs85, [%r283+192]; 2026-02-21T09:09:54.9418303Z ld.shared.b8 %rs86, [%r283+384]; 2026-02-21T09:09:54.9418364Z ld.shared.b8 %rs87, [%r283+448]; 2026-02-21T09:09:54.9418556Z .loc 1 67 28 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:67:28 2026-02-21T09:09:54.9418618Z shl.b16 %rs88, %rs80, 4; 2026-02-21T09:09:54.9418678Z shl.b16 %rs89, %rs81, 4; 2026-02-21T09:09:54.9418736Z shl.b16 %rs90, %rs84, 4; 2026-02-21T09:09:54.9418801Z shl.b16 %rs91, %rs85, 4; 2026-02-21T09:09:54.9418860Z shl.b16 %rs92, %rs82, 4; 2026-02-21T09:09:54.9418918Z shl.b16 %rs93, %rs83, 4; 2026-02-21T09:09:54.9418982Z shl.b16 %rs94, %rs86, 4; 2026-02-21T09:09:54.9419039Z shl.b16 %rs95, %rs87, 4; 2026-02-21T09:09:54.9419203Z .loc 1 82 58 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:82:58 2026-02-21T09:09:54.9419277Z selp.b16 %rs96, %rs88, %rs80, %p50; 2026-02-21T09:09:54.9419358Z cvt.s16.s8 %rs97, %rs96; 2026-02-21T09:09:54.9419418Z shr.s16 %rs98, %rs97, 4; 2026-02-21T09:09:54.9419484Z selp.b16 %rs99, %rs89, %rs81, %p50; 2026-02-21T09:09:54.9419551Z cvt.s16.s8 %rs100, %rs99; 2026-02-21T09:09:54.9419610Z shr.s16 %rs101, %rs100, 4; 2026-02-21T09:09:54.9419679Z selp.b16 %rs102, %rs90, %rs84, %p50; 2026-02-21T09:09:54.9419747Z cvt.s16.s8 %rs103, %rs102; 2026-02-21T09:09:54.9419804Z shr.s16 %rs104, %rs103, 4; 2026-02-21T09:09:54.9419868Z selp.b16 %rs105, %rs91, %rs85, %p50; 2026-02-21T09:09:54.9419926Z cvt.s16.s8 %rs106, %rs105; 2026-02-21T09:09:54.9419992Z shr.s16 %rs107, %rs106, 4; 2026-02-21T09:09:54.9420055Z selp.b16 %rs108, %rs92, %rs82, %p50; 2026-02-21T09:09:54.9420113Z cvt.s16.s8 %rs109, %rs108; 2026-02-21T09:09:54.9420176Z shr.s16 %rs110, %rs109, 4; 2026-02-21T09:09:54.9420237Z selp.b16 %rs111, %rs93, %rs83, %p50; 2026-02-21T09:09:54.9420295Z cvt.s16.s8 %rs112, %rs111; 2026-02-21T09:09:54.9420352Z shr.s16 %rs113, %rs112, 4; 2026-02-21T09:09:54.9420421Z selp.b16 %rs114, %rs94, %rs86, %p50; 2026-02-21T09:09:54.9420479Z cvt.s16.s8 %rs115, %rs114; 2026-02-21T09:09:54.9420535Z shr.s16 %rs116, %rs115, 4; 2026-02-21T09:09:54.9420605Z selp.b16 %rs117, %rs95, %rs87, %p50; 2026-02-21T09:09:54.9420662Z cvt.s16.s8 %rs118, %rs117; 2026-02-21T09:09:54.9420719Z shr.s16 %rs119, %rs118, 4; 2026-02-21T09:09:54.9420891Z .loc 1 87 32 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:87:32 2026-02-21T09:09:54.9420951Z cvt.rn.f32.s16 %r284, %rs98; 2026-02-21T09:09:54.9421012Z cvt.rn.f32.s16 %r285, %rs101; 2026-02-21T09:09:54.9421075Z cvt.rn.f32.s16 %r286, %rs104; 2026-02-21T09:09:54.9421141Z cvt.rn.f32.s16 %r287, %rs107; 2026-02-21T09:09:54.9421200Z cvt.rn.f32.s16 %r288, %rs110; 2026-02-21T09:09:54.9421259Z cvt.rn.f32.s16 %r289, %rs113; 2026-02-21T09:09:54.9421324Z cvt.rn.f32.s16 %r290, %rs116; 2026-02-21T09:09:54.9421383Z cvt.rn.f32.s16 %r291, %rs119; 2026-02-21T09:09:54.9421444Z st.shared.b32 [%r23], %r284; 2026-02-21T09:09:54.9421506Z st.shared.b32 [%r24], %r285; 2026-02-21T09:09:54.9421622Z st.shared.b32 [%r25], %r286; 2026-02-21T09:09:54.9421682Z st.shared.b32 [%r26], %r287; 2026-02-21T09:09:54.9421741Z st.shared.b32 [%r27], %r288; 2026-02-21T09:09:54.9421809Z st.shared.b32 [%r28], %r289; 2026-02-21T09:09:54.9421871Z st.shared.b32 [%r29], %r290; 2026-02-21T09:09:54.9421931Z st.shared.b32 [%r30], %r291; 2026-02-21T09:09:54.9422103Z .loc 1 51 92 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:51:92 2026-02-21T09:09:54.9422161Z shl.b32 %r292, %r377, 3; 2026-02-21T09:09:54.9422249Z add.s32 %r293, %r58, %r292; 2026-02-21T09:09:54.9422307Z add.s32 %r374, %r293, 13312; 2026-02-21T09:09:54.9422366Z $L__tmp30: 2026-02-21T09:09:54.9422582Z .loc 2 291 36 // standard.py:291:36 @[ ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:94:40 ] 2026-02-21T09:09:54.9422638Z // begin inline asm 2026-02-21T09:09:54.9422717Z fence.proxy.async.shared::cta; 2026-02-21T09:09:54.9422774Z // end inline asm 2026-02-21T09:09:54.9422829Z bar.sync 0; 2026-02-21T09:09:54.9422888Z @%p54 bra $L__BB0_7; 2026-02-21T09:09:54.9422995Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:54.9423103Z elect.sync %r306|%p73, -1; 2026-02-21T09:09:54.9423164Z mov.b32 %r296, 67635472; 2026-02-21T09:09:54.9423230Z // begin inline asm 2026-02-21T09:09:54.9423385Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 0 ], %rd72, %r296, %p72; 2026-02-21T09:09:54.9423442Z // end inline asm 2026-02-21T09:09:54.9423511Z // begin inline asm 2026-02-21T09:09:54.9423663Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 8 ], %rd73, %r296, %p72; 2026-02-21T09:09:54.9423719Z // end inline asm 2026-02-21T09:09:54.9423777Z // begin inline asm 2026-02-21T09:09:54.9423933Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 16 ], %rd74, %r296, %p72; 2026-02-21T09:09:54.9423989Z // end inline asm 2026-02-21T09:09:54.9424071Z // begin inline asm 2026-02-21T09:09:54.9424224Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r371 + 0 ], [ %r214 + 24 ], %rd75, %r296, %p72; 2026-02-21T09:09:54.9424280Z // end inline asm 2026-02-21T09:09:54.9424341Z cvt.u64.u32 %rd89, %r374; 2026-02-21T09:09:54.9424407Z // begin inline asm 2026-02-21T09:09:54.9424531Z @%p73 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd89]; 2026-02-21T09:09:54.9424586Z // end inline asm 2026-02-21T09:09:54.9424648Z bra.uni $L__BB0_7; 2026-02-21T09:09:54.9424701Z $L__tmp31: 2026-02-21T09:09:54.9424786Z $L__BB0_9: // %._crit_edge 2026-02-21T09:09:54.9424955Z .loc 1 31 4 // ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py:31:4 2026-02-21T09:09:54.9425017Z bar.sync 0; 2026-02-21T09:09:54.9425072Z // begin inline asm 2026-02-21T09:09:54.9425183Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r371, 64; 2026-02-21T09:09:54.9425246Z // end inline asm 2026-02-21T09:09:54.9425299Z ret; 2026-02-21T09:09:54.9425353Z $L__tmp32: 2026-02-21T09:09:54.9425409Z $L__func_end0: 2026-02-21T09:09:54.9425499Z // -- End function 2026-02-21T09:09:54.9425551Z } 2026-02-21T09:09:54.9425753Z .file 1 "/tmp/torchinductor_root/e3/ce3smaqqfa5a5wbd2sufakmgqm264x6hak7nslfel4deroch6f5w.py" 2026-02-21T09:09:54.9425931Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:54.9425993Z .section .debug_abbrev 2026-02-21T09:09:54.9426045Z { 2026-02-21T09:09:54.9426141Z .b8 1 // Abbreviation Code 2026-02-21T09:09:54.9426227Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:54.9426306Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:54.9426385Z .b8 37 // DW_AT_producer 2026-02-21T09:09:54.9426471Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.9426546Z .b8 19 // DW_AT_language 2026-02-21T09:09:54.9426654Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:54.9426737Z .b8 3 // DW_AT_name 2026-02-21T09:09:54.9426810Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.9426888Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:54.9426971Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:54.9427047Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:54.9427139Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.9427212Z .b8 0 // EOM(1) 2026-02-21T09:09:54.9427290Z .b8 0 // EOM(2) 2026-02-21T09:09:54.9427372Z .b8 2 // Abbreviation Code 2026-02-21T09:09:54.9427454Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:54.9427538Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:54.9427614Z .b8 3 // DW_AT_name 2026-02-21T09:09:54.9427686Z .b8 8 // DW_FORM_string 2026-02-21T09:09:54.9427788Z .b8 32 // DW_AT_inline 2026-02-21T09:09:54.9427862Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.9427929Z .b8 0 // EOM(1) 2026-02-21T09:09:54.9427993Z .b8 0 // EOM(2) 2026-02-21T09:09:54.9428079Z .b8 3 // Abbreviation Code 2026-02-21T09:09:54.9428156Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:54.9428232Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:54.9428312Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:54.9428384Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.9428480Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:54.9428559Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.9428645Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:54.9428718Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:54.9428792Z .b8 0 // EOM(1) 2026-02-21T09:09:54.9428856Z .b8 0 // EOM(2) 2026-02-21T09:09:54.9428935Z .b8 4 // Abbreviation Code 2026-02-21T09:09:54.9429028Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:54.9429109Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:54.9429193Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:54.9429264Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:54.9429343Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:54.9429414Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.9429490Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:54.9429566Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:54.9429643Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:54.9429716Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.9429790Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:54.9429869Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.9429947Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:54.9430019Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:54.9430093Z .b8 0 // EOM(1) 2026-02-21T09:09:54.9430159Z .b8 0 // EOM(2) 2026-02-21T09:09:54.9430225Z .b8 0 // EOM(3) 2026-02-21T09:09:54.9430284Z } 2026-02-21T09:09:54.9430373Z .section .debug_info 2026-02-21T09:09:54.9430424Z { 2026-02-21T09:09:54.9430506Z .b32 178 // Length of Unit 2026-02-21T09:09:54.9430599Z .b8 2 // DWARF version number 2026-02-21T09:09:54.9430652Z .b8 0 2026-02-21T09:09:54.9430765Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:54.9430860Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:54.9430958Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:54.9431059Z .b8 116 // DW_AT_producer 2026-02-21T09:09:54.9431121Z .b8 114 2026-02-21T09:09:54.9431174Z .b8 105 2026-02-21T09:09:54.9431225Z .b8 116 2026-02-21T09:09:54.9431276Z .b8 111 2026-02-21T09:09:54.9431335Z .b8 110 2026-02-21T09:09:54.9431388Z .b8 0 2026-02-21T09:09:54.9431465Z .b8 2 // DW_AT_language 2026-02-21T09:09:54.9431527Z .b8 0 2026-02-21T09:09:54.9431632Z .b8 99 // DW_AT_name 2026-02-21T09:09:54.9431693Z .b8 101 2026-02-21T09:09:54.9431747Z .b8 51 2026-02-21T09:09:54.9431806Z .b8 115 2026-02-21T09:09:54.9431857Z .b8 109 2026-02-21T09:09:54.9431934Z .b8 97 2026-02-21T09:09:54.9431986Z .b8 113 2026-02-21T09:09:54.9432045Z .b8 113 2026-02-21T09:09:54.9432094Z .b8 102 2026-02-21T09:09:54.9432144Z .b8 97 2026-02-21T09:09:54.9432203Z .b8 53 2026-02-21T09:09:54.9432254Z .b8 97 2026-02-21T09:09:54.9432305Z .b8 53 2026-02-21T09:09:54.9432357Z .b8 119 2026-02-21T09:09:54.9432419Z .b8 98 2026-02-21T09:09:54.9432471Z .b8 100 2026-02-21T09:09:54.9432521Z .b8 50 2026-02-21T09:09:54.9432579Z .b8 115 2026-02-21T09:09:54.9432631Z .b8 117 2026-02-21T09:09:54.9432682Z .b8 102 2026-02-21T09:09:54.9432732Z .b8 97 2026-02-21T09:09:54.9432789Z .b8 107 2026-02-21T09:09:54.9432839Z .b8 109 2026-02-21T09:09:54.9432888Z .b8 103 2026-02-21T09:09:54.9432945Z .b8 113 2026-02-21T09:09:54.9432995Z .b8 109 2026-02-21T09:09:54.9433070Z .b8 50 2026-02-21T09:09:54.9433122Z .b8 54 2026-02-21T09:09:54.9433181Z .b8 52 2026-02-21T09:09:54.9433232Z .b8 120 2026-02-21T09:09:54.9433281Z .b8 54 2026-02-21T09:09:54.9433332Z .b8 104 2026-02-21T09:09:54.9433391Z .b8 97 2026-02-21T09:09:54.9433442Z .b8 107 2026-02-21T09:09:54.9433492Z .b8 55 2026-02-21T09:09:54.9433549Z .b8 110 2026-02-21T09:09:54.9433600Z .b8 115 2026-02-21T09:09:54.9433650Z .b8 108 2026-02-21T09:09:54.9433700Z .b8 102 2026-02-21T09:09:54.9433757Z .b8 101 2026-02-21T09:09:54.9433806Z .b8 108 2026-02-21T09:09:54.9433859Z .b8 52 2026-02-21T09:09:54.9433916Z .b8 100 2026-02-21T09:09:54.9433966Z .b8 101 2026-02-21T09:09:54.9434016Z .b8 114 2026-02-21T09:09:54.9434066Z .b8 111 2026-02-21T09:09:54.9434123Z .b8 99 2026-02-21T09:09:54.9434172Z .b8 104 2026-02-21T09:09:54.9434222Z .b8 54 2026-02-21T09:09:54.9434270Z .b8 102 2026-02-21T09:09:54.9434327Z .b8 53 2026-02-21T09:09:54.9434377Z .b8 119 2026-02-21T09:09:54.9434426Z .b8 46 2026-02-21T09:09:54.9434483Z .b8 112 2026-02-21T09:09:54.9434536Z .b8 121 2026-02-21T09:09:54.9434586Z .b8 0 2026-02-21T09:09:54.9434676Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:54.9434757Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:54.9434808Z .b8 116 2026-02-21T09:09:54.9434858Z .b8 109 2026-02-21T09:09:54.9434914Z .b8 112 2026-02-21T09:09:54.9434962Z .b8 47 2026-02-21T09:09:54.9435011Z .b8 116 2026-02-21T09:09:54.9435061Z .b8 111 2026-02-21T09:09:54.9435118Z .b8 114 2026-02-21T09:09:54.9435170Z .b8 99 2026-02-21T09:09:54.9435223Z .b8 104 2026-02-21T09:09:54.9435282Z .b8 105 2026-02-21T09:09:54.9435333Z .b8 110 2026-02-21T09:09:54.9435384Z .b8 100 2026-02-21T09:09:54.9435436Z .b8 117 2026-02-21T09:09:54.9435494Z .b8 99 2026-02-21T09:09:54.9435546Z .b8 116 2026-02-21T09:09:54.9435597Z .b8 111 2026-02-21T09:09:54.9435657Z .b8 114 2026-02-21T09:09:54.9435710Z .b8 95 2026-02-21T09:09:54.9435763Z .b8 114 2026-02-21T09:09:54.9435814Z .b8 111 2026-02-21T09:09:54.9435874Z .b8 111 2026-02-21T09:09:54.9435956Z .b8 116 2026-02-21T09:09:54.9436008Z .b8 47 2026-02-21T09:09:54.9436061Z .b8 101 2026-02-21T09:09:54.9436121Z .b8 51 2026-02-21T09:09:54.9436173Z .b8 0 2026-02-21T09:09:54.9436278Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:54.9436361Z .b8 95 // DW_AT_name 2026-02-21T09:09:54.9436413Z .b8 104 2026-02-21T09:09:54.9436465Z .b8 101 2026-02-21T09:09:54.9436516Z .b8 108 2026-02-21T09:09:54.9436576Z .b8 105 2026-02-21T09:09:54.9436628Z .b8 111 2026-02-21T09:09:54.9436710Z .b8 110 2026-02-21T09:09:54.9436772Z .b8 95 2026-02-21T09:09:54.9436824Z .b8 109 2026-02-21T09:09:54.9436877Z .b8 97 2026-02-21T09:09:54.9436931Z .b8 116 2026-02-21T09:09:54.9436993Z .b8 109 2026-02-21T09:09:54.9437048Z .b8 117 2026-02-21T09:09:54.9437101Z .b8 108 2026-02-21T09:09:54.9437165Z .b8 95 2026-02-21T09:09:54.9437220Z .b8 98 2026-02-21T09:09:54.9437275Z .b8 102 2026-02-21T09:09:54.9437328Z .b8 49 2026-02-21T09:09:54.9437390Z .b8 54 2026-02-21T09:09:54.9437444Z .b8 95 2026-02-21T09:09:54.9437496Z .b8 105 2026-02-21T09:09:54.9437554Z .b8 110 2026-02-21T09:09:54.9437608Z .b8 116 2026-02-21T09:09:54.9437660Z .b8 52 2026-02-21T09:09:54.9437715Z .b8 0 2026-02-21T09:09:54.9437822Z .b8 1 // DW_AT_inline 2026-02-21T09:09:54.9437925Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:54.9438014Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:54.9438112Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:54.9438207Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:54.9438322Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:54.9438412Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:54.9438504Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:54.9438614Z .b64 $L__tmp31 // DW_AT_high_pc 2026-02-21T09:09:54.9438696Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:54.9438783Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:54.9438866Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:54.9438954Z .b8 0 // End Of Children Mark 2026-02-21T09:09:54.9439047Z .b8 0 // End Of Children Mark 2026-02-21T09:09:54.9439100Z } 2026-02-21T09:09:54.9439168Z .section .debug_macinfo { } 2026-02-21T09:09:54.9439174Z 2026-02-21T09:09:54.9439260Z ================================================================ 2026-02-21T09:09:54.9439367Z please share the reproducer above with Triton project. 2026-02-21T09:09:55.3602277Z [182s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:55.3602610Z 2026-02-21T09:09:55.3604478Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:55.3605628Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:55.3605876Z `ptxas` stderr: 2026-02-21T09:09:55.3606322Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 519 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:55.3606815Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:55.3606965Z 2026-02-21T09:09:55.3607369Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp5fpl6k31.ptx -o /tmp/tmp5fpl6k31.ptx.o 2026-02-21T09:09:55.3608035Z 2026-02-21T09:09:55.3608168Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:55.3608367Z 2026-02-21T09:09:55.3608376Z 2026-02-21T09:09:55.3608462Z ================================================================ 2026-02-21T09:09:55.3608697Z Internal Triton PTX codegen error 2026-02-21T09:09:55.3608872Z `ptxas` stderr: 2026-02-21T09:09:55.3609305Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 519 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:55.3609846Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:55.3609999Z 2026-02-21T09:09:55.3610364Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp5fpl6k31.ptx -o /tmp/tmp5fpl6k31.ptx.o 2026-02-21T09:09:55.3610783Z 2026-02-21T09:09:55.3610791Z 2026-02-21T09:09:55.3610861Z // 2026-02-21T09:09:55.3611006Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:55.3611193Z // 2026-02-21T09:09:55.3611262Z 2026-02-21T09:09:55.3611323Z .version 8.7 2026-02-21T09:09:55.3611517Z .target sm_100a 2026-02-21T09:09:55.3611729Z .address_size 64 2026-02-21T09:09:55.3611820Z 2026-02-21T09:09:55.3611967Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:55.3612257Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:55.3612475Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:55.3612696Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:55.3612938Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:55.3613227Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:55.3613507Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:55.3613827Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:55.3614116Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:55.3614331Z ) 2026-02-21T09:09:55.3614458Z .reqntid 128 2026-02-21T09:09:55.3614587Z .maxnreg 32 2026-02-21T09:09:55.3614715Z { 2026-02-21T09:09:55.3614840Z .reg .pred %p<99>; 2026-02-21T09:09:55.3614996Z .reg .b16 %rs<80>; 2026-02-21T09:09:55.3615134Z .reg .b32 %r<331>; 2026-02-21T09:09:55.3615279Z .reg .b64 %rd<112>; 2026-02-21T09:09:55.3615577Z $L__func_begin0: 2026-02-21T09:09:55.3615659Z 2026-02-21T09:09:55.3615713Z // %bb.0: 2026-02-21T09:09:55.3615955Z .loc 1 19 0 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:19 2026-02-21T09:09:55.3616243Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:55.3616436Z ld.param.b64 %rd17, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:55.3616652Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:55.3616858Z ld.param.b64 %rd35, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:55.3617077Z mov.b32 %r50, global_smem; 2026-02-21T09:09:55.3617235Z // begin inline asm 2026-02-21T09:09:55.3617479Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r50], 64; 2026-02-21T09:09:55.3617725Z // end inline asm 2026-02-21T09:09:55.3617905Z ld.param.b64 %rd52, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:55.3618104Z bar.sync 0; 2026-02-21T09:09:55.3618253Z ld.shared.b32 %r323, [global_smem]; 2026-02-21T09:09:55.3618421Z bar.sync 0; 2026-02-21T09:09:55.3618557Z // begin inline asm 2026-02-21T09:09:55.3618760Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:55.3618991Z // end inline asm 2026-02-21T09:09:55.3619248Z .loc 1 21 66 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:21:66 2026-02-21T09:09:55.3619536Z mov.u32 %r67, %ctaid.x; 2026-02-21T09:09:55.3619694Z mov.u32 %r68, %ctaid.y; 2026-02-21T09:09:55.3619844Z mov.u32 %r69, %ctaid.z; 2026-02-21T09:09:55.3620005Z mov.u32 %r70, %nctaid.x; 2026-02-21T09:09:55.3620213Z mov.u32 %r71, %nctaid.y; 2026-02-21T09:09:55.3620384Z mad.lo.s32 %r72, %r69, %r71, %r68; 2026-02-21T09:09:55.3620558Z mad.lo.s32 %r73, %r72, %r70, %r67; 2026-02-21T09:09:55.3620736Z shl.b32 %r74, %r73, 8; 2026-02-21T09:09:55.3620894Z cvt.s64.s32 %rd53, %r74; 2026-02-21T09:09:55.3621050Z add.s64 %rd31, %rd52, %rd53; 2026-02-21T09:09:55.3621223Z shl.b32 %r75, %r1, 2; 2026-02-21T09:09:55.3621377Z add.s32 %r51, %r50, %r75; 2026-02-21T09:09:55.3621587Z mov.b32 %r60, 0; 2026-02-21T09:09:55.3621763Z // begin inline asm 2026-02-21T09:09:55.3621924Z @%p1 st.shared.b32 [ %r51 + 0 ], %r60; 2026-02-21T09:09:55.3622096Z // end inline asm 2026-02-21T09:09:55.3622246Z bar.warp.sync -1; 2026-02-21T09:09:55.3622397Z setp.eq.b32 %p4, %r1, 0; 2026-02-21T09:09:55.3622565Z cvt.u64.u32 %rd16, %r50; 2026-02-21T09:09:55.3622726Z // begin inline asm 2026-02-21T09:09:55.3622983Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd16 + 0 ], %rd17; 2026-02-21T09:09:55.3623273Z // end inline asm 2026-02-21T09:09:55.3623414Z // begin inline asm 2026-02-21T09:09:55.3623652Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:55.3623934Z // end inline asm 2026-02-21T09:09:55.3624078Z mov.b32 %r53, 16; 2026-02-21T09:09:55.3624214Z // begin inline asm 2026-02-21T09:09:55.3624449Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r53; 2026-02-21T09:09:55.3624715Z // end inline asm 2026-02-21T09:09:55.3624849Z // begin inline asm 2026-02-21T09:09:55.3625091Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r53; 2026-02-21T09:09:55.3625355Z // end inline asm 2026-02-21T09:09:55.3625500Z mov.b32 %r55, 8192; 2026-02-21T09:09:55.3625643Z // begin inline asm 2026-02-21T09:09:55.3625891Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r55; 2026-02-21T09:09:55.3626174Z // end inline asm 2026-02-21T09:09:55.3626347Z mov.b32 %r56, 512; 2026-02-21T09:09:55.3626500Z // begin inline asm 2026-02-21T09:09:55.3626742Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r56; 2026-02-21T09:09:55.3627029Z // end inline asm 2026-02-21T09:09:55.3627177Z mov.b64 %rd24, 8192; 2026-02-21T09:09:55.3627336Z // begin inline asm 2026-02-21T09:09:55.3627603Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd16 + 0 ], 0x0, %rd24; 2026-02-21T09:09:55.3627902Z // end inline asm 2026-02-21T09:09:55.3628051Z mov.b32 %r57, 1; 2026-02-21T09:09:55.3628197Z // begin inline asm 2026-02-21T09:09:55.3628466Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r57; 2026-02-21T09:09:55.3628764Z // end inline asm 2026-02-21T09:09:55.3628913Z // begin inline asm 2026-02-21T09:09:55.3629172Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r57; 2026-02-21T09:09:55.3629473Z // end inline asm 2026-02-21T09:09:55.3629624Z // begin inline asm 2026-02-21T09:09:55.3629873Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:55.3630156Z // end inline asm 2026-02-21T09:09:55.3630302Z // begin inline asm 2026-02-21T09:09:55.3630576Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:55.3630869Z // end inline asm 2026-02-21T09:09:55.3631018Z // begin inline asm 2026-02-21T09:09:55.3631298Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:55.3631625Z // end inline asm 2026-02-21T09:09:55.3631764Z // begin inline asm 2026-02-21T09:09:55.3632002Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:55.3632260Z // end inline asm 2026-02-21T09:09:55.3632404Z // begin inline asm 2026-02-21T09:09:55.3632754Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd31 + 0 ], [ %rd16 + 0 ], 0x80; 2026-02-21T09:09:55.3633176Z // end inline asm 2026-02-21T09:09:55.3633325Z // begin inline asm 2026-02-21T09:09:55.3633538Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd31 + 0 ], 0x80; 2026-02-21T09:09:55.3633810Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:55.3634014Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:55.3634195Z // end inline asm 2026-02-21T09:09:55.3634328Z bar.sync 0; 2026-02-21T09:09:55.3634473Z cvta.global.u64 %rd79, %rd31; 2026-02-21T09:09:55.3634751Z .loc 1 23 67 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:23:67 2026-02-21T09:09:55.3635079Z add.s64 %rd49, %rd31, 128; 2026-02-21T09:09:55.3635242Z bar.sync 0; 2026-02-21T09:09:55.3635373Z // begin inline asm 2026-02-21T09:09:55.3635530Z @%p1 st.shared.b32 [ %r51 + 0 ], %r60; 2026-02-21T09:09:55.3635698Z // end inline asm 2026-02-21T09:09:55.3635845Z bar.warp.sync -1; 2026-02-21T09:09:55.3635985Z // begin inline asm 2026-02-21T09:09:55.3636237Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd16 + 0 ], %rd35; 2026-02-21T09:09:55.3636510Z // end inline asm 2026-02-21T09:09:55.3636650Z // begin inline asm 2026-02-21T09:09:55.3636902Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:55.3637145Z // end inline asm 2026-02-21T09:09:55.3637285Z // begin inline asm 2026-02-21T09:09:55.3637511Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r53; 2026-02-21T09:09:55.3637771Z // end inline asm 2026-02-21T09:09:55.3637904Z mov.b32 %r62, 64; 2026-02-21T09:09:55.3638046Z // begin inline asm 2026-02-21T09:09:55.3638269Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r62; 2026-02-21T09:09:55.3638529Z // end inline asm 2026-02-21T09:09:55.3638667Z // begin inline asm 2026-02-21T09:09:55.3638897Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r55; 2026-02-21T09:09:55.3639197Z // end inline asm 2026-02-21T09:09:55.3639336Z mov.b32 %r64, 4096; 2026-02-21T09:09:55.3639484Z // begin inline asm 2026-02-21T09:09:55.3639718Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r64; 2026-02-21T09:09:55.3639986Z // end inline asm 2026-02-21T09:09:55.3640128Z mov.b64 %rd42, 16384; 2026-02-21T09:09:55.3640272Z // begin inline asm 2026-02-21T09:09:55.3640523Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd16 + 0 ], 0x0, %rd42; 2026-02-21T09:09:55.3640799Z // end inline asm 2026-02-21T09:09:55.3640941Z // begin inline asm 2026-02-21T09:09:55.3641190Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0, %r57; 2026-02-21T09:09:55.3641477Z // end inline asm 2026-02-21T09:09:55.3641668Z // begin inline asm 2026-02-21T09:09:55.3641928Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1, %r57; 2026-02-21T09:09:55.3642213Z // end inline asm 2026-02-21T09:09:55.3642347Z // begin inline asm 2026-02-21T09:09:55.3642581Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd16 + 0 ], 0xa; 2026-02-21T09:09:55.3642838Z // end inline asm 2026-02-21T09:09:55.3642980Z // begin inline asm 2026-02-21T09:09:55.3643226Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:55.3643514Z // end inline asm 2026-02-21T09:09:55.3643653Z // begin inline asm 2026-02-21T09:09:55.3643888Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x1; 2026-02-21T09:09:55.3644156Z // end inline asm 2026-02-21T09:09:55.3644288Z // begin inline asm 2026-02-21T09:09:55.3644518Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd16 + 0 ], 0x0; 2026-02-21T09:09:55.3644765Z // end inline asm 2026-02-21T09:09:55.3644906Z // begin inline asm 2026-02-21T09:09:55.3645241Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd49 + 0 ], [ %rd16 + 0 ], 0x80; 2026-02-21T09:09:55.3645600Z // end inline asm 2026-02-21T09:09:55.3645772Z // begin inline asm 2026-02-21T09:09:55.3645977Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd49 + 0 ], 0x80; 2026-02-21T09:09:55.3646230Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:55.3646420Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:55.3646602Z // end inline asm 2026-02-21T09:09:55.3646735Z bar.sync 0; 2026-02-21T09:09:55.3646881Z cvta.global.u64 %rd93, %rd49; 2026-02-21T09:09:55.3647162Z .loc 1 29 35 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:29:35 2026-02-21T09:09:55.3647474Z shl.b32 %r324, %r67, 2; 2026-02-21T09:09:55.3647741Z .loc 1 30 37 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:30:37 2026-02-21T09:09:55.3648026Z add.s32 %r76, %r324, 4; 2026-02-21T09:09:55.3648290Z .loc 1 30 49 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:30:49 2026-02-21T09:09:55.3648581Z min.s32 %r4, %r76, 32768; 2026-02-21T09:09:55.3648856Z .loc 1 31 88 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:31:88 2026-02-21T09:09:55.3649152Z setp.ge.s32 %p39, %r324, %r4; 2026-02-21T09:09:55.3649319Z @%p39 bra $L__BB0_9; 2026-02-21T09:09:55.3649551Z // %bb.1: // %.lr.ph 2026-02-21T09:09:55.3649853Z .loc 1 0 88 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:0:88 2026-02-21T09:09:55.3650189Z ld.param.b64 %rd15, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:55.3650397Z shr.u32 %r5, %r1, 5; 2026-02-21T09:09:55.3650548Z shr.u32 %r6, %r1, 2; 2026-02-21T09:09:55.3650699Z bfe.u32 %r7, %r1, 2, 5; 2026-02-21T09:09:55.3650847Z shl.b32 %r77, %r1, 3; 2026-02-21T09:09:55.3650999Z and.b32 %r8, %r77, 24; 2026-02-21T09:09:55.3651147Z and.b32 %r9, %r1, 16; 2026-02-21T09:09:55.3651298Z shl.b32 %r78, %r1, 4; 2026-02-21T09:09:55.3651444Z and.b32 %r10, %r78, 2032; 2026-02-21T09:09:55.3651670Z add.s32 %r205, %r50, %r10; 2026-02-21T09:09:55.3651833Z add.s32 %r207, %r205, 2048; 2026-02-21T09:09:55.3652000Z add.s32 %r191, %r323, 16; 2026-02-21T09:09:55.3652152Z or.b32 %r14, %r8, 32; 2026-02-21T09:09:55.3652308Z add.s32 %r135, %r205, 4096; 2026-02-21T09:09:55.3652475Z add.s32 %r137, %r205, 6144; 2026-02-21T09:09:55.3652633Z and.b32 %r80, %r1, 15; 2026-02-21T09:09:55.3652788Z shl.b32 %r81, %r80, 6; 2026-02-21T09:09:55.3652933Z and.b32 %r82, %r1, 96; 2026-02-21T09:09:55.3653085Z shl.b32 %r83, %r82, 5; 2026-02-21T09:09:55.3653228Z shl.b32 %r84, %r9, 1; 2026-02-21T09:09:55.3653383Z or.b32 %r85, %r81, %r83; 2026-02-21T09:09:55.3653534Z or.b32 %r17, %r85, %r84; 2026-02-21T09:09:55.3653694Z add.s32 %r18, %r50, %r17; 2026-02-21T09:09:55.3653843Z shr.u32 %r86, %r82, 1; 2026-02-21T09:09:55.3653993Z or.b32 %r19, %r86, %r80; 2026-02-21T09:09:55.3654147Z add.s32 %r87, %r50, %r19; 2026-02-21T09:09:55.3654300Z add.s32 %r20, %r87, 10240; 2026-02-21T09:09:55.3654461Z shl.b32 %r88, %r80, 7; 2026-02-21T09:09:55.3654608Z and.b32 %r89, %r78, 112; 2026-02-21T09:09:55.3654767Z and.b32 %r90, %r6, 28; 2026-02-21T09:09:55.3654913Z xor.b32 %r91, %r89, %r90; 2026-02-21T09:09:55.3655068Z or.b32 %r92, %r91, %r88; 2026-02-21T09:09:55.3655218Z add.s32 %r93, %r50, 8192; 2026-02-21T09:09:55.3655376Z add.s32 %r21, %r93, %r92; 2026-02-21T09:09:55.3655525Z xor.b32 %r94, %r92, 32; 2026-02-21T09:09:55.3655680Z add.s32 %r22, %r93, %r94; 2026-02-21T09:09:55.3655833Z xor.b32 %r95, %r92, 64; 2026-02-21T09:09:55.3655979Z add.s32 %r23, %r93, %r95; 2026-02-21T09:09:55.3656136Z xor.b32 %r96, %r92, 96; 2026-02-21T09:09:55.3656278Z add.s32 %r24, %r93, %r96; 2026-02-21T09:09:55.3656437Z bfe.u32 %r97, %r93, 4, 14; 2026-02-21T09:09:55.3656592Z cvt.u64.u32 %rd54, %r97; 2026-02-21T09:09:55.3656759Z or.b64 %rd72, %rd54, 4611686293313683456; 2026-02-21T09:09:55.3656938Z add.s32 %r98, %r50, 8224; 2026-02-21T09:09:55.3657096Z bfe.u32 %r99, %r98, 4, 14; 2026-02-21T09:09:55.3657249Z cvt.u64.u32 %rd55, %r99; 2026-02-21T09:09:55.3657449Z or.b64 %rd73, %rd55, 4611686293313683456; 2026-02-21T09:09:55.3657630Z add.s32 %r100, %r50, 8256; 2026-02-21T09:09:55.3657785Z bfe.u32 %r101, %r100, 4, 14; 2026-02-21T09:09:55.3657948Z cvt.u64.u32 %rd56, %r101; 2026-02-21T09:09:55.3658102Z or.b64 %rd74, %rd56, 4611686293313683456; 2026-02-21T09:09:55.3658282Z add.s32 %r102, %r50, 8288; 2026-02-21T09:09:55.3658436Z bfe.u32 %r103, %r102, 4, 14; 2026-02-21T09:09:55.3658595Z cvt.u64.u32 %rd57, %r103; 2026-02-21T09:09:55.3658750Z or.b64 %rd75, %rd57, 4611686293313683456; 2026-02-21T09:09:55.3658964Z or.b32 %r25, %r8, 64; 2026-02-21T09:09:55.3659108Z shl.b32 %r104, %r1, 5; 2026-02-21T09:09:55.3659261Z and.b32 %r105, %r104, 352; 2026-02-21T09:09:55.3659419Z shl.b32 %r106, %r82, 4; 2026-02-21T09:09:55.3659567Z bfe.s32 %r107, %r1, 2, 1; 2026-02-21T09:09:55.3659725Z and.b32 %r108, %r107, 144; 2026-02-21T09:09:55.3659879Z or.b32 %r109, %r108, %r106; 2026-02-21T09:09:55.3660042Z xor.b32 %r110, %r109, %r9; 2026-02-21T09:09:55.3660202Z add.s32 %r111, %r50, %r105; 2026-02-21T09:09:55.3660368Z add.s32 %r26, %r111, %r110; 2026-02-21T09:09:55.3660641Z .loc 1 31 88 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:31:88 2026-02-21T09:09:55.3660972Z and.b32 %r112, %r1, 3; 2026-02-21T09:09:55.3661138Z mad.wide.u32 %rd58, %r112, 16, %rd15; 2026-02-21T09:09:55.3661310Z add.s64 %rd7, %rd58, 65728; 2026-02-21T09:09:55.3661471Z shl.b32 %r27, %r7, 10; 2026-02-21T09:09:55.3661643Z bra.uni $L__BB0_2; 2026-02-21T09:09:55.3661836Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:55.3662163Z .loc 1 0 88 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:0:88 2026-02-21T09:09:55.3662457Z mov.b32 %r293, 1; 2026-02-21T09:09:55.3662595Z $L__tmp0: 2026-02-21T09:09:55.3662897Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3663272Z // begin inline asm 2026-02-21T09:09:55.3663412Z 2026-02-21T09:09:55.3663538Z { 2026-02-21T09:09:55.3663663Z .reg .pred complete; 2026-02-21T09:09:55.3663819Z waitLoop: 2026-02-21T09:09:55.3664008Z mbarrier.try_wait.parity.shared.b64 complete, [%r326], %r293; 2026-02-21T09:09:55.3664251Z @!complete bra.uni waitLoop; 2026-02-21T09:09:55.3664403Z } 2026-02-21T09:09:55.3664475Z 2026-02-21T09:09:55.3664531Z // end inline asm 2026-02-21T09:09:55.3664672Z $L__tmp1: 2026-02-21T09:09:55.3664909Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3665206Z cp.async.wait_group 0; 2026-02-21T09:09:55.3665354Z bar.sync 0; 2026-02-21T09:09:55.3665493Z // begin inline asm 2026-02-21T09:09:55.3665657Z @%p4 mbarrier.inval.shared::cta.b64 [%r209]; 2026-02-21T09:09:55.3665847Z // end inline asm 2026-02-21T09:09:55.3665977Z bar.sync 0; 2026-02-21T09:09:55.3666110Z // begin inline asm 2026-02-21T09:09:55.3666271Z @%p4 mbarrier.inval.shared::cta.b64 [%r125]; 2026-02-21T09:09:55.3666463Z // end inline asm 2026-02-21T09:09:55.3666606Z add.s32 %r296, %r50, 10752; 2026-02-21T09:09:55.3666755Z // begin inline asm 2026-02-21T09:09:55.3666920Z @%p4 mbarrier.inval.shared::cta.b64 [%r296]; 2026-02-21T09:09:55.3667097Z // end inline asm 2026-02-21T09:09:55.3667235Z bar.sync 0; 2026-02-21T09:09:55.3667362Z // begin inline asm 2026-02-21T09:09:55.3667545Z @%p4 mbarrier.inval.shared::cta.b64 [%r123]; 2026-02-21T09:09:55.3667733Z // end inline asm 2026-02-21T09:09:55.3667880Z $L__tmp2: 2026-02-21T09:09:55.3668193Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3668539Z // begin inline asm 2026-02-21T09:09:55.3668845Z tcgen05.ld.sync.aligned.16x32bx2.x8.b32 {%r298, %r299, %r300, %r301, %r302, %r303, %r304, %r305}, [%r306 + 0], 8; 2026-02-21T09:09:55.3669165Z // end inline asm 2026-02-21T09:09:55.3669314Z // begin inline asm 2026-02-21T09:09:55.3669474Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:55.3669687Z // end inline asm 2026-02-21T09:09:55.3669835Z cvt.u64.u32 %rd94, %r298; 2026-02-21T09:09:55.3670012Z cvt.u64.u32 %rd95, %r299; 2026-02-21T09:09:55.3670189Z shl.b64 %rd96, %rd95, 32; 2026-02-21T09:09:55.3670367Z or.b64 %rd97, %rd94, %rd96; 2026-02-21T09:09:55.3670534Z $L__tmp3: 2026-02-21T09:09:55.3670786Z .loc 1 97 28 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:97:28 2026-02-21T09:09:55.3671096Z mov.b64 {%r310, %r311}, %rd97; 2026-02-21T09:09:55.3671312Z cvt.rn.bf16x2.f32 %r312, %r311, %r310; 2026-02-21T09:09:55.3671501Z $L__tmp4: 2026-02-21T09:09:55.3671834Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3672189Z cvt.u64.u32 %rd98, %r300; 2026-02-21T09:09:55.3672354Z cvt.u64.u32 %rd99, %r301; 2026-02-21T09:09:55.3672516Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:09:55.3672692Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:09:55.3672853Z $L__tmp5: 2026-02-21T09:09:55.3673113Z .loc 1 97 28 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:97:28 2026-02-21T09:09:55.3673451Z mov.b64 {%r313, %r314}, %rd101; 2026-02-21T09:09:55.3673645Z cvt.rn.bf16x2.f32 %r315, %r314, %r313; 2026-02-21T09:09:55.3673819Z $L__tmp6: 2026-02-21T09:09:55.3674115Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3674467Z cvt.u64.u32 %rd102, %r302; 2026-02-21T09:09:55.3674630Z cvt.u64.u32 %rd103, %r303; 2026-02-21T09:09:55.3674801Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:09:55.3674968Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:09:55.3675135Z $L__tmp7: 2026-02-21T09:09:55.3675380Z .loc 1 97 28 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:97:28 2026-02-21T09:09:55.3675698Z mov.b64 {%r316, %r317}, %rd105; 2026-02-21T09:09:55.3675901Z cvt.rn.bf16x2.f32 %r318, %r317, %r316; 2026-02-21T09:09:55.3676074Z $L__tmp8: 2026-02-21T09:09:55.3676357Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3676684Z cvt.u64.u32 %rd106, %r304; 2026-02-21T09:09:55.3676844Z cvt.u64.u32 %rd107, %r305; 2026-02-21T09:09:55.3676996Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:09:55.3677161Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:09:55.3677313Z $L__tmp9: 2026-02-21T09:09:55.3677549Z .loc 1 97 28 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:97:28 2026-02-21T09:09:55.3677842Z mov.b64 {%r319, %r320}, %rd109; 2026-02-21T09:09:55.3678010Z cvt.rn.bf16x2.f32 %r321, %r320, %r319; 2026-02-21T09:09:55.3678297Z .loc 1 98 43 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:98:43 2026-02-21T09:09:55.3678608Z st.shared.v4.b32 [%r26], {%r312, %r315, %r318, %r321}; 2026-02-21T09:09:55.3678818Z // begin inline asm 2026-02-21T09:09:55.3678977Z fence.proxy.async.shared::cta; 2026-02-21T09:09:55.3679147Z // end inline asm 2026-02-21T09:09:55.3679279Z bar.sync 0; 2026-02-21T09:09:55.3679425Z elect.sync %r322|%p96, -1; 2026-02-21T09:09:55.3679597Z and.pred %p94, %p1, %p96; 2026-02-21T09:09:55.3679754Z // begin inline asm 2026-02-21T09:09:55.3680018Z @%p94 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd93, {%r307, %r308}], [%r50]; 2026-02-21T09:09:55.3680302Z // end inline asm 2026-02-21T09:09:55.3680455Z cp.async.bulk.commit_group; 2026-02-21T09:09:55.3680632Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:55.3680811Z bar.sync 0; 2026-02-21T09:09:55.3681052Z .loc 1 31 88 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:31:88 2026-02-21T09:09:55.3681347Z add.s32 %r324, %r324, 1; 2026-02-21T09:09:55.3681513Z setp.ne.b32 %p97, %r324, %r4; 2026-02-21T09:09:55.3681725Z @%p97 bra $L__BB0_2; 2026-02-21T09:09:55.3681877Z bra.uni $L__BB0_9; 2026-02-21T09:09:55.3682096Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:09:55.3682340Z // Child Loop BB0_5 Depth 2 2026-02-21T09:09:55.3682649Z .loc 1 81 38 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:81:38 2026-02-21T09:09:55.3682945Z setp.eq.b32 %p50, %r9, 0; 2026-02-21T09:09:55.3683212Z .loc 1 37 35 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:37:35 2026-02-21T09:09:55.3683529Z shr.s32 %r163, %r324, 31; 2026-02-21T09:09:55.3683691Z shr.u32 %r164, %r163, 22; 2026-02-21T09:09:55.3683844Z add.s32 %r165, %r324, %r164; 2026-02-21T09:09:55.3684008Z shr.s32 %r166, %r165, 10; 2026-02-21T09:09:55.3684267Z .loc 1 40 45 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:40:45 2026-02-21T09:09:55.3684559Z and.b32 %r167, %r165, 64512; 2026-02-21T09:09:55.3684716Z sub.s32 %r168, %r324, %r167; 2026-02-21T09:09:55.3684991Z .loc 1 40 64 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:40:64 2026-02-21T09:09:55.3685284Z cvt.u16.u32 %rs1, %r168; 2026-02-21T09:09:55.3685574Z .loc 1 41 51 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:41:51 2026-02-21T09:09:55.3685863Z shr.s16 %rs2, %rs1, 15; 2026-02-21T09:09:55.3686019Z shr.u16 %rs3, %rs2, 12; 2026-02-21T09:09:55.3686181Z add.s16 %rs4, %rs1, %rs3; 2026-02-21T09:09:55.3686337Z shr.s16 %rs5, %rs4, 4; 2026-02-21T09:09:55.3686606Z .loc 1 40 64 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:40:64 2026-02-21T09:09:55.3686898Z and.b16 %rs6, %rs4, -16; 2026-02-21T09:09:55.3687052Z sub.s16 %rs7, %rs1, %rs6; 2026-02-21T09:09:55.3687315Z .loc 1 42 27 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:42:27 2026-02-21T09:09:55.3687598Z shl.b32 %r169, %r166, 8; 2026-02-21T09:09:55.3687796Z mad.wide.s16 %r307, %rs7, 16, %r169; 2026-02-21T09:09:55.3688074Z .loc 1 43 27 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:43:27 2026-02-21T09:09:55.3688365Z mul.wide.s16 %r308, %rs5, 64; 2026-02-21T09:09:55.3688628Z .loc 1 44 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:44:32 2026-02-21T09:09:55.3688916Z or.b32 %r170, %r308, %r7; 2026-02-21T09:09:55.3689171Z .loc 1 58 53 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:53 2026-02-21T09:09:55.3689447Z shl.b32 %r171, %r170, 10; 2026-02-21T09:09:55.3689599Z $L__tmp10: 2026-02-21T09:09:55.3689877Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3690219Z shfl.sync.idx.b32 %r32, %r5, 0, 31, -1; 2026-02-21T09:09:55.3690399Z shl.b32 %r172, %r32, 21; 2026-02-21T09:09:55.3690559Z and.b32 %r173, %r172, 6291456; 2026-02-21T09:09:55.3690728Z add.s32 %r306, %r173, %r323; 2026-02-21T09:09:55.3690886Z mov.pred %p57, -1; 2026-02-21T09:09:55.3691035Z mov.b32 %r325, 0; 2026-02-21T09:09:55.3691172Z // begin inline asm 2026-02-21T09:09:55.3691472Z @%p57 tcgen05.st.sync.aligned.16x32bx2.x8.b32 [%r306 + 0], 8, {%r325, %r325, %r325, %r325, %r325, %r325, %r325, %r325}; 2026-02-21T09:09:55.3691813Z // end inline asm 2026-02-21T09:09:55.3691957Z // begin inline asm 2026-02-21T09:09:55.3692114Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:55.3692286Z // end inline asm 2026-02-21T09:09:55.3692431Z bar.sync 0; 2026-02-21T09:09:55.3692562Z $L__tmp11: 2026-02-21T09:09:55.3692807Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3693097Z add.s32 %r326, %r50, 10752; 2026-02-21T09:09:55.3693259Z // begin inline asm 2026-02-21T09:09:55.3693423Z @%p4 mbarrier.init.shared::cta.b64 [%r326], 1; 2026-02-21T09:09:55.3693618Z // end inline asm 2026-02-21T09:09:55.3693749Z bar.sync 0; 2026-02-21T09:09:55.3693888Z add.s32 %r123, %r50, 10760; 2026-02-21T09:09:55.3694081Z // begin inline asm 2026-02-21T09:09:55.3694246Z @%p4 mbarrier.init.shared::cta.b64 [%r123], 1; 2026-02-21T09:09:55.3694440Z // end inline asm 2026-02-21T09:09:55.3694581Z add.s32 %r209, %r50, 10768; 2026-02-21T09:09:55.3694743Z // begin inline asm 2026-02-21T09:09:55.3694902Z @%p4 mbarrier.init.shared::cta.b64 [%r209], 1; 2026-02-21T09:09:55.3695093Z // end inline asm 2026-02-21T09:09:55.3695224Z bar.sync 0; 2026-02-21T09:09:55.3695364Z add.s32 %r125, %r50, 10776; 2026-02-21T09:09:55.3695565Z // begin inline asm 2026-02-21T09:09:55.3695729Z @%p4 mbarrier.init.shared::cta.b64 [%r125], 1; 2026-02-21T09:09:55.3695918Z // end inline asm 2026-02-21T09:09:55.3696163Z .loc 1 58 60 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:60 2026-02-21T09:09:55.3696451Z or.b32 %r175, %r171, %r8; 2026-02-21T09:09:55.3696711Z .loc 1 58 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:32 2026-02-21T09:09:55.3697006Z mad.wide.s32 %rd59, %r175, 2, %rd15; 2026-02-21T09:09:55.3697180Z cvt.u64.u32 %rd65, %r8; 2026-02-21T09:09:55.3697339Z cvt.s64.s32 %rd8, %r171; 2026-02-21T09:09:55.3697498Z or.b64 %rd66, %rd8, %rd65; 2026-02-21T09:09:55.3697682Z shl.b64 %rd67, %rd66, 1; 2026-02-21T09:09:55.3697846Z add.s64 %rd9, %rd15, %rd67; 2026-02-21T09:09:55.3698002Z add.s64 %rd60, %rd9, 65536; 2026-02-21T09:09:55.3698160Z mov.b32 %r206, 16; 2026-02-21T09:09:55.3698402Z .loc 1 58 80 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:80 2026-02-21T09:09:55.3698692Z // begin inline asm 2026-02-21T09:09:55.3698891Z cp.async.cg.shared.global [ %r205 + 0 ], [ %rd59 + 0 ], 0x10, %r206; 2026-02-21T09:09:55.3699118Z // end inline asm 2026-02-21T09:09:55.3699258Z // begin inline asm 2026-02-21T09:09:55.3699452Z cp.async.cg.shared.global [ %r207 + 0 ], [ %rd60 + 0 ], 0x10, %r206; 2026-02-21T09:09:55.3699679Z // end inline asm 2026-02-21T09:09:55.3699846Z cp.async.commit_group; 2026-02-21T09:09:55.3700115Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3700389Z bar.sync 0; 2026-02-21T09:09:55.3700528Z // begin inline asm 2026-02-21T09:09:55.3700712Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r209], 256; 2026-02-21T09:09:55.3700931Z // end inline asm 2026-02-21T09:09:55.3701176Z .loc 1 64 33 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:64:33 2026-02-21T09:09:55.3701452Z bar.sync 0; 2026-02-21T09:09:55.3701627Z elect.sync %r176|%p52, -1; 2026-02-21T09:09:55.3701798Z and.pred %p46, %p1, %p52; 2026-02-21T09:09:55.3701967Z add.s32 %r131, %r50, 10240; 2026-02-21T09:09:55.3702122Z // begin inline asm 2026-02-21T09:09:55.3702457Z @%p46 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r131], [%rd79, {%r307, %r325}], [%r209]; 2026-02-21T09:09:55.3702806Z // end inline asm 2026-02-21T09:09:55.3703065Z .loc 1 58 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:32 2026-02-21T09:09:55.3703358Z add.s64 %rd62, %rd9, 64; 2026-02-21T09:09:55.3703512Z cvt.u64.u32 %rd68, %r14; 2026-02-21T09:09:55.3703675Z or.b64 %rd69, %rd8, %rd68; 2026-02-21T09:09:55.3703831Z shl.b64 %rd70, %rd69, 1; 2026-02-21T09:09:55.3703994Z add.s64 %rd71, %rd15, %rd70; 2026-02-21T09:09:55.3704154Z add.s64 %rd63, %rd71, 65536; 2026-02-21T09:09:55.3704427Z .loc 1 58 80 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:80 2026-02-21T09:09:55.3704707Z // begin inline asm 2026-02-21T09:09:55.3704913Z cp.async.cg.shared.global [ %r135 + 0 ], [ %rd62 + 0 ], 0x10, %r206; 2026-02-21T09:09:55.3705147Z // end inline asm 2026-02-21T09:09:55.3705283Z // begin inline asm 2026-02-21T09:09:55.3705485Z cp.async.cg.shared.global [ %r137 + 0 ], [ %rd63 + 0 ], 0x10, %r206; 2026-02-21T09:09:55.3705709Z // end inline asm 2026-02-21T09:09:55.3705858Z cp.async.commit_group; 2026-02-21T09:09:55.3706156Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3706443Z bar.sync 0; 2026-02-21T09:09:55.3706571Z // begin inline asm 2026-02-21T09:09:55.3706765Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r125], 256; 2026-02-21T09:09:55.3706982Z // end inline asm 2026-02-21T09:09:55.3707227Z .loc 1 64 33 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:64:33 2026-02-21T09:09:55.3707513Z bar.sync 0; 2026-02-21T09:09:55.3707699Z elect.sync %r177|%p53, -1; 2026-02-21T09:09:55.3707875Z and.pred %p48, %p1, %p53; 2026-02-21T09:09:55.3708036Z add.s32 %r140, %r50, 10496; 2026-02-21T09:09:55.3708201Z // begin inline asm 2026-02-21T09:09:55.3708529Z @%p48 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r140], [%rd79, {%r307, %r206}], [%r125]; 2026-02-21T09:09:55.3708881Z // end inline asm 2026-02-21T09:09:55.3709138Z .loc 1 58 80 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:80 2026-02-21T09:09:55.3709433Z cp.async.wait_group 1; 2026-02-21T09:09:55.3709593Z bar.sync 0; 2026-02-21T09:09:55.3709796Z ld.shared.v4.b32 {%r178, %r179, %r180, %r181}, [%r18]; 2026-02-21T09:09:55.3710010Z mov.b32 {%rs8, %rs9}, %r181; 2026-02-21T09:09:55.3710172Z mov.b32 {%rs10, %rs11}, %r180; 2026-02-21T09:09:55.3710347Z mov.b32 {%rs12, %rs13}, %r179; 2026-02-21T09:09:55.3710514Z mov.b32 {%rs14, %rs15}, %r178; 2026-02-21T09:09:55.3710610Z ld.shared.v4.b32 {%r182, %r183, %r184, %r185}, [%r18+16]; 2026-02-21T09:09:55.3710672Z mov.b32 {%rs16, %rs17}, %r185; 2026-02-21T09:09:55.3710730Z mov.b32 {%rs18, %rs19}, %r184; 2026-02-21T09:09:55.3710797Z mov.b32 {%rs20, %rs21}, %r183; 2026-02-21T09:09:55.3710856Z mov.b32 {%rs22, %rs23}, %r182; 2026-02-21T09:09:55.3711029Z .loc 1 62 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:62:32 2026-02-21T09:09:55.3711135Z cvt.f32.bf16 %r145, %rs14; 2026-02-21T09:09:55.3711203Z cvt.f32.bf16 %r146, %rs15; 2026-02-21T09:09:55.3711264Z cvt.f32.bf16 %r147, %rs12; 2026-02-21T09:09:55.3711333Z cvt.f32.bf16 %r148, %rs13; 2026-02-21T09:09:55.3711394Z cvt.f32.bf16 %r149, %rs10; 2026-02-21T09:09:55.3711456Z cvt.f32.bf16 %r150, %rs11; 2026-02-21T09:09:55.3711519Z cvt.f32.bf16 %r151, %rs8; 2026-02-21T09:09:55.3711623Z cvt.f32.bf16 %r152, %rs9; 2026-02-21T09:09:55.3711684Z cvt.f32.bf16 %r153, %rs22; 2026-02-21T09:09:55.3711745Z cvt.f32.bf16 %r154, %rs23; 2026-02-21T09:09:55.3711816Z cvt.f32.bf16 %r155, %rs20; 2026-02-21T09:09:55.3711878Z cvt.f32.bf16 %r156, %rs21; 2026-02-21T09:09:55.3711939Z cvt.f32.bf16 %r157, %rs18; 2026-02-21T09:09:55.3712002Z cvt.f32.bf16 %r158, %rs19; 2026-02-21T09:09:55.3712075Z cvt.f32.bf16 %r159, %rs16; 2026-02-21T09:09:55.3712137Z cvt.f32.bf16 %r160, %rs17; 2026-02-21T09:09:55.3712194Z $L__tmp12: 2026-02-21T09:09:55.3712442Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3712507Z add.s32 %r223, %r173, %r191; 2026-02-21T09:09:55.3712568Z // begin inline asm 2026-02-21T09:09:55.3712882Z @%p57 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r223 + 0], 16, {%r145, %r146, %r147, %r148, %r149, %r150, %r151, %r152, %r153, %r154, %r155, %r156, %r157, %r158, %r159, %r160}; 2026-02-21T09:09:55.3712951Z // end inline asm 2026-02-21T09:09:55.3713012Z // begin inline asm 2026-02-21T09:09:55.3713088Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:55.3713162Z // end inline asm 2026-02-21T09:09:55.3713224Z bar.sync 0; 2026-02-21T09:09:55.3713283Z $L__tmp13: 2026-02-21T09:09:55.3713472Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3713532Z // begin inline asm 2026-02-21T09:09:55.3713585Z 2026-02-21T09:09:55.3713637Z { 2026-02-21T09:09:55.3713710Z .reg .pred complete; 2026-02-21T09:09:55.3713767Z waitLoop: 2026-02-21T09:09:55.3713892Z mbarrier.try_wait.parity.shared.b64 complete, [%r209], %r325; 2026-02-21T09:09:55.3714000Z @!complete bra.uni waitLoop; 2026-02-21T09:09:55.3714053Z } 2026-02-21T09:09:55.3714056Z 2026-02-21T09:09:55.3714113Z // end inline asm 2026-02-21T09:09:55.3714289Z .loc 1 64 33 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:64:33 2026-02-21T09:09:55.3714362Z ld.shared.b8 %rs24, [%r20]; 2026-02-21T09:09:55.3714431Z ld.shared.b8 %rs25, [%r20+64]; 2026-02-21T09:09:55.3714500Z ld.shared.b8 %rs26, [%r20+128]; 2026-02-21T09:09:55.3714604Z ld.shared.b8 %rs27, [%r20+192]; 2026-02-21T09:09:55.3714781Z .loc 1 67 28 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:67:28 2026-02-21T09:09:55.3714846Z shl.b16 %rs28, %rs24, 4; 2026-02-21T09:09:55.3714917Z shl.b16 %rs29, %rs25, 4; 2026-02-21T09:09:55.3714978Z shl.b16 %rs30, %rs26, 4; 2026-02-21T09:09:55.3715038Z shl.b16 %rs31, %rs27, 4; 2026-02-21T09:09:55.3715214Z .loc 1 82 58 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:82:58 2026-02-21T09:09:55.3715294Z selp.b16 %rs32, %rs28, %rs24, %p50; 2026-02-21T09:09:55.3715356Z cvt.s16.s8 %rs33, %rs32; 2026-02-21T09:09:55.3715417Z shr.s16 %rs34, %rs33, 4; 2026-02-21T09:09:55.3715524Z selp.b16 %rs35, %rs29, %rs25, %p50; 2026-02-21T09:09:55.3715588Z cvt.s16.s8 %rs36, %rs35; 2026-02-21T09:09:55.3715648Z shr.s16 %rs37, %rs36, 4; 2026-02-21T09:09:55.3715724Z selp.b16 %rs38, %rs30, %rs26, %p50; 2026-02-21T09:09:55.3715786Z cvt.s16.s8 %rs39, %rs38; 2026-02-21T09:09:55.3715847Z shr.s16 %rs40, %rs39, 4; 2026-02-21T09:09:55.3715914Z selp.b16 %rs41, %rs31, %rs27, %p50; 2026-02-21T09:09:55.3715984Z cvt.s16.s8 %rs42, %rs41; 2026-02-21T09:09:55.3716045Z shr.s16 %rs43, %rs42, 4; 2026-02-21T09:09:55.3716220Z .loc 1 87 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:87:32 2026-02-21T09:09:55.3716294Z cvt.rn.f32.s16 %r186, %rs34; 2026-02-21T09:09:55.3716390Z cvt.rn.f32.s16 %r187, %rs37; 2026-02-21T09:09:55.3716457Z cvt.rn.f32.s16 %r188, %rs40; 2026-02-21T09:09:55.3716519Z cvt.rn.f32.s16 %r189, %rs43; 2026-02-21T09:09:55.3716591Z st.shared.b32 [%r21], %r186; 2026-02-21T09:09:55.3716654Z st.shared.b32 [%r22], %r187; 2026-02-21T09:09:55.3716719Z st.shared.b32 [%r23], %r188; 2026-02-21T09:09:55.3716788Z st.shared.b32 [%r24], %r189; 2026-02-21T09:09:55.3716844Z $L__tmp14: 2026-02-21T09:09:55.3717072Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3717143Z // begin inline asm 2026-02-21T09:09:55.3717219Z fence.proxy.async.shared::cta; 2026-02-21T09:09:55.3717277Z // end inline asm 2026-02-21T09:09:55.3717332Z bar.sync 0; 2026-02-21T09:09:55.3717405Z setp.ne.b32 %p54, %r32, 0; 2026-02-21T09:09:55.3717468Z @%p54 bra $L__BB0_4; 2026-02-21T09:09:55.3717575Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:55.3717652Z elect.sync %r202|%p56, -1; 2026-02-21T09:09:55.3717714Z mov.b32 %r192, 67373328; 2026-02-21T09:09:55.3717774Z mov.pred %p55, 0; 2026-02-21T09:09:55.3717833Z // begin inline asm 2026-02-21T09:09:55.3718009Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 0 ], %rd72, %r192, %p55; 2026-02-21T09:09:55.3718068Z // end inline asm 2026-02-21T09:09:55.3718126Z // begin inline asm 2026-02-21T09:09:55.3718287Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 8 ], %rd73, %r192, %p57; 2026-02-21T09:09:55.3718346Z // end inline asm 2026-02-21T09:09:55.3718406Z // begin inline asm 2026-02-21T09:09:55.3718568Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 16 ], %rd74, %r192, %p57; 2026-02-21T09:09:55.3718626Z // end inline asm 2026-02-21T09:09:55.3718685Z // begin inline asm 2026-02-21T09:09:55.3718843Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 24 ], %rd75, %r192, %p57; 2026-02-21T09:09:55.3718901Z // end inline asm 2026-02-21T09:09:55.3718966Z add.s32 %r204, %r50, 10752; 2026-02-21T09:09:55.3719056Z cvt.u64.u32 %rd76, %r204; 2026-02-21T09:09:55.3719123Z // begin inline asm 2026-02-21T09:09:55.3719253Z @%p56 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd76]; 2026-02-21T09:09:55.3719312Z // end inline asm 2026-02-21T09:09:55.3719374Z $L__tmp15: 2026-02-21T09:09:55.3719481Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:55.3719663Z .loc 1 0 0 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:0 2026-02-21T09:09:55.3719749Z cvt.s32.s16 %r29, %rs5; 2026-02-21T09:09:55.3719926Z .loc 1 58 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:32 2026-02-21T09:09:55.3719987Z add.s64 %rd77, %rd9, 128; 2026-02-21T09:09:55.3720047Z cvt.u64.u32 %rd81, %r25; 2026-02-21T09:09:55.3720116Z add.s64 %rd82, %rd8, %rd81; 2026-02-21T09:09:55.3720174Z shl.b64 %rd83, %rd82, 1; 2026-02-21T09:09:55.3720236Z add.s64 %rd84, %rd15, %rd83; 2026-02-21T09:09:55.3720304Z add.s64 %rd78, %rd84, 65536; 2026-02-21T09:09:55.3720474Z .loc 1 58 80 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:80 2026-02-21T09:09:55.3720553Z // begin inline asm 2026-02-21T09:09:55.3720671Z cp.async.cg.shared.global [ %r205 + 0 ], [ %rd77 + 0 ], 0x10, %r206; 2026-02-21T09:09:55.3720736Z // end inline asm 2026-02-21T09:09:55.3720795Z // begin inline asm 2026-02-21T09:09:55.3720910Z cp.async.cg.shared.global [ %r207 + 0 ], [ %rd78 + 0 ], 0x10, %r206; 2026-02-21T09:09:55.3720979Z // end inline asm 2026-02-21T09:09:55.3721044Z cp.async.commit_group; 2026-02-21T09:09:55.3721207Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3721271Z // begin inline asm 2026-02-21T09:09:55.3721375Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r209], 256; 2026-02-21T09:09:55.3721430Z // end inline asm 2026-02-21T09:09:55.3721665Z .loc 1 64 33 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:64:33 2026-02-21T09:09:55.3721730Z bar.sync 0; 2026-02-21T09:09:55.3721796Z elect.sync %r218|%p67, -1; 2026-02-21T09:09:55.3721859Z and.pred %p65, %p1, %p67; 2026-02-21T09:09:55.3721923Z mov.b32 %r212, 32; 2026-02-21T09:09:55.3721980Z // begin inline asm 2026-02-21T09:09:55.3722216Z @%p65 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r131], [%rd79, {%r307, %r212}], [%r209]; 2026-02-21T09:09:55.3722277Z // end inline asm 2026-02-21T09:09:55.3722446Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3722503Z shl.b32 %r219, %r29, 16; 2026-02-21T09:09:55.3722562Z or.b32 %r220, %r27, %r219; 2026-02-21T09:09:55.3722638Z mad.wide.s32 %rd110, %r220, 2, %rd7; 2026-02-21T09:09:55.3722694Z mov.b32 %r329, 1; 2026-02-21T09:09:55.3722751Z mov.b64 %rd111, 0; 2026-02-21T09:09:55.3722817Z mov.b32 %r327, %r325; 2026-02-21T09:09:55.3722876Z mov.b32 %r328, %r325; 2026-02-21T09:09:55.3722934Z mov.b32 %r330, %r325; 2026-02-21T09:09:55.3722991Z bra.uni $L__BB0_5; 2026-02-21T09:09:55.3723098Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:55.3723268Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3723336Z setp.lt.u64 %p83, %rd111, 464; 2026-02-21T09:09:55.3723398Z $L__tmp16: 2026-02-21T09:09:55.3723613Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3723674Z add.s32 %r288, %r329, 1; 2026-02-21T09:09:55.3723744Z setp.gt.s32 %p86, %r288, 1; 2026-02-21T09:09:55.3723809Z selp.b32 %r329, 0, %r288, %p86; 2026-02-21T09:09:55.3723870Z selp.b32 %r289, 1, 0, %p86; 2026-02-21T09:09:55.3723928Z xor.b32 %r48, %r330, %r289; 2026-02-21T09:09:55.3723989Z $L__tmp17: 2026-02-21T09:09:55.3724157Z .loc 1 58 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:32 2026-02-21T09:09:55.3724270Z add.s64 %rd90, %rd110, -65536; 2026-02-21T09:09:55.3724443Z .loc 1 58 80 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:80 2026-02-21T09:09:55.3724504Z add.s32 %r279, %r42, %r10; 2026-02-21T09:09:55.3724566Z selp.b32 %r280, 16, 0, %p83; 2026-02-21T09:09:55.3724629Z // begin inline asm 2026-02-21T09:09:55.3724743Z cp.async.cg.shared.global [ %r279 + 0 ], [ %rd90 + 0 ], 0x10, %r280; 2026-02-21T09:09:55.3724826Z // end inline asm 2026-02-21T09:09:55.3724884Z add.s32 %r281, %r279, 2048; 2026-02-21T09:09:55.3724947Z // begin inline asm 2026-02-21T09:09:55.3725065Z cp.async.cg.shared.global [ %r281 + 0 ], [ %rd110 + 0 ], 0x10, %r280; 2026-02-21T09:09:55.3725120Z // end inline asm 2026-02-21T09:09:55.3725187Z cp.async.commit_group; 2026-02-21T09:09:55.3725355Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3725421Z and.pred %p81, %p4, %p83; 2026-02-21T09:09:55.3725483Z // begin inline asm 2026-02-21T09:09:55.3725593Z @%p81 mbarrier.arrive.expect_tx.shared.b64 _, [%r283], 256; 2026-02-21T09:09:55.3725649Z // end inline asm 2026-02-21T09:09:55.3725843Z .loc 1 64 33 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:64:33 2026-02-21T09:09:55.3725908Z bar.sync 0; 2026-02-21T09:09:55.3725973Z elect.sync %r290|%p87, -1; 2026-02-21T09:09:55.3726035Z and.pred %p88, %p83, %p87; 2026-02-21T09:09:55.3726104Z and.pred %p82, %p1, %p88; 2026-02-21T09:09:55.3726164Z cvt.u32.u64 %r291, %rd111; 2026-02-21T09:09:55.3726221Z add.s32 %r286, %r291, 48; 2026-02-21T09:09:55.3726277Z // begin inline asm 2026-02-21T09:09:55.3726517Z @%p82 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r284], [%rd79, {%r307, %r286}], [%r283]; 2026-02-21T09:09:55.3726572Z // end inline asm 2026-02-21T09:09:55.3726757Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3726828Z add.s64 %rd110, %rd110, 64; 2026-02-21T09:09:55.3726894Z setp.lt.u64 %p89, %rd111, 480; 2026-02-21T09:09:55.3726956Z add.s64 %rd111, %rd111, 16; 2026-02-21T09:09:55.3727020Z mov.b32 %r325, %r330; 2026-02-21T09:09:55.3727078Z mov.b32 %r330, %r48; 2026-02-21T09:09:55.3727137Z @%p89 bra $L__BB0_5; 2026-02-21T09:09:55.3727194Z bra.uni $L__BB0_8; 2026-02-21T09:09:55.3727301Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:09:55.3727400Z // => This Inner Loop Header: Depth=2 2026-02-21T09:09:55.3727458Z add.s32 %r242, %r328, 1; 2026-02-21T09:09:55.3727529Z setp.gt.s32 %p71, %r242, 1; 2026-02-21T09:09:55.3727592Z selp.b32 %r328, 0, %r242, %p71; 2026-02-21T09:09:55.3727763Z .loc 1 58 80 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:58:80 2026-02-21T09:09:55.3727835Z cp.async.wait_group 1; 2026-02-21T09:09:55.3727892Z bar.sync 0; 2026-02-21T09:09:55.3727950Z shl.b32 %r243, %r328, 12; 2026-02-21T09:09:55.3728010Z add.s32 %r42, %r50, %r243; 2026-02-21T09:09:55.3728077Z add.s32 %r245, %r42, %r17; 2026-02-21T09:09:55.3728174Z ld.shared.v4.b32 {%r246, %r247, %r248, %r249}, [%r245]; 2026-02-21T09:09:55.3728238Z mov.b32 {%rs44, %rs45}, %r249; 2026-02-21T09:09:55.3728304Z mov.b32 {%rs46, %rs47}, %r248; 2026-02-21T09:09:55.3728364Z mov.b32 {%rs48, %rs49}, %r247; 2026-02-21T09:09:55.3728423Z mov.b32 {%rs50, %rs51}, %r246; 2026-02-21T09:09:55.3728523Z ld.shared.v4.b32 {%r250, %r251, %r252, %r253}, [%r245+16]; 2026-02-21T09:09:55.3728590Z mov.b32 {%rs52, %rs53}, %r253; 2026-02-21T09:09:55.3728651Z mov.b32 {%rs54, %rs55}, %r252; 2026-02-21T09:09:55.3728710Z mov.b32 {%rs56, %rs57}, %r251; 2026-02-21T09:09:55.3728775Z mov.b32 {%rs58, %rs59}, %r250; 2026-02-21T09:09:55.3728943Z .loc 1 62 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:62:32 2026-02-21T09:09:55.3729024Z cvt.f32.bf16 %r224, %rs50; 2026-02-21T09:09:55.3729094Z cvt.f32.bf16 %r225, %rs51; 2026-02-21T09:09:55.3729153Z cvt.f32.bf16 %r226, %rs48; 2026-02-21T09:09:55.3729213Z cvt.f32.bf16 %r227, %rs49; 2026-02-21T09:09:55.3729272Z cvt.f32.bf16 %r228, %rs46; 2026-02-21T09:09:55.3729344Z cvt.f32.bf16 %r229, %rs47; 2026-02-21T09:09:55.3729402Z cvt.f32.bf16 %r230, %rs44; 2026-02-21T09:09:55.3729459Z cvt.f32.bf16 %r231, %rs45; 2026-02-21T09:09:55.3729524Z cvt.f32.bf16 %r232, %rs58; 2026-02-21T09:09:55.3729607Z cvt.f32.bf16 %r233, %rs59; 2026-02-21T09:09:55.3729664Z cvt.f32.bf16 %r234, %rs56; 2026-02-21T09:09:55.3729723Z cvt.f32.bf16 %r235, %rs57; 2026-02-21T09:09:55.3729790Z cvt.f32.bf16 %r236, %rs54; 2026-02-21T09:09:55.3729848Z cvt.f32.bf16 %r237, %rs55; 2026-02-21T09:09:55.3729904Z cvt.f32.bf16 %r238, %rs52; 2026-02-21T09:09:55.3729968Z cvt.f32.bf16 %r239, %rs53; 2026-02-21T09:09:55.3730020Z $L__tmp18: 2026-02-21T09:09:55.3730238Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3730295Z // begin inline asm 2026-02-21T09:09:55.3730355Z 2026-02-21T09:09:55.3730405Z { 2026-02-21T09:09:55.3730491Z .reg .pred complete; 2026-02-21T09:09:55.3730552Z waitLoop: 2026-02-21T09:09:55.3730670Z mbarrier.try_wait.parity.shared.b64 complete, [%r326], %r325; 2026-02-21T09:09:55.3730734Z @!complete bra.uni waitLoop; 2026-02-21T09:09:55.3730784Z } 2026-02-21T09:09:55.3730794Z 2026-02-21T09:09:55.3730851Z // end inline asm 2026-02-21T09:09:55.3730906Z $L__tmp19: 2026-02-21T09:09:55.3731068Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3731137Z selp.b32 %r254, 1, 0, %p71; 2026-02-21T09:09:55.3731198Z xor.b32 %r327, %r327, %r254; 2026-02-21T09:09:55.3731257Z mov.pred %p72, -1; 2026-02-21T09:09:55.3731317Z $L__tmp20: 2026-02-21T09:09:55.3731602Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3731662Z // begin inline asm 2026-02-21T09:09:55.3731954Z @%p72 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r223 + 0], 16, {%r224, %r225, %r226, %r227, %r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239}; 2026-02-21T09:09:55.3732017Z // end inline asm 2026-02-21T09:09:55.3732074Z // begin inline asm 2026-02-21T09:09:55.3732145Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:55.3732207Z // end inline asm 2026-02-21T09:09:55.3732263Z bar.sync 0; 2026-02-21T09:09:55.3732315Z $L__tmp21: 2026-02-21T09:09:55.3732489Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3732548Z shl.b32 %r255, %r328, 3; 2026-02-21T09:09:55.3732608Z add.s32 %r256, %r50, %r255; 2026-02-21T09:09:55.3732667Z add.s32 %r283, %r256, 10768; 2026-02-21T09:09:55.3732731Z // begin inline asm 2026-02-21T09:09:55.3732783Z 2026-02-21T09:09:55.3732834Z { 2026-02-21T09:09:55.3732902Z .reg .pred complete; 2026-02-21T09:09:55.3732956Z waitLoop: 2026-02-21T09:09:55.3733071Z mbarrier.try_wait.parity.shared.b64 complete, [%r283], %r327; 2026-02-21T09:09:55.3733136Z @!complete bra.uni waitLoop; 2026-02-21T09:09:55.3733196Z } 2026-02-21T09:09:55.3733199Z 2026-02-21T09:09:55.3733255Z // end inline asm 2026-02-21T09:09:55.3733420Z .loc 1 64 33 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:64:33 2026-02-21T09:09:55.3735150Z shl.b32 %r257, %r328, 8; 2026-02-21T09:09:55.3735221Z add.s32 %r258, %r50, %r257; 2026-02-21T09:09:55.3735281Z add.s32 %r284, %r258, 10240; 2026-02-21T09:09:55.3735340Z add.s32 %r259, %r284, %r19; 2026-02-21T09:09:55.3735412Z ld.shared.b8 %rs60, [%r259]; 2026-02-21T09:09:55.3735478Z ld.shared.b8 %rs61, [%r259+64]; 2026-02-21T09:09:55.3735546Z ld.shared.b8 %rs62, [%r259+128]; 2026-02-21T09:09:55.3735610Z ld.shared.b8 %rs63, [%r259+192]; 2026-02-21T09:09:55.3735793Z .loc 1 67 28 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:67:28 2026-02-21T09:09:55.3735894Z shl.b16 %rs64, %rs60, 4; 2026-02-21T09:09:55.3735953Z shl.b16 %rs65, %rs61, 4; 2026-02-21T09:09:55.3736021Z shl.b16 %rs66, %rs62, 4; 2026-02-21T09:09:55.3736080Z shl.b16 %rs67, %rs63, 4; 2026-02-21T09:09:55.3736244Z .loc 1 82 58 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:82:58 2026-02-21T09:09:55.3736318Z selp.b16 %rs68, %rs64, %rs60, %p50; 2026-02-21T09:09:55.3736402Z cvt.s16.s8 %rs69, %rs68; 2026-02-21T09:09:55.3736457Z shr.s16 %rs70, %rs69, 4; 2026-02-21T09:09:55.3736522Z selp.b16 %rs71, %rs65, %rs61, %p50; 2026-02-21T09:09:55.3736587Z cvt.s16.s8 %rs72, %rs71; 2026-02-21T09:09:55.3736644Z shr.s16 %rs73, %rs72, 4; 2026-02-21T09:09:55.3736707Z selp.b16 %rs74, %rs66, %rs62, %p50; 2026-02-21T09:09:55.3736772Z cvt.s16.s8 %rs75, %rs74; 2026-02-21T09:09:55.3736832Z shr.s16 %rs76, %rs75, 4; 2026-02-21T09:09:55.3736897Z selp.b16 %rs77, %rs67, %rs63, %p50; 2026-02-21T09:09:55.3736956Z cvt.s16.s8 %rs78, %rs77; 2026-02-21T09:09:55.3737024Z shr.s16 %rs79, %rs78, 4; 2026-02-21T09:09:55.3737224Z .loc 1 87 32 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:87:32 2026-02-21T09:09:55.3737290Z cvt.rn.f32.s16 %r260, %rs70; 2026-02-21T09:09:55.3737358Z cvt.rn.f32.s16 %r261, %rs73; 2026-02-21T09:09:55.3737416Z cvt.rn.f32.s16 %r262, %rs76; 2026-02-21T09:09:55.3737474Z cvt.rn.f32.s16 %r263, %rs79; 2026-02-21T09:09:55.3737534Z st.shared.b32 [%r21], %r260; 2026-02-21T09:09:55.3737602Z st.shared.b32 [%r22], %r261; 2026-02-21T09:09:55.3737662Z st.shared.b32 [%r23], %r262; 2026-02-21T09:09:55.3737720Z st.shared.b32 [%r24], %r263; 2026-02-21T09:09:55.3737895Z .loc 1 51 92 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:51:92 2026-02-21T09:09:55.3737953Z shl.b32 %r264, %r329, 3; 2026-02-21T09:09:55.3738011Z add.s32 %r265, %r50, %r264; 2026-02-21T09:09:55.3738099Z add.s32 %r326, %r265, 10752; 2026-02-21T09:09:55.3738153Z $L__tmp22: 2026-02-21T09:09:55.3738367Z .loc 2 291 36 // standard.py:291:36 @[ ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:94:40 ] 2026-02-21T09:09:55.3738427Z // begin inline asm 2026-02-21T09:09:55.3738511Z fence.proxy.async.shared::cta; 2026-02-21T09:09:55.3738567Z // end inline asm 2026-02-21T09:09:55.3738623Z bar.sync 0; 2026-02-21T09:09:55.3738694Z @%p54 bra $L__BB0_7; 2026-02-21T09:09:55.3738795Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:55.3738864Z elect.sync %r278|%p73, -1; 2026-02-21T09:09:55.3738922Z mov.b32 %r268, 67373328; 2026-02-21T09:09:55.3738987Z // begin inline asm 2026-02-21T09:09:55.3739139Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 0 ], %rd72, %r268, %p72; 2026-02-21T09:09:55.3739194Z // end inline asm 2026-02-21T09:09:55.3739258Z // begin inline asm 2026-02-21T09:09:55.3739403Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 8 ], %rd73, %r268, %p72; 2026-02-21T09:09:55.3739460Z // end inline asm 2026-02-21T09:09:55.3739522Z // begin inline asm 2026-02-21T09:09:55.3739670Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 16 ], %rd74, %r268, %p72; 2026-02-21T09:09:55.3739723Z // end inline asm 2026-02-21T09:09:55.3739784Z // begin inline asm 2026-02-21T09:09:55.3739931Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r323 + 0 ], [ %r191 + 24 ], %rd75, %r268, %p72; 2026-02-21T09:09:55.3740057Z // end inline asm 2026-02-21T09:09:55.3740119Z cvt.u64.u32 %rd89, %r326; 2026-02-21T09:09:55.3740183Z // begin inline asm 2026-02-21T09:09:55.3740306Z @%p73 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd89]; 2026-02-21T09:09:55.3740361Z // end inline asm 2026-02-21T09:09:55.3740423Z bra.uni $L__BB0_7; 2026-02-21T09:09:55.3740476Z $L__tmp23: 2026-02-21T09:09:55.3740559Z $L__BB0_9: // %._crit_edge 2026-02-21T09:09:55.3740736Z .loc 1 31 4 // ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py:31:4 2026-02-21T09:09:55.3740815Z bar.sync 0; 2026-02-21T09:09:55.3740872Z // begin inline asm 2026-02-21T09:09:55.3740987Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r323, 64; 2026-02-21T09:09:55.3741051Z // end inline asm 2026-02-21T09:09:55.3741104Z ret; 2026-02-21T09:09:55.3741157Z $L__tmp24: 2026-02-21T09:09:55.3741220Z $L__func_end0: 2026-02-21T09:09:55.3741303Z // -- End function 2026-02-21T09:09:55.3741358Z } 2026-02-21T09:09:55.3741584Z .file 1 "/tmp/torchinductor_root/tm/ctmn6mz4ffskfhdxicvntnniiknsymk4iowi3mw3op5qnmxs2kfb.py" 2026-02-21T09:09:55.3741763Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:55.3741825Z .section .debug_abbrev 2026-02-21T09:09:55.3741876Z { 2026-02-21T09:09:55.3741971Z .b8 1 // Abbreviation Code 2026-02-21T09:09:55.3742060Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:55.3742142Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:55.3742230Z .b8 37 // DW_AT_producer 2026-02-21T09:09:55.3742339Z .b8 8 // DW_FORM_string 2026-02-21T09:09:55.3742415Z .b8 19 // DW_AT_language 2026-02-21T09:09:55.3742494Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:55.3742578Z .b8 3 // DW_AT_name 2026-02-21T09:09:55.3742653Z .b8 8 // DW_FORM_string 2026-02-21T09:09:55.3742731Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:55.3742816Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:55.3742891Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:55.3743020Z .b8 8 // DW_FORM_string 2026-02-21T09:09:55.3743101Z .b8 0 // EOM(1) 2026-02-21T09:09:55.3743171Z .b8 0 // EOM(2) 2026-02-21T09:09:55.3743252Z .b8 2 // Abbreviation Code 2026-02-21T09:09:55.3743333Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:55.3743414Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:55.3743487Z .b8 3 // DW_AT_name 2026-02-21T09:09:55.3743558Z .b8 8 // DW_FORM_string 2026-02-21T09:09:55.3743641Z .b8 32 // DW_AT_inline 2026-02-21T09:09:55.3743715Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:55.3743786Z .b8 0 // EOM(1) 2026-02-21T09:09:55.3743859Z .b8 0 // EOM(2) 2026-02-21T09:09:55.3743938Z .b8 3 // Abbreviation Code 2026-02-21T09:09:55.3744019Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:55.3744102Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:55.3744178Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:55.3744253Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:55.3744329Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:55.3744409Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:55.3744493Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:55.3744599Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:55.3744675Z .b8 0 // EOM(1) 2026-02-21T09:09:55.3744743Z .b8 0 // EOM(2) 2026-02-21T09:09:55.3744820Z .b8 4 // Abbreviation Code 2026-02-21T09:09:55.3744919Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:55.3745020Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:55.3745104Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:55.3745179Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:55.3745257Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:55.3745326Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:55.3745401Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:55.3745479Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:55.3745556Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:55.3745628Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:55.3745711Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:55.3745783Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:55.3745861Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:55.3745934Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:55.3746010Z .b8 0 // EOM(1) 2026-02-21T09:09:55.3746099Z .b8 0 // EOM(2) 2026-02-21T09:09:55.3746168Z .b8 0 // EOM(3) 2026-02-21T09:09:55.3746226Z } 2026-02-21T09:09:55.3746286Z .section .debug_info 2026-02-21T09:09:55.3746337Z { 2026-02-21T09:09:55.3746428Z .b32 178 // Length of Unit 2026-02-21T09:09:55.3746517Z .b8 2 // DWARF version number 2026-02-21T09:09:55.3746570Z .b8 0 2026-02-21T09:09:55.3746686Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:55.3746784Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:55.3746884Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:55.3747002Z .b8 116 // DW_AT_producer 2026-02-21T09:09:55.3747067Z .b8 114 2026-02-21T09:09:55.3747121Z .b8 105 2026-02-21T09:09:55.3747172Z .b8 116 2026-02-21T09:09:55.3747223Z .b8 111 2026-02-21T09:09:55.3747282Z .b8 110 2026-02-21T09:09:55.3747336Z .b8 0 2026-02-21T09:09:55.3747410Z .b8 2 // DW_AT_language 2026-02-21T09:09:55.3747468Z .b8 0 2026-02-21T09:09:55.3747541Z .b8 99 // DW_AT_name 2026-02-21T09:09:55.3747592Z .b8 116 2026-02-21T09:09:55.3747645Z .b8 109 2026-02-21T09:09:55.3747702Z .b8 110 2026-02-21T09:09:55.3747753Z .b8 54 2026-02-21T09:09:55.3747804Z .b8 109 2026-02-21T09:09:55.3747862Z .b8 122 2026-02-21T09:09:55.3747913Z .b8 52 2026-02-21T09:09:55.3747964Z .b8 102 2026-02-21T09:09:55.3748014Z .b8 102 2026-02-21T09:09:55.3748075Z .b8 115 2026-02-21T09:09:55.3748125Z .b8 107 2026-02-21T09:09:55.3748176Z .b8 102 2026-02-21T09:09:55.3748231Z .b8 104 2026-02-21T09:09:55.3748281Z .b8 100 2026-02-21T09:09:55.3748332Z .b8 120 2026-02-21T09:09:55.3748385Z .b8 105 2026-02-21T09:09:55.3748442Z .b8 99 2026-02-21T09:09:55.3748493Z .b8 118 2026-02-21T09:09:55.3748542Z .b8 110 2026-02-21T09:09:55.3748598Z .b8 116 2026-02-21T09:09:55.3748649Z .b8 110 2026-02-21T09:09:55.3748700Z .b8 110 2026-02-21T09:09:55.3748750Z .b8 105 2026-02-21T09:09:55.3748807Z .b8 105 2026-02-21T09:09:55.3748856Z .b8 107 2026-02-21T09:09:55.3748907Z .b8 110 2026-02-21T09:09:55.3748957Z .b8 115 2026-02-21T09:09:55.3749016Z .b8 121 2026-02-21T09:09:55.3749067Z .b8 109 2026-02-21T09:09:55.3749142Z .b8 107 2026-02-21T09:09:55.3749200Z .b8 52 2026-02-21T09:09:55.3749253Z .b8 105 2026-02-21T09:09:55.3749304Z .b8 111 2026-02-21T09:09:55.3749354Z .b8 119 2026-02-21T09:09:55.3749413Z .b8 105 2026-02-21T09:09:55.3749464Z .b8 51 2026-02-21T09:09:55.3749514Z .b8 109 2026-02-21T09:09:55.3749572Z .b8 119 2026-02-21T09:09:55.3749622Z .b8 51 2026-02-21T09:09:55.3749671Z .b8 111 2026-02-21T09:09:55.3749720Z .b8 112 2026-02-21T09:09:55.3749777Z .b8 53 2026-02-21T09:09:55.3749828Z .b8 113 2026-02-21T09:09:55.3749900Z .b8 110 2026-02-21T09:09:55.3749950Z .b8 109 2026-02-21T09:09:55.3750005Z .b8 120 2026-02-21T09:09:55.3750056Z .b8 115 2026-02-21T09:09:55.3750105Z .b8 50 2026-02-21T09:09:55.3750162Z .b8 107 2026-02-21T09:09:55.3750211Z .b8 102 2026-02-21T09:09:55.3750261Z .b8 98 2026-02-21T09:09:55.3750311Z .b8 46 2026-02-21T09:09:55.3750367Z .b8 112 2026-02-21T09:09:55.3750417Z .b8 121 2026-02-21T09:09:55.3750467Z .b8 0 2026-02-21T09:09:55.3750563Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:55.3750640Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:55.3750690Z .b8 116 2026-02-21T09:09:55.3750740Z .b8 109 2026-02-21T09:09:55.3750797Z .b8 112 2026-02-21T09:09:55.3750848Z .b8 47 2026-02-21T09:09:55.3750898Z .b8 116 2026-02-21T09:09:55.3750954Z .b8 111 2026-02-21T09:09:55.3751005Z .b8 114 2026-02-21T09:09:55.3751056Z .b8 99 2026-02-21T09:09:55.3751105Z .b8 104 2026-02-21T09:09:55.3751163Z .b8 105 2026-02-21T09:09:55.3751215Z .b8 110 2026-02-21T09:09:55.3751265Z .b8 100 2026-02-21T09:09:55.3751323Z .b8 117 2026-02-21T09:09:55.3751373Z .b8 99 2026-02-21T09:09:55.3751423Z .b8 116 2026-02-21T09:09:55.3751474Z .b8 111 2026-02-21T09:09:55.3751531Z .b8 114 2026-02-21T09:09:55.3751651Z .b8 95 2026-02-21T09:09:55.3751703Z .b8 114 2026-02-21T09:09:55.3751753Z .b8 111 2026-02-21T09:09:55.3751812Z .b8 111 2026-02-21T09:09:55.3751862Z .b8 116 2026-02-21T09:09:55.3751913Z .b8 47 2026-02-21T09:09:55.3751971Z .b8 116 2026-02-21T09:09:55.3752022Z .b8 109 2026-02-21T09:09:55.3752074Z .b8 0 2026-02-21T09:09:55.3752174Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:55.3752257Z .b8 95 // DW_AT_name 2026-02-21T09:09:55.3752313Z .b8 104 2026-02-21T09:09:55.3752366Z .b8 101 2026-02-21T09:09:55.3752427Z .b8 108 2026-02-21T09:09:55.3752479Z .b8 105 2026-02-21T09:09:55.3752532Z .b8 111 2026-02-21T09:09:55.3752585Z .b8 110 2026-02-21T09:09:55.3752651Z .b8 95 2026-02-21T09:09:55.3752736Z .b8 109 2026-02-21T09:09:55.3752792Z .b8 97 2026-02-21T09:09:55.3752850Z .b8 116 2026-02-21T09:09:55.3752900Z .b8 109 2026-02-21T09:09:55.3752949Z .b8 117 2026-02-21T09:09:55.3752999Z .b8 108 2026-02-21T09:09:55.3753058Z .b8 95 2026-02-21T09:09:55.3753109Z .b8 98 2026-02-21T09:09:55.3753161Z .b8 102 2026-02-21T09:09:55.3753218Z .b8 49 2026-02-21T09:09:55.3753269Z .b8 54 2026-02-21T09:09:55.3753320Z .b8 95 2026-02-21T09:09:55.3753372Z .b8 105 2026-02-21T09:09:55.3753429Z .b8 110 2026-02-21T09:09:55.3753481Z .b8 116 2026-02-21T09:09:55.3753533Z .b8 52 2026-02-21T09:09:55.3753582Z .b8 0 2026-02-21T09:09:55.3753665Z .b8 1 // DW_AT_inline 2026-02-21T09:09:55.3753762Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:55.3753851Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:55.3753947Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:55.3754038Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:55.3754153Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:55.3754249Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:55.3754332Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:55.3754416Z .b64 $L__tmp23 // DW_AT_high_pc 2026-02-21T09:09:55.3754501Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:55.3754611Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:55.3754694Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:55.3754777Z .b8 0 // End Of Children Mark 2026-02-21T09:09:55.3754866Z .b8 0 // End Of Children Mark 2026-02-21T09:09:55.3754918Z } 2026-02-21T09:09:55.3754988Z .section .debug_macinfo { } 2026-02-21T09:09:55.3754992Z 2026-02-21T09:09:55.3755093Z ================================================================ 2026-02-21T09:09:55.3755232Z please share the reproducer above with Triton project. 2026-02-21T09:09:56.0388166Z [183s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:56.0388469Z 2026-02-21T09:09:56.0390454Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:56.0391976Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:56.0392227Z `ptxas` stderr: 2026-02-21T09:09:56.0392677Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 238 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:56.0393168Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:56.0393556Z 2026-02-21T09:09:56.0393961Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp_pe3dxh2.ptx -o /tmp/tmp_pe3dxh2.ptx.o 2026-02-21T09:09:56.0394399Z 2026-02-21T09:09:56.0394543Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:56.0394735Z 2026-02-21T09:09:56.0394737Z 2026-02-21T09:09:56.0394821Z ================================================================ 2026-02-21T09:09:56.0395036Z Internal Triton PTX codegen error 2026-02-21T09:09:56.0395207Z `ptxas` stderr: 2026-02-21T09:09:56.0395728Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 238 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:56.0396227Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:56.0396382Z 2026-02-21T09:09:56.0396763Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp_pe3dxh2.ptx -o /tmp/tmp_pe3dxh2.ptx.o 2026-02-21T09:09:56.0397207Z 2026-02-21T09:09:56.0397210Z 2026-02-21T09:09:56.0397269Z // 2026-02-21T09:09:56.0397415Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:56.0397602Z // 2026-02-21T09:09:56.0397671Z 2026-02-21T09:09:56.0397737Z .version 8.7 2026-02-21T09:09:56.0397878Z .target sm_100a 2026-02-21T09:09:56.0398026Z .address_size 64 2026-02-21T09:09:56.0398113Z 2026-02-21T09:09:56.0398266Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:56.0398565Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:56.0398792Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:56.0399025Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:56.0399278Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:56.0399579Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:56.0399874Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:56.0400157Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:56.0400450Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:56.0400731Z ) 2026-02-21T09:09:56.0400867Z .reqntid 128 2026-02-21T09:09:56.0401005Z .maxnreg 32 2026-02-21T09:09:56.0401147Z { 2026-02-21T09:09:56.0401278Z .reg .pred %p<52>; 2026-02-21T09:09:56.0401441Z .reg .b16 %rs<72>; 2026-02-21T09:09:56.0401627Z .reg .b32 %r<292>; 2026-02-21T09:09:56.0401770Z .reg .b64 %rd<90>; 2026-02-21T09:09:56.0401922Z $L__func_begin0: 2026-02-21T09:09:56.0402008Z 2026-02-21T09:09:56.0402065Z // %bb.0: 2026-02-21T09:09:56.0402412Z .loc 1 19 0 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:19 2026-02-21T09:09:56.0402716Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:56.0402886Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:56.0403102Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:56.0403335Z mov.b32 %r51, global_smem; 2026-02-21T09:09:56.0403507Z // begin inline asm 2026-02-21T09:09:56.0403905Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r51], 64; 2026-02-21T09:09:56.0404156Z // end inline asm 2026-02-21T09:09:56.0404332Z ld.param.b64 %rd35, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:56.0404539Z bar.sync 0; 2026-02-21T09:09:56.0404682Z ld.shared.b32 %r285, [global_smem]; 2026-02-21T09:09:56.0404860Z bar.sync 0; 2026-02-21T09:09:56.0404989Z // begin inline asm 2026-02-21T09:09:56.0405200Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:56.0405427Z // end inline asm 2026-02-21T09:09:56.0405684Z .loc 1 21 67 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:21:67 2026-02-21T09:09:56.0405986Z mov.u32 %r60, %ctaid.x; 2026-02-21T09:09:56.0406137Z mov.u32 %r61, %ctaid.y; 2026-02-21T09:09:56.0406323Z mov.u32 %r62, %ctaid.z; 2026-02-21T09:09:56.0406474Z mov.u32 %r63, %nctaid.x; 2026-02-21T09:09:56.0406633Z mov.u32 %r64, %nctaid.y; 2026-02-21T09:09:56.0406793Z mad.lo.s32 %r65, %r62, %r64, %r61; 2026-02-21T09:09:56.0406971Z mad.lo.s32 %r66, %r65, %r63, %r60; 2026-02-21T09:09:56.0407145Z shl.b32 %r67, %r66, 7; 2026-02-21T09:09:56.0407299Z cvt.s64.s32 %rd36, %r67; 2026-02-21T09:09:56.0407463Z add.s64 %rd32, %rd35, %rd36; 2026-02-21T09:09:56.0407621Z shl.b32 %r68, %r1, 2; 2026-02-21T09:09:56.0407774Z add.s32 %r52, %r51, %r68; 2026-02-21T09:09:56.0407923Z mov.b32 %r53, 0; 2026-02-21T09:09:56.0408063Z // begin inline asm 2026-02-21T09:09:56.0408215Z @%p1 st.shared.b32 [ %r52 + 0 ], %r53; 2026-02-21T09:09:56.0408431Z // end inline asm 2026-02-21T09:09:56.0408576Z bar.warp.sync -1; 2026-02-21T09:09:56.0408733Z setp.eq.b32 %p4, %r1, 0; 2026-02-21T09:09:56.0408896Z cvt.u64.u32 %rd17, %r51; 2026-02-21T09:09:56.0409048Z // begin inline asm 2026-02-21T09:09:56.0409312Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd17 + 0 ], %rd18; 2026-02-21T09:09:56.0409592Z // end inline asm 2026-02-21T09:09:56.0409738Z // begin inline asm 2026-02-21T09:09:56.0409966Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1; 2026-02-21T09:09:56.0410229Z // end inline asm 2026-02-21T09:09:56.0410368Z mov.b32 %r54, 32; 2026-02-21T09:09:56.0410519Z // begin inline asm 2026-02-21T09:09:56.0410775Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r54; 2026-02-21T09:09:56.0411047Z // end inline asm 2026-02-21T09:09:56.0411198Z mov.b32 %r55, 64; 2026-02-21T09:09:56.0411338Z // begin inline asm 2026-02-21T09:09:56.0411604Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r55; 2026-02-21T09:09:56.0411863Z // end inline asm 2026-02-21T09:09:56.0412006Z mov.b32 %r56, 8192; 2026-02-21T09:09:56.0412145Z // begin inline asm 2026-02-21T09:09:56.0412388Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r56; 2026-02-21T09:09:56.0412663Z // end inline asm 2026-02-21T09:09:56.0412798Z mov.b32 %r57, 4096; 2026-02-21T09:09:56.0412947Z // begin inline asm 2026-02-21T09:09:56.0413182Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r57; 2026-02-21T09:09:56.0413492Z // end inline asm 2026-02-21T09:09:56.0413628Z mov.b64 %rd25, 16384; 2026-02-21T09:09:56.0413780Z // begin inline asm 2026-02-21T09:09:56.0414035Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd17 + 0 ], 0x0, %rd25; 2026-02-21T09:09:56.0414309Z // end inline asm 2026-02-21T09:09:56.0414447Z mov.b32 %r58, 1; 2026-02-21T09:09:56.0414577Z // begin inline asm 2026-02-21T09:09:56.0414830Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r58; 2026-02-21T09:09:56.0415139Z // end inline asm 2026-02-21T09:09:56.0415285Z // begin inline asm 2026-02-21T09:09:56.0415537Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r58; 2026-02-21T09:09:56.0415824Z // end inline asm 2026-02-21T09:09:56.0415966Z // begin inline asm 2026-02-21T09:09:56.0416195Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd17 + 0 ], 0xa; 2026-02-21T09:09:56.0416458Z // end inline asm 2026-02-21T09:09:56.0416595Z // begin inline asm 2026-02-21T09:09:56.0416886Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:09:56.0417178Z // end inline asm 2026-02-21T09:09:56.0417323Z // begin inline asm 2026-02-21T09:09:56.0417562Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x2; 2026-02-21T09:09:56.0417830Z // end inline asm 2026-02-21T09:09:56.0417970Z // begin inline asm 2026-02-21T09:09:56.0418206Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:09:56.0418466Z // end inline asm 2026-02-21T09:09:56.0418601Z // begin inline asm 2026-02-21T09:09:56.0418980Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd32 + 0 ], [ %rd17 + 0 ], 0x80; 2026-02-21T09:09:56.0419345Z // end inline asm 2026-02-21T09:09:56.0419489Z // begin inline asm 2026-02-21T09:09:56.0419695Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd32 + 0 ], 0x80; 2026-02-21T09:09:56.0419950Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:56.0420144Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:56.0420319Z // end inline asm 2026-02-21T09:09:56.0420461Z bar.sync 0; 2026-02-21T09:09:56.0420600Z cvta.global.u64 %rd54, %rd32; 2026-02-21T09:09:56.0420888Z .loc 1 27 35 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:27:35 2026-02-21T09:09:56.0421213Z shl.b32 %r286, %r60, 1; 2026-02-21T09:09:56.0421487Z .loc 1 28 37 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:28:37 2026-02-21T09:09:56.0421815Z add.s32 %r69, %r286, 2; 2026-02-21T09:09:56.0422078Z .loc 1 28 49 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:28:49 2026-02-21T09:09:56.0422372Z min.s32 %r4, %r69, 16384; 2026-02-21T09:09:56.0422633Z .loc 1 29 88 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:29:88 2026-02-21T09:09:56.0422936Z setp.ge.s32 %p21, %r286, %r4; 2026-02-21T09:09:56.0423099Z @%p21 bra $L__BB0_9; 2026-02-21T09:09:56.0423268Z // %bb.1: // %.lr.ph 2026-02-21T09:09:56.0423576Z .loc 1 0 88 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:0:88 2026-02-21T09:09:56.0423905Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:56.0424155Z ld.param.b64 %rd15, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:56.0424356Z shr.u32 %r5, %r1, 5; 2026-02-21T09:09:56.0424505Z and.b32 %r6, %r1, 15; 2026-02-21T09:09:56.0424648Z shl.b32 %r7, %r6, 1; 2026-02-21T09:09:56.0424799Z bfe.u32 %r8, %r1, 1, 6; 2026-02-21T09:09:56.0424945Z shl.b32 %r9, %r1, 3; 2026-02-21T09:09:56.0425094Z and.b32 %r10, %r9, 8; 2026-02-21T09:09:56.0425239Z and.b32 %r11, %r1, 32; 2026-02-21T09:09:56.0425392Z shl.b32 %r70, %r1, 4; 2026-02-21T09:09:56.0425541Z and.b32 %r12, %r70, 2032; 2026-02-21T09:09:56.0425696Z add.s32 %r182, %r51, %r12; 2026-02-21T09:09:56.0425889Z add.s32 %r174, %r285, 32; 2026-02-21T09:09:56.0426046Z add.s32 %r132, %r182, 2048; 2026-02-21T09:09:56.0426210Z shl.b32 %r72, %r6, 5; 2026-02-21T09:09:56.0426353Z and.b32 %r73, %r1, 96; 2026-02-21T09:09:56.0426507Z shl.b32 %r74, %r73, 4; 2026-02-21T09:09:56.0426659Z bfe.s32 %r75, %r1, 4, 1; 2026-02-21T09:09:56.0426815Z and.b32 %r76, %r1, 16; 2026-02-21T09:09:56.0426963Z or.b32 %r77, %r72, %r74; 2026-02-21T09:09:56.0427124Z or.b32 %r16, %r77, %r76; 2026-02-21T09:09:56.0427321Z add.s32 %r17, %r51, %r16; 2026-02-21T09:09:56.0427477Z shl.b32 %r78, %r1, 9; 2026-02-21T09:09:56.0427636Z and.b32 %r18, %r78, 57344; 2026-02-21T09:09:56.0427788Z shl.b32 %r79, %r6, 3; 2026-02-21T09:09:56.0427936Z bfe.u32 %r80, %r1, 5, 2; 2026-02-21T09:09:56.0428083Z and.b32 %r81, %r75, 132; 2026-02-21T09:09:56.0428233Z or.b32 %r82, %r79, %r80; 2026-02-21T09:09:56.0428377Z or.b32 %r83, %r82, %r81; 2026-02-21T09:09:56.0428528Z add.s32 %r84, %r51, 6144; 2026-02-21T09:09:56.0428679Z add.s32 %r19, %r84, %r83; 2026-02-21T09:09:56.0428835Z xor.b32 %r85, %r83, 4; 2026-02-21T09:09:56.0428985Z add.s32 %r20, %r84, %r85; 2026-02-21T09:09:56.0429133Z and.b32 %r86, %r1, 31; 2026-02-21T09:09:56.0429285Z shl.b32 %r87, %r86, 2; 2026-02-21T09:09:56.0429430Z bfe.s32 %r88, %r1, 6, 1; 2026-02-21T09:09:56.0429583Z and.b32 %r89, %r88, 132; 2026-02-21T09:09:56.0429728Z xor.b32 %r90, %r89, %r87; 2026-02-21T09:09:56.0429886Z add.s32 %r21, %r84, %r90; 2026-02-21T09:09:56.0430035Z shl.b32 %r91, %r86, 6; 2026-02-21T09:09:56.0430184Z and.b32 %r92, %r9, 48; 2026-02-21T09:09:56.0430325Z shr.u32 %r93, %r73, 3; 2026-02-21T09:09:56.0430477Z or.b32 %r94, %r91, %r93; 2026-02-21T09:09:56.0430663Z or.b32 %r95, %r94, %r92; 2026-02-21T09:09:56.0430812Z add.s32 %r96, %r51, 4096; 2026-02-21T09:09:56.0430968Z add.s32 %r22, %r96, %r95; 2026-02-21T09:09:56.0431122Z xor.b32 %r97, %r95, 16; 2026-02-21T09:09:56.0431279Z add.s32 %r23, %r96, %r97; 2026-02-21T09:09:56.0431430Z xor.b32 %r98, %r95, 32; 2026-02-21T09:09:56.0431637Z add.s32 %r24, %r96, %r98; 2026-02-21T09:09:56.0431786Z xor.b32 %r99, %r95, 48; 2026-02-21T09:09:56.0431940Z add.s32 %r25, %r96, %r99; 2026-02-21T09:09:56.0432091Z bfe.u32 %r100, %r96, 4, 14; 2026-02-21T09:09:56.0432258Z cvt.u64.u32 %rd37, %r100; 2026-02-21T09:09:56.0432428Z or.b64 %rd43, %rd37, -9223371899407433728; 2026-02-21T09:09:56.0432608Z add.s32 %r101, %r51, 4128; 2026-02-21T09:09:56.0432806Z bfe.u32 %r102, %r101, 4, 14; 2026-02-21T09:09:56.0432962Z cvt.u64.u32 %rd38, %r102; 2026-02-21T09:09:56.0433132Z or.b64 %rd44, %rd38, -9223371899407433728; 2026-02-21T09:09:56.0433312Z shl.b32 %r103, %r6, 6; 2026-02-21T09:09:56.0433465Z shl.b32 %r104, %r73, 5; 2026-02-21T09:09:56.0433612Z shl.b32 %r105, %r76, 1; 2026-02-21T09:09:56.0433768Z or.b32 %r106, %r104, %r92; 2026-02-21T09:09:56.0433932Z xor.b32 %r107, %r106, %r105; 2026-02-21T09:09:56.0434089Z or.b32 %r108, %r107, %r103; 2026-02-21T09:09:56.0434254Z add.s32 %r26, %r51, %r108; 2026-02-21T09:09:56.0434408Z xor.b32 %r109, %r108, 16; 2026-02-21T09:09:56.0434566Z add.s32 %r27, %r51, %r109; 2026-02-21T09:09:56.0434832Z .loc 1 29 88 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:29:88 2026-02-21T09:09:56.0435126Z add.s64 %rd4, %rd15, 96; 2026-02-21T09:09:56.0435279Z shl.b32 %r110, %r8, 10; 2026-02-21T09:09:56.0435563Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0435862Z or.b32 %r28, %r110, %r10; 2026-02-21T09:09:56.0436121Z .loc 1 29 88 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:29:88 2026-02-21T09:09:56.0436414Z add.s64 %rd5, %rd16, 65536; 2026-02-21T09:09:56.0436689Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0436983Z or.b32 %r29, %r18, %r7; 2026-02-21T09:09:56.0437128Z bra.uni $L__BB0_2; 2026-02-21T09:09:56.0437324Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:56.0437721Z .loc 1 0 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:0:125 2026-02-21T09:09:56.0438009Z mov.b32 %r237, 1; 2026-02-21T09:09:56.0438156Z $L__tmp0: 2026-02-21T09:09:56.0438455Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0438806Z // begin inline asm 2026-02-21T09:09:56.0438981Z 2026-02-21T09:09:56.0439110Z { 2026-02-21T09:09:56.0439241Z .reg .pred complete; 2026-02-21T09:09:56.0439404Z waitLoop: 2026-02-21T09:09:56.0439604Z mbarrier.try_wait.parity.shared.b64 complete, [%r288], %r237; 2026-02-21T09:09:56.0439857Z @!complete bra.uni waitLoop; 2026-02-21T09:09:56.0440026Z } 2026-02-21T09:09:56.0440093Z 2026-02-21T09:09:56.0440152Z // end inline asm 2026-02-21T09:09:56.0440302Z $L__tmp1: 2026-02-21T09:09:56.0440557Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0440878Z cp.async.wait_group 0; 2026-02-21T09:09:56.0441034Z bar.sync 0; 2026-02-21T09:09:56.0441182Z add.s32 %r238, %r51, 6400; 2026-02-21T09:09:56.0441342Z // begin inline asm 2026-02-21T09:09:56.0441524Z @%p4 mbarrier.inval.shared::cta.b64 [%r238]; 2026-02-21T09:09:56.0441751Z // end inline asm 2026-02-21T09:09:56.0441891Z bar.sync 0; 2026-02-21T09:09:56.0442032Z // begin inline asm 2026-02-21T09:09:56.0442204Z @%p4 mbarrier.inval.shared::cta.b64 [%r129]; 2026-02-21T09:09:56.0442405Z // end inline asm 2026-02-21T09:09:56.0442543Z $L__tmp2: 2026-02-21T09:09:56.0442882Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0443237Z // begin inline asm 2026-02-21T09:09:56.0443628Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r240, %r241, %r242, %r243, %r244, %r245, %r246, %r247, %r248, %r249, %r250, %r251, %r252, %r253, %r254, %r255}, [%r256 + 0], 16; 2026-02-21T09:09:56.0444047Z // end inline asm 2026-02-21T09:09:56.0444186Z // begin inline asm 2026-02-21T09:09:56.0444350Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:56.0444519Z // end inline asm 2026-02-21T09:09:56.0444671Z cvt.u64.u32 %rd55, %r240; 2026-02-21T09:09:56.0444832Z cvt.u64.u32 %rd56, %r241; 2026-02-21T09:09:56.0445001Z shl.b64 %rd57, %rd56, 32; 2026-02-21T09:09:56.0445161Z or.b64 %rd58, %rd55, %rd57; 2026-02-21T09:09:56.0445360Z $L__tmp3: 2026-02-21T09:09:56.0445622Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0445926Z mov.b64 {%r260, %r261}, %rd58; 2026-02-21T09:09:56.0446117Z cvt.rn.bf16x2.f32 %r262, %r261, %r260; 2026-02-21T09:09:56.0446300Z $L__tmp4: 2026-02-21T09:09:56.0446613Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0446993Z cvt.u64.u32 %rd59, %r242; 2026-02-21T09:09:56.0447153Z cvt.u64.u32 %rd60, %r243; 2026-02-21T09:09:56.0447311Z shl.b64 %rd61, %rd60, 32; 2026-02-21T09:09:56.0447463Z or.b64 %rd62, %rd59, %rd61; 2026-02-21T09:09:56.0447621Z $L__tmp5: 2026-02-21T09:09:56.0447861Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0448162Z mov.b64 {%r263, %r264}, %rd62; 2026-02-21T09:09:56.0448331Z cvt.rn.bf16x2.f32 %r265, %r264, %r263; 2026-02-21T09:09:56.0448510Z $L__tmp6: 2026-02-21T09:09:56.0448790Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0449123Z cvt.u64.u32 %rd63, %r244; 2026-02-21T09:09:56.0449279Z cvt.u64.u32 %rd64, %r245; 2026-02-21T09:09:56.0449428Z shl.b64 %rd65, %rd64, 32; 2026-02-21T09:09:56.0449587Z or.b64 %rd66, %rd63, %rd65; 2026-02-21T09:09:56.0449735Z $L__tmp7: 2026-02-21T09:09:56.0449975Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0450292Z mov.b64 {%r266, %r267}, %rd66; 2026-02-21T09:09:56.0450466Z cvt.rn.bf16x2.f32 %r268, %r267, %r266; 2026-02-21T09:09:56.0450632Z $L__tmp8: 2026-02-21T09:09:56.0450921Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0451258Z cvt.u64.u32 %rd67, %r246; 2026-02-21T09:09:56.0451407Z cvt.u64.u32 %rd68, %r247; 2026-02-21T09:09:56.0451591Z shl.b64 %rd69, %rd68, 32; 2026-02-21T09:09:56.0451775Z or.b64 %rd70, %rd67, %rd69; 2026-02-21T09:09:56.0451929Z $L__tmp9: 2026-02-21T09:09:56.0452163Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0452459Z mov.b64 {%r269, %r270}, %rd70; 2026-02-21T09:09:56.0452623Z cvt.rn.bf16x2.f32 %r271, %r270, %r269; 2026-02-21T09:09:56.0452798Z $L__tmp10: 2026-02-21T09:09:56.0453089Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0453414Z cvt.u64.u32 %rd71, %r248; 2026-02-21T09:09:56.0453570Z cvt.u64.u32 %rd72, %r249; 2026-02-21T09:09:56.0453718Z shl.b64 %rd73, %rd72, 32; 2026-02-21T09:09:56.0453877Z or.b64 %rd74, %rd71, %rd73; 2026-02-21T09:09:56.0454026Z $L__tmp11: 2026-02-21T09:09:56.0454270Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0454566Z mov.b64 {%r272, %r273}, %rd74; 2026-02-21T09:09:56.0454731Z cvt.rn.bf16x2.f32 %r274, %r273, %r272; 2026-02-21T09:09:56.0454903Z $L__tmp12: 2026-02-21T09:09:56.0455220Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0455553Z cvt.u64.u32 %rd75, %r250; 2026-02-21T09:09:56.0455702Z cvt.u64.u32 %rd76, %r251; 2026-02-21T09:09:56.0455857Z shl.b64 %rd77, %rd76, 32; 2026-02-21T09:09:56.0456007Z or.b64 %rd78, %rd75, %rd77; 2026-02-21T09:09:56.0456162Z $L__tmp13: 2026-02-21T09:09:56.0456404Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0456687Z mov.b64 {%r275, %r276}, %rd78; 2026-02-21T09:09:56.0456861Z cvt.rn.bf16x2.f32 %r277, %r276, %r275; 2026-02-21T09:09:56.0457029Z $L__tmp14: 2026-02-21T09:09:56.0457362Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0457689Z cvt.u64.u32 %rd79, %r252; 2026-02-21T09:09:56.0457847Z cvt.u64.u32 %rd80, %r253; 2026-02-21T09:09:56.0457994Z shl.b64 %rd81, %rd80, 32; 2026-02-21T09:09:56.0458150Z or.b64 %rd82, %rd79, %rd81; 2026-02-21T09:09:56.0458303Z $L__tmp15: 2026-02-21T09:09:56.0458540Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0458835Z mov.b64 {%r278, %r279}, %rd82; 2026-02-21T09:09:56.0458998Z cvt.rn.bf16x2.f32 %r280, %r279, %r278; 2026-02-21T09:09:56.0459174Z $L__tmp16: 2026-02-21T09:09:56.0459455Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0459792Z cvt.u64.u32 %rd83, %r254; 2026-02-21T09:09:56.0459942Z cvt.u64.u32 %rd84, %r255; 2026-02-21T09:09:56.0460097Z shl.b64 %rd85, %rd84, 32; 2026-02-21T09:09:56.0460254Z or.b64 %rd86, %rd83, %rd85; 2026-02-21T09:09:56.0460404Z $L__tmp17: 2026-02-21T09:09:56.0460649Z .loc 1 97 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:97:28 2026-02-21T09:09:56.0460938Z mov.b64 {%r281, %r282}, %rd86; 2026-02-21T09:09:56.0461109Z cvt.rn.bf16x2.f32 %r283, %r282, %r281; 2026-02-21T09:09:56.0461394Z .loc 1 98 43 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:98:43 2026-02-21T09:09:56.0461752Z st.shared.v4.b32 [%r26], {%r262, %r265, %r268, %r271}; 2026-02-21T09:09:56.0461992Z st.shared.v4.b32 [%r27], {%r274, %r277, %r280, %r283}; 2026-02-21T09:09:56.0462223Z // begin inline asm 2026-02-21T09:09:56.0462391Z fence.proxy.async.shared::cta; 2026-02-21T09:09:56.0462555Z // end inline asm 2026-02-21T09:09:56.0462695Z bar.sync 0; 2026-02-21T09:09:56.0462833Z elect.sync %r284|%p49, -1; 2026-02-21T09:09:56.0463002Z and.pred %p47, %p1, %p49; 2026-02-21T09:09:56.0463153Z // begin inline asm 2026-02-21T09:09:56.0463420Z @%p47 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd54, {%r257, %r258}], [%r51]; 2026-02-21T09:09:56.0463739Z // end inline asm 2026-02-21T09:09:56.0463887Z cp.async.bulk.commit_group; 2026-02-21T09:09:56.0464076Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:56.0464249Z bar.sync 0; 2026-02-21T09:09:56.0464509Z .loc 1 29 88 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:29:88 2026-02-21T09:09:56.0464804Z add.s32 %r286, %r286, 1; 2026-02-21T09:09:56.0464974Z setp.ne.b32 %p50, %r286, %r4; 2026-02-21T09:09:56.0465140Z @%p50 bra $L__BB0_2; 2026-02-21T09:09:56.0465297Z bra.uni $L__BB0_9; 2026-02-21T09:09:56.0465482Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:09:56.0465734Z // Child Loop BB0_5 Depth 2 2026-02-21T09:09:56.0466056Z .loc 1 81 38 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:81:38 2026-02-21T09:09:56.0466351Z setp.eq.b32 %p26, %r11, 0; 2026-02-21T09:09:56.0466630Z .loc 1 35 35 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:35:35 2026-02-21T09:09:56.0466931Z shr.s32 %r143, %r286, 31; 2026-02-21T09:09:56.0467097Z shr.u32 %r144, %r143, 22; 2026-02-21T09:09:56.0467255Z add.s32 %r145, %r286, %r144; 2026-02-21T09:09:56.0467455Z shr.s32 %r146, %r145, 10; 2026-02-21T09:09:56.0467722Z .loc 1 38 45 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:38:45 2026-02-21T09:09:56.0468011Z and.b32 %r147, %r145, 64512; 2026-02-21T09:09:56.0468177Z sub.s32 %r148, %r286, %r147; 2026-02-21T09:09:56.0468437Z .loc 1 38 64 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:38:64 2026-02-21T09:09:56.0468730Z cvt.u16.u32 %rs2, %r148; 2026-02-21T09:09:56.0468992Z .loc 1 39 51 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:39:51 2026-02-21T09:09:56.0469281Z shr.s16 %rs3, %rs2, 15; 2026-02-21T09:09:56.0469440Z shr.u16 %rs4, %rs3, 12; 2026-02-21T09:09:56.0469617Z add.s16 %rs5, %rs2, %rs4; 2026-02-21T09:09:56.0469780Z shr.s16 %rs6, %rs5, 4; 2026-02-21T09:09:56.0470039Z .loc 1 38 64 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:38:64 2026-02-21T09:09:56.0470331Z and.b16 %rs7, %rs5, -16; 2026-02-21T09:09:56.0470484Z sub.s16 %rs8, %rs2, %rs7; 2026-02-21T09:09:56.0470750Z .loc 1 40 27 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:40:27 2026-02-21T09:09:56.0471038Z shl.b32 %r33, %r146, 9; 2026-02-21T09:09:56.0471190Z mul.wide.s16 %r34, %rs8, 32; 2026-02-21T09:09:56.0471356Z add.s32 %r257, %r34, %r33; 2026-02-21T09:09:56.0471659Z .loc 1 41 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:41:32 2026-02-21T09:09:56.0471949Z or.b32 %r149, %r257, %r7; 2026-02-21T09:09:56.0472207Z .loc 1 42 27 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:42:27 2026-02-21T09:09:56.0472502Z mul.wide.s16 %r258, %rs6, 64; 2026-02-21T09:09:56.0472769Z .loc 1 43 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:43:32 2026-02-21T09:09:56.0473057Z or.b32 %r150, %r258, %r8; 2026-02-21T09:09:56.0473320Z .loc 1 58 53 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:53 2026-02-21T09:09:56.0473602Z shl.b32 %r151, %r150, 10; 2026-02-21T09:09:56.0473751Z $L__tmp18: 2026-02-21T09:09:56.0474037Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0474423Z shfl.sync.idx.b32 %r37, %r5, 0, 31, -1; 2026-02-21T09:09:56.0474603Z shl.b32 %r152, %r37, 21; 2026-02-21T09:09:56.0474761Z and.b32 %r153, %r152, 6291456; 2026-02-21T09:09:56.0474926Z add.s32 %r256, %r153, %r285; 2026-02-21T09:09:56.0475082Z mov.pred %p22, -1; 2026-02-21T09:09:56.0475232Z mov.b32 %r287, 0; 2026-02-21T09:09:56.0475369Z // begin inline asm 2026-02-21T09:09:56.0475750Z @%p22 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r256 + 0], 16, {%r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287, %r287}; 2026-02-21T09:09:56.0476172Z // end inline asm 2026-02-21T09:09:56.0476316Z // begin inline asm 2026-02-21T09:09:56.0476471Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:56.0476643Z // end inline asm 2026-02-21T09:09:56.0476783Z bar.sync 0; 2026-02-21T09:09:56.0476909Z $L__tmp19: 2026-02-21T09:09:56.0477169Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0477466Z add.s32 %r288, %r51, 6400; 2026-02-21T09:09:56.0477626Z // begin inline asm 2026-02-21T09:09:56.0477794Z @%p4 mbarrier.init.shared::cta.b64 [%r288], 1; 2026-02-21T09:09:56.0477991Z // end inline asm 2026-02-21T09:09:56.0478123Z bar.sync 0; 2026-02-21T09:09:56.0478262Z add.s32 %r129, %r51, 6408; 2026-02-21T09:09:56.0478420Z // begin inline asm 2026-02-21T09:09:56.0478590Z @%p4 mbarrier.init.shared::cta.b64 [%r129], 1; 2026-02-21T09:09:56.0478789Z // end inline asm 2026-02-21T09:09:56.0479043Z .loc 1 58 60 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:60 2026-02-21T09:09:56.0479348Z or.b32 %r155, %r151, %r10; 2026-02-21T09:09:56.0479646Z .loc 1 58 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:32 2026-02-21T09:09:56.0479949Z mad.wide.s32 %rd39, %r155, 2, %rd15; 2026-02-21T09:09:56.0480120Z mov.b32 %r131, 16; 2026-02-21T09:09:56.0480374Z .loc 1 58 80 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:80 2026-02-21T09:09:56.0480663Z // begin inline asm 2026-02-21T09:09:56.0480862Z cp.async.cg.shared.global [ %r182 + 0 ], [ %rd39 + 0 ], 0x10, %r131; 2026-02-21T09:09:56.0481101Z // end inline asm 2026-02-21T09:09:56.0481245Z cp.async.commit_group; 2026-02-21T09:09:56.0481523Z .loc 1 58 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:32 2026-02-21T09:09:56.0481901Z add.s64 %rd40, %rd39, 32; 2026-02-21T09:09:56.0482188Z .loc 1 58 80 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:80 2026-02-21T09:09:56.0482486Z bar.sync 0; 2026-02-21T09:09:56.0482623Z // begin inline asm 2026-02-21T09:09:56.0482835Z cp.async.cg.shared.global [ %r132 + 0 ], [ %rd40 + 0 ], 0x10, %r131; 2026-02-21T09:09:56.0483066Z // end inline asm 2026-02-21T09:09:56.0483217Z cp.async.commit_group; 2026-02-21T09:09:56.0483382Z cp.async.wait_group 1; 2026-02-21T09:09:56.0483546Z bar.sync 0; 2026-02-21T09:09:56.0483717Z ld.shared.v4.b32 {%r156, %r157, %r158, %r159}, [%r17]; 2026-02-21T09:09:56.0483940Z mov.b32 {%rs9, %rs10}, %r159; 2026-02-21T09:09:56.0484122Z mov.b32 {%rs11, %rs12}, %r158; 2026-02-21T09:09:56.0484295Z mov.b32 {%rs13, %rs14}, %r157; 2026-02-21T09:09:56.0484470Z mov.b32 {%rs15, %rs16}, %r156; 2026-02-21T09:09:56.0484755Z .loc 1 62 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:62:32 2026-02-21T09:09:56.0485072Z cvt.f32.bf16 %r135, %rs15; 2026-02-21T09:09:56.0485234Z cvt.f32.bf16 %r136, %rs16; 2026-02-21T09:09:56.0485399Z cvt.f32.bf16 %r137, %rs13; 2026-02-21T09:09:56.0485555Z cvt.f32.bf16 %r138, %rs14; 2026-02-21T09:09:56.0485719Z cvt.f32.bf16 %r139, %rs11; 2026-02-21T09:09:56.0485873Z cvt.f32.bf16 %r140, %rs12; 2026-02-21T09:09:56.0486038Z cvt.f32.bf16 %r141, %rs9; 2026-02-21T09:09:56.0486200Z cvt.f32.bf16 %r142, %rs10; 2026-02-21T09:09:56.0486350Z $L__tmp20: 2026-02-21T09:09:56.0486662Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0487050Z add.s32 %r194, %r153, %r174; 2026-02-21T09:09:56.0487218Z // begin inline asm 2026-02-21T09:09:56.0487523Z @%p22 tcgen05.st.sync.aligned.16x32bx2.x8.b32 [%r194 + 0], 8, {%r135, %r136, %r137, %r138, %r139, %r140, %r141, %r142}; 2026-02-21T09:09:56.0487864Z // end inline asm 2026-02-21T09:09:56.0488011Z // begin inline asm 2026-02-21T09:09:56.0488204Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:56.0488378Z // end inline asm 2026-02-21T09:09:56.0488516Z bar.sync 0; 2026-02-21T09:09:56.0488667Z $L__tmp21: 2026-02-21T09:09:56.0488912Z .loc 1 64 62 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:64:62 2026-02-21T09:09:56.0489212Z add.s32 %r160, %r149, %r18; 2026-02-21T09:09:56.0489482Z .loc 1 64 34 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:64:34 2026-02-21T09:09:56.0489783Z cvt.s64.s32 %rd42, %r160; 2026-02-21T09:09:56.0489952Z add.s64 %rd41, %rd16, %rd42; 2026-02-21T09:09:56.0490223Z .loc 1 64 87 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:64:87 2026-02-21T09:09:56.0490521Z // begin inline asm 2026-02-21T09:09:56.0490661Z mov.u16 %rs1, 0x0; 2026-02-21T09:09:56.0490820Z ld.global.b16 { %rs1 }, [ %rd41 + 0 ]; 2026-02-21T09:09:56.0490993Z // end inline asm 2026-02-21T09:09:56.0491141Z cvt.s16.s8 %rs17, %rs1; 2026-02-21T09:09:56.0491405Z .loc 1 69 25 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:69:25 2026-02-21T09:09:56.0491726Z shl.b16 %rs18, %rs1, 12; 2026-02-21T09:09:56.0491889Z shr.s16 %rs19, %rs18, 12; 2026-02-21T09:09:56.0492069Z shl.b16 %rs20, %rs1, 4; 2026-02-21T09:09:56.0492229Z shr.s16 %rs21, %rs20, 12; 2026-02-21T09:09:56.0492493Z .loc 1 72 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:72:28 2026-02-21T09:09:56.0492790Z shr.u16 %rs22, %rs17, 4; 2026-02-21T09:09:56.0492943Z shr.s16 %rs23, %rs1, 12; 2026-02-21T09:09:56.0493211Z .loc 1 76 45 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:76:45 2026-02-21T09:09:56.0493501Z st.shared.b8 [%r19], %rs19; 2026-02-21T09:09:56.0493670Z st.shared.b8 [%r20], %rs21; 2026-02-21T09:09:56.0493829Z bar.sync 0; 2026-02-21T09:09:56.0493966Z ld.shared.b32 %r161, [%r21]; 2026-02-21T09:09:56.0494134Z cvt.u16.u32 %rs24, %r161; 2026-02-21T09:09:56.0494321Z prmt.b32 %r162, %r161, 0, 0x7771U; 2026-02-21T09:09:56.0494501Z cvt.u16.u32 %rs25, %r162; 2026-02-21T09:09:56.0494657Z prmt.b32 %r163, %r161, 0, 0x7772U; 2026-02-21T09:09:56.0494833Z cvt.u16.u32 %rs26, %r163; 2026-02-21T09:09:56.0494986Z prmt.b32 %r164, %r161, 0, 0x7773U; 2026-02-21T09:09:56.0495160Z cvt.u16.u32 %rs27, %r164; 2026-02-21T09:09:56.0495426Z .loc 1 77 45 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:77:45 2026-02-21T09:09:56.0495704Z bar.sync 0; 2026-02-21T09:09:56.0495846Z st.shared.b8 [%r19], %rs22; 2026-02-21T09:09:56.0496005Z st.shared.b8 [%r20], %rs23; 2026-02-21T09:09:56.0496164Z bar.sync 0; 2026-02-21T09:09:56.0496296Z ld.shared.b32 %r165, [%r21]; 2026-02-21T09:09:56.0496457Z cvt.u16.u32 %rs28, %r165; 2026-02-21T09:09:56.0496609Z prmt.b32 %r166, %r165, 0, 0x7771U; 2026-02-21T09:09:56.0496780Z cvt.u16.u32 %rs29, %r166; 2026-02-21T09:09:56.0496930Z prmt.b32 %r167, %r165, 0, 0x7772U; 2026-02-21T09:09:56.0497103Z cvt.u16.u32 %rs30, %r167; 2026-02-21T09:09:56.0497261Z prmt.b32 %r168, %r165, 0, 0x7773U; 2026-02-21T09:09:56.0497423Z cvt.u16.u32 %rs31, %r168; 2026-02-21T09:09:56.0497687Z .loc 1 82 58 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:82:58 2026-02-21T09:09:56.0497978Z selp.b16 %rs32, %rs24, %rs28, %p26; 2026-02-21T09:09:56.0498158Z cvt.s16.s8 %rs33, %rs32; 2026-02-21T09:09:56.0498225Z selp.b16 %rs34, %rs25, %rs29, %p26; 2026-02-21T09:09:56.0498285Z cvt.s16.s8 %rs35, %rs34; 2026-02-21T09:09:56.0498390Z selp.b16 %rs36, %rs26, %rs30, %p26; 2026-02-21T09:09:56.0498451Z cvt.s16.s8 %rs37, %rs36; 2026-02-21T09:09:56.0498515Z selp.b16 %rs38, %rs27, %rs31, %p26; 2026-02-21T09:09:56.0498582Z cvt.s16.s8 %rs39, %rs38; 2026-02-21T09:09:56.0498757Z .loc 1 87 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:87:32 2026-02-21T09:09:56.0498819Z cvt.rn.f32.s16 %r169, %rs33; 2026-02-21T09:09:56.0498881Z cvt.rn.f32.s16 %r170, %rs35; 2026-02-21T09:09:56.0498981Z cvt.rn.f32.s16 %r171, %rs37; 2026-02-21T09:09:56.0499041Z cvt.rn.f32.s16 %r172, %rs39; 2026-02-21T09:09:56.0499100Z st.shared.b32 [%r22], %r169; 2026-02-21T09:09:56.0499168Z st.shared.b32 [%r23], %r170; 2026-02-21T09:09:56.0499229Z st.shared.b32 [%r24], %r171; 2026-02-21T09:09:56.0499288Z st.shared.b32 [%r25], %r172; 2026-02-21T09:09:56.0499351Z $L__tmp22: 2026-02-21T09:09:56.0499578Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0499643Z // begin inline asm 2026-02-21T09:09:56.0499716Z fence.proxy.async.shared::cta; 2026-02-21T09:09:56.0499780Z // end inline asm 2026-02-21T09:09:56.0499835Z bar.sync 0; 2026-02-21T09:09:56.0499899Z setp.ne.b32 %p27, %r37, 0; 2026-02-21T09:09:56.0499963Z @%p27 bra $L__BB0_4; 2026-02-21T09:09:56.0500062Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:56.0500131Z elect.sync %r179|%p29, -1; 2026-02-21T09:09:56.0500192Z mov.b32 %r175, 67635472; 2026-02-21T09:09:56.0500259Z mov.pred %p28, 0; 2026-02-21T09:09:56.0500316Z // begin inline asm 2026-02-21T09:09:56.0500495Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r285 + 0 ], [ %r174 + 0 ], %rd43, %r175, %p28; 2026-02-21T09:09:56.0500562Z // end inline asm 2026-02-21T09:09:56.0500619Z // begin inline asm 2026-02-21T09:09:56.0500768Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r285 + 0 ], [ %r174 + 8 ], %rd44, %r175, %p22; 2026-02-21T09:09:56.0500831Z // end inline asm 2026-02-21T09:09:56.0500892Z add.s32 %r181, %r51, 6400; 2026-02-21T09:09:56.0500952Z cvt.u64.u32 %rd45, %r181; 2026-02-21T09:09:56.0501009Z // begin inline asm 2026-02-21T09:09:56.0501143Z @%p29 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd45]; 2026-02-21T09:09:56.0501198Z // end inline asm 2026-02-21T09:09:56.0501252Z $L__tmp23: 2026-02-21T09:09:56.0501359Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:09:56.0501582Z .loc 1 0 0 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:0 2026-02-21T09:09:56.0501647Z cvt.s32.s16 %r32, %rs6; 2026-02-21T09:09:56.0501827Z .loc 1 58 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:32 2026-02-21T09:09:56.0501886Z add.s64 %rd46, %rd39, 64; 2026-02-21T09:09:56.0502053Z .loc 1 58 80 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:80 2026-02-21T09:09:56.0502117Z // begin inline asm 2026-02-21T09:09:56.0502234Z cp.async.cg.shared.global [ %r182 + 0 ], [ %rd46 + 0 ], 0x10, %r131; 2026-02-21T09:09:56.0502288Z // end inline asm 2026-02-21T09:09:56.0502351Z cp.async.commit_group; 2026-02-21T09:09:56.0502532Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0502590Z shl.b32 %r187, %r32, 16; 2026-02-21T09:09:56.0502649Z or.b32 %r188, %r28, %r187; 2026-02-21T09:09:56.0502724Z mad.wide.s32 %rd88, %r188, 2, %rd4; 2026-02-21T09:09:56.0502785Z add.s32 %r189, %r29, %r33; 2026-02-21T09:09:56.0502844Z add.s32 %r190, %r189, %r34; 2026-02-21T09:09:56.0502904Z cvt.s64.s32 %rd48, %r190; 2026-02-21T09:09:56.0502972Z add.s64 %rd87, %rd5, %rd48; 2026-02-21T09:09:56.0503028Z mov.b32 %r290, 1; 2026-02-21T09:09:56.0503085Z mov.b64 %rd89, -8; 2026-02-21T09:09:56.0503149Z mov.b32 %r289, %r287; 2026-02-21T09:09:56.0503208Z mov.b32 %r291, %r287; 2026-02-21T09:09:56.0503264Z bra.uni $L__BB0_5; 2026-02-21T09:09:56.0503397Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:56.0503579Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0503637Z add.s64 %rd89, %rd89, 8; 2026-02-21T09:09:56.0503702Z setp.lt.u64 %p42, %rd89, 488; 2026-02-21T09:09:56.0503763Z $L__tmp24: 2026-02-21T09:09:56.0503989Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0504074Z add.s32 %r234, %r290, 1; 2026-02-21T09:09:56.0504142Z setp.gt.s32 %p43, %r234, 1; 2026-02-21T09:09:56.0504207Z selp.b32 %r290, 0, %r234, %p43; 2026-02-21T09:09:56.0504267Z selp.b32 %r235, 1, 0, %p43; 2026-02-21T09:09:56.0504324Z xor.b32 %r49, %r291, %r235; 2026-02-21T09:09:56.0504382Z $L__tmp25: 2026-02-21T09:09:56.0504552Z .loc 1 58 80 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:80 2026-02-21T09:09:56.0504611Z add.s32 %r232, %r46, %r12; 2026-02-21T09:09:56.0504678Z selp.b32 %r233, 16, 0, %p42; 2026-02-21T09:09:56.0504734Z // begin inline asm 2026-02-21T09:09:56.0504846Z cp.async.cg.shared.global [ %r232 + 0 ], [ %rd88 + 0 ], 0x10, %r233; 2026-02-21T09:09:56.0504907Z // end inline asm 2026-02-21T09:09:56.0504970Z cp.async.commit_group; 2026-02-21T09:09:56.0505147Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0505207Z add.s64 %rd88, %rd88, 32; 2026-02-21T09:09:56.0505274Z add.s64 %rd87, %rd87, 65536; 2026-02-21T09:09:56.0505339Z setp.lt.u64 %p44, %rd89, 496; 2026-02-21T09:09:56.0505396Z mov.b32 %r287, %r291; 2026-02-21T09:09:56.0505499Z mov.b32 %r291, %r49; 2026-02-21T09:09:56.0505559Z @%p44 bra $L__BB0_5; 2026-02-21T09:09:56.0505616Z bra.uni $L__BB0_8; 2026-02-21T09:09:56.0505714Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:09:56.0505814Z // => This Inner Loop Header: Depth=2 2026-02-21T09:09:56.0505874Z add.s32 %r203, %r289, 1; 2026-02-21T09:09:56.0505935Z setp.gt.s32 %p36, %r203, 1; 2026-02-21T09:09:56.0506005Z selp.b32 %r289, 0, %r203, %p36; 2026-02-21T09:09:56.0506176Z .loc 1 58 80 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:58:80 2026-02-21T09:09:56.0506240Z cp.async.wait_group 1; 2026-02-21T09:09:56.0506302Z bar.sync 0; 2026-02-21T09:09:56.0506385Z shl.b32 %r204, %r289, 11; 2026-02-21T09:09:56.0506446Z add.s32 %r46, %r51, %r204; 2026-02-21T09:09:56.0506505Z add.s32 %r206, %r46, %r16; 2026-02-21T09:09:56.0506610Z ld.shared.v4.b32 {%r207, %r208, %r209, %r210}, [%r206]; 2026-02-21T09:09:56.0506673Z mov.b32 {%rs41, %rs42}, %r210; 2026-02-21T09:09:56.0506735Z mov.b32 {%rs43, %rs44}, %r209; 2026-02-21T09:09:56.0506800Z mov.b32 {%rs45, %rs46}, %r208; 2026-02-21T09:09:56.0506859Z mov.b32 {%rs47, %rs48}, %r207; 2026-02-21T09:09:56.0507024Z .loc 1 62 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:62:32 2026-02-21T09:09:56.0507087Z cvt.f32.bf16 %r195, %rs47; 2026-02-21T09:09:56.0507153Z cvt.f32.bf16 %r196, %rs48; 2026-02-21T09:09:56.0507210Z cvt.f32.bf16 %r197, %rs45; 2026-02-21T09:09:56.0507268Z cvt.f32.bf16 %r198, %rs46; 2026-02-21T09:09:56.0507336Z cvt.f32.bf16 %r199, %rs43; 2026-02-21T09:09:56.0507393Z cvt.f32.bf16 %r200, %rs44; 2026-02-21T09:09:56.0507453Z cvt.f32.bf16 %r201, %rs41; 2026-02-21T09:09:56.0507523Z cvt.f32.bf16 %r202, %rs42; 2026-02-21T09:09:56.0507577Z $L__tmp26: 2026-02-21T09:09:56.0507796Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0507856Z // begin inline asm 2026-02-21T09:09:56.0507917Z 2026-02-21T09:09:56.0507968Z { 2026-02-21T09:09:56.0508031Z .reg .pred complete; 2026-02-21T09:09:56.0508092Z waitLoop: 2026-02-21T09:09:56.0508209Z mbarrier.try_wait.parity.shared.b64 complete, [%r288], %r287; 2026-02-21T09:09:56.0508298Z @!complete bra.uni waitLoop; 2026-02-21T09:09:56.0508348Z } 2026-02-21T09:09:56.0508353Z 2026-02-21T09:09:56.0508415Z // end inline asm 2026-02-21T09:09:56.0508474Z mov.pred %p37, -1; 2026-02-21T09:09:56.0508530Z // begin inline asm 2026-02-21T09:09:56.0508742Z @%p37 tcgen05.st.sync.aligned.16x32bx2.x8.b32 [%r194 + 0], 8, {%r195, %r196, %r197, %r198, %r199, %r200, %r201, %r202}; 2026-02-21T09:09:56.0508797Z // end inline asm 2026-02-21T09:09:56.0508853Z // begin inline asm 2026-02-21T09:09:56.0508951Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:56.0509005Z // end inline asm 2026-02-21T09:09:56.0509060Z bar.sync 0; 2026-02-21T09:09:56.0509111Z $L__tmp27: 2026-02-21T09:09:56.0509285Z .loc 1 64 87 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:64:87 2026-02-21T09:09:56.0509340Z // begin inline asm 2026-02-21T09:09:56.0509396Z mov.u16 %rs40, 0x0; 2026-02-21T09:09:56.0509474Z ld.global.b16 { %rs40 }, [ %rd87 + 0 ]; 2026-02-21T09:09:56.0509529Z // end inline asm 2026-02-21T09:09:56.0509587Z cvt.s16.s8 %rs49, %rs40; 2026-02-21T09:09:56.0509752Z .loc 1 69 25 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:69:25 2026-02-21T09:09:56.0509821Z shl.b16 %rs50, %rs40, 12; 2026-02-21T09:09:56.0509881Z shr.s16 %rs51, %rs50, 12; 2026-02-21T09:09:56.0509939Z shl.b16 %rs52, %rs40, 4; 2026-02-21T09:09:56.0510003Z shr.s16 %rs53, %rs52, 12; 2026-02-21T09:09:56.0510173Z .loc 1 72 28 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:72:28 2026-02-21T09:09:56.0510232Z shr.u16 %rs54, %rs49, 4; 2026-02-21T09:09:56.0510296Z shr.s16 %rs55, %rs40, 12; 2026-02-21T09:09:56.0510500Z .loc 1 76 45 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:76:45 2026-02-21T09:09:56.0510564Z st.shared.b8 [%r19], %rs51; 2026-02-21T09:09:56.0510625Z st.shared.b8 [%r20], %rs53; 2026-02-21T09:09:56.0510688Z bar.sync 0; 2026-02-21T09:09:56.0510749Z ld.shared.b32 %r211, [%r21]; 2026-02-21T09:09:56.0510811Z cvt.u16.u32 %rs56, %r211; 2026-02-21T09:09:56.0510881Z prmt.b32 %r212, %r211, 0, 0x7771U; 2026-02-21T09:09:56.0510939Z cvt.u16.u32 %rs57, %r212; 2026-02-21T09:09:56.0511001Z prmt.b32 %r213, %r211, 0, 0x7772U; 2026-02-21T09:09:56.0511058Z cvt.u16.u32 %rs58, %r213; 2026-02-21T09:09:56.0511125Z prmt.b32 %r214, %r211, 0, 0x7773U; 2026-02-21T09:09:56.0511182Z cvt.u16.u32 %rs59, %r214; 2026-02-21T09:09:56.0511374Z .loc 1 77 45 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:77:45 2026-02-21T09:09:56.0511439Z bar.sync 0; 2026-02-21T09:09:56.0511499Z st.shared.b8 [%r19], %rs54; 2026-02-21T09:09:56.0511594Z st.shared.b8 [%r20], %rs55; 2026-02-21T09:09:56.0511655Z bar.sync 0; 2026-02-21T09:09:56.0511716Z ld.shared.b32 %r215, [%r21]; 2026-02-21T09:09:56.0511772Z cvt.u16.u32 %rs60, %r215; 2026-02-21T09:09:56.0511831Z prmt.b32 %r216, %r215, 0, 0x7771U; 2026-02-21T09:09:56.0511896Z cvt.u16.u32 %rs61, %r216; 2026-02-21T09:09:56.0511957Z prmt.b32 %r217, %r215, 0, 0x7772U; 2026-02-21T09:09:56.0512013Z cvt.u16.u32 %rs62, %r217; 2026-02-21T09:09:56.0512076Z prmt.b32 %r218, %r215, 0, 0x7773U; 2026-02-21T09:09:56.0512133Z cvt.u16.u32 %rs63, %r218; 2026-02-21T09:09:56.0512299Z .loc 1 82 58 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:82:58 2026-02-21T09:09:56.0512365Z selp.b16 %rs64, %rs56, %rs60, %p26; 2026-02-21T09:09:56.0512433Z cvt.s16.s8 %rs65, %rs64; 2026-02-21T09:09:56.0512500Z selp.b16 %rs66, %rs57, %rs61, %p26; 2026-02-21T09:09:56.0512557Z cvt.s16.s8 %rs67, %rs66; 2026-02-21T09:09:56.0512627Z selp.b16 %rs68, %rs58, %rs62, %p26; 2026-02-21T09:09:56.0512686Z cvt.s16.s8 %rs69, %rs68; 2026-02-21T09:09:56.0512748Z selp.b16 %rs70, %rs59, %rs63, %p26; 2026-02-21T09:09:56.0512805Z cvt.s16.s8 %rs71, %rs70; 2026-02-21T09:09:56.0512977Z .loc 1 87 32 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:87:32 2026-02-21T09:09:56.0513070Z cvt.rn.f32.s16 %r219, %rs65; 2026-02-21T09:09:56.0513130Z cvt.rn.f32.s16 %r220, %rs67; 2026-02-21T09:09:56.0513199Z cvt.rn.f32.s16 %r221, %rs69; 2026-02-21T09:09:56.0513259Z cvt.rn.f32.s16 %r222, %rs71; 2026-02-21T09:09:56.0513320Z st.shared.b32 [%r22], %r219; 2026-02-21T09:09:56.0513386Z st.shared.b32 [%r23], %r220; 2026-02-21T09:09:56.0513444Z st.shared.b32 [%r24], %r221; 2026-02-21T09:09:56.0513501Z st.shared.b32 [%r25], %r222; 2026-02-21T09:09:56.0513681Z .loc 1 50 125 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:50:125 2026-02-21T09:09:56.0513773Z shl.b32 %r223, %r290, 3; 2026-02-21T09:09:56.0513831Z add.s32 %r224, %r51, %r223; 2026-02-21T09:09:56.0513890Z add.s32 %r288, %r224, 6400; 2026-02-21T09:09:56.0513951Z $L__tmp28: 2026-02-21T09:09:56.0514169Z .loc 2 291 36 // standard.py:291:36 @[ cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:94:40 ] 2026-02-21T09:09:56.0514227Z // begin inline asm 2026-02-21T09:09:56.0514309Z fence.proxy.async.shared::cta; 2026-02-21T09:09:56.0514364Z // end inline asm 2026-02-21T09:09:56.0514418Z bar.sync 0; 2026-02-21T09:09:56.0514476Z @%p27 bra $L__BB0_7; 2026-02-21T09:09:56.0514581Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:09:56.0514649Z elect.sync %r231|%p38, -1; 2026-02-21T09:09:56.0514707Z mov.b32 %r227, 67635472; 2026-02-21T09:09:56.0514771Z // begin inline asm 2026-02-21T09:09:56.0514928Z @%p38 tcgen05.mma.cta_group::1.kind::tf32 [ %r285 + 0 ], [ %r174 + 0 ], %rd43, %r227, %p37; 2026-02-21T09:09:56.0514985Z // end inline asm 2026-02-21T09:09:56.0515041Z // begin inline asm 2026-02-21T09:09:56.0515226Z @%p38 tcgen05.mma.cta_group::1.kind::tf32 [ %r285 + 0 ], [ %r174 + 8 ], %rd44, %r227, %p37; 2026-02-21T09:09:56.0515284Z // end inline asm 2026-02-21T09:09:56.0515345Z cvt.u64.u32 %rd52, %r288; 2026-02-21T09:09:56.0515412Z // begin inline asm 2026-02-21T09:09:56.0515535Z @%p38 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd52]; 2026-02-21T09:09:56.0515593Z // end inline asm 2026-02-21T09:09:56.0515657Z bra.uni $L__BB0_7; 2026-02-21T09:09:56.0515710Z $L__tmp29: 2026-02-21T09:09:56.0515792Z $L__BB0_9: // %._crit_edge 2026-02-21T09:09:56.0515963Z .loc 1 29 4 // cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py:29:4 2026-02-21T09:09:56.0516024Z bar.sync 0; 2026-02-21T09:09:56.0516116Z // begin inline asm 2026-02-21T09:09:56.0516231Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r285, 64; 2026-02-21T09:09:56.0516296Z // end inline asm 2026-02-21T09:09:56.0516349Z ret; 2026-02-21T09:09:56.0516402Z $L__tmp30: 2026-02-21T09:09:56.0516466Z $L__func_end0: 2026-02-21T09:09:56.0516548Z // -- End function 2026-02-21T09:09:56.0516600Z } 2026-02-21T09:09:56.0516813Z .file 1 "/tmp/torchinductor_root/xt/cxtrrhatqhklpaincjpuglld7sdbspq7yjhmfdhjvr4vvlwzzhj4.py" 2026-02-21T09:09:56.0516986Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:56.0517049Z .section .debug_abbrev 2026-02-21T09:09:56.0517101Z { 2026-02-21T09:09:56.0517196Z .b8 1 // Abbreviation Code 2026-02-21T09:09:56.0517282Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:56.0517361Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:56.0517449Z .b8 37 // DW_AT_producer 2026-02-21T09:09:56.0517528Z .b8 8 // DW_FORM_string 2026-02-21T09:09:56.0517601Z .b8 19 // DW_AT_language 2026-02-21T09:09:56.0517678Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:56.0517762Z .b8 3 // DW_AT_name 2026-02-21T09:09:56.0517835Z .b8 8 // DW_FORM_string 2026-02-21T09:09:56.0517911Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:56.0518020Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:56.0518095Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:56.0518165Z .b8 8 // DW_FORM_string 2026-02-21T09:09:56.0518242Z .b8 0 // EOM(1) 2026-02-21T09:09:56.0518310Z .b8 0 // EOM(2) 2026-02-21T09:09:56.0518393Z .b8 2 // Abbreviation Code 2026-02-21T09:09:56.0518496Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:56.0518576Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:56.0518651Z .b8 3 // DW_AT_name 2026-02-21T09:09:56.0518723Z .b8 8 // DW_FORM_string 2026-02-21T09:09:56.0518807Z .b8 32 // DW_AT_inline 2026-02-21T09:09:56.0518883Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:56.0518952Z .b8 0 // EOM(1) 2026-02-21T09:09:56.0519026Z .b8 0 // EOM(2) 2026-02-21T09:09:56.0519106Z .b8 3 // Abbreviation Code 2026-02-21T09:09:56.0519186Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:56.0519263Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:56.0519347Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:56.0519421Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:56.0519498Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:56.0519600Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:56.0519686Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:56.0519757Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:56.0519829Z .b8 0 // EOM(1) 2026-02-21T09:09:56.0519896Z .b8 0 // EOM(2) 2026-02-21T09:09:56.0519975Z .b8 4 // Abbreviation Code 2026-02-21T09:09:56.0520072Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:56.0520144Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:56.0520227Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:56.0520318Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:56.0520400Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:56.0520470Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:56.0520548Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:56.0520624Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:56.0520701Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:56.0520774Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:56.0520857Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:56.0520930Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:56.0521007Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:56.0521080Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:56.0521155Z .b8 0 // EOM(1) 2026-02-21T09:09:56.0521223Z .b8 0 // EOM(2) 2026-02-21T09:09:56.0521292Z .b8 0 // EOM(3) 2026-02-21T09:09:56.0521352Z } 2026-02-21T09:09:56.0521413Z .section .debug_info 2026-02-21T09:09:56.0521466Z { 2026-02-21T09:09:56.0521585Z .b32 178 // Length of Unit 2026-02-21T09:09:56.0521671Z .b8 2 // DWARF version number 2026-02-21T09:09:56.0521724Z .b8 0 2026-02-21T09:09:56.0521839Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:56.0521962Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:56.0522062Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:56.0522142Z .b8 116 // DW_AT_producer 2026-02-21T09:09:56.0522204Z .b8 114 2026-02-21T09:09:56.0522258Z .b8 105 2026-02-21T09:09:56.0522310Z .b8 116 2026-02-21T09:09:56.0522364Z .b8 111 2026-02-21T09:09:56.0522452Z .b8 110 2026-02-21T09:09:56.0522503Z .b8 0 2026-02-21T09:09:56.0522578Z .b8 2 // DW_AT_language 2026-02-21T09:09:56.0522638Z .b8 0 2026-02-21T09:09:56.0522713Z .b8 99 // DW_AT_name 2026-02-21T09:09:56.0522764Z .b8 120 2026-02-21T09:09:56.0522815Z .b8 116 2026-02-21T09:09:56.0522874Z .b8 114 2026-02-21T09:09:56.0522928Z .b8 114 2026-02-21T09:09:56.0522981Z .b8 104 2026-02-21T09:09:56.0523042Z .b8 97 2026-02-21T09:09:56.0523094Z .b8 116 2026-02-21T09:09:56.0523147Z .b8 113 2026-02-21T09:09:56.0523198Z .b8 104 2026-02-21T09:09:56.0523261Z .b8 107 2026-02-21T09:09:56.0523311Z .b8 108 2026-02-21T09:09:56.0523361Z .b8 112 2026-02-21T09:09:56.0523420Z .b8 97 2026-02-21T09:09:56.0523470Z .b8 105 2026-02-21T09:09:56.0523520Z .b8 110 2026-02-21T09:09:56.0523570Z .b8 99 2026-02-21T09:09:56.0523629Z .b8 106 2026-02-21T09:09:56.0523678Z .b8 112 2026-02-21T09:09:56.0523728Z .b8 117 2026-02-21T09:09:56.0523777Z .b8 103 2026-02-21T09:09:56.0523835Z .b8 108 2026-02-21T09:09:56.0523888Z .b8 108 2026-02-21T09:09:56.0523938Z .b8 100 2026-02-21T09:09:56.0523996Z .b8 55 2026-02-21T09:09:56.0524045Z .b8 115 2026-02-21T09:09:56.0524094Z .b8 100 2026-02-21T09:09:56.0524144Z .b8 98 2026-02-21T09:09:56.0524230Z .b8 115 2026-02-21T09:09:56.0524282Z .b8 112 2026-02-21T09:09:56.0524332Z .b8 113 2026-02-21T09:09:56.0524389Z .b8 55 2026-02-21T09:09:56.0524441Z .b8 121 2026-02-21T09:09:56.0524501Z .b8 106 2026-02-21T09:09:56.0524552Z .b8 104 2026-02-21T09:09:56.0524612Z .b8 109 2026-02-21T09:09:56.0524667Z .b8 102 2026-02-21T09:09:56.0524720Z .b8 100 2026-02-21T09:09:56.0524780Z .b8 104 2026-02-21T09:09:56.0524833Z .b8 106 2026-02-21T09:09:56.0524886Z .b8 118 2026-02-21T09:09:56.0524938Z .b8 114 2026-02-21T09:09:56.0524999Z .b8 52 2026-02-21T09:09:56.0525052Z .b8 118 2026-02-21T09:09:56.0525104Z .b8 118 2026-02-21T09:09:56.0525156Z .b8 108 2026-02-21T09:09:56.0525216Z .b8 119 2026-02-21T09:09:56.0525268Z .b8 122 2026-02-21T09:09:56.0525319Z .b8 122 2026-02-21T09:09:56.0525407Z .b8 104 2026-02-21T09:09:56.0525461Z .b8 106 2026-02-21T09:09:56.0525515Z .b8 52 2026-02-21T09:09:56.0525567Z .b8 46 2026-02-21T09:09:56.0525626Z .b8 112 2026-02-21T09:09:56.0525678Z .b8 121 2026-02-21T09:09:56.0525731Z .b8 0 2026-02-21T09:09:56.0525834Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:56.0525911Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:56.0525963Z .b8 116 2026-02-21T09:09:56.0526016Z .b8 109 2026-02-21T09:09:56.0526075Z .b8 112 2026-02-21T09:09:56.0526129Z .b8 47 2026-02-21T09:09:56.0526180Z .b8 116 2026-02-21T09:09:56.0526239Z .b8 111 2026-02-21T09:09:56.0526290Z .b8 114 2026-02-21T09:09:56.0526342Z .b8 99 2026-02-21T09:09:56.0526393Z .b8 104 2026-02-21T09:09:56.0526453Z .b8 105 2026-02-21T09:09:56.0526504Z .b8 110 2026-02-21T09:09:56.0526556Z .b8 100 2026-02-21T09:09:56.0526607Z .b8 117 2026-02-21T09:09:56.0526665Z .b8 99 2026-02-21T09:09:56.0526717Z .b8 116 2026-02-21T09:09:56.0526768Z .b8 111 2026-02-21T09:09:56.0526827Z .b8 114 2026-02-21T09:09:56.0526881Z .b8 95 2026-02-21T09:09:56.0526932Z .b8 114 2026-02-21T09:09:56.0526983Z .b8 111 2026-02-21T09:09:56.0527042Z .b8 111 2026-02-21T09:09:56.0527095Z .b8 116 2026-02-21T09:09:56.0527148Z .b8 47 2026-02-21T09:09:56.0527208Z .b8 120 2026-02-21T09:09:56.0527261Z .b8 116 2026-02-21T09:09:56.0527312Z .b8 0 2026-02-21T09:09:56.0527416Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:56.0527501Z .b8 95 // DW_AT_name 2026-02-21T09:09:56.0527580Z .b8 104 2026-02-21T09:09:56.0527633Z .b8 101 2026-02-21T09:09:56.0527692Z .b8 108 2026-02-21T09:09:56.0527744Z .b8 105 2026-02-21T09:09:56.0527797Z .b8 111 2026-02-21T09:09:56.0527849Z .b8 110 2026-02-21T09:09:56.0527911Z .b8 95 2026-02-21T09:09:56.0527963Z .b8 109 2026-02-21T09:09:56.0528016Z .b8 97 2026-02-21T09:09:56.0528075Z .b8 116 2026-02-21T09:09:56.0528129Z .b8 109 2026-02-21T09:09:56.0528181Z .b8 117 2026-02-21T09:09:56.0528234Z .b8 108 2026-02-21T09:09:56.0528339Z .b8 95 2026-02-21T09:09:56.0528392Z .b8 98 2026-02-21T09:09:56.0528444Z .b8 102 2026-02-21T09:09:56.0528497Z .b8 49 2026-02-21T09:09:56.0528557Z .b8 54 2026-02-21T09:09:56.0528611Z .b8 95 2026-02-21T09:09:56.0528664Z .b8 105 2026-02-21T09:09:56.0528726Z .b8 110 2026-02-21T09:09:56.0528781Z .b8 116 2026-02-21T09:09:56.0528834Z .b8 52 2026-02-21T09:09:56.0528894Z .b8 0 2026-02-21T09:09:56.0528979Z .b8 1 // DW_AT_inline 2026-02-21T09:09:56.0529080Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:56.0530704Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:56.0530808Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:56.0530923Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:56.0531045Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:56.0531144Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:56.0531245Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:56.0531335Z .b64 $L__tmp29 // DW_AT_high_pc 2026-02-21T09:09:56.0531416Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:56.0531497Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:56.0531633Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:56.0531722Z .b8 0 // End Of Children Mark 2026-02-21T09:09:56.0531810Z .b8 0 // End Of Children Mark 2026-02-21T09:09:56.0531900Z } 2026-02-21T09:09:56.0531971Z .section .debug_macinfo { } 2026-02-21T09:09:56.0531976Z 2026-02-21T09:09:56.0532058Z ================================================================ 2026-02-21T09:09:56.0532171Z please share the reproducer above with Triton project. 2026-02-21T09:09:57.5747609Z [184s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:57.5747954Z 2026-02-21T09:09:57.5749901Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 256, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:57.5751137Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:57.5751394Z `ptxas` stderr: 2026-02-21T09:09:57.5752054Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 297 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:57.5752583Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:57.5752746Z 2026-02-21T09:09:57.5753170Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpgud_jw8t.ptx -o /tmp/tmpgud_jw8t.ptx.o 2026-02-21T09:09:57.5753609Z 2026-02-21T09:09:57.5753746Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:57.5753963Z 2026-02-21T09:09:57.5753968Z 2026-02-21T09:09:57.5754057Z ================================================================ 2026-02-21T09:09:57.5754479Z Internal Triton PTX codegen error 2026-02-21T09:09:57.5754662Z `ptxas` stderr: 2026-02-21T09:09:57.5755097Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 297 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:57.5755568Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:57.5755721Z 2026-02-21T09:09:57.5756079Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpgud_jw8t.ptx -o /tmp/tmpgud_jw8t.ptx.o 2026-02-21T09:09:57.5756572Z 2026-02-21T09:09:57.5756575Z 2026-02-21T09:09:57.5756638Z // 2026-02-21T09:09:57.5756779Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:57.5756957Z // 2026-02-21T09:09:57.5757024Z 2026-02-21T09:09:57.5757081Z .version 8.7 2026-02-21T09:09:57.5757222Z .target sm_100a 2026-02-21T09:09:57.5757356Z .address_size 64 2026-02-21T09:09:57.5757446Z 2026-02-21T09:09:57.5757594Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:57.5757971Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:57.5758192Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:57.5758414Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:57.5758657Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:57.5759223Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:57.5759502Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:57.5759778Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:57.5760055Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:57.5760275Z ) 2026-02-21T09:09:57.5760412Z .reqntid 128 2026-02-21T09:09:57.5760546Z .maxnreg 32 2026-02-21T09:09:57.5760675Z { 2026-02-21T09:09:57.5760799Z .reg .pred %p<123>; 2026-02-21T09:09:57.5760956Z .reg .b16 %rs<209>; 2026-02-21T09:09:57.5761128Z .reg .b32 %r<770>; 2026-02-21T09:09:57.5761268Z .reg .b64 %rd<259>; 2026-02-21T09:09:57.5761410Z $L__func_begin0: 2026-02-21T09:09:57.5761490Z 2026-02-21T09:09:57.5761597Z // %bb.0: 2026-02-21T09:09:57.5761838Z .loc 1 19 0 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:19 2026-02-21T09:09:57.5762136Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:57.5762380Z ld.param.b64 %rd20, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:57.5762610Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:57.5762809Z ld.param.b64 %rd38, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:57.5763029Z mov.b32 %r52, global_smem; 2026-02-21T09:09:57.5763194Z // begin inline asm 2026-02-21T09:09:57.5763435Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r52], 128; 2026-02-21T09:09:57.5763689Z // end inline asm 2026-02-21T09:09:57.5763866Z ld.param.b64 %rd55, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:57.5764076Z bar.sync 0; 2026-02-21T09:09:57.5764220Z ld.shared.b32 %r763, [global_smem]; 2026-02-21T09:09:57.5764400Z bar.sync 0; 2026-02-21T09:09:57.5764531Z // begin inline asm 2026-02-21T09:09:57.5764743Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:57.5764972Z // end inline asm 2026-02-21T09:09:57.5765227Z .loc 1 21 66 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:21:66 2026-02-21T09:09:57.5765526Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:57.5765680Z mov.u32 %r69, %ctaid.y; 2026-02-21T09:09:57.5765836Z mov.u32 %r70, %ctaid.z; 2026-02-21T09:09:57.5765985Z mov.u32 %r71, %nctaid.x; 2026-02-21T09:09:57.5766143Z mov.u32 %r72, %nctaid.y; 2026-02-21T09:09:57.5766298Z mad.lo.s32 %r73, %r70, %r72, %r69; 2026-02-21T09:09:57.5766480Z mad.lo.s32 %r74, %r73, %r71, %r3; 2026-02-21T09:09:57.5766644Z shl.b32 %r75, %r74, 8; 2026-02-21T09:09:57.5766800Z cvt.s64.s32 %rd56, %r75; 2026-02-21T09:09:57.5767001Z add.s64 %rd34, %rd55, %rd56; 2026-02-21T09:09:57.5767156Z shl.b32 %r76, %r1, 2; 2026-02-21T09:09:57.5767312Z add.s32 %r53, %r52, %r76; 2026-02-21T09:09:57.5767460Z mov.b32 %r62, 0; 2026-02-21T09:09:57.5767601Z // begin inline asm 2026-02-21T09:09:57.5767752Z @%p1 st.shared.b32 [ %r53 + 0 ], %r62; 2026-02-21T09:09:57.5767930Z // end inline asm 2026-02-21T09:09:57.5768068Z bar.warp.sync -1; 2026-02-21T09:09:57.5768223Z setp.eq.b32 %p115, %r1, 0; 2026-02-21T09:09:57.5768383Z cvt.u64.u32 %rd19, %r52; 2026-02-21T09:09:57.5768576Z // begin inline asm 2026-02-21T09:09:57.5768836Z @%p115 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd19 + 0 ], %rd20; 2026-02-21T09:09:57.5769117Z // end inline asm 2026-02-21T09:09:57.5769257Z // begin inline asm 2026-02-21T09:09:57.5769482Z @%p115 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1; 2026-02-21T09:09:57.5769737Z // end inline asm 2026-02-21T09:09:57.5769869Z mov.b32 %r55, 32; 2026-02-21T09:09:57.5770016Z // begin inline asm 2026-02-21T09:09:57.5770262Z @%p115 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r55; 2026-02-21T09:09:57.5770572Z // end inline asm 2026-02-21T09:09:57.5770719Z mov.b32 %r150, 16; 2026-02-21T09:09:57.5770860Z // begin inline asm 2026-02-21T09:09:57.5771109Z @%p115 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r150; 2026-02-21T09:09:57.5771378Z // end inline asm 2026-02-21T09:09:57.5771524Z mov.b32 %r57, 8192; 2026-02-21T09:09:57.5771697Z // begin inline asm 2026-02-21T09:09:57.5771954Z @%p115 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r57; 2026-02-21T09:09:57.5772238Z // end inline asm 2026-02-21T09:09:57.5772371Z mov.b32 %r58, 512; 2026-02-21T09:09:57.5772514Z // begin inline asm 2026-02-21T09:09:57.5772748Z @%p115 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r58; 2026-02-21T09:09:57.5773025Z // end inline asm 2026-02-21T09:09:57.5773157Z mov.b64 %rd27, 8192; 2026-02-21T09:09:57.5773309Z // begin inline asm 2026-02-21T09:09:57.5773564Z @%p115 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd19 + 0 ], 0x0, %rd27; 2026-02-21T09:09:57.5773851Z // end inline asm 2026-02-21T09:09:57.5773990Z mov.b32 %r59, 1; 2026-02-21T09:09:57.5774121Z // begin inline asm 2026-02-21T09:09:57.5774382Z @%p115 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r59; 2026-02-21T09:09:57.5774689Z // end inline asm 2026-02-21T09:09:57.5774833Z // begin inline asm 2026-02-21T09:09:57.5775079Z @%p115 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r59; 2026-02-21T09:09:57.5775364Z // end inline asm 2026-02-21T09:09:57.5775504Z // begin inline asm 2026-02-21T09:09:57.5775734Z @%p115 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:09:57.5776000Z // end inline asm 2026-02-21T09:09:57.5776130Z // begin inline asm 2026-02-21T09:09:57.5776383Z @%p115 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:09:57.5776660Z // end inline asm 2026-02-21T09:09:57.5776800Z // begin inline asm 2026-02-21T09:09:57.5777040Z @%p115 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1; 2026-02-21T09:09:57.5777310Z // end inline asm 2026-02-21T09:09:57.5777448Z // begin inline asm 2026-02-21T09:09:57.5777675Z @%p115 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:09:57.5777940Z // end inline asm 2026-02-21T09:09:57.5778070Z // begin inline asm 2026-02-21T09:09:57.5778411Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd34 + 0 ], [ %rd19 + 0 ], 0x80; 2026-02-21T09:09:57.5778779Z // end inline asm 2026-02-21T09:09:57.5778912Z // begin inline asm 2026-02-21T09:09:57.5779124Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd34 + 0 ], 0x80; 2026-02-21T09:09:57.5779371Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:57.5779599Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:57.5779771Z // end inline asm 2026-02-21T09:09:57.5779910Z bar.sync 0; 2026-02-21T09:09:57.5780052Z cvta.global.u64 %rd102, %rd34; 2026-02-21T09:09:57.5780339Z .loc 1 23 67 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:23:67 2026-02-21T09:09:57.5780642Z add.s64 %rd52, %rd34, 128; 2026-02-21T09:09:57.5780799Z bar.sync 0; 2026-02-21T09:09:57.5780941Z // begin inline asm 2026-02-21T09:09:57.5781137Z @%p1 st.shared.b32 [ %r53 + 0 ], %r62; 2026-02-21T09:09:57.5781330Z // end inline asm 2026-02-21T09:09:57.5781480Z bar.warp.sync -1; 2026-02-21T09:09:57.5781660Z // begin inline asm 2026-02-21T09:09:57.5781922Z @%p115 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd19 + 0 ], %rd38; 2026-02-21T09:09:57.5782219Z // end inline asm 2026-02-21T09:09:57.5782359Z // begin inline asm 2026-02-21T09:09:57.5782599Z @%p115 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1; 2026-02-21T09:09:57.5782872Z // end inline asm 2026-02-21T09:09:57.5783011Z // begin inline asm 2026-02-21T09:09:57.5783291Z @%p115 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r55; 2026-02-21T09:09:57.5783569Z // end inline asm 2026-02-21T09:09:57.5783715Z mov.b32 %r64, 256; 2026-02-21T09:09:57.5783857Z // begin inline asm 2026-02-21T09:09:57.5784108Z @%p115 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r64; 2026-02-21T09:09:57.5784395Z // end inline asm 2026-02-21T09:09:57.5784537Z // begin inline asm 2026-02-21T09:09:57.5784797Z @%p115 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r57; 2026-02-21T09:09:57.5785078Z // end inline asm 2026-02-21T09:09:57.5785224Z mov.b32 %r66, 4096; 2026-02-21T09:09:57.5785371Z // begin inline asm 2026-02-21T09:09:57.5785626Z @%p115 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r66; 2026-02-21T09:09:57.5785905Z // end inline asm 2026-02-21T09:09:57.5786055Z mov.b64 %rd45, 16384; 2026-02-21T09:09:57.5786217Z // begin inline asm 2026-02-21T09:09:57.5786480Z @%p115 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd19 + 0 ], 0x0, %rd45; 2026-02-21T09:09:57.5786779Z // end inline asm 2026-02-21T09:09:57.5786916Z // begin inline asm 2026-02-21T09:09:57.5787183Z @%p115 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r59; 2026-02-21T09:09:57.5787503Z // end inline asm 2026-02-21T09:09:57.5787652Z // begin inline asm 2026-02-21T09:09:57.5787914Z @%p115 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r59; 2026-02-21T09:09:57.5788208Z // end inline asm 2026-02-21T09:09:57.5788350Z // begin inline asm 2026-02-21T09:09:57.5788592Z @%p115 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd19 + 0 ], 0xa; 2026-02-21T09:09:57.5788873Z // end inline asm 2026-02-21T09:09:57.5789003Z // begin inline asm 2026-02-21T09:09:57.5789253Z @%p115 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:09:57.5789535Z // end inline asm 2026-02-21T09:09:57.5789667Z // begin inline asm 2026-02-21T09:09:57.5789903Z @%p115 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x2; 2026-02-21T09:09:57.5790166Z // end inline asm 2026-02-21T09:09:57.5790304Z // begin inline asm 2026-02-21T09:09:57.5790527Z @%p115 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:09:57.5790791Z // end inline asm 2026-02-21T09:09:57.5790922Z // begin inline asm 2026-02-21T09:09:57.5791259Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd52 + 0 ], [ %rd19 + 0 ], 0x80; 2026-02-21T09:09:57.5791664Z // end inline asm 2026-02-21T09:09:57.5791799Z // begin inline asm 2026-02-21T09:09:57.5792012Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd52 + 0 ], 0x80; 2026-02-21T09:09:57.5792261Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:57.5792528Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:57.5792708Z // end inline asm 2026-02-21T09:09:57.5792849Z bar.sync 0; 2026-02-21T09:09:57.5792996Z cvta.global.u64 %rd128, %rd52; 2026-02-21T09:09:57.5793273Z .loc 1 28 98 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:28:98 2026-02-21T09:09:57.5793579Z setp.gt.u32 %p39, %r3, 4095; 2026-02-21T09:09:57.5793741Z @%p39 bra $L__BB0_8; 2026-02-21T09:09:57.5793911Z // %bb.1: // %.lr.ph 2026-02-21T09:09:57.5794238Z .loc 1 0 98 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:0:98 2026-02-21T09:09:57.5794579Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:57.5794791Z and.b32 %r4, %r1, 32; 2026-02-21T09:09:57.5795059Z .loc 1 78 38 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:78:38 2026-02-21T09:09:57.5795358Z setp.eq.b32 %p56, %r4, 0; 2026-02-21T09:09:57.5795629Z .loc 1 54 38 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:54:38 2026-02-21T09:09:57.5795951Z shl.b32 %r261, %r1, 3; 2026-02-21T09:09:57.5796109Z and.b32 %r262, %r261, 24; 2026-02-21T09:09:57.5796385Z .loc 1 41 45 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:41:45 2026-02-21T09:09:57.5796671Z shr.u32 %r263, %r1, 2; 2026-02-21T09:09:57.5796836Z bfe.u32 %r5, %r1, 2, 5; 2026-02-21T09:09:57.5797000Z shr.u32 %r264, %r1, 5; 2026-02-21T09:09:57.5797154Z and.b32 %r265, %r1, 127; 2026-02-21T09:09:57.5797314Z shl.b32 %r6, %r265, 4; 2026-02-21T09:09:57.5797465Z add.s32 %r383, %r52, %r6; 2026-02-21T09:09:57.5797632Z add.s32 %r385, %r383, 2048; 2026-02-21T09:09:57.5797793Z add.s32 %r387, %r383, 4096; 2026-02-21T09:09:57.5797955Z add.s32 %r389, %r383, 6144; 2026-02-21T09:09:57.5798110Z add.s32 %r391, %r383, 8192; 2026-02-21T09:09:57.5798275Z add.s32 %r393, %r383, 10240; 2026-02-21T09:09:57.5798435Z add.s32 %r395, %r383, 12288; 2026-02-21T09:09:57.5798600Z add.s32 %r397, %r383, 14336; 2026-02-21T09:09:57.5798764Z add.s32 %r357, %r763, 64; 2026-02-21T09:09:57.5798919Z add.s32 %r170, %r383, 16384; 2026-02-21T09:09:57.5799080Z add.s32 %r172, %r383, 18432; 2026-02-21T09:09:57.5799234Z add.s32 %r174, %r383, 20480; 2026-02-21T09:09:57.5799393Z add.s32 %r176, %r383, 22528; 2026-02-21T09:09:57.5799546Z add.s32 %r178, %r383, 24576; 2026-02-21T09:09:57.5799732Z add.s32 %r180, %r383, 26624; 2026-02-21T09:09:57.5799886Z add.s32 %r182, %r383, 28672; 2026-02-21T09:09:57.5800042Z add.s32 %r184, %r383, 30720; 2026-02-21T09:09:57.5800200Z shl.b32 %r16, %r265, 6; 2026-02-21T09:09:57.5800355Z mad.lo.s32 %r267, %r265, 48, %r383; 2026-02-21T09:09:57.5800531Z and.b32 %r268, %r1, 31; 2026-02-21T09:09:57.5800678Z shr.u32 %r269, %r1, 1; 2026-02-21T09:09:57.5800832Z and.b32 %r270, %r269, 32; 2026-02-21T09:09:57.5800982Z or.b32 %r17, %r270, %r268; 2026-02-21T09:09:57.5801144Z add.s32 %r166, %r52, 36864; 2026-02-21T09:09:57.5801299Z add.s32 %r271, %r166, %r17; 2026-02-21T09:09:57.5801460Z xor.b32 %r18, %r17, 16; 2026-02-21T09:09:57.5801657Z add.s32 %r272, %r166, %r18; 2026-02-21T09:09:57.5801831Z shl.b32 %r273, %r268, 7; 2026-02-21T09:09:57.5801991Z shl.b32 %r274, %r1, 4; 2026-02-21T09:09:57.5802139Z and.b32 %r275, %r274, 112; 2026-02-21T09:09:57.5802296Z shr.u32 %r276, %r1, 3; 2026-02-21T09:09:57.5802441Z and.b32 %r277, %r276, 12; 2026-02-21T09:09:57.5802597Z or.b32 %r278, %r273, %r275; 2026-02-21T09:09:57.5802749Z or.b32 %r279, %r278, %r277; 2026-02-21T09:09:57.5802908Z add.s32 %r280, %r52, 32768; 2026-02-21T09:09:57.5803060Z add.s32 %r19, %r280, %r279; 2026-02-21T09:09:57.5803219Z xor.b32 %r281, %r279, 16; 2026-02-21T09:09:57.5803366Z add.s32 %r20, %r280, %r281; 2026-02-21T09:09:57.5803526Z xor.b32 %r282, %r279, 32; 2026-02-21T09:09:57.5803682Z add.s32 %r21, %r280, %r282; 2026-02-21T09:09:57.5803832Z xor.b32 %r283, %r279, 48; 2026-02-21T09:09:57.5803987Z add.s32 %r22, %r280, %r283; 2026-02-21T09:09:57.5804171Z xor.b32 %r284, %r279, 64; 2026-02-21T09:09:57.5804327Z add.s32 %r23, %r280, %r284; 2026-02-21T09:09:57.5804481Z xor.b32 %r285, %r279, 80; 2026-02-21T09:09:57.5804636Z add.s32 %r24, %r280, %r285; 2026-02-21T09:09:57.5804787Z xor.b32 %r286, %r279, 96; 2026-02-21T09:09:57.5804941Z add.s32 %r25, %r280, %r286; 2026-02-21T09:09:57.5805091Z xor.b32 %r287, %r279, 112; 2026-02-21T09:09:57.5805249Z add.s32 %r26, %r280, %r287; 2026-02-21T09:09:57.5805412Z bfe.u32 %r288, %r280, 4, 14; 2026-02-21T09:09:57.5805601Z cvt.u64.u32 %rd75, %r288; 2026-02-21T09:09:57.5805775Z or.b64 %rd85, %rd75, 4611686293322072064; 2026-02-21T09:09:57.5805952Z add.s32 %r289, %r52, 32800; 2026-02-21T09:09:57.5806109Z bfe.u32 %r290, %r289, 4, 14; 2026-02-21T09:09:57.5806260Z cvt.u64.u32 %rd76, %r290; 2026-02-21T09:09:57.5806427Z or.b64 %rd86, %rd76, 4611686293322072064; 2026-02-21T09:09:57.5806600Z add.s32 %r291, %r52, 32832; 2026-02-21T09:09:57.5806756Z bfe.u32 %r292, %r291, 4, 14; 2026-02-21T09:09:57.5806915Z cvt.u64.u32 %rd77, %r292; 2026-02-21T09:09:57.5807074Z or.b64 %rd87, %rd77, 4611686293322072064; 2026-02-21T09:09:57.5807280Z add.s32 %r293, %r52, 32864; 2026-02-21T09:09:57.5807434Z bfe.u32 %r294, %r293, 4, 14; 2026-02-21T09:09:57.5807595Z cvt.u64.u32 %rd78, %r294; 2026-02-21T09:09:57.5807753Z or.b64 %rd88, %rd78, 4611686293322072064; 2026-02-21T09:09:57.5807932Z or.b32 %r27, %r262, 64; 2026-02-21T09:09:57.5808083Z and.b32 %r295, %r261, 48; 2026-02-21T09:09:57.5808244Z or.b32 %r296, %r16, %r295; 2026-02-21T09:09:57.5808396Z xor.b32 %r297, %r296, 16; 2026-02-21T09:09:57.5808551Z xor.b32 %r298, %r296, 32; 2026-02-21T09:09:57.5808708Z xor.b32 %r299, %r296, 48; 2026-02-21T09:09:57.5808975Z .loc 1 28 98 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:28:98 2026-02-21T09:09:57.5809264Z cvt.u64.u32 %rd79, %r262; 2026-02-21T09:09:57.5809524Z .loc 1 35 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:35:33 2026-02-21T09:09:57.5809817Z shr.u32 %r32, %r3, 4; 2026-02-21T09:09:57.5809967Z and.b32 %r300, %r32, 240; 2026-02-21T09:09:57.5810235Z .loc 1 37 64 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:37:64 2026-02-21T09:09:57.5810530Z and.b32 %r301, %r3, 15; 2026-02-21T09:09:57.5810784Z .loc 1 37 30 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:37:30 2026-02-21T09:09:57.5811101Z or.b32 %r302, %r300, %r301; 2026-02-21T09:09:57.5811365Z .loc 1 39 27 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:39:27 2026-02-21T09:09:57.5811691Z shl.b32 %r663, %r302, 5; 2026-02-21T09:09:57.5811951Z .loc 1 40 27 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:40:27 2026-02-21T09:09:57.5812246Z shl.b32 %r303, %r3, 4; 2026-02-21T09:09:57.5812404Z and.b32 %r664, %r303, 3840; 2026-02-21T09:09:57.5812667Z .loc 1 41 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:41:32 2026-02-21T09:09:57.5812962Z or.b32 %r304, %r664, %r5; 2026-02-21T09:09:57.5813114Z or.b32 %r305, %r263, %r664; 2026-02-21T09:09:57.5813386Z .loc 1 55 53 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:53 2026-02-21T09:09:57.5813675Z shl.b32 %r306, %r304, 10; 2026-02-21T09:09:57.5813832Z shl.b32 %r307, %r305, 10; 2026-02-21T09:09:57.5813985Z or.b32 %r308, %r307, 229376; 2026-02-21T09:09:57.5814147Z $L__tmp0: 2026-02-21T09:09:57.5814452Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5814808Z shfl.sync.idx.b32 %r35, %r264, 0, 31, -1; 2026-02-21T09:09:57.5814997Z shl.b32 %r309, %r35, 21; 2026-02-21T09:09:57.5815150Z and.b32 %r310, %r309, 6291456; 2026-02-21T09:09:57.5815321Z add.s32 %r662, %r310, %r763; 2026-02-21T09:09:57.5815477Z mov.pred %p63, -1; 2026-02-21T09:09:57.5815630Z // begin inline asm 2026-02-21T09:09:57.5816008Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r662 + 0], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:09:57.5816367Z // end inline asm 2026-02-21T09:09:57.5816511Z // begin inline asm 2026-02-21T09:09:57.5816834Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r662 + 16], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:09:57.5817193Z // end inline asm 2026-02-21T09:09:57.5817355Z // begin inline asm 2026-02-21T09:09:57.5817680Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r662 + 32], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:09:57.5818037Z // end inline asm 2026-02-21T09:09:57.5818171Z // begin inline asm 2026-02-21T09:09:57.5818494Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r662 + 48], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:09:57.5818843Z // end inline asm 2026-02-21T09:09:57.5818985Z // begin inline asm 2026-02-21T09:09:57.5819173Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:57.5819345Z // end inline asm 2026-02-21T09:09:57.5819477Z bar.sync 0; 2026-02-21T09:09:57.5819614Z $L__tmp1: 2026-02-21T09:09:57.5819868Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5820165Z add.s32 %r765, %r52, 37888; 2026-02-21T09:09:57.5820329Z // begin inline asm 2026-02-21T09:09:57.5820495Z @%p115 mbarrier.init.shared::cta.b64 [%r765], 1; 2026-02-21T09:09:57.5820692Z // end inline asm 2026-02-21T09:09:57.5820821Z bar.sync 0; 2026-02-21T09:09:57.5820962Z add.s32 %r146, %r52, 37896; 2026-02-21T09:09:57.5821115Z // begin inline asm 2026-02-21T09:09:57.5821292Z @%p115 mbarrier.init.shared::cta.b64 [%r146], 1; 2026-02-21T09:09:57.5821493Z // end inline asm 2026-02-21T09:09:57.5821687Z add.s32 %r399, %r52, 37904; 2026-02-21T09:09:57.5821855Z // begin inline asm 2026-02-21T09:09:57.5822021Z @%p115 mbarrier.init.shared::cta.b64 [%r399], 1; 2026-02-21T09:09:57.5822218Z // end inline asm 2026-02-21T09:09:57.5822350Z bar.sync 0; 2026-02-21T09:09:57.5822488Z add.s32 %r148, %r52, 37912; 2026-02-21T09:09:57.5822639Z // begin inline asm 2026-02-21T09:09:57.5822811Z @%p115 mbarrier.init.shared::cta.b64 [%r148], 1; 2026-02-21T09:09:57.5823001Z // end inline asm 2026-02-21T09:09:57.5823291Z .loc 1 55 60 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:60 2026-02-21T09:09:57.5823591Z or.b32 %r311, %r306, %r262; 2026-02-21T09:09:57.5823748Z or.b32 %r312, %r308, %r262; 2026-02-21T09:09:57.5824018Z .loc 1 55 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:32 2026-02-21T09:09:57.5824316Z mad.wide.u32 %rd57, %r311, 2, %rd18; 2026-02-21T09:09:57.5824499Z cvt.u64.u32 %rd7, %r306; 2026-02-21T09:09:57.5824653Z or.b64 %rd80, %rd7, %rd79; 2026-02-21T09:09:57.5824817Z shl.b64 %rd81, %rd80, 1; 2026-02-21T09:09:57.5824976Z add.s64 %rd8, %rd18, %rd81; 2026-02-21T09:09:57.5825137Z add.s64 %rd58, %rd8, 65536; 2026-02-21T09:09:57.5825307Z add.s64 %rd59, %rd8, 131072; 2026-02-21T09:09:57.5825471Z add.s64 %rd60, %rd8, 196608; 2026-02-21T09:09:57.5825642Z add.s64 %rd61, %rd8, 262144; 2026-02-21T09:09:57.5825802Z add.s64 %rd62, %rd8, 327680; 2026-02-21T09:09:57.5825966Z add.s64 %rd63, %rd8, 393216; 2026-02-21T09:09:57.5826132Z mad.wide.u32 %rd64, %r312, 2, %rd18; 2026-02-21T09:09:57.5826434Z .loc 1 55 80 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:80 2026-02-21T09:09:57.5826732Z // begin inline asm 2026-02-21T09:09:57.5826942Z cp.async.cg.shared.global [ %r383 + 0 ], [ %rd57 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5827185Z // end inline asm 2026-02-21T09:09:57.5827325Z // begin inline asm 2026-02-21T09:09:57.5827534Z cp.async.cg.shared.global [ %r385 + 0 ], [ %rd58 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5827763Z // end inline asm 2026-02-21T09:09:57.5827947Z // begin inline asm 2026-02-21T09:09:57.5828148Z cp.async.cg.shared.global [ %r387 + 0 ], [ %rd59 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5828382Z // end inline asm 2026-02-21T09:09:57.5828529Z // begin inline asm 2026-02-21T09:09:57.5828726Z cp.async.cg.shared.global [ %r389 + 0 ], [ %rd60 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5828959Z // end inline asm 2026-02-21T09:09:57.5829096Z // begin inline asm 2026-02-21T09:09:57.5829301Z cp.async.cg.shared.global [ %r391 + 0 ], [ %rd61 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5829568Z // end inline asm 2026-02-21T09:09:57.5829721Z // begin inline asm 2026-02-21T09:09:57.5829926Z cp.async.cg.shared.global [ %r393 + 0 ], [ %rd62 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5830161Z // end inline asm 2026-02-21T09:09:57.5830304Z // begin inline asm 2026-02-21T09:09:57.5830515Z cp.async.cg.shared.global [ %r395 + 0 ], [ %rd63 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5830754Z // end inline asm 2026-02-21T09:09:57.5830899Z // begin inline asm 2026-02-21T09:09:57.5831111Z cp.async.cg.shared.global [ %r397 + 0 ], [ %rd64 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5831372Z // end inline asm 2026-02-21T09:09:57.5831565Z cp.async.commit_group; 2026-02-21T09:09:57.5831855Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5832167Z bar.sync 0; 2026-02-21T09:09:57.5832302Z // begin inline asm 2026-02-21T09:09:57.5832511Z @%p115 mbarrier.arrive.expect_tx.shared.b64 _, [%r399], 512; 2026-02-21T09:09:57.5832746Z // end inline asm 2026-02-21T09:09:57.5833005Z .loc 1 61 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:61:33 2026-02-21T09:09:57.5833308Z bar.sync 0; 2026-02-21T09:09:57.5833458Z elect.sync %r313|%p58, -1; 2026-02-21T09:09:57.5833650Z and.pred %p49, %p1, %p58; 2026-02-21T09:09:57.5833807Z // begin inline asm 2026-02-21T09:09:57.5834143Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r166], [%rd102, {%r663, %r62}], [%r399]; 2026-02-21T09:09:57.5834504Z // end inline asm 2026-02-21T09:09:57.5834756Z .loc 1 55 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:32 2026-02-21T09:09:57.5835048Z add.s64 %rd66, %rd8, 64; 2026-02-21T09:09:57.5835204Z or.b32 %r314, %r311, 32; 2026-02-21T09:09:57.5835367Z mad.wide.u32 %rd82, %r314, 2, %rd18; 2026-02-21T09:09:57.5835586Z add.s64 %rd67, %rd82, 65536; 2026-02-21T09:09:57.5835760Z add.s64 %rd68, %rd82, 131072; 2026-02-21T09:09:57.5835921Z add.s64 %rd69, %rd82, 196608; 2026-02-21T09:09:57.5836088Z add.s64 %rd70, %rd82, 262144; 2026-02-21T09:09:57.5836251Z add.s64 %rd71, %rd82, 327680; 2026-02-21T09:09:57.5836407Z add.s64 %rd72, %rd82, 393216; 2026-02-21T09:09:57.5836569Z cvt.u64.u32 %rd9, %r308; 2026-02-21T09:09:57.5836721Z or.b64 %rd83, %rd9, %rd79; 2026-02-21T09:09:57.5836883Z shl.b64 %rd84, %rd83, 1; 2026-02-21T09:09:57.5837033Z add.s64 %rd10, %rd18, %rd84; 2026-02-21T09:09:57.5837196Z add.s64 %rd73, %rd10, 64; 2026-02-21T09:09:57.5837458Z .loc 1 55 80 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:80 2026-02-21T09:09:57.5837743Z // begin inline asm 2026-02-21T09:09:57.5837948Z cp.async.cg.shared.global [ %r170 + 0 ], [ %rd66 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5838167Z // end inline asm 2026-02-21T09:09:57.5838307Z // begin inline asm 2026-02-21T09:09:57.5838501Z cp.async.cg.shared.global [ %r172 + 0 ], [ %rd67 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5838727Z // end inline asm 2026-02-21T09:09:57.5838861Z // begin inline asm 2026-02-21T09:09:57.5839061Z cp.async.cg.shared.global [ %r174 + 0 ], [ %rd68 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5839276Z // end inline asm 2026-02-21T09:09:57.5839419Z // begin inline asm 2026-02-21T09:09:57.5839614Z cp.async.cg.shared.global [ %r176 + 0 ], [ %rd69 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5839830Z // end inline asm 2026-02-21T09:09:57.5839970Z // begin inline asm 2026-02-21T09:09:57.5840193Z cp.async.cg.shared.global [ %r178 + 0 ], [ %rd70 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5840419Z // end inline asm 2026-02-21T09:09:57.5840553Z // begin inline asm 2026-02-21T09:09:57.5840750Z cp.async.cg.shared.global [ %r180 + 0 ], [ %rd71 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5840969Z // end inline asm 2026-02-21T09:09:57.5841110Z // begin inline asm 2026-02-21T09:09:57.5841302Z cp.async.cg.shared.global [ %r182 + 0 ], [ %rd72 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5841585Z // end inline asm 2026-02-21T09:09:57.5841732Z // begin inline asm 2026-02-21T09:09:57.5841936Z cp.async.cg.shared.global [ %r184 + 0 ], [ %rd73 + 0 ], 0x10, %r150; 2026-02-21T09:09:57.5842160Z // end inline asm 2026-02-21T09:09:57.5842300Z cp.async.commit_group; 2026-02-21T09:09:57.5842575Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5842858Z bar.sync 0; 2026-02-21T09:09:57.5842997Z // begin inline asm 2026-02-21T09:09:57.5843188Z @%p115 mbarrier.arrive.expect_tx.shared.b64 _, [%r148], 512; 2026-02-21T09:09:57.5843411Z // end inline asm 2026-02-21T09:09:57.5843688Z .loc 1 61 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:61:33 2026-02-21T09:09:57.5843973Z bar.sync 0; 2026-02-21T09:09:57.5844121Z elect.sync %r315|%p59, -1; 2026-02-21T09:09:57.5844287Z and.pred %p51, %p1, %p59; 2026-02-21T09:09:57.5844452Z add.s32 %r187, %r52, 37376; 2026-02-21T09:09:57.5844607Z // begin inline asm 2026-02-21T09:09:57.5844941Z @%p51 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r187], [%rd102, {%r663, %r150}], [%r148]; 2026-02-21T09:09:57.5845297Z // end inline asm 2026-02-21T09:09:57.5845548Z .loc 1 55 80 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:80 2026-02-21T09:09:57.5845845Z cp.async.wait_group 1; 2026-02-21T09:09:57.5845994Z bar.sync 0; 2026-02-21T09:09:57.5846167Z ld.shared.v4.b32 {%r316, %r317, %r318, %r319}, [%r267]; 2026-02-21T09:09:57.5846375Z mov.b32 {%rs1, %rs2}, %r319; 2026-02-21T09:09:57.5846541Z mov.b32 {%rs3, %rs4}, %r318; 2026-02-21T09:09:57.5846695Z mov.b32 {%rs5, %rs6}, %r317; 2026-02-21T09:09:57.5846855Z mov.b32 {%rs7, %rs8}, %r316; 2026-02-21T09:09:57.5847051Z ld.shared.v4.b32 {%r320, %r321, %r322, %r323}, [%r267+16]; 2026-02-21T09:09:57.5847257Z mov.b32 {%rs9, %rs10}, %r323; 2026-02-21T09:09:57.5847457Z mov.b32 {%rs11, %rs12}, %r322; 2026-02-21T09:09:57.5847624Z mov.b32 {%rs13, %rs14}, %r321; 2026-02-21T09:09:57.5847788Z mov.b32 {%rs15, %rs16}, %r320; 2026-02-21T09:09:57.5847977Z ld.shared.v4.b32 {%r324, %r325, %r326, %r327}, [%r267+32]; 2026-02-21T09:09:57.5848184Z mov.b32 {%rs17, %rs18}, %r327; 2026-02-21T09:09:57.5848338Z mov.b32 {%rs19, %rs20}, %r326; 2026-02-21T09:09:57.5848497Z mov.b32 {%rs21, %rs22}, %r325; 2026-02-21T09:09:57.5848655Z mov.b32 {%rs23, %rs24}, %r324; 2026-02-21T09:09:57.5848840Z ld.shared.v4.b32 {%r328, %r329, %r330, %r331}, [%r267+48]; 2026-02-21T09:09:57.5849049Z mov.b32 {%rs25, %rs26}, %r331; 2026-02-21T09:09:57.5849203Z mov.b32 {%rs27, %rs28}, %r330; 2026-02-21T09:09:57.5849365Z mov.b32 {%rs29, %rs30}, %r329; 2026-02-21T09:09:57.5849520Z mov.b32 {%rs31, %rs32}, %r328; 2026-02-21T09:09:57.5849719Z ld.shared.v4.b32 {%r332, %r333, %r334, %r335}, [%r267+8192]; 2026-02-21T09:09:57.5849928Z mov.b32 {%rs33, %rs34}, %r335; 2026-02-21T09:09:57.5850090Z mov.b32 {%rs35, %rs36}, %r334; 2026-02-21T09:09:57.5850257Z mov.b32 {%rs37, %rs38}, %r333; 2026-02-21T09:09:57.5850410Z mov.b32 {%rs39, %rs40}, %r332; 2026-02-21T09:09:57.5850604Z ld.shared.v4.b32 {%r336, %r337, %r338, %r339}, [%r267+8208]; 2026-02-21T09:09:57.5850806Z mov.b32 {%rs41, %rs42}, %r339; 2026-02-21T09:09:57.5850969Z mov.b32 {%rs43, %rs44}, %r338; 2026-02-21T09:09:57.5851124Z mov.b32 {%rs45, %rs46}, %r337; 2026-02-21T09:09:57.5851288Z mov.b32 {%rs47, %rs48}, %r336; 2026-02-21T09:09:57.5851481Z ld.shared.v4.b32 {%r340, %r341, %r342, %r343}, [%r267+8224]; 2026-02-21T09:09:57.5851760Z mov.b32 {%rs49, %rs50}, %r343; 2026-02-21T09:09:57.5851922Z mov.b32 {%rs51, %rs52}, %r342; 2026-02-21T09:09:57.5852078Z mov.b32 {%rs53, %rs54}, %r341; 2026-02-21T09:09:57.5852242Z mov.b32 {%rs55, %rs56}, %r340; 2026-02-21T09:09:57.5852432Z ld.shared.v4.b32 {%r344, %r345, %r346, %r347}, [%r267+8240]; 2026-02-21T09:09:57.5852643Z mov.b32 {%rs57, %rs58}, %r347; 2026-02-21T09:09:57.5852802Z mov.b32 {%rs59, %rs60}, %r346; 2026-02-21T09:09:57.5852971Z mov.b32 {%rs61, %rs62}, %r345; 2026-02-21T09:09:57.5853160Z mov.b32 {%rs63, %rs64}, %r344; 2026-02-21T09:09:57.5853440Z .loc 1 59 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:59:32 2026-02-21T09:09:57.5853735Z cvt.f32.bf16 %r192, %rs7; 2026-02-21T09:09:57.5853894Z cvt.f32.bf16 %r193, %rs8; 2026-02-21T09:09:57.5854053Z cvt.f32.bf16 %r194, %rs5; 2026-02-21T09:09:57.5854204Z cvt.f32.bf16 %r195, %rs6; 2026-02-21T09:09:57.5854360Z cvt.f32.bf16 %r196, %rs3; 2026-02-21T09:09:57.5854509Z cvt.f32.bf16 %r197, %rs4; 2026-02-21T09:09:57.5854666Z cvt.f32.bf16 %r198, %rs1; 2026-02-21T09:09:57.5854842Z cvt.f32.bf16 %r199, %rs2; 2026-02-21T09:09:57.5855004Z cvt.f32.bf16 %r200, %rs15; 2026-02-21T09:09:57.5855158Z cvt.f32.bf16 %r201, %rs16; 2026-02-21T09:09:57.5855317Z cvt.f32.bf16 %r202, %rs13; 2026-02-21T09:09:57.5855475Z cvt.f32.bf16 %r203, %rs14; 2026-02-21T09:09:57.5855623Z cvt.f32.bf16 %r204, %rs11; 2026-02-21T09:09:57.5855780Z cvt.f32.bf16 %r205, %rs12; 2026-02-21T09:09:57.5855929Z cvt.f32.bf16 %r206, %rs9; 2026-02-21T09:09:57.5856082Z cvt.f32.bf16 %r207, %rs10; 2026-02-21T09:09:57.5856228Z cvt.f32.bf16 %r209, %rs23; 2026-02-21T09:09:57.5856383Z cvt.f32.bf16 %r210, %rs24; 2026-02-21T09:09:57.5856530Z cvt.f32.bf16 %r211, %rs21; 2026-02-21T09:09:57.5856685Z cvt.f32.bf16 %r212, %rs22; 2026-02-21T09:09:57.5856834Z cvt.f32.bf16 %r213, %rs19; 2026-02-21T09:09:57.5856989Z cvt.f32.bf16 %r214, %rs20; 2026-02-21T09:09:57.5857145Z cvt.f32.bf16 %r215, %rs17; 2026-02-21T09:09:57.5857294Z cvt.f32.bf16 %r216, %rs18; 2026-02-21T09:09:57.5857449Z cvt.f32.bf16 %r217, %rs31; 2026-02-21T09:09:57.5857597Z cvt.f32.bf16 %r218, %rs32; 2026-02-21T09:09:57.5857751Z cvt.f32.bf16 %r219, %rs29; 2026-02-21T09:09:57.5857897Z cvt.f32.bf16 %r220, %rs30; 2026-02-21T09:09:57.5858053Z cvt.f32.bf16 %r221, %rs27; 2026-02-21T09:09:57.5858199Z cvt.f32.bf16 %r222, %rs28; 2026-02-21T09:09:57.5858355Z cvt.f32.bf16 %r223, %rs25; 2026-02-21T09:09:57.5858530Z cvt.f32.bf16 %r224, %rs26; 2026-02-21T09:09:57.5858694Z cvt.f32.bf16 %r226, %rs39; 2026-02-21T09:09:57.5858853Z cvt.f32.bf16 %r227, %rs40; 2026-02-21T09:09:57.5859002Z cvt.f32.bf16 %r228, %rs37; 2026-02-21T09:09:57.5859158Z cvt.f32.bf16 %r229, %rs38; 2026-02-21T09:09:57.5859306Z cvt.f32.bf16 %r230, %rs35; 2026-02-21T09:09:57.5859464Z cvt.f32.bf16 %r231, %rs36; 2026-02-21T09:09:57.5859617Z cvt.f32.bf16 %r232, %rs33; 2026-02-21T09:09:57.5859780Z cvt.f32.bf16 %r233, %rs34; 2026-02-21T09:09:57.5859931Z cvt.f32.bf16 %r234, %rs47; 2026-02-21T09:09:57.5860089Z cvt.f32.bf16 %r235, %rs48; 2026-02-21T09:09:57.5860239Z cvt.f32.bf16 %r236, %rs45; 2026-02-21T09:09:57.5860407Z cvt.f32.bf16 %r237, %rs46; 2026-02-21T09:09:57.5860570Z cvt.f32.bf16 %r238, %rs43; 2026-02-21T09:09:57.5860720Z cvt.f32.bf16 %r239, %rs44; 2026-02-21T09:09:57.5860875Z cvt.f32.bf16 %r240, %rs41; 2026-02-21T09:09:57.5861023Z cvt.f32.bf16 %r241, %rs42; 2026-02-21T09:09:57.5861180Z cvt.f32.bf16 %r243, %rs55; 2026-02-21T09:09:57.5861330Z cvt.f32.bf16 %r244, %rs56; 2026-02-21T09:09:57.5861488Z cvt.f32.bf16 %r245, %rs53; 2026-02-21T09:09:57.5861669Z cvt.f32.bf16 %r246, %rs54; 2026-02-21T09:09:57.5861825Z cvt.f32.bf16 %r247, %rs51; 2026-02-21T09:09:57.5861976Z cvt.f32.bf16 %r248, %rs52; 2026-02-21T09:09:57.5862134Z cvt.f32.bf16 %r249, %rs49; 2026-02-21T09:09:57.5862288Z cvt.f32.bf16 %r250, %rs50; 2026-02-21T09:09:57.5862437Z cvt.f32.bf16 %r251, %rs63; 2026-02-21T09:09:57.5862593Z cvt.f32.bf16 %r252, %rs64; 2026-02-21T09:09:57.5862774Z cvt.f32.bf16 %r253, %rs61; 2026-02-21T09:09:57.5862931Z cvt.f32.bf16 %r254, %rs62; 2026-02-21T09:09:57.5863083Z cvt.f32.bf16 %r255, %rs59; 2026-02-21T09:09:57.5863241Z cvt.f32.bf16 %r256, %rs60; 2026-02-21T09:09:57.5863390Z cvt.f32.bf16 %r257, %rs57; 2026-02-21T09:09:57.5863547Z cvt.f32.bf16 %r258, %rs58; 2026-02-21T09:09:57.5863692Z $L__tmp2: 2026-02-21T09:09:57.5863993Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5864360Z add.s32 %r191, %r310, %r357; 2026-02-21T09:09:57.5864519Z // begin inline asm 2026-02-21T09:09:57.5864893Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 0], {%r192, %r193, %r194, %r195, %r196, %r197, %r198, %r199, %r200, %r201, %r202, %r203, %r204, %r205, %r206, %r207}; 2026-02-21T09:09:57.5865282Z // end inline asm 2026-02-21T09:09:57.5865432Z // begin inline asm 2026-02-21T09:09:57.5865799Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 16], {%r209, %r210, %r211, %r212, %r213, %r214, %r215, %r216, %r217, %r218, %r219, %r220, %r221, %r222, %r223, %r224}; 2026-02-21T09:09:57.5866215Z // end inline asm 2026-02-21T09:09:57.5866365Z // begin inline asm 2026-02-21T09:09:57.5866712Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 32], {%r226, %r227, %r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241}; 2026-02-21T09:09:57.5867112Z // end inline asm 2026-02-21T09:09:57.5867251Z // begin inline asm 2026-02-21T09:09:57.5867619Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 48], {%r243, %r244, %r245, %r246, %r247, %r248, %r249, %r250, %r251, %r252, %r253, %r254, %r255, %r256, %r257, %r258}; 2026-02-21T09:09:57.5868022Z // end inline asm 2026-02-21T09:09:57.5868168Z // begin inline asm 2026-02-21T09:09:57.5868339Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:57.5868514Z // end inline asm 2026-02-21T09:09:57.5868664Z bar.sync 0; 2026-02-21T09:09:57.5868804Z $L__tmp3: 2026-02-21T09:09:57.5869078Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5869388Z // begin inline asm 2026-02-21T09:09:57.5869539Z 2026-02-21T09:09:57.5869675Z { 2026-02-21T09:09:57.5869815Z .reg .pred complete; 2026-02-21T09:09:57.5869987Z waitLoop: 2026-02-21T09:09:57.5870191Z mbarrier.try_wait.parity.shared.b64 complete, [%r399], %r62; 2026-02-21T09:09:57.5870446Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.5870641Z } 2026-02-21T09:09:57.5870721Z 2026-02-21T09:09:57.5870780Z // end inline asm 2026-02-21T09:09:57.5871036Z .loc 1 61 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:61:33 2026-02-21T09:09:57.5871352Z ld.shared.b8 %rs65, [%r271]; 2026-02-21T09:09:57.5871566Z ld.shared.b8 %rs66, [%r271+64]; 2026-02-21T09:09:57.5871751Z ld.shared.b8 %rs67, [%r271+256]; 2026-02-21T09:09:57.5871938Z ld.shared.b8 %rs68, [%r271+320]; 2026-02-21T09:09:57.5872111Z ld.shared.b8 %rs69, [%r272+128]; 2026-02-21T09:09:57.5872290Z ld.shared.b8 %rs70, [%r272+192]; 2026-02-21T09:09:57.5872458Z ld.shared.b8 %rs71, [%r272+384]; 2026-02-21T09:09:57.5872632Z ld.shared.b8 %rs72, [%r272+448]; 2026-02-21T09:09:57.5872912Z .loc 1 64 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:64:28 2026-02-21T09:09:57.5873216Z shl.b16 %rs73, %rs65, 4; 2026-02-21T09:09:57.5873386Z shl.b16 %rs74, %rs66, 4; 2026-02-21T09:09:57.5873546Z shl.b16 %rs75, %rs69, 4; 2026-02-21T09:09:57.5873714Z shl.b16 %rs76, %rs70, 4; 2026-02-21T09:09:57.5873868Z shl.b16 %rs77, %rs67, 4; 2026-02-21T09:09:57.5874029Z shl.b16 %rs78, %rs68, 4; 2026-02-21T09:09:57.5874181Z shl.b16 %rs79, %rs71, 4; 2026-02-21T09:09:57.5874342Z shl.b16 %rs80, %rs72, 4; 2026-02-21T09:09:57.5874605Z .loc 1 79 58 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:79:58 2026-02-21T09:09:57.5874918Z selp.b16 %rs81, %rs73, %rs65, %p56; 2026-02-21T09:09:57.5875107Z cvt.s16.s8 %rs82, %rs81; 2026-02-21T09:09:57.5875291Z shr.s16 %rs83, %rs82, 4; 2026-02-21T09:09:57.5875467Z selp.b16 %rs84, %rs74, %rs66, %p56; 2026-02-21T09:09:57.5875657Z cvt.s16.s8 %rs85, %rs84; 2026-02-21T09:09:57.5875818Z shr.s16 %rs86, %rs85, 4; 2026-02-21T09:09:57.5875976Z selp.b16 %rs87, %rs75, %rs69, %p56; 2026-02-21T09:09:57.5876156Z cvt.s16.s8 %rs88, %rs87; 2026-02-21T09:09:57.5876305Z shr.s16 %rs89, %rs88, 4; 2026-02-21T09:09:57.5876470Z selp.b16 %rs90, %rs76, %rs70, %p56; 2026-02-21T09:09:57.5876664Z cvt.s16.s8 %rs91, %rs90; 2026-02-21T09:09:57.5876817Z shr.s16 %rs92, %rs91, 4; 2026-02-21T09:09:57.5876974Z selp.b16 %rs93, %rs77, %rs67, %p56; 2026-02-21T09:09:57.5877139Z cvt.s16.s8 %rs94, %rs93; 2026-02-21T09:09:57.5877295Z shr.s16 %rs95, %rs94, 4; 2026-02-21T09:09:57.5877446Z selp.b16 %rs96, %rs78, %rs68, %p56; 2026-02-21T09:09:57.5877620Z cvt.s16.s8 %rs97, %rs96; 2026-02-21T09:09:57.5877767Z shr.s16 %rs98, %rs97, 4; 2026-02-21T09:09:57.5877926Z selp.b16 %rs99, %rs79, %rs71, %p56; 2026-02-21T09:09:57.5878094Z cvt.s16.s8 %rs100, %rs99; 2026-02-21T09:09:57.5878258Z shr.s16 %rs101, %rs100, 4; 2026-02-21T09:09:57.5878470Z selp.b16 %rs102, %rs80, %rs72, %p56; 2026-02-21T09:09:57.5878651Z cvt.s16.s8 %rs103, %rs102; 2026-02-21T09:09:57.5878811Z shr.s16 %rs104, %rs103, 4; 2026-02-21T09:09:57.5879075Z .loc 1 84 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:84:32 2026-02-21T09:09:57.5879378Z cvt.rn.f32.s16 %r348, %rs83; 2026-02-21T09:09:57.5879543Z cvt.rn.f32.s16 %r349, %rs86; 2026-02-21T09:09:57.5891473Z cvt.rn.f32.s16 %r350, %rs89; 2026-02-21T09:09:57.5891727Z cvt.rn.f32.s16 %r351, %rs92; 2026-02-21T09:09:57.5891906Z cvt.rn.f32.s16 %r352, %rs95; 2026-02-21T09:09:57.5892077Z cvt.rn.f32.s16 %r353, %rs98; 2026-02-21T09:09:57.5892246Z cvt.rn.f32.s16 %r354, %rs101; 2026-02-21T09:09:57.5892422Z cvt.rn.f32.s16 %r355, %rs104; 2026-02-21T09:09:57.5892588Z st.shared.b32 [%r19], %r348; 2026-02-21T09:09:57.5892763Z st.shared.b32 [%r20], %r349; 2026-02-21T09:09:57.5892930Z st.shared.b32 [%r21], %r350; 2026-02-21T09:09:57.5893099Z st.shared.b32 [%r22], %r351; 2026-02-21T09:09:57.5893262Z st.shared.b32 [%r23], %r352; 2026-02-21T09:09:57.5893433Z st.shared.b32 [%r24], %r353; 2026-02-21T09:09:57.5893599Z st.shared.b32 [%r25], %r354; 2026-02-21T09:09:57.5893758Z st.shared.b32 [%r26], %r355; 2026-02-21T09:09:57.5893926Z $L__tmp4: 2026-02-21T09:09:57.5894340Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5894709Z // begin inline asm 2026-02-21T09:09:57.5894884Z fence.proxy.async.shared::cta; 2026-02-21T09:09:57.5895066Z // end inline asm 2026-02-21T09:09:57.5895206Z bar.sync 0; 2026-02-21T09:09:57.5895364Z setp.ne.b32 %p60, %r35, 0; 2026-02-21T09:09:57.5895526Z @%p60 bra $L__BB0_3; 2026-02-21T09:09:57.5895682Z // %bb.2: 2026-02-21T09:09:57.5895833Z elect.sync %r380|%p62, -1; 2026-02-21T09:09:57.5895998Z mov.b32 %r358, 134744336; 2026-02-21T09:09:57.5896169Z mov.pred %p61, 0; 2026-02-21T09:09:57.5896315Z // begin inline asm 2026-02-21T09:09:57.5896571Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 0 ], %rd85, %r358, %p61; 2026-02-21T09:09:57.5896845Z // end inline asm 2026-02-21T09:09:57.5896995Z // begin inline asm 2026-02-21T09:09:57.5897224Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 8 ], %rd86, %r358, %p63; 2026-02-21T09:09:57.5897495Z // end inline asm 2026-02-21T09:09:57.5897648Z // begin inline asm 2026-02-21T09:09:57.5897878Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 16 ], %rd87, %r358, %p63; 2026-02-21T09:09:57.5898150Z // end inline asm 2026-02-21T09:09:57.5898288Z // begin inline asm 2026-02-21T09:09:57.5898526Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 24 ], %rd88, %r358, %p63; 2026-02-21T09:09:57.5898787Z // end inline asm 2026-02-21T09:09:57.5898934Z // begin inline asm 2026-02-21T09:09:57.5899173Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 32 ], %rd85, %r358, %p61; 2026-02-21T09:09:57.5899481Z // end inline asm 2026-02-21T09:09:57.5899638Z // begin inline asm 2026-02-21T09:09:57.5899867Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 40 ], %rd86, %r358, %p63; 2026-02-21T09:09:57.5900139Z // end inline asm 2026-02-21T09:09:57.5900275Z // begin inline asm 2026-02-21T09:09:57.5900511Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 48 ], %rd87, %r358, %p63; 2026-02-21T09:09:57.5900822Z // end inline asm 2026-02-21T09:09:57.5900958Z // begin inline asm 2026-02-21T09:09:57.5901193Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 56 ], %rd88, %r358, %p63; 2026-02-21T09:09:57.5901450Z // end inline asm 2026-02-21T09:09:57.5901638Z add.s32 %r382, %r52, 37888; 2026-02-21T09:09:57.5901807Z cvt.u64.u32 %rd93, %r382; 2026-02-21T09:09:57.5901972Z // begin inline asm 2026-02-21T09:09:57.5902188Z @%p62 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd93]; 2026-02-21T09:09:57.5902434Z // end inline asm 2026-02-21T09:09:57.5902580Z $L__tmp5: 2026-02-21T09:09:57.5902742Z $L__BB0_3: 2026-02-21T09:09:57.5902920Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:57.5903127Z add.s32 %r28, %r52, %r296; 2026-02-21T09:09:57.5903297Z add.s32 %r29, %r52, %r297; 2026-02-21T09:09:57.5903453Z add.s32 %r30, %r52, %r298; 2026-02-21T09:09:57.5903620Z add.s32 %r31, %r52, %r299; 2026-02-21T09:09:57.5903899Z .loc 1 55 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:32 2026-02-21T09:09:57.5904204Z add.s64 %rd94, %rd8, 128; 2026-02-21T09:09:57.5904371Z cvt.u64.u32 %rd104, %r27; 2026-02-21T09:09:57.5904530Z add.s64 %rd105, %rd7, %rd104; 2026-02-21T09:09:57.5904699Z shl.b64 %rd106, %rd105, 1; 2026-02-21T09:09:57.5904858Z add.s64 %rd107, %rd18, %rd106; 2026-02-21T09:09:57.5905031Z add.s64 %rd95, %rd107, 65536; 2026-02-21T09:09:57.5905193Z add.s64 %rd96, %rd107, 131072; 2026-02-21T09:09:57.5905362Z add.s64 %rd97, %rd107, 196608; 2026-02-21T09:09:57.5905521Z add.s64 %rd98, %rd107, 262144; 2026-02-21T09:09:57.5905692Z add.s64 %rd99, %rd107, 327680; 2026-02-21T09:09:57.5905855Z add.s64 %rd100, %rd107, 393216; 2026-02-21T09:09:57.5906030Z add.s64 %rd101, %rd10, 128; 2026-02-21T09:09:57.5906196Z mov.b32 %r384, 16; 2026-02-21T09:09:57.5906482Z .loc 1 55 80 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:80 2026-02-21T09:09:57.5906781Z // begin inline asm 2026-02-21T09:09:57.5906989Z cp.async.cg.shared.global [ %r383 + 0 ], [ %rd94 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5907229Z // end inline asm 2026-02-21T09:09:57.5907368Z // begin inline asm 2026-02-21T09:09:57.5907578Z cp.async.cg.shared.global [ %r385 + 0 ], [ %rd95 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5907812Z // end inline asm 2026-02-21T09:09:57.5907950Z // begin inline asm 2026-02-21T09:09:57.5908156Z cp.async.cg.shared.global [ %r387 + 0 ], [ %rd96 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5908376Z // end inline asm 2026-02-21T09:09:57.5908523Z // begin inline asm 2026-02-21T09:09:57.5908721Z cp.async.cg.shared.global [ %r389 + 0 ], [ %rd97 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5908948Z // end inline asm 2026-02-21T09:09:57.5909084Z // begin inline asm 2026-02-21T09:09:57.5909293Z cp.async.cg.shared.global [ %r391 + 0 ], [ %rd98 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5909518Z // end inline asm 2026-02-21T09:09:57.5909669Z // begin inline asm 2026-02-21T09:09:57.5909876Z cp.async.cg.shared.global [ %r393 + 0 ], [ %rd99 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5910096Z // end inline asm 2026-02-21T09:09:57.5910243Z // begin inline asm 2026-02-21T09:09:57.5910441Z cp.async.cg.shared.global [ %r395 + 0 ], [ %rd100 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5910694Z // end inline asm 2026-02-21T09:09:57.5910835Z // begin inline asm 2026-02-21T09:09:57.5911050Z cp.async.cg.shared.global [ %r397 + 0 ], [ %rd101 + 0 ], 0x10, %r384; 2026-02-21T09:09:57.5911317Z // end inline asm 2026-02-21T09:09:57.5911481Z cp.async.commit_group; 2026-02-21T09:09:57.5911815Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5912130Z // begin inline asm 2026-02-21T09:09:57.5912349Z @%p115 mbarrier.arrive.expect_tx.shared.b64 _, [%r399], 512; 2026-02-21T09:09:57.5912584Z // end inline asm 2026-02-21T09:09:57.5912862Z .loc 1 61 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:61:33 2026-02-21T09:09:57.5913192Z bar.sync 0; 2026-02-21T09:09:57.5913357Z elect.sync %r408|%p81, -1; 2026-02-21T09:09:57.5913537Z and.pred %p79, %p1, %p81; 2026-02-21T09:09:57.5913710Z mov.b32 %r402, 32; 2026-02-21T09:09:57.5913868Z // begin inline asm 2026-02-21T09:09:57.5914217Z @%p79 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r166], [%rd102, {%r663, %r402}], [%r399]; 2026-02-21T09:09:57.5914597Z // end inline asm 2026-02-21T09:09:57.5914865Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5915211Z shl.b64 %rd108, %rd9, 1; 2026-02-21T09:09:57.5915381Z add.s64 %rd11, %rd108, 192; 2026-02-21T09:09:57.5915558Z and.b32 %r409, %r1, 3; 2026-02-21T09:09:57.5915738Z mad.wide.u32 %rd257, %r409, 16, %rd18; 2026-02-21T09:09:57.5915926Z and.b32 %r410, %r32, 15; 2026-02-21T09:09:57.5916096Z shl.b32 %r411, %r410, 18; 2026-02-21T09:09:57.5916260Z shl.b32 %r412, %r5, 10; 2026-02-21T09:09:57.5916431Z or.b32 %r413, %r411, %r412; 2026-02-21T09:09:57.5916602Z mul.wide.u32 %rd13, %r413, 2; 2026-02-21T09:09:57.5916777Z mov.b32 %r768, 1; 2026-02-21T09:09:57.5916922Z mov.b32 %r764, 0; 2026-02-21T09:09:57.5917077Z mov.b64 %rd258, 0; 2026-02-21T09:09:57.5917229Z mov.b32 %r766, %r764; 2026-02-21T09:09:57.5917394Z mov.b32 %r767, %r764; 2026-02-21T09:09:57.5917555Z mov.b32 %r769, %r764; 2026-02-21T09:09:57.5917707Z bra.uni $L__BB0_4; 2026-02-21T09:09:57.5917915Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:57.5918269Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5918592Z setp.lt.u64 %p108, %rd258, 464; 2026-02-21T09:09:57.5918765Z $L__tmp6: 2026-02-21T09:09:57.5919076Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5919470Z add.s32 %r585, %r768, 1; 2026-02-21T09:09:57.5919638Z setp.gt.s32 %p111, %r585, 1; 2026-02-21T09:09:57.5919815Z selp.b32 %r768, 0, %r585, %p111; 2026-02-21T09:09:57.5919990Z selp.b32 %r586, 1, 0, %p111; 2026-02-21T09:09:57.5920172Z xor.b32 %r51, %r769, %r586; 2026-02-21T09:09:57.5920326Z $L__tmp7: 2026-02-21T09:09:57.5920577Z .loc 1 55 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:32 2026-02-21T09:09:57.5920868Z add.s64 %rd127, %rd257, %rd13; 2026-02-21T09:09:57.5921044Z add.s64 %rd118, %rd127, 192; 2026-02-21T09:09:57.5921202Z add.s64 %rd119, %rd127, 65728; 2026-02-21T09:09:57.5921377Z add.s64 %rd120, %rd127, 131264; 2026-02-21T09:09:57.5921622Z add.s64 %rd121, %rd127, 196800; 2026-02-21T09:09:57.5921789Z add.s64 %rd122, %rd127, 262336; 2026-02-21T09:09:57.5921959Z add.s64 %rd123, %rd127, 327872; 2026-02-21T09:09:57.5922121Z add.s64 %rd124, %rd127, 393408; 2026-02-21T09:09:57.5922402Z .loc 1 55 80 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:80 2026-02-21T09:09:57.5922691Z add.s64 %rd125, %rd257, %rd11; 2026-02-21T09:09:57.5922863Z add.s32 %r564, %r45, %r6; 2026-02-21T09:09:57.5923024Z selp.b32 %r565, 16, 0, %p108; 2026-02-21T09:09:57.5923197Z // begin inline asm 2026-02-21T09:09:57.5923413Z cp.async.cg.shared.global [ %r564 + 0 ], [ %rd118 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5923643Z // end inline asm 2026-02-21T09:09:57.5923795Z add.s32 %r566, %r564, 2048; 2026-02-21T09:09:57.5923952Z // begin inline asm 2026-02-21T09:09:57.5924187Z cp.async.cg.shared.global [ %r566 + 0 ], [ %rd119 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5924416Z // end inline asm 2026-02-21T09:09:57.5924554Z add.s32 %r568, %r564, 4096; 2026-02-21T09:09:57.5924712Z // begin inline asm 2026-02-21T09:09:57.5924915Z cp.async.cg.shared.global [ %r568 + 0 ], [ %rd120 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5925133Z // end inline asm 2026-02-21T09:09:57.5925276Z add.s32 %r570, %r564, 6144; 2026-02-21T09:09:57.5925427Z // begin inline asm 2026-02-21T09:09:57.5925659Z cp.async.cg.shared.global [ %r570 + 0 ], [ %rd121 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5925880Z // end inline asm 2026-02-21T09:09:57.5926024Z add.s32 %r572, %r564, 8192; 2026-02-21T09:09:57.5926177Z // begin inline asm 2026-02-21T09:09:57.5926374Z cp.async.cg.shared.global [ %r572 + 0 ], [ %rd122 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5926596Z // end inline asm 2026-02-21T09:09:57.5926735Z add.s32 %r574, %r564, 10240; 2026-02-21T09:09:57.5926893Z // begin inline asm 2026-02-21T09:09:57.5927087Z cp.async.cg.shared.global [ %r574 + 0 ], [ %rd123 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5927345Z // end inline asm 2026-02-21T09:09:57.5927484Z add.s32 %r576, %r564, 12288; 2026-02-21T09:09:57.5927644Z // begin inline asm 2026-02-21T09:09:57.5927836Z cp.async.cg.shared.global [ %r576 + 0 ], [ %rd124 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5928066Z // end inline asm 2026-02-21T09:09:57.5928203Z add.s32 %r578, %r564, 14336; 2026-02-21T09:09:57.5928361Z // begin inline asm 2026-02-21T09:09:57.5928559Z cp.async.cg.shared.global [ %r578 + 0 ], [ %rd125 + 0 ], 0x10, %r565; 2026-02-21T09:09:57.5928777Z // end inline asm 2026-02-21T09:09:57.5928925Z cp.async.commit_group; 2026-02-21T09:09:57.5929191Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5929493Z and.pred %p106, %p115, %p108; 2026-02-21T09:09:57.5929653Z // begin inline asm 2026-02-21T09:09:57.5929851Z @%p106 mbarrier.arrive.expect_tx.shared.b64 _, [%r580], 512; 2026-02-21T09:09:57.5930066Z // end inline asm 2026-02-21T09:09:57.5930320Z .loc 1 61 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:61:33 2026-02-21T09:09:57.5930612Z bar.sync 0; 2026-02-21T09:09:57.5930756Z elect.sync %r587|%p112, -1; 2026-02-21T09:09:57.5930931Z and.pred %p113, %p108, %p112; 2026-02-21T09:09:57.5931094Z and.pred %p107, %p1, %p113; 2026-02-21T09:09:57.5931285Z cvt.u32.u64 %r588, %rd258; 2026-02-21T09:09:57.5931445Z add.s32 %r583, %r588, 48; 2026-02-21T09:09:57.5931643Z // begin inline asm 2026-02-21T09:09:57.5931977Z @%p107 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r581], [%rd102, {%r663, %r583}], [%r580]; 2026-02-21T09:09:57.5932330Z // end inline asm 2026-02-21T09:09:57.5932591Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5932883Z add.s64 %rd257, %rd257, 64; 2026-02-21T09:09:57.5933056Z setp.lt.u64 %p114, %rd258, 480; 2026-02-21T09:09:57.5933220Z add.s64 %rd258, %rd258, 16; 2026-02-21T09:09:57.5933379Z mov.b32 %r764, %r769; 2026-02-21T09:09:57.5933523Z mov.b32 %r769, %r51; 2026-02-21T09:09:57.5933676Z @%p114 bra $L__BB0_4; 2026-02-21T09:09:57.5933831Z bra.uni $L__BB0_7; 2026-02-21T09:09:57.5934017Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:57.5934240Z add.s32 %r486, %r767, 1; 2026-02-21T09:09:57.5934397Z setp.gt.s32 %p88, %r486, 1; 2026-02-21T09:09:57.5934569Z selp.b32 %r767, 0, %r486, %p88; 2026-02-21T09:09:57.5934841Z .loc 1 55 80 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:55:80 2026-02-21T09:09:57.5935137Z cp.async.wait_group 1; 2026-02-21T09:09:57.5935287Z bar.sync 0; 2026-02-21T09:09:57.5935429Z shl.b32 %r487, %r767, 14; 2026-02-21T09:09:57.5935589Z add.s32 %r45, %r52, %r487; 2026-02-21T09:09:57.5935744Z add.s32 %r489, %r45, %r16; 2026-02-21T09:09:57.5935939Z ld.shared.v4.b32 {%r490, %r491, %r492, %r493}, [%r489]; 2026-02-21T09:09:57.5936178Z mov.b32 {%rs105, %rs106}, %r493; 2026-02-21T09:09:57.5936355Z mov.b32 {%rs107, %rs108}, %r492; 2026-02-21T09:09:57.5936516Z mov.b32 {%rs109, %rs110}, %r491; 2026-02-21T09:09:57.5936681Z mov.b32 {%rs111, %rs112}, %r490; 2026-02-21T09:09:57.5936878Z ld.shared.v4.b32 {%r494, %r495, %r496, %r497}, [%r489+16]; 2026-02-21T09:09:57.5937091Z mov.b32 {%rs113, %rs114}, %r497; 2026-02-21T09:09:57.5937256Z mov.b32 {%rs115, %rs116}, %r496; 2026-02-21T09:09:57.5937464Z mov.b32 {%rs117, %rs118}, %r495; 2026-02-21T09:09:57.5937629Z mov.b32 {%rs119, %rs120}, %r494; 2026-02-21T09:09:57.5937819Z ld.shared.v4.b32 {%r498, %r499, %r500, %r501}, [%r489+32]; 2026-02-21T09:09:57.5938025Z mov.b32 {%rs121, %rs122}, %r501; 2026-02-21T09:09:57.5938181Z mov.b32 {%rs123, %rs124}, %r500; 2026-02-21T09:09:57.5938344Z mov.b32 {%rs125, %rs126}, %r499; 2026-02-21T09:09:57.5938501Z mov.b32 {%rs127, %rs128}, %r498; 2026-02-21T09:09:57.5938701Z ld.shared.v4.b32 {%r502, %r503, %r504, %r505}, [%r489+48]; 2026-02-21T09:09:57.5938908Z mov.b32 {%rs129, %rs130}, %r505; 2026-02-21T09:09:57.5939095Z mov.b32 {%rs131, %rs132}, %r504; 2026-02-21T09:09:57.5939265Z mov.b32 {%rs133, %rs134}, %r503; 2026-02-21T09:09:57.5939428Z mov.b32 {%rs135, %rs136}, %r502; 2026-02-21T09:09:57.5939635Z ld.shared.v4.b32 {%r506, %r507, %r508, %r509}, [%r489+8192]; 2026-02-21T09:09:57.5939846Z mov.b32 {%rs137, %rs138}, %r509; 2026-02-21T09:09:57.5940016Z mov.b32 {%rs139, %rs140}, %r508; 2026-02-21T09:09:57.5940181Z mov.b32 {%rs141, %rs142}, %r507; 2026-02-21T09:09:57.5940351Z mov.b32 {%rs143, %rs144}, %r506; 2026-02-21T09:09:57.5940555Z ld.shared.v4.b32 {%r510, %r511, %r512, %r513}, [%r489+8208]; 2026-02-21T09:09:57.5940765Z mov.b32 {%rs145, %rs146}, %r513; 2026-02-21T09:09:57.5940938Z mov.b32 {%rs147, %rs148}, %r512; 2026-02-21T09:09:57.5941103Z mov.b32 {%rs149, %rs150}, %r511; 2026-02-21T09:09:57.5941275Z mov.b32 {%rs151, %rs152}, %r510; 2026-02-21T09:09:57.5941474Z ld.shared.v4.b32 {%r514, %r515, %r516, %r517}, [%r489+8224]; 2026-02-21T09:09:57.5941718Z mov.b32 {%rs153, %rs154}, %r517; 2026-02-21T09:09:57.5941877Z mov.b32 {%rs155, %rs156}, %r516; 2026-02-21T09:09:57.5942039Z mov.b32 {%rs157, %rs158}, %r515; 2026-02-21T09:09:57.5942202Z mov.b32 {%rs159, %rs160}, %r514; 2026-02-21T09:09:57.5942391Z ld.shared.v4.b32 {%r518, %r519, %r520, %r521}, [%r489+8240]; 2026-02-21T09:09:57.5942626Z mov.b32 {%rs161, %rs162}, %r521; 2026-02-21T09:09:57.5942789Z mov.b32 {%rs163, %rs164}, %r520; 2026-02-21T09:09:57.5942952Z mov.b32 {%rs165, %rs166}, %r519; 2026-02-21T09:09:57.5943111Z mov.b32 {%rs167, %rs168}, %r518; 2026-02-21T09:09:57.5943388Z .loc 1 59 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:59:32 2026-02-21T09:09:57.5943688Z cvt.f32.bf16 %r417, %rs111; 2026-02-21T09:09:57.5943846Z cvt.f32.bf16 %r418, %rs112; 2026-02-21T09:09:57.5944011Z cvt.f32.bf16 %r419, %rs109; 2026-02-21T09:09:57.5944167Z cvt.f32.bf16 %r420, %rs110; 2026-02-21T09:09:57.5944325Z cvt.f32.bf16 %r421, %rs107; 2026-02-21T09:09:57.5944475Z cvt.f32.bf16 %r422, %rs108; 2026-02-21T09:09:57.5944634Z cvt.f32.bf16 %r423, %rs105; 2026-02-21T09:09:57.5944782Z cvt.f32.bf16 %r424, %rs106; 2026-02-21T09:09:57.5944937Z cvt.f32.bf16 %r425, %rs119; 2026-02-21T09:09:57.5945086Z cvt.f32.bf16 %r426, %rs120; 2026-02-21T09:09:57.5945242Z cvt.f32.bf16 %r427, %rs117; 2026-02-21T09:09:57.5945399Z cvt.f32.bf16 %r428, %rs118; 2026-02-21T09:09:57.5945550Z cvt.f32.bf16 %r429, %rs115; 2026-02-21T09:09:57.5945707Z cvt.f32.bf16 %r430, %rs116; 2026-02-21T09:09:57.5945856Z cvt.f32.bf16 %r431, %rs113; 2026-02-21T09:09:57.5946012Z cvt.f32.bf16 %r432, %rs114; 2026-02-21T09:09:57.5946159Z cvt.f32.bf16 %r434, %rs127; 2026-02-21T09:09:57.5946313Z cvt.f32.bf16 %r435, %rs128; 2026-02-21T09:09:57.5946461Z cvt.f32.bf16 %r436, %rs125; 2026-02-21T09:09:57.5946615Z cvt.f32.bf16 %r437, %rs126; 2026-02-21T09:09:57.5946762Z cvt.f32.bf16 %r438, %rs123; 2026-02-21T09:09:57.5946948Z cvt.f32.bf16 %r439, %rs124; 2026-02-21T09:09:57.5947107Z cvt.f32.bf16 %r440, %rs121; 2026-02-21T09:09:57.5947260Z cvt.f32.bf16 %r441, %rs122; 2026-02-21T09:09:57.5947418Z cvt.f32.bf16 %r442, %rs135; 2026-02-21T09:09:57.5947570Z cvt.f32.bf16 %r443, %rs136; 2026-02-21T09:09:57.5947727Z cvt.f32.bf16 %r444, %rs133; 2026-02-21T09:09:57.5947878Z cvt.f32.bf16 %r445, %rs134; 2026-02-21T09:09:57.5948034Z cvt.f32.bf16 %r446, %rs131; 2026-02-21T09:09:57.5948211Z cvt.f32.bf16 %r447, %rs132; 2026-02-21T09:09:57.5948371Z cvt.f32.bf16 %r448, %rs129; 2026-02-21T09:09:57.5948529Z cvt.f32.bf16 %r449, %rs130; 2026-02-21T09:09:57.5948679Z cvt.f32.bf16 %r451, %rs143; 2026-02-21T09:09:57.5948837Z cvt.f32.bf16 %r452, %rs144; 2026-02-21T09:09:57.5948989Z cvt.f32.bf16 %r453, %rs141; 2026-02-21T09:09:57.5949149Z cvt.f32.bf16 %r454, %rs142; 2026-02-21T09:09:57.5949301Z cvt.f32.bf16 %r455, %rs139; 2026-02-21T09:09:57.5949463Z cvt.f32.bf16 %r456, %rs140; 2026-02-21T09:09:57.5949619Z cvt.f32.bf16 %r457, %rs137; 2026-02-21T09:09:57.5949775Z cvt.f32.bf16 %r458, %rs138; 2026-02-21T09:09:57.5949956Z cvt.f32.bf16 %r459, %rs151; 2026-02-21T09:09:57.5950118Z cvt.f32.bf16 %r460, %rs152; 2026-02-21T09:09:57.5950271Z cvt.f32.bf16 %r461, %rs149; 2026-02-21T09:09:57.5950423Z cvt.f32.bf16 %r462, %rs150; 2026-02-21T09:09:57.5950581Z cvt.f32.bf16 %r463, %rs147; 2026-02-21T09:09:57.5950730Z cvt.f32.bf16 %r464, %rs148; 2026-02-21T09:09:57.5950888Z cvt.f32.bf16 %r465, %rs145; 2026-02-21T09:09:57.5951040Z cvt.f32.bf16 %r466, %rs146; 2026-02-21T09:09:57.5951196Z cvt.f32.bf16 %r468, %rs159; 2026-02-21T09:09:57.5951344Z cvt.f32.bf16 %r469, %rs160; 2026-02-21T09:09:57.5951502Z cvt.f32.bf16 %r470, %rs157; 2026-02-21T09:09:57.5951690Z cvt.f32.bf16 %r471, %rs158; 2026-02-21T09:09:57.5951849Z cvt.f32.bf16 %r472, %rs155; 2026-02-21T09:09:57.5952007Z cvt.f32.bf16 %r473, %rs156; 2026-02-21T09:09:57.5952155Z cvt.f32.bf16 %r474, %rs153; 2026-02-21T09:09:57.5952315Z cvt.f32.bf16 %r475, %rs154; 2026-02-21T09:09:57.5952464Z cvt.f32.bf16 %r476, %rs167; 2026-02-21T09:09:57.5952621Z cvt.f32.bf16 %r477, %rs168; 2026-02-21T09:09:57.5952768Z cvt.f32.bf16 %r478, %rs165; 2026-02-21T09:09:57.5952925Z cvt.f32.bf16 %r479, %rs166; 2026-02-21T09:09:57.5953072Z cvt.f32.bf16 %r480, %rs163; 2026-02-21T09:09:57.5953225Z cvt.f32.bf16 %r481, %rs164; 2026-02-21T09:09:57.5953405Z cvt.f32.bf16 %r482, %rs161; 2026-02-21T09:09:57.5953594Z cvt.f32.bf16 %r483, %rs162; 2026-02-21T09:09:57.5953759Z $L__tmp8: 2026-02-21T09:09:57.5954060Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5954412Z // begin inline asm 2026-02-21T09:09:57.5954555Z 2026-02-21T09:09:57.5954680Z { 2026-02-21T09:09:57.5954808Z .reg .pred complete; 2026-02-21T09:09:57.5954969Z waitLoop: 2026-02-21T09:09:57.5955162Z mbarrier.try_wait.parity.shared.b64 complete, [%r765], %r764; 2026-02-21T09:09:57.5955413Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.5955576Z } 2026-02-21T09:09:57.5955645Z 2026-02-21T09:09:57.5955703Z // end inline asm 2026-02-21T09:09:57.5955850Z $L__tmp9: 2026-02-21T09:09:57.5956101Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5956415Z selp.b32 %r522, 1, 0, %p88; 2026-02-21T09:09:57.5956577Z xor.b32 %r766, %r766, %r522; 2026-02-21T09:09:57.5956753Z mov.pred %p89, -1; 2026-02-21T09:09:57.5956900Z $L__tmp10: 2026-02-21T09:09:57.5957205Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5957552Z // begin inline asm 2026-02-21T09:09:57.5957932Z @%p89 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 0], {%r417, %r418, %r419, %r420, %r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432}; 2026-02-21T09:09:57.5958343Z // end inline asm 2026-02-21T09:09:57.5958484Z // begin inline asm 2026-02-21T09:09:57.5958897Z @%p89 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 16], {%r434, %r435, %r436, %r437, %r438, %r439, %r440, %r441, %r442, %r443, %r444, %r445, %r446, %r447, %r448, %r449}; 2026-02-21T09:09:57.5959305Z // end inline asm 2026-02-21T09:09:57.5959450Z // begin inline asm 2026-02-21T09:09:57.5959828Z @%p89 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 32], {%r451, %r452, %r453, %r454, %r455, %r456, %r457, %r458, %r459, %r460, %r461, %r462, %r463, %r464, %r465, %r466}; 2026-02-21T09:09:57.5960260Z // end inline asm 2026-02-21T09:09:57.5960409Z // begin inline asm 2026-02-21T09:09:57.5960768Z @%p89 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r191 + 48], {%r468, %r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479, %r480, %r481, %r482, %r483}; 2026-02-21T09:09:57.5961156Z // end inline asm 2026-02-21T09:09:57.5961296Z // begin inline asm 2026-02-21T09:09:57.5961447Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:57.5961636Z // end inline asm 2026-02-21T09:09:57.5961767Z bar.sync 0; 2026-02-21T09:09:57.5961902Z $L__tmp11: 2026-02-21T09:09:57.5962179Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5962476Z shl.b32 %r523, %r767, 3; 2026-02-21T09:09:57.5962631Z add.s32 %r524, %r52, %r523; 2026-02-21T09:09:57.5962797Z add.s32 %r580, %r524, 37904; 2026-02-21T09:09:57.5962947Z // begin inline asm 2026-02-21T09:09:57.5963086Z 2026-02-21T09:09:57.5963207Z { 2026-02-21T09:09:57.5963329Z .reg .pred complete; 2026-02-21T09:09:57.5963479Z waitLoop: 2026-02-21T09:09:57.5963661Z mbarrier.try_wait.parity.shared.b64 complete, [%r580], %r766; 2026-02-21T09:09:57.5963899Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.5964048Z } 2026-02-21T09:09:57.5964119Z 2026-02-21T09:09:57.5964173Z // end inline asm 2026-02-21T09:09:57.5964420Z .loc 1 61 33 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:61:33 2026-02-21T09:09:57.5964710Z shl.b32 %r525, %r767, 9; 2026-02-21T09:09:57.5964869Z add.s32 %r526, %r52, %r525; 2026-02-21T09:09:57.5965023Z add.s32 %r581, %r526, 36864; 2026-02-21T09:09:57.5965181Z add.s32 %r527, %r581, %r17; 2026-02-21T09:09:57.5965336Z ld.shared.b8 %rs169, [%r527]; 2026-02-21T09:09:57.5965505Z ld.shared.b8 %rs170, [%r527+64]; 2026-02-21T09:09:57.5965675Z ld.shared.b8 %rs171, [%r527+256]; 2026-02-21T09:09:57.5965852Z ld.shared.b8 %rs172, [%r527+320]; 2026-02-21T09:09:57.5966047Z add.s32 %r528, %r581, %r18; 2026-02-21T09:09:57.5966215Z ld.shared.b8 %rs173, [%r528+128]; 2026-02-21T09:09:57.5966382Z ld.shared.b8 %rs174, [%r528+192]; 2026-02-21T09:09:57.5966551Z ld.shared.b8 %rs175, [%r528+384]; 2026-02-21T09:09:57.5966721Z ld.shared.b8 %rs176, [%r528+448]; 2026-02-21T09:09:57.5966995Z .loc 1 64 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:64:28 2026-02-21T09:09:57.5967294Z shl.b16 %rs177, %rs169, 4; 2026-02-21T09:09:57.5967450Z shl.b16 %rs178, %rs170, 4; 2026-02-21T09:09:57.5967610Z shl.b16 %rs179, %rs173, 4; 2026-02-21T09:09:57.5967667Z shl.b16 %rs180, %rs174, 4; 2026-02-21T09:09:57.5967725Z shl.b16 %rs181, %rs171, 4; 2026-02-21T09:09:57.5967789Z shl.b16 %rs182, %rs172, 4; 2026-02-21T09:09:57.5967844Z shl.b16 %rs183, %rs175, 4; 2026-02-21T09:09:57.5967900Z shl.b16 %rs184, %rs176, 4; 2026-02-21T09:09:57.5968076Z .loc 1 79 58 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:79:58 2026-02-21T09:09:57.5968147Z selp.b16 %rs185, %rs177, %rs169, %p56; 2026-02-21T09:09:57.5968208Z cvt.s16.s8 %rs186, %rs185; 2026-02-21T09:09:57.5968267Z shr.s16 %rs187, %rs186, 4; 2026-02-21T09:09:57.5968342Z selp.b16 %rs188, %rs178, %rs170, %p56; 2026-02-21T09:09:57.5968400Z cvt.s16.s8 %rs189, %rs188; 2026-02-21T09:09:57.5968457Z shr.s16 %rs190, %rs189, 4; 2026-02-21T09:09:57.5968530Z selp.b16 %rs191, %rs179, %rs173, %p56; 2026-02-21T09:09:57.5968591Z cvt.s16.s8 %rs192, %rs191; 2026-02-21T09:09:57.5968650Z shr.s16 %rs193, %rs192, 4; 2026-02-21T09:09:57.5968751Z selp.b16 %rs194, %rs180, %rs174, %p56; 2026-02-21T09:09:57.5968819Z cvt.s16.s8 %rs195, %rs194; 2026-02-21T09:09:57.5968878Z shr.s16 %rs196, %rs195, 4; 2026-02-21T09:09:57.5968942Z selp.b16 %rs197, %rs181, %rs171, %p56; 2026-02-21T09:09:57.5969008Z cvt.s16.s8 %rs198, %rs197; 2026-02-21T09:09:57.5969066Z shr.s16 %rs199, %rs198, 4; 2026-02-21T09:09:57.5969129Z selp.b16 %rs200, %rs182, %rs172, %p56; 2026-02-21T09:09:57.5969194Z cvt.s16.s8 %rs201, %rs200; 2026-02-21T09:09:57.5969279Z shr.s16 %rs202, %rs201, 4; 2026-02-21T09:09:57.5969342Z selp.b16 %rs203, %rs183, %rs175, %p56; 2026-02-21T09:09:57.5969399Z cvt.s16.s8 %rs204, %rs203; 2026-02-21T09:09:57.5969464Z shr.s16 %rs205, %rs204, 4; 2026-02-21T09:09:57.5969526Z selp.b16 %rs206, %rs184, %rs176, %p56; 2026-02-21T09:09:57.5969584Z cvt.s16.s8 %rs207, %rs206; 2026-02-21T09:09:57.5969647Z shr.s16 %rs208, %rs207, 4; 2026-02-21T09:09:57.5969820Z .loc 1 84 32 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:84:32 2026-02-21T09:09:57.5969884Z cvt.rn.f32.s16 %r529, %rs187; 2026-02-21T09:09:57.5969966Z cvt.rn.f32.s16 %r530, %rs190; 2026-02-21T09:09:57.5970033Z cvt.rn.f32.s16 %r531, %rs193; 2026-02-21T09:09:57.5970092Z cvt.rn.f32.s16 %r532, %rs196; 2026-02-21T09:09:57.5970150Z cvt.rn.f32.s16 %r533, %rs199; 2026-02-21T09:09:57.5970216Z cvt.rn.f32.s16 %r534, %rs202; 2026-02-21T09:09:57.5970273Z cvt.rn.f32.s16 %r535, %rs205; 2026-02-21T09:09:57.5970331Z cvt.rn.f32.s16 %r536, %rs208; 2026-02-21T09:09:57.5970403Z st.shared.b32 [%r19], %r529; 2026-02-21T09:09:57.5970464Z st.shared.b32 [%r20], %r530; 2026-02-21T09:09:57.5970524Z st.shared.b32 [%r21], %r531; 2026-02-21T09:09:57.5970584Z st.shared.b32 [%r22], %r532; 2026-02-21T09:09:57.5970651Z st.shared.b32 [%r23], %r533; 2026-02-21T09:09:57.5970709Z st.shared.b32 [%r24], %r534; 2026-02-21T09:09:57.5970769Z st.shared.b32 [%r25], %r535; 2026-02-21T09:09:57.5970834Z st.shared.b32 [%r26], %r536; 2026-02-21T09:09:57.5971009Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5971071Z shl.b32 %r537, %r768, 3; 2026-02-21T09:09:57.5971132Z add.s32 %r538, %r52, %r537; 2026-02-21T09:09:57.5971196Z add.s32 %r765, %r538, 37888; 2026-02-21T09:09:57.5971249Z $L__tmp12: 2026-02-21T09:09:57.5971493Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5971647Z // begin inline asm 2026-02-21T09:09:57.5971723Z fence.proxy.async.shared::cta; 2026-02-21T09:09:57.5971778Z // end inline asm 2026-02-21T09:09:57.5971841Z bar.sync 0; 2026-02-21T09:09:57.5971901Z @%p60 bra $L__BB0_6; 2026-02-21T09:09:57.5972004Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:57.5972070Z elect.sync %r563|%p90, -1; 2026-02-21T09:09:57.5972139Z mov.b32 %r541, 134744336; 2026-02-21T09:09:57.5972196Z // begin inline asm 2026-02-21T09:09:57.5972352Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 0 ], %rd85, %r541, %p89; 2026-02-21T09:09:57.5972418Z // end inline asm 2026-02-21T09:09:57.5972477Z // begin inline asm 2026-02-21T09:09:57.5972623Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 8 ], %rd86, %r541, %p89; 2026-02-21T09:09:57.5972687Z // end inline asm 2026-02-21T09:09:57.5972745Z // begin inline asm 2026-02-21T09:09:57.5972892Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 16 ], %rd87, %r541, %p89; 2026-02-21T09:09:57.5972948Z // end inline asm 2026-02-21T09:09:57.5973013Z // begin inline asm 2026-02-21T09:09:57.5973156Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 0 ], [ %r357 + 24 ], %rd88, %r541, %p89; 2026-02-21T09:09:57.5973212Z // end inline asm 2026-02-21T09:09:57.5973276Z // begin inline asm 2026-02-21T09:09:57.5973420Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 32 ], %rd85, %r541, %p89; 2026-02-21T09:09:57.5973474Z // end inline asm 2026-02-21T09:09:57.5973582Z // begin inline asm 2026-02-21T09:09:57.5973725Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 40 ], %rd86, %r541, %p89; 2026-02-21T09:09:57.5973780Z // end inline asm 2026-02-21T09:09:57.5973841Z // begin inline asm 2026-02-21T09:09:57.5973982Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 48 ], %rd87, %r541, %p89; 2026-02-21T09:09:57.5974057Z // end inline asm 2026-02-21T09:09:57.5974113Z // begin inline asm 2026-02-21T09:09:57.5974290Z @%p90 tcgen05.mma.cta_group::1.kind::tf32 [ %r763 + 32 ], [ %r357 + 56 ], %rd88, %r541, %p89; 2026-02-21T09:09:57.5974346Z // end inline asm 2026-02-21T09:09:57.5974407Z cvt.u64.u32 %rd117, %r765; 2026-02-21T09:09:57.5974471Z // begin inline asm 2026-02-21T09:09:57.5974596Z @%p90 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd117]; 2026-02-21T09:09:57.5974651Z // end inline asm 2026-02-21T09:09:57.5974716Z bra.uni $L__BB0_6; 2026-02-21T09:09:57.5974812Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:57.5974933Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:57.5974991Z mov.b32 %r590, 1; 2026-02-21T09:09:57.5975218Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5975274Z // begin inline asm 2026-02-21T09:09:57.5975323Z 2026-02-21T09:09:57.5975380Z { 2026-02-21T09:09:57.5975443Z .reg .pred complete; 2026-02-21T09:09:57.5975498Z waitLoop: 2026-02-21T09:09:57.5975613Z mbarrier.try_wait.parity.shared.b64 complete, [%r765], %r590; 2026-02-21T09:09:57.5975686Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.5975736Z } 2026-02-21T09:09:57.5975740Z 2026-02-21T09:09:57.5975796Z // end inline asm 2026-02-21T09:09:57.5975856Z $L__tmp13: 2026-02-21T09:09:57.5976026Z .loc 1 48 125 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:48:125 2026-02-21T09:09:57.5976089Z cp.async.wait_group 0; 2026-02-21T09:09:57.5976151Z bar.sync 0; 2026-02-21T09:09:57.5976206Z // begin inline asm 2026-02-21T09:09:57.5976293Z @%p115 mbarrier.inval.shared::cta.b64 [%r399]; 2026-02-21T09:09:57.5976349Z // end inline asm 2026-02-21T09:09:57.5976412Z bar.sync 0; 2026-02-21T09:09:57.5976468Z // begin inline asm 2026-02-21T09:09:57.5976554Z @%p115 mbarrier.inval.shared::cta.b64 [%r148]; 2026-02-21T09:09:57.5976617Z // end inline asm 2026-02-21T09:09:57.5976704Z add.s32 %r593, %r52, 37888; 2026-02-21T09:09:57.5976763Z // begin inline asm 2026-02-21T09:09:57.5976844Z @%p115 mbarrier.inval.shared::cta.b64 [%r593]; 2026-02-21T09:09:57.5976906Z // end inline asm 2026-02-21T09:09:57.5976958Z bar.sync 0; 2026-02-21T09:09:57.5977012Z // begin inline asm 2026-02-21T09:09:57.5977099Z @%p115 mbarrier.inval.shared::cta.b64 [%r146]; 2026-02-21T09:09:57.5977152Z // end inline asm 2026-02-21T09:09:57.5977204Z $L__tmp14: 2026-02-21T09:09:57.5977427Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5977485Z // begin inline asm 2026-02-21T09:09:57.5977758Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r595, %r596, %r597, %r598, %r599, %r600, %r601, %r602, %r603, %r604, %r605, %r606, %r607, %r608, %r609, %r610}, [%r662 + 0]; 2026-02-21T09:09:57.5977813Z // end inline asm 2026-02-21T09:09:57.5977875Z // begin inline asm 2026-02-21T09:09:57.5978135Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r612, %r613, %r614, %r615, %r616, %r617, %r618, %r619, %r620, %r621, %r622, %r623, %r624, %r625, %r626, %r627}, [%r662 + 16]; 2026-02-21T09:09:57.5978190Z // end inline asm 2026-02-21T09:09:57.5978255Z // begin inline asm 2026-02-21T09:09:57.5978507Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r629, %r630, %r631, %r632, %r633, %r634, %r635, %r636, %r637, %r638, %r639, %r640, %r641, %r642, %r643, %r644}, [%r662 + 32]; 2026-02-21T09:09:57.5978561Z // end inline asm 2026-02-21T09:09:57.5978624Z // begin inline asm 2026-02-21T09:09:57.5978903Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r646, %r647, %r648, %r649, %r650, %r651, %r652, %r653, %r654, %r655, %r656, %r657, %r658, %r659, %r660, %r661}, [%r662 + 48]; 2026-02-21T09:09:57.5978958Z // end inline asm 2026-02-21T09:09:57.5979020Z // begin inline asm 2026-02-21T09:09:57.5979089Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:57.5979143Z // end inline asm 2026-02-21T09:09:57.5979205Z cvt.u64.u32 %rd129, %r595; 2026-02-21T09:09:57.5979274Z cvt.u64.u32 %rd130, %r596; 2026-02-21T09:09:57.5979355Z shl.b64 %rd131, %rd130, 32; 2026-02-21T09:09:57.5979417Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T09:09:57.5979478Z $L__tmp15: 2026-02-21T09:09:57.5979652Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5979715Z mov.b64 {%r666, %r667}, %rd132; 2026-02-21T09:09:57.5979784Z cvt.rn.bf16x2.f32 %r668, %r667, %r666; 2026-02-21T09:09:57.5979844Z $L__tmp16: 2026-02-21T09:09:57.5980056Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5980143Z cvt.u64.u32 %rd133, %r597; 2026-02-21T09:09:57.5980210Z cvt.u64.u32 %rd134, %r598; 2026-02-21T09:09:57.5980270Z shl.b64 %rd135, %rd134, 32; 2026-02-21T09:09:57.5980332Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T09:09:57.5980391Z $L__tmp17: 2026-02-21T09:09:57.5980564Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5980628Z mov.b64 {%r669, %r670}, %rd136; 2026-02-21T09:09:57.5980696Z cvt.rn.bf16x2.f32 %r671, %r670, %r669; 2026-02-21T09:09:57.5980757Z $L__tmp18: 2026-02-21T09:09:57.5980968Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5981027Z cvt.u64.u32 %rd137, %r599; 2026-02-21T09:09:57.5981091Z cvt.u64.u32 %rd138, %r600; 2026-02-21T09:09:57.5981150Z shl.b64 %rd139, %rd138, 32; 2026-02-21T09:09:57.5981211Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T09:09:57.5981269Z $L__tmp19: 2026-02-21T09:09:57.5981442Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5981502Z mov.b64 {%r672, %r673}, %rd140; 2026-02-21T09:09:57.5981601Z cvt.rn.bf16x2.f32 %r674, %r673, %r672; 2026-02-21T09:09:57.5981661Z $L__tmp20: 2026-02-21T09:09:57.5981896Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5981957Z cvt.u64.u32 %rd141, %r601; 2026-02-21T09:09:57.5982023Z cvt.u64.u32 %rd142, %r602; 2026-02-21T09:09:57.5982080Z shl.b64 %rd143, %rd142, 32; 2026-02-21T09:09:57.5982140Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T09:09:57.5982191Z $L__tmp21: 2026-02-21T09:09:57.5982364Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5982423Z mov.b64 {%r675, %r676}, %rd144; 2026-02-21T09:09:57.5982489Z cvt.rn.bf16x2.f32 %r677, %r676, %r675; 2026-02-21T09:09:57.5982548Z $L__tmp22: 2026-02-21T09:09:57.5982760Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5982819Z cvt.u64.u32 %rd145, %r603; 2026-02-21T09:09:57.5982883Z cvt.u64.u32 %rd146, %r604; 2026-02-21T09:09:57.5982941Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:09:57.5983000Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:09:57.5983052Z $L__tmp23: 2026-02-21T09:09:57.5983227Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5983286Z mov.b64 {%r678, %r679}, %rd148; 2026-02-21T09:09:57.5983352Z cvt.rn.bf16x2.f32 %r680, %r679, %r678; 2026-02-21T09:09:57.5983413Z $L__tmp24: 2026-02-21T09:09:57.5983626Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5983711Z cvt.u64.u32 %rd149, %r605; 2026-02-21T09:09:57.5983775Z cvt.u64.u32 %rd150, %r606; 2026-02-21T09:09:57.5983835Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:09:57.5983893Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:09:57.5983946Z $L__tmp25: 2026-02-21T09:09:57.5984123Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5984182Z mov.b64 {%r681, %r682}, %rd152; 2026-02-21T09:09:57.5984248Z cvt.rn.bf16x2.f32 %r683, %r682, %r681; 2026-02-21T09:09:57.5984334Z $L__tmp26: 2026-02-21T09:09:57.5984541Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5984600Z cvt.u64.u32 %rd153, %r607; 2026-02-21T09:09:57.5984664Z cvt.u64.u32 %rd154, %r608; 2026-02-21T09:09:57.5984725Z shl.b64 %rd155, %rd154, 32; 2026-02-21T09:09:57.5984783Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T09:09:57.5984835Z $L__tmp27: 2026-02-21T09:09:57.5985010Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5985094Z mov.b64 {%r684, %r685}, %rd156; 2026-02-21T09:09:57.5985162Z cvt.rn.bf16x2.f32 %r686, %r685, %r684; 2026-02-21T09:09:57.5985222Z $L__tmp28: 2026-02-21T09:09:57.5985430Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5985491Z cvt.u64.u32 %rd157, %r609; 2026-02-21T09:09:57.5985560Z cvt.u64.u32 %rd158, %r610; 2026-02-21T09:09:57.5985621Z shl.b64 %rd159, %rd158, 32; 2026-02-21T09:09:57.5985679Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T09:09:57.5985734Z $L__tmp29: 2026-02-21T09:09:57.5985909Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5985968Z mov.b64 {%r687, %r688}, %rd160; 2026-02-21T09:09:57.5986034Z cvt.rn.bf16x2.f32 %r689, %r688, %r687; 2026-02-21T09:09:57.5986091Z $L__tmp30: 2026-02-21T09:09:57.5986303Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5986360Z cvt.u64.u32 %rd161, %r612; 2026-02-21T09:09:57.5986419Z cvt.u64.u32 %rd162, %r613; 2026-02-21T09:09:57.5986484Z shl.b64 %rd163, %rd162, 32; 2026-02-21T09:09:57.5986542Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T09:09:57.5986593Z $L__tmp31: 2026-02-21T09:09:57.5986786Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5986849Z mov.b64 {%r690, %r691}, %rd164; 2026-02-21T09:09:57.5986915Z cvt.rn.bf16x2.f32 %r692, %r691, %r690; 2026-02-21T09:09:57.5986974Z $L__tmp32: 2026-02-21T09:09:57.5987186Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5987244Z cvt.u64.u32 %rd165, %r614; 2026-02-21T09:09:57.5987299Z cvt.u64.u32 %rd166, %r615; 2026-02-21T09:09:57.5987366Z shl.b64 %rd167, %rd166, 32; 2026-02-21T09:09:57.5987425Z or.b64 %rd168, %rd165, %rd167; 2026-02-21T09:09:57.5987475Z $L__tmp33: 2026-02-21T09:09:57.5987650Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5987709Z mov.b64 {%r693, %r694}, %rd168; 2026-02-21T09:09:57.5987775Z cvt.rn.bf16x2.f32 %r695, %r694, %r693; 2026-02-21T09:09:57.5987834Z $L__tmp34: 2026-02-21T09:09:57.5988113Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5988464Z cvt.u64.u32 %rd169, %r616; 2026-02-21T09:09:57.5988632Z cvt.u64.u32 %rd170, %r617; 2026-02-21T09:09:57.5988806Z shl.b64 %rd171, %rd170, 32; 2026-02-21T09:09:57.5988970Z or.b64 %rd172, %rd169, %rd171; 2026-02-21T09:09:57.5989135Z $L__tmp35: 2026-02-21T09:09:57.5989378Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5989695Z mov.b64 {%r696, %r697}, %rd172; 2026-02-21T09:09:57.5989868Z cvt.rn.bf16x2.f32 %r698, %r697, %r696; 2026-02-21T09:09:57.5990044Z $L__tmp36: 2026-02-21T09:09:57.5990330Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5990671Z cvt.u64.u32 %rd173, %r618; 2026-02-21T09:09:57.5990837Z cvt.u64.u32 %rd174, %r619; 2026-02-21T09:09:57.5990997Z shl.b64 %rd175, %rd174, 32; 2026-02-21T09:09:57.5991188Z or.b64 %rd176, %rd173, %rd175; 2026-02-21T09:09:57.5991347Z $L__tmp37: 2026-02-21T09:09:57.5991631Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5991914Z mov.b64 {%r699, %r700}, %rd176; 2026-02-21T09:09:57.5992096Z cvt.rn.bf16x2.f32 %r701, %r700, %r699; 2026-02-21T09:09:57.5992273Z $L__tmp38: 2026-02-21T09:09:57.5992558Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5992904Z cvt.u64.u32 %rd177, %r620; 2026-02-21T09:09:57.5993087Z cvt.u64.u32 %rd178, %r621; 2026-02-21T09:09:57.5993252Z shl.b64 %rd179, %rd178, 32; 2026-02-21T09:09:57.5993408Z or.b64 %rd180, %rd177, %rd179; 2026-02-21T09:09:57.5993574Z $L__tmp39: 2026-02-21T09:09:57.5993815Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5994144Z mov.b64 {%r702, %r703}, %rd180; 2026-02-21T09:09:57.5994320Z cvt.rn.bf16x2.f32 %r704, %r703, %r702; 2026-02-21T09:09:57.5994488Z $L__tmp40: 2026-02-21T09:09:57.5994773Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5995100Z cvt.u64.u32 %rd181, %r622; 2026-02-21T09:09:57.5995258Z cvt.u64.u32 %rd182, %r623; 2026-02-21T09:09:57.5995408Z shl.b64 %rd183, %rd182, 32; 2026-02-21T09:09:57.5995568Z or.b64 %rd184, %rd181, %rd183; 2026-02-21T09:09:57.5995720Z $L__tmp41: 2026-02-21T09:09:57.5995964Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5996253Z mov.b64 {%r705, %r706}, %rd184; 2026-02-21T09:09:57.5996430Z cvt.rn.bf16x2.f32 %r707, %r706, %r705; 2026-02-21T09:09:57.5996604Z $L__tmp42: 2026-02-21T09:09:57.5996925Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5997290Z cvt.u64.u32 %rd185, %r624; 2026-02-21T09:09:57.5997450Z cvt.u64.u32 %rd186, %r625; 2026-02-21T09:09:57.5997621Z shl.b64 %rd187, %rd186, 32; 2026-02-21T09:09:57.5997793Z or.b64 %rd188, %rd185, %rd187; 2026-02-21T09:09:57.5997953Z $L__tmp43: 2026-02-21T09:09:57.5998208Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.5998506Z mov.b64 {%r708, %r709}, %rd188; 2026-02-21T09:09:57.5998691Z cvt.rn.bf16x2.f32 %r710, %r709, %r708; 2026-02-21T09:09:57.5998868Z $L__tmp44: 2026-02-21T09:09:57.5999178Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.5999524Z cvt.u64.u32 %rd189, %r626; 2026-02-21T09:09:57.5999694Z cvt.u64.u32 %rd190, %r627; 2026-02-21T09:09:57.5999866Z shl.b64 %rd191, %rd190, 32; 2026-02-21T09:09:57.6000032Z or.b64 %rd192, %rd189, %rd191; 2026-02-21T09:09:57.6000207Z $L__tmp45: 2026-02-21T09:09:57.6000460Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6000763Z mov.b64 {%r711, %r712}, %rd192; 2026-02-21T09:09:57.6000938Z cvt.rn.bf16x2.f32 %r713, %r712, %r711; 2026-02-21T09:09:57.6001118Z $L__tmp46: 2026-02-21T09:09:57.6001409Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6001791Z cvt.u64.u32 %rd193, %r629; 2026-02-21T09:09:57.6001954Z cvt.u64.u32 %rd194, %r630; 2026-02-21T09:09:57.6002144Z shl.b64 %rd195, %rd194, 32; 2026-02-21T09:09:57.6002316Z or.b64 %rd196, %rd193, %rd195; 2026-02-21T09:09:57.6002476Z $L__tmp47: 2026-02-21T09:09:57.6002736Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6003033Z mov.b64 {%r714, %r715}, %rd196; 2026-02-21T09:09:57.6003217Z cvt.rn.bf16x2.f32 %r716, %r715, %r714; 2026-02-21T09:09:57.6003393Z $L__tmp48: 2026-02-21T09:09:57.6003722Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6004074Z cvt.u64.u32 %rd197, %r631; 2026-02-21T09:09:57.6004240Z cvt.u64.u32 %rd198, %r632; 2026-02-21T09:09:57.6004405Z shl.b64 %rd199, %rd198, 32; 2026-02-21T09:09:57.6004572Z or.b64 %rd200, %rd197, %rd199; 2026-02-21T09:09:57.6004730Z $L__tmp49: 2026-02-21T09:09:57.6004970Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6005249Z mov.b64 {%r717, %r718}, %rd200; 2026-02-21T09:09:57.6005448Z cvt.rn.bf16x2.f32 %r719, %r718, %r717; 2026-02-21T09:09:57.6005614Z $L__tmp50: 2026-02-21T09:09:57.6005900Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6006222Z cvt.u64.u32 %rd201, %r633; 2026-02-21T09:09:57.6006384Z cvt.u64.u32 %rd202, %r634; 2026-02-21T09:09:57.6006537Z shl.b64 %rd203, %rd202, 32; 2026-02-21T09:09:57.6006699Z or.b64 %rd204, %rd201, %rd203; 2026-02-21T09:09:57.6006857Z $L__tmp51: 2026-02-21T09:09:57.6007091Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6007383Z mov.b64 {%r720, %r721}, %rd204; 2026-02-21T09:09:57.6007550Z cvt.rn.bf16x2.f32 %r722, %r721, %r720; 2026-02-21T09:09:57.6007723Z $L__tmp52: 2026-02-21T09:09:57.6007998Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6008332Z cvt.u64.u32 %rd205, %r635; 2026-02-21T09:09:57.6008491Z cvt.u64.u32 %rd206, %r636; 2026-02-21T09:09:57.6008644Z shl.b64 %rd207, %rd206, 32; 2026-02-21T09:09:57.6008804Z or.b64 %rd208, %rd205, %rd207; 2026-02-21T09:09:57.6008952Z $L__tmp53: 2026-02-21T09:09:57.6009217Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6009503Z mov.b64 {%r723, %r724}, %rd208; 2026-02-21T09:09:57.6009676Z cvt.rn.bf16x2.f32 %r725, %r724, %r723; 2026-02-21T09:09:57.6009840Z $L__tmp54: 2026-02-21T09:09:57.6010128Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6010461Z cvt.u64.u32 %rd209, %r637; 2026-02-21T09:09:57.6010613Z cvt.u64.u32 %rd210, %r638; 2026-02-21T09:09:57.6010770Z shl.b64 %rd211, %rd210, 32; 2026-02-21T09:09:57.6010924Z or.b64 %rd212, %rd209, %rd211; 2026-02-21T09:09:57.6011083Z $L__tmp55: 2026-02-21T09:09:57.6011317Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6011645Z mov.b64 {%r726, %r727}, %rd212; 2026-02-21T09:09:57.6011811Z cvt.rn.bf16x2.f32 %r728, %r727, %r726; 2026-02-21T09:09:57.6011984Z $L__tmp56: 2026-02-21T09:09:57.6012276Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6012604Z cvt.u64.u32 %rd213, %r639; 2026-02-21T09:09:57.6012767Z cvt.u64.u32 %rd214, %r640; 2026-02-21T09:09:57.6012919Z shl.b64 %rd215, %rd214, 32; 2026-02-21T09:09:57.6013080Z or.b64 %rd216, %rd213, %rd215; 2026-02-21T09:09:57.6013233Z $L__tmp57: 2026-02-21T09:09:57.6013476Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6013764Z mov.b64 {%r729, %r730}, %rd216; 2026-02-21T09:09:57.6013975Z cvt.rn.bf16x2.f32 %r731, %r730, %r729; 2026-02-21T09:09:57.6014151Z $L__tmp58: 2026-02-21T09:09:57.6014430Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6014768Z cvt.u64.u32 %rd217, %r641; 2026-02-21T09:09:57.6014923Z cvt.u64.u32 %rd218, %r642; 2026-02-21T09:09:57.6015083Z shl.b64 %rd219, %rd218, 32; 2026-02-21T09:09:57.6015237Z or.b64 %rd220, %rd217, %rd219; 2026-02-21T09:09:57.6015421Z $L__tmp59: 2026-02-21T09:09:57.6015658Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6015949Z mov.b64 {%r732, %r733}, %rd220; 2026-02-21T09:09:57.6016123Z cvt.rn.bf16x2.f32 %r734, %r733, %r732; 2026-02-21T09:09:57.6016289Z $L__tmp60: 2026-02-21T09:09:57.6016576Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6016904Z cvt.u64.u32 %rd221, %r643; 2026-02-21T09:09:57.6017064Z cvt.u64.u32 %rd222, %r644; 2026-02-21T09:09:57.6017214Z shl.b64 %rd223, %rd222, 32; 2026-02-21T09:09:57.6017402Z or.b64 %rd224, %rd221, %rd223; 2026-02-21T09:09:57.6017561Z $L__tmp61: 2026-02-21T09:09:57.6017795Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6018085Z mov.b64 {%r735, %r736}, %rd224; 2026-02-21T09:09:57.6018254Z cvt.rn.bf16x2.f32 %r737, %r736, %r735; 2026-02-21T09:09:57.6018429Z $L__tmp62: 2026-02-21T09:09:57.6018710Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6019042Z cvt.u64.u32 %rd225, %r646; 2026-02-21T09:09:57.6019193Z cvt.u64.u32 %rd226, %r647; 2026-02-21T09:09:57.6019351Z shl.b64 %rd227, %rd226, 32; 2026-02-21T09:09:57.6019510Z or.b64 %rd228, %rd225, %rd227; 2026-02-21T09:09:57.6019661Z $L__tmp63: 2026-02-21T09:09:57.6019899Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6020180Z mov.b64 {%r738, %r739}, %rd228; 2026-02-21T09:09:57.6020352Z cvt.rn.bf16x2.f32 %r740, %r739, %r738; 2026-02-21T09:09:57.6020515Z $L__tmp64: 2026-02-21T09:09:57.6020803Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6021126Z cvt.u64.u32 %rd229, %r648; 2026-02-21T09:09:57.6021330Z cvt.u64.u32 %rd230, %r649; 2026-02-21T09:09:57.6021492Z shl.b64 %rd231, %rd230, 32; 2026-02-21T09:09:57.6021677Z or.b64 %rd232, %rd229, %rd231; 2026-02-21T09:09:57.6021837Z $L__tmp65: 2026-02-21T09:09:57.6022067Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6022358Z mov.b64 {%r741, %r742}, %rd232; 2026-02-21T09:09:57.6022523Z cvt.rn.bf16x2.f32 %r743, %r742, %r741; 2026-02-21T09:09:57.6022695Z $L__tmp66: 2026-02-21T09:09:57.6022978Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6023313Z cvt.u64.u32 %rd233, %r650; 2026-02-21T09:09:57.6023472Z cvt.u64.u32 %rd234, %r651; 2026-02-21T09:09:57.6023621Z shl.b64 %rd235, %rd234, 32; 2026-02-21T09:09:57.6023782Z or.b64 %rd236, %rd233, %rd235; 2026-02-21T09:09:57.6023934Z $L__tmp67: 2026-02-21T09:09:57.6024178Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6024465Z mov.b64 {%r744, %r745}, %rd236; 2026-02-21T09:09:57.6024640Z cvt.rn.bf16x2.f32 %r746, %r745, %r744; 2026-02-21T09:09:57.6024811Z $L__tmp68: 2026-02-21T09:09:57.6025094Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6025433Z cvt.u64.u32 %rd237, %r652; 2026-02-21T09:09:57.6025590Z cvt.u64.u32 %rd238, %r653; 2026-02-21T09:09:57.6025748Z shl.b64 %rd239, %rd238, 32; 2026-02-21T09:09:57.6025932Z or.b64 %rd240, %rd237, %rd239; 2026-02-21T09:09:57.6026089Z $L__tmp69: 2026-02-21T09:09:57.6026324Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6026611Z mov.b64 {%r747, %r748}, %rd240; 2026-02-21T09:09:57.6026784Z cvt.rn.bf16x2.f32 %r749, %r748, %r747; 2026-02-21T09:09:57.6026951Z $L__tmp70: 2026-02-21T09:09:57.6027237Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6027592Z cvt.u64.u32 %rd241, %r654; 2026-02-21T09:09:57.6027754Z cvt.u64.u32 %rd242, %r655; 2026-02-21T09:09:57.6027904Z shl.b64 %rd243, %rd242, 32; 2026-02-21T09:09:57.6028068Z or.b64 %rd244, %rd241, %rd243; 2026-02-21T09:09:57.6028221Z $L__tmp71: 2026-02-21T09:09:57.6028461Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6028752Z mov.b64 {%r750, %r751}, %rd244; 2026-02-21T09:09:57.6028922Z cvt.rn.bf16x2.f32 %r752, %r751, %r750; 2026-02-21T09:09:57.6029094Z $L__tmp72: 2026-02-21T09:09:57.6029404Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6029733Z cvt.u64.u32 %rd245, %r656; 2026-02-21T09:09:57.6029886Z cvt.u64.u32 %rd246, %r657; 2026-02-21T09:09:57.6030046Z shl.b64 %rd247, %rd246, 32; 2026-02-21T09:09:57.6030200Z or.b64 %rd248, %rd245, %rd247; 2026-02-21T09:09:57.6030357Z $L__tmp73: 2026-02-21T09:09:57.6030597Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6030880Z mov.b64 {%r753, %r754}, %rd248; 2026-02-21T09:09:57.6031053Z cvt.rn.bf16x2.f32 %r755, %r754, %r753; 2026-02-21T09:09:57.6031216Z $L__tmp74: 2026-02-21T09:09:57.6031500Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6031865Z cvt.u64.u32 %rd249, %r658; 2026-02-21T09:09:57.6032022Z cvt.u64.u32 %rd250, %r659; 2026-02-21T09:09:57.6032180Z shl.b64 %rd251, %rd250, 32; 2026-02-21T09:09:57.6032334Z or.b64 %rd252, %rd249, %rd251; 2026-02-21T09:09:57.6032493Z $L__tmp75: 2026-02-21T09:09:57.6032729Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6033020Z mov.b64 {%r756, %r757}, %rd252; 2026-02-21T09:09:57.6033217Z cvt.rn.bf16x2.f32 %r758, %r757, %r756; 2026-02-21T09:09:57.6033395Z $L__tmp76: 2026-02-21T09:09:57.6033676Z .loc 2 291 36 // standard.py:291:36 @[ co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:91:40 ] 2026-02-21T09:09:57.6034010Z cvt.u64.u32 %rd253, %r660; 2026-02-21T09:09:57.6034166Z cvt.u64.u32 %rd254, %r661; 2026-02-21T09:09:57.6034316Z shl.b64 %rd255, %rd254, 32; 2026-02-21T09:09:57.6034478Z or.b64 %rd256, %rd253, %rd255; 2026-02-21T09:09:57.6034628Z $L__tmp77: 2026-02-21T09:09:57.6034869Z .loc 1 94 28 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:94:28 2026-02-21T09:09:57.6035153Z mov.b64 {%r759, %r760}, %rd256; 2026-02-21T09:09:57.6035327Z cvt.rn.bf16x2.f32 %r761, %r760, %r759; 2026-02-21T09:09:57.6035603Z .loc 1 95 43 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:95:43 2026-02-21T09:09:57.6035930Z st.shared.v4.b32 [%r28], {%r668, %r671, %r674, %r677}; 2026-02-21T09:09:57.6036181Z st.shared.v4.b32 [%r28+8192], {%r716, %r719, %r722, %r725}; 2026-02-21T09:09:57.6036421Z st.shared.v4.b32 [%r29], {%r680, %r683, %r686, %r689}; 2026-02-21T09:09:57.6036659Z st.shared.v4.b32 [%r29+8192], {%r728, %r731, %r734, %r737}; 2026-02-21T09:09:57.6036887Z st.shared.v4.b32 [%r30], {%r692, %r695, %r698, %r701}; 2026-02-21T09:09:57.6037122Z st.shared.v4.b32 [%r30+8192], {%r740, %r743, %r746, %r749}; 2026-02-21T09:09:57.6037348Z st.shared.v4.b32 [%r31], {%r704, %r707, %r710, %r713}; 2026-02-21T09:09:57.6037609Z st.shared.v4.b32 [%r31+8192], {%r752, %r755, %r758, %r761}; 2026-02-21T09:09:57.6037823Z // begin inline asm 2026-02-21T09:09:57.6037985Z fence.proxy.async.shared::cta; 2026-02-21T09:09:57.6038158Z // end inline asm 2026-02-21T09:09:57.6038294Z bar.sync 0; 2026-02-21T09:09:57.6038443Z elect.sync %r762|%p121, -1; 2026-02-21T09:09:57.6038607Z and.pred %p119, %p1, %p121; 2026-02-21T09:09:57.6038768Z // begin inline asm 2026-02-21T09:09:57.6039039Z @%p119 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd128, {%r663, %r664}], [%r52]; 2026-02-21T09:09:57.6039362Z // end inline asm 2026-02-21T09:09:57.6039519Z cp.async.bulk.commit_group; 2026-02-21T09:09:57.6039692Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:57.6039867Z bar.sync 0; 2026-02-21T09:09:57.6040023Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:57.6040325Z .loc 1 28 4 // co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py:28:4 2026-02-21T09:09:57.6040606Z bar.sync 0; 2026-02-21T09:09:57.6040750Z // begin inline asm 2026-02-21T09:09:57.6041003Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r763, 128; 2026-02-21T09:09:57.6041241Z // end inline asm 2026-02-21T09:09:57.6041384Z ret; 2026-02-21T09:09:57.6041509Z $L__tmp78: 2026-02-21T09:09:57.6041669Z $L__func_end0: 2026-02-21T09:09:57.6041836Z // -- End function 2026-02-21T09:09:57.6042041Z } 2026-02-21T09:09:57.6042326Z .file 1 "/tmp/torchinductor_root/o2/co2feyighg5ywo6sd7yyhzhqw3tjsqmonidjxqrp3fbaxioziabv.py" 2026-02-21T09:09:57.6042789Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:57.6043096Z .section .debug_abbrev 2026-02-21T09:09:57.6043256Z { 2026-02-21T09:09:57.6043420Z .b8 1 // Abbreviation Code 2026-02-21T09:09:57.6043654Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:57.6043891Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:57.6044114Z .b8 37 // DW_AT_producer 2026-02-21T09:09:57.6044339Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.6044551Z .b8 19 // DW_AT_language 2026-02-21T09:09:57.6044772Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:57.6044989Z .b8 3 // DW_AT_name 2026-02-21T09:09:57.6045227Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.6045448Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:57.6045662Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:57.6045878Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:57.6046089Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.6046306Z .b8 0 // EOM(1) 2026-02-21T09:09:57.6046515Z .b8 0 // EOM(2) 2026-02-21T09:09:57.6046728Z .b8 2 // Abbreviation Code 2026-02-21T09:09:57.6046971Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:57.6047186Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:57.6047401Z .b8 3 // DW_AT_name 2026-02-21T09:09:57.6047604Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.6047821Z .b8 32 // DW_AT_inline 2026-02-21T09:09:57.6048032Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.6048239Z .b8 0 // EOM(1) 2026-02-21T09:09:57.6048446Z .b8 0 // EOM(2) 2026-02-21T09:09:57.6048645Z .b8 3 // Abbreviation Code 2026-02-21T09:09:57.6048863Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:57.6049070Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:57.6049309Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:57.6049510Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.6049718Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:57.6049925Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.6050131Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:57.6050349Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:57.6050581Z .b8 0 // EOM(1) 2026-02-21T09:09:57.6050776Z .b8 0 // EOM(2) 2026-02-21T09:09:57.6050971Z .b8 4 // Abbreviation Code 2026-02-21T09:09:57.6051198Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:57.6051418Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:57.6051652Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:57.6051871Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:57.6052092Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:57.6052290Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.6052485Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:57.6052693Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.6052897Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:57.6053095Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.6053305Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:57.6053507Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.6053714Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:57.6053932Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.6054131Z .b8 0 // EOM(1) 2026-02-21T09:09:57.6054324Z .b8 0 // EOM(2) 2026-02-21T09:09:57.6054509Z .b8 0 // EOM(3) 2026-02-21T09:09:57.6054685Z } 2026-02-21T09:09:57.6054808Z .section .debug_info 2026-02-21T09:09:57.6054953Z { 2026-02-21T09:09:57.6055099Z .b32 178 // Length of Unit 2026-02-21T09:09:57.6055348Z .b8 2 // DWARF version number 2026-02-21T09:09:57.6055542Z .b8 0 2026-02-21T09:09:57.6055730Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:57.6055993Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:57.6056228Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:57.6056471Z .b8 116 // DW_AT_producer 2026-02-21T09:09:57.6056656Z .b8 114 2026-02-21T09:09:57.6056787Z .b8 105 2026-02-21T09:09:57.6056906Z .b8 116 2026-02-21T09:09:57.6057033Z .b8 111 2026-02-21T09:09:57.6057144Z .b8 110 2026-02-21T09:09:57.6057264Z .b8 0 2026-02-21T09:09:57.6057404Z .b8 2 // DW_AT_language 2026-02-21T09:09:57.6057589Z .b8 0 2026-02-21T09:09:57.6057732Z .b8 99 // DW_AT_name 2026-02-21T09:09:57.6057904Z .b8 111 2026-02-21T09:09:57.6058023Z .b8 50 2026-02-21T09:09:57.6058140Z .b8 102 2026-02-21T09:09:57.6058260Z .b8 101 2026-02-21T09:09:57.6058373Z .b8 121 2026-02-21T09:09:57.6058492Z .b8 105 2026-02-21T09:09:57.6058604Z .b8 103 2026-02-21T09:09:57.6058724Z .b8 104 2026-02-21T09:09:57.6058835Z .b8 103 2026-02-21T09:09:57.6058956Z .b8 53 2026-02-21T09:09:57.6059069Z .b8 121 2026-02-21T09:09:57.6059189Z .b8 119 2026-02-21T09:09:57.6059299Z .b8 111 2026-02-21T09:09:57.6059416Z .b8 54 2026-02-21T09:09:57.6059535Z .b8 115 2026-02-21T09:09:57.6059645Z .b8 100 2026-02-21T09:09:57.6059762Z .b8 55 2026-02-21T09:09:57.6059873Z .b8 121 2026-02-21T09:09:57.6060020Z .b8 121 2026-02-21T09:09:57.6060133Z .b8 104 2026-02-21T09:09:57.6060249Z .b8 122 2026-02-21T09:09:57.6060358Z .b8 104 2026-02-21T09:09:57.6060477Z .b8 113 2026-02-21T09:09:57.6060588Z .b8 119 2026-02-21T09:09:57.6060704Z .b8 51 2026-02-21T09:09:57.6060815Z .b8 116 2026-02-21T09:09:57.6060932Z .b8 106 2026-02-21T09:09:57.6061043Z .b8 115 2026-02-21T09:09:57.6061160Z .b8 113 2026-02-21T09:09:57.6061270Z .b8 109 2026-02-21T09:09:57.6061388Z .b8 111 2026-02-21T09:09:57.6061507Z .b8 110 2026-02-21T09:09:57.6061687Z .b8 105 2026-02-21T09:09:57.6061806Z .b8 100 2026-02-21T09:09:57.6061920Z .b8 106 2026-02-21T09:09:57.6062040Z .b8 120 2026-02-21T09:09:57.6062152Z .b8 113 2026-02-21T09:09:57.6062272Z .b8 114 2026-02-21T09:09:57.6062384Z .b8 112 2026-02-21T09:09:57.6062506Z .b8 51 2026-02-21T09:09:57.6062619Z .b8 102 2026-02-21T09:09:57.6062738Z .b8 98 2026-02-21T09:09:57.6062853Z .b8 97 2026-02-21T09:09:57.6062980Z .b8 120 2026-02-21T09:09:57.6063099Z .b8 105 2026-02-21T09:09:57.6063224Z .b8 111 2026-02-21T09:09:57.6063345Z .b8 122 2026-02-21T09:09:57.6063457Z .b8 105 2026-02-21T09:09:57.6063576Z .b8 97 2026-02-21T09:09:57.6063687Z .b8 98 2026-02-21T09:09:57.6063835Z .b8 118 2026-02-21T09:09:57.6063948Z .b8 46 2026-02-21T09:09:57.6064069Z .b8 112 2026-02-21T09:09:57.6064181Z .b8 121 2026-02-21T09:09:57.6064300Z .b8 0 2026-02-21T09:09:57.6064457Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:57.6064680Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:57.6064860Z .b8 116 2026-02-21T09:09:57.6064978Z .b8 109 2026-02-21T09:09:57.6065099Z .b8 112 2026-02-21T09:09:57.6065211Z .b8 47 2026-02-21T09:09:57.6065332Z .b8 116 2026-02-21T09:09:57.6065443Z .b8 111 2026-02-21T09:09:57.6065563Z .b8 114 2026-02-21T09:09:57.6065674Z .b8 99 2026-02-21T09:09:57.6065793Z .b8 104 2026-02-21T09:09:57.6065904Z .b8 105 2026-02-21T09:09:57.6066021Z .b8 110 2026-02-21T09:09:57.6066131Z .b8 100 2026-02-21T09:09:57.6066248Z .b8 117 2026-02-21T09:09:57.6066358Z .b8 99 2026-02-21T09:09:57.6066479Z .b8 116 2026-02-21T09:09:57.6066589Z .b8 111 2026-02-21T09:09:57.6066708Z .b8 114 2026-02-21T09:09:57.6066818Z .b8 95 2026-02-21T09:09:57.6066936Z .b8 114 2026-02-21T09:09:57.6067054Z .b8 111 2026-02-21T09:09:57.6067162Z .b8 111 2026-02-21T09:09:57.6067277Z .b8 116 2026-02-21T09:09:57.6067387Z .b8 47 2026-02-21T09:09:57.6067503Z .b8 111 2026-02-21T09:09:57.6067614Z .b8 50 2026-02-21T09:09:57.6067735Z .b8 0 2026-02-21T09:09:57.6067920Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:57.6068158Z .b8 95 // DW_AT_name 2026-02-21T09:09:57.6068333Z .b8 104 2026-02-21T09:09:57.6068448Z .b8 101 2026-02-21T09:09:57.6068558Z .b8 108 2026-02-21T09:09:57.6068674Z .b8 105 2026-02-21T09:09:57.6068790Z .b8 111 2026-02-21T09:09:57.6068901Z .b8 110 2026-02-21T09:09:57.6069020Z .b8 95 2026-02-21T09:09:57.6069129Z .b8 109 2026-02-21T09:09:57.6069247Z .b8 97 2026-02-21T09:09:57.6069352Z .b8 116 2026-02-21T09:09:57.6069470Z .b8 109 2026-02-21T09:09:57.6069581Z .b8 117 2026-02-21T09:09:57.6069697Z .b8 108 2026-02-21T09:09:57.6069807Z .b8 95 2026-02-21T09:09:57.6069923Z .b8 98 2026-02-21T09:09:57.6070032Z .b8 102 2026-02-21T09:09:57.6070142Z .b8 49 2026-02-21T09:09:57.6070251Z .b8 54 2026-02-21T09:09:57.6070361Z .b8 95 2026-02-21T09:09:57.6070470Z .b8 105 2026-02-21T09:09:57.6070576Z .b8 110 2026-02-21T09:09:57.6070685Z .b8 116 2026-02-21T09:09:57.6070789Z .b8 52 2026-02-21T09:09:57.6070902Z .b8 0 2026-02-21T09:09:57.6071035Z .b8 1 // DW_AT_inline 2026-02-21T09:09:57.6071256Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:57.6071491Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:57.6071740Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:57.6071964Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:57.6072215Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:57.6072502Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:57.6072723Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:57.6072944Z .b64 $L__tmp77 // DW_AT_high_pc 2026-02-21T09:09:57.6073153Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:57.6073360Z .b8 91 // DW_AT_call_line 2026-02-21T09:09:57.6073576Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:57.6073819Z .b8 0 // End Of Children Mark 2026-02-21T09:09:57.6074043Z .b8 0 // End Of Children Mark 2026-02-21T09:09:57.6074227Z } 2026-02-21T09:09:57.6074363Z .section .debug_macinfo { } 2026-02-21T09:09:57.6074469Z 2026-02-21T09:09:57.6074545Z ================================================================ 2026-02-21T09:09:57.6074784Z please share the reproducer above with Triton project. 2026-02-21T09:09:57.7579240Z 2026-02-21T09:09:57.7581020Z [184s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:57.7583950Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:57.7585125Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:09:57.7585384Z `ptxas` stderr: 2026-02-21T09:09:57.7585826Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 297 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:57.7586318Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:57.7586473Z 2026-02-21T09:09:57.7586863Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp8i6luajm.ptx -o /tmp/tmp8i6luajm.ptx.o 2026-02-21T09:09:57.7587299Z 2026-02-21T09:09:57.7587428Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:57.7587939Z 2026-02-21T09:09:57.7587953Z 2026-02-21T09:09:57.7588045Z ================================================================ 2026-02-21T09:09:57.7588266Z Internal Triton PTX codegen error 2026-02-21T09:09:57.7588456Z `ptxas` stderr: 2026-02-21T09:09:57.7588901Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 297 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:09:57.7589401Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:09:57.7589553Z 2026-02-21T09:09:57.7589947Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp8i6luajm.ptx -o /tmp/tmp8i6luajm.ptx.o 2026-02-21T09:09:57.7590386Z 2026-02-21T09:09:57.7590389Z 2026-02-21T09:09:57.7590447Z // 2026-02-21T09:09:57.7590602Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:09:57.7590789Z // 2026-02-21T09:09:57.7590861Z 2026-02-21T09:09:57.7590922Z .version 8.7 2026-02-21T09:09:57.7591070Z .target sm_100a 2026-02-21T09:09:57.7591209Z .address_size 64 2026-02-21T09:09:57.7591301Z 2026-02-21T09:09:57.7591450Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:09:57.7591782Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:09:57.7592018Z // @_helion_matmul_bf16_int4 2026-02-21T09:09:57.7592249Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:09:57.7592501Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:09:57.7592845Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:09:57.7593136Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:09:57.7593434Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:09:57.7593726Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:09:57.7593966Z ) 2026-02-21T09:09:57.7594098Z .reqntid 128 2026-02-21T09:09:57.7594296Z .maxnreg 32 2026-02-21T09:09:57.7594429Z { 2026-02-21T09:09:57.7594559Z .reg .pred %p<99>; 2026-02-21T09:09:57.7594721Z .reg .b16 %rs<193>; 2026-02-21T09:09:57.7594874Z .reg .b32 %r<620>; 2026-02-21T09:09:57.7595028Z .reg .b64 %rd<220>; 2026-02-21T09:09:57.7595174Z $L__func_begin0: 2026-02-21T09:09:57.7595268Z 2026-02-21T09:09:57.7595325Z // %bb.0: 2026-02-21T09:09:57.7595572Z .loc 1 19 0 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:19 2026-02-21T09:09:57.7595882Z mov.u32 %r1, %tid.x; 2026-02-21T09:09:57.7596088Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:09:57.7596343Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:09:57.7596564Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:09:57.7596787Z mov.b32 %r54, global_smem; 2026-02-21T09:09:57.7596966Z // begin inline asm 2026-02-21T09:09:57.7597212Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r54], 256; 2026-02-21T09:09:57.7597477Z // end inline asm 2026-02-21T09:09:57.7597578Z ld.param.b64 %rd51, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:09:57.7597637Z bar.sync 0; 2026-02-21T09:09:57.7597720Z ld.shared.b32 %r613, [global_smem]; 2026-02-21T09:09:57.7597777Z bar.sync 0; 2026-02-21T09:09:57.7597849Z // begin inline asm 2026-02-21T09:09:57.7597975Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:09:57.7598033Z // end inline asm 2026-02-21T09:09:57.7598201Z .loc 1 21 66 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:21:66 2026-02-21T09:09:57.7598272Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:09:57.7598334Z mov.u32 %r71, %ctaid.y; 2026-02-21T09:09:57.7598394Z mov.u32 %r72, %ctaid.z; 2026-02-21T09:09:57.7598455Z mov.u32 %r73, %nctaid.x; 2026-02-21T09:09:57.7598522Z mov.u32 %r74, %nctaid.y; 2026-02-21T09:09:57.7598589Z mad.lo.s32 %r75, %r72, %r74, %r71; 2026-02-21T09:09:57.7598654Z mad.lo.s32 %r76, %r75, %r73, %r3; 2026-02-21T09:09:57.7598752Z shl.b32 %r77, %r76, 8; 2026-02-21T09:09:57.7598814Z cvt.s64.s32 %rd52, %r77; 2026-02-21T09:09:57.7598876Z add.s64 %rd30, %rd51, %rd52; 2026-02-21T09:09:57.7598934Z shl.b32 %r78, %r1, 2; 2026-02-21T09:09:57.7599002Z add.s32 %r55, %r54, %r78; 2026-02-21T09:09:57.7599057Z mov.b32 %r64, 0; 2026-02-21T09:09:57.7599113Z // begin inline asm 2026-02-21T09:09:57.7599189Z @%p1 st.shared.b32 [ %r55 + 0 ], %r64; 2026-02-21T09:09:57.7599245Z // end inline asm 2026-02-21T09:09:57.7599306Z bar.warp.sync -1; 2026-02-21T09:09:57.7599369Z setp.eq.b32 %p91, %r1, 0; 2026-02-21T09:09:57.7599437Z cvt.u64.u32 %rd15, %r54; 2026-02-21T09:09:57.7599495Z // begin inline asm 2026-02-21T09:09:57.7599665Z @%p91 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd16; 2026-02-21T09:09:57.7599727Z // end inline asm 2026-02-21T09:09:57.7599783Z // begin inline asm 2026-02-21T09:09:57.7599927Z @%p91 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:09:57.7599992Z // end inline asm 2026-02-21T09:09:57.7600048Z mov.b32 %r57, 128; 2026-02-21T09:09:57.7600103Z // begin inline asm 2026-02-21T09:09:57.7600255Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r57; 2026-02-21T09:09:57.7600318Z // end inline asm 2026-02-21T09:09:57.7600372Z mov.b32 %r152, 16; 2026-02-21T09:09:57.7600427Z // begin inline asm 2026-02-21T09:09:57.7600585Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r152; 2026-02-21T09:09:57.7600667Z // end inline asm 2026-02-21T09:09:57.7600724Z mov.b32 %r59, 8192; 2026-02-21T09:09:57.7600789Z // begin inline asm 2026-02-21T09:09:57.7600946Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r59; 2026-02-21T09:09:57.7601000Z // end inline asm 2026-02-21T09:09:57.7601054Z mov.b32 %r60, 512; 2026-02-21T09:09:57.7601115Z // begin inline asm 2026-02-21T09:09:57.7601267Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r60; 2026-02-21T09:09:57.7601348Z // end inline asm 2026-02-21T09:09:57.7601410Z mov.b64 %rd23, 8192; 2026-02-21T09:09:57.7601465Z // begin inline asm 2026-02-21T09:09:57.7601662Z @%p91 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd23; 2026-02-21T09:09:57.7601724Z // end inline asm 2026-02-21T09:09:57.7601777Z mov.b32 %r61, 1; 2026-02-21T09:09:57.7601833Z // begin inline asm 2026-02-21T09:09:57.7602006Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r61; 2026-02-21T09:09:57.7602069Z // end inline asm 2026-02-21T09:09:57.7602125Z // begin inline asm 2026-02-21T09:09:57.7602319Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r61; 2026-02-21T09:09:57.7602383Z // end inline asm 2026-02-21T09:09:57.7602438Z // begin inline asm 2026-02-21T09:09:57.7602587Z @%p91 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:57.7602650Z // end inline asm 2026-02-21T09:09:57.7602706Z // begin inline asm 2026-02-21T09:09:57.7602870Z @%p91 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:57.7602924Z // end inline asm 2026-02-21T09:09:57.7602985Z // begin inline asm 2026-02-21T09:09:57.7603136Z @%p91 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:09:57.7603190Z // end inline asm 2026-02-21T09:09:57.7603253Z // begin inline asm 2026-02-21T09:09:57.7603397Z @%p91 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:57.7603454Z // end inline asm 2026-02-21T09:09:57.7603518Z // begin inline asm 2026-02-21T09:09:57.7603780Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd30 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:09:57.7603836Z // end inline asm 2026-02-21T09:09:57.7603899Z // begin inline asm 2026-02-21T09:09:57.7604053Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd30 + 0 ], 0x80; 2026-02-21T09:09:57.7604130Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:57.7604205Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:57.7604268Z // end inline asm 2026-02-21T09:09:57.7604322Z bar.sync 0; 2026-02-21T09:09:57.7604387Z cvta.global.u64 %rd71, %rd30; 2026-02-21T09:09:57.7604571Z .loc 1 23 67 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:23:67 2026-02-21T09:09:57.7604634Z add.s64 %rd48, %rd30, 128; 2026-02-21T09:09:57.7604690Z bar.sync 0; 2026-02-21T09:09:57.7604747Z // begin inline asm 2026-02-21T09:09:57.7604823Z @%p1 st.shared.b32 [ %r55 + 0 ], %r64; 2026-02-21T09:09:57.7604879Z // end inline asm 2026-02-21T09:09:57.7604938Z bar.warp.sync -1; 2026-02-21T09:09:57.7605002Z // begin inline asm 2026-02-21T09:09:57.7605163Z @%p91 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd34; 2026-02-21T09:09:57.7605219Z // end inline asm 2026-02-21T09:09:57.7605284Z // begin inline asm 2026-02-21T09:09:57.7605426Z @%p91 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:09:57.7605482Z // end inline asm 2026-02-21T09:09:57.7605539Z mov.b32 %r65, 64; 2026-02-21T09:09:57.7605604Z // begin inline asm 2026-02-21T09:09:57.7605753Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r65; 2026-02-21T09:09:57.7605810Z // end inline asm 2026-02-21T09:09:57.7605874Z // begin inline asm 2026-02-21T09:09:57.7606020Z @%p91 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r65; 2026-02-21T09:09:57.7606100Z // end inline asm 2026-02-21T09:09:57.7606165Z // begin inline asm 2026-02-21T09:09:57.7606321Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r59; 2026-02-21T09:09:57.7606377Z // end inline asm 2026-02-21T09:09:57.7606433Z mov.b32 %r68, 4096; 2026-02-21T09:09:57.7606496Z // begin inline asm 2026-02-21T09:09:57.7606650Z @%p91 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r68; 2026-02-21T09:09:57.7606730Z // end inline asm 2026-02-21T09:09:57.7606796Z mov.b64 %rd41, 16384; 2026-02-21T09:09:57.7606850Z // begin inline asm 2026-02-21T09:09:57.7607017Z @%p91 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd41; 2026-02-21T09:09:57.7607078Z // end inline asm 2026-02-21T09:09:57.7607134Z // begin inline asm 2026-02-21T09:09:57.7607303Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r61; 2026-02-21T09:09:57.7607359Z // end inline asm 2026-02-21T09:09:57.7607423Z // begin inline asm 2026-02-21T09:09:57.7607610Z @%p91 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r61; 2026-02-21T09:09:57.7607665Z // end inline asm 2026-02-21T09:09:57.7607730Z // begin inline asm 2026-02-21T09:09:57.7607879Z @%p91 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0xa; 2026-02-21T09:09:57.7607935Z // end inline asm 2026-02-21T09:09:57.7608000Z // begin inline asm 2026-02-21T09:09:57.7608166Z @%p91 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:57.7608221Z // end inline asm 2026-02-21T09:09:57.7608282Z // begin inline asm 2026-02-21T09:09:57.7608434Z @%p91 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:09:57.7608490Z // end inline asm 2026-02-21T09:09:57.7608544Z // begin inline asm 2026-02-21T09:09:57.7608695Z @%p91 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:09:57.7608752Z // end inline asm 2026-02-21T09:09:57.7608809Z // begin inline asm 2026-02-21T09:09:57.7609072Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd48 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:09:57.7609127Z // end inline asm 2026-02-21T09:09:57.7609182Z // begin inline asm 2026-02-21T09:09:57.7609365Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd48 + 0 ], 0x80; 2026-02-21T09:09:57.7609439Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:09:57.7609512Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:09:57.7609566Z // end inline asm 2026-02-21T09:09:57.7609627Z bar.sync 0; 2026-02-21T09:09:57.7609693Z cvta.global.u64 %rd89, %rd48; 2026-02-21T09:09:57.7609863Z .loc 1 31 88 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:31:88 2026-02-21T09:09:57.7609935Z setp.gt.u32 %p39, %r3, 4095; 2026-02-21T09:09:57.7609994Z @%p39 bra $L__BB0_8; 2026-02-21T09:09:57.7610075Z // %bb.1: // %.lr.ph 2026-02-21T09:09:57.7610254Z .loc 1 0 88 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:0:88 2026-02-21T09:09:57.7610355Z ld.param.b64 %rd14, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:09:57.7610522Z .loc 1 57 38 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:57:38 2026-02-21T09:09:57.7610591Z shl.b32 %r188, %r1, 3; 2026-02-21T09:09:57.7610652Z and.b32 %r189, %r188, 24; 2026-02-21T09:09:57.7610816Z .loc 1 44 45 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:44:45 2026-02-21T09:09:57.7610876Z bfe.u32 %r4, %r1, 2, 5; 2026-02-21T09:09:57.7610943Z shr.u32 %r190, %r1, 5; 2026-02-21T09:09:57.7611002Z and.b32 %r5, %r1, 127; 2026-02-21T09:09:57.7611057Z shl.b32 %r6, %r5, 4; 2026-02-21T09:09:57.7611125Z add.s32 %r192, %r54, 16384; 2026-02-21T09:09:57.7611187Z add.s32 %r309, %r192, %r6; 2026-02-21T09:09:57.7611273Z add.s32 %r311, %r309, 2048; 2026-02-21T09:09:57.7611332Z add.s32 %r295, %r613, 128; 2026-02-21T09:09:57.7611398Z and.b32 %r193, %r1, 15; 2026-02-21T09:09:57.7611457Z shl.b32 %r194, %r193, 6; 2026-02-21T09:09:57.7611513Z and.b32 %r195, %r1, 96; 2026-02-21T09:09:57.7611605Z shl.b32 %r196, %r195, 5; 2026-02-21T09:09:57.7611664Z and.b32 %r197, %r1, 16; 2026-02-21T09:09:57.7611720Z shl.b32 %r198, %r197, 1; 2026-02-21T09:09:57.7611779Z or.b32 %r199, %r194, %r196; 2026-02-21T09:09:57.7611873Z or.b32 %r10, %r199, %r198; 2026-02-21T09:09:57.7611929Z shl.b32 %r200, %r5, 7; 2026-02-21T09:09:57.7611986Z shl.b32 %r201, %r1, 4; 2026-02-21T09:09:57.7612051Z and.b32 %r202, %r201, 112; 2026-02-21T09:09:57.7612110Z or.b32 %r203, %r200, %r202; 2026-02-21T09:09:57.7612168Z add.s32 %r11, %r54, %r203; 2026-02-21T09:09:57.7612226Z xor.b32 %r204, %r203, 16; 2026-02-21T09:09:57.7612292Z add.s32 %r12, %r54, %r204; 2026-02-21T09:09:57.7612350Z xor.b32 %r205, %r203, 32; 2026-02-21T09:09:57.7612408Z add.s32 %r13, %r54, %r205; 2026-02-21T09:09:57.7612473Z xor.b32 %r206, %r203, 48; 2026-02-21T09:09:57.7612555Z add.s32 %r14, %r54, %r206; 2026-02-21T09:09:57.7612614Z xor.b32 %r207, %r203, 64; 2026-02-21T09:09:57.7612671Z add.s32 %r15, %r54, %r207; 2026-02-21T09:09:57.7612735Z xor.b32 %r208, %r203, 80; 2026-02-21T09:09:57.7612792Z add.s32 %r16, %r54, %r208; 2026-02-21T09:09:57.7612849Z xor.b32 %r209, %r203, 96; 2026-02-21T09:09:57.7612916Z add.s32 %r17, %r54, %r209; 2026-02-21T09:09:57.7612974Z xor.b32 %r210, %r203, 112; 2026-02-21T09:09:57.7613031Z add.s32 %r18, %r54, %r210; 2026-02-21T09:09:57.7613099Z bfe.u32 %r211, %r54, 4, 14; 2026-02-21T09:09:57.7613159Z cvt.u64.u32 %rd59, %r211; 2026-02-21T09:09:57.7613232Z or.b64 %rd64, %rd59, 4611686293372403712; 2026-02-21T09:09:57.7613291Z add.s32 %r212, %r54, 32; 2026-02-21T09:09:57.7613365Z bfe.u32 %r213, %r212, 4, 14; 2026-02-21T09:09:57.7613424Z cvt.u64.u32 %rd60, %r213; 2026-02-21T09:09:57.7613493Z or.b64 %rd65, %rd60, 4611686293372403712; 2026-02-21T09:09:57.7613564Z add.s32 %r214, %r54, 64; 2026-02-21T09:09:57.7613625Z bfe.u32 %r215, %r214, 4, 14; 2026-02-21T09:09:57.7613686Z cvt.u64.u32 %rd61, %r215; 2026-02-21T09:09:57.7613752Z or.b64 %rd66, %rd61, 4611686293372403712; 2026-02-21T09:09:57.7613818Z add.s32 %r216, %r54, 96; 2026-02-21T09:09:57.7613877Z bfe.u32 %r217, %r216, 4, 14; 2026-02-21T09:09:57.7613935Z cvt.u64.u32 %rd62, %r217; 2026-02-21T09:09:57.7614033Z or.b64 %rd67, %rd62, 4611686293372403712; 2026-02-21T09:09:57.7614095Z or.b32 %r19, %r189, 64; 2026-02-21T09:09:57.7614152Z shl.b32 %r218, %r193, 7; 2026-02-21T09:09:57.7614210Z shl.b32 %r219, %r195, 6; 2026-02-21T09:09:57.7614274Z shl.b32 %r220, %r197, 9; 2026-02-21T09:09:57.7614331Z or.b32 %r221, %r218, %r219; 2026-02-21T09:09:57.7614388Z or.b32 %r222, %r202, %r220; 2026-02-21T09:09:57.7614454Z or.b32 %r223, %r221, %r222; 2026-02-21T09:09:57.7614512Z xor.b32 %r224, %r223, 16; 2026-02-21T09:09:57.7614568Z xor.b32 %r225, %r223, 32; 2026-02-21T09:09:57.7614627Z xor.b32 %r226, %r223, 48; 2026-02-21T09:09:57.7614690Z xor.b32 %r227, %r223, 64; 2026-02-21T09:09:57.7614748Z xor.b32 %r228, %r223, 80; 2026-02-21T09:09:57.7614805Z xor.b32 %r229, %r223, 96; 2026-02-21T09:09:57.7614870Z xor.b32 %r230, %r223, 112; 2026-02-21T09:09:57.7614928Z add.s32 %r231, %r192, %r10; 2026-02-21T09:09:57.7614985Z xor.b32 %r28, %r5, 112; 2026-02-21T09:09:57.7615050Z add.s32 %r156, %r54, 24576; 2026-02-21T09:09:57.7615109Z add.s32 %r232, %r156, %r28; 2026-02-21T09:09:57.7615167Z xor.b32 %r29, %r5, 96; 2026-02-21T09:09:57.7615223Z add.s32 %r233, %r156, %r29; 2026-02-21T09:09:57.7615289Z xor.b32 %r30, %r5, 80; 2026-02-21T09:09:57.7615346Z add.s32 %r234, %r156, %r30; 2026-02-21T09:09:57.7615403Z xor.b32 %r31, %r5, 64; 2026-02-21T09:09:57.7615467Z add.s32 %r235, %r156, %r31; 2026-02-21T09:09:57.7615523Z xor.b32 %r32, %r5, 48; 2026-02-21T09:09:57.7615580Z add.s32 %r236, %r156, %r32; 2026-02-21T09:09:57.7615636Z xor.b32 %r33, %r5, 32; 2026-02-21T09:09:57.7615729Z add.s32 %r237, %r156, %r33; 2026-02-21T09:09:57.7615786Z xor.b32 %r34, %r5, 16; 2026-02-21T09:09:57.7615845Z add.s32 %r238, %r156, %r34; 2026-02-21T09:09:57.7615912Z add.s32 %r239, %r156, %r5; 2026-02-21T09:09:57.7615970Z add.s32 %r240, %r54, %r6; 2026-02-21T09:09:57.7616031Z add.s32 %r160, %r240, 20480; 2026-02-21T09:09:57.7616091Z add.s32 %r162, %r240, 22528; 2026-02-21T09:09:57.7616266Z .loc 1 38 33 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:38:33 2026-02-21T09:09:57.7616347Z shr.u32 %r241, %r3, 6; 2026-02-21T09:09:57.7616405Z and.b32 %r242, %r241, 48; 2026-02-21T09:09:57.7616574Z .loc 1 40 64 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:40:64 2026-02-21T09:09:57.7616632Z and.b32 %r243, %r3, 15; 2026-02-21T09:09:57.7616794Z .loc 1 40 30 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:40:30 2026-02-21T09:09:57.7616857Z or.b32 %r244, %r242, %r243; 2026-02-21T09:09:57.7617040Z .loc 1 42 27 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:42:27 2026-02-21T09:09:57.7617099Z shl.b32 %r315, %r244, 7; 2026-02-21T09:09:57.7617262Z .loc 1 43 27 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:43:27 2026-02-21T09:09:57.7617325Z shl.b32 %r245, %r3, 2; 2026-02-21T09:09:57.7617383Z and.b32 %r510, %r245, 4032; 2026-02-21T09:09:57.7617552Z .loc 1 44 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:44:32 2026-02-21T09:09:57.7617618Z or.b32 %r246, %r510, %r4; 2026-02-21T09:09:57.7617780Z .loc 1 58 53 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:53 2026-02-21T09:09:57.7617837Z shl.b32 %r247, %r246, 10; 2026-02-21T09:09:57.7617899Z $L__tmp0: 2026-02-21T09:09:57.7618123Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7618199Z shfl.sync.idx.b32 %r37, %r190, 0, 31, -1; 2026-02-21T09:09:57.7618263Z shl.b32 %r248, %r37, 21; 2026-02-21T09:09:57.7618324Z and.b32 %r249, %r248, 6291456; 2026-02-21T09:09:57.7618382Z add.s32 %r508, %r249, %r613; 2026-02-21T09:09:57.7618443Z mov.pred %p59, -1; 2026-02-21T09:09:57.7618507Z // begin inline asm 2026-02-21T09:09:57.7618799Z @%p59 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r508 + 0], 64, {%r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64}; 2026-02-21T09:09:57.7618860Z // end inline asm 2026-02-21T09:09:57.7618923Z // begin inline asm 2026-02-21T09:09:57.7619181Z @%p59 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r508 + 16], 64, {%r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64}; 2026-02-21T09:09:57.7619237Z // end inline asm 2026-02-21T09:09:57.7619299Z // begin inline asm 2026-02-21T09:09:57.7619548Z @%p59 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r508 + 32], 64, {%r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64}; 2026-02-21T09:09:57.7619607Z // end inline asm 2026-02-21T09:09:57.7619664Z // begin inline asm 2026-02-21T09:09:57.7619919Z @%p59 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r508 + 48], 64, {%r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64, %r64}; 2026-02-21T09:09:57.7619974Z // end inline asm 2026-02-21T09:09:57.7620032Z // begin inline asm 2026-02-21T09:09:57.7620114Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:57.7620168Z // end inline asm 2026-02-21T09:09:57.7620223Z bar.sync 0; 2026-02-21T09:09:57.7620283Z $L__tmp1: 2026-02-21T09:09:57.7620447Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7620507Z add.s32 %r615, %r54, 28672; 2026-02-21T09:09:57.7620563Z // begin inline asm 2026-02-21T09:09:57.7620657Z @%p91 mbarrier.init.shared::cta.b64 [%r615], 1; 2026-02-21T09:09:57.7620754Z // end inline asm 2026-02-21T09:09:57.7620809Z bar.sync 0; 2026-02-21T09:09:57.7620875Z add.s32 %r148, %r54, 28680; 2026-02-21T09:09:57.7620933Z // begin inline asm 2026-02-21T09:09:57.7621018Z @%p91 mbarrier.init.shared::cta.b64 [%r148], 1; 2026-02-21T09:09:57.7621073Z // end inline asm 2026-02-21T09:09:57.7621139Z add.s32 %r313, %r54, 28688; 2026-02-21T09:09:57.7621195Z // begin inline asm 2026-02-21T09:09:57.7621279Z @%p91 mbarrier.init.shared::cta.b64 [%r313], 1; 2026-02-21T09:09:57.7621369Z // end inline asm 2026-02-21T09:09:57.7621425Z bar.sync 0; 2026-02-21T09:09:57.7621487Z add.s32 %r150, %r54, 28696; 2026-02-21T09:09:57.7621580Z // begin inline asm 2026-02-21T09:09:57.7621684Z @%p91 mbarrier.init.shared::cta.b64 [%r150], 1; 2026-02-21T09:09:57.7621739Z // end inline asm 2026-02-21T09:09:57.7621910Z .loc 1 58 60 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:60 2026-02-21T09:09:57.7621974Z or.b32 %r250, %r247, %r189; 2026-02-21T09:09:57.7622144Z .loc 1 58 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:32 2026-02-21T09:09:57.7622243Z mad.wide.u32 %rd53, %r250, 2, %rd14; 2026-02-21T09:09:57.7622320Z cvt.u64.u32 %rd7, %r247; 2026-02-21T09:09:57.7622384Z add.s64 %rd54, %rd53, 65536; 2026-02-21T09:09:57.7622552Z .loc 1 58 80 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:80 2026-02-21T09:09:57.7622621Z // begin inline asm 2026-02-21T09:09:57.7622747Z cp.async.cg.shared.global [ %r309 + 0 ], [ %rd53 + 0 ], 0x10, %r152; 2026-02-21T09:09:57.7622805Z // end inline asm 2026-02-21T09:09:57.7622865Z // begin inline asm 2026-02-21T09:09:57.7622996Z cp.async.cg.shared.global [ %r311 + 0 ], [ %rd54 + 0 ], 0x10, %r152; 2026-02-21T09:09:57.7623051Z // end inline asm 2026-02-21T09:09:57.7623117Z cp.async.commit_group; 2026-02-21T09:09:57.7623286Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7623343Z bar.sync 0; 2026-02-21T09:09:57.7623398Z // begin inline asm 2026-02-21T09:09:57.7623512Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r313], 2048; 2026-02-21T09:09:57.7623573Z // end inline asm 2026-02-21T09:09:57.7623735Z .loc 1 64 33 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:64:33 2026-02-21T09:09:57.7623790Z bar.sync 0; 2026-02-21T09:09:57.7623864Z elect.sync %r251|%p54, -1; 2026-02-21T09:09:57.7623957Z and.pred %p49, %p1, %p54; 2026-02-21T09:09:57.7624017Z // begin inline asm 2026-02-21T09:09:57.7624272Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r156], [%rd71, {%r315, %r64}], [%r313]; 2026-02-21T09:09:57.7624327Z // end inline asm 2026-02-21T09:09:57.7624492Z .loc 1 58 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:32 2026-02-21T09:09:57.7624559Z add.s64 %rd56, %rd53, 64; 2026-02-21T09:09:57.7624618Z or.b32 %r252, %r250, 32; 2026-02-21T09:09:57.7624688Z mad.wide.u32 %rd63, %r252, 2, %rd14; 2026-02-21T09:09:57.7624748Z add.s64 %rd57, %rd63, 65536; 2026-02-21T09:09:57.7624920Z .loc 1 58 80 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:80 2026-02-21T09:09:57.7624977Z // begin inline asm 2026-02-21T09:09:57.7625091Z cp.async.cg.shared.global [ %r160 + 0 ], [ %rd56 + 0 ], 0x10, %r152; 2026-02-21T09:09:57.7625153Z // end inline asm 2026-02-21T09:09:57.7625210Z // begin inline asm 2026-02-21T09:09:57.7625323Z cp.async.cg.shared.global [ %r162 + 0 ], [ %rd57 + 0 ], 0x10, %r152; 2026-02-21T09:09:57.7625376Z // end inline asm 2026-02-21T09:09:57.7625447Z cp.async.commit_group; 2026-02-21T09:09:57.7625610Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7625665Z bar.sync 0; 2026-02-21T09:09:57.7625727Z // begin inline asm 2026-02-21T09:09:57.7625834Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r150], 2048; 2026-02-21T09:09:57.7625916Z // end inline asm 2026-02-21T09:09:57.7626084Z .loc 1 64 33 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:64:33 2026-02-21T09:09:57.7626138Z bar.sync 0; 2026-02-21T09:09:57.7626201Z elect.sync %r253|%p55, -1; 2026-02-21T09:09:57.7626266Z and.pred %p51, %p1, %p55; 2026-02-21T09:09:57.7626332Z add.s32 %r165, %r54, 26624; 2026-02-21T09:09:57.7626388Z // begin inline asm 2026-02-21T09:09:57.7626628Z @%p51 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r165], [%rd71, {%r315, %r152}], [%r150]; 2026-02-21T09:09:57.7626716Z // end inline asm 2026-02-21T09:09:57.7626877Z .loc 1 58 80 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:80 2026-02-21T09:09:57.7626941Z cp.async.wait_group 1; 2026-02-21T09:09:57.7627001Z bar.sync 0; 2026-02-21T09:09:57.7627098Z ld.shared.v4.b32 {%r254, %r255, %r256, %r257}, [%r231]; 2026-02-21T09:09:57.7627160Z mov.b32 {%rs1, %rs2}, %r257; 2026-02-21T09:09:57.7627222Z mov.b32 {%rs3, %rs4}, %r256; 2026-02-21T09:09:57.7627288Z mov.b32 {%rs5, %rs6}, %r255; 2026-02-21T09:09:57.7627386Z mov.b32 {%rs7, %rs8}, %r254; 2026-02-21T09:09:57.7627486Z ld.shared.v4.b32 {%r258, %r259, %r260, %r261}, [%r231+16]; 2026-02-21T09:09:57.7627556Z mov.b32 {%rs9, %rs10}, %r261; 2026-02-21T09:09:57.7627620Z mov.b32 {%rs11, %rs12}, %r260; 2026-02-21T09:09:57.7627681Z mov.b32 {%rs13, %rs14}, %r259; 2026-02-21T09:09:57.7627741Z mov.b32 {%rs15, %rs16}, %r258; 2026-02-21T09:09:57.7627916Z .loc 1 62 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:62:32 2026-02-21T09:09:57.7627977Z cvt.f32.bf16 %r170, %rs7; 2026-02-21T09:09:57.7628038Z cvt.f32.bf16 %r171, %rs8; 2026-02-21T09:09:57.7628105Z cvt.f32.bf16 %r172, %rs5; 2026-02-21T09:09:57.7628163Z cvt.f32.bf16 %r173, %rs6; 2026-02-21T09:09:57.7628221Z cvt.f32.bf16 %r174, %rs3; 2026-02-21T09:09:57.7628284Z cvt.f32.bf16 %r175, %rs4; 2026-02-21T09:09:57.7628341Z cvt.f32.bf16 %r176, %rs1; 2026-02-21T09:09:57.7628400Z cvt.f32.bf16 %r177, %rs2; 2026-02-21T09:09:57.7628460Z cvt.f32.bf16 %r178, %rs15; 2026-02-21T09:09:57.7628527Z cvt.f32.bf16 %r179, %rs16; 2026-02-21T09:09:57.7628586Z cvt.f32.bf16 %r180, %rs13; 2026-02-21T09:09:57.7628643Z cvt.f32.bf16 %r181, %rs14; 2026-02-21T09:09:57.7628707Z cvt.f32.bf16 %r182, %rs11; 2026-02-21T09:09:57.7628764Z cvt.f32.bf16 %r183, %rs12; 2026-02-21T09:09:57.7628821Z cvt.f32.bf16 %r184, %rs9; 2026-02-21T09:09:57.7628902Z cvt.f32.bf16 %r185, %rs10; 2026-02-21T09:09:57.7628967Z $L__tmp2: 2026-02-21T09:09:57.7629183Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7629242Z add.s32 %r330, %r249, %r295; 2026-02-21T09:09:57.7629306Z // begin inline asm 2026-02-21T09:09:57.7629595Z @%p59 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r330 + 0], 16, {%r170, %r171, %r172, %r173, %r174, %r175, %r176, %r177, %r178, %r179, %r180, %r181, %r182, %r183, %r184, %r185}; 2026-02-21T09:09:57.7629653Z // end inline asm 2026-02-21T09:09:57.7629716Z // begin inline asm 2026-02-21T09:09:57.7629788Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:57.7629844Z // end inline asm 2026-02-21T09:09:57.7629899Z bar.sync 0; 2026-02-21T09:09:57.7629963Z $L__tmp3: 2026-02-21T09:09:57.7630131Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7630190Z // begin inline asm 2026-02-21T09:09:57.7630255Z 2026-02-21T09:09:57.7630306Z { 2026-02-21T09:09:57.7630368Z .reg .pred complete; 2026-02-21T09:09:57.7630424Z waitLoop: 2026-02-21T09:09:57.7630550Z mbarrier.try_wait.parity.shared.b64 complete, [%r313], %r64; 2026-02-21T09:09:57.7630616Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.7630666Z } 2026-02-21T09:09:57.7630670Z 2026-02-21T09:09:57.7630732Z // end inline asm 2026-02-21T09:09:57.7630915Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7631004Z ld.shared.s8 %rs17, [%r239]; 2026-02-21T09:09:57.7631181Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7631244Z shl.b16 %rs18, %rs17, 4; 2026-02-21T09:09:57.7631413Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7631484Z ld.shared.s8 %rs19, [%r238+128]; 2026-02-21T09:09:57.7631696Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7631786Z shl.b16 %rs20, %rs19, 4; 2026-02-21T09:09:57.7631953Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7632031Z ld.shared.s8 %rs21, [%r237+256]; 2026-02-21T09:09:57.7632200Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7632260Z shl.b16 %rs22, %rs21, 4; 2026-02-21T09:09:57.7632436Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7632528Z ld.shared.s8 %rs23, [%r236+384]; 2026-02-21T09:09:57.7632697Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7632765Z shl.b16 %rs24, %rs23, 4; 2026-02-21T09:09:57.7632940Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7633009Z ld.shared.s8 %rs25, [%r235+512]; 2026-02-21T09:09:57.7633184Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7633247Z shl.b16 %rs26, %rs25, 4; 2026-02-21T09:09:57.7633417Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7633483Z ld.shared.s8 %rs27, [%r234+640]; 2026-02-21T09:09:57.7633661Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7633724Z shl.b16 %rs28, %rs27, 4; 2026-02-21T09:09:57.7633898Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7633972Z ld.shared.s8 %rs29, [%r233+768]; 2026-02-21T09:09:57.7634141Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7634230Z shl.b16 %rs30, %rs29, 4; 2026-02-21T09:09:57.7634407Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7634473Z ld.shared.s8 %rs31, [%r232+896]; 2026-02-21T09:09:57.7634640Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7634707Z shl.b16 %rs32, %rs31, 4; 2026-02-21T09:09:57.7634875Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7634946Z ld.shared.s8 %rs33, [%r239+1024]; 2026-02-21T09:09:57.7635113Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7635182Z shl.b16 %rs34, %rs33, 4; 2026-02-21T09:09:57.7635347Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7635415Z ld.shared.s8 %rs35, [%r238+1152]; 2026-02-21T09:09:57.7635590Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7635651Z shl.b16 %rs36, %rs35, 4; 2026-02-21T09:09:57.7635817Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7635890Z ld.shared.s8 %rs37, [%r237+1280]; 2026-02-21T09:09:57.7636058Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7636119Z shl.b16 %rs38, %rs37, 4; 2026-02-21T09:09:57.7636323Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7636389Z ld.shared.s8 %rs39, [%r236+1408]; 2026-02-21T09:09:57.7636562Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7636621Z shl.b16 %rs40, %rs39, 4; 2026-02-21T09:09:57.7636799Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7636885Z ld.shared.s8 %rs41, [%r235+1536]; 2026-02-21T09:09:57.7637053Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7637121Z shl.b16 %rs42, %rs41, 4; 2026-02-21T09:09:57.7637286Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7637349Z ld.shared.s8 %rs43, [%r234+1664]; 2026-02-21T09:09:57.7637523Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7637585Z shl.b16 %rs44, %rs43, 4; 2026-02-21T09:09:57.7637773Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7637847Z ld.shared.s8 %rs45, [%r233+1792]; 2026-02-21T09:09:57.7638020Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7638083Z shl.b16 %rs46, %rs45, 4; 2026-02-21T09:09:57.7638251Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7638322Z ld.shared.s8 %rs47, [%r232+1920]; 2026-02-21T09:09:57.7638489Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7638551Z shl.b16 %rs48, %rs47, 4; 2026-02-21T09:09:57.7638729Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7638793Z cvt.s16.s8 %rs49, %rs18; 2026-02-21T09:09:57.7638852Z shr.s16 %rs50, %rs49, 4; 2026-02-21T09:09:57.7638921Z cvt.s16.s8 %rs51, %rs20; 2026-02-21T09:09:57.7638980Z shr.s16 %rs52, %rs51, 4; 2026-02-21T09:09:57.7639037Z shr.s16 %rs53, %rs17, 4; 2026-02-21T09:09:57.7639095Z shr.s16 %rs54, %rs19, 4; 2026-02-21T09:09:57.7639294Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7639361Z cvt.rn.f32.s16 %r262, %rs54; 2026-02-21T09:09:57.7639424Z cvt.rn.f32.s16 %r263, %rs53; 2026-02-21T09:09:57.7639494Z cvt.rn.f32.s16 %r264, %rs52; 2026-02-21T09:09:57.7639555Z cvt.rn.f32.s16 %r265, %rs50; 2026-02-21T09:09:57.7639730Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7639796Z cvt.s16.s8 %rs55, %rs22; 2026-02-21T09:09:57.7639854Z shr.s16 %rs56, %rs55, 4; 2026-02-21T09:09:57.7639911Z cvt.s16.s8 %rs57, %rs24; 2026-02-21T09:09:57.7639970Z shr.s16 %rs58, %rs57, 4; 2026-02-21T09:09:57.7640035Z shr.s16 %rs59, %rs21, 4; 2026-02-21T09:09:57.7640093Z shr.s16 %rs60, %rs23, 4; 2026-02-21T09:09:57.7640251Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7640317Z cvt.rn.f32.s16 %r266, %rs60; 2026-02-21T09:09:57.7640376Z cvt.rn.f32.s16 %r267, %rs59; 2026-02-21T09:09:57.7640436Z cvt.rn.f32.s16 %r268, %rs58; 2026-02-21T09:09:57.7640496Z cvt.rn.f32.s16 %r269, %rs56; 2026-02-21T09:09:57.7640665Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7640725Z cvt.s16.s8 %rs61, %rs26; 2026-02-21T09:09:57.7640784Z shr.s16 %rs62, %rs61, 4; 2026-02-21T09:09:57.7640852Z cvt.s16.s8 %rs63, %rs28; 2026-02-21T09:09:57.7640908Z shr.s16 %rs64, %rs63, 4; 2026-02-21T09:09:57.7640965Z shr.s16 %rs65, %rs25, 4; 2026-02-21T09:09:57.7641027Z shr.s16 %rs66, %rs27, 4; 2026-02-21T09:09:57.7641203Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7641263Z cvt.rn.f32.s16 %r270, %rs66; 2026-02-21T09:09:57.7641322Z cvt.rn.f32.s16 %r271, %rs65; 2026-02-21T09:09:57.7641390Z cvt.rn.f32.s16 %r272, %rs64; 2026-02-21T09:09:57.7641447Z cvt.rn.f32.s16 %r273, %rs62; 2026-02-21T09:09:57.7641641Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7641734Z cvt.s16.s8 %rs67, %rs30; 2026-02-21T09:09:57.7641791Z shr.s16 %rs68, %rs67, 4; 2026-02-21T09:09:57.7641849Z cvt.s16.s8 %rs69, %rs32; 2026-02-21T09:09:57.7641914Z shr.s16 %rs70, %rs69, 4; 2026-02-21T09:09:57.7641972Z shr.s16 %rs71, %rs29, 4; 2026-02-21T09:09:57.7642028Z shr.s16 %rs72, %rs31, 4; 2026-02-21T09:09:57.7642190Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7642258Z cvt.rn.f32.s16 %r274, %rs72; 2026-02-21T09:09:57.7642318Z cvt.rn.f32.s16 %r275, %rs71; 2026-02-21T09:09:57.7642376Z cvt.rn.f32.s16 %r276, %rs70; 2026-02-21T09:09:57.7642464Z cvt.rn.f32.s16 %r277, %rs68; 2026-02-21T09:09:57.7642621Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7642679Z cvt.s16.s8 %rs73, %rs34; 2026-02-21T09:09:57.7642737Z shr.s16 %rs74, %rs73, 4; 2026-02-21T09:09:57.7642802Z cvt.s16.s8 %rs75, %rs36; 2026-02-21T09:09:57.7642860Z shr.s16 %rs76, %rs75, 4; 2026-02-21T09:09:57.7642919Z shr.s16 %rs77, %rs33, 4; 2026-02-21T09:09:57.7642984Z shr.s16 %rs78, %rs35, 4; 2026-02-21T09:09:57.7643146Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7643205Z cvt.rn.f32.s16 %r278, %rs78; 2026-02-21T09:09:57.7643271Z cvt.rn.f32.s16 %r279, %rs77; 2026-02-21T09:09:57.7643330Z cvt.rn.f32.s16 %r280, %rs76; 2026-02-21T09:09:57.7643387Z cvt.rn.f32.s16 %r281, %rs74; 2026-02-21T09:09:57.7643546Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7643614Z cvt.s16.s8 %rs79, %rs38; 2026-02-21T09:09:57.7643673Z shr.s16 %rs80, %rs79, 4; 2026-02-21T09:09:57.7643731Z cvt.s16.s8 %rs81, %rs40; 2026-02-21T09:09:57.7643794Z shr.s16 %rs82, %rs81, 4; 2026-02-21T09:09:57.7643851Z shr.s16 %rs83, %rs37, 4; 2026-02-21T09:09:57.7643907Z shr.s16 %rs84, %rs39, 4; 2026-02-21T09:09:57.7644090Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7644158Z cvt.rn.f32.s16 %r282, %rs84; 2026-02-21T09:09:57.7644217Z cvt.rn.f32.s16 %r283, %rs83; 2026-02-21T09:09:57.7644276Z cvt.rn.f32.s16 %r284, %rs82; 2026-02-21T09:09:57.7644341Z cvt.rn.f32.s16 %r285, %rs80; 2026-02-21T09:09:57.7644500Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7644557Z cvt.s16.s8 %rs85, %rs42; 2026-02-21T09:09:57.7644621Z shr.s16 %rs86, %rs85, 4; 2026-02-21T09:09:57.7644679Z cvt.s16.s8 %rs87, %rs44; 2026-02-21T09:09:57.7644736Z shr.s16 %rs88, %rs87, 4; 2026-02-21T09:09:57.7644791Z shr.s16 %rs89, %rs41, 4; 2026-02-21T09:09:57.7644854Z shr.s16 %rs90, %rs43, 4; 2026-02-21T09:09:57.7645018Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7645077Z cvt.rn.f32.s16 %r286, %rs90; 2026-02-21T09:09:57.7645143Z cvt.rn.f32.s16 %r287, %rs89; 2026-02-21T09:09:57.7645202Z cvt.rn.f32.s16 %r288, %rs88; 2026-02-21T09:09:57.7645260Z cvt.rn.f32.s16 %r289, %rs86; 2026-02-21T09:09:57.7645424Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7645481Z cvt.s16.s8 %rs91, %rs46; 2026-02-21T09:09:57.7645538Z shr.s16 %rs92, %rs91, 4; 2026-02-21T09:09:57.7645594Z cvt.s16.s8 %rs93, %rs48; 2026-02-21T09:09:57.7645656Z shr.s16 %rs94, %rs93, 4; 2026-02-21T09:09:57.7645710Z shr.s16 %rs95, %rs45, 4; 2026-02-21T09:09:57.7645791Z shr.s16 %rs96, %rs47, 4; 2026-02-21T09:09:57.7645959Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7646018Z cvt.rn.f32.s16 %r290, %rs96; 2026-02-21T09:09:57.7646075Z cvt.rn.f32.s16 %r291, %rs95; 2026-02-21T09:09:57.7646134Z cvt.rn.f32.s16 %r292, %rs94; 2026-02-21T09:09:57.7646198Z cvt.rn.f32.s16 %r293, %rs92; 2026-02-21T09:09:57.7646292Z st.shared.v4.b32 [%r11], {%r265, %r263, %r264, %r262}; 2026-02-21T09:09:57.7646419Z st.shared.v4.b32 [%r12], {%r269, %r267, %r268, %r266}; 2026-02-21T09:09:57.7646515Z st.shared.v4.b32 [%r13], {%r273, %r271, %r272, %r270}; 2026-02-21T09:09:57.7646602Z st.shared.v4.b32 [%r14], {%r277, %r275, %r276, %r274}; 2026-02-21T09:09:57.7646686Z st.shared.v4.b32 [%r15], {%r281, %r279, %r280, %r278}; 2026-02-21T09:09:57.7646776Z st.shared.v4.b32 [%r16], {%r285, %r283, %r284, %r282}; 2026-02-21T09:09:57.7646860Z st.shared.v4.b32 [%r17], {%r289, %r287, %r288, %r286}; 2026-02-21T09:09:57.7646944Z st.shared.v4.b32 [%r18], {%r293, %r291, %r292, %r290}; 2026-02-21T09:09:57.7647020Z $L__tmp4: 2026-02-21T09:09:57.7647248Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7647307Z // begin inline asm 2026-02-21T09:09:57.7647383Z fence.proxy.async.shared::cta; 2026-02-21T09:09:57.7647446Z // end inline asm 2026-02-21T09:09:57.7647503Z bar.sync 0; 2026-02-21T09:09:57.7647567Z setp.ne.b32 %p56, %r37, 0; 2026-02-21T09:09:57.7647633Z @%p56 bra $L__BB0_3; 2026-02-21T09:09:57.7647685Z // %bb.2: 2026-02-21T09:09:57.7647752Z elect.sync %r306|%p58, -1; 2026-02-21T09:09:57.7647811Z mov.b32 %r296, 69208336; 2026-02-21T09:09:57.7647879Z mov.pred %p57, 0; 2026-02-21T09:09:57.7647937Z // begin inline asm 2026-02-21T09:09:57.7648096Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 0 ], %rd64, %r296, %p57; 2026-02-21T09:09:57.7648160Z // end inline asm 2026-02-21T09:09:57.7648216Z // begin inline asm 2026-02-21T09:09:57.7648367Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 8 ], %rd65, %r296, %p59; 2026-02-21T09:09:57.7648432Z // end inline asm 2026-02-21T09:09:57.7648491Z // begin inline asm 2026-02-21T09:09:57.7648639Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 16 ], %rd66, %r296, %p59; 2026-02-21T09:09:57.7648694Z // end inline asm 2026-02-21T09:09:57.7648788Z // begin inline asm 2026-02-21T09:09:57.7648936Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 24 ], %rd67, %r296, %p59; 2026-02-21T09:09:57.7648991Z // end inline asm 2026-02-21T09:09:57.7649062Z add.s32 %r308, %r54, 28672; 2026-02-21T09:09:57.7649124Z cvt.u64.u32 %rd68, %r308; 2026-02-21T09:09:57.7649180Z // begin inline asm 2026-02-21T09:09:57.7649311Z @%p58 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd68]; 2026-02-21T09:09:57.7649366Z // end inline asm 2026-02-21T09:09:57.7649421Z $L__tmp5: 2026-02-21T09:09:57.7649475Z $L__BB0_3: 2026-02-21T09:09:57.7649575Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:57.7649635Z add.s32 %r20, %r54, %r223; 2026-02-21T09:09:57.7649696Z add.s32 %r21, %r54, %r224; 2026-02-21T09:09:57.7649760Z add.s32 %r22, %r54, %r225; 2026-02-21T09:09:57.7649817Z add.s32 %r23, %r54, %r226; 2026-02-21T09:09:57.7649875Z add.s32 %r24, %r54, %r227; 2026-02-21T09:09:57.7649932Z add.s32 %r25, %r54, %r228; 2026-02-21T09:09:57.7649997Z add.s32 %r26, %r54, %r229; 2026-02-21T09:09:57.7650055Z add.s32 %r27, %r54, %r230; 2026-02-21T09:09:57.7650223Z .loc 1 58 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:32 2026-02-21T09:09:57.7650288Z add.s64 %rd69, %rd53, 128; 2026-02-21T09:09:57.7650348Z cvt.u64.u32 %rd73, %r19; 2026-02-21T09:09:57.7650410Z add.s64 %rd74, %rd7, %rd73; 2026-02-21T09:09:57.7650467Z shl.b64 %rd75, %rd74, 1; 2026-02-21T09:09:57.7650534Z add.s64 %rd76, %rd14, %rd75; 2026-02-21T09:09:57.7650614Z add.s64 %rd70, %rd76, 65536; 2026-02-21T09:09:57.7650669Z mov.b32 %r310, 16; 2026-02-21T09:09:57.7650843Z .loc 1 58 80 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:80 2026-02-21T09:09:57.7650900Z // begin inline asm 2026-02-21T09:09:57.7651017Z cp.async.cg.shared.global [ %r309 + 0 ], [ %rd69 + 0 ], 0x10, %r310; 2026-02-21T09:09:57.7651079Z // end inline asm 2026-02-21T09:09:57.7651137Z // begin inline asm 2026-02-21T09:09:57.7651271Z cp.async.cg.shared.global [ %r311 + 0 ], [ %rd70 + 0 ], 0x10, %r310; 2026-02-21T09:09:57.7651326Z // end inline asm 2026-02-21T09:09:57.7651397Z cp.async.commit_group; 2026-02-21T09:09:57.7651596Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7651654Z // begin inline asm 2026-02-21T09:09:57.7651772Z @%p91 mbarrier.arrive.expect_tx.shared.b64 _, [%r313], 2048; 2026-02-21T09:09:57.7651827Z // end inline asm 2026-02-21T09:09:57.7651993Z .loc 1 64 33 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:64:33 2026-02-21T09:09:57.7652081Z bar.sync 0; 2026-02-21T09:09:57.7652148Z elect.sync %r322|%p69, -1; 2026-02-21T09:09:57.7652214Z and.pred %p67, %p1, %p69; 2026-02-21T09:09:57.7652269Z mov.b32 %r316, 32; 2026-02-21T09:09:57.7652333Z // begin inline asm 2026-02-21T09:09:57.7652572Z @%p67 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r156], [%rd71, {%r315, %r316}], [%r313]; 2026-02-21T09:09:57.7652627Z // end inline asm 2026-02-21T09:09:57.7652798Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7652857Z and.b32 %r323, %r1, 3; 2026-02-21T09:09:57.7652922Z mul.wide.u32 %rd77, %r323, 16; 2026-02-21T09:09:57.7652989Z shl.b32 %r324, %r3, 12; 2026-02-21T09:09:57.7653052Z and.b32 %r325, %r324, 4128768; 2026-02-21T09:09:57.7653112Z shl.b32 %r326, %r4, 10; 2026-02-21T09:09:57.7653175Z or.b32 %r327, %r325, %r326; 2026-02-21T09:09:57.7653246Z mul.wide.u32 %rd78, %r327, 2; 2026-02-21T09:09:57.7653309Z or.b64 %rd79, %rd77, %rd78; 2026-02-21T09:09:57.7653368Z add.s64 %rd80, %rd79, %rd14; 2026-02-21T09:09:57.7653438Z add.s64 %rd218, %rd80, 65728; 2026-02-21T09:09:57.7653494Z mov.b32 %r618, 1; 2026-02-21T09:09:57.7653550Z mov.b32 %r614, 0; 2026-02-21T09:09:57.7653607Z mov.b64 %rd219, 0; 2026-02-21T09:09:57.7653697Z mov.b32 %r616, %r614; 2026-02-21T09:09:57.7653757Z mov.b32 %r617, %r614; 2026-02-21T09:09:57.7653814Z mov.b32 %r619, %r614; 2026-02-21T09:09:57.7653876Z bra.uni $L__BB0_4; 2026-02-21T09:09:57.7653979Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:57.7654139Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7654204Z setp.lt.u64 %p84, %rd219, 464; 2026-02-21T09:09:57.7654264Z $L__tmp6: 2026-02-21T09:09:57.7654477Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7654538Z add.s32 %r431, %r618, 1; 2026-02-21T09:09:57.7654608Z setp.gt.s32 %p87, %r431, 1; 2026-02-21T09:09:57.7654671Z selp.b32 %r618, 0, %r431, %p87; 2026-02-21T09:09:57.7654731Z selp.b32 %r432, 1, 0, %p87; 2026-02-21T09:09:57.7654797Z xor.b32 %r53, %r619, %r432; 2026-02-21T09:09:57.7654850Z $L__tmp7: 2026-02-21T09:09:57.7655012Z .loc 1 58 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:32 2026-02-21T09:09:57.7655076Z add.s64 %rd86, %rd218, -65536; 2026-02-21T09:09:57.7655246Z .loc 1 58 80 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:80 2026-02-21T09:09:57.7655305Z add.s32 %r422, %r47, %r6; 2026-02-21T09:09:57.7655366Z selp.b32 %r423, 16, 0, %p84; 2026-02-21T09:09:57.7655429Z // begin inline asm 2026-02-21T09:09:57.7655542Z cp.async.cg.shared.global [ %r422 + 0 ], [ %rd86 + 0 ], 0x10, %r423; 2026-02-21T09:09:57.7655628Z // end inline asm 2026-02-21T09:09:57.7655694Z add.s32 %r424, %r422, 2048; 2026-02-21T09:09:57.7655753Z // begin inline asm 2026-02-21T09:09:57.7655870Z cp.async.cg.shared.global [ %r424 + 0 ], [ %rd218 + 0 ], 0x10, %r423; 2026-02-21T09:09:57.7655926Z // end inline asm 2026-02-21T09:09:57.7655997Z cp.async.commit_group; 2026-02-21T09:09:57.7656156Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7656245Z and.pred %p82, %p91, %p84; 2026-02-21T09:09:57.7656310Z // begin inline asm 2026-02-21T09:09:57.7656417Z @%p82 mbarrier.arrive.expect_tx.shared.b64 _, [%r426], 2048; 2026-02-21T09:09:57.7656472Z // end inline asm 2026-02-21T09:09:57.7656631Z .loc 1 64 33 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:64:33 2026-02-21T09:09:57.7656696Z bar.sync 0; 2026-02-21T09:09:57.7656760Z elect.sync %r433|%p88, -1; 2026-02-21T09:09:57.7656824Z and.pred %p89, %p84, %p88; 2026-02-21T09:09:57.7656906Z and.pred %p83, %p1, %p89; 2026-02-21T09:09:57.7656966Z cvt.u32.u64 %r434, %rd219; 2026-02-21T09:09:57.7657043Z add.s32 %r429, %r434, 48; 2026-02-21T09:09:57.7657108Z // begin inline asm 2026-02-21T09:09:57.7657344Z @%p83 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r427], [%rd71, {%r315, %r429}], [%r426]; 2026-02-21T09:09:57.7657399Z // end inline asm 2026-02-21T09:09:57.7657563Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7657631Z add.s64 %rd218, %rd218, 64; 2026-02-21T09:09:57.7657695Z setp.lt.u64 %p90, %rd219, 480; 2026-02-21T09:09:57.7657756Z add.s64 %rd219, %rd219, 16; 2026-02-21T09:09:57.7657821Z mov.b32 %r614, %r619; 2026-02-21T09:09:57.7657878Z mov.b32 %r619, %r53; 2026-02-21T09:09:57.7657936Z @%p90 bra $L__BB0_4; 2026-02-21T09:09:57.7658000Z bra.uni $L__BB0_7; 2026-02-21T09:09:57.7658104Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:09:57.7658164Z add.s32 %r349, %r617, 1; 2026-02-21T09:09:57.7658225Z setp.gt.s32 %p72, %r349, 1; 2026-02-21T09:09:57.7658298Z selp.b32 %r617, 0, %r349, %p72; 2026-02-21T09:09:57.7658459Z .loc 1 58 80 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:58:80 2026-02-21T09:09:57.7658523Z cp.async.wait_group 1; 2026-02-21T09:09:57.7658585Z bar.sync 0; 2026-02-21T09:09:57.7658666Z shl.b32 %r350, %r617, 11; 2026-02-21T09:09:57.7658726Z shl.b32 %r351, %r617, 12; 2026-02-21T09:09:57.7658785Z add.s32 %r353, %r54, %r351; 2026-02-21T09:09:57.7658851Z add.s32 %r47, %r353, 16384; 2026-02-21T09:09:57.7658910Z add.s32 %r354, %r47, %r10; 2026-02-21T09:09:57.7659005Z ld.shared.v4.b32 {%r355, %r356, %r357, %r358}, [%r354]; 2026-02-21T09:09:57.7659074Z mov.b32 {%rs97, %rs98}, %r358; 2026-02-21T09:09:57.7659137Z mov.b32 {%rs99, %rs100}, %r357; 2026-02-21T09:09:57.7659203Z mov.b32 {%rs101, %rs102}, %r356; 2026-02-21T09:09:57.7659276Z mov.b32 {%rs103, %rs104}, %r355; 2026-02-21T09:09:57.7659374Z ld.shared.v4.b32 {%r359, %r360, %r361, %r362}, [%r354+16]; 2026-02-21T09:09:57.7659436Z mov.b32 {%rs105, %rs106}, %r362; 2026-02-21T09:09:57.7659496Z mov.b32 {%rs107, %rs108}, %r361; 2026-02-21T09:09:57.7659563Z mov.b32 {%rs109, %rs110}, %r360; 2026-02-21T09:09:57.7659622Z mov.b32 {%rs111, %rs112}, %r359; 2026-02-21T09:09:57.7659787Z .loc 1 62 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:62:32 2026-02-21T09:09:57.7659857Z cvt.f32.bf16 %r331, %rs103; 2026-02-21T09:09:57.7659917Z cvt.f32.bf16 %r332, %rs104; 2026-02-21T09:09:57.7659977Z cvt.f32.bf16 %r333, %rs101; 2026-02-21T09:09:57.7660035Z cvt.f32.bf16 %r334, %rs102; 2026-02-21T09:09:57.7660103Z cvt.f32.bf16 %r335, %rs99; 2026-02-21T09:09:57.7660160Z cvt.f32.bf16 %r336, %rs100; 2026-02-21T09:09:57.7660219Z cvt.f32.bf16 %r337, %rs97; 2026-02-21T09:09:57.7660284Z cvt.f32.bf16 %r338, %rs98; 2026-02-21T09:09:57.7660363Z cvt.f32.bf16 %r339, %rs111; 2026-02-21T09:09:57.7660421Z cvt.f32.bf16 %r340, %rs112; 2026-02-21T09:09:57.7660487Z cvt.f32.bf16 %r341, %rs109; 2026-02-21T09:09:57.7660546Z cvt.f32.bf16 %r342, %rs110; 2026-02-21T09:09:57.7660605Z cvt.f32.bf16 %r343, %rs107; 2026-02-21T09:09:57.7660663Z cvt.f32.bf16 %r344, %rs108; 2026-02-21T09:09:57.7660728Z cvt.f32.bf16 %r345, %rs105; 2026-02-21T09:09:57.7660783Z cvt.f32.bf16 %r346, %rs106; 2026-02-21T09:09:57.7660838Z $L__tmp8: 2026-02-21T09:09:57.7661079Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7661136Z // begin inline asm 2026-02-21T09:09:57.7661187Z 2026-02-21T09:09:57.7661237Z { 2026-02-21T09:09:57.7661307Z .reg .pred complete; 2026-02-21T09:09:57.7661363Z waitLoop: 2026-02-21T09:09:57.7661483Z mbarrier.try_wait.parity.shared.b64 complete, [%r615], %r614; 2026-02-21T09:09:57.7661581Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.7661632Z } 2026-02-21T09:09:57.7661638Z 2026-02-21T09:09:57.7661693Z // end inline asm 2026-02-21T09:09:57.7661746Z $L__tmp9: 2026-02-21T09:09:57.7661939Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7662001Z selp.b32 %r363, 1, 0, %p72; 2026-02-21T09:09:57.7662060Z xor.b32 %r616, %r616, %r363; 2026-02-21T09:09:57.7662126Z mov.pred %p73, -1; 2026-02-21T09:09:57.7662179Z $L__tmp10: 2026-02-21T09:09:57.7662392Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7662458Z // begin inline asm 2026-02-21T09:09:57.7662752Z @%p73 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r330 + 0], 16, {%r331, %r332, %r333, %r334, %r335, %r336, %r337, %r338, %r339, %r340, %r341, %r342, %r343, %r344, %r345, %r346}; 2026-02-21T09:09:57.7662809Z // end inline asm 2026-02-21T09:09:57.7662872Z // begin inline asm 2026-02-21T09:09:57.7662941Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:09:57.7662998Z // end inline asm 2026-02-21T09:09:57.7663052Z bar.sync 0; 2026-02-21T09:09:57.7663115Z $L__tmp11: 2026-02-21T09:09:57.7663283Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7663342Z shl.b32 %r364, %r617, 3; 2026-02-21T09:09:57.7663409Z add.s32 %r365, %r54, %r364; 2026-02-21T09:09:57.7663469Z add.s32 %r426, %r365, 28688; 2026-02-21T09:09:57.7663553Z // begin inline asm 2026-02-21T09:09:57.7663608Z 2026-02-21T09:09:57.7663666Z { 2026-02-21T09:09:57.7663729Z .reg .pred complete; 2026-02-21T09:09:57.7663783Z waitLoop: 2026-02-21T09:09:57.7663908Z mbarrier.try_wait.parity.shared.b64 complete, [%r426], %r616; 2026-02-21T09:09:57.7663972Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.7664024Z } 2026-02-21T09:09:57.7664027Z 2026-02-21T09:09:57.7664083Z // end inline asm 2026-02-21T09:09:57.7664255Z .loc 1 64 33 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:64:33 2026-02-21T09:09:57.7664316Z add.s32 %r366, %r54, %r350; 2026-02-21T09:09:57.7664376Z add.s32 %r427, %r366, 24576; 2026-02-21T09:09:57.7664446Z add.s32 %r367, %r427, %r5; 2026-02-21T09:09:57.7664506Z add.s32 %r368, %r427, %r34; 2026-02-21T09:09:57.7664564Z add.s32 %r369, %r427, %r33; 2026-02-21T09:09:57.7664632Z add.s32 %r370, %r427, %r32; 2026-02-21T09:09:57.7664690Z add.s32 %r371, %r427, %r31; 2026-02-21T09:09:57.7664748Z add.s32 %r372, %r427, %r30; 2026-02-21T09:09:57.7664808Z add.s32 %r373, %r427, %r29; 2026-02-21T09:09:57.7664874Z add.s32 %r374, %r427, %r28; 2026-02-21T09:09:57.7665035Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7665099Z ld.shared.s8 %rs113, [%r367]; 2026-02-21T09:09:57.7665267Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7665327Z shl.b16 %rs114, %rs113, 4; 2026-02-21T09:09:57.7665488Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7665602Z ld.shared.s8 %rs115, [%r368+128]; 2026-02-21T09:09:57.7665762Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7665821Z shl.b16 %rs116, %rs115, 4; 2026-02-21T09:09:57.7665983Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7666080Z ld.shared.s8 %rs117, [%r369+256]; 2026-02-21T09:09:57.7666241Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7666300Z shl.b16 %rs118, %rs117, 4; 2026-02-21T09:09:57.7666471Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7666536Z ld.shared.s8 %rs119, [%r370+384]; 2026-02-21T09:09:57.7666696Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7666763Z shl.b16 %rs120, %rs119, 4; 2026-02-21T09:09:57.7666957Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7667023Z ld.shared.s8 %rs121, [%r371+512]; 2026-02-21T09:09:57.7667192Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7667251Z shl.b16 %rs122, %rs121, 4; 2026-02-21T09:09:57.7667412Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7667474Z ld.shared.s8 %rs123, [%r372+640]; 2026-02-21T09:09:57.7667643Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7667700Z shl.b16 %rs124, %rs123, 4; 2026-02-21T09:09:57.7667859Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7667929Z ld.shared.s8 %rs125, [%r373+768]; 2026-02-21T09:09:57.7668091Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7668150Z shl.b16 %rs126, %rs125, 4; 2026-02-21T09:09:57.7668316Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7668378Z ld.shared.s8 %rs127, [%r374+896]; 2026-02-21T09:09:57.7668563Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7668632Z shl.b16 %rs128, %rs127, 4; 2026-02-21T09:09:57.7668793Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7668857Z ld.shared.s8 %rs129, [%r367+1024]; 2026-02-21T09:09:57.7669020Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7669085Z shl.b16 %rs130, %rs129, 4; 2026-02-21T09:09:57.7669247Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7669314Z ld.shared.s8 %rs131, [%r368+1152]; 2026-02-21T09:09:57.7669479Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7669539Z shl.b16 %rs132, %rs131, 4; 2026-02-21T09:09:57.7669701Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7669772Z ld.shared.s8 %rs133, [%r369+1280]; 2026-02-21T09:09:57.7669934Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7669992Z shl.b16 %rs134, %rs133, 4; 2026-02-21T09:09:57.7670159Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7670220Z ld.shared.s8 %rs135, [%r370+1408]; 2026-02-21T09:09:57.7670380Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7670464Z shl.b16 %rs136, %rs135, 4; 2026-02-21T09:09:57.7670633Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7670694Z ld.shared.s8 %rs137, [%r371+1536]; 2026-02-21T09:09:57.7670852Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7670919Z shl.b16 %rs138, %rs137, 4; 2026-02-21T09:09:57.7671100Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7671160Z ld.shared.s8 %rs139, [%r372+1664]; 2026-02-21T09:09:57.7671327Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7671385Z shl.b16 %rs140, %rs139, 4; 2026-02-21T09:09:57.7671573Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7671645Z ld.shared.s8 %rs141, [%r373+1792]; 2026-02-21T09:09:57.7671835Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7671893Z shl.b16 %rs142, %rs141, 4; 2026-02-21T09:09:57.7672055Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7672123Z ld.shared.s8 %rs143, [%r374+1920]; 2026-02-21T09:09:57.7672287Z .loc 1 67 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:67:28 2026-02-21T09:09:57.7672347Z shl.b16 %rs144, %rs143, 4; 2026-02-21T09:09:57.7672517Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7672577Z cvt.s16.s8 %rs145, %rs114; 2026-02-21T09:09:57.7672634Z shr.s16 %rs146, %rs145, 4; 2026-02-21T09:09:57.7672701Z cvt.s16.s8 %rs147, %rs116; 2026-02-21T09:09:57.7672760Z shr.s16 %rs148, %rs147, 4; 2026-02-21T09:09:57.7672819Z shr.s16 %rs149, %rs113, 4; 2026-02-21T09:09:57.7672876Z shr.s16 %rs150, %rs115, 4; 2026-02-21T09:09:57.7673044Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7673106Z cvt.rn.f32.s16 %r375, %rs150; 2026-02-21T09:09:57.7673169Z cvt.rn.f32.s16 %r376, %rs149; 2026-02-21T09:09:57.7673239Z cvt.rn.f32.s16 %r377, %rs148; 2026-02-21T09:09:57.7673324Z cvt.rn.f32.s16 %r378, %rs146; 2026-02-21T09:09:57.7673489Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7673556Z cvt.s16.s8 %rs151, %rs118; 2026-02-21T09:09:57.7673613Z shr.s16 %rs152, %rs151, 4; 2026-02-21T09:09:57.7673671Z cvt.s16.s8 %rs153, %rs120; 2026-02-21T09:09:57.7673728Z shr.s16 %rs154, %rs153, 4; 2026-02-21T09:09:57.7673794Z shr.s16 %rs155, %rs117, 4; 2026-02-21T09:09:57.7673852Z shr.s16 %rs156, %rs119, 4; 2026-02-21T09:09:57.7674013Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7674083Z cvt.rn.f32.s16 %r379, %rs156; 2026-02-21T09:09:57.7674143Z cvt.rn.f32.s16 %r380, %rs155; 2026-02-21T09:09:57.7674203Z cvt.rn.f32.s16 %r381, %rs154; 2026-02-21T09:09:57.7674268Z cvt.rn.f32.s16 %r382, %rs152; 2026-02-21T09:09:57.7674430Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7674490Z cvt.s16.s8 %rs157, %rs122; 2026-02-21T09:09:57.7674551Z shr.s16 %rs158, %rs157, 4; 2026-02-21T09:09:57.7674618Z cvt.s16.s8 %rs159, %rs124; 2026-02-21T09:09:57.7674676Z shr.s16 %rs160, %rs159, 4; 2026-02-21T09:09:57.7674734Z shr.s16 %rs161, %rs121, 4; 2026-02-21T09:09:57.7674802Z shr.s16 %rs162, %rs123, 4; 2026-02-21T09:09:57.7674974Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7675035Z cvt.rn.f32.s16 %r383, %rs162; 2026-02-21T09:09:57.7675097Z cvt.rn.f32.s16 %r384, %rs161; 2026-02-21T09:09:57.7675214Z cvt.rn.f32.s16 %r385, %rs160; 2026-02-21T09:09:57.7675277Z cvt.rn.f32.s16 %r386, %rs158; 2026-02-21T09:09:57.7675446Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7675515Z cvt.s16.s8 %rs163, %rs126; 2026-02-21T09:09:57.7675574Z shr.s16 %rs164, %rs163, 4; 2026-02-21T09:09:57.7675634Z cvt.s16.s8 %rs165, %rs128; 2026-02-21T09:09:57.7675701Z shr.s16 %rs166, %rs165, 4; 2026-02-21T09:09:57.7675790Z shr.s16 %rs167, %rs125, 4; 2026-02-21T09:09:57.7675849Z shr.s16 %rs168, %rs127, 4; 2026-02-21T09:09:57.7676021Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7676091Z cvt.rn.f32.s16 %r387, %rs168; 2026-02-21T09:09:57.7676153Z cvt.rn.f32.s16 %r388, %rs167; 2026-02-21T09:09:57.7676215Z cvt.rn.f32.s16 %r389, %rs166; 2026-02-21T09:09:57.7676284Z cvt.rn.f32.s16 %r390, %rs164; 2026-02-21T09:09:57.7676457Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7676539Z cvt.s16.s8 %rs169, %rs130; 2026-02-21T09:09:57.7676611Z shr.s16 %rs170, %rs169, 4; 2026-02-21T09:09:57.7676670Z cvt.s16.s8 %rs171, %rs132; 2026-02-21T09:09:57.7676731Z shr.s16 %rs172, %rs171, 4; 2026-02-21T09:09:57.7676792Z shr.s16 %rs173, %rs129, 4; 2026-02-21T09:09:57.7676861Z shr.s16 %rs174, %rs131, 4; 2026-02-21T09:09:57.7677038Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7677101Z cvt.rn.f32.s16 %r391, %rs174; 2026-02-21T09:09:57.7677171Z cvt.rn.f32.s16 %r392, %rs173; 2026-02-21T09:09:57.7677232Z cvt.rn.f32.s16 %r393, %rs172; 2026-02-21T09:09:57.7677291Z cvt.rn.f32.s16 %r394, %rs170; 2026-02-21T09:09:57.7677458Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7677525Z cvt.s16.s8 %rs175, %rs134; 2026-02-21T09:09:57.7677588Z shr.s16 %rs176, %rs175, 4; 2026-02-21T09:09:57.7677648Z cvt.s16.s8 %rs177, %rs136; 2026-02-21T09:09:57.7677718Z shr.s16 %rs178, %rs177, 4; 2026-02-21T09:09:57.7677778Z shr.s16 %rs179, %rs133, 4; 2026-02-21T09:09:57.7677838Z shr.s16 %rs180, %rs135, 4; 2026-02-21T09:09:57.7678013Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7678100Z cvt.rn.f32.s16 %r395, %rs180; 2026-02-21T09:09:57.7678166Z cvt.rn.f32.s16 %r396, %rs179; 2026-02-21T09:09:57.7678226Z cvt.rn.f32.s16 %r397, %rs178; 2026-02-21T09:09:57.7678295Z cvt.rn.f32.s16 %r398, %rs176; 2026-02-21T09:09:57.7678465Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7678526Z cvt.s16.s8 %rs181, %rs138; 2026-02-21T09:09:57.7678594Z shr.s16 %rs182, %rs181, 4; 2026-02-21T09:09:57.7678654Z cvt.s16.s8 %rs183, %rs140; 2026-02-21T09:09:57.7678714Z shr.s16 %rs184, %rs183, 4; 2026-02-21T09:09:57.7678776Z shr.s16 %rs185, %rs137, 4; 2026-02-21T09:09:57.7678843Z shr.s16 %rs186, %rs139, 4; 2026-02-21T09:09:57.7679014Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7679076Z cvt.rn.f32.s16 %r399, %rs186; 2026-02-21T09:09:57.7679144Z cvt.rn.f32.s16 %r400, %rs185; 2026-02-21T09:09:57.7679204Z cvt.rn.f32.s16 %r401, %rs184; 2026-02-21T09:09:57.7679264Z cvt.rn.f32.s16 %r402, %rs182; 2026-02-21T09:09:57.7679439Z .loc 1 69 25 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:69:25 2026-02-21T09:09:57.7679500Z cvt.s16.s8 %rs187, %rs142; 2026-02-21T09:09:57.7679559Z shr.s16 %rs188, %rs187, 4; 2026-02-21T09:09:57.7679619Z cvt.s16.s8 %rs189, %rs144; 2026-02-21T09:09:57.7679685Z shr.s16 %rs190, %rs189, 4; 2026-02-21T09:09:57.7679743Z shr.s16 %rs191, %rs141, 4; 2026-02-21T09:09:57.7679803Z shr.s16 %rs192, %rs143, 4; 2026-02-21T09:09:57.7679975Z .loc 1 87 32 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:87:32 2026-02-21T09:09:57.7680060Z cvt.rn.f32.s16 %r403, %rs192; 2026-02-21T09:09:57.7680121Z cvt.rn.f32.s16 %r404, %rs191; 2026-02-21T09:09:57.7680189Z cvt.rn.f32.s16 %r405, %rs190; 2026-02-21T09:09:57.7680250Z cvt.rn.f32.s16 %r406, %rs188; 2026-02-21T09:09:57.7680347Z st.shared.v4.b32 [%r11], {%r378, %r376, %r377, %r375}; 2026-02-21T09:09:57.7680441Z st.shared.v4.b32 [%r12], {%r382, %r380, %r381, %r379}; 2026-02-21T09:09:57.7680565Z st.shared.v4.b32 [%r13], {%r386, %r384, %r385, %r383}; 2026-02-21T09:09:57.7680653Z st.shared.v4.b32 [%r14], {%r390, %r388, %r389, %r387}; 2026-02-21T09:09:57.7680740Z st.shared.v4.b32 [%r15], {%r394, %r392, %r393, %r391}; 2026-02-21T09:09:57.7680834Z st.shared.v4.b32 [%r16], {%r398, %r396, %r397, %r395}; 2026-02-21T09:09:57.7680919Z st.shared.v4.b32 [%r17], {%r402, %r400, %r401, %r399}; 2026-02-21T09:09:57.7681005Z st.shared.v4.b32 [%r18], {%r406, %r404, %r405, %r403}; 2026-02-21T09:09:57.7681185Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7681268Z shl.b32 %r407, %r618, 3; 2026-02-21T09:09:57.7681335Z add.s32 %r408, %r54, %r407; 2026-02-21T09:09:57.7681396Z add.s32 %r615, %r408, 28672; 2026-02-21T09:09:57.7681459Z $L__tmp12: 2026-02-21T09:09:57.7681712Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7681774Z // begin inline asm 2026-02-21T09:09:57.7681859Z fence.proxy.async.shared::cta; 2026-02-21T09:09:57.7681916Z // end inline asm 2026-02-21T09:09:57.7681974Z bar.sync 0; 2026-02-21T09:09:57.7682037Z @%p56 bra $L__BB0_6; 2026-02-21T09:09:57.7682151Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:09:57.7682218Z elect.sync %r421|%p74, -1; 2026-02-21T09:09:57.7682279Z mov.b32 %r411, 69208336; 2026-02-21T09:09:57.7682345Z // begin inline asm 2026-02-21T09:09:57.7682510Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 0 ], %rd64, %r411, %p73; 2026-02-21T09:09:57.7682569Z // end inline asm 2026-02-21T09:09:57.7682635Z // begin inline asm 2026-02-21T09:09:57.7682790Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 8 ], %rd65, %r411, %p73; 2026-02-21T09:09:57.7682849Z // end inline asm 2026-02-21T09:09:57.7682916Z // begin inline asm 2026-02-21T09:09:57.7683113Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 16 ], %rd66, %r411, %p73; 2026-02-21T09:09:57.7683173Z // end inline asm 2026-02-21T09:09:57.7683233Z // begin inline asm 2026-02-21T09:09:57.7683388Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r613 + 0 ], [ %r295 + 24 ], %rd67, %r411, %p73; 2026-02-21T09:09:57.7683443Z // end inline asm 2026-02-21T09:09:57.7683504Z cvt.u64.u32 %rd85, %r615; 2026-02-21T09:09:57.7683569Z // begin inline asm 2026-02-21T09:09:57.7683690Z @%p74 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd85]; 2026-02-21T09:09:57.7683746Z // end inline asm 2026-02-21T09:09:57.7683809Z bra.uni $L__BB0_6; 2026-02-21T09:09:57.7683907Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:09:57.7683996Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:09:57.7684059Z setp.lt.u32 %p96, %r1, 64; 2026-02-21T09:09:57.7684120Z mov.b32 %r436, 1; 2026-02-21T09:09:57.7684336Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7684394Z // begin inline asm 2026-02-21T09:09:57.7684452Z 2026-02-21T09:09:57.7684502Z { 2026-02-21T09:09:57.7684564Z .reg .pred complete; 2026-02-21T09:09:57.7684618Z waitLoop: 2026-02-21T09:09:57.7684744Z mbarrier.try_wait.parity.shared.b64 complete, [%r615], %r436; 2026-02-21T09:09:57.7684810Z @!complete bra.uni waitLoop; 2026-02-21T09:09:57.7684861Z } 2026-02-21T09:09:57.7684864Z 2026-02-21T09:09:57.7684926Z // end inline asm 2026-02-21T09:09:57.7685006Z $L__tmp13: 2026-02-21T09:09:57.7685168Z .loc 1 51 92 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:51:92 2026-02-21T09:09:57.7685238Z cp.async.wait_group 0; 2026-02-21T09:09:57.7685292Z bar.sync 0; 2026-02-21T09:09:57.7685348Z // begin inline asm 2026-02-21T09:09:57.7685434Z @%p91 mbarrier.inval.shared::cta.b64 [%r313]; 2026-02-21T09:09:57.7685497Z // end inline asm 2026-02-21T09:09:57.7685552Z bar.sync 0; 2026-02-21T09:09:57.7685633Z // begin inline asm 2026-02-21T09:09:57.7685725Z @%p91 mbarrier.inval.shared::cta.b64 [%r150]; 2026-02-21T09:09:57.7685779Z // end inline asm 2026-02-21T09:09:57.7685840Z add.s32 %r439, %r54, 28672; 2026-02-21T09:09:57.7685896Z // begin inline asm 2026-02-21T09:09:57.7685984Z @%p91 mbarrier.inval.shared::cta.b64 [%r439]; 2026-02-21T09:09:57.7686039Z // end inline asm 2026-02-21T09:09:57.7686093Z bar.sync 0; 2026-02-21T09:09:57.7686154Z // begin inline asm 2026-02-21T09:09:57.7686230Z @%p91 mbarrier.inval.shared::cta.b64 [%r148]; 2026-02-21T09:09:57.7686286Z // end inline asm 2026-02-21T09:09:57.7686337Z $L__tmp14: 2026-02-21T09:09:57.7686578Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7686636Z // begin inline asm 2026-02-21T09:09:57.7686915Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r441, %r442, %r443, %r444, %r445, %r446, %r447, %r448, %r449, %r450, %r451, %r452, %r453, %r454, %r455, %r456}, [%r508 + 0], 64; 2026-02-21T09:09:57.7686979Z // end inline asm 2026-02-21T09:09:57.7687035Z // begin inline asm 2026-02-21T09:09:57.7687305Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r458, %r459, %r460, %r461, %r462, %r463, %r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472, %r473}, [%r508 + 16], 64; 2026-02-21T09:09:57.7687368Z // end inline asm 2026-02-21T09:09:57.7687422Z // begin inline asm 2026-02-21T09:09:57.7687691Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r475, %r476, %r477, %r478, %r479, %r480, %r481, %r482, %r483, %r484, %r485, %r486, %r487, %r488, %r489, %r490}, [%r508 + 32], 64; 2026-02-21T09:09:57.7687756Z // end inline asm 2026-02-21T09:09:57.7687812Z // begin inline asm 2026-02-21T09:09:57.7688080Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r492, %r493, %r494, %r495, %r496, %r497, %r498, %r499, %r500, %r501, %r502, %r503, %r504, %r505, %r506, %r507}, [%r508 + 48], 64; 2026-02-21T09:09:57.7688140Z // end inline asm 2026-02-21T09:09:57.7688216Z // begin inline asm 2026-02-21T09:09:57.7688288Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:09:57.7688343Z // end inline asm 2026-02-21T09:09:57.7688411Z cvt.u64.u32 %rd90, %r441; 2026-02-21T09:09:57.7688471Z cvt.u64.u32 %rd91, %r442; 2026-02-21T09:09:57.7688530Z shl.b64 %rd92, %rd91, 32; 2026-02-21T09:09:57.7688596Z or.b64 %rd93, %rd90, %rd92; 2026-02-21T09:09:57.7688648Z $L__tmp15: 2026-02-21T09:09:57.7688816Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7688879Z mov.b64 {%r513, %r514}, %rd93; 2026-02-21T09:09:57.7688956Z cvt.rn.bf16x2.f32 %r515, %r514, %r513; 2026-02-21T09:09:57.7689008Z $L__tmp16: 2026-02-21T09:09:57.7689222Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7689289Z cvt.u64.u32 %rd94, %r443; 2026-02-21T09:09:57.7689346Z cvt.u64.u32 %rd95, %r444; 2026-02-21T09:09:57.7689405Z shl.b64 %rd96, %rd95, 32; 2026-02-21T09:09:57.7689472Z or.b64 %rd97, %rd94, %rd96; 2026-02-21T09:09:57.7689524Z $L__tmp17: 2026-02-21T09:09:57.7689690Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7689751Z mov.b64 {%r516, %r517}, %rd97; 2026-02-21T09:09:57.7689828Z cvt.rn.bf16x2.f32 %r518, %r517, %r516; 2026-02-21T09:09:57.7689879Z $L__tmp18: 2026-02-21T09:09:57.7690090Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7690181Z cvt.u64.u32 %rd98, %r445; 2026-02-21T09:09:57.7690240Z cvt.u64.u32 %rd99, %r446; 2026-02-21T09:09:57.7690301Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:09:57.7690369Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:09:57.7690421Z $L__tmp19: 2026-02-21T09:09:57.7690584Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7690648Z mov.b64 {%r519, %r520}, %rd101; 2026-02-21T09:09:57.7690743Z cvt.rn.bf16x2.f32 %r521, %r520, %r519; 2026-02-21T09:09:57.7690794Z $L__tmp20: 2026-02-21T09:09:57.7691003Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7691069Z cvt.u64.u32 %rd102, %r447; 2026-02-21T09:09:57.7691129Z cvt.u64.u32 %rd103, %r448; 2026-02-21T09:09:57.7691188Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:09:57.7691248Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:09:57.7691308Z $L__tmp21: 2026-02-21T09:09:57.7691494Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7691581Z mov.b64 {%r522, %r523}, %rd105; 2026-02-21T09:09:57.7691659Z cvt.rn.bf16x2.f32 %r524, %r523, %r522; 2026-02-21T09:09:57.7691712Z $L__tmp22: 2026-02-21T09:09:57.7691924Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7691994Z cvt.u64.u32 %rd106, %r449; 2026-02-21T09:09:57.7692059Z cvt.u64.u32 %rd107, %r450; 2026-02-21T09:09:57.7692118Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:09:57.7692178Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:09:57.7692239Z $L__tmp23: 2026-02-21T09:09:57.7692401Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7692460Z mov.b64 {%r525, %r526}, %rd109; 2026-02-21T09:09:57.7692533Z cvt.rn.bf16x2.f32 %r527, %r526, %r525; 2026-02-21T09:09:57.7692587Z $L__tmp24: 2026-02-21T09:09:57.7692792Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7692857Z cvt.u64.u32 %rd110, %r451; 2026-02-21T09:09:57.7692914Z cvt.u64.u32 %rd111, %r452; 2026-02-21T09:09:57.7692973Z shl.b64 %rd112, %rd111, 32; 2026-02-21T09:09:57.7693032Z or.b64 %rd113, %rd110, %rd112; 2026-02-21T09:09:57.7693093Z $L__tmp25: 2026-02-21T09:09:57.7693281Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7693344Z mov.b64 {%r528, %r529}, %rd113; 2026-02-21T09:09:57.7693416Z cvt.rn.bf16x2.f32 %r530, %r529, %r528; 2026-02-21T09:09:57.7693469Z $L__tmp26: 2026-02-21T09:09:57.7693681Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7693747Z cvt.u64.u32 %rd114, %r453; 2026-02-21T09:09:57.7693805Z cvt.u64.u32 %rd115, %r454; 2026-02-21T09:09:57.7693864Z shl.b64 %rd116, %rd115, 32; 2026-02-21T09:09:57.7693924Z or.b64 %rd117, %rd114, %rd116; 2026-02-21T09:09:57.7693984Z $L__tmp27: 2026-02-21T09:09:57.7694146Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7694204Z mov.b64 {%r531, %r532}, %rd117; 2026-02-21T09:09:57.7694276Z cvt.rn.bf16x2.f32 %r533, %r532, %r531; 2026-02-21T09:09:57.7694329Z $L__tmp28: 2026-02-21T09:09:57.7694538Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7694597Z cvt.u64.u32 %rd118, %r455; 2026-02-21T09:09:57.7694663Z cvt.u64.u32 %rd119, %r456; 2026-02-21T09:09:57.7694723Z shl.b64 %rd120, %rd119, 32; 2026-02-21T09:09:57.7694781Z or.b64 %rd121, %rd118, %rd120; 2026-02-21T09:09:57.7694840Z $L__tmp29: 2026-02-21T09:09:57.7695004Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7695092Z mov.b64 {%r534, %r535}, %rd121; 2026-02-21T09:09:57.7695167Z cvt.rn.bf16x2.f32 %r536, %r535, %r534; 2026-02-21T09:09:57.7695219Z $L__tmp30: 2026-02-21T09:09:57.7695423Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7695481Z cvt.u64.u32 %rd122, %r458; 2026-02-21T09:09:57.7695547Z cvt.u64.u32 %rd123, %r459; 2026-02-21T09:09:57.7695633Z shl.b64 %rd124, %rd123, 32; 2026-02-21T09:09:57.7695692Z or.b64 %rd125, %rd122, %rd124; 2026-02-21T09:09:57.7695752Z $L__tmp31: 2026-02-21T09:09:57.7695919Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7695979Z mov.b64 {%r537, %r538}, %rd125; 2026-02-21T09:09:57.7696051Z cvt.rn.bf16x2.f32 %r539, %r538, %r537; 2026-02-21T09:09:57.7696102Z $L__tmp32: 2026-02-21T09:09:57.7696308Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7696368Z cvt.u64.u32 %rd126, %r460; 2026-02-21T09:09:57.7696457Z cvt.u64.u32 %rd127, %r461; 2026-02-21T09:09:57.7696516Z shl.b64 %rd128, %rd127, 32; 2026-02-21T09:09:57.7696575Z or.b64 %rd129, %rd126, %rd128; 2026-02-21T09:09:57.7696633Z $L__tmp33: 2026-02-21T09:09:57.7696793Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7696854Z mov.b64 {%r540, %r541}, %rd129; 2026-02-21T09:09:57.7696926Z cvt.rn.bf16x2.f32 %r542, %r541, %r540; 2026-02-21T09:09:57.7696977Z $L__tmp34: 2026-02-21T09:09:57.7697183Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7697241Z cvt.u64.u32 %rd130, %r462; 2026-02-21T09:09:57.7697305Z cvt.u64.u32 %rd131, %r463; 2026-02-21T09:09:57.7697363Z shl.b64 %rd132, %rd131, 32; 2026-02-21T09:09:57.7697422Z or.b64 %rd133, %rd130, %rd132; 2026-02-21T09:09:57.7697481Z $L__tmp35: 2026-02-21T09:09:57.7697642Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7697699Z mov.b64 {%r543, %r544}, %rd133; 2026-02-21T09:09:57.7697765Z cvt.rn.bf16x2.f32 %r545, %r544, %r543; 2026-02-21T09:09:57.7697824Z $L__tmp36: 2026-02-21T09:09:57.7698048Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7698109Z cvt.u64.u32 %rd134, %r464; 2026-02-21T09:09:57.7698175Z cvt.u64.u32 %rd135, %r465; 2026-02-21T09:09:57.7698233Z shl.b64 %rd136, %rd135, 32; 2026-02-21T09:09:57.7698290Z or.b64 %rd137, %rd134, %rd136; 2026-02-21T09:09:57.7698349Z $L__tmp37: 2026-02-21T09:09:57.7698513Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7698571Z mov.b64 {%r546, %r547}, %rd137; 2026-02-21T09:09:57.7698638Z cvt.rn.bf16x2.f32 %r548, %r547, %r546; 2026-02-21T09:09:57.7698697Z $L__tmp38: 2026-02-21T09:09:57.7698903Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7698961Z cvt.u64.u32 %rd138, %r466; 2026-02-21T09:09:57.7699026Z cvt.u64.u32 %rd139, %r467; 2026-02-21T09:09:57.7699084Z shl.b64 %rd140, %rd139, 32; 2026-02-21T09:09:57.7699143Z or.b64 %rd141, %rd138, %rd140; 2026-02-21T09:09:57.7699204Z $L__tmp39: 2026-02-21T09:09:57.7699367Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7699426Z mov.b64 {%r549, %r550}, %rd141; 2026-02-21T09:09:57.7699490Z cvt.rn.bf16x2.f32 %r551, %r550, %r549; 2026-02-21T09:09:57.7699550Z $L__tmp40: 2026-02-21T09:09:57.7699756Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7699814Z cvt.u64.u32 %rd142, %r468; 2026-02-21T09:09:57.7699906Z cvt.u64.u32 %rd143, %r469; 2026-02-21T09:09:57.7699965Z shl.b64 %rd144, %rd143, 32; 2026-02-21T09:09:57.7700025Z or.b64 %rd145, %rd142, %rd144; 2026-02-21T09:09:57.7700077Z $L__tmp41: 2026-02-21T09:09:57.7700251Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7700314Z mov.b64 {%r552, %r553}, %rd145; 2026-02-21T09:09:57.7700383Z cvt.rn.bf16x2.f32 %r554, %r553, %r552; 2026-02-21T09:09:57.7700469Z $L__tmp42: 2026-02-21T09:09:57.7700677Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7700736Z cvt.u64.u32 %rd146, %r470; 2026-02-21T09:09:57.7700802Z cvt.u64.u32 %rd147, %r471; 2026-02-21T09:09:57.7700861Z shl.b64 %rd148, %rd147, 32; 2026-02-21T09:09:57.7700920Z or.b64 %rd149, %rd146, %rd148; 2026-02-21T09:09:57.7700972Z $L__tmp43: 2026-02-21T09:09:57.7701141Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7701201Z mov.b64 {%r555, %r556}, %rd149; 2026-02-21T09:09:57.7701300Z cvt.rn.bf16x2.f32 %r557, %r556, %r555; 2026-02-21T09:09:57.7701360Z $L__tmp44: 2026-02-21T09:09:57.7701600Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7701660Z cvt.u64.u32 %rd150, %r472; 2026-02-21T09:09:57.7701728Z cvt.u64.u32 %rd151, %r473; 2026-02-21T09:09:57.7701788Z shl.b64 %rd152, %rd151, 32; 2026-02-21T09:09:57.7701846Z or.b64 %rd153, %rd150, %rd152; 2026-02-21T09:09:57.7701898Z $L__tmp45: 2026-02-21T09:09:57.7702069Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7702128Z mov.b64 {%r558, %r559}, %rd153; 2026-02-21T09:09:57.7702194Z cvt.rn.bf16x2.f32 %r560, %r559, %r558; 2026-02-21T09:09:57.7702255Z $L__tmp46: 2026-02-21T09:09:57.7702460Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7702522Z cvt.u64.u32 %rd154, %r475; 2026-02-21T09:09:57.7702588Z cvt.u64.u32 %rd155, %r476; 2026-02-21T09:09:57.7702647Z shl.b64 %rd156, %rd155, 32; 2026-02-21T09:09:57.7702707Z or.b64 %rd157, %rd154, %rd156; 2026-02-21T09:09:57.7702759Z $L__tmp47: 2026-02-21T09:09:57.7702954Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7703017Z mov.b64 {%r561, %r562}, %rd157; 2026-02-21T09:09:57.7703084Z cvt.rn.bf16x2.f32 %r563, %r562, %r561; 2026-02-21T09:09:57.7703145Z $L__tmp48: 2026-02-21T09:09:57.7703350Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7703410Z cvt.u64.u32 %rd158, %r477; 2026-02-21T09:09:57.7703477Z cvt.u64.u32 %rd159, %r478; 2026-02-21T09:09:57.7703537Z shl.b64 %rd160, %rd159, 32; 2026-02-21T09:09:57.7703598Z or.b64 %rd161, %rd158, %rd160; 2026-02-21T09:09:57.7703651Z $L__tmp49: 2026-02-21T09:09:57.7703825Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7703885Z mov.b64 {%r564, %r565}, %rd161; 2026-02-21T09:09:57.7703950Z cvt.rn.bf16x2.f32 %r566, %r565, %r564; 2026-02-21T09:09:57.7704011Z $L__tmp50: 2026-02-21T09:09:57.7704221Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7704280Z cvt.u64.u32 %rd162, %r479; 2026-02-21T09:09:57.7704338Z cvt.u64.u32 %rd163, %r480; 2026-02-21T09:09:57.7704402Z shl.b64 %rd164, %rd163, 32; 2026-02-21T09:09:57.7704461Z or.b64 %rd165, %rd162, %rd164; 2026-02-21T09:09:57.7704513Z $L__tmp51: 2026-02-21T09:09:57.7704685Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7704745Z mov.b64 {%r567, %r568}, %rd165; 2026-02-21T09:09:57.7704835Z cvt.rn.bf16x2.f32 %r569, %r568, %r567; 2026-02-21T09:09:57.7704891Z $L__tmp52: 2026-02-21T09:09:57.7705103Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7705161Z cvt.u64.u32 %rd166, %r481; 2026-02-21T09:09:57.7705218Z cvt.u64.u32 %rd167, %r482; 2026-02-21T09:09:57.7705284Z shl.b64 %rd168, %rd167, 32; 2026-02-21T09:09:57.7705343Z or.b64 %rd169, %rd166, %rd168; 2026-02-21T09:09:57.7705426Z $L__tmp53: 2026-02-21T09:09:57.7705595Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7705655Z mov.b64 {%r570, %r571}, %rd169; 2026-02-21T09:09:57.7705719Z cvt.rn.bf16x2.f32 %r572, %r571, %r570; 2026-02-21T09:09:57.7705778Z $L__tmp54: 2026-02-21T09:09:57.7705977Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7706036Z cvt.u64.u32 %rd170, %r483; 2026-02-21T09:09:57.7706092Z cvt.u64.u32 %rd171, %r484; 2026-02-21T09:09:57.7706182Z shl.b64 %rd172, %rd171, 32; 2026-02-21T09:09:57.7706242Z or.b64 %rd173, %rd170, %rd172; 2026-02-21T09:09:57.7706292Z $L__tmp55: 2026-02-21T09:09:57.7706460Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7706519Z mov.b64 {%r573, %r574}, %rd173; 2026-02-21T09:09:57.7706586Z cvt.rn.bf16x2.f32 %r575, %r574, %r573; 2026-02-21T09:09:57.7706639Z $L__tmp56: 2026-02-21T09:09:57.7706848Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7706907Z cvt.u64.u32 %rd174, %r485; 2026-02-21T09:09:57.7706963Z cvt.u64.u32 %rd175, %r486; 2026-02-21T09:09:57.7707029Z shl.b64 %rd176, %rd175, 32; 2026-02-21T09:09:57.7707087Z or.b64 %rd177, %rd174, %rd176; 2026-02-21T09:09:57.7707139Z $L__tmp57: 2026-02-21T09:09:57.7707309Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7707369Z mov.b64 {%r576, %r577}, %rd177; 2026-02-21T09:09:57.7707435Z cvt.rn.bf16x2.f32 %r578, %r577, %r576; 2026-02-21T09:09:57.7707485Z $L__tmp58: 2026-02-21T09:09:57.7707696Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7707781Z cvt.u64.u32 %rd178, %r487; 2026-02-21T09:09:57.7707841Z cvt.u64.u32 %rd179, %r488; 2026-02-21T09:09:57.7707907Z shl.b64 %rd180, %rd179, 32; 2026-02-21T09:09:57.7707966Z or.b64 %rd181, %rd178, %rd180; 2026-02-21T09:09:57.7708018Z $L__tmp59: 2026-02-21T09:09:57.7708190Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7708250Z mov.b64 {%r579, %r580}, %rd181; 2026-02-21T09:09:57.7708315Z cvt.rn.bf16x2.f32 %r581, %r580, %r579; 2026-02-21T09:09:57.7708365Z $L__tmp60: 2026-02-21T09:09:57.7708581Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7708638Z cvt.u64.u32 %rd182, %r489; 2026-02-21T09:09:57.7708694Z cvt.u64.u32 %rd183, %r490; 2026-02-21T09:09:57.7708763Z shl.b64 %rd184, %rd183, 32; 2026-02-21T09:09:57.7708822Z or.b64 %rd185, %rd182, %rd184; 2026-02-21T09:09:57.7708873Z $L__tmp61: 2026-02-21T09:09:57.7709045Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7709107Z mov.b64 {%r582, %r583}, %rd185; 2026-02-21T09:09:57.7709175Z cvt.rn.bf16x2.f32 %r584, %r583, %r582; 2026-02-21T09:09:57.7709229Z $L__tmp62: 2026-02-21T09:09:57.7709449Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7709507Z cvt.u64.u32 %rd186, %r492; 2026-02-21T09:09:57.7709565Z cvt.u64.u32 %rd187, %r493; 2026-02-21T09:09:57.7709656Z shl.b64 %rd188, %rd187, 32; 2026-02-21T09:09:57.7709715Z or.b64 %rd189, %rd186, %rd188; 2026-02-21T09:09:57.7709768Z $L__tmp63: 2026-02-21T09:09:57.7709931Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7709997Z mov.b64 {%r585, %r586}, %rd189; 2026-02-21T09:09:57.7710061Z cvt.rn.bf16x2.f32 %r587, %r586, %r585; 2026-02-21T09:09:57.7710114Z $L__tmp64: 2026-02-21T09:09:57.7710324Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7710406Z cvt.u64.u32 %rd190, %r494; 2026-02-21T09:09:57.7710464Z cvt.u64.u32 %rd191, %r495; 2026-02-21T09:09:57.7710531Z shl.b64 %rd192, %rd191, 32; 2026-02-21T09:09:57.7710589Z or.b64 %rd193, %rd190, %rd192; 2026-02-21T09:09:57.7710641Z $L__tmp65: 2026-02-21T09:09:57.7710808Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7710876Z mov.b64 {%r588, %r589}, %rd193; 2026-02-21T09:09:57.7710942Z cvt.rn.bf16x2.f32 %r590, %r589, %r588; 2026-02-21T09:09:57.7711015Z $L__tmp66: 2026-02-21T09:09:57.7711227Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7711285Z cvt.u64.u32 %rd194, %r496; 2026-02-21T09:09:57.7711342Z cvt.u64.u32 %rd195, %r497; 2026-02-21T09:09:57.7711409Z shl.b64 %rd196, %rd195, 32; 2026-02-21T09:09:57.7711471Z or.b64 %rd197, %rd194, %rd196; 2026-02-21T09:09:57.7711522Z $L__tmp67: 2026-02-21T09:09:57.7711707Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7711775Z mov.b64 {%r591, %r592}, %rd197; 2026-02-21T09:09:57.7711840Z cvt.rn.bf16x2.f32 %r593, %r592, %r591; 2026-02-21T09:09:57.7711891Z $L__tmp68: 2026-02-21T09:09:57.7712100Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7712160Z cvt.u64.u32 %rd198, %r498; 2026-02-21T09:09:57.7712217Z cvt.u64.u32 %rd199, %r499; 2026-02-21T09:09:57.7712283Z shl.b64 %rd200, %rd199, 32; 2026-02-21T09:09:57.7712341Z or.b64 %rd201, %rd198, %rd200; 2026-02-21T09:09:57.7712393Z $L__tmp69: 2026-02-21T09:09:57.7712553Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7712649Z mov.b64 {%r594, %r595}, %rd201; 2026-02-21T09:09:57.7712719Z cvt.rn.bf16x2.f32 %r596, %r595, %r594; 2026-02-21T09:09:57.7712770Z $L__tmp70: 2026-02-21T09:09:57.7712985Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7713042Z cvt.u64.u32 %rd202, %r500; 2026-02-21T09:09:57.7713099Z cvt.u64.u32 %rd203, %r501; 2026-02-21T09:09:57.7713157Z shl.b64 %rd204, %rd203, 32; 2026-02-21T09:09:57.7713222Z or.b64 %rd205, %rd202, %rd204; 2026-02-21T09:09:57.7713276Z $L__tmp71: 2026-02-21T09:09:57.7713442Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7713507Z mov.b64 {%r597, %r598}, %rd205; 2026-02-21T09:09:57.7713571Z cvt.rn.bf16x2.f32 %r599, %r598, %r597; 2026-02-21T09:09:57.7713622Z $L__tmp72: 2026-02-21T09:09:57.7713835Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7713893Z cvt.u64.u32 %rd206, %r502; 2026-02-21T09:09:57.7713950Z cvt.u64.u32 %rd207, %r503; 2026-02-21T09:09:57.7714007Z shl.b64 %rd208, %rd207, 32; 2026-02-21T09:09:57.7714073Z or.b64 %rd209, %rd206, %rd208; 2026-02-21T09:09:57.7714124Z $L__tmp73: 2026-02-21T09:09:57.7714287Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7714353Z mov.b64 {%r600, %r601}, %rd209; 2026-02-21T09:09:57.7714418Z cvt.rn.bf16x2.f32 %r602, %r601, %r600; 2026-02-21T09:09:57.7714497Z $L__tmp74: 2026-02-21T09:09:57.7714709Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7714767Z cvt.u64.u32 %rd210, %r504; 2026-02-21T09:09:57.7714824Z cvt.u64.u32 %rd211, %r505; 2026-02-21T09:09:57.7714881Z shl.b64 %rd212, %rd211, 32; 2026-02-21T09:09:57.7714946Z or.b64 %rd213, %rd210, %rd212; 2026-02-21T09:09:57.7714998Z $L__tmp75: 2026-02-21T09:09:57.7715188Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7715253Z mov.b64 {%r603, %r604}, %rd213; 2026-02-21T09:09:57.7715318Z cvt.rn.bf16x2.f32 %r605, %r604, %r603; 2026-02-21T09:09:57.7715371Z $L__tmp76: 2026-02-21T09:09:57.7715583Z .loc 2 291 36 // standard.py:291:36 @[ c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:94:40 ] 2026-02-21T09:09:57.7715640Z cvt.u64.u32 %rd214, %r506; 2026-02-21T09:09:57.7715698Z cvt.u64.u32 %rd215, %r507; 2026-02-21T09:09:57.7715756Z shl.b64 %rd216, %rd215, 32; 2026-02-21T09:09:57.7715845Z or.b64 %rd217, %rd214, %rd216; 2026-02-21T09:09:57.7715899Z $L__tmp77: 2026-02-21T09:09:57.7716060Z .loc 1 97 28 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:97:28 2026-02-21T09:09:57.7716124Z mov.b64 {%r606, %r607}, %rd217; 2026-02-21T09:09:57.7716189Z cvt.rn.bf16x2.f32 %r608, %r607, %r606; 2026-02-21T09:09:57.7716353Z .loc 1 98 43 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:98:43 2026-02-21T09:09:57.7716454Z st.shared.v4.b32 [%r20], {%r515, %r518, %r521, %r524}; 2026-02-21T09:09:57.7716546Z st.shared.v4.b32 [%r21], {%r527, %r530, %r533, %r536}; 2026-02-21T09:09:57.7716633Z st.shared.v4.b32 [%r22], {%r539, %r542, %r545, %r548}; 2026-02-21T09:09:57.7716715Z st.shared.v4.b32 [%r23], {%r551, %r554, %r557, %r560}; 2026-02-21T09:09:57.7716806Z st.shared.v4.b32 [%r24], {%r563, %r566, %r569, %r572}; 2026-02-21T09:09:57.7716890Z st.shared.v4.b32 [%r25], {%r575, %r578, %r581, %r584}; 2026-02-21T09:09:57.7716973Z st.shared.v4.b32 [%r26], {%r587, %r590, %r593, %r596}; 2026-02-21T09:09:57.7717062Z st.shared.v4.b32 [%r27], {%r599, %r602, %r605, %r608}; 2026-02-21T09:09:57.7717119Z // begin inline asm 2026-02-21T09:09:57.7717194Z fence.proxy.async.shared::cta; 2026-02-21T09:09:57.7717250Z // end inline asm 2026-02-21T09:09:57.7717311Z bar.sync 0; 2026-02-21T09:09:57.7717400Z elect.sync %r609|%p97, -1; 2026-02-21T09:09:57.7717467Z and.pred %p95, %p96, %p97; 2026-02-21T09:09:57.7717537Z and.b32 %r610, %r37, 1; 2026-02-21T09:09:57.7717597Z shl.b32 %r611, %r610, 13; 2026-02-21T09:09:57.7717658Z add.s32 %r511, %r54, %r611; 2026-02-21T09:09:57.7717729Z shl.b32 %r612, %r610, 6; 2026-02-21T09:09:57.7717790Z or.b32 %r509, %r612, %r315; 2026-02-21T09:09:57.7717846Z // begin inline asm 2026-02-21T09:09:57.7718025Z @%p95 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd89, {%r509, %r510}], [%r511]; 2026-02-21T09:09:57.7718092Z // end inline asm 2026-02-21T09:09:57.7718159Z cp.async.bulk.commit_group; 2026-02-21T09:09:57.7718233Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:09:57.7718292Z bar.sync 0; 2026-02-21T09:09:57.7718375Z $L__BB0_8: // %._crit_edge 2026-02-21T09:09:57.7718542Z .loc 1 31 4 // c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py:31:4 2026-02-21T09:09:57.7718604Z bar.sync 0; 2026-02-21T09:09:57.7718663Z // begin inline asm 2026-02-21T09:09:57.7718787Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r613, 256; 2026-02-21T09:09:57.7718845Z // end inline asm 2026-02-21T09:09:57.7718907Z ret; 2026-02-21T09:09:57.7718963Z $L__tmp78: 2026-02-21T09:09:57.7719021Z $L__func_end0: 2026-02-21T09:09:57.7719114Z // -- End function 2026-02-21T09:09:57.7719168Z } 2026-02-21T09:09:57.7719377Z .file 1 "/tmp/torchinductor_root/55/c55vgw7dfmhrcxbail5gf25xdnwsqn4bme6qxhft6uxjh4gnc5pv.py" 2026-02-21T09:09:57.7719579Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:09:57.7719655Z .section .debug_abbrev 2026-02-21T09:09:57.7719709Z { 2026-02-21T09:09:57.7719803Z .b8 1 // Abbreviation Code 2026-02-21T09:09:57.7719904Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:09:57.7719991Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:57.7720113Z .b8 37 // DW_AT_producer 2026-02-21T09:09:57.7720202Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.7720281Z .b8 19 // DW_AT_language 2026-02-21T09:09:57.7720361Z .b8 5 // DW_FORM_data2 2026-02-21T09:09:57.7720441Z .b8 3 // DW_AT_name 2026-02-21T09:09:57.7720528Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.7720611Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:09:57.7720712Z .b8 6 // DW_FORM_data4 2026-02-21T09:09:57.7720800Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:09:57.7720876Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.7720952Z .b8 0 // EOM(1) 2026-02-21T09:09:57.7721034Z .b8 0 // EOM(2) 2026-02-21T09:09:57.7721122Z .b8 2 // Abbreviation Code 2026-02-21T09:09:57.7721207Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:57.7721283Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:57.7721367Z .b8 3 // DW_AT_name 2026-02-21T09:09:57.7721441Z .b8 8 // DW_FORM_string 2026-02-21T09:09:57.7721520Z .b8 32 // DW_AT_inline 2026-02-21T09:09:57.7721637Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.7721711Z .b8 0 // EOM(1) 2026-02-21T09:09:57.7721781Z .b8 0 // EOM(2) 2026-02-21T09:09:57.7721873Z .b8 3 // Abbreviation Code 2026-02-21T09:09:57.7721956Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:09:57.7722064Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:09:57.7722155Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:57.7722232Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.7722312Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:57.7722387Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.7722482Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:57.7722556Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:57.7722627Z .b8 0 // EOM(1) 2026-02-21T09:09:57.7722704Z .b8 0 // EOM(2) 2026-02-21T09:09:57.7722784Z .b8 4 // Abbreviation Code 2026-02-21T09:09:57.7722879Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:09:57.7722964Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:09:57.7723054Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:09:57.7723130Z .b8 19 // DW_FORM_ref4 2026-02-21T09:09:57.7723206Z .b8 17 // DW_AT_low_pc 2026-02-21T09:09:57.7723287Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.7723367Z .b8 18 // DW_AT_high_pc 2026-02-21T09:09:57.7723442Z .b8 1 // DW_FORM_addr 2026-02-21T09:09:57.7723530Z .b8 88 // DW_AT_call_file 2026-02-21T09:09:57.7723632Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.7723713Z .b8 89 // DW_AT_call_line 2026-02-21T09:09:57.7723797Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.7723878Z .b8 87 // DW_AT_call_column 2026-02-21T09:09:57.7723954Z .b8 11 // DW_FORM_data1 2026-02-21T09:09:57.7724026Z .b8 0 // EOM(1) 2026-02-21T09:09:57.7724133Z .b8 0 // EOM(2) 2026-02-21T09:09:57.7724202Z .b8 0 // EOM(3) 2026-02-21T09:09:57.7724255Z } 2026-02-21T09:09:57.7724324Z .section .debug_info 2026-02-21T09:09:57.7724377Z { 2026-02-21T09:09:57.7724464Z .b32 178 // Length of Unit 2026-02-21T09:09:57.7724558Z .b8 2 // DWARF version number 2026-02-21T09:09:57.7724615Z .b8 0 2026-02-21T09:09:57.7724735Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:09:57.7724855Z .b8 8 // Address Size (in bytes) 2026-02-21T09:09:57.7724969Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:09:57.7725052Z .b8 116 // DW_AT_producer 2026-02-21T09:09:57.7725109Z .b8 114 2026-02-21T09:09:57.7725173Z .b8 105 2026-02-21T09:09:57.7725229Z .b8 116 2026-02-21T09:09:57.7725284Z .b8 111 2026-02-21T09:09:57.7725338Z .b8 110 2026-02-21T09:09:57.7725400Z .b8 0 2026-02-21T09:09:57.7725478Z .b8 2 // DW_AT_language 2026-02-21T09:09:57.7725532Z .b8 0 2026-02-21T09:09:57.7725618Z .b8 99 // DW_AT_name 2026-02-21T09:09:57.7725674Z .b8 53 2026-02-21T09:09:57.7725730Z .b8 53 2026-02-21T09:09:57.7725784Z .b8 118 2026-02-21T09:09:57.7725854Z .b8 103 2026-02-21T09:09:57.7725908Z .b8 119 2026-02-21T09:09:57.7725965Z .b8 55 2026-02-21T09:09:57.7726027Z .b8 100 2026-02-21T09:09:57.7726081Z .b8 102 2026-02-21T09:09:57.7726134Z .b8 109 2026-02-21T09:09:57.7726189Z .b8 104 2026-02-21T09:09:57.7726250Z .b8 114 2026-02-21T09:09:57.7726303Z .b8 99 2026-02-21T09:09:57.7726357Z .b8 120 2026-02-21T09:09:57.7726427Z .b8 98 2026-02-21T09:09:57.7726478Z .b8 97 2026-02-21T09:09:57.7726528Z .b8 105 2026-02-21T09:09:57.7726579Z .b8 108 2026-02-21T09:09:57.7726637Z .b8 53 2026-02-21T09:09:57.7726706Z .b8 103 2026-02-21T09:09:57.7726758Z .b8 102 2026-02-21T09:09:57.7726810Z .b8 50 2026-02-21T09:09:57.7726870Z .b8 53 2026-02-21T09:09:57.7726921Z .b8 120 2026-02-21T09:09:57.7726971Z .b8 100 2026-02-21T09:09:57.7727028Z .b8 110 2026-02-21T09:09:57.7727077Z .b8 119 2026-02-21T09:09:57.7727127Z .b8 115 2026-02-21T09:09:57.7727177Z .b8 113 2026-02-21T09:09:57.7727236Z .b8 110 2026-02-21T09:09:57.7727287Z .b8 52 2026-02-21T09:09:57.7727338Z .b8 98 2026-02-21T09:09:57.7727395Z .b8 109 2026-02-21T09:09:57.7727445Z .b8 101 2026-02-21T09:09:57.7727497Z .b8 54 2026-02-21T09:09:57.7727547Z .b8 113 2026-02-21T09:09:57.7727606Z .b8 120 2026-02-21T09:09:57.7727656Z .b8 104 2026-02-21T09:09:57.7727707Z .b8 102 2026-02-21T09:09:57.7727766Z .b8 116 2026-02-21T09:09:57.7727816Z .b8 54 2026-02-21T09:09:57.7727867Z .b8 117 2026-02-21T09:09:57.7727918Z .b8 120 2026-02-21T09:09:57.7727976Z .b8 106 2026-02-21T09:09:57.7728026Z .b8 104 2026-02-21T09:09:57.7728076Z .b8 52 2026-02-21T09:09:57.7728126Z .b8 103 2026-02-21T09:09:57.7728184Z .b8 110 2026-02-21T09:09:57.7728236Z .b8 99 2026-02-21T09:09:57.7728287Z .b8 53 2026-02-21T09:09:57.7728343Z .b8 112 2026-02-21T09:09:57.7728393Z .b8 118 2026-02-21T09:09:57.7728443Z .b8 46 2026-02-21T09:09:57.7728493Z .b8 112 2026-02-21T09:09:57.7728549Z .b8 121 2026-02-21T09:09:57.7728599Z .b8 0 2026-02-21T09:09:57.7728688Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:09:57.7728767Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:09:57.7728817Z .b8 116 2026-02-21T09:09:57.7728867Z .b8 109 2026-02-21T09:09:57.7728938Z .b8 112 2026-02-21T09:09:57.7728994Z .b8 47 2026-02-21T09:09:57.7729044Z .b8 116 2026-02-21T09:09:57.7729096Z .b8 111 2026-02-21T09:09:57.7729152Z .b8 114 2026-02-21T09:09:57.7729201Z .b8 99 2026-02-21T09:09:57.7729250Z .b8 104 2026-02-21T09:09:57.7729299Z .b8 105 2026-02-21T09:09:57.7729356Z .b8 110 2026-02-21T09:09:57.7729406Z .b8 100 2026-02-21T09:09:57.7729455Z .b8 117 2026-02-21T09:09:57.7729505Z .b8 99 2026-02-21T09:09:57.7729565Z .b8 116 2026-02-21T09:09:57.7729637Z .b8 111 2026-02-21T09:09:57.7729686Z .b8 114 2026-02-21T09:09:57.7729745Z .b8 95 2026-02-21T09:09:57.7729793Z .b8 114 2026-02-21T09:09:57.7729843Z .b8 111 2026-02-21T09:09:57.7729893Z .b8 111 2026-02-21T09:09:57.7729951Z .b8 116 2026-02-21T09:09:57.7730001Z .b8 47 2026-02-21T09:09:57.7730052Z .b8 53 2026-02-21T09:09:57.7730110Z .b8 53 2026-02-21T09:09:57.7730161Z .b8 0 2026-02-21T09:09:57.7730259Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:09:57.7730335Z .b8 95 // DW_AT_name 2026-02-21T09:09:57.7730395Z .b8 104 2026-02-21T09:09:57.7730445Z .b8 101 2026-02-21T09:09:57.7730517Z .b8 108 2026-02-21T09:09:57.7730575Z .b8 105 2026-02-21T09:09:57.7730624Z .b8 111 2026-02-21T09:09:57.7730674Z .b8 110 2026-02-21T09:09:57.7730724Z .b8 95 2026-02-21T09:09:57.7730782Z .b8 109 2026-02-21T09:09:57.7730833Z .b8 97 2026-02-21T09:09:57.7730882Z .b8 116 2026-02-21T09:09:57.7730939Z .b8 109 2026-02-21T09:09:57.7730991Z .b8 117 2026-02-21T09:09:57.7731044Z .b8 108 2026-02-21T09:09:57.7731095Z .b8 95 2026-02-21T09:09:57.7731156Z .b8 98 2026-02-21T09:09:57.7731207Z .b8 102 2026-02-21T09:09:57.7731259Z .b8 49 2026-02-21T09:09:57.7731309Z .b8 54 2026-02-21T09:09:57.7731371Z .b8 95 2026-02-21T09:09:57.7731421Z .b8 105 2026-02-21T09:09:57.7731471Z .b8 110 2026-02-21T09:09:57.7731528Z .b8 116 2026-02-21T09:09:57.7731634Z .b8 52 2026-02-21T09:09:57.7731686Z .b8 0 2026-02-21T09:09:57.7731762Z .b8 1 // DW_AT_inline 2026-02-21T09:09:57.7731867Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:09:57.7731955Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:09:57.7732041Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:09:57.7732139Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:57.7732287Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:09:57.7732378Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:09:57.7732468Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:09:57.7732554Z .b64 $L__tmp77 // DW_AT_high_pc 2026-02-21T09:09:57.7732633Z .b8 1 // DW_AT_call_file 2026-02-21T09:09:57.7732716Z .b8 94 // DW_AT_call_line 2026-02-21T09:09:57.7732796Z .b8 40 // DW_AT_call_column 2026-02-21T09:09:57.7732882Z .b8 0 // End Of Children Mark 2026-02-21T09:09:57.7732963Z .b8 0 // End Of Children Mark 2026-02-21T09:09:57.7733022Z } 2026-02-21T09:09:57.7733089Z .section .debug_macinfo { } 2026-02-21T09:09:57.7733093Z 2026-02-21T09:09:57.7733167Z ================================================================ 2026-02-21T09:09:57.7733279Z please share the reproducer above with Triton project. 2026-02-21T09:09:58.7818807Z 2026-02-21T09:09:58.7824590Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 102/102 6.5 configs/s 2026-02-21T09:10:00.6596244Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━ 680/680 348.6 configs/s 2026-02-21T09:10:00.7654082Z [187s] Generation 2 complete: 2026-02-21T09:10:00.7655602Z error=22 2026-02-21T09:10:00.7655798Z timeout=6 2026-02-21T09:10:00.7655968Z ok=75 2026-02-21T09:10:00.7656127Z min=0.2959 2026-02-21T09:10:00.7656283Z mid=1.4941 2026-02-21T09:10:00.7656441Z max=302.2316 2026-02-21T09:10:00.7656938Z best={'block_sizes': [8, 128, 128], 2026-02-21T09:10:00.7657239Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:10:00.7657499Z 'l2_groupings': [2], 2026-02-21T09:10:00.7657711Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:00.7657925Z 'loop_orders': [[0, 1]], 2026-02-21T09:10:00.7658098Z 'num_stages': 3, 2026-02-21T09:10:00.7658278Z 'num_warps': 4, 2026-02-21T09:10:00.7658470Z 'pid_type': 'flat', 2026-02-21T09:10:00.7658775Z 'range_flattens': [None, False], 2026-02-21T09:10:00.7659003Z 'range_multi_buffers': [None, False], 2026-02-21T09:10:00.7659244Z 'range_num_stages': [0, 4], 2026-02-21T09:10:00.7659458Z 'range_unroll_factors': [0, 0], 2026-02-21T09:10:00.7659672Z 'range_warp_specializes': [None, None]} 2026-02-21T09:10:00.7672944Z [187s] Fitting surrogate: 307 points, 307 targets 2026-02-21T09:10:02.1466944Z [189s] Generation 3 starting: 102 neighbors, 5 active search path(s) 2026-02-21T09:10:38.2517457Z [225s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, False]) 2026-02-21T09:10:38.2532431Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 0.4 configs/s 2026-02-21T09:10:41.0402977Z [228s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:10:41.0404098Z Tensor-likes are not close! 2026-02-21T09:10:41.0407855Z 2026-02-21T09:10:41.0411273Z Mismatched elements: 33485243 / 33554432 (99.8%) 2026-02-21T09:10:41.0414727Z Greatest absolute difference: 2432.0 at index (2056, 3359) (up to 0.01 allowed) 2026-02-21T09:10:41.0418480Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:10:41.0422977Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:10:41.0423534Z 2026-02-21T09:10:45.4170423Z 2026-02-21T09:10:45.4175801Z 2026-02-21T09:10:45.4178282Z ================================================================ 2026-02-21T09:10:45.4178628Z Internal Triton PTX codegen error 2026-02-21T09:10:45.4182412Z `ptxas` stderr: 2026-02-21T09:10:45.4187272Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 326 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:45.4191297Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:45.4192649Z 2026-02-21T09:10:45.4193176Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp3r3a8lyp.ptx -o /tmp/tmp3r3a8lyp.ptx.o 2026-02-21T09:10:45.4193642Z 2026-02-21T09:10:45.4193646Z 2026-02-21T09:10:45.4193752Z // 2026-02-21T09:10:45.4198435Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:45.4199851Z // 2026-02-21T09:10:45.4199995Z 2026-02-21T09:10:45.4204562Z .version 8.7 2026-02-21T09:10:45.4209571Z .target sm_100a 2026-02-21T09:10:45.4209846Z .address_size 64 2026-02-21T09:10:45.4209971Z 2026-02-21T09:10:45.4210214Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:45.4215175Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:45.4219271Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:45.4222817Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:45.4226360Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:45.4227540Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:45.4227845Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:45.4228140Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:45.4228422Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:45.4228662Z ) 2026-02-21T09:10:45.4228897Z .reqntid 256 2026-02-21T09:10:45.4229057Z .maxnreg 32 2026-02-21T09:10:45.4229187Z { 2026-02-21T09:10:45.4229335Z .reg .pred %p<332>; 2026-02-21T09:10:45.4229503Z .reg .b16 %rs<568>; 2026-02-21T09:10:45.4230038Z .reg .b32 %r<1891>; 2026-02-21T09:10:45.4230205Z .reg .b64 %rd<584>; 2026-02-21T09:10:45.4230468Z .loc 1 19 0 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:19:0 2026-02-21T09:10:45.4230761Z $L__func_begin0: 2026-02-21T09:10:45.4231006Z .loc 1 19 0 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:19:0 2026-02-21T09:10:45.4231251Z 2026-02-21T09:10:45.4231355Z // %bb.0: 2026-02-21T09:10:45.4231635Z ld.param.b64 %rd52, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:45.4231851Z $L__tmp0: 2026-02-21T09:10:45.4232098Z .loc 1 19 0 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:19 2026-02-21T09:10:45.4232373Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:45.4232622Z ld.param.b64 %rd54, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:45.4232843Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:10:45.4233046Z ld.param.b64 %rd72, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:45.4233257Z mov.b32 %r179, global_smem; 2026-02-21T09:10:45.4233426Z // begin inline asm 2026-02-21T09:10:45.4233689Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r179], 128; 2026-02-21T09:10:45.4233943Z // end inline asm 2026-02-21T09:10:45.4234128Z ld.param.b64 %rd89, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:45.4234330Z bar.sync 0; 2026-02-21T09:10:45.4234485Z ld.shared.b32 %r1846, [global_smem]; 2026-02-21T09:10:45.4234658Z bar.sync 0; 2026-02-21T09:10:45.4234796Z // begin inline asm 2026-02-21T09:10:45.4235004Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:45.4235235Z // end inline asm 2026-02-21T09:10:45.4235496Z .loc 1 21 66 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:21:66 2026-02-21T09:10:45.4235780Z mov.u32 %r1855, %ctaid.x; 2026-02-21T09:10:45.4235949Z mov.u32 %r196, %ctaid.y; 2026-02-21T09:10:45.4236103Z mov.u32 %r197, %ctaid.z; 2026-02-21T09:10:45.4236260Z mov.u32 %r198, %nctaid.x; 2026-02-21T09:10:45.4236413Z mov.u32 %r199, %nctaid.y; 2026-02-21T09:10:45.4236580Z mad.lo.s32 %r200, %r197, %r199, %r196; 2026-02-21T09:10:45.4236762Z mad.lo.s32 %r201, %r200, %r198, %r1855; 2026-02-21T09:10:45.4236942Z shl.b32 %r202, %r201, 8; 2026-02-21T09:10:45.4237101Z cvt.s64.s32 %rd90, %r202; 2026-02-21T09:10:45.4237261Z add.s64 %rd68, %rd89, %rd90; 2026-02-21T09:10:45.4237427Z shl.b32 %r203, %r1, 2; 2026-02-21T09:10:45.4237580Z add.s32 %r180, %r179, %r203; 2026-02-21T09:10:45.4237758Z mov.b32 %r189, 0; 2026-02-21T09:10:45.4237902Z // begin inline asm 2026-02-21T09:10:45.4238070Z @%p1 st.shared.b32 [ %r180 + 0 ], %r189; 2026-02-21T09:10:45.4238255Z // end inline asm 2026-02-21T09:10:45.4238413Z bar.warp.sync -1; 2026-02-21T09:10:45.4238569Z setp.eq.b32 %p273, %r1, 0; 2026-02-21T09:10:45.4238748Z cvt.u64.u32 %rd53, %r179; 2026-02-21T09:10:45.4238915Z // begin inline asm 2026-02-21T09:10:45.4239187Z @%p273 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd53 + 0 ], %rd54; 2026-02-21T09:10:45.4239492Z // end inline asm 2026-02-21T09:10:45.4239637Z // begin inline asm 2026-02-21T09:10:45.4239889Z @%p273 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1; 2026-02-21T09:10:45.4240164Z // end inline asm 2026-02-21T09:10:45.4240312Z mov.b32 %r182, 64; 2026-02-21T09:10:45.4240505Z // begin inline asm 2026-02-21T09:10:45.4240763Z @%p273 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0, %r182; 2026-02-21T09:10:45.4241055Z // end inline asm 2026-02-21T09:10:45.4241194Z mov.b32 %r183, 16; 2026-02-21T09:10:45.4241356Z // begin inline asm 2026-02-21T09:10:45.4241643Z @%p273 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1, %r183; 2026-02-21T09:10:45.4241927Z // end inline asm 2026-02-21T09:10:45.4242114Z mov.b32 %r184, 8192; 2026-02-21T09:10:45.4242274Z // begin inline asm 2026-02-21T09:10:45.4242539Z @%p273 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0, %r184; 2026-02-21T09:10:45.4242827Z // end inline asm 2026-02-21T09:10:45.4242992Z mov.b32 %r185, 512; 2026-02-21T09:10:45.4243138Z // begin inline asm 2026-02-21T09:10:45.4243395Z @%p273 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1, %r185; 2026-02-21T09:10:45.4243677Z // end inline asm 2026-02-21T09:10:45.4243831Z mov.b64 %rd61, 8192; 2026-02-21T09:10:45.4243980Z // begin inline asm 2026-02-21T09:10:45.4244287Z @%p273 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd53 + 0 ], 0x0, %rd61; 2026-02-21T09:10:45.4244595Z // end inline asm 2026-02-21T09:10:45.4244733Z mov.b32 %r186, 1; 2026-02-21T09:10:45.4244882Z // begin inline asm 2026-02-21T09:10:45.4245180Z @%p273 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0, %r186; 2026-02-21T09:10:45.4245493Z // end inline asm 2026-02-21T09:10:45.4245630Z // begin inline asm 2026-02-21T09:10:45.4245899Z @%p273 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1, %r186; 2026-02-21T09:10:45.4246201Z // end inline asm 2026-02-21T09:10:45.4246343Z // begin inline asm 2026-02-21T09:10:45.4246617Z @%p273 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0; 2026-02-21T09:10:45.4246911Z // end inline asm 2026-02-21T09:10:45.4247061Z // begin inline asm 2026-02-21T09:10:45.4247318Z @%p273 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0; 2026-02-21T09:10:45.4247625Z // end inline asm 2026-02-21T09:10:45.4247782Z // begin inline asm 2026-02-21T09:10:45.4248027Z @%p273 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x2; 2026-02-21T09:10:45.4248306Z // end inline asm 2026-02-21T09:10:45.4248438Z // begin inline asm 2026-02-21T09:10:45.4248675Z @%p273 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0; 2026-02-21T09:10:45.4248933Z // end inline asm 2026-02-21T09:10:45.4249073Z // begin inline asm 2026-02-21T09:10:45.4249416Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd68 + 0 ], [ %rd53 + 0 ], 0x80; 2026-02-21T09:10:45.4249778Z // end inline asm 2026-02-21T09:10:45.4249918Z // begin inline asm 2026-02-21T09:10:45.4250124Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd68 + 0 ], 0x80; 2026-02-21T09:10:45.4250386Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:45.4250577Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:45.4250761Z // end inline asm 2026-02-21T09:10:45.4250895Z bar.sync 0; 2026-02-21T09:10:45.4251046Z cvta.global.u64 %rd497, %rd68; 2026-02-21T09:10:45.4251330Z .loc 1 23 67 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:23:67 2026-02-21T09:10:45.4251642Z add.s64 %rd86, %rd68, 128; 2026-02-21T09:10:45.4251806Z bar.sync 0; 2026-02-21T09:10:45.4251939Z // begin inline asm 2026-02-21T09:10:45.4252098Z @%p1 st.shared.b32 [ %r180 + 0 ], %r189; 2026-02-21T09:10:45.4252270Z // end inline asm 2026-02-21T09:10:45.4252418Z bar.warp.sync -1; 2026-02-21T09:10:45.4252559Z // begin inline asm 2026-02-21T09:10:45.4252814Z @%p273 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd53 + 0 ], %rd72; 2026-02-21T09:10:45.4253098Z // end inline asm 2026-02-21T09:10:45.4253229Z // begin inline asm 2026-02-21T09:10:45.4253461Z @%p273 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1; 2026-02-21T09:10:45.4253743Z // end inline asm 2026-02-21T09:10:45.4253883Z // begin inline asm 2026-02-21T09:10:45.4254118Z @%p273 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0, %r182; 2026-02-21T09:10:45.4254388Z // end inline asm 2026-02-21T09:10:45.4254527Z mov.b32 %r191, 128; 2026-02-21T09:10:45.4254669Z // begin inline asm 2026-02-21T09:10:45.4254911Z @%p273 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1, %r191; 2026-02-21T09:10:45.4255209Z // end inline asm 2026-02-21T09:10:45.4255351Z // begin inline asm 2026-02-21T09:10:45.4255590Z @%p273 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0, %r184; 2026-02-21T09:10:45.4255867Z // end inline asm 2026-02-21T09:10:45.4255999Z mov.b32 %r193, 4096; 2026-02-21T09:10:45.4256148Z // begin inline asm 2026-02-21T09:10:45.4256395Z @%p273 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1, %r193; 2026-02-21T09:10:45.4256667Z // end inline asm 2026-02-21T09:10:45.4256814Z mov.b64 %rd79, 16384; 2026-02-21T09:10:45.4256986Z // begin inline asm 2026-02-21T09:10:45.4257245Z @%p273 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd53 + 0 ], 0x0, %rd79; 2026-02-21T09:10:45.4257523Z // end inline asm 2026-02-21T09:10:45.4257664Z // begin inline asm 2026-02-21T09:10:45.4257958Z @%p273 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0, %r186; 2026-02-21T09:10:45.4258246Z // end inline asm 2026-02-21T09:10:45.4258393Z // begin inline asm 2026-02-21T09:10:45.4258642Z @%p273 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x1, %r186; 2026-02-21T09:10:45.4258928Z // end inline asm 2026-02-21T09:10:45.4259060Z // begin inline asm 2026-02-21T09:10:45.4259296Z @%p273 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd53 + 0 ], 0xa; 2026-02-21T09:10:45.4259553Z // end inline asm 2026-02-21T09:10:45.4259693Z // begin inline asm 2026-02-21T09:10:45.4259949Z @%p273 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0; 2026-02-21T09:10:45.4260227Z // end inline asm 2026-02-21T09:10:45.4260370Z // begin inline asm 2026-02-21T09:10:45.4260602Z @%p273 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x3; 2026-02-21T09:10:45.4260874Z // end inline asm 2026-02-21T09:10:45.4261005Z // begin inline asm 2026-02-21T09:10:45.4261237Z @%p273 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd53 + 0 ], 0x0; 2026-02-21T09:10:45.4261497Z // end inline asm 2026-02-21T09:10:45.4261668Z // begin inline asm 2026-02-21T09:10:45.4262010Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd86 + 0 ], [ %rd53 + 0 ], 0x80; 2026-02-21T09:10:45.4262370Z // end inline asm 2026-02-21T09:10:45.4262511Z // begin inline asm 2026-02-21T09:10:45.4262719Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd86 + 0 ], 0x80; 2026-02-21T09:10:45.4262974Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:45.4263168Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:45.4263345Z // end inline asm 2026-02-21T09:10:45.4263485Z bar.sync 0; 2026-02-21T09:10:45.4263627Z cvta.global.u64 %rd507, %rd86; 2026-02-21T09:10:45.4263912Z .loc 1 30 49 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:30:49 2026-02-21T09:10:45.4264201Z min.u32 %r4, %r1855, 4095; 2026-02-21T09:10:45.4264474Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4264759Z sub.s32 %r204, %r4, %r1855; 2026-02-21T09:10:45.4264923Z add.s32 %r205, %r204, 1; 2026-02-21T09:10:45.4265081Z shr.s32 %r206, %r205, 31; 2026-02-21T09:10:45.4265233Z shr.u32 %r207, %r206, 30; 2026-02-21T09:10:45.4265395Z add.s32 %r208, %r205, %r207; 2026-02-21T09:10:45.4265551Z and.b32 %r209, %r208, -4; 2026-02-21T09:10:45.4265713Z add.s32 %r1884, %r209, %r1855; 2026-02-21T09:10:45.4265975Z .loc 1 44 45 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:44:45 2026-02-21T09:10:45.4266288Z shr.u32 %r6, %r1, 5; 2026-02-21T09:10:45.4266433Z bfe.u32 %r7, %r1, 2, 6; 2026-02-21T09:10:45.4266590Z or.b32 %r8, %r7, 64; 2026-02-21T09:10:45.4266841Z .loc 1 57 38 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:57:38 2026-02-21T09:10:45.4267118Z shl.b32 %r210, %r1, 3; 2026-02-21T09:10:45.4267280Z and.b32 %r9, %r210, 24; 2026-02-21T09:10:45.4267591Z .loc 1 75 38 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:75:38 2026-02-21T09:10:45.4267877Z and.b32 %r10, %r1, 64; 2026-02-21T09:10:45.4268127Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4268424Z setp.ge.s32 %p39, %r1855, %r1884; 2026-02-21T09:10:45.4268603Z shl.b32 %r1847, %r1, 4; 2026-02-21T09:10:45.4268758Z add.s32 %r1644, %r1846, 64; 2026-02-21T09:10:45.4268930Z and.b32 %r1849, %r1, 127; 2026-02-21T09:10:45.4269098Z and.b32 %r1850, %r1, 128; 2026-02-21T09:10:45.4269294Z and.b32 %r1851, %r1, 63; 2026-02-21T09:10:45.4269456Z shr.u32 %r1852, %r1, 4; 2026-02-21T09:10:45.4269617Z and.b32 %r1853, %r1, 3; 2026-02-21T09:10:45.4269762Z shl.b32 %r1854, %r7, 10; 2026-02-21T09:10:45.4269924Z setp.eq.b32 %p331, %r10, 0; 2026-02-21T09:10:45.4270086Z cvt.u64.u32 %rd572, %r9; 2026-02-21T09:10:45.4270270Z @%p39 bra $L__BB0_27; 2026-02-21T09:10:45.4270447Z // %bb.1: // %.lr.ph 2026-02-21T09:10:45.4270737Z .loc 1 0 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:0:97 2026-02-21T09:10:45.4271018Z and.b32 %r11, %r1847, 4080; 2026-02-21T09:10:45.4271175Z add.s32 %r12, %r179, %r11; 2026-02-21T09:10:45.4271338Z add.s32 %r13, %r12, 4096; 2026-02-21T09:10:45.4271488Z add.s32 %r15, %r12, 8192; 2026-02-21T09:10:45.4271681Z add.s32 %r16, %r12, 12288; 2026-02-21T09:10:45.4271835Z shl.b32 %r214, %r1849, 6; 2026-02-21T09:10:45.4271996Z shr.u32 %r216, %r1850, 2; 2026-02-21T09:10:45.4272153Z or.b32 %r17, %r214, %r216; 2026-02-21T09:10:45.4272308Z add.s32 %r18, %r179, %r17; 2026-02-21T09:10:45.4272467Z shr.u32 %r218, %r1850, 1; 2026-02-21T09:10:45.4272617Z or.b32 %r19, %r218, %r1851; 2026-02-21T09:10:45.4272782Z add.s32 %r219, %r179, 24576; 2026-02-21T09:10:45.4272936Z add.s32 %r20, %r219, %r19; 2026-02-21T09:10:45.4273097Z xor.b32 %r21, %r19, 16; 2026-02-21T09:10:45.4273246Z add.s32 %r22, %r219, %r21; 2026-02-21T09:10:45.4273408Z xor.b32 %r23, %r19, 32; 2026-02-21T09:10:45.4273558Z add.s32 %r24, %r219, %r23; 2026-02-21T09:10:45.4273717Z xor.b32 %r25, %r19, 48; 2026-02-21T09:10:45.4273869Z add.s32 %r26, %r219, %r25; 2026-02-21T09:10:45.4274019Z shl.b32 %r220, %r1851, 7; 2026-02-21T09:10:45.4274178Z and.b32 %r221, %r1847, 112; 2026-02-21T09:10:45.4274332Z and.b32 %r223, %r1852, 12; 2026-02-21T09:10:45.4274490Z or.b32 %r224, %r220, %r223; 2026-02-21T09:10:45.4274642Z or.b32 %r225, %r224, %r221; 2026-02-21T09:10:45.4274804Z add.s32 %r226, %r179, 16384; 2026-02-21T09:10:45.4274959Z add.s32 %r27, %r226, %r225; 2026-02-21T09:10:45.4275119Z xor.b32 %r227, %r225, 16; 2026-02-21T09:10:45.4275268Z add.s32 %r28, %r226, %r227; 2026-02-21T09:10:45.4275428Z xor.b32 %r228, %r225, 32; 2026-02-21T09:10:45.4275583Z add.s32 %r29, %r226, %r228; 2026-02-21T09:10:45.4275735Z xor.b32 %r229, %r225, 48; 2026-02-21T09:10:45.4275892Z add.s32 %r30, %r226, %r229; 2026-02-21T09:10:45.4276048Z xor.b32 %r230, %r225, 64; 2026-02-21T09:10:45.4276203Z add.s32 %r31, %r226, %r230; 2026-02-21T09:10:45.4276355Z xor.b32 %r231, %r225, 80; 2026-02-21T09:10:45.4276510Z add.s32 %r32, %r226, %r231; 2026-02-21T09:10:45.4276661Z xor.b32 %r232, %r225, 96; 2026-02-21T09:10:45.4276818Z add.s32 %r33, %r226, %r232; 2026-02-21T09:10:45.4276969Z xor.b32 %r233, %r225, 112; 2026-02-21T09:10:45.4277129Z add.s32 %r34, %r226, %r233; 2026-02-21T09:10:45.4277304Z bfe.u32 %r234, %r226, 4, 14; 2026-02-21T09:10:45.4277492Z cvt.u64.u32 %rd91, %r234; 2026-02-21T09:10:45.4277662Z or.b64 %rd3, %rd91, 4611686293338849280; 2026-02-21T09:10:45.4277841Z add.s32 %r235, %r179, 16416; 2026-02-21T09:10:45.4278001Z bfe.u32 %r236, %r235, 4, 14; 2026-02-21T09:10:45.4278152Z cvt.u64.u32 %rd92, %r236; 2026-02-21T09:10:45.4278318Z or.b64 %rd4, %rd92, 4611686293338849280; 2026-02-21T09:10:45.4278492Z add.s32 %r237, %r179, 16448; 2026-02-21T09:10:45.4278652Z bfe.u32 %r238, %r237, 4, 14; 2026-02-21T09:10:45.4278841Z cvt.u64.u32 %rd93, %r238; 2026-02-21T09:10:45.4279000Z or.b64 %rd5, %rd93, 4611686293338849280; 2026-02-21T09:10:45.4279177Z add.s32 %r239, %r179, 16480; 2026-02-21T09:10:45.4279327Z bfe.u32 %r240, %r239, 4, 14; 2026-02-21T09:10:45.4279485Z cvt.u64.u32 %rd94, %r240; 2026-02-21T09:10:45.4279640Z or.b64 %rd6, %rd94, 4611686293338849280; 2026-02-21T09:10:45.4279816Z shl.b32 %r241, %r1849, 7; 2026-02-21T09:10:45.4279964Z xor.b32 %r242, %r221, %r218; 2026-02-21T09:10:45.4280124Z or.b32 %r243, %r242, %r241; 2026-02-21T09:10:45.4280283Z add.s32 %r35, %r179, %r243; 2026-02-21T09:10:45.4280449Z xor.b32 %r244, %r243, 16; 2026-02-21T09:10:45.4280642Z add.s32 %r36, %r179, %r244; 2026-02-21T09:10:45.4280805Z xor.b32 %r245, %r243, 32; 2026-02-21T09:10:45.4280967Z add.s32 %r37, %r179, %r245; 2026-02-21T09:10:45.4281123Z xor.b32 %r246, %r243, 48; 2026-02-21T09:10:45.4281284Z add.s32 %r38, %r179, %r246; 2026-02-21T09:10:45.4281618Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4281932Z mad.wide.u32 %rd95, %r1853, 16, %rd52; 2026-02-21T09:10:45.4282117Z add.s64 %rd7, %rd95, 192; 2026-02-21T09:10:45.4282407Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4282711Z or.b32 %r248, %r1854, %r9; 2026-02-21T09:10:45.4282873Z or.b32 %r40, %r248, 65632; 2026-02-21T09:10:45.4283038Z bra.uni $L__BB0_2; 2026-02-21T09:10:45.4283234Z $L__BB0_26: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4283584Z .loc 1 0 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:0:125 2026-02-21T09:10:45.4283877Z mov.b32 %r1401, 1; 2026-02-21T09:10:45.4284032Z $L__tmp1: 2026-02-21T09:10:45.4284339Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4284684Z // begin inline asm 2026-02-21T09:10:45.4284840Z 2026-02-21T09:10:45.4284962Z { 2026-02-21T09:10:45.4285101Z .reg .pred complete; 2026-02-21T09:10:45.4285253Z waitLoop: 2026-02-21T09:10:45.4285461Z mbarrier.try_wait.parity.shared.b64 complete, [%r1879], %r1401; 2026-02-21T09:10:45.4285713Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4285881Z } 2026-02-21T09:10:45.4285951Z 2026-02-21T09:10:45.4286018Z // end inline asm 2026-02-21T09:10:45.4286158Z $L__tmp2: 2026-02-21T09:10:45.4286413Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4286723Z cp.async.wait_group 0; 2026-02-21T09:10:45.4286896Z bar.sync 0; 2026-02-21T09:10:45.4287042Z // begin inline asm 2026-02-21T09:10:45.4287232Z @%p273 mbarrier.inval.shared::cta.b64 [%r1310]; 2026-02-21T09:10:45.4287435Z // end inline asm 2026-02-21T09:10:45.4287586Z bar.sync 0; 2026-02-21T09:10:45.4287722Z // begin inline asm 2026-02-21T09:10:45.4287905Z @%p273 mbarrier.inval.shared::cta.b64 [%r1094]; 2026-02-21T09:10:45.4288134Z // end inline asm 2026-02-21T09:10:45.4288277Z add.s32 %r1404, %r179, 26624; 2026-02-21T09:10:45.4288444Z // begin inline asm 2026-02-21T09:10:45.4288610Z @%p273 mbarrier.inval.shared::cta.b64 [%r1404]; 2026-02-21T09:10:45.4288804Z // end inline asm 2026-02-21T09:10:45.4288936Z bar.sync 0; 2026-02-21T09:10:45.4289073Z // begin inline asm 2026-02-21T09:10:45.4289234Z @%p273 mbarrier.inval.shared::cta.b64 [%r1096]; 2026-02-21T09:10:45.4289426Z // end inline asm 2026-02-21T09:10:45.4289599Z $L__tmp3: 2026-02-21T09:10:45.4289882Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4290217Z // begin inline asm 2026-02-21T09:10:45.4290600Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1406, %r1407, %r1408, %r1409, %r1410, %r1411, %r1412, %r1413, %r1414, %r1415, %r1416, %r1417, %r1418, %r1419, %r1420, %r1421}, [%r1439 + 0]; 2026-02-21T09:10:45.4291017Z // end inline asm 2026-02-21T09:10:45.4291182Z // begin inline asm 2026-02-21T09:10:45.4291585Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1423, %r1424, %r1425, %r1426, %r1427, %r1428, %r1429, %r1430, %r1431, %r1432, %r1433, %r1434, %r1435, %r1436, %r1437, %r1438}, [%r1439 + 16]; 2026-02-21T09:10:45.4291997Z // end inline asm 2026-02-21T09:10:45.4292135Z // begin inline asm 2026-02-21T09:10:45.4292295Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.4292459Z // end inline asm 2026-02-21T09:10:45.4292605Z cvt.u64.u32 %rd408, %r1406; 2026-02-21T09:10:45.4292768Z cvt.u64.u32 %rd409, %r1407; 2026-02-21T09:10:45.4292932Z shl.b64 %rd410, %rd409, 32; 2026-02-21T09:10:45.4293117Z or.b64 %rd411, %rd408, %rd410; 2026-02-21T09:10:45.4293286Z $L__tmp4: 2026-02-21T09:10:45.4293528Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4293816Z mov.b64 {%r1443, %r1444}, %rd411; 2026-02-21T09:10:45.4294032Z cvt.rn.bf16x2.f32 %r1445, %r1444, %r1443; 2026-02-21T09:10:45.4294210Z $L__tmp5: 2026-02-21T09:10:45.4294503Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4294832Z cvt.u64.u32 %rd412, %r1408; 2026-02-21T09:10:45.4294997Z cvt.u64.u32 %rd413, %r1409; 2026-02-21T09:10:45.4295153Z shl.b64 %rd414, %rd413, 32; 2026-02-21T09:10:45.4295319Z or.b64 %rd415, %rd412, %rd414; 2026-02-21T09:10:45.4295487Z $L__tmp6: 2026-02-21T09:10:45.4295723Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4296019Z mov.b64 {%r1446, %r1447}, %rd415; 2026-02-21T09:10:45.4296200Z cvt.rn.bf16x2.f32 %r1448, %r1447, %r1446; 2026-02-21T09:10:45.4296384Z $L__tmp7: 2026-02-21T09:10:45.4296660Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4296995Z cvt.u64.u32 %rd416, %r1410; 2026-02-21T09:10:45.4297160Z cvt.u64.u32 %rd417, %r1411; 2026-02-21T09:10:45.4297319Z shl.b64 %rd418, %rd417, 32; 2026-02-21T09:10:45.4297489Z or.b64 %rd419, %rd416, %rd418; 2026-02-21T09:10:45.4297647Z $L__tmp8: 2026-02-21T09:10:45.4297889Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4298180Z mov.b64 {%r1449, %r1450}, %rd419; 2026-02-21T09:10:45.4298365Z cvt.rn.bf16x2.f32 %r1451, %r1450, %r1449; 2026-02-21T09:10:45.4298537Z $L__tmp9: 2026-02-21T09:10:45.4298823Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4299156Z cvt.u64.u32 %rd420, %r1412; 2026-02-21T09:10:45.4299309Z cvt.u64.u32 %rd421, %r1413; 2026-02-21T09:10:45.4299471Z shl.b64 %rd422, %rd421, 32; 2026-02-21T09:10:45.4299627Z or.b64 %rd423, %rd420, %rd422; 2026-02-21T09:10:45.4299789Z $L__tmp10: 2026-02-21T09:10:45.4300024Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4300318Z mov.b64 {%r1452, %r1453}, %rd423; 2026-02-21T09:10:45.4300495Z cvt.rn.bf16x2.f32 %r1454, %r1453, %r1452; 2026-02-21T09:10:45.4300672Z $L__tmp11: 2026-02-21T09:10:45.4300959Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4301282Z cvt.u64.u32 %rd424, %r1414; 2026-02-21T09:10:45.4301444Z cvt.u64.u32 %rd425, %r1415; 2026-02-21T09:10:45.4301625Z shl.b64 %rd426, %rd425, 32; 2026-02-21T09:10:45.4301821Z or.b64 %rd427, %rd424, %rd426; 2026-02-21T09:10:45.4301975Z $L__tmp12: 2026-02-21T09:10:45.4302219Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4302501Z mov.b64 {%r1455, %r1456}, %rd427; 2026-02-21T09:10:45.4302686Z cvt.rn.bf16x2.f32 %r1457, %r1456, %r1455; 2026-02-21T09:10:45.4302865Z $L__tmp13: 2026-02-21T09:10:45.4303137Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4303504Z cvt.u64.u32 %rd428, %r1416; 2026-02-21T09:10:45.4303658Z cvt.u64.u32 %rd429, %r1417; 2026-02-21T09:10:45.4303815Z shl.b64 %rd430, %rd429, 32; 2026-02-21T09:10:45.4303969Z or.b64 %rd431, %rd428, %rd430; 2026-02-21T09:10:45.4304128Z $L__tmp14: 2026-02-21T09:10:45.4304364Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4304644Z mov.b64 {%r1458, %r1459}, %rd431; 2026-02-21T09:10:45.4304826Z cvt.rn.bf16x2.f32 %r1460, %r1459, %r1458; 2026-02-21T09:10:45.4305052Z $L__tmp15: 2026-02-21T09:10:45.4305338Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4305659Z cvt.u64.u32 %rd432, %r1418; 2026-02-21T09:10:45.4305825Z cvt.u64.u32 %rd433, %r1419; 2026-02-21T09:10:45.4306008Z shl.b64 %rd434, %rd433, 32; 2026-02-21T09:10:45.4306175Z or.b64 %rd435, %rd432, %rd434; 2026-02-21T09:10:45.4306335Z $L__tmp16: 2026-02-21T09:10:45.4306564Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4306854Z mov.b64 {%r1461, %r1462}, %rd435; 2026-02-21T09:10:45.4307028Z cvt.rn.bf16x2.f32 %r1463, %r1462, %r1461; 2026-02-21T09:10:45.4307206Z $L__tmp17: 2026-02-21T09:10:45.4307479Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4307805Z cvt.u64.u32 %rd436, %r1420; 2026-02-21T09:10:45.4307961Z cvt.u64.u32 %rd437, %r1421; 2026-02-21T09:10:45.4308126Z shl.b64 %rd438, %rd437, 32; 2026-02-21T09:10:45.4308293Z or.b64 %rd439, %rd436, %rd438; 2026-02-21T09:10:45.4308454Z $L__tmp18: 2026-02-21T09:10:45.4308692Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4308974Z mov.b64 {%r1464, %r1465}, %rd439; 2026-02-21T09:10:45.4309156Z cvt.rn.bf16x2.f32 %r1466, %r1465, %r1464; 2026-02-21T09:10:45.4309325Z $L__tmp19: 2026-02-21T09:10:45.4309604Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4309937Z cvt.u64.u32 %rd440, %r1423; 2026-02-21T09:10:45.4310093Z cvt.u64.u32 %rd441, %r1424; 2026-02-21T09:10:45.4310253Z shl.b64 %rd442, %rd441, 32; 2026-02-21T09:10:45.4310410Z or.b64 %rd443, %rd440, %rd442; 2026-02-21T09:10:45.4310571Z $L__tmp20: 2026-02-21T09:10:45.4310799Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4311086Z mov.b64 {%r1467, %r1468}, %rd443; 2026-02-21T09:10:45.4311259Z cvt.rn.bf16x2.f32 %r1469, %r1468, %r1467; 2026-02-21T09:10:45.4311436Z $L__tmp21: 2026-02-21T09:10:45.4311741Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4312066Z cvt.u64.u32 %rd444, %r1425; 2026-02-21T09:10:45.4312231Z cvt.u64.u32 %rd445, %r1426; 2026-02-21T09:10:45.4312384Z shl.b64 %rd446, %rd445, 32; 2026-02-21T09:10:45.4312545Z or.b64 %rd447, %rd444, %rd446; 2026-02-21T09:10:45.4312699Z $L__tmp22: 2026-02-21T09:10:45.4312934Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4313213Z mov.b64 {%r1470, %r1471}, %rd447; 2026-02-21T09:10:45.4313394Z cvt.rn.bf16x2.f32 %r1472, %r1471, %r1470; 2026-02-21T09:10:45.4313625Z $L__tmp23: 2026-02-21T09:10:45.4313902Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4314237Z cvt.u64.u32 %rd448, %r1427; 2026-02-21T09:10:45.4314391Z cvt.u64.u32 %rd449, %r1428; 2026-02-21T09:10:45.4314550Z shl.b64 %rd450, %rd449, 32; 2026-02-21T09:10:45.4314703Z or.b64 %rd451, %rd448, %rd450; 2026-02-21T09:10:45.4314866Z $L__tmp24: 2026-02-21T09:10:45.4315126Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4315414Z mov.b64 {%r1473, %r1474}, %rd451; 2026-02-21T09:10:45.4315597Z cvt.rn.bf16x2.f32 %r1475, %r1474, %r1473; 2026-02-21T09:10:45.4315766Z $L__tmp25: 2026-02-21T09:10:45.4316046Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4316363Z cvt.u64.u32 %rd452, %r1429; 2026-02-21T09:10:45.4316527Z cvt.u64.u32 %rd453, %r1430; 2026-02-21T09:10:45.4316679Z shl.b64 %rd454, %rd453, 32; 2026-02-21T09:10:45.4316869Z or.b64 %rd455, %rd452, %rd454; 2026-02-21T09:10:45.4317032Z $L__tmp26: 2026-02-21T09:10:45.4317261Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4317549Z mov.b64 {%r1476, %r1477}, %rd455; 2026-02-21T09:10:45.4317749Z cvt.rn.bf16x2.f32 %r1478, %r1477, %r1476; 2026-02-21T09:10:45.4317931Z $L__tmp27: 2026-02-21T09:10:45.4318203Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4318528Z cvt.u64.u32 %rd456, %r1431; 2026-02-21T09:10:45.4318683Z cvt.u64.u32 %rd457, %r1432; 2026-02-21T09:10:45.4318848Z shl.b64 %rd458, %rd457, 32; 2026-02-21T09:10:45.4319017Z or.b64 %rd459, %rd456, %rd458; 2026-02-21T09:10:45.4319171Z $L__tmp28: 2026-02-21T09:10:45.4319410Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4319690Z mov.b64 {%r1479, %r1480}, %rd459; 2026-02-21T09:10:45.4319871Z cvt.rn.bf16x2.f32 %r1481, %r1480, %r1479; 2026-02-21T09:10:45.4320041Z $L__tmp29: 2026-02-21T09:10:45.4320323Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4320652Z cvt.u64.u32 %rd460, %r1433; 2026-02-21T09:10:45.4320806Z cvt.u64.u32 %rd461, %r1434; 2026-02-21T09:10:45.4320971Z shl.b64 %rd462, %rd461, 32; 2026-02-21T09:10:45.4321129Z or.b64 %rd463, %rd460, %rd462; 2026-02-21T09:10:45.4321289Z $L__tmp30: 2026-02-21T09:10:45.4321517Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4321828Z mov.b64 {%r1482, %r1483}, %rd463; 2026-02-21T09:10:45.4322002Z cvt.rn.bf16x2.f32 %r1484, %r1483, %r1482; 2026-02-21T09:10:45.4322182Z $L__tmp31: 2026-02-21T09:10:45.4322466Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4322790Z cvt.u64.u32 %rd464, %r1435; 2026-02-21T09:10:45.4322952Z cvt.u64.u32 %rd465, %r1436; 2026-02-21T09:10:45.4323104Z shl.b64 %rd466, %rd465, 32; 2026-02-21T09:10:45.4323266Z or.b64 %rd467, %rd464, %rd466; 2026-02-21T09:10:45.4323417Z $L__tmp32: 2026-02-21T09:10:45.4323656Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4323939Z mov.b64 {%r1485, %r1486}, %rd467; 2026-02-21T09:10:45.4324138Z cvt.rn.bf16x2.f32 %r1487, %r1486, %r1485; 2026-02-21T09:10:45.4324325Z $L__tmp33: 2026-02-21T09:10:45.4324614Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4324960Z cvt.u64.u32 %rd468, %r1437; 2026-02-21T09:10:45.4325120Z cvt.u64.u32 %rd469, %r1438; 2026-02-21T09:10:45.4325285Z shl.b64 %rd470, %rd469, 32; 2026-02-21T09:10:45.4325479Z or.b64 %rd471, %rd468, %rd470; 2026-02-21T09:10:45.4325644Z $L__tmp34: 2026-02-21T09:10:45.4325884Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4326181Z mov.b64 {%r1488, %r1489}, %rd471; 2026-02-21T09:10:45.4326366Z cvt.rn.bf16x2.f32 %r1490, %r1489, %r1488; 2026-02-21T09:10:45.4326657Z .loc 1 98 43 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:98:43 2026-02-21T09:10:45.4327029Z st.shared.v4.b32 [%r35], {%r1445, %r1448, %r1451, %r1454}; 2026-02-21T09:10:45.4327282Z st.shared.v4.b32 [%r36], {%r1457, %r1460, %r1463, %r1466}; 2026-02-21T09:10:45.4327536Z st.shared.v4.b32 [%r37], {%r1469, %r1472, %r1475, %r1478}; 2026-02-21T09:10:45.4327781Z st.shared.v4.b32 [%r38], {%r1481, %r1484, %r1487, %r1490}; 2026-02-21T09:10:45.4327996Z // begin inline asm 2026-02-21T09:10:45.4328170Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4328345Z // end inline asm 2026-02-21T09:10:45.4328497Z bar.sync 0; 2026-02-21T09:10:45.4328644Z elect.sync %r1491|%p268, -1; 2026-02-21T09:10:45.4328858Z and.pred %p266, %p1, %p268; 2026-02-21T09:10:45.4329026Z // begin inline asm 2026-02-21T09:10:45.4329315Z @%p266 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd507, {%r1440, %r1441}], [%r179]; 2026-02-21T09:10:45.4329625Z // end inline asm 2026-02-21T09:10:45.4329794Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.4330019Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.4330201Z bar.sync 0; 2026-02-21T09:10:45.4330470Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4330766Z add.s32 %r1855, %r1855, 4; 2026-02-21T09:10:45.4330949Z setp.lt.s32 %p269, %r1855, %r1884; 2026-02-21T09:10:45.4331133Z @%p269 bra $L__BB0_2; 2026-02-21T09:10:45.4331299Z bra.uni $L__BB0_27; 2026-02-21T09:10:45.4331499Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:10:45.4331782Z // Child Loop BB0_5 Depth 2 2026-02-21T09:10:45.4332054Z // Child Loop BB0_11 Depth 2 2026-02-21T09:10:45.4332279Z // Child Loop BB0_17 Depth 2 2026-02-21T09:10:45.4332505Z // Child Loop BB0_23 Depth 2 2026-02-21T09:10:45.4332804Z .loc 1 38 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:38:33 2026-02-21T09:10:45.4333098Z shr.u32 %r324, %r1855, 5; 2026-02-21T09:10:45.4333259Z and.b32 %r325, %r324, 67108848; 2026-02-21T09:10:45.4333534Z .loc 1 39 39 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:39 2026-02-21T09:10:45.4333824Z sub.s32 %r326, 128, %r325; 2026-02-21T09:10:45.4334083Z .loc 1 39 52 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:52 2026-02-21T09:10:45.4334370Z min.s32 %r327, %r326, 16; 2026-02-21T09:10:45.4334628Z .loc 1 40 45 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:45 2026-02-21T09:10:45.4334914Z and.b32 %r328, %r1855, 511; 2026-02-21T09:10:45.4335169Z .loc 1 41 51 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:41:51 2026-02-21T09:10:45.4335459Z div.s32 %r73, %r328, %r327; 2026-02-21T09:10:45.4335729Z .loc 1 40 64 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:64 2026-02-21T09:10:45.4336013Z mul.lo.s32 %r329, %r73, %r327; 2026-02-21T09:10:45.4336182Z sub.s32 %r330, %r328, %r329; 2026-02-21T09:10:45.4336437Z .loc 1 40 30 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:30 2026-02-21T09:10:45.4336727Z add.s32 %r331, %r330, %r325; 2026-02-21T09:10:45.4336985Z .loc 1 42 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:42:27 2026-02-21T09:10:45.4337272Z shl.b32 %r513, %r331, 6; 2026-02-21T09:10:45.4337567Z .loc 1 43 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:43:27 2026-02-21T09:10:45.4337844Z shl.b32 %r514, %r73, 7; 2026-02-21T09:10:45.4338100Z .loc 1 44 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:44:32 2026-02-21T09:10:45.4338372Z or.b32 %r332, %r514, %r7; 2026-02-21T09:10:45.4338532Z or.b32 %r333, %r514, %r8; 2026-02-21T09:10:45.4338780Z .loc 1 58 53 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:53 2026-02-21T09:10:45.4339089Z shl.b32 %r334, %r332, 10; 2026-02-21T09:10:45.4339248Z shl.b32 %r335, %r333, 10; 2026-02-21T09:10:45.4339390Z $L__tmp35: 2026-02-21T09:10:45.4339685Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4340023Z shfl.sync.idx.b32 %r76, %r6, 0, 31, -1; 2026-02-21T09:10:45.4340215Z shl.b32 %r336, %r76, 21; 2026-02-21T09:10:45.4340372Z and.b32 %r337, %r336, 6291456; 2026-02-21T09:10:45.4340544Z add.s32 %r338, %r337, %r1846; 2026-02-21T09:10:45.4340738Z and.b32 %r339, %r76, 4; 2026-02-21T09:10:45.4340890Z shl.b32 %r340, %r339, 3; 2026-02-21T09:10:45.4341049Z add.s32 %r1439, %r338, %r340; 2026-02-21T09:10:45.4341211Z mov.pred %p58, -1; 2026-02-21T09:10:45.4341367Z mov.b32 %r1857, 0; 2026-02-21T09:10:45.4341509Z // begin inline asm 2026-02-21T09:10:45.4341972Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 0], {%r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857}; 2026-02-21T09:10:45.4342391Z // end inline asm 2026-02-21T09:10:45.4342539Z // begin inline asm 2026-02-21T09:10:45.4342928Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 16], {%r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857, %r1857}; 2026-02-21T09:10:45.4343339Z // end inline asm 2026-02-21T09:10:45.4343484Z // begin inline asm 2026-02-21T09:10:45.4343636Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4343806Z // end inline asm 2026-02-21T09:10:45.4343939Z bar.sync 0; 2026-02-21T09:10:45.4344072Z $L__tmp36: 2026-02-21T09:10:45.4344312Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4344603Z add.s32 %r1858, %r179, 26624; 2026-02-21T09:10:45.4344767Z // begin inline asm 2026-02-21T09:10:45.4344937Z @%p273 mbarrier.init.shared::cta.b64 [%r1858], 1; 2026-02-21T09:10:45.4345137Z // end inline asm 2026-02-21T09:10:45.4345268Z bar.sync 0; 2026-02-21T09:10:45.4345408Z add.s32 %r1096, %r179, 26632; 2026-02-21T09:10:45.4345562Z // begin inline asm 2026-02-21T09:10:45.4345732Z @%p273 mbarrier.init.shared::cta.b64 [%r1096], 1; 2026-02-21T09:10:45.4345920Z // end inline asm 2026-02-21T09:10:45.4346064Z add.s32 %r1310, %r179, 26640; 2026-02-21T09:10:45.4346223Z // begin inline asm 2026-02-21T09:10:45.4346386Z @%p273 mbarrier.init.shared::cta.b64 [%r1310], 1; 2026-02-21T09:10:45.4346581Z // end inline asm 2026-02-21T09:10:45.4346710Z bar.sync 0; 2026-02-21T09:10:45.4346848Z add.s32 %r1094, %r179, 26648; 2026-02-21T09:10:45.4347000Z // begin inline asm 2026-02-21T09:10:45.4347167Z @%p273 mbarrier.init.shared::cta.b64 [%r1094], 1; 2026-02-21T09:10:45.4347349Z // end inline asm 2026-02-21T09:10:45.4347595Z .loc 1 58 60 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:60 2026-02-21T09:10:45.4347879Z or.b32 %r342, %r334, %r9; 2026-02-21T09:10:45.4348034Z or.b32 %r343, %r335, %r9; 2026-02-21T09:10:45.4348292Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4348577Z mad.wide.s32 %rd96, %r342, 2, %rd52; 2026-02-21T09:10:45.4348763Z mad.wide.s32 %rd97, %r343, 2, %rd52; 2026-02-21T09:10:45.4348928Z mov.b32 %r380, 16; 2026-02-21T09:10:45.4349174Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4349487Z // begin inline asm 2026-02-21T09:10:45.4349689Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd96 + 0 ], 0x10, %r380; 2026-02-21T09:10:45.4349920Z // end inline asm 2026-02-21T09:10:45.4350055Z // begin inline asm 2026-02-21T09:10:45.4350257Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd97 + 0 ], 0x10, %r380; 2026-02-21T09:10:45.4350477Z // end inline asm 2026-02-21T09:10:45.4350627Z cp.async.commit_group; 2026-02-21T09:10:45.4350913Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4351202Z bar.sync 0; 2026-02-21T09:10:45.4351334Z // begin inline asm 2026-02-21T09:10:45.4351558Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4351797Z // end inline asm 2026-02-21T09:10:45.4352037Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4352333Z bar.sync 0; 2026-02-21T09:10:45.4352477Z elect.sync %r344|%p53, -1; 2026-02-21T09:10:45.4352651Z and.pred %p47, %p1, %p53; 2026-02-21T09:10:45.4352834Z // begin inline asm 2026-02-21T09:10:45.4353167Z @%p47 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r513, %r1857}], [%r1310]; 2026-02-21T09:10:45.4353531Z // end inline asm 2026-02-21T09:10:45.4353822Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4354113Z cvt.s64.s32 %rd102, %r334; 2026-02-21T09:10:45.4354275Z or.b64 %rd103, %rd102, %rd572; 2026-02-21T09:10:45.4354447Z shl.b64 %rd104, %rd103, 1; 2026-02-21T09:10:45.4354604Z add.s64 %rd14, %rd52, %rd104; 2026-02-21T09:10:45.4354774Z add.s64 %rd99, %rd14, 64; 2026-02-21T09:10:45.4354928Z cvt.s64.s32 %rd105, %r335; 2026-02-21T09:10:45.4355094Z or.b64 %rd106, %rd105, %rd572; 2026-02-21T09:10:45.4355260Z shl.b64 %rd107, %rd106, 1; 2026-02-21T09:10:45.4355414Z add.s64 %rd15, %rd52, %rd107; 2026-02-21T09:10:45.4355577Z add.s64 %rd100, %rd15, 64; 2026-02-21T09:10:45.4355832Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4356113Z // begin inline asm 2026-02-21T09:10:45.4356313Z cp.async.cg.shared.global [ %r15 + 0 ], [ %rd99 + 0 ], 0x10, %r380; 2026-02-21T09:10:45.4356543Z // end inline asm 2026-02-21T09:10:45.4356680Z // begin inline asm 2026-02-21T09:10:45.4356889Z cp.async.cg.shared.global [ %r16 + 0 ], [ %rd100 + 0 ], 0x10, %r380; 2026-02-21T09:10:45.4357123Z // end inline asm 2026-02-21T09:10:45.4357264Z cp.async.commit_group; 2026-02-21T09:10:45.4357528Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4357807Z bar.sync 0; 2026-02-21T09:10:45.4357943Z // begin inline asm 2026-02-21T09:10:45.4358134Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1094], 1024; 2026-02-21T09:10:45.4358361Z // end inline asm 2026-02-21T09:10:45.4358600Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4358878Z bar.sync 0; 2026-02-21T09:10:45.4359020Z elect.sync %r345|%p54, -1; 2026-02-21T09:10:45.4359184Z and.pred %p49, %p1, %p54; 2026-02-21T09:10:45.4359352Z add.s32 %r301, %r179, 25600; 2026-02-21T09:10:45.4359508Z // begin inline asm 2026-02-21T09:10:45.4359840Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r301], [%rd497, {%r513, %r380}], [%r1094]; 2026-02-21T09:10:45.4360197Z // end inline asm 2026-02-21T09:10:45.4360443Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4360730Z cp.async.wait_group 1; 2026-02-21T09:10:45.4360881Z bar.sync 0; 2026-02-21T09:10:45.4361058Z ld.shared.v4.b32 {%r346, %r347, %r348, %r349}, [%r18]; 2026-02-21T09:10:45.4361263Z mov.b32 {%rs1, %rs2}, %r349; 2026-02-21T09:10:45.4361463Z mov.b32 {%rs3, %rs4}, %r348; 2026-02-21T09:10:45.4361646Z mov.b32 {%rs5, %rs6}, %r347; 2026-02-21T09:10:45.4361810Z mov.b32 {%rs7, %rs8}, %r346; 2026-02-21T09:10:45.4362003Z ld.shared.v4.b32 {%r350, %r351, %r352, %r353}, [%r18+16]; 2026-02-21T09:10:45.4362223Z mov.b32 {%rs9, %rs10}, %r353; 2026-02-21T09:10:45.4362399Z mov.b32 {%rs11, %rs12}, %r352; 2026-02-21T09:10:45.4362565Z mov.b32 {%rs13, %rs14}, %r351; 2026-02-21T09:10:45.4362739Z mov.b32 {%rs15, %rs16}, %r350; 2026-02-21T09:10:45.4363030Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4363319Z cvt.f32.bf16 %r306, %rs7; 2026-02-21T09:10:45.4363475Z cvt.f32.bf16 %r307, %rs8; 2026-02-21T09:10:45.4363635Z cvt.f32.bf16 %r308, %rs5; 2026-02-21T09:10:45.4363785Z cvt.f32.bf16 %r309, %rs6; 2026-02-21T09:10:45.4363943Z cvt.f32.bf16 %r310, %rs3; 2026-02-21T09:10:45.4364100Z cvt.f32.bf16 %r311, %rs4; 2026-02-21T09:10:45.4364249Z cvt.f32.bf16 %r312, %rs1; 2026-02-21T09:10:45.4364405Z cvt.f32.bf16 %r313, %rs2; 2026-02-21T09:10:45.4364556Z cvt.f32.bf16 %r314, %rs15; 2026-02-21T09:10:45.4364745Z cvt.f32.bf16 %r315, %rs16; 2026-02-21T09:10:45.4364903Z cvt.f32.bf16 %r316, %rs13; 2026-02-21T09:10:45.4365064Z cvt.f32.bf16 %r317, %rs14; 2026-02-21T09:10:45.4365214Z cvt.f32.bf16 %r318, %rs11; 2026-02-21T09:10:45.4365372Z cvt.f32.bf16 %r319, %rs12; 2026-02-21T09:10:45.4365523Z cvt.f32.bf16 %r320, %rs9; 2026-02-21T09:10:45.4365709Z cvt.f32.bf16 %r321, %rs10; 2026-02-21T09:10:45.4365867Z $L__tmp37: 2026-02-21T09:10:45.4366157Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4366492Z add.s32 %r354, %r337, %r1644; 2026-02-21T09:10:45.4366654Z shl.b32 %r355, %r339, 2; 2026-02-21T09:10:45.4366817Z add.s32 %r1190, %r354, %r355; 2026-02-21T09:10:45.4366972Z // begin inline asm 2026-02-21T09:10:45.4367348Z @%p58 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r306, %r307, %r308, %r309, %r310, %r311, %r312, %r313, %r314, %r315, %r316, %r317, %r318, %r319, %r320, %r321}; 2026-02-21T09:10:45.4367767Z // end inline asm 2026-02-21T09:10:45.4367911Z // begin inline asm 2026-02-21T09:10:45.4368079Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4368246Z // end inline asm 2026-02-21T09:10:45.4368394Z bar.sync 0; 2026-02-21T09:10:45.4368525Z $L__tmp38: 2026-02-21T09:10:45.4368785Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4369085Z // begin inline asm 2026-02-21T09:10:45.4369233Z 2026-02-21T09:10:45.4369353Z { 2026-02-21T09:10:45.4369493Z .reg .pred complete; 2026-02-21T09:10:45.4369650Z waitLoop: 2026-02-21T09:10:45.4369851Z mbarrier.try_wait.parity.shared.b64 complete, [%r1310], %r1857; 2026-02-21T09:10:45.4370107Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4370272Z } 2026-02-21T09:10:45.4370343Z 2026-02-21T09:10:45.4370409Z // end inline asm 2026-02-21T09:10:45.4370674Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4370983Z ld.shared.b8 %rs17, [%r20]; 2026-02-21T09:10:45.4371158Z ld.shared.b8 %rs18, [%r20+512]; 2026-02-21T09:10:45.4371344Z ld.shared.b8 %rs19, [%r22+128]; 2026-02-21T09:10:45.4371523Z ld.shared.b8 %rs20, [%r22+640]; 2026-02-21T09:10:45.4371714Z ld.shared.b8 %rs21, [%r24+256]; 2026-02-21T09:10:45.4371898Z ld.shared.b8 %rs22, [%r24+768]; 2026-02-21T09:10:45.4372073Z ld.shared.b8 %rs23, [%r26+384]; 2026-02-21T09:10:45.4372254Z ld.shared.b8 %rs24, [%r26+896]; 2026-02-21T09:10:45.4372538Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4372841Z shl.b16 %rs25, %rs17, 4; 2026-02-21T09:10:45.4373006Z shl.b16 %rs26, %rs19, 4; 2026-02-21T09:10:45.4373171Z shl.b16 %rs27, %rs21, 4; 2026-02-21T09:10:45.4373334Z shl.b16 %rs28, %rs23, 4; 2026-02-21T09:10:45.4373489Z shl.b16 %rs29, %rs18, 4; 2026-02-21T09:10:45.4373685Z shl.b16 %rs30, %rs20, 4; 2026-02-21T09:10:45.4373838Z shl.b16 %rs31, %rs22, 4; 2026-02-21T09:10:45.4373998Z shl.b16 %rs32, %rs24, 4; 2026-02-21T09:10:45.4374262Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4374572Z selp.b16 %rs33, %rs25, %rs17, %p331; 2026-02-21T09:10:45.4374753Z cvt.s16.s8 %rs34, %rs33; 2026-02-21T09:10:45.4374917Z shr.s16 %rs35, %rs34, 4; 2026-02-21T09:10:45.4375118Z selp.b16 %rs36, %rs26, %rs19, %p331; 2026-02-21T09:10:45.4375299Z cvt.s16.s8 %rs37, %rs36; 2026-02-21T09:10:45.4375466Z shr.s16 %rs38, %rs37, 4; 2026-02-21T09:10:45.4375620Z selp.b16 %rs39, %rs27, %rs21, %p331; 2026-02-21T09:10:45.4375798Z cvt.s16.s8 %rs40, %rs39; 2026-02-21T09:10:45.4375946Z shr.s16 %rs41, %rs40, 4; 2026-02-21T09:10:45.4376108Z selp.b16 %rs42, %rs28, %rs23, %p331; 2026-02-21T09:10:45.4376274Z cvt.s16.s8 %rs43, %rs42; 2026-02-21T09:10:45.4376427Z shr.s16 %rs44, %rs43, 4; 2026-02-21T09:10:45.4376582Z selp.b16 %rs45, %rs29, %rs18, %p331; 2026-02-21T09:10:45.4376754Z cvt.s16.s8 %rs46, %rs45; 2026-02-21T09:10:45.4376933Z shr.s16 %rs47, %rs46, 4; 2026-02-21T09:10:45.4377090Z selp.b16 %rs48, %rs30, %rs20, %p331; 2026-02-21T09:10:45.4377262Z cvt.s16.s8 %rs49, %rs48; 2026-02-21T09:10:45.4377408Z shr.s16 %rs50, %rs49, 4; 2026-02-21T09:10:45.4377568Z selp.b16 %rs51, %rs31, %rs22, %p331; 2026-02-21T09:10:45.4377758Z cvt.s16.s8 %rs52, %rs51; 2026-02-21T09:10:45.4377914Z shr.s16 %rs53, %rs52, 4; 2026-02-21T09:10:45.4378066Z selp.b16 %rs54, %rs32, %rs24, %p331; 2026-02-21T09:10:45.4378237Z cvt.s16.s8 %rs55, %rs54; 2026-02-21T09:10:45.4378382Z shr.s16 %rs56, %rs55, 4; 2026-02-21T09:10:45.4378638Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4378925Z cvt.rn.f32.s16 %r356, %rs35; 2026-02-21T09:10:45.4379084Z cvt.rn.f32.s16 %r357, %rs38; 2026-02-21T09:10:45.4379246Z cvt.rn.f32.s16 %r358, %rs41; 2026-02-21T09:10:45.4379403Z cvt.rn.f32.s16 %r359, %rs44; 2026-02-21T09:10:45.4379562Z cvt.rn.f32.s16 %r360, %rs47; 2026-02-21T09:10:45.4379714Z cvt.rn.f32.s16 %r361, %rs50; 2026-02-21T09:10:45.4379871Z cvt.rn.f32.s16 %r362, %rs53; 2026-02-21T09:10:45.4380024Z cvt.rn.f32.s16 %r363, %rs56; 2026-02-21T09:10:45.4380186Z st.shared.b32 [%r27], %r356; 2026-02-21T09:10:45.4380349Z st.shared.b32 [%r28], %r357; 2026-02-21T09:10:45.4380504Z st.shared.b32 [%r29], %r358; 2026-02-21T09:10:45.4380667Z st.shared.b32 [%r30], %r359; 2026-02-21T09:10:45.4380821Z st.shared.b32 [%r31], %r360; 2026-02-21T09:10:45.4380983Z st.shared.b32 [%r32], %r361; 2026-02-21T09:10:45.4381138Z st.shared.b32 [%r33], %r362; 2026-02-21T09:10:45.4381301Z st.shared.b32 [%r34], %r363; 2026-02-21T09:10:45.4381446Z $L__tmp39: 2026-02-21T09:10:45.4381794Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4382126Z // begin inline asm 2026-02-21T09:10:45.4382289Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4382461Z // end inline asm 2026-02-21T09:10:45.4382596Z bar.sync 0; 2026-02-21T09:10:45.4382742Z setp.ne.b32 %p55, %r76, 0; 2026-02-21T09:10:45.4382900Z @%p55 bra $L__BB0_4; 2026-02-21T09:10:45.4383093Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4383314Z elect.sync %r376|%p57, -1; 2026-02-21T09:10:45.4383486Z mov.b32 %r366, 135268624; 2026-02-21T09:10:45.4383649Z mov.pred %p56, 0; 2026-02-21T09:10:45.4383792Z // begin inline asm 2026-02-21T09:10:45.4384046Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r366, %p56; 2026-02-21T09:10:45.4384312Z // end inline asm 2026-02-21T09:10:45.4384454Z // begin inline asm 2026-02-21T09:10:45.4384683Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r366, %p58; 2026-02-21T09:10:45.4384956Z // end inline asm 2026-02-21T09:10:45.4385092Z // begin inline asm 2026-02-21T09:10:45.4385360Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r366, %p58; 2026-02-21T09:10:45.4385625Z // end inline asm 2026-02-21T09:10:45.4385762Z // begin inline asm 2026-02-21T09:10:45.4385992Z @%p57 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r366, %p58; 2026-02-21T09:10:45.4386245Z // end inline asm 2026-02-21T09:10:45.4386390Z add.s32 %r378, %r179, 26624; 2026-02-21T09:10:45.4386549Z cvt.u64.u32 %rd112, %r378; 2026-02-21T09:10:45.4386737Z // begin inline asm 2026-02-21T09:10:45.4386946Z @%p57 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd112]; 2026-02-21T09:10:45.4387176Z // end inline asm 2026-02-21T09:10:45.4387314Z $L__tmp40: 2026-02-21T09:10:45.4387486Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4387807Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4388087Z add.s64 %rd113, %rd14, 128; 2026-02-21T09:10:45.4388258Z add.s64 %rd114, %rd15, 128; 2026-02-21T09:10:45.4388561Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4388841Z // begin inline asm 2026-02-21T09:10:45.4389048Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd113 + 0 ], 0x10, %r380; 2026-02-21T09:10:45.4389270Z // end inline asm 2026-02-21T09:10:45.4389411Z // begin inline asm 2026-02-21T09:10:45.4389645Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd114 + 0 ], 0x10, %r380; 2026-02-21T09:10:45.4389871Z // end inline asm 2026-02-21T09:10:45.4390010Z cp.async.commit_group; 2026-02-21T09:10:45.4390278Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4390568Z // begin inline asm 2026-02-21T09:10:45.4390764Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4390993Z // end inline asm 2026-02-21T09:10:45.4391230Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4391517Z bar.sync 0; 2026-02-21T09:10:45.4391703Z elect.sync %r392|%p68, -1; 2026-02-21T09:10:45.4391886Z and.pred %p66, %p1, %p68; 2026-02-21T09:10:45.4392043Z mov.b32 %r386, 32; 2026-02-21T09:10:45.4392188Z // begin inline asm 2026-02-21T09:10:45.4392528Z @%p66 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r513, %r386}], [%r1310]; 2026-02-21T09:10:45.4392882Z // end inline asm 2026-02-21T09:10:45.4393137Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4393422Z shl.b32 %r393, %r73, 17; 2026-02-21T09:10:45.4393584Z or.b32 %r394, %r1854, %r393; 2026-02-21T09:10:45.4393748Z mad.wide.s32 %rd573, %r394, 2, %rd7; 2026-02-21T09:10:45.4393928Z or.b32 %r1856, %r40, %r393; 2026-02-21T09:10:45.4394080Z mov.b32 %r1861, 1; 2026-02-21T09:10:45.4394226Z mov.b64 %rd574, 0; 2026-02-21T09:10:45.4394376Z mov.b32 %r1859, %r1857; 2026-02-21T09:10:45.4394528Z mov.b32 %r1860, %r1857; 2026-02-21T09:10:45.4394685Z mov.b32 %r1862, %r1857; 2026-02-21T09:10:45.4394832Z bra.uni $L__BB0_5; 2026-02-21T09:10:45.4395022Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:10:45.4395347Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4395649Z setp.lt.u64 %p84, %rd574, 464; 2026-02-21T09:10:45.4395811Z $L__tmp41: 2026-02-21T09:10:45.4396103Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4396437Z add.s32 %r469, %r1861, 1; 2026-02-21T09:10:45.4396596Z setp.gt.s32 %p87, %r469, 1; 2026-02-21T09:10:45.4396770Z selp.b32 %r1861, 0, %r469, %p87; 2026-02-21T09:10:45.4396941Z selp.b32 %r470, 1, 0, %p87; 2026-02-21T09:10:45.4397106Z xor.b32 %r94, %r1862, %r470; 2026-02-21T09:10:45.4397284Z $L__tmp42: 2026-02-21T09:10:45.4397523Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4397816Z mad.wide.s32 %rd123, %r1856, 2, %rd52; 2026-02-21T09:10:45.4398095Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4398379Z add.s32 %r460, %r88, %r11; 2026-02-21T09:10:45.4398536Z selp.b32 %r461, 16, 0, %p84; 2026-02-21T09:10:45.4398729Z // begin inline asm 2026-02-21T09:10:45.4398931Z cp.async.cg.shared.global [ %r460 + 0 ], [ %rd573 + 0 ], 0x10, %r461; 2026-02-21T09:10:45.4399159Z // end inline asm 2026-02-21T09:10:45.4399295Z add.s32 %r462, %r460, 4096; 2026-02-21T09:10:45.4399454Z // begin inline asm 2026-02-21T09:10:45.4399658Z cp.async.cg.shared.global [ %r462 + 0 ], [ %rd123 + 0 ], 0x10, %r461; 2026-02-21T09:10:45.4399880Z // end inline asm 2026-02-21T09:10:45.4400026Z cp.async.commit_group; 2026-02-21T09:10:45.4400288Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4400627Z and.pred %p82, %p273, %p84; 2026-02-21T09:10:45.4400786Z // begin inline asm 2026-02-21T09:10:45.4400980Z @%p82 mbarrier.arrive.expect_tx.shared.b64 _, [%r464], 1024; 2026-02-21T09:10:45.4401194Z // end inline asm 2026-02-21T09:10:45.4401461Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4401778Z bar.sync 0; 2026-02-21T09:10:45.4401921Z elect.sync %r471|%p88, -1; 2026-02-21T09:10:45.4402100Z and.pred %p89, %p84, %p88; 2026-02-21T09:10:45.4402267Z and.pred %p83, %p1, %p89; 2026-02-21T09:10:45.4402437Z cvt.u32.u64 %r472, %rd574; 2026-02-21T09:10:45.4402593Z add.s32 %r467, %r472, 48; 2026-02-21T09:10:45.4402753Z // begin inline asm 2026-02-21T09:10:45.4403078Z @%p83 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r465], [%rd497, {%r513, %r467}], [%r464]; 2026-02-21T09:10:45.4403438Z // end inline asm 2026-02-21T09:10:45.4403699Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4403992Z add.s64 %rd573, %rd573, 64; 2026-02-21T09:10:45.4404162Z add.s32 %r1856, %r1856, 32; 2026-02-21T09:10:45.4404324Z setp.lt.u64 %p90, %rd574, 480; 2026-02-21T09:10:45.4404499Z add.s64 %rd574, %rd574, 16; 2026-02-21T09:10:45.4404652Z mov.b32 %r1857, %r1862; 2026-02-21T09:10:45.4404810Z mov.b32 %r1862, %r94; 2026-02-21T09:10:45.4404956Z @%p90 bra $L__BB0_5; 2026-02-21T09:10:45.4405108Z bra.uni $L__BB0_8; 2026-02-21T09:10:45.4405296Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:45.4405537Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:45.4405755Z add.s32 %r416, %r1860, 1; 2026-02-21T09:10:45.4405911Z setp.gt.s32 %p72, %r416, 1; 2026-02-21T09:10:45.4406081Z selp.b32 %r1860, 0, %r416, %p72; 2026-02-21T09:10:45.4406358Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4406656Z cp.async.wait_group 1; 2026-02-21T09:10:45.4406815Z bar.sync 0; 2026-02-21T09:10:45.4406947Z shl.b32 %r417, %r1860, 13; 2026-02-21T09:10:45.4407107Z add.s32 %r88, %r179, %r417; 2026-02-21T09:10:45.4407261Z add.s32 %r419, %r88, %r17; 2026-02-21T09:10:45.4407459Z ld.shared.v4.b32 {%r420, %r421, %r422, %r423}, [%r419]; 2026-02-21T09:10:45.4407667Z mov.b32 {%rs57, %rs58}, %r423; 2026-02-21T09:10:45.4407835Z mov.b32 {%rs59, %rs60}, %r422; 2026-02-21T09:10:45.4407994Z mov.b32 {%rs61, %rs62}, %r421; 2026-02-21T09:10:45.4408156Z mov.b32 {%rs63, %rs64}, %r420; 2026-02-21T09:10:45.4408346Z ld.shared.v4.b32 {%r424, %r425, %r426, %r427}, [%r419+16]; 2026-02-21T09:10:45.4408558Z mov.b32 {%rs65, %rs66}, %r427; 2026-02-21T09:10:45.4408718Z mov.b32 {%rs67, %rs68}, %r426; 2026-02-21T09:10:45.4408873Z mov.b32 {%rs69, %rs70}, %r425; 2026-02-21T09:10:45.4409068Z mov.b32 {%rs71, %rs72}, %r424; 2026-02-21T09:10:45.4409329Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4409619Z cvt.f32.bf16 %r398, %rs63; 2026-02-21T09:10:45.4409775Z cvt.f32.bf16 %r399, %rs64; 2026-02-21T09:10:45.4409938Z cvt.f32.bf16 %r400, %rs61; 2026-02-21T09:10:45.4410089Z cvt.f32.bf16 %r401, %rs62; 2026-02-21T09:10:45.4410246Z cvt.f32.bf16 %r402, %rs59; 2026-02-21T09:10:45.4410404Z cvt.f32.bf16 %r403, %rs60; 2026-02-21T09:10:45.4410582Z cvt.f32.bf16 %r404, %rs57; 2026-02-21T09:10:45.4410739Z cvt.f32.bf16 %r405, %rs58; 2026-02-21T09:10:45.4410888Z cvt.f32.bf16 %r406, %rs71; 2026-02-21T09:10:45.4411047Z cvt.f32.bf16 %r407, %rs72; 2026-02-21T09:10:45.4411197Z cvt.f32.bf16 %r408, %rs69; 2026-02-21T09:10:45.4411357Z cvt.f32.bf16 %r409, %rs70; 2026-02-21T09:10:45.4411512Z cvt.f32.bf16 %r410, %rs67; 2026-02-21T09:10:45.4411719Z cvt.f32.bf16 %r411, %rs68; 2026-02-21T09:10:45.4411901Z cvt.f32.bf16 %r412, %rs65; 2026-02-21T09:10:45.4412070Z cvt.f32.bf16 %r413, %rs66; 2026-02-21T09:10:45.4412227Z $L__tmp43: 2026-02-21T09:10:45.4412547Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4412894Z // begin inline asm 2026-02-21T09:10:45.4413037Z 2026-02-21T09:10:45.4413164Z { 2026-02-21T09:10:45.4413293Z .reg .pred complete; 2026-02-21T09:10:45.4413481Z waitLoop: 2026-02-21T09:10:45.4413683Z mbarrier.try_wait.parity.shared.b64 complete, [%r1858], %r1857; 2026-02-21T09:10:45.4413940Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4414108Z } 2026-02-21T09:10:45.4414175Z 2026-02-21T09:10:45.4414234Z // end inline asm 2026-02-21T09:10:45.4414379Z $L__tmp44: 2026-02-21T09:10:45.4414628Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4414942Z selp.b32 %r428, 1, 0, %p72; 2026-02-21T09:10:45.4415110Z xor.b32 %r1859, %r1859, %r428; 2026-02-21T09:10:45.4415289Z mov.pred %p73, -1; 2026-02-21T09:10:45.4415434Z $L__tmp45: 2026-02-21T09:10:45.4415732Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4416082Z // begin inline asm 2026-02-21T09:10:45.4416464Z @%p73 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r398, %r399, %r400, %r401, %r402, %r403, %r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413}; 2026-02-21T09:10:45.4416887Z // end inline asm 2026-02-21T09:10:45.4417028Z // begin inline asm 2026-02-21T09:10:45.4417195Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4417365Z // end inline asm 2026-02-21T09:10:45.4417512Z bar.sync 0; 2026-02-21T09:10:45.4417643Z $L__tmp46: 2026-02-21T09:10:45.4417901Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4418208Z shl.b32 %r429, %r1860, 3; 2026-02-21T09:10:45.4418375Z add.s32 %r430, %r179, %r429; 2026-02-21T09:10:45.4418549Z add.s32 %r464, %r430, 26640; 2026-02-21T09:10:45.4418710Z // begin inline asm 2026-02-21T09:10:45.4418855Z 2026-02-21T09:10:45.4418973Z { 2026-02-21T09:10:45.4419107Z .reg .pred complete; 2026-02-21T09:10:45.4419257Z waitLoop: 2026-02-21T09:10:45.4419459Z mbarrier.try_wait.parity.shared.b64 complete, [%r464], %r1859; 2026-02-21T09:10:45.4419704Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4419873Z } 2026-02-21T09:10:45.4419940Z 2026-02-21T09:10:45.4420001Z // end inline asm 2026-02-21T09:10:45.4420243Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4420532Z shl.b32 %r431, %r1860, 10; 2026-02-21T09:10:45.4420688Z add.s32 %r432, %r179, %r431; 2026-02-21T09:10:45.4420852Z add.s32 %r465, %r432, 24576; 2026-02-21T09:10:45.4421006Z add.s32 %r433, %r465, %r19; 2026-02-21T09:10:45.4421173Z ld.shared.b8 %rs73, [%r433]; 2026-02-21T09:10:45.4421345Z ld.shared.b8 %rs74, [%r433+512]; 2026-02-21T09:10:45.4421585Z add.s32 %r434, %r465, %r21; 2026-02-21T09:10:45.4421761Z ld.shared.b8 %rs75, [%r434+128]; 2026-02-21T09:10:45.4421932Z ld.shared.b8 %rs76, [%r434+640]; 2026-02-21T09:10:45.4422105Z add.s32 %r435, %r465, %r23; 2026-02-21T09:10:45.4422262Z ld.shared.b8 %rs77, [%r435+256]; 2026-02-21T09:10:45.4422433Z ld.shared.b8 %rs78, [%r435+768]; 2026-02-21T09:10:45.4422593Z add.s32 %r436, %r465, %r25; 2026-02-21T09:10:45.4422757Z ld.shared.b8 %rs79, [%r436+384]; 2026-02-21T09:10:45.4422962Z ld.shared.b8 %rs80, [%r436+896]; 2026-02-21T09:10:45.4423231Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4423517Z shl.b16 %rs81, %rs73, 4; 2026-02-21T09:10:45.4423674Z shl.b16 %rs82, %rs75, 4; 2026-02-21T09:10:45.4423833Z shl.b16 %rs83, %rs77, 4; 2026-02-21T09:10:45.4423980Z shl.b16 %rs84, %rs79, 4; 2026-02-21T09:10:45.4424134Z shl.b16 %rs85, %rs74, 4; 2026-02-21T09:10:45.4424280Z shl.b16 %rs86, %rs76, 4; 2026-02-21T09:10:45.4424436Z shl.b16 %rs87, %rs78, 4; 2026-02-21T09:10:45.4424582Z shl.b16 %rs88, %rs80, 4; 2026-02-21T09:10:45.4424873Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4425171Z selp.b16 %rs89, %rs81, %rs73, %p331; 2026-02-21T09:10:45.4425344Z cvt.s16.s8 %rs90, %rs89; 2026-02-21T09:10:45.4425501Z shr.s16 %rs91, %rs90, 4; 2026-02-21T09:10:45.4425688Z selp.b16 %rs92, %rs82, %rs75, %p331; 2026-02-21T09:10:45.4425870Z cvt.s16.s8 %rs93, %rs92; 2026-02-21T09:10:45.4426016Z shr.s16 %rs94, %rs93, 4; 2026-02-21T09:10:45.4426174Z selp.b16 %rs95, %rs83, %rs77, %p331; 2026-02-21T09:10:45.4426344Z cvt.s16.s8 %rs96, %rs95; 2026-02-21T09:10:45.4426499Z shr.s16 %rs97, %rs96, 4; 2026-02-21T09:10:45.4426657Z selp.b16 %rs98, %rs84, %rs79, %p331; 2026-02-21T09:10:45.4426827Z cvt.s16.s8 %rs99, %rs98; 2026-02-21T09:10:45.4426983Z shr.s16 %rs100, %rs99, 4; 2026-02-21T09:10:45.4427143Z selp.b16 %rs101, %rs85, %rs74, %p331; 2026-02-21T09:10:45.4427328Z cvt.s16.s8 %rs102, %rs101; 2026-02-21T09:10:45.4427485Z shr.s16 %rs103, %rs102, 4; 2026-02-21T09:10:45.4427653Z selp.b16 %rs104, %rs86, %rs76, %p331; 2026-02-21T09:10:45.4427825Z cvt.s16.s8 %rs105, %rs104; 2026-02-21T09:10:45.4427984Z shr.s16 %rs106, %rs105, 4; 2026-02-21T09:10:45.4428144Z selp.b16 %rs107, %rs87, %rs78, %p331; 2026-02-21T09:10:45.4428322Z cvt.s16.s8 %rs108, %rs107; 2026-02-21T09:10:45.4428492Z shr.s16 %rs109, %rs108, 4; 2026-02-21T09:10:45.4428653Z selp.b16 %rs110, %rs88, %rs80, %p331; 2026-02-21T09:10:45.4428831Z cvt.s16.s8 %rs111, %rs110; 2026-02-21T09:10:45.4428981Z shr.s16 %rs112, %rs111, 4; 2026-02-21T09:10:45.4429244Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4429527Z cvt.rn.f32.s16 %r437, %rs91; 2026-02-21T09:10:45.4429699Z cvt.rn.f32.s16 %r438, %rs94; 2026-02-21T09:10:45.4429858Z cvt.rn.f32.s16 %r439, %rs97; 2026-02-21T09:10:45.4430027Z cvt.rn.f32.s16 %r440, %rs100; 2026-02-21T09:10:45.4430196Z cvt.rn.f32.s16 %r441, %rs103; 2026-02-21T09:10:45.4430359Z cvt.rn.f32.s16 %r442, %rs106; 2026-02-21T09:10:45.4430530Z cvt.rn.f32.s16 %r443, %rs109; 2026-02-21T09:10:45.4430693Z cvt.rn.f32.s16 %r444, %rs112; 2026-02-21T09:10:45.4430862Z st.shared.b32 [%r27], %r437; 2026-02-21T09:10:45.4431019Z st.shared.b32 [%r28], %r438; 2026-02-21T09:10:45.4431184Z st.shared.b32 [%r29], %r439; 2026-02-21T09:10:45.4431340Z st.shared.b32 [%r30], %r440; 2026-02-21T09:10:45.4431505Z st.shared.b32 [%r31], %r441; 2026-02-21T09:10:45.4431713Z st.shared.b32 [%r32], %r442; 2026-02-21T09:10:45.4431868Z st.shared.b32 [%r33], %r443; 2026-02-21T09:10:45.4432033Z st.shared.b32 [%r34], %r444; 2026-02-21T09:10:45.4432299Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4432592Z shl.b32 %r445, %r1861, 3; 2026-02-21T09:10:45.4432747Z add.s32 %r446, %r179, %r445; 2026-02-21T09:10:45.4432939Z add.s32 %r1858, %r446, 26624; 2026-02-21T09:10:45.4433091Z $L__tmp47: 2026-02-21T09:10:45.4433387Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4433725Z // begin inline asm 2026-02-21T09:10:45.4433889Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4434060Z // end inline asm 2026-02-21T09:10:45.4434117Z bar.sync 0; 2026-02-21T09:10:45.4434179Z @%p55 bra $L__BB0_7; 2026-02-21T09:10:45.4434313Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:10:45.4434380Z elect.sync %r459|%p74, -1; 2026-02-21T09:10:45.4434441Z mov.b32 %r449, 135268624; 2026-02-21T09:10:45.4434498Z // begin inline asm 2026-02-21T09:10:45.4434663Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r449, %p73; 2026-02-21T09:10:45.4434722Z // end inline asm 2026-02-21T09:10:45.4434778Z // begin inline asm 2026-02-21T09:10:45.4434931Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r449, %p73; 2026-02-21T09:10:45.4435012Z // end inline asm 2026-02-21T09:10:45.4435070Z // begin inline asm 2026-02-21T09:10:45.4435226Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r449, %p73; 2026-02-21T09:10:45.4435281Z // end inline asm 2026-02-21T09:10:45.4435338Z // begin inline asm 2026-02-21T09:10:45.4435511Z @%p74 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r449, %p73; 2026-02-21T09:10:45.4435573Z // end inline asm 2026-02-21T09:10:45.4435637Z cvt.u64.u32 %rd121, %r1858; 2026-02-21T09:10:45.4435692Z // begin inline asm 2026-02-21T09:10:45.4435825Z @%p74 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd121]; 2026-02-21T09:10:45.4435879Z // end inline asm 2026-02-21T09:10:45.4435935Z bra.uni $L__BB0_7; 2026-02-21T09:10:45.4436043Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4436135Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.4436191Z mov.b32 %r1868, 1; 2026-02-21T09:10:45.4436400Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4436462Z // begin inline asm 2026-02-21T09:10:45.4436513Z 2026-02-21T09:10:45.4436563Z { 2026-02-21T09:10:45.4436633Z .reg .pred complete; 2026-02-21T09:10:45.4436689Z waitLoop: 2026-02-21T09:10:45.4436809Z mbarrier.try_wait.parity.shared.b64 complete, [%r1858], %r1868; 2026-02-21T09:10:45.4436882Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4436932Z } 2026-02-21T09:10:45.4436936Z 2026-02-21T09:10:45.4436990Z // end inline asm 2026-02-21T09:10:45.4437044Z $L__tmp48: 2026-02-21T09:10:45.4437223Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4437287Z cp.async.wait_group 0; 2026-02-21T09:10:45.4437343Z bar.sync 0; 2026-02-21T09:10:45.4437408Z // begin inline asm 2026-02-21T09:10:45.4437497Z @%p273 mbarrier.inval.shared::cta.b64 [%r1310]; 2026-02-21T09:10:45.4437554Z // end inline asm 2026-02-21T09:10:45.4437609Z bar.sync 0; 2026-02-21T09:10:45.4437672Z // begin inline asm 2026-02-21T09:10:45.4437757Z @%p273 mbarrier.inval.shared::cta.b64 [%r1094]; 2026-02-21T09:10:45.4437811Z // end inline asm 2026-02-21T09:10:45.4437879Z add.s32 %r1865, %r179, 26624; 2026-02-21T09:10:45.4437935Z // begin inline asm 2026-02-21T09:10:45.4438019Z @%p273 mbarrier.inval.shared::cta.b64 [%r1865]; 2026-02-21T09:10:45.4438080Z // end inline asm 2026-02-21T09:10:45.4438134Z bar.sync 0; 2026-02-21T09:10:45.4438190Z // begin inline asm 2026-02-21T09:10:45.4438269Z @%p273 mbarrier.inval.shared::cta.b64 [%r1096]; 2026-02-21T09:10:45.4438333Z // end inline asm 2026-02-21T09:10:45.4438387Z $L__tmp49: 2026-02-21T09:10:45.4438591Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4438678Z // begin inline asm 2026-02-21T09:10:45.4438957Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r479, %r480, %r481, %r482, %r483, %r484, %r485, %r486, %r487, %r488, %r489, %r490, %r491, %r492, %r493, %r494}, [%r1439 + 0]; 2026-02-21T09:10:45.4439013Z // end inline asm 2026-02-21T09:10:45.4439078Z // begin inline asm 2026-02-21T09:10:45.4439352Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r496, %r497, %r498, %r499, %r500, %r501, %r502, %r503, %r504, %r505, %r506, %r507, %r508, %r509, %r510, %r511}, [%r1439 + 16]; 2026-02-21T09:10:45.4439449Z // end inline asm 2026-02-21T09:10:45.4439508Z // begin inline asm 2026-02-21T09:10:45.4439589Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.4439646Z // end inline asm 2026-02-21T09:10:45.4439708Z cvt.u64.u32 %rd132, %r479; 2026-02-21T09:10:45.4439777Z cvt.u64.u32 %rd133, %r480; 2026-02-21T09:10:45.4439839Z shl.b64 %rd134, %rd133, 32; 2026-02-21T09:10:45.4439902Z or.b64 %rd135, %rd132, %rd134; 2026-02-21T09:10:45.4439956Z $L__tmp50: 2026-02-21T09:10:45.4440154Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4440219Z mov.b64 {%r591, %r592}, %rd135; 2026-02-21T09:10:45.4440289Z cvt.rn.bf16x2.f32 %r593, %r592, %r591; 2026-02-21T09:10:45.4440350Z $L__tmp51: 2026-02-21T09:10:45.4440577Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4440640Z cvt.u64.u32 %rd136, %r481; 2026-02-21T09:10:45.4440708Z cvt.u64.u32 %rd137, %r482; 2026-02-21T09:10:45.4440770Z shl.b64 %rd138, %rd137, 32; 2026-02-21T09:10:45.4440832Z or.b64 %rd139, %rd136, %rd138; 2026-02-21T09:10:45.4440883Z $L__tmp52: 2026-02-21T09:10:45.4441054Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4441116Z mov.b64 {%r594, %r595}, %rd139; 2026-02-21T09:10:45.4441188Z cvt.rn.bf16x2.f32 %r596, %r595, %r594; 2026-02-21T09:10:45.4441249Z $L__tmp53: 2026-02-21T09:10:45.4441454Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4441514Z cvt.u64.u32 %rd140, %r483; 2026-02-21T09:10:45.4441617Z cvt.u64.u32 %rd141, %r484; 2026-02-21T09:10:45.4441679Z shl.b64 %rd142, %rd141, 32; 2026-02-21T09:10:45.4441739Z or.b64 %rd143, %rd140, %rd142; 2026-02-21T09:10:45.4441790Z $L__tmp54: 2026-02-21T09:10:45.4441961Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4442023Z mov.b64 {%r597, %r598}, %rd143; 2026-02-21T09:10:45.4442091Z cvt.rn.bf16x2.f32 %r599, %r598, %r597; 2026-02-21T09:10:45.4442152Z $L__tmp55: 2026-02-21T09:10:45.4442355Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4442416Z cvt.u64.u32 %rd144, %r485; 2026-02-21T09:10:45.4442482Z cvt.u64.u32 %rd145, %r486; 2026-02-21T09:10:45.4442542Z shl.b64 %rd146, %rd145, 32; 2026-02-21T09:10:45.4442603Z or.b64 %rd147, %rd144, %rd146; 2026-02-21T09:10:45.4442655Z $L__tmp56: 2026-02-21T09:10:45.4442823Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4442884Z mov.b64 {%r600, %r601}, %rd147; 2026-02-21T09:10:45.4442951Z cvt.rn.bf16x2.f32 %r602, %r601, %r600; 2026-02-21T09:10:45.4443011Z $L__tmp57: 2026-02-21T09:10:45.4443218Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4443276Z cvt.u64.u32 %rd148, %r487; 2026-02-21T09:10:45.4443334Z cvt.u64.u32 %rd149, %r488; 2026-02-21T09:10:45.4443401Z shl.b64 %rd150, %rd149, 32; 2026-02-21T09:10:45.4443459Z or.b64 %rd151, %rd148, %rd150; 2026-02-21T09:10:45.4443511Z $L__tmp58: 2026-02-21T09:10:45.4443680Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4443768Z mov.b64 {%r603, %r604}, %rd151; 2026-02-21T09:10:45.4443836Z cvt.rn.bf16x2.f32 %r605, %r604, %r603; 2026-02-21T09:10:45.4443895Z $L__tmp59: 2026-02-21T09:10:45.4444100Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4444158Z cvt.u64.u32 %rd152, %r489; 2026-02-21T09:10:45.4444216Z cvt.u64.u32 %rd153, %r490; 2026-02-21T09:10:45.4444313Z shl.b64 %rd154, %rd153, 32; 2026-02-21T09:10:45.4444374Z or.b64 %rd155, %rd152, %rd154; 2026-02-21T09:10:45.4444425Z $L__tmp60: 2026-02-21T09:10:45.4444597Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4444656Z mov.b64 {%r606, %r607}, %rd155; 2026-02-21T09:10:45.4444724Z cvt.rn.bf16x2.f32 %r608, %r607, %r606; 2026-02-21T09:10:45.4444782Z $L__tmp61: 2026-02-21T09:10:45.4444986Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4445045Z cvt.u64.u32 %rd156, %r491; 2026-02-21T09:10:45.4445129Z cvt.u64.u32 %rd157, %r492; 2026-02-21T09:10:45.4445196Z shl.b64 %rd158, %rd157, 32; 2026-02-21T09:10:45.4445256Z or.b64 %rd159, %rd156, %rd158; 2026-02-21T09:10:45.4445310Z $L__tmp62: 2026-02-21T09:10:45.4445506Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4445568Z mov.b64 {%r609, %r610}, %rd159; 2026-02-21T09:10:45.4445633Z cvt.rn.bf16x2.f32 %r611, %r610, %r609; 2026-02-21T09:10:45.4445685Z $L__tmp63: 2026-02-21T09:10:45.4445894Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4445951Z cvt.u64.u32 %rd160, %r493; 2026-02-21T09:10:45.4446009Z cvt.u64.u32 %rd161, %r494; 2026-02-21T09:10:45.4446074Z shl.b64 %rd162, %rd161, 32; 2026-02-21T09:10:45.4446134Z or.b64 %rd163, %rd160, %rd162; 2026-02-21T09:10:45.4446187Z $L__tmp64: 2026-02-21T09:10:45.4446355Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4446414Z mov.b64 {%r612, %r613}, %rd163; 2026-02-21T09:10:45.4446479Z cvt.rn.bf16x2.f32 %r614, %r613, %r612; 2026-02-21T09:10:45.4446532Z $L__tmp65: 2026-02-21T09:10:45.4446742Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4446800Z cvt.u64.u32 %rd164, %r496; 2026-02-21T09:10:45.4446857Z cvt.u64.u32 %rd165, %r497; 2026-02-21T09:10:45.4446925Z shl.b64 %rd166, %rd165, 32; 2026-02-21T09:10:45.4446983Z or.b64 %rd167, %rd164, %rd166; 2026-02-21T09:10:45.4447035Z $L__tmp66: 2026-02-21T09:10:45.4447203Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4447263Z mov.b64 {%r615, %r616}, %rd167; 2026-02-21T09:10:45.4447328Z cvt.rn.bf16x2.f32 %r617, %r616, %r615; 2026-02-21T09:10:45.4447381Z $L__tmp67: 2026-02-21T09:10:45.4447596Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4447655Z cvt.u64.u32 %rd168, %r498; 2026-02-21T09:10:45.4447713Z cvt.u64.u32 %rd169, %r499; 2026-02-21T09:10:45.4447780Z shl.b64 %rd170, %rd169, 32; 2026-02-21T09:10:45.4447840Z or.b64 %rd171, %rd168, %rd170; 2026-02-21T09:10:45.4447895Z $L__tmp68: 2026-02-21T09:10:45.4448065Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4448126Z mov.b64 {%r618, %r619}, %rd171; 2026-02-21T09:10:45.4448195Z cvt.rn.bf16x2.f32 %r620, %r619, %r618; 2026-02-21T09:10:45.4448249Z $L__tmp69: 2026-02-21T09:10:45.4448458Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4448516Z cvt.u64.u32 %rd172, %r500; 2026-02-21T09:10:45.4448597Z cvt.u64.u32 %rd173, %r501; 2026-02-21T09:10:45.4448663Z shl.b64 %rd174, %rd173, 32; 2026-02-21T09:10:45.4448722Z or.b64 %rd175, %rd172, %rd174; 2026-02-21T09:10:45.4448773Z $L__tmp70: 2026-02-21T09:10:45.4448934Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4449001Z mov.b64 {%r621, %r622}, %rd175; 2026-02-21T09:10:45.4449068Z cvt.rn.bf16x2.f32 %r623, %r622, %r621; 2026-02-21T09:10:45.4449144Z $L__tmp71: 2026-02-21T09:10:45.4449357Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4449417Z cvt.u64.u32 %rd176, %r502; 2026-02-21T09:10:45.4449474Z cvt.u64.u32 %rd177, %r503; 2026-02-21T09:10:45.4449539Z shl.b64 %rd178, %rd177, 32; 2026-02-21T09:10:45.4449597Z or.b64 %rd179, %rd176, %rd178; 2026-02-21T09:10:45.4449649Z $L__tmp72: 2026-02-21T09:10:45.4449812Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4449879Z mov.b64 {%r624, %r625}, %rd179; 2026-02-21T09:10:45.4449968Z cvt.rn.bf16x2.f32 %r626, %r625, %r624; 2026-02-21T09:10:45.4450021Z $L__tmp73: 2026-02-21T09:10:45.4450231Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4450290Z cvt.u64.u32 %rd180, %r504; 2026-02-21T09:10:45.4450368Z cvt.u64.u32 %rd181, %r505; 2026-02-21T09:10:45.4450439Z shl.b64 %rd182, %rd181, 32; 2026-02-21T09:10:45.4450497Z or.b64 %rd183, %rd180, %rd182; 2026-02-21T09:10:45.4450550Z $L__tmp74: 2026-02-21T09:10:45.4450710Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4450775Z mov.b64 {%r627, %r628}, %rd183; 2026-02-21T09:10:45.4450841Z cvt.rn.bf16x2.f32 %r629, %r628, %r627; 2026-02-21T09:10:45.4450892Z $L__tmp75: 2026-02-21T09:10:45.4451103Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4451164Z cvt.u64.u32 %rd184, %r506; 2026-02-21T09:10:45.4451221Z cvt.u64.u32 %rd185, %r507; 2026-02-21T09:10:45.4451287Z shl.b64 %rd186, %rd185, 32; 2026-02-21T09:10:45.4451346Z or.b64 %rd187, %rd184, %rd186; 2026-02-21T09:10:45.4451398Z $L__tmp76: 2026-02-21T09:10:45.4451597Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4451666Z mov.b64 {%r630, %r631}, %rd187; 2026-02-21T09:10:45.4451732Z cvt.rn.bf16x2.f32 %r632, %r631, %r630; 2026-02-21T09:10:45.4451783Z $L__tmp77: 2026-02-21T09:10:45.4451995Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4452053Z cvt.u64.u32 %rd188, %r508; 2026-02-21T09:10:45.4452110Z cvt.u64.u32 %rd189, %r509; 2026-02-21T09:10:45.4452170Z shl.b64 %rd190, %rd189, 32; 2026-02-21T09:10:45.4452237Z or.b64 %rd191, %rd188, %rd190; 2026-02-21T09:10:45.4452288Z $L__tmp78: 2026-02-21T09:10:45.4452448Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4452514Z mov.b64 {%r633, %r634}, %rd191; 2026-02-21T09:10:45.4452579Z cvt.rn.bf16x2.f32 %r635, %r634, %r633; 2026-02-21T09:10:45.4452630Z $L__tmp79: 2026-02-21T09:10:45.4452839Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4452899Z cvt.u64.u32 %rd192, %r510; 2026-02-21T09:10:45.4452957Z cvt.u64.u32 %rd193, %r511; 2026-02-21T09:10:45.4453015Z shl.b64 %rd194, %rd193, 32; 2026-02-21T09:10:45.4453080Z or.b64 %rd195, %rd192, %rd194; 2026-02-21T09:10:45.4453132Z $L__tmp80: 2026-02-21T09:10:45.4453294Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4453358Z mov.b64 {%r636, %r637}, %rd195; 2026-02-21T09:10:45.4453453Z cvt.rn.bf16x2.f32 %r638, %r637, %r636; 2026-02-21T09:10:45.4453615Z .loc 1 98 43 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:98:43 2026-02-21T09:10:45.4453718Z st.shared.v4.b32 [%r35], {%r593, %r596, %r599, %r602}; 2026-02-21T09:10:45.4453818Z st.shared.v4.b32 [%r36], {%r605, %r608, %r611, %r614}; 2026-02-21T09:10:45.4453911Z st.shared.v4.b32 [%r37], {%r617, %r620, %r623, %r626}; 2026-02-21T09:10:45.4454027Z st.shared.v4.b32 [%r38], {%r629, %r632, %r635, %r638}; 2026-02-21T09:10:45.4454096Z // begin inline asm 2026-02-21T09:10:45.4454173Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4454231Z // end inline asm 2026-02-21T09:10:45.4454295Z bar.sync 0; 2026-02-21T09:10:45.4454364Z elect.sync %r639|%p110, -1; 2026-02-21T09:10:45.4454432Z and.pred %p95, %p1, %p110; 2026-02-21T09:10:45.4454498Z // begin inline asm 2026-02-21T09:10:45.4454686Z @%p95 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd507, {%r513, %r514}], [%r179]; 2026-02-21T09:10:45.4454747Z // end inline asm 2026-02-21T09:10:45.4454815Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.4454923Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.4454981Z bar.sync 0; 2026-02-21T09:10:45.4455150Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4455221Z add.s32 %r640, %r1855, 1; 2026-02-21T09:10:45.4455413Z .loc 1 38 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:38:33 2026-02-21T09:10:45.4455480Z shr.u32 %r641, %r640, 5; 2026-02-21T09:10:45.4455551Z and.b32 %r642, %r641, 67108848; 2026-02-21T09:10:45.4455719Z .loc 1 39 39 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:39 2026-02-21T09:10:45.4455781Z sub.s32 %r643, 128, %r642; 2026-02-21T09:10:45.4455946Z .loc 1 39 52 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:52 2026-02-21T09:10:45.4456019Z min.s32 %r644, %r643, 16; 2026-02-21T09:10:45.4456190Z .loc 1 40 45 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:45 2026-02-21T09:10:45.4456251Z and.b32 %r645, %r640, 511; 2026-02-21T09:10:45.4456424Z .loc 1 41 51 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:41:51 2026-02-21T09:10:45.4456486Z div.s32 %r96, %r645, %r644; 2026-02-21T09:10:45.4456655Z .loc 1 40 64 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:64 2026-02-21T09:10:45.4456728Z mul.lo.s32 %r646, %r96, %r644; 2026-02-21T09:10:45.4456793Z sub.s32 %r647, %r645, %r646; 2026-02-21T09:10:45.4456961Z .loc 1 40 30 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:30 2026-02-21T09:10:45.4457035Z add.s32 %r648, %r647, %r642; 2026-02-21T09:10:45.4457200Z .loc 1 42 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:42:27 2026-02-21T09:10:45.4457265Z shl.b32 %r822, %r648, 6; 2026-02-21T09:10:45.4457433Z .loc 1 43 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:43:27 2026-02-21T09:10:45.4457506Z shl.b32 %r823, %r96, 7; 2026-02-21T09:10:45.4457673Z .loc 1 44 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:44:32 2026-02-21T09:10:45.4457736Z or.b32 %r649, %r823, %r7; 2026-02-21T09:10:45.4457806Z or.b32 %r650, %r823, %r8; 2026-02-21T09:10:45.4457974Z .loc 1 58 53 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:53 2026-02-21T09:10:45.4458036Z shl.b32 %r651, %r649, 10; 2026-02-21T09:10:45.4458105Z shl.b32 %r652, %r650, 10; 2026-02-21T09:10:45.4458169Z mov.pred %p115, -1; 2026-02-21T09:10:45.4458227Z mov.b32 %r1864, 0; 2026-02-21T09:10:45.4458282Z $L__tmp81: 2026-02-21T09:10:45.4458507Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4458603Z // begin inline asm 2026-02-21T09:10:45.4458932Z @%p115 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 0], {%r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864}; 2026-02-21T09:10:45.4458999Z // end inline asm 2026-02-21T09:10:45.4459057Z // begin inline asm 2026-02-21T09:10:45.4459384Z @%p115 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 16], {%r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864, %r1864}; 2026-02-21T09:10:45.4459474Z // end inline asm 2026-02-21T09:10:45.4459532Z // begin inline asm 2026-02-21T09:10:45.4459605Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4459662Z // end inline asm 2026-02-21T09:10:45.4459726Z bar.sync 0; 2026-02-21T09:10:45.4459781Z $L__tmp82: 2026-02-21T09:10:45.4459956Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4460022Z // begin inline asm 2026-02-21T09:10:45.4460116Z @%p273 mbarrier.init.shared::cta.b64 [%r1865], 1; 2026-02-21T09:10:45.4460197Z // end inline asm 2026-02-21T09:10:45.4460262Z bar.sync 0; 2026-02-21T09:10:45.4460321Z // begin inline asm 2026-02-21T09:10:45.4460409Z @%p273 mbarrier.init.shared::cta.b64 [%r1096], 1; 2026-02-21T09:10:45.4460467Z // end inline asm 2026-02-21T09:10:45.4460534Z // begin inline asm 2026-02-21T09:10:45.4460642Z @%p273 mbarrier.init.shared::cta.b64 [%r1310], 1; 2026-02-21T09:10:45.4460703Z // end inline asm 2026-02-21T09:10:45.4460766Z bar.sync 0; 2026-02-21T09:10:45.4460825Z // begin inline asm 2026-02-21T09:10:45.4460909Z @%p273 mbarrier.init.shared::cta.b64 [%r1094], 1; 2026-02-21T09:10:45.4460967Z // end inline asm 2026-02-21T09:10:45.4461141Z .loc 1 58 60 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:60 2026-02-21T09:10:45.4461203Z or.b32 %r653, %r651, %r9; 2026-02-21T09:10:45.4461263Z or.b32 %r654, %r652, %r9; 2026-02-21T09:10:45.4461439Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4461512Z mad.wide.s32 %rd126, %r653, 2, %rd52; 2026-02-21T09:10:45.4461620Z mad.wide.s32 %rd127, %r654, 2, %rd52; 2026-02-21T09:10:45.4461688Z mov.b32 %r689, 16; 2026-02-21T09:10:45.4461865Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4461924Z // begin inline asm 2026-02-21T09:10:45.4462043Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd126 + 0 ], 0x10, %r689; 2026-02-21T09:10:45.4462103Z // end inline asm 2026-02-21T09:10:45.4462159Z // begin inline asm 2026-02-21T09:10:45.4462272Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd127 + 0 ], 0x10, %r689; 2026-02-21T09:10:45.4462335Z // end inline asm 2026-02-21T09:10:45.4462399Z cp.async.commit_group; 2026-02-21T09:10:45.4462563Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4462625Z bar.sync 0; 2026-02-21T09:10:45.4462680Z // begin inline asm 2026-02-21T09:10:45.4462791Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4462845Z // end inline asm 2026-02-21T09:10:45.4463008Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4463062Z bar.sync 0; 2026-02-21T09:10:45.4463125Z elect.sync %r655|%p111, -1; 2026-02-21T09:10:45.4463197Z and.pred %p103, %p1, %p111; 2026-02-21T09:10:45.4463255Z // begin inline asm 2026-02-21T09:10:45.4463496Z @%p103 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r822, %r1864}], [%r1310]; 2026-02-21T09:10:45.4463558Z // end inline asm 2026-02-21T09:10:45.4463716Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4463777Z cvt.s64.s32 %rd196, %r651; 2026-02-21T09:10:45.4463838Z or.b64 %rd197, %rd196, %rd572; 2026-02-21T09:10:45.4463930Z shl.b64 %rd198, %rd197, 1; 2026-02-21T09:10:45.4463993Z add.s64 %rd21, %rd52, %rd198; 2026-02-21T09:10:45.4464055Z add.s64 %rd129, %rd21, 64; 2026-02-21T09:10:45.4464121Z cvt.s64.s32 %rd199, %r652; 2026-02-21T09:10:45.4464180Z or.b64 %rd200, %rd199, %rd572; 2026-02-21T09:10:45.4464237Z shl.b64 %rd201, %rd200, 1; 2026-02-21T09:10:45.4464304Z add.s64 %rd22, %rd52, %rd201; 2026-02-21T09:10:45.4464363Z add.s64 %rd130, %rd22, 64; 2026-02-21T09:10:45.4464551Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4464608Z // begin inline asm 2026-02-21T09:10:45.4464730Z cp.async.cg.shared.global [ %r15 + 0 ], [ %rd129 + 0 ], 0x10, %r689; 2026-02-21T09:10:45.4464785Z // end inline asm 2026-02-21T09:10:45.4464840Z // begin inline asm 2026-02-21T09:10:45.4464960Z cp.async.cg.shared.global [ %r16 + 0 ], [ %rd130 + 0 ], 0x10, %r689; 2026-02-21T09:10:45.4465015Z // end inline asm 2026-02-21T09:10:45.4465082Z cp.async.commit_group; 2026-02-21T09:10:45.4465273Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4465336Z bar.sync 0; 2026-02-21T09:10:45.4465394Z // begin inline asm 2026-02-21T09:10:45.4465504Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1094], 1024; 2026-02-21T09:10:45.4465567Z // end inline asm 2026-02-21T09:10:45.4465748Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4465806Z bar.sync 0; 2026-02-21T09:10:45.4465879Z elect.sync %r656|%p112, -1; 2026-02-21T09:10:45.4465944Z and.pred %p105, %p1, %p112; 2026-02-21T09:10:45.4466001Z // begin inline asm 2026-02-21T09:10:45.4466237Z @%p105 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r301], [%rd497, {%r822, %r689}], [%r1094]; 2026-02-21T09:10:45.4466302Z // end inline asm 2026-02-21T09:10:45.4466461Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4466528Z cp.async.wait_group 1; 2026-02-21T09:10:45.4466600Z bar.sync 0; 2026-02-21T09:10:45.4466692Z ld.shared.v4.b32 {%r657, %r658, %r659, %r660}, [%r18]; 2026-02-21T09:10:45.4466758Z mov.b32 {%rs113, %rs114}, %r660; 2026-02-21T09:10:45.4466828Z mov.b32 {%rs115, %rs116}, %r659; 2026-02-21T09:10:45.4466888Z mov.b32 {%rs117, %rs118}, %r658; 2026-02-21T09:10:45.4466949Z mov.b32 {%rs119, %rs120}, %r657; 2026-02-21T09:10:45.4467048Z ld.shared.v4.b32 {%r661, %r662, %r663, %r664}, [%r18+16]; 2026-02-21T09:10:45.4467114Z mov.b32 {%rs121, %rs122}, %r664; 2026-02-21T09:10:45.4467172Z mov.b32 {%rs123, %rs124}, %r663; 2026-02-21T09:10:45.4467230Z mov.b32 {%rs125, %rs126}, %r662; 2026-02-21T09:10:45.4467296Z mov.b32 {%rs127, %rs128}, %r661; 2026-02-21T09:10:45.4467452Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4467514Z cvt.f32.bf16 %r573, %rs119; 2026-02-21T09:10:45.4467583Z cvt.f32.bf16 %r574, %rs120; 2026-02-21T09:10:45.4467640Z cvt.f32.bf16 %r575, %rs117; 2026-02-21T09:10:45.4467699Z cvt.f32.bf16 %r576, %rs118; 2026-02-21T09:10:45.4467757Z cvt.f32.bf16 %r577, %rs115; 2026-02-21T09:10:45.4467821Z cvt.f32.bf16 %r578, %rs116; 2026-02-21T09:10:45.4467879Z cvt.f32.bf16 %r579, %rs113; 2026-02-21T09:10:45.4467936Z cvt.f32.bf16 %r580, %rs114; 2026-02-21T09:10:45.4468000Z cvt.f32.bf16 %r581, %rs127; 2026-02-21T09:10:45.4468058Z cvt.f32.bf16 %r582, %rs128; 2026-02-21T09:10:45.4468115Z cvt.f32.bf16 %r583, %rs125; 2026-02-21T09:10:45.4468172Z cvt.f32.bf16 %r584, %rs126; 2026-02-21T09:10:45.4468237Z cvt.f32.bf16 %r585, %rs123; 2026-02-21T09:10:45.4468310Z cvt.f32.bf16 %r586, %rs124; 2026-02-21T09:10:45.4468367Z cvt.f32.bf16 %r587, %rs121; 2026-02-21T09:10:45.4468432Z cvt.f32.bf16 %r588, %rs122; 2026-02-21T09:10:45.4468484Z $L__tmp83: 2026-02-21T09:10:45.4468692Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4468782Z // begin inline asm 2026-02-21T09:10:45.4469060Z @%p115 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r573, %r574, %r575, %r576, %r577, %r578, %r579, %r580, %r581, %r582, %r583, %r584, %r585, %r586, %r587, %r588}; 2026-02-21T09:10:45.4469117Z // end inline asm 2026-02-21T09:10:45.4469174Z // begin inline asm 2026-02-21T09:10:45.4469251Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4469329Z // end inline asm 2026-02-21T09:10:45.4469382Z bar.sync 0; 2026-02-21T09:10:45.4469442Z $L__tmp84: 2026-02-21T09:10:45.4469607Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4469664Z // begin inline asm 2026-02-21T09:10:45.4469716Z 2026-02-21T09:10:45.4469773Z { 2026-02-21T09:10:45.4469835Z .reg .pred complete; 2026-02-21T09:10:45.4469889Z waitLoop: 2026-02-21T09:10:45.4470016Z mbarrier.try_wait.parity.shared.b64 complete, [%r1310], %r1864; 2026-02-21T09:10:45.4470080Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4470131Z } 2026-02-21T09:10:45.4470163Z 2026-02-21T09:10:45.4470225Z // end inline asm 2026-02-21T09:10:45.4470385Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4470449Z ld.shared.b8 %rs129, [%r20]; 2026-02-21T09:10:45.4470512Z ld.shared.b8 %rs130, [%r20+512]; 2026-02-21T09:10:45.4470610Z ld.shared.b8 %rs131, [%r22+128]; 2026-02-21T09:10:45.4470673Z ld.shared.b8 %rs132, [%r22+640]; 2026-02-21T09:10:45.4470732Z ld.shared.b8 %rs133, [%r24+256]; 2026-02-21T09:10:45.4470797Z ld.shared.b8 %rs134, [%r24+768]; 2026-02-21T09:10:45.4470856Z ld.shared.b8 %rs135, [%r26+384]; 2026-02-21T09:10:45.4470915Z ld.shared.b8 %rs136, [%r26+896]; 2026-02-21T09:10:45.4471081Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4471141Z shl.b16 %rs137, %rs129, 4; 2026-02-21T09:10:45.4471202Z shl.b16 %rs138, %rs131, 4; 2026-02-21T09:10:45.4471260Z shl.b16 %rs139, %rs133, 4; 2026-02-21T09:10:45.4471326Z shl.b16 %rs140, %rs135, 4; 2026-02-21T09:10:45.4471383Z shl.b16 %rs141, %rs130, 4; 2026-02-21T09:10:45.4471440Z shl.b16 %rs142, %rs132, 4; 2026-02-21T09:10:45.4471503Z shl.b16 %rs143, %rs134, 4; 2026-02-21T09:10:45.4471597Z shl.b16 %rs144, %rs136, 4; 2026-02-21T09:10:45.4471762Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4471834Z selp.b16 %rs145, %rs137, %rs129, %p331; 2026-02-21T09:10:45.4471901Z cvt.s16.s8 %rs146, %rs145; 2026-02-21T09:10:45.4471958Z shr.s16 %rs147, %rs146, 4; 2026-02-21T09:10:45.4472028Z selp.b16 %rs148, %rs138, %rs131, %p331; 2026-02-21T09:10:45.4472095Z cvt.s16.s8 %rs149, %rs148; 2026-02-21T09:10:45.4472151Z shr.s16 %rs150, %rs149, 4; 2026-02-21T09:10:45.4472219Z selp.b16 %rs151, %rs139, %rs133, %p331; 2026-02-21T09:10:45.4472277Z cvt.s16.s8 %rs152, %rs151; 2026-02-21T09:10:45.4472343Z shr.s16 %rs153, %rs152, 4; 2026-02-21T09:10:45.4472409Z selp.b16 %rs154, %rs140, %rs135, %p331; 2026-02-21T09:10:45.4472468Z cvt.s16.s8 %rs155, %rs154; 2026-02-21T09:10:45.4472534Z shr.s16 %rs156, %rs155, 4; 2026-02-21T09:10:45.4472599Z selp.b16 %rs157, %rs141, %rs130, %p331; 2026-02-21T09:10:45.4472657Z cvt.s16.s8 %rs158, %rs157; 2026-02-21T09:10:45.4472723Z shr.s16 %rs159, %rs158, 4; 2026-02-21T09:10:45.4472790Z selp.b16 %rs160, %rs142, %rs132, %p331; 2026-02-21T09:10:45.4472852Z cvt.s16.s8 %rs161, %rs160; 2026-02-21T09:10:45.4472911Z shr.s16 %rs162, %rs161, 4; 2026-02-21T09:10:45.4472985Z selp.b16 %rs163, %rs143, %rs134, %p331; 2026-02-21T09:10:45.4473045Z cvt.s16.s8 %rs164, %rs163; 2026-02-21T09:10:45.4473103Z shr.s16 %rs165, %rs164, 4; 2026-02-21T09:10:45.4473174Z selp.b16 %rs166, %rs144, %rs136, %p331; 2026-02-21T09:10:45.4473234Z cvt.s16.s8 %rs167, %rs166; 2026-02-21T09:10:45.4473293Z shr.s16 %rs168, %rs167, 4; 2026-02-21T09:10:45.4473481Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4473553Z cvt.rn.f32.s16 %r665, %rs147; 2026-02-21T09:10:45.4473615Z cvt.rn.f32.s16 %r666, %rs150; 2026-02-21T09:10:45.4473676Z cvt.rn.f32.s16 %r667, %rs153; 2026-02-21T09:10:45.4473744Z cvt.rn.f32.s16 %r668, %rs156; 2026-02-21T09:10:45.4473803Z cvt.rn.f32.s16 %r669, %rs159; 2026-02-21T09:10:45.4473864Z cvt.rn.f32.s16 %r670, %rs162; 2026-02-21T09:10:45.4473961Z cvt.rn.f32.s16 %r671, %rs165; 2026-02-21T09:10:45.4474022Z cvt.rn.f32.s16 %r672, %rs168; 2026-02-21T09:10:45.4474086Z st.shared.b32 [%r27], %r665; 2026-02-21T09:10:45.4474148Z st.shared.b32 [%r28], %r666; 2026-02-21T09:10:45.4474218Z st.shared.b32 [%r29], %r667; 2026-02-21T09:10:45.4474279Z st.shared.b32 [%r30], %r668; 2026-02-21T09:10:45.4474339Z st.shared.b32 [%r31], %r669; 2026-02-21T09:10:45.4474404Z st.shared.b32 [%r32], %r670; 2026-02-21T09:10:45.4474463Z st.shared.b32 [%r33], %r671; 2026-02-21T09:10:45.4474523Z st.shared.b32 [%r34], %r672; 2026-02-21T09:10:45.4474577Z $L__tmp85: 2026-02-21T09:10:45.4474821Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4474880Z // begin inline asm 2026-02-21T09:10:45.4474951Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4475014Z // end inline asm 2026-02-21T09:10:45.4475069Z bar.sync 0; 2026-02-21T09:10:45.4475173Z @%p55 bra $L__BB0_10; 2026-02-21T09:10:45.4475274Z // %bb.9: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4475346Z elect.sync %r685|%p114, -1; 2026-02-21T09:10:45.4475405Z mov.b32 %r675, 135268624; 2026-02-21T09:10:45.4475466Z mov.pred %p113, 0; 2026-02-21T09:10:45.4475530Z // begin inline asm 2026-02-21T09:10:45.4475689Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r675, %p113; 2026-02-21T09:10:45.4475745Z // end inline asm 2026-02-21T09:10:45.4475811Z // begin inline asm 2026-02-21T09:10:45.4475961Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r675, %p115; 2026-02-21T09:10:45.4476018Z // end inline asm 2026-02-21T09:10:45.4476074Z // begin inline asm 2026-02-21T09:10:45.4476229Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r675, %p115; 2026-02-21T09:10:45.4476284Z // end inline asm 2026-02-21T09:10:45.4476343Z // begin inline asm 2026-02-21T09:10:45.4476500Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r675, %p115; 2026-02-21T09:10:45.4476555Z // end inline asm 2026-02-21T09:10:45.4476615Z add.s32 %r687, %r179, 26624; 2026-02-21T09:10:45.4476684Z cvt.u64.u32 %rd206, %r687; 2026-02-21T09:10:45.4476739Z // begin inline asm 2026-02-21T09:10:45.4476864Z @%p114 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd206]; 2026-02-21T09:10:45.4476926Z // end inline asm 2026-02-21T09:10:45.4476979Z $L__tmp86: 2026-02-21T09:10:45.4477081Z $L__BB0_10: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4477244Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4477313Z add.s64 %rd207, %rd21, 128; 2026-02-21T09:10:45.4477373Z add.s64 %rd208, %rd22, 128; 2026-02-21T09:10:45.4477532Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4477596Z // begin inline asm 2026-02-21T09:10:45.4477712Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd207 + 0 ], 0x10, %r689; 2026-02-21T09:10:45.4477766Z // end inline asm 2026-02-21T09:10:45.4477829Z // begin inline asm 2026-02-21T09:10:45.4477941Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd208 + 0 ], 0x10, %r689; 2026-02-21T09:10:45.4477996Z // end inline asm 2026-02-21T09:10:45.4478058Z cp.async.commit_group; 2026-02-21T09:10:45.4478229Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4478306Z // begin inline asm 2026-02-21T09:10:45.4478418Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4478479Z // end inline asm 2026-02-21T09:10:45.4478640Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4478695Z bar.sync 0; 2026-02-21T09:10:45.4478758Z elect.sync %r701|%p125, -1; 2026-02-21T09:10:45.4478828Z and.pred %p123, %p1, %p125; 2026-02-21T09:10:45.4478906Z mov.b32 %r695, 32; 2026-02-21T09:10:45.4478961Z // begin inline asm 2026-02-21T09:10:45.4479212Z @%p123 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r822, %r695}], [%r1310]; 2026-02-21T09:10:45.4479266Z // end inline asm 2026-02-21T09:10:45.4479430Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4479497Z shl.b32 %r702, %r96, 17; 2026-02-21T09:10:45.4479559Z or.b32 %r703, %r1854, %r702; 2026-02-21T09:10:45.4479627Z mad.wide.s32 %rd575, %r703, 2, %rd7; 2026-02-21T09:10:45.4479711Z or.b32 %r1863, %r40, %r702; 2026-02-21T09:10:45.4479776Z mov.b64 %rd576, 0; 2026-02-21T09:10:45.4479834Z mov.b32 %r1866, %r1864; 2026-02-21T09:10:45.4479892Z mov.b32 %r1867, %r1864; 2026-02-21T09:10:45.4479956Z mov.b32 %r1869, %r1864; 2026-02-21T09:10:45.4480014Z bra.uni $L__BB0_11; 2026-02-21T09:10:45.4480135Z $L__BB0_13: // in Loop: Header=BB0_11 Depth=2 2026-02-21T09:10:45.4480312Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4480379Z setp.lt.u64 %p141, %rd576, 464; 2026-02-21T09:10:45.4480433Z $L__tmp87: 2026-02-21T09:10:45.4480646Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4480712Z add.s32 %r778, %r1868, 1; 2026-02-21T09:10:45.4480776Z setp.gt.s32 %p144, %r778, 1; 2026-02-21T09:10:45.4480842Z selp.b32 %r1868, 0, %r778, %p144; 2026-02-21T09:10:45.4480910Z selp.b32 %r779, 1, 0, %p144; 2026-02-21T09:10:45.4480969Z xor.b32 %r114, %r1869, %r779; 2026-02-21T09:10:45.4481021Z $L__tmp88: 2026-02-21T09:10:45.4481188Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4481257Z mad.wide.s32 %rd217, %r1863, 2, %rd52; 2026-02-21T09:10:45.4481422Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4481482Z add.s32 %r769, %r108, %r11; 2026-02-21T09:10:45.4481593Z selp.b32 %r770, 16, 0, %p141; 2026-02-21T09:10:45.4481654Z // begin inline asm 2026-02-21T09:10:45.4481772Z cp.async.cg.shared.global [ %r769 + 0 ], [ %rd575 + 0 ], 0x10, %r770; 2026-02-21T09:10:45.4481835Z // end inline asm 2026-02-21T09:10:45.4481892Z add.s32 %r771, %r769, 4096; 2026-02-21T09:10:45.4481948Z // begin inline asm 2026-02-21T09:10:45.4482074Z cp.async.cg.shared.global [ %r771 + 0 ], [ %rd217 + 0 ], 0x10, %r770; 2026-02-21T09:10:45.4482131Z // end inline asm 2026-02-21T09:10:45.4482198Z cp.async.commit_group; 2026-02-21T09:10:45.4482365Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4482441Z and.pred %p139, %p273, %p141; 2026-02-21T09:10:45.4482500Z // begin inline asm 2026-02-21T09:10:45.4482614Z @%p139 mbarrier.arrive.expect_tx.shared.b64 _, [%r773], 1024; 2026-02-21T09:10:45.4482678Z // end inline asm 2026-02-21T09:10:45.4482838Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4482893Z bar.sync 0; 2026-02-21T09:10:45.4482956Z elect.sync %r780|%p145, -1; 2026-02-21T09:10:45.4483027Z and.pred %p146, %p141, %p145; 2026-02-21T09:10:45.4483089Z and.pred %p140, %p1, %p146; 2026-02-21T09:10:45.4483149Z cvt.u32.u64 %r781, %rd576; 2026-02-21T09:10:45.4483217Z add.s32 %r776, %r781, 48; 2026-02-21T09:10:45.4483301Z // begin inline asm 2026-02-21T09:10:45.4483546Z @%p140 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r774], [%rd497, {%r822, %r776}], [%r773]; 2026-02-21T09:10:45.4483609Z // end inline asm 2026-02-21T09:10:45.4483772Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4483831Z add.s64 %rd575, %rd575, 64; 2026-02-21T09:10:45.4483890Z add.s32 %r1863, %r1863, 32; 2026-02-21T09:10:45.4483989Z setp.lt.u64 %p147, %rd576, 480; 2026-02-21T09:10:45.4484047Z add.s64 %rd576, %rd576, 16; 2026-02-21T09:10:45.4484105Z mov.b32 %r1864, %r1869; 2026-02-21T09:10:45.4484171Z mov.b32 %r1869, %r114; 2026-02-21T09:10:45.4484232Z @%p147 bra $L__BB0_11; 2026-02-21T09:10:45.4484290Z bra.uni $L__BB0_14; 2026-02-21T09:10:45.4484396Z $L__BB0_11: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:45.4484492Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:45.4484553Z add.s32 %r725, %r1867, 1; 2026-02-21T09:10:45.4484638Z setp.gt.s32 %p129, %r725, 1; 2026-02-21T09:10:45.4484712Z selp.b32 %r1867, 0, %r725, %p129; 2026-02-21T09:10:45.4484874Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4484937Z cp.async.wait_group 1; 2026-02-21T09:10:45.4484999Z bar.sync 0; 2026-02-21T09:10:45.4485081Z shl.b32 %r726, %r1867, 13; 2026-02-21T09:10:45.4485143Z add.s32 %r108, %r179, %r726; 2026-02-21T09:10:45.4485202Z add.s32 %r728, %r108, %r17; 2026-02-21T09:10:45.4485306Z ld.shared.v4.b32 {%r729, %r730, %r731, %r732}, [%r728]; 2026-02-21T09:10:45.4485371Z mov.b32 {%rs169, %rs170}, %r732; 2026-02-21T09:10:45.4485434Z mov.b32 {%rs171, %rs172}, %r731; 2026-02-21T09:10:45.4485501Z mov.b32 {%rs173, %rs174}, %r730; 2026-02-21T09:10:45.4485561Z mov.b32 {%rs175, %rs176}, %r729; 2026-02-21T09:10:45.4485659Z ld.shared.v4.b32 {%r733, %r734, %r735, %r736}, [%r728+16]; 2026-02-21T09:10:45.4485727Z mov.b32 {%rs177, %rs178}, %r736; 2026-02-21T09:10:45.4485788Z mov.b32 {%rs179, %rs180}, %r735; 2026-02-21T09:10:45.4485846Z mov.b32 {%rs181, %rs182}, %r734; 2026-02-21T09:10:45.4485904Z mov.b32 {%rs183, %rs184}, %r733; 2026-02-21T09:10:45.4486075Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4486138Z cvt.f32.bf16 %r707, %rs175; 2026-02-21T09:10:45.4486198Z cvt.f32.bf16 %r708, %rs176; 2026-02-21T09:10:45.4486264Z cvt.f32.bf16 %r709, %rs173; 2026-02-21T09:10:45.4486321Z cvt.f32.bf16 %r710, %rs174; 2026-02-21T09:10:45.4486380Z cvt.f32.bf16 %r711, %rs171; 2026-02-21T09:10:45.4486445Z cvt.f32.bf16 %r712, %rs172; 2026-02-21T09:10:45.4486502Z cvt.f32.bf16 %r713, %rs169; 2026-02-21T09:10:45.4486559Z cvt.f32.bf16 %r714, %rs170; 2026-02-21T09:10:45.4486617Z cvt.f32.bf16 %r715, %rs183; 2026-02-21T09:10:45.4486678Z cvt.f32.bf16 %r716, %rs184; 2026-02-21T09:10:45.4486737Z cvt.f32.bf16 %r717, %rs181; 2026-02-21T09:10:45.4486794Z cvt.f32.bf16 %r718, %rs182; 2026-02-21T09:10:45.4486859Z cvt.f32.bf16 %r719, %rs179; 2026-02-21T09:10:45.4486916Z cvt.f32.bf16 %r720, %rs180; 2026-02-21T09:10:45.4486972Z cvt.f32.bf16 %r721, %rs177; 2026-02-21T09:10:45.4487030Z cvt.f32.bf16 %r722, %rs178; 2026-02-21T09:10:45.4487089Z $L__tmp89: 2026-02-21T09:10:45.4487298Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4487357Z // begin inline asm 2026-02-21T09:10:45.4487414Z 2026-02-21T09:10:45.4487465Z { 2026-02-21T09:10:45.4487525Z .reg .pred complete; 2026-02-21T09:10:45.4487580Z waitLoop: 2026-02-21T09:10:45.4487705Z mbarrier.try_wait.parity.shared.b64 complete, [%r1865], %r1864; 2026-02-21T09:10:45.4487770Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4487821Z } 2026-02-21T09:10:45.4487825Z 2026-02-21T09:10:45.4487888Z // end inline asm 2026-02-21T09:10:45.4487963Z $L__tmp90: 2026-02-21T09:10:45.4488128Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4488196Z selp.b32 %r737, 1, 0, %p129; 2026-02-21T09:10:45.4488258Z xor.b32 %r1866, %r1866, %r737; 2026-02-21T09:10:45.4488318Z mov.pred %p130, -1; 2026-02-21T09:10:45.4488370Z $L__tmp91: 2026-02-21T09:10:45.4488584Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4488671Z // begin inline asm 2026-02-21T09:10:45.4488953Z @%p130 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r707, %r708, %r709, %r710, %r711, %r712, %r713, %r714, %r715, %r716, %r717, %r718, %r719, %r720, %r721, %r722}; 2026-02-21T09:10:45.4489017Z // end inline asm 2026-02-21T09:10:45.4489074Z // begin inline asm 2026-02-21T09:10:45.4489143Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4489205Z // end inline asm 2026-02-21T09:10:45.4489261Z bar.sync 0; 2026-02-21T09:10:45.4489314Z $L__tmp92: 2026-02-21T09:10:45.4489501Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4489570Z shl.b32 %r738, %r1867, 3; 2026-02-21T09:10:45.4489630Z add.s32 %r739, %r179, %r738; 2026-02-21T09:10:45.4489688Z add.s32 %r773, %r739, 26640; 2026-02-21T09:10:45.4489753Z // begin inline asm 2026-02-21T09:10:45.4489804Z 2026-02-21T09:10:45.4489855Z { 2026-02-21T09:10:45.4489940Z .reg .pred complete; 2026-02-21T09:10:45.4490007Z waitLoop: 2026-02-21T09:10:45.4490125Z mbarrier.try_wait.parity.shared.b64 complete, [%r773], %r1866; 2026-02-21T09:10:45.4490191Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4490251Z } 2026-02-21T09:10:45.4490254Z 2026-02-21T09:10:45.4490312Z // end inline asm 2026-02-21T09:10:45.4490471Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4490540Z shl.b32 %r740, %r1867, 10; 2026-02-21T09:10:45.4490600Z add.s32 %r741, %r179, %r740; 2026-02-21T09:10:45.4490659Z add.s32 %r774, %r741, 24576; 2026-02-21T09:10:45.4490719Z add.s32 %r742, %r774, %r19; 2026-02-21T09:10:45.4490791Z ld.shared.b8 %rs185, [%r742]; 2026-02-21T09:10:45.4490856Z ld.shared.b8 %rs186, [%r742+512]; 2026-02-21T09:10:45.4490914Z add.s32 %r743, %r774, %r21; 2026-02-21T09:10:45.4490985Z ld.shared.b8 %rs187, [%r743+128]; 2026-02-21T09:10:45.4491049Z ld.shared.b8 %rs188, [%r743+640]; 2026-02-21T09:10:45.4491108Z add.s32 %r744, %r774, %r23; 2026-02-21T09:10:45.4491169Z ld.shared.b8 %rs189, [%r744+256]; 2026-02-21T09:10:45.4491236Z ld.shared.b8 %rs190, [%r744+768]; 2026-02-21T09:10:45.4491294Z add.s32 %r745, %r774, %r25; 2026-02-21T09:10:45.4491354Z ld.shared.b8 %rs191, [%r745+384]; 2026-02-21T09:10:45.4491422Z ld.shared.b8 %rs192, [%r745+896]; 2026-02-21T09:10:45.4491621Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4491685Z shl.b16 %rs193, %rs185, 4; 2026-02-21T09:10:45.4491754Z shl.b16 %rs194, %rs187, 4; 2026-02-21T09:10:45.4491814Z shl.b16 %rs195, %rs189, 4; 2026-02-21T09:10:45.4491872Z shl.b16 %rs196, %rs191, 4; 2026-02-21T09:10:45.4491931Z shl.b16 %rs197, %rs186, 4; 2026-02-21T09:10:45.4491997Z shl.b16 %rs198, %rs188, 4; 2026-02-21T09:10:45.4492053Z shl.b16 %rs199, %rs190, 4; 2026-02-21T09:10:45.4492111Z shl.b16 %rs200, %rs192, 4; 2026-02-21T09:10:45.4492281Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4492353Z selp.b16 %rs201, %rs193, %rs185, %p331; 2026-02-21T09:10:45.4492413Z cvt.s16.s8 %rs202, %rs201; 2026-02-21T09:10:45.4492471Z shr.s16 %rs203, %rs202, 4; 2026-02-21T09:10:45.4492547Z selp.b16 %rs204, %rs194, %rs187, %p331; 2026-02-21T09:10:45.4492606Z cvt.s16.s8 %rs205, %rs204; 2026-02-21T09:10:45.4492663Z shr.s16 %rs206, %rs205, 4; 2026-02-21T09:10:45.4492737Z selp.b16 %rs207, %rs195, %rs189, %p331; 2026-02-21T09:10:45.4492823Z cvt.s16.s8 %rs208, %rs207; 2026-02-21T09:10:45.4492880Z shr.s16 %rs209, %rs208, 4; 2026-02-21T09:10:45.4492948Z selp.b16 %rs210, %rs196, %rs191, %p331; 2026-02-21T09:10:45.4493013Z cvt.s16.s8 %rs211, %rs210; 2026-02-21T09:10:45.4493071Z shr.s16 %rs212, %rs211, 4; 2026-02-21T09:10:45.4493135Z selp.b16 %rs213, %rs197, %rs186, %p331; 2026-02-21T09:10:45.4493201Z cvt.s16.s8 %rs214, %rs213; 2026-02-21T09:10:45.4493258Z shr.s16 %rs215, %rs214, 4; 2026-02-21T09:10:45.4493362Z selp.b16 %rs216, %rs198, %rs188, %p331; 2026-02-21T09:10:45.4493426Z cvt.s16.s8 %rs217, %rs216; 2026-02-21T09:10:45.4493483Z shr.s16 %rs218, %rs217, 4; 2026-02-21T09:10:45.4493547Z selp.b16 %rs219, %rs199, %rs190, %p331; 2026-02-21T09:10:45.4493605Z cvt.s16.s8 %rs220, %rs219; 2026-02-21T09:10:45.4493670Z shr.s16 %rs221, %rs220, 4; 2026-02-21T09:10:45.4493734Z selp.b16 %rs222, %rs200, %rs192, %p331; 2026-02-21T09:10:45.4493792Z cvt.s16.s8 %rs223, %rs222; 2026-02-21T09:10:45.4493855Z shr.s16 %rs224, %rs223, 4; 2026-02-21T09:10:45.4494065Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4494128Z cvt.rn.f32.s16 %r746, %rs203; 2026-02-21T09:10:45.4494189Z cvt.rn.f32.s16 %r747, %rs206; 2026-02-21T09:10:45.4494256Z cvt.rn.f32.s16 %r748, %rs209; 2026-02-21T09:10:45.4494315Z cvt.rn.f32.s16 %r749, %rs212; 2026-02-21T09:10:45.4494374Z cvt.rn.f32.s16 %r750, %rs215; 2026-02-21T09:10:45.4494465Z cvt.rn.f32.s16 %r751, %rs218; 2026-02-21T09:10:45.4494525Z cvt.rn.f32.s16 %r752, %rs221; 2026-02-21T09:10:45.4494584Z cvt.rn.f32.s16 %r753, %rs224; 2026-02-21T09:10:45.4494651Z st.shared.b32 [%r27], %r746; 2026-02-21T09:10:45.4494715Z st.shared.b32 [%r28], %r747; 2026-02-21T09:10:45.4494775Z st.shared.b32 [%r29], %r748; 2026-02-21T09:10:45.4494836Z st.shared.b32 [%r30], %r749; 2026-02-21T09:10:45.4494902Z st.shared.b32 [%r31], %r750; 2026-02-21T09:10:45.4494961Z st.shared.b32 [%r32], %r751; 2026-02-21T09:10:45.4495020Z st.shared.b32 [%r33], %r752; 2026-02-21T09:10:45.4495088Z st.shared.b32 [%r34], %r753; 2026-02-21T09:10:45.4495255Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4495316Z shl.b32 %r754, %r1868, 3; 2026-02-21T09:10:45.4495374Z add.s32 %r755, %r179, %r754; 2026-02-21T09:10:45.4495439Z add.s32 %r1865, %r755, 26624; 2026-02-21T09:10:45.4495490Z $L__tmp93: 2026-02-21T09:10:45.4495699Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4495766Z // begin inline asm 2026-02-21T09:10:45.4495837Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4495892Z // end inline asm 2026-02-21T09:10:45.4495964Z bar.sync 0; 2026-02-21T09:10:45.4496026Z @%p55 bra $L__BB0_13; 2026-02-21T09:10:45.4496129Z // %bb.12: // in Loop: Header=BB0_11 Depth=2 2026-02-21T09:10:45.4496196Z elect.sync %r768|%p131, -1; 2026-02-21T09:10:45.4496265Z mov.b32 %r758, 135268624; 2026-02-21T09:10:45.4496325Z // begin inline asm 2026-02-21T09:10:45.4496493Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r758, %p130; 2026-02-21T09:10:45.4496558Z // end inline asm 2026-02-21T09:10:45.4496616Z // begin inline asm 2026-02-21T09:10:45.4496778Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r758, %p130; 2026-02-21T09:10:45.4496843Z // end inline asm 2026-02-21T09:10:45.4496903Z // begin inline asm 2026-02-21T09:10:45.4497061Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r758, %p130; 2026-02-21T09:10:45.4497119Z // end inline asm 2026-02-21T09:10:45.4497186Z // begin inline asm 2026-02-21T09:10:45.4497342Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r758, %p130; 2026-02-21T09:10:45.4497399Z // end inline asm 2026-02-21T09:10:45.4497472Z cvt.u64.u32 %rd215, %r1865; 2026-02-21T09:10:45.4497556Z // begin inline asm 2026-02-21T09:10:45.4497694Z @%p131 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd215]; 2026-02-21T09:10:45.4497764Z // end inline asm 2026-02-21T09:10:45.4497827Z bra.uni $L__BB0_13; 2026-02-21T09:10:45.4497940Z $L__BB0_14: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4498035Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.4498104Z mov.b32 %r1875, 1; 2026-02-21T09:10:45.4498332Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4498412Z // begin inline asm 2026-02-21T09:10:45.4498475Z 2026-02-21T09:10:45.4498527Z { 2026-02-21T09:10:45.4498591Z .reg .pred complete; 2026-02-21T09:10:45.4498647Z waitLoop: 2026-02-21T09:10:45.4498779Z mbarrier.try_wait.parity.shared.b64 complete, [%r1865], %r1875; 2026-02-21T09:10:45.4498848Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4498900Z } 2026-02-21T09:10:45.4498905Z 2026-02-21T09:10:45.4498971Z // end inline asm 2026-02-21T09:10:45.4499026Z $L__tmp94: 2026-02-21T09:10:45.4499229Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4499304Z cp.async.wait_group 0; 2026-02-21T09:10:45.4499361Z bar.sync 0; 2026-02-21T09:10:45.4499418Z // begin inline asm 2026-02-21T09:10:45.4499510Z @%p273 mbarrier.inval.shared::cta.b64 [%r1310]; 2026-02-21T09:10:45.4499601Z // end inline asm 2026-02-21T09:10:45.4499660Z bar.sync 0; 2026-02-21T09:10:45.4499718Z // begin inline asm 2026-02-21T09:10:45.4499816Z @%p273 mbarrier.inval.shared::cta.b64 [%r1094]; 2026-02-21T09:10:45.4499874Z // end inline asm 2026-02-21T09:10:45.4499937Z add.s32 %r1872, %r179, 26624; 2026-02-21T09:10:45.4499996Z // begin inline asm 2026-02-21T09:10:45.4500088Z @%p273 mbarrier.inval.shared::cta.b64 [%r1872]; 2026-02-21T09:10:45.4500147Z // end inline asm 2026-02-21T09:10:45.4500204Z bar.sync 0; 2026-02-21T09:10:45.4500272Z // begin inline asm 2026-02-21T09:10:45.4500355Z @%p273 mbarrier.inval.shared::cta.b64 [%r1096]; 2026-02-21T09:10:45.4500413Z // end inline asm 2026-02-21T09:10:45.4500475Z $L__tmp95: 2026-02-21T09:10:45.4500696Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4500756Z // begin inline asm 2026-02-21T09:10:45.4501035Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r788, %r789, %r790, %r791, %r792, %r793, %r794, %r795, %r796, %r797, %r798, %r799, %r800, %r801, %r802, %r803}, [%r1439 + 0]; 2026-02-21T09:10:45.4501103Z // end inline asm 2026-02-21T09:10:45.4501161Z // begin inline asm 2026-02-21T09:10:45.4501444Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r805, %r806, %r807, %r808, %r809, %r810, %r811, %r812, %r813, %r814, %r815, %r816, %r817, %r818, %r819, %r820}, [%r1439 + 16]; 2026-02-21T09:10:45.4501511Z // end inline asm 2026-02-21T09:10:45.4501607Z // begin inline asm 2026-02-21T09:10:45.4501681Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.4501747Z // end inline asm 2026-02-21T09:10:45.4501813Z cvt.u64.u32 %rd226, %r788; 2026-02-21T09:10:45.4501876Z cvt.u64.u32 %rd227, %r789; 2026-02-21T09:10:45.4501940Z shl.b64 %rd228, %rd227, 32; 2026-02-21T09:10:45.4502011Z or.b64 %rd229, %rd226, %rd228; 2026-02-21T09:10:45.4502066Z $L__tmp96: 2026-02-21T09:10:45.4502244Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4502317Z mov.b64 {%r900, %r901}, %rd229; 2026-02-21T09:10:45.4502389Z cvt.rn.bf16x2.f32 %r902, %r901, %r900; 2026-02-21T09:10:45.4502443Z $L__tmp97: 2026-02-21T09:10:45.4502669Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4502732Z cvt.u64.u32 %rd230, %r790; 2026-02-21T09:10:45.4502792Z cvt.u64.u32 %rd231, %r791; 2026-02-21T09:10:45.4502854Z shl.b64 %rd232, %rd231, 32; 2026-02-21T09:10:45.4502929Z or.b64 %rd233, %rd230, %rd232; 2026-02-21T09:10:45.4503014Z $L__tmp98: 2026-02-21T09:10:45.4503183Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4503256Z mov.b64 {%r903, %r904}, %rd233; 2026-02-21T09:10:45.4503329Z cvt.rn.bf16x2.f32 %r905, %r904, %r903; 2026-02-21T09:10:45.4503384Z $L__tmp99: 2026-02-21T09:10:45.4503603Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4503713Z cvt.u64.u32 %rd234, %r792; 2026-02-21T09:10:45.4503774Z cvt.u64.u32 %rd235, %r793; 2026-02-21T09:10:45.4503835Z shl.b64 %rd236, %rd235, 32; 2026-02-21T09:10:45.4503905Z or.b64 %rd237, %rd234, %rd236; 2026-02-21T09:10:45.4503972Z $L__tmp100: 2026-02-21T09:10:45.4504135Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4504204Z mov.b64 {%r906, %r907}, %rd237; 2026-02-21T09:10:45.4504272Z cvt.rn.bf16x2.f32 %r908, %r907, %r906; 2026-02-21T09:10:45.4504325Z $L__tmp101: 2026-02-21T09:10:45.4504558Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4504626Z cvt.u64.u32 %rd238, %r794; 2026-02-21T09:10:45.4504684Z cvt.u64.u32 %rd239, %r795; 2026-02-21T09:10:45.4504742Z shl.b64 %rd240, %rd239, 32; 2026-02-21T09:10:45.4504836Z or.b64 %rd241, %rd238, %rd240; 2026-02-21T09:10:45.4504890Z $L__tmp102: 2026-02-21T09:10:45.4505054Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4505121Z mov.b64 {%r909, %r910}, %rd241; 2026-02-21T09:10:45.4505187Z cvt.rn.bf16x2.f32 %r911, %r910, %r909; 2026-02-21T09:10:45.4505240Z $L__tmp103: 2026-02-21T09:10:45.4505448Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4505514Z cvt.u64.u32 %rd242, %r796; 2026-02-21T09:10:45.4505573Z cvt.u64.u32 %rd243, %r797; 2026-02-21T09:10:45.4505632Z shl.b64 %rd244, %rd243, 32; 2026-02-21T09:10:45.4505700Z or.b64 %rd245, %rd242, %rd244; 2026-02-21T09:10:45.4505752Z $L__tmp104: 2026-02-21T09:10:45.4505913Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4505980Z mov.b64 {%r912, %r913}, %rd245; 2026-02-21T09:10:45.4506047Z cvt.rn.bf16x2.f32 %r914, %r913, %r912; 2026-02-21T09:10:45.4506101Z $L__tmp105: 2026-02-21T09:10:45.4506303Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4506372Z cvt.u64.u32 %rd246, %r798; 2026-02-21T09:10:45.4506431Z cvt.u64.u32 %rd247, %r799; 2026-02-21T09:10:45.4506491Z shl.b64 %rd248, %rd247, 32; 2026-02-21T09:10:45.4506561Z or.b64 %rd249, %rd246, %rd248; 2026-02-21T09:10:45.4506616Z $L__tmp106: 2026-02-21T09:10:45.4506779Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4506850Z mov.b64 {%r915, %r916}, %rd249; 2026-02-21T09:10:45.4506917Z cvt.rn.bf16x2.f32 %r917, %r916, %r915; 2026-02-21T09:10:45.4506970Z $L__tmp107: 2026-02-21T09:10:45.4507173Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4507242Z cvt.u64.u32 %rd250, %r800; 2026-02-21T09:10:45.4507301Z cvt.u64.u32 %rd251, %r801; 2026-02-21T09:10:45.4507359Z shl.b64 %rd252, %rd251, 32; 2026-02-21T09:10:45.4507424Z or.b64 %rd253, %rd250, %rd252; 2026-02-21T09:10:45.4507476Z $L__tmp108: 2026-02-21T09:10:45.4507636Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4507695Z mov.b64 {%r918, %r919}, %rd253; 2026-02-21T09:10:45.4507766Z cvt.rn.bf16x2.f32 %r920, %r919, %r918; 2026-02-21T09:10:45.4507819Z $L__tmp109: 2026-02-21T09:10:45.4508046Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4508113Z cvt.u64.u32 %rd254, %r802; 2026-02-21T09:10:45.4508171Z cvt.u64.u32 %rd255, %r803; 2026-02-21T09:10:45.4508229Z shl.b64 %rd256, %rd255, 32; 2026-02-21T09:10:45.4508295Z or.b64 %rd257, %rd254, %rd256; 2026-02-21T09:10:45.4508347Z $L__tmp110: 2026-02-21T09:10:45.4508510Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4508605Z mov.b64 {%r921, %r922}, %rd257; 2026-02-21T09:10:45.4508679Z cvt.rn.bf16x2.f32 %r923, %r922, %r921; 2026-02-21T09:10:45.4508731Z $L__tmp111: 2026-02-21T09:10:45.4508936Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4509003Z cvt.u64.u32 %rd258, %r805; 2026-02-21T09:10:45.4509061Z cvt.u64.u32 %rd259, %r806; 2026-02-21T09:10:45.4509120Z shl.b64 %rd260, %rd259, 32; 2026-02-21T09:10:45.4509187Z or.b64 %rd261, %rd258, %rd260; 2026-02-21T09:10:45.4509239Z $L__tmp112: 2026-02-21T09:10:45.4509425Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4509485Z mov.b64 {%r924, %r925}, %rd261; 2026-02-21T09:10:45.4509560Z cvt.rn.bf16x2.f32 %r926, %r925, %r924; 2026-02-21T09:10:45.4509614Z $L__tmp113: 2026-02-21T09:10:45.4509840Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4509909Z cvt.u64.u32 %rd262, %r807; 2026-02-21T09:10:45.4509967Z cvt.u64.u32 %rd263, %r808; 2026-02-21T09:10:45.4510025Z shl.b64 %rd264, %rd263, 32; 2026-02-21T09:10:45.4510092Z or.b64 %rd265, %rd262, %rd264; 2026-02-21T09:10:45.4510145Z $L__tmp114: 2026-02-21T09:10:45.4510306Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4510368Z mov.b64 {%r927, %r928}, %rd265; 2026-02-21T09:10:45.4510442Z cvt.rn.bf16x2.f32 %r929, %r928, %r927; 2026-02-21T09:10:45.4510495Z $L__tmp115: 2026-02-21T09:10:45.4510695Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4510761Z cvt.u64.u32 %rd266, %r809; 2026-02-21T09:10:45.4510818Z cvt.u64.u32 %rd267, %r810; 2026-02-21T09:10:45.4510877Z shl.b64 %rd268, %rd267, 32; 2026-02-21T09:10:45.4510937Z or.b64 %rd269, %rd266, %rd268; 2026-02-21T09:10:45.4510994Z $L__tmp116: 2026-02-21T09:10:45.4511152Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4511210Z mov.b64 {%r930, %r931}, %rd269; 2026-02-21T09:10:45.4511281Z cvt.rn.bf16x2.f32 %r932, %r931, %r930; 2026-02-21T09:10:45.4511333Z $L__tmp117: 2026-02-21T09:10:45.4511580Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4511648Z cvt.u64.u32 %rd270, %r811; 2026-02-21T09:10:45.4511705Z cvt.u64.u32 %rd271, %r812; 2026-02-21T09:10:45.4511765Z shl.b64 %rd272, %rd271, 32; 2026-02-21T09:10:45.4511823Z or.b64 %rd273, %rd270, %rd272; 2026-02-21T09:10:45.4511880Z $L__tmp118: 2026-02-21T09:10:45.4512038Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4512098Z mov.b64 {%r933, %r934}, %rd273; 2026-02-21T09:10:45.4512172Z cvt.rn.bf16x2.f32 %r935, %r934, %r933; 2026-02-21T09:10:45.4512224Z $L__tmp119: 2026-02-21T09:10:45.4512424Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4512489Z cvt.u64.u32 %rd274, %r813; 2026-02-21T09:10:45.4512546Z cvt.u64.u32 %rd275, %r814; 2026-02-21T09:10:45.4512605Z shl.b64 %rd276, %rd275, 32; 2026-02-21T09:10:45.4512664Z or.b64 %rd277, %rd274, %rd276; 2026-02-21T09:10:45.4512752Z $L__tmp120: 2026-02-21T09:10:45.4512914Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4512973Z mov.b64 {%r936, %r937}, %rd277; 2026-02-21T09:10:45.4513046Z cvt.rn.bf16x2.f32 %r938, %r937, %r936; 2026-02-21T09:10:45.4513099Z $L__tmp121: 2026-02-21T09:10:45.4513304Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4513419Z cvt.u64.u32 %rd278, %r815; 2026-02-21T09:10:45.4513476Z cvt.u64.u32 %rd279, %r816; 2026-02-21T09:10:45.4513535Z shl.b64 %rd280, %rd279, 32; 2026-02-21T09:10:45.4513594Z or.b64 %rd281, %rd278, %rd280; 2026-02-21T09:10:45.4513653Z $L__tmp122: 2026-02-21T09:10:45.4513811Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4513871Z mov.b64 {%r939, %r940}, %rd281; 2026-02-21T09:10:45.4513944Z cvt.rn.bf16x2.f32 %r941, %r940, %r939; 2026-02-21T09:10:45.4513998Z $L__tmp123: 2026-02-21T09:10:45.4514223Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4514283Z cvt.u64.u32 %rd282, %r817; 2026-02-21T09:10:45.4514350Z cvt.u64.u32 %rd283, %r818; 2026-02-21T09:10:45.4514409Z shl.b64 %rd284, %rd283, 32; 2026-02-21T09:10:45.4514467Z or.b64 %rd285, %rd282, %rd284; 2026-02-21T09:10:45.4514526Z $L__tmp124: 2026-02-21T09:10:45.4514711Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4514772Z mov.b64 {%r942, %r943}, %rd285; 2026-02-21T09:10:45.4514844Z cvt.rn.bf16x2.f32 %r944, %r943, %r942; 2026-02-21T09:10:45.4514898Z $L__tmp125: 2026-02-21T09:10:45.4515100Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4515159Z cvt.u64.u32 %rd286, %r819; 2026-02-21T09:10:45.4515229Z cvt.u64.u32 %rd287, %r820; 2026-02-21T09:10:45.4515289Z shl.b64 %rd288, %rd287, 32; 2026-02-21T09:10:45.4515351Z or.b64 %rd289, %rd286, %rd288; 2026-02-21T09:10:45.4515412Z $L__tmp126: 2026-02-21T09:10:45.4515573Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4515634Z mov.b64 {%r945, %r946}, %rd289; 2026-02-21T09:10:45.4515705Z cvt.rn.bf16x2.f32 %r947, %r946, %r945; 2026-02-21T09:10:45.4515868Z .loc 1 98 43 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:98:43 2026-02-21T09:10:45.4515964Z st.shared.v4.b32 [%r35], {%r902, %r905, %r908, %r911}; 2026-02-21T09:10:45.4516057Z st.shared.v4.b32 [%r36], {%r914, %r917, %r920, %r923}; 2026-02-21T09:10:45.4516152Z st.shared.v4.b32 [%r37], {%r926, %r929, %r932, %r935}; 2026-02-21T09:10:45.4516235Z st.shared.v4.b32 [%r38], {%r938, %r941, %r944, %r947}; 2026-02-21T09:10:45.4516292Z // begin inline asm 2026-02-21T09:10:45.4516375Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4516432Z // end inline asm 2026-02-21T09:10:45.4516485Z bar.sync 0; 2026-02-21T09:10:45.4516558Z elect.sync %r948|%p167, -1; 2026-02-21T09:10:45.4516620Z and.pred %p152, %p1, %p167; 2026-02-21T09:10:45.4516678Z // begin inline asm 2026-02-21T09:10:45.4516858Z @%p152 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd507, {%r822, %r823}], [%r179]; 2026-02-21T09:10:45.4516921Z // end inline asm 2026-02-21T09:10:45.4516990Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.4517062Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.4517123Z bar.sync 0; 2026-02-21T09:10:45.4517286Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4517346Z add.s32 %r949, %r1855, 2; 2026-02-21T09:10:45.4517517Z .loc 1 38 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:38:33 2026-02-21T09:10:45.4517577Z shr.u32 %r950, %r949, 5; 2026-02-21T09:10:45.4517637Z and.b32 %r951, %r950, 67108848; 2026-02-21T09:10:45.4517819Z .loc 1 39 39 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:39 2026-02-21T09:10:45.4517887Z sub.s32 %r952, 128, %r951; 2026-02-21T09:10:45.4518044Z .loc 1 39 52 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:52 2026-02-21T09:10:45.4518104Z min.s32 %r953, %r952, 16; 2026-02-21T09:10:45.4518274Z .loc 1 40 45 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:45 2026-02-21T09:10:45.4518355Z and.b32 %r954, %r949, 511; 2026-02-21T09:10:45.4518515Z .loc 1 41 51 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:41:51 2026-02-21T09:10:45.4518583Z div.s32 %r116, %r954, %r953; 2026-02-21T09:10:45.4518740Z .loc 1 40 64 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:64 2026-02-21T09:10:45.4518803Z mul.lo.s32 %r955, %r116, %r953; 2026-02-21T09:10:45.4518862Z sub.s32 %r956, %r954, %r955; 2026-02-21T09:10:45.4519050Z .loc 1 40 30 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:30 2026-02-21T09:10:45.4519109Z add.s32 %r957, %r956, %r951; 2026-02-21T09:10:45.4519266Z .loc 1 42 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:42:27 2026-02-21T09:10:45.4519334Z shl.b32 %r1131, %r957, 6; 2026-02-21T09:10:45.4519514Z .loc 1 43 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:43:27 2026-02-21T09:10:45.4519574Z shl.b32 %r1132, %r116, 7; 2026-02-21T09:10:45.4519736Z .loc 1 44 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:44:32 2026-02-21T09:10:45.4519795Z or.b32 %r958, %r1132, %r7; 2026-02-21T09:10:45.4519851Z or.b32 %r959, %r1132, %r8; 2026-02-21T09:10:45.4520017Z .loc 1 58 53 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:53 2026-02-21T09:10:45.4520074Z shl.b32 %r960, %r958, 10; 2026-02-21T09:10:45.4520132Z shl.b32 %r961, %r959, 10; 2026-02-21T09:10:45.4520193Z mov.pred %p172, -1; 2026-02-21T09:10:45.4520259Z mov.b32 %r1871, 0; 2026-02-21T09:10:45.4520314Z $L__tmp127: 2026-02-21T09:10:45.4520521Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4520585Z // begin inline asm 2026-02-21T09:10:45.4520894Z @%p172 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 0], {%r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871}; 2026-02-21T09:10:45.4520952Z // end inline asm 2026-02-21T09:10:45.4521014Z // begin inline asm 2026-02-21T09:10:45.4521316Z @%p172 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 16], {%r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871, %r1871}; 2026-02-21T09:10:45.4521372Z // end inline asm 2026-02-21T09:10:45.4521429Z // begin inline asm 2026-02-21T09:10:45.4521507Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4521607Z // end inline asm 2026-02-21T09:10:45.4521667Z bar.sync 0; 2026-02-21T09:10:45.4521730Z $L__tmp128: 2026-02-21T09:10:45.4521899Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4521957Z // begin inline asm 2026-02-21T09:10:45.4522057Z @%p273 mbarrier.init.shared::cta.b64 [%r1872], 1; 2026-02-21T09:10:45.4522115Z // end inline asm 2026-02-21T09:10:45.4522171Z bar.sync 0; 2026-02-21T09:10:45.4522229Z // begin inline asm 2026-02-21T09:10:45.4522327Z @%p273 mbarrier.init.shared::cta.b64 [%r1096], 1; 2026-02-21T09:10:45.4522384Z // end inline asm 2026-02-21T09:10:45.4522441Z // begin inline asm 2026-02-21T09:10:45.4522531Z @%p273 mbarrier.init.shared::cta.b64 [%r1310], 1; 2026-02-21T09:10:45.4522588Z // end inline asm 2026-02-21T09:10:45.4522643Z bar.sync 0; 2026-02-21T09:10:45.4522701Z // begin inline asm 2026-02-21T09:10:45.4522819Z @%p273 mbarrier.init.shared::cta.b64 [%r1094], 1; 2026-02-21T09:10:45.4522874Z // end inline asm 2026-02-21T09:10:45.4523040Z .loc 1 58 60 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:60 2026-02-21T09:10:45.4523110Z or.b32 %r962, %r960, %r9; 2026-02-21T09:10:45.4523171Z or.b32 %r963, %r961, %r9; 2026-02-21T09:10:45.4523332Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4523436Z mad.wide.s32 %rd220, %r962, 2, %rd52; 2026-02-21T09:10:45.4523502Z mad.wide.s32 %rd221, %r963, 2, %rd52; 2026-02-21T09:10:45.4523558Z mov.b32 %r998, 16; 2026-02-21T09:10:45.4523720Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4523784Z // begin inline asm 2026-02-21T09:10:45.4523906Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd220 + 0 ], 0x10, %r998; 2026-02-21T09:10:45.4523960Z // end inline asm 2026-02-21T09:10:45.4524025Z // begin inline asm 2026-02-21T09:10:45.4524166Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd221 + 0 ], 0x10, %r998; 2026-02-21T09:10:45.4524222Z // end inline asm 2026-02-21T09:10:45.4524298Z cp.async.commit_group; 2026-02-21T09:10:45.4524465Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4524522Z bar.sync 0; 2026-02-21T09:10:45.4524580Z // begin inline asm 2026-02-21T09:10:45.4524729Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4524789Z // end inline asm 2026-02-21T09:10:45.4524946Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4525008Z bar.sync 0; 2026-02-21T09:10:45.4525073Z elect.sync %r964|%p168, -1; 2026-02-21T09:10:45.4525138Z and.pred %p160, %p1, %p168; 2026-02-21T09:10:45.4525193Z // begin inline asm 2026-02-21T09:10:45.4525448Z @%p160 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r1131, %r1871}], [%r1310]; 2026-02-21T09:10:45.4525506Z // end inline asm 2026-02-21T09:10:45.4525665Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4525734Z cvt.s64.s32 %rd290, %r960; 2026-02-21T09:10:45.4525797Z or.b64 %rd291, %rd290, %rd572; 2026-02-21T09:10:45.4525856Z shl.b64 %rd292, %rd291, 1; 2026-02-21T09:10:45.4525927Z add.s64 %rd28, %rd52, %rd292; 2026-02-21T09:10:45.4525989Z add.s64 %rd223, %rd28, 64; 2026-02-21T09:10:45.4526048Z cvt.s64.s32 %rd293, %r961; 2026-02-21T09:10:45.4526117Z or.b64 %rd294, %rd293, %rd572; 2026-02-21T09:10:45.4526174Z shl.b64 %rd295, %rd294, 1; 2026-02-21T09:10:45.4526235Z add.s64 %rd29, %rd52, %rd295; 2026-02-21T09:10:45.4526295Z add.s64 %rd224, %rd29, 64; 2026-02-21T09:10:45.4526458Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4526516Z // begin inline asm 2026-02-21T09:10:45.4526629Z cp.async.cg.shared.global [ %r15 + 0 ], [ %rd223 + 0 ], 0x10, %r998; 2026-02-21T09:10:45.4526693Z // end inline asm 2026-02-21T09:10:45.4526750Z // begin inline asm 2026-02-21T09:10:45.4526863Z cp.async.cg.shared.global [ %r16 + 0 ], [ %rd224 + 0 ], 0x10, %r998; 2026-02-21T09:10:45.4526918Z // end inline asm 2026-02-21T09:10:45.4526991Z cp.async.commit_group; 2026-02-21T09:10:45.4527157Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4527212Z bar.sync 0; 2026-02-21T09:10:45.4527275Z // begin inline asm 2026-02-21T09:10:45.4527388Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1094], 1024; 2026-02-21T09:10:45.4527443Z // end inline asm 2026-02-21T09:10:45.4527605Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4527659Z bar.sync 0; 2026-02-21T09:10:45.4527722Z elect.sync %r965|%p169, -1; 2026-02-21T09:10:45.4527808Z and.pred %p162, %p1, %p169; 2026-02-21T09:10:45.4527871Z // begin inline asm 2026-02-21T09:10:45.4528114Z @%p162 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r301], [%rd497, {%r1131, %r998}], [%r1094]; 2026-02-21T09:10:45.4528170Z // end inline asm 2026-02-21T09:10:45.4528337Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4528402Z cp.async.wait_group 1; 2026-02-21T09:10:45.4528474Z bar.sync 0; 2026-02-21T09:10:45.4528576Z ld.shared.v4.b32 {%r966, %r967, %r968, %r969}, [%r18]; 2026-02-21T09:10:45.4528640Z mov.b32 {%rs225, %rs226}, %r969; 2026-02-21T09:10:45.4528702Z mov.b32 {%rs227, %rs228}, %r968; 2026-02-21T09:10:45.4528762Z mov.b32 {%rs229, %rs230}, %r967; 2026-02-21T09:10:45.4528828Z mov.b32 {%rs231, %rs232}, %r966; 2026-02-21T09:10:45.4528925Z ld.shared.v4.b32 {%r970, %r971, %r972, %r973}, [%r18+16]; 2026-02-21T09:10:45.4528985Z mov.b32 {%rs233, %rs234}, %r973; 2026-02-21T09:10:45.4529052Z mov.b32 {%rs235, %rs236}, %r972; 2026-02-21T09:10:45.4529133Z mov.b32 {%rs237, %rs238}, %r971; 2026-02-21T09:10:45.4529193Z mov.b32 {%rs239, %rs240}, %r970; 2026-02-21T09:10:45.4529362Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4529425Z cvt.f32.bf16 %r882, %rs231; 2026-02-21T09:10:45.4529484Z cvt.f32.bf16 %r883, %rs232; 2026-02-21T09:10:45.4529566Z cvt.f32.bf16 %r884, %rs229; 2026-02-21T09:10:45.4529635Z cvt.f32.bf16 %r885, %rs230; 2026-02-21T09:10:45.4529694Z cvt.f32.bf16 %r886, %rs227; 2026-02-21T09:10:45.4529753Z cvt.f32.bf16 %r887, %rs228; 2026-02-21T09:10:45.4529817Z cvt.f32.bf16 %r888, %rs225; 2026-02-21T09:10:45.4529874Z cvt.f32.bf16 %r889, %rs226; 2026-02-21T09:10:45.4529931Z cvt.f32.bf16 %r890, %rs239; 2026-02-21T09:10:45.4529989Z cvt.f32.bf16 %r891, %rs240; 2026-02-21T09:10:45.4530053Z cvt.f32.bf16 %r892, %rs237; 2026-02-21T09:10:45.4530111Z cvt.f32.bf16 %r893, %rs238; 2026-02-21T09:10:45.4530170Z cvt.f32.bf16 %r894, %rs235; 2026-02-21T09:10:45.4530233Z cvt.f32.bf16 %r895, %rs236; 2026-02-21T09:10:45.4530291Z cvt.f32.bf16 %r896, %rs233; 2026-02-21T09:10:45.4530350Z cvt.f32.bf16 %r897, %rs234; 2026-02-21T09:10:45.4530403Z $L__tmp129: 2026-02-21T09:10:45.4530623Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4530681Z // begin inline asm 2026-02-21T09:10:45.4530959Z @%p172 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r882, %r883, %r884, %r885, %r886, %r887, %r888, %r889, %r890, %r891, %r892, %r893, %r894, %r895, %r896, %r897}; 2026-02-21T09:10:45.4531024Z // end inline asm 2026-02-21T09:10:45.4531080Z // begin inline asm 2026-02-21T09:10:45.4531151Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4531212Z // end inline asm 2026-02-21T09:10:45.4531265Z bar.sync 0; 2026-02-21T09:10:45.4531319Z $L__tmp130: 2026-02-21T09:10:45.4531487Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4531586Z // begin inline asm 2026-02-21T09:10:45.4531640Z 2026-02-21T09:10:45.4531692Z { 2026-02-21T09:10:45.4531762Z .reg .pred complete; 2026-02-21T09:10:45.4531817Z waitLoop: 2026-02-21T09:10:45.4531939Z mbarrier.try_wait.parity.shared.b64 complete, [%r1310], %r1871; 2026-02-21T09:10:45.4532007Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4532067Z } 2026-02-21T09:10:45.4532072Z 2026-02-21T09:10:45.4532128Z // end inline asm 2026-02-21T09:10:45.4532291Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4532364Z ld.shared.b8 %rs241, [%r20]; 2026-02-21T09:10:45.4532429Z ld.shared.b8 %rs242, [%r20+512]; 2026-02-21T09:10:45.4532493Z ld.shared.b8 %rs243, [%r22+128]; 2026-02-21T09:10:45.4532563Z ld.shared.b8 %rs244, [%r22+640]; 2026-02-21T09:10:45.4532626Z ld.shared.b8 %rs245, [%r24+256]; 2026-02-21T09:10:45.4532738Z ld.shared.b8 %rs246, [%r24+768]; 2026-02-21T09:10:45.4532800Z ld.shared.b8 %rs247, [%r26+384]; 2026-02-21T09:10:45.4532872Z ld.shared.b8 %rs248, [%r26+896]; 2026-02-21T09:10:45.4533040Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4533100Z shl.b16 %rs249, %rs241, 4; 2026-02-21T09:10:45.4533168Z shl.b16 %rs250, %rs243, 4; 2026-02-21T09:10:45.4533228Z shl.b16 %rs251, %rs245, 4; 2026-02-21T09:10:45.4533314Z shl.b16 %rs252, %rs247, 4; 2026-02-21T09:10:45.4533378Z shl.b16 %rs253, %rs242, 4; 2026-02-21T09:10:45.4533434Z shl.b16 %rs254, %rs244, 4; 2026-02-21T09:10:45.4533491Z shl.b16 %rs255, %rs246, 4; 2026-02-21T09:10:45.4533546Z shl.b16 %rs256, %rs248, 4; 2026-02-21T09:10:45.4533716Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4533786Z selp.b16 %rs257, %rs249, %rs241, %p331; 2026-02-21T09:10:45.4533845Z cvt.s16.s8 %rs258, %rs257; 2026-02-21T09:10:45.4533916Z shr.s16 %rs259, %rs258, 4; 2026-02-21T09:10:45.4534010Z selp.b16 %rs260, %rs250, %rs243, %p331; 2026-02-21T09:10:45.4534071Z cvt.s16.s8 %rs261, %rs260; 2026-02-21T09:10:45.4534129Z shr.s16 %rs262, %rs261, 4; 2026-02-21T09:10:45.4534202Z selp.b16 %rs263, %rs251, %rs245, %p331; 2026-02-21T09:10:45.4534261Z cvt.s16.s8 %rs264, %rs263; 2026-02-21T09:10:45.4534319Z shr.s16 %rs265, %rs264, 4; 2026-02-21T09:10:45.4534420Z selp.b16 %rs266, %rs252, %rs247, %p331; 2026-02-21T09:10:45.4534480Z cvt.s16.s8 %rs267, %rs266; 2026-02-21T09:10:45.4534538Z shr.s16 %rs268, %rs267, 4; 2026-02-21T09:10:45.4534611Z selp.b16 %rs269, %rs253, %rs242, %p331; 2026-02-21T09:10:45.4534670Z cvt.s16.s8 %rs270, %rs269; 2026-02-21T09:10:45.4534727Z shr.s16 %rs271, %rs270, 4; 2026-02-21T09:10:45.4534792Z selp.b16 %rs272, %rs254, %rs244, %p331; 2026-02-21T09:10:45.4534857Z cvt.s16.s8 %rs273, %rs272; 2026-02-21T09:10:45.4534915Z shr.s16 %rs274, %rs273, 4; 2026-02-21T09:10:45.4534983Z selp.b16 %rs275, %rs255, %rs246, %p331; 2026-02-21T09:10:45.4535049Z cvt.s16.s8 %rs276, %rs275; 2026-02-21T09:10:45.4535107Z shr.s16 %rs277, %rs276, 4; 2026-02-21T09:10:45.4535172Z selp.b16 %rs278, %rs256, %rs248, %p331; 2026-02-21T09:10:45.4535230Z cvt.s16.s8 %rs279, %rs278; 2026-02-21T09:10:45.4535296Z shr.s16 %rs280, %rs279, 4; 2026-02-21T09:10:45.4535459Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4535523Z cvt.rn.f32.s16 %r974, %rs259; 2026-02-21T09:10:45.4535591Z cvt.rn.f32.s16 %r975, %rs262; 2026-02-21T09:10:45.4535651Z cvt.rn.f32.s16 %r976, %rs265; 2026-02-21T09:10:45.4535709Z cvt.rn.f32.s16 %r977, %rs268; 2026-02-21T09:10:45.4535775Z cvt.rn.f32.s16 %r978, %rs271; 2026-02-21T09:10:45.4535833Z cvt.rn.f32.s16 %r979, %rs274; 2026-02-21T09:10:45.4535891Z cvt.rn.f32.s16 %r980, %rs277; 2026-02-21T09:10:45.4535948Z cvt.rn.f32.s16 %r981, %rs280; 2026-02-21T09:10:45.4536018Z st.shared.b32 [%r27], %r974; 2026-02-21T09:10:45.4536081Z st.shared.b32 [%r28], %r975; 2026-02-21T09:10:45.4536143Z st.shared.b32 [%r29], %r976; 2026-02-21T09:10:45.4536210Z st.shared.b32 [%r30], %r977; 2026-02-21T09:10:45.4536270Z st.shared.b32 [%r31], %r978; 2026-02-21T09:10:45.4536329Z st.shared.b32 [%r32], %r979; 2026-02-21T09:10:45.4536387Z st.shared.b32 [%r33], %r980; 2026-02-21T09:10:45.4536453Z st.shared.b32 [%r34], %r981; 2026-02-21T09:10:45.4536508Z $L__tmp131: 2026-02-21T09:10:45.4536719Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4536784Z // begin inline asm 2026-02-21T09:10:45.4536853Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4536908Z // end inline asm 2026-02-21T09:10:45.4536960Z bar.sync 0; 2026-02-21T09:10:45.4537026Z @%p55 bra $L__BB0_16; 2026-02-21T09:10:45.4537124Z // %bb.15: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4537213Z elect.sync %r994|%p171, -1; 2026-02-21T09:10:45.4537280Z mov.b32 %r984, 135268624; 2026-02-21T09:10:45.4537338Z mov.pred %p170, 0; 2026-02-21T09:10:45.4537395Z // begin inline asm 2026-02-21T09:10:45.4537559Z @%p171 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r984, %p170; 2026-02-21T09:10:45.4537614Z // end inline asm 2026-02-21T09:10:45.4537669Z // begin inline asm 2026-02-21T09:10:45.4537821Z @%p171 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r984, %p172; 2026-02-21T09:10:45.4537910Z // end inline asm 2026-02-21T09:10:45.4537965Z // begin inline asm 2026-02-21T09:10:45.4538115Z @%p171 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r984, %p172; 2026-02-21T09:10:45.4538178Z // end inline asm 2026-02-21T09:10:45.4538232Z // begin inline asm 2026-02-21T09:10:45.4538379Z @%p171 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r984, %p172; 2026-02-21T09:10:45.4538440Z // end inline asm 2026-02-21T09:10:45.4538500Z add.s32 %r996, %r179, 26624; 2026-02-21T09:10:45.4538560Z cvt.u64.u32 %rd300, %r996; 2026-02-21T09:10:45.4538637Z // begin inline asm 2026-02-21T09:10:45.4538775Z @%p171 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd300]; 2026-02-21T09:10:45.4538831Z // end inline asm 2026-02-21T09:10:45.4538885Z $L__tmp132: 2026-02-21T09:10:45.4538994Z $L__BB0_16: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4539187Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4539250Z add.s64 %rd301, %rd28, 128; 2026-02-21T09:10:45.4539319Z add.s64 %rd302, %rd29, 128; 2026-02-21T09:10:45.4539480Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4539536Z // begin inline asm 2026-02-21T09:10:45.4539657Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd301 + 0 ], 0x10, %r998; 2026-02-21T09:10:45.4539714Z // end inline asm 2026-02-21T09:10:45.4539771Z // begin inline asm 2026-02-21T09:10:45.4539884Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd302 + 0 ], 0x10, %r998; 2026-02-21T09:10:45.4539948Z // end inline asm 2026-02-21T09:10:45.4540025Z cp.async.commit_group; 2026-02-21T09:10:45.4540198Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4540267Z // begin inline asm 2026-02-21T09:10:45.4540388Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4540448Z // end inline asm 2026-02-21T09:10:45.4540620Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4540689Z bar.sync 0; 2026-02-21T09:10:45.4540757Z elect.sync %r1010|%p182, -1; 2026-02-21T09:10:45.4540826Z and.pred %p180, %p1, %p182; 2026-02-21T09:10:45.4540895Z mov.b32 %r1004, 32; 2026-02-21T09:10:45.4540954Z // begin inline asm 2026-02-21T09:10:45.4541215Z @%p180 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r1131, %r1004}], [%r1310]; 2026-02-21T09:10:45.4541283Z // end inline asm 2026-02-21T09:10:45.4541457Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4541521Z shl.b32 %r1011, %r116, 17; 2026-02-21T09:10:45.4541629Z or.b32 %r1012, %r1854, %r1011; 2026-02-21T09:10:45.4541709Z mad.wide.s32 %rd577, %r1012, 2, %rd7; 2026-02-21T09:10:45.4541773Z or.b32 %r1870, %r40, %r1011; 2026-02-21T09:10:45.4541832Z mov.b64 %rd578, 0; 2026-02-21T09:10:45.4541902Z mov.b32 %r1873, %r1871; 2026-02-21T09:10:45.4541964Z mov.b32 %r1874, %r1871; 2026-02-21T09:10:45.4542023Z mov.b32 %r1876, %r1871; 2026-02-21T09:10:45.4542092Z bra.uni $L__BB0_17; 2026-02-21T09:10:45.4542202Z $L__BB0_19: // in Loop: Header=BB0_17 Depth=2 2026-02-21T09:10:45.4542375Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4542475Z setp.lt.u64 %p198, %rd578, 464; 2026-02-21T09:10:45.4542540Z $L__tmp133: 2026-02-21T09:10:45.4542761Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4542822Z add.s32 %r1087, %r1875, 1; 2026-02-21T09:10:45.4542896Z setp.gt.s32 %p201, %r1087, 1; 2026-02-21T09:10:45.4542966Z selp.b32 %r1875, 0, %r1087, %p201; 2026-02-21T09:10:45.4543061Z selp.b32 %r1088, 1, 0, %p201; 2026-02-21T09:10:45.4543136Z xor.b32 %r134, %r1876, %r1088; 2026-02-21T09:10:45.4543191Z $L__tmp134: 2026-02-21T09:10:45.4543358Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4543433Z mad.wide.s32 %rd311, %r1870, 2, %rd52; 2026-02-21T09:10:45.4543610Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4543673Z add.s32 %r1078, %r128, %r11; 2026-02-21T09:10:45.4543739Z selp.b32 %r1079, 16, 0, %p198; 2026-02-21T09:10:45.4543805Z // begin inline asm 2026-02-21T09:10:45.4543956Z cp.async.cg.shared.global [ %r1078 + 0 ], [ %rd577 + 0 ], 0x10, %r1079; 2026-02-21T09:10:45.4544014Z // end inline asm 2026-02-21T09:10:45.4544083Z add.s32 %r1080, %r1078, 4096; 2026-02-21T09:10:45.4544143Z // begin inline asm 2026-02-21T09:10:45.4544291Z cp.async.cg.shared.global [ %r1080 + 0 ], [ %rd311 + 0 ], 0x10, %r1079; 2026-02-21T09:10:45.4544352Z // end inline asm 2026-02-21T09:10:45.4544427Z cp.async.commit_group; 2026-02-21T09:10:45.4544598Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4544665Z and.pred %p196, %p273, %p198; 2026-02-21T09:10:45.4544733Z // begin inline asm 2026-02-21T09:10:45.4544853Z @%p196 mbarrier.arrive.expect_tx.shared.b64 _, [%r1082], 1024; 2026-02-21T09:10:45.4544910Z // end inline asm 2026-02-21T09:10:45.4545076Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4545141Z bar.sync 0; 2026-02-21T09:10:45.4545209Z elect.sync %r1089|%p202, -1; 2026-02-21T09:10:45.4545275Z and.pred %p203, %p198, %p202; 2026-02-21T09:10:45.4545349Z and.pred %p197, %p1, %p203; 2026-02-21T09:10:45.4545411Z cvt.u32.u64 %r1090, %rd578; 2026-02-21T09:10:45.4545472Z add.s32 %r1085, %r1090, 48; 2026-02-21T09:10:45.4545538Z // begin inline asm 2026-02-21T09:10:45.4545804Z @%p197 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1083], [%rd497, {%r1131, %r1085}], [%r1082]; 2026-02-21T09:10:45.4545862Z // end inline asm 2026-02-21T09:10:45.4546035Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4546103Z add.s64 %rd577, %rd577, 64; 2026-02-21T09:10:45.4546161Z add.s32 %r1870, %r1870, 32; 2026-02-21T09:10:45.4546228Z setp.lt.u64 %p204, %rd578, 480; 2026-02-21T09:10:45.4546297Z add.s64 %rd578, %rd578, 16; 2026-02-21T09:10:45.4546358Z mov.b32 %r1871, %r1876; 2026-02-21T09:10:45.4546421Z mov.b32 %r1876, %r134; 2026-02-21T09:10:45.4546491Z @%p204 bra $L__BB0_17; 2026-02-21T09:10:45.4546550Z bra.uni $L__BB0_20; 2026-02-21T09:10:45.4546656Z $L__BB0_17: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:45.4546756Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:45.4546826Z add.s32 %r1034, %r1874, 1; 2026-02-21T09:10:45.4546892Z setp.gt.s32 %p186, %r1034, 1; 2026-02-21T09:10:45.4546959Z selp.b32 %r1874, 0, %r1034, %p186; 2026-02-21T09:10:45.4547134Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4547200Z cp.async.wait_group 1; 2026-02-21T09:10:45.4547257Z bar.sync 0; 2026-02-21T09:10:45.4547318Z shl.b32 %r1035, %r1874, 13; 2026-02-21T09:10:45.4547394Z add.s32 %r128, %r179, %r1035; 2026-02-21T09:10:45.4547453Z add.s32 %r1037, %r128, %r17; 2026-02-21T09:10:45.4547577Z ld.shared.v4.b32 {%r1038, %r1039, %r1040, %r1041}, [%r1037]; 2026-02-21T09:10:45.4547650Z mov.b32 {%rs281, %rs282}, %r1041; 2026-02-21T09:10:45.4547712Z mov.b32 {%rs283, %rs284}, %r1040; 2026-02-21T09:10:45.4547772Z mov.b32 {%rs285, %rs286}, %r1039; 2026-02-21T09:10:45.4547837Z mov.b32 {%rs287, %rs288}, %r1038; 2026-02-21T09:10:45.4547943Z ld.shared.v4.b32 {%r1042, %r1043, %r1044, %r1045}, [%r1037+16]; 2026-02-21T09:10:45.4548033Z mov.b32 {%rs289, %rs290}, %r1045; 2026-02-21T09:10:45.4548091Z mov.b32 {%rs291, %rs292}, %r1044; 2026-02-21T09:10:45.4548156Z mov.b32 {%rs293, %rs294}, %r1043; 2026-02-21T09:10:45.4548216Z mov.b32 {%rs295, %rs296}, %r1042; 2026-02-21T09:10:45.4548379Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4548449Z cvt.f32.bf16 %r1016, %rs287; 2026-02-21T09:10:45.4548508Z cvt.f32.bf16 %r1017, %rs288; 2026-02-21T09:10:45.4548567Z cvt.f32.bf16 %r1018, %rs285; 2026-02-21T09:10:45.4548634Z cvt.f32.bf16 %r1019, %rs286; 2026-02-21T09:10:45.4548715Z cvt.f32.bf16 %r1020, %rs283; 2026-02-21T09:10:45.4548775Z cvt.f32.bf16 %r1021, %rs284; 2026-02-21T09:10:45.4548832Z cvt.f32.bf16 %r1022, %rs281; 2026-02-21T09:10:45.4548900Z cvt.f32.bf16 %r1023, %rs282; 2026-02-21T09:10:45.4548959Z cvt.f32.bf16 %r1024, %rs295; 2026-02-21T09:10:45.4549017Z cvt.f32.bf16 %r1025, %rs296; 2026-02-21T09:10:45.4549118Z cvt.f32.bf16 %r1026, %rs293; 2026-02-21T09:10:45.4549180Z cvt.f32.bf16 %r1027, %rs294; 2026-02-21T09:10:45.4549239Z cvt.f32.bf16 %r1028, %rs291; 2026-02-21T09:10:45.4549299Z cvt.f32.bf16 %r1029, %rs292; 2026-02-21T09:10:45.4549366Z cvt.f32.bf16 %r1030, %rs289; 2026-02-21T09:10:45.4549823Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:45.4550825Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:10:45.4550953Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:45.4551013Z `ptxas` stderr: 2026-02-21T09:10:45.4551386Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 326 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:45.4551482Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:45.4551487Z 2026-02-21T09:10:45.4551920Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp3r3a8lyp.ptx -o /tmp/tmp3r3a8lyp.ptx.o 2026-02-21T09:10:45.4551928Z 2026-02-21T09:10:45.4552063Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:45.4552124Z cvt.f32.bf16 %r1031, %rs290; 2026-02-21T09:10:45.4552178Z $L__tmp135: 2026-02-21T09:10:45.4552401Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4552459Z // begin inline asm 2026-02-21T09:10:45.4552512Z 2026-02-21T09:10:45.4552565Z { 2026-02-21T09:10:45.4552634Z .reg .pred complete; 2026-02-21T09:10:45.4552690Z waitLoop: 2026-02-21T09:10:45.4552812Z mbarrier.try_wait.parity.shared.b64 complete, [%r1872], %r1871; 2026-02-21T09:10:45.4552886Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4552938Z } 2026-02-21T09:10:45.4552941Z 2026-02-21T09:10:45.4552997Z // end inline asm 2026-02-21T09:10:45.4553050Z $L__tmp136: 2026-02-21T09:10:45.4553232Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4553325Z selp.b32 %r1046, 1, 0, %p186; 2026-02-21T09:10:45.4553388Z xor.b32 %r1873, %r1873, %r1046; 2026-02-21T09:10:45.4553459Z mov.pred %p187, -1; 2026-02-21T09:10:45.4553513Z $L__tmp137: 2026-02-21T09:10:45.4553723Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4553788Z // begin inline asm 2026-02-21T09:10:45.4554099Z @%p187 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r1016, %r1017, %r1018, %r1019, %r1020, %r1021, %r1022, %r1023, %r1024, %r1025, %r1026, %r1027, %r1028, %r1029, %r1030, %r1031}; 2026-02-21T09:10:45.4554183Z // end inline asm 2026-02-21T09:10:45.4554241Z // begin inline asm 2026-02-21T09:10:45.4554317Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4554371Z // end inline asm 2026-02-21T09:10:45.4554424Z bar.sync 0; 2026-02-21T09:10:45.4554485Z $L__tmp138: 2026-02-21T09:10:45.4554650Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4554711Z shl.b32 %r1047, %r1874, 3; 2026-02-21T09:10:45.4554810Z add.s32 %r1048, %r179, %r1047; 2026-02-21T09:10:45.4554873Z add.s32 %r1082, %r1048, 26640; 2026-02-21T09:10:45.4554930Z // begin inline asm 2026-02-21T09:10:45.4554981Z 2026-02-21T09:10:45.4555040Z { 2026-02-21T09:10:45.4555102Z .reg .pred complete; 2026-02-21T09:10:45.4555158Z waitLoop: 2026-02-21T09:10:45.4555308Z mbarrier.try_wait.parity.shared.b64 complete, [%r1082], %r1873; 2026-02-21T09:10:45.4555375Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4555425Z } 2026-02-21T09:10:45.4555428Z 2026-02-21T09:10:45.4555483Z // end inline asm 2026-02-21T09:10:45.4555650Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4555708Z shl.b32 %r1049, %r1874, 10; 2026-02-21T09:10:45.4555769Z add.s32 %r1050, %r179, %r1049; 2026-02-21T09:10:45.4555833Z add.s32 %r1083, %r1050, 24576; 2026-02-21T09:10:45.4555894Z add.s32 %r1051, %r1083, %r19; 2026-02-21T09:10:45.4555956Z ld.shared.b8 %rs297, [%r1051]; 2026-02-21T09:10:45.4556027Z ld.shared.b8 %rs298, [%r1051+512]; 2026-02-21T09:10:45.4556085Z add.s32 %r1052, %r1083, %r21; 2026-02-21T09:10:45.4556148Z ld.shared.b8 %rs299, [%r1052+128]; 2026-02-21T09:10:45.4556209Z ld.shared.b8 %rs300, [%r1052+640]; 2026-02-21T09:10:45.4556274Z add.s32 %r1053, %r1083, %r23; 2026-02-21T09:10:45.4556334Z ld.shared.b8 %rs301, [%r1053+256]; 2026-02-21T09:10:45.4556395Z ld.shared.b8 %rs302, [%r1053+768]; 2026-02-21T09:10:45.4556459Z add.s32 %r1054, %r1083, %r25; 2026-02-21T09:10:45.4556519Z ld.shared.b8 %rs303, [%r1054+384]; 2026-02-21T09:10:45.4556578Z ld.shared.b8 %rs304, [%r1054+896]; 2026-02-21T09:10:45.4556740Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4556807Z shl.b16 %rs305, %rs297, 4; 2026-02-21T09:10:45.4556866Z shl.b16 %rs306, %rs299, 4; 2026-02-21T09:10:45.4556924Z shl.b16 %rs307, %rs301, 4; 2026-02-21T09:10:45.4556988Z shl.b16 %rs308, %rs303, 4; 2026-02-21T09:10:45.4557047Z shl.b16 %rs309, %rs298, 4; 2026-02-21T09:10:45.4557104Z shl.b16 %rs310, %rs300, 4; 2026-02-21T09:10:45.4557167Z shl.b16 %rs311, %rs302, 4; 2026-02-21T09:10:45.4557223Z shl.b16 %rs312, %rs304, 4; 2026-02-21T09:10:45.4557381Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4557453Z selp.b16 %rs313, %rs305, %rs297, %p331; 2026-02-21T09:10:45.4557521Z cvt.s16.s8 %rs314, %rs313; 2026-02-21T09:10:45.4557578Z shr.s16 %rs315, %rs314, 4; 2026-02-21T09:10:45.4557647Z selp.b16 %rs316, %rs306, %rs299, %p331; 2026-02-21T09:10:45.4557712Z cvt.s16.s8 %rs317, %rs316; 2026-02-21T09:10:45.4557770Z shr.s16 %rs318, %rs317, 4; 2026-02-21T09:10:45.4557837Z selp.b16 %rs319, %rs307, %rs301, %p331; 2026-02-21T09:10:45.4557895Z cvt.s16.s8 %rs320, %rs319; 2026-02-21T09:10:45.4557961Z shr.s16 %rs321, %rs320, 4; 2026-02-21T09:10:45.4558049Z selp.b16 %rs322, %rs308, %rs303, %p331; 2026-02-21T09:10:45.4558107Z cvt.s16.s8 %rs323, %rs322; 2026-02-21T09:10:45.4558173Z shr.s16 %rs324, %rs323, 4; 2026-02-21T09:10:45.4558238Z selp.b16 %rs325, %rs309, %rs298, %p331; 2026-02-21T09:10:45.4558296Z cvt.s16.s8 %rs326, %rs325; 2026-02-21T09:10:45.4558353Z shr.s16 %rs327, %rs326, 4; 2026-02-21T09:10:45.4558430Z selp.b16 %rs328, %rs310, %rs300, %p331; 2026-02-21T09:10:45.4558489Z cvt.s16.s8 %rs329, %rs328; 2026-02-21T09:10:45.4558571Z shr.s16 %rs330, %rs329, 4; 2026-02-21T09:10:45.4558646Z selp.b16 %rs331, %rs311, %rs302, %p331; 2026-02-21T09:10:45.4558706Z cvt.s16.s8 %rs332, %rs331; 2026-02-21T09:10:45.4558763Z shr.s16 %rs333, %rs332, 4; 2026-02-21T09:10:45.4558834Z selp.b16 %rs334, %rs312, %rs304, %p331; 2026-02-21T09:10:45.4558891Z cvt.s16.s8 %rs335, %rs334; 2026-02-21T09:10:45.4558948Z shr.s16 %rs336, %rs335, 4; 2026-02-21T09:10:45.4559112Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4559185Z cvt.rn.f32.s16 %r1055, %rs315; 2026-02-21T09:10:45.4559267Z cvt.rn.f32.s16 %r1056, %rs318; 2026-02-21T09:10:45.4559327Z cvt.rn.f32.s16 %r1057, %rs321; 2026-02-21T09:10:45.4559395Z cvt.rn.f32.s16 %r1058, %rs324; 2026-02-21T09:10:45.4559453Z cvt.rn.f32.s16 %r1059, %rs327; 2026-02-21T09:10:45.4559512Z cvt.rn.f32.s16 %r1060, %rs330; 2026-02-21T09:10:45.4559593Z cvt.rn.f32.s16 %r1061, %rs333; 2026-02-21T09:10:45.4559661Z cvt.rn.f32.s16 %r1062, %rs336; 2026-02-21T09:10:45.4559722Z st.shared.b32 [%r27], %r1055; 2026-02-21T09:10:45.4559783Z st.shared.b32 [%r28], %r1056; 2026-02-21T09:10:45.4559851Z st.shared.b32 [%r29], %r1057; 2026-02-21T09:10:45.4559910Z st.shared.b32 [%r30], %r1058; 2026-02-21T09:10:45.4559969Z st.shared.b32 [%r31], %r1059; 2026-02-21T09:10:45.4560036Z st.shared.b32 [%r32], %r1060; 2026-02-21T09:10:45.4560095Z st.shared.b32 [%r33], %r1061; 2026-02-21T09:10:45.4560154Z st.shared.b32 [%r34], %r1062; 2026-02-21T09:10:45.4560322Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4560389Z shl.b32 %r1063, %r1875, 3; 2026-02-21T09:10:45.4560447Z add.s32 %r1064, %r179, %r1063; 2026-02-21T09:10:45.4560506Z add.s32 %r1872, %r1064, 26624; 2026-02-21T09:10:45.4560565Z $L__tmp139: 2026-02-21T09:10:45.4560775Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4560835Z // begin inline asm 2026-02-21T09:10:45.4560906Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4560968Z // end inline asm 2026-02-21T09:10:45.4561023Z bar.sync 0; 2026-02-21T09:10:45.4561082Z @%p55 bra $L__BB0_19; 2026-02-21T09:10:45.4561187Z // %bb.18: // in Loop: Header=BB0_17 Depth=2 2026-02-21T09:10:45.4561253Z elect.sync %r1077|%p188, -1; 2026-02-21T09:10:45.4561311Z mov.b32 %r1067, 135268624; 2026-02-21T09:10:45.4561375Z // begin inline asm 2026-02-21T09:10:45.4561576Z @%p188 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r1067, %p187; 2026-02-21T09:10:45.4561634Z // end inline asm 2026-02-21T09:10:45.4561691Z // begin inline asm 2026-02-21T09:10:45.4561849Z @%p188 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r1067, %p187; 2026-02-21T09:10:45.4561904Z // end inline asm 2026-02-21T09:10:45.4561959Z // begin inline asm 2026-02-21T09:10:45.4562119Z @%p188 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r1067, %p187; 2026-02-21T09:10:45.4562176Z // end inline asm 2026-02-21T09:10:45.4562231Z // begin inline asm 2026-02-21T09:10:45.4562387Z @%p188 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r1067, %p187; 2026-02-21T09:10:45.4562442Z // end inline asm 2026-02-21T09:10:45.4562502Z cvt.u64.u32 %rd309, %r1872; 2026-02-21T09:10:45.4562557Z // begin inline asm 2026-02-21T09:10:45.4562690Z @%p188 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd309]; 2026-02-21T09:10:45.4562772Z // end inline asm 2026-02-21T09:10:45.4562832Z bra.uni $L__BB0_19; 2026-02-21T09:10:45.4562940Z $L__BB0_20: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4563031Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.4563088Z mov.b32 %r1882, 1; 2026-02-21T09:10:45.4563304Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4563384Z // begin inline asm 2026-02-21T09:10:45.4563435Z 2026-02-21T09:10:45.4563483Z { 2026-02-21T09:10:45.4563550Z .reg .pred complete; 2026-02-21T09:10:45.4563604Z waitLoop: 2026-02-21T09:10:45.4563724Z mbarrier.try_wait.parity.shared.b64 complete, [%r1872], %r1882; 2026-02-21T09:10:45.4563797Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4563846Z } 2026-02-21T09:10:45.4563849Z 2026-02-21T09:10:45.4563903Z // end inline asm 2026-02-21T09:10:45.4563962Z $L__tmp140: 2026-02-21T09:10:45.4564155Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4564220Z cp.async.wait_group 0; 2026-02-21T09:10:45.4564274Z bar.sync 0; 2026-02-21T09:10:45.4564337Z // begin inline asm 2026-02-21T09:10:45.4564424Z @%p273 mbarrier.inval.shared::cta.b64 [%r1310]; 2026-02-21T09:10:45.4564479Z // end inline asm 2026-02-21T09:10:45.4564568Z bar.sync 0; 2026-02-21T09:10:45.4564626Z // begin inline asm 2026-02-21T09:10:45.4564711Z @%p273 mbarrier.inval.shared::cta.b64 [%r1094]; 2026-02-21T09:10:45.4564766Z // end inline asm 2026-02-21T09:10:45.4564832Z add.s32 %r1879, %r179, 26624; 2026-02-21T09:10:45.4564888Z // begin inline asm 2026-02-21T09:10:45.4564968Z @%p273 mbarrier.inval.shared::cta.b64 [%r1879]; 2026-02-21T09:10:45.4565029Z // end inline asm 2026-02-21T09:10:45.4565082Z bar.sync 0; 2026-02-21T09:10:45.4565137Z // begin inline asm 2026-02-21T09:10:45.4565222Z @%p273 mbarrier.inval.shared::cta.b64 [%r1096]; 2026-02-21T09:10:45.4565277Z // end inline asm 2026-02-21T09:10:45.4565329Z $L__tmp141: 2026-02-21T09:10:45.4565540Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4565604Z // begin inline asm 2026-02-21T09:10:45.4565907Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1097, %r1098, %r1099, %r1100, %r1101, %r1102, %r1103, %r1104, %r1105, %r1106, %r1107, %r1108, %r1109, %r1110, %r1111, %r1112}, [%r1439 + 0]; 2026-02-21T09:10:45.4565964Z // end inline asm 2026-02-21T09:10:45.4566027Z // begin inline asm 2026-02-21T09:10:45.4566320Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1114, %r1115, %r1116, %r1117, %r1118, %r1119, %r1120, %r1121, %r1122, %r1123, %r1124, %r1125, %r1126, %r1127, %r1128, %r1129}, [%r1439 + 16]; 2026-02-21T09:10:45.4566375Z // end inline asm 2026-02-21T09:10:45.4566441Z // begin inline asm 2026-02-21T09:10:45.4566510Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.4566568Z // end inline asm 2026-02-21T09:10:45.4566630Z cvt.u64.u32 %rd320, %r1097; 2026-02-21T09:10:45.4566700Z cvt.u64.u32 %rd321, %r1098; 2026-02-21T09:10:45.4566767Z shl.b64 %rd322, %rd321, 32; 2026-02-21T09:10:45.4566828Z or.b64 %rd323, %rd320, %rd322; 2026-02-21T09:10:45.4566889Z $L__tmp142: 2026-02-21T09:10:45.4567054Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4567118Z mov.b64 {%r1209, %r1210}, %rd323; 2026-02-21T09:10:45.4567201Z cvt.rn.bf16x2.f32 %r1211, %r1210, %r1209; 2026-02-21T09:10:45.4567254Z $L__tmp143: 2026-02-21T09:10:45.4567462Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4567520Z cvt.u64.u32 %rd324, %r1099; 2026-02-21T09:10:45.4567585Z cvt.u64.u32 %rd325, %r1100; 2026-02-21T09:10:45.4567643Z shl.b64 %rd326, %rd325, 32; 2026-02-21T09:10:45.4567704Z or.b64 %rd327, %rd324, %rd326; 2026-02-21T09:10:45.4567786Z $L__tmp144: 2026-02-21T09:10:45.4567947Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4568009Z mov.b64 {%r1212, %r1213}, %rd327; 2026-02-21T09:10:45.4568089Z cvt.rn.bf16x2.f32 %r1214, %r1213, %r1212; 2026-02-21T09:10:45.4568141Z $L__tmp145: 2026-02-21T09:10:45.4568346Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4568426Z cvt.u64.u32 %rd328, %r1101; 2026-02-21T09:10:45.4568492Z cvt.u64.u32 %rd329, %r1102; 2026-02-21T09:10:45.4568550Z shl.b64 %rd330, %rd329, 32; 2026-02-21T09:10:45.4568611Z or.b64 %rd331, %rd328, %rd330; 2026-02-21T09:10:45.4568671Z $L__tmp146: 2026-02-21T09:10:45.4568834Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4568894Z mov.b64 {%r1215, %r1216}, %rd331; 2026-02-21T09:10:45.4568964Z cvt.rn.bf16x2.f32 %r1217, %r1216, %r1215; 2026-02-21T09:10:45.4569025Z $L__tmp147: 2026-02-21T09:10:45.4569264Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4569323Z cvt.u64.u32 %rd332, %r1103; 2026-02-21T09:10:45.4569388Z cvt.u64.u32 %rd333, %r1104; 2026-02-21T09:10:45.4569448Z shl.b64 %rd334, %rd333, 32; 2026-02-21T09:10:45.4569533Z or.b64 %rd335, %rd332, %rd334; 2026-02-21T09:10:45.4569594Z $L__tmp148: 2026-02-21T09:10:45.4569755Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4569816Z mov.b64 {%r1218, %r1219}, %rd335; 2026-02-21T09:10:45.4569885Z cvt.rn.bf16x2.f32 %r1220, %r1219, %r1218; 2026-02-21T09:10:45.4569946Z $L__tmp149: 2026-02-21T09:10:45.4570148Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4570207Z cvt.u64.u32 %rd336, %r1105; 2026-02-21T09:10:45.4570274Z cvt.u64.u32 %rd337, %r1106; 2026-02-21T09:10:45.4570331Z shl.b64 %rd338, %rd337, 32; 2026-02-21T09:10:45.4570392Z or.b64 %rd339, %rd336, %rd338; 2026-02-21T09:10:45.4570450Z $L__tmp150: 2026-02-21T09:10:45.4570608Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4570669Z mov.b64 {%r1221, %r1222}, %rd339; 2026-02-21T09:10:45.4570738Z cvt.rn.bf16x2.f32 %r1223, %r1222, %r1221; 2026-02-21T09:10:45.4570798Z $L__tmp151: 2026-02-21T09:10:45.4570999Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4571058Z cvt.u64.u32 %rd340, %r1107; 2026-02-21T09:10:45.4571123Z cvt.u64.u32 %rd341, %r1108; 2026-02-21T09:10:45.4571180Z shl.b64 %rd342, %rd341, 32; 2026-02-21T09:10:45.4571240Z or.b64 %rd343, %rd340, %rd342; 2026-02-21T09:10:45.4571297Z $L__tmp152: 2026-02-21T09:10:45.4571456Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4571519Z mov.b64 {%r1224, %r1225}, %rd343; 2026-02-21T09:10:45.4571621Z cvt.rn.bf16x2.f32 %r1226, %r1225, %r1224; 2026-02-21T09:10:45.4571683Z $L__tmp153: 2026-02-21T09:10:45.4571885Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4571944Z cvt.u64.u32 %rd344, %r1109; 2026-02-21T09:10:45.4572011Z cvt.u64.u32 %rd345, %r1110; 2026-02-21T09:10:45.4572069Z shl.b64 %rd346, %rd345, 32; 2026-02-21T09:10:45.4572127Z or.b64 %rd347, %rd344, %rd346; 2026-02-21T09:10:45.4572178Z $L__tmp154: 2026-02-21T09:10:45.4572342Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4572403Z mov.b64 {%r1227, %r1228}, %rd347; 2026-02-21T09:10:45.4572471Z cvt.rn.bf16x2.f32 %r1229, %r1228, %r1227; 2026-02-21T09:10:45.4572530Z $L__tmp155: 2026-02-21T09:10:45.4572769Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4572829Z cvt.u64.u32 %rd348, %r1111; 2026-02-21T09:10:45.4572893Z cvt.u64.u32 %rd349, %r1112; 2026-02-21T09:10:45.4572951Z shl.b64 %rd350, %rd349, 32; 2026-02-21T09:10:45.4573011Z or.b64 %rd351, %rd348, %rd350; 2026-02-21T09:10:45.4573062Z $L__tmp156: 2026-02-21T09:10:45.4573236Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4573322Z mov.b64 {%r1230, %r1231}, %rd351; 2026-02-21T09:10:45.4573389Z cvt.rn.bf16x2.f32 %r1232, %r1231, %r1230; 2026-02-21T09:10:45.4573449Z $L__tmp157: 2026-02-21T09:10:45.4573654Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4573714Z cvt.u64.u32 %rd352, %r1114; 2026-02-21T09:10:45.4573779Z cvt.u64.u32 %rd353, %r1115; 2026-02-21T09:10:45.4573838Z shl.b64 %rd354, %rd353, 32; 2026-02-21T09:10:45.4573897Z or.b64 %rd355, %rd352, %rd354; 2026-02-21T09:10:45.4573976Z $L__tmp158: 2026-02-21T09:10:45.4574144Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4574203Z mov.b64 {%r1233, %r1234}, %rd355; 2026-02-21T09:10:45.4574271Z cvt.rn.bf16x2.f32 %r1235, %r1234, %r1233; 2026-02-21T09:10:45.4574331Z $L__tmp159: 2026-02-21T09:10:45.4574566Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4574628Z cvt.u64.u32 %rd356, %r1116; 2026-02-21T09:10:45.4574693Z cvt.u64.u32 %rd357, %r1117; 2026-02-21T09:10:45.4574752Z shl.b64 %rd358, %rd357, 32; 2026-02-21T09:10:45.4574811Z or.b64 %rd359, %rd356, %rd358; 2026-02-21T09:10:45.4574863Z $L__tmp160: 2026-02-21T09:10:45.4575036Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4575097Z mov.b64 {%r1236, %r1237}, %rd359; 2026-02-21T09:10:45.4575168Z cvt.rn.bf16x2.f32 %r1238, %r1237, %r1236; 2026-02-21T09:10:45.4575232Z $L__tmp161: 2026-02-21T09:10:45.4575436Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4575497Z cvt.u64.u32 %rd360, %r1118; 2026-02-21T09:10:45.4575563Z cvt.u64.u32 %rd361, %r1119; 2026-02-21T09:10:45.4575622Z shl.b64 %rd362, %rd361, 32; 2026-02-21T09:10:45.4575683Z or.b64 %rd363, %rd360, %rd362; 2026-02-21T09:10:45.4575735Z $L__tmp162: 2026-02-21T09:10:45.4575905Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4575963Z mov.b64 {%r1239, %r1240}, %rd363; 2026-02-21T09:10:45.4576031Z cvt.rn.bf16x2.f32 %r1241, %r1240, %r1239; 2026-02-21T09:10:45.4576088Z $L__tmp163: 2026-02-21T09:10:45.4576292Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4576351Z cvt.u64.u32 %rd364, %r1120; 2026-02-21T09:10:45.4576417Z cvt.u64.u32 %rd365, %r1121; 2026-02-21T09:10:45.4576474Z shl.b64 %rd366, %rd365, 32; 2026-02-21T09:10:45.4576534Z or.b64 %rd367, %rd364, %rd366; 2026-02-21T09:10:45.4576585Z $L__tmp164: 2026-02-21T09:10:45.4576758Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4576820Z mov.b64 {%r1242, %r1243}, %rd367; 2026-02-21T09:10:45.4576887Z cvt.rn.bf16x2.f32 %r1244, %r1243, %r1242; 2026-02-21T09:10:45.4576946Z $L__tmp165: 2026-02-21T09:10:45.4577155Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4577213Z cvt.u64.u32 %rd368, %r1122; 2026-02-21T09:10:45.4577270Z cvt.u64.u32 %rd369, %r1123; 2026-02-21T09:10:45.4577337Z shl.b64 %rd370, %rd369, 32; 2026-02-21T09:10:45.4577397Z or.b64 %rd371, %rd368, %rd370; 2026-02-21T09:10:45.4577475Z $L__tmp166: 2026-02-21T09:10:45.4577643Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4577704Z mov.b64 {%r1245, %r1246}, %rd371; 2026-02-21T09:10:45.4577771Z cvt.rn.bf16x2.f32 %r1247, %r1246, %r1245; 2026-02-21T09:10:45.4577831Z $L__tmp167: 2026-02-21T09:10:45.4578036Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4578118Z cvt.u64.u32 %rd372, %r1124; 2026-02-21T09:10:45.4578176Z cvt.u64.u32 %rd373, %r1125; 2026-02-21T09:10:45.4578242Z shl.b64 %rd374, %rd373, 32; 2026-02-21T09:10:45.4578302Z or.b64 %rd375, %rd372, %rd374; 2026-02-21T09:10:45.4578353Z $L__tmp168: 2026-02-21T09:10:45.4578521Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4578581Z mov.b64 {%r1248, %r1249}, %rd375; 2026-02-21T09:10:45.4578649Z cvt.rn.bf16x2.f32 %r1250, %r1249, %r1248; 2026-02-21T09:10:45.4578708Z $L__tmp169: 2026-02-21T09:10:45.4578937Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4578998Z cvt.u64.u32 %rd376, %r1126; 2026-02-21T09:10:45.4579054Z cvt.u64.u32 %rd377, %r1127; 2026-02-21T09:10:45.4579119Z shl.b64 %rd378, %rd377, 32; 2026-02-21T09:10:45.4579203Z or.b64 %rd379, %rd376, %rd378; 2026-02-21T09:10:45.4579257Z $L__tmp170: 2026-02-21T09:10:45.4579423Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4579482Z mov.b64 {%r1251, %r1252}, %rd379; 2026-02-21T09:10:45.4579548Z cvt.rn.bf16x2.f32 %r1253, %r1252, %r1251; 2026-02-21T09:10:45.4579607Z $L__tmp171: 2026-02-21T09:10:45.4579807Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4579866Z cvt.u64.u32 %rd380, %r1128; 2026-02-21T09:10:45.4579923Z cvt.u64.u32 %rd381, %r1129; 2026-02-21T09:10:45.4579987Z shl.b64 %rd382, %rd381, 32; 2026-02-21T09:10:45.4580046Z or.b64 %rd383, %rd380, %rd382; 2026-02-21T09:10:45.4580097Z $L__tmp172: 2026-02-21T09:10:45.4580261Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4580320Z mov.b64 {%r1254, %r1255}, %rd383; 2026-02-21T09:10:45.4580386Z cvt.rn.bf16x2.f32 %r1256, %r1255, %r1254; 2026-02-21T09:10:45.4580549Z .loc 1 98 43 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:98:43 2026-02-21T09:10:45.4580647Z st.shared.v4.b32 [%r35], {%r1211, %r1214, %r1217, %r1220}; 2026-02-21T09:10:45.4580741Z st.shared.v4.b32 [%r36], {%r1223, %r1226, %r1229, %r1232}; 2026-02-21T09:10:45.4580830Z st.shared.v4.b32 [%r37], {%r1235, %r1238, %r1241, %r1244}; 2026-02-21T09:10:45.4580928Z st.shared.v4.b32 [%r38], {%r1247, %r1250, %r1253, %r1256}; 2026-02-21T09:10:45.4580987Z // begin inline asm 2026-02-21T09:10:45.4581058Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4581121Z // end inline asm 2026-02-21T09:10:45.4581176Z bar.sync 0; 2026-02-21T09:10:45.4581242Z elect.sync %r1257|%p224, -1; 2026-02-21T09:10:45.4581306Z and.pred %p209, %p1, %p224; 2026-02-21T09:10:45.4581368Z // begin inline asm 2026-02-21T09:10:45.4581593Z @%p209 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd507, {%r1131, %r1132}], [%r179]; 2026-02-21T09:10:45.4581652Z // end inline asm 2026-02-21T09:10:45.4581727Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.4581798Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.4581851Z bar.sync 0; 2026-02-21T09:10:45.4582017Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4582076Z add.s32 %r1258, %r1855, 3; 2026-02-21T09:10:45.4582232Z .loc 1 38 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:38:33 2026-02-21T09:10:45.4582319Z shr.u32 %r1259, %r1258, 5; 2026-02-21T09:10:45.4582387Z and.b32 %r1260, %r1259, 67108848; 2026-02-21T09:10:45.4582548Z .loc 1 39 39 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:39 2026-02-21T09:10:45.4582609Z sub.s32 %r1261, 128, %r1260; 2026-02-21T09:10:45.4582776Z .loc 1 39 52 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:39:52 2026-02-21T09:10:45.4582836Z min.s32 %r1262, %r1261, 16; 2026-02-21T09:10:45.4583022Z .loc 1 40 45 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:45 2026-02-21T09:10:45.4583091Z and.b32 %r1263, %r1258, 511; 2026-02-21T09:10:45.4583251Z .loc 1 41 51 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:41:51 2026-02-21T09:10:45.4583314Z div.s32 %r136, %r1263, %r1262; 2026-02-21T09:10:45.4583519Z .loc 1 40 64 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:64 2026-02-21T09:10:45.4583586Z mul.lo.s32 %r1264, %r136, %r1262; 2026-02-21T09:10:45.4583675Z sub.s32 %r1265, %r1263, %r1264; 2026-02-21T09:10:45.4583834Z .loc 1 40 30 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:30 2026-02-21T09:10:45.4583905Z add.s32 %r1266, %r1265, %r1260; 2026-02-21T09:10:45.4584066Z .loc 1 42 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:42:27 2026-02-21T09:10:45.4584162Z shl.b32 %r1440, %r1266, 6; 2026-02-21T09:10:45.4584343Z .loc 1 43 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:43:27 2026-02-21T09:10:45.4584408Z shl.b32 %r1441, %r136, 7; 2026-02-21T09:10:45.4584576Z .loc 1 44 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:44:32 2026-02-21T09:10:45.4584648Z or.b32 %r1267, %r1441, %r7; 2026-02-21T09:10:45.4584710Z or.b32 %r1268, %r1441, %r8; 2026-02-21T09:10:45.4584879Z .loc 1 58 53 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:53 2026-02-21T09:10:45.4584949Z shl.b32 %r1269, %r1267, 10; 2026-02-21T09:10:45.4585011Z shl.b32 %r1270, %r1268, 10; 2026-02-21T09:10:45.4585074Z mov.pred %p229, -1; 2026-02-21T09:10:45.4585132Z mov.b32 %r1878, 0; 2026-02-21T09:10:45.4585194Z $L__tmp173: 2026-02-21T09:10:45.4585413Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4585474Z // begin inline asm 2026-02-21T09:10:45.4585805Z @%p229 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 0], {%r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878}; 2026-02-21T09:10:45.4585863Z // end inline asm 2026-02-21T09:10:45.4585921Z // begin inline asm 2026-02-21T09:10:45.4586253Z @%p229 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1439 + 16], {%r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878, %r1878}; 2026-02-21T09:10:45.4586313Z // end inline asm 2026-02-21T09:10:45.4586373Z // begin inline asm 2026-02-21T09:10:45.4586446Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4586509Z // end inline asm 2026-02-21T09:10:45.4586565Z bar.sync 0; 2026-02-21T09:10:45.4586621Z $L__tmp174: 2026-02-21T09:10:45.4586805Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4586866Z // begin inline asm 2026-02-21T09:10:45.4586958Z @%p273 mbarrier.init.shared::cta.b64 [%r1879], 1; 2026-02-21T09:10:45.4587023Z // end inline asm 2026-02-21T09:10:45.4587079Z bar.sync 0; 2026-02-21T09:10:45.4587138Z // begin inline asm 2026-02-21T09:10:45.4587227Z @%p273 mbarrier.init.shared::cta.b64 [%r1096], 1; 2026-02-21T09:10:45.4587293Z // end inline asm 2026-02-21T09:10:45.4587351Z // begin inline asm 2026-02-21T09:10:45.4587437Z @%p273 mbarrier.init.shared::cta.b64 [%r1310], 1; 2026-02-21T09:10:45.4587527Z // end inline asm 2026-02-21T09:10:45.4587585Z bar.sync 0; 2026-02-21T09:10:45.4587642Z // begin inline asm 2026-02-21T09:10:45.4587728Z @%p273 mbarrier.init.shared::cta.b64 [%r1094], 1; 2026-02-21T09:10:45.4587793Z // end inline asm 2026-02-21T09:10:45.4587960Z .loc 1 58 60 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:60 2026-02-21T09:10:45.4588022Z or.b32 %r1271, %r1269, %r9; 2026-02-21T09:10:45.4588091Z or.b32 %r1272, %r1270, %r9; 2026-02-21T09:10:45.4588299Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4588373Z mad.wide.s32 %rd314, %r1271, 2, %rd52; 2026-02-21T09:10:45.4588452Z mad.wide.s32 %rd315, %r1272, 2, %rd52; 2026-02-21T09:10:45.4588513Z mov.b32 %r1307, 16; 2026-02-21T09:10:45.4588680Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4588739Z // begin inline asm 2026-02-21T09:10:45.4588875Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd314 + 0 ], 0x10, %r1307; 2026-02-21T09:10:45.4588932Z // end inline asm 2026-02-21T09:10:45.4589011Z // begin inline asm 2026-02-21T09:10:45.4589142Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd315 + 0 ], 0x10, %r1307; 2026-02-21T09:10:45.4589201Z // end inline asm 2026-02-21T09:10:45.4589266Z cp.async.commit_group; 2026-02-21T09:10:45.4589462Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4589520Z bar.sync 0; 2026-02-21T09:10:45.4589577Z // begin inline asm 2026-02-21T09:10:45.4589695Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4589760Z // end inline asm 2026-02-21T09:10:45.4589929Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4589986Z bar.sync 0; 2026-02-21T09:10:45.4590059Z elect.sync %r1273|%p225, -1; 2026-02-21T09:10:45.4590126Z and.pred %p217, %p1, %p225; 2026-02-21T09:10:45.4590185Z // begin inline asm 2026-02-21T09:10:45.4590450Z @%p217 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r1440, %r1878}], [%r1310]; 2026-02-21T09:10:45.4590508Z // end inline asm 2026-02-21T09:10:45.4590672Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4590736Z cvt.s64.s32 %rd384, %r1269; 2026-02-21T09:10:45.4590809Z or.b64 %rd385, %rd384, %rd572; 2026-02-21T09:10:45.4590873Z shl.b64 %rd386, %rd385, 1; 2026-02-21T09:10:45.4590939Z add.s64 %rd35, %rd52, %rd386; 2026-02-21T09:10:45.4591009Z add.s64 %rd317, %rd35, 64; 2026-02-21T09:10:45.4591069Z cvt.s64.s32 %rd387, %r1270; 2026-02-21T09:10:45.4591131Z or.b64 %rd388, %rd387, %rd572; 2026-02-21T09:10:45.4591191Z shl.b64 %rd389, %rd388, 1; 2026-02-21T09:10:45.4591262Z add.s64 %rd36, %rd52, %rd389; 2026-02-21T09:10:45.4591321Z add.s64 %rd318, %rd36, 64; 2026-02-21T09:10:45.4591491Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4591602Z // begin inline asm 2026-02-21T09:10:45.4591724Z cp.async.cg.shared.global [ %r15 + 0 ], [ %rd317 + 0 ], 0x10, %r1307; 2026-02-21T09:10:45.4591781Z // end inline asm 2026-02-21T09:10:45.4591847Z // begin inline asm 2026-02-21T09:10:45.4591964Z cp.async.cg.shared.global [ %r16 + 0 ], [ %rd318 + 0 ], 0x10, %r1307; 2026-02-21T09:10:45.4592022Z // end inline asm 2026-02-21T09:10:45.4592088Z cp.async.commit_group; 2026-02-21T09:10:45.4592266Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4592323Z bar.sync 0; 2026-02-21T09:10:45.4592381Z // begin inline asm 2026-02-21T09:10:45.4592507Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1094], 1024; 2026-02-21T09:10:45.4592564Z // end inline asm 2026-02-21T09:10:45.4592734Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4592825Z bar.sync 0; 2026-02-21T09:10:45.4592906Z elect.sync %r1274|%p226, -1; 2026-02-21T09:10:45.4592970Z and.pred %p219, %p1, %p226; 2026-02-21T09:10:45.4593027Z // begin inline asm 2026-02-21T09:10:45.4593278Z @%p219 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r301], [%rd497, {%r1440, %r1307}], [%r1094]; 2026-02-21T09:10:45.4593333Z // end inline asm 2026-02-21T09:10:45.4593489Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4593592Z cp.async.wait_group 1; 2026-02-21T09:10:45.4593648Z bar.sync 0; 2026-02-21T09:10:45.4593749Z ld.shared.v4.b32 {%r1275, %r1276, %r1277, %r1278}, [%r18]; 2026-02-21T09:10:45.4593823Z mov.b32 {%rs337, %rs338}, %r1278; 2026-02-21T09:10:45.4593886Z mov.b32 {%rs339, %rs340}, %r1277; 2026-02-21T09:10:45.4593949Z mov.b32 {%rs341, %rs342}, %r1276; 2026-02-21T09:10:45.4594010Z mov.b32 {%rs343, %rs344}, %r1275; 2026-02-21T09:10:45.4594123Z ld.shared.v4.b32 {%r1279, %r1280, %r1281, %r1282}, [%r18+16]; 2026-02-21T09:10:45.4594208Z mov.b32 {%rs345, %rs346}, %r1282; 2026-02-21T09:10:45.4594267Z mov.b32 {%rs347, %rs348}, %r1281; 2026-02-21T09:10:45.4594333Z mov.b32 {%rs349, %rs350}, %r1280; 2026-02-21T09:10:45.4594391Z mov.b32 {%rs351, %rs352}, %r1279; 2026-02-21T09:10:45.4594574Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4594648Z cvt.f32.bf16 %r1191, %rs343; 2026-02-21T09:10:45.4594708Z cvt.f32.bf16 %r1192, %rs344; 2026-02-21T09:10:45.4594766Z cvt.f32.bf16 %r1193, %rs341; 2026-02-21T09:10:45.4594824Z cvt.f32.bf16 %r1194, %rs342; 2026-02-21T09:10:45.4594888Z cvt.f32.bf16 %r1195, %rs339; 2026-02-21T09:10:45.4594946Z cvt.f32.bf16 %r1196, %rs340; 2026-02-21T09:10:45.4595004Z cvt.f32.bf16 %r1197, %rs337; 2026-02-21T09:10:45.4595067Z cvt.f32.bf16 %r1198, %rs338; 2026-02-21T09:10:45.4595126Z cvt.f32.bf16 %r1199, %rs351; 2026-02-21T09:10:45.4595184Z cvt.f32.bf16 %r1200, %rs352; 2026-02-21T09:10:45.4595240Z cvt.f32.bf16 %r1201, %rs349; 2026-02-21T09:10:45.4595306Z cvt.f32.bf16 %r1202, %rs350; 2026-02-21T09:10:45.4595363Z cvt.f32.bf16 %r1203, %rs347; 2026-02-21T09:10:45.4595420Z cvt.f32.bf16 %r1204, %rs348; 2026-02-21T09:10:45.4595484Z cvt.f32.bf16 %r1205, %rs345; 2026-02-21T09:10:45.4595540Z cvt.f32.bf16 %r1206, %rs346; 2026-02-21T09:10:45.4595595Z $L__tmp175: 2026-02-21T09:10:45.4595810Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4595874Z // begin inline asm 2026-02-21T09:10:45.4596182Z @%p229 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r1191, %r1192, %r1193, %r1194, %r1195, %r1196, %r1197, %r1198, %r1199, %r1200, %r1201, %r1202, %r1203, %r1204, %r1205, %r1206}; 2026-02-21T09:10:45.4596236Z // end inline asm 2026-02-21T09:10:45.4596301Z // begin inline asm 2026-02-21T09:10:45.4596371Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4596426Z // end inline asm 2026-02-21T09:10:45.4596487Z bar.sync 0; 2026-02-21T09:10:45.4596542Z $L__tmp176: 2026-02-21T09:10:45.4596708Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4596765Z // begin inline asm 2026-02-21T09:10:45.4596822Z 2026-02-21T09:10:45.4596872Z { 2026-02-21T09:10:45.4596933Z .reg .pred complete; 2026-02-21T09:10:45.4596997Z waitLoop: 2026-02-21T09:10:45.4597119Z mbarrier.try_wait.parity.shared.b64 complete, [%r1310], %r1878; 2026-02-21T09:10:45.4597184Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4597234Z } 2026-02-21T09:10:45.4597244Z 2026-02-21T09:10:45.4597298Z // end inline asm 2026-02-21T09:10:45.4597460Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4597521Z ld.shared.b8 %rs353, [%r20]; 2026-02-21T09:10:45.4597593Z ld.shared.b8 %rs354, [%r20+512]; 2026-02-21T09:10:45.4597681Z ld.shared.b8 %rs355, [%r22+128]; 2026-02-21T09:10:45.4597743Z ld.shared.b8 %rs356, [%r22+640]; 2026-02-21T09:10:45.4597811Z ld.shared.b8 %rs357, [%r24+256]; 2026-02-21T09:10:45.4597871Z ld.shared.b8 %rs358, [%r24+768]; 2026-02-21T09:10:45.4597931Z ld.shared.b8 %rs359, [%r26+384]; 2026-02-21T09:10:45.4597992Z ld.shared.b8 %rs360, [%r26+896]; 2026-02-21T09:10:45.4598163Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4598247Z shl.b16 %rs361, %rs353, 4; 2026-02-21T09:10:45.4598310Z shl.b16 %rs362, %rs355, 4; 2026-02-21T09:10:45.4598375Z shl.b16 %rs363, %rs357, 4; 2026-02-21T09:10:45.4598432Z shl.b16 %rs364, %rs359, 4; 2026-02-21T09:10:45.4598488Z shl.b16 %rs365, %rs354, 4; 2026-02-21T09:10:45.4598550Z shl.b16 %rs366, %rs356, 4; 2026-02-21T09:10:45.4598606Z shl.b16 %rs367, %rs358, 4; 2026-02-21T09:10:45.4598663Z shl.b16 %rs368, %rs360, 4; 2026-02-21T09:10:45.4598825Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4598922Z selp.b16 %rs369, %rs361, %rs353, %p331; 2026-02-21T09:10:45.4598982Z cvt.s16.s8 %rs370, %rs369; 2026-02-21T09:10:45.4599038Z shr.s16 %rs371, %rs370, 4; 2026-02-21T09:10:45.4599114Z selp.b16 %rs372, %rs362, %rs355, %p331; 2026-02-21T09:10:45.4599172Z cvt.s16.s8 %rs373, %rs372; 2026-02-21T09:10:45.4599229Z shr.s16 %rs374, %rs373, 4; 2026-02-21T09:10:45.4599317Z selp.b16 %rs375, %rs363, %rs357, %p331; 2026-02-21T09:10:45.4599385Z cvt.s16.s8 %rs376, %rs375; 2026-02-21T09:10:45.4599442Z shr.s16 %rs377, %rs376, 4; 2026-02-21T09:10:45.4599507Z selp.b16 %rs378, %rs364, %rs359, %p331; 2026-02-21T09:10:45.4599574Z cvt.s16.s8 %rs379, %rs378; 2026-02-21T09:10:45.4599632Z shr.s16 %rs380, %rs379, 4; 2026-02-21T09:10:45.4599696Z selp.b16 %rs381, %rs365, %rs354, %p331; 2026-02-21T09:10:45.4599761Z cvt.s16.s8 %rs382, %rs381; 2026-02-21T09:10:45.4599818Z shr.s16 %rs383, %rs382, 4; 2026-02-21T09:10:45.4599883Z selp.b16 %rs384, %rs366, %rs356, %p331; 2026-02-21T09:10:45.4599941Z cvt.s16.s8 %rs385, %rs384; 2026-02-21T09:10:45.4600007Z shr.s16 %rs386, %rs385, 4; 2026-02-21T09:10:45.4600071Z selp.b16 %rs387, %rs367, %rs358, %p331; 2026-02-21T09:10:45.4600130Z cvt.s16.s8 %rs388, %rs387; 2026-02-21T09:10:45.4600194Z shr.s16 %rs389, %rs388, 4; 2026-02-21T09:10:45.4600258Z selp.b16 %rs390, %rs368, %rs360, %p331; 2026-02-21T09:10:45.4600316Z cvt.s16.s8 %rs391, %rs390; 2026-02-21T09:10:45.4600375Z shr.s16 %rs392, %rs391, 4; 2026-02-21T09:10:45.4600548Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4600612Z cvt.rn.f32.s16 %r1283, %rs371; 2026-02-21T09:10:45.4600676Z cvt.rn.f32.s16 %r1284, %rs374; 2026-02-21T09:10:45.4600744Z cvt.rn.f32.s16 %r1285, %rs377; 2026-02-21T09:10:45.4600804Z cvt.rn.f32.s16 %r1286, %rs380; 2026-02-21T09:10:45.4600862Z cvt.rn.f32.s16 %r1287, %rs383; 2026-02-21T09:10:45.4600930Z cvt.rn.f32.s16 %r1288, %rs386; 2026-02-21T09:10:45.4600990Z cvt.rn.f32.s16 %r1289, %rs389; 2026-02-21T09:10:45.4601052Z cvt.rn.f32.s16 %r1290, %rs392; 2026-02-21T09:10:45.4601115Z st.shared.b32 [%r27], %r1283; 2026-02-21T09:10:45.4601188Z st.shared.b32 [%r28], %r1284; 2026-02-21T09:10:45.4601258Z st.shared.b32 [%r29], %r1285; 2026-02-21T09:10:45.4613629Z st.shared.b32 [%r30], %r1286; 2026-02-21T09:10:45.4613754Z st.shared.b32 [%r31], %r1287; 2026-02-21T09:10:45.4613830Z st.shared.b32 [%r32], %r1288; 2026-02-21T09:10:45.4613898Z st.shared.b32 [%r33], %r1289; 2026-02-21T09:10:45.4613968Z st.shared.b32 [%r34], %r1290; 2026-02-21T09:10:45.4614030Z $L__tmp177: 2026-02-21T09:10:45.4614276Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4614347Z // begin inline asm 2026-02-21T09:10:45.4614430Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4614492Z // end inline asm 2026-02-21T09:10:45.4614666Z bar.sync 0; 2026-02-21T09:10:45.4614738Z @%p55 bra $L__BB0_22; 2026-02-21T09:10:45.4614850Z // %bb.21: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4614925Z elect.sync %r1303|%p228, -1; 2026-02-21T09:10:45.4614999Z mov.b32 %r1293, 135268624; 2026-02-21T09:10:45.4615062Z mov.pred %p227, 0; 2026-02-21T09:10:45.4615121Z // begin inline asm 2026-02-21T09:10:45.4615296Z @%p228 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r1293, %p227; 2026-02-21T09:10:45.4615422Z // end inline asm 2026-02-21T09:10:45.4615483Z // begin inline asm 2026-02-21T09:10:45.4615637Z @%p228 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r1293, %p229; 2026-02-21T09:10:45.4615704Z // end inline asm 2026-02-21T09:10:45.4615762Z // begin inline asm 2026-02-21T09:10:45.4615918Z @%p228 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r1293, %p229; 2026-02-21T09:10:45.4615985Z // end inline asm 2026-02-21T09:10:45.4616046Z // begin inline asm 2026-02-21T09:10:45.4616229Z @%p228 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r1293, %p229; 2026-02-21T09:10:45.4616299Z // end inline asm 2026-02-21T09:10:45.4616366Z add.s32 %r1305, %r179, 26624; 2026-02-21T09:10:45.4616432Z cvt.u64.u32 %rd394, %r1305; 2026-02-21T09:10:45.4616492Z // begin inline asm 2026-02-21T09:10:45.4616682Z @%p228 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd394]; 2026-02-21T09:10:45.4616741Z // end inline asm 2026-02-21T09:10:45.4616802Z $L__tmp178: 2026-02-21T09:10:45.4616914Z $L__BB0_22: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:45.4617090Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4617155Z add.s64 %rd395, %rd35, 128; 2026-02-21T09:10:45.4617223Z add.s64 %rd396, %rd36, 128; 2026-02-21T09:10:45.4617386Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4617447Z // begin inline asm 2026-02-21T09:10:45.4617571Z cp.async.cg.shared.global [ %r12 + 0 ], [ %rd395 + 0 ], 0x10, %r1307; 2026-02-21T09:10:45.4617637Z // end inline asm 2026-02-21T09:10:45.4617695Z // begin inline asm 2026-02-21T09:10:45.4617807Z cp.async.cg.shared.global [ %r13 + 0 ], [ %rd396 + 0 ], 0x10, %r1307; 2026-02-21T09:10:45.4617871Z // end inline asm 2026-02-21T09:10:45.4617939Z cp.async.commit_group; 2026-02-21T09:10:45.4618113Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4618179Z // begin inline asm 2026-02-21T09:10:45.4618299Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1310], 1024; 2026-02-21T09:10:45.4618356Z // end inline asm 2026-02-21T09:10:45.4618524Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4618591Z bar.sync 0; 2026-02-21T09:10:45.4618660Z elect.sync %r1319|%p239, -1; 2026-02-21T09:10:45.4618731Z and.pred %p237, %p1, %p239; 2026-02-21T09:10:45.4618800Z mov.b32 %r1313, 32; 2026-02-21T09:10:45.4618859Z // begin inline asm 2026-02-21T09:10:45.4619114Z @%p237 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r219], [%rd497, {%r1440, %r1313}], [%r1310]; 2026-02-21T09:10:45.4619181Z // end inline asm 2026-02-21T09:10:45.4619351Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4619417Z shl.b32 %r1320, %r136, 17; 2026-02-21T09:10:45.4619483Z or.b32 %r1321, %r1854, %r1320; 2026-02-21T09:10:45.4619566Z mad.wide.s32 %rd579, %r1321, 2, %rd7; 2026-02-21T09:10:45.4619631Z or.b32 %r1877, %r40, %r1320; 2026-02-21T09:10:45.4619688Z mov.b64 %rd580, 0; 2026-02-21T09:10:45.4619756Z mov.b32 %r1880, %r1878; 2026-02-21T09:10:45.4619816Z mov.b32 %r1881, %r1878; 2026-02-21T09:10:45.4619874Z mov.b32 %r1883, %r1878; 2026-02-21T09:10:45.4619935Z bra.uni $L__BB0_23; 2026-02-21T09:10:45.4620071Z $L__BB0_25: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:10:45.4620239Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4620308Z setp.lt.u64 %p255, %rd580, 464; 2026-02-21T09:10:45.4620375Z $L__tmp179: 2026-02-21T09:10:45.4620593Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4620678Z add.s32 %r1396, %r1882, 1; 2026-02-21T09:10:45.4620753Z setp.gt.s32 %p258, %r1396, 1; 2026-02-21T09:10:45.4620823Z selp.b32 %r1882, 0, %r1396, %p258; 2026-02-21T09:10:45.4620885Z selp.b32 %r1397, 1, 0, %p258; 2026-02-21T09:10:45.4620947Z xor.b32 %r154, %r1883, %r1397; 2026-02-21T09:10:45.4621013Z $L__tmp180: 2026-02-21T09:10:45.4621177Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4621253Z mad.wide.s32 %rd405, %r1877, 2, %rd52; 2026-02-21T09:10:45.4621454Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4621520Z add.s32 %r1387, %r148, %r11; 2026-02-21T09:10:45.4621635Z selp.b32 %r1388, 16, 0, %p255; 2026-02-21T09:10:45.4621710Z // begin inline asm 2026-02-21T09:10:45.4621838Z cp.async.cg.shared.global [ %r1387 + 0 ], [ %rd579 + 0 ], 0x10, %r1388; 2026-02-21T09:10:45.4621925Z // end inline asm 2026-02-21T09:10:45.4621995Z add.s32 %r1389, %r1387, 4096; 2026-02-21T09:10:45.4622068Z // begin inline asm 2026-02-21T09:10:45.4622188Z cp.async.cg.shared.global [ %r1389 + 0 ], [ %rd405 + 0 ], 0x10, %r1388; 2026-02-21T09:10:45.4622248Z // end inline asm 2026-02-21T09:10:45.4622328Z cp.async.commit_group; 2026-02-21T09:10:45.4622499Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4622569Z and.pred %p253, %p273, %p255; 2026-02-21T09:10:45.4622642Z // begin inline asm 2026-02-21T09:10:45.4622758Z @%p253 mbarrier.arrive.expect_tx.shared.b64 _, [%r1391], 1024; 2026-02-21T09:10:45.4622819Z // end inline asm 2026-02-21T09:10:45.4622985Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4623055Z bar.sync 0; 2026-02-21T09:10:45.4623122Z elect.sync %r1398|%p259, -1; 2026-02-21T09:10:45.4623186Z and.pred %p260, %p255, %p259; 2026-02-21T09:10:45.4623266Z and.pred %p254, %p1, %p260; 2026-02-21T09:10:45.4623327Z cvt.u32.u64 %r1399, %rd580; 2026-02-21T09:10:45.4623388Z add.s32 %r1394, %r1399, 48; 2026-02-21T09:10:45.4623446Z // begin inline asm 2026-02-21T09:10:45.4623711Z @%p254 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1392], [%rd497, {%r1440, %r1394}], [%r1391]; 2026-02-21T09:10:45.4623768Z // end inline asm 2026-02-21T09:10:45.4623937Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4624010Z add.s64 %rd579, %rd579, 64; 2026-02-21T09:10:45.4624068Z add.s32 %r1877, %r1877, 32; 2026-02-21T09:10:45.4624138Z setp.lt.u64 %p261, %rd580, 480; 2026-02-21T09:10:45.4624208Z add.s64 %rd580, %rd580, 16; 2026-02-21T09:10:45.4624268Z mov.b32 %r1878, %r1883; 2026-02-21T09:10:45.4624328Z mov.b32 %r1883, %r154; 2026-02-21T09:10:45.4624390Z @%p261 bra $L__BB0_23; 2026-02-21T09:10:45.4624459Z bra.uni $L__BB0_26; 2026-02-21T09:10:45.4624566Z $L__BB0_23: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:45.4624663Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:45.4624736Z add.s32 %r1343, %r1881, 1; 2026-02-21T09:10:45.4624801Z setp.gt.s32 %p243, %r1343, 1; 2026-02-21T09:10:45.4624868Z selp.b32 %r1881, 0, %r1343, %p243; 2026-02-21T09:10:45.4625045Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4625140Z cp.async.wait_group 1; 2026-02-21T09:10:45.4625196Z bar.sync 0; 2026-02-21T09:10:45.4625256Z shl.b32 %r1344, %r1881, 13; 2026-02-21T09:10:45.4625331Z add.s32 %r148, %r179, %r1344; 2026-02-21T09:10:45.4625391Z add.s32 %r1346, %r148, %r17; 2026-02-21T09:10:45.4625496Z ld.shared.v4.b32 {%r1347, %r1348, %r1349, %r1350}, [%r1346]; 2026-02-21T09:10:45.4625571Z mov.b32 {%rs393, %rs394}, %r1350; 2026-02-21T09:10:45.4625634Z mov.b32 {%rs395, %rs396}, %r1349; 2026-02-21T09:10:45.4625725Z mov.b32 {%rs397, %rs398}, %r1348; 2026-02-21T09:10:45.4625784Z mov.b32 {%rs399, %rs400}, %r1347; 2026-02-21T09:10:45.4625900Z ld.shared.v4.b32 {%r1351, %r1352, %r1353, %r1354}, [%r1346+16]; 2026-02-21T09:10:45.4625960Z mov.b32 {%rs401, %rs402}, %r1354; 2026-02-21T09:10:45.4626019Z mov.b32 {%rs403, %rs404}, %r1353; 2026-02-21T09:10:45.4626087Z mov.b32 {%rs405, %rs406}, %r1352; 2026-02-21T09:10:45.4626148Z mov.b32 {%rs407, %rs408}, %r1351; 2026-02-21T09:10:45.4626315Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4626389Z cvt.f32.bf16 %r1325, %rs399; 2026-02-21T09:10:45.4626475Z cvt.f32.bf16 %r1326, %rs400; 2026-02-21T09:10:45.4626536Z cvt.f32.bf16 %r1327, %rs397; 2026-02-21T09:10:45.4626596Z cvt.f32.bf16 %r1328, %rs398; 2026-02-21T09:10:45.4626664Z cvt.f32.bf16 %r1329, %rs395; 2026-02-21T09:10:45.4626722Z cvt.f32.bf16 %r1330, %rs396; 2026-02-21T09:10:45.4626803Z cvt.f32.bf16 %r1331, %rs393; 2026-02-21T09:10:45.4626872Z cvt.f32.bf16 %r1332, %rs394; 2026-02-21T09:10:45.4626929Z cvt.f32.bf16 %r1333, %rs407; 2026-02-21T09:10:45.4626987Z cvt.f32.bf16 %r1334, %rs408; 2026-02-21T09:10:45.4627045Z cvt.f32.bf16 %r1335, %rs405; 2026-02-21T09:10:45.4627113Z cvt.f32.bf16 %r1336, %rs406; 2026-02-21T09:10:45.4627172Z cvt.f32.bf16 %r1337, %rs403; 2026-02-21T09:10:45.4627255Z cvt.f32.bf16 %r1338, %rs404; 2026-02-21T09:10:45.4627326Z cvt.f32.bf16 %r1339, %rs401; 2026-02-21T09:10:45.4627387Z cvt.f32.bf16 %r1340, %rs402; 2026-02-21T09:10:45.4627445Z $L__tmp181: 2026-02-21T09:10:45.4627684Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4627747Z // begin inline asm 2026-02-21T09:10:45.4627803Z 2026-02-21T09:10:45.4627858Z { 2026-02-21T09:10:45.4627936Z .reg .pred complete; 2026-02-21T09:10:45.4627996Z waitLoop: 2026-02-21T09:10:45.4628128Z mbarrier.try_wait.parity.shared.b64 complete, [%r1879], %r1878; 2026-02-21T09:10:45.4628207Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4628263Z } 2026-02-21T09:10:45.4628269Z 2026-02-21T09:10:45.4628330Z // end inline asm 2026-02-21T09:10:45.4628386Z $L__tmp182: 2026-02-21T09:10:45.4628571Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4628639Z selp.b32 %r1355, 1, 0, %p243; 2026-02-21T09:10:45.4628704Z xor.b32 %r1880, %r1880, %r1355; 2026-02-21T09:10:45.4628778Z mov.pred %p244, -1; 2026-02-21T09:10:45.4628838Z $L__tmp183: 2026-02-21T09:10:45.4629056Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4629128Z // begin inline asm 2026-02-21T09:10:45.4629463Z @%p244 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1190 + 0], {%r1325, %r1326, %r1327, %r1328, %r1329, %r1330, %r1331, %r1332, %r1333, %r1334, %r1335, %r1336, %r1337, %r1338, %r1339, %r1340}; 2026-02-21T09:10:45.4629524Z // end inline asm 2026-02-21T09:10:45.4629589Z // begin inline asm 2026-02-21T09:10:45.4629672Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4629731Z // end inline asm 2026-02-21T09:10:45.4629793Z bar.sync 0; 2026-02-21T09:10:45.4629863Z $L__tmp184: 2026-02-21T09:10:45.4630040Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4630108Z shl.b32 %r1356, %r1881, 3; 2026-02-21T09:10:45.4630181Z add.s32 %r1357, %r179, %r1356; 2026-02-21T09:10:45.4630275Z add.s32 %r1391, %r1357, 26640; 2026-02-21T09:10:45.4630336Z // begin inline asm 2026-02-21T09:10:45.4630390Z 2026-02-21T09:10:45.4630455Z { 2026-02-21T09:10:45.4630522Z .reg .pred complete; 2026-02-21T09:10:45.4630579Z waitLoop: 2026-02-21T09:10:45.4630717Z mbarrier.try_wait.parity.shared.b64 complete, [%r1391], %r1880; 2026-02-21T09:10:45.4630787Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4630840Z } 2026-02-21T09:10:45.4630844Z 2026-02-21T09:10:45.4630904Z // end inline asm 2026-02-21T09:10:45.4631111Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4631174Z shl.b32 %r1358, %r1881, 10; 2026-02-21T09:10:45.4631239Z add.s32 %r1359, %r179, %r1358; 2026-02-21T09:10:45.4631312Z add.s32 %r1392, %r1359, 24576; 2026-02-21T09:10:45.4631376Z add.s32 %r1360, %r1392, %r19; 2026-02-21T09:10:45.4631444Z ld.shared.b8 %rs409, [%r1360]; 2026-02-21T09:10:45.4631522Z ld.shared.b8 %rs410, [%r1360+512]; 2026-02-21T09:10:45.4631645Z add.s32 %r1361, %r1392, %r21; 2026-02-21T09:10:45.4631718Z ld.shared.b8 %rs411, [%r1361+128]; 2026-02-21T09:10:45.4631836Z ld.shared.b8 %rs412, [%r1361+640]; 2026-02-21T09:10:45.4631912Z add.s32 %r1362, %r1392, %r23; 2026-02-21T09:10:45.4631979Z ld.shared.b8 %rs413, [%r1362+256]; 2026-02-21T09:10:45.4632043Z ld.shared.b8 %rs414, [%r1362+768]; 2026-02-21T09:10:45.4632117Z add.s32 %r1363, %r1392, %r25; 2026-02-21T09:10:45.4632210Z ld.shared.b8 %rs415, [%r1363+384]; 2026-02-21T09:10:45.4632276Z ld.shared.b8 %rs416, [%r1363+896]; 2026-02-21T09:10:45.4632456Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4632535Z shl.b16 %rs417, %rs409, 4; 2026-02-21T09:10:45.4632598Z shl.b16 %rs418, %rs411, 4; 2026-02-21T09:10:45.4632662Z shl.b16 %rs419, %rs413, 4; 2026-02-21T09:10:45.4632736Z shl.b16 %rs420, %rs415, 4; 2026-02-21T09:10:45.4632798Z shl.b16 %rs421, %rs410, 4; 2026-02-21T09:10:45.4632861Z shl.b16 %rs422, %rs412, 4; 2026-02-21T09:10:45.4632933Z shl.b16 %rs423, %rs414, 4; 2026-02-21T09:10:45.4632995Z shl.b16 %rs424, %rs416, 4; 2026-02-21T09:10:45.4633170Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4633248Z selp.b16 %rs425, %rs417, %rs409, %p331; 2026-02-21T09:10:45.4633323Z cvt.s16.s8 %rs426, %rs425; 2026-02-21T09:10:45.4633386Z shr.s16 %rs427, %rs426, 4; 2026-02-21T09:10:45.4633461Z selp.b16 %rs428, %rs418, %rs411, %p331; 2026-02-21T09:10:45.4633537Z cvt.s16.s8 %rs429, %rs428; 2026-02-21T09:10:45.4633601Z shr.s16 %rs430, %rs429, 4; 2026-02-21T09:10:45.4633673Z selp.b16 %rs431, %rs419, %rs413, %p331; 2026-02-21T09:10:45.4633739Z cvt.s16.s8 %rs432, %rs431; 2026-02-21T09:10:45.4633812Z shr.s16 %rs433, %rs432, 4; 2026-02-21T09:10:45.4633882Z selp.b16 %rs434, %rs420, %rs415, %p331; 2026-02-21T09:10:45.4633945Z cvt.s16.s8 %rs435, %rs434; 2026-02-21T09:10:45.4634016Z shr.s16 %rs436, %rs435, 4; 2026-02-21T09:10:45.4634086Z selp.b16 %rs437, %rs421, %rs410, %p331; 2026-02-21T09:10:45.4634147Z cvt.s16.s8 %rs438, %rs437; 2026-02-21T09:10:45.4634209Z shr.s16 %rs439, %rs438, 4; 2026-02-21T09:10:45.4634287Z selp.b16 %rs440, %rs422, %rs412, %p331; 2026-02-21T09:10:45.4634349Z cvt.s16.s8 %rs441, %rs440; 2026-02-21T09:10:45.4634410Z shr.s16 %rs442, %rs441, 4; 2026-02-21T09:10:45.4634488Z selp.b16 %rs443, %rs423, %rs414, %p331; 2026-02-21T09:10:45.4634550Z cvt.s16.s8 %rs444, %rs443; 2026-02-21T09:10:45.4634613Z shr.s16 %rs445, %rs444, 4; 2026-02-21T09:10:45.4634691Z selp.b16 %rs446, %rs424, %rs416, %p331; 2026-02-21T09:10:45.4634754Z cvt.s16.s8 %rs447, %rs446; 2026-02-21T09:10:45.4634815Z shr.s16 %rs448, %rs447, 4; 2026-02-21T09:10:45.4634987Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4635061Z cvt.rn.f32.s16 %r1364, %rs427; 2026-02-21T09:10:45.4635127Z cvt.rn.f32.s16 %r1365, %rs430; 2026-02-21T09:10:45.4635200Z cvt.rn.f32.s16 %r1366, %rs433; 2026-02-21T09:10:45.4635297Z cvt.rn.f32.s16 %r1367, %rs436; 2026-02-21T09:10:45.4635358Z cvt.rn.f32.s16 %r1368, %rs439; 2026-02-21T09:10:45.4635420Z cvt.rn.f32.s16 %r1369, %rs442; 2026-02-21T09:10:45.4635479Z cvt.rn.f32.s16 %r1370, %rs445; 2026-02-21T09:10:45.4635547Z cvt.rn.f32.s16 %r1371, %rs448; 2026-02-21T09:10:45.4635611Z st.shared.b32 [%r27], %r1364; 2026-02-21T09:10:45.4635672Z st.shared.b32 [%r28], %r1365; 2026-02-21T09:10:45.4635744Z st.shared.b32 [%r29], %r1366; 2026-02-21T09:10:45.4635834Z st.shared.b32 [%r30], %r1367; 2026-02-21T09:10:45.4635894Z st.shared.b32 [%r31], %r1368; 2026-02-21T09:10:45.4635963Z st.shared.b32 [%r32], %r1369; 2026-02-21T09:10:45.4636025Z st.shared.b32 [%r33], %r1370; 2026-02-21T09:10:45.4636084Z st.shared.b32 [%r34], %r1371; 2026-02-21T09:10:45.4636258Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4636330Z shl.b32 %r1372, %r1882, 3; 2026-02-21T09:10:45.4636393Z add.s32 %r1373, %r179, %r1372; 2026-02-21T09:10:45.4636453Z add.s32 %r1879, %r1373, 26624; 2026-02-21T09:10:45.4636538Z $L__tmp185: 2026-02-21T09:10:45.4636754Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4636813Z // begin inline asm 2026-02-21T09:10:45.4636886Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4636975Z // end inline asm 2026-02-21T09:10:45.4637036Z bar.sync 0; 2026-02-21T09:10:45.4637097Z @%p55 bra $L__BB0_25; 2026-02-21T09:10:45.4637210Z // %bb.24: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:10:45.4637275Z elect.sync %r1386|%p245, -1; 2026-02-21T09:10:45.4637334Z mov.b32 %r1376, 135268624; 2026-02-21T09:10:45.4637401Z // begin inline asm 2026-02-21T09:10:45.4637560Z @%p245 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd3, %r1376, %p244; 2026-02-21T09:10:45.4637618Z // end inline asm 2026-02-21T09:10:45.4637679Z // begin inline asm 2026-02-21T09:10:45.4637844Z @%p245 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd4, %r1376, %p244; 2026-02-21T09:10:45.4637902Z // end inline asm 2026-02-21T09:10:45.4637961Z // begin inline asm 2026-02-21T09:10:45.4638120Z @%p245 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd5, %r1376, %p244; 2026-02-21T09:10:45.4638177Z // end inline asm 2026-02-21T09:10:45.4638237Z // begin inline asm 2026-02-21T09:10:45.4638397Z @%p245 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd6, %r1376, %p244; 2026-02-21T09:10:45.4638452Z // end inline asm 2026-02-21T09:10:45.4638513Z cvt.u64.u32 %rd403, %r1879; 2026-02-21T09:10:45.4638571Z // begin inline asm 2026-02-21T09:10:45.4638710Z @%p245 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd403]; 2026-02-21T09:10:45.4638766Z // end inline asm 2026-02-21T09:10:45.4638825Z bra.uni $L__BB0_25; 2026-02-21T09:10:45.4638888Z $L__tmp186: 2026-02-21T09:10:45.4638976Z $L__BB0_27: // %.preheader 2026-02-21T09:10:45.4639142Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4639217Z setp.gt.s32 %p270, %r1884, %r4; 2026-02-21T09:10:45.4639278Z @%p270 bra $L__BB0_36; 2026-02-21T09:10:45.4639335Z bra.uni $L__BB0_28; 2026-02-21T09:10:45.4639420Z $L__BB0_36: // %._crit_edge 2026-02-21T09:10:45.4639594Z .loc 1 31 4 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:4 2026-02-21T09:10:45.4639651Z bar.sync 0; 2026-02-21T09:10:45.4639710Z // begin inline asm 2026-02-21T09:10:45.4639836Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r1846, 128; 2026-02-21T09:10:45.4639893Z // end inline asm 2026-02-21T09:10:45.4639950Z ret; 2026-02-21T09:10:45.4640042Z $L__BB0_28: // %.lr.ph343 2026-02-21T09:10:45.4640203Z .loc 1 0 4 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:0:4 2026-02-21T09:10:45.4640291Z and.b32 %r41, %r1847, 4080; 2026-02-21T09:10:45.4640354Z add.s32 %r42, %r179, %r41; 2026-02-21T09:10:45.4640426Z add.s32 %r43, %r42, 4096; 2026-02-21T09:10:45.4640486Z add.s32 %r1576, %r42, 8192; 2026-02-21T09:10:45.4640547Z add.s32 %r1578, %r42, 12288; 2026-02-21T09:10:45.4640616Z shl.b32 %r1495, %r1849, 6; 2026-02-21T09:10:45.4640677Z shr.u32 %r1497, %r1850, 2; 2026-02-21T09:10:45.4640738Z or.b32 %r47, %r1495, %r1497; 2026-02-21T09:10:45.4640821Z add.s32 %r48, %r179, %r47; 2026-02-21T09:10:45.4640884Z shr.u32 %r1499, %r1850, 1; 2026-02-21T09:10:45.4640950Z or.b32 %r49, %r1499, %r1851; 2026-02-21T09:10:45.4641009Z add.s32 %r1500, %r179, 24576; 2026-02-21T09:10:45.4641067Z add.s32 %r50, %r1500, %r49; 2026-02-21T09:10:45.4641126Z xor.b32 %r51, %r49, 16; 2026-02-21T09:10:45.4641190Z add.s32 %r52, %r1500, %r51; 2026-02-21T09:10:45.4641248Z xor.b32 %r53, %r49, 32; 2026-02-21T09:10:45.4641306Z add.s32 %r54, %r1500, %r53; 2026-02-21T09:10:45.4641371Z xor.b32 %r55, %r49, 48; 2026-02-21T09:10:45.4641428Z add.s32 %r56, %r1500, %r55; 2026-02-21T09:10:45.4641506Z shl.b32 %r1501, %r1851, 7; 2026-02-21T09:10:45.4641602Z and.b32 %r1502, %r1847, 112; 2026-02-21T09:10:45.4641668Z and.b32 %r1504, %r1852, 12; 2026-02-21T09:10:45.4641727Z or.b32 %r1505, %r1501, %r1504; 2026-02-21T09:10:45.4641785Z or.b32 %r1506, %r1505, %r1502; 2026-02-21T09:10:45.4641877Z add.s32 %r1507, %r179, 16384; 2026-02-21T09:10:45.4641940Z add.s32 %r57, %r1507, %r1506; 2026-02-21T09:10:45.4641998Z xor.b32 %r1508, %r1506, 16; 2026-02-21T09:10:45.4642061Z add.s32 %r58, %r1507, %r1508; 2026-02-21T09:10:45.4642118Z xor.b32 %r1509, %r1506, 32; 2026-02-21T09:10:45.4642176Z add.s32 %r59, %r1507, %r1509; 2026-02-21T09:10:45.4642232Z xor.b32 %r1510, %r1506, 48; 2026-02-21T09:10:45.4642297Z add.s32 %r60, %r1507, %r1510; 2026-02-21T09:10:45.4642352Z xor.b32 %r1511, %r1506, 64; 2026-02-21T09:10:45.4642409Z add.s32 %r61, %r1507, %r1511; 2026-02-21T09:10:45.4642475Z xor.b32 %r1512, %r1506, 80; 2026-02-21T09:10:45.4642533Z add.s32 %r62, %r1507, %r1512; 2026-02-21T09:10:45.4642591Z xor.b32 %r1513, %r1506, 96; 2026-02-21T09:10:45.4642647Z add.s32 %r63, %r1507, %r1513; 2026-02-21T09:10:45.4642714Z xor.b32 %r1514, %r1506, 112; 2026-02-21T09:10:45.4642771Z add.s32 %r64, %r1507, %r1514; 2026-02-21T09:10:45.4642830Z bfe.u32 %r1515, %r1507, 4, 14; 2026-02-21T09:10:45.4642896Z cvt.u64.u32 %rd472, %r1515; 2026-02-21T09:10:45.4642966Z or.b64 %rd8, %rd472, 4611686293338849280; 2026-02-21T09:10:45.4643023Z add.s32 %r1516, %r179, 16416; 2026-02-21T09:10:45.4643082Z bfe.u32 %r1517, %r1516, 4, 14; 2026-02-21T09:10:45.4643148Z cvt.u64.u32 %rd473, %r1517; 2026-02-21T09:10:45.4643214Z or.b64 %rd9, %rd473, 4611686293338849280; 2026-02-21T09:10:45.4643273Z add.s32 %r1518, %r179, 16448; 2026-02-21T09:10:45.4643338Z bfe.u32 %r1519, %r1518, 4, 14; 2026-02-21T09:10:45.4643396Z cvt.u64.u32 %rd474, %r1519; 2026-02-21T09:10:45.4643468Z or.b64 %rd10, %rd474, 4611686293338849280; 2026-02-21T09:10:45.4643532Z add.s32 %r1520, %r179, 16480; 2026-02-21T09:10:45.4643591Z bfe.u32 %r1521, %r1520, 4, 14; 2026-02-21T09:10:45.4643649Z cvt.u64.u32 %rd475, %r1521; 2026-02-21T09:10:45.4643717Z or.b64 %rd11, %rd475, 4611686293338849280; 2026-02-21T09:10:45.4643782Z shl.b32 %r1522, %r1849, 7; 2026-02-21T09:10:45.4643842Z xor.b32 %r1523, %r1502, %r1499; 2026-02-21T09:10:45.4643902Z or.b32 %r1524, %r1523, %r1522; 2026-02-21T09:10:45.4643967Z add.s32 %r65, %r179, %r1524; 2026-02-21T09:10:45.4644023Z xor.b32 %r1525, %r1524, 16; 2026-02-21T09:10:45.4644080Z add.s32 %r66, %r179, %r1525; 2026-02-21T09:10:45.4644135Z xor.b32 %r1526, %r1524, 32; 2026-02-21T09:10:45.4644199Z add.s32 %r67, %r179, %r1526; 2026-02-21T09:10:45.4644255Z xor.b32 %r1527, %r1524, 48; 2026-02-21T09:10:45.4644312Z add.s32 %r68, %r179, %r1527; 2026-02-21T09:10:45.4644485Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4644578Z mad.wide.u32 %rd476, %r1853, 16, %rd52; 2026-02-21T09:10:45.4644637Z add.s64 %rd12, %rd476, 192; 2026-02-21T09:10:45.4644693Z bra.uni $L__BB0_29; 2026-02-21T09:10:45.4644802Z $L__BB0_35: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:45.4644967Z .loc 1 0 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:0:97 2026-02-21T09:10:45.4645024Z mov.b32 %r1755, 1; 2026-02-21T09:10:45.4645115Z $L__tmp187: 2026-02-21T09:10:45.4645323Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4645383Z // begin inline asm 2026-02-21T09:10:45.4645447Z 2026-02-21T09:10:45.4645499Z { 2026-02-21T09:10:45.4645560Z .reg .pred complete; 2026-02-21T09:10:45.4645615Z waitLoop: 2026-02-21T09:10:45.4645742Z mbarrier.try_wait.parity.shared.b64 complete, [%r1886], %r1755; 2026-02-21T09:10:45.4645806Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4645858Z } 2026-02-21T09:10:45.4645862Z 2026-02-21T09:10:45.4645924Z // end inline asm 2026-02-21T09:10:45.4646003Z $L__tmp188: 2026-02-21T09:10:45.4646172Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4646242Z cp.async.wait_group 0; 2026-02-21T09:10:45.4646297Z bar.sync 0; 2026-02-21T09:10:45.4646360Z // begin inline asm 2026-02-21T09:10:45.4646471Z @%p273 mbarrier.inval.shared::cta.b64 [%r1662]; 2026-02-21T09:10:45.4646537Z // end inline asm 2026-02-21T09:10:45.4646591Z bar.sync 0; 2026-02-21T09:10:45.4646647Z // begin inline asm 2026-02-21T09:10:45.4646740Z @%p273 mbarrier.inval.shared::cta.b64 [%r1566]; 2026-02-21T09:10:45.4646796Z // end inline asm 2026-02-21T09:10:45.4646856Z add.s32 %r1758, %r179, 26624; 2026-02-21T09:10:45.4646918Z // begin inline asm 2026-02-21T09:10:45.4646998Z @%p273 mbarrier.inval.shared::cta.b64 [%r1758]; 2026-02-21T09:10:45.4647051Z // end inline asm 2026-02-21T09:10:45.4647106Z bar.sync 0; 2026-02-21T09:10:45.4647171Z // begin inline asm 2026-02-21T09:10:45.4647251Z @%p273 mbarrier.inval.shared::cta.b64 [%r1564]; 2026-02-21T09:10:45.4647306Z // end inline asm 2026-02-21T09:10:45.4647365Z $L__tmp189: 2026-02-21T09:10:45.4647576Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4647631Z // begin inline asm 2026-02-21T09:10:45.4647942Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1760, %r1761, %r1762, %r1763, %r1764, %r1765, %r1766, %r1767, %r1768, %r1769, %r1770, %r1771, %r1772, %r1773, %r1774, %r1775}, [%r1793 + 0]; 2026-02-21T09:10:45.4647998Z // end inline asm 2026-02-21T09:10:45.4648054Z // begin inline asm 2026-02-21T09:10:45.4648342Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1777, %r1778, %r1779, %r1780, %r1781, %r1782, %r1783, %r1784, %r1785, %r1786, %r1787, %r1788, %r1789, %r1790, %r1791, %r1792}, [%r1793 + 16]; 2026-02-21T09:10:45.4648405Z // end inline asm 2026-02-21T09:10:45.4648460Z // begin inline asm 2026-02-21T09:10:45.4648530Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.4648592Z // end inline asm 2026-02-21T09:10:45.4648652Z cvt.u64.u32 %rd508, %r1760; 2026-02-21T09:10:45.4648710Z cvt.u64.u32 %rd509, %r1761; 2026-02-21T09:10:45.4648776Z shl.b64 %rd510, %rd509, 32; 2026-02-21T09:10:45.4648837Z or.b64 %rd511, %rd508, %rd510; 2026-02-21T09:10:45.4648890Z $L__tmp190: 2026-02-21T09:10:45.4649054Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4649126Z mov.b64 {%r1797, %r1798}, %rd511; 2026-02-21T09:10:45.4649197Z cvt.rn.bf16x2.f32 %r1799, %r1798, %r1797; 2026-02-21T09:10:45.4649248Z $L__tmp191: 2026-02-21T09:10:45.4649459Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4649517Z cvt.u64.u32 %rd512, %r1762; 2026-02-21T09:10:45.4649575Z cvt.u64.u32 %rd513, %r1763; 2026-02-21T09:10:45.4649660Z shl.b64 %rd514, %rd513, 32; 2026-02-21T09:10:45.4649729Z or.b64 %rd515, %rd512, %rd514; 2026-02-21T09:10:45.4649781Z $L__tmp192: 2026-02-21T09:10:45.4649945Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4650015Z mov.b64 {%r1800, %r1801}, %rd515; 2026-02-21T09:10:45.4650085Z cvt.rn.bf16x2.f32 %r1802, %r1801, %r1800; 2026-02-21T09:10:45.4650139Z $L__tmp193: 2026-02-21T09:10:45.4650392Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4650450Z cvt.u64.u32 %rd516, %r1764; 2026-02-21T09:10:45.4650508Z cvt.u64.u32 %rd517, %r1765; 2026-02-21T09:10:45.4650568Z shl.b64 %rd518, %rd517, 32; 2026-02-21T09:10:45.4650633Z or.b64 %rd519, %rd516, %rd518; 2026-02-21T09:10:45.4650685Z $L__tmp194: 2026-02-21T09:10:45.4650844Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4650912Z mov.b64 {%r1803, %r1804}, %rd519; 2026-02-21T09:10:45.4651003Z cvt.rn.bf16x2.f32 %r1805, %r1804, %r1803; 2026-02-21T09:10:45.4651056Z $L__tmp195: 2026-02-21T09:10:45.4651259Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4651316Z cvt.u64.u32 %rd520, %r1766; 2026-02-21T09:10:45.4651395Z cvt.u64.u32 %rd521, %r1767; 2026-02-21T09:10:45.4651455Z shl.b64 %rd522, %rd521, 32; 2026-02-21T09:10:45.4651521Z or.b64 %rd523, %rd520, %rd522; 2026-02-21T09:10:45.4651620Z $L__tmp196: 2026-02-21T09:10:45.4651785Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4651851Z mov.b64 {%r1806, %r1807}, %rd523; 2026-02-21T09:10:45.4651918Z cvt.rn.bf16x2.f32 %r1808, %r1807, %r1806; 2026-02-21T09:10:45.4651969Z $L__tmp197: 2026-02-21T09:10:45.4652177Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4652237Z cvt.u64.u32 %rd524, %r1768; 2026-02-21T09:10:45.4652294Z cvt.u64.u32 %rd525, %r1769; 2026-02-21T09:10:45.4652351Z shl.b64 %rd526, %rd525, 32; 2026-02-21T09:10:45.4652417Z or.b64 %rd527, %rd524, %rd526; 2026-02-21T09:10:45.4652469Z $L__tmp198: 2026-02-21T09:10:45.4652631Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4652700Z mov.b64 {%r1809, %r1810}, %rd527; 2026-02-21T09:10:45.4652768Z cvt.rn.bf16x2.f32 %r1811, %r1810, %r1809; 2026-02-21T09:10:45.4652821Z $L__tmp199: 2026-02-21T09:10:45.4653030Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4653088Z cvt.u64.u32 %rd528, %r1770; 2026-02-21T09:10:45.4653146Z cvt.u64.u32 %rd529, %r1771; 2026-02-21T09:10:45.4653204Z shl.b64 %rd530, %rd529, 32; 2026-02-21T09:10:45.4653274Z or.b64 %rd531, %rd528, %rd530; 2026-02-21T09:10:45.4653326Z $L__tmp200: 2026-02-21T09:10:45.4653488Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4653557Z mov.b64 {%r1812, %r1813}, %rd531; 2026-02-21T09:10:45.4653626Z cvt.rn.bf16x2.f32 %r1814, %r1813, %r1812; 2026-02-21T09:10:45.4653680Z $L__tmp201: 2026-02-21T09:10:45.4653887Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4653953Z cvt.u64.u32 %rd532, %r1772; 2026-02-21T09:10:45.4654010Z cvt.u64.u32 %rd533, %r1773; 2026-02-21T09:10:45.4654069Z shl.b64 %rd534, %rd533, 32; 2026-02-21T09:10:45.4654137Z or.b64 %rd535, %rd532, %rd534; 2026-02-21T09:10:45.4654189Z $L__tmp202: 2026-02-21T09:10:45.4654348Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4654416Z mov.b64 {%r1815, %r1816}, %rd535; 2026-02-21T09:10:45.4654512Z cvt.rn.bf16x2.f32 %r1817, %r1816, %r1815; 2026-02-21T09:10:45.4654565Z $L__tmp203: 2026-02-21T09:10:45.4654772Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4654839Z cvt.u64.u32 %rd536, %r1774; 2026-02-21T09:10:45.4654898Z cvt.u64.u32 %rd537, %r1775; 2026-02-21T09:10:45.4654958Z shl.b64 %rd538, %rd537, 32; 2026-02-21T09:10:45.4655026Z or.b64 %rd539, %rd536, %rd538; 2026-02-21T09:10:45.4655106Z $L__tmp204: 2026-02-21T09:10:45.4655265Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4655331Z mov.b64 {%r1818, %r1819}, %rd539; 2026-02-21T09:10:45.4655398Z cvt.rn.bf16x2.f32 %r1820, %r1819, %r1818; 2026-02-21T09:10:45.4655450Z $L__tmp205: 2026-02-21T09:10:45.4655651Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4655719Z cvt.u64.u32 %rd540, %r1777; 2026-02-21T09:10:45.4655775Z cvt.u64.u32 %rd541, %r1778; 2026-02-21T09:10:45.4655857Z shl.b64 %rd542, %rd541, 32; 2026-02-21T09:10:45.4655925Z or.b64 %rd543, %rd540, %rd542; 2026-02-21T09:10:45.4655977Z $L__tmp206: 2026-02-21T09:10:45.4656140Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4656232Z mov.b64 {%r1821, %r1822}, %rd543; 2026-02-21T09:10:45.4656302Z cvt.rn.bf16x2.f32 %r1823, %r1822, %r1821; 2026-02-21T09:10:45.4656354Z $L__tmp207: 2026-02-21T09:10:45.4656555Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4656621Z cvt.u64.u32 %rd544, %r1779; 2026-02-21T09:10:45.4656678Z cvt.u64.u32 %rd545, %r1780; 2026-02-21T09:10:45.4656736Z shl.b64 %rd546, %rd545, 32; 2026-02-21T09:10:45.4656801Z or.b64 %rd547, %rd544, %rd546; 2026-02-21T09:10:45.4656857Z $L__tmp208: 2026-02-21T09:10:45.4657016Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4657076Z mov.b64 {%r1824, %r1825}, %rd547; 2026-02-21T09:10:45.4657150Z cvt.rn.bf16x2.f32 %r1826, %r1825, %r1824; 2026-02-21T09:10:45.4657203Z $L__tmp209: 2026-02-21T09:10:45.4657407Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4657473Z cvt.u64.u32 %rd548, %r1781; 2026-02-21T09:10:45.4657532Z cvt.u64.u32 %rd549, %r1782; 2026-02-21T09:10:45.4657588Z shl.b64 %rd550, %rd549, 32; 2026-02-21T09:10:45.4657653Z or.b64 %rd551, %rd548, %rd550; 2026-02-21T09:10:45.4657704Z $L__tmp210: 2026-02-21T09:10:45.4657865Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4657924Z mov.b64 {%r1827, %r1828}, %rd551; 2026-02-21T09:10:45.4657997Z cvt.rn.bf16x2.f32 %r1829, %r1828, %r1827; 2026-02-21T09:10:45.4658051Z $L__tmp211: 2026-02-21T09:10:45.4658255Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4658321Z cvt.u64.u32 %rd552, %r1783; 2026-02-21T09:10:45.4658379Z cvt.u64.u32 %rd553, %r1784; 2026-02-21T09:10:45.4658436Z shl.b64 %rd554, %rd553, 32; 2026-02-21T09:10:45.4658501Z or.b64 %rd555, %rd552, %rd554; 2026-02-21T09:10:45.4658555Z $L__tmp212: 2026-02-21T09:10:45.4658722Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4658783Z mov.b64 {%r1830, %r1831}, %rd555; 2026-02-21T09:10:45.4658857Z cvt.rn.bf16x2.f32 %r1832, %r1831, %r1830; 2026-02-21T09:10:45.4658909Z $L__tmp213: 2026-02-21T09:10:45.4659113Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4659177Z cvt.u64.u32 %rd556, %r1785; 2026-02-21T09:10:45.4659257Z cvt.u64.u32 %rd557, %r1786; 2026-02-21T09:10:45.4659314Z shl.b64 %rd558, %rd557, 32; 2026-02-21T09:10:45.4659378Z or.b64 %rd559, %rd556, %rd558; 2026-02-21T09:10:45.4659430Z $L__tmp214: 2026-02-21T09:10:45.4659590Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4659648Z mov.b64 {%r1833, %r1834}, %rd559; 2026-02-21T09:10:45.4659724Z cvt.rn.bf16x2.f32 %r1835, %r1834, %r1833; 2026-02-21T09:10:45.4659806Z $L__tmp215: 2026-02-21T09:10:45.4660006Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4660070Z cvt.u64.u32 %rd560, %r1787; 2026-02-21T09:10:45.4660126Z cvt.u64.u32 %rd561, %r1788; 2026-02-21T09:10:45.4660183Z shl.b64 %rd562, %rd561, 32; 2026-02-21T09:10:45.4660248Z or.b64 %rd563, %rd560, %rd562; 2026-02-21T09:10:45.4660299Z $L__tmp216: 2026-02-21T09:10:45.4660458Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4660519Z mov.b64 {%r1836, %r1837}, %rd563; 2026-02-21T09:10:45.4660612Z cvt.rn.bf16x2.f32 %r1838, %r1837, %r1836; 2026-02-21T09:10:45.4660666Z $L__tmp217: 2026-02-21T09:10:45.4660865Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4660931Z cvt.u64.u32 %rd564, %r1789; 2026-02-21T09:10:45.4661008Z cvt.u64.u32 %rd565, %r1790; 2026-02-21T09:10:45.4661069Z shl.b64 %rd566, %rd565, 32; 2026-02-21T09:10:45.4661129Z or.b64 %rd567, %rd564, %rd566; 2026-02-21T09:10:45.4661189Z $L__tmp218: 2026-02-21T09:10:45.4661349Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4661408Z mov.b64 {%r1839, %r1840}, %rd567; 2026-02-21T09:10:45.4661482Z cvt.rn.bf16x2.f32 %r1841, %r1840, %r1839; 2026-02-21T09:10:45.4661575Z $L__tmp219: 2026-02-21T09:10:45.4661780Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4661847Z cvt.u64.u32 %rd568, %r1791; 2026-02-21T09:10:45.4661906Z cvt.u64.u32 %rd569, %r1792; 2026-02-21T09:10:45.4661964Z shl.b64 %rd570, %rd569, 32; 2026-02-21T09:10:45.4662023Z or.b64 %rd571, %rd568, %rd570; 2026-02-21T09:10:45.4662083Z $L__tmp220: 2026-02-21T09:10:45.4662245Z .loc 1 97 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:97:28 2026-02-21T09:10:45.4662305Z mov.b64 {%r1842, %r1843}, %rd571; 2026-02-21T09:10:45.4662379Z cvt.rn.bf16x2.f32 %r1844, %r1843, %r1842; 2026-02-21T09:10:45.4662537Z .loc 1 98 43 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:98:43 2026-02-21T09:10:45.4662636Z st.shared.v4.b32 [%r65], {%r1799, %r1802, %r1805, %r1808}; 2026-02-21T09:10:45.4662739Z st.shared.v4.b32 [%r66], {%r1811, %r1814, %r1817, %r1820}; 2026-02-21T09:10:45.4662831Z st.shared.v4.b32 [%r67], {%r1823, %r1826, %r1829, %r1832}; 2026-02-21T09:10:45.4662922Z st.shared.v4.b32 [%r68], {%r1835, %r1838, %r1841, %r1844}; 2026-02-21T09:10:45.4662981Z // begin inline asm 2026-02-21T09:10:45.4663061Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4663119Z // end inline asm 2026-02-21T09:10:45.4663175Z bar.sync 0; 2026-02-21T09:10:45.4663252Z elect.sync %r1845|%p328, -1; 2026-02-21T09:10:45.4663317Z and.pred %p326, %p1, %p328; 2026-02-21T09:10:45.4663378Z // begin inline asm 2026-02-21T09:10:45.4663578Z @%p326 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd507, {%r1794, %r1795}], [%r179]; 2026-02-21T09:10:45.4663637Z // end inline asm 2026-02-21T09:10:45.4663707Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.4663783Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.4663849Z bar.sync 0; 2026-02-21T09:10:45.4664010Z .loc 1 31 97 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:31:97 2026-02-21T09:10:45.4664069Z add.s32 %r178, %r1884, 1; 2026-02-21T09:10:45.4664170Z setp.lt.s32 %p329, %r1884, %r4; 2026-02-21T09:10:45.4664230Z mov.b32 %r1884, %r178; 2026-02-21T09:10:45.4664290Z @%p329 bra $L__BB0_29; 2026-02-21T09:10:45.4664347Z bra.uni $L__BB0_36; 2026-02-21T09:10:45.4664454Z $L__BB0_29: // =>This Loop Header: Depth=1 2026-02-21T09:10:45.4664545Z // Child Loop BB0_32 Depth 2 2026-02-21T09:10:45.4664709Z .loc 1 37 35 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:37:35 2026-02-21T09:10:45.4664800Z shr.s32 %r1604, %r1884, 31; 2026-02-21T09:10:45.4664859Z shr.u32 %r1605, %r1604, 23; 2026-02-21T09:10:45.4664920Z add.s32 %r1606, %r1884, %r1605; 2026-02-21T09:10:45.4665087Z .loc 1 40 45 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:45 2026-02-21T09:10:45.4665148Z and.b32 %r1607, %r1606, 65024; 2026-02-21T09:10:45.4665209Z sub.s32 %r1608, %r1884, %r1607; 2026-02-21T09:10:45.4665373Z .loc 1 40 64 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:64 2026-02-21T09:10:45.4665457Z cvt.u16.u32 %rs449, %r1608; 2026-02-21T09:10:45.4665615Z .loc 1 41 51 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:41:51 2026-02-21T09:10:45.4665674Z shr.s16 %rs450, %rs449, 15; 2026-02-21T09:10:45.4665739Z shr.u16 %rs451, %rs450, 12; 2026-02-21T09:10:45.4665823Z add.s16 %rs452, %rs449, %rs451; 2026-02-21T09:10:45.4665885Z shr.s16 %rs453, %rs452, 4; 2026-02-21T09:10:45.4666052Z .loc 1 40 64 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:40:64 2026-02-21T09:10:45.4666115Z and.b16 %rs454, %rs452, -16; 2026-02-21T09:10:45.4666175Z sub.s16 %rs455, %rs449, %rs454; 2026-02-21T09:10:45.4666343Z .loc 1 42 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:42:27 2026-02-21T09:10:45.4666401Z shl.b32 %r1609, %r1606, 1; 2026-02-21T09:10:45.4666460Z and.b32 %r1610, %r1609, -1024; 2026-02-21T09:10:45.4666532Z mad.wide.s16 %r1794, %rs455, 64, %r1610; 2026-02-21T09:10:45.4666704Z .loc 1 43 27 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:43:27 2026-02-21T09:10:45.4666767Z mul.wide.s16 %r1795, %rs453, 128; 2026-02-21T09:10:45.4666925Z .loc 1 44 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:44:32 2026-02-21T09:10:45.4666992Z or.b32 %r1611, %r1795, %r7; 2026-02-21T09:10:45.4667050Z or.b32 %r1612, %r1795, %r8; 2026-02-21T09:10:45.4667207Z .loc 1 58 53 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:53 2026-02-21T09:10:45.4667269Z shl.b32 %r1613, %r1611, 10; 2026-02-21T09:10:45.4667324Z shl.b32 %r1614, %r1612, 10; 2026-02-21T09:10:45.4667376Z $L__tmp221: 2026-02-21T09:10:45.4667583Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4667664Z shfl.sync.idx.b32 %r161, %r6, 0, 31, -1; 2026-02-21T09:10:45.4667722Z shl.b32 %r1615, %r161, 21; 2026-02-21T09:10:45.4667784Z and.b32 %r1616, %r1615, 6291456; 2026-02-21T09:10:45.4667848Z add.s32 %r1617, %r1616, %r1846; 2026-02-21T09:10:45.4667907Z and.b32 %r1618, %r161, 4; 2026-02-21T09:10:45.4667965Z shl.b32 %r1619, %r1618, 3; 2026-02-21T09:10:45.4668029Z add.s32 %r1793, %r1617, %r1619; 2026-02-21T09:10:45.4668090Z mov.pred %p289, -1; 2026-02-21T09:10:45.4668146Z mov.b32 %r1885, 0; 2026-02-21T09:10:45.4668205Z // begin inline asm 2026-02-21T09:10:45.4668531Z @%p289 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1793 + 0], {%r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885}; 2026-02-21T09:10:45.4668588Z // end inline asm 2026-02-21T09:10:45.4668644Z // begin inline asm 2026-02-21T09:10:45.4668953Z @%p289 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1793 + 16], {%r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885, %r1885}; 2026-02-21T09:10:45.4669042Z // end inline asm 2026-02-21T09:10:45.4669099Z // begin inline asm 2026-02-21T09:10:45.4669176Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4669230Z // end inline asm 2026-02-21T09:10:45.4669283Z bar.sync 0; 2026-02-21T09:10:45.4669335Z $L__tmp222: 2026-02-21T09:10:45.4669509Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4669589Z add.s32 %r1886, %r179, 26624; 2026-02-21T09:10:45.4669645Z // begin inline asm 2026-02-21T09:10:45.4669740Z @%p273 mbarrier.init.shared::cta.b64 [%r1886], 1; 2026-02-21T09:10:45.4669794Z // end inline asm 2026-02-21T09:10:45.4669848Z bar.sync 0; 2026-02-21T09:10:45.4669907Z add.s32 %r1564, %r179, 26632; 2026-02-21T09:10:45.4669968Z // begin inline asm 2026-02-21T09:10:45.4670054Z @%p273 mbarrier.init.shared::cta.b64 [%r1564], 1; 2026-02-21T09:10:45.4670108Z // end inline asm 2026-02-21T09:10:45.4670174Z add.s32 %r1662, %r179, 26640; 2026-02-21T09:10:45.4670229Z // begin inline asm 2026-02-21T09:10:45.4670332Z @%p273 mbarrier.init.shared::cta.b64 [%r1662], 1; 2026-02-21T09:10:45.4670396Z // end inline asm 2026-02-21T09:10:45.4670450Z bar.sync 0; 2026-02-21T09:10:45.4670509Z add.s32 %r1566, %r179, 26648; 2026-02-21T09:10:45.4670565Z // begin inline asm 2026-02-21T09:10:45.4670652Z @%p273 mbarrier.init.shared::cta.b64 [%r1566], 1; 2026-02-21T09:10:45.4670731Z // end inline asm 2026-02-21T09:10:45.4670896Z .loc 1 58 60 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:60 2026-02-21T09:10:45.4670963Z or.b32 %r1621, %r1613, %r9; 2026-02-21T09:10:45.4671021Z or.b32 %r1622, %r1614, %r9; 2026-02-21T09:10:45.4671188Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4671266Z mad.wide.s32 %rd477, %r1621, 2, %rd52; 2026-02-21T09:10:45.4671338Z mad.wide.s32 %rd478, %r1622, 2, %rd52; 2026-02-21T09:10:45.4671400Z mov.b32 %r1659, 16; 2026-02-21T09:10:45.4671603Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4671674Z // begin inline asm 2026-02-21T09:10:45.4671802Z cp.async.cg.shared.global [ %r42 + 0 ], [ %rd477 + 0 ], 0x10, %r1659; 2026-02-21T09:10:45.4671861Z // end inline asm 2026-02-21T09:10:45.4671927Z // begin inline asm 2026-02-21T09:10:45.4672054Z cp.async.cg.shared.global [ %r43 + 0 ], [ %rd478 + 0 ], 0x10, %r1659; 2026-02-21T09:10:45.4672115Z // end inline asm 2026-02-21T09:10:45.4672181Z cp.async.commit_group; 2026-02-21T09:10:45.4672364Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4672424Z bar.sync 0; 2026-02-21T09:10:45.4672483Z // begin inline asm 2026-02-21T09:10:45.4672610Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1662], 1024; 2026-02-21T09:10:45.4672667Z // end inline asm 2026-02-21T09:10:45.4672835Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4672901Z bar.sync 0; 2026-02-21T09:10:45.4672969Z elect.sync %r1623|%p284, -1; 2026-02-21T09:10:45.4673037Z and.pred %p278, %p1, %p284; 2026-02-21T09:10:45.4673096Z // begin inline asm 2026-02-21T09:10:45.4673367Z @%p278 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1500], [%rd497, {%r1794, %r1885}], [%r1662]; 2026-02-21T09:10:45.4673426Z // end inline asm 2026-02-21T09:10:45.4673592Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4673668Z cvt.s64.s32 %rd483, %r1613; 2026-02-21T09:10:45.4673734Z or.b64 %rd485, %rd483, %rd572; 2026-02-21T09:10:45.4673798Z shl.b64 %rd486, %rd485, 1; 2026-02-21T09:10:45.4673871Z add.s64 %rd42, %rd52, %rd486; 2026-02-21T09:10:45.4673934Z add.s64 %rd480, %rd42, 64; 2026-02-21T09:10:45.4673995Z cvt.s64.s32 %rd487, %r1614; 2026-02-21T09:10:45.4674091Z or.b64 %rd488, %rd487, %rd572; 2026-02-21T09:10:45.4674158Z shl.b64 %rd489, %rd488, 1; 2026-02-21T09:10:45.4674221Z add.s64 %rd43, %rd52, %rd489; 2026-02-21T09:10:45.4674282Z add.s64 %rd481, %rd43, 64; 2026-02-21T09:10:45.4674461Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4674520Z // begin inline asm 2026-02-21T09:10:45.4674647Z cp.async.cg.shared.global [ %r1576 + 0 ], [ %rd480 + 0 ], 0x10, %r1659; 2026-02-21T09:10:45.4674746Z // end inline asm 2026-02-21T09:10:45.4674804Z // begin inline asm 2026-02-21T09:10:45.4674926Z cp.async.cg.shared.global [ %r1578 + 0 ], [ %rd481 + 0 ], 0x10, %r1659; 2026-02-21T09:10:45.4674984Z // end inline asm 2026-02-21T09:10:45.4675056Z cp.async.commit_group; 2026-02-21T09:10:45.4675233Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4675290Z bar.sync 0; 2026-02-21T09:10:45.4675356Z // begin inline asm 2026-02-21T09:10:45.4675473Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1566], 1024; 2026-02-21T09:10:45.4675558Z // end inline asm 2026-02-21T09:10:45.4675733Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4675788Z bar.sync 0; 2026-02-21T09:10:45.4675855Z elect.sync %r1624|%p285, -1; 2026-02-21T09:10:45.4675922Z and.pred %p280, %p1, %p285; 2026-02-21T09:10:45.4676015Z add.s32 %r1581, %r179, 25600; 2026-02-21T09:10:45.4676076Z // begin inline asm 2026-02-21T09:10:45.4676338Z @%p280 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1581], [%rd497, {%r1794, %r1659}], [%r1566]; 2026-02-21T09:10:45.4676403Z // end inline asm 2026-02-21T09:10:45.4676572Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4676639Z cp.async.wait_group 1; 2026-02-21T09:10:45.4676701Z bar.sync 0; 2026-02-21T09:10:45.4676804Z ld.shared.v4.b32 {%r1625, %r1626, %r1627, %r1628}, [%r48]; 2026-02-21T09:10:45.4676871Z mov.b32 {%rs456, %rs457}, %r1628; 2026-02-21T09:10:45.4676934Z mov.b32 {%rs458, %rs459}, %r1627; 2026-02-21T09:10:45.4677002Z mov.b32 {%rs460, %rs461}, %r1626; 2026-02-21T09:10:45.4677061Z mov.b32 {%rs462, %rs463}, %r1625; 2026-02-21T09:10:45.4677166Z ld.shared.v4.b32 {%r1629, %r1630, %r1631, %r1632}, [%r48+16]; 2026-02-21T09:10:45.4677234Z mov.b32 {%rs464, %rs465}, %r1632; 2026-02-21T09:10:45.4677296Z mov.b32 {%rs466, %rs467}, %r1631; 2026-02-21T09:10:45.4677354Z mov.b32 {%rs468, %rs469}, %r1630; 2026-02-21T09:10:45.4677414Z mov.b32 {%rs470, %rs471}, %r1629; 2026-02-21T09:10:45.4677589Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4677654Z cvt.f32.bf16 %r1586, %rs462; 2026-02-21T09:10:45.4677718Z cvt.f32.bf16 %r1587, %rs463; 2026-02-21T09:10:45.4677785Z cvt.f32.bf16 %r1588, %rs460; 2026-02-21T09:10:45.4677846Z cvt.f32.bf16 %r1589, %rs461; 2026-02-21T09:10:45.4677904Z cvt.f32.bf16 %r1590, %rs458; 2026-02-21T09:10:45.4677969Z cvt.f32.bf16 %r1591, %rs459; 2026-02-21T09:10:45.4678029Z cvt.f32.bf16 %r1592, %rs456; 2026-02-21T09:10:45.4678087Z cvt.f32.bf16 %r1593, %rs457; 2026-02-21T09:10:45.4678146Z cvt.f32.bf16 %r1594, %rs470; 2026-02-21T09:10:45.4678212Z cvt.f32.bf16 %r1595, %rs471; 2026-02-21T09:10:45.4678271Z cvt.f32.bf16 %r1596, %rs468; 2026-02-21T09:10:45.4678331Z cvt.f32.bf16 %r1597, %rs469; 2026-02-21T09:10:45.4678397Z cvt.f32.bf16 %r1598, %rs466; 2026-02-21T09:10:45.4678454Z cvt.f32.bf16 %r1599, %rs467; 2026-02-21T09:10:45.4678510Z cvt.f32.bf16 %r1600, %rs464; 2026-02-21T09:10:45.4678569Z cvt.f32.bf16 %r1601, %rs465; 2026-02-21T09:10:45.4678630Z $L__tmp223: 2026-02-21T09:10:45.4678854Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4678917Z add.s32 %r1633, %r1616, %r1644; 2026-02-21T09:10:45.4679013Z shl.b32 %r1634, %r1618, 2; 2026-02-21T09:10:45.4679075Z add.s32 %r1678, %r1633, %r1634; 2026-02-21T09:10:45.4679134Z // begin inline asm 2026-02-21T09:10:45.4679481Z @%p289 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1678 + 0], {%r1586, %r1587, %r1588, %r1589, %r1590, %r1591, %r1592, %r1593, %r1594, %r1595, %r1596, %r1597, %r1598, %r1599, %r1600, %r1601}; 2026-02-21T09:10:45.4679535Z // end inline asm 2026-02-21T09:10:45.4679590Z // begin inline asm 2026-02-21T09:10:45.4679686Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4679744Z // end inline asm 2026-02-21T09:10:45.4679796Z bar.sync 0; 2026-02-21T09:10:45.4679848Z $L__tmp224: 2026-02-21T09:10:45.4680022Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4680079Z // begin inline asm 2026-02-21T09:10:45.4680130Z 2026-02-21T09:10:45.4680182Z { 2026-02-21T09:10:45.4680247Z .reg .pred complete; 2026-02-21T09:10:45.4680301Z waitLoop: 2026-02-21T09:10:45.4680420Z mbarrier.try_wait.parity.shared.b64 complete, [%r1662], %r1885; 2026-02-21T09:10:45.4680512Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4680564Z } 2026-02-21T09:10:45.4680569Z 2026-02-21T09:10:45.4680624Z // end inline asm 2026-02-21T09:10:45.4680790Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4680852Z ld.shared.b8 %rs472, [%r50]; 2026-02-21T09:10:45.4680947Z ld.shared.b8 %rs473, [%r50+512]; 2026-02-21T09:10:45.4681013Z ld.shared.b8 %rs474, [%r52+128]; 2026-02-21T09:10:45.4681085Z ld.shared.b8 %rs475, [%r52+640]; 2026-02-21T09:10:45.4681145Z ld.shared.b8 %rs476, [%r54+256]; 2026-02-21T09:10:45.4681204Z ld.shared.b8 %rs477, [%r54+768]; 2026-02-21T09:10:45.4681271Z ld.shared.b8 %rs478, [%r56+384]; 2026-02-21T09:10:45.4681331Z ld.shared.b8 %rs479, [%r56+896]; 2026-02-21T09:10:45.4681494Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4681597Z shl.b16 %rs480, %rs472, 4; 2026-02-21T09:10:45.4681660Z shl.b16 %rs481, %rs474, 4; 2026-02-21T09:10:45.4681719Z shl.b16 %rs482, %rs476, 4; 2026-02-21T09:10:45.4681776Z shl.b16 %rs483, %rs478, 4; 2026-02-21T09:10:45.4681840Z shl.b16 %rs484, %rs473, 4; 2026-02-21T09:10:45.4681898Z shl.b16 %rs485, %rs475, 4; 2026-02-21T09:10:45.4681955Z shl.b16 %rs486, %rs477, 4; 2026-02-21T09:10:45.4682020Z shl.b16 %rs487, %rs479, 4; 2026-02-21T09:10:45.4682183Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4682255Z selp.b16 %rs488, %rs480, %rs472, %p331; 2026-02-21T09:10:45.4682314Z cvt.s16.s8 %rs489, %rs488; 2026-02-21T09:10:45.4682378Z shr.s16 %rs490, %rs489, 4; 2026-02-21T09:10:45.4682447Z selp.b16 %rs491, %rs481, %rs474, %p331; 2026-02-21T09:10:45.4682505Z cvt.s16.s8 %rs492, %rs491; 2026-02-21T09:10:45.4682570Z shr.s16 %rs493, %rs492, 4; 2026-02-21T09:10:45.4682635Z selp.b16 %rs494, %rs482, %rs476, %p331; 2026-02-21T09:10:45.4682694Z cvt.s16.s8 %rs495, %rs494; 2026-02-21T09:10:45.4682758Z shr.s16 %rs496, %rs495, 4; 2026-02-21T09:10:45.4682826Z selp.b16 %rs497, %rs483, %rs478, %p331; 2026-02-21T09:10:45.4682885Z cvt.s16.s8 %rs498, %rs497; 2026-02-21T09:10:45.4682942Z shr.s16 %rs499, %rs498, 4; 2026-02-21T09:10:45.4683014Z selp.b16 %rs500, %rs484, %rs473, %p331; 2026-02-21T09:10:45.4683073Z cvt.s16.s8 %rs501, %rs500; 2026-02-21T09:10:45.4683130Z shr.s16 %rs502, %rs501, 4; 2026-02-21T09:10:45.4683202Z selp.b16 %rs503, %rs485, %rs475, %p331; 2026-02-21T09:10:45.4683259Z cvt.s16.s8 %rs504, %rs503; 2026-02-21T09:10:45.4683316Z shr.s16 %rs505, %rs504, 4; 2026-02-21T09:10:45.4683380Z selp.b16 %rs506, %rs486, %rs477, %p331; 2026-02-21T09:10:45.4683446Z cvt.s16.s8 %rs507, %rs506; 2026-02-21T09:10:45.4683503Z shr.s16 %rs508, %rs507, 4; 2026-02-21T09:10:45.4683566Z selp.b16 %rs509, %rs487, %rs479, %p331; 2026-02-21T09:10:45.4683631Z cvt.s16.s8 %rs510, %rs509; 2026-02-21T09:10:45.4683719Z shr.s16 %rs511, %rs510, 4; 2026-02-21T09:10:45.4683886Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4683949Z cvt.rn.f32.s16 %r1635, %rs490; 2026-02-21T09:10:45.4684020Z cvt.rn.f32.s16 %r1636, %rs493; 2026-02-21T09:10:45.4684079Z cvt.rn.f32.s16 %r1637, %rs496; 2026-02-21T09:10:45.4684138Z cvt.rn.f32.s16 %r1638, %rs499; 2026-02-21T09:10:45.4684206Z cvt.rn.f32.s16 %r1639, %rs502; 2026-02-21T09:10:45.4684288Z cvt.rn.f32.s16 %r1640, %rs505; 2026-02-21T09:10:45.4684346Z cvt.rn.f32.s16 %r1641, %rs508; 2026-02-21T09:10:45.4684410Z cvt.rn.f32.s16 %r1642, %rs511; 2026-02-21T09:10:45.4684471Z st.shared.b32 [%r57], %r1635; 2026-02-21T09:10:45.4684531Z st.shared.b32 [%r58], %r1636; 2026-02-21T09:10:45.4684590Z st.shared.b32 [%r59], %r1637; 2026-02-21T09:10:45.4684655Z st.shared.b32 [%r60], %r1638; 2026-02-21T09:10:45.4684713Z st.shared.b32 [%r61], %r1639; 2026-02-21T09:10:45.4684771Z st.shared.b32 [%r62], %r1640; 2026-02-21T09:10:45.4684838Z st.shared.b32 [%r63], %r1641; 2026-02-21T09:10:45.4684896Z st.shared.b32 [%r64], %r1642; 2026-02-21T09:10:45.4684973Z $L__tmp225: 2026-02-21T09:10:45.4685187Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4685251Z // begin inline asm 2026-02-21T09:10:45.4685322Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4685414Z // end inline asm 2026-02-21T09:10:45.4685478Z bar.sync 0; 2026-02-21T09:10:45.4685544Z setp.ne.b32 %p286, %r161, 0; 2026-02-21T09:10:45.4685603Z @%p286 bra $L__BB0_31; 2026-02-21T09:10:45.4685707Z // %bb.30: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:45.4685773Z elect.sync %r1655|%p288, -1; 2026-02-21T09:10:45.4685831Z mov.b32 %r1645, 135268624; 2026-02-21T09:10:45.4685890Z mov.pred %p287, 0; 2026-02-21T09:10:45.4685953Z // begin inline asm 2026-02-21T09:10:45.4686113Z @%p288 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd8, %r1645, %p287; 2026-02-21T09:10:45.4686168Z // end inline asm 2026-02-21T09:10:45.4686231Z // begin inline asm 2026-02-21T09:10:45.4686384Z @%p288 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd9, %r1645, %p289; 2026-02-21T09:10:45.4686438Z // end inline asm 2026-02-21T09:10:45.4686500Z // begin inline asm 2026-02-21T09:10:45.4686655Z @%p288 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd10, %r1645, %p289; 2026-02-21T09:10:45.4686711Z // end inline asm 2026-02-21T09:10:45.4686767Z // begin inline asm 2026-02-21T09:10:45.4686924Z @%p288 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd11, %r1645, %p289; 2026-02-21T09:10:45.4686979Z // end inline asm 2026-02-21T09:10:45.4687038Z add.s32 %r1657, %r179, 26624; 2026-02-21T09:10:45.4687109Z cvt.u64.u32 %rd494, %r1657; 2026-02-21T09:10:45.4687164Z // begin inline asm 2026-02-21T09:10:45.4687292Z @%p288 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd494]; 2026-02-21T09:10:45.4687355Z // end inline asm 2026-02-21T09:10:45.4687409Z $L__tmp226: 2026-02-21T09:10:45.4687511Z $L__BB0_31: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:45.4687668Z .loc 1 0 0 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:0 2026-02-21T09:10:45.4687736Z cvt.s32.s16 %r158, %rs453; 2026-02-21T09:10:45.4687899Z .loc 1 58 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:32 2026-02-21T09:10:45.4687961Z add.s64 %rd495, %rd42, 128; 2026-02-21T09:10:45.4688031Z add.s64 %rd496, %rd43, 128; 2026-02-21T09:10:45.4688191Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4688249Z // begin inline asm 2026-02-21T09:10:45.4688376Z cp.async.cg.shared.global [ %r42 + 0 ], [ %rd495 + 0 ], 0x10, %r1659; 2026-02-21T09:10:45.4688433Z // end inline asm 2026-02-21T09:10:45.4688518Z // begin inline asm 2026-02-21T09:10:45.4688635Z cp.async.cg.shared.global [ %r43 + 0 ], [ %rd496 + 0 ], 0x10, %r1659; 2026-02-21T09:10:45.4688698Z // end inline asm 2026-02-21T09:10:45.4688761Z cp.async.commit_group; 2026-02-21T09:10:45.4688927Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4688989Z // begin inline asm 2026-02-21T09:10:45.4689105Z @%p273 mbarrier.arrive.expect_tx.shared.b64 _, [%r1662], 1024; 2026-02-21T09:10:45.4689182Z // end inline asm 2026-02-21T09:10:45.4689347Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4689400Z bar.sync 0; 2026-02-21T09:10:45.4689466Z elect.sync %r1671|%p299, -1; 2026-02-21T09:10:45.4689530Z and.pred %p297, %p1, %p299; 2026-02-21T09:10:45.4689593Z mov.b32 %r1665, 32; 2026-02-21T09:10:45.4689649Z // begin inline asm 2026-02-21T09:10:45.4689898Z @%p297 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1500], [%rd497, {%r1794, %r1665}], [%r1662]; 2026-02-21T09:10:45.4689961Z // end inline asm 2026-02-21T09:10:45.4690144Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4690207Z add.s32 %r1672, %r8, %r1795; 2026-02-21T09:10:45.4690273Z shl.b32 %r1673, %r1672, 10; 2026-02-21T09:10:45.4690343Z mad.wide.s32 %rd582, %r1673, 2, %rd12; 2026-02-21T09:10:45.4690425Z shl.b32 %r1674, %r158, 17; 2026-02-21T09:10:45.4690489Z or.b32 %r1675, %r1854, %r1674; 2026-02-21T09:10:45.4690563Z mad.wide.s32 %rd581, %r1675, 2, %rd12; 2026-02-21T09:10:45.4690618Z mov.b32 %r1889, 1; 2026-02-21T09:10:45.4690673Z mov.b64 %rd583, 0; 2026-02-21T09:10:45.4690739Z mov.b32 %r1887, %r1885; 2026-02-21T09:10:45.4690796Z mov.b32 %r1888, %r1885; 2026-02-21T09:10:45.4690853Z mov.b32 %r1890, %r1885; 2026-02-21T09:10:45.4690911Z bra.uni $L__BB0_32; 2026-02-21T09:10:45.4691016Z $L__BB0_34: // in Loop: Header=BB0_32 Depth=2 2026-02-21T09:10:45.4691185Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4691250Z setp.lt.u64 %p315, %rd583, 464; 2026-02-21T09:10:45.4691312Z $L__tmp227: 2026-02-21T09:10:45.4691524Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4691628Z add.s32 %r1750, %r1889, 1; 2026-02-21T09:10:45.4691703Z setp.gt.s32 %p318, %r1750, 1; 2026-02-21T09:10:45.4691769Z selp.b32 %r1889, 0, %r1750, %p318; 2026-02-21T09:10:45.4691832Z selp.b32 %r1751, 1, 0, %p318; 2026-02-21T09:10:45.4691892Z xor.b32 %r177, %r1890, %r1751; 2026-02-21T09:10:45.4691952Z $L__tmp228: 2026-02-21T09:10:45.4692113Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4692173Z add.s32 %r1741, %r171, %r41; 2026-02-21T09:10:45.4692240Z selp.b32 %r1742, 16, 0, %p315; 2026-02-21T09:10:45.4692298Z // begin inline asm 2026-02-21T09:10:45.4692420Z cp.async.cg.shared.global [ %r1741 + 0 ], [ %rd581 + 0 ], 0x10, %r1742; 2026-02-21T09:10:45.4692482Z // end inline asm 2026-02-21T09:10:45.4692540Z add.s32 %r1743, %r1741, 4096; 2026-02-21T09:10:45.4692596Z // begin inline asm 2026-02-21T09:10:45.4692714Z cp.async.cg.shared.global [ %r1743 + 0 ], [ %rd582 + 0 ], 0x10, %r1742; 2026-02-21T09:10:45.4692779Z // end inline asm 2026-02-21T09:10:45.4692842Z cp.async.commit_group; 2026-02-21T09:10:45.4693006Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4693077Z and.pred %p313, %p273, %p315; 2026-02-21T09:10:45.4693132Z // begin inline asm 2026-02-21T09:10:45.4693245Z @%p313 mbarrier.arrive.expect_tx.shared.b64 _, [%r1745], 1024; 2026-02-21T09:10:45.4693301Z // end inline asm 2026-02-21T09:10:45.4693473Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4693553Z bar.sync 0; 2026-02-21T09:10:45.4693616Z elect.sync %r1752|%p319, -1; 2026-02-21T09:10:45.4693687Z and.pred %p320, %p315, %p319; 2026-02-21T09:10:45.4693750Z and.pred %p314, %p1, %p320; 2026-02-21T09:10:45.4693809Z cvt.u32.u64 %r1753, %rd583; 2026-02-21T09:10:45.4693876Z add.s32 %r1748, %r1753, 48; 2026-02-21T09:10:45.4693933Z // begin inline asm 2026-02-21T09:10:45.4694182Z @%p314 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1746], [%rd497, {%r1794, %r1748}], [%r1745]; 2026-02-21T09:10:45.4694266Z // end inline asm 2026-02-21T09:10:45.4694433Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4694493Z add.s64 %rd582, %rd582, 64; 2026-02-21T09:10:45.4694550Z add.s64 %rd581, %rd581, 64; 2026-02-21T09:10:45.4694621Z setp.lt.u64 %p321, %rd583, 480; 2026-02-21T09:10:45.4694679Z add.s64 %rd583, %rd583, 16; 2026-02-21T09:10:45.4694739Z mov.b32 %r1885, %r1890; 2026-02-21T09:10:45.4694804Z mov.b32 %r1890, %r177; 2026-02-21T09:10:45.4694863Z @%p321 bra $L__BB0_32; 2026-02-21T09:10:45.4694947Z bra.uni $L__BB0_35; 2026-02-21T09:10:45.4695050Z $L__BB0_32: // Parent Loop BB0_29 Depth=1 2026-02-21T09:10:45.4695154Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:45.4695214Z add.s32 %r1697, %r1888, 1; 2026-02-21T09:10:45.4695301Z setp.gt.s32 %p303, %r1697, 1; 2026-02-21T09:10:45.4695376Z selp.b32 %r1888, 0, %r1697, %p303; 2026-02-21T09:10:45.4695535Z .loc 1 58 80 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:58:80 2026-02-21T09:10:45.4695598Z cp.async.wait_group 1; 2026-02-21T09:10:45.4695657Z bar.sync 0; 2026-02-21T09:10:45.4695714Z shl.b32 %r1698, %r1888, 13; 2026-02-21T09:10:45.4695772Z add.s32 %r171, %r179, %r1698; 2026-02-21T09:10:45.4695830Z add.s32 %r1700, %r171, %r47; 2026-02-21T09:10:45.4695941Z ld.shared.v4.b32 {%r1701, %r1702, %r1703, %r1704}, [%r1700]; 2026-02-21T09:10:45.4696005Z mov.b32 {%rs512, %rs513}, %r1704; 2026-02-21T09:10:45.4696069Z mov.b32 {%rs514, %rs515}, %r1703; 2026-02-21T09:10:45.4696135Z mov.b32 {%rs516, %rs517}, %r1702; 2026-02-21T09:10:45.4696195Z mov.b32 {%rs518, %rs519}, %r1701; 2026-02-21T09:10:45.4696300Z ld.shared.v4.b32 {%r1705, %r1706, %r1707, %r1708}, [%r1700+16]; 2026-02-21T09:10:45.4696361Z mov.b32 {%rs520, %rs521}, %r1708; 2026-02-21T09:10:45.4696430Z mov.b32 {%rs522, %rs523}, %r1707; 2026-02-21T09:10:45.4696489Z mov.b32 {%rs524, %rs525}, %r1706; 2026-02-21T09:10:45.4696549Z mov.b32 {%rs526, %rs527}, %r1705; 2026-02-21T09:10:45.4696718Z .loc 1 62 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:62:32 2026-02-21T09:10:45.4696781Z cvt.f32.bf16 %r1679, %rs518; 2026-02-21T09:10:45.4696841Z cvt.f32.bf16 %r1680, %rs519; 2026-02-21T09:10:45.4696911Z cvt.f32.bf16 %r1681, %rs516; 2026-02-21T09:10:45.4696969Z cvt.f32.bf16 %r1682, %rs517; 2026-02-21T09:10:45.4697028Z cvt.f32.bf16 %r1683, %rs514; 2026-02-21T09:10:45.4697085Z cvt.f32.bf16 %r1684, %rs515; 2026-02-21T09:10:45.4697150Z cvt.f32.bf16 %r1685, %rs512; 2026-02-21T09:10:45.4697207Z cvt.f32.bf16 %r1686, %rs513; 2026-02-21T09:10:45.4697263Z cvt.f32.bf16 %r1687, %rs526; 2026-02-21T09:10:45.4697326Z cvt.f32.bf16 %r1688, %rs527; 2026-02-21T09:10:45.4697383Z cvt.f32.bf16 %r1689, %rs524; 2026-02-21T09:10:45.4697442Z cvt.f32.bf16 %r1690, %rs525; 2026-02-21T09:10:45.4697501Z cvt.f32.bf16 %r1691, %rs522; 2026-02-21T09:10:45.4697566Z cvt.f32.bf16 %r1692, %rs523; 2026-02-21T09:10:45.4697621Z cvt.f32.bf16 %r1693, %rs520; 2026-02-21T09:10:45.4697677Z cvt.f32.bf16 %r1694, %rs521; 2026-02-21T09:10:45.4697737Z $L__tmp229: 2026-02-21T09:10:45.4697948Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4698006Z // begin inline asm 2026-02-21T09:10:45.4698056Z 2026-02-21T09:10:45.4698136Z { 2026-02-21T09:10:45.4698198Z .reg .pred complete; 2026-02-21T09:10:45.4698253Z waitLoop: 2026-02-21T09:10:45.4698382Z mbarrier.try_wait.parity.shared.b64 complete, [%r1886], %r1885; 2026-02-21T09:10:45.4698447Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4698496Z } 2026-02-21T09:10:45.4698500Z 2026-02-21T09:10:45.4698564Z // end inline asm 2026-02-21T09:10:45.4698617Z $L__tmp230: 2026-02-21T09:10:45.4698784Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4698868Z selp.b32 %r1709, 1, 0, %p303; 2026-02-21T09:10:45.4698936Z xor.b32 %r1887, %r1887, %r1709; 2026-02-21T09:10:45.4698998Z mov.pred %p304, -1; 2026-02-21T09:10:45.4699051Z $L__tmp231: 2026-02-21T09:10:45.4699268Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4699325Z // begin inline asm 2026-02-21T09:10:45.4699655Z @%p304 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1678 + 0], {%r1679, %r1680, %r1681, %r1682, %r1683, %r1684, %r1685, %r1686, %r1687, %r1688, %r1689, %r1690, %r1691, %r1692, %r1693, %r1694}; 2026-02-21T09:10:45.4699721Z // end inline asm 2026-02-21T09:10:45.4699778Z // begin inline asm 2026-02-21T09:10:45.4699846Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.4699899Z // end inline asm 2026-02-21T09:10:45.4699961Z bar.sync 0; 2026-02-21T09:10:45.4700014Z $L__tmp232: 2026-02-21T09:10:45.4700203Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4700274Z shl.b32 %r1710, %r1888, 3; 2026-02-21T09:10:45.4700336Z add.s32 %r1711, %r179, %r1710; 2026-02-21T09:10:45.4700396Z add.s32 %r1745, %r1711, 26640; 2026-02-21T09:10:45.4700452Z // begin inline asm 2026-02-21T09:10:45.4700510Z 2026-02-21T09:10:45.4700560Z { 2026-02-21T09:10:45.4700621Z .reg .pred complete; 2026-02-21T09:10:45.4700684Z waitLoop: 2026-02-21T09:10:45.4700800Z mbarrier.try_wait.parity.shared.b64 complete, [%r1745], %r1887; 2026-02-21T09:10:45.4700866Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.4700918Z } 2026-02-21T09:10:45.4700929Z 2026-02-21T09:10:45.4700984Z // end inline asm 2026-02-21T09:10:45.4701145Z .loc 1 64 33 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:64:33 2026-02-21T09:10:45.4701205Z shl.b32 %r1712, %r1888, 10; 2026-02-21T09:10:45.4701272Z add.s32 %r1713, %r179, %r1712; 2026-02-21T09:10:45.4701332Z add.s32 %r1746, %r1713, 24576; 2026-02-21T09:10:45.4701390Z add.s32 %r1714, %r1746, %r49; 2026-02-21T09:10:45.4701456Z ld.shared.b8 %rs528, [%r1714]; 2026-02-21T09:10:45.4701520Z ld.shared.b8 %rs529, [%r1714+512]; 2026-02-21T09:10:45.4701627Z add.s32 %r1715, %r1746, %r51; 2026-02-21T09:10:45.4701695Z ld.shared.b8 %rs530, [%r1715+128]; 2026-02-21T09:10:45.4701762Z ld.shared.b8 %rs531, [%r1715+640]; 2026-02-21T09:10:45.4701821Z add.s32 %r1716, %r1746, %r53; 2026-02-21T09:10:45.4701882Z ld.shared.b8 %rs532, [%r1716+256]; 2026-02-21T09:10:45.4701950Z ld.shared.b8 %rs533, [%r1716+768]; 2026-02-21T09:10:45.4702009Z add.s32 %r1717, %r1746, %r55; 2026-02-21T09:10:45.4702068Z ld.shared.b8 %rs534, [%r1717+384]; 2026-02-21T09:10:45.4702134Z ld.shared.b8 %rs535, [%r1717+896]; 2026-02-21T09:10:45.4702295Z .loc 1 67 28 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:67:28 2026-02-21T09:10:45.4702355Z shl.b16 %rs536, %rs528, 4; 2026-02-21T09:10:45.4702416Z shl.b16 %rs537, %rs530, 4; 2026-02-21T09:10:45.4702483Z shl.b16 %rs538, %rs532, 4; 2026-02-21T09:10:45.4702541Z shl.b16 %rs539, %rs534, 4; 2026-02-21T09:10:45.4702598Z shl.b16 %rs540, %rs529, 4; 2026-02-21T09:10:45.4702662Z shl.b16 %rs541, %rs531, 4; 2026-02-21T09:10:45.4702720Z shl.b16 %rs542, %rs533, 4; 2026-02-21T09:10:45.4702776Z shl.b16 %rs543, %rs535, 4; 2026-02-21T09:10:45.4702942Z .loc 1 82 58 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:82:58 2026-02-21T09:10:45.4703048Z selp.b16 %rs544, %rs536, %rs528, %p331; 2026-02-21T09:10:45.4703107Z cvt.s16.s8 %rs545, %rs544; 2026-02-21T09:10:45.4703166Z shr.s16 %rs546, %rs545, 4; 2026-02-21T09:10:45.4703243Z selp.b16 %rs547, %rs537, %rs530, %p331; 2026-02-21T09:10:45.4703303Z cvt.s16.s8 %rs548, %rs547; 2026-02-21T09:10:45.4703361Z shr.s16 %rs549, %rs548, 4; 2026-02-21T09:10:45.4703435Z selp.b16 %rs550, %rs538, %rs532, %p331; 2026-02-21T09:10:45.4703495Z cvt.s16.s8 %rs551, %rs550; 2026-02-21T09:10:45.4703579Z shr.s16 %rs552, %rs551, 4; 2026-02-21T09:10:45.4703644Z selp.b16 %rs553, %rs539, %rs534, %p331; 2026-02-21T09:10:45.4703710Z cvt.s16.s8 %rs554, %rs553; 2026-02-21T09:10:45.4703768Z shr.s16 %rs555, %rs554, 4; 2026-02-21T09:10:45.4703833Z selp.b16 %rs556, %rs540, %rs529, %p331; 2026-02-21T09:10:45.4703900Z cvt.s16.s8 %rs557, %rs556; 2026-02-21T09:10:45.4703957Z shr.s16 %rs558, %rs557, 4; 2026-02-21T09:10:45.4704021Z selp.b16 %rs559, %rs541, %rs531, %p331; 2026-02-21T09:10:45.4704080Z cvt.s16.s8 %rs560, %rs559; 2026-02-21T09:10:45.4704150Z shr.s16 %rs561, %rs560, 4; 2026-02-21T09:10:45.4704250Z selp.b16 %rs562, %rs542, %rs533, %p331; 2026-02-21T09:10:45.4704313Z cvt.s16.s8 %rs563, %rs562; 2026-02-21T09:10:45.4704380Z shr.s16 %rs564, %rs563, 4; 2026-02-21T09:10:45.4704445Z selp.b16 %rs565, %rs543, %rs535, %p331; 2026-02-21T09:10:45.4704503Z cvt.s16.s8 %rs566, %rs565; 2026-02-21T09:10:45.4704561Z shr.s16 %rs567, %rs566, 4; 2026-02-21T09:10:45.4704750Z .loc 1 87 32 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:87:32 2026-02-21T09:10:45.4704816Z cvt.rn.f32.s16 %r1718, %rs546; 2026-02-21T09:10:45.4704878Z cvt.rn.f32.s16 %r1719, %rs549; 2026-02-21T09:10:45.4704947Z cvt.rn.f32.s16 %r1720, %rs552; 2026-02-21T09:10:45.4705007Z cvt.rn.f32.s16 %r1721, %rs555; 2026-02-21T09:10:45.4705067Z cvt.rn.f32.s16 %r1722, %rs558; 2026-02-21T09:10:45.4705131Z cvt.rn.f32.s16 %r1723, %rs561; 2026-02-21T09:10:45.4705189Z cvt.rn.f32.s16 %r1724, %rs564; 2026-02-21T09:10:45.4705250Z cvt.rn.f32.s16 %r1725, %rs567; 2026-02-21T09:10:45.4705311Z st.shared.b32 [%r57], %r1718; 2026-02-21T09:10:45.4705381Z st.shared.b32 [%r58], %r1719; 2026-02-21T09:10:45.4705441Z st.shared.b32 [%r59], %r1720; 2026-02-21T09:10:45.4705500Z st.shared.b32 [%r60], %r1721; 2026-02-21T09:10:45.4705564Z st.shared.b32 [%r61], %r1722; 2026-02-21T09:10:45.4705622Z st.shared.b32 [%r62], %r1723; 2026-02-21T09:10:45.4705681Z st.shared.b32 [%r63], %r1724; 2026-02-21T09:10:45.4705741Z st.shared.b32 [%r64], %r1725; 2026-02-21T09:10:45.4705917Z .loc 1 51 125 // c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:51:125 2026-02-21T09:10:45.4705976Z shl.b32 %r1726, %r1889, 3; 2026-02-21T09:10:45.4706034Z add.s32 %r1727, %r179, %r1726; 2026-02-21T09:10:45.4706099Z add.s32 %r1886, %r1727, 26624; 2026-02-21T09:10:45.4706151Z $L__tmp233: 2026-02-21T09:10:45.4706363Z .loc 2 291 36 // standard.py:291:36 @[ c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py:94:40 ] 2026-02-21T09:10:45.4706429Z // begin inline asm 2026-02-21T09:10:45.4706501Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.4706557Z // end inline asm 2026-02-21T09:10:45.4706612Z bar.sync 0; 2026-02-21T09:10:45.4706679Z @%p286 bra $L__BB0_34; 2026-02-21T09:10:45.4706774Z // %bb.33: // in Loop: Header=BB0_32 Depth=2 2026-02-21T09:10:45.4706839Z elect.sync %r1740|%p305, -1; 2026-02-21T09:10:45.4706904Z mov.b32 %r1730, 135268624; 2026-02-21T09:10:45.4706961Z // begin inline asm 2026-02-21T09:10:45.4707118Z @%p305 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 0 ], %rd8, %r1730, %p304; 2026-02-21T09:10:45.4707180Z // end inline asm 2026-02-21T09:10:45.4707236Z // begin inline asm 2026-02-21T09:10:45.4707385Z @%p305 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 8 ], %rd9, %r1730, %p304; 2026-02-21T09:10:45.4707440Z // end inline asm 2026-02-21T09:10:45.4707502Z // begin inline asm 2026-02-21T09:10:45.4707679Z @%p305 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 16 ], %rd10, %r1730, %p304; 2026-02-21T09:10:45.4707734Z // end inline asm 2026-02-21T09:10:45.4707798Z // begin inline asm 2026-02-21T09:10:45.4707947Z @%p305 tcgen05.mma.cta_group::1.kind::tf32 [ %r1846 + 0 ], [ %r1644 + 24 ], %rd11, %r1730, %p304; 2026-02-21T09:10:45.4708000Z // end inline asm 2026-02-21T09:10:45.4708066Z cvt.u64.u32 %rd503, %r1886; 2026-02-21T09:10:45.4708122Z // begin inline asm 2026-02-21T09:10:45.4708272Z @%p305 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd503]; 2026-02-21T09:10:45.4708326Z // end inline asm 2026-02-21T09:10:45.4708391Z bra.uni $L__BB0_34; 2026-02-21T09:10:45.4708442Z $L__tmp234: 2026-02-21T09:10:45.4708497Z $L__func_end0: 2026-02-21T09:10:45.4708587Z // -- End function 2026-02-21T09:10:45.4708638Z } 2026-02-21T09:10:45.4708835Z .file 1 "/tmp/torchinductor_root/4t/c4thj64opnhz2eo2fdijcor76z6aty2phwz6q2kiqs5xjdkocjcx.py" 2026-02-21T09:10:45.4709013Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:45.4709103Z .section .debug_abbrev 2026-02-21T09:10:45.4709155Z { 2026-02-21T09:10:45.4709244Z .b8 1 // Abbreviation Code 2026-02-21T09:10:45.4709338Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:45.4709456Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:45.4709540Z .b8 37 // DW_AT_producer 2026-02-21T09:10:45.4709625Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.4709700Z .b8 19 // DW_AT_language 2026-02-21T09:10:45.4709778Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:45.4709858Z .b8 3 // DW_AT_name 2026-02-21T09:10:45.4709932Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.4710010Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:45.4710087Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:45.4710170Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:45.4710242Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.4710313Z .b8 0 // EOM(1) 2026-02-21T09:10:45.4710391Z .b8 0 // EOM(2) 2026-02-21T09:10:45.4710476Z .b8 2 // Abbreviation Code 2026-02-21T09:10:45.4710559Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:45.4710641Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:45.4710714Z .b8 3 // DW_AT_name 2026-02-21T09:10:45.4710787Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.4710864Z .b8 32 // DW_AT_inline 2026-02-21T09:10:45.4710947Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.4711016Z .b8 0 // EOM(1) 2026-02-21T09:10:45.4711085Z .b8 0 // EOM(2) 2026-02-21T09:10:45.4711173Z .b8 3 // Abbreviation Code 2026-02-21T09:10:45.4711254Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:45.4711332Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:45.4711417Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:45.4711490Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.4711618Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:45.4711693Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.4711786Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:45.4711858Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:45.4711957Z .b8 0 // EOM(1) 2026-02-21T09:10:45.4712035Z .b8 0 // EOM(2) 2026-02-21T09:10:45.4712114Z .b8 4 // Abbreviation Code 2026-02-21T09:10:45.4712207Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:45.4712297Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:45.4712383Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:45.4712486Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:45.4712572Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:45.4712647Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.4712724Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:45.4712799Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.4712892Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:45.4712966Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.4713063Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:45.4713143Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.4713223Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:45.4713294Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.4713394Z .b8 0 // EOM(1) 2026-02-21T09:10:45.4713465Z .b8 0 // EOM(2) 2026-02-21T09:10:45.4713530Z .b8 0 // EOM(3) 2026-02-21T09:10:45.4713582Z } 2026-02-21T09:10:45.4713650Z .section .debug_info 2026-02-21T09:10:45.4713701Z { 2026-02-21T09:10:45.4713785Z .b32 178 // Length of Unit 2026-02-21T09:10:45.4713876Z .b8 2 // DWARF version number 2026-02-21T09:10:45.4713928Z .b8 0 2026-02-21T09:10:45.4714043Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:45.4714136Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:45.4714236Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:45.4714315Z .b8 116 // DW_AT_producer 2026-02-21T09:10:45.4714369Z .b8 114 2026-02-21T09:10:45.4714430Z .b8 105 2026-02-21T09:10:45.4714482Z .b8 116 2026-02-21T09:10:45.4714535Z .b8 111 2026-02-21T09:10:45.4714592Z .b8 110 2026-02-21T09:10:45.4714642Z .b8 0 2026-02-21T09:10:45.4714717Z .b8 2 // DW_AT_language 2026-02-21T09:10:45.4714770Z .b8 0 2026-02-21T09:10:45.4714854Z .b8 99 // DW_AT_name 2026-02-21T09:10:45.4714909Z .b8 52 2026-02-21T09:10:45.4714963Z .b8 116 2026-02-21T09:10:45.4715023Z .b8 104 2026-02-21T09:10:45.4715075Z .b8 106 2026-02-21T09:10:45.4715128Z .b8 54 2026-02-21T09:10:45.4715181Z .b8 52 2026-02-21T09:10:45.4715242Z .b8 111 2026-02-21T09:10:45.4715295Z .b8 112 2026-02-21T09:10:45.4715349Z .b8 110 2026-02-21T09:10:45.4715409Z .b8 104 2026-02-21T09:10:45.4715464Z .b8 122 2026-02-21T09:10:45.4715516Z .b8 50 2026-02-21T09:10:45.4715568Z .b8 101 2026-02-21T09:10:45.4715626Z .b8 111 2026-02-21T09:10:45.4715679Z .b8 50 2026-02-21T09:10:45.4715730Z .b8 102 2026-02-21T09:10:45.4715782Z .b8 100 2026-02-21T09:10:45.4715842Z .b8 105 2026-02-21T09:10:45.4715895Z .b8 106 2026-02-21T09:10:45.4715949Z .b8 99 2026-02-21T09:10:45.4716006Z .b8 111 2026-02-21T09:10:45.4716058Z .b8 114 2026-02-21T09:10:45.4716110Z .b8 55 2026-02-21T09:10:45.4716162Z .b8 54 2026-02-21T09:10:45.4716222Z .b8 122 2026-02-21T09:10:45.4716274Z .b8 54 2026-02-21T09:10:45.4716325Z .b8 97 2026-02-21T09:10:45.4716384Z .b8 116 2026-02-21T09:10:45.4716436Z .b8 121 2026-02-21T09:10:45.4716488Z .b8 50 2026-02-21T09:10:45.4716540Z .b8 112 2026-02-21T09:10:45.4716598Z .b8 104 2026-02-21T09:10:45.4716651Z .b8 119 2026-02-21T09:10:45.4716702Z .b8 122 2026-02-21T09:10:45.4716783Z .b8 54 2026-02-21T09:10:45.4716844Z .b8 113 2026-02-21T09:10:45.4716897Z .b8 50 2026-02-21T09:10:45.4716951Z .b8 107 2026-02-21T09:10:45.4717010Z .b8 105 2026-02-21T09:10:45.4717063Z .b8 113 2026-02-21T09:10:45.4717116Z .b8 115 2026-02-21T09:10:45.4717167Z .b8 53 2026-02-21T09:10:45.4717227Z .b8 120 2026-02-21T09:10:45.4717279Z .b8 106 2026-02-21T09:10:45.4717332Z .b8 100 2026-02-21T09:10:45.4717393Z .b8 107 2026-02-21T09:10:45.4717447Z .b8 111 2026-02-21T09:10:45.4717527Z .b8 99 2026-02-21T09:10:45.4717579Z .b8 106 2026-02-21T09:10:45.4717640Z .b8 99 2026-02-21T09:10:45.4717693Z .b8 120 2026-02-21T09:10:45.4717746Z .b8 46 2026-02-21T09:10:45.4717798Z .b8 112 2026-02-21T09:10:45.4717858Z .b8 121 2026-02-21T09:10:45.4717911Z .b8 0 2026-02-21T09:10:45.4718006Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:45.4718092Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:45.4718145Z .b8 116 2026-02-21T09:10:45.4718198Z .b8 109 2026-02-21T09:10:45.4718263Z .b8 112 2026-02-21T09:10:45.4718317Z .b8 47 2026-02-21T09:10:45.4718370Z .b8 116 2026-02-21T09:10:45.4718446Z .b8 111 2026-02-21T09:10:45.4718511Z .b8 114 2026-02-21T09:10:45.4718564Z .b8 99 2026-02-21T09:10:45.4718618Z .b8 104 2026-02-21T09:10:45.4718672Z .b8 105 2026-02-21T09:10:45.4718733Z .b8 110 2026-02-21T09:10:45.4718785Z .b8 100 2026-02-21T09:10:45.4718837Z .b8 117 2026-02-21T09:10:45.4718897Z .b8 99 2026-02-21T09:10:45.4718951Z .b8 116 2026-02-21T09:10:45.4719030Z .b8 111 2026-02-21T09:10:45.4719083Z .b8 114 2026-02-21T09:10:45.4719144Z .b8 95 2026-02-21T09:10:45.4719196Z .b8 114 2026-02-21T09:10:45.4719250Z .b8 111 2026-02-21T09:10:45.4719308Z .b8 111 2026-02-21T09:10:45.4719360Z .b8 116 2026-02-21T09:10:45.4719412Z .b8 47 2026-02-21T09:10:45.4719465Z .b8 52 2026-02-21T09:10:45.4719524Z .b8 116 2026-02-21T09:10:45.4719576Z .b8 0 2026-02-21T09:10:45.4719678Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:45.4719760Z .b8 95 // DW_AT_name 2026-02-21T09:10:45.4719814Z .b8 104 2026-02-21T09:10:45.4719866Z .b8 101 2026-02-21T09:10:45.4719920Z .b8 108 2026-02-21T09:10:45.4719980Z .b8 105 2026-02-21T09:10:45.4720032Z .b8 111 2026-02-21T09:10:45.4720084Z .b8 110 2026-02-21T09:10:45.4720136Z .b8 95 2026-02-21T09:10:45.4720197Z .b8 109 2026-02-21T09:10:45.4720248Z .b8 97 2026-02-21T09:10:45.4720300Z .b8 116 2026-02-21T09:10:45.4720361Z .b8 109 2026-02-21T09:10:45.4720414Z .b8 117 2026-02-21T09:10:45.4720467Z .b8 108 2026-02-21T09:10:45.4720520Z .b8 95 2026-02-21T09:10:45.4720580Z .b8 98 2026-02-21T09:10:45.4720632Z .b8 102 2026-02-21T09:10:45.4720685Z .b8 49 2026-02-21T09:10:45.4720743Z .b8 54 2026-02-21T09:10:45.4720796Z .b8 95 2026-02-21T09:10:45.4720849Z .b8 105 2026-02-21T09:10:45.4720902Z .b8 110 2026-02-21T09:10:45.4720961Z .b8 116 2026-02-21T09:10:45.4721013Z .b8 52 2026-02-21T09:10:45.4721065Z .b8 0 2026-02-21T09:10:45.4721148Z .b8 1 // DW_AT_inline 2026-02-21T09:10:45.4721248Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:45.4721338Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:45.4721429Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:45.4721529Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:45.4721689Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:45.4721782Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:45.4721874Z .b64 $L__tmp1 // DW_AT_low_pc 2026-02-21T09:10:45.4721961Z .b64 $L__tmp234 // DW_AT_high_pc 2026-02-21T09:10:45.4722041Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:45.4722128Z .b8 94 // DW_AT_call_line 2026-02-21T09:10:45.4722211Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:45.4722328Z .b8 0 // End Of Children Mark 2026-02-21T09:10:45.4722421Z .b8 0 // End Of Children Mark 2026-02-21T09:10:45.4722477Z } 2026-02-21T09:10:45.4722547Z .section .debug_macinfo { } 2026-02-21T09:10:45.4722552Z 2026-02-21T09:10:45.4722630Z ================================================================ 2026-02-21T09:10:45.4722750Z please share the reproducer above with Triton project. 2026-02-21T09:10:45.5672746Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:45.5672913Z 2026-02-21T09:10:45.5678430Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:10:45.5678649Z 2026-02-21T09:10:45.5678833Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:45.5679072Z 2026-02-21T09:10:45.5679180Z ================================================================ 2026-02-21T09:10:45.5679256Z Internal Triton PTX codegen error 2026-02-21T09:10:45.5679369Z `ptxas` stderr: 2026-02-21T09:10:45.5679759Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 276 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:45.5683100Z `ptxas` stderr: 2026-02-21T09:10:45.5683235Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:45.5685072Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 276 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:45.5685197Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:45.5685204Z 2026-02-21T09:10:45.5685681Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp3uhuichs.ptx -o /tmp/tmp3uhuichs.ptx.o 2026-02-21T09:10:45.5685686Z 2026-02-21T09:10:45.5685833Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:45.5685916Z 2026-02-21T09:10:45.5690519Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp3uhuichs.ptx -o /tmp/tmp3uhuichs.ptx.o 2026-02-21T09:10:45.5693911Z 2026-02-21T09:10:45.5693969Z 2026-02-21T09:10:45.5694115Z // 2026-02-21T09:10:45.5694248Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:45.5694325Z // 2026-02-21T09:10:45.5694330Z 2026-02-21T09:10:45.5694410Z .version 8.7 2026-02-21T09:10:45.5694481Z .target sm_100a 2026-02-21T09:10:45.5694548Z .address_size 64 2026-02-21T09:10:45.5694572Z 2026-02-21T09:10:45.5694777Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:45.5694880Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:45.5694975Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:45.5695058Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:45.5695213Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:45.5695340Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:45.5695464Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:45.5695594Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:45.5695720Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:45.5695774Z ) 2026-02-21T09:10:45.5695831Z .reqntid 256 2026-02-21T09:10:45.5695897Z .maxnreg 32 2026-02-21T09:10:45.5695949Z { 2026-02-21T09:10:45.5696207Z .reg .pred %p<98>; 2026-02-21T09:10:45.5696279Z .reg .b16 %rs<76>; 2026-02-21T09:10:45.5696341Z .reg .b32 %r<353>; 2026-02-21T09:10:45.5696404Z .reg .b64 %rd<128>; 2026-02-21T09:10:45.5696467Z $L__func_begin0: 2026-02-21T09:10:45.5696471Z 2026-02-21T09:10:45.5696549Z // %bb.0: 2026-02-21T09:10:45.5696739Z .loc 1 19 0 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:19 2026-02-21T09:10:45.5696801Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:45.5696974Z ld.param.b64 %rd19, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:45.5697045Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:10:45.5697146Z ld.param.b64 %rd37, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:45.5697219Z mov.b32 %r41, global_smem; 2026-02-21T09:10:45.5697278Z // begin inline asm 2026-02-21T09:10:45.5697618Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r41], 64; 2026-02-21T09:10:45.5697682Z // end inline asm 2026-02-21T09:10:45.5697778Z ld.param.b64 %rd54, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:45.5697835Z bar.sync 0; 2026-02-21T09:10:45.5697943Z ld.shared.b32 %r346, [global_smem]; 2026-02-21T09:10:45.5698007Z bar.sync 0; 2026-02-21T09:10:45.5698064Z // begin inline asm 2026-02-21T09:10:45.5698186Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:45.5698249Z // end inline asm 2026-02-21T09:10:45.5698461Z .loc 1 21 66 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:21:66 2026-02-21T09:10:45.5698526Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:10:45.5698587Z mov.u32 %r58, %ctaid.y; 2026-02-21T09:10:45.5698653Z mov.u32 %r59, %ctaid.z; 2026-02-21T09:10:45.5698714Z mov.u32 %r60, %nctaid.x; 2026-02-21T09:10:45.5698772Z mov.u32 %r61, %nctaid.y; 2026-02-21T09:10:45.5698848Z mad.lo.s32 %r62, %r59, %r61, %r58; 2026-02-21T09:10:45.5698913Z mad.lo.s32 %r63, %r62, %r60, %r3; 2026-02-21T09:10:45.5698969Z shl.b32 %r64, %r63, 8; 2026-02-21T09:10:45.5699033Z cvt.s64.s32 %rd55, %r64; 2026-02-21T09:10:45.5699097Z add.s64 %rd33, %rd54, %rd55; 2026-02-21T09:10:45.5699154Z shl.b32 %r65, %r1, 2; 2026-02-21T09:10:45.5699214Z add.s32 %r42, %r41, %r65; 2026-02-21T09:10:45.5699275Z mov.b32 %r51, 0; 2026-02-21T09:10:45.5699331Z // begin inline asm 2026-02-21T09:10:45.5699401Z @%p1 st.shared.b32 [ %r42 + 0 ], %r51; 2026-02-21T09:10:45.5699461Z // end inline asm 2026-02-21T09:10:45.5699523Z bar.warp.sync -1; 2026-02-21T09:10:45.5699586Z setp.eq.b32 %p90, %r1, 0; 2026-02-21T09:10:45.5699647Z cvt.u64.u32 %rd18, %r41; 2026-02-21T09:10:45.5699711Z // begin inline asm 2026-02-21T09:10:45.5699885Z @%p90 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd18 + 0 ], %rd19; 2026-02-21T09:10:45.5699942Z // end inline asm 2026-02-21T09:10:45.5700002Z // begin inline asm 2026-02-21T09:10:45.5700144Z @%p90 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1; 2026-02-21T09:10:45.5700199Z // end inline asm 2026-02-21T09:10:45.5700253Z mov.b32 %r44, 32; 2026-02-21T09:10:45.5700318Z // begin inline asm 2026-02-21T09:10:45.5700471Z @%p90 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r44; 2026-02-21T09:10:45.5700526Z // end inline asm 2026-02-21T09:10:45.5700589Z mov.b32 %r88, 16; 2026-02-21T09:10:45.5700645Z // begin inline asm 2026-02-21T09:10:45.5700792Z @%p90 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r88; 2026-02-21T09:10:45.5700854Z // end inline asm 2026-02-21T09:10:45.5700912Z mov.b32 %r46, 8192; 2026-02-21T09:10:45.5700968Z // begin inline asm 2026-02-21T09:10:45.5701124Z @%p90 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r46; 2026-02-21T09:10:45.5701185Z // end inline asm 2026-02-21T09:10:45.5701240Z mov.b32 %r47, 512; 2026-02-21T09:10:45.5701296Z // begin inline asm 2026-02-21T09:10:45.5701458Z @%p90 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r47; 2026-02-21T09:10:45.5701513Z // end inline asm 2026-02-21T09:10:45.5701895Z mov.b64 %rd26, 8192; 2026-02-21T09:10:45.5701960Z // begin inline asm 2026-02-21T09:10:45.5702132Z @%p90 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd18 + 0 ], 0x0, %rd26; 2026-02-21T09:10:45.5702188Z // end inline asm 2026-02-21T09:10:45.5702242Z mov.b32 %r48, 1; 2026-02-21T09:10:45.5702307Z // begin inline asm 2026-02-21T09:10:45.5702481Z @%p90 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r48; 2026-02-21T09:10:45.5702571Z // end inline asm 2026-02-21T09:10:45.5702635Z // begin inline asm 2026-02-21T09:10:45.5702805Z @%p90 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r48; 2026-02-21T09:10:45.5702863Z // end inline asm 2026-02-21T09:10:45.5702929Z // begin inline asm 2026-02-21T09:10:45.5703081Z @%p90 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:45.5703137Z // end inline asm 2026-02-21T09:10:45.5703204Z // begin inline asm 2026-02-21T09:10:45.5703372Z @%p90 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:45.5703455Z // end inline asm 2026-02-21T09:10:45.5703512Z // begin inline asm 2026-02-21T09:10:45.5703676Z @%p90 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1; 2026-02-21T09:10:45.5703733Z // end inline asm 2026-02-21T09:10:45.5703793Z // begin inline asm 2026-02-21T09:10:45.5703990Z @%p90 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:45.5704048Z // end inline asm 2026-02-21T09:10:45.5704105Z // begin inline asm 2026-02-21T09:10:45.5704369Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd33 + 0 ], [ %rd18 + 0 ], 0x80; 2026-02-21T09:10:45.5704425Z // end inline asm 2026-02-21T09:10:45.5704480Z // begin inline asm 2026-02-21T09:10:45.5704606Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd33 + 0 ], 0x80; 2026-02-21T09:10:45.5704687Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:45.5704763Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:45.5704820Z // end inline asm 2026-02-21T09:10:45.5704883Z bar.sync 0; 2026-02-21T09:10:45.5704949Z cvta.global.u64 %rd74, %rd33; 2026-02-21T09:10:45.5705127Z .loc 1 23 67 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:23:67 2026-02-21T09:10:45.5705196Z add.s64 %rd51, %rd33, 128; 2026-02-21T09:10:45.5705252Z bar.sync 0; 2026-02-21T09:10:45.5705310Z // begin inline asm 2026-02-21T09:10:45.5705378Z @%p1 st.shared.b32 [ %r42 + 0 ], %r51; 2026-02-21T09:10:45.5705440Z // end inline asm 2026-02-21T09:10:45.5705501Z bar.warp.sync -1; 2026-02-21T09:10:45.5705558Z // begin inline asm 2026-02-21T09:10:45.5705725Z @%p90 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd18 + 0 ], %rd37; 2026-02-21T09:10:45.5705781Z // end inline asm 2026-02-21T09:10:45.5705836Z // begin inline asm 2026-02-21T09:10:45.5705984Z @%p90 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1; 2026-02-21T09:10:45.5706041Z // end inline asm 2026-02-21T09:10:45.5706096Z // begin inline asm 2026-02-21T09:10:45.5706245Z @%p90 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r44; 2026-02-21T09:10:45.5706308Z // end inline asm 2026-02-21T09:10:45.5706363Z mov.b32 %r53, 128; 2026-02-21T09:10:45.5706418Z // begin inline asm 2026-02-21T09:10:45.5706572Z @%p90 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r53; 2026-02-21T09:10:45.5706628Z // end inline asm 2026-02-21T09:10:45.5706684Z // begin inline asm 2026-02-21T09:10:45.5706845Z @%p90 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r46; 2026-02-21T09:10:45.5706901Z // end inline asm 2026-02-21T09:10:45.5706957Z mov.b32 %r55, 4096; 2026-02-21T09:10:45.5707012Z // begin inline asm 2026-02-21T09:10:45.5707174Z @%p90 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r55; 2026-02-21T09:10:45.5707258Z // end inline asm 2026-02-21T09:10:45.5707317Z mov.b64 %rd44, 16384; 2026-02-21T09:10:45.5707378Z // begin inline asm 2026-02-21T09:10:45.5707543Z @%p90 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd18 + 0 ], 0x0, %rd44; 2026-02-21T09:10:45.5707597Z // end inline asm 2026-02-21T09:10:45.5707658Z // begin inline asm 2026-02-21T09:10:45.5707821Z @%p90 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r48; 2026-02-21T09:10:45.5707875Z // end inline asm 2026-02-21T09:10:45.5707956Z // begin inline asm 2026-02-21T09:10:45.5708123Z @%p90 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r48; 2026-02-21T09:10:45.5708178Z // end inline asm 2026-02-21T09:10:45.5708233Z // begin inline asm 2026-02-21T09:10:45.5708388Z @%p90 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd18 + 0 ], 0xa; 2026-02-21T09:10:45.5708441Z // end inline asm 2026-02-21T09:10:45.5708495Z // begin inline asm 2026-02-21T09:10:45.5708663Z @%p90 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:45.5708717Z // end inline asm 2026-02-21T09:10:45.5708799Z // begin inline asm 2026-02-21T09:10:45.5708949Z @%p90 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x2; 2026-02-21T09:10:45.5709011Z // end inline asm 2026-02-21T09:10:45.5709066Z // begin inline asm 2026-02-21T09:10:45.5709233Z @%p90 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:45.5709298Z // end inline asm 2026-02-21T09:10:45.5709353Z // begin inline asm 2026-02-21T09:10:45.5709608Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd51 + 0 ], [ %rd18 + 0 ], 0x80; 2026-02-21T09:10:45.5709670Z // end inline asm 2026-02-21T09:10:45.5709725Z // begin inline asm 2026-02-21T09:10:45.5709847Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd51 + 0 ], 0x80; 2026-02-21T09:10:45.5709926Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:45.5710000Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:45.5710055Z // end inline asm 2026-02-21T09:10:45.5710111Z bar.sync 0; 2026-02-21T09:10:45.5710182Z cvta.global.u64 %rd92, %rd51; 2026-02-21T09:10:45.5710352Z .loc 1 31 88 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:31:88 2026-02-21T09:10:45.5710417Z setp.gt.u32 %p39, %r3, 8191; 2026-02-21T09:10:45.5710484Z @%p39 bra $L__BB0_8; 2026-02-21T09:10:45.5710563Z // %bb.1: // %.lr.ph 2026-02-21T09:10:45.5710735Z .loc 1 0 88 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:0:88 2026-02-21T09:10:45.5710840Z ld.param.b64 %rd17, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:45.5710900Z and.b32 %r4, %r1, 32; 2026-02-21T09:10:45.5711063Z .loc 1 81 38 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:81:38 2026-02-21T09:10:45.5711126Z setp.eq.b32 %p50, %r4, 0; 2026-02-21T09:10:45.5711303Z .loc 1 57 38 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:57:38 2026-02-21T09:10:45.5711363Z shl.b32 %r124, %r1, 3; 2026-02-21T09:10:45.5711423Z and.b32 %r5, %r124, 24; 2026-02-21T09:10:45.5711651Z .loc 1 43 45 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:43:45 2026-02-21T09:10:45.5711711Z shr.u32 %r125, %r1, 2; 2026-02-21T09:10:45.5711770Z bfe.u32 %r6, %r1, 2, 6; 2026-02-21T09:10:45.5711837Z shr.u32 %r126, %r1, 5; 2026-02-21T09:10:45.5711896Z shl.b32 %r127, %r1, 4; 2026-02-21T09:10:45.5711958Z and.b32 %r7, %r127, 4080; 2026-02-21T09:10:45.5712018Z add.s32 %r201, %r41, %r7; 2026-02-21T09:10:45.5712091Z add.s32 %r203, %r201, 4096; 2026-02-21T09:10:45.5712160Z add.s32 %r187, %r346, 32; 2026-02-21T09:10:45.5712221Z shl.b32 %r129, %r1, 6; 2026-02-21T09:10:45.5712294Z and.b32 %r130, %r129, 8128; 2026-02-21T09:10:45.5712354Z and.b32 %r131, %r125, 32; 2026-02-21T09:10:45.5712417Z or.b32 %r11, %r130, %r131; 2026-02-21T09:10:45.5712524Z and.b32 %r132, %r1, 31; 2026-02-21T09:10:45.5712587Z shr.u32 %r133, %r1, 1; 2026-02-21T09:10:45.5712645Z and.b32 %r134, %r133, 96; 2026-02-21T09:10:45.5712705Z or.b32 %r12, %r134, %r132; 2026-02-21T09:10:45.5712770Z shl.b32 %r135, %r132, 7; 2026-02-21T09:10:45.5712828Z and.b32 %r136, %r127, 112; 2026-02-21T09:10:45.5712884Z shr.u32 %r137, %r1, 3; 2026-02-21T09:10:45.5712942Z and.b32 %r138, %r137, 28; 2026-02-21T09:10:45.5713011Z xor.b32 %r139, %r136, %r138; 2026-02-21T09:10:45.5713099Z or.b32 %r140, %r139, %r135; 2026-02-21T09:10:45.5713158Z add.s32 %r141, %r41, 16384; 2026-02-21T09:10:45.5713223Z add.s32 %r13, %r141, %r140; 2026-02-21T09:10:45.5713282Z xor.b32 %r142, %r140, 32; 2026-02-21T09:10:45.5713339Z add.s32 %r14, %r141, %r142; 2026-02-21T09:10:45.5713397Z xor.b32 %r143, %r140, 64; 2026-02-21T09:10:45.5713461Z add.s32 %r15, %r141, %r143; 2026-02-21T09:10:45.5713517Z xor.b32 %r144, %r140, 96; 2026-02-21T09:10:45.5713575Z add.s32 %r16, %r141, %r144; 2026-02-21T09:10:45.5713646Z bfe.u32 %r145, %r141, 4, 14; 2026-02-21T09:10:45.5713705Z cvt.u64.u32 %rd62, %r145; 2026-02-21T09:10:45.5713806Z or.b64 %rd67, %rd62, 4611686293322072064; 2026-02-21T09:10:45.5713866Z add.s32 %r146, %r41, 16416; 2026-02-21T09:10:45.5713933Z bfe.u32 %r147, %r146, 4, 14; 2026-02-21T09:10:45.5713992Z cvt.u64.u32 %rd63, %r147; 2026-02-21T09:10:45.5714059Z or.b64 %rd68, %rd63, 4611686293322072064; 2026-02-21T09:10:45.5714153Z add.s32 %r148, %r41, 16448; 2026-02-21T09:10:45.5714212Z bfe.u32 %r149, %r148, 4, 14; 2026-02-21T09:10:45.5714270Z cvt.u64.u32 %rd64, %r149; 2026-02-21T09:10:45.5714344Z or.b64 %rd69, %rd64, 4611686293322072064; 2026-02-21T09:10:45.5714401Z add.s32 %r150, %r41, 16480; 2026-02-21T09:10:45.5714458Z bfe.u32 %r151, %r150, 4, 14; 2026-02-21T09:10:45.5714516Z cvt.u64.u32 %rd65, %r151; 2026-02-21T09:10:45.5714588Z or.b64 %rd70, %rd65, 4611686293322072064; 2026-02-21T09:10:45.5714645Z or.b32 %r17, %r5, 64; 2026-02-21T09:10:45.5714703Z and.b32 %r152, %r124, 48; 2026-02-21T09:10:45.5714768Z xor.b32 %r153, %r152, %r131; 2026-02-21T09:10:45.5714827Z or.b32 %r154, %r153, %r130; 2026-02-21T09:10:45.5714883Z xor.b32 %r155, %r154, 16; 2026-02-21T09:10:45.5714941Z add.s32 %r156, %r41, %r11; 2026-02-21T09:10:45.5715008Z xor.b32 %r20, %r12, 16; 2026-02-21T09:10:45.5715067Z add.s32 %r92, %r41, 20480; 2026-02-21T09:10:45.5715125Z add.s32 %r157, %r92, %r20; 2026-02-21T09:10:45.5715190Z add.s32 %r158, %r92, %r12; 2026-02-21T09:10:45.5715250Z add.s32 %r96, %r201, 8192; 2026-02-21T09:10:45.5715310Z add.s32 %r98, %r201, 12288; 2026-02-21T09:10:45.5715490Z .loc 1 38 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:38:33 2026-02-21T09:10:45.5715559Z shr.u32 %r159, %r3, 8; 2026-02-21T09:10:45.5715617Z and.b32 %r160, %r159, 16; 2026-02-21T09:10:45.5715796Z .loc 1 40 64 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:40:64 2026-02-21T09:10:45.5715866Z and.b32 %r21, %r3, 15; 2026-02-21T09:10:45.5716042Z .loc 1 40 30 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:40:30 2026-02-21T09:10:45.5716102Z or.b32 %r161, %r160, %r21; 2026-02-21T09:10:45.5716283Z .loc 1 42 27 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:42:27 2026-02-21T09:10:45.5716345Z shl.b32 %r319, %r161, 7; 2026-02-21T09:10:45.5716523Z .loc 1 43 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:43:32 2026-02-21T09:10:45.5716591Z or.b32 %r162, %r319, %r6; 2026-02-21T09:10:45.5716765Z .loc 1 44 27 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:44:27 2026-02-21T09:10:45.5716824Z shl.b32 %r163, %r3, 1; 2026-02-21T09:10:45.5716885Z and.b32 %r318, %r163, 8160; 2026-02-21T09:10:45.5717067Z .loc 1 58 53 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:53 2026-02-21T09:10:45.5717126Z shl.b32 %r164, %r162, 10; 2026-02-21T09:10:45.5717212Z $L__tmp0: 2026-02-21T09:10:45.5717457Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5717536Z shfl.sync.idx.b32 %r24, %r126, 0, 31, -1; 2026-02-21T09:10:45.5717596Z shl.b32 %r165, %r24, 21; 2026-02-21T09:10:45.5717669Z and.b32 %r166, %r165, 6291456; 2026-02-21T09:10:45.5717728Z shl.b32 %r167, %r24, 2; 2026-02-21T09:10:45.5717789Z and.b32 %r168, %r167, 16; 2026-02-21T09:10:45.5717873Z or.b32 %r169, %r168, %r166; 2026-02-21T09:10:45.5717943Z add.s32 %r317, %r169, %r346; 2026-02-21T09:10:45.5718008Z mov.pred %p57, -1; 2026-02-21T09:10:45.5718066Z // begin inline asm 2026-02-21T09:10:45.5718342Z @%p57 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r317 + 0], {%r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51, %r51}; 2026-02-21T09:10:45.5718401Z // end inline asm 2026-02-21T09:10:45.5718461Z // begin inline asm 2026-02-21T09:10:45.5718544Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.5718601Z // end inline asm 2026-02-21T09:10:45.5718658Z bar.sync 0; 2026-02-21T09:10:45.5718737Z $L__tmp1: 2026-02-21T09:10:45.5718931Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5718992Z add.s32 %r348, %r41, 21504; 2026-02-21T09:10:45.5719051Z // begin inline asm 2026-02-21T09:10:45.5719173Z @%p90 mbarrier.init.shared::cta.b64 [%r348], 1; 2026-02-21T09:10:45.5719235Z // end inline asm 2026-02-21T09:10:45.5719292Z bar.sync 0; 2026-02-21T09:10:45.5719353Z add.s32 %r84, %r41, 21512; 2026-02-21T09:10:45.5719420Z // begin inline asm 2026-02-21T09:10:45.5719510Z @%p90 mbarrier.init.shared::cta.b64 [%r84], 1; 2026-02-21T09:10:45.5719568Z // end inline asm 2026-02-21T09:10:45.5719636Z add.s32 %r205, %r41, 21520; 2026-02-21T09:10:45.5719695Z // begin inline asm 2026-02-21T09:10:45.5719782Z @%p90 mbarrier.init.shared::cta.b64 [%r205], 1; 2026-02-21T09:10:45.5719841Z // end inline asm 2026-02-21T09:10:45.5719909Z bar.sync 0; 2026-02-21T09:10:45.5719972Z add.s32 %r86, %r41, 21528; 2026-02-21T09:10:45.5720033Z // begin inline asm 2026-02-21T09:10:45.5720124Z @%p90 mbarrier.init.shared::cta.b64 [%r86], 1; 2026-02-21T09:10:45.5720181Z // end inline asm 2026-02-21T09:10:45.5720357Z .loc 1 58 60 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:60 2026-02-21T09:10:45.5720425Z or.b32 %r170, %r164, %r5; 2026-02-21T09:10:45.5720600Z .loc 1 58 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:32 2026-02-21T09:10:45.5720671Z mad.wide.u32 %rd56, %r170, 2, %rd17; 2026-02-21T09:10:45.5720735Z cvt.u64.u32 %rd7, %r164; 2026-02-21T09:10:45.5720806Z add.s64 %rd57, %rd56, 131072; 2026-02-21T09:10:45.5720974Z .loc 1 58 80 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:80 2026-02-21T09:10:45.5721033Z // begin inline asm 2026-02-21T09:10:45.5721163Z cp.async.cg.shared.global [ %r201 + 0 ], [ %rd56 + 0 ], 0x10, %r88; 2026-02-21T09:10:45.5721223Z // end inline asm 2026-02-21T09:10:45.5721284Z // begin inline asm 2026-02-21T09:10:45.5721407Z cp.async.cg.shared.global [ %r203 + 0 ], [ %rd57 + 0 ], 0x10, %r88; 2026-02-21T09:10:45.5721465Z // end inline asm 2026-02-21T09:10:45.5721530Z cp.async.commit_group; 2026-02-21T09:10:45.5721765Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5721832Z bar.sync 0; 2026-02-21T09:10:45.5721891Z // begin inline asm 2026-02-21T09:10:45.5722008Z @%p90 mbarrier.arrive.expect_tx.shared.b64 _, [%r205], 512; 2026-02-21T09:10:45.5722073Z // end inline asm 2026-02-21T09:10:45.5722250Z .loc 1 64 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:64:33 2026-02-21T09:10:45.5722306Z bar.sync 0; 2026-02-21T09:10:45.5722376Z elect.sync %r171|%p52, -1; 2026-02-21T09:10:45.5722452Z and.pred %p46, %p1, %p52; 2026-02-21T09:10:45.5722548Z // begin inline asm 2026-02-21T09:10:45.5722804Z @%p46 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r92], [%rd74, {%r318, %r51}], [%r205]; 2026-02-21T09:10:45.5722875Z // end inline asm 2026-02-21T09:10:45.5723050Z .loc 1 58 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:32 2026-02-21T09:10:45.5723113Z add.s64 %rd59, %rd56, 64; 2026-02-21T09:10:45.5723185Z or.b32 %r172, %r170, 32; 2026-02-21T09:10:45.5723289Z mad.wide.u32 %rd66, %r172, 2, %rd17; 2026-02-21T09:10:45.5723354Z add.s64 %rd60, %rd66, 131072; 2026-02-21T09:10:45.5723539Z .loc 1 58 80 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:80 2026-02-21T09:10:45.5723599Z // begin inline asm 2026-02-21T09:10:45.5723719Z cp.async.cg.shared.global [ %r96 + 0 ], [ %rd59 + 0 ], 0x10, %r88; 2026-02-21T09:10:45.5723777Z // end inline asm 2026-02-21T09:10:45.5723843Z // begin inline asm 2026-02-21T09:10:45.5723962Z cp.async.cg.shared.global [ %r98 + 0 ], [ %rd60 + 0 ], 0x10, %r88; 2026-02-21T09:10:45.5724020Z // end inline asm 2026-02-21T09:10:45.5724118Z cp.async.commit_group; 2026-02-21T09:10:45.5724310Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5724365Z bar.sync 0; 2026-02-21T09:10:45.5724420Z // begin inline asm 2026-02-21T09:10:45.5724563Z @%p90 mbarrier.arrive.expect_tx.shared.b64 _, [%r86], 512; 2026-02-21T09:10:45.5724621Z // end inline asm 2026-02-21T09:10:45.5724784Z .loc 1 64 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:64:33 2026-02-21T09:10:45.5724844Z bar.sync 0; 2026-02-21T09:10:45.5724908Z elect.sync %r173|%p53, -1; 2026-02-21T09:10:45.5724971Z and.pred %p48, %p1, %p53; 2026-02-21T09:10:45.5725036Z add.s32 %r101, %r41, 20992; 2026-02-21T09:10:45.5725091Z // begin inline asm 2026-02-21T09:10:45.5725327Z @%p48 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r101], [%rd74, {%r318, %r88}], [%r86]; 2026-02-21T09:10:45.5725382Z // end inline asm 2026-02-21T09:10:45.5725556Z .loc 1 58 80 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:80 2026-02-21T09:10:45.5725619Z cp.async.wait_group 1; 2026-02-21T09:10:45.5725671Z bar.sync 0; 2026-02-21T09:10:45.5725770Z ld.shared.v4.b32 {%r174, %r175, %r176, %r177}, [%r156]; 2026-02-21T09:10:45.5725833Z mov.b32 {%rs1, %rs2}, %r177; 2026-02-21T09:10:45.5725894Z mov.b32 {%rs3, %rs4}, %r176; 2026-02-21T09:10:45.5725958Z mov.b32 {%rs5, %rs6}, %r175; 2026-02-21T09:10:45.5726017Z mov.b32 {%rs7, %rs8}, %r174; 2026-02-21T09:10:45.5726115Z ld.shared.v4.b32 {%r178, %r179, %r180, %r181}, [%r156+16]; 2026-02-21T09:10:45.5726175Z mov.b32 {%rs9, %rs10}, %r181; 2026-02-21T09:10:45.5726244Z mov.b32 {%rs11, %rs12}, %r180; 2026-02-21T09:10:45.5726303Z mov.b32 {%rs13, %rs14}, %r179; 2026-02-21T09:10:45.5726361Z mov.b32 {%rs15, %rs16}, %r178; 2026-02-21T09:10:45.5726533Z .loc 1 62 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:62:32 2026-02-21T09:10:45.5726594Z cvt.f32.bf16 %r106, %rs7; 2026-02-21T09:10:45.5726652Z cvt.f32.bf16 %r107, %rs8; 2026-02-21T09:10:45.5726717Z cvt.f32.bf16 %r108, %rs5; 2026-02-21T09:10:45.5726774Z cvt.f32.bf16 %r109, %rs6; 2026-02-21T09:10:45.5726831Z cvt.f32.bf16 %r110, %rs3; 2026-02-21T09:10:45.5726888Z cvt.f32.bf16 %r111, %rs4; 2026-02-21T09:10:45.5726954Z cvt.f32.bf16 %r112, %rs1; 2026-02-21T09:10:45.5727012Z cvt.f32.bf16 %r113, %rs2; 2026-02-21T09:10:45.5727070Z cvt.f32.bf16 %r114, %rs15; 2026-02-21T09:10:45.5727135Z cvt.f32.bf16 %r115, %rs16; 2026-02-21T09:10:45.5727193Z cvt.f32.bf16 %r116, %rs13; 2026-02-21T09:10:45.5727250Z cvt.f32.bf16 %r117, %rs14; 2026-02-21T09:10:45.5727306Z cvt.f32.bf16 %r118, %rs11; 2026-02-21T09:10:45.5727370Z cvt.f32.bf16 %r119, %rs12; 2026-02-21T09:10:45.5727427Z cvt.f32.bf16 %r120, %rs9; 2026-02-21T09:10:45.5727484Z cvt.f32.bf16 %r121, %rs10; 2026-02-21T09:10:45.5727573Z $L__tmp2: 2026-02-21T09:10:45.5727796Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5727855Z add.s32 %r225, %r169, %r187; 2026-02-21T09:10:45.5727911Z // begin inline asm 2026-02-21T09:10:45.5728204Z @%p57 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r225 + 0], {%r106, %r107, %r108, %r109, %r110, %r111, %r112, %r113, %r114, %r115, %r116, %r117, %r118, %r119, %r120, %r121}; 2026-02-21T09:10:45.5728288Z // end inline asm 2026-02-21T09:10:45.5728345Z // begin inline asm 2026-02-21T09:10:45.5728423Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.5728479Z // end inline asm 2026-02-21T09:10:45.5728535Z bar.sync 0; 2026-02-21T09:10:45.5728597Z $L__tmp3: 2026-02-21T09:10:45.5728777Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5728837Z // begin inline asm 2026-02-21T09:10:45.5728891Z 2026-02-21T09:10:45.5728952Z { 2026-02-21T09:10:45.5729016Z .reg .pred complete; 2026-02-21T09:10:45.5729074Z waitLoop: 2026-02-21T09:10:45.5729222Z mbarrier.try_wait.parity.shared.b64 complete, [%r205], %r51; 2026-02-21T09:10:45.5729289Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.5729340Z } 2026-02-21T09:10:45.5729345Z 2026-02-21T09:10:45.5729400Z // end inline asm 2026-02-21T09:10:45.5729612Z .loc 1 64 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:64:33 2026-02-21T09:10:45.5729678Z ld.shared.b8 %rs17, [%r158]; 2026-02-21T09:10:45.5729746Z ld.shared.b8 %rs18, [%r158+256]; 2026-02-21T09:10:45.5729819Z ld.shared.b8 %rs19, [%r157+128]; 2026-02-21T09:10:45.5729881Z ld.shared.b8 %rs20, [%r157+384]; 2026-02-21T09:10:45.5730053Z .loc 1 67 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:67:28 2026-02-21T09:10:45.5730121Z shl.b16 %rs21, %rs17, 4; 2026-02-21T09:10:45.5730181Z shl.b16 %rs22, %rs19, 4; 2026-02-21T09:10:45.5730241Z shl.b16 %rs23, %rs18, 4; 2026-02-21T09:10:45.5730298Z shl.b16 %rs24, %rs20, 4; 2026-02-21T09:10:45.5730475Z .loc 1 82 58 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:82:58 2026-02-21T09:10:45.5730545Z selp.b16 %rs25, %rs21, %rs17, %p50; 2026-02-21T09:10:45.5730604Z cvt.s16.s8 %rs26, %rs25; 2026-02-21T09:10:45.5730668Z shr.s16 %rs27, %rs26, 4; 2026-02-21T09:10:45.5730735Z selp.b16 %rs28, %rs22, %rs19, %p50; 2026-02-21T09:10:45.5730795Z cvt.s16.s8 %rs29, %rs28; 2026-02-21T09:10:45.5730861Z shr.s16 %rs30, %rs29, 4; 2026-02-21T09:10:45.5730924Z selp.b16 %rs31, %rs23, %rs18, %p50; 2026-02-21T09:10:45.5730983Z cvt.s16.s8 %rs32, %rs31; 2026-02-21T09:10:45.5731039Z shr.s16 %rs33, %rs32, 4; 2026-02-21T09:10:45.5731111Z selp.b16 %rs34, %rs24, %rs20, %p50; 2026-02-21T09:10:45.5731169Z cvt.s16.s8 %rs35, %rs34; 2026-02-21T09:10:45.5731225Z shr.s16 %rs36, %rs35, 4; 2026-02-21T09:10:45.5731400Z .loc 1 87 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:87:32 2026-02-21T09:10:45.5731462Z cvt.rn.f32.s16 %r182, %rs27; 2026-02-21T09:10:45.5731525Z cvt.rn.f32.s16 %r183, %rs30; 2026-02-21T09:10:45.5731625Z cvt.rn.f32.s16 %r184, %rs33; 2026-02-21T09:10:45.5731692Z cvt.rn.f32.s16 %r185, %rs36; 2026-02-21T09:10:45.5731752Z st.shared.b32 [%r13], %r182; 2026-02-21T09:10:45.5731811Z st.shared.b32 [%r14], %r183; 2026-02-21T09:10:45.5731877Z st.shared.b32 [%r15], %r184; 2026-02-21T09:10:45.5731937Z st.shared.b32 [%r16], %r185; 2026-02-21T09:10:45.5731991Z $L__tmp4: 2026-02-21T09:10:45.5732219Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5732275Z // begin inline asm 2026-02-21T09:10:45.5732347Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.5732402Z // end inline asm 2026-02-21T09:10:45.5732462Z bar.sync 0; 2026-02-21T09:10:45.5732524Z setp.ne.b32 %p54, %r24, 0; 2026-02-21T09:10:45.5732582Z @%p54 bra $L__BB0_3; 2026-02-21T09:10:45.5732668Z // %bb.2: 2026-02-21T09:10:45.5732733Z elect.sync %r198|%p56, -1; 2026-02-21T09:10:45.5732794Z mov.b32 %r188, 134744336; 2026-02-21T09:10:45.5732854Z mov.pred %p55, 0; 2026-02-21T09:10:45.5732919Z // begin inline asm 2026-02-21T09:10:45.5733077Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 0 ], %rd67, %r188, %p55; 2026-02-21T09:10:45.5733133Z // end inline asm 2026-02-21T09:10:45.5733196Z // begin inline asm 2026-02-21T09:10:45.5733373Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 8 ], %rd68, %r188, %p57; 2026-02-21T09:10:45.5733427Z // end inline asm 2026-02-21T09:10:45.5733489Z // begin inline asm 2026-02-21T09:10:45.5733635Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 16 ], %rd69, %r188, %p57; 2026-02-21T09:10:45.5733690Z // end inline asm 2026-02-21T09:10:45.5733747Z // begin inline asm 2026-02-21T09:10:45.5733897Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 24 ], %rd70, %r188, %p57; 2026-02-21T09:10:45.5733953Z // end inline asm 2026-02-21T09:10:45.5734014Z add.s32 %r200, %r41, 21504; 2026-02-21T09:10:45.5734109Z cvt.u64.u32 %rd71, %r200; 2026-02-21T09:10:45.5734167Z // begin inline asm 2026-02-21T09:10:45.5734289Z @%p56 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd71]; 2026-02-21T09:10:45.5734350Z // end inline asm 2026-02-21T09:10:45.5734401Z $L__tmp5: 2026-02-21T09:10:45.5734455Z $L__BB0_3: 2026-02-21T09:10:45.5734569Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.5734640Z add.s32 %r18, %r41, %r154; 2026-02-21T09:10:45.5734698Z add.s32 %r19, %r41, %r155; 2026-02-21T09:10:45.5734874Z .loc 1 58 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:32 2026-02-21T09:10:45.5734939Z add.s64 %rd72, %rd56, 128; 2026-02-21T09:10:45.5734998Z cvt.u64.u32 %rd76, %r17; 2026-02-21T09:10:45.5735059Z add.s64 %rd77, %rd7, %rd76; 2026-02-21T09:10:45.5735117Z shl.b64 %rd78, %rd77, 1; 2026-02-21T09:10:45.5735186Z add.s64 %rd79, %rd17, %rd78; 2026-02-21T09:10:45.5735246Z add.s64 %rd73, %rd79, 131072; 2026-02-21T09:10:45.5735304Z mov.b32 %r202, 16; 2026-02-21T09:10:45.5735477Z .loc 1 58 80 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:80 2026-02-21T09:10:45.5735534Z // begin inline asm 2026-02-21T09:10:45.5735651Z cp.async.cg.shared.global [ %r201 + 0 ], [ %rd72 + 0 ], 0x10, %r202; 2026-02-21T09:10:45.5735713Z // end inline asm 2026-02-21T09:10:45.5735769Z // begin inline asm 2026-02-21T09:10:45.5735883Z cp.async.cg.shared.global [ %r203 + 0 ], [ %rd73 + 0 ], 0x10, %r202; 2026-02-21T09:10:45.5735937Z // end inline asm 2026-02-21T09:10:45.5736007Z cp.async.commit_group; 2026-02-21T09:10:45.5736188Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5736244Z // begin inline asm 2026-02-21T09:10:45.5736361Z @%p90 mbarrier.arrive.expect_tx.shared.b64 _, [%r205], 512; 2026-02-21T09:10:45.5736418Z // end inline asm 2026-02-21T09:10:45.5736590Z .loc 1 64 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:64:33 2026-02-21T09:10:45.5736654Z bar.sync 0; 2026-02-21T09:10:45.5736718Z elect.sync %r214|%p67, -1; 2026-02-21T09:10:45.5736784Z and.pred %p65, %p1, %p67; 2026-02-21T09:10:45.5736842Z mov.b32 %r208, 32; 2026-02-21T09:10:45.5736908Z // begin inline asm 2026-02-21T09:10:45.5737147Z @%p65 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r92], [%rd74, {%r318, %r208}], [%r205]; 2026-02-21T09:10:45.5737204Z // end inline asm 2026-02-21T09:10:45.5737381Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5737441Z cvt.u16.u32 %rs37, %r3; 2026-02-21T09:10:45.5737501Z shr.u16 %rs38, %rs37, 12; 2026-02-21T09:10:45.5737568Z and.b16 %rs39, %rs38, 1; 2026-02-21T09:10:45.5737632Z mul.wide.u16 %r215, %rs39, 2048; 2026-02-21T09:10:45.5737714Z shl.b32 %r216, %r21, 7; 2026-02-21T09:10:45.5737774Z or.b32 %r217, %r215, %r216; 2026-02-21T09:10:45.5737842Z or.b32 %r218, %r217, %r6; 2026-02-21T09:10:45.5737900Z shl.b32 %r219, %r218, 10; 2026-02-21T09:10:45.5737957Z or.b32 %r220, %r219, %r5; 2026-02-21T09:10:45.5738022Z or.b32 %r221, %r220, 65632; 2026-02-21T09:10:45.5738090Z mad.wide.u32 %rd126, %r221, 2, %rd17; 2026-02-21T09:10:45.5738154Z mul.wide.u32 %rd80, %r218, 2048; 2026-02-21T09:10:45.5738214Z and.b32 %r222, %r1, 3; 2026-02-21T09:10:45.5738307Z mul.wide.u32 %rd81, %r222, 16; 2026-02-21T09:10:45.5738366Z or.b64 %rd82, %rd80, %rd81; 2026-02-21T09:10:45.5738425Z add.s64 %rd83, %rd82, %rd17; 2026-02-21T09:10:45.5738491Z add.s64 %rd125, %rd83, 192; 2026-02-21T09:10:45.5738546Z mov.b32 %r351, 1; 2026-02-21T09:10:45.5738601Z mov.b32 %r347, 0; 2026-02-21T09:10:45.5738657Z mov.b64 %rd127, 0; 2026-02-21T09:10:45.5738722Z mov.b32 %r349, %r347; 2026-02-21T09:10:45.5738780Z mov.b32 %r350, %r347; 2026-02-21T09:10:45.5738837Z mov.b32 %r352, %r347; 2026-02-21T09:10:45.5738900Z bra.uni $L__BB0_4; 2026-02-21T09:10:45.5739023Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:45.5739196Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5739268Z setp.lt.u64 %p83, %rd127, 464; 2026-02-21T09:10:45.5739321Z $L__tmp6: 2026-02-21T09:10:45.5739560Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5739623Z add.s32 %r291, %r351, 1; 2026-02-21T09:10:45.5739693Z setp.gt.s32 %p86, %r291, 1; 2026-02-21T09:10:45.5739756Z selp.b32 %r351, 0, %r291, %p86; 2026-02-21T09:10:45.5739816Z selp.b32 %r292, 1, 0, %p86; 2026-02-21T09:10:45.5739881Z xor.b32 %r40, %r352, %r292; 2026-02-21T09:10:45.5739933Z $L__tmp7: 2026-02-21T09:10:45.5740107Z .loc 1 58 80 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:80 2026-02-21T09:10:45.5740167Z add.s32 %r282, %r34, %r7; 2026-02-21T09:10:45.5740234Z selp.b32 %r283, 16, 0, %p83; 2026-02-21T09:10:45.5740291Z // begin inline asm 2026-02-21T09:10:45.5740409Z cp.async.cg.shared.global [ %r282 + 0 ], [ %rd125 + 0 ], 0x10, %r283; 2026-02-21T09:10:45.5740471Z // end inline asm 2026-02-21T09:10:45.5740530Z add.s32 %r284, %r282, 4096; 2026-02-21T09:10:45.5740586Z // begin inline asm 2026-02-21T09:10:45.5740709Z cp.async.cg.shared.global [ %r284 + 0 ], [ %rd126 + 0 ], 0x10, %r283; 2026-02-21T09:10:45.5740766Z // end inline asm 2026-02-21T09:10:45.5740827Z cp.async.commit_group; 2026-02-21T09:10:45.5741000Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5741073Z and.pred %p81, %p90, %p83; 2026-02-21T09:10:45.5741128Z // begin inline asm 2026-02-21T09:10:45.5741235Z @%p81 mbarrier.arrive.expect_tx.shared.b64 _, [%r286], 512; 2026-02-21T09:10:45.5741295Z // end inline asm 2026-02-21T09:10:45.5741467Z .loc 1 64 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:64:33 2026-02-21T09:10:45.5741523Z bar.sync 0; 2026-02-21T09:10:45.5741614Z elect.sync %r293|%p87, -1; 2026-02-21T09:10:45.5741683Z and.pred %p88, %p83, %p87; 2026-02-21T09:10:45.5741744Z and.pred %p82, %p1, %p88; 2026-02-21T09:10:45.5741803Z cvt.u32.u64 %r294, %rd127; 2026-02-21T09:10:45.5741867Z add.s32 %r289, %r294, 48; 2026-02-21T09:10:45.5741924Z // begin inline asm 2026-02-21T09:10:45.5742159Z @%p82 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r287], [%rd74, {%r318, %r289}], [%r286]; 2026-02-21T09:10:45.5742220Z // end inline asm 2026-02-21T09:10:45.5742389Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5742448Z add.s64 %rd126, %rd126, 64; 2026-02-21T09:10:45.5742506Z add.s64 %rd125, %rd125, 64; 2026-02-21T09:10:45.5742577Z setp.lt.u64 %p89, %rd127, 480; 2026-02-21T09:10:45.5742662Z add.s64 %rd127, %rd127, 16; 2026-02-21T09:10:45.5742719Z mov.b32 %r347, %r352; 2026-02-21T09:10:45.5742786Z mov.b32 %r352, %r40; 2026-02-21T09:10:45.5742844Z @%p89 bra $L__BB0_4; 2026-02-21T09:10:45.5742901Z bra.uni $L__BB0_7; 2026-02-21T09:10:45.5743013Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:10:45.5743074Z add.s32 %r244, %r350, 1; 2026-02-21T09:10:45.5743137Z setp.gt.s32 %p71, %r244, 1; 2026-02-21T09:10:45.5743244Z selp.b32 %r350, 0, %r244, %p71; 2026-02-21T09:10:45.5743418Z .loc 1 58 80 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:58:80 2026-02-21T09:10:45.5743481Z cp.async.wait_group 1; 2026-02-21T09:10:45.5743535Z bar.sync 0; 2026-02-21T09:10:45.5743599Z shl.b32 %r245, %r350, 13; 2026-02-21T09:10:45.5743658Z add.s32 %r34, %r41, %r245; 2026-02-21T09:10:45.5743716Z add.s32 %r247, %r34, %r11; 2026-02-21T09:10:45.5743809Z ld.shared.v4.b32 {%r248, %r249, %r250, %r251}, [%r247]; 2026-02-21T09:10:45.5743879Z mov.b32 {%rs40, %rs41}, %r251; 2026-02-21T09:10:45.5743971Z mov.b32 {%rs42, %rs43}, %r250; 2026-02-21T09:10:45.5744031Z mov.b32 {%rs44, %rs45}, %r249; 2026-02-21T09:10:45.5744098Z mov.b32 {%rs46, %rs47}, %r248; 2026-02-21T09:10:45.5744195Z ld.shared.v4.b32 {%r252, %r253, %r254, %r255}, [%r247+16]; 2026-02-21T09:10:45.5744254Z mov.b32 {%rs48, %rs49}, %r255; 2026-02-21T09:10:45.5744347Z mov.b32 {%rs50, %rs51}, %r254; 2026-02-21T09:10:45.5744409Z mov.b32 {%rs52, %rs53}, %r253; 2026-02-21T09:10:45.5744469Z mov.b32 {%rs54, %rs55}, %r252; 2026-02-21T09:10:45.5744642Z .loc 1 62 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:62:32 2026-02-21T09:10:45.5744712Z cvt.f32.bf16 %r226, %rs46; 2026-02-21T09:10:45.5744774Z cvt.f32.bf16 %r227, %rs47; 2026-02-21T09:10:45.5744839Z cvt.f32.bf16 %r228, %rs44; 2026-02-21T09:10:45.5744924Z cvt.f32.bf16 %r229, %rs45; 2026-02-21T09:10:45.5744982Z cvt.f32.bf16 %r230, %rs42; 2026-02-21T09:10:45.5745040Z cvt.f32.bf16 %r231, %rs43; 2026-02-21T09:10:45.5745096Z cvt.f32.bf16 %r232, %rs40; 2026-02-21T09:10:45.5745164Z cvt.f32.bf16 %r233, %rs41; 2026-02-21T09:10:45.5745220Z cvt.f32.bf16 %r234, %rs54; 2026-02-21T09:10:45.5745276Z cvt.f32.bf16 %r235, %rs55; 2026-02-21T09:10:45.5745339Z cvt.f32.bf16 %r236, %rs52; 2026-02-21T09:10:45.5745396Z cvt.f32.bf16 %r237, %rs53; 2026-02-21T09:10:45.5745453Z cvt.f32.bf16 %r238, %rs50; 2026-02-21T09:10:45.5745511Z cvt.f32.bf16 %r239, %rs51; 2026-02-21T09:10:45.5745575Z cvt.f32.bf16 %r240, %rs48; 2026-02-21T09:10:45.5745633Z cvt.f32.bf16 %r241, %rs49; 2026-02-21T09:10:45.5745686Z $L__tmp8: 2026-02-21T09:10:45.5745911Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5745967Z // begin inline asm 2026-02-21T09:10:45.5746019Z 2026-02-21T09:10:45.5746076Z { 2026-02-21T09:10:45.5746138Z .reg .pred complete; 2026-02-21T09:10:45.5746194Z waitLoop: 2026-02-21T09:10:45.5746311Z mbarrier.try_wait.parity.shared.b64 complete, [%r348], %r347; 2026-02-21T09:10:45.5746386Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.5746437Z } 2026-02-21T09:10:45.5746441Z 2026-02-21T09:10:45.5746496Z // end inline asm 2026-02-21T09:10:45.5746556Z $L__tmp9: 2026-02-21T09:10:45.5746734Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5746795Z selp.b32 %r256, 1, 0, %p71; 2026-02-21T09:10:45.5746855Z xor.b32 %r349, %r349, %r256; 2026-02-21T09:10:45.5746921Z mov.pred %p72, -1; 2026-02-21T09:10:45.5746975Z $L__tmp10: 2026-02-21T09:10:45.5747195Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5747259Z // begin inline asm 2026-02-21T09:10:45.5747537Z @%p72 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r225 + 0], {%r226, %r227, %r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241}; 2026-02-21T09:10:45.5747620Z // end inline asm 2026-02-21T09:10:45.5747684Z // begin inline asm 2026-02-21T09:10:45.5747757Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.5747812Z // end inline asm 2026-02-21T09:10:45.5747866Z bar.sync 0; 2026-02-21T09:10:45.5747926Z $L__tmp11: 2026-02-21T09:10:45.5748098Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5748159Z shl.b32 %r257, %r350, 3; 2026-02-21T09:10:45.5748249Z add.s32 %r258, %r41, %r257; 2026-02-21T09:10:45.5748310Z add.s32 %r286, %r258, 21520; 2026-02-21T09:10:45.5748366Z // begin inline asm 2026-02-21T09:10:45.5748424Z 2026-02-21T09:10:45.5748475Z { 2026-02-21T09:10:45.5748536Z .reg .pred complete; 2026-02-21T09:10:45.5748590Z waitLoop: 2026-02-21T09:10:45.5748714Z mbarrier.try_wait.parity.shared.b64 complete, [%r286], %r349; 2026-02-21T09:10:45.5748779Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.5748829Z } 2026-02-21T09:10:45.5748834Z 2026-02-21T09:10:45.5748895Z // end inline asm 2026-02-21T09:10:45.5749101Z .loc 1 64 33 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:64:33 2026-02-21T09:10:45.5749161Z shl.b32 %r259, %r350, 9; 2026-02-21T09:10:45.5749219Z add.s32 %r260, %r41, %r259; 2026-02-21T09:10:45.5749285Z add.s32 %r287, %r260, 20480; 2026-02-21T09:10:45.5749342Z add.s32 %r261, %r287, %r12; 2026-02-21T09:10:45.5749425Z ld.shared.b8 %rs56, [%r261]; 2026-02-21T09:10:45.5749498Z ld.shared.b8 %rs57, [%r261+256]; 2026-02-21T09:10:45.5749557Z add.s32 %r262, %r287, %r20; 2026-02-21T09:10:45.5749620Z ld.shared.b8 %rs58, [%r262+128]; 2026-02-21T09:10:45.5749681Z ld.shared.b8 %rs59, [%r262+384]; 2026-02-21T09:10:45.5749856Z .loc 1 67 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:67:28 2026-02-21T09:10:45.5749914Z shl.b16 %rs60, %rs56, 4; 2026-02-21T09:10:45.5749971Z shl.b16 %rs61, %rs58, 4; 2026-02-21T09:10:45.5750040Z shl.b16 %rs62, %rs57, 4; 2026-02-21T09:10:45.5750096Z shl.b16 %rs63, %rs59, 4; 2026-02-21T09:10:45.5750266Z .loc 1 82 58 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:82:58 2026-02-21T09:10:45.5750340Z selp.b16 %rs64, %rs60, %rs56, %p50; 2026-02-21T09:10:45.5750400Z cvt.s16.s8 %rs65, %rs64; 2026-02-21T09:10:45.5750457Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:10:45.5750523Z selp.b16 %rs67, %rs61, %rs58, %p50; 2026-02-21T09:10:45.5750591Z cvt.s16.s8 %rs68, %rs67; 2026-02-21T09:10:45.5750648Z shr.s16 %rs69, %rs68, 4; 2026-02-21T09:10:45.5750711Z selp.b16 %rs70, %rs62, %rs57, %p50; 2026-02-21T09:10:45.5750777Z cvt.s16.s8 %rs71, %rs70; 2026-02-21T09:10:45.5750834Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:10:45.5750895Z selp.b16 %rs73, %rs63, %rs59, %p50; 2026-02-21T09:10:45.5750952Z cvt.s16.s8 %rs74, %rs73; 2026-02-21T09:10:45.5751015Z shr.s16 %rs75, %rs74, 4; 2026-02-21T09:10:45.5751184Z .loc 1 87 32 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:87:32 2026-02-21T09:10:45.5751247Z cvt.rn.f32.s16 %r263, %rs66; 2026-02-21T09:10:45.5751318Z cvt.rn.f32.s16 %r264, %rs69; 2026-02-21T09:10:45.5751378Z cvt.rn.f32.s16 %r265, %rs72; 2026-02-21T09:10:45.5751437Z cvt.rn.f32.s16 %r266, %rs75; 2026-02-21T09:10:45.5751505Z st.shared.b32 [%r13], %r263; 2026-02-21T09:10:45.5751602Z st.shared.b32 [%r14], %r264; 2026-02-21T09:10:45.5751663Z st.shared.b32 [%r15], %r265; 2026-02-21T09:10:45.5751722Z st.shared.b32 [%r16], %r266; 2026-02-21T09:10:45.5751904Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5751962Z shl.b32 %r267, %r351, 3; 2026-02-21T09:10:45.5752020Z add.s32 %r268, %r41, %r267; 2026-02-21T09:10:45.5752088Z add.s32 %r348, %r268, 21504; 2026-02-21T09:10:45.5752143Z $L__tmp12: 2026-02-21T09:10:45.5752363Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5752467Z // begin inline asm 2026-02-21T09:10:45.5752542Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.5752602Z // end inline asm 2026-02-21T09:10:45.5752659Z bar.sync 0; 2026-02-21T09:10:45.5752731Z @%p54 bra $L__BB0_6; 2026-02-21T09:10:45.5752835Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:45.5752902Z elect.sync %r281|%p73, -1; 2026-02-21T09:10:45.5752977Z mov.b32 %r271, 134744336; 2026-02-21T09:10:45.5753063Z // begin inline asm 2026-02-21T09:10:45.5753215Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 0 ], %rd67, %r271, %p72; 2026-02-21T09:10:45.5753276Z // end inline asm 2026-02-21T09:10:45.5753332Z // begin inline asm 2026-02-21T09:10:45.5753476Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 8 ], %rd68, %r271, %p72; 2026-02-21T09:10:45.5753532Z // end inline asm 2026-02-21T09:10:45.5753594Z // begin inline asm 2026-02-21T09:10:45.5753740Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 16 ], %rd69, %r271, %p72; 2026-02-21T09:10:45.5753797Z // end inline asm 2026-02-21T09:10:45.5753887Z // begin inline asm 2026-02-21T09:10:45.5754032Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r346 + 0 ], [ %r187 + 24 ], %rd70, %r271, %p72; 2026-02-21T09:10:45.5754087Z // end inline asm 2026-02-21T09:10:45.5754155Z cvt.u64.u32 %rd88, %r348; 2026-02-21T09:10:45.5754212Z // begin inline asm 2026-02-21T09:10:45.5754359Z @%p73 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd88]; 2026-02-21T09:10:45.5754416Z // end inline asm 2026-02-21T09:10:45.5754478Z bra.uni $L__BB0_6; 2026-02-21T09:10:45.5754573Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:10:45.5754660Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.5754723Z mov.b32 %r296, 1; 2026-02-21T09:10:45.5754941Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5754998Z // begin inline asm 2026-02-21T09:10:45.5755054Z 2026-02-21T09:10:45.5755105Z { 2026-02-21T09:10:45.5755165Z .reg .pred complete; 2026-02-21T09:10:45.5755219Z waitLoop: 2026-02-21T09:10:45.5755341Z mbarrier.try_wait.parity.shared.b64 complete, [%r348], %r296; 2026-02-21T09:10:45.5755406Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.5755458Z } 2026-02-21T09:10:45.5755461Z 2026-02-21T09:10:45.5755523Z // end inline asm 2026-02-21T09:10:45.5755577Z $L__tmp13: 2026-02-21T09:10:45.5755751Z .loc 1 51 125 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:51:125 2026-02-21T09:10:45.5755814Z cp.async.wait_group 0; 2026-02-21T09:10:45.5755876Z bar.sync 0; 2026-02-21T09:10:45.5755932Z // begin inline asm 2026-02-21T09:10:45.5756018Z @%p90 mbarrier.inval.shared::cta.b64 [%r205]; 2026-02-21T09:10:45.5756078Z // end inline asm 2026-02-21T09:10:45.5756132Z bar.sync 0; 2026-02-21T09:10:45.5756186Z // begin inline asm 2026-02-21T09:10:45.5756276Z @%p90 mbarrier.inval.shared::cta.b64 [%r86]; 2026-02-21T09:10:45.5756331Z // end inline asm 2026-02-21T09:10:45.5756391Z add.s32 %r299, %r41, 21504; 2026-02-21T09:10:45.5756446Z // begin inline asm 2026-02-21T09:10:45.5756533Z @%p90 mbarrier.inval.shared::cta.b64 [%r299]; 2026-02-21T09:10:45.5756587Z // end inline asm 2026-02-21T09:10:45.5756640Z bar.sync 0; 2026-02-21T09:10:45.5756703Z // begin inline asm 2026-02-21T09:10:45.5756780Z @%p90 mbarrier.inval.shared::cta.b64 [%r84]; 2026-02-21T09:10:45.5756835Z // end inline asm 2026-02-21T09:10:45.5756888Z $L__tmp14: 2026-02-21T09:10:45.5757112Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5757167Z // begin inline asm 2026-02-21T09:10:45.5757437Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r301, %r302, %r303, %r304, %r305, %r306, %r307, %r308, %r309, %r310, %r311, %r312, %r313, %r314, %r315, %r316}, [%r317 + 0]; 2026-02-21T09:10:45.5757500Z // end inline asm 2026-02-21T09:10:45.5757579Z // begin inline asm 2026-02-21T09:10:45.5757648Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.5757709Z // end inline asm 2026-02-21T09:10:45.5757769Z cvt.u64.u32 %rd93, %r301; 2026-02-21T09:10:45.5757827Z cvt.u64.u32 %rd94, %r302; 2026-02-21T09:10:45.5757885Z shl.b64 %rd95, %rd94, 32; 2026-02-21T09:10:45.5757951Z or.b64 %rd96, %rd93, %rd95; 2026-02-21T09:10:45.5758002Z $L__tmp15: 2026-02-21T09:10:45.5758178Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5758269Z mov.b64 {%r321, %r322}, %rd96; 2026-02-21T09:10:45.5758338Z cvt.rn.bf16x2.f32 %r323, %r322, %r321; 2026-02-21T09:10:45.5758391Z $L__tmp16: 2026-02-21T09:10:45.5758601Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5758668Z cvt.u64.u32 %rd97, %r303; 2026-02-21T09:10:45.5758727Z cvt.u64.u32 %rd98, %r304; 2026-02-21T09:10:45.5758784Z shl.b64 %rd99, %rd98, 32; 2026-02-21T09:10:45.5758851Z or.b64 %rd100, %rd97, %rd99; 2026-02-21T09:10:45.5758925Z $L__tmp17: 2026-02-21T09:10:45.5759112Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5759182Z mov.b64 {%r324, %r325}, %rd100; 2026-02-21T09:10:45.5759254Z cvt.rn.bf16x2.f32 %r326, %r325, %r324; 2026-02-21T09:10:45.5759307Z $L__tmp18: 2026-02-21T09:10:45.5759540Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5759611Z cvt.u64.u32 %rd101, %r305; 2026-02-21T09:10:45.5759673Z cvt.u64.u32 %rd102, %r306; 2026-02-21T09:10:45.5759735Z shl.b64 %rd103, %rd102, 32; 2026-02-21T09:10:45.5759807Z or.b64 %rd104, %rd101, %rd103; 2026-02-21T09:10:45.5759862Z $L__tmp19: 2026-02-21T09:10:45.5760036Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5760111Z mov.b64 {%r327, %r328}, %rd104; 2026-02-21T09:10:45.5760182Z cvt.rn.bf16x2.f32 %r329, %r328, %r327; 2026-02-21T09:10:45.5760236Z $L__tmp20: 2026-02-21T09:10:45.5760459Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5760533Z cvt.u64.u32 %rd105, %r307; 2026-02-21T09:10:45.5760595Z cvt.u64.u32 %rd106, %r308; 2026-02-21T09:10:45.5760660Z shl.b64 %rd107, %rd106, 32; 2026-02-21T09:10:45.5760734Z or.b64 %rd108, %rd105, %rd107; 2026-02-21T09:10:45.5760789Z $L__tmp21: 2026-02-21T09:10:45.5760970Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5761044Z mov.b64 {%r330, %r331}, %rd108; 2026-02-21T09:10:45.5761112Z cvt.rn.bf16x2.f32 %r332, %r331, %r330; 2026-02-21T09:10:45.5761167Z $L__tmp22: 2026-02-21T09:10:45.5761387Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5761459Z cvt.u64.u32 %rd109, %r309; 2026-02-21T09:10:45.5761521Z cvt.u64.u32 %rd110, %r310; 2026-02-21T09:10:45.5761612Z shl.b64 %rd111, %rd110, 32; 2026-02-21T09:10:45.5761684Z or.b64 %rd112, %rd109, %rd111; 2026-02-21T09:10:45.5761738Z $L__tmp23: 2026-02-21T09:10:45.5761916Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5761980Z mov.b64 {%r333, %r334}, %rd112; 2026-02-21T09:10:45.5762056Z cvt.rn.bf16x2.f32 %r335, %r334, %r333; 2026-02-21T09:10:45.5762111Z $L__tmp24: 2026-02-21T09:10:45.5762332Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5762400Z cvt.u64.u32 %rd113, %r311; 2026-02-21T09:10:45.5762461Z cvt.u64.u32 %rd114, %r312; 2026-02-21T09:10:45.5762523Z shl.b64 %rd115, %rd114, 32; 2026-02-21T09:10:45.5762591Z or.b64 %rd116, %rd113, %rd115; 2026-02-21T09:10:45.5762677Z $L__tmp25: 2026-02-21T09:10:45.5762853Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5762916Z mov.b64 {%r336, %r337}, %rd116; 2026-02-21T09:10:45.5762994Z cvt.rn.bf16x2.f32 %r338, %r337, %r336; 2026-02-21T09:10:45.5763048Z $L__tmp26: 2026-02-21T09:10:45.5763267Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5763367Z cvt.u64.u32 %rd117, %r313; 2026-02-21T09:10:45.5763429Z cvt.u64.u32 %rd118, %r314; 2026-02-21T09:10:45.5763491Z shl.b64 %rd119, %rd118, 32; 2026-02-21T09:10:45.5763560Z or.b64 %rd120, %rd117, %rd119; 2026-02-21T09:10:45.5763615Z $L__tmp27: 2026-02-21T09:10:45.5763792Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5763856Z mov.b64 {%r339, %r340}, %rd120; 2026-02-21T09:10:45.5763933Z cvt.rn.bf16x2.f32 %r341, %r340, %r339; 2026-02-21T09:10:45.5763988Z $L__tmp28: 2026-02-21T09:10:45.5764236Z .loc 2 291 36 // standard.py:291:36 @[ cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:94:40 ] 2026-02-21T09:10:45.5764309Z cvt.u64.u32 %rd121, %r315; 2026-02-21T09:10:45.5764369Z cvt.u64.u32 %rd122, %r316; 2026-02-21T09:10:45.5764430Z shl.b64 %rd123, %rd122, 32; 2026-02-21T09:10:45.5764493Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T09:10:45.5764581Z $L__tmp29: 2026-02-21T09:10:45.5764764Z .loc 1 97 28 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:97:28 2026-02-21T09:10:45.5764828Z mov.b64 {%r342, %r343}, %rd124; 2026-02-21T09:10:45.5764905Z cvt.rn.bf16x2.f32 %r344, %r343, %r342; 2026-02-21T09:10:45.5765082Z .loc 1 98 43 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:98:43 2026-02-21T09:10:45.5765182Z st.shared.v4.b32 [%r18], {%r323, %r326, %r329, %r332}; 2026-02-21T09:10:45.5765288Z st.shared.v4.b32 [%r19], {%r335, %r338, %r341, %r344}; 2026-02-21T09:10:45.5765349Z // begin inline asm 2026-02-21T09:10:45.5765427Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.5765486Z // end inline asm 2026-02-21T09:10:45.5765551Z bar.sync 0; 2026-02-21T09:10:45.5765619Z elect.sync %r345|%p96, -1; 2026-02-21T09:10:45.5765686Z and.pred %p94, %p1, %p96; 2026-02-21T09:10:45.5765752Z // begin inline asm 2026-02-21T09:10:45.5765936Z @%p94 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd92, {%r318, %r319}], [%r41]; 2026-02-21T09:10:45.5765995Z // end inline asm 2026-02-21T09:10:45.5766071Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.5766146Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.5766202Z bar.sync 0; 2026-02-21T09:10:45.5766287Z $L__BB0_8: // %._crit_edge 2026-02-21T09:10:45.5766473Z .loc 1 31 4 // cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py:31:4 2026-02-21T09:10:45.5766531Z bar.sync 0; 2026-02-21T09:10:45.5766590Z // begin inline asm 2026-02-21T09:10:45.5766718Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r346, 64; 2026-02-21T09:10:45.5766776Z // end inline asm 2026-02-21T09:10:45.5766830Z ret; 2026-02-21T09:10:45.5766892Z $L__tmp30: 2026-02-21T09:10:45.5766950Z $L__func_end0: 2026-02-21T09:10:45.5767034Z // -- End function 2026-02-21T09:10:45.5767087Z } 2026-02-21T09:10:45.5767321Z .file 1 "/tmp/torchinductor_root/yj/cyjhwawlddnapp4t75prbzuwhfxye4oocbngylemdhxbaqj3eqdq.py" 2026-02-21T09:10:45.5767494Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:45.5767556Z .section .debug_abbrev 2026-02-21T09:10:45.5767612Z { 2026-02-21T09:10:45.5767700Z .b8 1 // Abbreviation Code 2026-02-21T09:10:45.5767787Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:45.5767866Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:45.5767954Z .b8 37 // DW_AT_producer 2026-02-21T09:10:45.5768055Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.5768129Z .b8 19 // DW_AT_language 2026-02-21T09:10:45.5768213Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:45.5768286Z .b8 3 // DW_AT_name 2026-02-21T09:10:45.5768361Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.5768483Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:45.5768558Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:45.5768633Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:45.5768712Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.5768783Z .b8 0 // EOM(1) 2026-02-21T09:10:45.5768852Z .b8 0 // EOM(2) 2026-02-21T09:10:45.5768932Z .b8 2 // Abbreviation Code 2026-02-21T09:10:45.5769043Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:45.5769118Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:45.5769191Z .b8 3 // DW_AT_name 2026-02-21T09:10:45.5769272Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.5769369Z .b8 32 // DW_AT_inline 2026-02-21T09:10:45.5769447Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.5769524Z .b8 0 // EOM(1) 2026-02-21T09:10:45.5769594Z .b8 0 // EOM(2) 2026-02-21T09:10:45.5769674Z .b8 3 // Abbreviation Code 2026-02-21T09:10:45.5769753Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:45.5769843Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:45.5769922Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:45.5769997Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.5770083Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:45.5770157Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.5770241Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:45.5770322Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:45.5770392Z .b8 0 // EOM(1) 2026-02-21T09:10:45.5770459Z .b8 0 // EOM(2) 2026-02-21T09:10:45.5770538Z .b8 4 // Abbreviation Code 2026-02-21T09:10:45.5770635Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:45.5770710Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:45.5770795Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:45.5770876Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:45.5770951Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:45.5771023Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.5771106Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:45.5771177Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.5771256Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:45.5771340Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.5771416Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:45.5771491Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.5771600Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:45.5771683Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.5771751Z .b8 0 // EOM(1) 2026-02-21T09:10:45.5771846Z .b8 0 // EOM(2) 2026-02-21T09:10:45.5771924Z .b8 0 // EOM(3) 2026-02-21T09:10:45.5771977Z } 2026-02-21T09:10:45.5772039Z .section .debug_info 2026-02-21T09:10:45.5772090Z { 2026-02-21T09:10:45.5772179Z .b32 178 // Length of Unit 2026-02-21T09:10:45.5772264Z .b8 2 // DWARF version number 2026-02-21T09:10:45.5772317Z .b8 0 2026-02-21T09:10:45.5772466Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:45.5772553Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:45.5772651Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:45.5772737Z .b8 116 // DW_AT_producer 2026-02-21T09:10:45.5772792Z .b8 114 2026-02-21T09:10:45.5772846Z .b8 105 2026-02-21T09:10:45.5772900Z .b8 116 2026-02-21T09:10:45.5772962Z .b8 111 2026-02-21T09:10:45.5773014Z .b8 110 2026-02-21T09:10:45.5773065Z .b8 0 2026-02-21T09:10:45.5773174Z .b8 2 // DW_AT_language 2026-02-21T09:10:45.5773225Z .b8 0 2026-02-21T09:10:45.5773300Z .b8 99 // DW_AT_name 2026-02-21T09:10:45.5773351Z .b8 121 2026-02-21T09:10:45.5773408Z .b8 106 2026-02-21T09:10:45.5773460Z .b8 104 2026-02-21T09:10:45.5773510Z .b8 119 2026-02-21T09:10:45.5773567Z .b8 97 2026-02-21T09:10:45.5773642Z .b8 119 2026-02-21T09:10:45.5773692Z .b8 108 2026-02-21T09:10:45.5773742Z .b8 100 2026-02-21T09:10:45.5773800Z .b8 100 2026-02-21T09:10:45.5773849Z .b8 110 2026-02-21T09:10:45.5773899Z .b8 97 2026-02-21T09:10:45.5773955Z .b8 112 2026-02-21T09:10:45.5774006Z .b8 112 2026-02-21T09:10:45.5774056Z .b8 52 2026-02-21T09:10:45.5774106Z .b8 116 2026-02-21T09:10:45.5774161Z .b8 55 2026-02-21T09:10:45.5774212Z .b8 53 2026-02-21T09:10:45.5774262Z .b8 112 2026-02-21T09:10:45.5774312Z .b8 114 2026-02-21T09:10:45.5774368Z .b8 98 2026-02-21T09:10:45.5774420Z .b8 122 2026-02-21T09:10:45.5774470Z .b8 117 2026-02-21T09:10:45.5774527Z .b8 119 2026-02-21T09:10:45.5774579Z .b8 104 2026-02-21T09:10:45.5774630Z .b8 102 2026-02-21T09:10:45.5774679Z .b8 120 2026-02-21T09:10:45.5774737Z .b8 121 2026-02-21T09:10:45.5774787Z .b8 101 2026-02-21T09:10:45.5774837Z .b8 52 2026-02-21T09:10:45.5774895Z .b8 111 2026-02-21T09:10:45.5774945Z .b8 111 2026-02-21T09:10:45.5774996Z .b8 99 2026-02-21T09:10:45.5775048Z .b8 98 2026-02-21T09:10:45.5775107Z .b8 110 2026-02-21T09:10:45.5775157Z .b8 103 2026-02-21T09:10:45.5775207Z .b8 121 2026-02-21T09:10:45.5775264Z .b8 108 2026-02-21T09:10:45.5775318Z .b8 101 2026-02-21T09:10:45.5775368Z .b8 109 2026-02-21T09:10:45.5775418Z .b8 100 2026-02-21T09:10:45.5775478Z .b8 104 2026-02-21T09:10:45.5775528Z .b8 120 2026-02-21T09:10:45.5775577Z .b8 98 2026-02-21T09:10:45.5775626Z .b8 97 2026-02-21T09:10:45.5775685Z .b8 113 2026-02-21T09:10:45.5775736Z .b8 106 2026-02-21T09:10:45.5775787Z .b8 51 2026-02-21T09:10:45.5775845Z .b8 101 2026-02-21T09:10:45.5775898Z .b8 113 2026-02-21T09:10:45.5775949Z .b8 100 2026-02-21T09:10:45.5776000Z .b8 113 2026-02-21T09:10:45.5776060Z .b8 46 2026-02-21T09:10:45.5776113Z .b8 112 2026-02-21T09:10:45.5776170Z .b8 121 2026-02-21T09:10:45.5776228Z .b8 0 2026-02-21T09:10:45.5776318Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:45.5776392Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:45.5776443Z .b8 116 2026-02-21T09:10:45.5776504Z .b8 109 2026-02-21T09:10:45.5776555Z .b8 112 2026-02-21T09:10:45.5776605Z .b8 47 2026-02-21T09:10:45.5776663Z .b8 116 2026-02-21T09:10:45.5776715Z .b8 111 2026-02-21T09:10:45.5776765Z .b8 114 2026-02-21T09:10:45.5776817Z .b8 99 2026-02-21T09:10:45.5776874Z .b8 104 2026-02-21T09:10:45.5776924Z .b8 105 2026-02-21T09:10:45.5776974Z .b8 110 2026-02-21T09:10:45.5777023Z .b8 100 2026-02-21T09:10:45.5777080Z .b8 117 2026-02-21T09:10:45.5777130Z .b8 99 2026-02-21T09:10:45.5777180Z .b8 116 2026-02-21T09:10:45.5777237Z .b8 111 2026-02-21T09:10:45.5777313Z .b8 114 2026-02-21T09:10:45.5777363Z .b8 95 2026-02-21T09:10:45.5777414Z .b8 114 2026-02-21T09:10:45.5777474Z .b8 111 2026-02-21T09:10:45.5777524Z .b8 111 2026-02-21T09:10:45.5777574Z .b8 116 2026-02-21T09:10:45.5777631Z .b8 47 2026-02-21T09:10:45.5777682Z .b8 121 2026-02-21T09:10:45.5777731Z .b8 106 2026-02-21T09:10:45.5777781Z .b8 0 2026-02-21T09:10:45.5777890Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:45.5777988Z .b8 95 // DW_AT_name 2026-02-21T09:10:45.5778037Z .b8 104 2026-02-21T09:10:45.5778094Z .b8 101 2026-02-21T09:10:45.5778144Z .b8 108 2026-02-21T09:10:45.5778195Z .b8 105 2026-02-21T09:10:45.5778245Z .b8 111 2026-02-21T09:10:45.5778303Z .b8 110 2026-02-21T09:10:45.5778353Z .b8 95 2026-02-21T09:10:45.5778403Z .b8 109 2026-02-21T09:10:45.5778462Z .b8 97 2026-02-21T09:10:45.5778512Z .b8 116 2026-02-21T09:10:45.5778563Z .b8 109 2026-02-21T09:10:45.5778614Z .b8 117 2026-02-21T09:10:45.5778673Z .b8 108 2026-02-21T09:10:45.5778723Z .b8 95 2026-02-21T09:10:45.5778773Z .b8 98 2026-02-21T09:10:45.5778822Z .b8 102 2026-02-21T09:10:45.5778901Z .b8 49 2026-02-21T09:10:45.5778952Z .b8 54 2026-02-21T09:10:45.5779002Z .b8 95 2026-02-21T09:10:45.5779057Z .b8 105 2026-02-21T09:10:45.5779107Z .b8 110 2026-02-21T09:10:45.5779156Z .b8 116 2026-02-21T09:10:45.5779205Z .b8 52 2026-02-21T09:10:45.5779262Z .b8 0 2026-02-21T09:10:45.5779355Z .b8 1 // DW_AT_inline 2026-02-21T09:10:45.5779453Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:45.5779546Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:45.5779635Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:45.5779722Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:45.5779839Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:45.5779928Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:45.5780010Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:10:45.5780093Z .b64 $L__tmp29 // DW_AT_high_pc 2026-02-21T09:10:45.5780176Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:45.5780251Z .b8 94 // DW_AT_call_line 2026-02-21T09:10:45.5780330Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:45.5780421Z .b8 0 // End Of Children Mark 2026-02-21T09:10:45.5780500Z .b8 0 // End Of Children Mark 2026-02-21T09:10:45.5780551Z } 2026-02-21T09:10:45.5780624Z .section .debug_macinfo { } 2026-02-21T09:10:45.5780627Z 2026-02-21T09:10:45.5780702Z ================================================================ 2026-02-21T09:10:45.5780804Z please share the reproducer above with Triton project. 2026-02-21T09:10:45.7849669Z 2026-02-21T09:10:45.7854169Z 2026-02-21T09:10:45.7858062Z 2026-02-21T09:10:45.7863108Z ================================================================ 2026-02-21T09:10:45.7866618Z Internal Triton PTX codegen error 2026-02-21T09:10:45.7869821Z `ptxas` stderr: 2026-02-21T09:10:45.7874115Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 216 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:45.7874725Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:45.7874934Z 2026-02-21T09:10:45.7875333Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp4k_7r5ue.ptx -o /tmp/tmp4k_7r5ue.ptx.o 2026-02-21T09:10:45.7875763Z 2026-02-21T09:10:45.7875767Z 2026-02-21T09:10:45.7875838Z // 2026-02-21T09:10:45.7875984Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:45.7876168Z // 2026-02-21T09:10:45.7876237Z 2026-02-21T09:10:45.7876474Z .version 8.7 2026-02-21T09:10:45.7876619Z .target sm_100a 2026-02-21T09:10:45.7876761Z .address_size 64 2026-02-21T09:10:45.7876845Z 2026-02-21T09:10:45.7876995Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:45.7877288Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:45.7877508Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:45.7877735Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:45.7878040Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:45.7878329Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:45.7878608Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:45.7878878Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:45.7879173Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:45.7879396Z ) 2026-02-21T09:10:45.7879537Z .reqntid 128 2026-02-21T09:10:45.7879672Z .maxnreg 32 2026-02-21T09:10:45.7879801Z { 2026-02-21T09:10:45.7879965Z .reg .pred %p<66>; 2026-02-21T09:10:45.7880121Z .reg .b16 %rs<228>; 2026-02-21T09:10:45.7880270Z .reg .b32 %r<847>; 2026-02-21T09:10:45.7880410Z .reg .b64 %rd<345>; 2026-02-21T09:10:45.7880556Z $L__func_begin0: 2026-02-21T09:10:45.7880639Z 2026-02-21T09:10:45.7880692Z // %bb.0: 2026-02-21T09:10:45.7880979Z .loc 1 19 0 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:19 2026-02-21T09:10:45.7881271Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:45.7881432Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:10:45.7881745Z ld.param.b64 %rd20, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:45.7881969Z mov.b32 %r44, global_smem; 2026-02-21T09:10:45.7882136Z // begin inline asm 2026-02-21T09:10:45.7882373Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r44], 256; 2026-02-21T09:10:45.7882627Z // end inline asm 2026-02-21T09:10:45.7882813Z ld.param.b64 %rd37, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:45.7883026Z bar.sync 0; 2026-02-21T09:10:45.7883173Z ld.shared.b32 %r841, [global_smem]; 2026-02-21T09:10:45.7883354Z bar.sync 0; 2026-02-21T09:10:45.7883487Z // begin inline asm 2026-02-21T09:10:45.7883700Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:45.7883929Z // end inline asm 2026-02-21T09:10:45.7884191Z .loc 1 21 67 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:21:67 2026-02-21T09:10:45.7884494Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:10:45.7884648Z mov.u32 %r53, %ctaid.y; 2026-02-21T09:10:45.7884804Z mov.u32 %r54, %ctaid.z; 2026-02-21T09:10:45.7884957Z mov.u32 %r55, %nctaid.x; 2026-02-21T09:10:45.7885120Z mov.u32 %r56, %nctaid.y; 2026-02-21T09:10:45.7885279Z mad.lo.s32 %r57, %r54, %r56, %r53; 2026-02-21T09:10:45.7885462Z mad.lo.s32 %r58, %r57, %r55, %r3; 2026-02-21T09:10:45.7885636Z shl.b32 %r59, %r58, 7; 2026-02-21T09:10:45.7885792Z cvt.s64.s32 %rd38, %r59; 2026-02-21T09:10:45.7885957Z add.s64 %rd34, %rd37, %rd38; 2026-02-21T09:10:45.7886118Z shl.b32 %r60, %r1, 2; 2026-02-21T09:10:45.7886275Z add.s32 %r45, %r44, %r60; 2026-02-21T09:10:45.7886425Z mov.b32 %r62, 0; 2026-02-21T09:10:45.7886570Z // begin inline asm 2026-02-21T09:10:45.7886723Z @%p1 st.shared.b32 [ %r45 + 0 ], %r62; 2026-02-21T09:10:45.7886903Z // end inline asm 2026-02-21T09:10:45.7887043Z bar.warp.sync -1; 2026-02-21T09:10:45.7887208Z setp.eq.b32 %p60, %r1, 0; 2026-02-21T09:10:45.7887368Z cvt.u64.u32 %rd19, %r44; 2026-02-21T09:10:45.7887532Z // begin inline asm 2026-02-21T09:10:45.7887796Z @%p60 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd19 + 0 ], %rd20; 2026-02-21T09:10:45.7888078Z // end inline asm 2026-02-21T09:10:45.7888221Z // begin inline asm 2026-02-21T09:10:45.7888446Z @%p60 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1; 2026-02-21T09:10:45.7888703Z // end inline asm 2026-02-21T09:10:45.7888882Z mov.b32 %r47, 64; 2026-02-21T09:10:45.7889028Z // begin inline asm 2026-02-21T09:10:45.7889270Z @%p60 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r47; 2026-02-21T09:10:45.7889536Z // end inline asm 2026-02-21T09:10:45.7889676Z mov.b32 %r48, 128; 2026-02-21T09:10:45.7889813Z // begin inline asm 2026-02-21T09:10:45.7890053Z @%p60 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r48; 2026-02-21T09:10:45.7890311Z // end inline asm 2026-02-21T09:10:45.7890486Z mov.b32 %r49, 8192; 2026-02-21T09:10:45.7890626Z // begin inline asm 2026-02-21T09:10:45.7890873Z @%p60 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r49; 2026-02-21T09:10:45.7891145Z // end inline asm 2026-02-21T09:10:45.7891278Z mov.b32 %r50, 4096; 2026-02-21T09:10:45.7891423Z // begin inline asm 2026-02-21T09:10:45.7891694Z @%p60 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r50; 2026-02-21T09:10:45.7891981Z // end inline asm 2026-02-21T09:10:45.7892116Z mov.b64 %rd27, 16384; 2026-02-21T09:10:45.7892269Z // begin inline asm 2026-02-21T09:10:45.7892547Z @%p60 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd19 + 0 ], 0x0, %rd27; 2026-02-21T09:10:45.7892831Z // end inline asm 2026-02-21T09:10:45.7892971Z mov.b32 %r51, 1; 2026-02-21T09:10:45.7893105Z // begin inline asm 2026-02-21T09:10:45.7893395Z @%p60 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0, %r51; 2026-02-21T09:10:45.7893674Z // end inline asm 2026-02-21T09:10:45.7893811Z // begin inline asm 2026-02-21T09:10:45.7894053Z @%p60 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x1, %r51; 2026-02-21T09:10:45.7894332Z // end inline asm 2026-02-21T09:10:45.7894469Z // begin inline asm 2026-02-21T09:10:45.7894696Z @%p60 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd19 + 0 ], 0xa; 2026-02-21T09:10:45.7894960Z // end inline asm 2026-02-21T09:10:45.7895092Z // begin inline asm 2026-02-21T09:10:45.7895344Z @%p60 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:10:45.7895613Z // end inline asm 2026-02-21T09:10:45.7896122Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:45.7897297Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:10:45.7898453Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:45.7898685Z `ptxas` stderr: 2026-02-21T09:10:45.7899112Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 216 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:45.7899599Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:45.7899743Z 2026-02-21T09:10:45.7900115Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp4k_7r5ue.ptx -o /tmp/tmp4k_7r5ue.ptx.o 2026-02-21T09:10:45.7900546Z 2026-02-21T09:10:45.7900675Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:45.7900910Z // begin inline asm 2026-02-21T09:10:45.7901162Z @%p60 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x3; 2026-02-21T09:10:45.7901432Z // end inline asm 2026-02-21T09:10:45.7901600Z // begin inline asm 2026-02-21T09:10:45.7901850Z @%p60 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd19 + 0 ], 0x0; 2026-02-21T09:10:45.7902107Z // end inline asm 2026-02-21T09:10:45.7902289Z // begin inline asm 2026-02-21T09:10:45.7902627Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd34 + 0 ], [ %rd19 + 0 ], 0x80; 2026-02-21T09:10:45.7903003Z // end inline asm 2026-02-21T09:10:45.7903144Z // begin inline asm 2026-02-21T09:10:45.7903349Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd34 + 0 ], 0x80; 2026-02-21T09:10:45.7903601Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:45.7903791Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:45.7904008Z // end inline asm 2026-02-21T09:10:45.7904142Z bar.sync 0; 2026-02-21T09:10:45.7904292Z cvta.global.u64 %rd85, %rd34; 2026-02-21T09:10:45.7904572Z .loc 1 29 74 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:29:74 2026-02-21T09:10:45.7904872Z setp.gt.u32 %p21, %r3, 2047; 2026-02-21T09:10:45.7905053Z @%p21 bra $L__BB0_8; 2026-02-21T09:10:45.7905229Z // %bb.1: // %.lr.ph 2026-02-21T09:10:45.7905539Z .loc 1 0 74 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:0:74 2026-02-21T09:10:45.7905919Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:45.7906184Z ld.param.b64 %rd17, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:45.7906513Z .loc 1 57 38 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:57:38 2026-02-21T09:10:45.7906827Z shl.b32 %r253, %r1, 3; 2026-02-21T09:10:45.7907014Z and.b32 %r254, %r253, 24; 2026-02-21T09:10:45.7907298Z .loc 1 41 45 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:41:45 2026-02-21T09:10:45.7907592Z shl.b32 %r255, %r1, 4; 2026-02-21T09:10:45.7907760Z and.b32 %r4, %r255, 112; 2026-02-21T09:10:45.7907928Z bfe.u32 %r5, %r1, 2, 5; 2026-02-21T09:10:45.7908083Z shr.u32 %r256, %r1, 5; 2026-02-21T09:10:45.7908243Z and.b32 %r6, %r1, 127; 2026-02-21T09:10:45.7908394Z shl.b32 %r7, %r6, 4; 2026-02-21T09:10:45.7908554Z add.s32 %r258, %r44, %r7; 2026-02-21T09:10:45.7908719Z add.s32 %r356, %r258, 16384; 2026-02-21T09:10:45.7908890Z add.s32 %r358, %r258, 18432; 2026-02-21T09:10:45.7909053Z add.s32 %r360, %r258, 20480; 2026-02-21T09:10:45.7909222Z add.s32 %r362, %r258, 22528; 2026-02-21T09:10:45.7909382Z add.s32 %r342, %r841, 128; 2026-02-21T09:10:45.7909554Z shl.b32 %r259, %r1, 10; 2026-02-21T09:10:45.7909716Z and.b32 %r13, %r259, 122880; 2026-02-21T09:10:45.7909879Z add.s32 %r260, %r44, 32768; 2026-02-21T09:10:45.7910051Z add.s32 %r364, %r260, %r7; 2026-02-21T09:10:45.7910209Z or.b32 %r16, %r1, 896; 2026-02-21T09:10:45.7910369Z or.b32 %r17, %r1, 1920; 2026-02-21T09:10:45.7910520Z shl.b32 %r261, %r6, 7; 2026-02-21T09:10:45.7910678Z or.b32 %r262, %r261, %r4; 2026-02-21T09:10:45.7910836Z add.s32 %r18, %r44, %r262; 2026-02-21T09:10:45.7911004Z xor.b32 %r263, %r262, 16; 2026-02-21T09:10:45.7911159Z add.s32 %r19, %r44, %r263; 2026-02-21T09:10:45.7911324Z xor.b32 %r264, %r262, 32; 2026-02-21T09:10:45.7911486Z add.s32 %r20, %r44, %r264; 2026-02-21T09:10:45.7911672Z xor.b32 %r265, %r262, 48; 2026-02-21T09:10:45.7911836Z add.s32 %r21, %r44, %r265; 2026-02-21T09:10:45.7911993Z xor.b32 %r266, %r262, 64; 2026-02-21T09:10:45.7912156Z add.s32 %r22, %r44, %r266; 2026-02-21T09:10:45.7912313Z xor.b32 %r267, %r262, 80; 2026-02-21T09:10:45.7912477Z add.s32 %r23, %r44, %r267; 2026-02-21T09:10:45.7912634Z xor.b32 %r268, %r262, 96; 2026-02-21T09:10:45.7912798Z add.s32 %r24, %r44, %r268; 2026-02-21T09:10:45.7912953Z xor.b32 %r269, %r262, 112; 2026-02-21T09:10:45.7913120Z add.s32 %r25, %r44, %r269; 2026-02-21T09:10:45.7913287Z bfe.u32 %r270, %r44, 4, 14; 2026-02-21T09:10:45.7913452Z cvt.u64.u32 %rd49, %r270; 2026-02-21T09:10:45.7913635Z or.b64 %rd55, %rd49, 4611686293372403712; 2026-02-21T09:10:45.7913825Z add.s32 %r271, %r44, 32; 2026-02-21T09:10:45.7913998Z bfe.u32 %r272, %r271, 4, 14; 2026-02-21T09:10:45.7914163Z cvt.u64.u32 %rd50, %r272; 2026-02-21T09:10:45.7914350Z or.b64 %rd56, %rd50, 4611686293372403712; 2026-02-21T09:10:45.7914564Z add.s32 %r273, %r44, 64; 2026-02-21T09:10:45.7914728Z bfe.u32 %r274, %r273, 4, 14; 2026-02-21T09:10:45.7914890Z cvt.u64.u32 %rd51, %r274; 2026-02-21T09:10:45.7915059Z or.b64 %rd57, %rd51, 4611686293372403712; 2026-02-21T09:10:45.7915245Z add.s32 %r275, %r44, 96; 2026-02-21T09:10:45.7915399Z bfe.u32 %r276, %r275, 4, 14; 2026-02-21T09:10:45.7915564Z cvt.u64.u32 %rd52, %r276; 2026-02-21T09:10:45.7915727Z or.b64 %rd58, %rd52, 4611686293372403712; 2026-02-21T09:10:45.7915960Z or.b32 %r26, %r254, 64; 2026-02-21T09:10:45.7916114Z mad.lo.s32 %r277, %r6, 48, %r356; 2026-02-21T09:10:45.7916291Z add.s32 %r278, %r260, %r17; 2026-02-21T09:10:45.7916445Z add.s32 %r279, %r260, %r6; 2026-02-21T09:10:45.7916607Z add.s32 %r280, %r260, %r16; 2026-02-21T09:10:45.7916767Z add.s32 %r217, %r258, 34816; 2026-02-21T09:10:45.7916919Z add.s32 %r209, %r258, 24576; 2026-02-21T09:10:45.7917076Z add.s32 %r215, %r258, 30720; 2026-02-21T09:10:45.7917226Z add.s32 %r213, %r258, 28672; 2026-02-21T09:10:45.7917386Z add.s32 %r211, %r258, 26624; 2026-02-21T09:10:45.7917679Z .loc 1 36 33 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:36:33 2026-02-21T09:10:45.7917980Z shr.u32 %r281, %r3, 5; 2026-02-21T09:10:45.7918130Z and.b32 %r282, %r281, 48; 2026-02-21T09:10:45.7918400Z .loc 1 38 64 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:38:64 2026-02-21T09:10:45.7918723Z and.b32 %r27, %r3, 15; 2026-02-21T09:10:45.7918984Z .loc 1 38 30 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:38:30 2026-02-21T09:10:45.7919271Z or.b32 %r283, %r282, %r27; 2026-02-21T09:10:45.7919530Z .loc 1 40 27 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:40:27 2026-02-21T09:10:45.7919817Z shl.b32 %r28, %r283, 7; 2026-02-21T09:10:45.7920070Z .loc 1 41 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:41:32 2026-02-21T09:10:45.7920361Z or.b32 %r284, %r28, %r4; 2026-02-21T09:10:45.7920627Z .loc 1 42 27 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:42:27 2026-02-21T09:10:45.7920909Z shl.b32 %r285, %r3, 3; 2026-02-21T09:10:45.7921064Z and.b32 %r642, %r285, 3968; 2026-02-21T09:10:45.7921326Z .loc 1 43 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:43:32 2026-02-21T09:10:45.7921642Z or.b32 %r286, %r642, %r5; 2026-02-21T09:10:45.7921904Z .loc 1 58 53 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:53 2026-02-21T09:10:45.7922196Z shl.b32 %r287, %r286, 10; 2026-02-21T09:10:45.7922351Z $L__tmp0: 2026-02-21T09:10:45.7922646Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.7923004Z shfl.sync.idx.b32 %r30, %r256, 0, 31, -1; 2026-02-21T09:10:45.7923186Z shl.b32 %r288, %r30, 21; 2026-02-21T09:10:45.7923346Z and.b32 %r289, %r288, 6291456; 2026-02-21T09:10:45.7923512Z add.s32 %r640, %r289, %r841; 2026-02-21T09:10:45.7923677Z mov.pred %p37, -1; 2026-02-21T09:10:45.7923826Z // begin inline asm 2026-02-21T09:10:45.7924168Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 0], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7924535Z // end inline asm 2026-02-21T09:10:45.7924675Z // begin inline asm 2026-02-21T09:10:45.7925019Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 16], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7925371Z // end inline asm 2026-02-21T09:10:45.7925519Z // begin inline asm 2026-02-21T09:10:45.7925835Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 32], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7926190Z // end inline asm 2026-02-21T09:10:45.7926332Z // begin inline asm 2026-02-21T09:10:45.7926672Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 48], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7927021Z // end inline asm 2026-02-21T09:10:45.7927155Z // begin inline asm 2026-02-21T09:10:45.7927480Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 64], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7927860Z // end inline asm 2026-02-21T09:10:45.7927993Z // begin inline asm 2026-02-21T09:10:45.7928309Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 80], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7928651Z // end inline asm 2026-02-21T09:10:45.7928792Z // begin inline asm 2026-02-21T09:10:45.7929104Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 96], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7929454Z // end inline asm 2026-02-21T09:10:45.7929589Z // begin inline asm 2026-02-21T09:10:45.7929944Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r640 + 112], {%r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62, %r62}; 2026-02-21T09:10:45.7930299Z // end inline asm 2026-02-21T09:10:45.7930434Z // begin inline asm 2026-02-21T09:10:45.7930620Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.7930788Z // end inline asm 2026-02-21T09:10:45.7930928Z bar.sync 0; 2026-02-21T09:10:45.7931057Z $L__tmp1: 2026-02-21T09:10:45.7931314Z .loc 1 50 125 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:50:125 2026-02-21T09:10:45.7931641Z add.s32 %r843, %r44, 36864; 2026-02-21T09:10:45.7931801Z // begin inline asm 2026-02-21T09:10:45.7931974Z @%p60 mbarrier.init.shared::cta.b64 [%r843], 1; 2026-02-21T09:10:45.7932162Z // end inline asm 2026-02-21T09:10:45.7932300Z bar.sync 0; 2026-02-21T09:10:45.7932434Z add.s32 %r198, %r44, 36872; 2026-02-21T09:10:45.7932593Z // begin inline asm 2026-02-21T09:10:45.7932758Z @%p60 mbarrier.init.shared::cta.b64 [%r198], 1; 2026-02-21T09:10:45.7932952Z // end inline asm 2026-02-21T09:10:45.7933207Z .loc 1 58 60 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:60 2026-02-21T09:10:45.7933492Z or.b32 %r290, %r287, %r254; 2026-02-21T09:10:45.7933765Z .loc 1 58 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:32 2026-02-21T09:10:45.7934062Z mad.wide.u32 %rd39, %r290, 2, %rd17; 2026-02-21T09:10:45.7934246Z cvt.u64.u32 %rd6, %r287; 2026-02-21T09:10:45.7934403Z add.s64 %rd40, %rd39, 65536; 2026-02-21T09:10:45.7934571Z add.s64 %rd41, %rd39, 131072; 2026-02-21T09:10:45.7934734Z add.s64 %rd42, %rd39, 196608; 2026-02-21T09:10:45.7934898Z mov.b32 %r357, 16; 2026-02-21T09:10:45.7935155Z .loc 1 58 80 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:80 2026-02-21T09:10:45.7935440Z // begin inline asm 2026-02-21T09:10:45.7935654Z cp.async.cg.shared.global [ %r356 + 0 ], [ %rd39 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7935878Z // end inline asm 2026-02-21T09:10:45.7936019Z // begin inline asm 2026-02-21T09:10:45.7936218Z cp.async.cg.shared.global [ %r358 + 0 ], [ %rd40 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7936450Z // end inline asm 2026-02-21T09:10:45.7936586Z // begin inline asm 2026-02-21T09:10:45.7936789Z cp.async.cg.shared.global [ %r360 + 0 ], [ %rd41 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7937015Z // end inline asm 2026-02-21T09:10:45.7937148Z // begin inline asm 2026-02-21T09:10:45.7937347Z cp.async.cg.shared.global [ %r362 + 0 ], [ %rd42 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7937563Z // end inline asm 2026-02-21T09:10:45.7937710Z cp.async.commit_group; 2026-02-21T09:10:45.7937975Z .loc 1 64 62 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:62 2026-02-21T09:10:45.7938270Z or.b32 %r291, %r284, %r13; 2026-02-21T09:10:45.7938570Z .loc 1 64 34 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:34 2026-02-21T09:10:45.7938863Z cvt.u64.u32 %rd53, %r291; 2026-02-21T09:10:45.7939030Z add.s64 %rd43, %rd18, %rd53; 2026-02-21T09:10:45.7939292Z .loc 1 64 87 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:87 2026-02-21T09:10:45.7939581Z // begin inline asm 2026-02-21T09:10:45.7939775Z cp.async.cg.shared.global [ %r364 + 0 ], [ %rd43 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7940046Z // end inline asm 2026-02-21T09:10:45.7940185Z cp.async.commit_group; 2026-02-21T09:10:45.7940457Z .loc 1 58 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:32 2026-02-21T09:10:45.7940748Z add.s64 %rd44, %rd39, 64; 2026-02-21T09:10:45.7940902Z or.b32 %r292, %r290, 32; 2026-02-21T09:10:45.7941068Z mad.wide.u32 %rd54, %r292, 2, %rd17; 2026-02-21T09:10:45.7941242Z add.s64 %rd45, %rd54, 65536; 2026-02-21T09:10:45.7941410Z add.s64 %rd46, %rd54, 131072; 2026-02-21T09:10:45.7941601Z add.s64 %rd47, %rd54, 196608; 2026-02-21T09:10:45.7941897Z .loc 1 58 80 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:80 2026-02-21T09:10:45.7942172Z bar.sync 0; 2026-02-21T09:10:45.7942310Z // begin inline asm 2026-02-21T09:10:45.7942510Z cp.async.cg.shared.global [ %r209 + 0 ], [ %rd44 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7942756Z // end inline asm 2026-02-21T09:10:45.7942904Z // begin inline asm 2026-02-21T09:10:45.7943099Z cp.async.cg.shared.global [ %r211 + 0 ], [ %rd45 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7943327Z // end inline asm 2026-02-21T09:10:45.7943461Z // begin inline asm 2026-02-21T09:10:45.7943658Z cp.async.cg.shared.global [ %r213 + 0 ], [ %rd46 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7943874Z // end inline asm 2026-02-21T09:10:45.7944014Z // begin inline asm 2026-02-21T09:10:45.7944211Z cp.async.cg.shared.global [ %r215 + 0 ], [ %rd47 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7944425Z // end inline asm 2026-02-21T09:10:45.7944571Z cp.async.commit_group; 2026-02-21T09:10:45.7944829Z .loc 1 64 34 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:34 2026-02-21T09:10:45.7945119Z add.s64 %rd48, %rd43, 131072; 2026-02-21T09:10:45.7945381Z .loc 1 64 87 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:87 2026-02-21T09:10:45.7945673Z // begin inline asm 2026-02-21T09:10:45.7945865Z cp.async.cg.shared.global [ %r217 + 0 ], [ %rd48 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7946094Z // end inline asm 2026-02-21T09:10:45.7946241Z cp.async.commit_group; 2026-02-21T09:10:45.7946495Z .loc 1 58 80 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:80 2026-02-21T09:10:45.7946792Z cp.async.wait_group 2; 2026-02-21T09:10:45.7946942Z bar.sync 0; 2026-02-21T09:10:45.7947121Z ld.shared.v4.b32 {%r293, %r294, %r295, %r296}, [%r277]; 2026-02-21T09:10:45.7947336Z mov.b32 {%rs1, %rs2}, %r296; 2026-02-21T09:10:45.7947506Z mov.b32 {%rs3, %rs4}, %r295; 2026-02-21T09:10:45.7947663Z mov.b32 {%rs5, %rs6}, %r294; 2026-02-21T09:10:45.7947824Z mov.b32 {%rs7, %rs8}, %r293; 2026-02-21T09:10:45.7948021Z ld.shared.v4.b32 {%r297, %r298, %r299, %r300}, [%r277+16]; 2026-02-21T09:10:45.7948227Z mov.b32 {%rs9, %rs10}, %r300; 2026-02-21T09:10:45.7948394Z mov.b32 {%rs11, %rs12}, %r299; 2026-02-21T09:10:45.7948559Z mov.b32 {%rs13, %rs14}, %r298; 2026-02-21T09:10:45.7948728Z mov.b32 {%rs15, %rs16}, %r297; 2026-02-21T09:10:45.7948921Z ld.shared.v4.b32 {%r301, %r302, %r303, %r304}, [%r277+32]; 2026-02-21T09:10:45.7949131Z mov.b32 {%rs17, %rs18}, %r304; 2026-02-21T09:10:45.7949293Z mov.b32 {%rs19, %rs20}, %r303; 2026-02-21T09:10:45.7949466Z mov.b32 {%rs21, %rs22}, %r302; 2026-02-21T09:10:45.7949633Z mov.b32 {%rs23, %rs24}, %r301; 2026-02-21T09:10:45.7949831Z ld.shared.v4.b32 {%r305, %r306, %r307, %r308}, [%r277+48]; 2026-02-21T09:10:45.7950050Z mov.b32 {%rs25, %rs26}, %r308; 2026-02-21T09:10:45.7950244Z mov.b32 {%rs27, %rs28}, %r307; 2026-02-21T09:10:45.7950414Z mov.b32 {%rs29, %rs30}, %r306; 2026-02-21T09:10:45.7950577Z mov.b32 {%rs31, %rs32}, %r305; 2026-02-21T09:10:45.7950857Z .loc 1 62 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:62:32 2026-02-21T09:10:45.7951159Z cvt.f32.bf16 %r220, %rs7; 2026-02-21T09:10:45.7951331Z cvt.f32.bf16 %r221, %rs8; 2026-02-21T09:10:45.7951497Z cvt.f32.bf16 %r222, %rs5; 2026-02-21T09:10:45.7951709Z cvt.f32.bf16 %r223, %rs6; 2026-02-21T09:10:45.7951872Z cvt.f32.bf16 %r224, %rs3; 2026-02-21T09:10:45.7952026Z cvt.f32.bf16 %r225, %rs4; 2026-02-21T09:10:45.7952187Z cvt.f32.bf16 %r226, %rs1; 2026-02-21T09:10:45.7952340Z cvt.f32.bf16 %r227, %rs2; 2026-02-21T09:10:45.7952504Z cvt.f32.bf16 %r228, %rs15; 2026-02-21T09:10:45.7952664Z cvt.f32.bf16 %r229, %rs16; 2026-02-21T09:10:45.7952831Z cvt.f32.bf16 %r230, %rs13; 2026-02-21T09:10:45.7952987Z cvt.f32.bf16 %r231, %rs14; 2026-02-21T09:10:45.7953152Z cvt.f32.bf16 %r232, %rs11; 2026-02-21T09:10:45.7953315Z cvt.f32.bf16 %r233, %rs12; 2026-02-21T09:10:45.7953500Z cvt.f32.bf16 %r234, %rs9; 2026-02-21T09:10:45.7953664Z cvt.f32.bf16 %r235, %rs10; 2026-02-21T09:10:45.7953816Z cvt.f32.bf16 %r237, %rs23; 2026-02-21T09:10:45.7953977Z cvt.f32.bf16 %r238, %rs24; 2026-02-21T09:10:45.7954133Z cvt.f32.bf16 %r239, %rs21; 2026-02-21T09:10:45.7954327Z cvt.f32.bf16 %r240, %rs22; 2026-02-21T09:10:45.7954485Z cvt.f32.bf16 %r241, %rs19; 2026-02-21T09:10:45.7954648Z cvt.f32.bf16 %r242, %rs20; 2026-02-21T09:10:45.7954803Z cvt.f32.bf16 %r243, %rs17; 2026-02-21T09:10:45.7954967Z cvt.f32.bf16 %r244, %rs18; 2026-02-21T09:10:45.7955131Z cvt.f32.bf16 %r245, %rs31; 2026-02-21T09:10:45.7955287Z cvt.f32.bf16 %r246, %rs32; 2026-02-21T09:10:45.7955451Z cvt.f32.bf16 %r247, %rs29; 2026-02-21T09:10:45.7955609Z cvt.f32.bf16 %r248, %rs30; 2026-02-21T09:10:45.7955778Z cvt.f32.bf16 %r249, %rs27; 2026-02-21T09:10:45.7955939Z cvt.f32.bf16 %r250, %rs28; 2026-02-21T09:10:45.7956107Z cvt.f32.bf16 %r251, %rs25; 2026-02-21T09:10:45.7956264Z cvt.f32.bf16 %r252, %rs26; 2026-02-21T09:10:45.7956428Z $L__tmp2: 2026-02-21T09:10:45.7956745Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.7957093Z add.s32 %r219, %r289, %r342; 2026-02-21T09:10:45.7957261Z // begin inline asm 2026-02-21T09:10:45.7957645Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r219 + 0], {%r220, %r221, %r222, %r223, %r224, %r225, %r226, %r227, %r228, %r229, %r230, %r231, %r232, %r233, %r234, %r235}; 2026-02-21T09:10:45.7958038Z // end inline asm 2026-02-21T09:10:45.7958174Z // begin inline asm 2026-02-21T09:10:45.7958536Z @%p37 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r219 + 16], {%r237, %r238, %r239, %r240, %r241, %r242, %r243, %r244, %r245, %r246, %r247, %r248, %r249, %r250, %r251, %r252}; 2026-02-21T09:10:45.7958923Z // end inline asm 2026-02-21T09:10:45.7959061Z // begin inline asm 2026-02-21T09:10:45.7959221Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.7959383Z // end inline asm 2026-02-21T09:10:45.7959524Z bar.sync 0; 2026-02-21T09:10:45.7959654Z $L__tmp3: 2026-02-21T09:10:45.7959897Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7960189Z ld.shared.s8 %rs33, [%r279]; 2026-02-21T09:10:45.7960464Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7960757Z shl.b16 %rs34, %rs33, 4; 2026-02-21T09:10:45.7961013Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7961306Z ld.shared.s8 %rs35, [%r279+128]; 2026-02-21T09:10:45.7961613Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7961906Z shl.b16 %rs36, %rs35, 4; 2026-02-21T09:10:45.7962166Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7962501Z ld.shared.s8 %rs37, [%r279+256]; 2026-02-21T09:10:45.7962781Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7963060Z shl.b16 %rs38, %rs37, 4; 2026-02-21T09:10:45.7963325Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7963637Z ld.shared.s8 %rs39, [%r279+384]; 2026-02-21T09:10:45.7963914Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7964202Z shl.b16 %rs40, %rs39, 4; 2026-02-21T09:10:45.7964459Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7964752Z ld.shared.s8 %rs41, [%r279+512]; 2026-02-21T09:10:45.7965020Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7965306Z shl.b16 %rs42, %rs41, 4; 2026-02-21T09:10:45.7965582Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7965872Z ld.shared.s8 %rs43, [%r279+640]; 2026-02-21T09:10:45.7966148Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7966427Z shl.b16 %rs44, %rs43, 4; 2026-02-21T09:10:45.7966717Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7966998Z ld.shared.s8 %rs45, [%r279+768]; 2026-02-21T09:10:45.7967273Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7967552Z shl.b16 %rs46, %rs45, 4; 2026-02-21T09:10:45.7967820Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7968114Z ld.shared.s8 %rs47, [%r280]; 2026-02-21T09:10:45.7968381Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7968671Z shl.b16 %rs48, %rs47, 4; 2026-02-21T09:10:45.7968929Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7969232Z ld.shared.s8 %rs49, [%r279+1024]; 2026-02-21T09:10:45.7969514Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7969806Z shl.b16 %rs50, %rs49, 4; 2026-02-21T09:10:45.7970069Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7970356Z ld.shared.s8 %rs51, [%r279+1152]; 2026-02-21T09:10:45.7970639Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7970917Z shl.b16 %rs52, %rs51, 4; 2026-02-21T09:10:45.7971178Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7971465Z ld.shared.s8 %rs53, [%r279+1280]; 2026-02-21T09:10:45.7971767Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7972057Z shl.b16 %rs54, %rs53, 4; 2026-02-21T09:10:45.7972320Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7972621Z ld.shared.s8 %rs55, [%r279+1408]; 2026-02-21T09:10:45.7972897Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7973191Z shl.b16 %rs56, %rs55, 4; 2026-02-21T09:10:45.7973453Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7973750Z ld.shared.s8 %rs57, [%r279+1536]; 2026-02-21T09:10:45.7974031Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7974346Z shl.b16 %rs58, %rs57, 4; 2026-02-21T09:10:45.7974619Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7974913Z ld.shared.s8 %rs59, [%r279+1664]; 2026-02-21T09:10:45.7975198Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7975495Z shl.b16 %rs60, %rs59, 4; 2026-02-21T09:10:45.7975785Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7976082Z ld.shared.s8 %rs61, [%r279+1792]; 2026-02-21T09:10:45.7976353Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7976639Z shl.b16 %rs62, %rs61, 4; 2026-02-21T09:10:45.7976895Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7977192Z ld.shared.s8 %rs63, [%r278]; 2026-02-21T09:10:45.7977492Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.7977771Z shl.b16 %rs64, %rs63, 4; 2026-02-21T09:10:45.7978033Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7978311Z cvt.s16.s8 %rs65, %rs34; 2026-02-21T09:10:45.7978464Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:10:45.7978637Z cvt.s16.s8 %rs67, %rs36; 2026-02-21T09:10:45.7978797Z shr.s16 %rs68, %rs67, 4; 2026-02-21T09:10:45.7978945Z shr.s16 %rs69, %rs33, 4; 2026-02-21T09:10:45.7979098Z shr.s16 %rs70, %rs35, 4; 2026-02-21T09:10:45.7979361Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7979648Z cvt.rn.f32.s16 %r309, %rs70; 2026-02-21T09:10:45.7979815Z cvt.rn.f32.s16 %r310, %rs69; 2026-02-21T09:10:45.7979972Z cvt.rn.f32.s16 %r311, %rs68; 2026-02-21T09:10:45.7980134Z cvt.rn.f32.s16 %r312, %rs66; 2026-02-21T09:10:45.7980395Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7980684Z cvt.s16.s8 %rs71, %rs38; 2026-02-21T09:10:45.7980833Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:10:45.7980988Z cvt.s16.s8 %rs73, %rs40; 2026-02-21T09:10:45.7981144Z shr.s16 %rs74, %rs73, 4; 2026-02-21T09:10:45.7981289Z shr.s16 %rs75, %rs37, 4; 2026-02-21T09:10:45.7981450Z shr.s16 %rs76, %rs39, 4; 2026-02-21T09:10:45.7981742Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7982042Z cvt.rn.f32.s16 %r313, %rs76; 2026-02-21T09:10:45.7982199Z cvt.rn.f32.s16 %r314, %rs75; 2026-02-21T09:10:45.7982363Z cvt.rn.f32.s16 %r315, %rs74; 2026-02-21T09:10:45.7982519Z cvt.rn.f32.s16 %r316, %rs72; 2026-02-21T09:10:45.7982787Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7983077Z cvt.s16.s8 %rs77, %rs42; 2026-02-21T09:10:45.7983224Z shr.s16 %rs78, %rs77, 4; 2026-02-21T09:10:45.7983379Z cvt.s16.s8 %rs79, %rs44; 2026-02-21T09:10:45.7983525Z shr.s16 %rs80, %rs79, 4; 2026-02-21T09:10:45.7983678Z shr.s16 %rs81, %rs41, 4; 2026-02-21T09:10:45.7983823Z shr.s16 %rs82, %rs43, 4; 2026-02-21T09:10:45.7984084Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7984380Z cvt.rn.f32.s16 %r317, %rs82; 2026-02-21T09:10:45.7984540Z cvt.rn.f32.s16 %r318, %rs81; 2026-02-21T09:10:45.7984702Z cvt.rn.f32.s16 %r319, %rs80; 2026-02-21T09:10:45.7984855Z cvt.rn.f32.s16 %r320, %rs78; 2026-02-21T09:10:45.7985125Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7985413Z cvt.s16.s8 %rs83, %rs46; 2026-02-21T09:10:45.7985568Z shr.s16 %rs84, %rs83, 4; 2026-02-21T09:10:45.7985715Z cvt.s16.s8 %rs85, %rs48; 2026-02-21T09:10:45.7985869Z shr.s16 %rs86, %rs85, 4; 2026-02-21T09:10:45.7986046Z shr.s16 %rs87, %rs45, 4; 2026-02-21T09:10:45.7986111Z shr.s16 %rs88, %rs47, 4; 2026-02-21T09:10:45.7986279Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7986339Z cvt.rn.f32.s16 %r321, %rs88; 2026-02-21T09:10:45.7986405Z cvt.rn.f32.s16 %r322, %rs87; 2026-02-21T09:10:45.7986464Z cvt.rn.f32.s16 %r323, %rs86; 2026-02-21T09:10:45.7986524Z cvt.rn.f32.s16 %r324, %rs84; 2026-02-21T09:10:45.7986722Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7986780Z cvt.s16.s8 %rs89, %rs50; 2026-02-21T09:10:45.7986837Z shr.s16 %rs90, %rs89, 4; 2026-02-21T09:10:45.7986895Z cvt.s16.s8 %rs91, %rs52; 2026-02-21T09:10:45.7986959Z shr.s16 %rs92, %rs91, 4; 2026-02-21T09:10:45.7987016Z shr.s16 %rs93, %rs49, 4; 2026-02-21T09:10:45.7987072Z shr.s16 %rs94, %rs51, 4; 2026-02-21T09:10:45.7987242Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7987326Z cvt.rn.f32.s16 %r325, %rs94; 2026-02-21T09:10:45.7987385Z cvt.rn.f32.s16 %r326, %rs93; 2026-02-21T09:10:45.7987450Z cvt.rn.f32.s16 %r327, %rs92; 2026-02-21T09:10:45.7987507Z cvt.rn.f32.s16 %r328, %rs90; 2026-02-21T09:10:45.7987672Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7987755Z cvt.s16.s8 %rs95, %rs54; 2026-02-21T09:10:45.7987823Z shr.s16 %rs96, %rs95, 4; 2026-02-21T09:10:45.7987880Z cvt.s16.s8 %rs97, %rs56; 2026-02-21T09:10:45.7987936Z shr.s16 %rs98, %rs97, 4; 2026-02-21T09:10:45.7987999Z shr.s16 %rs99, %rs53, 4; 2026-02-21T09:10:45.7988058Z shr.s16 %rs100, %rs55, 4; 2026-02-21T09:10:45.7988227Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7988289Z cvt.rn.f32.s16 %r329, %rs100; 2026-02-21T09:10:45.7988355Z cvt.rn.f32.s16 %r330, %rs99; 2026-02-21T09:10:45.7988414Z cvt.rn.f32.s16 %r331, %rs98; 2026-02-21T09:10:45.7988473Z cvt.rn.f32.s16 %r332, %rs96; 2026-02-21T09:10:45.7988651Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7988712Z cvt.s16.s8 %rs101, %rs58; 2026-02-21T09:10:45.7988774Z shr.s16 %rs102, %rs101, 4; 2026-02-21T09:10:45.7988840Z cvt.s16.s8 %rs103, %rs60; 2026-02-21T09:10:45.7988901Z shr.s16 %rs104, %rs103, 4; 2026-02-21T09:10:45.7988960Z shr.s16 %rs105, %rs57, 4; 2026-02-21T09:10:45.7989016Z shr.s16 %rs106, %rs59, 4; 2026-02-21T09:10:45.7989189Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7989252Z cvt.rn.f32.s16 %r333, %rs106; 2026-02-21T09:10:45.7989311Z cvt.rn.f32.s16 %r334, %rs105; 2026-02-21T09:10:45.7989380Z cvt.rn.f32.s16 %r335, %rs104; 2026-02-21T09:10:45.7989438Z cvt.rn.f32.s16 %r336, %rs102; 2026-02-21T09:10:45.7989604Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.7989674Z cvt.s16.s8 %rs107, %rs62; 2026-02-21T09:10:45.7989733Z shr.s16 %rs108, %rs107, 4; 2026-02-21T09:10:45.7989791Z cvt.s16.s8 %rs109, %rs64; 2026-02-21T09:10:45.7989851Z shr.s16 %rs110, %rs109, 4; 2026-02-21T09:10:45.7989917Z shr.s16 %rs111, %rs61, 4; 2026-02-21T09:10:45.7989975Z shr.s16 %rs112, %rs63, 4; 2026-02-21T09:10:45.7990145Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.7990215Z cvt.rn.f32.s16 %r337, %rs112; 2026-02-21T09:10:45.7990276Z cvt.rn.f32.s16 %r338, %rs111; 2026-02-21T09:10:45.7990335Z cvt.rn.f32.s16 %r339, %rs110; 2026-02-21T09:10:45.7990394Z cvt.rn.f32.s16 %r340, %rs108; 2026-02-21T09:10:45.7990503Z st.shared.v4.b32 [%r18], {%r312, %r310, %r311, %r309}; 2026-02-21T09:10:45.7990593Z st.shared.v4.b32 [%r19], {%r316, %r314, %r315, %r313}; 2026-02-21T09:10:45.7990680Z st.shared.v4.b32 [%r20], {%r320, %r318, %r319, %r317}; 2026-02-21T09:10:45.7990799Z st.shared.v4.b32 [%r21], {%r324, %r322, %r323, %r321}; 2026-02-21T09:10:45.7990883Z st.shared.v4.b32 [%r22], {%r328, %r326, %r327, %r325}; 2026-02-21T09:10:45.7990967Z st.shared.v4.b32 [%r23], {%r332, %r330, %r331, %r329}; 2026-02-21T09:10:45.7991056Z st.shared.v4.b32 [%r24], {%r336, %r334, %r335, %r333}; 2026-02-21T09:10:45.7991138Z st.shared.v4.b32 [%r25], {%r340, %r338, %r339, %r337}; 2026-02-21T09:10:45.7991213Z $L__tmp4: 2026-02-21T09:10:45.7991434Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.7991499Z // begin inline asm 2026-02-21T09:10:45.7991602Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.7991660Z // end inline asm 2026-02-21T09:10:45.7991723Z bar.sync 0; 2026-02-21T09:10:45.7991785Z setp.ne.b32 %p34, %r30, 0; 2026-02-21T09:10:45.7991844Z @%p34 bra $L__BB0_3; 2026-02-21T09:10:45.7991904Z // %bb.2: 2026-02-21T09:10:45.7991971Z elect.sync %r353|%p36, -1; 2026-02-21T09:10:45.7992031Z mov.b32 %r343, 136317200; 2026-02-21T09:10:45.7992115Z mov.pred %p35, 0; 2026-02-21T09:10:45.7992183Z // begin inline asm 2026-02-21T09:10:45.7992345Z @%p36 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 0 ], %rd55, %r343, %p35; 2026-02-21T09:10:45.7992401Z // end inline asm 2026-02-21T09:10:45.7992466Z // begin inline asm 2026-02-21T09:10:45.7992656Z @%p36 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 8 ], %rd56, %r343, %p37; 2026-02-21T09:10:45.7992715Z // end inline asm 2026-02-21T09:10:45.7992772Z // begin inline asm 2026-02-21T09:10:45.7992927Z @%p36 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 16 ], %rd57, %r343, %p37; 2026-02-21T09:10:45.7992982Z // end inline asm 2026-02-21T09:10:45.7993038Z // begin inline asm 2026-02-21T09:10:45.7993190Z @%p36 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 24 ], %rd58, %r343, %p37; 2026-02-21T09:10:45.7993247Z // end inline asm 2026-02-21T09:10:45.7993309Z add.s32 %r355, %r44, 36864; 2026-02-21T09:10:45.7993379Z cvt.u64.u32 %rd59, %r355; 2026-02-21T09:10:45.7993435Z // begin inline asm 2026-02-21T09:10:45.7993560Z @%p36 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd59]; 2026-02-21T09:10:45.7993622Z // end inline asm 2026-02-21T09:10:45.7993675Z $L__tmp5: 2026-02-21T09:10:45.7993738Z $L__BB0_3: 2026-02-21T09:10:45.7993832Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.7993902Z shl.b32 %r15, %r6, 6; 2026-02-21T09:10:45.7994087Z .loc 1 58 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:32 2026-02-21T09:10:45.7994150Z add.s64 %rd60, %rd39, 128; 2026-02-21T09:10:45.7994219Z cvt.u64.u32 %rd66, %r26; 2026-02-21T09:10:45.7994284Z add.s64 %rd67, %rd6, %rd66; 2026-02-21T09:10:45.7994344Z shl.b64 %rd68, %rd67, 1; 2026-02-21T09:10:45.7994408Z add.s64 %rd69, %rd17, %rd68; 2026-02-21T09:10:45.7994477Z add.s64 %rd61, %rd69, 65536; 2026-02-21T09:10:45.7994541Z add.s64 %rd62, %rd69, 131072; 2026-02-21T09:10:45.7994602Z add.s64 %rd63, %rd69, 196608; 2026-02-21T09:10:45.7994784Z .loc 1 58 80 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:80 2026-02-21T09:10:45.7994842Z // begin inline asm 2026-02-21T09:10:45.7994967Z cp.async.cg.shared.global [ %r356 + 0 ], [ %rd60 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7995028Z // end inline asm 2026-02-21T09:10:45.7995088Z // begin inline asm 2026-02-21T09:10:45.7995210Z cp.async.cg.shared.global [ %r358 + 0 ], [ %rd61 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7995266Z // end inline asm 2026-02-21T09:10:45.7995331Z // begin inline asm 2026-02-21T09:10:45.7995445Z cp.async.cg.shared.global [ %r360 + 0 ], [ %rd62 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7995501Z // end inline asm 2026-02-21T09:10:45.7995564Z // begin inline asm 2026-02-21T09:10:45.7995677Z cp.async.cg.shared.global [ %r362 + 0 ], [ %rd63 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7995763Z // end inline asm 2026-02-21T09:10:45.7995828Z cp.async.commit_group; 2026-02-21T09:10:45.7996012Z .loc 1 64 34 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:34 2026-02-21T09:10:45.7996075Z add.s64 %rd64, %rd43, 262144; 2026-02-21T09:10:45.7996248Z .loc 1 64 87 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:87 2026-02-21T09:10:45.7996313Z // begin inline asm 2026-02-21T09:10:45.7996425Z cp.async.cg.shared.global [ %r364 + 0 ], [ %rd64 + 0 ], 0x10, %r357; 2026-02-21T09:10:45.7996516Z // end inline asm 2026-02-21T09:10:45.7996588Z cp.async.commit_group; 2026-02-21T09:10:45.7996766Z .loc 1 50 125 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:50:125 2026-02-21T09:10:45.7996829Z cvt.u16.u32 %rs113, %r3; 2026-02-21T09:10:45.7996893Z shr.u16 %rs114, %rs113, 9; 2026-02-21T09:10:45.7996964Z and.b16 %rs115, %rs114, 3; 2026-02-21T09:10:45.7997039Z mad.wide.u16 %r369, %rs115, 2048, %r13; 2026-02-21T09:10:45.7997105Z shl.b32 %r370, %r27, 7; 2026-02-21T09:10:45.7997175Z add.s32 %r371, %r369, %r370; 2026-02-21T09:10:45.7997260Z add.s32 %r372, %r371, %r4; 2026-02-21T09:10:45.7997324Z add.s32 %r373, %r372, 393216; 2026-02-21T09:10:45.7997387Z cvt.u64.u32 %rd70, %r373; 2026-02-21T09:10:45.7997456Z add.s64 %rd343, %rd18, %rd70; 2026-02-21T09:10:45.7997517Z and.b32 %r374, %r1, 3; 2026-02-21T09:10:45.7997606Z mul.wide.u32 %rd71, %r374, 16; 2026-02-21T09:10:45.7997677Z shl.b32 %r375, %r3, 13; 2026-02-21T09:10:45.7997741Z and.b32 %r376, %r375, 4063232; 2026-02-21T09:10:45.7997802Z shl.b32 %r377, %r5, 10; 2026-02-21T09:10:45.7997874Z or.b32 %r378, %r376, %r377; 2026-02-21T09:10:45.7997937Z mul.wide.u32 %rd72, %r378, 2; 2026-02-21T09:10:45.7998001Z or.b64 %rd73, %rd71, %rd72; 2026-02-21T09:10:45.7998064Z add.s64 %rd74, %rd73, %rd17; 2026-02-21T09:10:45.7998135Z add.s64 %rd342, %rd74, 196800; 2026-02-21T09:10:45.7998195Z mov.b32 %r845, 1; 2026-02-21T09:10:45.7998255Z mov.b32 %r842, 0; 2026-02-21T09:10:45.7998326Z mov.b64 %rd344, -16; 2026-02-21T09:10:45.7998387Z mov.b32 %r844, %r842; 2026-02-21T09:10:45.7998449Z mov.b32 %r846, %r842; 2026-02-21T09:10:45.7998510Z bra.uni $L__BB0_4; 2026-02-21T09:10:45.7998630Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:45.7998816Z .loc 1 50 125 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:50:125 2026-02-21T09:10:45.7998882Z add.s64 %rd344, %rd344, 16; 2026-02-21T09:10:45.7998958Z setp.lt.u64 %p57, %rd344, 464; 2026-02-21T09:10:45.7999013Z $L__tmp6: 2026-02-21T09:10:45.7999241Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.7999309Z add.s32 %r499, %r845, 1; 2026-02-21T09:10:45.7999373Z setp.gt.s32 %p58, %r499, 1; 2026-02-21T09:10:45.7999439Z selp.b32 %r845, 0, %r499, %p58; 2026-02-21T09:10:45.7999503Z selp.b32 %r500, 1, 0, %p58; 2026-02-21T09:10:45.7999574Z xor.b32 %r43, %r846, %r500; 2026-02-21T09:10:45.7999629Z $L__tmp7: 2026-02-21T09:10:45.7999808Z .loc 1 58 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:32 2026-02-21T09:10:45.7999883Z add.s64 %rd80, %rd342, -196608; 2026-02-21T09:10:45.7999947Z add.s64 %rd81, %rd342, -131072; 2026-02-21T09:10:45.8000011Z add.s64 %rd82, %rd342, -65536; 2026-02-21T09:10:45.8000187Z .loc 1 58 80 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:80 2026-02-21T09:10:45.8000257Z add.s32 %r489, %r39, %r7; 2026-02-21T09:10:45.8000322Z selp.b32 %r490, 16, 0, %p57; 2026-02-21T09:10:45.8000382Z // begin inline asm 2026-02-21T09:10:45.8000506Z cp.async.cg.shared.global [ %r489 + 0 ], [ %rd80 + 0 ], 0x10, %r490; 2026-02-21T09:10:45.8000565Z // end inline asm 2026-02-21T09:10:45.8000626Z add.s32 %r491, %r489, 2048; 2026-02-21T09:10:45.8000690Z // begin inline asm 2026-02-21T09:10:45.8000814Z cp.async.cg.shared.global [ %r491 + 0 ], [ %rd81 + 0 ], 0x10, %r490; 2026-02-21T09:10:45.8000894Z // end inline asm 2026-02-21T09:10:45.8000953Z add.s32 %r493, %r489, 4096; 2026-02-21T09:10:45.8001018Z // begin inline asm 2026-02-21T09:10:45.8001126Z cp.async.cg.shared.global [ %r493 + 0 ], [ %rd82 + 0 ], 0x10, %r490; 2026-02-21T09:10:45.8001180Z // end inline asm 2026-02-21T09:10:45.8001246Z add.s32 %r495, %r489, 6144; 2026-02-21T09:10:45.8001304Z // begin inline asm 2026-02-21T09:10:45.8001420Z cp.async.cg.shared.global [ %r495 + 0 ], [ %rd342 + 0 ], 0x10, %r490; 2026-02-21T09:10:45.8001499Z // end inline asm 2026-02-21T09:10:45.8001595Z cp.async.commit_group; 2026-02-21T09:10:45.8001767Z .loc 1 64 87 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:87 2026-02-21T09:10:45.8001833Z mad.lo.s32 %r497, %r6, 15, %r40; 2026-02-21T09:10:45.8001898Z // begin inline asm 2026-02-21T09:10:45.8002012Z cp.async.cg.shared.global [ %r497 + 0 ], [ %rd343 + 0 ], 0x10, %r490; 2026-02-21T09:10:45.8002069Z // end inline asm 2026-02-21T09:10:45.8002138Z cp.async.commit_group; 2026-02-21T09:10:45.8002335Z .loc 1 50 125 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:50:125 2026-02-21T09:10:45.8002399Z add.s64 %rd343, %rd343, 131072; 2026-02-21T09:10:45.8002459Z add.s64 %rd342, %rd342, 64; 2026-02-21T09:10:45.8002530Z setp.lt.u64 %p59, %rd344, 480; 2026-02-21T09:10:45.8002588Z mov.b32 %r842, %r846; 2026-02-21T09:10:45.8002672Z mov.b32 %r846, %r43; 2026-02-21T09:10:45.8002739Z @%p59 bra $L__BB0_4; 2026-02-21T09:10:45.8002797Z bra.uni $L__BB0_7; 2026-02-21T09:10:45.8002901Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:10:45.8002961Z add.s32 %r416, %r844, 1; 2026-02-21T09:10:45.8003030Z setp.gt.s32 %p47, %r416, 1; 2026-02-21T09:10:45.8003093Z selp.b32 %r844, 0, %r416, %p47; 2026-02-21T09:10:45.8003261Z .loc 1 58 80 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:58:80 2026-02-21T09:10:45.8003334Z cp.async.wait_group 2; 2026-02-21T09:10:45.8003389Z bar.sync 0; 2026-02-21T09:10:45.8003448Z shl.b32 %r417, %r844, 13; 2026-02-21T09:10:45.8003513Z add.s32 %r419, %r44, %r417; 2026-02-21T09:10:45.8003570Z add.s32 %r39, %r419, 16384; 2026-02-21T09:10:45.8003627Z add.s32 %r420, %r39, %r15; 2026-02-21T09:10:45.8003720Z ld.shared.v4.b32 {%r421, %r422, %r423, %r424}, [%r420]; 2026-02-21T09:10:45.8003793Z mov.b32 {%rs116, %rs117}, %r424; 2026-02-21T09:10:45.8003857Z mov.b32 {%rs118, %rs119}, %r423; 2026-02-21T09:10:45.8003915Z mov.b32 {%rs120, %rs121}, %r422; 2026-02-21T09:10:45.8003979Z mov.b32 {%rs122, %rs123}, %r421; 2026-02-21T09:10:45.8004078Z ld.shared.v4.b32 {%r425, %r426, %r427, %r428}, [%r420+16]; 2026-02-21T09:10:45.8004136Z mov.b32 {%rs124, %rs125}, %r428; 2026-02-21T09:10:45.8004194Z mov.b32 {%rs126, %rs127}, %r427; 2026-02-21T09:10:45.8004257Z mov.b32 {%rs128, %rs129}, %r426; 2026-02-21T09:10:45.8004315Z mov.b32 {%rs130, %rs131}, %r425; 2026-02-21T09:10:45.8004410Z ld.shared.v4.b32 {%r429, %r430, %r431, %r432}, [%r420+32]; 2026-02-21T09:10:45.8004478Z mov.b32 {%rs132, %rs133}, %r432; 2026-02-21T09:10:45.8004536Z mov.b32 {%rs134, %rs135}, %r431; 2026-02-21T09:10:45.8004594Z mov.b32 {%rs136, %rs137}, %r430; 2026-02-21T09:10:45.8004659Z mov.b32 {%rs138, %rs139}, %r429; 2026-02-21T09:10:45.8004751Z ld.shared.v4.b32 {%r433, %r434, %r435, %r436}, [%r420+48]; 2026-02-21T09:10:45.8004810Z mov.b32 {%rs140, %rs141}, %r436; 2026-02-21T09:10:45.8004868Z mov.b32 {%rs142, %rs143}, %r435; 2026-02-21T09:10:45.8004933Z mov.b32 {%rs144, %rs145}, %r434; 2026-02-21T09:10:45.8004990Z mov.b32 {%rs146, %rs147}, %r433; 2026-02-21T09:10:45.8005159Z .loc 1 62 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:62:32 2026-02-21T09:10:45.8005227Z cvt.f32.bf16 %r383, %rs122; 2026-02-21T09:10:45.8005288Z cvt.f32.bf16 %r384, %rs123; 2026-02-21T09:10:45.8005346Z cvt.f32.bf16 %r385, %rs120; 2026-02-21T09:10:45.8005442Z cvt.f32.bf16 %r386, %rs121; 2026-02-21T09:10:45.8005500Z cvt.f32.bf16 %r387, %rs118; 2026-02-21T09:10:45.8005560Z cvt.f32.bf16 %r388, %rs119; 2026-02-21T09:10:45.8005617Z cvt.f32.bf16 %r389, %rs116; 2026-02-21T09:10:45.8005681Z cvt.f32.bf16 %r390, %rs117; 2026-02-21T09:10:45.8005738Z cvt.f32.bf16 %r391, %rs130; 2026-02-21T09:10:45.8005795Z cvt.f32.bf16 %r392, %rs131; 2026-02-21T09:10:45.8005860Z cvt.f32.bf16 %r393, %rs128; 2026-02-21T09:10:45.8005918Z cvt.f32.bf16 %r394, %rs129; 2026-02-21T09:10:45.8006004Z cvt.f32.bf16 %r395, %rs126; 2026-02-21T09:10:45.8006062Z cvt.f32.bf16 %r396, %rs127; 2026-02-21T09:10:45.8006130Z cvt.f32.bf16 %r397, %rs124; 2026-02-21T09:10:45.8006189Z cvt.f32.bf16 %r398, %rs125; 2026-02-21T09:10:45.8006248Z cvt.f32.bf16 %r400, %rs138; 2026-02-21T09:10:45.8006316Z cvt.f32.bf16 %r401, %rs139; 2026-02-21T09:10:45.8006374Z cvt.f32.bf16 %r402, %rs136; 2026-02-21T09:10:45.8006432Z cvt.f32.bf16 %r403, %rs137; 2026-02-21T09:10:45.8006489Z cvt.f32.bf16 %r404, %rs134; 2026-02-21T09:10:45.8006555Z cvt.f32.bf16 %r405, %rs135; 2026-02-21T09:10:45.8006612Z cvt.f32.bf16 %r406, %rs132; 2026-02-21T09:10:45.8006692Z cvt.f32.bf16 %r407, %rs133; 2026-02-21T09:10:45.8006760Z cvt.f32.bf16 %r408, %rs146; 2026-02-21T09:10:45.8006817Z cvt.f32.bf16 %r409, %rs147; 2026-02-21T09:10:45.8006875Z cvt.f32.bf16 %r410, %rs144; 2026-02-21T09:10:45.8006932Z cvt.f32.bf16 %r411, %rs145; 2026-02-21T09:10:45.8007016Z cvt.f32.bf16 %r412, %rs142; 2026-02-21T09:10:45.8007077Z cvt.f32.bf16 %r413, %rs143; 2026-02-21T09:10:45.8007134Z cvt.f32.bf16 %r414, %rs140; 2026-02-21T09:10:45.8007198Z cvt.f32.bf16 %r415, %rs141; 2026-02-21T09:10:45.8007250Z $L__tmp8: 2026-02-21T09:10:45.8007470Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8007534Z // begin inline asm 2026-02-21T09:10:45.8007586Z 2026-02-21T09:10:45.8007637Z { 2026-02-21T09:10:45.8007700Z .reg .pred complete; 2026-02-21T09:10:45.8007766Z waitLoop: 2026-02-21T09:10:45.8007884Z mbarrier.try_wait.parity.shared.b64 complete, [%r843], %r842; 2026-02-21T09:10:45.8007950Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.8008007Z } 2026-02-21T09:10:45.8008011Z 2026-02-21T09:10:45.8008066Z // end inline asm 2026-02-21T09:10:45.8008127Z mov.pred %p48, -1; 2026-02-21T09:10:45.8008183Z // begin inline asm 2026-02-21T09:10:45.8008475Z @%p48 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r219 + 0], {%r383, %r384, %r385, %r386, %r387, %r388, %r389, %r390, %r391, %r392, %r393, %r394, %r395, %r396, %r397, %r398}; 2026-02-21T09:10:45.8008532Z // end inline asm 2026-02-21T09:10:45.8008588Z // begin inline asm 2026-02-21T09:10:45.8008876Z @%p48 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r219 + 16], {%r400, %r401, %r402, %r403, %r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414, %r415}; 2026-02-21T09:10:45.8008932Z // end inline asm 2026-02-21T09:10:45.8008988Z // begin inline asm 2026-02-21T09:10:45.8009067Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:45.8009120Z // end inline asm 2026-02-21T09:10:45.8009174Z bar.sync 0; 2026-02-21T09:10:45.8009229Z $L__tmp9: 2026-02-21T09:10:45.8009408Z .loc 1 64 87 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:64:87 2026-02-21T09:10:45.8009466Z shl.b32 %r437, %r844, 11; 2026-02-21T09:10:45.8009526Z add.s32 %r438, %r44, %r437; 2026-02-21T09:10:45.8009594Z add.s32 %r439, %r438, 32768; 2026-02-21T09:10:45.8009656Z add.s32 %r40, %r439, %r6; 2026-02-21T09:10:45.8009714Z add.s32 %r440, %r439, %r16; 2026-02-21T09:10:45.8009778Z add.s32 %r441, %r439, %r17; 2026-02-21T09:10:45.8009943Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8010007Z ld.shared.s8 %rs148, [%r40]; 2026-02-21T09:10:45.8010168Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8010235Z shl.b16 %rs149, %rs148, 4; 2026-02-21T09:10:45.8010418Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8010485Z ld.shared.s8 %rs150, [%r40+128]; 2026-02-21T09:10:45.8010654Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8010715Z shl.b16 %rs151, %rs150, 4; 2026-02-21T09:10:45.8010880Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8010974Z ld.shared.s8 %rs152, [%r40+256]; 2026-02-21T09:10:45.8011140Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8011201Z shl.b16 %rs153, %rs152, 4; 2026-02-21T09:10:45.8011372Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8011434Z ld.shared.s8 %rs154, [%r40+384]; 2026-02-21T09:10:45.8011625Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8011687Z shl.b16 %rs155, %rs154, 4; 2026-02-21T09:10:45.8011896Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8011960Z ld.shared.s8 %rs156, [%r40+512]; 2026-02-21T09:10:45.8012127Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8012221Z shl.b16 %rs157, %rs156, 4; 2026-02-21T09:10:45.8012392Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8012454Z ld.shared.s8 %rs158, [%r40+640]; 2026-02-21T09:10:45.8012627Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8012686Z shl.b16 %rs159, %rs158, 4; 2026-02-21T09:10:45.8012851Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8012923Z ld.shared.s8 %rs160, [%r40+768]; 2026-02-21T09:10:45.8013090Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8013149Z shl.b16 %rs161, %rs160, 4; 2026-02-21T09:10:45.8013320Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8013392Z ld.shared.s8 %rs162, [%r440]; 2026-02-21T09:10:45.8013557Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8013618Z shl.b16 %rs163, %rs162, 4; 2026-02-21T09:10:45.8013790Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8013856Z ld.shared.s8 %rs164, [%r40+1024]; 2026-02-21T09:10:45.8014022Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8014089Z shl.b16 %rs165, %rs164, 4; 2026-02-21T09:10:45.8014254Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8014320Z ld.shared.s8 %rs166, [%r40+1152]; 2026-02-21T09:10:45.8014493Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8014554Z shl.b16 %rs167, %rs166, 4; 2026-02-21T09:10:45.8014720Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8014785Z ld.shared.s8 %rs168, [%r40+1280]; 2026-02-21T09:10:45.8014954Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8015013Z shl.b16 %rs169, %rs168, 4; 2026-02-21T09:10:45.8015180Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8015249Z ld.shared.s8 %rs170, [%r40+1408]; 2026-02-21T09:10:45.8015416Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8015500Z shl.b16 %rs171, %rs170, 4; 2026-02-21T09:10:45.8015675Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8015737Z ld.shared.s8 %rs172, [%r40+1536]; 2026-02-21T09:10:45.8015905Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8015998Z shl.b16 %rs173, %rs172, 4; 2026-02-21T09:10:45.8016161Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8016225Z ld.shared.s8 %rs174, [%r40+1664]; 2026-02-21T09:10:45.8016390Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8016459Z shl.b16 %rs175, %rs174, 4; 2026-02-21T09:10:45.8016625Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8016688Z ld.shared.s8 %rs176, [%r40+1792]; 2026-02-21T09:10:45.8016879Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8016938Z shl.b16 %rs177, %rs176, 4; 2026-02-21T09:10:45.8017105Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8017196Z ld.shared.s8 %rs178, [%r441]; 2026-02-21T09:10:45.8017364Z .loc 1 67 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:67:28 2026-02-21T09:10:45.8017422Z shl.b16 %rs179, %rs178, 4; 2026-02-21T09:10:45.8017592Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8017651Z cvt.s16.s8 %rs180, %rs149; 2026-02-21T09:10:45.8017708Z shr.s16 %rs181, %rs180, 4; 2026-02-21T09:10:45.8017765Z cvt.s16.s8 %rs182, %rs151; 2026-02-21T09:10:45.8017829Z shr.s16 %rs183, %rs182, 4; 2026-02-21T09:10:45.8017887Z shr.s16 %rs184, %rs148, 4; 2026-02-21T09:10:45.8017945Z shr.s16 %rs185, %rs150, 4; 2026-02-21T09:10:45.8018120Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8018182Z cvt.rn.f32.s16 %r442, %rs185; 2026-02-21T09:10:45.8018241Z cvt.rn.f32.s16 %r443, %rs184; 2026-02-21T09:10:45.8018308Z cvt.rn.f32.s16 %r444, %rs183; 2026-02-21T09:10:45.8018368Z cvt.rn.f32.s16 %r445, %rs181; 2026-02-21T09:10:45.8018532Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8018592Z cvt.s16.s8 %rs186, %rs153; 2026-02-21T09:10:45.8018657Z shr.s16 %rs187, %rs186, 4; 2026-02-21T09:10:45.8018715Z cvt.s16.s8 %rs188, %rs155; 2026-02-21T09:10:45.8018773Z shr.s16 %rs189, %rs188, 4; 2026-02-21T09:10:45.8018837Z shr.s16 %rs190, %rs152, 4; 2026-02-21T09:10:45.8018894Z shr.s16 %rs191, %rs154, 4; 2026-02-21T09:10:45.8019057Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8019119Z cvt.rn.f32.s16 %r446, %rs191; 2026-02-21T09:10:45.8019185Z cvt.rn.f32.s16 %r447, %rs190; 2026-02-21T09:10:45.8019245Z cvt.rn.f32.s16 %r448, %rs189; 2026-02-21T09:10:45.8019304Z cvt.rn.f32.s16 %r449, %rs187; 2026-02-21T09:10:45.8019473Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8019535Z cvt.s16.s8 %rs192, %rs157; 2026-02-21T09:10:45.8019592Z shr.s16 %rs193, %rs192, 4; 2026-02-21T09:10:45.8019657Z cvt.s16.s8 %rs194, %rs159; 2026-02-21T09:10:45.8019714Z shr.s16 %rs195, %rs194, 4; 2026-02-21T09:10:45.8019771Z shr.s16 %rs196, %rs156, 4; 2026-02-21T09:10:45.8019828Z shr.s16 %rs197, %rs158, 4; 2026-02-21T09:10:45.8019996Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8020055Z cvt.rn.f32.s16 %r450, %rs197; 2026-02-21T09:10:45.8020137Z cvt.rn.f32.s16 %r451, %rs196; 2026-02-21T09:10:45.8020201Z cvt.rn.f32.s16 %r452, %rs195; 2026-02-21T09:10:45.8020260Z cvt.rn.f32.s16 %r453, %rs193; 2026-02-21T09:10:45.8020427Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8020491Z cvt.s16.s8 %rs198, %rs161; 2026-02-21T09:10:45.8020547Z shr.s16 %rs199, %rs198, 4; 2026-02-21T09:10:45.8020605Z cvt.s16.s8 %rs200, %rs163; 2026-02-21T09:10:45.8020683Z shr.s16 %rs201, %rs200, 4; 2026-02-21T09:10:45.8020746Z shr.s16 %rs202, %rs160, 4; 2026-02-21T09:10:45.8020802Z shr.s16 %rs203, %rs162, 4; 2026-02-21T09:10:45.8020967Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8021031Z cvt.rn.f32.s16 %r454, %rs203; 2026-02-21T09:10:45.8021090Z cvt.rn.f32.s16 %r455, %rs202; 2026-02-21T09:10:45.8021148Z cvt.rn.f32.s16 %r456, %rs201; 2026-02-21T09:10:45.8021205Z cvt.rn.f32.s16 %r457, %rs199; 2026-02-21T09:10:45.8021381Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8021461Z cvt.s16.s8 %rs204, %rs165; 2026-02-21T09:10:45.8021519Z shr.s16 %rs205, %rs204, 4; 2026-02-21T09:10:45.8021612Z cvt.s16.s8 %rs206, %rs167; 2026-02-21T09:10:45.8021671Z shr.s16 %rs207, %rs206, 4; 2026-02-21T09:10:45.8021727Z shr.s16 %rs208, %rs164, 4; 2026-02-21T09:10:45.8021820Z shr.s16 %rs209, %rs166, 4; 2026-02-21T09:10:45.8021988Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8022046Z cvt.rn.f32.s16 %r458, %rs209; 2026-02-21T09:10:45.8022104Z cvt.rn.f32.s16 %r459, %rs208; 2026-02-21T09:10:45.8022169Z cvt.rn.f32.s16 %r460, %rs207; 2026-02-21T09:10:45.8022227Z cvt.rn.f32.s16 %r461, %rs205; 2026-02-21T09:10:45.8022395Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8022460Z cvt.s16.s8 %rs210, %rs169; 2026-02-21T09:10:45.8022519Z shr.s16 %rs211, %rs210, 4; 2026-02-21T09:10:45.8022577Z cvt.s16.s8 %rs212, %rs171; 2026-02-21T09:10:45.8022635Z shr.s16 %rs213, %rs212, 4; 2026-02-21T09:10:45.8022699Z shr.s16 %rs214, %rs168, 4; 2026-02-21T09:10:45.8022756Z shr.s16 %rs215, %rs170, 4; 2026-02-21T09:10:45.8022921Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8022990Z cvt.rn.f32.s16 %r462, %rs215; 2026-02-21T09:10:45.8023049Z cvt.rn.f32.s16 %r463, %rs214; 2026-02-21T09:10:45.8023108Z cvt.rn.f32.s16 %r464, %rs213; 2026-02-21T09:10:45.8023172Z cvt.rn.f32.s16 %r465, %rs211; 2026-02-21T09:10:45.8023338Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8023398Z cvt.s16.s8 %rs216, %rs173; 2026-02-21T09:10:45.8023454Z shr.s16 %rs217, %rs216, 4; 2026-02-21T09:10:45.8023520Z cvt.s16.s8 %rs218, %rs175; 2026-02-21T09:10:45.8023579Z shr.s16 %rs219, %rs218, 4; 2026-02-21T09:10:45.8023636Z shr.s16 %rs220, %rs172, 4; 2026-02-21T09:10:45.8023702Z shr.s16 %rs221, %rs174, 4; 2026-02-21T09:10:45.8023866Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8023926Z cvt.rn.f32.s16 %r466, %rs221; 2026-02-21T09:10:45.8023992Z cvt.rn.f32.s16 %r467, %rs220; 2026-02-21T09:10:45.8024050Z cvt.rn.f32.s16 %r468, %rs219; 2026-02-21T09:10:45.8024110Z cvt.rn.f32.s16 %r469, %rs217; 2026-02-21T09:10:45.8024279Z .loc 1 69 25 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:69:25 2026-02-21T09:10:45.8024347Z cvt.s16.s8 %rs222, %rs177; 2026-02-21T09:10:45.8024406Z shr.s16 %rs223, %rs222, 4; 2026-02-21T09:10:45.8024464Z cvt.s16.s8 %rs224, %rs179; 2026-02-21T09:10:45.8024540Z shr.s16 %rs225, %rs224, 4; 2026-02-21T09:10:45.8024597Z shr.s16 %rs226, %rs176, 4; 2026-02-21T09:10:45.8024654Z shr.s16 %rs227, %rs178, 4; 2026-02-21T09:10:45.8024847Z .loc 1 87 32 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:87:32 2026-02-21T09:10:45.8024917Z cvt.rn.f32.s16 %r470, %rs227; 2026-02-21T09:10:45.8024976Z cvt.rn.f32.s16 %r471, %rs226; 2026-02-21T09:10:45.8025034Z cvt.rn.f32.s16 %r472, %rs225; 2026-02-21T09:10:45.8025097Z cvt.rn.f32.s16 %r473, %rs223; 2026-02-21T09:10:45.8025192Z st.shared.v4.b32 [%r18], {%r445, %r443, %r444, %r442}; 2026-02-21T09:10:45.8025283Z st.shared.v4.b32 [%r19], {%r449, %r447, %r448, %r446}; 2026-02-21T09:10:45.8025402Z st.shared.v4.b32 [%r20], {%r453, %r451, %r452, %r450}; 2026-02-21T09:10:45.8025489Z st.shared.v4.b32 [%r21], {%r457, %r455, %r456, %r454}; 2026-02-21T09:10:45.8025572Z st.shared.v4.b32 [%r22], {%r461, %r459, %r460, %r458}; 2026-02-21T09:10:45.8025654Z st.shared.v4.b32 [%r23], {%r465, %r463, %r464, %r462}; 2026-02-21T09:10:45.8025744Z st.shared.v4.b32 [%r24], {%r469, %r467, %r468, %r466}; 2026-02-21T09:10:45.8025827Z st.shared.v4.b32 [%r25], {%r473, %r471, %r472, %r470}; 2026-02-21T09:10:45.8026041Z .loc 1 50 125 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:50:125 2026-02-21T09:10:45.8026111Z shl.b32 %r474, %r845, 3; 2026-02-21T09:10:45.8026171Z add.s32 %r475, %r44, %r474; 2026-02-21T09:10:45.8026232Z add.s32 %r843, %r475, 36864; 2026-02-21T09:10:45.8026295Z $L__tmp10: 2026-02-21T09:10:45.8026538Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8026598Z // begin inline asm 2026-02-21T09:10:45.8026669Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.8026733Z // end inline asm 2026-02-21T09:10:45.8026788Z bar.sync 0; 2026-02-21T09:10:45.8026847Z @%p34 bra $L__BB0_6; 2026-02-21T09:10:45.8026954Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:45.8027014Z elect.sync %r488|%p49, -1; 2026-02-21T09:10:45.8027068Z mov.b32 %r478, 136317200; 2026-02-21T09:10:45.8027126Z // begin inline asm 2026-02-21T09:10:45.8027282Z @%p49 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 0 ], %rd55, %r478, %p48; 2026-02-21T09:10:45.8027334Z // end inline asm 2026-02-21T09:10:45.8027391Z // begin inline asm 2026-02-21T09:10:45.8027545Z @%p49 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 8 ], %rd56, %r478, %p48; 2026-02-21T09:10:45.8027600Z // end inline asm 2026-02-21T09:10:45.8027655Z // begin inline asm 2026-02-21T09:10:45.8027808Z @%p49 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 16 ], %rd57, %r478, %p48; 2026-02-21T09:10:45.8027865Z // end inline asm 2026-02-21T09:10:45.8027921Z // begin inline asm 2026-02-21T09:10:45.8028069Z @%p49 tcgen05.mma.cta_group::1.kind::tf32 [ %r841 + 0 ], [ %r342 + 24 ], %rd58, %r478, %p48; 2026-02-21T09:10:45.8028123Z // end inline asm 2026-02-21T09:10:45.8028183Z cvt.u64.u32 %rd79, %r843; 2026-02-21T09:10:45.8028239Z // begin inline asm 2026-02-21T09:10:45.8028368Z @%p49 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd79]; 2026-02-21T09:10:45.8028424Z // end inline asm 2026-02-21T09:10:45.8028478Z bra.uni $L__BB0_6; 2026-02-21T09:10:45.8028580Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:10:45.8028669Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:45.8028733Z setp.lt.u32 %p63, %r1, 64; 2026-02-21T09:10:45.8028794Z mov.b32 %r502, 1; 2026-02-21T09:10:45.8029010Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8029067Z // begin inline asm 2026-02-21T09:10:45.8029118Z 2026-02-21T09:10:45.8029173Z { 2026-02-21T09:10:45.8029234Z .reg .pred complete; 2026-02-21T09:10:45.8029289Z waitLoop: 2026-02-21T09:10:45.8029410Z mbarrier.try_wait.parity.shared.b64 complete, [%r843], %r502; 2026-02-21T09:10:45.8029476Z @!complete bra.uni waitLoop; 2026-02-21T09:10:45.8029527Z } 2026-02-21T09:10:45.8029531Z 2026-02-21T09:10:45.8029585Z // end inline asm 2026-02-21T09:10:45.8029669Z $L__tmp11: 2026-02-21T09:10:45.8029845Z .loc 1 50 125 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:50:125 2026-02-21T09:10:45.8029910Z cp.async.wait_group 0; 2026-02-21T09:10:45.8029971Z bar.sync 0; 2026-02-21T09:10:45.8030031Z add.s32 %r503, %r44, 36864; 2026-02-21T09:10:45.8030087Z // begin inline asm 2026-02-21T09:10:45.8030179Z @%p60 mbarrier.inval.shared::cta.b64 [%r503]; 2026-02-21T09:10:45.8030268Z // end inline asm 2026-02-21T09:10:45.8030322Z bar.sync 0; 2026-02-21T09:10:45.8030376Z // begin inline asm 2026-02-21T09:10:45.8030468Z @%p60 mbarrier.inval.shared::cta.b64 [%r198]; 2026-02-21T09:10:45.8030522Z // end inline asm 2026-02-21T09:10:45.8030574Z $L__tmp12: 2026-02-21T09:10:45.8030793Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8030849Z // begin inline asm 2026-02-21T09:10:45.8031136Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r505, %r506, %r507, %r508, %r509, %r510, %r511, %r512, %r513, %r514, %r515, %r516, %r517, %r518, %r519, %r520}, [%r640 + 0]; 2026-02-21T09:10:45.8031200Z // end inline asm 2026-02-21T09:10:45.8031256Z // begin inline asm 2026-02-21T09:10:45.8031517Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r522, %r523, %r524, %r525, %r526, %r527, %r528, %r529, %r530, %r531, %r532, %r533, %r534, %r535, %r536, %r537}, [%r640 + 16]; 2026-02-21T09:10:45.8031648Z // end inline asm 2026-02-21T09:10:45.8031715Z // begin inline asm 2026-02-21T09:10:45.8031973Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r539, %r540, %r541, %r542, %r543, %r544, %r545, %r546, %r547, %r548, %r549, %r550, %r551, %r552, %r553, %r554}, [%r640 + 32]; 2026-02-21T09:10:45.8032028Z // end inline asm 2026-02-21T09:10:45.8032092Z // begin inline asm 2026-02-21T09:10:45.8032348Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r556, %r557, %r558, %r559, %r560, %r561, %r562, %r563, %r564, %r565, %r566, %r567, %r568, %r569, %r570, %r571}, [%r640 + 48]; 2026-02-21T09:10:45.8032407Z // end inline asm 2026-02-21T09:10:45.8032469Z // begin inline asm 2026-02-21T09:10:45.8032729Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r573, %r574, %r575, %r576, %r577, %r578, %r579, %r580, %r581, %r582, %r583, %r584, %r585, %r586, %r587, %r588}, [%r640 + 64]; 2026-02-21T09:10:45.8032784Z // end inline asm 2026-02-21T09:10:45.8032847Z // begin inline asm 2026-02-21T09:10:45.8033105Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597, %r598, %r599, %r600, %r601, %r602, %r603, %r604, %r605}, [%r640 + 80]; 2026-02-21T09:10:45.8033162Z // end inline asm 2026-02-21T09:10:45.8033218Z // begin inline asm 2026-02-21T09:10:45.8033486Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r607, %r608, %r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616, %r617, %r618, %r619, %r620, %r621, %r622}, [%r640 + 96]; 2026-02-21T09:10:45.8033542Z // end inline asm 2026-02-21T09:10:45.8033598Z // begin inline asm 2026-02-21T09:10:45.8033871Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r624, %r625, %r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633, %r634, %r635, %r636, %r637, %r638, %r639}, [%r640 + 112]; 2026-02-21T09:10:45.8033929Z // end inline asm 2026-02-21T09:10:45.8033983Z // begin inline asm 2026-02-21T09:10:45.8034067Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:45.8034127Z // end inline asm 2026-02-21T09:10:45.8034192Z cvt.u64.u32 %rd86, %r505; 2026-02-21T09:10:45.8034256Z cvt.u64.u32 %rd87, %r506; 2026-02-21T09:10:45.8034323Z shl.b64 %rd88, %rd87, 32; 2026-02-21T09:10:45.8034385Z or.b64 %rd89, %rd86, %rd88; 2026-02-21T09:10:45.8034438Z $L__tmp13: 2026-02-21T09:10:45.8034619Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8034682Z mov.b64 {%r645, %r646}, %rd89; 2026-02-21T09:10:45.8034753Z cvt.rn.bf16x2.f32 %r647, %r646, %r645; 2026-02-21T09:10:45.8034814Z $L__tmp14: 2026-02-21T09:10:45.8035030Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8035116Z cvt.u64.u32 %rd90, %r507; 2026-02-21T09:10:45.8035174Z cvt.u64.u32 %rd91, %r508; 2026-02-21T09:10:45.8035240Z shl.b64 %rd92, %rd91, 32; 2026-02-21T09:10:45.8035299Z or.b64 %rd93, %rd90, %rd92; 2026-02-21T09:10:45.8035351Z $L__tmp15: 2026-02-21T09:10:45.8035526Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8035614Z mov.b64 {%r648, %r649}, %rd93; 2026-02-21T09:10:45.8035685Z cvt.rn.bf16x2.f32 %r650, %r649, %r648; 2026-02-21T09:10:45.8035738Z $L__tmp16: 2026-02-21T09:10:45.8035956Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8036015Z cvt.u64.u32 %rd94, %r509; 2026-02-21T09:10:45.8036073Z cvt.u64.u32 %rd95, %r510; 2026-02-21T09:10:45.8036140Z shl.b64 %rd96, %rd95, 32; 2026-02-21T09:10:45.8036198Z or.b64 %rd97, %rd94, %rd96; 2026-02-21T09:10:45.8036250Z $L__tmp17: 2026-02-21T09:10:45.8036449Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8036511Z mov.b64 {%r651, %r652}, %rd97; 2026-02-21T09:10:45.8036577Z cvt.rn.bf16x2.f32 %r653, %r652, %r651; 2026-02-21T09:10:45.8036630Z $L__tmp18: 2026-02-21T09:10:45.8036869Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8036933Z cvt.u64.u32 %rd98, %r511; 2026-02-21T09:10:45.8036993Z cvt.u64.u32 %rd99, %r512; 2026-02-21T09:10:45.8037064Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:10:45.8037127Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:10:45.8037181Z $L__tmp19: 2026-02-21T09:10:45.8037366Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8037432Z mov.b64 {%r654, %r655}, %rd101; 2026-02-21T09:10:45.8037501Z cvt.rn.bf16x2.f32 %r656, %r655, %r654; 2026-02-21T09:10:45.8037557Z $L__tmp20: 2026-02-21T09:10:45.8037790Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8037853Z cvt.u64.u32 %rd102, %r513; 2026-02-21T09:10:45.8037915Z cvt.u64.u32 %rd103, %r514; 2026-02-21T09:10:45.8037983Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:10:45.8038045Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:10:45.8038102Z $L__tmp21: 2026-02-21T09:10:45.8038285Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8038350Z mov.b64 {%r657, %r658}, %rd105; 2026-02-21T09:10:45.8038418Z cvt.rn.bf16x2.f32 %r659, %r658, %r657; 2026-02-21T09:10:45.8038472Z $L__tmp22: 2026-02-21T09:10:45.8038697Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8038759Z cvt.u64.u32 %rd106, %r515; 2026-02-21T09:10:45.8038820Z cvt.u64.u32 %rd107, %r516; 2026-02-21T09:10:45.8038890Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:10:45.8038955Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:10:45.8039008Z $L__tmp23: 2026-02-21T09:10:45.8039180Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8039245Z mov.b64 {%r660, %r661}, %rd109; 2026-02-21T09:10:45.8039314Z cvt.rn.bf16x2.f32 %r662, %r661, %r660; 2026-02-21T09:10:45.8039366Z $L__tmp24: 2026-02-21T09:10:45.8039589Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8039646Z cvt.u64.u32 %rd110, %r517; 2026-02-21T09:10:45.8039705Z cvt.u64.u32 %rd111, %r518; 2026-02-21T09:10:45.8039769Z shl.b64 %rd112, %rd111, 32; 2026-02-21T09:10:45.8039827Z or.b64 %rd113, %rd110, %rd112; 2026-02-21T09:10:45.8039881Z $L__tmp25: 2026-02-21T09:10:45.8040050Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8040138Z mov.b64 {%r663, %r664}, %rd113; 2026-02-21T09:10:45.8040209Z cvt.rn.bf16x2.f32 %r665, %r664, %r663; 2026-02-21T09:10:45.8040260Z $L__tmp26: 2026-02-21T09:10:45.8040479Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8040536Z cvt.u64.u32 %rd114, %r519; 2026-02-21T09:10:45.8040594Z cvt.u64.u32 %rd115, %r520; 2026-02-21T09:10:45.8040680Z shl.b64 %rd116, %rd115, 32; 2026-02-21T09:10:45.8040743Z or.b64 %rd117, %rd114, %rd116; 2026-02-21T09:10:45.8040793Z $L__tmp27: 2026-02-21T09:10:45.8040970Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8041033Z mov.b64 {%r666, %r667}, %rd117; 2026-02-21T09:10:45.8041097Z cvt.rn.bf16x2.f32 %r668, %r667, %r666; 2026-02-21T09:10:45.8041148Z $L__tmp28: 2026-02-21T09:10:45.8041367Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8041458Z cvt.u64.u32 %rd118, %r522; 2026-02-21T09:10:45.8041521Z cvt.u64.u32 %rd119, %r523; 2026-02-21T09:10:45.8041622Z shl.b64 %rd120, %rd119, 32; 2026-02-21T09:10:45.8041680Z or.b64 %rd121, %rd118, %rd120; 2026-02-21T09:10:45.8041731Z $L__tmp29: 2026-02-21T09:10:45.8041935Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8042000Z mov.b64 {%r669, %r670}, %rd121; 2026-02-21T09:10:45.8042070Z cvt.rn.bf16x2.f32 %r671, %r670, %r669; 2026-02-21T09:10:45.8042121Z $L__tmp30: 2026-02-21T09:10:45.8042340Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8042399Z cvt.u64.u32 %rd122, %r524; 2026-02-21T09:10:45.8042461Z cvt.u64.u32 %rd123, %r525; 2026-02-21T09:10:45.8042519Z shl.b64 %rd124, %rd123, 32; 2026-02-21T09:10:45.8042584Z or.b64 %rd125, %rd122, %rd124; 2026-02-21T09:10:45.8042634Z $L__tmp31: 2026-02-21T09:10:45.8042815Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8042879Z mov.b64 {%r672, %r673}, %rd125; 2026-02-21T09:10:45.8042946Z cvt.rn.bf16x2.f32 %r674, %r673, %r672; 2026-02-21T09:10:45.8043005Z $L__tmp32: 2026-02-21T09:10:45.8043240Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8043308Z cvt.u64.u32 %rd126, %r526; 2026-02-21T09:10:45.8043373Z cvt.u64.u32 %rd127, %r527; 2026-02-21T09:10:45.8043439Z shl.b64 %rd128, %rd127, 32; 2026-02-21T09:10:45.8043516Z or.b64 %rd129, %rd126, %rd128; 2026-02-21T09:10:45.8043570Z $L__tmp33: 2026-02-21T09:10:45.8043744Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8043815Z mov.b64 {%r675, %r676}, %rd129; 2026-02-21T09:10:45.8043886Z cvt.rn.bf16x2.f32 %r677, %r676, %r675; 2026-02-21T09:10:45.8043941Z $L__tmp34: 2026-02-21T09:10:45.8044168Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8044231Z cvt.u64.u32 %rd130, %r528; 2026-02-21T09:10:45.8044294Z cvt.u64.u32 %rd131, %r529; 2026-02-21T09:10:45.8044356Z shl.b64 %rd132, %rd131, 32; 2026-02-21T09:10:45.8044430Z or.b64 %rd133, %rd130, %rd132; 2026-02-21T09:10:45.8044486Z $L__tmp35: 2026-02-21T09:10:45.8044712Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8044781Z mov.b64 {%r678, %r679}, %rd133; 2026-02-21T09:10:45.8044848Z cvt.rn.bf16x2.f32 %r680, %r679, %r678; 2026-02-21T09:10:45.8044901Z $L__tmp36: 2026-02-21T09:10:45.8045113Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8045204Z cvt.u64.u32 %rd134, %r530; 2026-02-21T09:10:45.8045263Z cvt.u64.u32 %rd135, %r531; 2026-02-21T09:10:45.8045322Z shl.b64 %rd136, %rd135, 32; 2026-02-21T09:10:45.8045386Z or.b64 %rd137, %rd134, %rd136; 2026-02-21T09:10:45.8045439Z $L__tmp37: 2026-02-21T09:10:45.8045606Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8045672Z mov.b64 {%r681, %r682}, %rd137; 2026-02-21T09:10:45.8045737Z cvt.rn.bf16x2.f32 %r683, %r682, %r681; 2026-02-21T09:10:45.8045815Z $L__tmp38: 2026-02-21T09:10:45.8046018Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8046085Z cvt.u64.u32 %rd138, %r532; 2026-02-21T09:10:45.8046142Z cvt.u64.u32 %rd139, %r533; 2026-02-21T09:10:45.8046201Z shl.b64 %rd140, %rd139, 32; 2026-02-21T09:10:45.8046268Z or.b64 %rd141, %rd138, %rd140; 2026-02-21T09:10:45.8046322Z $L__tmp39: 2026-02-21T09:10:45.8046488Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8046578Z mov.b64 {%r684, %r685}, %rd141; 2026-02-21T09:10:45.8046645Z cvt.rn.bf16x2.f32 %r686, %r685, %r684; 2026-02-21T09:10:45.8046696Z $L__tmp40: 2026-02-21T09:10:45.8046908Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8046994Z cvt.u64.u32 %rd142, %r534; 2026-02-21T09:10:45.8047053Z cvt.u64.u32 %rd143, %r535; 2026-02-21T09:10:45.8047111Z shl.b64 %rd144, %rd143, 32; 2026-02-21T09:10:45.8047173Z or.b64 %rd145, %rd142, %rd144; 2026-02-21T09:10:45.8047224Z $L__tmp41: 2026-02-21T09:10:45.8047389Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8047454Z mov.b64 {%r687, %r688}, %rd145; 2026-02-21T09:10:45.8047517Z cvt.rn.bf16x2.f32 %r689, %r688, %r687; 2026-02-21T09:10:45.8047567Z $L__tmp42: 2026-02-21T09:10:45.8047786Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8047852Z cvt.u64.u32 %rd146, %r536; 2026-02-21T09:10:45.8047909Z cvt.u64.u32 %rd147, %r537; 2026-02-21T09:10:45.8047966Z shl.b64 %rd148, %rd147, 32; 2026-02-21T09:10:45.8048030Z or.b64 %rd149, %rd146, %rd148; 2026-02-21T09:10:45.8048082Z $L__tmp43: 2026-02-21T09:10:45.8048252Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8048319Z mov.b64 {%r690, %r691}, %rd149; 2026-02-21T09:10:45.8048383Z cvt.rn.bf16x2.f32 %r692, %r691, %r690; 2026-02-21T09:10:45.8048434Z $L__tmp44: 2026-02-21T09:10:45.8048646Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8048711Z cvt.u64.u32 %rd150, %r539; 2026-02-21T09:10:45.8048768Z cvt.u64.u32 %rd151, %r540; 2026-02-21T09:10:45.8048827Z shl.b64 %rd152, %rd151, 32; 2026-02-21T09:10:45.8048891Z or.b64 %rd153, %rd150, %rd152; 2026-02-21T09:10:45.8048943Z $L__tmp45: 2026-02-21T09:10:45.8049111Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8049170Z mov.b64 {%r693, %r694}, %rd153; 2026-02-21T09:10:45.8049242Z cvt.rn.bf16x2.f32 %r695, %r694, %r693; 2026-02-21T09:10:45.8049293Z $L__tmp46: 2026-02-21T09:10:45.8049508Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8049576Z cvt.u64.u32 %rd154, %r541; 2026-02-21T09:10:45.8049633Z cvt.u64.u32 %rd155, %r542; 2026-02-21T09:10:45.8049691Z shl.b64 %rd156, %rd155, 32; 2026-02-21T09:10:45.8049756Z or.b64 %rd157, %rd154, %rd156; 2026-02-21T09:10:45.8049807Z $L__tmp47: 2026-02-21T09:10:45.8049973Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8050070Z mov.b64 {%r696, %r697}, %rd157; 2026-02-21T09:10:45.8050143Z cvt.rn.bf16x2.f32 %r698, %r697, %r696; 2026-02-21T09:10:45.8050195Z $L__tmp48: 2026-02-21T09:10:45.8050407Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8050473Z cvt.u64.u32 %rd158, %r543; 2026-02-21T09:10:45.8050531Z cvt.u64.u32 %rd159, %r544; 2026-02-21T09:10:45.8050590Z shl.b64 %rd160, %rd159, 32; 2026-02-21T09:10:45.8050678Z or.b64 %rd161, %rd158, %rd160; 2026-02-21T09:10:45.8050730Z $L__tmp49: 2026-02-21T09:10:45.8050899Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8050958Z mov.b64 {%r699, %r700}, %rd161; 2026-02-21T09:10:45.8051031Z cvt.rn.bf16x2.f32 %r701, %r700, %r699; 2026-02-21T09:10:45.8051084Z $L__tmp50: 2026-02-21T09:10:45.8051295Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8051366Z cvt.u64.u32 %rd162, %r545; 2026-02-21T09:10:45.8051446Z cvt.u64.u32 %rd163, %r546; 2026-02-21T09:10:45.8051508Z shl.b64 %rd164, %rd163, 32; 2026-02-21T09:10:45.8051623Z or.b64 %rd165, %rd162, %rd164; 2026-02-21T09:10:45.8051678Z $L__tmp51: 2026-02-21T09:10:45.8051843Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8051932Z mov.b64 {%r702, %r703}, %rd165; 2026-02-21T09:10:45.8052009Z cvt.rn.bf16x2.f32 %r704, %r703, %r702; 2026-02-21T09:10:45.8052061Z $L__tmp52: 2026-02-21T09:10:45.8052273Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8052339Z cvt.u64.u32 %rd166, %r547; 2026-02-21T09:10:45.8052398Z cvt.u64.u32 %rd167, %r548; 2026-02-21T09:10:45.8052457Z shl.b64 %rd168, %rd167, 32; 2026-02-21T09:10:45.8052515Z or.b64 %rd169, %rd166, %rd168; 2026-02-21T09:10:45.8052575Z $L__tmp53: 2026-02-21T09:10:45.8052748Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8052807Z mov.b64 {%r705, %r706}, %rd169; 2026-02-21T09:10:45.8052880Z cvt.rn.bf16x2.f32 %r707, %r706, %r705; 2026-02-21T09:10:45.8052933Z $L__tmp54: 2026-02-21T09:10:45.8053149Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8053217Z cvt.u64.u32 %rd170, %r549; 2026-02-21T09:10:45.8053276Z cvt.u64.u32 %rd171, %r550; 2026-02-21T09:10:45.8053337Z shl.b64 %rd172, %rd171, 32; 2026-02-21T09:10:45.8053397Z or.b64 %rd173, %rd170, %rd172; 2026-02-21T09:10:45.8053460Z $L__tmp55: 2026-02-21T09:10:45.8053628Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8053688Z mov.b64 {%r708, %r709}, %rd173; 2026-02-21T09:10:45.8053762Z cvt.rn.bf16x2.f32 %r710, %r709, %r708; 2026-02-21T09:10:45.8053815Z $L__tmp56: 2026-02-21T09:10:45.8054029Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8054095Z cvt.u64.u32 %rd174, %r551; 2026-02-21T09:10:45.8054154Z cvt.u64.u32 %rd175, %r552; 2026-02-21T09:10:45.8054213Z shl.b64 %rd176, %rd175, 32; 2026-02-21T09:10:45.8054272Z or.b64 %rd177, %rd174, %rd176; 2026-02-21T09:10:45.8054332Z $L__tmp57: 2026-02-21T09:10:45.8054505Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8054565Z mov.b64 {%r711, %r712}, %rd177; 2026-02-21T09:10:45.8054637Z cvt.rn.bf16x2.f32 %r713, %r712, %r711; 2026-02-21T09:10:45.8054690Z $L__tmp58: 2026-02-21T09:10:45.8054907Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8054974Z cvt.u64.u32 %rd178, %r553; 2026-02-21T09:10:45.8055059Z cvt.u64.u32 %rd179, %r554; 2026-02-21T09:10:45.8055117Z shl.b64 %rd180, %rd179, 32; 2026-02-21T09:10:45.8055177Z or.b64 %rd181, %rd178, %rd180; 2026-02-21T09:10:45.8055238Z $L__tmp59: 2026-02-21T09:10:45.8055404Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8055463Z mov.b64 {%r714, %r715}, %rd181; 2026-02-21T09:10:45.8055536Z cvt.rn.bf16x2.f32 %r716, %r715, %r714; 2026-02-21T09:10:45.8055588Z $L__tmp60: 2026-02-21T09:10:45.8055828Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8055887Z cvt.u64.u32 %rd182, %r556; 2026-02-21T09:10:45.8055952Z cvt.u64.u32 %rd183, %r557; 2026-02-21T09:10:45.8056010Z shl.b64 %rd184, %rd183, 32; 2026-02-21T09:10:45.8056068Z or.b64 %rd185, %rd182, %rd184; 2026-02-21T09:10:45.8056127Z $L__tmp61: 2026-02-21T09:10:45.8056294Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8056354Z mov.b64 {%r717, %r718}, %rd185; 2026-02-21T09:10:45.8056451Z cvt.rn.bf16x2.f32 %r719, %r718, %r717; 2026-02-21T09:10:45.8056504Z $L__tmp62: 2026-02-21T09:10:45.8056709Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8056768Z cvt.u64.u32 %rd186, %r558; 2026-02-21T09:10:45.8056852Z cvt.u64.u32 %rd187, %r559; 2026-02-21T09:10:45.8056914Z shl.b64 %rd188, %rd187, 32; 2026-02-21T09:10:45.8056973Z or.b64 %rd189, %rd186, %rd188; 2026-02-21T09:10:45.8057030Z $L__tmp63: 2026-02-21T09:10:45.8057199Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8057258Z mov.b64 {%r720, %r721}, %rd189; 2026-02-21T09:10:45.8057329Z cvt.rn.bf16x2.f32 %r722, %r721, %r720; 2026-02-21T09:10:45.8057381Z $L__tmp64: 2026-02-21T09:10:45.8057592Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8057652Z cvt.u64.u32 %rd190, %r560; 2026-02-21T09:10:45.8057719Z cvt.u64.u32 %rd191, %r561; 2026-02-21T09:10:45.8057778Z shl.b64 %rd192, %rd191, 32; 2026-02-21T09:10:45.8057836Z or.b64 %rd193, %rd190, %rd192; 2026-02-21T09:10:45.8057893Z $L__tmp65: 2026-02-21T09:10:45.8058063Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8058123Z mov.b64 {%r723, %r724}, %rd193; 2026-02-21T09:10:45.8058188Z cvt.rn.bf16x2.f32 %r725, %r724, %r723; 2026-02-21T09:10:45.8058247Z $L__tmp66: 2026-02-21T09:10:45.8058456Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8058515Z cvt.u64.u32 %rd194, %r562; 2026-02-21T09:10:45.8058579Z cvt.u64.u32 %rd195, %r563; 2026-02-21T09:10:45.8058638Z shl.b64 %rd196, %rd195, 32; 2026-02-21T09:10:45.8058698Z or.b64 %rd197, %rd194, %rd196; 2026-02-21T09:10:45.8058756Z $L__tmp67: 2026-02-21T09:10:45.8058924Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8058983Z mov.b64 {%r726, %r727}, %rd197; 2026-02-21T09:10:45.8059048Z cvt.rn.bf16x2.f32 %r728, %r727, %r726; 2026-02-21T09:10:45.8059109Z $L__tmp68: 2026-02-21T09:10:45.8059324Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8059386Z cvt.u64.u32 %rd198, %r564; 2026-02-21T09:10:45.8059452Z cvt.u64.u32 %rd199, %r565; 2026-02-21T09:10:45.8059511Z shl.b64 %rd200, %rd199, 32; 2026-02-21T09:10:45.8059571Z or.b64 %rd201, %rd198, %rd200; 2026-02-21T09:10:45.8059629Z $L__tmp69: 2026-02-21T09:10:45.8059800Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8059859Z mov.b64 {%r729, %r730}, %rd201; 2026-02-21T09:10:45.8059954Z cvt.rn.bf16x2.f32 %r731, %r730, %r729; 2026-02-21T09:10:45.8060016Z $L__tmp70: 2026-02-21T09:10:45.8060226Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8060285Z cvt.u64.u32 %rd202, %r566; 2026-02-21T09:10:45.8060362Z cvt.u64.u32 %rd203, %r567; 2026-02-21T09:10:45.8060422Z shl.b64 %rd204, %rd203, 32; 2026-02-21T09:10:45.8060481Z or.b64 %rd205, %rd202, %rd204; 2026-02-21T09:10:45.8060564Z $L__tmp71: 2026-02-21T09:10:45.8060730Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8060789Z mov.b64 {%r732, %r733}, %rd205; 2026-02-21T09:10:45.8060854Z cvt.rn.bf16x2.f32 %r734, %r733, %r732; 2026-02-21T09:10:45.8060914Z $L__tmp72: 2026-02-21T09:10:45.8061123Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8061181Z cvt.u64.u32 %rd206, %r568; 2026-02-21T09:10:45.8061250Z cvt.u64.u32 %rd207, %r569; 2026-02-21T09:10:45.8061329Z shl.b64 %rd208, %rd207, 32; 2026-02-21T09:10:45.8061391Z or.b64 %rd209, %rd206, %rd208; 2026-02-21T09:10:45.8061445Z $L__tmp73: 2026-02-21T09:10:45.8061653Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8061714Z mov.b64 {%r735, %r736}, %rd209; 2026-02-21T09:10:45.8061815Z cvt.rn.bf16x2.f32 %r737, %r736, %r735; 2026-02-21T09:10:45.8061879Z $L__tmp74: 2026-02-21T09:10:45.8062087Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8062147Z cvt.u64.u32 %rd210, %r570; 2026-02-21T09:10:45.8062213Z cvt.u64.u32 %rd211, %r571; 2026-02-21T09:10:45.8062272Z shl.b64 %rd212, %rd211, 32; 2026-02-21T09:10:45.8062332Z or.b64 %rd213, %rd210, %rd212; 2026-02-21T09:10:45.8062385Z $L__tmp75: 2026-02-21T09:10:45.8062560Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8062622Z mov.b64 {%r738, %r739}, %rd213; 2026-02-21T09:10:45.8062687Z cvt.rn.bf16x2.f32 %r740, %r739, %r738; 2026-02-21T09:10:45.8062747Z $L__tmp76: 2026-02-21T09:10:45.8062960Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8063022Z cvt.u64.u32 %rd214, %r573; 2026-02-21T09:10:45.8063092Z cvt.u64.u32 %rd215, %r574; 2026-02-21T09:10:45.8063153Z shl.b64 %rd216, %rd215, 32; 2026-02-21T09:10:45.8063214Z or.b64 %rd217, %rd214, %rd216; 2026-02-21T09:10:45.8063265Z $L__tmp77: 2026-02-21T09:10:45.8063440Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8063500Z mov.b64 {%r741, %r742}, %rd217; 2026-02-21T09:10:45.8063565Z cvt.rn.bf16x2.f32 %r743, %r742, %r741; 2026-02-21T09:10:45.8063625Z $L__tmp78: 2026-02-21T09:10:45.8063833Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8063893Z cvt.u64.u32 %rd218, %r575; 2026-02-21T09:10:45.8063960Z cvt.u64.u32 %rd219, %r576; 2026-02-21T09:10:45.8064018Z shl.b64 %rd220, %rd219, 32; 2026-02-21T09:10:45.8064076Z or.b64 %rd221, %rd218, %rd220; 2026-02-21T09:10:45.8064127Z $L__tmp79: 2026-02-21T09:10:45.8064303Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8064364Z mov.b64 {%r744, %r745}, %rd221; 2026-02-21T09:10:45.8064429Z cvt.rn.bf16x2.f32 %r746, %r745, %r744; 2026-02-21T09:10:45.8064486Z $L__tmp80: 2026-02-21T09:10:45.8064690Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8064748Z cvt.u64.u32 %rd222, %r577; 2026-02-21T09:10:45.8064812Z cvt.u64.u32 %rd223, %r578; 2026-02-21T09:10:45.8064871Z shl.b64 %rd224, %rd223, 32; 2026-02-21T09:10:45.8064956Z or.b64 %rd225, %rd222, %rd224; 2026-02-21T09:10:45.8065008Z $L__tmp81: 2026-02-21T09:10:45.8065185Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8065242Z mov.b64 {%r747, %r748}, %rd225; 2026-02-21T09:10:45.8065306Z cvt.rn.bf16x2.f32 %r749, %r748, %r747; 2026-02-21T09:10:45.8065364Z $L__tmp82: 2026-02-21T09:10:45.8065569Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8065657Z cvt.u64.u32 %rd226, %r579; 2026-02-21T09:10:45.8065714Z cvt.u64.u32 %rd227, %r580; 2026-02-21T09:10:45.8065778Z shl.b64 %rd228, %rd227, 32; 2026-02-21T09:10:45.8065836Z or.b64 %rd229, %rd226, %rd228; 2026-02-21T09:10:45.8065888Z $L__tmp83: 2026-02-21T09:10:45.8066069Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8066129Z mov.b64 {%r750, %r751}, %rd229; 2026-02-21T09:10:45.8066197Z cvt.rn.bf16x2.f32 %r752, %r751, %r750; 2026-02-21T09:10:45.8066279Z $L__tmp84: 2026-02-21T09:10:45.8066491Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8066549Z cvt.u64.u32 %rd230, %r581; 2026-02-21T09:10:45.8066608Z cvt.u64.u32 %rd231, %r582; 2026-02-21T09:10:45.8066673Z shl.b64 %rd232, %rd231, 32; 2026-02-21T09:10:45.8066766Z or.b64 %rd233, %rd230, %rd232; 2026-02-21T09:10:45.8066818Z $L__tmp85: 2026-02-21T09:10:45.8066991Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8067051Z mov.b64 {%r753, %r754}, %rd233; 2026-02-21T09:10:45.8067116Z cvt.rn.bf16x2.f32 %r755, %r754, %r753; 2026-02-21T09:10:45.8067174Z $L__tmp86: 2026-02-21T09:10:45.8067385Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8067445Z cvt.u64.u32 %rd234, %r583; 2026-02-21T09:10:45.8067504Z cvt.u64.u32 %rd235, %r584; 2026-02-21T09:10:45.8067571Z shl.b64 %rd236, %rd235, 32; 2026-02-21T09:10:45.8067628Z or.b64 %rd237, %rd234, %rd236; 2026-02-21T09:10:45.8067679Z $L__tmp87: 2026-02-21T09:10:45.8067856Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8067916Z mov.b64 {%r756, %r757}, %rd237; 2026-02-21T09:10:45.8067983Z cvt.rn.bf16x2.f32 %r758, %r757, %r756; 2026-02-21T09:10:45.8068034Z $L__tmp88: 2026-02-21T09:10:45.8068251Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8068308Z cvt.u64.u32 %rd238, %r585; 2026-02-21T09:10:45.8068366Z cvt.u64.u32 %rd239, %r586; 2026-02-21T09:10:45.8068433Z shl.b64 %rd240, %rd239, 32; 2026-02-21T09:10:45.8068491Z or.b64 %rd241, %rd238, %rd240; 2026-02-21T09:10:45.8068543Z $L__tmp89: 2026-02-21T09:10:45.8068721Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8068783Z mov.b64 {%r759, %r760}, %rd241; 2026-02-21T09:10:45.8068849Z cvt.rn.bf16x2.f32 %r761, %r760, %r759; 2026-02-21T09:10:45.8068902Z $L__tmp90: 2026-02-21T09:10:45.8069123Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8069183Z cvt.u64.u32 %rd242, %r587; 2026-02-21T09:10:45.8069243Z cvt.u64.u32 %rd243, %r588; 2026-02-21T09:10:45.8069309Z shl.b64 %rd244, %rd243, 32; 2026-02-21T09:10:45.8069366Z or.b64 %rd245, %rd242, %rd244; 2026-02-21T09:10:45.8069417Z $L__tmp91: 2026-02-21T09:10:45.8069590Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8069649Z mov.b64 {%r762, %r763}, %rd245; 2026-02-21T09:10:45.8069714Z cvt.rn.bf16x2.f32 %r764, %r763, %r762; 2026-02-21T09:10:45.8069789Z $L__tmp92: 2026-02-21T09:10:45.8070003Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8070062Z cvt.u64.u32 %rd246, %r590; 2026-02-21T09:10:45.8070120Z cvt.u64.u32 %rd247, %r591; 2026-02-21T09:10:45.8070188Z shl.b64 %rd248, %rd247, 32; 2026-02-21T09:10:45.8070247Z or.b64 %rd249, %rd246, %rd248; 2026-02-21T09:10:45.8070299Z $L__tmp93: 2026-02-21T09:10:45.8070472Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8070555Z mov.b64 {%r765, %r766}, %rd249; 2026-02-21T09:10:45.8070621Z cvt.rn.bf16x2.f32 %r767, %r766, %r765; 2026-02-21T09:10:45.8070673Z $L__tmp94: 2026-02-21T09:10:45.8070892Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8070950Z cvt.u64.u32 %rd250, %r592; 2026-02-21T09:10:45.8071009Z cvt.u64.u32 %rd251, %r593; 2026-02-21T09:10:45.8071077Z shl.b64 %rd252, %rd251, 32; 2026-02-21T09:10:45.8071135Z or.b64 %rd253, %rd250, %rd252; 2026-02-21T09:10:45.8071206Z $L__tmp95: 2026-02-21T09:10:45.8071377Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8071442Z mov.b64 {%r768, %r769}, %rd253; 2026-02-21T09:10:45.8071507Z cvt.rn.bf16x2.f32 %r770, %r769, %r768; 2026-02-21T09:10:45.8071594Z $L__tmp96: 2026-02-21T09:10:45.8071839Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8071897Z cvt.u64.u32 %rd254, %r594; 2026-02-21T09:10:45.8071956Z cvt.u64.u32 %rd255, %r595; 2026-02-21T09:10:45.8072022Z shl.b64 %rd256, %rd255, 32; 2026-02-21T09:10:45.8072080Z or.b64 %rd257, %rd254, %rd256; 2026-02-21T09:10:45.8072132Z $L__tmp97: 2026-02-21T09:10:45.8072298Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8072366Z mov.b64 {%r771, %r772}, %rd257; 2026-02-21T09:10:45.8072433Z cvt.rn.bf16x2.f32 %r773, %r772, %r771; 2026-02-21T09:10:45.8072484Z $L__tmp98: 2026-02-21T09:10:45.8072699Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8072756Z cvt.u64.u32 %rd258, %r596; 2026-02-21T09:10:45.8072813Z cvt.u64.u32 %rd259, %r597; 2026-02-21T09:10:45.8072879Z shl.b64 %rd260, %rd259, 32; 2026-02-21T09:10:45.8072939Z or.b64 %rd261, %rd258, %rd260; 2026-02-21T09:10:45.8072991Z $L__tmp99: 2026-02-21T09:10:45.8073160Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8073226Z mov.b64 {%r774, %r775}, %rd261; 2026-02-21T09:10:45.8073290Z cvt.rn.bf16x2.f32 %r776, %r775, %r774; 2026-02-21T09:10:45.8073343Z $L__tmp100: 2026-02-21T09:10:45.8073557Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8073617Z cvt.u64.u32 %rd262, %r598; 2026-02-21T09:10:45.8073677Z cvt.u64.u32 %rd263, %r599; 2026-02-21T09:10:45.8073744Z shl.b64 %rd264, %rd263, 32; 2026-02-21T09:10:45.8073802Z or.b64 %rd265, %rd262, %rd264; 2026-02-21T09:10:45.8073857Z $L__tmp101: 2026-02-21T09:10:45.8074024Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8074091Z mov.b64 {%r777, %r778}, %rd265; 2026-02-21T09:10:45.8074157Z cvt.rn.bf16x2.f32 %r779, %r778, %r777; 2026-02-21T09:10:45.8074209Z $L__tmp102: 2026-02-21T09:10:45.8074423Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8074481Z cvt.u64.u32 %rd266, %r600; 2026-02-21T09:10:45.8074538Z cvt.u64.u32 %rd267, %r601; 2026-02-21T09:10:45.8074596Z shl.b64 %rd268, %rd267, 32; 2026-02-21T09:10:45.8074662Z or.b64 %rd269, %rd266, %rd268; 2026-02-21T09:10:45.8074740Z $L__tmp103: 2026-02-21T09:10:45.8074905Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8074973Z mov.b64 {%r780, %r781}, %rd269; 2026-02-21T09:10:45.8075037Z cvt.rn.bf16x2.f32 %r782, %r781, %r780; 2026-02-21T09:10:45.8075088Z $L__tmp104: 2026-02-21T09:10:45.8075302Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8075386Z cvt.u64.u32 %rd270, %r602; 2026-02-21T09:10:45.8075444Z cvt.u64.u32 %rd271, %r603; 2026-02-21T09:10:45.8075502Z shl.b64 %rd272, %rd271, 32; 2026-02-21T09:10:45.8075569Z or.b64 %rd273, %rd270, %rd272; 2026-02-21T09:10:45.8075621Z $L__tmp105: 2026-02-21T09:10:45.8075786Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8075854Z mov.b64 {%r783, %r784}, %rd273; 2026-02-21T09:10:45.8075918Z cvt.rn.bf16x2.f32 %r785, %r784, %r783; 2026-02-21T09:10:45.8075970Z $L__tmp106: 2026-02-21T09:10:45.8076206Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8076266Z cvt.u64.u32 %rd274, %r604; 2026-02-21T09:10:45.8076324Z cvt.u64.u32 %rd275, %r605; 2026-02-21T09:10:45.8076383Z shl.b64 %rd276, %rd275, 32; 2026-02-21T09:10:45.8076449Z or.b64 %rd277, %rd274, %rd276; 2026-02-21T09:10:45.8076521Z $L__tmp107: 2026-02-21T09:10:45.8076689Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8076755Z mov.b64 {%r786, %r787}, %rd277; 2026-02-21T09:10:45.8076819Z cvt.rn.bf16x2.f32 %r788, %r787, %r786; 2026-02-21T09:10:45.8076870Z $L__tmp108: 2026-02-21T09:10:45.8077089Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8077147Z cvt.u64.u32 %rd278, %r607; 2026-02-21T09:10:45.8077206Z cvt.u64.u32 %rd279, %r608; 2026-02-21T09:10:45.8077265Z shl.b64 %rd280, %rd279, 32; 2026-02-21T09:10:45.8077334Z or.b64 %rd281, %rd278, %rd280; 2026-02-21T09:10:45.8077387Z $L__tmp109: 2026-02-21T09:10:45.8077552Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8077622Z mov.b64 {%r789, %r790}, %rd281; 2026-02-21T09:10:45.8077692Z cvt.rn.bf16x2.f32 %r791, %r790, %r789; 2026-02-21T09:10:45.8077747Z $L__tmp110: 2026-02-21T09:10:45.8077957Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8078023Z cvt.u64.u32 %rd282, %r609; 2026-02-21T09:10:45.8078081Z cvt.u64.u32 %rd283, %r610; 2026-02-21T09:10:45.8078140Z shl.b64 %rd284, %rd283, 32; 2026-02-21T09:10:45.8078204Z or.b64 %rd285, %rd282, %rd284; 2026-02-21T09:10:45.8078255Z $L__tmp111: 2026-02-21T09:10:45.8078418Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8078485Z mov.b64 {%r792, %r793}, %rd285; 2026-02-21T09:10:45.8078551Z cvt.rn.bf16x2.f32 %r794, %r793, %r792; 2026-02-21T09:10:45.8078604Z $L__tmp112: 2026-02-21T09:10:45.8078815Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8078880Z cvt.u64.u32 %rd286, %r611; 2026-02-21T09:10:45.8078938Z cvt.u64.u32 %rd287, %r612; 2026-02-21T09:10:45.8078998Z shl.b64 %rd288, %rd287, 32; 2026-02-21T09:10:45.8079063Z or.b64 %rd289, %rd286, %rd288; 2026-02-21T09:10:45.8079116Z $L__tmp113: 2026-02-21T09:10:45.8079284Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8079349Z mov.b64 {%r795, %r796}, %rd289; 2026-02-21T09:10:45.8079415Z cvt.rn.bf16x2.f32 %r797, %r796, %r795; 2026-02-21T09:10:45.8079467Z $L__tmp114: 2026-02-21T09:10:45.8079678Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8079769Z cvt.u64.u32 %rd290, %r613; 2026-02-21T09:10:45.8079827Z cvt.u64.u32 %rd291, %r614; 2026-02-21T09:10:45.8079885Z shl.b64 %rd292, %rd291, 32; 2026-02-21T09:10:45.8079951Z or.b64 %rd293, %rd290, %rd292; 2026-02-21T09:10:45.8080004Z $L__tmp115: 2026-02-21T09:10:45.8080182Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8080269Z mov.b64 {%r798, %r799}, %rd293; 2026-02-21T09:10:45.8080346Z cvt.rn.bf16x2.f32 %r800, %r799, %r798; 2026-02-21T09:10:45.8080401Z $L__tmp116: 2026-02-21T09:10:45.8080620Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8080689Z cvt.u64.u32 %rd294, %r615; 2026-02-21T09:10:45.8080748Z cvt.u64.u32 %rd295, %r616; 2026-02-21T09:10:45.8080809Z shl.b64 %rd296, %rd295, 32; 2026-02-21T09:10:45.8080879Z or.b64 %rd297, %rd294, %rd296; 2026-02-21T09:10:45.8080932Z $L__tmp117: 2026-02-21T09:10:45.8081130Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8081195Z mov.b64 {%r801, %r802}, %rd297; 2026-02-21T09:10:45.8081269Z cvt.rn.bf16x2.f32 %r803, %r802, %r801; 2026-02-21T09:10:45.8081324Z $L__tmp118: 2026-02-21T09:10:45.8081603Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8081676Z cvt.u64.u32 %rd298, %r617; 2026-02-21T09:10:45.8081737Z cvt.u64.u32 %rd299, %r618; 2026-02-21T09:10:45.8081798Z shl.b64 %rd300, %rd299, 32; 2026-02-21T09:10:45.8081867Z or.b64 %rd301, %rd298, %rd300; 2026-02-21T09:10:45.8081921Z $L__tmp119: 2026-02-21T09:10:45.8082094Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8082156Z mov.b64 {%r804, %r805}, %rd301; 2026-02-21T09:10:45.8082234Z cvt.rn.bf16x2.f32 %r806, %r805, %r804; 2026-02-21T09:10:45.8082288Z $L__tmp120: 2026-02-21T09:10:45.8082510Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8082577Z cvt.u64.u32 %rd302, %r619; 2026-02-21T09:10:45.8082636Z cvt.u64.u32 %rd303, %r620; 2026-02-21T09:10:45.8082696Z shl.b64 %rd304, %rd303, 32; 2026-02-21T09:10:45.8082765Z or.b64 %rd305, %rd302, %rd304; 2026-02-21T09:10:45.8082819Z $L__tmp121: 2026-02-21T09:10:45.8082993Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8083055Z mov.b64 {%r807, %r808}, %rd305; 2026-02-21T09:10:45.8083132Z cvt.rn.bf16x2.f32 %r809, %r808, %r807; 2026-02-21T09:10:45.8083187Z $L__tmp122: 2026-02-21T09:10:45.8083405Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8083477Z cvt.u64.u32 %rd306, %r621; 2026-02-21T09:10:45.8083537Z cvt.u64.u32 %rd307, %r622; 2026-02-21T09:10:45.8083600Z shl.b64 %rd308, %rd307, 32; 2026-02-21T09:10:45.8083661Z or.b64 %rd309, %rd306, %rd308; 2026-02-21T09:10:45.8083724Z $L__tmp123: 2026-02-21T09:10:45.8083898Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8083962Z mov.b64 {%r810, %r811}, %rd309; 2026-02-21T09:10:45.8084039Z cvt.rn.bf16x2.f32 %r812, %r811, %r810; 2026-02-21T09:10:45.8084094Z $L__tmp124: 2026-02-21T09:10:45.8084311Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8084378Z cvt.u64.u32 %rd310, %r624; 2026-02-21T09:10:45.8084437Z cvt.u64.u32 %rd311, %r625; 2026-02-21T09:10:45.8084498Z shl.b64 %rd312, %rd311, 32; 2026-02-21T09:10:45.8084559Z or.b64 %rd313, %rd310, %rd312; 2026-02-21T09:10:45.8084617Z $L__tmp125: 2026-02-21T09:10:45.8084823Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8084884Z mov.b64 {%r813, %r814}, %rd313; 2026-02-21T09:10:45.8084962Z cvt.rn.bf16x2.f32 %r815, %r814, %r813; 2026-02-21T09:10:45.8085016Z $L__tmp126: 2026-02-21T09:10:45.8085233Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8085303Z cvt.u64.u32 %rd314, %r626; 2026-02-21T09:10:45.8085396Z cvt.u64.u32 %rd315, %r627; 2026-02-21T09:10:45.8085458Z shl.b64 %rd316, %rd315, 32; 2026-02-21T09:10:45.8085519Z or.b64 %rd317, %rd314, %rd316; 2026-02-21T09:10:45.8085578Z $L__tmp127: 2026-02-21T09:10:45.8085753Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8085814Z mov.b64 {%r816, %r817}, %rd317; 2026-02-21T09:10:45.8085888Z cvt.rn.bf16x2.f32 %r818, %r817, %r816; 2026-02-21T09:10:45.8085943Z $L__tmp128: 2026-02-21T09:10:45.8086210Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8086274Z cvt.u64.u32 %rd318, %r628; 2026-02-21T09:10:45.8086332Z cvt.u64.u32 %rd319, %r629; 2026-02-21T09:10:45.8086391Z shl.b64 %rd320, %rd319, 32; 2026-02-21T09:10:45.8086449Z or.b64 %rd321, %rd318, %rd320; 2026-02-21T09:10:45.8086504Z $L__tmp129: 2026-02-21T09:10:45.8086699Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8086761Z mov.b64 {%r819, %r820}, %rd321; 2026-02-21T09:10:45.8086832Z cvt.rn.bf16x2.f32 %r821, %r820, %r819; 2026-02-21T09:10:45.8086883Z $L__tmp130: 2026-02-21T09:10:45.8087099Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8087162Z cvt.u64.u32 %rd322, %r630; 2026-02-21T09:10:45.8087220Z cvt.u64.u32 %rd323, %r631; 2026-02-21T09:10:45.8087279Z shl.b64 %rd324, %rd323, 32; 2026-02-21T09:10:45.8087336Z or.b64 %rd325, %rd322, %rd324; 2026-02-21T09:10:45.8087392Z $L__tmp131: 2026-02-21T09:10:45.8087564Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8087622Z mov.b64 {%r822, %r823}, %rd325; 2026-02-21T09:10:45.8087689Z cvt.rn.bf16x2.f32 %r824, %r823, %r822; 2026-02-21T09:10:45.8087743Z $L__tmp132: 2026-02-21T09:10:45.8087961Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8088020Z cvt.u64.u32 %rd326, %r632; 2026-02-21T09:10:45.8088084Z cvt.u64.u32 %rd327, %r633; 2026-02-21T09:10:45.8088144Z shl.b64 %rd328, %rd327, 32; 2026-02-21T09:10:45.8088213Z or.b64 %rd329, %rd326, %rd328; 2026-02-21T09:10:45.8088267Z $L__tmp133: 2026-02-21T09:10:45.8088434Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8088491Z mov.b64 {%r825, %r826}, %rd329; 2026-02-21T09:10:45.8088559Z cvt.rn.bf16x2.f32 %r827, %r826, %r825; 2026-02-21T09:10:45.8088608Z $L__tmp134: 2026-02-21T09:10:45.8088818Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8088874Z cvt.u64.u32 %rd330, %r634; 2026-02-21T09:10:45.8088931Z cvt.u64.u32 %rd331, %r635; 2026-02-21T09:10:45.8088987Z shl.b64 %rd332, %rd331, 32; 2026-02-21T09:10:45.8089044Z or.b64 %rd333, %rd330, %rd332; 2026-02-21T09:10:45.8089099Z $L__tmp135: 2026-02-21T09:10:45.8089265Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8089320Z mov.b64 {%r828, %r829}, %rd333; 2026-02-21T09:10:45.8089387Z cvt.rn.bf16x2.f32 %r830, %r829, %r828; 2026-02-21T09:10:45.8089435Z $L__tmp136: 2026-02-21T09:10:45.8089642Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8089722Z cvt.u64.u32 %rd334, %r636; 2026-02-21T09:10:45.8089785Z cvt.u64.u32 %rd335, %r637; 2026-02-21T09:10:45.8089841Z shl.b64 %rd336, %rd335, 32; 2026-02-21T09:10:45.8089895Z or.b64 %rd337, %rd334, %rd336; 2026-02-21T09:10:45.8089950Z $L__tmp137: 2026-02-21T09:10:45.8090113Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8090197Z mov.b64 {%r831, %r832}, %rd337; 2026-02-21T09:10:45.8090259Z cvt.rn.bf16x2.f32 %r833, %r832, %r831; 2026-02-21T09:10:45.8090313Z $L__tmp138: 2026-02-21T09:10:45.8090516Z .loc 2 291 36 // standard.py:291:36 @[ cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:94:40 ] 2026-02-21T09:10:45.8090569Z cvt.u64.u32 %rd338, %r638; 2026-02-21T09:10:45.8090628Z cvt.u64.u32 %rd339, %r639; 2026-02-21T09:10:45.8090683Z shl.b64 %rd340, %rd339, 32; 2026-02-21T09:10:45.8090740Z or.b64 %rd341, %rd338, %rd340; 2026-02-21T09:10:45.8090792Z $L__tmp139: 2026-02-21T09:10:45.8090978Z .loc 1 97 28 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:97:28 2026-02-21T09:10:45.8091034Z mov.b64 {%r834, %r835}, %rd341; 2026-02-21T09:10:45.8091097Z cvt.rn.bf16x2.f32 %r836, %r835, %r834; 2026-02-21T09:10:45.8091264Z .loc 1 98 43 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:98:43 2026-02-21T09:10:45.8091377Z st.shared.v4.b32 [%r18], {%r647, %r650, %r653, %r656}; 2026-02-21T09:10:45.8091478Z st.shared.v4.b32 [%r18+16384], {%r743, %r746, %r749, %r752}; 2026-02-21T09:10:45.8091601Z st.shared.v4.b32 [%r19], {%r659, %r662, %r665, %r668}; 2026-02-21T09:10:45.8091694Z st.shared.v4.b32 [%r19+16384], {%r755, %r758, %r761, %r764}; 2026-02-21T09:10:45.8091778Z st.shared.v4.b32 [%r20], {%r671, %r674, %r677, %r680}; 2026-02-21T09:10:45.8091872Z st.shared.v4.b32 [%r20+16384], {%r767, %r770, %r773, %r776}; 2026-02-21T09:10:45.8091958Z st.shared.v4.b32 [%r21], {%r683, %r686, %r689, %r692}; 2026-02-21T09:10:45.8092050Z st.shared.v4.b32 [%r21+16384], {%r779, %r782, %r785, %r788}; 2026-02-21T09:10:45.8092137Z st.shared.v4.b32 [%r22], {%r695, %r698, %r701, %r704}; 2026-02-21T09:10:45.8092227Z st.shared.v4.b32 [%r22+16384], {%r791, %r794, %r797, %r800}; 2026-02-21T09:10:45.8092310Z st.shared.v4.b32 [%r23], {%r707, %r710, %r713, %r716}; 2026-02-21T09:10:45.8092399Z st.shared.v4.b32 [%r23+16384], {%r803, %r806, %r809, %r812}; 2026-02-21T09:10:45.8092487Z st.shared.v4.b32 [%r24], {%r719, %r722, %r725, %r728}; 2026-02-21T09:10:45.8092575Z st.shared.v4.b32 [%r24+16384], {%r815, %r818, %r821, %r824}; 2026-02-21T09:10:45.8092656Z st.shared.v4.b32 [%r25], {%r731, %r734, %r737, %r740}; 2026-02-21T09:10:45.8092748Z st.shared.v4.b32 [%r25+16384], {%r827, %r830, %r833, %r836}; 2026-02-21T09:10:45.8092806Z // begin inline asm 2026-02-21T09:10:45.8092880Z fence.proxy.async.shared::cta; 2026-02-21T09:10:45.8092938Z // end inline asm 2026-02-21T09:10:45.8092990Z bar.sync 0; 2026-02-21T09:10:45.8093055Z elect.sync %r837|%p64, -1; 2026-02-21T09:10:45.8093120Z and.pred %p62, %p63, %p64; 2026-02-21T09:10:45.8093180Z and.b32 %r838, %r30, 1; 2026-02-21T09:10:45.8093238Z shl.b32 %r839, %r838, 14; 2026-02-21T09:10:45.8093294Z add.s32 %r643, %r44, %r839; 2026-02-21T09:10:45.8093355Z shl.b32 %r840, %r838, 6; 2026-02-21T09:10:45.8093412Z or.b32 %r641, %r840, %r28; 2026-02-21T09:10:45.8093467Z // begin inline asm 2026-02-21T09:10:45.8093647Z @%p62 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd85, {%r641, %r642}], [%r643]; 2026-02-21T09:10:45.8093709Z // end inline asm 2026-02-21T09:10:45.8093771Z cp.async.bulk.commit_group; 2026-02-21T09:10:45.8093840Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:45.8093898Z bar.sync 0; 2026-02-21T09:10:45.8093979Z $L__BB0_8: // %._crit_edge 2026-02-21T09:10:45.8094148Z .loc 1 29 4 // cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py:29:4 2026-02-21T09:10:45.8094235Z bar.sync 0; 2026-02-21T09:10:45.8094289Z // begin inline asm 2026-02-21T09:10:45.8094402Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r841, 256; 2026-02-21T09:10:45.8094457Z // end inline asm 2026-02-21T09:10:45.8094518Z ret; 2026-02-21T09:10:45.8094572Z $L__tmp140: 2026-02-21T09:10:45.8094625Z $L__func_end0: 2026-02-21T09:10:45.8094711Z // -- End function 2026-02-21T09:10:45.8094765Z } 2026-02-21T09:10:45.8095004Z .file 1 "/tmp/torchinductor_root/ww/cwwowvgjsd2sexvzz6wykvwkimrsbqv6r5ipurvcmziomdjbwpy7.py" 2026-02-21T09:10:45.8095176Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:45.8095256Z .section .debug_abbrev 2026-02-21T09:10:45.8095309Z { 2026-02-21T09:10:45.8095396Z .b8 1 // Abbreviation Code 2026-02-21T09:10:45.8095487Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:45.8095567Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:45.8095676Z .b8 37 // DW_AT_producer 2026-02-21T09:10:45.8095761Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.8095835Z .b8 19 // DW_AT_language 2026-02-21T09:10:45.8095910Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:45.8096010Z .b8 3 // DW_AT_name 2026-02-21T09:10:45.8096092Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.8096169Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:45.8096242Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:45.8096323Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:45.8096395Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.8096466Z .b8 0 // EOM(1) 2026-02-21T09:10:45.8096544Z .b8 0 // EOM(2) 2026-02-21T09:10:45.8096626Z .b8 2 // Abbreviation Code 2026-02-21T09:10:45.8096707Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:45.8096787Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:45.8096861Z .b8 3 // DW_AT_name 2026-02-21T09:10:45.8096934Z .b8 8 // DW_FORM_string 2026-02-21T09:10:45.8097009Z .b8 32 // DW_AT_inline 2026-02-21T09:10:45.8097091Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.8097158Z .b8 0 // EOM(1) 2026-02-21T09:10:45.8097223Z .b8 0 // EOM(2) 2026-02-21T09:10:45.8097309Z .b8 3 // Abbreviation Code 2026-02-21T09:10:45.8097387Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:45.8097463Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:45.8097546Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:45.8097619Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.8097695Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:45.8097765Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.8097857Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:45.8097929Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:45.8097996Z .b8 0 // EOM(1) 2026-02-21T09:10:45.8098069Z .b8 0 // EOM(2) 2026-02-21T09:10:45.8098146Z .b8 4 // Abbreviation Code 2026-02-21T09:10:45.8098236Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:45.8098317Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:45.8098421Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:45.8098494Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:45.8098573Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:45.8098642Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.8098718Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:45.8098788Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:45.8098895Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:45.8098969Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.8099043Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:45.8099122Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.8099198Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:45.8099270Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:45.8099344Z .b8 0 // EOM(1) 2026-02-21T09:10:45.8099431Z .b8 0 // EOM(2) 2026-02-21T09:10:45.8099499Z .b8 0 // EOM(3) 2026-02-21T09:10:45.8099550Z } 2026-02-21T09:10:45.8099616Z .section .debug_info 2026-02-21T09:10:45.8099667Z { 2026-02-21T09:10:45.8099750Z .b32 178 // Length of Unit 2026-02-21T09:10:45.8099863Z .b8 2 // DWARF version number 2026-02-21T09:10:45.8099919Z .b8 0 2026-02-21T09:10:45.8100035Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:45.8100123Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:45.8100228Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:45.8100309Z .b8 116 // DW_AT_producer 2026-02-21T09:10:45.8100362Z .b8 114 2026-02-21T09:10:45.8100422Z .b8 105 2026-02-21T09:10:45.8100474Z .b8 116 2026-02-21T09:10:45.8100524Z .b8 111 2026-02-21T09:10:45.8100582Z .b8 110 2026-02-21T09:10:45.8100633Z .b8 0 2026-02-21T09:10:45.8100707Z .b8 2 // DW_AT_language 2026-02-21T09:10:45.8100759Z .b8 0 2026-02-21T09:10:45.8100838Z .b8 99 // DW_AT_name 2026-02-21T09:10:45.8100889Z .b8 119 2026-02-21T09:10:45.8100941Z .b8 119 2026-02-21T09:10:45.8100998Z .b8 111 2026-02-21T09:10:45.8101050Z .b8 119 2026-02-21T09:10:45.8101101Z .b8 118 2026-02-21T09:10:45.8101153Z .b8 103 2026-02-21T09:10:45.8101210Z .b8 106 2026-02-21T09:10:45.8101259Z .b8 115 2026-02-21T09:10:45.8101310Z .b8 100 2026-02-21T09:10:45.8101361Z .b8 50 2026-02-21T09:10:45.8101419Z .b8 115 2026-02-21T09:10:45.8101470Z .b8 101 2026-02-21T09:10:45.8101521Z .b8 120 2026-02-21T09:10:45.8101623Z .b8 118 2026-02-21T09:10:45.8101676Z .b8 122 2026-02-21T09:10:45.8101727Z .b8 122 2026-02-21T09:10:45.8101778Z .b8 54 2026-02-21T09:10:45.8101840Z .b8 119 2026-02-21T09:10:45.8101891Z .b8 121 2026-02-21T09:10:45.8101943Z .b8 107 2026-02-21T09:10:45.8102004Z .b8 118 2026-02-21T09:10:45.8102058Z .b8 119 2026-02-21T09:10:45.8102110Z .b8 107 2026-02-21T09:10:45.8102163Z .b8 105 2026-02-21T09:10:45.8102224Z .b8 109 2026-02-21T09:10:45.8102275Z .b8 114 2026-02-21T09:10:45.8102325Z .b8 115 2026-02-21T09:10:45.8102376Z .b8 98 2026-02-21T09:10:45.8102434Z .b8 113 2026-02-21T09:10:45.8102483Z .b8 118 2026-02-21T09:10:45.8102535Z .b8 54 2026-02-21T09:10:45.8102593Z .b8 114 2026-02-21T09:10:45.8102645Z .b8 53 2026-02-21T09:10:45.8102695Z .b8 105 2026-02-21T09:10:45.8102745Z .b8 112 2026-02-21T09:10:45.8102809Z .b8 117 2026-02-21T09:10:45.8102862Z .b8 114 2026-02-21T09:10:45.8102915Z .b8 118 2026-02-21T09:10:45.8102978Z .b8 99 2026-02-21T09:10:45.8103028Z .b8 109 2026-02-21T09:10:45.8103077Z .b8 122 2026-02-21T09:10:45.8103128Z .b8 105 2026-02-21T09:10:45.8103186Z .b8 111 2026-02-21T09:10:45.8103235Z .b8 109 2026-02-21T09:10:45.8103285Z .b8 100 2026-02-21T09:10:45.8103379Z .b8 106 2026-02-21T09:10:45.8103430Z .b8 98 2026-02-21T09:10:45.8103481Z .b8 119 2026-02-21T09:10:45.8103532Z .b8 112 2026-02-21T09:10:45.8103591Z .b8 121 2026-02-21T09:10:45.8103641Z .b8 55 2026-02-21T09:10:45.8103692Z .b8 46 2026-02-21T09:10:45.8103742Z .b8 112 2026-02-21T09:10:45.8103798Z .b8 121 2026-02-21T09:10:45.8103849Z .b8 0 2026-02-21T09:10:45.8103940Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:45.8104022Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:45.8104113Z .b8 116 2026-02-21T09:10:45.8104165Z .b8 109 2026-02-21T09:10:45.8104214Z .b8 112 2026-02-21T09:10:45.8104273Z .b8 47 2026-02-21T09:10:45.8104323Z .b8 116 2026-02-21T09:10:45.8104374Z .b8 111 2026-02-21T09:10:45.8104433Z .b8 114 2026-02-21T09:10:45.8104483Z .b8 99 2026-02-21T09:10:45.8104534Z .b8 104 2026-02-21T09:10:45.8104584Z .b8 105 2026-02-21T09:10:45.8104641Z .b8 110 2026-02-21T09:10:45.8104690Z .b8 100 2026-02-21T09:10:45.8104741Z .b8 117 2026-02-21T09:10:45.8104797Z .b8 99 2026-02-21T09:10:45.8104847Z .b8 116 2026-02-21T09:10:45.8104897Z .b8 111 2026-02-21T09:10:45.8104947Z .b8 114 2026-02-21T09:10:45.8105029Z .b8 95 2026-02-21T09:10:45.8105080Z .b8 114 2026-02-21T09:10:45.8105130Z .b8 111 2026-02-21T09:10:45.8105180Z .b8 111 2026-02-21T09:10:45.8105235Z .b8 116 2026-02-21T09:10:45.8105286Z .b8 47 2026-02-21T09:10:45.8105336Z .b8 119 2026-02-21T09:10:45.8105394Z .b8 119 2026-02-21T09:10:45.8105445Z .b8 0 2026-02-21T09:10:45.8105568Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:45.8105645Z .b8 95 // DW_AT_name 2026-02-21T09:10:45.8105702Z .b8 104 2026-02-21T09:10:45.8105752Z .b8 101 2026-02-21T09:10:45.8105801Z .b8 108 2026-02-21T09:10:45.8105857Z .b8 105 2026-02-21T09:10:45.8105906Z .b8 111 2026-02-21T09:10:45.8105955Z .b8 110 2026-02-21T09:10:45.8106004Z .b8 95 2026-02-21T09:10:45.8106061Z .b8 109 2026-02-21T09:10:45.8106111Z .b8 97 2026-02-21T09:10:45.8106160Z .b8 116 2026-02-21T09:10:45.8106218Z .b8 109 2026-02-21T09:10:45.8106268Z .b8 117 2026-02-21T09:10:45.8106318Z .b8 108 2026-02-21T09:10:45.8106368Z .b8 95 2026-02-21T09:10:45.8106425Z .b8 98 2026-02-21T09:10:45.8106475Z .b8 102 2026-02-21T09:10:45.8106525Z .b8 49 2026-02-21T09:10:45.8106581Z .b8 54 2026-02-21T09:10:45.8106631Z .b8 95 2026-02-21T09:10:45.8106681Z .b8 105 2026-02-21T09:10:45.8106731Z .b8 110 2026-02-21T09:10:45.8106787Z .b8 116 2026-02-21T09:10:45.8106838Z .b8 52 2026-02-21T09:10:45.8106890Z .b8 0 2026-02-21T09:10:45.8106964Z .b8 1 // DW_AT_inline 2026-02-21T09:10:45.8107069Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:45.8107154Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:45.8107243Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:45.8107340Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:45.8107450Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:45.8107543Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:45.8107631Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:10:45.8107717Z .b64 $L__tmp139 // DW_AT_high_pc 2026-02-21T09:10:45.8107796Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:45.8107882Z .b8 94 // DW_AT_call_line 2026-02-21T09:10:45.8107963Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:45.8108046Z .b8 0 // End Of Children Mark 2026-02-21T09:10:45.8108130Z .b8 0 // End Of Children Mark 2026-02-21T09:10:45.8108193Z } 2026-02-21T09:10:45.8108261Z .section .debug_macinfo { } 2026-02-21T09:10:45.8108265Z 2026-02-21T09:10:45.8108343Z ================================================================ 2026-02-21T09:10:45.8108476Z please share the reproducer above with Triton project. 2026-02-21T09:10:45.8109375Z [232s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T09:10:45.8109476Z Tensor-likes are not close! 2026-02-21T09:10:45.8109480Z 2026-02-21T09:10:45.8109567Z Mismatched elements: 33394075 / 33554432 (99.5%) 2026-02-21T09:10:45.8109713Z Greatest absolute difference: 1288.0 at index (2056, 3359) (up to 0.01 allowed) 2026-02-21T09:10:45.8109852Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:10:45.8109974Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:10:45.8109979Z 2026-02-21T09:10:46.9196401Z [234s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:46.9196857Z 2026-02-21T09:10:46.9202048Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:10:46.9203132Z 2026-02-21T09:10:46.9203325Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:46.9203509Z 2026-02-21T09:10:46.9207823Z `ptxas` stderr: 2026-02-21T09:10:46.9208031Z ================================================================ 2026-02-21T09:10:46.9209618Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 298 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:46.9210137Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:46.9210296Z 2026-02-21T09:10:46.9210691Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp4qqxeb0l.ptx -o /tmp/tmp4qqxeb0l.ptx.o 2026-02-21T09:10:46.9211126Z 2026-02-21T09:10:46.9211268Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:46.9211525Z Internal Triton PTX codegen error 2026-02-21T09:10:46.9211777Z `ptxas` stderr: 2026-02-21T09:10:46.9212200Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 298 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:46.9212705Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:46.9212857Z 2026-02-21T09:10:46.9213231Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp4qqxeb0l.ptx -o /tmp/tmp4qqxeb0l.ptx.o 2026-02-21T09:10:46.9213672Z 2026-02-21T09:10:46.9213676Z 2026-02-21T09:10:46.9213735Z // 2026-02-21T09:10:46.9213888Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:46.9214070Z // 2026-02-21T09:10:46.9214146Z 2026-02-21T09:10:46.9214206Z .version 8.7 2026-02-21T09:10:46.9214343Z .target sm_100a 2026-02-21T09:10:46.9214482Z .address_size 64 2026-02-21T09:10:46.9214564Z 2026-02-21T09:10:46.9214715Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:46.9214992Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:46.9215215Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:46.9215428Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:46.9215680Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:46.9216171Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:46.9216459Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:46.9216743Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:46.9217021Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:46.9217255Z ) 2026-02-21T09:10:46.9217383Z .reqntid 256 2026-02-21T09:10:46.9217590Z .maxnreg 32 2026-02-21T09:10:46.9217970Z { 2026-02-21T09:10:46.9218112Z .reg .pred %p<101>; 2026-02-21T09:10:46.9218258Z .reg .b16 %rs<193>; 2026-02-21T09:10:46.9218406Z .reg .b32 %r<589>; 2026-02-21T09:10:46.9218551Z .reg .b64 %rd<220>; 2026-02-21T09:10:46.9218690Z $L__func_begin0: 2026-02-21T09:10:46.9218772Z 2026-02-21T09:10:46.9218835Z // %bb.0: 2026-02-21T09:10:46.9219078Z .loc 1 19 0 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:19 2026-02-21T09:10:46.9219371Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:46.9219564Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:46.9219822Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:10:46.9220021Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:46.9220242Z mov.b32 %r55, global_smem; 2026-02-21T09:10:46.9220407Z // begin inline asm 2026-02-21T09:10:46.9220668Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r55], 256; 2026-02-21T09:10:46.9220921Z // end inline asm 2026-02-21T09:10:46.9221096Z ld.param.b64 %rd51, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:46.9221300Z bar.sync 0; 2026-02-21T09:10:46.9221444Z ld.shared.b32 %r582, [global_smem]; 2026-02-21T09:10:46.9221673Z bar.sync 0; 2026-02-21T09:10:46.9221803Z // begin inline asm 2026-02-21T09:10:46.9222013Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:46.9222242Z // end inline asm 2026-02-21T09:10:46.9222502Z .loc 1 21 66 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:21:66 2026-02-21T09:10:46.9222810Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:10:46.9222999Z mov.u32 %r72, %ctaid.y; 2026-02-21T09:10:46.9223168Z mov.u32 %r73, %ctaid.z; 2026-02-21T09:10:46.9223341Z mov.u32 %r74, %nctaid.x; 2026-02-21T09:10:46.9223503Z mov.u32 %r75, %nctaid.y; 2026-02-21T09:10:46.9223674Z mad.lo.s32 %r76, %r73, %r75, %r72; 2026-02-21T09:10:46.9223861Z mad.lo.s32 %r77, %r76, %r74, %r3; 2026-02-21T09:10:46.9224045Z shl.b32 %r78, %r77, 8; 2026-02-21T09:10:46.9224200Z cvt.s64.s32 %rd52, %r78; 2026-02-21T09:10:46.9224373Z add.s64 %rd30, %rd51, %rd52; 2026-02-21T09:10:46.9224538Z shl.b32 %r79, %r1, 2; 2026-02-21T09:10:46.9224701Z add.s32 %r56, %r55, %r79; 2026-02-21T09:10:46.9224865Z mov.b32 %r65, 0; 2026-02-21T09:10:46.9225009Z // begin inline asm 2026-02-21T09:10:46.9225175Z @%p1 st.shared.b32 [ %r56 + 0 ], %r65; 2026-02-21T09:10:46.9225356Z // end inline asm 2026-02-21T09:10:46.9225510Z bar.warp.sync -1; 2026-02-21T09:10:46.9225667Z setp.eq.b32 %p93, %r1, 0; 2026-02-21T09:10:46.9225838Z cvt.u64.u32 %rd15, %r55; 2026-02-21T09:10:46.9225996Z // begin inline asm 2026-02-21T09:10:46.9226268Z @%p93 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd16; 2026-02-21T09:10:46.9226571Z // end inline asm 2026-02-21T09:10:46.9226713Z // begin inline asm 2026-02-21T09:10:46.9226957Z @%p93 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:10:46.9227223Z // end inline asm 2026-02-21T09:10:46.9227368Z mov.b32 %r58, 128; 2026-02-21T09:10:46.9227511Z // begin inline asm 2026-02-21T09:10:46.9227763Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r58; 2026-02-21T09:10:46.9228040Z // end inline asm 2026-02-21T09:10:46.9228184Z mov.b32 %r153, 16; 2026-02-21T09:10:46.9228330Z // begin inline asm 2026-02-21T09:10:46.9228572Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r153; 2026-02-21T09:10:46.9228888Z // end inline asm 2026-02-21T09:10:46.9229029Z mov.b32 %r60, 8192; 2026-02-21T09:10:46.9229179Z // begin inline asm 2026-02-21T09:10:46.9229433Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r60; 2026-02-21T09:10:46.9229732Z // end inline asm 2026-02-21T09:10:46.9229864Z mov.b32 %r61, 512; 2026-02-21T09:10:46.9230009Z // begin inline asm 2026-02-21T09:10:46.9230255Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r61; 2026-02-21T09:10:46.9230554Z // end inline asm 2026-02-21T09:10:46.9230704Z mov.b64 %rd23, 8192; 2026-02-21T09:10:46.9230853Z // begin inline asm 2026-02-21T09:10:46.9231114Z @%p93 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd23; 2026-02-21T09:10:46.9231397Z // end inline asm 2026-02-21T09:10:46.9231574Z mov.b32 %r62, 1; 2026-02-21T09:10:46.9231712Z // begin inline asm 2026-02-21T09:10:46.9231970Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r62; 2026-02-21T09:10:46.9232260Z // end inline asm 2026-02-21T09:10:46.9232396Z // begin inline asm 2026-02-21T09:10:46.9232680Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r62; 2026-02-21T09:10:46.9232964Z // end inline asm 2026-02-21T09:10:46.9233114Z // begin inline asm 2026-02-21T09:10:46.9233351Z @%p93 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:46.9233658Z // end inline asm 2026-02-21T09:10:46.9233804Z // begin inline asm 2026-02-21T09:10:46.9234053Z @%p93 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:46.9234337Z // end inline asm 2026-02-21T09:10:46.9234469Z // begin inline asm 2026-02-21T09:10:46.9234708Z @%p93 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:10:46.9234970Z // end inline asm 2026-02-21T09:10:46.9235112Z // begin inline asm 2026-02-21T09:10:46.9235344Z @%p93 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:46.9235597Z // end inline asm 2026-02-21T09:10:46.9235739Z // begin inline asm 2026-02-21T09:10:46.9236073Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd30 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:10:46.9236447Z // end inline asm 2026-02-21T09:10:46.9236581Z // begin inline asm 2026-02-21T09:10:46.9236796Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd30 + 0 ], 0x80; 2026-02-21T09:10:46.9237051Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:46.9237237Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:46.9237418Z // end inline asm 2026-02-21T09:10:46.9237549Z bar.sync 0; 2026-02-21T09:10:46.9237695Z cvta.global.u64 %rd71, %rd30; 2026-02-21T09:10:46.9237973Z .loc 1 23 67 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:23:67 2026-02-21T09:10:46.9238270Z add.s64 %rd48, %rd30, 128; 2026-02-21T09:10:46.9238423Z bar.sync 0; 2026-02-21T09:10:46.9238559Z // begin inline asm 2026-02-21T09:10:46.9238707Z @%p1 st.shared.b32 [ %r56 + 0 ], %r65; 2026-02-21T09:10:46.9238881Z // end inline asm 2026-02-21T09:10:46.9239023Z bar.warp.sync -1; 2026-02-21T09:10:46.9239163Z // begin inline asm 2026-02-21T09:10:46.9239413Z @%p93 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd34; 2026-02-21T09:10:46.9239687Z // end inline asm 2026-02-21T09:10:46.9239825Z // begin inline asm 2026-02-21T09:10:46.9240046Z @%p93 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:10:46.9240299Z // end inline asm 2026-02-21T09:10:46.9240431Z mov.b32 %r66, 64; 2026-02-21T09:10:46.9240573Z // begin inline asm 2026-02-21T09:10:46.9240809Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r66; 2026-02-21T09:10:46.9241069Z // end inline asm 2026-02-21T09:10:46.9241210Z // begin inline asm 2026-02-21T09:10:46.9241439Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r58; 2026-02-21T09:10:46.9241782Z // end inline asm 2026-02-21T09:10:46.9241915Z // begin inline asm 2026-02-21T09:10:46.9242161Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r60; 2026-02-21T09:10:46.9242437Z // end inline asm 2026-02-21T09:10:46.9242574Z mov.b32 %r69, 4096; 2026-02-21T09:10:46.9242727Z // begin inline asm 2026-02-21T09:10:46.9242965Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r69; 2026-02-21T09:10:46.9243275Z // end inline asm 2026-02-21T09:10:46.9243418Z mov.b64 %rd41, 16384; 2026-02-21T09:10:46.9243575Z // begin inline asm 2026-02-21T09:10:46.9243820Z @%p93 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd41; 2026-02-21T09:10:46.9244105Z // end inline asm 2026-02-21T09:10:46.9244243Z // begin inline asm 2026-02-21T09:10:46.9244489Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r62; 2026-02-21T09:10:46.9244773Z // end inline asm 2026-02-21T09:10:46.9244906Z // begin inline asm 2026-02-21T09:10:46.9245185Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r62; 2026-02-21T09:10:46.9245459Z // end inline asm 2026-02-21T09:10:46.9245600Z // begin inline asm 2026-02-21T09:10:46.9245838Z @%p93 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0xa; 2026-02-21T09:10:46.9246118Z // end inline asm 2026-02-21T09:10:46.9246261Z // begin inline asm 2026-02-21T09:10:46.9246506Z @%p93 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:46.9246788Z // end inline asm 2026-02-21T09:10:46.9246920Z // begin inline asm 2026-02-21T09:10:46.9247155Z @%p93 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:10:46.9247419Z // end inline asm 2026-02-21T09:10:46.9247552Z // begin inline asm 2026-02-21T09:10:46.9247780Z @%p93 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:46.9248033Z // end inline asm 2026-02-21T09:10:46.9248174Z // begin inline asm 2026-02-21T09:10:46.9248501Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd48 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:10:46.9248870Z // end inline asm 2026-02-21T09:10:46.9249009Z // begin inline asm 2026-02-21T09:10:46.9249214Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd48 + 0 ], 0x80; 2026-02-21T09:10:46.9249463Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:46.9249648Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:46.9249829Z // end inline asm 2026-02-21T09:10:46.9249959Z bar.sync 0; 2026-02-21T09:10:46.9250103Z cvta.global.u64 %rd89, %rd48; 2026-02-21T09:10:46.9250374Z .loc 1 28 76 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:28:76 2026-02-21T09:10:46.9250666Z setp.gt.u32 %p39, %r3, 2047; 2026-02-21T09:10:46.9250837Z @%p39 bra $L__BB0_8; 2026-02-21T09:10:46.9251000Z // %bb.1: // %.lr.ph 2026-02-21T09:10:46.9251294Z .loc 1 0 76 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:0:76 2026-02-21T09:10:46.9251657Z ld.param.b64 %rd14, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:46.9251872Z and.b32 %r4, %r1, 128; 2026-02-21T09:10:46.9252134Z .loc 1 78 38 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:78:38 2026-02-21T09:10:46.9252430Z setp.eq.b32 %p53, %r4, 0; 2026-02-21T09:10:46.9252697Z .loc 1 54 38 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:54:38 2026-02-21T09:10:46.9252974Z shl.b32 %r189, %r1, 3; 2026-02-21T09:10:46.9253134Z and.b32 %r190, %r189, 24; 2026-02-21T09:10:46.9253387Z .loc 1 41 45 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:41:45 2026-02-21T09:10:46.9253670Z bfe.u32 %r5, %r1, 2, 6; 2026-02-21T09:10:46.9253821Z shr.u32 %r191, %r1, 5; 2026-02-21T09:10:46.9254011Z and.b32 %r192, %r1, 255; 2026-02-21T09:10:46.9254167Z shl.b32 %r6, %r192, 4; 2026-02-21T09:10:46.9254332Z add.s32 %r194, %r55, 49152; 2026-02-21T09:10:46.9254503Z add.s32 %r294, %r194, %r6; 2026-02-21T09:10:46.9254662Z add.s32 %r296, %r294, 4096; 2026-02-21T09:10:46.9254826Z add.s32 %r280, %r582, 128; 2026-02-21T09:10:46.9254981Z add.s32 %r195, %r55, %r6; 2026-02-21T09:10:46.9255146Z add.s32 %r161, %r195, 57344; 2026-02-21T09:10:46.9255351Z add.s32 %r163, %r195, 61440; 2026-02-21T09:10:46.9255514Z and.b32 %r10, %r1, 127; 2026-02-21T09:10:46.9255661Z shl.b32 %r196, %r10, 6; 2026-02-21T09:10:46.9255815Z shr.u32 %r197, %r4, 2; 2026-02-21T09:10:46.9255962Z or.b32 %r11, %r196, %r197; 2026-02-21T09:10:46.9256124Z add.s32 %r198, %r194, %r11; 2026-02-21T09:10:46.9256285Z add.s32 %r157, %r55, 65536; 2026-02-21T09:10:46.9256437Z add.s32 %r199, %r157, %r10; 2026-02-21T09:10:46.9256596Z xor.b32 %r12, %r10, 16; 2026-02-21T09:10:46.9256746Z add.s32 %r200, %r157, %r12; 2026-02-21T09:10:46.9256909Z xor.b32 %r13, %r10, 32; 2026-02-21T09:10:46.9257080Z add.s32 %r201, %r157, %r13; 2026-02-21T09:10:46.9257244Z xor.b32 %r14, %r10, 48; 2026-02-21T09:10:46.9257395Z add.s32 %r202, %r157, %r14; 2026-02-21T09:10:46.9257558Z xor.b32 %r15, %r10, 64; 2026-02-21T09:10:46.9257708Z add.s32 %r203, %r157, %r15; 2026-02-21T09:10:46.9257872Z xor.b32 %r16, %r10, 80; 2026-02-21T09:10:46.9258052Z add.s32 %r204, %r157, %r16; 2026-02-21T09:10:46.9258206Z xor.b32 %r17, %r10, 96; 2026-02-21T09:10:46.9258359Z add.s32 %r205, %r157, %r17; 2026-02-21T09:10:46.9258515Z xor.b32 %r18, %r10, 112; 2026-02-21T09:10:46.9258675Z add.s32 %r206, %r157, %r18; 2026-02-21T09:10:46.9258828Z shl.b32 %r207, %r10, 7; 2026-02-21T09:10:46.9258988Z shl.b32 %r208, %r1, 4; 2026-02-21T09:10:46.9259144Z and.b32 %r209, %r208, 112; 2026-02-21T09:10:46.9259307Z shr.u32 %r210, %r4, 5; 2026-02-21T09:10:46.9259458Z or.b32 %r211, %r207, %r210; 2026-02-21T09:10:46.9259619Z or.b32 %r212, %r211, %r209; 2026-02-21T09:10:46.9259780Z add.s32 %r213, %r55, 32768; 2026-02-21T09:10:46.9259937Z add.s32 %r19, %r213, %r212; 2026-02-21T09:10:46.9260101Z xor.b32 %r214, %r212, 16; 2026-02-21T09:10:46.9260255Z add.s32 %r20, %r213, %r214; 2026-02-21T09:10:46.9260419Z xor.b32 %r215, %r212, 32; 2026-02-21T09:10:46.9260573Z add.s32 %r21, %r213, %r215; 2026-02-21T09:10:46.9260737Z xor.b32 %r216, %r212, 48; 2026-02-21T09:10:46.9260890Z add.s32 %r22, %r213, %r216; 2026-02-21T09:10:46.9261055Z xor.b32 %r217, %r212, 64; 2026-02-21T09:10:46.9261209Z add.s32 %r23, %r213, %r217; 2026-02-21T09:10:46.9261372Z xor.b32 %r218, %r212, 80; 2026-02-21T09:10:46.9261530Z add.s32 %r24, %r213, %r218; 2026-02-21T09:10:46.9261713Z xor.b32 %r219, %r212, 96; 2026-02-21T09:10:46.9261870Z add.s32 %r25, %r213, %r219; 2026-02-21T09:10:46.9262024Z xor.b32 %r220, %r212, 112; 2026-02-21T09:10:46.9262187Z add.s32 %r26, %r213, %r220; 2026-02-21T09:10:46.9262347Z bfe.u32 %r221, %r213, 4, 14; 2026-02-21T09:10:46.9262523Z cvt.u64.u32 %rd59, %r221; 2026-02-21T09:10:46.9262684Z or.b64 %rd64, %rd59, 4611686293372403712; 2026-02-21T09:10:46.9262872Z add.s32 %r222, %r55, 32800; 2026-02-21T09:10:46.9263038Z bfe.u32 %r223, %r222, 4, 14; 2026-02-21T09:10:46.9263202Z cvt.u64.u32 %rd60, %r223; 2026-02-21T09:10:46.9263381Z or.b64 %rd65, %rd60, 4611686293372403712; 2026-02-21T09:10:46.9263560Z add.s32 %r224, %r55, 32832; 2026-02-21T09:10:46.9263722Z bfe.u32 %r225, %r224, 4, 14; 2026-02-21T09:10:46.9263876Z cvt.u64.u32 %rd61, %r225; 2026-02-21T09:10:46.9264041Z or.b64 %rd66, %rd61, 4611686293372403712; 2026-02-21T09:10:46.9264214Z add.s32 %r226, %r55, 32864; 2026-02-21T09:10:46.9264373Z bfe.u32 %r227, %r226, 4, 14; 2026-02-21T09:10:46.9264525Z cvt.u64.u32 %rd62, %r227; 2026-02-21T09:10:46.9264688Z or.b64 %rd67, %rd62, 4611686293372403712; 2026-02-21T09:10:46.9264868Z or.b32 %r27, %r190, 64; 2026-02-21T09:10:46.9265017Z shl.b32 %r228, %r192, 7; 2026-02-21T09:10:46.9265176Z or.b32 %r229, %r228, %r209; 2026-02-21T09:10:46.9265358Z xor.b32 %r230, %r229, 16; 2026-02-21T09:10:46.9265514Z xor.b32 %r231, %r229, 32; 2026-02-21T09:10:46.9265665Z xor.b32 %r232, %r229, 48; 2026-02-21T09:10:46.9265819Z xor.b32 %r233, %r229, 64; 2026-02-21T09:10:46.9265963Z xor.b32 %r234, %r229, 80; 2026-02-21T09:10:46.9266115Z xor.b32 %r235, %r229, 96; 2026-02-21T09:10:46.9266263Z xor.b32 %r236, %r229, 112; 2026-02-21T09:10:46.9266561Z .loc 1 35 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:35:33 2026-02-21T09:10:46.9266898Z shr.u32 %r237, %r3, 5; 2026-02-21T09:10:46.9267051Z and.b32 %r238, %r237, 48; 2026-02-21T09:10:46.9267329Z .loc 1 37 64 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:37:64 2026-02-21T09:10:46.9267625Z and.b32 %r239, %r3, 15; 2026-02-21T09:10:46.9267902Z .loc 1 37 30 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:37:30 2026-02-21T09:10:46.9268193Z or.b32 %r240, %r238, %r239; 2026-02-21T09:10:46.9268495Z .loc 1 39 27 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:39:27 2026-02-21T09:10:46.9268792Z shl.b32 %r300, %r240, 7; 2026-02-21T09:10:46.9269056Z .loc 1 40 27 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:40:27 2026-02-21T09:10:46.9269355Z shl.b32 %r241, %r3, 3; 2026-02-21T09:10:46.9269507Z and.b32 %r479, %r241, 3968; 2026-02-21T09:10:46.9269809Z .loc 1 41 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:41:32 2026-02-21T09:10:46.9270100Z or.b32 %r242, %r479, %r5; 2026-02-21T09:10:46.9270366Z .loc 1 55 53 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:53 2026-02-21T09:10:46.9270665Z shl.b32 %r243, %r242, 10; 2026-02-21T09:10:46.9270817Z $L__tmp0: 2026-02-21T09:10:46.9271122Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9271484Z shfl.sync.idx.b32 %r38, %r191, 0, 31, -1; 2026-02-21T09:10:46.9271727Z shl.b32 %r244, %r38, 21; 2026-02-21T09:10:46.9271890Z and.b32 %r245, %r244, 6291456; 2026-02-21T09:10:46.9272071Z add.s32 %r246, %r245, %r582; 2026-02-21T09:10:46.9272242Z and.b32 %r247, %r38, 4; 2026-02-21T09:10:46.9272401Z shl.b32 %r248, %r247, 4; 2026-02-21T09:10:46.9272573Z add.s32 %r477, %r246, %r248; 2026-02-21T09:10:46.9272741Z mov.pred %p60, -1; 2026-02-21T09:10:46.9272906Z // begin inline asm 2026-02-21T09:10:46.9273265Z @%p60 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r477 + 0], {%r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65}; 2026-02-21T09:10:46.9273650Z // end inline asm 2026-02-21T09:10:46.9273802Z // begin inline asm 2026-02-21T09:10:46.9274165Z @%p60 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r477 + 16], {%r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65}; 2026-02-21T09:10:46.9274562Z // end inline asm 2026-02-21T09:10:46.9274701Z // begin inline asm 2026-02-21T09:10:46.9275029Z @%p60 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r477 + 32], {%r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65}; 2026-02-21T09:10:46.9275378Z // end inline asm 2026-02-21T09:10:46.9275519Z // begin inline asm 2026-02-21T09:10:46.9275840Z @%p60 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r477 + 48], {%r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65, %r65}; 2026-02-21T09:10:46.9276199Z // end inline asm 2026-02-21T09:10:46.9276338Z // begin inline asm 2026-02-21T09:10:46.9276492Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:46.9276664Z // end inline asm 2026-02-21T09:10:46.9276795Z bar.sync 0; 2026-02-21T09:10:46.9276932Z $L__tmp1: 2026-02-21T09:10:46.9277184Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9277482Z add.s32 %r584, %r55, 69632; 2026-02-21T09:10:46.9277666Z // begin inline asm 2026-02-21T09:10:46.9277841Z @%p93 mbarrier.init.shared::cta.b64 [%r584], 1; 2026-02-21T09:10:46.9278037Z // end inline asm 2026-02-21T09:10:46.9278167Z bar.sync 0; 2026-02-21T09:10:46.9278307Z add.s32 %r149, %r55, 69640; 2026-02-21T09:10:46.9278459Z // begin inline asm 2026-02-21T09:10:46.9278628Z @%p93 mbarrier.init.shared::cta.b64 [%r149], 1; 2026-02-21T09:10:46.9278814Z // end inline asm 2026-02-21T09:10:46.9278954Z add.s32 %r298, %r55, 69648; 2026-02-21T09:10:46.9279138Z // begin inline asm 2026-02-21T09:10:46.9279312Z @%p93 mbarrier.init.shared::cta.b64 [%r298], 1; 2026-02-21T09:10:46.9279507Z // end inline asm 2026-02-21T09:10:46.9279642Z bar.sync 0; 2026-02-21T09:10:46.9279783Z add.s32 %r151, %r55, 69656; 2026-02-21T09:10:46.9279939Z // begin inline asm 2026-02-21T09:10:46.9280111Z @%p93 mbarrier.init.shared::cta.b64 [%r151], 1; 2026-02-21T09:10:46.9280299Z // end inline asm 2026-02-21T09:10:46.9280554Z .loc 1 55 60 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:60 2026-02-21T09:10:46.9280841Z or.b32 %r249, %r243, %r190; 2026-02-21T09:10:46.9281136Z .loc 1 55 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:32 2026-02-21T09:10:46.9281435Z mad.wide.u32 %rd53, %r249, 2, %rd14; 2026-02-21T09:10:46.9281655Z cvt.u64.u32 %rd7, %r243; 2026-02-21T09:10:46.9281822Z add.s64 %rd54, %rd53, 131072; 2026-02-21T09:10:46.9282109Z .loc 1 55 80 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:80 2026-02-21T09:10:46.9282401Z // begin inline asm 2026-02-21T09:10:46.9282607Z cp.async.cg.shared.global [ %r294 + 0 ], [ %rd53 + 0 ], 0x10, %r153; 2026-02-21T09:10:46.9282840Z // end inline asm 2026-02-21T09:10:46.9282975Z // begin inline asm 2026-02-21T09:10:46.9283184Z cp.async.cg.shared.global [ %r296 + 0 ], [ %rd54 + 0 ], 0x10, %r153; 2026-02-21T09:10:46.9283418Z // end inline asm 2026-02-21T09:10:46.9283565Z cp.async.commit_group; 2026-02-21T09:10:46.9283850Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9284141Z bar.sync 0; 2026-02-21T09:10:46.9284277Z // begin inline asm 2026-02-21T09:10:46.9284465Z @%p93 mbarrier.arrive.expect_tx.shared.b64 _, [%r298], 2048; 2026-02-21T09:10:46.9284688Z // end inline asm 2026-02-21T09:10:46.9284933Z .loc 1 61 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:61:33 2026-02-21T09:10:46.9285226Z bar.sync 0; 2026-02-21T09:10:46.9285373Z elect.sync %r250|%p55, -1; 2026-02-21T09:10:46.9285540Z and.pred %p49, %p1, %p55; 2026-02-21T09:10:46.9285702Z // begin inline asm 2026-02-21T09:10:46.9286025Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r157], [%rd71, {%r300, %r65}], [%r298]; 2026-02-21T09:10:46.9286379Z // end inline asm 2026-02-21T09:10:46.9286621Z .loc 1 55 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:32 2026-02-21T09:10:46.9286916Z add.s64 %rd56, %rd53, 64; 2026-02-21T09:10:46.9287076Z or.b32 %r251, %r249, 32; 2026-02-21T09:10:46.9287237Z mad.wide.u32 %rd63, %r251, 2, %rd14; 2026-02-21T09:10:46.9287421Z add.s64 %rd57, %rd63, 131072; 2026-02-21T09:10:46.9287686Z .loc 1 55 80 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:80 2026-02-21T09:10:46.9287970Z // begin inline asm 2026-02-21T09:10:46.9288169Z cp.async.cg.shared.global [ %r161 + 0 ], [ %rd56 + 0 ], 0x10, %r153; 2026-02-21T09:10:46.9288400Z // end inline asm 2026-02-21T09:10:46.9288535Z // begin inline asm 2026-02-21T09:10:46.9288737Z cp.async.cg.shared.global [ %r163 + 0 ], [ %rd57 + 0 ], 0x10, %r153; 2026-02-21T09:10:46.9288962Z // end inline asm 2026-02-21T09:10:46.9289102Z cp.async.commit_group; 2026-02-21T09:10:46.9289372Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9289658Z bar.sync 0; 2026-02-21T09:10:46.9289816Z // begin inline asm 2026-02-21T09:10:46.9290003Z @%p93 mbarrier.arrive.expect_tx.shared.b64 _, [%r151], 2048; 2026-02-21T09:10:46.9290224Z // end inline asm 2026-02-21T09:10:46.9290460Z .loc 1 61 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:61:33 2026-02-21T09:10:46.9290736Z bar.sync 0; 2026-02-21T09:10:46.9290881Z elect.sync %r252|%p56, -1; 2026-02-21T09:10:46.9291046Z and.pred %p51, %p1, %p56; 2026-02-21T09:10:46.9291212Z add.s32 %r166, %r55, 67584; 2026-02-21T09:10:46.9291390Z // begin inline asm 2026-02-21T09:10:46.9291745Z @%p51 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r166], [%rd71, {%r300, %r153}], [%r151]; 2026-02-21T09:10:46.9292088Z // end inline asm 2026-02-21T09:10:46.9292339Z .loc 1 55 80 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:80 2026-02-21T09:10:46.9292629Z cp.async.wait_group 1; 2026-02-21T09:10:46.9292777Z bar.sync 0; 2026-02-21T09:10:46.9292953Z ld.shared.v4.b32 {%r253, %r254, %r255, %r256}, [%r198]; 2026-02-21T09:10:46.9293162Z mov.b32 {%rs1, %rs2}, %r256; 2026-02-21T09:10:46.9293354Z mov.b32 {%rs3, %rs4}, %r255; 2026-02-21T09:10:46.9293513Z mov.b32 {%rs5, %rs6}, %r254; 2026-02-21T09:10:46.9293676Z mov.b32 {%rs7, %rs8}, %r253; 2026-02-21T09:10:46.9293872Z ld.shared.v4.b32 {%r257, %r258, %r259, %r260}, [%r198+16]; 2026-02-21T09:10:46.9294095Z mov.b32 {%rs9, %rs10}, %r260; 2026-02-21T09:10:46.9294307Z mov.b32 {%rs11, %rs12}, %r259; 2026-02-21T09:10:46.9294480Z mov.b32 {%rs13, %rs14}, %r258; 2026-02-21T09:10:46.9294648Z mov.b32 {%rs15, %rs16}, %r257; 2026-02-21T09:10:46.9294911Z .loc 1 59 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:59:32 2026-02-21T09:10:46.9295203Z cvt.f32.bf16 %r171, %rs7; 2026-02-21T09:10:46.9295358Z cvt.f32.bf16 %r172, %rs8; 2026-02-21T09:10:46.9295516Z cvt.f32.bf16 %r173, %rs5; 2026-02-21T09:10:46.9295665Z cvt.f32.bf16 %r174, %rs6; 2026-02-21T09:10:46.9295823Z cvt.f32.bf16 %r175, %rs3; 2026-02-21T09:10:46.9295978Z cvt.f32.bf16 %r176, %rs4; 2026-02-21T09:10:46.9296127Z cvt.f32.bf16 %r177, %rs1; 2026-02-21T09:10:46.9296283Z cvt.f32.bf16 %r178, %rs2; 2026-02-21T09:10:46.9296434Z cvt.f32.bf16 %r179, %rs15; 2026-02-21T09:10:46.9296597Z cvt.f32.bf16 %r180, %rs16; 2026-02-21T09:10:46.9296752Z cvt.f32.bf16 %r181, %rs13; 2026-02-21T09:10:46.9296910Z cvt.f32.bf16 %r182, %rs14; 2026-02-21T09:10:46.9297060Z cvt.f32.bf16 %r183, %rs11; 2026-02-21T09:10:46.9297218Z cvt.f32.bf16 %r184, %rs12; 2026-02-21T09:10:46.9297367Z cvt.f32.bf16 %r185, %rs9; 2026-02-21T09:10:46.9297523Z cvt.f32.bf16 %r186, %rs10; 2026-02-21T09:10:46.9297676Z $L__tmp2: 2026-02-21T09:10:46.9297961Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9298294Z add.s32 %r261, %r245, %r280; 2026-02-21T09:10:46.9298448Z shl.b32 %r262, %r247, 2; 2026-02-21T09:10:46.9298607Z add.s32 %r315, %r261, %r262; 2026-02-21T09:10:46.9298760Z // begin inline asm 2026-02-21T09:10:46.9299131Z @%p60 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r315 + 0], {%r171, %r172, %r173, %r174, %r175, %r176, %r177, %r178, %r179, %r180, %r181, %r182, %r183, %r184, %r185, %r186}; 2026-02-21T09:10:46.9299520Z // end inline asm 2026-02-21T09:10:46.9299656Z // begin inline asm 2026-02-21T09:10:46.9299816Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:46.9299980Z // end inline asm 2026-02-21T09:10:46.9300121Z bar.sync 0; 2026-02-21T09:10:46.9300247Z $L__tmp3: 2026-02-21T09:10:46.9300496Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9300778Z // begin inline asm 2026-02-21T09:10:46.9300918Z 2026-02-21T09:10:46.9301032Z { 2026-02-21T09:10:46.9301164Z .reg .pred complete; 2026-02-21T09:10:46.9301314Z waitLoop: 2026-02-21T09:10:46.9301499Z mbarrier.try_wait.parity.shared.b64 complete, [%r298], %r65; 2026-02-21T09:10:46.9301772Z @!complete bra.uni waitLoop; 2026-02-21T09:10:46.9301982Z } 2026-02-21T09:10:46.9302052Z 2026-02-21T09:10:46.9302119Z // end inline asm 2026-02-21T09:10:46.9302367Z .loc 1 61 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:61:33 2026-02-21T09:10:46.9302664Z ld.shared.b8 %rs17, [%r199]; 2026-02-21T09:10:46.9302834Z ld.shared.b8 %rs18, [%r199+1024]; 2026-02-21T09:10:46.9303026Z ld.shared.b8 %rs19, [%r200+128]; 2026-02-21T09:10:46.9303212Z ld.shared.b8 %rs20, [%r200+1152]; 2026-02-21T09:10:46.9303413Z ld.shared.b8 %rs21, [%r201+256]; 2026-02-21T09:10:46.9303597Z ld.shared.b8 %rs22, [%r201+1280]; 2026-02-21T09:10:46.9303769Z ld.shared.b8 %rs23, [%r202+384]; 2026-02-21T09:10:46.9303947Z ld.shared.b8 %rs24, [%r202+1408]; 2026-02-21T09:10:46.9304112Z ld.shared.b8 %rs25, [%r203+512]; 2026-02-21T09:10:46.9304285Z ld.shared.b8 %rs26, [%r203+1536]; 2026-02-21T09:10:46.9304449Z ld.shared.b8 %rs27, [%r204+640]; 2026-02-21T09:10:46.9304618Z ld.shared.b8 %rs28, [%r204+1664]; 2026-02-21T09:10:46.9304788Z ld.shared.b8 %rs29, [%r205+768]; 2026-02-21T09:10:46.9304950Z ld.shared.b8 %rs30, [%r205+1792]; 2026-02-21T09:10:46.9305143Z ld.shared.b8 %rs31, [%r206+896]; 2026-02-21T09:10:46.9305309Z ld.shared.b8 %rs32, [%r206+1920]; 2026-02-21T09:10:46.9305583Z .loc 1 64 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:64:28 2026-02-21T09:10:46.9305862Z shl.b16 %rs33, %rs17, 4; 2026-02-21T09:10:46.9306047Z shl.b16 %rs34, %rs19, 4; 2026-02-21T09:10:46.9306200Z shl.b16 %rs35, %rs21, 4; 2026-02-21T09:10:46.9306358Z shl.b16 %rs36, %rs23, 4; 2026-02-21T09:10:46.9306512Z shl.b16 %rs37, %rs25, 4; 2026-02-21T09:10:46.9306657Z shl.b16 %rs38, %rs27, 4; 2026-02-21T09:10:46.9306809Z shl.b16 %rs39, %rs29, 4; 2026-02-21T09:10:46.9306954Z shl.b16 %rs40, %rs31, 4; 2026-02-21T09:10:46.9307106Z shl.b16 %rs41, %rs18, 4; 2026-02-21T09:10:46.9307251Z shl.b16 %rs42, %rs20, 4; 2026-02-21T09:10:46.9307403Z shl.b16 %rs43, %rs22, 4; 2026-02-21T09:10:46.9307548Z shl.b16 %rs44, %rs24, 4; 2026-02-21T09:10:46.9307700Z shl.b16 %rs45, %rs26, 4; 2026-02-21T09:10:46.9307847Z shl.b16 %rs46, %rs28, 4; 2026-02-21T09:10:46.9307997Z shl.b16 %rs47, %rs30, 4; 2026-02-21T09:10:46.9308146Z shl.b16 %rs48, %rs32, 4; 2026-02-21T09:10:46.9308398Z .loc 1 79 58 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:79:58 2026-02-21T09:10:46.9308689Z selp.b16 %rs49, %rs33, %rs17, %p53; 2026-02-21T09:10:46.9308864Z cvt.s16.s8 %rs50, %rs49; 2026-02-21T09:10:46.9309018Z shr.s16 %rs51, %rs50, 4; 2026-02-21T09:10:46.9309173Z selp.b16 %rs52, %rs34, %rs19, %p53; 2026-02-21T09:10:46.9309350Z cvt.s16.s8 %rs53, %rs52; 2026-02-21T09:10:46.9309494Z shr.s16 %rs54, %rs53, 4; 2026-02-21T09:10:46.9309652Z selp.b16 %rs55, %rs35, %rs21, %p53; 2026-02-21T09:10:46.9309824Z cvt.s16.s8 %rs56, %rs55; 2026-02-21T09:10:46.9309969Z shr.s16 %rs57, %rs56, 4; 2026-02-21T09:10:46.9310128Z selp.b16 %rs58, %rs36, %rs23, %p53; 2026-02-21T09:10:46.9310292Z cvt.s16.s8 %rs59, %rs58; 2026-02-21T09:10:46.9310443Z shr.s16 %rs60, %rs59, 4; 2026-02-21T09:10:46.9310596Z selp.b16 %rs61, %rs37, %rs25, %p53; 2026-02-21T09:10:46.9310766Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T09:10:46.9310929Z shr.s16 %rs63, %rs62, 4; 2026-02-21T09:10:46.9311094Z selp.b16 %rs64, %rs38, %rs27, %p53; 2026-02-21T09:10:46.9311267Z cvt.s16.s8 %rs65, %rs64; 2026-02-21T09:10:46.9311428Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:10:46.9311637Z selp.b16 %rs67, %rs39, %rs29, %p53; 2026-02-21T09:10:46.9311814Z cvt.s16.s8 %rs68, %rs67; 2026-02-21T09:10:46.9311978Z shr.s16 %rs69, %rs68, 4; 2026-02-21T09:10:46.9312141Z selp.b16 %rs70, %rs40, %rs31, %p53; 2026-02-21T09:10:46.9312325Z cvt.s16.s8 %rs71, %rs70; 2026-02-21T09:10:46.9312481Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:10:46.9312648Z selp.b16 %rs73, %rs41, %rs18, %p53; 2026-02-21T09:10:46.9312822Z cvt.s16.s8 %rs74, %rs73; 2026-02-21T09:10:46.9312983Z shr.s16 %rs75, %rs74, 4; 2026-02-21T09:10:46.9313158Z selp.b16 %rs76, %rs42, %rs20, %p53; 2026-02-21T09:10:46.9313369Z cvt.s16.s8 %rs77, %rs76; 2026-02-21T09:10:46.9313533Z shr.s16 %rs78, %rs77, 4; 2026-02-21T09:10:46.9313695Z selp.b16 %rs79, %rs43, %rs22, %p53; 2026-02-21T09:10:46.9313876Z cvt.s16.s8 %rs80, %rs79; 2026-02-21T09:10:46.9314030Z shr.s16 %rs81, %rs80, 4; 2026-02-21T09:10:46.9314195Z selp.b16 %rs82, %rs44, %rs24, %p53; 2026-02-21T09:10:46.9314368Z cvt.s16.s8 %rs83, %rs82; 2026-02-21T09:10:46.9314530Z shr.s16 %rs84, %rs83, 4; 2026-02-21T09:10:46.9314726Z selp.b16 %rs85, %rs45, %rs26, %p53; 2026-02-21T09:10:46.9314907Z cvt.s16.s8 %rs86, %rs85; 2026-02-21T09:10:46.9315069Z shr.s16 %rs87, %rs86, 4; 2026-02-21T09:10:46.9315229Z selp.b16 %rs88, %rs46, %rs28, %p53; 2026-02-21T09:10:46.9315410Z cvt.s16.s8 %rs89, %rs88; 2026-02-21T09:10:46.9315566Z shr.s16 %rs90, %rs89, 4; 2026-02-21T09:10:46.9315735Z selp.b16 %rs91, %rs47, %rs30, %p53; 2026-02-21T09:10:46.9315909Z cvt.s16.s8 %rs92, %rs91; 2026-02-21T09:10:46.9316071Z shr.s16 %rs93, %rs92, 4; 2026-02-21T09:10:46.9316232Z selp.b16 %rs94, %rs48, %rs32, %p53; 2026-02-21T09:10:46.9316410Z cvt.s16.s8 %rs95, %rs94; 2026-02-21T09:10:46.9316608Z shr.s16 %rs96, %rs95, 4; 2026-02-21T09:10:46.9316884Z .loc 1 84 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:84:32 2026-02-21T09:10:46.9317188Z cvt.rn.f32.s16 %r263, %rs51; 2026-02-21T09:10:46.9317355Z cvt.rn.f32.s16 %r264, %rs54; 2026-02-21T09:10:46.9317552Z cvt.rn.f32.s16 %r265, %rs57; 2026-02-21T09:10:46.9317714Z cvt.rn.f32.s16 %r266, %rs60; 2026-02-21T09:10:46.9317881Z cvt.rn.f32.s16 %r267, %rs63; 2026-02-21T09:10:46.9318040Z cvt.rn.f32.s16 %r268, %rs66; 2026-02-21T09:10:46.9318207Z cvt.rn.f32.s16 %r269, %rs69; 2026-02-21T09:10:46.9318366Z cvt.rn.f32.s16 %r270, %rs72; 2026-02-21T09:10:46.9318534Z cvt.rn.f32.s16 %r271, %rs75; 2026-02-21T09:10:46.9318702Z cvt.rn.f32.s16 %r272, %rs78; 2026-02-21T09:10:46.9318876Z cvt.rn.f32.s16 %r273, %rs81; 2026-02-21T09:10:46.9319039Z cvt.rn.f32.s16 %r274, %rs84; 2026-02-21T09:10:46.9319195Z cvt.rn.f32.s16 %r275, %rs87; 2026-02-21T09:10:46.9319355Z cvt.rn.f32.s16 %r276, %rs90; 2026-02-21T09:10:46.9319507Z cvt.rn.f32.s16 %r277, %rs93; 2026-02-21T09:10:46.9319666Z cvt.rn.f32.s16 %r278, %rs96; 2026-02-21T09:10:46.9319820Z st.shared.b32 [%r19], %r263; 2026-02-21T09:10:46.9319986Z st.shared.b32 [%r19+8], %r264; 2026-02-21T09:10:46.9320157Z st.shared.b32 [%r20], %r265; 2026-02-21T09:10:46.9320316Z st.shared.b32 [%r20+8], %r266; 2026-02-21T09:10:46.9320488Z st.shared.b32 [%r21], %r267; 2026-02-21T09:10:46.9320647Z st.shared.b32 [%r21+8], %r268; 2026-02-21T09:10:46.9320820Z st.shared.b32 [%r22], %r269; 2026-02-21T09:10:46.9320978Z st.shared.b32 [%r22+8], %r270; 2026-02-21T09:10:46.9321151Z st.shared.b32 [%r23], %r271; 2026-02-21T09:10:46.9321308Z st.shared.b32 [%r23+8], %r272; 2026-02-21T09:10:46.9321476Z st.shared.b32 [%r24], %r273; 2026-02-21T09:10:46.9321668Z st.shared.b32 [%r24+8], %r274; 2026-02-21T09:10:46.9321835Z st.shared.b32 [%r25], %r275; 2026-02-21T09:10:46.9321999Z st.shared.b32 [%r25+8], %r276; 2026-02-21T09:10:46.9322161Z st.shared.b32 [%r26], %r277; 2026-02-21T09:10:46.9322325Z st.shared.b32 [%r26+8], %r278; 2026-02-21T09:10:46.9322478Z $L__tmp4: 2026-02-21T09:10:46.9322774Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9323104Z // begin inline asm 2026-02-21T09:10:46.9323273Z fence.proxy.async.shared::cta; 2026-02-21T09:10:46.9323442Z // end inline asm 2026-02-21T09:10:46.9323584Z bar.sync 0; 2026-02-21T09:10:46.9323730Z setp.ne.b32 %p57, %r38, 0; 2026-02-21T09:10:46.9323888Z @%p57 bra $L__BB0_3; 2026-02-21T09:10:46.9324035Z // %bb.2: 2026-02-21T09:10:46.9324171Z elect.sync %r291|%p59, -1; 2026-02-21T09:10:46.9324337Z mov.b32 %r281, 136317200; 2026-02-21T09:10:46.9324490Z mov.pred %p58, 0; 2026-02-21T09:10:46.9324639Z // begin inline asm 2026-02-21T09:10:46.9324879Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 0 ], %rd64, %r281, %p58; 2026-02-21T09:10:46.9325183Z // end inline asm 2026-02-21T09:10:46.9325321Z // begin inline asm 2026-02-21T09:10:46.9325560Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 8 ], %rd65, %r281, %p60; 2026-02-21T09:10:46.9325823Z // end inline asm 2026-02-21T09:10:46.9325958Z // begin inline asm 2026-02-21T09:10:46.9326193Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 16 ], %rd66, %r281, %p60; 2026-02-21T09:10:46.9326476Z // end inline asm 2026-02-21T09:10:46.9326623Z // begin inline asm 2026-02-21T09:10:46.9326851Z @%p59 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 24 ], %rd67, %r281, %p60; 2026-02-21T09:10:46.9327114Z // end inline asm 2026-02-21T09:10:46.9327261Z add.s32 %r293, %r55, 69632; 2026-02-21T09:10:46.9327424Z cvt.u64.u32 %rd68, %r293; 2026-02-21T09:10:46.9327587Z // begin inline asm 2026-02-21T09:10:46.9327794Z @%p59 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd68]; 2026-02-21T09:10:46.9328030Z // end inline asm 2026-02-21T09:10:46.9328166Z $L__tmp5: 2026-02-21T09:10:46.9328325Z $L__BB0_3: 2026-02-21T09:10:46.9328486Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:46.9328690Z add.s32 %r28, %r55, %r229; 2026-02-21T09:10:46.9328851Z add.s32 %r29, %r55, %r230; 2026-02-21T09:10:46.9329002Z add.s32 %r30, %r55, %r231; 2026-02-21T09:10:46.9329160Z add.s32 %r31, %r55, %r232; 2026-02-21T09:10:46.9329337Z add.s32 %r32, %r55, %r233; 2026-02-21T09:10:46.9329498Z add.s32 %r33, %r55, %r234; 2026-02-21T09:10:46.9329646Z add.s32 %r34, %r55, %r235; 2026-02-21T09:10:46.9329804Z add.s32 %r35, %r55, %r236; 2026-02-21T09:10:46.9330064Z .loc 1 55 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:32 2026-02-21T09:10:46.9330355Z add.s64 %rd69, %rd53, 128; 2026-02-21T09:10:46.9330520Z cvt.u64.u32 %rd73, %r27; 2026-02-21T09:10:46.9330676Z add.s64 %rd74, %rd7, %rd73; 2026-02-21T09:10:46.9330841Z shl.b64 %rd75, %rd74, 1; 2026-02-21T09:10:46.9330993Z add.s64 %rd76, %rd14, %rd75; 2026-02-21T09:10:46.9331160Z add.s64 %rd70, %rd76, 131072; 2026-02-21T09:10:46.9331316Z mov.b32 %r295, 16; 2026-02-21T09:10:46.9331590Z .loc 1 55 80 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:80 2026-02-21T09:10:46.9331865Z // begin inline asm 2026-02-21T09:10:46.9332074Z cp.async.cg.shared.global [ %r294 + 0 ], [ %rd69 + 0 ], 0x10, %r295; 2026-02-21T09:10:46.9332301Z // end inline asm 2026-02-21T09:10:46.9332437Z // begin inline asm 2026-02-21T09:10:46.9332643Z cp.async.cg.shared.global [ %r296 + 0 ], [ %rd70 + 0 ], 0x10, %r295; 2026-02-21T09:10:46.9332861Z // end inline asm 2026-02-21T09:10:46.9333011Z cp.async.commit_group; 2026-02-21T09:10:46.9333271Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9333560Z // begin inline asm 2026-02-21T09:10:46.9333747Z @%p93 mbarrier.arrive.expect_tx.shared.b64 _, [%r298], 2048; 2026-02-21T09:10:46.9333971Z // end inline asm 2026-02-21T09:10:46.9334217Z .loc 1 61 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:61:33 2026-02-21T09:10:46.9334486Z bar.sync 0; 2026-02-21T09:10:46.9334633Z elect.sync %r307|%p70, -1; 2026-02-21T09:10:46.9334797Z and.pred %p68, %p1, %p70; 2026-02-21T09:10:46.9334960Z mov.b32 %r301, 32; 2026-02-21T09:10:46.9335097Z // begin inline asm 2026-02-21T09:10:46.9335420Z @%p68 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r157], [%rd71, {%r300, %r301}], [%r298]; 2026-02-21T09:10:46.9335761Z // end inline asm 2026-02-21T09:10:46.9336009Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9336294Z and.b32 %r308, %r1, 3; 2026-02-21T09:10:46.9336448Z mul.wide.u32 %rd77, %r308, 16; 2026-02-21T09:10:46.9336618Z shl.b32 %r309, %r3, 13; 2026-02-21T09:10:46.9336769Z and.b32 %r310, %r309, 4063232; 2026-02-21T09:10:46.9336962Z shl.b32 %r311, %r5, 10; 2026-02-21T09:10:46.9337110Z or.b32 %r312, %r310, %r311; 2026-02-21T09:10:46.9337276Z mul.wide.u32 %rd78, %r312, 2; 2026-02-21T09:10:46.9337437Z or.b64 %rd79, %rd77, %rd78; 2026-02-21T09:10:46.9337600Z add.s64 %rd80, %rd79, %rd14; 2026-02-21T09:10:46.9337764Z add.s64 %rd218, %rd80, 131264; 2026-02-21T09:10:46.9337920Z mov.b32 %r587, 1; 2026-02-21T09:10:46.9338064Z mov.b32 %r583, 0; 2026-02-21T09:10:46.9338226Z mov.b64 %rd219, 0; 2026-02-21T09:10:46.9338374Z mov.b32 %r585, %r583; 2026-02-21T09:10:46.9338520Z mov.b32 %r586, %r583; 2026-02-21T09:10:46.9338670Z mov.b32 %r588, %r583; 2026-02-21T09:10:46.9338811Z bra.uni $L__BB0_4; 2026-02-21T09:10:46.9339003Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:46.9339329Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9339628Z setp.lt.u64 %p86, %rd219, 464; 2026-02-21T09:10:46.9339798Z $L__tmp6: 2026-02-21T09:10:46.9340105Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9340443Z add.s32 %r400, %r587, 1; 2026-02-21T09:10:46.9340611Z setp.gt.s32 %p89, %r400, 1; 2026-02-21T09:10:46.9340788Z selp.b32 %r587, 0, %r400, %p89; 2026-02-21T09:10:46.9340964Z selp.b32 %r401, 1, 0, %p89; 2026-02-21T09:10:46.9341180Z xor.b32 %r54, %r588, %r401; 2026-02-21T09:10:46.9341339Z $L__tmp7: 2026-02-21T09:10:46.9341603Z .loc 1 55 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:32 2026-02-21T09:10:46.9341898Z add.s64 %rd86, %rd218, -131072; 2026-02-21T09:10:46.9342168Z .loc 1 55 80 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:80 2026-02-21T09:10:46.9342454Z add.s32 %r391, %r48, %r6; 2026-02-21T09:10:46.9342611Z selp.b32 %r392, 16, 0, %p86; 2026-02-21T09:10:46.9342775Z // begin inline asm 2026-02-21T09:10:46.9342977Z cp.async.cg.shared.global [ %r391 + 0 ], [ %rd86 + 0 ], 0x10, %r392; 2026-02-21T09:10:46.9343214Z // end inline asm 2026-02-21T09:10:46.9343358Z add.s32 %r393, %r391, 4096; 2026-02-21T09:10:46.9343511Z // begin inline asm 2026-02-21T09:10:46.9343718Z cp.async.cg.shared.global [ %r393 + 0 ], [ %rd218 + 0 ], 0x10, %r392; 2026-02-21T09:10:46.9343940Z // end inline asm 2026-02-21T09:10:46.9344091Z cp.async.commit_group; 2026-02-21T09:10:46.9344356Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9344654Z and.pred %p84, %p93, %p86; 2026-02-21T09:10:46.9344810Z // begin inline asm 2026-02-21T09:10:46.9345004Z @%p84 mbarrier.arrive.expect_tx.shared.b64 _, [%r395], 2048; 2026-02-21T09:10:46.9345224Z // end inline asm 2026-02-21T09:10:46.9345464Z .loc 1 61 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:61:33 2026-02-21T09:10:46.9345744Z bar.sync 0; 2026-02-21T09:10:46.9345884Z elect.sync %r402|%p90, -1; 2026-02-21T09:10:46.9346050Z and.pred %p91, %p86, %p90; 2026-02-21T09:10:46.9346210Z and.pred %p85, %p1, %p91; 2026-02-21T09:10:46.9346372Z cvt.u32.u64 %r403, %rd219; 2026-02-21T09:10:46.9346524Z add.s32 %r398, %r403, 48; 2026-02-21T09:10:46.9346679Z // begin inline asm 2026-02-21T09:10:46.9347005Z @%p85 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r396], [%rd71, {%r300, %r398}], [%r395]; 2026-02-21T09:10:46.9347353Z // end inline asm 2026-02-21T09:10:46.9347602Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9347884Z add.s64 %rd218, %rd218, 64; 2026-02-21T09:10:46.9348052Z setp.lt.u64 %p92, %rd219, 480; 2026-02-21T09:10:46.9348214Z add.s64 %rd219, %rd219, 16; 2026-02-21T09:10:46.9348374Z mov.b32 %r583, %r588; 2026-02-21T09:10:46.9348524Z mov.b32 %r588, %r54; 2026-02-21T09:10:46.9348666Z @%p92 bra $L__BB0_4; 2026-02-21T09:10:46.9348845Z bra.uni $L__BB0_7; 2026-02-21T09:10:46.9349035Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:10:46.9349264Z add.s32 %r334, %r586, 1; 2026-02-21T09:10:46.9349417Z setp.gt.s32 %p74, %r334, 1; 2026-02-21T09:10:46.9349588Z selp.b32 %r586, 0, %r334, %p74; 2026-02-21T09:10:46.9349860Z .loc 1 55 80 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:55:80 2026-02-21T09:10:46.9350162Z cp.async.wait_group 1; 2026-02-21T09:10:46.9350348Z bar.sync 0; 2026-02-21T09:10:46.9350486Z shl.b32 %r335, %r586, 13; 2026-02-21T09:10:46.9350652Z add.s32 %r337, %r55, %r335; 2026-02-21T09:10:46.9350806Z add.s32 %r48, %r337, 49152; 2026-02-21T09:10:46.9350968Z add.s32 %r338, %r48, %r11; 2026-02-21T09:10:46.9351153Z ld.shared.v4.b32 {%r339, %r340, %r341, %r342}, [%r338]; 2026-02-21T09:10:46.9351365Z mov.b32 {%rs97, %rs98}, %r342; 2026-02-21T09:10:46.9351531Z mov.b32 {%rs99, %rs100}, %r341; 2026-02-21T09:10:46.9351732Z mov.b32 {%rs101, %rs102}, %r340; 2026-02-21T09:10:46.9351901Z mov.b32 {%rs103, %rs104}, %r339; 2026-02-21T09:10:46.9352133Z ld.shared.v4.b32 {%r343, %r344, %r345, %r346}, [%r338+16]; 2026-02-21T09:10:46.9352349Z mov.b32 {%rs105, %rs106}, %r346; 2026-02-21T09:10:46.9352512Z mov.b32 {%rs107, %rs108}, %r345; 2026-02-21T09:10:46.9352683Z mov.b32 {%rs109, %rs110}, %r344; 2026-02-21T09:10:46.9352845Z mov.b32 {%rs111, %rs112}, %r343; 2026-02-21T09:10:46.9353143Z .loc 1 59 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:59:32 2026-02-21T09:10:46.9353432Z cvt.f32.bf16 %r316, %rs103; 2026-02-21T09:10:46.9353598Z cvt.f32.bf16 %r317, %rs104; 2026-02-21T09:10:46.9353753Z cvt.f32.bf16 %r318, %rs101; 2026-02-21T09:10:46.9353915Z cvt.f32.bf16 %r319, %rs102; 2026-02-21T09:10:46.9354096Z cvt.f32.bf16 %r320, %rs99; 2026-02-21T09:10:46.9354255Z cvt.f32.bf16 %r321, %rs100; 2026-02-21T09:10:46.9354423Z cvt.f32.bf16 %r322, %rs97; 2026-02-21T09:10:46.9354583Z cvt.f32.bf16 %r323, %rs98; 2026-02-21T09:10:46.9354748Z cvt.f32.bf16 %r324, %rs111; 2026-02-21T09:10:46.9354906Z cvt.f32.bf16 %r325, %rs112; 2026-02-21T09:10:46.9355073Z cvt.f32.bf16 %r326, %rs109; 2026-02-21T09:10:46.9355231Z cvt.f32.bf16 %r327, %rs110; 2026-02-21T09:10:46.9355394Z cvt.f32.bf16 %r328, %rs107; 2026-02-21T09:10:46.9355556Z cvt.f32.bf16 %r329, %rs108; 2026-02-21T09:10:46.9355712Z cvt.f32.bf16 %r330, %rs105; 2026-02-21T09:10:46.9355877Z cvt.f32.bf16 %r331, %rs106; 2026-02-21T09:10:46.9356030Z $L__tmp8: 2026-02-21T09:10:46.9356329Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9356667Z // begin inline asm 2026-02-21T09:10:46.9356816Z 2026-02-21T09:10:46.9356932Z { 2026-02-21T09:10:46.9357066Z .reg .pred complete; 2026-02-21T09:10:46.9357217Z waitLoop: 2026-02-21T09:10:46.9357419Z mbarrier.try_wait.parity.shared.b64 complete, [%r584], %r583; 2026-02-21T09:10:46.9357669Z @!complete bra.uni waitLoop; 2026-02-21T09:10:46.9357827Z } 2026-02-21T09:10:46.9357895Z 2026-02-21T09:10:46.9357961Z // end inline asm 2026-02-21T09:10:46.9358101Z $L__tmp9: 2026-02-21T09:10:46.9358358Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9358660Z selp.b32 %r347, 1, 0, %p74; 2026-02-21T09:10:46.9358833Z xor.b32 %r585, %r585, %r347; 2026-02-21T09:10:46.9358998Z mov.pred %p75, -1; 2026-02-21T09:10:46.9359156Z $L__tmp10: 2026-02-21T09:10:46.9359463Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9359804Z // begin inline asm 2026-02-21T09:10:46.9360194Z @%p75 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r315 + 0], {%r316, %r317, %r318, %r319, %r320, %r321, %r322, %r323, %r324, %r325, %r326, %r327, %r328, %r329, %r330, %r331}; 2026-02-21T09:10:46.9360602Z // end inline asm 2026-02-21T09:10:46.9360757Z // begin inline asm 2026-02-21T09:10:46.9360949Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:46.9361130Z // end inline asm 2026-02-21T09:10:46.9361277Z bar.sync 0; 2026-02-21T09:10:46.9361408Z $L__tmp11: 2026-02-21T09:10:46.9361697Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9362023Z shl.b32 %r348, %r586, 3; 2026-02-21T09:10:46.9362185Z add.s32 %r349, %r55, %r348; 2026-02-21T09:10:46.9362343Z add.s32 %r395, %r349, 69648; 2026-02-21T09:10:46.9362532Z // begin inline asm 2026-02-21T09:10:46.9362665Z 2026-02-21T09:10:46.9362784Z { 2026-02-21T09:10:46.9362905Z .reg .pred complete; 2026-02-21T09:10:46.9363055Z waitLoop: 2026-02-21T09:10:46.9363247Z mbarrier.try_wait.parity.shared.b64 complete, [%r395], %r585; 2026-02-21T09:10:46.9363484Z @!complete bra.uni waitLoop; 2026-02-21T09:10:46.9363644Z } 2026-02-21T09:10:46.9363709Z 2026-02-21T09:10:46.9363765Z // end inline asm 2026-02-21T09:10:46.9364019Z .loc 1 61 33 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:61:33 2026-02-21T09:10:46.9364305Z shl.b32 %r350, %r586, 11; 2026-02-21T09:10:46.9364493Z add.s32 %r351, %r55, %r350; 2026-02-21T09:10:46.9364650Z add.s32 %r396, %r351, 65536; 2026-02-21T09:10:46.9364808Z add.s32 %r352, %r396, %r10; 2026-02-21T09:10:46.9364972Z ld.shared.b8 %rs113, [%r352]; 2026-02-21T09:10:46.9365140Z ld.shared.b8 %rs114, [%r352+1024]; 2026-02-21T09:10:46.9365340Z add.s32 %r353, %r396, %r12; 2026-02-21T09:10:46.9365501Z ld.shared.b8 %rs115, [%r353+128]; 2026-02-21T09:10:46.9365681Z ld.shared.b8 %rs116, [%r353+1152]; 2026-02-21T09:10:46.9365848Z add.s32 %r354, %r396, %r13; 2026-02-21T09:10:46.9366009Z ld.shared.b8 %rs117, [%r354+256]; 2026-02-21T09:10:46.9366177Z ld.shared.b8 %rs118, [%r354+1280]; 2026-02-21T09:10:46.9366346Z add.s32 %r355, %r396, %r14; 2026-02-21T09:10:46.9366506Z ld.shared.b8 %rs119, [%r355+384]; 2026-02-21T09:10:46.9366671Z ld.shared.b8 %rs120, [%r355+1408]; 2026-02-21T09:10:46.9366839Z add.s32 %r356, %r396, %r15; 2026-02-21T09:10:46.9366994Z ld.shared.b8 %rs121, [%r356+512]; 2026-02-21T09:10:46.9367165Z ld.shared.b8 %rs122, [%r356+1536]; 2026-02-21T09:10:46.9367327Z add.s32 %r357, %r396, %r16; 2026-02-21T09:10:46.9367489Z ld.shared.b8 %rs123, [%r357+640]; 2026-02-21T09:10:46.9367654Z ld.shared.b8 %rs124, [%r357+1664]; 2026-02-21T09:10:46.9367825Z add.s32 %r358, %r396, %r17; 2026-02-21T09:10:46.9367980Z ld.shared.b8 %rs125, [%r358+768]; 2026-02-21T09:10:46.9368154Z ld.shared.b8 %rs126, [%r358+1792]; 2026-02-21T09:10:46.9368323Z add.s32 %r359, %r396, %r18; 2026-02-21T09:10:46.9368477Z ld.shared.b8 %rs127, [%r359+896]; 2026-02-21T09:10:46.9368647Z ld.shared.b8 %rs128, [%r359+1920]; 2026-02-21T09:10:46.9368920Z .loc 1 64 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:64:28 2026-02-21T09:10:46.9369212Z shl.b16 %rs129, %rs113, 4; 2026-02-21T09:10:46.9369369Z shl.b16 %rs130, %rs115, 4; 2026-02-21T09:10:46.9369533Z shl.b16 %rs131, %rs117, 4; 2026-02-21T09:10:46.9369692Z shl.b16 %rs132, %rs119, 4; 2026-02-21T09:10:46.9369854Z shl.b16 %rs133, %rs121, 4; 2026-02-21T09:10:46.9370011Z shl.b16 %rs134, %rs123, 4; 2026-02-21T09:10:46.9370161Z shl.b16 %rs135, %rs125, 4; 2026-02-21T09:10:46.9370315Z shl.b16 %rs136, %rs127, 4; 2026-02-21T09:10:46.9370465Z shl.b16 %rs137, %rs114, 4; 2026-02-21T09:10:46.9370620Z shl.b16 %rs138, %rs116, 4; 2026-02-21T09:10:46.9370770Z shl.b16 %rs139, %rs118, 4; 2026-02-21T09:10:46.9370926Z shl.b16 %rs140, %rs120, 4; 2026-02-21T09:10:46.9371074Z shl.b16 %rs141, %rs122, 4; 2026-02-21T09:10:46.9371231Z shl.b16 %rs142, %rs124, 4; 2026-02-21T09:10:46.9371383Z shl.b16 %rs143, %rs126, 4; 2026-02-21T09:10:46.9371574Z shl.b16 %rs144, %rs128, 4; 2026-02-21T09:10:46.9371835Z .loc 1 79 58 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:79:58 2026-02-21T09:10:46.9372124Z selp.b16 %rs145, %rs129, %rs113, %p53; 2026-02-21T09:10:46.9372309Z cvt.s16.s8 %rs146, %rs145; 2026-02-21T09:10:46.9372491Z shr.s16 %rs147, %rs146, 4; 2026-02-21T09:10:46.9372658Z selp.b16 %rs148, %rs130, %rs115, %p53; 2026-02-21T09:10:46.9372835Z cvt.s16.s8 %rs149, %rs148; 2026-02-21T09:10:46.9372993Z shr.s16 %rs150, %rs149, 4; 2026-02-21T09:10:46.9373153Z selp.b16 %rs151, %rs131, %rs117, %p53; 2026-02-21T09:10:46.9373333Z cvt.s16.s8 %rs152, %rs151; 2026-02-21T09:10:46.9373493Z shr.s16 %rs153, %rs152, 4; 2026-02-21T09:10:46.9373655Z selp.b16 %rs154, %rs132, %rs119, %p53; 2026-02-21T09:10:46.9373858Z cvt.s16.s8 %rs155, %rs154; 2026-02-21T09:10:46.9374008Z shr.s16 %rs156, %rs155, 4; 2026-02-21T09:10:46.9374169Z selp.b16 %rs157, %rs133, %rs121, %p53; 2026-02-21T09:10:46.9374339Z cvt.s16.s8 %rs158, %rs157; 2026-02-21T09:10:46.9374496Z shr.s16 %rs159, %rs158, 4; 2026-02-21T09:10:46.9374652Z selp.b16 %rs160, %rs134, %rs123, %p53; 2026-02-21T09:10:46.9374830Z cvt.s16.s8 %rs161, %rs160; 2026-02-21T09:10:46.9374986Z shr.s16 %rs162, %rs161, 4; 2026-02-21T09:10:46.9375143Z selp.b16 %rs163, %rs135, %rs125, %p53; 2026-02-21T09:10:46.9375321Z cvt.s16.s8 %rs164, %rs163; 2026-02-21T09:10:46.9375498Z shr.s16 %rs165, %rs164, 4; 2026-02-21T09:10:46.9375664Z selp.b16 %rs166, %rs136, %rs127, %p53; 2026-02-21T09:10:46.9375833Z cvt.s16.s8 %rs167, %rs166; 2026-02-21T09:10:46.9375991Z shr.s16 %rs168, %rs167, 4; 2026-02-21T09:10:46.9376147Z selp.b16 %rs169, %rs137, %rs114, %p53; 2026-02-21T09:10:46.9376325Z cvt.s16.s8 %rs170, %rs169; 2026-02-21T09:10:46.9376499Z shr.s16 %rs171, %rs170, 4; 2026-02-21T09:10:46.9376666Z selp.b16 %rs172, %rs138, %rs116, %p53; 2026-02-21T09:10:46.9376844Z cvt.s16.s8 %rs173, %rs172; 2026-02-21T09:10:46.9376996Z shr.s16 %rs174, %rs173, 4; 2026-02-21T09:10:46.9377160Z selp.b16 %rs175, %rs139, %rs118, %p53; 2026-02-21T09:10:46.9377330Z cvt.s16.s8 %rs176, %rs175; 2026-02-21T09:10:46.9377491Z shr.s16 %rs177, %rs176, 4; 2026-02-21T09:10:46.9377649Z selp.b16 %rs178, %rs140, %rs120, %p53; 2026-02-21T09:10:46.9377831Z cvt.s16.s8 %rs179, %rs178; 2026-02-21T09:10:46.9377985Z shr.s16 %rs180, %rs179, 4; 2026-02-21T09:10:46.9378156Z selp.b16 %rs181, %rs141, %rs122, %p53; 2026-02-21T09:10:46.9378325Z cvt.s16.s8 %rs182, %rs181; 2026-02-21T09:10:46.9378485Z shr.s16 %rs183, %rs182, 4; 2026-02-21T09:10:46.9378648Z selp.b16 %rs184, %rs142, %rs124, %p53; 2026-02-21T09:10:46.9378817Z cvt.s16.s8 %rs185, %rs184; 2026-02-21T09:10:46.9378973Z shr.s16 %rs186, %rs185, 4; 2026-02-21T09:10:46.9379132Z selp.b16 %rs187, %rs143, %rs126, %p53; 2026-02-21T09:10:46.9379307Z cvt.s16.s8 %rs188, %rs187; 2026-02-21T09:10:46.9379456Z shr.s16 %rs189, %rs188, 4; 2026-02-21T09:10:46.9379620Z selp.b16 %rs190, %rs144, %rs128, %p53; 2026-02-21T09:10:46.9379788Z cvt.s16.s8 %rs191, %rs190; 2026-02-21T09:10:46.9379947Z shr.s16 %rs192, %rs191, 4; 2026-02-21T09:10:46.9380211Z .loc 1 84 32 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:84:32 2026-02-21T09:10:46.9380492Z cvt.rn.f32.s16 %r360, %rs147; 2026-02-21T09:10:46.9380661Z cvt.rn.f32.s16 %r361, %rs150; 2026-02-21T09:10:46.9380820Z cvt.rn.f32.s16 %r362, %rs153; 2026-02-21T09:10:46.9380985Z cvt.rn.f32.s16 %r363, %rs156; 2026-02-21T09:10:46.9381139Z cvt.rn.f32.s16 %r364, %rs159; 2026-02-21T09:10:46.9381300Z cvt.rn.f32.s16 %r365, %rs162; 2026-02-21T09:10:46.9381453Z cvt.rn.f32.s16 %r366, %rs165; 2026-02-21T09:10:46.9381642Z cvt.rn.f32.s16 %r367, %rs168; 2026-02-21T09:10:46.9381801Z cvt.rn.f32.s16 %r368, %rs171; 2026-02-21T09:10:46.9381955Z cvt.rn.f32.s16 %r369, %rs174; 2026-02-21T09:10:46.9382116Z cvt.rn.f32.s16 %r370, %rs177; 2026-02-21T09:10:46.9382267Z cvt.rn.f32.s16 %r371, %rs180; 2026-02-21T09:10:46.9382426Z cvt.rn.f32.s16 %r372, %rs183; 2026-02-21T09:10:46.9382577Z cvt.rn.f32.s16 %r373, %rs186; 2026-02-21T09:10:46.9382736Z cvt.rn.f32.s16 %r374, %rs189; 2026-02-21T09:10:46.9382885Z cvt.rn.f32.s16 %r375, %rs192; 2026-02-21T09:10:46.9383048Z st.shared.b32 [%r19], %r360; 2026-02-21T09:10:46.9383209Z st.shared.b32 [%r19+8], %r361; 2026-02-21T09:10:46.9383429Z st.shared.b32 [%r20], %r362; 2026-02-21T09:10:46.9383594Z st.shared.b32 [%r20+8], %r363; 2026-02-21T09:10:46.9383758Z st.shared.b32 [%r21], %r364; 2026-02-21T09:10:46.9383923Z st.shared.b32 [%r21+8], %r365; 2026-02-21T09:10:46.9384082Z st.shared.b32 [%r22], %r366; 2026-02-21T09:10:46.9384245Z st.shared.b32 [%r22+8], %r367; 2026-02-21T09:10:46.9384403Z st.shared.b32 [%r23], %r368; 2026-02-21T09:10:46.9384566Z st.shared.b32 [%r23+8], %r369; 2026-02-21T09:10:46.9384762Z st.shared.b32 [%r24], %r370; 2026-02-21T09:10:46.9384928Z st.shared.b32 [%r24+8], %r371; 2026-02-21T09:10:46.9385096Z st.shared.b32 [%r25], %r372; 2026-02-21T09:10:46.9385253Z st.shared.b32 [%r25+8], %r373; 2026-02-21T09:10:46.9385418Z st.shared.b32 [%r26], %r374; 2026-02-21T09:10:46.9385574Z st.shared.b32 [%r26+8], %r375; 2026-02-21T09:10:46.9385853Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9386138Z shl.b32 %r376, %r587, 3; 2026-02-21T09:10:46.9386307Z add.s32 %r377, %r55, %r376; 2026-02-21T09:10:46.9386465Z add.s32 %r584, %r377, 69632; 2026-02-21T09:10:46.9386676Z $L__tmp12: 2026-02-21T09:10:46.9386971Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9387304Z // begin inline asm 2026-02-21T09:10:46.9387470Z fence.proxy.async.shared::cta; 2026-02-21T09:10:46.9387659Z // end inline asm 2026-02-21T09:10:46.9387803Z bar.sync 0; 2026-02-21T09:10:46.9387936Z @%p57 bra $L__BB0_6; 2026-02-21T09:10:46.9388138Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:46.9388357Z elect.sync %r390|%p76, -1; 2026-02-21T09:10:46.9388526Z mov.b32 %r380, 136317200; 2026-02-21T09:10:46.9388682Z // begin inline asm 2026-02-21T09:10:46.9388920Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 0 ], %rd64, %r380, %p75; 2026-02-21T09:10:46.9389189Z // end inline asm 2026-02-21T09:10:46.9389326Z // begin inline asm 2026-02-21T09:10:46.9389560Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 8 ], %rd65, %r380, %p75; 2026-02-21T09:10:46.9389815Z // end inline asm 2026-02-21T09:10:46.9389957Z // begin inline asm 2026-02-21T09:10:46.9390182Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 16 ], %rd66, %r380, %p75; 2026-02-21T09:10:46.9390446Z // end inline asm 2026-02-21T09:10:46.9390589Z // begin inline asm 2026-02-21T09:10:46.9390812Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r582 + 0 ], [ %r280 + 24 ], %rd67, %r380, %p75; 2026-02-21T09:10:46.9391075Z // end inline asm 2026-02-21T09:10:46.9391214Z cvt.u64.u32 %rd85, %r584; 2026-02-21T09:10:46.9391371Z // begin inline asm 2026-02-21T09:10:46.9391599Z @%p76 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd85]; 2026-02-21T09:10:46.9391837Z // end inline asm 2026-02-21T09:10:46.9391980Z bra.uni $L__BB0_6; 2026-02-21T09:10:46.9392156Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:10:46.9392396Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:46.9392596Z setp.lt.u32 %p98, %r1, 64; 2026-02-21T09:10:46.9392758Z mov.b32 %r405, 1; 2026-02-21T09:10:46.9393047Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9393375Z // begin inline asm 2026-02-21T09:10:46.9393508Z 2026-02-21T09:10:46.9393631Z { 2026-02-21T09:10:46.9393761Z .reg .pred complete; 2026-02-21T09:10:46.9393909Z waitLoop: 2026-02-21T09:10:46.9394102Z mbarrier.try_wait.parity.shared.b64 complete, [%r584], %r405; 2026-02-21T09:10:46.9394334Z @!complete bra.uni waitLoop; 2026-02-21T09:10:46.9394495Z } 2026-02-21T09:10:46.9394559Z 2026-02-21T09:10:46.9394615Z // end inline asm 2026-02-21T09:10:46.9394754Z $L__tmp13: 2026-02-21T09:10:46.9394992Z .loc 1 48 125 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:48:125 2026-02-21T09:10:46.9395330Z cp.async.wait_group 0; 2026-02-21T09:10:46.9395488Z bar.sync 0; 2026-02-21T09:10:46.9395618Z // begin inline asm 2026-02-21T09:10:46.9395793Z @%p93 mbarrier.inval.shared::cta.b64 [%r298]; 2026-02-21T09:10:46.9395981Z // end inline asm 2026-02-21T09:10:46.9396123Z bar.sync 0; 2026-02-21T09:10:46.9396255Z // begin inline asm 2026-02-21T09:10:46.9396431Z @%p93 mbarrier.inval.shared::cta.b64 [%r151]; 2026-02-21T09:10:46.9396618Z // end inline asm 2026-02-21T09:10:46.9396794Z add.s32 %r408, %r55, 69632; 2026-02-21T09:10:46.9396948Z // begin inline asm 2026-02-21T09:10:46.9397113Z @%p93 mbarrier.inval.shared::cta.b64 [%r408]; 2026-02-21T09:10:46.9397298Z // end inline asm 2026-02-21T09:10:46.9397429Z bar.sync 0; 2026-02-21T09:10:46.9397563Z // begin inline asm 2026-02-21T09:10:46.9397719Z @%p93 mbarrier.inval.shared::cta.b64 [%r149]; 2026-02-21T09:10:46.9397921Z // end inline asm 2026-02-21T09:10:46.9398058Z $L__tmp14: 2026-02-21T09:10:46.9398356Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9398724Z // begin inline asm 2026-02-21T09:10:46.9399102Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r410, %r411, %r412, %r413, %r414, %r415, %r416, %r417, %r418, %r419, %r420, %r421, %r422, %r423, %r424, %r425}, [%r477 + 0]; 2026-02-21T09:10:46.9399501Z // end inline asm 2026-02-21T09:10:46.9399642Z // begin inline asm 2026-02-21T09:10:46.9400031Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434, %r435, %r436, %r437, %r438, %r439, %r440, %r441, %r442}, [%r477 + 16]; 2026-02-21T09:10:46.9400424Z // end inline asm 2026-02-21T09:10:46.9400572Z // begin inline asm 2026-02-21T09:10:46.9400922Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r444, %r445, %r446, %r447, %r448, %r449, %r450, %r451, %r452, %r453, %r454, %r455, %r456, %r457, %r458, %r459}, [%r477 + 32]; 2026-02-21T09:10:46.9401312Z // end inline asm 2026-02-21T09:10:46.9401461Z // begin inline asm 2026-02-21T09:10:46.9401859Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r461, %r462, %r463, %r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476}, [%r477 + 48]; 2026-02-21T09:10:46.9402248Z // end inline asm 2026-02-21T09:10:46.9402387Z // begin inline asm 2026-02-21T09:10:46.9402551Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:46.9402719Z // end inline asm 2026-02-21T09:10:46.9402871Z cvt.u64.u32 %rd90, %r410; 2026-02-21T09:10:46.9403041Z cvt.u64.u32 %rd91, %r411; 2026-02-21T09:10:46.9403203Z shl.b64 %rd92, %rd91, 32; 2026-02-21T09:10:46.9403370Z or.b64 %rd93, %rd90, %rd92; 2026-02-21T09:10:46.9403528Z $L__tmp15: 2026-02-21T09:10:46.9403790Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9404096Z mov.b64 {%r482, %r483}, %rd93; 2026-02-21T09:10:46.9404285Z cvt.rn.bf16x2.f32 %r484, %r483, %r482; 2026-02-21T09:10:46.9404463Z $L__tmp16: 2026-02-21T09:10:46.9404768Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9405120Z cvt.u64.u32 %rd94, %r412; 2026-02-21T09:10:46.9405280Z cvt.u64.u32 %rd95, %r413; 2026-02-21T09:10:46.9405444Z shl.b64 %rd96, %rd95, 32; 2026-02-21T09:10:46.9405600Z or.b64 %rd97, %rd94, %rd96; 2026-02-21T09:10:46.9405762Z $L__tmp17: 2026-02-21T09:10:46.9406007Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9406332Z mov.b64 {%r485, %r486}, %rd97; 2026-02-21T09:10:46.9406499Z cvt.rn.bf16x2.f32 %r487, %r486, %r485; 2026-02-21T09:10:46.9406675Z $L__tmp18: 2026-02-21T09:10:46.9406962Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9407283Z cvt.u64.u32 %rd98, %r414; 2026-02-21T09:10:46.9407344Z cvt.u64.u32 %rd99, %r415; 2026-02-21T09:10:46.9407414Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:10:46.9407514Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:10:46.9407567Z $L__tmp19: 2026-02-21T09:10:46.9407745Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9407809Z mov.b64 {%r488, %r489}, %rd101; 2026-02-21T09:10:46.9407877Z cvt.rn.bf16x2.f32 %r490, %r489, %r488; 2026-02-21T09:10:46.9407937Z $L__tmp20: 2026-02-21T09:10:46.9408146Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9408233Z cvt.u64.u32 %rd102, %r416; 2026-02-21T09:10:46.9408293Z cvt.u64.u32 %rd103, %r417; 2026-02-21T09:10:46.9408361Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:10:46.9408421Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:10:46.9408473Z $L__tmp21: 2026-02-21T09:10:46.9408644Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9408706Z mov.b64 {%r491, %r492}, %rd105; 2026-02-21T09:10:46.9408774Z cvt.rn.bf16x2.f32 %r493, %r492, %r491; 2026-02-21T09:10:46.9408826Z $L__tmp22: 2026-02-21T09:10:46.9409060Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9409121Z cvt.u64.u32 %rd106, %r418; 2026-02-21T09:10:46.9409180Z cvt.u64.u32 %rd107, %r419; 2026-02-21T09:10:46.9409247Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:10:46.9409333Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:10:46.9409389Z $L__tmp23: 2026-02-21T09:10:46.9409559Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9409618Z mov.b64 {%r494, %r495}, %rd109; 2026-02-21T09:10:46.9409685Z cvt.rn.bf16x2.f32 %r496, %r495, %r494; 2026-02-21T09:10:46.9409737Z $L__tmp24: 2026-02-21T09:10:46.9409948Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9410008Z cvt.u64.u32 %rd110, %r420; 2026-02-21T09:10:46.9410066Z cvt.u64.u32 %rd111, %r421; 2026-02-21T09:10:46.9410134Z shl.b64 %rd112, %rd111, 32; 2026-02-21T09:10:46.9410193Z or.b64 %rd113, %rd110, %rd112; 2026-02-21T09:10:46.9410245Z $L__tmp25: 2026-02-21T09:10:46.9410412Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9410471Z mov.b64 {%r497, %r498}, %rd113; 2026-02-21T09:10:46.9410539Z cvt.rn.bf16x2.f32 %r499, %r498, %r497; 2026-02-21T09:10:46.9410592Z $L__tmp26: 2026-02-21T09:10:46.9410802Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9410861Z cvt.u64.u32 %rd114, %r422; 2026-02-21T09:10:46.9410918Z cvt.u64.u32 %rd115, %r423; 2026-02-21T09:10:46.9410985Z shl.b64 %rd116, %rd115, 32; 2026-02-21T09:10:46.9411043Z or.b64 %rd117, %rd114, %rd116; 2026-02-21T09:10:46.9411095Z $L__tmp27: 2026-02-21T09:10:46.9411266Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9411326Z mov.b64 {%r500, %r501}, %rd117; 2026-02-21T09:10:46.9411392Z cvt.rn.bf16x2.f32 %r502, %r501, %r500; 2026-02-21T09:10:46.9411442Z $L__tmp28: 2026-02-21T09:10:46.9411683Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9411745Z cvt.u64.u32 %rd118, %r424; 2026-02-21T09:10:46.9411805Z cvt.u64.u32 %rd119, %r425; 2026-02-21T09:10:46.9411871Z shl.b64 %rd120, %rd119, 32; 2026-02-21T09:10:46.9411930Z or.b64 %rd121, %rd118, %rd120; 2026-02-21T09:10:46.9411982Z $L__tmp29: 2026-02-21T09:10:46.9412142Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9412208Z mov.b64 {%r503, %r504}, %rd121; 2026-02-21T09:10:46.9412273Z cvt.rn.bf16x2.f32 %r505, %r504, %r503; 2026-02-21T09:10:46.9412323Z $L__tmp30: 2026-02-21T09:10:46.9412567Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9412624Z cvt.u64.u32 %rd122, %r427; 2026-02-21T09:10:46.9412682Z cvt.u64.u32 %rd123, %r428; 2026-02-21T09:10:46.9412747Z shl.b64 %rd124, %rd123, 32; 2026-02-21T09:10:46.9412807Z or.b64 %rd125, %rd122, %rd124; 2026-02-21T09:10:46.9412859Z $L__tmp31: 2026-02-21T09:10:46.9413022Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9413114Z mov.b64 {%r506, %r507}, %rd125; 2026-02-21T09:10:46.9413179Z cvt.rn.bf16x2.f32 %r508, %r507, %r506; 2026-02-21T09:10:46.9413231Z $L__tmp32: 2026-02-21T09:10:46.9413442Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9413501Z cvt.u64.u32 %rd126, %r429; 2026-02-21T09:10:46.9413559Z cvt.u64.u32 %rd127, %r430; 2026-02-21T09:10:46.9413626Z shl.b64 %rd128, %rd127, 32; 2026-02-21T09:10:46.9413684Z or.b64 %rd129, %rd126, %rd128; 2026-02-21T09:10:46.9413736Z $L__tmp33: 2026-02-21T09:10:46.9413922Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9413991Z mov.b64 {%r509, %r510}, %rd129; 2026-02-21T09:10:46.9414057Z cvt.rn.bf16x2.f32 %r511, %r510, %r509; 2026-02-21T09:10:46.9414109Z $L__tmp34: 2026-02-21T09:10:46.9414350Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9414410Z cvt.u64.u32 %rd130, %r431; 2026-02-21T09:10:46.9414468Z cvt.u64.u32 %rd131, %r432; 2026-02-21T09:10:46.9414536Z shl.b64 %rd132, %rd131, 32; 2026-02-21T09:10:46.9414596Z or.b64 %rd133, %rd130, %rd132; 2026-02-21T09:10:46.9414648Z $L__tmp35: 2026-02-21T09:10:46.9414809Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9414877Z mov.b64 {%r512, %r513}, %rd133; 2026-02-21T09:10:46.9414942Z cvt.rn.bf16x2.f32 %r514, %r513, %r512; 2026-02-21T09:10:46.9414994Z $L__tmp36: 2026-02-21T09:10:46.9415206Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9415263Z cvt.u64.u32 %rd134, %r433; 2026-02-21T09:10:46.9415321Z cvt.u64.u32 %rd135, %r434; 2026-02-21T09:10:46.9415382Z shl.b64 %rd136, %rd135, 32; 2026-02-21T09:10:46.9415450Z or.b64 %rd137, %rd134, %rd136; 2026-02-21T09:10:46.9415502Z $L__tmp37: 2026-02-21T09:10:46.9415664Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9415730Z mov.b64 {%r515, %r516}, %rd137; 2026-02-21T09:10:46.9415794Z cvt.rn.bf16x2.f32 %r517, %r516, %r515; 2026-02-21T09:10:46.9415847Z $L__tmp38: 2026-02-21T09:10:46.9416061Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9416122Z cvt.u64.u32 %rd138, %r435; 2026-02-21T09:10:46.9416183Z cvt.u64.u32 %rd139, %r436; 2026-02-21T09:10:46.9416243Z shl.b64 %rd140, %rd139, 32; 2026-02-21T09:10:46.9416312Z or.b64 %rd141, %rd138, %rd140; 2026-02-21T09:10:46.9416373Z $L__tmp39: 2026-02-21T09:10:46.9416537Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9416606Z mov.b64 {%r518, %r519}, %rd141; 2026-02-21T09:10:46.9416671Z cvt.rn.bf16x2.f32 %r520, %r519, %r518; 2026-02-21T09:10:46.9416722Z $L__tmp40: 2026-02-21T09:10:46.9416933Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9416990Z cvt.u64.u32 %rd142, %r437; 2026-02-21T09:10:46.9417048Z cvt.u64.u32 %rd143, %r438; 2026-02-21T09:10:46.9417106Z shl.b64 %rd144, %rd143, 32; 2026-02-21T09:10:46.9417172Z or.b64 %rd145, %rd142, %rd144; 2026-02-21T09:10:46.9417245Z $L__tmp41: 2026-02-21T09:10:46.9417407Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9417475Z mov.b64 {%r521, %r522}, %rd145; 2026-02-21T09:10:46.9417538Z cvt.rn.bf16x2.f32 %r523, %r522, %r521; 2026-02-21T09:10:46.9417591Z $L__tmp42: 2026-02-21T09:10:46.9417805Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9417903Z cvt.u64.u32 %rd146, %r439; 2026-02-21T09:10:46.9417962Z cvt.u64.u32 %rd147, %r440; 2026-02-21T09:10:46.9418021Z shl.b64 %rd148, %rd147, 32; 2026-02-21T09:10:46.9418088Z or.b64 %rd149, %rd146, %rd148; 2026-02-21T09:10:46.9418139Z $L__tmp43: 2026-02-21T09:10:46.9418304Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9418371Z mov.b64 {%r524, %r525}, %rd149; 2026-02-21T09:10:46.9418435Z cvt.rn.bf16x2.f32 %r526, %r525, %r524; 2026-02-21T09:10:46.9418489Z $L__tmp44: 2026-02-21T09:10:46.9418718Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9418784Z cvt.u64.u32 %rd150, %r441; 2026-02-21T09:10:46.9418842Z cvt.u64.u32 %rd151, %r442; 2026-02-21T09:10:46.9418900Z shl.b64 %rd152, %rd151, 32; 2026-02-21T09:10:46.9418967Z or.b64 %rd153, %rd150, %rd152; 2026-02-21T09:10:46.9419018Z $L__tmp45: 2026-02-21T09:10:46.9419204Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9419271Z mov.b64 {%r527, %r528}, %rd153; 2026-02-21T09:10:46.9419337Z cvt.rn.bf16x2.f32 %r529, %r528, %r527; 2026-02-21T09:10:46.9419389Z $L__tmp46: 2026-02-21T09:10:46.9419595Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9419662Z cvt.u64.u32 %rd154, %r444; 2026-02-21T09:10:46.9419720Z cvt.u64.u32 %rd155, %r445; 2026-02-21T09:10:46.9419779Z shl.b64 %rd156, %rd155, 32; 2026-02-21T09:10:46.9419846Z or.b64 %rd157, %rd154, %rd156; 2026-02-21T09:10:46.9419898Z $L__tmp47: 2026-02-21T09:10:46.9420060Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9420127Z mov.b64 {%r530, %r531}, %rd157; 2026-02-21T09:10:46.9420192Z cvt.rn.bf16x2.f32 %r532, %r531, %r530; 2026-02-21T09:10:46.9420243Z $L__tmp48: 2026-02-21T09:10:46.9420451Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9420517Z cvt.u64.u32 %rd158, %r446; 2026-02-21T09:10:46.9420574Z cvt.u64.u32 %rd159, %r447; 2026-02-21T09:10:46.9420632Z shl.b64 %rd160, %rd159, 32; 2026-02-21T09:10:46.9420697Z or.b64 %rd161, %rd158, %rd160; 2026-02-21T09:10:46.9420748Z $L__tmp49: 2026-02-21T09:10:46.9420911Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9420977Z mov.b64 {%r533, %r534}, %rd161; 2026-02-21T09:10:46.9421043Z cvt.rn.bf16x2.f32 %r535, %r534, %r533; 2026-02-21T09:10:46.9421094Z $L__tmp50: 2026-02-21T09:10:46.9421297Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9421362Z cvt.u64.u32 %rd162, %r448; 2026-02-21T09:10:46.9421421Z cvt.u64.u32 %rd163, %r449; 2026-02-21T09:10:46.9421480Z shl.b64 %rd164, %rd163, 32; 2026-02-21T09:10:46.9421592Z or.b64 %rd165, %rd162, %rd164; 2026-02-21T09:10:46.9421647Z $L__tmp51: 2026-02-21T09:10:46.9421807Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9421865Z mov.b64 {%r536, %r537}, %rd165; 2026-02-21T09:10:46.9421935Z cvt.rn.bf16x2.f32 %r538, %r537, %r536; 2026-02-21T09:10:46.9421987Z $L__tmp52: 2026-02-21T09:10:46.9422191Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9422286Z cvt.u64.u32 %rd166, %r450; 2026-02-21T09:10:46.9422346Z cvt.u64.u32 %rd167, %r451; 2026-02-21T09:10:46.9422405Z shl.b64 %rd168, %rd167, 32; 2026-02-21T09:10:46.9422469Z or.b64 %rd169, %rd166, %rd168; 2026-02-21T09:10:46.9422521Z $L__tmp53: 2026-02-21T09:10:46.9422686Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9422769Z mov.b64 {%r539, %r540}, %rd169; 2026-02-21T09:10:46.9422843Z cvt.rn.bf16x2.f32 %r541, %r540, %r539; 2026-02-21T09:10:46.9422894Z $L__tmp54: 2026-02-21T09:10:46.9423099Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9423165Z cvt.u64.u32 %rd170, %r452; 2026-02-21T09:10:46.9423223Z cvt.u64.u32 %rd171, %r453; 2026-02-21T09:10:46.9423282Z shl.b64 %rd172, %rd171, 32; 2026-02-21T09:10:46.9423348Z or.b64 %rd173, %rd170, %rd172; 2026-02-21T09:10:46.9423400Z $L__tmp55: 2026-02-21T09:10:46.9423586Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9423647Z mov.b64 {%r542, %r543}, %rd173; 2026-02-21T09:10:46.9423721Z cvt.rn.bf16x2.f32 %r544, %r543, %r542; 2026-02-21T09:10:46.9423773Z $L__tmp56: 2026-02-21T09:10:46.9423999Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9424070Z cvt.u64.u32 %rd174, %r454; 2026-02-21T09:10:46.9424129Z cvt.u64.u32 %rd175, %r455; 2026-02-21T09:10:46.9424188Z shl.b64 %rd176, %rd175, 32; 2026-02-21T09:10:46.9424248Z or.b64 %rd177, %rd174, %rd176; 2026-02-21T09:10:46.9424308Z $L__tmp57: 2026-02-21T09:10:46.9424472Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9424531Z mov.b64 {%r545, %r546}, %rd177; 2026-02-21T09:10:46.9424611Z cvt.rn.bf16x2.f32 %r547, %r546, %r545; 2026-02-21T09:10:46.9424665Z $L__tmp58: 2026-02-21T09:10:46.9424870Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9424938Z cvt.u64.u32 %rd178, %r456; 2026-02-21T09:10:46.9424997Z cvt.u64.u32 %rd179, %r457; 2026-02-21T09:10:46.9425058Z shl.b64 %rd180, %rd179, 32; 2026-02-21T09:10:46.9425120Z or.b64 %rd181, %rd178, %rd180; 2026-02-21T09:10:46.9425181Z $L__tmp59: 2026-02-21T09:10:46.9425344Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9425403Z mov.b64 {%r548, %r549}, %rd181; 2026-02-21T09:10:46.9425475Z cvt.rn.bf16x2.f32 %r550, %r549, %r548; 2026-02-21T09:10:46.9425527Z $L__tmp60: 2026-02-21T09:10:46.9425732Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9425798Z cvt.u64.u32 %rd182, %r458; 2026-02-21T09:10:46.9425858Z cvt.u64.u32 %rd183, %r459; 2026-02-21T09:10:46.9425916Z shl.b64 %rd184, %rd183, 32; 2026-02-21T09:10:46.9425976Z or.b64 %rd185, %rd182, %rd184; 2026-02-21T09:10:46.9426036Z $L__tmp61: 2026-02-21T09:10:46.9426199Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9426259Z mov.b64 {%r551, %r552}, %rd185; 2026-02-21T09:10:46.9426333Z cvt.rn.bf16x2.f32 %r553, %r552, %r551; 2026-02-21T09:10:46.9426385Z $L__tmp62: 2026-02-21T09:10:46.9426589Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9426655Z cvt.u64.u32 %rd186, %r461; 2026-02-21T09:10:46.9426712Z cvt.u64.u32 %rd187, %r462; 2026-02-21T09:10:46.9426771Z shl.b64 %rd188, %rd187, 32; 2026-02-21T09:10:46.9426829Z or.b64 %rd189, %rd186, %rd188; 2026-02-21T09:10:46.9426889Z $L__tmp63: 2026-02-21T09:10:46.9427052Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9427131Z mov.b64 {%r554, %r555}, %rd189; 2026-02-21T09:10:46.9427203Z cvt.rn.bf16x2.f32 %r556, %r555, %r554; 2026-02-21T09:10:46.9427256Z $L__tmp64: 2026-02-21T09:10:46.9427466Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9427533Z cvt.u64.u32 %rd190, %r463; 2026-02-21T09:10:46.9427592Z cvt.u64.u32 %rd191, %r464; 2026-02-21T09:10:46.9427672Z shl.b64 %rd192, %rd191, 32; 2026-02-21T09:10:46.9427730Z or.b64 %rd193, %rd190, %rd192; 2026-02-21T09:10:46.9427789Z $L__tmp65: 2026-02-21T09:10:46.9427949Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9428007Z mov.b64 {%r557, %r558}, %rd193; 2026-02-21T09:10:46.9428078Z cvt.rn.bf16x2.f32 %r559, %r558, %r557; 2026-02-21T09:10:46.9428130Z $L__tmp66: 2026-02-21T09:10:46.9428333Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9428412Z cvt.u64.u32 %rd194, %r465; 2026-02-21T09:10:46.9428480Z cvt.u64.u32 %rd195, %r466; 2026-02-21T09:10:46.9428539Z shl.b64 %rd196, %rd195, 32; 2026-02-21T09:10:46.9428597Z or.b64 %rd197, %rd194, %rd196; 2026-02-21T09:10:46.9428657Z $L__tmp67: 2026-02-21T09:10:46.9428840Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9428902Z mov.b64 {%r560, %r561}, %rd197; 2026-02-21T09:10:46.9428974Z cvt.rn.bf16x2.f32 %r562, %r561, %r560; 2026-02-21T09:10:46.9429026Z $L__tmp68: 2026-02-21T09:10:46.9429235Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9429292Z cvt.u64.u32 %rd198, %r467; 2026-02-21T09:10:46.9429358Z cvt.u64.u32 %rd199, %r468; 2026-02-21T09:10:46.9429416Z shl.b64 %rd200, %rd199, 32; 2026-02-21T09:10:46.9429476Z or.b64 %rd201, %rd198, %rd200; 2026-02-21T09:10:46.9429534Z $L__tmp69: 2026-02-21T09:10:46.9429698Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9429756Z mov.b64 {%r563, %r564}, %rd201; 2026-02-21T09:10:46.9429827Z cvt.rn.bf16x2.f32 %r565, %r564, %r563; 2026-02-21T09:10:46.9429879Z $L__tmp70: 2026-02-21T09:10:46.9430091Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9430151Z cvt.u64.u32 %rd202, %r469; 2026-02-21T09:10:46.9430215Z cvt.u64.u32 %rd203, %r470; 2026-02-21T09:10:46.9430273Z shl.b64 %rd204, %rd203, 32; 2026-02-21T09:10:46.9430330Z or.b64 %rd205, %rd202, %rd204; 2026-02-21T09:10:46.9430388Z $L__tmp71: 2026-02-21T09:10:46.9430550Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9430609Z mov.b64 {%r566, %r567}, %rd205; 2026-02-21T09:10:46.9430675Z cvt.rn.bf16x2.f32 %r568, %r567, %r566; 2026-02-21T09:10:46.9430733Z $L__tmp72: 2026-02-21T09:10:46.9430935Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9430994Z cvt.u64.u32 %rd206, %r471; 2026-02-21T09:10:46.9431059Z cvt.u64.u32 %rd207, %r472; 2026-02-21T09:10:46.9431117Z shl.b64 %rd208, %rd207, 32; 2026-02-21T09:10:46.9431178Z or.b64 %rd209, %rd206, %rd208; 2026-02-21T09:10:46.9431238Z $L__tmp73: 2026-02-21T09:10:46.9431399Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9431457Z mov.b64 {%r569, %r570}, %rd209; 2026-02-21T09:10:46.9431521Z cvt.rn.bf16x2.f32 %r571, %r570, %r569; 2026-02-21T09:10:46.9431610Z $L__tmp74: 2026-02-21T09:10:46.9431816Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9431899Z cvt.u64.u32 %rd210, %r473; 2026-02-21T09:10:46.9431965Z cvt.u64.u32 %rd211, %r474; 2026-02-21T09:10:46.9432025Z shl.b64 %rd212, %rd211, 32; 2026-02-21T09:10:46.9432084Z or.b64 %rd213, %rd210, %rd212; 2026-02-21T09:10:46.9432144Z $L__tmp75: 2026-02-21T09:10:46.9432309Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9432368Z mov.b64 {%r572, %r573}, %rd213; 2026-02-21T09:10:46.9432459Z cvt.rn.bf16x2.f32 %r574, %r573, %r572; 2026-02-21T09:10:46.9432520Z $L__tmp76: 2026-02-21T09:10:46.9432730Z .loc 2 291 36 // standard.py:291:36 @[ cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:91:40 ] 2026-02-21T09:10:46.9432789Z cvt.u64.u32 %rd214, %r475; 2026-02-21T09:10:46.9432855Z cvt.u64.u32 %rd215, %r476; 2026-02-21T09:10:46.9432914Z shl.b64 %rd216, %rd215, 32; 2026-02-21T09:10:46.9432971Z or.b64 %rd217, %rd214, %rd216; 2026-02-21T09:10:46.9433023Z $L__tmp77: 2026-02-21T09:10:46.9433201Z .loc 1 94 28 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:94:28 2026-02-21T09:10:46.9433285Z mov.b64 {%r575, %r576}, %rd217; 2026-02-21T09:10:46.9433352Z cvt.rn.bf16x2.f32 %r577, %r576, %r575; 2026-02-21T09:10:46.9433527Z .loc 1 95 43 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:95:43 2026-02-21T09:10:46.9433604Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:46.9433684Z bar.sync 0; 2026-02-21T09:10:46.9433793Z st.shared.v4.b32 [%r28], {%r484, %r487, %r490, %r493}; 2026-02-21T09:10:46.9433889Z st.shared.v4.b32 [%r29], {%r496, %r499, %r502, %r505}; 2026-02-21T09:10:46.9433977Z st.shared.v4.b32 [%r30], {%r508, %r511, %r514, %r517}; 2026-02-21T09:10:46.9434069Z st.shared.v4.b32 [%r31], {%r520, %r523, %r526, %r529}; 2026-02-21T09:10:46.9434152Z st.shared.v4.b32 [%r32], {%r532, %r535, %r538, %r541}; 2026-02-21T09:10:46.9434233Z st.shared.v4.b32 [%r33], {%r544, %r547, %r550, %r553}; 2026-02-21T09:10:46.9434318Z st.shared.v4.b32 [%r34], {%r556, %r559, %r562, %r565}; 2026-02-21T09:10:46.9434409Z st.shared.v4.b32 [%r35], {%r568, %r571, %r574, %r577}; 2026-02-21T09:10:46.9434468Z // begin inline asm 2026-02-21T09:10:46.9434543Z fence.proxy.async.shared::cta; 2026-02-21T09:10:46.9434606Z // end inline asm 2026-02-21T09:10:46.9434660Z bar.sync 0; 2026-02-21T09:10:46.9434727Z elect.sync %r578|%p99, -1; 2026-02-21T09:10:46.9434791Z and.pred %p97, %p98, %p99; 2026-02-21T09:10:46.9434858Z and.b32 %r579, %r38, 1; 2026-02-21T09:10:46.9434917Z shl.b32 %r580, %r579, 14; 2026-02-21T09:10:46.9434978Z add.s32 %r480, %r55, %r580; 2026-02-21T09:10:46.9435043Z shl.b32 %r581, %r579, 6; 2026-02-21T09:10:46.9435102Z or.b32 %r478, %r581, %r300; 2026-02-21T09:10:46.9435160Z // begin inline asm 2026-02-21T09:10:46.9435345Z @%p97 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd89, {%r478, %r479}], [%r480]; 2026-02-21T09:10:46.9435402Z // end inline asm 2026-02-21T09:10:46.9435470Z cp.async.bulk.commit_group; 2026-02-21T09:10:46.9435554Z $L__BB0_8: // %._crit_edge 2026-02-21T09:10:46.9435725Z .loc 1 28 76 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:28:76 2026-02-21T09:10:46.9435798Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:46.9435853Z bar.sync 0; 2026-02-21T09:10:46.9436022Z .loc 1 28 4 // cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py:28:4 2026-02-21T09:10:46.9436079Z bar.sync 0; 2026-02-21T09:10:46.9436136Z // begin inline asm 2026-02-21T09:10:46.9436260Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r582, 256; 2026-02-21T09:10:46.9436317Z // end inline asm 2026-02-21T09:10:46.9436370Z ret; 2026-02-21T09:10:46.9436425Z $L__tmp78: 2026-02-21T09:10:46.9436488Z $L__func_end0: 2026-02-21T09:10:46.9436571Z // -- End function 2026-02-21T09:10:46.9436623Z } 2026-02-21T09:10:46.9436832Z .file 1 "/tmp/torchinductor_root/ek/cekc4xqbafuie7umlu53qoe5e7usqaqau4sgqwewqf22amkbmsqx.py" 2026-02-21T09:10:46.9437046Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:46.9437108Z .section .debug_abbrev 2026-02-21T09:10:46.9437160Z { 2026-02-21T09:10:46.9437254Z .b8 1 // Abbreviation Code 2026-02-21T09:10:46.9437341Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:46.9437421Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:46.9437533Z .b8 37 // DW_AT_producer 2026-02-21T09:10:46.9437609Z .b8 8 // DW_FORM_string 2026-02-21T09:10:46.9437681Z .b8 19 // DW_AT_language 2026-02-21T09:10:46.9437763Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:46.9437835Z .b8 3 // DW_AT_name 2026-02-21T09:10:46.9437908Z .b8 8 // DW_FORM_string 2026-02-21T09:10:46.9437986Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:46.9438087Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:46.9438161Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:46.9438233Z .b8 8 // DW_FORM_string 2026-02-21T09:10:46.9438310Z .b8 0 // EOM(1) 2026-02-21T09:10:46.9438399Z .b8 0 // EOM(2) 2026-02-21T09:10:46.9438485Z .b8 2 // Abbreviation Code 2026-02-21T09:10:46.9438575Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:46.9438651Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:46.9438724Z .b8 3 // DW_AT_name 2026-02-21T09:10:46.9438806Z .b8 8 // DW_FORM_string 2026-02-21T09:10:46.9438880Z .b8 32 // DW_AT_inline 2026-02-21T09:10:46.9438956Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:46.9439025Z .b8 0 // EOM(1) 2026-02-21T09:10:46.9439099Z .b8 0 // EOM(2) 2026-02-21T09:10:46.9439178Z .b8 3 // Abbreviation Code 2026-02-21T09:10:46.9439258Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:46.9439342Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:46.9439420Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:46.9439492Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:46.9439575Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:46.9439646Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:46.9439730Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:46.9439802Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:46.9439878Z .b8 0 // EOM(1) 2026-02-21T09:10:46.9439945Z .b8 0 // EOM(2) 2026-02-21T09:10:46.9440024Z .b8 4 // Abbreviation Code 2026-02-21T09:10:46.9440121Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:46.9440196Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:46.9440285Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:46.9440371Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:46.9440448Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:46.9440522Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:46.9440599Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:46.9440682Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:46.9440763Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:46.9440864Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:46.9440953Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:46.9441029Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:46.9441111Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:46.9441197Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:46.9441292Z .b8 0 // EOM(1) 2026-02-21T09:10:46.9441362Z .b8 0 // EOM(2) 2026-02-21T09:10:46.9441442Z .b8 0 // EOM(3) 2026-02-21T09:10:46.9441498Z } 2026-02-21T09:10:46.9441596Z .section .debug_info 2026-02-21T09:10:46.9441653Z { 2026-02-21T09:10:46.9441750Z .b32 178 // Length of Unit 2026-02-21T09:10:46.9441840Z .b8 2 // DWARF version number 2026-02-21T09:10:46.9441898Z .b8 0 2026-02-21T09:10:46.9442028Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:46.9442143Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:46.9442251Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:46.9442336Z .b8 116 // DW_AT_producer 2026-02-21T09:10:46.9442400Z .b8 114 2026-02-21T09:10:46.9442488Z .b8 105 2026-02-21T09:10:46.9442545Z .b8 116 2026-02-21T09:10:46.9442608Z .b8 111 2026-02-21T09:10:46.9442661Z .b8 110 2026-02-21T09:10:46.9442715Z .b8 0 2026-02-21T09:10:46.9442792Z .b8 2 // DW_AT_language 2026-02-21T09:10:46.9442855Z .b8 0 2026-02-21T09:10:46.9442934Z .b8 99 // DW_AT_name 2026-02-21T09:10:46.9442988Z .b8 101 2026-02-21T09:10:46.9443050Z .b8 107 2026-02-21T09:10:46.9443105Z .b8 99 2026-02-21T09:10:46.9443159Z .b8 52 2026-02-21T09:10:46.9443213Z .b8 120 2026-02-21T09:10:46.9443276Z .b8 113 2026-02-21T09:10:46.9443330Z .b8 98 2026-02-21T09:10:46.9443383Z .b8 97 2026-02-21T09:10:46.9443445Z .b8 102 2026-02-21T09:10:46.9443497Z .b8 117 2026-02-21T09:10:46.9443551Z .b8 105 2026-02-21T09:10:46.9443604Z .b8 101 2026-02-21T09:10:46.9443664Z .b8 55 2026-02-21T09:10:46.9443716Z .b8 117 2026-02-21T09:10:46.9443769Z .b8 109 2026-02-21T09:10:46.9443829Z .b8 108 2026-02-21T09:10:46.9443882Z .b8 117 2026-02-21T09:10:46.9443936Z .b8 53 2026-02-21T09:10:46.9443991Z .b8 51 2026-02-21T09:10:46.9444051Z .b8 113 2026-02-21T09:10:46.9444103Z .b8 111 2026-02-21T09:10:46.9444155Z .b8 101 2026-02-21T09:10:46.9444208Z .b8 53 2026-02-21T09:10:46.9444270Z .b8 101 2026-02-21T09:10:46.9444323Z .b8 55 2026-02-21T09:10:46.9444375Z .b8 117 2026-02-21T09:10:46.9444435Z .b8 115 2026-02-21T09:10:46.9444487Z .b8 113 2026-02-21T09:10:46.9444540Z .b8 97 2026-02-21T09:10:46.9444592Z .b8 113 2026-02-21T09:10:46.9444650Z .b8 97 2026-02-21T09:10:46.9444703Z .b8 117 2026-02-21T09:10:46.9444756Z .b8 52 2026-02-21T09:10:46.9444815Z .b8 115 2026-02-21T09:10:46.9444867Z .b8 103 2026-02-21T09:10:46.9444920Z .b8 113 2026-02-21T09:10:46.9444972Z .b8 119 2026-02-21T09:10:46.9445032Z .b8 101 2026-02-21T09:10:46.9445084Z .b8 119 2026-02-21T09:10:46.9445138Z .b8 113 2026-02-21T09:10:46.9445189Z .b8 102 2026-02-21T09:10:46.9445249Z .b8 50 2026-02-21T09:10:46.9445302Z .b8 50 2026-02-21T09:10:46.9445354Z .b8 97 2026-02-21T09:10:46.9445414Z .b8 109 2026-02-21T09:10:46.9445467Z .b8 107 2026-02-21T09:10:46.9445520Z .b8 98 2026-02-21T09:10:46.9445572Z .b8 109 2026-02-21T09:10:46.9445631Z .b8 115 2026-02-21T09:10:46.9445683Z .b8 113 2026-02-21T09:10:46.9445735Z .b8 120 2026-02-21T09:10:46.9445793Z .b8 46 2026-02-21T09:10:46.9445845Z .b8 112 2026-02-21T09:10:46.9445897Z .b8 121 2026-02-21T09:10:46.9445949Z .b8 0 2026-02-21T09:10:46.9446049Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:46.9446126Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:46.9446181Z .b8 116 2026-02-21T09:10:46.9446267Z .b8 109 2026-02-21T09:10:46.9446320Z .b8 112 2026-02-21T09:10:46.9446372Z .b8 47 2026-02-21T09:10:46.9446426Z .b8 116 2026-02-21T09:10:46.9446486Z .b8 111 2026-02-21T09:10:46.9446539Z .b8 114 2026-02-21T09:10:46.9446592Z .b8 99 2026-02-21T09:10:46.9446652Z .b8 104 2026-02-21T09:10:46.9446706Z .b8 105 2026-02-21T09:10:46.9446758Z .b8 110 2026-02-21T09:10:46.9446811Z .b8 100 2026-02-21T09:10:46.9446871Z .b8 117 2026-02-21T09:10:46.9446926Z .b8 99 2026-02-21T09:10:46.9447007Z .b8 116 2026-02-21T09:10:46.9447061Z .b8 111 2026-02-21T09:10:46.9447122Z .b8 114 2026-02-21T09:10:46.9447175Z .b8 95 2026-02-21T09:10:46.9447227Z .b8 114 2026-02-21T09:10:46.9447285Z .b8 111 2026-02-21T09:10:46.9447337Z .b8 111 2026-02-21T09:10:46.9447389Z .b8 116 2026-02-21T09:10:46.9447441Z .b8 47 2026-02-21T09:10:46.9447504Z .b8 101 2026-02-21T09:10:46.9447557Z .b8 107 2026-02-21T09:10:46.9447610Z .b8 0 2026-02-21T09:10:46.9447724Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:46.9447810Z .b8 95 // DW_AT_name 2026-02-21T09:10:46.9447863Z .b8 104 2026-02-21T09:10:46.9447936Z .b8 101 2026-02-21T09:10:46.9447999Z .b8 108 2026-02-21T09:10:46.9448053Z .b8 105 2026-02-21T09:10:46.9448114Z .b8 111 2026-02-21T09:10:46.9448171Z .b8 110 2026-02-21T09:10:46.9448222Z .b8 95 2026-02-21T09:10:46.9448272Z .b8 109 2026-02-21T09:10:46.9448321Z .b8 97 2026-02-21T09:10:46.9448379Z .b8 116 2026-02-21T09:10:46.9448453Z .b8 109 2026-02-21T09:10:46.9448506Z .b8 117 2026-02-21T09:10:46.9448564Z .b8 108 2026-02-21T09:10:46.9448614Z .b8 95 2026-02-21T09:10:46.9448665Z .b8 98 2026-02-21T09:10:46.9448715Z .b8 102 2026-02-21T09:10:46.9448774Z .b8 49 2026-02-21T09:10:46.9448825Z .b8 54 2026-02-21T09:10:46.9448875Z .b8 95 2026-02-21T09:10:46.9448926Z .b8 105 2026-02-21T09:10:46.9448984Z .b8 110 2026-02-21T09:10:46.9449035Z .b8 116 2026-02-21T09:10:46.9449085Z .b8 52 2026-02-21T09:10:46.9449142Z .b8 0 2026-02-21T09:10:46.9449216Z .b8 1 // DW_AT_inline 2026-02-21T09:10:46.9449314Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:46.9449402Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:46.9449496Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:46.9449584Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:46.9449695Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:46.9449792Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:46.9449874Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:10:46.9449959Z .b64 $L__tmp77 // DW_AT_high_pc 2026-02-21T09:10:46.9450044Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:46.9450119Z .b8 91 // DW_AT_call_line 2026-02-21T09:10:46.9450198Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:46.9450288Z .b8 0 // End Of Children Mark 2026-02-21T09:10:46.9450372Z .b8 0 // End Of Children Mark 2026-02-21T09:10:46.9450424Z } 2026-02-21T09:10:46.9450489Z .section .debug_macinfo { } 2026-02-21T09:10:46.9450493Z 2026-02-21T09:10:46.9450576Z ================================================================ 2026-02-21T09:10:46.9450682Z please share the reproducer above with Triton project. 2026-02-21T09:10:47.0887921Z [234s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:47.0887934Z 2026-02-21T09:10:47.0893080Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:10:47.0893265Z 2026-02-21T09:10:47.0893462Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:47.0893468Z 2026-02-21T09:10:47.0893585Z `ptxas` stderr: 2026-02-21T09:10:47.0893953Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 283 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:47.0894107Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:47.0894112Z 2026-02-21T09:10:47.0894527Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpd9dm9xj4.ptx -o /tmp/tmpd9dm9xj4.ptx.o 2026-02-21T09:10:47.0894531Z 2026-02-21T09:10:47.0894666Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:47.0894754Z ================================================================ 2026-02-21T09:10:47.0894875Z Internal Triton PTX codegen error 2026-02-21T09:10:47.0894941Z `ptxas` stderr: 2026-02-21T09:10:47.0895275Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 283 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:47.0895432Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:47.0895438Z 2026-02-21T09:10:47.0895807Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpd9dm9xj4.ptx -o /tmp/tmpd9dm9xj4.ptx.o 2026-02-21T09:10:47.0895811Z 2026-02-21T09:10:47.0895814Z 2026-02-21T09:10:47.0895875Z // 2026-02-21T09:10:47.0895948Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:47.0895999Z // 2026-02-21T09:10:47.0896002Z 2026-02-21T09:10:47.0896058Z .version 8.7 2026-02-21T09:10:47.0896122Z .target sm_100a 2026-02-21T09:10:47.0896178Z .address_size 64 2026-02-21T09:10:47.0896182Z 2026-02-21T09:10:47.0896332Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:47.0896422Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:47.0896511Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:47.0896586Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:47.0896711Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:47.0896823Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:47.0896929Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:47.0897035Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:47.0897156Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:47.0897209Z ) 2026-02-21T09:10:47.0897265Z .reqntid 128 2026-02-21T09:10:47.0897331Z .maxnreg 32 2026-02-21T09:10:47.0897386Z { 2026-02-21T09:10:47.0897452Z .reg .pred %p<101>; 2026-02-21T09:10:47.0897512Z .reg .b16 %rs<148>; 2026-02-21T09:10:47.0897578Z .reg .b32 %r<492>; 2026-02-21T09:10:47.0897632Z .reg .b64 %rd<168>; 2026-02-21T09:10:47.0897687Z $L__func_begin0: 2026-02-21T09:10:47.0897691Z 2026-02-21T09:10:47.0897751Z // %bb.0: 2026-02-21T09:10:47.0897922Z .loc 1 19 0 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:19 2026-02-21T09:10:47.0897983Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:47.0898093Z ld.param.b64 %rd19, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:47.0898160Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:10:47.0898255Z ld.param.b64 %rd37, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:47.0898319Z mov.b32 %r49, global_smem; 2026-02-21T09:10:47.0898385Z // begin inline asm 2026-02-21T09:10:47.0898530Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r49], 64; 2026-02-21T09:10:47.0898588Z // end inline asm 2026-02-21T09:10:47.0898718Z ld.param.b64 %rd54, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:47.0898774Z bar.sync 0; 2026-02-21T09:10:47.0898846Z ld.shared.b32 %r485, [global_smem]; 2026-02-21T09:10:47.0898906Z bar.sync 0; 2026-02-21T09:10:47.0898965Z // begin inline asm 2026-02-21T09:10:47.0899085Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:47.0899142Z // end inline asm 2026-02-21T09:10:47.0899323Z .loc 1 21 66 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:21:66 2026-02-21T09:10:47.0899414Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:10:47.0899476Z mov.u32 %r66, %ctaid.y; 2026-02-21T09:10:47.0899542Z mov.u32 %r67, %ctaid.z; 2026-02-21T09:10:47.0899603Z mov.u32 %r68, %nctaid.x; 2026-02-21T09:10:47.0899663Z mov.u32 %r69, %nctaid.y; 2026-02-21T09:10:47.0899728Z mad.lo.s32 %r70, %r67, %r69, %r66; 2026-02-21T09:10:47.0899801Z mad.lo.s32 %r71, %r70, %r68, %r3; 2026-02-21T09:10:47.0899861Z shl.b32 %r72, %r71, 8; 2026-02-21T09:10:47.0899923Z cvt.s64.s32 %rd55, %r72; 2026-02-21T09:10:47.0899996Z add.s64 %rd33, %rd54, %rd55; 2026-02-21T09:10:47.0900077Z shl.b32 %r73, %r1, 2; 2026-02-21T09:10:47.0900138Z add.s32 %r50, %r49, %r73; 2026-02-21T09:10:47.0900193Z mov.b32 %r59, 0; 2026-02-21T09:10:47.0900259Z // begin inline asm 2026-02-21T09:10:47.0900330Z @%p1 st.shared.b32 [ %r50 + 0 ], %r59; 2026-02-21T09:10:47.0900385Z // end inline asm 2026-02-21T09:10:47.0900481Z bar.warp.sync -1; 2026-02-21T09:10:47.0900544Z setp.eq.b32 %p93, %r1, 0; 2026-02-21T09:10:47.0900603Z cvt.u64.u32 %rd18, %r49; 2026-02-21T09:10:47.0900660Z // begin inline asm 2026-02-21T09:10:47.0901023Z @%p93 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd18 + 0 ], %rd19; 2026-02-21T09:10:47.0901079Z // end inline asm 2026-02-21T09:10:47.0901136Z // begin inline asm 2026-02-21T09:10:47.0901288Z @%p93 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1; 2026-02-21T09:10:47.0901344Z // end inline asm 2026-02-21T09:10:47.0901401Z mov.b32 %r52, 32; 2026-02-21T09:10:47.0901463Z // begin inline asm 2026-02-21T09:10:47.0901778Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r52; 2026-02-21T09:10:47.0901836Z // end inline asm 2026-02-21T09:10:47.0901898Z mov.b32 %r113, 16; 2026-02-21T09:10:47.0901954Z // begin inline asm 2026-02-21T09:10:47.0902106Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r113; 2026-02-21T09:10:47.0902164Z // end inline asm 2026-02-21T09:10:47.0902231Z mov.b32 %r54, 8192; 2026-02-21T09:10:47.0902287Z // begin inline asm 2026-02-21T09:10:47.0902443Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r54; 2026-02-21T09:10:47.0902507Z // end inline asm 2026-02-21T09:10:47.0902564Z mov.b32 %r55, 512; 2026-02-21T09:10:47.0902620Z // begin inline asm 2026-02-21T09:10:47.0902781Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r55; 2026-02-21T09:10:47.0902838Z // end inline asm 2026-02-21T09:10:47.0902896Z mov.b64 %rd26, 8192; 2026-02-21T09:10:47.0902953Z // begin inline asm 2026-02-21T09:10:47.0903127Z @%p93 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd18 + 0 ], 0x0, %rd26; 2026-02-21T09:10:47.0903183Z // end inline asm 2026-02-21T09:10:47.0903239Z mov.b32 %r56, 1; 2026-02-21T09:10:47.0903302Z // begin inline asm 2026-02-21T09:10:47.0903472Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r56; 2026-02-21T09:10:47.0903530Z // end inline asm 2026-02-21T09:10:47.0903592Z // begin inline asm 2026-02-21T09:10:47.0903754Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r56; 2026-02-21T09:10:47.0903808Z // end inline asm 2026-02-21T09:10:47.0903863Z // begin inline asm 2026-02-21T09:10:47.0904019Z @%p93 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:47.0904072Z // end inline asm 2026-02-21T09:10:47.0904165Z // begin inline asm 2026-02-21T09:10:47.0904339Z @%p93 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:47.0904394Z // end inline asm 2026-02-21T09:10:47.0904449Z // begin inline asm 2026-02-21T09:10:47.0904607Z @%p93 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1; 2026-02-21T09:10:47.0904661Z // end inline asm 2026-02-21T09:10:47.0904716Z // begin inline asm 2026-02-21T09:10:47.0904869Z @%p93 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:47.0904954Z // end inline asm 2026-02-21T09:10:47.0905010Z // begin inline asm 2026-02-21T09:10:47.0905267Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd33 + 0 ], [ %rd18 + 0 ], 0x80; 2026-02-21T09:10:47.0905331Z // end inline asm 2026-02-21T09:10:47.0905388Z // begin inline asm 2026-02-21T09:10:47.0905513Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd33 + 0 ], 0x80; 2026-02-21T09:10:47.0905602Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:47.0905679Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:47.0905760Z // end inline asm 2026-02-21T09:10:47.0905825Z bar.sync 0; 2026-02-21T09:10:47.0905891Z cvta.global.u64 %rd80, %rd33; 2026-02-21T09:10:47.0906061Z .loc 1 23 67 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:23:67 2026-02-21T09:10:47.0906150Z add.s64 %rd51, %rd33, 128; 2026-02-21T09:10:47.0906215Z bar.sync 0; 2026-02-21T09:10:47.0906272Z // begin inline asm 2026-02-21T09:10:47.0906339Z @%p1 st.shared.b32 [ %r50 + 0 ], %r59; 2026-02-21T09:10:47.0906399Z // end inline asm 2026-02-21T09:10:47.0906459Z bar.warp.sync -1; 2026-02-21T09:10:47.0906515Z // begin inline asm 2026-02-21T09:10:47.0906673Z @%p93 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd18 + 0 ], %rd37; 2026-02-21T09:10:47.0906734Z // end inline asm 2026-02-21T09:10:47.0906789Z // begin inline asm 2026-02-21T09:10:47.0906927Z @%p93 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1; 2026-02-21T09:10:47.0906989Z // end inline asm 2026-02-21T09:10:47.0907047Z // begin inline asm 2026-02-21T09:10:47.0907197Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r52; 2026-02-21T09:10:47.0907258Z // end inline asm 2026-02-21T09:10:47.0907313Z mov.b32 %r61, 128; 2026-02-21T09:10:47.0907367Z // begin inline asm 2026-02-21T09:10:47.0907512Z @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r61; 2026-02-21T09:10:47.0907575Z // end inline asm 2026-02-21T09:10:47.0907631Z // begin inline asm 2026-02-21T09:10:47.0907786Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r54; 2026-02-21T09:10:47.0907847Z // end inline asm 2026-02-21T09:10:47.0907903Z mov.b32 %r63, 4096; 2026-02-21T09:10:47.0907958Z // begin inline asm 2026-02-21T09:10:47.0908119Z @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r63; 2026-02-21T09:10:47.0908175Z // end inline asm 2026-02-21T09:10:47.0908233Z mov.b64 %rd44, 16384; 2026-02-21T09:10:47.0908290Z // begin inline asm 2026-02-21T09:10:47.0908462Z @%p93 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd18 + 0 ], 0x0, %rd44; 2026-02-21T09:10:47.0908517Z // end inline asm 2026-02-21T09:10:47.0908573Z // begin inline asm 2026-02-21T09:10:47.0908746Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0, %r56; 2026-02-21T09:10:47.0908802Z // end inline asm 2026-02-21T09:10:47.0908859Z // begin inline asm 2026-02-21T09:10:47.0909028Z @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x1, %r56; 2026-02-21T09:10:47.0909082Z // end inline asm 2026-02-21T09:10:47.0909137Z // begin inline asm 2026-02-21T09:10:47.0909291Z @%p93 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd18 + 0 ], 0xa; 2026-02-21T09:10:47.0909346Z // end inline asm 2026-02-21T09:10:47.0909401Z // begin inline asm 2026-02-21T09:10:47.0909583Z @%p93 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:47.0909646Z // end inline asm 2026-02-21T09:10:47.0909701Z // begin inline asm 2026-02-21T09:10:47.0909852Z @%p93 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x2; 2026-02-21T09:10:47.0909914Z // end inline asm 2026-02-21T09:10:47.0909969Z // begin inline asm 2026-02-21T09:10:47.0910113Z @%p93 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd18 + 0 ], 0x0; 2026-02-21T09:10:47.0910194Z // end inline asm 2026-02-21T09:10:47.0910249Z // begin inline asm 2026-02-21T09:10:47.0910501Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd51 + 0 ], [ %rd18 + 0 ], 0x80; 2026-02-21T09:10:47.0910563Z // end inline asm 2026-02-21T09:10:47.0910618Z // begin inline asm 2026-02-21T09:10:47.0910740Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd51 + 0 ], 0x80; 2026-02-21T09:10:47.0910813Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:47.0910893Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:47.0910969Z // end inline asm 2026-02-21T09:10:47.0911024Z bar.sync 0; 2026-02-21T09:10:47.0911095Z cvta.global.u64 %rd100, %rd51; 2026-02-21T09:10:47.0911261Z .loc 1 31 74 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:31:74 2026-02-21T09:10:47.0911325Z setp.gt.u32 %p39, %r3, 8191; 2026-02-21T09:10:47.0911414Z @%p39 bra $L__BB0_8; 2026-02-21T09:10:47.0911499Z // %bb.1: // %.lr.ph 2026-02-21T09:10:47.0911721Z .loc 1 0 74 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:0:74 2026-02-21T09:10:47.0911821Z ld.param.b64 %rd17, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:47.0911888Z and.b32 %r4, %r1, 32; 2026-02-21T09:10:47.0912050Z .loc 1 81 38 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:81:38 2026-02-21T09:10:47.0912114Z setp.eq.b32 %p52, %r4, 0; 2026-02-21T09:10:47.0912285Z .loc 1 57 38 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:57:38 2026-02-21T09:10:47.0912345Z shl.b32 %r174, %r1, 3; 2026-02-21T09:10:47.0912404Z and.b32 %r5, %r174, 24; 2026-02-21T09:10:47.0912570Z .loc 1 43 45 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:43:45 2026-02-21T09:10:47.0912630Z bfe.u32 %r6, %r1, 2, 5; 2026-02-21T09:10:47.0912689Z shr.u32 %r175, %r1, 5; 2026-02-21T09:10:47.0912749Z and.b32 %r7, %r1, 127; 2026-02-21T09:10:47.0912813Z shl.b32 %r176, %r7, 4; 2026-02-21T09:10:47.0912875Z add.s32 %r262, %r49, %r176; 2026-02-21T09:10:47.0912937Z add.s32 %r264, %r262, 2048; 2026-02-21T09:10:47.0913004Z add.s32 %r266, %r262, 4096; 2026-02-21T09:10:47.0913061Z add.s32 %r268, %r262, 6144; 2026-02-21T09:10:47.0913120Z add.s32 %r248, %r485, 32; 2026-02-21T09:10:47.0913178Z shl.b32 %r13, %r7, 6; 2026-02-21T09:10:47.0913245Z and.b32 %r178, %r1, 31; 2026-02-21T09:10:47.0913303Z shr.u32 %r179, %r1, 1; 2026-02-21T09:10:47.0913362Z and.b32 %r180, %r179, 32; 2026-02-21T09:10:47.0913432Z or.b32 %r14, %r180, %r178; 2026-02-21T09:10:47.0913492Z shl.b32 %r181, %r178, 7; 2026-02-21T09:10:47.0913549Z shl.b32 %r182, %r1, 4; 2026-02-21T09:10:47.0913609Z and.b32 %r183, %r182, 112; 2026-02-21T09:10:47.0913676Z shr.u32 %r184, %r1, 3; 2026-02-21T09:10:47.0913733Z and.b32 %r185, %r184, 12; 2026-02-21T09:10:47.0913795Z or.b32 %r186, %r181, %r183; 2026-02-21T09:10:47.0913863Z or.b32 %r187, %r186, %r185; 2026-02-21T09:10:47.0913923Z add.s32 %r188, %r49, 16384; 2026-02-21T09:10:47.0913982Z add.s32 %r15, %r188, %r187; 2026-02-21T09:10:47.0914040Z xor.b32 %r189, %r187, 16; 2026-02-21T09:10:47.0914109Z add.s32 %r16, %r188, %r189; 2026-02-21T09:10:47.0914166Z xor.b32 %r190, %r187, 32; 2026-02-21T09:10:47.0914224Z add.s32 %r17, %r188, %r190; 2026-02-21T09:10:47.0914288Z xor.b32 %r191, %r187, 48; 2026-02-21T09:10:47.0914345Z add.s32 %r18, %r188, %r191; 2026-02-21T09:10:47.0914434Z xor.b32 %r192, %r187, 64; 2026-02-21T09:10:47.0914491Z add.s32 %r19, %r188, %r192; 2026-02-21T09:10:47.0914556Z xor.b32 %r193, %r187, 80; 2026-02-21T09:10:47.0914614Z add.s32 %r20, %r188, %r193; 2026-02-21T09:10:47.0914671Z xor.b32 %r194, %r187, 96; 2026-02-21T09:10:47.0914734Z add.s32 %r21, %r188, %r194; 2026-02-21T09:10:47.0914792Z xor.b32 %r195, %r187, 112; 2026-02-21T09:10:47.0914849Z add.s32 %r22, %r188, %r195; 2026-02-21T09:10:47.0914943Z bfe.u32 %r196, %r188, 4, 14; 2026-02-21T09:10:47.0915002Z cvt.u64.u32 %rd66, %r196; 2026-02-21T09:10:47.0915073Z or.b64 %rd71, %rd66, 4611686293322072064; 2026-02-21T09:10:47.0915131Z add.s32 %r197, %r49, 16416; 2026-02-21T09:10:47.0915197Z bfe.u32 %r198, %r197, 4, 14; 2026-02-21T09:10:47.0915255Z cvt.u64.u32 %rd67, %r198; 2026-02-21T09:10:47.0915322Z or.b64 %rd72, %rd67, 4611686293322072064; 2026-02-21T09:10:47.0915387Z add.s32 %r199, %r49, 16448; 2026-02-21T09:10:47.0915445Z bfe.u32 %r200, %r199, 4, 14; 2026-02-21T09:10:47.0915505Z cvt.u64.u32 %rd68, %r200; 2026-02-21T09:10:47.0915569Z or.b64 %rd73, %rd68, 4611686293322072064; 2026-02-21T09:10:47.0915675Z add.s32 %r201, %r49, 16480; 2026-02-21T09:10:47.0915737Z bfe.u32 %r202, %r201, 4, 14; 2026-02-21T09:10:47.0915796Z cvt.u64.u32 %rd69, %r202; 2026-02-21T09:10:47.0915869Z or.b64 %rd74, %rd69, 4611686293322072064; 2026-02-21T09:10:47.0915926Z or.b32 %r23, %r5, 64; 2026-02-21T09:10:47.0916009Z and.b32 %r203, %r174, 48; 2026-02-21T09:10:47.0916071Z or.b32 %r204, %r13, %r203; 2026-02-21T09:10:47.0916135Z xor.b32 %r205, %r204, 16; 2026-02-21T09:10:47.0916192Z xor.b32 %r206, %r204, 32; 2026-02-21T09:10:47.0916247Z xor.b32 %r207, %r204, 48; 2026-02-21T09:10:47.0916320Z mad.lo.s32 %r208, %r7, 48, %r262; 2026-02-21T09:10:47.0916377Z xor.b32 %r28, %r14, 16; 2026-02-21T09:10:47.0916434Z add.s32 %r121, %r49, 20480; 2026-02-21T09:10:47.0916499Z add.s32 %r209, %r121, %r28; 2026-02-21T09:10:47.0916556Z add.s32 %r210, %r121, %r14; 2026-02-21T09:10:47.0916614Z add.s32 %r125, %r262, 8192; 2026-02-21T09:10:47.0916673Z add.s32 %r131, %r262, 14336; 2026-02-21T09:10:47.0916739Z add.s32 %r129, %r262, 12288; 2026-02-21T09:10:47.0916795Z add.s32 %r127, %r262, 10240; 2026-02-21T09:10:47.0916958Z .loc 1 38 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:38:33 2026-02-21T09:10:47.0917022Z shr.u32 %r211, %r3, 8; 2026-02-21T09:10:47.0917080Z and.b32 %r212, %r211, 16; 2026-02-21T09:10:47.0917240Z .loc 1 40 64 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:40:64 2026-02-21T09:10:47.0917298Z and.b32 %r29, %r3, 15; 2026-02-21T09:10:47.0917467Z .loc 1 40 30 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:40:30 2026-02-21T09:10:47.0917527Z or.b32 %r213, %r212, %r29; 2026-02-21T09:10:47.0917685Z .loc 1 42 27 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:42:27 2026-02-21T09:10:47.0917752Z shl.b32 %r434, %r213, 7; 2026-02-21T09:10:47.0917915Z .loc 1 43 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:43:32 2026-02-21T09:10:47.0917973Z or.b32 %r214, %r434, %r6; 2026-02-21T09:10:47.0918137Z .loc 1 44 27 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:44:27 2026-02-21T09:10:47.0918193Z shl.b32 %r215, %r3, 1; 2026-02-21T09:10:47.0918250Z and.b32 %r433, %r215, 8160; 2026-02-21T09:10:47.0918420Z .loc 1 58 53 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:53 2026-02-21T09:10:47.0918481Z shl.b32 %r216, %r214, 10; 2026-02-21T09:10:47.0918538Z $L__tmp0: 2026-02-21T09:10:47.0918763Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0918848Z shfl.sync.idx.b32 %r32, %r175, 0, 31, -1; 2026-02-21T09:10:47.0918910Z shl.b32 %r217, %r32, 21; 2026-02-21T09:10:47.0918974Z and.b32 %r218, %r217, 6291456; 2026-02-21T09:10:47.0919066Z add.s32 %r432, %r218, %r485; 2026-02-21T09:10:47.0919131Z mov.pred %p59, -1; 2026-02-21T09:10:47.0919193Z // begin inline asm 2026-02-21T09:10:47.0919468Z @%p59 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r432 + 0], {%r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59}; 2026-02-21T09:10:47.0919528Z // end inline asm 2026-02-21T09:10:47.0919587Z // begin inline asm 2026-02-21T09:10:47.0919847Z @%p59 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r432 + 16], {%r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59, %r59}; 2026-02-21T09:10:47.0919937Z // end inline asm 2026-02-21T09:10:47.0919995Z // begin inline asm 2026-02-21T09:10:47.0920071Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.0920136Z // end inline asm 2026-02-21T09:10:47.0920195Z bar.sync 0; 2026-02-21T09:10:47.0920252Z $L__tmp1: 2026-02-21T09:10:47.0920435Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0920505Z add.s32 %r487, %r49, 21504; 2026-02-21T09:10:47.0920586Z // begin inline asm 2026-02-21T09:10:47.0920678Z @%p93 mbarrier.init.shared::cta.b64 [%r487], 1; 2026-02-21T09:10:47.0920744Z // end inline asm 2026-02-21T09:10:47.0920801Z bar.sync 0; 2026-02-21T09:10:47.0920863Z add.s32 %r109, %r49, 21512; 2026-02-21T09:10:47.0920928Z // begin inline asm 2026-02-21T09:10:47.0921041Z @%p93 mbarrier.init.shared::cta.b64 [%r109], 1; 2026-02-21T09:10:47.0921100Z // end inline asm 2026-02-21T09:10:47.0921160Z add.s32 %r270, %r49, 21520; 2026-02-21T09:10:47.0921227Z // begin inline asm 2026-02-21T09:10:47.0921311Z @%p93 mbarrier.init.shared::cta.b64 [%r270], 1; 2026-02-21T09:10:47.0921368Z // end inline asm 2026-02-21T09:10:47.0921432Z bar.sync 0; 2026-02-21T09:10:47.0921494Z add.s32 %r111, %r49, 21528; 2026-02-21T09:10:47.0921632Z // begin inline asm 2026-02-21T09:10:47.0921718Z @%p93 mbarrier.init.shared::cta.b64 [%r111], 1; 2026-02-21T09:10:47.0921788Z // end inline asm 2026-02-21T09:10:47.0921961Z .loc 1 58 60 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:60 2026-02-21T09:10:47.0922026Z or.b32 %r219, %r216, %r5; 2026-02-21T09:10:47.0922210Z .loc 1 58 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:32 2026-02-21T09:10:47.0933138Z mad.wide.u32 %rd56, %r219, 2, %rd17; 2026-02-21T09:10:47.0933262Z cvt.u64.u32 %rd7, %r216; 2026-02-21T09:10:47.0933349Z add.s64 %rd57, %rd56, 65536; 2026-02-21T09:10:47.0933417Z add.s64 %rd58, %rd56, 131072; 2026-02-21T09:10:47.0933483Z add.s64 %rd59, %rd56, 196608; 2026-02-21T09:10:47.0933688Z .loc 1 58 80 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:80 2026-02-21T09:10:47.0933753Z // begin inline asm 2026-02-21T09:10:47.0933884Z cp.async.cg.shared.global [ %r262 + 0 ], [ %rd56 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0933956Z // end inline asm 2026-02-21T09:10:47.0934015Z // begin inline asm 2026-02-21T09:10:47.0934133Z cp.async.cg.shared.global [ %r264 + 0 ], [ %rd57 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0934193Z // end inline asm 2026-02-21T09:10:47.0934264Z // begin inline asm 2026-02-21T09:10:47.0934376Z cp.async.cg.shared.global [ %r266 + 0 ], [ %rd58 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0934433Z // end inline asm 2026-02-21T09:10:47.0934501Z // begin inline asm 2026-02-21T09:10:47.0934615Z cp.async.cg.shared.global [ %r268 + 0 ], [ %rd59 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0934673Z // end inline asm 2026-02-21T09:10:47.0934743Z cp.async.commit_group; 2026-02-21T09:10:47.0934931Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0934991Z bar.sync 0; 2026-02-21T09:10:47.0935052Z // begin inline asm 2026-02-21T09:10:47.0935179Z @%p93 mbarrier.arrive.expect_tx.shared.b64 _, [%r270], 512; 2026-02-21T09:10:47.0935237Z // end inline asm 2026-02-21T09:10:47.0935404Z .loc 1 64 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:64:33 2026-02-21T09:10:47.0935569Z bar.sync 0; 2026-02-21T09:10:47.0935644Z elect.sync %r220|%p54, -1; 2026-02-21T09:10:47.0935716Z and.pred %p47, %p1, %p54; 2026-02-21T09:10:47.0935776Z // begin inline asm 2026-02-21T09:10:47.0936042Z @%p47 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r121], [%rd80, {%r433, %r59}], [%r270]; 2026-02-21T09:10:47.0936102Z // end inline asm 2026-02-21T09:10:47.0936307Z .loc 1 58 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:32 2026-02-21T09:10:47.0936380Z add.s64 %rd61, %rd56, 64; 2026-02-21T09:10:47.0936446Z or.b32 %r221, %r219, 32; 2026-02-21T09:10:47.0936518Z mad.wide.u32 %rd70, %r221, 2, %rd17; 2026-02-21T09:10:47.0936591Z add.s64 %rd62, %rd70, 65536; 2026-02-21T09:10:47.0936655Z add.s64 %rd63, %rd70, 131072; 2026-02-21T09:10:47.0936717Z add.s64 %rd64, %rd70, 196608; 2026-02-21T09:10:47.0936884Z .loc 1 58 80 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:80 2026-02-21T09:10:47.0936993Z // begin inline asm 2026-02-21T09:10:47.0937112Z cp.async.cg.shared.global [ %r125 + 0 ], [ %rd61 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0937169Z // end inline asm 2026-02-21T09:10:47.0937235Z // begin inline asm 2026-02-21T09:10:47.0937345Z cp.async.cg.shared.global [ %r127 + 0 ], [ %rd62 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0937435Z // end inline asm 2026-02-21T09:10:47.0937503Z // begin inline asm 2026-02-21T09:10:47.0937615Z cp.async.cg.shared.global [ %r129 + 0 ], [ %rd63 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0937673Z // end inline asm 2026-02-21T09:10:47.0937732Z // begin inline asm 2026-02-21T09:10:47.0937853Z cp.async.cg.shared.global [ %r131 + 0 ], [ %rd64 + 0 ], 0x10, %r113; 2026-02-21T09:10:47.0937912Z // end inline asm 2026-02-21T09:10:47.0937978Z cp.async.commit_group; 2026-02-21T09:10:47.0938160Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0938221Z bar.sync 0; 2026-02-21T09:10:47.0938279Z // begin inline asm 2026-02-21T09:10:47.0938392Z @%p93 mbarrier.arrive.expect_tx.shared.b64 _, [%r111], 512; 2026-02-21T09:10:47.0938457Z // end inline asm 2026-02-21T09:10:47.0938622Z .loc 1 64 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:64:33 2026-02-21T09:10:47.0938680Z bar.sync 0; 2026-02-21T09:10:47.0938761Z elect.sync %r222|%p55, -1; 2026-02-21T09:10:47.0938830Z and.pred %p49, %p1, %p55; 2026-02-21T09:10:47.0938895Z add.s32 %r134, %r49, 20992; 2026-02-21T09:10:47.0938961Z // begin inline asm 2026-02-21T09:10:47.0939207Z @%p49 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r134], [%rd80, {%r433, %r113}], [%r111]; 2026-02-21T09:10:47.0939264Z // end inline asm 2026-02-21T09:10:47.0939434Z .loc 1 58 80 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:80 2026-02-21T09:10:47.0939502Z cp.async.wait_group 1; 2026-02-21T09:10:47.0939560Z bar.sync 0; 2026-02-21T09:10:47.0939660Z ld.shared.v4.b32 {%r223, %r224, %r225, %r226}, [%r208]; 2026-02-21T09:10:47.0939738Z mov.b32 {%rs1, %rs2}, %r226; 2026-02-21T09:10:47.0939803Z mov.b32 {%rs3, %rs4}, %r225; 2026-02-21T09:10:47.0939864Z mov.b32 {%rs5, %rs6}, %r224; 2026-02-21T09:10:47.0939932Z mov.b32 {%rs7, %rs8}, %r223; 2026-02-21T09:10:47.0940034Z ld.shared.v4.b32 {%r227, %r228, %r229, %r230}, [%r208+16]; 2026-02-21T09:10:47.0940100Z mov.b32 {%rs9, %rs10}, %r230; 2026-02-21T09:10:47.0940166Z mov.b32 {%rs11, %rs12}, %r229; 2026-02-21T09:10:47.0940239Z mov.b32 {%rs13, %rs14}, %r228; 2026-02-21T09:10:47.0940301Z mov.b32 {%rs15, %rs16}, %r227; 2026-02-21T09:10:47.0940397Z ld.shared.v4.b32 {%r231, %r232, %r233, %r234}, [%r208+32]; 2026-02-21T09:10:47.0940469Z mov.b32 {%rs17, %rs18}, %r234; 2026-02-21T09:10:47.0940531Z mov.b32 {%rs19, %rs20}, %r233; 2026-02-21T09:10:47.0940593Z mov.b32 {%rs21, %rs22}, %r232; 2026-02-21T09:10:47.0940679Z mov.b32 {%rs23, %rs24}, %r231; 2026-02-21T09:10:47.0940785Z ld.shared.v4.b32 {%r235, %r236, %r237, %r238}, [%r208+48]; 2026-02-21T09:10:47.0940845Z mov.b32 {%rs25, %rs26}, %r238; 2026-02-21T09:10:47.0940907Z mov.b32 {%rs27, %rs28}, %r237; 2026-02-21T09:10:47.0940977Z mov.b32 {%rs29, %rs30}, %r236; 2026-02-21T09:10:47.0941037Z mov.b32 {%rs31, %rs32}, %r235; 2026-02-21T09:10:47.0941203Z .loc 1 62 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:62:32 2026-02-21T09:10:47.0941296Z cvt.f32.bf16 %r139, %rs7; 2026-02-21T09:10:47.0941359Z cvt.f32.bf16 %r140, %rs8; 2026-02-21T09:10:47.0941419Z cvt.f32.bf16 %r141, %rs5; 2026-02-21T09:10:47.0941479Z cvt.f32.bf16 %r142, %rs6; 2026-02-21T09:10:47.0941594Z cvt.f32.bf16 %r143, %rs3; 2026-02-21T09:10:47.0941654Z cvt.f32.bf16 %r144, %rs4; 2026-02-21T09:10:47.0941714Z cvt.f32.bf16 %r145, %rs1; 2026-02-21T09:10:47.0941781Z cvt.f32.bf16 %r146, %rs2; 2026-02-21T09:10:47.0941847Z cvt.f32.bf16 %r147, %rs15; 2026-02-21T09:10:47.0941910Z cvt.f32.bf16 %r148, %rs16; 2026-02-21T09:10:47.0941970Z cvt.f32.bf16 %r149, %rs13; 2026-02-21T09:10:47.0942068Z cvt.f32.bf16 %r150, %rs14; 2026-02-21T09:10:47.0942134Z cvt.f32.bf16 %r151, %rs11; 2026-02-21T09:10:47.0942195Z cvt.f32.bf16 %r152, %rs12; 2026-02-21T09:10:47.0942266Z cvt.f32.bf16 %r153, %rs9; 2026-02-21T09:10:47.0942329Z cvt.f32.bf16 %r154, %rs10; 2026-02-21T09:10:47.0942415Z cvt.f32.bf16 %r156, %rs23; 2026-02-21T09:10:47.0942480Z cvt.f32.bf16 %r157, %rs24; 2026-02-21T09:10:47.0942548Z cvt.f32.bf16 %r158, %rs21; 2026-02-21T09:10:47.0942608Z cvt.f32.bf16 %r159, %rs22; 2026-02-21T09:10:47.0942669Z cvt.f32.bf16 %r160, %rs19; 2026-02-21T09:10:47.0942738Z cvt.f32.bf16 %r161, %rs20; 2026-02-21T09:10:47.0942799Z cvt.f32.bf16 %r162, %rs17; 2026-02-21T09:10:47.0942860Z cvt.f32.bf16 %r163, %rs18; 2026-02-21T09:10:47.0942929Z cvt.f32.bf16 %r164, %rs31; 2026-02-21T09:10:47.0942988Z cvt.f32.bf16 %r165, %rs32; 2026-02-21T09:10:47.0943052Z cvt.f32.bf16 %r166, %rs29; 2026-02-21T09:10:47.0943114Z cvt.f32.bf16 %r167, %rs30; 2026-02-21T09:10:47.0943184Z cvt.f32.bf16 %r168, %rs27; 2026-02-21T09:10:47.0943245Z cvt.f32.bf16 %r169, %rs28; 2026-02-21T09:10:47.0943304Z cvt.f32.bf16 %r170, %rs25; 2026-02-21T09:10:47.0943374Z cvt.f32.bf16 %r171, %rs26; 2026-02-21T09:10:47.0943433Z $L__tmp2: 2026-02-21T09:10:47.0943661Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0943726Z add.s32 %r138, %r218, %r248; 2026-02-21T09:10:47.0943797Z // begin inline asm 2026-02-21T09:10:47.0944078Z @%p59 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r138 + 0], {%r139, %r140, %r141, %r142, %r143, %r144, %r145, %r146, %r147, %r148, %r149, %r150, %r151, %r152, %r153, %r154}; 2026-02-21T09:10:47.0944138Z // end inline asm 2026-02-21T09:10:47.0944206Z // begin inline asm 2026-02-21T09:10:47.0944477Z @%p59 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r138 + 16], {%r156, %r157, %r158, %r159, %r160, %r161, %r162, %r163, %r164, %r165, %r166, %r167, %r168, %r169, %r170, %r171}; 2026-02-21T09:10:47.0944539Z // end inline asm 2026-02-21T09:10:47.0944607Z // begin inline asm 2026-02-21T09:10:47.0944683Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.0944758Z // end inline asm 2026-02-21T09:10:47.0944814Z bar.sync 0; 2026-02-21T09:10:47.0944879Z $L__tmp3: 2026-02-21T09:10:47.0945054Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0945114Z // begin inline asm 2026-02-21T09:10:47.0945178Z 2026-02-21T09:10:47.0945233Z { 2026-02-21T09:10:47.0945298Z .reg .pred complete; 2026-02-21T09:10:47.0945357Z waitLoop: 2026-02-21T09:10:47.0945483Z mbarrier.try_wait.parity.shared.b64 complete, [%r270], %r59; 2026-02-21T09:10:47.0945551Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.0945605Z } 2026-02-21T09:10:47.0945611Z 2026-02-21T09:10:47.0945676Z // end inline asm 2026-02-21T09:10:47.0945865Z .loc 1 64 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:64:33 2026-02-21T09:10:47.0945932Z ld.shared.b8 %rs33, [%r210]; 2026-02-21T09:10:47.0946009Z ld.shared.b8 %rs34, [%r210+64]; 2026-02-21T09:10:47.0946077Z ld.shared.b8 %rs35, [%r210+256]; 2026-02-21T09:10:47.0946145Z ld.shared.b8 %rs36, [%r210+320]; 2026-02-21T09:10:47.0946207Z ld.shared.b8 %rs37, [%r209+128]; 2026-02-21T09:10:47.0946282Z ld.shared.b8 %rs38, [%r209+192]; 2026-02-21T09:10:47.0946369Z ld.shared.b8 %rs39, [%r209+384]; 2026-02-21T09:10:47.0946431Z ld.shared.b8 %rs40, [%r209+448]; 2026-02-21T09:10:47.0946604Z .loc 1 67 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:67:28 2026-02-21T09:10:47.0946669Z shl.b16 %rs41, %rs33, 4; 2026-02-21T09:10:47.0946729Z shl.b16 %rs42, %rs34, 4; 2026-02-21T09:10:47.0946797Z shl.b16 %rs43, %rs37, 4; 2026-02-21T09:10:47.0946857Z shl.b16 %rs44, %rs38, 4; 2026-02-21T09:10:47.0946912Z shl.b16 %rs45, %rs35, 4; 2026-02-21T09:10:47.0946973Z shl.b16 %rs46, %rs36, 4; 2026-02-21T09:10:47.0947041Z shl.b16 %rs47, %rs39, 4; 2026-02-21T09:10:47.0947123Z shl.b16 %rs48, %rs40, 4; 2026-02-21T09:10:47.0947285Z .loc 1 82 58 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:82:58 2026-02-21T09:10:47.0947364Z selp.b16 %rs49, %rs41, %rs33, %p52; 2026-02-21T09:10:47.0947426Z cvt.s16.s8 %rs50, %rs49; 2026-02-21T09:10:47.0947507Z shr.s16 %rs51, %rs50, 4; 2026-02-21T09:10:47.0947577Z selp.b16 %rs52, %rs42, %rs34, %p52; 2026-02-21T09:10:47.0947649Z cvt.s16.s8 %rs53, %rs52; 2026-02-21T09:10:47.0947707Z shr.s16 %rs54, %rs53, 4; 2026-02-21T09:10:47.0947770Z selp.b16 %rs55, %rs43, %rs37, %p52; 2026-02-21T09:10:47.0947837Z cvt.s16.s8 %rs56, %rs55; 2026-02-21T09:10:47.0947896Z shr.s16 %rs57, %rs56, 4; 2026-02-21T09:10:47.0947962Z selp.b16 %rs58, %rs44, %rs38, %p52; 2026-02-21T09:10:47.0948032Z cvt.s16.s8 %rs59, %rs58; 2026-02-21T09:10:47.0948091Z shr.s16 %rs60, %rs59, 4; 2026-02-21T09:10:47.0948158Z selp.b16 %rs61, %rs45, %rs35, %p52; 2026-02-21T09:10:47.0948217Z cvt.s16.s8 %rs62, %rs61; 2026-02-21T09:10:47.0948287Z shr.s16 %rs63, %rs62, 4; 2026-02-21T09:10:47.0948353Z selp.b16 %rs64, %rs46, %rs36, %p52; 2026-02-21T09:10:47.0948413Z cvt.s16.s8 %rs65, %rs64; 2026-02-21T09:10:47.0948482Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:10:47.0948547Z selp.b16 %rs67, %rs47, %rs39, %p52; 2026-02-21T09:10:47.0948607Z cvt.s16.s8 %rs68, %rs67; 2026-02-21T09:10:47.0948668Z shr.s16 %rs69, %rs68, 4; 2026-02-21T09:10:47.0948741Z selp.b16 %rs70, %rs48, %rs40, %p52; 2026-02-21T09:10:47.0948801Z cvt.s16.s8 %rs71, %rs70; 2026-02-21T09:10:47.0948859Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:10:47.0949031Z .loc 1 87 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:87:32 2026-02-21T09:10:47.0949094Z cvt.rn.f32.s16 %r239, %rs51; 2026-02-21T09:10:47.0949154Z cvt.rn.f32.s16 %r240, %rs54; 2026-02-21T09:10:47.0949214Z cvt.rn.f32.s16 %r241, %rs57; 2026-02-21T09:10:47.0949283Z cvt.rn.f32.s16 %r242, %rs60; 2026-02-21T09:10:47.0949342Z cvt.rn.f32.s16 %r243, %rs63; 2026-02-21T09:10:47.0949403Z cvt.rn.f32.s16 %r244, %rs66; 2026-02-21T09:10:47.0949472Z cvt.rn.f32.s16 %r245, %rs69; 2026-02-21T09:10:47.0949532Z cvt.rn.f32.s16 %r246, %rs72; 2026-02-21T09:10:47.0949592Z st.shared.b32 [%r15], %r239; 2026-02-21T09:10:47.0949660Z st.shared.b32 [%r16], %r240; 2026-02-21T09:10:47.0949721Z st.shared.b32 [%r17], %r241; 2026-02-21T09:10:47.0949782Z st.shared.b32 [%r18], %r242; 2026-02-21T09:10:47.0949840Z st.shared.b32 [%r19], %r243; 2026-02-21T09:10:47.0949907Z st.shared.b32 [%r20], %r244; 2026-02-21T09:10:47.0949968Z st.shared.b32 [%r21], %r245; 2026-02-21T09:10:47.0950027Z st.shared.b32 [%r22], %r246; 2026-02-21T09:10:47.0950091Z $L__tmp4: 2026-02-21T09:10:47.0950307Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0950367Z // begin inline asm 2026-02-21T09:10:47.0950461Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.0950526Z // end inline asm 2026-02-21T09:10:47.0950583Z bar.sync 0; 2026-02-21T09:10:47.0950650Z setp.ne.b32 %p56, %r32, 0; 2026-02-21T09:10:47.0950719Z @%p56 bra $L__BB0_3; 2026-02-21T09:10:47.0950772Z // %bb.2: 2026-02-21T09:10:47.0950841Z elect.sync %r259|%p58, -1; 2026-02-21T09:10:47.0950902Z mov.b32 %r249, 134744336; 2026-02-21T09:10:47.0950971Z mov.pred %p57, 0; 2026-02-21T09:10:47.0951030Z // begin inline asm 2026-02-21T09:10:47.0951225Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 0 ], %rd71, %r249, %p57; 2026-02-21T09:10:47.0951294Z // end inline asm 2026-02-21T09:10:47.0951353Z // begin inline asm 2026-02-21T09:10:47.0951501Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 8 ], %rd72, %r249, %p59; 2026-02-21T09:10:47.0951599Z // end inline asm 2026-02-21T09:10:47.0951659Z // begin inline asm 2026-02-21T09:10:47.0951807Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 16 ], %rd73, %r249, %p59; 2026-02-21T09:10:47.0951866Z // end inline asm 2026-02-21T09:10:47.0951976Z // begin inline asm 2026-02-21T09:10:47.0952123Z @%p58 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 24 ], %rd74, %r249, %p59; 2026-02-21T09:10:47.0952181Z // end inline asm 2026-02-21T09:10:47.0952251Z add.s32 %r261, %r49, 21504; 2026-02-21T09:10:47.0952314Z cvt.u64.u32 %rd75, %r261; 2026-02-21T09:10:47.0952397Z // begin inline asm 2026-02-21T09:10:47.0952533Z @%p58 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd75]; 2026-02-21T09:10:47.0952590Z // end inline asm 2026-02-21T09:10:47.0952646Z $L__tmp5: 2026-02-21T09:10:47.0952701Z $L__BB0_3: 2026-02-21T09:10:47.0952801Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:47.0952863Z add.s32 %r24, %r49, %r204; 2026-02-21T09:10:47.0952922Z add.s32 %r25, %r49, %r205; 2026-02-21T09:10:47.0952986Z add.s32 %r26, %r49, %r206; 2026-02-21T09:10:47.0953045Z add.s32 %r27, %r49, %r207; 2026-02-21T09:10:47.0953220Z .loc 1 58 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:32 2026-02-21T09:10:47.0953292Z add.s64 %rd76, %rd56, 128; 2026-02-21T09:10:47.0953354Z cvt.u64.u32 %rd82, %r23; 2026-02-21T09:10:47.0953417Z add.s64 %rd83, %rd7, %rd82; 2026-02-21T09:10:47.0953478Z shl.b64 %rd84, %rd83, 1; 2026-02-21T09:10:47.0953548Z add.s64 %rd85, %rd17, %rd84; 2026-02-21T09:10:47.0953610Z add.s64 %rd77, %rd85, 65536; 2026-02-21T09:10:47.0953669Z add.s64 %rd78, %rd85, 131072; 2026-02-21T09:10:47.0953735Z add.s64 %rd79, %rd85, 196608; 2026-02-21T09:10:47.0953788Z mov.b32 %r263, 16; 2026-02-21T09:10:47.0953948Z .loc 1 58 80 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:80 2026-02-21T09:10:47.0954003Z // begin inline asm 2026-02-21T09:10:47.0954130Z cp.async.cg.shared.global [ %r262 + 0 ], [ %rd76 + 0 ], 0x10, %r263; 2026-02-21T09:10:47.0954187Z // end inline asm 2026-02-21T09:10:47.0954242Z // begin inline asm 2026-02-21T09:10:47.0954361Z cp.async.cg.shared.global [ %r264 + 0 ], [ %rd77 + 0 ], 0x10, %r263; 2026-02-21T09:10:47.0954420Z // end inline asm 2026-02-21T09:10:47.0954475Z // begin inline asm 2026-02-21T09:10:47.0954589Z cp.async.cg.shared.global [ %r266 + 0 ], [ %rd78 + 0 ], 0x10, %r263; 2026-02-21T09:10:47.0954648Z // end inline asm 2026-02-21T09:10:47.0954705Z // begin inline asm 2026-02-21T09:10:47.0954815Z cp.async.cg.shared.global [ %r268 + 0 ], [ %rd79 + 0 ], 0x10, %r263; 2026-02-21T09:10:47.0954876Z // end inline asm 2026-02-21T09:10:47.0954939Z cp.async.commit_group; 2026-02-21T09:10:47.0955107Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0955169Z // begin inline asm 2026-02-21T09:10:47.0955275Z @%p93 mbarrier.arrive.expect_tx.shared.b64 _, [%r270], 512; 2026-02-21T09:10:47.0955332Z // end inline asm 2026-02-21T09:10:47.0955493Z .loc 1 64 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:64:33 2026-02-21T09:10:47.0955589Z bar.sync 0; 2026-02-21T09:10:47.0955659Z elect.sync %r279|%p69, -1; 2026-02-21T09:10:47.0955724Z and.pred %p67, %p1, %p69; 2026-02-21T09:10:47.0955785Z mov.b32 %r273, 32; 2026-02-21T09:10:47.0955846Z // begin inline asm 2026-02-21T09:10:47.0956084Z @%p67 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r121], [%rd80, {%r433, %r273}], [%r270]; 2026-02-21T09:10:47.0956144Z // end inline asm 2026-02-21T09:10:47.0956338Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0956404Z cvt.u16.u32 %rs73, %r3; 2026-02-21T09:10:47.0956465Z shr.u16 %rs74, %rs73, 12; 2026-02-21T09:10:47.0956534Z and.b16 %rs75, %rs74, 1; 2026-02-21T09:10:47.0956598Z mul.wide.u16 %r280, %rs75, 2048; 2026-02-21T09:10:47.0956655Z shl.b32 %r281, %r29, 7; 2026-02-21T09:10:47.0956718Z or.b32 %r282, %r280, %r281; 2026-02-21T09:10:47.0956776Z or.b32 %r283, %r282, %r6; 2026-02-21T09:10:47.0956836Z shl.b32 %r284, %r283, 10; 2026-02-21T09:10:47.0956900Z or.b32 %r285, %r284, %r5; 2026-02-21T09:10:47.0956979Z or.b32 %r286, %r285, 98400; 2026-02-21T09:10:47.0957051Z mad.wide.u32 %rd166, %r286, 2, %rd17; 2026-02-21T09:10:47.0957115Z mul.wide.u32 %rd86, %r283, 2048; 2026-02-21T09:10:47.0957179Z and.b32 %r287, %r1, 3; 2026-02-21T09:10:47.0957241Z mul.wide.u32 %rd87, %r287, 16; 2026-02-21T09:10:47.0957325Z or.b64 %rd88, %rd86, %rd87; 2026-02-21T09:10:47.0957393Z add.s64 %rd89, %rd88, %rd17; 2026-02-21T09:10:47.0957455Z add.s64 %rd165, %rd89, 131264; 2026-02-21T09:10:47.0957508Z mov.b32 %r490, 1; 2026-02-21T09:10:47.0957560Z mov.b32 %r486, 0; 2026-02-21T09:10:47.0957620Z mov.b64 %rd167, 0; 2026-02-21T09:10:47.0957676Z mov.b32 %r488, %r486; 2026-02-21T09:10:47.0957734Z mov.b32 %r489, %r486; 2026-02-21T09:10:47.0957799Z mov.b32 %r491, %r486; 2026-02-21T09:10:47.0957856Z bra.uni $L__BB0_4; 2026-02-21T09:10:47.0957961Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:47.0958130Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0958198Z setp.lt.u64 %p86, %rd167, 464; 2026-02-21T09:10:47.0958249Z $L__tmp6: 2026-02-21T09:10:47.0958458Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0958523Z add.s32 %r389, %r490, 1; 2026-02-21T09:10:47.0958589Z setp.gt.s32 %p89, %r389, 1; 2026-02-21T09:10:47.0958650Z selp.b32 %r490, 0, %r389, %p89; 2026-02-21T09:10:47.0958719Z selp.b32 %r390, 1, 0, %p89; 2026-02-21T09:10:47.0958775Z xor.b32 %r48, %r491, %r390; 2026-02-21T09:10:47.0958825Z $L__tmp7: 2026-02-21T09:10:47.0958988Z .loc 1 58 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:32 2026-02-21T09:10:47.0959061Z add.s64 %rd95, %rd165, -131072; 2026-02-21T09:10:47.0959127Z add.s64 %rd96, %rd165, -65536; 2026-02-21T09:10:47.0959293Z .loc 1 58 80 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:80 2026-02-21T09:10:47.0959357Z mad.lo.s32 %r376, %r7, -48, %r42; 2026-02-21T09:10:47.0959416Z selp.b32 %r377, 16, 0, %p86; 2026-02-21T09:10:47.0959477Z // begin inline asm 2026-02-21T09:10:47.0959587Z cp.async.cg.shared.global [ %r376 + 0 ], [ %rd95 + 0 ], 0x10, %r377; 2026-02-21T09:10:47.0959641Z // end inline asm 2026-02-21T09:10:47.0959702Z add.s32 %r378, %r376, 2048; 2026-02-21T09:10:47.0959758Z // begin inline asm 2026-02-21T09:10:47.0959867Z cp.async.cg.shared.global [ %r378 + 0 ], [ %rd96 + 0 ], 0x10, %r377; 2026-02-21T09:10:47.0959923Z // end inline asm 2026-02-21T09:10:47.0959979Z add.s32 %r380, %r376, 4096; 2026-02-21T09:10:47.0960032Z // begin inline asm 2026-02-21T09:10:47.0960144Z cp.async.cg.shared.global [ %r380 + 0 ], [ %rd165 + 0 ], 0x10, %r377; 2026-02-21T09:10:47.0960204Z // end inline asm 2026-02-21T09:10:47.0960257Z add.s32 %r382, %r376, 6144; 2026-02-21T09:10:47.0960335Z // begin inline asm 2026-02-21T09:10:47.0960451Z cp.async.cg.shared.global [ %r382 + 0 ], [ %rd166 + 0 ], 0x10, %r377; 2026-02-21T09:10:47.0960502Z // end inline asm 2026-02-21T09:10:47.0960562Z cp.async.commit_group; 2026-02-21T09:10:47.0960726Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0960796Z and.pred %p84, %p93, %p86; 2026-02-21T09:10:47.0960851Z // begin inline asm 2026-02-21T09:10:47.0960976Z @%p84 mbarrier.arrive.expect_tx.shared.b64 _, [%r384], 512; 2026-02-21T09:10:47.0961034Z // end inline asm 2026-02-21T09:10:47.0961191Z .loc 1 64 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:64:33 2026-02-21T09:10:47.0961243Z bar.sync 0; 2026-02-21T09:10:47.0961307Z elect.sync %r391|%p90, -1; 2026-02-21T09:10:47.0961367Z and.pred %p91, %p86, %p90; 2026-02-21T09:10:47.0961427Z and.pred %p85, %p1, %p91; 2026-02-21T09:10:47.0961483Z cvt.u32.u64 %r392, %rd167; 2026-02-21T09:10:47.0961578Z add.s32 %r387, %r392, 48; 2026-02-21T09:10:47.0961632Z // begin inline asm 2026-02-21T09:10:47.0961893Z @%p85 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r385], [%rd80, {%r433, %r387}], [%r384]; 2026-02-21T09:10:47.0961951Z // end inline asm 2026-02-21T09:10:47.0962116Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0962201Z add.s64 %rd166, %rd166, 64; 2026-02-21T09:10:47.0962267Z add.s64 %rd165, %rd165, 64; 2026-02-21T09:10:47.0962330Z setp.lt.u64 %p92, %rd167, 480; 2026-02-21T09:10:47.0962385Z add.s64 %rd167, %rd167, 16; 2026-02-21T09:10:47.0962439Z mov.b32 %r486, %r491; 2026-02-21T09:10:47.0962496Z mov.b32 %r491, %r48; 2026-02-21T09:10:47.0962553Z @%p92 bra $L__BB0_4; 2026-02-21T09:10:47.0962606Z bra.uni $L__BB0_7; 2026-02-21T09:10:47.0962710Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:10:47.0962772Z add.s32 %r326, %r489, 1; 2026-02-21T09:10:47.0962830Z setp.gt.s32 %p74, %r326, 1; 2026-02-21T09:10:47.0962892Z selp.b32 %r489, 0, %r326, %p74; 2026-02-21T09:10:47.0963060Z .loc 1 58 80 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:58:80 2026-02-21T09:10:47.0963125Z cp.async.wait_group 1; 2026-02-21T09:10:47.0963178Z bar.sync 0; 2026-02-21T09:10:47.0963238Z shl.b32 %r327, %r489, 13; 2026-02-21T09:10:47.0963296Z add.s32 %r329, %r49, %r327; 2026-02-21T09:10:47.0963355Z add.s32 %r42, %r329, %r13; 2026-02-21T09:10:47.0963453Z ld.shared.v4.b32 {%r330, %r331, %r332, %r333}, [%r42]; 2026-02-21T09:10:47.0963518Z mov.b32 {%rs76, %rs77}, %r333; 2026-02-21T09:10:47.0963580Z mov.b32 {%rs78, %rs79}, %r332; 2026-02-21T09:10:47.0963638Z mov.b32 {%rs80, %rs81}, %r331; 2026-02-21T09:10:47.0963703Z mov.b32 {%rs82, %rs83}, %r330; 2026-02-21T09:10:47.0963802Z ld.shared.v4.b32 {%r334, %r335, %r336, %r337}, [%r42+16]; 2026-02-21T09:10:47.0963862Z mov.b32 {%rs84, %rs85}, %r337; 2026-02-21T09:10:47.0963926Z mov.b32 {%rs86, %rs87}, %r336; 2026-02-21T09:10:47.0963986Z mov.b32 {%rs88, %rs89}, %r335; 2026-02-21T09:10:47.0964043Z mov.b32 {%rs90, %rs91}, %r334; 2026-02-21T09:10:47.0964141Z ld.shared.v4.b32 {%r338, %r339, %r340, %r341}, [%r42+32]; 2026-02-21T09:10:47.0964207Z mov.b32 {%rs92, %rs93}, %r341; 2026-02-21T09:10:47.0964267Z mov.b32 {%rs94, %rs95}, %r340; 2026-02-21T09:10:47.0964330Z mov.b32 {%rs96, %rs97}, %r339; 2026-02-21T09:10:47.0964399Z mov.b32 {%rs98, %rs99}, %r338; 2026-02-21T09:10:47.0964490Z ld.shared.v4.b32 {%r342, %r343, %r344, %r345}, [%r42+48]; 2026-02-21T09:10:47.0964554Z mov.b32 {%rs100, %rs101}, %r345; 2026-02-21T09:10:47.0964617Z mov.b32 {%rs102, %rs103}, %r344; 2026-02-21T09:10:47.0964682Z mov.b32 {%rs104, %rs105}, %r343; 2026-02-21T09:10:47.0964742Z mov.b32 {%rs106, %rs107}, %r342; 2026-02-21T09:10:47.0964911Z .loc 1 62 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:62:32 2026-02-21T09:10:47.0965006Z cvt.f32.bf16 %r291, %rs82; 2026-02-21T09:10:47.0965066Z cvt.f32.bf16 %r292, %rs83; 2026-02-21T09:10:47.0965126Z cvt.f32.bf16 %r293, %rs80; 2026-02-21T09:10:47.0965193Z cvt.f32.bf16 %r294, %rs81; 2026-02-21T09:10:47.0965249Z cvt.f32.bf16 %r295, %rs78; 2026-02-21T09:10:47.0965305Z cvt.f32.bf16 %r296, %rs79; 2026-02-21T09:10:47.0965361Z cvt.f32.bf16 %r297, %rs76; 2026-02-21T09:10:47.0965422Z cvt.f32.bf16 %r298, %rs77; 2026-02-21T09:10:47.0965506Z cvt.f32.bf16 %r299, %rs90; 2026-02-21T09:10:47.0965561Z cvt.f32.bf16 %r300, %rs91; 2026-02-21T09:10:47.0965620Z cvt.f32.bf16 %r301, %rs88; 2026-02-21T09:10:47.0965676Z cvt.f32.bf16 %r302, %rs89; 2026-02-21T09:10:47.0965731Z cvt.f32.bf16 %r303, %rs86; 2026-02-21T09:10:47.0965789Z cvt.f32.bf16 %r304, %rs87; 2026-02-21T09:10:47.0965846Z cvt.f32.bf16 %r305, %rs84; 2026-02-21T09:10:47.0965901Z cvt.f32.bf16 %r306, %rs85; 2026-02-21T09:10:47.0965955Z cvt.f32.bf16 %r308, %rs98; 2026-02-21T09:10:47.0966016Z cvt.f32.bf16 %r309, %rs99; 2026-02-21T09:10:47.0966071Z cvt.f32.bf16 %r310, %rs96; 2026-02-21T09:10:47.0966149Z cvt.f32.bf16 %r311, %rs97; 2026-02-21T09:10:47.0966209Z cvt.f32.bf16 %r312, %rs94; 2026-02-21T09:10:47.0966264Z cvt.f32.bf16 %r313, %rs95; 2026-02-21T09:10:47.0966321Z cvt.f32.bf16 %r314, %rs92; 2026-02-21T09:10:47.0966378Z cvt.f32.bf16 %r315, %rs93; 2026-02-21T09:10:47.0966442Z cvt.f32.bf16 %r316, %rs106; 2026-02-21T09:10:47.0966523Z cvt.f32.bf16 %r317, %rs107; 2026-02-21T09:10:47.0966582Z cvt.f32.bf16 %r318, %rs104; 2026-02-21T09:10:47.0966642Z cvt.f32.bf16 %r319, %rs105; 2026-02-21T09:10:47.0966698Z cvt.f32.bf16 %r320, %rs102; 2026-02-21T09:10:47.0966754Z cvt.f32.bf16 %r321, %rs103; 2026-02-21T09:10:47.0966810Z cvt.f32.bf16 %r322, %rs100; 2026-02-21T09:10:47.0966869Z cvt.f32.bf16 %r323, %rs101; 2026-02-21T09:10:47.0966919Z $L__tmp8: 2026-02-21T09:10:47.0967143Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0967205Z // begin inline asm 2026-02-21T09:10:47.0967255Z 2026-02-21T09:10:47.0967305Z { 2026-02-21T09:10:47.0967366Z .reg .pred complete; 2026-02-21T09:10:47.0967423Z waitLoop: 2026-02-21T09:10:47.0967544Z mbarrier.try_wait.parity.shared.b64 complete, [%r487], %r486; 2026-02-21T09:10:47.0967608Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.0967660Z } 2026-02-21T09:10:47.0967665Z 2026-02-21T09:10:47.0967721Z // end inline asm 2026-02-21T09:10:47.0967773Z $L__tmp9: 2026-02-21T09:10:47.0967948Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0968008Z selp.b32 %r346, 1, 0, %p74; 2026-02-21T09:10:47.0968068Z xor.b32 %r488, %r488, %r346; 2026-02-21T09:10:47.0968126Z mov.pred %p75, -1; 2026-02-21T09:10:47.0968189Z $L__tmp10: 2026-02-21T09:10:47.0968408Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0968467Z // begin inline asm 2026-02-21T09:10:47.0968768Z @%p75 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r138 + 0], {%r291, %r292, %r293, %r294, %r295, %r296, %r297, %r298, %r299, %r300, %r301, %r302, %r303, %r304, %r305, %r306}; 2026-02-21T09:10:47.0968822Z // end inline asm 2026-02-21T09:10:47.0968878Z // begin inline asm 2026-02-21T09:10:47.0969169Z @%p75 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r138 + 16], {%r308, %r309, %r310, %r311, %r312, %r313, %r314, %r315, %r316, %r317, %r318, %r319, %r320, %r321, %r322, %r323}; 2026-02-21T09:10:47.0969227Z // end inline asm 2026-02-21T09:10:47.0969282Z // begin inline asm 2026-02-21T09:10:47.0969354Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.0969415Z // end inline asm 2026-02-21T09:10:47.0969469Z bar.sync 0; 2026-02-21T09:10:47.0969521Z $L__tmp11: 2026-02-21T09:10:47.0969703Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0969764Z shl.b32 %r347, %r489, 3; 2026-02-21T09:10:47.0969848Z add.s32 %r348, %r49, %r347; 2026-02-21T09:10:47.0969911Z add.s32 %r384, %r348, 21520; 2026-02-21T09:10:47.0969967Z // begin inline asm 2026-02-21T09:10:47.0970017Z 2026-02-21T09:10:47.0970065Z { 2026-02-21T09:10:47.0970134Z .reg .pred complete; 2026-02-21T09:10:47.0970187Z waitLoop: 2026-02-21T09:10:47.0970306Z mbarrier.try_wait.parity.shared.b64 complete, [%r384], %r488; 2026-02-21T09:10:47.0970376Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.0970462Z } 2026-02-21T09:10:47.0970467Z 2026-02-21T09:10:47.0970523Z // end inline asm 2026-02-21T09:10:47.0970691Z .loc 1 64 33 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:64:33 2026-02-21T09:10:47.0970756Z shl.b32 %r349, %r489, 9; 2026-02-21T09:10:47.0970812Z add.s32 %r350, %r49, %r349; 2026-02-21T09:10:47.0970870Z add.s32 %r385, %r350, 20480; 2026-02-21T09:10:47.0970934Z add.s32 %r351, %r385, %r14; 2026-02-21T09:10:47.0970996Z ld.shared.b8 %rs108, [%r351]; 2026-02-21T09:10:47.0971066Z ld.shared.b8 %rs109, [%r351+64]; 2026-02-21T09:10:47.0971141Z ld.shared.b8 %rs110, [%r351+256]; 2026-02-21T09:10:47.0971229Z ld.shared.b8 %rs111, [%r351+320]; 2026-02-21T09:10:47.0971293Z add.s32 %r352, %r385, %r28; 2026-02-21T09:10:47.0971356Z ld.shared.b8 %rs112, [%r352+128]; 2026-02-21T09:10:47.0971428Z ld.shared.b8 %rs113, [%r352+192]; 2026-02-21T09:10:47.0971492Z ld.shared.b8 %rs114, [%r352+384]; 2026-02-21T09:10:47.0971620Z ld.shared.b8 %rs115, [%r352+448]; 2026-02-21T09:10:47.0971798Z .loc 1 67 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:67:28 2026-02-21T09:10:47.0971863Z shl.b16 %rs116, %rs108, 4; 2026-02-21T09:10:47.0971923Z shl.b16 %rs117, %rs109, 4; 2026-02-21T09:10:47.0971984Z shl.b16 %rs118, %rs112, 4; 2026-02-21T09:10:47.0972051Z shl.b16 %rs119, %rs113, 4; 2026-02-21T09:10:47.0972111Z shl.b16 %rs120, %rs110, 4; 2026-02-21T09:10:47.0972170Z shl.b16 %rs121, %rs111, 4; 2026-02-21T09:10:47.0972237Z shl.b16 %rs122, %rs114, 4; 2026-02-21T09:10:47.0972299Z shl.b16 %rs123, %rs115, 4; 2026-02-21T09:10:47.0972472Z .loc 1 82 58 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:82:58 2026-02-21T09:10:47.0972556Z selp.b16 %rs124, %rs116, %rs108, %p52; 2026-02-21T09:10:47.0972621Z cvt.s16.s8 %rs125, %rs124; 2026-02-21T09:10:47.0972684Z shr.s16 %rs126, %rs125, 4; 2026-02-21T09:10:47.0972756Z selp.b16 %rs127, %rs117, %rs109, %p52; 2026-02-21T09:10:47.0972834Z cvt.s16.s8 %rs128, %rs127; 2026-02-21T09:10:47.0972898Z shr.s16 %rs129, %rs128, 4; 2026-02-21T09:10:47.0972983Z selp.b16 %rs130, %rs118, %rs112, %p52; 2026-02-21T09:10:47.0973055Z cvt.s16.s8 %rs131, %rs130; 2026-02-21T09:10:47.0973115Z shr.s16 %rs132, %rs131, 4; 2026-02-21T09:10:47.0973183Z selp.b16 %rs133, %rs119, %rs113, %p52; 2026-02-21T09:10:47.0973244Z cvt.s16.s8 %rs134, %rs133; 2026-02-21T09:10:47.0973317Z shr.s16 %rs135, %rs134, 4; 2026-02-21T09:10:47.0973384Z selp.b16 %rs136, %rs120, %rs110, %p52; 2026-02-21T09:10:47.0973446Z cvt.s16.s8 %rs137, %rs136; 2026-02-21T09:10:47.0973519Z shr.s16 %rs138, %rs137, 4; 2026-02-21T09:10:47.0973584Z selp.b16 %rs139, %rs121, %rs111, %p52; 2026-02-21T09:10:47.0973641Z cvt.s16.s8 %rs140, %rs139; 2026-02-21T09:10:47.0973698Z shr.s16 %rs141, %rs140, 4; 2026-02-21T09:10:47.0973769Z selp.b16 %rs142, %rs122, %rs114, %p52; 2026-02-21T09:10:47.0973826Z cvt.s16.s8 %rs143, %rs142; 2026-02-21T09:10:47.0973884Z shr.s16 %rs144, %rs143, 4; 2026-02-21T09:10:47.0973956Z selp.b16 %rs145, %rs123, %rs115, %p52; 2026-02-21T09:10:47.0974013Z cvt.s16.s8 %rs146, %rs145; 2026-02-21T09:10:47.0974070Z shr.s16 %rs147, %rs146, 4; 2026-02-21T09:10:47.0974236Z .loc 1 87 32 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:87:32 2026-02-21T09:10:47.0974299Z cvt.rn.f32.s16 %r353, %rs126; 2026-02-21T09:10:47.0974360Z cvt.rn.f32.s16 %r354, %rs129; 2026-02-21T09:10:47.0974417Z cvt.rn.f32.s16 %r355, %rs132; 2026-02-21T09:10:47.0974481Z cvt.rn.f32.s16 %r356, %rs135; 2026-02-21T09:10:47.0974573Z cvt.rn.f32.s16 %r357, %rs138; 2026-02-21T09:10:47.0974634Z cvt.rn.f32.s16 %r358, %rs141; 2026-02-21T09:10:47.0974699Z cvt.rn.f32.s16 %r359, %rs144; 2026-02-21T09:10:47.0974758Z cvt.rn.f32.s16 %r360, %rs147; 2026-02-21T09:10:47.0974819Z st.shared.b32 [%r15], %r353; 2026-02-21T09:10:47.0974880Z st.shared.b32 [%r16], %r354; 2026-02-21T09:10:47.0974948Z st.shared.b32 [%r17], %r355; 2026-02-21T09:10:47.0975010Z st.shared.b32 [%r18], %r356; 2026-02-21T09:10:47.0975096Z st.shared.b32 [%r19], %r357; 2026-02-21T09:10:47.0975160Z st.shared.b32 [%r20], %r358; 2026-02-21T09:10:47.0975218Z st.shared.b32 [%r21], %r359; 2026-02-21T09:10:47.0975277Z st.shared.b32 [%r22], %r360; 2026-02-21T09:10:47.0975444Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0975510Z shl.b32 %r361, %r490, 3; 2026-02-21T09:10:47.0975569Z add.s32 %r362, %r49, %r361; 2026-02-21T09:10:47.0975627Z add.s32 %r487, %r362, 21504; 2026-02-21T09:10:47.0975687Z $L__tmp12: 2026-02-21T09:10:47.0975921Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0975979Z // begin inline asm 2026-02-21T09:10:47.0976059Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.0976115Z // end inline asm 2026-02-21T09:10:47.0976169Z bar.sync 0; 2026-02-21T09:10:47.0976254Z @%p56 bra $L__BB0_6; 2026-02-21T09:10:47.0976365Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:47.0976429Z elect.sync %r375|%p76, -1; 2026-02-21T09:10:47.0976488Z mov.b32 %r365, 134744336; 2026-02-21T09:10:47.0976551Z // begin inline asm 2026-02-21T09:10:47.0976706Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 0 ], %rd71, %r365, %p75; 2026-02-21T09:10:47.0976760Z // end inline asm 2026-02-21T09:10:47.0976824Z // begin inline asm 2026-02-21T09:10:47.0976969Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 8 ], %rd72, %r365, %p75; 2026-02-21T09:10:47.0977026Z // end inline asm 2026-02-21T09:10:47.0977082Z // begin inline asm 2026-02-21T09:10:47.0977234Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 16 ], %rd73, %r365, %p75; 2026-02-21T09:10:47.0977289Z // end inline asm 2026-02-21T09:10:47.0977345Z // begin inline asm 2026-02-21T09:10:47.0977493Z @%p76 tcgen05.mma.cta_group::1.kind::tf32 [ %r485 + 0 ], [ %r248 + 24 ], %rd74, %r365, %p75; 2026-02-21T09:10:47.0977549Z // end inline asm 2026-02-21T09:10:47.0977609Z cvt.u64.u32 %rd94, %r487; 2026-02-21T09:10:47.0977670Z // begin inline asm 2026-02-21T09:10:47.0977791Z @%p76 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd94]; 2026-02-21T09:10:47.0977845Z // end inline asm 2026-02-21T09:10:47.0977901Z bra.uni $L__BB0_6; 2026-02-21T09:10:47.0978003Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:10:47.0978091Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:47.0978147Z mov.b32 %r394, 1; 2026-02-21T09:10:47.0978367Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0978423Z // begin inline asm 2026-02-21T09:10:47.0978473Z 2026-02-21T09:10:47.0978531Z { 2026-02-21T09:10:47.0978593Z .reg .pred complete; 2026-02-21T09:10:47.0978646Z waitLoop: 2026-02-21T09:10:47.0978761Z mbarrier.try_wait.parity.shared.b64 complete, [%r487], %r394; 2026-02-21T09:10:47.0978834Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.0978884Z } 2026-02-21T09:10:47.0978887Z 2026-02-21T09:10:47.0978943Z // end inline asm 2026-02-21T09:10:47.0979000Z $L__tmp13: 2026-02-21T09:10:47.0979171Z .loc 1 51 125 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:51:125 2026-02-21T09:10:47.0979233Z cp.async.wait_group 0; 2026-02-21T09:10:47.0979287Z bar.sync 0; 2026-02-21T09:10:47.0979351Z // begin inline asm 2026-02-21T09:10:47.0979459Z @%p93 mbarrier.inval.shared::cta.b64 [%r270]; 2026-02-21T09:10:47.0979514Z // end inline asm 2026-02-21T09:10:47.0979578Z bar.sync 0; 2026-02-21T09:10:47.0979634Z // begin inline asm 2026-02-21T09:10:47.0979717Z @%p93 mbarrier.inval.shared::cta.b64 [%r111]; 2026-02-21T09:10:47.0979777Z // end inline asm 2026-02-21T09:10:47.0979836Z add.s32 %r397, %r49, 21504; 2026-02-21T09:10:47.0979893Z // begin inline asm 2026-02-21T09:10:47.0979975Z @%p93 mbarrier.inval.shared::cta.b64 [%r397]; 2026-02-21T09:10:47.0980064Z // end inline asm 2026-02-21T09:10:47.0980119Z bar.sync 0; 2026-02-21T09:10:47.0980177Z // begin inline asm 2026-02-21T09:10:47.0980266Z @%p93 mbarrier.inval.shared::cta.b64 [%r109]; 2026-02-21T09:10:47.0980320Z // end inline asm 2026-02-21T09:10:47.0980372Z $L__tmp14: 2026-02-21T09:10:47.0980584Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0980647Z // begin inline asm 2026-02-21T09:10:47.0980934Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r399, %r400, %r401, %r402, %r403, %r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414}, [%r432 + 0]; 2026-02-21T09:10:47.0980991Z // end inline asm 2026-02-21T09:10:47.0981054Z // begin inline asm 2026-02-21T09:10:47.0981332Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r416, %r417, %r418, %r419, %r420, %r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431}, [%r432 + 16]; 2026-02-21T09:10:47.0981390Z // end inline asm 2026-02-21T09:10:47.0981452Z // begin inline asm 2026-02-21T09:10:47.0981523Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.0981633Z // end inline asm 2026-02-21T09:10:47.0981700Z cvt.u64.u32 %rd101, %r399; 2026-02-21T09:10:47.0981760Z cvt.u64.u32 %rd102, %r400; 2026-02-21T09:10:47.0981820Z shl.b64 %rd103, %rd102, 32; 2026-02-21T09:10:47.0981881Z or.b64 %rd104, %rd101, %rd103; 2026-02-21T09:10:47.0981941Z $L__tmp15: 2026-02-21T09:10:47.0982111Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0982177Z mov.b64 {%r436, %r437}, %rd104; 2026-02-21T09:10:47.0982253Z cvt.rn.bf16x2.f32 %r438, %r437, %r436; 2026-02-21T09:10:47.0982306Z $L__tmp16: 2026-02-21T09:10:47.0982514Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0982574Z cvt.u64.u32 %rd105, %r401; 2026-02-21T09:10:47.0982641Z cvt.u64.u32 %rd106, %r402; 2026-02-21T09:10:47.0982701Z shl.b64 %rd107, %rd106, 32; 2026-02-21T09:10:47.0982761Z or.b64 %rd108, %rd105, %rd107; 2026-02-21T09:10:47.0982821Z $L__tmp17: 2026-02-21T09:10:47.0982985Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0983048Z mov.b64 {%r439, %r440}, %rd108; 2026-02-21T09:10:47.0983126Z cvt.rn.bf16x2.f32 %r441, %r440, %r439; 2026-02-21T09:10:47.0983178Z $L__tmp18: 2026-02-21T09:10:47.0983383Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0983447Z cvt.u64.u32 %rd109, %r403; 2026-02-21T09:10:47.0983513Z cvt.u64.u32 %rd110, %r404; 2026-02-21T09:10:47.0983574Z shl.b64 %rd111, %rd110, 32; 2026-02-21T09:10:47.0983634Z or.b64 %rd112, %rd109, %rd111; 2026-02-21T09:10:47.0983693Z $L__tmp19: 2026-02-21T09:10:47.0983856Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0983918Z mov.b64 {%r442, %r443}, %rd112; 2026-02-21T09:10:47.0983992Z cvt.rn.bf16x2.f32 %r444, %r443, %r442; 2026-02-21T09:10:47.0984043Z $L__tmp20: 2026-02-21T09:10:47.0984246Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0984304Z cvt.u64.u32 %rd113, %r405; 2026-02-21T09:10:47.0984369Z cvt.u64.u32 %rd114, %r406; 2026-02-21T09:10:47.0984428Z shl.b64 %rd115, %rd114, 32; 2026-02-21T09:10:47.0984512Z or.b64 %rd116, %rd113, %rd115; 2026-02-21T09:10:47.0984585Z $L__tmp21: 2026-02-21T09:10:47.0984747Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0984807Z mov.b64 {%r445, %r446}, %rd116; 2026-02-21T09:10:47.0984872Z cvt.rn.bf16x2.f32 %r447, %r446, %r445; 2026-02-21T09:10:47.0984929Z $L__tmp22: 2026-02-21T09:10:47.0985134Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0985216Z cvt.u64.u32 %rd117, %r407; 2026-02-21T09:10:47.0985281Z cvt.u64.u32 %rd118, %r408; 2026-02-21T09:10:47.0985339Z shl.b64 %rd119, %rd118, 32; 2026-02-21T09:10:47.0985397Z or.b64 %rd120, %rd117, %rd119; 2026-02-21T09:10:47.0985455Z $L__tmp23: 2026-02-21T09:10:47.0985616Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0985675Z mov.b64 {%r448, %r449}, %rd120; 2026-02-21T09:10:47.0985741Z cvt.rn.bf16x2.f32 %r450, %r449, %r448; 2026-02-21T09:10:47.0985800Z $L__tmp24: 2026-02-21T09:10:47.0986025Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0986083Z cvt.u64.u32 %rd121, %r409; 2026-02-21T09:10:47.0986147Z cvt.u64.u32 %rd122, %r410; 2026-02-21T09:10:47.0986206Z shl.b64 %rd123, %rd122, 32; 2026-02-21T09:10:47.0986288Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T09:10:47.0986349Z $L__tmp25: 2026-02-21T09:10:47.0986509Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0986568Z mov.b64 {%r451, %r452}, %rd124; 2026-02-21T09:10:47.0986634Z cvt.rn.bf16x2.f32 %r453, %r452, %r451; 2026-02-21T09:10:47.0986694Z $L__tmp26: 2026-02-21T09:10:47.0986893Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0986951Z cvt.u64.u32 %rd125, %r411; 2026-02-21T09:10:47.0987015Z cvt.u64.u32 %rd126, %r412; 2026-02-21T09:10:47.0987075Z shl.b64 %rd127, %rd126, 32; 2026-02-21T09:10:47.0987133Z or.b64 %rd128, %rd125, %rd127; 2026-02-21T09:10:47.0987185Z $L__tmp27: 2026-02-21T09:10:47.0987354Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0987413Z mov.b64 {%r454, %r455}, %rd128; 2026-02-21T09:10:47.0987479Z cvt.rn.bf16x2.f32 %r456, %r455, %r454; 2026-02-21T09:10:47.0987538Z $L__tmp28: 2026-02-21T09:10:47.0987738Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0987795Z cvt.u64.u32 %rd129, %r413; 2026-02-21T09:10:47.0987860Z cvt.u64.u32 %rd130, %r414; 2026-02-21T09:10:47.0987918Z shl.b64 %rd131, %rd130, 32; 2026-02-21T09:10:47.0987976Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T09:10:47.0988026Z $L__tmp29: 2026-02-21T09:10:47.0988191Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0988253Z mov.b64 {%r457, %r458}, %rd132; 2026-02-21T09:10:47.0988318Z cvt.rn.bf16x2.f32 %r459, %r458, %r457; 2026-02-21T09:10:47.0988377Z $L__tmp30: 2026-02-21T09:10:47.0988577Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0988636Z cvt.u64.u32 %rd133, %r416; 2026-02-21T09:10:47.0988703Z cvt.u64.u32 %rd134, %r417; 2026-02-21T09:10:47.0988762Z shl.b64 %rd135, %rd134, 32; 2026-02-21T09:10:47.0988822Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T09:10:47.0988873Z $L__tmp31: 2026-02-21T09:10:47.0989042Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0989104Z mov.b64 {%r460, %r461}, %rd136; 2026-02-21T09:10:47.0989169Z cvt.rn.bf16x2.f32 %r462, %r461, %r460; 2026-02-21T09:10:47.0989263Z $L__tmp32: 2026-02-21T09:10:47.0989472Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0989531Z cvt.u64.u32 %rd137, %r418; 2026-02-21T09:10:47.0989595Z cvt.u64.u32 %rd138, %r419; 2026-02-21T09:10:47.0989655Z shl.b64 %rd139, %rd138, 32; 2026-02-21T09:10:47.0989714Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T09:10:47.0989764Z $L__tmp33: 2026-02-21T09:10:47.0989936Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0990012Z mov.b64 {%r463, %r464}, %rd140; 2026-02-21T09:10:47.0990075Z cvt.rn.bf16x2.f32 %r465, %r464, %r463; 2026-02-21T09:10:47.0990131Z $L__tmp34: 2026-02-21T09:10:47.0990334Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0990392Z cvt.u64.u32 %rd141, %r420; 2026-02-21T09:10:47.0990455Z cvt.u64.u32 %rd142, %r421; 2026-02-21T09:10:47.0990514Z shl.b64 %rd143, %rd142, 32; 2026-02-21T09:10:47.0990572Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T09:10:47.0990644Z $L__tmp35: 2026-02-21T09:10:47.0990814Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0990874Z mov.b64 {%r466, %r467}, %rd144; 2026-02-21T09:10:47.0990938Z cvt.rn.bf16x2.f32 %r468, %r467, %r466; 2026-02-21T09:10:47.0990996Z $L__tmp36: 2026-02-21T09:10:47.0991217Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0991278Z cvt.u64.u32 %rd145, %r422; 2026-02-21T09:10:47.0991335Z cvt.u64.u32 %rd146, %r423; 2026-02-21T09:10:47.0991399Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:10:47.0991457Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:10:47.0991508Z $L__tmp37: 2026-02-21T09:10:47.0991706Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0991767Z mov.b64 {%r469, %r470}, %rd148; 2026-02-21T09:10:47.0991834Z cvt.rn.bf16x2.f32 %r471, %r470, %r469; 2026-02-21T09:10:47.0991892Z $L__tmp38: 2026-02-21T09:10:47.0992093Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0992151Z cvt.u64.u32 %rd149, %r424; 2026-02-21T09:10:47.0992207Z cvt.u64.u32 %rd150, %r425; 2026-02-21T09:10:47.0992274Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:10:47.0992335Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:10:47.0992386Z $L__tmp39: 2026-02-21T09:10:47.0992547Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0992606Z mov.b64 {%r472, %r473}, %rd152; 2026-02-21T09:10:47.0992671Z cvt.rn.bf16x2.f32 %r474, %r473, %r472; 2026-02-21T09:10:47.0992727Z $L__tmp40: 2026-02-21T09:10:47.0992930Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0992989Z cvt.u64.u32 %rd153, %r426; 2026-02-21T09:10:47.0993046Z cvt.u64.u32 %rd154, %r427; 2026-02-21T09:10:47.0993110Z shl.b64 %rd155, %rd154, 32; 2026-02-21T09:10:47.0993167Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T09:10:47.0993219Z $L__tmp41: 2026-02-21T09:10:47.0993386Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0993447Z mov.b64 {%r475, %r476}, %rd156; 2026-02-21T09:10:47.0993513Z cvt.rn.bf16x2.f32 %r477, %r476, %r475; 2026-02-21T09:10:47.0993563Z $L__tmp42: 2026-02-21T09:10:47.0993770Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0993827Z cvt.u64.u32 %rd157, %r428; 2026-02-21T09:10:47.0993884Z cvt.u64.u32 %rd158, %r429; 2026-02-21T09:10:47.0993948Z shl.b64 %rd159, %rd158, 32; 2026-02-21T09:10:47.0994005Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T09:10:47.0994082Z $L__tmp43: 2026-02-21T09:10:47.0994247Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0994305Z mov.b64 {%r478, %r479}, %rd160; 2026-02-21T09:10:47.0994369Z cvt.rn.bf16x2.f32 %r480, %r479, %r478; 2026-02-21T09:10:47.0994420Z $L__tmp44: 2026-02-21T09:10:47.0994627Z .loc 2 291 36 // standard.py:291:36 @[ cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:94:40 ] 2026-02-21T09:10:47.0994710Z cvt.u64.u32 %rd161, %r430; 2026-02-21T09:10:47.0994766Z cvt.u64.u32 %rd162, %r431; 2026-02-21T09:10:47.0994831Z shl.b64 %rd163, %rd162, 32; 2026-02-21T09:10:47.0994889Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T09:10:47.0994940Z $L__tmp45: 2026-02-21T09:10:47.0995106Z .loc 1 97 28 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:97:28 2026-02-21T09:10:47.0995164Z mov.b64 {%r481, %r482}, %rd164; 2026-02-21T09:10:47.0995228Z cvt.rn.bf16x2.f32 %r483, %r482, %r481; 2026-02-21T09:10:47.0995416Z .loc 1 98 43 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:98:43 2026-02-21T09:10:47.0995521Z st.shared.v4.b32 [%r24], {%r438, %r441, %r444, %r447}; 2026-02-21T09:10:47.0995614Z st.shared.v4.b32 [%r25], {%r450, %r453, %r456, %r459}; 2026-02-21T09:10:47.0995700Z st.shared.v4.b32 [%r26], {%r462, %r465, %r468, %r471}; 2026-02-21T09:10:47.0995820Z st.shared.v4.b32 [%r27], {%r474, %r477, %r480, %r483}; 2026-02-21T09:10:47.0995877Z // begin inline asm 2026-02-21T09:10:47.0995953Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.0996012Z // end inline asm 2026-02-21T09:10:47.0996065Z bar.sync 0; 2026-02-21T09:10:47.0996129Z elect.sync %r484|%p99, -1; 2026-02-21T09:10:47.0996195Z and.pred %p97, %p1, %p99; 2026-02-21T09:10:47.0996258Z // begin inline asm 2026-02-21T09:10:47.0996435Z @%p97 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd100, {%r433, %r434}], [%r49]; 2026-02-21T09:10:47.0996490Z // end inline asm 2026-02-21T09:10:47.0996564Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.0996633Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.0996688Z bar.sync 0; 2026-02-21T09:10:47.0996770Z $L__BB0_8: // %._crit_edge 2026-02-21T09:10:47.0996937Z .loc 1 31 4 // cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py:31:4 2026-02-21T09:10:47.0996990Z bar.sync 0; 2026-02-21T09:10:47.0997046Z // begin inline asm 2026-02-21T09:10:47.0997167Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r485, 64; 2026-02-21T09:10:47.0997221Z // end inline asm 2026-02-21T09:10:47.0997274Z ret; 2026-02-21T09:10:47.0997334Z $L__tmp46: 2026-02-21T09:10:47.0997388Z $L__func_end0: 2026-02-21T09:10:47.0997476Z // -- End function 2026-02-21T09:10:47.0997527Z } 2026-02-21T09:10:47.0997739Z .file 1 "/tmp/torchinductor_root/gj/cgj37llvntcl32agbdc44ve7nnslyapx6ddzqpjsjds5cmz73wc4.py" 2026-02-21T09:10:47.0997908Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:47.0997971Z .section .debug_abbrev 2026-02-21T09:10:47.0998031Z { 2026-02-21T09:10:47.0998119Z .b8 1 // Abbreviation Code 2026-02-21T09:10:47.0998206Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:47.0998293Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:47.0998373Z .b8 37 // DW_AT_producer 2026-02-21T09:10:47.0998454Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.0998530Z .b8 19 // DW_AT_language 2026-02-21T09:10:47.0998615Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:47.0998689Z .b8 3 // DW_AT_name 2026-02-21T09:10:47.0998763Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.0998847Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:47.0998943Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:47.0999020Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:47.0999099Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.0999171Z .b8 0 // EOM(1) 2026-02-21T09:10:47.0999239Z .b8 0 // EOM(2) 2026-02-21T09:10:47.0999323Z .b8 2 // Abbreviation Code 2026-02-21T09:10:47.0999433Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:47.0999507Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:47.0999581Z .b8 3 // DW_AT_name 2026-02-21T09:10:47.0999662Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.0999736Z .b8 32 // DW_AT_inline 2026-02-21T09:10:47.0999812Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.0999890Z .b8 0 // EOM(1) 2026-02-21T09:10:47.0999978Z .b8 0 // EOM(2) 2026-02-21T09:10:47.1000060Z .b8 3 // Abbreviation Code 2026-02-21T09:10:47.1000144Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:47.1000221Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:47.1000319Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:47.1000394Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.1000480Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:47.1000551Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.1000636Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:47.1000716Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:47.1000783Z .b8 0 // EOM(1) 2026-02-21T09:10:47.1000852Z .b8 0 // EOM(2) 2026-02-21T09:10:47.1000939Z .b8 4 // Abbreviation Code 2026-02-21T09:10:47.1001031Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:47.1001105Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:47.1001190Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:47.1001273Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:47.1001343Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:47.1001413Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.1001497Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:47.1001601Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.1001678Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:47.1001757Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.1001833Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:47.1001906Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.1001982Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:47.1002061Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.1002129Z .b8 0 // EOM(1) 2026-02-21T09:10:47.1002199Z .b8 0 // EOM(2) 2026-02-21T09:10:47.1002272Z .b8 0 // EOM(3) 2026-02-21T09:10:47.1002324Z } 2026-02-21T09:10:47.1002385Z .section .debug_info 2026-02-21T09:10:47.1002441Z { 2026-02-21T09:10:47.1002523Z .b32 178 // Length of Unit 2026-02-21T09:10:47.1002605Z .b8 2 // DWARF version number 2026-02-21T09:10:47.1002657Z .b8 0 2026-02-21T09:10:47.1002779Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:47.1002897Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:47.1002999Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:47.1003087Z .b8 116 // DW_AT_producer 2026-02-21T09:10:47.1003141Z .b8 114 2026-02-21T09:10:47.1003194Z .b8 105 2026-02-21T09:10:47.1003252Z .b8 116 2026-02-21T09:10:47.1003305Z .b8 111 2026-02-21T09:10:47.1003382Z .b8 110 2026-02-21T09:10:47.1003434Z .b8 0 2026-02-21T09:10:47.1003516Z .b8 2 // DW_AT_language 2026-02-21T09:10:47.1003568Z .b8 0 2026-02-21T09:10:47.1003643Z .b8 99 // DW_AT_name 2026-02-21T09:10:47.1003700Z .b8 103 2026-02-21T09:10:47.1003752Z .b8 106 2026-02-21T09:10:47.1003803Z .b8 51 2026-02-21T09:10:47.1003855Z .b8 55 2026-02-21T09:10:47.1003912Z .b8 108 2026-02-21T09:10:47.1003961Z .b8 108 2026-02-21T09:10:47.1004010Z .b8 118 2026-02-21T09:10:47.1004061Z .b8 110 2026-02-21T09:10:47.1004119Z .b8 116 2026-02-21T09:10:47.1004169Z .b8 99 2026-02-21T09:10:47.1004240Z .b8 108 2026-02-21T09:10:47.1004301Z .b8 51 2026-02-21T09:10:47.1004356Z .b8 50 2026-02-21T09:10:47.1004412Z .b8 97 2026-02-21T09:10:47.1004465Z .b8 103 2026-02-21T09:10:47.1004527Z .b8 98 2026-02-21T09:10:47.1004580Z .b8 100 2026-02-21T09:10:47.1004634Z .b8 99 2026-02-21T09:10:47.1004698Z .b8 52 2026-02-21T09:10:47.1004751Z .b8 52 2026-02-21T09:10:47.1004846Z .b8 118 2026-02-21T09:10:47.1004900Z .b8 101 2026-02-21T09:10:47.1004958Z .b8 55 2026-02-21T09:10:47.1005009Z .b8 110 2026-02-21T09:10:47.1005058Z .b8 110 2026-02-21T09:10:47.1005108Z .b8 115 2026-02-21T09:10:47.1005166Z .b8 108 2026-02-21T09:10:47.1005217Z .b8 121 2026-02-21T09:10:47.1005266Z .b8 97 2026-02-21T09:10:47.1005323Z .b8 112 2026-02-21T09:10:47.1005372Z .b8 120 2026-02-21T09:10:47.1005422Z .b8 54 2026-02-21T09:10:47.1005471Z .b8 100 2026-02-21T09:10:47.1005529Z .b8 100 2026-02-21T09:10:47.1005580Z .b8 122 2026-02-21T09:10:47.1005632Z .b8 113 2026-02-21T09:10:47.1005688Z .b8 112 2026-02-21T09:10:47.1005738Z .b8 106 2026-02-21T09:10:47.1005790Z .b8 115 2026-02-21T09:10:47.1005840Z .b8 106 2026-02-21T09:10:47.1005897Z .b8 100 2026-02-21T09:10:47.1005947Z .b8 115 2026-02-21T09:10:47.1005997Z .b8 53 2026-02-21T09:10:47.1006054Z .b8 99 2026-02-21T09:10:47.1006105Z .b8 109 2026-02-21T09:10:47.1006155Z .b8 122 2026-02-21T09:10:47.1006205Z .b8 55 2026-02-21T09:10:47.1006262Z .b8 51 2026-02-21T09:10:47.1006316Z .b8 119 2026-02-21T09:10:47.1006366Z .b8 99 2026-02-21T09:10:47.1006416Z .b8 52 2026-02-21T09:10:47.1006474Z .b8 46 2026-02-21T09:10:47.1006524Z .b8 112 2026-02-21T09:10:47.1006573Z .b8 121 2026-02-21T09:10:47.1006631Z .b8 0 2026-02-21T09:10:47.1006720Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:47.1006794Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:47.1006845Z .b8 116 2026-02-21T09:10:47.1006902Z .b8 109 2026-02-21T09:10:47.1006953Z .b8 112 2026-02-21T09:10:47.1007003Z .b8 47 2026-02-21T09:10:47.1007059Z .b8 116 2026-02-21T09:10:47.1007108Z .b8 111 2026-02-21T09:10:47.1007158Z .b8 114 2026-02-21T09:10:47.1007209Z .b8 99 2026-02-21T09:10:47.1007278Z .b8 104 2026-02-21T09:10:47.1007330Z .b8 105 2026-02-21T09:10:47.1007382Z .b8 110 2026-02-21T09:10:47.1007440Z .b8 100 2026-02-21T09:10:47.1007491Z .b8 117 2026-02-21T09:10:47.1007543Z .b8 99 2026-02-21T09:10:47.1007595Z .b8 116 2026-02-21T09:10:47.1007652Z .b8 111 2026-02-21T09:10:47.1007705Z .b8 114 2026-02-21T09:10:47.1007757Z .b8 95 2026-02-21T09:10:47.1007810Z .b8 114 2026-02-21T09:10:47.1007869Z .b8 111 2026-02-21T09:10:47.1007921Z .b8 111 2026-02-21T09:10:47.1007973Z .b8 116 2026-02-21T09:10:47.1008032Z .b8 47 2026-02-21T09:10:47.1008083Z .b8 103 2026-02-21T09:10:47.1008134Z .b8 106 2026-02-21T09:10:47.1008187Z .b8 0 2026-02-21T09:10:47.1008294Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:47.1008368Z .b8 95 // DW_AT_name 2026-02-21T09:10:47.1008444Z .b8 104 2026-02-21T09:10:47.1008503Z .b8 101 2026-02-21T09:10:47.1008555Z .b8 108 2026-02-21T09:10:47.1008608Z .b8 105 2026-02-21T09:10:47.1008660Z .b8 111 2026-02-21T09:10:47.1008719Z .b8 110 2026-02-21T09:10:47.1008771Z .b8 95 2026-02-21T09:10:47.1008822Z .b8 109 2026-02-21T09:10:47.1008882Z .b8 97 2026-02-21T09:10:47.1008933Z .b8 116 2026-02-21T09:10:47.1008985Z .b8 109 2026-02-21T09:10:47.1009037Z .b8 117 2026-02-21T09:10:47.1009096Z .b8 108 2026-02-21T09:10:47.1009171Z .b8 95 2026-02-21T09:10:47.1009225Z .b8 98 2026-02-21T09:10:47.1009283Z .b8 102 2026-02-21T09:10:47.1009336Z .b8 49 2026-02-21T09:10:47.1009388Z .b8 54 2026-02-21T09:10:47.1009441Z .b8 95 2026-02-21T09:10:47.1009501Z .b8 105 2026-02-21T09:10:47.1009554Z .b8 110 2026-02-21T09:10:47.1009606Z .b8 116 2026-02-21T09:10:47.1009658Z .b8 52 2026-02-21T09:10:47.1009718Z .b8 0 2026-02-21T09:10:47.1009794Z .b8 1 // DW_AT_inline 2026-02-21T09:10:47.1009894Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:47.1009994Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:47.1010110Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:47.1010206Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:47.1010331Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:47.1010450Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:47.1010537Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:10:47.1010638Z .b64 $L__tmp45 // DW_AT_high_pc 2026-02-21T09:10:47.1010723Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:47.1010805Z .b8 94 // DW_AT_call_line 2026-02-21T09:10:47.1010892Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:47.1010990Z .b8 0 // End Of Children Mark 2026-02-21T09:10:47.1011077Z .b8 0 // End Of Children Mark 2026-02-21T09:10:47.1011132Z } 2026-02-21T09:10:47.1011208Z .section .debug_macinfo { } 2026-02-21T09:10:47.1011212Z 2026-02-21T09:10:47.1011291Z ================================================================ 2026-02-21T09:10:47.1011400Z please share the reproducer above with Triton project. 2026-02-21T09:10:47.6719326Z 2026-02-21T09:10:47.6723253Z 2026-02-21T09:10:47.6724629Z 2026-02-21T09:10:47.6724827Z ================================================================ 2026-02-21T09:10:47.6725081Z Internal Triton PTX codegen error 2026-02-21T09:10:47.6725267Z `ptxas` stderr: 2026-02-21T09:10:47.6725702Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 328 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:47.6726184Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:47.6726342Z 2026-02-21T09:10:47.6726743Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpnglp1mrc.ptx -o /tmp/tmpnglp1mrc.ptx.o 2026-02-21T09:10:47.6727176Z 2026-02-21T09:10:47.6727180Z 2026-02-21T09:10:47.6727233Z // 2026-02-21T09:10:47.6727378Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:47.6727550Z // 2026-02-21T09:10:47.6727622Z 2026-02-21T09:10:47.6727678Z .version 8.7 2026-02-21T09:10:47.6727823Z .target sm_100a 2026-02-21T09:10:47.6727960Z .address_size 64 2026-02-21T09:10:47.6728045Z 2026-02-21T09:10:47.6728203Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:47.6728483Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:47.6728711Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:47.6728923Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:47.6729178Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:47.6729640Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:47.6729913Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:47.6730189Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:47.6730459Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:47.6730682Z ) 2026-02-21T09:10:47.6730808Z .reqntid 256 2026-02-21T09:10:47.6731012Z .maxnreg 32 2026-02-21T09:10:47.6731143Z { 2026-02-21T09:10:47.6731266Z .reg .pred %p<350>; 2026-02-21T09:10:47.6731422Z .reg .b16 %rs<561>; 2026-02-21T09:10:47.6731635Z .reg .b32 %r<2114>; 2026-02-21T09:10:47.6731784Z .reg .b64 %rd<626>; 2026-02-21T09:10:47.6732041Z .loc 1 19 0 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:19:0 2026-02-21T09:10:47.6732327Z $L__func_begin0: 2026-02-21T09:10:47.6732567Z .loc 1 19 0 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:19:0 2026-02-21T09:10:47.6732808Z 2026-02-21T09:10:47.6732866Z // %bb.0: 2026-02-21T09:10:47.6733097Z ld.param.b64 %rd41, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:47.6733306Z $L__tmp0: 2026-02-21T09:10:47.6733542Z .loc 1 19 0 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:19 2026-02-21T09:10:47.6733826Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:47.6734052Z ld.param.b64 %rd43, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:47.6734277Z setp.lt.u32 %p6, %r1, 32; 2026-02-21T09:10:47.6735315Z [234s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:47.6736516Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:10:47.6737641Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:47.6737873Z `ptxas` stderr: 2026-02-21T09:10:47.6738299Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 328 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:47.6738788Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:47.6738933Z 2026-02-21T09:10:47.6739304Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpnglp1mrc.ptx -o /tmp/tmpnglp1mrc.ptx.o 2026-02-21T09:10:47.6739741Z 2026-02-21T09:10:47.6739864Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:47.6740165Z ld.param.b64 %rd61, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:47.6740394Z mov.b32 %r201, global_smem; 2026-02-21T09:10:47.6740571Z // begin inline asm 2026-02-21T09:10:47.6740814Z @%p6 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r201], 128; 2026-02-21T09:10:47.6741078Z // end inline asm 2026-02-21T09:10:47.6741262Z ld.param.b64 %rd78, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:47.6741479Z bar.sync 0; 2026-02-21T09:10:47.6741681Z ld.shared.b32 %r1917, [global_smem]; 2026-02-21T09:10:47.6741868Z bar.sync 0; 2026-02-21T09:10:47.6742002Z // begin inline asm 2026-02-21T09:10:47.6742223Z @%p6 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:47.6742461Z // end inline asm 2026-02-21T09:10:47.6742720Z .loc 1 21 66 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:21:66 2026-02-21T09:10:47.6743033Z mov.u32 %r2060, %ctaid.x; 2026-02-21T09:10:47.6743200Z mov.u32 %r218, %ctaid.y; 2026-02-21T09:10:47.6743407Z mov.u32 %r219, %ctaid.z; 2026-02-21T09:10:47.6743566Z mov.u32 %r220, %nctaid.x; 2026-02-21T09:10:47.6743737Z mov.u32 %r221, %nctaid.y; 2026-02-21T09:10:47.6743904Z mad.lo.s32 %r222, %r219, %r221, %r218; 2026-02-21T09:10:47.6744106Z mad.lo.s32 %r223, %r222, %r220, %r2060; 2026-02-21T09:10:47.6744295Z shl.b32 %r224, %r223, 8; 2026-02-21T09:10:47.6744453Z cvt.s64.s32 %rd79, %r224; 2026-02-21T09:10:47.6744624Z add.s64 %rd57, %rd78, %rd79; 2026-02-21T09:10:47.6744792Z shl.b32 %r225, %r1, 2; 2026-02-21T09:10:47.6744999Z add.s32 %r202, %r201, %r225; 2026-02-21T09:10:47.6745159Z mov.b32 %r211, 0; 2026-02-21T09:10:47.6745310Z // begin inline asm 2026-02-21T09:10:47.6745471Z @%p6 st.shared.b32 [ %r202 + 0 ], %r211; 2026-02-21T09:10:47.6745665Z // end inline asm 2026-02-21T09:10:47.6745813Z bar.warp.sync -1; 2026-02-21T09:10:47.6745977Z setp.eq.b32 %p9, %r1, 0; 2026-02-21T09:10:47.6746152Z cvt.u64.u32 %rd42, %r201; 2026-02-21T09:10:47.6746313Z // begin inline asm 2026-02-21T09:10:47.6746581Z @%p9 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd42 + 0 ], %rd43; 2026-02-21T09:10:47.6746908Z // end inline asm 2026-02-21T09:10:47.6747056Z // begin inline asm 2026-02-21T09:10:47.6747286Z @%p9 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1; 2026-02-21T09:10:47.6747550Z // end inline asm 2026-02-21T09:10:47.6747690Z mov.b32 %r204, 64; 2026-02-21T09:10:47.6747838Z // begin inline asm 2026-02-21T09:10:47.6748122Z @%p9 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0, %r204; 2026-02-21T09:10:47.6748404Z // end inline asm 2026-02-21T09:10:47.6748544Z mov.b32 %r205, 16; 2026-02-21T09:10:47.6748678Z // begin inline asm 2026-02-21T09:10:47.6748911Z @%p9 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1, %r205; 2026-02-21T09:10:47.6749171Z // end inline asm 2026-02-21T09:10:47.6749312Z mov.b32 %r206, 8192; 2026-02-21T09:10:47.6749452Z // begin inline asm 2026-02-21T09:10:47.6749700Z @%p9 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0, %r206; 2026-02-21T09:10:47.6749974Z // end inline asm 2026-02-21T09:10:47.6750110Z mov.b32 %r207, 512; 2026-02-21T09:10:47.6750259Z // begin inline asm 2026-02-21T09:10:47.6750492Z @%p9 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1, %r207; 2026-02-21T09:10:47.6750766Z // end inline asm 2026-02-21T09:10:47.6750903Z mov.b64 %rd50, 8192; 2026-02-21T09:10:47.6751054Z // begin inline asm 2026-02-21T09:10:47.6751305Z @%p9 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd42 + 0 ], 0x0, %rd50; 2026-02-21T09:10:47.6751616Z // end inline asm 2026-02-21T09:10:47.6751755Z mov.b32 %r208, 1; 2026-02-21T09:10:47.6751887Z // begin inline asm 2026-02-21T09:10:47.6752140Z @%p9 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0, %r208; 2026-02-21T09:10:47.6752418Z // end inline asm 2026-02-21T09:10:47.6752556Z // begin inline asm 2026-02-21T09:10:47.6752795Z @%p9 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1, %r208; 2026-02-21T09:10:47.6753076Z // end inline asm 2026-02-21T09:10:47.6753216Z // begin inline asm 2026-02-21T09:10:47.6753445Z @%p9 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0; 2026-02-21T09:10:47.6753706Z // end inline asm 2026-02-21T09:10:47.6753838Z // begin inline asm 2026-02-21T09:10:47.6754087Z @%p9 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0; 2026-02-21T09:10:47.6754363Z // end inline asm 2026-02-21T09:10:47.6754505Z // begin inline asm 2026-02-21T09:10:47.6754742Z @%p9 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x2; 2026-02-21T09:10:47.6755000Z // end inline asm 2026-02-21T09:10:47.6755140Z // begin inline asm 2026-02-21T09:10:47.6755359Z @%p9 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0; 2026-02-21T09:10:47.6755614Z // end inline asm 2026-02-21T09:10:47.6755746Z // begin inline asm 2026-02-21T09:10:47.6756115Z @%p6 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd57 + 0 ], [ %rd42 + 0 ], 0x80; 2026-02-21T09:10:47.6756487Z // end inline asm 2026-02-21T09:10:47.6756623Z // begin inline asm 2026-02-21T09:10:47.6756842Z @%p6 fence.proxy.tensormap::generic.acquire.gpu [ %rd57 + 0 ], 0x80; 2026-02-21T09:10:47.6757090Z @%p6 cp.async.bulk.commit_group ; 2026-02-21T09:10:47.6757285Z @%p6 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:47.6757492Z // end inline asm 2026-02-21T09:10:47.6757632Z bar.sync 0; 2026-02-21T09:10:47.6757773Z cvta.global.u64 %rd463, %rd57; 2026-02-21T09:10:47.6758063Z .loc 1 23 67 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:23:67 2026-02-21T09:10:47.6758363Z add.s64 %rd75, %rd57, 128; 2026-02-21T09:10:47.6758516Z bar.sync 0; 2026-02-21T09:10:47.6758657Z // begin inline asm 2026-02-21T09:10:47.6758808Z @%p6 st.shared.b32 [ %r202 + 0 ], %r211; 2026-02-21T09:10:47.6758986Z // end inline asm 2026-02-21T09:10:47.6759126Z bar.warp.sync -1; 2026-02-21T09:10:47.6759272Z // begin inline asm 2026-02-21T09:10:47.6759558Z @%p9 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd42 + 0 ], %rd61; 2026-02-21T09:10:47.6759837Z // end inline asm 2026-02-21T09:10:47.6759978Z // begin inline asm 2026-02-21T09:10:47.6760194Z @%p9 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1; 2026-02-21T09:10:47.6760443Z // end inline asm 2026-02-21T09:10:47.6760604Z // begin inline asm 2026-02-21T09:10:47.6760842Z @%p9 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0, %r204; 2026-02-21T09:10:47.6761111Z // end inline asm 2026-02-21T09:10:47.6761257Z mov.b32 %r213, 128; 2026-02-21T09:10:47.6761402Z // begin inline asm 2026-02-21T09:10:47.6761675Z @%p9 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1, %r213; 2026-02-21T09:10:47.6761938Z // end inline asm 2026-02-21T09:10:47.6762071Z // begin inline asm 2026-02-21T09:10:47.6762313Z @%p9 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0, %r206; 2026-02-21T09:10:47.6762580Z // end inline asm 2026-02-21T09:10:47.6762720Z mov.b32 %r215, 4096; 2026-02-21T09:10:47.6762861Z // begin inline asm 2026-02-21T09:10:47.6763101Z @%p9 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1, %r215; 2026-02-21T09:10:47.6763373Z // end inline asm 2026-02-21T09:10:47.6763511Z mov.b64 %rd68, 16384; 2026-02-21T09:10:47.6763667Z // begin inline asm 2026-02-21T09:10:47.6763914Z @%p9 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd42 + 0 ], 0x0, %rd68; 2026-02-21T09:10:47.6764196Z // end inline asm 2026-02-21T09:10:47.6764329Z // begin inline asm 2026-02-21T09:10:47.6764585Z @%p9 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0, %r208; 2026-02-21T09:10:47.6764864Z // end inline asm 2026-02-21T09:10:47.6765006Z // begin inline asm 2026-02-21T09:10:47.6765256Z @%p9 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x1, %r208; 2026-02-21T09:10:47.6765533Z // end inline asm 2026-02-21T09:10:47.6765675Z // begin inline asm 2026-02-21T09:10:47.6765902Z @%p9 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd42 + 0 ], 0xa; 2026-02-21T09:10:47.6766164Z // end inline asm 2026-02-21T09:10:47.6766297Z // begin inline asm 2026-02-21T09:10:47.6766548Z @%p9 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0; 2026-02-21T09:10:47.6766833Z // end inline asm 2026-02-21T09:10:47.6766972Z // begin inline asm 2026-02-21T09:10:47.6767216Z @%p9 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x3; 2026-02-21T09:10:47.6767477Z // end inline asm 2026-02-21T09:10:47.6767616Z // begin inline asm 2026-02-21T09:10:47.6767843Z @%p9 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd42 + 0 ], 0x0; 2026-02-21T09:10:47.6768101Z // end inline asm 2026-02-21T09:10:47.6768233Z // begin inline asm 2026-02-21T09:10:47.6768570Z @%p6 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd75 + 0 ], [ %rd42 + 0 ], 0x80; 2026-02-21T09:10:47.6768988Z // end inline asm 2026-02-21T09:10:47.6769120Z // begin inline asm 2026-02-21T09:10:47.6769330Z @%p6 fence.proxy.tensormap::generic.acquire.gpu [ %rd75 + 0 ], 0x80; 2026-02-21T09:10:47.6769574Z @%p6 cp.async.bulk.commit_group ; 2026-02-21T09:10:47.6769769Z @%p6 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:47.6769946Z // end inline asm 2026-02-21T09:10:47.6770117Z bar.sync 0; 2026-02-21T09:10:47.6770266Z cvta.global.u64 %rd553, %rd75; 2026-02-21T09:10:47.6770543Z .loc 1 30 49 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:30:49 2026-02-21T09:10:47.6770841Z min.u32 %r226, %r2060, 4095; 2026-02-21T09:10:47.6771002Z add.s32 %r4, %r226, 1; 2026-02-21T09:10:47.6771269Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.6771588Z sub.s32 %r227, %r4, %r2060; 2026-02-21T09:10:47.6771763Z shr.s32 %r228, %r227, 31; 2026-02-21T09:10:47.6771918Z shr.u32 %r229, %r228, 30; 2026-02-21T09:10:47.6772103Z add.s32 %r230, %r227, %r229; 2026-02-21T09:10:47.6772268Z and.b32 %r231, %r230, -4; 2026-02-21T09:10:47.6772423Z add.s32 %r2107, %r231, %r2060; 2026-02-21T09:10:47.6772696Z .loc 1 44 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:45 2026-02-21T09:10:47.6773001Z shr.u32 %r6, %r1, 5; 2026-02-21T09:10:47.6773156Z bfe.u32 %r7, %r1, 2, 6; 2026-02-21T09:10:47.6773304Z or.b32 %r8, %r7, 64; 2026-02-21T09:10:47.6773562Z .loc 1 57 38 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:57:38 2026-02-21T09:10:47.6773842Z shl.b32 %r232, %r1, 3; 2026-02-21T09:10:47.6773998Z and.b32 %r9, %r232, 24; 2026-02-21T09:10:47.6774255Z .loc 1 75 38 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:75:38 2026-02-21T09:10:47.6774534Z and.b32 %r10, %r1, 64; 2026-02-21T09:10:47.6774802Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.6775101Z setp.lt.s32 %p44, %r2060, %r2107; 2026-02-21T09:10:47.6775280Z shl.b32 %r11, %r1, 4; 2026-02-21T09:10:47.6775427Z and.b32 %r12, %r11, 4080; 2026-02-21T09:10:47.6775586Z or.b32 %r2090, %r12, 4096; 2026-02-21T09:10:47.6775752Z add.s32 %r1918, %r1917, 64; 2026-02-21T09:10:47.6775909Z or.b32 %r2089, %r9, 32; 2026-02-21T09:10:47.6776066Z and.b32 %r2055, %r1, 127; 2026-02-21T09:10:47.6776217Z and.b32 %r2056, %r1, 128; 2026-02-21T09:10:47.6776375Z and.b32 %r2057, %r1, 63; 2026-02-21T09:10:47.6776526Z and.b32 %r2058, %r11, 112; 2026-02-21T09:10:47.6776685Z shr.u32 %r2059, %r1, 4; 2026-02-21T09:10:47.6776836Z setp.eq.b32 %p346, %r10, 0; 2026-02-21T09:10:47.6777008Z @%p44 bra $L__BB0_1; 2026-02-21T09:10:47.6777151Z bra.uni $L__BB0_27; 2026-02-21T09:10:47.6777320Z $L__BB0_1: // %.lr.ph 2026-02-21T09:10:47.6777622Z .loc 1 0 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:0:111 2026-02-21T09:10:47.6777907Z add.s32 %r1327, %r201, %r12; 2026-02-21T09:10:47.6778076Z add.s32 %r1329, %r1327, 4096; 2026-02-21T09:10:47.6778236Z add.s32 %r1202, %r1327, 8192; 2026-02-21T09:10:47.6778401Z add.s32 %r1204, %r1327, 12288; 2026-02-21T09:10:47.6778559Z shl.b32 %r235, %r2055, 6; 2026-02-21T09:10:47.6778717Z shr.u32 %r237, %r2056, 2; 2026-02-21T09:10:47.6778867Z or.b32 %r22, %r235, %r237; 2026-02-21T09:10:47.6779030Z add.s32 %r23, %r201, %r22; 2026-02-21T09:10:47.6779181Z shr.u32 %r239, %r2056, 1; 2026-02-21T09:10:47.6779337Z or.b32 %r24, %r239, %r2057; 2026-02-21T09:10:47.6779499Z add.s32 %r240, %r201, 24576; 2026-02-21T09:10:47.6779653Z add.s32 %r25, %r240, %r24; 2026-02-21T09:10:47.6779811Z xor.b32 %r26, %r24, 16; 2026-02-21T09:10:47.6779959Z add.s32 %r27, %r240, %r26; 2026-02-21T09:10:47.6780116Z xor.b32 %r28, %r24, 32; 2026-02-21T09:10:47.6780263Z add.s32 %r29, %r240, %r28; 2026-02-21T09:10:47.6780464Z xor.b32 %r30, %r24, 48; 2026-02-21T09:10:47.6780613Z add.s32 %r31, %r240, %r30; 2026-02-21T09:10:47.6780769Z shl.b32 %r241, %r2057, 7; 2026-02-21T09:10:47.6780918Z and.b32 %r244, %r2059, 12; 2026-02-21T09:10:47.6781077Z or.b32 %r245, %r241, %r244; 2026-02-21T09:10:47.6781238Z or.b32 %r246, %r245, %r2058; 2026-02-21T09:10:47.6781391Z add.s32 %r247, %r201, 16384; 2026-02-21T09:10:47.6781593Z add.s32 %r32, %r247, %r246; 2026-02-21T09:10:47.6781772Z xor.b32 %r248, %r246, 16; 2026-02-21T09:10:47.6781928Z add.s32 %r33, %r247, %r248; 2026-02-21T09:10:47.6782082Z xor.b32 %r249, %r246, 32; 2026-02-21T09:10:47.6782236Z add.s32 %r34, %r247, %r249; 2026-02-21T09:10:47.6782386Z xor.b32 %r250, %r246, 48; 2026-02-21T09:10:47.6782541Z add.s32 %r35, %r247, %r250; 2026-02-21T09:10:47.6782700Z xor.b32 %r251, %r246, 64; 2026-02-21T09:10:47.6782845Z add.s32 %r36, %r247, %r251; 2026-02-21T09:10:47.6783003Z xor.b32 %r252, %r246, 80; 2026-02-21T09:10:47.6783152Z add.s32 %r37, %r247, %r252; 2026-02-21T09:10:47.6783311Z xor.b32 %r253, %r246, 96; 2026-02-21T09:10:47.6783506Z add.s32 %r38, %r247, %r253; 2026-02-21T09:10:47.6783677Z xor.b32 %r254, %r246, 112; 2026-02-21T09:10:47.6783841Z add.s32 %r39, %r247, %r254; 2026-02-21T09:10:47.6784017Z bfe.u32 %r255, %r247, 4, 14; 2026-02-21T09:10:47.6784182Z cvt.u64.u32 %rd80, %r255; 2026-02-21T09:10:47.6784403Z or.b64 %rd379, %rd80, 4611686293338849280; 2026-02-21T09:10:47.6784604Z add.s32 %r256, %r201, 16416; 2026-02-21T09:10:47.6784768Z bfe.u32 %r257, %r256, 4, 14; 2026-02-21T09:10:47.6784941Z cvt.u64.u32 %rd81, %r257; 2026-02-21T09:10:47.6785112Z or.b64 %rd380, %rd81, 4611686293338849280; 2026-02-21T09:10:47.6785315Z add.s32 %r258, %r201, 16448; 2026-02-21T09:10:47.6785477Z bfe.u32 %r259, %r258, 4, 14; 2026-02-21T09:10:47.6785642Z cvt.u64.u32 %rd82, %r259; 2026-02-21T09:10:47.6785807Z or.b64 %rd381, %rd82, 4611686293338849280; 2026-02-21T09:10:47.6785997Z add.s32 %r260, %r201, 16480; 2026-02-21T09:10:47.6786164Z bfe.u32 %r261, %r260, 4, 14; 2026-02-21T09:10:47.6786324Z cvt.u64.u32 %rd83, %r261; 2026-02-21T09:10:47.6786499Z or.b64 %rd382, %rd83, 4611686293338849280; 2026-02-21T09:10:47.6786680Z shl.b32 %r262, %r2055, 7; 2026-02-21T09:10:47.6786846Z xor.b32 %r263, %r2058, %r239; 2026-02-21T09:10:47.6787011Z or.b32 %r264, %r263, %r262; 2026-02-21T09:10:47.6787180Z add.s32 %r40, %r201, %r264; 2026-02-21T09:10:47.6787340Z xor.b32 %r265, %r264, 16; 2026-02-21T09:10:47.6787504Z add.s32 %r41, %r201, %r265; 2026-02-21T09:10:47.6787664Z xor.b32 %r266, %r264, 32; 2026-02-21T09:10:47.6787826Z add.s32 %r42, %r201, %r266; 2026-02-21T09:10:47.6787991Z xor.b32 %r267, %r264, 48; 2026-02-21T09:10:47.6788148Z add.s32 %r43, %r201, %r267; 2026-02-21T09:10:47.6788441Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.6788745Z and.b32 %r268, %r1, 3; 2026-02-21T09:10:47.6788919Z mad.wide.u32 %rd84, %r268, 16, %rd41; 2026-02-21T09:10:47.6789103Z add.s64 %rd7, %rd84, 192; 2026-02-21T09:10:47.6789268Z shl.b32 %r44, %r7, 10; 2026-02-21T09:10:47.6789544Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6789856Z or.b32 %r269, %r44, %r9; 2026-02-21T09:10:47.6790022Z or.b32 %r45, %r269, 65632; 2026-02-21T09:10:47.6790180Z bra.uni $L__BB0_2; 2026-02-21T09:10:47.6790385Z $L__BB0_26: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.6790726Z .loc 1 0 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:0:125 2026-02-21T09:10:47.6791028Z mov.b32 %r1422, 1; 2026-02-21T09:10:47.6791183Z $L__tmp1: 2026-02-21T09:10:47.6791476Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6791845Z // begin inline asm 2026-02-21T09:10:47.6791980Z 2026-02-21T09:10:47.6792101Z { 2026-02-21T09:10:47.6792281Z .reg .pred complete; 2026-02-21T09:10:47.6792435Z waitLoop: 2026-02-21T09:10:47.6792626Z mbarrier.try_wait.parity.shared.b64 complete, [%r2084], %r1422; 2026-02-21T09:10:47.6792867Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6793019Z } 2026-02-21T09:10:47.6793093Z 2026-02-21T09:10:47.6793150Z // end inline asm 2026-02-21T09:10:47.6793285Z $L__tmp2: 2026-02-21T09:10:47.6793531Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6793862Z cp.async.wait_group 0; 2026-02-21T09:10:47.6794012Z bar.sync 0; 2026-02-21T09:10:47.6794152Z // begin inline asm 2026-02-21T09:10:47.6794323Z @%p9 mbarrier.inval.shared::cta.b64 [%r1331]; 2026-02-21T09:10:47.6794520Z // end inline asm 2026-02-21T09:10:47.6794656Z bar.sync 0; 2026-02-21T09:10:47.6794794Z // begin inline asm 2026-02-21T09:10:47.6794961Z @%p9 mbarrier.inval.shared::cta.b64 [%r1115]; 2026-02-21T09:10:47.6795156Z // end inline asm 2026-02-21T09:10:47.6795304Z add.s32 %r1425, %r201, 26624; 2026-02-21T09:10:47.6795459Z // begin inline asm 2026-02-21T09:10:47.6795656Z @%p9 mbarrier.inval.shared::cta.b64 [%r1425]; 2026-02-21T09:10:47.6795837Z // end inline asm 2026-02-21T09:10:47.6795975Z bar.sync 0; 2026-02-21T09:10:47.6796103Z // begin inline asm 2026-02-21T09:10:47.6796265Z @%p9 mbarrier.inval.shared::cta.b64 [%r1117]; 2026-02-21T09:10:47.6796444Z // end inline asm 2026-02-21T09:10:47.6796621Z $L__tmp3: 2026-02-21T09:10:47.6796906Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6797242Z // begin inline asm 2026-02-21T09:10:47.6797632Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1427, %r1428, %r1429, %r1430, %r1431, %r1432, %r1433, %r1434, %r1435, %r1436, %r1437, %r1438, %r1439, %r1440, %r1441, %r1442}, [%r1460 + 0]; 2026-02-21T09:10:47.6798037Z // end inline asm 2026-02-21T09:10:47.6798177Z // begin inline asm 2026-02-21T09:10:47.6798551Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1444, %r1445, %r1446, %r1447, %r1448, %r1449, %r1450, %r1451, %r1452, %r1453, %r1454, %r1455, %r1456, %r1457, %r1458, %r1459}, [%r1460 + 16]; 2026-02-21T09:10:47.6798961Z // end inline asm 2026-02-21T09:10:47.6799102Z // begin inline asm 2026-02-21T09:10:47.6799251Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.6799422Z // end inline asm 2026-02-21T09:10:47.6799562Z cvt.u64.u32 %rd397, %r1427; 2026-02-21T09:10:47.6799729Z cvt.u64.u32 %rd398, %r1428; 2026-02-21T09:10:47.6799888Z shl.b64 %rd399, %rd398, 32; 2026-02-21T09:10:47.6800052Z or.b64 %rd400, %rd397, %rd399; 2026-02-21T09:10:47.6800206Z $L__tmp4: 2026-02-21T09:10:47.6800448Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6800744Z mov.b64 {%r1464, %r1465}, %rd400; 2026-02-21T09:10:47.6800923Z cvt.rn.bf16x2.f32 %r1466, %r1465, %r1464; 2026-02-21T09:10:47.6801102Z $L__tmp5: 2026-02-21T09:10:47.6801378Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6801746Z cvt.u64.u32 %rd401, %r1429; 2026-02-21T09:10:47.6801903Z cvt.u64.u32 %rd402, %r1430; 2026-02-21T09:10:47.6802064Z shl.b64 %rd403, %rd402, 32; 2026-02-21T09:10:47.6802220Z or.b64 %rd404, %rd401, %rd403; 2026-02-21T09:10:47.6802382Z $L__tmp6: 2026-02-21T09:10:47.6802626Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6802916Z mov.b64 {%r1467, %r1468}, %rd404; 2026-02-21T09:10:47.6803102Z cvt.rn.bf16x2.f32 %r1469, %r1468, %r1467; 2026-02-21T09:10:47.6803275Z $L__tmp7: 2026-02-21T09:10:47.6803561Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6803883Z cvt.u64.u32 %rd405, %r1431; 2026-02-21T09:10:47.6804047Z cvt.u64.u32 %rd406, %r1432; 2026-02-21T09:10:47.6804201Z shl.b64 %rd407, %rd406, 32; 2026-02-21T09:10:47.6804433Z or.b64 %rd408, %rd405, %rd407; 2026-02-21T09:10:47.6804593Z $L__tmp8: 2026-02-21T09:10:47.6804826Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6805115Z mov.b64 {%r1470, %r1471}, %rd408; 2026-02-21T09:10:47.6805292Z cvt.rn.bf16x2.f32 %r1472, %r1471, %r1470; 2026-02-21T09:10:47.6805475Z $L__tmp9: 2026-02-21T09:10:47.6805760Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6806157Z cvt.u64.u32 %rd409, %r1433; 2026-02-21T09:10:47.6806333Z cvt.u64.u32 %rd410, %r1434; 2026-02-21T09:10:47.6806486Z shl.b64 %rd411, %rd410, 32; 2026-02-21T09:10:47.6806647Z or.b64 %rd412, %rd409, %rd411; 2026-02-21T09:10:47.6806801Z $L__tmp10: 2026-02-21T09:10:47.6807039Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6807319Z mov.b64 {%r1473, %r1474}, %rd412; 2026-02-21T09:10:47.6807503Z cvt.rn.bf16x2.f32 %r1475, %r1474, %r1473; 2026-02-21T09:10:47.6807673Z $L__tmp11: 2026-02-21T09:10:47.6807987Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6808325Z cvt.u64.u32 %rd413, %r1435; 2026-02-21T09:10:47.6808479Z cvt.u64.u32 %rd414, %r1436; 2026-02-21T09:10:47.6808640Z shl.b64 %rd415, %rd414, 32; 2026-02-21T09:10:47.6808829Z or.b64 %rd416, %rd413, %rd415; 2026-02-21T09:10:47.6808994Z $L__tmp12: 2026-02-21T09:10:47.6809230Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6809521Z mov.b64 {%r1476, %r1477}, %rd416; 2026-02-21T09:10:47.6809697Z cvt.rn.bf16x2.f32 %r1478, %r1477, %r1476; 2026-02-21T09:10:47.6809875Z $L__tmp13: 2026-02-21T09:10:47.6810158Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6810485Z cvt.u64.u32 %rd417, %r1437; 2026-02-21T09:10:47.6810646Z cvt.u64.u32 %rd418, %r1438; 2026-02-21T09:10:47.6810798Z shl.b64 %rd419, %rd418, 32; 2026-02-21T09:10:47.6810960Z or.b64 %rd420, %rd417, %rd419; 2026-02-21T09:10:47.6811112Z $L__tmp14: 2026-02-21T09:10:47.6811348Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6811654Z mov.b64 {%r1479, %r1480}, %rd420; 2026-02-21T09:10:47.6811834Z cvt.rn.bf16x2.f32 %r1481, %r1480, %r1479; 2026-02-21T09:10:47.6812009Z $L__tmp15: 2026-02-21T09:10:47.6812288Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6812621Z cvt.u64.u32 %rd421, %r1439; 2026-02-21T09:10:47.6812776Z cvt.u64.u32 %rd422, %r1440; 2026-02-21T09:10:47.6812933Z shl.b64 %rd423, %rd422, 32; 2026-02-21T09:10:47.6813086Z or.b64 %rd424, %rd421, %rd423; 2026-02-21T09:10:47.6813248Z $L__tmp16: 2026-02-21T09:10:47.6813485Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6813775Z mov.b64 {%r1482, %r1483}, %rd424; 2026-02-21T09:10:47.6813955Z cvt.rn.bf16x2.f32 %r1484, %r1483, %r1482; 2026-02-21T09:10:47.6814126Z $L__tmp17: 2026-02-21T09:10:47.6814413Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6814743Z cvt.u64.u32 %rd425, %r1441; 2026-02-21T09:10:47.6814905Z cvt.u64.u32 %rd426, %r1442; 2026-02-21T09:10:47.6815056Z shl.b64 %rd427, %rd426, 32; 2026-02-21T09:10:47.6815220Z or.b64 %rd428, %rd425, %rd427; 2026-02-21T09:10:47.6815380Z $L__tmp18: 2026-02-21T09:10:47.6815615Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6815908Z mov.b64 {%r1485, %r1486}, %rd428; 2026-02-21T09:10:47.6816087Z cvt.rn.bf16x2.f32 %r1487, %r1486, %r1485; 2026-02-21T09:10:47.6816297Z $L__tmp19: 2026-02-21T09:10:47.6816583Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6816913Z cvt.u64.u32 %rd429, %r1444; 2026-02-21T09:10:47.6817067Z cvt.u64.u32 %rd430, %r1445; 2026-02-21T09:10:47.6817225Z shl.b64 %rd431, %rd430, 32; 2026-02-21T09:10:47.6817385Z or.b64 %rd432, %rd429, %rd431; 2026-02-21T09:10:47.6817539Z $L__tmp20: 2026-02-21T09:10:47.6817817Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6818104Z mov.b64 {%r1488, %r1489}, %rd432; 2026-02-21T09:10:47.6818285Z cvt.rn.bf16x2.f32 %r1490, %r1489, %r1488; 2026-02-21T09:10:47.6818454Z $L__tmp21: 2026-02-21T09:10:47.6818741Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6819069Z cvt.u64.u32 %rd433, %r1446; 2026-02-21T09:10:47.6819225Z cvt.u64.u32 %rd434, %r1447; 2026-02-21T09:10:47.6819386Z shl.b64 %rd435, %rd434, 32; 2026-02-21T09:10:47.6819567Z or.b64 %rd436, %rd433, %rd435; 2026-02-21T09:10:47.6819730Z $L__tmp22: 2026-02-21T09:10:47.6819963Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6820255Z mov.b64 {%r1491, %r1492}, %rd436; 2026-02-21T09:10:47.6820429Z cvt.rn.bf16x2.f32 %r1493, %r1492, %r1491; 2026-02-21T09:10:47.6820631Z $L__tmp23: 2026-02-21T09:10:47.6820912Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6821229Z cvt.u64.u32 %rd437, %r1448; 2026-02-21T09:10:47.6821391Z cvt.u64.u32 %rd438, %r1449; 2026-02-21T09:10:47.6821577Z shl.b64 %rd439, %rd438, 32; 2026-02-21T09:10:47.6821743Z or.b64 %rd440, %rd437, %rd439; 2026-02-21T09:10:47.6821897Z $L__tmp24: 2026-02-21T09:10:47.6822139Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6822422Z mov.b64 {%r1494, %r1495}, %rd440; 2026-02-21T09:10:47.6822605Z cvt.rn.bf16x2.f32 %r1496, %r1495, %r1494; 2026-02-21T09:10:47.6822781Z $L__tmp25: 2026-02-21T09:10:47.6823060Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6823392Z cvt.u64.u32 %rd441, %r1450; 2026-02-21T09:10:47.6823547Z cvt.u64.u32 %rd442, %r1451; 2026-02-21T09:10:47.6823709Z shl.b64 %rd443, %rd442, 32; 2026-02-21T09:10:47.6823864Z or.b64 %rd444, %rd441, %rd443; 2026-02-21T09:10:47.6824026Z $L__tmp26: 2026-02-21T09:10:47.6824261Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6824550Z mov.b64 {%r1497, %r1498}, %rd444; 2026-02-21T09:10:47.6824733Z cvt.rn.bf16x2.f32 %r1499, %r1498, %r1497; 2026-02-21T09:10:47.6824905Z $L__tmp27: 2026-02-21T09:10:47.6825192Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6825527Z cvt.u64.u32 %rd445, %r1452; 2026-02-21T09:10:47.6825691Z cvt.u64.u32 %rd446, %r1453; 2026-02-21T09:10:47.6825844Z shl.b64 %rd447, %rd446, 32; 2026-02-21T09:10:47.6826008Z or.b64 %rd448, %rd445, %rd447; 2026-02-21T09:10:47.6826170Z $L__tmp28: 2026-02-21T09:10:47.6826414Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6826726Z mov.b64 {%r1500, %r1501}, %rd448; 2026-02-21T09:10:47.6826912Z cvt.rn.bf16x2.f32 %r1502, %r1501, %r1500; 2026-02-21T09:10:47.6827104Z $L__tmp29: 2026-02-21T09:10:47.6827400Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6827751Z cvt.u64.u32 %rd449, %r1454; 2026-02-21T09:10:47.6827912Z cvt.u64.u32 %rd450, %r1455; 2026-02-21T09:10:47.6828080Z shl.b64 %rd451, %rd450, 32; 2026-02-21T09:10:47.6828275Z or.b64 %rd452, %rd449, %rd451; 2026-02-21T09:10:47.6828433Z $L__tmp30: 2026-02-21T09:10:47.6828685Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6828986Z mov.b64 {%r1503, %r1504}, %rd452; 2026-02-21T09:10:47.6829174Z cvt.rn.bf16x2.f32 %r1505, %r1504, %r1503; 2026-02-21T09:10:47.6829351Z $L__tmp31: 2026-02-21T09:10:47.6829646Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6830015Z cvt.u64.u32 %rd453, %r1456; 2026-02-21T09:10:47.6830176Z cvt.u64.u32 %rd454, %r1457; 2026-02-21T09:10:47.6830342Z shl.b64 %rd455, %rd454, 32; 2026-02-21T09:10:47.6830504Z or.b64 %rd456, %rd453, %rd455; 2026-02-21T09:10:47.6830669Z $L__tmp32: 2026-02-21T09:10:47.6830910Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6831213Z mov.b64 {%r1506, %r1507}, %rd456; 2026-02-21T09:10:47.6831393Z cvt.rn.bf16x2.f32 %r1508, %r1507, %r1506; 2026-02-21T09:10:47.6831615Z $L__tmp33: 2026-02-21T09:10:47.6831939Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6832275Z cvt.u64.u32 %rd457, %r1458; 2026-02-21T09:10:47.6832443Z cvt.u64.u32 %rd458, %r1459; 2026-02-21T09:10:47.6832600Z shl.b64 %rd459, %rd458, 32; 2026-02-21T09:10:47.6832800Z or.b64 %rd460, %rd457, %rd459; 2026-02-21T09:10:47.6832963Z $L__tmp34: 2026-02-21T09:10:47.6833213Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6833507Z mov.b64 {%r1509, %r1510}, %rd460; 2026-02-21T09:10:47.6833694Z cvt.rn.bf16x2.f32 %r1511, %r1510, %r1509; 2026-02-21T09:10:47.6833994Z .loc 1 98 43 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:98:43 2026-02-21T09:10:47.6834321Z st.shared.v4.b32 [%r40], {%r1466, %r1469, %r1472, %r1475}; 2026-02-21T09:10:47.6834587Z st.shared.v4.b32 [%r41], {%r1478, %r1481, %r1484, %r1487}; 2026-02-21T09:10:47.6834820Z st.shared.v4.b32 [%r42], {%r1490, %r1493, %r1496, %r1499}; 2026-02-21T09:10:47.6835055Z st.shared.v4.b32 [%r43], {%r1502, %r1505, %r1508, %r1511}; 2026-02-21T09:10:47.6835249Z // begin inline asm 2026-02-21T09:10:47.6835417Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.6835591Z // end inline asm 2026-02-21T09:10:47.6835730Z bar.sync 0; 2026-02-21T09:10:47.6835879Z elect.sync %r1512|%p273, -1; 2026-02-21T09:10:47.6836043Z and.pred %p271, %p6, %p273; 2026-02-21T09:10:47.6836206Z // begin inline asm 2026-02-21T09:10:47.6836475Z @%p271 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd553, {%r1461, %r1462}], [%r201]; 2026-02-21T09:10:47.6836775Z // end inline asm 2026-02-21T09:10:47.6836923Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.6837104Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.6837279Z bar.sync 0; 2026-02-21T09:10:47.6837527Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.6837826Z add.s32 %r2060, %r2060, 4; 2026-02-21T09:10:47.6837991Z setp.lt.s32 %p274, %r2060, %r2107; 2026-02-21T09:10:47.6838176Z @%p274 bra $L__BB0_2; 2026-02-21T09:10:47.6838329Z bra.uni $L__BB0_27; 2026-02-21T09:10:47.6838522Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:10:47.6838760Z // Child Loop BB0_5 Depth 2 2026-02-21T09:10:47.6838992Z // Child Loop BB0_11 Depth 2 2026-02-21T09:10:47.6839223Z // Child Loop BB0_17 Depth 2 2026-02-21T09:10:47.6839438Z // Child Loop BB0_23 Depth 2 2026-02-21T09:10:47.6839741Z .loc 1 38 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:38:33 2026-02-21T09:10:47.6840051Z shr.u32 %r345, %r2060, 5; 2026-02-21T09:10:47.6840219Z and.b32 %r346, %r345, 67108848; 2026-02-21T09:10:47.6840488Z .loc 1 39 39 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:39 2026-02-21T09:10:47.6840774Z sub.s32 %r347, 128, %r346; 2026-02-21T09:10:47.6841038Z .loc 1 39 52 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:52 2026-02-21T09:10:47.6841314Z min.s32 %r348, %r347, 16; 2026-02-21T09:10:47.6841636Z .loc 1 40 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:45 2026-02-21T09:10:47.6841915Z and.b32 %r349, %r2060, 511; 2026-02-21T09:10:47.6842180Z .loc 1 41 51 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:41:51 2026-02-21T09:10:47.6842459Z div.s32 %r47, %r349, %r348; 2026-02-21T09:10:47.6842726Z .loc 1 40 64 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:64 2026-02-21T09:10:47.6843018Z mul.lo.s32 %r350, %r47, %r348; 2026-02-21T09:10:47.6843186Z sub.s32 %r351, %r349, %r350; 2026-02-21T09:10:47.6843473Z .loc 1 40 30 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:30 2026-02-21T09:10:47.6843752Z add.s32 %r352, %r351, %r346; 2026-02-21T09:10:47.6844015Z .loc 1 42 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:42:27 2026-02-21T09:10:47.6844292Z shl.b32 %r534, %r352, 6; 2026-02-21T09:10:47.6844577Z .loc 1 43 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:43:27 2026-02-21T09:10:47.6844867Z shl.b32 %r535, %r47, 7; 2026-02-21T09:10:47.6845124Z .loc 1 44 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:32 2026-02-21T09:10:47.6845408Z or.b32 %r353, %r535, %r7; 2026-02-21T09:10:47.6845564Z or.b32 %r354, %r535, %r8; 2026-02-21T09:10:47.6845831Z .loc 1 58 53 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:53 2026-02-21T09:10:47.6846110Z shl.b32 %r355, %r353, 10; 2026-02-21T09:10:47.6846271Z shl.b32 %r356, %r354, 10; 2026-02-21T09:10:47.6846427Z $L__tmp35: 2026-02-21T09:10:47.6846717Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6847077Z shfl.sync.idx.b32 %r50, %r6, 0, 31, -1; 2026-02-21T09:10:47.6847264Z shl.b32 %r357, %r50, 21; 2026-02-21T09:10:47.6847430Z and.b32 %r358, %r357, 6291456; 2026-02-21T09:10:47.6847600Z add.s32 %r359, %r358, %r1917; 2026-02-21T09:10:47.6847773Z and.b32 %r360, %r50, 4; 2026-02-21T09:10:47.6847929Z shl.b32 %r361, %r360, 3; 2026-02-21T09:10:47.6848094Z add.s32 %r1460, %r359, %r361; 2026-02-21T09:10:47.6848267Z mov.pred %p63, -1; 2026-02-21T09:10:47.6848418Z mov.b32 %r2062, 0; 2026-02-21T09:10:47.6848569Z // begin inline asm 2026-02-21T09:10:47.6848973Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 0], {%r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062}; 2026-02-21T09:10:47.6849410Z // end inline asm 2026-02-21T09:10:47.6849553Z // begin inline asm 2026-02-21T09:10:47.6849952Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 16], {%r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062, %r2062}; 2026-02-21T09:10:47.6850378Z // end inline asm 2026-02-21T09:10:47.6850523Z // begin inline asm 2026-02-21T09:10:47.6850691Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.6850856Z // end inline asm 2026-02-21T09:10:47.6851000Z bar.sync 0; 2026-02-21T09:10:47.6851133Z $L__tmp36: 2026-02-21T09:10:47.6851385Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6851713Z add.s32 %r2063, %r201, 26624; 2026-02-21T09:10:47.6851878Z // begin inline asm 2026-02-21T09:10:47.6852052Z @%p9 mbarrier.init.shared::cta.b64 [%r2063], 1; 2026-02-21T09:10:47.6852292Z // end inline asm 2026-02-21T09:10:47.6852432Z bar.sync 0; 2026-02-21T09:10:47.6852566Z add.s32 %r1117, %r201, 26632; 2026-02-21T09:10:47.6852728Z // begin inline asm 2026-02-21T09:10:47.6852892Z @%p9 mbarrier.init.shared::cta.b64 [%r1117], 1; 2026-02-21T09:10:47.6853086Z // end inline asm 2026-02-21T09:10:47.6853222Z add.s32 %r1331, %r201, 26640; 2026-02-21T09:10:47.6853381Z // begin inline asm 2026-02-21T09:10:47.6853549Z @%p9 mbarrier.init.shared::cta.b64 [%r1331], 1; 2026-02-21T09:10:47.6853758Z // end inline asm 2026-02-21T09:10:47.6853898Z bar.sync 0; 2026-02-21T09:10:47.6854032Z add.s32 %r1115, %r201, 26648; 2026-02-21T09:10:47.6854194Z // begin inline asm 2026-02-21T09:10:47.6854353Z @%p9 mbarrier.init.shared::cta.b64 [%r1115], 1; 2026-02-21T09:10:47.6854544Z // end inline asm 2026-02-21T09:10:47.6854784Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.6855070Z or.b32 %r363, %r355, %r9; 2026-02-21T09:10:47.6855225Z or.b32 %r364, %r356, %r9; 2026-02-21T09:10:47.6855519Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6855813Z mad.wide.s32 %rd85, %r363, 2, %rd41; 2026-02-21T09:10:47.6855994Z mad.wide.s32 %rd86, %r364, 2, %rd41; 2026-02-21T09:10:47.6856165Z mov.b32 %r401, 16; 2026-02-21T09:10:47.6856431Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6856715Z // begin inline asm 2026-02-21T09:10:47.6856917Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd85 + 0 ], 0x10, %r401; 2026-02-21T09:10:47.6857147Z // end inline asm 2026-02-21T09:10:47.6857287Z // begin inline asm 2026-02-21T09:10:47.6857479Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd86 + 0 ], 0x10, %r401; 2026-02-21T09:10:47.6857709Z // end inline asm 2026-02-21T09:10:47.6857853Z cp.async.commit_group; 2026-02-21T09:10:47.6858120Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6858398Z bar.sync 0; 2026-02-21T09:10:47.6858538Z // begin inline asm 2026-02-21T09:10:47.6858726Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.6858946Z // end inline asm 2026-02-21T09:10:47.6859190Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6859462Z bar.sync 0; 2026-02-21T09:10:47.6859614Z elect.sync %r365|%p58, -1; 2026-02-21T09:10:47.6859782Z and.pred %p52, %p6, %p58; 2026-02-21T09:10:47.6859945Z // begin inline asm 2026-02-21T09:10:47.6860275Z @%p52 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r534, %r2062}], [%r1331]; 2026-02-21T09:10:47.6860648Z // end inline asm 2026-02-21T09:10:47.6860899Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6861180Z cvt.s64.s32 %rd91, %r355; 2026-02-21T09:10:47.6861344Z cvt.u64.u32 %rd8, %r9; 2026-02-21T09:10:47.6861497Z or.b64 %rd92, %rd91, %rd8; 2026-02-21T09:10:47.6861689Z shl.b64 %rd93, %rd92, 1; 2026-02-21T09:10:47.6861844Z add.s64 %rd9, %rd41, %rd93; 2026-02-21T09:10:47.6862007Z add.s64 %rd88, %rd9, 64; 2026-02-21T09:10:47.6862159Z cvt.s64.s32 %rd94, %r356; 2026-02-21T09:10:47.6862322Z or.b64 %rd95, %rd94, %rd8; 2026-02-21T09:10:47.6862474Z shl.b64 %rd96, %rd95, 1; 2026-02-21T09:10:47.6862637Z add.s64 %rd10, %rd41, %rd96; 2026-02-21T09:10:47.6862802Z add.s64 %rd89, %rd10, 64; 2026-02-21T09:10:47.6863059Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6863347Z // begin inline asm 2026-02-21T09:10:47.6863546Z cp.async.cg.shared.global [ %r1202 + 0 ], [ %rd88 + 0 ], 0x10, %r401; 2026-02-21T09:10:47.6863775Z // end inline asm 2026-02-21T09:10:47.6863913Z // begin inline asm 2026-02-21T09:10:47.6864119Z cp.async.cg.shared.global [ %r1204 + 0 ], [ %rd89 + 0 ], 0x10, %r401; 2026-02-21T09:10:47.6864387Z // end inline asm 2026-02-21T09:10:47.6864532Z cp.async.commit_group; 2026-02-21T09:10:47.6864808Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6865093Z bar.sync 0; 2026-02-21T09:10:47.6865235Z // begin inline asm 2026-02-21T09:10:47.6865426Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1115], 1024; 2026-02-21T09:10:47.6865650Z // end inline asm 2026-02-21T09:10:47.6865934Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6866214Z bar.sync 0; 2026-02-21T09:10:47.6866357Z elect.sync %r366|%p59, -1; 2026-02-21T09:10:47.6866524Z and.pred %p54, %p6, %p59; 2026-02-21T09:10:47.6866687Z add.s32 %r322, %r201, 25600; 2026-02-21T09:10:47.6866839Z // begin inline asm 2026-02-21T09:10:47.6867171Z @%p54 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r322], [%rd463, {%r534, %r401}], [%r1115]; 2026-02-21T09:10:47.6867519Z // end inline asm 2026-02-21T09:10:47.6867797Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6868084Z cp.async.wait_group 1; 2026-02-21T09:10:47.6868246Z bar.sync 0; 2026-02-21T09:10:47.6868440Z ld.shared.v4.b32 {%r367, %r368, %r369, %r370}, [%r23]; 2026-02-21T09:10:47.6868656Z mov.b32 {%rs1, %rs2}, %r370; 2026-02-21T09:10:47.6868857Z mov.b32 {%rs3, %rs4}, %r369; 2026-02-21T09:10:47.6869023Z mov.b32 {%rs5, %rs6}, %r368; 2026-02-21T09:10:47.6869192Z mov.b32 {%rs7, %rs8}, %r367; 2026-02-21T09:10:47.6869389Z ld.shared.v4.b32 {%r371, %r372, %r373, %r374}, [%r23+16]; 2026-02-21T09:10:47.6869612Z mov.b32 {%rs9, %rs10}, %r374; 2026-02-21T09:10:47.6869782Z mov.b32 {%rs11, %rs12}, %r373; 2026-02-21T09:10:47.6869958Z mov.b32 {%rs13, %rs14}, %r372; 2026-02-21T09:10:47.6870133Z mov.b32 {%rs15, %rs16}, %r371; 2026-02-21T09:10:47.6870410Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.6870714Z cvt.f32.bf16 %r327, %rs7; 2026-02-21T09:10:47.6870878Z cvt.f32.bf16 %r328, %rs8; 2026-02-21T09:10:47.6871046Z cvt.f32.bf16 %r329, %rs5; 2026-02-21T09:10:47.6871204Z cvt.f32.bf16 %r330, %rs6; 2026-02-21T09:10:47.6871365Z cvt.f32.bf16 %r331, %rs3; 2026-02-21T09:10:47.6871519Z cvt.f32.bf16 %r332, %rs4; 2026-02-21T09:10:47.6871739Z cvt.f32.bf16 %r333, %rs1; 2026-02-21T09:10:47.6871903Z cvt.f32.bf16 %r334, %rs2; 2026-02-21T09:10:47.6872056Z cvt.f32.bf16 %r335, %rs15; 2026-02-21T09:10:47.6872224Z cvt.f32.bf16 %r336, %rs16; 2026-02-21T09:10:47.6872382Z cvt.f32.bf16 %r337, %rs13; 2026-02-21T09:10:47.6872548Z cvt.f32.bf16 %r338, %rs14; 2026-02-21T09:10:47.6872704Z cvt.f32.bf16 %r339, %rs11; 2026-02-21T09:10:47.6872868Z cvt.f32.bf16 %r340, %rs12; 2026-02-21T09:10:47.6873023Z cvt.f32.bf16 %r341, %rs9; 2026-02-21T09:10:47.6873184Z cvt.f32.bf16 %r342, %rs10; 2026-02-21T09:10:47.6873339Z $L__tmp37: 2026-02-21T09:10:47.6873647Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6873995Z add.s32 %r375, %r358, %r1918; 2026-02-21T09:10:47.6874163Z shl.b32 %r376, %r360, 2; 2026-02-21T09:10:47.6874332Z add.s32 %r1211, %r375, %r376; 2026-02-21T09:10:47.6874494Z // begin inline asm 2026-02-21T09:10:47.6874882Z @%p63 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r327, %r328, %r329, %r330, %r331, %r332, %r333, %r334, %r335, %r336, %r337, %r338, %r339, %r340, %r341, %r342}; 2026-02-21T09:10:47.6875289Z // end inline asm 2026-02-21T09:10:47.6875438Z // begin inline asm 2026-02-21T09:10:47.6875604Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.6875772Z // end inline asm 2026-02-21T09:10:47.6875915Z bar.sync 0; 2026-02-21T09:10:47.6876047Z $L__tmp38: 2026-02-21T09:10:47.6876305Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6876647Z // begin inline asm 2026-02-21T09:10:47.6876787Z 2026-02-21T09:10:47.6876902Z { 2026-02-21T09:10:47.6877033Z .reg .pred complete; 2026-02-21T09:10:47.6877176Z waitLoop: 2026-02-21T09:10:47.6877373Z mbarrier.try_wait.parity.shared.b64 complete, [%r1331], %r2062; 2026-02-21T09:10:47.6877619Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6877773Z } 2026-02-21T09:10:47.6877840Z 2026-02-21T09:10:47.6877903Z // end inline asm 2026-02-21T09:10:47.6878146Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6878461Z ld.shared.b8 %rs17, [%r25]; 2026-02-21T09:10:47.6878626Z ld.shared.b8 %rs18, [%r25+512]; 2026-02-21T09:10:47.6878801Z ld.shared.b8 %rs19, [%r27+128]; 2026-02-21T09:10:47.6878963Z ld.shared.b8 %rs20, [%r27+640]; 2026-02-21T09:10:47.6879134Z ld.shared.b8 %rs21, [%r29+256]; 2026-02-21T09:10:47.6879301Z ld.shared.b8 %rs22, [%r29+768]; 2026-02-21T09:10:47.6879459Z ld.shared.b8 %rs23, [%r31+384]; 2026-02-21T09:10:47.6879626Z ld.shared.b8 %rs24, [%r31+896]; 2026-02-21T09:10:47.6879917Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.6880205Z shl.b16 %rs25, %rs17, 4; 2026-02-21T09:10:47.6880366Z shl.b16 %rs26, %rs19, 4; 2026-02-21T09:10:47.6880535Z shl.b16 %rs27, %rs21, 4; 2026-02-21T09:10:47.6880694Z shl.b16 %rs28, %rs23, 4; 2026-02-21T09:10:47.6880881Z shl.b16 %rs29, %rs18, 4; 2026-02-21T09:10:47.6881038Z shl.b16 %rs30, %rs20, 4; 2026-02-21T09:10:47.6881183Z shl.b16 %rs31, %rs22, 4; 2026-02-21T09:10:47.6881334Z shl.b16 %rs32, %rs24, 4; 2026-02-21T09:10:47.6881613Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.6881908Z selp.b16 %rs33, %rs25, %rs17, %p346; 2026-02-21T09:10:47.6882082Z cvt.s16.s8 %rs34, %rs33; 2026-02-21T09:10:47.6882239Z shr.s16 %rs35, %rs34, 4; 2026-02-21T09:10:47.6882394Z selp.b16 %rs36, %rs26, %rs19, %p346; 2026-02-21T09:10:47.6882572Z cvt.s16.s8 %rs37, %rs36; 2026-02-21T09:10:47.6882727Z shr.s16 %rs38, %rs37, 4; 2026-02-21T09:10:47.6882882Z selp.b16 %rs39, %rs27, %rs21, %p346; 2026-02-21T09:10:47.6883058Z cvt.s16.s8 %rs40, %rs39; 2026-02-21T09:10:47.6883205Z shr.s16 %rs41, %rs40, 4; 2026-02-21T09:10:47.6883366Z selp.b16 %rs42, %rs28, %rs23, %p346; 2026-02-21T09:10:47.6883532Z cvt.s16.s8 %rs43, %rs42; 2026-02-21T09:10:47.6883687Z shr.s16 %rs44, %rs43, 4; 2026-02-21T09:10:47.6883840Z selp.b16 %rs45, %rs29, %rs18, %p346; 2026-02-21T09:10:47.6884015Z cvt.s16.s8 %rs46, %rs45; 2026-02-21T09:10:47.6884161Z shr.s16 %rs47, %rs46, 4; 2026-02-21T09:10:47.6884320Z selp.b16 %rs48, %rs30, %rs20, %p346; 2026-02-21T09:10:47.6884495Z cvt.s16.s8 %rs49, %rs48; 2026-02-21T09:10:47.6884643Z shr.s16 %rs50, %rs49, 4; 2026-02-21T09:10:47.6884802Z selp.b16 %rs51, %rs31, %rs22, %p346; 2026-02-21T09:10:47.6884968Z cvt.s16.s8 %rs52, %rs51; 2026-02-21T09:10:47.6885120Z shr.s16 %rs53, %rs52, 4; 2026-02-21T09:10:47.6885273Z selp.b16 %rs54, %rs32, %rs24, %p346; 2026-02-21T09:10:47.6885444Z cvt.s16.s8 %rs55, %rs54; 2026-02-21T09:10:47.6885587Z shr.s16 %rs56, %rs55, 4; 2026-02-21T09:10:47.6885843Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.6886131Z cvt.rn.f32.s16 %r377, %rs35; 2026-02-21T09:10:47.6886292Z cvt.rn.f32.s16 %r378, %rs38; 2026-02-21T09:10:47.6886457Z cvt.rn.f32.s16 %r379, %rs41; 2026-02-21T09:10:47.6886613Z cvt.rn.f32.s16 %r380, %rs44; 2026-02-21T09:10:47.6886771Z cvt.rn.f32.s16 %r381, %rs47; 2026-02-21T09:10:47.6886925Z cvt.rn.f32.s16 %r382, %rs50; 2026-02-21T09:10:47.6887082Z cvt.rn.f32.s16 %r383, %rs53; 2026-02-21T09:10:47.6887233Z cvt.rn.f32.s16 %r384, %rs56; 2026-02-21T09:10:47.6887395Z st.shared.b32 [%r32], %r377; 2026-02-21T09:10:47.6887558Z st.shared.b32 [%r33], %r378; 2026-02-21T09:10:47.6887710Z st.shared.b32 [%r34], %r379; 2026-02-21T09:10:47.6887870Z st.shared.b32 [%r35], %r380; 2026-02-21T09:10:47.6888070Z st.shared.b32 [%r36], %r381; 2026-02-21T09:10:47.6888232Z st.shared.b32 [%r37], %r382; 2026-02-21T09:10:47.6888386Z st.shared.b32 [%r38], %r383; 2026-02-21T09:10:47.6888547Z st.shared.b32 [%r39], %r384; 2026-02-21T09:10:47.6888692Z $L__tmp39: 2026-02-21T09:10:47.6888990Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6889325Z // begin inline asm 2026-02-21T09:10:47.6889525Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.6889700Z // end inline asm 2026-02-21T09:10:47.6889834Z bar.sync 0; 2026-02-21T09:10:47.6889982Z setp.ne.b32 %p60, %r50, 0; 2026-02-21T09:10:47.6890140Z @%p60 bra $L__BB0_4; 2026-02-21T09:10:47.6890336Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.6890556Z elect.sync %r397|%p62, -1; 2026-02-21T09:10:47.6890722Z mov.b32 %r387, 135268624; 2026-02-21T09:10:47.6890874Z mov.pred %p61, 0; 2026-02-21T09:10:47.6891024Z // begin inline asm 2026-02-21T09:10:47.6891308Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r387, %p61; 2026-02-21T09:10:47.6891614Z // end inline asm 2026-02-21T09:10:47.6891759Z // begin inline asm 2026-02-21T09:10:47.6891991Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r387, %p63; 2026-02-21T09:10:47.6892262Z // end inline asm 2026-02-21T09:10:47.6892443Z // begin inline asm 2026-02-21T09:10:47.6892686Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r387, %p63; 2026-02-21T09:10:47.6892950Z // end inline asm 2026-02-21T09:10:47.6893085Z // begin inline asm 2026-02-21T09:10:47.6893320Z @%p62 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r387, %p63; 2026-02-21T09:10:47.6893578Z // end inline asm 2026-02-21T09:10:47.6893725Z add.s32 %r399, %r201, 26624; 2026-02-21T09:10:47.6893883Z cvt.u64.u32 %rd101, %r399; 2026-02-21T09:10:47.6894046Z // begin inline asm 2026-02-21T09:10:47.6894252Z @%p62 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd101]; 2026-02-21T09:10:47.6894487Z // end inline asm 2026-02-21T09:10:47.6894625Z $L__tmp40: 2026-02-21T09:10:47.6894796Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.6895126Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6895420Z add.s64 %rd102, %rd9, 128; 2026-02-21T09:10:47.6895588Z add.s64 %rd103, %rd10, 128; 2026-02-21T09:10:47.6895846Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6896128Z // begin inline asm 2026-02-21T09:10:47.6896336Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd102 + 0 ], 0x10, %r401; 2026-02-21T09:10:47.6896558Z // end inline asm 2026-02-21T09:10:47.6896698Z // begin inline asm 2026-02-21T09:10:47.6896890Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd103 + 0 ], 0x10, %r401; 2026-02-21T09:10:47.6897117Z // end inline asm 2026-02-21T09:10:47.6897258Z cp.async.commit_group; 2026-02-21T09:10:47.6897525Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6897806Z // begin inline asm 2026-02-21T09:10:47.6898001Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.6898219Z // end inline asm 2026-02-21T09:10:47.6898456Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6898741Z bar.sync 0; 2026-02-21T09:10:47.6898881Z elect.sync %r413|%p73, -1; 2026-02-21T09:10:47.6899055Z and.pred %p71, %p6, %p73; 2026-02-21T09:10:47.6899208Z mov.b32 %r407, 32; 2026-02-21T09:10:47.6899356Z // begin inline asm 2026-02-21T09:10:47.6899676Z @%p71 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r534, %r407}], [%r1331]; 2026-02-21T09:10:47.6900074Z // end inline asm 2026-02-21T09:10:47.6900327Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6900617Z shl.b32 %r414, %r47, 17; 2026-02-21T09:10:47.6900775Z or.b32 %r415, %r44, %r414; 2026-02-21T09:10:47.6900936Z mad.wide.s32 %rd618, %r415, 2, %rd7; 2026-02-21T09:10:47.6901116Z or.b32 %r2061, %r45, %r414; 2026-02-21T09:10:47.6901268Z mov.b32 %r2066, 1; 2026-02-21T09:10:47.6901412Z mov.b64 %rd619, 0; 2026-02-21T09:10:47.6901618Z mov.b32 %r2064, %r2062; 2026-02-21T09:10:47.6901772Z mov.b32 %r2065, %r2062; 2026-02-21T09:10:47.6901928Z mov.b32 %r2067, %r2062; 2026-02-21T09:10:47.6902073Z bra.uni $L__BB0_5; 2026-02-21T09:10:47.6902261Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:10:47.6902580Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6902879Z setp.lt.u64 %p89, %rd619, 464; 2026-02-21T09:10:47.6903039Z $L__tmp41: 2026-02-21T09:10:47.6903361Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6903695Z add.s32 %r490, %r2066, 1; 2026-02-21T09:10:47.6903854Z setp.gt.s32 %p92, %r490, 1; 2026-02-21T09:10:47.6904028Z selp.b32 %r2066, 0, %r490, %p92; 2026-02-21T09:10:47.6904198Z selp.b32 %r491, 1, 0, %p92; 2026-02-21T09:10:47.6904392Z xor.b32 %r68, %r2067, %r491; 2026-02-21T09:10:47.6904546Z $L__tmp42: 2026-02-21T09:10:47.6904789Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6905081Z mad.wide.s32 %rd112, %r2061, 2, %rd41; 2026-02-21T09:10:47.6905373Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6905662Z add.s32 %r481, %r62, %r12; 2026-02-21T09:10:47.6905819Z selp.b32 %r482, 16, 0, %p89; 2026-02-21T09:10:47.6905980Z // begin inline asm 2026-02-21T09:10:47.6906183Z cp.async.cg.shared.global [ %r481 + 0 ], [ %rd618 + 0 ], 0x10, %r482; 2026-02-21T09:10:47.6906413Z // end inline asm 2026-02-21T09:10:47.6906551Z add.s32 %r483, %r481, 4096; 2026-02-21T09:10:47.6906708Z // begin inline asm 2026-02-21T09:10:47.6906904Z cp.async.cg.shared.global [ %r483 + 0 ], [ %rd112 + 0 ], 0x10, %r482; 2026-02-21T09:10:47.6907130Z // end inline asm 2026-02-21T09:10:47.6907276Z cp.async.commit_group; 2026-02-21T09:10:47.6907538Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6907837Z and.pred %p87, %p9, %p89; 2026-02-21T09:10:47.6907993Z // begin inline asm 2026-02-21T09:10:47.6908189Z @%p87 mbarrier.arrive.expect_tx.shared.b64 _, [%r485], 1024; 2026-02-21T09:10:47.6908403Z // end inline asm 2026-02-21T09:10:47.6908657Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6908944Z bar.sync 0; 2026-02-21T09:10:47.6909083Z elect.sync %r492|%p93, -1; 2026-02-21T09:10:47.6909253Z and.pred %p94, %p89, %p93; 2026-02-21T09:10:47.6909414Z and.pred %p88, %p6, %p94; 2026-02-21T09:10:47.6909579Z cvt.u32.u64 %r493, %rd619; 2026-02-21T09:10:47.6909734Z add.s32 %r488, %r493, 48; 2026-02-21T09:10:47.6909890Z // begin inline asm 2026-02-21T09:10:47.6910216Z @%p88 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r486], [%rd463, {%r534, %r488}], [%r485]; 2026-02-21T09:10:47.6910575Z // end inline asm 2026-02-21T09:10:47.6910833Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6911120Z add.s64 %rd618, %rd618, 64; 2026-02-21T09:10:47.6911284Z add.s32 %r2061, %r2061, 32; 2026-02-21T09:10:47.6911444Z setp.lt.u64 %p95, %rd619, 480; 2026-02-21T09:10:47.6911655Z add.s64 %rd619, %rd619, 16; 2026-02-21T09:10:47.6911807Z mov.b32 %r2062, %r2067; 2026-02-21T09:10:47.6911962Z mov.b32 %r2067, %r68; 2026-02-21T09:10:47.6912138Z @%p95 bra $L__BB0_5; 2026-02-21T09:10:47.6912293Z bra.uni $L__BB0_8; 2026-02-21T09:10:47.6912493Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:47.6912752Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:47.6912983Z add.s32 %r437, %r2065, 1; 2026-02-21T09:10:47.6913146Z setp.gt.s32 %p77, %r437, 1; 2026-02-21T09:10:47.6913329Z selp.b32 %r2065, 0, %r437, %p77; 2026-02-21T09:10:47.6913644Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6913948Z cp.async.wait_group 1; 2026-02-21T09:10:47.6914106Z bar.sync 0; 2026-02-21T09:10:47.6914254Z shl.b32 %r438, %r2065, 13; 2026-02-21T09:10:47.6914421Z add.s32 %r62, %r201, %r438; 2026-02-21T09:10:47.6914581Z add.s32 %r440, %r62, %r22; 2026-02-21T09:10:47.6914781Z ld.shared.v4.b32 {%r441, %r442, %r443, %r444}, [%r440]; 2026-02-21T09:10:47.6914994Z mov.b32 {%rs57, %rs58}, %r444; 2026-02-21T09:10:47.6915172Z mov.b32 {%rs59, %rs60}, %r443; 2026-02-21T09:10:47.6915335Z mov.b32 {%rs61, %rs62}, %r442; 2026-02-21T09:10:47.6915529Z mov.b32 {%rs63, %rs64}, %r441; 2026-02-21T09:10:47.6915733Z ld.shared.v4.b32 {%r445, %r446, %r447, %r448}, [%r440+16]; 2026-02-21T09:10:47.6915952Z mov.b32 {%rs65, %rs66}, %r448; 2026-02-21T09:10:47.6916120Z mov.b32 {%rs67, %rs68}, %r447; 2026-02-21T09:10:47.6916311Z mov.b32 {%rs69, %rs70}, %r446; 2026-02-21T09:10:47.6916484Z mov.b32 {%rs71, %rs72}, %r445; 2026-02-21T09:10:47.6916755Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.6917058Z cvt.f32.bf16 %r419, %rs63; 2026-02-21T09:10:47.6917218Z cvt.f32.bf16 %r420, %rs64; 2026-02-21T09:10:47.6917383Z cvt.f32.bf16 %r421, %rs61; 2026-02-21T09:10:47.6917539Z cvt.f32.bf16 %r422, %rs62; 2026-02-21T09:10:47.6917702Z cvt.f32.bf16 %r423, %rs59; 2026-02-21T09:10:47.6917856Z cvt.f32.bf16 %r424, %rs60; 2026-02-21T09:10:47.6918019Z cvt.f32.bf16 %r425, %rs57; 2026-02-21T09:10:47.6918183Z cvt.f32.bf16 %r426, %rs58; 2026-02-21T09:10:47.6918338Z cvt.f32.bf16 %r427, %rs71; 2026-02-21T09:10:47.6918503Z cvt.f32.bf16 %r428, %rs72; 2026-02-21T09:10:47.6918659Z cvt.f32.bf16 %r429, %rs69; 2026-02-21T09:10:47.6918822Z cvt.f32.bf16 %r430, %rs70; 2026-02-21T09:10:47.6918977Z cvt.f32.bf16 %r431, %rs67; 2026-02-21T09:10:47.6919142Z cvt.f32.bf16 %r432, %rs68; 2026-02-21T09:10:47.6919299Z cvt.f32.bf16 %r433, %rs65; 2026-02-21T09:10:47.6919466Z cvt.f32.bf16 %r434, %rs66; 2026-02-21T09:10:47.6919618Z $L__tmp43: 2026-02-21T09:10:47.6919935Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6920268Z // begin inline asm 2026-02-21T09:10:47.6920403Z 2026-02-21T09:10:47.6920524Z { 2026-02-21T09:10:47.6920648Z .reg .pred complete; 2026-02-21T09:10:47.6920798Z waitLoop: 2026-02-21T09:10:47.6920983Z mbarrier.try_wait.parity.shared.b64 complete, [%r2063], %r2062; 2026-02-21T09:10:47.6921226Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6921378Z } 2026-02-21T09:10:47.6921452Z 2026-02-21T09:10:47.6921508Z // end inline asm 2026-02-21T09:10:47.6921682Z $L__tmp44: 2026-02-21T09:10:47.6921919Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6922211Z selp.b32 %r449, 1, 0, %p77; 2026-02-21T09:10:47.6922370Z xor.b32 %r2064, %r2064, %r449; 2026-02-21T09:10:47.6922537Z mov.pred %p78, -1; 2026-02-21T09:10:47.6922676Z $L__tmp45: 2026-02-21T09:10:47.6922962Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6923277Z // begin inline asm 2026-02-21T09:10:47.6923646Z @%p78 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r419, %r420, %r421, %r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434}; 2026-02-21T09:10:47.6924073Z // end inline asm 2026-02-21T09:10:47.6924208Z // begin inline asm 2026-02-21T09:10:47.6924372Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.6924532Z // end inline asm 2026-02-21T09:10:47.6924674Z bar.sync 0; 2026-02-21T09:10:47.6924798Z $L__tmp46: 2026-02-21T09:10:47.6925041Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6925333Z shl.b32 %r450, %r2065, 3; 2026-02-21T09:10:47.6925488Z add.s32 %r451, %r201, %r450; 2026-02-21T09:10:47.6925680Z add.s32 %r485, %r451, 26640; 2026-02-21T09:10:47.6925833Z // begin inline asm 2026-02-21T09:10:47.6925971Z 2026-02-21T09:10:47.6926083Z { 2026-02-21T09:10:47.6926209Z .reg .pred complete; 2026-02-21T09:10:47.6926352Z waitLoop: 2026-02-21T09:10:47.6926546Z mbarrier.try_wait.parity.shared.b64 complete, [%r485], %r2064; 2026-02-21T09:10:47.6926778Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6926936Z } 2026-02-21T09:10:47.6927000Z 2026-02-21T09:10:47.6927063Z // end inline asm 2026-02-21T09:10:47.6927336Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6927628Z shl.b32 %r452, %r2065, 10; 2026-02-21T09:10:47.6927786Z add.s32 %r453, %r201, %r452; 2026-02-21T09:10:47.6927948Z add.s32 %r486, %r453, 24576; 2026-02-21T09:10:47.6928103Z add.s32 %r454, %r486, %r24; 2026-02-21T09:10:47.6928266Z ld.shared.b8 %rs73, [%r454]; 2026-02-21T09:10:47.6928457Z ld.shared.b8 %rs74, [%r454+512]; 2026-02-21T09:10:47.6928634Z add.s32 %r455, %r486, %r26; 2026-02-21T09:10:47.6928799Z ld.shared.b8 %rs75, [%r455+128]; 2026-02-21T09:10:47.6928969Z ld.shared.b8 %rs76, [%r455+640]; 2026-02-21T09:10:47.6929141Z add.s32 %r456, %r486, %r28; 2026-02-21T09:10:47.6929296Z ld.shared.b8 %rs77, [%r456+256]; 2026-02-21T09:10:47.6929475Z ld.shared.b8 %rs78, [%r456+768]; 2026-02-21T09:10:47.6929640Z add.s32 %r457, %r486, %r30; 2026-02-21T09:10:47.6929804Z ld.shared.b8 %rs79, [%r457+384]; 2026-02-21T09:10:47.6929967Z ld.shared.b8 %rs80, [%r457+896]; 2026-02-21T09:10:47.6930242Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.6930532Z shl.b16 %rs81, %rs73, 4; 2026-02-21T09:10:47.6930687Z shl.b16 %rs82, %rs75, 4; 2026-02-21T09:10:47.6930845Z shl.b16 %rs83, %rs77, 4; 2026-02-21T09:10:47.6930994Z shl.b16 %rs84, %rs79, 4; 2026-02-21T09:10:47.6931149Z shl.b16 %rs85, %rs74, 4; 2026-02-21T09:10:47.6931296Z shl.b16 %rs86, %rs76, 4; 2026-02-21T09:10:47.6931448Z shl.b16 %rs87, %rs78, 4; 2026-02-21T09:10:47.6931625Z shl.b16 %rs88, %rs80, 4; 2026-02-21T09:10:47.6931887Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.6932185Z selp.b16 %rs89, %rs81, %rs73, %p346; 2026-02-21T09:10:47.6932358Z cvt.s16.s8 %rs90, %rs89; 2026-02-21T09:10:47.6932513Z shr.s16 %rs91, %rs90, 4; 2026-02-21T09:10:47.6932669Z selp.b16 %rs92, %rs82, %rs75, %p346; 2026-02-21T09:10:47.6932850Z cvt.s16.s8 %rs93, %rs92; 2026-02-21T09:10:47.6932997Z shr.s16 %rs94, %rs93, 4; 2026-02-21T09:10:47.6933159Z selp.b16 %rs95, %rs83, %rs77, %p346; 2026-02-21T09:10:47.6933327Z cvt.s16.s8 %rs96, %rs95; 2026-02-21T09:10:47.6933481Z shr.s16 %rs97, %rs96, 4; 2026-02-21T09:10:47.6933632Z selp.b16 %rs98, %rs84, %rs79, %p346; 2026-02-21T09:10:47.6933806Z cvt.s16.s8 %rs99, %rs98; 2026-02-21T09:10:47.6933961Z shr.s16 %rs100, %rs99, 4; 2026-02-21T09:10:47.6934123Z selp.b16 %rs101, %rs85, %rs74, %p346; 2026-02-21T09:10:47.6934305Z cvt.s16.s8 %rs102, %rs101; 2026-02-21T09:10:47.6934460Z shr.s16 %rs103, %rs102, 4; 2026-02-21T09:10:47.6934628Z selp.b16 %rs104, %rs86, %rs76, %p346; 2026-02-21T09:10:47.6934799Z cvt.s16.s8 %rs105, %rs104; 2026-02-21T09:10:47.6934863Z shr.s16 %rs106, %rs105, 4; 2026-02-21T09:10:47.6934927Z selp.b16 %rs107, %rs87, %rs78, %p346; 2026-02-21T09:10:47.6934986Z cvt.s16.s8 %rs108, %rs107; 2026-02-21T09:10:47.6935045Z shr.s16 %rs109, %rs108, 4; 2026-02-21T09:10:47.6935144Z selp.b16 %rs110, %rs88, %rs80, %p346; 2026-02-21T09:10:47.6935203Z cvt.s16.s8 %rs111, %rs110; 2026-02-21T09:10:47.6935261Z shr.s16 %rs112, %rs111, 4; 2026-02-21T09:10:47.6935439Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.6935502Z cvt.rn.f32.s16 %r458, %rs91; 2026-02-21T09:10:47.6935562Z cvt.rn.f32.s16 %r459, %rs94; 2026-02-21T09:10:47.6935630Z cvt.rn.f32.s16 %r460, %rs97; 2026-02-21T09:10:47.6935717Z cvt.rn.f32.s16 %r461, %rs100; 2026-02-21T09:10:47.6935780Z cvt.rn.f32.s16 %r462, %rs103; 2026-02-21T09:10:47.6935840Z cvt.rn.f32.s16 %r463, %rs106; 2026-02-21T09:10:47.6935906Z cvt.rn.f32.s16 %r464, %rs109; 2026-02-21T09:10:47.6935965Z cvt.rn.f32.s16 %r465, %rs112; 2026-02-21T09:10:47.6936025Z st.shared.b32 [%r32], %r458; 2026-02-21T09:10:47.6936093Z st.shared.b32 [%r33], %r459; 2026-02-21T09:10:47.6936152Z st.shared.b32 [%r34], %r460; 2026-02-21T09:10:47.6936210Z st.shared.b32 [%r35], %r461; 2026-02-21T09:10:47.6936271Z st.shared.b32 [%r36], %r462; 2026-02-21T09:10:47.6936336Z st.shared.b32 [%r37], %r463; 2026-02-21T09:10:47.6936431Z st.shared.b32 [%r38], %r464; 2026-02-21T09:10:47.6936492Z st.shared.b32 [%r39], %r465; 2026-02-21T09:10:47.6936673Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6936733Z shl.b32 %r466, %r2066, 3; 2026-02-21T09:10:47.6936816Z add.s32 %r467, %r201, %r466; 2026-02-21T09:10:47.6936886Z add.s32 %r2063, %r467, 26624; 2026-02-21T09:10:47.6936940Z $L__tmp47: 2026-02-21T09:10:47.6937156Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6937214Z // begin inline asm 2026-02-21T09:10:47.6937296Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.6937352Z // end inline asm 2026-02-21T09:10:47.6937407Z bar.sync 0; 2026-02-21T09:10:47.6937473Z @%p60 bra $L__BB0_7; 2026-02-21T09:10:47.6937575Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:10:47.6937642Z elect.sync %r480|%p79, -1; 2026-02-21T09:10:47.6937702Z mov.b32 %r470, 135268624; 2026-02-21T09:10:47.6937770Z // begin inline asm 2026-02-21T09:10:47.6937936Z @%p79 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r470, %p78; 2026-02-21T09:10:47.6937994Z // end inline asm 2026-02-21T09:10:47.6938064Z // begin inline asm 2026-02-21T09:10:47.6938219Z @%p79 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r470, %p78; 2026-02-21T09:10:47.6938274Z // end inline asm 2026-02-21T09:10:47.6938337Z // begin inline asm 2026-02-21T09:10:47.6938484Z @%p79 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r470, %p78; 2026-02-21T09:10:47.6938538Z // end inline asm 2026-02-21T09:10:47.6938601Z // begin inline asm 2026-02-21T09:10:47.6938745Z @%p79 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r470, %p78; 2026-02-21T09:10:47.6938801Z // end inline asm 2026-02-21T09:10:47.6938863Z cvt.u64.u32 %rd110, %r2063; 2026-02-21T09:10:47.6938927Z // begin inline asm 2026-02-21T09:10:47.6939051Z @%p79 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd110]; 2026-02-21T09:10:47.6939104Z // end inline asm 2026-02-21T09:10:47.6939166Z bra.uni $L__BB0_7; 2026-02-21T09:10:47.6939266Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.6939356Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:47.6939417Z mov.b32 %r2073, 1; 2026-02-21T09:10:47.6939631Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6939686Z // begin inline asm 2026-02-21T09:10:47.6939737Z 2026-02-21T09:10:47.6939796Z { 2026-02-21T09:10:47.6939856Z .reg .pred complete; 2026-02-21T09:10:47.6939910Z waitLoop: 2026-02-21T09:10:47.6940036Z mbarrier.try_wait.parity.shared.b64 complete, [%r2063], %r2073; 2026-02-21T09:10:47.6940123Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6940175Z } 2026-02-21T09:10:47.6940178Z 2026-02-21T09:10:47.6940233Z // end inline asm 2026-02-21T09:10:47.6940292Z $L__tmp48: 2026-02-21T09:10:47.6940463Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6940527Z cp.async.wait_group 0; 2026-02-21T09:10:47.6940591Z bar.sync 0; 2026-02-21T09:10:47.6940670Z // begin inline asm 2026-02-21T09:10:47.6940756Z @%p9 mbarrier.inval.shared::cta.b64 [%r1331]; 2026-02-21T09:10:47.6940817Z // end inline asm 2026-02-21T09:10:47.6940872Z bar.sync 0; 2026-02-21T09:10:47.6940928Z // begin inline asm 2026-02-21T09:10:47.6941010Z @%p9 mbarrier.inval.shared::cta.b64 [%r1115]; 2026-02-21T09:10:47.6941071Z // end inline asm 2026-02-21T09:10:47.6941131Z add.s32 %r2070, %r201, 26624; 2026-02-21T09:10:47.6941185Z // begin inline asm 2026-02-21T09:10:47.6941269Z @%p9 mbarrier.inval.shared::cta.b64 [%r2070]; 2026-02-21T09:10:47.6941322Z // end inline asm 2026-02-21T09:10:47.6941376Z bar.sync 0; 2026-02-21T09:10:47.6941452Z // begin inline asm 2026-02-21T09:10:47.6941568Z @%p9 mbarrier.inval.shared::cta.b64 [%r1117]; 2026-02-21T09:10:47.6941624Z // end inline asm 2026-02-21T09:10:47.6941677Z $L__tmp49: 2026-02-21T09:10:47.6941922Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6941979Z // begin inline asm 2026-02-21T09:10:47.6942251Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r500, %r501, %r502, %r503, %r504, %r505, %r506, %r507, %r508, %r509, %r510, %r511, %r512, %r513, %r514, %r515}, [%r1460 + 0]; 2026-02-21T09:10:47.6942311Z // end inline asm 2026-02-21T09:10:47.6942364Z // begin inline asm 2026-02-21T09:10:47.6942624Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r517, %r518, %r519, %r520, %r521, %r522, %r523, %r524, %r525, %r526, %r527, %r528, %r529, %r530, %r531, %r532}, [%r1460 + 16]; 2026-02-21T09:10:47.6942684Z // end inline asm 2026-02-21T09:10:47.6942740Z // begin inline asm 2026-02-21T09:10:47.6942807Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.6942862Z // end inline asm 2026-02-21T09:10:47.6942929Z cvt.u64.u32 %rd121, %r500; 2026-02-21T09:10:47.6942987Z cvt.u64.u32 %rd122, %r501; 2026-02-21T09:10:47.6943046Z shl.b64 %rd123, %rd122, 32; 2026-02-21T09:10:47.6943114Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T09:10:47.6943166Z $L__tmp50: 2026-02-21T09:10:47.6943338Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6943400Z mov.b64 {%r612, %r613}, %rd124; 2026-02-21T09:10:47.6943476Z cvt.rn.bf16x2.f32 %r614, %r613, %r612; 2026-02-21T09:10:47.6943527Z $L__tmp51: 2026-02-21T09:10:47.6943743Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6943811Z cvt.u64.u32 %rd125, %r502; 2026-02-21T09:10:47.6943871Z cvt.u64.u32 %rd126, %r503; 2026-02-21T09:10:47.6943928Z shl.b64 %rd127, %rd126, 32; 2026-02-21T09:10:47.6943995Z or.b64 %rd128, %rd125, %rd127; 2026-02-21T09:10:47.6944045Z $L__tmp52: 2026-02-21T09:10:47.6944213Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6944275Z mov.b64 {%r615, %r616}, %rd128; 2026-02-21T09:10:47.6944348Z cvt.rn.bf16x2.f32 %r617, %r616, %r615; 2026-02-21T09:10:47.6944399Z $L__tmp53: 2026-02-21T09:10:47.6944611Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6944675Z cvt.u64.u32 %rd129, %r504; 2026-02-21T09:10:47.6944733Z cvt.u64.u32 %rd130, %r505; 2026-02-21T09:10:47.6944792Z shl.b64 %rd131, %rd130, 32; 2026-02-21T09:10:47.6944851Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T09:10:47.6944910Z $L__tmp54: 2026-02-21T09:10:47.6945073Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6945157Z mov.b64 {%r618, %r619}, %rd132; 2026-02-21T09:10:47.6945234Z cvt.rn.bf16x2.f32 %r620, %r619, %r618; 2026-02-21T09:10:47.6945288Z $L__tmp55: 2026-02-21T09:10:47.6945490Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6945555Z cvt.u64.u32 %rd133, %r506; 2026-02-21T09:10:47.6945614Z cvt.u64.u32 %rd134, %r507; 2026-02-21T09:10:47.6945697Z shl.b64 %rd135, %rd134, 32; 2026-02-21T09:10:47.6945755Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T09:10:47.6945813Z $L__tmp56: 2026-02-21T09:10:47.6945977Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6946036Z mov.b64 {%r621, %r622}, %rd136; 2026-02-21T09:10:47.6946114Z cvt.rn.bf16x2.f32 %r623, %r622, %r621; 2026-02-21T09:10:47.6946162Z $L__tmp57: 2026-02-21T09:10:47.6946368Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6946459Z cvt.u64.u32 %rd137, %r508; 2026-02-21T09:10:47.6946519Z cvt.u64.u32 %rd138, %r509; 2026-02-21T09:10:47.6946576Z shl.b64 %rd139, %rd138, 32; 2026-02-21T09:10:47.6946632Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T09:10:47.6946690Z $L__tmp58: 2026-02-21T09:10:47.6946873Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6946934Z mov.b64 {%r624, %r625}, %rd140; 2026-02-21T09:10:47.6947001Z cvt.rn.bf16x2.f32 %r626, %r625, %r624; 2026-02-21T09:10:47.6947049Z $L__tmp59: 2026-02-21T09:10:47.6947255Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6947316Z cvt.u64.u32 %rd141, %r510; 2026-02-21T09:10:47.6947373Z cvt.u64.u32 %rd142, %r511; 2026-02-21T09:10:47.6947429Z shl.b64 %rd143, %rd142, 32; 2026-02-21T09:10:47.6947486Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T09:10:47.6947541Z $L__tmp60: 2026-02-21T09:10:47.6947704Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6947762Z mov.b64 {%r627, %r628}, %rd144; 2026-02-21T09:10:47.6947830Z cvt.rn.bf16x2.f32 %r629, %r628, %r627; 2026-02-21T09:10:47.6947878Z $L__tmp61: 2026-02-21T09:10:47.6948083Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6948141Z cvt.u64.u32 %rd145, %r512; 2026-02-21T09:10:47.6948204Z cvt.u64.u32 %rd146, %r513; 2026-02-21T09:10:47.6948260Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:10:47.6948316Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:10:47.6948367Z $L__tmp62: 2026-02-21T09:10:47.6948528Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6948585Z mov.b64 {%r630, %r631}, %rd148; 2026-02-21T09:10:47.6948655Z cvt.rn.bf16x2.f32 %r632, %r631, %r630; 2026-02-21T09:10:47.6948705Z $L__tmp63: 2026-02-21T09:10:47.6948914Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6948971Z cvt.u64.u32 %rd149, %r514; 2026-02-21T09:10:47.6949034Z cvt.u64.u32 %rd150, %r515; 2026-02-21T09:10:47.6949091Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:10:47.6949149Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:10:47.6949206Z $L__tmp64: 2026-02-21T09:10:47.6949363Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6949420Z mov.b64 {%r633, %r634}, %rd152; 2026-02-21T09:10:47.6949487Z cvt.rn.bf16x2.f32 %r635, %r634, %r633; 2026-02-21T09:10:47.6949538Z $L__tmp65: 2026-02-21T09:10:47.6949740Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6949818Z cvt.u64.u32 %rd153, %r517; 2026-02-21T09:10:47.6949878Z cvt.u64.u32 %rd154, %r518; 2026-02-21T09:10:47.6949935Z shl.b64 %rd155, %rd154, 32; 2026-02-21T09:10:47.6949991Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T09:10:47.6950045Z $L__tmp66: 2026-02-21T09:10:47.6950210Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6950266Z mov.b64 {%r636, %r637}, %rd156; 2026-02-21T09:10:47.6950332Z cvt.rn.bf16x2.f32 %r638, %r637, %r636; 2026-02-21T09:10:47.6950405Z $L__tmp67: 2026-02-21T09:10:47.6950605Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6950661Z cvt.u64.u32 %rd157, %r519; 2026-02-21T09:10:47.6950720Z cvt.u64.u32 %rd158, %r520; 2026-02-21T09:10:47.6950774Z shl.b64 %rd159, %rd158, 32; 2026-02-21T09:10:47.6950830Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T09:10:47.6950883Z $L__tmp68: 2026-02-21T09:10:47.6951041Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6951121Z mov.b64 {%r639, %r640}, %rd160; 2026-02-21T09:10:47.6951186Z cvt.rn.bf16x2.f32 %r641, %r640, %r639; 2026-02-21T09:10:47.6951238Z $L__tmp69: 2026-02-21T09:10:47.6951442Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6951517Z cvt.u64.u32 %rd161, %r521; 2026-02-21T09:10:47.6951616Z cvt.u64.u32 %rd162, %r522; 2026-02-21T09:10:47.6951673Z shl.b64 %rd163, %rd162, 32; 2026-02-21T09:10:47.6951729Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T09:10:47.6951782Z $L__tmp70: 2026-02-21T09:10:47.6951943Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6951998Z mov.b64 {%r642, %r643}, %rd164; 2026-02-21T09:10:47.6952061Z cvt.rn.bf16x2.f32 %r644, %r643, %r642; 2026-02-21T09:10:47.6952116Z $L__tmp71: 2026-02-21T09:10:47.6952323Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6952378Z cvt.u64.u32 %rd165, %r523; 2026-02-21T09:10:47.6952437Z cvt.u64.u32 %rd166, %r524; 2026-02-21T09:10:47.6952493Z shl.b64 %rd167, %rd166, 32; 2026-02-21T09:10:47.6952550Z or.b64 %rd168, %rd165, %rd167; 2026-02-21T09:10:47.6952604Z $L__tmp72: 2026-02-21T09:10:47.6952767Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6952826Z mov.b64 {%r645, %r646}, %rd168; 2026-02-21T09:10:47.6952887Z cvt.rn.bf16x2.f32 %r647, %r646, %r645; 2026-02-21T09:10:47.6952942Z $L__tmp73: 2026-02-21T09:10:47.6953147Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6953203Z cvt.u64.u32 %rd169, %r525; 2026-02-21T09:10:47.6953265Z cvt.u64.u32 %rd170, %r526; 2026-02-21T09:10:47.6953323Z shl.b64 %rd171, %rd170, 32; 2026-02-21T09:10:47.6953379Z or.b64 %rd172, %rd169, %rd171; 2026-02-21T09:10:47.6953429Z $L__tmp74: 2026-02-21T09:10:47.6953593Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6953652Z mov.b64 {%r648, %r649}, %rd172; 2026-02-21T09:10:47.6953715Z cvt.rn.bf16x2.f32 %r650, %r649, %r648; 2026-02-21T09:10:47.6953770Z $L__tmp75: 2026-02-21T09:10:47.6953977Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6954035Z cvt.u64.u32 %rd173, %r527; 2026-02-21T09:10:47.6954092Z cvt.u64.u32 %rd174, %r528; 2026-02-21T09:10:47.6954150Z shl.b64 %rd175, %rd174, 32; 2026-02-21T09:10:47.6954207Z or.b64 %rd176, %rd173, %rd175; 2026-02-21T09:10:47.6954256Z $L__tmp76: 2026-02-21T09:10:47.6954425Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6954512Z mov.b64 {%r651, %r652}, %rd176; 2026-02-21T09:10:47.6954576Z cvt.rn.bf16x2.f32 %r653, %r652, %r651; 2026-02-21T09:10:47.6954637Z $L__tmp77: 2026-02-21T09:10:47.6954841Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6954897Z cvt.u64.u32 %rd177, %r529; 2026-02-21T09:10:47.6954958Z cvt.u64.u32 %rd178, %r530; 2026-02-21T09:10:47.6955016Z shl.b64 %rd179, %rd178, 32; 2026-02-21T09:10:47.6955110Z or.b64 %rd180, %rd177, %rd179; 2026-02-21T09:10:47.6955161Z $L__tmp78: 2026-02-21T09:10:47.6955328Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6955384Z mov.b64 {%r654, %r655}, %rd180; 2026-02-21T09:10:47.6955445Z cvt.rn.bf16x2.f32 %r656, %r655, %r654; 2026-02-21T09:10:47.6955502Z $L__tmp79: 2026-02-21T09:10:47.6955704Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6955762Z cvt.u64.u32 %rd181, %r531; 2026-02-21T09:10:47.6955846Z cvt.u64.u32 %rd182, %r532; 2026-02-21T09:10:47.6955904Z shl.b64 %rd183, %rd182, 32; 2026-02-21T09:10:47.6955961Z or.b64 %rd184, %rd181, %rd183; 2026-02-21T09:10:47.6956011Z $L__tmp80: 2026-02-21T09:10:47.6956172Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.6956254Z mov.b64 {%r657, %r658}, %rd184; 2026-02-21T09:10:47.6956321Z cvt.rn.bf16x2.f32 %r659, %r658, %r657; 2026-02-21T09:10:47.6956486Z .loc 1 98 43 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:98:43 2026-02-21T09:10:47.6956576Z st.shared.v4.b32 [%r40], {%r614, %r617, %r620, %r623}; 2026-02-21T09:10:47.6956667Z st.shared.v4.b32 [%r41], {%r626, %r629, %r632, %r635}; 2026-02-21T09:10:47.6956762Z st.shared.v4.b32 [%r42], {%r638, %r641, %r644, %r647}; 2026-02-21T09:10:47.6956847Z st.shared.v4.b32 [%r43], {%r650, %r653, %r656, %r659}; 2026-02-21T09:10:47.6956906Z // begin inline asm 2026-02-21T09:10:47.6956981Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.6957041Z // end inline asm 2026-02-21T09:10:47.6957095Z bar.sync 0; 2026-02-21T09:10:47.6957163Z elect.sync %r660|%p115, -1; 2026-02-21T09:10:47.6957231Z and.pred %p100, %p6, %p115; 2026-02-21T09:10:47.6957288Z // begin inline asm 2026-02-21T09:10:47.6957478Z @%p100 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd553, {%r534, %r535}], [%r201]; 2026-02-21T09:10:47.6957533Z // end inline asm 2026-02-21T09:10:47.6957604Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.6957676Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.6957731Z bar.sync 0; 2026-02-21T09:10:47.6957919Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.6957980Z add.s32 %r661, %r2060, 1; 2026-02-21T09:10:47.6958149Z .loc 1 38 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:38:33 2026-02-21T09:10:47.6958216Z shr.u32 %r662, %r661, 5; 2026-02-21T09:10:47.6958279Z and.b32 %r663, %r662, 67108848; 2026-02-21T09:10:47.6958447Z .loc 1 39 39 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:39 2026-02-21T09:10:47.6958507Z sub.s32 %r664, 128, %r663; 2026-02-21T09:10:47.6958683Z .loc 1 39 52 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:52 2026-02-21T09:10:47.6958744Z min.s32 %r665, %r664, 16; 2026-02-21T09:10:47.6958915Z .loc 1 40 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:45 2026-02-21T09:10:47.6958978Z and.b32 %r666, %r661, 511; 2026-02-21T09:10:47.6959143Z .loc 1 41 51 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:41:51 2026-02-21T09:10:47.6959201Z div.s32 %r70, %r666, %r665; 2026-02-21T09:10:47.6959372Z .loc 1 40 64 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:64 2026-02-21T09:10:47.6959457Z mul.lo.s32 %r667, %r70, %r665; 2026-02-21T09:10:47.6959520Z sub.s32 %r668, %r666, %r667; 2026-02-21T09:10:47.6959695Z .loc 1 40 30 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:30 2026-02-21T09:10:47.6959756Z add.s32 %r669, %r668, %r663; 2026-02-21T09:10:47.6959932Z .loc 1 42 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:42:27 2026-02-21T09:10:47.6959995Z shl.b32 %r843, %r669, 6; 2026-02-21T09:10:47.6960189Z .loc 1 43 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:43:27 2026-02-21T09:10:47.6960248Z shl.b32 %r844, %r70, 7; 2026-02-21T09:10:47.6960421Z .loc 1 44 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:32 2026-02-21T09:10:47.6960484Z or.b32 %r670, %r844, %r7; 2026-02-21T09:10:47.6960542Z or.b32 %r671, %r844, %r8; 2026-02-21T09:10:47.6960713Z .loc 1 58 53 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:53 2026-02-21T09:10:47.6960775Z shl.b32 %r672, %r670, 10; 2026-02-21T09:10:47.6960853Z shl.b32 %r673, %r671, 10; 2026-02-21T09:10:47.6960915Z mov.pred %p120, -1; 2026-02-21T09:10:47.6960972Z mov.b32 %r2069, 0; 2026-02-21T09:10:47.6961033Z $L__tmp81: 2026-02-21T09:10:47.6961274Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6961336Z // begin inline asm 2026-02-21T09:10:47.6961716Z @%p120 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 0], {%r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069}; 2026-02-21T09:10:47.6961777Z // end inline asm 2026-02-21T09:10:47.6961838Z // begin inline asm 2026-02-21T09:10:47.6962167Z @%p120 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 16], {%r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069, %r2069}; 2026-02-21T09:10:47.6962227Z // end inline asm 2026-02-21T09:10:47.6962289Z // begin inline asm 2026-02-21T09:10:47.6962370Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.6962426Z // end inline asm 2026-02-21T09:10:47.6962484Z bar.sync 0; 2026-02-21T09:10:47.6962539Z $L__tmp82: 2026-02-21T09:10:47.6962726Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6962785Z // begin inline asm 2026-02-21T09:10:47.6962875Z @%p9 mbarrier.init.shared::cta.b64 [%r2070], 1; 2026-02-21T09:10:47.6962936Z // end inline asm 2026-02-21T09:10:47.6962989Z bar.sync 0; 2026-02-21T09:10:47.6963044Z // begin inline asm 2026-02-21T09:10:47.6963129Z @%p9 mbarrier.init.shared::cta.b64 [%r1117], 1; 2026-02-21T09:10:47.6963189Z // end inline asm 2026-02-21T09:10:47.6963245Z // begin inline asm 2026-02-21T09:10:47.6963323Z @%p9 mbarrier.init.shared::cta.b64 [%r1331], 1; 2026-02-21T09:10:47.6963384Z // end inline asm 2026-02-21T09:10:47.6963441Z bar.sync 0; 2026-02-21T09:10:47.6963498Z // begin inline asm 2026-02-21T09:10:47.6963584Z @%p9 mbarrier.init.shared::cta.b64 [%r1115], 1; 2026-02-21T09:10:47.6963638Z // end inline asm 2026-02-21T09:10:47.6963807Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.6963865Z or.b32 %r674, %r672, %r9; 2026-02-21T09:10:47.6963928Z or.b32 %r675, %r673, %r9; 2026-02-21T09:10:47.6964100Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6964170Z mad.wide.s32 %rd115, %r674, 2, %rd41; 2026-02-21T09:10:47.6964245Z mad.wide.s32 %rd116, %r675, 2, %rd41; 2026-02-21T09:10:47.6964299Z mov.b32 %r710, 16; 2026-02-21T09:10:47.6964468Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6964530Z // begin inline asm 2026-02-21T09:10:47.6964659Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd115 + 0 ], 0x10, %r710; 2026-02-21T09:10:47.6964753Z // end inline asm 2026-02-21T09:10:47.6964805Z // begin inline asm 2026-02-21T09:10:47.6964926Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd116 + 0 ], 0x10, %r710; 2026-02-21T09:10:47.6964976Z // end inline asm 2026-02-21T09:10:47.6965038Z cp.async.commit_group; 2026-02-21T09:10:47.6965211Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6965286Z bar.sync 0; 2026-02-21T09:10:47.6965341Z // begin inline asm 2026-02-21T09:10:47.6965448Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.6965507Z // end inline asm 2026-02-21T09:10:47.6965664Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6965718Z bar.sync 0; 2026-02-21T09:10:47.6965791Z elect.sync %r676|%p116, -1; 2026-02-21T09:10:47.6965856Z and.pred %p108, %p6, %p116; 2026-02-21T09:10:47.6965914Z // begin inline asm 2026-02-21T09:10:47.6966191Z @%p108 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r843, %r2069}], [%r1331]; 2026-02-21T09:10:47.6966247Z // end inline asm 2026-02-21T09:10:47.6966408Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6966478Z cvt.s64.s32 %rd185, %r672; 2026-02-21T09:10:47.6966564Z or.b64 %rd186, %rd185, %rd8; 2026-02-21T09:10:47.6966626Z shl.b64 %rd187, %rd186, 1; 2026-02-21T09:10:47.6966689Z add.s64 %rd16, %rd41, %rd187; 2026-02-21T09:10:47.6966756Z add.s64 %rd118, %rd16, 64; 2026-02-21T09:10:47.6966815Z cvt.s64.s32 %rd188, %r673; 2026-02-21T09:10:47.6966875Z or.b64 %rd189, %rd188, %rd8; 2026-02-21T09:10:47.6966942Z shl.b64 %rd190, %rd189, 1; 2026-02-21T09:10:47.6967002Z add.s64 %rd17, %rd41, %rd190; 2026-02-21T09:10:47.6967060Z add.s64 %rd119, %rd17, 64; 2026-02-21T09:10:47.6967223Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6967289Z // begin inline asm 2026-02-21T09:10:47.6967405Z cp.async.cg.shared.global [ %r1202 + 0 ], [ %rd118 + 0 ], 0x10, %r710; 2026-02-21T09:10:47.6967461Z // end inline asm 2026-02-21T09:10:47.6967524Z // begin inline asm 2026-02-21T09:10:47.6967639Z cp.async.cg.shared.global [ %r1204 + 0 ], [ %rd119 + 0 ], 0x10, %r710; 2026-02-21T09:10:47.6967695Z // end inline asm 2026-02-21T09:10:47.6967766Z cp.async.commit_group; 2026-02-21T09:10:47.6967935Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6967989Z bar.sync 0; 2026-02-21T09:10:47.6968046Z // begin inline asm 2026-02-21T09:10:47.6968161Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1115], 1024; 2026-02-21T09:10:47.6968215Z // end inline asm 2026-02-21T09:10:47.6968379Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6968441Z bar.sync 0; 2026-02-21T09:10:47.6968505Z elect.sync %r677|%p117, -1; 2026-02-21T09:10:47.6968570Z and.pred %p110, %p6, %p117; 2026-02-21T09:10:47.6968625Z // begin inline asm 2026-02-21T09:10:47.6968871Z @%p110 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r322], [%rd463, {%r843, %r710}], [%r1115]; 2026-02-21T09:10:47.6968924Z // end inline asm 2026-02-21T09:10:47.6969091Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6969162Z cp.async.wait_group 1; 2026-02-21T09:10:47.6969215Z bar.sync 0; 2026-02-21T09:10:47.6969306Z ld.shared.v4.b32 {%r678, %r679, %r680, %r681}, [%r23]; 2026-02-21T09:10:47.6969377Z mov.b32 {%rs113, %rs114}, %r681; 2026-02-21T09:10:47.6969439Z mov.b32 {%rs115, %rs116}, %r680; 2026-02-21T09:10:47.6969498Z mov.b32 {%rs117, %rs118}, %r679; 2026-02-21T09:10:47.6969555Z mov.b32 {%rs119, %rs120}, %r678; 2026-02-21T09:10:47.6969657Z ld.shared.v4.b32 {%r682, %r683, %r684, %r685}, [%r23+16]; 2026-02-21T09:10:47.6969749Z mov.b32 {%rs121, %rs122}, %r685; 2026-02-21T09:10:47.6969808Z mov.b32 {%rs123, %rs124}, %r684; 2026-02-21T09:10:47.6969873Z mov.b32 {%rs125, %rs126}, %r683; 2026-02-21T09:10:47.6969930Z mov.b32 {%rs127, %rs128}, %r682; 2026-02-21T09:10:47.6970089Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.6970160Z cvt.f32.bf16 %r594, %rs119; 2026-02-21T09:10:47.6970242Z cvt.f32.bf16 %r595, %rs120; 2026-02-21T09:10:47.6970302Z cvt.f32.bf16 %r596, %rs117; 2026-02-21T09:10:47.6970359Z cvt.f32.bf16 %r597, %rs118; 2026-02-21T09:10:47.6970424Z cvt.f32.bf16 %r598, %rs115; 2026-02-21T09:10:47.6970482Z cvt.f32.bf16 %r599, %rs116; 2026-02-21T09:10:47.6970539Z cvt.f32.bf16 %r600, %rs113; 2026-02-21T09:10:47.6970603Z cvt.f32.bf16 %r601, %rs114; 2026-02-21T09:10:47.6970660Z cvt.f32.bf16 %r602, %rs127; 2026-02-21T09:10:47.6970717Z cvt.f32.bf16 %r603, %rs128; 2026-02-21T09:10:47.6970775Z cvt.f32.bf16 %r604, %rs125; 2026-02-21T09:10:47.6970840Z cvt.f32.bf16 %r605, %rs126; 2026-02-21T09:10:47.6970917Z cvt.f32.bf16 %r606, %rs123; 2026-02-21T09:10:47.6970978Z cvt.f32.bf16 %r607, %rs124; 2026-02-21T09:10:47.6971042Z cvt.f32.bf16 %r608, %rs121; 2026-02-21T09:10:47.6971099Z cvt.f32.bf16 %r609, %rs122; 2026-02-21T09:10:47.6971152Z $L__tmp83: 2026-02-21T09:10:47.6971386Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6971445Z // begin inline asm 2026-02-21T09:10:47.6971762Z @%p120 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r594, %r595, %r596, %r597, %r598, %r599, %r600, %r601, %r602, %r603, %r604, %r605, %r606, %r607, %r608, %r609}; 2026-02-21T09:10:47.6971827Z // end inline asm 2026-02-21T09:10:47.6971883Z // begin inline asm 2026-02-21T09:10:47.6971954Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.6972010Z // end inline asm 2026-02-21T09:10:47.6972073Z bar.sync 0; 2026-02-21T09:10:47.6972128Z $L__tmp84: 2026-02-21T09:10:47.6972302Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6972367Z // begin inline asm 2026-02-21T09:10:47.6972420Z 2026-02-21T09:10:47.6972480Z { 2026-02-21T09:10:47.6972543Z .reg .pred complete; 2026-02-21T09:10:47.6972607Z waitLoop: 2026-02-21T09:10:47.6972729Z mbarrier.try_wait.parity.shared.b64 complete, [%r1331], %r2069; 2026-02-21T09:10:47.6972796Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6972857Z } 2026-02-21T09:10:47.6972860Z 2026-02-21T09:10:47.6972917Z // end inline asm 2026-02-21T09:10:47.6973084Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6973151Z ld.shared.b8 %rs129, [%r25]; 2026-02-21T09:10:47.6973230Z ld.shared.b8 %rs130, [%r25+512]; 2026-02-21T09:10:47.6973296Z ld.shared.b8 %rs131, [%r27+128]; 2026-02-21T09:10:47.6973360Z ld.shared.b8 %rs132, [%r27+640]; 2026-02-21T09:10:47.6973427Z ld.shared.b8 %rs133, [%r29+256]; 2026-02-21T09:10:47.6973488Z ld.shared.b8 %rs134, [%r29+768]; 2026-02-21T09:10:47.6973548Z ld.shared.b8 %rs135, [%r31+384]; 2026-02-21T09:10:47.6973614Z ld.shared.b8 %rs136, [%r31+896]; 2026-02-21T09:10:47.6973775Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.6973837Z shl.b16 %rs137, %rs129, 4; 2026-02-21T09:10:47.6973898Z shl.b16 %rs138, %rs131, 4; 2026-02-21T09:10:47.6973964Z shl.b16 %rs139, %rs133, 4; 2026-02-21T09:10:47.6974021Z shl.b16 %rs140, %rs135, 4; 2026-02-21T09:10:47.6974080Z shl.b16 %rs141, %rs130, 4; 2026-02-21T09:10:47.6974142Z shl.b16 %rs142, %rs132, 4; 2026-02-21T09:10:47.6974200Z shl.b16 %rs143, %rs134, 4; 2026-02-21T09:10:47.6974256Z shl.b16 %rs144, %rs136, 4; 2026-02-21T09:10:47.6974419Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.6974542Z selp.b16 %rs145, %rs137, %rs129, %p346; 2026-02-21T09:10:47.6974602Z cvt.s16.s8 %rs146, %rs145; 2026-02-21T09:10:47.6974660Z shr.s16 %rs147, %rs146, 4; 2026-02-21T09:10:47.6974736Z selp.b16 %rs148, %rs138, %rs131, %p346; 2026-02-21T09:10:47.6974794Z cvt.s16.s8 %rs149, %rs148; 2026-02-21T09:10:47.6974851Z shr.s16 %rs150, %rs149, 4; 2026-02-21T09:10:47.6974924Z selp.b16 %rs151, %rs139, %rs133, %p346; 2026-02-21T09:10:47.6974984Z cvt.s16.s8 %rs152, %rs151; 2026-02-21T09:10:47.6975066Z shr.s16 %rs153, %rs152, 4; 2026-02-21T09:10:47.6975132Z selp.b16 %rs154, %rs140, %rs135, %p346; 2026-02-21T09:10:47.6975199Z cvt.s16.s8 %rs155, %rs154; 2026-02-21T09:10:47.6975256Z shr.s16 %rs156, %rs155, 4; 2026-02-21T09:10:47.6975320Z selp.b16 %rs157, %rs141, %rs130, %p346; 2026-02-21T09:10:47.6975384Z cvt.s16.s8 %rs158, %rs157; 2026-02-21T09:10:47.6975440Z shr.s16 %rs159, %rs158, 4; 2026-02-21T09:10:47.6975504Z selp.b16 %rs160, %rs142, %rs132, %p346; 2026-02-21T09:10:47.6975564Z cvt.s16.s8 %rs161, %rs160; 2026-02-21T09:10:47.6975628Z shr.s16 %rs162, %rs161, 4; 2026-02-21T09:10:47.6975721Z selp.b16 %rs163, %rs143, %rs134, %p346; 2026-02-21T09:10:47.6975780Z cvt.s16.s8 %rs164, %rs163; 2026-02-21T09:10:47.6975843Z shr.s16 %rs165, %rs164, 4; 2026-02-21T09:10:47.6975907Z selp.b16 %rs166, %rs144, %rs136, %p346; 2026-02-21T09:10:47.6975964Z cvt.s16.s8 %rs167, %rs166; 2026-02-21T09:10:47.6976027Z shr.s16 %rs168, %rs167, 4; 2026-02-21T09:10:47.6976216Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.6976281Z cvt.rn.f32.s16 %r686, %rs147; 2026-02-21T09:10:47.6976342Z cvt.rn.f32.s16 %r687, %rs150; 2026-02-21T09:10:47.6976409Z cvt.rn.f32.s16 %r688, %rs153; 2026-02-21T09:10:47.6976467Z cvt.rn.f32.s16 %r689, %rs156; 2026-02-21T09:10:47.6976525Z cvt.rn.f32.s16 %r690, %rs159; 2026-02-21T09:10:47.6976590Z cvt.rn.f32.s16 %r691, %rs162; 2026-02-21T09:10:47.6976647Z cvt.rn.f32.s16 %r692, %rs165; 2026-02-21T09:10:47.6976705Z cvt.rn.f32.s16 %r693, %rs168; 2026-02-21T09:10:47.6976766Z st.shared.b32 [%r32], %r686; 2026-02-21T09:10:47.6976835Z st.shared.b32 [%r33], %r687; 2026-02-21T09:10:47.6976895Z st.shared.b32 [%r34], %r688; 2026-02-21T09:10:47.6976954Z st.shared.b32 [%r35], %r689; 2026-02-21T09:10:47.6977020Z st.shared.b32 [%r36], %r690; 2026-02-21T09:10:47.6977079Z st.shared.b32 [%r37], %r691; 2026-02-21T09:10:47.6977139Z st.shared.b32 [%r38], %r692; 2026-02-21T09:10:47.6977200Z st.shared.b32 [%r39], %r693; 2026-02-21T09:10:47.6977259Z $L__tmp85: 2026-02-21T09:10:47.6977470Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6977528Z // begin inline asm 2026-02-21T09:10:47.6977608Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.6977662Z // end inline asm 2026-02-21T09:10:47.6977715Z bar.sync 0; 2026-02-21T09:10:47.6977782Z @%p60 bra $L__BB0_10; 2026-02-21T09:10:47.6977882Z // %bb.9: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.6977946Z elect.sync %r706|%p119, -1; 2026-02-21T09:10:47.6978006Z mov.b32 %r696, 135268624; 2026-02-21T09:10:47.6978071Z mov.pred %p118, 0; 2026-02-21T09:10:47.6978129Z // begin inline asm 2026-02-21T09:10:47.6978288Z @%p119 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r696, %p118; 2026-02-21T09:10:47.6978349Z // end inline asm 2026-02-21T09:10:47.6978406Z // begin inline asm 2026-02-21T09:10:47.6978560Z @%p119 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r696, %p120; 2026-02-21T09:10:47.6978621Z // end inline asm 2026-02-21T09:10:47.6978676Z // begin inline asm 2026-02-21T09:10:47.6978825Z @%p119 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r696, %p120; 2026-02-21T09:10:47.6978879Z // end inline asm 2026-02-21T09:10:47.6978942Z // begin inline asm 2026-02-21T09:10:47.6979090Z @%p119 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r696, %p120; 2026-02-21T09:10:47.6979174Z // end inline asm 2026-02-21T09:10:47.6979242Z add.s32 %r708, %r201, 26624; 2026-02-21T09:10:47.6979303Z cvt.u64.u32 %rd195, %r708; 2026-02-21T09:10:47.6979359Z // begin inline asm 2026-02-21T09:10:47.6979495Z @%p119 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd195]; 2026-02-21T09:10:47.6979551Z // end inline asm 2026-02-21T09:10:47.6979606Z $L__tmp86: 2026-02-21T09:10:47.6979730Z $L__BB0_10: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.6979900Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6979964Z add.s64 %rd196, %rd16, 128; 2026-02-21T09:10:47.6980024Z add.s64 %rd197, %rd17, 128; 2026-02-21T09:10:47.6980193Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6980249Z // begin inline asm 2026-02-21T09:10:47.6980367Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd196 + 0 ], 0x10, %r710; 2026-02-21T09:10:47.6980428Z // end inline asm 2026-02-21T09:10:47.6980502Z // begin inline asm 2026-02-21T09:10:47.6980616Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd197 + 0 ], 0x10, %r710; 2026-02-21T09:10:47.6980670Z // end inline asm 2026-02-21T09:10:47.6980740Z cp.async.commit_group; 2026-02-21T09:10:47.6980928Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6980986Z // begin inline asm 2026-02-21T09:10:47.6981100Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.6981155Z // end inline asm 2026-02-21T09:10:47.6981318Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6981380Z bar.sync 0; 2026-02-21T09:10:47.6981445Z elect.sync %r722|%p130, -1; 2026-02-21T09:10:47.6981508Z and.pred %p128, %p6, %p130; 2026-02-21T09:10:47.6981623Z mov.b32 %r716, 32; 2026-02-21T09:10:47.6981688Z // begin inline asm 2026-02-21T09:10:47.6981932Z @%p128 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r843, %r716}], [%r1331]; 2026-02-21T09:10:47.6981987Z // end inline asm 2026-02-21T09:10:47.6982161Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6982223Z shl.b32 %r723, %r70, 17; 2026-02-21T09:10:47.6982285Z or.b32 %r724, %r44, %r723; 2026-02-21T09:10:47.6982363Z mad.wide.s32 %rd620, %r724, 2, %rd7; 2026-02-21T09:10:47.6982423Z or.b32 %r2068, %r45, %r723; 2026-02-21T09:10:47.6982479Z mov.b64 %rd621, 0; 2026-02-21T09:10:47.6982538Z mov.b32 %r2071, %r2069; 2026-02-21T09:10:47.6982606Z mov.b32 %r2072, %r2069; 2026-02-21T09:10:47.6982663Z mov.b32 %r2074, %r2069; 2026-02-21T09:10:47.6982721Z bra.uni $L__BB0_11; 2026-02-21T09:10:47.6982829Z $L__BB0_13: // in Loop: Header=BB0_11 Depth=2 2026-02-21T09:10:47.6983003Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6983070Z setp.lt.u64 %p146, %rd621, 464; 2026-02-21T09:10:47.6983132Z $L__tmp87: 2026-02-21T09:10:47.6983347Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6983407Z add.s32 %r799, %r2073, 1; 2026-02-21T09:10:47.6983471Z setp.gt.s32 %p149, %r799, 1; 2026-02-21T09:10:47.6983545Z selp.b32 %r2073, 0, %r799, %p149; 2026-02-21T09:10:47.6983606Z selp.b32 %r800, 1, 0, %p149; 2026-02-21T09:10:47.6983664Z xor.b32 %r88, %r2074, %r800; 2026-02-21T09:10:47.6983721Z $L__tmp88: 2026-02-21T09:10:47.6983886Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.6983953Z mad.wide.s32 %rd206, %r2068, 2, %rd41; 2026-02-21T09:10:47.6984124Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6984213Z add.s32 %r790, %r82, %r12; 2026-02-21T09:10:47.6984276Z selp.b32 %r791, 16, 0, %p146; 2026-02-21T09:10:47.6984332Z // begin inline asm 2026-02-21T09:10:47.6984460Z cp.async.cg.shared.global [ %r790 + 0 ], [ %rd620 + 0 ], 0x10, %r791; 2026-02-21T09:10:47.6984513Z // end inline asm 2026-02-21T09:10:47.6984571Z add.s32 %r792, %r790, 4096; 2026-02-21T09:10:47.6984633Z // begin inline asm 2026-02-21T09:10:47.6984772Z cp.async.cg.shared.global [ %r792 + 0 ], [ %rd206 + 0 ], 0x10, %r791; 2026-02-21T09:10:47.6984825Z // end inline asm 2026-02-21T09:10:47.6984888Z cp.async.commit_group; 2026-02-21T09:10:47.6985062Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6985125Z and.pred %p144, %p9, %p146; 2026-02-21T09:10:47.6985181Z // begin inline asm 2026-02-21T09:10:47.6985298Z @%p144 mbarrier.arrive.expect_tx.shared.b64 _, [%r794], 1024; 2026-02-21T09:10:47.6985354Z // end inline asm 2026-02-21T09:10:47.6985542Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6985603Z bar.sync 0; 2026-02-21T09:10:47.6985665Z elect.sync %r801|%p150, -1; 2026-02-21T09:10:47.6985728Z and.pred %p151, %p146, %p150; 2026-02-21T09:10:47.6985788Z and.pred %p145, %p6, %p151; 2026-02-21T09:10:47.6985855Z cvt.u32.u64 %r802, %rd621; 2026-02-21T09:10:47.6985938Z add.s32 %r797, %r802, 48; 2026-02-21T09:10:47.6985996Z // begin inline asm 2026-02-21T09:10:47.6986247Z @%p145 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r795], [%rd463, {%r843, %r797}], [%r794]; 2026-02-21T09:10:47.6986302Z // end inline asm 2026-02-21T09:10:47.6986470Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6986533Z add.s64 %rd620, %rd620, 64; 2026-02-21T09:10:47.6986589Z add.s32 %r2068, %r2068, 32; 2026-02-21T09:10:47.6986656Z setp.lt.u64 %p152, %rd621, 480; 2026-02-21T09:10:47.6986713Z add.s64 %rd621, %rd621, 16; 2026-02-21T09:10:47.6986778Z mov.b32 %r2069, %r2074; 2026-02-21T09:10:47.6986836Z mov.b32 %r2074, %r88; 2026-02-21T09:10:47.6986894Z @%p152 bra $L__BB0_11; 2026-02-21T09:10:47.6986958Z bra.uni $L__BB0_14; 2026-02-21T09:10:47.6987056Z $L__BB0_11: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:47.6987151Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:47.6987211Z add.s32 %r746, %r2072, 1; 2026-02-21T09:10:47.6987279Z setp.gt.s32 %p134, %r746, 1; 2026-02-21T09:10:47.6987342Z selp.b32 %r2072, 0, %r746, %p134; 2026-02-21T09:10:47.6987503Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.6987574Z cp.async.wait_group 1; 2026-02-21T09:10:47.6987629Z bar.sync 0; 2026-02-21T09:10:47.6987689Z shl.b32 %r747, %r2072, 13; 2026-02-21T09:10:47.6987752Z add.s32 %r82, %r201, %r747; 2026-02-21T09:10:47.6987811Z add.s32 %r749, %r82, %r22; 2026-02-21T09:10:47.6987906Z ld.shared.v4.b32 {%r750, %r751, %r752, %r753}, [%r749]; 2026-02-21T09:10:47.6987969Z mov.b32 {%rs169, %rs170}, %r753; 2026-02-21T09:10:47.6988036Z mov.b32 {%rs171, %rs172}, %r752; 2026-02-21T09:10:47.6988096Z mov.b32 {%rs173, %rs174}, %r751; 2026-02-21T09:10:47.6988154Z mov.b32 {%rs175, %rs176}, %r750; 2026-02-21T09:10:47.6988262Z ld.shared.v4.b32 {%r754, %r755, %r756, %r757}, [%r749+16]; 2026-02-21T09:10:47.6988321Z mov.b32 {%rs177, %rs178}, %r757; 2026-02-21T09:10:47.6988380Z mov.b32 {%rs179, %rs180}, %r756; 2026-02-21T09:10:47.6988438Z mov.b32 {%rs181, %rs182}, %r755; 2026-02-21T09:10:47.6988500Z mov.b32 {%rs183, %rs184}, %r754; 2026-02-21T09:10:47.6988665Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.6988724Z cvt.f32.bf16 %r728, %rs175; 2026-02-21T09:10:47.6988790Z cvt.f32.bf16 %r729, %rs176; 2026-02-21T09:10:47.6988867Z cvt.f32.bf16 %r730, %rs173; 2026-02-21T09:10:47.6988923Z cvt.f32.bf16 %r731, %rs174; 2026-02-21T09:10:47.6988987Z cvt.f32.bf16 %r732, %rs171; 2026-02-21T09:10:47.6989043Z cvt.f32.bf16 %r733, %rs172; 2026-02-21T09:10:47.6989099Z cvt.f32.bf16 %r734, %rs169; 2026-02-21T09:10:47.6989154Z cvt.f32.bf16 %r735, %rs170; 2026-02-21T09:10:47.6989218Z cvt.f32.bf16 %r736, %rs183; 2026-02-21T09:10:47.6989276Z cvt.f32.bf16 %r737, %rs184; 2026-02-21T09:10:47.6989353Z cvt.f32.bf16 %r738, %rs181; 2026-02-21T09:10:47.6989414Z cvt.f32.bf16 %r739, %rs182; 2026-02-21T09:10:47.6989470Z cvt.f32.bf16 %r740, %rs179; 2026-02-21T09:10:47.6989525Z cvt.f32.bf16 %r741, %rs180; 2026-02-21T09:10:47.6989582Z cvt.f32.bf16 %r742, %rs177; 2026-02-21T09:10:47.6989645Z cvt.f32.bf16 %r743, %rs178; 2026-02-21T09:10:47.6989698Z $L__tmp89: 2026-02-21T09:10:47.6989912Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6989975Z // begin inline asm 2026-02-21T09:10:47.6990025Z 2026-02-21T09:10:47.6990074Z { 2026-02-21T09:10:47.6990158Z .reg .pred complete; 2026-02-21T09:10:47.6990214Z waitLoop: 2026-02-21T09:10:47.6990335Z mbarrier.try_wait.parity.shared.b64 complete, [%r2070], %r2069; 2026-02-21T09:10:47.6990399Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6990456Z } 2026-02-21T09:10:47.6990460Z 2026-02-21T09:10:47.6990514Z // end inline asm 2026-02-21T09:10:47.6990602Z $L__tmp90: 2026-02-21T09:10:47.6990776Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6990836Z selp.b32 %r758, 1, 0, %p134; 2026-02-21T09:10:47.6990898Z xor.b32 %r2071, %r2071, %r758; 2026-02-21T09:10:47.6990958Z mov.pred %p135, -1; 2026-02-21T09:10:47.6991017Z $L__tmp91: 2026-02-21T09:10:47.6991227Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6991283Z // begin inline asm 2026-02-21T09:10:47.6991604Z @%p135 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r728, %r729, %r730, %r731, %r732, %r733, %r734, %r735, %r736, %r737, %r738, %r739, %r740, %r741, %r742, %r743}; 2026-02-21T09:10:47.6991663Z // end inline asm 2026-02-21T09:10:47.6991719Z // begin inline asm 2026-02-21T09:10:47.6991793Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.6991847Z // end inline asm 2026-02-21T09:10:47.6991901Z bar.sync 0; 2026-02-21T09:10:47.6991954Z $L__tmp92: 2026-02-21T09:10:47.6992126Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6992184Z shl.b32 %r759, %r2072, 3; 2026-02-21T09:10:47.6992242Z add.s32 %r760, %r201, %r759; 2026-02-21T09:10:47.6992306Z add.s32 %r794, %r760, 26640; 2026-02-21T09:10:47.6992362Z // begin inline asm 2026-02-21T09:10:47.6992414Z 2026-02-21T09:10:47.6992463Z { 2026-02-21T09:10:47.6992530Z .reg .pred complete; 2026-02-21T09:10:47.6992584Z waitLoop: 2026-02-21T09:10:47.6992703Z mbarrier.try_wait.parity.shared.b64 complete, [%r794], %r2071; 2026-02-21T09:10:47.6992774Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.6992824Z } 2026-02-21T09:10:47.6992827Z 2026-02-21T09:10:47.6992881Z // end inline asm 2026-02-21T09:10:47.6993046Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.6993105Z shl.b32 %r761, %r2072, 10; 2026-02-21T09:10:47.6993166Z add.s32 %r762, %r201, %r761; 2026-02-21T09:10:47.6993224Z add.s32 %r795, %r762, 24576; 2026-02-21T09:10:47.6993289Z add.s32 %r763, %r795, %r24; 2026-02-21T09:10:47.6993351Z ld.shared.b8 %rs185, [%r763]; 2026-02-21T09:10:47.6993417Z ld.shared.b8 %rs186, [%r763+512]; 2026-02-21T09:10:47.6993481Z add.s32 %r764, %r795, %r26; 2026-02-21T09:10:47.6993543Z ld.shared.b8 %rs187, [%r764+128]; 2026-02-21T09:10:47.6993604Z ld.shared.b8 %rs188, [%r764+640]; 2026-02-21T09:10:47.6993661Z add.s32 %r765, %r795, %r28; 2026-02-21T09:10:47.6993761Z ld.shared.b8 %rs189, [%r765+256]; 2026-02-21T09:10:47.6993819Z ld.shared.b8 %rs190, [%r765+768]; 2026-02-21T09:10:47.6993878Z add.s32 %r766, %r795, %r30; 2026-02-21T09:10:47.6993943Z ld.shared.b8 %rs191, [%r766+384]; 2026-02-21T09:10:47.6994002Z ld.shared.b8 %rs192, [%r766+896]; 2026-02-21T09:10:47.6994164Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.6994230Z shl.b16 %rs193, %rs185, 4; 2026-02-21T09:10:47.6994313Z shl.b16 %rs194, %rs187, 4; 2026-02-21T09:10:47.6994371Z shl.b16 %rs195, %rs189, 4; 2026-02-21T09:10:47.6994428Z shl.b16 %rs196, %rs191, 4; 2026-02-21T09:10:47.6994492Z shl.b16 %rs197, %rs186, 4; 2026-02-21T09:10:47.6994550Z shl.b16 %rs198, %rs188, 4; 2026-02-21T09:10:47.6994607Z shl.b16 %rs199, %rs190, 4; 2026-02-21T09:10:47.6994671Z shl.b16 %rs200, %rs192, 4; 2026-02-21T09:10:47.6994835Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.6994906Z selp.b16 %rs201, %rs193, %rs185, %p346; 2026-02-21T09:10:47.6994994Z cvt.s16.s8 %rs202, %rs201; 2026-02-21T09:10:47.6995050Z shr.s16 %rs203, %rs202, 4; 2026-02-21T09:10:47.6995117Z selp.b16 %rs204, %rs194, %rs187, %p346; 2026-02-21T09:10:47.6995176Z cvt.s16.s8 %rs205, %rs204; 2026-02-21T09:10:47.6995241Z shr.s16 %rs206, %rs205, 4; 2026-02-21T09:10:47.6995305Z selp.b16 %rs207, %rs195, %rs189, %p346; 2026-02-21T09:10:47.6995384Z cvt.s16.s8 %rs208, %rs207; 2026-02-21T09:10:47.6995446Z shr.s16 %rs209, %rs208, 4; 2026-02-21T09:10:47.6995508Z selp.b16 %rs210, %rs196, %rs191, %p346; 2026-02-21T09:10:47.6995563Z cvt.s16.s8 %rs211, %rs210; 2026-02-21T09:10:47.6995616Z shr.s16 %rs212, %rs211, 4; 2026-02-21T09:10:47.6995681Z selp.b16 %rs213, %rs197, %rs186, %p346; 2026-02-21T09:10:47.6995734Z cvt.s16.s8 %rs214, %rs213; 2026-02-21T09:10:47.6995789Z shr.s16 %rs215, %rs214, 4; 2026-02-21T09:10:47.6995853Z selp.b16 %rs216, %rs198, %rs188, %p346; 2026-02-21T09:10:47.6995907Z cvt.s16.s8 %rs217, %rs216; 2026-02-21T09:10:47.6995960Z shr.s16 %rs218, %rs217, 4; 2026-02-21T09:10:47.6996021Z selp.b16 %rs219, %rs199, %rs190, %p346; 2026-02-21T09:10:47.6996080Z cvt.s16.s8 %rs220, %rs219; 2026-02-21T09:10:47.6996134Z shr.s16 %rs221, %rs220, 4; 2026-02-21T09:10:47.6996195Z selp.b16 %rs222, %rs200, %rs192, %p346; 2026-02-21T09:10:47.6996253Z cvt.s16.s8 %rs223, %rs222; 2026-02-21T09:10:47.6996307Z shr.s16 %rs224, %rs223, 4; 2026-02-21T09:10:47.6996467Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.6996527Z cvt.rn.f32.s16 %r767, %rs203; 2026-02-21T09:10:47.6996585Z cvt.rn.f32.s16 %r768, %rs206; 2026-02-21T09:10:47.6996640Z cvt.rn.f32.s16 %r769, %rs209; 2026-02-21T09:10:47.6996693Z cvt.rn.f32.s16 %r770, %rs212; 2026-02-21T09:10:47.6996752Z cvt.rn.f32.s16 %r771, %rs215; 2026-02-21T09:10:47.6996807Z cvt.rn.f32.s16 %r772, %rs218; 2026-02-21T09:10:47.6996863Z cvt.rn.f32.s16 %r773, %rs221; 2026-02-21T09:10:47.6996924Z cvt.rn.f32.s16 %r774, %rs224; 2026-02-21T09:10:47.6996983Z st.shared.b32 [%r32], %r767; 2026-02-21T09:10:47.6997041Z st.shared.b32 [%r33], %r768; 2026-02-21T09:10:47.6997098Z st.shared.b32 [%r34], %r769; 2026-02-21T09:10:47.6997159Z st.shared.b32 [%r35], %r770; 2026-02-21T09:10:47.6997217Z st.shared.b32 [%r36], %r771; 2026-02-21T09:10:47.6997273Z st.shared.b32 [%r37], %r772; 2026-02-21T09:10:47.6997331Z st.shared.b32 [%r38], %r773; 2026-02-21T09:10:47.6997386Z st.shared.b32 [%r39], %r774; 2026-02-21T09:10:47.6997551Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.6997614Z shl.b32 %r775, %r2073, 3; 2026-02-21T09:10:47.6997672Z add.s32 %r776, %r201, %r775; 2026-02-21T09:10:47.6997726Z add.s32 %r2070, %r776, 26624; 2026-02-21T09:10:47.6997778Z $L__tmp93: 2026-02-21T09:10:47.6997991Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.6998073Z // begin inline asm 2026-02-21T09:10:47.6998143Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.6998201Z // end inline asm 2026-02-21T09:10:47.6998251Z bar.sync 0; 2026-02-21T09:10:47.6998306Z @%p60 bra $L__BB0_13; 2026-02-21T09:10:47.6998405Z // %bb.12: // in Loop: Header=BB0_11 Depth=2 2026-02-21T09:10:47.6998473Z elect.sync %r789|%p136, -1; 2026-02-21T09:10:47.6998553Z mov.b32 %r779, 135268624; 2026-02-21T09:10:47.6998606Z // begin inline asm 2026-02-21T09:10:47.6998767Z @%p136 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r779, %p135; 2026-02-21T09:10:47.6998822Z // end inline asm 2026-02-21T09:10:47.6998874Z // begin inline asm 2026-02-21T09:10:47.6999030Z @%p136 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r779, %p135; 2026-02-21T09:10:47.6999082Z // end inline asm 2026-02-21T09:10:47.6999137Z // begin inline asm 2026-02-21T09:10:47.6999287Z @%p136 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r779, %p135; 2026-02-21T09:10:47.6999364Z // end inline asm 2026-02-21T09:10:47.6999419Z // begin inline asm 2026-02-21T09:10:47.6999566Z @%p136 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r779, %p135; 2026-02-21T09:10:47.6999626Z // end inline asm 2026-02-21T09:10:47.6999684Z cvt.u64.u32 %rd204, %r2070; 2026-02-21T09:10:47.6999755Z // begin inline asm 2026-02-21T09:10:47.6999884Z @%p136 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd204]; 2026-02-21T09:10:47.6999934Z // end inline asm 2026-02-21T09:10:47.6999991Z bra.uni $L__BB0_13; 2026-02-21T09:10:47.7000090Z $L__BB0_14: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.7000180Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:47.7000234Z mov.b32 %r2080, 1; 2026-02-21T09:10:47.7000445Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7000509Z // begin inline asm 2026-02-21T09:10:47.7000571Z 2026-02-21T09:10:47.7000621Z { 2026-02-21T09:10:47.7000689Z .reg .pred complete; 2026-02-21T09:10:47.7000742Z waitLoop: 2026-02-21T09:10:47.7000864Z mbarrier.try_wait.parity.shared.b64 complete, [%r2070], %r2080; 2026-02-21T09:10:47.7000929Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7000986Z } 2026-02-21T09:10:47.7000991Z 2026-02-21T09:10:47.7001046Z // end inline asm 2026-02-21T09:10:47.7001099Z $L__tmp94: 2026-02-21T09:10:47.7001276Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7001340Z cp.async.wait_group 0; 2026-02-21T09:10:47.7001397Z bar.sync 0; 2026-02-21T09:10:47.7001459Z // begin inline asm 2026-02-21T09:10:47.7001600Z @%p9 mbarrier.inval.shared::cta.b64 [%r1331]; 2026-02-21T09:10:47.7001658Z // end inline asm 2026-02-21T09:10:47.7001717Z bar.sync 0; 2026-02-21T09:10:47.7001779Z // begin inline asm 2026-02-21T09:10:47.7001867Z @%p9 mbarrier.inval.shared::cta.b64 [%r1115]; 2026-02-21T09:10:47.7001920Z // end inline asm 2026-02-21T09:10:47.7001989Z add.s32 %r2077, %r201, 26624; 2026-02-21T09:10:47.7002044Z // begin inline asm 2026-02-21T09:10:47.7002123Z @%p9 mbarrier.inval.shared::cta.b64 [%r2077]; 2026-02-21T09:10:47.7002179Z // end inline asm 2026-02-21T09:10:47.7002240Z bar.sync 0; 2026-02-21T09:10:47.7002302Z // begin inline asm 2026-02-21T09:10:47.7002379Z @%p9 mbarrier.inval.shared::cta.b64 [%r1117]; 2026-02-21T09:10:47.7002436Z // end inline asm 2026-02-21T09:10:47.7002486Z $L__tmp95: 2026-02-21T09:10:47.7002704Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7002770Z // begin inline asm 2026-02-21T09:10:47.7003054Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r809, %r810, %r811, %r812, %r813, %r814, %r815, %r816, %r817, %r818, %r819, %r820, %r821, %r822, %r823, %r824}, [%r1460 + 0]; 2026-02-21T09:10:47.7003140Z // end inline asm 2026-02-21T09:10:47.7003199Z // begin inline asm 2026-02-21T09:10:47.7003487Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r826, %r827, %r828, %r829, %r830, %r831, %r832, %r833, %r834, %r835, %r836, %r837, %r838, %r839, %r840, %r841}, [%r1460 + 16]; 2026-02-21T09:10:47.7003547Z // end inline asm 2026-02-21T09:10:47.7003608Z // begin inline asm 2026-02-21T09:10:47.7003691Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.7003778Z // end inline asm 2026-02-21T09:10:47.7003842Z cvt.u64.u32 %rd215, %r809; 2026-02-21T09:10:47.7003908Z cvt.u64.u32 %rd216, %r810; 2026-02-21T09:10:47.7003970Z shl.b64 %rd217, %rd216, 32; 2026-02-21T09:10:47.7004032Z or.b64 %rd218, %rd215, %rd217; 2026-02-21T09:10:47.7004083Z $L__tmp96: 2026-02-21T09:10:47.7004267Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7004331Z mov.b64 {%r921, %r922}, %rd218; 2026-02-21T09:10:47.7004403Z cvt.rn.bf16x2.f32 %r923, %r922, %r921; 2026-02-21T09:10:47.7004459Z $L__tmp97: 2026-02-21T09:10:47.7004704Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7004765Z cvt.u64.u32 %rd219, %r811; 2026-02-21T09:10:47.7004830Z cvt.u64.u32 %rd220, %r812; 2026-02-21T09:10:47.7004889Z shl.b64 %rd221, %rd220, 32; 2026-02-21T09:10:47.7004977Z or.b64 %rd222, %rd219, %rd221; 2026-02-21T09:10:47.7005032Z $L__tmp98: 2026-02-21T09:10:47.7005205Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7005266Z mov.b64 {%r924, %r925}, %rd222; 2026-02-21T09:10:47.7005335Z cvt.rn.bf16x2.f32 %r926, %r925, %r924; 2026-02-21T09:10:47.7005391Z $L__tmp99: 2026-02-21T09:10:47.7005604Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7005666Z cvt.u64.u32 %rd223, %r813; 2026-02-21T09:10:47.7005723Z cvt.u64.u32 %rd224, %r814; 2026-02-21T09:10:47.7005789Z shl.b64 %rd225, %rd224, 32; 2026-02-21T09:10:47.7005849Z or.b64 %rd226, %rd223, %rd225; 2026-02-21T09:10:47.7005903Z $L__tmp100: 2026-02-21T09:10:47.7006074Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7006132Z mov.b64 {%r927, %r928}, %rd226; 2026-02-21T09:10:47.7006201Z cvt.rn.bf16x2.f32 %r929, %r928, %r927; 2026-02-21T09:10:47.7006258Z $L__tmp101: 2026-02-21T09:10:47.7006467Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7006525Z cvt.u64.u32 %rd227, %r815; 2026-02-21T09:10:47.7006581Z cvt.u64.u32 %rd228, %r816; 2026-02-21T09:10:47.7006647Z shl.b64 %rd229, %rd228, 32; 2026-02-21T09:10:47.7006705Z or.b64 %rd230, %rd227, %rd229; 2026-02-21T09:10:47.7006758Z $L__tmp102: 2026-02-21T09:10:47.7006930Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7006993Z mov.b64 {%r930, %r931}, %rd230; 2026-02-21T09:10:47.7007061Z cvt.rn.bf16x2.f32 %r932, %r931, %r930; 2026-02-21T09:10:47.7007119Z $L__tmp103: 2026-02-21T09:10:47.7007330Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7007389Z cvt.u64.u32 %rd231, %r817; 2026-02-21T09:10:47.7007449Z cvt.u64.u32 %rd232, %r818; 2026-02-21T09:10:47.7007514Z shl.b64 %rd233, %rd232, 32; 2026-02-21T09:10:47.7007574Z or.b64 %rd234, %rd231, %rd233; 2026-02-21T09:10:47.7007625Z $L__tmp104: 2026-02-21T09:10:47.7007797Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7007856Z mov.b64 {%r933, %r934}, %rd234; 2026-02-21T09:10:47.7007923Z cvt.rn.bf16x2.f32 %r935, %r934, %r933; 2026-02-21T09:10:47.7007975Z $L__tmp105: 2026-02-21T09:10:47.7008224Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7008283Z cvt.u64.u32 %rd235, %r819; 2026-02-21T09:10:47.7008341Z cvt.u64.u32 %rd236, %r820; 2026-02-21T09:10:47.7008406Z shl.b64 %rd237, %rd236, 32; 2026-02-21T09:10:47.7008470Z or.b64 %rd238, %rd235, %rd237; 2026-02-21T09:10:47.7008524Z $L__tmp106: 2026-02-21T09:10:47.7008698Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7008778Z mov.b64 {%r936, %r937}, %rd238; 2026-02-21T09:10:47.7008843Z cvt.rn.bf16x2.f32 %r938, %r937, %r936; 2026-02-21T09:10:47.7008891Z $L__tmp107: 2026-02-21T09:10:47.7009103Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7009157Z cvt.u64.u32 %rd239, %r821; 2026-02-21T09:10:47.7009211Z cvt.u64.u32 %rd240, %r822; 2026-02-21T09:10:47.7009272Z shl.b64 %rd241, %rd240, 32; 2026-02-21T09:10:47.7009326Z or.b64 %rd242, %rd239, %rd241; 2026-02-21T09:10:47.7009411Z $L__tmp108: 2026-02-21T09:10:47.7009578Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7009636Z mov.b64 {%r939, %r940}, %rd242; 2026-02-21T09:10:47.7009699Z cvt.rn.bf16x2.f32 %r941, %r940, %r939; 2026-02-21T09:10:47.7009750Z $L__tmp109: 2026-02-21T09:10:47.7009990Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7010047Z cvt.u64.u32 %rd243, %r823; 2026-02-21T09:10:47.7010101Z cvt.u64.u32 %rd244, %r824; 2026-02-21T09:10:47.7010163Z shl.b64 %rd245, %rd244, 32; 2026-02-21T09:10:47.7010218Z or.b64 %rd246, %rd243, %rd245; 2026-02-21T09:10:47.7010266Z $L__tmp110: 2026-02-21T09:10:47.7010432Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7010492Z mov.b64 {%r942, %r943}, %rd246; 2026-02-21T09:10:47.7010554Z cvt.rn.bf16x2.f32 %r944, %r943, %r942; 2026-02-21T09:10:47.7010603Z $L__tmp111: 2026-02-21T09:10:47.7010813Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7010871Z cvt.u64.u32 %rd247, %r826; 2026-02-21T09:10:47.7010925Z cvt.u64.u32 %rd248, %r827; 2026-02-21T09:10:47.7010985Z shl.b64 %rd249, %rd248, 32; 2026-02-21T09:10:47.7011042Z or.b64 %rd250, %rd247, %rd249; 2026-02-21T09:10:47.7011092Z $L__tmp112: 2026-02-21T09:10:47.7011257Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7011317Z mov.b64 {%r945, %r946}, %rd250; 2026-02-21T09:10:47.7011381Z cvt.rn.bf16x2.f32 %r947, %r946, %r945; 2026-02-21T09:10:47.7011429Z $L__tmp113: 2026-02-21T09:10:47.7011670Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7011728Z cvt.u64.u32 %rd251, %r828; 2026-02-21T09:10:47.7011786Z cvt.u64.u32 %rd252, %r829; 2026-02-21T09:10:47.7011845Z shl.b64 %rd253, %rd252, 32; 2026-02-21T09:10:47.7011902Z or.b64 %rd254, %rd251, %rd253; 2026-02-21T09:10:47.7011953Z $L__tmp114: 2026-02-21T09:10:47.7012113Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7012177Z mov.b64 {%r948, %r949}, %rd254; 2026-02-21T09:10:47.7012240Z cvt.rn.bf16x2.f32 %r950, %r949, %r948; 2026-02-21T09:10:47.7012289Z $L__tmp115: 2026-02-21T09:10:47.7012498Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7012555Z cvt.u64.u32 %rd255, %r830; 2026-02-21T09:10:47.7012610Z cvt.u64.u32 %rd256, %r831; 2026-02-21T09:10:47.7012671Z shl.b64 %rd257, %rd256, 32; 2026-02-21T09:10:47.7012728Z or.b64 %rd258, %rd255, %rd257; 2026-02-21T09:10:47.7012805Z $L__tmp116: 2026-02-21T09:10:47.7012966Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7013031Z mov.b64 {%r951, %r952}, %rd258; 2026-02-21T09:10:47.7013092Z cvt.rn.bf16x2.f32 %r953, %r952, %r951; 2026-02-21T09:10:47.7013143Z $L__tmp117: 2026-02-21T09:10:47.7013350Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7013436Z cvt.u64.u32 %rd259, %r832; 2026-02-21T09:10:47.7013491Z cvt.u64.u32 %rd260, %r833; 2026-02-21T09:10:47.7013556Z shl.b64 %rd261, %rd260, 32; 2026-02-21T09:10:47.7013614Z or.b64 %rd262, %rd259, %rd261; 2026-02-21T09:10:47.7013664Z $L__tmp118: 2026-02-21T09:10:47.7013826Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7013891Z mov.b64 {%r954, %r955}, %rd262; 2026-02-21T09:10:47.7013955Z cvt.rn.bf16x2.f32 %r956, %r955, %r954; 2026-02-21T09:10:47.7014007Z $L__tmp119: 2026-02-21T09:10:47.7014240Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7014296Z cvt.u64.u32 %rd263, %r834; 2026-02-21T09:10:47.7014350Z cvt.u64.u32 %rd264, %r835; 2026-02-21T09:10:47.7014408Z shl.b64 %rd265, %rd264, 32; 2026-02-21T09:10:47.7014467Z or.b64 %rd266, %rd263, %rd265; 2026-02-21T09:10:47.7014542Z $L__tmp120: 2026-02-21T09:10:47.7014704Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7014765Z mov.b64 {%r957, %r958}, %rd266; 2026-02-21T09:10:47.7014829Z cvt.rn.bf16x2.f32 %r959, %r958, %r957; 2026-02-21T09:10:47.7014879Z $L__tmp121: 2026-02-21T09:10:47.7015088Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7015144Z cvt.u64.u32 %rd267, %r836; 2026-02-21T09:10:47.7015202Z cvt.u64.u32 %rd268, %r837; 2026-02-21T09:10:47.7015256Z shl.b64 %rd269, %rd268, 32; 2026-02-21T09:10:47.7015319Z or.b64 %rd270, %rd267, %rd269; 2026-02-21T09:10:47.7015368Z $L__tmp122: 2026-02-21T09:10:47.7015533Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7015596Z mov.b64 {%r960, %r961}, %rd270; 2026-02-21T09:10:47.7015661Z cvt.rn.bf16x2.f32 %r962, %r961, %r960; 2026-02-21T09:10:47.7015712Z $L__tmp123: 2026-02-21T09:10:47.7015920Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7015976Z cvt.u64.u32 %rd271, %r838; 2026-02-21T09:10:47.7016031Z cvt.u64.u32 %rd272, %r839; 2026-02-21T09:10:47.7016087Z shl.b64 %rd273, %rd272, 32; 2026-02-21T09:10:47.7016148Z or.b64 %rd274, %rd271, %rd273; 2026-02-21T09:10:47.7016196Z $L__tmp124: 2026-02-21T09:10:47.7016359Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7016422Z mov.b64 {%r963, %r964}, %rd274; 2026-02-21T09:10:47.7016485Z cvt.rn.bf16x2.f32 %r965, %r964, %r963; 2026-02-21T09:10:47.7016533Z $L__tmp125: 2026-02-21T09:10:47.7016740Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7016796Z cvt.u64.u32 %rd275, %r840; 2026-02-21T09:10:47.7016852Z cvt.u64.u32 %rd276, %r841; 2026-02-21T09:10:47.7016909Z shl.b64 %rd277, %rd276, 32; 2026-02-21T09:10:47.7016971Z or.b64 %rd278, %rd275, %rd277; 2026-02-21T09:10:47.7017020Z $L__tmp126: 2026-02-21T09:10:47.7017183Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7017241Z mov.b64 {%r966, %r967}, %rd278; 2026-02-21T09:10:47.7017303Z cvt.rn.bf16x2.f32 %r968, %r967, %r966; 2026-02-21T09:10:47.7017463Z .loc 1 98 43 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:98:43 2026-02-21T09:10:47.7017579Z st.shared.v4.b32 [%r40], {%r923, %r926, %r929, %r932}; 2026-02-21T09:10:47.7017672Z st.shared.v4.b32 [%r41], {%r935, %r938, %r941, %r944}; 2026-02-21T09:10:47.7017754Z st.shared.v4.b32 [%r42], {%r947, %r950, %r953, %r956}; 2026-02-21T09:10:47.7017838Z st.shared.v4.b32 [%r43], {%r959, %r962, %r965, %r968}; 2026-02-21T09:10:47.7017898Z // begin inline asm 2026-02-21T09:10:47.7017967Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7018045Z // end inline asm 2026-02-21T09:10:47.7018101Z bar.sync 0; 2026-02-21T09:10:47.7018161Z elect.sync %r969|%p172, -1; 2026-02-21T09:10:47.7018220Z and.pred %p157, %p6, %p172; 2026-02-21T09:10:47.7018273Z // begin inline asm 2026-02-21T09:10:47.7018450Z @%p157 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd553, {%r843, %r844}], [%r201]; 2026-02-21T09:10:47.7018502Z // end inline asm 2026-02-21T09:10:47.7018563Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.7018637Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.7018691Z bar.sync 0; 2026-02-21T09:10:47.7018874Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7018935Z add.s32 %r970, %r2060, 2; 2026-02-21T09:10:47.7019096Z .loc 1 38 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:38:33 2026-02-21T09:10:47.7019153Z shr.u32 %r971, %r970, 5; 2026-02-21T09:10:47.7019232Z and.b32 %r972, %r971, 67108848; 2026-02-21T09:10:47.7019399Z .loc 1 39 39 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:39 2026-02-21T09:10:47.7019453Z sub.s32 %r973, 128, %r972; 2026-02-21T09:10:47.7019615Z .loc 1 39 52 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:52 2026-02-21T09:10:47.7019674Z min.s32 %r974, %r973, 16; 2026-02-21T09:10:47.7019831Z .loc 1 40 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:45 2026-02-21T09:10:47.7019887Z and.b32 %r975, %r970, 511; 2026-02-21T09:10:47.7020051Z .loc 1 41 51 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:41:51 2026-02-21T09:10:47.7020106Z div.s32 %r90, %r975, %r974; 2026-02-21T09:10:47.7020265Z .loc 1 40 64 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:64 2026-02-21T09:10:47.7020326Z mul.lo.s32 %r976, %r90, %r974; 2026-02-21T09:10:47.7020385Z sub.s32 %r977, %r975, %r976; 2026-02-21T09:10:47.7020546Z .loc 1 40 30 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:30 2026-02-21T09:10:47.7020602Z add.s32 %r978, %r977, %r972; 2026-02-21T09:10:47.7020764Z .loc 1 42 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:42:27 2026-02-21T09:10:47.7020822Z shl.b32 %r1152, %r978, 6; 2026-02-21T09:10:47.7020977Z .loc 1 43 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:43:27 2026-02-21T09:10:47.7021038Z shl.b32 %r1153, %r90, 7; 2026-02-21T09:10:47.7021195Z .loc 1 44 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:32 2026-02-21T09:10:47.7021253Z or.b32 %r979, %r1153, %r7; 2026-02-21T09:10:47.7021313Z or.b32 %r980, %r1153, %r8; 2026-02-21T09:10:47.7021475Z .loc 1 58 53 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:53 2026-02-21T09:10:47.7021566Z shl.b32 %r981, %r979, 10; 2026-02-21T09:10:47.7021628Z shl.b32 %r982, %r980, 10; 2026-02-21T09:10:47.7021691Z mov.pred %p177, -1; 2026-02-21T09:10:47.7021745Z mov.b32 %r2076, 0; 2026-02-21T09:10:47.7021795Z $L__tmp127: 2026-02-21T09:10:47.7022011Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7022066Z // begin inline asm 2026-02-21T09:10:47.7022371Z @%p177 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 0], {%r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076}; 2026-02-21T09:10:47.7022470Z // end inline asm 2026-02-21T09:10:47.7022525Z // begin inline asm 2026-02-21T09:10:47.7022832Z @%p177 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 16], {%r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076, %r2076}; 2026-02-21T09:10:47.7022892Z // end inline asm 2026-02-21T09:10:47.7022949Z // begin inline asm 2026-02-21T09:10:47.7023044Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7023096Z // end inline asm 2026-02-21T09:10:47.7023153Z bar.sync 0; 2026-02-21T09:10:47.7023205Z $L__tmp128: 2026-02-21T09:10:47.7023371Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7023436Z // begin inline asm 2026-02-21T09:10:47.7023522Z @%p9 mbarrier.init.shared::cta.b64 [%r2077], 1; 2026-02-21T09:10:47.7023576Z // end inline asm 2026-02-21T09:10:47.7023630Z bar.sync 0; 2026-02-21T09:10:47.7023689Z // begin inline asm 2026-02-21T09:10:47.7023792Z @%p9 mbarrier.init.shared::cta.b64 [%r1117], 1; 2026-02-21T09:10:47.7023847Z // end inline asm 2026-02-21T09:10:47.7023906Z // begin inline asm 2026-02-21T09:10:47.7023983Z @%p9 mbarrier.init.shared::cta.b64 [%r1331], 1; 2026-02-21T09:10:47.7024035Z // end inline asm 2026-02-21T09:10:47.7024094Z bar.sync 0; 2026-02-21T09:10:47.7024173Z // begin inline asm 2026-02-21T09:10:47.7024249Z @%p9 mbarrier.init.shared::cta.b64 [%r1115], 1; 2026-02-21T09:10:47.7024301Z // end inline asm 2026-02-21T09:10:47.7024466Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.7024526Z or.b32 %r983, %r981, %r9; 2026-02-21T09:10:47.7024583Z or.b32 %r984, %r982, %r9; 2026-02-21T09:10:47.7024748Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7024818Z mad.wide.s32 %rd209, %r983, 2, %rd41; 2026-02-21T09:10:47.7024884Z mad.wide.s32 %rd210, %r984, 2, %rd41; 2026-02-21T09:10:47.7024945Z mov.b32 %r1019, 16; 2026-02-21T09:10:47.7025104Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7025156Z // begin inline asm 2026-02-21T09:10:47.7025279Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd209 + 0 ], 0x10, %r1019; 2026-02-21T09:10:47.7025336Z // end inline asm 2026-02-21T09:10:47.7025392Z // begin inline asm 2026-02-21T09:10:47.7025510Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd210 + 0 ], 0x10, %r1019; 2026-02-21T09:10:47.7025568Z // end inline asm 2026-02-21T09:10:47.7025630Z cp.async.commit_group; 2026-02-21T09:10:47.7025794Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7025846Z bar.sync 0; 2026-02-21T09:10:47.7025905Z // begin inline asm 2026-02-21T09:10:47.7026013Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.7026065Z // end inline asm 2026-02-21T09:10:47.7026232Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7026284Z bar.sync 0; 2026-02-21T09:10:47.7026348Z elect.sync %r985|%p173, -1; 2026-02-21T09:10:47.7026419Z and.pred %p165, %p6, %p173; 2026-02-21T09:10:47.7026473Z // begin inline asm 2026-02-21T09:10:47.7026714Z @%p165 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r1152, %r2076}], [%r1331]; 2026-02-21T09:10:47.7026773Z // end inline asm 2026-02-21T09:10:47.7026931Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7026989Z cvt.s64.s32 %rd279, %r981; 2026-02-21T09:10:47.7027050Z or.b64 %rd280, %rd279, %rd8; 2026-02-21T09:10:47.7027116Z shl.b64 %rd281, %rd280, 1; 2026-02-21T09:10:47.7027176Z add.s64 %rd23, %rd41, %rd281; 2026-02-21T09:10:47.7027234Z add.s64 %rd212, %rd23, 64; 2026-02-21T09:10:47.7027323Z cvt.s64.s32 %rd282, %r982; 2026-02-21T09:10:47.7027383Z or.b64 %rd283, %rd282, %rd8; 2026-02-21T09:10:47.7027440Z shl.b64 %rd284, %rd283, 1; 2026-02-21T09:10:47.7027499Z add.s64 %rd24, %rd41, %rd284; 2026-02-21T09:10:47.7027564Z add.s64 %rd213, %rd24, 64; 2026-02-21T09:10:47.7027726Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7027782Z // begin inline asm 2026-02-21T09:10:47.7027941Z cp.async.cg.shared.global [ %r1202 + 0 ], [ %rd212 + 0 ], 0x10, %r1019; 2026-02-21T09:10:47.7027997Z // end inline asm 2026-02-21T09:10:47.7028054Z // begin inline asm 2026-02-21T09:10:47.7028174Z cp.async.cg.shared.global [ %r1204 + 0 ], [ %rd213 + 0 ], 0x10, %r1019; 2026-02-21T09:10:47.7028229Z // end inline asm 2026-02-21T09:10:47.7028291Z cp.async.commit_group; 2026-02-21T09:10:47.7028462Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7028525Z bar.sync 0; 2026-02-21T09:10:47.7028581Z // begin inline asm 2026-02-21T09:10:47.7028706Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1115], 1024; 2026-02-21T09:10:47.7028768Z // end inline asm 2026-02-21T09:10:47.7028935Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7028989Z bar.sync 0; 2026-02-21T09:10:47.7029073Z elect.sync %r986|%p174, -1; 2026-02-21T09:10:47.7029147Z and.pred %p167, %p6, %p174; 2026-02-21T09:10:47.7029205Z // begin inline asm 2026-02-21T09:10:47.7029451Z @%p167 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r322], [%rd463, {%r1152, %r1019}], [%r1115]; 2026-02-21T09:10:47.7029513Z // end inline asm 2026-02-21T09:10:47.7029674Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7029736Z cp.async.wait_group 1; 2026-02-21T09:10:47.7029798Z bar.sync 0; 2026-02-21T09:10:47.7029890Z ld.shared.v4.b32 {%r987, %r988, %r989, %r990}, [%r23]; 2026-02-21T09:10:47.7029955Z mov.b32 {%rs225, %rs226}, %r990; 2026-02-21T09:10:47.7030017Z mov.b32 {%rs227, %rs228}, %r989; 2026-02-21T09:10:47.7030081Z mov.b32 {%rs229, %rs230}, %r988; 2026-02-21T09:10:47.7030137Z mov.b32 {%rs231, %rs232}, %r987; 2026-02-21T09:10:47.7030231Z ld.shared.v4.b32 {%r991, %r992, %r993, %r994}, [%r23+16]; 2026-02-21T09:10:47.7030300Z mov.b32 {%rs233, %rs234}, %r994; 2026-02-21T09:10:47.7030359Z mov.b32 {%rs235, %rs236}, %r993; 2026-02-21T09:10:47.7030417Z mov.b32 {%rs237, %rs238}, %r992; 2026-02-21T09:10:47.7030482Z mov.b32 {%rs239, %rs240}, %r991; 2026-02-21T09:10:47.7030646Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.7030708Z cvt.f32.bf16 %r903, %rs231; 2026-02-21T09:10:47.7030768Z cvt.f32.bf16 %r904, %rs232; 2026-02-21T09:10:47.7030835Z cvt.f32.bf16 %r905, %rs229; 2026-02-21T09:10:47.7030892Z cvt.f32.bf16 %r906, %rs230; 2026-02-21T09:10:47.7030949Z cvt.f32.bf16 %r907, %rs227; 2026-02-21T09:10:47.7031014Z cvt.f32.bf16 %r908, %rs228; 2026-02-21T09:10:47.7031071Z cvt.f32.bf16 %r909, %rs225; 2026-02-21T09:10:47.7031126Z cvt.f32.bf16 %r910, %rs226; 2026-02-21T09:10:47.7031182Z cvt.f32.bf16 %r911, %rs239; 2026-02-21T09:10:47.7031247Z cvt.f32.bf16 %r912, %rs240; 2026-02-21T09:10:47.7031303Z cvt.f32.bf16 %r913, %rs237; 2026-02-21T09:10:47.7031360Z cvt.f32.bf16 %r914, %rs238; 2026-02-21T09:10:47.7031426Z cvt.f32.bf16 %r915, %rs235; 2026-02-21T09:10:47.7031483Z cvt.f32.bf16 %r916, %rs236; 2026-02-21T09:10:47.7031575Z cvt.f32.bf16 %r917, %rs233; 2026-02-21T09:10:47.7031641Z cvt.f32.bf16 %r918, %rs234; 2026-02-21T09:10:47.7031693Z $L__tmp129: 2026-02-21T09:10:47.7031905Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7031962Z // begin inline asm 2026-02-21T09:10:47.7032250Z @%p177 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r903, %r904, %r905, %r906, %r907, %r908, %r909, %r910, %r911, %r912, %r913, %r914, %r915, %r916, %r917, %r918}; 2026-02-21T09:10:47.7032334Z // end inline asm 2026-02-21T09:10:47.7032390Z // begin inline asm 2026-02-21T09:10:47.7032469Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7032523Z // end inline asm 2026-02-21T09:10:47.7032577Z bar.sync 0; 2026-02-21T09:10:47.7032637Z $L__tmp130: 2026-02-21T09:10:47.7032807Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7032889Z // begin inline asm 2026-02-21T09:10:47.7032941Z 2026-02-21T09:10:47.7033000Z { 2026-02-21T09:10:47.7033061Z .reg .pred complete; 2026-02-21T09:10:47.7033117Z waitLoop: 2026-02-21T09:10:47.7033247Z mbarrier.try_wait.parity.shared.b64 complete, [%r1331], %r2076; 2026-02-21T09:10:47.7033311Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7033362Z } 2026-02-21T09:10:47.7033366Z 2026-02-21T09:10:47.7033421Z // end inline asm 2026-02-21T09:10:47.7033615Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7033680Z ld.shared.b8 %rs241, [%r25]; 2026-02-21T09:10:47.7033743Z ld.shared.b8 %rs242, [%r25+512]; 2026-02-21T09:10:47.7033812Z ld.shared.b8 %rs243, [%r27+128]; 2026-02-21T09:10:47.7033872Z ld.shared.b8 %rs244, [%r27+640]; 2026-02-21T09:10:47.7033959Z ld.shared.b8 %rs245, [%r29+256]; 2026-02-21T09:10:47.7034028Z ld.shared.b8 %rs246, [%r29+768]; 2026-02-21T09:10:47.7034088Z ld.shared.b8 %rs247, [%r31+384]; 2026-02-21T09:10:47.7034147Z ld.shared.b8 %rs248, [%r31+896]; 2026-02-21T09:10:47.7034311Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.7034379Z shl.b16 %rs249, %rs241, 4; 2026-02-21T09:10:47.7034438Z shl.b16 %rs250, %rs243, 4; 2026-02-21T09:10:47.7034497Z shl.b16 %rs251, %rs245, 4; 2026-02-21T09:10:47.7034560Z shl.b16 %rs252, %rs247, 4; 2026-02-21T09:10:47.7034618Z shl.b16 %rs253, %rs242, 4; 2026-02-21T09:10:47.7034674Z shl.b16 %rs254, %rs244, 4; 2026-02-21T09:10:47.7034730Z shl.b16 %rs255, %rs246, 4; 2026-02-21T09:10:47.7034793Z shl.b16 %rs256, %rs248, 4; 2026-02-21T09:10:47.7034952Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.7035022Z selp.b16 %rs257, %rs249, %rs241, %p346; 2026-02-21T09:10:47.7035090Z cvt.s16.s8 %rs258, %rs257; 2026-02-21T09:10:47.7035148Z shr.s16 %rs259, %rs258, 4; 2026-02-21T09:10:47.7035216Z selp.b16 %rs260, %rs250, %rs243, %p346; 2026-02-21T09:10:47.7035281Z cvt.s16.s8 %rs261, %rs260; 2026-02-21T09:10:47.7035337Z shr.s16 %rs262, %rs261, 4; 2026-02-21T09:10:47.7035402Z selp.b16 %rs263, %rs251, %rs245, %p346; 2026-02-21T09:10:47.7035459Z cvt.s16.s8 %rs264, %rs263; 2026-02-21T09:10:47.7035522Z shr.s16 %rs265, %rs264, 4; 2026-02-21T09:10:47.7035587Z selp.b16 %rs266, %rs252, %rs247, %p346; 2026-02-21T09:10:47.7035647Z cvt.s16.s8 %rs267, %rs266; 2026-02-21T09:10:47.7035710Z shr.s16 %rs268, %rs267, 4; 2026-02-21T09:10:47.7035775Z selp.b16 %rs269, %rs253, %rs242, %p346; 2026-02-21T09:10:47.7035834Z cvt.s16.s8 %rs270, %rs269; 2026-02-21T09:10:47.7035890Z shr.s16 %rs271, %rs270, 4; 2026-02-21T09:10:47.7035962Z selp.b16 %rs272, %rs254, %rs244, %p346; 2026-02-21T09:10:47.7036020Z cvt.s16.s8 %rs273, %rs272; 2026-02-21T09:10:47.7036077Z shr.s16 %rs274, %rs273, 4; 2026-02-21T09:10:47.7036150Z selp.b16 %rs275, %rs255, %rs246, %p346; 2026-02-21T09:10:47.7036209Z cvt.s16.s8 %rs276, %rs275; 2026-02-21T09:10:47.7036266Z shr.s16 %rs277, %rs276, 4; 2026-02-21T09:10:47.7036329Z selp.b16 %rs278, %rs256, %rs248, %p346; 2026-02-21T09:10:47.7036395Z cvt.s16.s8 %rs279, %rs278; 2026-02-21T09:10:47.7036451Z shr.s16 %rs280, %rs279, 4; 2026-02-21T09:10:47.7036612Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.7036681Z cvt.rn.f32.s16 %r995, %rs259; 2026-02-21T09:10:47.7036766Z cvt.rn.f32.s16 %r996, %rs262; 2026-02-21T09:10:47.7036829Z cvt.rn.f32.s16 %r997, %rs265; 2026-02-21T09:10:47.7036897Z cvt.rn.f32.s16 %r998, %rs268; 2026-02-21T09:10:47.7036957Z cvt.rn.f32.s16 %r999, %rs271; 2026-02-21T09:10:47.7037020Z cvt.rn.f32.s16 %r1000, %rs274; 2026-02-21T09:10:47.7037085Z cvt.rn.f32.s16 %r1001, %rs277; 2026-02-21T09:10:47.7037151Z cvt.rn.f32.s16 %r1002, %rs280; 2026-02-21T09:10:47.7037211Z st.shared.b32 [%r32], %r995; 2026-02-21T09:10:47.7037292Z st.shared.b32 [%r33], %r996; 2026-02-21T09:10:47.7037355Z st.shared.b32 [%r34], %r997; 2026-02-21T09:10:47.7037414Z st.shared.b32 [%r35], %r998; 2026-02-21T09:10:47.7037472Z st.shared.b32 [%r36], %r999; 2026-02-21T09:10:47.7037529Z st.shared.b32 [%r37], %r1000; 2026-02-21T09:10:47.7037591Z st.shared.b32 [%r38], %r1001; 2026-02-21T09:10:47.7037648Z st.shared.b32 [%r39], %r1002; 2026-02-21T09:10:47.7037697Z $L__tmp131: 2026-02-21T09:10:47.7037909Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7037985Z // begin inline asm 2026-02-21T09:10:47.7038057Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7038115Z // end inline asm 2026-02-21T09:10:47.7038167Z bar.sync 0; 2026-02-21T09:10:47.7038224Z @%p60 bra $L__BB0_16; 2026-02-21T09:10:47.7038320Z // %bb.15: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.7038409Z elect.sync %r1015|%p176, -1; 2026-02-21T09:10:47.7038467Z mov.b32 %r1005, 135268624; 2026-02-21T09:10:47.7038523Z mov.pred %p175, 0; 2026-02-21T09:10:47.7038581Z // begin inline asm 2026-02-21T09:10:47.7038738Z @%p176 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r1005, %p175; 2026-02-21T09:10:47.7038789Z // end inline asm 2026-02-21T09:10:47.7038849Z // begin inline asm 2026-02-21T09:10:47.7039000Z @%p176 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r1005, %p177; 2026-02-21T09:10:47.7039055Z // end inline asm 2026-02-21T09:10:47.7039108Z // begin inline asm 2026-02-21T09:10:47.7039263Z @%p176 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r1005, %p177; 2026-02-21T09:10:47.7039315Z // end inline asm 2026-02-21T09:10:47.7039369Z // begin inline asm 2026-02-21T09:10:47.7039524Z @%p176 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r1005, %p177; 2026-02-21T09:10:47.7039576Z // end inline asm 2026-02-21T09:10:47.7039634Z add.s32 %r1017, %r201, 26624; 2026-02-21T09:10:47.7039695Z cvt.u64.u32 %rd289, %r1017; 2026-02-21T09:10:47.7039748Z // begin inline asm 2026-02-21T09:10:47.7039873Z @%p176 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd289]; 2026-02-21T09:10:47.7039924Z // end inline asm 2026-02-21T09:10:47.7039979Z $L__tmp132: 2026-02-21T09:10:47.7040079Z $L__BB0_16: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.7040241Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7040307Z add.s64 %rd290, %rd23, 128; 2026-02-21T09:10:47.7040362Z add.s64 %rd291, %rd24, 128; 2026-02-21T09:10:47.7040523Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7040582Z // begin inline asm 2026-02-21T09:10:47.7040700Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd290 + 0 ], 0x10, %r1019; 2026-02-21T09:10:47.7040754Z // end inline asm 2026-02-21T09:10:47.7040808Z // begin inline asm 2026-02-21T09:10:47.7040926Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd291 + 0 ], 0x10, %r1019; 2026-02-21T09:10:47.7040979Z // end inline asm 2026-02-21T09:10:47.7041039Z cp.async.commit_group; 2026-02-21T09:10:47.7041209Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7041261Z // begin inline asm 2026-02-21T09:10:47.7041369Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.7041449Z // end inline asm 2026-02-21T09:10:47.7041638Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7041694Z bar.sync 0; 2026-02-21T09:10:47.7041756Z elect.sync %r1031|%p187, -1; 2026-02-21T09:10:47.7041821Z and.pred %p185, %p6, %p187; 2026-02-21T09:10:47.7041875Z mov.b32 %r1025, 32; 2026-02-21T09:10:47.7041930Z // begin inline asm 2026-02-21T09:10:47.7042205Z @%p185 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r1152, %r1025}], [%r1331]; 2026-02-21T09:10:47.7042257Z // end inline asm 2026-02-21T09:10:47.7042423Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7042485Z shl.b32 %r1032, %r90, 17; 2026-02-21T09:10:47.7042541Z or.b32 %r1033, %r44, %r1032; 2026-02-21T09:10:47.7042605Z mad.wide.s32 %rd622, %r1033, 2, %rd7; 2026-02-21T09:10:47.7042663Z or.b32 %r2075, %r45, %r1032; 2026-02-21T09:10:47.7042719Z mov.b64 %rd623, 0; 2026-02-21T09:10:47.7042799Z mov.b32 %r2078, %r2076; 2026-02-21T09:10:47.7042857Z mov.b32 %r2079, %r2076; 2026-02-21T09:10:47.7042914Z mov.b32 %r2081, %r2076; 2026-02-21T09:10:47.7042972Z bra.uni $L__BB0_17; 2026-02-21T09:10:47.7043070Z $L__BB0_19: // in Loop: Header=BB0_17 Depth=2 2026-02-21T09:10:47.7043273Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7043352Z setp.lt.u64 %p203, %rd623, 464; 2026-02-21T09:10:47.7043408Z $L__tmp133: 2026-02-21T09:10:47.7043629Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7043696Z add.s32 %r1108, %r2080, 1; 2026-02-21T09:10:47.7043757Z setp.gt.s32 %p206, %r1108, 1; 2026-02-21T09:10:47.7043824Z selp.b32 %r2080, 0, %r1108, %p206; 2026-02-21T09:10:47.7043891Z selp.b32 %r1109, 1, 0, %p206; 2026-02-21T09:10:47.7043953Z xor.b32 %r108, %r2081, %r1109; 2026-02-21T09:10:47.7044006Z $L__tmp134: 2026-02-21T09:10:47.7044182Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7044255Z mad.wide.s32 %rd300, %r2075, 2, %rd41; 2026-02-21T09:10:47.7044428Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7044490Z add.s32 %r1099, %r102, %r12; 2026-02-21T09:10:47.7044560Z selp.b32 %r1100, 16, 0, %p203; 2026-02-21T09:10:47.7044617Z // begin inline asm 2026-02-21T09:10:47.7044738Z cp.async.cg.shared.global [ %r1099 + 0 ], [ %rd622 + 0 ], 0x10, %r1100; 2026-02-21T09:10:47.7044797Z // end inline asm 2026-02-21T09:10:47.7044859Z add.s32 %r1101, %r1099, 4096; 2026-02-21T09:10:47.7044915Z // begin inline asm 2026-02-21T09:10:47.7045034Z cp.async.cg.shared.global [ %r1101 + 0 ], [ %rd300 + 0 ], 0x10, %r1100; 2026-02-21T09:10:47.7045099Z // end inline asm 2026-02-21T09:10:47.7045162Z cp.async.commit_group; 2026-02-21T09:10:47.7045344Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7045418Z and.pred %p201, %p9, %p203; 2026-02-21T09:10:47.7045475Z // begin inline asm 2026-02-21T09:10:47.7045594Z @%p201 mbarrier.arrive.expect_tx.shared.b64 _, [%r1103], 1024; 2026-02-21T09:10:47.7045655Z // end inline asm 2026-02-21T09:10:47.7045828Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7045882Z bar.sync 0; 2026-02-21T09:10:47.7045946Z elect.sync %r1110|%p207, -1; 2026-02-21T09:10:47.7046016Z and.pred %p208, %p203, %p207; 2026-02-21T09:10:47.7046078Z and.pred %p202, %p6, %p208; 2026-02-21T09:10:47.7046136Z cvt.u32.u64 %r1111, %rd623; 2026-02-21T09:10:47.7046199Z add.s32 %r1106, %r1111, 48; 2026-02-21T09:10:47.7046255Z // begin inline asm 2026-02-21T09:10:47.7046517Z @%p202 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1104], [%rd463, {%r1152, %r1106}], [%r1103]; 2026-02-21T09:10:47.7046623Z // end inline asm 2026-02-21T09:10:47.7046795Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7046856Z add.s64 %rd622, %rd622, 64; 2026-02-21T09:10:47.7046913Z add.s32 %r2075, %r2075, 32; 2026-02-21T09:10:47.7046986Z setp.lt.u64 %p209, %rd623, 480; 2026-02-21T09:10:47.7047067Z add.s64 %rd623, %rd623, 16; 2026-02-21T09:10:47.7047125Z mov.b32 %r2076, %r2081; 2026-02-21T09:10:47.7047186Z mov.b32 %r2081, %r108; 2026-02-21T09:10:47.7047246Z @%p209 bra $L__BB0_17; 2026-02-21T09:10:47.7047302Z bra.uni $L__BB0_20; 2026-02-21T09:10:47.7047403Z $L__BB0_17: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:47.7047505Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:47.7047566Z add.s32 %r1055, %r2079, 1; 2026-02-21T09:10:47.7047630Z setp.gt.s32 %p191, %r1055, 1; 2026-02-21T09:10:47.7047701Z selp.b32 %r2079, 0, %r1055, %p191; 2026-02-21T09:10:47.7047890Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7047954Z cp.async.wait_group 1; 2026-02-21T09:10:47.7048013Z bar.sync 0; 2026-02-21T09:10:47.7048070Z shl.b32 %r1056, %r2079, 13; 2026-02-21T09:10:47.7048151Z add.s32 %r102, %r201, %r1056; 2026-02-21T09:10:47.7048215Z add.s32 %r1058, %r102, %r22; 2026-02-21T09:10:47.7048327Z ld.shared.v4.b32 {%r1059, %r1060, %r1061, %r1062}, [%r1058]; 2026-02-21T09:10:47.7048390Z mov.b32 {%rs281, %rs282}, %r1062; 2026-02-21T09:10:47.7048454Z mov.b32 {%rs283, %rs284}, %r1061; 2026-02-21T09:10:47.7048519Z mov.b32 {%rs285, %rs286}, %r1060; 2026-02-21T09:10:47.7048577Z mov.b32 {%rs287, %rs288}, %r1059; 2026-02-21T09:10:47.7048684Z ld.shared.v4.b32 {%r1063, %r1064, %r1065, %r1066}, [%r1058+16]; 2026-02-21T09:10:47.7048746Z mov.b32 {%rs289, %rs290}, %r1066; 2026-02-21T09:10:47.7048806Z mov.b32 {%rs291, %rs292}, %r1065; 2026-02-21T09:10:47.7048865Z mov.b32 {%rs293, %rs294}, %r1064; 2026-02-21T09:10:47.7048923Z mov.b32 {%rs295, %rs296}, %r1063; 2026-02-21T09:10:47.7049096Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.7049158Z cvt.f32.bf16 %r1037, %rs287; 2026-02-21T09:10:47.7049219Z cvt.f32.bf16 %r1038, %rs288; 2026-02-21T09:10:47.7049287Z cvt.f32.bf16 %r1039, %rs285; 2026-02-21T09:10:47.7049346Z cvt.f32.bf16 %r1040, %rs286; 2026-02-21T09:10:47.7049405Z cvt.f32.bf16 %r1041, %rs283; 2026-02-21T09:10:47.7049464Z cvt.f32.bf16 %r1042, %rs284; 2026-02-21T09:10:47.7049526Z cvt.f32.bf16 %r1043, %rs281; 2026-02-21T09:10:47.7049585Z cvt.f32.bf16 %r1044, %rs282; 2026-02-21T09:10:47.7049642Z cvt.f32.bf16 %r1045, %rs295; 2026-02-21T09:10:47.7049705Z cvt.f32.bf16 %r1046, %rs296; 2026-02-21T09:10:47.7049762Z cvt.f32.bf16 %r1047, %rs293; 2026-02-21T09:10:47.7049823Z cvt.f32.bf16 %r1048, %rs294; 2026-02-21T09:10:47.7049879Z cvt.f32.bf16 %r1049, %rs291; 2026-02-21T09:10:47.7049941Z cvt.f32.bf16 %r1050, %rs292; 2026-02-21T09:10:47.7049999Z cvt.f32.bf16 %r1051, %rs289; 2026-02-21T09:10:47.7050058Z cvt.f32.bf16 %r1052, %rs290; 2026-02-21T09:10:47.7050116Z $L__tmp135: 2026-02-21T09:10:47.7050341Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7050401Z // begin inline asm 2026-02-21T09:10:47.7050455Z 2026-02-21T09:10:47.7050504Z { 2026-02-21T09:10:47.7050566Z .reg .pred complete; 2026-02-21T09:10:47.7050619Z waitLoop: 2026-02-21T09:10:47.7050756Z mbarrier.try_wait.parity.shared.b64 complete, [%r2077], %r2076; 2026-02-21T09:10:47.7050822Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7050869Z } 2026-02-21T09:10:47.7050872Z 2026-02-21T09:10:47.7050931Z // end inline asm 2026-02-21T09:10:47.7050982Z $L__tmp136: 2026-02-21T09:10:47.7051150Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7051233Z selp.b32 %r1067, 1, 0, %p191; 2026-02-21T09:10:47.7051300Z xor.b32 %r2078, %r2078, %r1067; 2026-02-21T09:10:47.7051357Z mov.pred %p192, -1; 2026-02-21T09:10:47.7051407Z $L__tmp137: 2026-02-21T09:10:47.7051659Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7051740Z // begin inline asm 2026-02-21T09:10:47.7052047Z @%p192 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r1037, %r1038, %r1039, %r1040, %r1041, %r1042, %r1043, %r1044, %r1045, %r1046, %r1047, %r1048, %r1049, %r1050, %r1051, %r1052}; 2026-02-21T09:10:47.7052108Z // end inline asm 2026-02-21T09:10:47.7052161Z // begin inline asm 2026-02-21T09:10:47.7052228Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7052284Z // end inline asm 2026-02-21T09:10:47.7052336Z bar.sync 0; 2026-02-21T09:10:47.7052385Z $L__tmp138: 2026-02-21T09:10:47.7052584Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7052650Z shl.b32 %r1068, %r2079, 3; 2026-02-21T09:10:47.7052710Z add.s32 %r1069, %r201, %r1068; 2026-02-21T09:10:47.7052767Z add.s32 %r1103, %r1069, 26640; 2026-02-21T09:10:47.7052826Z // begin inline asm 2026-02-21T09:10:47.7052874Z 2026-02-21T09:10:47.7052923Z { 2026-02-21T09:10:47.7053005Z .reg .pred complete; 2026-02-21T09:10:47.7053066Z waitLoop: 2026-02-21T09:10:47.7053186Z mbarrier.try_wait.parity.shared.b64 complete, [%r1103], %r2078; 2026-02-21T09:10:47.7053249Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7053305Z } 2026-02-21T09:10:47.7053308Z 2026-02-21T09:10:47.7053359Z // end inline asm 2026-02-21T09:10:47.7053514Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7053571Z shl.b32 %r1070, %r2079, 10; 2026-02-21T09:10:47.7053634Z add.s32 %r1071, %r201, %r1070; 2026-02-21T09:10:47.7053694Z add.s32 %r1104, %r1071, 24576; 2026-02-21T09:10:47.7053751Z add.s32 %r1072, %r1104, %r24; 2026-02-21T09:10:47.7053817Z ld.shared.b8 %rs297, [%r1072]; 2026-02-21T09:10:47.7053882Z ld.shared.b8 %rs298, [%r1072+512]; 2026-02-21T09:10:47.7053938Z add.s32 %r1073, %r1104, %r26; 2026-02-21T09:10:47.7054005Z ld.shared.b8 %rs299, [%r1073+128]; 2026-02-21T09:10:47.7054065Z ld.shared.b8 %rs300, [%r1073+640]; 2026-02-21T09:10:47.7054121Z add.s32 %r1074, %r1104, %r28; 2026-02-21T09:10:47.7054180Z ld.shared.b8 %rs301, [%r1074+256]; 2026-02-21T09:10:47.7054245Z ld.shared.b8 %rs302, [%r1074+768]; 2026-02-21T09:10:47.7054301Z add.s32 %r1075, %r1104, %r30; 2026-02-21T09:10:47.7054359Z ld.shared.b8 %rs303, [%r1075+384]; 2026-02-21T09:10:47.7054419Z ld.shared.b8 %rs304, [%r1075+896]; 2026-02-21T09:10:47.7054581Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.7054641Z shl.b16 %rs305, %rs297, 4; 2026-02-21T09:10:47.7054699Z shl.b16 %rs306, %rs299, 4; 2026-02-21T09:10:47.7054757Z shl.b16 %rs307, %rs301, 4; 2026-02-21T09:10:47.7054814Z shl.b16 %rs308, %rs303, 4; 2026-02-21T09:10:47.7054869Z shl.b16 %rs309, %rs298, 4; 2026-02-21T09:10:47.7054930Z shl.b16 %rs310, %rs300, 4; 2026-02-21T09:10:47.7054984Z shl.b16 %rs311, %rs302, 4; 2026-02-21T09:10:47.7055038Z shl.b16 %rs312, %rs304, 4; 2026-02-21T09:10:47.7055200Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.7055269Z selp.b16 %rs313, %rs305, %rs297, %p346; 2026-02-21T09:10:47.7055327Z cvt.s16.s8 %rs314, %rs313; 2026-02-21T09:10:47.7055380Z shr.s16 %rs315, %rs314, 4; 2026-02-21T09:10:47.7055450Z selp.b16 %rs316, %rs306, %rs299, %p346; 2026-02-21T09:10:47.7055505Z cvt.s16.s8 %rs317, %rs316; 2026-02-21T09:10:47.7055560Z shr.s16 %rs318, %rs317, 4; 2026-02-21T09:10:47.7055628Z selp.b16 %rs319, %rs307, %rs301, %p346; 2026-02-21T09:10:47.7055712Z cvt.s16.s8 %rs320, %rs319; 2026-02-21T09:10:47.7055767Z shr.s16 %rs321, %rs320, 4; 2026-02-21T09:10:47.7055834Z selp.b16 %rs322, %rs308, %rs303, %p346; 2026-02-21T09:10:47.7055894Z cvt.s16.s8 %rs323, %rs322; 2026-02-21T09:10:47.7055948Z shr.s16 %rs324, %rs323, 4; 2026-02-21T09:10:47.7056008Z selp.b16 %rs325, %rs309, %rs298, %p346; 2026-02-21T09:10:47.7056066Z cvt.s16.s8 %rs326, %rs325; 2026-02-21T09:10:47.7056120Z shr.s16 %rs327, %rs326, 4; 2026-02-21T09:10:47.7056183Z selp.b16 %rs328, %rs310, %rs300, %p346; 2026-02-21T09:10:47.7056266Z cvt.s16.s8 %rs329, %rs328; 2026-02-21T09:10:47.7056326Z shr.s16 %rs330, %rs329, 4; 2026-02-21T09:10:47.7056389Z selp.b16 %rs331, %rs311, %rs302, %p346; 2026-02-21T09:10:47.7056444Z cvt.s16.s8 %rs332, %rs331; 2026-02-21T09:10:47.7056504Z shr.s16 %rs333, %rs332, 4; 2026-02-21T09:10:47.7056566Z selp.b16 %rs334, %rs312, %rs304, %p346; 2026-02-21T09:10:47.7056620Z cvt.s16.s8 %rs335, %rs334; 2026-02-21T09:10:47.7056679Z shr.s16 %rs336, %rs335, 4; 2026-02-21T09:10:47.7056843Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.7056932Z cvt.rn.f32.s16 %r1076, %rs315; 2026-02-21T09:10:47.7056993Z cvt.rn.f32.s16 %r1077, %rs318; 2026-02-21T09:10:47.7057057Z cvt.rn.f32.s16 %r1078, %rs321; 2026-02-21T09:10:47.7057113Z cvt.rn.f32.s16 %r1079, %rs324; 2026-02-21T09:10:47.7057168Z cvt.rn.f32.s16 %r1080, %rs327; 2026-02-21T09:10:47.7057249Z cvt.rn.f32.s16 %r1081, %rs330; 2026-02-21T09:10:47.7057307Z cvt.rn.f32.s16 %r1082, %rs333; 2026-02-21T09:10:47.7057363Z cvt.rn.f32.s16 %r1083, %rs336; 2026-02-21T09:10:47.7057421Z st.shared.b32 [%r32], %r1076; 2026-02-21T09:10:47.7057483Z st.shared.b32 [%r33], %r1077; 2026-02-21T09:10:47.7057540Z st.shared.b32 [%r34], %r1078; 2026-02-21T09:10:47.7057595Z st.shared.b32 [%r35], %r1079; 2026-02-21T09:10:47.7057655Z st.shared.b32 [%r36], %r1080; 2026-02-21T09:10:47.7057710Z st.shared.b32 [%r37], %r1081; 2026-02-21T09:10:47.7057764Z st.shared.b32 [%r38], %r1082; 2026-02-21T09:10:47.7057826Z st.shared.b32 [%r39], %r1083; 2026-02-21T09:10:47.7057994Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7058050Z shl.b32 %r1084, %r2080, 3; 2026-02-21T09:10:47.7058107Z add.s32 %r1085, %r201, %r1084; 2026-02-21T09:10:47.7058164Z add.s32 %r2077, %r1085, 26624; 2026-02-21T09:10:47.7058213Z $L__tmp139: 2026-02-21T09:10:47.7058424Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7058485Z // begin inline asm 2026-02-21T09:10:47.7058552Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7058603Z // end inline asm 2026-02-21T09:10:47.7058653Z bar.sync 0; 2026-02-21T09:10:47.7058713Z @%p60 bra $L__BB0_19; 2026-02-21T09:10:47.7058807Z // %bb.18: // in Loop: Header=BB0_17 Depth=2 2026-02-21T09:10:47.7058867Z elect.sync %r1098|%p193, -1; 2026-02-21T09:10:47.7058930Z mov.b32 %r1088, 135268624; 2026-02-21T09:10:47.7058983Z // begin inline asm 2026-02-21T09:10:47.7059138Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r1088, %p192; 2026-02-21T09:10:47.7059194Z // end inline asm 2026-02-21T09:10:47.7059248Z // begin inline asm 2026-02-21T09:10:47.7059398Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r1088, %p192; 2026-02-21T09:10:47.7059450Z // end inline asm 2026-02-21T09:10:47.7059508Z // begin inline asm 2026-02-21T09:10:47.7059656Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r1088, %p192; 2026-02-21T09:10:47.7059709Z // end inline asm 2026-02-21T09:10:47.7059765Z // begin inline asm 2026-02-21T09:10:47.7059912Z @%p193 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r1088, %p192; 2026-02-21T09:10:47.7059965Z // end inline asm 2026-02-21T09:10:47.7060030Z cvt.u64.u32 %rd298, %r2077; 2026-02-21T09:10:47.7060104Z // begin inline asm 2026-02-21T09:10:47.7060231Z @%p193 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd298]; 2026-02-21T09:10:47.7060291Z // end inline asm 2026-02-21T09:10:47.7060347Z bra.uni $L__BB0_19; 2026-02-21T09:10:47.7060444Z $L__BB0_20: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.7060531Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:47.7060592Z mov.b32 %r2087, 1; 2026-02-21T09:10:47.7060828Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7060882Z // begin inline asm 2026-02-21T09:10:47.7060934Z 2026-02-21T09:10:47.7060983Z { 2026-02-21T09:10:47.7061043Z .reg .pred complete; 2026-02-21T09:10:47.7061095Z waitLoop: 2026-02-21T09:10:47.7061216Z mbarrier.try_wait.parity.shared.b64 complete, [%r2077], %r2087; 2026-02-21T09:10:47.7061281Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7061329Z } 2026-02-21T09:10:47.7061332Z 2026-02-21T09:10:47.7061387Z // end inline asm 2026-02-21T09:10:47.7061437Z $L__tmp140: 2026-02-21T09:10:47.7061660Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7061726Z cp.async.wait_group 0; 2026-02-21T09:10:47.7061778Z bar.sync 0; 2026-02-21T09:10:47.7061831Z // begin inline asm 2026-02-21T09:10:47.7061954Z @%p9 mbarrier.inval.shared::cta.b64 [%r1331]; 2026-02-21T09:10:47.7062015Z // end inline asm 2026-02-21T09:10:47.7062068Z bar.sync 0; 2026-02-21T09:10:47.7062123Z // begin inline asm 2026-02-21T09:10:47.7062206Z @%p9 mbarrier.inval.shared::cta.b64 [%r1115]; 2026-02-21T09:10:47.7062258Z // end inline asm 2026-02-21T09:10:47.7062315Z add.s32 %r2084, %r201, 26624; 2026-02-21T09:10:47.7062370Z // begin inline asm 2026-02-21T09:10:47.7062449Z @%p9 mbarrier.inval.shared::cta.b64 [%r2084]; 2026-02-21T09:10:47.7062500Z // end inline asm 2026-02-21T09:10:47.7062553Z bar.sync 0; 2026-02-21T09:10:47.7062613Z // begin inline asm 2026-02-21T09:10:47.7062689Z @%p9 mbarrier.inval.shared::cta.b64 [%r1117]; 2026-02-21T09:10:47.7062744Z // end inline asm 2026-02-21T09:10:47.7062800Z $L__tmp141: 2026-02-21T09:10:47.7063006Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7063060Z // begin inline asm 2026-02-21T09:10:47.7063357Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1118, %r1119, %r1120, %r1121, %r1122, %r1123, %r1124, %r1125, %r1126, %r1127, %r1128, %r1129, %r1130, %r1131, %r1132, %r1133}, [%r1460 + 0]; 2026-02-21T09:10:47.7063415Z // end inline asm 2026-02-21T09:10:47.7063468Z // begin inline asm 2026-02-21T09:10:47.7063767Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1135, %r1136, %r1137, %r1138, %r1139, %r1140, %r1141, %r1142, %r1143, %r1144, %r1145, %r1146, %r1147, %r1148, %r1149, %r1150}, [%r1460 + 16]; 2026-02-21T09:10:47.7063824Z // end inline asm 2026-02-21T09:10:47.7063877Z // begin inline asm 2026-02-21T09:10:47.7063944Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.7063998Z // end inline asm 2026-02-21T09:10:47.7064056Z cvt.u64.u32 %rd309, %r1118; 2026-02-21T09:10:47.7064114Z cvt.u64.u32 %rd310, %r1119; 2026-02-21T09:10:47.7064170Z shl.b64 %rd311, %rd310, 32; 2026-02-21T09:10:47.7064237Z or.b64 %rd312, %rd309, %rd311; 2026-02-21T09:10:47.7064286Z $L__tmp142: 2026-02-21T09:10:47.7064449Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7064516Z mov.b64 {%r1230, %r1231}, %rd312; 2026-02-21T09:10:47.7064586Z cvt.rn.bf16x2.f32 %r1232, %r1231, %r1230; 2026-02-21T09:10:47.7064635Z $L__tmp143: 2026-02-21T09:10:47.7064841Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7064898Z cvt.u64.u32 %rd313, %r1120; 2026-02-21T09:10:47.7064954Z cvt.u64.u32 %rd314, %r1121; 2026-02-21T09:10:47.7065035Z shl.b64 %rd315, %rd314, 32; 2026-02-21T09:10:47.7065094Z or.b64 %rd316, %rd313, %rd315; 2026-02-21T09:10:47.7065144Z $L__tmp144: 2026-02-21T09:10:47.7065305Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7065369Z mov.b64 {%r1233, %r1234}, %rd316; 2026-02-21T09:10:47.7065439Z cvt.rn.bf16x2.f32 %r1235, %r1234, %r1233; 2026-02-21T09:10:47.7065488Z $L__tmp145: 2026-02-21T09:10:47.7065697Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7065780Z cvt.u64.u32 %rd317, %r1122; 2026-02-21T09:10:47.7065837Z cvt.u64.u32 %rd318, %r1123; 2026-02-21T09:10:47.7065891Z shl.b64 %rd319, %rd318, 32; 2026-02-21T09:10:47.7065954Z or.b64 %rd320, %rd317, %rd319; 2026-02-21T09:10:47.7066003Z $L__tmp146: 2026-02-21T09:10:47.7066164Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7066227Z mov.b64 {%r1236, %r1237}, %rd320; 2026-02-21T09:10:47.7066294Z cvt.rn.bf16x2.f32 %r1238, %r1237, %r1236; 2026-02-21T09:10:47.7066360Z $L__tmp147: 2026-02-21T09:10:47.7066565Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7066624Z cvt.u64.u32 %rd321, %r1124; 2026-02-21T09:10:47.7066678Z cvt.u64.u32 %rd322, %r1125; 2026-02-21T09:10:47.7066757Z shl.b64 %rd323, %rd322, 32; 2026-02-21T09:10:47.7066821Z or.b64 %rd324, %rd321, %rd323; 2026-02-21T09:10:47.7066873Z $L__tmp148: 2026-02-21T09:10:47.7067030Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7067090Z mov.b64 {%r1239, %r1240}, %rd324; 2026-02-21T09:10:47.7067156Z cvt.rn.bf16x2.f32 %r1241, %r1240, %r1239; 2026-02-21T09:10:47.7067206Z $L__tmp149: 2026-02-21T09:10:47.7067412Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7067472Z cvt.u64.u32 %rd325, %r1126; 2026-02-21T09:10:47.7067528Z cvt.u64.u32 %rd326, %r1127; 2026-02-21T09:10:47.7067583Z shl.b64 %rd327, %rd326, 32; 2026-02-21T09:10:47.7067646Z or.b64 %rd328, %rd325, %rd327; 2026-02-21T09:10:47.7067693Z $L__tmp150: 2026-02-21T09:10:47.7067852Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7067916Z mov.b64 {%r1242, %r1243}, %rd328; 2026-02-21T09:10:47.7067984Z cvt.rn.bf16x2.f32 %r1244, %r1243, %r1242; 2026-02-21T09:10:47.7068036Z $L__tmp151: 2026-02-21T09:10:47.7068239Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7068298Z cvt.u64.u32 %rd329, %r1128; 2026-02-21T09:10:47.7068356Z cvt.u64.u32 %rd330, %r1129; 2026-02-21T09:10:47.7068411Z shl.b64 %rd331, %rd330, 32; 2026-02-21T09:10:47.7068473Z or.b64 %rd332, %rd329, %rd331; 2026-02-21T09:10:47.7068528Z $L__tmp152: 2026-02-21T09:10:47.7068690Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7068752Z mov.b64 {%r1245, %r1246}, %rd332; 2026-02-21T09:10:47.7068819Z cvt.rn.bf16x2.f32 %r1247, %r1246, %r1245; 2026-02-21T09:10:47.7068869Z $L__tmp153: 2026-02-21T09:10:47.7069074Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7069136Z cvt.u64.u32 %rd333, %r1130; 2026-02-21T09:10:47.7069191Z cvt.u64.u32 %rd334, %r1131; 2026-02-21T09:10:47.7069246Z shl.b64 %rd335, %rd334, 32; 2026-02-21T09:10:47.7069307Z or.b64 %rd336, %rd333, %rd335; 2026-02-21T09:10:47.7069356Z $L__tmp154: 2026-02-21T09:10:47.7069516Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7069578Z mov.b64 {%r1248, %r1249}, %rd336; 2026-02-21T09:10:47.7069642Z cvt.rn.bf16x2.f32 %r1250, %r1249, %r1248; 2026-02-21T09:10:47.7069716Z $L__tmp155: 2026-02-21T09:10:47.7069917Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7069979Z cvt.u64.u32 %rd337, %r1132; 2026-02-21T09:10:47.7070034Z cvt.u64.u32 %rd338, %r1133; 2026-02-21T09:10:47.7070089Z shl.b64 %rd339, %rd338, 32; 2026-02-21T09:10:47.7070153Z or.b64 %rd340, %rd337, %rd339; 2026-02-21T09:10:47.7070223Z $L__tmp156: 2026-02-21T09:10:47.7070382Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7070440Z mov.b64 {%r1251, %r1252}, %rd340; 2026-02-21T09:10:47.7070508Z cvt.rn.bf16x2.f32 %r1253, %r1252, %r1251; 2026-02-21T09:10:47.7070556Z $L__tmp157: 2026-02-21T09:10:47.7070755Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7070817Z cvt.u64.u32 %rd341, %r1135; 2026-02-21T09:10:47.7070874Z cvt.u64.u32 %rd342, %r1136; 2026-02-21T09:10:47.7070928Z shl.b64 %rd343, %rd342, 32; 2026-02-21T09:10:47.7071006Z or.b64 %rd344, %rd341, %rd343; 2026-02-21T09:10:47.7071057Z $L__tmp158: 2026-02-21T09:10:47.7071216Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7071275Z mov.b64 {%r1254, %r1255}, %rd344; 2026-02-21T09:10:47.7071368Z cvt.rn.bf16x2.f32 %r1256, %r1255, %r1254; 2026-02-21T09:10:47.7071420Z $L__tmp159: 2026-02-21T09:10:47.7071644Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7071708Z cvt.u64.u32 %rd345, %r1137; 2026-02-21T09:10:47.7071762Z cvt.u64.u32 %rd346, %r1138; 2026-02-21T09:10:47.7071818Z shl.b64 %rd347, %rd346, 32; 2026-02-21T09:10:47.7071881Z or.b64 %rd348, %rd345, %rd347; 2026-02-21T09:10:47.7071931Z $L__tmp160: 2026-02-21T09:10:47.7072093Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7072152Z mov.b64 {%r1257, %r1258}, %rd348; 2026-02-21T09:10:47.7072221Z cvt.rn.bf16x2.f32 %r1259, %r1258, %r1257; 2026-02-21T09:10:47.7072269Z $L__tmp161: 2026-02-21T09:10:47.7072475Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7072540Z cvt.u64.u32 %rd349, %r1139; 2026-02-21T09:10:47.7072597Z cvt.u64.u32 %rd350, %r1140; 2026-02-21T09:10:47.7072650Z shl.b64 %rd351, %rd350, 32; 2026-02-21T09:10:47.7072711Z or.b64 %rd352, %rd349, %rd351; 2026-02-21T09:10:47.7072759Z $L__tmp162: 2026-02-21T09:10:47.7072918Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7072974Z mov.b64 {%r1260, %r1261}, %rd352; 2026-02-21T09:10:47.7073041Z cvt.rn.bf16x2.f32 %r1262, %r1261, %r1260; 2026-02-21T09:10:47.7073090Z $L__tmp163: 2026-02-21T09:10:47.7073297Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7073359Z cvt.u64.u32 %rd353, %r1141; 2026-02-21T09:10:47.7073413Z cvt.u64.u32 %rd354, %r1142; 2026-02-21T09:10:47.7073467Z shl.b64 %rd355, %rd354, 32; 2026-02-21T09:10:47.7073525Z or.b64 %rd356, %rd353, %rd355; 2026-02-21T09:10:47.7073575Z $L__tmp164: 2026-02-21T09:10:47.7073737Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7073796Z mov.b64 {%r1263, %r1264}, %rd356; 2026-02-21T09:10:47.7073866Z cvt.rn.bf16x2.f32 %r1265, %r1264, %r1263; 2026-02-21T09:10:47.7073915Z $L__tmp165: 2026-02-21T09:10:47.7074122Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7074181Z cvt.u64.u32 %rd357, %r1143; 2026-02-21T09:10:47.7074237Z cvt.u64.u32 %rd358, %r1144; 2026-02-21T09:10:47.7074320Z shl.b64 %rd359, %rd358, 32; 2026-02-21T09:10:47.7074375Z or.b64 %rd360, %rd357, %rd359; 2026-02-21T09:10:47.7074431Z $L__tmp166: 2026-02-21T09:10:47.7074591Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7074647Z mov.b64 {%r1266, %r1267}, %rd360; 2026-02-21T09:10:47.7074714Z cvt.rn.bf16x2.f32 %r1268, %r1267, %r1266; 2026-02-21T09:10:47.7074765Z $L__tmp167: 2026-02-21T09:10:47.7074969Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7075059Z cvt.u64.u32 %rd361, %r1145; 2026-02-21T09:10:47.7075115Z cvt.u64.u32 %rd362, %r1146; 2026-02-21T09:10:47.7075169Z shl.b64 %rd363, %rd362, 32; 2026-02-21T09:10:47.7075227Z or.b64 %rd364, %rd361, %rd363; 2026-02-21T09:10:47.7075285Z $L__tmp168: 2026-02-21T09:10:47.7075445Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7075501Z mov.b64 {%r1269, %r1270}, %rd364; 2026-02-21T09:10:47.7075595Z cvt.rn.bf16x2.f32 %r1271, %r1270, %r1269; 2026-02-21T09:10:47.7075645Z $L__tmp169: 2026-02-21T09:10:47.7075847Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7075904Z cvt.u64.u32 %rd365, %r1147; 2026-02-21T09:10:47.7075959Z cvt.u64.u32 %rd366, %r1148; 2026-02-21T09:10:47.7076038Z shl.b64 %rd367, %rd366, 32; 2026-02-21T09:10:47.7076099Z or.b64 %rd368, %rd365, %rd367; 2026-02-21T09:10:47.7076156Z $L__tmp170: 2026-02-21T09:10:47.7076313Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7076370Z mov.b64 {%r1272, %r1273}, %rd368; 2026-02-21T09:10:47.7076440Z cvt.rn.bf16x2.f32 %r1274, %r1273, %r1272; 2026-02-21T09:10:47.7076492Z $L__tmp171: 2026-02-21T09:10:47.7076694Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7076757Z cvt.u64.u32 %rd369, %r1149; 2026-02-21T09:10:47.7076812Z cvt.u64.u32 %rd370, %r1150; 2026-02-21T09:10:47.7076868Z shl.b64 %rd371, %rd370, 32; 2026-02-21T09:10:47.7076926Z or.b64 %rd372, %rd369, %rd371; 2026-02-21T09:10:47.7076979Z $L__tmp172: 2026-02-21T09:10:47.7077139Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7077197Z mov.b64 {%r1275, %r1276}, %rd372; 2026-02-21T09:10:47.7077267Z cvt.rn.bf16x2.f32 %r1277, %r1276, %r1275; 2026-02-21T09:10:47.7077424Z .loc 1 98 43 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:98:43 2026-02-21T09:10:47.7077519Z st.shared.v4.b32 [%r40], {%r1232, %r1235, %r1238, %r1241}; 2026-02-21T09:10:47.7077617Z st.shared.v4.b32 [%r41], {%r1244, %r1247, %r1250, %r1253}; 2026-02-21T09:10:47.7077706Z st.shared.v4.b32 [%r42], {%r1256, %r1259, %r1262, %r1265}; 2026-02-21T09:10:47.7077795Z st.shared.v4.b32 [%r43], {%r1268, %r1271, %r1274, %r1277}; 2026-02-21T09:10:47.7077849Z // begin inline asm 2026-02-21T09:10:47.7077928Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7077980Z // end inline asm 2026-02-21T09:10:47.7078031Z bar.sync 0; 2026-02-21T09:10:47.7078099Z elect.sync %r1278|%p229, -1; 2026-02-21T09:10:47.7078159Z and.pred %p214, %p6, %p229; 2026-02-21T09:10:47.7078215Z // begin inline asm 2026-02-21T09:10:47.7078401Z @%p214 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd553, {%r1152, %r1153}], [%r201]; 2026-02-21T09:10:47.7078456Z // end inline asm 2026-02-21T09:10:47.7078521Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.7078588Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.7078647Z bar.sync 0; 2026-02-21T09:10:47.7078817Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7078874Z add.s32 %r1279, %r2060, 3; 2026-02-21T09:10:47.7079044Z .loc 1 38 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:38:33 2026-02-21T09:10:47.7079126Z shr.u32 %r1280, %r1279, 5; 2026-02-21T09:10:47.7079184Z and.b32 %r1281, %r1280, 67108848; 2026-02-21T09:10:47.7079351Z .loc 1 39 39 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:39 2026-02-21T09:10:47.7079410Z sub.s32 %r1282, 128, %r1281; 2026-02-21T09:10:47.7079570Z .loc 1 39 52 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:52 2026-02-21T09:10:47.7079649Z min.s32 %r1283, %r1282, 16; 2026-02-21T09:10:47.7079812Z .loc 1 40 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:45 2026-02-21T09:10:47.7079868Z and.b32 %r1284, %r1279, 511; 2026-02-21T09:10:47.7080027Z .loc 1 41 51 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:41:51 2026-02-21T09:10:47.7080090Z div.s32 %r110, %r1284, %r1283; 2026-02-21T09:10:47.7080248Z .loc 1 40 64 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:64 2026-02-21T09:10:47.7080343Z mul.lo.s32 %r1285, %r110, %r1283; 2026-02-21T09:10:47.7080407Z sub.s32 %r1286, %r1284, %r1285; 2026-02-21T09:10:47.7080563Z .loc 1 40 30 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:30 2026-02-21T09:10:47.7080620Z add.s32 %r1287, %r1286, %r1281; 2026-02-21T09:10:47.7080801Z .loc 1 42 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:42:27 2026-02-21T09:10:47.7080858Z shl.b32 %r1461, %r1287, 6; 2026-02-21T09:10:47.7081018Z .loc 1 43 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:43:27 2026-02-21T09:10:47.7081075Z shl.b32 %r1462, %r110, 7; 2026-02-21T09:10:47.7081238Z .loc 1 44 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:32 2026-02-21T09:10:47.7081295Z or.b32 %r1288, %r1462, %r7; 2026-02-21T09:10:47.7081350Z or.b32 %r1289, %r1462, %r8; 2026-02-21T09:10:47.7081517Z .loc 1 58 53 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:53 2026-02-21T09:10:47.7081614Z shl.b32 %r1290, %r1288, 10; 2026-02-21T09:10:47.7081673Z shl.b32 %r1291, %r1289, 10; 2026-02-21T09:10:47.7081736Z mov.pred %p234, -1; 2026-02-21T09:10:47.7081790Z mov.b32 %r2083, 0; 2026-02-21T09:10:47.7081840Z $L__tmp173: 2026-02-21T09:10:47.7082045Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7082107Z // begin inline asm 2026-02-21T09:10:47.7082404Z @%p234 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 0], {%r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083}; 2026-02-21T09:10:47.7082460Z // end inline asm 2026-02-21T09:10:47.7082519Z // begin inline asm 2026-02-21T09:10:47.7082822Z @%p234 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1460 + 16], {%r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083, %r2083}; 2026-02-21T09:10:47.7082876Z // end inline asm 2026-02-21T09:10:47.7082933Z // begin inline asm 2026-02-21T09:10:47.7083001Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7083053Z // end inline asm 2026-02-21T09:10:47.7083105Z bar.sync 0; 2026-02-21T09:10:47.7083162Z $L__tmp174: 2026-02-21T09:10:47.7083330Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7083387Z // begin inline asm 2026-02-21T09:10:47.7083475Z @%p9 mbarrier.init.shared::cta.b64 [%r2084], 1; 2026-02-21T09:10:47.7083527Z // end inline asm 2026-02-21T09:10:47.7083580Z bar.sync 0; 2026-02-21T09:10:47.7083638Z // begin inline asm 2026-02-21T09:10:47.7083720Z @%p9 mbarrier.init.shared::cta.b64 [%r1117], 1; 2026-02-21T09:10:47.7083773Z // end inline asm 2026-02-21T09:10:47.7083826Z // begin inline asm 2026-02-21T09:10:47.7083934Z @%p9 mbarrier.init.shared::cta.b64 [%r1331], 1; 2026-02-21T09:10:47.7083989Z // end inline asm 2026-02-21T09:10:47.7084041Z bar.sync 0; 2026-02-21T09:10:47.7084097Z // begin inline asm 2026-02-21T09:10:47.7084172Z @%p9 mbarrier.init.shared::cta.b64 [%r1115], 1; 2026-02-21T09:10:47.7084226Z // end inline asm 2026-02-21T09:10:47.7084386Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.7084450Z or.b32 %r1292, %r1290, %r9; 2026-02-21T09:10:47.7084532Z or.b32 %r1293, %r1291, %r9; 2026-02-21T09:10:47.7084692Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7084762Z mad.wide.s32 %rd303, %r1292, 2, %rd41; 2026-02-21T09:10:47.7084826Z mad.wide.s32 %rd304, %r1293, 2, %rd41; 2026-02-21T09:10:47.7084881Z mov.b32 %r1328, 16; 2026-02-21T09:10:47.7085041Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7085095Z // begin inline asm 2026-02-21T09:10:47.7085241Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd303 + 0 ], 0x10, %r1328; 2026-02-21T09:10:47.7085296Z // end inline asm 2026-02-21T09:10:47.7085354Z // begin inline asm 2026-02-21T09:10:47.7085472Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd304 + 0 ], 0x10, %r1328; 2026-02-21T09:10:47.7085525Z // end inline asm 2026-02-21T09:10:47.7085592Z cp.async.commit_group; 2026-02-21T09:10:47.7085780Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7085836Z bar.sync 0; 2026-02-21T09:10:47.7085896Z // begin inline asm 2026-02-21T09:10:47.7086001Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.7086053Z // end inline asm 2026-02-21T09:10:47.7086210Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7086267Z bar.sync 0; 2026-02-21T09:10:47.7086330Z elect.sync %r1294|%p230, -1; 2026-02-21T09:10:47.7086391Z and.pred %p222, %p6, %p230; 2026-02-21T09:10:47.7086452Z // begin inline asm 2026-02-21T09:10:47.7086713Z @%p222 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r1461, %r2083}], [%r1331]; 2026-02-21T09:10:47.7086766Z // end inline asm 2026-02-21T09:10:47.7086938Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7086998Z cvt.s64.s32 %rd373, %r1290; 2026-02-21T09:10:47.7087060Z or.b64 %rd374, %rd373, %rd8; 2026-02-21T09:10:47.7087120Z shl.b64 %rd375, %rd374, 1; 2026-02-21T09:10:47.7087187Z add.s64 %rd30, %rd41, %rd375; 2026-02-21T09:10:47.7087247Z add.s64 %rd306, %rd30, 64; 2026-02-21T09:10:47.7087307Z cvt.s64.s32 %rd376, %r1291; 2026-02-21T09:10:47.7087371Z or.b64 %rd377, %rd376, %rd8; 2026-02-21T09:10:47.7087430Z shl.b64 %rd378, %rd377, 1; 2026-02-21T09:10:47.7087492Z add.s64 %rd31, %rd41, %rd378; 2026-02-21T09:10:47.7087553Z add.s64 %rd307, %rd31, 64; 2026-02-21T09:10:47.7087726Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7087784Z // begin inline asm 2026-02-21T09:10:47.7087907Z cp.async.cg.shared.global [ %r1202 + 0 ], [ %rd306 + 0 ], 0x10, %r1328; 2026-02-21T09:10:47.7087969Z // end inline asm 2026-02-21T09:10:47.7088026Z // begin inline asm 2026-02-21T09:10:47.7088146Z cp.async.cg.shared.global [ %r1204 + 0 ], [ %rd307 + 0 ], 0x10, %r1328; 2026-02-21T09:10:47.7088206Z // end inline asm 2026-02-21T09:10:47.7088267Z cp.async.commit_group; 2026-02-21T09:10:47.7088442Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7088497Z bar.sync 0; 2026-02-21T09:10:47.7088556Z // begin inline asm 2026-02-21T09:10:47.7088667Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1115], 1024; 2026-02-21T09:10:47.7088720Z // end inline asm 2026-02-21T09:10:47.7088922Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7088978Z bar.sync 0; 2026-02-21T09:10:47.7089042Z elect.sync %r1295|%p231, -1; 2026-02-21T09:10:47.7089111Z and.pred %p224, %p6, %p231; 2026-02-21T09:10:47.7089167Z // begin inline asm 2026-02-21T09:10:47.7089420Z @%p224 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r322], [%rd463, {%r1461, %r1328}], [%r1115]; 2026-02-21T09:10:47.7089496Z // end inline asm 2026-02-21T09:10:47.7089669Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7089733Z cp.async.wait_group 1; 2026-02-21T09:10:47.7089787Z bar.sync 0; 2026-02-21T09:10:47.7089891Z ld.shared.v4.b32 {%r1296, %r1297, %r1298, %r1299}, [%r23]; 2026-02-21T09:10:47.7089955Z mov.b32 {%rs337, %rs338}, %r1299; 2026-02-21T09:10:47.7090018Z mov.b32 {%rs339, %rs340}, %r1298; 2026-02-21T09:10:47.7090082Z mov.b32 {%rs341, %rs342}, %r1297; 2026-02-21T09:10:47.7090144Z mov.b32 {%rs343, %rs344}, %r1296; 2026-02-21T09:10:47.7090271Z ld.shared.v4.b32 {%r1300, %r1301, %r1302, %r1303}, [%r23+16]; 2026-02-21T09:10:47.7090332Z mov.b32 {%rs345, %rs346}, %r1303; 2026-02-21T09:10:47.7090397Z mov.b32 {%rs347, %rs348}, %r1302; 2026-02-21T09:10:47.7090457Z mov.b32 {%rs349, %rs350}, %r1301; 2026-02-21T09:10:47.7090516Z mov.b32 {%rs351, %rs352}, %r1300; 2026-02-21T09:10:47.7090711Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.7090776Z cvt.f32.bf16 %r1212, %rs343; 2026-02-21T09:10:47.7090837Z cvt.f32.bf16 %r1213, %rs344; 2026-02-21T09:10:47.7090902Z cvt.f32.bf16 %r1214, %rs341; 2026-02-21T09:10:47.7090960Z cvt.f32.bf16 %r1215, %rs342; 2026-02-21T09:10:47.7091019Z cvt.f32.bf16 %r1216, %rs339; 2026-02-21T09:10:47.7091076Z cvt.f32.bf16 %r1217, %rs340; 2026-02-21T09:10:47.7091138Z cvt.f32.bf16 %r1218, %rs337; 2026-02-21T09:10:47.7091195Z cvt.f32.bf16 %r1219, %rs338; 2026-02-21T09:10:47.7091252Z cvt.f32.bf16 %r1220, %rs351; 2026-02-21T09:10:47.7091314Z cvt.f32.bf16 %r1221, %rs352; 2026-02-21T09:10:47.7091372Z cvt.f32.bf16 %r1222, %rs349; 2026-02-21T09:10:47.7091428Z cvt.f32.bf16 %r1223, %rs350; 2026-02-21T09:10:47.7091484Z cvt.f32.bf16 %r1224, %rs347; 2026-02-21T09:10:47.7091582Z cvt.f32.bf16 %r1225, %rs348; 2026-02-21T09:10:47.7091640Z cvt.f32.bf16 %r1226, %rs345; 2026-02-21T09:10:47.7091699Z cvt.f32.bf16 %r1227, %rs346; 2026-02-21T09:10:47.7091756Z $L__tmp175: 2026-02-21T09:10:47.7091985Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7092041Z // begin inline asm 2026-02-21T09:10:47.7092357Z @%p234 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r1212, %r1213, %r1214, %r1215, %r1216, %r1217, %r1218, %r1219, %r1220, %r1221, %r1222, %r1223, %r1224, %r1225, %r1226, %r1227}; 2026-02-21T09:10:47.7092411Z // end inline asm 2026-02-21T09:10:47.7092469Z // begin inline asm 2026-02-21T09:10:47.7092537Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7092600Z // end inline asm 2026-02-21T09:10:47.7092653Z bar.sync 0; 2026-02-21T09:10:47.7092706Z $L__tmp176: 2026-02-21T09:10:47.7092888Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7092946Z // begin inline asm 2026-02-21T09:10:47.7092996Z 2026-02-21T09:10:47.7093048Z { 2026-02-21T09:10:47.7093115Z .reg .pred complete; 2026-02-21T09:10:47.7093170Z waitLoop: 2026-02-21T09:10:47.7093294Z mbarrier.try_wait.parity.shared.b64 complete, [%r1331], %r2083; 2026-02-21T09:10:47.7093365Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7093416Z } 2026-02-21T09:10:47.7093420Z 2026-02-21T09:10:47.7093474Z // end inline asm 2026-02-21T09:10:47.7093652Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7093715Z ld.shared.b8 %rs353, [%r25]; 2026-02-21T09:10:47.7093810Z ld.shared.b8 %rs354, [%r25+512]; 2026-02-21T09:10:47.7093876Z ld.shared.b8 %rs355, [%r27+128]; 2026-02-21T09:10:47.7093943Z ld.shared.b8 %rs356, [%r27+640]; 2026-02-21T09:10:47.7094004Z ld.shared.b8 %rs357, [%r29+256]; 2026-02-21T09:10:47.7094067Z ld.shared.b8 %rs358, [%r29+768]; 2026-02-21T09:10:47.7094133Z ld.shared.b8 %rs359, [%r31+384]; 2026-02-21T09:10:47.7094197Z ld.shared.b8 %rs360, [%r31+896]; 2026-02-21T09:10:47.7094371Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.7094471Z shl.b16 %rs361, %rs353, 4; 2026-02-21T09:10:47.7094528Z shl.b16 %rs362, %rs355, 4; 2026-02-21T09:10:47.7094585Z shl.b16 %rs363, %rs357, 4; 2026-02-21T09:10:47.7094642Z shl.b16 %rs364, %rs359, 4; 2026-02-21T09:10:47.7094703Z shl.b16 %rs365, %rs354, 4; 2026-02-21T09:10:47.7094757Z shl.b16 %rs366, %rs356, 4; 2026-02-21T09:10:47.7094814Z shl.b16 %rs367, %rs358, 4; 2026-02-21T09:10:47.7094874Z shl.b16 %rs368, %rs360, 4; 2026-02-21T09:10:47.7095054Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.7095122Z selp.b16 %rs369, %rs361, %rs353, %p346; 2026-02-21T09:10:47.7095178Z cvt.s16.s8 %rs370, %rs369; 2026-02-21T09:10:47.7095238Z shr.s16 %rs371, %rs370, 4; 2026-02-21T09:10:47.7095305Z selp.b16 %rs372, %rs362, %rs355, %p346; 2026-02-21T09:10:47.7095385Z cvt.s16.s8 %rs373, %rs372; 2026-02-21T09:10:47.7095451Z shr.s16 %rs374, %rs373, 4; 2026-02-21T09:10:47.7095516Z selp.b16 %rs375, %rs363, %rs357, %p346; 2026-02-21T09:10:47.7095572Z cvt.s16.s8 %rs376, %rs375; 2026-02-21T09:10:47.7095629Z shr.s16 %rs377, %rs376, 4; 2026-02-21T09:10:47.7095697Z selp.b16 %rs378, %rs364, %rs359, %p346; 2026-02-21T09:10:47.7095753Z cvt.s16.s8 %rs379, %rs378; 2026-02-21T09:10:47.7095808Z shr.s16 %rs380, %rs379, 4; 2026-02-21T09:10:47.7095875Z selp.b16 %rs381, %rs365, %rs354, %p346; 2026-02-21T09:10:47.7095930Z cvt.s16.s8 %rs382, %rs381; 2026-02-21T09:10:47.7095987Z shr.s16 %rs383, %rs382, 4; 2026-02-21T09:10:47.7096058Z selp.b16 %rs384, %rs366, %rs356, %p346; 2026-02-21T09:10:47.7096114Z cvt.s16.s8 %rs385, %rs384; 2026-02-21T09:10:47.7096168Z shr.s16 %rs386, %rs385, 4; 2026-02-21T09:10:47.7096230Z selp.b16 %rs387, %rs367, %rs358, %p346; 2026-02-21T09:10:47.7096290Z cvt.s16.s8 %rs388, %rs387; 2026-02-21T09:10:47.7096346Z shr.s16 %rs389, %rs388, 4; 2026-02-21T09:10:47.7096409Z selp.b16 %rs390, %rs368, %rs360, %p346; 2026-02-21T09:10:47.7096472Z cvt.s16.s8 %rs391, %rs390; 2026-02-21T09:10:47.7096528Z shr.s16 %rs392, %rs391, 4; 2026-02-21T09:10:47.7096688Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.7096750Z cvt.rn.f32.s16 %r1304, %rs371; 2026-02-21T09:10:47.7096814Z cvt.rn.f32.s16 %r1305, %rs374; 2026-02-21T09:10:47.7096873Z cvt.rn.f32.s16 %r1306, %rs377; 2026-02-21T09:10:47.7096931Z cvt.rn.f32.s16 %r1307, %rs380; 2026-02-21T09:10:47.7096995Z cvt.rn.f32.s16 %r1308, %rs383; 2026-02-21T09:10:47.7097051Z cvt.rn.f32.s16 %r1309, %rs386; 2026-02-21T09:10:47.7097106Z cvt.rn.f32.s16 %r1310, %rs389; 2026-02-21T09:10:47.7097167Z cvt.rn.f32.s16 %r1311, %rs392; 2026-02-21T09:10:47.7097227Z st.shared.b32 [%r32], %r1304; 2026-02-21T09:10:47.7097287Z st.shared.b32 [%r33], %r1305; 2026-02-21T09:10:47.7097346Z st.shared.b32 [%r34], %r1306; 2026-02-21T09:10:47.7097410Z st.shared.b32 [%r35], %r1307; 2026-02-21T09:10:47.7097467Z st.shared.b32 [%r36], %r1308; 2026-02-21T09:10:47.7097525Z st.shared.b32 [%r37], %r1309; 2026-02-21T09:10:47.7097586Z st.shared.b32 [%r38], %r1310; 2026-02-21T09:10:47.7097645Z st.shared.b32 [%r39], %r1311; 2026-02-21T09:10:47.7097696Z $L__tmp177: 2026-02-21T09:10:47.7097906Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7097964Z // begin inline asm 2026-02-21T09:10:47.7098032Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7098108Z // end inline asm 2026-02-21T09:10:47.7098166Z bar.sync 0; 2026-02-21T09:10:47.7098225Z @%p60 bra $L__BB0_22; 2026-02-21T09:10:47.7098324Z // %bb.21: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.7098385Z elect.sync %r1324|%p233, -1; 2026-02-21T09:10:47.7098444Z mov.b32 %r1314, 135268624; 2026-02-21T09:10:47.7098500Z mov.pred %p232, 0; 2026-02-21T09:10:47.7098554Z // begin inline asm 2026-02-21T09:10:47.7098772Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r1314, %p232; 2026-02-21T09:10:47.7098825Z // end inline asm 2026-02-21T09:10:47.7098880Z // begin inline asm 2026-02-21T09:10:47.7099035Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r1314, %p234; 2026-02-21T09:10:47.7099087Z // end inline asm 2026-02-21T09:10:47.7099140Z // begin inline asm 2026-02-21T09:10:47.7099292Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r1314, %p234; 2026-02-21T09:10:47.7099346Z // end inline asm 2026-02-21T09:10:47.7099400Z // begin inline asm 2026-02-21T09:10:47.7099566Z @%p233 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r1314, %p234; 2026-02-21T09:10:47.7099627Z // end inline asm 2026-02-21T09:10:47.7099686Z add.s32 %r1326, %r201, 26624; 2026-02-21T09:10:47.7099747Z cvt.u64.u32 %rd383, %r1326; 2026-02-21T09:10:47.7099830Z // begin inline asm 2026-02-21T09:10:47.7099956Z @%p233 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd383]; 2026-02-21T09:10:47.7100009Z // end inline asm 2026-02-21T09:10:47.7100064Z $L__tmp178: 2026-02-21T09:10:47.7100167Z $L__BB0_22: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:10:47.7100330Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7100388Z add.s64 %rd384, %rd30, 128; 2026-02-21T09:10:47.7100451Z add.s64 %rd385, %rd31, 128; 2026-02-21T09:10:47.7100614Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7100672Z // begin inline asm 2026-02-21T09:10:47.7100797Z cp.async.cg.shared.global [ %r1327 + 0 ], [ %rd384 + 0 ], 0x10, %r1328; 2026-02-21T09:10:47.7100849Z // end inline asm 2026-02-21T09:10:47.7100903Z // begin inline asm 2026-02-21T09:10:47.7101021Z cp.async.cg.shared.global [ %r1329 + 0 ], [ %rd385 + 0 ], 0x10, %r1328; 2026-02-21T09:10:47.7101076Z // end inline asm 2026-02-21T09:10:47.7101137Z cp.async.commit_group; 2026-02-21T09:10:47.7101309Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7101370Z // begin inline asm 2026-02-21T09:10:47.7101477Z @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r1331], 1024; 2026-02-21T09:10:47.7101561Z // end inline asm 2026-02-21T09:10:47.7101732Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7101792Z bar.sync 0; 2026-02-21T09:10:47.7101853Z elect.sync %r1340|%p244, -1; 2026-02-21T09:10:47.7101921Z and.pred %p242, %p6, %p244; 2026-02-21T09:10:47.7101977Z mov.b32 %r1334, 32; 2026-02-21T09:10:47.7102032Z // begin inline asm 2026-02-21T09:10:47.7102272Z @%p242 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r240], [%rd463, {%r1461, %r1334}], [%r1331]; 2026-02-21T09:10:47.7102333Z // end inline asm 2026-02-21T09:10:47.7102504Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7102563Z shl.b32 %r1341, %r110, 17; 2026-02-21T09:10:47.7102627Z or.b32 %r1342, %r44, %r1341; 2026-02-21T09:10:47.7102692Z mad.wide.s32 %rd624, %r1342, 2, %rd7; 2026-02-21T09:10:47.7102750Z or.b32 %r2082, %r45, %r1341; 2026-02-21T09:10:47.7102808Z mov.b64 %rd625, 0; 2026-02-21T09:10:47.7102866Z mov.b32 %r2085, %r2083; 2026-02-21T09:10:47.7102923Z mov.b32 %r2086, %r2083; 2026-02-21T09:10:47.7103006Z mov.b32 %r2088, %r2083; 2026-02-21T09:10:47.7103068Z bra.uni $L__BB0_23; 2026-02-21T09:10:47.7103170Z $L__BB0_25: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:10:47.7103343Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7103413Z setp.lt.u64 %p260, %rd625, 464; 2026-02-21T09:10:47.7103465Z $L__tmp179: 2026-02-21T09:10:47.7103673Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7103761Z add.s32 %r1417, %r2087, 1; 2026-02-21T09:10:47.7103823Z setp.gt.s32 %p263, %r1417, 1; 2026-02-21T09:10:47.7103888Z selp.b32 %r2087, 0, %r1417, %p263; 2026-02-21T09:10:47.7103948Z selp.b32 %r1418, 1, 0, %p263; 2026-02-21T09:10:47.7104009Z xor.b32 %r128, %r2088, %r1418; 2026-02-21T09:10:47.7104061Z $L__tmp180: 2026-02-21T09:10:47.7104225Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7104299Z mad.wide.s32 %rd394, %r2082, 2, %rd41; 2026-02-21T09:10:47.7104480Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7104539Z add.s32 %r1408, %r122, %r12; 2026-02-21T09:10:47.7104599Z selp.b32 %r1409, 16, 0, %p260; 2026-02-21T09:10:47.7104659Z // begin inline asm 2026-02-21T09:10:47.7104801Z cp.async.cg.shared.global [ %r1408 + 0 ], [ %rd624 + 0 ], 0x10, %r1409; 2026-02-21T09:10:47.7104855Z // end inline asm 2026-02-21T09:10:47.7104915Z add.s32 %r1410, %r1408, 4096; 2026-02-21T09:10:47.7104969Z // begin inline asm 2026-02-21T09:10:47.7105083Z cp.async.cg.shared.global [ %r1410 + 0 ], [ %rd394 + 0 ], 0x10, %r1409; 2026-02-21T09:10:47.7105141Z // end inline asm 2026-02-21T09:10:47.7105201Z cp.async.commit_group; 2026-02-21T09:10:47.7105367Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7105430Z and.pred %p258, %p9, %p260; 2026-02-21T09:10:47.7105488Z // begin inline asm 2026-02-21T09:10:47.7105601Z @%p258 mbarrier.arrive.expect_tx.shared.b64 _, [%r1412], 1024; 2026-02-21T09:10:47.7105655Z // end inline asm 2026-02-21T09:10:47.7105818Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7105870Z bar.sync 0; 2026-02-21T09:10:47.7105929Z elect.sync %r1419|%p264, -1; 2026-02-21T09:10:47.7105991Z and.pred %p265, %p260, %p264; 2026-02-21T09:10:47.7106054Z and.pred %p259, %p6, %p265; 2026-02-21T09:10:47.7106110Z cvt.u32.u64 %r1420, %rd625; 2026-02-21T09:10:47.7106165Z add.s32 %r1415, %r1420, 48; 2026-02-21T09:10:47.7106223Z // begin inline asm 2026-02-21T09:10:47.7106469Z @%p259 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1413], [%rd463, {%r1461, %r1415}], [%r1412]; 2026-02-21T09:10:47.7106524Z // end inline asm 2026-02-21T09:10:47.7106693Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7106751Z add.s64 %rd624, %rd624, 64; 2026-02-21T09:10:47.7106807Z add.s32 %r2082, %r2082, 32; 2026-02-21T09:10:47.7106874Z setp.lt.u64 %p266, %rd625, 480; 2026-02-21T09:10:47.7106930Z add.s64 %rd625, %rd625, 16; 2026-02-21T09:10:47.7106985Z mov.b32 %r2083, %r2088; 2026-02-21T09:10:47.7107040Z mov.b32 %r2088, %r128; 2026-02-21T09:10:47.7107103Z @%p266 bra $L__BB0_23; 2026-02-21T09:10:47.7107159Z bra.uni $L__BB0_26; 2026-02-21T09:10:47.7107254Z $L__BB0_23: // Parent Loop BB0_2 Depth=1 2026-02-21T09:10:47.7107350Z // => This Inner Loop Header: Depth=2 2026-02-21T09:10:47.7107406Z add.s32 %r1364, %r2086, 1; 2026-02-21T09:10:47.7107465Z setp.gt.s32 %p248, %r1364, 1; 2026-02-21T09:10:47.7107526Z selp.b32 %r2086, 0, %r1364, %p248; 2026-02-21T09:10:47.7107690Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7107771Z cp.async.wait_group 1; 2026-02-21T09:10:47.7107824Z bar.sync 0; 2026-02-21T09:10:47.7107886Z shl.b32 %r1365, %r2086, 13; 2026-02-21T09:10:47.7107942Z add.s32 %r122, %r201, %r1365; 2026-02-21T09:10:47.7107999Z add.s32 %r1367, %r122, %r22; 2026-02-21T09:10:47.7108106Z ld.shared.v4.b32 {%r1368, %r1369, %r1370, %r1371}, [%r1367]; 2026-02-21T09:10:47.7108167Z mov.b32 {%rs393, %rs394}, %r1371; 2026-02-21T09:10:47.7108250Z mov.b32 {%rs395, %rs396}, %r1370; 2026-02-21T09:10:47.7108307Z mov.b32 {%rs397, %rs398}, %r1369; 2026-02-21T09:10:47.7108370Z mov.b32 {%rs399, %rs400}, %r1368; 2026-02-21T09:10:47.7108471Z ld.shared.v4.b32 {%r1372, %r1373, %r1374, %r1375}, [%r1367+16]; 2026-02-21T09:10:47.7108529Z mov.b32 {%rs401, %rs402}, %r1375; 2026-02-21T09:10:47.7108587Z mov.b32 {%rs403, %rs404}, %r1374; 2026-02-21T09:10:47.7108642Z mov.b32 {%rs405, %rs406}, %r1373; 2026-02-21T09:10:47.7108698Z mov.b32 {%rs407, %rs408}, %r1372; 2026-02-21T09:10:47.7108877Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.7108942Z cvt.f32.bf16 %r1346, %rs399; 2026-02-21T09:10:47.7109002Z cvt.f32.bf16 %r1347, %rs400; 2026-02-21T09:10:47.7109058Z cvt.f32.bf16 %r1348, %rs397; 2026-02-21T09:10:47.7109120Z cvt.f32.bf16 %r1349, %rs398; 2026-02-21T09:10:47.7109176Z cvt.f32.bf16 %r1350, %rs395; 2026-02-21T09:10:47.7109253Z cvt.f32.bf16 %r1351, %rs396; 2026-02-21T09:10:47.7109317Z cvt.f32.bf16 %r1352, %rs393; 2026-02-21T09:10:47.7109373Z cvt.f32.bf16 %r1353, %rs394; 2026-02-21T09:10:47.7109428Z cvt.f32.bf16 %r1354, %rs407; 2026-02-21T09:10:47.7109486Z cvt.f32.bf16 %r1355, %rs408; 2026-02-21T09:10:47.7109547Z cvt.f32.bf16 %r1356, %rs405; 2026-02-21T09:10:47.7109601Z cvt.f32.bf16 %r1357, %rs406; 2026-02-21T09:10:47.7109658Z cvt.f32.bf16 %r1358, %rs403; 2026-02-21T09:10:47.7109717Z cvt.f32.bf16 %r1359, %rs404; 2026-02-21T09:10:47.7109771Z cvt.f32.bf16 %r1360, %rs401; 2026-02-21T09:10:47.7109828Z cvt.f32.bf16 %r1361, %rs402; 2026-02-21T09:10:47.7109877Z $L__tmp181: 2026-02-21T09:10:47.7110089Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7110143Z // begin inline asm 2026-02-21T09:10:47.7110192Z 2026-02-21T09:10:47.7110244Z { 2026-02-21T09:10:47.7110301Z .reg .pred complete; 2026-02-21T09:10:47.7110354Z waitLoop: 2026-02-21T09:10:47.7110473Z mbarrier.try_wait.parity.shared.b64 complete, [%r2084], %r2083; 2026-02-21T09:10:47.7110539Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7110586Z } 2026-02-21T09:10:47.7110589Z 2026-02-21T09:10:47.7110642Z // end inline asm 2026-02-21T09:10:47.7110697Z $L__tmp182: 2026-02-21T09:10:47.7110862Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7110923Z selp.b32 %r1376, 1, 0, %p248; 2026-02-21T09:10:47.7110985Z xor.b32 %r2085, %r2085, %r1376; 2026-02-21T09:10:47.7111045Z mov.pred %p249, -1; 2026-02-21T09:10:47.7111096Z $L__tmp183: 2026-02-21T09:10:47.7111303Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7111362Z // begin inline asm 2026-02-21T09:10:47.7111697Z @%p249 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1211 + 0], {%r1346, %r1347, %r1348, %r1349, %r1350, %r1351, %r1352, %r1353, %r1354, %r1355, %r1356, %r1357, %r1358, %r1359, %r1360, %r1361}; 2026-02-21T09:10:47.7111753Z // end inline asm 2026-02-21T09:10:47.7111812Z // begin inline asm 2026-02-21T09:10:47.7111877Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7111929Z // end inline asm 2026-02-21T09:10:47.7111985Z bar.sync 0; 2026-02-21T09:10:47.7112034Z $L__tmp184: 2026-02-21T09:10:47.7112197Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7112255Z shl.b32 %r1377, %r2086, 3; 2026-02-21T09:10:47.7112345Z add.s32 %r1378, %r201, %r1377; 2026-02-21T09:10:47.7112405Z add.s32 %r1412, %r1378, 26640; 2026-02-21T09:10:47.7112459Z // begin inline asm 2026-02-21T09:10:47.7112511Z 2026-02-21T09:10:47.7112558Z { 2026-02-21T09:10:47.7112617Z .reg .pred complete; 2026-02-21T09:10:47.7112668Z waitLoop: 2026-02-21T09:10:47.7112789Z mbarrier.try_wait.parity.shared.b64 complete, [%r1412], %r2085; 2026-02-21T09:10:47.7112853Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7112933Z } 2026-02-21T09:10:47.7112936Z 2026-02-21T09:10:47.7112995Z // end inline asm 2026-02-21T09:10:47.7113162Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7113218Z shl.b32 %r1379, %r2086, 10; 2026-02-21T09:10:47.7113284Z add.s32 %r1380, %r201, %r1379; 2026-02-21T09:10:47.7113340Z add.s32 %r1413, %r1380, 24576; 2026-02-21T09:10:47.7113398Z add.s32 %r1381, %r1413, %r24; 2026-02-21T09:10:47.7113461Z ld.shared.b8 %rs409, [%r1381]; 2026-02-21T09:10:47.7113534Z ld.shared.b8 %rs410, [%r1381+512]; 2026-02-21T09:10:47.7113593Z add.s32 %r1382, %r1413, %r26; 2026-02-21T09:10:47.7113679Z ld.shared.b8 %rs411, [%r1382+128]; 2026-02-21T09:10:47.7113749Z ld.shared.b8 %rs412, [%r1382+640]; 2026-02-21T09:10:47.7113805Z add.s32 %r1383, %r1413, %r28; 2026-02-21T09:10:47.7113867Z ld.shared.b8 %rs413, [%r1383+256]; 2026-02-21T09:10:47.7113927Z ld.shared.b8 %rs414, [%r1383+768]; 2026-02-21T09:10:47.7114016Z add.s32 %r1384, %r1413, %r30; 2026-02-21T09:10:47.7114076Z ld.shared.b8 %rs415, [%r1384+384]; 2026-02-21T09:10:47.7114135Z ld.shared.b8 %rs416, [%r1384+896]; 2026-02-21T09:10:47.7114303Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.7114360Z shl.b16 %rs417, %rs409, 4; 2026-02-21T09:10:47.7114416Z shl.b16 %rs418, %rs411, 4; 2026-02-21T09:10:47.7114476Z shl.b16 %rs419, %rs413, 4; 2026-02-21T09:10:47.7114531Z shl.b16 %rs420, %rs415, 4; 2026-02-21T09:10:47.7114586Z shl.b16 %rs421, %rs410, 4; 2026-02-21T09:10:47.7114640Z shl.b16 %rs422, %rs412, 4; 2026-02-21T09:10:47.7114700Z shl.b16 %rs423, %rs414, 4; 2026-02-21T09:10:47.7114757Z shl.b16 %rs424, %rs416, 4; 2026-02-21T09:10:47.7114912Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.7114986Z selp.b16 %rs425, %rs417, %rs409, %p346; 2026-02-21T09:10:47.7115044Z cvt.s16.s8 %rs426, %rs425; 2026-02-21T09:10:47.7115101Z shr.s16 %rs427, %rs426, 4; 2026-02-21T09:10:47.7115167Z selp.b16 %rs428, %rs418, %rs411, %p346; 2026-02-21T09:10:47.7115228Z cvt.s16.s8 %rs429, %rs428; 2026-02-21T09:10:47.7115286Z shr.s16 %rs430, %rs429, 4; 2026-02-21T09:10:47.7115350Z selp.b16 %rs431, %rs419, %rs413, %p346; 2026-02-21T09:10:47.7115411Z cvt.s16.s8 %rs432, %rs431; 2026-02-21T09:10:47.7115468Z shr.s16 %rs433, %rs432, 4; 2026-02-21T09:10:47.7115530Z selp.b16 %rs434, %rs420, %rs415, %p346; 2026-02-21T09:10:47.7115591Z cvt.s16.s8 %rs435, %rs434; 2026-02-21T09:10:47.7115648Z shr.s16 %rs436, %rs435, 4; 2026-02-21T09:10:47.7115711Z selp.b16 %rs437, %rs421, %rs410, %p346; 2026-02-21T09:10:47.7115767Z cvt.s16.s8 %rs438, %rs437; 2026-02-21T09:10:47.7115826Z shr.s16 %rs439, %rs438, 4; 2026-02-21T09:10:47.7115888Z selp.b16 %rs440, %rs422, %rs412, %p346; 2026-02-21T09:10:47.7115944Z cvt.s16.s8 %rs441, %rs440; 2026-02-21T09:10:47.7116006Z shr.s16 %rs442, %rs441, 4; 2026-02-21T09:10:47.7116068Z selp.b16 %rs443, %rs423, %rs414, %p346; 2026-02-21T09:10:47.7116125Z cvt.s16.s8 %rs444, %rs443; 2026-02-21T09:10:47.7116181Z shr.s16 %rs445, %rs444, 4; 2026-02-21T09:10:47.7116247Z selp.b16 %rs446, %rs424, %rs416, %p346; 2026-02-21T09:10:47.7116304Z cvt.s16.s8 %rs447, %rs446; 2026-02-21T09:10:47.7116362Z shr.s16 %rs448, %rs447, 4; 2026-02-21T09:10:47.7116528Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.7116588Z cvt.rn.f32.s16 %r1385, %rs427; 2026-02-21T09:10:47.7116685Z cvt.rn.f32.s16 %r1386, %rs430; 2026-02-21T09:10:47.7116748Z cvt.rn.f32.s16 %r1387, %rs433; 2026-02-21T09:10:47.7116807Z cvt.rn.f32.s16 %r1388, %rs436; 2026-02-21T09:10:47.7116863Z cvt.rn.f32.s16 %r1389, %rs439; 2026-02-21T09:10:47.7116921Z cvt.rn.f32.s16 %r1390, %rs442; 2026-02-21T09:10:47.7116982Z cvt.rn.f32.s16 %r1391, %rs445; 2026-02-21T09:10:47.7117037Z cvt.rn.f32.s16 %r1392, %rs448; 2026-02-21T09:10:47.7117096Z st.shared.b32 [%r32], %r1385; 2026-02-21T09:10:47.7117179Z st.shared.b32 [%r33], %r1386; 2026-02-21T09:10:47.7117237Z st.shared.b32 [%r34], %r1387; 2026-02-21T09:10:47.7117295Z st.shared.b32 [%r35], %r1388; 2026-02-21T09:10:47.7117351Z st.shared.b32 [%r36], %r1389; 2026-02-21T09:10:47.7117412Z st.shared.b32 [%r37], %r1390; 2026-02-21T09:10:47.7117469Z st.shared.b32 [%r38], %r1391; 2026-02-21T09:10:47.7117527Z st.shared.b32 [%r39], %r1392; 2026-02-21T09:10:47.7117696Z .loc 1 51 125 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:51:125 2026-02-21T09:10:47.7117755Z shl.b32 %r1393, %r2087, 3; 2026-02-21T09:10:47.7117832Z add.s32 %r1394, %r201, %r1393; 2026-02-21T09:10:47.7117888Z add.s32 %r2084, %r1394, 26624; 2026-02-21T09:10:47.7117943Z $L__tmp185: 2026-02-21T09:10:47.7118153Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7118208Z // begin inline asm 2026-02-21T09:10:47.7118302Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7118357Z // end inline asm 2026-02-21T09:10:47.7118410Z bar.sync 0; 2026-02-21T09:10:47.7118473Z @%p60 bra $L__BB0_25; 2026-02-21T09:10:47.7118570Z // %bb.24: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:10:47.7118632Z elect.sync %r1407|%p250, -1; 2026-02-21T09:10:47.7118687Z mov.b32 %r1397, 135268624; 2026-02-21T09:10:47.7118744Z // begin inline asm 2026-02-21T09:10:47.7118901Z @%p250 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd379, %r1397, %p249; 2026-02-21T09:10:47.7118954Z // end inline asm 2026-02-21T09:10:47.7119014Z // begin inline asm 2026-02-21T09:10:47.7119168Z @%p250 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd380, %r1397, %p249; 2026-02-21T09:10:47.7119221Z // end inline asm 2026-02-21T09:10:47.7119279Z // begin inline asm 2026-02-21T09:10:47.7119427Z @%p250 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd381, %r1397, %p249; 2026-02-21T09:10:47.7119479Z // end inline asm 2026-02-21T09:10:47.7119532Z // begin inline asm 2026-02-21T09:10:47.7119681Z @%p250 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd382, %r1397, %p249; 2026-02-21T09:10:47.7119733Z // end inline asm 2026-02-21T09:10:47.7119790Z cvt.u64.u32 %rd392, %r2084; 2026-02-21T09:10:47.7119847Z // begin inline asm 2026-02-21T09:10:47.7119971Z @%p250 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd392]; 2026-02-21T09:10:47.7120024Z // end inline asm 2026-02-21T09:10:47.7120085Z bra.uni $L__BB0_25; 2026-02-21T09:10:47.7120136Z $L__tmp186: 2026-02-21T09:10:47.7120220Z $L__BB0_27: // %._crit_edge 2026-02-21T09:10:47.7120387Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7120448Z sub.s32 %r1568, %r4, %r2107; 2026-02-21T09:10:47.7120504Z shl.b32 %r133, %r1568, 5; 2026-02-21T09:10:47.7120554Z $L__tmp187: 2026-02-21T09:10:47.7120767Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7120840Z shfl.sync.idx.b32 %r134, %r6, 0, 31, -1; 2026-02-21T09:10:47.7120896Z shl.b32 %r1569, %r134, 21; 2026-02-21T09:10:47.7120958Z and.b32 %r135, %r1569, 6291456; 2026-02-21T09:10:47.7121013Z add.s32 %r1570, %r135, %r1917; 2026-02-21T09:10:47.7121071Z and.b32 %r136, %r134, 4; 2026-02-21T09:10:47.7121127Z shl.b32 %r1571, %r136, 3; 2026-02-21T09:10:47.7121191Z add.s32 %r1977, %r1570, %r1571; 2026-02-21T09:10:47.7121269Z mov.pred %p275, -1; 2026-02-21T09:10:47.7121322Z mov.b32 %r2110, 0; 2026-02-21T09:10:47.7121381Z // begin inline asm 2026-02-21T09:10:47.7121706Z @%p275 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1977 + 0], {%r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110}; 2026-02-21T09:10:47.7121760Z // end inline asm 2026-02-21T09:10:47.7121818Z // begin inline asm 2026-02-21T09:10:47.7122147Z @%p275 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1977 + 16], {%r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110, %r2110}; 2026-02-21T09:10:47.7122199Z // end inline asm 2026-02-21T09:10:47.7122252Z // begin inline asm 2026-02-21T09:10:47.7122322Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7122373Z // end inline asm 2026-02-21T09:10:47.7122424Z bar.sync 0; 2026-02-21T09:10:47.7122481Z $L__tmp188: 2026-02-21T09:10:47.7122649Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7122729Z add.s32 %r1547, %r201, 43008; 2026-02-21T09:10:47.7122787Z // begin inline asm 2026-02-21T09:10:47.7122872Z @%p9 mbarrier.init.shared::cta.b64 [%r1547], 1; 2026-02-21T09:10:47.7122923Z // end inline asm 2026-02-21T09:10:47.7122975Z bar.sync 0; 2026-02-21T09:10:47.7123042Z add.s32 %r1548, %r201, 43016; 2026-02-21T09:10:47.7123130Z // begin inline asm 2026-02-21T09:10:47.7123214Z @%p9 mbarrier.init.shared::cta.b64 [%r1548], 1; 2026-02-21T09:10:47.7123271Z // end inline asm 2026-02-21T09:10:47.7123334Z setp.gt.s32 %p285, %r133, 0; 2026-02-21T09:10:47.7123493Z .loc 1 37 35 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:37:35 2026-02-21T09:10:47.7123552Z shr.s32 %r1573, %r2107, 31; 2026-02-21T09:10:47.7123615Z shr.u32 %r1574, %r1573, 23; 2026-02-21T09:10:47.7123674Z add.s32 %r1575, %r2107, %r1574; 2026-02-21T09:10:47.7123732Z shr.s32 %r1576, %r1575, 9; 2026-02-21T09:10:47.7123903Z .loc 1 38 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:38:33 2026-02-21T09:10:47.7123959Z shl.b32 %r1577, %r1576, 4; 2026-02-21T09:10:47.7124117Z .loc 1 39 39 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:39 2026-02-21T09:10:47.7124181Z sub.s32 %r1578, 128, %r1577; 2026-02-21T09:10:47.7124341Z .loc 1 39 52 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:52 2026-02-21T09:10:47.7124398Z min.s32 %r1579, %r1578, 16; 2026-02-21T09:10:47.7124561Z .loc 1 40 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:45 2026-02-21T09:10:47.7124618Z and.b32 %r1580, %r1575, -512; 2026-02-21T09:10:47.7124678Z sub.s32 %r1581, %r2107, %r1580; 2026-02-21T09:10:47.7124837Z .loc 1 41 51 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:41:51 2026-02-21T09:10:47.7124903Z div.s32 %r1582, %r1581, %r1579; 2026-02-21T09:10:47.7125065Z .loc 1 40 64 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:64 2026-02-21T09:10:47.7125126Z mul.lo.s32 %r1583, %r1582, %r1579; 2026-02-21T09:10:47.7125192Z sub.s32 %r1584, %r1581, %r1583; 2026-02-21T09:10:47.7125349Z .loc 1 40 30 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:30 2026-02-21T09:10:47.7125405Z add.s32 %r1585, %r1584, %r1577; 2026-02-21T09:10:47.7125570Z .loc 1 42 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:42:27 2026-02-21T09:10:47.7125624Z shl.b32 %r2091, %r1585, 6; 2026-02-21T09:10:47.7125780Z .loc 1 43 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:43:27 2026-02-21T09:10:47.7125841Z shl.b32 %r2092, %r1582, 7; 2026-02-21T09:10:47.7125999Z .loc 1 44 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:32 2026-02-21T09:10:47.7126101Z or.b32 %r2108, %r2092, %r7; 2026-02-21T09:10:47.7126157Z or.b32 %r2109, %r2092, %r8; 2026-02-21T09:10:47.7126323Z .loc 1 58 53 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:53 2026-02-21T09:10:47.7126380Z shl.b32 %r1586, %r2108, 10; 2026-02-21T09:10:47.7126435Z shl.b32 %r1587, %r2109, 10; 2026-02-21T09:10:47.7126600Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.7126677Z or.b32 %r1588, %r1586, %r9; 2026-02-21T09:10:47.7126733Z or.b32 %r1589, %r1587, %r9; 2026-02-21T09:10:47.7126894Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7126960Z mad.wide.s32 %rd461, %r1588, 2, %rd41; 2026-02-21T09:10:47.7127025Z mad.wide.s32 %rd462, %r1589, 2, %rd41; 2026-02-21T09:10:47.7127180Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7127244Z add.s32 %r1549, %r201, %r12; 2026-02-21T09:10:47.7127304Z selp.b32 %r1550, 16, 0, %p285; 2026-02-21T09:10:47.7127379Z // begin inline asm 2026-02-21T09:10:47.7127507Z cp.async.cg.shared.global [ %r1549 + 0 ], [ %rd461 + 0 ], 0x10, %r1550; 2026-02-21T09:10:47.7127559Z // end inline asm 2026-02-21T09:10:47.7127614Z add.s32 %r1551, %r201, %r2090; 2026-02-21T09:10:47.7127673Z // begin inline asm 2026-02-21T09:10:47.7127811Z cp.async.cg.shared.global [ %r1551 + 0 ], [ %rd462 + 0 ], 0x10, %r1550; 2026-02-21T09:10:47.7127868Z // end inline asm 2026-02-21T09:10:47.7127929Z cp.async.commit_group; 2026-02-21T09:10:47.7128097Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7128147Z bar.sync 0; 2026-02-21T09:10:47.7128209Z and.pred %p279, %p9, %p285; 2026-02-21T09:10:47.7128268Z // begin inline asm 2026-02-21T09:10:47.7128379Z @%p279 mbarrier.arrive.expect_tx.shared.b64 _, [%r1547], 1024; 2026-02-21T09:10:47.7128432Z // end inline asm 2026-02-21T09:10:47.7128589Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7128646Z bar.sync 0; 2026-02-21T09:10:47.7128706Z elect.sync %r1590|%p286, -1; 2026-02-21T09:10:47.7128770Z and.pred %p287, %p285, %p286; 2026-02-21T09:10:47.7128834Z and.pred %p280, %p6, %p287; 2026-02-21T09:10:47.7128890Z add.s32 %r1554, %r201, 40960; 2026-02-21T09:10:47.7128945Z // begin inline asm 2026-02-21T09:10:47.7129198Z @%p280 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1554], [%rd463, {%r2091, %r2110}], [%r1547]; 2026-02-21T09:10:47.7129251Z // end inline asm 2026-02-21T09:10:47.7129412Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7129473Z setp.gt.s32 %p288, %r133, 1; 2026-02-21T09:10:47.7129632Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.7129690Z or.b32 %r1591, %r1586, %r2089; 2026-02-21T09:10:47.7129746Z or.b32 %r1592, %r1587, %r2089; 2026-02-21T09:10:47.7129910Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7129974Z mad.wide.s32 %rd464, %r1591, 2, %rd41; 2026-02-21T09:10:47.7130036Z mad.wide.s32 %rd465, %r1592, 2, %rd41; 2026-02-21T09:10:47.7130200Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7130259Z add.s32 %r1593, %r201, 8192; 2026-02-21T09:10:47.7130316Z add.s32 %r1558, %r1593, %r12; 2026-02-21T09:10:47.7130375Z selp.b32 %r1559, 16, 0, %p288; 2026-02-21T09:10:47.7130428Z // begin inline asm 2026-02-21T09:10:47.7130541Z cp.async.cg.shared.global [ %r1558 + 0 ], [ %rd464 + 0 ], 0x10, %r1559; 2026-02-21T09:10:47.7130593Z // end inline asm 2026-02-21T09:10:47.7130656Z add.s32 %r1560, %r1593, %r2090; 2026-02-21T09:10:47.7130711Z // begin inline asm 2026-02-21T09:10:47.7130849Z cp.async.cg.shared.global [ %r1560 + 0 ], [ %rd465 + 0 ], 0x10, %r1559; 2026-02-21T09:10:47.7130908Z // end inline asm 2026-02-21T09:10:47.7130972Z cp.async.commit_group; 2026-02-21T09:10:47.7131151Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7131205Z bar.sync 0; 2026-02-21T09:10:47.7131271Z and.pred %p281, %p9, %p288; 2026-02-21T09:10:47.7131330Z // begin inline asm 2026-02-21T09:10:47.7131471Z @%p281 mbarrier.arrive.expect_tx.shared.b64 _, [%r1548], 1024; 2026-02-21T09:10:47.7131530Z // end inline asm 2026-02-21T09:10:47.7131739Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7131792Z bar.sync 0; 2026-02-21T09:10:47.7131860Z elect.sync %r1594|%p289, -1; 2026-02-21T09:10:47.7131925Z and.pred %p290, %p288, %p289; 2026-02-21T09:10:47.7131987Z and.pred %p282, %p6, %p290; 2026-02-21T09:10:47.7132046Z add.s32 %r1563, %r201, 41984; 2026-02-21T09:10:47.7132110Z mov.b32 %r2094, 16; 2026-02-21T09:10:47.7132167Z // begin inline asm 2026-02-21T09:10:47.7132450Z @%p282 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1563], [%rd463, {%r2091, %r2094}], [%r1548]; 2026-02-21T09:10:47.7132515Z // end inline asm 2026-02-21T09:10:47.7132692Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7132796Z add.s32 %r142, %r133, -1; 2026-02-21T09:10:47.7132866Z setp.lt.s32 %p291, %r142, 1; 2026-02-21T09:10:47.7132925Z mov.pred %p348, 0; 2026-02-21T09:10:47.7132985Z shl.b32 %r2051, %r136, 2; 2026-02-21T09:10:47.7133047Z setp.ne.b32 %p344, %r134, 0; 2026-02-21T09:10:47.7133111Z mov.b32 %r2111, %r2110; 2026-02-21T09:10:47.7133171Z mov.pred %p349, %p348; 2026-02-21T09:10:47.7133229Z @%p291 bra $L__BB0_37; 2026-02-21T09:10:47.7133318Z // %bb.28: // %.lr.ph345 2026-02-21T09:10:47.7133490Z .loc 1 0 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:0:111 2026-02-21T09:10:47.7133552Z add.s32 %r143, %r133, -2; 2026-02-21T09:10:47.7133612Z shl.b32 %r1600, %r2055, 6; 2026-02-21T09:10:47.7133677Z shr.u32 %r1602, %r2056, 2; 2026-02-21T09:10:47.7133857Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7133920Z add.s32 %r1604, %r201, %r1600; 2026-02-21T09:10:47.7133995Z add.s32 %r144, %r1604, %r1602; 2026-02-21T09:10:47.7134055Z shr.u32 %r1606, %r2056, 1; 2026-02-21T09:10:47.7134118Z or.b32 %r145, %r1606, %r2057; 2026-02-21T09:10:47.7134182Z shl.b32 %r1607, %r2057, 7; 2026-02-21T09:10:47.7134243Z and.b32 %r1610, %r2059, 12; 2026-02-21T09:10:47.7134303Z or.b32 %r1611, %r1607, %r1610; 2026-02-21T09:10:47.7134361Z or.b32 %r1612, %r1611, %r2058; 2026-02-21T09:10:47.7134427Z add.s32 %r1613, %r201, 32768; 2026-02-21T09:10:47.7134485Z add.s32 %r146, %r1613, %r1612; 2026-02-21T09:10:47.7134544Z xor.b32 %r1614, %r1612, 16; 2026-02-21T09:10:47.7134607Z add.s32 %r147, %r1613, %r1614; 2026-02-21T09:10:47.7134666Z xor.b32 %r1615, %r1612, 32; 2026-02-21T09:10:47.7134724Z add.s32 %r148, %r1613, %r1615; 2026-02-21T09:10:47.7134781Z xor.b32 %r1616, %r1612, 48; 2026-02-21T09:10:47.7134843Z add.s32 %r149, %r1613, %r1616; 2026-02-21T09:10:47.7134901Z xor.b32 %r1617, %r1612, 64; 2026-02-21T09:10:47.7134958Z add.s32 %r150, %r1613, %r1617; 2026-02-21T09:10:47.7135024Z xor.b32 %r1618, %r1612, 80; 2026-02-21T09:10:47.7135080Z add.s32 %r151, %r1613, %r1618; 2026-02-21T09:10:47.7135136Z xor.b32 %r1619, %r1612, 96; 2026-02-21T09:10:47.7135193Z add.s32 %r152, %r1613, %r1619; 2026-02-21T09:10:47.7135257Z xor.b32 %r1620, %r1612, 112; 2026-02-21T09:10:47.7135314Z add.s32 %r153, %r1613, %r1620; 2026-02-21T09:10:47.7135372Z add.s32 %r1621, %r135, %r1918; 2026-02-21T09:10:47.7135437Z add.s32 %r1653, %r1621, %r2051; 2026-02-21T09:10:47.7135495Z bfe.u32 %r1623, %r1613, 4, 14; 2026-02-21T09:10:47.7135580Z cvt.u64.u32 %rd467, %r1623; 2026-02-21T09:10:47.7135660Z or.b64 %rd471, %rd467, 4611686293338849280; 2026-02-21T09:10:47.7135721Z add.s32 %r1624, %r201, 32800; 2026-02-21T09:10:47.7135779Z bfe.u32 %r1625, %r1624, 4, 14; 2026-02-21T09:10:47.7135839Z cvt.u64.u32 %rd468, %r1625; 2026-02-21T09:10:47.7135920Z or.b64 %rd472, %rd468, 4611686293338849280; 2026-02-21T09:10:47.7135978Z add.s32 %r1626, %r201, 32832; 2026-02-21T09:10:47.7136039Z bfe.u32 %r1627, %r1626, 4, 14; 2026-02-21T09:10:47.7136135Z cvt.u64.u32 %rd469, %r1627; 2026-02-21T09:10:47.7136206Z or.b64 %rd473, %rd469, 4611686293338849280; 2026-02-21T09:10:47.7136264Z add.s32 %r1628, %r201, 32864; 2026-02-21T09:10:47.7136322Z bfe.u32 %r1629, %r1628, 4, 14; 2026-02-21T09:10:47.7136386Z cvt.u64.u32 %rd470, %r1629; 2026-02-21T09:10:47.7136453Z or.b64 %rd474, %rd470, 4611686293338849280; 2026-02-21T09:10:47.7136510Z shl.b32 %r1630, %r2055, 7; 2026-02-21T09:10:47.7136576Z xor.b32 %r1631, %r2058, %r1606; 2026-02-21T09:10:47.7136635Z or.b32 %r1632, %r1631, %r1630; 2026-02-21T09:10:47.7136692Z add.s32 %r1633, %r201, 16384; 2026-02-21T09:10:47.7136769Z add.s32 %r156, %r1633, %r1632; 2026-02-21T09:10:47.7136836Z xor.b32 %r1634, %r1632, 16; 2026-02-21T09:10:47.7136895Z add.s32 %r157, %r1633, %r1634; 2026-02-21T09:10:47.7136952Z xor.b32 %r1635, %r1632, 32; 2026-02-21T09:10:47.7137016Z add.s32 %r158, %r1633, %r1635; 2026-02-21T09:10:47.7137096Z xor.b32 %r1636, %r1632, 48; 2026-02-21T09:10:47.7137157Z add.s32 %r159, %r1633, %r1636; 2026-02-21T09:10:47.7137221Z mov.pred %p348, 0; 2026-02-21T09:10:47.7137277Z mov.b32 %r2097, 1; 2026-02-21T09:10:47.7137333Z mov.b32 %r2096, -1; 2026-02-21T09:10:47.7137387Z mov.b32 %r2093, 0; 2026-02-21T09:10:47.7137451Z mov.b32 %r2111, %r2093; 2026-02-21T09:10:47.7137508Z mov.b32 %r2098, %r2092; 2026-02-21T09:10:47.7137566Z mov.b32 %r2099, %r2091; 2026-02-21T09:10:47.7137628Z mov.b32 %r2101, %r2097; 2026-02-21T09:10:47.7137684Z mov.b32 %r2102, %r2093; 2026-02-21T09:10:47.7137741Z bra.uni $L__BB0_29; 2026-02-21T09:10:47.7137848Z $L__BB0_35: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:47.7138067Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7138136Z add.s32 %r2102, %r2102, 1; 2026-02-21T09:10:47.7138200Z setp.ne.b32 %p320, %r142, %r2102; 2026-02-21T09:10:47.7138262Z mov.b32 %r2091, %r170; 2026-02-21T09:10:47.7138318Z mov.b32 %r2092, %r169; 2026-02-21T09:10:47.7138374Z mov.b32 %r2093, %r2101; 2026-02-21T09:10:47.7138428Z mov.b32 %r2101, %r176; 2026-02-21T09:10:47.7138490Z @%p320 bra $L__BB0_29; 2026-02-21T09:10:47.7138545Z bra.uni $L__BB0_36; 2026-02-21T09:10:47.7138645Z $L__BB0_29: // =>This Inner Loop Header: Depth=1 2026-02-21T09:10:47.7138815Z .loc 1 0 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:0:111 2026-02-21T09:10:47.7138871Z mov.b32 %r170, %r2099; 2026-02-21T09:10:47.7138927Z mov.b32 %r169, %r2098; 2026-02-21T09:10:47.7139099Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7139155Z add.s32 %r1637, %r2101, 1; 2026-02-21T09:10:47.7139216Z setp.eq.b32 %p349, %r2101, 31; 2026-02-21T09:10:47.7139278Z selp.b32 %r176, 0, %r1637, %p349; 2026-02-21T09:10:47.7139343Z setp.ne.b32 %p293, %r176, 0; 2026-02-21T09:10:47.7139399Z @%p293 bra $L__BB0_31; 2026-02-21T09:10:47.7139494Z // %bb.30: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:47.7139556Z add.s32 %r2107, %r2107, 1; 2026-02-21T09:10:47.7139718Z .loc 1 37 35 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:37:35 2026-02-21T09:10:47.7139774Z shr.s32 %r1638, %r2107, 31; 2026-02-21T09:10:47.7139835Z shr.u32 %r1639, %r1638, 23; 2026-02-21T09:10:47.7139895Z add.s32 %r1640, %r2107, %r1639; 2026-02-21T09:10:47.7139952Z shr.s32 %r1641, %r1640, 9; 2026-02-21T09:10:47.7140132Z .loc 1 38 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:38:33 2026-02-21T09:10:47.7140197Z shl.b32 %r1642, %r1641, 4; 2026-02-21T09:10:47.7140360Z .loc 1 39 39 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:39 2026-02-21T09:10:47.7140419Z sub.s32 %r1643, 128, %r1642; 2026-02-21T09:10:47.7140584Z .loc 1 39 52 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:39:52 2026-02-21T09:10:47.7140667Z min.s32 %r1644, %r1643, 16; 2026-02-21T09:10:47.7140825Z .loc 1 40 45 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:45 2026-02-21T09:10:47.7140889Z and.b32 %r1645, %r1640, -512; 2026-02-21T09:10:47.7140947Z sub.s32 %r1646, %r2107, %r1645; 2026-02-21T09:10:47.7141109Z .loc 1 41 51 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:41:51 2026-02-21T09:10:47.7141169Z div.s32 %r1647, %r1646, %r1644; 2026-02-21T09:10:47.7141336Z .loc 1 40 64 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:64 2026-02-21T09:10:47.7141423Z mul.lo.s32 %r1648, %r1647, %r1644; 2026-02-21T09:10:47.7141481Z sub.s32 %r1649, %r1646, %r1648; 2026-02-21T09:10:47.7141680Z .loc 1 40 30 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:40:30 2026-02-21T09:10:47.7141738Z add.s32 %r1650, %r1649, %r1642; 2026-02-21T09:10:47.7141925Z .loc 1 42 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:42:27 2026-02-21T09:10:47.7141987Z shl.b32 %r2099, %r1650, 6; 2026-02-21T09:10:47.7142144Z .loc 1 43 27 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:43:27 2026-02-21T09:10:47.7142199Z shl.b32 %r2098, %r1647, 7; 2026-02-21T09:10:47.7142365Z .loc 1 44 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:44:32 2026-02-21T09:10:47.7142420Z or.b32 %r2108, %r2098, %r7; 2026-02-21T09:10:47.7142474Z or.b32 %r2109, %r2098, %r8; 2026-02-21T09:10:47.7142568Z $L__BB0_31: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:47.7142738Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7142791Z add.s32 %r1671, %r2096, 1; 2026-02-21T09:10:47.7142853Z setp.gt.s32 %p298, %r1671, 1; 2026-02-21T09:10:47.7142920Z selp.b32 %r2096, 0, %r1671, %p298; 2026-02-21T09:10:47.7142979Z selp.b32 %r1672, 1, 0, %p298; 2026-02-21T09:10:47.7143034Z xor.b32 %r2111, %r2111, %r1672; 2026-02-21T09:10:47.7143197Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7143259Z cp.async.wait_group 1; 2026-02-21T09:10:47.7143310Z bar.sync 0; 2026-02-21T09:10:47.7143364Z shl.b32 %r1673, %r2096, 13; 2026-02-21T09:10:47.7143422Z add.s32 %r1674, %r144, %r1673; 2026-02-21T09:10:47.7143522Z ld.shared.v4.b32 {%r1675, %r1676, %r1677, %r1678}, [%r1674]; 2026-02-21T09:10:47.7143584Z mov.b32 {%rs449, %rs450}, %r1678; 2026-02-21T09:10:47.7143647Z mov.b32 {%rs451, %rs452}, %r1677; 2026-02-21T09:10:47.7143703Z mov.b32 {%rs453, %rs454}, %r1676; 2026-02-21T09:10:47.7143759Z mov.b32 {%rs455, %rs456}, %r1675; 2026-02-21T09:10:47.7143866Z ld.shared.v4.b32 {%r1679, %r1680, %r1681, %r1682}, [%r1674+16]; 2026-02-21T09:10:47.7143923Z mov.b32 {%rs457, %rs458}, %r1682; 2026-02-21T09:10:47.7143983Z mov.b32 {%rs459, %rs460}, %r1681; 2026-02-21T09:10:47.7144042Z mov.b32 {%rs461, %rs462}, %r1680; 2026-02-21T09:10:47.7144105Z mov.b32 {%rs463, %rs464}, %r1679; 2026-02-21T09:10:47.7144265Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.7144325Z cvt.f32.bf16 %r1654, %rs455; 2026-02-21T09:10:47.7144392Z cvt.f32.bf16 %r1655, %rs456; 2026-02-21T09:10:47.7144451Z cvt.f32.bf16 %r1656, %rs453; 2026-02-21T09:10:47.7144507Z cvt.f32.bf16 %r1657, %rs454; 2026-02-21T09:10:47.7144590Z cvt.f32.bf16 %r1658, %rs451; 2026-02-21T09:10:47.7144651Z cvt.f32.bf16 %r1659, %rs452; 2026-02-21T09:10:47.7144708Z cvt.f32.bf16 %r1660, %rs449; 2026-02-21T09:10:47.7144766Z cvt.f32.bf16 %r1661, %rs450; 2026-02-21T09:10:47.7144827Z cvt.f32.bf16 %r1662, %rs463; 2026-02-21T09:10:47.7144880Z cvt.f32.bf16 %r1663, %rs464; 2026-02-21T09:10:47.7144933Z cvt.f32.bf16 %r1664, %rs461; 2026-02-21T09:10:47.7144992Z cvt.f32.bf16 %r1665, %rs462; 2026-02-21T09:10:47.7145047Z cvt.f32.bf16 %r1666, %rs459; 2026-02-21T09:10:47.7145128Z cvt.f32.bf16 %r1667, %rs460; 2026-02-21T09:10:47.7145184Z cvt.f32.bf16 %r1668, %rs457; 2026-02-21T09:10:47.7145245Z cvt.f32.bf16 %r1669, %rs458; 2026-02-21T09:10:47.7145411Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7145467Z shl.b32 %r1683, %r2096, 3; 2026-02-21T09:10:47.7145528Z add.s32 %r1685, %r201, %r1683; 2026-02-21T09:10:47.7145583Z add.s32 %r1651, %r1685, 43008; 2026-02-21T09:10:47.7145639Z // begin inline asm 2026-02-21T09:10:47.7145686Z 2026-02-21T09:10:47.7145737Z { 2026-02-21T09:10:47.7145824Z .reg .pred complete; 2026-02-21T09:10:47.7145877Z waitLoop: 2026-02-21T09:10:47.7145999Z mbarrier.try_wait.parity.shared.b64 complete, [%r1651], %r2111; 2026-02-21T09:10:47.7146061Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7146107Z } 2026-02-21T09:10:47.7146111Z 2026-02-21T09:10:47.7146163Z // end inline asm 2026-02-21T09:10:47.7146343Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7146399Z shl.b32 %r1686, %r2096, 10; 2026-02-21T09:10:47.7146454Z add.s32 %r1687, %r201, %r1686; 2026-02-21T09:10:47.7146513Z add.s32 %r1688, %r1687, 40960; 2026-02-21T09:10:47.7146567Z add.s32 %r1689, %r1688, %r145; 2026-02-21T09:10:47.7146624Z ld.shared.b8 %rs465, [%r1689]; 2026-02-21T09:10:47.7146689Z ld.shared.b8 %rs466, [%r1689+512]; 2026-02-21T09:10:47.7146745Z xor.b32 %r1690, %r145, 16; 2026-02-21T09:10:47.7146805Z add.s32 %r1691, %r1688, %r1690; 2026-02-21T09:10:47.7146867Z ld.shared.b8 %rs467, [%r1691+128]; 2026-02-21T09:10:47.7146933Z ld.shared.b8 %rs468, [%r1691+640]; 2026-02-21T09:10:47.7146989Z xor.b32 %r1692, %r145, 32; 2026-02-21T09:10:47.7147044Z add.s32 %r1693, %r1688, %r1692; 2026-02-21T09:10:47.7147106Z ld.shared.b8 %rs469, [%r1693+256]; 2026-02-21T09:10:47.7147165Z ld.shared.b8 %rs470, [%r1693+768]; 2026-02-21T09:10:47.7147222Z xor.b32 %r1694, %r145, 48; 2026-02-21T09:10:47.7147277Z add.s32 %r1695, %r1688, %r1694; 2026-02-21T09:10:47.7147340Z ld.shared.b8 %rs471, [%r1695+384]; 2026-02-21T09:10:47.7147397Z ld.shared.b8 %rs472, [%r1695+896]; 2026-02-21T09:10:47.7147561Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.7147624Z shl.b16 %rs473, %rs465, 4; 2026-02-21T09:10:47.7147679Z shl.b16 %rs474, %rs467, 4; 2026-02-21T09:10:47.7147736Z shl.b16 %rs475, %rs469, 4; 2026-02-21T09:10:47.7147799Z shl.b16 %rs476, %rs471, 4; 2026-02-21T09:10:47.7147853Z shl.b16 %rs477, %rs466, 4; 2026-02-21T09:10:47.7147909Z shl.b16 %rs478, %rs468, 4; 2026-02-21T09:10:47.7147964Z shl.b16 %rs479, %rs470, 4; 2026-02-21T09:10:47.7148025Z shl.b16 %rs480, %rs472, 4; 2026-02-21T09:10:47.7148184Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.7148254Z selp.b16 %rs481, %rs473, %rs465, %p346; 2026-02-21T09:10:47.7148318Z cvt.s16.s8 %rs482, %rs481; 2026-02-21T09:10:47.7148374Z shr.s16 %rs483, %rs482, 4; 2026-02-21T09:10:47.7148441Z selp.b16 %rs484, %rs474, %rs467, %p346; 2026-02-21T09:10:47.7148496Z cvt.s16.s8 %rs485, %rs484; 2026-02-21T09:10:47.7148557Z shr.s16 %rs486, %rs485, 4; 2026-02-21T09:10:47.7148622Z selp.b16 %rs487, %rs475, %rs469, %p346; 2026-02-21T09:10:47.7148679Z cvt.s16.s8 %rs488, %rs487; 2026-02-21T09:10:47.7148740Z shr.s16 %rs489, %rs488, 4; 2026-02-21T09:10:47.7148802Z selp.b16 %rs490, %rs476, %rs471, %p346; 2026-02-21T09:10:47.7148884Z cvt.s16.s8 %rs491, %rs490; 2026-02-21T09:10:47.7148943Z shr.s16 %rs492, %rs491, 4; 2026-02-21T09:10:47.7149006Z selp.b16 %rs493, %rs477, %rs466, %p346; 2026-02-21T09:10:47.7149061Z cvt.s16.s8 %rs494, %rs493; 2026-02-21T09:10:47.7149114Z shr.s16 %rs495, %rs494, 4; 2026-02-21T09:10:47.7149184Z selp.b16 %rs496, %rs478, %rs468, %p346; 2026-02-21T09:10:47.7149238Z cvt.s16.s8 %rs497, %rs496; 2026-02-21T09:10:47.7149292Z shr.s16 %rs498, %rs497, 4; 2026-02-21T09:10:47.7149381Z selp.b16 %rs499, %rs479, %rs470, %p346; 2026-02-21T09:10:47.7149435Z cvt.s16.s8 %rs500, %rs499; 2026-02-21T09:10:47.7149490Z shr.s16 %rs501, %rs500, 4; 2026-02-21T09:10:47.7149551Z selp.b16 %rs502, %rs480, %rs472, %p346; 2026-02-21T09:10:47.7149612Z cvt.s16.s8 %rs503, %rs502; 2026-02-21T09:10:47.7149668Z shr.s16 %rs504, %rs503, 4; 2026-02-21T09:10:47.7149828Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.7149896Z cvt.rn.f32.s16 %r1696, %rs483; 2026-02-21T09:10:47.7149954Z cvt.rn.f32.s16 %r1697, %rs486; 2026-02-21T09:10:47.7150047Z cvt.rn.f32.s16 %r1698, %rs489; 2026-02-21T09:10:47.7150104Z cvt.rn.f32.s16 %r1699, %rs492; 2026-02-21T09:10:47.7150166Z cvt.rn.f32.s16 %r1700, %rs495; 2026-02-21T09:10:47.7150223Z cvt.rn.f32.s16 %r1701, %rs498; 2026-02-21T09:10:47.7150280Z cvt.rn.f32.s16 %r1702, %rs501; 2026-02-21T09:10:47.7150368Z cvt.rn.f32.s16 %r1703, %rs504; 2026-02-21T09:10:47.7150430Z st.shared.b32 [%r146], %r1696; 2026-02-21T09:10:47.7150490Z st.shared.b32 [%r147], %r1697; 2026-02-21T09:10:47.7150550Z st.shared.b32 [%r148], %r1698; 2026-02-21T09:10:47.7150605Z st.shared.b32 [%r149], %r1699; 2026-02-21T09:10:47.7150661Z st.shared.b32 [%r150], %r1700; 2026-02-21T09:10:47.7150718Z st.shared.b32 [%r151], %r1701; 2026-02-21T09:10:47.7150779Z st.shared.b32 [%r152], %r1702; 2026-02-21T09:10:47.7150837Z st.shared.b32 [%r153], %r1703; 2026-02-21T09:10:47.7150894Z mov.pred %p301, -1; 2026-02-21T09:10:47.7150951Z $L__tmp189: 2026-02-21T09:10:47.7151168Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7151224Z // begin inline asm 2026-02-21T09:10:47.7151560Z @%p301 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1653 + 0], {%r1654, %r1655, %r1656, %r1657, %r1658, %r1659, %r1660, %r1661, %r1662, %r1663, %r1664, %r1665, %r1666, %r1667, %r1668, %r1669}; 2026-02-21T09:10:47.7151617Z // end inline asm 2026-02-21T09:10:47.7151671Z // begin inline asm 2026-02-21T09:10:47.7151738Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7151793Z // end inline asm 2026-02-21T09:10:47.7151846Z bar.sync 0; 2026-02-21T09:10:47.7151900Z // begin inline asm 2026-02-21T09:10:47.7151974Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7152025Z // end inline asm 2026-02-21T09:10:47.7152083Z add.s32 %r1719, %r201, 43024; 2026-02-21T09:10:47.7152136Z // begin inline asm 2026-02-21T09:10:47.7152224Z @%p9 mbarrier.init.shared::cta.b64 [%r1719], 1; 2026-02-21T09:10:47.7152278Z // end inline asm 2026-02-21T09:10:47.7152330Z bar.sync 0; 2026-02-21T09:10:47.7152392Z @%p344 bra $L__BB0_33; 2026-02-21T09:10:47.7152485Z // %bb.32: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:47.7152546Z elect.sync %r1716|%p300, -1; 2026-02-21T09:10:47.7152603Z mov.b32 %r1706, 135268624; 2026-02-21T09:10:47.7152661Z // begin inline asm 2026-02-21T09:10:47.7152818Z @%p300 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd471, %r1706, %p348; 2026-02-21T09:10:47.7152872Z // end inline asm 2026-02-21T09:10:47.7152932Z // begin inline asm 2026-02-21T09:10:47.7153084Z @%p300 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd472, %r1706, %p301; 2026-02-21T09:10:47.7153136Z // end inline asm 2026-02-21T09:10:47.7153194Z // begin inline asm 2026-02-21T09:10:47.7153344Z @%p300 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd473, %r1706, %p301; 2026-02-21T09:10:47.7153433Z // end inline asm 2026-02-21T09:10:47.7153494Z // begin inline asm 2026-02-21T09:10:47.7153639Z @%p300 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd474, %r1706, %p301; 2026-02-21T09:10:47.7153691Z // end inline asm 2026-02-21T09:10:47.7153749Z cvt.u64.u32 %rd475, %r1719; 2026-02-21T09:10:47.7153808Z // begin inline asm 2026-02-21T09:10:47.7153933Z @%p300 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd475]; 2026-02-21T09:10:47.7154010Z // end inline asm 2026-02-21T09:10:47.7154068Z $L__tmp190: 2026-02-21T09:10:47.7154163Z $L__BB0_33: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:47.7154328Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7154393Z setp.eq.b32 %p311, %r176, 0; 2026-02-21T09:10:47.7154455Z setp.lt.s32 %p312, %r2102, %r143; 2026-02-21T09:10:47.7154506Z mov.b32 %r1720, 0; 2026-02-21T09:10:47.7154559Z $L__tmp191: 2026-02-21T09:10:47.7154799Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7154855Z // begin inline asm 2026-02-21T09:10:47.7154905Z 2026-02-21T09:10:47.7154959Z { 2026-02-21T09:10:47.7155018Z .reg .pred complete; 2026-02-21T09:10:47.7155071Z waitLoop: 2026-02-21T09:10:47.7155215Z mbarrier.try_wait.parity.shared.b64 complete, [%r1719], %r1720; 2026-02-21T09:10:47.7155284Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7155335Z } 2026-02-21T09:10:47.7155338Z 2026-02-21T09:10:47.7155390Z // end inline asm 2026-02-21T09:10:47.7155445Z bar.sync 0; 2026-02-21T09:10:47.7155498Z // begin inline asm 2026-02-21T09:10:47.7155582Z @%p9 mbarrier.inval.shared::cta.b64 [%r1719]; 2026-02-21T09:10:47.7155637Z // end inline asm 2026-02-21T09:10:47.7155690Z $L__tmp192: 2026-02-21T09:10:47.7155856Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7155916Z add.s32 %r1732, %r2094, 16; 2026-02-21T09:10:47.7155983Z add.s32 %r1733, %r2097, 1; 2026-02-21T09:10:47.7156047Z setp.gt.s32 %p314, %r1733, 1; 2026-02-21T09:10:47.7156110Z selp.b32 %r2097, 0, %r1733, %p314; 2026-02-21T09:10:47.7156176Z selp.b32 %r2094, 0, %r1732, %p311; 2026-02-21T09:10:47.7156336Z .loc 1 55 22 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:55:22 2026-02-21T09:10:47.7156395Z shl.b32 %r1734, %r2094, 1; 2026-02-21T09:10:47.7156567Z .loc 1 57 25 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:57:25 2026-02-21T09:10:47.7156623Z add.s32 %r1735, %r1734, %r9; 2026-02-21T09:10:47.7156781Z .loc 1 58 53 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:53 2026-02-21T09:10:47.7156839Z shl.b32 %r1736, %r2108, 10; 2026-02-21T09:10:47.7156901Z shl.b32 %r1737, %r2109, 10; 2026-02-21T09:10:47.7157057Z .loc 1 58 60 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:60 2026-02-21T09:10:47.7157118Z add.s32 %r1738, %r1736, %r1735; 2026-02-21T09:10:47.7157186Z add.s32 %r1739, %r1737, %r1735; 2026-02-21T09:10:47.7157348Z .loc 1 58 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:32 2026-02-21T09:10:47.7157416Z mad.wide.s32 %rd476, %r1738, 2, %rd41; 2026-02-21T09:10:47.7157483Z mad.wide.s32 %rd477, %r1739, 2, %rd41; 2026-02-21T09:10:47.7157642Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7157696Z shl.b32 %r1740, %r2097, 13; 2026-02-21T09:10:47.7157753Z add.s32 %r1741, %r201, %r1740; 2026-02-21T09:10:47.7157816Z add.s32 %r1722, %r1741, %r12; 2026-02-21T09:10:47.7157876Z selp.b32 %r1723, 16, 0, %p312; 2026-02-21T09:10:47.7157929Z // begin inline asm 2026-02-21T09:10:47.7158049Z cp.async.cg.shared.global [ %r1722 + 0 ], [ %rd476 + 0 ], 0x10, %r1723; 2026-02-21T09:10:47.7158102Z // end inline asm 2026-02-21T09:10:47.7158185Z add.s32 %r1724, %r1741, %r2090; 2026-02-21T09:10:47.7158247Z // begin inline asm 2026-02-21T09:10:47.7158364Z cp.async.cg.shared.global [ %r1724 + 0 ], [ %rd477 + 0 ], 0x10, %r1723; 2026-02-21T09:10:47.7158416Z // end inline asm 2026-02-21T09:10:47.7158475Z cp.async.commit_group; 2026-02-21T09:10:47.7158647Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7158706Z shl.b32 %r1742, %r2097, 3; 2026-02-21T09:10:47.7158787Z add.s32 %r1743, %r201, %r1742; 2026-02-21T09:10:47.7158848Z add.s32 %r1730, %r1743, 43008; 2026-02-21T09:10:47.7158908Z and.pred %p309, %p9, %p312; 2026-02-21T09:10:47.7158962Z // begin inline asm 2026-02-21T09:10:47.7159072Z @%p309 mbarrier.arrive.expect_tx.shared.b64 _, [%r1730], 1024; 2026-02-21T09:10:47.7159130Z // end inline asm 2026-02-21T09:10:47.7159289Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7159346Z shl.b32 %r1744, %r2097, 10; 2026-02-21T09:10:47.7159407Z add.s32 %r1745, %r201, %r1744; 2026-02-21T09:10:47.7159481Z add.s32 %r1727, %r1745, 40960; 2026-02-21T09:10:47.7159533Z bar.sync 0; 2026-02-21T09:10:47.7159598Z elect.sync %r1746|%p315, -1; 2026-02-21T09:10:47.7159659Z and.pred %p316, %p312, %p315; 2026-02-21T09:10:47.7159716Z and.pred %p310, %p6, %p316; 2026-02-21T09:10:47.7159771Z // begin inline asm 2026-02-21T09:10:47.7160041Z @%p310 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r1727], [%rd463, {%r2099, %r2094}], [%r1730]; 2026-02-21T09:10:47.7160097Z // end inline asm 2026-02-21T09:10:47.7160260Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7160325Z setp.ne.b32 %p348, %r2093, 31; 2026-02-21T09:10:47.7160382Z @%p348 bra $L__BB0_35; 2026-02-21T09:10:47.7160470Z // %bb.34: // in Loop: Header=BB0_29 Depth=1 2026-02-21T09:10:47.7160527Z $L__tmp193: 2026-02-21T09:10:47.7160734Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7160789Z // begin inline asm 2026-02-21T09:10:47.7161086Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1747, %r1748, %r1749, %r1750, %r1751, %r1752, %r1753, %r1754, %r1755, %r1756, %r1757, %r1758, %r1759, %r1760, %r1761, %r1762}, [%r1977 + 0]; 2026-02-21T09:10:47.7161141Z // end inline asm 2026-02-21T09:10:47.7161196Z // begin inline asm 2026-02-21T09:10:47.7161480Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1764, %r1765, %r1766, %r1767, %r1768, %r1769, %r1770, %r1771, %r1772, %r1773, %r1774, %r1775, %r1776, %r1777, %r1778, %r1779}, [%r1977 + 16]; 2026-02-21T09:10:47.7161567Z // end inline asm 2026-02-21T09:10:47.7161622Z // begin inline asm 2026-02-21T09:10:47.7161690Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.7161746Z // end inline asm 2026-02-21T09:10:47.7161802Z cvt.u64.u32 %rd480, %r1747; 2026-02-21T09:10:47.7161861Z cvt.u64.u32 %rd481, %r1748; 2026-02-21T09:10:47.7161921Z shl.b64 %rd482, %rd481, 32; 2026-02-21T09:10:47.7161980Z or.b64 %rd483, %rd480, %rd482; 2026-02-21T09:10:47.7162031Z $L__tmp194: 2026-02-21T09:10:47.7162192Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7162258Z mov.b64 {%r1784, %r1785}, %rd483; 2026-02-21T09:10:47.7162330Z cvt.rn.bf16x2.f32 %r1786, %r1785, %r1784; 2026-02-21T09:10:47.7162380Z $L__tmp195: 2026-02-21T09:10:47.7162595Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7162655Z cvt.u64.u32 %rd484, %r1749; 2026-02-21T09:10:47.7162709Z cvt.u64.u32 %rd485, %r1750; 2026-02-21T09:10:47.7162766Z shl.b64 %rd486, %rd485, 32; 2026-02-21T09:10:47.7162828Z or.b64 %rd487, %rd484, %rd486; 2026-02-21T09:10:47.7162878Z $L__tmp196: 2026-02-21T09:10:47.7163037Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7163147Z mov.b64 {%r1787, %r1788}, %rd487; 2026-02-21T09:10:47.7163218Z cvt.rn.bf16x2.f32 %r1789, %r1788, %r1787; 2026-02-21T09:10:47.7163271Z $L__tmp197: 2026-02-21T09:10:47.7163482Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7163540Z cvt.u64.u32 %rd488, %r1751; 2026-02-21T09:10:47.7163634Z cvt.u64.u32 %rd489, %r1752; 2026-02-21T09:10:47.7163691Z shl.b64 %rd490, %rd489, 32; 2026-02-21T09:10:47.7163752Z or.b64 %rd491, %rd488, %rd490; 2026-02-21T09:10:47.7163802Z $L__tmp198: 2026-02-21T09:10:47.7163961Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7164026Z mov.b64 {%r1790, %r1791}, %rd491; 2026-02-21T09:10:47.7164096Z cvt.rn.bf16x2.f32 %r1792, %r1791, %r1790; 2026-02-21T09:10:47.7164146Z $L__tmp199: 2026-02-21T09:10:47.7164396Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7164454Z cvt.u64.u32 %rd492, %r1753; 2026-02-21T09:10:47.7164509Z cvt.u64.u32 %rd493, %r1754; 2026-02-21T09:10:47.7164566Z shl.b64 %rd494, %rd493, 32; 2026-02-21T09:10:47.7164628Z or.b64 %rd495, %rd492, %rd494; 2026-02-21T09:10:47.7164679Z $L__tmp200: 2026-02-21T09:10:47.7164874Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7164947Z mov.b64 {%r1793, %r1794}, %rd495; 2026-02-21T09:10:47.7165016Z cvt.rn.bf16x2.f32 %r1795, %r1794, %r1793; 2026-02-21T09:10:47.7165069Z $L__tmp201: 2026-02-21T09:10:47.7165283Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7165342Z cvt.u64.u32 %rd496, %r1755; 2026-02-21T09:10:47.7165400Z cvt.u64.u32 %rd497, %r1756; 2026-02-21T09:10:47.7165458Z shl.b64 %rd498, %rd497, 32; 2026-02-21T09:10:47.7165524Z or.b64 %rd499, %rd496, %rd498; 2026-02-21T09:10:47.7165576Z $L__tmp202: 2026-02-21T09:10:47.7165737Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7165804Z mov.b64 {%r1796, %r1797}, %rd499; 2026-02-21T09:10:47.7165871Z cvt.rn.bf16x2.f32 %r1798, %r1797, %r1796; 2026-02-21T09:10:47.7165923Z $L__tmp203: 2026-02-21T09:10:47.7166137Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7166196Z cvt.u64.u32 %rd500, %r1757; 2026-02-21T09:10:47.7166255Z cvt.u64.u32 %rd501, %r1758; 2026-02-21T09:10:47.7166313Z shl.b64 %rd502, %rd501, 32; 2026-02-21T09:10:47.7166379Z or.b64 %rd503, %rd500, %rd502; 2026-02-21T09:10:47.7166431Z $L__tmp204: 2026-02-21T09:10:47.7166590Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7166659Z mov.b64 {%r1799, %r1800}, %rd503; 2026-02-21T09:10:47.7166729Z cvt.rn.bf16x2.f32 %r1801, %r1800, %r1799; 2026-02-21T09:10:47.7166780Z $L__tmp205: 2026-02-21T09:10:47.7166984Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7167049Z cvt.u64.u32 %rd504, %r1759; 2026-02-21T09:10:47.7167106Z cvt.u64.u32 %rd505, %r1760; 2026-02-21T09:10:47.7167165Z shl.b64 %rd506, %rd505, 32; 2026-02-21T09:10:47.7167232Z or.b64 %rd507, %rd504, %rd506; 2026-02-21T09:10:47.7167283Z $L__tmp206: 2026-02-21T09:10:47.7167445Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7167512Z mov.b64 {%r1802, %r1803}, %rd507; 2026-02-21T09:10:47.7167578Z cvt.rn.bf16x2.f32 %r1804, %r1803, %r1802; 2026-02-21T09:10:47.7167629Z $L__tmp207: 2026-02-21T09:10:47.7167834Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7167928Z cvt.u64.u32 %rd508, %r1761; 2026-02-21T09:10:47.7167987Z cvt.u64.u32 %rd509, %r1762; 2026-02-21T09:10:47.7168043Z shl.b64 %rd510, %rd509, 32; 2026-02-21T09:10:47.7168108Z or.b64 %rd511, %rd508, %rd510; 2026-02-21T09:10:47.7168158Z $L__tmp208: 2026-02-21T09:10:47.7168321Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7168423Z mov.b64 {%r1805, %r1806}, %rd511; 2026-02-21T09:10:47.7168490Z cvt.rn.bf16x2.f32 %r1807, %r1806, %r1805; 2026-02-21T09:10:47.7168542Z $L__tmp209: 2026-02-21T09:10:47.7168750Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7168813Z cvt.u64.u32 %rd512, %r1764; 2026-02-21T09:10:47.7168870Z cvt.u64.u32 %rd513, %r1765; 2026-02-21T09:10:47.7168927Z shl.b64 %rd514, %rd513, 32; 2026-02-21T09:10:47.7168993Z or.b64 %rd515, %rd512, %rd514; 2026-02-21T09:10:47.7169044Z $L__tmp210: 2026-02-21T09:10:47.7169228Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7169294Z mov.b64 {%r1808, %r1809}, %rd515; 2026-02-21T09:10:47.7169359Z cvt.rn.bf16x2.f32 %r1810, %r1809, %r1808; 2026-02-21T09:10:47.7169410Z $L__tmp211: 2026-02-21T09:10:47.7169636Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7169704Z cvt.u64.u32 %rd516, %r1766; 2026-02-21T09:10:47.7169762Z cvt.u64.u32 %rd517, %r1767; 2026-02-21T09:10:47.7169820Z shl.b64 %rd518, %rd517, 32; 2026-02-21T09:10:47.7169885Z or.b64 %rd519, %rd516, %rd518; 2026-02-21T09:10:47.7169936Z $L__tmp212: 2026-02-21T09:10:47.7170100Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7170159Z mov.b64 {%r1811, %r1812}, %rd519; 2026-02-21T09:10:47.7170234Z cvt.rn.bf16x2.f32 %r1813, %r1812, %r1811; 2026-02-21T09:10:47.7170286Z $L__tmp213: 2026-02-21T09:10:47.7170494Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7170558Z cvt.u64.u32 %rd520, %r1768; 2026-02-21T09:10:47.7170616Z cvt.u64.u32 %rd521, %r1769; 2026-02-21T09:10:47.7170674Z shl.b64 %rd522, %rd521, 32; 2026-02-21T09:10:47.7170741Z or.b64 %rd523, %rd520, %rd522; 2026-02-21T09:10:47.7170794Z $L__tmp214: 2026-02-21T09:10:47.7170958Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7171018Z mov.b64 {%r1814, %r1815}, %rd523; 2026-02-21T09:10:47.7171092Z cvt.rn.bf16x2.f32 %r1816, %r1815, %r1814; 2026-02-21T09:10:47.7171143Z $L__tmp215: 2026-02-21T09:10:47.7171347Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7171416Z cvt.u64.u32 %rd524, %r1770; 2026-02-21T09:10:47.7171474Z cvt.u64.u32 %rd525, %r1771; 2026-02-21T09:10:47.7171579Z shl.b64 %rd526, %rd525, 32; 2026-02-21T09:10:47.7171649Z or.b64 %rd527, %rd524, %rd526; 2026-02-21T09:10:47.7171701Z $L__tmp216: 2026-02-21T09:10:47.7171864Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7171924Z mov.b64 {%r1817, %r1818}, %rd527; 2026-02-21T09:10:47.7172001Z cvt.rn.bf16x2.f32 %r1819, %r1818, %r1817; 2026-02-21T09:10:47.7172054Z $L__tmp217: 2026-02-21T09:10:47.7172264Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7172331Z cvt.u64.u32 %rd528, %r1772; 2026-02-21T09:10:47.7172387Z cvt.u64.u32 %rd529, %r1773; 2026-02-21T09:10:47.7172445Z shl.b64 %rd530, %rd529, 32; 2026-02-21T09:10:47.7172510Z or.b64 %rd531, %rd528, %rd530; 2026-02-21T09:10:47.7172562Z $L__tmp218: 2026-02-21T09:10:47.7172757Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7172819Z mov.b64 {%r1820, %r1821}, %rd531; 2026-02-21T09:10:47.7172896Z cvt.rn.bf16x2.f32 %r1822, %r1821, %r1820; 2026-02-21T09:10:47.7172951Z $L__tmp219: 2026-02-21T09:10:47.7173160Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7173230Z cvt.u64.u32 %rd532, %r1774; 2026-02-21T09:10:47.7173318Z cvt.u64.u32 %rd533, %r1775; 2026-02-21T09:10:47.7173379Z shl.b64 %rd534, %rd533, 32; 2026-02-21T09:10:47.7173451Z or.b64 %rd535, %rd532, %rd534; 2026-02-21T09:10:47.7173505Z $L__tmp220: 2026-02-21T09:10:47.7173667Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7173727Z mov.b64 {%r1823, %r1824}, %rd535; 2026-02-21T09:10:47.7173802Z cvt.rn.bf16x2.f32 %r1825, %r1824, %r1823; 2026-02-21T09:10:47.7173856Z $L__tmp221: 2026-02-21T09:10:47.7174110Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7174180Z cvt.u64.u32 %rd536, %r1776; 2026-02-21T09:10:47.7174240Z cvt.u64.u32 %rd537, %r1777; 2026-02-21T09:10:47.7174300Z shl.b64 %rd538, %rd537, 32; 2026-02-21T09:10:47.7174362Z or.b64 %rd539, %rd536, %rd538; 2026-02-21T09:10:47.7174425Z $L__tmp222: 2026-02-21T09:10:47.7174624Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7174689Z mov.b64 {%r1826, %r1827}, %rd539; 2026-02-21T09:10:47.7174767Z cvt.rn.bf16x2.f32 %r1828, %r1827, %r1826; 2026-02-21T09:10:47.7174822Z $L__tmp223: 2026-02-21T09:10:47.7175038Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7175105Z cvt.u64.u32 %rd540, %r1778; 2026-02-21T09:10:47.7175165Z cvt.u64.u32 %rd541, %r1779; 2026-02-21T09:10:47.7175227Z shl.b64 %rd542, %rd541, 32; 2026-02-21T09:10:47.7175288Z or.b64 %rd543, %rd540, %rd542; 2026-02-21T09:10:47.7175351Z $L__tmp224: 2026-02-21T09:10:47.7175523Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7175585Z mov.b64 {%r1829, %r1830}, %rd543; 2026-02-21T09:10:47.7175662Z cvt.rn.bf16x2.f32 %r1831, %r1830, %r1829; 2026-02-21T09:10:47.7175834Z .loc 1 98 43 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:98:43 2026-02-21T09:10:47.7175910Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.7175974Z bar.sync 0; 2026-02-21T09:10:47.7176080Z st.shared.v4.b32 [%r156], {%r1786, %r1789, %r1792, %r1795}; 2026-02-21T09:10:47.7176181Z st.shared.v4.b32 [%r157], {%r1798, %r1801, %r1804, %r1807}; 2026-02-21T09:10:47.7176276Z st.shared.v4.b32 [%r158], {%r1810, %r1813, %r1816, %r1819}; 2026-02-21T09:10:47.7176376Z st.shared.v4.b32 [%r159], {%r1822, %r1825, %r1828, %r1831}; 2026-02-21T09:10:47.7176438Z // begin inline asm 2026-02-21T09:10:47.7176513Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7176580Z // end inline asm 2026-02-21T09:10:47.7176636Z bar.sync 0; 2026-02-21T09:10:47.7176704Z elect.sync %r1832|%p319, -1; 2026-02-21T09:10:47.7176777Z and.pred %p317, %p6, %p319; 2026-02-21T09:10:47.7176836Z // begin inline asm 2026-02-21T09:10:47.7177027Z @%p317 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd553, {%r2091, %r2092}], [%r1633]; 2026-02-21T09:10:47.7177087Z // end inline asm 2026-02-21T09:10:47.7177164Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.7177224Z bra.uni $L__BB0_35; 2026-02-21T09:10:47.7177326Z $L__BB0_36: // %._crit_edge346.loopexit 2026-02-21T09:10:47.7177509Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7177569Z add.s32 %r2110, %r2096, 1; 2026-02-21T09:10:47.7177630Z mov.b32 %r2092, %r169; 2026-02-21T09:10:47.7177727Z mov.b32 %r2091, %r170; 2026-02-21T09:10:47.7177819Z $L__BB0_37: // %._crit_edge346 2026-02-21T09:10:47.7177885Z setp.lt.s32 %p321, %r133, 1; 2026-02-21T09:10:47.7177958Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.7178022Z bar.sync 0; 2026-02-21T09:10:47.7178082Z @%p321 bra $L__BB0_42; 2026-02-21T09:10:47.7178136Z // %bb.38: 2026-02-21T09:10:47.7178208Z setp.gt.s32 %p325, %r2110, 1; 2026-02-21T09:10:47.7178276Z selp.b32 %r1854, 0, %r2110, %p325; 2026-02-21T09:10:47.7178369Z selp.b32 %r1855, 1, 0, %p325; 2026-02-21T09:10:47.7178431Z xor.b32 %r1835, %r2111, %r1855; 2026-02-21T09:10:47.7178611Z .loc 1 58 80 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:58:80 2026-02-21T09:10:47.7178677Z cp.async.wait_group 0; 2026-02-21T09:10:47.7178733Z bar.sync 0; 2026-02-21T09:10:47.7178800Z shl.b32 %r1856, %r1854, 13; 2026-02-21T09:10:47.7178862Z add.s32 %r1858, %r201, %r1856; 2026-02-21T09:10:47.7178924Z shl.b32 %r1859, %r2055, 6; 2026-02-21T09:10:47.7178983Z shr.u32 %r1861, %r2056, 2; 2026-02-21T09:10:47.7179081Z add.s32 %r1862, %r1858, %r1859; 2026-02-21T09:10:47.7179145Z add.s32 %r1863, %r1862, %r1861; 2026-02-21T09:10:47.7179249Z ld.shared.v4.b32 {%r1864, %r1865, %r1866, %r1867}, [%r1863]; 2026-02-21T09:10:47.7179322Z mov.b32 {%rs505, %rs506}, %r1867; 2026-02-21T09:10:47.7179384Z mov.b32 {%rs507, %rs508}, %r1866; 2026-02-21T09:10:47.7179465Z mov.b32 {%rs509, %rs510}, %r1865; 2026-02-21T09:10:47.7179535Z mov.b32 {%rs511, %rs512}, %r1864; 2026-02-21T09:10:47.7179642Z ld.shared.v4.b32 {%r1868, %r1869, %r1870, %r1871}, [%r1863+16]; 2026-02-21T09:10:47.7179702Z mov.b32 {%rs513, %rs514}, %r1871; 2026-02-21T09:10:47.7179762Z mov.b32 {%rs515, %rs516}, %r1870; 2026-02-21T09:10:47.7179831Z mov.b32 {%rs517, %rs518}, %r1869; 2026-02-21T09:10:47.7179891Z mov.b32 {%rs519, %rs520}, %r1868; 2026-02-21T09:10:47.7180064Z .loc 1 62 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:62:32 2026-02-21T09:10:47.7180138Z cvt.f32.bf16 %r1837, %rs511; 2026-02-21T09:10:47.7180203Z cvt.f32.bf16 %r1838, %rs512; 2026-02-21T09:10:47.7180266Z cvt.f32.bf16 %r1839, %rs509; 2026-02-21T09:10:47.7180334Z cvt.f32.bf16 %r1840, %rs510; 2026-02-21T09:10:47.7180395Z cvt.f32.bf16 %r1841, %rs507; 2026-02-21T09:10:47.7180456Z cvt.f32.bf16 %r1842, %rs508; 2026-02-21T09:10:47.7180515Z cvt.f32.bf16 %r1843, %rs505; 2026-02-21T09:10:47.7180585Z cvt.f32.bf16 %r1844, %rs506; 2026-02-21T09:10:47.7180647Z cvt.f32.bf16 %r1845, %rs519; 2026-02-21T09:10:47.7180706Z cvt.f32.bf16 %r1846, %rs520; 2026-02-21T09:10:47.7180772Z cvt.f32.bf16 %r1847, %rs517; 2026-02-21T09:10:47.7180832Z cvt.f32.bf16 %r1848, %rs518; 2026-02-21T09:10:47.7180892Z cvt.f32.bf16 %r1849, %rs515; 2026-02-21T09:10:47.7180950Z cvt.f32.bf16 %r1850, %rs516; 2026-02-21T09:10:47.7181019Z cvt.f32.bf16 %r1851, %rs513; 2026-02-21T09:10:47.7181080Z cvt.f32.bf16 %r1852, %rs514; 2026-02-21T09:10:47.7181258Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7181331Z shl.b32 %r1872, %r1854, 3; 2026-02-21T09:10:47.7181396Z add.s32 %r1873, %r201, %r1872; 2026-02-21T09:10:47.7181459Z add.s32 %r1834, %r1873, 43008; 2026-02-21T09:10:47.7181520Z // begin inline asm 2026-02-21T09:10:47.7181613Z 2026-02-21T09:10:47.7181668Z { 2026-02-21T09:10:47.7181733Z .reg .pred complete; 2026-02-21T09:10:47.7181801Z waitLoop: 2026-02-21T09:10:47.7181935Z mbarrier.try_wait.parity.shared.b64 complete, [%r1834], %r1835; 2026-02-21T09:10:47.7182000Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7182050Z } 2026-02-21T09:10:47.7182061Z 2026-02-21T09:10:47.7182115Z // end inline asm 2026-02-21T09:10:47.7182277Z .loc 1 64 33 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:64:33 2026-02-21T09:10:47.7182335Z shl.b32 %r1874, %r1854, 10; 2026-02-21T09:10:47.7182402Z add.s32 %r1875, %r201, %r1874; 2026-02-21T09:10:47.7182497Z add.s32 %r1876, %r1875, 40960; 2026-02-21T09:10:47.7182556Z shr.u32 %r198, %r2056, 1; 2026-02-21T09:10:47.7182620Z or.b32 %r1878, %r198, %r2057; 2026-02-21T09:10:47.7182680Z add.s32 %r1879, %r1876, %r1878; 2026-02-21T09:10:47.7182741Z ld.shared.b8 %rs521, [%r1879]; 2026-02-21T09:10:47.7182805Z ld.shared.b8 %rs522, [%r1879+512]; 2026-02-21T09:10:47.7182870Z xor.b32 %r1880, %r1878, 16; 2026-02-21T09:10:47.7182929Z add.s32 %r1881, %r1876, %r1880; 2026-02-21T09:10:47.7183022Z ld.shared.b8 %rs523, [%r1881+128]; 2026-02-21T09:10:47.7183091Z ld.shared.b8 %rs524, [%r1881+640]; 2026-02-21T09:10:47.7183149Z xor.b32 %r1882, %r1878, 32; 2026-02-21T09:10:47.7183207Z add.s32 %r1883, %r1876, %r1882; 2026-02-21T09:10:47.7183275Z ld.shared.b8 %rs525, [%r1883+256]; 2026-02-21T09:10:47.7183335Z ld.shared.b8 %rs526, [%r1883+768]; 2026-02-21T09:10:47.7183393Z xor.b32 %r1884, %r1878, 48; 2026-02-21T09:10:47.7183452Z add.s32 %r1885, %r1876, %r1884; 2026-02-21T09:10:47.7183521Z ld.shared.b8 %rs527, [%r1885+384]; 2026-02-21T09:10:47.7183583Z ld.shared.b8 %rs528, [%r1885+896]; 2026-02-21T09:10:47.7183780Z .loc 1 67 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:67:28 2026-02-21T09:10:47.7183849Z shl.b16 %rs529, %rs521, 4; 2026-02-21T09:10:47.7183909Z shl.b16 %rs530, %rs523, 4; 2026-02-21T09:10:47.7183967Z shl.b16 %rs531, %rs525, 4; 2026-02-21T09:10:47.7184025Z shl.b16 %rs532, %rs527, 4; 2026-02-21T09:10:47.7184122Z shl.b16 %rs533, %rs522, 4; 2026-02-21T09:10:47.7184183Z shl.b16 %rs534, %rs524, 4; 2026-02-21T09:10:47.7184240Z shl.b16 %rs535, %rs526, 4; 2026-02-21T09:10:47.7184304Z shl.b16 %rs536, %rs528, 4; 2026-02-21T09:10:47.7184468Z .loc 1 82 58 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:82:58 2026-02-21T09:10:47.7184536Z selp.b16 %rs537, %rs529, %rs521, %p346; 2026-02-21T09:10:47.7184603Z cvt.s16.s8 %rs538, %rs537; 2026-02-21T09:10:47.7184661Z shr.s16 %rs539, %rs538, 4; 2026-02-21T09:10:47.7184731Z selp.b16 %rs540, %rs530, %rs523, %p346; 2026-02-21T09:10:47.7184789Z cvt.s16.s8 %rs541, %rs540; 2026-02-21T09:10:47.7184854Z shr.s16 %rs542, %rs541, 4; 2026-02-21T09:10:47.7184920Z selp.b16 %rs543, %rs531, %rs525, %p346; 2026-02-21T09:10:47.7184978Z cvt.s16.s8 %rs544, %rs543; 2026-02-21T09:10:47.7185042Z shr.s16 %rs545, %rs544, 4; 2026-02-21T09:10:47.7185108Z selp.b16 %rs546, %rs532, %rs527, %p346; 2026-02-21T09:10:47.7185166Z cvt.s16.s8 %rs547, %rs546; 2026-02-21T09:10:47.7185224Z shr.s16 %rs548, %rs547, 4; 2026-02-21T09:10:47.7185297Z selp.b16 %rs549, %rs533, %rs522, %p346; 2026-02-21T09:10:47.7185353Z cvt.s16.s8 %rs550, %rs549; 2026-02-21T09:10:47.7185410Z shr.s16 %rs551, %rs550, 4; 2026-02-21T09:10:47.7185481Z selp.b16 %rs552, %rs534, %rs524, %p346; 2026-02-21T09:10:47.7185539Z cvt.s16.s8 %rs553, %rs552; 2026-02-21T09:10:47.7185595Z shr.s16 %rs554, %rs553, 4; 2026-02-21T09:10:47.7185659Z selp.b16 %rs555, %rs535, %rs526, %p346; 2026-02-21T09:10:47.7185723Z cvt.s16.s8 %rs556, %rs555; 2026-02-21T09:10:47.7185780Z shr.s16 %rs557, %rs556, 4; 2026-02-21T09:10:47.7185845Z selp.b16 %rs558, %rs536, %rs528, %p346; 2026-02-21T09:10:47.7185910Z cvt.s16.s8 %rs559, %rs558; 2026-02-21T09:10:47.7185967Z shr.s16 %rs560, %rs559, 4; 2026-02-21T09:10:47.7186131Z .loc 1 87 32 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:87:32 2026-02-21T09:10:47.7186200Z cvt.rn.f32.s16 %r1886, %rs539; 2026-02-21T09:10:47.7186260Z cvt.rn.f32.s16 %r1887, %rs542; 2026-02-21T09:10:47.7186321Z cvt.rn.f32.s16 %r1888, %rs545; 2026-02-21T09:10:47.7186380Z cvt.rn.f32.s16 %r1889, %rs548; 2026-02-21T09:10:47.7186445Z cvt.rn.f32.s16 %r1890, %rs551; 2026-02-21T09:10:47.7186504Z cvt.rn.f32.s16 %r1891, %rs554; 2026-02-21T09:10:47.7186562Z cvt.rn.f32.s16 %r1892, %rs557; 2026-02-21T09:10:47.7186628Z cvt.rn.f32.s16 %r1893, %rs560; 2026-02-21T09:10:47.7186684Z shl.b32 %r1894, %r2057, 7; 2026-02-21T09:10:47.7186742Z and.b32 %r1896, %r2059, 12; 2026-02-21T09:10:47.7186842Z or.b32 %r1897, %r1894, %r1896; 2026-02-21T09:10:47.7186906Z or.b32 %r1898, %r1897, %r2058; 2026-02-21T09:10:47.7186966Z add.s32 %r1899, %r201, 16384; 2026-02-21T09:10:47.7187026Z add.s32 %r1900, %r1899, %r1898; 2026-02-21T09:10:47.7187094Z st.shared.b32 [%r1900], %r1886; 2026-02-21T09:10:47.7187151Z xor.b32 %r1901, %r1898, 16; 2026-02-21T09:10:47.7187210Z add.s32 %r1902, %r1899, %r1901; 2026-02-21T09:10:47.7187272Z st.shared.b32 [%r1902], %r1887; 2026-02-21T09:10:47.7187360Z xor.b32 %r1903, %r1898, 32; 2026-02-21T09:10:47.7187419Z add.s32 %r1904, %r1899, %r1903; 2026-02-21T09:10:47.7187479Z st.shared.b32 [%r1904], %r1888; 2026-02-21T09:10:47.7187544Z xor.b32 %r1905, %r1898, 48; 2026-02-21T09:10:47.7187602Z add.s32 %r1906, %r1899, %r1905; 2026-02-21T09:10:47.7187662Z st.shared.b32 [%r1906], %r1889; 2026-02-21T09:10:47.7187724Z xor.b32 %r1907, %r1898, 64; 2026-02-21T09:10:47.7187782Z add.s32 %r1908, %r1899, %r1907; 2026-02-21T09:10:47.7187842Z st.shared.b32 [%r1908], %r1890; 2026-02-21T09:10:47.7187901Z xor.b32 %r1909, %r1898, 80; 2026-02-21T09:10:47.7187965Z add.s32 %r1910, %r1899, %r1909; 2026-02-21T09:10:47.7188046Z st.shared.b32 [%r1910], %r1891; 2026-02-21T09:10:47.7188107Z xor.b32 %r1911, %r1898, 96; 2026-02-21T09:10:47.7188173Z add.s32 %r1912, %r1899, %r1911; 2026-02-21T09:10:47.7188233Z st.shared.b32 [%r1912], %r1892; 2026-02-21T09:10:47.7188295Z xor.b32 %r1913, %r1898, 112; 2026-02-21T09:10:47.7188381Z add.s32 %r1914, %r1899, %r1913; 2026-02-21T09:10:47.7188458Z st.shared.b32 [%r1914], %r1893; 2026-02-21T09:10:47.7188512Z $L__tmp225: 2026-02-21T09:10:47.7188728Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7188795Z add.s32 %r1915, %r135, %r1918; 2026-02-21T09:10:47.7188853Z add.s32 %r1836, %r1915, %r2051; 2026-02-21T09:10:47.7188913Z mov.pred %p329, -1; 2026-02-21T09:10:47.7188975Z // begin inline asm 2026-02-21T09:10:47.7189280Z @%p329 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r1836 + 0], {%r1837, %r1838, %r1839, %r1840, %r1841, %r1842, %r1843, %r1844, %r1845, %r1846, %r1847, %r1848, %r1849, %r1850, %r1851, %r1852}; 2026-02-21T09:10:47.7189337Z // end inline asm 2026-02-21T09:10:47.7189394Z // begin inline asm 2026-02-21T09:10:47.7189472Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:47.7189526Z // end inline asm 2026-02-21T09:10:47.7189579Z bar.sync 0; 2026-02-21T09:10:47.7189642Z // begin inline asm 2026-02-21T09:10:47.7189717Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7189772Z // end inline asm 2026-02-21T09:10:47.7189829Z add.s32 %r1940, %r201, 24576; 2026-02-21T09:10:47.7189891Z // begin inline asm 2026-02-21T09:10:47.7189979Z @%p9 mbarrier.init.shared::cta.b64 [%r1940], 1; 2026-02-21T09:10:47.7190032Z // end inline asm 2026-02-21T09:10:47.7190093Z bar.sync 0; 2026-02-21T09:10:47.7190150Z @%p344 bra $L__BB0_40; 2026-02-21T09:10:47.7190201Z // %bb.39: 2026-02-21T09:10:47.7190271Z elect.sync %r1929|%p328, -1; 2026-02-21T09:10:47.7190333Z bfe.u32 %r1932, %r1899, 4, 14; 2026-02-21T09:10:47.7190394Z cvt.u64.u32 %rd549, %r1932; 2026-02-21T09:10:47.7190468Z or.b64 %rd544, %rd549, 4611686293338849280; 2026-02-21T09:10:47.7190536Z mov.b32 %r1919, 135268624; 2026-02-21T09:10:47.7190593Z // begin inline asm 2026-02-21T09:10:47.7190753Z @%p328 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 0 ], %rd544, %r1919, %p348; 2026-02-21T09:10:47.7190816Z // end inline asm 2026-02-21T09:10:47.7190877Z add.s32 %r1933, %r201, 16416; 2026-02-21T09:10:47.7190936Z bfe.u32 %r1934, %r1933, 4, 14; 2026-02-21T09:10:47.7190996Z cvt.u64.u32 %rd550, %r1934; 2026-02-21T09:10:47.7191072Z or.b64 %rd545, %rd550, 4611686293338849280; 2026-02-21T09:10:47.7191128Z // begin inline asm 2026-02-21T09:10:47.7191281Z @%p328 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 8 ], %rd545, %r1919, %p329; 2026-02-21T09:10:47.7191344Z // end inline asm 2026-02-21T09:10:47.7191404Z add.s32 %r1935, %r201, 16448; 2026-02-21T09:10:47.7191484Z bfe.u32 %r1936, %r1935, 4, 14; 2026-02-21T09:10:47.7191583Z cvt.u64.u32 %rd551, %r1936; 2026-02-21T09:10:47.7191656Z or.b64 %rd546, %rd551, 4611686293338849280; 2026-02-21T09:10:47.7191713Z // begin inline asm 2026-02-21T09:10:47.7191865Z @%p328 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 16 ], %rd546, %r1919, %p329; 2026-02-21T09:10:47.7191927Z // end inline asm 2026-02-21T09:10:47.7191988Z add.s32 %r1937, %r201, 16480; 2026-02-21T09:10:47.7192089Z bfe.u32 %r1938, %r1937, 4, 14; 2026-02-21T09:10:47.7192153Z cvt.u64.u32 %rd552, %r1938; 2026-02-21T09:10:47.7192221Z or.b64 %rd547, %rd552, 4611686293338849280; 2026-02-21T09:10:47.7192278Z // begin inline asm 2026-02-21T09:10:47.7192436Z @%p328 tcgen05.mma.cta_group::1.kind::tf32 [ %r1917 + 0 ], [ %r1918 + 24 ], %rd547, %r1919, %p329; 2026-02-21T09:10:47.7192491Z // end inline asm 2026-02-21T09:10:47.7192549Z cvt.u64.u32 %rd548, %r1940; 2026-02-21T09:10:47.7192608Z // begin inline asm 2026-02-21T09:10:47.7192741Z @%p328 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd548]; 2026-02-21T09:10:47.7192824Z // end inline asm 2026-02-21T09:10:47.7192879Z $L__BB0_40: 2026-02-21T09:10:47.7192973Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:47.7193029Z mov.b32 %r1941, 0; 2026-02-21T09:10:47.7193269Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7193334Z // begin inline asm 2026-02-21T09:10:47.7193384Z 2026-02-21T09:10:47.7193435Z { 2026-02-21T09:10:47.7193496Z .reg .pred complete; 2026-02-21T09:10:47.7193556Z waitLoop: 2026-02-21T09:10:47.7193674Z mbarrier.try_wait.parity.shared.b64 complete, [%r1940], %r1941; 2026-02-21T09:10:47.7193738Z @!complete bra.uni waitLoop; 2026-02-21T09:10:47.7193793Z } 2026-02-21T09:10:47.7193796Z 2026-02-21T09:10:47.7193849Z // end inline asm 2026-02-21T09:10:47.7193902Z bar.sync 0; 2026-02-21T09:10:47.7193958Z // begin inline asm 2026-02-21T09:10:47.7194052Z @%p9 mbarrier.inval.shared::cta.b64 [%r1940]; 2026-02-21T09:10:47.7194107Z // end inline asm 2026-02-21T09:10:47.7194161Z $L__tmp226: 2026-02-21T09:10:47.7194334Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7194394Z not.pred %p337, %p349; 2026-02-21T09:10:47.7194453Z @%p337 bra $L__BB0_42; 2026-02-21T09:10:47.7194509Z // %bb.41: 2026-02-21T09:10:47.7194563Z $L__tmp227: 2026-02-21T09:10:47.7194774Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7194829Z // begin inline asm 2026-02-21T09:10:47.7195128Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1944, %r1945, %r1946, %r1947, %r1948, %r1949, %r1950, %r1951, %r1952, %r1953, %r1954, %r1955, %r1956, %r1957, %r1958, %r1959}, [%r1977 + 0]; 2026-02-21T09:10:47.7195182Z // end inline asm 2026-02-21T09:10:47.7195237Z // begin inline asm 2026-02-21T09:10:47.7195537Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r1961, %r1962, %r1963, %r1964, %r1965, %r1966, %r1967, %r1968, %r1969, %r1970, %r1971, %r1972, %r1973, %r1974, %r1975, %r1976}, [%r1977 + 16]; 2026-02-21T09:10:47.7195591Z // end inline asm 2026-02-21T09:10:47.7195646Z // begin inline asm 2026-02-21T09:10:47.7195718Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:47.7195771Z // end inline asm 2026-02-21T09:10:47.7195831Z cvt.u64.u32 %rd554, %r1944; 2026-02-21T09:10:47.7195890Z cvt.u64.u32 %rd555, %r1945; 2026-02-21T09:10:47.7195956Z shl.b64 %rd556, %rd555, 32; 2026-02-21T09:10:47.7196015Z or.b64 %rd557, %rd554, %rd556; 2026-02-21T09:10:47.7196067Z $L__tmp228: 2026-02-21T09:10:47.7196239Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7196301Z mov.b64 {%r1981, %r1982}, %rd557; 2026-02-21T09:10:47.7196371Z cvt.rn.bf16x2.f32 %r1983, %r1982, %r1981; 2026-02-21T09:10:47.7196437Z $L__tmp229: 2026-02-21T09:10:47.7196683Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7196743Z cvt.u64.u32 %rd558, %r1946; 2026-02-21T09:10:47.7196802Z cvt.u64.u32 %rd559, %r1947; 2026-02-21T09:10:47.7196870Z shl.b64 %rd560, %rd559, 32; 2026-02-21T09:10:47.7196930Z or.b64 %rd561, %rd558, %rd560; 2026-02-21T09:10:47.7196982Z $L__tmp230: 2026-02-21T09:10:47.7197154Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7197237Z mov.b64 {%r1984, %r1985}, %rd561; 2026-02-21T09:10:47.7197307Z cvt.rn.bf16x2.f32 %r1986, %r1985, %r1984; 2026-02-21T09:10:47.7197360Z $L__tmp231: 2026-02-21T09:10:47.7197569Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7197628Z cvt.u64.u32 %rd562, %r1948; 2026-02-21T09:10:47.7197685Z cvt.u64.u32 %rd563, %r1949; 2026-02-21T09:10:47.7197753Z shl.b64 %rd564, %rd563, 32; 2026-02-21T09:10:47.7197811Z or.b64 %rd565, %rd562, %rd564; 2026-02-21T09:10:47.7197888Z $L__tmp232: 2026-02-21T09:10:47.7198061Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7198122Z mov.b64 {%r1987, %r1988}, %rd565; 2026-02-21T09:10:47.7198190Z cvt.rn.bf16x2.f32 %r1989, %r1988, %r1987; 2026-02-21T09:10:47.7198242Z $L__tmp233: 2026-02-21T09:10:47.7198476Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7198538Z cvt.u64.u32 %rd566, %r1950; 2026-02-21T09:10:47.7198596Z cvt.u64.u32 %rd567, %r1951; 2026-02-21T09:10:47.7198662Z shl.b64 %rd568, %rd567, 32; 2026-02-21T09:10:47.7198721Z or.b64 %rd569, %rd566, %rd568; 2026-02-21T09:10:47.7198772Z $L__tmp234: 2026-02-21T09:10:47.7198940Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7199000Z mov.b64 {%r1990, %r1991}, %rd569; 2026-02-21T09:10:47.7199068Z cvt.rn.bf16x2.f32 %r1992, %r1991, %r1990; 2026-02-21T09:10:47.7199121Z $L__tmp235: 2026-02-21T09:10:47.7199331Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7199389Z cvt.u64.u32 %rd570, %r1952; 2026-02-21T09:10:47.7199447Z cvt.u64.u32 %rd571, %r1953; 2026-02-21T09:10:47.7199513Z shl.b64 %rd572, %rd571, 32; 2026-02-21T09:10:47.7199573Z or.b64 %rd573, %rd570, %rd572; 2026-02-21T09:10:47.7199624Z $L__tmp236: 2026-02-21T09:10:47.7199792Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7199851Z mov.b64 {%r1993, %r1994}, %rd573; 2026-02-21T09:10:47.7199919Z cvt.rn.bf16x2.f32 %r1995, %r1994, %r1993; 2026-02-21T09:10:47.7199970Z $L__tmp237: 2026-02-21T09:10:47.7200180Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7200239Z cvt.u64.u32 %rd574, %r1954; 2026-02-21T09:10:47.7200297Z cvt.u64.u32 %rd575, %r1955; 2026-02-21T09:10:47.7200362Z shl.b64 %rd576, %rd575, 32; 2026-02-21T09:10:47.7200420Z or.b64 %rd577, %rd574, %rd576; 2026-02-21T09:10:47.7200472Z $L__tmp238: 2026-02-21T09:10:47.7200640Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7200700Z mov.b64 {%r1996, %r1997}, %rd577; 2026-02-21T09:10:47.7200767Z cvt.rn.bf16x2.f32 %r1998, %r1997, %r1996; 2026-02-21T09:10:47.7200818Z $L__tmp239: 2026-02-21T09:10:47.7201025Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7201082Z cvt.u64.u32 %rd578, %r1956; 2026-02-21T09:10:47.7201138Z cvt.u64.u32 %rd579, %r1957; 2026-02-21T09:10:47.7201202Z shl.b64 %rd580, %rd579, 32; 2026-02-21T09:10:47.7201260Z or.b64 %rd581, %rd578, %rd580; 2026-02-21T09:10:47.7201333Z $L__tmp240: 2026-02-21T09:10:47.7201494Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7201592Z mov.b64 {%r1999, %r2000}, %rd581; 2026-02-21T09:10:47.7201661Z cvt.rn.bf16x2.f32 %r2001, %r2000, %r1999; 2026-02-21T09:10:47.7201713Z $L__tmp241: 2026-02-21T09:10:47.7201928Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7202025Z cvt.u64.u32 %rd582, %r1958; 2026-02-21T09:10:47.7202083Z cvt.u64.u32 %rd583, %r1959; 2026-02-21T09:10:47.7202146Z shl.b64 %rd584, %rd583, 32; 2026-02-21T09:10:47.7202204Z or.b64 %rd585, %rd582, %rd584; 2026-02-21T09:10:47.7202256Z $L__tmp242: 2026-02-21T09:10:47.7202419Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7202485Z mov.b64 {%r2002, %r2003}, %rd585; 2026-02-21T09:10:47.7202553Z cvt.rn.bf16x2.f32 %r2004, %r2003, %r2002; 2026-02-21T09:10:47.7202605Z $L__tmp243: 2026-02-21T09:10:47.7202863Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7202923Z cvt.u64.u32 %rd586, %r1961; 2026-02-21T09:10:47.7202982Z cvt.u64.u32 %rd587, %r1962; 2026-02-21T09:10:47.7203047Z shl.b64 %rd588, %rd587, 32; 2026-02-21T09:10:47.7203146Z or.b64 %rd589, %rd586, %rd588; 2026-02-21T09:10:47.7203200Z $L__tmp244: 2026-02-21T09:10:47.7203360Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7203429Z mov.b64 {%r2005, %r2006}, %rd589; 2026-02-21T09:10:47.7203495Z cvt.rn.bf16x2.f32 %r2007, %r2006, %r2005; 2026-02-21T09:10:47.7203546Z $L__tmp245: 2026-02-21T09:10:47.7203754Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7203814Z cvt.u64.u32 %rd590, %r1963; 2026-02-21T09:10:47.7203872Z cvt.u64.u32 %rd591, %r1964; 2026-02-21T09:10:47.7203939Z shl.b64 %rd592, %rd591, 32; 2026-02-21T09:10:47.7203998Z or.b64 %rd593, %rd590, %rd592; 2026-02-21T09:10:47.7204050Z $L__tmp246: 2026-02-21T09:10:47.7204208Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7204277Z mov.b64 {%r2008, %r2009}, %rd593; 2026-02-21T09:10:47.7204345Z cvt.rn.bf16x2.f32 %r2010, %r2009, %r2008; 2026-02-21T09:10:47.7204398Z $L__tmp247: 2026-02-21T09:10:47.7204606Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7204663Z cvt.u64.u32 %rd594, %r1965; 2026-02-21T09:10:47.7204721Z cvt.u64.u32 %rd595, %r1966; 2026-02-21T09:10:47.7204789Z shl.b64 %rd596, %rd595, 32; 2026-02-21T09:10:47.7204849Z or.b64 %rd597, %rd594, %rd596; 2026-02-21T09:10:47.7204902Z $L__tmp248: 2026-02-21T09:10:47.7205064Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7205138Z mov.b64 {%r2011, %r2012}, %rd597; 2026-02-21T09:10:47.7205207Z cvt.rn.bf16x2.f32 %r2013, %r2012, %r2011; 2026-02-21T09:10:47.7205259Z $L__tmp249: 2026-02-21T09:10:47.7205467Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7205525Z cvt.u64.u32 %rd598, %r1967; 2026-02-21T09:10:47.7205583Z cvt.u64.u32 %rd599, %r1968; 2026-02-21T09:10:47.7205647Z shl.b64 %rd600, %rd599, 32; 2026-02-21T09:10:47.7205705Z or.b64 %rd601, %rd598, %rd600; 2026-02-21T09:10:47.7205756Z $L__tmp250: 2026-02-21T09:10:47.7205914Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7205980Z mov.b64 {%r2014, %r2015}, %rd601; 2026-02-21T09:10:47.7206045Z cvt.rn.bf16x2.f32 %r2016, %r2015, %r2014; 2026-02-21T09:10:47.7206096Z $L__tmp251: 2026-02-21T09:10:47.7206346Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7206405Z cvt.u64.u32 %rd602, %r1969; 2026-02-21T09:10:47.7206462Z cvt.u64.u32 %rd603, %r1970; 2026-02-21T09:10:47.7206520Z shl.b64 %rd604, %rd603, 32; 2026-02-21T09:10:47.7206587Z or.b64 %rd605, %rd602, %rd604; 2026-02-21T09:10:47.7206639Z $L__tmp252: 2026-02-21T09:10:47.7206806Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7206893Z mov.b64 {%r2017, %r2018}, %rd605; 2026-02-21T09:10:47.7206960Z cvt.rn.bf16x2.f32 %r2019, %r2018, %r2017; 2026-02-21T09:10:47.7207011Z $L__tmp253: 2026-02-21T09:10:47.7207221Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7207278Z cvt.u64.u32 %rd606, %r1971; 2026-02-21T09:10:47.7207334Z cvt.u64.u32 %rd607, %r1972; 2026-02-21T09:10:47.7207393Z shl.b64 %rd608, %rd607, 32; 2026-02-21T09:10:47.7207460Z or.b64 %rd609, %rd606, %rd608; 2026-02-21T09:10:47.7207533Z $L__tmp254: 2026-02-21T09:10:47.7207697Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7207765Z mov.b64 {%r2020, %r2021}, %rd609; 2026-02-21T09:10:47.7207831Z cvt.rn.bf16x2.f32 %r2022, %r2021, %r2020; 2026-02-21T09:10:47.7207904Z $L__tmp255: 2026-02-21T09:10:47.7208120Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7208178Z cvt.u64.u32 %rd610, %r1973; 2026-02-21T09:10:47.7208237Z cvt.u64.u32 %rd611, %r1974; 2026-02-21T09:10:47.7208294Z shl.b64 %rd612, %rd611, 32; 2026-02-21T09:10:47.7208360Z or.b64 %rd613, %rd610, %rd612; 2026-02-21T09:10:47.7208411Z $L__tmp256: 2026-02-21T09:10:47.7208577Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7208647Z mov.b64 {%r2023, %r2024}, %rd613; 2026-02-21T09:10:47.7208716Z cvt.rn.bf16x2.f32 %r2025, %r2024, %r2023; 2026-02-21T09:10:47.7208768Z $L__tmp257: 2026-02-21T09:10:47.7208985Z .loc 2 291 36 // standard.py:291:36 @[ cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:94:40 ] 2026-02-21T09:10:47.7209042Z cvt.u64.u32 %rd614, %r1975; 2026-02-21T09:10:47.7209101Z cvt.u64.u32 %rd615, %r1976; 2026-02-21T09:10:47.7209159Z shl.b64 %rd616, %rd615, 32; 2026-02-21T09:10:47.7209224Z or.b64 %rd617, %rd614, %rd616; 2026-02-21T09:10:47.7209275Z $L__tmp258: 2026-02-21T09:10:47.7209438Z .loc 1 97 28 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:97:28 2026-02-21T09:10:47.7209502Z mov.b64 {%r2026, %r2027}, %rd617; 2026-02-21T09:10:47.7209568Z cvt.rn.bf16x2.f32 %r2028, %r2027, %r2026; 2026-02-21T09:10:47.7209728Z .loc 1 98 43 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:98:43 2026-02-21T09:10:47.7209790Z bar.sync 0; 2026-02-21T09:10:47.7209849Z shl.b32 %r2029, %r2055, 7; 2026-02-21T09:10:47.7209909Z xor.b32 %r2030, %r2058, %r198; 2026-02-21T09:10:47.7209966Z or.b32 %r2031, %r2030, %r2029; 2026-02-21T09:10:47.7210031Z add.s32 %r2033, %r1899, %r2031; 2026-02-21T09:10:47.7210133Z st.shared.v4.b32 [%r2033], {%r1983, %r1986, %r1989, %r1992}; 2026-02-21T09:10:47.7210190Z xor.b32 %r2034, %r2031, 16; 2026-02-21T09:10:47.7210256Z add.s32 %r2035, %r1899, %r2034; 2026-02-21T09:10:47.7210357Z st.shared.v4.b32 [%r2035], {%r1995, %r1998, %r2001, %r2004}; 2026-02-21T09:10:47.7210415Z xor.b32 %r2036, %r2031, 32; 2026-02-21T09:10:47.7210475Z add.s32 %r2037, %r1899, %r2036; 2026-02-21T09:10:47.7210577Z st.shared.v4.b32 [%r2037], {%r2007, %r2010, %r2013, %r2016}; 2026-02-21T09:10:47.7210635Z xor.b32 %r2038, %r2031, 48; 2026-02-21T09:10:47.7210694Z add.s32 %r2039, %r1899, %r2038; 2026-02-21T09:10:47.7210791Z st.shared.v4.b32 [%r2039], {%r2019, %r2022, %r2025, %r2028}; 2026-02-21T09:10:47.7210873Z // begin inline asm 2026-02-21T09:10:47.7210947Z fence.proxy.async.shared::cta; 2026-02-21T09:10:47.7211010Z // end inline asm 2026-02-21T09:10:47.7211064Z bar.sync 0; 2026-02-21T09:10:47.7211128Z elect.sync %r2040|%p340, -1; 2026-02-21T09:10:47.7211190Z and.pred %p338, %p6, %p340; 2026-02-21T09:10:47.7211254Z // begin inline asm 2026-02-21T09:10:47.7211438Z @%p338 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd553, {%r2091, %r2092}], [%r1899]; 2026-02-21T09:10:47.7211513Z // end inline asm 2026-02-21T09:10:47.7211626Z cp.async.bulk.commit_group; 2026-02-21T09:10:47.7211697Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:47.7211751Z bar.sync 0; 2026-02-21T09:10:47.7211804Z $L__BB0_42: 2026-02-21T09:10:47.7211982Z .loc 1 31 111 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:111 2026-02-21T09:10:47.7212044Z cp.async.wait_group 0; 2026-02-21T09:10:47.7212098Z bar.sync 0; 2026-02-21T09:10:47.7212164Z // begin inline asm 2026-02-21T09:10:47.7212248Z @%p9 mbarrier.inval.shared::cta.b64 [%r1547]; 2026-02-21T09:10:47.7212336Z // end inline asm 2026-02-21T09:10:47.7212399Z bar.sync 0; 2026-02-21T09:10:47.7212456Z // begin inline asm 2026-02-21T09:10:47.7212539Z @%p9 mbarrier.inval.shared::cta.b64 [%r1548]; 2026-02-21T09:10:47.7212596Z // end inline asm 2026-02-21T09:10:47.7212794Z .loc 1 31 4 // cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py:31:4 2026-02-21T09:10:47.7212852Z bar.sync 0; 2026-02-21T09:10:47.7212909Z // begin inline asm 2026-02-21T09:10:47.7213035Z @%p6 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r1917, 128; 2026-02-21T09:10:47.7213092Z // end inline asm 2026-02-21T09:10:47.7213147Z ret; 2026-02-21T09:10:47.7213201Z $L__tmp259: 2026-02-21T09:10:47.7213268Z $L__func_end0: 2026-02-21T09:10:47.7213353Z // -- End function 2026-02-21T09:10:47.7213407Z } 2026-02-21T09:10:47.7213616Z .file 1 "/tmp/torchinductor_root/pe/cpeyvly5kekgexon2swvsr7owo33sg7nkvw6wfwtu5ix3rdg63ip.py" 2026-02-21T09:10:47.7213791Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:47.7213852Z .section .debug_abbrev 2026-02-21T09:10:47.7213912Z { 2026-02-21T09:10:47.7214001Z .b8 1 // Abbreviation Code 2026-02-21T09:10:47.7214087Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:47.7214169Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:47.7214257Z .b8 37 // DW_AT_producer 2026-02-21T09:10:47.7214333Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.7214407Z .b8 19 // DW_AT_language 2026-02-21T09:10:47.7214491Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:47.7214564Z .b8 3 // DW_AT_name 2026-02-21T09:10:47.7214637Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.7214720Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:47.7214795Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:47.7214870Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:47.7214942Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.7215020Z .b8 0 // EOM(1) 2026-02-21T09:10:47.7215088Z .b8 0 // EOM(2) 2026-02-21T09:10:47.7215171Z .b8 2 // Abbreviation Code 2026-02-21T09:10:47.7215262Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:47.7215334Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:47.7215407Z .b8 3 // DW_AT_name 2026-02-21T09:10:47.7215487Z .b8 8 // DW_FORM_string 2026-02-21T09:10:47.7215562Z .b8 32 // DW_AT_inline 2026-02-21T09:10:47.7215665Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.7215734Z .b8 0 // EOM(1) 2026-02-21T09:10:47.7215809Z .b8 0 // EOM(2) 2026-02-21T09:10:47.7215889Z .b8 3 // Abbreviation Code 2026-02-21T09:10:47.7215969Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:47.7216056Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:47.7216157Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:47.7216231Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.7216314Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:47.7216386Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.7216469Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:47.7216548Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:47.7216616Z .b8 0 // EOM(1) 2026-02-21T09:10:47.7216704Z .b8 0 // EOM(2) 2026-02-21T09:10:47.7216784Z .b8 4 // Abbreviation Code 2026-02-21T09:10:47.7216882Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:47.7216977Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:47.7217063Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:47.7217143Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:47.7217214Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:47.7217283Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.7217365Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:47.7217435Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:47.7217512Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:47.7217585Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.7217667Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:47.7217738Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.7217815Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:47.7217896Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:47.7217964Z .b8 0 // EOM(1) 2026-02-21T09:10:47.7218033Z .b8 0 // EOM(2) 2026-02-21T09:10:47.7218109Z .b8 0 // EOM(3) 2026-02-21T09:10:47.7218163Z } 2026-02-21T09:10:47.7218227Z .section .debug_info 2026-02-21T09:10:47.7218279Z { 2026-02-21T09:10:47.7218370Z .b32 178 // Length of Unit 2026-02-21T09:10:47.7218457Z .b8 2 // DWARF version number 2026-02-21T09:10:47.7218513Z .b8 0 2026-02-21T09:10:47.7218643Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:47.7218733Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:47.7218836Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:47.7218926Z .b8 116 // DW_AT_producer 2026-02-21T09:10:47.7218984Z .b8 114 2026-02-21T09:10:47.7219039Z .b8 105 2026-02-21T09:10:47.7219093Z .b8 116 2026-02-21T09:10:47.7219155Z .b8 111 2026-02-21T09:10:47.7219208Z .b8 110 2026-02-21T09:10:47.7219262Z .b8 0 2026-02-21T09:10:47.7219346Z .b8 2 // DW_AT_language 2026-02-21T09:10:47.7219400Z .b8 0 2026-02-21T09:10:47.7219479Z .b8 99 // DW_AT_name 2026-02-21T09:10:47.7219533Z .b8 112 2026-02-21T09:10:47.7219593Z .b8 101 2026-02-21T09:10:47.7219645Z .b8 121 2026-02-21T09:10:47.7219698Z .b8 118 2026-02-21T09:10:47.7219781Z .b8 108 2026-02-21T09:10:47.7219834Z .b8 121 2026-02-21T09:10:47.7219887Z .b8 53 2026-02-21T09:10:47.7219940Z .b8 107 2026-02-21T09:10:47.7220000Z .b8 101 2026-02-21T09:10:47.7220052Z .b8 107 2026-02-21T09:10:47.7220105Z .b8 103 2026-02-21T09:10:47.7220157Z .b8 101 2026-02-21T09:10:47.7220218Z .b8 120 2026-02-21T09:10:47.7220271Z .b8 111 2026-02-21T09:10:47.7220324Z .b8 110 2026-02-21T09:10:47.7220386Z .b8 50 2026-02-21T09:10:47.7220439Z .b8 115 2026-02-21T09:10:47.7220517Z .b8 119 2026-02-21T09:10:47.7220572Z .b8 118 2026-02-21T09:10:47.7220634Z .b8 115 2026-02-21T09:10:47.7220688Z .b8 114 2026-02-21T09:10:47.7220744Z .b8 55 2026-02-21T09:10:47.7220804Z .b8 111 2026-02-21T09:10:47.7220857Z .b8 119 2026-02-21T09:10:47.7220909Z .b8 111 2026-02-21T09:10:47.7220961Z .b8 51 2026-02-21T09:10:47.7221019Z .b8 51 2026-02-21T09:10:47.7221071Z .b8 115 2026-02-21T09:10:47.7221123Z .b8 103 2026-02-21T09:10:47.7221181Z .b8 55 2026-02-21T09:10:47.7221232Z .b8 110 2026-02-21T09:10:47.7221284Z .b8 107 2026-02-21T09:10:47.7221338Z .b8 118 2026-02-21T09:10:47.7221396Z .b8 119 2026-02-21T09:10:47.7221449Z .b8 54 2026-02-21T09:10:47.7221573Z .b8 119 2026-02-21T09:10:47.7221628Z .b8 102 2026-02-21T09:10:47.7221690Z .b8 119 2026-02-21T09:10:47.7221743Z .b8 116 2026-02-21T09:10:47.7221796Z .b8 117 2026-02-21T09:10:47.7221856Z .b8 53 2026-02-21T09:10:47.7221909Z .b8 105 2026-02-21T09:10:47.7221963Z .b8 120 2026-02-21T09:10:47.7222015Z .b8 51 2026-02-21T09:10:47.7222103Z .b8 114 2026-02-21T09:10:47.7222160Z .b8 100 2026-02-21T09:10:47.7222213Z .b8 103 2026-02-21T09:10:47.7222273Z .b8 54 2026-02-21T09:10:47.7222325Z .b8 51 2026-02-21T09:10:47.7222380Z .b8 105 2026-02-21T09:10:47.7222432Z .b8 112 2026-02-21T09:10:47.7222494Z .b8 46 2026-02-21T09:10:47.7222547Z .b8 112 2026-02-21T09:10:47.7222600Z .b8 121 2026-02-21T09:10:47.7222653Z .b8 0 2026-02-21T09:10:47.7222755Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:47.7222834Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:47.7222890Z .b8 116 2026-02-21T09:10:47.7222951Z .b8 109 2026-02-21T09:10:47.7223005Z .b8 112 2026-02-21T09:10:47.7223057Z .b8 47 2026-02-21T09:10:47.7223113Z .b8 116 2026-02-21T09:10:47.7223173Z .b8 111 2026-02-21T09:10:47.7223225Z .b8 114 2026-02-21T09:10:47.7223277Z .b8 99 2026-02-21T09:10:47.7223337Z .b8 104 2026-02-21T09:10:47.7223390Z .b8 105 2026-02-21T09:10:47.7223441Z .b8 110 2026-02-21T09:10:47.7223492Z .b8 100 2026-02-21T09:10:47.7223552Z .b8 117 2026-02-21T09:10:47.7223608Z .b8 99 2026-02-21T09:10:47.7223662Z .b8 116 2026-02-21T09:10:47.7223721Z .b8 111 2026-02-21T09:10:47.7223775Z .b8 114 2026-02-21T09:10:47.7223827Z .b8 95 2026-02-21T09:10:47.7223878Z .b8 114 2026-02-21T09:10:47.7223936Z .b8 111 2026-02-21T09:10:47.7223987Z .b8 111 2026-02-21T09:10:47.7224039Z .b8 116 2026-02-21T09:10:47.7224097Z .b8 47 2026-02-21T09:10:47.7224149Z .b8 112 2026-02-21T09:10:47.7224201Z .b8 101 2026-02-21T09:10:47.7224253Z .b8 0 2026-02-21T09:10:47.7224361Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:47.7224439Z .b8 95 // DW_AT_name 2026-02-21T09:10:47.7224493Z .b8 104 2026-02-21T09:10:47.7224553Z .b8 101 2026-02-21T09:10:47.7224605Z .b8 108 2026-02-21T09:10:47.7224657Z .b8 105 2026-02-21T09:10:47.7224710Z .b8 111 2026-02-21T09:10:47.7224769Z .b8 110 2026-02-21T09:10:47.7224821Z .b8 95 2026-02-21T09:10:47.7224875Z .b8 109 2026-02-21T09:10:47.7224927Z .b8 97 2026-02-21T09:10:47.7224989Z .b8 116 2026-02-21T09:10:47.7225042Z .b8 109 2026-02-21T09:10:47.7225094Z .b8 117 2026-02-21T09:10:47.7225153Z .b8 108 2026-02-21T09:10:47.7225205Z .b8 95 2026-02-21T09:10:47.7225257Z .b8 98 2026-02-21T09:10:47.7225310Z .b8 102 2026-02-21T09:10:47.7225375Z .b8 49 2026-02-21T09:10:47.7225424Z .b8 54 2026-02-21T09:10:47.7225474Z .b8 95 2026-02-21T09:10:47.7225532Z .b8 105 2026-02-21T09:10:47.7225582Z .b8 110 2026-02-21T09:10:47.7225631Z .b8 116 2026-02-21T09:10:47.7225681Z .b8 52 2026-02-21T09:10:47.7225738Z .b8 0 2026-02-21T09:10:47.7225842Z .b8 1 // DW_AT_inline 2026-02-21T09:10:47.7225938Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:47.7226032Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:47.7226119Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:47.7226209Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:47.7226320Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:47.7226440Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:47.7226523Z .b64 $L__tmp1 // DW_AT_low_pc 2026-02-21T09:10:47.7226608Z .b64 $L__tmp258 // DW_AT_high_pc 2026-02-21T09:10:47.7226694Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:47.7226772Z .b8 94 // DW_AT_call_line 2026-02-21T09:10:47.7226861Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:47.7226984Z .b8 0 // End Of Children Mark 2026-02-21T09:10:47.7227069Z .b8 0 // End Of Children Mark 2026-02-21T09:10:47.7227121Z } 2026-02-21T09:10:47.7227191Z .section .debug_macinfo { } 2026-02-21T09:10:47.7227195Z 2026-02-21T09:10:47.7227268Z ================================================================ 2026-02-21T09:10:47.7227392Z please share the reproducer above with Triton project. 2026-02-21T09:10:48.0874179Z 2026-02-21T09:10:48.0877749Z [235s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:10:48.0877763Z 2026-02-21T09:10:48.0883844Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 3], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:10:48.0883862Z 2026-02-21T09:10:48.0884376Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:10:48.0884453Z `ptxas` stderr: 2026-02-21T09:10:48.0884838Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 279 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:48.0884938Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:48.0884945Z 2026-02-21T09:10:48.0885339Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpu3tk4vra.ptx -o /tmp/tmpu3tk4vra.ptx.o 2026-02-21T09:10:48.0885344Z 2026-02-21T09:10:48.0885488Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:10:48.0885581Z ================================================================ 2026-02-21T09:10:48.0885657Z Internal Triton PTX codegen error 2026-02-21T09:10:48.0885728Z `ptxas` stderr: 2026-02-21T09:10:48.0886056Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 279 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:10:48.0886145Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:10:48.0886151Z 2026-02-21T09:10:48.0886521Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpu3tk4vra.ptx -o /tmp/tmpu3tk4vra.ptx.o 2026-02-21T09:10:48.0886525Z 2026-02-21T09:10:48.0886528Z 2026-02-21T09:10:48.0886581Z // 2026-02-21T09:10:48.0886654Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:10:48.0886709Z // 2026-02-21T09:10:48.0886713Z 2026-02-21T09:10:48.0886769Z .version 8.7 2026-02-21T09:10:48.0887012Z .target sm_100a 2026-02-21T09:10:48.0887071Z .address_size 64 2026-02-21T09:10:48.0887075Z 2026-02-21T09:10:48.0887223Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:10:48.0887305Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:10:48.0887398Z // @_helion_matmul_bf16_int4 2026-02-21T09:10:48.0887472Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:10:48.0887869Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:10:48.0888075Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:10:48.0888180Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:10:48.0888286Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:10:48.0888393Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:10:48.0888451Z ) 2026-02-21T09:10:48.0888507Z .reqntid 128 2026-02-21T09:10:48.0888563Z .maxnreg 32 2026-02-21T09:10:48.0888622Z { 2026-02-21T09:10:48.0888684Z .reg .pred %p<105>; 2026-02-21T09:10:48.0888792Z .reg .b16 %rs<225>; 2026-02-21T09:10:48.0888848Z .reg .b32 %r<891>; 2026-02-21T09:10:48.0888912Z .reg .b64 %rd<356>; 2026-02-21T09:10:48.0888969Z $L__func_begin0: 2026-02-21T09:10:48.0888972Z 2026-02-21T09:10:48.0889025Z // %bb.0: 2026-02-21T09:10:48.0889241Z .loc 1 19 0 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:19 2026-02-21T09:10:48.0889305Z mov.u32 %r1, %tid.x; 2026-02-21T09:10:48.0889409Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:10:48.0889481Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:10:48.0889578Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:10:48.0889645Z mov.b32 %r48, global_smem; 2026-02-21T09:10:48.0889704Z // begin inline asm 2026-02-21T09:10:48.0889855Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r48], 256; 2026-02-21T09:10:48.0889914Z // end inline asm 2026-02-21T09:10:48.0890007Z ld.param.b64 %rd51, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:10:48.0890073Z bar.sync 0; 2026-02-21T09:10:48.0890144Z ld.shared.b32 %r884, [global_smem]; 2026-02-21T09:10:48.0890200Z bar.sync 0; 2026-02-21T09:10:48.0890259Z // begin inline asm 2026-02-21T09:10:48.0890386Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:10:48.0890445Z // end inline asm 2026-02-21T09:10:48.0890625Z .loc 1 21 66 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:21:66 2026-02-21T09:10:48.0890700Z mov.u32 %r3, %ctaid.x; 2026-02-21T09:10:48.0890762Z mov.u32 %r65, %ctaid.y; 2026-02-21T09:10:48.0890823Z mov.u32 %r66, %ctaid.z; 2026-02-21T09:10:48.0890894Z mov.u32 %r67, %nctaid.x; 2026-02-21T09:10:48.0890955Z mov.u32 %r68, %nctaid.y; 2026-02-21T09:10:48.0891021Z mad.lo.s32 %r69, %r66, %r68, %r65; 2026-02-21T09:10:48.0891081Z mad.lo.s32 %r70, %r69, %r67, %r3; 2026-02-21T09:10:48.0891145Z shl.b32 %r71, %r70, 8; 2026-02-21T09:10:48.0891204Z cvt.s64.s32 %rd52, %r71; 2026-02-21T09:10:48.0891264Z add.s64 %rd30, %rd51, %rd52; 2026-02-21T09:10:48.0891327Z shl.b32 %r72, %r1, 2; 2026-02-21T09:10:48.0891384Z add.s32 %r49, %r48, %r72; 2026-02-21T09:10:48.0891436Z mov.b32 %r58, 0; 2026-02-21T09:10:48.0891492Z // begin inline asm 2026-02-21T09:10:48.0891749Z @%p1 st.shared.b32 [ %r49 + 0 ], %r58; 2026-02-21T09:10:48.0891810Z // end inline asm 2026-02-21T09:10:48.0891872Z bar.warp.sync -1; 2026-02-21T09:10:48.0891938Z setp.eq.b32 %p97, %r1, 0; 2026-02-21T09:10:48.0891996Z cvt.u64.u32 %rd15, %r48; 2026-02-21T09:10:48.0892053Z // begin inline asm 2026-02-21T09:10:48.0892229Z @%p97 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd16; 2026-02-21T09:10:48.0892284Z // end inline asm 2026-02-21T09:10:48.0892337Z // begin inline asm 2026-02-21T09:10:48.0892477Z @%p97 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:10:48.0892574Z // end inline asm 2026-02-21T09:10:48.0892631Z mov.b32 %r51, 128; 2026-02-21T09:10:48.0892687Z // begin inline asm 2026-02-21T09:10:48.0892844Z @%p97 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r51; 2026-02-21T09:10:48.0892898Z // end inline asm 2026-02-21T09:10:48.0892953Z mov.b32 %r214, 16; 2026-02-21T09:10:48.0893008Z // begin inline asm 2026-02-21T09:10:48.0893175Z @%p97 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r214; 2026-02-21T09:10:48.0893264Z // end inline asm 2026-02-21T09:10:48.0893320Z mov.b32 %r53, 8192; 2026-02-21T09:10:48.0893382Z // begin inline asm 2026-02-21T09:10:48.0893537Z @%p97 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r53; 2026-02-21T09:10:48.0893590Z // end inline asm 2026-02-21T09:10:48.0893650Z mov.b32 %r54, 512; 2026-02-21T09:10:48.0893705Z // begin inline asm 2026-02-21T09:10:48.0893858Z @%p97 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r54; 2026-02-21T09:10:48.0893920Z // end inline asm 2026-02-21T09:10:48.0893977Z mov.b64 %rd23, 8192; 2026-02-21T09:10:48.0894053Z // begin inline asm 2026-02-21T09:10:48.0894219Z @%p97 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd23; 2026-02-21T09:10:48.0894278Z // end inline asm 2026-02-21T09:10:48.0894333Z mov.b32 %r55, 1; 2026-02-21T09:10:48.0894386Z // begin inline asm 2026-02-21T09:10:48.0894583Z @%p97 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r55; 2026-02-21T09:10:48.0894642Z // end inline asm 2026-02-21T09:10:48.0894698Z // begin inline asm 2026-02-21T09:10:48.0894866Z @%p97 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r55; 2026-02-21T09:10:48.0894919Z // end inline asm 2026-02-21T09:10:48.0894973Z // begin inline asm 2026-02-21T09:10:48.0895120Z @%p97 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:48.0895184Z // end inline asm 2026-02-21T09:10:48.0895237Z // begin inline asm 2026-02-21T09:10:48.0895399Z @%p97 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:48.0895455Z // end inline asm 2026-02-21T09:10:48.0895508Z // begin inline asm 2026-02-21T09:10:48.0895655Z @%p97 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:10:48.0895714Z // end inline asm 2026-02-21T09:10:48.0895769Z // begin inline asm 2026-02-21T09:10:48.0895914Z @%p97 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:48.0895966Z // end inline asm 2026-02-21T09:10:48.0896026Z // begin inline asm 2026-02-21T09:10:48.0896280Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd30 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:10:48.0896333Z // end inline asm 2026-02-21T09:10:48.0896392Z // begin inline asm 2026-02-21T09:10:48.0896514Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd30 + 0 ], 0x80; 2026-02-21T09:10:48.0896586Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:48.0896667Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:48.0896721Z // end inline asm 2026-02-21T09:10:48.0896772Z bar.sync 0; 2026-02-21T09:10:48.0896838Z cvta.global.u64 %rd77, %rd30; 2026-02-21T09:10:48.0897012Z .loc 1 23 67 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:23:67 2026-02-21T09:10:48.0897073Z add.s64 %rd48, %rd30, 128; 2026-02-21T09:10:48.0897128Z bar.sync 0; 2026-02-21T09:10:48.0897189Z // begin inline asm 2026-02-21T09:10:48.0897255Z @%p1 st.shared.b32 [ %r49 + 0 ], %r58; 2026-02-21T09:10:48.0897309Z // end inline asm 2026-02-21T09:10:48.0897374Z bar.warp.sync -1; 2026-02-21T09:10:48.0897429Z // begin inline asm 2026-02-21T09:10:48.0897588Z @%p97 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd15 + 0 ], %rd34; 2026-02-21T09:10:48.0897642Z // end inline asm 2026-02-21T09:10:48.0897704Z // begin inline asm 2026-02-21T09:10:48.0897869Z @%p97 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1; 2026-02-21T09:10:48.0897925Z // end inline asm 2026-02-21T09:10:48.0897987Z mov.b32 %r59, 64; 2026-02-21T09:10:48.0898042Z // begin inline asm 2026-02-21T09:10:48.0898190Z @%p97 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r59; 2026-02-21T09:10:48.0898254Z // end inline asm 2026-02-21T09:10:48.0898312Z // begin inline asm 2026-02-21T09:10:48.0898497Z @%p97 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r51; 2026-02-21T09:10:48.0898553Z // end inline asm 2026-02-21T09:10:48.0898615Z // begin inline asm 2026-02-21T09:10:48.0898772Z @%p97 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r53; 2026-02-21T09:10:48.0898825Z // end inline asm 2026-02-21T09:10:48.0898884Z mov.b32 %r62, 4096; 2026-02-21T09:10:48.0898939Z // begin inline asm 2026-02-21T09:10:48.0899089Z @%p97 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r62; 2026-02-21T09:10:48.0899152Z // end inline asm 2026-02-21T09:10:48.0899230Z mov.b64 %rd41, 16384; 2026-02-21T09:10:48.0899286Z // begin inline asm 2026-02-21T09:10:48.0899448Z @%p97 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd15 + 0 ], 0x0, %rd41; 2026-02-21T09:10:48.0899507Z // end inline asm 2026-02-21T09:10:48.0899563Z // begin inline asm 2026-02-21T09:10:48.0899748Z @%p97 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0, %r55; 2026-02-21T09:10:48.0899809Z // end inline asm 2026-02-21T09:10:48.0899862Z // begin inline asm 2026-02-21T09:10:48.0900021Z @%p97 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x1, %r55; 2026-02-21T09:10:48.0900074Z // end inline asm 2026-02-21T09:10:48.0900129Z // begin inline asm 2026-02-21T09:10:48.0900276Z @%p97 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd15 + 0 ], 0xa; 2026-02-21T09:10:48.0900326Z // end inline asm 2026-02-21T09:10:48.0900383Z // begin inline asm 2026-02-21T09:10:48.0900546Z @%p97 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:48.0900597Z // end inline asm 2026-02-21T09:10:48.0900658Z // begin inline asm 2026-02-21T09:10:48.0900806Z @%p97 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x3; 2026-02-21T09:10:48.0900860Z // end inline asm 2026-02-21T09:10:48.0900918Z // begin inline asm 2026-02-21T09:10:48.0901061Z @%p97 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd15 + 0 ], 0x0; 2026-02-21T09:10:48.0901116Z // end inline asm 2026-02-21T09:10:48.0901179Z // begin inline asm 2026-02-21T09:10:48.0901430Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd48 + 0 ], [ %rd15 + 0 ], 0x80; 2026-02-21T09:10:48.0901486Z // end inline asm 2026-02-21T09:10:48.0901585Z // begin inline asm 2026-02-21T09:10:48.0901717Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd48 + 0 ], 0x80; 2026-02-21T09:10:48.0901791Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:10:48.0901864Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:10:48.0901928Z // end inline asm 2026-02-21T09:10:48.0901983Z bar.sync 0; 2026-02-21T09:10:48.0902048Z cvta.global.u64 %rd97, %rd48; 2026-02-21T09:10:48.0902227Z .loc 1 31 74 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:31:74 2026-02-21T09:10:48.0902294Z setp.gt.u32 %p39, %r3, 2047; 2026-02-21T09:10:48.0902365Z @%p39 bra $L__BB0_8; 2026-02-21T09:10:48.0902448Z // %bb.1: // %.lr.ph 2026-02-21T09:10:48.0902633Z .loc 1 0 74 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:0:74 2026-02-21T09:10:48.0902736Z ld.param.b64 %rd14, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:10:48.0902907Z .loc 1 57 38 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:57:38 2026-02-21T09:10:48.0902978Z shl.b32 %r275, %r1, 3; 2026-02-21T09:10:48.0903079Z and.b32 %r276, %r275, 24; 2026-02-21T09:10:48.0903250Z .loc 1 44 45 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:44:45 2026-02-21T09:10:48.0903323Z bfe.u32 %r4, %r1, 2, 5; 2026-02-21T09:10:48.0903385Z shr.u32 %r277, %r1, 5; 2026-02-21T09:10:48.0903445Z and.b32 %r5, %r1, 127; 2026-02-21T09:10:48.0903504Z shl.b32 %r6, %r5, 4; 2026-02-21T09:10:48.0903573Z add.s32 %r279, %r48, %r6; 2026-02-21T09:10:48.0903637Z add.s32 %r383, %r279, 16384; 2026-02-21T09:10:48.0903757Z add.s32 %r385, %r279, 18432; 2026-02-21T09:10:48.0903824Z add.s32 %r387, %r279, 20480; 2026-02-21T09:10:48.0903883Z add.s32 %r389, %r279, 22528; 2026-02-21T09:10:48.0903946Z add.s32 %r369, %r884, 128; 2026-02-21T09:10:48.0904006Z shl.b32 %r280, %r5, 7; 2026-02-21T09:10:48.0904074Z shl.b32 %r281, %r1, 4; 2026-02-21T09:10:48.0904137Z and.b32 %r282, %r281, 112; 2026-02-21T09:10:48.0904201Z or.b32 %r283, %r280, %r282; 2026-02-21T09:10:48.0904269Z add.s32 %r13, %r48, %r283; 2026-02-21T09:10:48.0904330Z xor.b32 %r284, %r283, 16; 2026-02-21T09:10:48.0908302Z add.s32 %r14, %r48, %r284; 2026-02-21T09:10:48.0908506Z xor.b32 %r285, %r283, 32; 2026-02-21T09:10:48.0908586Z add.s32 %r15, %r48, %r285; 2026-02-21T09:10:48.0908650Z xor.b32 %r286, %r283, 48; 2026-02-21T09:10:48.0908711Z add.s32 %r16, %r48, %r286; 2026-02-21T09:10:48.0908779Z xor.b32 %r287, %r283, 64; 2026-02-21T09:10:48.0908955Z add.s32 %r17, %r48, %r287; 2026-02-21T09:10:48.0909028Z xor.b32 %r288, %r283, 80; 2026-02-21T09:10:48.0909097Z add.s32 %r18, %r48, %r288; 2026-02-21T09:10:48.0909157Z xor.b32 %r289, %r283, 96; 2026-02-21T09:10:48.0909218Z add.s32 %r19, %r48, %r289; 2026-02-21T09:10:48.0909277Z xor.b32 %r290, %r283, 112; 2026-02-21T09:10:48.0909347Z add.s32 %r20, %r48, %r290; 2026-02-21T09:10:48.0909414Z bfe.u32 %r291, %r48, 4, 14; 2026-02-21T09:10:48.0909478Z cvt.u64.u32 %rd63, %r291; 2026-02-21T09:10:48.0909569Z or.b64 %rd68, %rd63, 4611686293372403712; 2026-02-21T09:10:48.0909637Z add.s32 %r292, %r48, 32; 2026-02-21T09:10:48.0909704Z bfe.u32 %r293, %r292, 4, 14; 2026-02-21T09:10:48.0909769Z cvt.u64.u32 %rd64, %r293; 2026-02-21T09:10:48.0909853Z or.b64 %rd69, %rd64, 4611686293372403712; 2026-02-21T09:10:48.0909916Z add.s32 %r294, %r48, 64; 2026-02-21T09:10:48.0909977Z bfe.u32 %r295, %r294, 4, 14; 2026-02-21T09:10:48.0910044Z cvt.u64.u32 %rd65, %r295; 2026-02-21T09:10:48.0910114Z or.b64 %rd70, %rd65, 4611686293372403712; 2026-02-21T09:10:48.0910177Z add.s32 %r296, %r48, 96; 2026-02-21T09:10:48.0910238Z bfe.u32 %r297, %r296, 4, 14; 2026-02-21T09:10:48.0910308Z cvt.u64.u32 %rd66, %r297; 2026-02-21T09:10:48.0910376Z or.b64 %rd71, %rd66, 4611686293372403712; 2026-02-21T09:10:48.0910437Z or.b32 %r21, %r276, 64; 2026-02-21T09:10:48.0910518Z mad.lo.s32 %r298, %r5, 48, %r383; 2026-02-21T09:10:48.0910582Z xor.b32 %r22, %r5, 112; 2026-02-21T09:10:48.0910647Z add.s32 %r222, %r48, 32768; 2026-02-21T09:10:48.0910710Z add.s32 %r299, %r222, %r22; 2026-02-21T09:10:48.0910783Z xor.b32 %r23, %r5, 96; 2026-02-21T09:10:48.0910847Z add.s32 %r300, %r222, %r23; 2026-02-21T09:10:48.0910910Z xor.b32 %r24, %r5, 80; 2026-02-21T09:10:48.0910980Z add.s32 %r301, %r222, %r24; 2026-02-21T09:10:48.0911040Z xor.b32 %r25, %r5, 64; 2026-02-21T09:10:48.0911100Z add.s32 %r302, %r222, %r25; 2026-02-21T09:10:48.0911166Z xor.b32 %r26, %r5, 48; 2026-02-21T09:10:48.0911226Z add.s32 %r303, %r222, %r26; 2026-02-21T09:10:48.0911285Z xor.b32 %r27, %r5, 32; 2026-02-21T09:10:48.0911347Z add.s32 %r304, %r222, %r27; 2026-02-21T09:10:48.0911413Z xor.b32 %r28, %r5, 16; 2026-02-21T09:10:48.0911473Z add.s32 %r305, %r222, %r28; 2026-02-21T09:10:48.0911625Z add.s32 %r306, %r222, %r5; 2026-02-21T09:10:48.0911706Z add.s32 %r226, %r279, 24576; 2026-02-21T09:10:48.0911769Z add.s32 %r232, %r279, 30720; 2026-02-21T09:10:48.0911831Z add.s32 %r230, %r279, 28672; 2026-02-21T09:10:48.0911893Z add.s32 %r228, %r279, 26624; 2026-02-21T09:10:48.0912121Z .loc 1 38 33 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:38:33 2026-02-21T09:10:48.0912237Z shr.u32 %r307, %r3, 5; 2026-02-21T09:10:48.0912300Z and.b32 %r308, %r307, 48; 2026-02-21T09:10:48.0912490Z .loc 1 40 64 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:40:64 2026-02-21T09:10:48.0912551Z and.b32 %r309, %r3, 15; 2026-02-21T09:10:48.0912728Z .loc 1 40 30 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:40:30 2026-02-21T09:10:48.0912836Z or.b32 %r310, %r308, %r309; 2026-02-21T09:10:48.0913008Z .loc 1 42 27 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:42:27 2026-02-21T09:10:48.0913070Z shl.b32 %r393, %r310, 7; 2026-02-21T09:10:48.0913237Z .loc 1 43 27 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:43:27 2026-02-21T09:10:48.0913306Z shl.b32 %r311, %r3, 3; 2026-02-21T09:10:48.0913366Z and.b32 %r685, %r311, 3968; 2026-02-21T09:10:48.0913534Z .loc 1 44 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:44:32 2026-02-21T09:10:48.0913633Z or.b32 %r312, %r685, %r4; 2026-02-21T09:10:48.0913806Z .loc 1 58 53 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:53 2026-02-21T09:10:48.0913868Z shl.b32 %r313, %r312, 10; 2026-02-21T09:10:48.0913936Z $L__tmp0: 2026-02-21T09:10:48.0914208Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0914294Z shfl.sync.idx.b32 %r31, %r277, 0, 31, -1; 2026-02-21T09:10:48.0914357Z shl.b32 %r314, %r31, 21; 2026-02-21T09:10:48.0914430Z and.b32 %r315, %r314, 6291456; 2026-02-21T09:10:48.0914500Z add.s32 %r683, %r315, %r884; 2026-02-21T09:10:48.0914567Z mov.pred %p64, -1; 2026-02-21T09:10:48.0914634Z // begin inline asm 2026-02-21T09:10:48.0914891Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 0], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0914951Z // end inline asm 2026-02-21T09:10:48.0915017Z // begin inline asm 2026-02-21T09:10:48.0915262Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 16], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0915316Z // end inline asm 2026-02-21T09:10:48.0915373Z // begin inline asm 2026-02-21T09:10:48.0915614Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 32], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0915671Z // end inline asm 2026-02-21T09:10:48.0915727Z // begin inline asm 2026-02-21T09:10:48.0915973Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 48], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0916027Z // end inline asm 2026-02-21T09:10:48.0916084Z // begin inline asm 2026-02-21T09:10:48.0916327Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 64], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0916384Z // end inline asm 2026-02-21T09:10:48.0916439Z // begin inline asm 2026-02-21T09:10:48.0916677Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 80], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0916733Z // end inline asm 2026-02-21T09:10:48.0916788Z // begin inline asm 2026-02-21T09:10:48.0917018Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 96], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0917078Z // end inline asm 2026-02-21T09:10:48.0917133Z // begin inline asm 2026-02-21T09:10:48.0917372Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r683 + 112], {%r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58, %r58}; 2026-02-21T09:10:48.0917454Z // end inline asm 2026-02-21T09:10:48.0917509Z // begin inline asm 2026-02-21T09:10:48.0917592Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:48.0917652Z // end inline asm 2026-02-21T09:10:48.0917709Z bar.sync 0; 2026-02-21T09:10:48.0917764Z $L__tmp1: 2026-02-21T09:10:48.0917942Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0918010Z add.s32 %r886, %r48, 36864; 2026-02-21T09:10:48.0918088Z // begin inline asm 2026-02-21T09:10:48.0918180Z @%p97 mbarrier.init.shared::cta.b64 [%r886], 1; 2026-02-21T09:10:48.0918242Z // end inline asm 2026-02-21T09:10:48.0918297Z bar.sync 0; 2026-02-21T09:10:48.0918357Z add.s32 %r210, %r48, 36872; 2026-02-21T09:10:48.0918419Z // begin inline asm 2026-02-21T09:10:48.0918502Z @%p97 mbarrier.init.shared::cta.b64 [%r210], 1; 2026-02-21T09:10:48.0918556Z // end inline asm 2026-02-21T09:10:48.0918614Z add.s32 %r391, %r48, 36880; 2026-02-21T09:10:48.0918680Z // begin inline asm 2026-02-21T09:10:48.0918761Z @%p97 mbarrier.init.shared::cta.b64 [%r391], 1; 2026-02-21T09:10:48.0918834Z // end inline asm 2026-02-21T09:10:48.0918898Z bar.sync 0; 2026-02-21T09:10:48.0918957Z add.s32 %r212, %r48, 36888; 2026-02-21T09:10:48.0919013Z // begin inline asm 2026-02-21T09:10:48.0919092Z @%p97 mbarrier.init.shared::cta.b64 [%r212], 1; 2026-02-21T09:10:48.0919156Z // end inline asm 2026-02-21T09:10:48.0919355Z .loc 1 58 60 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:60 2026-02-21T09:10:48.0919419Z or.b32 %r316, %r313, %r276; 2026-02-21T09:10:48.0919593Z .loc 1 58 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:32 2026-02-21T09:10:48.0919664Z mad.wide.u32 %rd53, %r316, 2, %rd14; 2026-02-21T09:10:48.0919725Z cvt.u64.u32 %rd7, %r313; 2026-02-21T09:10:48.0919793Z add.s64 %rd54, %rd53, 65536; 2026-02-21T09:10:48.0919855Z add.s64 %rd55, %rd53, 131072; 2026-02-21T09:10:48.0919916Z add.s64 %rd56, %rd53, 196608; 2026-02-21T09:10:48.0920083Z .loc 1 58 80 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:80 2026-02-21T09:10:48.0920149Z // begin inline asm 2026-02-21T09:10:48.0920273Z cp.async.cg.shared.global [ %r383 + 0 ], [ %rd53 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0920337Z // end inline asm 2026-02-21T09:10:48.0920399Z // begin inline asm 2026-02-21T09:10:48.0920514Z cp.async.cg.shared.global [ %r385 + 0 ], [ %rd54 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0920568Z // end inline asm 2026-02-21T09:10:48.0920630Z // begin inline asm 2026-02-21T09:10:48.0920738Z cp.async.cg.shared.global [ %r387 + 0 ], [ %rd55 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0920797Z // end inline asm 2026-02-21T09:10:48.0920855Z // begin inline asm 2026-02-21T09:10:48.0920977Z cp.async.cg.shared.global [ %r389 + 0 ], [ %rd56 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0921032Z // end inline asm 2026-02-21T09:10:48.0921098Z cp.async.commit_group; 2026-02-21T09:10:48.0921277Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0921332Z bar.sync 0; 2026-02-21T09:10:48.0921387Z // begin inline asm 2026-02-21T09:10:48.0921494Z @%p97 mbarrier.arrive.expect_tx.shared.b64 _, [%r391], 2048; 2026-02-21T09:10:48.0921601Z // end inline asm 2026-02-21T09:10:48.0921761Z .loc 1 64 33 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:64:33 2026-02-21T09:10:48.0921816Z bar.sync 0; 2026-02-21T09:10:48.0921889Z elect.sync %r317|%p59, -1; 2026-02-21T09:10:48.0921954Z and.pred %p53, %p1, %p59; 2026-02-21T09:10:48.0922008Z // begin inline asm 2026-02-21T09:10:48.0922255Z @%p53 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r222], [%rd77, {%r393, %r58}], [%r391]; 2026-02-21T09:10:48.0922310Z // end inline asm 2026-02-21T09:10:48.0922469Z .loc 1 58 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:32 2026-02-21T09:10:48.0922606Z add.s64 %rd58, %rd53, 64; 2026-02-21T09:10:48.0922664Z or.b32 %r318, %r316, 32; 2026-02-21T09:10:48.0922731Z mad.wide.u32 %rd67, %r318, 2, %rd14; 2026-02-21T09:10:48.0922790Z add.s64 %rd59, %rd67, 65536; 2026-02-21T09:10:48.0922857Z add.s64 %rd60, %rd67, 131072; 2026-02-21T09:10:48.0922915Z add.s64 %rd61, %rd67, 196608; 2026-02-21T09:10:48.0923080Z .loc 1 58 80 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:80 2026-02-21T09:10:48.0923173Z // begin inline asm 2026-02-21T09:10:48.0923285Z cp.async.cg.shared.global [ %r226 + 0 ], [ %rd58 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0923340Z // end inline asm 2026-02-21T09:10:48.0923396Z // begin inline asm 2026-02-21T09:10:48.0923509Z cp.async.cg.shared.global [ %r228 + 0 ], [ %rd59 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0923562Z // end inline asm 2026-02-21T09:10:48.0923618Z // begin inline asm 2026-02-21T09:10:48.0923732Z cp.async.cg.shared.global [ %r230 + 0 ], [ %rd60 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0923787Z // end inline asm 2026-02-21T09:10:48.0923843Z // begin inline asm 2026-02-21T09:10:48.0923983Z cp.async.cg.shared.global [ %r232 + 0 ], [ %rd61 + 0 ], 0x10, %r214; 2026-02-21T09:10:48.0924037Z // end inline asm 2026-02-21T09:10:48.0924100Z cp.async.commit_group; 2026-02-21T09:10:48.0924268Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0924351Z bar.sync 0; 2026-02-21T09:10:48.0924407Z // begin inline asm 2026-02-21T09:10:48.0924518Z @%p97 mbarrier.arrive.expect_tx.shared.b64 _, [%r212], 2048; 2026-02-21T09:10:48.0924578Z // end inline asm 2026-02-21T09:10:48.0924739Z .loc 1 64 33 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:64:33 2026-02-21T09:10:48.0924792Z bar.sync 0; 2026-02-21T09:10:48.0924855Z elect.sync %r319|%p60, -1; 2026-02-21T09:10:48.0924925Z and.pred %p55, %p1, %p60; 2026-02-21T09:10:48.0924983Z add.s32 %r235, %r48, 34816; 2026-02-21T09:10:48.0925040Z // begin inline asm 2026-02-21T09:10:48.0925290Z @%p55 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r235], [%rd77, {%r393, %r214}], [%r212]; 2026-02-21T09:10:48.0925344Z // end inline asm 2026-02-21T09:10:48.0925504Z .loc 1 58 80 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:80 2026-02-21T09:10:48.0925573Z cp.async.wait_group 1; 2026-02-21T09:10:48.0925625Z bar.sync 0; 2026-02-21T09:10:48.0925720Z ld.shared.v4.b32 {%r320, %r321, %r322, %r323}, [%r298]; 2026-02-21T09:10:48.0925781Z mov.b32 {%rs1, %rs2}, %r323; 2026-02-21T09:10:48.0925847Z mov.b32 {%rs3, %rs4}, %r322; 2026-02-21T09:10:48.0925905Z mov.b32 {%rs5, %rs6}, %r321; 2026-02-21T09:10:48.0925961Z mov.b32 {%rs7, %rs8}, %r320; 2026-02-21T09:10:48.0926063Z ld.shared.v4.b32 {%r324, %r325, %r326, %r327}, [%r298+16]; 2026-02-21T09:10:48.0926123Z mov.b32 {%rs9, %rs10}, %r327; 2026-02-21T09:10:48.0926184Z mov.b32 {%rs11, %rs12}, %r326; 2026-02-21T09:10:48.0926253Z mov.b32 {%rs13, %rs14}, %r325; 2026-02-21T09:10:48.0926311Z mov.b32 {%rs15, %rs16}, %r324; 2026-02-21T09:10:48.0926407Z ld.shared.v4.b32 {%r328, %r329, %r330, %r331}, [%r298+32]; 2026-02-21T09:10:48.0926461Z mov.b32 {%rs17, %rs18}, %r331; 2026-02-21T09:10:48.0926523Z mov.b32 {%rs19, %rs20}, %r330; 2026-02-21T09:10:48.0926580Z mov.b32 {%rs21, %rs22}, %r329; 2026-02-21T09:10:48.0926637Z mov.b32 {%rs23, %rs24}, %r328; 2026-02-21T09:10:48.0926735Z ld.shared.v4.b32 {%r332, %r333, %r334, %r335}, [%r298+48]; 2026-02-21T09:10:48.0926791Z mov.b32 {%rs25, %rs26}, %r335; 2026-02-21T09:10:48.0926848Z mov.b32 {%rs27, %rs28}, %r334; 2026-02-21T09:10:48.0926904Z mov.b32 {%rs29, %rs30}, %r333; 2026-02-21T09:10:48.0926965Z mov.b32 {%rs31, %rs32}, %r332; 2026-02-21T09:10:48.0927126Z .loc 1 62 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:62:32 2026-02-21T09:10:48.0927186Z cvt.f32.bf16 %r240, %rs7; 2026-02-21T09:10:48.0927276Z cvt.f32.bf16 %r241, %rs8; 2026-02-21T09:10:48.0927334Z cvt.f32.bf16 %r242, %rs5; 2026-02-21T09:10:48.0927392Z cvt.f32.bf16 %r243, %rs6; 2026-02-21T09:10:48.0927456Z cvt.f32.bf16 %r244, %rs3; 2026-02-21T09:10:48.0927510Z cvt.f32.bf16 %r245, %rs4; 2026-02-21T09:10:48.0927567Z cvt.f32.bf16 %r246, %rs1; 2026-02-21T09:10:48.0927620Z cvt.f32.bf16 %r247, %rs2; 2026-02-21T09:10:48.0927686Z cvt.f32.bf16 %r248, %rs15; 2026-02-21T09:10:48.0927748Z cvt.f32.bf16 %r249, %rs16; 2026-02-21T09:10:48.0927828Z cvt.f32.bf16 %r250, %rs13; 2026-02-21T09:10:48.0927895Z cvt.f32.bf16 %r251, %rs14; 2026-02-21T09:10:48.0927952Z cvt.f32.bf16 %r252, %rs11; 2026-02-21T09:10:48.0928011Z cvt.f32.bf16 %r253, %rs12; 2026-02-21T09:10:48.0928068Z cvt.f32.bf16 %r254, %rs9; 2026-02-21T09:10:48.0928137Z cvt.f32.bf16 %r255, %rs10; 2026-02-21T09:10:48.0928194Z cvt.f32.bf16 %r257, %rs23; 2026-02-21T09:10:48.0928252Z cvt.f32.bf16 %r258, %rs24; 2026-02-21T09:10:48.0928312Z cvt.f32.bf16 %r259, %rs21; 2026-02-21T09:10:48.0928370Z cvt.f32.bf16 %r260, %rs22; 2026-02-21T09:10:48.0928427Z cvt.f32.bf16 %r261, %rs19; 2026-02-21T09:10:48.0928504Z cvt.f32.bf16 %r262, %rs20; 2026-02-21T09:10:48.0928566Z cvt.f32.bf16 %r263, %rs17; 2026-02-21T09:10:48.0928621Z cvt.f32.bf16 %r264, %rs18; 2026-02-21T09:10:48.0928676Z cvt.f32.bf16 %r265, %rs31; 2026-02-21T09:10:48.0928739Z cvt.f32.bf16 %r266, %rs32; 2026-02-21T09:10:48.0928794Z cvt.f32.bf16 %r267, %rs29; 2026-02-21T09:10:48.0928868Z cvt.f32.bf16 %r268, %rs30; 2026-02-21T09:10:48.0928928Z cvt.f32.bf16 %r269, %rs27; 2026-02-21T09:10:48.0928992Z cvt.f32.bf16 %r270, %rs28; 2026-02-21T09:10:48.0929051Z cvt.f32.bf16 %r271, %rs25; 2026-02-21T09:10:48.0929107Z cvt.f32.bf16 %r272, %rs26; 2026-02-21T09:10:48.0929166Z $L__tmp2: 2026-02-21T09:10:48.0929385Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0929444Z add.s32 %r239, %r315, %r369; 2026-02-21T09:10:48.0929510Z // begin inline asm 2026-02-21T09:10:48.0929792Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r239 + 0], {%r240, %r241, %r242, %r243, %r244, %r245, %r246, %r247, %r248, %r249, %r250, %r251, %r252, %r253, %r254, %r255}; 2026-02-21T09:10:48.0929847Z // end inline asm 2026-02-21T09:10:48.0929903Z // begin inline asm 2026-02-21T09:10:48.0930187Z @%p64 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r239 + 16], {%r257, %r258, %r259, %r260, %r261, %r262, %r263, %r264, %r265, %r266, %r267, %r268, %r269, %r270, %r271, %r272}; 2026-02-21T09:10:48.0930243Z // end inline asm 2026-02-21T09:10:48.0930297Z // begin inline asm 2026-02-21T09:10:48.0930370Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:48.0930426Z // end inline asm 2026-02-21T09:10:48.0930479Z bar.sync 0; 2026-02-21T09:10:48.0930538Z $L__tmp3: 2026-02-21T09:10:48.0930710Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0930764Z // begin inline asm 2026-02-21T09:10:48.0930819Z 2026-02-21T09:10:48.0930876Z { 2026-02-21T09:10:48.0930938Z .reg .pred complete; 2026-02-21T09:10:48.0930993Z waitLoop: 2026-02-21T09:10:48.0931119Z mbarrier.try_wait.parity.shared.b64 complete, [%r391], %r58; 2026-02-21T09:10:48.0931183Z @!complete bra.uni waitLoop; 2026-02-21T09:10:48.0931232Z } 2026-02-21T09:10:48.0931238Z 2026-02-21T09:10:48.0931291Z // end inline asm 2026-02-21T09:10:48.0931458Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0931522Z ld.shared.s8 %rs33, [%r306]; 2026-02-21T09:10:48.0931730Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0931798Z shl.b16 %rs34, %rs33, 4; 2026-02-21T09:10:48.0931954Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0932019Z ld.shared.s8 %rs35, [%r305+128]; 2026-02-21T09:10:48.0932182Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0932279Z shl.b16 %rs36, %rs35, 4; 2026-02-21T09:10:48.0932438Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0932508Z ld.shared.s8 %rs37, [%r304+256]; 2026-02-21T09:10:48.0932665Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0932724Z shl.b16 %rs38, %rs37, 4; 2026-02-21T09:10:48.0932921Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0932988Z ld.shared.s8 %rs39, [%r303+384]; 2026-02-21T09:10:48.0933147Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0933206Z shl.b16 %rs40, %rs39, 4; 2026-02-21T09:10:48.0933370Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0933432Z ld.shared.s8 %rs41, [%r302+512]; 2026-02-21T09:10:48.0933617Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0933683Z shl.b16 %rs42, %rs41, 4; 2026-02-21T09:10:48.0933841Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0933902Z ld.shared.s8 %rs43, [%r301+640]; 2026-02-21T09:10:48.0934093Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0934154Z shl.b16 %rs44, %rs43, 4; 2026-02-21T09:10:48.0934313Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0934373Z ld.shared.s8 %rs45, [%r300+768]; 2026-02-21T09:10:48.0934541Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0934599Z shl.b16 %rs46, %rs45, 4; 2026-02-21T09:10:48.0934761Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0934830Z ld.shared.s8 %rs47, [%r299+896]; 2026-02-21T09:10:48.0934989Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0935046Z shl.b16 %rs48, %rs47, 4; 2026-02-21T09:10:48.0935209Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0935275Z ld.shared.s8 %rs49, [%r306+1024]; 2026-02-21T09:10:48.0935434Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0935495Z shl.b16 %rs50, %rs49, 4; 2026-02-21T09:10:48.0935653Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0935716Z ld.shared.s8 %rs51, [%r305+1152]; 2026-02-21T09:10:48.0935886Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0935944Z shl.b16 %rs52, %rs51, 4; 2026-02-21T09:10:48.0936103Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0936165Z ld.shared.s8 %rs53, [%r304+1280]; 2026-02-21T09:10:48.0936332Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0936388Z shl.b16 %rs54, %rs53, 4; 2026-02-21T09:10:48.0936548Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0936615Z ld.shared.s8 %rs55, [%r303+1408]; 2026-02-21T09:10:48.0936776Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0936833Z shl.b16 %rs56, %rs55, 4; 2026-02-21T09:10:48.0937001Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0937084Z ld.shared.s8 %rs57, [%r302+1536]; 2026-02-21T09:10:48.0937244Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0937310Z shl.b16 %rs58, %rs57, 4; 2026-02-21T09:10:48.0937469Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0937532Z ld.shared.s8 %rs59, [%r301+1664]; 2026-02-21T09:10:48.0937694Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0937798Z shl.b16 %rs60, %rs59, 4; 2026-02-21T09:10:48.0937961Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0938023Z ld.shared.s8 %rs61, [%r300+1792]; 2026-02-21T09:10:48.0938191Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0938248Z shl.b16 %rs62, %rs61, 4; 2026-02-21T09:10:48.0938408Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0938495Z ld.shared.s8 %rs63, [%r299+1920]; 2026-02-21T09:10:48.0938652Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0938710Z shl.b16 %rs64, %rs63, 4; 2026-02-21T09:10:48.0938915Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0938977Z cvt.s16.s8 %rs65, %rs34; 2026-02-21T09:10:48.0939032Z shr.s16 %rs66, %rs65, 4; 2026-02-21T09:10:48.0939091Z cvt.s16.s8 %rs67, %rs36; 2026-02-21T09:10:48.0939153Z shr.s16 %rs68, %rs67, 4; 2026-02-21T09:10:48.0939210Z shr.s16 %rs69, %rs33, 4; 2026-02-21T09:10:48.0939265Z shr.s16 %rs70, %rs35, 4; 2026-02-21T09:10:48.0939435Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0939497Z cvt.rn.f32.s16 %r336, %rs70; 2026-02-21T09:10:48.0939561Z cvt.rn.f32.s16 %r337, %rs69; 2026-02-21T09:10:48.0939623Z cvt.rn.f32.s16 %r338, %rs68; 2026-02-21T09:10:48.0939689Z cvt.rn.f32.s16 %r339, %rs66; 2026-02-21T09:10:48.0939848Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0939905Z cvt.s16.s8 %rs71, %rs38; 2026-02-21T09:10:48.0939968Z shr.s16 %rs72, %rs71, 4; 2026-02-21T09:10:48.0940026Z cvt.s16.s8 %rs73, %rs40; 2026-02-21T09:10:48.0940084Z shr.s16 %rs74, %rs73, 4; 2026-02-21T09:10:48.0940146Z shr.s16 %rs75, %rs37, 4; 2026-02-21T09:10:48.0940201Z shr.s16 %rs76, %rs39, 4; 2026-02-21T09:10:48.0940358Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0940417Z cvt.rn.f32.s16 %r340, %rs76; 2026-02-21T09:10:48.0940481Z cvt.rn.f32.s16 %r341, %rs75; 2026-02-21T09:10:48.0940536Z cvt.rn.f32.s16 %r342, %rs74; 2026-02-21T09:10:48.0940593Z cvt.rn.f32.s16 %r343, %rs72; 2026-02-21T09:10:48.0940761Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0940818Z cvt.s16.s8 %rs77, %rs42; 2026-02-21T09:10:48.0940872Z shr.s16 %rs78, %rs77, 4; 2026-02-21T09:10:48.0940928Z cvt.s16.s8 %rs79, %rs44; 2026-02-21T09:10:48.0940990Z shr.s16 %rs80, %rs79, 4; 2026-02-21T09:10:48.0941044Z shr.s16 %rs81, %rs41, 4; 2026-02-21T09:10:48.0941101Z shr.s16 %rs82, %rs43, 4; 2026-02-21T09:10:48.0941272Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0941328Z cvt.rn.f32.s16 %r344, %rs82; 2026-02-21T09:10:48.0941388Z cvt.rn.f32.s16 %r345, %rs81; 2026-02-21T09:10:48.0941452Z cvt.rn.f32.s16 %r346, %rs80; 2026-02-21T09:10:48.0941508Z cvt.rn.f32.s16 %r347, %rs78; 2026-02-21T09:10:48.0941723Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0941805Z cvt.s16.s8 %rs83, %rs46; 2026-02-21T09:10:48.0941868Z shr.s16 %rs84, %rs83, 4; 2026-02-21T09:10:48.0941925Z cvt.s16.s8 %rs85, %rs48; 2026-02-21T09:10:48.0941981Z shr.s16 %rs86, %rs85, 4; 2026-02-21T09:10:48.0942045Z shr.s16 %rs87, %rs45, 4; 2026-02-21T09:10:48.0942100Z shr.s16 %rs88, %rs47, 4; 2026-02-21T09:10:48.0942259Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0942324Z cvt.rn.f32.s16 %r348, %rs88; 2026-02-21T09:10:48.0942411Z cvt.rn.f32.s16 %r349, %rs87; 2026-02-21T09:10:48.0942469Z cvt.rn.f32.s16 %r350, %rs86; 2026-02-21T09:10:48.0942525Z cvt.rn.f32.s16 %r351, %rs84; 2026-02-21T09:10:48.0942689Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0942746Z cvt.s16.s8 %rs89, %rs50; 2026-02-21T09:10:48.0942803Z shr.s16 %rs90, %rs89, 4; 2026-02-21T09:10:48.0942865Z cvt.s16.s8 %rs91, %rs52; 2026-02-21T09:10:48.0942923Z shr.s16 %rs92, %rs91, 4; 2026-02-21T09:10:48.0942982Z shr.s16 %rs93, %rs49, 4; 2026-02-21T09:10:48.0943038Z shr.s16 %rs94, %rs51, 4; 2026-02-21T09:10:48.0943244Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0943303Z cvt.rn.f32.s16 %r352, %rs94; 2026-02-21T09:10:48.0943362Z cvt.rn.f32.s16 %r353, %rs93; 2026-02-21T09:10:48.0943424Z cvt.rn.f32.s16 %r354, %rs92; 2026-02-21T09:10:48.0943507Z cvt.rn.f32.s16 %r355, %rs90; 2026-02-21T09:10:48.0943672Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0943735Z cvt.s16.s8 %rs95, %rs54; 2026-02-21T09:10:48.0943791Z shr.s16 %rs96, %rs95, 4; 2026-02-21T09:10:48.0943846Z cvt.s16.s8 %rs97, %rs56; 2026-02-21T09:10:48.0943903Z shr.s16 %rs98, %rs97, 4; 2026-02-21T09:10:48.0943964Z shr.s16 %rs99, %rs53, 4; 2026-02-21T09:10:48.0944023Z shr.s16 %rs100, %rs55, 4; 2026-02-21T09:10:48.0944185Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0944254Z cvt.rn.f32.s16 %r356, %rs100; 2026-02-21T09:10:48.0944312Z cvt.rn.f32.s16 %r357, %rs99; 2026-02-21T09:10:48.0944371Z cvt.rn.f32.s16 %r358, %rs98; 2026-02-21T09:10:48.0944427Z cvt.rn.f32.s16 %r359, %rs96; 2026-02-21T09:10:48.0944596Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0944656Z cvt.s16.s8 %rs101, %rs58; 2026-02-21T09:10:48.0944717Z shr.s16 %rs102, %rs101, 4; 2026-02-21T09:10:48.0944781Z cvt.s16.s8 %rs103, %rs60; 2026-02-21T09:10:48.0944839Z shr.s16 %rs104, %rs103, 4; 2026-02-21T09:10:48.0944897Z shr.s16 %rs105, %rs57, 4; 2026-02-21T09:10:48.0944955Z shr.s16 %rs106, %rs59, 4; 2026-02-21T09:10:48.0945117Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0945177Z cvt.rn.f32.s16 %r360, %rs106; 2026-02-21T09:10:48.0945234Z cvt.rn.f32.s16 %r361, %rs105; 2026-02-21T09:10:48.0945299Z cvt.rn.f32.s16 %r362, %rs104; 2026-02-21T09:10:48.0945359Z cvt.rn.f32.s16 %r363, %rs102; 2026-02-21T09:10:48.0945520Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0945584Z cvt.s16.s8 %rs107, %rs62; 2026-02-21T09:10:48.0945643Z shr.s16 %rs108, %rs107, 4; 2026-02-21T09:10:48.0945701Z cvt.s16.s8 %rs109, %rs64; 2026-02-21T09:10:48.0945767Z shr.s16 %rs110, %rs109, 4; 2026-02-21T09:10:48.0945827Z shr.s16 %rs111, %rs61, 4; 2026-02-21T09:10:48.0945889Z shr.s16 %rs112, %rs63, 4; 2026-02-21T09:10:48.0946054Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0946122Z cvt.rn.f32.s16 %r364, %rs112; 2026-02-21T09:10:48.0946182Z cvt.rn.f32.s16 %r365, %rs111; 2026-02-21T09:10:48.0946240Z cvt.rn.f32.s16 %r366, %rs110; 2026-02-21T09:10:48.0946305Z cvt.rn.f32.s16 %r367, %rs108; 2026-02-21T09:10:48.0946401Z st.shared.v4.b32 [%r13], {%r339, %r337, %r338, %r336}; 2026-02-21T09:10:48.0946514Z st.shared.v4.b32 [%r14], {%r343, %r341, %r342, %r340}; 2026-02-21T09:10:48.0946601Z st.shared.v4.b32 [%r15], {%r347, %r345, %r346, %r344}; 2026-02-21T09:10:48.0946693Z st.shared.v4.b32 [%r16], {%r351, %r349, %r350, %r348}; 2026-02-21T09:10:48.0946777Z st.shared.v4.b32 [%r17], {%r355, %r353, %r354, %r352}; 2026-02-21T09:10:48.0946861Z st.shared.v4.b32 [%r18], {%r359, %r357, %r358, %r356}; 2026-02-21T09:10:48.0946973Z st.shared.v4.b32 [%r19], {%r363, %r361, %r362, %r360}; 2026-02-21T09:10:48.0947052Z st.shared.v4.b32 [%r20], {%r367, %r365, %r366, %r364}; 2026-02-21T09:10:48.0947104Z $L__tmp4: 2026-02-21T09:10:48.0947325Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0947381Z // begin inline asm 2026-02-21T09:10:48.0947452Z fence.proxy.async.shared::cta; 2026-02-21T09:10:48.0947505Z // end inline asm 2026-02-21T09:10:48.0947566Z bar.sync 0; 2026-02-21T09:10:48.0947629Z setp.ne.b32 %p61, %r31, 0; 2026-02-21T09:10:48.0947710Z @%p61 bra $L__BB0_3; 2026-02-21T09:10:48.0947770Z // %bb.2: 2026-02-21T09:10:48.0947834Z elect.sync %r380|%p63, -1; 2026-02-21T09:10:48.0947892Z mov.b32 %r370, 136317200; 2026-02-21T09:10:48.0947955Z mov.pred %p62, 0; 2026-02-21T09:10:48.0948011Z // begin inline asm 2026-02-21T09:10:48.0948206Z @%p63 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 0 ], %rd68, %r370, %p62; 2026-02-21T09:10:48.0948265Z // end inline asm 2026-02-21T09:10:48.0948329Z // begin inline asm 2026-02-21T09:10:48.0948485Z @%p63 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 8 ], %rd69, %r370, %p64; 2026-02-21T09:10:48.0948541Z // end inline asm 2026-02-21T09:10:48.0948605Z // begin inline asm 2026-02-21T09:10:48.0948760Z @%p63 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 16 ], %rd70, %r370, %p64; 2026-02-21T09:10:48.0948817Z // end inline asm 2026-02-21T09:10:48.0948883Z // begin inline asm 2026-02-21T09:10:48.0949034Z @%p63 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 24 ], %rd71, %r370, %p64; 2026-02-21T09:10:48.0949090Z // end inline asm 2026-02-21T09:10:48.0949151Z add.s32 %r382, %r48, 36864; 2026-02-21T09:10:48.0949218Z cvt.u64.u32 %rd72, %r382; 2026-02-21T09:10:48.0949276Z // begin inline asm 2026-02-21T09:10:48.0949409Z @%p63 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd72]; 2026-02-21T09:10:48.0949472Z // end inline asm 2026-02-21T09:10:48.0949528Z $L__tmp5: 2026-02-21T09:10:48.0949583Z $L__BB0_3: 2026-02-21T09:10:48.0949682Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:48.0949743Z shl.b32 %r12, %r5, 6; 2026-02-21T09:10:48.0949915Z .loc 1 58 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:32 2026-02-21T09:10:48.0949976Z add.s64 %rd73, %rd53, 128; 2026-02-21T09:10:48.0950045Z cvt.u64.u32 %rd79, %r21; 2026-02-21T09:10:48.0950110Z add.s64 %rd80, %rd7, %rd79; 2026-02-21T09:10:48.0950170Z shl.b64 %rd81, %rd80, 1; 2026-02-21T09:10:48.0950236Z add.s64 %rd82, %rd14, %rd81; 2026-02-21T09:10:48.0950298Z add.s64 %rd74, %rd82, 65536; 2026-02-21T09:10:48.0950360Z add.s64 %rd75, %rd82, 131072; 2026-02-21T09:10:48.0950420Z add.s64 %rd76, %rd82, 196608; 2026-02-21T09:10:48.0950482Z mov.b32 %r384, 16; 2026-02-21T09:10:48.0950652Z .loc 1 58 80 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:80 2026-02-21T09:10:48.0950712Z // begin inline asm 2026-02-21T09:10:48.0950840Z cp.async.cg.shared.global [ %r383 + 0 ], [ %rd73 + 0 ], 0x10, %r384; 2026-02-21T09:10:48.0950896Z // end inline asm 2026-02-21T09:10:48.0950953Z // begin inline asm 2026-02-21T09:10:48.0951077Z cp.async.cg.shared.global [ %r385 + 0 ], [ %rd74 + 0 ], 0x10, %r384; 2026-02-21T09:10:48.0951134Z // end inline asm 2026-02-21T09:10:48.0951192Z // begin inline asm 2026-02-21T09:10:48.0951305Z cp.async.cg.shared.global [ %r387 + 0 ], [ %rd75 + 0 ], 0x10, %r384; 2026-02-21T09:10:48.0951395Z // end inline asm 2026-02-21T09:10:48.0951452Z // begin inline asm 2026-02-21T09:10:48.0951619Z cp.async.cg.shared.global [ %r389 + 0 ], [ %rd76 + 0 ], 0x10, %r384; 2026-02-21T09:10:48.0951683Z // end inline asm 2026-02-21T09:10:48.0951748Z cp.async.commit_group; 2026-02-21T09:10:48.0951922Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0951980Z // begin inline asm 2026-02-21T09:10:48.0952136Z @%p97 mbarrier.arrive.expect_tx.shared.b64 _, [%r391], 2048; 2026-02-21T09:10:48.0952192Z // end inline asm 2026-02-21T09:10:48.0952360Z .loc 1 64 33 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:64:33 2026-02-21T09:10:48.0952423Z bar.sync 0; 2026-02-21T09:10:48.0952490Z elect.sync %r400|%p74, -1; 2026-02-21T09:10:48.0952559Z and.pred %p72, %p1, %p74; 2026-02-21T09:10:48.0952622Z mov.b32 %r394, 32; 2026-02-21T09:10:48.0952682Z // begin inline asm 2026-02-21T09:10:48.0952972Z @%p72 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r222], [%rd77, {%r393, %r394}], [%r391]; 2026-02-21T09:10:48.0953030Z // end inline asm 2026-02-21T09:10:48.0953211Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0953272Z and.b32 %r401, %r1, 3; 2026-02-21T09:10:48.0953336Z mul.wide.u32 %rd83, %r401, 16; 2026-02-21T09:10:48.0953435Z shl.b32 %r402, %r3, 13; 2026-02-21T09:10:48.0953503Z and.b32 %r403, %r402, 4063232; 2026-02-21T09:10:48.0953562Z shl.b32 %r404, %r4, 10; 2026-02-21T09:10:48.0953630Z or.b32 %r405, %r403, %r404; 2026-02-21T09:10:48.0953694Z mul.wide.u32 %rd84, %r405, 2; 2026-02-21T09:10:48.0953756Z or.b64 %rd85, %rd83, %rd84; 2026-02-21T09:10:48.0953818Z add.s64 %rd86, %rd85, %rd14; 2026-02-21T09:10:48.0953887Z add.s64 %rd354, %rd86, 196800; 2026-02-21T09:10:48.0953942Z mov.b32 %r889, 1; 2026-02-21T09:10:48.0954000Z mov.b32 %r885, 0; 2026-02-21T09:10:48.0954064Z mov.b64 %rd355, 0; 2026-02-21T09:10:48.0954125Z mov.b32 %r887, %r885; 2026-02-21T09:10:48.0954185Z mov.b32 %r888, %r885; 2026-02-21T09:10:48.0954244Z mov.b32 %r890, %r885; 2026-02-21T09:10:48.0954314Z bra.uni $L__BB0_4; 2026-02-21T09:10:48.0954421Z $L__BB0_6: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:48.0954604Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0954677Z setp.lt.u64 %p90, %rd355, 464; 2026-02-21T09:10:48.0954732Z $L__tmp6: 2026-02-21T09:10:48.0954957Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0955026Z add.s32 %r538, %r889, 1; 2026-02-21T09:10:48.0955090Z setp.gt.s32 %p93, %r538, 1; 2026-02-21T09:10:48.0955157Z selp.b32 %r889, 0, %r538, %p93; 2026-02-21T09:10:48.0955219Z selp.b32 %r539, 1, 0, %p93; 2026-02-21T09:10:48.0955285Z xor.b32 %r47, %r890, %r539; 2026-02-21T09:10:48.0955341Z $L__tmp7: 2026-02-21T09:10:48.0955511Z .loc 1 58 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:32 2026-02-21T09:10:48.0955585Z add.s64 %rd92, %rd354, -196608; 2026-02-21T09:10:48.0955649Z add.s64 %rd93, %rd354, -131072; 2026-02-21T09:10:48.0955711Z add.s64 %rd94, %rd354, -65536; 2026-02-21T09:10:48.0955878Z .loc 1 58 80 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:80 2026-02-21T09:10:48.0955946Z add.s32 %r525, %r41, %r6; 2026-02-21T09:10:48.0956010Z selp.b32 %r526, 16, 0, %p90; 2026-02-21T09:10:48.0956069Z // begin inline asm 2026-02-21T09:10:48.0956199Z cp.async.cg.shared.global [ %r525 + 0 ], [ %rd92 + 0 ], 0x10, %r526; 2026-02-21T09:10:48.0956255Z // end inline asm 2026-02-21T09:10:48.0956311Z add.s32 %r527, %r525, 2048; 2026-02-21T09:10:48.0956372Z // begin inline asm 2026-02-21T09:10:48.0956481Z cp.async.cg.shared.global [ %r527 + 0 ], [ %rd93 + 0 ], 0x10, %r526; 2026-02-21T09:10:48.0956564Z // end inline asm 2026-02-21T09:10:48.0956622Z add.s32 %r529, %r525, 4096; 2026-02-21T09:10:48.0956686Z // begin inline asm 2026-02-21T09:10:48.0956796Z cp.async.cg.shared.global [ %r529 + 0 ], [ %rd94 + 0 ], 0x10, %r526; 2026-02-21T09:10:48.0956850Z // end inline asm 2026-02-21T09:10:48.0956911Z add.s32 %r531, %r525, 6144; 2026-02-21T09:10:48.0956966Z // begin inline asm 2026-02-21T09:10:48.0957081Z cp.async.cg.shared.global [ %r531 + 0 ], [ %rd354 + 0 ], 0x10, %r526; 2026-02-21T09:10:48.0957156Z // end inline asm 2026-02-21T09:10:48.0957225Z cp.async.commit_group; 2026-02-21T09:10:48.0957392Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0957456Z and.pred %p88, %p97, %p90; 2026-02-21T09:10:48.0957518Z // begin inline asm 2026-02-21T09:10:48.0957624Z @%p88 mbarrier.arrive.expect_tx.shared.b64 _, [%r533], 2048; 2026-02-21T09:10:48.0957676Z // end inline asm 2026-02-21T09:10:48.0957841Z .loc 1 64 33 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:64:33 2026-02-21T09:10:48.0957925Z bar.sync 0; 2026-02-21T09:10:48.0957991Z elect.sync %r540|%p94, -1; 2026-02-21T09:10:48.0958053Z and.pred %p95, %p90, %p94; 2026-02-21T09:10:48.0958124Z and.pred %p89, %p1, %p95; 2026-02-21T09:10:48.0958184Z cvt.u32.u64 %r541, %rd355; 2026-02-21T09:10:48.0958249Z add.s32 %r536, %r541, 48; 2026-02-21T09:10:48.0958340Z // begin inline asm 2026-02-21T09:10:48.0958579Z @%p89 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r534], [%rd77, {%r393, %r536}], [%r533]; 2026-02-21T09:10:48.0958634Z // end inline asm 2026-02-21T09:10:48.0958809Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0958866Z add.s64 %rd354, %rd354, 64; 2026-02-21T09:10:48.0958927Z setp.lt.u64 %p96, %rd355, 480; 2026-02-21T09:10:48.0958984Z add.s64 %rd355, %rd355, 16; 2026-02-21T09:10:48.0959047Z mov.b32 %r885, %r890; 2026-02-21T09:10:48.0959104Z mov.b32 %r890, %r47; 2026-02-21T09:10:48.0959163Z @%p96 bra $L__BB0_4; 2026-02-21T09:10:48.0959223Z bra.uni $L__BB0_7; 2026-02-21T09:10:48.0959328Z $L__BB0_4: // =>This Inner Loop Header: Depth=1 2026-02-21T09:10:48.0959387Z add.s32 %r444, %r888, 1; 2026-02-21T09:10:48.0959449Z setp.gt.s32 %p78, %r444, 1; 2026-02-21T09:10:48.0959519Z selp.b32 %r888, 0, %r444, %p78; 2026-02-21T09:10:48.0959684Z .loc 1 58 80 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:58:80 2026-02-21T09:10:48.0959747Z cp.async.wait_group 1; 2026-02-21T09:10:48.0959806Z bar.sync 0; 2026-02-21T09:10:48.0959863Z shl.b32 %r445, %r888, 13; 2026-02-21T09:10:48.0959920Z add.s32 %r447, %r48, %r445; 2026-02-21T09:10:48.0959981Z add.s32 %r41, %r447, 16384; 2026-02-21T09:10:48.0960038Z add.s32 %r448, %r41, %r12; 2026-02-21T09:10:48.0960133Z ld.shared.v4.b32 {%r449, %r450, %r451, %r452}, [%r448]; 2026-02-21T09:10:48.0960198Z mov.b32 {%rs113, %rs114}, %r452; 2026-02-21T09:10:48.0960268Z mov.b32 {%rs115, %rs116}, %r451; 2026-02-21T09:10:48.0960328Z mov.b32 {%rs117, %rs118}, %r450; 2026-02-21T09:10:48.0960386Z mov.b32 {%rs119, %rs120}, %r449; 2026-02-21T09:10:48.0960489Z ld.shared.v4.b32 {%r453, %r454, %r455, %r456}, [%r448+16]; 2026-02-21T09:10:48.0960546Z mov.b32 {%rs121, %rs122}, %r456; 2026-02-21T09:10:48.0960605Z mov.b32 {%rs123, %rs124}, %r455; 2026-02-21T09:10:48.0960664Z mov.b32 {%rs125, %rs126}, %r454; 2026-02-21T09:10:48.0960730Z mov.b32 {%rs127, %rs128}, %r453; 2026-02-21T09:10:48.0960821Z ld.shared.v4.b32 {%r457, %r458, %r459, %r460}, [%r448+32]; 2026-02-21T09:10:48.0960878Z mov.b32 {%rs129, %rs130}, %r460; 2026-02-21T09:10:48.0960943Z mov.b32 {%rs131, %rs132}, %r459; 2026-02-21T09:10:48.0960999Z mov.b32 {%rs133, %rs134}, %r458; 2026-02-21T09:10:48.0961057Z mov.b32 {%rs135, %rs136}, %r457; 2026-02-21T09:10:48.0961155Z ld.shared.v4.b32 {%r461, %r462, %r463, %r464}, [%r448+48]; 2026-02-21T09:10:48.0961237Z mov.b32 {%rs137, %rs138}, %r464; 2026-02-21T09:10:48.0961296Z mov.b32 {%rs139, %rs140}, %r463; 2026-02-21T09:10:48.0961354Z mov.b32 {%rs141, %rs142}, %r462; 2026-02-21T09:10:48.0961418Z mov.b32 {%rs143, %rs144}, %r461; 2026-02-21T09:10:48.0961622Z .loc 1 62 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:62:32 2026-02-21T09:10:48.0961687Z cvt.f32.bf16 %r409, %rs119; 2026-02-21T09:10:48.0961754Z cvt.f32.bf16 %r410, %rs120; 2026-02-21T09:10:48.0966358Z cvt.f32.bf16 %r411, %rs117; 2026-02-21T09:10:48.0966429Z cvt.f32.bf16 %r412, %rs118; 2026-02-21T09:10:48.0966495Z cvt.f32.bf16 %r413, %rs115; 2026-02-21T09:10:48.0966557Z cvt.f32.bf16 %r414, %rs116; 2026-02-21T09:10:48.0966616Z cvt.f32.bf16 %r415, %rs113; 2026-02-21T09:10:48.0966674Z cvt.f32.bf16 %r416, %rs114; 2026-02-21T09:10:48.0966741Z cvt.f32.bf16 %r417, %rs127; 2026-02-21T09:10:48.0966799Z cvt.f32.bf16 %r418, %rs128; 2026-02-21T09:10:48.0966870Z cvt.f32.bf16 %r419, %rs125; 2026-02-21T09:10:48.0966932Z cvt.f32.bf16 %r420, %rs126; 2026-02-21T09:10:48.0967086Z cvt.f32.bf16 %r421, %rs123; 2026-02-21T09:10:48.0967151Z cvt.f32.bf16 %r422, %rs124; 2026-02-21T09:10:48.0967211Z cvt.f32.bf16 %r423, %rs121; 2026-02-21T09:10:48.0967283Z cvt.f32.bf16 %r424, %rs122; 2026-02-21T09:10:48.0967342Z cvt.f32.bf16 %r426, %rs135; 2026-02-21T09:10:48.0967401Z cvt.f32.bf16 %r427, %rs136; 2026-02-21T09:10:48.0967512Z cvt.f32.bf16 %r428, %rs133; 2026-02-21T09:10:48.0967574Z cvt.f32.bf16 %r429, %rs134; 2026-02-21T09:10:48.0967631Z cvt.f32.bf16 %r430, %rs131; 2026-02-21T09:10:48.0967688Z cvt.f32.bf16 %r431, %rs132; 2026-02-21T09:10:48.0967751Z cvt.f32.bf16 %r432, %rs129; 2026-02-21T09:10:48.0967808Z cvt.f32.bf16 %r433, %rs130; 2026-02-21T09:10:48.0967865Z cvt.f32.bf16 %r434, %rs143; 2026-02-21T09:10:48.0967929Z cvt.f32.bf16 %r435, %rs144; 2026-02-21T09:10:48.0967986Z cvt.f32.bf16 %r436, %rs141; 2026-02-21T09:10:48.0968041Z cvt.f32.bf16 %r437, %rs142; 2026-02-21T09:10:48.0968100Z cvt.f32.bf16 %r438, %rs139; 2026-02-21T09:10:48.0968163Z cvt.f32.bf16 %r439, %rs140; 2026-02-21T09:10:48.0968223Z cvt.f32.bf16 %r440, %rs137; 2026-02-21T09:10:48.0968282Z cvt.f32.bf16 %r441, %rs138; 2026-02-21T09:10:48.0968345Z $L__tmp8: 2026-02-21T09:10:48.0968603Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0968665Z // begin inline asm 2026-02-21T09:10:48.0968726Z 2026-02-21T09:10:48.0968777Z { 2026-02-21T09:10:48.0968842Z .reg .pred complete; 2026-02-21T09:10:48.0968897Z waitLoop: 2026-02-21T09:10:48.0969029Z mbarrier.try_wait.parity.shared.b64 complete, [%r886], %r885; 2026-02-21T09:10:48.0969096Z @!complete bra.uni waitLoop; 2026-02-21T09:10:48.0969146Z } 2026-02-21T09:10:48.0969153Z 2026-02-21T09:10:48.0969215Z // end inline asm 2026-02-21T09:10:48.0969269Z $L__tmp9: 2026-02-21T09:10:48.0969448Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0969512Z selp.b32 %r465, 1, 0, %p78; 2026-02-21T09:10:48.0969583Z xor.b32 %r887, %r887, %r465; 2026-02-21T09:10:48.0969647Z mov.pred %p79, -1; 2026-02-21T09:10:48.0969698Z $L__tmp10: 2026-02-21T09:10:48.0969916Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0969973Z // begin inline asm 2026-02-21T09:10:48.0970258Z @%p79 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r239 + 0], {%r409, %r410, %r411, %r412, %r413, %r414, %r415, %r416, %r417, %r418, %r419, %r420, %r421, %r422, %r423, %r424}; 2026-02-21T09:10:48.0970322Z // end inline asm 2026-02-21T09:10:48.0970377Z // begin inline asm 2026-02-21T09:10:48.0970650Z @%p79 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r239 + 16], {%r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434, %r435, %r436, %r437, %r438, %r439, %r440, %r441}; 2026-02-21T09:10:48.0970709Z // end inline asm 2026-02-21T09:10:48.0970812Z // begin inline asm 2026-02-21T09:10:48.0970893Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:10:48.0970950Z // end inline asm 2026-02-21T09:10:48.0971009Z bar.sync 0; 2026-02-21T09:10:48.0971062Z $L__tmp11: 2026-02-21T09:10:48.0971241Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0971311Z shl.b32 %r466, %r888, 3; 2026-02-21T09:10:48.0971372Z add.s32 %r467, %r48, %r466; 2026-02-21T09:10:48.0971464Z add.s32 %r533, %r467, 36880; 2026-02-21T09:10:48.0971520Z // begin inline asm 2026-02-21T09:10:48.0971670Z 2026-02-21T09:10:48.0971721Z { 2026-02-21T09:10:48.0971781Z .reg .pred complete; 2026-02-21T09:10:48.0971836Z waitLoop: 2026-02-21T09:10:48.0971956Z mbarrier.try_wait.parity.shared.b64 complete, [%r533], %r887; 2026-02-21T09:10:48.0972020Z @!complete bra.uni waitLoop; 2026-02-21T09:10:48.0972069Z } 2026-02-21T09:10:48.0972080Z 2026-02-21T09:10:48.0972134Z // end inline asm 2026-02-21T09:10:48.0972303Z .loc 1 64 33 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:64:33 2026-02-21T09:10:48.0972394Z shl.b32 %r468, %r888, 11; 2026-02-21T09:10:48.0972459Z add.s32 %r469, %r48, %r468; 2026-02-21T09:10:48.0972517Z add.s32 %r534, %r469, 32768; 2026-02-21T09:10:48.0972578Z add.s32 %r470, %r534, %r5; 2026-02-21T09:10:48.0972641Z add.s32 %r471, %r534, %r28; 2026-02-21T09:10:48.0972699Z add.s32 %r472, %r534, %r27; 2026-02-21T09:10:48.0972787Z add.s32 %r473, %r534, %r26; 2026-02-21T09:10:48.0972844Z add.s32 %r474, %r534, %r25; 2026-02-21T09:10:48.0972907Z add.s32 %r475, %r534, %r24; 2026-02-21T09:10:48.0972962Z add.s32 %r476, %r534, %r23; 2026-02-21T09:10:48.0973016Z add.s32 %r477, %r534, %r22; 2026-02-21T09:10:48.0973184Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0973247Z ld.shared.s8 %rs145, [%r470]; 2026-02-21T09:10:48.0973406Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0973476Z shl.b16 %rs146, %rs145, 4; 2026-02-21T09:10:48.0973639Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0973708Z ld.shared.s8 %rs147, [%r471+128]; 2026-02-21T09:10:48.0973864Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0973936Z shl.b16 %rs148, %rs147, 4; 2026-02-21T09:10:48.0974093Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0974157Z ld.shared.s8 %rs149, [%r472+256]; 2026-02-21T09:10:48.0974325Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0974383Z shl.b16 %rs150, %rs149, 4; 2026-02-21T09:10:48.0974539Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0974613Z ld.shared.s8 %rs151, [%r473+384]; 2026-02-21T09:10:48.0974772Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0974828Z shl.b16 %rs152, %rs151, 4; 2026-02-21T09:10:48.0974994Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0975058Z ld.shared.s8 %rs153, [%r474+512]; 2026-02-21T09:10:48.0975216Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0975277Z shl.b16 %rs154, %rs153, 4; 2026-02-21T09:10:48.0975445Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0975506Z ld.shared.s8 %rs155, [%r475+640]; 2026-02-21T09:10:48.0975662Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0975731Z shl.b16 %rs156, %rs155, 4; 2026-02-21T09:10:48.0975919Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0975979Z ld.shared.s8 %rs157, [%r476+768]; 2026-02-21T09:10:48.0976145Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0976204Z shl.b16 %rs158, %rs157, 4; 2026-02-21T09:10:48.0976365Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0976458Z ld.shared.s8 %rs159, [%r477+896]; 2026-02-21T09:10:48.0976617Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0976674Z shl.b16 %rs160, %rs159, 4; 2026-02-21T09:10:48.0976834Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0976906Z ld.shared.s8 %rs161, [%r470+1024]; 2026-02-21T09:10:48.0977066Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0977144Z shl.b16 %rs162, %rs161, 4; 2026-02-21T09:10:48.0977309Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0977374Z ld.shared.s8 %rs163, [%r471+1152]; 2026-02-21T09:10:48.0977560Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0977627Z shl.b16 %rs164, %rs163, 4; 2026-02-21T09:10:48.0977784Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0977848Z ld.shared.s8 %rs165, [%r472+1280]; 2026-02-21T09:10:48.0978013Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0978071Z shl.b16 %rs166, %rs165, 4; 2026-02-21T09:10:48.0978230Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0978291Z ld.shared.s8 %rs167, [%r473+1408]; 2026-02-21T09:10:48.0978456Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0978514Z shl.b16 %rs168, %rs167, 4; 2026-02-21T09:10:48.0978670Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0978740Z ld.shared.s8 %rs169, [%r474+1536]; 2026-02-21T09:10:48.0978899Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0978957Z shl.b16 %rs170, %rs169, 4; 2026-02-21T09:10:48.0979119Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0979180Z ld.shared.s8 %rs171, [%r475+1664]; 2026-02-21T09:10:48.0979338Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0979402Z shl.b16 %rs172, %rs171, 4; 2026-02-21T09:10:48.0979559Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0979619Z ld.shared.s8 %rs173, [%r476+1792]; 2026-02-21T09:10:48.0979777Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0979841Z shl.b16 %rs174, %rs173, 4; 2026-02-21T09:10:48.0979996Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0980059Z ld.shared.s8 %rs175, [%r477+1920]; 2026-02-21T09:10:48.0980222Z .loc 1 67 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:67:28 2026-02-21T09:10:48.0980279Z shl.b16 %rs176, %rs175, 4; 2026-02-21T09:10:48.0980434Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0980501Z cvt.s16.s8 %rs177, %rs146; 2026-02-21T09:10:48.0980580Z shr.s16 %rs178, %rs177, 4; 2026-02-21T09:10:48.0980638Z cvt.s16.s8 %rs179, %rs148; 2026-02-21T09:10:48.0980700Z shr.s16 %rs180, %rs179, 4; 2026-02-21T09:10:48.0980755Z shr.s16 %rs181, %rs145, 4; 2026-02-21T09:10:48.0980811Z shr.s16 %rs182, %rs147, 4; 2026-02-21T09:10:48.0980970Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0981038Z cvt.rn.f32.s16 %r478, %rs182; 2026-02-21T09:10:48.0981098Z cvt.rn.f32.s16 %r479, %rs181; 2026-02-21T09:10:48.0981195Z cvt.rn.f32.s16 %r480, %rs180; 2026-02-21T09:10:48.0981260Z cvt.rn.f32.s16 %r481, %rs178; 2026-02-21T09:10:48.0981422Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0981479Z cvt.s16.s8 %rs183, %rs150; 2026-02-21T09:10:48.0981575Z shr.s16 %rs184, %rs183, 4; 2026-02-21T09:10:48.0981637Z cvt.s16.s8 %rs185, %rs152; 2026-02-21T09:10:48.0981693Z shr.s16 %rs186, %rs185, 4; 2026-02-21T09:10:48.0981749Z shr.s16 %rs187, %rs149, 4; 2026-02-21T09:10:48.0981810Z shr.s16 %rs188, %rs151, 4; 2026-02-21T09:10:48.0982001Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0982060Z cvt.rn.f32.s16 %r482, %rs188; 2026-02-21T09:10:48.0982122Z cvt.rn.f32.s16 %r483, %rs187; 2026-02-21T09:10:48.0982178Z cvt.rn.f32.s16 %r484, %rs186; 2026-02-21T09:10:48.0982261Z cvt.rn.f32.s16 %r485, %rs184; 2026-02-21T09:10:48.0982420Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0982486Z cvt.s16.s8 %rs189, %rs154; 2026-02-21T09:10:48.0982541Z shr.s16 %rs190, %rs189, 4; 2026-02-21T09:10:48.0982598Z cvt.s16.s8 %rs191, %rs156; 2026-02-21T09:10:48.0982661Z shr.s16 %rs192, %rs191, 4; 2026-02-21T09:10:48.0982719Z shr.s16 %rs193, %rs153, 4; 2026-02-21T09:10:48.0982772Z shr.s16 %rs194, %rs155, 4; 2026-02-21T09:10:48.0982933Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0982997Z cvt.rn.f32.s16 %r486, %rs194; 2026-02-21T09:10:48.0983055Z cvt.rn.f32.s16 %r487, %rs193; 2026-02-21T09:10:48.0983112Z cvt.rn.f32.s16 %r488, %rs192; 2026-02-21T09:10:48.0983176Z cvt.rn.f32.s16 %r489, %rs190; 2026-02-21T09:10:48.0983336Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0983393Z cvt.s16.s8 %rs195, %rs158; 2026-02-21T09:10:48.0983457Z shr.s16 %rs196, %rs195, 4; 2026-02-21T09:10:48.0983512Z cvt.s16.s8 %rs197, %rs160; 2026-02-21T09:10:48.0983566Z shr.s16 %rs198, %rs197, 4; 2026-02-21T09:10:48.0983620Z shr.s16 %rs199, %rs157, 4; 2026-02-21T09:10:48.0983683Z shr.s16 %rs200, %rs159, 4; 2026-02-21T09:10:48.0983843Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0983901Z cvt.rn.f32.s16 %r490, %rs200; 2026-02-21T09:10:48.0983965Z cvt.rn.f32.s16 %r491, %rs199; 2026-02-21T09:10:48.0984025Z cvt.rn.f32.s16 %r492, %rs198; 2026-02-21T09:10:48.0984082Z cvt.rn.f32.s16 %r493, %rs196; 2026-02-21T09:10:48.0984247Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0984302Z cvt.s16.s8 %rs201, %rs162; 2026-02-21T09:10:48.0984359Z shr.s16 %rs202, %rs201, 4; 2026-02-21T09:10:48.0984415Z cvt.s16.s8 %rs203, %rs164; 2026-02-21T09:10:48.0984480Z shr.s16 %rs204, %rs203, 4; 2026-02-21T09:10:48.0984538Z shr.s16 %rs205, %rs161, 4; 2026-02-21T09:10:48.0984593Z shr.s16 %rs206, %rs163, 4; 2026-02-21T09:10:48.0984760Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0984816Z cvt.rn.f32.s16 %r494, %rs206; 2026-02-21T09:10:48.0984875Z cvt.rn.f32.s16 %r495, %rs205; 2026-02-21T09:10:48.0984933Z cvt.rn.f32.s16 %r496, %rs204; 2026-02-21T09:10:48.0984996Z cvt.rn.f32.s16 %r497, %rs202; 2026-02-21T09:10:48.0985180Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0985239Z cvt.s16.s8 %rs207, %rs166; 2026-02-21T09:10:48.0985302Z shr.s16 %rs208, %rs207, 4; 2026-02-21T09:10:48.0985358Z cvt.s16.s8 %rs209, %rs168; 2026-02-21T09:10:48.0985415Z shr.s16 %rs210, %rs209, 4; 2026-02-21T09:10:48.0985476Z shr.s16 %rs211, %rs165, 4; 2026-02-21T09:10:48.0985531Z shr.s16 %rs212, %rs167, 4; 2026-02-21T09:10:48.0985693Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0985773Z cvt.rn.f32.s16 %r498, %rs212; 2026-02-21T09:10:48.0985838Z cvt.rn.f32.s16 %r499, %rs211; 2026-02-21T09:10:48.0985896Z cvt.rn.f32.s16 %r500, %rs210; 2026-02-21T09:10:48.0985951Z cvt.rn.f32.s16 %r501, %rs208; 2026-02-21T09:10:48.0986115Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0986172Z cvt.s16.s8 %rs213, %rs170; 2026-02-21T09:10:48.0986228Z shr.s16 %rs214, %rs213, 4; 2026-02-21T09:10:48.0986290Z cvt.s16.s8 %rs215, %rs172; 2026-02-21T09:10:48.0986365Z shr.s16 %rs216, %rs215, 4; 2026-02-21T09:10:48.0986422Z shr.s16 %rs217, %rs169, 4; 2026-02-21T09:10:48.0986476Z shr.s16 %rs218, %rs171, 4; 2026-02-21T09:10:48.0986648Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0986726Z cvt.rn.f32.s16 %r502, %rs218; 2026-02-21T09:10:48.0986786Z cvt.rn.f32.s16 %r503, %rs217; 2026-02-21T09:10:48.0986848Z cvt.rn.f32.s16 %r504, %rs216; 2026-02-21T09:10:48.0986904Z cvt.rn.f32.s16 %r505, %rs214; 2026-02-21T09:10:48.0987064Z .loc 1 69 25 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:69:25 2026-02-21T09:10:48.0987122Z cvt.s16.s8 %rs219, %rs174; 2026-02-21T09:10:48.0987181Z shr.s16 %rs220, %rs219, 4; 2026-02-21T09:10:48.0987239Z cvt.s16.s8 %rs221, %rs176; 2026-02-21T09:10:48.0987295Z shr.s16 %rs222, %rs221, 4; 2026-02-21T09:10:48.0987357Z shr.s16 %rs223, %rs173, 4; 2026-02-21T09:10:48.0987411Z shr.s16 %rs224, %rs175, 4; 2026-02-21T09:10:48.0987574Z .loc 1 87 32 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:87:32 2026-02-21T09:10:48.0987636Z cvt.rn.f32.s16 %r506, %rs224; 2026-02-21T09:10:48.0987693Z cvt.rn.f32.s16 %r507, %rs223; 2026-02-21T09:10:48.0987748Z cvt.rn.f32.s16 %r508, %rs222; 2026-02-21T09:10:48.0987807Z cvt.rn.f32.s16 %r509, %rs220; 2026-02-21T09:10:48.0987911Z st.shared.v4.b32 [%r13], {%r481, %r479, %r480, %r478}; 2026-02-21T09:10:48.0988002Z st.shared.v4.b32 [%r14], {%r485, %r483, %r484, %r482}; 2026-02-21T09:10:48.0988087Z st.shared.v4.b32 [%r15], {%r489, %r487, %r488, %r486}; 2026-02-21T09:10:48.0988175Z st.shared.v4.b32 [%r16], {%r493, %r491, %r492, %r490}; 2026-02-21T09:10:48.0988254Z st.shared.v4.b32 [%r17], {%r497, %r495, %r496, %r494}; 2026-02-21T09:10:48.0988335Z st.shared.v4.b32 [%r18], {%r501, %r499, %r500, %r498}; 2026-02-21T09:10:48.0988422Z st.shared.v4.b32 [%r19], {%r505, %r503, %r504, %r502}; 2026-02-21T09:10:48.0988503Z st.shared.v4.b32 [%r20], {%r509, %r507, %r508, %r506}; 2026-02-21T09:10:48.0988671Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0988728Z shl.b32 %r510, %r889, 3; 2026-02-21T09:10:48.0988792Z add.s32 %r511, %r48, %r510; 2026-02-21T09:10:48.0988850Z add.s32 %r886, %r511, 36864; 2026-02-21T09:10:48.0988903Z $L__tmp12: 2026-02-21T09:10:48.0989125Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0989181Z // begin inline asm 2026-02-21T09:10:48.0989253Z fence.proxy.async.shared::cta; 2026-02-21T09:10:48.0989311Z // end inline asm 2026-02-21T09:10:48.0989365Z bar.sync 0; 2026-02-21T09:10:48.0989423Z @%p61 bra $L__BB0_6; 2026-02-21T09:10:48.0989525Z // %bb.5: // in Loop: Header=BB0_4 Depth=1 2026-02-21T09:10:48.0989620Z elect.sync %r524|%p80, -1; 2026-02-21T09:10:48.0989677Z mov.b32 %r514, 136317200; 2026-02-21T09:10:48.0989735Z // begin inline asm 2026-02-21T09:10:48.0989899Z @%p80 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 0 ], %rd68, %r514, %p79; 2026-02-21T09:10:48.0989953Z // end inline asm 2026-02-21T09:10:48.0990008Z // begin inline asm 2026-02-21T09:10:48.0990156Z @%p80 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 8 ], %rd69, %r514, %p79; 2026-02-21T09:10:48.0990229Z // end inline asm 2026-02-21T09:10:48.0990286Z // begin inline asm 2026-02-21T09:10:48.0990434Z @%p80 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 16 ], %rd70, %r514, %p79; 2026-02-21T09:10:48.0990493Z // end inline asm 2026-02-21T09:10:48.0990548Z // begin inline asm 2026-02-21T09:10:48.0990688Z @%p80 tcgen05.mma.cta_group::1.kind::tf32 [ %r884 + 0 ], [ %r369 + 24 ], %rd71, %r514, %p79; 2026-02-21T09:10:48.0990747Z // end inline asm 2026-02-21T09:10:48.0990806Z cvt.u64.u32 %rd91, %r886; 2026-02-21T09:10:48.0990860Z // begin inline asm 2026-02-21T09:10:48.0991009Z @%p80 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd91]; 2026-02-21T09:10:48.0991063Z // end inline asm 2026-02-21T09:10:48.0991117Z bra.uni $L__BB0_6; 2026-02-21T09:10:48.0991216Z $L__BB0_7: // %._crit_edge.loopexit 2026-02-21T09:10:48.0991311Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:10:48.0991394Z setp.lt.u32 %p102, %r1, 64; 2026-02-21T09:10:48.0991450Z mov.b32 %r543, 1; 2026-02-21T09:10:48.0991706Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.0991761Z // begin inline asm 2026-02-21T09:10:48.0991811Z 2026-02-21T09:10:48.0991859Z { 2026-02-21T09:10:48.0991923Z .reg .pred complete; 2026-02-21T09:10:48.0991976Z waitLoop: 2026-02-21T09:10:48.0992094Z mbarrier.try_wait.parity.shared.b64 complete, [%r886], %r543; 2026-02-21T09:10:48.0992164Z @!complete bra.uni waitLoop; 2026-02-21T09:10:48.0992212Z } 2026-02-21T09:10:48.0992216Z 2026-02-21T09:10:48.0992270Z // end inline asm 2026-02-21T09:10:48.0992328Z $L__tmp13: 2026-02-21T09:10:48.0992498Z .loc 1 51 125 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:51:125 2026-02-21T09:10:48.0992562Z cp.async.wait_group 0; 2026-02-21T09:10:48.0992617Z bar.sync 0; 2026-02-21T09:10:48.0992680Z // begin inline asm 2026-02-21T09:10:48.0992769Z @%p97 mbarrier.inval.shared::cta.b64 [%r391]; 2026-02-21T09:10:48.1003115Z // end inline asm 2026-02-21T09:10:48.1003228Z bar.sync 0; 2026-02-21T09:10:48.1003300Z // begin inline asm 2026-02-21T09:10:48.1003424Z @%p97 mbarrier.inval.shared::cta.b64 [%r212]; 2026-02-21T09:10:48.1003497Z // end inline asm 2026-02-21T09:10:48.1003562Z add.s32 %r546, %r48, 36864; 2026-02-21T09:10:48.1003623Z // begin inline asm 2026-02-21T09:10:48.1003718Z @%p97 mbarrier.inval.shared::cta.b64 [%r546]; 2026-02-21T09:10:48.1003779Z // end inline asm 2026-02-21T09:10:48.1003838Z bar.sync 0; 2026-02-21T09:10:48.1003895Z // begin inline asm 2026-02-21T09:10:48.1003989Z @%p97 mbarrier.inval.shared::cta.b64 [%r210]; 2026-02-21T09:10:48.1004049Z // end inline asm 2026-02-21T09:10:48.1004105Z $L__tmp14: 2026-02-21T09:10:48.1004350Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1004415Z // begin inline asm 2026-02-21T09:10:48.1004688Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r548, %r549, %r550, %r551, %r552, %r553, %r554, %r555, %r556, %r557, %r558, %r559, %r560, %r561, %r562, %r563}, [%r683 + 0]; 2026-02-21T09:10:48.1004754Z // end inline asm 2026-02-21T09:10:48.1004815Z // begin inline asm 2026-02-21T09:10:48.1005081Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r565, %r566, %r567, %r568, %r569, %r570, %r571, %r572, %r573, %r574, %r575, %r576, %r577, %r578, %r579, %r580}, [%r683 + 16]; 2026-02-21T09:10:48.1005139Z // end inline asm 2026-02-21T09:10:48.1005297Z // begin inline asm 2026-02-21T09:10:48.1005555Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r582, %r583, %r584, %r585, %r586, %r587, %r588, %r589, %r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597}, [%r683 + 32]; 2026-02-21T09:10:48.1005613Z // end inline asm 2026-02-21T09:10:48.1005683Z // begin inline asm 2026-02-21T09:10:48.1005942Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r599, %r600, %r601, %r602, %r603, %r604, %r605, %r606, %r607, %r608, %r609, %r610, %r611, %r612, %r613, %r614}, [%r683 + 48]; 2026-02-21T09:10:48.1006039Z // end inline asm 2026-02-21T09:10:48.1006107Z // begin inline asm 2026-02-21T09:10:48.1006366Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r616, %r617, %r618, %r619, %r620, %r621, %r622, %r623, %r624, %r625, %r626, %r627, %r628, %r629, %r630, %r631}, [%r683 + 64]; 2026-02-21T09:10:48.1006424Z // end inline asm 2026-02-21T09:10:48.1006492Z // begin inline asm 2026-02-21T09:10:48.1006747Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r633, %r634, %r635, %r636, %r637, %r638, %r639, %r640, %r641, %r642, %r643, %r644, %r645, %r646, %r647, %r648}, [%r683 + 80]; 2026-02-21T09:10:48.1006836Z // end inline asm 2026-02-21T09:10:48.1006895Z // begin inline asm 2026-02-21T09:10:48.1007163Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r650, %r651, %r652, %r653, %r654, %r655, %r656, %r657, %r658, %r659, %r660, %r661, %r662, %r663, %r664, %r665}, [%r683 + 96]; 2026-02-21T09:10:48.1007219Z // end inline asm 2026-02-21T09:10:48.1007307Z // begin inline asm 2026-02-21T09:10:48.1007577Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r667, %r668, %r669, %r670, %r671, %r672, %r673, %r674, %r675, %r676, %r677, %r678, %r679, %r680, %r681, %r682}, [%r683 + 112]; 2026-02-21T09:10:48.1007635Z // end inline asm 2026-02-21T09:10:48.1007692Z // begin inline asm 2026-02-21T09:10:48.1007777Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:10:48.1007834Z // end inline asm 2026-02-21T09:10:48.1007899Z cvt.u64.u32 %rd98, %r548; 2026-02-21T09:10:48.1007969Z cvt.u64.u32 %rd99, %r549; 2026-02-21T09:10:48.1008038Z shl.b64 %rd100, %rd99, 32; 2026-02-21T09:10:48.1008102Z or.b64 %rd101, %rd98, %rd100; 2026-02-21T09:10:48.1008160Z $L__tmp15: 2026-02-21T09:10:48.1008351Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1008418Z mov.b64 {%r688, %r689}, %rd101; 2026-02-21T09:10:48.1008491Z cvt.rn.bf16x2.f32 %r690, %r689, %r688; 2026-02-21T09:10:48.1008555Z $L__tmp16: 2026-02-21T09:10:48.1008771Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1008838Z cvt.u64.u32 %rd102, %r550; 2026-02-21T09:10:48.1008897Z cvt.u64.u32 %rd103, %r551; 2026-02-21T09:10:48.1008969Z shl.b64 %rd104, %rd103, 32; 2026-02-21T09:10:48.1009035Z or.b64 %rd105, %rd102, %rd104; 2026-02-21T09:10:48.1009088Z $L__tmp17: 2026-02-21T09:10:48.1009267Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1009334Z mov.b64 {%r691, %r692}, %rd105; 2026-02-21T09:10:48.1009405Z cvt.rn.bf16x2.f32 %r693, %r692, %r691; 2026-02-21T09:10:48.1009470Z $L__tmp18: 2026-02-21T09:10:48.1009677Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1009739Z cvt.u64.u32 %rd106, %r552; 2026-02-21T09:10:48.1009800Z cvt.u64.u32 %rd107, %r553; 2026-02-21T09:10:48.1009872Z shl.b64 %rd108, %rd107, 32; 2026-02-21T09:10:48.1009936Z or.b64 %rd109, %rd106, %rd108; 2026-02-21T09:10:48.1009992Z $L__tmp19: 2026-02-21T09:10:48.1010168Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1010229Z mov.b64 {%r694, %r695}, %rd109; 2026-02-21T09:10:48.1010298Z cvt.rn.bf16x2.f32 %r696, %r695, %r694; 2026-02-21T09:10:48.1010362Z $L__tmp20: 2026-02-21T09:10:48.1010566Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1010650Z cvt.u64.u32 %rd110, %r554; 2026-02-21T09:10:48.1010712Z cvt.u64.u32 %rd111, %r555; 2026-02-21T09:10:48.1010783Z shl.b64 %rd112, %rd111, 32; 2026-02-21T09:10:48.1010846Z or.b64 %rd113, %rd110, %rd112; 2026-02-21T09:10:48.1010898Z $L__tmp21: 2026-02-21T09:10:48.1011070Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1011134Z mov.b64 {%r697, %r698}, %rd113; 2026-02-21T09:10:48.1011226Z cvt.rn.bf16x2.f32 %r699, %r698, %r697; 2026-02-21T09:10:48.1011280Z $L__tmp22: 2026-02-21T09:10:48.1011500Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1011605Z cvt.u64.u32 %rd114, %r556; 2026-02-21T09:10:48.1011669Z cvt.u64.u32 %rd115, %r557; 2026-02-21T09:10:48.1011739Z shl.b64 %rd116, %rd115, 32; 2026-02-21T09:10:48.1011801Z or.b64 %rd117, %rd114, %rd116; 2026-02-21T09:10:48.1011856Z $L__tmp23: 2026-02-21T09:10:48.1012063Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1012127Z mov.b64 {%r700, %r701}, %rd117; 2026-02-21T09:10:48.1012197Z cvt.rn.bf16x2.f32 %r702, %r701, %r700; 2026-02-21T09:10:48.1012253Z $L__tmp24: 2026-02-21T09:10:48.1012503Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1012569Z cvt.u64.u32 %rd118, %r558; 2026-02-21T09:10:48.1012628Z cvt.u64.u32 %rd119, %r559; 2026-02-21T09:10:48.1012705Z shl.b64 %rd120, %rd119, 32; 2026-02-21T09:10:48.1012768Z or.b64 %rd121, %rd118, %rd120; 2026-02-21T09:10:48.1012823Z $L__tmp25: 2026-02-21T09:10:48.1012998Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1013061Z mov.b64 {%r703, %r704}, %rd121; 2026-02-21T09:10:48.1013126Z cvt.rn.bf16x2.f32 %r705, %r704, %r703; 2026-02-21T09:10:48.1013183Z $L__tmp26: 2026-02-21T09:10:48.1013401Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1013462Z cvt.u64.u32 %rd122, %r560; 2026-02-21T09:10:48.1013523Z cvt.u64.u32 %rd123, %r561; 2026-02-21T09:10:48.1013593Z shl.b64 %rd124, %rd123, 32; 2026-02-21T09:10:48.1013654Z or.b64 %rd125, %rd122, %rd124; 2026-02-21T09:10:48.1013710Z $L__tmp27: 2026-02-21T09:10:48.1013884Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1013946Z mov.b64 {%r706, %r707}, %rd125; 2026-02-21T09:10:48.1014015Z cvt.rn.bf16x2.f32 %r708, %r707, %r706; 2026-02-21T09:10:48.1014067Z $L__tmp28: 2026-02-21T09:10:48.1014281Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1014342Z cvt.u64.u32 %rd126, %r562; 2026-02-21T09:10:48.1014402Z cvt.u64.u32 %rd127, %r563; 2026-02-21T09:10:48.1014470Z shl.b64 %rd128, %rd127, 32; 2026-02-21T09:10:48.1014532Z or.b64 %rd129, %rd126, %rd128; 2026-02-21T09:10:48.1014586Z $L__tmp29: 2026-02-21T09:10:48.1014750Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1014818Z mov.b64 {%r709, %r710}, %rd129; 2026-02-21T09:10:48.1014884Z cvt.rn.bf16x2.f32 %r711, %r710, %r709; 2026-02-21T09:10:48.1014941Z $L__tmp30: 2026-02-21T09:10:48.1015161Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1015221Z cvt.u64.u32 %rd130, %r565; 2026-02-21T09:10:48.1015281Z cvt.u64.u32 %rd131, %r566; 2026-02-21T09:10:48.1015351Z shl.b64 %rd132, %rd131, 32; 2026-02-21T09:10:48.1015410Z or.b64 %rd133, %rd130, %rd132; 2026-02-21T09:10:48.1015464Z $L__tmp31: 2026-02-21T09:10:48.1015629Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1015733Z mov.b64 {%r712, %r713}, %rd133; 2026-02-21T09:10:48.1015801Z cvt.rn.bf16x2.f32 %r714, %r713, %r712; 2026-02-21T09:10:48.1015855Z $L__tmp32: 2026-02-21T09:10:48.1016066Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1016127Z cvt.u64.u32 %rd134, %r567; 2026-02-21T09:10:48.1016187Z cvt.u64.u32 %rd135, %r568; 2026-02-21T09:10:48.1016280Z shl.b64 %rd136, %rd135, 32; 2026-02-21T09:10:48.1016342Z or.b64 %rd137, %rd134, %rd136; 2026-02-21T09:10:48.1016394Z $L__tmp33: 2026-02-21T09:10:48.1016562Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1016630Z mov.b64 {%r715, %r716}, %rd137; 2026-02-21T09:10:48.1016696Z cvt.rn.bf16x2.f32 %r717, %r716, %r715; 2026-02-21T09:10:48.1016750Z $L__tmp34: 2026-02-21T09:10:48.1016964Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1017048Z cvt.u64.u32 %rd138, %r569; 2026-02-21T09:10:48.1017110Z cvt.u64.u32 %rd139, %r570; 2026-02-21T09:10:48.1017178Z shl.b64 %rd140, %rd139, 32; 2026-02-21T09:10:48.1017236Z or.b64 %rd141, %rd138, %rd140; 2026-02-21T09:10:48.1017289Z $L__tmp35: 2026-02-21T09:10:48.1017472Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1017544Z mov.b64 {%r718, %r719}, %rd141; 2026-02-21T09:10:48.1017612Z cvt.rn.bf16x2.f32 %r720, %r719, %r718; 2026-02-21T09:10:48.1017665Z $L__tmp36: 2026-02-21T09:10:48.1017882Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1017942Z cvt.u64.u32 %rd142, %r571; 2026-02-21T09:10:48.1018000Z cvt.u64.u32 %rd143, %r572; 2026-02-21T09:10:48.1018060Z shl.b64 %rd144, %rd143, 32; 2026-02-21T09:10:48.1018130Z or.b64 %rd145, %rd142, %rd144; 2026-02-21T09:10:48.1018182Z $L__tmp37: 2026-02-21T09:10:48.1018346Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1018415Z mov.b64 {%r721, %r722}, %rd145; 2026-02-21T09:10:48.1018479Z cvt.rn.bf16x2.f32 %r723, %r722, %r721; 2026-02-21T09:10:48.1018532Z $L__tmp38: 2026-02-21T09:10:48.1018749Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1018810Z cvt.u64.u32 %rd146, %r573; 2026-02-21T09:10:48.1018870Z cvt.u64.u32 %rd147, %r574; 2026-02-21T09:10:48.1018929Z shl.b64 %rd148, %rd147, 32; 2026-02-21T09:10:48.1018999Z or.b64 %rd149, %rd146, %rd148; 2026-02-21T09:10:48.1019054Z $L__tmp39: 2026-02-21T09:10:48.1019218Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1019287Z mov.b64 {%r724, %r725}, %rd149; 2026-02-21T09:10:48.1019357Z cvt.rn.bf16x2.f32 %r726, %r725, %r724; 2026-02-21T09:10:48.1019407Z $L__tmp40: 2026-02-21T09:10:48.1019622Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1019683Z cvt.u64.u32 %rd150, %r575; 2026-02-21T09:10:48.1019740Z cvt.u64.u32 %rd151, %r576; 2026-02-21T09:10:48.1019801Z shl.b64 %rd152, %rd151, 32; 2026-02-21T09:10:48.1019872Z or.b64 %rd153, %rd150, %rd152; 2026-02-21T09:10:48.1019926Z $L__tmp41: 2026-02-21T09:10:48.1020089Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1020162Z mov.b64 {%r727, %r728}, %rd153; 2026-02-21T09:10:48.1020228Z cvt.rn.bf16x2.f32 %r729, %r728, %r727; 2026-02-21T09:10:48.1020283Z $L__tmp42: 2026-02-21T09:10:48.1020498Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1020579Z cvt.u64.u32 %rd154, %r577; 2026-02-21T09:10:48.1020638Z cvt.u64.u32 %rd155, %r578; 2026-02-21T09:10:48.1020700Z shl.b64 %rd156, %rd155, 32; 2026-02-21T09:10:48.1020769Z or.b64 %rd157, %rd154, %rd156; 2026-02-21T09:10:48.1020821Z $L__tmp43: 2026-02-21T09:10:48.1020985Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1021054Z mov.b64 {%r730, %r731}, %rd157; 2026-02-21T09:10:48.1021123Z cvt.rn.bf16x2.f32 %r732, %r731, %r730; 2026-02-21T09:10:48.1021199Z $L__tmp44: 2026-02-21T09:10:48.1021404Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1021471Z cvt.u64.u32 %rd158, %r579; 2026-02-21T09:10:48.1021529Z cvt.u64.u32 %rd159, %r580; 2026-02-21T09:10:48.1021636Z shl.b64 %rd160, %rd159, 32; 2026-02-21T09:10:48.1021704Z or.b64 %rd161, %rd158, %rd160; 2026-02-21T09:10:48.1021757Z $L__tmp45: 2026-02-21T09:10:48.1021921Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1022036Z mov.b64 {%r733, %r734}, %rd161; 2026-02-21T09:10:48.1022103Z cvt.rn.bf16x2.f32 %r735, %r734, %r733; 2026-02-21T09:10:48.1022157Z $L__tmp46: 2026-02-21T09:10:48.1022366Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1022458Z cvt.u64.u32 %rd162, %r582; 2026-02-21T09:10:48.1022519Z cvt.u64.u32 %rd163, %r583; 2026-02-21T09:10:48.1022580Z shl.b64 %rd164, %rd163, 32; 2026-02-21T09:10:48.1022648Z or.b64 %rd165, %rd162, %rd164; 2026-02-21T09:10:48.1022702Z $L__tmp47: 2026-02-21T09:10:48.1022865Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1022930Z mov.b64 {%r736, %r737}, %rd165; 2026-02-21T09:10:48.1022996Z cvt.rn.bf16x2.f32 %r738, %r737, %r736; 2026-02-21T09:10:48.1023050Z $L__tmp48: 2026-02-21T09:10:48.1023258Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1023326Z cvt.u64.u32 %rd166, %r584; 2026-02-21T09:10:48.1023382Z cvt.u64.u32 %rd167, %r585; 2026-02-21T09:10:48.1023442Z shl.b64 %rd168, %rd167, 32; 2026-02-21T09:10:48.1023511Z or.b64 %rd169, %rd166, %rd168; 2026-02-21T09:10:48.1023562Z $L__tmp49: 2026-02-21T09:10:48.1023726Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1023794Z mov.b64 {%r739, %r740}, %rd169; 2026-02-21T09:10:48.1023860Z cvt.rn.bf16x2.f32 %r741, %r740, %r739; 2026-02-21T09:10:48.1023913Z $L__tmp50: 2026-02-21T09:10:48.1024114Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1024182Z cvt.u64.u32 %rd170, %r586; 2026-02-21T09:10:48.1024243Z cvt.u64.u32 %rd171, %r587; 2026-02-21T09:10:48.1024303Z shl.b64 %rd172, %rd171, 32; 2026-02-21T09:10:48.1024369Z or.b64 %rd173, %rd170, %rd172; 2026-02-21T09:10:48.1024421Z $L__tmp51: 2026-02-21T09:10:48.1024583Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1024641Z mov.b64 {%r742, %r743}, %rd173; 2026-02-21T09:10:48.1024711Z cvt.rn.bf16x2.f32 %r744, %r743, %r742; 2026-02-21T09:10:48.1024762Z $L__tmp52: 2026-02-21T09:10:48.1024967Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1025034Z cvt.u64.u32 %rd174, %r588; 2026-02-21T09:10:48.1025091Z cvt.u64.u32 %rd175, %r589; 2026-02-21T09:10:48.1025150Z shl.b64 %rd176, %rd175, 32; 2026-02-21T09:10:48.1025215Z or.b64 %rd177, %rd174, %rd176; 2026-02-21T09:10:48.1025265Z $L__tmp53: 2026-02-21T09:10:48.1025427Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1025514Z mov.b64 {%r745, %r746}, %rd177; 2026-02-21T09:10:48.1025583Z cvt.rn.bf16x2.f32 %r747, %r746, %r745; 2026-02-21T09:10:48.1025636Z $L__tmp54: 2026-02-21T09:10:48.1025842Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1025910Z cvt.u64.u32 %rd178, %r590; 2026-02-21T09:10:48.1025968Z cvt.u64.u32 %rd179, %r591; 2026-02-21T09:10:48.1026027Z shl.b64 %rd180, %rd179, 32; 2026-02-21T09:10:48.1026119Z or.b64 %rd181, %rd178, %rd180; 2026-02-21T09:10:48.1026172Z $L__tmp55: 2026-02-21T09:10:48.1026334Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1026390Z mov.b64 {%r748, %r749}, %rd181; 2026-02-21T09:10:48.1026461Z cvt.rn.bf16x2.f32 %r750, %r749, %r748; 2026-02-21T09:10:48.1026509Z $L__tmp56: 2026-02-21T09:10:48.1026709Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1026774Z cvt.u64.u32 %rd182, %r592; 2026-02-21T09:10:48.1026831Z cvt.u64.u32 %rd183, %r593; 2026-02-21T09:10:48.1026919Z shl.b64 %rd184, %rd183, 32; 2026-02-21T09:10:48.1026976Z or.b64 %rd185, %rd182, %rd184; 2026-02-21T09:10:48.1027032Z $L__tmp57: 2026-02-21T09:10:48.1027199Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1027278Z mov.b64 {%r751, %r752}, %rd185; 2026-02-21T09:10:48.1027351Z cvt.rn.bf16x2.f32 %r753, %r752, %r751; 2026-02-21T09:10:48.1027404Z $L__tmp58: 2026-02-21T09:10:48.1027605Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1027666Z cvt.u64.u32 %rd186, %r594; 2026-02-21T09:10:48.1027721Z cvt.u64.u32 %rd187, %r595; 2026-02-21T09:10:48.1027778Z shl.b64 %rd188, %rd187, 32; 2026-02-21T09:10:48.1027835Z or.b64 %rd189, %rd186, %rd188; 2026-02-21T09:10:48.1027891Z $L__tmp59: 2026-02-21T09:10:48.1028050Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1028107Z mov.b64 {%r754, %r755}, %rd189; 2026-02-21T09:10:48.1028179Z cvt.rn.bf16x2.f32 %r756, %r755, %r754; 2026-02-21T09:10:48.1028229Z $L__tmp60: 2026-02-21T09:10:48.1028438Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1028506Z cvt.u64.u32 %rd190, %r596; 2026-02-21T09:10:48.1028562Z cvt.u64.u32 %rd191, %r597; 2026-02-21T09:10:48.1028623Z shl.b64 %rd192, %rd191, 32; 2026-02-21T09:10:48.1028678Z or.b64 %rd193, %rd190, %rd192; 2026-02-21T09:10:48.1028733Z $L__tmp61: 2026-02-21T09:10:48.1028893Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1028952Z mov.b64 {%r757, %r758}, %rd193; 2026-02-21T09:10:48.1029024Z cvt.rn.bf16x2.f32 %r759, %r758, %r757; 2026-02-21T09:10:48.1029076Z $L__tmp62: 2026-02-21T09:10:48.1029280Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1029340Z cvt.u64.u32 %rd194, %r599; 2026-02-21T09:10:48.1029396Z cvt.u64.u32 %rd195, %r600; 2026-02-21T09:10:48.1029452Z shl.b64 %rd196, %rd195, 32; 2026-02-21T09:10:48.1029510Z or.b64 %rd197, %rd194, %rd196; 2026-02-21T09:10:48.1029568Z $L__tmp63: 2026-02-21T09:10:48.1029730Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1029792Z mov.b64 {%r760, %r761}, %rd197; 2026-02-21T09:10:48.1029860Z cvt.rn.bf16x2.f32 %r762, %r761, %r760; 2026-02-21T09:10:48.1029911Z $L__tmp64: 2026-02-21T09:10:48.1030115Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1030177Z cvt.u64.u32 %rd198, %r601; 2026-02-21T09:10:48.1030233Z cvt.u64.u32 %rd199, %r602; 2026-02-21T09:10:48.1030308Z shl.b64 %rd200, %rd199, 32; 2026-02-21T09:10:48.1030363Z or.b64 %rd201, %rd198, %rd200; 2026-02-21T09:10:48.1030417Z $L__tmp65: 2026-02-21T09:10:48.1030580Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1030638Z mov.b64 {%r763, %r764}, %rd201; 2026-02-21T09:10:48.1030708Z cvt.rn.bf16x2.f32 %r765, %r764, %r763; 2026-02-21T09:10:48.1030758Z $L__tmp66: 2026-02-21T09:10:48.1030979Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1031036Z cvt.u64.u32 %rd202, %r603; 2026-02-21T09:10:48.1031096Z cvt.u64.u32 %rd203, %r604; 2026-02-21T09:10:48.1031153Z shl.b64 %rd204, %rd203, 32; 2026-02-21T09:10:48.1031208Z or.b64 %rd205, %rd202, %rd204; 2026-02-21T09:10:48.1031260Z $L__tmp67: 2026-02-21T09:10:48.1031422Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1031483Z mov.b64 {%r766, %r767}, %rd205; 2026-02-21T09:10:48.1031613Z cvt.rn.bf16x2.f32 %r768, %r767, %r766; 2026-02-21T09:10:48.1031664Z $L__tmp68: 2026-02-21T09:10:48.1031866Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1031922Z cvt.u64.u32 %rd206, %r605; 2026-02-21T09:10:48.1031982Z cvt.u64.u32 %rd207, %r606; 2026-02-21T09:10:48.1032065Z shl.b64 %rd208, %rd207, 32; 2026-02-21T09:10:48.1032131Z or.b64 %rd209, %rd206, %rd208; 2026-02-21T09:10:48.1032181Z $L__tmp69: 2026-02-21T09:10:48.1032345Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1032406Z mov.b64 {%r769, %r770}, %rd209; 2026-02-21T09:10:48.1032468Z cvt.rn.bf16x2.f32 %r771, %r770, %r769; 2026-02-21T09:10:48.1032517Z $L__tmp70: 2026-02-21T09:10:48.1032721Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1032778Z cvt.u64.u32 %rd210, %r607; 2026-02-21T09:10:48.1032833Z cvt.u64.u32 %rd211, %r608; 2026-02-21T09:10:48.1032893Z shl.b64 %rd212, %rd211, 32; 2026-02-21T09:10:48.1032947Z or.b64 %rd213, %rd210, %rd212; 2026-02-21T09:10:48.1032995Z $L__tmp71: 2026-02-21T09:10:48.1033152Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1033213Z mov.b64 {%r772, %r773}, %rd213; 2026-02-21T09:10:48.1033274Z cvt.rn.bf16x2.f32 %r774, %r773, %r772; 2026-02-21T09:10:48.1033322Z $L__tmp72: 2026-02-21T09:10:48.1033527Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1033581Z cvt.u64.u32 %rd214, %r609; 2026-02-21T09:10:48.1033634Z cvt.u64.u32 %rd215, %r610; 2026-02-21T09:10:48.1033689Z shl.b64 %rd216, %rd215, 32; 2026-02-21T09:10:48.1033747Z or.b64 %rd217, %rd214, %rd216; 2026-02-21T09:10:48.1033795Z $L__tmp73: 2026-02-21T09:10:48.1033956Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1034017Z mov.b64 {%r775, %r776}, %rd217; 2026-02-21T09:10:48.1034078Z cvt.rn.bf16x2.f32 %r777, %r776, %r775; 2026-02-21T09:10:48.1034125Z $L__tmp74: 2026-02-21T09:10:48.1034333Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1034389Z cvt.u64.u32 %rd218, %r611; 2026-02-21T09:10:48.1034443Z cvt.u64.u32 %rd219, %r612; 2026-02-21T09:10:48.1034498Z shl.b64 %rd220, %rd219, 32; 2026-02-21T09:10:48.1034556Z or.b64 %rd221, %rd218, %rd220; 2026-02-21T09:10:48.1034603Z $L__tmp75: 2026-02-21T09:10:48.1034760Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1034818Z mov.b64 {%r778, %r779}, %rd221; 2026-02-21T09:10:48.1034879Z cvt.rn.bf16x2.f32 %r780, %r779, %r778; 2026-02-21T09:10:48.1034955Z $L__tmp76: 2026-02-21T09:10:48.1035161Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1035216Z cvt.u64.u32 %rd222, %r613; 2026-02-21T09:10:48.1035270Z cvt.u64.u32 %rd223, %r614; 2026-02-21T09:10:48.1035324Z shl.b64 %rd224, %rd223, 32; 2026-02-21T09:10:48.1035383Z or.b64 %rd225, %rd222, %rd224; 2026-02-21T09:10:48.1035457Z $L__tmp77: 2026-02-21T09:10:48.1035614Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1035673Z mov.b64 {%r781, %r782}, %rd225; 2026-02-21T09:10:48.1035736Z cvt.rn.bf16x2.f32 %r783, %r782, %r781; 2026-02-21T09:10:48.1035785Z $L__tmp78: 2026-02-21T09:10:48.1035989Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1036043Z cvt.u64.u32 %rd226, %r616; 2026-02-21T09:10:48.1036098Z cvt.u64.u32 %rd227, %r617; 2026-02-21T09:10:48.1036151Z shl.b64 %rd228, %rd227, 32; 2026-02-21T09:10:48.1036226Z or.b64 %rd229, %rd226, %rd228; 2026-02-21T09:10:48.1036275Z $L__tmp79: 2026-02-21T09:10:48.1036433Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1036492Z mov.b64 {%r784, %r785}, %rd229; 2026-02-21T09:10:48.1036573Z cvt.rn.bf16x2.f32 %r786, %r785, %r784; 2026-02-21T09:10:48.1036622Z $L__tmp80: 2026-02-21T09:10:48.1036818Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1036876Z cvt.u64.u32 %rd230, %r618; 2026-02-21T09:10:48.1036930Z cvt.u64.u32 %rd231, %r619; 2026-02-21T09:10:48.1036984Z shl.b64 %rd232, %rd231, 32; 2026-02-21T09:10:48.1037043Z or.b64 %rd233, %rd230, %rd232; 2026-02-21T09:10:48.1037092Z $L__tmp81: 2026-02-21T09:10:48.1037248Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1037309Z mov.b64 {%r787, %r788}, %rd233; 2026-02-21T09:10:48.1037373Z cvt.rn.bf16x2.f32 %r789, %r788, %r787; 2026-02-21T09:10:48.1037422Z $L__tmp82: 2026-02-21T09:10:48.1037631Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1037694Z cvt.u64.u32 %rd234, %r620; 2026-02-21T09:10:48.1037754Z cvt.u64.u32 %rd235, %r621; 2026-02-21T09:10:48.1037813Z shl.b64 %rd236, %rd235, 32; 2026-02-21T09:10:48.1037874Z or.b64 %rd237, %rd234, %rd236; 2026-02-21T09:10:48.1037923Z $L__tmp83: 2026-02-21T09:10:48.1038091Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1038152Z mov.b64 {%r790, %r791}, %rd237; 2026-02-21T09:10:48.1038218Z cvt.rn.bf16x2.f32 %r792, %r791, %r790; 2026-02-21T09:10:48.1038268Z $L__tmp84: 2026-02-21T09:10:48.1038476Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1038540Z cvt.u64.u32 %rd238, %r622; 2026-02-21T09:10:48.1038604Z cvt.u64.u32 %rd239, %r623; 2026-02-21T09:10:48.1038661Z shl.b64 %rd240, %rd239, 32; 2026-02-21T09:10:48.1038724Z or.b64 %rd241, %rd238, %rd240; 2026-02-21T09:10:48.1038774Z $L__tmp85: 2026-02-21T09:10:48.1038942Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1039005Z mov.b64 {%r793, %r794}, %rd241; 2026-02-21T09:10:48.1039069Z cvt.rn.bf16x2.f32 %r795, %r794, %r793; 2026-02-21T09:10:48.1039119Z $L__tmp86: 2026-02-21T09:10:48.1039326Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1039388Z cvt.u64.u32 %rd242, %r624; 2026-02-21T09:10:48.1039443Z cvt.u64.u32 %rd243, %r625; 2026-02-21T09:10:48.1039501Z shl.b64 %rd244, %rd243, 32; 2026-02-21T09:10:48.1039587Z or.b64 %rd245, %rd242, %rd244; 2026-02-21T09:10:48.1039638Z $L__tmp87: 2026-02-21T09:10:48.1039807Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1039868Z mov.b64 {%r796, %r797}, %rd245; 2026-02-21T09:10:48.1039932Z cvt.rn.bf16x2.f32 %r798, %r797, %r796; 2026-02-21T09:10:48.1039984Z $L__tmp88: 2026-02-21T09:10:48.1040194Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1040292Z cvt.u64.u32 %rd246, %r626; 2026-02-21T09:10:48.1040347Z cvt.u64.u32 %rd247, %r627; 2026-02-21T09:10:48.1040405Z shl.b64 %rd248, %rd247, 32; 2026-02-21T09:10:48.1040465Z or.b64 %rd249, %rd246, %rd248; 2026-02-21T09:10:48.1040516Z $L__tmp89: 2026-02-21T09:10:48.1040684Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1040745Z mov.b64 {%r799, %r800}, %rd249; 2026-02-21T09:10:48.1040812Z cvt.rn.bf16x2.f32 %r801, %r800, %r799; 2026-02-21T09:10:48.1040861Z $L__tmp90: 2026-02-21T09:10:48.1041099Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1041162Z cvt.u64.u32 %rd250, %r628; 2026-02-21T09:10:48.1041222Z cvt.u64.u32 %rd251, %r629; 2026-02-21T09:10:48.1041279Z shl.b64 %rd252, %rd251, 32; 2026-02-21T09:10:48.1041363Z or.b64 %rd253, %rd250, %rd252; 2026-02-21T09:10:48.1041414Z $L__tmp91: 2026-02-21T09:10:48.1041622Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1041682Z mov.b64 {%r802, %r803}, %rd253; 2026-02-21T09:10:48.1041749Z cvt.rn.bf16x2.f32 %r804, %r803, %r802; 2026-02-21T09:10:48.1041799Z $L__tmp92: 2026-02-21T09:10:48.1042010Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1042072Z cvt.u64.u32 %rd254, %r630; 2026-02-21T09:10:48.1042127Z cvt.u64.u32 %rd255, %r631; 2026-02-21T09:10:48.1042185Z shl.b64 %rd256, %rd255, 32; 2026-02-21T09:10:48.1042244Z or.b64 %rd257, %rd254, %rd256; 2026-02-21T09:10:48.1042294Z $L__tmp93: 2026-02-21T09:10:48.1042459Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1042516Z mov.b64 {%r805, %r806}, %rd257; 2026-02-21T09:10:48.1042584Z cvt.rn.bf16x2.f32 %r807, %r806, %r805; 2026-02-21T09:10:48.1042635Z $L__tmp94: 2026-02-21T09:10:48.1042844Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1042904Z cvt.u64.u32 %rd258, %r633; 2026-02-21T09:10:48.1042960Z cvt.u64.u32 %rd259, %r634; 2026-02-21T09:10:48.1043017Z shl.b64 %rd260, %rd259, 32; 2026-02-21T09:10:48.1043075Z or.b64 %rd261, %rd258, %rd260; 2026-02-21T09:10:48.1043125Z $L__tmp95: 2026-02-21T09:10:48.1043295Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1043354Z mov.b64 {%r808, %r809}, %rd261; 2026-02-21T09:10:48.1043421Z cvt.rn.bf16x2.f32 %r810, %r809, %r808; 2026-02-21T09:10:48.1043471Z $L__tmp96: 2026-02-21T09:10:48.1043686Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1043748Z cvt.u64.u32 %rd262, %r635; 2026-02-21T09:10:48.1043805Z cvt.u64.u32 %rd263, %r636; 2026-02-21T09:10:48.1043861Z shl.b64 %rd264, %rd263, 32; 2026-02-21T09:10:48.1043920Z or.b64 %rd265, %rd262, %rd264; 2026-02-21T09:10:48.1043970Z $L__tmp97: 2026-02-21T09:10:48.1044137Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1044194Z mov.b64 {%r811, %r812}, %rd265; 2026-02-21T09:10:48.1044261Z cvt.rn.bf16x2.f32 %r813, %r812, %r811; 2026-02-21T09:10:48.1044310Z $L__tmp98: 2026-02-21T09:10:48.1044548Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1044608Z cvt.u64.u32 %rd266, %r637; 2026-02-21T09:10:48.1044663Z cvt.u64.u32 %rd267, %r638; 2026-02-21T09:10:48.1044719Z shl.b64 %rd268, %rd267, 32; 2026-02-21T09:10:48.1044775Z or.b64 %rd269, %rd266, %rd268; 2026-02-21T09:10:48.1044827Z $L__tmp99: 2026-02-21T09:10:48.1044992Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1045078Z mov.b64 {%r814, %r815}, %rd269; 2026-02-21T09:10:48.1045145Z cvt.rn.bf16x2.f32 %r816, %r815, %r814; 2026-02-21T09:10:48.1045197Z $L__tmp100: 2026-02-21T09:10:48.1045409Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1045479Z cvt.u64.u32 %rd270, %r639; 2026-02-21T09:10:48.1045532Z cvt.u64.u32 %rd271, %r640; 2026-02-21T09:10:48.1045587Z shl.b64 %rd272, %rd271, 32; 2026-02-21T09:10:48.1045641Z or.b64 %rd273, %rd270, %rd272; 2026-02-21T09:10:48.1045715Z $L__tmp101: 2026-02-21T09:10:48.1045876Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1045931Z mov.b64 {%r817, %r818}, %rd273; 2026-02-21T09:10:48.1045996Z cvt.rn.bf16x2.f32 %r819, %r818, %r817; 2026-02-21T09:10:48.1046044Z $L__tmp102: 2026-02-21T09:10:48.1046267Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1046326Z cvt.u64.u32 %rd274, %r641; 2026-02-21T09:10:48.1046380Z cvt.u64.u32 %rd275, %r642; 2026-02-21T09:10:48.1046437Z shl.b64 %rd276, %rd275, 32; 2026-02-21T09:10:48.1046495Z or.b64 %rd277, %rd274, %rd276; 2026-02-21T09:10:48.1046554Z $L__tmp103: 2026-02-21T09:10:48.1046718Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1046778Z mov.b64 {%r820, %r821}, %rd277; 2026-02-21T09:10:48.1046852Z cvt.rn.bf16x2.f32 %r822, %r821, %r820; 2026-02-21T09:10:48.1046906Z $L__tmp104: 2026-02-21T09:10:48.1047111Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1047175Z cvt.u64.u32 %rd278, %r643; 2026-02-21T09:10:48.1047231Z cvt.u64.u32 %rd279, %r644; 2026-02-21T09:10:48.1047289Z shl.b64 %rd280, %rd279, 32; 2026-02-21T09:10:48.1047348Z or.b64 %rd281, %rd278, %rd280; 2026-02-21T09:10:48.1047406Z $L__tmp105: 2026-02-21T09:10:48.1047564Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1047621Z mov.b64 {%r823, %r824}, %rd281; 2026-02-21T09:10:48.1047689Z cvt.rn.bf16x2.f32 %r825, %r824, %r823; 2026-02-21T09:10:48.1047741Z $L__tmp106: 2026-02-21T09:10:48.1047948Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1048007Z cvt.u64.u32 %rd282, %r645; 2026-02-21T09:10:48.1048070Z cvt.u64.u32 %rd283, %r646; 2026-02-21T09:10:48.1048128Z shl.b64 %rd284, %rd283, 32; 2026-02-21T09:10:48.1048185Z or.b64 %rd285, %rd282, %rd284; 2026-02-21T09:10:48.1048241Z $L__tmp107: 2026-02-21T09:10:48.1048403Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1048463Z mov.b64 {%r826, %r827}, %rd285; 2026-02-21T09:10:48.1048535Z cvt.rn.bf16x2.f32 %r828, %r827, %r826; 2026-02-21T09:10:48.1048586Z $L__tmp108: 2026-02-21T09:10:48.1048792Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1048849Z cvt.u64.u32 %rd286, %r647; 2026-02-21T09:10:48.1048912Z cvt.u64.u32 %rd287, %r648; 2026-02-21T09:10:48.1048970Z shl.b64 %rd288, %rd287, 32; 2026-02-21T09:10:48.1049027Z or.b64 %rd289, %rd286, %rd288; 2026-02-21T09:10:48.1049108Z $L__tmp109: 2026-02-21T09:10:48.1049271Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1049331Z mov.b64 {%r829, %r830}, %rd289; 2026-02-21T09:10:48.1049399Z cvt.rn.bf16x2.f32 %r831, %r830, %r829; 2026-02-21T09:10:48.1049449Z $L__tmp110: 2026-02-21T09:10:48.1049650Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1049728Z cvt.u64.u32 %rd290, %r650; 2026-02-21T09:10:48.1049790Z cvt.u64.u32 %rd291, %r651; 2026-02-21T09:10:48.1049846Z shl.b64 %rd292, %rd291, 32; 2026-02-21T09:10:48.1049903Z or.b64 %rd293, %rd290, %rd292; 2026-02-21T09:10:48.1049960Z $L__tmp111: 2026-02-21T09:10:48.1050124Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1050182Z mov.b64 {%r832, %r833}, %rd293; 2026-02-21T09:10:48.1050250Z cvt.rn.bf16x2.f32 %r834, %r833, %r832; 2026-02-21T09:10:48.1050301Z $L__tmp112: 2026-02-21T09:10:48.1050521Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1050580Z cvt.u64.u32 %rd294, %r652; 2026-02-21T09:10:48.1050642Z cvt.u64.u32 %rd295, %r653; 2026-02-21T09:10:48.1050697Z shl.b64 %rd296, %rd295, 32; 2026-02-21T09:10:48.1050754Z or.b64 %rd297, %rd294, %rd296; 2026-02-21T09:10:48.1050830Z $L__tmp113: 2026-02-21T09:10:48.1050991Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1051047Z mov.b64 {%r835, %r836}, %rd297; 2026-02-21T09:10:48.1051112Z cvt.rn.bf16x2.f32 %r837, %r836, %r835; 2026-02-21T09:10:48.1051164Z $L__tmp114: 2026-02-21T09:10:48.1051364Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1051419Z cvt.u64.u32 %rd298, %r654; 2026-02-21T09:10:48.1051483Z cvt.u64.u32 %rd299, %r655; 2026-02-21T09:10:48.1051586Z shl.b64 %rd300, %rd299, 32; 2026-02-21T09:10:48.1051648Z or.b64 %rd301, %rd298, %rd300; 2026-02-21T09:10:48.1051704Z $L__tmp115: 2026-02-21T09:10:48.1051865Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1051923Z mov.b64 {%r838, %r839}, %rd301; 2026-02-21T09:10:48.1051989Z cvt.rn.bf16x2.f32 %r840, %r839, %r838; 2026-02-21T09:10:48.1052046Z $L__tmp116: 2026-02-21T09:10:48.1052248Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1052305Z cvt.u64.u32 %rd302, %r656; 2026-02-21T09:10:48.1052367Z cvt.u64.u32 %rd303, %r657; 2026-02-21T09:10:48.1052423Z shl.b64 %rd304, %rd303, 32; 2026-02-21T09:10:48.1052479Z or.b64 %rd305, %rd302, %rd304; 2026-02-21T09:10:48.1052536Z $L__tmp117: 2026-02-21T09:10:48.1052696Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1052755Z mov.b64 {%r841, %r842}, %rd305; 2026-02-21T09:10:48.1052820Z cvt.rn.bf16x2.f32 %r843, %r842, %r841; 2026-02-21T09:10:48.1052877Z $L__tmp118: 2026-02-21T09:10:48.1053080Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1053137Z cvt.u64.u32 %rd306, %r658; 2026-02-21T09:10:48.1053199Z cvt.u64.u32 %rd307, %r659; 2026-02-21T09:10:48.1053257Z shl.b64 %rd308, %rd307, 32; 2026-02-21T09:10:48.1053314Z or.b64 %rd309, %rd306, %rd308; 2026-02-21T09:10:48.1053362Z $L__tmp119: 2026-02-21T09:10:48.1053527Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1053585Z mov.b64 {%r844, %r845}, %rd309; 2026-02-21T09:10:48.1053649Z cvt.rn.bf16x2.f32 %r846, %r845, %r844; 2026-02-21T09:10:48.1053709Z $L__tmp120: 2026-02-21T09:10:48.1053913Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1053997Z cvt.u64.u32 %rd310, %r660; 2026-02-21T09:10:48.1054064Z cvt.u64.u32 %rd311, %r661; 2026-02-21T09:10:48.1054122Z shl.b64 %rd312, %rd311, 32; 2026-02-21T09:10:48.1054179Z or.b64 %rd313, %rd310, %rd312; 2026-02-21T09:10:48.1054229Z $L__tmp121: 2026-02-21T09:10:48.1054399Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1054482Z mov.b64 {%r847, %r848}, %rd313; 2026-02-21T09:10:48.1054546Z cvt.rn.bf16x2.f32 %r849, %r848, %r847; 2026-02-21T09:10:48.1054604Z $L__tmp122: 2026-02-21T09:10:48.1054806Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1054862Z cvt.u64.u32 %rd314, %r662; 2026-02-21T09:10:48.1054924Z cvt.u64.u32 %rd315, %r663; 2026-02-21T09:10:48.1054983Z shl.b64 %rd316, %rd315, 32; 2026-02-21T09:10:48.1055044Z or.b64 %rd317, %rd314, %rd316; 2026-02-21T09:10:48.1055096Z $L__tmp123: 2026-02-21T09:10:48.1055287Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1055348Z mov.b64 {%r850, %r851}, %rd317; 2026-02-21T09:10:48.1055413Z cvt.rn.bf16x2.f32 %r852, %r851, %r850; 2026-02-21T09:10:48.1055474Z $L__tmp124: 2026-02-21T09:10:48.1055699Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1055757Z cvt.u64.u32 %rd318, %r664; 2026-02-21T09:10:48.1055821Z cvt.u64.u32 %rd319, %r665; 2026-02-21T09:10:48.1055878Z shl.b64 %rd320, %rd319, 32; 2026-02-21T09:10:48.1055936Z or.b64 %rd321, %rd318, %rd320; 2026-02-21T09:10:48.1055986Z $L__tmp125: 2026-02-21T09:10:48.1056151Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1056209Z mov.b64 {%r853, %r854}, %rd321; 2026-02-21T09:10:48.1056274Z cvt.rn.bf16x2.f32 %r855, %r854, %r853; 2026-02-21T09:10:48.1056330Z $L__tmp126: 2026-02-21T09:10:48.1056537Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1056594Z cvt.u64.u32 %rd322, %r667; 2026-02-21T09:10:48.1056656Z cvt.u64.u32 %rd323, %r668; 2026-02-21T09:10:48.1056713Z shl.b64 %rd324, %rd323, 32; 2026-02-21T09:10:48.1056770Z or.b64 %rd325, %rd322, %rd324; 2026-02-21T09:10:48.1056820Z $L__tmp127: 2026-02-21T09:10:48.1056989Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1057046Z mov.b64 {%r856, %r857}, %rd325; 2026-02-21T09:10:48.1057110Z cvt.rn.bf16x2.f32 %r858, %r857, %r856; 2026-02-21T09:10:48.1057165Z $L__tmp128: 2026-02-21T09:10:48.1057368Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1057425Z cvt.u64.u32 %rd326, %r669; 2026-02-21T09:10:48.1057481Z cvt.u64.u32 %rd327, %r670; 2026-02-21T09:10:48.1057547Z shl.b64 %rd328, %rd327, 32; 2026-02-21T09:10:48.1057602Z or.b64 %rd329, %rd326, %rd328; 2026-02-21T09:10:48.1057653Z $L__tmp129: 2026-02-21T09:10:48.1057817Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1057876Z mov.b64 {%r859, %r860}, %rd329; 2026-02-21T09:10:48.1057940Z cvt.rn.bf16x2.f32 %r861, %r860, %r859; 2026-02-21T09:10:48.1057996Z $L__tmp130: 2026-02-21T09:10:48.1058201Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1058256Z cvt.u64.u32 %rd330, %r671; 2026-02-21T09:10:48.1058310Z cvt.u64.u32 %rd331, %r672; 2026-02-21T09:10:48.1058371Z shl.b64 %rd332, %rd331, 32; 2026-02-21T09:10:48.1058428Z or.b64 %rd333, %rd330, %rd332; 2026-02-21T09:10:48.1058478Z $L__tmp131: 2026-02-21T09:10:48.1058677Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1058736Z mov.b64 {%r862, %r863}, %rd333; 2026-02-21T09:10:48.1058800Z cvt.rn.bf16x2.f32 %r864, %r863, %r862; 2026-02-21T09:10:48.1058854Z $L__tmp132: 2026-02-21T09:10:48.1059060Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1059119Z cvt.u64.u32 %rd334, %r673; 2026-02-21T09:10:48.1059207Z cvt.u64.u32 %rd335, %r674; 2026-02-21T09:10:48.1059267Z shl.b64 %rd336, %rd335, 32; 2026-02-21T09:10:48.1059322Z or.b64 %rd337, %rd334, %rd336; 2026-02-21T09:10:48.1059372Z $L__tmp133: 2026-02-21T09:10:48.1059538Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1059595Z mov.b64 {%r865, %r866}, %rd337; 2026-02-21T09:10:48.1059660Z cvt.rn.bf16x2.f32 %r867, %r866, %r865; 2026-02-21T09:10:48.1059708Z $L__tmp134: 2026-02-21T09:10:48.1059941Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1060000Z cvt.u64.u32 %rd338, %r675; 2026-02-21T09:10:48.1060055Z cvt.u64.u32 %rd339, %r676; 2026-02-21T09:10:48.1060117Z shl.b64 %rd340, %rd339, 32; 2026-02-21T09:10:48.1060173Z or.b64 %rd341, %rd338, %rd340; 2026-02-21T09:10:48.1060222Z $L__tmp135: 2026-02-21T09:10:48.1060409Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1060469Z mov.b64 {%r868, %r869}, %rd341; 2026-02-21T09:10:48.1060531Z cvt.rn.bf16x2.f32 %r870, %r869, %r868; 2026-02-21T09:10:48.1060582Z $L__tmp136: 2026-02-21T09:10:48.1060794Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1060850Z cvt.u64.u32 %rd342, %r677; 2026-02-21T09:10:48.1060906Z cvt.u64.u32 %rd343, %r678; 2026-02-21T09:10:48.1060968Z shl.b64 %rd344, %rd343, 32; 2026-02-21T09:10:48.1061024Z or.b64 %rd345, %rd342, %rd344; 2026-02-21T09:10:48.1061075Z $L__tmp137: 2026-02-21T09:10:48.1061241Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1061298Z mov.b64 {%r871, %r872}, %rd345; 2026-02-21T09:10:48.1061361Z cvt.rn.bf16x2.f32 %r873, %r872, %r871; 2026-02-21T09:10:48.1061410Z $L__tmp138: 2026-02-21T09:10:48.1061657Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1061715Z cvt.u64.u32 %rd346, %r679; 2026-02-21T09:10:48.1061771Z cvt.u64.u32 %rd347, %r680; 2026-02-21T09:10:48.1061831Z shl.b64 %rd348, %rd347, 32; 2026-02-21T09:10:48.1061887Z or.b64 %rd349, %rd346, %rd348; 2026-02-21T09:10:48.1061937Z $L__tmp139: 2026-02-21T09:10:48.1062104Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1062162Z mov.b64 {%r874, %r875}, %rd349; 2026-02-21T09:10:48.1062228Z cvt.rn.bf16x2.f32 %r876, %r875, %r874; 2026-02-21T09:10:48.1062278Z $L__tmp140: 2026-02-21T09:10:48.1062486Z .loc 2 291 36 // standard.py:291:36 @[ cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:94:40 ] 2026-02-21T09:10:48.1062541Z cvt.u64.u32 %rd350, %r681; 2026-02-21T09:10:48.1062596Z cvt.u64.u32 %rd351, %r682; 2026-02-21T09:10:48.1062657Z shl.b64 %rd352, %rd351, 32; 2026-02-21T09:10:48.1062715Z or.b64 %rd353, %rd350, %rd352; 2026-02-21T09:10:48.1062765Z $L__tmp141: 2026-02-21T09:10:48.1062925Z .loc 1 97 28 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:97:28 2026-02-21T09:10:48.1062989Z mov.b64 {%r877, %r878}, %rd353; 2026-02-21T09:10:48.1063053Z cvt.rn.bf16x2.f32 %r879, %r878, %r877; 2026-02-21T09:10:48.1063211Z .loc 1 98 43 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:98:43 2026-02-21T09:10:48.1063340Z st.shared.v4.b32 [%r13], {%r690, %r693, %r696, %r699}; 2026-02-21T09:10:48.1063444Z st.shared.v4.b32 [%r13+16384], {%r786, %r789, %r792, %r795}; 2026-02-21T09:10:48.1063536Z st.shared.v4.b32 [%r14], {%r702, %r705, %r708, %r711}; 2026-02-21T09:10:48.1063638Z st.shared.v4.b32 [%r14+16384], {%r798, %r801, %r804, %r807}; 2026-02-21T09:10:48.1063722Z st.shared.v4.b32 [%r15], {%r714, %r717, %r720, %r723}; 2026-02-21T09:10:48.1063815Z st.shared.v4.b32 [%r15+16384], {%r810, %r813, %r816, %r819}; 2026-02-21T09:10:48.1063931Z st.shared.v4.b32 [%r16], {%r726, %r729, %r732, %r735}; 2026-02-21T09:10:48.1064025Z st.shared.v4.b32 [%r16+16384], {%r822, %r825, %r828, %r831}; 2026-02-21T09:10:48.1064110Z st.shared.v4.b32 [%r17], {%r738, %r741, %r744, %r747}; 2026-02-21T09:10:48.1064198Z st.shared.v4.b32 [%r17+16384], {%r834, %r837, %r840, %r843}; 2026-02-21T09:10:48.1064285Z st.shared.v4.b32 [%r18], {%r750, %r753, %r756, %r759}; 2026-02-21T09:10:48.1064376Z st.shared.v4.b32 [%r18+16384], {%r846, %r849, %r852, %r855}; 2026-02-21T09:10:48.1064459Z st.shared.v4.b32 [%r19], {%r762, %r765, %r768, %r771}; 2026-02-21T09:10:48.1064580Z st.shared.v4.b32 [%r19+16384], {%r858, %r861, %r864, %r867}; 2026-02-21T09:10:48.1064664Z st.shared.v4.b32 [%r20], {%r774, %r777, %r780, %r783}; 2026-02-21T09:10:48.1064754Z st.shared.v4.b32 [%r20+16384], {%r870, %r873, %r876, %r879}; 2026-02-21T09:10:48.1064816Z // begin inline asm 2026-02-21T09:10:48.1064913Z fence.proxy.async.shared::cta; 2026-02-21T09:10:48.1064970Z // end inline asm 2026-02-21T09:10:48.1065025Z bar.sync 0; 2026-02-21T09:10:48.1065097Z elect.sync %r880|%p103, -1; 2026-02-21T09:10:48.1065158Z and.pred %p101, %p102, %p103; 2026-02-21T09:10:48.1065215Z and.b32 %r881, %r31, 1; 2026-02-21T09:10:48.1065276Z shl.b32 %r882, %r881, 14; 2026-02-21T09:10:48.1065333Z add.s32 %r686, %r48, %r882; 2026-02-21T09:10:48.1065391Z shl.b32 %r883, %r881, 6; 2026-02-21T09:10:48.1065447Z or.b32 %r684, %r883, %r393; 2026-02-21T09:10:48.1065508Z // begin inline asm 2026-02-21T09:10:48.1065693Z @%p101 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd97, {%r684, %r685}], [%r686]; 2026-02-21T09:10:48.1065749Z // end inline asm 2026-02-21T09:10:48.1065819Z cp.async.bulk.commit_group; 2026-02-21T09:10:48.1065890Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:10:48.1065944Z bar.sync 0; 2026-02-21T09:10:48.1066030Z $L__BB0_8: // %._crit_edge 2026-02-21T09:10:48.1066193Z .loc 1 31 4 // cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py:31:4 2026-02-21T09:10:48.1066246Z bar.sync 0; 2026-02-21T09:10:48.1066300Z // begin inline asm 2026-02-21T09:10:48.1066419Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r884, 256; 2026-02-21T09:10:48.1066472Z // end inline asm 2026-02-21T09:10:48.1066522Z ret; 2026-02-21T09:10:48.1066577Z $L__tmp142: 2026-02-21T09:10:48.1066630Z $L__func_end0: 2026-02-21T09:10:48.1066711Z // -- End function 2026-02-21T09:10:48.1066763Z } 2026-02-21T09:10:48.1066969Z .file 1 "/tmp/torchinductor_root/za/cza5tfrgmgno6wqn5dr4qvrnkuz3w5pgk3huzzl2g3a2t5nuzob2.py" 2026-02-21T09:10:48.1067140Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:10:48.1067201Z .section .debug_abbrev 2026-02-21T09:10:48.1067254Z { 2026-02-21T09:10:48.1067340Z .b8 1 // Abbreviation Code 2026-02-21T09:10:48.1067425Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:10:48.1067505Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:48.1067582Z .b8 37 // DW_AT_producer 2026-02-21T09:10:48.1067657Z .b8 8 // DW_FORM_string 2026-02-21T09:10:48.1067727Z .b8 19 // DW_AT_language 2026-02-21T09:10:48.1067804Z .b8 5 // DW_FORM_data2 2026-02-21T09:10:48.1067875Z .b8 3 // DW_AT_name 2026-02-21T09:10:48.1068020Z .b8 8 // DW_FORM_string 2026-02-21T09:10:48.1068096Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:10:48.1068170Z .b8 6 // DW_FORM_data4 2026-02-21T09:10:48.1068244Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:10:48.1068314Z .b8 8 // DW_FORM_string 2026-02-21T09:10:48.1068383Z .b8 0 // EOM(1) 2026-02-21T09:10:48.1068474Z .b8 0 // EOM(2) 2026-02-21T09:10:48.1068553Z .b8 2 // Abbreviation Code 2026-02-21T09:10:48.1068632Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:48.1068707Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:48.1068778Z .b8 3 // DW_AT_name 2026-02-21T09:10:48.1068849Z .b8 8 // DW_FORM_string 2026-02-21T09:10:48.1068926Z .b8 32 // DW_AT_inline 2026-02-21T09:10:48.1069016Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:48.1069082Z .b8 0 // EOM(1) 2026-02-21T09:10:48.1069145Z .b8 0 // EOM(2) 2026-02-21T09:10:48.1069224Z .b8 3 // Abbreviation Code 2026-02-21T09:10:48.1069323Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:10:48.1069400Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:10:48.1069477Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:48.1069544Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:48.1069617Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:48.1069689Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:48.1069772Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:48.1069841Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:48.1069909Z .b8 0 // EOM(1) 2026-02-21T09:10:48.1069974Z .b8 0 // EOM(2) 2026-02-21T09:10:48.1070048Z .b8 4 // Abbreviation Code 2026-02-21T09:10:48.1070137Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:10:48.1070214Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:10:48.1070293Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:10:48.1070362Z .b8 19 // DW_FORM_ref4 2026-02-21T09:10:48.1070435Z .b8 17 // DW_AT_low_pc 2026-02-21T09:10:48.1070502Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:48.1070572Z .b8 18 // DW_AT_high_pc 2026-02-21T09:10:48.1070645Z .b8 1 // DW_FORM_addr 2026-02-21T09:10:48.1070718Z .b8 88 // DW_AT_call_file 2026-02-21T09:10:48.1070786Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:48.1070857Z .b8 89 // DW_AT_call_line 2026-02-21T09:10:48.1070931Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:48.1071004Z .b8 87 // DW_AT_call_column 2026-02-21T09:10:48.1071074Z .b8 11 // DW_FORM_data1 2026-02-21T09:10:48.1071143Z .b8 0 // EOM(1) 2026-02-21T09:10:48.1071208Z .b8 0 // EOM(2) 2026-02-21T09:10:48.1071271Z .b8 0 // EOM(3) 2026-02-21T09:10:48.1071329Z } 2026-02-21T09:10:48.1071387Z .section .debug_info 2026-02-21T09:10:48.1071435Z { 2026-02-21T09:10:48.1071516Z .b32 178 // Length of Unit 2026-02-21T09:10:48.1071668Z .b8 2 // DWARF version number 2026-02-21T09:10:48.1071719Z .b8 0 2026-02-21T09:10:48.1071834Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:10:48.1071927Z .b8 8 // Address Size (in bytes) 2026-02-21T09:10:48.1072024Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:10:48.1072107Z .b8 116 // DW_AT_producer 2026-02-21T09:10:48.1072194Z .b8 114 2026-02-21T09:10:48.1072248Z .b8 105 2026-02-21T09:10:48.1072298Z .b8 116 2026-02-21T09:10:48.1072347Z .b8 111 2026-02-21T09:10:48.1072403Z .b8 110 2026-02-21T09:10:48.1072457Z .b8 0 2026-02-21T09:10:48.1072531Z .b8 2 // DW_AT_language 2026-02-21T09:10:48.1072587Z .b8 0 2026-02-21T09:10:48.1072661Z .b8 99 // DW_AT_name 2026-02-21T09:10:48.1072713Z .b8 122 2026-02-21T09:10:48.1072762Z .b8 97 2026-02-21T09:10:48.1072820Z .b8 53 2026-02-21T09:10:48.1072871Z .b8 116 2026-02-21T09:10:48.1072920Z .b8 102 2026-02-21T09:10:48.1072998Z .b8 114 2026-02-21T09:10:48.1073048Z .b8 103 2026-02-21T09:10:48.1073097Z .b8 109 2026-02-21T09:10:48.1073145Z .b8 103 2026-02-21T09:10:48.1073202Z .b8 110 2026-02-21T09:10:48.1073250Z .b8 111 2026-02-21T09:10:48.1073298Z .b8 54 2026-02-21T09:10:48.1073348Z .b8 119 2026-02-21T09:10:48.1073405Z .b8 113 2026-02-21T09:10:48.1073493Z .b8 110 2026-02-21T09:10:48.1073546Z .b8 53 2026-02-21T09:10:48.1073602Z .b8 100 2026-02-21T09:10:48.1073652Z .b8 114 2026-02-21T09:10:48.1073699Z .b8 52 2026-02-21T09:10:48.1073749Z .b8 113 2026-02-21T09:10:48.1073805Z .b8 118 2026-02-21T09:10:48.1073853Z .b8 114 2026-02-21T09:10:48.1073903Z .b8 110 2026-02-21T09:10:48.1073956Z .b8 107 2026-02-21T09:10:48.1074003Z .b8 117 2026-02-21T09:10:48.1074052Z .b8 122 2026-02-21T09:10:48.1074100Z .b8 51 2026-02-21T09:10:48.1074153Z .b8 119 2026-02-21T09:10:48.1074200Z .b8 53 2026-02-21T09:10:48.1074248Z .b8 112 2026-02-21T09:10:48.1074298Z .b8 103 2026-02-21T09:10:48.1074352Z .b8 107 2026-02-21T09:10:48.1074400Z .b8 51 2026-02-21T09:10:48.1074451Z .b8 104 2026-02-21T09:10:48.1074505Z .b8 117 2026-02-21T09:10:48.1074553Z .b8 122 2026-02-21T09:10:48.1074601Z .b8 122 2026-02-21T09:10:48.1074649Z .b8 108 2026-02-21T09:10:48.1074703Z .b8 50 2026-02-21T09:10:48.1074752Z .b8 103 2026-02-21T09:10:48.1074800Z .b8 51 2026-02-21T09:10:48.1074854Z .b8 97 2026-02-21T09:10:48.1074904Z .b8 50 2026-02-21T09:10:48.1074953Z .b8 116 2026-02-21T09:10:48.1075001Z .b8 53 2026-02-21T09:10:48.1075055Z .b8 110 2026-02-21T09:10:48.1075103Z .b8 117 2026-02-21T09:10:48.1075151Z .b8 122 2026-02-21T09:10:48.1075206Z .b8 111 2026-02-21T09:10:48.1075255Z .b8 98 2026-02-21T09:10:48.1075305Z .b8 50 2026-02-21T09:10:48.1075352Z .b8 46 2026-02-21T09:10:48.1075406Z .b8 112 2026-02-21T09:10:48.1075455Z .b8 121 2026-02-21T09:10:48.1075505Z .b8 0 2026-02-21T09:10:48.1075592Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:10:48.1075669Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:10:48.1075719Z .b8 116 2026-02-21T09:10:48.1075770Z .b8 109 2026-02-21T09:10:48.1075825Z .b8 112 2026-02-21T09:10:48.1075874Z .b8 47 2026-02-21T09:10:48.1075922Z .b8 116 2026-02-21T09:10:48.1075970Z .b8 111 2026-02-21T09:10:48.1076026Z .b8 114 2026-02-21T09:10:48.1076073Z .b8 99 2026-02-21T09:10:48.1076122Z .b8 104 2026-02-21T09:10:48.1076176Z .b8 105 2026-02-21T09:10:48.1076225Z .b8 110 2026-02-21T09:10:48.1076277Z .b8 100 2026-02-21T09:10:48.1076328Z .b8 117 2026-02-21T09:10:48.1076384Z .b8 99 2026-02-21T09:10:48.1076433Z .b8 116 2026-02-21T09:10:48.1076481Z .b8 111 2026-02-21T09:10:48.1076533Z .b8 114 2026-02-21T09:10:48.1076582Z .b8 95 2026-02-21T09:10:48.1076632Z .b8 114 2026-02-21T09:10:48.1076681Z .b8 111 2026-02-21T09:10:48.1076738Z .b8 111 2026-02-21T09:10:48.1076789Z .b8 116 2026-02-21T09:10:48.1076837Z .b8 47 2026-02-21T09:10:48.1076888Z .b8 122 2026-02-21T09:10:48.1076943Z .b8 97 2026-02-21T09:10:48.1076993Z .b8 0 2026-02-21T09:10:48.1077116Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:10:48.1077198Z .b8 95 // DW_AT_name 2026-02-21T09:10:48.1077247Z .b8 104 2026-02-21T09:10:48.1077298Z .b8 101 2026-02-21T09:10:48.1077348Z .b8 108 2026-02-21T09:10:48.1077403Z .b8 105 2026-02-21T09:10:48.1077452Z .b8 111 2026-02-21T09:10:48.1077500Z .b8 110 2026-02-21T09:10:48.1077555Z .b8 95 2026-02-21T09:10:48.1077606Z .b8 109 2026-02-21T09:10:48.1077674Z .b8 97 2026-02-21T09:10:48.1077723Z .b8 116 2026-02-21T09:10:48.1077778Z .b8 109 2026-02-21T09:10:48.1077827Z .b8 117 2026-02-21T09:10:48.1077876Z .b8 108 2026-02-21T09:10:48.1077928Z .b8 95 2026-02-21T09:10:48.1077975Z .b8 98 2026-02-21T09:10:48.1078024Z .b8 102 2026-02-21T09:10:48.1078073Z .b8 49 2026-02-21T09:10:48.1078127Z .b8 54 2026-02-21T09:10:48.1078175Z .b8 95 2026-02-21T09:10:48.1078223Z .b8 105 2026-02-21T09:10:48.1078276Z .b8 110 2026-02-21T09:10:48.1078324Z .b8 116 2026-02-21T09:10:48.1078374Z .b8 52 2026-02-21T09:10:48.1078424Z .b8 0 2026-02-21T09:10:48.1078502Z .b8 1 // DW_AT_inline 2026-02-21T09:10:48.1078614Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:10:48.1078700Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:10:48.1078791Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:10:48.1078900Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:48.1079013Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:10:48.1079098Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:10:48.1079183Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:10:48.1079265Z .b64 $L__tmp141 // DW_AT_high_pc 2026-02-21T09:10:48.1079341Z .b8 1 // DW_AT_call_file 2026-02-21T09:10:48.1079422Z .b8 94 // DW_AT_call_line 2026-02-21T09:10:48.1079500Z .b8 40 // DW_AT_call_column 2026-02-21T09:10:48.1079582Z .b8 0 // End Of Children Mark 2026-02-21T09:10:48.1079667Z .b8 0 // End Of Children Mark 2026-02-21T09:10:48.1079716Z } 2026-02-21T09:10:48.1079781Z .section .debug_macinfo { } 2026-02-21T09:10:48.1079786Z 2026-02-21T09:10:48.1079868Z ================================================================ 2026-02-21T09:10:48.1079972Z please share the reproducer above with Triton project. 2026-02-21T09:10:49.3283656Z 2026-02-21T09:10:49.3285506Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 104/104 9.3 configs/s 2026-02-21T09:10:51.5552250Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━ 824/824 360.9 configs/s 2026-02-21T09:10:51.6541866Z [238s] Generation 3 complete: 2026-02-21T09:10:51.6543744Z error=19 2026-02-21T09:10:51.6543908Z timeout=1 2026-02-21T09:10:51.6544061Z ok=87 2026-02-21T09:10:51.6544188Z min=0.2447 2026-02-21T09:10:51.6544319Z mid=0.7322 2026-02-21T09:10:51.6544446Z max=123.4115 2026-02-21T09:10:51.6544604Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:10:51.6544867Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:10:51.6545107Z 'l2_groupings': [2], 2026-02-21T09:10:51.6545289Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:51.6545491Z 'loop_orders': [[0, 1]], 2026-02-21T09:10:51.6545655Z 'maxnreg': 128, 2026-02-21T09:10:51.6545797Z 'num_sm_multiplier': 8, 2026-02-21T09:10:51.6545954Z 'num_stages': 3, 2026-02-21T09:10:51.6546086Z 'num_warps': 4, 2026-02-21T09:10:51.6546243Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:10:51.6546434Z 'range_flattens': [None, False], 2026-02-21T09:10:51.6546610Z 'range_multi_buffers': [True, False], 2026-02-21T09:10:51.6546794Z 'range_num_stages': [3, 4], 2026-02-21T09:10:51.6546961Z 'range_unroll_factors': [0, 0], 2026-02-21T09:10:51.6547141Z 'range_warp_specializes': [True, None]} 2026-02-21T09:10:51.6566340Z [238s] Fitting surrogate: 414 points, 414 targets 2026-02-21T09:10:53.2262876Z [240s] Generation 4 starting: 110 neighbors, 5 active search path(s) 2026-02-21T09:11:28.7456027Z [275s] Timeout after 30s compiling Config(block_sizes=[64, 128, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, None]) 2026-02-21T09:11:29.4297028Z [276s] Timeout after 30s compiling Config(block_sizes=[128, 128, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[False, None]) 2026-02-21T09:11:29.4316785Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 112/112 0.6 configs/s 2026-02-21T09:11:29.8679741Z [276s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['last', ''], loop_orders=[[1, 0]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T09:11:29.8680785Z Tensor-likes are not close! 2026-02-21T09:11:29.8680904Z 2026-02-21T09:11:29.8683683Z Mismatched elements: 33451375 / 33554432 (99.7%) 2026-02-21T09:11:29.8684076Z Greatest absolute difference: 1560.0 at index (1672, 2372) (up to 0.01 allowed) 2026-02-21T09:11:29.8684458Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:11:29.8684825Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:11:29.8685010Z 2026-02-21T09:11:34.8533863Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 112/112 20.5 configs/s 2026-02-21T09:11:42.6141195Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━ 838/838 107.5 configs/s 2026-02-21T09:11:42.7867266Z [289s] Generation 4 complete: 2026-02-21T09:11:42.7871494Z error=34 2026-02-21T09:11:42.7875815Z timeout=2 2026-02-21T09:11:42.7878954Z ok=79 2026-02-21T09:11:42.7883455Z min=0.2386 2026-02-21T09:11:42.7887573Z mid=0.4251 2026-02-21T09:11:42.7889619Z max=10.7347 2026-02-21T09:11:42.7889804Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:11:42.7890084Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:11:42.7890340Z 'l2_groupings': [2], 2026-02-21T09:11:42.7890528Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:11:42.7890741Z 'loop_orders': [[0, 1]], 2026-02-21T09:11:42.7890925Z 'maxnreg': 128, 2026-02-21T09:11:42.7891072Z 'num_sm_multiplier': 4, 2026-02-21T09:11:42.7891236Z 'num_stages': 3, 2026-02-21T09:11:42.7891386Z 'num_warps': 4, 2026-02-21T09:11:42.7891619Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:11:42.7891829Z 'range_flattens': [None, None], 2026-02-21T09:11:42.7892012Z 'range_multi_buffers': [True, False], 2026-02-21T09:11:42.7892205Z 'range_num_stages': [3, 4], 2026-02-21T09:11:42.7892377Z 'range_unroll_factors': [0, 0], 2026-02-21T09:11:42.7892573Z 'range_warp_specializes': [True, None]} 2026-02-21T09:11:42.7892796Z [289s] Fitting surrogate: 529 points, 529 targets 2026-02-21T09:11:44.3952821Z [291s] Generation 5 starting: 109 neighbors, 5 active search path(s) 2026-02-21T09:12:22.2842900Z [329s] Timeout after 30s compiling Config(block_sizes=[64, 256, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:12:22.2857654Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 110/110 0.7 configs/s 2026-02-21T09:12:24.7419492Z [331s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 256, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=16, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[0, 1], range_warp_specializes=[False, None]) 2026-02-21T09:12:24.7421036Z Tensor-likes are not close! 2026-02-21T09:12:24.7425846Z 2026-02-21T09:12:24.7427430Z Mismatched elements: 33444714 / 33554432 (99.7%) 2026-02-21T09:12:24.7427772Z Greatest absolute difference: 1392.0 at index (957, 4101) (up to 0.01 allowed) 2026-02-21T09:12:24.7428331Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:12:24.7428669Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:12:24.7428841Z 2026-02-21T09:12:25.2230715Z [332s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 256, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=16, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[3, 0], range_unroll_factors=[0, 1], range_warp_specializes=[False, None]) 2026-02-21T09:12:25.2231983Z Tensor-likes are not close! 2026-02-21T09:12:25.2237078Z 2026-02-21T09:12:25.2238649Z Mismatched elements: 33444714 / 33554432 (99.7%) 2026-02-21T09:12:25.2238976Z Greatest absolute difference: 1392.0 at index (957, 4101) (up to 0.01 allowed) 2026-02-21T09:12:25.2239337Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:12:25.2239645Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:12:25.2239809Z 2026-02-21T09:12:27.6196759Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:12:27.6198675Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:12:27.6199006Z ^ 2026-02-21T09:12:27.6199424Z /tmp/torchinductor_root/uk/cukse4t4frgzo53t735r7gqhvhfhsehrahjmjfsgoviuspg2qejf.py:84:40: note: called from 2026-02-21T09:12:27.6199895Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:12:27.6200116Z ^ 2026-02-21T09:12:27.6200582Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:12:27.6201126Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:12:27.6201391Z ^ 2026-02-21T09:12:27.6207358Z module { 2026-02-21T09:12:27.6211862Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:12:27.6212494Z %cst = arith.constant dense<0> : tensor<64x2x32xi8> 2026-02-21T09:12:27.6212730Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:12:27.6212947Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:12:27.6216155Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:12:27.6216398Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T09:12:27.6216628Z %cst_0 = arith.constant dense<8192> : tensor<256x1xi32> 2026-02-21T09:12:27.6216882Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:12:27.6217319Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:12:27.6217548Z %cst_3 = arith.constant dense<4> : tensor<64x32xi8> 2026-02-21T09:12:27.6217792Z %cst_4 = arith.constant dense<8192> : tensor<64x1xi32> 2026-02-21T09:12:27.6218030Z %cst_5 = arith.constant dense<1024> : tensor<256x1xi32> 2026-02-21T09:12:27.6218249Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:12:27.6218474Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<256x32xf32> 2026-02-21T09:12:27.6218775Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:12:27.6218958Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:12:27.6219154Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:12:27.6219347Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:12:27.6219535Z %0 = tt.get_program_id x : i32 2026-02-21T09:12:27.6219725Z %1 = arith.subi %c4096_i32, %0 : i32 2026-02-21T09:12:27.6219906Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:12:27.6220100Z %2 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T09:12:27.6220287Z %3 = arith.addi %1, %2 : i32 2026-02-21T09:12:27.6220507Z %4 = arith.divui %3, %c2368_i32 : i32 2026-02-21T09:12:27.6220693Z %c4_i32 = arith.constant 4 : i32 2026-02-21T09:12:27.6220879Z %5 = arith.remsi %4, %c4_i32 : i32 2026-02-21T09:12:27.6221060Z %6 = arith.subi %4, %5 : i32 2026-02-21T09:12:27.6221228Z %7 = arith.muli %6, %c2368_i32 : i32 2026-02-21T09:12:27.6221452Z %8 = arith.addi %0, %7 : i32 2026-02-21T09:12:27.6221727Z %9 = arith.muli %c2368_i32, %c4_i32 : i32 2026-02-21T09:12:27.6221938Z scf.for %arg3 = %0 to %8 step %9 : i32 { 2026-02-21T09:12:27.6222142Z %10 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:12:27.6222354Z %11 = arith.muli %10, %c16_i32 : i32 2026-02-21T09:12:27.6222537Z %12 = arith.subi %c256_i32, %11 : i32 2026-02-21T09:12:27.6222728Z %13 = arith.minsi %12, %c16_i32 : i32 2026-02-21T09:12:27.6222920Z %14 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:12:27.6223107Z %15 = arith.remsi %14, %13 : i32 2026-02-21T09:12:27.6223292Z %16 = arith.addi %11, %15 : i32 2026-02-21T09:12:27.6223466Z %17 = arith.divsi %14, %13 : i32 2026-02-21T09:12:27.6223647Z %18 = arith.muli %16, %c32_i32 : i32 2026-02-21T09:12:27.6223879Z %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:12:27.6224139Z %20 = tt.splat %18 : i32 -> tensor<32xi32> 2026-02-21T09:12:27.6224348Z %21 = arith.addi %20, %19 : tensor<32xi32> 2026-02-21T09:12:27.6224540Z %22 = arith.muli %17, %c256_i32 : i32 2026-02-21T09:12:27.6224782Z %23 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:12:27.6225034Z %24 = tt.splat %22 : i32 -> tensor<256xi32> 2026-02-21T09:12:27.6225240Z %25 = arith.addi %24, %23 : tensor<256xi32> 2026-02-21T09:12:27.6225434Z %c256_i32_7 = arith.constant 256 : i32 2026-02-21T09:12:27.6225768Z %26 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c256_i32_7 iter_args(%arg5 = %cst_6) -> (tensor<256x32xf32>) : i32 { 2026-02-21T09:12:27.6226192Z %120 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6226455Z %121 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6226669Z %122 = arith.addi %121, %120 : tensor<64xi32> 2026-02-21T09:12:27.6226876Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:12:27.6227110Z %124 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6227367Z %125 = tt.splat %123 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6227581Z %126 = arith.addi %125, %124 : tensor<128xi32> 2026-02-21T09:12:27.6227845Z %127 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6228132Z %128 = arith.muli %127, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6228403Z %129 = tt.expand_dims %126 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6228752Z %130 = tt.broadcast %128 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6229032Z %131 = tt.broadcast %129 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6229294Z %132 = arith.addi %130, %131 : tensor<256x128xi32> 2026-02-21T09:12:27.6229556Z %133 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6229858Z %134 = tt.addptr %133, %132 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6230226Z %135 = tt.load %134 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6230546Z %136 = arith.extf %135 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6230862Z %137 = tt.expand_dims %122 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6231148Z %138 = arith.muli %137, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6231413Z %139 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6231799Z %140 = tt.broadcast %138 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6232068Z %141 = tt.broadcast %139 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6232324Z %142 = arith.addi %140, %141 : tensor<64x32xi32> 2026-02-21T09:12:27.6232568Z %143 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6232886Z %144 = tt.addptr %143, %142 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6233214Z %145 = tt.load %144 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6233492Z %146 = arith.shli %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6233734Z %147 = arith.shrsi %146, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6233952Z %148 = arith.shrsi %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6234206Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6234501Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6234827Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6235163Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6235487Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6235784Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6236039Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6236321Z %156 = tt.broadcast %152 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6236616Z %157 = arith.select %155, %156, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6236895Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6237156Z %159 = tt.broadcast %153 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6237427Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6237718Z %161 = arith.select %160, %159, %157 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6238004Z %162 = tt.reshape %161 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6238280Z %163 = arith.sitofp %162 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6238665Z %164 = tt.dot %136, %163, %arg5, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6239006Z %c1_i32_13 = arith.constant 1 : i32 2026-02-21T09:12:27.6239211Z %165 = arith.muli %c64_i32, %c1_i32_13 : i32 2026-02-21T09:12:27.6239407Z %166 = arith.addi %arg4, %165 : i32 2026-02-21T09:12:27.6239641Z %167 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6239940Z %168 = tt.splat %166 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6240148Z %169 = arith.addi %168, %167 : tensor<64xi32> 2026-02-21T09:12:27.6240351Z %170 = arith.muli %166, %c2_i32 : i32 2026-02-21T09:12:27.6240586Z %171 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6240847Z %172 = tt.splat %170 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6241052Z %173 = arith.addi %172, %171 : tensor<128xi32> 2026-02-21T09:12:27.6241364Z %174 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6241693Z %175 = arith.muli %174, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6241958Z %176 = tt.expand_dims %173 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6242263Z %177 = tt.broadcast %175 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6242540Z %178 = tt.broadcast %176 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6242796Z %179 = arith.addi %177, %178 : tensor<256x128xi32> 2026-02-21T09:12:27.6243076Z %180 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6243384Z %181 = tt.addptr %180, %179 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6243724Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6244082Z %183 = arith.extf %182 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6244408Z %184 = tt.expand_dims %169 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6244698Z %185 = arith.muli %184, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6244980Z %186 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6245294Z %187 = tt.broadcast %185 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6245578Z %188 = tt.broadcast %186 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6245848Z %189 = arith.addi %187, %188 : tensor<64x32xi32> 2026-02-21T09:12:27.6246100Z %190 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6246407Z %191 = tt.addptr %190, %189 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6246735Z %192 = tt.load %191 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6247026Z %193 = arith.shli %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6247265Z %194 = arith.shrsi %193, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6247495Z %195 = arith.shrsi %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6247762Z %196 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6248069Z %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6248407Z %198 = tt.expand_dims %197 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6248756Z %199 = tt.expand_dims %194 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6249106Z %200 = tt.expand_dims %195 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6249413Z %201 = arith.cmpi eq, %198, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6249681Z %202 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6249979Z %203 = tt.broadcast %199 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6250282Z %204 = arith.select %202, %203, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6250582Z %205 = arith.cmpi eq, %198, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6250849Z %206 = tt.broadcast %200 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6251131Z %207 = tt.broadcast %205 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6251506Z %208 = arith.select %207, %206, %204 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6251863Z %209 = tt.reshape %208 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6252153Z %210 = arith.sitofp %209 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6252544Z %211 = tt.dot %183, %210, %164, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6252879Z %c2_i32_14 = arith.constant 2 : i32 2026-02-21T09:12:27.6253132Z %212 = arith.muli %c64_i32, %c2_i32_14 : i32 2026-02-21T09:12:27.6253333Z %213 = arith.addi %arg4, %212 : i32 2026-02-21T09:12:27.6253569Z %214 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6253820Z %215 = tt.splat %213 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6254034Z %216 = arith.addi %215, %214 : tensor<64xi32> 2026-02-21T09:12:27.6254231Z %217 = arith.muli %213, %c2_i32 : i32 2026-02-21T09:12:27.6254474Z %218 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6254771Z %219 = tt.splat %217 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6254981Z %220 = arith.addi %219, %218 : tensor<128xi32> 2026-02-21T09:12:27.6255252Z %221 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6255564Z %222 = arith.muli %221, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6255845Z %223 = tt.expand_dims %220 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6256153Z %224 = tt.broadcast %222 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6256430Z %225 = tt.broadcast %223 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6256690Z %226 = arith.addi %224, %225 : tensor<256x128xi32> 2026-02-21T09:12:27.6256943Z %227 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6257249Z %228 = tt.addptr %227, %226 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6257576Z %229 = tt.load %228 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6257894Z %230 = arith.extf %229 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6258198Z %231 = tt.expand_dims %216 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6258469Z %232 = arith.muli %231, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6258739Z %233 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6259034Z %234 = tt.broadcast %232 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6259312Z %235 = tt.broadcast %233 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6259571Z %236 = arith.addi %234, %235 : tensor<64x32xi32> 2026-02-21T09:12:27.6259813Z %237 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6260121Z %238 = tt.addptr %237, %236 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6260434Z %239 = tt.load %238 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6260711Z %240 = arith.shli %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6260928Z %241 = arith.shrsi %240, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6261160Z %242 = arith.shrsi %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6261416Z %243 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6261749Z %244 = tt.expand_dims %243 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6262076Z %245 = tt.expand_dims %244 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6262404Z %246 = tt.expand_dims %241 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6262739Z %247 = tt.expand_dims %242 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6263080Z %248 = arith.cmpi eq, %245, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6263347Z %249 = tt.broadcast %248 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6263630Z %250 = tt.broadcast %246 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6263921Z %251 = arith.select %249, %250, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6264236Z %252 = arith.cmpi eq, %245, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6264490Z %253 = tt.broadcast %247 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6264771Z %254 = tt.broadcast %252 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6265065Z %255 = arith.select %254, %253, %251 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6265350Z %256 = tt.reshape %255 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6265632Z %257 = arith.sitofp %256 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6266021Z %258 = tt.dot %230, %257, %211, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6266366Z %c3_i32_15 = arith.constant 3 : i32 2026-02-21T09:12:27.6266566Z %259 = arith.muli %c64_i32, %c3_i32_15 : i32 2026-02-21T09:12:27.6266791Z %260 = arith.addi %arg4, %259 : i32 2026-02-21T09:12:27.6267032Z %261 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6267282Z %262 = tt.splat %260 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6267496Z %263 = arith.addi %262, %261 : tensor<64xi32> 2026-02-21T09:12:27.6267689Z %264 = arith.muli %260, %c2_i32 : i32 2026-02-21T09:12:27.6267929Z %265 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6268179Z %266 = tt.splat %264 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6268392Z %267 = arith.addi %266, %265 : tensor<128xi32> 2026-02-21T09:12:27.6268655Z %268 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6268930Z %269 = arith.muli %268, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6269204Z %270 = tt.expand_dims %267 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6269507Z %271 = tt.broadcast %269 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6269791Z %272 = tt.broadcast %270 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6270045Z %273 = arith.addi %271, %272 : tensor<256x128xi32> 2026-02-21T09:12:27.6270297Z %274 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6270602Z %275 = tt.addptr %274, %273 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6270925Z %276 = tt.load %275 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6271240Z %277 = arith.extf %276 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6271594Z %278 = tt.expand_dims %263 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6271871Z %279 = arith.muli %278, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6272139Z %280 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6272433Z %281 = tt.broadcast %279 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6272708Z %282 = tt.broadcast %280 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6272951Z %283 = arith.addi %281, %282 : tensor<64x32xi32> 2026-02-21T09:12:27.6273198Z %284 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6273480Z %285 = tt.addptr %284, %283 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6273796Z %286 = tt.load %285 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6274119Z %287 = arith.shli %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6274340Z %288 = arith.shrsi %287, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6274567Z %289 = arith.shrsi %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6274811Z %290 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6275117Z %291 = tt.expand_dims %290 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6275478Z %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6275800Z %293 = tt.expand_dims %288 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6276132Z %294 = tt.expand_dims %289 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6276417Z %295 = arith.cmpi eq, %292, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6276684Z %296 = tt.broadcast %295 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6276979Z %297 = tt.broadcast %293 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6277278Z %298 = arith.select %296, %297, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6277564Z %299 = arith.cmpi eq, %292, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6277847Z %300 = tt.broadcast %294 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6278129Z %301 = tt.broadcast %299 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6278408Z %302 = arith.select %301, %300, %298 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6278699Z %303 = tt.reshape %302 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6278974Z %304 = arith.sitofp %303 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6279343Z %305 = tt.dot %277, %304, %258, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6279680Z scf.yield %305 : tensor<256x32xf32> 2026-02-21T09:12:27.6279860Z } {tt.flatten} 2026-02-21T09:12:27.6280064Z %27 = arith.truncf %26 : tensor<256x32xf32> to tensor<256x32xbf16> 2026-02-21T09:12:27.6280351Z %28 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6280621Z %29 = arith.muli %28, %cst_0 : tensor<256x1xi32> 2026-02-21T09:12:27.6280884Z %30 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6281168Z %31 = tt.broadcast %29 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6281515Z %32 = tt.broadcast %30 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6281787Z %33 = arith.addi %31, %32 : tensor<256x32xi32> 2026-02-21T09:12:27.6282039Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6282335Z %35 = tt.addptr %34, %33 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:12:27.6282597Z tt.store %35, %27 : tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6282812Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:12:27.6283004Z %36 = arith.muli %c2368_i32, %c1_i32_8 : i32 2026-02-21T09:12:27.6283209Z %37 = arith.addi %arg3, %36 : i32 2026-02-21T09:12:27.6283397Z %38 = arith.divsi %37, %c256_i32 : i32 2026-02-21T09:12:27.6283590Z %39 = arith.muli %38, %c16_i32 : i32 2026-02-21T09:12:27.6283771Z %40 = arith.subi %c256_i32, %39 : i32 2026-02-21T09:12:27.6283959Z %41 = arith.minsi %40, %c16_i32 : i32 2026-02-21T09:12:27.6284152Z %42 = arith.remsi %37, %c256_i32 : i32 2026-02-21T09:12:27.6284334Z %43 = arith.remsi %42, %41 : i32 2026-02-21T09:12:27.6284517Z %44 = arith.addi %39, %43 : i32 2026-02-21T09:12:27.6284691Z %45 = arith.divsi %42, %41 : i32 2026-02-21T09:12:27.6284876Z %46 = arith.muli %44, %c32_i32 : i32 2026-02-21T09:12:27.6285129Z %47 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:12:27.6285410Z %48 = tt.splat %46 : i32 -> tensor<32xi32> 2026-02-21T09:12:27.6285616Z %49 = arith.addi %48, %47 : tensor<32xi32> 2026-02-21T09:12:27.6285804Z %50 = arith.muli %45, %c256_i32 : i32 2026-02-21T09:12:27.6286039Z %51 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:12:27.6286284Z %52 = tt.splat %50 : i32 -> tensor<256xi32> 2026-02-21T09:12:27.6286517Z %53 = arith.addi %52, %51 : tensor<256xi32> 2026-02-21T09:12:27.6286709Z %c256_i32_9 = arith.constant 256 : i32 2026-02-21T09:12:27.6287033Z %54 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c256_i32_9 iter_args(%arg5 = %cst_6) -> (tensor<256x32xf32>) : i32 { 2026-02-21T09:12:27.6287407Z %120 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6287663Z %121 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6287890Z %122 = arith.addi %121, %120 : tensor<64xi32> 2026-02-21T09:12:27.6288099Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:12:27.6288373Z %124 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6288638Z %125 = tt.splat %123 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6288864Z %126 = arith.addi %125, %124 : tensor<128xi32> 2026-02-21T09:12:27.6289190Z %127 = tt.expand_dims %53 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6289478Z %128 = arith.muli %127, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6289765Z %129 = tt.expand_dims %126 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6290078Z %130 = tt.broadcast %128 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6290373Z %131 = tt.broadcast %129 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6290631Z %132 = arith.addi %130, %131 : tensor<256x128xi32> 2026-02-21T09:12:27.6290899Z %133 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6291219Z %134 = tt.addptr %133, %132 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6291616Z %135 = tt.load %134 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6291949Z %136 = arith.extf %135 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6292264Z %137 = tt.expand_dims %122 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6292557Z %138 = arith.muli %137, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6292836Z %139 = tt.expand_dims %49 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6293140Z %140 = tt.broadcast %138 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6293430Z %141 = tt.broadcast %139 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6293684Z %142 = arith.addi %140, %141 : tensor<64x32xi32> 2026-02-21T09:12:27.6293944Z %143 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6294237Z %144 = tt.addptr %143, %142 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6294564Z %145 = tt.load %144 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6294854Z %146 = arith.shli %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6295087Z %147 = arith.shrsi %146, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6295327Z %148 = arith.shrsi %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6295585Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6295899Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6296234Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6296591Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6296955Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6297250Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6297512Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6297788Z %156 = tt.broadcast %152 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6298108Z %157 = arith.select %155, %156, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6298389Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6298642Z %159 = tt.broadcast %153 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6298921Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6299199Z %161 = arith.select %160, %159, %157 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6299496Z %162 = tt.reshape %161 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6299796Z %163 = arith.sitofp %162 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6300183Z %164 = tt.dot %136, %163, %arg5, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6300521Z %c1_i32_13 = arith.constant 1 : i32 2026-02-21T09:12:27.6300749Z %165 = arith.muli %c64_i32, %c1_i32_13 : i32 2026-02-21T09:12:27.6300958Z %166 = arith.addi %arg4, %165 : i32 2026-02-21T09:12:27.6301189Z %167 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6301451Z %168 = tt.splat %166 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6301716Z %169 = arith.addi %168, %167 : tensor<64xi32> 2026-02-21T09:12:27.6301914Z %170 = arith.muli %166, %c2_i32 : i32 2026-02-21T09:12:27.6302156Z %171 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6302410Z %172 = tt.splat %170 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6302625Z %173 = arith.addi %172, %171 : tensor<128xi32> 2026-02-21T09:12:27.6302883Z %174 = tt.expand_dims %53 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6303166Z %175 = arith.muli %174, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6303440Z %176 = tt.expand_dims %173 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6303744Z %177 = tt.broadcast %175 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6304027Z %178 = tt.broadcast %176 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6304277Z %179 = arith.addi %177, %178 : tensor<256x128xi32> 2026-02-21T09:12:27.6304535Z %180 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6304833Z %181 = tt.addptr %180, %179 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6305164Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6305476Z %183 = arith.extf %182 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6305771Z %184 = tt.expand_dims %169 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6306049Z %185 = arith.muli %184, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6306310Z %186 = tt.expand_dims %49 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6306614Z %187 = tt.broadcast %185 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6306884Z %188 = tt.broadcast %186 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6307124Z %189 = arith.addi %187, %188 : tensor<64x32xi32> 2026-02-21T09:12:27.6307367Z %190 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6307644Z %191 = tt.addptr %190, %189 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6308001Z %192 = tt.load %191 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6308273Z %193 = arith.shli %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6308497Z %194 = arith.shrsi %193, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6308723Z %195 = arith.shrsi %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6308970Z %196 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6309298Z %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6309616Z %198 = tt.expand_dims %197 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6309955Z %199 = tt.expand_dims %194 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6310281Z %200 = tt.expand_dims %195 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6310575Z %201 = arith.cmpi eq, %198, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6310871Z %202 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6311143Z %203 = tt.broadcast %199 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6311438Z %204 = arith.select %202, %203, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6311767Z %205 = arith.cmpi eq, %198, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6312034Z %206 = tt.broadcast %200 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6312315Z %207 = tt.broadcast %205 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6312596Z %208 = arith.select %207, %206, %204 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6312902Z %209 = tt.reshape %208 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6313172Z %210 = arith.sitofp %209 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6313556Z %211 = tt.dot %183, %210, %164, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6313905Z %c2_i32_14 = arith.constant 2 : i32 2026-02-21T09:12:27.6314111Z %212 = arith.muli %c64_i32, %c2_i32_14 : i32 2026-02-21T09:12:27.6314318Z %213 = arith.addi %arg4, %212 : i32 2026-02-21T09:12:27.6314548Z %214 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6314807Z %215 = tt.splat %213 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6315015Z %216 = arith.addi %215, %214 : tensor<64xi32> 2026-02-21T09:12:27.6315218Z %217 = arith.muli %213, %c2_i32 : i32 2026-02-21T09:12:27.6315453Z %218 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6315714Z %219 = tt.splat %217 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6315930Z %220 = arith.addi %219, %218 : tensor<128xi32> 2026-02-21T09:12:27.6316190Z %221 = tt.expand_dims %53 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6316476Z %222 = arith.muli %221, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6316745Z %223 = tt.expand_dims %220 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6317057Z %224 = tt.broadcast %222 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6317342Z %225 = tt.broadcast %223 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6317595Z %226 = arith.addi %224, %225 : tensor<256x128xi32> 2026-02-21T09:12:27.6317859Z %227 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6318169Z %228 = tt.addptr %227, %226 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6318503Z %229 = tt.load %228 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6318811Z %230 = arith.extf %229 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6319165Z %231 = tt.expand_dims %216 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6319444Z %232 = arith.muli %231, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6319705Z %233 = tt.expand_dims %49 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6320001Z %234 = tt.broadcast %232 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6320264Z %235 = tt.broadcast %233 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6320553Z %236 = arith.addi %234, %235 : tensor<64x32xi32> 2026-02-21T09:12:27.6320790Z %237 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6321077Z %238 = tt.addptr %237, %236 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6321389Z %239 = tt.load %238 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6321690Z %240 = arith.shli %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6321918Z %241 = arith.shrsi %240, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6322174Z %242 = arith.shrsi %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6322429Z %243 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6322730Z %244 = tt.expand_dims %243 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6323083Z %245 = tt.expand_dims %244 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6323418Z %246 = tt.expand_dims %241 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6323742Z %247 = tt.expand_dims %242 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6324034Z %248 = arith.cmpi eq, %245, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6324285Z %249 = tt.broadcast %248 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6324568Z %250 = tt.broadcast %246 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6324866Z %251 = arith.select %249, %250, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6325140Z %252 = arith.cmpi eq, %245, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6325404Z %253 = tt.broadcast %247 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6325676Z %254 = tt.broadcast %252 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6325970Z %255 = arith.select %254, %253, %251 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6326266Z %256 = tt.reshape %255 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6326536Z %257 = arith.sitofp %256 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6326916Z %258 = tt.dot %230, %257, %211, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6327243Z %c3_i32_15 = arith.constant 3 : i32 2026-02-21T09:12:27.6327450Z %259 = arith.muli %c64_i32, %c3_i32_15 : i32 2026-02-21T09:12:27.6327645Z %260 = arith.addi %arg4, %259 : i32 2026-02-21T09:12:27.6327883Z %261 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6328139Z %262 = tt.splat %260 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6328346Z %263 = arith.addi %262, %261 : tensor<64xi32> 2026-02-21T09:12:27.6328552Z %264 = arith.muli %260, %c2_i32 : i32 2026-02-21T09:12:27.6328785Z %265 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6329045Z %266 = tt.splat %264 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6329254Z %267 = arith.addi %266, %265 : tensor<128xi32> 2026-02-21T09:12:27.6329516Z %268 = tt.expand_dims %53 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6329797Z %269 = arith.muli %268, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6330091Z %270 = tt.expand_dims %267 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6330395Z %271 = tt.broadcast %269 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6330674Z %272 = tt.broadcast %270 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6330932Z %273 = arith.addi %271, %272 : tensor<256x128xi32> 2026-02-21T09:12:27.6331195Z %274 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6331514Z %275 = tt.addptr %274, %273 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6331868Z %276 = tt.load %275 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6332191Z %277 = arith.extf %276 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6332506Z %278 = tt.expand_dims %263 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6332791Z %279 = arith.muli %278, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6333119Z %280 = tt.expand_dims %49 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6333431Z %281 = tt.broadcast %279 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6333709Z %282 = tt.broadcast %280 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6333969Z %283 = arith.addi %281, %282 : tensor<64x32xi32> 2026-02-21T09:12:27.6334251Z %284 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6334552Z %285 = tt.addptr %284, %283 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6334872Z %286 = tt.load %285 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6335161Z %287 = arith.shli %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6335397Z %288 = arith.shrsi %287, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6335628Z %289 = arith.shrsi %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6335888Z %290 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6336195Z %291 = tt.expand_dims %290 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6336531Z %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6336886Z %293 = tt.expand_dims %288 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6337231Z %294 = tt.expand_dims %289 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6337539Z %295 = arith.cmpi eq, %292, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6337806Z %296 = tt.broadcast %295 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6338100Z %297 = tt.broadcast %293 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6338407Z %298 = arith.select %296, %297, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6338707Z %299 = arith.cmpi eq, %292, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6338983Z %300 = tt.broadcast %294 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6339270Z %301 = tt.broadcast %299 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6339577Z %302 = arith.select %301, %300, %298 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6339875Z %303 = tt.reshape %302 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6340177Z %304 = arith.sitofp %303 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6340555Z %305 = tt.dot %277, %304, %258, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6340877Z scf.yield %305 : tensor<256x32xf32> 2026-02-21T09:12:27.6341062Z } {tt.flatten} 2026-02-21T09:12:27.6341259Z %55 = arith.truncf %54 : tensor<256x32xf32> to tensor<256x32xbf16> 2026-02-21T09:12:27.6341594Z %56 = tt.expand_dims %53 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6341891Z %57 = arith.muli %56, %cst_0 : tensor<256x1xi32> 2026-02-21T09:12:27.6342153Z %58 = tt.expand_dims %49 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6342444Z %59 = tt.broadcast %57 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6342707Z %60 = tt.broadcast %58 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6342976Z %61 = arith.addi %59, %60 : tensor<256x32xi32> 2026-02-21T09:12:27.6343220Z %62 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6343512Z %63 = tt.addptr %62, %61 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:12:27.6343774Z tt.store %63, %55 : tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6343988Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:12:27.6344194Z %64 = arith.muli %c2368_i32, %c2_i32_10 : i32 2026-02-21T09:12:27.6344391Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T09:12:27.6344584Z %66 = arith.divsi %65, %c256_i32 : i32 2026-02-21T09:12:27.6344794Z %67 = arith.muli %66, %c16_i32 : i32 2026-02-21T09:12:27.6344985Z %68 = arith.subi %c256_i32, %67 : i32 2026-02-21T09:12:27.6345165Z %69 = arith.minsi %68, %c16_i32 : i32 2026-02-21T09:12:27.6345354Z %70 = arith.remsi %65, %c256_i32 : i32 2026-02-21T09:12:27.6345568Z %71 = arith.remsi %70, %69 : i32 2026-02-21T09:12:27.6345744Z %72 = arith.addi %67, %71 : i32 2026-02-21T09:12:27.6345922Z %73 = arith.divsi %70, %69 : i32 2026-02-21T09:12:27.6346096Z %74 = arith.muli %72, %c32_i32 : i32 2026-02-21T09:12:27.6346324Z %75 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:12:27.6346567Z %76 = tt.splat %74 : i32 -> tensor<32xi32> 2026-02-21T09:12:27.6346773Z %77 = arith.addi %76, %75 : tensor<32xi32> 2026-02-21T09:12:27.6346957Z %78 = arith.muli %73, %c256_i32 : i32 2026-02-21T09:12:27.6347189Z %79 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:12:27.6347442Z %80 = tt.splat %78 : i32 -> tensor<256xi32> 2026-02-21T09:12:27.6347639Z %81 = arith.addi %80, %79 : tensor<256xi32> 2026-02-21T09:12:27.6347842Z %c256_i32_11 = arith.constant 256 : i32 2026-02-21T09:12:27.6348163Z %82 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c256_i32_11 iter_args(%arg5 = %cst_6) -> (tensor<256x32xf32>) : i32 { 2026-02-21T09:12:27.6348536Z %120 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6348787Z %121 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6349000Z %122 = arith.addi %121, %120 : tensor<64xi32> 2026-02-21T09:12:27.6349204Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:12:27.6349437Z %124 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6349693Z %125 = tt.splat %123 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6349895Z %126 = arith.addi %125, %124 : tensor<128xi32> 2026-02-21T09:12:27.6350156Z %127 = tt.expand_dims %81 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6350428Z %128 = arith.muli %127, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6350702Z %129 = tt.expand_dims %126 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6351012Z %130 = tt.broadcast %128 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6351290Z %131 = tt.broadcast %129 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6351592Z %132 = arith.addi %130, %131 : tensor<256x128xi32> 2026-02-21T09:12:27.6351846Z %133 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6352159Z %134 = tt.addptr %133, %132 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6352495Z %135 = tt.load %134 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6352831Z %136 = arith.extf %135 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6353131Z %137 = tt.expand_dims %122 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6353402Z %138 = arith.muli %137, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6353668Z %139 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6353979Z %140 = tt.broadcast %138 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6354250Z %141 = tt.broadcast %139 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6354498Z %142 = arith.addi %140, %141 : tensor<64x32xi32> 2026-02-21T09:12:27.6354735Z %143 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6355019Z %144 = tt.addptr %143, %142 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6355319Z %145 = tt.load %144 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6355624Z %146 = arith.shli %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6355853Z %147 = arith.shrsi %146, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6356070Z %148 = arith.shrsi %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6356323Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6356641Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6356966Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6357291Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6357626Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6357918Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6358170Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6358449Z %156 = tt.broadcast %152 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6358740Z %157 = arith.select %155, %156, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6359024Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6359275Z %159 = tt.broadcast %153 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6359554Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6359837Z %161 = arith.select %160, %159, %157 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6360117Z %162 = tt.reshape %161 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6360391Z %163 = arith.sitofp %162 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6360765Z %164 = tt.dot %136, %163, %arg5, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6361105Z %c1_i32_13 = arith.constant 1 : i32 2026-02-21T09:12:27.6361309Z %165 = arith.muli %c64_i32, %c1_i32_13 : i32 2026-02-21T09:12:27.6361506Z %166 = arith.addi %arg4, %165 : i32 2026-02-21T09:12:27.6361817Z %167 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6362157Z %168 = tt.splat %166 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6362377Z %169 = arith.addi %168, %167 : tensor<64xi32> 2026-02-21T09:12:27.6362696Z %170 = arith.muli %166, %c2_i32 : i32 2026-02-21T09:12:27.6362938Z %171 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6363199Z %172 = tt.splat %170 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6363404Z %173 = arith.addi %172, %171 : tensor<128xi32> 2026-02-21T09:12:27.6363670Z %174 = tt.expand_dims %81 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6363981Z %175 = arith.muli %174, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6364253Z %176 = tt.expand_dims %173 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6364563Z %177 = tt.broadcast %175 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6364839Z %178 = tt.broadcast %176 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6365105Z %179 = arith.addi %177, %178 : tensor<256x128xi32> 2026-02-21T09:12:27.6365386Z %180 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6365705Z %181 = tt.addptr %180, %179 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6366035Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6366342Z %183 = arith.extf %182 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6366649Z %184 = tt.expand_dims %169 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6366942Z %185 = arith.muli %184, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6367212Z %186 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6367515Z %187 = tt.broadcast %185 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6367805Z %188 = tt.broadcast %186 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6368058Z %189 = arith.addi %187, %188 : tensor<64x32xi32> 2026-02-21T09:12:27.6368296Z %190 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6368580Z %191 = tt.addptr %190, %189 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6368883Z %192 = tt.load %191 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6369160Z %193 = arith.shli %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6369383Z %194 = arith.shrsi %193, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6369603Z %195 = arith.shrsi %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6369857Z %196 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6370149Z %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6370474Z %198 = tt.expand_dims %197 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6370809Z %199 = tt.expand_dims %194 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6371136Z %200 = tt.expand_dims %195 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6371430Z %201 = arith.cmpi eq, %198, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6371736Z %202 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6372014Z %203 = tt.broadcast %199 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6372306Z %204 = arith.select %202, %203, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6372590Z %205 = arith.cmpi eq, %198, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6372848Z %206 = tt.broadcast %200 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6373117Z %207 = tt.broadcast %205 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6373410Z %208 = arith.select %207, %206, %204 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6373694Z %209 = tt.reshape %208 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6373973Z %210 = arith.sitofp %209 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6374351Z %211 = tt.dot %183, %210, %164, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6374678Z %c2_i32_14 = arith.constant 2 : i32 2026-02-21T09:12:27.6374885Z %212 = arith.muli %c64_i32, %c2_i32_14 : i32 2026-02-21T09:12:27.6375107Z %213 = arith.addi %arg4, %212 : i32 2026-02-21T09:12:27.6375344Z %214 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6375593Z %215 = tt.splat %213 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6375809Z %216 = arith.addi %215, %214 : tensor<64xi32> 2026-02-21T09:12:27.6376018Z %217 = arith.muli %213, %c2_i32 : i32 2026-02-21T09:12:27.6376262Z %218 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6376579Z %219 = tt.splat %217 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6376795Z %220 = arith.addi %219, %218 : tensor<128xi32> 2026-02-21T09:12:27.6377069Z %221 = tt.expand_dims %81 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6377358Z %222 = arith.muli %221, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6377645Z %223 = tt.expand_dims %220 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6377968Z %224 = tt.broadcast %222 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6378282Z %225 = tt.broadcast %223 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6378554Z %226 = arith.addi %224, %225 : tensor<256x128xi32> 2026-02-21T09:12:27.6378822Z %227 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6379187Z %228 = tt.addptr %227, %226 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6379533Z %229 = tt.load %228 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6379866Z %230 = arith.extf %229 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6380190Z %231 = tt.expand_dims %216 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6380477Z %232 = arith.muli %231, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6380753Z %233 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6381061Z %234 = tt.broadcast %232 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6381350Z %235 = tt.broadcast %233 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6381634Z %236 = arith.addi %234, %235 : tensor<64x32xi32> 2026-02-21T09:12:27.6381881Z %237 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6382176Z %238 = tt.addptr %237, %236 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6382498Z %239 = tt.load %238 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6382786Z %240 = arith.shli %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6383011Z %241 = arith.shrsi %240, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6383247Z %242 = arith.shrsi %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6383511Z %243 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6383816Z %244 = tt.expand_dims %243 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6384136Z %245 = tt.expand_dims %244 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6384460Z %246 = tt.expand_dims %241 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6384799Z %247 = tt.expand_dims %242 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6385093Z %248 = arith.cmpi eq, %245, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6385342Z %249 = tt.broadcast %248 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6385620Z %250 = tt.broadcast %246 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6385909Z %251 = arith.select %249, %250, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6386187Z %252 = arith.cmpi eq, %245, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6386470Z %253 = tt.broadcast %247 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6386748Z %254 = tt.broadcast %252 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6387034Z %255 = arith.select %254, %253, %251 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6387316Z %256 = tt.reshape %255 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6387589Z %257 = arith.sitofp %256 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6387979Z %258 = tt.dot %230, %257, %211, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6388312Z %c3_i32_15 = arith.constant 3 : i32 2026-02-21T09:12:27.6388516Z %259 = arith.muli %c64_i32, %c3_i32_15 : i32 2026-02-21T09:12:27.6388708Z %260 = arith.addi %arg4, %259 : i32 2026-02-21T09:12:27.6388941Z %261 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6389186Z %262 = tt.splat %260 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6389421Z %263 = arith.addi %262, %261 : tensor<64xi32> 2026-02-21T09:12:27.6389619Z %264 = arith.muli %260, %c2_i32 : i32 2026-02-21T09:12:27.6389860Z %265 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6390113Z %266 = tt.splat %264 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6390347Z %267 = arith.addi %266, %265 : tensor<128xi32> 2026-02-21T09:12:27.6390616Z %268 = tt.expand_dims %81 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6390890Z %269 = arith.muli %268, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6391161Z %270 = tt.expand_dims %267 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6391459Z %271 = tt.broadcast %269 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6391762Z %272 = tt.broadcast %270 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6392018Z %273 = arith.addi %271, %272 : tensor<256x128xi32> 2026-02-21T09:12:27.6392270Z %274 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6392585Z %275 = tt.addptr %274, %273 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6392911Z %276 = tt.load %275 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6393233Z %277 = arith.extf %276 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6393537Z %278 = tt.expand_dims %263 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6393818Z %279 = arith.muli %278, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6394088Z %280 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6394378Z %281 = tt.broadcast %279 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6394651Z %282 = tt.broadcast %280 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6394899Z %283 = arith.addi %281, %282 : tensor<64x32xi32> 2026-02-21T09:12:27.6395153Z %284 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6395445Z %285 = tt.addptr %284, %283 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6395746Z %286 = tt.load %285 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6396028Z %287 = arith.shli %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6396245Z %288 = arith.shrsi %287, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6396471Z %289 = arith.shrsi %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6396713Z %290 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6397013Z %291 = tt.expand_dims %290 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6397333Z %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6397693Z %293 = tt.expand_dims %288 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6398030Z %294 = tt.expand_dims %289 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6398313Z %295 = arith.cmpi eq, %292, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6398574Z %296 = tt.broadcast %295 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6398883Z %297 = tt.broadcast %293 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6399171Z %298 = arith.select %296, %297, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6399455Z %299 = arith.cmpi eq, %292, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6399706Z %300 = tt.broadcast %294 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6399985Z %301 = tt.broadcast %299 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6400263Z %302 = arith.select %301, %300, %298 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6400579Z %303 = tt.reshape %302 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6400855Z %304 = arith.sitofp %303 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6401221Z %305 = tt.dot %277, %304, %258, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6401626Z scf.yield %305 : tensor<256x32xf32> 2026-02-21T09:12:27.6401807Z } {tt.flatten} 2026-02-21T09:12:27.6402009Z %83 = arith.truncf %82 : tensor<256x32xf32> to tensor<256x32xbf16> 2026-02-21T09:12:27.6402299Z %84 = tt.expand_dims %81 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6402573Z %85 = arith.muli %84, %cst_0 : tensor<256x1xi32> 2026-02-21T09:12:27.6402832Z %86 = tt.expand_dims %77 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6403118Z %87 = tt.broadcast %85 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6403387Z %88 = tt.broadcast %86 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6403619Z %89 = arith.addi %87, %88 : tensor<256x32xi32> 2026-02-21T09:12:27.6403867Z %90 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6404160Z %91 = tt.addptr %90, %89 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:12:27.6404417Z tt.store %91, %83 : tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6404632Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:12:27.6404824Z %92 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T09:12:27.6405039Z %93 = arith.addi %arg3, %92 : i32 2026-02-21T09:12:27.6405222Z %94 = arith.divsi %93, %c256_i32 : i32 2026-02-21T09:12:27.6405415Z %95 = arith.muli %94, %c16_i32 : i32 2026-02-21T09:12:27.6405596Z %96 = arith.subi %c256_i32, %95 : i32 2026-02-21T09:12:27.6405784Z %97 = arith.minsi %96, %c16_i32 : i32 2026-02-21T09:12:27.6405975Z %98 = arith.remsi %93, %c256_i32 : i32 2026-02-21T09:12:27.6406158Z %99 = arith.remsi %98, %97 : i32 2026-02-21T09:12:27.6406346Z %100 = arith.addi %95, %99 : i32 2026-02-21T09:12:27.6406520Z %101 = arith.divsi %98, %97 : i32 2026-02-21T09:12:27.6406706Z %102 = arith.muli %100, %c32_i32 : i32 2026-02-21T09:12:27.6407012Z %103 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:12:27.6407270Z %104 = tt.splat %102 : i32 -> tensor<32xi32> 2026-02-21T09:12:27.6407481Z %105 = arith.addi %104, %103 : tensor<32xi32> 2026-02-21T09:12:27.6407675Z %106 = arith.muli %101, %c256_i32 : i32 2026-02-21T09:12:27.6407914Z %107 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:12:27.6408163Z %108 = tt.splat %106 : i32 -> tensor<256xi32> 2026-02-21T09:12:27.6408376Z %109 = arith.addi %108, %107 : tensor<256xi32> 2026-02-21T09:12:27.6408639Z %c256_i32_12 = arith.constant 256 : i32 2026-02-21T09:12:27.6408975Z %110 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c256_i32_12 iter_args(%arg5 = %cst_6) -> (tensor<256x32xf32>) : i32 { 2026-02-21T09:12:27.6409353Z %120 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6409606Z %121 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6409820Z %122 = arith.addi %121, %120 : tensor<64xi32> 2026-02-21T09:12:27.6410044Z %123 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:12:27.6410286Z %124 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6410539Z %125 = tt.splat %123 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6410750Z %126 = arith.addi %125, %124 : tensor<128xi32> 2026-02-21T09:12:27.6411019Z %127 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6411301Z %128 = arith.muli %127, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6411622Z %129 = tt.expand_dims %126 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6411927Z %130 = tt.broadcast %128 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6412210Z %131 = tt.broadcast %129 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6412460Z %132 = arith.addi %130, %131 : tensor<256x128xi32> 2026-02-21T09:12:27.6412754Z %133 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6413067Z %134 = tt.addptr %133, %132 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6413399Z %135 = tt.load %134 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6413715Z %136 = arith.extf %135 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6414015Z %137 = tt.expand_dims %122 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6414299Z %138 = arith.muli %137, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6414569Z %139 = tt.expand_dims %105 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6414865Z %140 = tt.broadcast %138 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6415142Z %141 = tt.broadcast %139 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6415386Z %142 = arith.addi %140, %141 : tensor<64x32xi32> 2026-02-21T09:12:27.6415630Z %143 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6415909Z %144 = tt.addptr %143, %142 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6416222Z %145 = tt.load %144 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6416496Z %146 = arith.shli %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6416716Z %147 = arith.shrsi %146, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6416940Z %148 = arith.shrsi %145, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6417189Z %149 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6417492Z %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6417809Z %151 = tt.expand_dims %150 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6418146Z %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6418491Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6418781Z %154 = arith.cmpi eq, %151, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6419045Z %155 = tt.broadcast %154 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6419334Z %156 = tt.broadcast %152 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6419660Z %157 = arith.select %155, %156, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6420014Z %158 = arith.cmpi eq, %151, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6420285Z %159 = tt.broadcast %153 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6420581Z %160 = tt.broadcast %158 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6420876Z %161 = arith.select %160, %159, %157 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6421183Z %162 = tt.reshape %161 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6421514Z %163 = arith.sitofp %162 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6421942Z %164 = tt.dot %136, %163, %arg5, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6422305Z %c1_i32_13 = arith.constant 1 : i32 2026-02-21T09:12:27.6422512Z %165 = arith.muli %c64_i32, %c1_i32_13 : i32 2026-02-21T09:12:27.6422727Z %166 = arith.addi %arg4, %165 : i32 2026-02-21T09:12:27.6422970Z %167 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6423279Z %168 = tt.splat %166 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6423511Z %169 = arith.addi %168, %167 : tensor<64xi32> 2026-02-21T09:12:27.6423721Z %170 = arith.muli %166, %c2_i32 : i32 2026-02-21T09:12:27.6424017Z %171 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6424285Z %172 = tt.splat %170 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6424507Z %173 = arith.addi %172, %171 : tensor<128xi32> 2026-02-21T09:12:27.6424782Z %174 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6425086Z %175 = arith.muli %174, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6425375Z %176 = tt.expand_dims %173 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6425689Z %177 = tt.broadcast %175 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6425987Z %178 = tt.broadcast %176 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6426250Z %179 = arith.addi %177, %178 : tensor<256x128xi32> 2026-02-21T09:12:27.6426516Z %180 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6426829Z %181 = tt.addptr %180, %179 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6427178Z %182 = tt.load %181 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6427505Z %183 = arith.extf %182 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6427797Z %184 = tt.expand_dims %169 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6428070Z %185 = arith.muli %184, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6428329Z %186 = tt.expand_dims %105 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6428626Z %187 = tt.broadcast %185 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6428900Z %188 = tt.broadcast %186 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6429138Z %189 = arith.addi %187, %188 : tensor<64x32xi32> 2026-02-21T09:12:27.6429380Z %190 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6429656Z %191 = tt.addptr %190, %189 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6429967Z %192 = tt.load %191 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6430237Z %193 = arith.shli %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6430461Z %194 = arith.shrsi %193, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6430682Z %195 = arith.shrsi %192, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6430926Z %196 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6431220Z %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6431607Z %198 = tt.expand_dims %197 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6431938Z %199 = tt.expand_dims %194 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6432271Z %200 = tt.expand_dims %195 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6432561Z %201 = arith.cmpi eq, %198, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6432865Z %202 = tt.broadcast %201 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6433143Z %203 = tt.broadcast %199 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6433452Z %204 = arith.select %202, %203, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6433739Z %205 = arith.cmpi eq, %198, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6434015Z %206 = tt.broadcast %200 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6434310Z %207 = tt.broadcast %205 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6434649Z %208 = arith.select %207, %206, %204 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6434938Z %209 = tt.reshape %208 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6435207Z %210 = arith.sitofp %209 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6435611Z %211 = tt.dot %183, %210, %164, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6435943Z %c2_i32_14 = arith.constant 2 : i32 2026-02-21T09:12:27.6436139Z %212 = arith.muli %c64_i32, %c2_i32_14 : i32 2026-02-21T09:12:27.6436342Z %213 = arith.addi %arg4, %212 : i32 2026-02-21T09:12:27.6436573Z %214 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6436830Z %215 = tt.splat %213 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6437034Z %216 = arith.addi %215, %214 : tensor<64xi32> 2026-02-21T09:12:27.6437236Z %217 = arith.muli %213, %c2_i32 : i32 2026-02-21T09:12:27.6437473Z %218 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6437732Z %219 = tt.splat %217 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6437944Z %220 = arith.addi %219, %218 : tensor<128xi32> 2026-02-21T09:12:27.6438204Z %221 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6438487Z %222 = arith.muli %221, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6438751Z %223 = tt.expand_dims %220 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6439060Z %224 = tt.broadcast %222 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6439342Z %225 = tt.broadcast %223 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6439588Z %226 = arith.addi %224, %225 : tensor<256x128xi32> 2026-02-21T09:12:27.6439849Z %227 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6440147Z %228 = tt.addptr %227, %226 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6440475Z %229 = tt.load %228 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6440781Z %230 = arith.extf %229 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6441088Z %231 = tt.expand_dims %216 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6441368Z %232 = arith.muli %231, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6441646Z %233 = tt.expand_dims %105 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6441945Z %234 = tt.broadcast %232 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6442209Z %235 = tt.broadcast %233 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6442500Z %236 = arith.addi %234, %235 : tensor<64x32xi32> 2026-02-21T09:12:27.6442747Z %237 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6443026Z %238 = tt.addptr %237, %236 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6443342Z %239 = tt.load %238 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6443614Z %240 = arith.shli %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6443874Z %241 = arith.shrsi %240, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6444092Z %242 = arith.shrsi %239, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6444344Z %243 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6444644Z %244 = tt.expand_dims %243 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6444959Z %245 = tt.expand_dims %244 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6445291Z %246 = tt.expand_dims %241 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6445645Z %247 = tt.expand_dims %242 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6445941Z %248 = arith.cmpi eq, %245, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6446196Z %249 = tt.broadcast %248 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6446510Z %250 = tt.broadcast %246 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6446816Z %251 = arith.select %249, %250, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6447098Z %252 = arith.cmpi eq, %245, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6447366Z %253 = tt.broadcast %247 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6447646Z %254 = tt.broadcast %252 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6447942Z %255 = arith.select %254, %253, %251 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6448232Z %256 = tt.reshape %255 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6448504Z %257 = arith.sitofp %256 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6448887Z %258 = tt.dot %230, %257, %211, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6449217Z %c3_i32_15 = arith.constant 3 : i32 2026-02-21T09:12:27.6449421Z %259 = arith.muli %c64_i32, %c3_i32_15 : i32 2026-02-21T09:12:27.6449617Z %260 = arith.addi %arg4, %259 : i32 2026-02-21T09:12:27.6449853Z %261 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6450109Z %262 = tt.splat %260 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6450318Z %263 = arith.addi %262, %261 : tensor<64xi32> 2026-02-21T09:12:27.6450524Z %264 = arith.muli %260, %c2_i32 : i32 2026-02-21T09:12:27.6450757Z %265 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6451018Z %266 = tt.splat %264 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6451222Z %267 = arith.addi %266, %265 : tensor<128xi32> 2026-02-21T09:12:27.6451490Z %268 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6451798Z %269 = arith.muli %268, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6452064Z %270 = tt.expand_dims %267 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6452373Z %271 = tt.broadcast %269 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6452651Z %272 = tt.broadcast %270 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6452910Z %273 = arith.addi %271, %272 : tensor<256x128xi32> 2026-02-21T09:12:27.6453166Z %274 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6453467Z %275 = tt.addptr %274, %273 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6453832Z %276 = tt.load %275 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6454139Z %277 = arith.extf %276 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6454442Z %278 = tt.expand_dims %263 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6454714Z %279 = arith.muli %278, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6455005Z %280 = tt.expand_dims %105 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6455305Z %281 = tt.broadcast %279 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6455573Z %282 = tt.broadcast %280 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6455823Z %283 = arith.addi %281, %282 : tensor<64x32xi32> 2026-02-21T09:12:27.6456059Z %284 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6456346Z %285 = tt.addptr %284, %283 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6456680Z %286 = tt.load %285 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6456953Z %287 = arith.shli %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6457175Z %288 = arith.shrsi %287, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6457394Z %289 = arith.shrsi %286, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6457689Z %290 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6457984Z %291 = tt.expand_dims %290 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6458309Z %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6458643Z %293 = tt.expand_dims %288 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6458968Z %294 = tt.expand_dims %289 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6459260Z %295 = arith.cmpi eq, %292, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6459517Z %296 = tt.broadcast %295 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6459798Z %297 = tt.broadcast %293 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6460090Z %298 = arith.select %296, %297, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6460375Z %299 = arith.cmpi eq, %292, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6460643Z %300 = tt.broadcast %294 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6460920Z %301 = tt.broadcast %299 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6461226Z %302 = arith.select %301, %300, %298 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6461525Z %303 = tt.reshape %302 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6461865Z %304 = arith.sitofp %303 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6462266Z %305 = tt.dot %277, %304, %258, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6462612Z scf.yield %305 : tensor<256x32xf32> 2026-02-21T09:12:27.6462809Z } {tt.flatten} 2026-02-21T09:12:27.6463019Z %111 = arith.truncf %110 : tensor<256x32xf32> to tensor<256x32xbf16> 2026-02-21T09:12:27.6463341Z %112 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6463631Z %113 = arith.muli %112, %cst_0 : tensor<256x1xi32> 2026-02-21T09:12:27.6463912Z %114 = tt.expand_dims %105 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6464230Z %115 = tt.broadcast %113 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6464511Z %116 = tt.broadcast %114 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6464773Z %117 = arith.addi %115, %116 : tensor<256x32xi32> 2026-02-21T09:12:27.6465072Z %118 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6465390Z %119 = tt.addptr %118, %117 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:12:27.6465681Z tt.store %119, %111 : tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6465944Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:12:27.6466220Z scf.for %arg3 = %8 to %c4096_i32 step %c2368_i32 : i32 { 2026-02-21T09:12:27.6466483Z %10 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T09:12:27.6466692Z %11 = arith.muli %10, %c16_i32 : i32 2026-02-21T09:12:27.6466887Z %12 = arith.subi %c256_i32, %11 : i32 2026-02-21T09:12:27.6467087Z %13 = arith.minsi %12, %c16_i32 : i32 2026-02-21T09:12:27.6467285Z %14 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T09:12:27.6467490Z %15 = arith.remsi %14, %13 : i32 2026-02-21T09:12:27.6467685Z %16 = arith.addi %11, %15 : i32 2026-02-21T09:12:27.6467865Z %17 = arith.divsi %14, %13 : i32 2026-02-21T09:12:27.6468057Z %18 = arith.muli %16, %c32_i32 : i32 2026-02-21T09:12:27.6468317Z %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:12:27.6468583Z %20 = tt.splat %18 : i32 -> tensor<32xi32> 2026-02-21T09:12:27.6468788Z %21 = arith.addi %20, %19 : tensor<32xi32> 2026-02-21T09:12:27.6468989Z %22 = arith.muli %17, %c256_i32 : i32 2026-02-21T09:12:27.6469251Z %23 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:12:27.6469519Z %24 = tt.splat %22 : i32 -> tensor<256xi32> 2026-02-21T09:12:27.6469724Z %25 = arith.addi %24, %23 : tensor<256xi32> 2026-02-21T09:12:27.6469919Z %c256_i32_7 = arith.constant 256 : i32 2026-02-21T09:12:27.6470246Z %26 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c256_i32_7 iter_args(%arg5 = %cst_6) -> (tensor<256x32xf32>) : i32 { 2026-02-21T09:12:27.6470605Z %36 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6470865Z %37 = tt.splat %arg4 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6471079Z %38 = arith.addi %37, %36 : tensor<64xi32> 2026-02-21T09:12:27.6471274Z %39 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:12:27.6471515Z %40 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6471780Z %41 = tt.splat %39 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6471991Z %42 = arith.addi %41, %40 : tensor<128xi32> 2026-02-21T09:12:27.6472249Z %43 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6472532Z %44 = arith.muli %43, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6472796Z %45 = tt.expand_dims %42 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6473089Z %46 = tt.broadcast %44 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6473365Z %47 = tt.broadcast %45 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6473612Z %48 = arith.addi %46, %47 : tensor<256x128xi32> 2026-02-21T09:12:27.6473869Z %49 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6474170Z %50 = tt.addptr %49, %48 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6474492Z %51 = tt.load %50 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6474798Z %52 = arith.extf %51 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6475087Z %53 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6475360Z %54 = arith.muli %53, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6475610Z %55 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6475899Z %56 = tt.broadcast %54 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6476162Z %57 = tt.broadcast %55 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6476437Z %58 = arith.addi %56, %57 : tensor<64x32xi32> 2026-02-21T09:12:27.6476682Z %59 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6476953Z %60 = tt.addptr %59, %58 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6477253Z %61 = tt.load %60 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6477517Z %62 = arith.shli %61, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6477762Z %63 = arith.shrsi %62, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6477982Z %64 = arith.shrsi %61, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6478223Z %65 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6478516Z %66 = tt.expand_dims %65 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6478816Z %67 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6479158Z %68 = tt.expand_dims %63 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6479476Z %69 = tt.expand_dims %64 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6479765Z %70 = arith.cmpi eq, %67, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6480022Z %71 = tt.broadcast %70 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6480317Z %72 = tt.broadcast %68 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6480608Z %73 = arith.select %71, %72, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6480877Z %74 = arith.cmpi eq, %67, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6481129Z %75 = tt.broadcast %69 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6481396Z %76 = tt.broadcast %74 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6481695Z %77 = arith.select %76, %75, %73 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6481978Z %78 = tt.reshape %77 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6482239Z %79 = arith.sitofp %78 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6482613Z %80 = tt.dot %52, %79, %arg5, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6482942Z %c1_i32_8 = arith.constant 1 : i32 2026-02-21T09:12:27.6483147Z %81 = arith.muli %c64_i32, %c1_i32_8 : i32 2026-02-21T09:12:27.6483348Z %82 = arith.addi %arg4, %81 : i32 2026-02-21T09:12:27.6483573Z %83 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6483826Z %84 = tt.splat %82 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6484027Z %85 = arith.addi %84, %83 : tensor<64xi32> 2026-02-21T09:12:27.6484228Z %86 = arith.muli %82, %c2_i32 : i32 2026-02-21T09:12:27.6484463Z %87 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6484723Z %88 = tt.splat %86 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6484933Z %89 = arith.addi %88, %87 : tensor<128xi32> 2026-02-21T09:12:27.6485185Z %90 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6485465Z %91 = arith.muli %90, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6485722Z %92 = tt.expand_dims %89 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6486023Z %93 = tt.broadcast %91 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6486290Z %94 = tt.broadcast %92 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6486541Z %95 = arith.addi %93, %94 : tensor<256x128xi32> 2026-02-21T09:12:27.6486795Z %96 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6487087Z %97 = tt.addptr %96, %95 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6487448Z %98 = tt.load %97 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6487746Z %99 = arith.extf %98 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6488045Z %100 = tt.expand_dims %85 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6488326Z %101 = arith.muli %100, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6488583Z %102 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6488907Z %103 = tt.broadcast %101 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6489173Z %104 = tt.broadcast %102 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6489421Z %105 = arith.addi %103, %104 : tensor<64x32xi32> 2026-02-21T09:12:27.6489659Z %106 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6489947Z %107 = tt.addptr %106, %105 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6490289Z %108 = tt.load %107 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6490563Z %109 = arith.shli %108, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6490791Z %110 = arith.shrsi %109, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6491013Z %111 = arith.shrsi %108, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6491294Z %112 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6491633Z %113 = tt.expand_dims %112 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6491772Z %114 = tt.expand_dims %113 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6491907Z %115 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6492040Z %116 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6492145Z %117 = arith.cmpi eq, %114, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6492253Z %118 = tt.broadcast %117 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6492362Z %119 = tt.broadcast %115 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6492495Z %120 = arith.select %118, %119, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6492589Z %121 = arith.cmpi eq, %114, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6492697Z %122 = tt.broadcast %116 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6492807Z %123 = tt.broadcast %121 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6492925Z %124 = arith.select %123, %122, %120 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6493028Z %125 = tt.reshape %124 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6493145Z %126 = arith.sitofp %125 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6493343Z %127 = tt.dot %99, %126, %80, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6493417Z %c2_i32_9 = arith.constant 2 : i32 2026-02-21T09:12:27.6493491Z %128 = arith.muli %c64_i32, %c2_i32_9 : i32 2026-02-21T09:12:27.6493565Z %129 = arith.addi %arg4, %128 : i32 2026-02-21T09:12:27.6493679Z %130 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6493759Z %131 = tt.splat %129 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6493845Z %132 = arith.addi %131, %130 : tensor<64xi32> 2026-02-21T09:12:27.6493916Z %133 = arith.muli %129, %c2_i32 : i32 2026-02-21T09:12:27.6494031Z %134 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6494115Z %135 = tt.splat %133 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6494194Z %136 = arith.addi %135, %134 : tensor<128xi32> 2026-02-21T09:12:27.6494352Z %137 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6494444Z %138 = arith.muli %137, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6494574Z %139 = tt.expand_dims %136 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6494684Z %140 = tt.broadcast %138 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6494793Z %141 = tt.broadcast %139 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6494909Z %142 = arith.addi %140, %141 : tensor<256x128xi32> 2026-02-21T09:12:27.6495023Z %143 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6495149Z %144 = tt.addptr %143, %142 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6495290Z %145 = tt.load %144 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6495401Z %146 = arith.extf %145 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6495562Z %147 = tt.expand_dims %132 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6495655Z %148 = arith.muli %147, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6495780Z %149 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6495885Z %150 = tt.broadcast %148 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6496019Z %151 = tt.broadcast %149 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6496104Z %152 = arith.addi %150, %151 : tensor<64x32xi32> 2026-02-21T09:12:27.6496209Z %153 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6496334Z %154 = tt.addptr %153, %152 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6496461Z %155 = tt.load %154 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6496544Z %156 = arith.shli %155, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6496628Z %157 = arith.shrsi %156, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6496718Z %158 = arith.shrsi %155, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6496827Z %159 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6496950Z %160 = tt.expand_dims %159 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6497085Z %161 = tt.expand_dims %160 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6497216Z %162 = tt.expand_dims %157 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6497346Z %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6497447Z %164 = arith.cmpi eq, %161, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6497552Z %165 = tt.broadcast %164 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6497661Z %166 = tt.broadcast %162 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6497791Z %167 = arith.select %165, %166, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6497883Z %168 = arith.cmpi eq, %161, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6497990Z %169 = tt.broadcast %163 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6498100Z %170 = tt.broadcast %168 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6498222Z %171 = arith.select %170, %169, %167 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6498325Z %172 = tt.reshape %171 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6498432Z %173 = arith.sitofp %172 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6498640Z %174 = tt.dot %146, %173, %127, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6498713Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:12:27.6498814Z %175 = arith.muli %c64_i32, %c3_i32 : i32 2026-02-21T09:12:27.6498893Z %176 = arith.addi %arg4, %175 : i32 2026-02-21T09:12:27.6499009Z %177 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:12:27.6499094Z %178 = tt.splat %176 : i32 -> tensor<64xi32> 2026-02-21T09:12:27.6499185Z %179 = arith.addi %178, %177 : tensor<64xi32> 2026-02-21T09:12:27.6499255Z %180 = arith.muli %176, %c2_i32 : i32 2026-02-21T09:12:27.6499393Z %181 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:12:27.6499479Z %182 = tt.splat %180 : i32 -> tensor<128xi32> 2026-02-21T09:12:27.6499559Z %183 = arith.addi %182, %181 : tensor<128xi32> 2026-02-21T09:12:27.6499689Z %184 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6499771Z %185 = arith.muli %184, %cst_5 : tensor<256x1xi32> 2026-02-21T09:12:27.6499910Z %186 = tt.expand_dims %183 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:12:27.6500044Z %187 = tt.broadcast %185 : tensor<256x1xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6500154Z %188 = tt.broadcast %186 : tensor<1x128xi32> -> tensor<256x128xi32> 2026-02-21T09:12:27.6500246Z %189 = arith.addi %187, %188 : tensor<256x128xi32> 2026-02-21T09:12:27.6500384Z %190 = tt.splat %arg0 : !tt.ptr -> tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6500514Z %191 = tt.addptr %190, %189 : tensor<256x128x!tt.ptr>, tensor<256x128xi32> 2026-02-21T09:12:27.6500659Z %192 = tt.load %191 evictionPolicy = evict_last : tensor<256x128x!tt.ptr> 2026-02-21T09:12:27.6500772Z %193 = arith.extf %192 : tensor<256x128xbf16> to tensor<256x128xf32> 2026-02-21T09:12:27.6500897Z %194 = tt.expand_dims %179 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:12:27.6500988Z %195 = arith.muli %194, %cst_4 : tensor<64x1xi32> 2026-02-21T09:12:27.6501114Z %196 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6501221Z %197 = tt.broadcast %195 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6501332Z %198 = tt.broadcast %196 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:12:27.6501417Z %199 = arith.addi %197, %198 : tensor<64x32xi32> 2026-02-21T09:12:27.6501524Z %200 = tt.splat %arg1 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6501685Z %201 = tt.addptr %200, %199 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:12:27.6501819Z %202 = tt.load %201 evictionPolicy = evict_last : tensor<64x32x!tt.ptr> 2026-02-21T09:12:27.6501901Z %203 = arith.shli %202, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6501984Z %204 = arith.shrsi %203, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6502076Z %205 = arith.shrsi %202, %cst_3 : tensor<64x32xi8> 2026-02-21T09:12:27.6502187Z %206 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:12:27.6502313Z %207 = tt.expand_dims %206 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:12:27.6502450Z %208 = tt.expand_dims %207 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:12:27.6502581Z %209 = tt.expand_dims %204 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6502715Z %210 = tt.expand_dims %205 {axis = 1 : i32} : tensor<64x32xi8> -> tensor<64x1x32xi8> 2026-02-21T09:12:27.6502816Z %211 = arith.cmpi eq, %208, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6502925Z %212 = tt.broadcast %211 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6503035Z %213 = tt.broadcast %209 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6503168Z %214 = arith.select %212, %213, %cst : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6503299Z %215 = arith.cmpi eq, %208, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:12:27.6503407Z %216 = tt.broadcast %210 : tensor<64x1x32xi8> -> tensor<64x2x32xi8> 2026-02-21T09:12:27.6503511Z %217 = tt.broadcast %215 : tensor<1x2x1xi1> -> tensor<64x2x32xi1> 2026-02-21T09:12:27.6503635Z %218 = arith.select %217, %216, %214 : tensor<64x2x32xi1>, tensor<64x2x32xi8> 2026-02-21T09:12:27.6503736Z %219 = tt.reshape %218 : tensor<64x2x32xi8> -> tensor<128x32xi8> 2026-02-21T09:12:27.6503882Z %220 = arith.sitofp %219 : tensor<128x32xi8> to tensor<128x32xf32> 2026-02-21T09:12:27.6504089Z %221 = tt.dot %193, %220, %174, inputPrecision = tf32 : tensor<256x128xf32> * tensor<128x32xf32> -> tensor<256x32xf32> 2026-02-21T09:12:27.6504159Z scf.yield %221 : tensor<256x32xf32> 2026-02-21T09:12:27.6504219Z } {tt.flatten} 2026-02-21T09:12:27.6504341Z %27 = arith.truncf %26 : tensor<256x32xf32> to tensor<256x32xbf16> 2026-02-21T09:12:27.6504472Z %28 = tt.expand_dims %25 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:12:27.6504596Z %29 = arith.muli %28, %cst_0 : tensor<256x1xi32> 2026-02-21T09:12:27.6504733Z %30 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:12:27.6504844Z %31 = tt.broadcast %29 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6504952Z %32 = tt.broadcast %30 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:12:27.6505075Z %33 = arith.addi %31, %32 : tensor<256x32xi32> 2026-02-21T09:12:27.6505202Z %34 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6505324Z %35 = tt.addptr %34, %33 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:12:27.6505413Z tt.store %35, %27 : tensor<256x32x!tt.ptr> 2026-02-21T09:12:27.6505529Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:12:27.6505589Z tt.return 2026-02-21T09:12:27.6505646Z } 2026-02-21T09:12:27.6505713Z } 2026-02-21T09:12:27.6505719Z 2026-02-21T09:12:27.6505772Z {-# 2026-02-21T09:12:27.6505839Z external_resources: { 2026-02-21T09:12:27.6505905Z mlir_reproducer: { 2026-02-21T09:12:27.6510479Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:12:27.6510564Z disable_threading: false, 2026-02-21T09:12:27.6510657Z verify_each: true 2026-02-21T09:12:27.6510713Z } 2026-02-21T09:12:27.6510774Z } 2026-02-21T09:12:27.6510830Z #-} 2026-02-21T09:12:27.6511265Z /tmp/torchinductor_root/uk/cukse4t4frgzo53t735r7gqhvhfhsehrahjmjfsgoviuspg2qejf.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:12:27.6512077Z /tmp/torchinductor_root/uk/cukse4t4frgzo53t735r7gqhvhfhsehrahjmjfsgoviuspg2qejf.py:14:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:12:27.6512312Z [334s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:12:27.6513311Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 256, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 0], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:12:27.6513406Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:12:27.6513544Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:12:30.0723266Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 110/110 14.4 configs/s 2026-02-21T09:12:39.9058863Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 838/838 85.0 configs/s 2026-02-21T09:12:40.1059635Z [347s] Generation 5 complete: 2026-02-21T09:12:40.1061749Z error=28 2026-02-21T09:12:40.1061949Z timeout=1 2026-02-21T09:12:40.1062098Z ok=85 2026-02-21T09:12:40.1062232Z min=0.2386 2026-02-21T09:12:40.1062356Z mid=0.3962 2026-02-21T09:12:40.1062488Z max=39.7751 2026-02-21T09:12:40.1062632Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:12:40.1062915Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:12:40.1063158Z 'l2_groupings': [2], 2026-02-21T09:12:40.1063350Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:12:40.1063555Z 'loop_orders': [[0, 1]], 2026-02-21T09:12:40.1063711Z 'maxnreg': 128, 2026-02-21T09:12:40.1063862Z 'num_sm_multiplier': 4, 2026-02-21T09:12:40.1064014Z 'num_stages': 3, 2026-02-21T09:12:40.1064169Z 'num_warps': 4, 2026-02-21T09:12:40.1064337Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:12:40.1064650Z 'range_flattens': [None, None], 2026-02-21T09:12:40.1064832Z 'range_multi_buffers': [True, False], 2026-02-21T09:12:40.1065062Z 'range_num_stages': [3, 4], 2026-02-21T09:12:40.1069366Z 'range_unroll_factors': [0, 0], 2026-02-21T09:12:40.1072909Z 'range_warp_specializes': [True, None]} 2026-02-21T09:12:40.1090892Z [347s] Fitting surrogate: 643 points, 643 targets 2026-02-21T09:12:41.4385358Z [348s] Generation 6 starting: 95 neighbors, 5 active search path(s) 2026-02-21T09:13:17.1646094Z [384s] Timeout after 30s compiling Config(block_sizes=[64, 128, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, False]) 2026-02-21T09:13:17.1664586Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.5 configs/s 2026-02-21T09:13:17.6673551Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:13:17.6677825Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:13:17.6679805Z ^ 2026-02-21T09:13:17.6680232Z /tmp/torchinductor_root/3w/c3wd2lloww2nmydc7k363b5ll4kf7zy7k67zg6qsoxumx47lrq3i.py:91:40: note: called from 2026-02-21T09:13:17.6680936Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:13:17.6681170Z ^ 2026-02-21T09:13:17.6683398Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:13:17.6683912Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:13:17.6684179Z ^ 2026-02-21T09:13:17.6684536Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:13:17.6685106Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:13:17.6685682Z %cst = arith.constant dense<0> : tensor<16x2x128xi8> 2026-02-21T09:13:17.6685914Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:13:17.6686120Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:13:17.6686316Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:13:17.6686565Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T09:13:17.6686795Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:13:17.6687032Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:13:17.6687270Z %cst_2 = arith.constant dense<4> : tensor<16x128xi8> 2026-02-21T09:13:17.6687573Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32> 2026-02-21T09:13:17.6687833Z %cst_4 = arith.constant dense<1024> : tensor<256x1xi32> 2026-02-21T09:13:17.6688053Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:13:17.6688292Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<256x128xf32> 2026-02-21T09:13:17.6688540Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:13:17.6688727Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:13:17.6688921Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:13:17.6689108Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:13:17.6689309Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:13:17.6689501Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:13:17.6689699Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:13:17.6690033Z %0 = tt.make_tensor_descriptor %arg2, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T09:13:17.6690384Z %1 = tt.get_program_id x : i32 2026-02-21T09:13:17.6690606Z scf.for %arg3 = %1 to %c1024_i32 step %c2368_i32 : i32 { 2026-02-21T09:13:17.6690834Z %2 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:13:17.6691033Z %3 = arith.muli %2, %c16_i32 : i32 2026-02-21T09:13:17.6691213Z %4 = arith.subi %c16_i32, %3 : i32 2026-02-21T09:13:17.6691397Z %5 = arith.minsi %4, %c16_i32 : i32 2026-02-21T09:13:17.6691618Z %6 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:13:17.6691811Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:13:17.6691986Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:13:17.6692150Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:13:17.6692334Z %10 = arith.muli %8, %c256_i32 : i32 2026-02-21T09:13:17.6692561Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:13:17.6692828Z %12 = tt.splat %10 : i32 -> tensor<256xi32> 2026-02-21T09:13:17.6693030Z %13 = arith.addi %12, %11 : tensor<256xi32> 2026-02-21T09:13:17.6693228Z %14 = arith.muli %9, %c128_i32 : i32 2026-02-21T09:13:17.6693456Z %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:13:17.6693714Z %16 = tt.splat %14 : i32 -> tensor<128xi32> 2026-02-21T09:13:17.6693919Z %17 = arith.addi %16, %15 : tensor<128xi32> 2026-02-21T09:13:17.6694110Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:13:17.6694429Z %18 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c32_i32 iter_args(%arg5 = %cst_5) -> (tensor<256x128xf32>) : i32 { 2026-02-21T09:13:17.6694794Z %20 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:13:17.6695100Z %21 = tt.splat %arg4 : i32 -> tensor<16xi32> 2026-02-21T09:13:17.6695317Z %22 = arith.addi %21, %20 : tensor<16xi32> 2026-02-21T09:13:17.6695516Z %23 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:13:17.6695753Z %24 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:13:17.6695997Z %25 = tt.splat %23 : i32 -> tensor<32xi32> 2026-02-21T09:13:17.6696235Z %26 = arith.addi %25, %24 : tensor<32xi32> 2026-02-21T09:13:17.6696489Z %27 = tt.expand_dims %13 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:13:17.6696775Z %28 = arith.muli %27, %cst_4 : tensor<256x1xi32> 2026-02-21T09:13:17.6697035Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:13:17.6697326Z %30 = tt.broadcast %28 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:13:17.6697599Z %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:13:17.6697837Z %32 = arith.addi %30, %31 : tensor<256x32xi32> 2026-02-21T09:13:17.6698120Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:13:17.6698414Z %34 = tt.addptr %33, %32 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:13:17.6698731Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<256x32x!tt.ptr> 2026-02-21T09:13:17.6699071Z %36 = arith.extf %35 : tensor<256x32xbf16> to tensor<256x32xf32> 2026-02-21T09:13:17.6699354Z %37 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:13:17.6699626Z %38 = arith.muli %37, %cst_3 : tensor<16x1xi32> 2026-02-21T09:13:17.6699883Z %39 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:13:17.6700180Z %40 = tt.broadcast %38 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:13:17.6700440Z %41 = tt.broadcast %39 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:13:17.6700687Z %42 = arith.addi %40, %41 : tensor<16x128xi32> 2026-02-21T09:13:17.6700930Z %43 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:13:17.6701206Z %44 = tt.addptr %43, %42 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:13:17.6701509Z %45 = tt.load %44 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T09:13:17.6701817Z %46 = arith.shli %45, %cst_2 : tensor<16x128xi8> 2026-02-21T09:13:17.6702048Z %47 = arith.shrsi %46, %cst_2 : tensor<16x128xi8> 2026-02-21T09:13:17.6702277Z %48 = arith.shrsi %45, %cst_2 : tensor<16x128xi8> 2026-02-21T09:13:17.6702522Z %49 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:13:17.6702818Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:13:17.6703125Z %51 = tt.expand_dims %50 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:13:17.6703461Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:13:17.6703790Z %53 = tt.expand_dims %48 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:13:17.6704091Z %54 = arith.cmpi eq, %51, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:13:17.6704348Z %55 = tt.broadcast %54 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:13:17.6704624Z %56 = tt.broadcast %52 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:13:17.6704930Z %57 = arith.select %55, %56, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:13:17.6705208Z %58 = arith.cmpi eq, %51, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:13:17.6705471Z %59 = tt.broadcast %53 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:13:17.6705747Z %60 = tt.broadcast %58 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:13:17.6706052Z %61 = arith.select %60, %59, %57 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:13:17.6706335Z %62 = tt.reshape %61 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:13:17.6706593Z %63 = arith.sitofp %62 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:13:17.6706961Z %64 = tt.dot %36, %63, %arg5, inputPrecision = tf32 : tensor<256x32xf32> * tensor<32x128xf32> -> tensor<256x128xf32> 2026-02-21T09:13:17.6707289Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:13:17.6707518Z %65 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:13:17.6707718Z %66 = arith.addi %arg4, %65 : i32 2026-02-21T09:13:17.6707945Z %67 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:13:17.6708202Z %68 = tt.splat %66 : i32 -> tensor<16xi32> 2026-02-21T09:13:17.6708402Z %69 = arith.addi %68, %67 : tensor<16xi32> 2026-02-21T09:13:17.6708601Z %70 = arith.muli %66, %c2_i32 : i32 2026-02-21T09:13:17.6708827Z %71 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:13:17.6709102Z %72 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T09:13:17.6709303Z %73 = arith.addi %72, %71 : tensor<32xi32> 2026-02-21T09:13:17.6709551Z %74 = tt.expand_dims %13 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:13:17.6709825Z %75 = arith.muli %74, %cst_4 : tensor<256x1xi32> 2026-02-21T09:13:17.6710099Z %76 = tt.expand_dims %73 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:13:17.6710400Z %77 = tt.broadcast %75 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:13:17.6710659Z %78 = tt.broadcast %76 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:13:17.6710901Z %79 = arith.addi %77, %78 : tensor<256x32xi32> 2026-02-21T09:13:17.6711148Z %80 = tt.splat %arg0 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:13:17.6711434Z %81 = tt.addptr %80, %79 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:13:17.6711780Z %82 = tt.load %81 evictionPolicy = evict_first : tensor<256x32x!tt.ptr> 2026-02-21T09:13:17.6712080Z %83 = arith.extf %82 : tensor<256x32xbf16> to tensor<256x32xf32> 2026-02-21T09:13:17.6712370Z %84 = tt.expand_dims %69 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:13:17.6712636Z %85 = arith.muli %84, %cst_3 : tensor<16x1xi32> 2026-02-21T09:13:17.6712903Z %86 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:13:17.6713200Z %87 = tt.broadcast %85 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:13:17.6713460Z %88 = tt.broadcast %86 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:13:17.6713709Z %89 = arith.addi %87, %88 : tensor<16x128xi32> 2026-02-21T09:13:17.6713944Z %90 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:13:17.6714228Z %91 = tt.addptr %90, %89 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:13:17.6714538Z %92 = tt.load %91 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T09:13:17.6714808Z %93 = arith.shli %92, %cst_2 : tensor<16x128xi8> 2026-02-21T09:13:17.6715035Z %94 = arith.shrsi %93, %cst_2 : tensor<16x128xi8> 2026-02-21T09:13:17.6715254Z %95 = arith.shrsi %92, %cst_2 : tensor<16x128xi8> 2026-02-21T09:13:17.6715507Z %96 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:13:17.6715796Z %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:13:17.6716108Z %98 = tt.expand_dims %97 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:13:17.6716437Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:13:17.6716790Z %100 = tt.expand_dims %95 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:13:17.6717120Z %101 = arith.cmpi eq, %98, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:13:17.6717394Z %102 = tt.broadcast %101 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:13:17.6717677Z %103 = tt.broadcast %99 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:13:17.6717984Z %104 = arith.select %102, %103, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:13:17.6718268Z %105 = arith.cmpi eq, %98, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:13:17.6718569Z %106 = tt.broadcast %100 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:13:17.6718857Z %107 = tt.broadcast %105 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:13:17.6719169Z %108 = arith.select %107, %106, %104 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:13:17.6719471Z %109 = tt.reshape %108 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:13:17.6719750Z %110 = arith.sitofp %109 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:13:17.6720179Z %111 = tt.dot %83, %110, %64, inputPrecision = tf32 : tensor<256x32xf32> * tensor<32x128xf32> -> tensor<256x128xf32> 2026-02-21T09:13:17.6720509Z scf.yield %111 : tensor<256x128xf32> 2026-02-21T09:13:17.6720696Z } 2026-02-21T09:13:17.6720878Z %19 = arith.truncf %18 : tensor<256x128xf32> to tensor<256x128xbf16> 2026-02-21T09:13:17.6721249Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<256x128xbf16> 2026-02-21T09:13:17.6721626Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32} 2026-02-21T09:13:17.6721839Z tt.return 2026-02-21T09:13:17.6721979Z } 2026-02-21T09:13:17.6722102Z } 2026-02-21T09:13:17.6722179Z 2026-02-21T09:13:17.6722231Z {-# 2026-02-21T09:13:17.6722360Z external_resources: { 2026-02-21T09:13:17.6722526Z mlir_reproducer: { 2026-02-21T09:13:17.6727102Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:13:17.6731827Z disable_threading: false, 2026-02-21T09:13:17.6732007Z verify_each: true 2026-02-21T09:13:17.6732152Z } 2026-02-21T09:13:17.6732284Z } 2026-02-21T09:13:17.6732403Z #-} 2026-02-21T09:13:17.6732834Z /tmp/torchinductor_root/3w/c3wd2lloww2nmydc7k363b5ll4kf7zy7k67zg6qsoxumx47lrq3i.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:13:17.6733931Z /tmp/torchinductor_root/3w/c3wd2lloww2nmydc7k363b5ll4kf7zy7k67zg6qsoxumx47lrq3i.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:13:17.6734780Z [384s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:13:17.6736046Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 256, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[3, 0], range_unroll_factors=[0, 2], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:13:17.6737180Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:13:17.6737453Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:13:22.5399809Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 17.8 configs/s 2026-02-21T09:13:31.6357016Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 845/845 92.7 configs/s 2026-02-21T09:13:31.8303023Z [398s] Generation 6 complete: 2026-02-21T09:13:31.8304865Z error=31 2026-02-21T09:13:31.8305268Z timeout=1 2026-02-21T09:13:31.8305466Z ok=68 2026-02-21T09:13:31.8305611Z min=0.2366 2026-02-21T09:13:31.8305749Z mid=0.4291 2026-02-21T09:13:31.8305893Z max=26.8421 2026-02-21T09:13:31.8306055Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:13:31.8306336Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:13:31.8306584Z 'l2_groupings': [2], 2026-02-21T09:13:31.8306773Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:13:31.8306976Z 'loop_orders': [[0, 1]], 2026-02-21T09:13:31.8307141Z 'maxnreg': 128, 2026-02-21T09:13:31.8307297Z 'num_sm_multiplier': 2, 2026-02-21T09:13:31.8307463Z 'num_stages': 3, 2026-02-21T09:13:31.8307619Z 'num_warps': 4, 2026-02-21T09:13:31.8307780Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:13:31.8307985Z 'range_flattens': [None, None], 2026-02-21T09:13:31.8308170Z 'range_multi_buffers': [True, False], 2026-02-21T09:13:31.8308366Z 'range_num_stages': [3, 4], 2026-02-21T09:13:31.8308541Z 'range_unroll_factors': [0, 0], 2026-02-21T09:13:31.8308738Z 'range_warp_specializes': [True, None]} 2026-02-21T09:13:31.8351857Z [398s] Fitting surrogate: 743 points, 743 targets 2026-02-21T09:13:33.7641892Z [400s] Generation 7 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:14:11.0634772Z [438s] Timeout after 30s compiling Config(block_sizes=[128, 128, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:14:11.2432865Z [438s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:14:11.2448338Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T09:14:13.0139660Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:14:13.0144428Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:14:13.0150322Z ^ 2026-02-21T09:14:13.0154556Z /tmp/torchinductor_root/z7/cz7calrzptiav2nmlvlgqwwlppa3pl7hco25am5d47nz4xwvtvmi.py:91:40: note: called from 2026-02-21T09:14:13.0159037Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:14:13.0160529Z ^ 2026-02-21T09:14:13.0161001Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:14:13.0161794Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:14:13.0162048Z ^ 2026-02-21T09:14:13.0162332Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T09:14:13.0167564Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:14:13.0174028Z %cst = arith.constant dense<0> : tensor<16x2x128xi8> 2026-02-21T09:14:13.0176641Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:14:13.0176859Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:14:13.0177073Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:14:13.0177271Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:14:13.0177481Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:14:13.0177762Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T09:14:13.0178004Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:14:13.0178253Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:14:13.0178501Z %cst_2 = arith.constant dense<4> : tensor<16x128xi8> 2026-02-21T09:14:13.0178750Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32> 2026-02-21T09:14:13.0179003Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32> 2026-02-21T09:14:13.0179236Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:14:13.0179472Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<128x128xf32> 2026-02-21T09:14:13.0179726Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:14:13.0179933Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:14:13.0180133Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:14:13.0180328Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:14:13.0180512Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:14:13.0180701Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:14:13.0181024Z %0 = tt.make_tensor_descriptor %arg2, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T09:14:13.0181362Z %1 = tt.get_program_id x : i32 2026-02-21T09:14:13.0181662Z scf.for %arg3 = %1 to %c2048_i32 step %c2368_i32 : i32 { 2026-02-21T09:14:13.0181893Z %2 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T09:14:13.0182090Z %3 = arith.muli %2, %c16_i32 : i32 2026-02-21T09:14:13.0182268Z %4 = arith.subi %c32_i32, %3 : i32 2026-02-21T09:14:13.0182454Z %5 = arith.minsi %4, %c16_i32 : i32 2026-02-21T09:14:13.0182640Z %6 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T09:14:13.0182830Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:14:13.0183003Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:14:13.0183176Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:14:13.0183356Z %10 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:14:13.0183586Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:14:13.0183856Z %12 = tt.splat %10 : i32 -> tensor<128xi32> 2026-02-21T09:14:13.0184055Z %13 = arith.addi %12, %11 : tensor<128xi32> 2026-02-21T09:14:13.0184249Z %14 = arith.muli %9, %c128_i32 : i32 2026-02-21T09:14:13.0184436Z %15 = tt.splat %14 : i32 -> tensor<128xi32> 2026-02-21T09:14:13.0184637Z %16 = arith.addi %15, %11 : tensor<128xi32> 2026-02-21T09:14:13.0184838Z %c32_i32_6 = arith.constant 32 : i32 2026-02-21T09:14:13.0185160Z %17 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c32_i32_6 iter_args(%arg5 = %cst_5) -> (tensor<128x128xf32>) : i32 { 2026-02-21T09:14:13.0185590Z %19 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:14:13.0185850Z %20 = tt.splat %arg4 : i32 -> tensor<16xi32> 2026-02-21T09:14:13.0186068Z %21 = arith.addi %20, %19 : tensor<16xi32> 2026-02-21T09:14:13.0186265Z %22 = arith.muli %arg4, %c2_i32 : i32 2026-02-21T09:14:13.0186507Z %23 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:14:13.0186809Z %24 = tt.splat %22 : i32 -> tensor<32xi32> 2026-02-21T09:14:13.0187010Z %25 = arith.addi %24, %23 : tensor<32xi32> 2026-02-21T09:14:13.0187273Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:14:13.0187554Z %27 = arith.muli %26, %cst_4 : tensor<128x1xi32> 2026-02-21T09:14:13.0187827Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:14:13.0188126Z %29 = tt.broadcast %27 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:14:13.0188451Z %30 = tt.broadcast %28 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:14:13.0188704Z %31 = arith.addi %29, %30 : tensor<128x32xi32> 2026-02-21T09:14:13.0188954Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:14:13.0189288Z %33 = tt.addptr %32, %31 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:14:13.0189601Z %34 = tt.load %33 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:14:13.0189909Z %35 = arith.extf %34 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:14:13.0190201Z %36 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:14:13.0190468Z %37 = arith.muli %36, %cst_3 : tensor<16x1xi32> 2026-02-21T09:14:13.0190732Z %38 = tt.expand_dims %16 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:14:13.0191020Z %39 = tt.broadcast %37 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:14:13.0191286Z %40 = tt.broadcast %38 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:14:13.0191522Z %41 = arith.addi %39, %40 : tensor<16x128xi32> 2026-02-21T09:14:13.0191806Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:14:13.0192087Z %43 = tt.addptr %42, %41 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:14:13.0192410Z %44 = tt.load %43 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T09:14:13.0192683Z %45 = arith.shli %44, %cst_2 : tensor<16x128xi8> 2026-02-21T09:14:13.0192909Z %46 = arith.shrsi %45, %cst_2 : tensor<16x128xi8> 2026-02-21T09:14:13.0193127Z %47 = arith.shrsi %44, %cst_2 : tensor<16x128xi8> 2026-02-21T09:14:13.0193379Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:14:13.0193673Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:14:13.0193984Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:14:13.0194307Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:14:13.0194631Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:14:13.0194921Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:14:13.0195167Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:14:13.0195441Z %55 = tt.broadcast %51 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:14:13.0195725Z %56 = arith.select %54, %55, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:14:13.0195996Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:14:13.0196248Z %58 = tt.broadcast %52 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:14:13.0196549Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:14:13.0196830Z %60 = arith.select %59, %58, %56 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:14:13.0197112Z %61 = tt.reshape %60 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:14:13.0197379Z %62 = arith.sitofp %61 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:14:13.0197771Z %63 = tt.dot %35, %62, %arg5, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:14:13.0198103Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:14:13.0198301Z %64 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:14:13.0198493Z %65 = arith.addi %arg4, %64 : i32 2026-02-21T09:14:13.0198724Z %66 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:14:13.0198967Z %67 = tt.splat %65 : i32 -> tensor<16xi32> 2026-02-21T09:14:13.0199175Z %68 = arith.addi %67, %66 : tensor<16xi32> 2026-02-21T09:14:13.0199409Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T09:14:13.0199642Z %70 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:14:13.0199895Z %71 = tt.splat %69 : i32 -> tensor<32xi32> 2026-02-21T09:14:13.0200092Z %72 = arith.addi %71, %70 : tensor<32xi32> 2026-02-21T09:14:13.0200377Z %73 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:14:13.0200650Z %74 = arith.muli %73, %cst_4 : tensor<128x1xi32> 2026-02-21T09:14:13.0200912Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:14:13.0201210Z %76 = tt.broadcast %74 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:14:13.0201498Z %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:14:13.0201812Z %78 = arith.addi %76, %77 : tensor<128x32xi32> 2026-02-21T09:14:13.0202063Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:14:13.0202371Z %80 = tt.addptr %79, %78 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:14:13.0202691Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:14:13.0203010Z %82 = arith.extf %81 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:14:13.0203315Z %83 = tt.expand_dims %68 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:14:13.0203592Z %84 = arith.muli %83, %cst_3 : tensor<16x1xi32> 2026-02-21T09:14:13.0203865Z %85 = tt.expand_dims %16 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:14:13.0204164Z %86 = tt.broadcast %84 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:14:13.0204445Z %87 = tt.broadcast %85 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:14:13.0204691Z %88 = arith.addi %86, %87 : tensor<16x128xi32> 2026-02-21T09:14:13.0204942Z %89 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:14:13.0205234Z %90 = tt.addptr %89, %88 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:14:13.0205541Z %91 = tt.load %90 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T09:14:13.0205825Z %92 = arith.shli %91, %cst_2 : tensor<16x128xi8> 2026-02-21T09:14:13.0206054Z %93 = arith.shrsi %92, %cst_2 : tensor<16x128xi8> 2026-02-21T09:14:13.0206289Z %94 = arith.shrsi %91, %cst_2 : tensor<16x128xi8> 2026-02-21T09:14:13.0206543Z %95 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:14:13.0206846Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:14:13.0207164Z %97 = tt.expand_dims %96 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:14:13.0207493Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:14:13.0207867Z %99 = tt.expand_dims %94 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:14:13.0208163Z %100 = arith.cmpi eq, %97, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:14:13.0208439Z %101 = tt.broadcast %100 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:14:13.0208736Z %102 = tt.broadcast %98 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:14:13.0209084Z %103 = arith.select %101, %102, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:14:13.0209367Z %104 = arith.cmpi eq, %97, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:14:13.0209618Z %105 = tt.broadcast %99 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:14:13.0209896Z %106 = tt.broadcast %104 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:14:13.0210185Z %107 = arith.select %106, %105, %103 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:14:13.0210482Z %108 = tt.reshape %107 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:14:13.0210787Z %109 = arith.sitofp %108 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:14:13.0211149Z %110 = tt.dot %82, %109, %63, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:14:13.0211484Z scf.yield %110 : tensor<128x128xf32> 2026-02-21T09:14:13.0211711Z } 2026-02-21T09:14:13.0211920Z %18 = arith.truncf %17 : tensor<128x128xf32> to tensor<128x128xbf16> 2026-02-21T09:14:13.0212262Z tt.descriptor_store %0[%10, %14], %18 : !tt.tensordesc>, tensor<128x128xbf16> 2026-02-21T09:14:13.0212582Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32} 2026-02-21T09:14:13.0212798Z tt.return 2026-02-21T09:14:13.0212925Z } 2026-02-21T09:14:13.0213055Z } 2026-02-21T09:14:13.0213125Z 2026-02-21T09:14:13.0213177Z {-# 2026-02-21T09:14:13.0213313Z external_resources: { 2026-02-21T09:14:13.0213478Z mlir_reproducer: { 2026-02-21T09:14:13.0217822Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:14:13.0222293Z disable_threading: false, 2026-02-21T09:14:13.0222480Z verify_each: true 2026-02-21T09:14:13.0222630Z } 2026-02-21T09:14:13.0222770Z } 2026-02-21T09:14:13.0222945Z #-} 2026-02-21T09:14:13.0223376Z /tmp/torchinductor_root/z7/cz7calrzptiav2nmlvlgqwwlppa3pl7hco25am5d47nz4xwvtvmi.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:14:13.0224419Z /tmp/torchinductor_root/z7/cz7calrzptiav2nmlvlgqwwlppa3pl7hco25am5d47nz4xwvtvmi.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:14:13.0225269Z [440s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:14:13.0226487Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[3, 0], range_unroll_factors=[0, 2], range_warp_specializes=[False, None]), static_shapes=True) 2026-02-21T09:14:13.0227599Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:14:13.0227868Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:14:15.4583570Z 2026-02-21T09:14:15.4585006Z 2026-02-21T09:14:15.4585476Z ================================================================ 2026-02-21T09:14:15.4586025Z Internal Triton PTX codegen error 2026-02-21T09:14:15.4586255Z `ptxas` stderr: 2026-02-21T09:14:15.4586707Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 321 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:14:15.4587212Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:14:15.4587364Z 2026-02-21T09:14:15.4587765Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp7oryq3ns.ptx -o /tmp/tmp7oryq3ns.ptx.o 2026-02-21T09:14:15.4588216Z 2026-02-21T09:14:15.4588220Z 2026-02-21T09:14:15.4588286Z // 2026-02-21T09:14:15.4588447Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:14:15.4592008Z // 2026-02-21T09:14:15.4593873Z 2026-02-21T09:14:15.4594052Z .version 8.7 2026-02-21T09:14:15.4594230Z .target sm_100a 2026-02-21T09:14:15.4594390Z .address_size 64 2026-02-21T09:14:15.4594476Z 2026-02-21T09:14:15.4594651Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:14:15.4594950Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:14:15.4595167Z // @_helion_matmul_bf16_int4 2026-02-21T09:14:15.4595391Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:14:15.4595644Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:14:15.4595922Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:14:15.4596204Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:14:15.4596475Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:14:15.4596755Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:14:15.4596977Z ) 2026-02-21T09:14:15.4597107Z .reqntid 128 2026-02-21T09:14:15.4597238Z .maxnreg 32 2026-02-21T09:14:15.4597369Z { 2026-02-21T09:14:15.4597502Z .reg .pred %p<111>; 2026-02-21T09:14:15.4597648Z .reg .b16 %rs<228>; 2026-02-21T09:14:15.4597801Z .reg .b32 %r<933>; 2026-02-21T09:14:15.4597947Z .reg .b64 %rd<363>; 2026-02-21T09:14:15.4598104Z $L__func_begin0: 2026-02-21T09:14:15.4598189Z 2026-02-21T09:14:15.4598244Z // %bb.0: 2026-02-21T09:14:15.4598514Z .loc 1 19 0 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:19 2026-02-21T09:14:15.4598812Z mov.u32 %r1, %tid.x; 2026-02-21T09:14:15.4599030Z ld.param.b64 %rd18, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:14:15.4599505Z setp.lt.u32 %p1, %r1, 32; 2026-02-21T09:14:15.4599708Z ld.param.b64 %rd36, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:14:15.4599934Z mov.b32 %r81, global_smem; 2026-02-21T09:14:15.4600091Z // begin inline asm 2026-02-21T09:14:15.4600346Z @%p1 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r81], 256; 2026-02-21T09:14:15.4600595Z // end inline asm 2026-02-21T09:14:15.4600784Z ld.param.b64 %rd53, [_helion_matmul_bf16_int4_param_3]; 2026-02-21T09:14:15.4601002Z bar.sync 0; 2026-02-21T09:14:15.4601249Z ld.shared.b32 %r925, [global_smem]; 2026-02-21T09:14:15.4601429Z bar.sync 0; 2026-02-21T09:14:15.4601902Z // begin inline asm 2026-02-21T09:14:15.4602122Z @%p1 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:14:15.4602355Z // end inline asm 2026-02-21T09:14:15.4602624Z .loc 1 21 66 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:21:66 2026-02-21T09:14:15.4602929Z mov.u32 %r926, %ctaid.x; 2026-02-21T09:14:15.4603088Z mov.u32 %r98, %ctaid.y; 2026-02-21T09:14:15.4603248Z mov.u32 %r99, %ctaid.z; 2026-02-21T09:14:15.4603453Z mov.u32 %r100, %nctaid.x; 2026-02-21T09:14:15.4603620Z mov.u32 %r101, %nctaid.y; 2026-02-21T09:14:15.4603780Z mad.lo.s32 %r102, %r99, %r101, %r98; 2026-02-21T09:14:15.4603966Z mad.lo.s32 %r103, %r102, %r100, %r926; 2026-02-21T09:14:15.4604141Z shl.b32 %r104, %r103, 8; 2026-02-21T09:14:15.4604301Z cvt.s64.s32 %rd54, %r104; 2026-02-21T09:14:15.4604496Z add.s64 %rd32, %rd53, %rd54; 2026-02-21T09:14:15.4604668Z shl.b32 %r105, %r1, 2; 2026-02-21T09:14:15.4604827Z add.s32 %r82, %r81, %r105; 2026-02-21T09:14:15.4604978Z mov.b32 %r91, 0; 2026-02-21T09:14:15.4605121Z // begin inline asm 2026-02-21T09:14:15.4605272Z @%p1 st.shared.b32 [ %r82 + 0 ], %r91; 2026-02-21T09:14:15.4605448Z // end inline asm 2026-02-21T09:14:15.4605592Z bar.warp.sync -1; 2026-02-21T09:14:15.4605748Z setp.eq.b32 %p4, %r1, 0; 2026-02-21T09:14:15.4605904Z cvt.u64.u32 %rd17, %r81; 2026-02-21T09:14:15.4606063Z // begin inline asm 2026-02-21T09:14:15.4606320Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd17 + 0 ], %rd18; 2026-02-21T09:14:15.4606596Z // end inline asm 2026-02-21T09:14:15.4606737Z // begin inline asm 2026-02-21T09:14:15.4606955Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1; 2026-02-21T09:14:15.4607205Z // end inline asm 2026-02-21T09:14:15.4607338Z mov.b32 %r84, 128; 2026-02-21T09:14:15.4607489Z // begin inline asm 2026-02-21T09:14:15.4607734Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r84; 2026-02-21T09:14:15.4608001Z // end inline asm 2026-02-21T09:14:15.4608141Z mov.b32 %r85, 16; 2026-02-21T09:14:15.4608276Z // begin inline asm 2026-02-21T09:14:15.4608509Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r85; 2026-02-21T09:14:15.4608762Z // end inline asm 2026-02-21T09:14:15.4608903Z mov.b32 %r86, 8192; 2026-02-21T09:14:15.4609045Z // begin inline asm 2026-02-21T09:14:15.4609289Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r86; 2026-02-21T09:14:15.4609552Z // end inline asm 2026-02-21T09:14:15.4609697Z mov.b32 %r87, 512; 2026-02-21T09:14:15.4609845Z // begin inline asm 2026-02-21T09:14:15.4610078Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r87; 2026-02-21T09:14:15.4610347Z // end inline asm 2026-02-21T09:14:15.4610487Z mov.b64 %rd25, 8192; 2026-02-21T09:14:15.4610639Z // begin inline asm 2026-02-21T09:14:15.4610886Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd17 + 0 ], 0x0, %rd25; 2026-02-21T09:14:15.4611171Z // end inline asm 2026-02-21T09:14:15.4611309Z mov.b32 %r88, 1; 2026-02-21T09:14:15.4611441Z // begin inline asm 2026-02-21T09:14:15.4611739Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r88; 2026-02-21T09:14:15.4612015Z // end inline asm 2026-02-21T09:14:15.4612161Z // begin inline asm 2026-02-21T09:14:15.4612465Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r88; 2026-02-21T09:14:15.4612755Z // end inline asm 2026-02-21T09:14:15.4612892Z // begin inline asm 2026-02-21T09:14:15.4613132Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:14:15.4613397Z // end inline asm 2026-02-21T09:14:15.4613535Z // begin inline asm 2026-02-21T09:14:15.4613792Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:14:15.4614104Z // end inline asm 2026-02-21T09:14:15.4614247Z // begin inline asm 2026-02-21T09:14:15.4614478Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x3; 2026-02-21T09:14:15.4614743Z // end inline asm 2026-02-21T09:14:15.4614879Z // begin inline asm 2026-02-21T09:14:15.4615098Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:14:15.4615350Z // end inline asm 2026-02-21T09:14:15.4615484Z // begin inline asm 2026-02-21T09:14:15.4615851Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd32 + 0 ], [ %rd17 + 0 ], 0x80; 2026-02-21T09:14:15.4616219Z // end inline asm 2026-02-21T09:14:15.4616359Z // begin inline asm 2026-02-21T09:14:15.4616571Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd32 + 0 ], 0x80; 2026-02-21T09:14:15.4616817Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:14:15.4617041Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:14:15.4617219Z // end inline asm 2026-02-21T09:14:15.4617358Z bar.sync 0; 2026-02-21T09:14:15.4617501Z cvta.global.u64 %rd88, %rd32; 2026-02-21T09:14:15.4617789Z .loc 1 23 67 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:23:67 2026-02-21T09:14:15.4618081Z add.s64 %rd50, %rd32, 128; 2026-02-21T09:14:15.4618240Z bar.sync 0; 2026-02-21T09:14:15.4618379Z // begin inline asm 2026-02-21T09:14:15.4618527Z @%p1 st.shared.b32 [ %r82 + 0 ], %r91; 2026-02-21T09:14:15.4618705Z // end inline asm 2026-02-21T09:14:15.4618845Z bar.warp.sync -1; 2026-02-21T09:14:15.4618995Z // begin inline asm 2026-02-21T09:14:15.4619235Z @%p4 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd17 + 0 ], %rd36; 2026-02-21T09:14:15.4619514Z // end inline asm 2026-02-21T09:14:15.4619647Z // begin inline asm 2026-02-21T09:14:15.4619875Z @%p4 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1; 2026-02-21T09:14:15.4620133Z // end inline asm 2026-02-21T09:14:15.4620269Z mov.b32 %r92, 64; 2026-02-21T09:14:15.4620413Z // begin inline asm 2026-02-21T09:14:15.4620639Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r92; 2026-02-21T09:14:15.4620903Z // end inline asm 2026-02-21T09:14:15.4621035Z // begin inline asm 2026-02-21T09:14:15.4621264Z @%p4 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r84; 2026-02-21T09:14:15.4621517Z // end inline asm 2026-02-21T09:14:15.4621693Z // begin inline asm 2026-02-21T09:14:15.4621934Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r86; 2026-02-21T09:14:15.4622200Z // end inline asm 2026-02-21T09:14:15.4622343Z mov.b32 %r95, 4096; 2026-02-21T09:14:15.4622483Z // begin inline asm 2026-02-21T09:14:15.4622724Z @%p4 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r95; 2026-02-21T09:14:15.4622986Z // end inline asm 2026-02-21T09:14:15.4623130Z mov.b64 %rd43, 16384; 2026-02-21T09:14:15.4623276Z // begin inline asm 2026-02-21T09:14:15.4623526Z @%p4 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd17 + 0 ], 0x0, %rd43; 2026-02-21T09:14:15.4623814Z // end inline asm 2026-02-21T09:14:15.4623947Z // begin inline asm 2026-02-21T09:14:15.4624206Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0, %r88; 2026-02-21T09:14:15.4624487Z // end inline asm 2026-02-21T09:14:15.4624627Z // begin inline asm 2026-02-21T09:14:15.4624870Z @%p4 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x1, %r88; 2026-02-21T09:14:15.4625182Z // end inline asm 2026-02-21T09:14:15.4625324Z // begin inline asm 2026-02-21T09:14:15.4625583Z @%p4 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd17 + 0 ], 0xa; 2026-02-21T09:14:15.4625839Z // end inline asm 2026-02-21T09:14:15.4625981Z // begin inline asm 2026-02-21T09:14:15.4626229Z @%p4 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:14:15.4626548Z // end inline asm 2026-02-21T09:14:15.4626689Z // begin inline asm 2026-02-21T09:14:15.4626923Z @%p4 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x3; 2026-02-21T09:14:15.4627190Z // end inline asm 2026-02-21T09:14:15.4627324Z // begin inline asm 2026-02-21T09:14:15.4627556Z @%p4 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd17 + 0 ], 0x0; 2026-02-21T09:14:15.4627816Z // end inline asm 2026-02-21T09:14:15.4627949Z // begin inline asm 2026-02-21T09:14:15.4628324Z @%p1 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd50 + 0 ], [ %rd17 + 0 ], 0x80; 2026-02-21T09:14:15.4628684Z // end inline asm 2026-02-21T09:14:15.4628826Z // begin inline asm 2026-02-21T09:14:15.4629031Z @%p1 fence.proxy.tensormap::generic.acquire.gpu [ %rd50 + 0 ], 0x80; 2026-02-21T09:14:15.4629288Z @%p1 cp.async.bulk.commit_group ; 2026-02-21T09:14:15.4629506Z @%p1 cp.async.bulk.wait_group.read 0 ; 2026-02-21T09:14:15.4629683Z // end inline asm 2026-02-21T09:14:15.4629822Z bar.sync 0; 2026-02-21T09:14:15.4629961Z cvta.global.u64 %rd104, %rd50; 2026-02-21T09:14:15.4630256Z .loc 1 28 111 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:28:111 2026-02-21T09:14:15.4630558Z setp.gt.u32 %p39, %r926, 2047; 2026-02-21T09:14:15.4630730Z @%p39 bra $L__BB0_9; 2026-02-21T09:14:15.4630890Z // %bb.1: // %.lr.ph 2026-02-21T09:14:15.4631199Z .loc 1 0 111 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:0:111 2026-02-21T09:14:15.4631583Z ld.param.b64 %rd16, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:14:15.4631791Z shr.u32 %r4, %r1, 5; 2026-02-21T09:14:15.4631946Z bfe.u32 %r5, %r1, 2, 5; 2026-02-21T09:14:15.4632096Z shl.b32 %r106, %r1, 3; 2026-02-21T09:14:15.4632253Z and.b32 %r6, %r106, 24; 2026-02-21T09:14:15.4632403Z and.b32 %r7, %r1, 127; 2026-02-21T09:14:15.4632555Z shl.b32 %r8, %r7, 4; 2026-02-21T09:14:15.4632701Z add.s32 %r108, %r81, %r8; 2026-02-21T09:14:15.4632866Z add.s32 %r427, %r108, 32768; 2026-02-21T09:14:15.4633025Z add.s32 %r429, %r108, 34816; 2026-02-21T09:14:15.4633185Z add.s32 %r431, %r108, 36864; 2026-02-21T09:14:15.4633342Z add.s32 %r433, %r108, 38912; 2026-02-21T09:14:15.4633498Z add.s32 %r413, %r925, 128; 2026-02-21T09:14:15.4633658Z or.b32 %r14, %r6, 32; 2026-02-21T09:14:15.4633806Z add.s32 %r285, %r108, 40960; 2026-02-21T09:14:15.4633968Z add.s32 %r287, %r108, 43008; 2026-02-21T09:14:15.4634120Z add.s32 %r289, %r108, 45056; 2026-02-21T09:14:15.4634279Z add.s32 %r291, %r108, 47104; 2026-02-21T09:14:15.4634432Z or.b32 %r19, %r6, 64; 2026-02-21T09:14:15.4634584Z add.s32 %r298, %r108, 49152; 2026-02-21T09:14:15.4634745Z add.s32 %r300, %r108, 51200; 2026-02-21T09:14:15.4634896Z add.s32 %r302, %r108, 53248; 2026-02-21T09:14:15.4635059Z add.s32 %r304, %r108, 55296; 2026-02-21T09:14:15.4635210Z shl.b32 %r24, %r7, 6; 2026-02-21T09:14:15.4635374Z mad.lo.s32 %r25, %r7, 48, %r427; 2026-02-21T09:14:15.4635545Z add.s32 %r109, %r81, 73728; 2026-02-21T09:14:15.4635708Z add.s32 %r26, %r109, %r7; 2026-02-21T09:14:15.4635865Z xor.b32 %r27, %r7, 16; 2026-02-21T09:14:15.4636028Z add.s32 %r28, %r109, %r27; 2026-02-21T09:14:15.4636188Z xor.b32 %r29, %r7, 32; 2026-02-21T09:14:15.4636350Z add.s32 %r30, %r109, %r29; 2026-02-21T09:14:15.4636515Z xor.b32 %r31, %r7, 48; 2026-02-21T09:14:15.4636667Z add.s32 %r32, %r109, %r31; 2026-02-21T09:14:15.4636832Z xor.b32 %r33, %r7, 64; 2026-02-21T09:14:15.4637015Z add.s32 %r34, %r109, %r33; 2026-02-21T09:14:15.4637180Z xor.b32 %r35, %r7, 80; 2026-02-21T09:14:15.4637332Z add.s32 %r36, %r109, %r35; 2026-02-21T09:14:15.4637497Z xor.b32 %r37, %r7, 96; 2026-02-21T09:14:15.4637648Z add.s32 %r38, %r109, %r37; 2026-02-21T09:14:15.4637814Z xor.b32 %r39, %r7, 112; 2026-02-21T09:14:15.4637969Z add.s32 %r40, %r109, %r39; 2026-02-21T09:14:15.4638133Z shl.b32 %r110, %r7, 7; 2026-02-21T09:14:15.4638293Z shl.b32 %r111, %r1, 4; 2026-02-21T09:14:15.4638479Z and.b32 %r112, %r111, 112; 2026-02-21T09:14:15.4638651Z or.b32 %r113, %r110, %r112; 2026-02-21T09:14:15.4638814Z add.s32 %r114, %r81, 57344; 2026-02-21T09:14:15.4638983Z add.s32 %r41, %r114, %r113; 2026-02-21T09:14:15.4639143Z xor.b32 %r115, %r113, 16; 2026-02-21T09:14:15.4639306Z add.s32 %r42, %r114, %r115; 2026-02-21T09:14:15.4639466Z xor.b32 %r116, %r113, 32; 2026-02-21T09:14:15.4639629Z add.s32 %r43, %r114, %r116; 2026-02-21T09:14:15.4639786Z xor.b32 %r117, %r113, 48; 2026-02-21T09:14:15.4639948Z add.s32 %r44, %r114, %r117; 2026-02-21T09:14:15.4640114Z xor.b32 %r118, %r113, 64; 2026-02-21T09:14:15.4640328Z add.s32 %r45, %r114, %r118; 2026-02-21T09:14:15.4640495Z xor.b32 %r119, %r113, 80; 2026-02-21T09:14:15.4640652Z add.s32 %r46, %r114, %r119; 2026-02-21T09:14:15.4640819Z xor.b32 %r120, %r113, 96; 2026-02-21T09:14:15.4640975Z add.s32 %r47, %r114, %r120; 2026-02-21T09:14:15.4641143Z xor.b32 %r121, %r113, 112; 2026-02-21T09:14:15.4641329Z add.s32 %r48, %r114, %r121; 2026-02-21T09:14:15.4641508Z bfe.u32 %r122, %r114, 4, 14; 2026-02-21T09:14:15.4641705Z cvt.u64.u32 %rd55, %r122; 2026-02-21T09:14:15.4641884Z or.b64 %rd79, %rd55, 4611686293372403712; 2026-02-21T09:14:15.4642077Z add.s32 %r123, %r81, 57376; 2026-02-21T09:14:15.4642241Z bfe.u32 %r124, %r123, 4, 14; 2026-02-21T09:14:15.4642413Z cvt.u64.u32 %rd56, %r124; 2026-02-21T09:14:15.4642583Z or.b64 %rd80, %rd56, 4611686293372403712; 2026-02-21T09:14:15.4642777Z add.s32 %r125, %r81, 57408; 2026-02-21T09:14:15.4642942Z bfe.u32 %r126, %r125, 4, 14; 2026-02-21T09:14:15.4643112Z cvt.u64.u32 %rd57, %r126; 2026-02-21T09:14:15.4643281Z or.b64 %rd81, %rd57, 4611686293372403712; 2026-02-21T09:14:15.4643474Z add.s32 %r127, %r81, 57440; 2026-02-21T09:14:15.4643647Z bfe.u32 %r128, %r127, 4, 14; 2026-02-21T09:14:15.4643814Z cvt.u64.u32 %rd58, %r128; 2026-02-21T09:14:15.4643990Z or.b64 %rd82, %rd58, 4611686293372403712; 2026-02-21T09:14:15.4644178Z or.b32 %r49, %r6, 96; 2026-02-21T09:14:15.4644334Z add.s32 %r50, %r81, %r113; 2026-02-21T09:14:15.4644488Z add.s32 %r51, %r81, %r115; 2026-02-21T09:14:15.4644644Z add.s32 %r52, %r81, %r116; 2026-02-21T09:14:15.4644794Z add.s32 %r53, %r81, %r117; 2026-02-21T09:14:15.4644951Z add.s32 %r54, %r81, %r118; 2026-02-21T09:14:15.4645100Z add.s32 %r55, %r81, %r119; 2026-02-21T09:14:15.4645255Z add.s32 %r56, %r81, %r120; 2026-02-21T09:14:15.4645411Z add.s32 %r57, %r81, %r121; 2026-02-21T09:14:15.4645682Z .loc 1 28 111 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:28:111 2026-02-21T09:14:15.4645985Z cvt.u64.u32 %rd7, %r6; 2026-02-21T09:14:15.4646138Z and.b32 %r58, %r926, 1; 2026-02-21T09:14:15.4646296Z and.b32 %r129, %r1, 3; 2026-02-21T09:14:15.4646452Z mad.wide.u32 %rd59, %r129, 16, %rd16; 2026-02-21T09:14:15.4646638Z add.s64 %rd8, %rd59, 196864; 2026-02-21T09:14:15.4646795Z shl.b32 %r130, %r58, 7; 2026-02-21T09:14:15.4647071Z .loc 1 35 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:35:33 2026-02-21T09:14:15.4647375Z or.b32 %r59, %r130, %r5; 2026-02-21T09:14:15.4647527Z bra.uni $L__BB0_2; 2026-02-21T09:14:15.4647723Z $L__BB0_8: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:14:15.4648054Z .loc 1 0 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:0:33 2026-02-21T09:14:15.4648352Z setp.lt.u32 %p107, %r1, 64; 2026-02-21T09:14:15.4648509Z mov.b32 %r582, 1; 2026-02-21T09:14:15.4648654Z $L__tmp0: 2026-02-21T09:14:15.4648978Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4649306Z // begin inline asm 2026-02-21T09:14:15.4649450Z 2026-02-21T09:14:15.4649566Z { 2026-02-21T09:14:15.4649699Z .reg .pred complete; 2026-02-21T09:14:15.4649842Z waitLoop: 2026-02-21T09:14:15.4650036Z mbarrier.try_wait.parity.shared.b64 complete, [%r928], %r582; 2026-02-21T09:14:15.4650270Z @!complete bra.uni waitLoop; 2026-02-21T09:14:15.4650466Z } 2026-02-21T09:14:15.4650532Z 2026-02-21T09:14:15.4650588Z // end inline asm 2026-02-21T09:14:15.4650729Z $L__tmp1: 2026-02-21T09:14:15.4650978Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4651272Z cp.async.wait_group 0; 2026-02-21T09:14:15.4651431Z bar.sync 0; 2026-02-21T09:14:15.4651596Z // begin inline asm 2026-02-21T09:14:15.4651776Z @%p4 mbarrier.inval.shared::cta.b64 [%r435]; 2026-02-21T09:14:15.4651962Z // end inline asm 2026-02-21T09:14:15.4652104Z bar.sync 0; 2026-02-21T09:14:15.4652233Z // begin inline asm 2026-02-21T09:14:15.4652433Z @%p4 mbarrier.inval.shared::cta.b64 [%r270]; 2026-02-21T09:14:15.4652628Z // end inline asm 2026-02-21T09:14:15.4652762Z bar.sync 0; 2026-02-21T09:14:15.4652904Z // begin inline asm 2026-02-21T09:14:15.4653066Z @%p4 mbarrier.inval.shared::cta.b64 [%r271]; 2026-02-21T09:14:15.4653262Z // end inline asm 2026-02-21T09:14:15.4653430Z add.s32 %r586, %r81, 79904; 2026-02-21T09:14:15.4653593Z // begin inline asm 2026-02-21T09:14:15.4653752Z @%p4 mbarrier.inval.shared::cta.b64 [%r586]; 2026-02-21T09:14:15.4653939Z // end inline asm 2026-02-21T09:14:15.4654069Z bar.sync 0; 2026-02-21T09:14:15.4654203Z // begin inline asm 2026-02-21T09:14:15.4654372Z @%p4 mbarrier.inval.shared::cta.b64 [%r268]; 2026-02-21T09:14:15.4654551Z // end inline asm 2026-02-21T09:14:15.4654690Z $L__tmp2: 2026-02-21T09:14:15.4654978Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4655316Z // begin inline asm 2026-02-21T09:14:15.4655671Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r588, %r589, %r590, %r591, %r592, %r593, %r594, %r595, %r596, %r597, %r598, %r599, %r600, %r601, %r602, %r603}, [%r723 + 0]; 2026-02-21T09:14:15.4656053Z // end inline asm 2026-02-21T09:14:15.4656197Z // begin inline asm 2026-02-21T09:14:15.4656547Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r605, %r606, %r607, %r608, %r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616, %r617, %r618, %r619, %r620}, [%r723 + 16]; 2026-02-21T09:14:15.4656928Z // end inline asm 2026-02-21T09:14:15.4657063Z // begin inline asm 2026-02-21T09:14:15.4657407Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r622, %r623, %r624, %r625, %r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633, %r634, %r635, %r636, %r637}, [%r723 + 32]; 2026-02-21T09:14:15.4657775Z // end inline asm 2026-02-21T09:14:15.4657918Z // begin inline asm 2026-02-21T09:14:15.4658265Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r639, %r640, %r641, %r642, %r643, %r644, %r645, %r646, %r647, %r648, %r649, %r650, %r651, %r652, %r653, %r654}, [%r723 + 48]; 2026-02-21T09:14:15.4658632Z // end inline asm 2026-02-21T09:14:15.4658772Z // begin inline asm 2026-02-21T09:14:15.4659114Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r656, %r657, %r658, %r659, %r660, %r661, %r662, %r663, %r664, %r665, %r666, %r667, %r668, %r669, %r670, %r671}, [%r723 + 64]; 2026-02-21T09:14:15.4659490Z // end inline asm 2026-02-21T09:14:15.4659623Z // begin inline asm 2026-02-21T09:14:15.4659966Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r673, %r674, %r675, %r676, %r677, %r678, %r679, %r680, %r681, %r682, %r683, %r684, %r685, %r686, %r687, %r688}, [%r723 + 80]; 2026-02-21T09:14:15.4660335Z // end inline asm 2026-02-21T09:14:15.4660468Z // begin inline asm 2026-02-21T09:14:15.4660813Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r690, %r691, %r692, %r693, %r694, %r695, %r696, %r697, %r698, %r699, %r700, %r701, %r702, %r703, %r704, %r705}, [%r723 + 96]; 2026-02-21T09:14:15.4661209Z // end inline asm 2026-02-21T09:14:15.4661350Z // begin inline asm 2026-02-21T09:14:15.4661721Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r707, %r708, %r709, %r710, %r711, %r712, %r713, %r714, %r715, %r716, %r717, %r718, %r719, %r720, %r721, %r722}, [%r723 + 112]; 2026-02-21T09:14:15.4662102Z // end inline asm 2026-02-21T09:14:15.4662242Z // begin inline asm 2026-02-21T09:14:15.4662397Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:14:15.4662602Z // end inline asm 2026-02-21T09:14:15.4662742Z cvt.u64.u32 %rd105, %r588; 2026-02-21T09:14:15.4662906Z cvt.u64.u32 %rd106, %r589; 2026-02-21T09:14:15.4663066Z shl.b64 %rd107, %rd106, 32; 2026-02-21T09:14:15.4663238Z or.b64 %rd108, %rd105, %rd107; 2026-02-21T09:14:15.4663394Z $L__tmp3: 2026-02-21T09:14:15.4663650Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4663958Z mov.b64 {%r728, %r729}, %rd108; 2026-02-21T09:14:15.4664132Z cvt.rn.bf16x2.f32 %r730, %r729, %r728; 2026-02-21T09:14:15.4664308Z $L__tmp4: 2026-02-21T09:14:15.4664616Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4664958Z cvt.u64.u32 %rd109, %r590; 2026-02-21T09:14:15.4665117Z cvt.u64.u32 %rd110, %r591; 2026-02-21T09:14:15.4665296Z shl.b64 %rd111, %rd110, 32; 2026-02-21T09:14:15.4665488Z or.b64 %rd112, %rd109, %rd111; 2026-02-21T09:14:15.4665646Z $L__tmp5: 2026-02-21T09:14:15.4665886Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4666176Z mov.b64 {%r731, %r732}, %rd112; 2026-02-21T09:14:15.4666355Z cvt.rn.bf16x2.f32 %r733, %r732, %r731; 2026-02-21T09:14:15.4666524Z $L__tmp6: 2026-02-21T09:14:15.4666806Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4667134Z cvt.u64.u32 %rd113, %r592; 2026-02-21T09:14:15.4667294Z cvt.u64.u32 %rd114, %r593; 2026-02-21T09:14:15.4667454Z shl.b64 %rd115, %rd114, 32; 2026-02-21T09:14:15.4667611Z or.b64 %rd116, %rd113, %rd115; 2026-02-21T09:14:15.4667774Z $L__tmp7: 2026-02-21T09:14:15.4668006Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4668303Z mov.b64 {%r734, %r735}, %rd116; 2026-02-21T09:14:15.4668477Z cvt.rn.bf16x2.f32 %r736, %r735, %r734; 2026-02-21T09:14:15.4668656Z $L__tmp8: 2026-02-21T09:14:15.4668927Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4669259Z cvt.u64.u32 %rd117, %r594; 2026-02-21T09:14:15.4669419Z cvt.u64.u32 %rd118, %r595; 2026-02-21T09:14:15.4669571Z shl.b64 %rd119, %rd118, 32; 2026-02-21T09:14:15.4669735Z or.b64 %rd120, %rd117, %rd119; 2026-02-21T09:14:15.4669886Z $L__tmp9: 2026-02-21T09:14:15.4670131Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4670416Z mov.b64 {%r737, %r738}, %rd120; 2026-02-21T09:14:15.4670591Z cvt.rn.bf16x2.f32 %r739, %r738, %r737; 2026-02-21T09:14:15.4670758Z $L__tmp10: 2026-02-21T09:14:15.4671042Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4671371Z cvt.u64.u32 %rd121, %r596; 2026-02-21T09:14:15.4671524Z cvt.u64.u32 %rd122, %r597; 2026-02-21T09:14:15.4671724Z shl.b64 %rd123, %rd122, 32; 2026-02-21T09:14:15.4671880Z or.b64 %rd124, %rd121, %rd123; 2026-02-21T09:14:15.4672038Z $L__tmp11: 2026-02-21T09:14:15.4672278Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4672568Z mov.b64 {%r740, %r741}, %rd124; 2026-02-21T09:14:15.4672743Z cvt.rn.bf16x2.f32 %r742, %r741, %r740; 2026-02-21T09:14:15.4672909Z $L__tmp12: 2026-02-21T09:14:15.4673232Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4673557Z cvt.u64.u32 %rd125, %r598; 2026-02-21T09:14:15.4673718Z cvt.u64.u32 %rd126, %r599; 2026-02-21T09:14:15.4673873Z shl.b64 %rd127, %rd126, 32; 2026-02-21T09:14:15.4674036Z or.b64 %rd128, %rd125, %rd127; 2026-02-21T09:14:15.4674187Z $L__tmp13: 2026-02-21T09:14:15.4674432Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4674752Z mov.b64 {%r743, %r744}, %rd128; 2026-02-21T09:14:15.4674921Z cvt.rn.bf16x2.f32 %r745, %r744, %r743; 2026-02-21T09:14:15.4675095Z $L__tmp14: 2026-02-21T09:14:15.4675370Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4675708Z cvt.u64.u32 %rd129, %r600; 2026-02-21T09:14:15.4675864Z cvt.u64.u32 %rd130, %r601; 2026-02-21T09:14:15.4676030Z shl.b64 %rd131, %rd130, 32; 2026-02-21T09:14:15.4676185Z or.b64 %rd132, %rd129, %rd131; 2026-02-21T09:14:15.4676347Z $L__tmp15: 2026-02-21T09:14:15.4676614Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4676898Z mov.b64 {%r746, %r747}, %rd132; 2026-02-21T09:14:15.4677082Z cvt.rn.bf16x2.f32 %r748, %r747, %r746; 2026-02-21T09:14:15.4677252Z $L__tmp16: 2026-02-21T09:14:15.4677561Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4677892Z cvt.u64.u32 %rd133, %r602; 2026-02-21T09:14:15.4678053Z cvt.u64.u32 %rd134, %r603; 2026-02-21T09:14:15.4678206Z shl.b64 %rd135, %rd134, 32; 2026-02-21T09:14:15.4678370Z or.b64 %rd136, %rd133, %rd135; 2026-02-21T09:14:15.4678530Z $L__tmp17: 2026-02-21T09:14:15.4678762Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4679048Z mov.b64 {%r749, %r750}, %rd136; 2026-02-21T09:14:15.4679215Z cvt.rn.bf16x2.f32 %r751, %r750, %r749; 2026-02-21T09:14:15.4679390Z $L__tmp18: 2026-02-21T09:14:15.4679666Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4680000Z cvt.u64.u32 %rd137, %r605; 2026-02-21T09:14:15.4680158Z cvt.u64.u32 %rd138, %r606; 2026-02-21T09:14:15.4680312Z shl.b64 %rd139, %rd138, 32; 2026-02-21T09:14:15.4680476Z or.b64 %rd140, %rd137, %rd139; 2026-02-21T09:14:15.4680629Z $L__tmp19: 2026-02-21T09:14:15.4680876Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4681173Z mov.b64 {%r752, %r753}, %rd140; 2026-02-21T09:14:15.4681354Z cvt.rn.bf16x2.f32 %r754, %r753, %r752; 2026-02-21T09:14:15.4681529Z $L__tmp20: 2026-02-21T09:14:15.4681859Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4682210Z cvt.u64.u32 %rd141, %r607; 2026-02-21T09:14:15.4682370Z cvt.u64.u32 %rd142, %r608; 2026-02-21T09:14:15.4682538Z shl.b64 %rd143, %rd142, 32; 2026-02-21T09:14:15.4682702Z or.b64 %rd144, %rd141, %rd143; 2026-02-21T09:14:15.4682868Z $L__tmp21: 2026-02-21T09:14:15.4683115Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4683421Z mov.b64 {%r755, %r756}, %rd144; 2026-02-21T09:14:15.4683596Z cvt.rn.bf16x2.f32 %r757, %r756, %r755; 2026-02-21T09:14:15.4683778Z $L__tmp22: 2026-02-21T09:14:15.4684076Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4684416Z cvt.u64.u32 %rd145, %r609; 2026-02-21T09:14:15.4684585Z cvt.u64.u32 %rd146, %r610; 2026-02-21T09:14:15.4684743Z shl.b64 %rd147, %rd146, 32; 2026-02-21T09:14:15.4684915Z or.b64 %rd148, %rd145, %rd147; 2026-02-21T09:14:15.4685111Z $L__tmp23: 2026-02-21T09:14:15.4685371Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4685670Z mov.b64 {%r758, %r759}, %rd148; 2026-02-21T09:14:15.4685863Z cvt.rn.bf16x2.f32 %r760, %r759, %r758; 2026-02-21T09:14:15.4686048Z $L__tmp24: 2026-02-21T09:14:15.4686345Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4686744Z cvt.u64.u32 %rd149, %r611; 2026-02-21T09:14:15.4686906Z cvt.u64.u32 %rd150, %r612; 2026-02-21T09:14:15.4687078Z shl.b64 %rd151, %rd150, 32; 2026-02-21T09:14:15.4687240Z or.b64 %rd152, %rd149, %rd151; 2026-02-21T09:14:15.4687408Z $L__tmp25: 2026-02-21T09:14:15.4687654Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4687953Z mov.b64 {%r761, %r762}, %rd152; 2026-02-21T09:14:15.4688137Z cvt.rn.bf16x2.f32 %r763, %r762, %r761; 2026-02-21T09:14:15.4688313Z $L__tmp26: 2026-02-21T09:14:15.4688632Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4688963Z cvt.u64.u32 %rd153, %r613; 2026-02-21T09:14:15.4689124Z cvt.u64.u32 %rd154, %r614; 2026-02-21T09:14:15.4689278Z shl.b64 %rd155, %rd154, 32; 2026-02-21T09:14:15.4689443Z or.b64 %rd156, %rd153, %rd155; 2026-02-21T09:14:15.4689601Z $L__tmp27: 2026-02-21T09:14:15.4689861Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4690150Z mov.b64 {%r764, %r765}, %rd156; 2026-02-21T09:14:15.4690316Z cvt.rn.bf16x2.f32 %r766, %r765, %r764; 2026-02-21T09:14:15.4690490Z $L__tmp28: 2026-02-21T09:14:15.4690766Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4691096Z cvt.u64.u32 %rd157, %r615; 2026-02-21T09:14:15.4691250Z cvt.u64.u32 %rd158, %r616; 2026-02-21T09:14:15.4691410Z shl.b64 %rd159, %rd158, 32; 2026-02-21T09:14:15.4691593Z or.b64 %rd160, %rd157, %rd159; 2026-02-21T09:14:15.4691747Z $L__tmp29: 2026-02-21T09:14:15.4691984Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4692264Z mov.b64 {%r767, %r768}, %rd160; 2026-02-21T09:14:15.4692440Z cvt.rn.bf16x2.f32 %r769, %r768, %r767; 2026-02-21T09:14:15.4692605Z $L__tmp30: 2026-02-21T09:14:15.4692886Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4693217Z cvt.u64.u32 %rd161, %r617; 2026-02-21T09:14:15.4693368Z cvt.u64.u32 %rd162, %r618; 2026-02-21T09:14:15.4693528Z shl.b64 %rd163, %rd162, 32; 2026-02-21T09:14:15.4693682Z or.b64 %rd164, %rd161, %rd163; 2026-02-21T09:14:15.4693842Z $L__tmp31: 2026-02-21T09:14:15.4694072Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4694359Z mov.b64 {%r770, %r771}, %rd164; 2026-02-21T09:14:15.4694524Z cvt.rn.bf16x2.f32 %r772, %r771, %r770; 2026-02-21T09:14:15.4694698Z $L__tmp32: 2026-02-21T09:14:15.4694984Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4695306Z cvt.u64.u32 %rd165, %r619; 2026-02-21T09:14:15.4695467Z cvt.u64.u32 %rd166, %r620; 2026-02-21T09:14:15.4695622Z shl.b64 %rd167, %rd166, 32; 2026-02-21T09:14:15.4695787Z or.b64 %rd168, %rd165, %rd167; 2026-02-21T09:14:15.4695935Z $L__tmp33: 2026-02-21T09:14:15.4696173Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4696453Z mov.b64 {%r773, %r774}, %rd168; 2026-02-21T09:14:15.4696627Z cvt.rn.bf16x2.f32 %r775, %r774, %r773; 2026-02-21T09:14:15.4696799Z $L__tmp34: 2026-02-21T09:14:15.4697074Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4697434Z cvt.u64.u32 %rd169, %r622; 2026-02-21T09:14:15.4697592Z cvt.u64.u32 %rd170, %r623; 2026-02-21T09:14:15.4697756Z shl.b64 %rd171, %rd170, 32; 2026-02-21T09:14:15.4697913Z or.b64 %rd172, %rd169, %rd171; 2026-02-21T09:14:15.4698074Z $L__tmp35: 2026-02-21T09:14:15.4698305Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4698623Z mov.b64 {%r776, %r777}, %rd172; 2026-02-21T09:14:15.4698798Z cvt.rn.bf16x2.f32 %r778, %r777, %r776; 2026-02-21T09:14:15.4698963Z $L__tmp36: 2026-02-21T09:14:15.4699246Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4699569Z cvt.u64.u32 %rd173, %r624; 2026-02-21T09:14:15.4699727Z cvt.u64.u32 %rd174, %r625; 2026-02-21T09:14:15.4699878Z shl.b64 %rd175, %rd174, 32; 2026-02-21T09:14:15.4700043Z or.b64 %rd176, %rd173, %rd175; 2026-02-21T09:14:15.4700196Z $L__tmp37: 2026-02-21T09:14:15.4700462Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4700757Z mov.b64 {%r779, %r780}, %rd176; 2026-02-21T09:14:15.4700930Z cvt.rn.bf16x2.f32 %r781, %r780, %r779; 2026-02-21T09:14:15.4701108Z $L__tmp38: 2026-02-21T09:14:15.4701418Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4701818Z cvt.u64.u32 %rd177, %r626; 2026-02-21T09:14:15.4701969Z cvt.u64.u32 %rd178, %r627; 2026-02-21T09:14:15.4702130Z shl.b64 %rd179, %rd178, 32; 2026-02-21T09:14:15.4702293Z or.b64 %rd180, %rd177, %rd179; 2026-02-21T09:14:15.4702445Z $L__tmp39: 2026-02-21T09:14:15.4702687Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4702969Z mov.b64 {%r782, %r783}, %rd180; 2026-02-21T09:14:15.4703146Z cvt.rn.bf16x2.f32 %r784, %r783, %r782; 2026-02-21T09:14:15.4703313Z $L__tmp40: 2026-02-21T09:14:15.4703601Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4703923Z cvt.u64.u32 %rd181, %r628; 2026-02-21T09:14:15.4704084Z cvt.u64.u32 %rd182, %r629; 2026-02-21T09:14:15.4704244Z shl.b64 %rd183, %rd182, 32; 2026-02-21T09:14:15.4704403Z or.b64 %rd184, %rd181, %rd183; 2026-02-21T09:14:15.4704564Z $L__tmp41: 2026-02-21T09:14:15.4704797Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4705089Z mov.b64 {%r785, %r786}, %rd184; 2026-02-21T09:14:15.4705257Z cvt.rn.bf16x2.f32 %r787, %r786, %r785; 2026-02-21T09:14:15.4705431Z $L__tmp42: 2026-02-21T09:14:15.4705708Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4706041Z cvt.u64.u32 %rd185, %r630; 2026-02-21T09:14:15.4706202Z cvt.u64.u32 %rd186, %r631; 2026-02-21T09:14:15.4706354Z shl.b64 %rd187, %rd186, 32; 2026-02-21T09:14:15.4706521Z or.b64 %rd188, %rd185, %rd187; 2026-02-21T09:14:15.4706669Z $L__tmp43: 2026-02-21T09:14:15.4706912Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4707195Z mov.b64 {%r788, %r789}, %rd188; 2026-02-21T09:14:15.4707370Z cvt.rn.bf16x2.f32 %r790, %r789, %r788; 2026-02-21T09:14:15.4707539Z $L__tmp44: 2026-02-21T09:14:15.4707831Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4708165Z cvt.u64.u32 %rd189, %r632; 2026-02-21T09:14:15.4708330Z cvt.u64.u32 %rd190, %r633; 2026-02-21T09:14:15.4708493Z shl.b64 %rd191, %rd190, 32; 2026-02-21T09:14:15.4708648Z or.b64 %rd192, %rd189, %rd191; 2026-02-21T09:14:15.4708807Z $L__tmp45: 2026-02-21T09:14:15.4709044Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4709372Z mov.b64 {%r791, %r792}, %rd192; 2026-02-21T09:14:15.4709549Z cvt.rn.bf16x2.f32 %r793, %r792, %r791; 2026-02-21T09:14:15.4709715Z $L__tmp46: 2026-02-21T09:14:15.4709994Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4710320Z cvt.u64.u32 %rd193, %r634; 2026-02-21T09:14:15.4710481Z cvt.u64.u32 %rd194, %r635; 2026-02-21T09:14:15.4710662Z shl.b64 %rd195, %rd194, 32; 2026-02-21T09:14:15.4710828Z or.b64 %rd196, %rd193, %rd195; 2026-02-21T09:14:15.4710981Z $L__tmp47: 2026-02-21T09:14:15.4711225Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4711513Z mov.b64 {%r794, %r795}, %rd196; 2026-02-21T09:14:15.4711712Z cvt.rn.bf16x2.f32 %r796, %r795, %r794; 2026-02-21T09:14:15.4711886Z $L__tmp48: 2026-02-21T09:14:15.4712164Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4712530Z cvt.u64.u32 %rd197, %r636; 2026-02-21T09:14:15.4712686Z cvt.u64.u32 %rd198, %r637; 2026-02-21T09:14:15.4712849Z shl.b64 %rd199, %rd198, 32; 2026-02-21T09:14:15.4713006Z or.b64 %rd200, %rd197, %rd199; 2026-02-21T09:14:15.4713166Z $L__tmp49: 2026-02-21T09:14:15.4713434Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4713713Z mov.b64 {%r797, %r798}, %rd200; 2026-02-21T09:14:15.4713886Z cvt.rn.bf16x2.f32 %r799, %r798, %r797; 2026-02-21T09:14:15.4714050Z $L__tmp50: 2026-02-21T09:14:15.4714335Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4714654Z cvt.u64.u32 %rd201, %r639; 2026-02-21T09:14:15.4714816Z cvt.u64.u32 %rd202, %r640; 2026-02-21T09:14:15.4714970Z shl.b64 %rd203, %rd202, 32; 2026-02-21T09:14:15.4715136Z or.b64 %rd204, %rd201, %rd203; 2026-02-21T09:14:15.4715292Z $L__tmp51: 2026-02-21T09:14:15.4715525Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4715812Z mov.b64 {%r800, %r801}, %rd204; 2026-02-21T09:14:15.4715980Z cvt.rn.bf16x2.f32 %r802, %r801, %r800; 2026-02-21T09:14:15.4716152Z $L__tmp52: 2026-02-21T09:14:15.4716427Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4716759Z cvt.u64.u32 %rd205, %r641; 2026-02-21T09:14:15.4716919Z cvt.u64.u32 %rd206, %r642; 2026-02-21T09:14:15.4717071Z shl.b64 %rd207, %rd206, 32; 2026-02-21T09:14:15.4717235Z or.b64 %rd208, %rd205, %rd207; 2026-02-21T09:14:15.4717386Z $L__tmp53: 2026-02-21T09:14:15.4717625Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4717907Z mov.b64 {%r803, %r804}, %rd208; 2026-02-21T09:14:15.4718084Z cvt.rn.bf16x2.f32 %r805, %r804, %r803; 2026-02-21T09:14:15.4718251Z $L__tmp54: 2026-02-21T09:14:15.4718537Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4718871Z cvt.u64.u32 %rd209, %r643; 2026-02-21T09:14:15.4719026Z cvt.u64.u32 %rd210, %r644; 2026-02-21T09:14:15.4719187Z shl.b64 %rd211, %rd210, 32; 2026-02-21T09:14:15.4719345Z or.b64 %rd212, %rd209, %rd211; 2026-02-21T09:14:15.4719508Z $L__tmp55: 2026-02-21T09:14:15.4719736Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4720021Z mov.b64 {%r806, %r807}, %rd212; 2026-02-21T09:14:15.4720187Z cvt.rn.bf16x2.f32 %r808, %r807, %r806; 2026-02-21T09:14:15.4720360Z $L__tmp56: 2026-02-21T09:14:15.4720647Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4721003Z cvt.u64.u32 %rd213, %r645; 2026-02-21T09:14:15.4721162Z cvt.u64.u32 %rd214, %r646; 2026-02-21T09:14:15.4721315Z shl.b64 %rd215, %rd214, 32; 2026-02-21T09:14:15.4721479Z or.b64 %rd216, %rd213, %rd215; 2026-02-21T09:14:15.4721676Z $L__tmp57: 2026-02-21T09:14:15.4721920Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4722205Z mov.b64 {%r809, %r810}, %rd216; 2026-02-21T09:14:15.4722412Z cvt.rn.bf16x2.f32 %r811, %r810, %r809; 2026-02-21T09:14:15.4722587Z $L__tmp58: 2026-02-21T09:14:15.4722868Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4723206Z cvt.u64.u32 %rd217, %r647; 2026-02-21T09:14:15.4723362Z cvt.u64.u32 %rd218, %r648; 2026-02-21T09:14:15.4723522Z shl.b64 %rd219, %rd218, 32; 2026-02-21T09:14:15.4723679Z or.b64 %rd220, %rd217, %rd219; 2026-02-21T09:14:15.4723837Z $L__tmp59: 2026-02-21T09:14:15.4724087Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4724424Z mov.b64 {%r812, %r813}, %rd220; 2026-02-21T09:14:15.4724608Z cvt.rn.bf16x2.f32 %r814, %r813, %r812; 2026-02-21T09:14:15.4724782Z $L__tmp60: 2026-02-21T09:14:15.4725082Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4725448Z cvt.u64.u32 %rd221, %r649; 2026-02-21T09:14:15.4725613Z cvt.u64.u32 %rd222, %r650; 2026-02-21T09:14:15.4725771Z shl.b64 %rd223, %rd222, 32; 2026-02-21T09:14:15.4725940Z or.b64 %rd224, %rd221, %rd223; 2026-02-21T09:14:15.4726102Z $L__tmp61: 2026-02-21T09:14:15.4726345Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4726643Z mov.b64 {%r815, %r816}, %rd224; 2026-02-21T09:14:15.4726822Z cvt.rn.bf16x2.f32 %r817, %r816, %r815; 2026-02-21T09:14:15.4727004Z $L__tmp62: 2026-02-21T09:14:15.4727293Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4727639Z cvt.u64.u32 %rd225, %r651; 2026-02-21T09:14:15.4727798Z cvt.u64.u32 %rd226, %r652; 2026-02-21T09:14:15.4727963Z shl.b64 %rd227, %rd226, 32; 2026-02-21T09:14:15.4728132Z or.b64 %rd228, %rd225, %rd227; 2026-02-21T09:14:15.4728289Z $L__tmp63: 2026-02-21T09:14:15.4728537Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4728832Z mov.b64 {%r818, %r819}, %rd228; 2026-02-21T09:14:15.4729016Z cvt.rn.bf16x2.f32 %r820, %r819, %r818; 2026-02-21T09:14:15.4729189Z $L__tmp64: 2026-02-21T09:14:15.4729492Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4729841Z cvt.u64.u32 %rd229, %r653; 2026-02-21T09:14:15.4730004Z cvt.u64.u32 %rd230, %r654; 2026-02-21T09:14:15.4730172Z shl.b64 %rd231, %rd230, 32; 2026-02-21T09:14:15.4730335Z or.b64 %rd232, %rd229, %rd231; 2026-02-21T09:14:15.4730500Z $L__tmp65: 2026-02-21T09:14:15.4730744Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4731047Z mov.b64 {%r821, %r822}, %rd232; 2026-02-21T09:14:15.4731222Z cvt.rn.bf16x2.f32 %r823, %r822, %r821; 2026-02-21T09:14:15.4731402Z $L__tmp66: 2026-02-21T09:14:15.4731720Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4732069Z cvt.u64.u32 %rd233, %r656; 2026-02-21T09:14:15.4732238Z cvt.u64.u32 %rd234, %r657; 2026-02-21T09:14:15.4732398Z shl.b64 %rd235, %rd234, 32; 2026-02-21T09:14:15.4732569Z or.b64 %rd236, %rd233, %rd235; 2026-02-21T09:14:15.4732727Z $L__tmp67: 2026-02-21T09:14:15.4732980Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4733337Z mov.b64 {%r824, %r825}, %rd236; 2026-02-21T09:14:15.4733520Z cvt.rn.bf16x2.f32 %r826, %r825, %r824; 2026-02-21T09:14:15.4733695Z $L__tmp68: 2026-02-21T09:14:15.4733978Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4734319Z cvt.u64.u32 %rd237, %r658; 2026-02-21T09:14:15.4734472Z cvt.u64.u32 %rd238, %r659; 2026-02-21T09:14:15.4734633Z shl.b64 %rd239, %rd238, 32; 2026-02-21T09:14:15.4734816Z or.b64 %rd240, %rd237, %rd239; 2026-02-21T09:14:15.4734973Z $L__tmp69: 2026-02-21T09:14:15.4735206Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4735495Z mov.b64 {%r827, %r828}, %rd240; 2026-02-21T09:14:15.4735668Z cvt.rn.bf16x2.f32 %r829, %r828, %r827; 2026-02-21T09:14:15.4735832Z $L__tmp70: 2026-02-21T09:14:15.4736112Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4736435Z cvt.u64.u32 %rd241, %r660; 2026-02-21T09:14:15.4736625Z cvt.u64.u32 %rd242, %r661; 2026-02-21T09:14:15.4736777Z shl.b64 %rd243, %rd242, 32; 2026-02-21T09:14:15.4736938Z or.b64 %rd244, %rd241, %rd243; 2026-02-21T09:14:15.4737089Z $L__tmp71: 2026-02-21T09:14:15.4737330Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4737650Z mov.b64 {%r830, %r831}, %rd244; 2026-02-21T09:14:15.4737820Z cvt.rn.bf16x2.f32 %r832, %r831, %r830; 2026-02-21T09:14:15.4737991Z $L__tmp72: 2026-02-21T09:14:15.4738270Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4738594Z cvt.u64.u32 %rd245, %r662; 2026-02-21T09:14:15.4738746Z cvt.u64.u32 %rd246, %r663; 2026-02-21T09:14:15.4738904Z shl.b64 %rd247, %rd246, 32; 2026-02-21T09:14:15.4739057Z or.b64 %rd248, %rd245, %rd247; 2026-02-21T09:14:15.4739217Z $L__tmp73: 2026-02-21T09:14:15.4739457Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4739736Z mov.b64 {%r833, %r834}, %rd248; 2026-02-21T09:14:15.4739911Z cvt.rn.bf16x2.f32 %r835, %r834, %r833; 2026-02-21T09:14:15.4740076Z $L__tmp74: 2026-02-21T09:14:15.4740366Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4740690Z cvt.u64.u32 %rd249, %r664; 2026-02-21T09:14:15.4740857Z cvt.u64.u32 %rd250, %r665; 2026-02-21T09:14:15.4741017Z shl.b64 %rd251, %rd250, 32; 2026-02-21T09:14:15.4741172Z or.b64 %rd252, %rd249, %rd251; 2026-02-21T09:14:15.4741331Z $L__tmp75: 2026-02-21T09:14:15.4741596Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4741884Z mov.b64 {%r836, %r837}, %rd252; 2026-02-21T09:14:15.4742052Z cvt.rn.bf16x2.f32 %r838, %r837, %r836; 2026-02-21T09:14:15.4742230Z $L__tmp76: 2026-02-21T09:14:15.4742508Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4742838Z cvt.u64.u32 %rd253, %r666; 2026-02-21T09:14:15.4742996Z cvt.u64.u32 %rd254, %r667; 2026-02-21T09:14:15.4743150Z shl.b64 %rd255, %rd254, 32; 2026-02-21T09:14:15.4743317Z or.b64 %rd256, %rd253, %rd255; 2026-02-21T09:14:15.4743469Z $L__tmp77: 2026-02-21T09:14:15.4743711Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4743991Z mov.b64 {%r839, %r840}, %rd256; 2026-02-21T09:14:15.4744168Z cvt.rn.bf16x2.f32 %r841, %r840, %r839; 2026-02-21T09:14:15.4744338Z $L__tmp78: 2026-02-21T09:14:15.4744625Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4744949Z cvt.u64.u32 %rd257, %r668; 2026-02-21T09:14:15.4745138Z cvt.u64.u32 %rd258, %r669; 2026-02-21T09:14:15.4745297Z shl.b64 %rd259, %rd258, 32; 2026-02-21T09:14:15.4745453Z or.b64 %rd260, %rd257, %rd259; 2026-02-21T09:14:15.4745610Z $L__tmp79: 2026-02-21T09:14:15.4745844Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4746132Z mov.b64 {%r842, %r843}, %rd260; 2026-02-21T09:14:15.4746297Z cvt.rn.bf16x2.f32 %r844, %r843, %r842; 2026-02-21T09:14:15.4746468Z $L__tmp80: 2026-02-21T09:14:15.4746778Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4747100Z cvt.u64.u32 %rd261, %r670; 2026-02-21T09:14:15.4747258Z cvt.u64.u32 %rd262, %r671; 2026-02-21T09:14:15.4747411Z shl.b64 %rd263, %rd262, 32; 2026-02-21T09:14:15.4747571Z or.b64 %rd264, %rd261, %rd263; 2026-02-21T09:14:15.4747720Z $L__tmp81: 2026-02-21T09:14:15.4747961Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4748252Z mov.b64 {%r845, %r846}, %rd264; 2026-02-21T09:14:15.4748444Z cvt.rn.bf16x2.f32 %r847, %r846, %r845; 2026-02-21T09:14:15.4748621Z $L__tmp82: 2026-02-21T09:14:15.4748901Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4749231Z cvt.u64.u32 %rd265, %r673; 2026-02-21T09:14:15.4749409Z cvt.u64.u32 %rd266, %r674; 2026-02-21T09:14:15.4749575Z shl.b64 %rd267, %rd266, 32; 2026-02-21T09:14:15.4749731Z or.b64 %rd268, %rd265, %rd267; 2026-02-21T09:14:15.4749893Z $L__tmp83: 2026-02-21T09:14:15.4750137Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4750422Z mov.b64 {%r848, %r849}, %rd268; 2026-02-21T09:14:15.4750598Z cvt.rn.bf16x2.f32 %r850, %r849, %r848; 2026-02-21T09:14:15.4750767Z $L__tmp84: 2026-02-21T09:14:15.4751059Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4751388Z cvt.u64.u32 %rd269, %r675; 2026-02-21T09:14:15.4751603Z cvt.u64.u32 %rd270, %r676; 2026-02-21T09:14:15.4751758Z shl.b64 %rd271, %rd270, 32; 2026-02-21T09:14:15.4751922Z or.b64 %rd272, %rd269, %rd271; 2026-02-21T09:14:15.4752082Z $L__tmp85: 2026-02-21T09:14:15.4752315Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4752610Z mov.b64 {%r851, %r852}, %rd272; 2026-02-21T09:14:15.4752778Z cvt.rn.bf16x2.f32 %r853, %r852, %r851; 2026-02-21T09:14:15.4752949Z $L__tmp86: 2026-02-21T09:14:15.4753228Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4753562Z cvt.u64.u32 %rd273, %r677; 2026-02-21T09:14:15.4753720Z cvt.u64.u32 %rd274, %r678; 2026-02-21T09:14:15.4753874Z shl.b64 %rd275, %rd274, 32; 2026-02-21T09:14:15.4754038Z or.b64 %rd276, %rd273, %rd275; 2026-02-21T09:14:15.4754190Z $L__tmp87: 2026-02-21T09:14:15.4754431Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4754710Z mov.b64 {%r854, %r855}, %rd276; 2026-02-21T09:14:15.4754885Z cvt.rn.bf16x2.f32 %r856, %r855, %r854; 2026-02-21T09:14:15.4755049Z $L__tmp88: 2026-02-21T09:14:15.4755335Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4755670Z cvt.u64.u32 %rd277, %r679; 2026-02-21T09:14:15.4755822Z cvt.u64.u32 %rd278, %r680; 2026-02-21T09:14:15.4755980Z shl.b64 %rd279, %rd278, 32; 2026-02-21T09:14:15.4756134Z or.b64 %rd280, %rd277, %rd279; 2026-02-21T09:14:15.4756293Z $L__tmp89: 2026-02-21T09:14:15.4756526Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4756815Z mov.b64 {%r857, %r858}, %rd280; 2026-02-21T09:14:15.4757013Z cvt.rn.bf16x2.f32 %r859, %r858, %r857; 2026-02-21T09:14:15.4757184Z $L__tmp90: 2026-02-21T09:14:15.4757469Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4757793Z cvt.u64.u32 %rd281, %r681; 2026-02-21T09:14:15.4757950Z cvt.u64.u32 %rd282, %r682; 2026-02-21T09:14:15.4758099Z shl.b64 %rd283, %rd282, 32; 2026-02-21T09:14:15.4758261Z or.b64 %rd284, %rd281, %rd283; 2026-02-21T09:14:15.4758440Z $L__tmp91: 2026-02-21T09:14:15.4758680Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4758965Z mov.b64 {%r860, %r861}, %rd284; 2026-02-21T09:14:15.4759140Z cvt.rn.bf16x2.f32 %r862, %r861, %r860; 2026-02-21T09:14:15.4759314Z $L__tmp92: 2026-02-21T09:14:15.4759591Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4759926Z cvt.u64.u32 %rd285, %r683; 2026-02-21T09:14:15.4760080Z cvt.u64.u32 %rd286, %r684; 2026-02-21T09:14:15.4760268Z shl.b64 %rd287, %rd286, 32; 2026-02-21T09:14:15.4760427Z or.b64 %rd288, %rd285, %rd287; 2026-02-21T09:14:15.4760591Z $L__tmp93: 2026-02-21T09:14:15.4760829Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4761128Z mov.b64 {%r863, %r864}, %rd288; 2026-02-21T09:14:15.4761331Z cvt.rn.bf16x2.f32 %r865, %r864, %r863; 2026-02-21T09:14:15.4761503Z $L__tmp94: 2026-02-21T09:14:15.4761823Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4762159Z cvt.u64.u32 %rd289, %r685; 2026-02-21T09:14:15.4762319Z cvt.u64.u32 %rd290, %r686; 2026-02-21T09:14:15.4762471Z shl.b64 %rd291, %rd290, 32; 2026-02-21T09:14:15.4762633Z or.b64 %rd292, %rd289, %rd291; 2026-02-21T09:14:15.4762791Z $L__tmp95: 2026-02-21T09:14:15.4763026Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4763317Z mov.b64 {%r866, %r867}, %rd292; 2026-02-21T09:14:15.4763487Z cvt.rn.bf16x2.f32 %r868, %r867, %r866; 2026-02-21T09:14:15.4763662Z $L__tmp96: 2026-02-21T09:14:15.4763943Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4764277Z cvt.u64.u32 %rd293, %r687; 2026-02-21T09:14:15.4764434Z cvt.u64.u32 %rd294, %r688; 2026-02-21T09:14:15.4764594Z shl.b64 %rd295, %rd294, 32; 2026-02-21T09:14:15.4764757Z or.b64 %rd296, %rd293, %rd295; 2026-02-21T09:14:15.4764909Z $L__tmp97: 2026-02-21T09:14:15.4765152Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4765437Z mov.b64 {%r869, %r870}, %rd296; 2026-02-21T09:14:15.4765612Z cvt.rn.bf16x2.f32 %r871, %r870, %r869; 2026-02-21T09:14:15.4765779Z $L__tmp98: 2026-02-21T09:14:15.4766065Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4766395Z cvt.u64.u32 %rd297, %r690; 2026-02-21T09:14:15.4766556Z cvt.u64.u32 %rd298, %r691; 2026-02-21T09:14:15.4766716Z shl.b64 %rd299, %rd298, 32; 2026-02-21T09:14:15.4766872Z or.b64 %rd300, %rd297, %rd299; 2026-02-21T09:14:15.4767030Z $L__tmp99: 2026-02-21T09:14:15.4767267Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4767560Z mov.b64 {%r872, %r873}, %rd300; 2026-02-21T09:14:15.4767726Z cvt.rn.bf16x2.f32 %r874, %r873, %r872; 2026-02-21T09:14:15.4767901Z $L__tmp100: 2026-02-21T09:14:15.4768198Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4768546Z cvt.u64.u32 %rd301, %r692; 2026-02-21T09:14:15.4768713Z cvt.u64.u32 %rd302, %r693; 2026-02-21T09:14:15.4768871Z shl.b64 %rd303, %rd302, 32; 2026-02-21T09:14:15.4769073Z or.b64 %rd304, %rd301, %rd303; 2026-02-21T09:14:15.4769231Z $L__tmp101: 2026-02-21T09:14:15.4769488Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4769785Z mov.b64 {%r875, %r876}, %rd304; 2026-02-21T09:14:15.4769968Z cvt.rn.bf16x2.f32 %r877, %r876, %r875; 2026-02-21T09:14:15.4770150Z $L__tmp102: 2026-02-21T09:14:15.4770449Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4770824Z cvt.u64.u32 %rd305, %r694; 2026-02-21T09:14:15.4770989Z cvt.u64.u32 %rd306, %r695; 2026-02-21T09:14:15.4771163Z shl.b64 %rd307, %rd306, 32; 2026-02-21T09:14:15.4771331Z or.b64 %rd308, %rd305, %rd307; 2026-02-21T09:14:15.4771503Z $L__tmp103: 2026-02-21T09:14:15.4771775Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4772085Z mov.b64 {%r878, %r879}, %rd308; 2026-02-21T09:14:15.4772274Z cvt.rn.bf16x2.f32 %r880, %r879, %r878; 2026-02-21T09:14:15.4772478Z $L__tmp104: 2026-02-21T09:14:15.4772789Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4773137Z cvt.u64.u32 %rd309, %r696; 2026-02-21T09:14:15.4773307Z cvt.u64.u32 %rd310, %r697; 2026-02-21T09:14:15.4773520Z shl.b64 %rd311, %rd310, 32; 2026-02-21T09:14:15.4773697Z or.b64 %rd312, %rd309, %rd311; 2026-02-21T09:14:15.4773856Z $L__tmp105: 2026-02-21T09:14:15.4774110Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4774414Z mov.b64 {%r881, %r882}, %rd312; 2026-02-21T09:14:15.4774590Z cvt.rn.bf16x2.f32 %r883, %r882, %r881; 2026-02-21T09:14:15.4774773Z $L__tmp106: 2026-02-21T09:14:15.4775065Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4775416Z cvt.u64.u32 %rd313, %r698; 2026-02-21T09:14:15.4775577Z cvt.u64.u32 %rd314, %r699; 2026-02-21T09:14:15.4775745Z shl.b64 %rd315, %rd314, 32; 2026-02-21T09:14:15.4775909Z or.b64 %rd316, %rd313, %rd315; 2026-02-21T09:14:15.4776085Z $L__tmp107: 2026-02-21T09:14:15.4776328Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4776613Z mov.b64 {%r884, %r885}, %rd316; 2026-02-21T09:14:15.4776788Z cvt.rn.bf16x2.f32 %r886, %r885, %r884; 2026-02-21T09:14:15.4776956Z $L__tmp108: 2026-02-21T09:14:15.4777243Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4777568Z cvt.u64.u32 %rd317, %r700; 2026-02-21T09:14:15.4777730Z cvt.u64.u32 %rd318, %r701; 2026-02-21T09:14:15.4777890Z shl.b64 %rd319, %rd318, 32; 2026-02-21T09:14:15.4778046Z or.b64 %rd320, %rd317, %rd319; 2026-02-21T09:14:15.4778203Z $L__tmp109: 2026-02-21T09:14:15.4778441Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4778730Z mov.b64 {%r887, %r888}, %rd320; 2026-02-21T09:14:15.4778895Z cvt.rn.bf16x2.f32 %r889, %r888, %r887; 2026-02-21T09:14:15.4779068Z $L__tmp110: 2026-02-21T09:14:15.4779350Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4779681Z cvt.u64.u32 %rd321, %r702; 2026-02-21T09:14:15.4779840Z cvt.u64.u32 %rd322, %r703; 2026-02-21T09:14:15.4779990Z shl.b64 %rd323, %rd322, 32; 2026-02-21T09:14:15.4780149Z or.b64 %rd324, %rd321, %rd323; 2026-02-21T09:14:15.4780299Z $L__tmp111: 2026-02-21T09:14:15.4780542Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4780829Z mov.b64 {%r890, %r891}, %rd324; 2026-02-21T09:14:15.4781004Z cvt.rn.bf16x2.f32 %r892, %r891, %r890; 2026-02-21T09:14:15.4781197Z $L__tmp112: 2026-02-21T09:14:15.4781481Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4781843Z cvt.u64.u32 %rd325, %r704; 2026-02-21T09:14:15.4781995Z cvt.u64.u32 %rd326, %r705; 2026-02-21T09:14:15.4782157Z shl.b64 %rd327, %rd326, 32; 2026-02-21T09:14:15.4782313Z or.b64 %rd328, %rd325, %rd327; 2026-02-21T09:14:15.4782472Z $L__tmp113: 2026-02-21T09:14:15.4782735Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4783025Z mov.b64 {%r893, %r894}, %rd328; 2026-02-21T09:14:15.4783194Z cvt.rn.bf16x2.f32 %r895, %r894, %r893; 2026-02-21T09:14:15.4783372Z $L__tmp114: 2026-02-21T09:14:15.4783666Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4783998Z cvt.u64.u32 %rd329, %r707; 2026-02-21T09:14:15.4784160Z cvt.u64.u32 %rd330, %r708; 2026-02-21T09:14:15.4784315Z shl.b64 %rd331, %rd330, 32; 2026-02-21T09:14:15.4784506Z or.b64 %rd332, %rd329, %rd331; 2026-02-21T09:14:15.4784658Z $L__tmp115: 2026-02-21T09:14:15.4784897Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4785190Z mov.b64 {%r896, %r897}, %rd332; 2026-02-21T09:14:15.4785357Z cvt.rn.bf16x2.f32 %r898, %r897, %r896; 2026-02-21T09:14:15.4785565Z $L__tmp116: 2026-02-21T09:14:15.4785845Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4786174Z cvt.u64.u32 %rd333, %r709; 2026-02-21T09:14:15.4786325Z cvt.u64.u32 %rd334, %r710; 2026-02-21T09:14:15.4786483Z shl.b64 %rd335, %rd334, 32; 2026-02-21T09:14:15.4786638Z or.b64 %rd336, %rd333, %rd335; 2026-02-21T09:14:15.4786796Z $L__tmp117: 2026-02-21T09:14:15.4787032Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4787316Z mov.b64 {%r899, %r900}, %rd336; 2026-02-21T09:14:15.4787491Z cvt.rn.bf16x2.f32 %r901, %r900, %r899; 2026-02-21T09:14:15.4787654Z $L__tmp118: 2026-02-21T09:14:15.4787938Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4788259Z cvt.u64.u32 %rd337, %r711; 2026-02-21T09:14:15.4788421Z cvt.u64.u32 %rd338, %r712; 2026-02-21T09:14:15.4788575Z shl.b64 %rd339, %rd338, 32; 2026-02-21T09:14:15.4788739Z or.b64 %rd340, %rd337, %rd339; 2026-02-21T09:14:15.4788894Z $L__tmp119: 2026-02-21T09:14:15.4789127Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4789413Z mov.b64 {%r902, %r903}, %rd340; 2026-02-21T09:14:15.4789578Z cvt.rn.bf16x2.f32 %r904, %r903, %r902; 2026-02-21T09:14:15.4789747Z $L__tmp120: 2026-02-21T09:14:15.4790020Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4790349Z cvt.u64.u32 %rd341, %r713; 2026-02-21T09:14:15.4790498Z cvt.u64.u32 %rd342, %r714; 2026-02-21T09:14:15.4790659Z shl.b64 %rd343, %rd342, 32; 2026-02-21T09:14:15.4790821Z or.b64 %rd344, %rd341, %rd343; 2026-02-21T09:14:15.4790972Z $L__tmp121: 2026-02-21T09:14:15.4791208Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4791489Z mov.b64 {%r905, %r906}, %rd344; 2026-02-21T09:14:15.4791705Z cvt.rn.bf16x2.f32 %r907, %r906, %r905; 2026-02-21T09:14:15.4791872Z $L__tmp122: 2026-02-21T09:14:15.4792159Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4792496Z cvt.u64.u32 %rd345, %r715; 2026-02-21T09:14:15.4792651Z cvt.u64.u32 %rd346, %r716; 2026-02-21T09:14:15.4792813Z shl.b64 %rd347, %rd346, 32; 2026-02-21T09:14:15.4793002Z or.b64 %rd348, %rd345, %rd347; 2026-02-21T09:14:15.4793163Z $L__tmp123: 2026-02-21T09:14:15.4793403Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4793696Z mov.b64 {%r908, %r909}, %rd348; 2026-02-21T09:14:15.4793868Z cvt.rn.bf16x2.f32 %r910, %r909, %r908; 2026-02-21T09:14:15.4794051Z $L__tmp124: 2026-02-21T09:14:15.4794345Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4794707Z cvt.u64.u32 %rd349, %r717; 2026-02-21T09:14:15.4794868Z cvt.u64.u32 %rd350, %r718; 2026-02-21T09:14:15.4795020Z shl.b64 %rd351, %rd350, 32; 2026-02-21T09:14:15.4795183Z or.b64 %rd352, %rd349, %rd351; 2026-02-21T09:14:15.4795333Z $L__tmp125: 2026-02-21T09:14:15.4795570Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4795858Z mov.b64 {%r911, %r912}, %rd352; 2026-02-21T09:14:15.4796035Z cvt.rn.bf16x2.f32 %r913, %r912, %r911; 2026-02-21T09:14:15.4796209Z $L__tmp126: 2026-02-21T09:14:15.4796513Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4796851Z cvt.u64.u32 %rd353, %r719; 2026-02-21T09:14:15.4797004Z cvt.u64.u32 %rd354, %r720; 2026-02-21T09:14:15.4797166Z shl.b64 %rd355, %rd354, 32; 2026-02-21T09:14:15.4797348Z or.b64 %rd356, %rd353, %rd355; 2026-02-21T09:14:15.4797510Z $L__tmp127: 2026-02-21T09:14:15.4797747Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4798039Z mov.b64 {%r914, %r915}, %rd356; 2026-02-21T09:14:15.4798212Z cvt.rn.bf16x2.f32 %r916, %r915, %r914; 2026-02-21T09:14:15.4798376Z $L__tmp128: 2026-02-21T09:14:15.4798659Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4798985Z cvt.u64.u32 %rd357, %r721; 2026-02-21T09:14:15.4799146Z cvt.u64.u32 %rd358, %r722; 2026-02-21T09:14:15.4799299Z shl.b64 %rd359, %rd358, 32; 2026-02-21T09:14:15.4799460Z or.b64 %rd360, %rd357, %rd359; 2026-02-21T09:14:15.4799611Z $L__tmp129: 2026-02-21T09:14:15.4799849Z .loc 1 94 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:94:28 2026-02-21T09:14:15.4800132Z mov.b64 {%r917, %r918}, %rd360; 2026-02-21T09:14:15.4800298Z cvt.rn.bf16x2.f32 %r919, %r918, %r917; 2026-02-21T09:14:15.4800586Z .loc 1 95 43 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:95:43 2026-02-21T09:14:15.4800880Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:14:15.4801056Z bar.sync 0; 2026-02-21T09:14:15.4801225Z st.shared.v4.b32 [%r50], {%r730, %r733, %r736, %r739}; 2026-02-21T09:14:15.4801472Z st.shared.v4.b32 [%r50+16384], {%r826, %r829, %r832, %r835}; 2026-02-21T09:14:15.4801740Z st.shared.v4.b32 [%r51], {%r742, %r745, %r748, %r751}; 2026-02-21T09:14:15.4801976Z st.shared.v4.b32 [%r51+16384], {%r838, %r841, %r844, %r847}; 2026-02-21T09:14:15.4802220Z st.shared.v4.b32 [%r52], {%r754, %r757, %r760, %r763}; 2026-02-21T09:14:15.4802451Z st.shared.v4.b32 [%r52+16384], {%r850, %r853, %r856, %r859}; 2026-02-21T09:14:15.4802690Z st.shared.v4.b32 [%r53], {%r766, %r769, %r772, %r775}; 2026-02-21T09:14:15.4802919Z st.shared.v4.b32 [%r53+16384], {%r862, %r865, %r868, %r871}; 2026-02-21T09:14:15.4803158Z st.shared.v4.b32 [%r54], {%r778, %r781, %r784, %r787}; 2026-02-21T09:14:15.4803396Z st.shared.v4.b32 [%r54+16384], {%r874, %r877, %r880, %r883}; 2026-02-21T09:14:15.4803627Z st.shared.v4.b32 [%r55], {%r790, %r793, %r796, %r799}; 2026-02-21T09:14:15.4803866Z st.shared.v4.b32 [%r55+16384], {%r886, %r889, %r892, %r895}; 2026-02-21T09:14:15.4804099Z st.shared.v4.b32 [%r56], {%r802, %r805, %r808, %r811}; 2026-02-21T09:14:15.4804335Z st.shared.v4.b32 [%r56+16384], {%r898, %r901, %r904, %r907}; 2026-02-21T09:14:15.4804567Z st.shared.v4.b32 [%r57], {%r814, %r817, %r820, %r823}; 2026-02-21T09:14:15.4804832Z st.shared.v4.b32 [%r57+16384], {%r910, %r913, %r916, %r919}; 2026-02-21T09:14:15.4805047Z // begin inline asm 2026-02-21T09:14:15.4805213Z fence.proxy.async.shared::cta; 2026-02-21T09:14:15.4805397Z // end inline asm 2026-02-21T09:14:15.4805537Z bar.sync 0; 2026-02-21T09:14:15.4805685Z elect.sync %r920|%p108, -1; 2026-02-21T09:14:15.4805849Z and.pred %p106, %p107, %p108; 2026-02-21T09:14:15.4806020Z and.b32 %r921, %r63, 1; 2026-02-21T09:14:15.4806198Z shl.b32 %r922, %r921, 14; 2026-02-21T09:14:15.4806359Z add.s32 %r726, %r81, %r922; 2026-02-21T09:14:15.4806512Z and.b32 %r923, %r926, 126; 2026-02-21T09:14:15.4806668Z or.b32 %r924, %r921, %r923; 2026-02-21T09:14:15.4806826Z shl.b32 %r724, %r924, 6; 2026-02-21T09:14:15.4806973Z // begin inline asm 2026-02-21T09:14:15.4807246Z @%p106 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd104, {%r724, %r725}], [%r726]; 2026-02-21T09:14:15.4807536Z // end inline asm 2026-02-21T09:14:15.4807692Z cp.async.bulk.commit_group; 2026-02-21T09:14:15.4807991Z .loc 1 28 111 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:28:111 2026-02-21T09:14:15.4808293Z add.s32 %r80, %r926, 296; 2026-02-21T09:14:15.4808454Z setp.lt.u32 %p109, %r926, 1752; 2026-02-21T09:14:15.4808624Z mov.b32 %r926, %r80; 2026-02-21T09:14:15.4808684Z @%p109 bra $L__BB0_2; 2026-02-21T09:14:15.4808750Z bra.uni $L__BB0_9; 2026-02-21T09:14:15.4808878Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:14:15.4808971Z // Child Loop BB0_5 Depth 2 2026-02-21T09:14:15.4809149Z .loc 1 35 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:35:33 2026-02-21T09:14:15.4809211Z cvt.u16.u32 %rs1, %r926; 2026-02-21T09:14:15.4809272Z shr.u16 %rs2, %rs1, 7; 2026-02-21T09:14:15.4809332Z and.b16 %rs3, %rs2, 15; 2026-02-21T09:14:15.4809405Z mul.wide.u16 %r347, %rs3, 256; 2026-02-21T09:14:15.4809466Z or.b32 %r348, %r59, %r347; 2026-02-21T09:14:15.4809524Z shr.u32 %r349, %r926, 6; 2026-02-21T09:14:15.4809592Z and.b32 %r350, %r349, 30; 2026-02-21T09:14:15.4809763Z .loc 1 37 30 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:37:30 2026-02-21T09:14:15.4809820Z or.b32 %r351, %r350, %r58; 2026-02-21T09:14:15.4809996Z .loc 1 39 27 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:39:27 2026-02-21T09:14:15.4810055Z shl.b32 %r725, %r351, 7; 2026-02-21T09:14:15.4810227Z .loc 1 40 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:40:32 2026-02-21T09:14:15.4810289Z or.b32 %r352, %r725, %r5; 2026-02-21T09:14:15.4810469Z .loc 1 41 27 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:41:27 2026-02-21T09:14:15.4810527Z shl.b32 %r353, %r926, 6; 2026-02-21T09:14:15.4810588Z and.b32 %r437, %r353, 8064; 2026-02-21T09:14:15.4810766Z .loc 1 55 53 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:53 2026-02-21T09:14:15.4810829Z shl.b32 %r354, %r352, 10; 2026-02-21T09:14:15.4810884Z $L__tmp130: 2026-02-21T09:14:15.4811118Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4811194Z shfl.sync.idx.b32 %r63, %r4, 0, 31, -1; 2026-02-21T09:14:15.4811253Z shl.b32 %r355, %r63, 21; 2026-02-21T09:14:15.4811319Z and.b32 %r356, %r355, 6291456; 2026-02-21T09:14:15.4811390Z add.s32 %r723, %r356, %r925; 2026-02-21T09:14:15.4811453Z mov.pred %p68, -1; 2026-02-21T09:14:15.4811509Z mov.b32 %r927, 0; 2026-02-21T09:14:15.4811616Z // begin inline asm 2026-02-21T09:14:15.4811917Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 0], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4811977Z // end inline asm 2026-02-21T09:14:15.4812043Z // begin inline asm 2026-02-21T09:14:15.4812362Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 16], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4812419Z // end inline asm 2026-02-21T09:14:15.4812476Z // begin inline asm 2026-02-21T09:14:15.4812769Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 32], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4812862Z // end inline asm 2026-02-21T09:14:15.4812920Z // begin inline asm 2026-02-21T09:14:15.4813209Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 48], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4813266Z // end inline asm 2026-02-21T09:14:15.4813325Z // begin inline asm 2026-02-21T09:14:15.4813609Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 64], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4813711Z // end inline asm 2026-02-21T09:14:15.4813770Z // begin inline asm 2026-02-21T09:14:15.4814051Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 80], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4814110Z // end inline asm 2026-02-21T09:14:15.4814192Z // begin inline asm 2026-02-21T09:14:15.4814473Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 96], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4814539Z // end inline asm 2026-02-21T09:14:15.4814599Z // begin inline asm 2026-02-21T09:14:15.4814888Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r723 + 112], {%r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927, %r927}; 2026-02-21T09:14:15.4814955Z // end inline asm 2026-02-21T09:14:15.4815015Z // begin inline asm 2026-02-21T09:14:15.4815091Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:14:15.4815160Z // end inline asm 2026-02-21T09:14:15.4815219Z bar.sync 0; 2026-02-21T09:14:15.4815277Z $L__tmp131: 2026-02-21T09:14:15.4815466Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4815541Z add.s32 %r928, %r81, 79904; 2026-02-21T09:14:15.4815603Z // begin inline asm 2026-02-21T09:14:15.4815698Z @%p4 mbarrier.init.shared::cta.b64 [%r928], 1; 2026-02-21T09:14:15.4815769Z // end inline asm 2026-02-21T09:14:15.4815828Z bar.sync 0; 2026-02-21T09:14:15.4815890Z add.s32 %r268, %r81, 79912; 2026-02-21T09:14:15.4815955Z // begin inline asm 2026-02-21T09:14:15.4816042Z @%p4 mbarrier.init.shared::cta.b64 [%r268], 1; 2026-02-21T09:14:15.4816100Z // end inline asm 2026-02-21T09:14:15.4816161Z add.s32 %r435, %r81, 79872; 2026-02-21T09:14:15.4816227Z // begin inline asm 2026-02-21T09:14:15.4816311Z @%p4 mbarrier.init.shared::cta.b64 [%r435], 1; 2026-02-21T09:14:15.4816367Z // end inline asm 2026-02-21T09:14:15.4816431Z bar.sync 0; 2026-02-21T09:14:15.4816490Z add.s32 %r270, %r81, 79880; 2026-02-21T09:14:15.4816548Z // begin inline asm 2026-02-21T09:14:15.4816629Z @%p4 mbarrier.init.shared::cta.b64 [%r270], 1; 2026-02-21T09:14:15.4816696Z // end inline asm 2026-02-21T09:14:15.4816751Z bar.sync 0; 2026-02-21T09:14:15.4816812Z add.s32 %r271, %r81, 79888; 2026-02-21T09:14:15.4816880Z // begin inline asm 2026-02-21T09:14:15.4816962Z @%p4 mbarrier.init.shared::cta.b64 [%r271], 1; 2026-02-21T09:14:15.4817018Z // end inline asm 2026-02-21T09:14:15.4817202Z .loc 1 55 60 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:60 2026-02-21T09:14:15.4817264Z or.b32 %r358, %r354, %r6; 2026-02-21T09:14:15.4817434Z .loc 1 55 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:32 2026-02-21T09:14:15.4817536Z mad.wide.u32 %rd60, %r358, 2, %rd16; 2026-02-21T09:14:15.4817609Z cvt.u64.u32 %rd10, %r354; 2026-02-21T09:14:15.4817672Z or.b64 %rd75, %rd10, %rd7; 2026-02-21T09:14:15.4817733Z shl.b64 %rd76, %rd75, 1; 2026-02-21T09:14:15.4817804Z add.s64 %rd11, %rd16, %rd76; 2026-02-21T09:14:15.4817879Z add.s64 %rd61, %rd11, 65536; 2026-02-21T09:14:15.4817940Z add.s64 %rd62, %rd11, 131072; 2026-02-21T09:14:15.4818000Z add.s64 %rd63, %rd11, 196608; 2026-02-21T09:14:15.4818063Z mov.b32 %r428, 16; 2026-02-21T09:14:15.4818251Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4818308Z // begin inline asm 2026-02-21T09:14:15.4818437Z cp.async.cg.shared.global [ %r427 + 0 ], [ %rd60 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4818493Z // end inline asm 2026-02-21T09:14:15.4818549Z // begin inline asm 2026-02-21T09:14:15.4818673Z cp.async.cg.shared.global [ %r429 + 0 ], [ %rd61 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4818727Z // end inline asm 2026-02-21T09:14:15.4818784Z // begin inline asm 2026-02-21T09:14:15.4818919Z cp.async.cg.shared.global [ %r431 + 0 ], [ %rd62 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4818983Z // end inline asm 2026-02-21T09:14:15.4819039Z // begin inline asm 2026-02-21T09:14:15.4819147Z cp.async.cg.shared.global [ %r433 + 0 ], [ %rd63 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4819209Z // end inline asm 2026-02-21T09:14:15.4819272Z cp.async.commit_group; 2026-02-21T09:14:15.4819462Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4819524Z bar.sync 0; 2026-02-21T09:14:15.4819580Z // begin inline asm 2026-02-21T09:14:15.4819685Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r435], 2048; 2026-02-21T09:14:15.4819740Z // end inline asm 2026-02-21T09:14:15.4819908Z .loc 1 61 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:61:33 2026-02-21T09:14:15.4819961Z bar.sync 0; 2026-02-21T09:14:15.4820028Z elect.sync %r359|%p62, -1; 2026-02-21T09:14:15.4820099Z and.pred %p54, %p1, %p62; 2026-02-21T09:14:15.4820154Z // begin inline asm 2026-02-21T09:14:15.4820391Z @%p54 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r109], [%rd88, {%r437, %r927}], [%r435]; 2026-02-21T09:14:15.4820452Z // end inline asm 2026-02-21T09:14:15.4820610Z .loc 1 55 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:32 2026-02-21T09:14:15.4820672Z add.s64 %rd65, %rd11, 64; 2026-02-21T09:14:15.4820731Z or.b32 %r360, %r354, %r14; 2026-02-21T09:14:15.4820806Z mad.wide.u32 %rd77, %r360, 2, %rd16; 2026-02-21T09:14:15.4820865Z add.s64 %rd66, %rd77, 65536; 2026-02-21T09:14:15.4820925Z add.s64 %rd67, %rd77, 131072; 2026-02-21T09:14:15.4820991Z add.s64 %rd68, %rd77, 196608; 2026-02-21T09:14:15.4821151Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4821207Z // begin inline asm 2026-02-21T09:14:15.4821317Z cp.async.cg.shared.global [ %r285 + 0 ], [ %rd65 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4821379Z // end inline asm 2026-02-21T09:14:15.4821435Z // begin inline asm 2026-02-21T09:14:15.4821567Z cp.async.cg.shared.global [ %r287 + 0 ], [ %rd66 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4821633Z // end inline asm 2026-02-21T09:14:15.4821690Z // begin inline asm 2026-02-21T09:14:15.4821799Z cp.async.cg.shared.global [ %r289 + 0 ], [ %rd67 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4821862Z // end inline asm 2026-02-21T09:14:15.4821919Z // begin inline asm 2026-02-21T09:14:15.4822026Z cp.async.cg.shared.global [ %r291 + 0 ], [ %rd68 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4822082Z // end inline asm 2026-02-21T09:14:15.4822154Z cp.async.commit_group; 2026-02-21T09:14:15.4822323Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4822377Z bar.sync 0; 2026-02-21T09:14:15.4822440Z // begin inline asm 2026-02-21T09:14:15.4822585Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r270], 2048; 2026-02-21T09:14:15.4822640Z // end inline asm 2026-02-21T09:14:15.4822812Z .loc 1 61 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:61:33 2026-02-21T09:14:15.4822868Z bar.sync 0; 2026-02-21T09:14:15.4822933Z elect.sync %r361|%p63, -1; 2026-02-21T09:14:15.4822998Z and.pred %p56, %p1, %p63; 2026-02-21T09:14:15.4823066Z add.s32 %r294, %r81, 75776; 2026-02-21T09:14:15.4823149Z // begin inline asm 2026-02-21T09:14:15.4823387Z @%p56 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r294], [%rd88, {%r437, %r428}], [%r270]; 2026-02-21T09:14:15.4823450Z // end inline asm 2026-02-21T09:14:15.4823617Z .loc 1 55 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:32 2026-02-21T09:14:15.4823676Z add.s64 %rd70, %rd11, 128; 2026-02-21T09:14:15.4823744Z or.b32 %r362, %r354, %r19; 2026-02-21T09:14:15.4823810Z mad.wide.u32 %rd78, %r362, 2, %rd16; 2026-02-21T09:14:15.4823873Z add.s64 %rd71, %rd78, 65536; 2026-02-21T09:14:15.4823959Z add.s64 %rd72, %rd78, 131072; 2026-02-21T09:14:15.4824030Z add.s64 %rd73, %rd78, 196608; 2026-02-21T09:14:15.4824200Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4824259Z // begin inline asm 2026-02-21T09:14:15.4824408Z cp.async.cg.shared.global [ %r298 + 0 ], [ %rd70 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4824468Z // end inline asm 2026-02-21T09:14:15.4824524Z // begin inline asm 2026-02-21T09:14:15.4824635Z cp.async.cg.shared.global [ %r300 + 0 ], [ %rd71 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4824698Z // end inline asm 2026-02-21T09:14:15.4824753Z // begin inline asm 2026-02-21T09:14:15.4824861Z cp.async.cg.shared.global [ %r302 + 0 ], [ %rd72 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4824923Z // end inline asm 2026-02-21T09:14:15.4824978Z // begin inline asm 2026-02-21T09:14:15.4825085Z cp.async.cg.shared.global [ %r304 + 0 ], [ %rd73 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4825149Z // end inline asm 2026-02-21T09:14:15.4825214Z cp.async.commit_group; 2026-02-21T09:14:15.4825383Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4825438Z bar.sync 0; 2026-02-21T09:14:15.4825501Z // begin inline asm 2026-02-21T09:14:15.4825606Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r271], 2048; 2026-02-21T09:14:15.4825662Z // end inline asm 2026-02-21T09:14:15.4825836Z .loc 1 61 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:61:33 2026-02-21T09:14:15.4825890Z bar.sync 0; 2026-02-21T09:14:15.4825954Z elect.sync %r363|%p64, -1; 2026-02-21T09:14:15.4826018Z and.pred %p58, %p1, %p64; 2026-02-21T09:14:15.4826086Z add.s32 %r307, %r81, 77824; 2026-02-21T09:14:15.4826141Z mov.b32 %r309, 32; 2026-02-21T09:14:15.4826196Z // begin inline asm 2026-02-21T09:14:15.4826441Z @%p58 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r307], [%rd88, {%r437, %r309}], [%r271]; 2026-02-21T09:14:15.4826498Z // end inline asm 2026-02-21T09:14:15.4826664Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4826736Z cp.async.wait_group 2; 2026-02-21T09:14:15.4826791Z bar.sync 0; 2026-02-21T09:14:15.4826884Z ld.shared.v4.b32 {%r364, %r365, %r366, %r367}, [%r25]; 2026-02-21T09:14:15.4826947Z mov.b32 {%rs4, %rs5}, %r367; 2026-02-21T09:14:15.4827016Z mov.b32 {%rs6, %rs7}, %r366; 2026-02-21T09:14:15.4827076Z mov.b32 {%rs8, %rs9}, %r365; 2026-02-21T09:14:15.4827139Z mov.b32 {%rs10, %rs11}, %r364; 2026-02-21T09:14:15.4827243Z ld.shared.v4.b32 {%r368, %r369, %r370, %r371}, [%r25+16]; 2026-02-21T09:14:15.4827305Z mov.b32 {%rs12, %rs13}, %r371; 2026-02-21T09:14:15.4827366Z mov.b32 {%rs14, %rs15}, %r370; 2026-02-21T09:14:15.4827432Z mov.b32 {%rs16, %rs17}, %r369; 2026-02-21T09:14:15.4827489Z mov.b32 {%rs18, %rs19}, %r368; 2026-02-21T09:14:15.4827604Z ld.shared.v4.b32 {%r372, %r373, %r374, %r375}, [%r25+32]; 2026-02-21T09:14:15.4827664Z mov.b32 {%rs20, %rs21}, %r375; 2026-02-21T09:14:15.4827728Z mov.b32 {%rs22, %rs23}, %r374; 2026-02-21T09:14:15.4827785Z mov.b32 {%rs24, %rs25}, %r373; 2026-02-21T09:14:15.4827843Z mov.b32 {%rs26, %rs27}, %r372; 2026-02-21T09:14:15.4827940Z ld.shared.v4.b32 {%r376, %r377, %r378, %r379}, [%r25+48]; 2026-02-21T09:14:15.4827998Z mov.b32 {%rs28, %rs29}, %r379; 2026-02-21T09:14:15.4828079Z mov.b32 {%rs30, %rs31}, %r378; 2026-02-21T09:14:15.4828138Z mov.b32 {%rs32, %rs33}, %r377; 2026-02-21T09:14:15.4828202Z mov.b32 {%rs34, %rs35}, %r376; 2026-02-21T09:14:15.4828366Z .loc 1 59 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:59:32 2026-02-21T09:14:15.4828426Z cvt.f32.bf16 %r312, %rs10; 2026-02-21T09:14:15.4828492Z cvt.f32.bf16 %r313, %rs11; 2026-02-21T09:14:15.4828551Z cvt.f32.bf16 %r314, %rs8; 2026-02-21T09:14:15.4828608Z cvt.f32.bf16 %r315, %rs9; 2026-02-21T09:14:15.4828673Z cvt.f32.bf16 %r316, %rs6; 2026-02-21T09:14:15.4828728Z cvt.f32.bf16 %r317, %rs7; 2026-02-21T09:14:15.4828807Z cvt.f32.bf16 %r318, %rs4; 2026-02-21T09:14:15.4828865Z cvt.f32.bf16 %r319, %rs5; 2026-02-21T09:14:15.4828930Z cvt.f32.bf16 %r320, %rs18; 2026-02-21T09:14:15.4828988Z cvt.f32.bf16 %r321, %rs19; 2026-02-21T09:14:15.4829045Z cvt.f32.bf16 %r322, %rs16; 2026-02-21T09:14:15.4829128Z cvt.f32.bf16 %r323, %rs17; 2026-02-21T09:14:15.4829189Z cvt.f32.bf16 %r324, %rs14; 2026-02-21T09:14:15.4829247Z cvt.f32.bf16 %r325, %rs15; 2026-02-21T09:14:15.4829304Z cvt.f32.bf16 %r326, %rs12; 2026-02-21T09:14:15.4829369Z cvt.f32.bf16 %r327, %rs13; 2026-02-21T09:14:15.4829425Z cvt.f32.bf16 %r329, %rs26; 2026-02-21T09:14:15.4829481Z cvt.f32.bf16 %r330, %rs27; 2026-02-21T09:14:15.4829543Z cvt.f32.bf16 %r331, %rs24; 2026-02-21T09:14:15.4829600Z cvt.f32.bf16 %r332, %rs25; 2026-02-21T09:14:15.4829658Z cvt.f32.bf16 %r333, %rs22; 2026-02-21T09:14:15.4829715Z cvt.f32.bf16 %r334, %rs23; 2026-02-21T09:14:15.4829779Z cvt.f32.bf16 %r335, %rs20; 2026-02-21T09:14:15.4829838Z cvt.f32.bf16 %r336, %rs21; 2026-02-21T09:14:15.4829895Z cvt.f32.bf16 %r337, %rs34; 2026-02-21T09:14:15.4829958Z cvt.f32.bf16 %r338, %rs35; 2026-02-21T09:14:15.4830015Z cvt.f32.bf16 %r339, %rs32; 2026-02-21T09:14:15.4830072Z cvt.f32.bf16 %r340, %rs33; 2026-02-21T09:14:15.4830129Z cvt.f32.bf16 %r341, %rs30; 2026-02-21T09:14:15.4830195Z cvt.f32.bf16 %r342, %rs31; 2026-02-21T09:14:15.4830253Z cvt.f32.bf16 %r343, %rs28; 2026-02-21T09:14:15.4830311Z cvt.f32.bf16 %r344, %rs29; 2026-02-21T09:14:15.4830371Z $L__tmp132: 2026-02-21T09:14:15.4830588Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4830648Z add.s32 %r311, %r356, %r413; 2026-02-21T09:14:15.4830714Z // begin inline asm 2026-02-21T09:14:15.4830994Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r311 + 0], {%r312, %r313, %r314, %r315, %r316, %r317, %r318, %r319, %r320, %r321, %r322, %r323, %r324, %r325, %r326, %r327}; 2026-02-21T09:14:15.4831053Z // end inline asm 2026-02-21T09:14:15.4831109Z // begin inline asm 2026-02-21T09:14:15.4831387Z @%p68 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r311 + 16], {%r329, %r330, %r331, %r332, %r333, %r334, %r335, %r336, %r337, %r338, %r339, %r340, %r341, %r342, %r343, %r344}; 2026-02-21T09:14:15.4831443Z // end inline asm 2026-02-21T09:14:15.4831500Z // begin inline asm 2026-02-21T09:14:15.4831611Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:14:15.4831669Z // end inline asm 2026-02-21T09:14:15.4831725Z bar.sync 0; 2026-02-21T09:14:15.4831792Z $L__tmp133: 2026-02-21T09:14:15.4831980Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4832036Z // begin inline asm 2026-02-21T09:14:15.4832088Z 2026-02-21T09:14:15.4832148Z { 2026-02-21T09:14:15.4832210Z .reg .pred complete; 2026-02-21T09:14:15.4832269Z waitLoop: 2026-02-21T09:14:15.4832427Z mbarrier.try_wait.parity.shared.b64 complete, [%r435], %r927; 2026-02-21T09:14:15.4832495Z @!complete bra.uni waitLoop; 2026-02-21T09:14:15.4832548Z } 2026-02-21T09:14:15.4832552Z 2026-02-21T09:14:15.4832607Z // end inline asm 2026-02-21T09:14:15.4832783Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4832848Z ld.shared.s8 %rs36, [%r26]; 2026-02-21T09:14:15.4833024Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4833140Z shl.b16 %rs37, %rs36, 4; 2026-02-21T09:14:15.4833304Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4833372Z ld.shared.s8 %rs38, [%r28+128]; 2026-02-21T09:14:15.4833539Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4833598Z shl.b16 %rs39, %rs38, 4; 2026-02-21T09:14:15.4833790Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4833863Z ld.shared.s8 %rs40, [%r30+256]; 2026-02-21T09:14:15.4834030Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4834088Z shl.b16 %rs41, %rs40, 4; 2026-02-21T09:14:15.4834281Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4834354Z ld.shared.s8 %rs42, [%r32+384]; 2026-02-21T09:14:15.4834516Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4834574Z shl.b16 %rs43, %rs42, 4; 2026-02-21T09:14:15.4834745Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4834807Z ld.shared.s8 %rs44, [%r34+512]; 2026-02-21T09:14:15.4834968Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4835035Z shl.b16 %rs45, %rs44, 4; 2026-02-21T09:14:15.4835197Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4835260Z ld.shared.s8 %rs46, [%r36+640]; 2026-02-21T09:14:15.4835428Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4835486Z shl.b16 %rs47, %rs46, 4; 2026-02-21T09:14:15.4835650Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4835718Z ld.shared.s8 %rs48, [%r38+768]; 2026-02-21T09:14:15.4835882Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4835940Z shl.b16 %rs49, %rs48, 4; 2026-02-21T09:14:15.4836103Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4836175Z ld.shared.s8 %rs50, [%r40+896]; 2026-02-21T09:14:15.4836340Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4836399Z shl.b16 %rs51, %rs50, 4; 2026-02-21T09:14:15.4836568Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4836633Z ld.shared.s8 %rs52, [%r26+1024]; 2026-02-21T09:14:15.4836799Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4836864Z shl.b16 %rs53, %rs52, 4; 2026-02-21T09:14:15.4837028Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4837093Z ld.shared.s8 %rs54, [%r28+1152]; 2026-02-21T09:14:15.4837260Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4837318Z shl.b16 %rs55, %rs54, 4; 2026-02-21T09:14:15.4837505Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4837571Z ld.shared.s8 %rs56, [%r30+1280]; 2026-02-21T09:14:15.4837744Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4837802Z shl.b16 %rs57, %rs56, 4; 2026-02-21T09:14:15.4837973Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4838084Z ld.shared.s8 %rs58, [%r32+1408]; 2026-02-21T09:14:15.4838244Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4838302Z shl.b16 %rs59, %rs58, 4; 2026-02-21T09:14:15.4838467Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4838530Z ld.shared.s8 %rs60, [%r34+1536]; 2026-02-21T09:14:15.4838691Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4838757Z shl.b16 %rs61, %rs60, 4; 2026-02-21T09:14:15.4838940Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4839003Z ld.shared.s8 %rs62, [%r36+1664]; 2026-02-21T09:14:15.4839168Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4839254Z shl.b16 %rs63, %rs62, 4; 2026-02-21T09:14:15.4839419Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4839481Z ld.shared.s8 %rs64, [%r38+1792]; 2026-02-21T09:14:15.4839653Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4839710Z shl.b16 %rs65, %rs64, 4; 2026-02-21T09:14:15.4839876Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4839946Z ld.shared.s8 %rs66, [%r40+1920]; 2026-02-21T09:14:15.4840111Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4840168Z shl.b16 %rs67, %rs66, 4; 2026-02-21T09:14:15.4840339Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4840400Z cvt.s16.s8 %rs68, %rs37; 2026-02-21T09:14:15.4840459Z shr.s16 %rs69, %rs68, 4; 2026-02-21T09:14:15.4840521Z cvt.s16.s8 %rs70, %rs39; 2026-02-21T09:14:15.4840587Z shr.s16 %rs71, %rs70, 4; 2026-02-21T09:14:15.4840645Z shr.s16 %rs72, %rs36, 4; 2026-02-21T09:14:15.4840703Z shr.s16 %rs73, %rs38, 4; 2026-02-21T09:14:15.4840877Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4840940Z cvt.rn.f32.s16 %r380, %rs73; 2026-02-21T09:14:15.4841003Z cvt.rn.f32.s16 %r381, %rs72; 2026-02-21T09:14:15.4841063Z cvt.rn.f32.s16 %r382, %rs71; 2026-02-21T09:14:15.4841131Z cvt.rn.f32.s16 %r383, %rs69; 2026-02-21T09:14:15.4841299Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4841357Z cvt.s16.s8 %rs74, %rs41; 2026-02-21T09:14:15.4841424Z shr.s16 %rs75, %rs74, 4; 2026-02-21T09:14:15.4841481Z cvt.s16.s8 %rs76, %rs43; 2026-02-21T09:14:15.4841560Z shr.s16 %rs77, %rs76, 4; 2026-02-21T09:14:15.4841627Z shr.s16 %rs78, %rs40, 4; 2026-02-21T09:14:15.4841688Z shr.s16 %rs79, %rs42, 4; 2026-02-21T09:14:15.4841854Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4841915Z cvt.rn.f32.s16 %r384, %rs79; 2026-02-21T09:14:15.4841984Z cvt.rn.f32.s16 %r385, %rs78; 2026-02-21T09:14:15.4842044Z cvt.rn.f32.s16 %r386, %rs77; 2026-02-21T09:14:15.4842103Z cvt.rn.f32.s16 %r387, %rs75; 2026-02-21T09:14:15.4842274Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4842358Z cvt.s16.s8 %rs80, %rs45; 2026-02-21T09:14:15.4842415Z shr.s16 %rs81, %rs80, 4; 2026-02-21T09:14:15.4842482Z cvt.s16.s8 %rs82, %rs47; 2026-02-21T09:14:15.4842538Z shr.s16 %rs83, %rs82, 4; 2026-02-21T09:14:15.4842595Z shr.s16 %rs84, %rs44, 4; 2026-02-21T09:14:15.4842651Z shr.s16 %rs85, %rs46, 4; 2026-02-21T09:14:15.4842828Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4842912Z cvt.rn.f32.s16 %r388, %rs85; 2026-02-21T09:14:15.4842971Z cvt.rn.f32.s16 %r389, %rs84; 2026-02-21T09:14:15.4843036Z cvt.rn.f32.s16 %r390, %rs83; 2026-02-21T09:14:15.4843094Z cvt.rn.f32.s16 %r391, %rs81; 2026-02-21T09:14:15.4843263Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4843321Z cvt.s16.s8 %rs86, %rs49; 2026-02-21T09:14:15.4843386Z shr.s16 %rs87, %rs86, 4; 2026-02-21T09:14:15.4843444Z cvt.s16.s8 %rs88, %rs51; 2026-02-21T09:14:15.4843503Z shr.s16 %rs89, %rs88, 4; 2026-02-21T09:14:15.4843568Z shr.s16 %rs90, %rs48, 4; 2026-02-21T09:14:15.4843651Z shr.s16 %rs91, %rs50, 4; 2026-02-21T09:14:15.4843819Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4843885Z cvt.rn.f32.s16 %r392, %rs91; 2026-02-21T09:14:15.4843943Z cvt.rn.f32.s16 %r393, %rs90; 2026-02-21T09:14:15.4844024Z cvt.rn.f32.s16 %r394, %rs89; 2026-02-21T09:14:15.4844086Z cvt.rn.f32.s16 %r395, %rs87; 2026-02-21T09:14:15.4844259Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4844319Z cvt.s16.s8 %rs92, %rs53; 2026-02-21T09:14:15.4844376Z shr.s16 %rs93, %rs92, 4; 2026-02-21T09:14:15.4844441Z cvt.s16.s8 %rs94, %rs55; 2026-02-21T09:14:15.4844497Z shr.s16 %rs95, %rs94, 4; 2026-02-21T09:14:15.4844554Z shr.s16 %rs96, %rs52, 4; 2026-02-21T09:14:15.4844611Z shr.s16 %rs97, %rs54, 4; 2026-02-21T09:14:15.4844787Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4844848Z cvt.rn.f32.s16 %r396, %rs97; 2026-02-21T09:14:15.4844907Z cvt.rn.f32.s16 %r397, %rs96; 2026-02-21T09:14:15.4844974Z cvt.rn.f32.s16 %r398, %rs95; 2026-02-21T09:14:15.4845031Z cvt.rn.f32.s16 %r399, %rs93; 2026-02-21T09:14:15.4845200Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4845266Z cvt.s16.s8 %rs98, %rs57; 2026-02-21T09:14:15.4845322Z shr.s16 %rs99, %rs98, 4; 2026-02-21T09:14:15.4845382Z cvt.s16.s8 %rs100, %rs59; 2026-02-21T09:14:15.4845444Z shr.s16 %rs101, %rs100, 4; 2026-02-21T09:14:15.4845508Z shr.s16 %rs102, %rs56, 4; 2026-02-21T09:14:15.4845566Z shr.s16 %rs103, %rs58, 4; 2026-02-21T09:14:15.4845733Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4845801Z cvt.rn.f32.s16 %r400, %rs103; 2026-02-21T09:14:15.4845862Z cvt.rn.f32.s16 %r401, %rs102; 2026-02-21T09:14:15.4845921Z cvt.rn.f32.s16 %r402, %rs101; 2026-02-21T09:14:15.4845982Z cvt.rn.f32.s16 %r403, %rs99; 2026-02-21T09:14:15.4846155Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4846213Z cvt.s16.s8 %rs104, %rs61; 2026-02-21T09:14:15.4846272Z shr.s16 %rs105, %rs104, 4; 2026-02-21T09:14:15.4846340Z cvt.s16.s8 %rs106, %rs63; 2026-02-21T09:14:15.4846399Z shr.s16 %rs107, %rs106, 4; 2026-02-21T09:14:15.4846458Z shr.s16 %rs108, %rs60, 4; 2026-02-21T09:14:15.4846520Z shr.s16 %rs109, %rs62, 4; 2026-02-21T09:14:15.4846682Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4846742Z cvt.rn.f32.s16 %r404, %rs109; 2026-02-21T09:14:15.4846801Z cvt.rn.f32.s16 %r405, %rs108; 2026-02-21T09:14:15.4846867Z cvt.rn.f32.s16 %r406, %rs107; 2026-02-21T09:14:15.4846924Z cvt.rn.f32.s16 %r407, %rs105; 2026-02-21T09:14:15.4847114Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4847180Z cvt.s16.s8 %rs110, %rs65; 2026-02-21T09:14:15.4847239Z shr.s16 %rs111, %rs110, 4; 2026-02-21T09:14:15.4847295Z cvt.s16.s8 %rs112, %rs67; 2026-02-21T09:14:15.4847359Z shr.s16 %rs113, %rs112, 4; 2026-02-21T09:14:15.4847415Z shr.s16 %rs114, %rs64, 4; 2026-02-21T09:14:15.4847472Z shr.s16 %rs115, %rs66, 4; 2026-02-21T09:14:15.4847660Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4847725Z cvt.rn.f32.s16 %r408, %rs115; 2026-02-21T09:14:15.4847784Z cvt.rn.f32.s16 %r409, %rs114; 2026-02-21T09:14:15.4847842Z cvt.rn.f32.s16 %r410, %rs113; 2026-02-21T09:14:15.4847906Z cvt.rn.f32.s16 %r411, %rs111; 2026-02-21T09:14:15.4847999Z st.shared.v4.b32 [%r41], {%r383, %r381, %r382, %r380}; 2026-02-21T09:14:15.4848090Z st.shared.v4.b32 [%r42], {%r387, %r385, %r386, %r384}; 2026-02-21T09:14:15.4848176Z st.shared.v4.b32 [%r43], {%r391, %r389, %r390, %r388}; 2026-02-21T09:14:15.4848290Z st.shared.v4.b32 [%r44], {%r395, %r393, %r394, %r392}; 2026-02-21T09:14:15.4848376Z st.shared.v4.b32 [%r45], {%r399, %r397, %r398, %r396}; 2026-02-21T09:14:15.4848459Z st.shared.v4.b32 [%r46], {%r403, %r401, %r402, %r400}; 2026-02-21T09:14:15.4848550Z st.shared.v4.b32 [%r47], {%r407, %r405, %r406, %r404}; 2026-02-21T09:14:15.4848653Z st.shared.v4.b32 [%r48], {%r411, %r409, %r410, %r408}; 2026-02-21T09:14:15.4848710Z $L__tmp134: 2026-02-21T09:14:15.4848932Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4848991Z // begin inline asm 2026-02-21T09:14:15.4849066Z fence.proxy.async.shared::cta; 2026-02-21T09:14:15.4849129Z // end inline asm 2026-02-21T09:14:15.4849184Z bar.sync 0; 2026-02-21T09:14:15.4849247Z setp.ne.b32 %p65, %r63, 0; 2026-02-21T09:14:15.4849306Z @%p65 bra $L__BB0_4; 2026-02-21T09:14:15.4849417Z // %bb.3: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:14:15.4849484Z elect.sync %r424|%p67, -1; 2026-02-21T09:14:15.4849544Z mov.b32 %r414, 136317200; 2026-02-21T09:14:15.4849611Z mov.pred %p66, 0; 2026-02-21T09:14:15.4849670Z // begin inline asm 2026-02-21T09:14:15.4849834Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 0 ], %rd79, %r414, %p66; 2026-02-21T09:14:15.4849892Z // end inline asm 2026-02-21T09:14:15.4849960Z // begin inline asm 2026-02-21T09:14:15.4850110Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 8 ], %rd80, %r414, %p68; 2026-02-21T09:14:15.4850166Z // end inline asm 2026-02-21T09:14:15.4850235Z // begin inline asm 2026-02-21T09:14:15.4850383Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 16 ], %rd81, %r414, %p68; 2026-02-21T09:14:15.4850438Z // end inline asm 2026-02-21T09:14:15.4850502Z // begin inline asm 2026-02-21T09:14:15.4850643Z @%p67 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 24 ], %rd82, %r414, %p68; 2026-02-21T09:14:15.4850700Z // end inline asm 2026-02-21T09:14:15.4850761Z add.s32 %r426, %r81, 79904; 2026-02-21T09:14:15.4850828Z cvt.u64.u32 %rd83, %r426; 2026-02-21T09:14:15.4850884Z // begin inline asm 2026-02-21T09:14:15.4851010Z @%p67 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd83]; 2026-02-21T09:14:15.4851073Z // end inline asm 2026-02-21T09:14:15.4851127Z $L__tmp135: 2026-02-21T09:14:15.4851228Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:14:15.4851396Z .loc 1 0 0 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:0 2026-02-21T09:14:15.4851468Z mad.wide.u32 %rd361, %r348, 2048, %rd8; 2026-02-21T09:14:15.4851666Z .loc 1 55 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:32 2026-02-21T09:14:15.4851736Z add.s64 %rd84, %rd11, 192; 2026-02-21T09:14:15.4851798Z cvt.u64.u32 %rd90, %r49; 2026-02-21T09:14:15.4851906Z add.s64 %rd91, %rd10, %rd90; 2026-02-21T09:14:15.4851965Z shl.b64 %rd92, %rd91, 1; 2026-02-21T09:14:15.4852035Z add.s64 %rd93, %rd16, %rd92; 2026-02-21T09:14:15.4852094Z add.s64 %rd85, %rd93, 65536; 2026-02-21T09:14:15.4852153Z add.s64 %rd86, %rd93, 131072; 2026-02-21T09:14:15.4852219Z add.s64 %rd87, %rd93, 196608; 2026-02-21T09:14:15.4852390Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4852473Z // begin inline asm 2026-02-21T09:14:15.4852594Z cp.async.cg.shared.global [ %r427 + 0 ], [ %rd84 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4852659Z // end inline asm 2026-02-21T09:14:15.4852716Z // begin inline asm 2026-02-21T09:14:15.4852832Z cp.async.cg.shared.global [ %r429 + 0 ], [ %rd85 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4852896Z // end inline asm 2026-02-21T09:14:15.4852952Z // begin inline asm 2026-02-21T09:14:15.4853064Z cp.async.cg.shared.global [ %r431 + 0 ], [ %rd86 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4853127Z // end inline asm 2026-02-21T09:14:15.4853183Z // begin inline asm 2026-02-21T09:14:15.4853316Z cp.async.cg.shared.global [ %r433 + 0 ], [ %rd87 + 0 ], 0x10, %r428; 2026-02-21T09:14:15.4853373Z // end inline asm 2026-02-21T09:14:15.4853446Z cp.async.commit_group; 2026-02-21T09:14:15.4853622Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4853704Z // begin inline asm 2026-02-21T09:14:15.4853821Z @%p4 mbarrier.arrive.expect_tx.shared.b64 _, [%r435], 2048; 2026-02-21T09:14:15.4853876Z // end inline asm 2026-02-21T09:14:15.4854043Z .loc 1 61 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:61:33 2026-02-21T09:14:15.4854098Z bar.sync 0; 2026-02-21T09:14:15.4854170Z elect.sync %r444|%p78, -1; 2026-02-21T09:14:15.4854234Z and.pred %p76, %p1, %p78; 2026-02-21T09:14:15.4854290Z mov.b32 %r438, 48; 2026-02-21T09:14:15.4854352Z // begin inline asm 2026-02-21T09:14:15.4854593Z @%p76 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r109], [%rd88, {%r437, %r438}], [%r435]; 2026-02-21T09:14:15.4854648Z // end inline asm 2026-02-21T09:14:15.4854709Z mov.b32 %r931, 1; 2026-02-21T09:14:15.4854764Z mov.b64 %rd362, 0; 2026-02-21T09:14:15.4854820Z mov.b32 %r929, %r927; 2026-02-21T09:14:15.4854876Z mov.b32 %r930, %r927; 2026-02-21T09:14:15.4854941Z mov.b32 %r932, %r927; 2026-02-21T09:14:15.4855000Z bra.uni $L__BB0_5; 2026-02-21T09:14:15.4855103Z $L__BB0_7: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:14:15.4855287Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4855358Z setp.lt.u64 %p94, %rd362, 448; 2026-02-21T09:14:15.4855412Z $L__tmp136: 2026-02-21T09:14:15.4855643Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4855706Z add.s32 %r577, %r931, 1; 2026-02-21T09:14:15.4855772Z setp.gt.s32 %p97, %r577, 1; 2026-02-21T09:14:15.4855840Z selp.b32 %r931, 0, %r577, %p97; 2026-02-21T09:14:15.4855911Z selp.b32 %r578, 1, 0, %p97; 2026-02-21T09:14:15.4855972Z xor.b32 %r79, %r932, %r578; 2026-02-21T09:14:15.4856027Z $L__tmp137: 2026-02-21T09:14:15.4856208Z .loc 1 55 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:32 2026-02-21T09:14:15.4856278Z add.s64 %rd99, %rd361, -196608; 2026-02-21T09:14:15.4856348Z add.s64 %rd100, %rd361, -131072; 2026-02-21T09:14:15.4856414Z add.s64 %rd101, %rd361, -65536; 2026-02-21T09:14:15.4856591Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4856653Z add.s32 %r564, %r73, %r8; 2026-02-21T09:14:15.4856717Z selp.b32 %r565, 16, 0, %p94; 2026-02-21T09:14:15.4856782Z // begin inline asm 2026-02-21T09:14:15.4856900Z cp.async.cg.shared.global [ %r564 + 0 ], [ %rd99 + 0 ], 0x10, %r565; 2026-02-21T09:14:15.4856982Z // end inline asm 2026-02-21T09:14:15.4857050Z add.s32 %r566, %r564, 2048; 2026-02-21T09:14:15.4857109Z // begin inline asm 2026-02-21T09:14:15.4857232Z cp.async.cg.shared.global [ %r566 + 0 ], [ %rd100 + 0 ], 0x10, %r565; 2026-02-21T09:14:15.4857290Z // end inline asm 2026-02-21T09:14:15.4857356Z add.s32 %r568, %r564, 4096; 2026-02-21T09:14:15.4857415Z // begin inline asm 2026-02-21T09:14:15.4857537Z cp.async.cg.shared.global [ %r568 + 0 ], [ %rd101 + 0 ], 0x10, %r565; 2026-02-21T09:14:15.4857623Z // end inline asm 2026-02-21T09:14:15.4857684Z add.s32 %r570, %r564, 6144; 2026-02-21T09:14:15.4857743Z // begin inline asm 2026-02-21T09:14:15.4857860Z cp.async.cg.shared.global [ %r570 + 0 ], [ %rd361 + 0 ], 0x10, %r565; 2026-02-21T09:14:15.4857925Z // end inline asm 2026-02-21T09:14:15.4857990Z cp.async.commit_group; 2026-02-21T09:14:15.4858171Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4858248Z and.pred %p92, %p4, %p94; 2026-02-21T09:14:15.4858307Z // begin inline asm 2026-02-21T09:14:15.4858448Z @%p92 mbarrier.arrive.expect_tx.shared.b64 _, [%r572], 2048; 2026-02-21T09:14:15.4858516Z // end inline asm 2026-02-21T09:14:15.4858689Z .loc 1 61 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:61:33 2026-02-21T09:14:15.4858747Z bar.sync 0; 2026-02-21T09:14:15.4858856Z elect.sync %r579|%p98, -1; 2026-02-21T09:14:15.4858937Z and.pred %p99, %p94, %p98; 2026-02-21T09:14:15.4859005Z and.pred %p93, %p1, %p99; 2026-02-21T09:14:15.4859068Z cvt.u32.u64 %r580, %rd362; 2026-02-21T09:14:15.4859138Z add.s32 %r575, %r580, 64; 2026-02-21T09:14:15.4859196Z // begin inline asm 2026-02-21T09:14:15.4859440Z @%p93 cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%r573], [%rd88, {%r437, %r575}], [%r572]; 2026-02-21T09:14:15.4859505Z // end inline asm 2026-02-21T09:14:15.4859684Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4859750Z add.s64 %rd361, %rd361, 64; 2026-02-21T09:14:15.4859817Z setp.lt.u64 %p100, %rd362, 480; 2026-02-21T09:14:15.4859887Z add.s64 %rd362, %rd362, 16; 2026-02-21T09:14:15.4859946Z mov.b32 %r927, %r932; 2026-02-21T09:14:15.4860005Z mov.b32 %r932, %r79; 2026-02-21T09:14:15.4860073Z @%p100 bra $L__BB0_5; 2026-02-21T09:14:15.4860132Z bra.uni $L__BB0_8; 2026-02-21T09:14:15.4860234Z $L__BB0_5: // Parent Loop BB0_2 Depth=1 2026-02-21T09:14:15.4860340Z // => This Inner Loop Header: Depth=2 2026-02-21T09:14:15.4860522Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4860584Z add.s32 %r483, %r930, 1; 2026-02-21T09:14:15.4860883Z [442s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:14:15.4861932Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:14:15.4862068Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:14:15.4862140Z `ptxas` stderr: 2026-02-21T09:14:15.4862504Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 321 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:14:15.4862602Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:14:15.4862607Z 2026-02-21T09:14:15.4863016Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp7oryq3ns.ptx -o /tmp/tmp7oryq3ns.ptx.o 2026-02-21T09:14:15.4863051Z 2026-02-21T09:14:15.4863183Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:14:15.4863251Z setp.gt.s32 %p82, %r483, 2; 2026-02-21T09:14:15.4863326Z selp.b32 %r930, 0, %r483, %p82; 2026-02-21T09:14:15.4863509Z .loc 1 55 80 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:55:80 2026-02-21T09:14:15.4863607Z cp.async.wait_group 2; 2026-02-21T09:14:15.4863672Z bar.sync 0; 2026-02-21T09:14:15.4863735Z shl.b32 %r484, %r930, 13; 2026-02-21T09:14:15.4863797Z add.s32 %r486, %r81, %r484; 2026-02-21T09:14:15.4863859Z add.s32 %r73, %r486, 32768; 2026-02-21T09:14:15.4863939Z add.s32 %r487, %r73, %r24; 2026-02-21T09:14:15.4864037Z ld.shared.v4.b32 {%r488, %r489, %r490, %r491}, [%r487]; 2026-02-21T09:14:15.4864104Z mov.b32 {%rs116, %rs117}, %r491; 2026-02-21T09:14:15.4864174Z mov.b32 {%rs118, %rs119}, %r490; 2026-02-21T09:14:15.4864235Z mov.b32 {%rs120, %rs121}, %r489; 2026-02-21T09:14:15.4864318Z mov.b32 {%rs122, %rs123}, %r488; 2026-02-21T09:14:15.4864419Z ld.shared.v4.b32 {%r492, %r493, %r494, %r495}, [%r487+16]; 2026-02-21T09:14:15.4864485Z mov.b32 {%rs124, %rs125}, %r495; 2026-02-21T09:14:15.4864544Z mov.b32 {%rs126, %rs127}, %r494; 2026-02-21T09:14:15.4864602Z mov.b32 {%rs128, %rs129}, %r493; 2026-02-21T09:14:15.4864695Z mov.b32 {%rs130, %rs131}, %r492; 2026-02-21T09:14:15.4864796Z ld.shared.v4.b32 {%r496, %r497, %r498, %r499}, [%r487+32]; 2026-02-21T09:14:15.4864855Z mov.b32 {%rs132, %rs133}, %r499; 2026-02-21T09:14:15.4864921Z mov.b32 {%rs134, %rs135}, %r498; 2026-02-21T09:14:15.4864979Z mov.b32 {%rs136, %rs137}, %r497; 2026-02-21T09:14:15.4865037Z mov.b32 {%rs138, %rs139}, %r496; 2026-02-21T09:14:15.4865129Z ld.shared.v4.b32 {%r500, %r501, %r502, %r503}, [%r487+48]; 2026-02-21T09:14:15.4865195Z mov.b32 {%rs140, %rs141}, %r503; 2026-02-21T09:14:15.4865254Z mov.b32 {%rs142, %rs143}, %r502; 2026-02-21T09:14:15.4865312Z mov.b32 {%rs144, %rs145}, %r501; 2026-02-21T09:14:15.4865375Z mov.b32 {%rs146, %rs147}, %r500; 2026-02-21T09:14:15.4865548Z .loc 1 59 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:59:32 2026-02-21T09:14:15.4865610Z cvt.f32.bf16 %r448, %rs122; 2026-02-21T09:14:15.4865677Z cvt.f32.bf16 %r449, %rs123; 2026-02-21T09:14:15.4865737Z cvt.f32.bf16 %r450, %rs120; 2026-02-21T09:14:15.4865796Z cvt.f32.bf16 %r451, %rs121; 2026-02-21T09:14:15.4865854Z cvt.f32.bf16 %r452, %rs118; 2026-02-21T09:14:15.4865919Z cvt.f32.bf16 %r453, %rs119; 2026-02-21T09:14:15.4865975Z cvt.f32.bf16 %r454, %rs116; 2026-02-21T09:14:15.4866031Z cvt.f32.bf16 %r455, %rs117; 2026-02-21T09:14:15.4866094Z cvt.f32.bf16 %r456, %rs130; 2026-02-21T09:14:15.4866151Z cvt.f32.bf16 %r457, %rs131; 2026-02-21T09:14:15.4866206Z cvt.f32.bf16 %r458, %rs128; 2026-02-21T09:14:15.4866262Z cvt.f32.bf16 %r459, %rs129; 2026-02-21T09:14:15.4866328Z cvt.f32.bf16 %r460, %rs126; 2026-02-21T09:14:15.4866385Z cvt.f32.bf16 %r461, %rs127; 2026-02-21T09:14:15.4866445Z cvt.f32.bf16 %r462, %rs124; 2026-02-21T09:14:15.4866508Z cvt.f32.bf16 %r463, %rs125; 2026-02-21T09:14:15.4866565Z cvt.f32.bf16 %r465, %rs138; 2026-02-21T09:14:15.4866621Z cvt.f32.bf16 %r466, %rs139; 2026-02-21T09:14:15.4866678Z cvt.f32.bf16 %r467, %rs136; 2026-02-21T09:14:15.4866742Z cvt.f32.bf16 %r468, %rs137; 2026-02-21T09:14:15.4866800Z cvt.f32.bf16 %r469, %rs134; 2026-02-21T09:14:15.4866858Z cvt.f32.bf16 %r470, %rs135; 2026-02-21T09:14:15.4866924Z cvt.f32.bf16 %r471, %rs132; 2026-02-21T09:14:15.4866981Z cvt.f32.bf16 %r472, %rs133; 2026-02-21T09:14:15.4867038Z cvt.f32.bf16 %r473, %rs146; 2026-02-21T09:14:15.4867095Z cvt.f32.bf16 %r474, %rs147; 2026-02-21T09:14:15.4867158Z cvt.f32.bf16 %r475, %rs144; 2026-02-21T09:14:15.4867216Z cvt.f32.bf16 %r476, %rs145; 2026-02-21T09:14:15.4867272Z cvt.f32.bf16 %r477, %rs142; 2026-02-21T09:14:15.4867336Z cvt.f32.bf16 %r478, %rs143; 2026-02-21T09:14:15.4867416Z cvt.f32.bf16 %r479, %rs140; 2026-02-21T09:14:15.4867476Z cvt.f32.bf16 %r480, %rs141; 2026-02-21T09:14:15.4867537Z $L__tmp138: 2026-02-21T09:14:15.4867755Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4867812Z // begin inline asm 2026-02-21T09:14:15.4867864Z 2026-02-21T09:14:15.4867928Z { 2026-02-21T09:14:15.4867991Z .reg .pred complete; 2026-02-21T09:14:15.4868075Z waitLoop: 2026-02-21T09:14:15.4868203Z mbarrier.try_wait.parity.shared.b64 complete, [%r928], %r927; 2026-02-21T09:14:15.4868279Z @!complete bra.uni waitLoop; 2026-02-21T09:14:15.4868330Z } 2026-02-21T09:14:15.4868334Z 2026-02-21T09:14:15.4868388Z // end inline asm 2026-02-21T09:14:15.4868449Z $L__tmp139: 2026-02-21T09:14:15.4868622Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4868682Z selp.b32 %r504, 1, 0, %p82; 2026-02-21T09:14:15.4868753Z xor.b32 %r929, %r929, %r504; 2026-02-21T09:14:15.4868812Z mov.pred %p83, -1; 2026-02-21T09:14:15.4868883Z $L__tmp140: 2026-02-21T09:14:15.4869106Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4869164Z // begin inline asm 2026-02-21T09:14:15.4869476Z @%p83 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r311 + 0], {%r448, %r449, %r450, %r451, %r452, %r453, %r454, %r455, %r456, %r457, %r458, %r459, %r460, %r461, %r462, %r463}; 2026-02-21T09:14:15.4869543Z // end inline asm 2026-02-21T09:14:15.4869599Z // begin inline asm 2026-02-21T09:14:15.4869868Z @%p83 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r311 + 16], {%r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472, %r473, %r474, %r475, %r476, %r477, %r478, %r479, %r480}; 2026-02-21T09:14:15.4869922Z // end inline asm 2026-02-21T09:14:15.4869985Z // begin inline asm 2026-02-21T09:14:15.4870056Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:14:15.4870111Z // end inline asm 2026-02-21T09:14:15.4870173Z bar.sync 0; 2026-02-21T09:14:15.4870227Z $L__tmp141: 2026-02-21T09:14:15.4870399Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4870459Z shl.b32 %r505, %r930, 3; 2026-02-21T09:14:15.4870524Z add.s32 %r506, %r81, %r505; 2026-02-21T09:14:15.4870584Z add.s32 %r572, %r506, 79872; 2026-02-21T09:14:15.4870640Z // begin inline asm 2026-02-21T09:14:15.4870699Z 2026-02-21T09:14:15.4870750Z { 2026-02-21T09:14:15.4870811Z .reg .pred complete; 2026-02-21T09:14:15.4870865Z waitLoop: 2026-02-21T09:14:15.4870987Z mbarrier.try_wait.parity.shared.b64 complete, [%r572], %r929; 2026-02-21T09:14:15.4871052Z @!complete bra.uni waitLoop; 2026-02-21T09:14:15.4871103Z } 2026-02-21T09:14:15.4871106Z 2026-02-21T09:14:15.4871168Z // end inline asm 2026-02-21T09:14:15.4871334Z .loc 1 61 33 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:61:33 2026-02-21T09:14:15.4871395Z shl.b32 %r507, %r930, 11; 2026-02-21T09:14:15.4871461Z add.s32 %r508, %r81, %r507; 2026-02-21T09:14:15.4871523Z add.s32 %r573, %r508, 73728; 2026-02-21T09:14:15.4871620Z add.s32 %r509, %r573, %r7; 2026-02-21T09:14:15.4871679Z add.s32 %r510, %r573, %r27; 2026-02-21T09:14:15.4871745Z add.s32 %r511, %r573, %r29; 2026-02-21T09:14:15.4871801Z add.s32 %r512, %r573, %r31; 2026-02-21T09:14:15.4871858Z add.s32 %r513, %r573, %r33; 2026-02-21T09:14:15.4871923Z add.s32 %r514, %r573, %r35; 2026-02-21T09:14:15.4871979Z add.s32 %r515, %r573, %r37; 2026-02-21T09:14:15.4872037Z add.s32 %r516, %r573, %r39; 2026-02-21T09:14:15.4872209Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4872273Z ld.shared.s8 %rs148, [%r509]; 2026-02-21T09:14:15.4872437Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4872495Z shl.b16 %rs149, %rs148, 4; 2026-02-21T09:14:15.4872697Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4872765Z ld.shared.s8 %rs150, [%r510+128]; 2026-02-21T09:14:15.4872930Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4872998Z shl.b16 %rs151, %rs150, 4; 2026-02-21T09:14:15.4873163Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4873255Z ld.shared.s8 %rs152, [%r511+256]; 2026-02-21T09:14:15.4873431Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4873491Z shl.b16 %rs153, %rs152, 4; 2026-02-21T09:14:15.4873656Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4873726Z ld.shared.s8 %rs154, [%r512+384]; 2026-02-21T09:14:15.4873890Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4873975Z shl.b16 %rs155, %rs154, 4; 2026-02-21T09:14:15.4874137Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4874208Z ld.shared.s8 %rs156, [%r513+512]; 2026-02-21T09:14:15.4874394Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4874456Z shl.b16 %rs157, %rs156, 4; 2026-02-21T09:14:15.4874627Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4874689Z ld.shared.s8 %rs158, [%r514+640]; 2026-02-21T09:14:15.4874853Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4874919Z shl.b16 %rs159, %rs158, 4; 2026-02-21T09:14:15.4875081Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4875144Z ld.shared.s8 %rs160, [%r515+768]; 2026-02-21T09:14:15.4875315Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4875373Z shl.b16 %rs161, %rs160, 4; 2026-02-21T09:14:15.4875534Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4875597Z ld.shared.s8 %rs162, [%r516+896]; 2026-02-21T09:14:15.4875771Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4875829Z shl.b16 %rs163, %rs162, 4; 2026-02-21T09:14:15.4875989Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4876062Z ld.shared.s8 %rs164, [%r509+1024]; 2026-02-21T09:14:15.4876223Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4876282Z shl.b16 %rs165, %rs164, 4; 2026-02-21T09:14:15.4876451Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4876516Z ld.shared.s8 %rs166, [%r510+1152]; 2026-02-21T09:14:15.4876677Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4876742Z shl.b16 %rs167, %rs166, 4; 2026-02-21T09:14:15.4876905Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4876970Z ld.shared.s8 %rs168, [%r511+1280]; 2026-02-21T09:14:15.4877131Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4877197Z shl.b16 %rs169, %rs168, 4; 2026-02-21T09:14:15.4877356Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4877418Z ld.shared.s8 %rs170, [%r512+1408]; 2026-02-21T09:14:15.4877613Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4877673Z shl.b16 %rs171, %rs170, 4; 2026-02-21T09:14:15.4877837Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4877908Z ld.shared.s8 %rs172, [%r513+1536]; 2026-02-21T09:14:15.4878073Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4878163Z shl.b16 %rs173, %rs172, 4; 2026-02-21T09:14:15.4878338Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4878400Z ld.shared.s8 %rs174, [%r514+1664]; 2026-02-21T09:14:15.4878563Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4878621Z shl.b16 %rs175, %rs174, 4; 2026-02-21T09:14:15.4878794Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4878878Z ld.shared.s8 %rs176, [%r515+1792]; 2026-02-21T09:14:15.4879038Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4879104Z shl.b16 %rs177, %rs176, 4; 2026-02-21T09:14:15.4879263Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4879346Z ld.shared.s8 %rs178, [%r516+1920]; 2026-02-21T09:14:15.4879512Z .loc 1 64 28 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:64:28 2026-02-21T09:14:15.4879570Z shl.b16 %rs179, %rs178, 4; 2026-02-21T09:14:15.4879728Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4879794Z cvt.s16.s8 %rs180, %rs149; 2026-02-21T09:14:15.4879851Z shr.s16 %rs181, %rs180, 4; 2026-02-21T09:14:15.4879907Z cvt.s16.s8 %rs182, %rs151; 2026-02-21T09:14:15.4879965Z shr.s16 %rs183, %rs182, 4; 2026-02-21T09:14:15.4880030Z shr.s16 %rs184, %rs148, 4; 2026-02-21T09:14:15.4880088Z shr.s16 %rs185, %rs150, 4; 2026-02-21T09:14:15.4880247Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4880318Z cvt.rn.f32.s16 %r517, %rs185; 2026-02-21T09:14:15.4880378Z cvt.rn.f32.s16 %r518, %rs184; 2026-02-21T09:14:15.4880438Z cvt.rn.f32.s16 %r519, %rs183; 2026-02-21T09:14:15.4880497Z cvt.rn.f32.s16 %r520, %rs181; 2026-02-21T09:14:15.4880663Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4880722Z cvt.s16.s8 %rs186, %rs153; 2026-02-21T09:14:15.4880778Z shr.s16 %rs187, %rs186, 4; 2026-02-21T09:14:15.4880843Z cvt.s16.s8 %rs188, %rs155; 2026-02-21T09:14:15.4880900Z shr.s16 %rs189, %rs188, 4; 2026-02-21T09:14:15.4880957Z shr.s16 %rs190, %rs152, 4; 2026-02-21T09:14:15.4881021Z shr.s16 %rs191, %rs154, 4; 2026-02-21T09:14:15.4881184Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4881244Z cvt.rn.f32.s16 %r521, %rs191; 2026-02-21T09:14:15.4881302Z cvt.rn.f32.s16 %r522, %rs190; 2026-02-21T09:14:15.4881369Z cvt.rn.f32.s16 %r523, %rs189; 2026-02-21T09:14:15.4881426Z cvt.rn.f32.s16 %r524, %rs187; 2026-02-21T09:14:15.4881619Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4881689Z cvt.s16.s8 %rs192, %rs157; 2026-02-21T09:14:15.4881747Z shr.s16 %rs193, %rs192, 4; 2026-02-21T09:14:15.4881804Z cvt.s16.s8 %rs194, %rs159; 2026-02-21T09:14:15.4881860Z shr.s16 %rs195, %rs194, 4; 2026-02-21T09:14:15.4881923Z shr.s16 %rs196, %rs156, 4; 2026-02-21T09:14:15.4881979Z shr.s16 %rs197, %rs158, 4; 2026-02-21T09:14:15.4882142Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4882237Z cvt.rn.f32.s16 %r525, %rs197; 2026-02-21T09:14:15.4882295Z cvt.rn.f32.s16 %r526, %rs196; 2026-02-21T09:14:15.4882354Z cvt.rn.f32.s16 %r527, %rs195; 2026-02-21T09:14:15.4882420Z cvt.rn.f32.s16 %r528, %rs193; 2026-02-21T09:14:15.4882585Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4882643Z cvt.s16.s8 %rs198, %rs161; 2026-02-21T09:14:15.4882700Z shr.s16 %rs199, %rs198, 4; 2026-02-21T09:14:15.4882765Z cvt.s16.s8 %rs200, %rs163; 2026-02-21T09:14:15.4882847Z shr.s16 %rs201, %rs200, 4; 2026-02-21T09:14:15.4882904Z shr.s16 %rs202, %rs160, 4; 2026-02-21T09:14:15.4882970Z shr.s16 %rs203, %rs162, 4; 2026-02-21T09:14:15.4883131Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4883190Z cvt.rn.f32.s16 %r529, %rs203; 2026-02-21T09:14:15.4883257Z cvt.rn.f32.s16 %r530, %rs202; 2026-02-21T09:14:15.4883315Z cvt.rn.f32.s16 %r531, %rs201; 2026-02-21T09:14:15.4883373Z cvt.rn.f32.s16 %r532, %rs199; 2026-02-21T09:14:15.4883557Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4883625Z cvt.s16.s8 %rs204, %rs165; 2026-02-21T09:14:15.4883682Z shr.s16 %rs205, %rs204, 4; 2026-02-21T09:14:15.4883740Z cvt.s16.s8 %rs206, %rs167; 2026-02-21T09:14:15.4883807Z shr.s16 %rs207, %rs206, 4; 2026-02-21T09:14:15.4883863Z shr.s16 %rs208, %rs164, 4; 2026-02-21T09:14:15.4883947Z shr.s16 %rs209, %rs166, 4; 2026-02-21T09:14:15.4884107Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4884174Z cvt.rn.f32.s16 %r533, %rs209; 2026-02-21T09:14:15.4884232Z cvt.rn.f32.s16 %r534, %rs208; 2026-02-21T09:14:15.4884290Z cvt.rn.f32.s16 %r535, %rs207; 2026-02-21T09:14:15.4884357Z cvt.rn.f32.s16 %r536, %rs205; 2026-02-21T09:14:15.4884520Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4884581Z cvt.s16.s8 %rs210, %rs169; 2026-02-21T09:14:15.4884646Z shr.s16 %rs211, %rs210, 4; 2026-02-21T09:14:15.4884703Z cvt.s16.s8 %rs212, %rs171; 2026-02-21T09:14:15.4884759Z shr.s16 %rs213, %rs212, 4; 2026-02-21T09:14:15.4884816Z shr.s16 %rs214, %rs168, 4; 2026-02-21T09:14:15.4884880Z shr.s16 %rs215, %rs170, 4; 2026-02-21T09:14:15.4885042Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4885103Z cvt.rn.f32.s16 %r537, %rs215; 2026-02-21T09:14:15.4885171Z cvt.rn.f32.s16 %r538, %rs214; 2026-02-21T09:14:15.4885229Z cvt.rn.f32.s16 %r539, %rs213; 2026-02-21T09:14:15.4885287Z cvt.rn.f32.s16 %r540, %rs211; 2026-02-21T09:14:15.4885456Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4885514Z cvt.s16.s8 %rs216, %rs173; 2026-02-21T09:14:15.4885571Z shr.s16 %rs217, %rs216, 4; 2026-02-21T09:14:15.4885629Z cvt.s16.s8 %rs218, %rs175; 2026-02-21T09:14:15.4885698Z shr.s16 %rs219, %rs218, 4; 2026-02-21T09:14:15.4885756Z shr.s16 %rs220, %rs172, 4; 2026-02-21T09:14:15.4885816Z shr.s16 %rs221, %rs174, 4; 2026-02-21T09:14:15.4885989Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4886048Z cvt.rn.f32.s16 %r541, %rs221; 2026-02-21T09:14:15.4886108Z cvt.rn.f32.s16 %r542, %rs220; 2026-02-21T09:14:15.4886169Z cvt.rn.f32.s16 %r543, %rs219; 2026-02-21T09:14:15.4886239Z cvt.rn.f32.s16 %r544, %rs217; 2026-02-21T09:14:15.4886404Z .loc 1 66 25 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:66:25 2026-02-21T09:14:15.4886464Z cvt.s16.s8 %rs222, %rs177; 2026-02-21T09:14:15.4886534Z shr.s16 %rs223, %rs222, 4; 2026-02-21T09:14:15.4886595Z cvt.s16.s8 %rs224, %rs179; 2026-02-21T09:14:15.4886652Z shr.s16 %rs225, %rs224, 4; 2026-02-21T09:14:15.4886718Z shr.s16 %rs226, %rs176, 4; 2026-02-21T09:14:15.4886775Z shr.s16 %rs227, %rs178, 4; 2026-02-21T09:14:15.4886969Z .loc 1 84 32 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:84:32 2026-02-21T09:14:15.4887029Z cvt.rn.f32.s16 %r545, %rs227; 2026-02-21T09:14:15.4887094Z cvt.rn.f32.s16 %r546, %rs226; 2026-02-21T09:14:15.4887152Z cvt.rn.f32.s16 %r547, %rs225; 2026-02-21T09:14:15.4887210Z cvt.rn.f32.s16 %r548, %rs223; 2026-02-21T09:14:15.4887313Z st.shared.v4.b32 [%r41], {%r520, %r518, %r519, %r517}; 2026-02-21T09:14:15.4887425Z st.shared.v4.b32 [%r42], {%r524, %r522, %r523, %r521}; 2026-02-21T09:14:15.4887511Z st.shared.v4.b32 [%r43], {%r528, %r526, %r527, %r525}; 2026-02-21T09:14:15.4887601Z st.shared.v4.b32 [%r44], {%r532, %r530, %r531, %r529}; 2026-02-21T09:14:15.4887684Z st.shared.v4.b32 [%r45], {%r536, %r534, %r535, %r533}; 2026-02-21T09:14:15.4887766Z st.shared.v4.b32 [%r46], {%r540, %r538, %r539, %r537}; 2026-02-21T09:14:15.4887847Z st.shared.v4.b32 [%r47], {%r544, %r542, %r543, %r541}; 2026-02-21T09:14:15.4887937Z st.shared.v4.b32 [%r48], {%r548, %r546, %r547, %r545}; 2026-02-21T09:14:15.4888155Z .loc 1 48 102 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:48:102 2026-02-21T09:14:15.4888218Z shl.b32 %r549, %r931, 3; 2026-02-21T09:14:15.4888288Z add.s32 %r550, %r81, %r549; 2026-02-21T09:14:15.4888349Z add.s32 %r928, %r550, 79904; 2026-02-21T09:14:15.4888402Z $L__tmp142: 2026-02-21T09:14:15.4888663Z .loc 2 291 36 // standard.py:291:36 @[ cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:91:40 ] 2026-02-21T09:14:15.4888723Z // begin inline asm 2026-02-21T09:14:15.4888797Z fence.proxy.async.shared::cta; 2026-02-21T09:14:15.4888853Z // end inline asm 2026-02-21T09:14:15.4888915Z bar.sync 0; 2026-02-21T09:14:15.4888973Z @%p65 bra $L__BB0_7; 2026-02-21T09:14:15.4889074Z // %bb.6: // in Loop: Header=BB0_5 Depth=2 2026-02-21T09:14:15.4889148Z elect.sync %r563|%p84, -1; 2026-02-21T09:14:15.4889206Z mov.b32 %r553, 136317200; 2026-02-21T09:14:15.4889266Z // begin inline asm 2026-02-21T09:14:15.4889421Z @%p84 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 0 ], %rd79, %r553, %p83; 2026-02-21T09:14:15.4889484Z // end inline asm 2026-02-21T09:14:15.4889541Z // begin inline asm 2026-02-21T09:14:15.4889686Z @%p84 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 8 ], %rd80, %r553, %p83; 2026-02-21T09:14:15.4889748Z // end inline asm 2026-02-21T09:14:15.4889806Z // begin inline asm 2026-02-21T09:14:15.4889953Z @%p84 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 16 ], %rd81, %r553, %p83; 2026-02-21T09:14:15.4890017Z // end inline asm 2026-02-21T09:14:15.4890072Z // begin inline asm 2026-02-21T09:14:15.4890216Z @%p84 tcgen05.mma.cta_group::1.kind::tf32 [ %r925 + 0 ], [ %r413 + 24 ], %rd82, %r553, %p83; 2026-02-21T09:14:15.4890278Z // end inline asm 2026-02-21T09:14:15.4890338Z cvt.u64.u32 %rd98, %r928; 2026-02-21T09:14:15.4890394Z // begin inline asm 2026-02-21T09:14:15.4890522Z @%p84 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd98]; 2026-02-21T09:14:15.4890583Z // end inline asm 2026-02-21T09:14:15.4890641Z bra.uni $L__BB0_7; 2026-02-21T09:14:15.4890694Z $L__tmp143: 2026-02-21T09:14:15.4890783Z $L__BB0_9: // %._crit_edge 2026-02-21T09:14:15.4890957Z .loc 1 28 111 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:28:111 2026-02-21T09:14:15.4891031Z cp.async.bulk.wait_group.read 0; 2026-02-21T09:14:15.4891087Z bar.sync 0; 2026-02-21T09:14:15.4891259Z .loc 1 28 4 // cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py:28:4 2026-02-21T09:14:15.4891313Z bar.sync 0; 2026-02-21T09:14:15.4891370Z // begin inline asm 2026-02-21T09:14:15.4891492Z @%p1 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r925, 256; 2026-02-21T09:14:15.4891583Z // end inline asm 2026-02-21T09:14:15.4891638Z ret; 2026-02-21T09:14:15.4891697Z $L__tmp144: 2026-02-21T09:14:15.4891753Z $L__func_end0: 2026-02-21T09:14:15.4891865Z // -- End function 2026-02-21T09:14:15.4891915Z } 2026-02-21T09:14:15.4892124Z .file 1 "/tmp/torchinductor_root/ed/cednyhuo3z6rsa4dta2modupk5juwkwkfcg3blg2cpgrdvgvc6lc.py" 2026-02-21T09:14:15.4892290Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:14:15.4892352Z .section .debug_abbrev 2026-02-21T09:14:15.4892412Z { 2026-02-21T09:14:15.4892498Z .b8 1 // Abbreviation Code 2026-02-21T09:14:15.4892613Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:14:15.4892701Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:14:15.4892781Z .b8 37 // DW_AT_producer 2026-02-21T09:14:15.4892859Z .b8 8 // DW_FORM_string 2026-02-21T09:14:15.4892932Z .b8 19 // DW_AT_language 2026-02-21T09:14:15.4893014Z .b8 5 // DW_FORM_data2 2026-02-21T09:14:15.4893090Z .b8 3 // DW_AT_name 2026-02-21T09:14:15.4893191Z .b8 8 // DW_FORM_string 2026-02-21T09:14:15.4893278Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:14:15.4893353Z .b8 6 // DW_FORM_data4 2026-02-21T09:14:15.4893452Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:14:15.4893535Z .b8 8 // DW_FORM_string 2026-02-21T09:14:15.4893605Z .b8 0 // EOM(1) 2026-02-21T09:14:15.4893674Z .b8 0 // EOM(2) 2026-02-21T09:14:15.4893754Z .b8 2 // Abbreviation Code 2026-02-21T09:14:15.4893842Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:14:15.4893914Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:14:15.4893987Z .b8 3 // DW_AT_name 2026-02-21T09:14:15.4894070Z .b8 8 // DW_FORM_string 2026-02-21T09:14:15.4894147Z .b8 32 // DW_AT_inline 2026-02-21T09:14:15.4894221Z .b8 11 // DW_FORM_data1 2026-02-21T09:14:15.4894295Z .b8 0 // EOM(1) 2026-02-21T09:14:15.4894361Z .b8 0 // EOM(2) 2026-02-21T09:14:15.4894441Z .b8 3 // Abbreviation Code 2026-02-21T09:14:15.4894527Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:14:15.4894605Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:14:15.4894681Z .b8 17 // DW_AT_low_pc 2026-02-21T09:14:15.4894754Z .b8 1 // DW_FORM_addr 2026-02-21T09:14:15.4894841Z .b8 18 // DW_AT_high_pc 2026-02-21T09:14:15.4894914Z .b8 1 // DW_FORM_addr 2026-02-21T09:14:15.4895003Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:14:15.4895083Z .b8 19 // DW_FORM_ref4 2026-02-21T09:14:15.4895149Z .b8 0 // EOM(1) 2026-02-21T09:14:15.4895216Z .b8 0 // EOM(2) 2026-02-21T09:14:15.4895300Z .b8 4 // Abbreviation Code 2026-02-21T09:14:15.4895392Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:14:15.4895466Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:14:15.4895549Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:14:15.4895628Z .b8 19 // DW_FORM_ref4 2026-02-21T09:14:15.4895699Z .b8 17 // DW_AT_low_pc 2026-02-21T09:14:15.4895769Z .b8 1 // DW_FORM_addr 2026-02-21T09:14:15.4895874Z .b8 18 // DW_AT_high_pc 2026-02-21T09:14:15.4895945Z .b8 1 // DW_FORM_addr 2026-02-21T09:14:15.4896022Z .b8 88 // DW_AT_call_file 2026-02-21T09:14:15.4896103Z .b8 11 // DW_FORM_data1 2026-02-21T09:14:15.4896178Z .b8 89 // DW_AT_call_line 2026-02-21T09:14:15.4896252Z .b8 11 // DW_FORM_data1 2026-02-21T09:14:15.4896352Z .b8 87 // DW_AT_call_column 2026-02-21T09:14:15.4896432Z .b8 11 // DW_FORM_data1 2026-02-21T09:14:15.4896499Z .b8 0 // EOM(1) 2026-02-21T09:14:15.4896566Z .b8 0 // EOM(2) 2026-02-21T09:14:15.4896641Z .b8 0 // EOM(3) 2026-02-21T09:14:15.4896694Z } 2026-02-21T09:14:15.4896754Z .section .debug_info 2026-02-21T09:14:15.4896812Z { 2026-02-21T09:14:15.4896895Z .b32 178 // Length of Unit 2026-02-21T09:14:15.4897004Z .b8 2 // DWARF version number 2026-02-21T09:14:15.4897059Z .b8 0 2026-02-21T09:14:15.4897184Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:14:15.4897275Z .b8 8 // Address Size (in bytes) 2026-02-21T09:14:15.4897401Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:14:15.4897493Z .b8 116 // DW_AT_producer 2026-02-21T09:14:15.4897547Z .b8 114 2026-02-21T09:14:15.4897601Z .b8 105 2026-02-21T09:14:15.4897660Z .b8 116 2026-02-21T09:14:15.4897713Z .b8 111 2026-02-21T09:14:15.4897767Z .b8 110 2026-02-21T09:14:15.4897821Z .b8 0 2026-02-21T09:14:15.4897906Z .b8 2 // DW_AT_language 2026-02-21T09:14:15.4897959Z .b8 0 2026-02-21T09:14:15.4898035Z .b8 99 // DW_AT_name 2026-02-21T09:14:15.4898098Z .b8 101 2026-02-21T09:14:15.4898151Z .b8 100 2026-02-21T09:14:15.4898203Z .b8 110 2026-02-21T09:14:15.4898257Z .b8 121 2026-02-21T09:14:15.4898318Z .b8 104 2026-02-21T09:14:15.4898371Z .b8 117 2026-02-21T09:14:15.4898423Z .b8 111 2026-02-21T09:14:15.4898475Z .b8 51 2026-02-21T09:14:15.4898535Z .b8 122 2026-02-21T09:14:15.4898588Z .b8 54 2026-02-21T09:14:15.4898641Z .b8 114 2026-02-21T09:14:15.4898701Z .b8 115 2026-02-21T09:14:15.4898755Z .b8 97 2026-02-21T09:14:15.4898808Z .b8 52 2026-02-21T09:14:15.4898861Z .b8 100 2026-02-21T09:14:15.4898921Z .b8 116 2026-02-21T09:14:15.4898973Z .b8 97 2026-02-21T09:14:15.4899026Z .b8 50 2026-02-21T09:14:15.4899085Z .b8 109 2026-02-21T09:14:15.4899137Z .b8 111 2026-02-21T09:14:15.4899189Z .b8 100 2026-02-21T09:14:15.4899242Z .b8 117 2026-02-21T09:14:15.4899301Z .b8 112 2026-02-21T09:14:15.4899353Z .b8 107 2026-02-21T09:14:15.4899405Z .b8 53 2026-02-21T09:14:15.4899457Z .b8 106 2026-02-21T09:14:15.4899516Z .b8 117 2026-02-21T09:14:15.4899571Z .b8 119 2026-02-21T09:14:15.4899624Z .b8 107 2026-02-21T09:14:15.4899681Z .b8 119 2026-02-21T09:14:15.4899735Z .b8 107 2026-02-21T09:14:15.4899786Z .b8 102 2026-02-21T09:14:15.4899839Z .b8 99 2026-02-21T09:14:15.4899900Z .b8 103 2026-02-21T09:14:15.4899951Z .b8 51 2026-02-21T09:14:15.4900004Z .b8 98 2026-02-21T09:14:15.4900064Z .b8 108 2026-02-21T09:14:15.4900117Z .b8 103 2026-02-21T09:14:15.4900170Z .b8 50 2026-02-21T09:14:15.4900221Z .b8 99 2026-02-21T09:14:15.4900284Z .b8 112 2026-02-21T09:14:15.4900338Z .b8 103 2026-02-21T09:14:15.4900391Z .b8 114 2026-02-21T09:14:15.4900453Z .b8 100 2026-02-21T09:14:15.4900506Z .b8 118 2026-02-21T09:14:15.4900561Z .b8 103 2026-02-21T09:14:15.4900613Z .b8 118 2026-02-21T09:14:15.4900674Z .b8 99 2026-02-21T09:14:15.4900727Z .b8 54 2026-02-21T09:14:15.4900780Z .b8 108 2026-02-21T09:14:15.4900833Z .b8 99 2026-02-21T09:14:15.4900896Z .b8 46 2026-02-21T09:14:15.4900948Z .b8 112 2026-02-21T09:14:15.4901003Z .b8 121 2026-02-21T09:14:15.4901065Z .b8 0 2026-02-21T09:14:15.4901187Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:14:15.4901270Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:14:15.4901325Z .b8 116 2026-02-21T09:14:15.4901396Z .b8 109 2026-02-21T09:14:15.4901448Z .b8 112 2026-02-21T09:14:15.4901501Z .b8 47 2026-02-21T09:14:15.4901587Z .b8 116 2026-02-21T09:14:15.4901641Z .b8 111 2026-02-21T09:14:15.4901693Z .b8 114 2026-02-21T09:14:15.4901746Z .b8 99 2026-02-21T09:14:15.4901808Z .b8 104 2026-02-21T09:14:15.4901890Z .b8 105 2026-02-21T09:14:15.4901943Z .b8 110 2026-02-21T09:14:15.4902003Z .b8 100 2026-02-21T09:14:15.4902055Z .b8 117 2026-02-21T09:14:15.4902107Z .b8 99 2026-02-21T09:14:15.4902159Z .b8 116 2026-02-21T09:14:15.4902220Z .b8 111 2026-02-21T09:14:15.4902273Z .b8 114 2026-02-21T09:14:15.4902326Z .b8 95 2026-02-21T09:14:15.4902379Z .b8 114 2026-02-21T09:14:15.4902439Z .b8 111 2026-02-21T09:14:15.4902492Z .b8 111 2026-02-21T09:14:15.4902544Z .b8 116 2026-02-21T09:14:15.4902604Z .b8 47 2026-02-21T09:14:15.4902659Z .b8 101 2026-02-21T09:14:15.4902712Z .b8 100 2026-02-21T09:14:15.4902767Z .b8 0 2026-02-21T09:14:15.4902906Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:14:15.4902985Z .b8 95 // DW_AT_name 2026-02-21T09:14:15.4903038Z .b8 104 2026-02-21T09:14:15.4903100Z .b8 101 2026-02-21T09:14:15.4903155Z .b8 108 2026-02-21T09:14:15.4903208Z .b8 105 2026-02-21T09:14:15.4903290Z .b8 111 2026-02-21T09:14:15.4903356Z .b8 110 2026-02-21T09:14:15.4903409Z .b8 95 2026-02-21T09:14:15.4903462Z .b8 109 2026-02-21T09:14:15.4903522Z .b8 97 2026-02-21T09:14:15.4903574Z .b8 116 2026-02-21T09:14:15.4903626Z .b8 109 2026-02-21T09:14:15.4903679Z .b8 117 2026-02-21T09:14:15.4903739Z .b8 108 2026-02-21T09:14:15.4903791Z .b8 95 2026-02-21T09:14:15.4903843Z .b8 98 2026-02-21T09:14:15.4903902Z .b8 102 2026-02-21T09:14:15.4903954Z .b8 49 2026-02-21T09:14:15.4904007Z .b8 54 2026-02-21T09:14:15.4904060Z .b8 95 2026-02-21T09:14:15.4904120Z .b8 105 2026-02-21T09:14:15.4904172Z .b8 110 2026-02-21T09:14:15.4904224Z .b8 116 2026-02-21T09:14:15.4904276Z .b8 52 2026-02-21T09:14:15.4904337Z .b8 0 2026-02-21T09:14:15.4904414Z .b8 1 // DW_AT_inline 2026-02-21T09:14:15.4904514Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:14:15.4904610Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:14:15.4904703Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:14:15.4904797Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:14:15.4904915Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:14:15.4905007Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:14:15.4905090Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:14:15.4905183Z .b64 $L__tmp143 // DW_AT_high_pc 2026-02-21T09:14:15.4905261Z .b8 1 // DW_AT_call_file 2026-02-21T09:14:15.4905341Z .b8 91 // DW_AT_call_line 2026-02-21T09:14:15.4905424Z .b8 40 // DW_AT_call_column 2026-02-21T09:14:15.4905517Z .b8 0 // End Of Children Mark 2026-02-21T09:14:15.4905601Z .b8 0 // End Of Children Mark 2026-02-21T09:14:15.4905656Z } 2026-02-21T09:14:15.4905733Z .section .debug_macinfo { } 2026-02-21T09:14:15.4905737Z 2026-02-21T09:14:15.4905815Z ================================================================ 2026-02-21T09:14:15.4905922Z please share the reproducer above with Triton project. 2026-02-21T09:14:17.2241371Z 2026-02-21T09:14:17.2243380Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 16.8 configs/s 2026-02-21T09:14:28.5856979Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 845/845 74.3 configs/s 2026-02-21T09:14:28.8254157Z [455s] Generation 7 complete: 2026-02-21T09:14:28.8256122Z error=25 2026-02-21T09:14:28.8256326Z timeout=2 2026-02-21T09:14:28.8261023Z ok=76 2026-02-21T09:14:28.8262567Z min=0.2366 2026-02-21T09:14:28.8262746Z mid=0.3287 2026-02-21T09:14:28.8262884Z max=26.8155 2026-02-21T09:14:28.8263067Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:14:28.8263342Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:14:28.8263615Z 'l2_groupings': [2], 2026-02-21T09:14:28.8267661Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:14:28.8272054Z 'loop_orders': [[0, 1]], 2026-02-21T09:14:28.8273745Z 'maxnreg': 128, 2026-02-21T09:14:28.8273946Z 'num_sm_multiplier': 2, 2026-02-21T09:14:28.8274116Z 'num_stages': 3, 2026-02-21T09:14:28.8274269Z 'num_warps': 4, 2026-02-21T09:14:28.8274426Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:14:28.8274632Z 'range_flattens': [None, None], 2026-02-21T09:14:28.8274819Z 'range_multi_buffers': [True, False], 2026-02-21T09:14:28.8275001Z 'range_num_stages': [3, 4], 2026-02-21T09:14:28.8275183Z 'range_unroll_factors': [0, 0], 2026-02-21T09:14:28.8275359Z 'range_warp_specializes': [True, None]} 2026-02-21T09:14:28.8290992Z [455s] Fitting surrogate: 846 points, 846 targets 2026-02-21T09:14:30.0900849Z [457s] Generation 8 starting: 81 neighbors, 4 active search path(s) 2026-02-21T09:15:04.0578354Z [491s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:04.2901195Z [491s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:04.4605908Z [491s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 0], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:04.5776054Z [491s] Timeout after 30s compiling Config(block_sizes=[32, 512, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:04.8481599Z [491s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:04.9958659Z [492s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=8, num_stages=6, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:05.1036436Z [492s] Timeout after 30s compiling Config(block_sizes=[32, 256, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:05.1049282Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.5 configs/s 2026-02-21T09:15:05.3274931Z [492s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[4], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T09:15:05.3278238Z Tensor-likes are not close! 2026-02-21T09:15:05.3282499Z 2026-02-21T09:15:05.3290414Z Mismatched elements: 33451473 / 33554432 (99.7%) 2026-02-21T09:15:05.3293939Z Greatest absolute difference: 1416.0 at index (551, 45) (up to 0.01 allowed) 2026-02-21T09:15:05.3299363Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:05.3306059Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:05.3307720Z 2026-02-21T09:15:05.4632220Z [492s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[8], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], num_stages=1, num_warps=32, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T09:15:05.4634234Z Tensor-likes are not close! 2026-02-21T09:15:05.4634361Z 2026-02-21T09:15:05.4634458Z Mismatched elements: 33436465 / 33554432 (99.6%) 2026-02-21T09:15:05.4634745Z Greatest absolute difference: 1456.0 at index (2320, 6047) (up to 0.01 allowed) 2026-02-21T09:15:05.4635118Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:05.4635461Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:05.4635650Z 2026-02-21T09:15:08.6253247Z 2026-02-21T09:15:08.6257830Z 2026-02-21T09:15:08.6260246Z ================================================================ 2026-02-21T09:15:08.6260548Z Internal Triton PTX codegen error 2026-02-21T09:15:08.6260739Z `ptxas` stderr: 2026-02-21T09:15:08.6261217Z ptxas fatal : (C7602) Insufficient registers (40) to compile instruction at line 2007 in function _helion_matmul_bf16_int4. Try to compile with register target of 146 or higher. 2026-02-21T09:15:08.6261999Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:15:08.6262160Z 2026-02-21T09:15:08.6262580Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpe1ctb5uf.ptx -o /tmp/tmpe1ctb5uf.ptx.o 2026-02-21T09:15:08.6263043Z 2026-02-21T09:15:08.6263047Z 2026-02-21T09:15:08.6263108Z // 2026-02-21T09:15:08.6263262Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:15:08.6263456Z // 2026-02-21T09:15:08.6263527Z 2026-02-21T09:15:08.6263594Z .version 8.7 2026-02-21T09:15:08.6263740Z .target sm_100a 2026-02-21T09:15:08.6263893Z .address_size 64 2026-02-21T09:15:08.6263982Z 2026-02-21T09:15:08.6264139Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:15:08.6264448Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:15:08.6264685Z // @_helion_matmul_bf16_int4 2026-02-21T09:15:08.6265189Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:15:08.6265472Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:15:08.6265807Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:15:08.6266099Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:15:08.6266373Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:15:08.6266662Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:15:08.6266973Z ) 2026-02-21T09:15:08.6267108Z .reqntid 1408 2026-02-21T09:15:08.6267243Z { 2026-02-21T09:15:08.6267382Z .reg .pred %p<353>; 2026-02-21T09:15:08.6267547Z .reg .b16 %rs<1163>; 2026-02-21T09:15:08.6267699Z .reg .b32 %r<2394>; 2026-02-21T09:15:08.6267853Z .reg .b64 %rd<447>; 2026-02-21T09:15:08.6267995Z $L__func_begin0: 2026-02-21T09:15:08.6268088Z 2026-02-21T09:15:08.6268144Z // %bb.0: 2026-02-21T09:15:08.6268394Z .loc 1 14 0 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:14 2026-02-21T09:15:08.6268702Z mov.u32 %r1, %tid.x; 2026-02-21T09:15:08.6268896Z shr.u32 %r2, %r1, 5; 2026-02-21T09:15:08.6269065Z shfl.sync.idx.b32 %r3, %r2, 0, 31, -1; 2026-02-21T09:15:08.6269264Z setp.lt.u32 %p1, %r3, 8; 2026-02-21T09:15:08.6269428Z @%p1 bra $L__BB0_32; 2026-02-21T09:15:08.6269587Z bra.uni $L__BB0_1; 2026-02-21T09:15:08.6269741Z $L__BB0_32: 2026-02-21T09:15:08.6269952Z setp.lt.u32 %p299, %r1, 32; 2026-02-21T09:15:08.6270130Z mov.b32 %r2270, global_smem; 2026-02-21T09:15:08.6281063Z // begin inline asm 2026-02-21T09:15:08.6281466Z @%p299 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r2270], 512; 2026-02-21T09:15:08.6281798Z // end inline asm 2026-02-21T09:15:08.6281951Z bar.sync 0, 256; 2026-02-21T09:15:08.6282107Z ld.shared.b32 %r2367, [global_smem]; 2026-02-21T09:15:08.6282295Z bar.sync 0, 256; 2026-02-21T09:15:08.6282435Z // begin inline asm 2026-02-21T09:15:08.6282656Z @%p299 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:15:08.6282918Z // end inline asm 2026-02-21T09:15:08.6283221Z .loc 1 19 46 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:46 2026-02-21T09:15:08.6283528Z mov.u32 %r186, %ctaid.x; 2026-02-21T09:15:08.6283820Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6284141Z setp.eq.b32 %p330, %r1, 0; 2026-02-21T09:15:08.6284317Z add.s32 %r2271, %r2270, 368688; 2026-02-21T09:15:08.6284492Z // begin inline asm 2026-02-21T09:15:08.6284664Z @%p330 mbarrier.init.shared::cta.b64 [%r2271], 1; 2026-02-21T09:15:08.6284865Z // end inline asm 2026-02-21T09:15:08.6284997Z bar.sync 0, 256; 2026-02-21T09:15:08.6285144Z add.s32 %r2272, %r2270, 368696; 2026-02-21T09:15:08.6285302Z // begin inline asm 2026-02-21T09:15:08.6285473Z @%p330 mbarrier.init.shared::cta.b64 [%r2272], 1; 2026-02-21T09:15:08.6285670Z // end inline asm 2026-02-21T09:15:08.6285812Z add.s32 %r2273, %r2270, 368704; 2026-02-21T09:15:08.6285978Z // begin inline asm 2026-02-21T09:15:08.6286143Z @%p330 mbarrier.init.shared::cta.b64 [%r2273], 1; 2026-02-21T09:15:08.6286340Z // end inline asm 2026-02-21T09:15:08.6286473Z bar.sync 0, 256; 2026-02-21T09:15:08.6286631Z add.s32 %r2274, %r2270, 368712; 2026-02-21T09:15:08.6286791Z // begin inline asm 2026-02-21T09:15:08.6286965Z @%p330 mbarrier.init.shared::cta.b64 [%r2274], 1; 2026-02-21T09:15:08.6287164Z // end inline asm 2026-02-21T09:15:08.6287297Z $L__tmp0: 2026-02-21T09:15:08.6287602Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6287939Z add.s32 %r2296, %r2367, 64; 2026-02-21T09:15:08.6288110Z add.s32 %r2297, %r2367, 128; 2026-02-21T09:15:08.6288267Z add.s32 %r2298, %r2367, 192; 2026-02-21T09:15:08.6289236Z [495s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:15:08.6290523Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 128, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'first'], loop_orders=[[0, 1]], num_sm_multiplier=16, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[2, 1], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:15:08.6291699Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:15:08.6291941Z `ptxas` stderr: 2026-02-21T09:15:08.6292381Z ptxas fatal : (C7602) Insufficient registers (40) to compile instruction at line 2007 in function _helion_matmul_bf16_int4. Try to compile with register target of 146 or higher. 2026-02-21T09:15:08.6292862Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:15:08.6293016Z 2026-02-21T09:15:08.6293398Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpe1ctb5uf.ptx -o /tmp/tmpe1ctb5uf.ptx.o 2026-02-21T09:15:08.6293861Z 2026-02-21T09:15:08.6294000Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:15:08.6294234Z bar.sync 0, 256; 2026-02-21T09:15:08.6294382Z // begin inline asm 2026-02-21T09:15:08.6294559Z @%p330 mbarrier.arrive.shared::cta.b64 _, [%r2273]; 2026-02-21T09:15:08.6294807Z // end inline asm 2026-02-21T09:15:08.6294949Z bar.sync 0, 256; 2026-02-21T09:15:08.6295089Z // begin inline asm 2026-02-21T09:15:08.6295259Z @%p330 mbarrier.arrive.shared::cta.b64 _, [%r2274]; 2026-02-21T09:15:08.6295461Z // end inline asm 2026-02-21T09:15:08.6295597Z $L__tmp1: 2026-02-21T09:15:08.6295845Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6296143Z bar.sync 0, 256; 2026-02-21T09:15:08.6296283Z add.s32 %r2277, %r2270, 368720; 2026-02-21T09:15:08.6296452Z // begin inline asm 2026-02-21T09:15:08.6296613Z @%p330 mbarrier.init.shared::cta.b64 [%r2277], 1; 2026-02-21T09:15:08.6296810Z // end inline asm 2026-02-21T09:15:08.6296942Z bar.sync 0, 256; 2026-02-21T09:15:08.6297085Z add.s32 %r2278, %r2270, 368728; 2026-02-21T09:15:08.6297243Z // begin inline asm 2026-02-21T09:15:08.6297412Z @%p330 mbarrier.init.shared::cta.b64 [%r2278], 1; 2026-02-21T09:15:08.6297606Z // end inline asm 2026-02-21T09:15:08.6297742Z add.s32 %r2279, %r2270, 368736; 2026-02-21T09:15:08.6297906Z // begin inline asm 2026-02-21T09:15:08.6298066Z @%p330 mbarrier.init.shared::cta.b64 [%r2279], 1; 2026-02-21T09:15:08.6298260Z // end inline asm 2026-02-21T09:15:08.6298389Z bar.sync 0, 256; 2026-02-21T09:15:08.6298532Z add.s32 %r2280, %r2270, 368744; 2026-02-21T09:15:08.6298687Z // begin inline asm 2026-02-21T09:15:08.6298856Z @%p330 mbarrier.init.shared::cta.b64 [%r2280], 1; 2026-02-21T09:15:08.6299050Z // end inline asm 2026-02-21T09:15:08.6299185Z add.s32 %r2281, %r2270, 368752; 2026-02-21T09:15:08.6299353Z // begin inline asm 2026-02-21T09:15:08.6299519Z @%p330 mbarrier.init.shared::cta.b64 [%r2281], 1; 2026-02-21T09:15:08.6299714Z // end inline asm 2026-02-21T09:15:08.6299845Z bar.sync 0, 256; 2026-02-21T09:15:08.6299992Z add.s32 %r2282, %r2270, 368760; 2026-02-21T09:15:08.6300153Z // begin inline asm 2026-02-21T09:15:08.6300328Z @%p330 mbarrier.init.shared::cta.b64 [%r2282], 1; 2026-02-21T09:15:08.6300529Z // end inline asm 2026-02-21T09:15:08.6300671Z add.s32 %r2283, %r2270, 368768; 2026-02-21T09:15:08.6300840Z // begin inline asm 2026-02-21T09:15:08.6301005Z @%p330 mbarrier.init.shared::cta.b64 [%r2283], 1; 2026-02-21T09:15:08.6301203Z // end inline asm 2026-02-21T09:15:08.6301339Z bar.sync 0, 256; 2026-02-21T09:15:08.6301488Z add.s32 %r2284, %r2270, 368776; 2026-02-21T09:15:08.6301694Z // begin inline asm 2026-02-21T09:15:08.6301872Z @%p330 mbarrier.init.shared::cta.b64 [%r2284], 1; 2026-02-21T09:15:08.6302076Z // end inline asm 2026-02-21T09:15:08.6302255Z add.s32 %r2285, %r2270, 368784; 2026-02-21T09:15:08.6302433Z // begin inline asm 2026-02-21T09:15:08.6302610Z @%p330 mbarrier.init.shared::cta.b64 [%r2285], 1; 2026-02-21T09:15:08.6302818Z // end inline asm 2026-02-21T09:15:08.6302963Z bar.sync 0, 256; 2026-02-21T09:15:08.6303118Z add.s32 %r2286, %r2270, 368792; 2026-02-21T09:15:08.6303287Z // begin inline asm 2026-02-21T09:15:08.6303471Z @%p330 mbarrier.init.shared::cta.b64 [%r2286], 1; 2026-02-21T09:15:08.6303669Z // end inline asm 2026-02-21T09:15:08.6303849Z add.s32 %r2287, %r2270, 368800; 2026-02-21T09:15:08.6304020Z // begin inline asm 2026-02-21T09:15:08.6304189Z @%p330 mbarrier.init.shared::cta.b64 [%r2287], 1; 2026-02-21T09:15:08.6304388Z // end inline asm 2026-02-21T09:15:08.6304527Z bar.sync 0, 256; 2026-02-21T09:15:08.6304675Z add.s32 %r2288, %r2270, 368808; 2026-02-21T09:15:08.6304840Z // begin inline asm 2026-02-21T09:15:08.6305016Z @%p330 mbarrier.init.shared::cta.b64 [%r2288], 1; 2026-02-21T09:15:08.6305210Z // end inline asm 2026-02-21T09:15:08.6305353Z $L__tmp2: 2026-02-21T09:15:08.6305696Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6306055Z add.s32 %r2289, %r2270, 368816; 2026-02-21T09:15:08.6306227Z // begin inline asm 2026-02-21T09:15:08.6306396Z @%p330 mbarrier.init.shared::cta.b64 [%r2289], 1; 2026-02-21T09:15:08.6306599Z // end inline asm 2026-02-21T09:15:08.6306771Z add.s32 %r2290, %r2270, 368832; 2026-02-21T09:15:08.6306947Z // begin inline asm 2026-02-21T09:15:08.6307118Z @%p330 mbarrier.init.shared::cta.b64 [%r2290], 1; 2026-02-21T09:15:08.6307323Z // end inline asm 2026-02-21T09:15:08.6307478Z add.s32 %r2291, %r2270, 368848; 2026-02-21T09:15:08.6307652Z // begin inline asm 2026-02-21T09:15:08.6307825Z @%p330 mbarrier.init.shared::cta.b64 [%r2291], 1; 2026-02-21T09:15:08.6308013Z // end inline asm 2026-02-21T09:15:08.6308161Z add.s32 %r2292, %r2270, 368864; 2026-02-21T09:15:08.6308327Z // begin inline asm 2026-02-21T09:15:08.6308501Z @%p330 mbarrier.init.shared::cta.b64 [%r2292], 1; 2026-02-21T09:15:08.6308693Z // end inline asm 2026-02-21T09:15:08.6308841Z add.s32 %r2293, %r2270, 368880; 2026-02-21T09:15:08.6309007Z // begin inline asm 2026-02-21T09:15:08.6309173Z @%p330 mbarrier.init.shared::cta.b64 [%r2293], 1; 2026-02-21T09:15:08.6309369Z // end inline asm 2026-02-21T09:15:08.6309509Z add.s32 %r2294, %r2270, 368896; 2026-02-21T09:15:08.6309679Z // begin inline asm 2026-02-21T09:15:08.6309845Z @%p330 mbarrier.init.shared::cta.b64 [%r2294], 1; 2026-02-21T09:15:08.6310041Z // end inline asm 2026-02-21T09:15:08.6310175Z $L__tmp3: 2026-02-21T09:15:08.6310431Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6310792Z st.shared.v4.b32 [global_smem+368640], {0, 0, 33686018, 33686018}; 2026-02-21T09:15:08.6311092Z st.shared.v4.b32 [global_smem+368656], {50529027, 50529027, 67372036, 67372036}; 2026-02-21T09:15:08.6311373Z st.shared.b32 [global_smem+368672], 100730117; 2026-02-21T09:15:08.6311621Z st.shared.b32 [global_smem], %r2367; 2026-02-21T09:15:08.6311816Z st.shared.b32 [global_smem+8], %r2296; 2026-02-21T09:15:08.6312003Z st.shared.b32 [global_smem+16], %r2297; 2026-02-21T09:15:08.6312199Z st.shared.b32 [global_smem+24], %r2298; 2026-02-21T09:15:08.6312384Z st.shared.b32 [global_smem+32], %r2367; 2026-02-21T09:15:08.6312565Z barrier.sync 1; 2026-02-21T09:15:08.6312717Z barrier.sync 1; 2026-02-21T09:15:08.6312862Z setp.gt.u32 %p325, %r186, 8191; 2026-02-21T09:15:08.6313031Z mov.b32 %r2392, 0; 2026-02-21T09:15:08.6313173Z mov.b32 %r2393, %r2392; 2026-02-21T09:15:08.6313334Z @%p325 bra $L__BB0_35; 2026-02-21T09:15:08.6313508Z // %bb.33: // %.lr.ph1000 2026-02-21T09:15:08.6313818Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6314109Z shl.b32 %r2300, %r1, 7; 2026-02-21T09:15:08.6314323Z and.b32 %r2301, %r2300, 16256; 2026-02-21T09:15:08.6314490Z shr.u32 %r2302, %r1, 1; 2026-02-21T09:15:08.6314641Z and.b32 %r2303, %r2302, 64; 2026-02-21T09:15:08.6314807Z add.s32 %r2305, %r2270, %r2301; 2026-02-21T09:15:08.6314966Z add.s32 %r2306, %r2305, %r2303; 2026-02-21T09:15:08.6315133Z add.s32 %r187, %r2306, 294912; 2026-02-21T09:15:08.6315409Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6315750Z add.s32 %r2388, %r186, -2368; 2026-02-21T09:15:08.6315909Z mov.b32 %r2389, 0; 2026-02-21T09:15:08.6316055Z mov.b32 %r2393, %r2389; 2026-02-21T09:15:08.6316208Z mov.b32 %r2392, %r2389; 2026-02-21T09:15:08.6316400Z $L__BB0_34: // =>This Inner Loop Header: Depth=1 2026-02-21T09:15:08.6316619Z $L__tmp4: 2026-02-21T09:15:08.6316907Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6317248Z shl.b32 %r2330, %r2392, 5; 2026-02-21T09:15:08.6317403Z add.s32 %r2331, %r2330, %r2367; 2026-02-21T09:15:08.6317592Z $L__tmp5: 2026-02-21T09:15:08.6317831Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6318130Z shl.b32 %r2332, %r2392, 3; 2026-02-21T09:15:08.6318301Z add.s32 %r2334, %r2270, %r2332; 2026-02-21T09:15:08.6318462Z add.s32 %r2329, %r2334, 368704; 2026-02-21T09:15:08.6318668Z add.s32 %r2307, %r2334, 368688; 2026-02-21T09:15:08.6318827Z $L__tmp6: 2026-02-21T09:15:08.6319118Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6319444Z bar.sync 0, 256; 2026-02-21T09:15:08.6319602Z // begin inline asm 2026-02-21T09:15:08.6319748Z 2026-02-21T09:15:08.6319863Z { 2026-02-21T09:15:08.6319995Z .reg .pred complete; 2026-02-21T09:15:08.6320140Z waitLoop: 2026-02-21T09:15:08.6320337Z mbarrier.try_wait.parity.shared.b64 complete, [%r2307], %r2393; 2026-02-21T09:15:08.6320579Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6320741Z } 2026-02-21T09:15:08.6320809Z 2026-02-21T09:15:08.6320865Z // end inline asm 2026-02-21T09:15:08.6321014Z xor.b32 %r2389, %r2389, 1; 2026-02-21T09:15:08.6321164Z bar.sync 0, 256; 2026-02-21T09:15:08.6321307Z // begin inline asm 2026-02-21T09:15:08.6321447Z 2026-02-21T09:15:08.6321618Z { 2026-02-21T09:15:08.6321752Z .reg .pred complete; 2026-02-21T09:15:08.6321898Z waitLoop: 2026-02-21T09:15:08.6322091Z mbarrier.try_wait.parity.shared.b64 complete, [%r2289], %r2389; 2026-02-21T09:15:08.6322329Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6322490Z } 2026-02-21T09:15:08.6322554Z 2026-02-21T09:15:08.6322610Z // end inline asm 2026-02-21T09:15:08.6322771Z shfl.sync.idx.b32 %r2335, %r2, 0, 31, -1; 2026-02-21T09:15:08.6322957Z shl.b32 %r2336, %r2335, 21; 2026-02-21T09:15:08.6323129Z and.b32 %r2337, %r2336, 6291456; 2026-02-21T09:15:08.6323306Z add.s32 %r2338, %r2331, %r2337; 2026-02-21T09:15:08.6323471Z shl.b32 %r2339, %r2335, 2; 2026-02-21T09:15:08.6323634Z and.b32 %r2340, %r2339, 16; 2026-02-21T09:15:08.6323792Z add.s32 %r2327, %r2338, %r2340; 2026-02-21T09:15:08.6323957Z // begin inline asm 2026-02-21T09:15:08.6324345Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r2311, %r2312, %r2313, %r2314, %r2315, %r2316, %r2317, %r2318, %r2319, %r2320, %r2321, %r2322, %r2323, %r2324, %r2325, %r2326}, [%r2327 + 0]; 2026-02-21T09:15:08.6324762Z // end inline asm 2026-02-21T09:15:08.6324897Z // begin inline asm 2026-02-21T09:15:08.6325060Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:15:08.6325229Z // end inline asm 2026-02-21T09:15:08.6325406Z st.shared.v4.b32 [%r187], {%r2311, %r2312, %r2313, %r2314}; 2026-02-21T09:15:08.6325667Z st.shared.v4.b32 [%r187+16], {%r2315, %r2316, %r2317, %r2318}; 2026-02-21T09:15:08.6325917Z st.shared.v4.b32 [%r187+32], {%r2319, %r2320, %r2321, %r2322}; 2026-02-21T09:15:08.6326167Z st.shared.v4.b32 [%r187+48], {%r2323, %r2324, %r2325, %r2326}; 2026-02-21T09:15:08.6326409Z bar.sync 0, 256; 2026-02-21T09:15:08.6326553Z // begin inline asm 2026-02-21T09:15:08.6326737Z @%p330 mbarrier.arrive.shared::cta.b64 _, [%r2290]; 2026-02-21T09:15:08.6326933Z // end inline asm 2026-02-21T09:15:08.6327075Z bar.sync 0, 256; 2026-02-21T09:15:08.6327209Z // begin inline asm 2026-02-21T09:15:08.6327388Z @%p330 mbarrier.arrive.shared::cta.b64 _, [%r2329]; 2026-02-21T09:15:08.6327582Z // end inline asm 2026-02-21T09:15:08.6327728Z add.s32 %r2341, %r2392, 1; 2026-02-21T09:15:08.6327927Z setp.eq.b32 %p328, %r2341, 2; 2026-02-21T09:15:08.6328111Z selp.b32 %r2392, 0, %r2341, %p328; 2026-02-21T09:15:08.6328287Z selp.b32 %r2342, 1, 0, %p328; 2026-02-21T09:15:08.6328459Z xor.b32 %r2393, %r2393, %r2342; 2026-02-21T09:15:08.6328622Z $L__tmp7: 2026-02-21T09:15:08.6328865Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6329164Z add.s32 %r2388, %r2388, 2368; 2026-02-21T09:15:08.6329327Z setp.lt.u32 %p329, %r2388, 5824; 2026-02-21T09:15:08.6329504Z @%p329 bra $L__BB0_34; 2026-02-21T09:15:08.6329731Z $L__BB0_35: // %._crit_edge1001 2026-02-21T09:15:08.6329937Z barrier.sync 1; 2026-02-21T09:15:08.6330069Z $L__tmp8: 2026-02-21T09:15:08.6330359Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6330721Z bar.sync 0, 256; 2026-02-21T09:15:08.6330864Z // begin inline asm 2026-02-21T09:15:08.6331039Z @%p330 mbarrier.inval.shared::cta.b64 [%r2289]; 2026-02-21T09:15:08.6331230Z // end inline asm 2026-02-21T09:15:08.6331371Z // begin inline asm 2026-02-21T09:15:08.6331567Z @%p330 mbarrier.inval.shared::cta.b64 [%r2290]; 2026-02-21T09:15:08.6331764Z // end inline asm 2026-02-21T09:15:08.6331899Z // begin inline asm 2026-02-21T09:15:08.6332067Z @%p330 mbarrier.inval.shared::cta.b64 [%r2291]; 2026-02-21T09:15:08.6332258Z // end inline asm 2026-02-21T09:15:08.6332392Z // begin inline asm 2026-02-21T09:15:08.6332560Z @%p330 mbarrier.inval.shared::cta.b64 [%r2292]; 2026-02-21T09:15:08.6332744Z // end inline asm 2026-02-21T09:15:08.6332882Z // begin inline asm 2026-02-21T09:15:08.6333041Z @%p330 mbarrier.inval.shared::cta.b64 [%r2293]; 2026-02-21T09:15:08.6333230Z // end inline asm 2026-02-21T09:15:08.6333363Z // begin inline asm 2026-02-21T09:15:08.6333530Z @%p330 mbarrier.inval.shared::cta.b64 [%r2294]; 2026-02-21T09:15:08.6333723Z // end inline asm 2026-02-21T09:15:08.6333852Z $L__tmp9: 2026-02-21T09:15:08.6334099Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6334386Z // begin inline asm 2026-02-21T09:15:08.6334552Z @%p330 mbarrier.inval.shared::cta.b64 [%r2287]; 2026-02-21T09:15:08.6334733Z // end inline asm 2026-02-21T09:15:08.6334872Z bar.sync 0, 256; 2026-02-21T09:15:08.6335002Z // begin inline asm 2026-02-21T09:15:08.6335166Z @%p330 mbarrier.inval.shared::cta.b64 [%r2288]; 2026-02-21T09:15:08.6335354Z // end inline asm 2026-02-21T09:15:08.6335483Z // begin inline asm 2026-02-21T09:15:08.6335649Z @%p330 mbarrier.inval.shared::cta.b64 [%r2285]; 2026-02-21T09:15:08.6335830Z // end inline asm 2026-02-21T09:15:08.6335968Z bar.sync 0, 256; 2026-02-21T09:15:08.6336100Z // begin inline asm 2026-02-21T09:15:08.6336265Z @%p330 mbarrier.inval.shared::cta.b64 [%r2286]; 2026-02-21T09:15:08.6336444Z // end inline asm 2026-02-21T09:15:08.6336584Z // begin inline asm 2026-02-21T09:15:08.6336745Z @%p330 mbarrier.inval.shared::cta.b64 [%r2283]; 2026-02-21T09:15:08.6336933Z // end inline asm 2026-02-21T09:15:08.6337072Z bar.sync 0, 256; 2026-02-21T09:15:08.6337205Z // begin inline asm 2026-02-21T09:15:08.6337375Z @%p330 mbarrier.inval.shared::cta.b64 [%r2284]; 2026-02-21T09:15:08.6337560Z // end inline asm 2026-02-21T09:15:08.6337700Z // begin inline asm 2026-02-21T09:15:08.6337857Z @%p330 mbarrier.inval.shared::cta.b64 [%r2281]; 2026-02-21T09:15:08.6338044Z // end inline asm 2026-02-21T09:15:08.6338203Z bar.sync 0, 256; 2026-02-21T09:15:08.6338339Z // begin inline asm 2026-02-21T09:15:08.6338505Z @%p330 mbarrier.inval.shared::cta.b64 [%r2282]; 2026-02-21T09:15:08.6338686Z // end inline asm 2026-02-21T09:15:08.6338824Z // begin inline asm 2026-02-21T09:15:08.6338983Z @%p330 mbarrier.inval.shared::cta.b64 [%r2279]; 2026-02-21T09:15:08.6339171Z // end inline asm 2026-02-21T09:15:08.6339300Z bar.sync 0, 256; 2026-02-21T09:15:08.6339437Z // begin inline asm 2026-02-21T09:15:08.6339629Z @%p330 mbarrier.inval.shared::cta.b64 [%r2280]; 2026-02-21T09:15:08.6339820Z // end inline asm 2026-02-21T09:15:08.6339950Z // begin inline asm 2026-02-21T09:15:08.6340117Z @%p330 mbarrier.inval.shared::cta.b64 [%r2277]; 2026-02-21T09:15:08.6340307Z // end inline asm 2026-02-21T09:15:08.6340438Z bar.sync 0, 256; 2026-02-21T09:15:08.6340576Z // begin inline asm 2026-02-21T09:15:08.6340734Z @%p330 mbarrier.inval.shared::cta.b64 [%r2278]; 2026-02-21T09:15:08.6340925Z // end inline asm 2026-02-21T09:15:08.6341064Z shl.b32 %r2369, %r2392, 3; 2026-02-21T09:15:08.6341228Z add.s32 %r2361, %r2273, %r2369; 2026-02-21T09:15:08.6341418Z $L__tmp10: 2026-02-21T09:15:08.6341759Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6342093Z // begin inline asm 2026-02-21T09:15:08.6342226Z 2026-02-21T09:15:08.6342346Z { 2026-02-21T09:15:08.6342495Z .reg .pred complete; 2026-02-21T09:15:08.6342646Z waitLoop: 2026-02-21T09:15:08.6342831Z mbarrier.try_wait.parity.shared.b64 complete, [%r2361], %r2393; 2026-02-21T09:15:08.6343060Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6343210Z } 2026-02-21T09:15:08.6343280Z 2026-02-21T09:15:08.6343335Z // end inline asm 2026-02-21T09:15:08.6343462Z $L__tmp11: 2026-02-21T09:15:08.6343707Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6344001Z bar.sync 0, 256; 2026-02-21T09:15:08.6344139Z // begin inline asm 2026-02-21T09:15:08.6344313Z @%p330 mbarrier.inval.shared::cta.b64 [%r2273]; 2026-02-21T09:15:08.6344506Z // end inline asm 2026-02-21T09:15:08.6344647Z bar.sync 0, 256; 2026-02-21T09:15:08.6344784Z // begin inline asm 2026-02-21T09:15:08.6344959Z @%p330 mbarrier.inval.shared::cta.b64 [%r2274]; 2026-02-21T09:15:08.6345148Z // end inline asm 2026-02-21T09:15:08.6345294Z // begin inline asm 2026-02-21T09:15:08.6345460Z @%p330 mbarrier.inval.shared::cta.b64 [%r2271]; 2026-02-21T09:15:08.6345647Z // end inline asm 2026-02-21T09:15:08.6345785Z bar.sync 0, 256; 2026-02-21T09:15:08.6345916Z // begin inline asm 2026-02-21T09:15:08.6346084Z @%p330 mbarrier.inval.shared::cta.b64 [%r2272]; 2026-02-21T09:15:08.6346266Z // end inline asm 2026-02-21T09:15:08.6346519Z .loc 1 19 4 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:4 2026-02-21T09:15:08.6346810Z bar.sync 0, 256; 2026-02-21T09:15:08.6346945Z // begin inline asm 2026-02-21T09:15:08.6347155Z @%p299 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r2367, 512; 2026-02-21T09:15:08.6347382Z // end inline asm 2026-02-21T09:15:08.6347610Z st.shared.v4.b32 [global_smem+368640], {117901063, 117901063, 117901063, 117901063}; 2026-02-21T09:15:08.6347948Z st.shared.v4.b32 [global_smem+368656], {117901063, 117901063, 117901063, 117901063}; 2026-02-21T09:15:08.6348231Z st.shared.b32 [global_smem+368672], 117901063; 2026-02-21T09:15:08.6348424Z barrier.sync 1; 2026-02-21T09:15:08.6348589Z $L__BB0_36: // %common.ret 2026-02-21T09:15:08.6348897Z .loc 1 0 0 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0 2026-02-21T09:15:08.6349169Z ret; 2026-02-21T09:15:08.6349339Z $L__BB0_1: // %.preheader.preheader 2026-02-21T09:15:08.6349597Z ld.param.b64 %rd35, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:15:08.6349865Z ld.param.b64 %rd34, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:15:08.6350140Z ld.param.b64 %rd33, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:15:08.6350365Z mov.b32 %r200, global_smem; 2026-02-21T09:15:08.6350542Z add.s32 %r201, %r200, %r3; 2026-02-21T09:15:08.6350705Z mov.u32 %r9, %ctaid.x; 2026-02-21T09:15:08.6350869Z add.s32 %r590, %r1, -256; 2026-02-21T09:15:08.6351032Z shr.u32 %r10, %r590, 5; 2026-02-21T09:15:08.6351197Z setp.gt.u32 %p23, %r9, 8191; 2026-02-21T09:15:08.6351370Z setp.lt.u32 %p24, %r9, 8192; 2026-02-21T09:15:08.6351586Z cvt.u16.u32 %rs4, %r9; 2026-02-21T09:15:08.6351748Z add.s16 %rs1, %rs4, 2368; 2026-02-21T09:15:08.6351917Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6352098Z $L__BB0_29: // %._crit_edge 2026-02-21T09:15:08.6352329Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6352657Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6352937Z barrier.sync 1; 2026-02-21T09:15:08.6353107Z $L__BB0_2: // %.preheader 2026-02-21T09:15:08.6353348Z // =>This Loop Header: Depth=1 2026-02-21T09:15:08.6353573Z // Child Loop BB0_28 Depth 2 2026-02-21T09:15:08.6353794Z // Child Loop BB0_24 Depth 2 2026-02-21T09:15:08.6354037Z // Child Loop BB0_20 Depth 2 2026-02-21T09:15:08.6354259Z // Child Loop BB0_7 Depth 2 2026-02-21T09:15:08.6354555Z .loc 1 14 0 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:14 2026-02-21T09:15:08.6354835Z barrier.sync 1; 2026-02-21T09:15:08.6354981Z ld.shared.b8 %r199, [%r201+368632]; 2026-02-21T09:15:08.6355157Z setp.gt.u32 %p2, %r199, 7; 2026-02-21T09:15:08.6355312Z @%p2 bra $L__BB0_4; 2026-02-21T09:15:08.6355477Z // %bb.3: // %.preheader 2026-02-21T09:15:08.6355694Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6355893Z $L_brx_0: .branchtargets 2026-02-21T09:15:08.6356041Z $L__BB0_5, 2026-02-21T09:15:08.6356164Z $L__BB0_17, 2026-02-21T09:15:08.6356291Z $L__BB0_18, 2026-02-21T09:15:08.6356410Z $L__BB0_22, 2026-02-21T09:15:08.6356548Z $L__BB0_26, 2026-02-21T09:15:08.6356668Z $L__BB0_30, 2026-02-21T09:15:08.6356794Z $L__BB0_31, 2026-02-21T09:15:08.6356914Z $L__BB0_36; 2026-02-21T09:15:08.6357053Z brx.idx %r199, $L_brx_0; 2026-02-21T09:15:08.6357246Z $L__BB0_5: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6357582Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6357888Z ld.shared.b32 %r4, [global_smem]; 2026-02-21T09:15:08.6358073Z ld.shared.b32 %r5, [global_smem+8]; 2026-02-21T09:15:08.6358265Z ld.shared.b32 %r6, [global_smem+16]; 2026-02-21T09:15:08.6358445Z ld.shared.b32 %r7, [global_smem+24]; 2026-02-21T09:15:08.6358629Z ld.shared.b32 %r8, [global_smem+32]; 2026-02-21T09:15:08.6358796Z barrier.sync 1; 2026-02-21T09:15:08.6359054Z .loc 1 31 45 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:31:45 2026-02-21T09:15:08.6359350Z bfe.u32 %r12, %r1, 5, 3; 2026-02-21T09:15:08.6359502Z or.b32 %r13, %r12, 8; 2026-02-21T09:15:08.6359658Z or.b32 %r14, %r12, 16; 2026-02-21T09:15:08.6359807Z or.b32 %r15, %r12, 24; 2026-02-21T09:15:08.6359963Z or.b32 %r16, %r12, 32; 2026-02-21T09:15:08.6360107Z or.b32 %r17, %r12, 40; 2026-02-21T09:15:08.6360256Z or.b32 %r18, %r12, 48; 2026-02-21T09:15:08.6360397Z or.b32 %r19, %r12, 56; 2026-02-21T09:15:08.6360545Z or.b32 %r20, %r12, 64; 2026-02-21T09:15:08.6360686Z or.b32 %r21, %r12, 72; 2026-02-21T09:15:08.6360834Z or.b32 %r22, %r12, 80; 2026-02-21T09:15:08.6360979Z or.b32 %r23, %r12, 88; 2026-02-21T09:15:08.6361120Z or.b32 %r24, %r12, 96; 2026-02-21T09:15:08.6361271Z or.b32 %r25, %r12, 104; 2026-02-21T09:15:08.6361452Z or.b32 %r26, %r12, 112; 2026-02-21T09:15:08.6361651Z or.b32 %r27, %r12, 120; 2026-02-21T09:15:08.6361799Z shr.u32 %r28, %r1, 1; 2026-02-21T09:15:08.6362063Z .loc 1 33 45 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:33:45 2026-02-21T09:15:08.6362348Z shl.b32 %r29, %r1, 4; 2026-02-21T09:15:08.6362500Z and.b32 %r30, %r29, 16; 2026-02-21T09:15:08.6362766Z .loc 1 48 65 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:65 2026-02-21T09:15:08.6363078Z and.b32 %r31, %r1, 31; 2026-02-21T09:15:08.6363236Z shl.b32 %r32, %r31, 3; 2026-02-21T09:15:08.6363493Z .loc 1 54 55 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:55 2026-02-21T09:15:08.6363793Z shl.b32 %r591, %r28, 13; 2026-02-21T09:15:08.6363954Z and.b32 %r33, %r591, 1040384; 2026-02-21T09:15:08.6364128Z or.b32 %r35, %r591, 3145728; 2026-02-21T09:15:08.6364396Z .loc 1 26 33 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:26:33 2026-02-21T09:15:08.6364693Z shr.u32 %r592, %r9, 8; 2026-02-21T09:15:08.6364886Z and.b32 %r593, %r592, 8388592; 2026-02-21T09:15:08.6365157Z .loc 1 27 39 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:27:39 2026-02-21T09:15:08.6365455Z sub.s32 %r594, 32, %r593; 2026-02-21T09:15:08.6365740Z .loc 1 27 52 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:27:52 2026-02-21T09:15:08.6366039Z min.s32 %r595, %r594, 16; 2026-02-21T09:15:08.6366308Z .loc 1 28 45 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:45 2026-02-21T09:15:08.6366608Z and.b32 %r596, %r9, 4095; 2026-02-21T09:15:08.6366872Z .loc 1 29 51 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:29:51 2026-02-21T09:15:08.6367159Z div.s32 %r597, %r596, %r595; 2026-02-21T09:15:08.6367428Z .loc 1 28 64 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:64 2026-02-21T09:15:08.6367714Z mul.lo.s32 %r598, %r597, %r595; 2026-02-21T09:15:08.6367888Z sub.s32 %r599, %r596, %r598; 2026-02-21T09:15:08.6368150Z .loc 1 28 30 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:30 2026-02-21T09:15:08.6368440Z add.s32 %r600, %r599, %r593; 2026-02-21T09:15:08.6368705Z .loc 1 30 27 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:30:27 2026-02-21T09:15:08.6368987Z shl.b32 %r601, %r600, 7; 2026-02-21T09:15:08.6369251Z .loc 1 31 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:31:32 2026-02-21T09:15:08.6369533Z or.b32 %r602, %r601, %r12; 2026-02-21T09:15:08.6369695Z or.b32 %r603, %r601, %r13; 2026-02-21T09:15:08.6369846Z or.b32 %r604, %r601, %r14; 2026-02-21T09:15:08.6370001Z or.b32 %r605, %r601, %r15; 2026-02-21T09:15:08.6370156Z or.b32 %r606, %r601, %r16; 2026-02-21T09:15:08.6370303Z or.b32 %r607, %r601, %r17; 2026-02-21T09:15:08.6370460Z or.b32 %r608, %r601, %r18; 2026-02-21T09:15:08.6370607Z or.b32 %r609, %r601, %r19; 2026-02-21T09:15:08.6370764Z or.b32 %r610, %r601, %r20; 2026-02-21T09:15:08.6370912Z or.b32 %r611, %r601, %r21; 2026-02-21T09:15:08.6371068Z or.b32 %r612, %r601, %r22; 2026-02-21T09:15:08.6371215Z or.b32 %r613, %r601, %r23; 2026-02-21T09:15:08.6371370Z or.b32 %r614, %r601, %r24; 2026-02-21T09:15:08.6371518Z or.b32 %r615, %r601, %r25; 2026-02-21T09:15:08.6371736Z or.b32 %r616, %r601, %r26; 2026-02-21T09:15:08.6371889Z or.b32 %r617, %r601, %r27; 2026-02-21T09:15:08.6372148Z .loc 1 32 27 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:32:27 2026-02-21T09:15:08.6372440Z shl.b32 %r618, %r597, 5; 2026-02-21T09:15:08.6372707Z .loc 1 33 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:33:32 2026-02-21T09:15:08.6372999Z or.b32 %r619, %r618, %r30; 2026-02-21T09:15:08.6373258Z .loc 1 48 53 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:53 2026-02-21T09:15:08.6373578Z shl.b32 %r620, %r602, 10; 2026-02-21T09:15:08.6373739Z shl.b32 %r621, %r603, 10; 2026-02-21T09:15:08.6373889Z shl.b32 %r622, %r604, 10; 2026-02-21T09:15:08.6374045Z shl.b32 %r623, %r605, 10; 2026-02-21T09:15:08.6374191Z shl.b32 %r624, %r606, 10; 2026-02-21T09:15:08.6374344Z shl.b32 %r625, %r607, 10; 2026-02-21T09:15:08.6374489Z shl.b32 %r626, %r608, 10; 2026-02-21T09:15:08.6374699Z shl.b32 %r627, %r609, 10; 2026-02-21T09:15:08.6374845Z shl.b32 %r628, %r610, 10; 2026-02-21T09:15:08.6374997Z shl.b32 %r629, %r611, 10; 2026-02-21T09:15:08.6375144Z shl.b32 %r630, %r612, 10; 2026-02-21T09:15:08.6375299Z shl.b32 %r631, %r613, 10; 2026-02-21T09:15:08.6375452Z shl.b32 %r632, %r614, 10; 2026-02-21T09:15:08.6375599Z shl.b32 %r633, %r615, 10; 2026-02-21T09:15:08.6375752Z shl.b32 %r634, %r616, 10; 2026-02-21T09:15:08.6375898Z shl.b32 %r635, %r617, 10; 2026-02-21T09:15:08.6376167Z .loc 1 48 60 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:60 2026-02-21T09:15:08.6376484Z or.b32 %r636, %r620, %r32; 2026-02-21T09:15:08.6376650Z or.b32 %r637, %r621, %r32; 2026-02-21T09:15:08.6376803Z or.b32 %r638, %r622, %r32; 2026-02-21T09:15:08.6376960Z or.b32 %r639, %r623, %r32; 2026-02-21T09:15:08.6377115Z or.b32 %r640, %r624, %r32; 2026-02-21T09:15:08.6377260Z or.b32 %r641, %r625, %r32; 2026-02-21T09:15:08.6377439Z or.b32 %r642, %r626, %r32; 2026-02-21T09:15:08.6377591Z or.b32 %r643, %r627, %r32; 2026-02-21T09:15:08.6377744Z or.b32 %r644, %r628, %r32; 2026-02-21T09:15:08.6377893Z or.b32 %r645, %r629, %r32; 2026-02-21T09:15:08.6378047Z or.b32 %r646, %r630, %r32; 2026-02-21T09:15:08.6378195Z or.b32 %r647, %r631, %r32; 2026-02-21T09:15:08.6378351Z or.b32 %r648, %r632, %r32; 2026-02-21T09:15:08.6378499Z or.b32 %r649, %r633, %r32; 2026-02-21T09:15:08.6378653Z or.b32 %r650, %r634, %r32; 2026-02-21T09:15:08.6378810Z or.b32 %r651, %r635, %r32; 2026-02-21T09:15:08.6379068Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6379366Z mad.wide.s32 %rd70, %r636, 2, %rd33; 2026-02-21T09:15:08.6379548Z mad.wide.s32 %rd71, %r637, 2, %rd33; 2026-02-21T09:15:08.6379728Z mad.wide.s32 %rd72, %r638, 2, %rd33; 2026-02-21T09:15:08.6379899Z mad.wide.s32 %rd73, %r639, 2, %rd33; 2026-02-21T09:15:08.6380074Z mad.wide.s32 %rd74, %r640, 2, %rd33; 2026-02-21T09:15:08.6380248Z mad.wide.s32 %rd75, %r641, 2, %rd33; 2026-02-21T09:15:08.6380425Z mad.wide.s32 %rd76, %r642, 2, %rd33; 2026-02-21T09:15:08.6380600Z mad.wide.s32 %rd77, %r643, 2, %rd33; 2026-02-21T09:15:08.6380764Z mad.wide.s32 %rd78, %r644, 2, %rd33; 2026-02-21T09:15:08.6380937Z mad.wide.s32 %rd79, %r645, 2, %rd33; 2026-02-21T09:15:08.6381105Z mad.wide.s32 %rd80, %r646, 2, %rd33; 2026-02-21T09:15:08.6381277Z mad.wide.s32 %rd81, %r647, 2, %rd33; 2026-02-21T09:15:08.6381442Z mad.wide.s32 %rd82, %r648, 2, %rd33; 2026-02-21T09:15:08.6381655Z mad.wide.s32 %rd83, %r649, 2, %rd33; 2026-02-21T09:15:08.6381823Z mad.wide.s32 %rd84, %r650, 2, %rd33; 2026-02-21T09:15:08.6381997Z mad.wide.s32 %rd85, %r651, 2, %rd33; 2026-02-21T09:15:08.6382277Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6382558Z and.b32 %r652, %r29, 4080; 2026-02-21T09:15:08.6382723Z add.s32 %r2075, %r200, %r652; 2026-02-21T09:15:08.6382887Z selp.b32 %r454, 16, 0, %p24; 2026-02-21T09:15:08.6383055Z // begin inline asm 2026-02-21T09:15:08.6383260Z cp.async.cg.shared.global [ %r2075 + 0 ], [ %rd70 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6383497Z // end inline asm 2026-02-21T09:15:08.6383641Z add.s32 %r2077, %r2075, 4096; 2026-02-21T09:15:08.6383806Z // begin inline asm 2026-02-21T09:15:08.6384013Z cp.async.cg.shared.global [ %r2077 + 0 ], [ %rd71 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6384234Z // end inline asm 2026-02-21T09:15:08.6384383Z add.s32 %r2079, %r2075, 8192; 2026-02-21T09:15:08.6384566Z // begin inline asm 2026-02-21T09:15:08.6384768Z cp.async.cg.shared.global [ %r2079 + 0 ], [ %rd72 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6384989Z // end inline asm 2026-02-21T09:15:08.6385138Z or.b32 %r653, %r29, 12288; 2026-02-21T09:15:08.6385294Z add.s32 %r2081, %r200, %r653; 2026-02-21T09:15:08.6385457Z // begin inline asm 2026-02-21T09:15:08.6385663Z cp.async.cg.shared.global [ %r2081 + 0 ], [ %rd73 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6385886Z // end inline asm 2026-02-21T09:15:08.6386068Z add.s32 %r2083, %r2075, 16384; 2026-02-21T09:15:08.6386229Z // begin inline asm 2026-02-21T09:15:08.6386439Z cp.async.cg.shared.global [ %r2083 + 0 ], [ %rd74 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6386666Z // end inline asm 2026-02-21T09:15:08.6386822Z add.s32 %r2085, %r2075, 20480; 2026-02-21T09:15:08.6386979Z // begin inline asm 2026-02-21T09:15:08.6387179Z cp.async.cg.shared.global [ %r2085 + 0 ], [ %rd75 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6387410Z // end inline asm 2026-02-21T09:15:08.6387556Z add.s32 %r2087, %r2075, 24576; 2026-02-21T09:15:08.6387729Z // begin inline asm 2026-02-21T09:15:08.6387957Z cp.async.cg.shared.global [ %r2087 + 0 ], [ %rd76 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6388194Z // end inline asm 2026-02-21T09:15:08.6388341Z or.b32 %r654, %r29, 28672; 2026-02-21T09:15:08.6388513Z add.s32 %r2089, %r200, %r654; 2026-02-21T09:15:08.6388678Z // begin inline asm 2026-02-21T09:15:08.6388920Z cp.async.cg.shared.global [ %r2089 + 0 ], [ %rd77 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6389152Z // end inline asm 2026-02-21T09:15:08.6389306Z add.s32 %r2091, %r2075, 32768; 2026-02-21T09:15:08.6389478Z // begin inline asm 2026-02-21T09:15:08.6389679Z cp.async.cg.shared.global [ %r2091 + 0 ], [ %rd78 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6389913Z // end inline asm 2026-02-21T09:15:08.6390057Z add.s32 %r2093, %r2075, 36864; 2026-02-21T09:15:08.6390226Z // begin inline asm 2026-02-21T09:15:08.6390427Z cp.async.cg.shared.global [ %r2093 + 0 ], [ %rd79 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6390663Z // end inline asm 2026-02-21T09:15:08.6390806Z add.s32 %r2095, %r2075, 40960; 2026-02-21T09:15:08.6390976Z // begin inline asm 2026-02-21T09:15:08.6391184Z cp.async.cg.shared.global [ %r2095 + 0 ], [ %rd80 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6391412Z // end inline asm 2026-02-21T09:15:08.6391590Z or.b32 %r655, %r29, 45056; 2026-02-21T09:15:08.6391752Z add.s32 %r2097, %r200, %r655; 2026-02-21T09:15:08.6391922Z // begin inline asm 2026-02-21T09:15:08.6392123Z cp.async.cg.shared.global [ %r2097 + 0 ], [ %rd81 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6392359Z // end inline asm 2026-02-21T09:15:08.6392501Z add.s32 %r2099, %r2075, 49152; 2026-02-21T09:15:08.6392673Z // begin inline asm 2026-02-21T09:15:08.6392875Z cp.async.cg.shared.global [ %r2099 + 0 ], [ %rd82 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6393112Z // end inline asm 2026-02-21T09:15:08.6393262Z add.s32 %r2101, %r2075, 53248; 2026-02-21T09:15:08.6393424Z // begin inline asm 2026-02-21T09:15:08.6393637Z cp.async.cg.shared.global [ %r2101 + 0 ], [ %rd83 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6393868Z // end inline asm 2026-02-21T09:15:08.6394019Z add.s32 %r2103, %r2075, 57344; 2026-02-21T09:15:08.6394182Z // begin inline asm 2026-02-21T09:15:08.6394391Z cp.async.cg.shared.global [ %r2103 + 0 ], [ %rd84 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6394618Z // end inline asm 2026-02-21T09:15:08.6394772Z or.b32 %r656, %r29, 61440; 2026-02-21T09:15:08.6394945Z add.s32 %r2105, %r200, %r656; 2026-02-21T09:15:08.6395115Z // begin inline asm 2026-02-21T09:15:08.6395325Z cp.async.cg.shared.global [ %r2105 + 0 ], [ %rd85 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6395559Z // end inline asm 2026-02-21T09:15:08.6395713Z cp.async.commit_group; 2026-02-21T09:15:08.6395988Z .loc 1 54 62 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:62 2026-02-21T09:15:08.6396291Z add.s32 %r657, %r619, %r33; 2026-02-21T09:15:08.6396558Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6396881Z cvt.s64.s32 %rd138, %r657; 2026-02-21T09:15:08.6397051Z add.s64 %rd86, %rd34, %rd138; 2026-02-21T09:15:08.6397319Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6397610Z add.s32 %r485, %r2075, 352256; 2026-02-21T09:15:08.6397767Z // begin inline asm 2026-02-21T09:15:08.6397973Z cp.async.cg.shared.global [ %r485 + 0 ], [ %rd86 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6398221Z // end inline asm 2026-02-21T09:15:08.6398370Z cp.async.commit_group; 2026-02-21T09:15:08.6398635Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6398921Z cvt.s64.s32 %rd139, %r620; 2026-02-21T09:15:08.6399086Z cvt.u64.u32 %rd140, %r32; 2026-02-21T09:15:08.6399245Z or.b64 %rd141, %rd139, %rd140; 2026-02-21T09:15:08.6399413Z shl.b64 %rd142, %rd141, 1; 2026-02-21T09:15:08.6399568Z add.s64 %rd143, %rd33, %rd142; 2026-02-21T09:15:08.6399735Z add.s64 %rd87, %rd143, 512; 2026-02-21T09:15:08.6399923Z cvt.s64.s32 %rd144, %r621; 2026-02-21T09:15:08.6400088Z or.b64 %rd145, %rd144, %rd140; 2026-02-21T09:15:08.6400252Z shl.b64 %rd146, %rd145, 1; 2026-02-21T09:15:08.6400406Z add.s64 %rd147, %rd33, %rd146; 2026-02-21T09:15:08.6400572Z add.s64 %rd88, %rd147, 512; 2026-02-21T09:15:08.6400726Z cvt.s64.s32 %rd148, %r622; 2026-02-21T09:15:08.6400917Z or.b64 %rd149, %rd148, %rd140; 2026-02-21T09:15:08.6401075Z shl.b64 %rd150, %rd149, 1; 2026-02-21T09:15:08.6401236Z add.s64 %rd151, %rd33, %rd150; 2026-02-21T09:15:08.6401392Z add.s64 %rd89, %rd151, 512; 2026-02-21T09:15:08.6401588Z cvt.s64.s32 %rd152, %r623; 2026-02-21T09:15:08.6401743Z or.b64 %rd153, %rd152, %rd140; 2026-02-21T09:15:08.6401905Z shl.b64 %rd154, %rd153, 1; 2026-02-21T09:15:08.6402065Z add.s64 %rd155, %rd33, %rd154; 2026-02-21T09:15:08.6402221Z add.s64 %rd90, %rd155, 512; 2026-02-21T09:15:08.6402383Z cvt.s64.s32 %rd156, %r624; 2026-02-21T09:15:08.6402534Z or.b64 %rd157, %rd156, %rd140; 2026-02-21T09:15:08.6402700Z shl.b64 %rd158, %rd157, 1; 2026-02-21T09:15:08.6402853Z add.s64 %rd159, %rd33, %rd158; 2026-02-21T09:15:08.6403020Z add.s64 %rd91, %rd159, 512; 2026-02-21T09:15:08.6403173Z cvt.s64.s32 %rd160, %r625; 2026-02-21T09:15:08.6403334Z or.b64 %rd161, %rd160, %rd140; 2026-02-21T09:15:08.6403488Z shl.b64 %rd162, %rd161, 1; 2026-02-21T09:15:08.6403650Z add.s64 %rd163, %rd33, %rd162; 2026-02-21T09:15:08.6403816Z add.s64 %rd92, %rd163, 512; 2026-02-21T09:15:08.6403972Z cvt.s64.s32 %rd164, %r626; 2026-02-21T09:15:08.6404137Z or.b64 %rd165, %rd164, %rd140; 2026-02-21T09:15:08.6404295Z shl.b64 %rd166, %rd165, 1; 2026-02-21T09:15:08.6404460Z add.s64 %rd167, %rd33, %rd166; 2026-02-21T09:15:08.6404619Z add.s64 %rd93, %rd167, 512; 2026-02-21T09:15:08.6404780Z cvt.s64.s32 %rd168, %r627; 2026-02-21T09:15:08.6404933Z or.b64 %rd169, %rd168, %rd140; 2026-02-21T09:15:08.6405096Z shl.b64 %rd170, %rd169, 1; 2026-02-21T09:15:08.6405248Z add.s64 %rd171, %rd33, %rd170; 2026-02-21T09:15:08.6405415Z add.s64 %rd94, %rd171, 512; 2026-02-21T09:15:08.6405574Z cvt.s64.s32 %rd172, %r628; 2026-02-21T09:15:08.6405726Z or.b64 %rd173, %rd172, %rd140; 2026-02-21T09:15:08.6405889Z shl.b64 %rd174, %rd173, 1; 2026-02-21T09:15:08.6406039Z add.s64 %rd175, %rd33, %rd174; 2026-02-21T09:15:08.6406205Z add.s64 %rd95, %rd175, 512; 2026-02-21T09:15:08.6406359Z cvt.s64.s32 %rd176, %r629; 2026-02-21T09:15:08.6406522Z or.b64 %rd177, %rd176, %rd140; 2026-02-21T09:15:08.6406678Z shl.b64 %rd178, %rd177, 1; 2026-02-21T09:15:08.6406838Z add.s64 %rd179, %rd33, %rd178; 2026-02-21T09:15:08.6407003Z add.s64 %rd96, %rd179, 512; 2026-02-21T09:15:08.6407156Z cvt.s64.s32 %rd180, %r630; 2026-02-21T09:15:08.6407318Z or.b64 %rd181, %rd180, %rd140; 2026-02-21T09:15:08.6407474Z shl.b64 %rd182, %rd181, 1; 2026-02-21T09:15:08.6407634Z add.s64 %rd183, %rd33, %rd182; 2026-02-21T09:15:08.6407828Z add.s64 %rd97, %rd183, 512; 2026-02-21T09:15:08.6407993Z cvt.s64.s32 %rd184, %r631; 2026-02-21T09:15:08.6408152Z or.b64 %rd185, %rd184, %rd140; 2026-02-21T09:15:08.6408321Z shl.b64 %rd186, %rd185, 1; 2026-02-21T09:15:08.6408478Z add.s64 %rd187, %rd33, %rd186; 2026-02-21T09:15:08.6408647Z add.s64 %rd98, %rd187, 512; 2026-02-21T09:15:08.6408814Z cvt.s64.s32 %rd188, %r632; 2026-02-21T09:15:08.6408969Z or.b64 %rd189, %rd188, %rd140; 2026-02-21T09:15:08.6409137Z shl.b64 %rd190, %rd189, 1; 2026-02-21T09:15:08.6409329Z add.s64 %rd191, %rd33, %rd190; 2026-02-21T09:15:08.6409498Z add.s64 %rd99, %rd191, 512; 2026-02-21T09:15:08.6409657Z cvt.s64.s32 %rd192, %r633; 2026-02-21T09:15:08.6409819Z or.b64 %rd193, %rd192, %rd140; 2026-02-21T09:15:08.6409979Z shl.b64 %rd194, %rd193, 1; 2026-02-21T09:15:08.6410142Z add.s64 %rd195, %rd33, %rd194; 2026-02-21T09:15:08.6410307Z add.s64 %rd100, %rd195, 512; 2026-02-21T09:15:08.6410477Z cvt.s64.s32 %rd196, %r634; 2026-02-21T09:15:08.6410640Z or.b64 %rd197, %rd196, %rd140; 2026-02-21T09:15:08.6410803Z shl.b64 %rd198, %rd197, 1; 2026-02-21T09:15:08.6410997Z add.s64 %rd199, %rd33, %rd198; 2026-02-21T09:15:08.6411159Z add.s64 %rd101, %rd199, 512; 2026-02-21T09:15:08.6411323Z cvt.s64.s32 %rd200, %r635; 2026-02-21T09:15:08.6411476Z or.b64 %rd201, %rd200, %rd140; 2026-02-21T09:15:08.6411664Z shl.b64 %rd202, %rd201, 1; 2026-02-21T09:15:08.6411818Z add.s64 %rd203, %rd33, %rd202; 2026-02-21T09:15:08.6412010Z add.s64 %rd102, %rd203, 512; 2026-02-21T09:15:08.6412288Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6412580Z add.s32 %r658, %r200, 65536; 2026-02-21T09:15:08.6412749Z add.s32 %r2109, %r658, %r652; 2026-02-21T09:15:08.6412914Z // begin inline asm 2026-02-21T09:15:08.6413132Z cp.async.cg.shared.global [ %r2109 + 0 ], [ %rd87 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6413360Z // end inline asm 2026-02-21T09:15:08.6413514Z add.s32 %r2111, %r2109, 4096; 2026-02-21T09:15:08.6413668Z // begin inline asm 2026-02-21T09:15:08.6413874Z cp.async.cg.shared.global [ %r2111 + 0 ], [ %rd88 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6414100Z // end inline asm 2026-02-21T09:15:08.6414235Z add.s32 %r2113, %r2109, 8192; 2026-02-21T09:15:08.6414395Z // begin inline asm 2026-02-21T09:15:08.6414587Z cp.async.cg.shared.global [ %r2113 + 0 ], [ %rd89 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6414815Z // end inline asm 2026-02-21T09:15:08.6414952Z add.s32 %r2115, %r658, %r653; 2026-02-21T09:15:08.6415111Z // begin inline asm 2026-02-21T09:15:08.6415306Z cp.async.cg.shared.global [ %r2115 + 0 ], [ %rd90 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6415535Z // end inline asm 2026-02-21T09:15:08.6415679Z add.s32 %r2117, %r2109, 16384; 2026-02-21T09:15:08.6415834Z // begin inline asm 2026-02-21T09:15:08.6416032Z cp.async.cg.shared.global [ %r2117 + 0 ], [ %rd91 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6416253Z // end inline asm 2026-02-21T09:15:08.6416397Z add.s32 %r2119, %r2109, 20480; 2026-02-21T09:15:08.6416553Z // begin inline asm 2026-02-21T09:15:08.6416751Z cp.async.cg.shared.global [ %r2119 + 0 ], [ %rd92 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6416967Z // end inline asm 2026-02-21T09:15:08.6417108Z add.s32 %r2121, %r2109, 24576; 2026-02-21T09:15:08.6417262Z // begin inline asm 2026-02-21T09:15:08.6417460Z cp.async.cg.shared.global [ %r2121 + 0 ], [ %rd93 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6417684Z // end inline asm 2026-02-21T09:15:08.6417820Z add.s32 %r2123, %r658, %r654; 2026-02-21T09:15:08.6417979Z // begin inline asm 2026-02-21T09:15:08.6418169Z cp.async.cg.shared.global [ %r2123 + 0 ], [ %rd94 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6418392Z // end inline asm 2026-02-21T09:15:08.6418526Z add.s32 %r2125, %r2109, 32768; 2026-02-21T09:15:08.6418686Z // begin inline asm 2026-02-21T09:15:08.6418874Z cp.async.cg.shared.global [ %r2125 + 0 ], [ %rd95 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6419096Z // end inline asm 2026-02-21T09:15:08.6419239Z add.s32 %r2127, %r2109, 36864; 2026-02-21T09:15:08.6419447Z // begin inline asm 2026-02-21T09:15:08.6419647Z cp.async.cg.shared.global [ %r2127 + 0 ], [ %rd96 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6419869Z // end inline asm 2026-02-21T09:15:08.6420013Z add.s32 %r2129, %r2109, 40960; 2026-02-21T09:15:08.6420168Z // begin inline asm 2026-02-21T09:15:08.6420368Z cp.async.cg.shared.global [ %r2129 + 0 ], [ %rd97 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6420589Z // end inline asm 2026-02-21T09:15:08.6420766Z add.s32 %r2131, %r658, %r655; 2026-02-21T09:15:08.6420927Z // begin inline asm 2026-02-21T09:15:08.6421121Z cp.async.cg.shared.global [ %r2131 + 0 ], [ %rd98 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6421349Z // end inline asm 2026-02-21T09:15:08.6421485Z add.s32 %r2133, %r2109, 49152; 2026-02-21T09:15:08.6421684Z // begin inline asm 2026-02-21T09:15:08.6421879Z cp.async.cg.shared.global [ %r2133 + 0 ], [ %rd99 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6422111Z // end inline asm 2026-02-21T09:15:08.6422250Z add.s32 %r2135, %r2109, 53248; 2026-02-21T09:15:08.6422418Z // begin inline asm 2026-02-21T09:15:08.6422642Z cp.async.cg.shared.global [ %r2135 + 0 ], [ %rd100 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6422870Z // end inline asm 2026-02-21T09:15:08.6423015Z add.s32 %r2137, %r2109, 57344; 2026-02-21T09:15:08.6423177Z // begin inline asm 2026-02-21T09:15:08.6423390Z cp.async.cg.shared.global [ %r2137 + 0 ], [ %rd101 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6423639Z // end inline asm 2026-02-21T09:15:08.6423786Z add.s32 %r2139, %r658, %r656; 2026-02-21T09:15:08.6423940Z // begin inline asm 2026-02-21T09:15:08.6424141Z cp.async.cg.shared.global [ %r2139 + 0 ], [ %rd102 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6424361Z // end inline asm 2026-02-21T09:15:08.6424511Z cp.async.commit_group; 2026-02-21T09:15:08.6424776Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6425071Z cvt.s64.s32 %rd204, %r619; 2026-02-21T09:15:08.6425238Z cvt.u64.u32 %rd205, %r33; 2026-02-21T09:15:08.6425397Z add.s64 %rd206, %rd204, %rd205; 2026-02-21T09:15:08.6425572Z add.s64 %rd207, %rd34, %rd206; 2026-02-21T09:15:08.6425737Z add.s64 %rd103, %rd207, 1048576; 2026-02-21T09:15:08.6426012Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6426299Z add.s32 %r519, %r2075, 356352; 2026-02-21T09:15:08.6426463Z // begin inline asm 2026-02-21T09:15:08.6426667Z cp.async.cg.shared.global [ %r519 + 0 ], [ %rd103 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6426884Z // end inline asm 2026-02-21T09:15:08.6427033Z cp.async.commit_group; 2026-02-21T09:15:08.6427287Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6427585Z add.s64 %rd104, %rd143, 1024; 2026-02-21T09:15:08.6427741Z add.s64 %rd105, %rd147, 1024; 2026-02-21T09:15:08.6427902Z add.s64 %rd106, %rd151, 1024; 2026-02-21T09:15:08.6428061Z add.s64 %rd107, %rd155, 1024; 2026-02-21T09:15:08.6428216Z add.s64 %rd108, %rd159, 1024; 2026-02-21T09:15:08.6428377Z add.s64 %rd109, %rd163, 1024; 2026-02-21T09:15:08.6428529Z add.s64 %rd110, %rd167, 1024; 2026-02-21T09:15:08.6428689Z add.s64 %rd111, %rd171, 1024; 2026-02-21T09:15:08.6428839Z add.s64 %rd112, %rd175, 1024; 2026-02-21T09:15:08.6428998Z add.s64 %rd113, %rd179, 1024; 2026-02-21T09:15:08.6429149Z add.s64 %rd114, %rd183, 1024; 2026-02-21T09:15:08.6429309Z add.s64 %rd115, %rd187, 1024; 2026-02-21T09:15:08.6429460Z add.s64 %rd116, %rd191, 1024; 2026-02-21T09:15:08.6429619Z add.s64 %rd117, %rd195, 1024; 2026-02-21T09:15:08.6429776Z add.s64 %rd118, %rd199, 1024; 2026-02-21T09:15:08.6429935Z add.s64 %rd119, %rd203, 1024; 2026-02-21T09:15:08.6430214Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6430513Z add.s32 %r659, %r200, 131072; 2026-02-21T09:15:08.6430681Z add.s32 %r2143, %r659, %r652; 2026-02-21T09:15:08.6430870Z // begin inline asm 2026-02-21T09:15:08.6431086Z cp.async.cg.shared.global [ %r2143 + 0 ], [ %rd104 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6431326Z // end inline asm 2026-02-21T09:15:08.6431482Z add.s32 %r2145, %r2143, 4096; 2026-02-21T09:15:08.6431686Z // begin inline asm 2026-02-21T09:15:08.6431895Z cp.async.cg.shared.global [ %r2145 + 0 ], [ %rd105 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6432137Z // end inline asm 2026-02-21T09:15:08.6432281Z add.s32 %r2147, %r2143, 8192; 2026-02-21T09:15:08.6432476Z // begin inline asm 2026-02-21T09:15:08.6432687Z cp.async.cg.shared.global [ %r2147 + 0 ], [ %rd106 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6432936Z // end inline asm 2026-02-21T09:15:08.6433080Z add.s32 %r2149, %r659, %r653; 2026-02-21T09:15:08.6433248Z // begin inline asm 2026-02-21T09:15:08.6433458Z cp.async.cg.shared.global [ %r2149 + 0 ], [ %rd107 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6433687Z // end inline asm 2026-02-21T09:15:08.6433838Z add.s32 %r2151, %r2143, 16384; 2026-02-21T09:15:08.6434004Z // begin inline asm 2026-02-21T09:15:08.6434241Z cp.async.cg.shared.global [ %r2151 + 0 ], [ %rd108 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6434479Z // end inline asm 2026-02-21T09:15:08.6434628Z add.s32 %r2153, %r2143, 20480; 2026-02-21T09:15:08.6434790Z // begin inline asm 2026-02-21T09:15:08.6435004Z cp.async.cg.shared.global [ %r2153 + 0 ], [ %rd109 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6435242Z // end inline asm 2026-02-21T09:15:08.6435415Z add.s32 %r2155, %r2143, 24576; 2026-02-21T09:15:08.6435589Z // begin inline asm 2026-02-21T09:15:08.6435794Z cp.async.cg.shared.global [ %r2155 + 0 ], [ %rd110 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6436034Z // end inline asm 2026-02-21T09:15:08.6436175Z add.s32 %r2157, %r659, %r654; 2026-02-21T09:15:08.6436343Z // begin inline asm 2026-02-21T09:15:08.6436547Z cp.async.cg.shared.global [ %r2157 + 0 ], [ %rd111 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6436784Z // end inline asm 2026-02-21T09:15:08.6436929Z add.s32 %r2159, %r2143, 32768; 2026-02-21T09:15:08.6437099Z // begin inline asm 2026-02-21T09:15:08.6437309Z cp.async.cg.shared.global [ %r2159 + 0 ], [ %rd112 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6437541Z // end inline asm 2026-02-21T09:15:08.6437700Z add.s32 %r2161, %r2143, 36864; 2026-02-21T09:15:08.6437853Z // begin inline asm 2026-02-21T09:15:08.6438054Z cp.async.cg.shared.global [ %r2161 + 0 ], [ %rd113 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6438274Z // end inline asm 2026-02-21T09:15:08.6438421Z add.s32 %r2163, %r2143, 40960; 2026-02-21T09:15:08.6438574Z // begin inline asm 2026-02-21T09:15:08.6438774Z cp.async.cg.shared.global [ %r2163 + 0 ], [ %rd114 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6439003Z // end inline asm 2026-02-21T09:15:08.6439140Z add.s32 %r2165, %r659, %r655; 2026-02-21T09:15:08.6439298Z // begin inline asm 2026-02-21T09:15:08.6439492Z cp.async.cg.shared.global [ %r2165 + 0 ], [ %rd115 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6439724Z // end inline asm 2026-02-21T09:15:08.6439862Z add.s32 %r2167, %r2143, 49152; 2026-02-21T09:15:08.6440027Z // begin inline asm 2026-02-21T09:15:08.6440223Z cp.async.cg.shared.global [ %r2167 + 0 ], [ %rd116 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6440456Z // end inline asm 2026-02-21T09:15:08.6440601Z add.s32 %r2169, %r2143, 53248; 2026-02-21T09:15:08.6440759Z // begin inline asm 2026-02-21T09:15:08.6440965Z cp.async.cg.shared.global [ %r2169 + 0 ], [ %rd117 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6441195Z // end inline asm 2026-02-21T09:15:08.6441343Z add.s32 %r2171, %r2143, 57344; 2026-02-21T09:15:08.6441498Z // begin inline asm 2026-02-21T09:15:08.6441727Z cp.async.cg.shared.global [ %r2171 + 0 ], [ %rd118 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6441948Z // end inline asm 2026-02-21T09:15:08.6442094Z add.s32 %r2173, %r659, %r656; 2026-02-21T09:15:08.6442249Z // begin inline asm 2026-02-21T09:15:08.6442451Z cp.async.cg.shared.global [ %r2173 + 0 ], [ %rd119 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6442678Z // end inline asm 2026-02-21T09:15:08.6442848Z cp.async.commit_group; 2026-02-21T09:15:08.6443120Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6443411Z add.s64 %rd120, %rd207, 2097152; 2026-02-21T09:15:08.6443688Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6443975Z add.s32 %r553, %r2075, 360448; 2026-02-21T09:15:08.6444139Z // begin inline asm 2026-02-21T09:15:08.6444371Z cp.async.cg.shared.global [ %r553 + 0 ], [ %rd120 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6444592Z // end inline asm 2026-02-21T09:15:08.6444740Z cp.async.commit_group; 2026-02-21T09:15:08.6444996Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6445293Z add.s64 %rd121, %rd143, 1536; 2026-02-21T09:15:08.6445454Z add.s64 %rd122, %rd147, 1536; 2026-02-21T09:15:08.6445618Z add.s64 %rd123, %rd151, 1536; 2026-02-21T09:15:08.6445773Z add.s64 %rd124, %rd155, 1536; 2026-02-21T09:15:08.6445934Z add.s64 %rd125, %rd159, 1536; 2026-02-21T09:15:08.6446120Z add.s64 %rd126, %rd163, 1536; 2026-02-21T09:15:08.6446277Z add.s64 %rd127, %rd167, 1536; 2026-02-21T09:15:08.6446438Z add.s64 %rd128, %rd171, 1536; 2026-02-21T09:15:08.6446592Z add.s64 %rd129, %rd175, 1536; 2026-02-21T09:15:08.6446753Z add.s64 %rd130, %rd179, 1536; 2026-02-21T09:15:08.6446906Z add.s64 %rd131, %rd183, 1536; 2026-02-21T09:15:08.6447092Z add.s64 %rd132, %rd187, 1536; 2026-02-21T09:15:08.6447250Z add.s64 %rd133, %rd191, 1536; 2026-02-21T09:15:08.6447412Z add.s64 %rd134, %rd195, 1536; 2026-02-21T09:15:08.6447566Z add.s64 %rd135, %rd199, 1536; 2026-02-21T09:15:08.6447726Z add.s64 %rd136, %rd203, 1536; 2026-02-21T09:15:08.6447998Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6448282Z add.s32 %r660, %r200, 196608; 2026-02-21T09:15:08.6448442Z add.s32 %r2177, %r660, %r652; 2026-02-21T09:15:08.6448596Z // begin inline asm 2026-02-21T09:15:08.6448804Z cp.async.cg.shared.global [ %r2177 + 0 ], [ %rd121 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6449030Z // end inline asm 2026-02-21T09:15:08.6449177Z add.s32 %r2179, %r2177, 4096; 2026-02-21T09:15:08.6449329Z // begin inline asm 2026-02-21T09:15:08.6449535Z cp.async.cg.shared.global [ %r2179 + 0 ], [ %rd122 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6449767Z // end inline asm 2026-02-21T09:15:08.6449905Z add.s32 %r2181, %r2177, 8192; 2026-02-21T09:15:08.6450063Z // begin inline asm 2026-02-21T09:15:08.6450264Z cp.async.cg.shared.global [ %r2181 + 0 ], [ %rd123 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6450502Z // end inline asm 2026-02-21T09:15:08.6450644Z add.s32 %r2183, %r660, %r653; 2026-02-21T09:15:08.6450810Z // begin inline asm 2026-02-21T09:15:08.6451005Z cp.async.cg.shared.global [ %r2183 + 0 ], [ %rd124 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6451242Z // end inline asm 2026-02-21T09:15:08.6451386Z add.s32 %r2185, %r2177, 16384; 2026-02-21T09:15:08.6451641Z // begin inline asm 2026-02-21T09:15:08.6451856Z cp.async.cg.shared.global [ %r2185 + 0 ], [ %rd125 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6452079Z // end inline asm 2026-02-21T09:15:08.6452231Z add.s32 %r2187, %r2177, 20480; 2026-02-21T09:15:08.6452388Z // begin inline asm 2026-02-21T09:15:08.6452592Z cp.async.cg.shared.global [ %r2187 + 0 ], [ %rd126 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6452817Z // end inline asm 2026-02-21T09:15:08.6452967Z add.s32 %r2189, %r2177, 24576; 2026-02-21T09:15:08.6453133Z // begin inline asm 2026-02-21T09:15:08.6453329Z cp.async.cg.shared.global [ %r2189 + 0 ], [ %rd127 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6453562Z // end inline asm 2026-02-21T09:15:08.6453703Z add.s32 %r2191, %r660, %r654; 2026-02-21T09:15:08.6453866Z // begin inline asm 2026-02-21T09:15:08.6454063Z cp.async.cg.shared.global [ %r2191 + 0 ], [ %rd128 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6454292Z // end inline asm 2026-02-21T09:15:08.6454462Z add.s32 %r2193, %r2177, 32768; 2026-02-21T09:15:08.6454629Z // begin inline asm 2026-02-21T09:15:08.6454837Z cp.async.cg.shared.global [ %r2193 + 0 ], [ %rd129 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6455061Z // end inline asm 2026-02-21T09:15:08.6455207Z add.s32 %r2195, %r2177, 36864; 2026-02-21T09:15:08.6455365Z // begin inline asm 2026-02-21T09:15:08.6455569Z cp.async.cg.shared.global [ %r2195 + 0 ], [ %rd130 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6455798Z // end inline asm 2026-02-21T09:15:08.6455971Z add.s32 %r2197, %r2177, 40960; 2026-02-21T09:15:08.6456123Z // begin inline asm 2026-02-21T09:15:08.6456323Z cp.async.cg.shared.global [ %r2197 + 0 ], [ %rd131 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6456545Z // end inline asm 2026-02-21T09:15:08.6456687Z add.s32 %r2199, %r660, %r655; 2026-02-21T09:15:08.6456847Z // begin inline asm 2026-02-21T09:15:08.6457041Z cp.async.cg.shared.global [ %r2199 + 0 ], [ %rd132 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6457271Z // end inline asm 2026-02-21T09:15:08.6457409Z add.s32 %r2201, %r2177, 49152; 2026-02-21T09:15:08.6457572Z // begin inline asm 2026-02-21T09:15:08.6457785Z cp.async.cg.shared.global [ %r2201 + 0 ], [ %rd133 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6458012Z // end inline asm 2026-02-21T09:15:08.6458149Z add.s32 %r2203, %r2177, 53248; 2026-02-21T09:15:08.6458308Z // begin inline asm 2026-02-21T09:15:08.6458507Z cp.async.cg.shared.global [ %r2203 + 0 ], [ %rd134 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6458770Z // end inline asm 2026-02-21T09:15:08.6458916Z add.s32 %r2205, %r2177, 57344; 2026-02-21T09:15:08.6459070Z // begin inline asm 2026-02-21T09:15:08.6459271Z cp.async.cg.shared.global [ %r2205 + 0 ], [ %rd135 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6459491Z // end inline asm 2026-02-21T09:15:08.6459638Z add.s32 %r2207, %r660, %r656; 2026-02-21T09:15:08.6459794Z // begin inline asm 2026-02-21T09:15:08.6459997Z cp.async.cg.shared.global [ %r2207 + 0 ], [ %rd136 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6460228Z // end inline asm 2026-02-21T09:15:08.6460371Z cp.async.commit_group; 2026-02-21T09:15:08.6460639Z .loc 1 54 62 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:62 2026-02-21T09:15:08.6460927Z add.s32 %r661, %r619, %r35; 2026-02-21T09:15:08.6461209Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6461498Z cvt.s64.s32 %rd208, %r661; 2026-02-21T09:15:08.6461704Z add.s64 %rd137, %rd34, %rd208; 2026-02-21T09:15:08.6461969Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6462257Z add.s32 %r587, %r2075, 364544; 2026-02-21T09:15:08.6462420Z // begin inline asm 2026-02-21T09:15:08.6462620Z cp.async.cg.shared.global [ %r587 + 0 ], [ %rd137 + 0 ], 0x10, %r454; 2026-02-21T09:15:08.6462850Z // end inline asm 2026-02-21T09:15:08.6462990Z cp.async.commit_group; 2026-02-21T09:15:08.6463256Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6463547Z @%p23 bra $L__BB0_16; 2026-02-21T09:15:08.6463726Z // %bb.6: // %.lr.ph997 2026-02-21T09:15:08.6463956Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6464269Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6464568Z and.b32 %r11, %r1, 224; 2026-02-21T09:15:08.6464724Z and.b32 %r34, %r1, 32; 2026-02-21T09:15:08.6464879Z shl.b32 %r663, %r1, 9; 2026-02-21T09:15:08.6465028Z and.b32 %r664, %r663, 65024; 2026-02-21T09:15:08.6465190Z shl.b32 %r665, %r1, 1; 2026-02-21T09:15:08.6465338Z and.b32 %r666, %r665, 256; 2026-02-21T09:15:08.6465502Z or.b32 %r667, %r664, %r666; 2026-02-21T09:15:08.6465664Z and.b32 %r668, %r28, 96; 2026-02-21T09:15:08.6465814Z or.b32 %r669, %r668, %r31; 2026-02-21T09:15:08.6465970Z shl.b32 %r670, %r31, 7; 2026-02-21T09:15:08.6466149Z and.b32 %r671, %r29, 112; 2026-02-21T09:15:08.6466307Z shr.u32 %r672, %r11, 3; 2026-02-21T09:15:08.6466457Z or.b32 %r673, %r670, %r671; 2026-02-21T09:15:08.6466620Z xor.b32 %r674, %r673, %r672; 2026-02-21T09:15:08.6466773Z add.s32 %r676, %r200, 262144; 2026-02-21T09:15:08.6466936Z add.s32 %r104, %r676, %r674; 2026-02-21T09:15:08.6467091Z xor.b32 %r677, %r674, 32; 2026-02-21T09:15:08.6467250Z add.s32 %r105, %r676, %r677; 2026-02-21T09:15:08.6467410Z xor.b32 %r678, %r674, 64; 2026-02-21T09:15:08.6467588Z add.s32 %r106, %r676, %r678; 2026-02-21T09:15:08.6467746Z xor.b32 %r679, %r674, 96; 2026-02-21T09:15:08.6467895Z add.s32 %r107, %r676, %r679; 2026-02-21T09:15:08.6468056Z add.s32 %r1975, %r8, 256; 2026-02-21T09:15:08.6468208Z bfe.u32 %r680, %r676, 4, 14; 2026-02-21T09:15:08.6468370Z cvt.u64.u32 %rd209, %r680; 2026-02-21T09:15:08.6468540Z or.b64 %rd340, %rd209, 4611686293322072064; 2026-02-21T09:15:08.6468732Z add.s32 %r681, %r200, 262176; 2026-02-21T09:15:08.6468890Z bfe.u32 %r682, %r681, 4, 14; 2026-02-21T09:15:08.6469052Z cvt.u64.u32 %rd210, %r682; 2026-02-21T09:15:08.6469250Z or.b64 %rd341, %rd210, 4611686293322072064; 2026-02-21T09:15:08.6469433Z add.s32 %r683, %r200, 262208; 2026-02-21T09:15:08.6469602Z bfe.u32 %r684, %r683, 4, 14; 2026-02-21T09:15:08.6469761Z cvt.u64.u32 %rd211, %r684; 2026-02-21T09:15:08.6469932Z or.b64 %rd342, %rd211, 4611686293322072064; 2026-02-21T09:15:08.6470132Z add.s32 %r685, %r200, 262240; 2026-02-21T09:15:08.6470299Z bfe.u32 %r686, %r685, 4, 14; 2026-02-21T09:15:08.6470453Z cvt.u64.u32 %rd212, %r686; 2026-02-21T09:15:08.6470625Z or.b64 %rd343, %rd212, 4611686293322072064; 2026-02-21T09:15:08.6470812Z add.s32 %r687, %r200, 266240; 2026-02-21T09:15:08.6470872Z bfe.u32 %r688, %r687, 4, 14; 2026-02-21T09:15:08.6470931Z cvt.u64.u32 %rd213, %r688; 2026-02-21T09:15:08.6471000Z or.b64 %rd344, %rd213, 4611686293322072064; 2026-02-21T09:15:08.6471068Z add.s32 %r689, %r200, 266272; 2026-02-21T09:15:08.6471126Z bfe.u32 %r690, %r689, 4, 14; 2026-02-21T09:15:08.6471186Z cvt.u64.u32 %rd214, %r690; 2026-02-21T09:15:08.6471262Z or.b64 %rd345, %rd214, 4611686293322072064; 2026-02-21T09:15:08.6471321Z add.s32 %r691, %r200, 266304; 2026-02-21T09:15:08.6471379Z bfe.u32 %r692, %r691, 4, 14; 2026-02-21T09:15:08.6471439Z cvt.u64.u32 %rd215, %r692; 2026-02-21T09:15:08.6471513Z or.b64 %rd346, %rd215, 4611686293322072064; 2026-02-21T09:15:08.6471601Z add.s32 %r693, %r200, 266336; 2026-02-21T09:15:08.6471662Z bfe.u32 %r694, %r693, 4, 14; 2026-02-21T09:15:08.6471728Z cvt.u64.u32 %rd216, %r694; 2026-02-21T09:15:08.6471795Z or.b64 %rd347, %rd216, 4611686293322072064; 2026-02-21T09:15:08.6471854Z add.s32 %r695, %r200, 270336; 2026-02-21T09:15:08.6471918Z bfe.u32 %r696, %r695, 4, 14; 2026-02-21T09:15:08.6471976Z cvt.u64.u32 %rd217, %r696; 2026-02-21T09:15:08.6472044Z or.b64 %rd348, %rd217, 4611686293322072064; 2026-02-21T09:15:08.6472103Z add.s32 %r697, %r200, 270368; 2026-02-21T09:15:08.6472169Z bfe.u32 %r698, %r697, 4, 14; 2026-02-21T09:15:08.6472229Z cvt.u64.u32 %rd218, %r698; 2026-02-21T09:15:08.6472299Z or.b64 %rd349, %rd218, 4611686293322072064; 2026-02-21T09:15:08.6472365Z add.s32 %r699, %r200, 270400; 2026-02-21T09:15:08.6472425Z bfe.u32 %r700, %r699, 4, 14; 2026-02-21T09:15:08.6472487Z cvt.u64.u32 %rd219, %r700; 2026-02-21T09:15:08.6472556Z or.b64 %rd350, %rd219, 4611686293322072064; 2026-02-21T09:15:08.6472625Z add.s32 %r701, %r200, 270432; 2026-02-21T09:15:08.6472687Z bfe.u32 %r702, %r701, 4, 14; 2026-02-21T09:15:08.6472750Z cvt.u64.u32 %rd220, %r702; 2026-02-21T09:15:08.6472828Z or.b64 %rd351, %rd220, 4611686293322072064; 2026-02-21T09:15:08.6472888Z add.s32 %r703, %r200, 274432; 2026-02-21T09:15:08.6472948Z bfe.u32 %r704, %r703, 4, 14; 2026-02-21T09:15:08.6473010Z cvt.u64.u32 %rd221, %r704; 2026-02-21T09:15:08.6473086Z or.b64 %rd352, %rd221, 4611686293322072064; 2026-02-21T09:15:08.6473146Z add.s32 %r705, %r200, 274464; 2026-02-21T09:15:08.6473206Z bfe.u32 %r706, %r705, 4, 14; 2026-02-21T09:15:08.6473304Z cvt.u64.u32 %rd222, %r706; 2026-02-21T09:15:08.6473372Z or.b64 %rd353, %rd222, 4611686293322072064; 2026-02-21T09:15:08.6473433Z add.s32 %r707, %r200, 274496; 2026-02-21T09:15:08.6473500Z bfe.u32 %r708, %r707, 4, 14; 2026-02-21T09:15:08.6473560Z cvt.u64.u32 %rd223, %r708; 2026-02-21T09:15:08.6473626Z or.b64 %rd354, %rd223, 4611686293322072064; 2026-02-21T09:15:08.6473687Z add.s32 %r709, %r200, 274528; 2026-02-21T09:15:08.6473754Z bfe.u32 %r710, %r709, 4, 14; 2026-02-21T09:15:08.6473843Z cvt.u64.u32 %rd224, %r710; 2026-02-21T09:15:08.6473912Z or.b64 %rd355, %rd224, 4611686293322072064; 2026-02-21T09:15:08.6473978Z add.s32 %r711, %r200, 278528; 2026-02-21T09:15:08.6474038Z bfe.u32 %r712, %r711, 4, 14; 2026-02-21T09:15:08.6474099Z cvt.u64.u32 %rd225, %r712; 2026-02-21T09:15:08.6474170Z or.b64 %rd356, %rd225, 4611686293322072064; 2026-02-21T09:15:08.6474237Z add.s32 %r713, %r200, 278560; 2026-02-21T09:15:08.6474296Z bfe.u32 %r714, %r713, 4, 14; 2026-02-21T09:15:08.6474358Z cvt.u64.u32 %rd226, %r714; 2026-02-21T09:15:08.6474433Z or.b64 %rd357, %rd226, 4611686293322072064; 2026-02-21T09:15:08.6474536Z add.s32 %r715, %r200, 278592; 2026-02-21T09:15:08.6474596Z bfe.u32 %r716, %r715, 4, 14; 2026-02-21T09:15:08.6474665Z cvt.u64.u32 %rd227, %r716; 2026-02-21T09:15:08.6474735Z or.b64 %rd358, %rd227, 4611686293322072064; 2026-02-21T09:15:08.6474795Z add.s32 %r717, %r200, 278624; 2026-02-21T09:15:08.6474880Z bfe.u32 %r718, %r717, 4, 14; 2026-02-21T09:15:08.6474952Z cvt.u64.u32 %rd228, %r718; 2026-02-21T09:15:08.6475021Z or.b64 %rd359, %rd228, 4611686293322072064; 2026-02-21T09:15:08.6475081Z add.s32 %r719, %r200, 282624; 2026-02-21T09:15:08.6475150Z bfe.u32 %r720, %r719, 4, 14; 2026-02-21T09:15:08.6475211Z cvt.u64.u32 %rd229, %r720; 2026-02-21T09:15:08.6475279Z or.b64 %rd360, %rd229, 4611686293322072064; 2026-02-21T09:15:08.6475341Z add.s32 %r721, %r200, 282656; 2026-02-21T09:15:08.6475409Z bfe.u32 %r722, %r721, 4, 14; 2026-02-21T09:15:08.6475472Z cvt.u64.u32 %rd230, %r722; 2026-02-21T09:15:08.6475542Z or.b64 %rd361, %rd230, 4611686293322072064; 2026-02-21T09:15:08.6475610Z add.s32 %r723, %r200, 282688; 2026-02-21T09:15:08.6475670Z bfe.u32 %r724, %r723, 4, 14; 2026-02-21T09:15:08.6475732Z cvt.u64.u32 %rd231, %r724; 2026-02-21T09:15:08.6475800Z or.b64 %rd362, %rd231, 4611686293322072064; 2026-02-21T09:15:08.6475871Z add.s32 %r725, %r200, 282720; 2026-02-21T09:15:08.6475936Z bfe.u32 %r726, %r725, 4, 14; 2026-02-21T09:15:08.6476000Z cvt.u64.u32 %rd232, %r726; 2026-02-21T09:15:08.6476080Z or.b64 %rd363, %rd232, 4611686293322072064; 2026-02-21T09:15:08.6476144Z add.s32 %r727, %r200, 286720; 2026-02-21T09:15:08.6476207Z bfe.u32 %r728, %r727, 4, 14; 2026-02-21T09:15:08.6476279Z cvt.u64.u32 %rd233, %r728; 2026-02-21T09:15:08.6476350Z or.b64 %rd364, %rd233, 4611686293322072064; 2026-02-21T09:15:08.6476412Z add.s32 %r729, %r200, 286752; 2026-02-21T09:15:08.6476474Z bfe.u32 %r730, %r729, 4, 14; 2026-02-21T09:15:08.6476547Z cvt.u64.u32 %rd234, %r730; 2026-02-21T09:15:08.6476616Z or.b64 %rd365, %rd234, 4611686293322072064; 2026-02-21T09:15:08.6476677Z add.s32 %r731, %r200, 286784; 2026-02-21T09:15:08.6476747Z bfe.u32 %r732, %r731, 4, 14; 2026-02-21T09:15:08.6476808Z cvt.u64.u32 %rd235, %r732; 2026-02-21T09:15:08.6476877Z or.b64 %rd366, %rd235, 4611686293322072064; 2026-02-21T09:15:08.6476936Z add.s32 %r733, %r200, 286816; 2026-02-21T09:15:08.6477004Z bfe.u32 %r734, %r733, 4, 14; 2026-02-21T09:15:08.6477066Z cvt.u64.u32 %rd236, %r734; 2026-02-21T09:15:08.6477137Z or.b64 %rd367, %rd236, 4611686293322072064; 2026-02-21T09:15:08.6477207Z add.s32 %r735, %r200, 290816; 2026-02-21T09:15:08.6477267Z bfe.u32 %r736, %r735, 4, 14; 2026-02-21T09:15:08.6477328Z cvt.u64.u32 %rd237, %r736; 2026-02-21T09:15:08.6477398Z or.b64 %rd368, %rd237, 4611686293322072064; 2026-02-21T09:15:08.6477466Z add.s32 %r737, %r200, 290848; 2026-02-21T09:15:08.6477526Z bfe.u32 %r738, %r737, 4, 14; 2026-02-21T09:15:08.6477587Z cvt.u64.u32 %rd238, %r738; 2026-02-21T09:15:08.6477687Z or.b64 %rd369, %rd238, 4611686293322072064; 2026-02-21T09:15:08.6477749Z add.s32 %r739, %r200, 290880; 2026-02-21T09:15:08.6477810Z bfe.u32 %r740, %r739, 4, 14; 2026-02-21T09:15:08.6477878Z cvt.u64.u32 %rd239, %r740; 2026-02-21T09:15:08.6477946Z or.b64 %rd370, %rd239, 4611686293322072064; 2026-02-21T09:15:08.6478007Z add.s32 %r741, %r200, 290912; 2026-02-21T09:15:08.6478068Z bfe.u32 %r742, %r741, 4, 14; 2026-02-21T09:15:08.6478163Z cvt.u64.u32 %rd240, %r742; 2026-02-21T09:15:08.6478233Z or.b64 %rd371, %rd240, 4611686293322072064; 2026-02-21T09:15:08.6478293Z add.s32 %r109, %r200, %r667; 2026-02-21T09:15:08.6478360Z add.s32 %r743, %r200, %r669; 2026-02-21T09:15:08.6478419Z add.s32 %r110, %r743, 352256; 2026-02-21T09:15:08.6478479Z add.s32 %r111, %r109, 65536; 2026-02-21T09:15:08.6478540Z add.s32 %r112, %r743, 356352; 2026-02-21T09:15:08.6478607Z add.s32 %r113, %r109, 131072; 2026-02-21T09:15:08.6478666Z add.s32 %r114, %r743, 360448; 2026-02-21T09:15:08.6478728Z add.s32 %r115, %r109, 196608; 2026-02-21T09:15:08.6478795Z add.s32 %r116, %r743, 364544; 2026-02-21T09:15:08.6478877Z mov.b32 %r2370, 0; 2026-02-21T09:15:08.6478946Z setp.eq.b32 %p26, %r34, 0; 2026-02-21T09:15:08.6479007Z mov.b16 %rs1162, %rs1; 2026-02-21T09:15:08.6479075Z mov.b32 %r2371, %r2370; 2026-02-21T09:15:08.6479135Z mov.b32 %r2372, %r9; 2026-02-21T09:15:08.6479194Z bra.uni $L__BB0_7; 2026-02-21T09:15:08.6479356Z $L__BB0_15: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6479428Z selp.b32 %r2371, 0, %r1307, %p96; 2026-02-21T09:15:08.6479492Z selp.b32 %r1308, 1, 0, %p96; 2026-02-21T09:15:08.6479564Z xor.b32 %r2370, %r2370, %r1308; 2026-02-21T09:15:08.6479757Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6479826Z setp.lt.u32 %p298, %r2372, 5824; 2026-02-21T09:15:08.6479886Z add.s32 %r2372, %r2372, 2368; 2026-02-21T09:15:08.6480086Z .loc 1 26 33 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:26:33 2026-02-21T09:15:08.6480144Z shr.u32 %r2211, %r2372, 8; 2026-02-21T09:15:08.6480204Z and.b32 %r2212, %r2211, 48; 2026-02-21T09:15:08.6480378Z .loc 1 27 39 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:27:39 2026-02-21T09:15:08.6480437Z sub.s32 %r2213, 32, %r2212; 2026-02-21T09:15:08.6480605Z .loc 1 27 52 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:27:52 2026-02-21T09:15:08.6480671Z min.u32 %r2214, %r2213, 16; 2026-02-21T09:15:08.6480836Z .loc 1 28 64 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:64 2026-02-21T09:15:08.6480899Z and.b16 %rs1157, %rs1162, 4095; 2026-02-21T09:15:08.6480958Z cvt.u16.u32 %rs1158, %r2214; 2026-02-21T09:15:08.6481129Z .loc 1 29 51 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:29:51 2026-02-21T09:15:08.6481195Z div.u16 %rs1159, %rs1157, %rs1158; 2026-02-21T09:15:08.6481364Z .loc 1 28 64 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:64 2026-02-21T09:15:08.6481440Z mul.lo.s16 %rs1160, %rs1159, %rs1158; 2026-02-21T09:15:08.6481503Z sub.s16 %rs1161, %rs1157, %rs1160; 2026-02-21T09:15:08.6481583Z cvt.u32.u16 %r2215, %rs1161; 2026-02-21T09:15:08.6481755Z .loc 1 28 30 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:30 2026-02-21T09:15:08.6481816Z add.s32 %r2216, %r2212, %r2215; 2026-02-21T09:15:08.6481981Z .loc 1 30 27 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:30:27 2026-02-21T09:15:08.6482045Z shl.b32 %r2217, %r2216, 7; 2026-02-21T09:15:08.6482208Z .loc 1 31 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:31:32 2026-02-21T09:15:08.6482266Z or.b32 %r2218, %r2217, %r12; 2026-02-21T09:15:08.6482324Z or.b32 %r2219, %r2217, %r13; 2026-02-21T09:15:08.6482416Z or.b32 %r2220, %r2217, %r14; 2026-02-21T09:15:08.6482472Z or.b32 %r2221, %r2217, %r15; 2026-02-21T09:15:08.6482529Z or.b32 %r2222, %r2217, %r16; 2026-02-21T09:15:08.6482593Z or.b32 %r2223, %r2217, %r17; 2026-02-21T09:15:08.6482649Z or.b32 %r2224, %r2217, %r18; 2026-02-21T09:15:08.6482706Z or.b32 %r2225, %r2217, %r19; 2026-02-21T09:15:08.6482763Z or.b32 %r2226, %r2217, %r20; 2026-02-21T09:15:08.6482827Z or.b32 %r2227, %r2217, %r21; 2026-02-21T09:15:08.6482884Z or.b32 %r2228, %r2217, %r22; 2026-02-21T09:15:08.6482968Z or.b32 %r2229, %r2217, %r23; 2026-02-21T09:15:08.6483035Z or.b32 %r2230, %r2217, %r24; 2026-02-21T09:15:08.6483092Z or.b32 %r2231, %r2217, %r25; 2026-02-21T09:15:08.6483151Z or.b32 %r2232, %r2217, %r26; 2026-02-21T09:15:08.6483214Z or.b32 %r2233, %r2217, %r27; 2026-02-21T09:15:08.6483377Z .loc 1 32 27 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:32:27 2026-02-21T09:15:08.6483443Z mul.wide.u16 %r2234, %rs1159, 32; 2026-02-21T09:15:08.6483612Z .loc 1 33 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:33:32 2026-02-21T09:15:08.6483712Z or.b32 %r2235, %r2234, %r30; 2026-02-21T09:15:08.6483881Z .loc 1 48 53 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:53 2026-02-21T09:15:08.6483939Z shl.b32 %r2236, %r2218, 10; 2026-02-21T09:15:08.6484004Z shl.b32 %r2237, %r2219, 10; 2026-02-21T09:15:08.6484087Z shl.b32 %r2238, %r2220, 10; 2026-02-21T09:15:08.6484148Z shl.b32 %r2239, %r2221, 10; 2026-02-21T09:15:08.6484215Z shl.b32 %r2240, %r2222, 10; 2026-02-21T09:15:08.6484274Z shl.b32 %r2241, %r2223, 10; 2026-02-21T09:15:08.6484331Z shl.b32 %r2242, %r2224, 10; 2026-02-21T09:15:08.6484390Z shl.b32 %r2243, %r2225, 10; 2026-02-21T09:15:08.6484461Z shl.b32 %r2244, %r2226, 10; 2026-02-21T09:15:08.6484519Z shl.b32 %r2245, %r2227, 10; 2026-02-21T09:15:08.6484576Z shl.b32 %r2246, %r2228, 10; 2026-02-21T09:15:08.6484641Z shl.b32 %r2247, %r2229, 10; 2026-02-21T09:15:08.6484698Z shl.b32 %r2248, %r2230, 10; 2026-02-21T09:15:08.6484754Z shl.b32 %r2249, %r2231, 10; 2026-02-21T09:15:08.6484812Z shl.b32 %r2250, %r2232, 10; 2026-02-21T09:15:08.6484874Z shl.b32 %r2251, %r2233, 10; 2026-02-21T09:15:08.6485038Z .loc 1 48 60 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:60 2026-02-21T09:15:08.6485094Z or.b32 %r2252, %r2236, %r32; 2026-02-21T09:15:08.6485161Z or.b32 %r2253, %r2237, %r32; 2026-02-21T09:15:08.6485218Z or.b32 %r2254, %r2238, %r32; 2026-02-21T09:15:08.6485275Z or.b32 %r2255, %r2239, %r32; 2026-02-21T09:15:08.6485339Z or.b32 %r2256, %r2240, %r32; 2026-02-21T09:15:08.6485395Z or.b32 %r2257, %r2241, %r32; 2026-02-21T09:15:08.6485452Z or.b32 %r2258, %r2242, %r32; 2026-02-21T09:15:08.6485507Z or.b32 %r2259, %r2243, %r32; 2026-02-21T09:15:08.6485571Z or.b32 %r2260, %r2244, %r32; 2026-02-21T09:15:08.6485628Z or.b32 %r2261, %r2245, %r32; 2026-02-21T09:15:08.6485685Z or.b32 %r2262, %r2246, %r32; 2026-02-21T09:15:08.6485750Z or.b32 %r2263, %r2247, %r32; 2026-02-21T09:15:08.6485806Z or.b32 %r2264, %r2248, %r32; 2026-02-21T09:15:08.6485864Z or.b32 %r2265, %r2249, %r32; 2026-02-21T09:15:08.6485921Z or.b32 %r2266, %r2250, %r32; 2026-02-21T09:15:08.6485984Z or.b32 %r2267, %r2251, %r32; 2026-02-21T09:15:08.6486154Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6486224Z mad.wide.u32 %rd373, %r2252, 2, %rd33; 2026-02-21T09:15:08.6486302Z mad.wide.u32 %rd374, %r2253, 2, %rd33; 2026-02-21T09:15:08.6486367Z mad.wide.u32 %rd375, %r2254, 2, %rd33; 2026-02-21T09:15:08.6486430Z mad.wide.u32 %rd376, %r2255, 2, %rd33; 2026-02-21T09:15:08.6486500Z mad.wide.u32 %rd377, %r2256, 2, %rd33; 2026-02-21T09:15:08.6486562Z mad.wide.u32 %rd378, %r2257, 2, %rd33; 2026-02-21T09:15:08.6486625Z mad.wide.u32 %rd379, %r2258, 2, %rd33; 2026-02-21T09:15:08.6486688Z mad.wide.u32 %rd380, %r2259, 2, %rd33; 2026-02-21T09:15:08.6486782Z mad.wide.u32 %rd381, %r2260, 2, %rd33; 2026-02-21T09:15:08.6486846Z mad.wide.u32 %rd382, %r2261, 2, %rd33; 2026-02-21T09:15:08.6486912Z mad.wide.u32 %rd383, %r2262, 2, %rd33; 2026-02-21T09:15:08.6486981Z mad.wide.u32 %rd384, %r2263, 2, %rd33; 2026-02-21T09:15:08.6487043Z mad.wide.u32 %rd385, %r2264, 2, %rd33; 2026-02-21T09:15:08.6487105Z mad.wide.u32 %rd386, %r2265, 2, %rd33; 2026-02-21T09:15:08.6487168Z mad.wide.u32 %rd387, %r2266, 2, %rd33; 2026-02-21T09:15:08.6487240Z mad.wide.u32 %rd388, %r2267, 2, %rd33; 2026-02-21T09:15:08.6487430Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6487493Z selp.b32 %r2076, 16, 0, %p298; 2026-02-21T09:15:08.6487557Z // begin inline asm 2026-02-21T09:15:08.6487680Z cp.async.cg.shared.global [ %r2075 + 0 ], [ %rd373 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6487735Z // end inline asm 2026-02-21T09:15:08.6487799Z // begin inline asm 2026-02-21T09:15:08.6487917Z cp.async.cg.shared.global [ %r2077 + 0 ], [ %rd374 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6487974Z // end inline asm 2026-02-21T09:15:08.6488049Z // begin inline asm 2026-02-21T09:15:08.6488176Z cp.async.cg.shared.global [ %r2079 + 0 ], [ %rd375 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6488232Z // end inline asm 2026-02-21T09:15:08.6488288Z // begin inline asm 2026-02-21T09:15:08.6488409Z cp.async.cg.shared.global [ %r2081 + 0 ], [ %rd376 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6488481Z // end inline asm 2026-02-21T09:15:08.6488539Z // begin inline asm 2026-02-21T09:15:08.6488652Z cp.async.cg.shared.global [ %r2083 + 0 ], [ %rd377 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6488712Z // end inline asm 2026-02-21T09:15:08.6488767Z // begin inline asm 2026-02-21T09:15:08.6488880Z cp.async.cg.shared.global [ %r2085 + 0 ], [ %rd378 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6488940Z // end inline asm 2026-02-21T09:15:08.6488996Z // begin inline asm 2026-02-21T09:15:08.6489108Z cp.async.cg.shared.global [ %r2087 + 0 ], [ %rd379 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6489171Z // end inline asm 2026-02-21T09:15:08.6489227Z // begin inline asm 2026-02-21T09:15:08.6489342Z cp.async.cg.shared.global [ %r2089 + 0 ], [ %rd380 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6489396Z // end inline asm 2026-02-21T09:15:08.6489460Z // begin inline asm 2026-02-21T09:15:08.6489574Z cp.async.cg.shared.global [ %r2091 + 0 ], [ %rd381 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6489630Z // end inline asm 2026-02-21T09:15:08.6489693Z // begin inline asm 2026-02-21T09:15:08.6489806Z cp.async.cg.shared.global [ %r2093 + 0 ], [ %rd382 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6489861Z // end inline asm 2026-02-21T09:15:08.6489915Z // begin inline asm 2026-02-21T09:15:08.6490033Z cp.async.cg.shared.global [ %r2095 + 0 ], [ %rd383 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6490088Z // end inline asm 2026-02-21T09:15:08.6490143Z // begin inline asm 2026-02-21T09:15:08.6490263Z cp.async.cg.shared.global [ %r2097 + 0 ], [ %rd384 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6490320Z // end inline asm 2026-02-21T09:15:08.6490375Z // begin inline asm 2026-02-21T09:15:08.6490496Z cp.async.cg.shared.global [ %r2099 + 0 ], [ %rd385 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6490551Z // end inline asm 2026-02-21T09:15:08.6490607Z // begin inline asm 2026-02-21T09:15:08.6490721Z cp.async.cg.shared.global [ %r2101 + 0 ], [ %rd386 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6490785Z // end inline asm 2026-02-21T09:15:08.6490842Z // begin inline asm 2026-02-21T09:15:08.6490956Z cp.async.cg.shared.global [ %r2103 + 0 ], [ %rd387 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6491017Z // end inline asm 2026-02-21T09:15:08.6491072Z // begin inline asm 2026-02-21T09:15:08.6491186Z cp.async.cg.shared.global [ %r2105 + 0 ], [ %rd388 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6491241Z // end inline asm 2026-02-21T09:15:08.6491314Z cp.async.commit_group; 2026-02-21T09:15:08.6491487Z .loc 1 54 62 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:62 2026-02-21T09:15:08.6491610Z add.s32 %r2268, %r2235, %r33; 2026-02-21T09:15:08.6491789Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6491852Z cvt.u64.u32 %rd441, %r2268; 2026-02-21T09:15:08.6491915Z add.s64 %rd389, %rd34, %rd441; 2026-02-21T09:15:08.6492094Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6492155Z // begin inline asm 2026-02-21T09:15:08.6492299Z cp.async.cg.shared.global [ %r485 + 0 ], [ %rd389 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6492353Z // end inline asm 2026-02-21T09:15:08.6492424Z cp.async.commit_group; 2026-02-21T09:15:08.6492593Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6492655Z add.s64 %rd390, %rd373, 512; 2026-02-21T09:15:08.6492720Z add.s64 %rd391, %rd374, 512; 2026-02-21T09:15:08.6492778Z add.s64 %rd392, %rd375, 512; 2026-02-21T09:15:08.6492835Z add.s64 %rd393, %rd376, 512; 2026-02-21T09:15:08.6492900Z add.s64 %rd394, %rd377, 512; 2026-02-21T09:15:08.6492985Z add.s64 %rd395, %rd378, 512; 2026-02-21T09:15:08.6493048Z add.s64 %rd396, %rd379, 512; 2026-02-21T09:15:08.6493109Z add.s64 %rd397, %rd380, 512; 2026-02-21T09:15:08.6493179Z add.s64 %rd398, %rd381, 512; 2026-02-21T09:15:08.6493238Z add.s64 %rd399, %rd382, 512; 2026-02-21T09:15:08.6493320Z add.s64 %rd400, %rd383, 512; 2026-02-21T09:15:08.6493387Z add.s64 %rd401, %rd384, 512; 2026-02-21T09:15:08.6493445Z add.s64 %rd402, %rd385, 512; 2026-02-21T09:15:08.6493501Z add.s64 %rd403, %rd386, 512; 2026-02-21T09:15:08.6493558Z add.s64 %rd404, %rd387, 512; 2026-02-21T09:15:08.6493623Z add.s64 %rd405, %rd388, 512; 2026-02-21T09:15:08.6493794Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6493850Z // begin inline asm 2026-02-21T09:15:08.6493974Z cp.async.cg.shared.global [ %r2109 + 0 ], [ %rd390 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6494032Z // end inline asm 2026-02-21T09:15:08.6494088Z // begin inline asm 2026-02-21T09:15:08.6494209Z cp.async.cg.shared.global [ %r2111 + 0 ], [ %rd391 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6494264Z // end inline asm 2026-02-21T09:15:08.6494321Z // begin inline asm 2026-02-21T09:15:08.6494433Z cp.async.cg.shared.global [ %r2113 + 0 ], [ %rd392 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6494497Z // end inline asm 2026-02-21T09:15:08.6494555Z // begin inline asm 2026-02-21T09:15:08.6494668Z cp.async.cg.shared.global [ %r2115 + 0 ], [ %rd393 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6494729Z // end inline asm 2026-02-21T09:15:08.6494783Z // begin inline asm 2026-02-21T09:15:08.6494897Z cp.async.cg.shared.global [ %r2117 + 0 ], [ %rd394 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6494951Z // end inline asm 2026-02-21T09:15:08.6495015Z // begin inline asm 2026-02-21T09:15:08.6495128Z cp.async.cg.shared.global [ %r2119 + 0 ], [ %rd395 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6495183Z // end inline asm 2026-02-21T09:15:08.6495246Z // begin inline asm 2026-02-21T09:15:08.6495359Z cp.async.cg.shared.global [ %r2121 + 0 ], [ %rd396 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6495413Z // end inline asm 2026-02-21T09:15:08.6495468Z // begin inline asm 2026-02-21T09:15:08.6495589Z cp.async.cg.shared.global [ %r2123 + 0 ], [ %rd397 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6495643Z // end inline asm 2026-02-21T09:15:08.6495700Z // begin inline asm 2026-02-21T09:15:08.6495819Z cp.async.cg.shared.global [ %r2125 + 0 ], [ %rd398 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6495874Z // end inline asm 2026-02-21T09:15:08.6495928Z // begin inline asm 2026-02-21T09:15:08.6496046Z cp.async.cg.shared.global [ %r2127 + 0 ], [ %rd399 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6496100Z // end inline asm 2026-02-21T09:15:08.6496154Z // begin inline asm 2026-02-21T09:15:08.6496266Z cp.async.cg.shared.global [ %r2129 + 0 ], [ %rd400 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6496352Z // end inline asm 2026-02-21T09:15:08.6496408Z // begin inline asm 2026-02-21T09:15:08.6496522Z cp.async.cg.shared.global [ %r2131 + 0 ], [ %rd401 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6496584Z // end inline asm 2026-02-21T09:15:08.6496639Z // begin inline asm 2026-02-21T09:15:08.6496751Z cp.async.cg.shared.global [ %r2133 + 0 ], [ %rd402 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6496804Z // end inline asm 2026-02-21T09:15:08.6496869Z // begin inline asm 2026-02-21T09:15:08.6497002Z cp.async.cg.shared.global [ %r2135 + 0 ], [ %rd403 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6497056Z // end inline asm 2026-02-21T09:15:08.6497120Z // begin inline asm 2026-02-21T09:15:08.6497231Z cp.async.cg.shared.global [ %r2137 + 0 ], [ %rd404 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6497285Z // end inline asm 2026-02-21T09:15:08.6497347Z // begin inline asm 2026-02-21T09:15:08.6497460Z cp.async.cg.shared.global [ %r2139 + 0 ], [ %rd405 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6497513Z // end inline asm 2026-02-21T09:15:08.6497578Z cp.async.commit_group; 2026-02-21T09:15:08.6497788Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6497850Z cvt.u64.u32 %rd442, %r2235; 2026-02-21T09:15:08.6497913Z add.s64 %rd444, %rd442, %rd205; 2026-02-21T09:15:08.6497982Z add.s64 %rd445, %rd34, %rd444; 2026-02-21T09:15:08.6498045Z add.s64 %rd406, %rd445, 1048576; 2026-02-21T09:15:08.6498235Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6498300Z // begin inline asm 2026-02-21T09:15:08.6498417Z cp.async.cg.shared.global [ %r519 + 0 ], [ %rd406 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6498471Z // end inline asm 2026-02-21T09:15:08.6498534Z cp.async.commit_group; 2026-02-21T09:15:08.6498707Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6498768Z add.s64 %rd407, %rd373, 1024; 2026-02-21T09:15:08.6498829Z add.s64 %rd408, %rd374, 1024; 2026-02-21T09:15:08.6498895Z add.s64 %rd409, %rd375, 1024; 2026-02-21T09:15:08.6498953Z add.s64 %rd410, %rd376, 1024; 2026-02-21T09:15:08.6499012Z add.s64 %rd411, %rd377, 1024; 2026-02-21T09:15:08.6499070Z add.s64 %rd412, %rd378, 1024; 2026-02-21T09:15:08.6499134Z add.s64 %rd413, %rd379, 1024; 2026-02-21T09:15:08.6499190Z add.s64 %rd414, %rd380, 1024; 2026-02-21T09:15:08.6499249Z add.s64 %rd415, %rd381, 1024; 2026-02-21T09:15:08.6499316Z add.s64 %rd416, %rd382, 1024; 2026-02-21T09:15:08.6499373Z add.s64 %rd417, %rd383, 1024; 2026-02-21T09:15:08.6499431Z add.s64 %rd418, %rd384, 1024; 2026-02-21T09:15:08.6499497Z add.s64 %rd419, %rd385, 1024; 2026-02-21T09:15:08.6499556Z add.s64 %rd420, %rd386, 1024; 2026-02-21T09:15:08.6499616Z add.s64 %rd421, %rd387, 1024; 2026-02-21T09:15:08.6499674Z add.s64 %rd422, %rd388, 1024; 2026-02-21T09:15:08.6499851Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6499910Z // begin inline asm 2026-02-21T09:15:08.6500026Z cp.async.cg.shared.global [ %r2143 + 0 ], [ %rd407 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6500087Z // end inline asm 2026-02-21T09:15:08.6500144Z // begin inline asm 2026-02-21T09:15:08.6500256Z cp.async.cg.shared.global [ %r2145 + 0 ], [ %rd408 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6500310Z // end inline asm 2026-02-21T09:15:08.6500372Z // begin inline asm 2026-02-21T09:15:08.6500484Z cp.async.cg.shared.global [ %r2147 + 0 ], [ %rd409 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6500539Z // end inline asm 2026-02-21T09:15:08.6500603Z // begin inline asm 2026-02-21T09:15:08.6500713Z cp.async.cg.shared.global [ %r2149 + 0 ], [ %rd410 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6500767Z // end inline asm 2026-02-21T09:15:08.6500830Z // begin inline asm 2026-02-21T09:15:08.6500940Z cp.async.cg.shared.global [ %r2151 + 0 ], [ %rd411 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6500995Z // end inline asm 2026-02-21T09:15:08.6501075Z // begin inline asm 2026-02-21T09:15:08.6501196Z cp.async.cg.shared.global [ %r2153 + 0 ], [ %rd412 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6501251Z // end inline asm 2026-02-21T09:15:08.6501307Z // begin inline asm 2026-02-21T09:15:08.6501424Z cp.async.cg.shared.global [ %r2155 + 0 ], [ %rd413 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6501478Z // end inline asm 2026-02-21T09:15:08.6501562Z // begin inline asm 2026-02-21T09:15:08.6501676Z cp.async.cg.shared.global [ %r2157 + 0 ], [ %rd414 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6501779Z // end inline asm 2026-02-21T09:15:08.6501835Z // begin inline asm 2026-02-21T09:15:08.6501946Z cp.async.cg.shared.global [ %r2159 + 0 ], [ %rd415 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6502007Z // end inline asm 2026-02-21T09:15:08.6502062Z // begin inline asm 2026-02-21T09:15:08.6502173Z cp.async.cg.shared.global [ %r2161 + 0 ], [ %rd416 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6502234Z // end inline asm 2026-02-21T09:15:08.6502291Z // begin inline asm 2026-02-21T09:15:08.6502403Z cp.async.cg.shared.global [ %r2163 + 0 ], [ %rd417 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6502481Z // end inline asm 2026-02-21T09:15:08.6502547Z // begin inline asm 2026-02-21T09:15:08.6502662Z cp.async.cg.shared.global [ %r2165 + 0 ], [ %rd418 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6502716Z // end inline asm 2026-02-21T09:15:08.6502778Z // begin inline asm 2026-02-21T09:15:08.6502923Z cp.async.cg.shared.global [ %r2167 + 0 ], [ %rd419 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6502982Z // end inline asm 2026-02-21T09:15:08.6503037Z // begin inline asm 2026-02-21T09:15:08.6503158Z cp.async.cg.shared.global [ %r2169 + 0 ], [ %rd420 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6503212Z // end inline asm 2026-02-21T09:15:08.6503267Z // begin inline asm 2026-02-21T09:15:08.6503384Z cp.async.cg.shared.global [ %r2171 + 0 ], [ %rd421 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6503439Z // end inline asm 2026-02-21T09:15:08.6503494Z // begin inline asm 2026-02-21T09:15:08.6503607Z cp.async.cg.shared.global [ %r2173 + 0 ], [ %rd422 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6503668Z // end inline asm 2026-02-21T09:15:08.6503730Z cp.async.commit_group; 2026-02-21T09:15:08.6503896Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6503967Z add.s64 %rd423, %rd445, 2097152; 2026-02-21T09:15:08.6504136Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6504194Z // begin inline asm 2026-02-21T09:15:08.6504313Z cp.async.cg.shared.global [ %r553 + 0 ], [ %rd423 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6504368Z // end inline asm 2026-02-21T09:15:08.6504429Z cp.async.commit_group; 2026-02-21T09:15:08.6504593Z .loc 1 48 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:32 2026-02-21T09:15:08.6504658Z add.s64 %rd424, %rd373, 1536; 2026-02-21T09:15:08.6504716Z add.s64 %rd425, %rd374, 1536; 2026-02-21T09:15:08.6504775Z add.s64 %rd426, %rd375, 1536; 2026-02-21T09:15:08.6504839Z add.s64 %rd427, %rd376, 1536; 2026-02-21T09:15:08.6504897Z add.s64 %rd428, %rd377, 1536; 2026-02-21T09:15:08.6504953Z add.s64 %rd429, %rd378, 1536; 2026-02-21T09:15:08.6505018Z add.s64 %rd430, %rd379, 1536; 2026-02-21T09:15:08.6505073Z add.s64 %rd431, %rd380, 1536; 2026-02-21T09:15:08.6505129Z add.s64 %rd432, %rd381, 1536; 2026-02-21T09:15:08.6505187Z add.s64 %rd433, %rd382, 1536; 2026-02-21T09:15:08.6505251Z add.s64 %rd434, %rd383, 1536; 2026-02-21T09:15:08.6505308Z add.s64 %rd435, %rd384, 1536; 2026-02-21T09:15:08.6505363Z add.s64 %rd436, %rd385, 1536; 2026-02-21T09:15:08.6505425Z add.s64 %rd437, %rd386, 1536; 2026-02-21T09:15:08.6505482Z add.s64 %rd438, %rd387, 1536; 2026-02-21T09:15:08.6505539Z add.s64 %rd439, %rd388, 1536; 2026-02-21T09:15:08.6505704Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6505792Z // begin inline asm 2026-02-21T09:15:08.6505905Z cp.async.cg.shared.global [ %r2177 + 0 ], [ %rd424 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6505960Z // end inline asm 2026-02-21T09:15:08.6506024Z // begin inline asm 2026-02-21T09:15:08.6506135Z cp.async.cg.shared.global [ %r2179 + 0 ], [ %rd425 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6506188Z // end inline asm 2026-02-21T09:15:08.6506251Z // begin inline asm 2026-02-21T09:15:08.6506362Z cp.async.cg.shared.global [ %r2181 + 0 ], [ %rd426 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6506438Z // end inline asm 2026-02-21T09:15:08.6506494Z // begin inline asm 2026-02-21T09:15:08.6506614Z cp.async.cg.shared.global [ %r2183 + 0 ], [ %rd427 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6506667Z // end inline asm 2026-02-21T09:15:08.6506723Z // begin inline asm 2026-02-21T09:15:08.6506841Z cp.async.cg.shared.global [ %r2185 + 0 ], [ %rd428 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6506895Z // end inline asm 2026-02-21T09:15:08.6506951Z // begin inline asm 2026-02-21T09:15:08.6507064Z cp.async.cg.shared.global [ %r2187 + 0 ], [ %rd429 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6507148Z // end inline asm 2026-02-21T09:15:08.6507208Z // begin inline asm 2026-02-21T09:15:08.6507322Z cp.async.cg.shared.global [ %r2189 + 0 ], [ %rd430 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6507388Z // end inline asm 2026-02-21T09:15:08.6507445Z // begin inline asm 2026-02-21T09:15:08.6507576Z cp.async.cg.shared.global [ %r2191 + 0 ], [ %rd431 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6507644Z // end inline asm 2026-02-21T09:15:08.6507702Z // begin inline asm 2026-02-21T09:15:08.6507814Z cp.async.cg.shared.global [ %r2193 + 0 ], [ %rd432 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6507868Z // end inline asm 2026-02-21T09:15:08.6507931Z // begin inline asm 2026-02-21T09:15:08.6508041Z cp.async.cg.shared.global [ %r2195 + 0 ], [ %rd433 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6508095Z // end inline asm 2026-02-21T09:15:08.6508159Z // begin inline asm 2026-02-21T09:15:08.6508271Z cp.async.cg.shared.global [ %r2197 + 0 ], [ %rd434 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6508324Z // end inline asm 2026-02-21T09:15:08.6508379Z // begin inline asm 2026-02-21T09:15:08.6508496Z cp.async.cg.shared.global [ %r2199 + 0 ], [ %rd435 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6508551Z // end inline asm 2026-02-21T09:15:08.6508605Z // begin inline asm 2026-02-21T09:15:08.6508725Z cp.async.cg.shared.global [ %r2201 + 0 ], [ %rd436 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6508780Z // end inline asm 2026-02-21T09:15:08.6508834Z // begin inline asm 2026-02-21T09:15:08.6508951Z cp.async.cg.shared.global [ %r2203 + 0 ], [ %rd437 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6509005Z // end inline asm 2026-02-21T09:15:08.6509060Z // begin inline asm 2026-02-21T09:15:08.6509170Z cp.async.cg.shared.global [ %r2205 + 0 ], [ %rd438 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6509232Z // end inline asm 2026-02-21T09:15:08.6509286Z // begin inline asm 2026-02-21T09:15:08.6509396Z cp.async.cg.shared.global [ %r2207 + 0 ], [ %rd439 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6509460Z // end inline asm 2026-02-21T09:15:08.6509523Z cp.async.commit_group; 2026-02-21T09:15:08.6509686Z .loc 1 54 62 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:62 2026-02-21T09:15:08.6509747Z add.s32 %r2269, %r2235, %r35; 2026-02-21T09:15:08.6509922Z .loc 1 54 34 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:34 2026-02-21T09:15:08.6509984Z cvt.u64.u32 %rd446, %r2269; 2026-02-21T09:15:08.6510044Z add.s64 %rd440, %rd34, %rd446; 2026-02-21T09:15:08.6510212Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6510269Z // begin inline asm 2026-02-21T09:15:08.6510381Z cp.async.cg.shared.global [ %r587 + 0 ], [ %rd440 + 0 ], 0x10, %r2076; 2026-02-21T09:15:08.6510443Z // end inline asm 2026-02-21T09:15:08.6510505Z cp.async.commit_group; 2026-02-21T09:15:08.6510676Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6510760Z add.s16 %rs1162, %rs1162, 2368; 2026-02-21T09:15:08.6510829Z @%p298 bra $L__BB0_7; 2026-02-21T09:15:08.6510888Z bra.uni $L__BB0_16; 2026-02-21T09:15:08.6510988Z $L__BB0_7: // Parent Loop BB0_2 Depth=1 2026-02-21T09:15:08.6511091Z // => This Inner Loop Header: Depth=2 2026-02-21T09:15:08.6511278Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6511342Z cp.async.wait_group 0; 2026-02-21T09:15:08.6511404Z bar.sync 2, 256; 2026-02-21T09:15:08.6511499Z ld.shared.v4.b32 {%r875, %r876, %r877, %r878}, [%r109]; 2026-02-21T09:15:08.6511579Z mov.b32 {%rs5, %rs6}, %r878; 2026-02-21T09:15:08.6511640Z mov.b32 {%rs7, %rs8}, %r877; 2026-02-21T09:15:08.6511707Z mov.b32 {%rs9, %rs10}, %r876; 2026-02-21T09:15:08.6511768Z mov.b32 {%rs11, %rs12}, %r875; 2026-02-21T09:15:08.6511869Z ld.shared.v4.b32 {%r879, %r880, %r881, %r882}, [%r109+16]; 2026-02-21T09:15:08.6511964Z mov.b32 {%rs13, %rs14}, %r882; 2026-02-21T09:15:08.6512024Z mov.b32 {%rs15, %rs16}, %r881; 2026-02-21T09:15:08.6512081Z mov.b32 {%rs17, %rs18}, %r880; 2026-02-21T09:15:08.6512146Z mov.b32 {%rs19, %rs20}, %r879; 2026-02-21T09:15:08.6512240Z ld.shared.v4.b32 {%r883, %r884, %r885, %r886}, [%r109+32]; 2026-02-21T09:15:08.6512328Z mov.b32 {%rs21, %rs22}, %r886; 2026-02-21T09:15:08.6512388Z mov.b32 {%rs23, %rs24}, %r885; 2026-02-21T09:15:08.6512452Z mov.b32 {%rs25, %rs26}, %r884; 2026-02-21T09:15:08.6512510Z mov.b32 {%rs27, %rs28}, %r883; 2026-02-21T09:15:08.6512601Z ld.shared.v4.b32 {%r887, %r888, %r889, %r890}, [%r109+48]; 2026-02-21T09:15:08.6512668Z mov.b32 {%rs29, %rs30}, %r890; 2026-02-21T09:15:08.6512726Z mov.b32 {%rs31, %rs32}, %r889; 2026-02-21T09:15:08.6512783Z mov.b32 {%rs33, %rs34}, %r888; 2026-02-21T09:15:08.6512842Z mov.b32 {%rs35, %rs36}, %r887; 2026-02-21T09:15:08.6512941Z ld.shared.v4.b32 {%r891, %r892, %r893, %r894}, [%r109+64]; 2026-02-21T09:15:08.6513000Z mov.b32 {%rs37, %rs38}, %r894; 2026-02-21T09:15:08.6513058Z mov.b32 {%rs39, %rs40}, %r893; 2026-02-21T09:15:08.6513122Z mov.b32 {%rs41, %rs42}, %r892; 2026-02-21T09:15:08.6513180Z mov.b32 {%rs43, %rs44}, %r891; 2026-02-21T09:15:08.6513268Z ld.shared.v4.b32 {%r895, %r896, %r897, %r898}, [%r109+80]; 2026-02-21T09:15:08.6513334Z mov.b32 {%rs45, %rs46}, %r898; 2026-02-21T09:15:08.6513393Z mov.b32 {%rs47, %rs48}, %r897; 2026-02-21T09:15:08.6513451Z mov.b32 {%rs49, %rs50}, %r896; 2026-02-21T09:15:08.6513507Z mov.b32 {%rs51, %rs52}, %r895; 2026-02-21T09:15:08.6513604Z ld.shared.v4.b32 {%r899, %r900, %r901, %r902}, [%r109+96]; 2026-02-21T09:15:08.6513663Z mov.b32 {%rs53, %rs54}, %r902; 2026-02-21T09:15:08.6513719Z mov.b32 {%rs55, %rs56}, %r901; 2026-02-21T09:15:08.6513784Z mov.b32 {%rs57, %rs58}, %r900; 2026-02-21T09:15:08.6513842Z mov.b32 {%rs59, %rs60}, %r899; 2026-02-21T09:15:08.6513938Z ld.shared.v4.b32 {%r903, %r904, %r905, %r906}, [%r109+112]; 2026-02-21T09:15:08.6513999Z mov.b32 {%rs61, %rs62}, %r906; 2026-02-21T09:15:08.6514064Z mov.b32 {%rs63, %rs64}, %r905; 2026-02-21T09:15:08.6514121Z mov.b32 {%rs65, %rs66}, %r904; 2026-02-21T09:15:08.6514180Z mov.b32 {%rs67, %rs68}, %r903; 2026-02-21T09:15:08.6514282Z ld.shared.v4.b32 {%r907, %r908, %r909, %r910}, [%r109+128]; 2026-02-21T09:15:08.6514340Z mov.b32 {%rs69, %rs70}, %r910; 2026-02-21T09:15:08.6514399Z mov.b32 {%rs71, %rs72}, %r909; 2026-02-21T09:15:08.6514463Z mov.b32 {%rs73, %rs74}, %r908; 2026-02-21T09:15:08.6514519Z mov.b32 {%rs75, %rs76}, %r907; 2026-02-21T09:15:08.6514611Z ld.shared.v4.b32 {%r911, %r912, %r913, %r914}, [%r109+144]; 2026-02-21T09:15:08.6514669Z mov.b32 {%rs77, %rs78}, %r914; 2026-02-21T09:15:08.6514735Z mov.b32 {%rs79, %rs80}, %r913; 2026-02-21T09:15:08.6514792Z mov.b32 {%rs81, %rs82}, %r912; 2026-02-21T09:15:08.6514852Z mov.b32 {%rs83, %rs84}, %r911; 2026-02-21T09:15:08.6514978Z ld.shared.v4.b32 {%r915, %r916, %r917, %r918}, [%r109+160]; 2026-02-21T09:15:08.6515038Z mov.b32 {%rs85, %rs86}, %r918; 2026-02-21T09:15:08.6515099Z mov.b32 {%rs87, %rs88}, %r917; 2026-02-21T09:15:08.6515159Z mov.b32 {%rs89, %rs90}, %r916; 2026-02-21T09:15:08.6515229Z mov.b32 {%rs91, %rs92}, %r915; 2026-02-21T09:15:08.6515320Z ld.shared.v4.b32 {%r919, %r920, %r921, %r922}, [%r109+176]; 2026-02-21T09:15:08.6515381Z mov.b32 {%rs93, %rs94}, %r922; 2026-02-21T09:15:08.6515474Z mov.b32 {%rs95, %rs96}, %r921; 2026-02-21T09:15:08.6515532Z mov.b32 {%rs97, %rs98}, %r920; 2026-02-21T09:15:08.6515595Z mov.b32 {%rs99, %rs100}, %r919; 2026-02-21T09:15:08.6515691Z ld.shared.v4.b32 {%r923, %r924, %r925, %r926}, [%r109+192]; 2026-02-21T09:15:08.6515756Z mov.b32 {%rs101, %rs102}, %r926; 2026-02-21T09:15:08.6515819Z mov.b32 {%rs103, %rs104}, %r925; 2026-02-21T09:15:08.6515877Z mov.b32 {%rs105, %rs106}, %r924; 2026-02-21T09:15:08.6515945Z mov.b32 {%rs107, %rs108}, %r923; 2026-02-21T09:15:08.6516038Z ld.shared.v4.b32 {%r927, %r928, %r929, %r930}, [%r109+208]; 2026-02-21T09:15:08.6516127Z mov.b32 {%rs109, %rs110}, %r930; 2026-02-21T09:15:08.6516198Z mov.b32 {%rs111, %rs112}, %r929; 2026-02-21T09:15:08.6516262Z mov.b32 {%rs113, %rs114}, %r928; 2026-02-21T09:15:08.6516321Z mov.b32 {%rs115, %rs116}, %r927; 2026-02-21T09:15:08.6516416Z ld.shared.v4.b32 {%r931, %r932, %r933, %r934}, [%r109+224]; 2026-02-21T09:15:08.6516505Z mov.b32 {%rs117, %rs118}, %r934; 2026-02-21T09:15:08.6516569Z mov.b32 {%rs119, %rs120}, %r933; 2026-02-21T09:15:08.6516629Z mov.b32 {%rs121, %rs122}, %r932; 2026-02-21T09:15:08.6516696Z mov.b32 {%rs123, %rs124}, %r931; 2026-02-21T09:15:08.6516788Z ld.shared.v4.b32 {%r935, %r936, %r937, %r938}, [%r109+240]; 2026-02-21T09:15:08.6516849Z mov.b32 {%rs125, %rs126}, %r938; 2026-02-21T09:15:08.6516917Z mov.b32 {%rs127, %rs128}, %r937; 2026-02-21T09:15:08.6516978Z mov.b32 {%rs129, %rs130}, %r936; 2026-02-21T09:15:08.6517038Z mov.b32 {%rs131, %rs132}, %r935; 2026-02-21T09:15:08.6517220Z .loc 1 52 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:52:32 2026-02-21T09:15:08.6517295Z cvt.f32.bf16 %r747, %rs11; 2026-02-21T09:15:08.6517360Z cvt.f32.bf16 %r748, %rs12; 2026-02-21T09:15:08.6517423Z cvt.f32.bf16 %r749, %rs9; 2026-02-21T09:15:08.6517491Z cvt.f32.bf16 %r750, %rs10; 2026-02-21T09:15:08.6517553Z cvt.f32.bf16 %r751, %rs7; 2026-02-21T09:15:08.6517614Z cvt.f32.bf16 %r752, %rs8; 2026-02-21T09:15:08.6517677Z cvt.f32.bf16 %r753, %rs5; 2026-02-21T09:15:08.6517744Z cvt.f32.bf16 %r754, %rs6; 2026-02-21T09:15:08.6517805Z cvt.f32.bf16 %r755, %rs19; 2026-02-21T09:15:08.6517865Z cvt.f32.bf16 %r756, %rs20; 2026-02-21T09:15:08.6517933Z cvt.f32.bf16 %r757, %rs17; 2026-02-21T09:15:08.6517993Z cvt.f32.bf16 %r758, %rs18; 2026-02-21T09:15:08.6518053Z cvt.f32.bf16 %r759, %rs15; 2026-02-21T09:15:08.6518111Z cvt.f32.bf16 %r760, %rs16; 2026-02-21T09:15:08.6518178Z cvt.f32.bf16 %r761, %rs13; 2026-02-21T09:15:08.6518240Z cvt.f32.bf16 %r762, %rs14; 2026-02-21T09:15:08.6518301Z cvt.f32.bf16 %r763, %rs27; 2026-02-21T09:15:08.6518368Z cvt.f32.bf16 %r764, %rs28; 2026-02-21T09:15:08.6518428Z cvt.f32.bf16 %r765, %rs25; 2026-02-21T09:15:08.6518487Z cvt.f32.bf16 %r766, %rs26; 2026-02-21T09:15:08.6518553Z cvt.f32.bf16 %r767, %rs23; 2026-02-21T09:15:08.6518613Z cvt.f32.bf16 %r768, %rs24; 2026-02-21T09:15:08.6518672Z cvt.f32.bf16 %r769, %rs21; 2026-02-21T09:15:08.6518735Z cvt.f32.bf16 %r770, %rs22; 2026-02-21T09:15:08.6518804Z cvt.f32.bf16 %r771, %rs35; 2026-02-21T09:15:08.6518863Z cvt.f32.bf16 %r772, %rs36; 2026-02-21T09:15:08.6518923Z cvt.f32.bf16 %r773, %rs33; 2026-02-21T09:15:08.6518990Z cvt.f32.bf16 %r774, %rs34; 2026-02-21T09:15:08.6519049Z cvt.f32.bf16 %r775, %rs31; 2026-02-21T09:15:08.6519108Z cvt.f32.bf16 %r776, %rs32; 2026-02-21T09:15:08.6519167Z cvt.f32.bf16 %r777, %rs29; 2026-02-21T09:15:08.6519236Z cvt.f32.bf16 %r778, %rs30; 2026-02-21T09:15:08.6519294Z cvt.f32.bf16 %r779, %rs43; 2026-02-21T09:15:08.6519392Z cvt.f32.bf16 %r780, %rs44; 2026-02-21T09:15:08.6519460Z cvt.f32.bf16 %r781, %rs41; 2026-02-21T09:15:08.6519521Z cvt.f32.bf16 %r782, %rs42; 2026-02-21T09:15:08.6519582Z cvt.f32.bf16 %r783, %rs39; 2026-02-21T09:15:08.6519642Z cvt.f32.bf16 %r784, %rs40; 2026-02-21T09:15:08.6519710Z cvt.f32.bf16 %r785, %rs37; 2026-02-21T09:15:08.6519770Z cvt.f32.bf16 %r786, %rs38; 2026-02-21T09:15:08.6519829Z cvt.f32.bf16 %r787, %rs51; 2026-02-21T09:15:08.6519896Z cvt.f32.bf16 %r788, %rs52; 2026-02-21T09:15:08.6519980Z cvt.f32.bf16 %r789, %rs49; 2026-02-21T09:15:08.6520039Z cvt.f32.bf16 %r790, %rs50; 2026-02-21T09:15:08.6520099Z cvt.f32.bf16 %r791, %rs47; 2026-02-21T09:15:08.6520168Z cvt.f32.bf16 %r792, %rs48; 2026-02-21T09:15:08.6520228Z cvt.f32.bf16 %r793, %rs45; 2026-02-21T09:15:08.6520288Z cvt.f32.bf16 %r794, %rs46; 2026-02-21T09:15:08.6520354Z cvt.f32.bf16 %r795, %rs59; 2026-02-21T09:15:08.6520414Z cvt.f32.bf16 %r796, %rs60; 2026-02-21T09:15:08.6520473Z cvt.f32.bf16 %r797, %rs57; 2026-02-21T09:15:08.6520536Z cvt.f32.bf16 %r798, %rs58; 2026-02-21T09:15:08.6520603Z cvt.f32.bf16 %r799, %rs55; 2026-02-21T09:15:08.6520684Z cvt.f32.bf16 %r800, %rs56; 2026-02-21T09:15:08.6520746Z cvt.f32.bf16 %r801, %rs53; 2026-02-21T09:15:08.6520814Z cvt.f32.bf16 %r802, %rs54; 2026-02-21T09:15:08.6520874Z cvt.f32.bf16 %r803, %rs67; 2026-02-21T09:15:08.6520933Z cvt.f32.bf16 %r804, %rs68; 2026-02-21T09:15:08.6521017Z cvt.f32.bf16 %r805, %rs65; 2026-02-21T09:15:08.6521088Z cvt.f32.bf16 %r806, %rs66; 2026-02-21T09:15:08.6521149Z cvt.f32.bf16 %r807, %rs63; 2026-02-21T09:15:08.6521208Z cvt.f32.bf16 %r808, %rs64; 2026-02-21T09:15:08.6521276Z cvt.f32.bf16 %r809, %rs61; 2026-02-21T09:15:08.6521336Z cvt.f32.bf16 %r810, %rs62; 2026-02-21T09:15:08.6521395Z cvt.f32.bf16 %r811, %rs75; 2026-02-21T09:15:08.6521463Z cvt.f32.bf16 %r812, %rs76; 2026-02-21T09:15:08.6521524Z cvt.f32.bf16 %r813, %rs73; 2026-02-21T09:15:08.6521611Z cvt.f32.bf16 %r814, %rs74; 2026-02-21T09:15:08.6521674Z cvt.f32.bf16 %r815, %rs71; 2026-02-21T09:15:08.6521746Z cvt.f32.bf16 %r816, %rs72; 2026-02-21T09:15:08.6521809Z cvt.f32.bf16 %r817, %rs69; 2026-02-21T09:15:08.6521871Z cvt.f32.bf16 %r818, %rs70; 2026-02-21T09:15:08.6521943Z cvt.f32.bf16 %r819, %rs83; 2026-02-21T09:15:08.6522003Z cvt.f32.bf16 %r820, %rs84; 2026-02-21T09:15:08.6522063Z cvt.f32.bf16 %r821, %rs81; 2026-02-21T09:15:08.6522123Z cvt.f32.bf16 %r822, %rs82; 2026-02-21T09:15:08.6522191Z cvt.f32.bf16 %r823, %rs79; 2026-02-21T09:15:08.6522253Z cvt.f32.bf16 %r824, %rs80; 2026-02-21T09:15:08.6522312Z cvt.f32.bf16 %r825, %rs77; 2026-02-21T09:15:08.6522378Z cvt.f32.bf16 %r826, %rs78; 2026-02-21T09:15:08.6522437Z cvt.f32.bf16 %r827, %rs91; 2026-02-21T09:15:08.6522495Z cvt.f32.bf16 %r828, %rs92; 2026-02-21T09:15:08.6522554Z cvt.f32.bf16 %r829, %rs89; 2026-02-21T09:15:08.6522621Z cvt.f32.bf16 %r830, %rs90; 2026-02-21T09:15:08.6522680Z cvt.f32.bf16 %r831, %rs87; 2026-02-21T09:15:08.6522740Z cvt.f32.bf16 %r832, %rs88; 2026-02-21T09:15:08.6522807Z cvt.f32.bf16 %r833, %rs85; 2026-02-21T09:15:08.6522869Z cvt.f32.bf16 %r834, %rs86; 2026-02-21T09:15:08.6522930Z cvt.f32.bf16 %r835, %rs99; 2026-02-21T09:15:08.6522994Z cvt.f32.bf16 %r836, %rs100; 2026-02-21T09:15:08.6523063Z cvt.f32.bf16 %r837, %rs97; 2026-02-21T09:15:08.6523123Z cvt.f32.bf16 %r838, %rs98; 2026-02-21T09:15:08.6523183Z cvt.f32.bf16 %r839, %rs95; 2026-02-21T09:15:08.6523251Z cvt.f32.bf16 %r840, %rs96; 2026-02-21T09:15:08.6523313Z cvt.f32.bf16 %r841, %rs93; 2026-02-21T09:15:08.6523372Z cvt.f32.bf16 %r842, %rs94; 2026-02-21T09:15:08.6523434Z cvt.f32.bf16 %r843, %rs107; 2026-02-21T09:15:08.6523504Z cvt.f32.bf16 %r844, %rs108; 2026-02-21T09:15:08.6523565Z cvt.f32.bf16 %r845, %rs105; 2026-02-21T09:15:08.6523625Z cvt.f32.bf16 %r846, %rs106; 2026-02-21T09:15:08.6523693Z cvt.f32.bf16 %r847, %rs103; 2026-02-21T09:15:08.6523752Z cvt.f32.bf16 %r848, %rs104; 2026-02-21T09:15:08.6523811Z cvt.f32.bf16 %r849, %rs101; 2026-02-21T09:15:08.6523871Z cvt.f32.bf16 %r850, %rs102; 2026-02-21T09:15:08.6523967Z cvt.f32.bf16 %r851, %rs115; 2026-02-21T09:15:08.6524028Z cvt.f32.bf16 %r852, %rs116; 2026-02-21T09:15:08.6524088Z cvt.f32.bf16 %r853, %rs113; 2026-02-21T09:15:08.6524155Z cvt.f32.bf16 %r854, %rs114; 2026-02-21T09:15:08.6524214Z cvt.f32.bf16 %r855, %rs111; 2026-02-21T09:15:08.6524275Z cvt.f32.bf16 %r856, %rs112; 2026-02-21T09:15:08.6524343Z cvt.f32.bf16 %r857, %rs109; 2026-02-21T09:15:08.6524404Z cvt.f32.bf16 %r858, %rs110; 2026-02-21T09:15:08.6524491Z cvt.f32.bf16 %r859, %rs123; 2026-02-21T09:15:08.6524550Z cvt.f32.bf16 %r860, %rs124; 2026-02-21T09:15:08.6524618Z cvt.f32.bf16 %r861, %rs121; 2026-02-21T09:15:08.6524687Z cvt.f32.bf16 %r862, %rs122; 2026-02-21T09:15:08.6524745Z cvt.f32.bf16 %r863, %rs119; 2026-02-21T09:15:08.6524809Z cvt.f32.bf16 %r864, %rs120; 2026-02-21T09:15:08.6524865Z cvt.f32.bf16 %r865, %rs117; 2026-02-21T09:15:08.6524922Z cvt.f32.bf16 %r866, %rs118; 2026-02-21T09:15:08.6524980Z cvt.f32.bf16 %r867, %rs131; 2026-02-21T09:15:08.6525047Z cvt.f32.bf16 %r868, %rs132; 2026-02-21T09:15:08.6525104Z cvt.f32.bf16 %r869, %rs129; 2026-02-21T09:15:08.6525186Z cvt.f32.bf16 %r870, %rs130; 2026-02-21T09:15:08.6525253Z cvt.f32.bf16 %r871, %rs127; 2026-02-21T09:15:08.6525310Z cvt.f32.bf16 %r872, %rs128; 2026-02-21T09:15:08.6525367Z cvt.f32.bf16 %r873, %rs125; 2026-02-21T09:15:08.6525424Z cvt.f32.bf16 %r874, %rs126; 2026-02-21T09:15:08.6525630Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6525698Z ld.shared.b8 %rs133, [%r110]; 2026-02-21T09:15:08.6525762Z ld.shared.b8 %rs134, [%r110+128]; 2026-02-21T09:15:08.6525832Z ld.shared.b8 %rs135, [%r110+256]; 2026-02-21T09:15:08.6525894Z ld.shared.b8 %rs136, [%r110+384]; 2026-02-21T09:15:08.6525954Z ld.shared.b8 %rs137, [%r110+512]; 2026-02-21T09:15:08.6526019Z ld.shared.b8 %rs138, [%r110+640]; 2026-02-21T09:15:08.6526077Z ld.shared.b8 %rs139, [%r110+768]; 2026-02-21T09:15:08.6526138Z ld.shared.b8 %rs140, [%r110+896]; 2026-02-21T09:15:08.6526202Z ld.shared.b8 %rs141, [%r110+1024]; 2026-02-21T09:15:08.6526273Z ld.shared.b8 %rs142, [%r110+1152]; 2026-02-21T09:15:08.6526334Z ld.shared.b8 %rs143, [%r110+1280]; 2026-02-21T09:15:08.6526395Z ld.shared.b8 %rs144, [%r110+1408]; 2026-02-21T09:15:08.6526463Z ld.shared.b8 %rs145, [%r110+1536]; 2026-02-21T09:15:08.6526522Z ld.shared.b8 %rs146, [%r110+1664]; 2026-02-21T09:15:08.6526583Z ld.shared.b8 %rs147, [%r110+1792]; 2026-02-21T09:15:08.6526645Z ld.shared.b8 %rs148, [%r110+1920]; 2026-02-21T09:15:08.6526711Z ld.shared.b8 %rs149, [%r110+2048]; 2026-02-21T09:15:08.6526771Z ld.shared.b8 %rs150, [%r110+2176]; 2026-02-21T09:15:08.6526830Z ld.shared.b8 %rs151, [%r110+2304]; 2026-02-21T09:15:08.6526897Z ld.shared.b8 %rs152, [%r110+2432]; 2026-02-21T09:15:08.6526956Z ld.shared.b8 %rs153, [%r110+2560]; 2026-02-21T09:15:08.6527016Z ld.shared.b8 %rs154, [%r110+2688]; 2026-02-21T09:15:08.6527081Z ld.shared.b8 %rs155, [%r110+2816]; 2026-02-21T09:15:08.6527142Z ld.shared.b8 %rs156, [%r110+2944]; 2026-02-21T09:15:08.6527204Z ld.shared.b8 %rs157, [%r110+3072]; 2026-02-21T09:15:08.6527264Z ld.shared.b8 %rs158, [%r110+3200]; 2026-02-21T09:15:08.6527331Z ld.shared.b8 %rs159, [%r110+3328]; 2026-02-21T09:15:08.6527389Z ld.shared.b8 %rs160, [%r110+3456]; 2026-02-21T09:15:08.6527451Z ld.shared.b8 %rs161, [%r110+3584]; 2026-02-21T09:15:08.6527518Z ld.shared.b8 %rs162, [%r110+3712]; 2026-02-21T09:15:08.6527581Z ld.shared.b8 %rs163, [%r110+3840]; 2026-02-21T09:15:08.6527640Z ld.shared.b8 %rs164, [%r110+3968]; 2026-02-21T09:15:08.6527809Z .loc 1 57 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:57:28 2026-02-21T09:15:08.6527877Z shl.b16 %rs165, %rs133, 4; 2026-02-21T09:15:08.6527936Z shl.b16 %rs166, %rs134, 4; 2026-02-21T09:15:08.6527994Z shl.b16 %rs167, %rs135, 4; 2026-02-21T09:15:08.6528059Z shl.b16 %rs168, %rs136, 4; 2026-02-21T09:15:08.6528117Z shl.b16 %rs169, %rs137, 4; 2026-02-21T09:15:08.6528196Z shl.b16 %rs170, %rs138, 4; 2026-02-21T09:15:08.6528264Z shl.b16 %rs171, %rs139, 4; 2026-02-21T09:15:08.6528325Z shl.b16 %rs172, %rs140, 4; 2026-02-21T09:15:08.6528384Z shl.b16 %rs173, %rs141, 4; 2026-02-21T09:15:08.6528448Z shl.b16 %rs174, %rs142, 4; 2026-02-21T09:15:08.6528514Z shl.b16 %rs175, %rs143, 4; 2026-02-21T09:15:08.6528571Z shl.b16 %rs176, %rs144, 4; 2026-02-21T09:15:08.6528628Z shl.b16 %rs177, %rs145, 4; 2026-02-21T09:15:08.6528693Z shl.b16 %rs178, %rs146, 4; 2026-02-21T09:15:08.6528769Z shl.b16 %rs179, %rs147, 4; 2026-02-21T09:15:08.6528826Z shl.b16 %rs180, %rs148, 4; 2026-02-21T09:15:08.6528883Z shl.b16 %rs181, %rs149, 4; 2026-02-21T09:15:08.6528948Z shl.b16 %rs182, %rs150, 4; 2026-02-21T09:15:08.6529005Z shl.b16 %rs183, %rs151, 4; 2026-02-21T09:15:08.6529063Z shl.b16 %rs184, %rs152, 4; 2026-02-21T09:15:08.6529127Z shl.b16 %rs185, %rs153, 4; 2026-02-21T09:15:08.6529185Z shl.b16 %rs186, %rs154, 4; 2026-02-21T09:15:08.6529243Z shl.b16 %rs187, %rs155, 4; 2026-02-21T09:15:08.6529302Z shl.b16 %rs188, %rs156, 4; 2026-02-21T09:15:08.6529368Z shl.b16 %rs189, %rs157, 4; 2026-02-21T09:15:08.6529449Z shl.b16 %rs190, %rs158, 4; 2026-02-21T09:15:08.6529508Z shl.b16 %rs191, %rs159, 4; 2026-02-21T09:15:08.6529572Z shl.b16 %rs192, %rs160, 4; 2026-02-21T09:15:08.6529629Z shl.b16 %rs193, %rs161, 4; 2026-02-21T09:15:08.6529686Z shl.b16 %rs194, %rs162, 4; 2026-02-21T09:15:08.6529761Z shl.b16 %rs195, %rs163, 4; 2026-02-21T09:15:08.6529830Z shl.b16 %rs196, %rs164, 4; 2026-02-21T09:15:08.6530000Z .loc 1 72 58 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:72:58 2026-02-21T09:15:08.6530072Z selp.b16 %rs197, %rs165, %rs133, %p26; 2026-02-21T09:15:08.6530139Z cvt.s16.s8 %rs198, %rs197; 2026-02-21T09:15:08.6530196Z shr.s16 %rs199, %rs198, 4; 2026-02-21T09:15:08.6530265Z selp.b16 %rs200, %rs166, %rs134, %p26; 2026-02-21T09:15:08.6530332Z cvt.s16.s8 %rs201, %rs200; 2026-02-21T09:15:08.6530390Z shr.s16 %rs202, %rs201, 4; 2026-02-21T09:15:08.6530458Z selp.b16 %rs203, %rs167, %rs135, %p26; 2026-02-21T09:15:08.6530519Z cvt.s16.s8 %rs204, %rs203; 2026-02-21T09:15:08.6530588Z shr.s16 %rs205, %rs204, 4; 2026-02-21T09:15:08.6530654Z selp.b16 %rs206, %rs168, %rs136, %p26; 2026-02-21T09:15:08.6530712Z cvt.s16.s8 %rs207, %rs206; 2026-02-21T09:15:08.6530775Z shr.s16 %rs208, %rs207, 4; 2026-02-21T09:15:08.6530841Z selp.b16 %rs209, %rs169, %rs137, %p26; 2026-02-21T09:15:08.6530900Z cvt.s16.s8 %rs210, %rs209; 2026-02-21T09:15:08.6530960Z shr.s16 %rs211, %rs210, 4; 2026-02-21T09:15:08.6531033Z selp.b16 %rs212, %rs170, %rs138, %p26; 2026-02-21T09:15:08.6531090Z cvt.s16.s8 %rs213, %rs212; 2026-02-21T09:15:08.6531149Z shr.s16 %rs214, %rs213, 4; 2026-02-21T09:15:08.6531219Z selp.b16 %rs215, %rs171, %rs139, %p26; 2026-02-21T09:15:08.6531277Z cvt.s16.s8 %rs216, %rs215; 2026-02-21T09:15:08.6531334Z shr.s16 %rs217, %rs216, 4; 2026-02-21T09:15:08.6531399Z selp.b16 %rs218, %rs172, %rs140, %p26; 2026-02-21T09:15:08.6531466Z cvt.s16.s8 %rs219, %rs218; 2026-02-21T09:15:08.6531524Z shr.s16 %rs220, %rs219, 4; 2026-02-21T09:15:08.6531623Z selp.b16 %rs221, %rs173, %rs141, %p26; 2026-02-21T09:15:08.6531688Z cvt.s16.s8 %rs222, %rs221; 2026-02-21T09:15:08.6531744Z shr.s16 %rs223, %rs222, 4; 2026-02-21T09:15:08.6531807Z selp.b16 %rs224, %rs174, %rs142, %p26; 2026-02-21T09:15:08.6531864Z cvt.s16.s8 %rs225, %rs224; 2026-02-21T09:15:08.6531928Z shr.s16 %rs226, %rs225, 4; 2026-02-21T09:15:08.6531992Z selp.b16 %rs227, %rs175, %rs143, %p26; 2026-02-21T09:15:08.6532050Z cvt.s16.s8 %rs228, %rs227; 2026-02-21T09:15:08.6532113Z shr.s16 %rs229, %rs228, 4; 2026-02-21T09:15:08.6532175Z selp.b16 %rs230, %rs176, %rs144, %p26; 2026-02-21T09:15:08.6532232Z cvt.s16.s8 %rs231, %rs230; 2026-02-21T09:15:08.6532297Z shr.s16 %rs232, %rs231, 4; 2026-02-21T09:15:08.6532359Z selp.b16 %rs233, %rs177, %rs145, %p26; 2026-02-21T09:15:08.6532417Z cvt.s16.s8 %rs234, %rs233; 2026-02-21T09:15:08.6532473Z shr.s16 %rs235, %rs234, 4; 2026-02-21T09:15:08.6532578Z selp.b16 %rs236, %rs178, %rs146, %p26; 2026-02-21T09:15:08.6532638Z cvt.s16.s8 %rs237, %rs236; 2026-02-21T09:15:08.6532697Z shr.s16 %rs238, %rs237, 4; 2026-02-21T09:15:08.6532769Z selp.b16 %rs239, %rs179, %rs147, %p26; 2026-02-21T09:15:08.6532828Z cvt.s16.s8 %rs240, %rs239; 2026-02-21T09:15:08.6532886Z shr.s16 %rs241, %rs240, 4; 2026-02-21T09:15:08.6532950Z selp.b16 %rs242, %rs180, %rs148, %p26; 2026-02-21T09:15:08.6533016Z cvt.s16.s8 %rs243, %rs242; 2026-02-21T09:15:08.6533097Z shr.s16 %rs244, %rs243, 4; 2026-02-21T09:15:08.6533162Z selp.b16 %rs245, %rs181, %rs149, %p26; 2026-02-21T09:15:08.6533226Z cvt.s16.s8 %rs246, %rs245; 2026-02-21T09:15:08.6533284Z shr.s16 %rs247, %rs246, 4; 2026-02-21T09:15:08.6533346Z selp.b16 %rs248, %rs182, %rs150, %p26; 2026-02-21T09:15:08.6533402Z cvt.s16.s8 %rs249, %rs248; 2026-02-21T09:15:08.6533467Z shr.s16 %rs250, %rs249, 4; 2026-02-21T09:15:08.6533531Z selp.b16 %rs251, %rs183, %rs151, %p26; 2026-02-21T09:15:08.6533591Z cvt.s16.s8 %rs252, %rs251; 2026-02-21T09:15:08.6533656Z shr.s16 %rs253, %rs252, 4; 2026-02-21T09:15:08.6533743Z selp.b16 %rs254, %rs184, %rs152, %p26; 2026-02-21T09:15:08.6533803Z cvt.s16.s8 %rs255, %rs254; 2026-02-21T09:15:08.6533862Z shr.s16 %rs256, %rs255, 4; 2026-02-21T09:15:08.6533933Z selp.b16 %rs257, %rs185, %rs153, %p26; 2026-02-21T09:15:08.6533990Z cvt.s16.s8 %rs258, %rs257; 2026-02-21T09:15:08.6534088Z shr.s16 %rs259, %rs258, 4; 2026-02-21T09:15:08.6534166Z selp.b16 %rs260, %rs186, %rs154, %p26; 2026-02-21T09:15:08.6534226Z cvt.s16.s8 %rs261, %rs260; 2026-02-21T09:15:08.6534283Z shr.s16 %rs262, %rs261, 4; 2026-02-21T09:15:08.6534353Z selp.b16 %rs263, %rs187, %rs155, %p26; 2026-02-21T09:15:08.6534410Z cvt.s16.s8 %rs264, %rs263; 2026-02-21T09:15:08.6534468Z shr.s16 %rs265, %rs264, 4; 2026-02-21T09:15:08.6534533Z selp.b16 %rs266, %rs188, %rs156, %p26; 2026-02-21T09:15:08.6534601Z cvt.s16.s8 %rs267, %rs266; 2026-02-21T09:15:08.6534661Z shr.s16 %rs268, %rs267, 4; 2026-02-21T09:15:08.6534727Z selp.b16 %rs269, %rs189, %rs157, %p26; 2026-02-21T09:15:08.6534795Z cvt.s16.s8 %rs270, %rs269; 2026-02-21T09:15:08.6534855Z shr.s16 %rs271, %rs270, 4; 2026-02-21T09:15:08.6534921Z selp.b16 %rs272, %rs190, %rs158, %p26; 2026-02-21T09:15:08.6534978Z cvt.s16.s8 %rs273, %rs272; 2026-02-21T09:15:08.6535044Z shr.s16 %rs274, %rs273, 4; 2026-02-21T09:15:08.6535108Z selp.b16 %rs275, %rs191, %rs159, %p26; 2026-02-21T09:15:08.6535166Z cvt.s16.s8 %rs276, %rs275; 2026-02-21T09:15:08.6535232Z shr.s16 %rs277, %rs276, 4; 2026-02-21T09:15:08.6535296Z selp.b16 %rs278, %rs192, %rs160, %p26; 2026-02-21T09:15:08.6535354Z cvt.s16.s8 %rs279, %rs278; 2026-02-21T09:15:08.6535413Z shr.s16 %rs280, %rs279, 4; 2026-02-21T09:15:08.6535484Z selp.b16 %rs281, %rs193, %rs161, %p26; 2026-02-21T09:15:08.6535541Z cvt.s16.s8 %rs282, %rs281; 2026-02-21T09:15:08.6535598Z shr.s16 %rs283, %rs282, 4; 2026-02-21T09:15:08.6535670Z selp.b16 %rs284, %rs194, %rs162, %p26; 2026-02-21T09:15:08.6535729Z cvt.s16.s8 %rs285, %rs284; 2026-02-21T09:15:08.6535786Z shr.s16 %rs286, %rs285, 4; 2026-02-21T09:15:08.6535858Z selp.b16 %rs287, %rs195, %rs163, %p26; 2026-02-21T09:15:08.6535916Z cvt.s16.s8 %rs288, %rs287; 2026-02-21T09:15:08.6535973Z shr.s16 %rs289, %rs288, 4; 2026-02-21T09:15:08.6536036Z selp.b16 %rs290, %rs196, %rs164, %p26; 2026-02-21T09:15:08.6536101Z cvt.s16.s8 %rs291, %rs290; 2026-02-21T09:15:08.6536159Z shr.s16 %rs292, %rs291, 4; 2026-02-21T09:15:08.6536333Z .loc 1 77 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:77:32 2026-02-21T09:15:08.6536402Z cvt.rn.f32.s16 %r939, %rs199; 2026-02-21T09:15:08.6536464Z cvt.rn.f32.s16 %r940, %rs202; 2026-02-21T09:15:08.6536524Z cvt.rn.f32.s16 %r941, %rs205; 2026-02-21T09:15:08.6536584Z cvt.rn.f32.s16 %r942, %rs208; 2026-02-21T09:15:08.6536650Z cvt.rn.f32.s16 %r943, %rs211; 2026-02-21T09:15:08.6536708Z cvt.rn.f32.s16 %r944, %rs214; 2026-02-21T09:15:08.6536767Z cvt.rn.f32.s16 %r945, %rs217; 2026-02-21T09:15:08.6536857Z cvt.rn.f32.s16 %r946, %rs220; 2026-02-21T09:15:08.6536917Z cvt.rn.f32.s16 %r947, %rs223; 2026-02-21T09:15:08.6536977Z cvt.rn.f32.s16 %r948, %rs226; 2026-02-21T09:15:08.6537037Z cvt.rn.f32.s16 %r949, %rs229; 2026-02-21T09:15:08.6537103Z cvt.rn.f32.s16 %r950, %rs232; 2026-02-21T09:15:08.6537162Z cvt.rn.f32.s16 %r951, %rs235; 2026-02-21T09:15:08.6537219Z cvt.rn.f32.s16 %r952, %rs238; 2026-02-21T09:15:08.6537285Z cvt.rn.f32.s16 %r953, %rs241; 2026-02-21T09:15:08.6537365Z cvt.rn.f32.s16 %r954, %rs244; 2026-02-21T09:15:08.6537423Z cvt.rn.f32.s16 %r955, %rs247; 2026-02-21T09:15:08.6537488Z cvt.rn.f32.s16 %r956, %rs250; 2026-02-21T09:15:08.6537544Z cvt.rn.f32.s16 %r957, %rs253; 2026-02-21T09:15:08.6537603Z cvt.rn.f32.s16 %r958, %rs256; 2026-02-21T09:15:08.6537660Z cvt.rn.f32.s16 %r959, %rs259; 2026-02-21T09:15:08.6537724Z cvt.rn.f32.s16 %r960, %rs262; 2026-02-21T09:15:08.6537781Z cvt.rn.f32.s16 %r961, %rs265; 2026-02-21T09:15:08.6537840Z cvt.rn.f32.s16 %r962, %rs268; 2026-02-21T09:15:08.6537904Z cvt.rn.f32.s16 %r963, %rs271; 2026-02-21T09:15:08.6537984Z cvt.rn.f32.s16 %r964, %rs274; 2026-02-21T09:15:08.6538044Z cvt.rn.f32.s16 %r965, %rs277; 2026-02-21T09:15:08.6538101Z cvt.rn.f32.s16 %r966, %rs280; 2026-02-21T09:15:08.6538165Z cvt.rn.f32.s16 %r967, %rs283; 2026-02-21T09:15:08.6538222Z cvt.rn.f32.s16 %r968, %rs286; 2026-02-21T09:15:08.6538280Z cvt.rn.f32.s16 %r969, %rs289; 2026-02-21T09:15:08.6538363Z cvt.rn.f32.s16 %r970, %rs292; 2026-02-21T09:15:08.6538423Z st.shared.b32 [%r104], %r939; 2026-02-21T09:15:08.6538485Z st.shared.b32 [%r104+4096], %r943; 2026-02-21T09:15:08.6538547Z st.shared.b32 [%r104+8192], %r947; 2026-02-21T09:15:08.6538620Z st.shared.b32 [%r104+12288], %r951; 2026-02-21T09:15:08.6538686Z st.shared.b32 [%r104+16384], %r955; 2026-02-21T09:15:08.6538748Z st.shared.b32 [%r104+20480], %r959; 2026-02-21T09:15:08.6538814Z st.shared.b32 [%r104+24576], %r963; 2026-02-21T09:15:08.6538874Z st.shared.b32 [%r104+28672], %r967; 2026-02-21T09:15:08.6538934Z st.shared.b32 [%r105], %r940; 2026-02-21T09:15:08.6539003Z st.shared.b32 [%r105+4096], %r944; 2026-02-21T09:15:08.6539064Z st.shared.b32 [%r105+8192], %r948; 2026-02-21T09:15:08.6539123Z st.shared.b32 [%r105+12288], %r952; 2026-02-21T09:15:08.6539182Z st.shared.b32 [%r105+16384], %r956; 2026-02-21T09:15:08.6539249Z st.shared.b32 [%r105+20480], %r960; 2026-02-21T09:15:08.6539310Z st.shared.b32 [%r105+24576], %r964; 2026-02-21T09:15:08.6539372Z st.shared.b32 [%r105+28672], %r968; 2026-02-21T09:15:08.6539437Z st.shared.b32 [%r106], %r941; 2026-02-21T09:15:08.6539496Z st.shared.b32 [%r106+4096], %r945; 2026-02-21T09:15:08.6539556Z st.shared.b32 [%r106+8192], %r949; 2026-02-21T09:15:08.6539616Z st.shared.b32 [%r106+12288], %r953; 2026-02-21T09:15:08.6539684Z st.shared.b32 [%r106+16384], %r957; 2026-02-21T09:15:08.6539745Z st.shared.b32 [%r106+20480], %r961; 2026-02-21T09:15:08.6539805Z st.shared.b32 [%r106+24576], %r965; 2026-02-21T09:15:08.6539874Z st.shared.b32 [%r106+28672], %r969; 2026-02-21T09:15:08.6539933Z st.shared.b32 [%r107], %r942; 2026-02-21T09:15:08.6539995Z st.shared.b32 [%r107+4096], %r946; 2026-02-21T09:15:08.6540062Z st.shared.b32 [%r107+8192], %r950; 2026-02-21T09:15:08.6540123Z st.shared.b32 [%r107+12288], %r954; 2026-02-21T09:15:08.6540183Z st.shared.b32 [%r107+16384], %r958; 2026-02-21T09:15:08.6540243Z st.shared.b32 [%r107+20480], %r962; 2026-02-21T09:15:08.6540313Z st.shared.b32 [%r107+24576], %r966; 2026-02-21T09:15:08.6540375Z st.shared.b32 [%r107+28672], %r970; 2026-02-21T09:15:08.6540431Z $L__tmp12: 2026-02-21T09:15:08.6540662Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6540722Z shl.b32 %r120, %r2371, 5; 2026-02-21T09:15:08.6540778Z $L__tmp13: 2026-02-21T09:15:08.6540950Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6541039Z shl.b32 %r971, %r2371, 3; 2026-02-21T09:15:08.6541099Z add.s32 %r973, %r200, %r971; 2026-02-21T09:15:08.6541160Z add.s32 %r744, %r973, 368704; 2026-02-21T09:15:08.6541224Z $L__tmp14: 2026-02-21T09:15:08.6541448Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6541510Z // begin inline asm 2026-02-21T09:15:08.6541605Z 2026-02-21T09:15:08.6541659Z { 2026-02-21T09:15:08.6541755Z .reg .pred complete; 2026-02-21T09:15:08.6541811Z waitLoop: 2026-02-21T09:15:08.6541942Z mbarrier.try_wait.parity.shared.b64 complete, [%r744], %r2370; 2026-02-21T09:15:08.6542010Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6542062Z } 2026-02-21T09:15:08.6542066Z 2026-02-21T09:15:08.6542131Z // end inline asm 2026-02-21T09:15:08.6542208Z shfl.sync.idx.b32 %r121, %r10, 0, 31, -1; 2026-02-21T09:15:08.6542266Z shl.b32 %r974, %r121, 21; 2026-02-21T09:15:08.6542326Z and.b32 %r975, %r974, 6291456; 2026-02-21T09:15:08.6542393Z add.s32 %r976, %r975, %r1975; 2026-02-21T09:15:08.6542454Z shl.b32 %r977, %r121, 5; 2026-02-21T09:15:08.6542542Z and.b32 %r978, %r977, 128; 2026-02-21T09:15:08.6542615Z add.s32 %r1746, %r976, %r978; 2026-02-21T09:15:08.6542678Z mov.pred %p30, -1; 2026-02-21T09:15:08.6542735Z // begin inline asm 2026-02-21T09:15:08.6544151Z @%p30 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r1746 + 0], {%r747, %r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755, %r756, %r757, %r758, %r759, %r760, %r761, %r762, %r763, %r764, %r765, %r766, %r767, %r768, %r769, %r770, %r771, %r772, %r773, %r774, %r775, %r776, %r777, %r778, %r779, %r780, %r781, %r782, %r783, %r784, %r785, %r786, %r787, %r788, %r789, %r790, %r791, %r792, %r793, %r794, %r795, %r796, %r797, %r798, %r799, %r800, %r801, %r802, %r803, %r804, %r805, %r806, %r807, %r808, %r809, %r810, %r811, %r812, %r813, %r814, %r815, %r816, %r817, %r818, %r819, %r820, %r821, %r822, %r823, %r824, %r825, %r826, %r827, %r828, %r829, %r830, %r831, %r832, %r833, %r834, %r835, %r836, %r837, %r838, %r839, %r840, %r841, %r842, %r843, %r844, %r845, %r846, %r847, %r848, %r849, %r850, %r851, %r852, %r853, %r854, %r855, %r856, %r857, %r858, %r859, %r860, %r861, %r862, %r863, %r864, %r865, %r866, %r867, %r868, %r869, %r870, %r871, %r872, %r873, %r874}; 2026-02-21T09:15:08.6544213Z // end inline asm 2026-02-21T09:15:08.6544276Z // begin inline asm 2026-02-21T09:15:08.6544348Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6544405Z // end inline asm 2026-02-21T09:15:08.6544463Z bar.sync 2, 256; 2026-02-21T09:15:08.6544526Z // begin inline asm 2026-02-21T09:15:08.6544597Z fence.proxy.async.shared::cta; 2026-02-21T09:15:08.6544650Z // end inline asm 2026-02-21T09:15:08.6544710Z bar.sync 2, 256; 2026-02-21T09:15:08.6544773Z setp.ne.b32 %p27, %r121, 0; 2026-02-21T09:15:08.6544830Z @%p27 bra $L__BB0_9; 2026-02-21T09:15:08.6544883Z $L__tmp15: 2026-02-21T09:15:08.6544989Z // %bb.8: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6545164Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6545229Z add.s32 %r1078, %r973, 368688; 2026-02-21T09:15:08.6545292Z $L__tmp16: 2026-02-21T09:15:08.6545510Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6545572Z add.s32 %r979, %r120, %r4; 2026-02-21T09:15:08.6545644Z elect.sync %r1079|%p29, -1; 2026-02-21T09:15:08.6545724Z mov.b32 %r981, 134744336; 2026-02-21T09:15:08.6545781Z mov.pred %p28, 0; 2026-02-21T09:15:08.6545839Z // begin inline asm 2026-02-21T09:15:08.6546007Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 0 ], %rd340, %r981, %p28; 2026-02-21T09:15:08.6546064Z // end inline asm 2026-02-21T09:15:08.6546120Z // begin inline asm 2026-02-21T09:15:08.6546278Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 8 ], %rd341, %r981, %p30; 2026-02-21T09:15:08.6546360Z // end inline asm 2026-02-21T09:15:08.6546416Z // begin inline asm 2026-02-21T09:15:08.6546573Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 16 ], %rd342, %r981, %p30; 2026-02-21T09:15:08.6546629Z // end inline asm 2026-02-21T09:15:08.6546685Z // begin inline asm 2026-02-21T09:15:08.6546829Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 24 ], %rd343, %r981, %p30; 2026-02-21T09:15:08.6546891Z // end inline asm 2026-02-21T09:15:08.6546948Z // begin inline asm 2026-02-21T09:15:08.6547110Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 32 ], %rd344, %r981, %p30; 2026-02-21T09:15:08.6547172Z // end inline asm 2026-02-21T09:15:08.6547228Z // begin inline asm 2026-02-21T09:15:08.6547366Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 40 ], %rd345, %r981, %p30; 2026-02-21T09:15:08.6547428Z // end inline asm 2026-02-21T09:15:08.6547483Z // begin inline asm 2026-02-21T09:15:08.6547620Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 48 ], %rd346, %r981, %p30; 2026-02-21T09:15:08.6547682Z // end inline asm 2026-02-21T09:15:08.6547757Z // begin inline asm 2026-02-21T09:15:08.6547898Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 56 ], %rd347, %r981, %p30; 2026-02-21T09:15:08.6547953Z // end inline asm 2026-02-21T09:15:08.6548016Z // begin inline asm 2026-02-21T09:15:08.6548179Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 64 ], %rd348, %r981, %p30; 2026-02-21T09:15:08.6548237Z // end inline asm 2026-02-21T09:15:08.6548302Z // begin inline asm 2026-02-21T09:15:08.6548442Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 72 ], %rd349, %r981, %p30; 2026-02-21T09:15:08.6548497Z // end inline asm 2026-02-21T09:15:08.6548558Z // begin inline asm 2026-02-21T09:15:08.6548699Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 80 ], %rd350, %r981, %p30; 2026-02-21T09:15:08.6548754Z // end inline asm 2026-02-21T09:15:08.6548813Z // begin inline asm 2026-02-21T09:15:08.6548960Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 88 ], %rd351, %r981, %p30; 2026-02-21T09:15:08.6549015Z // end inline asm 2026-02-21T09:15:08.6549070Z // begin inline asm 2026-02-21T09:15:08.6549216Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 96 ], %rd352, %r981, %p30; 2026-02-21T09:15:08.6549270Z // end inline asm 2026-02-21T09:15:08.6549326Z // begin inline asm 2026-02-21T09:15:08.6549481Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 104 ], %rd353, %r981, %p30; 2026-02-21T09:15:08.6549536Z // end inline asm 2026-02-21T09:15:08.6549593Z // begin inline asm 2026-02-21T09:15:08.6549743Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 112 ], %rd354, %r981, %p30; 2026-02-21T09:15:08.6549796Z // end inline asm 2026-02-21T09:15:08.6549852Z // begin inline asm 2026-02-21T09:15:08.6549993Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 120 ], %rd355, %r981, %p30; 2026-02-21T09:15:08.6550055Z // end inline asm 2026-02-21T09:15:08.6550111Z // begin inline asm 2026-02-21T09:15:08.6550254Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 128 ], %rd356, %r981, %p30; 2026-02-21T09:15:08.6550319Z // end inline asm 2026-02-21T09:15:08.6550374Z // begin inline asm 2026-02-21T09:15:08.6550514Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 136 ], %rd357, %r981, %p30; 2026-02-21T09:15:08.6550579Z // end inline asm 2026-02-21T09:15:08.6550636Z // begin inline asm 2026-02-21T09:15:08.6550779Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 144 ], %rd358, %r981, %p30; 2026-02-21T09:15:08.6550843Z // end inline asm 2026-02-21T09:15:08.6550901Z // begin inline asm 2026-02-21T09:15:08.6551040Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 152 ], %rd359, %r981, %p30; 2026-02-21T09:15:08.6551096Z // end inline asm 2026-02-21T09:15:08.6551163Z // begin inline asm 2026-02-21T09:15:08.6551329Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 160 ], %rd360, %r981, %p30; 2026-02-21T09:15:08.6551385Z // end inline asm 2026-02-21T09:15:08.6551450Z // begin inline asm 2026-02-21T09:15:08.6551639Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 168 ], %rd361, %r981, %p30; 2026-02-21T09:15:08.6551695Z // end inline asm 2026-02-21T09:15:08.6551758Z // begin inline asm 2026-02-21T09:15:08.6551903Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 176 ], %rd362, %r981, %p30; 2026-02-21T09:15:08.6551989Z // end inline asm 2026-02-21T09:15:08.6552052Z // begin inline asm 2026-02-21T09:15:08.6552192Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 184 ], %rd363, %r981, %p30; 2026-02-21T09:15:08.6552247Z // end inline asm 2026-02-21T09:15:08.6552302Z // begin inline asm 2026-02-21T09:15:08.6552451Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 192 ], %rd364, %r981, %p30; 2026-02-21T09:15:08.6552507Z // end inline asm 2026-02-21T09:15:08.6552563Z // begin inline asm 2026-02-21T09:15:08.6552796Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 200 ], %rd365, %r981, %p30; 2026-02-21T09:15:08.6552852Z // end inline asm 2026-02-21T09:15:08.6552910Z // begin inline asm 2026-02-21T09:15:08.6553057Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 208 ], %rd366, %r981, %p30; 2026-02-21T09:15:08.6553135Z // end inline asm 2026-02-21T09:15:08.6553195Z // begin inline asm 2026-02-21T09:15:08.6553342Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 216 ], %rd367, %r981, %p30; 2026-02-21T09:15:08.6553397Z // end inline asm 2026-02-21T09:15:08.6553452Z // begin inline asm 2026-02-21T09:15:08.6553593Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 224 ], %rd368, %r981, %p30; 2026-02-21T09:15:08.6553657Z // end inline asm 2026-02-21T09:15:08.6553713Z // begin inline asm 2026-02-21T09:15:08.6553855Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 232 ], %rd369, %r981, %p30; 2026-02-21T09:15:08.6553920Z // end inline asm 2026-02-21T09:15:08.6553977Z // begin inline asm 2026-02-21T09:15:08.6554117Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 240 ], %rd370, %r981, %p30; 2026-02-21T09:15:08.6554180Z // end inline asm 2026-02-21T09:15:08.6554236Z // begin inline asm 2026-02-21T09:15:08.6554379Z @%p29 tcgen05.mma.cta_group::1.kind::tf32 [ %r979 + 0 ], [ %r1975 + 248 ], %rd371, %r981, %p30; 2026-02-21T09:15:08.6554436Z // end inline asm 2026-02-21T09:15:08.6554508Z cvt.u64.u32 %rd273, %r1078; 2026-02-21T09:15:08.6554565Z // begin inline asm 2026-02-21T09:15:08.6554692Z @%p29 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd273]; 2026-02-21T09:15:08.6554756Z // end inline asm 2026-02-21T09:15:08.6554810Z $L__tmp17: 2026-02-21T09:15:08.6554912Z $L__BB0_9: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6555095Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6555198Z ld.shared.v4.b32 {%r1211, %r1212, %r1213, %r1214}, [%r111]; 2026-02-21T09:15:08.6555264Z mov.b32 {%rs293, %rs294}, %r1214; 2026-02-21T09:15:08.6555327Z mov.b32 {%rs295, %rs296}, %r1213; 2026-02-21T09:15:08.6555394Z mov.b32 {%rs297, %rs298}, %r1212; 2026-02-21T09:15:08.6555453Z mov.b32 {%rs299, %rs300}, %r1211; 2026-02-21T09:15:08.6555554Z ld.shared.v4.b32 {%r1215, %r1216, %r1217, %r1218}, [%r111+16]; 2026-02-21T09:15:08.6555623Z mov.b32 {%rs301, %rs302}, %r1218; 2026-02-21T09:15:08.6555681Z mov.b32 {%rs303, %rs304}, %r1217; 2026-02-21T09:15:08.6555739Z mov.b32 {%rs305, %rs306}, %r1216; 2026-02-21T09:15:08.6555802Z mov.b32 {%rs307, %rs308}, %r1215; 2026-02-21T09:15:08.6555898Z ld.shared.v4.b32 {%r1219, %r1220, %r1221, %r1222}, [%r111+32]; 2026-02-21T09:15:08.6555956Z mov.b32 {%rs309, %rs310}, %r1222; 2026-02-21T09:15:08.6556014Z mov.b32 {%rs311, %rs312}, %r1221; 2026-02-21T09:15:08.6556111Z mov.b32 {%rs313, %rs314}, %r1220; 2026-02-21T09:15:08.6556170Z mov.b32 {%rs315, %rs316}, %r1219; 2026-02-21T09:15:08.6556268Z ld.shared.v4.b32 {%r1223, %r1224, %r1225, %r1226}, [%r111+48]; 2026-02-21T09:15:08.6556335Z mov.b32 {%rs317, %rs318}, %r1226; 2026-02-21T09:15:08.6556393Z mov.b32 {%rs319, %rs320}, %r1225; 2026-02-21T09:15:08.6556450Z mov.b32 {%rs321, %rs322}, %r1224; 2026-02-21T09:15:08.6556515Z mov.b32 {%rs323, %rs324}, %r1223; 2026-02-21T09:15:08.6556635Z ld.shared.v4.b32 {%r1227, %r1228, %r1229, %r1230}, [%r111+64]; 2026-02-21T09:15:08.6556694Z mov.b32 {%rs325, %rs326}, %r1230; 2026-02-21T09:15:08.6556752Z mov.b32 {%rs327, %rs328}, %r1229; 2026-02-21T09:15:08.6556818Z mov.b32 {%rs329, %rs330}, %r1228; 2026-02-21T09:15:08.6556875Z mov.b32 {%rs331, %rs332}, %r1227; 2026-02-21T09:15:08.6556969Z ld.shared.v4.b32 {%r1231, %r1232, %r1233, %r1234}, [%r111+80]; 2026-02-21T09:15:08.6557033Z mov.b32 {%rs333, %rs334}, %r1234; 2026-02-21T09:15:08.6557092Z mov.b32 {%rs335, %rs336}, %r1233; 2026-02-21T09:15:08.6557150Z mov.b32 {%rs337, %rs338}, %r1232; 2026-02-21T09:15:08.6557230Z mov.b32 {%rs339, %rs340}, %r1231; 2026-02-21T09:15:08.6557330Z ld.shared.v4.b32 {%r1235, %r1236, %r1237, %r1238}, [%r111+96]; 2026-02-21T09:15:08.6557388Z mov.b32 {%rs341, %rs342}, %r1238; 2026-02-21T09:15:08.6557446Z mov.b32 {%rs343, %rs344}, %r1237; 2026-02-21T09:15:08.6557511Z mov.b32 {%rs345, %rs346}, %r1236; 2026-02-21T09:15:08.6557589Z mov.b32 {%rs347, %rs348}, %r1235; 2026-02-21T09:15:08.6557695Z ld.shared.v4.b32 {%r1239, %r1240, %r1241, %r1242}, [%r111+112]; 2026-02-21T09:15:08.6557762Z mov.b32 {%rs349, %rs350}, %r1242; 2026-02-21T09:15:08.6557820Z mov.b32 {%rs351, %rs352}, %r1241; 2026-02-21T09:15:08.6557879Z mov.b32 {%rs353, %rs354}, %r1240; 2026-02-21T09:15:08.6557936Z mov.b32 {%rs355, %rs356}, %r1239; 2026-02-21T09:15:08.6558044Z ld.shared.v4.b32 {%r1243, %r1244, %r1245, %r1246}, [%r111+128]; 2026-02-21T09:15:08.6558103Z mov.b32 {%rs357, %rs358}, %r1246; 2026-02-21T09:15:08.6558165Z mov.b32 {%rs359, %rs360}, %r1245; 2026-02-21T09:15:08.6558233Z mov.b32 {%rs361, %rs362}, %r1244; 2026-02-21T09:15:08.6558293Z mov.b32 {%rs363, %rs364}, %r1243; 2026-02-21T09:15:08.6558390Z ld.shared.v4.b32 {%r1247, %r1248, %r1249, %r1250}, [%r111+144]; 2026-02-21T09:15:08.6558449Z mov.b32 {%rs365, %rs366}, %r1250; 2026-02-21T09:15:08.6558516Z mov.b32 {%rs367, %rs368}, %r1249; 2026-02-21T09:15:08.6558575Z mov.b32 {%rs369, %rs370}, %r1248; 2026-02-21T09:15:08.6558634Z mov.b32 {%rs371, %rs372}, %r1247; 2026-02-21T09:15:08.6558743Z ld.shared.v4.b32 {%r1251, %r1252, %r1253, %r1254}, [%r111+160]; 2026-02-21T09:15:08.6558806Z mov.b32 {%rs373, %rs374}, %r1254; 2026-02-21T09:15:08.6558868Z mov.b32 {%rs375, %rs376}, %r1253; 2026-02-21T09:15:08.6558938Z mov.b32 {%rs377, %rs378}, %r1252; 2026-02-21T09:15:08.6559001Z mov.b32 {%rs379, %rs380}, %r1251; 2026-02-21T09:15:08.6559107Z ld.shared.v4.b32 {%r1255, %r1256, %r1257, %r1258}, [%r111+176]; 2026-02-21T09:15:08.6559172Z mov.b32 {%rs381, %rs382}, %r1258; 2026-02-21T09:15:08.6559242Z mov.b32 {%rs383, %rs384}, %r1257; 2026-02-21T09:15:08.6559305Z mov.b32 {%rs385, %rs386}, %r1256; 2026-02-21T09:15:08.6559366Z mov.b32 {%rs387, %rs388}, %r1255; 2026-02-21T09:15:08.6559472Z ld.shared.v4.b32 {%r1259, %r1260, %r1261, %r1262}, [%r111+192]; 2026-02-21T09:15:08.6559535Z mov.b32 {%rs389, %rs390}, %r1262; 2026-02-21T09:15:08.6559598Z mov.b32 {%rs391, %rs392}, %r1261; 2026-02-21T09:15:08.6559669Z mov.b32 {%rs393, %rs394}, %r1260; 2026-02-21T09:15:08.6559730Z mov.b32 {%rs395, %rs396}, %r1259; 2026-02-21T09:15:08.6559828Z ld.shared.v4.b32 {%r1263, %r1264, %r1265, %r1266}, [%r111+208]; 2026-02-21T09:15:08.6559890Z mov.b32 {%rs397, %rs398}, %r1266; 2026-02-21T09:15:08.6559957Z mov.b32 {%rs399, %rs400}, %r1265; 2026-02-21T09:15:08.6560018Z mov.b32 {%rs401, %rs402}, %r1264; 2026-02-21T09:15:08.6560079Z mov.b32 {%rs403, %rs404}, %r1263; 2026-02-21T09:15:08.6560186Z ld.shared.v4.b32 {%r1267, %r1268, %r1269, %r1270}, [%r111+224]; 2026-02-21T09:15:08.6560270Z mov.b32 {%rs405, %rs406}, %r1270; 2026-02-21T09:15:08.6560333Z mov.b32 {%rs407, %rs408}, %r1269; 2026-02-21T09:15:08.6560394Z mov.b32 {%rs409, %rs410}, %r1268; 2026-02-21T09:15:08.6560462Z mov.b32 {%rs411, %rs412}, %r1267; 2026-02-21T09:15:08.6560560Z ld.shared.v4.b32 {%r1271, %r1272, %r1273, %r1274}, [%r111+240]; 2026-02-21T09:15:08.6560622Z mov.b32 {%rs413, %rs414}, %r1274; 2026-02-21T09:15:08.6560692Z mov.b32 {%rs415, %rs416}, %r1273; 2026-02-21T09:15:08.6560773Z mov.b32 {%rs417, %rs418}, %r1272; 2026-02-21T09:15:08.6560835Z mov.b32 {%rs419, %rs420}, %r1271; 2026-02-21T09:15:08.6561021Z .loc 1 52 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:52:32 2026-02-21T09:15:08.6561087Z cvt.f32.bf16 %r1083, %rs299; 2026-02-21T09:15:08.6561152Z cvt.f32.bf16 %r1084, %rs300; 2026-02-21T09:15:08.6561212Z cvt.f32.bf16 %r1085, %rs297; 2026-02-21T09:15:08.6561282Z cvt.f32.bf16 %r1086, %rs298; 2026-02-21T09:15:08.6561344Z cvt.f32.bf16 %r1087, %rs295; 2026-02-21T09:15:08.6561403Z cvt.f32.bf16 %r1088, %rs296; 2026-02-21T09:15:08.6561493Z cvt.f32.bf16 %r1089, %rs293; 2026-02-21T09:15:08.6561584Z cvt.f32.bf16 %r1090, %rs294; 2026-02-21T09:15:08.6561645Z cvt.f32.bf16 %r1091, %rs307; 2026-02-21T09:15:08.6561706Z cvt.f32.bf16 %r1092, %rs308; 2026-02-21T09:15:08.6561773Z cvt.f32.bf16 %r1093, %rs305; 2026-02-21T09:15:08.6561833Z cvt.f32.bf16 %r1094, %rs306; 2026-02-21T09:15:08.6561919Z cvt.f32.bf16 %r1095, %rs303; 2026-02-21T09:15:08.6561988Z cvt.f32.bf16 %r1096, %rs304; 2026-02-21T09:15:08.6562049Z cvt.f32.bf16 %r1097, %rs301; 2026-02-21T09:15:08.6562108Z cvt.f32.bf16 %r1098, %rs302; 2026-02-21T09:15:08.6562176Z cvt.f32.bf16 %r1099, %rs315; 2026-02-21T09:15:08.6562237Z cvt.f32.bf16 %r1100, %rs316; 2026-02-21T09:15:08.6562296Z cvt.f32.bf16 %r1101, %rs313; 2026-02-21T09:15:08.6562356Z cvt.f32.bf16 %r1102, %rs314; 2026-02-21T09:15:08.6562424Z cvt.f32.bf16 %r1103, %rs311; 2026-02-21T09:15:08.6562485Z cvt.f32.bf16 %r1104, %rs312; 2026-02-21T09:15:08.6562544Z cvt.f32.bf16 %r1105, %rs309; 2026-02-21T09:15:08.6562613Z cvt.f32.bf16 %r1106, %rs310; 2026-02-21T09:15:08.6562676Z cvt.f32.bf16 %r1107, %rs323; 2026-02-21T09:15:08.6562735Z cvt.f32.bf16 %r1108, %rs324; 2026-02-21T09:15:08.6562794Z cvt.f32.bf16 %r1109, %rs321; 2026-02-21T09:15:08.6562863Z cvt.f32.bf16 %r1110, %rs322; 2026-02-21T09:15:08.6562923Z cvt.f32.bf16 %r1111, %rs319; 2026-02-21T09:15:08.6562985Z cvt.f32.bf16 %r1112, %rs320; 2026-02-21T09:15:08.6563056Z cvt.f32.bf16 %r1113, %rs317; 2026-02-21T09:15:08.6563117Z cvt.f32.bf16 %r1114, %rs318; 2026-02-21T09:15:08.6563175Z cvt.f32.bf16 %r1115, %rs331; 2026-02-21T09:15:08.6563234Z cvt.f32.bf16 %r1116, %rs332; 2026-02-21T09:15:08.6563302Z cvt.f32.bf16 %r1117, %rs329; 2026-02-21T09:15:08.6563361Z cvt.f32.bf16 %r1118, %rs330; 2026-02-21T09:15:08.6563420Z cvt.f32.bf16 %r1119, %rs327; 2026-02-21T09:15:08.6563487Z cvt.f32.bf16 %r1120, %rs328; 2026-02-21T09:15:08.6563548Z cvt.f32.bf16 %r1121, %rs325; 2026-02-21T09:15:08.6563607Z cvt.f32.bf16 %r1122, %rs326; 2026-02-21T09:15:08.6563668Z cvt.f32.bf16 %r1123, %rs339; 2026-02-21T09:15:08.6563736Z cvt.f32.bf16 %r1124, %rs340; 2026-02-21T09:15:08.6563796Z cvt.f32.bf16 %r1125, %rs337; 2026-02-21T09:15:08.6563856Z cvt.f32.bf16 %r1126, %rs338; 2026-02-21T09:15:08.6563922Z cvt.f32.bf16 %r1127, %rs335; 2026-02-21T09:15:08.6563982Z cvt.f32.bf16 %r1128, %rs336; 2026-02-21T09:15:08.6564042Z cvt.f32.bf16 %r1129, %rs333; 2026-02-21T09:15:08.6564109Z cvt.f32.bf16 %r1130, %rs334; 2026-02-21T09:15:08.6564169Z cvt.f32.bf16 %r1131, %rs347; 2026-02-21T09:15:08.6564228Z cvt.f32.bf16 %r1132, %rs348; 2026-02-21T09:15:08.6564289Z cvt.f32.bf16 %r1133, %rs345; 2026-02-21T09:15:08.6564357Z cvt.f32.bf16 %r1134, %rs346; 2026-02-21T09:15:08.6564416Z cvt.f32.bf16 %r1135, %rs343; 2026-02-21T09:15:08.6564476Z cvt.f32.bf16 %r1136, %rs344; 2026-02-21T09:15:08.6564543Z cvt.f32.bf16 %r1137, %rs341; 2026-02-21T09:15:08.6564633Z cvt.f32.bf16 %r1138, %rs342; 2026-02-21T09:15:08.6564693Z cvt.f32.bf16 %r1139, %rs355; 2026-02-21T09:15:08.6564754Z cvt.f32.bf16 %r1140, %rs356; 2026-02-21T09:15:08.6564822Z cvt.f32.bf16 %r1141, %rs353; 2026-02-21T09:15:08.6564882Z cvt.f32.bf16 %r1142, %rs354; 2026-02-21T09:15:08.6564941Z cvt.f32.bf16 %r1143, %rs351; 2026-02-21T09:15:08.6565008Z cvt.f32.bf16 %r1144, %rs352; 2026-02-21T09:15:08.6565067Z cvt.f32.bf16 %r1145, %rs349; 2026-02-21T09:15:08.6565127Z cvt.f32.bf16 %r1146, %rs350; 2026-02-21T09:15:08.6565215Z cvt.f32.bf16 %r1147, %rs363; 2026-02-21T09:15:08.6565281Z cvt.f32.bf16 %r1148, %rs364; 2026-02-21T09:15:08.6565341Z cvt.f32.bf16 %r1149, %rs361; 2026-02-21T09:15:08.6565402Z cvt.f32.bf16 %r1150, %rs362; 2026-02-21T09:15:08.6565472Z cvt.f32.bf16 %r1151, %rs359; 2026-02-21T09:15:08.6565534Z cvt.f32.bf16 %r1152, %rs360; 2026-02-21T09:15:08.6565596Z cvt.f32.bf16 %r1153, %rs357; 2026-02-21T09:15:08.6565666Z cvt.f32.bf16 %r1154, %rs358; 2026-02-21T09:15:08.6565731Z cvt.f32.bf16 %r1155, %rs371; 2026-02-21T09:15:08.6565792Z cvt.f32.bf16 %r1156, %rs372; 2026-02-21T09:15:08.6565879Z cvt.f32.bf16 %r1157, %rs369; 2026-02-21T09:15:08.6565949Z cvt.f32.bf16 %r1158, %rs370; 2026-02-21T09:15:08.6566009Z cvt.f32.bf16 %r1159, %rs367; 2026-02-21T09:15:08.6566069Z cvt.f32.bf16 %r1160, %rs368; 2026-02-21T09:15:08.6566147Z cvt.f32.bf16 %r1161, %rs365; 2026-02-21T09:15:08.6566204Z cvt.f32.bf16 %r1162, %rs366; 2026-02-21T09:15:08.6566284Z cvt.f32.bf16 %r1163, %rs379; 2026-02-21T09:15:08.6566346Z cvt.f32.bf16 %r1164, %rs380; 2026-02-21T09:15:08.6566410Z cvt.f32.bf16 %r1165, %rs377; 2026-02-21T09:15:08.6566468Z cvt.f32.bf16 %r1166, %rs378; 2026-02-21T09:15:08.6566527Z cvt.f32.bf16 %r1167, %rs375; 2026-02-21T09:15:08.6566591Z cvt.f32.bf16 %r1168, %rs376; 2026-02-21T09:15:08.6566648Z cvt.f32.bf16 %r1169, %rs373; 2026-02-21T09:15:08.6566705Z cvt.f32.bf16 %r1170, %rs374; 2026-02-21T09:15:08.6566762Z cvt.f32.bf16 %r1171, %rs387; 2026-02-21T09:15:08.6566830Z cvt.f32.bf16 %r1172, %rs388; 2026-02-21T09:15:08.6566888Z cvt.f32.bf16 %r1173, %rs385; 2026-02-21T09:15:08.6566947Z cvt.f32.bf16 %r1174, %rs386; 2026-02-21T09:15:08.6567013Z cvt.f32.bf16 %r1175, %rs383; 2026-02-21T09:15:08.6567072Z cvt.f32.bf16 %r1176, %rs384; 2026-02-21T09:15:08.6567130Z cvt.f32.bf16 %r1177, %rs381; 2026-02-21T09:15:08.6567187Z cvt.f32.bf16 %r1178, %rs382; 2026-02-21T09:15:08.6567253Z cvt.f32.bf16 %r1179, %rs395; 2026-02-21T09:15:08.6567312Z cvt.f32.bf16 %r1180, %rs396; 2026-02-21T09:15:08.6567372Z cvt.f32.bf16 %r1181, %rs393; 2026-02-21T09:15:08.6567438Z cvt.f32.bf16 %r1182, %rs394; 2026-02-21T09:15:08.6567496Z cvt.f32.bf16 %r1183, %rs391; 2026-02-21T09:15:08.6567552Z cvt.f32.bf16 %r1184, %rs392; 2026-02-21T09:15:08.6567617Z cvt.f32.bf16 %r1185, %rs389; 2026-02-21T09:15:08.6567674Z cvt.f32.bf16 %r1186, %rs390; 2026-02-21T09:15:08.6567731Z cvt.f32.bf16 %r1187, %rs403; 2026-02-21T09:15:08.6567791Z cvt.f32.bf16 %r1188, %rs404; 2026-02-21T09:15:08.6567857Z cvt.f32.bf16 %r1189, %rs401; 2026-02-21T09:15:08.6567915Z cvt.f32.bf16 %r1190, %rs402; 2026-02-21T09:15:08.6567975Z cvt.f32.bf16 %r1191, %rs399; 2026-02-21T09:15:08.6568041Z cvt.f32.bf16 %r1192, %rs400; 2026-02-21T09:15:08.6568099Z cvt.f32.bf16 %r1193, %rs397; 2026-02-21T09:15:08.6568158Z cvt.f32.bf16 %r1194, %rs398; 2026-02-21T09:15:08.6568215Z cvt.f32.bf16 %r1195, %rs411; 2026-02-21T09:15:08.6568279Z cvt.f32.bf16 %r1196, %rs412; 2026-02-21T09:15:08.6568338Z cvt.f32.bf16 %r1197, %rs409; 2026-02-21T09:15:08.6568397Z cvt.f32.bf16 %r1198, %rs410; 2026-02-21T09:15:08.6568462Z cvt.f32.bf16 %r1199, %rs407; 2026-02-21T09:15:08.6568519Z cvt.f32.bf16 %r1200, %rs408; 2026-02-21T09:15:08.6568577Z cvt.f32.bf16 %r1201, %rs405; 2026-02-21T09:15:08.6568634Z cvt.f32.bf16 %r1202, %rs406; 2026-02-21T09:15:08.6568698Z cvt.f32.bf16 %r1203, %rs419; 2026-02-21T09:15:08.6568756Z cvt.f32.bf16 %r1204, %rs420; 2026-02-21T09:15:08.6568813Z cvt.f32.bf16 %r1205, %rs417; 2026-02-21T09:15:08.6568875Z cvt.f32.bf16 %r1206, %rs418; 2026-02-21T09:15:08.6568954Z cvt.f32.bf16 %r1207, %rs415; 2026-02-21T09:15:08.6569014Z cvt.f32.bf16 %r1208, %rs416; 2026-02-21T09:15:08.6569072Z cvt.f32.bf16 %r1209, %rs413; 2026-02-21T09:15:08.6569135Z cvt.f32.bf16 %r1210, %rs414; 2026-02-21T09:15:08.6569308Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6569372Z ld.shared.b8 %rs421, [%r112]; 2026-02-21T09:15:08.6569443Z ld.shared.b8 %rs422, [%r112+128]; 2026-02-21T09:15:08.6569537Z ld.shared.b8 %rs423, [%r112+256]; 2026-02-21T09:15:08.6569598Z ld.shared.b8 %rs424, [%r112+384]; 2026-02-21T09:15:08.6569663Z ld.shared.b8 %rs425, [%r112+512]; 2026-02-21T09:15:08.6569722Z ld.shared.b8 %rs426, [%r112+640]; 2026-02-21T09:15:08.6569782Z ld.shared.b8 %rs427, [%r112+768]; 2026-02-21T09:15:08.6569841Z ld.shared.b8 %rs428, [%r112+896]; 2026-02-21T09:15:08.6569911Z ld.shared.b8 %rs429, [%r112+1024]; 2026-02-21T09:15:08.6569974Z ld.shared.b8 %rs430, [%r112+1152]; 2026-02-21T09:15:08.6570038Z ld.shared.b8 %rs431, [%r112+1280]; 2026-02-21T09:15:08.6570124Z ld.shared.b8 %rs432, [%r112+1408]; 2026-02-21T09:15:08.6570185Z ld.shared.b8 %rs433, [%r112+1536]; 2026-02-21T09:15:08.6570245Z ld.shared.b8 %rs434, [%r112+1664]; 2026-02-21T09:15:08.6570310Z ld.shared.b8 %rs435, [%r112+1792]; 2026-02-21T09:15:08.6570369Z ld.shared.b8 %rs436, [%r112+1920]; 2026-02-21T09:15:08.6570450Z ld.shared.b8 %rs437, [%r112+2048]; 2026-02-21T09:15:08.6570513Z ld.shared.b8 %rs438, [%r112+2176]; 2026-02-21T09:15:08.6570582Z ld.shared.b8 %rs439, [%r112+2304]; 2026-02-21T09:15:08.6570641Z ld.shared.b8 %rs440, [%r112+2432]; 2026-02-21T09:15:08.6570702Z ld.shared.b8 %rs441, [%r112+2560]; 2026-02-21T09:15:08.6570769Z ld.shared.b8 %rs442, [%r112+2688]; 2026-02-21T09:15:08.6570828Z ld.shared.b8 %rs443, [%r112+2816]; 2026-02-21T09:15:08.6570888Z ld.shared.b8 %rs444, [%r112+2944]; 2026-02-21T09:15:08.6570948Z ld.shared.b8 %rs445, [%r112+3072]; 2026-02-21T09:15:08.6571019Z ld.shared.b8 %rs446, [%r112+3200]; 2026-02-21T09:15:08.6571079Z ld.shared.b8 %rs447, [%r112+3328]; 2026-02-21T09:15:08.6571143Z ld.shared.b8 %rs448, [%r112+3456]; 2026-02-21T09:15:08.6571212Z ld.shared.b8 %rs449, [%r112+3584]; 2026-02-21T09:15:08.6571272Z ld.shared.b8 %rs450, [%r112+3712]; 2026-02-21T09:15:08.6571333Z ld.shared.b8 %rs451, [%r112+3840]; 2026-02-21T09:15:08.6571393Z ld.shared.b8 %rs452, [%r112+3968]; 2026-02-21T09:15:08.6571626Z .loc 1 57 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:57:28 2026-02-21T09:15:08.6571692Z shl.b16 %rs453, %rs421, 4; 2026-02-21T09:15:08.6571753Z shl.b16 %rs454, %rs422, 4; 2026-02-21T09:15:08.6571819Z shl.b16 %rs455, %rs423, 4; 2026-02-21T09:15:08.6571879Z shl.b16 %rs456, %rs424, 4; 2026-02-21T09:15:08.6571938Z shl.b16 %rs457, %rs425, 4; 2026-02-21T09:15:08.6572004Z shl.b16 %rs458, %rs426, 4; 2026-02-21T09:15:08.6572064Z shl.b16 %rs459, %rs427, 4; 2026-02-21T09:15:08.6572124Z shl.b16 %rs460, %rs428, 4; 2026-02-21T09:15:08.6572185Z shl.b16 %rs461, %rs429, 4; 2026-02-21T09:15:08.6572253Z shl.b16 %rs462, %rs430, 4; 2026-02-21T09:15:08.6572314Z shl.b16 %rs463, %rs431, 4; 2026-02-21T09:15:08.6572372Z shl.b16 %rs464, %rs432, 4; 2026-02-21T09:15:08.6572437Z shl.b16 %rs465, %rs433, 4; 2026-02-21T09:15:08.6572494Z shl.b16 %rs466, %rs434, 4; 2026-02-21T09:15:08.6572550Z shl.b16 %rs467, %rs435, 4; 2026-02-21T09:15:08.6572609Z shl.b16 %rs468, %rs436, 4; 2026-02-21T09:15:08.6572681Z shl.b16 %rs469, %rs437, 4; 2026-02-21T09:15:08.6572742Z shl.b16 %rs470, %rs438, 4; 2026-02-21T09:15:08.6572802Z shl.b16 %rs471, %rs439, 4; 2026-02-21T09:15:08.6572871Z shl.b16 %rs472, %rs440, 4; 2026-02-21T09:15:08.6572932Z shl.b16 %rs473, %rs441, 4; 2026-02-21T09:15:08.6572992Z shl.b16 %rs474, %rs442, 4; 2026-02-21T09:15:08.6573052Z shl.b16 %rs475, %rs443, 4; 2026-02-21T09:15:08.6573124Z shl.b16 %rs476, %rs444, 4; 2026-02-21T09:15:08.6573183Z shl.b16 %rs477, %rs445, 4; 2026-02-21T09:15:08.6573268Z shl.b16 %rs478, %rs446, 4; 2026-02-21T09:15:08.6573333Z shl.b16 %rs479, %rs447, 4; 2026-02-21T09:15:08.6573392Z shl.b16 %rs480, %rs448, 4; 2026-02-21T09:15:08.6573449Z shl.b16 %rs481, %rs449, 4; 2026-02-21T09:15:08.6573506Z shl.b16 %rs482, %rs450, 4; 2026-02-21T09:15:08.6573571Z shl.b16 %rs483, %rs451, 4; 2026-02-21T09:15:08.6573629Z shl.b16 %rs484, %rs452, 4; 2026-02-21T09:15:08.6573795Z .loc 1 72 58 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:72:58 2026-02-21T09:15:08.6573899Z selp.b16 %rs485, %rs453, %rs421, %p26; 2026-02-21T09:15:08.6573960Z cvt.s16.s8 %rs486, %rs485; 2026-02-21T09:15:08.6574018Z shr.s16 %rs487, %rs486, 4; 2026-02-21T09:15:08.6574091Z selp.b16 %rs488, %rs454, %rs422, %p26; 2026-02-21T09:15:08.6574151Z cvt.s16.s8 %rs489, %rs488; 2026-02-21T09:15:08.6574209Z shr.s16 %rs490, %rs489, 4; 2026-02-21T09:15:08.6574276Z selp.b16 %rs491, %rs455, %rs423, %p26; 2026-02-21T09:15:08.6574342Z cvt.s16.s8 %rs492, %rs491; 2026-02-21T09:15:08.6574402Z shr.s16 %rs493, %rs492, 4; 2026-02-21T09:15:08.6574469Z selp.b16 %rs494, %rs456, %rs424, %p26; 2026-02-21T09:15:08.6574566Z cvt.s16.s8 %rs495, %rs494; 2026-02-21T09:15:08.6574626Z shr.s16 %rs496, %rs495, 4; 2026-02-21T09:15:08.6574690Z selp.b16 %rs497, %rs457, %rs425, %p26; 2026-02-21T09:15:08.6574748Z cvt.s16.s8 %rs498, %rs497; 2026-02-21T09:15:08.6574813Z shr.s16 %rs499, %rs498, 4; 2026-02-21T09:15:08.6574901Z selp.b16 %rs500, %rs458, %rs426, %p26; 2026-02-21T09:15:08.6574962Z cvt.s16.s8 %rs501, %rs500; 2026-02-21T09:15:08.6575028Z shr.s16 %rs502, %rs501, 4; 2026-02-21T09:15:08.6575091Z selp.b16 %rs503, %rs459, %rs427, %p26; 2026-02-21T09:15:08.6575148Z cvt.s16.s8 %rs504, %rs503; 2026-02-21T09:15:08.6575206Z shr.s16 %rs505, %rs504, 4; 2026-02-21T09:15:08.6575277Z selp.b16 %rs506, %rs460, %rs428, %p26; 2026-02-21T09:15:08.6575335Z cvt.s16.s8 %rs507, %rs506; 2026-02-21T09:15:08.6575392Z shr.s16 %rs508, %rs507, 4; 2026-02-21T09:15:08.6575464Z selp.b16 %rs509, %rs461, %rs429, %p26; 2026-02-21T09:15:08.6575523Z cvt.s16.s8 %rs510, %rs509; 2026-02-21T09:15:08.6575580Z shr.s16 %rs511, %rs510, 4; 2026-02-21T09:15:08.6575652Z selp.b16 %rs512, %rs462, %rs430, %p26; 2026-02-21T09:15:08.6575710Z cvt.s16.s8 %rs513, %rs512; 2026-02-21T09:15:08.6575766Z shr.s16 %rs514, %rs513, 4; 2026-02-21T09:15:08.6575828Z selp.b16 %rs515, %rs463, %rs431, %p26; 2026-02-21T09:15:08.6575896Z cvt.s16.s8 %rs516, %rs515; 2026-02-21T09:15:08.6575954Z shr.s16 %rs517, %rs516, 4; 2026-02-21T09:15:08.6576019Z selp.b16 %rs518, %rs464, %rs432, %p26; 2026-02-21T09:15:08.6576082Z cvt.s16.s8 %rs519, %rs518; 2026-02-21T09:15:08.6576138Z shr.s16 %rs520, %rs519, 4; 2026-02-21T09:15:08.6576200Z selp.b16 %rs521, %rs465, %rs433, %p26; 2026-02-21T09:15:08.6576257Z cvt.s16.s8 %rs522, %rs521; 2026-02-21T09:15:08.6576319Z shr.s16 %rs523, %rs522, 4; 2026-02-21T09:15:08.6576381Z selp.b16 %rs524, %rs466, %rs434, %p26; 2026-02-21T09:15:08.6576437Z cvt.s16.s8 %rs525, %rs524; 2026-02-21T09:15:08.6576500Z shr.s16 %rs526, %rs525, 4; 2026-02-21T09:15:08.6576563Z selp.b16 %rs527, %rs467, %rs435, %p26; 2026-02-21T09:15:08.6576620Z cvt.s16.s8 %rs528, %rs527; 2026-02-21T09:15:08.6576676Z shr.s16 %rs529, %rs528, 4; 2026-02-21T09:15:08.6576746Z selp.b16 %rs530, %rs468, %rs436, %p26; 2026-02-21T09:15:08.6576803Z cvt.s16.s8 %rs531, %rs530; 2026-02-21T09:15:08.6576861Z shr.s16 %rs532, %rs531, 4; 2026-02-21T09:15:08.6576933Z selp.b16 %rs533, %rs469, %rs437, %p26; 2026-02-21T09:15:08.6576991Z cvt.s16.s8 %rs534, %rs533; 2026-02-21T09:15:08.6577047Z shr.s16 %rs535, %rs534, 4; 2026-02-21T09:15:08.6577109Z selp.b16 %rs536, %rs470, %rs438, %p26; 2026-02-21T09:15:08.6577175Z cvt.s16.s8 %rs537, %rs536; 2026-02-21T09:15:08.6577232Z shr.s16 %rs538, %rs537, 4; 2026-02-21T09:15:08.6577294Z selp.b16 %rs539, %rs471, %rs439, %p26; 2026-02-21T09:15:08.6577358Z cvt.s16.s8 %rs540, %rs539; 2026-02-21T09:15:08.6577415Z shr.s16 %rs541, %rs540, 4; 2026-02-21T09:15:08.6577479Z selp.b16 %rs542, %rs472, %rs440, %p26; 2026-02-21T09:15:08.6577566Z cvt.s16.s8 %rs543, %rs542; 2026-02-21T09:15:08.6577626Z shr.s16 %rs544, %rs543, 4; 2026-02-21T09:15:08.6577691Z selp.b16 %rs545, %rs473, %rs441, %p26; 2026-02-21T09:15:08.6577750Z cvt.s16.s8 %rs546, %rs545; 2026-02-21T09:15:08.6577817Z shr.s16 %rs547, %rs546, 4; 2026-02-21T09:15:08.6577880Z selp.b16 %rs548, %rs474, %rs442, %p26; 2026-02-21T09:15:08.6577938Z cvt.s16.s8 %rs549, %rs548; 2026-02-21T09:15:08.6578003Z shr.s16 %rs550, %rs549, 4; 2026-02-21T09:15:08.6578087Z selp.b16 %rs551, %rs475, %rs443, %p26; 2026-02-21T09:15:08.6578147Z cvt.s16.s8 %rs552, %rs551; 2026-02-21T09:15:08.6578205Z shr.s16 %rs553, %rs552, 4; 2026-02-21T09:15:08.6578279Z selp.b16 %rs554, %rs476, %rs444, %p26; 2026-02-21T09:15:08.6578338Z cvt.s16.s8 %rs555, %rs554; 2026-02-21T09:15:08.6578396Z shr.s16 %rs556, %rs555, 4; 2026-02-21T09:15:08.6578470Z selp.b16 %rs557, %rs477, %rs445, %p26; 2026-02-21T09:15:08.6578528Z cvt.s16.s8 %rs558, %rs557; 2026-02-21T09:15:08.6578590Z shr.s16 %rs559, %rs558, 4; 2026-02-21T09:15:08.6578654Z selp.b16 %rs560, %rs478, %rs446, %p26; 2026-02-21T09:15:08.6578741Z cvt.s16.s8 %rs561, %rs560; 2026-02-21T09:15:08.6578801Z shr.s16 %rs562, %rs561, 4; 2026-02-21T09:15:08.6578865Z selp.b16 %rs563, %rs479, %rs447, %p26; 2026-02-21T09:15:08.6578931Z cvt.s16.s8 %rs564, %rs563; 2026-02-21T09:15:08.6578988Z shr.s16 %rs565, %rs564, 4; 2026-02-21T09:15:08.6579072Z selp.b16 %rs566, %rs480, %rs448, %p26; 2026-02-21T09:15:08.6579138Z cvt.s16.s8 %rs567, %rs566; 2026-02-21T09:15:08.6579195Z shr.s16 %rs568, %rs567, 4; 2026-02-21T09:15:08.6579258Z selp.b16 %rs569, %rs481, %rs449, %p26; 2026-02-21T09:15:08.6579314Z cvt.s16.s8 %rs570, %rs569; 2026-02-21T09:15:08.6579380Z shr.s16 %rs571, %rs570, 4; 2026-02-21T09:15:08.6579445Z selp.b16 %rs572, %rs482, %rs450, %p26; 2026-02-21T09:15:08.6579501Z cvt.s16.s8 %rs573, %rs572; 2026-02-21T09:15:08.6579566Z shr.s16 %rs574, %rs573, 4; 2026-02-21T09:15:08.6579629Z selp.b16 %rs575, %rs483, %rs451, %p26; 2026-02-21T09:15:08.6579689Z cvt.s16.s8 %rs576, %rs575; 2026-02-21T09:15:08.6579745Z shr.s16 %rs577, %rs576, 4; 2026-02-21T09:15:08.6579818Z selp.b16 %rs578, %rs484, %rs452, %p26; 2026-02-21T09:15:08.6579877Z cvt.s16.s8 %rs579, %rs578; 2026-02-21T09:15:08.6579935Z shr.s16 %rs580, %rs579, 4; 2026-02-21T09:15:08.6580114Z .loc 1 77 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:77:32 2026-02-21T09:15:08.6580178Z cvt.rn.f32.s16 %r1275, %rs487; 2026-02-21T09:15:08.6580241Z cvt.rn.f32.s16 %r1276, %rs490; 2026-02-21T09:15:08.6580301Z cvt.rn.f32.s16 %r1277, %rs493; 2026-02-21T09:15:08.6580369Z cvt.rn.f32.s16 %r1278, %rs496; 2026-02-21T09:15:08.6580428Z cvt.rn.f32.s16 %r1279, %rs499; 2026-02-21T09:15:08.6580487Z cvt.rn.f32.s16 %r1280, %rs502; 2026-02-21T09:15:08.6580553Z cvt.rn.f32.s16 %r1281, %rs505; 2026-02-21T09:15:08.6580611Z cvt.rn.f32.s16 %r1282, %rs508; 2026-02-21T09:15:08.6580668Z cvt.rn.f32.s16 %r1283, %rs511; 2026-02-21T09:15:08.6580734Z cvt.rn.f32.s16 %r1284, %rs514; 2026-02-21T09:15:08.6580792Z cvt.rn.f32.s16 %r1285, %rs517; 2026-02-21T09:15:08.6580851Z cvt.rn.f32.s16 %r1286, %rs520; 2026-02-21T09:15:08.6580910Z cvt.rn.f32.s16 %r1287, %rs523; 2026-02-21T09:15:08.6580977Z cvt.rn.f32.s16 %r1288, %rs526; 2026-02-21T09:15:08.6581035Z cvt.rn.f32.s16 %r1289, %rs529; 2026-02-21T09:15:08.6581093Z cvt.rn.f32.s16 %r1290, %rs532; 2026-02-21T09:15:08.6581157Z cvt.rn.f32.s16 %r1291, %rs535; 2026-02-21T09:15:08.6581217Z cvt.rn.f32.s16 %r1292, %rs538; 2026-02-21T09:15:08.6581275Z cvt.rn.f32.s16 %r1293, %rs541; 2026-02-21T09:15:08.6581333Z cvt.rn.f32.s16 %r1294, %rs544; 2026-02-21T09:15:08.6581399Z cvt.rn.f32.s16 %r1295, %rs547; 2026-02-21T09:15:08.6581457Z cvt.rn.f32.s16 %r1296, %rs550; 2026-02-21T09:15:08.6581515Z cvt.rn.f32.s16 %r1297, %rs553; 2026-02-21T09:15:08.6581610Z cvt.rn.f32.s16 %r1298, %rs556; 2026-02-21T09:15:08.6581671Z cvt.rn.f32.s16 %r1299, %rs559; 2026-02-21T09:15:08.6581729Z cvt.rn.f32.s16 %r1300, %rs562; 2026-02-21T09:15:08.6581817Z cvt.rn.f32.s16 %r1301, %rs565; 2026-02-21T09:15:08.6581883Z cvt.rn.f32.s16 %r1302, %rs568; 2026-02-21T09:15:08.6581942Z cvt.rn.f32.s16 %r1303, %rs571; 2026-02-21T09:15:08.6582000Z cvt.rn.f32.s16 %r1304, %rs574; 2026-02-21T09:15:08.6582064Z cvt.rn.f32.s16 %r1305, %rs577; 2026-02-21T09:15:08.6582123Z cvt.rn.f32.s16 %r1306, %rs580; 2026-02-21T09:15:08.6582178Z bar.sync 2, 256; 2026-02-21T09:15:08.6582239Z st.shared.b32 [%r104], %r1275; 2026-02-21T09:15:08.6582337Z st.shared.b32 [%r104+4096], %r1279; 2026-02-21T09:15:08.6582401Z st.shared.b32 [%r104+8192], %r1283; 2026-02-21T09:15:08.6582468Z st.shared.b32 [%r104+12288], %r1287; 2026-02-21T09:15:08.6582537Z st.shared.b32 [%r104+16384], %r1291; 2026-02-21T09:15:08.6582600Z st.shared.b32 [%r104+20480], %r1295; 2026-02-21T09:15:08.6582664Z st.shared.b32 [%r104+24576], %r1299; 2026-02-21T09:15:08.6582736Z st.shared.b32 [%r104+28672], %r1303; 2026-02-21T09:15:08.6582797Z st.shared.b32 [%r105], %r1276; 2026-02-21T09:15:08.6582862Z st.shared.b32 [%r105+4096], %r1280; 2026-02-21T09:15:08.6582948Z st.shared.b32 [%r105+8192], %r1284; 2026-02-21T09:15:08.6583018Z st.shared.b32 [%r105+12288], %r1288; 2026-02-21T09:15:08.6583080Z st.shared.b32 [%r105+16384], %r1292; 2026-02-21T09:15:08.6583142Z st.shared.b32 [%r105+20480], %r1296; 2026-02-21T09:15:08.6583213Z st.shared.b32 [%r105+24576], %r1300; 2026-02-21T09:15:08.6583299Z st.shared.b32 [%r105+28672], %r1304; 2026-02-21T09:15:08.6583360Z st.shared.b32 [%r106], %r1277; 2026-02-21T09:15:08.6583422Z st.shared.b32 [%r106+4096], %r1281; 2026-02-21T09:15:08.6583490Z st.shared.b32 [%r106+8192], %r1285; 2026-02-21T09:15:08.6583551Z st.shared.b32 [%r106+12288], %r1289; 2026-02-21T09:15:08.6583612Z st.shared.b32 [%r106+16384], %r1293; 2026-02-21T09:15:08.6583681Z st.shared.b32 [%r106+20480], %r1297; 2026-02-21T09:15:08.6583743Z st.shared.b32 [%r106+24576], %r1301; 2026-02-21T09:15:08.6583805Z st.shared.b32 [%r106+28672], %r1305; 2026-02-21T09:15:08.6583873Z st.shared.b32 [%r107], %r1278; 2026-02-21T09:15:08.6583935Z st.shared.b32 [%r107+4096], %r1282; 2026-02-21T09:15:08.6583998Z st.shared.b32 [%r107+8192], %r1286; 2026-02-21T09:15:08.6584058Z st.shared.b32 [%r107+12288], %r1290; 2026-02-21T09:15:08.6584128Z st.shared.b32 [%r107+16384], %r1294; 2026-02-21T09:15:08.6584190Z st.shared.b32 [%r107+20480], %r1298; 2026-02-21T09:15:08.6584250Z st.shared.b32 [%r107+24576], %r1302; 2026-02-21T09:15:08.6584322Z st.shared.b32 [%r107+28672], %r1306; 2026-02-21T09:15:08.6584499Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6584558Z add.s32 %r1080, %r973, 368720; 2026-02-21T09:15:08.6584620Z $L__tmp18: 2026-02-21T09:15:08.6584843Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6584902Z // begin inline asm 2026-02-21T09:15:08.6584955Z 2026-02-21T09:15:08.6585017Z { 2026-02-21T09:15:08.6585081Z .reg .pred complete; 2026-02-21T09:15:08.6585137Z waitLoop: 2026-02-21T09:15:08.6585268Z mbarrier.try_wait.parity.shared.b64 complete, [%r1080], %r2370; 2026-02-21T09:15:08.6585337Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6585390Z } 2026-02-21T09:15:08.6585394Z 2026-02-21T09:15:08.6585459Z // end inline asm 2026-02-21T09:15:08.6585523Z // begin inline asm 2026-02-21T09:15:08.6587143Z @%p30 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r1746 + 0], {%r1083, %r1084, %r1085, %r1086, %r1087, %r1088, %r1089, %r1090, %r1091, %r1092, %r1093, %r1094, %r1095, %r1096, %r1097, %r1098, %r1099, %r1100, %r1101, %r1102, %r1103, %r1104, %r1105, %r1106, %r1107, %r1108, %r1109, %r1110, %r1111, %r1112, %r1113, %r1114, %r1115, %r1116, %r1117, %r1118, %r1119, %r1120, %r1121, %r1122, %r1123, %r1124, %r1125, %r1126, %r1127, %r1128, %r1129, %r1130, %r1131, %r1132, %r1133, %r1134, %r1135, %r1136, %r1137, %r1138, %r1139, %r1140, %r1141, %r1142, %r1143, %r1144, %r1145, %r1146, %r1147, %r1148, %r1149, %r1150, %r1151, %r1152, %r1153, %r1154, %r1155, %r1156, %r1157, %r1158, %r1159, %r1160, %r1161, %r1162, %r1163, %r1164, %r1165, %r1166, %r1167, %r1168, %r1169, %r1170, %r1171, %r1172, %r1173, %r1174, %r1175, %r1176, %r1177, %r1178, %r1179, %r1180, %r1181, %r1182, %r1183, %r1184, %r1185, %r1186, %r1187, %r1188, %r1189, %r1190, %r1191, %r1192, %r1193, %r1194, %r1195, %r1196, %r1197, %r1198, %r1199, %r1200, %r1201, %r1202, %r1203, %r1204, %r1205, %r1206, %r1207, %r1208, %r1209, %r1210}; 2026-02-21T09:15:08.6587266Z // end inline asm 2026-02-21T09:15:08.6587323Z // begin inline asm 2026-02-21T09:15:08.6587396Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6587460Z // end inline asm 2026-02-21T09:15:08.6587515Z bar.sync 2, 256; 2026-02-21T09:15:08.6587571Z // begin inline asm 2026-02-21T09:15:08.6587649Z fence.proxy.async.shared::cta; 2026-02-21T09:15:08.6587703Z // end inline asm 2026-02-21T09:15:08.6587757Z bar.sync 2, 256; 2026-02-21T09:15:08.6587816Z @%p27 bra $L__BB0_11; 2026-02-21T09:15:08.6587877Z $L__tmp19: 2026-02-21T09:15:08.6588002Z // %bb.10: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6588180Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6588249Z add.s32 %r1411, %r973, 368736; 2026-02-21T09:15:08.6588303Z $L__tmp20: 2026-02-21T09:15:08.6588543Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6588615Z add.s32 %r1312, %r120, %r5; 2026-02-21T09:15:08.6588682Z elect.sync %r1412|%p98, -1; 2026-02-21T09:15:08.6588740Z mov.b32 %r1314, 134744336; 2026-02-21T09:15:08.6588800Z mov.pred %p97, -1; 2026-02-21T09:15:08.6588865Z // begin inline asm 2026-02-21T09:15:08.6589022Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 0 ], %rd340, %r1314, %p97; 2026-02-21T09:15:08.6589078Z // end inline asm 2026-02-21T09:15:08.6589142Z // begin inline asm 2026-02-21T09:15:08.6589295Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 8 ], %rd341, %r1314, %p97; 2026-02-21T09:15:08.6589349Z // end inline asm 2026-02-21T09:15:08.6589412Z // begin inline asm 2026-02-21T09:15:08.6589560Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 16 ], %rd342, %r1314, %p97; 2026-02-21T09:15:08.6589615Z // end inline asm 2026-02-21T09:15:08.6589670Z // begin inline asm 2026-02-21T09:15:08.6589824Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 24 ], %rd343, %r1314, %p97; 2026-02-21T09:15:08.6589881Z // end inline asm 2026-02-21T09:15:08.6589937Z // begin inline asm 2026-02-21T09:15:08.6590087Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 32 ], %rd344, %r1314, %p97; 2026-02-21T09:15:08.6590142Z // end inline asm 2026-02-21T09:15:08.6590198Z // begin inline asm 2026-02-21T09:15:08.6590348Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 40 ], %rd345, %r1314, %p97; 2026-02-21T09:15:08.6590404Z // end inline asm 2026-02-21T09:15:08.6590461Z // begin inline asm 2026-02-21T09:15:08.6590610Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 48 ], %rd346, %r1314, %p97; 2026-02-21T09:15:08.6590665Z // end inline asm 2026-02-21T09:15:08.6590719Z // begin inline asm 2026-02-21T09:15:08.6590860Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 56 ], %rd347, %r1314, %p97; 2026-02-21T09:15:08.6590922Z // end inline asm 2026-02-21T09:15:08.6590979Z // begin inline asm 2026-02-21T09:15:08.6591119Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 64 ], %rd348, %r1314, %p97; 2026-02-21T09:15:08.6591181Z // end inline asm 2026-02-21T09:15:08.6591236Z // begin inline asm 2026-02-21T09:15:08.6591375Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 72 ], %rd349, %r1314, %p97; 2026-02-21T09:15:08.6591435Z // end inline asm 2026-02-21T09:15:08.6591489Z // begin inline asm 2026-02-21T09:15:08.6591680Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 80 ], %rd350, %r1314, %p97; 2026-02-21T09:15:08.6591743Z // end inline asm 2026-02-21T09:15:08.6591798Z // begin inline asm 2026-02-21T09:15:08.6591938Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 88 ], %rd351, %r1314, %p97; 2026-02-21T09:15:08.6591993Z // end inline asm 2026-02-21T09:15:08.6592057Z // begin inline asm 2026-02-21T09:15:08.6592198Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 96 ], %rd352, %r1314, %p97; 2026-02-21T09:15:08.6592281Z // end inline asm 2026-02-21T09:15:08.6592344Z // begin inline asm 2026-02-21T09:15:08.6592491Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 104 ], %rd353, %r1314, %p97; 2026-02-21T09:15:08.6592546Z // end inline asm 2026-02-21T09:15:08.6592609Z // begin inline asm 2026-02-21T09:15:08.6592753Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 112 ], %rd354, %r1314, %p97; 2026-02-21T09:15:08.6592810Z // end inline asm 2026-02-21T09:15:08.6592865Z // begin inline asm 2026-02-21T09:15:08.6593041Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 120 ], %rd355, %r1314, %p97; 2026-02-21T09:15:08.6593096Z // end inline asm 2026-02-21T09:15:08.6593153Z // begin inline asm 2026-02-21T09:15:08.6593305Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 128 ], %rd356, %r1314, %p97; 2026-02-21T09:15:08.6593383Z // end inline asm 2026-02-21T09:15:08.6593442Z // begin inline asm 2026-02-21T09:15:08.6593593Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 136 ], %rd357, %r1314, %p97; 2026-02-21T09:15:08.6593649Z // end inline asm 2026-02-21T09:15:08.6593707Z // begin inline asm 2026-02-21T09:15:08.6593857Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 144 ], %rd358, %r1314, %p97; 2026-02-21T09:15:08.6593912Z // end inline asm 2026-02-21T09:15:08.6593967Z // begin inline asm 2026-02-21T09:15:08.6594119Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 152 ], %rd359, %r1314, %p97; 2026-02-21T09:15:08.6594174Z // end inline asm 2026-02-21T09:15:08.6594230Z // begin inline asm 2026-02-21T09:15:08.6594371Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 160 ], %rd360, %r1314, %p97; 2026-02-21T09:15:08.6594434Z // end inline asm 2026-02-21T09:15:08.6594490Z // begin inline asm 2026-02-21T09:15:08.6594635Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 168 ], %rd361, %r1314, %p97; 2026-02-21T09:15:08.6594699Z // end inline asm 2026-02-21T09:15:08.6594755Z // begin inline asm 2026-02-21T09:15:08.6594896Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 176 ], %rd362, %r1314, %p97; 2026-02-21T09:15:08.6594960Z // end inline asm 2026-02-21T09:15:08.6595017Z // begin inline asm 2026-02-21T09:15:08.6595159Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 184 ], %rd363, %r1314, %p97; 2026-02-21T09:15:08.6595218Z // end inline asm 2026-02-21T09:15:08.6595283Z // begin inline asm 2026-02-21T09:15:08.6595427Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 192 ], %rd364, %r1314, %p97; 2026-02-21T09:15:08.6595483Z // end inline asm 2026-02-21T09:15:08.6595548Z // begin inline asm 2026-02-21T09:15:08.6595691Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 200 ], %rd365, %r1314, %p97; 2026-02-21T09:15:08.6595747Z // end inline asm 2026-02-21T09:15:08.6595811Z // begin inline asm 2026-02-21T09:15:08.6595953Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 208 ], %rd366, %r1314, %p97; 2026-02-21T09:15:08.6596007Z // end inline asm 2026-02-21T09:15:08.6596068Z // begin inline asm 2026-02-21T09:15:08.6596210Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 216 ], %rd367, %r1314, %p97; 2026-02-21T09:15:08.6596264Z // end inline asm 2026-02-21T09:15:08.6596319Z // begin inline asm 2026-02-21T09:15:08.6596468Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 224 ], %rd368, %r1314, %p97; 2026-02-21T09:15:08.6596550Z // end inline asm 2026-02-21T09:15:08.6596606Z // begin inline asm 2026-02-21T09:15:08.6596760Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 232 ], %rd369, %r1314, %p97; 2026-02-21T09:15:08.6596813Z // end inline asm 2026-02-21T09:15:08.6596869Z // begin inline asm 2026-02-21T09:15:08.6597020Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 240 ], %rd370, %r1314, %p97; 2026-02-21T09:15:08.6597096Z // end inline asm 2026-02-21T09:15:08.6597151Z // begin inline asm 2026-02-21T09:15:08.6597303Z @%p98 tcgen05.mma.cta_group::1.kind::tf32 [ %r1312 + 0 ], [ %r1975 + 248 ], %rd371, %r1314, %p97; 2026-02-21T09:15:08.6597357Z // end inline asm 2026-02-21T09:15:08.6597418Z cvt.u64.u32 %rd306, %r1411; 2026-02-21T09:15:08.6597473Z // begin inline asm 2026-02-21T09:15:08.6597607Z @%p98 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd306]; 2026-02-21T09:15:08.6597663Z // end inline asm 2026-02-21T09:15:08.6597714Z $L__tmp21: 2026-02-21T09:15:08.6597844Z $L__BB0_11: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6598019Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6598121Z ld.shared.v4.b32 {%r1544, %r1545, %r1546, %r1547}, [%r113]; 2026-02-21T09:15:08.6598212Z mov.b32 {%rs581, %rs582}, %r1547; 2026-02-21T09:15:08.6598278Z mov.b32 {%rs583, %rs584}, %r1546; 2026-02-21T09:15:08.6598338Z mov.b32 {%rs585, %rs586}, %r1545; 2026-02-21T09:15:08.6598396Z mov.b32 {%rs587, %rs588}, %r1544; 2026-02-21T09:15:08.6598505Z ld.shared.v4.b32 {%r1548, %r1549, %r1550, %r1551}, [%r113+16]; 2026-02-21T09:15:08.6598564Z mov.b32 {%rs589, %rs590}, %r1551; 2026-02-21T09:15:08.6598624Z mov.b32 {%rs591, %rs592}, %r1550; 2026-02-21T09:15:08.6598691Z mov.b32 {%rs593, %rs594}, %r1549; 2026-02-21T09:15:08.6598749Z mov.b32 {%rs595, %rs596}, %r1548; 2026-02-21T09:15:08.6598847Z ld.shared.v4.b32 {%r1552, %r1553, %r1554, %r1555}, [%r113+32]; 2026-02-21T09:15:08.6598914Z mov.b32 {%rs597, %rs598}, %r1555; 2026-02-21T09:15:08.6598973Z mov.b32 {%rs599, %rs600}, %r1554; 2026-02-21T09:15:08.6599030Z mov.b32 {%rs601, %rs602}, %r1553; 2026-02-21T09:15:08.6599087Z mov.b32 {%rs603, %rs604}, %r1552; 2026-02-21T09:15:08.6599190Z ld.shared.v4.b32 {%r1556, %r1557, %r1558, %r1559}, [%r113+48]; 2026-02-21T09:15:08.6599249Z mov.b32 {%rs605, %rs606}, %r1559; 2026-02-21T09:15:08.6599309Z mov.b32 {%rs607, %rs608}, %r1558; 2026-02-21T09:15:08.6599372Z mov.b32 {%rs609, %rs610}, %r1557; 2026-02-21T09:15:08.6599428Z mov.b32 {%rs611, %rs612}, %r1556; 2026-02-21T09:15:08.6599520Z ld.shared.v4.b32 {%r1560, %r1561, %r1562, %r1563}, [%r113+64]; 2026-02-21T09:15:08.6599577Z mov.b32 {%rs613, %rs614}, %r1563; 2026-02-21T09:15:08.6599642Z mov.b32 {%rs615, %rs616}, %r1562; 2026-02-21T09:15:08.6599700Z mov.b32 {%rs617, %rs618}, %r1561; 2026-02-21T09:15:08.6599759Z mov.b32 {%rs619, %rs620}, %r1560; 2026-02-21T09:15:08.6599859Z ld.shared.v4.b32 {%r1564, %r1565, %r1566, %r1567}, [%r113+80]; 2026-02-21T09:15:08.6599918Z mov.b32 {%rs621, %rs622}, %r1567; 2026-02-21T09:15:08.6599976Z mov.b32 {%rs623, %rs624}, %r1566; 2026-02-21T09:15:08.6600039Z mov.b32 {%rs625, %rs626}, %r1565; 2026-02-21T09:15:08.6600097Z mov.b32 {%rs627, %rs628}, %r1564; 2026-02-21T09:15:08.6600190Z ld.shared.v4.b32 {%r1568, %r1569, %r1570, %r1571}, [%r113+96]; 2026-02-21T09:15:08.6600249Z mov.b32 {%rs629, %rs630}, %r1571; 2026-02-21T09:15:08.6600313Z mov.b32 {%rs631, %rs632}, %r1570; 2026-02-21T09:15:08.6600369Z mov.b32 {%rs633, %rs634}, %r1569; 2026-02-21T09:15:08.6600426Z mov.b32 {%rs635, %rs636}, %r1568; 2026-02-21T09:15:08.6600532Z ld.shared.v4.b32 {%r1572, %r1573, %r1574, %r1575}, [%r113+112]; 2026-02-21T09:15:08.6600591Z mov.b32 {%rs637, %rs638}, %r1575; 2026-02-21T09:15:08.6600648Z mov.b32 {%rs639, %rs640}, %r1574; 2026-02-21T09:15:08.6600705Z mov.b32 {%rs641, %rs642}, %r1573; 2026-02-21T09:15:08.6600790Z mov.b32 {%rs643, %rs644}, %r1572; 2026-02-21T09:15:08.6600888Z ld.shared.v4.b32 {%r1576, %r1577, %r1578, %r1579}, [%r113+128]; 2026-02-21T09:15:08.6600946Z mov.b32 {%rs645, %rs646}, %r1579; 2026-02-21T09:15:08.6601012Z mov.b32 {%rs647, %rs648}, %r1578; 2026-02-21T09:15:08.6601069Z mov.b32 {%rs649, %rs650}, %r1577; 2026-02-21T09:15:08.6601127Z mov.b32 {%rs651, %rs652}, %r1576; 2026-02-21T09:15:08.6601230Z ld.shared.v4.b32 {%r1580, %r1581, %r1582, %r1583}, [%r113+144]; 2026-02-21T09:15:08.6601311Z mov.b32 {%rs653, %rs654}, %r1583; 2026-02-21T09:15:08.6601370Z mov.b32 {%rs655, %rs656}, %r1582; 2026-02-21T09:15:08.6601427Z mov.b32 {%rs657, %rs658}, %r1581; 2026-02-21T09:15:08.6601492Z mov.b32 {%rs659, %rs660}, %r1580; 2026-02-21T09:15:08.6601626Z ld.shared.v4.b32 {%r1584, %r1585, %r1586, %r1587}, [%r113+160]; 2026-02-21T09:15:08.6601684Z mov.b32 {%rs661, %rs662}, %r1587; 2026-02-21T09:15:08.6601751Z mov.b32 {%rs663, %rs664}, %r1586; 2026-02-21T09:15:08.6601811Z mov.b32 {%rs665, %rs666}, %r1585; 2026-02-21T09:15:08.6601869Z mov.b32 {%rs667, %rs668}, %r1584; 2026-02-21T09:15:08.6601995Z ld.shared.v4.b32 {%r1588, %r1589, %r1590, %r1591}, [%r113+176]; 2026-02-21T09:15:08.6602054Z mov.b32 {%rs669, %rs670}, %r1591; 2026-02-21T09:15:08.6602111Z mov.b32 {%rs671, %rs672}, %r1590; 2026-02-21T09:15:08.6602169Z mov.b32 {%rs673, %rs674}, %r1589; 2026-02-21T09:15:08.6602272Z mov.b32 {%rs675, %rs676}, %r1588; 2026-02-21T09:15:08.6602371Z ld.shared.v4.b32 {%r1592, %r1593, %r1594, %r1595}, [%r113+192]; 2026-02-21T09:15:08.6602430Z mov.b32 {%rs677, %rs678}, %r1595; 2026-02-21T09:15:08.6602495Z mov.b32 {%rs679, %rs680}, %r1594; 2026-02-21T09:15:08.6602554Z mov.b32 {%rs681, %rs682}, %r1593; 2026-02-21T09:15:08.6602616Z mov.b32 {%rs683, %rs684}, %r1592; 2026-02-21T09:15:08.6602717Z ld.shared.v4.b32 {%r1596, %r1597, %r1598, %r1599}, [%r113+208]; 2026-02-21T09:15:08.6602791Z mov.b32 {%rs685, %rs686}, %r1599; 2026-02-21T09:15:08.6602856Z mov.b32 {%rs687, %rs688}, %r1598; 2026-02-21T09:15:08.6602921Z mov.b32 {%rs689, %rs690}, %r1597; 2026-02-21T09:15:08.6603000Z mov.b32 {%rs691, %rs692}, %r1596; 2026-02-21T09:15:08.6603102Z ld.shared.v4.b32 {%r1600, %r1601, %r1602, %r1603}, [%r113+224]; 2026-02-21T09:15:08.6603166Z mov.b32 {%rs693, %rs694}, %r1603; 2026-02-21T09:15:08.6603234Z mov.b32 {%rs695, %rs696}, %r1602; 2026-02-21T09:15:08.6603295Z mov.b32 {%rs697, %rs698}, %r1601; 2026-02-21T09:15:08.6603357Z mov.b32 {%rs699, %rs700}, %r1600; 2026-02-21T09:15:08.6603458Z ld.shared.v4.b32 {%r1604, %r1605, %r1606, %r1607}, [%r113+240]; 2026-02-21T09:15:08.6603527Z mov.b32 {%rs701, %rs702}, %r1607; 2026-02-21T09:15:08.6603588Z mov.b32 {%rs703, %rs704}, %r1606; 2026-02-21T09:15:08.6603648Z mov.b32 {%rs705, %rs706}, %r1605; 2026-02-21T09:15:08.6603715Z mov.b32 {%rs707, %rs708}, %r1604; 2026-02-21T09:15:08.6603893Z .loc 1 52 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:52:32 2026-02-21T09:15:08.6603962Z cvt.f32.bf16 %r1416, %rs587; 2026-02-21T09:15:08.6604032Z cvt.f32.bf16 %r1417, %rs588; 2026-02-21T09:15:08.6604096Z cvt.f32.bf16 %r1418, %rs585; 2026-02-21T09:15:08.6604157Z cvt.f32.bf16 %r1419, %rs586; 2026-02-21T09:15:08.6604218Z cvt.f32.bf16 %r1420, %rs583; 2026-02-21T09:15:08.6604287Z cvt.f32.bf16 %r1421, %rs584; 2026-02-21T09:15:08.6604347Z cvt.f32.bf16 %r1422, %rs581; 2026-02-21T09:15:08.6604406Z cvt.f32.bf16 %r1423, %rs582; 2026-02-21T09:15:08.6604474Z cvt.f32.bf16 %r1424, %rs595; 2026-02-21T09:15:08.6604536Z cvt.f32.bf16 %r1425, %rs596; 2026-02-21T09:15:08.6604596Z cvt.f32.bf16 %r1426, %rs593; 2026-02-21T09:15:08.6604656Z cvt.f32.bf16 %r1427, %rs594; 2026-02-21T09:15:08.6604726Z cvt.f32.bf16 %r1428, %rs591; 2026-02-21T09:15:08.6604786Z cvt.f32.bf16 %r1429, %rs592; 2026-02-21T09:15:08.6604845Z cvt.f32.bf16 %r1430, %rs589; 2026-02-21T09:15:08.6604911Z cvt.f32.bf16 %r1431, %rs590; 2026-02-21T09:15:08.6604971Z cvt.f32.bf16 %r1432, %rs603; 2026-02-21T09:15:08.6605059Z cvt.f32.bf16 %r1433, %rs604; 2026-02-21T09:15:08.6605120Z cvt.f32.bf16 %r1434, %rs601; 2026-02-21T09:15:08.6605190Z cvt.f32.bf16 %r1435, %rs602; 2026-02-21T09:15:08.6605251Z cvt.f32.bf16 %r1436, %rs599; 2026-02-21T09:15:08.6605311Z cvt.f32.bf16 %r1437, %rs600; 2026-02-21T09:15:08.6605380Z cvt.f32.bf16 %r1438, %rs597; 2026-02-21T09:15:08.6605441Z cvt.f32.bf16 %r1439, %rs598; 2026-02-21T09:15:08.6605501Z cvt.f32.bf16 %r1440, %rs611; 2026-02-21T09:15:08.6605562Z cvt.f32.bf16 %r1441, %rs612; 2026-02-21T09:15:08.6605654Z cvt.f32.bf16 %r1442, %rs609; 2026-02-21T09:15:08.6605714Z cvt.f32.bf16 %r1443, %rs610; 2026-02-21T09:15:08.6605774Z cvt.f32.bf16 %r1444, %rs607; 2026-02-21T09:15:08.6605843Z cvt.f32.bf16 %r1445, %rs608; 2026-02-21T09:15:08.6605903Z cvt.f32.bf16 %r1446, %rs605; 2026-02-21T09:15:08.6605963Z cvt.f32.bf16 %r1447, %rs606; 2026-02-21T09:15:08.6606030Z cvt.f32.bf16 %r1448, %rs619; 2026-02-21T09:15:08.6606089Z cvt.f32.bf16 %r1449, %rs620; 2026-02-21T09:15:08.6606151Z cvt.f32.bf16 %r1450, %rs617; 2026-02-21T09:15:08.6606212Z cvt.f32.bf16 %r1451, %rs618; 2026-02-21T09:15:08.6606298Z cvt.f32.bf16 %r1452, %rs615; 2026-02-21T09:15:08.6606360Z cvt.f32.bf16 %r1453, %rs616; 2026-02-21T09:15:08.6606421Z cvt.f32.bf16 %r1454, %rs613; 2026-02-21T09:15:08.6606487Z cvt.f32.bf16 %r1455, %rs614; 2026-02-21T09:15:08.6606548Z cvt.f32.bf16 %r1456, %rs627; 2026-02-21T09:15:08.6606607Z cvt.f32.bf16 %r1457, %rs628; 2026-02-21T09:15:08.6606686Z cvt.f32.bf16 %r1458, %rs625; 2026-02-21T09:15:08.6606754Z cvt.f32.bf16 %r1459, %rs626; 2026-02-21T09:15:08.6606814Z cvt.f32.bf16 %r1460, %rs623; 2026-02-21T09:15:08.6606874Z cvt.f32.bf16 %r1461, %rs624; 2026-02-21T09:15:08.6606943Z cvt.f32.bf16 %r1462, %rs621; 2026-02-21T09:15:08.6607003Z cvt.f32.bf16 %r1463, %rs622; 2026-02-21T09:15:08.6607064Z cvt.f32.bf16 %r1464, %rs635; 2026-02-21T09:15:08.6607124Z cvt.f32.bf16 %r1465, %rs636; 2026-02-21T09:15:08.6607192Z cvt.f32.bf16 %r1466, %rs633; 2026-02-21T09:15:08.6607252Z cvt.f32.bf16 %r1467, %rs634; 2026-02-21T09:15:08.6607315Z cvt.f32.bf16 %r1468, %rs631; 2026-02-21T09:15:08.6607384Z cvt.f32.bf16 %r1469, %rs632; 2026-02-21T09:15:08.6607444Z cvt.f32.bf16 %r1470, %rs629; 2026-02-21T09:15:08.6607505Z cvt.f32.bf16 %r1471, %rs630; 2026-02-21T09:15:08.6607573Z cvt.f32.bf16 %r1472, %rs643; 2026-02-21T09:15:08.6607632Z cvt.f32.bf16 %r1473, %rs644; 2026-02-21T09:15:08.6607692Z cvt.f32.bf16 %r1474, %rs641; 2026-02-21T09:15:08.6607754Z cvt.f32.bf16 %r1475, %rs642; 2026-02-21T09:15:08.6607825Z cvt.f32.bf16 %r1476, %rs639; 2026-02-21T09:15:08.6607884Z cvt.f32.bf16 %r1477, %rs640; 2026-02-21T09:15:08.6607943Z cvt.f32.bf16 %r1478, %rs637; 2026-02-21T09:15:08.6608011Z cvt.f32.bf16 %r1479, %rs638; 2026-02-21T09:15:08.6608071Z cvt.f32.bf16 %r1480, %rs651; 2026-02-21T09:15:08.6608133Z cvt.f32.bf16 %r1481, %rs652; 2026-02-21T09:15:08.6608193Z cvt.f32.bf16 %r1482, %rs649; 2026-02-21T09:15:08.6608260Z cvt.f32.bf16 %r1483, %rs650; 2026-02-21T09:15:08.6608320Z cvt.f32.bf16 %r1484, %rs647; 2026-02-21T09:15:08.6608383Z cvt.f32.bf16 %r1485, %rs648; 2026-02-21T09:15:08.6608451Z cvt.f32.bf16 %r1486, %rs645; 2026-02-21T09:15:08.6608513Z cvt.f32.bf16 %r1487, %rs646; 2026-02-21T09:15:08.6608574Z cvt.f32.bf16 %r1488, %rs659; 2026-02-21T09:15:08.6608635Z cvt.f32.bf16 %r1489, %rs660; 2026-02-21T09:15:08.6608703Z cvt.f32.bf16 %r1490, %rs657; 2026-02-21T09:15:08.6608764Z cvt.f32.bf16 %r1491, %rs658; 2026-02-21T09:15:08.6608826Z cvt.f32.bf16 %r1492, %rs655; 2026-02-21T09:15:08.6608895Z cvt.f32.bf16 %r1493, %rs656; 2026-02-21T09:15:08.6608955Z cvt.f32.bf16 %r1494, %rs653; 2026-02-21T09:15:08.6609017Z cvt.f32.bf16 %r1495, %rs654; 2026-02-21T09:15:08.6609077Z cvt.f32.bf16 %r1496, %rs667; 2026-02-21T09:15:08.6609145Z cvt.f32.bf16 %r1497, %rs668; 2026-02-21T09:15:08.6609208Z cvt.f32.bf16 %r1498, %rs665; 2026-02-21T09:15:08.6609270Z cvt.f32.bf16 %r1499, %rs666; 2026-02-21T09:15:08.6609342Z cvt.f32.bf16 %r1500, %rs663; 2026-02-21T09:15:08.6609404Z cvt.f32.bf16 %r1501, %rs664; 2026-02-21T09:15:08.6609489Z cvt.f32.bf16 %r1502, %rs661; 2026-02-21T09:15:08.6609560Z cvt.f32.bf16 %r1503, %rs662; 2026-02-21T09:15:08.6609622Z cvt.f32.bf16 %r1504, %rs675; 2026-02-21T09:15:08.6609682Z cvt.f32.bf16 %r1505, %rs676; 2026-02-21T09:15:08.6609743Z cvt.f32.bf16 %r1506, %rs673; 2026-02-21T09:15:08.6609813Z cvt.f32.bf16 %r1507, %rs674; 2026-02-21T09:15:08.6609872Z cvt.f32.bf16 %r1508, %rs671; 2026-02-21T09:15:08.6609933Z cvt.f32.bf16 %r1509, %rs672; 2026-02-21T09:15:08.6610022Z cvt.f32.bf16 %r1510, %rs669; 2026-02-21T09:15:08.6610080Z cvt.f32.bf16 %r1511, %rs670; 2026-02-21T09:15:08.6610140Z cvt.f32.bf16 %r1512, %rs683; 2026-02-21T09:15:08.6610199Z cvt.f32.bf16 %r1513, %rs684; 2026-02-21T09:15:08.6610265Z cvt.f32.bf16 %r1514, %rs681; 2026-02-21T09:15:08.6610326Z cvt.f32.bf16 %r1515, %rs682; 2026-02-21T09:15:08.6610385Z cvt.f32.bf16 %r1516, %rs679; 2026-02-21T09:15:08.6610451Z cvt.f32.bf16 %r1517, %rs680; 2026-02-21T09:15:08.6610509Z cvt.f32.bf16 %r1518, %rs677; 2026-02-21T09:15:08.6610570Z cvt.f32.bf16 %r1519, %rs678; 2026-02-21T09:15:08.6610629Z cvt.f32.bf16 %r1520, %rs691; 2026-02-21T09:15:08.6610717Z cvt.f32.bf16 %r1521, %rs692; 2026-02-21T09:15:08.6610778Z cvt.f32.bf16 %r1522, %rs689; 2026-02-21T09:15:08.6610838Z cvt.f32.bf16 %r1523, %rs690; 2026-02-21T09:15:08.6610906Z cvt.f32.bf16 %r1524, %rs687; 2026-02-21T09:15:08.6610965Z cvt.f32.bf16 %r1525, %rs688; 2026-02-21T09:15:08.6611046Z cvt.f32.bf16 %r1526, %rs685; 2026-02-21T09:15:08.6611120Z cvt.f32.bf16 %r1527, %rs686; 2026-02-21T09:15:08.6611183Z cvt.f32.bf16 %r1528, %rs699; 2026-02-21T09:15:08.6611240Z cvt.f32.bf16 %r1529, %rs700; 2026-02-21T09:15:08.6611297Z cvt.f32.bf16 %r1530, %rs697; 2026-02-21T09:15:08.6611362Z cvt.f32.bf16 %r1531, %rs698; 2026-02-21T09:15:08.6611420Z cvt.f32.bf16 %r1532, %rs695; 2026-02-21T09:15:08.6611478Z cvt.f32.bf16 %r1533, %rs696; 2026-02-21T09:15:08.6611580Z cvt.f32.bf16 %r1534, %rs693; 2026-02-21T09:15:08.6611638Z cvt.f32.bf16 %r1535, %rs694; 2026-02-21T09:15:08.6611697Z cvt.f32.bf16 %r1536, %rs707; 2026-02-21T09:15:08.6611754Z cvt.f32.bf16 %r1537, %rs708; 2026-02-21T09:15:08.6611821Z cvt.f32.bf16 %r1538, %rs705; 2026-02-21T09:15:08.6611879Z cvt.f32.bf16 %r1539, %rs706; 2026-02-21T09:15:08.6611938Z cvt.f32.bf16 %r1540, %rs703; 2026-02-21T09:15:08.6612005Z cvt.f32.bf16 %r1541, %rs704; 2026-02-21T09:15:08.6612063Z cvt.f32.bf16 %r1542, %rs701; 2026-02-21T09:15:08.6612122Z cvt.f32.bf16 %r1543, %rs702; 2026-02-21T09:15:08.6612300Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6612372Z ld.shared.b8 %rs709, [%r114]; 2026-02-21T09:15:08.6612434Z ld.shared.b8 %rs710, [%r114+128]; 2026-02-21T09:15:08.6612497Z ld.shared.b8 %rs711, [%r114+256]; 2026-02-21T09:15:08.6612565Z ld.shared.b8 %rs712, [%r114+384]; 2026-02-21T09:15:08.6612626Z ld.shared.b8 %rs713, [%r114+512]; 2026-02-21T09:15:08.6612687Z ld.shared.b8 %rs714, [%r114+640]; 2026-02-21T09:15:08.6612758Z ld.shared.b8 %rs715, [%r114+768]; 2026-02-21T09:15:08.6612819Z ld.shared.b8 %rs716, [%r114+896]; 2026-02-21T09:15:08.6612886Z ld.shared.b8 %rs717, [%r114+1024]; 2026-02-21T09:15:08.6612950Z ld.shared.b8 %rs718, [%r114+1152]; 2026-02-21T09:15:08.6613021Z ld.shared.b8 %rs719, [%r114+1280]; 2026-02-21T09:15:08.6613082Z ld.shared.b8 %rs720, [%r114+1408]; 2026-02-21T09:15:08.6613144Z ld.shared.b8 %rs721, [%r114+1536]; 2026-02-21T09:15:08.6613213Z ld.shared.b8 %rs722, [%r114+1664]; 2026-02-21T09:15:08.6613275Z ld.shared.b8 %rs723, [%r114+1792]; 2026-02-21T09:15:08.6613336Z ld.shared.b8 %rs724, [%r114+1920]; 2026-02-21T09:15:08.6613396Z ld.shared.b8 %rs725, [%r114+2048]; 2026-02-21T09:15:08.6613465Z ld.shared.b8 %rs726, [%r114+2176]; 2026-02-21T09:15:08.6613525Z ld.shared.b8 %rs727, [%r114+2304]; 2026-02-21T09:15:08.6613585Z ld.shared.b8 %rs728, [%r114+2432]; 2026-02-21T09:15:08.6613654Z ld.shared.b8 %rs729, [%r114+2560]; 2026-02-21T09:15:08.6613714Z ld.shared.b8 %rs730, [%r114+2688]; 2026-02-21T09:15:08.6613802Z ld.shared.b8 %rs731, [%r114+2816]; 2026-02-21T09:15:08.6613864Z ld.shared.b8 %rs732, [%r114+2944]; 2026-02-21T09:15:08.6613931Z ld.shared.b8 %rs733, [%r114+3072]; 2026-02-21T09:15:08.6613991Z ld.shared.b8 %rs734, [%r114+3200]; 2026-02-21T09:15:08.6614052Z ld.shared.b8 %rs735, [%r114+3328]; 2026-02-21T09:15:08.6614119Z ld.shared.b8 %rs736, [%r114+3456]; 2026-02-21T09:15:08.6614179Z ld.shared.b8 %rs737, [%r114+3584]; 2026-02-21T09:15:08.6614275Z ld.shared.b8 %rs738, [%r114+3712]; 2026-02-21T09:15:08.6614343Z ld.shared.b8 %rs739, [%r114+3840]; 2026-02-21T09:15:08.6614403Z ld.shared.b8 %rs740, [%r114+3968]; 2026-02-21T09:15:08.6614570Z .loc 1 57 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:57:28 2026-02-21T09:15:08.6614632Z shl.b16 %rs741, %rs709, 4; 2026-02-21T09:15:08.6614700Z shl.b16 %rs742, %rs710, 4; 2026-02-21T09:15:08.6614759Z shl.b16 %rs743, %rs711, 4; 2026-02-21T09:15:08.6614819Z shl.b16 %rs744, %rs712, 4; 2026-02-21T09:15:08.6614888Z shl.b16 %rs745, %rs713, 4; 2026-02-21T09:15:08.6614946Z shl.b16 %rs746, %rs714, 4; 2026-02-21T09:15:08.6615031Z shl.b16 %rs747, %rs715, 4; 2026-02-21T09:15:08.6615092Z shl.b16 %rs748, %rs716, 4; 2026-02-21T09:15:08.6615157Z shl.b16 %rs749, %rs717, 4; 2026-02-21T09:15:08.6615216Z shl.b16 %rs750, %rs718, 4; 2026-02-21T09:15:08.6615273Z shl.b16 %rs751, %rs719, 4; 2026-02-21T09:15:08.6615363Z shl.b16 %rs752, %rs720, 4; 2026-02-21T09:15:08.6615424Z shl.b16 %rs753, %rs721, 4; 2026-02-21T09:15:08.6615483Z shl.b16 %rs754, %rs722, 4; 2026-02-21T09:15:08.6615540Z shl.b16 %rs755, %rs723, 4; 2026-02-21T09:15:08.6615605Z shl.b16 %rs756, %rs724, 4; 2026-02-21T09:15:08.6615662Z shl.b16 %rs757, %rs725, 4; 2026-02-21T09:15:08.6615720Z shl.b16 %rs758, %rs726, 4; 2026-02-21T09:15:08.6615785Z shl.b16 %rs759, %rs727, 4; 2026-02-21T09:15:08.6615842Z shl.b16 %rs760, %rs728, 4; 2026-02-21T09:15:08.6615900Z shl.b16 %rs761, %rs729, 4; 2026-02-21T09:15:08.6615968Z shl.b16 %rs762, %rs730, 4; 2026-02-21T09:15:08.6616028Z shl.b16 %rs763, %rs731, 4; 2026-02-21T09:15:08.6616089Z shl.b16 %rs764, %rs732, 4; 2026-02-21T09:15:08.6616153Z shl.b16 %rs765, %rs733, 4; 2026-02-21T09:15:08.6616220Z shl.b16 %rs766, %rs734, 4; 2026-02-21T09:15:08.6616278Z shl.b16 %rs767, %rs735, 4; 2026-02-21T09:15:08.6616334Z shl.b16 %rs768, %rs736, 4; 2026-02-21T09:15:08.6616398Z shl.b16 %rs769, %rs737, 4; 2026-02-21T09:15:08.6616456Z shl.b16 %rs770, %rs738, 4; 2026-02-21T09:15:08.6616513Z shl.b16 %rs771, %rs739, 4; 2026-02-21T09:15:08.6616570Z shl.b16 %rs772, %rs740, 4; 2026-02-21T09:15:08.6616751Z .loc 1 72 58 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:72:58 2026-02-21T09:15:08.6616821Z selp.b16 %rs773, %rs741, %rs709, %p26; 2026-02-21T09:15:08.6616880Z cvt.s16.s8 %rs774, %rs773; 2026-02-21T09:15:08.6616944Z shr.s16 %rs775, %rs774, 4; 2026-02-21T09:15:08.6617012Z selp.b16 %rs776, %rs742, %rs710, %p26; 2026-02-21T09:15:08.6617072Z cvt.s16.s8 %rs777, %rs776; 2026-02-21T09:15:08.6617130Z shr.s16 %rs778, %rs777, 4; 2026-02-21T09:15:08.6617203Z selp.b16 %rs779, %rs743, %rs711, %p26; 2026-02-21T09:15:08.6617262Z cvt.s16.s8 %rs780, %rs779; 2026-02-21T09:15:08.6617320Z shr.s16 %rs781, %rs780, 4; 2026-02-21T09:15:08.6617393Z selp.b16 %rs782, %rs744, %rs712, %p26; 2026-02-21T09:15:08.6617450Z cvt.s16.s8 %rs783, %rs782; 2026-02-21T09:15:08.6617507Z shr.s16 %rs784, %rs783, 4; 2026-02-21T09:15:08.6617579Z selp.b16 %rs785, %rs745, %rs713, %p26; 2026-02-21T09:15:08.6617637Z cvt.s16.s8 %rs786, %rs785; 2026-02-21T09:15:08.6617693Z shr.s16 %rs787, %rs786, 4; 2026-02-21T09:15:08.6617756Z selp.b16 %rs788, %rs746, %rs714, %p26; 2026-02-21T09:15:08.6617822Z cvt.s16.s8 %rs789, %rs788; 2026-02-21T09:15:08.6617880Z shr.s16 %rs790, %rs789, 4; 2026-02-21T09:15:08.6617943Z selp.b16 %rs791, %rs747, %rs715, %p26; 2026-02-21T09:15:08.6618009Z cvt.s16.s8 %rs792, %rs791; 2026-02-21T09:15:08.6618067Z shr.s16 %rs793, %rs792, 4; 2026-02-21T09:15:08.6618155Z selp.b16 %rs794, %rs748, %rs716, %p26; 2026-02-21T09:15:08.6618213Z cvt.s16.s8 %rs795, %rs794; 2026-02-21T09:15:08.6618279Z shr.s16 %rs796, %rs795, 4; 2026-02-21T09:15:08.6618343Z selp.b16 %rs797, %rs749, %rs717, %p26; 2026-02-21T09:15:08.6618401Z cvt.s16.s8 %rs798, %rs797; 2026-02-21T09:15:08.6618466Z shr.s16 %rs799, %rs798, 4; 2026-02-21T09:15:08.6618531Z selp.b16 %rs800, %rs750, %rs718, %p26; 2026-02-21T09:15:08.6618590Z cvt.s16.s8 %rs801, %rs800; 2026-02-21T09:15:08.6618670Z shr.s16 %rs802, %rs801, 4; 2026-02-21T09:15:08.6618739Z selp.b16 %rs803, %rs751, %rs719, %p26; 2026-02-21T09:15:08.6618797Z cvt.s16.s8 %rs804, %rs803; 2026-02-21T09:15:08.6618853Z shr.s16 %rs805, %rs804, 4; 2026-02-21T09:15:08.6618924Z selp.b16 %rs806, %rs752, %rs720, %p26; 2026-02-21T09:15:08.6618982Z cvt.s16.s8 %rs807, %rs806; 2026-02-21T09:15:08.6619039Z shr.s16 %rs808, %rs807, 4; 2026-02-21T09:15:08.6619101Z selp.b16 %rs809, %rs753, %rs721, %p26; 2026-02-21T09:15:08.6619167Z cvt.s16.s8 %rs810, %rs809; 2026-02-21T09:15:08.6619223Z shr.s16 %rs811, %rs810, 4; 2026-02-21T09:15:08.6619321Z selp.b16 %rs812, %rs754, %rs722, %p26; 2026-02-21T09:15:08.6619386Z cvt.s16.s8 %rs813, %rs812; 2026-02-21T09:15:08.6619444Z shr.s16 %rs814, %rs813, 4; 2026-02-21T09:15:08.6619507Z selp.b16 %rs815, %rs755, %rs723, %p26; 2026-02-21T09:15:08.6619572Z cvt.s16.s8 %rs816, %rs815; 2026-02-21T09:15:08.6619629Z shr.s16 %rs817, %rs816, 4; 2026-02-21T09:15:08.6619711Z selp.b16 %rs818, %rs756, %rs724, %p26; 2026-02-21T09:15:08.6619771Z cvt.s16.s8 %rs819, %rs818; 2026-02-21T09:15:08.6619836Z shr.s16 %rs820, %rs819, 4; 2026-02-21T09:15:08.6619898Z selp.b16 %rs821, %rs757, %rs725, %p26; 2026-02-21T09:15:08.6619956Z cvt.s16.s8 %rs822, %rs821; 2026-02-21T09:15:08.6620020Z shr.s16 %rs823, %rs822, 4; 2026-02-21T09:15:08.6620083Z selp.b16 %rs824, %rs758, %rs726, %p26; 2026-02-21T09:15:08.6620140Z cvt.s16.s8 %rs825, %rs824; 2026-02-21T09:15:08.6620197Z shr.s16 %rs826, %rs825, 4; 2026-02-21T09:15:08.6620270Z selp.b16 %rs827, %rs759, %rs727, %p26; 2026-02-21T09:15:08.6620327Z cvt.s16.s8 %rs828, %rs827; 2026-02-21T09:15:08.6620385Z shr.s16 %rs829, %rs828, 4; 2026-02-21T09:15:08.6620456Z selp.b16 %rs830, %rs760, %rs728, %p26; 2026-02-21T09:15:08.6620513Z cvt.s16.s8 %rs831, %rs830; 2026-02-21T09:15:08.6620569Z shr.s16 %rs832, %rs831, 4; 2026-02-21T09:15:08.6620631Z selp.b16 %rs833, %rs761, %rs729, %p26; 2026-02-21T09:15:08.6620697Z cvt.s16.s8 %rs834, %rs833; 2026-02-21T09:15:08.6620755Z shr.s16 %rs835, %rs834, 4; 2026-02-21T09:15:08.6620818Z selp.b16 %rs836, %rs762, %rs730, %p26; 2026-02-21T09:15:08.6620883Z cvt.s16.s8 %rs837, %rs836; 2026-02-21T09:15:08.6620940Z shr.s16 %rs838, %rs837, 4; 2026-02-21T09:15:08.6621003Z selp.b16 %rs839, %rs763, %rs731, %p26; 2026-02-21T09:15:08.6621068Z cvt.s16.s8 %rs840, %rs839; 2026-02-21T09:15:08.6621127Z shr.s16 %rs841, %rs840, 4; 2026-02-21T09:15:08.6621190Z selp.b16 %rs842, %rs764, %rs732, %p26; 2026-02-21T09:15:08.6621250Z cvt.s16.s8 %rs843, %rs842; 2026-02-21T09:15:08.6621314Z shr.s16 %rs844, %rs843, 4; 2026-02-21T09:15:08.6621380Z selp.b16 %rs845, %rs765, %rs733, %p26; 2026-02-21T09:15:08.6621437Z cvt.s16.s8 %rs846, %rs845; 2026-02-21T09:15:08.6621501Z shr.s16 %rs847, %rs846, 4; 2026-02-21T09:15:08.6621598Z selp.b16 %rs848, %rs766, %rs734, %p26; 2026-02-21T09:15:08.6621658Z cvt.s16.s8 %rs849, %rs848; 2026-02-21T09:15:08.6621718Z shr.s16 %rs850, %rs849, 4; 2026-02-21T09:15:08.6621793Z selp.b16 %rs851, %rs767, %rs735, %p26; 2026-02-21T09:15:08.6621855Z cvt.s16.s8 %rs852, %rs851; 2026-02-21T09:15:08.6621916Z shr.s16 %rs853, %rs852, 4; 2026-02-21T09:15:08.6621990Z selp.b16 %rs854, %rs768, %rs736, %p26; 2026-02-21T09:15:08.6622050Z cvt.s16.s8 %rs855, %rs854; 2026-02-21T09:15:08.6622109Z shr.s16 %rs856, %rs855, 4; 2026-02-21T09:15:08.6622175Z selp.b16 %rs857, %rs769, %rs737, %p26; 2026-02-21T09:15:08.6622244Z cvt.s16.s8 %rs858, %rs857; 2026-02-21T09:15:08.6622305Z shr.s16 %rs859, %rs858, 4; 2026-02-21T09:15:08.6622398Z selp.b16 %rs860, %rs770, %rs738, %p26; 2026-02-21T09:15:08.6622468Z cvt.s16.s8 %rs861, %rs860; 2026-02-21T09:15:08.6622531Z shr.s16 %rs862, %rs861, 4; 2026-02-21T09:15:08.6622599Z selp.b16 %rs863, %rs771, %rs739, %p26; 2026-02-21T09:15:08.6622660Z cvt.s16.s8 %rs864, %rs863; 2026-02-21T09:15:08.6622731Z shr.s16 %rs865, %rs864, 4; 2026-02-21T09:15:08.6622800Z selp.b16 %rs866, %rs772, %rs740, %p26; 2026-02-21T09:15:08.6622862Z cvt.s16.s8 %rs867, %rs866; 2026-02-21T09:15:08.6622956Z shr.s16 %rs868, %rs867, 4; 2026-02-21T09:15:08.6623124Z .loc 1 77 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:77:32 2026-02-21T09:15:08.6623186Z cvt.rn.f32.s16 %r1608, %rs775; 2026-02-21T09:15:08.6623254Z cvt.rn.f32.s16 %r1609, %rs778; 2026-02-21T09:15:08.6623313Z cvt.rn.f32.s16 %r1610, %rs781; 2026-02-21T09:15:08.6623371Z cvt.rn.f32.s16 %r1611, %rs784; 2026-02-21T09:15:08.6623430Z cvt.rn.f32.s16 %r1612, %rs787; 2026-02-21T09:15:08.6623497Z cvt.rn.f32.s16 %r1613, %rs790; 2026-02-21T09:15:08.6623555Z cvt.rn.f32.s16 %r1614, %rs793; 2026-02-21T09:15:08.6623638Z cvt.rn.f32.s16 %r1615, %rs796; 2026-02-21T09:15:08.6623704Z cvt.rn.f32.s16 %r1616, %rs799; 2026-02-21T09:15:08.6623763Z cvt.rn.f32.s16 %r1617, %rs802; 2026-02-21T09:15:08.6623821Z cvt.rn.f32.s16 %r1618, %rs805; 2026-02-21T09:15:08.6623880Z cvt.rn.f32.s16 %r1619, %rs808; 2026-02-21T09:15:08.6623944Z cvt.rn.f32.s16 %r1620, %rs811; 2026-02-21T09:15:08.6624028Z cvt.rn.f32.s16 %r1621, %rs814; 2026-02-21T09:15:08.6624087Z cvt.rn.f32.s16 %r1622, %rs817; 2026-02-21T09:15:08.6624154Z cvt.rn.f32.s16 %r1623, %rs820; 2026-02-21T09:15:08.6624213Z cvt.rn.f32.s16 %r1624, %rs823; 2026-02-21T09:15:08.6624271Z cvt.rn.f32.s16 %r1625, %rs826; 2026-02-21T09:15:08.6624328Z cvt.rn.f32.s16 %r1626, %rs829; 2026-02-21T09:15:08.6624394Z cvt.rn.f32.s16 %r1627, %rs832; 2026-02-21T09:15:08.6624453Z cvt.rn.f32.s16 %r1628, %rs835; 2026-02-21T09:15:08.6624511Z cvt.rn.f32.s16 %r1629, %rs838; 2026-02-21T09:15:08.6624579Z cvt.rn.f32.s16 %r1630, %rs841; 2026-02-21T09:15:08.6624637Z cvt.rn.f32.s16 %r1631, %rs844; 2026-02-21T09:15:08.6624697Z cvt.rn.f32.s16 %r1632, %rs847; 2026-02-21T09:15:08.6624762Z cvt.rn.f32.s16 %r1633, %rs850; 2026-02-21T09:15:08.6624821Z cvt.rn.f32.s16 %r1634, %rs853; 2026-02-21T09:15:08.6624879Z cvt.rn.f32.s16 %r1635, %rs856; 2026-02-21T09:15:08.6624937Z cvt.rn.f32.s16 %r1636, %rs859; 2026-02-21T09:15:08.6625005Z cvt.rn.f32.s16 %r1637, %rs862; 2026-02-21T09:15:08.6625064Z cvt.rn.f32.s16 %r1638, %rs865; 2026-02-21T09:15:08.6625123Z cvt.rn.f32.s16 %r1639, %rs868; 2026-02-21T09:15:08.6625184Z bar.sync 2, 256; 2026-02-21T09:15:08.6625243Z st.shared.b32 [%r104], %r1608; 2026-02-21T09:15:08.6625310Z st.shared.b32 [%r104+4096], %r1612; 2026-02-21T09:15:08.6625376Z st.shared.b32 [%r104+8192], %r1616; 2026-02-21T09:15:08.6625449Z st.shared.b32 [%r104+12288], %r1620; 2026-02-21T09:15:08.6625513Z st.shared.b32 [%r104+16384], %r1624; 2026-02-21T09:15:08.6625576Z st.shared.b32 [%r104+20480], %r1628; 2026-02-21T09:15:08.6625646Z st.shared.b32 [%r104+24576], %r1632; 2026-02-21T09:15:08.6625707Z st.shared.b32 [%r104+28672], %r1636; 2026-02-21T09:15:08.6625768Z st.shared.b32 [%r105], %r1609; 2026-02-21T09:15:08.6625830Z st.shared.b32 [%r105+4096], %r1613; 2026-02-21T09:15:08.6625899Z st.shared.b32 [%r105+8192], %r1617; 2026-02-21T09:15:08.6625959Z st.shared.b32 [%r105+12288], %r1621; 2026-02-21T09:15:08.6626021Z st.shared.b32 [%r105+16384], %r1625; 2026-02-21T09:15:08.6626092Z st.shared.b32 [%r105+20480], %r1629; 2026-02-21T09:15:08.6626152Z st.shared.b32 [%r105+24576], %r1633; 2026-02-21T09:15:08.6626213Z st.shared.b32 [%r105+28672], %r1637; 2026-02-21T09:15:08.6626281Z st.shared.b32 [%r106], %r1610; 2026-02-21T09:15:08.6626344Z st.shared.b32 [%r106+4096], %r1614; 2026-02-21T09:15:08.6626404Z st.shared.b32 [%r106+8192], %r1618; 2026-02-21T09:15:08.6626466Z st.shared.b32 [%r106+12288], %r1622; 2026-02-21T09:15:08.6626535Z st.shared.b32 [%r106+16384], %r1626; 2026-02-21T09:15:08.6626621Z st.shared.b32 [%r106+20480], %r1630; 2026-02-21T09:15:08.6626684Z st.shared.b32 [%r106+24576], %r1634; 2026-02-21T09:15:08.6626752Z st.shared.b32 [%r106+28672], %r1638; 2026-02-21T09:15:08.6626812Z st.shared.b32 [%r107], %r1611; 2026-02-21T09:15:08.6626873Z st.shared.b32 [%r107+4096], %r1615; 2026-02-21T09:15:08.6626942Z st.shared.b32 [%r107+8192], %r1619; 2026-02-21T09:15:08.6627004Z st.shared.b32 [%r107+12288], %r1623; 2026-02-21T09:15:08.6627085Z st.shared.b32 [%r107+16384], %r1627; 2026-02-21T09:15:08.6627147Z st.shared.b32 [%r107+20480], %r1631; 2026-02-21T09:15:08.6627216Z st.shared.b32 [%r107+24576], %r1635; 2026-02-21T09:15:08.6627277Z st.shared.b32 [%r107+28672], %r1639; 2026-02-21T09:15:08.6627455Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6627522Z add.s32 %r1413, %r973, 368752; 2026-02-21T09:15:08.6627576Z $L__tmp22: 2026-02-21T09:15:08.6627801Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6627883Z // begin inline asm 2026-02-21T09:15:08.6627944Z 2026-02-21T09:15:08.6627994Z { 2026-02-21T09:15:08.6628057Z .reg .pred complete; 2026-02-21T09:15:08.6628119Z waitLoop: 2026-02-21T09:15:08.6628238Z mbarrier.try_wait.parity.shared.b64 complete, [%r1413], %r2370; 2026-02-21T09:15:08.6628325Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6628379Z } 2026-02-21T09:15:08.6628390Z 2026-02-21T09:15:08.6628447Z // end inline asm 2026-02-21T09:15:08.6628510Z mov.pred %p165, -1; 2026-02-21T09:15:08.6628567Z // begin inline asm 2026-02-21T09:15:08.6630236Z @%p165 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r1746 + 0], {%r1416, %r1417, %r1418, %r1419, %r1420, %r1421, %r1422, %r1423, %r1424, %r1425, %r1426, %r1427, %r1428, %r1429, %r1430, %r1431, %r1432, %r1433, %r1434, %r1435, %r1436, %r1437, %r1438, %r1439, %r1440, %r1441, %r1442, %r1443, %r1444, %r1445, %r1446, %r1447, %r1448, %r1449, %r1450, %r1451, %r1452, %r1453, %r1454, %r1455, %r1456, %r1457, %r1458, %r1459, %r1460, %r1461, %r1462, %r1463, %r1464, %r1465, %r1466, %r1467, %r1468, %r1469, %r1470, %r1471, %r1472, %r1473, %r1474, %r1475, %r1476, %r1477, %r1478, %r1479, %r1480, %r1481, %r1482, %r1483, %r1484, %r1485, %r1486, %r1487, %r1488, %r1489, %r1490, %r1491, %r1492, %r1493, %r1494, %r1495, %r1496, %r1497, %r1498, %r1499, %r1500, %r1501, %r1502, %r1503, %r1504, %r1505, %r1506, %r1507, %r1508, %r1509, %r1510, %r1511, %r1512, %r1513, %r1514, %r1515, %r1516, %r1517, %r1518, %r1519, %r1520, %r1521, %r1522, %r1523, %r1524, %r1525, %r1526, %r1527, %r1528, %r1529, %r1530, %r1531, %r1532, %r1533, %r1534, %r1535, %r1536, %r1537, %r1538, %r1539, %r1540, %r1541, %r1542, %r1543}; 2026-02-21T09:15:08.6630296Z // end inline asm 2026-02-21T09:15:08.6630362Z // begin inline asm 2026-02-21T09:15:08.6630435Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6630490Z // end inline asm 2026-02-21T09:15:08.6630555Z bar.sync 2, 256; 2026-02-21T09:15:08.6630613Z // begin inline asm 2026-02-21T09:15:08.6630686Z fence.proxy.async.shared::cta; 2026-02-21T09:15:08.6630749Z // end inline asm 2026-02-21T09:15:08.6630806Z bar.sync 2, 256; 2026-02-21T09:15:08.6630868Z @%p27 bra $L__BB0_13; 2026-02-21T09:15:08.6630923Z $L__tmp23: 2026-02-21T09:15:08.6631031Z // %bb.12: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6631203Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6631264Z add.s32 %r1742, %r973, 368768; 2026-02-21T09:15:08.6631323Z $L__tmp24: 2026-02-21T09:15:08.6631565Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6631627Z add.s32 %r1643, %r120, %r6; 2026-02-21T09:15:08.6631692Z elect.sync %r1743|%p166, -1; 2026-02-21T09:15:08.6631759Z mov.b32 %r1645, 134744336; 2026-02-21T09:15:08.6631840Z // begin inline asm 2026-02-21T09:15:08.6632001Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 0 ], %rd340, %r1645, %p165; 2026-02-21T09:15:08.6632063Z // end inline asm 2026-02-21T09:15:08.6632119Z // begin inline asm 2026-02-21T09:15:08.6632271Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 8 ], %rd341, %r1645, %p165; 2026-02-21T09:15:08.6632333Z // end inline asm 2026-02-21T09:15:08.6632390Z // begin inline asm 2026-02-21T09:15:08.6632564Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 16 ], %rd342, %r1645, %p165; 2026-02-21T09:15:08.6632628Z // end inline asm 2026-02-21T09:15:08.6632683Z // begin inline asm 2026-02-21T09:15:08.6632832Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 24 ], %rd343, %r1645, %p165; 2026-02-21T09:15:08.6632887Z // end inline asm 2026-02-21T09:15:08.6632952Z // begin inline asm 2026-02-21T09:15:08.6633103Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 32 ], %rd344, %r1645, %p165; 2026-02-21T09:15:08.6633160Z // end inline asm 2026-02-21T09:15:08.6633227Z // begin inline asm 2026-02-21T09:15:08.6633397Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 40 ], %rd345, %r1645, %p165; 2026-02-21T09:15:08.6633453Z // end inline asm 2026-02-21T09:15:08.6633516Z // begin inline asm 2026-02-21T09:15:08.6633688Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 48 ], %rd346, %r1645, %p165; 2026-02-21T09:15:08.6633746Z // end inline asm 2026-02-21T09:15:08.6633812Z // begin inline asm 2026-02-21T09:15:08.6633958Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 56 ], %rd347, %r1645, %p165; 2026-02-21T09:15:08.6634013Z // end inline asm 2026-02-21T09:15:08.6634067Z // begin inline asm 2026-02-21T09:15:08.6634220Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 64 ], %rd348, %r1645, %p165; 2026-02-21T09:15:08.6634274Z // end inline asm 2026-02-21T09:15:08.6634329Z // begin inline asm 2026-02-21T09:15:08.6634485Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 72 ], %rd349, %r1645, %p165; 2026-02-21T09:15:08.6634541Z // end inline asm 2026-02-21T09:15:08.6634596Z // begin inline asm 2026-02-21T09:15:08.6634748Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 80 ], %rd350, %r1645, %p165; 2026-02-21T09:15:08.6634802Z // end inline asm 2026-02-21T09:15:08.6634857Z // begin inline asm 2026-02-21T09:15:08.6635014Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 88 ], %rd351, %r1645, %p165; 2026-02-21T09:15:08.6635069Z // end inline asm 2026-02-21T09:15:08.6635125Z // begin inline asm 2026-02-21T09:15:08.6635270Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 96 ], %rd352, %r1645, %p165; 2026-02-21T09:15:08.6635330Z // end inline asm 2026-02-21T09:15:08.6635387Z // begin inline asm 2026-02-21T09:15:08.6635536Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 104 ], %rd353, %r1645, %p165; 2026-02-21T09:15:08.6635599Z // end inline asm 2026-02-21T09:15:08.6635655Z // begin inline asm 2026-02-21T09:15:08.6635803Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 112 ], %rd354, %r1645, %p165; 2026-02-21T09:15:08.6635863Z // end inline asm 2026-02-21T09:15:08.6635918Z // begin inline asm 2026-02-21T09:15:08.6636068Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 120 ], %rd355, %r1645, %p165; 2026-02-21T09:15:08.6636131Z // end inline asm 2026-02-21T09:15:08.6636187Z // begin inline asm 2026-02-21T09:15:08.6636331Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 128 ], %rd356, %r1645, %p165; 2026-02-21T09:15:08.6636385Z // end inline asm 2026-02-21T09:15:08.6636449Z // begin inline asm 2026-02-21T09:15:08.6636595Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 136 ], %rd357, %r1645, %p165; 2026-02-21T09:15:08.6636649Z // end inline asm 2026-02-21T09:15:08.6636712Z // begin inline asm 2026-02-21T09:15:08.6636879Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 144 ], %rd358, %r1645, %p165; 2026-02-21T09:15:08.6636933Z // end inline asm 2026-02-21T09:15:08.6636996Z // begin inline asm 2026-02-21T09:15:08.6637141Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 152 ], %rd359, %r1645, %p165; 2026-02-21T09:15:08.6637194Z // end inline asm 2026-02-21T09:15:08.6637250Z // begin inline asm 2026-02-21T09:15:08.6637439Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 160 ], %rd360, %r1645, %p165; 2026-02-21T09:15:08.6637494Z // end inline asm 2026-02-21T09:15:08.6637550Z // begin inline asm 2026-02-21T09:15:08.6637703Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 168 ], %rd361, %r1645, %p165; 2026-02-21T09:15:08.6637756Z // end inline asm 2026-02-21T09:15:08.6637813Z // begin inline asm 2026-02-21T09:15:08.6637968Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 176 ], %rd362, %r1645, %p165; 2026-02-21T09:15:08.6638023Z // end inline asm 2026-02-21T09:15:08.6638077Z // begin inline asm 2026-02-21T09:15:08.6638251Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 184 ], %rd363, %r1645, %p165; 2026-02-21T09:15:08.6638306Z // end inline asm 2026-02-21T09:15:08.6638363Z // begin inline asm 2026-02-21T09:15:08.6638532Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 192 ], %rd364, %r1645, %p165; 2026-02-21T09:15:08.6638590Z // end inline asm 2026-02-21T09:15:08.6638647Z // begin inline asm 2026-02-21T09:15:08.6638794Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 200 ], %rd365, %r1645, %p165; 2026-02-21T09:15:08.6638857Z // end inline asm 2026-02-21T09:15:08.6638913Z // begin inline asm 2026-02-21T09:15:08.6639059Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 208 ], %rd366, %r1645, %p165; 2026-02-21T09:15:08.6639123Z // end inline asm 2026-02-21T09:15:08.6639181Z // begin inline asm 2026-02-21T09:15:08.6639327Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 216 ], %rd367, %r1645, %p165; 2026-02-21T09:15:08.6639390Z // end inline asm 2026-02-21T09:15:08.6639447Z // begin inline asm 2026-02-21T09:15:08.6639593Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 224 ], %rd368, %r1645, %p165; 2026-02-21T09:15:08.6639647Z // end inline asm 2026-02-21T09:15:08.6639714Z // begin inline asm 2026-02-21T09:15:08.6639864Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 232 ], %rd369, %r1645, %p165; 2026-02-21T09:15:08.6639921Z // end inline asm 2026-02-21T09:15:08.6639986Z // begin inline asm 2026-02-21T09:15:08.6640134Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 240 ], %rd370, %r1645, %p165; 2026-02-21T09:15:08.6640189Z // end inline asm 2026-02-21T09:15:08.6640252Z // begin inline asm 2026-02-21T09:15:08.6640396Z @%p166 tcgen05.mma.cta_group::1.kind::tf32 [ %r1643 + 0 ], [ %r1975 + 248 ], %rd371, %r1645, %p165; 2026-02-21T09:15:08.6640452Z // end inline asm 2026-02-21T09:15:08.6640522Z cvt.u64.u32 %rd339, %r1742; 2026-02-21T09:15:08.6640579Z // begin inline asm 2026-02-21T09:15:08.6640708Z @%p166 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd339]; 2026-02-21T09:15:08.6640761Z // end inline asm 2026-02-21T09:15:08.6640823Z $L__tmp25: 2026-02-21T09:15:08.6640925Z $L__BB0_13: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6641090Z .loc 1 0 0 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0 2026-02-21T09:15:08.6641161Z add.s32 %r1307, %r2371, 1; 2026-02-21T09:15:08.6641226Z setp.eq.b32 %p96, %r1307, 2; 2026-02-21T09:15:08.6641399Z .loc 1 48 80 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:48:80 2026-02-21T09:15:08.6641506Z ld.shared.v4.b32 {%r1875, %r1876, %r1877, %r1878}, [%r115]; 2026-02-21T09:15:08.6641598Z mov.b32 {%rs869, %rs870}, %r1878; 2026-02-21T09:15:08.6641688Z mov.b32 {%rs871, %rs872}, %r1877; 2026-02-21T09:15:08.6641746Z mov.b32 {%rs873, %rs874}, %r1876; 2026-02-21T09:15:08.6641815Z mov.b32 {%rs875, %rs876}, %r1875; 2026-02-21T09:15:08.6641917Z ld.shared.v4.b32 {%r1879, %r1880, %r1881, %r1882}, [%r115+16]; 2026-02-21T09:15:08.6641976Z mov.b32 {%rs877, %rs878}, %r1882; 2026-02-21T09:15:08.6642041Z mov.b32 {%rs879, %rs880}, %r1881; 2026-02-21T09:15:08.6642101Z mov.b32 {%rs881, %rs882}, %r1880; 2026-02-21T09:15:08.6642185Z mov.b32 {%rs883, %rs884}, %r1879; 2026-02-21T09:15:08.6642292Z ld.shared.v4.b32 {%r1883, %r1884, %r1885, %r1886}, [%r115+32]; 2026-02-21T09:15:08.6642351Z mov.b32 {%rs885, %rs886}, %r1886; 2026-02-21T09:15:08.6642408Z mov.b32 {%rs887, %rs888}, %r1885; 2026-02-21T09:15:08.6642466Z mov.b32 {%rs889, %rs890}, %r1884; 2026-02-21T09:15:08.6642532Z mov.b32 {%rs891, %rs892}, %r1883; 2026-02-21T09:15:08.6642629Z ld.shared.v4.b32 {%r1887, %r1888, %r1889, %r1890}, [%r115+48]; 2026-02-21T09:15:08.6642689Z mov.b32 {%rs893, %rs894}, %r1890; 2026-02-21T09:15:08.6642755Z mov.b32 {%rs895, %rs896}, %r1889; 2026-02-21T09:15:08.6642837Z mov.b32 {%rs897, %rs898}, %r1888; 2026-02-21T09:15:08.6642897Z mov.b32 {%rs899, %rs900}, %r1887; 2026-02-21T09:15:08.6642991Z ld.shared.v4.b32 {%r1891, %r1892, %r1893, %r1894}, [%r115+64]; 2026-02-21T09:15:08.6643060Z mov.b32 {%rs901, %rs902}, %r1894; 2026-02-21T09:15:08.6643120Z mov.b32 {%rs903, %rs904}, %r1893; 2026-02-21T09:15:08.6643203Z mov.b32 {%rs905, %rs906}, %r1892; 2026-02-21T09:15:08.6643274Z mov.b32 {%rs907, %rs908}, %r1891; 2026-02-21T09:15:08.6643369Z ld.shared.v4.b32 {%r1895, %r1896, %r1897, %r1898}, [%r115+80]; 2026-02-21T09:15:08.6643430Z mov.b32 {%rs909, %rs910}, %r1898; 2026-02-21T09:15:08.6643496Z mov.b32 {%rs911, %rs912}, %r1897; 2026-02-21T09:15:08.6643555Z mov.b32 {%rs913, %rs914}, %r1896; 2026-02-21T09:15:08.6643613Z mov.b32 {%rs915, %rs916}, %r1895; 2026-02-21T09:15:08.6643705Z ld.shared.v4.b32 {%r1899, %r1900, %r1901, %r1902}, [%r115+96]; 2026-02-21T09:15:08.6643772Z mov.b32 {%rs917, %rs918}, %r1902; 2026-02-21T09:15:08.6643831Z mov.b32 {%rs919, %rs920}, %r1901; 2026-02-21T09:15:08.6643890Z mov.b32 {%rs921, %rs922}, %r1900; 2026-02-21T09:15:08.6643955Z mov.b32 {%rs923, %rs924}, %r1899; 2026-02-21T09:15:08.6644054Z ld.shared.v4.b32 {%r1903, %r1904, %r1905, %r1906}, [%r115+112]; 2026-02-21T09:15:08.6644112Z mov.b32 {%rs925, %rs926}, %r1906; 2026-02-21T09:15:08.6644172Z mov.b32 {%rs927, %rs928}, %r1905; 2026-02-21T09:15:08.6644240Z mov.b32 {%rs929, %rs930}, %r1904; 2026-02-21T09:15:08.6644298Z mov.b32 {%rs931, %rs932}, %r1903; 2026-02-21T09:15:08.6644395Z ld.shared.v4.b32 {%r1907, %r1908, %r1909, %r1910}, [%r115+128]; 2026-02-21T09:15:08.6644461Z mov.b32 {%rs933, %rs934}, %r1910; 2026-02-21T09:15:08.6644519Z mov.b32 {%rs935, %rs936}, %r1909; 2026-02-21T09:15:08.6644577Z mov.b32 {%rs937, %rs938}, %r1908; 2026-02-21T09:15:08.6644642Z mov.b32 {%rs939, %rs940}, %r1907; 2026-02-21T09:15:08.6644736Z ld.shared.v4.b32 {%r1911, %r1912, %r1913, %r1914}, [%r115+144]; 2026-02-21T09:15:08.6644795Z mov.b32 {%rs941, %rs942}, %r1914; 2026-02-21T09:15:08.6644855Z mov.b32 {%rs943, %rs944}, %r1913; 2026-02-21T09:15:08.6644919Z mov.b32 {%rs945, %rs946}, %r1912; 2026-02-21T09:15:08.6644978Z mov.b32 {%rs947, %rs948}, %r1911; 2026-02-21T09:15:08.6645072Z ld.shared.v4.b32 {%r1915, %r1916, %r1917, %r1918}, [%r115+160]; 2026-02-21T09:15:08.6645138Z mov.b32 {%rs949, %rs950}, %r1918; 2026-02-21T09:15:08.6645202Z mov.b32 {%rs951, %rs952}, %r1917; 2026-02-21T09:15:08.6645264Z mov.b32 {%rs953, %rs954}, %r1916; 2026-02-21T09:15:08.6645333Z mov.b32 {%rs955, %rs956}, %r1915; 2026-02-21T09:15:08.6645432Z ld.shared.v4.b32 {%r1919, %r1920, %r1921, %r1922}, [%r115+176]; 2026-02-21T09:15:08.6645493Z mov.b32 {%rs957, %rs958}, %r1922; 2026-02-21T09:15:08.6645554Z mov.b32 {%rs959, %rs960}, %r1921; 2026-02-21T09:15:08.6645622Z mov.b32 {%rs961, %rs962}, %r1920; 2026-02-21T09:15:08.6645684Z mov.b32 {%rs963, %rs964}, %r1919; 2026-02-21T09:15:08.6645810Z ld.shared.v4.b32 {%r1923, %r1924, %r1925, %r1926}, [%r115+192]; 2026-02-21T09:15:08.6645878Z mov.b32 {%rs965, %rs966}, %r1926; 2026-02-21T09:15:08.6645939Z mov.b32 {%rs967, %rs968}, %r1925; 2026-02-21T09:15:08.6646000Z mov.b32 {%rs969, %rs970}, %r1924; 2026-02-21T09:15:08.6646060Z mov.b32 {%rs971, %rs972}, %r1923; 2026-02-21T09:15:08.6646167Z ld.shared.v4.b32 {%r1927, %r1928, %r1929, %r1930}, [%r115+208]; 2026-02-21T09:15:08.6646232Z mov.b32 {%rs973, %rs974}, %r1930; 2026-02-21T09:15:08.6646315Z mov.b32 {%rs975, %rs976}, %r1929; 2026-02-21T09:15:08.6646384Z mov.b32 {%rs977, %rs978}, %r1928; 2026-02-21T09:15:08.6646444Z mov.b32 {%rs979, %rs980}, %r1927; 2026-02-21T09:15:08.6646543Z ld.shared.v4.b32 {%r1931, %r1932, %r1933, %r1934}, [%r115+224]; 2026-02-21T09:15:08.6646610Z mov.b32 {%rs981, %rs982}, %r1934; 2026-02-21T09:15:08.6646670Z mov.b32 {%rs983, %rs984}, %r1933; 2026-02-21T09:15:08.6646730Z mov.b32 {%rs985, %rs986}, %r1932; 2026-02-21T09:15:08.6646793Z mov.b32 {%rs987, %rs988}, %r1931; 2026-02-21T09:15:08.6646899Z ld.shared.v4.b32 {%r1935, %r1936, %r1937, %r1938}, [%r115+240]; 2026-02-21T09:15:08.6646980Z mov.b32 {%rs989, %rs990}, %r1938; 2026-02-21T09:15:08.6647044Z mov.b32 {%rs991, %rs992}, %r1937; 2026-02-21T09:15:08.6647115Z mov.b32 {%rs993, %rs994}, %r1936; 2026-02-21T09:15:08.6647177Z mov.b32 {%rs995, %rs996}, %r1935; 2026-02-21T09:15:08.6647378Z .loc 1 52 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:52:32 2026-02-21T09:15:08.6647461Z cvt.f32.bf16 %r1747, %rs875; 2026-02-21T09:15:08.6647527Z cvt.f32.bf16 %r1748, %rs876; 2026-02-21T09:15:08.6647589Z cvt.f32.bf16 %r1749, %rs873; 2026-02-21T09:15:08.6647649Z cvt.f32.bf16 %r1750, %rs874; 2026-02-21T09:15:08.6647718Z cvt.f32.bf16 %r1751, %rs871; 2026-02-21T09:15:08.6647777Z cvt.f32.bf16 %r1752, %rs872; 2026-02-21T09:15:08.6647837Z cvt.f32.bf16 %r1753, %rs869; 2026-02-21T09:15:08.6647903Z cvt.f32.bf16 %r1754, %rs870; 2026-02-21T09:15:08.6647965Z cvt.f32.bf16 %r1755, %rs883; 2026-02-21T09:15:08.6648025Z cvt.f32.bf16 %r1756, %rs884; 2026-02-21T09:15:08.6648086Z cvt.f32.bf16 %r1757, %rs881; 2026-02-21T09:15:08.6648154Z cvt.f32.bf16 %r1758, %rs882; 2026-02-21T09:15:08.6648213Z cvt.f32.bf16 %r1759, %rs879; 2026-02-21T09:15:08.6648274Z cvt.f32.bf16 %r1760, %rs880; 2026-02-21T09:15:08.6648342Z cvt.f32.bf16 %r1761, %rs877; 2026-02-21T09:15:08.6648402Z cvt.f32.bf16 %r1762, %rs878; 2026-02-21T09:15:08.6648462Z cvt.f32.bf16 %r1763, %rs891; 2026-02-21T09:15:08.6648523Z cvt.f32.bf16 %r1764, %rs892; 2026-02-21T09:15:08.6648590Z cvt.f32.bf16 %r1765, %rs889; 2026-02-21T09:15:08.6648650Z cvt.f32.bf16 %r1766, %rs890; 2026-02-21T09:15:08.6648709Z cvt.f32.bf16 %r1767, %rs887; 2026-02-21T09:15:08.6648777Z cvt.f32.bf16 %r1768, %rs888; 2026-02-21T09:15:08.6648837Z cvt.f32.bf16 %r1769, %rs885; 2026-02-21T09:15:08.6648896Z cvt.f32.bf16 %r1770, %rs886; 2026-02-21T09:15:08.6648963Z cvt.f32.bf16 %r1771, %rs899; 2026-02-21T09:15:08.6649025Z cvt.f32.bf16 %r1772, %rs900; 2026-02-21T09:15:08.6649086Z cvt.f32.bf16 %r1773, %rs897; 2026-02-21T09:15:08.6649148Z cvt.f32.bf16 %r1774, %rs898; 2026-02-21T09:15:08.6649216Z cvt.f32.bf16 %r1775, %rs895; 2026-02-21T09:15:08.6649276Z cvt.f32.bf16 %r1776, %rs896; 2026-02-21T09:15:08.6649335Z cvt.f32.bf16 %r1777, %rs893; 2026-02-21T09:15:08.6649400Z cvt.f32.bf16 %r1778, %rs894; 2026-02-21T09:15:08.6649459Z cvt.f32.bf16 %r1779, %rs907; 2026-02-21T09:15:08.6649520Z cvt.f32.bf16 %r1780, %rs908; 2026-02-21T09:15:08.6649581Z cvt.f32.bf16 %r1781, %rs905; 2026-02-21T09:15:08.6649650Z cvt.f32.bf16 %r1782, %rs906; 2026-02-21T09:15:08.6649711Z cvt.f32.bf16 %r1783, %rs903; 2026-02-21T09:15:08.6649771Z cvt.f32.bf16 %r1784, %rs904; 2026-02-21T09:15:08.6649838Z cvt.f32.bf16 %r1785, %rs901; 2026-02-21T09:15:08.6649898Z cvt.f32.bf16 %r1786, %rs902; 2026-02-21T09:15:08.6649957Z cvt.f32.bf16 %r1787, %rs915; 2026-02-21T09:15:08.6650016Z cvt.f32.bf16 %r1788, %rs916; 2026-02-21T09:15:08.6650108Z cvt.f32.bf16 %r1789, %rs913; 2026-02-21T09:15:08.6650167Z cvt.f32.bf16 %r1790, %rs914; 2026-02-21T09:15:08.6650228Z cvt.f32.bf16 %r1791, %rs911; 2026-02-21T09:15:08.6650297Z cvt.f32.bf16 %r1792, %rs912; 2026-02-21T09:15:08.6650356Z cvt.f32.bf16 %r1793, %rs909; 2026-02-21T09:15:08.6650415Z cvt.f32.bf16 %r1794, %rs910; 2026-02-21T09:15:08.6650475Z cvt.f32.bf16 %r1795, %rs923; 2026-02-21T09:15:08.6650542Z cvt.f32.bf16 %r1796, %rs924; 2026-02-21T09:15:08.6650602Z cvt.f32.bf16 %r1797, %rs921; 2026-02-21T09:15:08.6650683Z cvt.f32.bf16 %r1798, %rs922; 2026-02-21T09:15:08.6650751Z cvt.f32.bf16 %r1799, %rs919; 2026-02-21T09:15:08.6650812Z cvt.f32.bf16 %r1800, %rs920; 2026-02-21T09:15:08.6650872Z cvt.f32.bf16 %r1801, %rs917; 2026-02-21T09:15:08.6650937Z cvt.f32.bf16 %r1802, %rs918; 2026-02-21T09:15:08.6650997Z cvt.f32.bf16 %r1803, %rs931; 2026-02-21T09:15:08.6651057Z cvt.f32.bf16 %r1804, %rs932; 2026-02-21T09:15:08.6651115Z cvt.f32.bf16 %r1805, %rs929; 2026-02-21T09:15:08.6651181Z cvt.f32.bf16 %r1806, %rs930; 2026-02-21T09:15:08.6651243Z cvt.f32.bf16 %r1807, %rs927; 2026-02-21T09:15:08.6651321Z cvt.f32.bf16 %r1808, %rs928; 2026-02-21T09:15:08.6651390Z cvt.f32.bf16 %r1809, %rs925; 2026-02-21T09:15:08.6651449Z cvt.f32.bf16 %r1810, %rs926; 2026-02-21T09:15:08.6651508Z cvt.f32.bf16 %r1811, %rs939; 2026-02-21T09:15:08.6651603Z cvt.f32.bf16 %r1812, %rs940; 2026-02-21T09:15:08.6651673Z cvt.f32.bf16 %r1813, %rs937; 2026-02-21T09:15:08.6651760Z cvt.f32.bf16 %r1814, %rs938; 2026-02-21T09:15:08.6651823Z cvt.f32.bf16 %r1815, %rs935; 2026-02-21T09:15:08.6651890Z cvt.f32.bf16 %r1816, %rs936; 2026-02-21T09:15:08.6651949Z cvt.f32.bf16 %r1817, %rs933; 2026-02-21T09:15:08.6652009Z cvt.f32.bf16 %r1818, %rs934; 2026-02-21T09:15:08.6652069Z cvt.f32.bf16 %r1819, %rs947; 2026-02-21T09:15:08.6652135Z cvt.f32.bf16 %r1820, %rs948; 2026-02-21T09:15:08.6652196Z cvt.f32.bf16 %r1821, %rs945; 2026-02-21T09:15:08.6652256Z cvt.f32.bf16 %r1822, %rs946; 2026-02-21T09:15:08.6652324Z cvt.f32.bf16 %r1823, %rs943; 2026-02-21T09:15:08.6652384Z cvt.f32.bf16 %r1824, %rs944; 2026-02-21T09:15:08.6652445Z cvt.f32.bf16 %r1825, %rs941; 2026-02-21T09:15:08.6652513Z cvt.f32.bf16 %r1826, %rs942; 2026-02-21T09:15:08.6652573Z cvt.f32.bf16 %r1827, %rs955; 2026-02-21T09:15:08.6652633Z cvt.f32.bf16 %r1828, %rs956; 2026-02-21T09:15:08.6652693Z cvt.f32.bf16 %r1829, %rs953; 2026-02-21T09:15:08.6652761Z cvt.f32.bf16 %r1830, %rs954; 2026-02-21T09:15:08.6652823Z cvt.f32.bf16 %r1831, %rs951; 2026-02-21T09:15:08.6652886Z cvt.f32.bf16 %r1832, %rs952; 2026-02-21T09:15:08.6652955Z cvt.f32.bf16 %r1833, %rs949; 2026-02-21T09:15:08.6653015Z cvt.f32.bf16 %r1834, %rs950; 2026-02-21T09:15:08.6653076Z cvt.f32.bf16 %r1835, %rs963; 2026-02-21T09:15:08.6653137Z cvt.f32.bf16 %r1836, %rs964; 2026-02-21T09:15:08.6653204Z cvt.f32.bf16 %r1837, %rs961; 2026-02-21T09:15:08.6653264Z cvt.f32.bf16 %r1838, %rs962; 2026-02-21T09:15:08.6653323Z cvt.f32.bf16 %r1839, %rs959; 2026-02-21T09:15:08.6653392Z cvt.f32.bf16 %r1840, %rs960; 2026-02-21T09:15:08.6653455Z cvt.f32.bf16 %r1841, %rs957; 2026-02-21T09:15:08.6653518Z cvt.f32.bf16 %r1842, %rs958; 2026-02-21T09:15:08.6653581Z cvt.f32.bf16 %r1843, %rs971; 2026-02-21T09:15:08.6653661Z cvt.f32.bf16 %r1844, %rs972; 2026-02-21T09:15:08.6653720Z cvt.f32.bf16 %r1845, %rs969; 2026-02-21T09:15:08.6653778Z cvt.f32.bf16 %r1846, %rs970; 2026-02-21T09:15:08.6653842Z cvt.f32.bf16 %r1847, %rs967; 2026-02-21T09:15:08.6653900Z cvt.f32.bf16 %r1848, %rs968; 2026-02-21T09:15:08.6653959Z cvt.f32.bf16 %r1849, %rs965; 2026-02-21T09:15:08.6654016Z cvt.f32.bf16 %r1850, %rs966; 2026-02-21T09:15:08.6654080Z cvt.f32.bf16 %r1851, %rs979; 2026-02-21T09:15:08.6654137Z cvt.f32.bf16 %r1852, %rs980; 2026-02-21T09:15:08.6654195Z cvt.f32.bf16 %r1853, %rs977; 2026-02-21T09:15:08.6654260Z cvt.f32.bf16 %r1854, %rs978; 2026-02-21T09:15:08.6654317Z cvt.f32.bf16 %r1855, %rs975; 2026-02-21T09:15:08.6654374Z cvt.f32.bf16 %r1856, %rs976; 2026-02-21T09:15:08.6654439Z cvt.f32.bf16 %r1857, %rs973; 2026-02-21T09:15:08.6654537Z cvt.f32.bf16 %r1858, %rs974; 2026-02-21T09:15:08.6654595Z cvt.f32.bf16 %r1859, %rs987; 2026-02-21T09:15:08.6654654Z cvt.f32.bf16 %r1860, %rs988; 2026-02-21T09:15:08.6654719Z cvt.f32.bf16 %r1861, %rs985; 2026-02-21T09:15:08.6654776Z cvt.f32.bf16 %r1862, %rs986; 2026-02-21T09:15:08.6654834Z cvt.f32.bf16 %r1863, %rs983; 2026-02-21T09:15:08.6654899Z cvt.f32.bf16 %r1864, %rs984; 2026-02-21T09:15:08.6654958Z cvt.f32.bf16 %r1865, %rs981; 2026-02-21T09:15:08.6655040Z cvt.f32.bf16 %r1866, %rs982; 2026-02-21T09:15:08.6655098Z cvt.f32.bf16 %r1867, %rs995; 2026-02-21T09:15:08.6655164Z cvt.f32.bf16 %r1868, %rs996; 2026-02-21T09:15:08.6655220Z cvt.f32.bf16 %r1869, %rs993; 2026-02-21T09:15:08.6655277Z cvt.f32.bf16 %r1870, %rs994; 2026-02-21T09:15:08.6655340Z cvt.f32.bf16 %r1871, %rs991; 2026-02-21T09:15:08.6655396Z cvt.f32.bf16 %r1872, %rs992; 2026-02-21T09:15:08.6655455Z cvt.f32.bf16 %r1873, %rs989; 2026-02-21T09:15:08.6655513Z cvt.f32.bf16 %r1874, %rs990; 2026-02-21T09:15:08.6655721Z .loc 1 54 87 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:54:87 2026-02-21T09:15:08.6655787Z ld.shared.b8 %rs997, [%r116]; 2026-02-21T09:15:08.6655851Z ld.shared.b8 %rs998, [%r116+128]; 2026-02-21T09:15:08.6655921Z ld.shared.b8 %rs999, [%r116+256]; 2026-02-21T09:15:08.6655986Z ld.shared.b8 %rs1000, [%r116+384]; 2026-02-21T09:15:08.6656049Z ld.shared.b8 %rs1001, [%r116+512]; 2026-02-21T09:15:08.6656138Z ld.shared.b8 %rs1002, [%r116+640]; 2026-02-21T09:15:08.6656204Z ld.shared.b8 %rs1003, [%r116+768]; 2026-02-21T09:15:08.6656264Z ld.shared.b8 %rs1004, [%r116+896]; 2026-02-21T09:15:08.6656331Z ld.shared.b8 %rs1005, [%r116+1024]; 2026-02-21T09:15:08.6656403Z ld.shared.b8 %rs1006, [%r116+1152]; 2026-02-21T09:15:08.6656467Z ld.shared.b8 %rs1007, [%r116+1280]; 2026-02-21T09:15:08.6656528Z ld.shared.b8 %rs1008, [%r116+1408]; 2026-02-21T09:15:08.6656596Z ld.shared.b8 %rs1009, [%r116+1536]; 2026-02-21T09:15:08.6656658Z ld.shared.b8 %rs1010, [%r116+1664]; 2026-02-21T09:15:08.6656719Z ld.shared.b8 %rs1011, [%r116+1792]; 2026-02-21T09:15:08.6656783Z ld.shared.b8 %rs1012, [%r116+1920]; 2026-02-21T09:15:08.6656850Z ld.shared.b8 %rs1013, [%r116+2048]; 2026-02-21T09:15:08.6656912Z ld.shared.b8 %rs1014, [%r116+2176]; 2026-02-21T09:15:08.6656973Z ld.shared.b8 %rs1015, [%r116+2304]; 2026-02-21T09:15:08.6657041Z ld.shared.b8 %rs1016, [%r116+2432]; 2026-02-21T09:15:08.6657103Z ld.shared.b8 %rs1017, [%r116+2560]; 2026-02-21T09:15:08.6657166Z ld.shared.b8 %rs1018, [%r116+2688]; 2026-02-21T09:15:08.6657234Z ld.shared.b8 %rs1019, [%r116+2816]; 2026-02-21T09:15:08.6657294Z ld.shared.b8 %rs1020, [%r116+2944]; 2026-02-21T09:15:08.6657354Z ld.shared.b8 %rs1021, [%r116+3072]; 2026-02-21T09:15:08.6657414Z ld.shared.b8 %rs1022, [%r116+3200]; 2026-02-21T09:15:08.6657482Z ld.shared.b8 %rs1023, [%r116+3328]; 2026-02-21T09:15:08.6657542Z ld.shared.b8 %rs1024, [%r116+3456]; 2026-02-21T09:15:08.6657603Z ld.shared.b8 %rs1025, [%r116+3584]; 2026-02-21T09:15:08.6657673Z ld.shared.b8 %rs1026, [%r116+3712]; 2026-02-21T09:15:08.6657734Z ld.shared.b8 %rs1027, [%r116+3840]; 2026-02-21T09:15:08.6657793Z ld.shared.b8 %rs1028, [%r116+3968]; 2026-02-21T09:15:08.6657967Z .loc 1 57 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:57:28 2026-02-21T09:15:08.6658027Z shl.b16 %rs1029, %rs997, 4; 2026-02-21T09:15:08.6658089Z shl.b16 %rs1030, %rs998, 4; 2026-02-21T09:15:08.6658149Z shl.b16 %rs1031, %rs999, 4; 2026-02-21T09:15:08.6658218Z shl.b16 %rs1032, %rs1000, 4; 2026-02-21T09:15:08.6658278Z shl.b16 %rs1033, %rs1001, 4; 2026-02-21T09:15:08.6658335Z shl.b16 %rs1034, %rs1002, 4; 2026-02-21T09:15:08.6658398Z shl.b16 %rs1035, %rs1003, 4; 2026-02-21T09:15:08.6658455Z shl.b16 %rs1036, %rs1004, 4; 2026-02-21T09:15:08.6658512Z shl.b16 %rs1037, %rs1005, 4; 2026-02-21T09:15:08.6658569Z shl.b16 %rs1038, %rs1006, 4; 2026-02-21T09:15:08.6658633Z shl.b16 %rs1039, %rs1007, 4; 2026-02-21T09:15:08.6658712Z shl.b16 %rs1040, %rs1008, 4; 2026-02-21T09:15:08.6658769Z shl.b16 %rs1041, %rs1009, 4; 2026-02-21T09:15:08.6658835Z shl.b16 %rs1042, %rs1010, 4; 2026-02-21T09:15:08.6658892Z shl.b16 %rs1043, %rs1011, 4; 2026-02-21T09:15:08.6658948Z shl.b16 %rs1044, %rs1012, 4; 2026-02-21T09:15:08.6659004Z shl.b16 %rs1045, %rs1013, 4; 2026-02-21T09:15:08.6659068Z shl.b16 %rs1046, %rs1014, 4; 2026-02-21T09:15:08.6659125Z shl.b16 %rs1047, %rs1015, 4; 2026-02-21T09:15:08.6659205Z shl.b16 %rs1048, %rs1016, 4; 2026-02-21T09:15:08.6659270Z shl.b16 %rs1049, %rs1017, 4; 2026-02-21T09:15:08.6659327Z shl.b16 %rs1050, %rs1018, 4; 2026-02-21T09:15:08.6659383Z shl.b16 %rs1051, %rs1019, 4; 2026-02-21T09:15:08.6659440Z shl.b16 %rs1052, %rs1020, 4; 2026-02-21T09:15:08.6659505Z shl.b16 %rs1053, %rs1021, 4; 2026-02-21T09:15:08.6659561Z shl.b16 %rs1054, %rs1022, 4; 2026-02-21T09:15:08.6659618Z shl.b16 %rs1055, %rs1023, 4; 2026-02-21T09:15:08.6659682Z shl.b16 %rs1056, %rs1024, 4; 2026-02-21T09:15:08.6659741Z shl.b16 %rs1057, %rs1025, 4; 2026-02-21T09:15:08.6659799Z shl.b16 %rs1058, %rs1026, 4; 2026-02-21T09:15:08.6659884Z shl.b16 %rs1059, %rs1027, 4; 2026-02-21T09:15:08.6659943Z shl.b16 %rs1060, %rs1028, 4; 2026-02-21T09:15:08.6660111Z .loc 1 72 58 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:72:58 2026-02-21T09:15:08.6660185Z selp.b16 %rs1061, %rs1029, %rs997, %p26; 2026-02-21T09:15:08.6660293Z cvt.s16.s8 %rs1062, %rs1061; 2026-02-21T09:15:08.6660354Z shr.s16 %rs1063, %rs1062, 4; 2026-02-21T09:15:08.6660426Z selp.b16 %rs1064, %rs1030, %rs998, %p26; 2026-02-21T09:15:08.6660494Z cvt.s16.s8 %rs1065, %rs1064; 2026-02-21T09:15:08.6660551Z shr.s16 %rs1066, %rs1065, 4; 2026-02-21T09:15:08.6660620Z selp.b16 %rs1067, %rs1031, %rs999, %p26; 2026-02-21T09:15:08.6660678Z cvt.s16.s8 %rs1068, %rs1067; 2026-02-21T09:15:08.6660744Z shr.s16 %rs1069, %rs1068, 4; 2026-02-21T09:15:08.6660813Z selp.b16 %rs1070, %rs1032, %rs1000, %p26; 2026-02-21T09:15:08.6660872Z cvt.s16.s8 %rs1071, %rs1070; 2026-02-21T09:15:08.6660935Z shr.s16 %rs1072, %rs1071, 4; 2026-02-21T09:15:08.6661004Z selp.b16 %rs1073, %rs1033, %rs1001, %p26; 2026-02-21T09:15:08.6661062Z cvt.s16.s8 %rs1074, %rs1073; 2026-02-21T09:15:08.6661127Z shr.s16 %rs1075, %rs1074, 4; 2026-02-21T09:15:08.6661193Z selp.b16 %rs1076, %rs1034, %rs1002, %p26; 2026-02-21T09:15:08.6661250Z cvt.s16.s8 %rs1077, %rs1076; 2026-02-21T09:15:08.6661309Z shr.s16 %rs1078, %rs1077, 4; 2026-02-21T09:15:08.6661385Z selp.b16 %rs1079, %rs1035, %rs1003, %p26; 2026-02-21T09:15:08.6661444Z cvt.s16.s8 %rs1080, %rs1079; 2026-02-21T09:15:08.6661501Z shr.s16 %rs1081, %rs1080, 4; 2026-02-21T09:15:08.6661606Z selp.b16 %rs1082, %rs1036, %rs1004, %p26; 2026-02-21T09:15:08.6661665Z cvt.s16.s8 %rs1083, %rs1082; 2026-02-21T09:15:08.6661722Z shr.s16 %rs1084, %rs1083, 4; 2026-02-21T09:15:08.6661789Z selp.b16 %rs1085, %rs1037, %rs1005, %p26; 2026-02-21T09:15:08.6661855Z cvt.s16.s8 %rs1086, %rs1085; 2026-02-21T09:15:08.6661914Z shr.s16 %rs1087, %rs1086, 4; 2026-02-21T09:15:08.6661980Z selp.b16 %rs1088, %rs1038, %rs1006, %p26; 2026-02-21T09:15:08.6662048Z cvt.s16.s8 %rs1089, %rs1088; 2026-02-21T09:15:08.6662107Z shr.s16 %rs1090, %rs1089, 4; 2026-02-21T09:15:08.6662174Z selp.b16 %rs1091, %rs1039, %rs1007, %p26; 2026-02-21T09:15:08.6662240Z cvt.s16.s8 %rs1092, %rs1091; 2026-02-21T09:15:08.6662297Z shr.s16 %rs1093, %rs1092, 4; 2026-02-21T09:15:08.6662364Z selp.b16 %rs1094, %rs1040, %rs1008, %p26; 2026-02-21T09:15:08.6662424Z cvt.s16.s8 %rs1095, %rs1094; 2026-02-21T09:15:08.6662490Z shr.s16 %rs1096, %rs1095, 4; 2026-02-21T09:15:08.6662556Z selp.b16 %rs1097, %rs1041, %rs1009, %p26; 2026-02-21T09:15:08.6662614Z cvt.s16.s8 %rs1098, %rs1097; 2026-02-21T09:15:08.6662679Z shr.s16 %rs1099, %rs1098, 4; 2026-02-21T09:15:08.6662745Z selp.b16 %rs1100, %rs1042, %rs1010, %p26; 2026-02-21T09:15:08.6662804Z cvt.s16.s8 %rs1101, %rs1100; 2026-02-21T09:15:08.6662862Z shr.s16 %rs1102, %rs1101, 4; 2026-02-21T09:15:08.6662969Z selp.b16 %rs1103, %rs1043, %rs1011, %p26; 2026-02-21T09:15:08.6663027Z cvt.s16.s8 %rs1104, %rs1103; 2026-02-21T09:15:08.6663085Z shr.s16 %rs1105, %rs1104, 4; 2026-02-21T09:15:08.6663159Z selp.b16 %rs1106, %rs1044, %rs1012, %p26; 2026-02-21T09:15:08.6663217Z cvt.s16.s8 %rs1107, %rs1106; 2026-02-21T09:15:08.6663273Z shr.s16 %rs1108, %rs1107, 4; 2026-02-21T09:15:08.6663340Z selp.b16 %rs1109, %rs1045, %rs1013, %p26; 2026-02-21T09:15:08.6663406Z cvt.s16.s8 %rs1110, %rs1109; 2026-02-21T09:15:08.6663488Z shr.s16 %rs1111, %rs1110, 4; 2026-02-21T09:15:08.6663555Z selp.b16 %rs1112, %rs1046, %rs1014, %p26; 2026-02-21T09:15:08.6663619Z cvt.s16.s8 %rs1113, %rs1112; 2026-02-21T09:15:08.6663675Z shr.s16 %rs1114, %rs1113, 4; 2026-02-21T09:15:08.6663741Z selp.b16 %rs1115, %rs1047, %rs1015, %p26; 2026-02-21T09:15:08.6663806Z cvt.s16.s8 %rs1116, %rs1115; 2026-02-21T09:15:08.6663862Z shr.s16 %rs1117, %rs1116, 4; 2026-02-21T09:15:08.6663929Z selp.b16 %rs1118, %rs1048, %rs1016, %p26; 2026-02-21T09:15:08.6663988Z cvt.s16.s8 %rs1119, %rs1118; 2026-02-21T09:15:08.6664053Z shr.s16 %rs1120, %rs1119, 4; 2026-02-21T09:15:08.6664145Z selp.b16 %rs1121, %rs1049, %rs1017, %p26; 2026-02-21T09:15:08.6664203Z cvt.s16.s8 %rs1122, %rs1121; 2026-02-21T09:15:08.6664267Z shr.s16 %rs1123, %rs1122, 4; 2026-02-21T09:15:08.6664334Z selp.b16 %rs1124, %rs1050, %rs1018, %p26; 2026-02-21T09:15:08.6664392Z cvt.s16.s8 %rs1125, %rs1124; 2026-02-21T09:15:08.6664475Z shr.s16 %rs1126, %rs1125, 4; 2026-02-21T09:15:08.6664550Z selp.b16 %rs1127, %rs1051, %rs1019, %p26; 2026-02-21T09:15:08.6664608Z cvt.s16.s8 %rs1128, %rs1127; 2026-02-21T09:15:08.6664666Z shr.s16 %rs1129, %rs1128, 4; 2026-02-21T09:15:08.6664739Z selp.b16 %rs1130, %rs1052, %rs1020, %p26; 2026-02-21T09:15:08.6664796Z cvt.s16.s8 %rs1131, %rs1130; 2026-02-21T09:15:08.6664854Z shr.s16 %rs1132, %rs1131, 4; 2026-02-21T09:15:08.6664927Z selp.b16 %rs1133, %rs1053, %rs1021, %p26; 2026-02-21T09:15:08.6664985Z cvt.s16.s8 %rs1134, %rs1133; 2026-02-21T09:15:08.6665044Z shr.s16 %rs1135, %rs1134, 4; 2026-02-21T09:15:08.6665109Z selp.b16 %rs1136, %rs1054, %rs1022, %p26; 2026-02-21T09:15:08.6665176Z cvt.s16.s8 %rs1137, %rs1136; 2026-02-21T09:15:08.6665233Z shr.s16 %rs1138, %rs1137, 4; 2026-02-21T09:15:08.6665299Z selp.b16 %rs1139, %rs1055, %rs1023, %p26; 2026-02-21T09:15:08.6665364Z cvt.s16.s8 %rs1140, %rs1139; 2026-02-21T09:15:08.6665421Z shr.s16 %rs1141, %rs1140, 4; 2026-02-21T09:15:08.6665488Z selp.b16 %rs1142, %rs1056, %rs1024, %p26; 2026-02-21T09:15:08.6665548Z cvt.s16.s8 %rs1143, %rs1142; 2026-02-21T09:15:08.6665612Z shr.s16 %rs1144, %rs1143, 4; 2026-02-21T09:15:08.6665677Z selp.b16 %rs1145, %rs1057, %rs1025, %p26; 2026-02-21T09:15:08.6665735Z cvt.s16.s8 %rs1146, %rs1145; 2026-02-21T09:15:08.6665799Z shr.s16 %rs1147, %rs1146, 4; 2026-02-21T09:15:08.6665864Z selp.b16 %rs1148, %rs1058, %rs1026, %p26; 2026-02-21T09:15:08.6665922Z cvt.s16.s8 %rs1149, %rs1148; 2026-02-21T09:15:08.6665980Z shr.s16 %rs1150, %rs1149, 4; 2026-02-21T09:15:08.6666055Z selp.b16 %rs1151, %rs1059, %rs1027, %p26; 2026-02-21T09:15:08.6666112Z cvt.s16.s8 %rs1152, %rs1151; 2026-02-21T09:15:08.6666171Z shr.s16 %rs1153, %rs1152, 4; 2026-02-21T09:15:08.6666246Z selp.b16 %rs1154, %rs1060, %rs1028, %p26; 2026-02-21T09:15:08.6666305Z cvt.s16.s8 %rs1155, %rs1154; 2026-02-21T09:15:08.6666363Z shr.s16 %rs1156, %rs1155, 4; 2026-02-21T09:15:08.6666536Z .loc 1 77 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:77:32 2026-02-21T09:15:08.6666603Z cvt.rn.f32.s16 %r1939, %rs1063; 2026-02-21T09:15:08.6666666Z cvt.rn.f32.s16 %r1940, %rs1066; 2026-02-21T09:15:08.6666727Z cvt.rn.f32.s16 %r1941, %rs1069; 2026-02-21T09:15:08.6666798Z cvt.rn.f32.s16 %r1942, %rs1072; 2026-02-21T09:15:08.6666860Z cvt.rn.f32.s16 %r1943, %rs1075; 2026-02-21T09:15:08.6666920Z cvt.rn.f32.s16 %r1944, %rs1078; 2026-02-21T09:15:08.6666986Z cvt.rn.f32.s16 %r1945, %rs1081; 2026-02-21T09:15:08.6667045Z cvt.rn.f32.s16 %r1946, %rs1084; 2026-02-21T09:15:08.6667128Z cvt.rn.f32.s16 %r1947, %rs1087; 2026-02-21T09:15:08.6667187Z cvt.rn.f32.s16 %r1948, %rs1090; 2026-02-21T09:15:08.6667255Z cvt.rn.f32.s16 %r1949, %rs1093; 2026-02-21T09:15:08.6667313Z cvt.rn.f32.s16 %r1950, %rs1096; 2026-02-21T09:15:08.6667371Z cvt.rn.f32.s16 %r1951, %rs1099; 2026-02-21T09:15:08.6667437Z cvt.rn.f32.s16 %r1952, %rs1102; 2026-02-21T09:15:08.6667494Z cvt.rn.f32.s16 %r1953, %rs1105; 2026-02-21T09:15:08.6667553Z cvt.rn.f32.s16 %r1954, %rs1108; 2026-02-21T09:15:08.6667640Z cvt.rn.f32.s16 %r1955, %rs1111; 2026-02-21T09:15:08.6667698Z cvt.rn.f32.s16 %r1956, %rs1114; 2026-02-21T09:15:08.6667757Z cvt.rn.f32.s16 %r1957, %rs1117; 2026-02-21T09:15:08.6667814Z cvt.rn.f32.s16 %r1958, %rs1120; 2026-02-21T09:15:08.6667881Z cvt.rn.f32.s16 %r1959, %rs1123; 2026-02-21T09:15:08.6667940Z cvt.rn.f32.s16 %r1960, %rs1126; 2026-02-21T09:15:08.6667998Z cvt.rn.f32.s16 %r1961, %rs1129; 2026-02-21T09:15:08.6668064Z cvt.rn.f32.s16 %r1962, %rs1132; 2026-02-21T09:15:08.6668124Z cvt.rn.f32.s16 %r1963, %rs1135; 2026-02-21T09:15:08.6668182Z cvt.rn.f32.s16 %r1964, %rs1138; 2026-02-21T09:15:08.6668260Z cvt.rn.f32.s16 %r1965, %rs1141; 2026-02-21T09:15:08.6668328Z cvt.rn.f32.s16 %r1966, %rs1144; 2026-02-21T09:15:08.6668387Z cvt.rn.f32.s16 %r1967, %rs1147; 2026-02-21T09:15:08.6668445Z cvt.rn.f32.s16 %r1968, %rs1150; 2026-02-21T09:15:08.6668510Z cvt.rn.f32.s16 %r1969, %rs1153; 2026-02-21T09:15:08.6668600Z cvt.rn.f32.s16 %r1970, %rs1156; 2026-02-21T09:15:08.6668659Z bar.sync 2, 256; 2026-02-21T09:15:08.6668730Z st.shared.b32 [%r104], %r1939; 2026-02-21T09:15:08.6668796Z st.shared.b32 [%r104+4096], %r1943; 2026-02-21T09:15:08.6668860Z st.shared.b32 [%r104+8192], %r1947; 2026-02-21T09:15:08.6668925Z st.shared.b32 [%r104+12288], %r1951; 2026-02-21T09:15:08.6668996Z st.shared.b32 [%r104+16384], %r1955; 2026-02-21T09:15:08.6669058Z st.shared.b32 [%r104+20480], %r1959; 2026-02-21T09:15:08.6669119Z st.shared.b32 [%r104+24576], %r1963; 2026-02-21T09:15:08.6669189Z st.shared.b32 [%r104+28672], %r1967; 2026-02-21T09:15:08.6669250Z st.shared.b32 [%r105], %r1940; 2026-02-21T09:15:08.6669314Z st.shared.b32 [%r105+4096], %r1944; 2026-02-21T09:15:08.6669375Z st.shared.b32 [%r105+8192], %r1948; 2026-02-21T09:15:08.6669443Z st.shared.b32 [%r105+12288], %r1952; 2026-02-21T09:15:08.6669504Z st.shared.b32 [%r105+16384], %r1956; 2026-02-21T09:15:08.6669565Z st.shared.b32 [%r105+20480], %r1960; 2026-02-21T09:15:08.6669633Z st.shared.b32 [%r105+24576], %r1964; 2026-02-21T09:15:08.6669695Z st.shared.b32 [%r105+28672], %r1968; 2026-02-21T09:15:08.6669755Z st.shared.b32 [%r106], %r1941; 2026-02-21T09:15:08.6669822Z st.shared.b32 [%r106+4096], %r1945; 2026-02-21T09:15:08.6669882Z st.shared.b32 [%r106+8192], %r1949; 2026-02-21T09:15:08.6669943Z st.shared.b32 [%r106+12288], %r1953; 2026-02-21T09:15:08.6670003Z st.shared.b32 [%r106+16384], %r1957; 2026-02-21T09:15:08.6670070Z st.shared.b32 [%r106+20480], %r1961; 2026-02-21T09:15:08.6670130Z st.shared.b32 [%r106+24576], %r1965; 2026-02-21T09:15:08.6670191Z st.shared.b32 [%r106+28672], %r1969; 2026-02-21T09:15:08.6670258Z st.shared.b32 [%r107], %r1942; 2026-02-21T09:15:08.6670318Z st.shared.b32 [%r107+4096], %r1946; 2026-02-21T09:15:08.6670377Z st.shared.b32 [%r107+8192], %r1950; 2026-02-21T09:15:08.6670438Z st.shared.b32 [%r107+12288], %r1954; 2026-02-21T09:15:08.6670504Z st.shared.b32 [%r107+16384], %r1958; 2026-02-21T09:15:08.6670565Z st.shared.b32 [%r107+20480], %r1962; 2026-02-21T09:15:08.6670626Z st.shared.b32 [%r107+24576], %r1966; 2026-02-21T09:15:08.6670694Z st.shared.b32 [%r107+28672], %r1970; 2026-02-21T09:15:08.6670869Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6670928Z add.s32 %r1744, %r973, 368784; 2026-02-21T09:15:08.6670988Z $L__tmp26: 2026-02-21T09:15:08.6671211Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6671291Z // begin inline asm 2026-02-21T09:15:08.6671341Z 2026-02-21T09:15:08.6671398Z { 2026-02-21T09:15:08.6671462Z .reg .pred complete; 2026-02-21T09:15:08.6671517Z waitLoop: 2026-02-21T09:15:08.6671677Z mbarrier.try_wait.parity.shared.b64 complete, [%r1744], %r2370; 2026-02-21T09:15:08.6671744Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6671796Z } 2026-02-21T09:15:08.6671799Z 2026-02-21T09:15:08.6671856Z // end inline asm 2026-02-21T09:15:08.6671921Z // begin inline asm 2026-02-21T09:15:08.6673616Z @%p165 tcgen05.st.sync.aligned.32x32b.x128.b32 [%r1746 + 0], {%r1747, %r1748, %r1749, %r1750, %r1751, %r1752, %r1753, %r1754, %r1755, %r1756, %r1757, %r1758, %r1759, %r1760, %r1761, %r1762, %r1763, %r1764, %r1765, %r1766, %r1767, %r1768, %r1769, %r1770, %r1771, %r1772, %r1773, %r1774, %r1775, %r1776, %r1777, %r1778, %r1779, %r1780, %r1781, %r1782, %r1783, %r1784, %r1785, %r1786, %r1787, %r1788, %r1789, %r1790, %r1791, %r1792, %r1793, %r1794, %r1795, %r1796, %r1797, %r1798, %r1799, %r1800, %r1801, %r1802, %r1803, %r1804, %r1805, %r1806, %r1807, %r1808, %r1809, %r1810, %r1811, %r1812, %r1813, %r1814, %r1815, %r1816, %r1817, %r1818, %r1819, %r1820, %r1821, %r1822, %r1823, %r1824, %r1825, %r1826, %r1827, %r1828, %r1829, %r1830, %r1831, %r1832, %r1833, %r1834, %r1835, %r1836, %r1837, %r1838, %r1839, %r1840, %r1841, %r1842, %r1843, %r1844, %r1845, %r1846, %r1847, %r1848, %r1849, %r1850, %r1851, %r1852, %r1853, %r1854, %r1855, %r1856, %r1857, %r1858, %r1859, %r1860, %r1861, %r1862, %r1863, %r1864, %r1865, %r1866, %r1867, %r1868, %r1869, %r1870, %r1871, %r1872, %r1873, %r1874}; 2026-02-21T09:15:08.6673685Z // end inline asm 2026-02-21T09:15:08.6673742Z // begin inline asm 2026-02-21T09:15:08.6673816Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6673882Z // end inline asm 2026-02-21T09:15:08.6673938Z bar.sync 2, 256; 2026-02-21T09:15:08.6673995Z // begin inline asm 2026-02-21T09:15:08.6674076Z fence.proxy.async.shared::cta; 2026-02-21T09:15:08.6674131Z // end inline asm 2026-02-21T09:15:08.6674187Z bar.sync 2, 256; 2026-02-21T09:15:08.6674245Z @%p27 bra $L__BB0_15; 2026-02-21T09:15:08.6674308Z $L__tmp27: 2026-02-21T09:15:08.6674408Z // %bb.14: // in Loop: Header=BB0_7 Depth=2 2026-02-21T09:15:08.6674578Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6674646Z add.s32 %r2073, %r973, 368800; 2026-02-21T09:15:08.6674701Z $L__tmp28: 2026-02-21T09:15:08.6674918Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6674988Z add.s32 %r1974, %r120, %r7; 2026-02-21T09:15:08.6675053Z elect.sync %r2074|%p234, -1; 2026-02-21T09:15:08.6675114Z mov.b32 %r1976, 134744336; 2026-02-21T09:15:08.6675177Z mov.pred %p233, -1; 2026-02-21T09:15:08.6675244Z // begin inline asm 2026-02-21T09:15:08.6675407Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 0 ], %rd340, %r1976, %p233; 2026-02-21T09:15:08.6675471Z // end inline asm 2026-02-21T09:15:08.6675536Z // begin inline asm 2026-02-21T09:15:08.6675693Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 8 ], %rd341, %r1976, %p233; 2026-02-21T09:15:08.6675748Z // end inline asm 2026-02-21T09:15:08.6675812Z // begin inline asm 2026-02-21T09:15:08.6675963Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 16 ], %rd342, %r1976, %p233; 2026-02-21T09:15:08.6676019Z // end inline asm 2026-02-21T09:15:08.6676076Z // begin inline asm 2026-02-21T09:15:08.6676236Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 24 ], %rd343, %r1976, %p233; 2026-02-21T09:15:08.6676290Z // end inline asm 2026-02-21T09:15:08.6676346Z // begin inline asm 2026-02-21T09:15:08.6676501Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 32 ], %rd344, %r1976, %p233; 2026-02-21T09:15:08.6676555Z // end inline asm 2026-02-21T09:15:08.6676612Z // begin inline asm 2026-02-21T09:15:08.6676790Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 40 ], %rd345, %r1976, %p233; 2026-02-21T09:15:08.6676848Z // end inline asm 2026-02-21T09:15:08.6676904Z // begin inline asm 2026-02-21T09:15:08.6677056Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 48 ], %rd346, %r1976, %p233; 2026-02-21T09:15:08.6677110Z // end inline asm 2026-02-21T09:15:08.6677166Z // begin inline asm 2026-02-21T09:15:08.6677317Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 56 ], %rd347, %r1976, %p233; 2026-02-21T09:15:08.6677398Z // end inline asm 2026-02-21T09:15:08.6677455Z // begin inline asm 2026-02-21T09:15:08.6677599Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 64 ], %rd348, %r1976, %p233; 2026-02-21T09:15:08.6677661Z // end inline asm 2026-02-21T09:15:08.6677718Z // begin inline asm 2026-02-21T09:15:08.6677868Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 72 ], %rd349, %r1976, %p233; 2026-02-21T09:15:08.6677932Z // end inline asm 2026-02-21T09:15:08.6677988Z // begin inline asm 2026-02-21T09:15:08.6678153Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 80 ], %rd350, %r1976, %p233; 2026-02-21T09:15:08.6678217Z // end inline asm 2026-02-21T09:15:08.6678275Z // begin inline asm 2026-02-21T09:15:08.6678419Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 88 ], %rd351, %r1976, %p233; 2026-02-21T09:15:08.6678496Z // end inline asm 2026-02-21T09:15:08.6678563Z // begin inline asm 2026-02-21T09:15:08.6678707Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 96 ], %rd352, %r1976, %p233; 2026-02-21T09:15:08.6678763Z // end inline asm 2026-02-21T09:15:08.6678825Z // begin inline asm 2026-02-21T09:15:08.6678976Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 104 ], %rd353, %r1976, %p233; 2026-02-21T09:15:08.6679032Z // end inline asm 2026-02-21T09:15:08.6679095Z // begin inline asm 2026-02-21T09:15:08.6679245Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 112 ], %rd354, %r1976, %p233; 2026-02-21T09:15:08.6679301Z // end inline asm 2026-02-21T09:15:08.6679364Z // begin inline asm 2026-02-21T09:15:08.6679512Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 120 ], %rd355, %r1976, %p233; 2026-02-21T09:15:08.6679567Z // end inline asm 2026-02-21T09:15:08.6679623Z // begin inline asm 2026-02-21T09:15:08.6679777Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 128 ], %rd356, %r1976, %p233; 2026-02-21T09:15:08.6679833Z // end inline asm 2026-02-21T09:15:08.6679888Z // begin inline asm 2026-02-21T09:15:08.6680047Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 136 ], %rd357, %r1976, %p233; 2026-02-21T09:15:08.6680102Z // end inline asm 2026-02-21T09:15:08.6680156Z // begin inline asm 2026-02-21T09:15:08.6680309Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 144 ], %rd358, %r1976, %p233; 2026-02-21T09:15:08.6680365Z // end inline asm 2026-02-21T09:15:08.6680420Z // begin inline asm 2026-02-21T09:15:08.6680576Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 152 ], %rd359, %r1976, %p233; 2026-02-21T09:15:08.6680630Z // end inline asm 2026-02-21T09:15:08.6680685Z // begin inline asm 2026-02-21T09:15:08.6680833Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 160 ], %rd360, %r1976, %p233; 2026-02-21T09:15:08.6680896Z // end inline asm 2026-02-21T09:15:08.6680952Z // begin inline asm 2026-02-21T09:15:08.6681098Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 168 ], %rd361, %r1976, %p233; 2026-02-21T09:15:08.6681158Z // end inline asm 2026-02-21T09:15:08.6681213Z // begin inline asm 2026-02-21T09:15:08.6681361Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 176 ], %rd362, %r1976, %p233; 2026-02-21T09:15:08.6681422Z // end inline asm 2026-02-21T09:15:08.6681477Z // begin inline asm 2026-02-21T09:15:08.6681677Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 184 ], %rd363, %r1976, %p233; 2026-02-21T09:15:08.6681740Z // end inline asm 2026-02-21T09:15:08.6681795Z // begin inline asm 2026-02-21T09:15:08.6681946Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 192 ], %rd364, %r1976, %p233; 2026-02-21T09:15:08.6682001Z // end inline asm 2026-02-21T09:15:08.6682064Z // begin inline asm 2026-02-21T09:15:08.6682215Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 200 ], %rd365, %r1976, %p233; 2026-02-21T09:15:08.6682296Z // end inline asm 2026-02-21T09:15:08.6682360Z // begin inline asm 2026-02-21T09:15:08.6682507Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 208 ], %rd366, %r1976, %p233; 2026-02-21T09:15:08.6682562Z // end inline asm 2026-02-21T09:15:08.6682623Z // begin inline asm 2026-02-21T09:15:08.6682771Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 216 ], %rd367, %r1976, %p233; 2026-02-21T09:15:08.6682827Z // end inline asm 2026-02-21T09:15:08.6682890Z // begin inline asm 2026-02-21T09:15:08.6683063Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 224 ], %rd368, %r1976, %p233; 2026-02-21T09:15:08.6683120Z // end inline asm 2026-02-21T09:15:08.6683177Z // begin inline asm 2026-02-21T09:15:08.6683333Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 232 ], %rd369, %r1976, %p233; 2026-02-21T09:15:08.6683414Z // end inline asm 2026-02-21T09:15:08.6683473Z // begin inline asm 2026-02-21T09:15:08.6683631Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 240 ], %rd370, %r1976, %p233; 2026-02-21T09:15:08.6683687Z // end inline asm 2026-02-21T09:15:08.6683743Z // begin inline asm 2026-02-21T09:15:08.6683899Z @%p234 tcgen05.mma.cta_group::1.kind::tf32 [ %r1974 + 0 ], [ %r1975 + 248 ], %rd371, %r1976, %p233; 2026-02-21T09:15:08.6683954Z // end inline asm 2026-02-21T09:15:08.6684020Z cvt.u64.u32 %rd372, %r2073; 2026-02-21T09:15:08.6684078Z // begin inline asm 2026-02-21T09:15:08.6684217Z @%p234 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd372]; 2026-02-21T09:15:08.6684274Z // end inline asm 2026-02-21T09:15:08.6684334Z bra.uni $L__BB0_15; 2026-02-21T09:15:08.6684394Z $L__tmp29: 2026-02-21T09:15:08.6684496Z $L__BB0_26: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6684672Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6684753Z ld.shared.b32 %r160, [global_smem+24]; 2026-02-21T09:15:08.6684809Z barrier.sync 1; 2026-02-21T09:15:08.6684977Z .loc 1 19 46 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:46 2026-02-21T09:15:08.6685045Z mov.u32 %r2387, %ctaid.x; 2026-02-21T09:15:08.6685215Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6685280Z setp.gt.u32 %p3, %r2387, 8191; 2026-02-21T09:15:08.6685340Z @%p3 bra $L__BB0_29; 2026-02-21T09:15:08.6685425Z // %bb.27: // %.lr.ph 2026-02-21T09:15:08.6685516Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6685688Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6685755Z and.b32 %r202, %r1, 31; 2026-02-21T09:15:08.6685816Z or.b32 %r162, %r2, 134217696; 2026-02-21T09:15:08.6685876Z shl.b32 %r203, %r162, 5; 2026-02-21T09:15:08.6685943Z or.b32 %r163, %r203, %r202; 2026-02-21T09:15:08.6686001Z and.b32 %r164, %r163, 252; 2026-02-21T09:15:08.6686062Z bfe.u32 %r165, %r163, 2, 6; 2026-02-21T09:15:08.6686120Z or.b32 %r166, %r165, 64; 2026-02-21T09:15:08.6686186Z and.b32 %r167, %r1, 3; 2026-02-21T09:15:08.6686243Z shl.b32 %r168, %r167, 3; 2026-02-21T09:15:08.6686301Z and.b32 %r206, %r163, 127; 2026-02-21T09:15:08.6686364Z shl.b32 %r207, %r206, 7; 2026-02-21T09:15:08.6686446Z shr.u32 %r208, %r1, 1; 2026-02-21T09:15:08.6686505Z and.b32 %r209, %r208, 64; 2026-02-21T09:15:08.6686568Z add.s32 %r211, %r200, %r207; 2026-02-21T09:15:08.6686637Z add.s32 %r212, %r211, %r209; 2026-02-21T09:15:08.6686697Z add.s32 %r169, %r212, 327680; 2026-02-21T09:15:08.6686755Z shl.b32 %r213, %r1, 10; 2026-02-21T09:15:08.6686820Z and.b32 %r214, %r213, 6144; 2026-02-21T09:15:08.6686877Z shl.b32 %r215, %r206, 4; 2026-02-21T09:15:08.6686936Z xor.b32 %r216, %r215, %r209; 2026-02-21T09:15:08.6687015Z or.b32 %r217, %r216, %r214; 2026-02-21T09:15:08.6687081Z add.s32 %r218, %r200, 344064; 2026-02-21T09:15:08.6687140Z add.s32 %r170, %r218, %r217; 2026-02-21T09:15:08.6687198Z xor.b32 %r219, %r217, 32; 2026-02-21T09:15:08.6687262Z add.s32 %r171, %r218, %r219; 2026-02-21T09:15:08.6687322Z shl.b32 %r220, %r1, 8; 2026-02-21T09:15:08.6687380Z and.b32 %r221, %r220, 6144; 2026-02-21T09:15:08.6687439Z shl.b32 %r222, %r167, 5; 2026-02-21T09:15:08.6687506Z shl.b32 %r223, %r164, 2; 2026-02-21T09:15:08.6687567Z or.b32 %r224, %r221, %r222; 2026-02-21T09:15:08.6687626Z xor.b32 %r225, %r223, %r224; 2026-02-21T09:15:08.6687730Z add.s32 %r172, %r218, %r225; 2026-02-21T09:15:08.6687913Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6687973Z and.b32 %r173, %r2387, 15; 2026-02-21T09:15:08.6688040Z shl.b32 %r2383, %r2387, 1; 2026-02-21T09:15:08.6688118Z mov.b32 %r2385, 0; 2026-02-21T09:15:08.6688178Z mov.b32 %r2384, 1; 2026-02-21T09:15:08.6688239Z mov.b32 %r2386, %r2385; 2026-02-21T09:15:08.6688350Z $L__BB0_28: // Parent Loop BB0_2 Depth=1 2026-02-21T09:15:08.6688447Z // => This Inner Loop Header: Depth=2 2026-02-21T09:15:08.6688622Z .loc 1 26 33 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:26:33 2026-02-21T09:15:08.6688690Z shr.u32 %r274, %r2387, 8; 2026-02-21T09:15:08.6688749Z and.b32 %r275, %r274, 16; 2026-02-21T09:15:08.6688921Z .loc 1 28 30 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:28:30 2026-02-21T09:15:08.6688988Z or.b32 %r276, %r275, %r173; 2026-02-21T09:15:08.6689158Z .loc 1 30 27 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:30:27 2026-02-21T09:15:08.6689218Z shl.b32 %r277, %r276, 7; 2026-02-21T09:15:08.6689386Z .loc 1 31 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:31:32 2026-02-21T09:15:08.6689454Z or.b32 %r278, %r277, %r165; 2026-02-21T09:15:08.6689514Z or.b32 %r279, %r277, %r166; 2026-02-21T09:15:08.6689683Z .loc 1 32 27 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:32:27 2026-02-21T09:15:08.6689752Z and.b32 %r280, %r2383, 8160; 2026-02-21T09:15:08.6689919Z .loc 1 33 32 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:33:32 2026-02-21T09:15:08.6689980Z or.b32 %r281, %r280, %r168; 2026-02-21T09:15:08.6690044Z $L__tmp30: 2026-02-21T09:15:08.6690273Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6690333Z shl.b32 %r282, %r2386, 5; 2026-02-21T09:15:08.6690393Z add.s32 %r283, %r282, %r160; 2026-02-21T09:15:08.6690461Z xor.b32 %r2384, %r2384, 1; 2026-02-21T09:15:08.6690521Z add.s32 %r226, %r200, 368896; 2026-02-21T09:15:08.6690581Z // begin inline asm 2026-02-21T09:15:08.6690644Z 2026-02-21T09:15:08.6690697Z { 2026-02-21T09:15:08.6690762Z .reg .pred complete; 2026-02-21T09:15:08.6690818Z waitLoop: 2026-02-21T09:15:08.6690955Z mbarrier.try_wait.parity.shared.b64 complete, [%r226], %r2384; 2026-02-21T09:15:08.6691024Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6691077Z } 2026-02-21T09:15:08.6691081Z 2026-02-21T09:15:08.6691147Z // end inline asm 2026-02-21T09:15:08.6691246Z ld.shared.v4.b32 {%r230, %r231, %r232, %r233}, [%r169]; 2026-02-21T09:15:08.6691351Z ld.shared.v4.b32 {%r234, %r235, %r236, %r237}, [%r169+16]; 2026-02-21T09:15:08.6691480Z ld.shared.v4.b32 {%r238, %r239, %r240, %r241}, [%r169+32]; 2026-02-21T09:15:08.6691605Z ld.shared.v4.b32 {%r242, %r243, %r244, %r245}, [%r169+48]; 2026-02-21T09:15:08.6691667Z bar.sync 6, 256; 2026-02-21T09:15:08.6691731Z add.s32 %r228, %r200, 368880; 2026-02-21T09:15:08.6691802Z mov.pred %p4, 0; 2026-02-21T09:15:08.6691863Z // begin inline asm 2026-02-21T09:15:08.6691959Z @%p4 mbarrier.arrive.shared::cta.b64 _, [%r228]; 2026-02-21T09:15:08.6692053Z // end inline asm 2026-02-21T09:15:08.6692135Z shfl.sync.idx.b32 %r285, %r162, 0, 31, -1; 2026-02-21T09:15:08.6692198Z shl.b32 %r286, %r285, 21; 2026-02-21T09:15:08.6692272Z and.b32 %r287, %r286, 6291456; 2026-02-21T09:15:08.6692335Z add.s32 %r288, %r283, %r287; 2026-02-21T09:15:08.6692399Z shl.b32 %r289, %r285, 2; 2026-02-21T09:15:08.6692463Z and.b32 %r290, %r289, 16; 2026-02-21T09:15:08.6692534Z add.s32 %r229, %r288, %r290; 2026-02-21T09:15:08.6692599Z mov.pred %p5, -1; 2026-02-21T09:15:08.6692661Z // begin inline asm 2026-02-21T09:15:08.6692991Z @%p5 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r229 + 0], {%r230, %r231, %r232, %r233, %r234, %r235, %r236, %r237, %r238, %r239, %r240, %r241, %r242, %r243, %r244, %r245}; 2026-02-21T09:15:08.6693053Z // end inline asm 2026-02-21T09:15:08.6693115Z // begin inline asm 2026-02-21T09:15:08.6693190Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6693282Z // end inline asm 2026-02-21T09:15:08.6693342Z bar.sync 6, 256; 2026-02-21T09:15:08.6693398Z $L__tmp31: 2026-02-21T09:15:08.6693588Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6693649Z shl.b32 %r291, %r2386, 3; 2026-02-21T09:15:08.6693710Z add.s32 %r292, %r200, %r291; 2026-02-21T09:15:08.6693777Z add.s32 %r246, %r292, 368784; 2026-02-21T09:15:08.6693832Z $L__tmp32: 2026-02-21T09:15:08.6694056Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6694117Z bar.sync 6, 256; 2026-02-21T09:15:08.6694186Z // begin inline asm 2026-02-21T09:15:08.6694281Z @%p4 mbarrier.arrive.shared::cta.b64 _, [%r246]; 2026-02-21T09:15:08.6694339Z // end inline asm 2026-02-21T09:15:08.6694400Z $L__tmp33: 2026-02-21T09:15:08.6694576Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6694638Z add.s32 %r247, %r292, 368800; 2026-02-21T09:15:08.6694693Z $L__tmp34: 2026-02-21T09:15:08.6694919Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6694977Z bar.sync 6, 256; 2026-02-21T09:15:08.6695035Z // begin inline asm 2026-02-21T09:15:08.6695097Z 2026-02-21T09:15:08.6695150Z { 2026-02-21T09:15:08.6695214Z .reg .pred complete; 2026-02-21T09:15:08.6695278Z waitLoop: 2026-02-21T09:15:08.6695403Z mbarrier.try_wait.parity.shared.b64 complete, [%r247], %r2385; 2026-02-21T09:15:08.6695474Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6695537Z } 2026-02-21T09:15:08.6695542Z 2026-02-21T09:15:08.6695606Z // end inline asm 2026-02-21T09:15:08.6695665Z add.s32 %r293, %r2386, 1; 2026-02-21T09:15:08.6695728Z setp.eq.b32 %p7, %r293, 2; 2026-02-21T09:15:08.6695799Z selp.b32 %r2386, 0, %r293, %p7; 2026-02-21T09:15:08.6695860Z selp.b32 %r294, 1, 0, %p7; 2026-02-21T09:15:08.6695922Z xor.b32 %r2385, %r2385, %r294; 2026-02-21T09:15:08.6695976Z $L__tmp35: 2026-02-21T09:15:08.6696154Z .loc 1 88 43 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:88:43 2026-02-21T09:15:08.6696212Z shl.b32 %r295, %r278, 13; 2026-02-21T09:15:08.6696270Z shl.b32 %r296, %r279, 13; 2026-02-21T09:15:08.6696439Z .loc 1 88 50 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:88:50 2026-02-21T09:15:08.6696497Z or.b32 %r297, %r295, %r281; 2026-02-21T09:15:08.6696554Z or.b32 %r298, %r296, %r281; 2026-02-21T09:15:08.6696823Z .loc 1 88 22 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:88:22 2026-02-21T09:15:08.6696893Z mad.wide.u32 %rd36, %r297, 2, %rd35; 2026-02-21T09:15:08.6696960Z mad.wide.u32 %rd37, %r298, 2, %rd35; 2026-02-21T09:15:08.6697011Z $L__tmp36: 2026-02-21T09:15:08.6697237Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6697297Z // begin inline asm 2026-02-21T09:15:08.6697589Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r249, %r250, %r251, %r252, %r253, %r254, %r255, %r256, %r257, %r258, %r259, %r260, %r261, %r262, %r263, %r264}, [%r229 + 0]; 2026-02-21T09:15:08.6697654Z // end inline asm 2026-02-21T09:15:08.6697711Z // begin inline asm 2026-02-21T09:15:08.6697780Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:15:08.6697842Z // end inline asm 2026-02-21T09:15:08.6697903Z cvt.u64.u32 %rd38, %r249; 2026-02-21T09:15:08.6697960Z cvt.u64.u32 %rd39, %r250; 2026-02-21T09:15:08.6698021Z shl.b64 %rd40, %rd39, 32; 2026-02-21T09:15:08.6698088Z or.b64 %rd41, %rd38, %rd40; 2026-02-21T09:15:08.6698165Z $L__tmp37: 2026-02-21T09:15:08.6698338Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6698408Z mov.b64 {%r299, %r300}, %rd41; 2026-02-21T09:15:08.6698479Z cvt.rn.bf16x2.f32 %r301, %r300, %r299; 2026-02-21T09:15:08.6698532Z $L__tmp38: 2026-02-21T09:15:08.6698772Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6698832Z cvt.u64.u32 %rd42, %r251; 2026-02-21T09:15:08.6698889Z cvt.u64.u32 %rd43, %r252; 2026-02-21T09:15:08.6698947Z shl.b64 %rd44, %rd43, 32; 2026-02-21T09:15:08.6699011Z or.b64 %rd45, %rd42, %rd44; 2026-02-21T09:15:08.6699062Z $L__tmp39: 2026-02-21T09:15:08.6699229Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6699298Z mov.b64 {%r302, %r303}, %rd45; 2026-02-21T09:15:08.6699367Z cvt.rn.bf16x2.f32 %r304, %r303, %r302; 2026-02-21T09:15:08.6699421Z $L__tmp40: 2026-02-21T09:15:08.6699634Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6699698Z cvt.u64.u32 %rd46, %r253; 2026-02-21T09:15:08.6699755Z cvt.u64.u32 %rd47, %r254; 2026-02-21T09:15:08.6699814Z shl.b64 %rd48, %rd47, 32; 2026-02-21T09:15:08.6699880Z or.b64 %rd49, %rd46, %rd48; 2026-02-21T09:15:08.6699933Z $L__tmp41: 2026-02-21T09:15:08.6700099Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6700165Z mov.b64 {%r305, %r306}, %rd49; 2026-02-21T09:15:08.6700231Z cvt.rn.bf16x2.f32 %r307, %r306, %r305; 2026-02-21T09:15:08.6700283Z $L__tmp42: 2026-02-21T09:15:08.6700490Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6700557Z cvt.u64.u32 %rd50, %r255; 2026-02-21T09:15:08.6700614Z cvt.u64.u32 %rd51, %r256; 2026-02-21T09:15:08.6700672Z shl.b64 %rd52, %rd51, 32; 2026-02-21T09:15:08.6700738Z or.b64 %rd53, %rd50, %rd52; 2026-02-21T09:15:08.6700791Z $L__tmp43: 2026-02-21T09:15:08.6700959Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6701029Z mov.b64 {%r308, %r309}, %rd53; 2026-02-21T09:15:08.6701097Z cvt.rn.bf16x2.f32 %r310, %r309, %r308; 2026-02-21T09:15:08.6701151Z $L__tmp44: 2026-02-21T09:15:08.6701365Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6701436Z cvt.u64.u32 %rd54, %r257; 2026-02-21T09:15:08.6701494Z cvt.u64.u32 %rd55, %r258; 2026-02-21T09:15:08.6701581Z shl.b64 %rd56, %rd55, 32; 2026-02-21T09:15:08.6701648Z or.b64 %rd57, %rd54, %rd56; 2026-02-21T09:15:08.6701734Z $L__tmp45: 2026-02-21T09:15:08.6701903Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6701969Z mov.b64 {%r311, %r312}, %rd57; 2026-02-21T09:15:08.6702034Z cvt.rn.bf16x2.f32 %r313, %r312, %r311; 2026-02-21T09:15:08.6702087Z $L__tmp46: 2026-02-21T09:15:08.6702296Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6702386Z cvt.u64.u32 %rd58, %r259; 2026-02-21T09:15:08.6702443Z cvt.u64.u32 %rd59, %r260; 2026-02-21T09:15:08.6702499Z shl.b64 %rd60, %rd59, 32; 2026-02-21T09:15:08.6702567Z or.b64 %rd61, %rd58, %rd60; 2026-02-21T09:15:08.6702620Z $L__tmp47: 2026-02-21T09:15:08.6702785Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6702845Z mov.b64 {%r314, %r315}, %rd61; 2026-02-21T09:15:08.6702918Z cvt.rn.bf16x2.f32 %r316, %r315, %r314; 2026-02-21T09:15:08.6702970Z $L__tmp48: 2026-02-21T09:15:08.6703206Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6703275Z cvt.u64.u32 %rd62, %r261; 2026-02-21T09:15:08.6703333Z cvt.u64.u32 %rd63, %r262; 2026-02-21T09:15:08.6703389Z shl.b64 %rd64, %rd63, 32; 2026-02-21T09:15:08.6703456Z or.b64 %rd65, %rd62, %rd64; 2026-02-21T09:15:08.6703508Z $L__tmp49: 2026-02-21T09:15:08.6703709Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6703771Z mov.b64 {%r317, %r318}, %rd65; 2026-02-21T09:15:08.6703845Z cvt.rn.bf16x2.f32 %r319, %r318, %r317; 2026-02-21T09:15:08.6703897Z $L__tmp50: 2026-02-21T09:15:08.6704105Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6704173Z cvt.u64.u32 %rd66, %r263; 2026-02-21T09:15:08.6704231Z cvt.u64.u32 %rd67, %r264; 2026-02-21T09:15:08.6704290Z shl.b64 %rd68, %rd67, 32; 2026-02-21T09:15:08.6704355Z or.b64 %rd69, %rd66, %rd68; 2026-02-21T09:15:08.6704410Z $L__tmp51: 2026-02-21T09:15:08.6704582Z .loc 1 87 28 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:87:28 2026-02-21T09:15:08.6704641Z mov.b64 {%r320, %r321}, %rd69; 2026-02-21T09:15:08.6704714Z cvt.rn.bf16x2.f32 %r322, %r321, %r320; 2026-02-21T09:15:08.6704878Z .loc 1 88 81 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:88:81 2026-02-21T09:15:08.6704975Z st.shared.v4.b32 [%r170], {%r301, %r304, %r307, %r310}; 2026-02-21T09:15:08.6705075Z st.shared.v4.b32 [%r171], {%r313, %r316, %r319, %r322}; 2026-02-21T09:15:08.6705132Z bar.sync 6, 256; 2026-02-21T09:15:08.6705232Z ld.shared.v4.b32 {%r270, %r271, %r272, %r273}, [%r172+1024]; 2026-02-21T09:15:08.6705327Z ld.shared.v4.b32 {%r266, %r267, %r268, %r269}, [%r172]; 2026-02-21T09:15:08.6705385Z // begin inline asm 2026-02-21T09:15:08.6705486Z st.global.v4.b32 [ %rd36 + 0 ], { %r266, %r267, %r268, %r269 }; 2026-02-21T09:15:08.6705542Z // end inline asm 2026-02-21T09:15:08.6705608Z // begin inline asm 2026-02-21T09:15:08.6705706Z st.global.v4.b32 [ %rd37 + 0 ], { %r270, %r271, %r272, %r273 }; 2026-02-21T09:15:08.6705761Z // end inline asm 2026-02-21T09:15:08.6705941Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6706004Z add.s32 %r183, %r2387, 2368; 2026-02-21T09:15:08.6706065Z add.s32 %r2383, %r2383, 4736; 2026-02-21T09:15:08.6706127Z setp.lt.u32 %p8, %r2387, 5824; 2026-02-21T09:15:08.6706192Z mov.b32 %r2387, %r183; 2026-02-21T09:15:08.6706250Z @%p8 bra $L__BB0_28; 2026-02-21T09:15:08.6706308Z bra.uni $L__BB0_29; 2026-02-21T09:15:08.6706417Z $L__BB0_18: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6706587Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6706679Z ld.shared.b32 %r126, [global_smem+8]; 2026-02-21T09:15:08.6706744Z barrier.sync 1; 2026-02-21T09:15:08.6706909Z .loc 1 19 46 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:46 2026-02-21T09:15:08.6706969Z mov.u32 %r127, %ctaid.x; 2026-02-21T09:15:08.6707138Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6707209Z setp.gt.u32 %p16, %r127, 8191; 2026-02-21T09:15:08.6707299Z @%p16 bra $L__BB0_21; 2026-02-21T09:15:08.6707382Z // %bb.19: // %.lr.ph994 2026-02-21T09:15:08.6707481Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6707650Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6707711Z add.s32 %r128, %r1, -512; 2026-02-21T09:15:08.6707778Z shr.u32 %r129, %r128, 5; 2026-02-21T09:15:08.6707835Z shl.b32 %r390, %r1, 7; 2026-02-21T09:15:08.6707894Z and.b32 %r391, %r390, 16256; 2026-02-21T09:15:08.6707953Z shr.u32 %r392, %r1, 1; 2026-02-21T09:15:08.6708037Z and.b32 %r393, %r392, 64; 2026-02-21T09:15:08.6708098Z or.b32 %r394, %r391, %r393; 2026-02-21T09:15:08.6708158Z add.s32 %r396, %r200, %r394; 2026-02-21T09:15:08.6708225Z add.s32 %r130, %r396, 294912; 2026-02-21T09:15:08.6708283Z add.s32 %r131, %r396, 311296; 2026-02-21T09:15:08.6708473Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6708541Z add.s32 %r2373, %r127, -2368; 2026-02-21T09:15:08.6708597Z mov.b32 %r2375, 1; 2026-02-21T09:15:08.6708652Z mov.b32 %r2374, 0; 2026-02-21T09:15:08.6708712Z mov.b32 %r2376, %r2374; 2026-02-21T09:15:08.6708778Z mov.b32 %r2377, %r2374; 2026-02-21T09:15:08.6708879Z $L__BB0_20: // Parent Loop BB0_2 Depth=1 2026-02-21T09:15:08.6708972Z // => This Inner Loop Header: Depth=2 2026-02-21T09:15:08.6709148Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6709212Z setp.eq.b32 %p17, %r128, 0; 2026-02-21T09:15:08.6709265Z $L__tmp52: 2026-02-21T09:15:08.6709486Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6709545Z shl.b32 %r440, %r2377, 5; 2026-02-21T09:15:08.6709603Z add.s32 %r441, %r440, %r126; 2026-02-21T09:15:08.6709664Z xor.b32 %r2375, %r2375, 1; 2026-02-21T09:15:08.6709731Z bar.sync 4, 256; 2026-02-21T09:15:08.6709791Z add.s32 %r397, %r200, 368832; 2026-02-21T09:15:08.6709847Z // begin inline asm 2026-02-21T09:15:08.6709915Z 2026-02-21T09:15:08.6709966Z { 2026-02-21T09:15:08.6710027Z .reg .pred complete; 2026-02-21T09:15:08.6710082Z waitLoop: 2026-02-21T09:15:08.6710208Z mbarrier.try_wait.parity.shared.b64 complete, [%r397], %r2375; 2026-02-21T09:15:08.6710272Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6710323Z } 2026-02-21T09:15:08.6710327Z 2026-02-21T09:15:08.6710388Z // end inline asm 2026-02-21T09:15:08.6710479Z ld.shared.v4.b32 {%r401, %r402, %r403, %r404}, [%r130]; 2026-02-21T09:15:08.6710573Z ld.shared.v4.b32 {%r405, %r406, %r407, %r408}, [%r130+16]; 2026-02-21T09:15:08.6710672Z ld.shared.v4.b32 {%r409, %r410, %r411, %r412}, [%r130+32]; 2026-02-21T09:15:08.6710763Z ld.shared.v4.b32 {%r413, %r414, %r415, %r416}, [%r130+48]; 2026-02-21T09:15:08.6710820Z bar.sync 4, 256; 2026-02-21T09:15:08.6710880Z add.s32 %r399, %r200, 368816; 2026-02-21T09:15:08.6710944Z // begin inline asm 2026-02-21T09:15:08.6711034Z @%p17 mbarrier.arrive.shared::cta.b64 _, [%r399]; 2026-02-21T09:15:08.6711088Z // end inline asm 2026-02-21T09:15:08.6711170Z shfl.sync.idx.b32 %r443, %r129, 0, 31, -1; 2026-02-21T09:15:08.6711229Z shl.b32 %r444, %r443, 21; 2026-02-21T09:15:08.6711290Z and.b32 %r445, %r444, 6291456; 2026-02-21T09:15:08.6711349Z add.s32 %r446, %r441, %r445; 2026-02-21T09:15:08.6711439Z shl.b32 %r447, %r443, 2; 2026-02-21T09:15:08.6711496Z and.b32 %r448, %r447, 16; 2026-02-21T09:15:08.6711588Z add.s32 %r400, %r446, %r448; 2026-02-21T09:15:08.6711656Z mov.pred %p18, -1; 2026-02-21T09:15:08.6711713Z // begin inline asm 2026-02-21T09:15:08.6711996Z @%p18 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r400 + 0], {%r401, %r402, %r403, %r404, %r405, %r406, %r407, %r408, %r409, %r410, %r411, %r412, %r413, %r414, %r415, %r416}; 2026-02-21T09:15:08.6712086Z // end inline asm 2026-02-21T09:15:08.6712144Z // begin inline asm 2026-02-21T09:15:08.6712212Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6712265Z // end inline asm 2026-02-21T09:15:08.6712327Z bar.sync 4, 256; 2026-02-21T09:15:08.6712380Z $L__tmp53: 2026-02-21T09:15:08.6712552Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6712618Z shl.b32 %r449, %r2377, 3; 2026-02-21T09:15:08.6712675Z add.s32 %r450, %r200, %r449; 2026-02-21T09:15:08.6712735Z add.s32 %r417, %r450, 368720; 2026-02-21T09:15:08.6712788Z $L__tmp54: 2026-02-21T09:15:08.6713036Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6713093Z bar.sync 4, 256; 2026-02-21T09:15:08.6713150Z // begin inline asm 2026-02-21T09:15:08.6713248Z @%p17 mbarrier.arrive.shared::cta.b64 _, [%r417]; 2026-02-21T09:15:08.6713328Z // end inline asm 2026-02-21T09:15:08.6713385Z $L__tmp55: 2026-02-21T09:15:08.6713569Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6713628Z add.s32 %r418, %r450, 368736; 2026-02-21T09:15:08.6713680Z $L__tmp56: 2026-02-21T09:15:08.6713888Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6713950Z bar.sync 4, 256; 2026-02-21T09:15:08.6714007Z // begin inline asm 2026-02-21T09:15:08.6714059Z 2026-02-21T09:15:08.6714115Z { 2026-02-21T09:15:08.6714175Z .reg .pred complete; 2026-02-21T09:15:08.6714228Z waitLoop: 2026-02-21T09:15:08.6714348Z mbarrier.try_wait.parity.shared.b64 complete, [%r418], %r2376; 2026-02-21T09:15:08.6714420Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6714470Z } 2026-02-21T09:15:08.6714473Z 2026-02-21T09:15:08.6714526Z // end inline asm 2026-02-21T09:15:08.6714592Z xor.b32 %r2374, %r2374, 1; 2026-02-21T09:15:08.6714651Z add.s32 %r420, %r200, 368848; 2026-02-21T09:15:08.6714707Z // begin inline asm 2026-02-21T09:15:08.6714764Z 2026-02-21T09:15:08.6714813Z { 2026-02-21T09:15:08.6714873Z .reg .pred complete; 2026-02-21T09:15:08.6714926Z waitLoop: 2026-02-21T09:15:08.6715046Z mbarrier.try_wait.parity.shared.b64 complete, [%r420], %r2374; 2026-02-21T09:15:08.6715110Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6715160Z } 2026-02-21T09:15:08.6715163Z 2026-02-21T09:15:08.6715224Z // end inline asm 2026-02-21T09:15:08.6715279Z // begin inline asm 2026-02-21T09:15:08.6715545Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r422, %r423, %r424, %r425, %r426, %r427, %r428, %r429, %r430, %r431, %r432, %r433, %r434, %r435, %r436, %r437}, [%r400 + 0]; 2026-02-21T09:15:08.6715601Z // end inline asm 2026-02-21T09:15:08.6715664Z // begin inline asm 2026-02-21T09:15:08.6715731Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:15:08.6715784Z // end inline asm 2026-02-21T09:15:08.6715883Z st.shared.v4.b32 [%r131], {%r422, %r423, %r424, %r425}; 2026-02-21T09:15:08.6715978Z st.shared.v4.b32 [%r131+16], {%r426, %r427, %r428, %r429}; 2026-02-21T09:15:08.6716071Z st.shared.v4.b32 [%r131+32], {%r430, %r431, %r432, %r433}; 2026-02-21T09:15:08.6716167Z st.shared.v4.b32 [%r131+48], {%r434, %r435, %r436, %r437}; 2026-02-21T09:15:08.6716223Z bar.sync 4, 256; 2026-02-21T09:15:08.6716281Z add.s32 %r439, %r200, 368864; 2026-02-21T09:15:08.6716337Z // begin inline asm 2026-02-21T09:15:08.6716434Z @%p17 mbarrier.arrive.shared::cta.b64 _, [%r439]; 2026-02-21T09:15:08.6716515Z // end inline asm 2026-02-21T09:15:08.6716573Z add.s32 %r451, %r2377, 1; 2026-02-21T09:15:08.6716644Z setp.eq.b32 %p21, %r451, 2; 2026-02-21T09:15:08.6716711Z selp.b32 %r2377, 0, %r451, %p21; 2026-02-21T09:15:08.6716771Z selp.b32 %r452, 1, 0, %p21; 2026-02-21T09:15:08.6716831Z xor.b32 %r2376, %r2376, %r452; 2026-02-21T09:15:08.6716891Z $L__tmp57: 2026-02-21T09:15:08.6717067Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6717150Z add.s32 %r2373, %r2373, 2368; 2026-02-21T09:15:08.6717222Z setp.lt.u32 %p22, %r2373, 5824; 2026-02-21T09:15:08.6717282Z @%p22 bra $L__BB0_20; 2026-02-21T09:15:08.6717371Z $L__BB0_21: // %._crit_edge995 2026-02-21T09:15:08.6717473Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6717531Z barrier.sync 1; 2026-02-21T09:15:08.6717591Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6717689Z $L__BB0_22: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6717900Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6717974Z ld.shared.b32 %r143, [global_smem+16]; 2026-02-21T09:15:08.6718032Z barrier.sync 1; 2026-02-21T09:15:08.6718207Z .loc 1 19 46 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:46 2026-02-21T09:15:08.6718285Z mov.u32 %r144, %ctaid.x; 2026-02-21T09:15:08.6718457Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6718528Z setp.gt.u32 %p9, %r144, 8191; 2026-02-21T09:15:08.6718585Z @%p9 bra $L__BB0_25; 2026-02-21T09:15:08.6718667Z // %bb.23: // %.lr.ph991 2026-02-21T09:15:08.6718754Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6718932Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6718993Z add.s32 %r145, %r1, -768; 2026-02-21T09:15:08.6719052Z shr.u32 %r146, %r145, 5; 2026-02-21T09:15:08.6719120Z shl.b32 %r325, %r1, 7; 2026-02-21T09:15:08.6719179Z and.b32 %r326, %r325, 16256; 2026-02-21T09:15:08.6719236Z shr.u32 %r327, %r1, 1; 2026-02-21T09:15:08.6719300Z and.b32 %r328, %r327, 64; 2026-02-21T09:15:08.6719359Z or.b32 %r329, %r326, %r328; 2026-02-21T09:15:08.6719419Z add.s32 %r331, %r200, %r329; 2026-02-21T09:15:08.6719479Z add.s32 %r147, %r331, 311296; 2026-02-21T09:15:08.6719545Z add.s32 %r148, %r331, 327680; 2026-02-21T09:15:08.6719715Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6719773Z add.s32 %r2378, %r144, -2368; 2026-02-21T09:15:08.6719836Z mov.b32 %r2380, 1; 2026-02-21T09:15:08.6719891Z mov.b32 %r2379, 0; 2026-02-21T09:15:08.6719949Z mov.b32 %r2381, %r2379; 2026-02-21T09:15:08.6720009Z mov.b32 %r2382, %r2379; 2026-02-21T09:15:08.6720115Z $L__BB0_24: // Parent Loop BB0_2 Depth=1 2026-02-21T09:15:08.6720209Z // => This Inner Loop Header: Depth=2 2026-02-21T09:15:08.6720378Z .loc 1 0 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:0:112 2026-02-21T09:15:08.6720448Z setp.eq.b32 %p10, %r145, 0; 2026-02-21T09:15:08.6720501Z $L__tmp58: 2026-02-21T09:15:08.6720719Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6720787Z shl.b32 %r375, %r2382, 5; 2026-02-21T09:15:08.6720845Z add.s32 %r376, %r375, %r143; 2026-02-21T09:15:08.6720904Z xor.b32 %r2380, %r2380, 1; 2026-02-21T09:15:08.6720960Z bar.sync 5, 256; 2026-02-21T09:15:08.6721028Z add.s32 %r332, %r200, 368864; 2026-02-21T09:15:08.6721086Z // begin inline asm 2026-02-21T09:15:08.6721137Z 2026-02-21T09:15:08.6721194Z { 2026-02-21T09:15:08.6721253Z .reg .pred complete; 2026-02-21T09:15:08.6721328Z waitLoop: 2026-02-21T09:15:08.6721448Z mbarrier.try_wait.parity.shared.b64 complete, [%r332], %r2380; 2026-02-21T09:15:08.6721520Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6721596Z } 2026-02-21T09:15:08.6721600Z 2026-02-21T09:15:08.6721656Z // end inline asm 2026-02-21T09:15:08.6721756Z ld.shared.v4.b32 {%r336, %r337, %r338, %r339}, [%r147]; 2026-02-21T09:15:08.6721850Z ld.shared.v4.b32 {%r340, %r341, %r342, %r343}, [%r147+16]; 2026-02-21T09:15:08.6721973Z ld.shared.v4.b32 {%r344, %r345, %r346, %r347}, [%r147+32]; 2026-02-21T09:15:08.6722070Z ld.shared.v4.b32 {%r348, %r349, %r350, %r351}, [%r147+48]; 2026-02-21T09:15:08.6722126Z bar.sync 5, 256; 2026-02-21T09:15:08.6722184Z add.s32 %r334, %r200, 368848; 2026-02-21T09:15:08.6722241Z // begin inline asm 2026-02-21T09:15:08.6722336Z @%p10 mbarrier.arrive.shared::cta.b64 _, [%r334]; 2026-02-21T09:15:08.6722392Z // end inline asm 2026-02-21T09:15:08.6722467Z shfl.sync.idx.b32 %r378, %r146, 0, 31, -1; 2026-02-21T09:15:08.6722535Z shl.b32 %r379, %r378, 21; 2026-02-21T09:15:08.6722593Z and.b32 %r380, %r379, 6291456; 2026-02-21T09:15:08.6722681Z add.s32 %r381, %r376, %r380; 2026-02-21T09:15:08.6722746Z shl.b32 %r382, %r378, 2; 2026-02-21T09:15:08.6722803Z and.b32 %r383, %r382, 16; 2026-02-21T09:15:08.6722861Z add.s32 %r335, %r381, %r383; 2026-02-21T09:15:08.6722921Z mov.pred %p11, -1; 2026-02-21T09:15:08.6722984Z // begin inline asm 2026-02-21T09:15:08.6723287Z @%p11 tcgen05.st.sync.aligned.32x32b.x16.b32 [%r335 + 0], {%r336, %r337, %r338, %r339, %r340, %r341, %r342, %r343, %r344, %r345, %r346, %r347, %r348, %r349, %r350, %r351}; 2026-02-21T09:15:08.6723346Z // end inline asm 2026-02-21T09:15:08.6723411Z // begin inline asm 2026-02-21T09:15:08.6723478Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:15:08.6723532Z // end inline asm 2026-02-21T09:15:08.6723587Z bar.sync 5, 256; 2026-02-21T09:15:08.6723645Z $L__tmp59: 2026-02-21T09:15:08.6723817Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6723876Z shl.b32 %r384, %r2382, 3; 2026-02-21T09:15:08.6723943Z add.s32 %r385, %r200, %r384; 2026-02-21T09:15:08.6724000Z add.s32 %r352, %r385, 368752; 2026-02-21T09:15:08.6724053Z $L__tmp60: 2026-02-21T09:15:08.6724274Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6724331Z bar.sync 5, 256; 2026-02-21T09:15:08.6724388Z // begin inline asm 2026-02-21T09:15:08.6724475Z @%p10 mbarrier.arrive.shared::cta.b64 _, [%r352]; 2026-02-21T09:15:08.6724538Z // end inline asm 2026-02-21T09:15:08.6724591Z $L__tmp61: 2026-02-21T09:15:08.6724760Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6724827Z add.s32 %r353, %r385, 368768; 2026-02-21T09:15:08.6724879Z $L__tmp62: 2026-02-21T09:15:08.6725096Z .loc 2 291 36 // standard.py:291:36 @[ cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:84:40 ] 2026-02-21T09:15:08.6725159Z bar.sync 5, 256; 2026-02-21T09:15:08.6725216Z // begin inline asm 2026-02-21T09:15:08.6725269Z 2026-02-21T09:15:08.6725320Z { 2026-02-21T09:15:08.6725388Z .reg .pred complete; 2026-02-21T09:15:08.6725443Z waitLoop: 2026-02-21T09:15:08.6725561Z mbarrier.try_wait.parity.shared.b64 complete, [%r353], %r2381; 2026-02-21T09:15:08.6725636Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6725691Z } 2026-02-21T09:15:08.6725696Z 2026-02-21T09:15:08.6725752Z // end inline asm 2026-02-21T09:15:08.6725812Z xor.b32 %r2379, %r2379, 1; 2026-02-21T09:15:08.6725882Z add.s32 %r355, %r200, 368880; 2026-02-21T09:15:08.6725939Z // begin inline asm 2026-02-21T09:15:08.6725991Z 2026-02-21T09:15:08.6726048Z { 2026-02-21T09:15:08.6726108Z .reg .pred complete; 2026-02-21T09:15:08.6726163Z waitLoop: 2026-02-21T09:15:08.6726276Z mbarrier.try_wait.parity.shared.b64 complete, [%r355], %r2379; 2026-02-21T09:15:08.6726348Z @!complete bra.uni waitLoop; 2026-02-21T09:15:08.6726438Z } 2026-02-21T09:15:08.6726442Z 2026-02-21T09:15:08.6726497Z // end inline asm 2026-02-21T09:15:08.6726563Z // begin inline asm 2026-02-21T09:15:08.6726829Z tcgen05.ld.sync.aligned.32x32b.x16.b32 {%r357, %r358, %r359, %r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369, %r370, %r371, %r372}, [%r335 + 0]; 2026-02-21T09:15:08.6726884Z // end inline asm 2026-02-21T09:15:08.6726949Z // begin inline asm 2026-02-21T09:15:08.6727038Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:15:08.6727093Z // end inline asm 2026-02-21T09:15:08.6727184Z st.shared.v4.b32 [%r148], {%r357, %r358, %r359, %r360}; 2026-02-21T09:15:08.6727284Z st.shared.v4.b32 [%r148+16], {%r361, %r362, %r363, %r364}; 2026-02-21T09:15:08.6727377Z st.shared.v4.b32 [%r148+32], {%r365, %r366, %r367, %r368}; 2026-02-21T09:15:08.6727465Z st.shared.v4.b32 [%r148+48], {%r369, %r370, %r371, %r372}; 2026-02-21T09:15:08.6727529Z bar.sync 5, 256; 2026-02-21T09:15:08.6727587Z add.s32 %r374, %r200, 368896; 2026-02-21T09:15:08.6727646Z // begin inline asm 2026-02-21T09:15:08.6727757Z @%p10 mbarrier.arrive.shared::cta.b64 _, [%r374]; 2026-02-21T09:15:08.6727814Z // end inline asm 2026-02-21T09:15:08.6727872Z add.s32 %r386, %r2382, 1; 2026-02-21T09:15:08.6727934Z setp.eq.b32 %p14, %r386, 2; 2026-02-21T09:15:08.6728008Z selp.b32 %r2382, 0, %r386, %p14; 2026-02-21T09:15:08.6728069Z selp.b32 %r387, 1, 0, %p14; 2026-02-21T09:15:08.6728155Z xor.b32 %r2381, %r2381, %r387; 2026-02-21T09:15:08.6728217Z $L__tmp63: 2026-02-21T09:15:08.6728389Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6728447Z add.s32 %r2378, %r2378, 2368; 2026-02-21T09:15:08.6728511Z setp.lt.u32 %p15, %r2378, 5824; 2026-02-21T09:15:08.6728577Z @%p15 bra $L__BB0_24; 2026-02-21T09:15:08.6728664Z $L__BB0_25: // %._crit_edge992 2026-02-21T09:15:08.6728752Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6728818Z barrier.sync 1; 2026-02-21T09:15:08.6728874Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6728971Z $L__BB0_4: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6729141Z .loc 1 14 0 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:14 2026-02-21T09:15:08.6729200Z barrier.sync 1; 2026-02-21T09:15:08.6729257Z barrier.sync 1; 2026-02-21T09:15:08.6729314Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6729416Z $L__BB0_17: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6729586Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6729642Z barrier.sync 1; 2026-02-21T09:15:08.6729706Z barrier.sync 1; 2026-02-21T09:15:08.6729761Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6729851Z $L__BB0_30: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6730025Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6730081Z barrier.sync 1; 2026-02-21T09:15:08.6730137Z barrier.sync 1; 2026-02-21T09:15:08.6730192Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6730288Z $L__BB0_31: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6730454Z .loc 1 19 112 // cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py:19:112 2026-02-21T09:15:08.6730513Z barrier.sync 1; 2026-02-21T09:15:08.6730575Z barrier.sync 1; 2026-02-21T09:15:08.6730629Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6730711Z $L__BB0_16: // %._crit_edge998 2026-02-21T09:15:08.6730801Z // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:15:08.6730866Z cp.async.wait_group 0; 2026-02-21T09:15:08.6730921Z bar.sync 2, 256; 2026-02-21T09:15:08.6730975Z barrier.sync 1; 2026-02-21T09:15:08.6731036Z bra.uni $L__BB0_2; 2026-02-21T09:15:08.6731111Z $L__tmp64: 2026-02-21T09:15:08.6731167Z $L__func_end0: 2026-02-21T09:15:08.6731254Z // -- End function 2026-02-21T09:15:08.6731306Z } 2026-02-21T09:15:08.6731529Z .file 1 "/tmp/torchinductor_root/lj/cljbvfpefbejlzywn7wgfsl63kwsek3ypeidw4byyn77ymmismtc.py" 2026-02-21T09:15:08.6733291Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:15:08.6733361Z .section .debug_abbrev 2026-02-21T09:15:08.6733427Z { 2026-02-21T09:15:08.6737049Z .b8 1 // Abbreviation Code 2026-02-21T09:15:08.6737144Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:15:08.6737229Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:15:08.6737323Z .b8 37 // DW_AT_producer 2026-02-21T09:15:08.6737404Z .b8 8 // DW_FORM_string 2026-02-21T09:15:08.6737482Z .b8 19 // DW_AT_language 2026-02-21T09:15:08.6737571Z .b8 5 // DW_FORM_data2 2026-02-21T09:15:08.6737683Z .b8 3 // DW_AT_name 2026-02-21T09:15:08.6737766Z .b8 8 // DW_FORM_string 2026-02-21T09:15:08.6737846Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:15:08.6737959Z .b8 6 // DW_FORM_data4 2026-02-21T09:15:08.6738070Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:15:08.6738159Z .b8 8 // DW_FORM_string 2026-02-21T09:15:08.6738234Z .b8 0 // EOM(1) 2026-02-21T09:15:08.6738315Z .b8 0 // EOM(2) 2026-02-21T09:15:08.6738402Z .b8 2 // Abbreviation Code 2026-02-21T09:15:08.6738487Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:15:08.6738574Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:15:08.6738652Z .b8 3 // DW_AT_name 2026-02-21T09:15:08.6738729Z .b8 8 // DW_FORM_string 2026-02-21T09:15:08.6738819Z .b8 32 // DW_AT_inline 2026-02-21T09:15:08.6738900Z .b8 11 // DW_FORM_data1 2026-02-21T09:15:08.6738973Z .b8 0 // EOM(1) 2026-02-21T09:15:08.6739045Z .b8 0 // EOM(2) 2026-02-21T09:15:08.6739137Z .b8 3 // Abbreviation Code 2026-02-21T09:15:08.6739219Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:15:08.6739300Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:15:08.6739398Z .b8 17 // DW_AT_low_pc 2026-02-21T09:15:08.6739471Z .b8 1 // DW_FORM_addr 2026-02-21T09:15:08.6739547Z .b8 18 // DW_AT_high_pc 2026-02-21T09:15:08.6739625Z .b8 1 // DW_FORM_addr 2026-02-21T09:15:08.6739712Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:15:08.6739783Z .b8 19 // DW_FORM_ref4 2026-02-21T09:15:08.6739858Z .b8 0 // EOM(1) 2026-02-21T09:15:08.6739925Z .b8 0 // EOM(2) 2026-02-21T09:15:08.6740004Z .b8 4 // Abbreviation Code 2026-02-21T09:15:08.6740098Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:15:08.6740182Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:15:08.6740265Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:15:08.6740337Z .b8 19 // DW_FORM_ref4 2026-02-21T09:15:08.6740417Z .b8 17 // DW_AT_low_pc 2026-02-21T09:15:08.6740487Z .b8 1 // DW_FORM_addr 2026-02-21T09:15:08.6740563Z .b8 18 // DW_AT_high_pc 2026-02-21T09:15:08.6740642Z .b8 1 // DW_FORM_addr 2026-02-21T09:15:08.6740719Z .b8 88 // DW_AT_call_file 2026-02-21T09:15:08.6740791Z .b8 11 // DW_FORM_data1 2026-02-21T09:15:08.6740941Z .b8 89 // DW_AT_call_line 2026-02-21T09:15:08.6741022Z .b8 11 // DW_FORM_data1 2026-02-21T09:15:08.6741131Z .b8 87 // DW_AT_call_column 2026-02-21T09:15:08.6741205Z .b8 11 // DW_FORM_data1 2026-02-21T09:15:08.6741281Z .b8 0 // EOM(1) 2026-02-21T09:15:08.6741349Z .b8 0 // EOM(2) 2026-02-21T09:15:08.6741416Z .b8 0 // EOM(3) 2026-02-21T09:15:08.6741475Z } 2026-02-21T09:15:08.6741603Z .section .debug_info 2026-02-21T09:15:08.6741657Z { 2026-02-21T09:15:08.6741767Z .b32 178 // Length of Unit 2026-02-21T09:15:08.6741865Z .b8 2 // DWARF version number 2026-02-21T09:15:08.6741920Z .b8 0 2026-02-21T09:15:08.6742036Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:15:08.6742178Z .b8 8 // Address Size (in bytes) 2026-02-21T09:15:08.6742281Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:15:08.6742362Z .b8 116 // DW_AT_producer 2026-02-21T09:15:08.6742427Z .b8 114 2026-02-21T09:15:08.6742482Z .b8 105 2026-02-21T09:15:08.6742534Z .b8 116 2026-02-21T09:15:08.6742586Z .b8 111 2026-02-21T09:15:08.6742645Z .b8 110 2026-02-21T09:15:08.6742697Z .b8 0 2026-02-21T09:15:08.6742770Z .b8 2 // DW_AT_language 2026-02-21T09:15:08.6742831Z .b8 0 2026-02-21T09:15:08.6742907Z .b8 99 // DW_AT_name 2026-02-21T09:15:08.6742958Z .b8 108 2026-02-21T09:15:08.6743009Z .b8 106 2026-02-21T09:15:08.6743070Z .b8 98 2026-02-21T09:15:08.6743122Z .b8 118 2026-02-21T09:15:08.6743172Z .b8 102 2026-02-21T09:15:08.6743223Z .b8 112 2026-02-21T09:15:08.6743279Z .b8 101 2026-02-21T09:15:08.6743330Z .b8 102 2026-02-21T09:15:08.6743381Z .b8 98 2026-02-21T09:15:08.6743437Z .b8 101 2026-02-21T09:15:08.6743487Z .b8 106 2026-02-21T09:15:08.6743537Z .b8 108 2026-02-21T09:15:08.6743587Z .b8 122 2026-02-21T09:15:08.6743644Z .b8 121 2026-02-21T09:15:08.6743694Z .b8 119 2026-02-21T09:15:08.6743744Z .b8 110 2026-02-21T09:15:08.6743800Z .b8 55 2026-02-21T09:15:08.6743851Z .b8 119 2026-02-21T09:15:08.6743900Z .b8 103 2026-02-21T09:15:08.6743950Z .b8 102 2026-02-21T09:15:08.6744008Z .b8 115 2026-02-21T09:15:08.6744058Z .b8 108 2026-02-21T09:15:08.6744108Z .b8 54 2026-02-21T09:15:08.6744164Z .b8 51 2026-02-21T09:15:08.6744214Z .b8 107 2026-02-21T09:15:08.6744264Z .b8 119 2026-02-21T09:15:08.6744314Z .b8 115 2026-02-21T09:15:08.6744372Z .b8 101 2026-02-21T09:15:08.6744421Z .b8 107 2026-02-21T09:15:08.6744472Z .b8 51 2026-02-21T09:15:08.6744522Z .b8 121 2026-02-21T09:15:08.6744580Z .b8 112 2026-02-21T09:15:08.6744632Z .b8 101 2026-02-21T09:15:08.6744682Z .b8 105 2026-02-21T09:15:08.6744743Z .b8 100 2026-02-21T09:15:08.6744792Z .b8 119 2026-02-21T09:15:08.6744843Z .b8 52 2026-02-21T09:15:08.6744894Z .b8 98 2026-02-21T09:15:08.6744953Z .b8 121 2026-02-21T09:15:08.6745002Z .b8 121 2026-02-21T09:15:08.6745054Z .b8 110 2026-02-21T09:15:08.6745112Z .b8 55 2026-02-21T09:15:08.6745163Z .b8 55 2026-02-21T09:15:08.6745225Z .b8 121 2026-02-21T09:15:08.6745276Z .b8 109 2026-02-21T09:15:08.6745335Z .b8 109 2026-02-21T09:15:08.6745387Z .b8 105 2026-02-21T09:15:08.6745439Z .b8 115 2026-02-21T09:15:08.6745489Z .b8 109 2026-02-21T09:15:08.6745556Z .b8 116 2026-02-21T09:15:08.6745606Z .b8 99 2026-02-21T09:15:08.6745658Z .b8 46 2026-02-21T09:15:08.6745715Z .b8 112 2026-02-21T09:15:08.6745765Z .b8 121 2026-02-21T09:15:08.6745816Z .b8 0 2026-02-21T09:15:08.6745905Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:15:08.6745985Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:15:08.6746037Z .b8 116 2026-02-21T09:15:08.6746086Z .b8 109 2026-02-21T09:15:08.6746145Z .b8 112 2026-02-21T09:15:08.6746245Z .b8 47 2026-02-21T09:15:08.6746296Z .b8 116 2026-02-21T09:15:08.6746346Z .b8 111 2026-02-21T09:15:08.6746405Z .b8 114 2026-02-21T09:15:08.6746454Z .b8 99 2026-02-21T09:15:08.6746541Z .b8 104 2026-02-21T09:15:08.6746599Z .b8 105 2026-02-21T09:15:08.6746650Z .b8 110 2026-02-21T09:15:08.6746700Z .b8 100 2026-02-21T09:15:08.6746750Z .b8 117 2026-02-21T09:15:08.6746808Z .b8 99 2026-02-21T09:15:08.6746859Z .b8 116 2026-02-21T09:15:08.6746909Z .b8 111 2026-02-21T09:15:08.6746966Z .b8 114 2026-02-21T09:15:08.6747017Z .b8 95 2026-02-21T09:15:08.6747068Z .b8 114 2026-02-21T09:15:08.6747119Z .b8 111 2026-02-21T09:15:08.6747178Z .b8 111 2026-02-21T09:15:08.6747227Z .b8 116 2026-02-21T09:15:08.6747278Z .b8 47 2026-02-21T09:15:08.6747328Z .b8 108 2026-02-21T09:15:08.6747386Z .b8 106 2026-02-21T09:15:08.6747459Z .b8 0 2026-02-21T09:15:08.6747561Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:15:08.6747642Z .b8 95 // DW_AT_name 2026-02-21T09:15:08.6747696Z .b8 104 2026-02-21T09:15:08.6747747Z .b8 101 2026-02-21T09:15:08.6747797Z .b8 108 2026-02-21T09:15:08.6747875Z .b8 105 2026-02-21T09:15:08.6747927Z .b8 111 2026-02-21T09:15:08.6747980Z .b8 110 2026-02-21T09:15:08.6748034Z .b8 95 2026-02-21T09:15:08.6748084Z .b8 109 2026-02-21T09:15:08.6748133Z .b8 97 2026-02-21T09:15:08.6748183Z .b8 116 2026-02-21T09:15:08.6748239Z .b8 109 2026-02-21T09:15:08.6748289Z .b8 117 2026-02-21T09:15:08.6748338Z .b8 108 2026-02-21T09:15:08.6748395Z .b8 95 2026-02-21T09:15:08.6748444Z .b8 98 2026-02-21T09:15:08.6748493Z .b8 102 2026-02-21T09:15:08.6748543Z .b8 49 2026-02-21T09:15:08.6748599Z .b8 54 2026-02-21T09:15:08.6748649Z .b8 95 2026-02-21T09:15:08.6748699Z .b8 105 2026-02-21T09:15:08.6748755Z .b8 110 2026-02-21T09:15:08.6748807Z .b8 116 2026-02-21T09:15:08.6748858Z .b8 52 2026-02-21T09:15:08.6748908Z .b8 0 2026-02-21T09:15:08.6748987Z .b8 1 // DW_AT_inline 2026-02-21T09:15:08.6749083Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:15:08.6749172Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:15:08.6749269Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:15:08.6749359Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:15:08.6749471Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:15:08.6749558Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:15:08.6749646Z .b64 $L__tmp0 // DW_AT_low_pc 2026-02-21T09:15:08.6749729Z .b64 $L__tmp63 // DW_AT_high_pc 2026-02-21T09:15:08.6749806Z .b8 1 // DW_AT_call_file 2026-02-21T09:15:08.6749892Z .b8 84 // DW_AT_call_line 2026-02-21T09:15:08.6749971Z .b8 40 // DW_AT_call_column 2026-02-21T09:15:08.6750052Z .b8 0 // End Of Children Mark 2026-02-21T09:15:08.6750141Z .b8 0 // End Of Children Mark 2026-02-21T09:15:08.6750193Z } 2026-02-21T09:15:08.6750261Z .section .debug_macinfo { } 2026-02-21T09:15:08.6750267Z 2026-02-21T09:15:08.6750349Z ================================================================ 2026-02-21T09:15:08.6750452Z please share the reproducer above with Triton project. 2026-02-21T09:15:09.5651029Z [496s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T09:15:09.5652303Z Tensor-likes are not close! 2026-02-21T09:15:09.5652421Z 2026-02-21T09:15:09.5652507Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T09:15:09.5653078Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T09:15:09.5653418Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:09.5653781Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:09.5653954Z 2026-02-21T09:15:09.5872716Z [496s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T09:15:09.5873817Z Tensor-likes are not close! 2026-02-21T09:15:09.5879261Z 2026-02-21T09:15:09.5881606Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T09:15:09.5881956Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T09:15:09.5882505Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:09.5882818Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:09.5882986Z 2026-02-21T09:15:09.5938783Z [496s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T09:15:09.5939804Z Tensor-likes are not close! 2026-02-21T09:15:09.5939922Z 2026-02-21T09:15:09.5940008Z Mismatched elements: 33485703 / 33554432 (99.8%) 2026-02-21T09:15:09.5940290Z Greatest absolute difference: 2400.0 at index (473, 5764) (up to 0.01 allowed) 2026-02-21T09:15:09.5940626Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:09.5940934Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:09.5941101Z 2026-02-21T09:15:09.5956737Z 2026-02-21T09:15:09.5962071Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 18.2 configs/s 2026-02-21T09:15:17.3476571Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━ 845/845 108.8 configs/s 2026-02-21T09:15:17.5315216Z [504s] Generation 8 complete: 2026-02-21T09:15:17.5317176Z error=20 2026-02-21T09:15:17.5317416Z timeout=7 2026-02-21T09:15:17.5323583Z ok=59 2026-02-21T09:15:17.5327624Z min=0.2365 2026-02-21T09:15:17.5329070Z mid=0.3574 2026-02-21T09:15:17.5329237Z max=15.1972 2026-02-21T09:15:17.5329418Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:15:17.5329687Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:15:17.5329930Z 'l2_groupings': [2], 2026-02-21T09:15:17.5330125Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:17.5330330Z 'loop_orders': [[0, 1]], 2026-02-21T09:15:17.5330495Z 'maxnreg': 128, 2026-02-21T09:15:17.5330652Z 'num_sm_multiplier': 2, 2026-02-21T09:15:17.5330814Z 'num_stages': 3, 2026-02-21T09:15:17.5330961Z 'num_warps': 4, 2026-02-21T09:15:17.5331114Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:15:17.5331312Z 'range_flattens': [None, None], 2026-02-21T09:15:17.5331490Z 'range_multi_buffers': [True, False], 2026-02-21T09:15:17.5331749Z 'range_num_stages': [3, 4], 2026-02-21T09:15:17.5331918Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:17.5332107Z 'range_warp_specializes': [True, None]} 2026-02-21T09:15:17.5350348Z [504s] Fitting surrogate: 932 points, 932 targets 2026-02-21T09:15:18.5031214Z [505s] Generation 9 starting: 58 neighbors, 3 active search path(s) 2026-02-21T09:15:51.9497118Z [539s] Timeout after 30s compiling Config(block_sizes=[32, 512, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:15:51.9523889Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.4 configs/s 2026-02-21T09:15:55.2949371Z [542s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T09:15:55.2950571Z Tensor-likes are not close! 2026-02-21T09:15:55.2954148Z 2026-02-21T09:15:55.2958104Z Mismatched elements: 33444483 / 33554432 (99.7%) 2026-02-21T09:15:55.2961938Z Greatest absolute difference: 1288.0 at index (3709, 4256) (up to 0.01 allowed) 2026-02-21T09:15:55.2966160Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:55.2970525Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:55.2974873Z 2026-02-21T09:15:56.0915443Z [543s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T09:15:56.0916445Z Tensor-likes are not close! 2026-02-21T09:15:56.0916578Z 2026-02-21T09:15:56.0916669Z Mismatched elements: 33444783 / 33554432 (99.7%) 2026-02-21T09:15:56.0916966Z Greatest absolute difference: 1456.0 at index (2320, 6047) (up to 0.01 allowed) 2026-02-21T09:15:56.0917309Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:15:56.0917628Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:15:56.0917790Z 2026-02-21T09:15:56.0931087Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 14.4 configs/s 2026-02-21T09:16:01.1187401Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━ 845/845 166.9 configs/s 2026-02-21T09:16:01.2605506Z [548s] Generation 9 complete: 2026-02-21T09:16:01.2610560Z error=10 2026-02-21T09:16:01.2615681Z timeout=1 2026-02-21T09:16:01.2620677Z ok=51 2026-02-21T09:16:01.2625102Z min=0.2366 2026-02-21T09:16:01.2626529Z mid=0.4515 2026-02-21T09:16:01.2626718Z max=21.6750 2026-02-21T09:16:01.2626880Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:16:01.2627137Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:16:01.2627394Z 'l2_groupings': [2], 2026-02-21T09:16:01.2627570Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:16:01.2627782Z 'loop_orders': [[0, 1]], 2026-02-21T09:16:01.2627937Z 'maxnreg': 128, 2026-02-21T09:16:01.2628097Z 'num_sm_multiplier': 2, 2026-02-21T09:16:01.2628256Z 'num_stages': 3, 2026-02-21T09:16:01.2628393Z 'num_warps': 4, 2026-02-21T09:16:01.2628556Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:16:01.2628746Z 'range_flattens': [None, None], 2026-02-21T09:16:01.2628927Z 'range_multi_buffers': [True, False], 2026-02-21T09:16:01.2629105Z 'range_num_stages': [3, 4], 2026-02-21T09:16:01.2629277Z 'range_unroll_factors': [0, 0], 2026-02-21T09:16:01.2629454Z 'range_warp_specializes': [True, None]} 2026-02-21T09:16:01.2636310Z [548s] Fitting surrogate: 994 points, 994 targets 2026-02-21T09:16:01.9680449Z [549s] Generation 10 starting: 35 neighbors, 2 active search path(s) 2026-02-21T09:16:33.5158975Z [580s] Timeout after 30s compiling Config(block_sizes=[32, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=16, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:16:33.5176833Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.3 configs/s 2026-02-21T09:16:36.1673456Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 14.3 configs/s 2026-02-21T09:16:38.6562971Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━ 845/845 333.7 configs/s 2026-02-21T09:16:38.7579786Z [585s] Generation 10 complete: 2026-02-21T09:16:38.7581962Z error=5 2026-02-21T09:16:38.7582497Z timeout=1 2026-02-21T09:16:38.7582642Z ok=32 2026-02-21T09:16:38.7582774Z min=0.2367 2026-02-21T09:16:38.7582899Z mid=0.9708 2026-02-21T09:16:38.7583028Z max=20.3838 2026-02-21T09:16:38.7583182Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:16:38.7583524Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:16:38.7583777Z 'l2_groupings': [2], 2026-02-21T09:16:38.7583962Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:16:38.7584167Z 'loop_orders': [[0, 1]], 2026-02-21T09:16:38.7584324Z 'maxnreg': 128, 2026-02-21T09:16:38.7584477Z 'num_sm_multiplier': 2, 2026-02-21T09:16:38.7584645Z 'num_stages': 3, 2026-02-21T09:16:38.7584820Z 'num_warps': 4, 2026-02-21T09:16:38.7585034Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:16:38.7585333Z 'range_flattens': [None, None], 2026-02-21T09:16:38.7585607Z 'range_multi_buffers': [True, False], 2026-02-21T09:16:38.7585896Z 'range_num_stages': [3, 4], 2026-02-21T09:16:38.7586159Z 'range_unroll_factors': [0, 0], 2026-02-21T09:16:38.7586432Z 'range_warp_specializes': [True, None]} 2026-02-21T09:16:38.7607986Z [585s] Fitting surrogate: 1032 points, 1032 targets 2026-02-21T09:16:39.4543072Z [586s] Generation 11 starting: 34 neighbors, 2 active search path(s) 2026-02-21T09:17:11.0583960Z [618s] Timeout after 30s compiling Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:17:11.0604431Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 0.3 configs/s 2026-02-21T09:17:11.7094921Z [618s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T09:17:11.7096087Z Tensor-likes are not close! 2026-02-21T09:17:11.7101920Z 2026-02-21T09:17:11.7102156Z Mismatched elements: 33451485 / 33554432 (99.7%) 2026-02-21T09:17:11.7102481Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T09:17:11.7102828Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:17:11.7103134Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:17:11.7103312Z 2026-02-21T09:17:12.8361294Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 20.3 configs/s 2026-02-21T09:17:15.0895169Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 845/845 367.5 configs/s 2026-02-21T09:17:15.1865215Z [622s] Generation 11 complete: 2026-02-21T09:17:15.1866523Z error=15 2026-02-21T09:17:15.1866777Z timeout=1 2026-02-21T09:17:15.1869589Z ok=21 2026-02-21T09:17:15.1869755Z min=0.2386 2026-02-21T09:17:15.1870121Z mid=0.6216 2026-02-21T09:17:15.1870244Z max=14.3678 2026-02-21T09:17:15.1870464Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:17:15.1870726Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:17:15.1871045Z 'l2_groupings': [2], 2026-02-21T09:17:15.1871225Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:17:15.1871433Z 'loop_orders': [[0, 1]], 2026-02-21T09:17:15.1871665Z 'maxnreg': 128, 2026-02-21T09:17:15.1871814Z 'num_sm_multiplier': 2, 2026-02-21T09:17:15.1871978Z 'num_stages': 3, 2026-02-21T09:17:15.1872116Z 'num_warps': 4, 2026-02-21T09:17:15.1872280Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:17:15.1872474Z 'range_flattens': [None, None], 2026-02-21T09:17:15.1872661Z 'range_multi_buffers': [True, False], 2026-02-21T09:17:15.1872843Z 'range_num_stages': [3, 4], 2026-02-21T09:17:15.1873016Z 'range_unroll_factors': [0, 0], 2026-02-21T09:17:15.1873199Z 'range_warp_specializes': [True, None]} 2026-02-21T09:17:15.1902398Z [622s] Fitting surrogate: 1069 points, 1069 targets 2026-02-21T09:17:15.6078640Z [622s] Generation 12 starting: 14 neighbors, 1 active search path(s) 2026-02-21T09:17:26.4795386Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 1.5 configs/s 2026-02-21T09:17:28.0869101Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 10.0 configs/s 2026-02-21T09:17:28.0873842Z [635s] Generation 12 complete: 2026-02-21T09:17:28.0874136Z error=1 2026-02-21T09:17:28.0874283Z ok=15 2026-02-21T09:17:28.0874453Z min=0.2386 2026-02-21T09:17:28.0874601Z mid=9.7946 2026-02-21T09:17:28.0874757Z max=13.3428 2026-02-21T09:17:28.0875250Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:17:28.0875556Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:17:28.0875828Z 'l2_groupings': [2], 2026-02-21T09:17:28.0876017Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:17:28.0876221Z 'loop_orders': [[0, 1]], 2026-02-21T09:17:28.0876392Z 'maxnreg': 128, 2026-02-21T09:17:28.0876578Z 'num_sm_multiplier': 2, 2026-02-21T09:17:28.0876743Z 'num_stages': 3, 2026-02-21T09:17:28.0876890Z 'num_warps': 4, 2026-02-21T09:17:28.0877051Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:17:28.0877247Z 'range_flattens': [None, None], 2026-02-21T09:17:28.0877432Z 'range_multi_buffers': [True, False], 2026-02-21T09:17:28.0877612Z 'range_num_stages': [3, 4], 2026-02-21T09:17:28.0877786Z 'range_unroll_factors': [0, 0], 2026-02-21T09:17:28.0877974Z 'range_warp_specializes': [True, None]} 2026-02-21T09:17:28.0909211Z [635s] Fitting surrogate: 1085 points, 1085 targets 2026-02-21T09:17:28.5890934Z [635s] Generation 13 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:17:32.8112566Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 4.5 configs/s 2026-02-21T09:17:33.7234644Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 21.1 configs/s 2026-02-21T09:17:33.7240104Z [640s] Generation 13 complete: 2026-02-21T09:17:33.7244501Z error=9 2026-02-21T09:17:33.7245926Z ok=12 2026-02-21T09:17:33.7246098Z min=0.2386 2026-02-21T09:17:33.7246230Z mid=1.4489 2026-02-21T09:17:33.7246380Z max=4.6311 2026-02-21T09:17:33.7246524Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:17:33.7246805Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:17:33.7247048Z 'l2_groupings': [2], 2026-02-21T09:17:33.7247231Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:17:33.7247434Z 'loop_orders': [[0, 1]], 2026-02-21T09:17:33.7247592Z 'maxnreg': 128, 2026-02-21T09:17:33.7247746Z 'num_sm_multiplier': 2, 2026-02-21T09:17:33.7247900Z 'num_stages': 3, 2026-02-21T09:17:33.7248045Z 'num_warps': 4, 2026-02-21T09:17:33.7248202Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:17:33.7248657Z 'range_flattens': [None, None], 2026-02-21T09:17:33.7248850Z 'range_multi_buffers': [True, False], 2026-02-21T09:17:33.7249040Z 'range_num_stages': [3, 4], 2026-02-21T09:17:33.7249211Z 'range_unroll_factors': [0, 0], 2026-02-21T09:17:33.7249483Z 'range_warp_specializes': [True, None]} 2026-02-21T09:17:33.7275968Z [640s] Fitting surrogate: 1106 points, 1106 targets 2026-02-21T09:17:34.1939947Z [641s] Generation 14 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:18:04.6863786Z [671s] Timeout after 30s compiling Config(block_sizes=[16, 64, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:18:05.1355769Z [672s] Timeout after 30s compiling Config(block_sizes=[16, 256, 64], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:18:05.1372680Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.4 configs/s 2026-02-21T09:18:06.3506423Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 13.5 configs/s 2026-02-21T09:18:06.3510572Z [673s] Generation 14 complete: 2026-02-21T09:18:06.3510860Z error=1 2026-02-21T09:18:06.3511041Z timeout=2 2026-02-21T09:18:06.3511170Z ok=15 2026-02-21T09:18:06.3511303Z min=0.2386 2026-02-21T09:18:06.3511428Z mid=1.1808 2026-02-21T09:18:06.3511624Z max=14.3575 2026-02-21T09:18:06.3511776Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:18:06.3512061Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:18:06.3512305Z 'l2_groupings': [2], 2026-02-21T09:18:06.3512504Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:18:06.3512707Z 'loop_orders': [[0, 1]], 2026-02-21T09:18:06.3512871Z 'maxnreg': 128, 2026-02-21T09:18:06.3513024Z 'num_sm_multiplier': 2, 2026-02-21T09:18:06.3513187Z 'num_stages': 3, 2026-02-21T09:18:06.3513339Z 'num_warps': 4, 2026-02-21T09:18:06.3513505Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:18:06.3513709Z 'range_flattens': [None, None], 2026-02-21T09:18:06.3513889Z 'range_multi_buffers': [True, False], 2026-02-21T09:18:06.3514081Z 'range_num_stages': [3, 4], 2026-02-21T09:18:06.3514253Z 'range_unroll_factors': [0, 0], 2026-02-21T09:18:06.3514433Z 'range_warp_specializes': [True, None]} 2026-02-21T09:18:06.3544905Z [673s] Fitting surrogate: 1124 points, 1124 targets 2026-02-21T09:18:06.7871924Z [673s] Generation 15 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:18:12.8986117Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.1 configs/s 2026-02-21T09:18:14.9521271Z 2026-02-21T09:18:14.9523736Z 2026-02-21T09:18:14.9526801Z ================================================================ 2026-02-21T09:18:14.9527105Z Internal Triton PTX codegen error 2026-02-21T09:18:14.9527285Z `ptxas` stderr: 2026-02-21T09:18:14.9527749Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 251 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:18:14.9528240Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:18:14.9528396Z 2026-02-21T09:18:14.9528788Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpxtuvssla.ptx -o /tmp/tmpxtuvssla.ptx.o 2026-02-21T09:18:14.9529220Z 2026-02-21T09:18:14.9529223Z 2026-02-21T09:18:14.9529289Z // 2026-02-21T09:18:14.9529428Z // Generated by LLVM NVPTX Back-End 2026-02-21T09:18:14.9529850Z // 2026-02-21T09:18:14.9529928Z 2026-02-21T09:18:14.9529987Z .version 8.7 2026-02-21T09:18:14.9530143Z .target sm_100a 2026-02-21T09:18:14.9530355Z .address_size 64 2026-02-21T09:18:14.9530598Z 2026-02-21T09:18:14.9530865Z // .globl _helion_matmul_bf16_int4 // -- Begin function _helion_matmul_bf16_int4 2026-02-21T09:18:14.9531245Z .extern .shared .align 16 .b8 global_smem[]; 2026-02-21T09:18:14.9535839Z // @_helion_matmul_bf16_int4 2026-02-21T09:18:14.9540090Z .visible .entry _helion_matmul_bf16_int4( 2026-02-21T09:18:14.9544582Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_0, 2026-02-21T09:18:14.9546661Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_1, 2026-02-21T09:18:14.9546989Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_2, 2026-02-21T09:18:14.9547269Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_3, 2026-02-21T09:18:14.9547564Z .param .u64 .ptr .global .align 1 _helion_matmul_bf16_int4_param_4 2026-02-21T09:18:14.9547785Z ) 2026-02-21T09:18:14.9547920Z .reqntid 128 2026-02-21T09:18:14.9548053Z .maxnreg 32 2026-02-21T09:18:14.9548187Z { 2026-02-21T09:18:14.9548321Z .reg .pred %p<394>; 2026-02-21T09:18:14.9548472Z .reg .b16 %rs<2568>; 2026-02-21T09:18:14.9548624Z .reg .b32 %r<7848>; 2026-02-21T09:18:14.9548766Z .reg .b64 %rd<1059>; 2026-02-21T09:18:14.9549037Z .loc 1 14 0 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:14:0 2026-02-21T09:18:14.9549324Z $L__func_begin0: 2026-02-21T09:18:14.9549572Z .loc 1 14 0 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:14:0 2026-02-21T09:18:14.9549796Z 2026-02-21T09:18:14.9549851Z // %bb.0: 2026-02-21T09:18:14.9550035Z ld.param.b64 %rd57, [_helion_matmul_bf16_int4_param_2]; 2026-02-21T09:18:14.9550297Z ld.param.b64 %rd56, [_helion_matmul_bf16_int4_param_1]; 2026-02-21T09:18:14.9550544Z ld.param.b64 %rd55, [_helion_matmul_bf16_int4_param_0]; 2026-02-21T09:18:14.9550762Z $L__tmp0: 2026-02-21T09:18:14.9550998Z .loc 1 14 0 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:14 2026-02-21T09:18:14.9551292Z mov.u32 %r1, %tid.x; 2026-02-21T09:18:14.9551456Z setp.lt.u32 %p6, %r1, 32; 2026-02-21T09:18:14.9551733Z mov.b32 %r144, global_smem; 2026-02-21T09:18:14.9551906Z // begin inline asm 2026-02-21T09:18:14.9552159Z @%p6 tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%r144], 512; 2026-02-21T09:18:14.9552420Z // end inline asm 2026-02-21T09:18:14.9552562Z bar.sync 0; 2026-02-21T09:18:14.9552730Z ld.shared.b32 %r7822, [global_smem]; 2026-02-21T09:18:14.9552907Z bar.sync 0; 2026-02-21T09:18:14.9553052Z // begin inline asm 2026-02-21T09:18:14.9553267Z @%p6 tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned; 2026-02-21T09:18:14.9553498Z // end inline asm 2026-02-21T09:18:14.9553774Z .loc 1 19 46 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:46 2026-02-21T09:18:14.9554839Z mov.u32 %r7842, %ctaid.x; 2026-02-21T09:18:14.9555100Z .loc 1 0 0 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:0 2026-02-21T09:18:14.9555388Z sub.s32 %r145, 6463, %r7842; 2026-02-21T09:18:14.9555662Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:14.9555956Z shr.u32 %r146, %r145, 6; 2026-02-21T09:18:14.9556128Z mul.hi.u32 %r147, %r146, 116080198; 2026-02-21T09:18:14.9556302Z and.b32 %r148, %r147, 2097148; 2026-02-21T09:18:14.9556480Z mad.lo.s32 %r7847, %r148, 2368, %r7842; 2026-02-21T09:18:14.9556758Z .loc 1 31 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:31:45 2026-02-21T09:18:14.9557044Z shr.u32 %r5, %r1, 5; 2026-02-21T09:18:14.9557190Z and.b32 %r6, %r1, 124; 2026-02-21T09:18:14.9557350Z bfe.u32 %r7, %r1, 2, 5; 2026-02-21T09:18:14.9557535Z and.b32 %r8, %r1, 112; 2026-02-21T09:18:14.9557853Z bfe.u32 %r9, %r1, 4, 3; 2026-02-21T09:18:14.9558021Z or.b32 %r10, %r9, 8; 2026-02-21T09:18:14.9558172Z or.b32 %r11, %r9, 16; 2026-02-21T09:18:14.9558333Z or.b32 %r12, %r9, 24; 2026-02-21T09:18:14.9558479Z or.b32 %r13, %r9, 32; 2026-02-21T09:18:14.9558673Z or.b32 %r14, %r9, 40; 2026-02-21T09:18:14.9558821Z or.b32 %r15, %r9, 48; 2026-02-21T09:18:14.9559000Z or.b32 %r16, %r9, 56; 2026-02-21T09:18:14.9559256Z .loc 1 33 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:33:45 2026-02-21T09:18:14.9559557Z shl.b32 %r17, %r1, 4; 2026-02-21T09:18:14.9559713Z and.b32 %r18, %r17, 112; 2026-02-21T09:18:14.9559866Z and.b32 %r19, %r1, 15; 2026-02-21T09:18:14.9560022Z shl.b32 %r20, %r19, 3; 2026-02-21T09:18:14.9560272Z .loc 1 41 48 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:41:48 2026-02-21T09:18:14.9560563Z bfe.u32 %r21, %r1, 3, 4; 2026-02-21T09:18:14.9560822Z .loc 1 47 38 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:47:38 2026-02-21T09:18:14.9561094Z and.b32 %r22, %r1, 3; 2026-02-21T09:18:14.9561353Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:14.9561683Z setp.ge.s32 %p8, %r7842, %r7847; 2026-02-21T09:18:14.9561865Z add.s32 %r6675, %r7822, 1048576; 2026-02-21T09:18:14.9562031Z add.s32 %r6964, %r7822, 128; 2026-02-21T09:18:14.9562199Z add.s32 %r7253, %r7822, 1048704; 2026-02-21T09:18:14.9562367Z and.b32 %r7826, %r1, 24; 2026-02-21T09:18:14.9562518Z shl.b32 %r7827, %r22, 5; 2026-02-21T09:18:14.9562678Z shl.b32 %r7828, %r6, 2; 2026-02-21T09:18:14.9562829Z shl.b32 %r7829, %r1, 9; 2026-02-21T09:18:14.9562984Z shl.b32 %r7830, %r19, 4; 2026-02-21T09:18:14.9563132Z and.b32 %r7831, %r1, 96; 2026-02-21T09:18:14.9563287Z shl.b32 %r7832, %r1, 2; 2026-02-21T09:18:14.9563437Z and.b32 %r7833, %r17, 496; 2026-02-21T09:18:14.9563599Z bfe.u32 %r7834, %r1, 5, 2; 2026-02-21T09:18:14.9563751Z and.b32 %r7835, %r1, 12; 2026-02-21T09:18:14.9563906Z shl.b32 %r7836, %r22, 2; 2026-02-21T09:18:14.9564058Z shl.b32 %r7837, %r1, 7; 2026-02-21T09:18:14.9564208Z add.s32 %r7254, %r7822, 1048832; 2026-02-21T09:18:14.9564384Z add.s32 %r7543, %r7822, 256; 2026-02-21T09:18:14.9564539Z shl.b32 %r7840, %r1, 5; 2026-02-21T09:18:14.9564699Z or.b32 %r7841, %r7, 32; 2026-02-21T09:18:14.9564855Z setp.eq.b32 %p366, %r1, 0; 2026-02-21T09:18:14.9565028Z @%p8 bra $L__BB0_43; 2026-02-21T09:18:14.9565199Z // %bb.1: // %.lr.ph 2026-02-21T09:18:14.9565526Z .loc 1 0 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:0:112 2026-02-21T09:18:14.9565830Z shl.b32 %r150, %r7826, 7; 2026-02-21T09:18:14.9565989Z or.b32 %r153, %r150, %r7827; 2026-02-21T09:18:14.9566160Z xor.b32 %r154, %r153, %r7828; 2026-02-21T09:18:14.9566326Z add.s32 %r26, %r144, %r154; 2026-02-21T09:18:14.9566497Z and.b32 %r157, %r7829, 3072; 2026-02-21T09:18:14.9566657Z shl.b32 %r160, %r7831, 3; 2026-02-21T09:18:14.9566827Z and.b32 %r162, %r7832, 64; 2026-02-21T09:18:14.9566985Z or.b32 %r163, %r7830, %r160; 2026-02-21T09:18:14.9567158Z xor.b32 %r164, %r163, %r162; 2026-02-21T09:18:14.9567331Z or.b32 %r165, %r164, %r157; 2026-02-21T09:18:14.9567494Z add.s32 %r27, %r144, %r165; 2026-02-21T09:18:14.9567666Z xor.b32 %r166, %r165, 32; 2026-02-21T09:18:14.9567821Z add.s32 %r28, %r144, %r166; 2026-02-21T09:18:14.9567992Z shr.u32 %r168, %r7826, 1; 2026-02-21T09:18:14.9568151Z or.b32 %r170, %r168, %r7834; 2026-02-21T09:18:14.9568324Z or.b32 %r171, %r170, %r7833; 2026-02-21T09:18:14.9568492Z add.s32 %r29, %r144, %r171; 2026-02-21T09:18:14.9568651Z xor.b32 %r172, %r171, 4; 2026-02-21T09:18:14.9568823Z add.s32 %r30, %r144, %r172; 2026-02-21T09:18:14.9568978Z xor.b32 %r173, %r171, 8; 2026-02-21T09:18:14.9569141Z add.s32 %r31, %r144, %r173; 2026-02-21T09:18:14.9569296Z xor.b32 %r174, %r171, 12; 2026-02-21T09:18:14.9569461Z add.s32 %r32, %r144, %r174; 2026-02-21T09:18:14.9569648Z xor.b32 %r175, %r171, 32; 2026-02-21T09:18:14.9569807Z add.s32 %r33, %r144, %r175; 2026-02-21T09:18:14.9569966Z xor.b32 %r176, %r171, 36; 2026-02-21T09:18:14.9570113Z add.s32 %r34, %r144, %r176; 2026-02-21T09:18:14.9570303Z xor.b32 %r177, %r171, 40; 2026-02-21T09:18:14.9570451Z add.s32 %r35, %r144, %r177; 2026-02-21T09:18:14.9570642Z xor.b32 %r178, %r171, 44; 2026-02-21T09:18:14.9570792Z add.s32 %r36, %r144, %r178; 2026-02-21T09:18:14.9570974Z xor.b32 %r179, %r171, 64; 2026-02-21T09:18:14.9571119Z add.s32 %r37, %r144, %r179; 2026-02-21T09:18:14.9571276Z xor.b32 %r180, %r171, 68; 2026-02-21T09:18:14.9571424Z add.s32 %r38, %r144, %r180; 2026-02-21T09:18:14.9571619Z xor.b32 %r181, %r171, 72; 2026-02-21T09:18:14.9571774Z add.s32 %r39, %r144, %r181; 2026-02-21T09:18:14.9571924Z xor.b32 %r182, %r171, 76; 2026-02-21T09:18:14.9572079Z add.s32 %r40, %r144, %r182; 2026-02-21T09:18:14.9572231Z xor.b32 %r183, %r171, 96; 2026-02-21T09:18:14.9572386Z add.s32 %r41, %r144, %r183; 2026-02-21T09:18:14.9572540Z xor.b32 %r184, %r171, 100; 2026-02-21T09:18:14.9572699Z add.s32 %r42, %r144, %r184; 2026-02-21T09:18:14.9572852Z xor.b32 %r185, %r171, 104; 2026-02-21T09:18:14.9573016Z add.s32 %r43, %r144, %r185; 2026-02-21T09:18:14.9573172Z xor.b32 %r186, %r171, 108; 2026-02-21T09:18:14.9573332Z add.s32 %r44, %r144, %r186; 2026-02-21T09:18:14.9573499Z mul.lo.s32 %r189, %r7835, 136; 2026-02-21T09:18:14.9573665Z or.b32 %r190, %r7836, %r8; 2026-02-21T09:18:14.9573827Z xor.b32 %r191, %r189, %r190; 2026-02-21T09:18:14.9573982Z add.s32 %r45, %r144, %r191; 2026-02-21T09:18:14.9574143Z xor.b32 %r192, %r191, 132; 2026-02-21T09:18:14.9574299Z add.s32 %r46, %r144, %r192; 2026-02-21T09:18:14.9574463Z xor.b32 %r193, %r191, 264; 2026-02-21T09:18:14.9574613Z add.s32 %r47, %r144, %r193; 2026-02-21T09:18:14.9574771Z xor.b32 %r194, %r191, 396; 2026-02-21T09:18:14.9574927Z add.s32 %r48, %r144, %r194; 2026-02-21T09:18:14.9575080Z and.b32 %r196, %r7837, 16256; 2026-02-21T09:18:14.9575244Z or.b32 %r197, %r196, %r18; 2026-02-21T09:18:14.9575395Z add.s32 %r49, %r144, %r197; 2026-02-21T09:18:14.9575556Z xor.b32 %r198, %r197, 16; 2026-02-21T09:18:14.9575704Z add.s32 %r50, %r144, %r198; 2026-02-21T09:18:14.9575865Z xor.b32 %r199, %r197, 32; 2026-02-21T09:18:14.9576014Z add.s32 %r51, %r144, %r199; 2026-02-21T09:18:14.9576174Z xor.b32 %r200, %r197, 48; 2026-02-21T09:18:14.9576320Z add.s32 %r52, %r144, %r200; 2026-02-21T09:18:14.9576480Z xor.b32 %r201, %r197, 64; 2026-02-21T09:18:14.9576632Z add.s32 %r53, %r144, %r201; 2026-02-21T09:18:14.9576782Z xor.b32 %r202, %r197, 80; 2026-02-21T09:18:14.9576937Z add.s32 %r54, %r144, %r202; 2026-02-21T09:18:14.9577087Z xor.b32 %r203, %r197, 96; 2026-02-21T09:18:14.9577244Z add.s32 %r55, %r144, %r203; 2026-02-21T09:18:14.9577396Z xor.b32 %r204, %r197, 112; 2026-02-21T09:18:14.9577551Z add.s32 %r56, %r144, %r204; 2026-02-21T09:18:14.9577703Z bfe.u32 %r205, %r144, 4, 14; 2026-02-21T09:18:14.9577863Z cvt.u64.u32 %rd58, %r205; 2026-02-21T09:18:14.9578027Z or.b64 %rd1, %rd58, 4611686293372403712; 2026-02-21T09:18:14.9578212Z add.s32 %r206, %r144, 32; 2026-02-21T09:18:14.9578366Z bfe.u32 %r207, %r206, 4, 14; 2026-02-21T09:18:14.9578518Z cvt.u64.u32 %rd59, %r207; 2026-02-21T09:18:14.9578689Z or.b64 %rd2, %rd59, 4611686293372403712; 2026-02-21T09:18:14.9578864Z add.s32 %r208, %r144, 64; 2026-02-21T09:18:14.9579016Z bfe.u32 %r209, %r208, 4, 14; 2026-02-21T09:18:14.9579168Z cvt.u64.u32 %rd60, %r209; 2026-02-21T09:18:14.9579331Z or.b64 %rd3, %rd60, 4611686293372403712; 2026-02-21T09:18:14.9579501Z add.s32 %r210, %r144, 96; 2026-02-21T09:18:14.9579654Z bfe.u32 %r211, %r210, 4, 14; 2026-02-21T09:18:14.9579812Z cvt.u64.u32 %rd61, %r211; 2026-02-21T09:18:14.9579969Z or.b64 %rd4, %rd61, 4611686293372403712; 2026-02-21T09:18:14.9580151Z and.b32 %r213, %r7840, 3168; 2026-02-21T09:18:14.9580309Z shl.b32 %r214, %r7826, 4; 2026-02-21T09:18:14.9580472Z and.b32 %r215, %r7832, 16; 2026-02-21T09:18:14.9580669Z or.b32 %r216, %r213, %r214; 2026-02-21T09:18:14.9580841Z xor.b32 %r217, %r216, %r7831; 2026-02-21T09:18:14.9581005Z add.s32 %r218, %r144, %r215; 2026-02-21T09:18:14.9581175Z add.s32 %r1540, %r218, %r217; 2026-02-21T09:18:14.9581403Z add.s32 %r1545, %r1540, 512; 2026-02-21T09:18:14.9581765Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:14.9582080Z shl.b32 %r219, %r21, 13; 2026-02-21T09:18:14.9582390Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:14.9582696Z or.b32 %r61, %r219, %r18; 2026-02-21T09:18:14.9582973Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:14.9583300Z mad.wide.u32 %rd5, %r22, 16, %rd55; 2026-02-21T09:18:14.9583485Z shl.b32 %r63, %r7, 10; 2026-02-21T09:18:14.9583650Z bra.uni $L__BB0_2; 2026-02-21T09:18:14.9583855Z $L__BB0_42: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:18:14.9584191Z .loc 1 31 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:31:32 2026-02-21T09:18:14.9584492Z or.b32 %r6129, %r127, %r9; 2026-02-21T09:18:14.9584655Z or.b32 %r6130, %r127, %r10; 2026-02-21T09:18:14.9584823Z or.b32 %r6131, %r127, %r11; 2026-02-21T09:18:14.9584983Z or.b32 %r6132, %r127, %r12; 2026-02-21T09:18:14.9585150Z or.b32 %r6133, %r127, %r13; 2026-02-21T09:18:14.9585309Z or.b32 %r6134, %r127, %r14; 2026-02-21T09:18:14.9585473Z or.b32 %r6135, %r127, %r15; 2026-02-21T09:18:14.9585637Z or.b32 %r6136, %r127, %r16; 2026-02-21T09:18:14.9585903Z .loc 1 33 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:33:32 2026-02-21T09:18:14.9586198Z or.b32 %r6137, %r128, %r20; 2026-02-21T09:18:14.9586462Z .loc 1 88 43 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:43 2026-02-21T09:18:14.9586755Z shl.b32 %r6138, %r6129, 13; 2026-02-21T09:18:14.9586918Z shl.b32 %r6139, %r6130, 13; 2026-02-21T09:18:14.9587093Z shl.b32 %r6140, %r6131, 13; 2026-02-21T09:18:14.9587252Z shl.b32 %r6141, %r6132, 13; 2026-02-21T09:18:14.9587400Z shl.b32 %r6142, %r6133, 13; 2026-02-21T09:18:14.9587556Z shl.b32 %r6143, %r6134, 13; 2026-02-21T09:18:14.9587704Z shl.b32 %r6144, %r6135, 13; 2026-02-21T09:18:14.9587861Z shl.b32 %r6145, %r6136, 13; 2026-02-21T09:18:14.9588110Z .loc 1 88 50 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:50 2026-02-21T09:18:14.9588392Z add.s32 %r6146, %r6138, %r6137; 2026-02-21T09:18:14.9588556Z add.s32 %r6147, %r6139, %r6137; 2026-02-21T09:18:14.9588723Z add.s32 %r6148, %r6140, %r6137; 2026-02-21T09:18:14.9588887Z add.s32 %r6149, %r6141, %r6137; 2026-02-21T09:18:14.9589044Z add.s32 %r6150, %r6142, %r6137; 2026-02-21T09:18:14.9589207Z add.s32 %r6151, %r6143, %r6137; 2026-02-21T09:18:14.9589360Z add.s32 %r6152, %r6144, %r6137; 2026-02-21T09:18:14.9589521Z add.s32 %r6153, %r6145, %r6137; 2026-02-21T09:18:14.9589779Z .loc 1 88 22 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:22 2026-02-21T09:18:14.9590070Z mad.wide.s32 %rd714, %r6146, 2, %rd57; 2026-02-21T09:18:14.9590255Z mad.wide.s32 %rd715, %r6147, 2, %rd57; 2026-02-21T09:18:14.9590443Z mad.wide.s32 %rd716, %r6148, 2, %rd57; 2026-02-21T09:18:14.9590626Z mad.wide.s32 %rd717, %r6149, 2, %rd57; 2026-02-21T09:18:14.9590802Z mad.wide.s32 %rd718, %r6150, 2, %rd57; 2026-02-21T09:18:14.9590984Z mad.wide.s32 %rd719, %r6151, 2, %rd57; 2026-02-21T09:18:14.9591154Z mad.wide.s32 %rd720, %r6152, 2, %rd57; 2026-02-21T09:18:14.9591334Z mad.wide.s32 %rd721, %r6153, 2, %rd57; 2026-02-21T09:18:14.9591501Z $L__tmp1: 2026-02-21T09:18:14.9591818Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9592148Z // begin inline asm 2026-02-21T09:18:14.9592585Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5989, %r5990, %r5991, %r5992, %r5993, %r5994, %r5995, %r5996, %r5997, %r5998, %r5999, %r6000, %r6001, %r6002, %r6003, %r6004}, [%r4673 + 0], 64; 2026-02-21T09:18:14.9593010Z // end inline asm 2026-02-21T09:18:14.9593180Z // begin inline asm 2026-02-21T09:18:14.9593621Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6006, %r6007, %r6008, %r6009, %r6010, %r6011, %r6012, %r6013, %r6014, %r6015, %r6016, %r6017, %r6018, %r6019, %r6020, %r6021}, [%r4673 + 16], 64; 2026-02-21T09:18:14.9594085Z // end inline asm 2026-02-21T09:18:14.9594240Z // begin inline asm 2026-02-21T09:18:14.9594636Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6023, %r6024, %r6025, %r6026, %r6027, %r6028, %r6029, %r6030, %r6031, %r6032, %r6033, %r6034, %r6035, %r6036, %r6037, %r6038}, [%r4673 + 32], 64; 2026-02-21T09:18:14.9595054Z // end inline asm 2026-02-21T09:18:14.9595201Z // begin inline asm 2026-02-21T09:18:14.9595585Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6040, %r6041, %r6042, %r6043, %r6044, %r6045, %r6046, %r6047, %r6048, %r6049, %r6050, %r6051, %r6052, %r6053, %r6054, %r6055}, [%r4673 + 48], 64; 2026-02-21T09:18:14.9596012Z // end inline asm 2026-02-21T09:18:14.9596152Z // begin inline asm 2026-02-21T09:18:14.9596320Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9596501Z // end inline asm 2026-02-21T09:18:14.9596646Z cvt.u64.u32 %rd722, %r5989; 2026-02-21T09:18:14.9596821Z cvt.u64.u32 %rd723, %r5990; 2026-02-21T09:18:14.9596985Z shl.b64 %rd724, %rd723, 32; 2026-02-21T09:18:14.9597157Z or.b64 %rd725, %rd722, %rd724; 2026-02-21T09:18:14.9597321Z $L__tmp2: 2026-02-21T09:18:14.9597570Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9597862Z mov.b64 {%r6154, %r6155}, %rd725; 2026-02-21T09:18:14.9598060Z cvt.rn.bf16x2.f32 %r6156, %r6155, %r6154; 2026-02-21T09:18:14.9598248Z $L__tmp3: 2026-02-21T09:18:14.9598532Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9598872Z cvt.u64.u32 %rd726, %r5991; 2026-02-21T09:18:14.9599035Z cvt.u64.u32 %rd727, %r5992; 2026-02-21T09:18:14.9599200Z shl.b64 %rd728, %rd727, 32; 2026-02-21T09:18:14.9599360Z or.b64 %rd729, %rd726, %rd728; 2026-02-21T09:18:14.9599527Z $L__tmp4: 2026-02-21T09:18:14.9599762Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9600054Z mov.b64 {%r6157, %r6158}, %rd729; 2026-02-21T09:18:14.9600246Z cvt.rn.bf16x2.f32 %r6159, %r6158, %r6157; 2026-02-21T09:18:14.9600422Z $L__tmp5: 2026-02-21T09:18:14.9600706Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9601036Z cvt.u64.u32 %rd730, %r5993; 2026-02-21T09:18:14.9601202Z cvt.u64.u32 %rd731, %r5994; 2026-02-21T09:18:14.9601358Z shl.b64 %rd732, %rd731, 32; 2026-02-21T09:18:14.9601525Z or.b64 %rd733, %rd730, %rd732; 2026-02-21T09:18:14.9601725Z $L__tmp6: 2026-02-21T09:18:14.9601961Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9602245Z mov.b64 {%r6160, %r6161}, %rd733; 2026-02-21T09:18:14.9602422Z cvt.rn.bf16x2.f32 %r6162, %r6161, %r6160; 2026-02-21T09:18:14.9602601Z $L__tmp7: 2026-02-21T09:18:14.9602875Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9603201Z cvt.u64.u32 %rd734, %r5995; 2026-02-21T09:18:14.9603355Z cvt.u64.u32 %rd735, %r5996; 2026-02-21T09:18:14.9603514Z shl.b64 %rd736, %rd735, 32; 2026-02-21T09:18:14.9603677Z or.b64 %rd737, %rd734, %rd736; 2026-02-21T09:18:14.9603830Z $L__tmp8: 2026-02-21T09:18:14.9604070Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9604351Z mov.b64 {%r6163, %r6164}, %rd737; 2026-02-21T09:18:14.9604537Z cvt.rn.bf16x2.f32 %r6165, %r6164, %r6163; 2026-02-21T09:18:14.9604754Z $L__tmp9: 2026-02-21T09:18:14.9605039Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9605361Z cvt.u64.u32 %rd738, %r5997; 2026-02-21T09:18:14.9605558Z cvt.u64.u32 %rd739, %r5998; 2026-02-21T09:18:14.9605743Z shl.b64 %rd740, %rd739, 32; 2026-02-21T09:18:14.9605903Z or.b64 %rd741, %rd738, %rd740; 2026-02-21T09:18:14.9606091Z $L__tmp10: 2026-02-21T09:18:14.9606330Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9606633Z mov.b64 {%r6166, %r6167}, %rd741; 2026-02-21T09:18:14.9606820Z cvt.rn.bf16x2.f32 %r6168, %r6167, %r6166; 2026-02-21T09:18:14.9607005Z $L__tmp11: 2026-02-21T09:18:14.9607280Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9607610Z cvt.u64.u32 %rd742, %r5999; 2026-02-21T09:18:14.9607777Z cvt.u64.u32 %rd743, %r6000; 2026-02-21T09:18:14.9607935Z shl.b64 %rd744, %rd743, 32; 2026-02-21T09:18:14.9608105Z or.b64 %rd745, %rd742, %rd744; 2026-02-21T09:18:14.9608264Z $L__tmp12: 2026-02-21T09:18:14.9608504Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9608793Z mov.b64 {%r6169, %r6170}, %rd745; 2026-02-21T09:18:14.9608982Z cvt.rn.bf16x2.f32 %r6171, %r6170, %r6169; 2026-02-21T09:18:14.9609159Z $L__tmp13: 2026-02-21T09:18:14.9609442Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9609773Z cvt.u64.u32 %rd746, %r6001; 2026-02-21T09:18:14.9609930Z cvt.u64.u32 %rd747, %r6002; 2026-02-21T09:18:14.9610096Z shl.b64 %rd748, %rd747, 32; 2026-02-21T09:18:14.9610254Z or.b64 %rd749, %rd746, %rd748; 2026-02-21T09:18:14.9610417Z $L__tmp14: 2026-02-21T09:18:14.9610650Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9610940Z mov.b64 {%r6172, %r6173}, %rd749; 2026-02-21T09:18:14.9611126Z cvt.rn.bf16x2.f32 %r6174, %r6173, %r6172; 2026-02-21T09:18:14.9611300Z $L__tmp15: 2026-02-21T09:18:14.9611615Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9611933Z cvt.u64.u32 %rd750, %r6003; 2026-02-21T09:18:14.9612095Z cvt.u64.u32 %rd751, %r6004; 2026-02-21T09:18:14.9612247Z shl.b64 %rd752, %rd751, 32; 2026-02-21T09:18:14.9612408Z or.b64 %rd753, %rd750, %rd752; 2026-02-21T09:18:14.9612561Z $L__tmp16: 2026-02-21T09:18:14.9612797Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9613081Z mov.b64 {%r6175, %r6176}, %rd753; 2026-02-21T09:18:14.9613256Z cvt.rn.bf16x2.f32 %r6177, %r6176, %r6175; 2026-02-21T09:18:14.9613436Z $L__tmp17: 2026-02-21T09:18:14.9613706Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9614029Z cvt.u64.u32 %rd754, %r6006; 2026-02-21T09:18:14.9614183Z cvt.u64.u32 %rd755, %r6007; 2026-02-21T09:18:14.9614345Z shl.b64 %rd756, %rd755, 32; 2026-02-21T09:18:14.9614498Z or.b64 %rd757, %rd754, %rd756; 2026-02-21T09:18:14.9614659Z $L__tmp18: 2026-02-21T09:18:14.9614895Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9615172Z mov.b64 {%r6178, %r6179}, %rd757; 2026-02-21T09:18:14.9615354Z cvt.rn.bf16x2.f32 %r6180, %r6179, %r6178; 2026-02-21T09:18:14.9615526Z $L__tmp19: 2026-02-21T09:18:14.9615809Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9616130Z cvt.u64.u32 %rd758, %r6008; 2026-02-21T09:18:14.9616297Z cvt.u64.u32 %rd759, %r6009; 2026-02-21T09:18:14.9616456Z shl.b64 %rd760, %rd759, 32; 2026-02-21T09:18:14.9616637Z or.b64 %rd761, %rd758, %rd760; 2026-02-21T09:18:14.9616802Z $L__tmp20: 2026-02-21T09:18:14.9617034Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9617349Z mov.b64 {%r6181, %r6182}, %rd761; 2026-02-21T09:18:14.9617552Z cvt.rn.bf16x2.f32 %r6183, %r6182, %r6181; 2026-02-21T09:18:14.9617730Z $L__tmp21: 2026-02-21T09:18:14.9618033Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9618362Z cvt.u64.u32 %rd762, %r6010; 2026-02-21T09:18:14.9618527Z cvt.u64.u32 %rd763, %r6011; 2026-02-21T09:18:14.9618680Z shl.b64 %rd764, %rd763, 32; 2026-02-21T09:18:14.9618845Z or.b64 %rd765, %rd762, %rd764; 2026-02-21T09:18:14.9618999Z $L__tmp22: 2026-02-21T09:18:14.9619235Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9619513Z mov.b64 {%r6184, %r6185}, %rd765; 2026-02-21T09:18:14.9619697Z cvt.rn.bf16x2.f32 %r6186, %r6185, %r6184; 2026-02-21T09:18:14.9619867Z $L__tmp23: 2026-02-21T09:18:14.9620147Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9620478Z cvt.u64.u32 %rd766, %r6012; 2026-02-21T09:18:14.9620631Z cvt.u64.u32 %rd767, %r6013; 2026-02-21T09:18:14.9620790Z shl.b64 %rd768, %rd767, 32; 2026-02-21T09:18:14.9620947Z or.b64 %rd769, %rd766, %rd768; 2026-02-21T09:18:14.9621109Z $L__tmp24: 2026-02-21T09:18:14.9621336Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9621657Z mov.b64 {%r6187, %r6188}, %rd769; 2026-02-21T09:18:14.9621837Z cvt.rn.bf16x2.f32 %r6189, %r6188, %r6187; 2026-02-21T09:18:14.9622005Z $L__tmp25: 2026-02-21T09:18:14.9622287Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9622604Z cvt.u64.u32 %rd770, %r6014; 2026-02-21T09:18:14.9622763Z cvt.u64.u32 %rd771, %r6015; 2026-02-21T09:18:14.9622913Z shl.b64 %rd772, %rd771, 32; 2026-02-21T09:18:14.9623074Z or.b64 %rd773, %rd770, %rd772; 2026-02-21T09:18:14.9623225Z $L__tmp26: 2026-02-21T09:18:14.9623464Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9623752Z mov.b64 {%r6190, %r6191}, %rd773; 2026-02-21T09:18:14.9623926Z cvt.rn.bf16x2.f32 %r6192, %r6191, %r6190; 2026-02-21T09:18:14.9624121Z $L__tmp27: 2026-02-21T09:18:14.9624414Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9624757Z cvt.u64.u32 %rd774, %r6016; 2026-02-21T09:18:14.9624915Z cvt.u64.u32 %rd775, %r6017; 2026-02-21T09:18:14.9625083Z shl.b64 %rd776, %rd775, 32; 2026-02-21T09:18:14.9625243Z or.b64 %rd777, %rd774, %rd776; 2026-02-21T09:18:14.9625410Z $L__tmp28: 2026-02-21T09:18:14.9625662Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9625954Z mov.b64 {%r6193, %r6194}, %rd777; 2026-02-21T09:18:14.9626144Z cvt.rn.bf16x2.f32 %r6195, %r6194, %r6193; 2026-02-21T09:18:14.9626322Z $L__tmp29: 2026-02-21T09:18:14.9626620Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9626957Z cvt.u64.u32 %rd778, %r6018; 2026-02-21T09:18:14.9627134Z cvt.u64.u32 %rd779, %r6019; 2026-02-21T09:18:14.9627302Z shl.b64 %rd780, %rd779, 32; 2026-02-21T09:18:14.9627464Z or.b64 %rd781, %rd778, %rd780; 2026-02-21T09:18:14.9627629Z $L__tmp30: 2026-02-21T09:18:14.9627872Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9628169Z mov.b64 {%r6196, %r6197}, %rd781; 2026-02-21T09:18:14.9628350Z cvt.rn.bf16x2.f32 %r6198, %r6197, %r6196; 2026-02-21T09:18:14.9628582Z $L__tmp31: 2026-02-21T09:18:14.9628874Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9629218Z cvt.u64.u32 %rd782, %r6020; 2026-02-21T09:18:14.9629417Z cvt.u64.u32 %rd783, %r6021; 2026-02-21T09:18:14.9629611Z shl.b64 %rd784, %rd783, 32; 2026-02-21T09:18:14.9629784Z or.b64 %rd785, %rd782, %rd784; 2026-02-21T09:18:14.9629980Z $L__tmp32: 2026-02-21T09:18:14.9630237Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9630532Z mov.b64 {%r6199, %r6200}, %rd785; 2026-02-21T09:18:14.9630722Z cvt.rn.bf16x2.f32 %r6201, %r6200, %r6199; 2026-02-21T09:18:14.9630901Z $L__tmp33: 2026-02-21T09:18:14.9631199Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9631583Z cvt.u64.u32 %rd786, %r6023; 2026-02-21T09:18:14.9631749Z cvt.u64.u32 %rd787, %r6024; 2026-02-21T09:18:14.9631915Z shl.b64 %rd788, %rd787, 32; 2026-02-21T09:18:14.9632075Z or.b64 %rd789, %rd786, %rd788; 2026-02-21T09:18:14.9632244Z $L__tmp34: 2026-02-21T09:18:14.9632486Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9632781Z mov.b64 {%r6202, %r6203}, %rd789; 2026-02-21T09:18:14.9632973Z cvt.rn.bf16x2.f32 %r6204, %r6203, %r6202; 2026-02-21T09:18:14.9633150Z $L__tmp35: 2026-02-21T09:18:14.9633430Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9633747Z cvt.u64.u32 %rd790, %r6025; 2026-02-21T09:18:14.9633909Z cvt.u64.u32 %rd791, %r6026; 2026-02-21T09:18:14.9634064Z shl.b64 %rd792, %rd791, 32; 2026-02-21T09:18:14.9634224Z or.b64 %rd793, %rd790, %rd792; 2026-02-21T09:18:14.9634376Z $L__tmp36: 2026-02-21T09:18:14.9634611Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9634892Z mov.b64 {%r6205, %r6206}, %rd793; 2026-02-21T09:18:14.9635066Z cvt.rn.bf16x2.f32 %r6207, %r6206, %r6205; 2026-02-21T09:18:14.9635241Z $L__tmp37: 2026-02-21T09:18:14.9635513Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9635841Z cvt.u64.u32 %rd794, %r6027; 2026-02-21T09:18:14.9635996Z cvt.u64.u32 %rd795, %r6028; 2026-02-21T09:18:14.9636155Z shl.b64 %rd796, %rd795, 32; 2026-02-21T09:18:14.9636311Z or.b64 %rd797, %rd794, %rd796; 2026-02-21T09:18:14.9636472Z $L__tmp38: 2026-02-21T09:18:14.9636705Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9636984Z mov.b64 {%r6208, %r6209}, %rd797; 2026-02-21T09:18:14.9637167Z cvt.rn.bf16x2.f32 %r6210, %r6209, %r6208; 2026-02-21T09:18:14.9637338Z $L__tmp39: 2026-02-21T09:18:14.9637624Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9637951Z cvt.u64.u32 %rd798, %r6029; 2026-02-21T09:18:14.9638112Z cvt.u64.u32 %rd799, %r6030; 2026-02-21T09:18:14.9638265Z shl.b64 %rd800, %rd799, 32; 2026-02-21T09:18:14.9638427Z or.b64 %rd801, %rd798, %rd800; 2026-02-21T09:18:14.9638586Z $L__tmp40: 2026-02-21T09:18:14.9638816Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9639099Z mov.b64 {%r6211, %r6212}, %rd801; 2026-02-21T09:18:14.9639274Z cvt.rn.bf16x2.f32 %r6213, %r6212, %r6211; 2026-02-21T09:18:14.9639452Z $L__tmp41: 2026-02-21T09:18:14.9639721Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9640048Z cvt.u64.u32 %rd802, %r6031; 2026-02-21T09:18:14.9640207Z cvt.u64.u32 %rd803, %r6032; 2026-02-21T09:18:14.9640358Z shl.b64 %rd804, %rd803, 32; 2026-02-21T09:18:14.9640548Z or.b64 %rd805, %rd802, %rd804; 2026-02-21T09:18:14.9640704Z $L__tmp42: 2026-02-21T09:18:14.9640941Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9641242Z mov.b64 {%r6214, %r6215}, %rd805; 2026-02-21T09:18:14.9641454Z cvt.rn.bf16x2.f32 %r6216, %r6215, %r6214; 2026-02-21T09:18:14.9641654Z $L__tmp43: 2026-02-21T09:18:14.9641960Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9642290Z cvt.u64.u32 %rd806, %r6033; 2026-02-21T09:18:14.9642444Z cvt.u64.u32 %rd807, %r6034; 2026-02-21T09:18:14.9642603Z shl.b64 %rd808, %rd807, 32; 2026-02-21T09:18:14.9642756Z or.b64 %rd809, %rd806, %rd808; 2026-02-21T09:18:14.9642915Z $L__tmp44: 2026-02-21T09:18:14.9643140Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9643425Z mov.b64 {%r6217, %r6218}, %rd809; 2026-02-21T09:18:14.9643600Z cvt.rn.bf16x2.f32 %r6219, %r6218, %r6217; 2026-02-21T09:18:14.9643779Z $L__tmp45: 2026-02-21T09:18:14.9644054Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9644369Z cvt.u64.u32 %rd810, %r6035; 2026-02-21T09:18:14.9644529Z cvt.u64.u32 %rd811, %r6036; 2026-02-21T09:18:14.9644682Z shl.b64 %rd812, %rd811, 32; 2026-02-21T09:18:14.9644843Z or.b64 %rd813, %rd810, %rd812; 2026-02-21T09:18:14.9644993Z $L__tmp46: 2026-02-21T09:18:14.9645227Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9645511Z mov.b64 {%r6220, %r6221}, %rd813; 2026-02-21T09:18:14.9645683Z cvt.rn.bf16x2.f32 %r6222, %r6221, %r6220; 2026-02-21T09:18:14.9645862Z $L__tmp47: 2026-02-21T09:18:14.9646137Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9646464Z cvt.u64.u32 %rd814, %r6037; 2026-02-21T09:18:14.9646618Z cvt.u64.u32 %rd815, %r6038; 2026-02-21T09:18:14.9646778Z shl.b64 %rd816, %rd815, 32; 2026-02-21T09:18:14.9646930Z or.b64 %rd817, %rd814, %rd816; 2026-02-21T09:18:14.9647092Z $L__tmp48: 2026-02-21T09:18:14.9647330Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9647608Z mov.b64 {%r6223, %r6224}, %rd817; 2026-02-21T09:18:14.9647792Z cvt.rn.bf16x2.f32 %r6225, %r6224, %r6223; 2026-02-21T09:18:14.9647963Z $L__tmp49: 2026-02-21T09:18:14.9648249Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9648572Z cvt.u64.u32 %rd818, %r6040; 2026-02-21T09:18:14.9648735Z cvt.u64.u32 %rd819, %r6041; 2026-02-21T09:18:14.9648888Z shl.b64 %rd820, %rd819, 32; 2026-02-21T09:18:14.9649052Z or.b64 %rd821, %rd818, %rd820; 2026-02-21T09:18:14.9649212Z $L__tmp50: 2026-02-21T09:18:14.9649440Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9649722Z mov.b64 {%r6226, %r6227}, %rd821; 2026-02-21T09:18:14.9649895Z cvt.rn.bf16x2.f32 %r6228, %r6227, %r6226; 2026-02-21T09:18:14.9650072Z $L__tmp51: 2026-02-21T09:18:14.9650339Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9650663Z cvt.u64.u32 %rd822, %r6042; 2026-02-21T09:18:14.9650824Z cvt.u64.u32 %rd823, %r6043; 2026-02-21T09:18:14.9650975Z shl.b64 %rd824, %rd823, 32; 2026-02-21T09:18:14.9651136Z or.b64 %rd825, %rd822, %rd824; 2026-02-21T09:18:14.9651288Z $L__tmp52: 2026-02-21T09:18:14.9651524Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9651839Z mov.b64 {%r6229, %r6230}, %rd825; 2026-02-21T09:18:14.9652022Z cvt.rn.bf16x2.f32 %r6231, %r6230, %r6229; 2026-02-21T09:18:14.9652216Z $L__tmp53: 2026-02-21T09:18:14.9652499Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9652821Z cvt.u64.u32 %rd826, %r6044; 2026-02-21T09:18:14.9653001Z cvt.u64.u32 %rd827, %r6045; 2026-02-21T09:18:14.9653187Z shl.b64 %rd828, %rd827, 32; 2026-02-21T09:18:14.9653344Z or.b64 %rd829, %rd826, %rd828; 2026-02-21T09:18:14.9653532Z $L__tmp54: 2026-02-21T09:18:14.9653761Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9654053Z mov.b64 {%r6232, %r6233}, %rd829; 2026-02-21T09:18:14.9654225Z cvt.rn.bf16x2.f32 %r6234, %r6233, %r6232; 2026-02-21T09:18:14.9654405Z $L__tmp55: 2026-02-21T09:18:14.9654683Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9655005Z cvt.u64.u32 %rd830, %r6046; 2026-02-21T09:18:14.9655165Z cvt.u64.u32 %rd831, %r6047; 2026-02-21T09:18:14.9655317Z shl.b64 %rd832, %rd831, 32; 2026-02-21T09:18:14.9655476Z or.b64 %rd833, %rd830, %rd832; 2026-02-21T09:18:14.9655629Z $L__tmp56: 2026-02-21T09:18:14.9655864Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9656147Z mov.b64 {%r6235, %r6236}, %rd833; 2026-02-21T09:18:14.9656327Z cvt.rn.bf16x2.f32 %r6237, %r6236, %r6235; 2026-02-21T09:18:14.9656506Z $L__tmp57: 2026-02-21T09:18:14.9656779Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9657109Z cvt.u64.u32 %rd834, %r6048; 2026-02-21T09:18:14.9657260Z cvt.u64.u32 %rd835, %r6049; 2026-02-21T09:18:14.9657417Z shl.b64 %rd836, %rd835, 32; 2026-02-21T09:18:14.9657574Z or.b64 %rd837, %rd834, %rd836; 2026-02-21T09:18:14.9657735Z $L__tmp58: 2026-02-21T09:18:14.9657967Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9658251Z mov.b64 {%r6238, %r6239}, %rd837; 2026-02-21T09:18:14.9658434Z cvt.rn.bf16x2.f32 %r6240, %r6239, %r6238; 2026-02-21T09:18:14.9658606Z $L__tmp59: 2026-02-21T09:18:14.9658889Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9659209Z cvt.u64.u32 %rd838, %r6050; 2026-02-21T09:18:14.9659375Z cvt.u64.u32 %rd839, %r6051; 2026-02-21T09:18:14.9659526Z shl.b64 %rd840, %rd839, 32; 2026-02-21T09:18:14.9659689Z or.b64 %rd841, %rd838, %rd840; 2026-02-21T09:18:14.9659849Z $L__tmp60: 2026-02-21T09:18:14.9660080Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9660368Z mov.b64 {%r6241, %r6242}, %rd841; 2026-02-21T09:18:14.9660544Z cvt.rn.bf16x2.f32 %r6243, %r6242, %r6241; 2026-02-21T09:18:14.9660721Z $L__tmp61: 2026-02-21T09:18:14.9660993Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9661320Z cvt.u64.u32 %rd842, %r6052; 2026-02-21T09:18:14.9661473Z cvt.u64.u32 %rd843, %r6053; 2026-02-21T09:18:14.9661660Z shl.b64 %rd844, %rd843, 32; 2026-02-21T09:18:14.9661825Z or.b64 %rd845, %rd842, %rd844; 2026-02-21T09:18:14.9661981Z $L__tmp62: 2026-02-21T09:18:14.9662220Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9662501Z mov.b64 {%r6244, %r6245}, %rd845; 2026-02-21T09:18:14.9662682Z cvt.rn.bf16x2.f32 %r6246, %r6245, %r6244; 2026-02-21T09:18:14.9662852Z $L__tmp63: 2026-02-21T09:18:14.9663131Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9663456Z cvt.u64.u32 %rd846, %r6054; 2026-02-21T09:18:14.9663608Z cvt.u64.u32 %rd847, %r6055; 2026-02-21T09:18:14.9663767Z shl.b64 %rd848, %rd847, 32; 2026-02-21T09:18:14.9663951Z or.b64 %rd849, %rd846, %rd848; 2026-02-21T09:18:14.9664112Z $L__tmp64: 2026-02-21T09:18:14.9664338Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9664648Z mov.b64 {%r6247, %r6248}, %rd849; 2026-02-21T09:18:14.9664844Z cvt.rn.bf16x2.f32 %r6249, %r6248, %r6247; 2026-02-21T09:18:14.9665135Z .loc 1 88 81 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:81 2026-02-21T09:18:14.9665455Z bar.sync 0; 2026-02-21T09:18:14.9665626Z st.shared.v4.b32 [%r27], {%r6156, %r6168, %r6180, %r6192}; 2026-02-21T09:18:14.9665874Z st.shared.v4.b32 [%r28], {%r6204, %r6216, %r6228, %r6240}; 2026-02-21T09:18:14.9666071Z bar.sync 0; 2026-02-21T09:18:14.9666207Z // begin inline asm 2026-02-21T09:18:14.9666446Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6057, %r6058, %r6059, %r6060}, [%r1540]; 2026-02-21T09:18:14.9666722Z // end inline asm 2026-02-21T09:18:14.9666858Z // begin inline asm 2026-02-21T09:18:14.9667101Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6062, %r6063, %r6064, %r6065}, [%r1545]; 2026-02-21T09:18:14.9667382Z // end inline asm 2026-02-21T09:18:14.9667524Z bar.sync 0; 2026-02-21T09:18:14.9667709Z st.shared.v4.b32 [%r27], {%r6159, %r6171, %r6183, %r6195}; 2026-02-21T09:18:14.9667956Z st.shared.v4.b32 [%r28], {%r6207, %r6219, %r6231, %r6243}; 2026-02-21T09:18:14.9668168Z bar.sync 0; 2026-02-21T09:18:14.9668304Z // begin inline asm 2026-02-21T09:18:14.9668551Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6067, %r6068, %r6069, %r6070}, [%r1540]; 2026-02-21T09:18:14.9668826Z // end inline asm 2026-02-21T09:18:14.9668976Z // begin inline asm 2026-02-21T09:18:14.9669220Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6072, %r6073, %r6074, %r6075}, [%r1545]; 2026-02-21T09:18:14.9669492Z // end inline asm 2026-02-21T09:18:14.9669642Z bar.sync 0; 2026-02-21T09:18:14.9692198Z st.shared.v4.b32 [%r27], {%r6162, %r6174, %r6186, %r6198}; 2026-02-21T09:18:14.9692475Z st.shared.v4.b32 [%r28], {%r6210, %r6222, %r6234, %r6246}; 2026-02-21T09:18:14.9692689Z bar.sync 0; 2026-02-21T09:18:14.9692839Z // begin inline asm 2026-02-21T09:18:14.9693092Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6077, %r6078, %r6079, %r6080}, [%r1540]; 2026-02-21T09:18:14.9693382Z // end inline asm 2026-02-21T09:18:14.9693533Z // begin inline asm 2026-02-21T09:18:14.9693780Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6082, %r6083, %r6084, %r6085}, [%r1545]; 2026-02-21T09:18:14.9694052Z // end inline asm 2026-02-21T09:18:14.9694192Z bar.sync 0; 2026-02-21T09:18:14.9694362Z st.shared.v4.b32 [%r27], {%r6165, %r6177, %r6189, %r6201}; 2026-02-21T09:18:14.9694606Z st.shared.v4.b32 [%r28], {%r6213, %r6225, %r6237, %r6249}; 2026-02-21T09:18:14.9694807Z bar.sync 0; 2026-02-21T09:18:14.9694935Z // begin inline asm 2026-02-21T09:18:14.9695177Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6087, %r6088, %r6089, %r6090}, [%r1540]; 2026-02-21T09:18:14.9695436Z // end inline asm 2026-02-21T09:18:14.9695579Z // begin inline asm 2026-02-21T09:18:14.9695807Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r6092, %r6093, %r6094, %r6095}, [%r1545]; 2026-02-21T09:18:14.9696072Z // end inline asm 2026-02-21T09:18:14.9696206Z // begin inline asm 2026-02-21T09:18:14.9696402Z st.global.v4.b32 [ %rd714 + 0 ], { %r6057, %r6067, %r6077, %r6087 }; 2026-02-21T09:18:14.9696622Z // end inline asm 2026-02-21T09:18:14.9696756Z // begin inline asm 2026-02-21T09:18:14.9696948Z st.global.v4.b32 [ %rd715 + 0 ], { %r6058, %r6068, %r6078, %r6088 }; 2026-02-21T09:18:14.9697159Z // end inline asm 2026-02-21T09:18:14.9697298Z // begin inline asm 2026-02-21T09:18:14.9697480Z st.global.v4.b32 [ %rd716 + 0 ], { %r6059, %r6069, %r6079, %r6089 }; 2026-02-21T09:18:14.9697695Z // end inline asm 2026-02-21T09:18:14.9697828Z // begin inline asm 2026-02-21T09:18:14.9698012Z st.global.v4.b32 [ %rd717 + 0 ], { %r6060, %r6070, %r6080, %r6090 }; 2026-02-21T09:18:14.9698228Z // end inline asm 2026-02-21T09:18:14.9698361Z // begin inline asm 2026-02-21T09:18:14.9698642Z st.global.v4.b32 [ %rd718 + 0 ], { %r6062, %r6072, %r6082, %r6092 }; 2026-02-21T09:18:14.9698857Z // end inline asm 2026-02-21T09:18:14.9699001Z // begin inline asm 2026-02-21T09:18:14.9699224Z st.global.v4.b32 [ %rd719 + 0 ], { %r6063, %r6073, %r6083, %r6093 }; 2026-02-21T09:18:14.9699440Z // end inline asm 2026-02-21T09:18:14.9699611Z // begin inline asm 2026-02-21T09:18:14.9699831Z st.global.v4.b32 [ %rd720 + 0 ], { %r6064, %r6074, %r6084, %r6094 }; 2026-02-21T09:18:14.9700051Z // end inline asm 2026-02-21T09:18:14.9701529Z // begin inline asm 2026-02-21T09:18:14.9702710Z st.global.v4.b32 [ %rd721 + 0 ], { %r6065, %r6075, %r6085, %r6095 }; 2026-02-21T09:18:14.9702948Z // end inline asm 2026-02-21T09:18:14.9703215Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:14.9703533Z add.s32 %r7842, %r7842, 9472; 2026-02-21T09:18:14.9703713Z setp.lt.s32 %p309, %r7842, %r7847; 2026-02-21T09:18:14.9703907Z @%p309 bra $L__BB0_2; 2026-02-21T09:18:14.9704065Z bra.uni $L__BB0_43; 2026-02-21T09:18:14.9704267Z $L__BB0_2: // =>This Loop Header: Depth=1 2026-02-21T09:18:14.9704508Z // Child Loop BB0_3 Depth 2 2026-02-21T09:18:14.9704741Z // Child Loop BB0_13 Depth 2 2026-02-21T09:18:14.9704973Z // Child Loop BB0_23 Depth 2 2026-02-21T09:18:14.9705192Z // Child Loop BB0_33 Depth 2 2026-02-21T09:18:14.9705527Z .loc 1 25 35 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:25:35 2026-02-21T09:18:14.9705813Z shr.s32 %r289, %r7842, 31; 2026-02-21T09:18:14.9705983Z shr.u32 %r290, %r289, 22; 2026-02-21T09:18:14.9706141Z add.s32 %r291, %r7842, %r290; 2026-02-21T09:18:14.9706308Z shr.s32 %r292, %r291, 10; 2026-02-21T09:18:14.9706563Z .loc 1 26 33 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:26:33 2026-02-21T09:18:14.9706848Z shl.b32 %r293, %r292, 4; 2026-02-21T09:18:14.9707095Z .loc 1 27 39 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:39 2026-02-21T09:18:14.9707378Z sub.s32 %r294, 64, %r293; 2026-02-21T09:18:14.9707628Z .loc 1 27 52 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:52 2026-02-21T09:18:14.9707898Z min.s32 %r295, %r294, 16; 2026-02-21T09:18:14.9708147Z .loc 1 28 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:45 2026-02-21T09:18:14.9708421Z and.b32 %r296, %r291, -1024; 2026-02-21T09:18:14.9708589Z sub.s32 %r297, %r7842, %r296; 2026-02-21T09:18:14.9708844Z .loc 1 29 51 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:29:51 2026-02-21T09:18:14.9709119Z div.s32 %r298, %r297, %r295; 2026-02-21T09:18:14.9709374Z .loc 1 28 64 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:64 2026-02-21T09:18:14.9709650Z mul.lo.s32 %r299, %r298, %r295; 2026-02-21T09:18:14.9709822Z sub.s32 %r300, %r297, %r299; 2026-02-21T09:18:14.9710069Z .loc 1 28 30 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:30 2026-02-21T09:18:14.9710350Z add.s32 %r301, %r300, %r293; 2026-02-21T09:18:14.9710600Z .loc 1 30 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:30:27 2026-02-21T09:18:14.9710877Z shl.b32 %r105, %r301, 6; 2026-02-21T09:18:14.9711145Z .loc 1 32 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:32:27 2026-02-21T09:18:14.9711431Z shl.b32 %r106, %r298, 7; 2026-02-21T09:18:14.9711627Z $L__tmp65: 2026-02-21T09:18:14.9711928Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9712298Z shfl.sync.idx.b32 %r107, %r5, 0, 31, -1; 2026-02-21T09:18:14.9712491Z shl.b32 %r302, %r107, 21; 2026-02-21T09:18:14.9712664Z and.b32 %r303, %r302, 6291456; 2026-02-21T09:18:14.9712847Z add.s32 %r4673, %r303, %r7822; 2026-02-21T09:18:14.9713019Z mov.pred %p9, -1; 2026-02-21T09:18:14.9713216Z mov.b32 %r222, 0; 2026-02-21T09:18:14.9713363Z // begin inline asm 2026-02-21T09:18:14.9713770Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222}; 2026-02-21T09:18:14.9714224Z // end inline asm 2026-02-21T09:18:14.9714458Z // begin inline asm 2026-02-21T09:18:14.9714899Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222}; 2026-02-21T09:18:14.9715317Z // end inline asm 2026-02-21T09:18:14.9715469Z // begin inline asm 2026-02-21T09:18:14.9715845Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222}; 2026-02-21T09:18:14.9716262Z // end inline asm 2026-02-21T09:18:14.9716411Z // begin inline asm 2026-02-21T09:18:14.9716804Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222, %r222}; 2026-02-21T09:18:14.9717217Z // end inline asm 2026-02-21T09:18:14.9717360Z // begin inline asm 2026-02-21T09:18:14.9717536Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9717710Z // end inline asm 2026-02-21T09:18:14.9717857Z bar.sync 0; 2026-02-21T09:18:14.9718002Z add.s32 %r625, %r303, %r6675; 2026-02-21T09:18:14.9718179Z add.s32 %r5556, %r303, %r7254; 2026-02-21T09:18:14.9718346Z add.s32 %r915, %r303, %r6964; 2026-02-21T09:18:14.9718521Z add.s32 %r5846, %r303, %r7543; 2026-02-21T09:18:14.9718703Z add.s32 %r1205, %r303, %r7253; 2026-02-21T09:18:14.9718854Z $L__tmp66: 2026-02-21T09:18:14.9719109Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:14.9719401Z add.s32 %r7843, %r61, %r106; 2026-02-21T09:18:14.9719566Z or.b32 %r304, %r7841, %r105; 2026-02-21T09:18:14.9719725Z shl.b32 %r305, %r304, 10; 2026-02-21T09:18:14.9719889Z mul.wide.s32 %rd12, %r305, 2; 2026-02-21T09:18:14.9720046Z shl.b32 %r306, %r301, 16; 2026-02-21T09:18:14.9720207Z or.b32 %r307, %r63, %r306; 2026-02-21T09:18:14.9720364Z mul.wide.s32 %rd13, %r307, 2; 2026-02-21T09:18:14.9720527Z mov.pred %p389, 0; 2026-02-21T09:18:14.9720678Z mov.b64 %rd1049, -64; 2026-02-21T09:18:14.9720828Z mov.b64 %rd1048, %rd5; 2026-02-21T09:18:14.9720985Z bra.uni $L__BB0_3; 2026-02-21T09:18:14.9721172Z $L__BB0_11: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9721393Z $L__tmp67: 2026-02-21T09:18:14.9721710Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9722048Z // begin inline asm 2026-02-21T09:18:14.9722182Z 2026-02-21T09:18:14.9722303Z { 2026-02-21T09:18:14.9722437Z .reg .pred complete; 2026-02-21T09:18:14.9722584Z waitLoop: 2026-02-21T09:18:14.9722787Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r1175; 2026-02-21T09:18:14.9723032Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9723196Z } 2026-02-21T09:18:14.9723263Z 2026-02-21T09:18:14.9723320Z // end inline asm 2026-02-21T09:18:14.9723463Z bar.sync 0; 2026-02-21T09:18:14.9723594Z // begin inline asm 2026-02-21T09:18:14.9723779Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9723984Z // end inline asm 2026-02-21T09:18:14.9724118Z $L__tmp68: 2026-02-21T09:18:14.9724370Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:14.9724670Z add.s64 %rd1049, %rd1049, 64; 2026-02-21T09:18:14.9724844Z add.s32 %r7843, %r7843, 524288; 2026-02-21T09:18:14.9725015Z add.s64 %rd1048, %rd1048, 256; 2026-02-21T09:18:14.9725187Z setp.lt.u64 %p83, %rd1049, 448; 2026-02-21T09:18:14.9725356Z @%p83 bra $L__BB0_3; 2026-02-21T09:18:14.9725507Z bra.uni $L__BB0_12; 2026-02-21T09:18:14.9725734Z $L__BB0_3: // Parent Loop BB0_2 Depth=1 2026-02-21T09:18:14.9725983Z // => This Inner Loop Header: Depth=2 2026-02-21T09:18:14.9726334Z .loc 1 0 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:0:124 2026-02-21T09:18:14.9726656Z setp.ne.b32 %p20, %r107, 0; 2026-02-21T09:18:14.9726957Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9727239Z add.s64 %rd64, %rd1048, %rd13; 2026-02-21T09:18:14.9727506Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9727788Z add.s64 %rd67, %rd1048, %rd12; 2026-02-21T09:18:14.9727946Z // begin inline asm 2026-02-21T09:18:14.9728097Z mov.u64 %rd63, 0x0; 2026-02-21T09:18:14.9728292Z createpolicy.fractional.L2::evict_first.b64 %rd63, 1.0; 2026-02-21T09:18:14.9728513Z // end inline asm 2026-02-21T09:18:14.9728652Z // begin inline asm 2026-02-21T09:18:14.9728796Z mov.u32 %r308, 0x0; 2026-02-21T09:18:14.9728931Z mov.u32 %r309, 0x0; 2026-02-21T09:18:14.9729072Z mov.u32 %r310, 0x0; 2026-02-21T09:18:14.9729208Z mov.u32 %r311, 0x0; 2026-02-21T09:18:14.9729460Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r308, %r309, %r310, %r311 }, [ %rd64 + 0 ], %rd63; 2026-02-21T09:18:14.9729748Z // end inline asm 2026-02-21T09:18:14.9729881Z // begin inline asm 2026-02-21T09:18:14.9730025Z mov.u64 %rd66, 0x0; 2026-02-21T09:18:14.9730210Z createpolicy.fractional.L2::evict_first.b64 %rd66, 1.0; 2026-02-21T09:18:14.9730425Z // end inline asm 2026-02-21T09:18:14.9730556Z // begin inline asm 2026-02-21T09:18:14.9730697Z mov.u32 %r312, 0x0; 2026-02-21T09:18:14.9730831Z mov.u32 %r313, 0x0; 2026-02-21T09:18:14.9730972Z mov.u32 %r314, 0x0; 2026-02-21T09:18:14.9731111Z mov.u32 %r315, 0x0; 2026-02-21T09:18:14.9731349Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r312, %r313, %r314, %r315 }, [ %rd67 + 0 ], %rd66; 2026-02-21T09:18:14.9731664Z // end inline asm 2026-02-21T09:18:14.9731794Z $L__tmp69: 2026-02-21T09:18:14.9732083Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9732406Z bar.sync 0; 2026-02-21T09:18:14.9732581Z st.shared.v4.b32 [%r26], {%r308, %r309, %r310, %r311}; 2026-02-21T09:18:14.9732825Z st.shared.v4.b32 [%r26+512], {%r312, %r313, %r314, %r315}; 2026-02-21T09:18:14.9733028Z bar.sync 0; 2026-02-21T09:18:14.9733196Z ld.shared.v4.b32 {%r474, %r475, %r476, %r477}, [%r27]; 2026-02-21T09:18:14.9733397Z mov.b32 {%rs1, %rs2}, %r477; 2026-02-21T09:18:14.9733566Z mov.b32 {%rs3, %rs4}, %r476; 2026-02-21T09:18:14.9733723Z mov.b32 {%rs5, %rs6}, %r475; 2026-02-21T09:18:14.9733886Z mov.b32 {%rs7, %rs8}, %r474; 2026-02-21T09:18:14.9734070Z ld.shared.v4.b32 {%r478, %r479, %r480, %r481}, [%r28]; 2026-02-21T09:18:14.9734277Z mov.b32 {%rs9, %rs10}, %r481; 2026-02-21T09:18:14.9734439Z mov.b32 {%rs11, %rs12}, %r480; 2026-02-21T09:18:14.9734614Z mov.b32 {%rs13, %rs14}, %r479; 2026-02-21T09:18:14.9734783Z mov.b32 {%rs15, %rs16}, %r478; 2026-02-21T09:18:14.9734937Z $L__tmp70: 2026-02-21T09:18:14.9735184Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9735470Z cvt.f32.bf16 %r457, %rs7; 2026-02-21T09:18:14.9735636Z cvt.f32.bf16 %r458, %rs8; 2026-02-21T09:18:14.9735789Z cvt.f32.bf16 %r459, %rs5; 2026-02-21T09:18:14.9735947Z cvt.f32.bf16 %r460, %rs6; 2026-02-21T09:18:14.9736096Z cvt.f32.bf16 %r461, %rs3; 2026-02-21T09:18:14.9736258Z cvt.f32.bf16 %r462, %rs4; 2026-02-21T09:18:14.9736415Z cvt.f32.bf16 %r463, %rs1; 2026-02-21T09:18:14.9736564Z cvt.f32.bf16 %r464, %rs2; 2026-02-21T09:18:14.9736721Z cvt.f32.bf16 %r465, %rs15; 2026-02-21T09:18:14.9736878Z cvt.f32.bf16 %r466, %rs16; 2026-02-21T09:18:14.9737036Z cvt.f32.bf16 %r467, %rs13; 2026-02-21T09:18:14.9737186Z cvt.f32.bf16 %r468, %rs14; 2026-02-21T09:18:14.9737396Z cvt.f32.bf16 %r469, %rs11; 2026-02-21T09:18:14.9737544Z cvt.f32.bf16 %r470, %rs12; 2026-02-21T09:18:14.9737700Z cvt.f32.bf16 %r471, %rs9; 2026-02-21T09:18:14.9737856Z cvt.f32.bf16 %r472, %rs10; 2026-02-21T09:18:14.9738138Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9738441Z cvt.s64.s32 %rd72, %r7843; 2026-02-21T09:18:14.9738597Z add.s64 %rd70, %rd56, %rd72; 2026-02-21T09:18:14.9738883Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9739158Z // begin inline asm 2026-02-21T09:18:14.9739307Z mov.u64 %rd69, 0x0; 2026-02-21T09:18:14.9739497Z createpolicy.fractional.L2::evict_first.b64 %rd69, 1.0; 2026-02-21T09:18:14.9739713Z // end inline asm 2026-02-21T09:18:14.9739856Z // begin inline asm 2026-02-21T09:18:14.9739995Z mov.u32 %r316, 0x0; 2026-02-21T09:18:14.9740139Z mov.u32 %r317, 0x0; 2026-02-21T09:18:14.9740270Z mov.u32 %r318, 0x0; 2026-02-21T09:18:14.9740411Z mov.u32 %r319, 0x0; 2026-02-21T09:18:14.9740651Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r316, %r317, %r318, %r319 }, [ %rd70 + 0 ], %rd69; 2026-02-21T09:18:14.9740929Z // end inline asm 2026-02-21T09:18:14.9741073Z prmt.b32 %r482, %r316, 0, 0x8880U; 2026-02-21T09:18:14.9741250Z cvt.u16.u32 %rs17, %r482; 2026-02-21T09:18:14.9741414Z prmt.b32 %r483, %r316, 0, 0x7770U; 2026-02-21T09:18:14.9741605Z cvt.u16.u32 %rs18, %r483; 2026-02-21T09:18:14.9741765Z prmt.b32 %r484, %r316, 0, 0x9991U; 2026-02-21T09:18:14.9741931Z cvt.u16.u32 %rs19, %r484; 2026-02-21T09:18:14.9742089Z prmt.b32 %r485, %r316, 0, 0x7771U; 2026-02-21T09:18:14.9742251Z cvt.u16.u32 %rs20, %r485; 2026-02-21T09:18:14.9742408Z prmt.b32 %r486, %r316, 0, 0xaaa2U; 2026-02-21T09:18:14.9742568Z cvt.u16.u32 %rs21, %r486; 2026-02-21T09:18:14.9742723Z prmt.b32 %r487, %r316, 0, 0x7772U; 2026-02-21T09:18:14.9742891Z cvt.u16.u32 %rs22, %r487; 2026-02-21T09:18:14.9743040Z prmt.b32 %r488, %r316, 0, 0xbbb3U; 2026-02-21T09:18:14.9743208Z cvt.u16.u32 %rs23, %r488; 2026-02-21T09:18:14.9743359Z prmt.b32 %r489, %r316, 0, 0x7773U; 2026-02-21T09:18:14.9743529Z cvt.u16.u32 %rs24, %r489; 2026-02-21T09:18:14.9743683Z prmt.b32 %r490, %r317, 0, 0x8880U; 2026-02-21T09:18:14.9743853Z cvt.u16.u32 %rs25, %r490; 2026-02-21T09:18:14.9744003Z prmt.b32 %r491, %r317, 0, 0x7770U; 2026-02-21T09:18:14.9744172Z cvt.u16.u32 %rs26, %r491; 2026-02-21T09:18:14.9744323Z prmt.b32 %r492, %r317, 0, 0x9991U; 2026-02-21T09:18:14.9744494Z cvt.u16.u32 %rs27, %r492; 2026-02-21T09:18:14.9744656Z prmt.b32 %r493, %r317, 0, 0x7771U; 2026-02-21T09:18:14.9744818Z cvt.u16.u32 %rs28, %r493; 2026-02-21T09:18:14.9744981Z prmt.b32 %r494, %r317, 0, 0xaaa2U; 2026-02-21T09:18:14.9745142Z cvt.u16.u32 %rs29, %r494; 2026-02-21T09:18:14.9745298Z prmt.b32 %r495, %r317, 0, 0x7772U; 2026-02-21T09:18:14.9745458Z cvt.u16.u32 %rs30, %r495; 2026-02-21T09:18:14.9745613Z prmt.b32 %r496, %r317, 0, 0xbbb3U; 2026-02-21T09:18:14.9745774Z cvt.u16.u32 %rs31, %r496; 2026-02-21T09:18:14.9745929Z prmt.b32 %r497, %r317, 0, 0x7773U; 2026-02-21T09:18:14.9746091Z cvt.u16.u32 %rs32, %r497; 2026-02-21T09:18:14.9746249Z prmt.b32 %r498, %r318, 0, 0x8880U; 2026-02-21T09:18:14.9746420Z cvt.u16.u32 %rs33, %r498; 2026-02-21T09:18:14.9746569Z prmt.b32 %r499, %r318, 0, 0x7770U; 2026-02-21T09:18:14.9746740Z cvt.u16.u32 %rs34, %r499; 2026-02-21T09:18:14.9746891Z prmt.b32 %r500, %r318, 0, 0x9991U; 2026-02-21T09:18:14.9747061Z cvt.u16.u32 %rs35, %r500; 2026-02-21T09:18:14.9747211Z prmt.b32 %r501, %r318, 0, 0x7771U; 2026-02-21T09:18:14.9747379Z cvt.u16.u32 %rs36, %r501; 2026-02-21T09:18:14.9747531Z prmt.b32 %r502, %r318, 0, 0xaaa2U; 2026-02-21T09:18:14.9747700Z cvt.u16.u32 %rs37, %r502; 2026-02-21T09:18:14.9747855Z prmt.b32 %r503, %r318, 0, 0x7772U; 2026-02-21T09:18:14.9748015Z cvt.u16.u32 %rs38, %r503; 2026-02-21T09:18:14.9748172Z prmt.b32 %r504, %r318, 0, 0xbbb3U; 2026-02-21T09:18:14.9748332Z cvt.u16.u32 %rs39, %r504; 2026-02-21T09:18:14.9748523Z prmt.b32 %r505, %r318, 0, 0x7773U; 2026-02-21T09:18:14.9748685Z cvt.u16.u32 %rs40, %r505; 2026-02-21T09:18:14.9748843Z prmt.b32 %r506, %r319, 0, 0x8880U; 2026-02-21T09:18:14.9749034Z cvt.u16.u32 %rs41, %r506; 2026-02-21T09:18:14.9749191Z prmt.b32 %r507, %r319, 0, 0x7770U; 2026-02-21T09:18:14.9749353Z cvt.u16.u32 %rs42, %r507; 2026-02-21T09:18:14.9749536Z prmt.b32 %r508, %r319, 0, 0x9991U; 2026-02-21T09:18:14.9749730Z cvt.u16.u32 %rs43, %r508; 2026-02-21T09:18:14.9749881Z prmt.b32 %r509, %r319, 0, 0x7771U; 2026-02-21T09:18:14.9750048Z cvt.u16.u32 %rs44, %r509; 2026-02-21T09:18:14.9750196Z prmt.b32 %r510, %r319, 0, 0xaaa2U; 2026-02-21T09:18:14.9750361Z cvt.u16.u32 %rs45, %r510; 2026-02-21T09:18:14.9750510Z prmt.b32 %r511, %r319, 0, 0x7772U; 2026-02-21T09:18:14.9750677Z cvt.u16.u32 %rs46, %r511; 2026-02-21T09:18:14.9750825Z prmt.b32 %r512, %r319, 0, 0xbbb3U; 2026-02-21T09:18:14.9750995Z cvt.u16.u32 %rs47, %r512; 2026-02-21T09:18:14.9751150Z prmt.b32 %r513, %r319, 0, 0x7773U; 2026-02-21T09:18:14.9751313Z cvt.u16.u32 %rs48, %r513; 2026-02-21T09:18:14.9751711Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9751995Z shl.b16 %rs49, %rs18, 12; 2026-02-21T09:18:14.9752155Z shr.s16 %rs50, %rs49, 12; 2026-02-21T09:18:14.9752306Z shl.b16 %rs51, %rs20, 12; 2026-02-21T09:18:14.9752466Z shr.s16 %rs52, %rs51, 12; 2026-02-21T09:18:14.9752616Z shl.b16 %rs53, %rs22, 12; 2026-02-21T09:18:14.9752773Z shr.s16 %rs54, %rs53, 12; 2026-02-21T09:18:14.9752923Z shl.b16 %rs55, %rs24, 12; 2026-02-21T09:18:14.9753083Z shr.s16 %rs56, %rs55, 12; 2026-02-21T09:18:14.9753246Z shl.b16 %rs57, %rs26, 12; 2026-02-21T09:18:14.9753399Z shr.s16 %rs58, %rs57, 12; 2026-02-21T09:18:14.9753564Z shl.b16 %rs59, %rs28, 12; 2026-02-21T09:18:14.9753717Z shr.s16 %rs60, %rs59, 12; 2026-02-21T09:18:14.9753870Z shl.b16 %rs61, %rs30, 12; 2026-02-21T09:18:14.9754015Z shr.s16 %rs62, %rs61, 12; 2026-02-21T09:18:14.9754170Z shl.b16 %rs63, %rs32, 12; 2026-02-21T09:18:14.9754318Z shr.s16 %rs64, %rs63, 12; 2026-02-21T09:18:14.9754479Z shl.b16 %rs65, %rs34, 12; 2026-02-21T09:18:14.9754638Z shr.s16 %rs66, %rs65, 12; 2026-02-21T09:18:14.9754795Z shl.b16 %rs67, %rs36, 12; 2026-02-21T09:18:14.9754954Z shr.s16 %rs68, %rs67, 12; 2026-02-21T09:18:14.9755111Z shl.b16 %rs69, %rs38, 12; 2026-02-21T09:18:14.9755270Z shr.s16 %rs70, %rs69, 12; 2026-02-21T09:18:14.9755424Z shl.b16 %rs71, %rs40, 12; 2026-02-21T09:18:14.9755585Z shr.s16 %rs72, %rs71, 12; 2026-02-21T09:18:14.9755737Z shl.b16 %rs73, %rs42, 12; 2026-02-21T09:18:14.9755897Z shr.s16 %rs74, %rs73, 12; 2026-02-21T09:18:14.9756053Z shl.b16 %rs75, %rs44, 12; 2026-02-21T09:18:14.9756215Z shr.s16 %rs76, %rs75, 12; 2026-02-21T09:18:14.9756373Z shl.b16 %rs77, %rs46, 12; 2026-02-21T09:18:14.9756526Z shr.s16 %rs78, %rs77, 12; 2026-02-21T09:18:14.9756684Z shl.b16 %rs79, %rs48, 12; 2026-02-21T09:18:14.9756836Z shr.s16 %rs80, %rs79, 12; 2026-02-21T09:18:14.9757108Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:14.9757400Z shr.u16 %rs81, %rs17, 4; 2026-02-21T09:18:14.9757565Z shr.u16 %rs82, %rs19, 4; 2026-02-21T09:18:14.9757723Z shr.u16 %rs83, %rs21, 4; 2026-02-21T09:18:14.9757884Z shr.u16 %rs84, %rs23, 4; 2026-02-21T09:18:14.9758042Z shr.u16 %rs85, %rs25, 4; 2026-02-21T09:18:14.9758201Z shr.u16 %rs86, %rs27, 4; 2026-02-21T09:18:14.9758359Z shr.u16 %rs87, %rs29, 4; 2026-02-21T09:18:14.9758512Z shr.u16 %rs88, %rs31, 4; 2026-02-21T09:18:14.9758670Z shr.u16 %rs89, %rs33, 4; 2026-02-21T09:18:14.9758821Z shr.u16 %rs90, %rs35, 4; 2026-02-21T09:18:14.9758977Z shr.u16 %rs91, %rs37, 4; 2026-02-21T09:18:14.9759127Z shr.u16 %rs92, %rs39, 4; 2026-02-21T09:18:14.9759284Z shr.u16 %rs93, %rs41, 4; 2026-02-21T09:18:14.9759437Z shr.u16 %rs94, %rs43, 4; 2026-02-21T09:18:14.9759597Z shr.u16 %rs95, %rs45, 4; 2026-02-21T09:18:14.9759749Z shr.u16 %rs96, %rs47, 4; 2026-02-21T09:18:14.9760048Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:14.9760344Z bar.sync 0; 2026-02-21T09:18:14.9760493Z st.shared.b8 [%r29], %rs50; 2026-02-21T09:18:14.9760699Z st.shared.b8 [%r30], %rs52; 2026-02-21T09:18:14.9760868Z st.shared.b8 [%r31], %rs54; 2026-02-21T09:18:14.9761067Z st.shared.b8 [%r32], %rs56; 2026-02-21T09:18:14.9761238Z st.shared.b8 [%r33+512], %rs58; 2026-02-21T09:18:14.9761450Z st.shared.b8 [%r34+512], %rs60; 2026-02-21T09:18:14.9761686Z st.shared.b8 [%r35+512], %rs62; 2026-02-21T09:18:14.9761866Z st.shared.b8 [%r36+512], %rs64; 2026-02-21T09:18:14.9762050Z st.shared.b8 [%r37+1024], %rs66; 2026-02-21T09:18:14.9762239Z st.shared.b8 [%r38+1024], %rs68; 2026-02-21T09:18:14.9762413Z st.shared.b8 [%r39+1024], %rs70; 2026-02-21T09:18:14.9762575Z st.shared.b8 [%r40+1024], %rs72; 2026-02-21T09:18:14.9762743Z st.shared.b8 [%r41+1536], %rs74; 2026-02-21T09:18:14.9762903Z st.shared.b8 [%r42+1536], %rs76; 2026-02-21T09:18:14.9763069Z st.shared.b8 [%r43+1536], %rs78; 2026-02-21T09:18:14.9763227Z st.shared.b8 [%r44+1536], %rs80; 2026-02-21T09:18:14.9763390Z bar.sync 0; 2026-02-21T09:18:14.9763541Z ld.shared.b32 %r514, [%r45]; 2026-02-21T09:18:14.9763705Z ld.shared.b32 %r515, [%r46]; 2026-02-21T09:18:14.9763871Z ld.shared.b32 %r516, [%r47]; 2026-02-21T09:18:14.9764028Z ld.shared.b32 %r517, [%r48]; 2026-02-21T09:18:14.9764294Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:14.9764569Z bar.sync 0; 2026-02-21T09:18:14.9764714Z st.shared.b8 [%r29], %rs81; 2026-02-21T09:18:14.9764872Z st.shared.b8 [%r30], %rs82; 2026-02-21T09:18:14.9765036Z st.shared.b8 [%r31], %rs83; 2026-02-21T09:18:14.9765196Z st.shared.b8 [%r32], %rs84; 2026-02-21T09:18:14.9765355Z st.shared.b8 [%r33+512], %rs85; 2026-02-21T09:18:14.9765526Z st.shared.b8 [%r34+512], %rs86; 2026-02-21T09:18:14.9765690Z st.shared.b8 [%r35+512], %rs87; 2026-02-21T09:18:14.9765859Z st.shared.b8 [%r36+512], %rs88; 2026-02-21T09:18:14.9766018Z st.shared.b8 [%r37+1024], %rs89; 2026-02-21T09:18:14.9766189Z st.shared.b8 [%r38+1024], %rs90; 2026-02-21T09:18:14.9766353Z st.shared.b8 [%r39+1024], %rs91; 2026-02-21T09:18:14.9766519Z st.shared.b8 [%r40+1024], %rs92; 2026-02-21T09:18:14.9766681Z st.shared.b8 [%r41+1536], %rs93; 2026-02-21T09:18:14.9766846Z st.shared.b8 [%r42+1536], %rs94; 2026-02-21T09:18:14.9767011Z st.shared.b8 [%r43+1536], %rs95; 2026-02-21T09:18:14.9767169Z st.shared.b8 [%r44+1536], %rs96; 2026-02-21T09:18:14.9767330Z bar.sync 0; 2026-02-21T09:18:14.9767465Z ld.shared.b32 %r518, [%r45]; 2026-02-21T09:18:14.9767629Z ld.shared.b32 %r519, [%r46]; 2026-02-21T09:18:14.9767785Z ld.shared.b32 %r520, [%r47]; 2026-02-21T09:18:14.9767946Z ld.shared.b32 %r521, [%r48]; 2026-02-21T09:18:14.9768207Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:14.9768491Z cvt.s8.s32 %rs97, %r515; 2026-02-21T09:18:14.9768653Z cvt.rn.f32.s16 %r522, %rs97; 2026-02-21T09:18:14.9768809Z cvt.s8.s32 %rs98, %r514; 2026-02-21T09:18:14.9768968Z cvt.rn.f32.s16 %r523, %rs98; 2026-02-21T09:18:14.9769123Z cvt.s8.s32 %rs99, %r519; 2026-02-21T09:18:14.9769278Z cvt.rn.f32.s16 %r524, %rs99; 2026-02-21T09:18:14.9769432Z cvt.s8.s32 %rs100, %r518; 2026-02-21T09:18:14.9769595Z cvt.rn.f32.s16 %r525, %rs100; 2026-02-21T09:18:14.9769758Z cvt.s8.s32 %rs101, %r517; 2026-02-21T09:18:14.9769918Z cvt.rn.f32.s16 %r526, %rs101; 2026-02-21T09:18:14.9770075Z cvt.s8.s32 %rs102, %r516; 2026-02-21T09:18:14.9770237Z cvt.rn.f32.s16 %r527, %rs102; 2026-02-21T09:18:14.9770402Z cvt.s8.s32 %rs103, %r521; 2026-02-21T09:18:14.9770559Z cvt.rn.f32.s16 %r528, %rs103; 2026-02-21T09:18:14.9770723Z cvt.s8.s32 %rs104, %r520; 2026-02-21T09:18:14.9770874Z cvt.rn.f32.s16 %r529, %rs104; 2026-02-21T09:18:14.9771041Z prmt.b32 %r530, %r515, 0, 0x9991U; 2026-02-21T09:18:14.9771213Z cvt.u16.u32 %rs105, %r530; 2026-02-21T09:18:14.9771407Z cvt.rn.f32.s16 %r531, %rs105; 2026-02-21T09:18:14.9771600Z prmt.b32 %r532, %r514, 0, 0x9991U; 2026-02-21T09:18:14.9771776Z cvt.u16.u32 %rs106, %r532; 2026-02-21T09:18:14.9771969Z cvt.rn.f32.s16 %r533, %rs106; 2026-02-21T09:18:14.9772126Z prmt.b32 %r534, %r519, 0, 0x9991U; 2026-02-21T09:18:14.9772301Z cvt.u16.u32 %rs107, %r534; 2026-02-21T09:18:14.9772482Z cvt.rn.f32.s16 %r535, %rs107; 2026-02-21T09:18:14.9772688Z prmt.b32 %r536, %r518, 0, 0x9991U; 2026-02-21T09:18:14.9772855Z cvt.u16.u32 %rs108, %r536; 2026-02-21T09:18:14.9773016Z cvt.rn.f32.s16 %r537, %rs108; 2026-02-21T09:18:14.9773173Z prmt.b32 %r538, %r517, 0, 0x9991U; 2026-02-21T09:18:14.9773346Z cvt.u16.u32 %rs109, %r538; 2026-02-21T09:18:14.9773496Z cvt.rn.f32.s16 %r539, %rs109; 2026-02-21T09:18:14.9773660Z prmt.b32 %r540, %r516, 0, 0x9991U; 2026-02-21T09:18:14.9773830Z cvt.u16.u32 %rs110, %r540; 2026-02-21T09:18:14.9773981Z cvt.rn.f32.s16 %r541, %rs110; 2026-02-21T09:18:14.9774042Z prmt.b32 %r542, %r521, 0, 0x9991U; 2026-02-21T09:18:14.9774109Z cvt.u16.u32 %rs111, %r542; 2026-02-21T09:18:14.9774167Z cvt.rn.f32.s16 %r543, %rs111; 2026-02-21T09:18:14.9774226Z prmt.b32 %r544, %r520, 0, 0x9991U; 2026-02-21T09:18:14.9774291Z cvt.u16.u32 %rs112, %r544; 2026-02-21T09:18:14.9774349Z cvt.rn.f32.s16 %r545, %rs112; 2026-02-21T09:18:14.9774410Z prmt.b32 %r546, %r515, 0, 0xaaa2U; 2026-02-21T09:18:14.9774468Z cvt.u16.u32 %rs113, %r546; 2026-02-21T09:18:14.9774532Z cvt.rn.f32.s16 %r547, %rs113; 2026-02-21T09:18:14.9774592Z prmt.b32 %r548, %r514, 0, 0xaaa2U; 2026-02-21T09:18:14.9774648Z cvt.u16.u32 %rs114, %r548; 2026-02-21T09:18:14.9774714Z cvt.rn.f32.s16 %r549, %rs114; 2026-02-21T09:18:14.9774773Z prmt.b32 %r550, %r519, 0, 0xaaa2U; 2026-02-21T09:18:14.9774831Z cvt.u16.u32 %rs115, %r550; 2026-02-21T09:18:14.9774888Z cvt.rn.f32.s16 %r551, %rs115; 2026-02-21T09:18:14.9774954Z prmt.b32 %r552, %r518, 0, 0xaaa2U; 2026-02-21T09:18:14.9775011Z cvt.u16.u32 %rs116, %r552; 2026-02-21T09:18:14.9775072Z cvt.rn.f32.s16 %r553, %rs116; 2026-02-21T09:18:14.9775137Z prmt.b32 %r554, %r517, 0, 0xaaa2U; 2026-02-21T09:18:14.9775194Z cvt.u16.u32 %rs117, %r554; 2026-02-21T09:18:14.9775251Z cvt.rn.f32.s16 %r555, %rs117; 2026-02-21T09:18:14.9775319Z prmt.b32 %r556, %r516, 0, 0xaaa2U; 2026-02-21T09:18:14.9775376Z cvt.u16.u32 %rs118, %r556; 2026-02-21T09:18:14.9775436Z cvt.rn.f32.s16 %r557, %rs118; 2026-02-21T09:18:14.9775497Z prmt.b32 %r558, %r521, 0, 0xaaa2U; 2026-02-21T09:18:14.9775562Z cvt.u16.u32 %rs119, %r558; 2026-02-21T09:18:14.9775620Z cvt.rn.f32.s16 %r559, %rs119; 2026-02-21T09:18:14.9775680Z prmt.b32 %r560, %r520, 0, 0xaaa2U; 2026-02-21T09:18:14.9775745Z cvt.u16.u32 %rs120, %r560; 2026-02-21T09:18:14.9775802Z cvt.rn.f32.s16 %r561, %rs120; 2026-02-21T09:18:14.9775861Z prmt.b32 %r562, %r515, 0, 0xbbb3U; 2026-02-21T09:18:14.9775918Z cvt.u16.u32 %rs121, %r562; 2026-02-21T09:18:14.9775983Z cvt.rn.f32.s16 %r563, %rs121; 2026-02-21T09:18:14.9776042Z prmt.b32 %r564, %r514, 0, 0xbbb3U; 2026-02-21T09:18:14.9776102Z cvt.u16.u32 %rs122, %r564; 2026-02-21T09:18:14.9776165Z cvt.rn.f32.s16 %r565, %rs122; 2026-02-21T09:18:14.9776223Z prmt.b32 %r566, %r519, 0, 0xbbb3U; 2026-02-21T09:18:14.9776282Z cvt.u16.u32 %rs123, %r566; 2026-02-21T09:18:14.9776343Z cvt.rn.f32.s16 %r567, %rs123; 2026-02-21T09:18:14.9776410Z prmt.b32 %r568, %r518, 0, 0xbbb3U; 2026-02-21T09:18:14.9776468Z cvt.u16.u32 %rs124, %r568; 2026-02-21T09:18:14.9776528Z cvt.rn.f32.s16 %r569, %rs124; 2026-02-21T09:18:14.9776596Z prmt.b32 %r570, %r517, 0, 0xbbb3U; 2026-02-21T09:18:14.9776654Z cvt.u16.u32 %rs125, %r570; 2026-02-21T09:18:14.9776711Z cvt.rn.f32.s16 %r571, %rs125; 2026-02-21T09:18:14.9776777Z prmt.b32 %r572, %r516, 0, 0xbbb3U; 2026-02-21T09:18:14.9776834Z cvt.u16.u32 %rs126, %r572; 2026-02-21T09:18:14.9776893Z cvt.rn.f32.s16 %r573, %rs126; 2026-02-21T09:18:14.9776952Z prmt.b32 %r574, %r521, 0, 0xbbb3U; 2026-02-21T09:18:14.9777018Z cvt.u16.u32 %rs127, %r574; 2026-02-21T09:18:14.9777105Z cvt.rn.f32.s16 %r575, %rs127; 2026-02-21T09:18:14.9777164Z prmt.b32 %r576, %r520, 0, 0xbbb3U; 2026-02-21T09:18:14.9777231Z cvt.u16.u32 %rs128, %r576; 2026-02-21T09:18:14.9777290Z cvt.rn.f32.s16 %r577, %rs128; 2026-02-21T09:18:14.9777367Z bar.sync 0; 2026-02-21T09:18:14.9777464Z st.shared.v4.b32 [%r49], {%r523, %r525, %r522, %r524}; 2026-02-21T09:18:14.9777586Z st.shared.v4.b32 [%r50], {%r527, %r529, %r526, %r528}; 2026-02-21T09:18:14.9777695Z st.shared.v4.b32 [%r51], {%r533, %r537, %r531, %r535}; 2026-02-21T09:18:14.9777784Z st.shared.v4.b32 [%r52], {%r541, %r545, %r539, %r543}; 2026-02-21T09:18:14.9777878Z st.shared.v4.b32 [%r53], {%r549, %r553, %r547, %r551}; 2026-02-21T09:18:14.9777962Z st.shared.v4.b32 [%r54], {%r557, %r561, %r555, %r559}; 2026-02-21T09:18:14.9778047Z st.shared.v4.b32 [%r55], {%r565, %r569, %r563, %r567}; 2026-02-21T09:18:14.9778138Z st.shared.v4.b32 [%r56], {%r573, %r577, %r571, %r575}; 2026-02-21T09:18:14.9778193Z $L__tmp71: 2026-02-21T09:18:14.9778416Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9778476Z // begin inline asm 2026-02-21T09:18:14.9778771Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r320, %r321, %r322, %r323, %r324, %r325, %r326, %r327, %r328, %r329, %r330, %r331, %r332, %r333, %r334, %r335}, [%r4673 + 0], 64; 2026-02-21T09:18:14.9778829Z // end inline asm 2026-02-21T09:18:14.9778887Z // begin inline asm 2026-02-21T09:18:14.9779174Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r337, %r338, %r339, %r340, %r341, %r342, %r343, %r344, %r345, %r346, %r347, %r348, %r349, %r350, %r351, %r352}, [%r4673 + 16], 64; 2026-02-21T09:18:14.9779230Z // end inline asm 2026-02-21T09:18:14.9779288Z // begin inline asm 2026-02-21T09:18:14.9779567Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r354, %r355, %r356, %r357, %r358, %r359, %r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369}, [%r4673 + 32], 64; 2026-02-21T09:18:14.9779622Z // end inline asm 2026-02-21T09:18:14.9779680Z // begin inline asm 2026-02-21T09:18:14.9779954Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r371, %r372, %r373, %r374, %r375, %r376, %r377, %r378, %r379, %r380, %r381, %r382, %r383, %r384, %r385, %r386}, [%r4673 + 48], 64; 2026-02-21T09:18:14.9780012Z // end inline asm 2026-02-21T09:18:14.9780069Z // begin inline asm 2026-02-21T09:18:14.9780143Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9780206Z // end inline asm 2026-02-21T09:18:14.9780262Z // begin inline asm 2026-02-21T09:18:14.9780542Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 0], 64, {%r320, %r321, %r322, %r323, %r324, %r325, %r326, %r327, %r328, %r329, %r330, %r331, %r332, %r333, %r334, %r335}; 2026-02-21T09:18:14.9780604Z // end inline asm 2026-02-21T09:18:14.9780658Z // begin inline asm 2026-02-21T09:18:14.9780943Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 16], 64, {%r337, %r338, %r339, %r340, %r341, %r342, %r343, %r344, %r345, %r346, %r347, %r348, %r349, %r350, %r351, %r352}; 2026-02-21T09:18:14.9781008Z // end inline asm 2026-02-21T09:18:14.9781064Z // begin inline asm 2026-02-21T09:18:14.9781334Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 32], 64, {%r354, %r355, %r356, %r357, %r358, %r359, %r360, %r361, %r362, %r363, %r364, %r365, %r366, %r367, %r368, %r369}; 2026-02-21T09:18:14.9781398Z // end inline asm 2026-02-21T09:18:14.9781455Z // begin inline asm 2026-02-21T09:18:14.9781750Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 48], 64, {%r371, %r372, %r373, %r374, %r375, %r376, %r377, %r378, %r379, %r380, %r381, %r382, %r383, %r384, %r385, %r386}; 2026-02-21T09:18:14.9781807Z // end inline asm 2026-02-21T09:18:14.9781869Z // begin inline asm 2026-02-21T09:18:14.9781938Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9781992Z // end inline asm 2026-02-21T09:18:14.9782055Z bar.sync 0; 2026-02-21T09:18:14.9782111Z // begin inline asm 2026-02-21T09:18:14.9782382Z @%p9 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r457, %r458, %r459, %r460, %r461, %r462, %r463, %r464, %r465, %r466, %r467, %r468, %r469, %r470, %r471, %r472}; 2026-02-21T09:18:14.9782475Z // end inline asm 2026-02-21T09:18:14.9782529Z // begin inline asm 2026-02-21T09:18:14.9782621Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9782675Z // end inline asm 2026-02-21T09:18:14.9782739Z bar.sync 0; 2026-02-21T09:18:14.9782822Z // begin inline asm 2026-02-21T09:18:14.9782897Z fence.proxy.async.shared::cta; 2026-02-21T09:18:14.9782980Z // end inline asm 2026-02-21T09:18:14.9783042Z add.s32 %r5985, %r144, 16384; 2026-02-21T09:18:14.9783097Z // begin inline asm 2026-02-21T09:18:14.9783191Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:14.9783251Z // end inline asm 2026-02-21T09:18:14.9783304Z bar.sync 0; 2026-02-21T09:18:14.9783364Z @%p20 bra $L__BB0_5; 2026-02-21T09:18:14.9783472Z // %bb.4: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9783538Z elect.sync %r591|%p22, -1; 2026-02-21T09:18:14.9783599Z mov.b32 %r581, 69208336; 2026-02-21T09:18:14.9783662Z // begin inline asm 2026-02-21T09:18:14.9783819Z @%p22 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 0 ], %rd1, %r581, %p389; 2026-02-21T09:18:14.9783874Z // end inline asm 2026-02-21T09:18:14.9783935Z mov.pred %p23, -1; 2026-02-21T09:18:14.9783998Z // begin inline asm 2026-02-21T09:18:14.9784149Z @%p22 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 8 ], %rd2, %r581, %p23; 2026-02-21T09:18:14.9784204Z // end inline asm 2026-02-21T09:18:14.9784265Z // begin inline asm 2026-02-21T09:18:14.9784408Z @%p22 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 16 ], %rd3, %r581, %p23; 2026-02-21T09:18:14.9784462Z // end inline asm 2026-02-21T09:18:14.9784523Z // begin inline asm 2026-02-21T09:18:14.9784663Z @%p22 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 24 ], %rd4, %r581, %p23; 2026-02-21T09:18:14.9784717Z // end inline asm 2026-02-21T09:18:14.9784778Z cvt.u64.u32 %rd77, %r5985; 2026-02-21T09:18:14.9784842Z // begin inline asm 2026-02-21T09:18:14.9784965Z @%p22 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd77]; 2026-02-21T09:18:14.9785021Z // end inline asm 2026-02-21T09:18:14.9785130Z $L__BB0_5: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9785221Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:14.9785279Z mov.b32 %r595, 0; 2026-02-21T09:18:14.9785501Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9785558Z // begin inline asm 2026-02-21T09:18:14.9785610Z 2026-02-21T09:18:14.9785662Z { 2026-02-21T09:18:14.9785734Z .reg .pred complete; 2026-02-21T09:18:14.9785790Z waitLoop: 2026-02-21T09:18:14.9785910Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r595; 2026-02-21T09:18:14.9785986Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9786037Z } 2026-02-21T09:18:14.9786042Z 2026-02-21T09:18:14.9786098Z // end inline asm 2026-02-21T09:18:14.9786152Z bar.sync 0; 2026-02-21T09:18:14.9786215Z // begin inline asm 2026-02-21T09:18:14.9786301Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9786358Z // end inline asm 2026-02-21T09:18:14.9786419Z $L__tmp72: 2026-02-21T09:18:14.9786584Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9786648Z add.s64 %rd79, %rd64, 64; 2026-02-21T09:18:14.9786818Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9786882Z add.s64 %rd82, %rd67, 64; 2026-02-21T09:18:14.9786940Z // begin inline asm 2026-02-21T09:18:14.9786997Z mov.u64 %rd78, 0x0; 2026-02-21T09:18:14.9787112Z createpolicy.fractional.L2::evict_first.b64 %rd78, 1.0; 2026-02-21T09:18:14.9787166Z // end inline asm 2026-02-21T09:18:14.9787221Z // begin inline asm 2026-02-21T09:18:14.9787283Z mov.u32 %r597, 0x0; 2026-02-21T09:18:14.9787360Z mov.u32 %r598, 0x0; 2026-02-21T09:18:14.9787414Z mov.u32 %r599, 0x0; 2026-02-21T09:18:14.9787469Z mov.u32 %r600, 0x0; 2026-02-21T09:18:14.9787644Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r597, %r598, %r599, %r600 }, [ %rd79 + 0 ], %rd78; 2026-02-21T09:18:14.9787729Z // end inline asm 2026-02-21T09:18:14.9787785Z // begin inline asm 2026-02-21T09:18:14.9787868Z mov.u64 %rd81, 0x0; 2026-02-21T09:18:14.9787990Z createpolicy.fractional.L2::evict_first.b64 %rd81, 1.0; 2026-02-21T09:18:14.9788045Z // end inline asm 2026-02-21T09:18:14.9788109Z // begin inline asm 2026-02-21T09:18:14.9788163Z mov.u32 %r601, 0x0; 2026-02-21T09:18:14.9788218Z mov.u32 %r602, 0x0; 2026-02-21T09:18:14.9788272Z mov.u32 %r603, 0x0; 2026-02-21T09:18:14.9788335Z mov.u32 %r604, 0x0; 2026-02-21T09:18:14.9788494Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r601, %r602, %r603, %r604 }, [ %rd82 + 0 ], %rd81; 2026-02-21T09:18:14.9788549Z // end inline asm 2026-02-21T09:18:14.9788610Z $L__tmp73: 2026-02-21T09:18:14.9788823Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9788910Z st.shared.v4.b32 [%r26], {%r597, %r598, %r599, %r600}; 2026-02-21T09:18:14.9789014Z st.shared.v4.b32 [%r26+512], {%r601, %r602, %r603, %r604}; 2026-02-21T09:18:14.9789069Z bar.sync 0; 2026-02-21T09:18:14.9789156Z ld.shared.v4.b32 {%r764, %r765, %r766, %r767}, [%r27]; 2026-02-21T09:18:14.9789226Z mov.b32 {%rs129, %rs130}, %r767; 2026-02-21T09:18:14.9789295Z mov.b32 {%rs131, %rs132}, %r766; 2026-02-21T09:18:14.9789356Z mov.b32 {%rs133, %rs134}, %r765; 2026-02-21T09:18:14.9789415Z mov.b32 {%rs135, %rs136}, %r764; 2026-02-21T09:18:14.9789509Z ld.shared.v4.b32 {%r768, %r769, %r770, %r771}, [%r28]; 2026-02-21T09:18:14.9789569Z mov.b32 {%rs137, %rs138}, %r771; 2026-02-21T09:18:14.9789627Z mov.b32 {%rs139, %rs140}, %r770; 2026-02-21T09:18:14.9789692Z mov.b32 {%rs141, %rs142}, %r769; 2026-02-21T09:18:14.9789752Z mov.b32 {%rs143, %rs144}, %r768; 2026-02-21T09:18:14.9789804Z $L__tmp74: 2026-02-21T09:18:14.9789963Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9790035Z cvt.f32.bf16 %r746, %rs135; 2026-02-21T09:18:14.9790095Z cvt.f32.bf16 %r747, %rs136; 2026-02-21T09:18:14.9790156Z cvt.f32.bf16 %r748, %rs133; 2026-02-21T09:18:14.9790220Z cvt.f32.bf16 %r749, %rs134; 2026-02-21T09:18:14.9790279Z cvt.f32.bf16 %r750, %rs131; 2026-02-21T09:18:14.9790336Z cvt.f32.bf16 %r751, %rs132; 2026-02-21T09:18:14.9790393Z cvt.f32.bf16 %r752, %rs129; 2026-02-21T09:18:14.9790457Z cvt.f32.bf16 %r753, %rs130; 2026-02-21T09:18:14.9790513Z cvt.f32.bf16 %r754, %rs143; 2026-02-21T09:18:14.9790570Z cvt.f32.bf16 %r755, %rs144; 2026-02-21T09:18:14.9790634Z cvt.f32.bf16 %r756, %rs141; 2026-02-21T09:18:14.9790691Z cvt.f32.bf16 %r757, %rs142; 2026-02-21T09:18:14.9790747Z cvt.f32.bf16 %r758, %rs139; 2026-02-21T09:18:14.9790804Z cvt.f32.bf16 %r759, %rs140; 2026-02-21T09:18:14.9790869Z cvt.f32.bf16 %r760, %rs137; 2026-02-21T09:18:14.9790925Z cvt.f32.bf16 %r761, %rs138; 2026-02-21T09:18:14.9791083Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9791154Z add.s32 %r772, %r7843, 131072; 2026-02-21T09:18:14.9791216Z cvt.s64.s32 %rd87, %r772; 2026-02-21T09:18:14.9791276Z add.s64 %rd85, %rd56, %rd87; 2026-02-21T09:18:14.9791441Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9791498Z // begin inline asm 2026-02-21T09:18:14.9791573Z mov.u64 %rd84, 0x0; 2026-02-21T09:18:14.9791698Z createpolicy.fractional.L2::evict_first.b64 %rd84, 1.0; 2026-02-21T09:18:14.9791760Z // end inline asm 2026-02-21T09:18:14.9791815Z // begin inline asm 2026-02-21T09:18:14.9791870Z mov.u32 %r605, 0x0; 2026-02-21T09:18:14.9791930Z mov.u32 %r606, 0x0; 2026-02-21T09:18:14.9791983Z mov.u32 %r607, 0x0; 2026-02-21T09:18:14.9792069Z mov.u32 %r608, 0x0; 2026-02-21T09:18:14.9792235Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r605, %r606, %r607, %r608 }, [ %rd85 + 0 ], %rd84; 2026-02-21T09:18:14.9792290Z // end inline asm 2026-02-21T09:18:14.9792383Z prmt.b32 %r773, %r605, 0, 0x8880U; 2026-02-21T09:18:14.9792444Z cvt.u16.u32 %rs145, %r773; 2026-02-21T09:18:14.9792552Z prmt.b32 %r774, %r605, 0, 0x7770U; 2026-02-21T09:18:14.9792614Z cvt.u16.u32 %rs146, %r774; 2026-02-21T09:18:14.9792700Z prmt.b32 %r775, %r605, 0, 0x9991U; 2026-02-21T09:18:14.9792769Z cvt.u16.u32 %rs147, %r775; 2026-02-21T09:18:14.9792830Z prmt.b32 %r776, %r605, 0, 0x7771U; 2026-02-21T09:18:14.9792887Z cvt.u16.u32 %rs148, %r776; 2026-02-21T09:18:14.9792947Z prmt.b32 %r777, %r605, 0, 0xaaa2U; 2026-02-21T09:18:14.9793013Z cvt.u16.u32 %rs149, %r777; 2026-02-21T09:18:14.9793071Z prmt.b32 %r778, %r605, 0, 0x7772U; 2026-02-21T09:18:14.9793128Z cvt.u16.u32 %rs150, %r778; 2026-02-21T09:18:14.9793193Z prmt.b32 %r779, %r605, 0, 0xbbb3U; 2026-02-21T09:18:14.9793252Z cvt.u16.u32 %rs151, %r779; 2026-02-21T09:18:14.9793311Z prmt.b32 %r780, %r605, 0, 0x7773U; 2026-02-21T09:18:14.9793369Z cvt.u16.u32 %rs152, %r780; 2026-02-21T09:18:14.9793436Z prmt.b32 %r781, %r606, 0, 0x8880U; 2026-02-21T09:18:14.9793492Z cvt.u16.u32 %rs153, %r781; 2026-02-21T09:18:14.9793550Z prmt.b32 %r782, %r606, 0, 0x7770U; 2026-02-21T09:18:14.9793617Z cvt.u16.u32 %rs154, %r782; 2026-02-21T09:18:14.9793676Z prmt.b32 %r783, %r606, 0, 0x9991U; 2026-02-21T09:18:14.9793733Z cvt.u16.u32 %rs155, %r783; 2026-02-21T09:18:14.9793799Z prmt.b32 %r784, %r606, 0, 0x7771U; 2026-02-21T09:18:14.9793856Z cvt.u16.u32 %rs156, %r784; 2026-02-21T09:18:14.9793916Z prmt.b32 %r785, %r606, 0, 0xaaa2U; 2026-02-21T09:18:14.9793974Z cvt.u16.u32 %rs157, %r785; 2026-02-21T09:18:14.9794044Z prmt.b32 %r786, %r606, 0, 0x7772U; 2026-02-21T09:18:14.9794103Z cvt.u16.u32 %rs158, %r786; 2026-02-21T09:18:14.9794166Z prmt.b32 %r787, %r606, 0, 0xbbb3U; 2026-02-21T09:18:14.9794235Z cvt.u16.u32 %rs159, %r787; 2026-02-21T09:18:14.9794296Z prmt.b32 %r788, %r606, 0, 0x7773U; 2026-02-21T09:18:14.9794353Z cvt.u16.u32 %rs160, %r788; 2026-02-21T09:18:14.9794412Z prmt.b32 %r789, %r607, 0, 0x8880U; 2026-02-21T09:18:14.9794478Z cvt.u16.u32 %rs161, %r789; 2026-02-21T09:18:14.9794537Z prmt.b32 %r790, %r607, 0, 0x7770U; 2026-02-21T09:18:14.9794595Z cvt.u16.u32 %rs162, %r790; 2026-02-21T09:18:14.9794661Z prmt.b32 %r791, %r607, 0, 0x9991U; 2026-02-21T09:18:14.9794718Z cvt.u16.u32 %rs163, %r791; 2026-02-21T09:18:14.9794775Z prmt.b32 %r792, %r607, 0, 0x7771U; 2026-02-21T09:18:14.9794832Z cvt.u16.u32 %rs164, %r792; 2026-02-21T09:18:14.9794898Z prmt.b32 %r793, %r607, 0, 0xaaa2U; 2026-02-21T09:18:14.9794954Z cvt.u16.u32 %rs165, %r793; 2026-02-21T09:18:14.9795013Z prmt.b32 %r794, %r607, 0, 0x7772U; 2026-02-21T09:18:14.9795077Z cvt.u16.u32 %rs166, %r794; 2026-02-21T09:18:14.9795137Z prmt.b32 %r795, %r607, 0, 0xbbb3U; 2026-02-21T09:18:14.9795194Z cvt.u16.u32 %rs167, %r795; 2026-02-21T09:18:14.9795256Z prmt.b32 %r796, %r607, 0, 0x7773U; 2026-02-21T09:18:14.9795321Z cvt.u16.u32 %rs168, %r796; 2026-02-21T09:18:14.9795379Z prmt.b32 %r797, %r608, 0, 0x8880U; 2026-02-21T09:18:14.9795436Z cvt.u16.u32 %rs169, %r797; 2026-02-21T09:18:14.9795500Z prmt.b32 %r798, %r608, 0, 0x7770U; 2026-02-21T09:18:14.9795557Z cvt.u16.u32 %rs170, %r798; 2026-02-21T09:18:14.9795617Z prmt.b32 %r799, %r608, 0, 0x9991U; 2026-02-21T09:18:14.9795682Z cvt.u16.u32 %rs171, %r799; 2026-02-21T09:18:14.9795742Z prmt.b32 %r800, %r608, 0, 0x7771U; 2026-02-21T09:18:14.9795799Z cvt.u16.u32 %rs172, %r800; 2026-02-21T09:18:14.9795858Z prmt.b32 %r801, %r608, 0, 0xaaa2U; 2026-02-21T09:18:14.9795922Z cvt.u16.u32 %rs173, %r801; 2026-02-21T09:18:14.9795980Z prmt.b32 %r802, %r608, 0, 0x7772U; 2026-02-21T09:18:14.9796038Z cvt.u16.u32 %rs174, %r802; 2026-02-21T09:18:14.9796102Z prmt.b32 %r803, %r608, 0, 0xbbb3U; 2026-02-21T09:18:14.9796159Z cvt.u16.u32 %rs175, %r803; 2026-02-21T09:18:14.9796218Z prmt.b32 %r804, %r608, 0, 0x7773U; 2026-02-21T09:18:14.9796297Z cvt.u16.u32 %rs176, %r804; 2026-02-21T09:18:14.9796468Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9796554Z shl.b16 %rs177, %rs146, 12; 2026-02-21T09:18:14.9796617Z shr.s16 %rs178, %rs177, 12; 2026-02-21T09:18:14.9796705Z shl.b16 %rs179, %rs148, 12; 2026-02-21T09:18:14.9796767Z shr.s16 %rs180, %rs179, 12; 2026-02-21T09:18:14.9796846Z shl.b16 %rs181, %rs150, 12; 2026-02-21T09:18:14.9796914Z shr.s16 %rs182, %rs181, 12; 2026-02-21T09:18:14.9796974Z shl.b16 %rs183, %rs152, 12; 2026-02-21T09:18:14.9797033Z shr.s16 %rs184, %rs183, 12; 2026-02-21T09:18:14.9797091Z shl.b16 %rs185, %rs154, 12; 2026-02-21T09:18:14.9797158Z shr.s16 %rs186, %rs185, 12; 2026-02-21T09:18:14.9797217Z shl.b16 %rs187, %rs156, 12; 2026-02-21T09:18:14.9797276Z shr.s16 %rs188, %rs187, 12; 2026-02-21T09:18:14.9797341Z shl.b16 %rs189, %rs158, 12; 2026-02-21T09:18:14.9797399Z shr.s16 %rs190, %rs189, 12; 2026-02-21T09:18:14.9797459Z shl.b16 %rs191, %rs160, 12; 2026-02-21T09:18:14.9797519Z shr.s16 %rs192, %rs191, 12; 2026-02-21T09:18:14.9797585Z shl.b16 %rs193, %rs162, 12; 2026-02-21T09:18:14.9797645Z shr.s16 %rs194, %rs193, 12; 2026-02-21T09:18:14.9797703Z shl.b16 %rs195, %rs164, 12; 2026-02-21T09:18:14.9797769Z shr.s16 %rs196, %rs195, 12; 2026-02-21T09:18:14.9797827Z shl.b16 %rs197, %rs166, 12; 2026-02-21T09:18:14.9797887Z shr.s16 %rs198, %rs197, 12; 2026-02-21T09:18:14.9797946Z shl.b16 %rs199, %rs168, 12; 2026-02-21T09:18:14.9798011Z shr.s16 %rs200, %rs199, 12; 2026-02-21T09:18:14.9798070Z shl.b16 %rs201, %rs170, 12; 2026-02-21T09:18:14.9798128Z shr.s16 %rs202, %rs201, 12; 2026-02-21T09:18:14.9798196Z shl.b16 %rs203, %rs172, 12; 2026-02-21T09:18:14.9798255Z shr.s16 %rs204, %rs203, 12; 2026-02-21T09:18:14.9798313Z shl.b16 %rs205, %rs174, 12; 2026-02-21T09:18:14.9798372Z shr.s16 %rs206, %rs205, 12; 2026-02-21T09:18:14.9798437Z shl.b16 %rs207, %rs176, 12; 2026-02-21T09:18:14.9798498Z shr.s16 %rs208, %rs207, 12; 2026-02-21T09:18:14.9798661Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:14.9798732Z shr.u16 %rs209, %rs145, 4; 2026-02-21T09:18:14.9798792Z shr.u16 %rs210, %rs147, 4; 2026-02-21T09:18:14.9798853Z shr.u16 %rs211, %rs149, 4; 2026-02-21T09:18:14.9798921Z shr.u16 %rs212, %rs151, 4; 2026-02-21T09:18:14.9798982Z shr.u16 %rs213, %rs153, 4; 2026-02-21T09:18:14.9799042Z shr.u16 %rs214, %rs155, 4; 2026-02-21T09:18:14.9799101Z shr.u16 %rs215, %rs157, 4; 2026-02-21T09:18:14.9799166Z shr.u16 %rs216, %rs159, 4; 2026-02-21T09:18:14.9799226Z shr.u16 %rs217, %rs161, 4; 2026-02-21T09:18:14.9799286Z shr.u16 %rs218, %rs163, 4; 2026-02-21T09:18:14.9799351Z shr.u16 %rs219, %rs165, 4; 2026-02-21T09:18:14.9799410Z shr.u16 %rs220, %rs167, 4; 2026-02-21T09:18:14.9799469Z shr.u16 %rs221, %rs169, 4; 2026-02-21T09:18:14.9799528Z shr.u16 %rs222, %rs171, 4; 2026-02-21T09:18:14.9799598Z shr.u16 %rs223, %rs173, 4; 2026-02-21T09:18:14.9799657Z shr.u16 %rs224, %rs175, 4; 2026-02-21T09:18:14.9799820Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:14.9799886Z bar.sync 0; 2026-02-21T09:18:14.9799952Z st.shared.b8 [%r29], %rs178; 2026-02-21T09:18:14.9800017Z st.shared.b8 [%r30], %rs180; 2026-02-21T09:18:14.9800079Z st.shared.b8 [%r31], %rs182; 2026-02-21T09:18:14.9800148Z st.shared.b8 [%r32], %rs184; 2026-02-21T09:18:14.9800213Z st.shared.b8 [%r33+512], %rs186; 2026-02-21T09:18:14.9800278Z st.shared.b8 [%r34+512], %rs188; 2026-02-21T09:18:14.9800349Z st.shared.b8 [%r35+512], %rs190; 2026-02-21T09:18:14.9800412Z st.shared.b8 [%r36+512], %rs192; 2026-02-21T09:18:14.9800480Z st.shared.b8 [%r37+1024], %rs194; 2026-02-21T09:18:14.9800553Z st.shared.b8 [%r38+1024], %rs196; 2026-02-21T09:18:14.9800617Z st.shared.b8 [%r39+1024], %rs198; 2026-02-21T09:18:14.9800681Z st.shared.b8 [%r40+1024], %rs200; 2026-02-21T09:18:14.9800774Z st.shared.b8 [%r41+1536], %rs202; 2026-02-21T09:18:14.9800846Z st.shared.b8 [%r42+1536], %rs204; 2026-02-21T09:18:14.9800909Z st.shared.b8 [%r43+1536], %rs206; 2026-02-21T09:18:14.9800991Z st.shared.b8 [%r44+1536], %rs208; 2026-02-21T09:18:14.9801055Z bar.sync 0; 2026-02-21T09:18:14.9801118Z ld.shared.b32 %r805, [%r45]; 2026-02-21T09:18:14.9801199Z ld.shared.b32 %r806, [%r46]; 2026-02-21T09:18:14.9801284Z ld.shared.b32 %r807, [%r47]; 2026-02-21T09:18:14.9801355Z ld.shared.b32 %r808, [%r48]; 2026-02-21T09:18:14.9801527Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:14.9801631Z bar.sync 0; 2026-02-21T09:18:14.9801702Z st.shared.b8 [%r29], %rs209; 2026-02-21T09:18:14.9801764Z st.shared.b8 [%r30], %rs210; 2026-02-21T09:18:14.9801825Z st.shared.b8 [%r31], %rs211; 2026-02-21T09:18:14.9801893Z st.shared.b8 [%r32], %rs212; 2026-02-21T09:18:14.9801957Z st.shared.b8 [%r33+512], %rs213; 2026-02-21T09:18:14.9802023Z st.shared.b8 [%r34+512], %rs214; 2026-02-21T09:18:14.9802086Z st.shared.b8 [%r35+512], %rs215; 2026-02-21T09:18:14.9802159Z st.shared.b8 [%r36+512], %rs216; 2026-02-21T09:18:14.9802223Z st.shared.b8 [%r37+1024], %rs217; 2026-02-21T09:18:14.9802286Z st.shared.b8 [%r38+1024], %rs218; 2026-02-21T09:18:14.9802357Z st.shared.b8 [%r39+1024], %rs219; 2026-02-21T09:18:14.9802420Z st.shared.b8 [%r40+1024], %rs220; 2026-02-21T09:18:14.9802485Z st.shared.b8 [%r41+1536], %rs221; 2026-02-21T09:18:14.9802547Z st.shared.b8 [%r42+1536], %rs222; 2026-02-21T09:18:14.9802618Z st.shared.b8 [%r43+1536], %rs223; 2026-02-21T09:18:14.9802680Z st.shared.b8 [%r44+1536], %rs224; 2026-02-21T09:18:14.9802738Z bar.sync 0; 2026-02-21T09:18:14.9802807Z ld.shared.b32 %r809, [%r45]; 2026-02-21T09:18:14.9802869Z ld.shared.b32 %r810, [%r46]; 2026-02-21T09:18:14.9802932Z ld.shared.b32 %r811, [%r47]; 2026-02-21T09:18:14.9802994Z ld.shared.b32 %r812, [%r48]; 2026-02-21T09:18:14.9803166Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:14.9803231Z cvt.s8.s32 %rs225, %r806; 2026-02-21T09:18:14.9803297Z cvt.rn.f32.s16 %r813, %rs225; 2026-02-21T09:18:14.9803370Z cvt.s8.s32 %rs226, %r805; 2026-02-21T09:18:14.9803435Z cvt.rn.f32.s16 %r814, %rs226; 2026-02-21T09:18:14.9803497Z cvt.s8.s32 %rs227, %r810; 2026-02-21T09:18:14.9803566Z cvt.rn.f32.s16 %r815, %rs227; 2026-02-21T09:18:14.9803628Z cvt.s8.s32 %rs228, %r809; 2026-02-21T09:18:14.9803688Z cvt.rn.f32.s16 %r816, %rs228; 2026-02-21T09:18:14.9803750Z cvt.s8.s32 %rs229, %r808; 2026-02-21T09:18:14.9803817Z cvt.rn.f32.s16 %r817, %rs229; 2026-02-21T09:18:14.9803875Z cvt.s8.s32 %rs230, %r807; 2026-02-21T09:18:14.9803936Z cvt.rn.f32.s16 %r818, %rs230; 2026-02-21T09:18:14.9804002Z cvt.s8.s32 %rs231, %r812; 2026-02-21T09:18:14.9804061Z cvt.rn.f32.s16 %r819, %rs231; 2026-02-21T09:18:14.9804120Z cvt.s8.s32 %rs232, %r811; 2026-02-21T09:18:14.9804181Z cvt.rn.f32.s16 %r820, %rs232; 2026-02-21T09:18:14.9804254Z prmt.b32 %r821, %r806, 0, 0x9991U; 2026-02-21T09:18:14.9804316Z cvt.u16.u32 %rs233, %r821; 2026-02-21T09:18:14.9804376Z cvt.rn.f32.s16 %r822, %rs233; 2026-02-21T09:18:14.9804448Z prmt.b32 %r823, %r805, 0, 0x9991U; 2026-02-21T09:18:14.9804509Z cvt.u16.u32 %rs234, %r823; 2026-02-21T09:18:14.9804572Z cvt.rn.f32.s16 %r824, %rs234; 2026-02-21T09:18:14.9804634Z prmt.b32 %r825, %r810, 0, 0x9991U; 2026-02-21T09:18:14.9804703Z cvt.u16.u32 %rs235, %r825; 2026-02-21T09:18:14.9804773Z cvt.rn.f32.s16 %r826, %rs235; 2026-02-21T09:18:14.9804831Z prmt.b32 %r827, %r809, 0, 0x9991U; 2026-02-21T09:18:14.9804895Z cvt.u16.u32 %rs236, %r827; 2026-02-21T09:18:14.9804952Z cvt.rn.f32.s16 %r828, %rs236; 2026-02-21T09:18:14.9805012Z prmt.b32 %r829, %r808, 0, 0x9991U; 2026-02-21T09:18:14.9805075Z cvt.u16.u32 %rs237, %r829; 2026-02-21T09:18:14.9805133Z cvt.rn.f32.s16 %r830, %rs237; 2026-02-21T09:18:14.9805193Z prmt.b32 %r831, %r807, 0, 0x9991U; 2026-02-21T09:18:14.9805282Z cvt.u16.u32 %rs238, %r831; 2026-02-21T09:18:14.9805347Z cvt.rn.f32.s16 %r832, %rs238; 2026-02-21T09:18:14.9805406Z prmt.b32 %r833, %r812, 0, 0x9991U; 2026-02-21T09:18:14.9805464Z cvt.u16.u32 %rs239, %r833; 2026-02-21T09:18:14.9805556Z cvt.rn.f32.s16 %r834, %rs239; 2026-02-21T09:18:14.9805614Z prmt.b32 %r835, %r811, 0, 0x9991U; 2026-02-21T09:18:14.9805695Z cvt.u16.u32 %rs240, %r835; 2026-02-21T09:18:14.9805754Z cvt.rn.f32.s16 %r836, %rs240; 2026-02-21T09:18:14.9805845Z prmt.b32 %r837, %r806, 0, 0xaaa2U; 2026-02-21T09:18:14.9805904Z cvt.u16.u32 %rs241, %r837; 2026-02-21T09:18:14.9805963Z cvt.rn.f32.s16 %r838, %rs241; 2026-02-21T09:18:14.9806025Z prmt.b32 %r839, %r805, 0, 0xaaa2U; 2026-02-21T09:18:14.9806079Z cvt.u16.u32 %rs242, %r839; 2026-02-21T09:18:14.9806137Z cvt.rn.f32.s16 %r840, %rs242; 2026-02-21T09:18:14.9806196Z prmt.b32 %r841, %r810, 0, 0xaaa2U; 2026-02-21T09:18:14.9806255Z cvt.u16.u32 %rs243, %r841; 2026-02-21T09:18:14.9806312Z cvt.rn.f32.s16 %r842, %rs243; 2026-02-21T09:18:14.9806369Z prmt.b32 %r843, %r809, 0, 0xaaa2U; 2026-02-21T09:18:14.9806430Z cvt.u16.u32 %rs244, %r843; 2026-02-21T09:18:14.9806485Z cvt.rn.f32.s16 %r844, %rs244; 2026-02-21T09:18:14.9806543Z prmt.b32 %r845, %r808, 0, 0xaaa2U; 2026-02-21T09:18:14.9806600Z cvt.u16.u32 %rs245, %r845; 2026-02-21T09:18:14.9806663Z cvt.rn.f32.s16 %r846, %rs245; 2026-02-21T09:18:14.9806718Z prmt.b32 %r847, %r807, 0, 0xaaa2U; 2026-02-21T09:18:14.9806778Z cvt.u16.u32 %rs246, %r847; 2026-02-21T09:18:14.9806838Z cvt.rn.f32.s16 %r848, %rs246; 2026-02-21T09:18:14.9806892Z prmt.b32 %r849, %r812, 0, 0xaaa2U; 2026-02-21T09:18:14.9806948Z cvt.u16.u32 %rs247, %r849; 2026-02-21T09:18:14.9807010Z cvt.rn.f32.s16 %r850, %rs247; 2026-02-21T09:18:14.9807065Z prmt.b32 %r851, %r811, 0, 0xaaa2U; 2026-02-21T09:18:14.9807119Z cvt.u16.u32 %rs248, %r851; 2026-02-21T09:18:14.9807180Z cvt.rn.f32.s16 %r852, %rs248; 2026-02-21T09:18:14.9807240Z prmt.b32 %r853, %r806, 0, 0xbbb3U; 2026-02-21T09:18:14.9807295Z cvt.u16.u32 %rs249, %r853; 2026-02-21T09:18:14.9807354Z cvt.rn.f32.s16 %r854, %rs249; 2026-02-21T09:18:14.9807412Z prmt.b32 %r855, %r805, 0, 0xbbb3U; 2026-02-21T09:18:14.9807466Z cvt.u16.u32 %rs250, %r855; 2026-02-21T09:18:14.9807521Z cvt.rn.f32.s16 %r856, %rs250; 2026-02-21T09:18:14.9807579Z prmt.b32 %r857, %r810, 0, 0xbbb3U; 2026-02-21T09:18:14.9807639Z cvt.u16.u32 %rs251, %r857; 2026-02-21T09:18:14.9807694Z cvt.rn.f32.s16 %r858, %rs251; 2026-02-21T09:18:14.9807750Z prmt.b32 %r859, %r809, 0, 0xbbb3U; 2026-02-21T09:18:14.9807807Z cvt.u16.u32 %rs252, %r859; 2026-02-21T09:18:14.9807864Z cvt.rn.f32.s16 %r860, %rs252; 2026-02-21T09:18:14.9807919Z prmt.b32 %r861, %r808, 0, 0xbbb3U; 2026-02-21T09:18:14.9807977Z cvt.u16.u32 %rs253, %r861; 2026-02-21T09:18:14.9808039Z cvt.rn.f32.s16 %r862, %rs253; 2026-02-21T09:18:14.9808094Z prmt.b32 %r863, %r807, 0, 0xbbb3U; 2026-02-21T09:18:14.9808147Z cvt.u16.u32 %rs254, %r863; 2026-02-21T09:18:14.9808209Z cvt.rn.f32.s16 %r864, %rs254; 2026-02-21T09:18:14.9808266Z prmt.b32 %r865, %r812, 0, 0xbbb3U; 2026-02-21T09:18:14.9808324Z cvt.u16.u32 %rs255, %r865; 2026-02-21T09:18:14.9808383Z cvt.rn.f32.s16 %r866, %rs255; 2026-02-21T09:18:14.9808443Z prmt.b32 %r867, %r811, 0, 0xbbb3U; 2026-02-21T09:18:14.9808497Z cvt.u16.u32 %rs256, %r867; 2026-02-21T09:18:14.9808551Z cvt.rn.f32.s16 %r868, %rs256; 2026-02-21T09:18:14.9808610Z bar.sync 0; 2026-02-21T09:18:14.9808702Z st.shared.v4.b32 [%r49], {%r814, %r816, %r813, %r815}; 2026-02-21T09:18:14.9808788Z st.shared.v4.b32 [%r50], {%r818, %r820, %r817, %r819}; 2026-02-21T09:18:14.9808874Z st.shared.v4.b32 [%r51], {%r824, %r828, %r822, %r826}; 2026-02-21T09:18:14.9808959Z st.shared.v4.b32 [%r52], {%r832, %r836, %r830, %r834}; 2026-02-21T09:18:14.9809039Z st.shared.v4.b32 [%r53], {%r840, %r844, %r838, %r842}; 2026-02-21T09:18:14.9809123Z st.shared.v4.b32 [%r54], {%r848, %r852, %r846, %r850}; 2026-02-21T09:18:14.9809204Z st.shared.v4.b32 [%r55], {%r856, %r860, %r854, %r858}; 2026-02-21T09:18:14.9809308Z st.shared.v4.b32 [%r56], {%r864, %r868, %r862, %r866}; 2026-02-21T09:18:14.9809361Z $L__tmp75: 2026-02-21T09:18:14.9809575Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9809667Z // begin inline asm 2026-02-21T09:18:14.9809978Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616, %r617, %r618, %r619, %r620, %r621, %r622, %r623, %r624}, [%r625 + 0], 64; 2026-02-21T09:18:14.9810040Z // end inline asm 2026-02-21T09:18:14.9810093Z // begin inline asm 2026-02-21T09:18:14.9810365Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633, %r634, %r635, %r636, %r637, %r638, %r639, %r640, %r641}, [%r625 + 16], 64; 2026-02-21T09:18:14.9810421Z // end inline asm 2026-02-21T09:18:14.9810478Z // begin inline asm 2026-02-21T09:18:14.9810745Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r643, %r644, %r645, %r646, %r647, %r648, %r649, %r650, %r651, %r652, %r653, %r654, %r655, %r656, %r657, %r658}, [%r625 + 32], 64; 2026-02-21T09:18:14.9810797Z // end inline asm 2026-02-21T09:18:14.9810857Z // begin inline asm 2026-02-21T09:18:14.9811124Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r660, %r661, %r662, %r663, %r664, %r665, %r666, %r667, %r668, %r669, %r670, %r671, %r672, %r673, %r674, %r675}, [%r625 + 48], 64; 2026-02-21T09:18:14.9811179Z // end inline asm 2026-02-21T09:18:14.9811239Z // begin inline asm 2026-02-21T09:18:14.9811307Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9811357Z // end inline asm 2026-02-21T09:18:14.9811422Z mov.pred %p38, -1; 2026-02-21T09:18:14.9811473Z // begin inline asm 2026-02-21T09:18:14.9811781Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 0], 64, {%r609, %r610, %r611, %r612, %r613, %r614, %r615, %r616, %r617, %r618, %r619, %r620, %r621, %r622, %r623, %r624}; 2026-02-21T09:18:14.9811836Z // end inline asm 2026-02-21T09:18:14.9811891Z // begin inline asm 2026-02-21T09:18:14.9812168Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 16], 64, {%r626, %r627, %r628, %r629, %r630, %r631, %r632, %r633, %r634, %r635, %r636, %r637, %r638, %r639, %r640, %r641}; 2026-02-21T09:18:14.9812220Z // end inline asm 2026-02-21T09:18:14.9812279Z // begin inline asm 2026-02-21T09:18:14.9812551Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 32], 64, {%r643, %r644, %r645, %r646, %r647, %r648, %r649, %r650, %r651, %r652, %r653, %r654, %r655, %r656, %r657, %r658}; 2026-02-21T09:18:14.9812606Z // end inline asm 2026-02-21T09:18:14.9812660Z // begin inline asm 2026-02-21T09:18:14.9812926Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 48], 64, {%r660, %r661, %r662, %r663, %r664, %r665, %r666, %r667, %r668, %r669, %r670, %r671, %r672, %r673, %r674, %r675}; 2026-02-21T09:18:14.9812981Z // end inline asm 2026-02-21T09:18:14.9813035Z // begin inline asm 2026-02-21T09:18:14.9813103Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9813152Z // end inline asm 2026-02-21T09:18:14.9813207Z bar.sync 0; 2026-02-21T09:18:14.9813262Z // begin inline asm 2026-02-21T09:18:14.9813530Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r746, %r747, %r748, %r749, %r750, %r751, %r752, %r753, %r754, %r755, %r756, %r757, %r758, %r759, %r760, %r761}; 2026-02-21T09:18:14.9813587Z // end inline asm 2026-02-21T09:18:14.9813648Z // begin inline asm 2026-02-21T09:18:14.9813710Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9813761Z // end inline asm 2026-02-21T09:18:14.9813819Z bar.sync 0; 2026-02-21T09:18:14.9813870Z // begin inline asm 2026-02-21T09:18:14.9813942Z fence.proxy.async.shared::cta; 2026-02-21T09:18:14.9813992Z // end inline asm 2026-02-21T09:18:14.9814052Z // begin inline asm 2026-02-21T09:18:14.9814139Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:14.9814190Z // end inline asm 2026-02-21T09:18:14.9814247Z bar.sync 0; 2026-02-21T09:18:14.9814301Z @%p20 bra $L__BB0_7; 2026-02-21T09:18:14.9814400Z // %bb.6: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9814491Z elect.sync %r881|%p39, -1; 2026-02-21T09:18:14.9814552Z mov.b32 %r871, 69208336; 2026-02-21T09:18:14.9814609Z // begin inline asm 2026-02-21T09:18:14.9814781Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 0 ], %rd1, %r871, %p38; 2026-02-21T09:18:14.9814863Z // end inline asm 2026-02-21T09:18:14.9814916Z // begin inline asm 2026-02-21T09:18:14.9815089Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 8 ], %rd2, %r871, %p38; 2026-02-21T09:18:14.9815145Z // end inline asm 2026-02-21T09:18:14.9815202Z // begin inline asm 2026-02-21T09:18:14.9815343Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 16 ], %rd3, %r871, %p38; 2026-02-21T09:18:14.9815400Z // end inline asm 2026-02-21T09:18:14.9815452Z // begin inline asm 2026-02-21T09:18:14.9815594Z @%p39 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 24 ], %rd4, %r871, %p38; 2026-02-21T09:18:14.9815647Z // end inline asm 2026-02-21T09:18:14.9815715Z cvt.u64.u32 %rd92, %r5985; 2026-02-21T09:18:14.9815766Z // begin inline asm 2026-02-21T09:18:14.9815890Z @%p39 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd92]; 2026-02-21T09:18:14.9815947Z // end inline asm 2026-02-21T09:18:14.9816042Z $L__BB0_7: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9816098Z // begin inline asm 2026-02-21T09:18:14.9816146Z 2026-02-21T09:18:14.9816201Z { 2026-02-21T09:18:14.9816262Z .reg .pred complete; 2026-02-21T09:18:14.9816312Z waitLoop: 2026-02-21T09:18:14.9816433Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r595; 2026-02-21T09:18:14.9816496Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9816543Z } 2026-02-21T09:18:14.9816548Z 2026-02-21T09:18:14.9816604Z // end inline asm 2026-02-21T09:18:14.9816655Z bar.sync 0; 2026-02-21T09:18:14.9816711Z // begin inline asm 2026-02-21T09:18:14.9816793Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9816851Z // end inline asm 2026-02-21T09:18:14.9816900Z $L__tmp76: 2026-02-21T09:18:14.9817063Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9817124Z add.s64 %rd94, %rd64, 128; 2026-02-21T09:18:14.9817284Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9817340Z add.s64 %rd97, %rd67, 128; 2026-02-21T09:18:14.9817398Z // begin inline asm 2026-02-21T09:18:14.9817450Z mov.u64 %rd93, 0x0; 2026-02-21T09:18:14.9817550Z createpolicy.fractional.L2::evict_first.b64 %rd93, 1.0; 2026-02-21T09:18:14.9817605Z // end inline asm 2026-02-21T09:18:14.9817660Z // begin inline asm 2026-02-21T09:18:14.9817715Z mov.u32 %r887, 0x0; 2026-02-21T09:18:14.9817765Z mov.u32 %r888, 0x0; 2026-02-21T09:18:14.9817824Z mov.u32 %r889, 0x0; 2026-02-21T09:18:14.9817874Z mov.u32 %r890, 0x0; 2026-02-21T09:18:14.9818034Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r887, %r888, %r889, %r890 }, [ %rd94 + 0 ], %rd93; 2026-02-21T09:18:14.9818092Z // end inline asm 2026-02-21T09:18:14.9818150Z // begin inline asm 2026-02-21T09:18:14.9818201Z mov.u64 %rd96, 0x0; 2026-02-21T09:18:14.9818301Z createpolicy.fractional.L2::evict_first.b64 %rd96, 1.0; 2026-02-21T09:18:14.9818360Z // end inline asm 2026-02-21T09:18:14.9818412Z // begin inline asm 2026-02-21T09:18:14.9818467Z mov.u32 %r891, 0x0; 2026-02-21T09:18:14.9818518Z mov.u32 %r892, 0x0; 2026-02-21T09:18:14.9818575Z mov.u32 %r893, 0x0; 2026-02-21T09:18:14.9818629Z mov.u32 %r894, 0x0; 2026-02-21T09:18:14.9818788Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r891, %r892, %r893, %r894 }, [ %rd97 + 0 ], %rd96; 2026-02-21T09:18:14.9818852Z // end inline asm 2026-02-21T09:18:14.9818903Z $L__tmp77: 2026-02-21T09:18:14.9819109Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9819205Z st.shared.v4.b32 [%r26], {%r887, %r888, %r889, %r890}; 2026-02-21T09:18:14.9819323Z st.shared.v4.b32 [%r26+512], {%r891, %r892, %r893, %r894}; 2026-02-21T09:18:14.9819377Z bar.sync 0; 2026-02-21T09:18:14.9819472Z ld.shared.v4.b32 {%r1054, %r1055, %r1056, %r1057}, [%r27]; 2026-02-21T09:18:14.9819563Z mov.b32 {%rs257, %rs258}, %r1057; 2026-02-21T09:18:14.9819645Z mov.b32 {%rs259, %rs260}, %r1056; 2026-02-21T09:18:14.9819706Z mov.b32 {%rs261, %rs262}, %r1055; 2026-02-21T09:18:14.9819793Z mov.b32 {%rs263, %rs264}, %r1054; 2026-02-21T09:18:14.9819886Z ld.shared.v4.b32 {%r1058, %r1059, %r1060, %r1061}, [%r28]; 2026-02-21T09:18:14.9819947Z mov.b32 {%rs265, %rs266}, %r1061; 2026-02-21T09:18:14.9820013Z mov.b32 {%rs267, %rs268}, %r1060; 2026-02-21T09:18:14.9820070Z mov.b32 {%rs269, %rs270}, %r1059; 2026-02-21T09:18:14.9820128Z mov.b32 {%rs271, %rs272}, %r1058; 2026-02-21T09:18:14.9820179Z $L__tmp78: 2026-02-21T09:18:14.9820349Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9820413Z cvt.f32.bf16 %r1036, %rs263; 2026-02-21T09:18:14.9820474Z cvt.f32.bf16 %r1037, %rs264; 2026-02-21T09:18:14.9820543Z cvt.f32.bf16 %r1038, %rs261; 2026-02-21T09:18:14.9820604Z cvt.f32.bf16 %r1039, %rs262; 2026-02-21T09:18:14.9820661Z cvt.f32.bf16 %r1040, %rs259; 2026-02-21T09:18:14.9820719Z cvt.f32.bf16 %r1041, %rs260; 2026-02-21T09:18:14.9820783Z cvt.f32.bf16 %r1042, %rs257; 2026-02-21T09:18:14.9820841Z cvt.f32.bf16 %r1043, %rs258; 2026-02-21T09:18:14.9820899Z cvt.f32.bf16 %r1044, %rs271; 2026-02-21T09:18:14.9820965Z cvt.f32.bf16 %r1045, %rs272; 2026-02-21T09:18:14.9821024Z cvt.f32.bf16 %r1046, %rs269; 2026-02-21T09:18:14.9821080Z cvt.f32.bf16 %r1047, %rs270; 2026-02-21T09:18:14.9821144Z cvt.f32.bf16 %r1048, %rs267; 2026-02-21T09:18:14.9821201Z cvt.f32.bf16 %r1049, %rs268; 2026-02-21T09:18:14.9821258Z cvt.f32.bf16 %r1050, %rs265; 2026-02-21T09:18:14.9821317Z cvt.f32.bf16 %r1051, %rs266; 2026-02-21T09:18:14.9821483Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9821575Z add.s32 %r1062, %r7843, 262144; 2026-02-21T09:18:14.9821638Z cvt.s64.s32 %rd102, %r1062; 2026-02-21T09:18:14.9821708Z add.s64 %rd100, %rd56, %rd102; 2026-02-21T09:18:14.9821865Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9821923Z // begin inline asm 2026-02-21T09:18:14.9821987Z mov.u64 %rd99, 0x0; 2026-02-21T09:18:14.9822090Z createpolicy.fractional.L2::evict_first.b64 %rd99, 1.0; 2026-02-21T09:18:14.9822146Z // end inline asm 2026-02-21T09:18:14.9822203Z // begin inline asm 2026-02-21T09:18:14.9822265Z mov.u32 %r895, 0x0; 2026-02-21T09:18:14.9822320Z mov.u32 %r896, 0x0; 2026-02-21T09:18:14.9822373Z mov.u32 %r897, 0x0; 2026-02-21T09:18:14.9822435Z mov.u32 %r898, 0x0; 2026-02-21T09:18:14.9822598Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r895, %r896, %r897, %r898 }, [ %rd100 + 0 ], %rd99; 2026-02-21T09:18:14.9822655Z // end inline asm 2026-02-21T09:18:14.9822722Z prmt.b32 %r1063, %r895, 0, 0x8880U; 2026-02-21T09:18:14.9822792Z cvt.u16.u32 %rs273, %r1063; 2026-02-21T09:18:14.9822858Z prmt.b32 %r1064, %r895, 0, 0x7770U; 2026-02-21T09:18:14.9822922Z cvt.u16.u32 %rs274, %r1064; 2026-02-21T09:18:14.9822992Z prmt.b32 %r1065, %r895, 0, 0x9991U; 2026-02-21T09:18:14.9823054Z cvt.u16.u32 %rs275, %r1065; 2026-02-21T09:18:14.9823119Z prmt.b32 %r1066, %r895, 0, 0x7771U; 2026-02-21T09:18:14.9823190Z cvt.u16.u32 %rs276, %r1066; 2026-02-21T09:18:14.9823252Z prmt.b32 %r1067, %r895, 0, 0xaaa2U; 2026-02-21T09:18:14.9823313Z cvt.u16.u32 %rs277, %r1067; 2026-02-21T09:18:14.9823376Z prmt.b32 %r1068, %r895, 0, 0x7772U; 2026-02-21T09:18:14.9823448Z cvt.u16.u32 %rs278, %r1068; 2026-02-21T09:18:14.9823513Z prmt.b32 %r1069, %r895, 0, 0xbbb3U; 2026-02-21T09:18:14.9823575Z cvt.u16.u32 %rs279, %r1069; 2026-02-21T09:18:14.9823646Z prmt.b32 %r1070, %r895, 0, 0x7773U; 2026-02-21T09:18:14.9823703Z cvt.u16.u32 %rs280, %r1070; 2026-02-21T09:18:14.9823788Z prmt.b32 %r1071, %r896, 0, 0x8880U; 2026-02-21T09:18:14.9823846Z cvt.u16.u32 %rs281, %r1071; 2026-02-21T09:18:14.9823914Z prmt.b32 %r1072, %r896, 0, 0x7770U; 2026-02-21T09:18:14.9823997Z cvt.u16.u32 %rs282, %r1072; 2026-02-21T09:18:14.9824057Z prmt.b32 %r1073, %r896, 0, 0x9991U; 2026-02-21T09:18:14.9824154Z cvt.u16.u32 %rs283, %r1073; 2026-02-21T09:18:14.9824216Z prmt.b32 %r1074, %r896, 0, 0x7771U; 2026-02-21T09:18:14.9824306Z cvt.u16.u32 %rs284, %r1074; 2026-02-21T09:18:14.9824367Z prmt.b32 %r1075, %r896, 0, 0xaaa2U; 2026-02-21T09:18:14.9824434Z cvt.u16.u32 %rs285, %r1075; 2026-02-21T09:18:14.9824495Z prmt.b32 %r1076, %r896, 0, 0x7772U; 2026-02-21T09:18:14.9824554Z cvt.u16.u32 %rs286, %r1076; 2026-02-21T09:18:14.9824622Z prmt.b32 %r1077, %r896, 0, 0xbbb3U; 2026-02-21T09:18:14.9824679Z cvt.u16.u32 %rs287, %r1077; 2026-02-21T09:18:14.9824739Z prmt.b32 %r1078, %r896, 0, 0x7773U; 2026-02-21T09:18:14.9824802Z cvt.u16.u32 %rs288, %r1078; 2026-02-21T09:18:14.9824864Z prmt.b32 %r1079, %r897, 0, 0x8880U; 2026-02-21T09:18:14.9824923Z cvt.u16.u32 %rs289, %r1079; 2026-02-21T09:18:14.9824982Z prmt.b32 %r1080, %r897, 0, 0x7770U; 2026-02-21T09:18:14.9825049Z cvt.u16.u32 %rs290, %r1080; 2026-02-21T09:18:14.9825110Z prmt.b32 %r1081, %r897, 0, 0x9991U; 2026-02-21T09:18:14.9825169Z cvt.u16.u32 %rs291, %r1081; 2026-02-21T09:18:14.9825236Z prmt.b32 %r1082, %r897, 0, 0x7771U; 2026-02-21T09:18:14.9825297Z cvt.u16.u32 %rs292, %r1082; 2026-02-21T09:18:14.9825358Z prmt.b32 %r1083, %r897, 0, 0xaaa2U; 2026-02-21T09:18:14.9825416Z cvt.u16.u32 %rs293, %r1083; 2026-02-21T09:18:14.9825484Z prmt.b32 %r1084, %r897, 0, 0x7772U; 2026-02-21T09:18:14.9825542Z cvt.u16.u32 %rs294, %r1084; 2026-02-21T09:18:14.9825600Z prmt.b32 %r1085, %r897, 0, 0xbbb3U; 2026-02-21T09:18:14.9825665Z cvt.u16.u32 %rs295, %r1085; 2026-02-21T09:18:14.9825727Z prmt.b32 %r1086, %r897, 0, 0x7773U; 2026-02-21T09:18:14.9825785Z cvt.u16.u32 %rs296, %r1086; 2026-02-21T09:18:14.9825854Z prmt.b32 %r1087, %r898, 0, 0x8880U; 2026-02-21T09:18:14.9825913Z cvt.u16.u32 %rs297, %r1087; 2026-02-21T09:18:14.9825973Z prmt.b32 %r1088, %r898, 0, 0x7770U; 2026-02-21T09:18:14.9826032Z cvt.u16.u32 %rs298, %r1088; 2026-02-21T09:18:14.9826098Z prmt.b32 %r1089, %r898, 0, 0x9991U; 2026-02-21T09:18:14.9826154Z cvt.u16.u32 %rs299, %r1089; 2026-02-21T09:18:14.9826216Z prmt.b32 %r1090, %r898, 0, 0x7771U; 2026-02-21T09:18:14.9826282Z cvt.u16.u32 %rs300, %r1090; 2026-02-21T09:18:14.9826341Z prmt.b32 %r1091, %r898, 0, 0xaaa2U; 2026-02-21T09:18:14.9826398Z cvt.u16.u32 %rs301, %r1091; 2026-02-21T09:18:14.9826458Z prmt.b32 %r1092, %r898, 0, 0x7772U; 2026-02-21T09:18:14.9826522Z cvt.u16.u32 %rs302, %r1092; 2026-02-21T09:18:14.9826581Z prmt.b32 %r1093, %r898, 0, 0xbbb3U; 2026-02-21T09:18:14.9826635Z cvt.u16.u32 %rs303, %r1093; 2026-02-21T09:18:14.9826702Z prmt.b32 %r1094, %r898, 0, 0x7773U; 2026-02-21T09:18:14.9826759Z cvt.u16.u32 %rs304, %r1094; 2026-02-21T09:18:14.9826921Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9826986Z shl.b16 %rs305, %rs274, 12; 2026-02-21T09:18:14.9827044Z shr.s16 %rs306, %rs305, 12; 2026-02-21T09:18:14.9827102Z shl.b16 %rs307, %rs276, 12; 2026-02-21T09:18:14.9827160Z shr.s16 %rs308, %rs307, 12; 2026-02-21T09:18:14.9827225Z shl.b16 %rs309, %rs278, 12; 2026-02-21T09:18:14.9827282Z shr.s16 %rs310, %rs309, 12; 2026-02-21T09:18:14.9827339Z shl.b16 %rs311, %rs280, 12; 2026-02-21T09:18:14.9827401Z shr.s16 %rs312, %rs311, 12; 2026-02-21T09:18:14.9827457Z shl.b16 %rs313, %rs282, 12; 2026-02-21T09:18:14.9827514Z shr.s16 %rs314, %rs313, 12; 2026-02-21T09:18:14.9827569Z shl.b16 %rs315, %rs284, 12; 2026-02-21T09:18:14.9827636Z shr.s16 %rs316, %rs315, 12; 2026-02-21T09:18:14.9827692Z shl.b16 %rs317, %rs286, 12; 2026-02-21T09:18:14.9827749Z shr.s16 %rs318, %rs317, 12; 2026-02-21T09:18:14.9827813Z shl.b16 %rs319, %rs288, 12; 2026-02-21T09:18:14.9827870Z shr.s16 %rs320, %rs319, 12; 2026-02-21T09:18:14.9827964Z shl.b16 %rs321, %rs290, 12; 2026-02-21T09:18:14.9828020Z shr.s16 %rs322, %rs321, 12; 2026-02-21T09:18:14.9828084Z shl.b16 %rs323, %rs292, 12; 2026-02-21T09:18:14.9828162Z shr.s16 %rs324, %rs323, 12; 2026-02-21T09:18:14.9828219Z shl.b16 %rs325, %rs294, 12; 2026-02-21T09:18:14.9828311Z shr.s16 %rs326, %rs325, 12; 2026-02-21T09:18:14.9828370Z shl.b16 %rs327, %rs296, 12; 2026-02-21T09:18:14.9828453Z shr.s16 %rs328, %rs327, 12; 2026-02-21T09:18:14.9828511Z shl.b16 %rs329, %rs298, 12; 2026-02-21T09:18:14.9828576Z shr.s16 %rs330, %rs329, 12; 2026-02-21T09:18:14.9828633Z shl.b16 %rs331, %rs300, 12; 2026-02-21T09:18:14.9828690Z shr.s16 %rs332, %rs331, 12; 2026-02-21T09:18:14.9828757Z shl.b16 %rs333, %rs302, 12; 2026-02-21T09:18:14.9828815Z shr.s16 %rs334, %rs333, 12; 2026-02-21T09:18:14.9828873Z shl.b16 %rs335, %rs304, 12; 2026-02-21T09:18:14.9828938Z shr.s16 %rs336, %rs335, 12; 2026-02-21T09:18:14.9829101Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:14.9829163Z shr.u16 %rs337, %rs273, 4; 2026-02-21T09:18:14.9829223Z shr.u16 %rs338, %rs275, 4; 2026-02-21T09:18:14.9829292Z shr.u16 %rs339, %rs277, 4; 2026-02-21T09:18:14.9829351Z shr.u16 %rs340, %rs279, 4; 2026-02-21T09:18:14.9829412Z shr.u16 %rs341, %rs281, 4; 2026-02-21T09:18:14.9829481Z shr.u16 %rs342, %rs283, 4; 2026-02-21T09:18:14.9829540Z shr.u16 %rs343, %rs285, 4; 2026-02-21T09:18:14.9829600Z shr.u16 %rs344, %rs287, 4; 2026-02-21T09:18:14.9829663Z shr.u16 %rs345, %rs289, 4; 2026-02-21T09:18:14.9829730Z shr.u16 %rs346, %rs291, 4; 2026-02-21T09:18:14.9829787Z shr.u16 %rs347, %rs293, 4; 2026-02-21T09:18:14.9829843Z shr.u16 %rs348, %rs295, 4; 2026-02-21T09:18:14.9829907Z shr.u16 %rs349, %rs297, 4; 2026-02-21T09:18:14.9829965Z shr.u16 %rs350, %rs299, 4; 2026-02-21T09:18:14.9830022Z shr.u16 %rs351, %rs301, 4; 2026-02-21T09:18:14.9830080Z shr.u16 %rs352, %rs303, 4; 2026-02-21T09:18:14.9830247Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:14.9830302Z bar.sync 0; 2026-02-21T09:18:14.9830365Z st.shared.b8 [%r29], %rs306; 2026-02-21T09:18:14.9830436Z st.shared.b8 [%r30], %rs308; 2026-02-21T09:18:14.9830496Z st.shared.b8 [%r31], %rs310; 2026-02-21T09:18:14.9830558Z st.shared.b8 [%r32], %rs312; 2026-02-21T09:18:14.9830631Z st.shared.b8 [%r33+512], %rs314; 2026-02-21T09:18:14.9830694Z st.shared.b8 [%r34+512], %rs316; 2026-02-21T09:18:14.9830754Z st.shared.b8 [%r35+512], %rs318; 2026-02-21T09:18:14.9830815Z st.shared.b8 [%r36+512], %rs320; 2026-02-21T09:18:14.9830886Z st.shared.b8 [%r37+1024], %rs322; 2026-02-21T09:18:14.9830948Z st.shared.b8 [%r38+1024], %rs324; 2026-02-21T09:18:14.9831009Z st.shared.b8 [%r39+1024], %rs326; 2026-02-21T09:18:14.9831076Z st.shared.b8 [%r40+1024], %rs328; 2026-02-21T09:18:14.9831136Z st.shared.b8 [%r41+1536], %rs330; 2026-02-21T09:18:14.9831195Z st.shared.b8 [%r42+1536], %rs332; 2026-02-21T09:18:14.9831256Z st.shared.b8 [%r43+1536], %rs334; 2026-02-21T09:18:14.9831325Z st.shared.b8 [%r44+1536], %rs336; 2026-02-21T09:18:14.9831380Z bar.sync 0; 2026-02-21T09:18:14.9831444Z ld.shared.b32 %r1095, [%r45]; 2026-02-21T09:18:14.9831513Z ld.shared.b32 %r1096, [%r46]; 2026-02-21T09:18:14.9831600Z ld.shared.b32 %r1097, [%r47]; 2026-02-21T09:18:14.9831662Z ld.shared.b32 %r1098, [%r48]; 2026-02-21T09:18:14.9831831Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:14.9831885Z bar.sync 0; 2026-02-21T09:18:14.9831946Z st.shared.b8 [%r29], %rs337; 2026-02-21T09:18:14.9832004Z st.shared.b8 [%r30], %rs338; 2026-02-21T09:18:14.9832072Z st.shared.b8 [%r31], %rs339; 2026-02-21T09:18:14.9832131Z st.shared.b8 [%r32], %rs340; 2026-02-21T09:18:14.9832193Z st.shared.b8 [%r33+512], %rs341; 2026-02-21T09:18:14.9832261Z st.shared.b8 [%r34+512], %rs342; 2026-02-21T09:18:14.9832322Z st.shared.b8 [%r35+512], %rs343; 2026-02-21T09:18:14.9832410Z st.shared.b8 [%r36+512], %rs344; 2026-02-21T09:18:14.9832470Z st.shared.b8 [%r37+1024], %rs345; 2026-02-21T09:18:14.9832539Z st.shared.b8 [%r38+1024], %rs346; 2026-02-21T09:18:14.9832623Z st.shared.b8 [%r39+1024], %rs347; 2026-02-21T09:18:14.9832682Z st.shared.b8 [%r40+1024], %rs348; 2026-02-21T09:18:14.9832773Z st.shared.b8 [%r41+1536], %rs349; 2026-02-21T09:18:14.9832833Z st.shared.b8 [%r42+1536], %rs350; 2026-02-21T09:18:14.9832919Z st.shared.b8 [%r43+1536], %rs351; 2026-02-21T09:18:14.9832978Z st.shared.b8 [%r44+1536], %rs352; 2026-02-21T09:18:14.9833039Z bar.sync 0; 2026-02-21T09:18:14.9833100Z ld.shared.b32 %r1099, [%r45]; 2026-02-21T09:18:14.9833161Z ld.shared.b32 %r1100, [%r46]; 2026-02-21T09:18:14.9833226Z ld.shared.b32 %r1101, [%r47]; 2026-02-21T09:18:14.9833287Z ld.shared.b32 %r1102, [%r48]; 2026-02-21T09:18:14.9833447Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:14.9833517Z cvt.s8.s32 %rs353, %r1096; 2026-02-21T09:18:14.9833580Z cvt.rn.f32.s16 %r1103, %rs353; 2026-02-21T09:18:14.9833640Z cvt.s8.s32 %rs354, %r1095; 2026-02-21T09:18:14.9833703Z cvt.rn.f32.s16 %r1104, %rs354; 2026-02-21T09:18:14.9833768Z cvt.s8.s32 %rs355, %r1100; 2026-02-21T09:18:14.9833829Z cvt.rn.f32.s16 %r1105, %rs355; 2026-02-21T09:18:14.9833888Z cvt.s8.s32 %rs356, %r1099; 2026-02-21T09:18:14.9833957Z cvt.rn.f32.s16 %r1106, %rs356; 2026-02-21T09:18:14.9834015Z cvt.s8.s32 %rs357, %r1098; 2026-02-21T09:18:14.9834075Z cvt.rn.f32.s16 %r1107, %rs357; 2026-02-21T09:18:14.9834132Z cvt.s8.s32 %rs358, %r1097; 2026-02-21T09:18:14.9834198Z cvt.rn.f32.s16 %r1108, %rs358; 2026-02-21T09:18:14.9834255Z cvt.s8.s32 %rs359, %r1102; 2026-02-21T09:18:14.9834314Z cvt.rn.f32.s16 %r1109, %rs359; 2026-02-21T09:18:14.9834380Z cvt.s8.s32 %rs360, %r1101; 2026-02-21T09:18:14.9834439Z cvt.rn.f32.s16 %r1110, %rs360; 2026-02-21T09:18:14.9834504Z prmt.b32 %r1111, %r1096, 0, 0x9991U; 2026-02-21T09:18:14.9834566Z cvt.u16.u32 %rs361, %r1111; 2026-02-21T09:18:14.9834632Z cvt.rn.f32.s16 %r1112, %rs361; 2026-02-21T09:18:14.9834695Z prmt.b32 %r1113, %r1095, 0, 0x9991U; 2026-02-21T09:18:14.9834756Z cvt.u16.u32 %rs362, %r1113; 2026-02-21T09:18:14.9834822Z cvt.rn.f32.s16 %r1114, %rs362; 2026-02-21T09:18:14.9834885Z prmt.b32 %r1115, %r1100, 0, 0x9991U; 2026-02-21T09:18:14.9834944Z cvt.u16.u32 %rs363, %r1115; 2026-02-21T09:18:14.9835012Z cvt.rn.f32.s16 %r1116, %rs363; 2026-02-21T09:18:14.9835074Z prmt.b32 %r1117, %r1099, 0, 0x9991U; 2026-02-21T09:18:14.9835134Z cvt.u16.u32 %rs364, %r1117; 2026-02-21T09:18:14.9835192Z cvt.rn.f32.s16 %r1118, %rs364; 2026-02-21T09:18:14.9835261Z prmt.b32 %r1119, %r1098, 0, 0x9991U; 2026-02-21T09:18:14.9835321Z cvt.u16.u32 %rs365, %r1119; 2026-02-21T09:18:14.9835380Z cvt.rn.f32.s16 %r1120, %rs365; 2026-02-21T09:18:14.9835449Z prmt.b32 %r1121, %r1097, 0, 0x9991U; 2026-02-21T09:18:14.9835507Z cvt.u16.u32 %rs366, %r1121; 2026-02-21T09:18:14.9835567Z cvt.rn.f32.s16 %r1122, %rs366; 2026-02-21T09:18:14.9835629Z prmt.b32 %r1123, %r1102, 0, 0x9991U; 2026-02-21T09:18:14.9835696Z cvt.u16.u32 %rs367, %r1123; 2026-02-21T09:18:14.9835754Z cvt.rn.f32.s16 %r1124, %rs367; 2026-02-21T09:18:14.9835816Z prmt.b32 %r1125, %r1101, 0, 0x9991U; 2026-02-21T09:18:14.9835884Z cvt.u16.u32 %rs368, %r1125; 2026-02-21T09:18:14.9835945Z cvt.rn.f32.s16 %r1126, %rs368; 2026-02-21T09:18:14.9836009Z prmt.b32 %r1127, %r1096, 0, 0xaaa2U; 2026-02-21T09:18:14.9836068Z cvt.u16.u32 %rs369, %r1127; 2026-02-21T09:18:14.9836139Z cvt.rn.f32.s16 %r1128, %rs369; 2026-02-21T09:18:14.9836202Z prmt.b32 %r1129, %r1095, 0, 0xaaa2U; 2026-02-21T09:18:14.9836264Z cvt.u16.u32 %rs370, %r1129; 2026-02-21T09:18:14.9836330Z cvt.rn.f32.s16 %r1130, %rs370; 2026-02-21T09:18:14.9836391Z prmt.b32 %r1131, %r1100, 0, 0xaaa2U; 2026-02-21T09:18:14.9836449Z cvt.u16.u32 %rs371, %r1131; 2026-02-21T09:18:14.9836514Z cvt.rn.f32.s16 %r1132, %rs371; 2026-02-21T09:18:14.9836574Z prmt.b32 %r1133, %r1099, 0, 0xaaa2U; 2026-02-21T09:18:14.9836654Z cvt.u16.u32 %rs372, %r1133; 2026-02-21T09:18:14.9836712Z cvt.rn.f32.s16 %r1134, %rs372; 2026-02-21T09:18:14.9836782Z prmt.b32 %r1135, %r1098, 0, 0xaaa2U; 2026-02-21T09:18:14.9836860Z cvt.u16.u32 %rs373, %r1135; 2026-02-21T09:18:14.9836919Z cvt.rn.f32.s16 %r1136, %rs373; 2026-02-21T09:18:14.9837008Z prmt.b32 %r1137, %r1097, 0, 0xaaa2U; 2026-02-21T09:18:14.9837067Z cvt.u16.u32 %rs374, %r1137; 2026-02-21T09:18:14.9837151Z cvt.rn.f32.s16 %r1138, %rs374; 2026-02-21T09:18:14.9837213Z prmt.b32 %r1139, %r1102, 0, 0xaaa2U; 2026-02-21T09:18:14.9837279Z cvt.u16.u32 %rs375, %r1139; 2026-02-21T09:18:14.9837335Z cvt.rn.f32.s16 %r1140, %rs375; 2026-02-21T09:18:14.9837396Z prmt.b32 %r1141, %r1101, 0, 0xaaa2U; 2026-02-21T09:18:14.9837462Z cvt.u16.u32 %rs376, %r1141; 2026-02-21T09:18:14.9837521Z cvt.rn.f32.s16 %r1142, %rs376; 2026-02-21T09:18:14.9837581Z prmt.b32 %r1143, %r1096, 0, 0xbbb3U; 2026-02-21T09:18:14.9837646Z cvt.u16.u32 %rs377, %r1143; 2026-02-21T09:18:14.9837706Z cvt.rn.f32.s16 %r1144, %rs377; 2026-02-21T09:18:14.9837766Z prmt.b32 %r1145, %r1095, 0, 0xbbb3U; 2026-02-21T09:18:14.9837824Z cvt.u16.u32 %rs378, %r1145; 2026-02-21T09:18:14.9837892Z cvt.rn.f32.s16 %r1146, %rs378; 2026-02-21T09:18:14.9837953Z prmt.b32 %r1147, %r1100, 0, 0xbbb3U; 2026-02-21T09:18:14.9838012Z cvt.u16.u32 %rs379, %r1147; 2026-02-21T09:18:14.9838077Z cvt.rn.f32.s16 %r1148, %rs379; 2026-02-21T09:18:14.9838138Z prmt.b32 %r1149, %r1099, 0, 0xbbb3U; 2026-02-21T09:18:14.9838197Z cvt.u16.u32 %rs380, %r1149; 2026-02-21T09:18:14.9838255Z cvt.rn.f32.s16 %r1150, %rs380; 2026-02-21T09:18:14.9838324Z prmt.b32 %r1151, %r1098, 0, 0xbbb3U; 2026-02-21T09:18:14.9838382Z cvt.u16.u32 %rs381, %r1151; 2026-02-21T09:18:14.9838441Z cvt.rn.f32.s16 %r1152, %rs381; 2026-02-21T09:18:14.9838535Z prmt.b32 %r1153, %r1097, 0, 0xbbb3U; 2026-02-21T09:18:14.9838596Z cvt.u16.u32 %rs382, %r1153; 2026-02-21T09:18:14.9838657Z cvt.rn.f32.s16 %r1154, %rs382; 2026-02-21T09:18:14.9838722Z prmt.b32 %r1155, %r1102, 0, 0xbbb3U; 2026-02-21T09:18:14.9838789Z cvt.u16.u32 %rs383, %r1155; 2026-02-21T09:18:14.9838850Z cvt.rn.f32.s16 %r1156, %rs383; 2026-02-21T09:18:14.9838914Z prmt.b32 %r1157, %r1101, 0, 0xbbb3U; 2026-02-21T09:18:14.9838982Z cvt.u16.u32 %rs384, %r1157; 2026-02-21T09:18:14.9839043Z cvt.rn.f32.s16 %r1158, %rs384; 2026-02-21T09:18:14.9839101Z bar.sync 0; 2026-02-21T09:18:14.9839212Z st.shared.v4.b32 [%r49], {%r1104, %r1106, %r1103, %r1105}; 2026-02-21T09:18:14.9839313Z st.shared.v4.b32 [%r50], {%r1108, %r1110, %r1107, %r1109}; 2026-02-21T09:18:14.9839410Z st.shared.v4.b32 [%r51], {%r1114, %r1118, %r1112, %r1116}; 2026-02-21T09:18:14.9839504Z st.shared.v4.b32 [%r52], {%r1122, %r1126, %r1120, %r1124}; 2026-02-21T09:18:14.9839601Z st.shared.v4.b32 [%r53], {%r1130, %r1134, %r1128, %r1132}; 2026-02-21T09:18:14.9839693Z st.shared.v4.b32 [%r54], {%r1138, %r1142, %r1136, %r1140}; 2026-02-21T09:18:14.9839786Z st.shared.v4.b32 [%r55], {%r1146, %r1150, %r1144, %r1148}; 2026-02-21T09:18:14.9839887Z st.shared.v4.b32 [%r56], {%r1154, %r1158, %r1152, %r1156}; 2026-02-21T09:18:14.9839944Z $L__tmp79: 2026-02-21T09:18:14.9840172Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9840240Z // begin inline asm 2026-02-21T09:18:14.9840539Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r899, %r900, %r901, %r902, %r903, %r904, %r905, %r906, %r907, %r908, %r909, %r910, %r911, %r912, %r913, %r914}, [%r915 + 0], 64; 2026-02-21T09:18:14.9840597Z // end inline asm 2026-02-21T09:18:14.9840657Z // begin inline asm 2026-02-21T09:18:14.9840957Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r916, %r917, %r918, %r919, %r920, %r921, %r922, %r923, %r924, %r925, %r926, %r927, %r928, %r929, %r930, %r931}, [%r915 + 16], 64; 2026-02-21T09:18:14.9841012Z // end inline asm 2026-02-21T09:18:14.9841071Z // begin inline asm 2026-02-21T09:18:14.9841360Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r933, %r934, %r935, %r936, %r937, %r938, %r939, %r940, %r941, %r942, %r943, %r944, %r945, %r946, %r947, %r948}, [%r915 + 32], 64; 2026-02-21T09:18:14.9841442Z // end inline asm 2026-02-21T09:18:14.9841522Z // begin inline asm 2026-02-21T09:18:14.9841858Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r950, %r951, %r952, %r953, %r954, %r955, %r956, %r957, %r958, %r959, %r960, %r961, %r962, %r963, %r964, %r965}, [%r915 + 48], 64; 2026-02-21T09:18:14.9841961Z // end inline asm 2026-02-21T09:18:14.9842021Z // begin inline asm 2026-02-21T09:18:14.9842103Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9842158Z // end inline asm 2026-02-21T09:18:14.9842214Z // begin inline asm 2026-02-21T09:18:14.9842508Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 0], 64, {%r899, %r900, %r901, %r902, %r903, %r904, %r905, %r906, %r907, %r908, %r909, %r910, %r911, %r912, %r913, %r914}; 2026-02-21T09:18:14.9842565Z // end inline asm 2026-02-21T09:18:14.9842621Z // begin inline asm 2026-02-21T09:18:14.9842914Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 16], 64, {%r916, %r917, %r918, %r919, %r920, %r921, %r922, %r923, %r924, %r925, %r926, %r927, %r928, %r929, %r930, %r931}; 2026-02-21T09:18:14.9842977Z // end inline asm 2026-02-21T09:18:14.9843035Z // begin inline asm 2026-02-21T09:18:14.9843323Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 32], 64, {%r933, %r934, %r935, %r936, %r937, %r938, %r939, %r940, %r941, %r942, %r943, %r944, %r945, %r946, %r947, %r948}; 2026-02-21T09:18:14.9843383Z // end inline asm 2026-02-21T09:18:14.9843440Z // begin inline asm 2026-02-21T09:18:14.9843724Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 48], 64, {%r950, %r951, %r952, %r953, %r954, %r955, %r956, %r957, %r958, %r959, %r960, %r961, %r962, %r963, %r964, %r965}; 2026-02-21T09:18:14.9843782Z // end inline asm 2026-02-21T09:18:14.9843839Z // begin inline asm 2026-02-21T09:18:14.9843906Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9843963Z // end inline asm 2026-02-21T09:18:14.9844027Z bar.sync 0; 2026-02-21T09:18:14.9844085Z // begin inline asm 2026-02-21T09:18:14.9844409Z @%p38 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r1036, %r1037, %r1038, %r1039, %r1040, %r1041, %r1042, %r1043, %r1044, %r1045, %r1046, %r1047, %r1048, %r1049, %r1050, %r1051}; 2026-02-21T09:18:14.9844476Z // end inline asm 2026-02-21T09:18:14.9844532Z // begin inline asm 2026-02-21T09:18:14.9844598Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9844657Z // end inline asm 2026-02-21T09:18:14.9844711Z bar.sync 0; 2026-02-21T09:18:14.9844765Z // begin inline asm 2026-02-21T09:18:14.9844838Z fence.proxy.async.shared::cta; 2026-02-21T09:18:14.9844898Z // end inline asm 2026-02-21T09:18:14.9844952Z // begin inline asm 2026-02-21T09:18:14.9845044Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:14.9845101Z // end inline asm 2026-02-21T09:18:14.9845153Z bar.sync 0; 2026-02-21T09:18:14.9845210Z @%p20 bra $L__BB0_9; 2026-02-21T09:18:14.9845309Z // %bb.8: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9845382Z elect.sync %r1171|%p56, -1; 2026-02-21T09:18:14.9845442Z mov.b32 %r1161, 69208336; 2026-02-21T09:18:14.9845502Z mov.pred %p55, -1; 2026-02-21T09:18:14.9845560Z // begin inline asm 2026-02-21T09:18:14.9845717Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 0 ], %rd1, %r1161, %p55; 2026-02-21T09:18:14.9845770Z // end inline asm 2026-02-21T09:18:14.9845834Z // begin inline asm 2026-02-21T09:18:14.9845990Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 8 ], %rd2, %r1161, %p55; 2026-02-21T09:18:14.9846047Z // end inline asm 2026-02-21T09:18:14.9846104Z // begin inline asm 2026-02-21T09:18:14.9846277Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 16 ], %rd3, %r1161, %p55; 2026-02-21T09:18:14.9846331Z // end inline asm 2026-02-21T09:18:14.9846388Z // begin inline asm 2026-02-21T09:18:14.9846542Z @%p56 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 24 ], %rd4, %r1161, %p55; 2026-02-21T09:18:14.9846626Z // end inline asm 2026-02-21T09:18:14.9846687Z cvt.u64.u32 %rd107, %r5985; 2026-02-21T09:18:14.9846750Z // begin inline asm 2026-02-21T09:18:14.9846902Z @%p56 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd107]; 2026-02-21T09:18:14.9846957Z // end inline asm 2026-02-21T09:18:14.9847082Z $L__BB0_9: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9847193Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:14.9847251Z mov.b32 %r1175, 0; 2026-02-21T09:18:14.9847464Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9847529Z // begin inline asm 2026-02-21T09:18:14.9847581Z 2026-02-21T09:18:14.9847631Z { 2026-02-21T09:18:14.9847702Z .reg .pred complete; 2026-02-21T09:18:14.9847756Z waitLoop: 2026-02-21T09:18:14.9847877Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r1175; 2026-02-21T09:18:14.9847944Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9848001Z } 2026-02-21T09:18:14.9848005Z 2026-02-21T09:18:14.9848059Z // end inline asm 2026-02-21T09:18:14.9848114Z bar.sync 0; 2026-02-21T09:18:14.9848177Z // begin inline asm 2026-02-21T09:18:14.9848264Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9848319Z // end inline asm 2026-02-21T09:18:14.9848379Z $L__tmp80: 2026-02-21T09:18:14.9848547Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9848608Z add.s64 %rd109, %rd64, 192; 2026-02-21T09:18:14.9848767Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9848832Z add.s64 %rd112, %rd67, 192; 2026-02-21T09:18:14.9848887Z // begin inline asm 2026-02-21T09:18:14.9848941Z mov.u64 %rd108, 0x0; 2026-02-21T09:18:14.9849055Z createpolicy.fractional.L2::evict_first.b64 %rd108, 1.0; 2026-02-21T09:18:14.9849107Z // end inline asm 2026-02-21T09:18:14.9849163Z // begin inline asm 2026-02-21T09:18:14.9849217Z mov.u32 %r1177, 0x0; 2026-02-21T09:18:14.9849277Z mov.u32 %r1178, 0x0; 2026-02-21T09:18:14.9849332Z mov.u32 %r1179, 0x0; 2026-02-21T09:18:14.9849384Z mov.u32 %r1180, 0x0; 2026-02-21T09:18:14.9849570Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r1177, %r1178, %r1179, %r1180 }, [ %rd109 + 0 ], %rd108; 2026-02-21T09:18:14.9849627Z // end inline asm 2026-02-21T09:18:14.9849680Z // begin inline asm 2026-02-21T09:18:14.9849739Z mov.u64 %rd111, 0x0; 2026-02-21T09:18:14.9849846Z createpolicy.fractional.L2::evict_first.b64 %rd111, 1.0; 2026-02-21T09:18:14.9849898Z // end inline asm 2026-02-21T09:18:14.9849950Z // begin inline asm 2026-02-21T09:18:14.9850010Z mov.u32 %r1181, 0x0; 2026-02-21T09:18:14.9850062Z mov.u32 %r1182, 0x0; 2026-02-21T09:18:14.9850113Z mov.u32 %r1183, 0x0; 2026-02-21T09:18:14.9850169Z mov.u32 %r1184, 0x0; 2026-02-21T09:18:14.9850342Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r1181, %r1182, %r1183, %r1184 }, [ %rd112 + 0 ], %rd111; 2026-02-21T09:18:14.9850395Z // end inline asm 2026-02-21T09:18:14.9850450Z $L__tmp81: 2026-02-21T09:18:14.9850655Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9850749Z st.shared.v4.b32 [%r26], {%r1177, %r1178, %r1179, %r1180}; 2026-02-21T09:18:14.9850851Z st.shared.v4.b32 [%r26+512], {%r1181, %r1182, %r1183, %r1184}; 2026-02-21T09:18:14.9850906Z bar.sync 0; 2026-02-21T09:18:14.9850995Z ld.shared.v4.b32 {%r1344, %r1345, %r1346, %r1347}, [%r27]; 2026-02-21T09:18:14.9851059Z mov.b32 {%rs385, %rs386}, %r1347; 2026-02-21T09:18:14.9851127Z mov.b32 {%rs387, %rs388}, %r1346; 2026-02-21T09:18:14.9851185Z mov.b32 {%rs389, %rs390}, %r1345; 2026-02-21T09:18:14.9851244Z mov.b32 {%rs391, %rs392}, %r1344; 2026-02-21T09:18:14.9851336Z ld.shared.v4.b32 {%r1348, %r1349, %r1350, %r1351}, [%r28]; 2026-02-21T09:18:14.9851412Z mov.b32 {%rs393, %rs394}, %r1351; 2026-02-21T09:18:14.9851471Z mov.b32 {%rs395, %rs396}, %r1350; 2026-02-21T09:18:14.9851527Z mov.b32 {%rs397, %rs398}, %r1349; 2026-02-21T09:18:14.9851612Z mov.b32 {%rs399, %rs400}, %r1348; 2026-02-21T09:18:14.9851687Z $L__tmp82: 2026-02-21T09:18:14.9851870Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9851939Z cvt.f32.bf16 %r1326, %rs391; 2026-02-21T09:18:14.9852017Z cvt.f32.bf16 %r1327, %rs392; 2026-02-21T09:18:14.9852077Z cvt.f32.bf16 %r1328, %rs389; 2026-02-21T09:18:14.9852137Z cvt.f32.bf16 %r1329, %rs390; 2026-02-21T09:18:14.9852197Z cvt.f32.bf16 %r1330, %rs387; 2026-02-21T09:18:14.9852254Z cvt.f32.bf16 %r1331, %rs388; 2026-02-21T09:18:14.9852309Z cvt.f32.bf16 %r1332, %rs385; 2026-02-21T09:18:14.9852367Z cvt.f32.bf16 %r1333, %rs386; 2026-02-21T09:18:14.9852423Z cvt.f32.bf16 %r1334, %rs399; 2026-02-21T09:18:14.9852478Z cvt.f32.bf16 %r1335, %rs400; 2026-02-21T09:18:14.9852538Z cvt.f32.bf16 %r1336, %rs397; 2026-02-21T09:18:14.9852592Z cvt.f32.bf16 %r1337, %rs398; 2026-02-21T09:18:14.9852647Z cvt.f32.bf16 %r1338, %rs395; 2026-02-21T09:18:14.9852704Z cvt.f32.bf16 %r1339, %rs396; 2026-02-21T09:18:14.9852769Z cvt.f32.bf16 %r1340, %rs393; 2026-02-21T09:18:14.9852824Z cvt.f32.bf16 %r1341, %rs394; 2026-02-21T09:18:14.9852985Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9853051Z add.s32 %r1352, %r7843, 393216; 2026-02-21T09:18:14.9853108Z cvt.s64.s32 %rd117, %r1352; 2026-02-21T09:18:14.9853171Z add.s64 %rd115, %rd56, %rd117; 2026-02-21T09:18:14.9853324Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9853386Z // begin inline asm 2026-02-21T09:18:14.9853439Z mov.u64 %rd114, 0x0; 2026-02-21T09:18:14.9853540Z createpolicy.fractional.L2::evict_first.b64 %rd114, 1.0; 2026-02-21T09:18:14.9853601Z // end inline asm 2026-02-21T09:18:14.9853660Z // begin inline asm 2026-02-21T09:18:14.9853711Z mov.u32 %r1185, 0x0; 2026-02-21T09:18:14.9853768Z mov.u32 %r1186, 0x0; 2026-02-21T09:18:14.9853822Z mov.u32 %r1187, 0x0; 2026-02-21T09:18:14.9853875Z mov.u32 %r1188, 0x0; 2026-02-21T09:18:14.9854046Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r1185, %r1186, %r1187, %r1188 }, [ %rd115 + 0 ], %rd114; 2026-02-21T09:18:14.9854107Z // end inline asm 2026-02-21T09:18:14.9854175Z prmt.b32 %r1353, %r1185, 0, 0x8880U; 2026-02-21T09:18:14.9854233Z cvt.u16.u32 %rs401, %r1353; 2026-02-21T09:18:14.9854302Z prmt.b32 %r1354, %r1185, 0, 0x7770U; 2026-02-21T09:18:14.9854358Z cvt.u16.u32 %rs402, %r1354; 2026-02-21T09:18:14.9854419Z prmt.b32 %r1355, %r1185, 0, 0x9991U; 2026-02-21T09:18:14.9854474Z cvt.u16.u32 %rs403, %r1355; 2026-02-21T09:18:14.9854540Z prmt.b32 %r1356, %r1185, 0, 0x7771U; 2026-02-21T09:18:14.9854594Z cvt.u16.u32 %rs404, %r1356; 2026-02-21T09:18:14.9854651Z prmt.b32 %r1357, %r1185, 0, 0xaaa2U; 2026-02-21T09:18:14.9854716Z cvt.u16.u32 %rs405, %r1357; 2026-02-21T09:18:14.9854772Z prmt.b32 %r1358, %r1185, 0, 0x7772U; 2026-02-21T09:18:14.9854828Z cvt.u16.u32 %rs406, %r1358; 2026-02-21T09:18:14.9854893Z prmt.b32 %r1359, %r1185, 0, 0xbbb3U; 2026-02-21T09:18:14.9854950Z cvt.u16.u32 %rs407, %r1359; 2026-02-21T09:18:14.9855008Z prmt.b32 %r1360, %r1185, 0, 0x7773U; 2026-02-21T09:18:14.9855062Z cvt.u16.u32 %rs408, %r1360; 2026-02-21T09:18:14.9855123Z prmt.b32 %r1361, %r1186, 0, 0x8880U; 2026-02-21T09:18:14.9855176Z cvt.u16.u32 %rs409, %r1361; 2026-02-21T09:18:14.9855236Z prmt.b32 %r1362, %r1186, 0, 0x7770U; 2026-02-21T09:18:14.9855297Z cvt.u16.u32 %rs410, %r1362; 2026-02-21T09:18:14.9855354Z prmt.b32 %r1363, %r1186, 0, 0x9991U; 2026-02-21T09:18:14.9855408Z cvt.u16.u32 %rs411, %r1363; 2026-02-21T09:18:14.9855468Z prmt.b32 %r1364, %r1186, 0, 0x7771U; 2026-02-21T09:18:14.9855527Z cvt.u16.u32 %rs412, %r1364; 2026-02-21T09:18:14.9855587Z prmt.b32 %r1365, %r1186, 0, 0xaaa2U; 2026-02-21T09:18:14.9855666Z cvt.u16.u32 %rs413, %r1365; 2026-02-21T09:18:14.9855730Z prmt.b32 %r1366, %r1186, 0, 0x7772U; 2026-02-21T09:18:14.9855785Z cvt.u16.u32 %rs414, %r1366; 2026-02-21T09:18:14.9855861Z prmt.b32 %r1367, %r1186, 0, 0xbbb3U; 2026-02-21T09:18:14.9855920Z cvt.u16.u32 %rs415, %r1367; 2026-02-21T09:18:14.9855976Z prmt.b32 %r1368, %r1186, 0, 0x7773U; 2026-02-21T09:18:14.9856049Z cvt.u16.u32 %rs416, %r1368; 2026-02-21T09:18:14.9856126Z prmt.b32 %r1369, %r1187, 0, 0x8880U; 2026-02-21T09:18:14.9856189Z cvt.u16.u32 %rs417, %r1369; 2026-02-21T09:18:14.9856246Z prmt.b32 %r1370, %r1187, 0, 0x7770U; 2026-02-21T09:18:14.9856300Z cvt.u16.u32 %rs418, %r1370; 2026-02-21T09:18:14.9856365Z prmt.b32 %r1371, %r1187, 0, 0x9991U; 2026-02-21T09:18:14.9856418Z cvt.u16.u32 %rs419, %r1371; 2026-02-21T09:18:14.9856477Z prmt.b32 %r1372, %r1187, 0, 0x7771U; 2026-02-21T09:18:14.9856530Z cvt.u16.u32 %rs420, %r1372; 2026-02-21T09:18:14.9856591Z prmt.b32 %r1373, %r1187, 0, 0xaaa2U; 2026-02-21T09:18:14.9856651Z cvt.u16.u32 %rs421, %r1373; 2026-02-21T09:18:14.9856708Z prmt.b32 %r1374, %r1187, 0, 0x7772U; 2026-02-21T09:18:14.9856765Z cvt.u16.u32 %rs422, %r1374; 2026-02-21T09:18:14.9856821Z prmt.b32 %r1375, %r1187, 0, 0xbbb3U; 2026-02-21T09:18:14.9856879Z cvt.u16.u32 %rs423, %r1375; 2026-02-21T09:18:14.9856938Z prmt.b32 %r1376, %r1187, 0, 0x7773U; 2026-02-21T09:18:14.9857003Z cvt.u16.u32 %rs424, %r1376; 2026-02-21T09:18:14.9857061Z prmt.b32 %r1377, %r1188, 0, 0x8880U; 2026-02-21T09:18:14.9857115Z cvt.u16.u32 %rs425, %r1377; 2026-02-21T09:18:14.9857175Z prmt.b32 %r1378, %r1188, 0, 0x7770U; 2026-02-21T09:18:14.9857230Z cvt.u16.u32 %rs426, %r1378; 2026-02-21T09:18:14.9857286Z prmt.b32 %r1379, %r1188, 0, 0x9991U; 2026-02-21T09:18:14.9857344Z cvt.u16.u32 %rs427, %r1379; 2026-02-21T09:18:14.9857402Z prmt.b32 %r1380, %r1188, 0, 0x7771U; 2026-02-21T09:18:14.9857457Z cvt.u16.u32 %rs428, %r1380; 2026-02-21T09:18:14.9857514Z prmt.b32 %r1381, %r1188, 0, 0xaaa2U; 2026-02-21T09:18:14.9857579Z cvt.u16.u32 %rs429, %r1381; 2026-02-21T09:18:14.9857641Z prmt.b32 %r1382, %r1188, 0, 0x7772U; 2026-02-21T09:18:14.9857699Z cvt.u16.u32 %rs430, %r1382; 2026-02-21T09:18:14.9857765Z prmt.b32 %r1383, %r1188, 0, 0xbbb3U; 2026-02-21T09:18:14.9857825Z cvt.u16.u32 %rs431, %r1383; 2026-02-21T09:18:14.9857885Z prmt.b32 %r1384, %r1188, 0, 0x7773U; 2026-02-21T09:18:14.9857944Z cvt.u16.u32 %rs432, %r1384; 2026-02-21T09:18:14.9858114Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9858172Z shl.b16 %rs433, %rs402, 12; 2026-02-21T09:18:14.9858231Z shr.s16 %rs434, %rs433, 12; 2026-02-21T09:18:14.9858296Z shl.b16 %rs435, %rs404, 12; 2026-02-21T09:18:14.9858355Z shr.s16 %rs436, %rs435, 12; 2026-02-21T09:18:14.9858412Z shl.b16 %rs437, %rs406, 12; 2026-02-21T09:18:14.9858478Z shr.s16 %rs438, %rs437, 12; 2026-02-21T09:18:14.9858535Z shl.b16 %rs439, %rs408, 12; 2026-02-21T09:18:14.9858592Z shr.s16 %rs440, %rs439, 12; 2026-02-21T09:18:14.9858652Z shl.b16 %rs441, %rs410, 12; 2026-02-21T09:18:14.9858718Z shr.s16 %rs442, %rs441, 12; 2026-02-21T09:18:14.9858775Z shl.b16 %rs443, %rs412, 12; 2026-02-21T09:18:14.9858833Z shr.s16 %rs444, %rs443, 12; 2026-02-21T09:18:14.9858900Z shl.b16 %rs445, %rs414, 12; 2026-02-21T09:18:14.9858958Z shr.s16 %rs446, %rs445, 12; 2026-02-21T09:18:14.9859018Z shl.b16 %rs447, %rs416, 12; 2026-02-21T09:18:14.9859076Z shr.s16 %rs448, %rs447, 12; 2026-02-21T09:18:14.9859149Z shl.b16 %rs449, %rs418, 12; 2026-02-21T09:18:14.9859209Z shr.s16 %rs450, %rs449, 12; 2026-02-21T09:18:14.9859268Z shl.b16 %rs451, %rs420, 12; 2026-02-21T09:18:14.9859332Z shr.s16 %rs452, %rs451, 12; 2026-02-21T09:18:14.9859389Z shl.b16 %rs453, %rs422, 12; 2026-02-21T09:18:14.9859446Z shr.s16 %rs454, %rs453, 12; 2026-02-21T09:18:14.9859503Z shl.b16 %rs455, %rs424, 12; 2026-02-21T09:18:14.9859569Z shr.s16 %rs456, %rs455, 12; 2026-02-21T09:18:14.9859627Z shl.b16 %rs457, %rs426, 12; 2026-02-21T09:18:14.9859704Z shr.s16 %rs458, %rs457, 12; 2026-02-21T09:18:14.9859769Z shl.b16 %rs459, %rs428, 12; 2026-02-21T09:18:14.9859825Z shr.s16 %rs460, %rs459, 12; 2026-02-21T09:18:14.9859882Z shl.b16 %rs461, %rs430, 12; 2026-02-21T09:18:14.9859970Z shr.s16 %rs462, %rs461, 12; 2026-02-21T09:18:14.9860029Z shl.b16 %rs463, %rs432, 12; 2026-02-21T09:18:14.9860135Z shr.s16 %rs464, %rs463, 12; 2026-02-21T09:18:14.9860314Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:14.9860385Z shr.u16 %rs465, %rs401, 4; 2026-02-21T09:18:14.9860446Z shr.u16 %rs466, %rs403, 4; 2026-02-21T09:18:14.9860504Z shr.u16 %rs467, %rs405, 4; 2026-02-21T09:18:14.9860567Z shr.u16 %rs468, %rs407, 4; 2026-02-21T09:18:14.9860625Z shr.u16 %rs469, %rs409, 4; 2026-02-21T09:18:14.9860682Z shr.u16 %rs470, %rs411, 4; 2026-02-21T09:18:14.9860739Z shr.u16 %rs471, %rs413, 4; 2026-02-21T09:18:14.9860805Z shr.u16 %rs472, %rs415, 4; 2026-02-21T09:18:14.9860864Z shr.u16 %rs473, %rs417, 4; 2026-02-21T09:18:14.9860924Z shr.u16 %rs474, %rs419, 4; 2026-02-21T09:18:14.9860991Z shr.u16 %rs475, %rs421, 4; 2026-02-21T09:18:14.9861048Z shr.u16 %rs476, %rs423, 4; 2026-02-21T09:18:14.9861106Z shr.u16 %rs477, %rs425, 4; 2026-02-21T09:18:14.9861165Z shr.u16 %rs478, %rs427, 4; 2026-02-21T09:18:14.9861229Z shr.u16 %rs479, %rs429, 4; 2026-02-21T09:18:14.9861289Z shr.u16 %rs480, %rs431, 4; 2026-02-21T09:18:14.9861449Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:14.9861513Z bar.sync 0; 2026-02-21T09:18:14.9861608Z st.shared.b8 [%r29], %rs434; 2026-02-21T09:18:14.9861670Z st.shared.b8 [%r30], %rs436; 2026-02-21T09:18:14.9861737Z st.shared.b8 [%r31], %rs438; 2026-02-21T09:18:14.9861797Z st.shared.b8 [%r32], %rs440; 2026-02-21T09:18:14.9861861Z st.shared.b8 [%r33+512], %rs442; 2026-02-21T09:18:14.9861921Z st.shared.b8 [%r34+512], %rs444; 2026-02-21T09:18:14.9861990Z st.shared.b8 [%r35+512], %rs446; 2026-02-21T09:18:14.9862051Z st.shared.b8 [%r36+512], %rs448; 2026-02-21T09:18:14.9862114Z st.shared.b8 [%r37+1024], %rs450; 2026-02-21T09:18:14.9862182Z st.shared.b8 [%r38+1024], %rs452; 2026-02-21T09:18:14.9862244Z st.shared.b8 [%r39+1024], %rs454; 2026-02-21T09:18:14.9862302Z st.shared.b8 [%r40+1024], %rs456; 2026-02-21T09:18:14.9862362Z st.shared.b8 [%r41+1536], %rs458; 2026-02-21T09:18:14.9862429Z st.shared.b8 [%r42+1536], %rs460; 2026-02-21T09:18:14.9862489Z st.shared.b8 [%r43+1536], %rs462; 2026-02-21T09:18:14.9862548Z st.shared.b8 [%r44+1536], %rs464; 2026-02-21T09:18:14.9862609Z bar.sync 0; 2026-02-21T09:18:14.9862671Z ld.shared.b32 %r1385, [%r45]; 2026-02-21T09:18:14.9862732Z ld.shared.b32 %r1386, [%r46]; 2026-02-21T09:18:14.9862791Z ld.shared.b32 %r1387, [%r47]; 2026-02-21T09:18:14.9862858Z ld.shared.b32 %r1388, [%r48]; 2026-02-21T09:18:14.9863017Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:14.9863073Z bar.sync 0; 2026-02-21T09:18:14.9863140Z st.shared.b8 [%r29], %rs465; 2026-02-21T09:18:14.9863199Z st.shared.b8 [%r30], %rs466; 2026-02-21T09:18:14.9863257Z st.shared.b8 [%r31], %rs467; 2026-02-21T09:18:14.9863325Z st.shared.b8 [%r32], %rs468; 2026-02-21T09:18:14.9863384Z st.shared.b8 [%r33+512], %rs469; 2026-02-21T09:18:14.9863447Z st.shared.b8 [%r34+512], %rs470; 2026-02-21T09:18:14.9863506Z st.shared.b8 [%r35+512], %rs471; 2026-02-21T09:18:14.9863575Z st.shared.b8 [%r36+512], %rs472; 2026-02-21T09:18:14.9863634Z st.shared.b8 [%r37+1024], %rs473; 2026-02-21T09:18:14.9863693Z st.shared.b8 [%r38+1024], %rs474; 2026-02-21T09:18:14.9863756Z st.shared.b8 [%r39+1024], %rs475; 2026-02-21T09:18:14.9863816Z st.shared.b8 [%r40+1024], %rs476; 2026-02-21T09:18:14.9863876Z st.shared.b8 [%r41+1536], %rs477; 2026-02-21T09:18:14.9863934Z st.shared.b8 [%r42+1536], %rs478; 2026-02-21T09:18:14.9864000Z st.shared.b8 [%r43+1536], %rs479; 2026-02-21T09:18:14.9864060Z st.shared.b8 [%r44+1536], %rs480; 2026-02-21T09:18:14.9864141Z bar.sync 0; 2026-02-21T09:18:14.9864210Z ld.shared.b32 %r1389, [%r45]; 2026-02-21T09:18:14.9864271Z ld.shared.b32 %r1390, [%r46]; 2026-02-21T09:18:14.9864358Z ld.shared.b32 %r1391, [%r47]; 2026-02-21T09:18:14.9864424Z ld.shared.b32 %r1392, [%r48]; 2026-02-21T09:18:14.9864610Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:14.9864694Z cvt.s8.s32 %rs481, %r1386; 2026-02-21T09:18:14.9864758Z cvt.rn.f32.s16 %r1393, %rs481; 2026-02-21T09:18:14.9864827Z cvt.s8.s32 %rs482, %r1385; 2026-02-21T09:18:14.9864888Z cvt.rn.f32.s16 %r1394, %rs482; 2026-02-21T09:18:14.9864946Z cvt.s8.s32 %rs483, %r1390; 2026-02-21T09:18:14.9865015Z cvt.rn.f32.s16 %r1395, %rs483; 2026-02-21T09:18:14.9865075Z cvt.s8.s32 %rs484, %r1389; 2026-02-21T09:18:14.9865134Z cvt.rn.f32.s16 %r1396, %rs484; 2026-02-21T09:18:14.9865191Z cvt.s8.s32 %rs485, %r1388; 2026-02-21T09:18:14.9865258Z cvt.rn.f32.s16 %r1397, %rs485; 2026-02-21T09:18:14.9865318Z cvt.s8.s32 %rs486, %r1387; 2026-02-21T09:18:14.9865378Z cvt.rn.f32.s16 %r1398, %rs486; 2026-02-21T09:18:14.9865444Z cvt.s8.s32 %rs487, %r1392; 2026-02-21T09:18:14.9865506Z cvt.rn.f32.s16 %r1399, %rs487; 2026-02-21T09:18:14.9865565Z cvt.s8.s32 %rs488, %r1391; 2026-02-21T09:18:14.9865625Z cvt.rn.f32.s16 %r1400, %rs488; 2026-02-21T09:18:14.9865701Z prmt.b32 %r1401, %r1386, 0, 0x9991U; 2026-02-21T09:18:14.9865763Z cvt.u16.u32 %rs489, %r1401; 2026-02-21T09:18:14.9865826Z cvt.rn.f32.s16 %r1402, %rs489; 2026-02-21T09:18:14.9865898Z prmt.b32 %r1403, %r1385, 0, 0x9991U; 2026-02-21T09:18:14.9865959Z cvt.u16.u32 %rs490, %r1403; 2026-02-21T09:18:14.9866018Z cvt.rn.f32.s16 %r1404, %rs490; 2026-02-21T09:18:14.9866089Z prmt.b32 %r1405, %r1390, 0, 0x9991U; 2026-02-21T09:18:14.9866147Z cvt.u16.u32 %rs491, %r1405; 2026-02-21T09:18:14.9866206Z cvt.rn.f32.s16 %r1406, %rs491; 2026-02-21T09:18:14.9866266Z prmt.b32 %r1407, %r1389, 0, 0x9991U; 2026-02-21T09:18:14.9866332Z cvt.u16.u32 %rs492, %r1407; 2026-02-21T09:18:14.9866390Z cvt.rn.f32.s16 %r1408, %rs492; 2026-02-21T09:18:14.9866450Z prmt.b32 %r1409, %r1388, 0, 0x9991U; 2026-02-21T09:18:14.9866515Z cvt.u16.u32 %rs493, %r1409; 2026-02-21T09:18:14.9866575Z cvt.rn.f32.s16 %r1410, %rs493; 2026-02-21T09:18:14.9866636Z prmt.b32 %r1411, %r1387, 0, 0x9991U; 2026-02-21T09:18:14.9866694Z cvt.u16.u32 %rs494, %r1411; 2026-02-21T09:18:14.9866763Z cvt.rn.f32.s16 %r1412, %rs494; 2026-02-21T09:18:14.9866825Z prmt.b32 %r1413, %r1392, 0, 0x9991U; 2026-02-21T09:18:14.9866883Z cvt.u16.u32 %rs495, %r1413; 2026-02-21T09:18:14.9866949Z cvt.rn.f32.s16 %r1414, %rs495; 2026-02-21T09:18:14.9867010Z prmt.b32 %r1415, %r1391, 0, 0x9991U; 2026-02-21T09:18:14.9867068Z cvt.u16.u32 %rs496, %r1415; 2026-02-21T09:18:14.9867127Z cvt.rn.f32.s16 %r1416, %rs496; 2026-02-21T09:18:14.9867196Z prmt.b32 %r1417, %r1386, 0, 0xaaa2U; 2026-02-21T09:18:14.9867252Z cvt.u16.u32 %rs497, %r1417; 2026-02-21T09:18:14.9867312Z cvt.rn.f32.s16 %r1418, %rs497; 2026-02-21T09:18:14.9867382Z prmt.b32 %r1419, %r1385, 0, 0xaaa2U; 2026-02-21T09:18:14.9867441Z cvt.u16.u32 %rs498, %r1419; 2026-02-21T09:18:14.9867499Z cvt.rn.f32.s16 %r1420, %rs498; 2026-02-21T09:18:14.9867569Z prmt.b32 %r1421, %r1390, 0, 0xaaa2U; 2026-02-21T09:18:14.9867627Z cvt.u16.u32 %rs499, %r1421; 2026-02-21T09:18:14.9867687Z cvt.rn.f32.s16 %r1422, %rs499; 2026-02-21T09:18:14.9867747Z prmt.b32 %r1423, %r1389, 0, 0xaaa2U; 2026-02-21T09:18:14.9867815Z cvt.u16.u32 %rs500, %r1423; 2026-02-21T09:18:14.9867875Z cvt.rn.f32.s16 %r1424, %rs500; 2026-02-21T09:18:14.9867935Z prmt.b32 %r1425, %r1388, 0, 0xaaa2U; 2026-02-21T09:18:14.9868000Z cvt.u16.u32 %rs501, %r1425; 2026-02-21T09:18:14.9868059Z cvt.rn.f32.s16 %r1426, %rs501; 2026-02-21T09:18:14.9868120Z prmt.b32 %r1427, %r1387, 0, 0xaaa2U; 2026-02-21T09:18:14.9868178Z cvt.u16.u32 %rs502, %r1427; 2026-02-21T09:18:14.9868244Z cvt.rn.f32.s16 %r1428, %rs502; 2026-02-21T09:18:14.9868304Z prmt.b32 %r1429, %r1392, 0, 0xaaa2U; 2026-02-21T09:18:14.9868385Z cvt.u16.u32 %rs503, %r1429; 2026-02-21T09:18:14.9868451Z cvt.rn.f32.s16 %r1430, %rs503; 2026-02-21T09:18:14.9868512Z prmt.b32 %r1431, %r1391, 0, 0xaaa2U; 2026-02-21T09:18:14.9868594Z cvt.u16.u32 %rs504, %r1431; 2026-02-21T09:18:14.9868654Z cvt.rn.f32.s16 %r1432, %rs504; 2026-02-21T09:18:14.9868743Z prmt.b32 %r1433, %r1386, 0, 0xbbb3U; 2026-02-21T09:18:14.9868803Z cvt.u16.u32 %rs505, %r1433; 2026-02-21T09:18:14.9868882Z cvt.rn.f32.s16 %r1434, %rs505; 2026-02-21T09:18:14.9868949Z prmt.b32 %r1435, %r1385, 0, 0xbbb3U; 2026-02-21T09:18:14.9869007Z cvt.u16.u32 %rs506, %r1435; 2026-02-21T09:18:14.9869066Z cvt.rn.f32.s16 %r1436, %rs506; 2026-02-21T09:18:14.9869134Z prmt.b32 %r1437, %r1390, 0, 0xbbb3U; 2026-02-21T09:18:14.9869192Z cvt.u16.u32 %rs507, %r1437; 2026-02-21T09:18:14.9869251Z cvt.rn.f32.s16 %r1438, %rs507; 2026-02-21T09:18:14.9869311Z prmt.b32 %r1439, %r1389, 0, 0xbbb3U; 2026-02-21T09:18:14.9869376Z cvt.u16.u32 %rs508, %r1439; 2026-02-21T09:18:14.9869437Z cvt.rn.f32.s16 %r1440, %rs508; 2026-02-21T09:18:14.9869498Z prmt.b32 %r1441, %r1388, 0, 0xbbb3U; 2026-02-21T09:18:14.9869563Z cvt.u16.u32 %rs509, %r1441; 2026-02-21T09:18:14.9869624Z cvt.rn.f32.s16 %r1442, %rs509; 2026-02-21T09:18:14.9869684Z prmt.b32 %r1443, %r1387, 0, 0xbbb3U; 2026-02-21T09:18:14.9869742Z cvt.u16.u32 %rs510, %r1443; 2026-02-21T09:18:14.9869809Z cvt.rn.f32.s16 %r1444, %rs510; 2026-02-21T09:18:14.9869872Z prmt.b32 %r1445, %r1392, 0, 0xbbb3U; 2026-02-21T09:18:14.9869930Z cvt.u16.u32 %rs511, %r1445; 2026-02-21T09:18:14.9869995Z cvt.rn.f32.s16 %r1446, %rs511; 2026-02-21T09:18:14.9870056Z prmt.b32 %r1447, %r1391, 0, 0xbbb3U; 2026-02-21T09:18:14.9870116Z cvt.u16.u32 %rs512, %r1447; 2026-02-21T09:18:14.9870175Z cvt.rn.f32.s16 %r1448, %rs512; 2026-02-21T09:18:14.9870237Z bar.sync 0; 2026-02-21T09:18:14.9870334Z st.shared.v4.b32 [%r49], {%r1394, %r1396, %r1393, %r1395}; 2026-02-21T09:18:14.9870428Z st.shared.v4.b32 [%r50], {%r1398, %r1400, %r1397, %r1399}; 2026-02-21T09:18:14.9870528Z st.shared.v4.b32 [%r51], {%r1404, %r1408, %r1402, %r1406}; 2026-02-21T09:18:14.9870616Z st.shared.v4.b32 [%r52], {%r1412, %r1416, %r1410, %r1414}; 2026-02-21T09:18:14.9870706Z st.shared.v4.b32 [%r53], {%r1420, %r1424, %r1418, %r1422}; 2026-02-21T09:18:14.9870802Z st.shared.v4.b32 [%r54], {%r1428, %r1432, %r1426, %r1430}; 2026-02-21T09:18:14.9870892Z st.shared.v4.b32 [%r55], {%r1436, %r1440, %r1434, %r1438}; 2026-02-21T09:18:14.9870981Z st.shared.v4.b32 [%r56], {%r1444, %r1448, %r1442, %r1446}; 2026-02-21T09:18:14.9871034Z $L__tmp83: 2026-02-21T09:18:14.9871256Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9871314Z // begin inline asm 2026-02-21T09:18:14.9871651Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1189, %r1190, %r1191, %r1192, %r1193, %r1194, %r1195, %r1196, %r1197, %r1198, %r1199, %r1200, %r1201, %r1202, %r1203, %r1204}, [%r1205 + 0], 64; 2026-02-21T09:18:14.9871718Z // end inline asm 2026-02-21T09:18:14.9871776Z // begin inline asm 2026-02-21T09:18:14.9872086Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1206, %r1207, %r1208, %r1209, %r1210, %r1211, %r1212, %r1213, %r1214, %r1215, %r1216, %r1217, %r1218, %r1219, %r1220, %r1221}, [%r1205 + 16], 64; 2026-02-21T09:18:14.9872151Z // end inline asm 2026-02-21T09:18:14.9872209Z // begin inline asm 2026-02-21T09:18:14.9872509Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1223, %r1224, %r1225, %r1226, %r1227, %r1228, %r1229, %r1230, %r1231, %r1232, %r1233, %r1234, %r1235, %r1236, %r1237, %r1238}, [%r1205 + 32], 64; 2026-02-21T09:18:14.9872573Z // end inline asm 2026-02-21T09:18:14.9872630Z // begin inline asm 2026-02-21T09:18:14.9872929Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1240, %r1241, %r1242, %r1243, %r1244, %r1245, %r1246, %r1247, %r1248, %r1249, %r1250, %r1251, %r1252, %r1253, %r1254, %r1255}, [%r1205 + 48], 64; 2026-02-21T09:18:14.9872991Z // end inline asm 2026-02-21T09:18:14.9873076Z // begin inline asm 2026-02-21T09:18:14.9873149Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9873206Z // end inline asm 2026-02-21T09:18:14.9873283Z mov.pred %p389, -1; 2026-02-21T09:18:14.9873366Z // begin inline asm 2026-02-21T09:18:14.9873705Z @%p389 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r1189, %r1190, %r1191, %r1192, %r1193, %r1194, %r1195, %r1196, %r1197, %r1198, %r1199, %r1200, %r1201, %r1202, %r1203, %r1204}; 2026-02-21T09:18:14.9873798Z // end inline asm 2026-02-21T09:18:14.9873856Z // begin inline asm 2026-02-21T09:18:14.9874169Z @%p389 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r1206, %r1207, %r1208, %r1209, %r1210, %r1211, %r1212, %r1213, %r1214, %r1215, %r1216, %r1217, %r1218, %r1219, %r1220, %r1221}; 2026-02-21T09:18:14.9874234Z // end inline asm 2026-02-21T09:18:14.9874289Z // begin inline asm 2026-02-21T09:18:14.9874595Z @%p389 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r1223, %r1224, %r1225, %r1226, %r1227, %r1228, %r1229, %r1230, %r1231, %r1232, %r1233, %r1234, %r1235, %r1236, %r1237, %r1238}; 2026-02-21T09:18:14.9874659Z // end inline asm 2026-02-21T09:18:14.9874714Z // begin inline asm 2026-02-21T09:18:14.9875029Z @%p389 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r1240, %r1241, %r1242, %r1243, %r1244, %r1245, %r1246, %r1247, %r1248, %r1249, %r1250, %r1251, %r1252, %r1253, %r1254, %r1255}; 2026-02-21T09:18:14.9875092Z // end inline asm 2026-02-21T09:18:14.9875149Z // begin inline asm 2026-02-21T09:18:14.9875217Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9875272Z // end inline asm 2026-02-21T09:18:14.9875334Z bar.sync 0; 2026-02-21T09:18:14.9875390Z // begin inline asm 2026-02-21T09:18:14.9875692Z @%p389 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r1326, %r1327, %r1328, %r1329, %r1330, %r1331, %r1332, %r1333, %r1334, %r1335, %r1336, %r1337, %r1338, %r1339, %r1340, %r1341}; 2026-02-21T09:18:14.9875755Z // end inline asm 2026-02-21T09:18:14.9875811Z // begin inline asm 2026-02-21T09:18:14.9875879Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9875934Z // end inline asm 2026-02-21T09:18:14.9875995Z bar.sync 0; 2026-02-21T09:18:14.9876053Z // begin inline asm 2026-02-21T09:18:14.9876126Z fence.proxy.async.shared::cta; 2026-02-21T09:18:14.9876187Z // end inline asm 2026-02-21T09:18:14.9876242Z // begin inline asm 2026-02-21T09:18:14.9876335Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:14.9876396Z // end inline asm 2026-02-21T09:18:14.9876452Z bar.sync 0; 2026-02-21T09:18:14.9876511Z @%p20 bra $L__BB0_11; 2026-02-21T09:18:14.9876610Z // %bb.10: // in Loop: Header=BB0_3 Depth=2 2026-02-21T09:18:14.9876683Z elect.sync %r1461|%p73, -1; 2026-02-21T09:18:14.9876744Z mov.b32 %r1451, 69208336; 2026-02-21T09:18:14.9876805Z mov.pred %p72, -1; 2026-02-21T09:18:14.9876866Z // begin inline asm 2026-02-21T09:18:14.9877022Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 0 ], %rd1, %r1451, %p72; 2026-02-21T09:18:14.9877078Z // end inline asm 2026-02-21T09:18:14.9877134Z // begin inline asm 2026-02-21T09:18:14.9877292Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 8 ], %rd2, %r1451, %p72; 2026-02-21T09:18:14.9877348Z // end inline asm 2026-02-21T09:18:14.9877403Z // begin inline asm 2026-02-21T09:18:14.9877561Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 16 ], %rd3, %r1451, %p72; 2026-02-21T09:18:14.9877616Z // end inline asm 2026-02-21T09:18:14.9877672Z // begin inline asm 2026-02-21T09:18:14.9877824Z @%p73 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 24 ], %rd4, %r1451, %p72; 2026-02-21T09:18:14.9877878Z // end inline asm 2026-02-21T09:18:14.9877939Z cvt.u64.u32 %rd122, %r5985; 2026-02-21T09:18:14.9878001Z // begin inline asm 2026-02-21T09:18:14.9878128Z @%p73 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd122]; 2026-02-21T09:18:14.9878183Z // end inline asm 2026-02-21T09:18:14.9878241Z bra.uni $L__BB0_11; 2026-02-21T09:18:14.9878324Z $L__tmp84: 2026-02-21T09:18:14.9878426Z $L__BB0_12: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:18:14.9878593Z .loc 1 31 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:31:32 2026-02-21T09:18:14.9878700Z or.b32 %r1676, %r105, %r9; 2026-02-21T09:18:14.9878781Z or.b32 %r1677, %r105, %r10; 2026-02-21T09:18:14.9878839Z or.b32 %r1678, %r105, %r11; 2026-02-21T09:18:14.9878917Z or.b32 %r1679, %r105, %r12; 2026-02-21T09:18:14.9878981Z or.b32 %r1680, %r105, %r13; 2026-02-21T09:18:14.9879038Z or.b32 %r1681, %r105, %r14; 2026-02-21T09:18:14.9879094Z or.b32 %r1682, %r105, %r15; 2026-02-21T09:18:14.9879157Z or.b32 %r1683, %r105, %r16; 2026-02-21T09:18:14.9879325Z .loc 1 33 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:33:32 2026-02-21T09:18:14.9879381Z or.b32 %r1684, %r106, %r20; 2026-02-21T09:18:14.9879554Z .loc 1 88 43 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:43 2026-02-21T09:18:14.9879613Z shl.b32 %r1685, %r1676, 13; 2026-02-21T09:18:14.9879670Z shl.b32 %r1686, %r1677, 13; 2026-02-21T09:18:14.9879728Z shl.b32 %r1687, %r1678, 13; 2026-02-21T09:18:14.9879791Z shl.b32 %r1688, %r1679, 13; 2026-02-21T09:18:14.9879847Z shl.b32 %r1689, %r1680, 13; 2026-02-21T09:18:14.9879904Z shl.b32 %r1690, %r1681, 13; 2026-02-21T09:18:14.9879968Z shl.b32 %r1691, %r1682, 13; 2026-02-21T09:18:14.9880024Z shl.b32 %r1692, %r1683, 13; 2026-02-21T09:18:14.9880183Z .loc 1 88 50 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:50 2026-02-21T09:18:14.9880252Z add.s32 %r1693, %r1685, %r1684; 2026-02-21T09:18:14.9880313Z add.s32 %r1694, %r1686, %r1684; 2026-02-21T09:18:14.9880373Z add.s32 %r1695, %r1687, %r1684; 2026-02-21T09:18:14.9880430Z add.s32 %r1696, %r1688, %r1684; 2026-02-21T09:18:14.9880495Z add.s32 %r1697, %r1689, %r1684; 2026-02-21T09:18:14.9880553Z add.s32 %r1698, %r1690, %r1684; 2026-02-21T09:18:14.9880613Z add.s32 %r1699, %r1691, %r1684; 2026-02-21T09:18:14.9880676Z add.s32 %r1700, %r1692, %r1684; 2026-02-21T09:18:14.9880838Z .loc 1 88 22 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:22 2026-02-21T09:18:14.9880910Z mad.wide.s32 %rd123, %r1693, 2, %rd57; 2026-02-21T09:18:14.9880978Z mad.wide.s32 %rd124, %r1694, 2, %rd57; 2026-02-21T09:18:14.9881051Z mad.wide.s32 %rd125, %r1695, 2, %rd57; 2026-02-21T09:18:14.9881115Z mad.wide.s32 %rd126, %r1696, 2, %rd57; 2026-02-21T09:18:14.9881177Z mad.wide.s32 %rd127, %r1697, 2, %rd57; 2026-02-21T09:18:14.9881248Z mad.wide.s32 %rd128, %r1698, 2, %rd57; 2026-02-21T09:18:14.9881310Z mad.wide.s32 %rd129, %r1699, 2, %rd57; 2026-02-21T09:18:14.9881372Z mad.wide.s32 %rd130, %r1700, 2, %rd57; 2026-02-21T09:18:14.9881433Z $L__tmp85: 2026-02-21T09:18:14.9881677Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9881738Z // begin inline asm 2026-02-21T09:18:14.9882061Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1468, %r1469, %r1470, %r1471, %r1472, %r1473, %r1474, %r1475, %r1476, %r1477, %r1478, %r1479, %r1480, %r1481, %r1482, %r1483}, [%r4673 + 0], 64; 2026-02-21T09:18:14.9882119Z // end inline asm 2026-02-21T09:18:14.9882178Z // begin inline asm 2026-02-21T09:18:14.9882483Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1485, %r1486, %r1487, %r1488, %r1489, %r1490, %r1491, %r1492, %r1493, %r1494, %r1495, %r1496, %r1497, %r1498, %r1499, %r1500}, [%r4673 + 16], 64; 2026-02-21T09:18:14.9882548Z // end inline asm 2026-02-21T09:18:14.9882609Z // begin inline asm 2026-02-21T09:18:14.9882916Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1502, %r1503, %r1504, %r1505, %r1506, %r1507, %r1508, %r1509, %r1510, %r1511, %r1512, %r1513, %r1514, %r1515, %r1516, %r1517}, [%r4673 + 32], 64; 2026-02-21T09:18:14.9882980Z // end inline asm 2026-02-21T09:18:14.9883038Z // begin inline asm 2026-02-21T09:18:14.9883363Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1519, %r1520, %r1521, %r1522, %r1523, %r1524, %r1525, %r1526, %r1527, %r1528, %r1529, %r1530, %r1531, %r1532, %r1533, %r1534}, [%r4673 + 48], 64; 2026-02-21T09:18:14.9883450Z // end inline asm 2026-02-21T09:18:14.9883505Z // begin inline asm 2026-02-21T09:18:14.9883577Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9883668Z // end inline asm 2026-02-21T09:18:14.9883735Z cvt.u64.u32 %rd132, %r1468; 2026-02-21T09:18:14.9883820Z cvt.u64.u32 %rd133, %r1469; 2026-02-21T09:18:14.9883880Z shl.b64 %rd134, %rd133, 32; 2026-02-21T09:18:14.9883950Z or.b64 %rd135, %rd132, %rd134; 2026-02-21T09:18:14.9884003Z $L__tmp86: 2026-02-21T09:18:14.9884166Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9884237Z mov.b64 {%r1701, %r1702}, %rd135; 2026-02-21T09:18:14.9884309Z cvt.rn.bf16x2.f32 %r1703, %r1702, %r1701; 2026-02-21T09:18:14.9884363Z $L__tmp87: 2026-02-21T09:18:14.9884606Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9884675Z cvt.u64.u32 %rd136, %r1470; 2026-02-21T09:18:14.9884738Z cvt.u64.u32 %rd137, %r1471; 2026-02-21T09:18:14.9884801Z shl.b64 %rd138, %rd137, 32; 2026-02-21T09:18:14.9884874Z or.b64 %rd139, %rd136, %rd138; 2026-02-21T09:18:14.9884929Z $L__tmp88: 2026-02-21T09:18:14.9885101Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9885173Z mov.b64 {%r1704, %r1705}, %rd139; 2026-02-21T09:18:14.9885248Z cvt.rn.bf16x2.f32 %r1706, %r1705, %r1704; 2026-02-21T09:18:14.9885303Z $L__tmp89: 2026-02-21T09:18:14.9885525Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9885597Z cvt.u64.u32 %rd140, %r1472; 2026-02-21T09:18:14.9885658Z cvt.u64.u32 %rd141, %r1473; 2026-02-21T09:18:14.9885719Z shl.b64 %rd142, %rd141, 32; 2026-02-21T09:18:14.9885790Z or.b64 %rd143, %rd140, %rd142; 2026-02-21T09:18:14.9885844Z $L__tmp90: 2026-02-21T09:18:14.9886013Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9886085Z mov.b64 {%r1707, %r1708}, %rd143; 2026-02-21T09:18:14.9886158Z cvt.rn.bf16x2.f32 %r1709, %r1708, %r1707; 2026-02-21T09:18:14.9886212Z $L__tmp91: 2026-02-21T09:18:14.9886428Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9886497Z cvt.u64.u32 %rd144, %r1474; 2026-02-21T09:18:14.9886556Z cvt.u64.u32 %rd145, %r1475; 2026-02-21T09:18:14.9886615Z shl.b64 %rd146, %rd145, 32; 2026-02-21T09:18:14.9886683Z or.b64 %rd147, %rd144, %rd146; 2026-02-21T09:18:14.9886737Z $L__tmp92: 2026-02-21T09:18:14.9886901Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9886973Z mov.b64 {%r1710, %r1711}, %rd147; 2026-02-21T09:18:14.9887043Z cvt.rn.bf16x2.f32 %r1712, %r1711, %r1710; 2026-02-21T09:18:14.9887098Z $L__tmp93: 2026-02-21T09:18:14.9887312Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9887384Z cvt.u64.u32 %rd148, %r1476; 2026-02-21T09:18:14.9887444Z cvt.u64.u32 %rd149, %r1477; 2026-02-21T09:18:14.9887505Z shl.b64 %rd150, %rd149, 32; 2026-02-21T09:18:14.9887572Z or.b64 %rd151, %rd148, %rd150; 2026-02-21T09:18:14.9887626Z $L__tmp94: 2026-02-21T09:18:14.9887792Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9887854Z mov.b64 {%r1713, %r1714}, %rd151; 2026-02-21T09:18:14.9887932Z cvt.rn.bf16x2.f32 %r1715, %r1714, %r1713; 2026-02-21T09:18:14.9887986Z $L__tmp95: 2026-02-21T09:18:14.9888197Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9888290Z cvt.u64.u32 %rd152, %r1478; 2026-02-21T09:18:14.9888350Z cvt.u64.u32 %rd153, %r1479; 2026-02-21T09:18:14.9888410Z shl.b64 %rd154, %rd153, 32; 2026-02-21T09:18:14.9888502Z or.b64 %rd155, %rd152, %rd154; 2026-02-21T09:18:14.9888557Z $L__tmp96: 2026-02-21T09:18:14.9888750Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9888834Z mov.b64 {%r1716, %r1717}, %rd155; 2026-02-21T09:18:14.9888912Z cvt.rn.bf16x2.f32 %r1718, %r1717, %r1716; 2026-02-21T09:18:14.9888966Z $L__tmp97: 2026-02-21T09:18:14.9889192Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9889262Z cvt.u64.u32 %rd156, %r1480; 2026-02-21T09:18:14.9889322Z cvt.u64.u32 %rd157, %r1481; 2026-02-21T09:18:14.9889383Z shl.b64 %rd158, %rd157, 32; 2026-02-21T09:18:14.9889450Z or.b64 %rd159, %rd156, %rd158; 2026-02-21T09:18:14.9889507Z $L__tmp98: 2026-02-21T09:18:14.9889674Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9889740Z mov.b64 {%r1719, %r1720}, %rd159; 2026-02-21T09:18:14.9889819Z cvt.rn.bf16x2.f32 %r1721, %r1720, %r1719; 2026-02-21T09:18:14.9889874Z $L__tmp99: 2026-02-21T09:18:14.9890096Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9890163Z cvt.u64.u32 %rd160, %r1482; 2026-02-21T09:18:14.9890223Z cvt.u64.u32 %rd161, %r1483; 2026-02-21T09:18:14.9890284Z shl.b64 %rd162, %rd161, 32; 2026-02-21T09:18:14.9890354Z or.b64 %rd163, %rd160, %rd162; 2026-02-21T09:18:14.9890410Z $L__tmp100: 2026-02-21T09:18:14.9890581Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9890644Z mov.b64 {%r1722, %r1723}, %rd163; 2026-02-21T09:18:14.9890723Z cvt.rn.bf16x2.f32 %r1724, %r1723, %r1722; 2026-02-21T09:18:14.9890782Z $L__tmp101: 2026-02-21T09:18:14.9891000Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9891071Z cvt.u64.u32 %rd164, %r1485; 2026-02-21T09:18:14.9891131Z cvt.u64.u32 %rd165, %r1486; 2026-02-21T09:18:14.9891192Z shl.b64 %rd166, %rd165, 32; 2026-02-21T09:18:14.9891255Z or.b64 %rd167, %rd164, %rd166; 2026-02-21T09:18:14.9891322Z $L__tmp102: 2026-02-21T09:18:14.9891497Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9891585Z mov.b64 {%r1725, %r1726}, %rd167; 2026-02-21T09:18:14.9891669Z cvt.rn.bf16x2.f32 %r1727, %r1726, %r1725; 2026-02-21T09:18:14.9891727Z $L__tmp103: 2026-02-21T09:18:14.9891950Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9892019Z cvt.u64.u32 %rd168, %r1487; 2026-02-21T09:18:14.9892081Z cvt.u64.u32 %rd169, %r1488; 2026-02-21T09:18:14.9892142Z shl.b64 %rd170, %rd169, 32; 2026-02-21T09:18:14.9892205Z or.b64 %rd171, %rd168, %rd170; 2026-02-21T09:18:14.9892269Z $L__tmp104: 2026-02-21T09:18:14.9892441Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9892505Z mov.b64 {%r1728, %r1729}, %rd171; 2026-02-21T09:18:14.9892585Z cvt.rn.bf16x2.f32 %r1730, %r1729, %r1728; 2026-02-21T09:18:14.9892640Z $L__tmp105: 2026-02-21T09:18:14.9892855Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9892922Z cvt.u64.u32 %rd172, %r1489; 2026-02-21T09:18:14.9892981Z cvt.u64.u32 %rd173, %r1490; 2026-02-21T09:18:14.9893042Z shl.b64 %rd174, %rd173, 32; 2026-02-21T09:18:14.9893104Z or.b64 %rd175, %rd172, %rd174; 2026-02-21T09:18:14.9893168Z $L__tmp106: 2026-02-21T09:18:14.9893338Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9893426Z mov.b64 {%r1731, %r1732}, %rd175; 2026-02-21T09:18:14.9893505Z cvt.rn.bf16x2.f32 %r1733, %r1732, %r1731; 2026-02-21T09:18:14.9893593Z $L__tmp107: 2026-02-21T09:18:14.9893817Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9893905Z cvt.u64.u32 %rd176, %r1491; 2026-02-21T09:18:14.9893964Z cvt.u64.u32 %rd177, %r1492; 2026-02-21T09:18:14.9894021Z shl.b64 %rd178, %rd177, 32; 2026-02-21T09:18:14.9894080Z or.b64 %rd179, %rd176, %rd178; 2026-02-21T09:18:14.9894141Z $L__tmp108: 2026-02-21T09:18:14.9894304Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9894364Z mov.b64 {%r1734, %r1735}, %rd179; 2026-02-21T09:18:14.9894438Z cvt.rn.bf16x2.f32 %r1736, %r1735, %r1734; 2026-02-21T09:18:14.9894490Z $L__tmp109: 2026-02-21T09:18:14.9894700Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9894765Z cvt.u64.u32 %rd180, %r1493; 2026-02-21T09:18:14.9894824Z cvt.u64.u32 %rd181, %r1494; 2026-02-21T09:18:14.9894881Z shl.b64 %rd182, %rd181, 32; 2026-02-21T09:18:14.9894942Z or.b64 %rd183, %rd180, %rd182; 2026-02-21T09:18:14.9895002Z $L__tmp110: 2026-02-21T09:18:14.9895164Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9895224Z mov.b64 {%r1737, %r1738}, %rd183; 2026-02-21T09:18:14.9895298Z cvt.rn.bf16x2.f32 %r1739, %r1738, %r1737; 2026-02-21T09:18:14.9895349Z $L__tmp111: 2026-02-21T09:18:14.9895560Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9895626Z cvt.u64.u32 %rd184, %r1495; 2026-02-21T09:18:14.9895682Z cvt.u64.u32 %rd185, %r1496; 2026-02-21T09:18:14.9895741Z shl.b64 %rd186, %rd185, 32; 2026-02-21T09:18:14.9895800Z or.b64 %rd187, %rd184, %rd186; 2026-02-21T09:18:14.9895860Z $L__tmp112: 2026-02-21T09:18:14.9896021Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9896082Z mov.b64 {%r1740, %r1741}, %rd187; 2026-02-21T09:18:14.9896159Z cvt.rn.bf16x2.f32 %r1742, %r1741, %r1740; 2026-02-21T09:18:14.9896211Z $L__tmp113: 2026-02-21T09:18:14.9896416Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9896473Z cvt.u64.u32 %rd188, %r1497; 2026-02-21T09:18:14.9896538Z cvt.u64.u32 %rd189, %r1498; 2026-02-21T09:18:14.9896595Z shl.b64 %rd190, %rd189, 32; 2026-02-21T09:18:14.9896653Z or.b64 %rd191, %rd188, %rd190; 2026-02-21T09:18:14.9896710Z $L__tmp114: 2026-02-21T09:18:14.9896870Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9896932Z mov.b64 {%r1743, %r1744}, %rd191; 2026-02-21T09:18:14.9897006Z cvt.rn.bf16x2.f32 %r1745, %r1744, %r1743; 2026-02-21T09:18:14.9897057Z $L__tmp115: 2026-02-21T09:18:14.9897260Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9897319Z cvt.u64.u32 %rd192, %r1499; 2026-02-21T09:18:14.9897383Z cvt.u64.u32 %rd193, %r1500; 2026-02-21T09:18:14.9897442Z shl.b64 %rd194, %rd193, 32; 2026-02-21T09:18:14.9897501Z or.b64 %rd195, %rd192, %rd194; 2026-02-21T09:18:14.9897559Z $L__tmp116: 2026-02-21T09:18:14.9897721Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9897780Z mov.b64 {%r1746, %r1747}, %rd195; 2026-02-21T09:18:14.9897853Z cvt.rn.bf16x2.f32 %r1748, %r1747, %r1746; 2026-02-21T09:18:14.9897906Z $L__tmp117: 2026-02-21T09:18:14.9898111Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9898204Z cvt.u64.u32 %rd196, %r1502; 2026-02-21T09:18:14.9898271Z cvt.u64.u32 %rd197, %r1503; 2026-02-21T09:18:14.9898362Z shl.b64 %rd198, %rd197, 32; 2026-02-21T09:18:14.9898422Z or.b64 %rd199, %rd196, %rd198; 2026-02-21T09:18:14.9898481Z $L__tmp118: 2026-02-21T09:18:14.9898688Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9898749Z mov.b64 {%r1749, %r1750}, %rd199; 2026-02-21T09:18:14.9898825Z cvt.rn.bf16x2.f32 %r1751, %r1750, %r1749; 2026-02-21T09:18:14.9898878Z $L__tmp119: 2026-02-21T09:18:14.9899083Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9899141Z cvt.u64.u32 %rd200, %r1504; 2026-02-21T09:18:14.9899207Z cvt.u64.u32 %rd201, %r1505; 2026-02-21T09:18:14.9899265Z shl.b64 %rd202, %rd201, 32; 2026-02-21T09:18:14.9899323Z or.b64 %rd203, %rd200, %rd202; 2026-02-21T09:18:14.9899383Z $L__tmp120: 2026-02-21T09:18:14.9899546Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9899608Z mov.b64 {%r1752, %r1753}, %rd203; 2026-02-21T09:18:14.9899676Z cvt.rn.bf16x2.f32 %r1754, %r1753, %r1752; 2026-02-21T09:18:14.9899737Z $L__tmp121: 2026-02-21T09:18:14.9899942Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9900002Z cvt.u64.u32 %rd204, %r1506; 2026-02-21T09:18:14.9900070Z cvt.u64.u32 %rd205, %r1507; 2026-02-21T09:18:14.9900129Z shl.b64 %rd206, %rd205, 32; 2026-02-21T09:18:14.9900191Z or.b64 %rd207, %rd204, %rd206; 2026-02-21T09:18:14.9900253Z $L__tmp122: 2026-02-21T09:18:14.9900415Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9900477Z mov.b64 {%r1755, %r1756}, %rd207; 2026-02-21T09:18:14.9900547Z cvt.rn.bf16x2.f32 %r1757, %r1756, %r1755; 2026-02-21T09:18:14.9900608Z $L__tmp123: 2026-02-21T09:18:14.9900811Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9900871Z cvt.u64.u32 %rd208, %r1508; 2026-02-21T09:18:14.9900939Z cvt.u64.u32 %rd209, %r1509; 2026-02-21T09:18:14.9900995Z shl.b64 %rd210, %rd209, 32; 2026-02-21T09:18:14.9901055Z or.b64 %rd211, %rd208, %rd210; 2026-02-21T09:18:14.9901114Z $L__tmp124: 2026-02-21T09:18:14.9901273Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9901334Z mov.b64 {%r1758, %r1759}, %rd211; 2026-02-21T09:18:14.9901399Z cvt.rn.bf16x2.f32 %r1760, %r1759, %r1758; 2026-02-21T09:18:14.9901459Z $L__tmp125: 2026-02-21T09:18:14.9901695Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9901754Z cvt.u64.u32 %rd212, %r1510; 2026-02-21T09:18:14.9901819Z cvt.u64.u32 %rd213, %r1511; 2026-02-21T09:18:14.9901877Z shl.b64 %rd214, %rd213, 32; 2026-02-21T09:18:14.9901937Z or.b64 %rd215, %rd212, %rd214; 2026-02-21T09:18:14.9901998Z $L__tmp126: 2026-02-21T09:18:14.9902159Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9902219Z mov.b64 {%r1761, %r1762}, %rd215; 2026-02-21T09:18:14.9902287Z cvt.rn.bf16x2.f32 %r1763, %r1762, %r1761; 2026-02-21T09:18:14.9902348Z $L__tmp127: 2026-02-21T09:18:14.9902546Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9902605Z cvt.u64.u32 %rd216, %r1512; 2026-02-21T09:18:14.9902671Z cvt.u64.u32 %rd217, %r1513; 2026-02-21T09:18:14.9902729Z shl.b64 %rd218, %rd217, 32; 2026-02-21T09:18:14.9902787Z or.b64 %rd219, %rd216, %rd218; 2026-02-21T09:18:14.9902838Z $L__tmp128: 2026-02-21T09:18:14.9903037Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9903097Z mov.b64 {%r1764, %r1765}, %rd219; 2026-02-21T09:18:14.9903195Z cvt.rn.bf16x2.f32 %r1766, %r1765, %r1764; 2026-02-21T09:18:14.9903256Z $L__tmp129: 2026-02-21T09:18:14.9903486Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9903569Z cvt.u64.u32 %rd220, %r1514; 2026-02-21T09:18:14.9903635Z cvt.u64.u32 %rd221, %r1515; 2026-02-21T09:18:14.9903693Z shl.b64 %rd222, %rd221, 32; 2026-02-21T09:18:14.9903752Z or.b64 %rd223, %rd220, %rd222; 2026-02-21T09:18:14.9903804Z $L__tmp130: 2026-02-21T09:18:14.9903971Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9904031Z mov.b64 {%r1767, %r1768}, %rd223; 2026-02-21T09:18:14.9904098Z cvt.rn.bf16x2.f32 %r1769, %r1768, %r1767; 2026-02-21T09:18:14.9904158Z $L__tmp131: 2026-02-21T09:18:14.9904362Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9904420Z cvt.u64.u32 %rd224, %r1516; 2026-02-21T09:18:14.9904483Z cvt.u64.u32 %rd225, %r1517; 2026-02-21T09:18:14.9904541Z shl.b64 %rd226, %rd225, 32; 2026-02-21T09:18:14.9904601Z or.b64 %rd227, %rd224, %rd226; 2026-02-21T09:18:14.9904653Z $L__tmp132: 2026-02-21T09:18:14.9904824Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9904883Z mov.b64 {%r1770, %r1771}, %rd227; 2026-02-21T09:18:14.9904948Z cvt.rn.bf16x2.f32 %r1772, %r1771, %r1770; 2026-02-21T09:18:14.9905006Z $L__tmp133: 2026-02-21T09:18:14.9905208Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9905265Z cvt.u64.u32 %rd228, %r1519; 2026-02-21T09:18:14.9905328Z cvt.u64.u32 %rd229, %r1520; 2026-02-21T09:18:14.9905386Z shl.b64 %rd230, %rd229, 32; 2026-02-21T09:18:14.9905445Z or.b64 %rd231, %rd228, %rd230; 2026-02-21T09:18:14.9905497Z $L__tmp134: 2026-02-21T09:18:14.9905664Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9905725Z mov.b64 {%r1773, %r1774}, %rd231; 2026-02-21T09:18:14.9905791Z cvt.rn.bf16x2.f32 %r1775, %r1774, %r1773; 2026-02-21T09:18:14.9905851Z $L__tmp135: 2026-02-21T09:18:14.9906054Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9906111Z cvt.u64.u32 %rd232, %r1521; 2026-02-21T09:18:14.9906174Z cvt.u64.u32 %rd233, %r1522; 2026-02-21T09:18:14.9906231Z shl.b64 %rd234, %rd233, 32; 2026-02-21T09:18:14.9906290Z or.b64 %rd235, %rd232, %rd234; 2026-02-21T09:18:14.9906340Z $L__tmp136: 2026-02-21T09:18:14.9906507Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9906567Z mov.b64 {%r1776, %r1777}, %rd235; 2026-02-21T09:18:14.9906635Z cvt.rn.bf16x2.f32 %r1778, %r1777, %r1776; 2026-02-21T09:18:14.9906692Z $L__tmp137: 2026-02-21T09:18:14.9906896Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9906956Z cvt.u64.u32 %rd236, %r1523; 2026-02-21T09:18:14.9907021Z cvt.u64.u32 %rd237, %r1524; 2026-02-21T09:18:14.9907078Z shl.b64 %rd238, %rd237, 32; 2026-02-21T09:18:14.9907136Z or.b64 %rd239, %rd236, %rd238; 2026-02-21T09:18:14.9907188Z $L__tmp138: 2026-02-21T09:18:14.9907354Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9907415Z mov.b64 {%r1779, %r1780}, %rd239; 2026-02-21T09:18:14.9907482Z cvt.rn.bf16x2.f32 %r1781, %r1780, %r1779; 2026-02-21T09:18:14.9907540Z $L__tmp139: 2026-02-21T09:18:14.9907745Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9907826Z cvt.u64.u32 %rd240, %r1525; 2026-02-21T09:18:14.9907884Z cvt.u64.u32 %rd241, %r1526; 2026-02-21T09:18:14.9907971Z shl.b64 %rd242, %rd241, 32; 2026-02-21T09:18:14.9908029Z or.b64 %rd243, %rd240, %rd242; 2026-02-21T09:18:14.9908081Z $L__tmp140: 2026-02-21T09:18:14.9908287Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9908348Z mov.b64 {%r1782, %r1783}, %rd243; 2026-02-21T09:18:14.9908416Z cvt.rn.bf16x2.f32 %r1784, %r1783, %r1782; 2026-02-21T09:18:14.9908476Z $L__tmp141: 2026-02-21T09:18:14.9908684Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9908744Z cvt.u64.u32 %rd244, %r1527; 2026-02-21T09:18:14.9908803Z cvt.u64.u32 %rd245, %r1528; 2026-02-21T09:18:14.9908870Z shl.b64 %rd246, %rd245, 32; 2026-02-21T09:18:14.9908933Z or.b64 %rd247, %rd244, %rd246; 2026-02-21T09:18:14.9908986Z $L__tmp142: 2026-02-21T09:18:14.9909157Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9909220Z mov.b64 {%r1785, %r1786}, %rd247; 2026-02-21T09:18:14.9909287Z cvt.rn.bf16x2.f32 %r1787, %r1786, %r1785; 2026-02-21T09:18:14.9909348Z $L__tmp143: 2026-02-21T09:18:14.9909553Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9909611Z cvt.u64.u32 %rd248, %r1529; 2026-02-21T09:18:14.9909668Z cvt.u64.u32 %rd249, %r1530; 2026-02-21T09:18:14.9909735Z shl.b64 %rd250, %rd249, 32; 2026-02-21T09:18:14.9909793Z or.b64 %rd251, %rd248, %rd250; 2026-02-21T09:18:14.9909844Z $L__tmp144: 2026-02-21T09:18:14.9910011Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9910071Z mov.b64 {%r1788, %r1789}, %rd251; 2026-02-21T09:18:14.9910139Z cvt.rn.bf16x2.f32 %r1790, %r1789, %r1788; 2026-02-21T09:18:14.9910199Z $L__tmp145: 2026-02-21T09:18:14.9910402Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9910462Z cvt.u64.u32 %rd252, %r1531; 2026-02-21T09:18:14.9910520Z cvt.u64.u32 %rd253, %r1532; 2026-02-21T09:18:14.9910586Z shl.b64 %rd254, %rd253, 32; 2026-02-21T09:18:14.9910647Z or.b64 %rd255, %rd252, %rd254; 2026-02-21T09:18:14.9910699Z $L__tmp146: 2026-02-21T09:18:14.9910868Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9910928Z mov.b64 {%r1791, %r1792}, %rd255; 2026-02-21T09:18:14.9910995Z cvt.rn.bf16x2.f32 %r1793, %r1792, %r1791; 2026-02-21T09:18:14.9911046Z $L__tmp147: 2026-02-21T09:18:14.9911259Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9911320Z cvt.u64.u32 %rd256, %r1533; 2026-02-21T09:18:14.9911378Z cvt.u64.u32 %rd257, %r1534; 2026-02-21T09:18:14.9911442Z shl.b64 %rd258, %rd257, 32; 2026-02-21T09:18:14.9911502Z or.b64 %rd259, %rd256, %rd258; 2026-02-21T09:18:14.9911584Z $L__tmp148: 2026-02-21T09:18:14.9911754Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:14.9911818Z mov.b64 {%r1794, %r1795}, %rd259; 2026-02-21T09:18:14.9911886Z cvt.rn.bf16x2.f32 %r1796, %r1795, %r1794; 2026-02-21T09:18:14.9912046Z .loc 1 88 81 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:81 2026-02-21T09:18:14.9912109Z bar.sync 0; 2026-02-21T09:18:14.9912207Z st.shared.v4.b32 [%r27], {%r1703, %r1715, %r1727, %r1739}; 2026-02-21T09:18:14.9912305Z st.shared.v4.b32 [%r28], {%r1751, %r1763, %r1775, %r1787}; 2026-02-21T09:18:14.9912368Z bar.sync 0; 2026-02-21T09:18:14.9912426Z // begin inline asm 2026-02-21T09:18:14.9912621Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1576, %r1580, %r1584, %r1588}, [%r1540]; 2026-02-21T09:18:14.9912687Z // end inline asm 2026-02-21T09:18:14.9912744Z // begin inline asm 2026-02-21T09:18:14.9912920Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1592, %r1596, %r1600, %r1604}, [%r1545]; 2026-02-21T09:18:14.9912977Z // end inline asm 2026-02-21T09:18:14.9913064Z bar.sync 0; 2026-02-21T09:18:14.9913196Z st.shared.v4.b32 [%r27], {%r1706, %r1718, %r1730, %r1742}; 2026-02-21T09:18:14.9913291Z st.shared.v4.b32 [%r28], {%r1754, %r1766, %r1778, %r1790}; 2026-02-21T09:18:14.9913353Z bar.sync 0; 2026-02-21T09:18:14.9913411Z // begin inline asm 2026-02-21T09:18:14.9913562Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1577, %r1581, %r1585, %r1589}, [%r1540]; 2026-02-21T09:18:14.9913624Z // end inline asm 2026-02-21T09:18:14.9913680Z // begin inline asm 2026-02-21T09:18:14.9913832Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1593, %r1597, %r1601, %r1605}, [%r1545]; 2026-02-21T09:18:14.9913890Z // end inline asm 2026-02-21T09:18:14.9913951Z bar.sync 0; 2026-02-21T09:18:14.9914043Z st.shared.v4.b32 [%r27], {%r1709, %r1721, %r1733, %r1745}; 2026-02-21T09:18:14.9914135Z st.shared.v4.b32 [%r28], {%r1757, %r1769, %r1781, %r1793}; 2026-02-21T09:18:14.9914197Z bar.sync 0; 2026-02-21T09:18:14.9914252Z // begin inline asm 2026-02-21T09:18:14.9914401Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1578, %r1582, %r1586, %r1590}, [%r1540]; 2026-02-21T09:18:14.9914457Z // end inline asm 2026-02-21T09:18:14.9914521Z // begin inline asm 2026-02-21T09:18:14.9914667Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1594, %r1598, %r1602, %r1606}, [%r1545]; 2026-02-21T09:18:14.9914722Z // end inline asm 2026-02-21T09:18:14.9914784Z bar.sync 0; 2026-02-21T09:18:14.9914874Z st.shared.v4.b32 [%r27], {%r1712, %r1724, %r1736, %r1748}; 2026-02-21T09:18:14.9914964Z st.shared.v4.b32 [%r28], {%r1760, %r1772, %r1784, %r1796}; 2026-02-21T09:18:14.9915026Z bar.sync 0; 2026-02-21T09:18:14.9915082Z // begin inline asm 2026-02-21T09:18:14.9915230Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1579, %r1583, %r1587, %r1591}, [%r1540]; 2026-02-21T09:18:14.9915284Z // end inline asm 2026-02-21T09:18:14.9915346Z // begin inline asm 2026-02-21T09:18:14.9915496Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1595, %r1599, %r1603, %r1607}, [%r1545]; 2026-02-21T09:18:14.9915553Z // end inline asm 2026-02-21T09:18:14.9915616Z // begin inline asm 2026-02-21T09:18:14.9915727Z st.global.v4.b32 [ %rd123 + 0 ], { %r1576, %r1577, %r1578, %r1579 }; 2026-02-21T09:18:14.9915782Z // end inline asm 2026-02-21T09:18:14.9915837Z // begin inline asm 2026-02-21T09:18:14.9915951Z st.global.v4.b32 [ %rd124 + 0 ], { %r1580, %r1581, %r1582, %r1583 }; 2026-02-21T09:18:14.9916007Z // end inline asm 2026-02-21T09:18:14.9916063Z // begin inline asm 2026-02-21T09:18:14.9916174Z st.global.v4.b32 [ %rd125 + 0 ], { %r1584, %r1585, %r1586, %r1587 }; 2026-02-21T09:18:14.9916229Z // end inline asm 2026-02-21T09:18:14.9916284Z // begin inline asm 2026-02-21T09:18:14.9916392Z st.global.v4.b32 [ %rd126 + 0 ], { %r1588, %r1589, %r1590, %r1591 }; 2026-02-21T09:18:14.9916447Z // end inline asm 2026-02-21T09:18:14.9916503Z // begin inline asm 2026-02-21T09:18:14.9916602Z st.global.v4.b32 [ %rd127 + 0 ], { %r1592, %r1593, %r1594, %r1595 }; 2026-02-21T09:18:14.9916664Z // end inline asm 2026-02-21T09:18:14.9916722Z // begin inline asm 2026-02-21T09:18:14.9916821Z st.global.v4.b32 [ %rd128 + 0 ], { %r1596, %r1597, %r1598, %r1599 }; 2026-02-21T09:18:14.9916882Z // end inline asm 2026-02-21T09:18:14.9916937Z // begin inline asm 2026-02-21T09:18:14.9917035Z st.global.v4.b32 [ %rd129 + 0 ], { %r1600, %r1601, %r1602, %r1603 }; 2026-02-21T09:18:14.9917092Z // end inline asm 2026-02-21T09:18:14.9917156Z // begin inline asm 2026-02-21T09:18:14.9917257Z st.global.v4.b32 [ %rd130 + 0 ], { %r1604, %r1605, %r1606, %r1607 }; 2026-02-21T09:18:14.9917313Z // end inline asm 2026-02-21T09:18:14.9917495Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:14.9917581Z add.s32 %r1797, %r7842, 2368; 2026-02-21T09:18:14.9917742Z .loc 1 25 35 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:25:35 2026-02-21T09:18:14.9917837Z shr.s32 %r1798, %r1797, 31; 2026-02-21T09:18:14.9917896Z shr.u32 %r1799, %r1798, 22; 2026-02-21T09:18:14.9917979Z add.s32 %r1800, %r1797, %r1799; 2026-02-21T09:18:14.9918057Z shr.s32 %r1801, %r1800, 10; 2026-02-21T09:18:14.9918224Z .loc 1 26 33 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:26:33 2026-02-21T09:18:14.9918284Z shl.b32 %r1802, %r1801, 4; 2026-02-21T09:18:14.9918442Z .loc 1 27 39 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:39 2026-02-21T09:18:14.9918506Z sub.s32 %r1803, 64, %r1802; 2026-02-21T09:18:14.9918663Z .loc 1 27 52 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:52 2026-02-21T09:18:14.9918720Z min.s32 %r1804, %r1803, 16; 2026-02-21T09:18:14.9918886Z .loc 1 28 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:45 2026-02-21T09:18:14.9918950Z and.b32 %r1805, %r1800, -1024; 2026-02-21T09:18:14.9919012Z sub.s32 %r1806, %r1797, %r1805; 2026-02-21T09:18:14.9919177Z .loc 1 29 51 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:29:51 2026-02-21T09:18:14.9919239Z div.s32 %r1807, %r1806, %r1804; 2026-02-21T09:18:14.9919398Z .loc 1 28 64 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:64 2026-02-21T09:18:14.9919462Z mul.lo.s32 %r1808, %r1807, %r1804; 2026-02-21T09:18:14.9919531Z sub.s32 %r1809, %r1806, %r1808; 2026-02-21T09:18:14.9919686Z .loc 1 28 30 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:30 2026-02-21T09:18:14.9919745Z add.s32 %r1810, %r1809, %r1802; 2026-02-21T09:18:14.9919909Z .loc 1 30 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:30:27 2026-02-21T09:18:14.9919971Z shl.b32 %r117, %r1810, 6; 2026-02-21T09:18:14.9920126Z .loc 1 32 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:32:27 2026-02-21T09:18:14.9920195Z shl.b32 %r118, %r1807, 7; 2026-02-21T09:18:14.9920255Z mov.pred %p84, -1; 2026-02-21T09:18:14.9920312Z mov.b32 %r1609, 0; 2026-02-21T09:18:14.9920365Z $L__tmp149: 2026-02-21T09:18:14.9920579Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9920636Z // begin inline asm 2026-02-21T09:18:14.9920954Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609}; 2026-02-21T09:18:14.9921018Z // end inline asm 2026-02-21T09:18:14.9921074Z // begin inline asm 2026-02-21T09:18:14.9921388Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609}; 2026-02-21T09:18:14.9921453Z // end inline asm 2026-02-21T09:18:14.9921510Z // begin inline asm 2026-02-21T09:18:14.9921849Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609}; 2026-02-21T09:18:14.9921911Z // end inline asm 2026-02-21T09:18:14.9921967Z // begin inline asm 2026-02-21T09:18:14.9922269Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609, %r1609}; 2026-02-21T09:18:14.9922331Z // end inline asm 2026-02-21T09:18:14.9922385Z // begin inline asm 2026-02-21T09:18:14.9922457Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9922512Z // end inline asm 2026-02-21T09:18:14.9922600Z bar.sync 0; 2026-02-21T09:18:14.9922654Z $L__tmp150: 2026-02-21T09:18:14.9922821Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:14.9922917Z add.s32 %r7844, %r61, %r118; 2026-02-21T09:18:14.9922977Z or.b32 %r1811, %r7841, %r117; 2026-02-21T09:18:14.9923058Z shl.b32 %r1812, %r1811, 10; 2026-02-21T09:18:14.9923125Z mul.wide.s32 %rd20, %r1812, 2; 2026-02-21T09:18:14.9923216Z shl.b32 %r1813, %r1810, 16; 2026-02-21T09:18:14.9923277Z or.b32 %r1814, %r63, %r1813; 2026-02-21T09:18:14.9923342Z mul.wide.s32 %rd21, %r1814, 2; 2026-02-21T09:18:14.9923408Z mov.pred %p390, 0; 2026-02-21T09:18:14.9923467Z mov.b64 %rd1051, -64; 2026-02-21T09:18:14.9923527Z mov.b64 %rd1050, %rd5; 2026-02-21T09:18:14.9923585Z bra.uni $L__BB0_13; 2026-02-21T09:18:14.9923696Z $L__BB0_21: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:14.9923751Z $L__tmp151: 2026-02-21T09:18:14.9923963Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9924029Z // begin inline asm 2026-02-21T09:18:14.9924081Z 2026-02-21T09:18:14.9924132Z { 2026-02-21T09:18:14.9924203Z .reg .pred complete; 2026-02-21T09:18:14.9924258Z waitLoop: 2026-02-21T09:18:14.9924380Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r2682; 2026-02-21T09:18:14.9924446Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9924504Z } 2026-02-21T09:18:14.9924508Z 2026-02-21T09:18:14.9924563Z // end inline asm 2026-02-21T09:18:14.9924616Z bar.sync 0; 2026-02-21T09:18:14.9924681Z // begin inline asm 2026-02-21T09:18:14.9924769Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9924824Z // end inline asm 2026-02-21T09:18:14.9924878Z $L__tmp152: 2026-02-21T09:18:14.9925050Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:14.9925111Z add.s64 %rd1051, %rd1051, 64; 2026-02-21T09:18:14.9925173Z add.s32 %r7844, %r7844, 524288; 2026-02-21T09:18:14.9925241Z add.s64 %rd1050, %rd1050, 256; 2026-02-21T09:18:14.9925308Z setp.lt.u64 %p158, %rd1051, 448; 2026-02-21T09:18:14.9925371Z @%p158 bra $L__BB0_13; 2026-02-21T09:18:14.9925436Z bra.uni $L__BB0_22; 2026-02-21T09:18:14.9925539Z $L__BB0_13: // Parent Loop BB0_2 Depth=1 2026-02-21T09:18:14.9925637Z // => This Inner Loop Header: Depth=2 2026-02-21T09:18:14.9925798Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9925867Z add.s64 %rd261, %rd1050, %rd21; 2026-02-21T09:18:14.9926025Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9926087Z add.s64 %rd264, %rd1050, %rd20; 2026-02-21T09:18:14.9926154Z // begin inline asm 2026-02-21T09:18:14.9926212Z mov.u64 %rd260, 0x0; 2026-02-21T09:18:14.9926325Z createpolicy.fractional.L2::evict_first.b64 %rd260, 1.0; 2026-02-21T09:18:14.9926389Z // end inline asm 2026-02-21T09:18:14.9926446Z // begin inline asm 2026-02-21T09:18:14.9926502Z mov.u32 %r1815, 0x0; 2026-02-21T09:18:14.9926561Z mov.u32 %r1816, 0x0; 2026-02-21T09:18:14.9926625Z mov.u32 %r1817, 0x0; 2026-02-21T09:18:14.9926681Z mov.u32 %r1818, 0x0; 2026-02-21T09:18:14.9926873Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r1815, %r1816, %r1817, %r1818 }, [ %rd261 + 0 ], %rd260; 2026-02-21T09:18:14.9926938Z // end inline asm 2026-02-21T09:18:14.9926995Z // begin inline asm 2026-02-21T09:18:14.9927051Z mov.u64 %rd263, 0x0; 2026-02-21T09:18:14.9927166Z createpolicy.fractional.L2::evict_first.b64 %rd263, 1.0; 2026-02-21T09:18:14.9927221Z // end inline asm 2026-02-21T09:18:14.9927279Z // begin inline asm 2026-02-21T09:18:14.9927333Z mov.u32 %r1819, 0x0; 2026-02-21T09:18:14.9927394Z mov.u32 %r1820, 0x0; 2026-02-21T09:18:14.9927448Z mov.u32 %r1821, 0x0; 2026-02-21T09:18:14.9927537Z mov.u32 %r1822, 0x0; 2026-02-21T09:18:14.9927726Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r1819, %r1820, %r1821, %r1822 }, [ %rd264 + 0 ], %rd263; 2026-02-21T09:18:14.9927784Z // end inline asm 2026-02-21T09:18:14.9927861Z $L__tmp153: 2026-02-21T09:18:14.9928096Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9928161Z bar.sync 0; 2026-02-21T09:18:14.9928281Z st.shared.v4.b32 [%r26], {%r1815, %r1816, %r1817, %r1818}; 2026-02-21T09:18:14.9928391Z st.shared.v4.b32 [%r26+512], {%r1819, %r1820, %r1821, %r1822}; 2026-02-21T09:18:14.9928456Z bar.sync 0; 2026-02-21T09:18:14.9928556Z ld.shared.v4.b32 {%r1981, %r1982, %r1983, %r1984}, [%r27]; 2026-02-21T09:18:14.9928626Z mov.b32 {%rs513, %rs514}, %r1984; 2026-02-21T09:18:14.9928697Z mov.b32 {%rs515, %rs516}, %r1983; 2026-02-21T09:18:14.9928759Z mov.b32 {%rs517, %rs518}, %r1982; 2026-02-21T09:18:14.9928819Z mov.b32 {%rs519, %rs520}, %r1981; 2026-02-21T09:18:14.9928918Z ld.shared.v4.b32 {%r1985, %r1986, %r1987, %r1988}, [%r28]; 2026-02-21T09:18:14.9928987Z mov.b32 {%rs521, %rs522}, %r1988; 2026-02-21T09:18:14.9929049Z mov.b32 {%rs523, %rs524}, %r1987; 2026-02-21T09:18:14.9929111Z mov.b32 {%rs525, %rs526}, %r1986; 2026-02-21T09:18:14.9929181Z mov.b32 {%rs527, %rs528}, %r1985; 2026-02-21T09:18:14.9929237Z $L__tmp154: 2026-02-21T09:18:14.9929408Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9929482Z cvt.f32.bf16 %r1964, %rs519; 2026-02-21T09:18:14.9929546Z cvt.f32.bf16 %r1965, %rs520; 2026-02-21T09:18:14.9929608Z cvt.f32.bf16 %r1966, %rs517; 2026-02-21T09:18:14.9929669Z cvt.f32.bf16 %r1967, %rs518; 2026-02-21T09:18:14.9929739Z cvt.f32.bf16 %r1968, %rs515; 2026-02-21T09:18:14.9929800Z cvt.f32.bf16 %r1969, %rs516; 2026-02-21T09:18:14.9929861Z cvt.f32.bf16 %r1970, %rs513; 2026-02-21T09:18:14.9929928Z cvt.f32.bf16 %r1971, %rs514; 2026-02-21T09:18:14.9929990Z cvt.f32.bf16 %r1972, %rs527; 2026-02-21T09:18:14.9930049Z cvt.f32.bf16 %r1973, %rs528; 2026-02-21T09:18:14.9930109Z cvt.f32.bf16 %r1974, %rs525; 2026-02-21T09:18:14.9930177Z cvt.f32.bf16 %r1975, %rs526; 2026-02-21T09:18:14.9930239Z cvt.f32.bf16 %r1976, %rs523; 2026-02-21T09:18:14.9930299Z cvt.f32.bf16 %r1977, %rs524; 2026-02-21T09:18:14.9930369Z cvt.f32.bf16 %r1978, %rs521; 2026-02-21T09:18:14.9930429Z cvt.f32.bf16 %r1979, %rs522; 2026-02-21T09:18:14.9930600Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9930671Z cvt.s64.s32 %rd269, %r7844; 2026-02-21T09:18:14.9930736Z add.s64 %rd267, %rd56, %rd269; 2026-02-21T09:18:14.9930901Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9930961Z // begin inline asm 2026-02-21T09:18:14.9931028Z mov.u64 %rd266, 0x0; 2026-02-21T09:18:14.9931142Z createpolicy.fractional.L2::evict_first.b64 %rd266, 1.0; 2026-02-21T09:18:14.9931201Z // end inline asm 2026-02-21T09:18:14.9931268Z // begin inline asm 2026-02-21T09:18:14.9931326Z mov.u32 %r1823, 0x0; 2026-02-21T09:18:14.9931383Z mov.u32 %r1824, 0x0; 2026-02-21T09:18:14.9931441Z mov.u32 %r1825, 0x0; 2026-02-21T09:18:14.9931504Z mov.u32 %r1826, 0x0; 2026-02-21T09:18:14.9931718Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r1823, %r1824, %r1825, %r1826 }, [ %rd267 + 0 ], %rd266; 2026-02-21T09:18:14.9931779Z // end inline asm 2026-02-21T09:18:14.9931857Z prmt.b32 %r1989, %r1823, 0, 0x8880U; 2026-02-21T09:18:14.9931921Z cvt.u16.u32 %rs529, %r1989; 2026-02-21T09:18:14.9931989Z prmt.b32 %r1990, %r1823, 0, 0x7770U; 2026-02-21T09:18:14.9932058Z cvt.u16.u32 %rs530, %r1990; 2026-02-21T09:18:14.9932124Z prmt.b32 %r1991, %r1823, 0, 0x9991U; 2026-02-21T09:18:14.9932187Z cvt.u16.u32 %rs531, %r1991; 2026-02-21T09:18:14.9932251Z prmt.b32 %r1992, %r1823, 0, 0x7771U; 2026-02-21T09:18:14.9932320Z cvt.u16.u32 %rs532, %r1992; 2026-02-21T09:18:14.9932412Z prmt.b32 %r1993, %r1823, 0, 0xaaa2U; 2026-02-21T09:18:14.9932473Z cvt.u16.u32 %rs533, %r1993; 2026-02-21T09:18:14.9932544Z prmt.b32 %r1994, %r1823, 0, 0x7772U; 2026-02-21T09:18:14.9932630Z cvt.u16.u32 %rs534, %r1994; 2026-02-21T09:18:14.9932694Z prmt.b32 %r1995, %r1823, 0, 0xbbb3U; 2026-02-21T09:18:14.9932757Z cvt.u16.u32 %rs535, %r1995; 2026-02-21T09:18:14.9932872Z prmt.b32 %r1996, %r1823, 0, 0x7773U; 2026-02-21T09:18:14.9932959Z cvt.u16.u32 %rs536, %r1996; 2026-02-21T09:18:14.9933023Z prmt.b32 %r1997, %r1824, 0, 0x8880U; 2026-02-21T09:18:14.9933094Z cvt.u16.u32 %rs537, %r1997; 2026-02-21T09:18:14.9933159Z prmt.b32 %r1998, %r1824, 0, 0x7770U; 2026-02-21T09:18:14.9933220Z cvt.u16.u32 %rs538, %r1998; 2026-02-21T09:18:14.9933290Z prmt.b32 %r1999, %r1824, 0, 0x9991U; 2026-02-21T09:18:14.9933350Z cvt.u16.u32 %rs539, %r1999; 2026-02-21T09:18:14.9933413Z prmt.b32 %r2000, %r1824, 0, 0x7771U; 2026-02-21T09:18:14.9933474Z cvt.u16.u32 %rs540, %r2000; 2026-02-21T09:18:14.9933546Z prmt.b32 %r2001, %r1824, 0, 0xaaa2U; 2026-02-21T09:18:14.9933605Z cvt.u16.u32 %rs541, %r2001; 2026-02-21T09:18:14.9933668Z prmt.b32 %r2002, %r1824, 0, 0x7772U; 2026-02-21T09:18:14.9933736Z cvt.u16.u32 %rs542, %r2002; 2026-02-21T09:18:14.9933800Z prmt.b32 %r2003, %r1824, 0, 0xbbb3U; 2026-02-21T09:18:14.9933861Z cvt.u16.u32 %rs543, %r2003; 2026-02-21T09:18:14.9933924Z prmt.b32 %r2004, %r1824, 0, 0x7773U; 2026-02-21T09:18:14.9933996Z cvt.u16.u32 %rs544, %r2004; 2026-02-21T09:18:14.9934060Z prmt.b32 %r2005, %r1825, 0, 0x8880U; 2026-02-21T09:18:14.9934121Z cvt.u16.u32 %rs545, %r2005; 2026-02-21T09:18:14.9934196Z prmt.b32 %r2006, %r1825, 0, 0x7770U; 2026-02-21T09:18:14.9934258Z cvt.u16.u32 %rs546, %r2006; 2026-02-21T09:18:14.9934324Z prmt.b32 %r2007, %r1825, 0, 0x9991U; 2026-02-21T09:18:14.9934394Z cvt.u16.u32 %rs547, %r2007; 2026-02-21T09:18:14.9934457Z prmt.b32 %r2008, %r1825, 0, 0x7771U; 2026-02-21T09:18:14.9934518Z cvt.u16.u32 %rs548, %r2008; 2026-02-21T09:18:14.9934581Z prmt.b32 %r2009, %r1825, 0, 0xaaa2U; 2026-02-21T09:18:14.9934653Z cvt.u16.u32 %rs549, %r2009; 2026-02-21T09:18:14.9934715Z prmt.b32 %r2010, %r1825, 0, 0x7772U; 2026-02-21T09:18:14.9934776Z cvt.u16.u32 %rs550, %r2010; 2026-02-21T09:18:14.9934848Z prmt.b32 %r2011, %r1825, 0, 0xbbb3U; 2026-02-21T09:18:14.9934907Z cvt.u16.u32 %rs551, %r2011; 2026-02-21T09:18:14.9934971Z prmt.b32 %r2012, %r1825, 0, 0x7773U; 2026-02-21T09:18:14.9935030Z cvt.u16.u32 %rs552, %r2012; 2026-02-21T09:18:14.9935101Z prmt.b32 %r2013, %r1826, 0, 0x8880U; 2026-02-21T09:18:14.9935162Z cvt.u16.u32 %rs553, %r2013; 2026-02-21T09:18:14.9935225Z prmt.b32 %r2014, %r1826, 0, 0x7770U; 2026-02-21T09:18:14.9935292Z cvt.u16.u32 %rs554, %r2014; 2026-02-21T09:18:14.9935353Z prmt.b32 %r2015, %r1826, 0, 0x9991U; 2026-02-21T09:18:14.9935413Z cvt.u16.u32 %rs555, %r2015; 2026-02-21T09:18:14.9935477Z prmt.b32 %r2016, %r1826, 0, 0x7771U; 2026-02-21T09:18:14.9935546Z cvt.u16.u32 %rs556, %r2016; 2026-02-21T09:18:14.9935608Z prmt.b32 %r2017, %r1826, 0, 0xaaa2U; 2026-02-21T09:18:14.9935671Z cvt.u16.u32 %rs557, %r2017; 2026-02-21T09:18:14.9935751Z prmt.b32 %r2018, %r1826, 0, 0x7772U; 2026-02-21T09:18:14.9935811Z cvt.u16.u32 %rs558, %r2018; 2026-02-21T09:18:14.9935873Z prmt.b32 %r2019, %r1826, 0, 0xbbb3U; 2026-02-21T09:18:14.9935940Z cvt.u16.u32 %rs559, %r2019; 2026-02-21T09:18:14.9936002Z prmt.b32 %r2020, %r1826, 0, 0x7773U; 2026-02-21T09:18:14.9936061Z cvt.u16.u32 %rs560, %r2020; 2026-02-21T09:18:14.9936227Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9936295Z shl.b16 %rs561, %rs530, 12; 2026-02-21T09:18:14.9936353Z shr.s16 %rs562, %rs561, 12; 2026-02-21T09:18:14.9936410Z shl.b16 %rs563, %rs532, 12; 2026-02-21T09:18:14.9936475Z shr.s16 %rs564, %rs563, 12; 2026-02-21T09:18:14.9936531Z shl.b16 %rs565, %rs534, 12; 2026-02-21T09:18:14.9936589Z shr.s16 %rs566, %rs565, 12; 2026-02-21T09:18:14.9936647Z shl.b16 %rs567, %rs536, 12; 2026-02-21T09:18:14.9936736Z shr.s16 %rs568, %rs567, 12; 2026-02-21T09:18:14.9936794Z shl.b16 %rs569, %rs538, 12; 2026-02-21T09:18:14.9936852Z shr.s16 %rs570, %rs569, 12; 2026-02-21T09:18:14.9936918Z shl.b16 %rs571, %rs540, 12; 2026-02-21T09:18:14.9937000Z shr.s16 %rs572, %rs571, 12; 2026-02-21T09:18:14.9937056Z shl.b16 %rs573, %rs542, 12; 2026-02-21T09:18:14.9937137Z shr.s16 %rs574, %rs573, 12; 2026-02-21T09:18:14.9937197Z shl.b16 %rs575, %rs544, 12; 2026-02-21T09:18:14.9937272Z shr.s16 %rs576, %rs575, 12; 2026-02-21T09:18:14.9937329Z shl.b16 %rs577, %rs546, 12; 2026-02-21T09:18:14.9937394Z shr.s16 %rs578, %rs577, 12; 2026-02-21T09:18:14.9937450Z shl.b16 %rs579, %rs548, 12; 2026-02-21T09:18:14.9937506Z shr.s16 %rs580, %rs579, 12; 2026-02-21T09:18:14.9937570Z shl.b16 %rs581, %rs550, 12; 2026-02-21T09:18:14.9937626Z shr.s16 %rs582, %rs581, 12; 2026-02-21T09:18:14.9937683Z shl.b16 %rs583, %rs552, 12; 2026-02-21T09:18:14.9937740Z shr.s16 %rs584, %rs583, 12; 2026-02-21T09:18:14.9937804Z shl.b16 %rs585, %rs554, 12; 2026-02-21T09:18:14.9937861Z shr.s16 %rs586, %rs585, 12; 2026-02-21T09:18:14.9937917Z shl.b16 %rs587, %rs556, 12; 2026-02-21T09:18:14.9937980Z shr.s16 %rs588, %rs587, 12; 2026-02-21T09:18:14.9938037Z shl.b16 %rs589, %rs558, 12; 2026-02-21T09:18:14.9938094Z shr.s16 %rs590, %rs589, 12; 2026-02-21T09:18:14.9938152Z shl.b16 %rs591, %rs560, 12; 2026-02-21T09:18:14.9938216Z shr.s16 %rs592, %rs591, 12; 2026-02-21T09:18:14.9938376Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:14.9938438Z shr.u16 %rs593, %rs529, 4; 2026-02-21T09:18:14.9938504Z shr.u16 %rs594, %rs531, 4; 2026-02-21T09:18:14.9938562Z shr.u16 %rs595, %rs533, 4; 2026-02-21T09:18:14.9938618Z shr.u16 %rs596, %rs535, 4; 2026-02-21T09:18:14.9938682Z shr.u16 %rs597, %rs537, 4; 2026-02-21T09:18:14.9938737Z shr.u16 %rs598, %rs539, 4; 2026-02-21T09:18:14.9938794Z shr.u16 %rs599, %rs541, 4; 2026-02-21T09:18:14.9938850Z shr.u16 %rs600, %rs543, 4; 2026-02-21T09:18:14.9938916Z shr.u16 %rs601, %rs545, 4; 2026-02-21T09:18:14.9938972Z shr.u16 %rs602, %rs547, 4; 2026-02-21T09:18:14.9939029Z shr.u16 %rs603, %rs549, 4; 2026-02-21T09:18:14.9939093Z shr.u16 %rs604, %rs551, 4; 2026-02-21T09:18:14.9939151Z shr.u16 %rs605, %rs553, 4; 2026-02-21T09:18:14.9939207Z shr.u16 %rs606, %rs555, 4; 2026-02-21T09:18:14.9939264Z shr.u16 %rs607, %rs557, 4; 2026-02-21T09:18:14.9939328Z shr.u16 %rs608, %rs559, 4; 2026-02-21T09:18:14.9939490Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:14.9939545Z bar.sync 0; 2026-02-21T09:18:14.9939616Z st.shared.b8 [%r29], %rs562; 2026-02-21T09:18:14.9939678Z st.shared.b8 [%r30], %rs564; 2026-02-21T09:18:14.9939737Z st.shared.b8 [%r31], %rs566; 2026-02-21T09:18:14.9939796Z st.shared.b8 [%r32], %rs568; 2026-02-21T09:18:14.9939867Z st.shared.b8 [%r33+512], %rs570; 2026-02-21T09:18:14.9939931Z st.shared.b8 [%r34+512], %rs572; 2026-02-21T09:18:14.9939993Z st.shared.b8 [%r35+512], %rs574; 2026-02-21T09:18:14.9940061Z st.shared.b8 [%r36+512], %rs576; 2026-02-21T09:18:14.9940125Z st.shared.b8 [%r37+1024], %rs578; 2026-02-21T09:18:14.9940188Z st.shared.b8 [%r38+1024], %rs580; 2026-02-21T09:18:14.9940258Z st.shared.b8 [%r39+1024], %rs582; 2026-02-21T09:18:14.9940320Z st.shared.b8 [%r40+1024], %rs584; 2026-02-21T09:18:14.9940379Z st.shared.b8 [%r41+1536], %rs586; 2026-02-21T09:18:14.9940440Z st.shared.b8 [%r42+1536], %rs588; 2026-02-21T09:18:14.9940509Z st.shared.b8 [%r43+1536], %rs590; 2026-02-21T09:18:14.9940568Z st.shared.b8 [%r44+1536], %rs592; 2026-02-21T09:18:14.9940624Z bar.sync 0; 2026-02-21T09:18:14.9940694Z ld.shared.b32 %r2021, [%r45]; 2026-02-21T09:18:14.9940757Z ld.shared.b32 %r2022, [%r46]; 2026-02-21T09:18:14.9940822Z ld.shared.b32 %r2023, [%r47]; 2026-02-21T09:18:14.9940882Z ld.shared.b32 %r2024, [%r48]; 2026-02-21T09:18:14.9941052Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:14.9941128Z bar.sync 0; 2026-02-21T09:18:14.9941190Z st.shared.b8 [%r29], %rs593; 2026-02-21T09:18:14.9941257Z st.shared.b8 [%r30], %rs594; 2026-02-21T09:18:14.9941339Z st.shared.b8 [%r31], %rs595; 2026-02-21T09:18:14.9941398Z st.shared.b8 [%r32], %rs596; 2026-02-21T09:18:14.9941478Z st.shared.b8 [%r33+512], %rs597; 2026-02-21T09:18:14.9941584Z st.shared.b8 [%r34+512], %rs598; 2026-02-21T09:18:14.9941673Z st.shared.b8 [%r35+512], %rs599; 2026-02-21T09:18:14.9941734Z st.shared.b8 [%r36+512], %rs600; 2026-02-21T09:18:14.9941802Z st.shared.b8 [%r37+1024], %rs601; 2026-02-21T09:18:14.9941863Z st.shared.b8 [%r38+1024], %rs602; 2026-02-21T09:18:14.9941924Z st.shared.b8 [%r39+1024], %rs603; 2026-02-21T09:18:14.9941993Z st.shared.b8 [%r40+1024], %rs604; 2026-02-21T09:18:14.9942055Z st.shared.b8 [%r41+1536], %rs605; 2026-02-21T09:18:14.9942113Z st.shared.b8 [%r42+1536], %rs606; 2026-02-21T09:18:14.9942173Z st.shared.b8 [%r43+1536], %rs607; 2026-02-21T09:18:14.9942242Z st.shared.b8 [%r44+1536], %rs608; 2026-02-21T09:18:14.9942296Z bar.sync 0; 2026-02-21T09:18:14.9942357Z ld.shared.b32 %r2025, [%r45]; 2026-02-21T09:18:14.9942425Z ld.shared.b32 %r2026, [%r46]; 2026-02-21T09:18:14.9942487Z ld.shared.b32 %r2027, [%r47]; 2026-02-21T09:18:14.9942548Z ld.shared.b32 %r2028, [%r48]; 2026-02-21T09:18:14.9942711Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:14.9942779Z cvt.s8.s32 %rs609, %r2022; 2026-02-21T09:18:14.9942840Z cvt.rn.f32.s16 %r2029, %rs609; 2026-02-21T09:18:14.9942901Z cvt.s8.s32 %rs610, %r2021; 2026-02-21T09:18:14.9942970Z cvt.rn.f32.s16 %r2030, %rs610; 2026-02-21T09:18:14.9943029Z cvt.s8.s32 %rs611, %r2026; 2026-02-21T09:18:14.9943091Z cvt.rn.f32.s16 %r2031, %rs611; 2026-02-21T09:18:14.9943156Z cvt.s8.s32 %rs612, %r2025; 2026-02-21T09:18:14.9943216Z cvt.rn.f32.s16 %r2032, %rs612; 2026-02-21T09:18:14.9943274Z cvt.s8.s32 %rs613, %r2024; 2026-02-21T09:18:14.9943335Z cvt.rn.f32.s16 %r2033, %rs613; 2026-02-21T09:18:14.9943401Z cvt.s8.s32 %rs614, %r2023; 2026-02-21T09:18:14.9943459Z cvt.rn.f32.s16 %r2034, %rs614; 2026-02-21T09:18:14.9943518Z cvt.s8.s32 %rs615, %r2028; 2026-02-21T09:18:14.9943584Z cvt.rn.f32.s16 %r2035, %rs615; 2026-02-21T09:18:14.9943642Z cvt.s8.s32 %rs616, %r2027; 2026-02-21T09:18:14.9943702Z cvt.rn.f32.s16 %r2036, %rs616; 2026-02-21T09:18:14.9943768Z prmt.b32 %r2037, %r2022, 0, 0x9991U; 2026-02-21T09:18:14.9943837Z cvt.u16.u32 %rs617, %r2037; 2026-02-21T09:18:14.9943897Z cvt.rn.f32.s16 %r2038, %rs617; 2026-02-21T09:18:14.9943963Z prmt.b32 %r2039, %r2021, 0, 0x9991U; 2026-02-21T09:18:14.9944029Z cvt.u16.u32 %rs618, %r2039; 2026-02-21T09:18:14.9944088Z cvt.rn.f32.s16 %r2040, %rs618; 2026-02-21T09:18:14.9944151Z prmt.b32 %r2041, %r2026, 0, 0x9991U; 2026-02-21T09:18:14.9944210Z cvt.u16.u32 %rs619, %r2041; 2026-02-21T09:18:14.9944275Z cvt.rn.f32.s16 %r2042, %rs619; 2026-02-21T09:18:14.9944337Z prmt.b32 %r2043, %r2025, 0, 0x9991U; 2026-02-21T09:18:14.9944394Z cvt.u16.u32 %rs620, %r2043; 2026-02-21T09:18:14.9944459Z cvt.rn.f32.s16 %r2044, %rs620; 2026-02-21T09:18:14.9944518Z prmt.b32 %r2045, %r2024, 0, 0x9991U; 2026-02-21T09:18:14.9944578Z cvt.u16.u32 %rs621, %r2045; 2026-02-21T09:18:14.9944642Z cvt.rn.f32.s16 %r2046, %rs621; 2026-02-21T09:18:14.9944704Z prmt.b32 %r2047, %r2023, 0, 0x9991U; 2026-02-21T09:18:14.9944761Z cvt.u16.u32 %rs622, %r2047; 2026-02-21T09:18:14.9944820Z cvt.rn.f32.s16 %r2048, %rs622; 2026-02-21T09:18:14.9944888Z prmt.b32 %r2049, %r2028, 0, 0x9991U; 2026-02-21T09:18:14.9944945Z cvt.u16.u32 %rs623, %r2049; 2026-02-21T09:18:14.9945004Z cvt.rn.f32.s16 %r2050, %rs623; 2026-02-21T09:18:14.9945070Z prmt.b32 %r2051, %r2027, 0, 0x9991U; 2026-02-21T09:18:14.9945128Z cvt.u16.u32 %rs624, %r2051; 2026-02-21T09:18:14.9945186Z cvt.rn.f32.s16 %r2052, %rs624; 2026-02-21T09:18:14.9945246Z prmt.b32 %r2053, %r2022, 0, 0xaaa2U; 2026-02-21T09:18:14.9945311Z cvt.u16.u32 %rs625, %r2053; 2026-02-21T09:18:14.9945402Z cvt.rn.f32.s16 %r2054, %rs625; 2026-02-21T09:18:14.9945463Z prmt.b32 %r2055, %r2021, 0, 0xaaa2U; 2026-02-21T09:18:14.9945524Z cvt.u16.u32 %rs626, %r2055; 2026-02-21T09:18:14.9945609Z cvt.rn.f32.s16 %r2056, %rs626; 2026-02-21T09:18:14.9945670Z prmt.b32 %r2057, %r2026, 0, 0xaaa2U; 2026-02-21T09:18:14.9945754Z cvt.u16.u32 %rs627, %r2057; 2026-02-21T09:18:14.9945820Z cvt.rn.f32.s16 %r2058, %rs627; 2026-02-21T09:18:14.9945903Z prmt.b32 %r2059, %r2025, 0, 0xaaa2U; 2026-02-21T09:18:14.9945961Z cvt.u16.u32 %rs628, %r2059; 2026-02-21T09:18:14.9946029Z cvt.rn.f32.s16 %r2060, %rs628; 2026-02-21T09:18:14.9946091Z prmt.b32 %r2061, %r2024, 0, 0xaaa2U; 2026-02-21T09:18:14.9946150Z cvt.u16.u32 %rs629, %r2061; 2026-02-21T09:18:14.9946218Z cvt.rn.f32.s16 %r2062, %rs629; 2026-02-21T09:18:14.9946278Z prmt.b32 %r2063, %r2023, 0, 0xaaa2U; 2026-02-21T09:18:14.9946337Z cvt.u16.u32 %rs630, %r2063; 2026-02-21T09:18:14.9946396Z cvt.rn.f32.s16 %r2064, %rs630; 2026-02-21T09:18:14.9946468Z prmt.b32 %r2065, %r2028, 0, 0xaaa2U; 2026-02-21T09:18:14.9946524Z cvt.u16.u32 %rs631, %r2065; 2026-02-21T09:18:14.9946583Z cvt.rn.f32.s16 %r2066, %rs631; 2026-02-21T09:18:14.9946651Z prmt.b32 %r2067, %r2027, 0, 0xaaa2U; 2026-02-21T09:18:14.9946709Z cvt.u16.u32 %rs632, %r2067; 2026-02-21T09:18:14.9946774Z cvt.rn.f32.s16 %r2068, %rs632; 2026-02-21T09:18:14.9946834Z prmt.b32 %r2069, %r2022, 0, 0xbbb3U; 2026-02-21T09:18:14.9946900Z cvt.u16.u32 %rs633, %r2069; 2026-02-21T09:18:14.9946959Z cvt.rn.f32.s16 %r2070, %rs633; 2026-02-21T09:18:14.9947021Z prmt.b32 %r2071, %r2021, 0, 0xbbb3U; 2026-02-21T09:18:14.9947088Z cvt.u16.u32 %rs634, %r2071; 2026-02-21T09:18:14.9947147Z cvt.rn.f32.s16 %r2072, %rs634; 2026-02-21T09:18:14.9947208Z prmt.b32 %r2073, %r2026, 0, 0xbbb3U; 2026-02-21T09:18:14.9947268Z cvt.u16.u32 %rs635, %r2073; 2026-02-21T09:18:14.9947337Z cvt.rn.f32.s16 %r2074, %rs635; 2026-02-21T09:18:14.9947401Z prmt.b32 %r2075, %r2025, 0, 0xbbb3U; 2026-02-21T09:18:14.9947462Z cvt.u16.u32 %rs636, %r2075; 2026-02-21T09:18:14.9947531Z cvt.rn.f32.s16 %r2076, %rs636; 2026-02-21T09:18:14.9947591Z prmt.b32 %r2077, %r2024, 0, 0xbbb3U; 2026-02-21T09:18:14.9947652Z cvt.u16.u32 %rs637, %r2077; 2026-02-21T09:18:14.9947719Z cvt.rn.f32.s16 %r2078, %rs637; 2026-02-21T09:18:14.9947780Z prmt.b32 %r2079, %r2023, 0, 0xbbb3U; 2026-02-21T09:18:14.9947839Z cvt.u16.u32 %rs638, %r2079; 2026-02-21T09:18:14.9947901Z cvt.rn.f32.s16 %r2080, %rs638; 2026-02-21T09:18:14.9947969Z prmt.b32 %r2081, %r2028, 0, 0xbbb3U; 2026-02-21T09:18:14.9948026Z cvt.u16.u32 %rs639, %r2081; 2026-02-21T09:18:14.9948086Z cvt.rn.f32.s16 %r2082, %rs639; 2026-02-21T09:18:14.9948153Z prmt.b32 %r2083, %r2027, 0, 0xbbb3U; 2026-02-21T09:18:14.9948211Z cvt.u16.u32 %rs640, %r2083; 2026-02-21T09:18:14.9948270Z cvt.rn.f32.s16 %r2084, %rs640; 2026-02-21T09:18:14.9948323Z bar.sync 0; 2026-02-21T09:18:14.9948428Z st.shared.v4.b32 [%r49], {%r2030, %r2032, %r2029, %r2031}; 2026-02-21T09:18:14.9948523Z st.shared.v4.b32 [%r50], {%r2034, %r2036, %r2033, %r2035}; 2026-02-21T09:18:14.9948617Z st.shared.v4.b32 [%r51], {%r2040, %r2044, %r2038, %r2042}; 2026-02-21T09:18:14.9948715Z st.shared.v4.b32 [%r52], {%r2048, %r2052, %r2046, %r2050}; 2026-02-21T09:18:14.9948804Z st.shared.v4.b32 [%r53], {%r2056, %r2060, %r2054, %r2058}; 2026-02-21T09:18:14.9948893Z st.shared.v4.b32 [%r54], {%r2064, %r2068, %r2062, %r2066}; 2026-02-21T09:18:14.9948988Z st.shared.v4.b32 [%r55], {%r2072, %r2076, %r2070, %r2074}; 2026-02-21T09:18:14.9949077Z st.shared.v4.b32 [%r56], {%r2080, %r2084, %r2078, %r2082}; 2026-02-21T09:18:14.9949131Z $L__tmp155: 2026-02-21T09:18:14.9949345Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9949411Z // begin inline asm 2026-02-21T09:18:14.9949721Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1827, %r1828, %r1829, %r1830, %r1831, %r1832, %r1833, %r1834, %r1835, %r1836, %r1837, %r1838, %r1839, %r1840, %r1841, %r1842}, [%r4673 + 0], 64; 2026-02-21T09:18:14.9949800Z // end inline asm 2026-02-21T09:18:14.9949865Z // begin inline asm 2026-02-21T09:18:14.9950174Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1844, %r1845, %r1846, %r1847, %r1848, %r1849, %r1850, %r1851, %r1852, %r1853, %r1854, %r1855, %r1856, %r1857, %r1858, %r1859}, [%r4673 + 16], 64; 2026-02-21T09:18:14.9950284Z // end inline asm 2026-02-21T09:18:14.9950352Z // begin inline asm 2026-02-21T09:18:14.9950679Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1861, %r1862, %r1863, %r1864, %r1865, %r1866, %r1867, %r1868, %r1869, %r1870, %r1871, %r1872, %r1873, %r1874, %r1875, %r1876}, [%r4673 + 32], 64; 2026-02-21T09:18:14.9950736Z // end inline asm 2026-02-21T09:18:14.9950802Z // begin inline asm 2026-02-21T09:18:14.9951104Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r1878, %r1879, %r1880, %r1881, %r1882, %r1883, %r1884, %r1885, %r1886, %r1887, %r1888, %r1889, %r1890, %r1891, %r1892, %r1893}, [%r4673 + 48], 64; 2026-02-21T09:18:14.9951161Z // end inline asm 2026-02-21T09:18:14.9951226Z // begin inline asm 2026-02-21T09:18:14.9951296Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9951351Z // end inline asm 2026-02-21T09:18:14.9951409Z // begin inline asm 2026-02-21T09:18:14.9951765Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 0], 64, {%r1827, %r1828, %r1829, %r1830, %r1831, %r1832, %r1833, %r1834, %r1835, %r1836, %r1837, %r1838, %r1839, %r1840, %r1841, %r1842}; 2026-02-21T09:18:14.9951823Z // end inline asm 2026-02-21T09:18:14.9951880Z // begin inline asm 2026-02-21T09:18:14.9952194Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 16], 64, {%r1844, %r1845, %r1846, %r1847, %r1848, %r1849, %r1850, %r1851, %r1852, %r1853, %r1854, %r1855, %r1856, %r1857, %r1858, %r1859}; 2026-02-21T09:18:14.9952249Z // end inline asm 2026-02-21T09:18:14.9952305Z // begin inline asm 2026-02-21T09:18:14.9952618Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 32], 64, {%r1861, %r1862, %r1863, %r1864, %r1865, %r1866, %r1867, %r1868, %r1869, %r1870, %r1871, %r1872, %r1873, %r1874, %r1875, %r1876}; 2026-02-21T09:18:14.9952676Z // end inline asm 2026-02-21T09:18:14.9952733Z // begin inline asm 2026-02-21T09:18:14.9953045Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 48], 64, {%r1878, %r1879, %r1880, %r1881, %r1882, %r1883, %r1884, %r1885, %r1886, %r1887, %r1888, %r1889, %r1890, %r1891, %r1892, %r1893}; 2026-02-21T09:18:14.9953100Z // end inline asm 2026-02-21T09:18:14.9953158Z // begin inline asm 2026-02-21T09:18:14.9953226Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9953287Z // end inline asm 2026-02-21T09:18:14.9953341Z bar.sync 0; 2026-02-21T09:18:14.9953398Z // begin inline asm 2026-02-21T09:18:14.9953709Z @%p84 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r1964, %r1965, %r1966, %r1967, %r1968, %r1969, %r1970, %r1971, %r1972, %r1973, %r1974, %r1975, %r1976, %r1977, %r1978, %r1979}; 2026-02-21T09:18:14.9953763Z // end inline asm 2026-02-21T09:18:14.9953818Z // begin inline asm 2026-02-21T09:18:14.9953892Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9953947Z // end inline asm 2026-02-21T09:18:14.9954001Z bar.sync 0; 2026-02-21T09:18:14.9954057Z // begin inline asm 2026-02-21T09:18:14.9954139Z fence.proxy.async.shared::cta; 2026-02-21T09:18:14.9954194Z // end inline asm 2026-02-21T09:18:14.9954249Z // begin inline asm 2026-02-21T09:18:14.9954348Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:14.9954403Z // end inline asm 2026-02-21T09:18:14.9954457Z bar.sync 0; 2026-02-21T09:18:14.9954517Z @%p20 bra $L__BB0_15; 2026-02-21T09:18:14.9954623Z // %bb.14: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:14.9954689Z elect.sync %r2098|%p97, -1; 2026-02-21T09:18:14.9954748Z mov.b32 %r2088, 69208336; 2026-02-21T09:18:14.9954811Z // begin inline asm 2026-02-21T09:18:14.9954972Z @%p97 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 0 ], %rd1, %r2088, %p390; 2026-02-21T09:18:14.9955027Z // end inline asm 2026-02-21T09:18:14.9955120Z mov.pred %p98, -1; 2026-02-21T09:18:14.9955176Z // begin inline asm 2026-02-21T09:18:14.9955329Z @%p97 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 8 ], %rd2, %r2088, %p98; 2026-02-21T09:18:14.9955410Z // end inline asm 2026-02-21T09:18:14.9955475Z // begin inline asm 2026-02-21T09:18:14.9955645Z @%p97 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 16 ], %rd3, %r2088, %p98; 2026-02-21T09:18:14.9955725Z // end inline asm 2026-02-21T09:18:14.9955793Z // begin inline asm 2026-02-21T09:18:14.9955941Z @%p97 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 24 ], %rd4, %r2088, %p98; 2026-02-21T09:18:14.9955997Z // end inline asm 2026-02-21T09:18:14.9956066Z cvt.u64.u32 %rd274, %r5985; 2026-02-21T09:18:14.9956122Z // begin inline asm 2026-02-21T09:18:14.9956250Z @%p97 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd274]; 2026-02-21T09:18:14.9956305Z // end inline asm 2026-02-21T09:18:14.9956416Z $L__BB0_15: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:14.9956510Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:14.9956569Z mov.b32 %r2102, 0; 2026-02-21T09:18:14.9956800Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9956860Z // begin inline asm 2026-02-21T09:18:14.9956912Z 2026-02-21T09:18:14.9956976Z { 2026-02-21T09:18:14.9957041Z .reg .pred complete; 2026-02-21T09:18:14.9957097Z waitLoop: 2026-02-21T09:18:14.9957220Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r2102; 2026-02-21T09:18:14.9957295Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9957345Z } 2026-02-21T09:18:14.9957349Z 2026-02-21T09:18:14.9957403Z // end inline asm 2026-02-21T09:18:14.9957464Z bar.sync 0; 2026-02-21T09:18:14.9957520Z // begin inline asm 2026-02-21T09:18:14.9957608Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9957670Z // end inline asm 2026-02-21T09:18:14.9957725Z $L__tmp156: 2026-02-21T09:18:14.9957891Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9957953Z add.s64 %rd276, %rd261, 64; 2026-02-21T09:18:14.9958125Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9958184Z add.s64 %rd279, %rd264, 64; 2026-02-21T09:18:14.9958243Z // begin inline asm 2026-02-21T09:18:14.9958310Z mov.u64 %rd275, 0x0; 2026-02-21T09:18:14.9958419Z createpolicy.fractional.L2::evict_first.b64 %rd275, 1.0; 2026-02-21T09:18:14.9958475Z // end inline asm 2026-02-21T09:18:14.9958531Z // begin inline asm 2026-02-21T09:18:14.9958594Z mov.u32 %r2104, 0x0; 2026-02-21T09:18:14.9958650Z mov.u32 %r2105, 0x0; 2026-02-21T09:18:14.9958705Z mov.u32 %r2106, 0x0; 2026-02-21T09:18:14.9958768Z mov.u32 %r2107, 0x0; 2026-02-21T09:18:14.9958945Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2104, %r2105, %r2106, %r2107 }, [ %rd276 + 0 ], %rd275; 2026-02-21T09:18:14.9959003Z // end inline asm 2026-02-21T09:18:14.9959066Z // begin inline asm 2026-02-21T09:18:14.9959123Z mov.u64 %rd278, 0x0; 2026-02-21T09:18:14.9959231Z createpolicy.fractional.L2::evict_first.b64 %rd278, 1.0; 2026-02-21T09:18:14.9959286Z // end inline asm 2026-02-21T09:18:14.9959352Z // begin inline asm 2026-02-21T09:18:14.9959408Z mov.u32 %r2108, 0x0; 2026-02-21T09:18:14.9959465Z mov.u32 %r2109, 0x0; 2026-02-21T09:18:14.9959526Z mov.u32 %r2110, 0x0; 2026-02-21T09:18:14.9959581Z mov.u32 %r2111, 0x0; 2026-02-21T09:18:14.9959755Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2108, %r2109, %r2110, %r2111 }, [ %rd279 + 0 ], %rd278; 2026-02-21T09:18:14.9959817Z // end inline asm 2026-02-21T09:18:14.9959871Z $L__tmp157: 2026-02-21T09:18:14.9960084Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9960181Z st.shared.v4.b32 [%r26], {%r2104, %r2105, %r2106, %r2107}; 2026-02-21T09:18:14.9960329Z st.shared.v4.b32 [%r26+512], {%r2108, %r2109, %r2110, %r2111}; 2026-02-21T09:18:14.9960384Z bar.sync 0; 2026-02-21T09:18:14.9960476Z ld.shared.v4.b32 {%r2271, %r2272, %r2273, %r2274}, [%r27]; 2026-02-21T09:18:14.9960577Z mov.b32 {%rs641, %rs642}, %r2274; 2026-02-21T09:18:14.9960661Z mov.b32 {%rs643, %rs644}, %r2273; 2026-02-21T09:18:14.9960724Z mov.b32 {%rs645, %rs646}, %r2272; 2026-02-21T09:18:14.9960810Z mov.b32 {%rs647, %rs648}, %r2271; 2026-02-21T09:18:14.9960904Z ld.shared.v4.b32 {%r2275, %r2276, %r2277, %r2278}, [%r28]; 2026-02-21T09:18:14.9960963Z mov.b32 {%rs649, %rs650}, %r2278; 2026-02-21T09:18:14.9961022Z mov.b32 {%rs651, %rs652}, %r2277; 2026-02-21T09:18:14.9961088Z mov.b32 {%rs653, %rs654}, %r2276; 2026-02-21T09:18:14.9961144Z mov.b32 {%rs655, %rs656}, %r2275; 2026-02-21T09:18:14.9961197Z $L__tmp158: 2026-02-21T09:18:14.9961367Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9961430Z cvt.f32.bf16 %r2253, %rs647; 2026-02-21T09:18:14.9961491Z cvt.f32.bf16 %r2254, %rs648; 2026-02-21T09:18:14.9961574Z cvt.f32.bf16 %r2255, %rs645; 2026-02-21T09:18:14.9961641Z cvt.f32.bf16 %r2256, %rs646; 2026-02-21T09:18:14.9961699Z cvt.f32.bf16 %r2257, %rs643; 2026-02-21T09:18:14.9961756Z cvt.f32.bf16 %r2258, %rs644; 2026-02-21T09:18:14.9961821Z cvt.f32.bf16 %r2259, %rs641; 2026-02-21T09:18:14.9961879Z cvt.f32.bf16 %r2260, %rs642; 2026-02-21T09:18:14.9961937Z cvt.f32.bf16 %r2261, %rs655; 2026-02-21T09:18:14.9961994Z cvt.f32.bf16 %r2262, %rs656; 2026-02-21T09:18:14.9962057Z cvt.f32.bf16 %r2263, %rs653; 2026-02-21T09:18:14.9962114Z cvt.f32.bf16 %r2264, %rs654; 2026-02-21T09:18:14.9962171Z cvt.f32.bf16 %r2265, %rs651; 2026-02-21T09:18:14.9962237Z cvt.f32.bf16 %r2266, %rs652; 2026-02-21T09:18:14.9962294Z cvt.f32.bf16 %r2267, %rs649; 2026-02-21T09:18:14.9962351Z cvt.f32.bf16 %r2268, %rs650; 2026-02-21T09:18:14.9962519Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9962583Z add.s32 %r2279, %r7844, 131072; 2026-02-21T09:18:14.9962643Z cvt.s64.s32 %rd284, %r2279; 2026-02-21T09:18:14.9962706Z add.s64 %rd282, %rd56, %rd284; 2026-02-21T09:18:14.9962875Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9962933Z // begin inline asm 2026-02-21T09:18:14.9962990Z mov.u64 %rd281, 0x0; 2026-02-21T09:18:14.9963105Z createpolicy.fractional.L2::evict_first.b64 %rd281, 1.0; 2026-02-21T09:18:14.9963162Z // end inline asm 2026-02-21T09:18:14.9963219Z // begin inline asm 2026-02-21T09:18:14.9963282Z mov.u32 %r2112, 0x0; 2026-02-21T09:18:14.9963338Z mov.u32 %r2113, 0x0; 2026-02-21T09:18:14.9963394Z mov.u32 %r2114, 0x0; 2026-02-21T09:18:14.9963450Z mov.u32 %r2115, 0x0; 2026-02-21T09:18:14.9963626Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2112, %r2113, %r2114, %r2115 }, [ %rd282 + 0 ], %rd281; 2026-02-21T09:18:14.9963684Z // end inline asm 2026-02-21T09:18:14.9963751Z prmt.b32 %r2280, %r2112, 0, 0x8880U; 2026-02-21T09:18:14.9963820Z cvt.u16.u32 %rs657, %r2280; 2026-02-21T09:18:14.9963886Z prmt.b32 %r2281, %r2112, 0, 0x7770U; 2026-02-21T09:18:14.9963946Z cvt.u16.u32 %rs658, %r2281; 2026-02-21T09:18:14.9964010Z prmt.b32 %r2282, %r2112, 0, 0x9991U; 2026-02-21T09:18:14.9964081Z cvt.u16.u32 %rs659, %r2282; 2026-02-21T09:18:14.9964144Z prmt.b32 %r2283, %r2112, 0, 0x7771U; 2026-02-21T09:18:14.9964204Z cvt.u16.u32 %rs660, %r2283; 2026-02-21T09:18:14.9964275Z prmt.b32 %r2284, %r2112, 0, 0xaaa2U; 2026-02-21T09:18:14.9964333Z cvt.u16.u32 %rs661, %r2284; 2026-02-21T09:18:14.9964396Z prmt.b32 %r2285, %r2112, 0, 0x7772U; 2026-02-21T09:18:14.9964466Z cvt.u16.u32 %rs662, %r2285; 2026-02-21T09:18:14.9964526Z prmt.b32 %r2286, %r2112, 0, 0xbbb3U; 2026-02-21T09:18:14.9964584Z cvt.u16.u32 %rs663, %r2286; 2026-02-21T09:18:14.9964644Z prmt.b32 %r2287, %r2112, 0, 0x7773U; 2026-02-21T09:18:14.9964742Z cvt.u16.u32 %rs664, %r2287; 2026-02-21T09:18:14.9964803Z prmt.b32 %r2288, %r2113, 0, 0x8880U; 2026-02-21T09:18:14.9964862Z cvt.u16.u32 %rs665, %r2288; 2026-02-21T09:18:14.9964928Z prmt.b32 %r2289, %r2113, 0, 0x7770U; 2026-02-21T09:18:14.9965010Z cvt.u16.u32 %rs666, %r2289; 2026-02-21T09:18:14.9965072Z prmt.b32 %r2290, %r2113, 0, 0x9991U; 2026-02-21T09:18:14.9965155Z cvt.u16.u32 %rs667, %r2290; 2026-02-21T09:18:14.9965250Z prmt.b32 %r2291, %r2113, 0, 0x7771U; 2026-02-21T09:18:14.9965308Z cvt.u16.u32 %rs668, %r2291; 2026-02-21T09:18:14.9965369Z prmt.b32 %r2292, %r2113, 0, 0xaaa2U; 2026-02-21T09:18:14.9965434Z cvt.u16.u32 %rs669, %r2292; 2026-02-21T09:18:14.9965493Z prmt.b32 %r2293, %r2113, 0, 0x7772U; 2026-02-21T09:18:14.9965551Z cvt.u16.u32 %rs670, %r2293; 2026-02-21T09:18:14.9965616Z prmt.b32 %r2294, %r2113, 0, 0xbbb3U; 2026-02-21T09:18:14.9965674Z cvt.u16.u32 %rs671, %r2294; 2026-02-21T09:18:14.9965734Z prmt.b32 %r2295, %r2113, 0, 0x7773U; 2026-02-21T09:18:14.9965792Z cvt.u16.u32 %rs672, %r2295; 2026-02-21T09:18:14.9965860Z prmt.b32 %r2296, %r2114, 0, 0x8880U; 2026-02-21T09:18:14.9965917Z cvt.u16.u32 %rs673, %r2296; 2026-02-21T09:18:14.9965976Z prmt.b32 %r2297, %r2114, 0, 0x7770U; 2026-02-21T09:18:14.9966043Z cvt.u16.u32 %rs674, %r2297; 2026-02-21T09:18:14.9966105Z prmt.b32 %r2298, %r2114, 0, 0x9991U; 2026-02-21T09:18:14.9966164Z cvt.u16.u32 %rs675, %r2298; 2026-02-21T09:18:14.9966225Z prmt.b32 %r2299, %r2114, 0, 0x7771U; 2026-02-21T09:18:14.9966291Z cvt.u16.u32 %rs676, %r2299; 2026-02-21T09:18:14.9966351Z prmt.b32 %r2300, %r2114, 0, 0xaaa2U; 2026-02-21T09:18:14.9966408Z cvt.u16.u32 %rs677, %r2300; 2026-02-21T09:18:14.9966477Z prmt.b32 %r2301, %r2114, 0, 0x7772U; 2026-02-21T09:18:14.9966535Z cvt.u16.u32 %rs678, %r2301; 2026-02-21T09:18:14.9966596Z prmt.b32 %r2302, %r2114, 0, 0xbbb3U; 2026-02-21T09:18:14.9966654Z cvt.u16.u32 %rs679, %r2302; 2026-02-21T09:18:14.9966720Z prmt.b32 %r2303, %r2114, 0, 0x7773U; 2026-02-21T09:18:14.9966780Z cvt.u16.u32 %rs680, %r2303; 2026-02-21T09:18:14.9966839Z prmt.b32 %r2304, %r2115, 0, 0x8880U; 2026-02-21T09:18:14.9966904Z cvt.u16.u32 %rs681, %r2304; 2026-02-21T09:18:14.9966965Z prmt.b32 %r2305, %r2115, 0, 0x7770U; 2026-02-21T09:18:14.9967023Z cvt.u16.u32 %rs682, %r2305; 2026-02-21T09:18:14.9967090Z prmt.b32 %r2306, %r2115, 0, 0x9991U; 2026-02-21T09:18:14.9967149Z cvt.u16.u32 %rs683, %r2306; 2026-02-21T09:18:14.9967211Z prmt.b32 %r2307, %r2115, 0, 0x7771U; 2026-02-21T09:18:14.9967268Z cvt.u16.u32 %rs684, %r2307; 2026-02-21T09:18:14.9967336Z prmt.b32 %r2308, %r2115, 0, 0xaaa2U; 2026-02-21T09:18:14.9967392Z cvt.u16.u32 %rs685, %r2308; 2026-02-21T09:18:14.9967452Z prmt.b32 %r2309, %r2115, 0, 0x7772U; 2026-02-21T09:18:14.9967516Z cvt.u16.u32 %rs686, %r2309; 2026-02-21T09:18:14.9967576Z prmt.b32 %r2310, %r2115, 0, 0xbbb3U; 2026-02-21T09:18:14.9967634Z cvt.u16.u32 %rs687, %r2310; 2026-02-21T09:18:14.9967693Z prmt.b32 %r2311, %r2115, 0, 0x7773U; 2026-02-21T09:18:14.9967757Z cvt.u16.u32 %rs688, %r2311; 2026-02-21T09:18:14.9967917Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9967975Z shl.b16 %rs689, %rs658, 12; 2026-02-21T09:18:14.9968040Z shr.s16 %rs690, %rs689, 12; 2026-02-21T09:18:14.9968097Z shl.b16 %rs691, %rs660, 12; 2026-02-21T09:18:14.9968155Z shr.s16 %rs692, %rs691, 12; 2026-02-21T09:18:14.9968219Z shl.b16 %rs693, %rs662, 12; 2026-02-21T09:18:14.9968275Z shr.s16 %rs694, %rs693, 12; 2026-02-21T09:18:14.9968331Z shl.b16 %rs695, %rs664, 12; 2026-02-21T09:18:14.9968388Z shr.s16 %rs696, %rs695, 12; 2026-02-21T09:18:14.9968452Z shl.b16 %rs697, %rs666, 12; 2026-02-21T09:18:14.9968509Z shr.s16 %rs698, %rs697, 12; 2026-02-21T09:18:14.9968565Z shl.b16 %rs699, %rs668, 12; 2026-02-21T09:18:14.9968627Z shr.s16 %rs700, %rs699, 12; 2026-02-21T09:18:14.9968684Z shl.b16 %rs701, %rs670, 12; 2026-02-21T09:18:14.9968740Z shr.s16 %rs702, %rs701, 12; 2026-02-21T09:18:14.9968796Z shl.b16 %rs703, %rs672, 12; 2026-02-21T09:18:14.9968900Z shr.s16 %rs704, %rs703, 12; 2026-02-21T09:18:14.9968958Z shl.b16 %rs705, %rs674, 12; 2026-02-21T09:18:14.9969016Z shr.s16 %rs706, %rs705, 12; 2026-02-21T09:18:14.9969102Z shl.b16 %rs707, %rs676, 12; 2026-02-21T09:18:14.9969159Z shr.s16 %rs708, %rs707, 12; 2026-02-21T09:18:14.9969216Z shl.b16 %rs709, %rs678, 12; 2026-02-21T09:18:14.9969295Z shr.s16 %rs710, %rs709, 12; 2026-02-21T09:18:14.9969386Z shl.b16 %rs711, %rs680, 12; 2026-02-21T09:18:14.9969446Z shr.s16 %rs712, %rs711, 12; 2026-02-21T09:18:14.9969504Z shl.b16 %rs713, %rs682, 12; 2026-02-21T09:18:14.9969569Z shr.s16 %rs714, %rs713, 12; 2026-02-21T09:18:14.9969626Z shl.b16 %rs715, %rs684, 12; 2026-02-21T09:18:14.9969683Z shr.s16 %rs716, %rs715, 12; 2026-02-21T09:18:14.9969745Z shl.b16 %rs717, %rs686, 12; 2026-02-21T09:18:14.9969802Z shr.s16 %rs718, %rs717, 12; 2026-02-21T09:18:14.9969859Z shl.b16 %rs719, %rs688, 12; 2026-02-21T09:18:14.9969916Z shr.s16 %rs720, %rs719, 12; 2026-02-21T09:18:14.9970083Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:14.9970144Z shr.u16 %rs721, %rs657, 4; 2026-02-21T09:18:14.9970206Z shr.u16 %rs722, %rs659, 4; 2026-02-21T09:18:14.9970270Z shr.u16 %rs723, %rs661, 4; 2026-02-21T09:18:14.9970327Z shr.u16 %rs724, %rs663, 4; 2026-02-21T09:18:14.9970387Z shr.u16 %rs725, %rs665, 4; 2026-02-21T09:18:14.9970444Z shr.u16 %rs726, %rs667, 4; 2026-02-21T09:18:14.9970512Z shr.u16 %rs727, %rs669, 4; 2026-02-21T09:18:14.9970570Z shr.u16 %rs728, %rs671, 4; 2026-02-21T09:18:14.9970629Z shr.u16 %rs729, %rs673, 4; 2026-02-21T09:18:14.9970694Z shr.u16 %rs730, %rs675, 4; 2026-02-21T09:18:14.9970753Z shr.u16 %rs731, %rs677, 4; 2026-02-21T09:18:14.9970812Z shr.u16 %rs732, %rs679, 4; 2026-02-21T09:18:14.9970870Z shr.u16 %rs733, %rs681, 4; 2026-02-21T09:18:14.9970937Z shr.u16 %rs734, %rs683, 4; 2026-02-21T09:18:14.9970995Z shr.u16 %rs735, %rs685, 4; 2026-02-21T09:18:14.9971053Z shr.u16 %rs736, %rs687, 4; 2026-02-21T09:18:14.9971217Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:14.9971270Z bar.sync 0; 2026-02-21T09:18:14.9971333Z st.shared.b8 [%r29], %rs690; 2026-02-21T09:18:14.9971402Z st.shared.b8 [%r30], %rs692; 2026-02-21T09:18:14.9971472Z st.shared.b8 [%r31], %rs694; 2026-02-21T09:18:14.9971558Z st.shared.b8 [%r32], %rs696; 2026-02-21T09:18:14.9971629Z st.shared.b8 [%r33+512], %rs698; 2026-02-21T09:18:14.9971702Z st.shared.b8 [%r34+512], %rs700; 2026-02-21T09:18:14.9971765Z st.shared.b8 [%r35+512], %rs702; 2026-02-21T09:18:14.9971829Z st.shared.b8 [%r36+512], %rs704; 2026-02-21T09:18:14.9971904Z st.shared.b8 [%r37+1024], %rs706; 2026-02-21T09:18:14.9971969Z st.shared.b8 [%r38+1024], %rs708; 2026-02-21T09:18:14.9972032Z st.shared.b8 [%r39+1024], %rs710; 2026-02-21T09:18:14.9972094Z st.shared.b8 [%r40+1024], %rs712; 2026-02-21T09:18:14.9972164Z st.shared.b8 [%r41+1536], %rs714; 2026-02-21T09:18:14.9972227Z st.shared.b8 [%r42+1536], %rs716; 2026-02-21T09:18:14.9972290Z st.shared.b8 [%r43+1536], %rs718; 2026-02-21T09:18:14.9972359Z st.shared.b8 [%r44+1536], %rs720; 2026-02-21T09:18:14.9972418Z bar.sync 0; 2026-02-21T09:18:14.9972484Z ld.shared.b32 %r2312, [%r45]; 2026-02-21T09:18:14.9972546Z ld.shared.b32 %r2313, [%r46]; 2026-02-21T09:18:14.9972620Z ld.shared.b32 %r2314, [%r47]; 2026-02-21T09:18:14.9972686Z ld.shared.b32 %r2315, [%r48]; 2026-02-21T09:18:14.9972857Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:14.9972922Z bar.sync 0; 2026-02-21T09:18:14.9972985Z st.shared.b8 [%r29], %rs721; 2026-02-21T09:18:14.9973047Z st.shared.b8 [%r30], %rs722; 2026-02-21T09:18:14.9973118Z st.shared.b8 [%r31], %rs723; 2026-02-21T09:18:14.9973180Z st.shared.b8 [%r32], %rs724; 2026-02-21T09:18:14.9973245Z st.shared.b8 [%r33+512], %rs725; 2026-02-21T09:18:14.9973309Z st.shared.b8 [%r34+512], %rs726; 2026-02-21T09:18:14.9973413Z st.shared.b8 [%r35+512], %rs727; 2026-02-21T09:18:14.9973478Z st.shared.b8 [%r36+512], %rs728; 2026-02-21T09:18:14.9973541Z st.shared.b8 [%r37+1024], %rs729; 2026-02-21T09:18:14.9973640Z st.shared.b8 [%r38+1024], %rs730; 2026-02-21T09:18:14.9973703Z st.shared.b8 [%r39+1024], %rs731; 2026-02-21T09:18:14.9973790Z st.shared.b8 [%r40+1024], %rs732; 2026-02-21T09:18:14.9973855Z st.shared.b8 [%r41+1536], %rs733; 2026-02-21T09:18:14.9973947Z st.shared.b8 [%r42+1536], %rs734; 2026-02-21T09:18:14.9974010Z st.shared.b8 [%r43+1536], %rs735; 2026-02-21T09:18:14.9974073Z st.shared.b8 [%r44+1536], %rs736; 2026-02-21T09:18:14.9974138Z bar.sync 0; 2026-02-21T09:18:14.9974201Z ld.shared.b32 %r2316, [%r45]; 2026-02-21T09:18:14.9974265Z ld.shared.b32 %r2317, [%r46]; 2026-02-21T09:18:14.9974329Z ld.shared.b32 %r2318, [%r47]; 2026-02-21T09:18:14.9974399Z ld.shared.b32 %r2319, [%r48]; 2026-02-21T09:18:14.9974566Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:14.9974631Z cvt.s8.s32 %rs737, %r2313; 2026-02-21T09:18:14.9974705Z cvt.rn.f32.s16 %r2320, %rs737; 2026-02-21T09:18:14.9974768Z cvt.s8.s32 %rs738, %r2312; 2026-02-21T09:18:14.9974836Z cvt.rn.f32.s16 %r2321, %rs738; 2026-02-21T09:18:14.9974904Z cvt.s8.s32 %rs739, %r2317; 2026-02-21T09:18:14.9974969Z cvt.rn.f32.s16 %r2322, %rs739; 2026-02-21T09:18:14.9975029Z cvt.s8.s32 %rs740, %r2316; 2026-02-21T09:18:14.9975091Z cvt.rn.f32.s16 %r2323, %rs740; 2026-02-21T09:18:14.9975158Z cvt.s8.s32 %rs741, %r2315; 2026-02-21T09:18:14.9975221Z cvt.rn.f32.s16 %r2324, %rs741; 2026-02-21T09:18:14.9975282Z cvt.s8.s32 %rs742, %r2314; 2026-02-21T09:18:14.9975350Z cvt.rn.f32.s16 %r2325, %rs742; 2026-02-21T09:18:14.9975409Z cvt.s8.s32 %rs743, %r2319; 2026-02-21T09:18:14.9975470Z cvt.rn.f32.s16 %r2326, %rs743; 2026-02-21T09:18:14.9975531Z cvt.s8.s32 %rs744, %r2318; 2026-02-21T09:18:14.9975601Z cvt.rn.f32.s16 %r2327, %rs744; 2026-02-21T09:18:14.9975672Z prmt.b32 %r2328, %r2313, 0, 0x9991U; 2026-02-21T09:18:14.9975734Z cvt.u16.u32 %rs745, %r2328; 2026-02-21T09:18:14.9975804Z cvt.rn.f32.s16 %r2329, %rs745; 2026-02-21T09:18:14.9975872Z prmt.b32 %r2330, %r2312, 0, 0x9991U; 2026-02-21T09:18:14.9975936Z cvt.u16.u32 %rs746, %r2330; 2026-02-21T09:18:14.9976004Z cvt.rn.f32.s16 %r2331, %rs746; 2026-02-21T09:18:14.9976069Z prmt.b32 %r2332, %r2317, 0, 0x9991U; 2026-02-21T09:18:14.9976131Z cvt.u16.u32 %rs747, %r2332; 2026-02-21T09:18:14.9976192Z cvt.rn.f32.s16 %r2333, %rs747; 2026-02-21T09:18:14.9976265Z prmt.b32 %r2334, %r2316, 0, 0x9991U; 2026-02-21T09:18:14.9976326Z cvt.u16.u32 %rs748, %r2334; 2026-02-21T09:18:14.9976388Z cvt.rn.f32.s16 %r2335, %rs748; 2026-02-21T09:18:14.9976459Z prmt.b32 %r2336, %r2315, 0, 0x9991U; 2026-02-21T09:18:14.9976519Z cvt.u16.u32 %rs749, %r2336; 2026-02-21T09:18:14.9976581Z cvt.rn.f32.s16 %r2337, %rs749; 2026-02-21T09:18:14.9976643Z prmt.b32 %r2338, %r2314, 0, 0x9991U; 2026-02-21T09:18:14.9976712Z cvt.u16.u32 %rs750, %r2338; 2026-02-21T09:18:14.9976773Z cvt.rn.f32.s16 %r2339, %rs750; 2026-02-21T09:18:14.9976837Z prmt.b32 %r2340, %r2319, 0, 0x9991U; 2026-02-21T09:18:14.9976903Z cvt.u16.u32 %rs751, %r2340; 2026-02-21T09:18:14.9976966Z cvt.rn.f32.s16 %r2341, %rs751; 2026-02-21T09:18:14.9977037Z prmt.b32 %r2342, %r2318, 0, 0x9991U; 2026-02-21T09:18:14.9977100Z cvt.u16.u32 %rs752, %r2342; 2026-02-21T09:18:14.9977170Z cvt.rn.f32.s16 %r2343, %rs752; 2026-02-21T09:18:14.9977234Z prmt.b32 %r2344, %r2313, 0, 0xaaa2U; 2026-02-21T09:18:14.9977294Z cvt.u16.u32 %rs753, %r2344; 2026-02-21T09:18:14.9977365Z cvt.rn.f32.s16 %r2345, %rs753; 2026-02-21T09:18:14.9977428Z prmt.b32 %r2346, %r2312, 0, 0xaaa2U; 2026-02-21T09:18:14.9977489Z cvt.u16.u32 %rs754, %r2346; 2026-02-21T09:18:14.9977559Z cvt.rn.f32.s16 %r2347, %rs754; 2026-02-21T09:18:14.9977623Z prmt.b32 %r2348, %r2317, 0, 0xaaa2U; 2026-02-21T09:18:14.9977685Z cvt.u16.u32 %rs755, %r2348; 2026-02-21T09:18:14.9977749Z cvt.rn.f32.s16 %r2349, %rs755; 2026-02-21T09:18:14.9977845Z prmt.b32 %r2350, %r2316, 0, 0xaaa2U; 2026-02-21T09:18:14.9977907Z cvt.u16.u32 %rs756, %r2350; 2026-02-21T09:18:14.9977968Z cvt.rn.f32.s16 %r2351, %rs756; 2026-02-21T09:18:14.9978059Z prmt.b32 %r2352, %r2315, 0, 0xaaa2U; 2026-02-21T09:18:14.9978121Z cvt.u16.u32 %rs757, %r2352; 2026-02-21T09:18:14.9978202Z cvt.rn.f32.s16 %r2353, %rs757; 2026-02-21T09:18:14.9978266Z prmt.b32 %r2354, %r2314, 0, 0xaaa2U; 2026-02-21T09:18:14.9978356Z cvt.u16.u32 %rs758, %r2354; 2026-02-21T09:18:14.9978419Z cvt.rn.f32.s16 %r2355, %rs758; 2026-02-21T09:18:14.9978484Z prmt.b32 %r2356, %r2319, 0, 0xaaa2U; 2026-02-21T09:18:14.9978552Z cvt.u16.u32 %rs759, %r2356; 2026-02-21T09:18:14.9978614Z cvt.rn.f32.s16 %r2357, %rs759; 2026-02-21T09:18:14.9978676Z prmt.b32 %r2358, %r2318, 0, 0xaaa2U; 2026-02-21T09:18:14.9978737Z cvt.u16.u32 %rs760, %r2358; 2026-02-21T09:18:14.9978804Z cvt.rn.f32.s16 %r2359, %rs760; 2026-02-21T09:18:14.9978866Z prmt.b32 %r2360, %r2313, 0, 0xbbb3U; 2026-02-21T09:18:14.9978929Z cvt.u16.u32 %rs761, %r2360; 2026-02-21T09:18:14.9978999Z cvt.rn.f32.s16 %r2361, %rs761; 2026-02-21T09:18:14.9979062Z prmt.b32 %r2362, %r2312, 0, 0xbbb3U; 2026-02-21T09:18:14.9979124Z cvt.u16.u32 %rs762, %r2362; 2026-02-21T09:18:14.9979193Z cvt.rn.f32.s16 %r2363, %rs762; 2026-02-21T09:18:14.9979257Z prmt.b32 %r2364, %r2317, 0, 0xbbb3U; 2026-02-21T09:18:14.9979319Z cvt.u16.u32 %rs763, %r2364; 2026-02-21T09:18:14.9979381Z cvt.rn.f32.s16 %r2365, %rs763; 2026-02-21T09:18:14.9979452Z prmt.b32 %r2366, %r2316, 0, 0xbbb3U; 2026-02-21T09:18:14.9979514Z cvt.u16.u32 %rs764, %r2366; 2026-02-21T09:18:14.9979575Z cvt.rn.f32.s16 %r2367, %rs764; 2026-02-21T09:18:14.9979645Z prmt.b32 %r2368, %r2315, 0, 0xbbb3U; 2026-02-21T09:18:14.9979707Z cvt.u16.u32 %rs765, %r2368; 2026-02-21T09:18:14.9979769Z cvt.rn.f32.s16 %r2369, %rs765; 2026-02-21T09:18:14.9979831Z prmt.b32 %r2370, %r2314, 0, 0xbbb3U; 2026-02-21T09:18:14.9979900Z cvt.u16.u32 %rs766, %r2370; 2026-02-21T09:18:14.9979963Z cvt.rn.f32.s16 %r2371, %rs766; 2026-02-21T09:18:14.9980036Z prmt.b32 %r2372, %r2319, 0, 0xbbb3U; 2026-02-21T09:18:14.9980101Z cvt.u16.u32 %rs767, %r2372; 2026-02-21T09:18:14.9980159Z cvt.rn.f32.s16 %r2373, %rs767; 2026-02-21T09:18:14.9980221Z prmt.b32 %r2374, %r2318, 0, 0xbbb3U; 2026-02-21T09:18:14.9980278Z cvt.u16.u32 %rs768, %r2374; 2026-02-21T09:18:14.9980346Z cvt.rn.f32.s16 %r2375, %rs768; 2026-02-21T09:18:14.9980401Z bar.sync 0; 2026-02-21T09:18:14.9980496Z st.shared.v4.b32 [%r49], {%r2321, %r2323, %r2320, %r2322}; 2026-02-21T09:18:14.9980600Z st.shared.v4.b32 [%r50], {%r2325, %r2327, %r2324, %r2326}; 2026-02-21T09:18:14.9980691Z st.shared.v4.b32 [%r51], {%r2331, %r2335, %r2329, %r2333}; 2026-02-21T09:18:14.9980778Z st.shared.v4.b32 [%r52], {%r2339, %r2343, %r2337, %r2341}; 2026-02-21T09:18:14.9980874Z st.shared.v4.b32 [%r53], {%r2347, %r2351, %r2345, %r2349}; 2026-02-21T09:18:14.9980962Z st.shared.v4.b32 [%r54], {%r2355, %r2359, %r2353, %r2357}; 2026-02-21T09:18:14.9981052Z st.shared.v4.b32 [%r55], {%r2363, %r2367, %r2361, %r2365}; 2026-02-21T09:18:14.9981139Z st.shared.v4.b32 [%r56], {%r2371, %r2375, %r2369, %r2373}; 2026-02-21T09:18:14.9981200Z $L__tmp159: 2026-02-21T09:18:14.9981414Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9981473Z // begin inline asm 2026-02-21T09:18:14.9981814Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2116, %r2117, %r2118, %r2119, %r2120, %r2121, %r2122, %r2123, %r2124, %r2125, %r2126, %r2127, %r2128, %r2129, %r2130, %r2131}, [%r625 + 0], 64; 2026-02-21T09:18:14.9981871Z // end inline asm 2026-02-21T09:18:14.9981930Z // begin inline asm 2026-02-21T09:18:14.9982236Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2133, %r2134, %r2135, %r2136, %r2137, %r2138, %r2139, %r2140, %r2141, %r2142, %r2143, %r2144, %r2145, %r2146, %r2147, %r2148}, [%r625 + 16], 64; 2026-02-21T09:18:14.9982292Z // end inline asm 2026-02-21T09:18:14.9982349Z // begin inline asm 2026-02-21T09:18:14.9982680Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2150, %r2151, %r2152, %r2153, %r2154, %r2155, %r2156, %r2157, %r2158, %r2159, %r2160, %r2161, %r2162, %r2163, %r2164, %r2165}, [%r625 + 32], 64; 2026-02-21T09:18:14.9982759Z // end inline asm 2026-02-21T09:18:14.9982816Z // begin inline asm 2026-02-21T09:18:14.9983177Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2167, %r2168, %r2169, %r2170, %r2171, %r2172, %r2173, %r2174, %r2175, %r2176, %r2177, %r2178, %r2179, %r2180, %r2181, %r2182}, [%r625 + 48], 64; 2026-02-21T09:18:14.9983234Z // end inline asm 2026-02-21T09:18:14.9983290Z // begin inline asm 2026-02-21T09:18:14.9983361Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:14.9983425Z // end inline asm 2026-02-21T09:18:14.9983487Z mov.pred %p113, -1; 2026-02-21T09:18:14.9983543Z // begin inline asm 2026-02-21T09:18:14.9983858Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 0], 64, {%r2116, %r2117, %r2118, %r2119, %r2120, %r2121, %r2122, %r2123, %r2124, %r2125, %r2126, %r2127, %r2128, %r2129, %r2130, %r2131}; 2026-02-21T09:18:14.9983914Z // end inline asm 2026-02-21T09:18:14.9983971Z // begin inline asm 2026-02-21T09:18:14.9984286Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 16], 64, {%r2133, %r2134, %r2135, %r2136, %r2137, %r2138, %r2139, %r2140, %r2141, %r2142, %r2143, %r2144, %r2145, %r2146, %r2147, %r2148}; 2026-02-21T09:18:14.9984341Z // end inline asm 2026-02-21T09:18:14.9984399Z // begin inline asm 2026-02-21T09:18:14.9984712Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 32], 64, {%r2150, %r2151, %r2152, %r2153, %r2154, %r2155, %r2156, %r2157, %r2158, %r2159, %r2160, %r2161, %r2162, %r2163, %r2164, %r2165}; 2026-02-21T09:18:14.9984767Z // end inline asm 2026-02-21T09:18:14.9984823Z // begin inline asm 2026-02-21T09:18:14.9985126Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 48], 64, {%r2167, %r2168, %r2169, %r2170, %r2171, %r2172, %r2173, %r2174, %r2175, %r2176, %r2177, %r2178, %r2179, %r2180, %r2181, %r2182}; 2026-02-21T09:18:14.9985189Z // end inline asm 2026-02-21T09:18:14.9985244Z // begin inline asm 2026-02-21T09:18:14.9985312Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9985375Z // end inline asm 2026-02-21T09:18:14.9985430Z bar.sync 0; 2026-02-21T09:18:14.9985485Z // begin inline asm 2026-02-21T09:18:14.9985804Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r2253, %r2254, %r2255, %r2256, %r2257, %r2258, %r2259, %r2260, %r2261, %r2262, %r2263, %r2264, %r2265, %r2266, %r2267, %r2268}; 2026-02-21T09:18:14.9985858Z // end inline asm 2026-02-21T09:18:14.9985914Z // begin inline asm 2026-02-21T09:18:14.9985980Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:14.9986042Z // end inline asm 2026-02-21T09:18:14.9986097Z bar.sync 0; 2026-02-21T09:18:14.9986154Z // begin inline asm 2026-02-21T09:18:14.9986233Z fence.proxy.async.shared::cta; 2026-02-21T09:18:14.9986288Z // end inline asm 2026-02-21T09:18:14.9986345Z // begin inline asm 2026-02-21T09:18:14.9986446Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:14.9986503Z // end inline asm 2026-02-21T09:18:14.9986557Z bar.sync 0; 2026-02-21T09:18:14.9986618Z @%p20 bra $L__BB0_17; 2026-02-21T09:18:14.9986735Z // %bb.16: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:14.9986801Z elect.sync %r2388|%p114, -1; 2026-02-21T09:18:14.9986862Z mov.b32 %r2378, 69208336; 2026-02-21T09:18:14.9986926Z // begin inline asm 2026-02-21T09:18:14.9987082Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 0 ], %rd1, %r2378, %p113; 2026-02-21T09:18:14.9987138Z // end inline asm 2026-02-21T09:18:14.9987193Z // begin inline asm 2026-02-21T09:18:14.9987352Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 8 ], %rd2, %r2378, %p113; 2026-02-21T09:18:14.9987408Z // end inline asm 2026-02-21T09:18:14.9987463Z // begin inline asm 2026-02-21T09:18:14.9987623Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 16 ], %rd3, %r2378, %p113; 2026-02-21T09:18:14.9987704Z // end inline asm 2026-02-21T09:18:14.9987762Z // begin inline asm 2026-02-21T09:18:14.9987918Z @%p114 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 24 ], %rd4, %r2378, %p113; 2026-02-21T09:18:14.9987998Z // end inline asm 2026-02-21T09:18:14.9988060Z cvt.u64.u32 %rd289, %r5985; 2026-02-21T09:18:14.9988149Z // begin inline asm 2026-02-21T09:18:14.9988300Z @%p114 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd289]; 2026-02-21T09:18:14.9988357Z // end inline asm 2026-02-21T09:18:14.9988458Z $L__BB0_17: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:14.9988523Z // begin inline asm 2026-02-21T09:18:14.9988573Z 2026-02-21T09:18:14.9988624Z { 2026-02-21T09:18:14.9988692Z .reg .pred complete; 2026-02-21T09:18:14.9988748Z waitLoop: 2026-02-21T09:18:14.9988869Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r2102; 2026-02-21T09:18:14.9988936Z @!complete bra.uni waitLoop; 2026-02-21T09:18:14.9988997Z } 2026-02-21T09:18:14.9989000Z 2026-02-21T09:18:14.9989054Z // end inline asm 2026-02-21T09:18:14.9989108Z bar.sync 0; 2026-02-21T09:18:14.9989171Z // begin inline asm 2026-02-21T09:18:14.9989258Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:14.9989312Z // end inline asm 2026-02-21T09:18:14.9989365Z $L__tmp160: 2026-02-21T09:18:14.9989538Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:14.9989601Z add.s64 %rd291, %rd261, 128; 2026-02-21T09:18:14.9989760Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:14.9989830Z add.s64 %rd294, %rd264, 128; 2026-02-21T09:18:14.9989887Z // begin inline asm 2026-02-21T09:18:14.9989945Z mov.u64 %rd290, 0x0; 2026-02-21T09:18:14.9990061Z createpolicy.fractional.L2::evict_first.b64 %rd290, 1.0; 2026-02-21T09:18:14.9990116Z // end inline asm 2026-02-21T09:18:14.9990171Z // begin inline asm 2026-02-21T09:18:14.9990229Z mov.u32 %r2394, 0x0; 2026-02-21T09:18:14.9990290Z mov.u32 %r2395, 0x0; 2026-02-21T09:18:14.9990345Z mov.u32 %r2396, 0x0; 2026-02-21T09:18:14.9990398Z mov.u32 %r2397, 0x0; 2026-02-21T09:18:14.9990585Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2394, %r2395, %r2396, %r2397 }, [ %rd291 + 0 ], %rd290; 2026-02-21T09:18:14.9990640Z // end inline asm 2026-02-21T09:18:14.9990696Z // begin inline asm 2026-02-21T09:18:14.9990757Z mov.u64 %rd293, 0x0; 2026-02-21T09:18:14.9990863Z createpolicy.fractional.L2::evict_first.b64 %rd293, 1.0; 2026-02-21T09:18:14.9990918Z // end inline asm 2026-02-21T09:18:14.9990974Z // begin inline asm 2026-02-21T09:18:14.9991035Z mov.u32 %r2398, 0x0; 2026-02-21T09:18:14.9991088Z mov.u32 %r2399, 0x0; 2026-02-21T09:18:14.9991142Z mov.u32 %r2400, 0x0; 2026-02-21T09:18:14.9991201Z mov.u32 %r2401, 0x0; 2026-02-21T09:18:14.9991372Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2398, %r2399, %r2400, %r2401 }, [ %rd294 + 0 ], %rd293; 2026-02-21T09:18:14.9991431Z // end inline asm 2026-02-21T09:18:14.9991490Z $L__tmp161: 2026-02-21T09:18:14.9991723Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:14.9991819Z st.shared.v4.b32 [%r26], {%r2394, %r2395, %r2396, %r2397}; 2026-02-21T09:18:14.9991921Z st.shared.v4.b32 [%r26+512], {%r2398, %r2399, %r2400, %r2401}; 2026-02-21T09:18:14.9991981Z bar.sync 0; 2026-02-21T09:18:14.9992073Z ld.shared.v4.b32 {%r2561, %r2562, %r2563, %r2564}, [%r27]; 2026-02-21T09:18:14.9992138Z mov.b32 {%rs769, %rs770}, %r2564; 2026-02-21T09:18:14.9992209Z mov.b32 {%rs771, %rs772}, %r2563; 2026-02-21T09:18:14.9992271Z mov.b32 {%rs773, %rs774}, %r2562; 2026-02-21T09:18:14.9992330Z mov.b32 {%rs775, %rs776}, %r2561; 2026-02-21T09:18:14.9992427Z ld.shared.v4.b32 {%r2565, %r2566, %r2567, %r2568}, [%r28]; 2026-02-21T09:18:14.9992487Z mov.b32 {%rs777, %rs778}, %r2568; 2026-02-21T09:18:14.9992546Z mov.b32 {%rs779, %rs780}, %r2567; 2026-02-21T09:18:14.9992635Z mov.b32 {%rs781, %rs782}, %r2566; 2026-02-21T09:18:14.9992703Z mov.b32 {%rs783, %rs784}, %r2565; 2026-02-21T09:18:14.9992757Z $L__tmp162: 2026-02-21T09:18:14.9992947Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:14.9993017Z cvt.f32.bf16 %r2543, %rs775; 2026-02-21T09:18:14.9993104Z cvt.f32.bf16 %r2544, %rs776; 2026-02-21T09:18:14.9993188Z cvt.f32.bf16 %r2545, %rs773; 2026-02-21T09:18:14.9993249Z cvt.f32.bf16 %r2546, %rs774; 2026-02-21T09:18:14.9993315Z cvt.f32.bf16 %r2547, %rs771; 2026-02-21T09:18:14.9993372Z cvt.f32.bf16 %r2548, %rs772; 2026-02-21T09:18:14.9993430Z cvt.f32.bf16 %r2549, %rs769; 2026-02-21T09:18:14.9993494Z cvt.f32.bf16 %r2550, %rs770; 2026-02-21T09:18:14.9993552Z cvt.f32.bf16 %r2551, %rs783; 2026-02-21T09:18:14.9993609Z cvt.f32.bf16 %r2552, %rs784; 2026-02-21T09:18:14.9993665Z cvt.f32.bf16 %r2553, %rs781; 2026-02-21T09:18:14.9993731Z cvt.f32.bf16 %r2554, %rs782; 2026-02-21T09:18:14.9993790Z cvt.f32.bf16 %r2555, %rs779; 2026-02-21T09:18:14.9993848Z cvt.f32.bf16 %r2556, %rs780; 2026-02-21T09:18:14.9993911Z cvt.f32.bf16 %r2557, %rs777; 2026-02-21T09:18:14.9993970Z cvt.f32.bf16 %r2558, %rs778; 2026-02-21T09:18:14.9994134Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:14.9994206Z add.s32 %r2569, %r7844, 262144; 2026-02-21T09:18:14.9994270Z cvt.s64.s32 %rd299, %r2569; 2026-02-21T09:18:14.9994333Z add.s64 %rd297, %rd56, %rd299; 2026-02-21T09:18:14.9994497Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:14.9994566Z // begin inline asm 2026-02-21T09:18:14.9994624Z mov.u64 %rd296, 0x0; 2026-02-21T09:18:14.9994730Z createpolicy.fractional.L2::evict_first.b64 %rd296, 1.0; 2026-02-21T09:18:14.9994794Z // end inline asm 2026-02-21T09:18:14.9994850Z // begin inline asm 2026-02-21T09:18:14.9994905Z mov.u32 %r2402, 0x0; 2026-02-21T09:18:14.9994970Z mov.u32 %r2403, 0x0; 2026-02-21T09:18:14.9995024Z mov.u32 %r2404, 0x0; 2026-02-21T09:18:14.9995078Z mov.u32 %r2405, 0x0; 2026-02-21T09:18:14.9995246Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2402, %r2403, %r2404, %r2405 }, [ %rd297 + 0 ], %rd296; 2026-02-21T09:18:14.9995311Z // end inline asm 2026-02-21T09:18:14.9995377Z prmt.b32 %r2570, %r2402, 0, 0x8880U; 2026-02-21T09:18:14.9995439Z cvt.u16.u32 %rs785, %r2570; 2026-02-21T09:18:14.9995512Z prmt.b32 %r2571, %r2402, 0, 0x7770U; 2026-02-21T09:18:14.9995571Z cvt.u16.u32 %rs786, %r2571; 2026-02-21T09:18:14.9995633Z prmt.b32 %r2572, %r2402, 0, 0x9991U; 2026-02-21T09:18:14.9995692Z cvt.u16.u32 %rs787, %r2572; 2026-02-21T09:18:14.9995760Z prmt.b32 %r2573, %r2402, 0, 0x7771U; 2026-02-21T09:18:14.9995818Z cvt.u16.u32 %rs788, %r2573; 2026-02-21T09:18:14.9995878Z prmt.b32 %r2574, %r2402, 0, 0xaaa2U; 2026-02-21T09:18:14.9995943Z cvt.u16.u32 %rs789, %r2574; 2026-02-21T09:18:14.9996004Z prmt.b32 %r2575, %r2402, 0, 0x7772U; 2026-02-21T09:18:14.9996064Z cvt.u16.u32 %rs790, %r2575; 2026-02-21T09:18:14.9996131Z prmt.b32 %r2576, %r2402, 0, 0xbbb3U; 2026-02-21T09:18:14.9996189Z cvt.u16.u32 %rs791, %r2576; 2026-02-21T09:18:14.9996252Z prmt.b32 %r2577, %r2402, 0, 0x7773U; 2026-02-21T09:18:14.9996309Z cvt.u16.u32 %rs792, %r2577; 2026-02-21T09:18:14.9996378Z prmt.b32 %r2578, %r2403, 0, 0x8880U; 2026-02-21T09:18:14.9996436Z cvt.u16.u32 %rs793, %r2578; 2026-02-21T09:18:14.9996497Z prmt.b32 %r2579, %r2403, 0, 0x7770U; 2026-02-21T09:18:14.9996561Z cvt.u16.u32 %rs794, %r2579; 2026-02-21T09:18:14.9996621Z prmt.b32 %r2580, %r2403, 0, 0x9991U; 2026-02-21T09:18:14.9996679Z cvt.u16.u32 %rs795, %r2580; 2026-02-21T09:18:14.9996738Z prmt.b32 %r2581, %r2403, 0, 0x7771U; 2026-02-21T09:18:14.9996802Z cvt.u16.u32 %rs796, %r2581; 2026-02-21T09:18:14.9996861Z prmt.b32 %r2582, %r2403, 0, 0xaaa2U; 2026-02-21T09:18:14.9996920Z cvt.u16.u32 %rs797, %r2582; 2026-02-21T09:18:14.9996988Z prmt.b32 %r2583, %r2403, 0, 0x7772U; 2026-02-21T09:18:14.9997066Z cvt.u16.u32 %rs798, %r2583; 2026-02-21T09:18:14.9997127Z prmt.b32 %r2584, %r2403, 0, 0xbbb3U; 2026-02-21T09:18:14.9997191Z cvt.u16.u32 %rs799, %r2584; 2026-02-21T09:18:14.9997275Z prmt.b32 %r2585, %r2403, 0, 0x7773U; 2026-02-21T09:18:14.9997333Z cvt.u16.u32 %rs800, %r2585; 2026-02-21T09:18:14.9997412Z prmt.b32 %r2586, %r2404, 0, 0x8880U; 2026-02-21T09:18:14.9997478Z cvt.u16.u32 %rs801, %r2586; 2026-02-21T09:18:14.9997557Z prmt.b32 %r2587, %r2404, 0, 0x7770U; 2026-02-21T09:18:14.9997616Z cvt.u16.u32 %rs802, %r2587; 2026-02-21T09:18:14.9997682Z prmt.b32 %r2588, %r2404, 0, 0x9991U; 2026-02-21T09:18:14.9997738Z cvt.u16.u32 %rs803, %r2588; 2026-02-21T09:18:14.9997798Z prmt.b32 %r2589, %r2404, 0, 0x7771U; 2026-02-21T09:18:14.9997855Z cvt.u16.u32 %rs804, %r2589; 2026-02-21T09:18:14.9997921Z prmt.b32 %r2590, %r2404, 0, 0xaaa2U; 2026-02-21T09:18:14.9997978Z cvt.u16.u32 %rs805, %r2590; 2026-02-21T09:18:14.9998038Z prmt.b32 %r2591, %r2404, 0, 0x7772U; 2026-02-21T09:18:14.9998103Z cvt.u16.u32 %rs806, %r2591; 2026-02-21T09:18:14.9998162Z prmt.b32 %r2592, %r2404, 0, 0xbbb3U; 2026-02-21T09:18:14.9998219Z cvt.u16.u32 %rs807, %r2592; 2026-02-21T09:18:14.9998280Z prmt.b32 %r2593, %r2404, 0, 0x7773U; 2026-02-21T09:18:14.9998344Z cvt.u16.u32 %rs808, %r2593; 2026-02-21T09:18:14.9998407Z prmt.b32 %r2594, %r2405, 0, 0x8880U; 2026-02-21T09:18:14.9998464Z cvt.u16.u32 %rs809, %r2594; 2026-02-21T09:18:14.9998530Z prmt.b32 %r2595, %r2405, 0, 0x7770U; 2026-02-21T09:18:14.9998587Z cvt.u16.u32 %rs810, %r2595; 2026-02-21T09:18:14.9998647Z prmt.b32 %r2596, %r2405, 0, 0x9991U; 2026-02-21T09:18:14.9998712Z cvt.u16.u32 %rs811, %r2596; 2026-02-21T09:18:14.9998771Z prmt.b32 %r2597, %r2405, 0, 0x7771U; 2026-02-21T09:18:14.9998827Z cvt.u16.u32 %rs812, %r2597; 2026-02-21T09:18:14.9998887Z prmt.b32 %r2598, %r2405, 0, 0xaaa2U; 2026-02-21T09:18:14.9998953Z cvt.u16.u32 %rs813, %r2598; 2026-02-21T09:18:14.9999013Z prmt.b32 %r2599, %r2405, 0, 0x7772U; 2026-02-21T09:18:14.9999072Z cvt.u16.u32 %rs814, %r2599; 2026-02-21T09:18:14.9999139Z prmt.b32 %r2600, %r2405, 0, 0xbbb3U; 2026-02-21T09:18:14.9999196Z cvt.u16.u32 %rs815, %r2600; 2026-02-21T09:18:14.9999258Z prmt.b32 %r2601, %r2405, 0, 0x7773U; 2026-02-21T09:18:14.9999315Z cvt.u16.u32 %rs816, %r2601; 2026-02-21T09:18:14.9999480Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:14.9999540Z shl.b16 %rs817, %rs786, 12; 2026-02-21T09:18:14.9999598Z shr.s16 %rs818, %rs817, 12; 2026-02-21T09:18:14.9999663Z shl.b16 %rs819, %rs788, 12; 2026-02-21T09:18:14.9999721Z shr.s16 %rs820, %rs819, 12; 2026-02-21T09:18:14.9999777Z shl.b16 %rs821, %rs790, 12; 2026-02-21T09:18:14.9999840Z shr.s16 %rs822, %rs821, 12; 2026-02-21T09:18:14.9999898Z shl.b16 %rs823, %rs792, 12; 2026-02-21T09:18:14.9999955Z shr.s16 %rs824, %rs823, 12; 2026-02-21T09:18:15.0000013Z shl.b16 %rs825, %rs794, 12; 2026-02-21T09:18:15.0000079Z shr.s16 %rs826, %rs825, 12; 2026-02-21T09:18:15.0000137Z shl.b16 %rs827, %rs796, 12; 2026-02-21T09:18:15.0000194Z shr.s16 %rs828, %rs827, 12; 2026-02-21T09:18:15.0000256Z shl.b16 %rs829, %rs798, 12; 2026-02-21T09:18:15.0000315Z shr.s16 %rs830, %rs829, 12; 2026-02-21T09:18:15.0000373Z shl.b16 %rs831, %rs800, 12; 2026-02-21T09:18:15.0000429Z shr.s16 %rs832, %rs831, 12; 2026-02-21T09:18:15.0000495Z shl.b16 %rs833, %rs802, 12; 2026-02-21T09:18:15.0000555Z shr.s16 %rs834, %rs833, 12; 2026-02-21T09:18:15.0000613Z shl.b16 %rs835, %rs804, 12; 2026-02-21T09:18:15.0000678Z shr.s16 %rs836, %rs835, 12; 2026-02-21T09:18:15.0000737Z shl.b16 %rs837, %rs806, 12; 2026-02-21T09:18:15.0000795Z shr.s16 %rs838, %rs837, 12; 2026-02-21T09:18:15.0000853Z shl.b16 %rs839, %rs808, 12; 2026-02-21T09:18:15.0000921Z shr.s16 %rs840, %rs839, 12; 2026-02-21T09:18:15.0000978Z shl.b16 %rs841, %rs810, 12; 2026-02-21T09:18:15.0001036Z shr.s16 %rs842, %rs841, 12; 2026-02-21T09:18:15.0001100Z shl.b16 %rs843, %rs812, 12; 2026-02-21T09:18:15.0001181Z shr.s16 %rs844, %rs843, 12; 2026-02-21T09:18:15.0001238Z shl.b16 %rs845, %rs814, 12; 2026-02-21T09:18:15.0001302Z shr.s16 %rs846, %rs845, 12; 2026-02-21T09:18:15.0001358Z shl.b16 %rs847, %rs816, 12; 2026-02-21T09:18:15.0001436Z shr.s16 %rs848, %rs847, 12; 2026-02-21T09:18:15.0001668Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0001768Z shr.u16 %rs849, %rs785, 4; 2026-02-21T09:18:15.0001831Z shr.u16 %rs850, %rs787, 4; 2026-02-21T09:18:15.0001891Z shr.u16 %rs851, %rs789, 4; 2026-02-21T09:18:15.0001956Z shr.u16 %rs852, %rs791, 4; 2026-02-21T09:18:15.0002014Z shr.u16 %rs853, %rs793, 4; 2026-02-21T09:18:15.0002071Z shr.u16 %rs854, %rs795, 4; 2026-02-21T09:18:15.0002128Z shr.u16 %rs855, %rs797, 4; 2026-02-21T09:18:15.0002192Z shr.u16 %rs856, %rs799, 4; 2026-02-21T09:18:15.0002251Z shr.u16 %rs857, %rs801, 4; 2026-02-21T09:18:15.0002307Z shr.u16 %rs858, %rs803, 4; 2026-02-21T09:18:15.0002375Z shr.u16 %rs859, %rs805, 4; 2026-02-21T09:18:15.0002433Z shr.u16 %rs860, %rs807, 4; 2026-02-21T09:18:15.0002490Z shr.u16 %rs861, %rs809, 4; 2026-02-21T09:18:15.0002549Z shr.u16 %rs862, %rs811, 4; 2026-02-21T09:18:15.0002618Z shr.u16 %rs863, %rs813, 4; 2026-02-21T09:18:15.0002676Z shr.u16 %rs864, %rs815, 4; 2026-02-21T09:18:15.0002836Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0002899Z bar.sync 0; 2026-02-21T09:18:15.0002961Z st.shared.b8 [%r29], %rs818; 2026-02-21T09:18:15.0003023Z st.shared.b8 [%r30], %rs820; 2026-02-21T09:18:15.0003091Z st.shared.b8 [%r31], %rs822; 2026-02-21T09:18:15.0003151Z st.shared.b8 [%r32], %rs824; 2026-02-21T09:18:15.0003215Z st.shared.b8 [%r33+512], %rs826; 2026-02-21T09:18:15.0003279Z st.shared.b8 [%r34+512], %rs828; 2026-02-21T09:18:15.0003349Z st.shared.b8 [%r35+512], %rs830; 2026-02-21T09:18:15.0003410Z st.shared.b8 [%r36+512], %rs832; 2026-02-21T09:18:15.0003475Z st.shared.b8 [%r37+1024], %rs834; 2026-02-21T09:18:15.0003544Z st.shared.b8 [%r38+1024], %rs836; 2026-02-21T09:18:15.0003605Z st.shared.b8 [%r39+1024], %rs838; 2026-02-21T09:18:15.0003665Z st.shared.b8 [%r40+1024], %rs840; 2026-02-21T09:18:15.0003726Z st.shared.b8 [%r41+1536], %rs842; 2026-02-21T09:18:15.0003794Z st.shared.b8 [%r42+1536], %rs844; 2026-02-21T09:18:15.0003855Z st.shared.b8 [%r43+1536], %rs846; 2026-02-21T09:18:15.0003916Z st.shared.b8 [%r44+1536], %rs848; 2026-02-21T09:18:15.0003976Z bar.sync 0; 2026-02-21T09:18:15.0004039Z ld.shared.b32 %r2602, [%r45]; 2026-02-21T09:18:15.0004100Z ld.shared.b32 %r2603, [%r46]; 2026-02-21T09:18:15.0004160Z ld.shared.b32 %r2604, [%r47]; 2026-02-21T09:18:15.0004226Z ld.shared.b32 %r2605, [%r48]; 2026-02-21T09:18:15.0004387Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0004440Z bar.sync 0; 2026-02-21T09:18:15.0004506Z st.shared.b8 [%r29], %rs849; 2026-02-21T09:18:15.0004567Z st.shared.b8 [%r30], %rs850; 2026-02-21T09:18:15.0004625Z st.shared.b8 [%r31], %rs851; 2026-02-21T09:18:15.0004691Z st.shared.b8 [%r32], %rs852; 2026-02-21T09:18:15.0004751Z st.shared.b8 [%r33+512], %rs853; 2026-02-21T09:18:15.0004813Z st.shared.b8 [%r34+512], %rs854; 2026-02-21T09:18:15.0004873Z st.shared.b8 [%r35+512], %rs855; 2026-02-21T09:18:15.0004942Z st.shared.b8 [%r36+512], %rs856; 2026-02-21T09:18:15.0005003Z st.shared.b8 [%r37+1024], %rs857; 2026-02-21T09:18:15.0005063Z st.shared.b8 [%r38+1024], %rs858; 2026-02-21T09:18:15.0005129Z st.shared.b8 [%r39+1024], %rs859; 2026-02-21T09:18:15.0005187Z st.shared.b8 [%r40+1024], %rs860; 2026-02-21T09:18:15.0005246Z st.shared.b8 [%r41+1536], %rs861; 2026-02-21T09:18:15.0005305Z st.shared.b8 [%r42+1536], %rs862; 2026-02-21T09:18:15.0005371Z st.shared.b8 [%r43+1536], %rs863; 2026-02-21T09:18:15.0005430Z st.shared.b8 [%r44+1536], %rs864; 2026-02-21T09:18:15.0005483Z bar.sync 0; 2026-02-21T09:18:15.0005551Z ld.shared.b32 %r2606, [%r45]; 2026-02-21T09:18:15.0005638Z ld.shared.b32 %r2607, [%r46]; 2026-02-21T09:18:15.0005698Z ld.shared.b32 %r2608, [%r47]; 2026-02-21T09:18:15.0005756Z ld.shared.b32 %r2609, [%r48]; 2026-02-21T09:18:15.0005947Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0006028Z cvt.s8.s32 %rs865, %r2603; 2026-02-21T09:18:15.0006091Z cvt.rn.f32.s16 %r2610, %rs865; 2026-02-21T09:18:15.0006178Z cvt.s8.s32 %rs866, %r2602; 2026-02-21T09:18:15.0006242Z cvt.rn.f32.s16 %r2611, %rs866; 2026-02-21T09:18:15.0006301Z cvt.s8.s32 %rs867, %r2607; 2026-02-21T09:18:15.0006369Z cvt.rn.f32.s16 %r2612, %rs867; 2026-02-21T09:18:15.0006427Z cvt.s8.s32 %rs868, %r2606; 2026-02-21T09:18:15.0006487Z cvt.rn.f32.s16 %r2613, %rs868; 2026-02-21T09:18:15.0006544Z cvt.s8.s32 %rs869, %r2605; 2026-02-21T09:18:15.0006611Z cvt.rn.f32.s16 %r2614, %rs869; 2026-02-21T09:18:15.0006669Z cvt.s8.s32 %rs870, %r2604; 2026-02-21T09:18:15.0006730Z cvt.rn.f32.s16 %r2615, %rs870; 2026-02-21T09:18:15.0006795Z cvt.s8.s32 %rs871, %r2609; 2026-02-21T09:18:15.0006853Z cvt.rn.f32.s16 %r2616, %rs871; 2026-02-21T09:18:15.0006911Z cvt.s8.s32 %rs872, %r2608; 2026-02-21T09:18:15.0006973Z cvt.rn.f32.s16 %r2617, %rs872; 2026-02-21T09:18:15.0007046Z prmt.b32 %r2618, %r2603, 0, 0x9991U; 2026-02-21T09:18:15.0007106Z cvt.u16.u32 %rs873, %r2618; 2026-02-21T09:18:15.0007165Z cvt.rn.f32.s16 %r2619, %rs873; 2026-02-21T09:18:15.0007237Z prmt.b32 %r2620, %r2602, 0, 0x9991U; 2026-02-21T09:18:15.0007297Z cvt.u16.u32 %rs874, %r2620; 2026-02-21T09:18:15.0007356Z cvt.rn.f32.s16 %r2621, %rs874; 2026-02-21T09:18:15.0007420Z prmt.b32 %r2622, %r2607, 0, 0x9991U; 2026-02-21T09:18:15.0007489Z cvt.u16.u32 %rs875, %r2622; 2026-02-21T09:18:15.0007551Z cvt.rn.f32.s16 %r2623, %rs875; 2026-02-21T09:18:15.0007615Z prmt.b32 %r2624, %r2606, 0, 0x9991U; 2026-02-21T09:18:15.0007684Z cvt.u16.u32 %rs876, %r2624; 2026-02-21T09:18:15.0007744Z cvt.rn.f32.s16 %r2625, %rs876; 2026-02-21T09:18:15.0007806Z prmt.b32 %r2626, %r2605, 0, 0x9991U; 2026-02-21T09:18:15.0007871Z cvt.u16.u32 %rs877, %r2626; 2026-02-21T09:18:15.0007929Z cvt.rn.f32.s16 %r2627, %rs877; 2026-02-21T09:18:15.0007991Z prmt.b32 %r2628, %r2604, 0, 0x9991U; 2026-02-21T09:18:15.0008049Z cvt.u16.u32 %rs878, %r2628; 2026-02-21T09:18:15.0008116Z cvt.rn.f32.s16 %r2629, %rs878; 2026-02-21T09:18:15.0008176Z prmt.b32 %r2630, %r2609, 0, 0x9991U; 2026-02-21T09:18:15.0008236Z cvt.u16.u32 %rs879, %r2630; 2026-02-21T09:18:15.0008302Z cvt.rn.f32.s16 %r2631, %rs879; 2026-02-21T09:18:15.0008362Z prmt.b32 %r2632, %r2608, 0, 0x9991U; 2026-02-21T09:18:15.0008419Z cvt.u16.u32 %rs880, %r2632; 2026-02-21T09:18:15.0008478Z cvt.rn.f32.s16 %r2633, %rs880; 2026-02-21T09:18:15.0008545Z prmt.b32 %r2634, %r2603, 0, 0xaaa2U; 2026-02-21T09:18:15.0008603Z cvt.u16.u32 %rs881, %r2634; 2026-02-21T09:18:15.0008661Z cvt.rn.f32.s16 %r2635, %rs881; 2026-02-21T09:18:15.0008728Z prmt.b32 %r2636, %r2602, 0, 0xaaa2U; 2026-02-21T09:18:15.0008788Z cvt.u16.u32 %rs882, %r2636; 2026-02-21T09:18:15.0008847Z cvt.rn.f32.s16 %r2637, %rs882; 2026-02-21T09:18:15.0008913Z prmt.b32 %r2638, %r2607, 0, 0xaaa2U; 2026-02-21T09:18:15.0008971Z cvt.u16.u32 %rs883, %r2638; 2026-02-21T09:18:15.0009030Z cvt.rn.f32.s16 %r2639, %rs883; 2026-02-21T09:18:15.0009093Z prmt.b32 %r2640, %r2606, 0, 0xaaa2U; 2026-02-21T09:18:15.0009158Z cvt.u16.u32 %rs884, %r2640; 2026-02-21T09:18:15.0009218Z cvt.rn.f32.s16 %r2641, %rs884; 2026-02-21T09:18:15.0009279Z prmt.b32 %r2642, %r2605, 0, 0xaaa2U; 2026-02-21T09:18:15.0009344Z cvt.u16.u32 %rs885, %r2642; 2026-02-21T09:18:15.0009403Z cvt.rn.f32.s16 %r2643, %rs885; 2026-02-21T09:18:15.0009462Z prmt.b32 %r2644, %r2604, 0, 0xaaa2U; 2026-02-21T09:18:15.0009521Z cvt.u16.u32 %rs886, %r2644; 2026-02-21T09:18:15.0009586Z cvt.rn.f32.s16 %r2645, %rs886; 2026-02-21T09:18:15.0009646Z prmt.b32 %r2646, %r2609, 0, 0xaaa2U; 2026-02-21T09:18:15.0009705Z cvt.u16.u32 %rs887, %r2646; 2026-02-21T09:18:15.0009793Z cvt.rn.f32.s16 %r2647, %rs887; 2026-02-21T09:18:15.0009853Z prmt.b32 %r2648, %r2608, 0, 0xaaa2U; 2026-02-21T09:18:15.0009911Z cvt.u16.u32 %rs888, %r2648; 2026-02-21T09:18:15.0009970Z cvt.rn.f32.s16 %r2649, %rs888; 2026-02-21T09:18:15.0010063Z prmt.b32 %r2650, %r2603, 0, 0xbbb3U; 2026-02-21T09:18:15.0010121Z cvt.u16.u32 %rs889, %r2650; 2026-02-21T09:18:15.0010203Z cvt.rn.f32.s16 %r2651, %rs889; 2026-02-21T09:18:15.0010289Z prmt.b32 %r2652, %r2602, 0, 0xbbb3U; 2026-02-21T09:18:15.0010348Z cvt.u16.u32 %rs890, %r2652; 2026-02-21T09:18:15.0010408Z cvt.rn.f32.s16 %r2653, %rs890; 2026-02-21T09:18:15.0010475Z prmt.b32 %r2654, %r2607, 0, 0xbbb3U; 2026-02-21T09:18:15.0010533Z cvt.u16.u32 %rs891, %r2654; 2026-02-21T09:18:15.0010592Z cvt.rn.f32.s16 %r2655, %rs891; 2026-02-21T09:18:15.0010653Z prmt.b32 %r2656, %r2606, 0, 0xbbb3U; 2026-02-21T09:18:15.0010719Z cvt.u16.u32 %rs892, %r2656; 2026-02-21T09:18:15.0010777Z cvt.rn.f32.s16 %r2657, %rs892; 2026-02-21T09:18:15.0010838Z prmt.b32 %r2658, %r2605, 0, 0xbbb3U; 2026-02-21T09:18:15.0010902Z cvt.u16.u32 %rs893, %r2658; 2026-02-21T09:18:15.0010961Z cvt.rn.f32.s16 %r2659, %rs893; 2026-02-21T09:18:15.0011021Z prmt.b32 %r2660, %r2604, 0, 0xbbb3U; 2026-02-21T09:18:15.0011081Z cvt.u16.u32 %rs894, %r2660; 2026-02-21T09:18:15.0011147Z cvt.rn.f32.s16 %r2661, %rs894; 2026-02-21T09:18:15.0011206Z prmt.b32 %r2662, %r2609, 0, 0xbbb3U; 2026-02-21T09:18:15.0011264Z cvt.u16.u32 %rs895, %r2662; 2026-02-21T09:18:15.0011329Z cvt.rn.f32.s16 %r2663, %rs895; 2026-02-21T09:18:15.0011388Z prmt.b32 %r2664, %r2608, 0, 0xbbb3U; 2026-02-21T09:18:15.0011444Z cvt.u16.u32 %rs896, %r2664; 2026-02-21T09:18:15.0011503Z cvt.rn.f32.s16 %r2665, %rs896; 2026-02-21T09:18:15.0011588Z bar.sync 0; 2026-02-21T09:18:15.0011686Z st.shared.v4.b32 [%r49], {%r2611, %r2613, %r2610, %r2612}; 2026-02-21T09:18:15.0011779Z st.shared.v4.b32 [%r50], {%r2615, %r2617, %r2614, %r2616}; 2026-02-21T09:18:15.0011877Z st.shared.v4.b32 [%r51], {%r2621, %r2625, %r2619, %r2623}; 2026-02-21T09:18:15.0011964Z st.shared.v4.b32 [%r52], {%r2629, %r2633, %r2627, %r2631}; 2026-02-21T09:18:15.0012051Z st.shared.v4.b32 [%r53], {%r2637, %r2641, %r2635, %r2639}; 2026-02-21T09:18:15.0012146Z st.shared.v4.b32 [%r54], {%r2645, %r2649, %r2643, %r2647}; 2026-02-21T09:18:15.0012232Z st.shared.v4.b32 [%r55], {%r2653, %r2657, %r2651, %r2655}; 2026-02-21T09:18:15.0012318Z st.shared.v4.b32 [%r56], {%r2661, %r2665, %r2659, %r2663}; 2026-02-21T09:18:15.0012373Z $L__tmp163: 2026-02-21T09:18:15.0012593Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0012652Z // begin inline asm 2026-02-21T09:18:15.0012994Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2406, %r2407, %r2408, %r2409, %r2410, %r2411, %r2412, %r2413, %r2414, %r2415, %r2416, %r2417, %r2418, %r2419, %r2420, %r2421}, [%r915 + 0], 64; 2026-02-21T09:18:15.0013057Z // end inline asm 2026-02-21T09:18:15.0013114Z // begin inline asm 2026-02-21T09:18:15.0013416Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2423, %r2424, %r2425, %r2426, %r2427, %r2428, %r2429, %r2430, %r2431, %r2432, %r2433, %r2434, %r2435, %r2436, %r2437, %r2438}, [%r915 + 16], 64; 2026-02-21T09:18:15.0013482Z // end inline asm 2026-02-21T09:18:15.0013552Z // begin inline asm 2026-02-21T09:18:15.0013865Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2440, %r2441, %r2442, %r2443, %r2444, %r2445, %r2446, %r2447, %r2448, %r2449, %r2450, %r2451, %r2452, %r2453, %r2454, %r2455}, [%r915 + 32], 64; 2026-02-21T09:18:15.0013930Z // end inline asm 2026-02-21T09:18:15.0013989Z // begin inline asm 2026-02-21T09:18:15.0014292Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2457, %r2458, %r2459, %r2460, %r2461, %r2462, %r2463, %r2464, %r2465, %r2466, %r2467, %r2468, %r2469, %r2470, %r2471, %r2472}, [%r915 + 48], 64; 2026-02-21T09:18:15.0014359Z // end inline asm 2026-02-21T09:18:15.0014420Z // begin inline asm 2026-02-21T09:18:15.0014493Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0014582Z // end inline asm 2026-02-21T09:18:15.0014649Z // begin inline asm 2026-02-21T09:18:15.0014973Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 0], 64, {%r2406, %r2407, %r2408, %r2409, %r2410, %r2411, %r2412, %r2413, %r2414, %r2415, %r2416, %r2417, %r2418, %r2419, %r2420, %r2421}; 2026-02-21T09:18:15.0015057Z // end inline asm 2026-02-21T09:18:15.0015151Z // begin inline asm 2026-02-21T09:18:15.0015521Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 16], 64, {%r2423, %r2424, %r2425, %r2426, %r2427, %r2428, %r2429, %r2430, %r2431, %r2432, %r2433, %r2434, %r2435, %r2436, %r2437, %r2438}; 2026-02-21T09:18:15.0015581Z // end inline asm 2026-02-21T09:18:15.0015649Z // begin inline asm 2026-02-21T09:18:15.0015968Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 32], 64, {%r2440, %r2441, %r2442, %r2443, %r2444, %r2445, %r2446, %r2447, %r2448, %r2449, %r2450, %r2451, %r2452, %r2453, %r2454, %r2455}; 2026-02-21T09:18:15.0016028Z // end inline asm 2026-02-21T09:18:15.0016101Z // begin inline asm 2026-02-21T09:18:15.0016417Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 48], 64, {%r2457, %r2458, %r2459, %r2460, %r2461, %r2462, %r2463, %r2464, %r2465, %r2466, %r2467, %r2468, %r2469, %r2470, %r2471, %r2472}; 2026-02-21T09:18:15.0016476Z // end inline asm 2026-02-21T09:18:15.0016535Z // begin inline asm 2026-02-21T09:18:15.0016614Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0016671Z // end inline asm 2026-02-21T09:18:15.0016729Z bar.sync 0; 2026-02-21T09:18:15.0016797Z // begin inline asm 2026-02-21T09:18:15.0017112Z @%p113 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r2543, %r2544, %r2545, %r2546, %r2547, %r2548, %r2549, %r2550, %r2551, %r2552, %r2553, %r2554, %r2555, %r2556, %r2557, %r2558}; 2026-02-21T09:18:15.0017169Z // end inline asm 2026-02-21T09:18:15.0017233Z // begin inline asm 2026-02-21T09:18:15.0017302Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0017360Z // end inline asm 2026-02-21T09:18:15.0017416Z bar.sync 0; 2026-02-21T09:18:15.0017483Z // begin inline asm 2026-02-21T09:18:15.0017559Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0017615Z // end inline asm 2026-02-21T09:18:15.0017680Z // begin inline asm 2026-02-21T09:18:15.0017776Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0017833Z // end inline asm 2026-02-21T09:18:15.0017898Z bar.sync 0; 2026-02-21T09:18:15.0017960Z @%p20 bra $L__BB0_19; 2026-02-21T09:18:15.0018062Z // %bb.18: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:15.0018131Z elect.sync %r2678|%p131, -1; 2026-02-21T09:18:15.0018201Z mov.b32 %r2668, 69208336; 2026-02-21T09:18:15.0018266Z mov.pred %p130, -1; 2026-02-21T09:18:15.0018324Z // begin inline asm 2026-02-21T09:18:15.0018497Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 0 ], %rd1, %r2668, %p130; 2026-02-21T09:18:15.0018554Z // end inline asm 2026-02-21T09:18:15.0018612Z // begin inline asm 2026-02-21T09:18:15.0018769Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 8 ], %rd2, %r2668, %p130; 2026-02-21T09:18:15.0018833Z // end inline asm 2026-02-21T09:18:15.0018891Z // begin inline asm 2026-02-21T09:18:15.0019050Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 16 ], %rd3, %r2668, %p130; 2026-02-21T09:18:15.0019116Z // end inline asm 2026-02-21T09:18:15.0019175Z // begin inline asm 2026-02-21T09:18:15.0019332Z @%p131 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 24 ], %rd4, %r2668, %p130; 2026-02-21T09:18:15.0019397Z // end inline asm 2026-02-21T09:18:15.0019460Z cvt.u64.u32 %rd304, %r5985; 2026-02-21T09:18:15.0019518Z // begin inline asm 2026-02-21T09:18:15.0019658Z @%p131 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd304]; 2026-02-21T09:18:15.0019715Z // end inline asm 2026-02-21T09:18:15.0019819Z $L__BB0_19: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:15.0019912Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0020006Z mov.b32 %r2682, 0; 2026-02-21T09:18:15.0020228Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0020327Z // begin inline asm 2026-02-21T09:18:15.0020387Z 2026-02-21T09:18:15.0020440Z { 2026-02-21T09:18:15.0020530Z .reg .pred complete; 2026-02-21T09:18:15.0020590Z waitLoop: 2026-02-21T09:18:15.0020743Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r2682; 2026-02-21T09:18:15.0020813Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0020867Z } 2026-02-21T09:18:15.0020871Z 2026-02-21T09:18:15.0020936Z // end inline asm 2026-02-21T09:18:15.0020992Z bar.sync 0; 2026-02-21T09:18:15.0021050Z // begin inline asm 2026-02-21T09:18:15.0021144Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0021201Z // end inline asm 2026-02-21T09:18:15.0021258Z $L__tmp164: 2026-02-21T09:18:15.0021430Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0021505Z add.s64 %rd306, %rd261, 192; 2026-02-21T09:18:15.0021700Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0021768Z add.s64 %rd309, %rd264, 192; 2026-02-21T09:18:15.0021842Z // begin inline asm 2026-02-21T09:18:15.0021902Z mov.u64 %rd305, 0x0; 2026-02-21T09:18:15.0022012Z createpolicy.fractional.L2::evict_first.b64 %rd305, 1.0; 2026-02-21T09:18:15.0022073Z // end inline asm 2026-02-21T09:18:15.0022128Z // begin inline asm 2026-02-21T09:18:15.0022185Z mov.u32 %r2684, 0x0; 2026-02-21T09:18:15.0022240Z mov.u32 %r2685, 0x0; 2026-02-21T09:18:15.0022303Z mov.u32 %r2686, 0x0; 2026-02-21T09:18:15.0022358Z mov.u32 %r2687, 0x0; 2026-02-21T09:18:15.0022535Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2684, %r2685, %r2686, %r2687 }, [ %rd306 + 0 ], %rd305; 2026-02-21T09:18:15.0022596Z // end inline asm 2026-02-21T09:18:15.0022651Z // begin inline asm 2026-02-21T09:18:15.0022708Z mov.u64 %rd308, 0x0; 2026-02-21T09:18:15.0022815Z createpolicy.fractional.L2::evict_first.b64 %rd308, 1.0; 2026-02-21T09:18:15.0022877Z // end inline asm 2026-02-21T09:18:15.0022935Z // begin inline asm 2026-02-21T09:18:15.0022990Z mov.u32 %r2688, 0x0; 2026-02-21T09:18:15.0023051Z mov.u32 %r2689, 0x0; 2026-02-21T09:18:15.0023107Z mov.u32 %r2690, 0x0; 2026-02-21T09:18:15.0023161Z mov.u32 %r2691, 0x0; 2026-02-21T09:18:15.0023338Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2688, %r2689, %r2690, %r2691 }, [ %rd309 + 0 ], %rd308; 2026-02-21T09:18:15.0023394Z // end inline asm 2026-02-21T09:18:15.0023448Z $L__tmp165: 2026-02-21T09:18:15.0023659Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0023762Z st.shared.v4.b32 [%r26], {%r2684, %r2685, %r2686, %r2687}; 2026-02-21T09:18:15.0023862Z st.shared.v4.b32 [%r26+512], {%r2688, %r2689, %r2690, %r2691}; 2026-02-21T09:18:15.0023919Z bar.sync 0; 2026-02-21T09:18:15.0024021Z ld.shared.v4.b32 {%r2851, %r2852, %r2853, %r2854}, [%r27]; 2026-02-21T09:18:15.0024087Z mov.b32 {%rs897, %rs898}, %r2854; 2026-02-21T09:18:15.0024155Z mov.b32 {%rs899, %rs900}, %r2853; 2026-02-21T09:18:15.0024225Z mov.b32 {%rs901, %rs902}, %r2852; 2026-02-21T09:18:15.0024284Z mov.b32 {%rs903, %rs904}, %r2851; 2026-02-21T09:18:15.0024378Z ld.shared.v4.b32 {%r2855, %r2856, %r2857, %r2858}, [%r28]; 2026-02-21T09:18:15.0024438Z mov.b32 {%rs905, %rs906}, %r2858; 2026-02-21T09:18:15.0024508Z mov.b32 {%rs907, %rs908}, %r2857; 2026-02-21T09:18:15.0024567Z mov.b32 {%rs909, %rs910}, %r2856; 2026-02-21T09:18:15.0024624Z mov.b32 {%rs911, %rs912}, %r2855; 2026-02-21T09:18:15.0024685Z $L__tmp166: 2026-02-21T09:18:15.0024850Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0024911Z cvt.f32.bf16 %r2833, %rs903; 2026-02-21T09:18:15.0024978Z cvt.f32.bf16 %r2834, %rs904; 2026-02-21T09:18:15.0025062Z cvt.f32.bf16 %r2835, %rs901; 2026-02-21T09:18:15.0025120Z cvt.f32.bf16 %r2836, %rs902; 2026-02-21T09:18:15.0025177Z cvt.f32.bf16 %r2837, %rs899; 2026-02-21T09:18:15.0025267Z cvt.f32.bf16 %r2838, %rs900; 2026-02-21T09:18:15.0025324Z cvt.f32.bf16 %r2839, %rs897; 2026-02-21T09:18:15.0025383Z cvt.f32.bf16 %r2840, %rs898; 2026-02-21T09:18:15.0025473Z cvt.f32.bf16 %r2841, %rs911; 2026-02-21T09:18:15.0025555Z cvt.f32.bf16 %r2842, %rs912; 2026-02-21T09:18:15.0025613Z cvt.f32.bf16 %r2843, %rs909; 2026-02-21T09:18:15.0025670Z cvt.f32.bf16 %r2844, %rs910; 2026-02-21T09:18:15.0025737Z cvt.f32.bf16 %r2845, %rs907; 2026-02-21T09:18:15.0025793Z cvt.f32.bf16 %r2846, %rs908; 2026-02-21T09:18:15.0025851Z cvt.f32.bf16 %r2847, %rs905; 2026-02-21T09:18:15.0025917Z cvt.f32.bf16 %r2848, %rs906; 2026-02-21T09:18:15.0026078Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0026140Z add.s32 %r2859, %r7844, 393216; 2026-02-21T09:18:15.0026201Z cvt.s64.s32 %rd314, %r2859; 2026-02-21T09:18:15.0026270Z add.s64 %rd312, %rd56, %rd314; 2026-02-21T09:18:15.0026428Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0026487Z // begin inline asm 2026-02-21T09:18:15.0026551Z mov.u64 %rd311, 0x0; 2026-02-21T09:18:15.0026659Z createpolicy.fractional.L2::evict_first.b64 %rd311, 1.0; 2026-02-21T09:18:15.0026716Z // end inline asm 2026-02-21T09:18:15.0026782Z // begin inline asm 2026-02-21T09:18:15.0026839Z mov.u32 %r2692, 0x0; 2026-02-21T09:18:15.0026895Z mov.u32 %r2693, 0x0; 2026-02-21T09:18:15.0026950Z mov.u32 %r2694, 0x0; 2026-02-21T09:18:15.0027012Z mov.u32 %r2695, 0x0; 2026-02-21T09:18:15.0027182Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r2692, %r2693, %r2694, %r2695 }, [ %rd312 + 0 ], %rd311; 2026-02-21T09:18:15.0027237Z // end inline asm 2026-02-21T09:18:15.0027310Z prmt.b32 %r2860, %r2692, 0, 0x8880U; 2026-02-21T09:18:15.0027371Z cvt.u16.u32 %rs913, %r2860; 2026-02-21T09:18:15.0027436Z prmt.b32 %r2861, %r2692, 0, 0x7770U; 2026-02-21T09:18:15.0027502Z cvt.u16.u32 %rs914, %r2861; 2026-02-21T09:18:15.0027565Z prmt.b32 %r2862, %r2692, 0, 0x9991U; 2026-02-21T09:18:15.0027624Z cvt.u16.u32 %rs915, %r2862; 2026-02-21T09:18:15.0027684Z prmt.b32 %r2863, %r2692, 0, 0x7771U; 2026-02-21T09:18:15.0027750Z cvt.u16.u32 %rs916, %r2863; 2026-02-21T09:18:15.0027812Z prmt.b32 %r2864, %r2692, 0, 0xaaa2U; 2026-02-21T09:18:15.0027869Z cvt.u16.u32 %rs917, %r2864; 2026-02-21T09:18:15.0027936Z prmt.b32 %r2865, %r2692, 0, 0x7772U; 2026-02-21T09:18:15.0027994Z cvt.u16.u32 %rs918, %r2865; 2026-02-21T09:18:15.0028054Z prmt.b32 %r2866, %r2692, 0, 0xbbb3U; 2026-02-21T09:18:15.0028112Z cvt.u16.u32 %rs919, %r2866; 2026-02-21T09:18:15.0028179Z prmt.b32 %r2867, %r2692, 0, 0x7773U; 2026-02-21T09:18:15.0028237Z cvt.u16.u32 %rs920, %r2867; 2026-02-21T09:18:15.0028299Z prmt.b32 %r2868, %r2693, 0, 0x8880U; 2026-02-21T09:18:15.0028364Z cvt.u16.u32 %rs921, %r2868; 2026-02-21T09:18:15.0028423Z prmt.b32 %r2869, %r2693, 0, 0x7770U; 2026-02-21T09:18:15.0028482Z cvt.u16.u32 %rs922, %r2869; 2026-02-21T09:18:15.0028547Z prmt.b32 %r2870, %r2693, 0, 0x9991U; 2026-02-21T09:18:15.0028605Z cvt.u16.u32 %rs923, %r2870; 2026-02-21T09:18:15.0028664Z prmt.b32 %r2871, %r2693, 0, 0x7771U; 2026-02-21T09:18:15.0028722Z cvt.u16.u32 %rs924, %r2871; 2026-02-21T09:18:15.0028789Z prmt.b32 %r2872, %r2693, 0, 0xaaa2U; 2026-02-21T09:18:15.0028845Z cvt.u16.u32 %rs925, %r2872; 2026-02-21T09:18:15.0028906Z prmt.b32 %r2873, %r2693, 0, 0x7772U; 2026-02-21T09:18:15.0028970Z cvt.u16.u32 %rs926, %r2873; 2026-02-21T09:18:15.0029029Z prmt.b32 %r2874, %r2693, 0, 0xbbb3U; 2026-02-21T09:18:15.0029086Z cvt.u16.u32 %rs927, %r2874; 2026-02-21T09:18:15.0029146Z prmt.b32 %r2875, %r2693, 0, 0x7773U; 2026-02-21T09:18:15.0029212Z cvt.u16.u32 %rs928, %r2875; 2026-02-21T09:18:15.0029271Z prmt.b32 %r2876, %r2694, 0, 0x8880U; 2026-02-21T09:18:15.0029354Z cvt.u16.u32 %rs929, %r2876; 2026-02-21T09:18:15.0029421Z prmt.b32 %r2877, %r2694, 0, 0x7770U; 2026-02-21T09:18:15.0029477Z cvt.u16.u32 %rs930, %r2877; 2026-02-21T09:18:15.0029537Z prmt.b32 %r2878, %r2694, 0, 0x9991U; 2026-02-21T09:18:15.0029615Z cvt.u16.u32 %rs931, %r2878; 2026-02-21T09:18:15.0029683Z prmt.b32 %r2879, %r2694, 0, 0x7771U; 2026-02-21T09:18:15.0029764Z cvt.u16.u32 %rs932, %r2879; 2026-02-21T09:18:15.0029841Z prmt.b32 %r2880, %r2694, 0, 0xaaa2U; 2026-02-21T09:18:15.0029907Z cvt.u16.u32 %rs933, %r2880; 2026-02-21T09:18:15.0029968Z prmt.b32 %r2881, %r2694, 0, 0x7772U; 2026-02-21T09:18:15.0030025Z cvt.u16.u32 %rs934, %r2881; 2026-02-21T09:18:15.0030093Z prmt.b32 %r2882, %r2694, 0, 0xbbb3U; 2026-02-21T09:18:15.0030151Z cvt.u16.u32 %rs935, %r2882; 2026-02-21T09:18:15.0030211Z prmt.b32 %r2883, %r2694, 0, 0x7773U; 2026-02-21T09:18:15.0030269Z cvt.u16.u32 %rs936, %r2883; 2026-02-21T09:18:15.0030336Z prmt.b32 %r2884, %r2695, 0, 0x8880U; 2026-02-21T09:18:15.0030394Z cvt.u16.u32 %rs937, %r2884; 2026-02-21T09:18:15.0030457Z prmt.b32 %r2885, %r2695, 0, 0x7770U; 2026-02-21T09:18:15.0030521Z cvt.u16.u32 %rs938, %r2885; 2026-02-21T09:18:15.0030582Z prmt.b32 %r2886, %r2695, 0, 0x9991U; 2026-02-21T09:18:15.0030642Z cvt.u16.u32 %rs939, %r2886; 2026-02-21T09:18:15.0030704Z prmt.b32 %r2887, %r2695, 0, 0x7771U; 2026-02-21T09:18:15.0030771Z cvt.u16.u32 %rs940, %r2887; 2026-02-21T09:18:15.0030832Z prmt.b32 %r2888, %r2695, 0, 0xaaa2U; 2026-02-21T09:18:15.0030892Z cvt.u16.u32 %rs941, %r2888; 2026-02-21T09:18:15.0030963Z prmt.b32 %r2889, %r2695, 0, 0x7772U; 2026-02-21T09:18:15.0031021Z cvt.u16.u32 %rs942, %r2889; 2026-02-21T09:18:15.0031083Z prmt.b32 %r2890, %r2695, 0, 0xbbb3U; 2026-02-21T09:18:15.0031151Z cvt.u16.u32 %rs943, %r2890; 2026-02-21T09:18:15.0031214Z prmt.b32 %r2891, %r2695, 0, 0x7773U; 2026-02-21T09:18:15.0031274Z cvt.u16.u32 %rs944, %r2891; 2026-02-21T09:18:15.0031430Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0031497Z shl.b16 %rs945, %rs914, 12; 2026-02-21T09:18:15.0031581Z shr.s16 %rs946, %rs945, 12; 2026-02-21T09:18:15.0031641Z shl.b16 %rs947, %rs916, 12; 2026-02-21T09:18:15.0031707Z shr.s16 %rs948, %rs947, 12; 2026-02-21T09:18:15.0031766Z shl.b16 %rs949, %rs918, 12; 2026-02-21T09:18:15.0031822Z shr.s16 %rs950, %rs949, 12; 2026-02-21T09:18:15.0031880Z shl.b16 %rs951, %rs920, 12; 2026-02-21T09:18:15.0031945Z shr.s16 %rs952, %rs951, 12; 2026-02-21T09:18:15.0032004Z shl.b16 %rs953, %rs922, 12; 2026-02-21T09:18:15.0032062Z shr.s16 %rs954, %rs953, 12; 2026-02-21T09:18:15.0032128Z shl.b16 %rs955, %rs924, 12; 2026-02-21T09:18:15.0032186Z shr.s16 %rs956, %rs955, 12; 2026-02-21T09:18:15.0032242Z shl.b16 %rs957, %rs926, 12; 2026-02-21T09:18:15.0032299Z shr.s16 %rs958, %rs957, 12; 2026-02-21T09:18:15.0032362Z shl.b16 %rs959, %rs928, 12; 2026-02-21T09:18:15.0032419Z shr.s16 %rs960, %rs959, 12; 2026-02-21T09:18:15.0032476Z shl.b16 %rs961, %rs930, 12; 2026-02-21T09:18:15.0032544Z shr.s16 %rs962, %rs961, 12; 2026-02-21T09:18:15.0032602Z shl.b16 %rs963, %rs932, 12; 2026-02-21T09:18:15.0032660Z shr.s16 %rs964, %rs963, 12; 2026-02-21T09:18:15.0032728Z shl.b16 %rs965, %rs934, 12; 2026-02-21T09:18:15.0032784Z shr.s16 %rs966, %rs965, 12; 2026-02-21T09:18:15.0032842Z shl.b16 %rs967, %rs936, 12; 2026-02-21T09:18:15.0032901Z shr.s16 %rs968, %rs967, 12; 2026-02-21T09:18:15.0032967Z shl.b16 %rs969, %rs938, 12; 2026-02-21T09:18:15.0033026Z shr.s16 %rs970, %rs969, 12; 2026-02-21T09:18:15.0033084Z shl.b16 %rs971, %rs940, 12; 2026-02-21T09:18:15.0033150Z shr.s16 %rs972, %rs971, 12; 2026-02-21T09:18:15.0033208Z shl.b16 %rs973, %rs942, 12; 2026-02-21T09:18:15.0033265Z shr.s16 %rs974, %rs973, 12; 2026-02-21T09:18:15.0033323Z shl.b16 %rs975, %rs944, 12; 2026-02-21T09:18:15.0033387Z shr.s16 %rs976, %rs975, 12; 2026-02-21T09:18:15.0033543Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0033631Z shr.u16 %rs977, %rs913, 4; 2026-02-21T09:18:15.0033699Z shr.u16 %rs978, %rs915, 4; 2026-02-21T09:18:15.0033757Z shr.u16 %rs979, %rs917, 4; 2026-02-21T09:18:15.0033815Z shr.u16 %rs980, %rs919, 4; 2026-02-21T09:18:15.0033899Z shr.u16 %rs981, %rs921, 4; 2026-02-21T09:18:15.0033963Z shr.u16 %rs982, %rs923, 4; 2026-02-21T09:18:15.0034043Z shr.u16 %rs983, %rs925, 4; 2026-02-21T09:18:15.0034101Z shr.u16 %rs984, %rs927, 4; 2026-02-21T09:18:15.0034188Z shr.u16 %rs985, %rs929, 4; 2026-02-21T09:18:15.0034245Z shr.u16 %rs986, %rs931, 4; 2026-02-21T09:18:15.0034302Z shr.u16 %rs987, %rs933, 4; 2026-02-21T09:18:15.0034365Z shr.u16 %rs988, %rs935, 4; 2026-02-21T09:18:15.0034422Z shr.u16 %rs989, %rs937, 4; 2026-02-21T09:18:15.0034478Z shr.u16 %rs990, %rs939, 4; 2026-02-21T09:18:15.0034535Z shr.u16 %rs991, %rs941, 4; 2026-02-21T09:18:15.0034600Z shr.u16 %rs992, %rs943, 4; 2026-02-21T09:18:15.0034758Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0034815Z bar.sync 0; 2026-02-21T09:18:15.0034882Z st.shared.b8 [%r29], %rs946; 2026-02-21T09:18:15.0034943Z st.shared.b8 [%r30], %rs948; 2026-02-21T09:18:15.0035004Z st.shared.b8 [%r31], %rs950; 2026-02-21T09:18:15.0035063Z st.shared.b8 [%r32], %rs952; 2026-02-21T09:18:15.0035132Z st.shared.b8 [%r33+512], %rs954; 2026-02-21T09:18:15.0035195Z st.shared.b8 [%r34+512], %rs956; 2026-02-21T09:18:15.0035256Z st.shared.b8 [%r35+512], %rs958; 2026-02-21T09:18:15.0035325Z st.shared.b8 [%r36+512], %rs960; 2026-02-21T09:18:15.0035388Z st.shared.b8 [%r37+1024], %rs962; 2026-02-21T09:18:15.0035449Z st.shared.b8 [%r38+1024], %rs964; 2026-02-21T09:18:15.0035509Z st.shared.b8 [%r39+1024], %rs966; 2026-02-21T09:18:15.0035576Z st.shared.b8 [%r40+1024], %rs968; 2026-02-21T09:18:15.0035634Z st.shared.b8 [%r41+1536], %rs970; 2026-02-21T09:18:15.0035693Z st.shared.b8 [%r42+1536], %rs972; 2026-02-21T09:18:15.0035761Z st.shared.b8 [%r43+1536], %rs974; 2026-02-21T09:18:15.0035822Z st.shared.b8 [%r44+1536], %rs976; 2026-02-21T09:18:15.0035876Z bar.sync 0; 2026-02-21T09:18:15.0035946Z ld.shared.b32 %r2892, [%r45]; 2026-02-21T09:18:15.0036008Z ld.shared.b32 %r2893, [%r46]; 2026-02-21T09:18:15.0036070Z ld.shared.b32 %r2894, [%r47]; 2026-02-21T09:18:15.0036130Z ld.shared.b32 %r2895, [%r48]; 2026-02-21T09:18:15.0036301Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0036356Z bar.sync 0; 2026-02-21T09:18:15.0036416Z st.shared.b8 [%r29], %rs977; 2026-02-21T09:18:15.0036480Z st.shared.b8 [%r30], %rs978; 2026-02-21T09:18:15.0036539Z st.shared.b8 [%r31], %rs979; 2026-02-21T09:18:15.0036598Z st.shared.b8 [%r32], %rs980; 2026-02-21T09:18:15.0036660Z st.shared.b8 [%r33+512], %rs981; 2026-02-21T09:18:15.0036728Z st.shared.b8 [%r34+512], %rs982; 2026-02-21T09:18:15.0036788Z st.shared.b8 [%r35+512], %rs983; 2026-02-21T09:18:15.0036848Z st.shared.b8 [%r36+512], %rs984; 2026-02-21T09:18:15.0036917Z st.shared.b8 [%r37+1024], %rs985; 2026-02-21T09:18:15.0036977Z st.shared.b8 [%r38+1024], %rs986; 2026-02-21T09:18:15.0037037Z st.shared.b8 [%r39+1024], %rs987; 2026-02-21T09:18:15.0037104Z st.shared.b8 [%r40+1024], %rs988; 2026-02-21T09:18:15.0037163Z st.shared.b8 [%r41+1536], %rs989; 2026-02-21T09:18:15.0037224Z st.shared.b8 [%r42+1536], %rs990; 2026-02-21T09:18:15.0037284Z st.shared.b8 [%r43+1536], %rs991; 2026-02-21T09:18:15.0037352Z st.shared.b8 [%r44+1536], %rs992; 2026-02-21T09:18:15.0037407Z bar.sync 0; 2026-02-21T09:18:15.0037469Z ld.shared.b32 %r2896, [%r45]; 2026-02-21T09:18:15.0037539Z ld.shared.b32 %r2897, [%r46]; 2026-02-21T09:18:15.0037601Z ld.shared.b32 %r2898, [%r47]; 2026-02-21T09:18:15.0037663Z ld.shared.b32 %r2899, [%r48]; 2026-02-21T09:18:15.0037825Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0037898Z cvt.s8.s32 %rs993, %r2893; 2026-02-21T09:18:15.0037998Z cvt.rn.f32.s16 %r2900, %rs993; 2026-02-21T09:18:15.0038059Z cvt.s8.s32 %rs994, %r2892; 2026-02-21T09:18:15.0038129Z cvt.rn.f32.s16 %r2901, %rs994; 2026-02-21T09:18:15.0038188Z cvt.s8.s32 %rs995, %r2897; 2026-02-21T09:18:15.0038270Z cvt.rn.f32.s16 %r2902, %rs995; 2026-02-21T09:18:15.0038327Z cvt.s8.s32 %rs996, %r2896; 2026-02-21T09:18:15.0038414Z cvt.rn.f32.s16 %r2903, %rs996; 2026-02-21T09:18:15.0038473Z cvt.s8.s32 %rs997, %r2895; 2026-02-21T09:18:15.0038553Z cvt.rn.f32.s16 %r2904, %rs997; 2026-02-21T09:18:15.0038622Z cvt.s8.s32 %rs998, %r2894; 2026-02-21T09:18:15.0038681Z cvt.rn.f32.s16 %r2905, %rs998; 2026-02-21T09:18:15.0038738Z cvt.s8.s32 %rs999, %r2899; 2026-02-21T09:18:15.0038806Z cvt.rn.f32.s16 %r2906, %rs999; 2026-02-21T09:18:15.0038866Z cvt.s8.s32 %rs1000, %r2898; 2026-02-21T09:18:15.0038929Z cvt.rn.f32.s16 %r2907, %rs1000; 2026-02-21T09:18:15.0038994Z prmt.b32 %r2908, %r2893, 0, 0x9991U; 2026-02-21T09:18:15.0039061Z cvt.u16.u32 %rs1001, %r2908; 2026-02-21T09:18:15.0039124Z cvt.rn.f32.s16 %r2909, %rs1001; 2026-02-21T09:18:15.0039188Z prmt.b32 %r2910, %r2892, 0, 0x9991U; 2026-02-21T09:18:15.0039254Z cvt.u16.u32 %rs1002, %r2910; 2026-02-21T09:18:15.0039315Z cvt.rn.f32.s16 %r2911, %rs1002; 2026-02-21T09:18:15.0039379Z prmt.b32 %r2912, %r2897, 0, 0x9991U; 2026-02-21T09:18:15.0039437Z cvt.u16.u32 %rs1003, %r2912; 2026-02-21T09:18:15.0039506Z cvt.rn.f32.s16 %r2913, %rs1003; 2026-02-21T09:18:15.0039568Z prmt.b32 %r2914, %r2896, 0, 0x9991U; 2026-02-21T09:18:15.0039629Z cvt.u16.u32 %rs1004, %r2914; 2026-02-21T09:18:15.0039697Z cvt.rn.f32.s16 %r2915, %rs1004; 2026-02-21T09:18:15.0039757Z prmt.b32 %r2916, %r2895, 0, 0x9991U; 2026-02-21T09:18:15.0039817Z cvt.u16.u32 %rs1005, %r2916; 2026-02-21T09:18:15.0039876Z cvt.rn.f32.s16 %r2917, %rs1005; 2026-02-21T09:18:15.0039945Z prmt.b32 %r2918, %r2894, 0, 0x9991U; 2026-02-21T09:18:15.0040004Z cvt.u16.u32 %rs1006, %r2918; 2026-02-21T09:18:15.0040063Z cvt.rn.f32.s16 %r2919, %rs1006; 2026-02-21T09:18:15.0040132Z prmt.b32 %r2920, %r2899, 0, 0x9991U; 2026-02-21T09:18:15.0040191Z cvt.u16.u32 %rs1007, %r2920; 2026-02-21T09:18:15.0040251Z cvt.rn.f32.s16 %r2921, %rs1007; 2026-02-21T09:18:15.0040318Z prmt.b32 %r2922, %r2898, 0, 0x9991U; 2026-02-21T09:18:15.0040377Z cvt.u16.u32 %rs1008, %r2922; 2026-02-21T09:18:15.0040436Z cvt.rn.f32.s16 %r2923, %rs1008; 2026-02-21T09:18:15.0040498Z prmt.b32 %r2924, %r2893, 0, 0xaaa2U; 2026-02-21T09:18:15.0040564Z cvt.u16.u32 %rs1009, %r2924; 2026-02-21T09:18:15.0040624Z cvt.rn.f32.s16 %r2925, %rs1009; 2026-02-21T09:18:15.0040685Z prmt.b32 %r2926, %r2892, 0, 0xaaa2U; 2026-02-21T09:18:15.0040751Z cvt.u16.u32 %rs1010, %r2926; 2026-02-21T09:18:15.0040809Z cvt.rn.f32.s16 %r2927, %rs1010; 2026-02-21T09:18:15.0040869Z prmt.b32 %r2928, %r2897, 0, 0xaaa2U; 2026-02-21T09:18:15.0040927Z cvt.u16.u32 %rs1011, %r2928; 2026-02-21T09:18:15.0040993Z cvt.rn.f32.s16 %r2929, %rs1011; 2026-02-21T09:18:15.0041054Z prmt.b32 %r2930, %r2896, 0, 0xaaa2U; 2026-02-21T09:18:15.0041113Z cvt.u16.u32 %rs1012, %r2930; 2026-02-21T09:18:15.0041179Z cvt.rn.f32.s16 %r2931, %rs1012; 2026-02-21T09:18:15.0041239Z prmt.b32 %r2932, %r2895, 0, 0xaaa2U; 2026-02-21T09:18:15.0041297Z cvt.u16.u32 %rs1013, %r2932; 2026-02-21T09:18:15.0041365Z cvt.rn.f32.s16 %r2933, %rs1013; 2026-02-21T09:18:15.0041425Z prmt.b32 %r2934, %r2894, 0, 0xaaa2U; 2026-02-21T09:18:15.0041485Z cvt.u16.u32 %rs1014, %r2934; 2026-02-21T09:18:15.0041595Z cvt.rn.f32.s16 %r2935, %rs1014; 2026-02-21T09:18:15.0041665Z prmt.b32 %r2936, %r2899, 0, 0xaaa2U; 2026-02-21T09:18:15.0041726Z cvt.u16.u32 %rs1015, %r2936; 2026-02-21T09:18:15.0041786Z cvt.rn.f32.s16 %r2937, %rs1015; 2026-02-21T09:18:15.0041853Z prmt.b32 %r2938, %r2898, 0, 0xaaa2U; 2026-02-21T09:18:15.0041911Z cvt.u16.u32 %rs1016, %r2938; 2026-02-21T09:18:15.0041970Z cvt.rn.f32.s16 %r2939, %rs1016; 2026-02-21T09:18:15.0042030Z prmt.b32 %r2940, %r2893, 0, 0xbbb3U; 2026-02-21T09:18:15.0042098Z cvt.u16.u32 %rs1017, %r2940; 2026-02-21T09:18:15.0042158Z cvt.rn.f32.s16 %r2941, %rs1017; 2026-02-21T09:18:15.0042254Z prmt.b32 %r2942, %r2892, 0, 0xbbb3U; 2026-02-21T09:18:15.0042321Z cvt.u16.u32 %rs1018, %r2942; 2026-02-21T09:18:15.0042382Z cvt.rn.f32.s16 %r2943, %rs1018; 2026-02-21T09:18:15.0042476Z prmt.b32 %r2944, %r2897, 0, 0xbbb3U; 2026-02-21T09:18:15.0042535Z cvt.u16.u32 %rs1019, %r2944; 2026-02-21T09:18:15.0042626Z cvt.rn.f32.s16 %r2945, %rs1019; 2026-02-21T09:18:15.0042719Z prmt.b32 %r2946, %r2896, 0, 0xbbb3U; 2026-02-21T09:18:15.0042779Z cvt.u16.u32 %rs1020, %r2946; 2026-02-21T09:18:15.0042848Z cvt.rn.f32.s16 %r2947, %rs1020; 2026-02-21T09:18:15.0042909Z prmt.b32 %r2948, %r2895, 0, 0xbbb3U; 2026-02-21T09:18:15.0042968Z cvt.u16.u32 %rs1021, %r2948; 2026-02-21T09:18:15.0043036Z cvt.rn.f32.s16 %r2949, %rs1021; 2026-02-21T09:18:15.0043096Z prmt.b32 %r2950, %r2894, 0, 0xbbb3U; 2026-02-21T09:18:15.0043154Z cvt.u16.u32 %rs1022, %r2950; 2026-02-21T09:18:15.0043212Z cvt.rn.f32.s16 %r2951, %rs1022; 2026-02-21T09:18:15.0043281Z prmt.b32 %r2952, %r2899, 0, 0xbbb3U; 2026-02-21T09:18:15.0043342Z cvt.u16.u32 %rs1023, %r2952; 2026-02-21T09:18:15.0043402Z cvt.rn.f32.s16 %r2953, %rs1023; 2026-02-21T09:18:15.0043470Z prmt.b32 %r2954, %r2898, 0, 0xbbb3U; 2026-02-21T09:18:15.0043530Z cvt.u16.u32 %rs1024, %r2954; 2026-02-21T09:18:15.0043590Z cvt.rn.f32.s16 %r2955, %rs1024; 2026-02-21T09:18:15.0043647Z bar.sync 0; 2026-02-21T09:18:15.0043753Z st.shared.v4.b32 [%r49], {%r2901, %r2903, %r2900, %r2902}; 2026-02-21T09:18:15.0043851Z st.shared.v4.b32 [%r50], {%r2905, %r2907, %r2904, %r2906}; 2026-02-21T09:18:15.0043944Z st.shared.v4.b32 [%r51], {%r2911, %r2915, %r2909, %r2913}; 2026-02-21T09:18:15.0044041Z st.shared.v4.b32 [%r52], {%r2919, %r2923, %r2917, %r2921}; 2026-02-21T09:18:15.0044130Z st.shared.v4.b32 [%r53], {%r2927, %r2931, %r2925, %r2929}; 2026-02-21T09:18:15.0044217Z st.shared.v4.b32 [%r54], {%r2935, %r2939, %r2933, %r2937}; 2026-02-21T09:18:15.0044317Z st.shared.v4.b32 [%r55], {%r2943, %r2947, %r2941, %r2945}; 2026-02-21T09:18:15.0044407Z st.shared.v4.b32 [%r56], {%r2951, %r2955, %r2949, %r2953}; 2026-02-21T09:18:15.0044463Z $L__tmp167: 2026-02-21T09:18:15.0044679Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0044753Z // begin inline asm 2026-02-21T09:18:15.0045065Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2696, %r2697, %r2698, %r2699, %r2700, %r2701, %r2702, %r2703, %r2704, %r2705, %r2706, %r2707, %r2708, %r2709, %r2710, %r2711}, [%r1205 + 0], 64; 2026-02-21T09:18:15.0045122Z // end inline asm 2026-02-21T09:18:15.0045187Z // begin inline asm 2026-02-21T09:18:15.0045495Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2713, %r2714, %r2715, %r2716, %r2717, %r2718, %r2719, %r2720, %r2721, %r2722, %r2723, %r2724, %r2725, %r2726, %r2727, %r2728}, [%r1205 + 16], 64; 2026-02-21T09:18:15.0045550Z // end inline asm 2026-02-21T09:18:15.0045614Z // begin inline asm 2026-02-21T09:18:15.0045911Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2730, %r2731, %r2732, %r2733, %r2734, %r2735, %r2736, %r2737, %r2738, %r2739, %r2740, %r2741, %r2742, %r2743, %r2744, %r2745}, [%r1205 + 32], 64; 2026-02-21T09:18:15.0045968Z // end inline asm 2026-02-21T09:18:15.0046033Z // begin inline asm 2026-02-21T09:18:15.0046330Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2747, %r2748, %r2749, %r2750, %r2751, %r2752, %r2753, %r2754, %r2755, %r2756, %r2757, %r2758, %r2759, %r2760, %r2761, %r2762}, [%r1205 + 48], 64; 2026-02-21T09:18:15.0046386Z // end inline asm 2026-02-21T09:18:15.0046448Z // begin inline asm 2026-02-21T09:18:15.0046518Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0046573Z // end inline asm 2026-02-21T09:18:15.0046635Z mov.pred %p390, -1; 2026-02-21T09:18:15.0046699Z // begin inline asm 2026-02-21T09:18:15.0047010Z @%p390 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r2696, %r2697, %r2698, %r2699, %r2700, %r2701, %r2702, %r2703, %r2704, %r2705, %r2706, %r2707, %r2708, %r2709, %r2710, %r2711}; 2026-02-21T09:18:15.0047092Z // end inline asm 2026-02-21T09:18:15.0047155Z // begin inline asm 2026-02-21T09:18:15.0047463Z @%p390 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r2713, %r2714, %r2715, %r2716, %r2717, %r2718, %r2719, %r2720, %r2721, %r2722, %r2723, %r2724, %r2725, %r2726, %r2727, %r2728}; 2026-02-21T09:18:15.0047544Z // end inline asm 2026-02-21T09:18:15.0047641Z // begin inline asm 2026-02-21T09:18:15.0047970Z @%p390 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r2730, %r2731, %r2732, %r2733, %r2734, %r2735, %r2736, %r2737, %r2738, %r2739, %r2740, %r2741, %r2742, %r2743, %r2744, %r2745}; 2026-02-21T09:18:15.0048027Z // end inline asm 2026-02-21T09:18:15.0048090Z // begin inline asm 2026-02-21T09:18:15.0048390Z @%p390 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r2747, %r2748, %r2749, %r2750, %r2751, %r2752, %r2753, %r2754, %r2755, %r2756, %r2757, %r2758, %r2759, %r2760, %r2761, %r2762}; 2026-02-21T09:18:15.0048446Z // end inline asm 2026-02-21T09:18:15.0048503Z // begin inline asm 2026-02-21T09:18:15.0048577Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0048632Z // end inline asm 2026-02-21T09:18:15.0048687Z bar.sync 0; 2026-02-21T09:18:15.0048752Z // begin inline asm 2026-02-21T09:18:15.0049050Z @%p390 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r2833, %r2834, %r2835, %r2836, %r2837, %r2838, %r2839, %r2840, %r2841, %r2842, %r2843, %r2844, %r2845, %r2846, %r2847, %r2848}; 2026-02-21T09:18:15.0049107Z // end inline asm 2026-02-21T09:18:15.0049169Z // begin inline asm 2026-02-21T09:18:15.0049235Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0049289Z // end inline asm 2026-02-21T09:18:15.0049343Z bar.sync 0; 2026-02-21T09:18:15.0049406Z // begin inline asm 2026-02-21T09:18:15.0049476Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0049529Z // end inline asm 2026-02-21T09:18:15.0049591Z // begin inline asm 2026-02-21T09:18:15.0049679Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0049735Z // end inline asm 2026-02-21T09:18:15.0049787Z bar.sync 0; 2026-02-21T09:18:15.0049852Z @%p20 bra $L__BB0_21; 2026-02-21T09:18:15.0049950Z // %bb.20: // in Loop: Header=BB0_13 Depth=2 2026-02-21T09:18:15.0050017Z elect.sync %r2968|%p148, -1; 2026-02-21T09:18:15.0050082Z mov.b32 %r2958, 69208336; 2026-02-21T09:18:15.0050142Z mov.pred %p147, -1; 2026-02-21T09:18:15.0050197Z // begin inline asm 2026-02-21T09:18:15.0050356Z @%p148 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 0 ], %rd1, %r2958, %p147; 2026-02-21T09:18:15.0050411Z // end inline asm 2026-02-21T09:18:15.0050467Z // begin inline asm 2026-02-21T09:18:15.0050616Z @%p148 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 8 ], %rd2, %r2958, %p147; 2026-02-21T09:18:15.0050676Z // end inline asm 2026-02-21T09:18:15.0050731Z // begin inline asm 2026-02-21T09:18:15.0050880Z @%p148 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 16 ], %rd3, %r2958, %p147; 2026-02-21T09:18:15.0050942Z // end inline asm 2026-02-21T09:18:15.0050997Z // begin inline asm 2026-02-21T09:18:15.0051142Z @%p148 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 24 ], %rd4, %r2958, %p147; 2026-02-21T09:18:15.0051204Z // end inline asm 2026-02-21T09:18:15.0051265Z cvt.u64.u32 %rd319, %r5985; 2026-02-21T09:18:15.0051320Z // begin inline asm 2026-02-21T09:18:15.0051456Z @%p148 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd319]; 2026-02-21T09:18:15.0051510Z // end inline asm 2026-02-21T09:18:15.0051591Z bra.uni $L__BB0_21; 2026-02-21T09:18:15.0051646Z $L__tmp168: 2026-02-21T09:18:15.0051754Z $L__BB0_22: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:18:15.0051920Z .loc 1 31 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:31:32 2026-02-21T09:18:15.0051980Z or.b32 %r3183, %r117, %r9; 2026-02-21T09:18:15.0052046Z or.b32 %r3184, %r117, %r10; 2026-02-21T09:18:15.0052105Z or.b32 %r3185, %r117, %r11; 2026-02-21T09:18:15.0052197Z or.b32 %r3186, %r117, %r12; 2026-02-21T09:18:15.0052254Z or.b32 %r3187, %r117, %r13; 2026-02-21T09:18:15.0052317Z or.b32 %r3188, %r117, %r14; 2026-02-21T09:18:15.0052400Z or.b32 %r3189, %r117, %r15; 2026-02-21T09:18:15.0052456Z or.b32 %r3190, %r117, %r16; 2026-02-21T09:18:15.0052653Z .loc 1 33 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:33:32 2026-02-21T09:18:15.0052746Z or.b32 %r3191, %r118, %r20; 2026-02-21T09:18:15.0052906Z .loc 1 88 43 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:43 2026-02-21T09:18:15.0052971Z shl.b32 %r3192, %r3183, 13; 2026-02-21T09:18:15.0053028Z shl.b32 %r3193, %r3184, 13; 2026-02-21T09:18:15.0053086Z shl.b32 %r3194, %r3185, 13; 2026-02-21T09:18:15.0053141Z shl.b32 %r3195, %r3186, 13; 2026-02-21T09:18:15.0053206Z shl.b32 %r3196, %r3187, 13; 2026-02-21T09:18:15.0053263Z shl.b32 %r3197, %r3188, 13; 2026-02-21T09:18:15.0053319Z shl.b32 %r3198, %r3189, 13; 2026-02-21T09:18:15.0053385Z shl.b32 %r3199, %r3190, 13; 2026-02-21T09:18:15.0053544Z .loc 1 88 50 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:50 2026-02-21T09:18:15.0053608Z add.s32 %r3200, %r3192, %r3191; 2026-02-21T09:18:15.0053679Z add.s32 %r3201, %r3193, %r3191; 2026-02-21T09:18:15.0053741Z add.s32 %r3202, %r3194, %r3191; 2026-02-21T09:18:15.0053800Z add.s32 %r3203, %r3195, %r3191; 2026-02-21T09:18:15.0053861Z add.s32 %r3204, %r3196, %r3191; 2026-02-21T09:18:15.0053928Z add.s32 %r3205, %r3197, %r3191; 2026-02-21T09:18:15.0053988Z add.s32 %r3206, %r3198, %r3191; 2026-02-21T09:18:15.0054047Z add.s32 %r3207, %r3199, %r3191; 2026-02-21T09:18:15.0054213Z .loc 1 88 22 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:22 2026-02-21T09:18:15.0054284Z mad.wide.s32 %rd320, %r3200, 2, %rd57; 2026-02-21T09:18:15.0054354Z mad.wide.s32 %rd321, %r3201, 2, %rd57; 2026-02-21T09:18:15.0054418Z mad.wide.s32 %rd322, %r3202, 2, %rd57; 2026-02-21T09:18:15.0054491Z mad.wide.s32 %rd323, %r3203, 2, %rd57; 2026-02-21T09:18:15.0054555Z mad.wide.s32 %rd324, %r3204, 2, %rd57; 2026-02-21T09:18:15.0054617Z mad.wide.s32 %rd325, %r3205, 2, %rd57; 2026-02-21T09:18:15.0054689Z mad.wide.s32 %rd326, %r3206, 2, %rd57; 2026-02-21T09:18:15.0054751Z mad.wide.s32 %rd327, %r3207, 2, %rd57; 2026-02-21T09:18:15.0054805Z $L__tmp169: 2026-02-21T09:18:15.0055029Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0055088Z // begin inline asm 2026-02-21T09:18:15.0055401Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2975, %r2976, %r2977, %r2978, %r2979, %r2980, %r2981, %r2982, %r2983, %r2984, %r2985, %r2986, %r2987, %r2988, %r2989, %r2990}, [%r4673 + 0], 64; 2026-02-21T09:18:15.0055464Z // end inline asm 2026-02-21T09:18:15.0055521Z // begin inline asm 2026-02-21T09:18:15.0055825Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r2992, %r2993, %r2994, %r2995, %r2996, %r2997, %r2998, %r2999, %r3000, %r3001, %r3002, %r3003, %r3004, %r3005, %r3006, %r3007}, [%r4673 + 16], 64; 2026-02-21T09:18:15.0055882Z // end inline asm 2026-02-21T09:18:15.0055947Z // begin inline asm 2026-02-21T09:18:15.0056248Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3009, %r3010, %r3011, %r3012, %r3013, %r3014, %r3015, %r3016, %r3017, %r3018, %r3019, %r3020, %r3021, %r3022, %r3023, %r3024}, [%r4673 + 32], 64; 2026-02-21T09:18:15.0056305Z // end inline asm 2026-02-21T09:18:15.0056370Z // begin inline asm 2026-02-21T09:18:15.0056670Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3026, %r3027, %r3028, %r3029, %r3030, %r3031, %r3032, %r3033, %r3034, %r3035, %r3036, %r3037, %r3038, %r3039, %r3040, %r3041}, [%r4673 + 48], 64; 2026-02-21T09:18:15.0056725Z // end inline asm 2026-02-21T09:18:15.0056813Z // begin inline asm 2026-02-21T09:18:15.0056882Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0056936Z // end inline asm 2026-02-21T09:18:15.0056998Z cvt.u64.u32 %rd329, %r2975; 2026-02-21T09:18:15.0057086Z cvt.u64.u32 %rd330, %r2976; 2026-02-21T09:18:15.0057146Z shl.b64 %rd331, %rd330, 32; 2026-02-21T09:18:15.0057208Z or.b64 %rd332, %rd329, %rd331; 2026-02-21T09:18:15.0057291Z $L__tmp170: 2026-02-21T09:18:15.0057473Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0057565Z mov.b64 {%r3208, %r3209}, %rd332; 2026-02-21T09:18:15.0057671Z cvt.rn.bf16x2.f32 %r3210, %r3209, %r3208; 2026-02-21T09:18:15.0057726Z $L__tmp171: 2026-02-21T09:18:15.0057948Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0058010Z cvt.u64.u32 %rd333, %r2977; 2026-02-21T09:18:15.0058078Z cvt.u64.u32 %rd334, %r2978; 2026-02-21T09:18:15.0058140Z shl.b64 %rd335, %rd334, 32; 2026-02-21T09:18:15.0058203Z or.b64 %rd336, %rd333, %rd335; 2026-02-21T09:18:15.0058265Z $L__tmp172: 2026-02-21T09:18:15.0058430Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0058497Z mov.b64 {%r3211, %r3212}, %rd336; 2026-02-21T09:18:15.0058579Z cvt.rn.bf16x2.f32 %r3213, %r3212, %r3211; 2026-02-21T09:18:15.0058635Z $L__tmp173: 2026-02-21T09:18:15.0058851Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0058914Z cvt.u64.u32 %rd337, %r2979; 2026-02-21T09:18:15.0058981Z cvt.u64.u32 %rd338, %r2980; 2026-02-21T09:18:15.0059040Z shl.b64 %rd339, %rd338, 32; 2026-02-21T09:18:15.0059103Z or.b64 %rd340, %rd337, %rd339; 2026-02-21T09:18:15.0059164Z $L__tmp174: 2026-02-21T09:18:15.0059328Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0059390Z mov.b64 {%r3214, %r3215}, %rd340; 2026-02-21T09:18:15.0059472Z cvt.rn.bf16x2.f32 %r3216, %r3215, %r3214; 2026-02-21T09:18:15.0059526Z $L__tmp175: 2026-02-21T09:18:15.0059739Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0059798Z cvt.u64.u32 %rd341, %r2981; 2026-02-21T09:18:15.0059869Z cvt.u64.u32 %rd342, %r2982; 2026-02-21T09:18:15.0059929Z shl.b64 %rd343, %rd342, 32; 2026-02-21T09:18:15.0059993Z or.b64 %rd344, %rd341, %rd343; 2026-02-21T09:18:15.0060054Z $L__tmp176: 2026-02-21T09:18:15.0060224Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0060287Z mov.b64 {%r3217, %r3218}, %rd344; 2026-02-21T09:18:15.0060357Z cvt.rn.bf16x2.f32 %r3219, %r3218, %r3217; 2026-02-21T09:18:15.0060418Z $L__tmp177: 2026-02-21T09:18:15.0060629Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0060690Z cvt.u64.u32 %rd345, %r2983; 2026-02-21T09:18:15.0060759Z cvt.u64.u32 %rd346, %r2984; 2026-02-21T09:18:15.0060821Z shl.b64 %rd347, %rd346, 32; 2026-02-21T09:18:15.0060883Z or.b64 %rd348, %rd345, %rd347; 2026-02-21T09:18:15.0060944Z $L__tmp178: 2026-02-21T09:18:15.0061109Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0061175Z mov.b64 {%r3220, %r3221}, %rd348; 2026-02-21T09:18:15.0061247Z cvt.rn.bf16x2.f32 %r3222, %r3221, %r3220; 2026-02-21T09:18:15.0061308Z $L__tmp179: 2026-02-21T09:18:15.0061520Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0061607Z cvt.u64.u32 %rd349, %r2985; 2026-02-21T09:18:15.0061677Z cvt.u64.u32 %rd350, %r2986; 2026-02-21T09:18:15.0061739Z shl.b64 %rd351, %rd350, 32; 2026-02-21T09:18:15.0061801Z or.b64 %rd352, %rd349, %rd351; 2026-02-21T09:18:15.0061862Z $L__tmp180: 2026-02-21T09:18:15.0062027Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0062118Z mov.b64 {%r3223, %r3224}, %rd352; 2026-02-21T09:18:15.0062190Z cvt.rn.bf16x2.f32 %r3225, %r3224, %r3223; 2026-02-21T09:18:15.0062253Z $L__tmp181: 2026-02-21T09:18:15.0062492Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0062579Z cvt.u64.u32 %rd353, %r2987; 2026-02-21T09:18:15.0062648Z cvt.u64.u32 %rd354, %r2988; 2026-02-21T09:18:15.0062733Z shl.b64 %rd355, %rd354, 32; 2026-02-21T09:18:15.0062797Z or.b64 %rd356, %rd353, %rd355; 2026-02-21T09:18:15.0062859Z $L__tmp182: 2026-02-21T09:18:15.0063029Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0063092Z mov.b64 {%r3226, %r3227}, %rd356; 2026-02-21T09:18:15.0063164Z cvt.rn.bf16x2.f32 %r3228, %r3227, %r3226; 2026-02-21T09:18:15.0063230Z $L__tmp183: 2026-02-21T09:18:15.0063445Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0063509Z cvt.u64.u32 %rd357, %r2989; 2026-02-21T09:18:15.0063582Z cvt.u64.u32 %rd358, %r2990; 2026-02-21T09:18:15.0063644Z shl.b64 %rd359, %rd358, 32; 2026-02-21T09:18:15.0063707Z or.b64 %rd360, %rd357, %rd359; 2026-02-21T09:18:15.0063763Z $L__tmp184: 2026-02-21T09:18:15.0063939Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0064003Z mov.b64 {%r3229, %r3230}, %rd360; 2026-02-21T09:18:15.0064076Z cvt.rn.bf16x2.f32 %r3231, %r3230, %r3229; 2026-02-21T09:18:15.0064138Z $L__tmp185: 2026-02-21T09:18:15.0064350Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0064413Z cvt.u64.u32 %rd361, %r2992; 2026-02-21T09:18:15.0064480Z cvt.u64.u32 %rd362, %r2993; 2026-02-21T09:18:15.0064541Z shl.b64 %rd363, %rd362, 32; 2026-02-21T09:18:15.0064602Z or.b64 %rd364, %rd361, %rd363; 2026-02-21T09:18:15.0064657Z $L__tmp186: 2026-02-21T09:18:15.0064829Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0064894Z mov.b64 {%r3232, %r3233}, %rd364; 2026-02-21T09:18:15.0064973Z cvt.rn.bf16x2.f32 %r3234, %r3233, %r3232; 2026-02-21T09:18:15.0065034Z $L__tmp187: 2026-02-21T09:18:15.0065238Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0065297Z cvt.u64.u32 %rd365, %r2994; 2026-02-21T09:18:15.0065360Z cvt.u64.u32 %rd366, %r2995; 2026-02-21T09:18:15.0065417Z shl.b64 %rd367, %rd366, 32; 2026-02-21T09:18:15.0065476Z or.b64 %rd368, %rd365, %rd367; 2026-02-21T09:18:15.0065527Z $L__tmp188: 2026-02-21T09:18:15.0065694Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0065754Z mov.b64 {%r3235, %r3236}, %rd368; 2026-02-21T09:18:15.0065822Z cvt.rn.bf16x2.f32 %r3237, %r3236, %r3235; 2026-02-21T09:18:15.0065883Z $L__tmp189: 2026-02-21T09:18:15.0066085Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0066144Z cvt.u64.u32 %rd369, %r2996; 2026-02-21T09:18:15.0066207Z cvt.u64.u32 %rd370, %r2997; 2026-02-21T09:18:15.0066267Z shl.b64 %rd371, %rd370, 32; 2026-02-21T09:18:15.0066328Z or.b64 %rd372, %rd369, %rd371; 2026-02-21T09:18:15.0066379Z $L__tmp190: 2026-02-21T09:18:15.0066546Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0066605Z mov.b64 {%r3238, %r3239}, %rd372; 2026-02-21T09:18:15.0066672Z cvt.rn.bf16x2.f32 %r3240, %r3239, %r3238; 2026-02-21T09:18:15.0066732Z $L__tmp191: 2026-02-21T09:18:15.0066933Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0067012Z cvt.u64.u32 %rd373, %r2998; 2026-02-21T09:18:15.0067076Z cvt.u64.u32 %rd374, %r2999; 2026-02-21T09:18:15.0067133Z shl.b64 %rd375, %rd374, 32; 2026-02-21T09:18:15.0067193Z or.b64 %rd376, %rd373, %rd375; 2026-02-21T09:18:15.0067268Z $L__tmp192: 2026-02-21T09:18:15.0067457Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0067518Z mov.b64 {%r3241, %r3242}, %rd376; 2026-02-21T09:18:15.0067605Z cvt.rn.bf16x2.f32 %r3243, %r3242, %r3241; 2026-02-21T09:18:15.0067664Z $L__tmp193: 2026-02-21T09:18:15.0067864Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0067921Z cvt.u64.u32 %rd377, %r3000; 2026-02-21T09:18:15.0067985Z cvt.u64.u32 %rd378, %r3001; 2026-02-21T09:18:15.0068043Z shl.b64 %rd379, %rd378, 32; 2026-02-21T09:18:15.0068103Z or.b64 %rd380, %rd377, %rd379; 2026-02-21T09:18:15.0068154Z $L__tmp194: 2026-02-21T09:18:15.0068321Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0068380Z mov.b64 {%r3244, %r3245}, %rd380; 2026-02-21T09:18:15.0068450Z cvt.rn.bf16x2.f32 %r3246, %r3245, %r3244; 2026-02-21T09:18:15.0068508Z $L__tmp195: 2026-02-21T09:18:15.0068709Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0068767Z cvt.u64.u32 %rd381, %r3002; 2026-02-21T09:18:15.0068824Z cvt.u64.u32 %rd382, %r3003; 2026-02-21T09:18:15.0068887Z shl.b64 %rd383, %rd382, 32; 2026-02-21T09:18:15.0068947Z or.b64 %rd384, %rd381, %rd383; 2026-02-21T09:18:15.0068998Z $L__tmp196: 2026-02-21T09:18:15.0069161Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0069220Z mov.b64 {%r3247, %r3248}, %rd384; 2026-02-21T09:18:15.0069287Z cvt.rn.bf16x2.f32 %r3249, %r3248, %r3247; 2026-02-21T09:18:15.0069347Z $L__tmp197: 2026-02-21T09:18:15.0069546Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0069603Z cvt.u64.u32 %rd385, %r3004; 2026-02-21T09:18:15.0069663Z cvt.u64.u32 %rd386, %r3005; 2026-02-21T09:18:15.0069726Z shl.b64 %rd387, %rd386, 32; 2026-02-21T09:18:15.0069788Z or.b64 %rd388, %rd385, %rd387; 2026-02-21T09:18:15.0069840Z $L__tmp198: 2026-02-21T09:18:15.0070004Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0070064Z mov.b64 {%r3250, %r3251}, %rd388; 2026-02-21T09:18:15.0070131Z cvt.rn.bf16x2.f32 %r3252, %r3251, %r3250; 2026-02-21T09:18:15.0070191Z $L__tmp199: 2026-02-21T09:18:15.0070390Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0070448Z cvt.u64.u32 %rd389, %r3006; 2026-02-21T09:18:15.0070505Z cvt.u64.u32 %rd390, %r3007; 2026-02-21T09:18:15.0070571Z shl.b64 %rd391, %rd390, 32; 2026-02-21T09:18:15.0070630Z or.b64 %rd392, %rd389, %rd391; 2026-02-21T09:18:15.0070683Z $L__tmp200: 2026-02-21T09:18:15.0070848Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0070910Z mov.b64 {%r3253, %r3254}, %rd392; 2026-02-21T09:18:15.0070980Z cvt.rn.bf16x2.f32 %r3255, %r3254, %r3253; 2026-02-21T09:18:15.0071040Z $L__tmp201: 2026-02-21T09:18:15.0071238Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0071296Z cvt.u64.u32 %rd393, %r3009; 2026-02-21T09:18:15.0071353Z cvt.u64.u32 %rd394, %r3010; 2026-02-21T09:18:15.0071420Z shl.b64 %rd395, %rd394, 32; 2026-02-21T09:18:15.0071480Z or.b64 %rd396, %rd393, %rd395; 2026-02-21T09:18:15.0071562Z $L__tmp202: 2026-02-21T09:18:15.0071734Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0071827Z mov.b64 {%r3256, %r3257}, %rd396; 2026-02-21T09:18:15.0071900Z cvt.rn.bf16x2.f32 %r3258, %r3257, %r3256; 2026-02-21T09:18:15.0071954Z $L__tmp203: 2026-02-21T09:18:15.0072195Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0072293Z cvt.u64.u32 %rd397, %r3011; 2026-02-21T09:18:15.0072375Z cvt.u64.u32 %rd398, %r3012; 2026-02-21T09:18:15.0072442Z shl.b64 %rd399, %rd398, 32; 2026-02-21T09:18:15.0072503Z or.b64 %rd400, %rd397, %rd399; 2026-02-21T09:18:15.0072557Z $L__tmp204: 2026-02-21T09:18:15.0072723Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0072783Z mov.b64 {%r3259, %r3260}, %rd400; 2026-02-21T09:18:15.0072852Z cvt.rn.bf16x2.f32 %r3261, %r3260, %r3259; 2026-02-21T09:18:15.0072906Z $L__tmp205: 2026-02-21T09:18:15.0073117Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0073177Z cvt.u64.u32 %rd401, %r3013; 2026-02-21T09:18:15.0073235Z cvt.u64.u32 %rd402, %r3014; 2026-02-21T09:18:15.0073303Z shl.b64 %rd403, %rd402, 32; 2026-02-21T09:18:15.0073362Z or.b64 %rd404, %rd401, %rd403; 2026-02-21T09:18:15.0073414Z $L__tmp206: 2026-02-21T09:18:15.0073582Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0073644Z mov.b64 {%r3262, %r3263}, %rd404; 2026-02-21T09:18:15.0073714Z cvt.rn.bf16x2.f32 %r3264, %r3263, %r3262; 2026-02-21T09:18:15.0073767Z $L__tmp207: 2026-02-21T09:18:15.0073974Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0074033Z cvt.u64.u32 %rd405, %r3015; 2026-02-21T09:18:15.0074090Z cvt.u64.u32 %rd406, %r3016; 2026-02-21T09:18:15.0074154Z shl.b64 %rd407, %rd406, 32; 2026-02-21T09:18:15.0074215Z or.b64 %rd408, %rd405, %rd407; 2026-02-21T09:18:15.0074267Z $L__tmp208: 2026-02-21T09:18:15.0074435Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0074496Z mov.b64 {%r3265, %r3266}, %rd408; 2026-02-21T09:18:15.0074564Z cvt.rn.bf16x2.f32 %r3267, %r3266, %r3265; 2026-02-21T09:18:15.0074618Z $L__tmp209: 2026-02-21T09:18:15.0074831Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0074889Z cvt.u64.u32 %rd409, %r3017; 2026-02-21T09:18:15.0074946Z cvt.u64.u32 %rd410, %r3018; 2026-02-21T09:18:15.0075013Z shl.b64 %rd411, %rd410, 32; 2026-02-21T09:18:15.0075070Z or.b64 %rd412, %rd409, %rd411; 2026-02-21T09:18:15.0075122Z $L__tmp210: 2026-02-21T09:18:15.0075292Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0075352Z mov.b64 {%r3268, %r3269}, %rd412; 2026-02-21T09:18:15.0075422Z cvt.rn.bf16x2.f32 %r3270, %r3269, %r3268; 2026-02-21T09:18:15.0075473Z $L__tmp211: 2026-02-21T09:18:15.0075685Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0075745Z cvt.u64.u32 %rd413, %r3019; 2026-02-21T09:18:15.0075803Z cvt.u64.u32 %rd414, %r3020; 2026-02-21T09:18:15.0075868Z shl.b64 %rd415, %rd414, 32; 2026-02-21T09:18:15.0075928Z or.b64 %rd416, %rd413, %rd415; 2026-02-21T09:18:15.0075980Z $L__tmp212: 2026-02-21T09:18:15.0076138Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0076204Z mov.b64 {%r3271, %r3272}, %rd416; 2026-02-21T09:18:15.0076272Z cvt.rn.bf16x2.f32 %r3273, %r3272, %r3271; 2026-02-21T09:18:15.0076323Z $L__tmp213: 2026-02-21T09:18:15.0076531Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0076611Z cvt.u64.u32 %rd417, %r3021; 2026-02-21T09:18:15.0076669Z cvt.u64.u32 %rd418, %r3022; 2026-02-21T09:18:15.0076731Z shl.b64 %rd419, %rd418, 32; 2026-02-21T09:18:15.0076812Z or.b64 %rd420, %rd417, %rd419; 2026-02-21T09:18:15.0076864Z $L__tmp214: 2026-02-21T09:18:15.0077046Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0077133Z mov.b64 {%r3274, %r3275}, %rd420; 2026-02-21T09:18:15.0077201Z cvt.rn.bf16x2.f32 %r3276, %r3275, %r3274; 2026-02-21T09:18:15.0077253Z $L__tmp215: 2026-02-21T09:18:15.0077460Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0077517Z cvt.u64.u32 %rd421, %r3023; 2026-02-21T09:18:15.0077575Z cvt.u64.u32 %rd422, %r3024; 2026-02-21T09:18:15.0077638Z shl.b64 %rd423, %rd422, 32; 2026-02-21T09:18:15.0077696Z or.b64 %rd424, %rd421, %rd423; 2026-02-21T09:18:15.0077749Z $L__tmp216: 2026-02-21T09:18:15.0077909Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0077978Z mov.b64 {%r3277, %r3278}, %rd424; 2026-02-21T09:18:15.0078048Z cvt.rn.bf16x2.f32 %r3279, %r3278, %r3277; 2026-02-21T09:18:15.0078100Z $L__tmp217: 2026-02-21T09:18:15.0078309Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0078369Z cvt.u64.u32 %rd425, %r3026; 2026-02-21T09:18:15.0078426Z cvt.u64.u32 %rd426, %r3027; 2026-02-21T09:18:15.0078492Z shl.b64 %rd427, %rd426, 32; 2026-02-21T09:18:15.0078551Z or.b64 %rd428, %rd425, %rd427; 2026-02-21T09:18:15.0078603Z $L__tmp218: 2026-02-21T09:18:15.0078759Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0078826Z mov.b64 {%r3280, %r3281}, %rd428; 2026-02-21T09:18:15.0078892Z cvt.rn.bf16x2.f32 %r3282, %r3281, %r3280; 2026-02-21T09:18:15.0078945Z $L__tmp219: 2026-02-21T09:18:15.0079156Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0079216Z cvt.u64.u32 %rd429, %r3028; 2026-02-21T09:18:15.0079272Z cvt.u64.u32 %rd430, %r3029; 2026-02-21T09:18:15.0079337Z shl.b64 %rd431, %rd430, 32; 2026-02-21T09:18:15.0079398Z or.b64 %rd432, %rd429, %rd431; 2026-02-21T09:18:15.0079451Z $L__tmp220: 2026-02-21T09:18:15.0079612Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0079681Z mov.b64 {%r3283, %r3284}, %rd432; 2026-02-21T09:18:15.0079748Z cvt.rn.bf16x2.f32 %r3285, %r3284, %r3283; 2026-02-21T09:18:15.0079801Z $L__tmp221: 2026-02-21T09:18:15.0080007Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0080065Z cvt.u64.u32 %rd433, %r3030; 2026-02-21T09:18:15.0080124Z cvt.u64.u32 %rd434, %r3031; 2026-02-21T09:18:15.0080181Z shl.b64 %rd435, %rd434, 32; 2026-02-21T09:18:15.0080249Z or.b64 %rd436, %rd433, %rd435; 2026-02-21T09:18:15.0080304Z $L__tmp222: 2026-02-21T09:18:15.0080463Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0080536Z mov.b64 {%r3286, %r3287}, %rd436; 2026-02-21T09:18:15.0080610Z cvt.rn.bf16x2.f32 %r3288, %r3287, %r3286; 2026-02-21T09:18:15.0080664Z $L__tmp223: 2026-02-21T09:18:15.0080872Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0080930Z cvt.u64.u32 %rd437, %r3032; 2026-02-21T09:18:15.0080987Z cvt.u64.u32 %rd438, %r3033; 2026-02-21T09:18:15.0081044Z shl.b64 %rd439, %rd438, 32; 2026-02-21T09:18:15.0081111Z or.b64 %rd440, %rd437, %rd439; 2026-02-21T09:18:15.0081163Z $L__tmp224: 2026-02-21T09:18:15.0081321Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0081410Z mov.b64 {%r3289, %r3290}, %rd440; 2026-02-21T09:18:15.0081477Z cvt.rn.bf16x2.f32 %r3291, %r3290, %r3289; 2026-02-21T09:18:15.0081573Z $L__tmp225: 2026-02-21T09:18:15.0081808Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0081868Z cvt.u64.u32 %rd441, %r3034; 2026-02-21T09:18:15.0081951Z cvt.u64.u32 %rd442, %r3035; 2026-02-21T09:18:15.0082009Z shl.b64 %rd443, %rd442, 32; 2026-02-21T09:18:15.0082078Z or.b64 %rd444, %rd441, %rd443; 2026-02-21T09:18:15.0082129Z $L__tmp226: 2026-02-21T09:18:15.0082288Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0082355Z mov.b64 {%r3292, %r3293}, %rd444; 2026-02-21T09:18:15.0082423Z cvt.rn.bf16x2.f32 %r3294, %r3293, %r3292; 2026-02-21T09:18:15.0082476Z $L__tmp227: 2026-02-21T09:18:15.0082685Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0082745Z cvt.u64.u32 %rd445, %r3036; 2026-02-21T09:18:15.0082803Z cvt.u64.u32 %rd446, %r3037; 2026-02-21T09:18:15.0082863Z shl.b64 %rd447, %rd446, 32; 2026-02-21T09:18:15.0082929Z or.b64 %rd448, %rd445, %rd447; 2026-02-21T09:18:15.0082983Z $L__tmp228: 2026-02-21T09:18:15.0083146Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0083213Z mov.b64 {%r3295, %r3296}, %rd448; 2026-02-21T09:18:15.0083281Z cvt.rn.bf16x2.f32 %r3297, %r3296, %r3295; 2026-02-21T09:18:15.0083333Z $L__tmp229: 2026-02-21T09:18:15.0083543Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0083602Z cvt.u64.u32 %rd449, %r3038; 2026-02-21T09:18:15.0083659Z cvt.u64.u32 %rd450, %r3039; 2026-02-21T09:18:15.0083718Z shl.b64 %rd451, %rd450, 32; 2026-02-21T09:18:15.0083788Z or.b64 %rd452, %rd449, %rd451; 2026-02-21T09:18:15.0083841Z $L__tmp230: 2026-02-21T09:18:15.0083998Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0084066Z mov.b64 {%r3298, %r3299}, %rd452; 2026-02-21T09:18:15.0084134Z cvt.rn.bf16x2.f32 %r3300, %r3299, %r3298; 2026-02-21T09:18:15.0084186Z $L__tmp231: 2026-02-21T09:18:15.0084391Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0084456Z cvt.u64.u32 %rd453, %r3040; 2026-02-21T09:18:15.0084514Z cvt.u64.u32 %rd454, %r3041; 2026-02-21T09:18:15.0084572Z shl.b64 %rd455, %rd454, 32; 2026-02-21T09:18:15.0084639Z or.b64 %rd456, %rd453, %rd455; 2026-02-21T09:18:15.0084690Z $L__tmp232: 2026-02-21T09:18:15.0084847Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0084913Z mov.b64 {%r3301, %r3302}, %rd456; 2026-02-21T09:18:15.0084983Z cvt.rn.bf16x2.f32 %r3303, %r3302, %r3301; 2026-02-21T09:18:15.0085138Z .loc 1 88 81 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:81 2026-02-21T09:18:15.0085194Z bar.sync 0; 2026-02-21T09:18:15.0085300Z st.shared.v4.b32 [%r27], {%r3210, %r3222, %r3234, %r3246}; 2026-02-21T09:18:15.0085397Z st.shared.v4.b32 [%r28], {%r3258, %r3270, %r3282, %r3294}; 2026-02-21T09:18:15.0085452Z bar.sync 0; 2026-02-21T09:18:15.0085517Z // begin inline asm 2026-02-21T09:18:15.0085677Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3083, %r3087, %r3091, %r3095}, [%r1540]; 2026-02-21T09:18:15.0085735Z // end inline asm 2026-02-21T09:18:15.0085798Z // begin inline asm 2026-02-21T09:18:15.0085952Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3099, %r3103, %r3107, %r3111}, [%r1545]; 2026-02-21T09:18:15.0086006Z // end inline asm 2026-02-21T09:18:15.0086060Z bar.sync 0; 2026-02-21T09:18:15.0086160Z st.shared.v4.b32 [%r27], {%r3213, %r3225, %r3237, %r3249}; 2026-02-21T09:18:15.0086276Z st.shared.v4.b32 [%r28], {%r3261, %r3273, %r3285, %r3297}; 2026-02-21T09:18:15.0086330Z bar.sync 0; 2026-02-21T09:18:15.0086394Z // begin inline asm 2026-02-21T09:18:15.0086569Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3084, %r3088, %r3092, %r3096}, [%r1540]; 2026-02-21T09:18:15.0086645Z // end inline asm 2026-02-21T09:18:15.0086702Z // begin inline asm 2026-02-21T09:18:15.0086877Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3100, %r3104, %r3108, %r3112}, [%r1545]; 2026-02-21T09:18:15.0086933Z // end inline asm 2026-02-21T09:18:15.0086987Z bar.sync 0; 2026-02-21T09:18:15.0087086Z st.shared.v4.b32 [%r27], {%r3216, %r3228, %r3240, %r3252}; 2026-02-21T09:18:15.0087174Z st.shared.v4.b32 [%r28], {%r3264, %r3276, %r3288, %r3300}; 2026-02-21T09:18:15.0087228Z bar.sync 0; 2026-02-21T09:18:15.0087291Z // begin inline asm 2026-02-21T09:18:15.0087436Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3085, %r3089, %r3093, %r3097}, [%r1540]; 2026-02-21T09:18:15.0087493Z // end inline asm 2026-02-21T09:18:15.0087549Z // begin inline asm 2026-02-21T09:18:15.0087701Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3101, %r3105, %r3109, %r3113}, [%r1545]; 2026-02-21T09:18:15.0087757Z // end inline asm 2026-02-21T09:18:15.0087812Z bar.sync 0; 2026-02-21T09:18:15.0087912Z st.shared.v4.b32 [%r27], {%r3219, %r3231, %r3243, %r3255}; 2026-02-21T09:18:15.0088003Z st.shared.v4.b32 [%r28], {%r3267, %r3279, %r3291, %r3303}; 2026-02-21T09:18:15.0088057Z bar.sync 0; 2026-02-21T09:18:15.0088119Z // begin inline asm 2026-02-21T09:18:15.0088263Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3086, %r3090, %r3094, %r3098}, [%r1540]; 2026-02-21T09:18:15.0088317Z // end inline asm 2026-02-21T09:18:15.0088375Z // begin inline asm 2026-02-21T09:18:15.0088527Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r3102, %r3106, %r3110, %r3114}, [%r1545]; 2026-02-21T09:18:15.0088582Z // end inline asm 2026-02-21T09:18:15.0088638Z // begin inline asm 2026-02-21T09:18:15.0088761Z st.global.v4.b32 [ %rd320 + 0 ], { %r3083, %r3084, %r3085, %r3086 }; 2026-02-21T09:18:15.0088815Z // end inline asm 2026-02-21T09:18:15.0088872Z // begin inline asm 2026-02-21T09:18:15.0088981Z st.global.v4.b32 [ %rd321 + 0 ], { %r3087, %r3088, %r3089, %r3090 }; 2026-02-21T09:18:15.0089045Z // end inline asm 2026-02-21T09:18:15.0089105Z // begin inline asm 2026-02-21T09:18:15.0089210Z st.global.v4.b32 [ %rd322 + 0 ], { %r3091, %r3092, %r3093, %r3094 }; 2026-02-21T09:18:15.0089273Z // end inline asm 2026-02-21T09:18:15.0089328Z // begin inline asm 2026-02-21T09:18:15.0089428Z st.global.v4.b32 [ %rd323 + 0 ], { %r3095, %r3096, %r3097, %r3098 }; 2026-02-21T09:18:15.0089490Z // end inline asm 2026-02-21T09:18:15.0089546Z // begin inline asm 2026-02-21T09:18:15.0089645Z st.global.v4.b32 [ %rd324 + 0 ], { %r3099, %r3100, %r3101, %r3102 }; 2026-02-21T09:18:15.0089698Z // end inline asm 2026-02-21T09:18:15.0089761Z // begin inline asm 2026-02-21T09:18:15.0089860Z st.global.v4.b32 [ %rd325 + 0 ], { %r3103, %r3104, %r3105, %r3106 }; 2026-02-21T09:18:15.0089917Z // end inline asm 2026-02-21T09:18:15.0089979Z // begin inline asm 2026-02-21T09:18:15.0090077Z st.global.v4.b32 [ %rd326 + 0 ], { %r3107, %r3108, %r3109, %r3110 }; 2026-02-21T09:18:15.0090133Z // end inline asm 2026-02-21T09:18:15.0090188Z // begin inline asm 2026-02-21T09:18:15.0090295Z st.global.v4.b32 [ %rd327 + 0 ], { %r3111, %r3112, %r3113, %r3114 }; 2026-02-21T09:18:15.0090352Z // end inline asm 2026-02-21T09:18:15.0090525Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:15.0090595Z add.s32 %r3304, %r7842, 4736; 2026-02-21T09:18:15.0090752Z .loc 1 25 35 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:25:35 2026-02-21T09:18:15.0090811Z shr.s32 %r3305, %r3304, 31; 2026-02-21T09:18:15.0090876Z shr.u32 %r3306, %r3305, 22; 2026-02-21T09:18:15.0090939Z add.s32 %r3307, %r3304, %r3306; 2026-02-21T09:18:15.0091024Z shr.s32 %r3308, %r3307, 10; 2026-02-21T09:18:15.0091188Z .loc 1 26 33 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:26:33 2026-02-21T09:18:15.0091254Z shl.b32 %r3309, %r3308, 4; 2026-02-21T09:18:15.0091452Z .loc 1 27 39 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:39 2026-02-21T09:18:15.0091532Z sub.s32 %r3310, 64, %r3309; 2026-02-21T09:18:15.0091753Z .loc 1 27 52 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:52 2026-02-21T09:18:15.0091811Z min.s32 %r3311, %r3310, 16; 2026-02-21T09:18:15.0091970Z .loc 1 28 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:45 2026-02-21T09:18:15.0092041Z and.b32 %r3312, %r3307, -1024; 2026-02-21T09:18:15.0092103Z sub.s32 %r3313, %r3304, %r3312; 2026-02-21T09:18:15.0092262Z .loc 1 29 51 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:29:51 2026-02-21T09:18:15.0092329Z div.s32 %r3314, %r3313, %r3311; 2026-02-21T09:18:15.0092487Z .loc 1 28 64 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:64 2026-02-21T09:18:15.0092553Z mul.lo.s32 %r3315, %r3314, %r3311; 2026-02-21T09:18:15.0092613Z sub.s32 %r3316, %r3313, %r3315; 2026-02-21T09:18:15.0092781Z .loc 1 28 30 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:30 2026-02-21T09:18:15.0092841Z add.s32 %r3317, %r3316, %r3309; 2026-02-21T09:18:15.0092998Z .loc 1 30 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:30:27 2026-02-21T09:18:15.0093065Z shl.b32 %r122, %r3317, 6; 2026-02-21T09:18:15.0093221Z .loc 1 32 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:32:27 2026-02-21T09:18:15.0093282Z shl.b32 %r123, %r3314, 7; 2026-02-21T09:18:15.0093350Z mov.pred %p159, -1; 2026-02-21T09:18:15.0093406Z mov.b32 %r3116, 0; 2026-02-21T09:18:15.0093459Z $L__tmp233: 2026-02-21T09:18:15.0093665Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0093728Z // begin inline asm 2026-02-21T09:18:15.0094048Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116}; 2026-02-21T09:18:15.0094103Z // end inline asm 2026-02-21T09:18:15.0094167Z // begin inline asm 2026-02-21T09:18:15.0094473Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116}; 2026-02-21T09:18:15.0094528Z // end inline asm 2026-02-21T09:18:15.0094591Z // begin inline asm 2026-02-21T09:18:15.0094892Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116}; 2026-02-21T09:18:15.0094948Z // end inline asm 2026-02-21T09:18:15.0095011Z // begin inline asm 2026-02-21T09:18:15.0095318Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116, %r3116}; 2026-02-21T09:18:15.0095375Z // end inline asm 2026-02-21T09:18:15.0095438Z // begin inline asm 2026-02-21T09:18:15.0095510Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0095564Z // end inline asm 2026-02-21T09:18:15.0095618Z bar.sync 0; 2026-02-21T09:18:15.0095678Z $L__tmp234: 2026-02-21T09:18:15.0095843Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:15.0095905Z add.s32 %r7845, %r61, %r123; 2026-02-21T09:18:15.0095972Z or.b32 %r3318, %r7841, %r122; 2026-02-21T09:18:15.0096029Z shl.b32 %r3319, %r3318, 10; 2026-02-21T09:18:15.0096091Z mul.wide.s32 %rd28, %r3319, 2; 2026-02-21T09:18:15.0096173Z shl.b32 %r3320, %r3317, 16; 2026-02-21T09:18:15.0096241Z or.b32 %r3321, %r63, %r3320; 2026-02-21T09:18:15.0096303Z mul.wide.s32 %rd29, %r3321, 2; 2026-02-21T09:18:15.0097443Z [682s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:18:15.0098439Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=32, num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:18:15.0098567Z Error: PTXASError: PTXAS error: Internal Triton PTX codegen error 2026-02-21T09:18:15.0098631Z `ptxas` stderr: 2026-02-21T09:18:15.0098976Z ptxas fatal : (C7602) Insufficient registers (32) to compile instruction at line 251 in function _helion_matmul_bf16_int4. Try to compile with register target of 34 or higher. 2026-02-21T09:18:15.0099070Z ptxas fatal : Ptx assembly aborted due to errors 2026-02-21T09:18:15.0099076Z 2026-02-21T09:18:15.0099467Z Repro command: /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmpxtuvssla.ptx -o /tmp/tmpxtuvssla.ptx.o 2026-02-21T09:18:15.0099472Z 2026-02-21T09:18:15.0099596Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:18:15.0099657Z mov.pred %p391, 0; 2026-02-21T09:18:15.0099725Z mov.b64 %rd1053, -64; 2026-02-21T09:18:15.0099785Z mov.b64 %rd1052, %rd5; 2026-02-21T09:18:15.0099844Z bra.uni $L__BB0_23; 2026-02-21T09:18:15.0099947Z $L__BB0_31: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0100010Z $L__tmp235: 2026-02-21T09:18:15.0100224Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0100286Z // begin inline asm 2026-02-21T09:18:15.0100349Z 2026-02-21T09:18:15.0100400Z { 2026-02-21T09:18:15.0100465Z .reg .pred complete; 2026-02-21T09:18:15.0100522Z waitLoop: 2026-02-21T09:18:15.0100655Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r4189; 2026-02-21T09:18:15.0100722Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0100773Z } 2026-02-21T09:18:15.0100778Z 2026-02-21T09:18:15.0100841Z // end inline asm 2026-02-21T09:18:15.0100895Z bar.sync 0; 2026-02-21T09:18:15.0100965Z // begin inline asm 2026-02-21T09:18:15.0101065Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0101123Z // end inline asm 2026-02-21T09:18:15.0101177Z $L__tmp236: 2026-02-21T09:18:15.0101356Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:15.0101430Z add.s64 %rd1053, %rd1053, 64; 2026-02-21T09:18:15.0101496Z add.s32 %r7845, %r7845, 524288; 2026-02-21T09:18:15.0101595Z add.s64 %rd1052, %rd1052, 256; 2026-02-21T09:18:15.0101675Z setp.lt.u64 %p233, %rd1053, 448; 2026-02-21T09:18:15.0101739Z @%p233 bra $L__BB0_23; 2026-02-21T09:18:15.0101800Z bra.uni $L__BB0_32; 2026-02-21T09:18:15.0101906Z $L__BB0_23: // Parent Loop BB0_2 Depth=1 2026-02-21T09:18:15.0102015Z // => This Inner Loop Header: Depth=2 2026-02-21T09:18:15.0102187Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0102253Z add.s64 %rd458, %rd1052, %rd29; 2026-02-21T09:18:15.0102431Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0102495Z add.s64 %rd461, %rd1052, %rd28; 2026-02-21T09:18:15.0102556Z // begin inline asm 2026-02-21T09:18:15.0102624Z mov.u64 %rd457, 0x0; 2026-02-21T09:18:15.0102742Z createpolicy.fractional.L2::evict_first.b64 %rd457, 1.0; 2026-02-21T09:18:15.0102829Z // end inline asm 2026-02-21T09:18:15.0102890Z // begin inline asm 2026-02-21T09:18:15.0102958Z mov.u32 %r3322, 0x0; 2026-02-21T09:18:15.0103042Z mov.u32 %r3323, 0x0; 2026-02-21T09:18:15.0103099Z mov.u32 %r3324, 0x0; 2026-02-21T09:18:15.0103164Z mov.u32 %r3325, 0x0; 2026-02-21T09:18:15.0103378Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3322, %r3323, %r3324, %r3325 }, [ %rd458 + 0 ], %rd457; 2026-02-21T09:18:15.0103458Z // end inline asm 2026-02-21T09:18:15.0103526Z // begin inline asm 2026-02-21T09:18:15.0103585Z mov.u64 %rd460, 0x0; 2026-02-21T09:18:15.0103701Z createpolicy.fractional.L2::evict_first.b64 %rd460, 1.0; 2026-02-21T09:18:15.0103760Z // end inline asm 2026-02-21T09:18:15.0103826Z // begin inline asm 2026-02-21T09:18:15.0103883Z mov.u32 %r3326, 0x0; 2026-02-21T09:18:15.0103939Z mov.u32 %r3327, 0x0; 2026-02-21T09:18:15.0104003Z mov.u32 %r3328, 0x0; 2026-02-21T09:18:15.0104058Z mov.u32 %r3329, 0x0; 2026-02-21T09:18:15.0104240Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3326, %r3327, %r3328, %r3329 }, [ %rd461 + 0 ], %rd460; 2026-02-21T09:18:15.0104304Z // end inline asm 2026-02-21T09:18:15.0104359Z $L__tmp237: 2026-02-21T09:18:15.0104576Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0104636Z bar.sync 0; 2026-02-21T09:18:15.0104744Z st.shared.v4.b32 [%r26], {%r3322, %r3323, %r3324, %r3325}; 2026-02-21T09:18:15.0104851Z st.shared.v4.b32 [%r26+512], {%r3326, %r3327, %r3328, %r3329}; 2026-02-21T09:18:15.0104909Z bar.sync 0; 2026-02-21T09:18:15.0105012Z ld.shared.v4.b32 {%r3488, %r3489, %r3490, %r3491}, [%r27]; 2026-02-21T09:18:15.0105081Z mov.b32 {%rs1025, %rs1026}, %r3491; 2026-02-21T09:18:15.0105147Z mov.b32 {%rs1027, %rs1028}, %r3490; 2026-02-21T09:18:15.0105215Z mov.b32 {%rs1029, %rs1030}, %r3489; 2026-02-21T09:18:15.0105279Z mov.b32 {%rs1031, %rs1032}, %r3488; 2026-02-21T09:18:15.0105376Z ld.shared.v4.b32 {%r3492, %r3493, %r3494, %r3495}, [%r28]; 2026-02-21T09:18:15.0105438Z mov.b32 {%rs1033, %rs1034}, %r3495; 2026-02-21T09:18:15.0105508Z mov.b32 {%rs1035, %rs1036}, %r3494; 2026-02-21T09:18:15.0105572Z mov.b32 {%rs1037, %rs1038}, %r3493; 2026-02-21T09:18:15.0105634Z mov.b32 {%rs1039, %rs1040}, %r3492; 2026-02-21T09:18:15.0105695Z $L__tmp238: 2026-02-21T09:18:15.0105865Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0105932Z cvt.f32.bf16 %r3471, %rs1031; 2026-02-21T09:18:15.0105997Z cvt.f32.bf16 %r3472, %rs1032; 2026-02-21T09:18:15.0106065Z cvt.f32.bf16 %r3473, %rs1029; 2026-02-21T09:18:15.0106125Z cvt.f32.bf16 %r3474, %rs1030; 2026-02-21T09:18:15.0106186Z cvt.f32.bf16 %r3475, %rs1027; 2026-02-21T09:18:15.0106252Z cvt.f32.bf16 %r3476, %rs1028; 2026-02-21T09:18:15.0106311Z cvt.f32.bf16 %r3477, %rs1025; 2026-02-21T09:18:15.0106372Z cvt.f32.bf16 %r3478, %rs1026; 2026-02-21T09:18:15.0106437Z cvt.f32.bf16 %r3479, %rs1039; 2026-02-21T09:18:15.0106499Z cvt.f32.bf16 %r3480, %rs1040; 2026-02-21T09:18:15.0106559Z cvt.f32.bf16 %r3481, %rs1037; 2026-02-21T09:18:15.0106619Z cvt.f32.bf16 %r3482, %rs1038; 2026-02-21T09:18:15.0106690Z cvt.f32.bf16 %r3483, %rs1035; 2026-02-21T09:18:15.0106749Z cvt.f32.bf16 %r3484, %rs1036; 2026-02-21T09:18:15.0106811Z cvt.f32.bf16 %r3485, %rs1033; 2026-02-21T09:18:15.0106877Z cvt.f32.bf16 %r3486, %rs1034; 2026-02-21T09:18:15.0107047Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0107113Z cvt.s64.s32 %rd466, %r7845; 2026-02-21T09:18:15.0107177Z add.s64 %rd464, %rd56, %rd466; 2026-02-21T09:18:15.0107350Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0107411Z // begin inline asm 2026-02-21T09:18:15.0107469Z mov.u64 %rd463, 0x0; 2026-02-21T09:18:15.0107591Z createpolicy.fractional.L2::evict_first.b64 %rd463, 1.0; 2026-02-21T09:18:15.0107674Z // end inline asm 2026-02-21T09:18:15.0107734Z // begin inline asm 2026-02-21T09:18:15.0107799Z mov.u32 %r3330, 0x0; 2026-02-21T09:18:15.0107857Z mov.u32 %r3331, 0x0; 2026-02-21T09:18:15.0107935Z mov.u32 %r3332, 0x0; 2026-02-21T09:18:15.0107993Z mov.u32 %r3333, 0x0; 2026-02-21T09:18:15.0108204Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3330, %r3331, %r3332, %r3333 }, [ %rd464 + 0 ], %rd463; 2026-02-21T09:18:15.0108286Z // end inline asm 2026-02-21T09:18:15.0108358Z prmt.b32 %r3496, %r3330, 0, 0x8880U; 2026-02-21T09:18:15.0108432Z cvt.u16.u32 %rs1041, %r3496; 2026-02-21T09:18:15.0108502Z prmt.b32 %r3497, %r3330, 0, 0x7770U; 2026-02-21T09:18:15.0108566Z cvt.u16.u32 %rs1042, %r3497; 2026-02-21T09:18:15.0108641Z prmt.b32 %r3498, %r3330, 0, 0x9991U; 2026-02-21T09:18:15.0108706Z cvt.u16.u32 %rs1043, %r3498; 2026-02-21T09:18:15.0108770Z prmt.b32 %r3499, %r3330, 0, 0x7771U; 2026-02-21T09:18:15.0108832Z cvt.u16.u32 %rs1044, %r3499; 2026-02-21T09:18:15.0108905Z prmt.b32 %r3500, %r3330, 0, 0xaaa2U; 2026-02-21T09:18:15.0108965Z cvt.u16.u32 %rs1045, %r3500; 2026-02-21T09:18:15.0109027Z prmt.b32 %r3501, %r3330, 0, 0x7772U; 2026-02-21T09:18:15.0109096Z cvt.u16.u32 %rs1046, %r3501; 2026-02-21T09:18:15.0109159Z prmt.b32 %r3502, %r3330, 0, 0xbbb3U; 2026-02-21T09:18:15.0109220Z cvt.u16.u32 %rs1047, %r3502; 2026-02-21T09:18:15.0109285Z prmt.b32 %r3503, %r3330, 0, 0x7773U; 2026-02-21T09:18:15.0109355Z cvt.u16.u32 %rs1048, %r3503; 2026-02-21T09:18:15.0109429Z prmt.b32 %r3504, %r3331, 0, 0x8880U; 2026-02-21T09:18:15.0109487Z cvt.u16.u32 %rs1049, %r3504; 2026-02-21T09:18:15.0109554Z prmt.b32 %r3505, %r3331, 0, 0x7770U; 2026-02-21T09:18:15.0109610Z cvt.u16.u32 %rs1050, %r3505; 2026-02-21T09:18:15.0109670Z prmt.b32 %r3506, %r3331, 0, 0x9991U; 2026-02-21T09:18:15.0109728Z cvt.u16.u32 %rs1051, %r3506; 2026-02-21T09:18:15.0109796Z prmt.b32 %r3507, %r3331, 0, 0x7771U; 2026-02-21T09:18:15.0109854Z cvt.u16.u32 %rs1052, %r3507; 2026-02-21T09:18:15.0109916Z prmt.b32 %r3508, %r3331, 0, 0xaaa2U; 2026-02-21T09:18:15.0109982Z cvt.u16.u32 %rs1053, %r3508; 2026-02-21T09:18:15.0110041Z prmt.b32 %r3509, %r3331, 0, 0x7772U; 2026-02-21T09:18:15.0110101Z cvt.u16.u32 %rs1054, %r3509; 2026-02-21T09:18:15.0110169Z prmt.b32 %r3510, %r3331, 0, 0xbbb3U; 2026-02-21T09:18:15.0110227Z cvt.u16.u32 %rs1055, %r3510; 2026-02-21T09:18:15.0110288Z prmt.b32 %r3511, %r3331, 0, 0x7773U; 2026-02-21T09:18:15.0110346Z cvt.u16.u32 %rs1056, %r3511; 2026-02-21T09:18:15.0110415Z prmt.b32 %r3512, %r3332, 0, 0x8880U; 2026-02-21T09:18:15.0110472Z cvt.u16.u32 %rs1057, %r3512; 2026-02-21T09:18:15.0110532Z prmt.b32 %r3513, %r3332, 0, 0x7770U; 2026-02-21T09:18:15.0110598Z cvt.u16.u32 %rs1058, %r3513; 2026-02-21T09:18:15.0110657Z prmt.b32 %r3514, %r3332, 0, 0x9991U; 2026-02-21T09:18:15.0110716Z cvt.u16.u32 %rs1059, %r3514; 2026-02-21T09:18:15.0110779Z prmt.b32 %r3515, %r3332, 0, 0x7771U; 2026-02-21T09:18:15.0110846Z cvt.u16.u32 %rs1060, %r3515; 2026-02-21T09:18:15.0110906Z prmt.b32 %r3516, %r3332, 0, 0xaaa2U; 2026-02-21T09:18:15.0110963Z cvt.u16.u32 %rs1061, %r3516; 2026-02-21T09:18:15.0111029Z prmt.b32 %r3517, %r3332, 0, 0x7772U; 2026-02-21T09:18:15.0111089Z cvt.u16.u32 %rs1062, %r3517; 2026-02-21T09:18:15.0111150Z prmt.b32 %r3518, %r3332, 0, 0xbbb3U; 2026-02-21T09:18:15.0111215Z cvt.u16.u32 %rs1063, %r3518; 2026-02-21T09:18:15.0111274Z prmt.b32 %r3519, %r3332, 0, 0x7773U; 2026-02-21T09:18:15.0111333Z cvt.u16.u32 %rs1064, %r3519; 2026-02-21T09:18:15.0111393Z prmt.b32 %r3520, %r3333, 0, 0x8880U; 2026-02-21T09:18:15.0111458Z cvt.u16.u32 %rs1065, %r3520; 2026-02-21T09:18:15.0111517Z prmt.b32 %r3521, %r3333, 0, 0x7770U; 2026-02-21T09:18:15.0111600Z cvt.u16.u32 %rs1066, %r3521; 2026-02-21T09:18:15.0111668Z prmt.b32 %r3522, %r3333, 0, 0x9991U; 2026-02-21T09:18:15.0111726Z cvt.u16.u32 %rs1067, %r3522; 2026-02-21T09:18:15.0111785Z prmt.b32 %r3523, %r3333, 0, 0x7771U; 2026-02-21T09:18:15.0111842Z cvt.u16.u32 %rs1068, %r3523; 2026-02-21T09:18:15.0111947Z prmt.b32 %r3524, %r3333, 0, 0xaaa2U; 2026-02-21T09:18:15.0112005Z cvt.u16.u32 %rs1069, %r3524; 2026-02-21T09:18:15.0112064Z prmt.b32 %r3525, %r3333, 0, 0x7772U; 2026-02-21T09:18:15.0112157Z cvt.u16.u32 %rs1070, %r3525; 2026-02-21T09:18:15.0112216Z prmt.b32 %r3526, %r3333, 0, 0xbbb3U; 2026-02-21T09:18:15.0112272Z cvt.u16.u32 %rs1071, %r3526; 2026-02-21T09:18:15.0112354Z prmt.b32 %r3527, %r3333, 0, 0x7773U; 2026-02-21T09:18:15.0112439Z cvt.u16.u32 %rs1072, %r3527; 2026-02-21T09:18:15.0112603Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0112665Z shl.b16 %rs1073, %rs1042, 12; 2026-02-21T09:18:15.0112735Z shr.s16 %rs1074, %rs1073, 12; 2026-02-21T09:18:15.0112793Z shl.b16 %rs1075, %rs1044, 12; 2026-02-21T09:18:15.0112853Z shr.s16 %rs1076, %rs1075, 12; 2026-02-21T09:18:15.0112917Z shl.b16 %rs1077, %rs1046, 12; 2026-02-21T09:18:15.0112974Z shr.s16 %rs1078, %rs1077, 12; 2026-02-21T09:18:15.0113030Z shl.b16 %rs1079, %rs1048, 12; 2026-02-21T09:18:15.0113088Z shr.s16 %rs1080, %rs1079, 12; 2026-02-21T09:18:15.0113151Z shl.b16 %rs1081, %rs1050, 12; 2026-02-21T09:18:15.0113208Z shr.s16 %rs1082, %rs1081, 12; 2026-02-21T09:18:15.0113267Z shl.b16 %rs1083, %rs1052, 12; 2026-02-21T09:18:15.0113332Z shr.s16 %rs1084, %rs1083, 12; 2026-02-21T09:18:15.0113389Z shl.b16 %rs1085, %rs1054, 12; 2026-02-21T09:18:15.0113446Z shr.s16 %rs1086, %rs1085, 12; 2026-02-21T09:18:15.0113504Z shl.b16 %rs1087, %rs1056, 12; 2026-02-21T09:18:15.0113571Z shr.s16 %rs1088, %rs1087, 12; 2026-02-21T09:18:15.0113628Z shl.b16 %rs1089, %rs1058, 12; 2026-02-21T09:18:15.0113685Z shr.s16 %rs1090, %rs1089, 12; 2026-02-21T09:18:15.0113750Z shl.b16 %rs1091, %rs1060, 12; 2026-02-21T09:18:15.0113808Z shr.s16 %rs1092, %rs1091, 12; 2026-02-21T09:18:15.0113865Z shl.b16 %rs1093, %rs1062, 12; 2026-02-21T09:18:15.0113923Z shr.s16 %rs1094, %rs1093, 12; 2026-02-21T09:18:15.0113987Z shl.b16 %rs1095, %rs1064, 12; 2026-02-21T09:18:15.0114044Z shr.s16 %rs1096, %rs1095, 12; 2026-02-21T09:18:15.0114102Z shl.b16 %rs1097, %rs1066, 12; 2026-02-21T09:18:15.0114167Z shr.s16 %rs1098, %rs1097, 12; 2026-02-21T09:18:15.0114224Z shl.b16 %rs1099, %rs1068, 12; 2026-02-21T09:18:15.0114283Z shr.s16 %rs1100, %rs1099, 12; 2026-02-21T09:18:15.0114346Z shl.b16 %rs1101, %rs1070, 12; 2026-02-21T09:18:15.0114405Z shr.s16 %rs1102, %rs1101, 12; 2026-02-21T09:18:15.0114463Z shl.b16 %rs1103, %rs1072, 12; 2026-02-21T09:18:15.0114529Z shr.s16 %rs1104, %rs1103, 12; 2026-02-21T09:18:15.0114697Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0114757Z shr.u16 %rs1105, %rs1041, 4; 2026-02-21T09:18:15.0114815Z shr.u16 %rs1106, %rs1043, 4; 2026-02-21T09:18:15.0114879Z shr.u16 %rs1107, %rs1045, 4; 2026-02-21T09:18:15.0114936Z shr.u16 %rs1108, %rs1047, 4; 2026-02-21T09:18:15.0114993Z shr.u16 %rs1109, %rs1049, 4; 2026-02-21T09:18:15.0115050Z shr.u16 %rs1110, %rs1051, 4; 2026-02-21T09:18:15.0115115Z shr.u16 %rs1111, %rs1053, 4; 2026-02-21T09:18:15.0115172Z shr.u16 %rs1112, %rs1055, 4; 2026-02-21T09:18:15.0115230Z shr.u16 %rs1113, %rs1057, 4; 2026-02-21T09:18:15.0115295Z shr.u16 %rs1114, %rs1059, 4; 2026-02-21T09:18:15.0115352Z shr.u16 %rs1115, %rs1061, 4; 2026-02-21T09:18:15.0115410Z shr.u16 %rs1116, %rs1063, 4; 2026-02-21T09:18:15.0115468Z shr.u16 %rs1117, %rs1065, 4; 2026-02-21T09:18:15.0115535Z shr.u16 %rs1118, %rs1067, 4; 2026-02-21T09:18:15.0115592Z shr.u16 %rs1119, %rs1069, 4; 2026-02-21T09:18:15.0115649Z shr.u16 %rs1120, %rs1071, 4; 2026-02-21T09:18:15.0115815Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0115869Z bar.sync 0; 2026-02-21T09:18:15.0115931Z st.shared.b8 [%r29], %rs1074; 2026-02-21T09:18:15.0115998Z st.shared.b8 [%r30], %rs1076; 2026-02-21T09:18:15.0116058Z st.shared.b8 [%r31], %rs1078; 2026-02-21T09:18:15.0116116Z st.shared.b8 [%r32], %rs1080; 2026-02-21T09:18:15.0116204Z st.shared.b8 [%r33+512], %rs1082; 2026-02-21T09:18:15.0116274Z st.shared.b8 [%r34+512], %rs1084; 2026-02-21T09:18:15.0116336Z st.shared.b8 [%r35+512], %rs1086; 2026-02-21T09:18:15.0116414Z st.shared.b8 [%r36+512], %rs1088; 2026-02-21T09:18:15.0116486Z st.shared.b8 [%r37+1024], %rs1090; 2026-02-21T09:18:15.0116572Z st.shared.b8 [%r38+1024], %rs1092; 2026-02-21T09:18:15.0116633Z st.shared.b8 [%r39+1024], %rs1094; 2026-02-21T09:18:15.0116720Z st.shared.b8 [%r40+1024], %rs1096; 2026-02-21T09:18:15.0116789Z st.shared.b8 [%r41+1536], %rs1098; 2026-02-21T09:18:15.0116849Z st.shared.b8 [%r42+1536], %rs1100; 2026-02-21T09:18:15.0116908Z st.shared.b8 [%r43+1536], %rs1102; 2026-02-21T09:18:15.0116975Z st.shared.b8 [%r44+1536], %rs1104; 2026-02-21T09:18:15.0117027Z bar.sync 0; 2026-02-21T09:18:15.0117089Z ld.shared.b32 %r3528, [%r45]; 2026-02-21T09:18:15.0117149Z ld.shared.b32 %r3529, [%r46]; 2026-02-21T09:18:15.0117216Z ld.shared.b32 %r3530, [%r47]; 2026-02-21T09:18:15.0117277Z ld.shared.b32 %r3531, [%r48]; 2026-02-21T09:18:15.0117434Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0117495Z bar.sync 0; 2026-02-21T09:18:15.0117555Z st.shared.b8 [%r29], %rs1105; 2026-02-21T09:18:15.0117613Z st.shared.b8 [%r30], %rs1106; 2026-02-21T09:18:15.0117679Z st.shared.b8 [%r31], %rs1107; 2026-02-21T09:18:15.0117737Z st.shared.b8 [%r32], %rs1108; 2026-02-21T09:18:15.0117796Z st.shared.b8 [%r33+512], %rs1109; 2026-02-21T09:18:15.0117855Z st.shared.b8 [%r34+512], %rs1110; 2026-02-21T09:18:15.0117922Z st.shared.b8 [%r35+512], %rs1111; 2026-02-21T09:18:15.0117981Z st.shared.b8 [%r36+512], %rs1112; 2026-02-21T09:18:15.0118041Z st.shared.b8 [%r37+1024], %rs1113; 2026-02-21T09:18:15.0118105Z st.shared.b8 [%r38+1024], %rs1114; 2026-02-21T09:18:15.0118165Z st.shared.b8 [%r39+1024], %rs1115; 2026-02-21T09:18:15.0118223Z st.shared.b8 [%r40+1024], %rs1116; 2026-02-21T09:18:15.0118282Z st.shared.b8 [%r41+1536], %rs1117; 2026-02-21T09:18:15.0118348Z st.shared.b8 [%r42+1536], %rs1118; 2026-02-21T09:18:15.0118406Z st.shared.b8 [%r43+1536], %rs1119; 2026-02-21T09:18:15.0118464Z st.shared.b8 [%r44+1536], %rs1120; 2026-02-21T09:18:15.0118526Z bar.sync 0; 2026-02-21T09:18:15.0118585Z ld.shared.b32 %r3532, [%r45]; 2026-02-21T09:18:15.0118646Z ld.shared.b32 %r3533, [%r46]; 2026-02-21T09:18:15.0118711Z ld.shared.b32 %r3534, [%r47]; 2026-02-21T09:18:15.0118771Z ld.shared.b32 %r3535, [%r48]; 2026-02-21T09:18:15.0118930Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0118991Z cvt.s8.s32 %rs1121, %r3529; 2026-02-21T09:18:15.0119061Z cvt.rn.f32.s16 %r3536, %rs1121; 2026-02-21T09:18:15.0119121Z cvt.s8.s32 %rs1122, %r3528; 2026-02-21T09:18:15.0119182Z cvt.rn.f32.s16 %r3537, %rs1122; 2026-02-21T09:18:15.0119246Z cvt.s8.s32 %rs1123, %r3533; 2026-02-21T09:18:15.0119307Z cvt.rn.f32.s16 %r3538, %rs1123; 2026-02-21T09:18:15.0119366Z cvt.s8.s32 %rs1124, %r3532; 2026-02-21T09:18:15.0119424Z cvt.rn.f32.s16 %r3539, %rs1124; 2026-02-21T09:18:15.0119489Z cvt.s8.s32 %rs1125, %r3531; 2026-02-21T09:18:15.0119549Z cvt.rn.f32.s16 %r3540, %rs1125; 2026-02-21T09:18:15.0119607Z cvt.s8.s32 %rs1126, %r3530; 2026-02-21T09:18:15.0119673Z cvt.rn.f32.s16 %r3541, %rs1126; 2026-02-21T09:18:15.0119732Z cvt.s8.s32 %rs1127, %r3535; 2026-02-21T09:18:15.0119791Z cvt.rn.f32.s16 %r3542, %rs1127; 2026-02-21T09:18:15.0119856Z cvt.s8.s32 %rs1128, %r3534; 2026-02-21T09:18:15.0119914Z cvt.rn.f32.s16 %r3543, %rs1128; 2026-02-21T09:18:15.0119978Z prmt.b32 %r3544, %r3529, 0, 0x9991U; 2026-02-21T09:18:15.0120037Z cvt.u16.u32 %rs1129, %r3544; 2026-02-21T09:18:15.0120104Z cvt.rn.f32.s16 %r3545, %rs1129; 2026-02-21T09:18:15.0120167Z prmt.b32 %r3546, %r3528, 0, 0x9991U; 2026-02-21T09:18:15.0120226Z cvt.u16.u32 %rs1130, %r3546; 2026-02-21T09:18:15.0120292Z cvt.rn.f32.s16 %r3547, %rs1130; 2026-02-21T09:18:15.0120353Z prmt.b32 %r3548, %r3533, 0, 0x9991U; 2026-02-21T09:18:15.0120433Z cvt.u16.u32 %rs1131, %r3548; 2026-02-21T09:18:15.0120493Z cvt.rn.f32.s16 %r3549, %rs1131; 2026-02-21T09:18:15.0120563Z prmt.b32 %r3550, %r3532, 0, 0x9991U; 2026-02-21T09:18:15.0120643Z cvt.u16.u32 %rs1132, %r3550; 2026-02-21T09:18:15.0120703Z cvt.rn.f32.s16 %r3551, %rs1132; 2026-02-21T09:18:15.0120792Z prmt.b32 %r3552, %r3531, 0, 0x9991U; 2026-02-21T09:18:15.0120853Z cvt.u16.u32 %rs1133, %r3552; 2026-02-21T09:18:15.0120933Z cvt.rn.f32.s16 %r3553, %rs1133; 2026-02-21T09:18:15.0120997Z prmt.b32 %r3554, %r3530, 0, 0x9991U; 2026-02-21T09:18:15.0121065Z cvt.u16.u32 %rs1134, %r3554; 2026-02-21T09:18:15.0121126Z cvt.rn.f32.s16 %r3555, %rs1134; 2026-02-21T09:18:15.0121188Z prmt.b32 %r3556, %r3535, 0, 0x9991U; 2026-02-21T09:18:15.0121254Z cvt.u16.u32 %rs1135, %r3556; 2026-02-21T09:18:15.0121313Z cvt.rn.f32.s16 %r3557, %rs1135; 2026-02-21T09:18:15.0121374Z prmt.b32 %r3558, %r3534, 0, 0x9991U; 2026-02-21T09:18:15.0121438Z cvt.u16.u32 %rs1136, %r3558; 2026-02-21T09:18:15.0121498Z cvt.rn.f32.s16 %r3559, %rs1136; 2026-02-21T09:18:15.0121586Z prmt.b32 %r3560, %r3529, 0, 0xaaa2U; 2026-02-21T09:18:15.0121646Z cvt.u16.u32 %rs1137, %r3560; 2026-02-21T09:18:15.0121714Z cvt.rn.f32.s16 %r3561, %rs1137; 2026-02-21T09:18:15.0121777Z prmt.b32 %r3562, %r3528, 0, 0xaaa2U; 2026-02-21T09:18:15.0121837Z cvt.u16.u32 %rs1138, %r3562; 2026-02-21T09:18:15.0121907Z cvt.rn.f32.s16 %r3563, %rs1138; 2026-02-21T09:18:15.0121969Z prmt.b32 %r3564, %r3533, 0, 0xaaa2U; 2026-02-21T09:18:15.0122028Z cvt.u16.u32 %rs1139, %r3564; 2026-02-21T09:18:15.0122088Z cvt.rn.f32.s16 %r3565, %rs1139; 2026-02-21T09:18:15.0122155Z prmt.b32 %r3566, %r3532, 0, 0xaaa2U; 2026-02-21T09:18:15.0122214Z cvt.u16.u32 %rs1140, %r3566; 2026-02-21T09:18:15.0122273Z cvt.rn.f32.s16 %r3567, %rs1140; 2026-02-21T09:18:15.0122342Z prmt.b32 %r3568, %r3531, 0, 0xaaa2U; 2026-02-21T09:18:15.0122400Z cvt.u16.u32 %rs1141, %r3568; 2026-02-21T09:18:15.0122459Z cvt.rn.f32.s16 %r3569, %rs1141; 2026-02-21T09:18:15.0122530Z prmt.b32 %r3570, %r3530, 0, 0xaaa2U; 2026-02-21T09:18:15.0122588Z cvt.u16.u32 %rs1142, %r3570; 2026-02-21T09:18:15.0122647Z cvt.rn.f32.s16 %r3571, %rs1142; 2026-02-21T09:18:15.0122710Z prmt.b32 %r3572, %r3535, 0, 0xaaa2U; 2026-02-21T09:18:15.0122777Z cvt.u16.u32 %rs1143, %r3572; 2026-02-21T09:18:15.0122839Z cvt.rn.f32.s16 %r3573, %rs1143; 2026-02-21T09:18:15.0122901Z prmt.b32 %r3574, %r3534, 0, 0xaaa2U; 2026-02-21T09:18:15.0122970Z cvt.u16.u32 %rs1144, %r3574; 2026-02-21T09:18:15.0123030Z cvt.rn.f32.s16 %r3575, %rs1144; 2026-02-21T09:18:15.0123092Z prmt.b32 %r3576, %r3529, 0, 0xbbb3U; 2026-02-21T09:18:15.0123151Z cvt.u16.u32 %rs1145, %r3576; 2026-02-21T09:18:15.0123219Z cvt.rn.f32.s16 %r3577, %rs1145; 2026-02-21T09:18:15.0123280Z prmt.b32 %r3578, %r3528, 0, 0xbbb3U; 2026-02-21T09:18:15.0123339Z cvt.u16.u32 %rs1146, %r3578; 2026-02-21T09:18:15.0123407Z cvt.rn.f32.s16 %r3579, %rs1146; 2026-02-21T09:18:15.0123469Z prmt.b32 %r3580, %r3533, 0, 0xbbb3U; 2026-02-21T09:18:15.0123528Z cvt.u16.u32 %rs1147, %r3580; 2026-02-21T09:18:15.0123587Z cvt.rn.f32.s16 %r3581, %rs1147; 2026-02-21T09:18:15.0123654Z prmt.b32 %r3582, %r3532, 0, 0xbbb3U; 2026-02-21T09:18:15.0123714Z cvt.u16.u32 %rs1148, %r3582; 2026-02-21T09:18:15.0123772Z cvt.rn.f32.s16 %r3583, %rs1148; 2026-02-21T09:18:15.0123842Z prmt.b32 %r3584, %r3531, 0, 0xbbb3U; 2026-02-21T09:18:15.0123901Z cvt.u16.u32 %rs1149, %r3584; 2026-02-21T09:18:15.0123961Z cvt.rn.f32.s16 %r3585, %rs1149; 2026-02-21T09:18:15.0124028Z prmt.b32 %r3586, %r3530, 0, 0xbbb3U; 2026-02-21T09:18:15.0124086Z cvt.u16.u32 %rs1150, %r3586; 2026-02-21T09:18:15.0124145Z cvt.rn.f32.s16 %r3587, %rs1150; 2026-02-21T09:18:15.0124204Z prmt.b32 %r3588, %r3535, 0, 0xbbb3U; 2026-02-21T09:18:15.0124269Z cvt.u16.u32 %rs1151, %r3588; 2026-02-21T09:18:15.0124329Z cvt.rn.f32.s16 %r3589, %rs1151; 2026-02-21T09:18:15.0124389Z prmt.b32 %r3590, %r3534, 0, 0xbbb3U; 2026-02-21T09:18:15.0124455Z cvt.u16.u32 %rs1152, %r3590; 2026-02-21T09:18:15.0124549Z cvt.rn.f32.s16 %r3591, %rs1152; 2026-02-21T09:18:15.0124603Z bar.sync 0; 2026-02-21T09:18:15.0124703Z st.shared.v4.b32 [%r49], {%r3537, %r3539, %r3536, %r3538}; 2026-02-21T09:18:15.0124830Z st.shared.v4.b32 [%r50], {%r3541, %r3543, %r3540, %r3542}; 2026-02-21T09:18:15.0124922Z st.shared.v4.b32 [%r51], {%r3547, %r3551, %r3545, %r3549}; 2026-02-21T09:18:15.0125040Z st.shared.v4.b32 [%r52], {%r3555, %r3559, %r3553, %r3557}; 2026-02-21T09:18:15.0125174Z st.shared.v4.b32 [%r53], {%r3563, %r3567, %r3561, %r3565}; 2026-02-21T09:18:15.0125263Z st.shared.v4.b32 [%r54], {%r3571, %r3575, %r3569, %r3573}; 2026-02-21T09:18:15.0125351Z st.shared.v4.b32 [%r55], {%r3579, %r3583, %r3577, %r3581}; 2026-02-21T09:18:15.0125447Z st.shared.v4.b32 [%r56], {%r3587, %r3591, %r3585, %r3589}; 2026-02-21T09:18:15.0125500Z $L__tmp239: 2026-02-21T09:18:15.0125711Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0125771Z // begin inline asm 2026-02-21T09:18:15.0126093Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3334, %r3335, %r3336, %r3337, %r3338, %r3339, %r3340, %r3341, %r3342, %r3343, %r3344, %r3345, %r3346, %r3347, %r3348, %r3349}, [%r4673 + 0], 64; 2026-02-21T09:18:15.0126151Z // end inline asm 2026-02-21T09:18:15.0126210Z // begin inline asm 2026-02-21T09:18:15.0126528Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3351, %r3352, %r3353, %r3354, %r3355, %r3356, %r3357, %r3358, %r3359, %r3360, %r3361, %r3362, %r3363, %r3364, %r3365, %r3366}, [%r4673 + 16], 64; 2026-02-21T09:18:15.0126583Z // end inline asm 2026-02-21T09:18:15.0126640Z // begin inline asm 2026-02-21T09:18:15.0126949Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3368, %r3369, %r3370, %r3371, %r3372, %r3373, %r3374, %r3375, %r3376, %r3377, %r3378, %r3379, %r3380, %r3381, %r3382, %r3383}, [%r4673 + 32], 64; 2026-02-21T09:18:15.0127004Z // end inline asm 2026-02-21T09:18:15.0127060Z // begin inline asm 2026-02-21T09:18:15.0127365Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3385, %r3386, %r3387, %r3388, %r3389, %r3390, %r3391, %r3392, %r3393, %r3394, %r3395, %r3396, %r3397, %r3398, %r3399, %r3400}, [%r4673 + 48], 64; 2026-02-21T09:18:15.0127421Z // end inline asm 2026-02-21T09:18:15.0127479Z // begin inline asm 2026-02-21T09:18:15.0127556Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0127611Z // end inline asm 2026-02-21T09:18:15.0127667Z // begin inline asm 2026-02-21T09:18:15.0127984Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 0], 64, {%r3334, %r3335, %r3336, %r3337, %r3338, %r3339, %r3340, %r3341, %r3342, %r3343, %r3344, %r3345, %r3346, %r3347, %r3348, %r3349}; 2026-02-21T09:18:15.0128047Z // end inline asm 2026-02-21T09:18:15.0128103Z // begin inline asm 2026-02-21T09:18:15.0128415Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 16], 64, {%r3351, %r3352, %r3353, %r3354, %r3355, %r3356, %r3357, %r3358, %r3359, %r3360, %r3361, %r3362, %r3363, %r3364, %r3365, %r3366}; 2026-02-21T09:18:15.0128480Z // end inline asm 2026-02-21T09:18:15.0128537Z // begin inline asm 2026-02-21T09:18:15.0128843Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 32], 64, {%r3368, %r3369, %r3370, %r3371, %r3372, %r3373, %r3374, %r3375, %r3376, %r3377, %r3378, %r3379, %r3380, %r3381, %r3382, %r3383}; 2026-02-21T09:18:15.0128906Z // end inline asm 2026-02-21T09:18:15.0128963Z // begin inline asm 2026-02-21T09:18:15.0129266Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 48], 64, {%r3385, %r3386, %r3387, %r3388, %r3389, %r3390, %r3391, %r3392, %r3393, %r3394, %r3395, %r3396, %r3397, %r3398, %r3399, %r3400}; 2026-02-21T09:18:15.0129329Z // end inline asm 2026-02-21T09:18:15.0129386Z // begin inline asm 2026-02-21T09:18:15.0129455Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0129510Z // end inline asm 2026-02-21T09:18:15.0129573Z bar.sync 0; 2026-02-21T09:18:15.0129631Z // begin inline asm 2026-02-21T09:18:15.0129938Z @%p159 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r3471, %r3472, %r3473, %r3474, %r3475, %r3476, %r3477, %r3478, %r3479, %r3480, %r3481, %r3482, %r3483, %r3484, %r3485, %r3486}; 2026-02-21T09:18:15.0130028Z // end inline asm 2026-02-21T09:18:15.0130083Z // begin inline asm 2026-02-21T09:18:15.0130171Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0130232Z // end inline asm 2026-02-21T09:18:15.0130306Z bar.sync 0; 2026-02-21T09:18:15.0130362Z // begin inline asm 2026-02-21T09:18:15.0130454Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0130517Z // end inline asm 2026-02-21T09:18:15.0130571Z // begin inline asm 2026-02-21T09:18:15.0130663Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0130724Z // end inline asm 2026-02-21T09:18:15.0130778Z bar.sync 0; 2026-02-21T09:18:15.0130838Z @%p20 bra $L__BB0_25; 2026-02-21T09:18:15.0130936Z // %bb.24: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0131010Z elect.sync %r3605|%p172, -1; 2026-02-21T09:18:15.0131071Z mov.b32 %r3595, 69208336; 2026-02-21T09:18:15.0131129Z // begin inline asm 2026-02-21T09:18:15.0131298Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 0 ], %rd1, %r3595, %p391; 2026-02-21T09:18:15.0131355Z // end inline asm 2026-02-21T09:18:15.0131417Z mov.pred %p173, -1; 2026-02-21T09:18:15.0131482Z // begin inline asm 2026-02-21T09:18:15.0131661Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 8 ], %rd2, %r3595, %p173; 2026-02-21T09:18:15.0131720Z // end inline asm 2026-02-21T09:18:15.0131776Z // begin inline asm 2026-02-21T09:18:15.0131934Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 16 ], %rd3, %r3595, %p173; 2026-02-21T09:18:15.0131989Z // end inline asm 2026-02-21T09:18:15.0132044Z // begin inline asm 2026-02-21T09:18:15.0132198Z @%p172 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 24 ], %rd4, %r3595, %p173; 2026-02-21T09:18:15.0132253Z // end inline asm 2026-02-21T09:18:15.0132316Z cvt.u64.u32 %rd471, %r5985; 2026-02-21T09:18:15.0132381Z // begin inline asm 2026-02-21T09:18:15.0132509Z @%p172 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd471]; 2026-02-21T09:18:15.0132564Z // end inline asm 2026-02-21T09:18:15.0132671Z $L__BB0_25: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0132762Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0132817Z mov.b32 %r3609, 0; 2026-02-21T09:18:15.0133034Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0133100Z // begin inline asm 2026-02-21T09:18:15.0133152Z 2026-02-21T09:18:15.0133203Z { 2026-02-21T09:18:15.0133272Z .reg .pred complete; 2026-02-21T09:18:15.0133327Z waitLoop: 2026-02-21T09:18:15.0133448Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r3609; 2026-02-21T09:18:15.0133515Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0133573Z } 2026-02-21T09:18:15.0133577Z 2026-02-21T09:18:15.0133635Z // end inline asm 2026-02-21T09:18:15.0133690Z bar.sync 0; 2026-02-21T09:18:15.0133755Z // begin inline asm 2026-02-21T09:18:15.0133841Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0133897Z // end inline asm 2026-02-21T09:18:15.0133955Z $L__tmp240: 2026-02-21T09:18:15.0134120Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0134182Z add.s64 %rd473, %rd458, 64; 2026-02-21T09:18:15.0134341Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0134407Z add.s64 %rd476, %rd461, 64; 2026-02-21T09:18:15.0134463Z // begin inline asm 2026-02-21T09:18:15.0134520Z mov.u64 %rd472, 0x0; 2026-02-21T09:18:15.0134637Z createpolicy.fractional.L2::evict_first.b64 %rd472, 1.0; 2026-02-21T09:18:15.0134690Z // end inline asm 2026-02-21T09:18:15.0134746Z // begin inline asm 2026-02-21T09:18:15.0134801Z mov.u32 %r3611, 0x0; 2026-02-21T09:18:15.0134891Z mov.u32 %r3612, 0x0; 2026-02-21T09:18:15.0134946Z mov.u32 %r3613, 0x0; 2026-02-21T09:18:15.0135001Z mov.u32 %r3614, 0x0; 2026-02-21T09:18:15.0135186Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3611, %r3612, %r3613, %r3614 }, [ %rd473 + 0 ], %rd472; 2026-02-21T09:18:15.0135271Z // end inline asm 2026-02-21T09:18:15.0135326Z // begin inline asm 2026-02-21T09:18:15.0135414Z mov.u64 %rd475, 0x0; 2026-02-21T09:18:15.0135554Z createpolicy.fractional.L2::evict_first.b64 %rd475, 1.0; 2026-02-21T09:18:15.0135611Z // end inline asm 2026-02-21T09:18:15.0135666Z // begin inline asm 2026-02-21T09:18:15.0135728Z mov.u32 %r3615, 0x0; 2026-02-21T09:18:15.0135782Z mov.u32 %r3616, 0x0; 2026-02-21T09:18:15.0135837Z mov.u32 %r3617, 0x0; 2026-02-21T09:18:15.0135899Z mov.u32 %r3618, 0x0; 2026-02-21T09:18:15.0136072Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3615, %r3616, %r3617, %r3618 }, [ %rd476 + 0 ], %rd475; 2026-02-21T09:18:15.0136128Z // end inline asm 2026-02-21T09:18:15.0136188Z $L__tmp241: 2026-02-21T09:18:15.0136395Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0136490Z st.shared.v4.b32 [%r26], {%r3611, %r3612, %r3613, %r3614}; 2026-02-21T09:18:15.0136593Z st.shared.v4.b32 [%r26+512], {%r3615, %r3616, %r3617, %r3618}; 2026-02-21T09:18:15.0136657Z bar.sync 0; 2026-02-21T09:18:15.0136750Z ld.shared.v4.b32 {%r3778, %r3779, %r3780, %r3781}, [%r27]; 2026-02-21T09:18:15.0136816Z mov.b32 {%rs1153, %rs1154}, %r3781; 2026-02-21T09:18:15.0136887Z mov.b32 {%rs1155, %rs1156}, %r3780; 2026-02-21T09:18:15.0136948Z mov.b32 {%rs1157, %rs1158}, %r3779; 2026-02-21T09:18:15.0137008Z mov.b32 {%rs1159, %rs1160}, %r3778; 2026-02-21T09:18:15.0137104Z ld.shared.v4.b32 {%r3782, %r3783, %r3784, %r3785}, [%r28]; 2026-02-21T09:18:15.0137165Z mov.b32 {%rs1161, %rs1162}, %r3785; 2026-02-21T09:18:15.0137226Z mov.b32 {%rs1163, %rs1164}, %r3784; 2026-02-21T09:18:15.0137285Z mov.b32 {%rs1165, %rs1166}, %r3783; 2026-02-21T09:18:15.0137353Z mov.b32 {%rs1167, %rs1168}, %r3782; 2026-02-21T09:18:15.0137407Z $L__tmp242: 2026-02-21T09:18:15.0137567Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0137640Z cvt.f32.bf16 %r3760, %rs1159; 2026-02-21T09:18:15.0137703Z cvt.f32.bf16 %r3761, %rs1160; 2026-02-21T09:18:15.0137764Z cvt.f32.bf16 %r3762, %rs1157; 2026-02-21T09:18:15.0137827Z cvt.f32.bf16 %r3763, %rs1158; 2026-02-21T09:18:15.0137896Z cvt.f32.bf16 %r3764, %rs1155; 2026-02-21T09:18:15.0137954Z cvt.f32.bf16 %r3765, %rs1156; 2026-02-21T09:18:15.0138011Z cvt.f32.bf16 %r3766, %rs1153; 2026-02-21T09:18:15.0138077Z cvt.f32.bf16 %r3767, %rs1154; 2026-02-21T09:18:15.0138134Z cvt.f32.bf16 %r3768, %rs1167; 2026-02-21T09:18:15.0138191Z cvt.f32.bf16 %r3769, %rs1168; 2026-02-21T09:18:15.0138256Z cvt.f32.bf16 %r3770, %rs1165; 2026-02-21T09:18:15.0138313Z cvt.f32.bf16 %r3771, %rs1166; 2026-02-21T09:18:15.0138371Z cvt.f32.bf16 %r3772, %rs1163; 2026-02-21T09:18:15.0138428Z cvt.f32.bf16 %r3773, %rs1164; 2026-02-21T09:18:15.0138491Z cvt.f32.bf16 %r3774, %rs1161; 2026-02-21T09:18:15.0138547Z cvt.f32.bf16 %r3775, %rs1162; 2026-02-21T09:18:15.0138708Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0138778Z add.s32 %r3786, %r7845, 131072; 2026-02-21T09:18:15.0138839Z cvt.s64.s32 %rd481, %r3786; 2026-02-21T09:18:15.0138903Z add.s64 %rd479, %rd56, %rd481; 2026-02-21T09:18:15.0139058Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0139125Z // begin inline asm 2026-02-21T09:18:15.0139180Z mov.u64 %rd478, 0x0; 2026-02-21T09:18:15.0139287Z createpolicy.fractional.L2::evict_first.b64 %rd478, 1.0; 2026-02-21T09:18:15.0139350Z // end inline asm 2026-02-21T09:18:15.0139407Z // begin inline asm 2026-02-21T09:18:15.0139462Z mov.u32 %r3619, 0x0; 2026-02-21T09:18:15.0139550Z mov.u32 %r3620, 0x0; 2026-02-21T09:18:15.0139605Z mov.u32 %r3621, 0x0; 2026-02-21T09:18:15.0139659Z mov.u32 %r3622, 0x0; 2026-02-21T09:18:15.0139828Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3619, %r3620, %r3621, %r3622 }, [ %rd479 + 0 ], %rd478; 2026-02-21T09:18:15.0139912Z // end inline asm 2026-02-21T09:18:15.0140000Z prmt.b32 %r3787, %r3619, 0, 0x8880U; 2026-02-21T09:18:15.0140064Z cvt.u16.u32 %rs1169, %r3787; 2026-02-21T09:18:15.0140155Z prmt.b32 %r3788, %r3619, 0, 0x7770U; 2026-02-21T09:18:15.0140216Z cvt.u16.u32 %rs1170, %r3788; 2026-02-21T09:18:15.0140279Z prmt.b32 %r3789, %r3619, 0, 0x9991U; 2026-02-21T09:18:15.0140346Z cvt.u16.u32 %rs1171, %r3789; 2026-02-21T09:18:15.0140406Z prmt.b32 %r3790, %r3619, 0, 0x7771U; 2026-02-21T09:18:15.0140465Z cvt.u16.u32 %rs1172, %r3790; 2026-02-21T09:18:15.0140526Z prmt.b32 %r3791, %r3619, 0, 0xaaa2U; 2026-02-21T09:18:15.0140593Z cvt.u16.u32 %rs1173, %r3791; 2026-02-21T09:18:15.0140655Z prmt.b32 %r3792, %r3619, 0, 0x7772U; 2026-02-21T09:18:15.0140717Z cvt.u16.u32 %rs1174, %r3792; 2026-02-21T09:18:15.0140783Z prmt.b32 %r3793, %r3619, 0, 0xbbb3U; 2026-02-21T09:18:15.0140841Z cvt.u16.u32 %rs1175, %r3793; 2026-02-21T09:18:15.0140904Z prmt.b32 %r3794, %r3619, 0, 0x7773U; 2026-02-21T09:18:15.0140961Z cvt.u16.u32 %rs1176, %r3794; 2026-02-21T09:18:15.0141029Z prmt.b32 %r3795, %r3620, 0, 0x8880U; 2026-02-21T09:18:15.0141087Z cvt.u16.u32 %rs1177, %r3795; 2026-02-21T09:18:15.0141148Z prmt.b32 %r3796, %r3620, 0, 0x7770U; 2026-02-21T09:18:15.0141214Z cvt.u16.u32 %rs1178, %r3796; 2026-02-21T09:18:15.0141273Z prmt.b32 %r3797, %r3620, 0, 0x9991U; 2026-02-21T09:18:15.0141331Z cvt.u16.u32 %rs1179, %r3797; 2026-02-21T09:18:15.0141391Z prmt.b32 %r3798, %r3620, 0, 0x7771U; 2026-02-21T09:18:15.0141456Z cvt.u16.u32 %rs1180, %r3798; 2026-02-21T09:18:15.0141516Z prmt.b32 %r3799, %r3620, 0, 0xaaa2U; 2026-02-21T09:18:15.0141595Z cvt.u16.u32 %rs1181, %r3799; 2026-02-21T09:18:15.0141666Z prmt.b32 %r3800, %r3620, 0, 0x7772U; 2026-02-21T09:18:15.0141725Z cvt.u16.u32 %rs1182, %r3800; 2026-02-21T09:18:15.0141785Z prmt.b32 %r3801, %r3620, 0, 0xbbb3U; 2026-02-21T09:18:15.0141851Z cvt.u16.u32 %rs1183, %r3801; 2026-02-21T09:18:15.0141914Z prmt.b32 %r3802, %r3620, 0, 0x7773U; 2026-02-21T09:18:15.0141972Z cvt.u16.u32 %rs1184, %r3802; 2026-02-21T09:18:15.0142032Z prmt.b32 %r3803, %r3621, 0, 0x8880U; 2026-02-21T09:18:15.0142097Z cvt.u16.u32 %rs1185, %r3803; 2026-02-21T09:18:15.0142159Z prmt.b32 %r3804, %r3621, 0, 0x7770U; 2026-02-21T09:18:15.0142216Z cvt.u16.u32 %rs1186, %r3804; 2026-02-21T09:18:15.0142281Z prmt.b32 %r3805, %r3621, 0, 0x9991U; 2026-02-21T09:18:15.0142338Z cvt.u16.u32 %rs1187, %r3805; 2026-02-21T09:18:15.0142397Z prmt.b32 %r3806, %r3621, 0, 0x7771U; 2026-02-21T09:18:15.0142455Z cvt.u16.u32 %rs1188, %r3806; 2026-02-21T09:18:15.0142521Z prmt.b32 %r3807, %r3621, 0, 0xaaa2U; 2026-02-21T09:18:15.0142578Z cvt.u16.u32 %rs1189, %r3807; 2026-02-21T09:18:15.0142638Z prmt.b32 %r3808, %r3621, 0, 0x7772U; 2026-02-21T09:18:15.0142705Z cvt.u16.u32 %rs1190, %r3808; 2026-02-21T09:18:15.0142764Z prmt.b32 %r3809, %r3621, 0, 0xbbb3U; 2026-02-21T09:18:15.0142821Z cvt.u16.u32 %rs1191, %r3809; 2026-02-21T09:18:15.0142890Z prmt.b32 %r3810, %r3621, 0, 0x7773U; 2026-02-21T09:18:15.0142949Z cvt.u16.u32 %rs1192, %r3810; 2026-02-21T09:18:15.0143010Z prmt.b32 %r3811, %r3622, 0, 0x8880U; 2026-02-21T09:18:15.0143069Z cvt.u16.u32 %rs1193, %r3811; 2026-02-21T09:18:15.0143136Z prmt.b32 %r3812, %r3622, 0, 0x7770U; 2026-02-21T09:18:15.0143195Z cvt.u16.u32 %rs1194, %r3812; 2026-02-21T09:18:15.0143254Z prmt.b32 %r3813, %r3622, 0, 0x9991U; 2026-02-21T09:18:15.0143319Z cvt.u16.u32 %rs1195, %r3813; 2026-02-21T09:18:15.0143380Z prmt.b32 %r3814, %r3622, 0, 0x7771U; 2026-02-21T09:18:15.0143437Z cvt.u16.u32 %rs1196, %r3814; 2026-02-21T09:18:15.0143499Z prmt.b32 %r3815, %r3622, 0, 0xaaa2U; 2026-02-21T09:18:15.0143565Z cvt.u16.u32 %rs1197, %r3815; 2026-02-21T09:18:15.0143624Z prmt.b32 %r3816, %r3622, 0, 0x7772U; 2026-02-21T09:18:15.0143712Z cvt.u16.u32 %rs1198, %r3816; 2026-02-21T09:18:15.0143780Z prmt.b32 %r3817, %r3622, 0, 0xbbb3U; 2026-02-21T09:18:15.0143837Z cvt.u16.u32 %rs1199, %r3817; 2026-02-21T09:18:15.0143924Z prmt.b32 %r3818, %r3622, 0, 0x7773U; 2026-02-21T09:18:15.0143982Z cvt.u16.u32 %rs1200, %r3818; 2026-02-21T09:18:15.0144190Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0144279Z shl.b16 %rs1201, %rs1170, 12; 2026-02-21T09:18:15.0144341Z shr.s16 %rs1202, %rs1201, 12; 2026-02-21T09:18:15.0144410Z shl.b16 %rs1203, %rs1172, 12; 2026-02-21T09:18:15.0144468Z shr.s16 %rs1204, %rs1203, 12; 2026-02-21T09:18:15.0144527Z shl.b16 %rs1205, %rs1174, 12; 2026-02-21T09:18:15.0144601Z shr.s16 %rs1206, %rs1205, 12; 2026-02-21T09:18:15.0144659Z shl.b16 %rs1207, %rs1176, 12; 2026-02-21T09:18:15.0144717Z shr.s16 %rs1208, %rs1207, 12; 2026-02-21T09:18:15.0144774Z shl.b16 %rs1209, %rs1178, 12; 2026-02-21T09:18:15.0144841Z shr.s16 %rs1210, %rs1209, 12; 2026-02-21T09:18:15.0144897Z shl.b16 %rs1211, %rs1180, 12; 2026-02-21T09:18:15.0144953Z shr.s16 %rs1212, %rs1211, 12; 2026-02-21T09:18:15.0145017Z shl.b16 %rs1213, %rs1182, 12; 2026-02-21T09:18:15.0145076Z shr.s16 %rs1214, %rs1213, 12; 2026-02-21T09:18:15.0145134Z shl.b16 %rs1215, %rs1184, 12; 2026-02-21T09:18:15.0145195Z shr.s16 %rs1216, %rs1215, 12; 2026-02-21T09:18:15.0145260Z shl.b16 %rs1217, %rs1186, 12; 2026-02-21T09:18:15.0145318Z shr.s16 %rs1218, %rs1217, 12; 2026-02-21T09:18:15.0145375Z shl.b16 %rs1219, %rs1188, 12; 2026-02-21T09:18:15.0145439Z shr.s16 %rs1220, %rs1219, 12; 2026-02-21T09:18:15.0145496Z shl.b16 %rs1221, %rs1190, 12; 2026-02-21T09:18:15.0145574Z shr.s16 %rs1222, %rs1221, 12; 2026-02-21T09:18:15.0145642Z shl.b16 %rs1223, %rs1192, 12; 2026-02-21T09:18:15.0145702Z shr.s16 %rs1224, %rs1223, 12; 2026-02-21T09:18:15.0145762Z shl.b16 %rs1225, %rs1194, 12; 2026-02-21T09:18:15.0145822Z shr.s16 %rs1226, %rs1225, 12; 2026-02-21T09:18:15.0145891Z shl.b16 %rs1227, %rs1196, 12; 2026-02-21T09:18:15.0145949Z shr.s16 %rs1228, %rs1227, 12; 2026-02-21T09:18:15.0146008Z shl.b16 %rs1229, %rs1198, 12; 2026-02-21T09:18:15.0146075Z shr.s16 %rs1230, %rs1229, 12; 2026-02-21T09:18:15.0146137Z shl.b16 %rs1231, %rs1200, 12; 2026-02-21T09:18:15.0146195Z shr.s16 %rs1232, %rs1231, 12; 2026-02-21T09:18:15.0146363Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0146434Z shr.u16 %rs1233, %rs1169, 4; 2026-02-21T09:18:15.0146494Z shr.u16 %rs1234, %rs1171, 4; 2026-02-21T09:18:15.0146555Z shr.u16 %rs1235, %rs1173, 4; 2026-02-21T09:18:15.0146624Z shr.u16 %rs1236, %rs1175, 4; 2026-02-21T09:18:15.0146685Z shr.u16 %rs1237, %rs1177, 4; 2026-02-21T09:18:15.0146746Z shr.u16 %rs1238, %rs1179, 4; 2026-02-21T09:18:15.0146806Z shr.u16 %rs1239, %rs1181, 4; 2026-02-21T09:18:15.0146874Z shr.u16 %rs1240, %rs1183, 4; 2026-02-21T09:18:15.0146934Z shr.u16 %rs1241, %rs1185, 4; 2026-02-21T09:18:15.0146996Z shr.u16 %rs1242, %rs1187, 4; 2026-02-21T09:18:15.0147064Z shr.u16 %rs1243, %rs1189, 4; 2026-02-21T09:18:15.0147124Z shr.u16 %rs1244, %rs1191, 4; 2026-02-21T09:18:15.0147185Z shr.u16 %rs1245, %rs1193, 4; 2026-02-21T09:18:15.0147252Z shr.u16 %rs1246, %rs1195, 4; 2026-02-21T09:18:15.0147311Z shr.u16 %rs1247, %rs1197, 4; 2026-02-21T09:18:15.0147373Z shr.u16 %rs1248, %rs1199, 4; 2026-02-21T09:18:15.0147539Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0147603Z bar.sync 0; 2026-02-21T09:18:15.0147668Z st.shared.b8 [%r29], %rs1202; 2026-02-21T09:18:15.0147730Z st.shared.b8 [%r30], %rs1204; 2026-02-21T09:18:15.0147797Z st.shared.b8 [%r31], %rs1206; 2026-02-21T09:18:15.0147858Z st.shared.b8 [%r32], %rs1208; 2026-02-21T09:18:15.0147924Z st.shared.b8 [%r33+512], %rs1210; 2026-02-21T09:18:15.0147988Z st.shared.b8 [%r34+512], %rs1212; 2026-02-21T09:18:15.0148056Z st.shared.b8 [%r35+512], %rs1214; 2026-02-21T09:18:15.0148141Z st.shared.b8 [%r36+512], %rs1216; 2026-02-21T09:18:15.0148207Z st.shared.b8 [%r37+1024], %rs1218; 2026-02-21T09:18:15.0148279Z st.shared.b8 [%r38+1024], %rs1220; 2026-02-21T09:18:15.0148363Z st.shared.b8 [%r39+1024], %rs1222; 2026-02-21T09:18:15.0148425Z st.shared.b8 [%r40+1024], %rs1224; 2026-02-21T09:18:15.0148514Z st.shared.b8 [%r41+1536], %rs1226; 2026-02-21T09:18:15.0148578Z st.shared.b8 [%r42+1536], %rs1228; 2026-02-21T09:18:15.0148660Z st.shared.b8 [%r43+1536], %rs1230; 2026-02-21T09:18:15.0148721Z st.shared.b8 [%r44+1536], %rs1232; 2026-02-21T09:18:15.0148787Z bar.sync 0; 2026-02-21T09:18:15.0148851Z ld.shared.b32 %r3819, [%r45]; 2026-02-21T09:18:15.0148913Z ld.shared.b32 %r3820, [%r46]; 2026-02-21T09:18:15.0148981Z ld.shared.b32 %r3821, [%r47]; 2026-02-21T09:18:15.0149043Z ld.shared.b32 %r3822, [%r48]; 2026-02-21T09:18:15.0149208Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0149266Z bar.sync 0; 2026-02-21T09:18:15.0149336Z st.shared.b8 [%r29], %rs1233; 2026-02-21T09:18:15.0149396Z st.shared.b8 [%r30], %rs1234; 2026-02-21T09:18:15.0149456Z st.shared.b8 [%r31], %rs1235; 2026-02-21T09:18:15.0149524Z st.shared.b8 [%r32], %rs1236; 2026-02-21T09:18:15.0149587Z st.shared.b8 [%r33+512], %rs1237; 2026-02-21T09:18:15.0149651Z st.shared.b8 [%r34+512], %rs1238; 2026-02-21T09:18:15.0149715Z st.shared.b8 [%r35+512], %rs1239; 2026-02-21T09:18:15.0149785Z st.shared.b8 [%r36+512], %rs1240; 2026-02-21T09:18:15.0149848Z st.shared.b8 [%r37+1024], %rs1241; 2026-02-21T09:18:15.0149910Z st.shared.b8 [%r38+1024], %rs1242; 2026-02-21T09:18:15.0149980Z st.shared.b8 [%r39+1024], %rs1243; 2026-02-21T09:18:15.0150042Z st.shared.b8 [%r40+1024], %rs1244; 2026-02-21T09:18:15.0150102Z st.shared.b8 [%r41+1536], %rs1245; 2026-02-21T09:18:15.0150171Z st.shared.b8 [%r42+1536], %rs1246; 2026-02-21T09:18:15.0150233Z st.shared.b8 [%r43+1536], %rs1247; 2026-02-21T09:18:15.0150297Z st.shared.b8 [%r44+1536], %rs1248; 2026-02-21T09:18:15.0150352Z bar.sync 0; 2026-02-21T09:18:15.0150422Z ld.shared.b32 %r3823, [%r45]; 2026-02-21T09:18:15.0150484Z ld.shared.b32 %r3824, [%r46]; 2026-02-21T09:18:15.0150547Z ld.shared.b32 %r3825, [%r47]; 2026-02-21T09:18:15.0150616Z ld.shared.b32 %r3826, [%r48]; 2026-02-21T09:18:15.0150786Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0150853Z cvt.s8.s32 %rs1249, %r3820; 2026-02-21T09:18:15.0150920Z cvt.rn.f32.s16 %r3827, %rs1249; 2026-02-21T09:18:15.0150992Z cvt.s8.s32 %rs1250, %r3819; 2026-02-21T09:18:15.0151058Z cvt.rn.f32.s16 %r3828, %rs1250; 2026-02-21T09:18:15.0151120Z cvt.s8.s32 %rs1251, %r3824; 2026-02-21T09:18:15.0151193Z cvt.rn.f32.s16 %r3829, %rs1251; 2026-02-21T09:18:15.0151258Z cvt.s8.s32 %rs1252, %r3823; 2026-02-21T09:18:15.0151321Z cvt.rn.f32.s16 %r3830, %rs1252; 2026-02-21T09:18:15.0151393Z cvt.s8.s32 %rs1253, %r3822; 2026-02-21T09:18:15.0151457Z cvt.rn.f32.s16 %r3831, %rs1253; 2026-02-21T09:18:15.0151518Z cvt.s8.s32 %rs1254, %r3821; 2026-02-21T09:18:15.0151614Z cvt.rn.f32.s16 %r3832, %rs1254; 2026-02-21T09:18:15.0151686Z cvt.s8.s32 %rs1255, %r3826; 2026-02-21T09:18:15.0151748Z cvt.rn.f32.s16 %r3833, %rs1255; 2026-02-21T09:18:15.0151810Z cvt.s8.s32 %rs1256, %r3825; 2026-02-21T09:18:15.0151881Z cvt.rn.f32.s16 %r3834, %rs1256; 2026-02-21T09:18:15.0151948Z prmt.b32 %r3835, %r3820, 0, 0x9991U; 2026-02-21T09:18:15.0152009Z cvt.u16.u32 %rs1257, %r3835; 2026-02-21T09:18:15.0152070Z cvt.rn.f32.s16 %r3836, %rs1257; 2026-02-21T09:18:15.0152144Z prmt.b32 %r3837, %r3819, 0, 0x9991U; 2026-02-21T09:18:15.0152206Z cvt.u16.u32 %rs1258, %r3837; 2026-02-21T09:18:15.0152267Z cvt.rn.f32.s16 %r3838, %rs1258; 2026-02-21T09:18:15.0152340Z prmt.b32 %r3839, %r3824, 0, 0x9991U; 2026-02-21T09:18:15.0152402Z cvt.u16.u32 %rs1259, %r3839; 2026-02-21T09:18:15.0152463Z cvt.rn.f32.s16 %r3840, %rs1259; 2026-02-21T09:18:15.0152557Z prmt.b32 %r3841, %r3823, 0, 0x9991U; 2026-02-21T09:18:15.0152626Z cvt.u16.u32 %rs1260, %r3841; 2026-02-21T09:18:15.0152688Z cvt.rn.f32.s16 %r3842, %rs1260; 2026-02-21T09:18:15.0152753Z prmt.b32 %r3843, %r3822, 0, 0x9991U; 2026-02-21T09:18:15.0152848Z cvt.u16.u32 %rs1261, %r3843; 2026-02-21T09:18:15.0152910Z cvt.rn.f32.s16 %r3844, %rs1261; 2026-02-21T09:18:15.0153001Z prmt.b32 %r3845, %r3821, 0, 0x9991U; 2026-02-21T09:18:15.0153092Z cvt.u16.u32 %rs1262, %r3845; 2026-02-21T09:18:15.0153155Z cvt.rn.f32.s16 %r3846, %rs1262; 2026-02-21T09:18:15.0153219Z prmt.b32 %r3847, %r3826, 0, 0x9991U; 2026-02-21T09:18:15.0153279Z cvt.u16.u32 %rs1263, %r3847; 2026-02-21T09:18:15.0153349Z cvt.rn.f32.s16 %r3848, %rs1263; 2026-02-21T09:18:15.0153412Z prmt.b32 %r3849, %r3825, 0, 0x9991U; 2026-02-21T09:18:15.0153472Z cvt.u16.u32 %rs1264, %r3849; 2026-02-21T09:18:15.0153548Z cvt.rn.f32.s16 %r3850, %rs1264; 2026-02-21T09:18:15.0153609Z prmt.b32 %r3851, %r3820, 0, 0xaaa2U; 2026-02-21T09:18:15.0153669Z cvt.u16.u32 %rs1265, %r3851; 2026-02-21T09:18:15.0153728Z cvt.rn.f32.s16 %r3852, %rs1265; 2026-02-21T09:18:15.0153795Z prmt.b32 %r3853, %r3819, 0, 0xaaa2U; 2026-02-21T09:18:15.0153853Z cvt.u16.u32 %rs1266, %r3853; 2026-02-21T09:18:15.0153915Z cvt.rn.f32.s16 %r3854, %rs1266; 2026-02-21T09:18:15.0153983Z prmt.b32 %r3855, %r3824, 0, 0xaaa2U; 2026-02-21T09:18:15.0154044Z cvt.u16.u32 %rs1267, %r3855; 2026-02-21T09:18:15.0154104Z cvt.rn.f32.s16 %r3856, %rs1267; 2026-02-21T09:18:15.0154171Z prmt.b32 %r3857, %r3823, 0, 0xaaa2U; 2026-02-21T09:18:15.0154228Z cvt.u16.u32 %rs1268, %r3857; 2026-02-21T09:18:15.0154287Z cvt.rn.f32.s16 %r3858, %rs1268; 2026-02-21T09:18:15.0154348Z prmt.b32 %r3859, %r3822, 0, 0xaaa2U; 2026-02-21T09:18:15.0154414Z cvt.u16.u32 %rs1269, %r3859; 2026-02-21T09:18:15.0154474Z cvt.rn.f32.s16 %r3860, %rs1269; 2026-02-21T09:18:15.0154534Z prmt.b32 %r3861, %r3821, 0, 0xaaa2U; 2026-02-21T09:18:15.0154598Z cvt.u16.u32 %rs1270, %r3861; 2026-02-21T09:18:15.0154656Z cvt.rn.f32.s16 %r3862, %rs1270; 2026-02-21T09:18:15.0154718Z prmt.b32 %r3863, %r3826, 0, 0xaaa2U; 2026-02-21T09:18:15.0154775Z cvt.u16.u32 %rs1271, %r3863; 2026-02-21T09:18:15.0154842Z cvt.rn.f32.s16 %r3864, %rs1271; 2026-02-21T09:18:15.0154904Z prmt.b32 %r3865, %r3825, 0, 0xaaa2U; 2026-02-21T09:18:15.0154963Z cvt.u16.u32 %rs1272, %r3865; 2026-02-21T09:18:15.0155027Z cvt.rn.f32.s16 %r3866, %rs1272; 2026-02-21T09:18:15.0155087Z prmt.b32 %r3867, %r3820, 0, 0xbbb3U; 2026-02-21T09:18:15.0155145Z cvt.u16.u32 %rs1273, %r3867; 2026-02-21T09:18:15.0155205Z cvt.rn.f32.s16 %r3868, %rs1273; 2026-02-21T09:18:15.0155272Z prmt.b32 %r3869, %r3819, 0, 0xbbb3U; 2026-02-21T09:18:15.0155330Z cvt.u16.u32 %rs1274, %r3869; 2026-02-21T09:18:15.0155389Z cvt.rn.f32.s16 %r3870, %rs1274; 2026-02-21T09:18:15.0155457Z prmt.b32 %r3871, %r3824, 0, 0xbbb3U; 2026-02-21T09:18:15.0155513Z cvt.u16.u32 %rs1275, %r3871; 2026-02-21T09:18:15.0155571Z cvt.rn.f32.s16 %r3872, %rs1275; 2026-02-21T09:18:15.0155639Z prmt.b32 %r3873, %r3823, 0, 0xbbb3U; 2026-02-21T09:18:15.0155698Z cvt.u16.u32 %rs1276, %r3873; 2026-02-21T09:18:15.0155757Z cvt.rn.f32.s16 %r3874, %rs1276; 2026-02-21T09:18:15.0155817Z prmt.b32 %r3875, %r3822, 0, 0xbbb3U; 2026-02-21T09:18:15.0155884Z cvt.u16.u32 %rs1277, %r3875; 2026-02-21T09:18:15.0155943Z cvt.rn.f32.s16 %r3876, %rs1277; 2026-02-21T09:18:15.0156004Z prmt.b32 %r3877, %r3821, 0, 0xbbb3U; 2026-02-21T09:18:15.0156068Z cvt.u16.u32 %rs1278, %r3877; 2026-02-21T09:18:15.0156128Z cvt.rn.f32.s16 %r3878, %rs1278; 2026-02-21T09:18:15.0156188Z prmt.b32 %r3879, %r3826, 0, 0xbbb3U; 2026-02-21T09:18:15.0156247Z cvt.u16.u32 %rs1279, %r3879; 2026-02-21T09:18:15.0156312Z cvt.rn.f32.s16 %r3880, %rs1279; 2026-02-21T09:18:15.0156372Z prmt.b32 %r3881, %r3825, 0, 0xbbb3U; 2026-02-21T09:18:15.0156430Z cvt.u16.u32 %rs1280, %r3881; 2026-02-21T09:18:15.0156496Z cvt.rn.f32.s16 %r3882, %rs1280; 2026-02-21T09:18:15.0156550Z bar.sync 0; 2026-02-21T09:18:15.0156646Z st.shared.v4.b32 [%r49], {%r3828, %r3830, %r3827, %r3829}; 2026-02-21T09:18:15.0156767Z st.shared.v4.b32 [%r50], {%r3832, %r3834, %r3831, %r3833}; 2026-02-21T09:18:15.0156858Z st.shared.v4.b32 [%r51], {%r3838, %r3842, %r3836, %r3840}; 2026-02-21T09:18:15.0156966Z st.shared.v4.b32 [%r52], {%r3846, %r3850, %r3844, %r3848}; 2026-02-21T09:18:15.0157075Z st.shared.v4.b32 [%r53], {%r3854, %r3858, %r3852, %r3856}; 2026-02-21T09:18:15.0157172Z st.shared.v4.b32 [%r54], {%r3862, %r3866, %r3860, %r3864}; 2026-02-21T09:18:15.0157279Z st.shared.v4.b32 [%r55], {%r3870, %r3874, %r3868, %r3872}; 2026-02-21T09:18:15.0157369Z st.shared.v4.b32 [%r56], {%r3878, %r3882, %r3876, %r3880}; 2026-02-21T09:18:15.0157431Z $L__tmp243: 2026-02-21T09:18:15.0157644Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0157704Z // begin inline asm 2026-02-21T09:18:15.0158020Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3623, %r3624, %r3625, %r3626, %r3627, %r3628, %r3629, %r3630, %r3631, %r3632, %r3633, %r3634, %r3635, %r3636, %r3637, %r3638}, [%r625 + 0], 64; 2026-02-21T09:18:15.0158080Z // end inline asm 2026-02-21T09:18:15.0158140Z // begin inline asm 2026-02-21T09:18:15.0158451Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3640, %r3641, %r3642, %r3643, %r3644, %r3645, %r3646, %r3647, %r3648, %r3649, %r3650, %r3651, %r3652, %r3653, %r3654, %r3655}, [%r625 + 16], 64; 2026-02-21T09:18:15.0158509Z // end inline asm 2026-02-21T09:18:15.0158567Z // begin inline asm 2026-02-21T09:18:15.0158864Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3657, %r3658, %r3659, %r3660, %r3661, %r3662, %r3663, %r3664, %r3665, %r3666, %r3667, %r3668, %r3669, %r3670, %r3671, %r3672}, [%r625 + 32], 64; 2026-02-21T09:18:15.0158933Z // end inline asm 2026-02-21T09:18:15.0158989Z // begin inline asm 2026-02-21T09:18:15.0159280Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3674, %r3675, %r3676, %r3677, %r3678, %r3679, %r3680, %r3681, %r3682, %r3683, %r3684, %r3685, %r3686, %r3687, %r3688, %r3689}, [%r625 + 48], 64; 2026-02-21T09:18:15.0159347Z // end inline asm 2026-02-21T09:18:15.0159403Z // begin inline asm 2026-02-21T09:18:15.0159472Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0159535Z // end inline asm 2026-02-21T09:18:15.0159599Z mov.pred %p188, -1; 2026-02-21T09:18:15.0159655Z // begin inline asm 2026-02-21T09:18:15.0159965Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 0], 64, {%r3623, %r3624, %r3625, %r3626, %r3627, %r3628, %r3629, %r3630, %r3631, %r3632, %r3633, %r3634, %r3635, %r3636, %r3637, %r3638}; 2026-02-21T09:18:15.0160028Z // end inline asm 2026-02-21T09:18:15.0160083Z // begin inline asm 2026-02-21T09:18:15.0160384Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 16], 64, {%r3640, %r3641, %r3642, %r3643, %r3644, %r3645, %r3646, %r3647, %r3648, %r3649, %r3650, %r3651, %r3652, %r3653, %r3654, %r3655}; 2026-02-21T09:18:15.0160447Z // end inline asm 2026-02-21T09:18:15.0160502Z // begin inline asm 2026-02-21T09:18:15.0160802Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 32], 64, {%r3657, %r3658, %r3659, %r3660, %r3661, %r3662, %r3663, %r3664, %r3665, %r3666, %r3667, %r3668, %r3669, %r3670, %r3671, %r3672}; 2026-02-21T09:18:15.0160865Z // end inline asm 2026-02-21T09:18:15.0160923Z // begin inline asm 2026-02-21T09:18:15.0161223Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 48], 64, {%r3674, %r3675, %r3676, %r3677, %r3678, %r3679, %r3680, %r3681, %r3682, %r3683, %r3684, %r3685, %r3686, %r3687, %r3688, %r3689}; 2026-02-21T09:18:15.0161285Z // end inline asm 2026-02-21T09:18:15.0161341Z // begin inline asm 2026-02-21T09:18:15.0161409Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0161470Z // end inline asm 2026-02-21T09:18:15.0161524Z bar.sync 0; 2026-02-21T09:18:15.0161605Z // begin inline asm 2026-02-21T09:18:15.0161906Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r3760, %r3761, %r3762, %r3763, %r3764, %r3765, %r3766, %r3767, %r3768, %r3769, %r3770, %r3771, %r3772, %r3773, %r3774, %r3775}; 2026-02-21T09:18:15.0161995Z // end inline asm 2026-02-21T09:18:15.0162052Z // begin inline asm 2026-02-21T09:18:15.0162120Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0162181Z // end inline asm 2026-02-21T09:18:15.0162274Z bar.sync 0; 2026-02-21T09:18:15.0162332Z // begin inline asm 2026-02-21T09:18:15.0162403Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0162487Z // end inline asm 2026-02-21T09:18:15.0162543Z // begin inline asm 2026-02-21T09:18:15.0162657Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0162721Z // end inline asm 2026-02-21T09:18:15.0162775Z bar.sync 0; 2026-02-21T09:18:15.0162836Z @%p20 bra $L__BB0_27; 2026-02-21T09:18:15.0162941Z // %bb.26: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0163005Z elect.sync %r3895|%p189, -1; 2026-02-21T09:18:15.0163065Z mov.b32 %r3885, 69208336; 2026-02-21T09:18:15.0163121Z // begin inline asm 2026-02-21T09:18:15.0163288Z @%p189 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 0 ], %rd1, %r3885, %p188; 2026-02-21T09:18:15.0163344Z // end inline asm 2026-02-21T09:18:15.0163399Z // begin inline asm 2026-02-21T09:18:15.0163560Z @%p189 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 8 ], %rd2, %r3885, %p188; 2026-02-21T09:18:15.0163615Z // end inline asm 2026-02-21T09:18:15.0163672Z // begin inline asm 2026-02-21T09:18:15.0163832Z @%p189 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 16 ], %rd3, %r3885, %p188; 2026-02-21T09:18:15.0163887Z // end inline asm 2026-02-21T09:18:15.0163944Z // begin inline asm 2026-02-21T09:18:15.0164094Z @%p189 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 24 ], %rd4, %r3885, %p188; 2026-02-21T09:18:15.0164154Z // end inline asm 2026-02-21T09:18:15.0164217Z cvt.u64.u32 %rd486, %r5985; 2026-02-21T09:18:15.0164272Z // begin inline asm 2026-02-21T09:18:15.0164406Z @%p189 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd486]; 2026-02-21T09:18:15.0164461Z // end inline asm 2026-02-21T09:18:15.0164563Z $L__BB0_27: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0164625Z // begin inline asm 2026-02-21T09:18:15.0164675Z 2026-02-21T09:18:15.0164727Z { 2026-02-21T09:18:15.0164789Z .reg .pred complete; 2026-02-21T09:18:15.0164852Z waitLoop: 2026-02-21T09:18:15.0164974Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r3609; 2026-02-21T09:18:15.0165041Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0165099Z } 2026-02-21T09:18:15.0165103Z 2026-02-21T09:18:15.0165156Z // end inline asm 2026-02-21T09:18:15.0165211Z bar.sync 0; 2026-02-21T09:18:15.0165266Z // begin inline asm 2026-02-21T09:18:15.0165359Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0165413Z // end inline asm 2026-02-21T09:18:15.0165465Z $L__tmp244: 2026-02-21T09:18:15.0165637Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0165699Z add.s64 %rd488, %rd458, 128; 2026-02-21T09:18:15.0165861Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0165928Z add.s64 %rd491, %rd461, 128; 2026-02-21T09:18:15.0165986Z // begin inline asm 2026-02-21T09:18:15.0166043Z mov.u64 %rd487, 0x0; 2026-02-21T09:18:15.0166153Z createpolicy.fractional.L2::evict_first.b64 %rd487, 1.0; 2026-02-21T09:18:15.0166215Z // end inline asm 2026-02-21T09:18:15.0166271Z // begin inline asm 2026-02-21T09:18:15.0166328Z mov.u32 %r3901, 0x0; 2026-02-21T09:18:15.0166392Z mov.u32 %r3902, 0x0; 2026-02-21T09:18:15.0166448Z mov.u32 %r3903, 0x0; 2026-02-21T09:18:15.0166503Z mov.u32 %r3904, 0x0; 2026-02-21T09:18:15.0166687Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3901, %r3902, %r3903, %r3904 }, [ %rd488 + 0 ], %rd487; 2026-02-21T09:18:15.0166742Z // end inline asm 2026-02-21T09:18:15.0166799Z // begin inline asm 2026-02-21T09:18:15.0166856Z mov.u64 %rd490, 0x0; 2026-02-21T09:18:15.0166972Z createpolicy.fractional.L2::evict_first.b64 %rd490, 1.0; 2026-02-21T09:18:15.0167053Z // end inline asm 2026-02-21T09:18:15.0167110Z // begin inline asm 2026-02-21T09:18:15.0167174Z mov.u32 %r3905, 0x0; 2026-02-21T09:18:15.0167251Z mov.u32 %r3906, 0x0; 2026-02-21T09:18:15.0167307Z mov.u32 %r3907, 0x0; 2026-02-21T09:18:15.0167363Z mov.u32 %r3908, 0x0; 2026-02-21T09:18:15.0167589Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3905, %r3906, %r3907, %r3908 }, [ %rd491 + 0 ], %rd490; 2026-02-21T09:18:15.0167651Z // end inline asm 2026-02-21T09:18:15.0167707Z $L__tmp245: 2026-02-21T09:18:15.0167931Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0168026Z st.shared.v4.b32 [%r26], {%r3901, %r3902, %r3903, %r3904}; 2026-02-21T09:18:15.0168126Z st.shared.v4.b32 [%r26+512], {%r3905, %r3906, %r3907, %r3908}; 2026-02-21T09:18:15.0168188Z bar.sync 0; 2026-02-21T09:18:15.0168278Z ld.shared.v4.b32 {%r4068, %r4069, %r4070, %r4071}, [%r27]; 2026-02-21T09:18:15.0168345Z mov.b32 {%rs1281, %rs1282}, %r4071; 2026-02-21T09:18:15.0168409Z mov.b32 {%rs1283, %rs1284}, %r4070; 2026-02-21T09:18:15.0168477Z mov.b32 {%rs1285, %rs1286}, %r4069; 2026-02-21T09:18:15.0168538Z mov.b32 {%rs1287, %rs1288}, %r4068; 2026-02-21T09:18:15.0168629Z ld.shared.v4.b32 {%r4072, %r4073, %r4074, %r4075}, [%r28]; 2026-02-21T09:18:15.0168696Z mov.b32 {%rs1289, %rs1290}, %r4075; 2026-02-21T09:18:15.0168757Z mov.b32 {%rs1291, %rs1292}, %r4074; 2026-02-21T09:18:15.0168816Z mov.b32 {%rs1293, %rs1294}, %r4073; 2026-02-21T09:18:15.0168882Z mov.b32 {%rs1295, %rs1296}, %r4072; 2026-02-21T09:18:15.0168935Z $L__tmp246: 2026-02-21T09:18:15.0169099Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0169162Z cvt.f32.bf16 %r4050, %rs1287; 2026-02-21T09:18:15.0169231Z cvt.f32.bf16 %r4051, %rs1288; 2026-02-21T09:18:15.0169290Z cvt.f32.bf16 %r4052, %rs1285; 2026-02-21T09:18:15.0169350Z cvt.f32.bf16 %r4053, %rs1286; 2026-02-21T09:18:15.0169415Z cvt.f32.bf16 %r4054, %rs1283; 2026-02-21T09:18:15.0169472Z cvt.f32.bf16 %r4055, %rs1284; 2026-02-21T09:18:15.0169530Z cvt.f32.bf16 %r4056, %rs1281; 2026-02-21T09:18:15.0169588Z cvt.f32.bf16 %r4057, %rs1282; 2026-02-21T09:18:15.0169653Z cvt.f32.bf16 %r4058, %rs1295; 2026-02-21T09:18:15.0169710Z cvt.f32.bf16 %r4059, %rs1296; 2026-02-21T09:18:15.0169768Z cvt.f32.bf16 %r4060, %rs1293; 2026-02-21T09:18:15.0169835Z cvt.f32.bf16 %r4061, %rs1294; 2026-02-21T09:18:15.0169893Z cvt.f32.bf16 %r4062, %rs1291; 2026-02-21T09:18:15.0169951Z cvt.f32.bf16 %r4063, %rs1292; 2026-02-21T09:18:15.0170019Z cvt.f32.bf16 %r4064, %rs1289; 2026-02-21T09:18:15.0170076Z cvt.f32.bf16 %r4065, %rs1290; 2026-02-21T09:18:15.0170237Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0170299Z add.s32 %r4076, %r7845, 262144; 2026-02-21T09:18:15.0170368Z cvt.s64.s32 %rd496, %r4076; 2026-02-21T09:18:15.0170432Z add.s64 %rd494, %rd56, %rd496; 2026-02-21T09:18:15.0170595Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0170664Z // begin inline asm 2026-02-21T09:18:15.0170721Z mov.u64 %rd493, 0x0; 2026-02-21T09:18:15.0170831Z createpolicy.fractional.L2::evict_first.b64 %rd493, 1.0; 2026-02-21T09:18:15.0170893Z // end inline asm 2026-02-21T09:18:15.0170953Z // begin inline asm 2026-02-21T09:18:15.0171010Z mov.u32 %r3909, 0x0; 2026-02-21T09:18:15.0171066Z mov.u32 %r3910, 0x0; 2026-02-21T09:18:15.0171128Z mov.u32 %r3911, 0x0; 2026-02-21T09:18:15.0171181Z mov.u32 %r3912, 0x0; 2026-02-21T09:18:15.0171352Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r3909, %r3910, %r3911, %r3912 }, [ %rd494 + 0 ], %rd493; 2026-02-21T09:18:15.0171414Z // end inline asm 2026-02-21T09:18:15.0171479Z prmt.b32 %r4077, %r3909, 0, 0x8880U; 2026-02-21T09:18:15.0171566Z cvt.u16.u32 %rs1297, %r4077; 2026-02-21T09:18:15.0171658Z prmt.b32 %r4078, %r3909, 0, 0x7770U; 2026-02-21T09:18:15.0171726Z cvt.u16.u32 %rs1298, %r4078; 2026-02-21T09:18:15.0171787Z prmt.b32 %r4079, %r3909, 0, 0x9991U; 2026-02-21T09:18:15.0171845Z cvt.u16.u32 %rs1299, %r4079; 2026-02-21T09:18:15.0171935Z prmt.b32 %r4080, %r3909, 0, 0x7771U; 2026-02-21T09:18:15.0171995Z cvt.u16.u32 %rs1300, %r4080; 2026-02-21T09:18:15.0172080Z prmt.b32 %r4081, %r3909, 0, 0xaaa2U; 2026-02-21T09:18:15.0172170Z cvt.u16.u32 %rs1301, %r4081; 2026-02-21T09:18:15.0172232Z prmt.b32 %r4082, %r3909, 0, 0x7772U; 2026-02-21T09:18:15.0172291Z cvt.u16.u32 %rs1302, %r4082; 2026-02-21T09:18:15.0172351Z prmt.b32 %r4083, %r3909, 0, 0xbbb3U; 2026-02-21T09:18:15.0172415Z cvt.u16.u32 %rs1303, %r4083; 2026-02-21T09:18:15.0172475Z prmt.b32 %r4084, %r3909, 0, 0x7773U; 2026-02-21T09:18:15.0172533Z cvt.u16.u32 %rs1304, %r4084; 2026-02-21T09:18:15.0172599Z prmt.b32 %r4085, %r3910, 0, 0x8880U; 2026-02-21T09:18:15.0172656Z cvt.u16.u32 %rs1305, %r4085; 2026-02-21T09:18:15.0172717Z prmt.b32 %r4086, %r3910, 0, 0x7770U; 2026-02-21T09:18:15.0172774Z cvt.u16.u32 %rs1306, %r4086; 2026-02-21T09:18:15.0172842Z prmt.b32 %r4087, %r3910, 0, 0x9991U; 2026-02-21T09:18:15.0172899Z cvt.u16.u32 %rs1307, %r4087; 2026-02-21T09:18:15.0172960Z prmt.b32 %r4088, %r3910, 0, 0x7771U; 2026-02-21T09:18:15.0173024Z cvt.u16.u32 %rs1308, %r4088; 2026-02-21T09:18:15.0173087Z prmt.b32 %r4089, %r3910, 0, 0xaaa2U; 2026-02-21T09:18:15.0173146Z cvt.u16.u32 %rs1309, %r4089; 2026-02-21T09:18:15.0173212Z prmt.b32 %r4090, %r3910, 0, 0x7772U; 2026-02-21T09:18:15.0173270Z cvt.u16.u32 %rs1310, %r4090; 2026-02-21T09:18:15.0173330Z prmt.b32 %r4091, %r3910, 0, 0xbbb3U; 2026-02-21T09:18:15.0173388Z cvt.u16.u32 %rs1311, %r4091; 2026-02-21T09:18:15.0173456Z prmt.b32 %r4092, %r3910, 0, 0x7773U; 2026-02-21T09:18:15.0173515Z cvt.u16.u32 %rs1312, %r4092; 2026-02-21T09:18:15.0173575Z prmt.b32 %r4093, %r3911, 0, 0x8880U; 2026-02-21T09:18:15.0173640Z cvt.u16.u32 %rs1313, %r4093; 2026-02-21T09:18:15.0173701Z prmt.b32 %r4094, %r3911, 0, 0x7770U; 2026-02-21T09:18:15.0173759Z cvt.u16.u32 %rs1314, %r4094; 2026-02-21T09:18:15.0173820Z prmt.b32 %r4095, %r3911, 0, 0x9991U; 2026-02-21T09:18:15.0173886Z cvt.u16.u32 %rs1315, %r4095; 2026-02-21T09:18:15.0173950Z prmt.b32 %r4096, %r3911, 0, 0x7771U; 2026-02-21T09:18:15.0174007Z cvt.u16.u32 %rs1316, %r4096; 2026-02-21T09:18:15.0174074Z prmt.b32 %r4097, %r3911, 0, 0xaaa2U; 2026-02-21T09:18:15.0174133Z cvt.u16.u32 %rs1317, %r4097; 2026-02-21T09:18:15.0174194Z prmt.b32 %r4098, %r3911, 0, 0x7772U; 2026-02-21T09:18:15.0174252Z cvt.u16.u32 %rs1318, %r4098; 2026-02-21T09:18:15.0174321Z prmt.b32 %r4099, %r3911, 0, 0xbbb3U; 2026-02-21T09:18:15.0174379Z cvt.u16.u32 %rs1319, %r4099; 2026-02-21T09:18:15.0174443Z prmt.b32 %r4100, %r3911, 0, 0x7773U; 2026-02-21T09:18:15.0174511Z cvt.u16.u32 %rs1320, %r4100; 2026-02-21T09:18:15.0174577Z prmt.b32 %r4101, %r3912, 0, 0x8880U; 2026-02-21T09:18:15.0174635Z cvt.u16.u32 %rs1321, %r4101; 2026-02-21T09:18:15.0174704Z prmt.b32 %r4102, %r3912, 0, 0x7770U; 2026-02-21T09:18:15.0174762Z cvt.u16.u32 %rs1322, %r4102; 2026-02-21T09:18:15.0174821Z prmt.b32 %r4103, %r3912, 0, 0x9991U; 2026-02-21T09:18:15.0174879Z cvt.u16.u32 %rs1323, %r4103; 2026-02-21T09:18:15.0174948Z prmt.b32 %r4104, %r3912, 0, 0x7771U; 2026-02-21T09:18:15.0175004Z cvt.u16.u32 %rs1324, %r4104; 2026-02-21T09:18:15.0175067Z prmt.b32 %r4105, %r3912, 0, 0xaaa2U; 2026-02-21T09:18:15.0175132Z cvt.u16.u32 %rs1325, %r4105; 2026-02-21T09:18:15.0175191Z prmt.b32 %r4106, %r3912, 0, 0x7772U; 2026-02-21T09:18:15.0175250Z cvt.u16.u32 %rs1326, %r4106; 2026-02-21T09:18:15.0175310Z prmt.b32 %r4107, %r3912, 0, 0xbbb3U; 2026-02-21T09:18:15.0175375Z cvt.u16.u32 %rs1327, %r4107; 2026-02-21T09:18:15.0175435Z prmt.b32 %r4108, %r3912, 0, 0x7773U; 2026-02-21T09:18:15.0175493Z cvt.u16.u32 %rs1328, %r4108; 2026-02-21T09:18:15.0175664Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0175753Z shl.b16 %rs1329, %rs1298, 12; 2026-02-21T09:18:15.0175814Z shr.s16 %rs1330, %rs1329, 12; 2026-02-21T09:18:15.0175879Z shl.b16 %rs1331, %rs1300, 12; 2026-02-21T09:18:15.0175936Z shr.s16 %rs1332, %rs1331, 12; 2026-02-21T09:18:15.0176014Z shl.b16 %rs1333, %rs1302, 12; 2026-02-21T09:18:15.0176071Z shr.s16 %rs1334, %rs1333, 12; 2026-02-21T09:18:15.0176155Z shl.b16 %rs1335, %rs1304, 12; 2026-02-21T09:18:15.0176214Z shr.s16 %rs1336, %rs1335, 12; 2026-02-21T09:18:15.0176306Z shl.b16 %rs1337, %rs1306, 12; 2026-02-21T09:18:15.0176372Z shr.s16 %rs1338, %rs1337, 12; 2026-02-21T09:18:15.0176429Z shl.b16 %rs1339, %rs1308, 12; 2026-02-21T09:18:15.0176487Z shr.s16 %rs1340, %rs1339, 12; 2026-02-21T09:18:15.0176544Z shl.b16 %rs1341, %rs1310, 12; 2026-02-21T09:18:15.0176608Z shr.s16 %rs1342, %rs1341, 12; 2026-02-21T09:18:15.0176665Z shl.b16 %rs1343, %rs1312, 12; 2026-02-21T09:18:15.0176723Z shr.s16 %rs1344, %rs1343, 12; 2026-02-21T09:18:15.0176787Z shl.b16 %rs1345, %rs1314, 12; 2026-02-21T09:18:15.0176845Z shr.s16 %rs1346, %rs1345, 12; 2026-02-21T09:18:15.0176902Z shl.b16 %rs1347, %rs1316, 12; 2026-02-21T09:18:15.0176960Z shr.s16 %rs1348, %rs1347, 12; 2026-02-21T09:18:15.0177025Z shl.b16 %rs1349, %rs1318, 12; 2026-02-21T09:18:15.0177083Z shr.s16 %rs1350, %rs1349, 12; 2026-02-21T09:18:15.0177141Z shl.b16 %rs1351, %rs1320, 12; 2026-02-21T09:18:15.0177206Z shr.s16 %rs1352, %rs1351, 12; 2026-02-21T09:18:15.0177264Z shl.b16 %rs1353, %rs1322, 12; 2026-02-21T09:18:15.0177322Z shr.s16 %rs1354, %rs1353, 12; 2026-02-21T09:18:15.0177385Z shl.b16 %rs1355, %rs1324, 12; 2026-02-21T09:18:15.0177441Z shr.s16 %rs1356, %rs1355, 12; 2026-02-21T09:18:15.0177498Z shl.b16 %rs1357, %rs1326, 12; 2026-02-21T09:18:15.0177555Z shr.s16 %rs1358, %rs1357, 12; 2026-02-21T09:18:15.0177620Z shl.b16 %rs1359, %rs1328, 12; 2026-02-21T09:18:15.0177679Z shr.s16 %rs1360, %rs1359, 12; 2026-02-21T09:18:15.0177838Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0177906Z shr.u16 %rs1361, %rs1297, 4; 2026-02-21T09:18:15.0177965Z shr.u16 %rs1362, %rs1299, 4; 2026-02-21T09:18:15.0178022Z shr.u16 %rs1363, %rs1301, 4; 2026-02-21T09:18:15.0178081Z shr.u16 %rs1364, %rs1303, 4; 2026-02-21T09:18:15.0178144Z shr.u16 %rs1365, %rs1305, 4; 2026-02-21T09:18:15.0178202Z shr.u16 %rs1366, %rs1307, 4; 2026-02-21T09:18:15.0178260Z shr.u16 %rs1367, %rs1309, 4; 2026-02-21T09:18:15.0178325Z shr.u16 %rs1368, %rs1311, 4; 2026-02-21T09:18:15.0178382Z shr.u16 %rs1369, %rs1313, 4; 2026-02-21T09:18:15.0178438Z shr.u16 %rs1370, %rs1315, 4; 2026-02-21T09:18:15.0178493Z shr.u16 %rs1371, %rs1317, 4; 2026-02-21T09:18:15.0178557Z shr.u16 %rs1372, %rs1319, 4; 2026-02-21T09:18:15.0178613Z shr.u16 %rs1373, %rs1321, 4; 2026-02-21T09:18:15.0178670Z shr.u16 %rs1374, %rs1323, 4; 2026-02-21T09:18:15.0178733Z shr.u16 %rs1375, %rs1325, 4; 2026-02-21T09:18:15.0178789Z shr.u16 %rs1376, %rs1327, 4; 2026-02-21T09:18:15.0178944Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0179005Z bar.sync 0; 2026-02-21T09:18:15.0179064Z st.shared.b8 [%r29], %rs1330; 2026-02-21T09:18:15.0179126Z st.shared.b8 [%r30], %rs1332; 2026-02-21T09:18:15.0179185Z st.shared.b8 [%r31], %rs1334; 2026-02-21T09:18:15.0179249Z st.shared.b8 [%r32], %rs1336; 2026-02-21T09:18:15.0179314Z st.shared.b8 [%r33+512], %rs1338; 2026-02-21T09:18:15.0179378Z st.shared.b8 [%r34+512], %rs1340; 2026-02-21T09:18:15.0179446Z st.shared.b8 [%r35+512], %rs1342; 2026-02-21T09:18:15.0179505Z st.shared.b8 [%r36+512], %rs1344; 2026-02-21T09:18:15.0179567Z st.shared.b8 [%r37+1024], %rs1346; 2026-02-21T09:18:15.0179631Z st.shared.b8 [%r38+1024], %rs1348; 2026-02-21T09:18:15.0179699Z st.shared.b8 [%r39+1024], %rs1350; 2026-02-21T09:18:15.0179759Z st.shared.b8 [%r40+1024], %rs1352; 2026-02-21T09:18:15.0179818Z st.shared.b8 [%r41+1536], %rs1354; 2026-02-21T09:18:15.0179885Z st.shared.b8 [%r42+1536], %rs1356; 2026-02-21T09:18:15.0179990Z st.shared.b8 [%r43+1536], %rs1358; 2026-02-21T09:18:15.0180051Z st.shared.b8 [%r44+1536], %rs1360; 2026-02-21T09:18:15.0180112Z bar.sync 0; 2026-02-21T09:18:15.0180199Z ld.shared.b32 %r4109, [%r45]; 2026-02-21T09:18:15.0180260Z ld.shared.b32 %r4110, [%r46]; 2026-02-21T09:18:15.0180319Z ld.shared.b32 %r4111, [%r47]; 2026-02-21T09:18:15.0180409Z ld.shared.b32 %r4112, [%r48]; 2026-02-21T09:18:15.0180590Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0180646Z bar.sync 0; 2026-02-21T09:18:15.0180712Z st.shared.b8 [%r29], %rs1361; 2026-02-21T09:18:15.0180771Z st.shared.b8 [%r30], %rs1362; 2026-02-21T09:18:15.0180830Z st.shared.b8 [%r31], %rs1363; 2026-02-21T09:18:15.0180889Z st.shared.b8 [%r32], %rs1364; 2026-02-21T09:18:15.0180959Z st.shared.b8 [%r33+512], %rs1365; 2026-02-21T09:18:15.0181021Z st.shared.b8 [%r34+512], %rs1366; 2026-02-21T09:18:15.0181082Z st.shared.b8 [%r35+512], %rs1367; 2026-02-21T09:18:15.0181155Z st.shared.b8 [%r36+512], %rs1368; 2026-02-21T09:18:15.0181216Z st.shared.b8 [%r37+1024], %rs1369; 2026-02-21T09:18:15.0181280Z st.shared.b8 [%r38+1024], %rs1370; 2026-02-21T09:18:15.0181345Z st.shared.b8 [%r39+1024], %rs1371; 2026-02-21T09:18:15.0181417Z st.shared.b8 [%r40+1024], %rs1372; 2026-02-21T09:18:15.0181481Z st.shared.b8 [%r41+1536], %rs1373; 2026-02-21T09:18:15.0181568Z st.shared.b8 [%r42+1536], %rs1374; 2026-02-21T09:18:15.0181637Z st.shared.b8 [%r43+1536], %rs1375; 2026-02-21T09:18:15.0181696Z st.shared.b8 [%r44+1536], %rs1376; 2026-02-21T09:18:15.0181751Z bar.sync 0; 2026-02-21T09:18:15.0181817Z ld.shared.b32 %r4113, [%r45]; 2026-02-21T09:18:15.0181877Z ld.shared.b32 %r4114, [%r46]; 2026-02-21T09:18:15.0181937Z ld.shared.b32 %r4115, [%r47]; 2026-02-21T09:18:15.0181996Z ld.shared.b32 %r4116, [%r48]; 2026-02-21T09:18:15.0182163Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0182226Z cvt.s8.s32 %rs1377, %r4110; 2026-02-21T09:18:15.0182287Z cvt.rn.f32.s16 %r4117, %rs1377; 2026-02-21T09:18:15.0182355Z cvt.s8.s32 %rs1378, %r4109; 2026-02-21T09:18:15.0182418Z cvt.rn.f32.s16 %r4118, %rs1378; 2026-02-21T09:18:15.0182477Z cvt.s8.s32 %rs1379, %r4114; 2026-02-21T09:18:15.0182538Z cvt.rn.f32.s16 %r4119, %rs1379; 2026-02-21T09:18:15.0182606Z cvt.s8.s32 %rs1380, %r4113; 2026-02-21T09:18:15.0182668Z cvt.rn.f32.s16 %r4120, %rs1380; 2026-02-21T09:18:15.0182727Z cvt.s8.s32 %rs1381, %r4112; 2026-02-21T09:18:15.0182794Z cvt.rn.f32.s16 %r4121, %rs1381; 2026-02-21T09:18:15.0182853Z cvt.s8.s32 %rs1382, %r4111; 2026-02-21T09:18:15.0182912Z cvt.rn.f32.s16 %r4122, %rs1382; 2026-02-21T09:18:15.0182979Z cvt.s8.s32 %rs1383, %r4116; 2026-02-21T09:18:15.0183040Z cvt.rn.f32.s16 %r4123, %rs1383; 2026-02-21T09:18:15.0183099Z cvt.s8.s32 %rs1384, %r4115; 2026-02-21T09:18:15.0183158Z cvt.rn.f32.s16 %r4124, %rs1384; 2026-02-21T09:18:15.0183230Z prmt.b32 %r4125, %r4110, 0, 0x9991U; 2026-02-21T09:18:15.0183290Z cvt.u16.u32 %rs1385, %r4125; 2026-02-21T09:18:15.0183348Z cvt.rn.f32.s16 %r4126, %rs1385; 2026-02-21T09:18:15.0183418Z prmt.b32 %r4127, %r4109, 0, 0x9991U; 2026-02-21T09:18:15.0183478Z cvt.u16.u32 %rs1386, %r4127; 2026-02-21T09:18:15.0183537Z cvt.rn.f32.s16 %r4128, %rs1386; 2026-02-21T09:18:15.0183602Z prmt.b32 %r4129, %r4114, 0, 0x9991U; 2026-02-21T09:18:15.0183668Z cvt.u16.u32 %rs1387, %r4129; 2026-02-21T09:18:15.0183727Z cvt.rn.f32.s16 %r4130, %rs1387; 2026-02-21T09:18:15.0183790Z prmt.b32 %r4131, %r4113, 0, 0x9991U; 2026-02-21T09:18:15.0183857Z cvt.u16.u32 %rs1388, %r4131; 2026-02-21T09:18:15.0183915Z cvt.rn.f32.s16 %r4132, %rs1388; 2026-02-21T09:18:15.0183976Z prmt.b32 %r4133, %r4112, 0, 0x9991U; 2026-02-21T09:18:15.0184035Z cvt.u16.u32 %rs1389, %r4133; 2026-02-21T09:18:15.0184102Z cvt.rn.f32.s16 %r4134, %rs1389; 2026-02-21T09:18:15.0184161Z prmt.b32 %r4135, %r4111, 0, 0x9991U; 2026-02-21T09:18:15.0184219Z cvt.u16.u32 %rs1390, %r4135; 2026-02-21T09:18:15.0184317Z cvt.rn.f32.s16 %r4136, %rs1390; 2026-02-21T09:18:15.0184379Z prmt.b32 %r4137, %r4116, 0, 0x9991U; 2026-02-21T09:18:15.0184438Z cvt.u16.u32 %rs1391, %r4137; 2026-02-21T09:18:15.0184526Z cvt.rn.f32.s16 %r4138, %rs1391; 2026-02-21T09:18:15.0184586Z prmt.b32 %r4139, %r4115, 0, 0x9991U; 2026-02-21T09:18:15.0184664Z cvt.u16.u32 %rs1392, %r4139; 2026-02-21T09:18:15.0184724Z cvt.rn.f32.s16 %r4140, %rs1392; 2026-02-21T09:18:15.0184816Z prmt.b32 %r4141, %r4110, 0, 0xaaa2U; 2026-02-21T09:18:15.0184874Z cvt.u16.u32 %rs1393, %r4141; 2026-02-21T09:18:15.0184932Z cvt.rn.f32.s16 %r4142, %rs1393; 2026-02-21T09:18:15.0184999Z prmt.b32 %r4143, %r4109, 0, 0xaaa2U; 2026-02-21T09:18:15.0185057Z cvt.u16.u32 %rs1394, %r4143; 2026-02-21T09:18:15.0185116Z cvt.rn.f32.s16 %r4144, %rs1394; 2026-02-21T09:18:15.0185176Z prmt.b32 %r4145, %r4114, 0, 0xaaa2U; 2026-02-21T09:18:15.0185240Z cvt.u16.u32 %rs1395, %r4145; 2026-02-21T09:18:15.0185299Z cvt.rn.f32.s16 %r4146, %rs1395; 2026-02-21T09:18:15.0185360Z prmt.b32 %r4147, %r4113, 0, 0xaaa2U; 2026-02-21T09:18:15.0185425Z cvt.u16.u32 %rs1396, %r4147; 2026-02-21T09:18:15.0185483Z cvt.rn.f32.s16 %r4148, %rs1396; 2026-02-21T09:18:15.0185546Z prmt.b32 %r4149, %r4112, 0, 0xaaa2U; 2026-02-21T09:18:15.0185611Z cvt.u16.u32 %rs1397, %r4149; 2026-02-21T09:18:15.0185671Z cvt.rn.f32.s16 %r4150, %rs1397; 2026-02-21T09:18:15.0185732Z prmt.b32 %r4151, %r4111, 0, 0xaaa2U; 2026-02-21T09:18:15.0185790Z cvt.u16.u32 %rs1398, %r4151; 2026-02-21T09:18:15.0185858Z cvt.rn.f32.s16 %r4152, %rs1398; 2026-02-21T09:18:15.0185918Z prmt.b32 %r4153, %r4116, 0, 0xaaa2U; 2026-02-21T09:18:15.0185975Z cvt.u16.u32 %rs1399, %r4153; 2026-02-21T09:18:15.0186040Z cvt.rn.f32.s16 %r4154, %rs1399; 2026-02-21T09:18:15.0186100Z prmt.b32 %r4155, %r4115, 0, 0xaaa2U; 2026-02-21T09:18:15.0186156Z cvt.u16.u32 %rs1400, %r4155; 2026-02-21T09:18:15.0186215Z cvt.rn.f32.s16 %r4156, %rs1400; 2026-02-21T09:18:15.0186283Z prmt.b32 %r4157, %r4110, 0, 0xbbb3U; 2026-02-21T09:18:15.0186341Z cvt.u16.u32 %rs1401, %r4157; 2026-02-21T09:18:15.0186399Z cvt.rn.f32.s16 %r4158, %rs1401; 2026-02-21T09:18:15.0186467Z prmt.b32 %r4159, %r4109, 0, 0xbbb3U; 2026-02-21T09:18:15.0186526Z cvt.u16.u32 %rs1402, %r4159; 2026-02-21T09:18:15.0186584Z cvt.rn.f32.s16 %r4160, %rs1402; 2026-02-21T09:18:15.0186645Z prmt.b32 %r4161, %r4114, 0, 0xbbb3U; 2026-02-21T09:18:15.0186711Z cvt.u16.u32 %rs1403, %r4161; 2026-02-21T09:18:15.0186771Z cvt.rn.f32.s16 %r4162, %rs1403; 2026-02-21T09:18:15.0186832Z prmt.b32 %r4163, %r4113, 0, 0xbbb3U; 2026-02-21T09:18:15.0186899Z cvt.u16.u32 %rs1404, %r4163; 2026-02-21T09:18:15.0186959Z cvt.rn.f32.s16 %r4164, %rs1404; 2026-02-21T09:18:15.0187019Z prmt.b32 %r4165, %r4112, 0, 0xbbb3U; 2026-02-21T09:18:15.0187084Z cvt.u16.u32 %rs1405, %r4165; 2026-02-21T09:18:15.0187144Z cvt.rn.f32.s16 %r4166, %rs1405; 2026-02-21T09:18:15.0187205Z prmt.b32 %r4167, %r4111, 0, 0xbbb3U; 2026-02-21T09:18:15.0187267Z cvt.u16.u32 %rs1406, %r4167; 2026-02-21T09:18:15.0187338Z cvt.rn.f32.s16 %r4168, %rs1406; 2026-02-21T09:18:15.0187400Z prmt.b32 %r4169, %r4116, 0, 0xbbb3U; 2026-02-21T09:18:15.0187459Z cvt.u16.u32 %rs1407, %r4169; 2026-02-21T09:18:15.0187533Z cvt.rn.f32.s16 %r4170, %rs1407; 2026-02-21T09:18:15.0187593Z prmt.b32 %r4171, %r4115, 0, 0xbbb3U; 2026-02-21T09:18:15.0187651Z cvt.u16.u32 %rs1408, %r4171; 2026-02-21T09:18:15.0187711Z cvt.rn.f32.s16 %r4172, %rs1408; 2026-02-21T09:18:15.0187775Z bar.sync 0; 2026-02-21T09:18:15.0187870Z st.shared.v4.b32 [%r49], {%r4118, %r4120, %r4117, %r4119}; 2026-02-21T09:18:15.0187964Z st.shared.v4.b32 [%r50], {%r4122, %r4124, %r4121, %r4123}; 2026-02-21T09:18:15.0188061Z st.shared.v4.b32 [%r51], {%r4128, %r4132, %r4126, %r4130}; 2026-02-21T09:18:15.0188148Z st.shared.v4.b32 [%r52], {%r4136, %r4140, %r4134, %r4138}; 2026-02-21T09:18:15.0188234Z st.shared.v4.b32 [%r53], {%r4144, %r4148, %r4142, %r4146}; 2026-02-21T09:18:15.0188329Z st.shared.v4.b32 [%r54], {%r4152, %r4156, %r4150, %r4154}; 2026-02-21T09:18:15.0188439Z st.shared.v4.b32 [%r55], {%r4160, %r4164, %r4158, %r4162}; 2026-02-21T09:18:15.0188528Z st.shared.v4.b32 [%r56], {%r4168, %r4172, %r4166, %r4170}; 2026-02-21T09:18:15.0188607Z $L__tmp247: 2026-02-21T09:18:15.0188826Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0188904Z // begin inline asm 2026-02-21T09:18:15.0189240Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3913, %r3914, %r3915, %r3916, %r3917, %r3918, %r3919, %r3920, %r3921, %r3922, %r3923, %r3924, %r3925, %r3926, %r3927, %r3928}, [%r915 + 0], 64; 2026-02-21T09:18:15.0189320Z // end inline asm 2026-02-21T09:18:15.0189381Z // begin inline asm 2026-02-21T09:18:15.0189700Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3930, %r3931, %r3932, %r3933, %r3934, %r3935, %r3936, %r3937, %r3938, %r3939, %r3940, %r3941, %r3942, %r3943, %r3944, %r3945}, [%r915 + 16], 64; 2026-02-21T09:18:15.0189767Z // end inline asm 2026-02-21T09:18:15.0189829Z // begin inline asm 2026-02-21T09:18:15.0190142Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3947, %r3948, %r3949, %r3950, %r3951, %r3952, %r3953, %r3954, %r3955, %r3956, %r3957, %r3958, %r3959, %r3960, %r3961, %r3962}, [%r915 + 32], 64; 2026-02-21T09:18:15.0190210Z // end inline asm 2026-02-21T09:18:15.0190270Z // begin inline asm 2026-02-21T09:18:15.0190586Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r3964, %r3965, %r3966, %r3967, %r3968, %r3969, %r3970, %r3971, %r3972, %r3973, %r3974, %r3975, %r3976, %r3977, %r3978, %r3979}, [%r915 + 48], 64; 2026-02-21T09:18:15.0190650Z // end inline asm 2026-02-21T09:18:15.0190709Z // begin inline asm 2026-02-21T09:18:15.0190784Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0190841Z // end inline asm 2026-02-21T09:18:15.0190907Z // begin inline asm 2026-02-21T09:18:15.0191236Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 0], 64, {%r3913, %r3914, %r3915, %r3916, %r3917, %r3918, %r3919, %r3920, %r3921, %r3922, %r3923, %r3924, %r3925, %r3926, %r3927, %r3928}; 2026-02-21T09:18:15.0191295Z // end inline asm 2026-02-21T09:18:15.0191362Z // begin inline asm 2026-02-21T09:18:15.0191719Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 16], 64, {%r3930, %r3931, %r3932, %r3933, %r3934, %r3935, %r3936, %r3937, %r3938, %r3939, %r3940, %r3941, %r3942, %r3943, %r3944, %r3945}; 2026-02-21T09:18:15.0191780Z // end inline asm 2026-02-21T09:18:15.0191846Z // begin inline asm 2026-02-21T09:18:15.0192166Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 32], 64, {%r3947, %r3948, %r3949, %r3950, %r3951, %r3952, %r3953, %r3954, %r3955, %r3956, %r3957, %r3958, %r3959, %r3960, %r3961, %r3962}; 2026-02-21T09:18:15.0192223Z // end inline asm 2026-02-21T09:18:15.0192290Z // begin inline asm 2026-02-21T09:18:15.0192608Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 48], 64, {%r3964, %r3965, %r3966, %r3967, %r3968, %r3969, %r3970, %r3971, %r3972, %r3973, %r3974, %r3975, %r3976, %r3977, %r3978, %r3979}; 2026-02-21T09:18:15.0192667Z // end inline asm 2026-02-21T09:18:15.0192725Z // begin inline asm 2026-02-21T09:18:15.0192803Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0192859Z // end inline asm 2026-02-21T09:18:15.0192917Z bar.sync 0; 2026-02-21T09:18:15.0192983Z // begin inline asm 2026-02-21T09:18:15.0193300Z @%p188 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r4050, %r4051, %r4052, %r4053, %r4054, %r4055, %r4056, %r4057, %r4058, %r4059, %r4060, %r4061, %r4062, %r4063, %r4064, %r4065}; 2026-02-21T09:18:15.0193358Z // end inline asm 2026-02-21T09:18:15.0193422Z // begin inline asm 2026-02-21T09:18:15.0193490Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0193546Z // end inline asm 2026-02-21T09:18:15.0193600Z bar.sync 0; 2026-02-21T09:18:15.0193666Z // begin inline asm 2026-02-21T09:18:15.0193740Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0193795Z // end inline asm 2026-02-21T09:18:15.0193861Z // begin inline asm 2026-02-21T09:18:15.0193956Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0194041Z // end inline asm 2026-02-21T09:18:15.0194097Z bar.sync 0; 2026-02-21T09:18:15.0194168Z @%p20 bra $L__BB0_29; 2026-02-21T09:18:15.0194271Z // %bb.28: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0194364Z elect.sync %r4185|%p206, -1; 2026-02-21T09:18:15.0194460Z mov.b32 %r4175, 69208336; 2026-02-21T09:18:15.0194525Z mov.pred %p205, -1; 2026-02-21T09:18:15.0194619Z // begin inline asm 2026-02-21T09:18:15.0194795Z @%p206 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 0 ], %rd1, %r4175, %p205; 2026-02-21T09:18:15.0194855Z // end inline asm 2026-02-21T09:18:15.0194915Z // begin inline asm 2026-02-21T09:18:15.0195074Z @%p206 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 8 ], %rd2, %r4175, %p205; 2026-02-21T09:18:15.0195138Z // end inline asm 2026-02-21T09:18:15.0195197Z // begin inline asm 2026-02-21T09:18:15.0195353Z @%p206 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 16 ], %rd3, %r4175, %p205; 2026-02-21T09:18:15.0195420Z // end inline asm 2026-02-21T09:18:15.0195479Z // begin inline asm 2026-02-21T09:18:15.0195636Z @%p206 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 24 ], %rd4, %r4175, %p205; 2026-02-21T09:18:15.0195702Z // end inline asm 2026-02-21T09:18:15.0195767Z cvt.u64.u32 %rd501, %r5985; 2026-02-21T09:18:15.0195827Z // begin inline asm 2026-02-21T09:18:15.0195968Z @%p206 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd501]; 2026-02-21T09:18:15.0196027Z // end inline asm 2026-02-21T09:18:15.0196134Z $L__BB0_29: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0196226Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0196294Z mov.b32 %r4189, 0; 2026-02-21T09:18:15.0196524Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0196585Z // begin inline asm 2026-02-21T09:18:15.0196657Z 2026-02-21T09:18:15.0196708Z { 2026-02-21T09:18:15.0196771Z .reg .pred complete; 2026-02-21T09:18:15.0196827Z waitLoop: 2026-02-21T09:18:15.0196957Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r4189; 2026-02-21T09:18:15.0197025Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0197076Z } 2026-02-21T09:18:15.0197080Z 2026-02-21T09:18:15.0197146Z // end inline asm 2026-02-21T09:18:15.0197201Z bar.sync 0; 2026-02-21T09:18:15.0197261Z // begin inline asm 2026-02-21T09:18:15.0197358Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0197414Z // end inline asm 2026-02-21T09:18:15.0197470Z $L__tmp248: 2026-02-21T09:18:15.0197640Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0197713Z add.s64 %rd503, %rd458, 192; 2026-02-21T09:18:15.0197871Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0197935Z add.s64 %rd506, %rd461, 192; 2026-02-21T09:18:15.0197999Z // begin inline asm 2026-02-21T09:18:15.0198057Z mov.u64 %rd502, 0x0; 2026-02-21T09:18:15.0198166Z createpolicy.fractional.L2::evict_first.b64 %rd502, 1.0; 2026-02-21T09:18:15.0198229Z // end inline asm 2026-02-21T09:18:15.0198286Z // begin inline asm 2026-02-21T09:18:15.0198345Z mov.u32 %r4191, 0x0; 2026-02-21T09:18:15.0198401Z mov.u32 %r4192, 0x0; 2026-02-21T09:18:15.0198464Z mov.u32 %r4193, 0x0; 2026-02-21T09:18:15.0198520Z mov.u32 %r4194, 0x0; 2026-02-21T09:18:15.0198695Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r4191, %r4192, %r4193, %r4194 }, [ %rd503 + 0 ], %rd502; 2026-02-21T09:18:15.0198756Z // end inline asm 2026-02-21T09:18:15.0198813Z // begin inline asm 2026-02-21T09:18:15.0198869Z mov.u64 %rd505, 0x0; 2026-02-21T09:18:15.0198974Z createpolicy.fractional.L2::evict_first.b64 %rd505, 1.0; 2026-02-21T09:18:15.0199035Z // end inline asm 2026-02-21T09:18:15.0199093Z // begin inline asm 2026-02-21T09:18:15.0199149Z mov.u32 %r4195, 0x0; 2026-02-21T09:18:15.0199231Z mov.u32 %r4196, 0x0; 2026-02-21T09:18:15.0199286Z mov.u32 %r4197, 0x0; 2026-02-21T09:18:15.0199340Z mov.u32 %r4198, 0x0; 2026-02-21T09:18:15.0199523Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r4195, %r4196, %r4197, %r4198 }, [ %rd506 + 0 ], %rd505; 2026-02-21T09:18:15.0199603Z // end inline asm 2026-02-21T09:18:15.0199677Z $L__tmp249: 2026-02-21T09:18:15.0199902Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0200006Z st.shared.v4.b32 [%r26], {%r4191, %r4192, %r4193, %r4194}; 2026-02-21T09:18:15.0200105Z st.shared.v4.b32 [%r26+512], {%r4195, %r4196, %r4197, %r4198}; 2026-02-21T09:18:15.0200160Z bar.sync 0; 2026-02-21T09:18:15.0200261Z ld.shared.v4.b32 {%r4358, %r4359, %r4360, %r4361}, [%r27]; 2026-02-21T09:18:15.0200327Z mov.b32 {%rs1409, %rs1410}, %r4361; 2026-02-21T09:18:15.0200391Z mov.b32 {%rs1411, %rs1412}, %r4360; 2026-02-21T09:18:15.0200461Z mov.b32 {%rs1413, %rs1414}, %r4359; 2026-02-21T09:18:15.0200522Z mov.b32 {%rs1415, %rs1416}, %r4358; 2026-02-21T09:18:15.0200613Z ld.shared.v4.b32 {%r4362, %r4363, %r4364, %r4365}, [%r28]; 2026-02-21T09:18:15.0200674Z mov.b32 {%rs1417, %rs1418}, %r4365; 2026-02-21T09:18:15.0200741Z mov.b32 {%rs1419, %rs1420}, %r4364; 2026-02-21T09:18:15.0200802Z mov.b32 {%rs1421, %rs1422}, %r4363; 2026-02-21T09:18:15.0200860Z mov.b32 {%rs1423, %rs1424}, %r4362; 2026-02-21T09:18:15.0200921Z $L__tmp250: 2026-02-21T09:18:15.0201086Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0201148Z cvt.f32.bf16 %r4340, %rs1415; 2026-02-21T09:18:15.0201216Z cvt.f32.bf16 %r4341, %rs1416; 2026-02-21T09:18:15.0201277Z cvt.f32.bf16 %r4342, %rs1413; 2026-02-21T09:18:15.0201336Z cvt.f32.bf16 %r4343, %rs1414; 2026-02-21T09:18:15.0201392Z cvt.f32.bf16 %r4344, %rs1411; 2026-02-21T09:18:15.0201457Z cvt.f32.bf16 %r4345, %rs1412; 2026-02-21T09:18:15.0201516Z cvt.f32.bf16 %r4346, %rs1409; 2026-02-21T09:18:15.0201600Z cvt.f32.bf16 %r4347, %rs1410; 2026-02-21T09:18:15.0201665Z cvt.f32.bf16 %r4348, %rs1423; 2026-02-21T09:18:15.0201723Z cvt.f32.bf16 %r4349, %rs1424; 2026-02-21T09:18:15.0201780Z cvt.f32.bf16 %r4350, %rs1421; 2026-02-21T09:18:15.0201836Z cvt.f32.bf16 %r4351, %rs1422; 2026-02-21T09:18:15.0201904Z cvt.f32.bf16 %r4352, %rs1419; 2026-02-21T09:18:15.0201963Z cvt.f32.bf16 %r4353, %rs1420; 2026-02-21T09:18:15.0202019Z cvt.f32.bf16 %r4354, %rs1417; 2026-02-21T09:18:15.0202081Z cvt.f32.bf16 %r4355, %rs1418; 2026-02-21T09:18:15.0202243Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0202304Z add.s32 %r4366, %r7845, 393216; 2026-02-21T09:18:15.0202365Z cvt.s64.s32 %rd511, %r4366; 2026-02-21T09:18:15.0202433Z add.s64 %rd509, %rd56, %rd511; 2026-02-21T09:18:15.0202591Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0202650Z // begin inline asm 2026-02-21T09:18:15.0202714Z mov.u64 %rd508, 0x0; 2026-02-21T09:18:15.0202819Z createpolicy.fractional.L2::evict_first.b64 %rd508, 1.0; 2026-02-21T09:18:15.0202876Z // end inline asm 2026-02-21T09:18:15.0202939Z // begin inline asm 2026-02-21T09:18:15.0202996Z mov.u32 %r4199, 0x0; 2026-02-21T09:18:15.0203052Z mov.u32 %r4200, 0x0; 2026-02-21T09:18:15.0203107Z mov.u32 %r4201, 0x0; 2026-02-21T09:18:15.0203171Z mov.u32 %r4202, 0x0; 2026-02-21T09:18:15.0203341Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r4199, %r4200, %r4201, %r4202 }, [ %rd509 + 0 ], %rd508; 2026-02-21T09:18:15.0203396Z // end inline asm 2026-02-21T09:18:15.0203468Z prmt.b32 %r4367, %r4199, 0, 0x8880U; 2026-02-21T09:18:15.0203529Z cvt.u16.u32 %rs1425, %r4367; 2026-02-21T09:18:15.0203593Z prmt.b32 %r4368, %r4199, 0, 0x7770U; 2026-02-21T09:18:15.0203659Z cvt.u16.u32 %rs1426, %r4368; 2026-02-21T09:18:15.0203719Z prmt.b32 %r4369, %r4199, 0, 0x9991U; 2026-02-21T09:18:15.0203806Z cvt.u16.u32 %rs1427, %r4369; 2026-02-21T09:18:15.0203868Z prmt.b32 %r4370, %r4199, 0, 0x7771U; 2026-02-21T09:18:15.0203934Z cvt.u16.u32 %rs1428, %r4370; 2026-02-21T09:18:15.0204020Z prmt.b32 %r4371, %r4199, 0, 0xaaa2U; 2026-02-21T09:18:15.0204079Z cvt.u16.u32 %rs1429, %r4371; 2026-02-21T09:18:15.0204171Z prmt.b32 %r4372, %r4199, 0, 0x7772U; 2026-02-21T09:18:15.0204231Z cvt.u16.u32 %rs1430, %r4372; 2026-02-21T09:18:15.0204316Z prmt.b32 %r4373, %r4199, 0, 0xbbb3U; 2026-02-21T09:18:15.0204376Z cvt.u16.u32 %rs1431, %r4373; 2026-02-21T09:18:15.0204445Z prmt.b32 %r4374, %r4199, 0, 0x7773U; 2026-02-21T09:18:15.0204503Z cvt.u16.u32 %rs1432, %r4374; 2026-02-21T09:18:15.0204564Z prmt.b32 %r4375, %r4200, 0, 0x8880U; 2026-02-21T09:18:15.0204631Z cvt.u16.u32 %rs1433, %r4375; 2026-02-21T09:18:15.0204692Z prmt.b32 %r4376, %r4200, 0, 0x7770U; 2026-02-21T09:18:15.0204750Z cvt.u16.u32 %rs1434, %r4376; 2026-02-21T09:18:15.0204819Z prmt.b32 %r4377, %r4200, 0, 0x9991U; 2026-02-21T09:18:15.0204879Z cvt.u16.u32 %rs1435, %r4377; 2026-02-21T09:18:15.0204942Z prmt.b32 %r4378, %r4200, 0, 0x7771U; 2026-02-21T09:18:15.0205005Z cvt.u16.u32 %rs1436, %r4378; 2026-02-21T09:18:15.0205076Z prmt.b32 %r4379, %r4200, 0, 0xaaa2U; 2026-02-21T09:18:15.0205134Z cvt.u16.u32 %rs1437, %r4379; 2026-02-21T09:18:15.0205196Z prmt.b32 %r4380, %r4200, 0, 0x7772U; 2026-02-21T09:18:15.0205260Z cvt.u16.u32 %rs1438, %r4380; 2026-02-21T09:18:15.0205321Z prmt.b32 %r4381, %r4200, 0, 0xbbb3U; 2026-02-21T09:18:15.0205378Z cvt.u16.u32 %rs1439, %r4381; 2026-02-21T09:18:15.0205436Z prmt.b32 %r4382, %r4200, 0, 0x7773U; 2026-02-21T09:18:15.0205501Z cvt.u16.u32 %rs1440, %r4382; 2026-02-21T09:18:15.0205561Z prmt.b32 %r4383, %r4201, 0, 0x8880U; 2026-02-21T09:18:15.0205617Z cvt.u16.u32 %rs1441, %r4383; 2026-02-21T09:18:15.0205686Z prmt.b32 %r4384, %r4201, 0, 0x7770U; 2026-02-21T09:18:15.0205744Z cvt.u16.u32 %rs1442, %r4384; 2026-02-21T09:18:15.0205804Z prmt.b32 %r4385, %r4201, 0, 0x9991U; 2026-02-21T09:18:15.0205863Z cvt.u16.u32 %rs1443, %r4385; 2026-02-21T09:18:15.0205932Z prmt.b32 %r4386, %r4201, 0, 0x7771U; 2026-02-21T09:18:15.0205990Z cvt.u16.u32 %rs1444, %r4386; 2026-02-21T09:18:15.0206051Z prmt.b32 %r4387, %r4201, 0, 0xaaa2U; 2026-02-21T09:18:15.0206116Z cvt.u16.u32 %rs1445, %r4387; 2026-02-21T09:18:15.0206177Z prmt.b32 %r4388, %r4201, 0, 0x7772U; 2026-02-21T09:18:15.0206233Z cvt.u16.u32 %rs1446, %r4388; 2026-02-21T09:18:15.0206303Z prmt.b32 %r4389, %r4201, 0, 0xbbb3U; 2026-02-21T09:18:15.0206361Z cvt.u16.u32 %rs1447, %r4389; 2026-02-21T09:18:15.0206425Z prmt.b32 %r4390, %r4201, 0, 0x7773U; 2026-02-21T09:18:15.0206482Z cvt.u16.u32 %rs1448, %r4390; 2026-02-21T09:18:15.0206551Z prmt.b32 %r4391, %r4202, 0, 0x8880U; 2026-02-21T09:18:15.0206609Z cvt.u16.u32 %rs1449, %r4391; 2026-02-21T09:18:15.0206670Z prmt.b32 %r4392, %r4202, 0, 0x7770U; 2026-02-21T09:18:15.0206735Z cvt.u16.u32 %rs1450, %r4392; 2026-02-21T09:18:15.0206795Z prmt.b32 %r4393, %r4202, 0, 0x9991U; 2026-02-21T09:18:15.0206855Z cvt.u16.u32 %rs1451, %r4393; 2026-02-21T09:18:15.0206915Z prmt.b32 %r4394, %r4202, 0, 0x7771U; 2026-02-21T09:18:15.0206978Z cvt.u16.u32 %rs1452, %r4394; 2026-02-21T09:18:15.0207041Z prmt.b32 %r4395, %r4202, 0, 0xaaa2U; 2026-02-21T09:18:15.0207099Z cvt.u16.u32 %rs1453, %r4395; 2026-02-21T09:18:15.0207168Z prmt.b32 %r4396, %r4202, 0, 0x7772U; 2026-02-21T09:18:15.0207225Z cvt.u16.u32 %rs1454, %r4396; 2026-02-21T09:18:15.0207287Z prmt.b32 %r4397, %r4202, 0, 0xbbb3U; 2026-02-21T09:18:15.0207352Z cvt.u16.u32 %rs1455, %r4397; 2026-02-21T09:18:15.0207411Z prmt.b32 %r4398, %r4202, 0, 0x7773U; 2026-02-21T09:18:15.0207470Z cvt.u16.u32 %rs1456, %r4398; 2026-02-21T09:18:15.0207632Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0207702Z shl.b16 %rs1457, %rs1426, 12; 2026-02-21T09:18:15.0207761Z shr.s16 %rs1458, %rs1457, 12; 2026-02-21T09:18:15.0207820Z shl.b16 %rs1459, %rs1428, 12; 2026-02-21T09:18:15.0207906Z shr.s16 %rs1460, %rs1459, 12; 2026-02-21T09:18:15.0207962Z shl.b16 %rs1461, %rs1430, 12; 2026-02-21T09:18:15.0208019Z shr.s16 %rs1462, %rs1461, 12; 2026-02-21T09:18:15.0208075Z shl.b16 %rs1463, %rs1432, 12; 2026-02-21T09:18:15.0208160Z shr.s16 %rs1464, %rs1463, 12; 2026-02-21T09:18:15.0208217Z shl.b16 %rs1465, %rs1434, 12; 2026-02-21T09:18:15.0208297Z shr.s16 %rs1466, %rs1465, 12; 2026-02-21T09:18:15.0208389Z shl.b16 %rs1467, %rs1436, 12; 2026-02-21T09:18:15.0208447Z shr.s16 %rs1468, %rs1467, 12; 2026-02-21T09:18:15.0208504Z shl.b16 %rs1469, %rs1438, 12; 2026-02-21T09:18:15.0208560Z shr.s16 %rs1470, %rs1469, 12; 2026-02-21T09:18:15.0208622Z shl.b16 %rs1471, %rs1440, 12; 2026-02-21T09:18:15.0208679Z shr.s16 %rs1472, %rs1471, 12; 2026-02-21T09:18:15.0208735Z shl.b16 %rs1473, %rs1442, 12; 2026-02-21T09:18:15.0208798Z shr.s16 %rs1474, %rs1473, 12; 2026-02-21T09:18:15.0208854Z shl.b16 %rs1475, %rs1444, 12; 2026-02-21T09:18:15.0208912Z shr.s16 %rs1476, %rs1475, 12; 2026-02-21T09:18:15.0208976Z shl.b16 %rs1477, %rs1446, 12; 2026-02-21T09:18:15.0209033Z shr.s16 %rs1478, %rs1477, 12; 2026-02-21T09:18:15.0209091Z shl.b16 %rs1479, %rs1448, 12; 2026-02-21T09:18:15.0209149Z shr.s16 %rs1480, %rs1479, 12; 2026-02-21T09:18:15.0209216Z shl.b16 %rs1481, %rs1450, 12; 2026-02-21T09:18:15.0209273Z shr.s16 %rs1482, %rs1481, 12; 2026-02-21T09:18:15.0209331Z shl.b16 %rs1483, %rs1452, 12; 2026-02-21T09:18:15.0209395Z shr.s16 %rs1484, %rs1483, 12; 2026-02-21T09:18:15.0209452Z shl.b16 %rs1485, %rs1454, 12; 2026-02-21T09:18:15.0209508Z shr.s16 %rs1486, %rs1485, 12; 2026-02-21T09:18:15.0209564Z shl.b16 %rs1487, %rs1456, 12; 2026-02-21T09:18:15.0209628Z shr.s16 %rs1488, %rs1487, 12; 2026-02-21T09:18:15.0209790Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0209849Z shr.u16 %rs1489, %rs1425, 4; 2026-02-21T09:18:15.0209914Z shr.u16 %rs1490, %rs1427, 4; 2026-02-21T09:18:15.0209972Z shr.u16 %rs1491, %rs1429, 4; 2026-02-21T09:18:15.0210030Z shr.u16 %rs1492, %rs1431, 4; 2026-02-21T09:18:15.0210095Z shr.u16 %rs1493, %rs1433, 4; 2026-02-21T09:18:15.0210152Z shr.u16 %rs1494, %rs1435, 4; 2026-02-21T09:18:15.0210209Z shr.u16 %rs1495, %rs1437, 4; 2026-02-21T09:18:15.0210267Z shr.u16 %rs1496, %rs1439, 4; 2026-02-21T09:18:15.0210333Z shr.u16 %rs1497, %rs1441, 4; 2026-02-21T09:18:15.0210390Z shr.u16 %rs1498, %rs1443, 4; 2026-02-21T09:18:15.0210448Z shr.u16 %rs1499, %rs1445, 4; 2026-02-21T09:18:15.0210513Z shr.u16 %rs1500, %rs1447, 4; 2026-02-21T09:18:15.0210571Z shr.u16 %rs1501, %rs1449, 4; 2026-02-21T09:18:15.0210628Z shr.u16 %rs1502, %rs1451, 4; 2026-02-21T09:18:15.0210686Z shr.u16 %rs1503, %rs1453, 4; 2026-02-21T09:18:15.0210752Z shr.u16 %rs1504, %rs1455, 4; 2026-02-21T09:18:15.0210914Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0210969Z bar.sync 0; 2026-02-21T09:18:15.0211040Z st.shared.b8 [%r29], %rs1458; 2026-02-21T09:18:15.0211103Z st.shared.b8 [%r30], %rs1460; 2026-02-21T09:18:15.0211163Z st.shared.b8 [%r31], %rs1462; 2026-02-21T09:18:15.0211223Z st.shared.b8 [%r32], %rs1464; 2026-02-21T09:18:15.0211299Z st.shared.b8 [%r33+512], %rs1466; 2026-02-21T09:18:15.0211366Z st.shared.b8 [%r34+512], %rs1468; 2026-02-21T09:18:15.0211430Z st.shared.b8 [%r35+512], %rs1470; 2026-02-21T09:18:15.0211498Z st.shared.b8 [%r36+512], %rs1472; 2026-02-21T09:18:15.0211587Z st.shared.b8 [%r37+1024], %rs1474; 2026-02-21T09:18:15.0211652Z st.shared.b8 [%r38+1024], %rs1476; 2026-02-21T09:18:15.0211720Z st.shared.b8 [%r39+1024], %rs1478; 2026-02-21T09:18:15.0211780Z st.shared.b8 [%r40+1024], %rs1480; 2026-02-21T09:18:15.0211839Z st.shared.b8 [%r41+1536], %rs1482; 2026-02-21T09:18:15.0211899Z st.shared.b8 [%r42+1536], %rs1484; 2026-02-21T09:18:15.0211968Z st.shared.b8 [%r43+1536], %rs1486; 2026-02-21T09:18:15.0212028Z st.shared.b8 [%r44+1536], %rs1488; 2026-02-21T09:18:15.0212082Z bar.sync 0; 2026-02-21T09:18:15.0212176Z ld.shared.b32 %r4399, [%r45]; 2026-02-21T09:18:15.0212237Z ld.shared.b32 %r4400, [%r46]; 2026-02-21T09:18:15.0212297Z ld.shared.b32 %r4401, [%r47]; 2026-02-21T09:18:15.0212382Z ld.shared.b32 %r4402, [%r48]; 2026-02-21T09:18:15.0212596Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0212655Z bar.sync 0; 2026-02-21T09:18:15.0212739Z st.shared.b8 [%r29], %rs1489; 2026-02-21T09:18:15.0212808Z st.shared.b8 [%r30], %rs1490; 2026-02-21T09:18:15.0212868Z st.shared.b8 [%r31], %rs1491; 2026-02-21T09:18:15.0212927Z st.shared.b8 [%r32], %rs1492; 2026-02-21T09:18:15.0212996Z st.shared.b8 [%r33+512], %rs1493; 2026-02-21T09:18:15.0213057Z st.shared.b8 [%r34+512], %rs1494; 2026-02-21T09:18:15.0213117Z st.shared.b8 [%r35+512], %rs1495; 2026-02-21T09:18:15.0213179Z st.shared.b8 [%r36+512], %rs1496; 2026-02-21T09:18:15.0213249Z st.shared.b8 [%r37+1024], %rs1497; 2026-02-21T09:18:15.0213310Z st.shared.b8 [%r38+1024], %rs1498; 2026-02-21T09:18:15.0213369Z st.shared.b8 [%r39+1024], %rs1499; 2026-02-21T09:18:15.0213436Z st.shared.b8 [%r40+1024], %rs1500; 2026-02-21T09:18:15.0213495Z st.shared.b8 [%r41+1536], %rs1501; 2026-02-21T09:18:15.0213555Z st.shared.b8 [%r42+1536], %rs1502; 2026-02-21T09:18:15.0213614Z st.shared.b8 [%r43+1536], %rs1503; 2026-02-21T09:18:15.0213683Z st.shared.b8 [%r44+1536], %rs1504; 2026-02-21T09:18:15.0213738Z bar.sync 0; 2026-02-21T09:18:15.0213798Z ld.shared.b32 %r4403, [%r45]; 2026-02-21T09:18:15.0213864Z ld.shared.b32 %r4404, [%r46]; 2026-02-21T09:18:15.0213924Z ld.shared.b32 %r4405, [%r47]; 2026-02-21T09:18:15.0213984Z ld.shared.b32 %r4406, [%r48]; 2026-02-21T09:18:15.0214147Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0214217Z cvt.s8.s32 %rs1505, %r4400; 2026-02-21T09:18:15.0214279Z cvt.rn.f32.s16 %r4407, %rs1505; 2026-02-21T09:18:15.0214341Z cvt.s8.s32 %rs1506, %r4399; 2026-02-21T09:18:15.0214411Z cvt.rn.f32.s16 %r4408, %rs1506; 2026-02-21T09:18:15.0214469Z cvt.s8.s32 %rs1507, %r4404; 2026-02-21T09:18:15.0214531Z cvt.rn.f32.s16 %r4409, %rs1507; 2026-02-21T09:18:15.0214597Z cvt.s8.s32 %rs1508, %r4403; 2026-02-21T09:18:15.0214657Z cvt.rn.f32.s16 %r4410, %rs1508; 2026-02-21T09:18:15.0214714Z cvt.s8.s32 %rs1509, %r4402; 2026-02-21T09:18:15.0214774Z cvt.rn.f32.s16 %r4411, %rs1509; 2026-02-21T09:18:15.0214840Z cvt.s8.s32 %rs1510, %r4401; 2026-02-21T09:18:15.0214900Z cvt.rn.f32.s16 %r4412, %rs1510; 2026-02-21T09:18:15.0214959Z cvt.s8.s32 %rs1511, %r4406; 2026-02-21T09:18:15.0215024Z cvt.rn.f32.s16 %r4413, %rs1511; 2026-02-21T09:18:15.0215081Z cvt.s8.s32 %rs1512, %r4405; 2026-02-21T09:18:15.0215140Z cvt.rn.f32.s16 %r4414, %rs1512; 2026-02-21T09:18:15.0215203Z prmt.b32 %r4415, %r4400, 0, 0x9991U; 2026-02-21T09:18:15.0215269Z cvt.u16.u32 %rs1513, %r4415; 2026-02-21T09:18:15.0215327Z cvt.rn.f32.s16 %r4416, %rs1513; 2026-02-21T09:18:15.0215390Z prmt.b32 %r4417, %r4399, 0, 0x9991U; 2026-02-21T09:18:15.0215456Z cvt.u16.u32 %rs1514, %r4417; 2026-02-21T09:18:15.0215514Z cvt.rn.f32.s16 %r4418, %rs1514; 2026-02-21T09:18:15.0215575Z prmt.b32 %r4419, %r4404, 0, 0x9991U; 2026-02-21T09:18:15.0215643Z cvt.u16.u32 %rs1515, %r4419; 2026-02-21T09:18:15.0215703Z cvt.rn.f32.s16 %r4420, %rs1515; 2026-02-21T09:18:15.0215765Z prmt.b32 %r4421, %r4403, 0, 0x9991U; 2026-02-21T09:18:15.0215823Z cvt.u16.u32 %rs1516, %r4421; 2026-02-21T09:18:15.0215891Z cvt.rn.f32.s16 %r4422, %rs1516; 2026-02-21T09:18:15.0215952Z prmt.b32 %r4423, %r4402, 0, 0x9991U; 2026-02-21T09:18:15.0216009Z cvt.u16.u32 %rs1517, %r4423; 2026-02-21T09:18:15.0216076Z cvt.rn.f32.s16 %r4424, %rs1517; 2026-02-21T09:18:15.0216137Z prmt.b32 %r4425, %r4401, 0, 0x9991U; 2026-02-21T09:18:15.0216193Z cvt.u16.u32 %rs1518, %r4425; 2026-02-21T09:18:15.0216251Z cvt.rn.f32.s16 %r4426, %rs1518; 2026-02-21T09:18:15.0216318Z prmt.b32 %r4427, %r4406, 0, 0x9991U; 2026-02-21T09:18:15.0216375Z cvt.u16.u32 %rs1519, %r4427; 2026-02-21T09:18:15.0216456Z cvt.rn.f32.s16 %r4428, %rs1519; 2026-02-21T09:18:15.0216523Z prmt.b32 %r4429, %r4405, 0, 0x9991U; 2026-02-21T09:18:15.0216581Z cvt.u16.u32 %rs1520, %r4429; 2026-02-21T09:18:15.0216667Z cvt.rn.f32.s16 %r4430, %rs1520; 2026-02-21T09:18:15.0216729Z prmt.b32 %r4431, %r4400, 0, 0xaaa2U; 2026-02-21T09:18:15.0216819Z cvt.u16.u32 %rs1521, %r4431; 2026-02-21T09:18:15.0216879Z cvt.rn.f32.s16 %r4432, %rs1521; 2026-02-21T09:18:15.0216966Z prmt.b32 %r4433, %r4399, 0, 0xaaa2U; 2026-02-21T09:18:15.0217034Z cvt.u16.u32 %rs1522, %r4433; 2026-02-21T09:18:15.0217094Z cvt.rn.f32.s16 %r4434, %rs1522; 2026-02-21T09:18:15.0217156Z prmt.b32 %r4435, %r4404, 0, 0xaaa2U; 2026-02-21T09:18:15.0217222Z cvt.u16.u32 %rs1523, %r4435; 2026-02-21T09:18:15.0217282Z cvt.rn.f32.s16 %r4436, %rs1523; 2026-02-21T09:18:15.0217342Z prmt.b32 %r4437, %r4403, 0, 0xaaa2U; 2026-02-21T09:18:15.0217401Z cvt.u16.u32 %rs1524, %r4437; 2026-02-21T09:18:15.0217471Z cvt.rn.f32.s16 %r4438, %rs1524; 2026-02-21T09:18:15.0217535Z prmt.b32 %r4439, %r4402, 0, 0xaaa2U; 2026-02-21T09:18:15.0217595Z cvt.u16.u32 %rs1525, %r4439; 2026-02-21T09:18:15.0217668Z cvt.rn.f32.s16 %r4440, %rs1525; 2026-02-21T09:18:15.0217733Z prmt.b32 %r4441, %r4401, 0, 0xaaa2U; 2026-02-21T09:18:15.0217793Z cvt.u16.u32 %rs1526, %r4441; 2026-02-21T09:18:15.0217855Z cvt.rn.f32.s16 %r4442, %rs1526; 2026-02-21T09:18:15.0217928Z prmt.b32 %r4443, %r4406, 0, 0xaaa2U; 2026-02-21T09:18:15.0217990Z cvt.u16.u32 %rs1527, %r4443; 2026-02-21T09:18:15.0218051Z cvt.rn.f32.s16 %r4444, %rs1527; 2026-02-21T09:18:15.0218121Z prmt.b32 %r4445, %r4405, 0, 0xaaa2U; 2026-02-21T09:18:15.0218178Z cvt.u16.u32 %rs1528, %r4445; 2026-02-21T09:18:15.0218237Z cvt.rn.f32.s16 %r4446, %rs1528; 2026-02-21T09:18:15.0218305Z prmt.b32 %r4447, %r4400, 0, 0xbbb3U; 2026-02-21T09:18:15.0218364Z cvt.u16.u32 %rs1529, %r4447; 2026-02-21T09:18:15.0218422Z cvt.rn.f32.s16 %r4448, %rs1529; 2026-02-21T09:18:15.0218484Z prmt.b32 %r4449, %r4399, 0, 0xbbb3U; 2026-02-21T09:18:15.0218552Z cvt.u16.u32 %rs1530, %r4449; 2026-02-21T09:18:15.0218611Z cvt.rn.f32.s16 %r4450, %rs1530; 2026-02-21T09:18:15.0218672Z prmt.b32 %r4451, %r4404, 0, 0xbbb3U; 2026-02-21T09:18:15.0218736Z cvt.u16.u32 %rs1531, %r4451; 2026-02-21T09:18:15.0218797Z cvt.rn.f32.s16 %r4452, %rs1531; 2026-02-21T09:18:15.0218858Z prmt.b32 %r4453, %r4403, 0, 0xbbb3U; 2026-02-21T09:18:15.0218916Z cvt.u16.u32 %rs1532, %r4453; 2026-02-21T09:18:15.0218985Z cvt.rn.f32.s16 %r4454, %rs1532; 2026-02-21T09:18:15.0219047Z prmt.b32 %r4455, %r4402, 0, 0xbbb3U; 2026-02-21T09:18:15.0219105Z cvt.u16.u32 %rs1533, %r4455; 2026-02-21T09:18:15.0219172Z cvt.rn.f32.s16 %r4456, %rs1533; 2026-02-21T09:18:15.0219233Z prmt.b32 %r4457, %r4401, 0, 0xbbb3U; 2026-02-21T09:18:15.0219291Z cvt.u16.u32 %rs1534, %r4457; 2026-02-21T09:18:15.0219351Z cvt.rn.f32.s16 %r4458, %rs1534; 2026-02-21T09:18:15.0219419Z prmt.b32 %r4459, %r4406, 0, 0xbbb3U; 2026-02-21T09:18:15.0219477Z cvt.u16.u32 %rs1535, %r4459; 2026-02-21T09:18:15.0219538Z cvt.rn.f32.s16 %r4460, %rs1535; 2026-02-21T09:18:15.0219605Z prmt.b32 %r4461, %r4405, 0, 0xbbb3U; 2026-02-21T09:18:15.0219664Z cvt.u16.u32 %rs1536, %r4461; 2026-02-21T09:18:15.0219724Z cvt.rn.f32.s16 %r4462, %rs1536; 2026-02-21T09:18:15.0219784Z bar.sync 0; 2026-02-21T09:18:15.0219885Z st.shared.v4.b32 [%r49], {%r4408, %r4410, %r4407, %r4409}; 2026-02-21T09:18:15.0219981Z st.shared.v4.b32 [%r50], {%r4412, %r4414, %r4411, %r4413}; 2026-02-21T09:18:15.0220073Z st.shared.v4.b32 [%r51], {%r4418, %r4422, %r4416, %r4420}; 2026-02-21T09:18:15.0220172Z st.shared.v4.b32 [%r52], {%r4426, %r4430, %r4424, %r4428}; 2026-02-21T09:18:15.0220262Z st.shared.v4.b32 [%r53], {%r4434, %r4438, %r4432, %r4436}; 2026-02-21T09:18:15.0220350Z st.shared.v4.b32 [%r54], {%r4442, %r4446, %r4440, %r4444}; 2026-02-21T09:18:15.0220448Z st.shared.v4.b32 [%r55], {%r4450, %r4454, %r4448, %r4452}; 2026-02-21T09:18:15.0220535Z st.shared.v4.b32 [%r56], {%r4458, %r4462, %r4456, %r4460}; 2026-02-21T09:18:15.0220611Z $L__tmp251: 2026-02-21T09:18:15.0220832Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0220913Z // begin inline asm 2026-02-21T09:18:15.0221239Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4203, %r4204, %r4205, %r4206, %r4207, %r4208, %r4209, %r4210, %r4211, %r4212, %r4213, %r4214, %r4215, %r4216, %r4217, %r4218}, [%r1205 + 0], 64; 2026-02-21T09:18:15.0221330Z // end inline asm 2026-02-21T09:18:15.0221391Z // begin inline asm 2026-02-21T09:18:15.0221726Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4220, %r4221, %r4222, %r4223, %r4224, %r4225, %r4226, %r4227, %r4228, %r4229, %r4230, %r4231, %r4232, %r4233, %r4234, %r4235}, [%r1205 + 16], 64; 2026-02-21T09:18:15.0221782Z // end inline asm 2026-02-21T09:18:15.0221847Z // begin inline asm 2026-02-21T09:18:15.0222147Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4237, %r4238, %r4239, %r4240, %r4241, %r4242, %r4243, %r4244, %r4245, %r4246, %r4247, %r4248, %r4249, %r4250, %r4251, %r4252}, [%r1205 + 32], 64; 2026-02-21T09:18:15.0222202Z // end inline asm 2026-02-21T09:18:15.0222265Z // begin inline asm 2026-02-21T09:18:15.0222559Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4254, %r4255, %r4256, %r4257, %r4258, %r4259, %r4260, %r4261, %r4262, %r4263, %r4264, %r4265, %r4266, %r4267, %r4268, %r4269}, [%r1205 + 48], 64; 2026-02-21T09:18:15.0222614Z // end inline asm 2026-02-21T09:18:15.0222677Z // begin inline asm 2026-02-21T09:18:15.0222747Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0222801Z // end inline asm 2026-02-21T09:18:15.0222863Z mov.pred %p391, -1; 2026-02-21T09:18:15.0222926Z // begin inline asm 2026-02-21T09:18:15.0223234Z @%p391 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r4203, %r4204, %r4205, %r4206, %r4207, %r4208, %r4209, %r4210, %r4211, %r4212, %r4213, %r4214, %r4215, %r4216, %r4217, %r4218}; 2026-02-21T09:18:15.0223289Z // end inline asm 2026-02-21T09:18:15.0223353Z // begin inline asm 2026-02-21T09:18:15.0223661Z @%p391 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r4220, %r4221, %r4222, %r4223, %r4224, %r4225, %r4226, %r4227, %r4228, %r4229, %r4230, %r4231, %r4232, %r4233, %r4234, %r4235}; 2026-02-21T09:18:15.0223720Z // end inline asm 2026-02-21T09:18:15.0223786Z // begin inline asm 2026-02-21T09:18:15.0224089Z @%p391 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r4237, %r4238, %r4239, %r4240, %r4241, %r4242, %r4243, %r4244, %r4245, %r4246, %r4247, %r4248, %r4249, %r4250, %r4251, %r4252}; 2026-02-21T09:18:15.0224144Z // end inline asm 2026-02-21T09:18:15.0224209Z // begin inline asm 2026-02-21T09:18:15.0224509Z @%p391 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r4254, %r4255, %r4256, %r4257, %r4258, %r4259, %r4260, %r4261, %r4262, %r4263, %r4264, %r4265, %r4266, %r4267, %r4268, %r4269}; 2026-02-21T09:18:15.0224565Z // end inline asm 2026-02-21T09:18:15.0224628Z // begin inline asm 2026-02-21T09:18:15.0224696Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0224752Z // end inline asm 2026-02-21T09:18:15.0224807Z bar.sync 0; 2026-02-21T09:18:15.0224872Z // begin inline asm 2026-02-21T09:18:15.0225172Z @%p391 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r4340, %r4341, %r4342, %r4343, %r4344, %r4345, %r4346, %r4347, %r4348, %r4349, %r4350, %r4351, %r4352, %r4353, %r4354, %r4355}; 2026-02-21T09:18:15.0225229Z // end inline asm 2026-02-21T09:18:15.0225294Z // begin inline asm 2026-02-21T09:18:15.0225361Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0225415Z // end inline asm 2026-02-21T09:18:15.0225475Z bar.sync 0; 2026-02-21T09:18:15.0225531Z // begin inline asm 2026-02-21T09:18:15.0225602Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0225657Z // end inline asm 2026-02-21T09:18:15.0225721Z // begin inline asm 2026-02-21T09:18:15.0225809Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0225863Z // end inline asm 2026-02-21T09:18:15.0225922Z bar.sync 0; 2026-02-21T09:18:15.0226017Z @%p20 bra $L__BB0_31; 2026-02-21T09:18:15.0226116Z // %bb.30: // in Loop: Header=BB0_23 Depth=2 2026-02-21T09:18:15.0226183Z elect.sync %r4475|%p223, -1; 2026-02-21T09:18:15.0226290Z mov.b32 %r4465, 69208336; 2026-02-21T09:18:15.0226352Z mov.pred %p222, -1; 2026-02-21T09:18:15.0226409Z // begin inline asm 2026-02-21T09:18:15.0226627Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 0 ], %rd1, %r4465, %p222; 2026-02-21T09:18:15.0226688Z // end inline asm 2026-02-21T09:18:15.0226746Z // begin inline asm 2026-02-21T09:18:15.0226908Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 8 ], %rd2, %r4465, %p222; 2026-02-21T09:18:15.0226964Z // end inline asm 2026-02-21T09:18:15.0227021Z // begin inline asm 2026-02-21T09:18:15.0227174Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 16 ], %rd3, %r4465, %p222; 2026-02-21T09:18:15.0227243Z // end inline asm 2026-02-21T09:18:15.0227305Z // begin inline asm 2026-02-21T09:18:15.0227456Z @%p223 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 24 ], %rd4, %r4465, %p222; 2026-02-21T09:18:15.0227524Z // end inline asm 2026-02-21T09:18:15.0227589Z cvt.u64.u32 %rd516, %r5985; 2026-02-21T09:18:15.0227645Z // begin inline asm 2026-02-21T09:18:15.0227781Z @%p223 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd516]; 2026-02-21T09:18:15.0227837Z // end inline asm 2026-02-21T09:18:15.0227897Z bra.uni $L__BB0_31; 2026-02-21T09:18:15.0227951Z $L__tmp252: 2026-02-21T09:18:15.0228060Z $L__BB0_32: // in Loop: Header=BB0_2 Depth=1 2026-02-21T09:18:15.0228226Z .loc 1 31 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:31:32 2026-02-21T09:18:15.0228288Z or.b32 %r4690, %r122, %r9; 2026-02-21T09:18:15.0228356Z or.b32 %r4691, %r122, %r10; 2026-02-21T09:18:15.0228415Z or.b32 %r4692, %r122, %r11; 2026-02-21T09:18:15.0228472Z or.b32 %r4693, %r122, %r12; 2026-02-21T09:18:15.0228537Z or.b32 %r4694, %r122, %r13; 2026-02-21T09:18:15.0228592Z or.b32 %r4695, %r122, %r14; 2026-02-21T09:18:15.0228648Z or.b32 %r4696, %r122, %r15; 2026-02-21T09:18:15.0228705Z or.b32 %r4697, %r122, %r16; 2026-02-21T09:18:15.0228877Z .loc 1 33 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:33:32 2026-02-21T09:18:15.0228935Z or.b32 %r4698, %r123, %r20; 2026-02-21T09:18:15.0229094Z .loc 1 88 43 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:43 2026-02-21T09:18:15.0229159Z shl.b32 %r4699, %r4690, 13; 2026-02-21T09:18:15.0229216Z shl.b32 %r4700, %r4691, 13; 2026-02-21T09:18:15.0229273Z shl.b32 %r4701, %r4692, 13; 2026-02-21T09:18:15.0229336Z shl.b32 %r4702, %r4693, 13; 2026-02-21T09:18:15.0229392Z shl.b32 %r4703, %r4694, 13; 2026-02-21T09:18:15.0229449Z shl.b32 %r4704, %r4695, 13; 2026-02-21T09:18:15.0229504Z shl.b32 %r4705, %r4696, 13; 2026-02-21T09:18:15.0229569Z shl.b32 %r4706, %r4697, 13; 2026-02-21T09:18:15.0229726Z .loc 1 88 50 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:50 2026-02-21T09:18:15.0229787Z add.s32 %r4707, %r4699, %r4698; 2026-02-21T09:18:15.0229855Z add.s32 %r4708, %r4700, %r4698; 2026-02-21T09:18:15.0229914Z add.s32 %r4709, %r4701, %r4698; 2026-02-21T09:18:15.0229972Z add.s32 %r4710, %r4702, %r4698; 2026-02-21T09:18:15.0230030Z add.s32 %r4711, %r4703, %r4698; 2026-02-21T09:18:15.0230097Z add.s32 %r4712, %r4704, %r4698; 2026-02-21T09:18:15.0230155Z add.s32 %r4713, %r4705, %r4698; 2026-02-21T09:18:15.0230213Z add.s32 %r4714, %r4706, %r4698; 2026-02-21T09:18:15.0230375Z .loc 1 88 22 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:22 2026-02-21T09:18:15.0230444Z mad.wide.s32 %rd517, %r4707, 2, %rd57; 2026-02-21T09:18:15.0230512Z mad.wide.s32 %rd518, %r4708, 2, %rd57; 2026-02-21T09:18:15.0230583Z mad.wide.s32 %rd519, %r4709, 2, %rd57; 2026-02-21T09:18:15.0230646Z mad.wide.s32 %rd520, %r4710, 2, %rd57; 2026-02-21T09:18:15.0230732Z mad.wide.s32 %rd521, %r4711, 2, %rd57; 2026-02-21T09:18:15.0230796Z mad.wide.s32 %rd522, %r4712, 2, %rd57; 2026-02-21T09:18:15.0230866Z mad.wide.s32 %rd523, %r4713, 2, %rd57; 2026-02-21T09:18:15.0230964Z mad.wide.s32 %rd524, %r4714, 2, %rd57; 2026-02-21T09:18:15.0231018Z $L__tmp253: 2026-02-21T09:18:15.0231276Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0231334Z // begin inline asm 2026-02-21T09:18:15.0231668Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4482, %r4483, %r4484, %r4485, %r4486, %r4487, %r4488, %r4489, %r4490, %r4491, %r4492, %r4493, %r4494, %r4495, %r4496, %r4497}, [%r4673 + 0], 64; 2026-02-21T09:18:15.0231730Z // end inline asm 2026-02-21T09:18:15.0231787Z // begin inline asm 2026-02-21T09:18:15.0232092Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4499, %r4500, %r4501, %r4502, %r4503, %r4504, %r4505, %r4506, %r4507, %r4508, %r4509, %r4510, %r4511, %r4512, %r4513, %r4514}, [%r4673 + 16], 64; 2026-02-21T09:18:15.0232157Z // end inline asm 2026-02-21T09:18:15.0232212Z // begin inline asm 2026-02-21T09:18:15.0232514Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4516, %r4517, %r4518, %r4519, %r4520, %r4521, %r4522, %r4523, %r4524, %r4525, %r4526, %r4527, %r4528, %r4529, %r4530, %r4531}, [%r4673 + 32], 64; 2026-02-21T09:18:15.0232570Z // end inline asm 2026-02-21T09:18:15.0232635Z // begin inline asm 2026-02-21T09:18:15.0232941Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4533, %r4534, %r4535, %r4536, %r4537, %r4538, %r4539, %r4540, %r4541, %r4542, %r4543, %r4544, %r4545, %r4546, %r4547, %r4548}, [%r4673 + 48], 64; 2026-02-21T09:18:15.0232996Z // end inline asm 2026-02-21T09:18:15.0233061Z // begin inline asm 2026-02-21T09:18:15.0233130Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0233185Z // end inline asm 2026-02-21T09:18:15.0233251Z cvt.u64.u32 %rd526, %r4482; 2026-02-21T09:18:15.0233311Z cvt.u64.u32 %rd527, %r4483; 2026-02-21T09:18:15.0233371Z shl.b64 %rd528, %rd527, 32; 2026-02-21T09:18:15.0233445Z or.b64 %rd529, %rd526, %rd528; 2026-02-21T09:18:15.0233509Z $L__tmp254: 2026-02-21T09:18:15.0233676Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0233744Z mov.b64 {%r4715, %r4716}, %rd529; 2026-02-21T09:18:15.0233830Z cvt.rn.bf16x2.f32 %r4717, %r4716, %r4715; 2026-02-21T09:18:15.0233886Z $L__tmp255: 2026-02-21T09:18:15.0234106Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0234177Z cvt.u64.u32 %rd530, %r4484; 2026-02-21T09:18:15.0234238Z cvt.u64.u32 %rd531, %r4485; 2026-02-21T09:18:15.0234300Z shl.b64 %rd532, %rd531, 32; 2026-02-21T09:18:15.0234364Z or.b64 %rd533, %rd530, %rd532; 2026-02-21T09:18:15.0234427Z $L__tmp256: 2026-02-21T09:18:15.0234590Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0234658Z mov.b64 {%r4718, %r4719}, %rd533; 2026-02-21T09:18:15.0234740Z cvt.rn.bf16x2.f32 %r4720, %r4719, %r4718; 2026-02-21T09:18:15.0234795Z $L__tmp257: 2026-02-21T09:18:15.0235006Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0235077Z cvt.u64.u32 %rd534, %r4486; 2026-02-21T09:18:15.0235138Z cvt.u64.u32 %rd535, %r4487; 2026-02-21T09:18:15.0235202Z shl.b64 %rd536, %rd535, 32; 2026-02-21T09:18:15.0235265Z or.b64 %rd537, %rd534, %rd536; 2026-02-21T09:18:15.0235328Z $L__tmp258: 2026-02-21T09:18:15.0235494Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0235558Z mov.b64 {%r4721, %r4722}, %rd537; 2026-02-21T09:18:15.0235641Z cvt.rn.bf16x2.f32 %r4723, %r4722, %r4721; 2026-02-21T09:18:15.0235698Z $L__tmp259: 2026-02-21T09:18:15.0235913Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0236004Z cvt.u64.u32 %rd538, %r4488; 2026-02-21T09:18:15.0236079Z cvt.u64.u32 %rd539, %r4489; 2026-02-21T09:18:15.0236168Z shl.b64 %rd540, %rd539, 32; 2026-02-21T09:18:15.0236233Z or.b64 %rd541, %rd538, %rd540; 2026-02-21T09:18:15.0236296Z $L__tmp260: 2026-02-21T09:18:15.0236516Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0236582Z mov.b64 {%r4724, %r4725}, %rd541; 2026-02-21T09:18:15.0236662Z cvt.rn.bf16x2.f32 %r4726, %r4725, %r4724; 2026-02-21T09:18:15.0236717Z $L__tmp261: 2026-02-21T09:18:15.0236932Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0236992Z cvt.u64.u32 %rd542, %r4490; 2026-02-21T09:18:15.0237059Z cvt.u64.u32 %rd543, %r4491; 2026-02-21T09:18:15.0237120Z shl.b64 %rd544, %rd543, 32; 2026-02-21T09:18:15.0237185Z or.b64 %rd545, %rd542, %rd544; 2026-02-21T09:18:15.0237245Z $L__tmp262: 2026-02-21T09:18:15.0237409Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0237475Z mov.b64 {%r4727, %r4728}, %rd545; 2026-02-21T09:18:15.0237554Z cvt.rn.bf16x2.f32 %r4729, %r4728, %r4727; 2026-02-21T09:18:15.0237610Z $L__tmp263: 2026-02-21T09:18:15.0237823Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0237885Z cvt.u64.u32 %rd546, %r4492; 2026-02-21T09:18:15.0237952Z cvt.u64.u32 %rd547, %r4493; 2026-02-21T09:18:15.0238012Z shl.b64 %rd548, %rd547, 32; 2026-02-21T09:18:15.0238075Z or.b64 %rd549, %rd546, %rd548; 2026-02-21T09:18:15.0238137Z $L__tmp264: 2026-02-21T09:18:15.0238305Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0238369Z mov.b64 {%r4730, %r4731}, %rd549; 2026-02-21T09:18:15.0238448Z cvt.rn.bf16x2.f32 %r4732, %r4731, %r4730; 2026-02-21T09:18:15.0238502Z $L__tmp265: 2026-02-21T09:18:15.0238714Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0238777Z cvt.u64.u32 %rd550, %r4494; 2026-02-21T09:18:15.0238847Z cvt.u64.u32 %rd551, %r4495; 2026-02-21T09:18:15.0238908Z shl.b64 %rd552, %rd551, 32; 2026-02-21T09:18:15.0238973Z or.b64 %rd553, %rd550, %rd552; 2026-02-21T09:18:15.0239034Z $L__tmp266: 2026-02-21T09:18:15.0239203Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0239265Z mov.b64 {%r4733, %r4734}, %rd553; 2026-02-21T09:18:15.0239335Z cvt.rn.bf16x2.f32 %r4735, %r4734, %r4733; 2026-02-21T09:18:15.0239399Z $L__tmp267: 2026-02-21T09:18:15.0239612Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0239675Z cvt.u64.u32 %rd554, %r4496; 2026-02-21T09:18:15.0239744Z cvt.u64.u32 %rd555, %r4497; 2026-02-21T09:18:15.0239804Z shl.b64 %rd556, %rd555, 32; 2026-02-21T09:18:15.0239868Z or.b64 %rd557, %rd554, %rd556; 2026-02-21T09:18:15.0239930Z $L__tmp268: 2026-02-21T09:18:15.0240099Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0240161Z mov.b64 {%r4736, %r4737}, %rd557; 2026-02-21T09:18:15.0240233Z cvt.rn.bf16x2.f32 %r4738, %r4737, %r4736; 2026-02-21T09:18:15.0240296Z $L__tmp269: 2026-02-21T09:18:15.0240505Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0240566Z cvt.u64.u32 %rd558, %r4499; 2026-02-21T09:18:15.0240634Z cvt.u64.u32 %rd559, %r4500; 2026-02-21T09:18:15.0240694Z shl.b64 %rd560, %rd559, 32; 2026-02-21T09:18:15.0240756Z or.b64 %rd561, %rd558, %rd560; 2026-02-21T09:18:15.0240816Z $L__tmp270: 2026-02-21T09:18:15.0241004Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0241067Z mov.b64 {%r4739, %r4740}, %rd561; 2026-02-21T09:18:15.0241157Z cvt.rn.bf16x2.f32 %r4741, %r4740, %r4739; 2026-02-21T09:18:15.0241219Z $L__tmp271: 2026-02-21T09:18:15.0241456Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0241567Z cvt.u64.u32 %rd562, %r4501; 2026-02-21T09:18:15.0241633Z cvt.u64.u32 %rd563, %r4502; 2026-02-21T09:18:15.0241690Z shl.b64 %rd564, %rd563, 32; 2026-02-21T09:18:15.0241750Z or.b64 %rd565, %rd562, %rd564; 2026-02-21T09:18:15.0241809Z $L__tmp272: 2026-02-21T09:18:15.0241968Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0242027Z mov.b64 {%r4742, %r4743}, %rd565; 2026-02-21T09:18:15.0242095Z cvt.rn.bf16x2.f32 %r4744, %r4743, %r4742; 2026-02-21T09:18:15.0242156Z $L__tmp273: 2026-02-21T09:18:15.0242355Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0242414Z cvt.u64.u32 %rd566, %r4503; 2026-02-21T09:18:15.0242479Z cvt.u64.u32 %rd567, %r4504; 2026-02-21T09:18:15.0242536Z shl.b64 %rd568, %rd567, 32; 2026-02-21T09:18:15.0242597Z or.b64 %rd569, %rd566, %rd568; 2026-02-21T09:18:15.0242649Z $L__tmp274: 2026-02-21T09:18:15.0242816Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0242876Z mov.b64 {%r4745, %r4746}, %rd569; 2026-02-21T09:18:15.0242944Z cvt.rn.bf16x2.f32 %r4747, %r4746, %r4745; 2026-02-21T09:18:15.0243005Z $L__tmp275: 2026-02-21T09:18:15.0243205Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0243263Z cvt.u64.u32 %rd570, %r4505; 2026-02-21T09:18:15.0243329Z cvt.u64.u32 %rd571, %r4506; 2026-02-21T09:18:15.0243390Z shl.b64 %rd572, %rd571, 32; 2026-02-21T09:18:15.0243451Z or.b64 %rd573, %rd570, %rd572; 2026-02-21T09:18:15.0243504Z $L__tmp276: 2026-02-21T09:18:15.0243673Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0243734Z mov.b64 {%r4748, %r4749}, %rd573; 2026-02-21T09:18:15.0243804Z cvt.rn.bf16x2.f32 %r4750, %r4749, %r4748; 2026-02-21T09:18:15.0243866Z $L__tmp277: 2026-02-21T09:18:15.0244068Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0244125Z cvt.u64.u32 %rd574, %r4507; 2026-02-21T09:18:15.0244189Z cvt.u64.u32 %rd575, %r4508; 2026-02-21T09:18:15.0244245Z shl.b64 %rd576, %rd575, 32; 2026-02-21T09:18:15.0244306Z or.b64 %rd577, %rd574, %rd576; 2026-02-21T09:18:15.0244359Z $L__tmp278: 2026-02-21T09:18:15.0244528Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0244590Z mov.b64 {%r4751, %r4752}, %rd577; 2026-02-21T09:18:15.0244658Z cvt.rn.bf16x2.f32 %r4753, %r4752, %r4751; 2026-02-21T09:18:15.0244721Z $L__tmp279: 2026-02-21T09:18:15.0244923Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0244983Z cvt.u64.u32 %rd578, %r4509; 2026-02-21T09:18:15.0245050Z cvt.u64.u32 %rd579, %r4510; 2026-02-21T09:18:15.0245108Z shl.b64 %rd580, %rd579, 32; 2026-02-21T09:18:15.0245168Z or.b64 %rd581, %rd578, %rd580; 2026-02-21T09:18:15.0245220Z $L__tmp280: 2026-02-21T09:18:15.0245383Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0245444Z mov.b64 {%r4754, %r4755}, %rd581; 2026-02-21T09:18:15.0245512Z cvt.rn.bf16x2.f32 %r4756, %r4755, %r4754; 2026-02-21T09:18:15.0245571Z $L__tmp281: 2026-02-21T09:18:15.0245771Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0245856Z cvt.u64.u32 %rd582, %r4511; 2026-02-21T09:18:15.0245920Z cvt.u64.u32 %rd583, %r4512; 2026-02-21T09:18:15.0246001Z shl.b64 %rd584, %rd583, 32; 2026-02-21T09:18:15.0246061Z or.b64 %rd585, %rd582, %rd584; 2026-02-21T09:18:15.0246139Z $L__tmp282: 2026-02-21T09:18:15.0246331Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0246392Z mov.b64 {%r4757, %r4758}, %rd585; 2026-02-21T09:18:15.0246459Z cvt.rn.bf16x2.f32 %r4759, %r4758, %r4757; 2026-02-21T09:18:15.0246518Z $L__tmp283: 2026-02-21T09:18:15.0246718Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0246776Z cvt.u64.u32 %rd586, %r4513; 2026-02-21T09:18:15.0246842Z cvt.u64.u32 %rd587, %r4514; 2026-02-21T09:18:15.0246899Z shl.b64 %rd588, %rd587, 32; 2026-02-21T09:18:15.0246960Z or.b64 %rd589, %rd586, %rd588; 2026-02-21T09:18:15.0247012Z $L__tmp284: 2026-02-21T09:18:15.0247181Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0247242Z mov.b64 {%r4760, %r4761}, %rd589; 2026-02-21T09:18:15.0247310Z cvt.rn.bf16x2.f32 %r4762, %r4761, %r4760; 2026-02-21T09:18:15.0247371Z $L__tmp285: 2026-02-21T09:18:15.0247574Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0247633Z cvt.u64.u32 %rd590, %r4516; 2026-02-21T09:18:15.0247692Z cvt.u64.u32 %rd591, %r4517; 2026-02-21T09:18:15.0247757Z shl.b64 %rd592, %rd591, 32; 2026-02-21T09:18:15.0247816Z or.b64 %rd593, %rd590, %rd592; 2026-02-21T09:18:15.0247867Z $L__tmp286: 2026-02-21T09:18:15.0248034Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0248092Z mov.b64 {%r4763, %r4764}, %rd593; 2026-02-21T09:18:15.0248159Z cvt.rn.bf16x2.f32 %r4765, %r4764, %r4763; 2026-02-21T09:18:15.0248219Z $L__tmp287: 2026-02-21T09:18:15.0248419Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0248478Z cvt.u64.u32 %rd594, %r4518; 2026-02-21T09:18:15.0248536Z cvt.u64.u32 %rd595, %r4519; 2026-02-21T09:18:15.0248601Z shl.b64 %rd596, %rd595, 32; 2026-02-21T09:18:15.0248662Z or.b64 %rd597, %rd594, %rd596; 2026-02-21T09:18:15.0248713Z $L__tmp288: 2026-02-21T09:18:15.0248879Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0248937Z mov.b64 {%r4766, %r4767}, %rd597; 2026-02-21T09:18:15.0249003Z cvt.rn.bf16x2.f32 %r4768, %r4767, %r4766; 2026-02-21T09:18:15.0249060Z $L__tmp289: 2026-02-21T09:18:15.0249260Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0249319Z cvt.u64.u32 %rd598, %r4520; 2026-02-21T09:18:15.0249375Z cvt.u64.u32 %rd599, %r4521; 2026-02-21T09:18:15.0249439Z shl.b64 %rd600, %rd599, 32; 2026-02-21T09:18:15.0249500Z or.b64 %rd601, %rd598, %rd600; 2026-02-21T09:18:15.0249551Z $L__tmp290: 2026-02-21T09:18:15.0249715Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0249775Z mov.b64 {%r4769, %r4770}, %rd601; 2026-02-21T09:18:15.0249842Z cvt.rn.bf16x2.f32 %r4771, %r4770, %r4769; 2026-02-21T09:18:15.0249901Z $L__tmp291: 2026-02-21T09:18:15.0250099Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0250157Z cvt.u64.u32 %rd602, %r4522; 2026-02-21T09:18:15.0250216Z cvt.u64.u32 %rd603, %r4523; 2026-02-21T09:18:15.0250280Z shl.b64 %rd604, %rd603, 32; 2026-02-21T09:18:15.0250338Z or.b64 %rd605, %rd602, %rd604; 2026-02-21T09:18:15.0250424Z $L__tmp292: 2026-02-21T09:18:15.0250585Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0250644Z mov.b64 {%r4772, %r4773}, %rd605; 2026-02-21T09:18:15.0250734Z cvt.rn.bf16x2.f32 %r4774, %r4773, %r4772; 2026-02-21T09:18:15.0250786Z $L__tmp293: 2026-02-21T09:18:15.0251033Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0251095Z cvt.u64.u32 %rd606, %r4524; 2026-02-21T09:18:15.0251152Z cvt.u64.u32 %rd607, %r4525; 2026-02-21T09:18:15.0251218Z shl.b64 %rd608, %rd607, 32; 2026-02-21T09:18:15.0251278Z or.b64 %rd609, %rd606, %rd608; 2026-02-21T09:18:15.0251330Z $L__tmp294: 2026-02-21T09:18:15.0251496Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0251577Z mov.b64 {%r4775, %r4776}, %rd609; 2026-02-21T09:18:15.0251645Z cvt.rn.bf16x2.f32 %r4777, %r4776, %r4775; 2026-02-21T09:18:15.0251699Z $L__tmp295: 2026-02-21T09:18:15.0251914Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0251975Z cvt.u64.u32 %rd610, %r4526; 2026-02-21T09:18:15.0252034Z cvt.u64.u32 %rd611, %r4527; 2026-02-21T09:18:15.0252102Z shl.b64 %rd612, %rd611, 32; 2026-02-21T09:18:15.0252163Z or.b64 %rd613, %rd610, %rd612; 2026-02-21T09:18:15.0252217Z $L__tmp296: 2026-02-21T09:18:15.0252384Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0252443Z mov.b64 {%r4778, %r4779}, %rd613; 2026-02-21T09:18:15.0252511Z cvt.rn.bf16x2.f32 %r4780, %r4779, %r4778; 2026-02-21T09:18:15.0252563Z $L__tmp297: 2026-02-21T09:18:15.0252775Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0252834Z cvt.u64.u32 %rd614, %r4528; 2026-02-21T09:18:15.0252893Z cvt.u64.u32 %rd615, %r4529; 2026-02-21T09:18:15.0252960Z shl.b64 %rd616, %rd615, 32; 2026-02-21T09:18:15.0253022Z or.b64 %rd617, %rd614, %rd616; 2026-02-21T09:18:15.0253075Z $L__tmp298: 2026-02-21T09:18:15.0253247Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0253310Z mov.b64 {%r4781, %r4782}, %rd617; 2026-02-21T09:18:15.0253382Z cvt.rn.bf16x2.f32 %r4783, %r4782, %r4781; 2026-02-21T09:18:15.0253439Z $L__tmp299: 2026-02-21T09:18:15.0253654Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0253713Z cvt.u64.u32 %rd618, %r4530; 2026-02-21T09:18:15.0253773Z cvt.u64.u32 %rd619, %r4531; 2026-02-21T09:18:15.0253840Z shl.b64 %rd620, %rd619, 32; 2026-02-21T09:18:15.0253900Z or.b64 %rd621, %rd618, %rd620; 2026-02-21T09:18:15.0253953Z $L__tmp300: 2026-02-21T09:18:15.0254110Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0254178Z mov.b64 {%r4784, %r4785}, %rd621; 2026-02-21T09:18:15.0254246Z cvt.rn.bf16x2.f32 %r4786, %r4785, %r4784; 2026-02-21T09:18:15.0254299Z $L__tmp301: 2026-02-21T09:18:15.0254511Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0254570Z cvt.u64.u32 %rd622, %r4533; 2026-02-21T09:18:15.0254630Z cvt.u64.u32 %rd623, %r4534; 2026-02-21T09:18:15.0254694Z shl.b64 %rd624, %rd623, 32; 2026-02-21T09:18:15.0254753Z or.b64 %rd625, %rd622, %rd624; 2026-02-21T09:18:15.0254807Z $L__tmp302: 2026-02-21T09:18:15.0254970Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0255037Z mov.b64 {%r4787, %r4788}, %rd625; 2026-02-21T09:18:15.0255105Z cvt.rn.bf16x2.f32 %r4789, %r4788, %r4787; 2026-02-21T09:18:15.0255158Z $L__tmp303: 2026-02-21T09:18:15.0255368Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0255458Z cvt.u64.u32 %rd626, %r4535; 2026-02-21T09:18:15.0255539Z cvt.u64.u32 %rd627, %r4536; 2026-02-21T09:18:15.0255603Z shl.b64 %rd628, %rd627, 32; 2026-02-21T09:18:15.0255663Z or.b64 %rd629, %rd626, %rd628; 2026-02-21T09:18:15.0255739Z $L__tmp304: 2026-02-21T09:18:15.0255922Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0255990Z mov.b64 {%r4790, %r4791}, %rd629; 2026-02-21T09:18:15.0256058Z cvt.rn.bf16x2.f32 %r4792, %r4791, %r4790; 2026-02-21T09:18:15.0256110Z $L__tmp305: 2026-02-21T09:18:15.0256321Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0256378Z cvt.u64.u32 %rd630, %r4537; 2026-02-21T09:18:15.0256436Z cvt.u64.u32 %rd631, %r4538; 2026-02-21T09:18:15.0256501Z shl.b64 %rd632, %rd631, 32; 2026-02-21T09:18:15.0256563Z or.b64 %rd633, %rd630, %rd632; 2026-02-21T09:18:15.0256615Z $L__tmp306: 2026-02-21T09:18:15.0256777Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0256846Z mov.b64 {%r4793, %r4794}, %rd633; 2026-02-21T09:18:15.0256915Z cvt.rn.bf16x2.f32 %r4795, %r4794, %r4793; 2026-02-21T09:18:15.0256967Z $L__tmp307: 2026-02-21T09:18:15.0257176Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0257234Z cvt.u64.u32 %rd634, %r4539; 2026-02-21T09:18:15.0257291Z cvt.u64.u32 %rd635, %r4540; 2026-02-21T09:18:15.0257357Z shl.b64 %rd636, %rd635, 32; 2026-02-21T09:18:15.0257417Z or.b64 %rd637, %rd634, %rd636; 2026-02-21T09:18:15.0257469Z $L__tmp308: 2026-02-21T09:18:15.0257628Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0257696Z mov.b64 {%r4796, %r4797}, %rd637; 2026-02-21T09:18:15.0257763Z cvt.rn.bf16x2.f32 %r4798, %r4797, %r4796; 2026-02-21T09:18:15.0257815Z $L__tmp309: 2026-02-21T09:18:15.0258023Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0258084Z cvt.u64.u32 %rd638, %r4541; 2026-02-21T09:18:15.0258142Z cvt.u64.u32 %rd639, %r4542; 2026-02-21T09:18:15.0258208Z shl.b64 %rd640, %rd639, 32; 2026-02-21T09:18:15.0258267Z or.b64 %rd641, %rd638, %rd640; 2026-02-21T09:18:15.0258319Z $L__tmp310: 2026-02-21T09:18:15.0258477Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0258542Z mov.b64 {%r4799, %r4800}, %rd641; 2026-02-21T09:18:15.0258609Z cvt.rn.bf16x2.f32 %r4801, %r4800, %r4799; 2026-02-21T09:18:15.0258661Z $L__tmp311: 2026-02-21T09:18:15.0258869Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0258927Z cvt.u64.u32 %rd642, %r4543; 2026-02-21T09:18:15.0258983Z cvt.u64.u32 %rd643, %r4544; 2026-02-21T09:18:15.0259040Z shl.b64 %rd644, %rd643, 32; 2026-02-21T09:18:15.0259109Z or.b64 %rd645, %rd642, %rd644; 2026-02-21T09:18:15.0259160Z $L__tmp312: 2026-02-21T09:18:15.0259320Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0259389Z mov.b64 {%r4802, %r4803}, %rd645; 2026-02-21T09:18:15.0259457Z cvt.rn.bf16x2.f32 %r4804, %r4803, %r4802; 2026-02-21T09:18:15.0259508Z $L__tmp313: 2026-02-21T09:18:15.0259717Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0259775Z cvt.u64.u32 %rd646, %r4545; 2026-02-21T09:18:15.0259832Z cvt.u64.u32 %rd647, %r4546; 2026-02-21T09:18:15.0259889Z shl.b64 %rd648, %rd647, 32; 2026-02-21T09:18:15.0259954Z or.b64 %rd649, %rd646, %rd648; 2026-02-21T09:18:15.0260036Z $L__tmp314: 2026-02-21T09:18:15.0260195Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0260262Z mov.b64 {%r4805, %r4806}, %rd649; 2026-02-21T09:18:15.0260353Z cvt.rn.bf16x2.f32 %r4807, %r4806, %r4805; 2026-02-21T09:18:15.0260406Z $L__tmp315: 2026-02-21T09:18:15.0260658Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0260718Z cvt.u64.u32 %rd650, %r4547; 2026-02-21T09:18:15.0260776Z cvt.u64.u32 %rd651, %r4548; 2026-02-21T09:18:15.0260833Z shl.b64 %rd652, %rd651, 32; 2026-02-21T09:18:15.0260900Z or.b64 %rd653, %rd650, %rd652; 2026-02-21T09:18:15.0260952Z $L__tmp316: 2026-02-21T09:18:15.0261112Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0261179Z mov.b64 {%r4808, %r4809}, %rd653; 2026-02-21T09:18:15.0261245Z cvt.rn.bf16x2.f32 %r4810, %r4809, %r4808; 2026-02-21T09:18:15.0261402Z .loc 1 88 81 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:81 2026-02-21T09:18:15.0261466Z bar.sync 0; 2026-02-21T09:18:15.0261593Z st.shared.v4.b32 [%r27], {%r4717, %r4729, %r4741, %r4753}; 2026-02-21T09:18:15.0261693Z st.shared.v4.b32 [%r28], {%r4765, %r4777, %r4789, %r4801}; 2026-02-21T09:18:15.0261750Z bar.sync 0; 2026-02-21T09:18:15.0261819Z // begin inline asm 2026-02-21T09:18:15.0261980Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4590, %r4594, %r4598, %r4602}, [%r1540]; 2026-02-21T09:18:15.0262038Z // end inline asm 2026-02-21T09:18:15.0262106Z // begin inline asm 2026-02-21T09:18:15.0262263Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4606, %r4610, %r4614, %r4618}, [%r1545]; 2026-02-21T09:18:15.0262319Z // end inline asm 2026-02-21T09:18:15.0262372Z bar.sync 0; 2026-02-21T09:18:15.0262474Z st.shared.v4.b32 [%r27], {%r4720, %r4732, %r4744, %r4756}; 2026-02-21T09:18:15.0262567Z st.shared.v4.b32 [%r28], {%r4768, %r4780, %r4792, %r4804}; 2026-02-21T09:18:15.0262621Z bar.sync 0; 2026-02-21T09:18:15.0262687Z // begin inline asm 2026-02-21T09:18:15.0262839Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4591, %r4595, %r4599, %r4603}, [%r1540]; 2026-02-21T09:18:15.0262896Z // end inline asm 2026-02-21T09:18:15.0262958Z // begin inline asm 2026-02-21T09:18:15.0263111Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4607, %r4611, %r4615, %r4619}, [%r1545]; 2026-02-21T09:18:15.0263165Z // end inline asm 2026-02-21T09:18:15.0263218Z bar.sync 0; 2026-02-21T09:18:15.0263317Z st.shared.v4.b32 [%r27], {%r4723, %r4735, %r4747, %r4759}; 2026-02-21T09:18:15.0263406Z st.shared.v4.b32 [%r28], {%r4771, %r4783, %r4795, %r4807}; 2026-02-21T09:18:15.0263459Z bar.sync 0; 2026-02-21T09:18:15.0263523Z // begin inline asm 2026-02-21T09:18:15.0263669Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4592, %r4596, %r4600, %r4604}, [%r1540]; 2026-02-21T09:18:15.0263725Z // end inline asm 2026-02-21T09:18:15.0263790Z // begin inline asm 2026-02-21T09:18:15.0263936Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4608, %r4612, %r4616, %r4620}, [%r1545]; 2026-02-21T09:18:15.0263990Z // end inline asm 2026-02-21T09:18:15.0264046Z bar.sync 0; 2026-02-21T09:18:15.0264145Z st.shared.v4.b32 [%r27], {%r4726, %r4738, %r4750, %r4762}; 2026-02-21T09:18:15.0264236Z st.shared.v4.b32 [%r28], {%r4774, %r4786, %r4798, %r4810}; 2026-02-21T09:18:15.0264290Z bar.sync 0; 2026-02-21T09:18:15.0264356Z // begin inline asm 2026-02-21T09:18:15.0264503Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4593, %r4597, %r4601, %r4605}, [%r1540]; 2026-02-21T09:18:15.0264558Z // end inline asm 2026-02-21T09:18:15.0264615Z // begin inline asm 2026-02-21T09:18:15.0264770Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4609, %r4613, %r4617, %r4621}, [%r1545]; 2026-02-21T09:18:15.0264825Z // end inline asm 2026-02-21T09:18:15.0264880Z // begin inline asm 2026-02-21T09:18:15.0264996Z st.global.v4.b32 [ %rd517 + 0 ], { %r4590, %r4591, %r4592, %r4593 }; 2026-02-21T09:18:15.0265085Z // end inline asm 2026-02-21T09:18:15.0265141Z // begin inline asm 2026-02-21T09:18:15.0265255Z st.global.v4.b32 [ %rd518 + 0 ], { %r4594, %r4595, %r4596, %r4597 }; 2026-02-21T09:18:15.0265341Z // end inline asm 2026-02-21T09:18:15.0265397Z // begin inline asm 2026-02-21T09:18:15.0265527Z st.global.v4.b32 [ %rd519 + 0 ], { %r4598, %r4599, %r4600, %r4601 }; 2026-02-21T09:18:15.0265592Z // end inline asm 2026-02-21T09:18:15.0265685Z // begin inline asm 2026-02-21T09:18:15.0265787Z st.global.v4.b32 [ %rd520 + 0 ], { %r4602, %r4603, %r4604, %r4605 }; 2026-02-21T09:18:15.0265849Z // end inline asm 2026-02-21T09:18:15.0265905Z // begin inline asm 2026-02-21T09:18:15.0266003Z st.global.v4.b32 [ %rd521 + 0 ], { %r4606, %r4607, %r4608, %r4609 }; 2026-02-21T09:18:15.0266057Z // end inline asm 2026-02-21T09:18:15.0266123Z // begin inline asm 2026-02-21T09:18:15.0266222Z st.global.v4.b32 [ %rd522 + 0 ], { %r4610, %r4611, %r4612, %r4613 }; 2026-02-21T09:18:15.0266278Z // end inline asm 2026-02-21T09:18:15.0266340Z // begin inline asm 2026-02-21T09:18:15.0266438Z st.global.v4.b32 [ %rd523 + 0 ], { %r4614, %r4615, %r4616, %r4617 }; 2026-02-21T09:18:15.0266493Z // end inline asm 2026-02-21T09:18:15.0266558Z // begin inline asm 2026-02-21T09:18:15.0266658Z st.global.v4.b32 [ %rd524 + 0 ], { %r4618, %r4619, %r4620, %r4621 }; 2026-02-21T09:18:15.0266714Z // end inline asm 2026-02-21T09:18:15.0266887Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:15.0266959Z add.s32 %r4811, %r7842, 7104; 2026-02-21T09:18:15.0267121Z .loc 1 25 35 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:25:35 2026-02-21T09:18:15.0267181Z shr.s32 %r4812, %r4811, 31; 2026-02-21T09:18:15.0267247Z shr.u32 %r4813, %r4812, 22; 2026-02-21T09:18:15.0267311Z add.s32 %r4814, %r4811, %r4813; 2026-02-21T09:18:15.0267369Z shr.s32 %r4815, %r4814, 10; 2026-02-21T09:18:15.0267537Z .loc 1 26 33 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:26:33 2026-02-21T09:18:15.0267601Z shl.b32 %r4816, %r4815, 4; 2026-02-21T09:18:15.0267761Z .loc 1 27 39 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:39 2026-02-21T09:18:15.0267820Z sub.s32 %r4817, 64, %r4816; 2026-02-21T09:18:15.0267988Z .loc 1 27 52 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:27:52 2026-02-21T09:18:15.0268047Z min.s32 %r4818, %r4817, 16; 2026-02-21T09:18:15.0268206Z .loc 1 28 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:45 2026-02-21T09:18:15.0268278Z and.b32 %r4819, %r4814, -1024; 2026-02-21T09:18:15.0268342Z sub.s32 %r4820, %r4811, %r4819; 2026-02-21T09:18:15.0268500Z .loc 1 29 51 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:29:51 2026-02-21T09:18:15.0268569Z div.s32 %r4821, %r4820, %r4818; 2026-02-21T09:18:15.0268730Z .loc 1 28 64 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:64 2026-02-21T09:18:15.0268796Z mul.lo.s32 %r4822, %r4821, %r4818; 2026-02-21T09:18:15.0268857Z sub.s32 %r4823, %r4820, %r4822; 2026-02-21T09:18:15.0269024Z .loc 1 28 30 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:30 2026-02-21T09:18:15.0269085Z add.s32 %r4824, %r4823, %r4816; 2026-02-21T09:18:15.0269245Z .loc 1 30 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:30:27 2026-02-21T09:18:15.0269315Z shl.b32 %r127, %r4824, 6; 2026-02-21T09:18:15.0269473Z .loc 1 32 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:32:27 2026-02-21T09:18:15.0269533Z shl.b32 %r128, %r4821, 7; 2026-02-21T09:18:15.0269603Z mov.pred %p234, -1; 2026-02-21T09:18:15.0269660Z mov.b32 %r4623, 0; 2026-02-21T09:18:15.0269713Z $L__tmp317: 2026-02-21T09:18:15.0269935Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0270019Z // begin inline asm 2026-02-21T09:18:15.0270337Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623}; 2026-02-21T09:18:15.0270437Z // end inline asm 2026-02-21T09:18:15.0270503Z // begin inline asm 2026-02-21T09:18:15.0270836Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623}; 2026-02-21T09:18:15.0270892Z // end inline asm 2026-02-21T09:18:15.0270959Z // begin inline asm 2026-02-21T09:18:15.0271264Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623}; 2026-02-21T09:18:15.0271320Z // end inline asm 2026-02-21T09:18:15.0271389Z // begin inline asm 2026-02-21T09:18:15.0271732Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623, %r4623}; 2026-02-21T09:18:15.0271791Z // end inline asm 2026-02-21T09:18:15.0271856Z // begin inline asm 2026-02-21T09:18:15.0271928Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0271984Z // end inline asm 2026-02-21T09:18:15.0272039Z bar.sync 0; 2026-02-21T09:18:15.0272101Z $L__tmp318: 2026-02-21T09:18:15.0272267Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:15.0272329Z add.s32 %r7846, %r61, %r128; 2026-02-21T09:18:15.0272398Z or.b32 %r4825, %r7841, %r127; 2026-02-21T09:18:15.0272456Z shl.b32 %r4826, %r4825, 10; 2026-02-21T09:18:15.0272519Z mul.wide.s32 %rd36, %r4826, 2; 2026-02-21T09:18:15.0272584Z shl.b32 %r4827, %r4824, 16; 2026-02-21T09:18:15.0272646Z or.b32 %r4828, %r63, %r4827; 2026-02-21T09:18:15.0272707Z mul.wide.s32 %rd37, %r4828, 2; 2026-02-21T09:18:15.0272767Z mov.pred %p392, 0; 2026-02-21T09:18:15.0272833Z mov.b64 %rd1055, -64; 2026-02-21T09:18:15.0272893Z mov.b64 %rd1054, %rd5; 2026-02-21T09:18:15.0272952Z bra.uni $L__BB0_33; 2026-02-21T09:18:15.0273064Z $L__BB0_41: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0273118Z $L__tmp319: 2026-02-21T09:18:15.0273328Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0273384Z // begin inline asm 2026-02-21T09:18:15.0273443Z 2026-02-21T09:18:15.0273495Z { 2026-02-21T09:18:15.0273558Z .reg .pred complete; 2026-02-21T09:18:15.0273619Z waitLoop: 2026-02-21T09:18:15.0273739Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r5696; 2026-02-21T09:18:15.0273805Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0273855Z } 2026-02-21T09:18:15.0273868Z 2026-02-21T09:18:15.0273923Z // end inline asm 2026-02-21T09:18:15.0273977Z bar.sync 0; 2026-02-21T09:18:15.0274033Z // begin inline asm 2026-02-21T09:18:15.0274131Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0274187Z // end inline asm 2026-02-21T09:18:15.0274239Z $L__tmp320: 2026-02-21T09:18:15.0274412Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:15.0274473Z add.s64 %rd1055, %rd1055, 64; 2026-02-21T09:18:15.0274533Z add.s32 %r7846, %r7846, 524288; 2026-02-21T09:18:15.0274592Z add.s64 %rd1054, %rd1054, 256; 2026-02-21T09:18:15.0274667Z setp.lt.u64 %p308, %rd1055, 448; 2026-02-21T09:18:15.0274729Z @%p308 bra $L__BB0_33; 2026-02-21T09:18:15.0274786Z bra.uni $L__BB0_42; 2026-02-21T09:18:15.0274893Z $L__BB0_33: // Parent Loop BB0_2 Depth=1 2026-02-21T09:18:15.0274991Z // => This Inner Loop Header: Depth=2 2026-02-21T09:18:15.0275179Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0275248Z add.s64 %rd655, %rd1054, %rd37; 2026-02-21T09:18:15.0275431Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0275517Z add.s64 %rd658, %rd1054, %rd36; 2026-02-21T09:18:15.0275574Z // begin inline asm 2026-02-21T09:18:15.0275662Z mov.u64 %rd654, 0x0; 2026-02-21T09:18:15.0275775Z createpolicy.fractional.L2::evict_first.b64 %rd654, 1.0; 2026-02-21T09:18:15.0275831Z // end inline asm 2026-02-21T09:18:15.0275894Z // begin inline asm 2026-02-21T09:18:15.0275950Z mov.u32 %r4829, 0x0; 2026-02-21T09:18:15.0276005Z mov.u32 %r4830, 0x0; 2026-02-21T09:18:15.0276060Z mov.u32 %r4831, 0x0; 2026-02-21T09:18:15.0276121Z mov.u32 %r4832, 0x0; 2026-02-21T09:18:15.0276299Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r4829, %r4830, %r4831, %r4832 }, [ %rd655 + 0 ], %rd654; 2026-02-21T09:18:15.0276355Z // end inline asm 2026-02-21T09:18:15.0276418Z // begin inline asm 2026-02-21T09:18:15.0276474Z mov.u64 %rd657, 0x0; 2026-02-21T09:18:15.0276583Z createpolicy.fractional.L2::evict_first.b64 %rd657, 1.0; 2026-02-21T09:18:15.0276647Z // end inline asm 2026-02-21T09:18:15.0276704Z // begin inline asm 2026-02-21T09:18:15.0276760Z mov.u32 %r4833, 0x0; 2026-02-21T09:18:15.0276814Z mov.u32 %r4834, 0x0; 2026-02-21T09:18:15.0276877Z mov.u32 %r4835, 0x0; 2026-02-21T09:18:15.0276932Z mov.u32 %r4836, 0x0; 2026-02-21T09:18:15.0277130Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r4833, %r4834, %r4835, %r4836 }, [ %rd658 + 0 ], %rd657; 2026-02-21T09:18:15.0277196Z // end inline asm 2026-02-21T09:18:15.0277252Z $L__tmp321: 2026-02-21T09:18:15.0277472Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0277535Z bar.sync 0; 2026-02-21T09:18:15.0277637Z st.shared.v4.b32 [%r26], {%r4829, %r4830, %r4831, %r4832}; 2026-02-21T09:18:15.0277747Z st.shared.v4.b32 [%r26+512], {%r4833, %r4834, %r4835, %r4836}; 2026-02-21T09:18:15.0277804Z bar.sync 0; 2026-02-21T09:18:15.0277912Z ld.shared.v4.b32 {%r4995, %r4996, %r4997, %r4998}, [%r27]; 2026-02-21T09:18:15.0277983Z mov.b32 {%rs1537, %rs1538}, %r4998; 2026-02-21T09:18:15.0278052Z mov.b32 {%rs1539, %rs1540}, %r4997; 2026-02-21T09:18:15.0278125Z mov.b32 {%rs1541, %rs1542}, %r4996; 2026-02-21T09:18:15.0278190Z mov.b32 {%rs1543, %rs1544}, %r4995; 2026-02-21T09:18:15.0278288Z ld.shared.v4.b32 {%r4999, %r5000, %r5001, %r5002}, [%r28]; 2026-02-21T09:18:15.0278359Z mov.b32 {%rs1545, %rs1546}, %r5002; 2026-02-21T09:18:15.0278422Z mov.b32 {%rs1547, %rs1548}, %r5001; 2026-02-21T09:18:15.0278485Z mov.b32 {%rs1549, %rs1550}, %r5000; 2026-02-21T09:18:15.0278547Z mov.b32 {%rs1551, %rs1552}, %r4999; 2026-02-21T09:18:15.0278610Z $L__tmp322: 2026-02-21T09:18:15.0278783Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0278852Z cvt.f32.bf16 %r4978, %rs1543; 2026-02-21T09:18:15.0278923Z cvt.f32.bf16 %r4979, %rs1544; 2026-02-21T09:18:15.0278986Z cvt.f32.bf16 %r4980, %rs1541; 2026-02-21T09:18:15.0279047Z cvt.f32.bf16 %r4981, %rs1542; 2026-02-21T09:18:15.0279110Z cvt.f32.bf16 %r4982, %rs1539; 2026-02-21T09:18:15.0279182Z cvt.f32.bf16 %r4983, %rs1540; 2026-02-21T09:18:15.0279244Z cvt.f32.bf16 %r4984, %rs1537; 2026-02-21T09:18:15.0279308Z cvt.f32.bf16 %r4985, %rs1538; 2026-02-21T09:18:15.0279378Z cvt.f32.bf16 %r4986, %rs1551; 2026-02-21T09:18:15.0279441Z cvt.f32.bf16 %r4987, %rs1552; 2026-02-21T09:18:15.0279502Z cvt.f32.bf16 %r4988, %rs1549; 2026-02-21T09:18:15.0279569Z cvt.f32.bf16 %r4989, %rs1550; 2026-02-21T09:18:15.0279630Z cvt.f32.bf16 %r4990, %rs1547; 2026-02-21T09:18:15.0279690Z cvt.f32.bf16 %r4991, %rs1548; 2026-02-21T09:18:15.0279749Z cvt.f32.bf16 %r4992, %rs1545; 2026-02-21T09:18:15.0279818Z cvt.f32.bf16 %r4993, %rs1546; 2026-02-21T09:18:15.0280012Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0280077Z cvt.s64.s32 %rd663, %r7846; 2026-02-21T09:18:15.0280148Z add.s64 %rd661, %rd56, %rd663; 2026-02-21T09:18:15.0280337Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0280420Z // begin inline asm 2026-02-21T09:18:15.0280480Z mov.u64 %rd660, 0x0; 2026-02-21T09:18:15.0280619Z createpolicy.fractional.L2::evict_first.b64 %rd660, 1.0; 2026-02-21T09:18:15.0280679Z // end inline asm 2026-02-21T09:18:15.0280740Z // begin inline asm 2026-02-21T09:18:15.0280807Z mov.u32 %r4837, 0x0; 2026-02-21T09:18:15.0280864Z mov.u32 %r4838, 0x0; 2026-02-21T09:18:15.0280922Z mov.u32 %r4839, 0x0; 2026-02-21T09:18:15.0280985Z mov.u32 %r4840, 0x0; 2026-02-21T09:18:15.0281164Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r4837, %r4838, %r4839, %r4840 }, [ %rd661 + 0 ], %rd660; 2026-02-21T09:18:15.0281223Z // end inline asm 2026-02-21T09:18:15.0281293Z prmt.b32 %r5003, %r4837, 0, 0x8880U; 2026-02-21T09:18:15.0281368Z cvt.u16.u32 %rs1553, %r5003; 2026-02-21T09:18:15.0281436Z prmt.b32 %r5004, %r4837, 0, 0x7770U; 2026-02-21T09:18:15.0281501Z cvt.u16.u32 %rs1554, %r5004; 2026-02-21T09:18:15.0281600Z prmt.b32 %r5005, %r4837, 0, 0x9991U; 2026-02-21T09:18:15.0281666Z cvt.u16.u32 %rs1555, %r5005; 2026-02-21T09:18:15.0281730Z prmt.b32 %r5006, %r4837, 0, 0x7771U; 2026-02-21T09:18:15.0281802Z cvt.u16.u32 %rs1556, %r5006; 2026-02-21T09:18:15.0281865Z prmt.b32 %r5007, %r4837, 0, 0xaaa2U; 2026-02-21T09:18:15.0281927Z cvt.u16.u32 %rs1557, %r5007; 2026-02-21T09:18:15.0281991Z prmt.b32 %r5008, %r4837, 0, 0x7772U; 2026-02-21T09:18:15.0282061Z cvt.u16.u32 %rs1558, %r5008; 2026-02-21T09:18:15.0282125Z prmt.b32 %r5009, %r4837, 0, 0xbbb3U; 2026-02-21T09:18:15.0282186Z cvt.u16.u32 %rs1559, %r5009; 2026-02-21T09:18:15.0282255Z prmt.b32 %r5010, %r4837, 0, 0x7773U; 2026-02-21T09:18:15.0282316Z cvt.u16.u32 %rs1560, %r5010; 2026-02-21T09:18:15.0282381Z prmt.b32 %r5011, %r4838, 0, 0x8880U; 2026-02-21T09:18:15.0282442Z cvt.u16.u32 %rs1561, %r5011; 2026-02-21T09:18:15.0282514Z prmt.b32 %r5012, %r4838, 0, 0x7770U; 2026-02-21T09:18:15.0282576Z cvt.u16.u32 %rs1562, %r5012; 2026-02-21T09:18:15.0282640Z prmt.b32 %r5013, %r4838, 0, 0x9991U; 2026-02-21T09:18:15.0282709Z cvt.u16.u32 %rs1563, %r5013; 2026-02-21T09:18:15.0282773Z prmt.b32 %r5014, %r4838, 0, 0x7771U; 2026-02-21T09:18:15.0282835Z cvt.u16.u32 %rs1564, %r5014; 2026-02-21T09:18:15.0282899Z prmt.b32 %r5015, %r4838, 0, 0xaaa2U; 2026-02-21T09:18:15.0282965Z cvt.u16.u32 %rs1565, %r5015; 2026-02-21T09:18:15.0283027Z prmt.b32 %r5016, %r4838, 0, 0x7772U; 2026-02-21T09:18:15.0283088Z cvt.u16.u32 %rs1566, %r5016; 2026-02-21T09:18:15.0283158Z prmt.b32 %r5017, %r4838, 0, 0xbbb3U; 2026-02-21T09:18:15.0283219Z cvt.u16.u32 %rs1567, %r5017; 2026-02-21T09:18:15.0283281Z prmt.b32 %r5018, %r4838, 0, 0x7773U; 2026-02-21T09:18:15.0283348Z cvt.u16.u32 %rs1568, %r5018; 2026-02-21T09:18:15.0283411Z prmt.b32 %r5019, %r4839, 0, 0x8880U; 2026-02-21T09:18:15.0283470Z cvt.u16.u32 %rs1569, %r5019; 2026-02-21T09:18:15.0283533Z prmt.b32 %r5020, %r4839, 0, 0x7770U; 2026-02-21T09:18:15.0283603Z cvt.u16.u32 %rs1570, %r5020; 2026-02-21T09:18:15.0283666Z prmt.b32 %r5021, %r4839, 0, 0x9991U; 2026-02-21T09:18:15.0283728Z cvt.u16.u32 %rs1571, %r5021; 2026-02-21T09:18:15.0283798Z prmt.b32 %r5022, %r4839, 0, 0x7771U; 2026-02-21T09:18:15.0283860Z cvt.u16.u32 %rs1572, %r5022; 2026-02-21T09:18:15.0283923Z prmt.b32 %r5023, %r4839, 0, 0xaaa2U; 2026-02-21T09:18:15.0283983Z cvt.u16.u32 %rs1573, %r5023; 2026-02-21T09:18:15.0284054Z prmt.b32 %r5024, %r4839, 0, 0x7772U; 2026-02-21T09:18:15.0284115Z cvt.u16.u32 %rs1574, %r5024; 2026-02-21T09:18:15.0284176Z prmt.b32 %r5025, %r4839, 0, 0xbbb3U; 2026-02-21T09:18:15.0284243Z cvt.u16.u32 %rs1575, %r5025; 2026-02-21T09:18:15.0284306Z prmt.b32 %r5026, %r4839, 0, 0x7773U; 2026-02-21T09:18:15.0284367Z cvt.u16.u32 %rs1576, %r5026; 2026-02-21T09:18:15.0284465Z prmt.b32 %r5027, %r4840, 0, 0x8880U; 2026-02-21T09:18:15.0284526Z cvt.u16.u32 %rs1577, %r5027; 2026-02-21T09:18:15.0284590Z prmt.b32 %r5028, %r4840, 0, 0x7770U; 2026-02-21T09:18:15.0284681Z cvt.u16.u32 %rs1578, %r5028; 2026-02-21T09:18:15.0284752Z prmt.b32 %r5029, %r4840, 0, 0x9991U; 2026-02-21T09:18:15.0284849Z cvt.u16.u32 %rs1579, %r5029; 2026-02-21T09:18:15.0284916Z prmt.b32 %r5030, %r4840, 0, 0x7771U; 2026-02-21T09:18:15.0285010Z cvt.u16.u32 %rs1580, %r5030; 2026-02-21T09:18:15.0285084Z prmt.b32 %r5031, %r4840, 0, 0xaaa2U; 2026-02-21T09:18:15.0285143Z cvt.u16.u32 %rs1581, %r5031; 2026-02-21T09:18:15.0285204Z prmt.b32 %r5032, %r4840, 0, 0x7772U; 2026-02-21T09:18:15.0285269Z cvt.u16.u32 %rs1582, %r5032; 2026-02-21T09:18:15.0285329Z prmt.b32 %r5033, %r4840, 0, 0xbbb3U; 2026-02-21T09:18:15.0285387Z cvt.u16.u32 %rs1583, %r5033; 2026-02-21T09:18:15.0285453Z prmt.b32 %r5034, %r4840, 0, 0x7773U; 2026-02-21T09:18:15.0285512Z cvt.u16.u32 %rs1584, %r5034; 2026-02-21T09:18:15.0285677Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0285746Z shl.b16 %rs1585, %rs1554, 12; 2026-02-21T09:18:15.0285807Z shr.s16 %rs1586, %rs1585, 12; 2026-02-21T09:18:15.0285866Z shl.b16 %rs1587, %rs1556, 12; 2026-02-21T09:18:15.0285925Z shr.s16 %rs1588, %rs1587, 12; 2026-02-21T09:18:15.0285991Z shl.b16 %rs1589, %rs1558, 12; 2026-02-21T09:18:15.0286049Z shr.s16 %rs1590, %rs1589, 12; 2026-02-21T09:18:15.0286106Z shl.b16 %rs1591, %rs1560, 12; 2026-02-21T09:18:15.0286173Z shr.s16 %rs1592, %rs1591, 12; 2026-02-21T09:18:15.0286232Z shl.b16 %rs1593, %rs1562, 12; 2026-02-21T09:18:15.0286289Z shr.s16 %rs1594, %rs1593, 12; 2026-02-21T09:18:15.0286348Z shl.b16 %rs1595, %rs1564, 12; 2026-02-21T09:18:15.0286416Z shr.s16 %rs1596, %rs1595, 12; 2026-02-21T09:18:15.0286475Z shl.b16 %rs1597, %rs1566, 12; 2026-02-21T09:18:15.0286533Z shr.s16 %rs1598, %rs1597, 12; 2026-02-21T09:18:15.0286599Z shl.b16 %rs1599, %rs1568, 12; 2026-02-21T09:18:15.0286656Z shr.s16 %rs1600, %rs1599, 12; 2026-02-21T09:18:15.0286713Z shl.b16 %rs1601, %rs1570, 12; 2026-02-21T09:18:15.0286770Z shr.s16 %rs1602, %rs1601, 12; 2026-02-21T09:18:15.0286835Z shl.b16 %rs1603, %rs1572, 12; 2026-02-21T09:18:15.0286890Z shr.s16 %rs1604, %rs1603, 12; 2026-02-21T09:18:15.0286948Z shl.b16 %rs1605, %rs1574, 12; 2026-02-21T09:18:15.0287013Z shr.s16 %rs1606, %rs1605, 12; 2026-02-21T09:18:15.0287071Z shl.b16 %rs1607, %rs1576, 12; 2026-02-21T09:18:15.0287127Z shr.s16 %rs1608, %rs1607, 12; 2026-02-21T09:18:15.0287184Z shl.b16 %rs1609, %rs1578, 12; 2026-02-21T09:18:15.0287248Z shr.s16 %rs1610, %rs1609, 12; 2026-02-21T09:18:15.0287304Z shl.b16 %rs1611, %rs1580, 12; 2026-02-21T09:18:15.0287360Z shr.s16 %rs1612, %rs1611, 12; 2026-02-21T09:18:15.0287424Z shl.b16 %rs1613, %rs1582, 12; 2026-02-21T09:18:15.0287480Z shr.s16 %rs1614, %rs1613, 12; 2026-02-21T09:18:15.0287539Z shl.b16 %rs1615, %rs1584, 12; 2026-02-21T09:18:15.0287605Z shr.s16 %rs1616, %rs1615, 12; 2026-02-21T09:18:15.0287764Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0287823Z shr.u16 %rs1617, %rs1553, 4; 2026-02-21T09:18:15.0287882Z shr.u16 %rs1618, %rs1555, 4; 2026-02-21T09:18:15.0287948Z shr.u16 %rs1619, %rs1557, 4; 2026-02-21T09:18:15.0288007Z shr.u16 %rs1620, %rs1559, 4; 2026-02-21T09:18:15.0288066Z shr.u16 %rs1621, %rs1561, 4; 2026-02-21T09:18:15.0288129Z shr.u16 %rs1622, %rs1563, 4; 2026-02-21T09:18:15.0288186Z shr.u16 %rs1623, %rs1565, 4; 2026-02-21T09:18:15.0288241Z shr.u16 %rs1624, %rs1567, 4; 2026-02-21T09:18:15.0288299Z shr.u16 %rs1625, %rs1569, 4; 2026-02-21T09:18:15.0288364Z shr.u16 %rs1626, %rs1571, 4; 2026-02-21T09:18:15.0288421Z shr.u16 %rs1627, %rs1573, 4; 2026-02-21T09:18:15.0288478Z shr.u16 %rs1628, %rs1575, 4; 2026-02-21T09:18:15.0288543Z shr.u16 %rs1629, %rs1577, 4; 2026-02-21T09:18:15.0288599Z shr.u16 %rs1630, %rs1579, 4; 2026-02-21T09:18:15.0288680Z shr.u16 %rs1631, %rs1581, 4; 2026-02-21T09:18:15.0288744Z shr.u16 %rs1632, %rs1583, 4; 2026-02-21T09:18:15.0288902Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0288975Z bar.sync 0; 2026-02-21T09:18:15.0289036Z st.shared.b8 [%r29], %rs1586; 2026-02-21T09:18:15.0289125Z st.shared.b8 [%r30], %rs1588; 2026-02-21T09:18:15.0289208Z st.shared.b8 [%r31], %rs1590; 2026-02-21T09:18:15.0289269Z st.shared.b8 [%r32], %rs1592; 2026-02-21T09:18:15.0289338Z st.shared.b8 [%r33+512], %rs1594; 2026-02-21T09:18:15.0289400Z st.shared.b8 [%r34+512], %rs1596; 2026-02-21T09:18:15.0289462Z st.shared.b8 [%r35+512], %rs1598; 2026-02-21T09:18:15.0289522Z st.shared.b8 [%r36+512], %rs1600; 2026-02-21T09:18:15.0289592Z st.shared.b8 [%r37+1024], %rs1602; 2026-02-21T09:18:15.0289655Z st.shared.b8 [%r38+1024], %rs1604; 2026-02-21T09:18:15.0289716Z st.shared.b8 [%r39+1024], %rs1606; 2026-02-21T09:18:15.0289786Z st.shared.b8 [%r40+1024], %rs1608; 2026-02-21T09:18:15.0289847Z st.shared.b8 [%r41+1536], %rs1610; 2026-02-21T09:18:15.0289905Z st.shared.b8 [%r42+1536], %rs1612; 2026-02-21T09:18:15.0289965Z st.shared.b8 [%r43+1536], %rs1614; 2026-02-21T09:18:15.0290033Z st.shared.b8 [%r44+1536], %rs1616; 2026-02-21T09:18:15.0290087Z bar.sync 0; 2026-02-21T09:18:15.0290149Z ld.shared.b32 %r5035, [%r45]; 2026-02-21T09:18:15.0290214Z ld.shared.b32 %r5036, [%r46]; 2026-02-21T09:18:15.0290274Z ld.shared.b32 %r5037, [%r47]; 2026-02-21T09:18:15.0290335Z ld.shared.b32 %r5038, [%r48]; 2026-02-21T09:18:15.0290502Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0290554Z bar.sync 0; 2026-02-21T09:18:15.0290612Z st.shared.b8 [%r29], %rs1617; 2026-02-21T09:18:15.0290671Z st.shared.b8 [%r30], %rs1618; 2026-02-21T09:18:15.0290735Z st.shared.b8 [%r31], %rs1619; 2026-02-21T09:18:15.0290793Z st.shared.b8 [%r32], %rs1620; 2026-02-21T09:18:15.0290855Z st.shared.b8 [%r33+512], %rs1621; 2026-02-21T09:18:15.0290922Z st.shared.b8 [%r34+512], %rs1622; 2026-02-21T09:18:15.0290982Z st.shared.b8 [%r35+512], %rs1623; 2026-02-21T09:18:15.0291042Z st.shared.b8 [%r36+512], %rs1624; 2026-02-21T09:18:15.0291103Z st.shared.b8 [%r37+1024], %rs1625; 2026-02-21T09:18:15.0291170Z st.shared.b8 [%r38+1024], %rs1626; 2026-02-21T09:18:15.0291231Z st.shared.b8 [%r39+1024], %rs1627; 2026-02-21T09:18:15.0291291Z st.shared.b8 [%r40+1024], %rs1628; 2026-02-21T09:18:15.0291361Z st.shared.b8 [%r41+1536], %rs1629; 2026-02-21T09:18:15.0291420Z st.shared.b8 [%r42+1536], %rs1630; 2026-02-21T09:18:15.0291479Z st.shared.b8 [%r43+1536], %rs1631; 2026-02-21T09:18:15.0291580Z st.shared.b8 [%r44+1536], %rs1632; 2026-02-21T09:18:15.0291635Z bar.sync 0; 2026-02-21T09:18:15.0291694Z ld.shared.b32 %r5039, [%r45]; 2026-02-21T09:18:15.0291754Z ld.shared.b32 %r5040, [%r46]; 2026-02-21T09:18:15.0291822Z ld.shared.b32 %r5041, [%r47]; 2026-02-21T09:18:15.0291882Z ld.shared.b32 %r5042, [%r48]; 2026-02-21T09:18:15.0292040Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0292109Z cvt.s8.s32 %rs1633, %r5036; 2026-02-21T09:18:15.0292173Z cvt.rn.f32.s16 %r5043, %rs1633; 2026-02-21T09:18:15.0292234Z cvt.s8.s32 %rs1634, %r5035; 2026-02-21T09:18:15.0292297Z cvt.rn.f32.s16 %r5044, %rs1634; 2026-02-21T09:18:15.0292363Z cvt.s8.s32 %rs1635, %r5040; 2026-02-21T09:18:15.0292425Z cvt.rn.f32.s16 %r5045, %rs1635; 2026-02-21T09:18:15.0292484Z cvt.s8.s32 %rs1636, %r5039; 2026-02-21T09:18:15.0292552Z cvt.rn.f32.s16 %r5046, %rs1636; 2026-02-21T09:18:15.0292609Z cvt.s8.s32 %rs1637, %r5038; 2026-02-21T09:18:15.0292668Z cvt.rn.f32.s16 %r5047, %rs1637; 2026-02-21T09:18:15.0292727Z cvt.s8.s32 %rs1638, %r5037; 2026-02-21T09:18:15.0292796Z cvt.rn.f32.s16 %r5048, %rs1638; 2026-02-21T09:18:15.0292854Z cvt.s8.s32 %rs1639, %r5042; 2026-02-21T09:18:15.0292916Z cvt.rn.f32.s16 %r5049, %rs1639; 2026-02-21T09:18:15.0293015Z cvt.s8.s32 %rs1640, %r5041; 2026-02-21T09:18:15.0293077Z cvt.rn.f32.s16 %r5050, %rs1640; 2026-02-21T09:18:15.0293143Z prmt.b32 %r5051, %r5036, 0, 0x9991U; 2026-02-21T09:18:15.0293234Z cvt.u16.u32 %rs1641, %r5051; 2026-02-21T09:18:15.0293295Z cvt.rn.f32.s16 %r5052, %rs1641; 2026-02-21T09:18:15.0293382Z prmt.b32 %r5053, %r5035, 0, 0x9991U; 2026-02-21T09:18:15.0293442Z cvt.u16.u32 %rs1642, %r5053; 2026-02-21T09:18:15.0293530Z cvt.rn.f32.s16 %r5054, %rs1642; 2026-02-21T09:18:15.0293594Z prmt.b32 %r5055, %r5040, 0, 0x9991U; 2026-02-21T09:18:15.0293653Z cvt.u16.u32 %rs1643, %r5055; 2026-02-21T09:18:15.0293719Z cvt.rn.f32.s16 %r5056, %rs1643; 2026-02-21T09:18:15.0293780Z prmt.b32 %r5057, %r5039, 0, 0x9991U; 2026-02-21T09:18:15.0293839Z cvt.u16.u32 %rs1644, %r5057; 2026-02-21T09:18:15.0293897Z cvt.rn.f32.s16 %r5058, %rs1644; 2026-02-21T09:18:15.0293965Z prmt.b32 %r5059, %r5038, 0, 0x9991U; 2026-02-21T09:18:15.0294023Z cvt.u16.u32 %rs1645, %r5059; 2026-02-21T09:18:15.0294085Z cvt.rn.f32.s16 %r5060, %rs1645; 2026-02-21T09:18:15.0294153Z prmt.b32 %r5061, %r5037, 0, 0x9991U; 2026-02-21T09:18:15.0294211Z cvt.u16.u32 %rs1646, %r5061; 2026-02-21T09:18:15.0294272Z cvt.rn.f32.s16 %r5062, %rs1646; 2026-02-21T09:18:15.0294340Z prmt.b32 %r5063, %r5042, 0, 0x9991U; 2026-02-21T09:18:15.0294398Z cvt.u16.u32 %rs1647, %r5063; 2026-02-21T09:18:15.0294459Z cvt.rn.f32.s16 %r5064, %rs1647; 2026-02-21T09:18:15.0294521Z prmt.b32 %r5065, %r5041, 0, 0x9991U; 2026-02-21T09:18:15.0294587Z cvt.u16.u32 %rs1648, %r5065; 2026-02-21T09:18:15.0294647Z cvt.rn.f32.s16 %r5066, %rs1648; 2026-02-21T09:18:15.0294708Z prmt.b32 %r5067, %r5036, 0, 0xaaa2U; 2026-02-21T09:18:15.0294774Z cvt.u16.u32 %rs1649, %r5067; 2026-02-21T09:18:15.0294832Z cvt.rn.f32.s16 %r5068, %rs1649; 2026-02-21T09:18:15.0294892Z prmt.b32 %r5069, %r5035, 0, 0xaaa2U; 2026-02-21T09:18:15.0294951Z cvt.u16.u32 %rs1650, %r5069; 2026-02-21T09:18:15.0295017Z cvt.rn.f32.s16 %r5070, %rs1650; 2026-02-21T09:18:15.0295079Z prmt.b32 %r5071, %r5040, 0, 0xaaa2U; 2026-02-21T09:18:15.0295137Z cvt.u16.u32 %rs1651, %r5071; 2026-02-21T09:18:15.0295204Z cvt.rn.f32.s16 %r5072, %rs1651; 2026-02-21T09:18:15.0295263Z prmt.b32 %r5073, %r5039, 0, 0xaaa2U; 2026-02-21T09:18:15.0295323Z cvt.u16.u32 %rs1652, %r5073; 2026-02-21T09:18:15.0295382Z cvt.rn.f32.s16 %r5074, %rs1652; 2026-02-21T09:18:15.0295451Z prmt.b32 %r5075, %r5038, 0, 0xaaa2U; 2026-02-21T09:18:15.0295510Z cvt.u16.u32 %rs1653, %r5075; 2026-02-21T09:18:15.0295570Z cvt.rn.f32.s16 %r5076, %rs1653; 2026-02-21T09:18:15.0295638Z prmt.b32 %r5077, %r5037, 0, 0xaaa2U; 2026-02-21T09:18:15.0295696Z cvt.u16.u32 %rs1654, %r5077; 2026-02-21T09:18:15.0295755Z cvt.rn.f32.s16 %r5078, %rs1654; 2026-02-21T09:18:15.0295822Z prmt.b32 %r5079, %r5042, 0, 0xaaa2U; 2026-02-21T09:18:15.0295880Z cvt.u16.u32 %rs1655, %r5079; 2026-02-21T09:18:15.0295939Z cvt.rn.f32.s16 %r5080, %rs1655; 2026-02-21T09:18:15.0296000Z prmt.b32 %r5081, %r5041, 0, 0xaaa2U; 2026-02-21T09:18:15.0296065Z cvt.u16.u32 %rs1656, %r5081; 2026-02-21T09:18:15.0296124Z cvt.rn.f32.s16 %r5082, %rs1656; 2026-02-21T09:18:15.0296196Z prmt.b32 %r5083, %r5036, 0, 0xbbb3U; 2026-02-21T09:18:15.0296260Z cvt.u16.u32 %rs1657, %r5083; 2026-02-21T09:18:15.0296322Z cvt.rn.f32.s16 %r5084, %rs1657; 2026-02-21T09:18:15.0296381Z prmt.b32 %r5085, %r5035, 0, 0xbbb3U; 2026-02-21T09:18:15.0296440Z cvt.u16.u32 %rs1658, %r5085; 2026-02-21T09:18:15.0296506Z cvt.rn.f32.s16 %r5086, %rs1658; 2026-02-21T09:18:15.0296567Z prmt.b32 %r5087, %r5040, 0, 0xbbb3U; 2026-02-21T09:18:15.0296625Z cvt.u16.u32 %rs1659, %r5087; 2026-02-21T09:18:15.0296690Z cvt.rn.f32.s16 %r5088, %rs1659; 2026-02-21T09:18:15.0296749Z prmt.b32 %r5089, %r5039, 0, 0xbbb3U; 2026-02-21T09:18:15.0296805Z cvt.u16.u32 %rs1660, %r5089; 2026-02-21T09:18:15.0296871Z cvt.rn.f32.s16 %r5090, %rs1660; 2026-02-21T09:18:15.0296931Z prmt.b32 %r5091, %r5038, 0, 0xbbb3U; 2026-02-21T09:18:15.0296988Z cvt.u16.u32 %rs1661, %r5091; 2026-02-21T09:18:15.0297068Z cvt.rn.f32.s16 %r5092, %rs1661; 2026-02-21T09:18:15.0297135Z prmt.b32 %r5093, %r5037, 0, 0xbbb3U; 2026-02-21T09:18:15.0297193Z cvt.u16.u32 %rs1662, %r5093; 2026-02-21T09:18:15.0297253Z cvt.rn.f32.s16 %r5094, %rs1662; 2026-02-21T09:18:15.0297340Z prmt.b32 %r5095, %r5042, 0, 0xbbb3U; 2026-02-21T09:18:15.0297398Z cvt.u16.u32 %rs1663, %r5095; 2026-02-21T09:18:15.0297478Z cvt.rn.f32.s16 %r5096, %rs1663; 2026-02-21T09:18:15.0297558Z prmt.b32 %r5097, %r5041, 0, 0xbbb3U; 2026-02-21T09:18:15.0297625Z cvt.u16.u32 %rs1664, %r5097; 2026-02-21T09:18:15.0297683Z cvt.rn.f32.s16 %r5098, %rs1664; 2026-02-21T09:18:15.0297737Z bar.sync 0; 2026-02-21T09:18:15.0297842Z st.shared.v4.b32 [%r49], {%r5044, %r5046, %r5043, %r5045}; 2026-02-21T09:18:15.0297938Z st.shared.v4.b32 [%r50], {%r5048, %r5050, %r5047, %r5049}; 2026-02-21T09:18:15.0298030Z st.shared.v4.b32 [%r51], {%r5054, %r5058, %r5052, %r5056}; 2026-02-21T09:18:15.0298126Z st.shared.v4.b32 [%r52], {%r5062, %r5066, %r5060, %r5064}; 2026-02-21T09:18:15.0298216Z st.shared.v4.b32 [%r53], {%r5070, %r5074, %r5068, %r5072}; 2026-02-21T09:18:15.0298304Z st.shared.v4.b32 [%r54], {%r5078, %r5082, %r5076, %r5080}; 2026-02-21T09:18:15.0298390Z st.shared.v4.b32 [%r55], {%r5086, %r5090, %r5084, %r5088}; 2026-02-21T09:18:15.0298486Z st.shared.v4.b32 [%r56], {%r5094, %r5098, %r5092, %r5096}; 2026-02-21T09:18:15.0298541Z $L__tmp323: 2026-02-21T09:18:15.0298760Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0298828Z // begin inline asm 2026-02-21T09:18:15.0299137Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4841, %r4842, %r4843, %r4844, %r4845, %r4846, %r4847, %r4848, %r4849, %r4850, %r4851, %r4852, %r4853, %r4854, %r4855, %r4856}, [%r4673 + 0], 64; 2026-02-21T09:18:15.0299195Z // end inline asm 2026-02-21T09:18:15.0299260Z // begin inline asm 2026-02-21T09:18:15.0299561Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4858, %r4859, %r4860, %r4861, %r4862, %r4863, %r4864, %r4865, %r4866, %r4867, %r4868, %r4869, %r4870, %r4871, %r4872, %r4873}, [%r4673 + 16], 64; 2026-02-21T09:18:15.0299619Z // end inline asm 2026-02-21T09:18:15.0299684Z // begin inline asm 2026-02-21T09:18:15.0299982Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4875, %r4876, %r4877, %r4878, %r4879, %r4880, %r4881, %r4882, %r4883, %r4884, %r4885, %r4886, %r4887, %r4888, %r4889, %r4890}, [%r4673 + 32], 64; 2026-02-21T09:18:15.0300037Z // end inline asm 2026-02-21T09:18:15.0300094Z // begin inline asm 2026-02-21T09:18:15.0300401Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r4892, %r4893, %r4894, %r4895, %r4896, %r4897, %r4898, %r4899, %r4900, %r4901, %r4902, %r4903, %r4904, %r4905, %r4906, %r4907}, [%r4673 + 48], 64; 2026-02-21T09:18:15.0300458Z // end inline asm 2026-02-21T09:18:15.0300516Z // begin inline asm 2026-02-21T09:18:15.0300597Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0300662Z // end inline asm 2026-02-21T09:18:15.0300718Z // begin inline asm 2026-02-21T09:18:15.0301032Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 0], 64, {%r4841, %r4842, %r4843, %r4844, %r4845, %r4846, %r4847, %r4848, %r4849, %r4850, %r4851, %r4852, %r4853, %r4854, %r4855, %r4856}; 2026-02-21T09:18:15.0301089Z // end inline asm 2026-02-21T09:18:15.0301144Z // begin inline asm 2026-02-21T09:18:15.0301456Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 16], 64, {%r4858, %r4859, %r4860, %r4861, %r4862, %r4863, %r4864, %r4865, %r4866, %r4867, %r4868, %r4869, %r4870, %r4871, %r4872, %r4873}; 2026-02-21T09:18:15.0301511Z // end inline asm 2026-02-21T09:18:15.0301588Z // begin inline asm 2026-02-21T09:18:15.0301892Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 32], 64, {%r4875, %r4876, %r4877, %r4878, %r4879, %r4880, %r4881, %r4882, %r4883, %r4884, %r4885, %r4886, %r4887, %r4888, %r4889, %r4890}; 2026-02-21T09:18:15.0301955Z // end inline asm 2026-02-21T09:18:15.0302011Z // begin inline asm 2026-02-21T09:18:15.0302310Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r625 + 48], 64, {%r4892, %r4893, %r4894, %r4895, %r4896, %r4897, %r4898, %r4899, %r4900, %r4901, %r4902, %r4903, %r4904, %r4905, %r4906, %r4907}; 2026-02-21T09:18:15.0302398Z // end inline asm 2026-02-21T09:18:15.0302491Z // begin inline asm 2026-02-21T09:18:15.0302559Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0302622Z // end inline asm 2026-02-21T09:18:15.0302700Z bar.sync 0; 2026-02-21T09:18:15.0302759Z // begin inline asm 2026-02-21T09:18:15.0303097Z @%p234 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r4978, %r4979, %r4980, %r4981, %r4982, %r4983, %r4984, %r4985, %r4986, %r4987, %r4988, %r4989, %r4990, %r4991, %r4992, %r4993}; 2026-02-21T09:18:15.0303153Z // end inline asm 2026-02-21T09:18:15.0303209Z // begin inline asm 2026-02-21T09:18:15.0303275Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0303338Z // end inline asm 2026-02-21T09:18:15.0303392Z bar.sync 0; 2026-02-21T09:18:15.0303448Z // begin inline asm 2026-02-21T09:18:15.0303527Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0303583Z // end inline asm 2026-02-21T09:18:15.0303640Z // begin inline asm 2026-02-21T09:18:15.0303729Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0303793Z // end inline asm 2026-02-21T09:18:15.0303848Z bar.sync 0; 2026-02-21T09:18:15.0303908Z @%p20 bra $L__BB0_35; 2026-02-21T09:18:15.0304017Z // %bb.34: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0304084Z elect.sync %r5112|%p247, -1; 2026-02-21T09:18:15.0304144Z mov.b32 %r5102, 69208336; 2026-02-21T09:18:15.0304207Z // begin inline asm 2026-02-21T09:18:15.0304367Z @%p247 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 0 ], %rd1, %r5102, %p392; 2026-02-21T09:18:15.0304424Z // end inline asm 2026-02-21T09:18:15.0304487Z mov.pred %p248, -1; 2026-02-21T09:18:15.0304551Z // begin inline asm 2026-02-21T09:18:15.0304702Z @%p247 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 8 ], %rd2, %r5102, %p248; 2026-02-21T09:18:15.0304758Z // end inline asm 2026-02-21T09:18:15.0304823Z // begin inline asm 2026-02-21T09:18:15.0304974Z @%p247 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 16 ], %rd3, %r5102, %p248; 2026-02-21T09:18:15.0305031Z // end inline asm 2026-02-21T09:18:15.0305094Z // begin inline asm 2026-02-21T09:18:15.0305243Z @%p247 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 24 ], %rd4, %r5102, %p248; 2026-02-21T09:18:15.0305300Z // end inline asm 2026-02-21T09:18:15.0305362Z cvt.u64.u32 %rd668, %r5985; 2026-02-21T09:18:15.0305424Z // begin inline asm 2026-02-21T09:18:15.0305551Z @%p247 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd668]; 2026-02-21T09:18:15.0305605Z // end inline asm 2026-02-21T09:18:15.0305712Z $L__BB0_35: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0305801Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0305855Z mov.b32 %r5116, 0; 2026-02-21T09:18:15.0306078Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0306134Z // begin inline asm 2026-02-21T09:18:15.0306184Z 2026-02-21T09:18:15.0306235Z { 2026-02-21T09:18:15.0306303Z .reg .pred complete; 2026-02-21T09:18:15.0306360Z waitLoop: 2026-02-21T09:18:15.0306483Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r5116; 2026-02-21T09:18:15.0306558Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0306608Z } 2026-02-21T09:18:15.0306612Z 2026-02-21T09:18:15.0306666Z // end inline asm 2026-02-21T09:18:15.0306728Z bar.sync 0; 2026-02-21T09:18:15.0306783Z // begin inline asm 2026-02-21T09:18:15.0306869Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0306924Z // end inline asm 2026-02-21T09:18:15.0306985Z $L__tmp324: 2026-02-21T09:18:15.0307153Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0307215Z add.s64 %rd670, %rd655, 64; 2026-02-21T09:18:15.0307412Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0307473Z add.s64 %rd673, %rd658, 64; 2026-02-21T09:18:15.0307555Z // begin inline asm 2026-02-21T09:18:15.0307614Z mov.u64 %rd669, 0x0; 2026-02-21T09:18:15.0307751Z createpolicy.fractional.L2::evict_first.b64 %rd669, 1.0; 2026-02-21T09:18:15.0307807Z // end inline asm 2026-02-21T09:18:15.0307885Z // begin inline asm 2026-02-21T09:18:15.0307950Z mov.u32 %r5118, 0x0; 2026-02-21T09:18:15.0308007Z mov.u32 %r5119, 0x0; 2026-02-21T09:18:15.0308062Z mov.u32 %r5120, 0x0; 2026-02-21T09:18:15.0308125Z mov.u32 %r5121, 0x0; 2026-02-21T09:18:15.0308299Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5118, %r5119, %r5120, %r5121 }, [ %rd670 + 0 ], %rd669; 2026-02-21T09:18:15.0308355Z // end inline asm 2026-02-21T09:18:15.0308411Z // begin inline asm 2026-02-21T09:18:15.0308475Z mov.u64 %rd672, 0x0; 2026-02-21T09:18:15.0308580Z createpolicy.fractional.L2::evict_first.b64 %rd672, 1.0; 2026-02-21T09:18:15.0308637Z // end inline asm 2026-02-21T09:18:15.0308701Z // begin inline asm 2026-02-21T09:18:15.0308756Z mov.u32 %r5122, 0x0; 2026-02-21T09:18:15.0308813Z mov.u32 %r5123, 0x0; 2026-02-21T09:18:15.0308867Z mov.u32 %r5124, 0x0; 2026-02-21T09:18:15.0308933Z mov.u32 %r5125, 0x0; 2026-02-21T09:18:15.0309108Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5122, %r5123, %r5124, %r5125 }, [ %rd673 + 0 ], %rd672; 2026-02-21T09:18:15.0309166Z // end inline asm 2026-02-21T09:18:15.0309233Z $L__tmp325: 2026-02-21T09:18:15.0309441Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0309543Z st.shared.v4.b32 [%r26], {%r5118, %r5119, %r5120, %r5121}; 2026-02-21T09:18:15.0309653Z st.shared.v4.b32 [%r26+512], {%r5122, %r5123, %r5124, %r5125}; 2026-02-21T09:18:15.0309708Z bar.sync 0; 2026-02-21T09:18:15.0309803Z ld.shared.v4.b32 {%r5285, %r5286, %r5287, %r5288}, [%r27]; 2026-02-21T09:18:15.0309876Z mov.b32 {%rs1665, %rs1666}, %r5288; 2026-02-21T09:18:15.0309939Z mov.b32 {%rs1667, %rs1668}, %r5287; 2026-02-21T09:18:15.0310000Z mov.b32 {%rs1669, %rs1670}, %r5286; 2026-02-21T09:18:15.0310063Z mov.b32 {%rs1671, %rs1672}, %r5285; 2026-02-21T09:18:15.0310161Z ld.shared.v4.b32 {%r5289, %r5290, %r5291, %r5292}, [%r28]; 2026-02-21T09:18:15.0310223Z mov.b32 {%rs1673, %rs1674}, %r5292; 2026-02-21T09:18:15.0310283Z mov.b32 {%rs1675, %rs1676}, %r5291; 2026-02-21T09:18:15.0310350Z mov.b32 {%rs1677, %rs1678}, %r5290; 2026-02-21T09:18:15.0310409Z mov.b32 {%rs1679, %rs1680}, %r5289; 2026-02-21T09:18:15.0310461Z $L__tmp326: 2026-02-21T09:18:15.0310623Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0310695Z cvt.f32.bf16 %r5267, %rs1671; 2026-02-21T09:18:15.0310756Z cvt.f32.bf16 %r5268, %rs1672; 2026-02-21T09:18:15.0310815Z cvt.f32.bf16 %r5269, %rs1669; 2026-02-21T09:18:15.0310884Z cvt.f32.bf16 %r5270, %rs1670; 2026-02-21T09:18:15.0310942Z cvt.f32.bf16 %r5271, %rs1667; 2026-02-21T09:18:15.0311000Z cvt.f32.bf16 %r5272, %rs1668; 2026-02-21T09:18:15.0311064Z cvt.f32.bf16 %r5273, %rs1665; 2026-02-21T09:18:15.0311123Z cvt.f32.bf16 %r5274, %rs1666; 2026-02-21T09:18:15.0311180Z cvt.f32.bf16 %r5275, %rs1679; 2026-02-21T09:18:15.0311238Z cvt.f32.bf16 %r5276, %rs1680; 2026-02-21T09:18:15.0311304Z cvt.f32.bf16 %r5277, %rs1677; 2026-02-21T09:18:15.0311362Z cvt.f32.bf16 %r5278, %rs1678; 2026-02-21T09:18:15.0311421Z cvt.f32.bf16 %r5279, %rs1675; 2026-02-21T09:18:15.0311486Z cvt.f32.bf16 %r5280, %rs1676; 2026-02-21T09:18:15.0311565Z cvt.f32.bf16 %r5281, %rs1673; 2026-02-21T09:18:15.0311623Z cvt.f32.bf16 %r5282, %rs1674; 2026-02-21T09:18:15.0311782Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0311850Z add.s32 %r5293, %r7846, 131072; 2026-02-21T09:18:15.0311911Z cvt.s64.s32 %rd678, %r5293; 2026-02-21T09:18:15.0312002Z add.s64 %rd676, %rd56, %rd678; 2026-02-21T09:18:15.0312168Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0312252Z // begin inline asm 2026-02-21T09:18:15.0312309Z mov.u64 %rd675, 0x0; 2026-02-21T09:18:15.0312421Z createpolicy.fractional.L2::evict_first.b64 %rd675, 1.0; 2026-02-21T09:18:15.0312501Z // end inline asm 2026-02-21T09:18:15.0312560Z // begin inline asm 2026-02-21T09:18:15.0312638Z mov.u32 %r5126, 0x0; 2026-02-21T09:18:15.0312703Z mov.u32 %r5127, 0x0; 2026-02-21T09:18:15.0312758Z mov.u32 %r5128, 0x0; 2026-02-21T09:18:15.0312813Z mov.u32 %r5129, 0x0; 2026-02-21T09:18:15.0312990Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5126, %r5127, %r5128, %r5129 }, [ %rd676 + 0 ], %rd675; 2026-02-21T09:18:15.0313045Z // end inline asm 2026-02-21T09:18:15.0313111Z prmt.b32 %r5294, %r5126, 0, 0x8880U; 2026-02-21T09:18:15.0313179Z cvt.u16.u32 %rs1681, %r5294; 2026-02-21T09:18:15.0313243Z prmt.b32 %r5295, %r5126, 0, 0x7770U; 2026-02-21T09:18:15.0313305Z cvt.u16.u32 %rs1682, %r5295; 2026-02-21T09:18:15.0313365Z prmt.b32 %r5296, %r5126, 0, 0x9991U; 2026-02-21T09:18:15.0313432Z cvt.u16.u32 %rs1683, %r5296; 2026-02-21T09:18:15.0313494Z prmt.b32 %r5297, %r5126, 0, 0x7771U; 2026-02-21T09:18:15.0313552Z cvt.u16.u32 %rs1684, %r5297; 2026-02-21T09:18:15.0313622Z prmt.b32 %r5298, %r5126, 0, 0xaaa2U; 2026-02-21T09:18:15.0313682Z cvt.u16.u32 %rs1685, %r5298; 2026-02-21T09:18:15.0313744Z prmt.b32 %r5299, %r5126, 0, 0x7772U; 2026-02-21T09:18:15.0313804Z cvt.u16.u32 %rs1686, %r5299; 2026-02-21T09:18:15.0313872Z prmt.b32 %r5300, %r5126, 0, 0xbbb3U; 2026-02-21T09:18:15.0313931Z cvt.u16.u32 %rs1687, %r5300; 2026-02-21T09:18:15.0313991Z prmt.b32 %r5301, %r5126, 0, 0x7773U; 2026-02-21T09:18:15.0314055Z cvt.u16.u32 %rs1688, %r5301; 2026-02-21T09:18:15.0314116Z prmt.b32 %r5302, %r5127, 0, 0x8880U; 2026-02-21T09:18:15.0314173Z cvt.u16.u32 %rs1689, %r5302; 2026-02-21T09:18:15.0314234Z prmt.b32 %r5303, %r5127, 0, 0x7770U; 2026-02-21T09:18:15.0314299Z cvt.u16.u32 %rs1690, %r5303; 2026-02-21T09:18:15.0314360Z prmt.b32 %r5304, %r5127, 0, 0x9991U; 2026-02-21T09:18:15.0314418Z cvt.u16.u32 %rs1691, %r5304; 2026-02-21T09:18:15.0314489Z prmt.b32 %r5305, %r5127, 0, 0x7771U; 2026-02-21T09:18:15.0314547Z cvt.u16.u32 %rs1692, %r5305; 2026-02-21T09:18:15.0314610Z prmt.b32 %r5306, %r5127, 0, 0xaaa2U; 2026-02-21T09:18:15.0314676Z cvt.u16.u32 %rs1693, %r5306; 2026-02-21T09:18:15.0314736Z prmt.b32 %r5307, %r5127, 0, 0x7772U; 2026-02-21T09:18:15.0314795Z cvt.u16.u32 %rs1694, %r5307; 2026-02-21T09:18:15.0314854Z prmt.b32 %r5308, %r5127, 0, 0xbbb3U; 2026-02-21T09:18:15.0314919Z cvt.u16.u32 %rs1695, %r5308; 2026-02-21T09:18:15.0314979Z prmt.b32 %r5309, %r5127, 0, 0x7773U; 2026-02-21T09:18:15.0315037Z cvt.u16.u32 %rs1696, %r5309; 2026-02-21T09:18:15.0315104Z prmt.b32 %r5310, %r5128, 0, 0x8880U; 2026-02-21T09:18:15.0315163Z cvt.u16.u32 %rs1697, %r5310; 2026-02-21T09:18:15.0315223Z prmt.b32 %r5311, %r5128, 0, 0x7770U; 2026-02-21T09:18:15.0315282Z cvt.u16.u32 %rs1698, %r5311; 2026-02-21T09:18:15.0315350Z prmt.b32 %r5312, %r5128, 0, 0x9991U; 2026-02-21T09:18:15.0315409Z cvt.u16.u32 %rs1699, %r5312; 2026-02-21T09:18:15.0315471Z prmt.b32 %r5313, %r5128, 0, 0x7771U; 2026-02-21T09:18:15.0315537Z cvt.u16.u32 %rs1700, %r5313; 2026-02-21T09:18:15.0315599Z prmt.b32 %r5314, %r5128, 0, 0xaaa2U; 2026-02-21T09:18:15.0315658Z cvt.u16.u32 %rs1701, %r5314; 2026-02-21T09:18:15.0315725Z prmt.b32 %r5315, %r5128, 0, 0x7772U; 2026-02-21T09:18:15.0315783Z cvt.u16.u32 %rs1702, %r5315; 2026-02-21T09:18:15.0315844Z prmt.b32 %r5316, %r5128, 0, 0xbbb3U; 2026-02-21T09:18:15.0315903Z cvt.u16.u32 %rs1703, %r5316; 2026-02-21T09:18:15.0315972Z prmt.b32 %r5317, %r5128, 0, 0x7773U; 2026-02-21T09:18:15.0316032Z cvt.u16.u32 %rs1704, %r5317; 2026-02-21T09:18:15.0316093Z prmt.b32 %r5318, %r5129, 0, 0x8880U; 2026-02-21T09:18:15.0316160Z cvt.u16.u32 %rs1705, %r5318; 2026-02-21T09:18:15.0316220Z prmt.b32 %r5319, %r5129, 0, 0x7770U; 2026-02-21T09:18:15.0316303Z cvt.u16.u32 %rs1706, %r5319; 2026-02-21T09:18:15.0316366Z prmt.b32 %r5320, %r5129, 0, 0x9991U; 2026-02-21T09:18:15.0316432Z cvt.u16.u32 %rs1707, %r5320; 2026-02-21T09:18:15.0316513Z prmt.b32 %r5321, %r5129, 0, 0x7771U; 2026-02-21T09:18:15.0316571Z cvt.u16.u32 %rs1708, %r5321; 2026-02-21T09:18:15.0316658Z prmt.b32 %r5322, %r5129, 0, 0xaaa2U; 2026-02-21T09:18:15.0316739Z cvt.u16.u32 %rs1709, %r5322; 2026-02-21T09:18:15.0316800Z prmt.b32 %r5323, %r5129, 0, 0x7772U; 2026-02-21T09:18:15.0316859Z cvt.u16.u32 %rs1710, %r5323; 2026-02-21T09:18:15.0316925Z prmt.b32 %r5324, %r5129, 0, 0xbbb3U; 2026-02-21T09:18:15.0316983Z cvt.u16.u32 %rs1711, %r5324; 2026-02-21T09:18:15.0317043Z prmt.b32 %r5325, %r5129, 0, 0x7773U; 2026-02-21T09:18:15.0317107Z cvt.u16.u32 %rs1712, %r5325; 2026-02-21T09:18:15.0317272Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0317334Z shl.b16 %rs1713, %rs1682, 12; 2026-02-21T09:18:15.0317402Z shr.s16 %rs1714, %rs1713, 12; 2026-02-21T09:18:15.0317460Z shl.b16 %rs1715, %rs1684, 12; 2026-02-21T09:18:15.0317518Z shr.s16 %rs1716, %rs1715, 12; 2026-02-21T09:18:15.0317577Z shl.b16 %rs1717, %rs1686, 12; 2026-02-21T09:18:15.0317644Z shr.s16 %rs1718, %rs1717, 12; 2026-02-21T09:18:15.0317702Z shl.b16 %rs1719, %rs1688, 12; 2026-02-21T09:18:15.0317759Z shr.s16 %rs1720, %rs1719, 12; 2026-02-21T09:18:15.0317824Z shl.b16 %rs1721, %rs1690, 12; 2026-02-21T09:18:15.0317881Z shr.s16 %rs1722, %rs1721, 12; 2026-02-21T09:18:15.0317939Z shl.b16 %rs1723, %rs1692, 12; 2026-02-21T09:18:15.0317998Z shr.s16 %rs1724, %rs1723, 12; 2026-02-21T09:18:15.0318062Z shl.b16 %rs1725, %rs1694, 12; 2026-02-21T09:18:15.0318120Z shr.s16 %rs1726, %rs1725, 12; 2026-02-21T09:18:15.0318176Z shl.b16 %rs1727, %rs1696, 12; 2026-02-21T09:18:15.0318242Z shr.s16 %rs1728, %rs1727, 12; 2026-02-21T09:18:15.0318299Z shl.b16 %rs1729, %rs1698, 12; 2026-02-21T09:18:15.0318357Z shr.s16 %rs1730, %rs1729, 12; 2026-02-21T09:18:15.0318422Z shl.b16 %rs1731, %rs1700, 12; 2026-02-21T09:18:15.0318479Z shr.s16 %rs1732, %rs1731, 12; 2026-02-21T09:18:15.0318538Z shl.b16 %rs1733, %rs1702, 12; 2026-02-21T09:18:15.0318597Z shr.s16 %rs1734, %rs1733, 12; 2026-02-21T09:18:15.0318661Z shl.b16 %rs1735, %rs1704, 12; 2026-02-21T09:18:15.0318721Z shr.s16 %rs1736, %rs1735, 12; 2026-02-21T09:18:15.0318778Z shl.b16 %rs1737, %rs1706, 12; 2026-02-21T09:18:15.0318843Z shr.s16 %rs1738, %rs1737, 12; 2026-02-21T09:18:15.0318901Z shl.b16 %rs1739, %rs1708, 12; 2026-02-21T09:18:15.0318957Z shr.s16 %rs1740, %rs1739, 12; 2026-02-21T09:18:15.0319013Z shl.b16 %rs1741, %rs1710, 12; 2026-02-21T09:18:15.0319078Z shr.s16 %rs1742, %rs1741, 12; 2026-02-21T09:18:15.0319135Z shl.b16 %rs1743, %rs1712, 12; 2026-02-21T09:18:15.0319192Z shr.s16 %rs1744, %rs1743, 12; 2026-02-21T09:18:15.0319357Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0319418Z shr.u16 %rs1745, %rs1681, 4; 2026-02-21T09:18:15.0319476Z shr.u16 %rs1746, %rs1683, 4; 2026-02-21T09:18:15.0319534Z shr.u16 %rs1747, %rs1685, 4; 2026-02-21T09:18:15.0319600Z shr.u16 %rs1748, %rs1687, 4; 2026-02-21T09:18:15.0319656Z shr.u16 %rs1749, %rs1689, 4; 2026-02-21T09:18:15.0319713Z shr.u16 %rs1750, %rs1691, 4; 2026-02-21T09:18:15.0319777Z shr.u16 %rs1751, %rs1693, 4; 2026-02-21T09:18:15.0319835Z shr.u16 %rs1752, %rs1695, 4; 2026-02-21T09:18:15.0319891Z shr.u16 %rs1753, %rs1697, 4; 2026-02-21T09:18:15.0319955Z shr.u16 %rs1754, %rs1699, 4; 2026-02-21T09:18:15.0320012Z shr.u16 %rs1755, %rs1701, 4; 2026-02-21T09:18:15.0320068Z shr.u16 %rs1756, %rs1703, 4; 2026-02-21T09:18:15.0320123Z shr.u16 %rs1757, %rs1705, 4; 2026-02-21T09:18:15.0320188Z shr.u16 %rs1758, %rs1707, 4; 2026-02-21T09:18:15.0320244Z shr.u16 %rs1759, %rs1709, 4; 2026-02-21T09:18:15.0320300Z shr.u16 %rs1760, %rs1711, 4; 2026-02-21T09:18:15.0320466Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0320558Z bar.sync 0; 2026-02-21T09:18:15.0320620Z st.shared.b8 [%r29], %rs1714; 2026-02-21T09:18:15.0320703Z st.shared.b8 [%r30], %rs1716; 2026-02-21T09:18:15.0320770Z st.shared.b8 [%r31], %rs1718; 2026-02-21T09:18:15.0320828Z st.shared.b8 [%r32], %rs1720; 2026-02-21T09:18:15.0320911Z st.shared.b8 [%r33+512], %rs1722; 2026-02-21T09:18:15.0321003Z st.shared.b8 [%r34+512], %rs1724; 2026-02-21T09:18:15.0321066Z st.shared.b8 [%r35+512], %rs1726; 2026-02-21T09:18:15.0321135Z st.shared.b8 [%r36+512], %rs1728; 2026-02-21T09:18:15.0321208Z st.shared.b8 [%r37+1024], %rs1730; 2026-02-21T09:18:15.0321272Z st.shared.b8 [%r38+1024], %rs1732; 2026-02-21T09:18:15.0321335Z st.shared.b8 [%r39+1024], %rs1734; 2026-02-21T09:18:15.0321400Z st.shared.b8 [%r40+1024], %rs1736; 2026-02-21T09:18:15.0321470Z st.shared.b8 [%r41+1536], %rs1738; 2026-02-21T09:18:15.0321559Z st.shared.b8 [%r42+1536], %rs1740; 2026-02-21T09:18:15.0321626Z st.shared.b8 [%r43+1536], %rs1742; 2026-02-21T09:18:15.0321695Z st.shared.b8 [%r44+1536], %rs1744; 2026-02-21T09:18:15.0321752Z bar.sync 0; 2026-02-21T09:18:15.0321815Z ld.shared.b32 %r5326, [%r45]; 2026-02-21T09:18:15.0321880Z ld.shared.b32 %r5327, [%r46]; 2026-02-21T09:18:15.0321952Z ld.shared.b32 %r5328, [%r47]; 2026-02-21T09:18:15.0322016Z ld.shared.b32 %r5329, [%r48]; 2026-02-21T09:18:15.0322187Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0322253Z bar.sync 0; 2026-02-21T09:18:15.0322315Z st.shared.b8 [%r29], %rs1745; 2026-02-21T09:18:15.0322376Z st.shared.b8 [%r30], %rs1746; 2026-02-21T09:18:15.0322439Z st.shared.b8 [%r31], %rs1747; 2026-02-21T09:18:15.0322508Z st.shared.b8 [%r32], %rs1748; 2026-02-21T09:18:15.0322572Z st.shared.b8 [%r33+512], %rs1749; 2026-02-21T09:18:15.0322635Z st.shared.b8 [%r34+512], %rs1750; 2026-02-21T09:18:15.0322707Z st.shared.b8 [%r35+512], %rs1751; 2026-02-21T09:18:15.0322773Z st.shared.b8 [%r36+512], %rs1752; 2026-02-21T09:18:15.0322837Z st.shared.b8 [%r37+1024], %rs1753; 2026-02-21T09:18:15.0322912Z st.shared.b8 [%r38+1024], %rs1754; 2026-02-21T09:18:15.0322977Z st.shared.b8 [%r39+1024], %rs1755; 2026-02-21T09:18:15.0323041Z st.shared.b8 [%r40+1024], %rs1756; 2026-02-21T09:18:15.0323107Z st.shared.b8 [%r41+1536], %rs1757; 2026-02-21T09:18:15.0323180Z st.shared.b8 [%r42+1536], %rs1758; 2026-02-21T09:18:15.0323244Z st.shared.b8 [%r43+1536], %rs1759; 2026-02-21T09:18:15.0323307Z st.shared.b8 [%r44+1536], %rs1760; 2026-02-21T09:18:15.0323371Z bar.sync 0; 2026-02-21T09:18:15.0323435Z ld.shared.b32 %r5330, [%r45]; 2026-02-21T09:18:15.0323498Z ld.shared.b32 %r5331, [%r46]; 2026-02-21T09:18:15.0323561Z ld.shared.b32 %r5332, [%r47]; 2026-02-21T09:18:15.0323632Z ld.shared.b32 %r5333, [%r48]; 2026-02-21T09:18:15.0323801Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0323868Z cvt.s8.s32 %rs1761, %r5327; 2026-02-21T09:18:15.0323943Z cvt.rn.f32.s16 %r5334, %rs1761; 2026-02-21T09:18:15.0324006Z cvt.s8.s32 %rs1762, %r5326; 2026-02-21T09:18:15.0324072Z cvt.rn.f32.s16 %r5335, %rs1762; 2026-02-21T09:18:15.0324144Z cvt.s8.s32 %rs1763, %r5331; 2026-02-21T09:18:15.0324207Z cvt.rn.f32.s16 %r5336, %rs1763; 2026-02-21T09:18:15.0324270Z cvt.s8.s32 %rs1764, %r5330; 2026-02-21T09:18:15.0324334Z cvt.rn.f32.s16 %r5337, %rs1764; 2026-02-21T09:18:15.0324406Z cvt.s8.s32 %rs1765, %r5329; 2026-02-21T09:18:15.0324470Z cvt.rn.f32.s16 %r5338, %rs1765; 2026-02-21T09:18:15.0324531Z cvt.s8.s32 %rs1766, %r5328; 2026-02-21T09:18:15.0324604Z cvt.rn.f32.s16 %r5339, %rs1766; 2026-02-21T09:18:15.0324665Z cvt.s8.s32 %rs1767, %r5333; 2026-02-21T09:18:15.0324728Z cvt.rn.f32.s16 %r5340, %rs1767; 2026-02-21T09:18:15.0324789Z cvt.s8.s32 %rs1768, %r5332; 2026-02-21T09:18:15.0324860Z cvt.rn.f32.s16 %r5341, %rs1768; 2026-02-21T09:18:15.0324927Z prmt.b32 %r5342, %r5327, 0, 0x9991U; 2026-02-21T09:18:15.0325021Z cvt.u16.u32 %rs1769, %r5342; 2026-02-21T09:18:15.0325091Z cvt.rn.f32.s16 %r5343, %rs1769; 2026-02-21T09:18:15.0325157Z prmt.b32 %r5344, %r5326, 0, 0x9991U; 2026-02-21T09:18:15.0325244Z cvt.u16.u32 %rs1770, %r5344; 2026-02-21T09:18:15.0325305Z cvt.rn.f32.s16 %r5345, %rs1770; 2026-02-21T09:18:15.0325404Z prmt.b32 %r5346, %r5331, 0, 0x9991U; 2026-02-21T09:18:15.0325467Z cvt.u16.u32 %rs1771, %r5346; 2026-02-21T09:18:15.0325554Z cvt.rn.f32.s16 %r5347, %rs1771; 2026-02-21T09:18:15.0325626Z prmt.b32 %r5348, %r5330, 0, 0x9991U; 2026-02-21T09:18:15.0325687Z cvt.u16.u32 %rs1772, %r5348; 2026-02-21T09:18:15.0325750Z cvt.rn.f32.s16 %r5349, %rs1772; 2026-02-21T09:18:15.0325819Z prmt.b32 %r5350, %r5329, 0, 0x9991U; 2026-02-21T09:18:15.0325880Z cvt.u16.u32 %rs1773, %r5350; 2026-02-21T09:18:15.0325943Z cvt.rn.f32.s16 %r5351, %rs1773; 2026-02-21T09:18:15.0326005Z prmt.b32 %r5352, %r5328, 0, 0x9991U; 2026-02-21T09:18:15.0326073Z cvt.u16.u32 %rs1774, %r5352; 2026-02-21T09:18:15.0326137Z cvt.rn.f32.s16 %r5353, %rs1774; 2026-02-21T09:18:15.0326201Z prmt.b32 %r5354, %r5333, 0, 0x9991U; 2026-02-21T09:18:15.0326270Z cvt.u16.u32 %rs1775, %r5354; 2026-02-21T09:18:15.0326333Z cvt.rn.f32.s16 %r5355, %rs1775; 2026-02-21T09:18:15.0326396Z prmt.b32 %r5356, %r5332, 0, 0x9991U; 2026-02-21T09:18:15.0326458Z cvt.u16.u32 %rs1776, %r5356; 2026-02-21T09:18:15.0326528Z cvt.rn.f32.s16 %r5357, %rs1776; 2026-02-21T09:18:15.0326592Z prmt.b32 %r5358, %r5327, 0, 0xaaa2U; 2026-02-21T09:18:15.0326652Z cvt.u16.u32 %rs1777, %r5358; 2026-02-21T09:18:15.0326721Z cvt.rn.f32.s16 %r5359, %rs1777; 2026-02-21T09:18:15.0326784Z prmt.b32 %r5360, %r5326, 0, 0xaaa2U; 2026-02-21T09:18:15.0326843Z cvt.u16.u32 %rs1778, %r5360; 2026-02-21T09:18:15.0326906Z cvt.rn.f32.s16 %r5361, %rs1778; 2026-02-21T09:18:15.0326976Z prmt.b32 %r5362, %r5331, 0, 0xaaa2U; 2026-02-21T09:18:15.0327036Z cvt.u16.u32 %rs1779, %r5362; 2026-02-21T09:18:15.0327097Z cvt.rn.f32.s16 %r5363, %rs1779; 2026-02-21T09:18:15.0327168Z prmt.b32 %r5364, %r5330, 0, 0xaaa2U; 2026-02-21T09:18:15.0327228Z cvt.u16.u32 %rs1780, %r5364; 2026-02-21T09:18:15.0327289Z cvt.rn.f32.s16 %r5365, %rs1780; 2026-02-21T09:18:15.0327359Z prmt.b32 %r5366, %r5329, 0, 0xaaa2U; 2026-02-21T09:18:15.0327418Z cvt.u16.u32 %rs1781, %r5366; 2026-02-21T09:18:15.0327482Z cvt.rn.f32.s16 %r5367, %rs1781; 2026-02-21T09:18:15.0327545Z prmt.b32 %r5368, %r5328, 0, 0xaaa2U; 2026-02-21T09:18:15.0327614Z cvt.u16.u32 %rs1782, %r5368; 2026-02-21T09:18:15.0327676Z cvt.rn.f32.s16 %r5369, %rs1782; 2026-02-21T09:18:15.0327739Z prmt.b32 %r5370, %r5333, 0, 0xaaa2U; 2026-02-21T09:18:15.0327806Z cvt.u16.u32 %rs1783, %r5370; 2026-02-21T09:18:15.0327867Z cvt.rn.f32.s16 %r5371, %rs1783; 2026-02-21T09:18:15.0327930Z prmt.b32 %r5372, %r5332, 0, 0xaaa2U; 2026-02-21T09:18:15.0327989Z cvt.u16.u32 %rs1784, %r5372; 2026-02-21T09:18:15.0328057Z cvt.rn.f32.s16 %r5373, %rs1784; 2026-02-21T09:18:15.0328119Z prmt.b32 %r5374, %r5327, 0, 0xbbb3U; 2026-02-21T09:18:15.0328181Z cvt.u16.u32 %rs1785, %r5374; 2026-02-21T09:18:15.0328249Z cvt.rn.f32.s16 %r5375, %rs1785; 2026-02-21T09:18:15.0328311Z prmt.b32 %r5376, %r5326, 0, 0xbbb3U; 2026-02-21T09:18:15.0328374Z cvt.u16.u32 %rs1786, %r5376; 2026-02-21T09:18:15.0328443Z cvt.rn.f32.s16 %r5377, %rs1786; 2026-02-21T09:18:15.0328506Z prmt.b32 %r5378, %r5331, 0, 0xbbb3U; 2026-02-21T09:18:15.0328567Z cvt.u16.u32 %rs1787, %r5378; 2026-02-21T09:18:15.0328629Z cvt.rn.f32.s16 %r5379, %rs1787; 2026-02-21T09:18:15.0328700Z prmt.b32 %r5380, %r5330, 0, 0xbbb3U; 2026-02-21T09:18:15.0328772Z cvt.u16.u32 %rs1788, %r5380; 2026-02-21T09:18:15.0328830Z cvt.rn.f32.s16 %r5381, %rs1788; 2026-02-21T09:18:15.0328896Z prmt.b32 %r5382, %r5329, 0, 0xbbb3U; 2026-02-21T09:18:15.0328953Z cvt.u16.u32 %rs1789, %r5382; 2026-02-21T09:18:15.0329010Z cvt.rn.f32.s16 %r5383, %rs1789; 2026-02-21T09:18:15.0329071Z prmt.b32 %r5384, %r5328, 0, 0xbbb3U; 2026-02-21T09:18:15.0329137Z cvt.u16.u32 %rs1790, %r5384; 2026-02-21T09:18:15.0329222Z cvt.rn.f32.s16 %r5385, %rs1790; 2026-02-21T09:18:15.0329283Z prmt.b32 %r5386, %r5333, 0, 0xbbb3U; 2026-02-21T09:18:15.0329351Z cvt.u16.u32 %rs1791, %r5386; 2026-02-21T09:18:15.0329434Z cvt.rn.f32.s16 %r5387, %rs1791; 2026-02-21T09:18:15.0329496Z prmt.b32 %r5388, %r5332, 0, 0xbbb3U; 2026-02-21T09:18:15.0329554Z cvt.u16.u32 %rs1792, %r5388; 2026-02-21T09:18:15.0329645Z cvt.rn.f32.s16 %r5389, %rs1792; 2026-02-21T09:18:15.0341146Z bar.sync 0; 2026-02-21T09:18:15.0341422Z st.shared.v4.b32 [%r49], {%r5335, %r5337, %r5334, %r5336}; 2026-02-21T09:18:15.0341591Z st.shared.v4.b32 [%r50], {%r5339, %r5341, %r5338, %r5340}; 2026-02-21T09:18:15.0341693Z st.shared.v4.b32 [%r51], {%r5345, %r5349, %r5343, %r5347}; 2026-02-21T09:18:15.0341789Z st.shared.v4.b32 [%r52], {%r5353, %r5357, %r5351, %r5355}; 2026-02-21T09:18:15.0341886Z st.shared.v4.b32 [%r53], {%r5361, %r5365, %r5359, %r5363}; 2026-02-21T09:18:15.0341992Z st.shared.v4.b32 [%r54], {%r5369, %r5373, %r5367, %r5371}; 2026-02-21T09:18:15.0342094Z st.shared.v4.b32 [%r55], {%r5377, %r5381, %r5375, %r5379}; 2026-02-21T09:18:15.0342187Z st.shared.v4.b32 [%r56], {%r5385, %r5389, %r5383, %r5387}; 2026-02-21T09:18:15.0342259Z $L__tmp327: 2026-02-21T09:18:15.0342503Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0342572Z // begin inline asm 2026-02-21T09:18:15.0342903Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5130, %r5131, %r5132, %r5133, %r5134, %r5135, %r5136, %r5137, %r5138, %r5139, %r5140, %r5141, %r5142, %r5143, %r5144, %r5145}, [%r625 + 0], 64; 2026-02-21T09:18:15.0342967Z // end inline asm 2026-02-21T09:18:15.0343025Z // begin inline asm 2026-02-21T09:18:15.0343338Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5147, %r5148, %r5149, %r5150, %r5151, %r5152, %r5153, %r5154, %r5155, %r5156, %r5157, %r5158, %r5159, %r5160, %r5161, %r5162}, [%r625 + 16], 64; 2026-02-21T09:18:15.0343395Z // end inline asm 2026-02-21T09:18:15.0343454Z // begin inline asm 2026-02-21T09:18:15.0343752Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5164, %r5165, %r5166, %r5167, %r5168, %r5169, %r5170, %r5171, %r5172, %r5173, %r5174, %r5175, %r5176, %r5177, %r5178, %r5179}, [%r625 + 32], 64; 2026-02-21T09:18:15.0343817Z // end inline asm 2026-02-21T09:18:15.0343873Z // begin inline asm 2026-02-21T09:18:15.0344174Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5181, %r5182, %r5183, %r5184, %r5185, %r5186, %r5187, %r5188, %r5189, %r5190, %r5191, %r5192, %r5193, %r5194, %r5195, %r5196}, [%r625 + 48], 64; 2026-02-21T09:18:15.0344239Z // end inline asm 2026-02-21T09:18:15.0344295Z // begin inline asm 2026-02-21T09:18:15.0344373Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0344435Z // end inline asm 2026-02-21T09:18:15.0344501Z mov.pred %p263, -1; 2026-02-21T09:18:15.0344557Z // begin inline asm 2026-02-21T09:18:15.0344873Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 0], 64, {%r5130, %r5131, %r5132, %r5133, %r5134, %r5135, %r5136, %r5137, %r5138, %r5139, %r5140, %r5141, %r5142, %r5143, %r5144, %r5145}; 2026-02-21T09:18:15.0344938Z // end inline asm 2026-02-21T09:18:15.0344995Z // begin inline asm 2026-02-21T09:18:15.0345305Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 16], 64, {%r5147, %r5148, %r5149, %r5150, %r5151, %r5152, %r5153, %r5154, %r5155, %r5156, %r5157, %r5158, %r5159, %r5160, %r5161, %r5162}; 2026-02-21T09:18:15.0345371Z // end inline asm 2026-02-21T09:18:15.0345430Z // begin inline asm 2026-02-21T09:18:15.0345734Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 32], 64, {%r5164, %r5165, %r5166, %r5167, %r5168, %r5169, %r5170, %r5171, %r5172, %r5173, %r5174, %r5175, %r5176, %r5177, %r5178, %r5179}; 2026-02-21T09:18:15.0345797Z // end inline asm 2026-02-21T09:18:15.0345853Z // begin inline asm 2026-02-21T09:18:15.0346156Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r915 + 48], 64, {%r5181, %r5182, %r5183, %r5184, %r5185, %r5186, %r5187, %r5188, %r5189, %r5190, %r5191, %r5192, %r5193, %r5194, %r5195, %r5196}; 2026-02-21T09:18:15.0346267Z // end inline asm 2026-02-21T09:18:15.0346323Z // begin inline asm 2026-02-21T09:18:15.0346395Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0346491Z // end inline asm 2026-02-21T09:18:15.0346549Z bar.sync 0; 2026-02-21T09:18:15.0346605Z // begin inline asm 2026-02-21T09:18:15.0346967Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r5267, %r5268, %r5269, %r5270, %r5271, %r5272, %r5273, %r5274, %r5275, %r5276, %r5277, %r5278, %r5279, %r5280, %r5281, %r5282}; 2026-02-21T09:18:15.0347033Z // end inline asm 2026-02-21T09:18:15.0347089Z // begin inline asm 2026-02-21T09:18:15.0347156Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0347217Z // end inline asm 2026-02-21T09:18:15.0347271Z bar.sync 0; 2026-02-21T09:18:15.0347327Z // begin inline asm 2026-02-21T09:18:15.0347399Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0347462Z // end inline asm 2026-02-21T09:18:15.0347519Z // begin inline asm 2026-02-21T09:18:15.0347614Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0347678Z // end inline asm 2026-02-21T09:18:15.0347733Z bar.sync 0; 2026-02-21T09:18:15.0347793Z @%p20 bra $L__BB0_37; 2026-02-21T09:18:15.0347904Z // %bb.36: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0347974Z elect.sync %r5402|%p264, -1; 2026-02-21T09:18:15.0348037Z mov.b32 %r5392, 69208336; 2026-02-21T09:18:15.0348095Z // begin inline asm 2026-02-21T09:18:15.0348268Z @%p264 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 0 ], %rd1, %r5392, %p263; 2026-02-21T09:18:15.0348323Z // end inline asm 2026-02-21T09:18:15.0348379Z // begin inline asm 2026-02-21T09:18:15.0348538Z @%p264 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 8 ], %rd2, %r5392, %p263; 2026-02-21T09:18:15.0348592Z // end inline asm 2026-02-21T09:18:15.0348648Z // begin inline asm 2026-02-21T09:18:15.0348808Z @%p264 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 16 ], %rd3, %r5392, %p263; 2026-02-21T09:18:15.0348863Z // end inline asm 2026-02-21T09:18:15.0348919Z // begin inline asm 2026-02-21T09:18:15.0349070Z @%p264 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 24 ], %rd4, %r5392, %p263; 2026-02-21T09:18:15.0349133Z // end inline asm 2026-02-21T09:18:15.0349198Z cvt.u64.u32 %rd683, %r5985; 2026-02-21T09:18:15.0349256Z // begin inline asm 2026-02-21T09:18:15.0349394Z @%p264 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd683]; 2026-02-21T09:18:15.0349450Z // end inline asm 2026-02-21T09:18:15.0349554Z $L__BB0_37: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0349618Z // begin inline asm 2026-02-21T09:18:15.0349668Z 2026-02-21T09:18:15.0349719Z { 2026-02-21T09:18:15.0349782Z .reg .pred complete; 2026-02-21T09:18:15.0349843Z waitLoop: 2026-02-21T09:18:15.0349967Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r5116; 2026-02-21T09:18:15.0350036Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0350094Z } 2026-02-21T09:18:15.0350098Z 2026-02-21T09:18:15.0350153Z // end inline asm 2026-02-21T09:18:15.0350208Z bar.sync 0; 2026-02-21T09:18:15.0350264Z // begin inline asm 2026-02-21T09:18:15.0350359Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0350413Z // end inline asm 2026-02-21T09:18:15.0350466Z $L__tmp328: 2026-02-21T09:18:15.0350651Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0350718Z add.s64 %rd685, %rd655, 128; 2026-02-21T09:18:15.0350886Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0350959Z add.s64 %rd688, %rd658, 128; 2026-02-21T09:18:15.0351015Z // begin inline asm 2026-02-21T09:18:15.0351075Z mov.u64 %rd684, 0x0; 2026-02-21T09:18:15.0351192Z createpolicy.fractional.L2::evict_first.b64 %rd684, 1.0; 2026-02-21T09:18:15.0351257Z // end inline asm 2026-02-21T09:18:15.0351313Z // begin inline asm 2026-02-21T09:18:15.0351395Z mov.u32 %r5408, 0x0; 2026-02-21T09:18:15.0351459Z mov.u32 %r5409, 0x0; 2026-02-21T09:18:15.0351514Z mov.u32 %r5410, 0x0; 2026-02-21T09:18:15.0351603Z mov.u32 %r5411, 0x0; 2026-02-21T09:18:15.0351818Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5408, %r5409, %r5410, %r5411 }, [ %rd685 + 0 ], %rd684; 2026-02-21T09:18:15.0351897Z // end inline asm 2026-02-21T09:18:15.0351955Z // begin inline asm 2026-02-21T09:18:15.0352033Z mov.u64 %rd687, 0x0; 2026-02-21T09:18:15.0352154Z createpolicy.fractional.L2::evict_first.b64 %rd687, 1.0; 2026-02-21T09:18:15.0352208Z // end inline asm 2026-02-21T09:18:15.0352264Z // begin inline asm 2026-02-21T09:18:15.0352326Z mov.u32 %r5412, 0x0; 2026-02-21T09:18:15.0352380Z mov.u32 %r5413, 0x0; 2026-02-21T09:18:15.0352434Z mov.u32 %r5414, 0x0; 2026-02-21T09:18:15.0352488Z mov.u32 %r5415, 0x0; 2026-02-21T09:18:15.0352671Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5412, %r5413, %r5414, %r5415 }, [ %rd688 + 0 ], %rd687; 2026-02-21T09:18:15.0352727Z // end inline asm 2026-02-21T09:18:15.0352781Z $L__tmp329: 2026-02-21T09:18:15.0353006Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0353106Z st.shared.v4.b32 [%r26], {%r5408, %r5409, %r5410, %r5411}; 2026-02-21T09:18:15.0353209Z st.shared.v4.b32 [%r26+512], {%r5412, %r5413, %r5414, %r5415}; 2026-02-21T09:18:15.0353275Z bar.sync 0; 2026-02-21T09:18:15.0353368Z ld.shared.v4.b32 {%r5575, %r5576, %r5577, %r5578}, [%r27]; 2026-02-21T09:18:15.0353436Z mov.b32 {%rs1793, %rs1794}, %r5578; 2026-02-21T09:18:15.0353499Z mov.b32 {%rs1795, %rs1796}, %r5577; 2026-02-21T09:18:15.0353567Z mov.b32 {%rs1797, %rs1798}, %r5576; 2026-02-21T09:18:15.0353626Z mov.b32 {%rs1799, %rs1800}, %r5575; 2026-02-21T09:18:15.0353718Z ld.shared.v4.b32 {%r5579, %r5580, %r5581, %r5582}, [%r28]; 2026-02-21T09:18:15.0353788Z mov.b32 {%rs1801, %rs1802}, %r5582; 2026-02-21T09:18:15.0353849Z mov.b32 {%rs1803, %rs1804}, %r5581; 2026-02-21T09:18:15.0353910Z mov.b32 {%rs1805, %rs1806}, %r5580; 2026-02-21T09:18:15.0353976Z mov.b32 {%rs1807, %rs1808}, %r5579; 2026-02-21T09:18:15.0354027Z $L__tmp330: 2026-02-21T09:18:15.0354197Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0354261Z cvt.f32.bf16 %r5557, %rs1799; 2026-02-21T09:18:15.0354330Z cvt.f32.bf16 %r5558, %rs1800; 2026-02-21T09:18:15.0354390Z cvt.f32.bf16 %r5559, %rs1797; 2026-02-21T09:18:15.0354448Z cvt.f32.bf16 %r5560, %rs1798; 2026-02-21T09:18:15.0354512Z cvt.f32.bf16 %r5561, %rs1795; 2026-02-21T09:18:15.0354568Z cvt.f32.bf16 %r5562, %rs1796; 2026-02-21T09:18:15.0354625Z cvt.f32.bf16 %r5563, %rs1793; 2026-02-21T09:18:15.0354681Z cvt.f32.bf16 %r5564, %rs1794; 2026-02-21T09:18:15.0354744Z cvt.f32.bf16 %r5565, %rs1807; 2026-02-21T09:18:15.0354801Z cvt.f32.bf16 %r5566, %rs1808; 2026-02-21T09:18:15.0354858Z cvt.f32.bf16 %r5567, %rs1805; 2026-02-21T09:18:15.0354924Z cvt.f32.bf16 %r5568, %rs1806; 2026-02-21T09:18:15.0354981Z cvt.f32.bf16 %r5569, %rs1803; 2026-02-21T09:18:15.0355039Z cvt.f32.bf16 %r5570, %rs1804; 2026-02-21T09:18:15.0355104Z cvt.f32.bf16 %r5571, %rs1801; 2026-02-21T09:18:15.0355162Z cvt.f32.bf16 %r5572, %rs1802; 2026-02-21T09:18:15.0355325Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0355389Z add.s32 %r5583, %r7846, 262144; 2026-02-21T09:18:15.0355458Z cvt.s64.s32 %rd693, %r5583; 2026-02-21T09:18:15.0355521Z add.s64 %rd691, %rd56, %rd693; 2026-02-21T09:18:15.0355680Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0355744Z // begin inline asm 2026-02-21T09:18:15.0355801Z mov.u64 %rd690, 0x0; 2026-02-21T09:18:15.0355907Z createpolicy.fractional.L2::evict_first.b64 %rd690, 1.0; 2026-02-21T09:18:15.0355968Z // end inline asm 2026-02-21T09:18:15.0356024Z // begin inline asm 2026-02-21T09:18:15.0356106Z mov.u32 %r5416, 0x0; 2026-02-21T09:18:15.0356162Z mov.u32 %r5417, 0x0; 2026-02-21T09:18:15.0356224Z mov.u32 %r5418, 0x0; 2026-02-21T09:18:15.0356278Z mov.u32 %r5419, 0x0; 2026-02-21T09:18:15.0356488Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5416, %r5417, %r5418, %r5419 }, [ %rd691 + 0 ], %rd690; 2026-02-21T09:18:15.0356573Z // end inline asm 2026-02-21T09:18:15.0356641Z prmt.b32 %r5584, %r5416, 0, 0x8880U; 2026-02-21T09:18:15.0356724Z cvt.u16.u32 %rs1809, %r5584; 2026-02-21T09:18:15.0356790Z prmt.b32 %r5585, %r5416, 0, 0x7770U; 2026-02-21T09:18:15.0356858Z cvt.u16.u32 %rs1810, %r5585; 2026-02-21T09:18:15.0356920Z prmt.b32 %r5586, %r5416, 0, 0x9991U; 2026-02-21T09:18:15.0356979Z cvt.u16.u32 %rs1811, %r5586; 2026-02-21T09:18:15.0357047Z prmt.b32 %r5587, %r5416, 0, 0x7771U; 2026-02-21T09:18:15.0357105Z cvt.u16.u32 %rs1812, %r5587; 2026-02-21T09:18:15.0357166Z prmt.b32 %r5588, %r5416, 0, 0xaaa2U; 2026-02-21T09:18:15.0357230Z cvt.u16.u32 %rs1813, %r5588; 2026-02-21T09:18:15.0357292Z prmt.b32 %r5589, %r5416, 0, 0x7772U; 2026-02-21T09:18:15.0357351Z cvt.u16.u32 %rs1814, %r5589; 2026-02-21T09:18:15.0357410Z prmt.b32 %r5590, %r5416, 0, 0xbbb3U; 2026-02-21T09:18:15.0357477Z cvt.u16.u32 %rs1815, %r5590; 2026-02-21T09:18:15.0357537Z prmt.b32 %r5591, %r5416, 0, 0x7773U; 2026-02-21T09:18:15.0357597Z cvt.u16.u32 %rs1816, %r5591; 2026-02-21T09:18:15.0357664Z prmt.b32 %r5592, %r5417, 0, 0x8880U; 2026-02-21T09:18:15.0357723Z cvt.u16.u32 %rs1817, %r5592; 2026-02-21T09:18:15.0357783Z prmt.b32 %r5593, %r5417, 0, 0x7770U; 2026-02-21T09:18:15.0357839Z cvt.u16.u32 %rs1818, %r5593; 2026-02-21T09:18:15.0357909Z prmt.b32 %r5594, %r5417, 0, 0x9991U; 2026-02-21T09:18:15.0357964Z cvt.u16.u32 %rs1819, %r5594; 2026-02-21T09:18:15.0358024Z prmt.b32 %r5595, %r5417, 0, 0x7771U; 2026-02-21T09:18:15.0358089Z cvt.u16.u32 %rs1820, %r5595; 2026-02-21T09:18:15.0358148Z prmt.b32 %r5596, %r5417, 0, 0xaaa2U; 2026-02-21T09:18:15.0358207Z cvt.u16.u32 %rs1821, %r5596; 2026-02-21T09:18:15.0358270Z prmt.b32 %r5597, %r5417, 0, 0x7772U; 2026-02-21T09:18:15.0358337Z cvt.u16.u32 %rs1822, %r5597; 2026-02-21T09:18:15.0358398Z prmt.b32 %r5598, %r5417, 0, 0xbbb3U; 2026-02-21T09:18:15.0358458Z cvt.u16.u32 %rs1823, %r5598; 2026-02-21T09:18:15.0358528Z prmt.b32 %r5599, %r5417, 0, 0x7773U; 2026-02-21T09:18:15.0358586Z cvt.u16.u32 %rs1824, %r5599; 2026-02-21T09:18:15.0358645Z prmt.b32 %r5600, %r5418, 0, 0x8880U; 2026-02-21T09:18:15.0358710Z cvt.u16.u32 %rs1825, %r5600; 2026-02-21T09:18:15.0358769Z prmt.b32 %r5601, %r5418, 0, 0x7770U; 2026-02-21T09:18:15.0358827Z cvt.u16.u32 %rs1826, %r5601; 2026-02-21T09:18:15.0358885Z prmt.b32 %r5602, %r5418, 0, 0x9991U; 2026-02-21T09:18:15.0358948Z cvt.u16.u32 %rs1827, %r5602; 2026-02-21T09:18:15.0359007Z prmt.b32 %r5603, %r5418, 0, 0x7771U; 2026-02-21T09:18:15.0359063Z cvt.u16.u32 %rs1828, %r5603; 2026-02-21T09:18:15.0359129Z prmt.b32 %r5604, %r5418, 0, 0xaaa2U; 2026-02-21T09:18:15.0359186Z cvt.u16.u32 %rs1829, %r5604; 2026-02-21T09:18:15.0359246Z prmt.b32 %r5605, %r5418, 0, 0x7772U; 2026-02-21T09:18:15.0359302Z cvt.u16.u32 %rs1830, %r5605; 2026-02-21T09:18:15.0359366Z prmt.b32 %r5606, %r5418, 0, 0xbbb3U; 2026-02-21T09:18:15.0359424Z cvt.u16.u32 %rs1831, %r5606; 2026-02-21T09:18:15.0359483Z prmt.b32 %r5607, %r5418, 0, 0x7773U; 2026-02-21T09:18:15.0359550Z cvt.u16.u32 %rs1832, %r5607; 2026-02-21T09:18:15.0359610Z prmt.b32 %r5608, %r5419, 0, 0x8880U; 2026-02-21T09:18:15.0359670Z cvt.u16.u32 %rs1833, %r5608; 2026-02-21T09:18:15.0359737Z prmt.b32 %r5609, %r5419, 0, 0x7770U; 2026-02-21T09:18:15.0359793Z cvt.u16.u32 %rs1834, %r5609; 2026-02-21T09:18:15.0359852Z prmt.b32 %r5610, %r5419, 0, 0x9991U; 2026-02-21T09:18:15.0359910Z cvt.u16.u32 %rs1835, %r5610; 2026-02-21T09:18:15.0359976Z prmt.b32 %r5611, %r5419, 0, 0x7771U; 2026-02-21T09:18:15.0360030Z cvt.u16.u32 %rs1836, %r5611; 2026-02-21T09:18:15.0360090Z prmt.b32 %r5612, %r5419, 0, 0xaaa2U; 2026-02-21T09:18:15.0360154Z cvt.u16.u32 %rs1837, %r5612; 2026-02-21T09:18:15.0360235Z prmt.b32 %r5613, %r5419, 0, 0x7772U; 2026-02-21T09:18:15.0360292Z cvt.u16.u32 %rs1838, %r5613; 2026-02-21T09:18:15.0360350Z prmt.b32 %r5614, %r5419, 0, 0xbbb3U; 2026-02-21T09:18:15.0360444Z cvt.u16.u32 %rs1839, %r5614; 2026-02-21T09:18:15.0360505Z prmt.b32 %r5615, %r5419, 0, 0x7773U; 2026-02-21T09:18:15.0360601Z cvt.u16.u32 %rs1840, %r5615; 2026-02-21T09:18:15.0360795Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0360858Z shl.b16 %rs1841, %rs1810, 12; 2026-02-21T09:18:15.0360918Z shr.s16 %rs1842, %rs1841, 12; 2026-02-21T09:18:15.0360982Z shl.b16 %rs1843, %rs1812, 12; 2026-02-21T09:18:15.0361038Z shr.s16 %rs1844, %rs1843, 12; 2026-02-21T09:18:15.0361095Z shl.b16 %rs1845, %rs1814, 12; 2026-02-21T09:18:15.0361150Z shr.s16 %rs1846, %rs1845, 12; 2026-02-21T09:18:15.0361213Z shl.b16 %rs1847, %rs1816, 12; 2026-02-21T09:18:15.0361269Z shr.s16 %rs1848, %rs1847, 12; 2026-02-21T09:18:15.0361327Z shl.b16 %rs1849, %rs1818, 12; 2026-02-21T09:18:15.0361389Z shr.s16 %rs1850, %rs1849, 12; 2026-02-21T09:18:15.0361445Z shl.b16 %rs1851, %rs1820, 12; 2026-02-21T09:18:15.0361500Z shr.s16 %rs1852, %rs1851, 12; 2026-02-21T09:18:15.0361591Z shl.b16 %rs1853, %rs1822, 12; 2026-02-21T09:18:15.0361656Z shr.s16 %rs1854, %rs1853, 12; 2026-02-21T09:18:15.0361714Z shl.b16 %rs1855, %rs1824, 12; 2026-02-21T09:18:15.0361773Z shr.s16 %rs1856, %rs1855, 12; 2026-02-21T09:18:15.0361835Z shl.b16 %rs1857, %rs1826, 12; 2026-02-21T09:18:15.0361890Z shr.s16 %rs1858, %rs1857, 12; 2026-02-21T09:18:15.0361946Z shl.b16 %rs1859, %rs1828, 12; 2026-02-21T09:18:15.0362002Z shr.s16 %rs1860, %rs1859, 12; 2026-02-21T09:18:15.0362065Z shl.b16 %rs1861, %rs1830, 12; 2026-02-21T09:18:15.0362120Z shr.s16 %rs1862, %rs1861, 12; 2026-02-21T09:18:15.0362176Z shl.b16 %rs1863, %rs1832, 12; 2026-02-21T09:18:15.0362239Z shr.s16 %rs1864, %rs1863, 12; 2026-02-21T09:18:15.0362294Z shl.b16 %rs1865, %rs1834, 12; 2026-02-21T09:18:15.0362351Z shr.s16 %rs1866, %rs1865, 12; 2026-02-21T09:18:15.0362407Z shl.b16 %rs1867, %rs1836, 12; 2026-02-21T09:18:15.0362469Z shr.s16 %rs1868, %rs1867, 12; 2026-02-21T09:18:15.0362525Z shl.b16 %rs1869, %rs1838, 12; 2026-02-21T09:18:15.0362582Z shr.s16 %rs1870, %rs1869, 12; 2026-02-21T09:18:15.0362643Z shl.b16 %rs1871, %rs1840, 12; 2026-02-21T09:18:15.0362700Z shr.s16 %rs1872, %rs1871, 12; 2026-02-21T09:18:15.0362862Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0362928Z shr.u16 %rs1873, %rs1809, 4; 2026-02-21T09:18:15.0362985Z shr.u16 %rs1874, %rs1811, 4; 2026-02-21T09:18:15.0363041Z shr.u16 %rs1875, %rs1813, 4; 2026-02-21T09:18:15.0363099Z shr.u16 %rs1876, %rs1815, 4; 2026-02-21T09:18:15.0363161Z shr.u16 %rs1877, %rs1817, 4; 2026-02-21T09:18:15.0363216Z shr.u16 %rs1878, %rs1819, 4; 2026-02-21T09:18:15.0363272Z shr.u16 %rs1879, %rs1821, 4; 2026-02-21T09:18:15.0363334Z shr.u16 %rs1880, %rs1823, 4; 2026-02-21T09:18:15.0363391Z shr.u16 %rs1881, %rs1825, 4; 2026-02-21T09:18:15.0363446Z shr.u16 %rs1882, %rs1827, 4; 2026-02-21T09:18:15.0363502Z shr.u16 %rs1883, %rs1829, 4; 2026-02-21T09:18:15.0363565Z shr.u16 %rs1884, %rs1831, 4; 2026-02-21T09:18:15.0363621Z shr.u16 %rs1885, %rs1833, 4; 2026-02-21T09:18:15.0363680Z shr.u16 %rs1886, %rs1835, 4; 2026-02-21T09:18:15.0363744Z shr.u16 %rs1887, %rs1837, 4; 2026-02-21T09:18:15.0363800Z shr.u16 %rs1888, %rs1839, 4; 2026-02-21T09:18:15.0363961Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0364021Z bar.sync 0; 2026-02-21T09:18:15.0364080Z st.shared.b8 [%r29], %rs1842; 2026-02-21T09:18:15.0364139Z st.shared.b8 [%r30], %rs1844; 2026-02-21T09:18:15.0364197Z st.shared.b8 [%r31], %rs1846; 2026-02-21T09:18:15.0364262Z st.shared.b8 [%r32], %rs1848; 2026-02-21T09:18:15.0364324Z st.shared.b8 [%r33+512], %rs1850; 2026-02-21T09:18:15.0364387Z st.shared.b8 [%r34+512], %rs1852; 2026-02-21T09:18:15.0364481Z st.shared.b8 [%r35+512], %rs1854; 2026-02-21T09:18:15.0364540Z st.shared.b8 [%r36+512], %rs1856; 2026-02-21T09:18:15.0364603Z st.shared.b8 [%r37+1024], %rs1858; 2026-02-21T09:18:15.0364692Z st.shared.b8 [%r38+1024], %rs1860; 2026-02-21T09:18:15.0364763Z st.shared.b8 [%r39+1024], %rs1862; 2026-02-21T09:18:15.0364862Z st.shared.b8 [%r40+1024], %rs1864; 2026-02-21T09:18:15.0364955Z st.shared.b8 [%r41+1536], %rs1866; 2026-02-21T09:18:15.0365029Z st.shared.b8 [%r42+1536], %rs1868; 2026-02-21T09:18:15.0365091Z st.shared.b8 [%r43+1536], %rs1870; 2026-02-21T09:18:15.0365151Z st.shared.b8 [%r44+1536], %rs1872; 2026-02-21T09:18:15.0365207Z bar.sync 0; 2026-02-21T09:18:15.0365279Z ld.shared.b32 %r5616, [%r45]; 2026-02-21T09:18:15.0365340Z ld.shared.b32 %r5617, [%r46]; 2026-02-21T09:18:15.0365401Z ld.shared.b32 %r5618, [%r47]; 2026-02-21T09:18:15.0365469Z ld.shared.b32 %r5619, [%r48]; 2026-02-21T09:18:15.0365639Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0365695Z bar.sync 0; 2026-02-21T09:18:15.0365760Z st.shared.b8 [%r29], %rs1873; 2026-02-21T09:18:15.0365821Z st.shared.b8 [%r30], %rs1874; 2026-02-21T09:18:15.0365879Z st.shared.b8 [%r31], %rs1875; 2026-02-21T09:18:15.0365939Z st.shared.b8 [%r32], %rs1876; 2026-02-21T09:18:15.0366008Z st.shared.b8 [%r33+512], %rs1877; 2026-02-21T09:18:15.0366071Z st.shared.b8 [%r34+512], %rs1878; 2026-02-21T09:18:15.0366133Z st.shared.b8 [%r35+512], %rs1879; 2026-02-21T09:18:15.0366199Z st.shared.b8 [%r36+512], %rs1880; 2026-02-21T09:18:15.0366259Z st.shared.b8 [%r37+1024], %rs1881; 2026-02-21T09:18:15.0366321Z st.shared.b8 [%r38+1024], %rs1882; 2026-02-21T09:18:15.0366383Z st.shared.b8 [%r39+1024], %rs1883; 2026-02-21T09:18:15.0366450Z st.shared.b8 [%r40+1024], %rs1884; 2026-02-21T09:18:15.0366511Z st.shared.b8 [%r41+1536], %rs1885; 2026-02-21T09:18:15.0366570Z st.shared.b8 [%r42+1536], %rs1886; 2026-02-21T09:18:15.0366639Z st.shared.b8 [%r43+1536], %rs1887; 2026-02-21T09:18:15.0366700Z st.shared.b8 [%r44+1536], %rs1888; 2026-02-21T09:18:15.0366754Z bar.sync 0; 2026-02-21T09:18:15.0366824Z ld.shared.b32 %r5620, [%r45]; 2026-02-21T09:18:15.0366886Z ld.shared.b32 %r5621, [%r46]; 2026-02-21T09:18:15.0366947Z ld.shared.b32 %r5622, [%r47]; 2026-02-21T09:18:15.0367008Z ld.shared.b32 %r5623, [%r48]; 2026-02-21T09:18:15.0367183Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0367245Z cvt.s8.s32 %rs1889, %r5617; 2026-02-21T09:18:15.0367311Z cvt.rn.f32.s16 %r5624, %rs1889; 2026-02-21T09:18:15.0367378Z cvt.s8.s32 %rs1890, %r5616; 2026-02-21T09:18:15.0367442Z cvt.rn.f32.s16 %r5625, %rs1890; 2026-02-21T09:18:15.0367502Z cvt.s8.s32 %rs1891, %r5621; 2026-02-21T09:18:15.0367564Z cvt.rn.f32.s16 %r5626, %rs1891; 2026-02-21T09:18:15.0367629Z cvt.s8.s32 %rs1892, %r5620; 2026-02-21T09:18:15.0367691Z cvt.rn.f32.s16 %r5627, %rs1892; 2026-02-21T09:18:15.0367752Z cvt.s8.s32 %rs1893, %r5619; 2026-02-21T09:18:15.0367820Z cvt.rn.f32.s16 %r5628, %rs1893; 2026-02-21T09:18:15.0367878Z cvt.s8.s32 %rs1894, %r5618; 2026-02-21T09:18:15.0367939Z cvt.rn.f32.s16 %r5629, %rs1894; 2026-02-21T09:18:15.0368003Z cvt.s8.s32 %rs1895, %r5623; 2026-02-21T09:18:15.0368061Z cvt.rn.f32.s16 %r5630, %rs1895; 2026-02-21T09:18:15.0368122Z cvt.s8.s32 %rs1896, %r5622; 2026-02-21T09:18:15.0368184Z cvt.rn.f32.s16 %r5631, %rs1896; 2026-02-21T09:18:15.0368257Z prmt.b32 %r5632, %r5617, 0, 0x9991U; 2026-02-21T09:18:15.0368318Z cvt.u16.u32 %rs1897, %r5632; 2026-02-21T09:18:15.0368379Z cvt.rn.f32.s16 %r5633, %rs1897; 2026-02-21T09:18:15.0368446Z prmt.b32 %r5634, %r5616, 0, 0x9991U; 2026-02-21T09:18:15.0368506Z cvt.u16.u32 %rs1898, %r5634; 2026-02-21T09:18:15.0368566Z cvt.rn.f32.s16 %r5635, %rs1898; 2026-02-21T09:18:15.0368629Z prmt.b32 %r5636, %r5621, 0, 0x9991U; 2026-02-21T09:18:15.0368694Z cvt.u16.u32 %rs1899, %r5636; 2026-02-21T09:18:15.0368780Z cvt.rn.f32.s16 %r5637, %rs1899; 2026-02-21T09:18:15.0368843Z prmt.b32 %r5638, %r5620, 0, 0x9991U; 2026-02-21T09:18:15.0368908Z cvt.u16.u32 %rs1900, %r5638; 2026-02-21T09:18:15.0368967Z cvt.rn.f32.s16 %r5639, %rs1900; 2026-02-21T09:18:15.0369050Z prmt.b32 %r5640, %r5619, 0, 0x9991U; 2026-02-21T09:18:15.0369108Z cvt.u16.u32 %rs1901, %r5640; 2026-02-21T09:18:15.0369196Z cvt.rn.f32.s16 %r5641, %rs1901; 2026-02-21T09:18:15.0369280Z prmt.b32 %r5642, %r5618, 0, 0x9991U; 2026-02-21T09:18:15.0369340Z cvt.u16.u32 %rs1902, %r5642; 2026-02-21T09:18:15.0369404Z cvt.rn.f32.s16 %r5643, %rs1902; 2026-02-21T09:18:15.0369465Z prmt.b32 %r5644, %r5623, 0, 0x9991U; 2026-02-21T09:18:15.0369523Z cvt.u16.u32 %rs1903, %r5644; 2026-02-21T09:18:15.0369589Z cvt.rn.f32.s16 %r5645, %rs1903; 2026-02-21T09:18:15.0369651Z prmt.b32 %r5646, %r5622, 0, 0x9991U; 2026-02-21T09:18:15.0369711Z cvt.u16.u32 %rs1904, %r5646; 2026-02-21T09:18:15.0369769Z cvt.rn.f32.s16 %r5647, %rs1904; 2026-02-21T09:18:15.0369838Z prmt.b32 %r5648, %r5617, 0, 0xaaa2U; 2026-02-21T09:18:15.0369896Z cvt.u16.u32 %rs1905, %r5648; 2026-02-21T09:18:15.0369957Z cvt.rn.f32.s16 %r5649, %rs1905; 2026-02-21T09:18:15.0370023Z prmt.b32 %r5650, %r5616, 0, 0xaaa2U; 2026-02-21T09:18:15.0370083Z cvt.u16.u32 %rs1906, %r5650; 2026-02-21T09:18:15.0370142Z cvt.rn.f32.s16 %r5651, %rs1906; 2026-02-21T09:18:15.0370205Z prmt.b32 %r5652, %r5621, 0, 0xaaa2U; 2026-02-21T09:18:15.0370271Z cvt.u16.u32 %rs1907, %r5652; 2026-02-21T09:18:15.0370331Z cvt.rn.f32.s16 %r5653, %rs1907; 2026-02-21T09:18:15.0370393Z prmt.b32 %r5654, %r5620, 0, 0xaaa2U; 2026-02-21T09:18:15.0370457Z cvt.u16.u32 %rs1908, %r5654; 2026-02-21T09:18:15.0370517Z cvt.rn.f32.s16 %r5655, %rs1908; 2026-02-21T09:18:15.0370579Z prmt.b32 %r5656, %r5619, 0, 0xaaa2U; 2026-02-21T09:18:15.0370642Z cvt.u16.u32 %rs1909, %r5656; 2026-02-21T09:18:15.0370702Z cvt.rn.f32.s16 %r5657, %rs1909; 2026-02-21T09:18:15.0370763Z prmt.b32 %r5658, %r5618, 0, 0xaaa2U; 2026-02-21T09:18:15.0370824Z cvt.u16.u32 %rs1910, %r5658; 2026-02-21T09:18:15.0370890Z cvt.rn.f32.s16 %r5659, %rs1910; 2026-02-21T09:18:15.0370951Z prmt.b32 %r5660, %r5623, 0, 0xaaa2U; 2026-02-21T09:18:15.0371010Z cvt.u16.u32 %rs1911, %r5660; 2026-02-21T09:18:15.0371074Z cvt.rn.f32.s16 %r5661, %rs1911; 2026-02-21T09:18:15.0371136Z prmt.b32 %r5662, %r5622, 0, 0xaaa2U; 2026-02-21T09:18:15.0371196Z cvt.u16.u32 %rs1912, %r5662; 2026-02-21T09:18:15.0371257Z cvt.rn.f32.s16 %r5663, %rs1912; 2026-02-21T09:18:15.0371326Z prmt.b32 %r5664, %r5617, 0, 0xbbb3U; 2026-02-21T09:18:15.0371387Z cvt.u16.u32 %rs1913, %r5664; 2026-02-21T09:18:15.0371458Z cvt.rn.f32.s16 %r5665, %rs1913; 2026-02-21T09:18:15.0371529Z prmt.b32 %r5666, %r5616, 0, 0xbbb3U; 2026-02-21T09:18:15.0371619Z cvt.u16.u32 %rs1914, %r5666; 2026-02-21T09:18:15.0371681Z cvt.rn.f32.s16 %r5667, %rs1914; 2026-02-21T09:18:15.0371745Z prmt.b32 %r5668, %r5621, 0, 0xbbb3U; 2026-02-21T09:18:15.0371814Z cvt.u16.u32 %rs1915, %r5668; 2026-02-21T09:18:15.0371876Z cvt.rn.f32.s16 %r5669, %rs1915; 2026-02-21T09:18:15.0371940Z prmt.b32 %r5670, %r5620, 0, 0xbbb3U; 2026-02-21T09:18:15.0372007Z cvt.u16.u32 %rs1916, %r5670; 2026-02-21T09:18:15.0372069Z cvt.rn.f32.s16 %r5671, %rs1916; 2026-02-21T09:18:15.0372135Z prmt.b32 %r5672, %r5619, 0, 0xbbb3U; 2026-02-21T09:18:15.0372213Z cvt.u16.u32 %rs1917, %r5672; 2026-02-21T09:18:15.0372272Z cvt.rn.f32.s16 %r5673, %rs1917; 2026-02-21T09:18:15.0372334Z prmt.b32 %r5674, %r5618, 0, 0xbbb3U; 2026-02-21T09:18:15.0372392Z cvt.u16.u32 %rs1918, %r5674; 2026-02-21T09:18:15.0372458Z cvt.rn.f32.s16 %r5675, %rs1918; 2026-02-21T09:18:15.0372518Z prmt.b32 %r5676, %r5623, 0, 0xbbb3U; 2026-02-21T09:18:15.0372575Z cvt.u16.u32 %rs1919, %r5676; 2026-02-21T09:18:15.0372641Z cvt.rn.f32.s16 %r5677, %rs1919; 2026-02-21T09:18:15.0372701Z prmt.b32 %r5678, %r5622, 0, 0xbbb3U; 2026-02-21T09:18:15.0372759Z cvt.u16.u32 %rs1920, %r5678; 2026-02-21T09:18:15.0372817Z cvt.rn.f32.s16 %r5679, %rs1920; 2026-02-21T09:18:15.0372880Z bar.sync 0; 2026-02-21T09:18:15.0373017Z st.shared.v4.b32 [%r49], {%r5625, %r5627, %r5624, %r5626}; 2026-02-21T09:18:15.0373112Z st.shared.v4.b32 [%r50], {%r5629, %r5631, %r5628, %r5630}; 2026-02-21T09:18:15.0373235Z st.shared.v4.b32 [%r51], {%r5635, %r5639, %r5633, %r5637}; 2026-02-21T09:18:15.0373323Z st.shared.v4.b32 [%r52], {%r5643, %r5647, %r5641, %r5645}; 2026-02-21T09:18:15.0373435Z st.shared.v4.b32 [%r53], {%r5651, %r5655, %r5649, %r5653}; 2026-02-21T09:18:15.0373556Z st.shared.v4.b32 [%r54], {%r5659, %r5663, %r5657, %r5661}; 2026-02-21T09:18:15.0373645Z st.shared.v4.b32 [%r55], {%r5667, %r5671, %r5665, %r5669}; 2026-02-21T09:18:15.0373733Z st.shared.v4.b32 [%r56], {%r5675, %r5679, %r5673, %r5677}; 2026-02-21T09:18:15.0373786Z $L__tmp331: 2026-02-21T09:18:15.0374006Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0374065Z // begin inline asm 2026-02-21T09:18:15.0374374Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5420, %r5421, %r5422, %r5423, %r5424, %r5425, %r5426, %r5427, %r5428, %r5429, %r5430, %r5431, %r5432, %r5433, %r5434, %r5435}, [%r915 + 0], 64; 2026-02-21T09:18:15.0374439Z // end inline asm 2026-02-21T09:18:15.0374499Z // begin inline asm 2026-02-21T09:18:15.0374804Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5437, %r5438, %r5439, %r5440, %r5441, %r5442, %r5443, %r5444, %r5445, %r5446, %r5447, %r5448, %r5449, %r5450, %r5451, %r5452}, [%r915 + 16], 64; 2026-02-21T09:18:15.0374865Z // end inline asm 2026-02-21T09:18:15.0374922Z // begin inline asm 2026-02-21T09:18:15.0375217Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5454, %r5455, %r5456, %r5457, %r5458, %r5459, %r5460, %r5461, %r5462, %r5463, %r5464, %r5465, %r5466, %r5467, %r5468, %r5469}, [%r915 + 32], 64; 2026-02-21T09:18:15.0375277Z // end inline asm 2026-02-21T09:18:15.0375333Z // begin inline asm 2026-02-21T09:18:15.0375634Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5471, %r5472, %r5473, %r5474, %r5475, %r5476, %r5477, %r5478, %r5479, %r5480, %r5481, %r5482, %r5483, %r5484, %r5485, %r5486}, [%r915 + 48], 64; 2026-02-21T09:18:15.0375698Z // end inline asm 2026-02-21T09:18:15.0375755Z // begin inline asm 2026-02-21T09:18:15.0375825Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0375879Z // end inline asm 2026-02-21T09:18:15.0375940Z // begin inline asm 2026-02-21T09:18:15.0376251Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 0], 64, {%r5420, %r5421, %r5422, %r5423, %r5424, %r5425, %r5426, %r5427, %r5428, %r5429, %r5430, %r5431, %r5432, %r5433, %r5434, %r5435}; 2026-02-21T09:18:15.0376305Z // end inline asm 2026-02-21T09:18:15.0376367Z // begin inline asm 2026-02-21T09:18:15.0376671Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 16], 64, {%r5437, %r5438, %r5439, %r5440, %r5441, %r5442, %r5443, %r5444, %r5445, %r5446, %r5447, %r5448, %r5449, %r5450, %r5451, %r5452}; 2026-02-21T09:18:15.0376726Z // end inline asm 2026-02-21T09:18:15.0376788Z // begin inline asm 2026-02-21T09:18:15.0377092Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 32], 64, {%r5454, %r5455, %r5456, %r5457, %r5458, %r5459, %r5460, %r5461, %r5462, %r5463, %r5464, %r5465, %r5466, %r5467, %r5468, %r5469}; 2026-02-21T09:18:15.0377148Z // end inline asm 2026-02-21T09:18:15.0377210Z // begin inline asm 2026-02-21T09:18:15.0377512Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r1205 + 48], 64, {%r5471, %r5472, %r5473, %r5474, %r5475, %r5476, %r5477, %r5478, %r5479, %r5480, %r5481, %r5482, %r5483, %r5484, %r5485, %r5486}; 2026-02-21T09:18:15.0377566Z // end inline asm 2026-02-21T09:18:15.0377620Z // begin inline asm 2026-02-21T09:18:15.0377695Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0377749Z // end inline asm 2026-02-21T09:18:15.0377802Z bar.sync 0; 2026-02-21T09:18:15.0377864Z // begin inline asm 2026-02-21T09:18:15.0378171Z @%p263 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5556 + 0], 16, {%r5557, %r5558, %r5559, %r5560, %r5561, %r5562, %r5563, %r5564, %r5565, %r5566, %r5567, %r5568, %r5569, %r5570, %r5571, %r5572}; 2026-02-21T09:18:15.0378248Z // end inline asm 2026-02-21T09:18:15.0378311Z // begin inline asm 2026-02-21T09:18:15.0378376Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0378450Z // end inline asm 2026-02-21T09:18:15.0378505Z bar.sync 0; 2026-02-21T09:18:15.0378569Z // begin inline asm 2026-02-21T09:18:15.0378658Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0378715Z // end inline asm 2026-02-21T09:18:15.0378796Z // begin inline asm 2026-02-21T09:18:15.0378888Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0378942Z // end inline asm 2026-02-21T09:18:15.0378995Z bar.sync 0; 2026-02-21T09:18:15.0379063Z @%p20 bra $L__BB0_39; 2026-02-21T09:18:15.0379162Z // %bb.38: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0379230Z elect.sync %r5692|%p281, -1; 2026-02-21T09:18:15.0379297Z mov.b32 %r5682, 69208336; 2026-02-21T09:18:15.0379358Z mov.pred %p280, -1; 2026-02-21T09:18:15.0379413Z // begin inline asm 2026-02-21T09:18:15.0379580Z @%p281 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 0 ], %rd1, %r5682, %p280; 2026-02-21T09:18:15.0379635Z // end inline asm 2026-02-21T09:18:15.0379693Z // begin inline asm 2026-02-21T09:18:15.0379844Z @%p281 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 8 ], %rd2, %r5682, %p280; 2026-02-21T09:18:15.0379908Z // end inline asm 2026-02-21T09:18:15.0379964Z // begin inline asm 2026-02-21T09:18:15.0380117Z @%p281 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 16 ], %rd3, %r5682, %p280; 2026-02-21T09:18:15.0380180Z // end inline asm 2026-02-21T09:18:15.0380237Z // begin inline asm 2026-02-21T09:18:15.0380386Z @%p281 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 24 ], %rd4, %r5682, %p280; 2026-02-21T09:18:15.0380451Z // end inline asm 2026-02-21T09:18:15.0380516Z cvt.u64.u32 %rd698, %r5985; 2026-02-21T09:18:15.0380571Z // begin inline asm 2026-02-21T09:18:15.0380711Z @%p281 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd698]; 2026-02-21T09:18:15.0380768Z // end inline asm 2026-02-21T09:18:15.0380870Z $L__BB0_39: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0380963Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0381028Z mov.b32 %r5696, 0; 2026-02-21T09:18:15.0381245Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0381300Z // begin inline asm 2026-02-21T09:18:15.0381358Z 2026-02-21T09:18:15.0381407Z { 2026-02-21T09:18:15.0381469Z .reg .pred complete; 2026-02-21T09:18:15.0381523Z waitLoop: 2026-02-21T09:18:15.0381681Z mbarrier.try_wait.parity.shared.b64 complete, [%r5985], %r5696; 2026-02-21T09:18:15.0381750Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0381801Z } 2026-02-21T09:18:15.0381806Z 2026-02-21T09:18:15.0381869Z // end inline asm 2026-02-21T09:18:15.0381923Z bar.sync 0; 2026-02-21T09:18:15.0381981Z // begin inline asm 2026-02-21T09:18:15.0382073Z @%p366 mbarrier.inval.shared::cta.b64 [%r5985]; 2026-02-21T09:18:15.0382128Z // end inline asm 2026-02-21T09:18:15.0382181Z $L__tmp332: 2026-02-21T09:18:15.0382346Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0382416Z add.s64 %rd700, %rd655, 192; 2026-02-21T09:18:15.0382576Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0382636Z add.s64 %rd703, %rd658, 192; 2026-02-21T09:18:15.0382700Z // begin inline asm 2026-02-21T09:18:15.0382757Z mov.u64 %rd699, 0x0; 2026-02-21T09:18:15.0382865Z createpolicy.fractional.L2::evict_first.b64 %rd699, 1.0; 2026-02-21T09:18:15.0382927Z // end inline asm 2026-02-21T09:18:15.0382983Z // begin inline asm 2026-02-21T09:18:15.0383039Z mov.u32 %r5698, 0x0; 2026-02-21T09:18:15.0383094Z mov.u32 %r5699, 0x0; 2026-02-21T09:18:15.0383157Z mov.u32 %r5700, 0x0; 2026-02-21T09:18:15.0383241Z mov.u32 %r5701, 0x0; 2026-02-21T09:18:15.0383423Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5698, %r5699, %r5700, %r5701 }, [ %rd700 + 0 ], %rd699; 2026-02-21T09:18:15.0383511Z // end inline asm 2026-02-21T09:18:15.0383567Z // begin inline asm 2026-02-21T09:18:15.0383622Z mov.u64 %rd702, 0x0; 2026-02-21T09:18:15.0383753Z createpolicy.fractional.L2::evict_first.b64 %rd702, 1.0; 2026-02-21T09:18:15.0383817Z // end inline asm 2026-02-21T09:18:15.0383896Z // begin inline asm 2026-02-21T09:18:15.0383951Z mov.u32 %r5702, 0x0; 2026-02-21T09:18:15.0384013Z mov.u32 %r5703, 0x0; 2026-02-21T09:18:15.0384068Z mov.u32 %r5704, 0x0; 2026-02-21T09:18:15.0384123Z mov.u32 %r5705, 0x0; 2026-02-21T09:18:15.0384298Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5702, %r5703, %r5704, %r5705 }, [ %rd703 + 0 ], %rd702; 2026-02-21T09:18:15.0384352Z // end inline asm 2026-02-21T09:18:15.0384404Z $L__tmp333: 2026-02-21T09:18:15.0384616Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0384722Z st.shared.v4.b32 [%r26], {%r5698, %r5699, %r5700, %r5701}; 2026-02-21T09:18:15.0384822Z st.shared.v4.b32 [%r26+512], {%r5702, %r5703, %r5704, %r5705}; 2026-02-21T09:18:15.0384878Z bar.sync 0; 2026-02-21T09:18:15.0384978Z ld.shared.v4.b32 {%r5865, %r5866, %r5867, %r5868}, [%r27]; 2026-02-21T09:18:15.0385044Z mov.b32 {%rs1921, %rs1922}, %r5868; 2026-02-21T09:18:15.0385109Z mov.b32 {%rs1923, %rs1924}, %r5867; 2026-02-21T09:18:15.0385174Z mov.b32 {%rs1925, %rs1926}, %r5866; 2026-02-21T09:18:15.0385234Z mov.b32 {%rs1927, %rs1928}, %r5865; 2026-02-21T09:18:15.0385325Z ld.shared.v4.b32 {%r5869, %r5870, %r5871, %r5872}, [%r28]; 2026-02-21T09:18:15.0385384Z mov.b32 {%rs1929, %rs1930}, %r5872; 2026-02-21T09:18:15.0385449Z mov.b32 {%rs1931, %rs1932}, %r5871; 2026-02-21T09:18:15.0385508Z mov.b32 {%rs1933, %rs1934}, %r5870; 2026-02-21T09:18:15.0385567Z mov.b32 {%rs1935, %rs1936}, %r5869; 2026-02-21T09:18:15.0385628Z $L__tmp334: 2026-02-21T09:18:15.0385789Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0385852Z cvt.f32.bf16 %r5847, %rs1927; 2026-02-21T09:18:15.0385919Z cvt.f32.bf16 %r5848, %rs1928; 2026-02-21T09:18:15.0385976Z cvt.f32.bf16 %r5849, %rs1925; 2026-02-21T09:18:15.0386037Z cvt.f32.bf16 %r5850, %rs1926; 2026-02-21T09:18:15.0386093Z cvt.f32.bf16 %r5851, %rs1923; 2026-02-21T09:18:15.0386158Z cvt.f32.bf16 %r5852, %rs1924; 2026-02-21T09:18:15.0386214Z cvt.f32.bf16 %r5853, %rs1921; 2026-02-21T09:18:15.0386271Z cvt.f32.bf16 %r5854, %rs1922; 2026-02-21T09:18:15.0386334Z cvt.f32.bf16 %r5855, %rs1935; 2026-02-21T09:18:15.0386389Z cvt.f32.bf16 %r5856, %rs1936; 2026-02-21T09:18:15.0386445Z cvt.f32.bf16 %r5857, %rs1933; 2026-02-21T09:18:15.0386501Z cvt.f32.bf16 %r5858, %rs1934; 2026-02-21T09:18:15.0386566Z cvt.f32.bf16 %r5859, %rs1931; 2026-02-21T09:18:15.0386622Z cvt.f32.bf16 %r5860, %rs1932; 2026-02-21T09:18:15.0386681Z cvt.f32.bf16 %r5861, %rs1929; 2026-02-21T09:18:15.0386744Z cvt.f32.bf16 %r5862, %rs1930; 2026-02-21T09:18:15.0386905Z .loc 1 54 34 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:34 2026-02-21T09:18:15.0386967Z add.s32 %r5873, %r7846, 393216; 2026-02-21T09:18:15.0387029Z cvt.s64.s32 %rd708, %r5873; 2026-02-21T09:18:15.0387101Z add.s64 %rd706, %rd56, %rd708; 2026-02-21T09:18:15.0387260Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0387318Z // begin inline asm 2026-02-21T09:18:15.0387384Z mov.u64 %rd705, 0x0; 2026-02-21T09:18:15.0387490Z createpolicy.fractional.L2::evict_first.b64 %rd705, 1.0; 2026-02-21T09:18:15.0387544Z // end inline asm 2026-02-21T09:18:15.0387606Z // begin inline asm 2026-02-21T09:18:15.0387661Z mov.u32 %r5706, 0x0; 2026-02-21T09:18:15.0387716Z mov.u32 %r5707, 0x0; 2026-02-21T09:18:15.0387770Z mov.u32 %r5708, 0x0; 2026-02-21T09:18:15.0387855Z mov.u32 %r5709, 0x0; 2026-02-21T09:18:15.0388023Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r5706, %r5707, %r5708, %r5709 }, [ %rd706 + 0 ], %rd705; 2026-02-21T09:18:15.0388079Z // end inline asm 2026-02-21T09:18:15.0388179Z prmt.b32 %r5874, %r5706, 0, 0x8880U; 2026-02-21T09:18:15.0388244Z cvt.u16.u32 %rs1937, %r5874; 2026-02-21T09:18:15.0388328Z prmt.b32 %r5875, %r5706, 0, 0x7770U; 2026-02-21T09:18:15.0388429Z cvt.u16.u32 %rs1938, %r5875; 2026-02-21T09:18:15.0388495Z prmt.b32 %r5876, %r5706, 0, 0x9991U; 2026-02-21T09:18:15.0388553Z cvt.u16.u32 %rs1939, %r5876; 2026-02-21T09:18:15.0388615Z prmt.b32 %r5877, %r5706, 0, 0x7771U; 2026-02-21T09:18:15.0388680Z cvt.u16.u32 %rs1940, %r5877; 2026-02-21T09:18:15.0388739Z prmt.b32 %r5878, %r5706, 0, 0xaaa2U; 2026-02-21T09:18:15.0388797Z cvt.u16.u32 %rs1941, %r5878; 2026-02-21T09:18:15.0388862Z prmt.b32 %r5879, %r5706, 0, 0x7772U; 2026-02-21T09:18:15.0388919Z cvt.u16.u32 %rs1942, %r5879; 2026-02-21T09:18:15.0388978Z prmt.b32 %r5880, %r5706, 0, 0xbbb3U; 2026-02-21T09:18:15.0389036Z cvt.u16.u32 %rs1943, %r5880; 2026-02-21T09:18:15.0389103Z prmt.b32 %r5881, %r5706, 0, 0x7773U; 2026-02-21T09:18:15.0389160Z cvt.u16.u32 %rs1944, %r5881; 2026-02-21T09:18:15.0389222Z prmt.b32 %r5882, %r5707, 0, 0x8880U; 2026-02-21T09:18:15.0389287Z cvt.u16.u32 %rs1945, %r5882; 2026-02-21T09:18:15.0389349Z prmt.b32 %r5883, %r5707, 0, 0x7770U; 2026-02-21T09:18:15.0389407Z cvt.u16.u32 %rs1946, %r5883; 2026-02-21T09:18:15.0389466Z prmt.b32 %r5884, %r5707, 0, 0x9991U; 2026-02-21T09:18:15.0389532Z cvt.u16.u32 %rs1947, %r5884; 2026-02-21T09:18:15.0389591Z prmt.b32 %r5885, %r5707, 0, 0x7771U; 2026-02-21T09:18:15.0389649Z cvt.u16.u32 %rs1948, %r5885; 2026-02-21T09:18:15.0389716Z prmt.b32 %r5886, %r5707, 0, 0xaaa2U; 2026-02-21T09:18:15.0389773Z cvt.u16.u32 %rs1949, %r5886; 2026-02-21T09:18:15.0389833Z prmt.b32 %r5887, %r5707, 0, 0x7772U; 2026-02-21T09:18:15.0389898Z cvt.u16.u32 %rs1950, %r5887; 2026-02-21T09:18:15.0389957Z prmt.b32 %r5888, %r5707, 0, 0xbbb3U; 2026-02-21T09:18:15.0390016Z cvt.u16.u32 %rs1951, %r5888; 2026-02-21T09:18:15.0390075Z prmt.b32 %r5889, %r5707, 0, 0x7773U; 2026-02-21T09:18:15.0390141Z cvt.u16.u32 %rs1952, %r5889; 2026-02-21T09:18:15.0390202Z prmt.b32 %r5890, %r5708, 0, 0x8880U; 2026-02-21T09:18:15.0390259Z cvt.u16.u32 %rs1953, %r5890; 2026-02-21T09:18:15.0390327Z prmt.b32 %r5891, %r5708, 0, 0x7770U; 2026-02-21T09:18:15.0390386Z cvt.u16.u32 %rs1954, %r5891; 2026-02-21T09:18:15.0390446Z prmt.b32 %r5892, %r5708, 0, 0x9991U; 2026-02-21T09:18:15.0390505Z cvt.u16.u32 %rs1955, %r5892; 2026-02-21T09:18:15.0390572Z prmt.b32 %r5893, %r5708, 0, 0x7771U; 2026-02-21T09:18:15.0390630Z cvt.u16.u32 %rs1956, %r5893; 2026-02-21T09:18:15.0390690Z prmt.b32 %r5894, %r5708, 0, 0xaaa2U; 2026-02-21T09:18:15.0390755Z cvt.u16.u32 %rs1957, %r5894; 2026-02-21T09:18:15.0390813Z prmt.b32 %r5895, %r5708, 0, 0x7772U; 2026-02-21T09:18:15.0390871Z cvt.u16.u32 %rs1958, %r5895; 2026-02-21T09:18:15.0390935Z prmt.b32 %r5896, %r5708, 0, 0xbbb3U; 2026-02-21T09:18:15.0390995Z cvt.u16.u32 %rs1959, %r5896; 2026-02-21T09:18:15.0391054Z prmt.b32 %r5897, %r5708, 0, 0x7773U; 2026-02-21T09:18:15.0391112Z cvt.u16.u32 %rs1960, %r5897; 2026-02-21T09:18:15.0391182Z prmt.b32 %r5898, %r5709, 0, 0x8880U; 2026-02-21T09:18:15.0391239Z cvt.u16.u32 %rs1961, %r5898; 2026-02-21T09:18:15.0391299Z prmt.b32 %r5899, %r5709, 0, 0x7770U; 2026-02-21T09:18:15.0391364Z cvt.u16.u32 %rs1962, %r5899; 2026-02-21T09:18:15.0391422Z prmt.b32 %r5900, %r5709, 0, 0x9991U; 2026-02-21T09:18:15.0391479Z cvt.u16.u32 %rs1963, %r5900; 2026-02-21T09:18:15.0391568Z prmt.b32 %r5901, %r5709, 0, 0x7771U; 2026-02-21T09:18:15.0391633Z cvt.u16.u32 %rs1964, %r5901; 2026-02-21T09:18:15.0391692Z prmt.b32 %r5902, %r5709, 0, 0xaaa2U; 2026-02-21T09:18:15.0391748Z cvt.u16.u32 %rs1965, %r5902; 2026-02-21T09:18:15.0391815Z prmt.b32 %r5903, %r5709, 0, 0x7772U; 2026-02-21T09:18:15.0391871Z cvt.u16.u32 %rs1966, %r5903; 2026-02-21T09:18:15.0391930Z prmt.b32 %r5904, %r5709, 0, 0xbbb3U; 2026-02-21T09:18:15.0392014Z cvt.u16.u32 %rs1967, %r5904; 2026-02-21T09:18:15.0392081Z prmt.b32 %r5905, %r5709, 0, 0x7773U; 2026-02-21T09:18:15.0392137Z cvt.u16.u32 %rs1968, %r5905; 2026-02-21T09:18:15.0392319Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0392409Z shl.b16 %rs1969, %rs1938, 12; 2026-02-21T09:18:15.0392468Z shr.s16 %rs1970, %rs1969, 12; 2026-02-21T09:18:15.0392549Z shl.b16 %rs1971, %rs1940, 12; 2026-02-21T09:18:15.0392614Z shr.s16 %rs1972, %rs1971, 12; 2026-02-21T09:18:15.0392672Z shl.b16 %rs1973, %rs1942, 12; 2026-02-21T09:18:15.0392731Z shr.s16 %rs1974, %rs1973, 12; 2026-02-21T09:18:15.0392789Z shl.b16 %rs1975, %rs1944, 12; 2026-02-21T09:18:15.0392853Z shr.s16 %rs1976, %rs1975, 12; 2026-02-21T09:18:15.0392910Z shl.b16 %rs1977, %rs1946, 12; 2026-02-21T09:18:15.0392967Z shr.s16 %rs1978, %rs1977, 12; 2026-02-21T09:18:15.0393029Z shl.b16 %rs1979, %rs1948, 12; 2026-02-21T09:18:15.0393087Z shr.s16 %rs1980, %rs1979, 12; 2026-02-21T09:18:15.0393143Z shl.b16 %rs1981, %rs1950, 12; 2026-02-21T09:18:15.0393199Z shr.s16 %rs1982, %rs1981, 12; 2026-02-21T09:18:15.0393263Z shl.b16 %rs1983, %rs1952, 12; 2026-02-21T09:18:15.0393318Z shr.s16 %rs1984, %rs1983, 12; 2026-02-21T09:18:15.0393374Z shl.b16 %rs1985, %rs1954, 12; 2026-02-21T09:18:15.0393440Z shr.s16 %rs1986, %rs1985, 12; 2026-02-21T09:18:15.0393498Z shl.b16 %rs1987, %rs1956, 12; 2026-02-21T09:18:15.0393554Z shr.s16 %rs1988, %rs1987, 12; 2026-02-21T09:18:15.0393618Z shl.b16 %rs1989, %rs1958, 12; 2026-02-21T09:18:15.0393674Z shr.s16 %rs1990, %rs1989, 12; 2026-02-21T09:18:15.0393732Z shl.b16 %rs1991, %rs1960, 12; 2026-02-21T09:18:15.0393787Z shr.s16 %rs1992, %rs1991, 12; 2026-02-21T09:18:15.0393851Z shl.b16 %rs1993, %rs1962, 12; 2026-02-21T09:18:15.0393909Z shr.s16 %rs1994, %rs1993, 12; 2026-02-21T09:18:15.0393965Z shl.b16 %rs1995, %rs1964, 12; 2026-02-21T09:18:15.0394028Z shr.s16 %rs1996, %rs1995, 12; 2026-02-21T09:18:15.0394085Z shl.b16 %rs1997, %rs1966, 12; 2026-02-21T09:18:15.0394142Z shr.s16 %rs1998, %rs1997, 12; 2026-02-21T09:18:15.0394198Z shl.b16 %rs1999, %rs1968, 12; 2026-02-21T09:18:15.0394265Z shr.s16 %rs2000, %rs1999, 12; 2026-02-21T09:18:15.0394427Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0394487Z shr.u16 %rs2001, %rs1937, 4; 2026-02-21T09:18:15.0394556Z shr.u16 %rs2002, %rs1939, 4; 2026-02-21T09:18:15.0394615Z shr.u16 %rs2003, %rs1941, 4; 2026-02-21T09:18:15.0394674Z shr.u16 %rs2004, %rs1943, 4; 2026-02-21T09:18:15.0394733Z shr.u16 %rs2005, %rs1945, 4; 2026-02-21T09:18:15.0394802Z shr.u16 %rs2006, %rs1947, 4; 2026-02-21T09:18:15.0394860Z shr.u16 %rs2007, %rs1949, 4; 2026-02-21T09:18:15.0394918Z shr.u16 %rs2008, %rs1951, 4; 2026-02-21T09:18:15.0394979Z shr.u16 %rs2009, %rs1953, 4; 2026-02-21T09:18:15.0395035Z shr.u16 %rs2010, %rs1955, 4; 2026-02-21T09:18:15.0395093Z shr.u16 %rs2011, %rs1957, 4; 2026-02-21T09:18:15.0395156Z shr.u16 %rs2012, %rs1959, 4; 2026-02-21T09:18:15.0395212Z shr.u16 %rs2013, %rs1961, 4; 2026-02-21T09:18:15.0395268Z shr.u16 %rs2014, %rs1963, 4; 2026-02-21T09:18:15.0395326Z shr.u16 %rs2015, %rs1965, 4; 2026-02-21T09:18:15.0395389Z shr.u16 %rs2016, %rs1967, 4; 2026-02-21T09:18:15.0395546Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0395601Z bar.sync 0; 2026-02-21T09:18:15.0395668Z st.shared.b8 [%r29], %rs1970; 2026-02-21T09:18:15.0395727Z st.shared.b8 [%r30], %rs1972; 2026-02-21T09:18:15.0395785Z st.shared.b8 [%r31], %rs1974; 2026-02-21T09:18:15.0395842Z st.shared.b8 [%r32], %rs1976; 2026-02-21T09:18:15.0395912Z st.shared.b8 [%r33+512], %rs1978; 2026-02-21T09:18:15.0395973Z st.shared.b8 [%r34+512], %rs1980; 2026-02-21T09:18:15.0396033Z st.shared.b8 [%r35+512], %rs1982; 2026-02-21T09:18:15.0396100Z st.shared.b8 [%r36+512], %rs1984; 2026-02-21T09:18:15.0396185Z st.shared.b8 [%r37+1024], %rs1986; 2026-02-21T09:18:15.0396247Z st.shared.b8 [%r38+1024], %rs1988; 2026-02-21T09:18:15.0396313Z st.shared.b8 [%r39+1024], %rs1990; 2026-02-21T09:18:15.0396394Z st.shared.b8 [%r40+1024], %rs1992; 2026-02-21T09:18:15.0396454Z st.shared.b8 [%r41+1536], %rs1994; 2026-02-21T09:18:15.0396513Z st.shared.b8 [%r42+1536], %rs1996; 2026-02-21T09:18:15.0396601Z st.shared.b8 [%r43+1536], %rs1998; 2026-02-21T09:18:15.0396684Z st.shared.b8 [%r44+1536], %rs2000; 2026-02-21T09:18:15.0396738Z bar.sync 0; 2026-02-21T09:18:15.0396807Z ld.shared.b32 %r5906, [%r45]; 2026-02-21T09:18:15.0396866Z ld.shared.b32 %r5907, [%r46]; 2026-02-21T09:18:15.0396924Z ld.shared.b32 %r5908, [%r47]; 2026-02-21T09:18:15.0396983Z ld.shared.b32 %r5909, [%r48]; 2026-02-21T09:18:15.0397147Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0397200Z bar.sync 0; 2026-02-21T09:18:15.0397260Z st.shared.b8 [%r29], %rs2001; 2026-02-21T09:18:15.0397327Z st.shared.b8 [%r30], %rs2002; 2026-02-21T09:18:15.0397384Z st.shared.b8 [%r31], %rs2003; 2026-02-21T09:18:15.0397440Z st.shared.b8 [%r32], %rs2004; 2026-02-21T09:18:15.0397502Z st.shared.b8 [%r33+512], %rs2005; 2026-02-21T09:18:15.0397569Z st.shared.b8 [%r34+512], %rs2006; 2026-02-21T09:18:15.0397629Z st.shared.b8 [%r35+512], %rs2007; 2026-02-21T09:18:15.0397686Z st.shared.b8 [%r36+512], %rs2008; 2026-02-21T09:18:15.0397751Z st.shared.b8 [%r37+1024], %rs2009; 2026-02-21T09:18:15.0397810Z st.shared.b8 [%r38+1024], %rs2010; 2026-02-21T09:18:15.0397869Z st.shared.b8 [%r39+1024], %rs2011; 2026-02-21T09:18:15.0397934Z st.shared.b8 [%r40+1024], %rs2012; 2026-02-21T09:18:15.0397992Z st.shared.b8 [%r41+1536], %rs2013; 2026-02-21T09:18:15.0398051Z st.shared.b8 [%r42+1536], %rs2014; 2026-02-21T09:18:15.0398108Z st.shared.b8 [%r43+1536], %rs2015; 2026-02-21T09:18:15.0398174Z st.shared.b8 [%r44+1536], %rs2016; 2026-02-21T09:18:15.0398227Z bar.sync 0; 2026-02-21T09:18:15.0398287Z ld.shared.b32 %r5910, [%r45]; 2026-02-21T09:18:15.0398350Z ld.shared.b32 %r5911, [%r46]; 2026-02-21T09:18:15.0398407Z ld.shared.b32 %r5912, [%r47]; 2026-02-21T09:18:15.0398467Z ld.shared.b32 %r5913, [%r48]; 2026-02-21T09:18:15.0398620Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0398687Z cvt.s8.s32 %rs2017, %r5907; 2026-02-21T09:18:15.0398750Z cvt.rn.f32.s16 %r5914, %rs2017; 2026-02-21T09:18:15.0398810Z cvt.s8.s32 %rs2018, %r5906; 2026-02-21T09:18:15.0398878Z cvt.rn.f32.s16 %r5915, %rs2018; 2026-02-21T09:18:15.0398935Z cvt.s8.s32 %rs2019, %r5911; 2026-02-21T09:18:15.0398994Z cvt.rn.f32.s16 %r5916, %rs2019; 2026-02-21T09:18:15.0399058Z cvt.s8.s32 %rs2020, %r5910; 2026-02-21T09:18:15.0399117Z cvt.rn.f32.s16 %r5917, %rs2020; 2026-02-21T09:18:15.0399174Z cvt.s8.s32 %rs2021, %r5909; 2026-02-21T09:18:15.0399232Z cvt.rn.f32.s16 %r5918, %rs2021; 2026-02-21T09:18:15.0399299Z cvt.s8.s32 %rs2022, %r5908; 2026-02-21T09:18:15.0399360Z cvt.rn.f32.s16 %r5919, %rs2022; 2026-02-21T09:18:15.0399417Z cvt.s8.s32 %rs2023, %r5913; 2026-02-21T09:18:15.0399482Z cvt.rn.f32.s16 %r5920, %rs2023; 2026-02-21T09:18:15.0399540Z cvt.s8.s32 %rs2024, %r5912; 2026-02-21T09:18:15.0399598Z cvt.rn.f32.s16 %r5921, %rs2024; 2026-02-21T09:18:15.0399663Z prmt.b32 %r5922, %r5907, 0, 0x9991U; 2026-02-21T09:18:15.0399728Z cvt.u16.u32 %rs2025, %r5922; 2026-02-21T09:18:15.0399788Z cvt.rn.f32.s16 %r5923, %rs2025; 2026-02-21T09:18:15.0399851Z prmt.b32 %r5924, %r5906, 0, 0x9991U; 2026-02-21T09:18:15.0399914Z cvt.u16.u32 %rs2026, %r5924; 2026-02-21T09:18:15.0399972Z cvt.rn.f32.s16 %r5925, %rs2026; 2026-02-21T09:18:15.0400034Z prmt.b32 %r5926, %r5911, 0, 0x9991U; 2026-02-21T09:18:15.0400092Z cvt.u16.u32 %rs2027, %r5926; 2026-02-21T09:18:15.0400157Z cvt.rn.f32.s16 %r5927, %rs2027; 2026-02-21T09:18:15.0400217Z prmt.b32 %r5928, %r5910, 0, 0x9991U; 2026-02-21T09:18:15.0400274Z cvt.u16.u32 %rs2028, %r5928; 2026-02-21T09:18:15.0400364Z cvt.rn.f32.s16 %r5929, %rs2028; 2026-02-21T09:18:15.0400426Z prmt.b32 %r5930, %r5909, 0, 0x9991U; 2026-02-21T09:18:15.0400484Z cvt.u16.u32 %rs2029, %r5930; 2026-02-21T09:18:15.0400570Z cvt.rn.f32.s16 %r5931, %rs2029; 2026-02-21T09:18:15.0400632Z prmt.b32 %r5932, %r5908, 0, 0x9991U; 2026-02-21T09:18:15.0400709Z cvt.u16.u32 %rs2030, %r5932; 2026-02-21T09:18:15.0400768Z cvt.rn.f32.s16 %r5933, %rs2030; 2026-02-21T09:18:15.0400875Z prmt.b32 %r5934, %r5913, 0, 0x9991U; 2026-02-21T09:18:15.0400934Z cvt.u16.u32 %rs2031, %r5934; 2026-02-21T09:18:15.0400993Z cvt.rn.f32.s16 %r5935, %rs2031; 2026-02-21T09:18:15.0401064Z prmt.b32 %r5936, %r5912, 0, 0x9991U; 2026-02-21T09:18:15.0401123Z cvt.u16.u32 %rs2032, %r5936; 2026-02-21T09:18:15.0401182Z cvt.rn.f32.s16 %r5937, %rs2032; 2026-02-21T09:18:15.0401244Z prmt.b32 %r5938, %r5907, 0, 0xaaa2U; 2026-02-21T09:18:15.0401315Z cvt.u16.u32 %rs2033, %r5938; 2026-02-21T09:18:15.0401374Z cvt.rn.f32.s16 %r5939, %rs2033; 2026-02-21T09:18:15.0401435Z prmt.b32 %r5940, %r5906, 0, 0xaaa2U; 2026-02-21T09:18:15.0401500Z cvt.u16.u32 %rs2034, %r5940; 2026-02-21T09:18:15.0401584Z cvt.rn.f32.s16 %r5941, %rs2034; 2026-02-21T09:18:15.0401648Z prmt.b32 %r5942, %r5911, 0, 0xaaa2U; 2026-02-21T09:18:15.0401713Z cvt.u16.u32 %rs2035, %r5942; 2026-02-21T09:18:15.0401772Z cvt.rn.f32.s16 %r5943, %rs2035; 2026-02-21T09:18:15.0401832Z prmt.b32 %r5944, %r5910, 0, 0xaaa2U; 2026-02-21T09:18:15.0401893Z cvt.u16.u32 %rs2036, %r5944; 2026-02-21T09:18:15.0401958Z cvt.rn.f32.s16 %r5945, %rs2036; 2026-02-21T09:18:15.0402019Z prmt.b32 %r5946, %r5909, 0, 0xaaa2U; 2026-02-21T09:18:15.0402076Z cvt.u16.u32 %rs2037, %r5946; 2026-02-21T09:18:15.0402142Z cvt.rn.f32.s16 %r5947, %rs2037; 2026-02-21T09:18:15.0402202Z prmt.b32 %r5948, %r5908, 0, 0xaaa2U; 2026-02-21T09:18:15.0402260Z cvt.u16.u32 %rs2038, %r5948; 2026-02-21T09:18:15.0402317Z cvt.rn.f32.s16 %r5949, %rs2038; 2026-02-21T09:18:15.0402384Z prmt.b32 %r5950, %r5913, 0, 0xaaa2U; 2026-02-21T09:18:15.0402443Z cvt.u16.u32 %rs2039, %r5950; 2026-02-21T09:18:15.0402502Z cvt.rn.f32.s16 %r5951, %rs2039; 2026-02-21T09:18:15.0402568Z prmt.b32 %r5952, %r5912, 0, 0xaaa2U; 2026-02-21T09:18:15.0402629Z cvt.u16.u32 %rs2040, %r5952; 2026-02-21T09:18:15.0402687Z cvt.rn.f32.s16 %r5953, %rs2040; 2026-02-21T09:18:15.0402747Z prmt.b32 %r5954, %r5907, 0, 0xbbb3U; 2026-02-21T09:18:15.0402813Z cvt.u16.u32 %rs2041, %r5954; 2026-02-21T09:18:15.0402873Z cvt.rn.f32.s16 %r5955, %rs2041; 2026-02-21T09:18:15.0402935Z prmt.b32 %r5956, %r5906, 0, 0xbbb3U; 2026-02-21T09:18:15.0403002Z cvt.u16.u32 %rs2042, %r5956; 2026-02-21T09:18:15.0403061Z cvt.rn.f32.s16 %r5957, %rs2042; 2026-02-21T09:18:15.0403121Z prmt.b32 %r5958, %r5911, 0, 0xbbb3U; 2026-02-21T09:18:15.0403186Z cvt.u16.u32 %rs2043, %r5958; 2026-02-21T09:18:15.0403245Z cvt.rn.f32.s16 %r5959, %rs2043; 2026-02-21T09:18:15.0403305Z prmt.b32 %r5960, %r5910, 0, 0xbbb3U; 2026-02-21T09:18:15.0403364Z cvt.u16.u32 %rs2044, %r5960; 2026-02-21T09:18:15.0403432Z cvt.rn.f32.s16 %r5961, %rs2044; 2026-02-21T09:18:15.0403492Z prmt.b32 %r5962, %r5909, 0, 0xbbb3U; 2026-02-21T09:18:15.0403549Z cvt.u16.u32 %rs2045, %r5962; 2026-02-21T09:18:15.0403616Z cvt.rn.f32.s16 %r5963, %rs2045; 2026-02-21T09:18:15.0403676Z prmt.b32 %r5964, %r5908, 0, 0xbbb3U; 2026-02-21T09:18:15.0403733Z cvt.u16.u32 %rs2046, %r5964; 2026-02-21T09:18:15.0403793Z cvt.rn.f32.s16 %r5965, %rs2046; 2026-02-21T09:18:15.0403863Z prmt.b32 %r5966, %r5913, 0, 0xbbb3U; 2026-02-21T09:18:15.0403921Z cvt.u16.u32 %rs2047, %r5966; 2026-02-21T09:18:15.0403979Z cvt.rn.f32.s16 %r5967, %rs2047; 2026-02-21T09:18:15.0404045Z prmt.b32 %r5968, %r5912, 0, 0xbbb3U; 2026-02-21T09:18:15.0404102Z cvt.u16.u32 %rs2048, %r5968; 2026-02-21T09:18:15.0404159Z cvt.rn.f32.s16 %r5969, %rs2048; 2026-02-21T09:18:15.0404218Z bar.sync 0; 2026-02-21T09:18:15.0404315Z st.shared.v4.b32 [%r49], {%r5915, %r5917, %r5914, %r5916}; 2026-02-21T09:18:15.0404410Z st.shared.v4.b32 [%r50], {%r5919, %r5921, %r5918, %r5920}; 2026-02-21T09:18:15.0404541Z st.shared.v4.b32 [%r51], {%r5925, %r5929, %r5923, %r5927}; 2026-02-21T09:18:15.0404637Z st.shared.v4.b32 [%r52], {%r5933, %r5937, %r5931, %r5935}; 2026-02-21T09:18:15.0404754Z st.shared.v4.b32 [%r53], {%r5941, %r5945, %r5939, %r5943}; 2026-02-21T09:18:15.0404842Z st.shared.v4.b32 [%r54], {%r5949, %r5953, %r5947, %r5951}; 2026-02-21T09:18:15.0404980Z st.shared.v4.b32 [%r55], {%r5957, %r5961, %r5955, %r5959}; 2026-02-21T09:18:15.0405091Z st.shared.v4.b32 [%r56], {%r5965, %r5969, %r5963, %r5967}; 2026-02-21T09:18:15.0405144Z $L__tmp335: 2026-02-21T09:18:15.0405360Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0405419Z // begin inline asm 2026-02-21T09:18:15.0405729Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5710, %r5711, %r5712, %r5713, %r5714, %r5715, %r5716, %r5717, %r5718, %r5719, %r5720, %r5721, %r5722, %r5723, %r5724, %r5725}, [%r1205 + 0], 64; 2026-02-21T09:18:15.0405786Z // end inline asm 2026-02-21T09:18:15.0405852Z // begin inline asm 2026-02-21T09:18:15.0406151Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5727, %r5728, %r5729, %r5730, %r5731, %r5732, %r5733, %r5734, %r5735, %r5736, %r5737, %r5738, %r5739, %r5740, %r5741, %r5742}, [%r1205 + 16], 64; 2026-02-21T09:18:15.0406207Z // end inline asm 2026-02-21T09:18:15.0406272Z // begin inline asm 2026-02-21T09:18:15.0406571Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5744, %r5745, %r5746, %r5747, %r5748, %r5749, %r5750, %r5751, %r5752, %r5753, %r5754, %r5755, %r5756, %r5757, %r5758, %r5759}, [%r1205 + 32], 64; 2026-02-21T09:18:15.0406626Z // end inline asm 2026-02-21T09:18:15.0406688Z // begin inline asm 2026-02-21T09:18:15.0406982Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r5761, %r5762, %r5763, %r5764, %r5765, %r5766, %r5767, %r5768, %r5769, %r5770, %r5771, %r5772, %r5773, %r5774, %r5775, %r5776}, [%r1205 + 48], 64; 2026-02-21T09:18:15.0407036Z // end inline asm 2026-02-21T09:18:15.0407101Z // begin inline asm 2026-02-21T09:18:15.0407171Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0407225Z // end inline asm 2026-02-21T09:18:15.0407289Z mov.pred %p392, -1; 2026-02-21T09:18:15.0407353Z // begin inline asm 2026-02-21T09:18:15.0407663Z @%p392 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 0], 64, {%r5710, %r5711, %r5712, %r5713, %r5714, %r5715, %r5716, %r5717, %r5718, %r5719, %r5720, %r5721, %r5722, %r5723, %r5724, %r5725}; 2026-02-21T09:18:15.0407719Z // end inline asm 2026-02-21T09:18:15.0407781Z // begin inline asm 2026-02-21T09:18:15.0408083Z @%p392 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 16], 64, {%r5727, %r5728, %r5729, %r5730, %r5731, %r5732, %r5733, %r5734, %r5735, %r5736, %r5737, %r5738, %r5739, %r5740, %r5741, %r5742}; 2026-02-21T09:18:15.0408150Z // end inline asm 2026-02-21T09:18:15.0408217Z // begin inline asm 2026-02-21T09:18:15.0408533Z @%p392 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 32], 64, {%r5744, %r5745, %r5746, %r5747, %r5748, %r5749, %r5750, %r5751, %r5752, %r5753, %r5754, %r5755, %r5756, %r5757, %r5758, %r5759}; 2026-02-21T09:18:15.0408591Z // end inline asm 2026-02-21T09:18:15.0408657Z // begin inline asm 2026-02-21T09:18:15.0408981Z @%p392 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r4673 + 48], 64, {%r5761, %r5762, %r5763, %r5764, %r5765, %r5766, %r5767, %r5768, %r5769, %r5770, %r5771, %r5772, %r5773, %r5774, %r5775, %r5776}; 2026-02-21T09:18:15.0409039Z // end inline asm 2026-02-21T09:18:15.0409106Z // begin inline asm 2026-02-21T09:18:15.0409176Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0409233Z // end inline asm 2026-02-21T09:18:15.0409289Z bar.sync 0; 2026-02-21T09:18:15.0409356Z // begin inline asm 2026-02-21T09:18:15.0409672Z @%p392 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r5846 + 0], 16, {%r5847, %r5848, %r5849, %r5850, %r5851, %r5852, %r5853, %r5854, %r5855, %r5856, %r5857, %r5858, %r5859, %r5860, %r5861, %r5862}; 2026-02-21T09:18:15.0409729Z // end inline asm 2026-02-21T09:18:15.0409799Z // begin inline asm 2026-02-21T09:18:15.0409895Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0409954Z // end inline asm 2026-02-21T09:18:15.0410010Z bar.sync 0; 2026-02-21T09:18:15.0410081Z // begin inline asm 2026-02-21T09:18:15.0410182Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0410241Z // end inline asm 2026-02-21T09:18:15.0410310Z // begin inline asm 2026-02-21T09:18:15.0410425Z @%p366 mbarrier.init.shared::cta.b64 [%r5985], 1; 2026-02-21T09:18:15.0410504Z // end inline asm 2026-02-21T09:18:15.0410568Z bar.sync 0; 2026-02-21T09:18:15.0410632Z @%p20 bra $L__BB0_41; 2026-02-21T09:18:15.0410736Z // %bb.40: // in Loop: Header=BB0_33 Depth=2 2026-02-21T09:18:15.0410804Z elect.sync %r5982|%p298, -1; 2026-02-21T09:18:15.0410875Z mov.b32 %r5972, 69208336; 2026-02-21T09:18:15.0410939Z mov.pred %p297, -1; 2026-02-21T09:18:15.0410997Z // begin inline asm 2026-02-21T09:18:15.0411170Z @%p298 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 0 ], %rd1, %r5972, %p297; 2026-02-21T09:18:15.0411228Z // end inline asm 2026-02-21T09:18:15.0411287Z // begin inline asm 2026-02-21T09:18:15.0411452Z @%p298 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 8 ], %rd2, %r5972, %p297; 2026-02-21T09:18:15.0411512Z // end inline asm 2026-02-21T09:18:15.0411595Z // begin inline asm 2026-02-21T09:18:15.0411757Z @%p298 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 16 ], %rd3, %r5972, %p297; 2026-02-21T09:18:15.0411825Z // end inline asm 2026-02-21T09:18:15.0411883Z // begin inline asm 2026-02-21T09:18:15.0412039Z @%p298 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 24 ], %rd4, %r5972, %p297; 2026-02-21T09:18:15.0412105Z // end inline asm 2026-02-21T09:18:15.0412170Z cvt.u64.u32 %rd713, %r5985; 2026-02-21T09:18:15.0412228Z // begin inline asm 2026-02-21T09:18:15.0412368Z @%p298 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd713]; 2026-02-21T09:18:15.0412426Z // end inline asm 2026-02-21T09:18:15.0412489Z bra.uni $L__BB0_41; 2026-02-21T09:18:15.0412545Z $L__tmp336: 2026-02-21T09:18:15.0412641Z $L__BB0_43: // %.preheader 2026-02-21T09:18:15.0412822Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:15.0412897Z setp.gt.s32 %p310, %r7847, 4095; 2026-02-21T09:18:15.0412969Z @%p310 bra $L__BB0_56; 2026-02-21T09:18:15.0413055Z // %bb.44: // %.lr.ph760 2026-02-21T09:18:15.0413230Z .loc 1 0 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:0:112 2026-02-21T09:18:15.0413300Z shl.b32 %r6251, %r7826, 7; 2026-02-21T09:18:15.0413366Z or.b32 %r6254, %r6251, %r7827; 2026-02-21T09:18:15.0413430Z xor.b32 %r6255, %r6254, %r7828; 2026-02-21T09:18:15.0413493Z add.s32 %r67, %r144, %r6255; 2026-02-21T09:18:15.0413562Z and.b32 %r6258, %r7829, 3072; 2026-02-21T09:18:15.0413623Z shl.b32 %r6261, %r7831, 3; 2026-02-21T09:18:15.0413685Z and.b32 %r6263, %r7832, 64; 2026-02-21T09:18:15.0413757Z or.b32 %r6264, %r7830, %r6261; 2026-02-21T09:18:15.0413819Z xor.b32 %r6265, %r6264, %r6263; 2026-02-21T09:18:15.0413879Z or.b32 %r6266, %r6265, %r6258; 2026-02-21T09:18:15.0413942Z add.s32 %r68, %r144, %r6266; 2026-02-21T09:18:15.0414010Z xor.b32 %r6267, %r6266, 32; 2026-02-21T09:18:15.0414072Z add.s32 %r69, %r144, %r6267; 2026-02-21T09:18:15.0414134Z shr.u32 %r6269, %r7826, 1; 2026-02-21T09:18:15.0414203Z or.b32 %r6271, %r6269, %r7834; 2026-02-21T09:18:15.0414264Z or.b32 %r6272, %r6271, %r7833; 2026-02-21T09:18:15.0414324Z add.s32 %r70, %r144, %r6272; 2026-02-21T09:18:15.0414383Z xor.b32 %r6273, %r6272, 4; 2026-02-21T09:18:15.0414449Z add.s32 %r71, %r144, %r6273; 2026-02-21T09:18:15.0414508Z xor.b32 %r6274, %r6272, 8; 2026-02-21T09:18:15.0414568Z add.s32 %r72, %r144, %r6274; 2026-02-21T09:18:15.0414636Z xor.b32 %r6275, %r6272, 12; 2026-02-21T09:18:15.0414695Z add.s32 %r73, %r144, %r6275; 2026-02-21T09:18:15.0414755Z xor.b32 %r6276, %r6272, 32; 2026-02-21T09:18:15.0414855Z add.s32 %r74, %r144, %r6276; 2026-02-21T09:18:15.0414914Z xor.b32 %r6277, %r6272, 36; 2026-02-21T09:18:15.0414973Z add.s32 %r75, %r144, %r6277; 2026-02-21T09:18:15.0415063Z xor.b32 %r6278, %r6272, 40; 2026-02-21T09:18:15.0415131Z add.s32 %r76, %r144, %r6278; 2026-02-21T09:18:15.0415191Z xor.b32 %r6279, %r6272, 44; 2026-02-21T09:18:15.0415274Z add.s32 %r77, %r144, %r6279; 2026-02-21T09:18:15.0415370Z xor.b32 %r6280, %r6272, 64; 2026-02-21T09:18:15.0415430Z add.s32 %r78, %r144, %r6280; 2026-02-21T09:18:15.0415489Z xor.b32 %r6281, %r6272, 68; 2026-02-21T09:18:15.0415549Z add.s32 %r79, %r144, %r6281; 2026-02-21T09:18:15.0415615Z xor.b32 %r6282, %r6272, 72; 2026-02-21T09:18:15.0415673Z add.s32 %r80, %r144, %r6282; 2026-02-21T09:18:15.0415732Z xor.b32 %r6283, %r6272, 76; 2026-02-21T09:18:15.0415798Z add.s32 %r81, %r144, %r6283; 2026-02-21T09:18:15.0415857Z xor.b32 %r6284, %r6272, 96; 2026-02-21T09:18:15.0415916Z add.s32 %r82, %r144, %r6284; 2026-02-21T09:18:15.0415979Z xor.b32 %r6285, %r6272, 100; 2026-02-21T09:18:15.0416054Z add.s32 %r83, %r144, %r6285; 2026-02-21T09:18:15.0416111Z xor.b32 %r6286, %r6272, 104; 2026-02-21T09:18:15.0416170Z add.s32 %r84, %r144, %r6286; 2026-02-21T09:18:15.0416232Z xor.b32 %r6287, %r6272, 108; 2026-02-21T09:18:15.0416289Z add.s32 %r85, %r144, %r6287; 2026-02-21T09:18:15.0416352Z mul.lo.s32 %r6290, %r7835, 136; 2026-02-21T09:18:15.0416418Z or.b32 %r6291, %r7836, %r8; 2026-02-21T09:18:15.0416475Z xor.b32 %r6292, %r6290, %r6291; 2026-02-21T09:18:15.0416531Z add.s32 %r86, %r144, %r6292; 2026-02-21T09:18:15.0416588Z xor.b32 %r6293, %r6292, 132; 2026-02-21T09:18:15.0416650Z add.s32 %r87, %r144, %r6293; 2026-02-21T09:18:15.0416705Z xor.b32 %r6294, %r6292, 264; 2026-02-21T09:18:15.0416762Z add.s32 %r88, %r144, %r6294; 2026-02-21T09:18:15.0416824Z xor.b32 %r6295, %r6292, 396; 2026-02-21T09:18:15.0416879Z add.s32 %r89, %r144, %r6295; 2026-02-21T09:18:15.0416939Z and.b32 %r6297, %r7837, 16256; 2026-02-21T09:18:15.0416998Z or.b32 %r6298, %r6297, %r18; 2026-02-21T09:18:15.0417063Z add.s32 %r90, %r144, %r6298; 2026-02-21T09:18:15.0417121Z xor.b32 %r6299, %r6298, 16; 2026-02-21T09:18:15.0417180Z add.s32 %r91, %r144, %r6299; 2026-02-21T09:18:15.0417245Z xor.b32 %r6300, %r6298, 32; 2026-02-21T09:18:15.0417304Z add.s32 %r92, %r144, %r6300; 2026-02-21T09:18:15.0417361Z xor.b32 %r6301, %r6298, 48; 2026-02-21T09:18:15.0417419Z add.s32 %r93, %r144, %r6301; 2026-02-21T09:18:15.0417484Z xor.b32 %r6302, %r6298, 64; 2026-02-21T09:18:15.0417540Z add.s32 %r94, %r144, %r6302; 2026-02-21T09:18:15.0417597Z xor.b32 %r6303, %r6298, 80; 2026-02-21T09:18:15.0417660Z add.s32 %r95, %r144, %r6303; 2026-02-21T09:18:15.0417716Z xor.b32 %r6304, %r6298, 96; 2026-02-21T09:18:15.0417772Z add.s32 %r96, %r144, %r6304; 2026-02-21T09:18:15.0417827Z xor.b32 %r6305, %r6298, 112; 2026-02-21T09:18:15.0417889Z add.s32 %r97, %r144, %r6305; 2026-02-21T09:18:15.0417948Z bfe.u32 %r6306, %r144, 4, 14; 2026-02-21T09:18:15.0418009Z cvt.u64.u32 %rd850, %r6306; 2026-02-21T09:18:15.0418085Z or.b64 %rd6, %rd850, 4611686293372403712; 2026-02-21T09:18:15.0418143Z add.s32 %r6307, %r144, 32; 2026-02-21T09:18:15.0418201Z bfe.u32 %r6308, %r6307, 4, 14; 2026-02-21T09:18:15.0418269Z cvt.u64.u32 %rd851, %r6308; 2026-02-21T09:18:15.0418338Z or.b64 %rd7, %rd851, 4611686293372403712; 2026-02-21T09:18:15.0418396Z add.s32 %r6309, %r144, 64; 2026-02-21T09:18:15.0418456Z bfe.u32 %r6310, %r6309, 4, 14; 2026-02-21T09:18:15.0418521Z cvt.u64.u32 %rd852, %r6310; 2026-02-21T09:18:15.0418588Z or.b64 %rd8, %rd852, 4611686293372403712; 2026-02-21T09:18:15.0418645Z add.s32 %r6311, %r144, 96; 2026-02-21T09:18:15.0418709Z bfe.u32 %r6312, %r6311, 4, 14; 2026-02-21T09:18:15.0418767Z cvt.u64.u32 %rd853, %r6312; 2026-02-21T09:18:15.0418833Z or.b64 %rd9, %rd853, 4611686293372403712; 2026-02-21T09:18:15.0418893Z and.b32 %r6314, %r7840, 3168; 2026-02-21T09:18:15.0418958Z and.b32 %r6315, %r17, 384; 2026-02-21T09:18:15.0419040Z and.b32 %r6316, %r7832, 16; 2026-02-21T09:18:15.0419099Z or.b32 %r6317, %r6314, %r6315; 2026-02-21T09:18:15.0419165Z xor.b32 %r6318, %r6317, %r7831; 2026-02-21T09:18:15.0419224Z add.s32 %r6319, %r144, %r6316; 2026-02-21T09:18:15.0419304Z add.s32 %r7633, %r6319, %r6318; 2026-02-21T09:18:15.0419361Z add.s32 %r7638, %r7633, 512; 2026-02-21T09:18:15.0419580Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:15.0419650Z mad.wide.u32 %rd10, %r22, 16, %rd55; 2026-02-21T09:18:15.0419718Z mad.wide.u32 %rd11, %r21, 8192, %rd56; 2026-02-21T09:18:15.0419783Z bra.uni $L__BB0_45; 2026-02-21T09:18:15.0419885Z $L__BB0_55: // in Loop: Header=BB0_45 Depth=1 2026-02-21T09:18:15.0420046Z .loc 1 31 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:31:32 2026-02-21T09:18:15.0420112Z or.b32 %r7701, %r134, %r9; 2026-02-21T09:18:15.0420170Z or.b32 %r7702, %r134, %r10; 2026-02-21T09:18:15.0420228Z or.b32 %r7703, %r134, %r11; 2026-02-21T09:18:15.0420284Z or.b32 %r7704, %r134, %r12; 2026-02-21T09:18:15.0420348Z or.b32 %r7705, %r134, %r13; 2026-02-21T09:18:15.0420404Z or.b32 %r7706, %r134, %r14; 2026-02-21T09:18:15.0420459Z or.b32 %r7707, %r134, %r15; 2026-02-21T09:18:15.0420522Z or.b32 %r7708, %r134, %r16; 2026-02-21T09:18:15.0420685Z .loc 1 33 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:33:32 2026-02-21T09:18:15.0420740Z or.b32 %r7709, %r135, %r20; 2026-02-21T09:18:15.0420905Z .loc 1 88 43 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:43 2026-02-21T09:18:15.0420961Z shl.b32 %r7710, %r7701, 13; 2026-02-21T09:18:15.0421016Z shl.b32 %r7711, %r7702, 13; 2026-02-21T09:18:15.0421072Z shl.b32 %r7712, %r7703, 13; 2026-02-21T09:18:15.0421135Z shl.b32 %r7713, %r7704, 13; 2026-02-21T09:18:15.0421189Z shl.b32 %r7714, %r7705, 13; 2026-02-21T09:18:15.0421244Z shl.b32 %r7715, %r7706, 13; 2026-02-21T09:18:15.0421307Z shl.b32 %r7716, %r7707, 13; 2026-02-21T09:18:15.0421363Z shl.b32 %r7717, %r7708, 13; 2026-02-21T09:18:15.0421522Z .loc 1 88 50 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:50 2026-02-21T09:18:15.0421643Z add.s32 %r7718, %r7710, %r7709; 2026-02-21T09:18:15.0421704Z add.s32 %r7719, %r7711, %r7709; 2026-02-21T09:18:15.0421761Z add.s32 %r7720, %r7712, %r7709; 2026-02-21T09:18:15.0421818Z add.s32 %r7721, %r7713, %r7709; 2026-02-21T09:18:15.0421882Z add.s32 %r7722, %r7714, %r7709; 2026-02-21T09:18:15.0421937Z add.s32 %r7723, %r7715, %r7709; 2026-02-21T09:18:15.0421994Z add.s32 %r7724, %r7716, %r7709; 2026-02-21T09:18:15.0422057Z add.s32 %r7725, %r7717, %r7709; 2026-02-21T09:18:15.0422215Z .loc 1 88 22 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:22 2026-02-21T09:18:15.0422284Z mad.wide.s32 %rd912, %r7718, 2, %rd57; 2026-02-21T09:18:15.0422351Z mad.wide.s32 %rd913, %r7719, 2, %rd57; 2026-02-21T09:18:15.0422423Z mad.wide.s32 %rd914, %r7720, 2, %rd57; 2026-02-21T09:18:15.0422486Z mad.wide.s32 %rd915, %r7721, 2, %rd57; 2026-02-21T09:18:15.0422549Z mad.wide.s32 %rd916, %r7722, 2, %rd57; 2026-02-21T09:18:15.0422621Z mad.wide.s32 %rd917, %r7723, 2, %rd57; 2026-02-21T09:18:15.0422685Z mad.wide.s32 %rd918, %r7724, 2, %rd57; 2026-02-21T09:18:15.0422750Z mad.wide.s32 %rd919, %r7725, 2, %rd57; 2026-02-21T09:18:15.0422811Z $L__tmp337: 2026-02-21T09:18:15.0423021Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0423081Z // begin inline asm 2026-02-21T09:18:15.0423390Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7561, %r7562, %r7563, %r7564, %r7565, %r7566, %r7567, %r7568, %r7569, %r7570, %r7571, %r7572, %r7573, %r7574, %r7575, %r7576}, [%r7628 + 0], 64; 2026-02-21T09:18:15.0423454Z // end inline asm 2026-02-21T09:18:15.0423513Z // begin inline asm 2026-02-21T09:18:15.0423814Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7578, %r7579, %r7580, %r7581, %r7582, %r7583, %r7584, %r7585, %r7586, %r7587, %r7588, %r7589, %r7590, %r7591, %r7592, %r7593}, [%r7628 + 16], 64; 2026-02-21T09:18:15.0423945Z // end inline asm 2026-02-21T09:18:15.0424003Z // begin inline asm 2026-02-21T09:18:15.0424348Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7595, %r7596, %r7597, %r7598, %r7599, %r7600, %r7601, %r7602, %r7603, %r7604, %r7605, %r7606, %r7607, %r7608, %r7609, %r7610}, [%r7628 + 32], 64; 2026-02-21T09:18:15.0424411Z // end inline asm 2026-02-21T09:18:15.0424467Z // begin inline asm 2026-02-21T09:18:15.0424763Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7612, %r7613, %r7614, %r7615, %r7616, %r7617, %r7618, %r7619, %r7620, %r7621, %r7622, %r7623, %r7624, %r7625, %r7626, %r7627}, [%r7628 + 48], 64; 2026-02-21T09:18:15.0424825Z // end inline asm 2026-02-21T09:18:15.0424882Z // begin inline asm 2026-02-21T09:18:15.0424951Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0425015Z // end inline asm 2026-02-21T09:18:15.0425077Z cvt.u64.u32 %rd920, %r7561; 2026-02-21T09:18:15.0425137Z cvt.u64.u32 %rd921, %r7562; 2026-02-21T09:18:15.0425196Z shl.b64 %rd922, %rd921, 32; 2026-02-21T09:18:15.0425268Z or.b64 %rd923, %rd920, %rd922; 2026-02-21T09:18:15.0425322Z $L__tmp338: 2026-02-21T09:18:15.0425491Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0425563Z mov.b64 {%r7726, %r7727}, %rd923; 2026-02-21T09:18:15.0425637Z cvt.rn.bf16x2.f32 %r7728, %r7727, %r7726; 2026-02-21T09:18:15.0425692Z $L__tmp339: 2026-02-21T09:18:15.0425904Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0425976Z cvt.u64.u32 %rd924, %r7563; 2026-02-21T09:18:15.0426037Z cvt.u64.u32 %rd925, %r7564; 2026-02-21T09:18:15.0426097Z shl.b64 %rd926, %rd925, 32; 2026-02-21T09:18:15.0426171Z or.b64 %rd927, %rd924, %rd926; 2026-02-21T09:18:15.0426223Z $L__tmp340: 2026-02-21T09:18:15.0426386Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0426455Z mov.b64 {%r7729, %r7730}, %rd927; 2026-02-21T09:18:15.0426527Z cvt.rn.bf16x2.f32 %r7731, %r7730, %r7729; 2026-02-21T09:18:15.0426580Z $L__tmp341: 2026-02-21T09:18:15.0426787Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0426855Z cvt.u64.u32 %rd928, %r7565; 2026-02-21T09:18:15.0426912Z cvt.u64.u32 %rd929, %r7566; 2026-02-21T09:18:15.0426969Z shl.b64 %rd930, %rd929, 32; 2026-02-21T09:18:15.0427035Z or.b64 %rd931, %rd928, %rd930; 2026-02-21T09:18:15.0427086Z $L__tmp342: 2026-02-21T09:18:15.0427243Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0427312Z mov.b64 {%r7732, %r7733}, %rd931; 2026-02-21T09:18:15.0427383Z cvt.rn.bf16x2.f32 %r7734, %r7733, %r7732; 2026-02-21T09:18:15.0427436Z $L__tmp343: 2026-02-21T09:18:15.0427637Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0427704Z cvt.u64.u32 %rd932, %r7567; 2026-02-21T09:18:15.0427762Z cvt.u64.u32 %rd933, %r7568; 2026-02-21T09:18:15.0427819Z shl.b64 %rd934, %rd933, 32; 2026-02-21T09:18:15.0427886Z or.b64 %rd935, %rd932, %rd934; 2026-02-21T09:18:15.0427940Z $L__tmp344: 2026-02-21T09:18:15.0428101Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0428162Z mov.b64 {%r7735, %r7736}, %rd935; 2026-02-21T09:18:15.0428238Z cvt.rn.bf16x2.f32 %r7737, %r7736, %r7735; 2026-02-21T09:18:15.0428289Z $L__tmp345: 2026-02-21T09:18:15.0428490Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0428556Z cvt.u64.u32 %rd936, %r7569; 2026-02-21T09:18:15.0428637Z cvt.u64.u32 %rd937, %r7570; 2026-02-21T09:18:15.0428695Z shl.b64 %rd938, %rd937, 32; 2026-02-21T09:18:15.0428761Z or.b64 %rd939, %rd936, %rd938; 2026-02-21T09:18:15.0428814Z $L__tmp346: 2026-02-21T09:18:15.0428994Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0429078Z mov.b64 {%r7738, %r7739}, %rd939; 2026-02-21T09:18:15.0429155Z cvt.rn.bf16x2.f32 %r7740, %r7739, %r7738; 2026-02-21T09:18:15.0429226Z $L__tmp347: 2026-02-21T09:18:15.0429428Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0429493Z cvt.u64.u32 %rd940, %r7571; 2026-02-21T09:18:15.0429548Z cvt.u64.u32 %rd941, %r7572; 2026-02-21T09:18:15.0429605Z shl.b64 %rd942, %rd941, 32; 2026-02-21T09:18:15.0429669Z or.b64 %rd943, %rd940, %rd942; 2026-02-21T09:18:15.0429720Z $L__tmp348: 2026-02-21T09:18:15.0429882Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0429944Z mov.b64 {%r7741, %r7742}, %rd943; 2026-02-21T09:18:15.0430018Z cvt.rn.bf16x2.f32 %r7743, %r7742, %r7741; 2026-02-21T09:18:15.0430071Z $L__tmp349: 2026-02-21T09:18:15.0430270Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0430335Z cvt.u64.u32 %rd944, %r7573; 2026-02-21T09:18:15.0430392Z cvt.u64.u32 %rd945, %r7574; 2026-02-21T09:18:15.0430449Z shl.b64 %rd946, %rd945, 32; 2026-02-21T09:18:15.0430513Z or.b64 %rd947, %rd944, %rd946; 2026-02-21T09:18:15.0430564Z $L__tmp350: 2026-02-21T09:18:15.0430722Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0430782Z mov.b64 {%r7744, %r7745}, %rd947; 2026-02-21T09:18:15.0430856Z cvt.rn.bf16x2.f32 %r7746, %r7745, %r7744; 2026-02-21T09:18:15.0430907Z $L__tmp351: 2026-02-21T09:18:15.0431105Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0431172Z cvt.u64.u32 %rd948, %r7575; 2026-02-21T09:18:15.0431229Z cvt.u64.u32 %rd949, %r7576; 2026-02-21T09:18:15.0431287Z shl.b64 %rd950, %rd949, 32; 2026-02-21T09:18:15.0431352Z or.b64 %rd951, %rd948, %rd950; 2026-02-21T09:18:15.0431404Z $L__tmp352: 2026-02-21T09:18:15.0431586Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0431647Z mov.b64 {%r7747, %r7748}, %rd951; 2026-02-21T09:18:15.0431721Z cvt.rn.bf16x2.f32 %r7749, %r7748, %r7747; 2026-02-21T09:18:15.0431773Z $L__tmp353: 2026-02-21T09:18:15.0431974Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0432040Z cvt.u64.u32 %rd952, %r7578; 2026-02-21T09:18:15.0432097Z cvt.u64.u32 %rd953, %r7579; 2026-02-21T09:18:15.0432155Z shl.b64 %rd954, %rd953, 32; 2026-02-21T09:18:15.0432213Z or.b64 %rd955, %rd952, %rd954; 2026-02-21T09:18:15.0432274Z $L__tmp354: 2026-02-21T09:18:15.0432431Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0432491Z mov.b64 {%r7750, %r7751}, %rd955; 2026-02-21T09:18:15.0432570Z cvt.rn.bf16x2.f32 %r7752, %r7751, %r7750; 2026-02-21T09:18:15.0432622Z $L__tmp355: 2026-02-21T09:18:15.0432826Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0432890Z cvt.u64.u32 %rd956, %r7580; 2026-02-21T09:18:15.0432946Z cvt.u64.u32 %rd957, %r7581; 2026-02-21T09:18:15.0433005Z shl.b64 %rd958, %rd957, 32; 2026-02-21T09:18:15.0433064Z or.b64 %rd959, %rd956, %rd958; 2026-02-21T09:18:15.0433124Z $L__tmp356: 2026-02-21T09:18:15.0433283Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0433373Z mov.b64 {%r7753, %r7754}, %rd959; 2026-02-21T09:18:15.0433450Z cvt.rn.bf16x2.f32 %r7755, %r7754, %r7753; 2026-02-21T09:18:15.0433503Z $L__tmp357: 2026-02-21T09:18:15.0433705Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0433796Z cvt.u64.u32 %rd960, %r7582; 2026-02-21T09:18:15.0433879Z cvt.u64.u32 %rd961, %r7583; 2026-02-21T09:18:15.0433965Z shl.b64 %rd962, %rd961, 32; 2026-02-21T09:18:15.0434032Z or.b64 %rd963, %rd960, %rd962; 2026-02-21T09:18:15.0434092Z $L__tmp358: 2026-02-21T09:18:15.0434246Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0434306Z mov.b64 {%r7756, %r7757}, %rd963; 2026-02-21T09:18:15.0434380Z cvt.rn.bf16x2.f32 %r7758, %r7757, %r7756; 2026-02-21T09:18:15.0434431Z $L__tmp359: 2026-02-21T09:18:15.0434629Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0434694Z cvt.u64.u32 %rd964, %r7584; 2026-02-21T09:18:15.0434749Z cvt.u64.u32 %rd965, %r7585; 2026-02-21T09:18:15.0434807Z shl.b64 %rd966, %rd965, 32; 2026-02-21T09:18:15.0434867Z or.b64 %rd967, %rd964, %rd966; 2026-02-21T09:18:15.0434926Z $L__tmp360: 2026-02-21T09:18:15.0435083Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0435143Z mov.b64 {%r7759, %r7760}, %rd967; 2026-02-21T09:18:15.0435219Z cvt.rn.bf16x2.f32 %r7761, %r7760, %r7759; 2026-02-21T09:18:15.0435270Z $L__tmp361: 2026-02-21T09:18:15.0435467Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0435532Z cvt.u64.u32 %rd968, %r7586; 2026-02-21T09:18:15.0435589Z cvt.u64.u32 %rd969, %r7587; 2026-02-21T09:18:15.0435646Z shl.b64 %rd970, %rd969, 32; 2026-02-21T09:18:15.0435707Z or.b64 %rd971, %rd968, %rd970; 2026-02-21T09:18:15.0435767Z $L__tmp362: 2026-02-21T09:18:15.0435921Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0435980Z mov.b64 {%r7762, %r7763}, %rd971; 2026-02-21T09:18:15.0436056Z cvt.rn.bf16x2.f32 %r7764, %r7763, %r7762; 2026-02-21T09:18:15.0436108Z $L__tmp363: 2026-02-21T09:18:15.0436305Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0436371Z cvt.u64.u32 %rd972, %r7588; 2026-02-21T09:18:15.0436428Z cvt.u64.u32 %rd973, %r7589; 2026-02-21T09:18:15.0436485Z shl.b64 %rd974, %rd973, 32; 2026-02-21T09:18:15.0436543Z or.b64 %rd975, %rd972, %rd974; 2026-02-21T09:18:15.0436602Z $L__tmp364: 2026-02-21T09:18:15.0436756Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0436815Z mov.b64 {%r7765, %r7766}, %rd975; 2026-02-21T09:18:15.0436889Z cvt.rn.bf16x2.f32 %r7767, %r7766, %r7765; 2026-02-21T09:18:15.0436942Z $L__tmp365: 2026-02-21T09:18:15.0437136Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0437196Z cvt.u64.u32 %rd976, %r7590; 2026-02-21T09:18:15.0437261Z cvt.u64.u32 %rd977, %r7591; 2026-02-21T09:18:15.0437319Z shl.b64 %rd978, %rd977, 32; 2026-02-21T09:18:15.0437378Z or.b64 %rd979, %rd976, %rd978; 2026-02-21T09:18:15.0437437Z $L__tmp366: 2026-02-21T09:18:15.0437593Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0437652Z mov.b64 {%r7768, %r7769}, %rd979; 2026-02-21T09:18:15.0437725Z cvt.rn.bf16x2.f32 %r7770, %r7769, %r7768; 2026-02-21T09:18:15.0437776Z $L__tmp367: 2026-02-21T09:18:15.0437971Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0438030Z cvt.u64.u32 %rd980, %r7592; 2026-02-21T09:18:15.0438119Z cvt.u64.u32 %rd981, %r7593; 2026-02-21T09:18:15.0438176Z shl.b64 %rd982, %rd981, 32; 2026-02-21T09:18:15.0438235Z or.b64 %rd983, %rd980, %rd982; 2026-02-21T09:18:15.0438313Z $L__tmp368: 2026-02-21T09:18:15.0438472Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0438550Z mov.b64 {%r7771, %r7772}, %rd983; 2026-02-21T09:18:15.0438649Z cvt.rn.bf16x2.f32 %r7773, %r7772, %r7771; 2026-02-21T09:18:15.0438701Z $L__tmp369: 2026-02-21T09:18:15.0438899Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0438957Z cvt.u64.u32 %rd984, %r7595; 2026-02-21T09:18:15.0439022Z cvt.u64.u32 %rd985, %r7596; 2026-02-21T09:18:15.0439080Z shl.b64 %rd986, %rd985, 32; 2026-02-21T09:18:15.0439139Z or.b64 %rd987, %rd984, %rd986; 2026-02-21T09:18:15.0439196Z $L__tmp370: 2026-02-21T09:18:15.0439355Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0439416Z mov.b64 {%r7774, %r7775}, %rd987; 2026-02-21T09:18:15.0439488Z cvt.rn.bf16x2.f32 %r7776, %r7775, %r7774; 2026-02-21T09:18:15.0439542Z $L__tmp371: 2026-02-21T09:18:15.0439747Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0439805Z cvt.u64.u32 %rd988, %r7597; 2026-02-21T09:18:15.0439870Z cvt.u64.u32 %rd989, %r7598; 2026-02-21T09:18:15.0439928Z shl.b64 %rd990, %rd989, 32; 2026-02-21T09:18:15.0439987Z or.b64 %rd991, %rd988, %rd990; 2026-02-21T09:18:15.0440046Z $L__tmp372: 2026-02-21T09:18:15.0440200Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0440259Z mov.b64 {%r7777, %r7778}, %rd991; 2026-02-21T09:18:15.0440325Z cvt.rn.bf16x2.f32 %r7779, %r7778, %r7777; 2026-02-21T09:18:15.0440383Z $L__tmp373: 2026-02-21T09:18:15.0440588Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0440646Z cvt.u64.u32 %rd992, %r7599; 2026-02-21T09:18:15.0440712Z cvt.u64.u32 %rd993, %r7600; 2026-02-21T09:18:15.0440768Z shl.b64 %rd994, %rd993, 32; 2026-02-21T09:18:15.0440829Z or.b64 %rd995, %rd992, %rd994; 2026-02-21T09:18:15.0440888Z $L__tmp374: 2026-02-21T09:18:15.0441047Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0441106Z mov.b64 {%r7780, %r7781}, %rd995; 2026-02-21T09:18:15.0441174Z cvt.rn.bf16x2.f32 %r7782, %r7781, %r7780; 2026-02-21T09:18:15.0441234Z $L__tmp375: 2026-02-21T09:18:15.0441432Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0441490Z cvt.u64.u32 %rd996, %r7601; 2026-02-21T09:18:15.0441580Z cvt.u64.u32 %rd997, %r7602; 2026-02-21T09:18:15.0441640Z shl.b64 %rd998, %rd997, 32; 2026-02-21T09:18:15.0441700Z or.b64 %rd999, %rd996, %rd998; 2026-02-21T09:18:15.0441758Z $L__tmp376: 2026-02-21T09:18:15.0441917Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0441977Z mov.b64 {%r7783, %r7784}, %rd999; 2026-02-21T09:18:15.0442046Z cvt.rn.bf16x2.f32 %r7785, %r7784, %r7783; 2026-02-21T09:18:15.0442109Z $L__tmp377: 2026-02-21T09:18:15.0442312Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0442375Z cvt.u64.u32 %rd1000, %r7603; 2026-02-21T09:18:15.0442445Z cvt.u64.u32 %rd1001, %r7604; 2026-02-21T09:18:15.0442506Z shl.b64 %rd1002, %rd1001, 32; 2026-02-21T09:18:15.0442566Z or.b64 %rd1003, %rd1000, %rd1002; 2026-02-21T09:18:15.0442626Z $L__tmp378: 2026-02-21T09:18:15.0442783Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0442894Z mov.b64 {%r7786, %r7787}, %rd1003; 2026-02-21T09:18:15.0442966Z cvt.rn.bf16x2.f32 %r7788, %r7787, %r7786; 2026-02-21T09:18:15.0443035Z $L__tmp379: 2026-02-21T09:18:15.0443263Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0443352Z cvt.u64.u32 %rd1004, %r7605; 2026-02-21T09:18:15.0443420Z cvt.u64.u32 %rd1005, %r7606; 2026-02-21T09:18:15.0443504Z shl.b64 %rd1006, %rd1005, 32; 2026-02-21T09:18:15.0443565Z or.b64 %rd1007, %rd1004, %rd1006; 2026-02-21T09:18:15.0443616Z $L__tmp380: 2026-02-21T09:18:15.0443782Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0443843Z mov.b64 {%r7789, %r7790}, %rd1007; 2026-02-21T09:18:15.0443911Z cvt.rn.bf16x2.f32 %r7791, %r7790, %r7789; 2026-02-21T09:18:15.0443969Z $L__tmp381: 2026-02-21T09:18:15.0444170Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0444230Z cvt.u64.u32 %rd1008, %r7607; 2026-02-21T09:18:15.0444295Z cvt.u64.u32 %rd1009, %r7608; 2026-02-21T09:18:15.0444356Z shl.b64 %rd1010, %rd1009, 32; 2026-02-21T09:18:15.0444414Z or.b64 %rd1011, %rd1008, %rd1010; 2026-02-21T09:18:15.0444465Z $L__tmp382: 2026-02-21T09:18:15.0444635Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0444697Z mov.b64 {%r7792, %r7793}, %rd1011; 2026-02-21T09:18:15.0444764Z cvt.rn.bf16x2.f32 %r7794, %r7793, %r7792; 2026-02-21T09:18:15.0444825Z $L__tmp383: 2026-02-21T09:18:15.0445025Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0445084Z cvt.u64.u32 %rd1012, %r7609; 2026-02-21T09:18:15.0445148Z cvt.u64.u32 %rd1013, %r7610; 2026-02-21T09:18:15.0445207Z shl.b64 %rd1014, %rd1013, 32; 2026-02-21T09:18:15.0445267Z or.b64 %rd1015, %rd1012, %rd1014; 2026-02-21T09:18:15.0445319Z $L__tmp384: 2026-02-21T09:18:15.0445486Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0445547Z mov.b64 {%r7795, %r7796}, %rd1015; 2026-02-21T09:18:15.0445616Z cvt.rn.bf16x2.f32 %r7797, %r7796, %r7795; 2026-02-21T09:18:15.0445677Z $L__tmp385: 2026-02-21T09:18:15.0445882Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0445941Z cvt.u64.u32 %rd1016, %r7612; 2026-02-21T09:18:15.0446006Z cvt.u64.u32 %rd1017, %r7613; 2026-02-21T09:18:15.0446066Z shl.b64 %rd1018, %rd1017, 32; 2026-02-21T09:18:15.0446124Z or.b64 %rd1019, %rd1016, %rd1018; 2026-02-21T09:18:15.0446174Z $L__tmp386: 2026-02-21T09:18:15.0446339Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0446399Z mov.b64 {%r7798, %r7799}, %rd1019; 2026-02-21T09:18:15.0446469Z cvt.rn.bf16x2.f32 %r7800, %r7799, %r7798; 2026-02-21T09:18:15.0446526Z $L__tmp387: 2026-02-21T09:18:15.0446728Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0446788Z cvt.u64.u32 %rd1020, %r7614; 2026-02-21T09:18:15.0446852Z cvt.u64.u32 %rd1021, %r7615; 2026-02-21T09:18:15.0446910Z shl.b64 %rd1022, %rd1021, 32; 2026-02-21T09:18:15.0446968Z or.b64 %rd1023, %rd1020, %rd1022; 2026-02-21T09:18:15.0447020Z $L__tmp388: 2026-02-21T09:18:15.0447189Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0447247Z mov.b64 {%r7801, %r7802}, %rd1023; 2026-02-21T09:18:15.0447314Z cvt.rn.bf16x2.f32 %r7803, %r7802, %r7801; 2026-02-21T09:18:15.0447370Z $L__tmp389: 2026-02-21T09:18:15.0447571Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0447650Z cvt.u64.u32 %rd1024, %r7616; 2026-02-21T09:18:15.0447714Z cvt.u64.u32 %rd1025, %r7617; 2026-02-21T09:18:15.0447772Z shl.b64 %rd1026, %rd1025, 32; 2026-02-21T09:18:15.0447853Z or.b64 %rd1027, %rd1024, %rd1026; 2026-02-21T09:18:15.0447905Z $L__tmp390: 2026-02-21T09:18:15.0448093Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0448175Z mov.b64 {%r7804, %r7805}, %rd1027; 2026-02-21T09:18:15.0448243Z cvt.rn.bf16x2.f32 %r7806, %r7805, %r7804; 2026-02-21T09:18:15.0448300Z $L__tmp391: 2026-02-21T09:18:15.0448500Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0448559Z cvt.u64.u32 %rd1028, %r7618; 2026-02-21T09:18:15.0448615Z cvt.u64.u32 %rd1029, %r7619; 2026-02-21T09:18:15.0448681Z shl.b64 %rd1030, %rd1029, 32; 2026-02-21T09:18:15.0448741Z or.b64 %rd1031, %rd1028, %rd1030; 2026-02-21T09:18:15.0448793Z $L__tmp392: 2026-02-21T09:18:15.0448958Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0449017Z mov.b64 {%r7807, %r7808}, %rd1031; 2026-02-21T09:18:15.0449088Z cvt.rn.bf16x2.f32 %r7809, %r7808, %r7807; 2026-02-21T09:18:15.0449146Z $L__tmp393: 2026-02-21T09:18:15.0449349Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0449407Z cvt.u64.u32 %rd1032, %r7620; 2026-02-21T09:18:15.0449464Z cvt.u64.u32 %rd1033, %r7621; 2026-02-21T09:18:15.0449531Z shl.b64 %rd1034, %rd1033, 32; 2026-02-21T09:18:15.0449590Z or.b64 %rd1035, %rd1032, %rd1034; 2026-02-21T09:18:15.0449642Z $L__tmp394: 2026-02-21T09:18:15.0449810Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0449870Z mov.b64 {%r7810, %r7811}, %rd1035; 2026-02-21T09:18:15.0449939Z cvt.rn.bf16x2.f32 %r7812, %r7811, %r7810; 2026-02-21T09:18:15.0449997Z $L__tmp395: 2026-02-21T09:18:15.0450203Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0450262Z cvt.u64.u32 %rd1036, %r7622; 2026-02-21T09:18:15.0450321Z cvt.u64.u32 %rd1037, %r7623; 2026-02-21T09:18:15.0450389Z shl.b64 %rd1038, %rd1037, 32; 2026-02-21T09:18:15.0450450Z or.b64 %rd1039, %rd1036, %rd1038; 2026-02-21T09:18:15.0450503Z $L__tmp396: 2026-02-21T09:18:15.0450670Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0450730Z mov.b64 {%r7813, %r7814}, %rd1039; 2026-02-21T09:18:15.0450799Z cvt.rn.bf16x2.f32 %r7815, %r7814, %r7813; 2026-02-21T09:18:15.0450860Z $L__tmp397: 2026-02-21T09:18:15.0451068Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0451128Z cvt.u64.u32 %rd1040, %r7624; 2026-02-21T09:18:15.0451188Z cvt.u64.u32 %rd1041, %r7625; 2026-02-21T09:18:15.0451255Z shl.b64 %rd1042, %rd1041, 32; 2026-02-21T09:18:15.0451315Z or.b64 %rd1043, %rd1040, %rd1042; 2026-02-21T09:18:15.0451368Z $L__tmp398: 2026-02-21T09:18:15.0451560Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0451620Z mov.b64 {%r7816, %r7817}, %rd1043; 2026-02-21T09:18:15.0451688Z cvt.rn.bf16x2.f32 %r7818, %r7817, %r7816; 2026-02-21T09:18:15.0451746Z $L__tmp399: 2026-02-21T09:18:15.0451947Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0452006Z cvt.u64.u32 %rd1044, %r7626; 2026-02-21T09:18:15.0452064Z cvt.u64.u32 %rd1045, %r7627; 2026-02-21T09:18:15.0452130Z shl.b64 %rd1046, %rd1045, 32; 2026-02-21T09:18:15.0452202Z or.b64 %rd1047, %rd1044, %rd1046; 2026-02-21T09:18:15.0452256Z $L__tmp400: 2026-02-21T09:18:15.0452457Z .loc 1 87 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:87:28 2026-02-21T09:18:15.0452519Z mov.b64 {%r7819, %r7820}, %rd1047; 2026-02-21T09:18:15.0452617Z cvt.rn.bf16x2.f32 %r7821, %r7820, %r7819; 2026-02-21T09:18:15.0452811Z .loc 1 88 81 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:88:81 2026-02-21T09:18:15.0452870Z bar.sync 0; 2026-02-21T09:18:15.0453004Z st.shared.v4.b32 [%r68], {%r7728, %r7740, %r7752, %r7764}; 2026-02-21T09:18:15.0453106Z st.shared.v4.b32 [%r69], {%r7776, %r7788, %r7800, %r7812}; 2026-02-21T09:18:15.0453171Z bar.sync 0; 2026-02-21T09:18:15.0453233Z // begin inline asm 2026-02-21T09:18:15.0453400Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7629, %r7630, %r7631, %r7632}, [%r7633]; 2026-02-21T09:18:15.0453465Z // end inline asm 2026-02-21T09:18:15.0453525Z // begin inline asm 2026-02-21T09:18:15.0453687Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7634, %r7635, %r7636, %r7637}, [%r7638]; 2026-02-21T09:18:15.0453746Z // end inline asm 2026-02-21T09:18:15.0453811Z bar.sync 0; 2026-02-21T09:18:15.0453909Z st.shared.v4.b32 [%r68], {%r7731, %r7743, %r7755, %r7767}; 2026-02-21T09:18:15.0454006Z st.shared.v4.b32 [%r69], {%r7779, %r7791, %r7803, %r7815}; 2026-02-21T09:18:15.0454070Z bar.sync 0; 2026-02-21T09:18:15.0454130Z // begin inline asm 2026-02-21T09:18:15.0454288Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7639, %r7640, %r7641, %r7642}, [%r7633]; 2026-02-21T09:18:15.0454353Z // end inline asm 2026-02-21T09:18:15.0454413Z // begin inline asm 2026-02-21T09:18:15.0454569Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7644, %r7645, %r7646, %r7647}, [%r7638]; 2026-02-21T09:18:15.0454626Z // end inline asm 2026-02-21T09:18:15.0454689Z bar.sync 0; 2026-02-21T09:18:15.0454783Z st.shared.v4.b32 [%r68], {%r7734, %r7746, %r7758, %r7770}; 2026-02-21T09:18:15.0454877Z st.shared.v4.b32 [%r69], {%r7782, %r7794, %r7806, %r7818}; 2026-02-21T09:18:15.0454941Z bar.sync 0; 2026-02-21T09:18:15.0455000Z // begin inline asm 2026-02-21T09:18:15.0455153Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7649, %r7650, %r7651, %r7652}, [%r7633]; 2026-02-21T09:18:15.0455210Z // end inline asm 2026-02-21T09:18:15.0455275Z // begin inline asm 2026-02-21T09:18:15.0455427Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7654, %r7655, %r7656, %r7657}, [%r7638]; 2026-02-21T09:18:15.0455485Z // end inline asm 2026-02-21T09:18:15.0455546Z bar.sync 0; 2026-02-21T09:18:15.0455640Z st.shared.v4.b32 [%r68], {%r7737, %r7749, %r7761, %r7773}; 2026-02-21T09:18:15.0455734Z st.shared.v4.b32 [%r69], {%r7785, %r7797, %r7809, %r7821}; 2026-02-21T09:18:15.0455796Z bar.sync 0; 2026-02-21T09:18:15.0455853Z // begin inline asm 2026-02-21T09:18:15.0456005Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7659, %r7660, %r7661, %r7662}, [%r7633]; 2026-02-21T09:18:15.0456061Z // end inline asm 2026-02-21T09:18:15.0456124Z // begin inline asm 2026-02-21T09:18:15.0456275Z ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r7664, %r7665, %r7666, %r7667}, [%r7638]; 2026-02-21T09:18:15.0456334Z // end inline asm 2026-02-21T09:18:15.0456399Z // begin inline asm 2026-02-21T09:18:15.0456511Z st.global.v4.b32 [ %rd912 + 0 ], { %r7629, %r7639, %r7649, %r7659 }; 2026-02-21T09:18:15.0456569Z // end inline asm 2026-02-21T09:18:15.0456627Z // begin inline asm 2026-02-21T09:18:15.0456745Z st.global.v4.b32 [ %rd913 + 0 ], { %r7630, %r7640, %r7650, %r7660 }; 2026-02-21T09:18:15.0456802Z // end inline asm 2026-02-21T09:18:15.0456859Z // begin inline asm 2026-02-21T09:18:15.0456972Z st.global.v4.b32 [ %rd914 + 0 ], { %r7631, %r7641, %r7651, %r7661 }; 2026-02-21T09:18:15.0457028Z // end inline asm 2026-02-21T09:18:15.0457085Z // begin inline asm 2026-02-21T09:18:15.0457196Z st.global.v4.b32 [ %rd915 + 0 ], { %r7632, %r7642, %r7652, %r7662 }; 2026-02-21T09:18:15.0457251Z // end inline asm 2026-02-21T09:18:15.0457309Z // begin inline asm 2026-02-21T09:18:15.0457412Z st.global.v4.b32 [ %rd916 + 0 ], { %r7634, %r7644, %r7654, %r7664 }; 2026-02-21T09:18:15.0457501Z // end inline asm 2026-02-21T09:18:15.0457561Z // begin inline asm 2026-02-21T09:18:15.0457661Z st.global.v4.b32 [ %rd917 + 0 ], { %r7635, %r7645, %r7655, %r7665 }; 2026-02-21T09:18:15.0457748Z // end inline asm 2026-02-21T09:18:15.0457806Z // begin inline asm 2026-02-21T09:18:15.0457927Z st.global.v4.b32 [ %rd918 + 0 ], { %r7636, %r7646, %r7656, %r7666 }; 2026-02-21T09:18:15.0457983Z // end inline asm 2026-02-21T09:18:15.0458084Z // begin inline asm 2026-02-21T09:18:15.0458188Z st.global.v4.b32 [ %rd919 + 0 ], { %r7637, %r7647, %r7657, %r7667 }; 2026-02-21T09:18:15.0458245Z // end inline asm 2026-02-21T09:18:15.0458436Z .loc 1 19 112 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:112 2026-02-21T09:18:15.0458502Z add.s32 %r143, %r7847, 2368; 2026-02-21T09:18:15.0458573Z setp.lt.s32 %p386, %r7847, 1728; 2026-02-21T09:18:15.0458642Z mov.b32 %r7847, %r143; 2026-02-21T09:18:15.0458705Z @%p386 bra $L__BB0_45; 2026-02-21T09:18:15.0458767Z bra.uni $L__BB0_56; 2026-02-21T09:18:15.0458871Z $L__BB0_45: // =>This Loop Header: Depth=1 2026-02-21T09:18:15.0458973Z // Child Loop BB0_46 Depth 2 2026-02-21T09:18:15.0459147Z .loc 1 25 35 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:25:35 2026-02-21T09:18:15.0459213Z shr.s32 %r6389, %r7847, 31; 2026-02-21T09:18:15.0459287Z shr.u32 %r6390, %r6389, 22; 2026-02-21T09:18:15.0459353Z add.s32 %r6391, %r7847, %r6390; 2026-02-21T09:18:15.0459522Z .loc 1 28 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:45 2026-02-21T09:18:15.0459597Z and.b32 %r6392, %r6391, -1024; 2026-02-21T09:18:15.0459661Z sub.s32 %r6393, %r7847, %r6392; 2026-02-21T09:18:15.0459829Z .loc 1 28 64 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:64 2026-02-21T09:18:15.0459893Z cvt.u16.u32 %rs2049, %r6393; 2026-02-21T09:18:15.0460076Z .loc 1 29 51 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:29:51 2026-02-21T09:18:15.0460138Z shr.s16 %rs2050, %rs2049, 15; 2026-02-21T09:18:15.0460196Z shr.u16 %rs2051, %rs2050, 12; 2026-02-21T09:18:15.0460270Z add.s16 %rs2052, %rs2049, %rs2051; 2026-02-21T09:18:15.0460330Z shr.s16 %rs2053, %rs2052, 4; 2026-02-21T09:18:15.0460487Z .loc 1 28 64 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:28:64 2026-02-21T09:18:15.0460554Z and.b16 %rs2054, %rs2052, -16; 2026-02-21T09:18:15.0460614Z sub.s16 %rs2055, %rs2049, %rs2054; 2026-02-21T09:18:15.0460771Z .loc 1 30 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:30:27 2026-02-21T09:18:15.0460833Z mul.wide.s16 %r6394, %rs2055, 64; 2026-02-21T09:18:15.0460900Z add.s32 %r134, %r6394, %r6392; 2026-02-21T09:18:15.0461056Z .loc 1 32 27 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:32:27 2026-02-21T09:18:15.0461121Z mul.wide.s16 %r135, %rs2053, 128; 2026-02-21T09:18:15.0461180Z $L__tmp401: 2026-02-21T09:18:15.0461389Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0461468Z shfl.sync.idx.b32 %r136, %r5, 0, 31, -1; 2026-02-21T09:18:15.0461601Z shl.b32 %r6395, %r136, 21; 2026-02-21T09:18:15.0461667Z and.b32 %r6396, %r6395, 6291456; 2026-02-21T09:18:15.0461729Z add.s32 %r7628, %r6396, %r7822; 2026-02-21T09:18:15.0461792Z mov.pred %p311, -1; 2026-02-21T09:18:15.0461856Z mov.b32 %r6322, 0; 2026-02-21T09:18:15.0461913Z // begin inline asm 2026-02-21T09:18:15.0462248Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 0], 64, {%r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322}; 2026-02-21T09:18:15.0462314Z // end inline asm 2026-02-21T09:18:15.0462371Z // begin inline asm 2026-02-21T09:18:15.0462688Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 16], 64, {%r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322}; 2026-02-21T09:18:15.0462781Z // end inline asm 2026-02-21T09:18:15.0462868Z // begin inline asm 2026-02-21T09:18:15.0463225Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 32], 64, {%r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322}; 2026-02-21T09:18:15.0463290Z // end inline asm 2026-02-21T09:18:15.0463348Z // begin inline asm 2026-02-21T09:18:15.0463651Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 48], 64, {%r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322, %r6322}; 2026-02-21T09:18:15.0463716Z // end inline asm 2026-02-21T09:18:15.0463773Z // begin inline asm 2026-02-21T09:18:15.0463844Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0463899Z // end inline asm 2026-02-21T09:18:15.0463962Z bar.sync 0; 2026-02-21T09:18:15.0464022Z add.s32 %r6721, %r6396, %r6675; 2026-02-21T09:18:15.0464081Z add.s32 %r7130, %r6396, %r7254; 2026-02-21T09:18:15.0464145Z add.s32 %r7010, %r6396, %r6964; 2026-02-21T09:18:15.0464204Z add.s32 %r7419, %r6396, %r7543; 2026-02-21T09:18:15.0464260Z add.s32 %r7299, %r6396, %r7253; 2026-02-21T09:18:15.0464314Z $L__tmp402: 2026-02-21T09:18:15.0464492Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:15.0464551Z or.b32 %r6397, %r7841, %r6392; 2026-02-21T09:18:15.0464609Z add.s32 %r6398, %r6397, %r6394; 2026-02-21T09:18:15.0464674Z shl.b32 %r6399, %r6398, 10; 2026-02-21T09:18:15.0464736Z mul.wide.s32 %rd44, %r6399, 2; 2026-02-21T09:18:15.0464793Z or.b32 %r6400, %r7, %r6392; 2026-02-21T09:18:15.0464857Z add.s32 %r6401, %r6400, %r6394; 2026-02-21T09:18:15.0464914Z shl.b32 %r6402, %r6401, 10; 2026-02-21T09:18:15.0464974Z mul.wide.s32 %rd45, %r6402, 2; 2026-02-21T09:18:15.0465032Z or.b32 %r6403, %r18, %r135; 2026-02-21T09:18:15.0465098Z cvt.s64.s32 %rd855, %r6403; 2026-02-21T09:18:15.0465158Z add.s64 %rd1056, %rd11, %rd855; 2026-02-21T09:18:15.0465220Z mov.pred %p393, 0; 2026-02-21T09:18:15.0465284Z mov.b64 %rd1058, -64; 2026-02-21T09:18:15.0465341Z mov.b64 %rd1057, %rd10; 2026-02-21T09:18:15.0465399Z bra.uni $L__BB0_46; 2026-02-21T09:18:15.0465503Z $L__BB0_54: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0465562Z $L__tmp403: 2026-02-21T09:18:15.0465771Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0465827Z // begin inline asm 2026-02-21T09:18:15.0465884Z 2026-02-21T09:18:15.0465935Z { 2026-02-21T09:18:15.0465997Z .reg .pred complete; 2026-02-21T09:18:15.0466052Z waitLoop: 2026-02-21T09:18:15.0466181Z mbarrier.try_wait.parity.shared.b64 complete, [%r7557], %r7269; 2026-02-21T09:18:15.0466247Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0466297Z } 2026-02-21T09:18:15.0466301Z 2026-02-21T09:18:15.0466364Z // end inline asm 2026-02-21T09:18:15.0466418Z bar.sync 0; 2026-02-21T09:18:15.0466475Z // begin inline asm 2026-02-21T09:18:15.0466568Z @%p366 mbarrier.inval.shared::cta.b64 [%r7557]; 2026-02-21T09:18:15.0466624Z // end inline asm 2026-02-21T09:18:15.0466678Z $L__tmp404: 2026-02-21T09:18:15.0466847Z .loc 1 40 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:40:124 2026-02-21T09:18:15.0466916Z add.s64 %rd1058, %rd1058, 64; 2026-02-21T09:18:15.0466975Z add.s64 %rd1057, %rd1057, 256; 2026-02-21T09:18:15.0467035Z add.s64 %rd1056, %rd1056, 524288; 2026-02-21T09:18:15.0467107Z setp.lt.u64 %p385, %rd1058, 448; 2026-02-21T09:18:15.0467166Z @%p385 bra $L__BB0_46; 2026-02-21T09:18:15.0467222Z bra.uni $L__BB0_55; 2026-02-21T09:18:15.0467320Z $L__BB0_46: // Parent Loop BB0_45 Depth=1 2026-02-21T09:18:15.0467423Z // => This Inner Loop Header: Depth=2 2026-02-21T09:18:15.0467610Z .loc 1 0 124 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:0:124 2026-02-21T09:18:15.0467696Z setp.ne.b32 %p322, %r136, 0; 2026-02-21T09:18:15.0467890Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0467953Z add.s64 %rd857, %rd1057, %rd45; 2026-02-21T09:18:15.0468129Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0468199Z add.s64 %rd860, %rd1057, %rd44; 2026-02-21T09:18:15.0468256Z // begin inline asm 2026-02-21T09:18:15.0468313Z mov.u64 %rd856, 0x0; 2026-02-21T09:18:15.0468437Z createpolicy.fractional.L2::evict_first.b64 %rd856, 1.0; 2026-02-21T09:18:15.0468495Z // end inline asm 2026-02-21T09:18:15.0468552Z // begin inline asm 2026-02-21T09:18:15.0468614Z mov.u32 %r6404, 0x0; 2026-02-21T09:18:15.0468678Z mov.u32 %r6405, 0x0; 2026-02-21T09:18:15.0468734Z mov.u32 %r6406, 0x0; 2026-02-21T09:18:15.0468790Z mov.u32 %r6407, 0x0; 2026-02-21T09:18:15.0468977Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6404, %r6405, %r6406, %r6407 }, [ %rd857 + 0 ], %rd856; 2026-02-21T09:18:15.0469034Z // end inline asm 2026-02-21T09:18:15.0469092Z // begin inline asm 2026-02-21T09:18:15.0469149Z mov.u64 %rd859, 0x0; 2026-02-21T09:18:15.0469264Z createpolicy.fractional.L2::evict_first.b64 %rd859, 1.0; 2026-02-21T09:18:15.0469319Z // end inline asm 2026-02-21T09:18:15.0469374Z // begin inline asm 2026-02-21T09:18:15.0469437Z mov.u32 %r6408, 0x0; 2026-02-21T09:18:15.0469490Z mov.u32 %r6409, 0x0; 2026-02-21T09:18:15.0469543Z mov.u32 %r6410, 0x0; 2026-02-21T09:18:15.0469603Z mov.u32 %r6411, 0x0; 2026-02-21T09:18:15.0469777Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6408, %r6409, %r6410, %r6411 }, [ %rd860 + 0 ], %rd859; 2026-02-21T09:18:15.0469832Z // end inline asm 2026-02-21T09:18:15.0469884Z $L__tmp405: 2026-02-21T09:18:15.0470104Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0470159Z bar.sync 0; 2026-02-21T09:18:15.0470258Z st.shared.v4.b32 [%r67], {%r6404, %r6405, %r6406, %r6407}; 2026-02-21T09:18:15.0470368Z st.shared.v4.b32 [%r67+512], {%r6408, %r6409, %r6410, %r6411}; 2026-02-21T09:18:15.0470424Z bar.sync 0; 2026-02-21T09:18:15.0470518Z ld.shared.v4.b32 {%r6570, %r6571, %r6572, %r6573}, [%r68]; 2026-02-21T09:18:15.0470592Z mov.b32 {%rs2056, %rs2057}, %r6573; 2026-02-21T09:18:15.0470657Z mov.b32 {%rs2058, %rs2059}, %r6572; 2026-02-21T09:18:15.0470718Z mov.b32 {%rs2060, %rs2061}, %r6571; 2026-02-21T09:18:15.0470778Z mov.b32 {%rs2062, %rs2063}, %r6570; 2026-02-21T09:18:15.0470878Z ld.shared.v4.b32 {%r6574, %r6575, %r6576, %r6577}, [%r69]; 2026-02-21T09:18:15.0470938Z mov.b32 {%rs2064, %rs2065}, %r6577; 2026-02-21T09:18:15.0470998Z mov.b32 {%rs2066, %rs2067}, %r6576; 2026-02-21T09:18:15.0471067Z mov.b32 {%rs2068, %rs2069}, %r6575; 2026-02-21T09:18:15.0471126Z mov.b32 {%rs2070, %rs2071}, %r6574; 2026-02-21T09:18:15.0471179Z $L__tmp406: 2026-02-21T09:18:15.0471347Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0471412Z cvt.f32.bf16 %r6553, %rs2062; 2026-02-21T09:18:15.0471474Z cvt.f32.bf16 %r6554, %rs2063; 2026-02-21T09:18:15.0471562Z cvt.f32.bf16 %r6555, %rs2060; 2026-02-21T09:18:15.0471632Z cvt.f32.bf16 %r6556, %rs2061; 2026-02-21T09:18:15.0471689Z cvt.f32.bf16 %r6557, %rs2058; 2026-02-21T09:18:15.0471746Z cvt.f32.bf16 %r6558, %rs2059; 2026-02-21T09:18:15.0471808Z cvt.f32.bf16 %r6559, %rs2056; 2026-02-21T09:18:15.0471865Z cvt.f32.bf16 %r6560, %rs2057; 2026-02-21T09:18:15.0471923Z cvt.f32.bf16 %r6561, %rs2070; 2026-02-21T09:18:15.0471981Z cvt.f32.bf16 %r6562, %rs2071; 2026-02-21T09:18:15.0472043Z cvt.f32.bf16 %r6563, %rs2068; 2026-02-21T09:18:15.0472101Z cvt.f32.bf16 %r6564, %rs2069; 2026-02-21T09:18:15.0472191Z cvt.f32.bf16 %r6565, %rs2066; 2026-02-21T09:18:15.0472255Z cvt.f32.bf16 %r6566, %rs2067; 2026-02-21T09:18:15.0472311Z cvt.f32.bf16 %r6567, %rs2064; 2026-02-21T09:18:15.0472392Z cvt.f32.bf16 %r6568, %rs2065; 2026-02-21T09:18:15.0472553Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0472641Z // begin inline asm 2026-02-21T09:18:15.0472700Z mov.u64 %rd862, 0x0; 2026-02-21T09:18:15.0472833Z createpolicy.fractional.L2::evict_first.b64 %rd862, 1.0; 2026-02-21T09:18:15.0472897Z // end inline asm 2026-02-21T09:18:15.0472954Z // begin inline asm 2026-02-21T09:18:15.0473009Z mov.u32 %r6412, 0x0; 2026-02-21T09:18:15.0473070Z mov.u32 %r6413, 0x0; 2026-02-21T09:18:15.0473124Z mov.u32 %r6414, 0x0; 2026-02-21T09:18:15.0473178Z mov.u32 %r6415, 0x0; 2026-02-21T09:18:15.0473354Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6412, %r6413, %r6414, %r6415 }, [ %rd1056 + 0 ], %rd862; 2026-02-21T09:18:15.0473418Z // end inline asm 2026-02-21T09:18:15.0473484Z prmt.b32 %r6578, %r6412, 0, 0x8880U; 2026-02-21T09:18:15.0473545Z cvt.u16.u32 %rs2072, %r6578; 2026-02-21T09:18:15.0473617Z prmt.b32 %r6579, %r6412, 0, 0x7770U; 2026-02-21T09:18:15.0473677Z cvt.u16.u32 %rs2073, %r6579; 2026-02-21T09:18:15.0473740Z prmt.b32 %r6580, %r6412, 0, 0x9991U; 2026-02-21T09:18:15.0473804Z cvt.u16.u32 %rs2074, %r6580; 2026-02-21T09:18:15.0473866Z prmt.b32 %r6581, %r6412, 0, 0x7771U; 2026-02-21T09:18:15.0473924Z cvt.u16.u32 %rs2075, %r6581; 2026-02-21T09:18:15.0473984Z prmt.b32 %r6582, %r6412, 0, 0xaaa2U; 2026-02-21T09:18:15.0474048Z cvt.u16.u32 %rs2076, %r6582; 2026-02-21T09:18:15.0474108Z prmt.b32 %r6583, %r6412, 0, 0x7772U; 2026-02-21T09:18:15.0474167Z cvt.u16.u32 %rs2077, %r6583; 2026-02-21T09:18:15.0474235Z prmt.b32 %r6584, %r6412, 0, 0xbbb3U; 2026-02-21T09:18:15.0474293Z cvt.u16.u32 %rs2078, %r6584; 2026-02-21T09:18:15.0474353Z prmt.b32 %r6585, %r6412, 0, 0x7773U; 2026-02-21T09:18:15.0474412Z cvt.u16.u32 %rs2079, %r6585; 2026-02-21T09:18:15.0474484Z prmt.b32 %r6586, %r6413, 0, 0x8880U; 2026-02-21T09:18:15.0474541Z cvt.u16.u32 %rs2080, %r6586; 2026-02-21T09:18:15.0474602Z prmt.b32 %r6587, %r6413, 0, 0x7770U; 2026-02-21T09:18:15.0474669Z cvt.u16.u32 %rs2081, %r6587; 2026-02-21T09:18:15.0474728Z prmt.b32 %r6588, %r6413, 0, 0x9991U; 2026-02-21T09:18:15.0474787Z cvt.u16.u32 %rs2082, %r6588; 2026-02-21T09:18:15.0474848Z prmt.b32 %r6589, %r6413, 0, 0x7771U; 2026-02-21T09:18:15.0474913Z cvt.u16.u32 %rs2083, %r6589; 2026-02-21T09:18:15.0474973Z prmt.b32 %r6590, %r6413, 0, 0xaaa2U; 2026-02-21T09:18:15.0475030Z cvt.u16.u32 %rs2084, %r6590; 2026-02-21T09:18:15.0475099Z prmt.b32 %r6591, %r6413, 0, 0x7772U; 2026-02-21T09:18:15.0475157Z cvt.u16.u32 %rs2085, %r6591; 2026-02-21T09:18:15.0475218Z prmt.b32 %r6592, %r6413, 0, 0xbbb3U; 2026-02-21T09:18:15.0475283Z cvt.u16.u32 %rs2086, %r6592; 2026-02-21T09:18:15.0475345Z prmt.b32 %r6593, %r6413, 0, 0x7773U; 2026-02-21T09:18:15.0475402Z cvt.u16.u32 %rs2087, %r6593; 2026-02-21T09:18:15.0475466Z prmt.b32 %r6594, %r6414, 0, 0x8880U; 2026-02-21T09:18:15.0475532Z cvt.u16.u32 %rs2088, %r6594; 2026-02-21T09:18:15.0475594Z prmt.b32 %r6595, %r6414, 0, 0x7770U; 2026-02-21T09:18:15.0475656Z cvt.u16.u32 %rs2089, %r6595; 2026-02-21T09:18:15.0475725Z prmt.b32 %r6596, %r6414, 0, 0x9991U; 2026-02-21T09:18:15.0475783Z cvt.u16.u32 %rs2090, %r6596; 2026-02-21T09:18:15.0475845Z prmt.b32 %r6597, %r6414, 0, 0x7771U; 2026-02-21T09:18:15.0475902Z cvt.u16.u32 %rs2091, %r6597; 2026-02-21T09:18:15.0475969Z prmt.b32 %r6598, %r6414, 0, 0xaaa2U; 2026-02-21T09:18:15.0476027Z cvt.u16.u32 %rs2092, %r6598; 2026-02-21T09:18:15.0476087Z prmt.b32 %r6599, %r6414, 0, 0x7772U; 2026-02-21T09:18:15.0476153Z cvt.u16.u32 %rs2093, %r6599; 2026-02-21T09:18:15.0476213Z prmt.b32 %r6600, %r6414, 0, 0xbbb3U; 2026-02-21T09:18:15.0476271Z cvt.u16.u32 %rs2094, %r6600; 2026-02-21T09:18:15.0476338Z prmt.b32 %r6601, %r6414, 0, 0x7773U; 2026-02-21T09:18:15.0476396Z cvt.u16.u32 %rs2095, %r6601; 2026-02-21T09:18:15.0476478Z prmt.b32 %r6602, %r6415, 0, 0x8880U; 2026-02-21T09:18:15.0476535Z cvt.u16.u32 %rs2096, %r6602; 2026-02-21T09:18:15.0476602Z prmt.b32 %r6603, %r6415, 0, 0x7770U; 2026-02-21T09:18:15.0476681Z cvt.u16.u32 %rs2097, %r6603; 2026-02-21T09:18:15.0476741Z prmt.b32 %r6604, %r6415, 0, 0x9991U; 2026-02-21T09:18:15.0476838Z cvt.u16.u32 %rs2098, %r6604; 2026-02-21T09:18:15.0476919Z prmt.b32 %r6605, %r6415, 0, 0x7771U; 2026-02-21T09:18:15.0476976Z cvt.u16.u32 %rs2099, %r6605; 2026-02-21T09:18:15.0477036Z prmt.b32 %r6606, %r6415, 0, 0xaaa2U; 2026-02-21T09:18:15.0477102Z cvt.u16.u32 %rs2100, %r6606; 2026-02-21T09:18:15.0477163Z prmt.b32 %r6607, %r6415, 0, 0x7772U; 2026-02-21T09:18:15.0477220Z cvt.u16.u32 %rs2101, %r6607; 2026-02-21T09:18:15.0477286Z prmt.b32 %r6608, %r6415, 0, 0xbbb3U; 2026-02-21T09:18:15.0477343Z cvt.u16.u32 %rs2102, %r6608; 2026-02-21T09:18:15.0477402Z prmt.b32 %r6609, %r6415, 0, 0x7773U; 2026-02-21T09:18:15.0477462Z cvt.u16.u32 %rs2103, %r6609; 2026-02-21T09:18:15.0477628Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0477689Z shl.b16 %rs2104, %rs2073, 12; 2026-02-21T09:18:15.0477749Z shr.s16 %rs2105, %rs2104, 12; 2026-02-21T09:18:15.0477814Z shl.b16 %rs2106, %rs2075, 12; 2026-02-21T09:18:15.0477873Z shr.s16 %rs2107, %rs2106, 12; 2026-02-21T09:18:15.0477929Z shl.b16 %rs2108, %rs2077, 12; 2026-02-21T09:18:15.0477995Z shr.s16 %rs2109, %rs2108, 12; 2026-02-21T09:18:15.0478051Z shl.b16 %rs2110, %rs2079, 12; 2026-02-21T09:18:15.0478107Z shr.s16 %rs2111, %rs2110, 12; 2026-02-21T09:18:15.0478163Z shl.b16 %rs2112, %rs2081, 12; 2026-02-21T09:18:15.0478228Z shr.s16 %rs2113, %rs2112, 12; 2026-02-21T09:18:15.0478284Z shl.b16 %rs2114, %rs2083, 12; 2026-02-21T09:18:15.0478341Z shr.s16 %rs2115, %rs2114, 12; 2026-02-21T09:18:15.0478403Z shl.b16 %rs2116, %rs2085, 12; 2026-02-21T09:18:15.0478460Z shr.s16 %rs2117, %rs2116, 12; 2026-02-21T09:18:15.0478520Z shl.b16 %rs2118, %rs2087, 12; 2026-02-21T09:18:15.0478577Z shr.s16 %rs2119, %rs2118, 12; 2026-02-21T09:18:15.0478641Z shl.b16 %rs2120, %rs2089, 12; 2026-02-21T09:18:15.0478697Z shr.s16 %rs2121, %rs2120, 12; 2026-02-21T09:18:15.0478754Z shl.b16 %rs2122, %rs2091, 12; 2026-02-21T09:18:15.0478817Z shr.s16 %rs2123, %rs2122, 12; 2026-02-21T09:18:15.0478874Z shl.b16 %rs2124, %rs2093, 12; 2026-02-21T09:18:15.0478931Z shr.s16 %rs2125, %rs2124, 12; 2026-02-21T09:18:15.0478988Z shl.b16 %rs2126, %rs2095, 12; 2026-02-21T09:18:15.0479051Z shr.s16 %rs2127, %rs2126, 12; 2026-02-21T09:18:15.0479106Z shl.b16 %rs2128, %rs2097, 12; 2026-02-21T09:18:15.0479161Z shr.s16 %rs2129, %rs2128, 12; 2026-02-21T09:18:15.0479224Z shl.b16 %rs2130, %rs2099, 12; 2026-02-21T09:18:15.0479279Z shr.s16 %rs2131, %rs2130, 12; 2026-02-21T09:18:15.0479334Z shl.b16 %rs2132, %rs2101, 12; 2026-02-21T09:18:15.0479395Z shr.s16 %rs2133, %rs2132, 12; 2026-02-21T09:18:15.0479451Z shl.b16 %rs2134, %rs2103, 12; 2026-02-21T09:18:15.0479509Z shr.s16 %rs2135, %rs2134, 12; 2026-02-21T09:18:15.0479667Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0479734Z shr.u16 %rs2136, %rs2072, 4; 2026-02-21T09:18:15.0479791Z shr.u16 %rs2137, %rs2074, 4; 2026-02-21T09:18:15.0479848Z shr.u16 %rs2138, %rs2076, 4; 2026-02-21T09:18:15.0479912Z shr.u16 %rs2139, %rs2078, 4; 2026-02-21T09:18:15.0479969Z shr.u16 %rs2140, %rs2080, 4; 2026-02-21T09:18:15.0480027Z shr.u16 %rs2141, %rs2082, 4; 2026-02-21T09:18:15.0480082Z shr.u16 %rs2142, %rs2084, 4; 2026-02-21T09:18:15.0480145Z shr.u16 %rs2143, %rs2086, 4; 2026-02-21T09:18:15.0480200Z shr.u16 %rs2144, %rs2088, 4; 2026-02-21T09:18:15.0480255Z shr.u16 %rs2145, %rs2090, 4; 2026-02-21T09:18:15.0480319Z shr.u16 %rs2146, %rs2092, 4; 2026-02-21T09:18:15.0480375Z shr.u16 %rs2147, %rs2094, 4; 2026-02-21T09:18:15.0480432Z shr.u16 %rs2148, %rs2096, 4; 2026-02-21T09:18:15.0480495Z shr.u16 %rs2149, %rs2098, 4; 2026-02-21T09:18:15.0480570Z shr.u16 %rs2150, %rs2100, 4; 2026-02-21T09:18:15.0480627Z shr.u16 %rs2151, %rs2102, 4; 2026-02-21T09:18:15.0480786Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0480870Z bar.sync 0; 2026-02-21T09:18:15.0480930Z st.shared.b8 [%r70], %rs2105; 2026-02-21T09:18:15.0481019Z st.shared.b8 [%r71], %rs2107; 2026-02-21T09:18:15.0481106Z st.shared.b8 [%r72], %rs2109; 2026-02-21T09:18:15.0481165Z st.shared.b8 [%r73], %rs2111; 2026-02-21T09:18:15.0481227Z st.shared.b8 [%r74+512], %rs2113; 2026-02-21T09:18:15.0481289Z st.shared.b8 [%r75+512], %rs2115; 2026-02-21T09:18:15.0481357Z st.shared.b8 [%r76+512], %rs2117; 2026-02-21T09:18:15.0481416Z st.shared.b8 [%r77+512], %rs2119; 2026-02-21T09:18:15.0481479Z st.shared.b8 [%r78+1024], %rs2121; 2026-02-21T09:18:15.0481569Z st.shared.b8 [%r79+1024], %rs2123; 2026-02-21T09:18:15.0481631Z st.shared.b8 [%r80+1024], %rs2125; 2026-02-21T09:18:15.0481692Z st.shared.b8 [%r81+1024], %rs2127; 2026-02-21T09:18:15.0481753Z st.shared.b8 [%r82+1536], %rs2129; 2026-02-21T09:18:15.0481821Z st.shared.b8 [%r83+1536], %rs2131; 2026-02-21T09:18:15.0481883Z st.shared.b8 [%r84+1536], %rs2133; 2026-02-21T09:18:15.0481942Z st.shared.b8 [%r85+1536], %rs2135; 2026-02-21T09:18:15.0482005Z bar.sync 0; 2026-02-21T09:18:15.0482072Z ld.shared.b32 %r6610, [%r86]; 2026-02-21T09:18:15.0482135Z ld.shared.b32 %r6611, [%r87]; 2026-02-21T09:18:15.0482204Z ld.shared.b32 %r6612, [%r88]; 2026-02-21T09:18:15.0482262Z ld.shared.b32 %r6613, [%r89]; 2026-02-21T09:18:15.0482421Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0482474Z bar.sync 0; 2026-02-21T09:18:15.0482541Z st.shared.b8 [%r70], %rs2136; 2026-02-21T09:18:15.0482599Z st.shared.b8 [%r71], %rs2137; 2026-02-21T09:18:15.0482656Z st.shared.b8 [%r72], %rs2138; 2026-02-21T09:18:15.0482721Z st.shared.b8 [%r73], %rs2139; 2026-02-21T09:18:15.0482783Z st.shared.b8 [%r74+512], %rs2140; 2026-02-21T09:18:15.0482842Z st.shared.b8 [%r75+512], %rs2141; 2026-02-21T09:18:15.0482902Z st.shared.b8 [%r76+512], %rs2142; 2026-02-21T09:18:15.0482969Z st.shared.b8 [%r77+512], %rs2143; 2026-02-21T09:18:15.0483030Z st.shared.b8 [%r78+1024], %rs2144; 2026-02-21T09:18:15.0483090Z st.shared.b8 [%r79+1024], %rs2145; 2026-02-21T09:18:15.0483155Z st.shared.b8 [%r80+1024], %rs2146; 2026-02-21T09:18:15.0483215Z st.shared.b8 [%r81+1024], %rs2147; 2026-02-21T09:18:15.0483275Z st.shared.b8 [%r82+1536], %rs2148; 2026-02-21T09:18:15.0483342Z st.shared.b8 [%r83+1536], %rs2149; 2026-02-21T09:18:15.0483402Z st.shared.b8 [%r84+1536], %rs2150; 2026-02-21T09:18:15.0483462Z st.shared.b8 [%r85+1536], %rs2151; 2026-02-21T09:18:15.0483514Z bar.sync 0; 2026-02-21T09:18:15.0483583Z ld.shared.b32 %r6614, [%r86]; 2026-02-21T09:18:15.0483643Z ld.shared.b32 %r6615, [%r87]; 2026-02-21T09:18:15.0483703Z ld.shared.b32 %r6616, [%r88]; 2026-02-21T09:18:15.0483769Z ld.shared.b32 %r6617, [%r89]; 2026-02-21T09:18:15.0483929Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0483991Z cvt.s8.s32 %rs2152, %r6611; 2026-02-21T09:18:15.0484054Z cvt.rn.f32.s16 %r6618, %rs2152; 2026-02-21T09:18:15.0484122Z cvt.s8.s32 %rs2153, %r6610; 2026-02-21T09:18:15.0484185Z cvt.rn.f32.s16 %r6619, %rs2153; 2026-02-21T09:18:15.0484243Z cvt.s8.s32 %rs2154, %r6615; 2026-02-21T09:18:15.0484312Z cvt.rn.f32.s16 %r6620, %rs2154; 2026-02-21T09:18:15.0484371Z cvt.s8.s32 %rs2155, %r6614; 2026-02-21T09:18:15.0484430Z cvt.rn.f32.s16 %r6621, %rs2155; 2026-02-21T09:18:15.0484487Z cvt.s8.s32 %rs2156, %r6613; 2026-02-21T09:18:15.0484554Z cvt.rn.f32.s16 %r6622, %rs2156; 2026-02-21T09:18:15.0484612Z cvt.s8.s32 %rs2157, %r6612; 2026-02-21T09:18:15.0484670Z cvt.rn.f32.s16 %r6623, %rs2157; 2026-02-21T09:18:15.0484736Z cvt.s8.s32 %rs2158, %r6617; 2026-02-21T09:18:15.0484794Z cvt.rn.f32.s16 %r6624, %rs2158; 2026-02-21T09:18:15.0484881Z cvt.s8.s32 %rs2159, %r6616; 2026-02-21T09:18:15.0484946Z cvt.rn.f32.s16 %r6625, %rs2159; 2026-02-21T09:18:15.0485009Z prmt.b32 %r6626, %r6611, 0, 0x9991U; 2026-02-21T09:18:15.0485093Z cvt.u16.u32 %rs2160, %r6626; 2026-02-21T09:18:15.0485152Z cvt.rn.f32.s16 %r6627, %rs2160; 2026-02-21T09:18:15.0485248Z prmt.b32 %r6628, %r6610, 0, 0x9991U; 2026-02-21T09:18:15.0485309Z cvt.u16.u32 %rs2161, %r6628; 2026-02-21T09:18:15.0485393Z cvt.rn.f32.s16 %r6629, %rs2161; 2026-02-21T09:18:15.0485462Z prmt.b32 %r6630, %r6615, 0, 0x9991U; 2026-02-21T09:18:15.0485520Z cvt.u16.u32 %rs2162, %r6630; 2026-02-21T09:18:15.0485578Z cvt.rn.f32.s16 %r6631, %rs2162; 2026-02-21T09:18:15.0485638Z prmt.b32 %r6632, %r6614, 0, 0x9991U; 2026-02-21T09:18:15.0485702Z cvt.u16.u32 %rs2163, %r6632; 2026-02-21T09:18:15.0485760Z cvt.rn.f32.s16 %r6633, %rs2163; 2026-02-21T09:18:15.0485820Z prmt.b32 %r6634, %r6613, 0, 0x9991U; 2026-02-21T09:18:15.0485884Z cvt.u16.u32 %rs2164, %r6634; 2026-02-21T09:18:15.0485943Z cvt.rn.f32.s16 %r6635, %rs2164; 2026-02-21T09:18:15.0486003Z prmt.b32 %r6636, %r6612, 0, 0x9991U; 2026-02-21T09:18:15.0486065Z cvt.u16.u32 %rs2165, %r6636; 2026-02-21T09:18:15.0486124Z cvt.rn.f32.s16 %r6637, %rs2165; 2026-02-21T09:18:15.0486184Z prmt.b32 %r6638, %r6617, 0, 0x9991U; 2026-02-21T09:18:15.0486241Z cvt.u16.u32 %rs2166, %r6638; 2026-02-21T09:18:15.0486308Z cvt.rn.f32.s16 %r6639, %rs2166; 2026-02-21T09:18:15.0486369Z prmt.b32 %r6640, %r6616, 0, 0x9991U; 2026-02-21T09:18:15.0486426Z cvt.u16.u32 %rs2167, %r6640; 2026-02-21T09:18:15.0486490Z cvt.rn.f32.s16 %r6641, %rs2167; 2026-02-21T09:18:15.0486551Z prmt.b32 %r6642, %r6611, 0, 0xaaa2U; 2026-02-21T09:18:15.0486608Z cvt.u16.u32 %rs2168, %r6642; 2026-02-21T09:18:15.0486666Z cvt.rn.f32.s16 %r6643, %rs2168; 2026-02-21T09:18:15.0486735Z prmt.b32 %r6644, %r6610, 0, 0xaaa2U; 2026-02-21T09:18:15.0486791Z cvt.u16.u32 %rs2169, %r6644; 2026-02-21T09:18:15.0486849Z cvt.rn.f32.s16 %r6645, %rs2169; 2026-02-21T09:18:15.0486918Z prmt.b32 %r6646, %r6615, 0, 0xaaa2U; 2026-02-21T09:18:15.0486975Z cvt.u16.u32 %rs2170, %r6646; 2026-02-21T09:18:15.0487033Z cvt.rn.f32.s16 %r6647, %rs2170; 2026-02-21T09:18:15.0487094Z prmt.b32 %r6648, %r6614, 0, 0xaaa2U; 2026-02-21T09:18:15.0487159Z cvt.u16.u32 %rs2171, %r6648; 2026-02-21T09:18:15.0487216Z cvt.rn.f32.s16 %r6649, %rs2171; 2026-02-21T09:18:15.0487278Z prmt.b32 %r6650, %r6613, 0, 0xaaa2U; 2026-02-21T09:18:15.0487344Z cvt.u16.u32 %rs2172, %r6650; 2026-02-21T09:18:15.0487403Z cvt.rn.f32.s16 %r6651, %rs2172; 2026-02-21T09:18:15.0487464Z prmt.b32 %r6652, %r6612, 0, 0xaaa2U; 2026-02-21T09:18:15.0487528Z cvt.u16.u32 %rs2173, %r6652; 2026-02-21T09:18:15.0487585Z cvt.rn.f32.s16 %r6653, %rs2173; 2026-02-21T09:18:15.0487645Z prmt.b32 %r6654, %r6617, 0, 0xaaa2U; 2026-02-21T09:18:15.0487702Z cvt.u16.u32 %rs2174, %r6654; 2026-02-21T09:18:15.0487769Z cvt.rn.f32.s16 %r6655, %rs2174; 2026-02-21T09:18:15.0487830Z prmt.b32 %r6656, %r6616, 0, 0xaaa2U; 2026-02-21T09:18:15.0487890Z cvt.u16.u32 %rs2175, %r6656; 2026-02-21T09:18:15.0487955Z cvt.rn.f32.s16 %r6657, %rs2175; 2026-02-21T09:18:15.0488016Z prmt.b32 %r6658, %r6611, 0, 0xbbb3U; 2026-02-21T09:18:15.0488077Z cvt.u16.u32 %rs2176, %r6658; 2026-02-21T09:18:15.0488138Z cvt.rn.f32.s16 %r6659, %rs2176; 2026-02-21T09:18:15.0488207Z prmt.b32 %r6660, %r6610, 0, 0xbbb3U; 2026-02-21T09:18:15.0488267Z cvt.u16.u32 %rs2177, %r6660; 2026-02-21T09:18:15.0488327Z cvt.rn.f32.s16 %r6661, %rs2177; 2026-02-21T09:18:15.0488397Z prmt.b32 %r6662, %r6615, 0, 0xbbb3U; 2026-02-21T09:18:15.0488456Z cvt.u16.u32 %rs2178, %r6662; 2026-02-21T09:18:15.0488515Z cvt.rn.f32.s16 %r6663, %rs2178; 2026-02-21T09:18:15.0488583Z prmt.b32 %r6664, %r6614, 0, 0xbbb3U; 2026-02-21T09:18:15.0488641Z cvt.u16.u32 %rs2179, %r6664; 2026-02-21T09:18:15.0488701Z cvt.rn.f32.s16 %r6665, %rs2179; 2026-02-21T09:18:15.0488761Z prmt.b32 %r6666, %r6613, 0, 0xbbb3U; 2026-02-21T09:18:15.0488827Z cvt.u16.u32 %rs2180, %r6666; 2026-02-21T09:18:15.0488908Z cvt.rn.f32.s16 %r6667, %rs2180; 2026-02-21T09:18:15.0488969Z prmt.b32 %r6668, %r6612, 0, 0xbbb3U; 2026-02-21T09:18:15.0489034Z cvt.u16.u32 %rs2181, %r6668; 2026-02-21T09:18:15.0489093Z cvt.rn.f32.s16 %r6669, %rs2181; 2026-02-21T09:18:15.0489177Z prmt.b32 %r6670, %r6617, 0, 0xbbb3U; 2026-02-21T09:18:15.0489235Z cvt.u16.u32 %rs2182, %r6670; 2026-02-21T09:18:15.0489324Z cvt.rn.f32.s16 %r6671, %rs2182; 2026-02-21T09:18:15.0489408Z prmt.b32 %r6672, %r6616, 0, 0xbbb3U; 2026-02-21T09:18:15.0489467Z cvt.u16.u32 %rs2183, %r6672; 2026-02-21T09:18:15.0489533Z cvt.rn.f32.s16 %r6673, %rs2183; 2026-02-21T09:18:15.0489588Z bar.sync 0; 2026-02-21T09:18:15.0489687Z st.shared.v4.b32 [%r90], {%r6619, %r6621, %r6618, %r6620}; 2026-02-21T09:18:15.0489782Z st.shared.v4.b32 [%r91], {%r6623, %r6625, %r6622, %r6624}; 2026-02-21T09:18:15.0489880Z st.shared.v4.b32 [%r92], {%r6629, %r6633, %r6627, %r6631}; 2026-02-21T09:18:15.0489968Z st.shared.v4.b32 [%r93], {%r6637, %r6641, %r6635, %r6639}; 2026-02-21T09:18:15.0490059Z st.shared.v4.b32 [%r94], {%r6645, %r6649, %r6643, %r6647}; 2026-02-21T09:18:15.0490155Z st.shared.v4.b32 [%r95], {%r6653, %r6657, %r6651, %r6655}; 2026-02-21T09:18:15.0490244Z st.shared.v4.b32 [%r96], {%r6661, %r6665, %r6659, %r6663}; 2026-02-21T09:18:15.0490333Z st.shared.v4.b32 [%r97], {%r6669, %r6673, %r6667, %r6671}; 2026-02-21T09:18:15.0490394Z $L__tmp407: 2026-02-21T09:18:15.0490606Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0490664Z // begin inline asm 2026-02-21T09:18:15.0490979Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6416, %r6417, %r6418, %r6419, %r6420, %r6421, %r6422, %r6423, %r6424, %r6425, %r6426, %r6427, %r6428, %r6429, %r6430, %r6431}, [%r7628 + 0], 64; 2026-02-21T09:18:15.0491037Z // end inline asm 2026-02-21T09:18:15.0491095Z // begin inline asm 2026-02-21T09:18:15.0491404Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6433, %r6434, %r6435, %r6436, %r6437, %r6438, %r6439, %r6440, %r6441, %r6442, %r6443, %r6444, %r6445, %r6446, %r6447, %r6448}, [%r7628 + 16], 64; 2026-02-21T09:18:15.0491461Z // end inline asm 2026-02-21T09:18:15.0491519Z // begin inline asm 2026-02-21T09:18:15.0491854Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6450, %r6451, %r6452, %r6453, %r6454, %r6455, %r6456, %r6457, %r6458, %r6459, %r6460, %r6461, %r6462, %r6463, %r6464, %r6465}, [%r7628 + 32], 64; 2026-02-21T09:18:15.0491919Z // end inline asm 2026-02-21T09:18:15.0491976Z // begin inline asm 2026-02-21T09:18:15.0492273Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6467, %r6468, %r6469, %r6470, %r6471, %r6472, %r6473, %r6474, %r6475, %r6476, %r6477, %r6478, %r6479, %r6480, %r6481, %r6482}, [%r7628 + 48], 64; 2026-02-21T09:18:15.0492336Z // end inline asm 2026-02-21T09:18:15.0492392Z // begin inline asm 2026-02-21T09:18:15.0492460Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0492520Z // end inline asm 2026-02-21T09:18:15.0492576Z // begin inline asm 2026-02-21T09:18:15.0492884Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r6721 + 0], 64, {%r6416, %r6417, %r6418, %r6419, %r6420, %r6421, %r6422, %r6423, %r6424, %r6425, %r6426, %r6427, %r6428, %r6429, %r6430, %r6431}; 2026-02-21T09:18:15.0492948Z // end inline asm 2026-02-21T09:18:15.0493004Z // begin inline asm 2026-02-21T09:18:15.0493315Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r6721 + 16], 64, {%r6433, %r6434, %r6435, %r6436, %r6437, %r6438, %r6439, %r6440, %r6441, %r6442, %r6443, %r6444, %r6445, %r6446, %r6447, %r6448}; 2026-02-21T09:18:15.0493370Z // end inline asm 2026-02-21T09:18:15.0493433Z // begin inline asm 2026-02-21T09:18:15.0493735Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r6721 + 32], 64, {%r6450, %r6451, %r6452, %r6453, %r6454, %r6455, %r6456, %r6457, %r6458, %r6459, %r6460, %r6461, %r6462, %r6463, %r6464, %r6465}; 2026-02-21T09:18:15.0493791Z // end inline asm 2026-02-21T09:18:15.0493855Z // begin inline asm 2026-02-21T09:18:15.0494215Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r6721 + 48], 64, {%r6467, %r6468, %r6469, %r6470, %r6471, %r6472, %r6473, %r6474, %r6475, %r6476, %r6477, %r6478, %r6479, %r6480, %r6481, %r6482}; 2026-02-21T09:18:15.0494297Z // end inline asm 2026-02-21T09:18:15.0494400Z // begin inline asm 2026-02-21T09:18:15.0494468Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0494522Z // end inline asm 2026-02-21T09:18:15.0494605Z bar.sync 0; 2026-02-21T09:18:15.0494662Z // begin inline asm 2026-02-21T09:18:15.0495028Z @%p311 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7130 + 0], 16, {%r6553, %r6554, %r6555, %r6556, %r6557, %r6558, %r6559, %r6560, %r6561, %r6562, %r6563, %r6564, %r6565, %r6566, %r6567, %r6568}; 2026-02-21T09:18:15.0495088Z // end inline asm 2026-02-21T09:18:15.0495153Z // begin inline asm 2026-02-21T09:18:15.0495221Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0495277Z // end inline asm 2026-02-21T09:18:15.0495340Z bar.sync 0; 2026-02-21T09:18:15.0495399Z // begin inline asm 2026-02-21T09:18:15.0495472Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0495530Z // end inline asm 2026-02-21T09:18:15.0495599Z add.s32 %r7557, %r144, 16384; 2026-02-21T09:18:15.0495658Z // begin inline asm 2026-02-21T09:18:15.0495753Z @%p366 mbarrier.init.shared::cta.b64 [%r7557], 1; 2026-02-21T09:18:15.0495819Z // end inline asm 2026-02-21T09:18:15.0495875Z bar.sync 0; 2026-02-21T09:18:15.0495940Z @%p322 bra $L__BB0_48; 2026-02-21T09:18:15.0496052Z // %bb.47: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0496121Z elect.sync %r6687|%p324, -1; 2026-02-21T09:18:15.0496185Z mov.b32 %r6677, 69208336; 2026-02-21T09:18:15.0496244Z // begin inline asm 2026-02-21T09:18:15.0496421Z @%p324 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 0 ], %rd6, %r6677, %p393; 2026-02-21T09:18:15.0496479Z // end inline asm 2026-02-21T09:18:15.0496543Z mov.pred %p325, -1; 2026-02-21T09:18:15.0496608Z // begin inline asm 2026-02-21T09:18:15.0496772Z @%p324 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 8 ], %rd7, %r6677, %p325; 2026-02-21T09:18:15.0496831Z // end inline asm 2026-02-21T09:18:15.0496896Z // begin inline asm 2026-02-21T09:18:15.0497055Z @%p324 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 16 ], %rd8, %r6677, %p325; 2026-02-21T09:18:15.0497114Z // end inline asm 2026-02-21T09:18:15.0497175Z // begin inline asm 2026-02-21T09:18:15.0497343Z @%p324 tcgen05.mma.cta_group::1.kind::tf32 [ %r6675 + 0 ], [ %r7254 + 24 ], %rd9, %r6677, %p325; 2026-02-21T09:18:15.0497400Z // end inline asm 2026-02-21T09:18:15.0497466Z cvt.u64.u32 %rd869, %r7557; 2026-02-21T09:18:15.0497533Z // begin inline asm 2026-02-21T09:18:15.0497666Z @%p324 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd869]; 2026-02-21T09:18:15.0497724Z // end inline asm 2026-02-21T09:18:15.0497839Z $L__BB0_48: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0497934Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0497994Z mov.b32 %r6691, 0; 2026-02-21T09:18:15.0498223Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0498296Z // begin inline asm 2026-02-21T09:18:15.0498351Z 2026-02-21T09:18:15.0498404Z { 2026-02-21T09:18:15.0498476Z .reg .pred complete; 2026-02-21T09:18:15.0498534Z waitLoop: 2026-02-21T09:18:15.0498663Z mbarrier.try_wait.parity.shared.b64 complete, [%r7557], %r6691; 2026-02-21T09:18:15.0498739Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0498792Z } 2026-02-21T09:18:15.0498796Z 2026-02-21T09:18:15.0498852Z // end inline asm 2026-02-21T09:18:15.0498908Z bar.sync 0; 2026-02-21T09:18:15.0498974Z // begin inline asm 2026-02-21T09:18:15.0499063Z @%p366 mbarrier.inval.shared::cta.b64 [%r7557]; 2026-02-21T09:18:15.0499120Z // end inline asm 2026-02-21T09:18:15.0499183Z $L__tmp408: 2026-02-21T09:18:15.0499352Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0499440Z add.s64 %rd871, %rd857, 64; 2026-02-21T09:18:15.0499610Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0499703Z add.s64 %rd874, %rd860, 64; 2026-02-21T09:18:15.0499761Z // begin inline asm 2026-02-21T09:18:15.0499821Z mov.u64 %rd870, 0x0; 2026-02-21T09:18:15.0499963Z createpolicy.fractional.L2::evict_first.b64 %rd870, 1.0; 2026-02-21T09:18:15.0500043Z // end inline asm 2026-02-21T09:18:15.0500103Z // begin inline asm 2026-02-21T09:18:15.0500170Z mov.u32 %r6693, 0x0; 2026-02-21T09:18:15.0500230Z mov.u32 %r6694, 0x0; 2026-02-21T09:18:15.0500287Z mov.u32 %r6695, 0x0; 2026-02-21T09:18:15.0500344Z mov.u32 %r6696, 0x0; 2026-02-21T09:18:15.0500540Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6693, %r6694, %r6695, %r6696 }, [ %rd871 + 0 ], %rd870; 2026-02-21T09:18:15.0500597Z // end inline asm 2026-02-21T09:18:15.0500655Z // begin inline asm 2026-02-21T09:18:15.0500723Z mov.u64 %rd873, 0x0; 2026-02-21T09:18:15.0500836Z createpolicy.fractional.L2::evict_first.b64 %rd873, 1.0; 2026-02-21T09:18:15.0500893Z // end inline asm 2026-02-21T09:18:15.0500960Z // begin inline asm 2026-02-21T09:18:15.0501020Z mov.u32 %r6697, 0x0; 2026-02-21T09:18:15.0501077Z mov.u32 %r6698, 0x0; 2026-02-21T09:18:15.0501134Z mov.u32 %r6699, 0x0; 2026-02-21T09:18:15.0501200Z mov.u32 %r6700, 0x0; 2026-02-21T09:18:15.0501384Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6697, %r6698, %r6699, %r6700 }, [ %rd874 + 0 ], %rd873; 2026-02-21T09:18:15.0501441Z // end inline asm 2026-02-21T09:18:15.0501503Z $L__tmp409: 2026-02-21T09:18:15.0501751Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0501852Z st.shared.v4.b32 [%r67], {%r6693, %r6694, %r6695, %r6696}; 2026-02-21T09:18:15.0501965Z st.shared.v4.b32 [%r67+512], {%r6697, %r6698, %r6699, %r6700}; 2026-02-21T09:18:15.0502022Z bar.sync 0; 2026-02-21T09:18:15.0502119Z ld.shared.v4.b32 {%r6860, %r6861, %r6862, %r6863}, [%r68]; 2026-02-21T09:18:15.0502187Z mov.b32 {%rs2184, %rs2185}, %r6863; 2026-02-21T09:18:15.0502261Z mov.b32 {%rs2186, %rs2187}, %r6862; 2026-02-21T09:18:15.0502330Z mov.b32 {%rs2188, %rs2189}, %r6861; 2026-02-21T09:18:15.0502391Z mov.b32 {%rs2190, %rs2191}, %r6860; 2026-02-21T09:18:15.0502495Z ld.shared.v4.b32 {%r6864, %r6865, %r6866, %r6867}, [%r69]; 2026-02-21T09:18:15.0502559Z mov.b32 {%rs2192, %rs2193}, %r6867; 2026-02-21T09:18:15.0502623Z mov.b32 {%rs2194, %rs2195}, %r6866; 2026-02-21T09:18:15.0502686Z mov.b32 {%rs2196, %rs2197}, %r6865; 2026-02-21T09:18:15.0502753Z mov.b32 {%rs2198, %rs2199}, %r6864; 2026-02-21T09:18:15.0502808Z $L__tmp410: 2026-02-21T09:18:15.0502983Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0503065Z cvt.f32.bf16 %r6842, %rs2190; 2026-02-21T09:18:15.0503126Z cvt.f32.bf16 %r6843, %rs2191; 2026-02-21T09:18:15.0503186Z cvt.f32.bf16 %r6844, %rs2188; 2026-02-21T09:18:15.0503249Z cvt.f32.bf16 %r6845, %rs2189; 2026-02-21T09:18:15.0503306Z cvt.f32.bf16 %r6846, %rs2186; 2026-02-21T09:18:15.0503365Z cvt.f32.bf16 %r6847, %rs2187; 2026-02-21T09:18:15.0503424Z cvt.f32.bf16 %r6848, %rs2184; 2026-02-21T09:18:15.0503491Z cvt.f32.bf16 %r6849, %rs2185; 2026-02-21T09:18:15.0503548Z cvt.f32.bf16 %r6850, %rs2198; 2026-02-21T09:18:15.0503605Z cvt.f32.bf16 %r6851, %rs2199; 2026-02-21T09:18:15.0503669Z cvt.f32.bf16 %r6852, %rs2196; 2026-02-21T09:18:15.0503726Z cvt.f32.bf16 %r6853, %rs2197; 2026-02-21T09:18:15.0503782Z cvt.f32.bf16 %r6854, %rs2194; 2026-02-21T09:18:15.0503839Z cvt.f32.bf16 %r6855, %rs2195; 2026-02-21T09:18:15.0503904Z cvt.f32.bf16 %r6856, %rs2192; 2026-02-21T09:18:15.0503961Z cvt.f32.bf16 %r6857, %rs2193; 2026-02-21T09:18:15.0504125Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0504198Z add.s64 %rd877, %rd1056, 131072; 2026-02-21T09:18:15.0504282Z // begin inline asm 2026-02-21T09:18:15.0504338Z mov.u64 %rd876, 0x0; 2026-02-21T09:18:15.0504451Z createpolicy.fractional.L2::evict_first.b64 %rd876, 1.0; 2026-02-21T09:18:15.0504532Z // end inline asm 2026-02-21T09:18:15.0504589Z // begin inline asm 2026-02-21T09:18:15.0504644Z mov.u32 %r6701, 0x0; 2026-02-21T09:18:15.0504727Z mov.u32 %r6702, 0x0; 2026-02-21T09:18:15.0504783Z mov.u32 %r6703, 0x0; 2026-02-21T09:18:15.0504867Z mov.u32 %r6704, 0x0; 2026-02-21T09:18:15.0505048Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6701, %r6702, %r6703, %r6704 }, [ %rd877 + 0 ], %rd876; 2026-02-21T09:18:15.0505105Z // end inline asm 2026-02-21T09:18:15.0505170Z prmt.b32 %r6868, %r6701, 0, 0x8880U; 2026-02-21T09:18:15.0505238Z cvt.u16.u32 %rs2200, %r6868; 2026-02-21T09:18:15.0505303Z prmt.b32 %r6869, %r6701, 0, 0x7770U; 2026-02-21T09:18:15.0505364Z cvt.u16.u32 %rs2201, %r6869; 2026-02-21T09:18:15.0505426Z prmt.b32 %r6870, %r6701, 0, 0x9991U; 2026-02-21T09:18:15.0505497Z cvt.u16.u32 %rs2202, %r6870; 2026-02-21T09:18:15.0505558Z prmt.b32 %r6871, %r6701, 0, 0x7771U; 2026-02-21T09:18:15.0505619Z cvt.u16.u32 %rs2203, %r6871; 2026-02-21T09:18:15.0505690Z prmt.b32 %r6872, %r6701, 0, 0xaaa2U; 2026-02-21T09:18:15.0505749Z cvt.u16.u32 %rs2204, %r6872; 2026-02-21T09:18:15.0505812Z prmt.b32 %r6873, %r6701, 0, 0x7772U; 2026-02-21T09:18:15.0505873Z cvt.u16.u32 %rs2205, %r6873; 2026-02-21T09:18:15.0505944Z prmt.b32 %r6874, %r6701, 0, 0xbbb3U; 2026-02-21T09:18:15.0506001Z cvt.u16.u32 %rs2206, %r6874; 2026-02-21T09:18:15.0506062Z prmt.b32 %r6875, %r6701, 0, 0x7773U; 2026-02-21T09:18:15.0506126Z cvt.u16.u32 %rs2207, %r6875; 2026-02-21T09:18:15.0506186Z prmt.b32 %r6876, %r6702, 0, 0x8880U; 2026-02-21T09:18:15.0506243Z cvt.u16.u32 %rs2208, %r6876; 2026-02-21T09:18:15.0506303Z prmt.b32 %r6877, %r6702, 0, 0x7770U; 2026-02-21T09:18:15.0506367Z cvt.u16.u32 %rs2209, %r6877; 2026-02-21T09:18:15.0506428Z prmt.b32 %r6878, %r6702, 0, 0x9991U; 2026-02-21T09:18:15.0506486Z cvt.u16.u32 %rs2210, %r6878; 2026-02-21T09:18:15.0506553Z prmt.b32 %r6879, %r6702, 0, 0x7771U; 2026-02-21T09:18:15.0506611Z cvt.u16.u32 %rs2211, %r6879; 2026-02-21T09:18:15.0506674Z prmt.b32 %r6880, %r6702, 0, 0xaaa2U; 2026-02-21T09:18:15.0506738Z cvt.u16.u32 %rs2212, %r6880; 2026-02-21T09:18:15.0506798Z prmt.b32 %r6881, %r6702, 0, 0x7772U; 2026-02-21T09:18:15.0506856Z cvt.u16.u32 %rs2213, %r6881; 2026-02-21T09:18:15.0506917Z prmt.b32 %r6882, %r6702, 0, 0xbbb3U; 2026-02-21T09:18:15.0506982Z cvt.u16.u32 %rs2214, %r6882; 2026-02-21T09:18:15.0507041Z prmt.b32 %r6883, %r6702, 0, 0x7773U; 2026-02-21T09:18:15.0507098Z cvt.u16.u32 %rs2215, %r6883; 2026-02-21T09:18:15.0507164Z prmt.b32 %r6884, %r6703, 0, 0x8880U; 2026-02-21T09:18:15.0507223Z cvt.u16.u32 %rs2216, %r6884; 2026-02-21T09:18:15.0507282Z prmt.b32 %r6885, %r6703, 0, 0x7770U; 2026-02-21T09:18:15.0507340Z cvt.u16.u32 %rs2217, %r6885; 2026-02-21T09:18:15.0507408Z prmt.b32 %r6886, %r6703, 0, 0x9991U; 2026-02-21T09:18:15.0507467Z cvt.u16.u32 %rs2218, %r6886; 2026-02-21T09:18:15.0507527Z prmt.b32 %r6887, %r6703, 0, 0x7771U; 2026-02-21T09:18:15.0507594Z cvt.u16.u32 %rs2219, %r6887; 2026-02-21T09:18:15.0507656Z prmt.b32 %r6888, %r6703, 0, 0xaaa2U; 2026-02-21T09:18:15.0507713Z cvt.u16.u32 %rs2220, %r6888; 2026-02-21T09:18:15.0507779Z prmt.b32 %r6889, %r6703, 0, 0x7772U; 2026-02-21T09:18:15.0507839Z cvt.u16.u32 %rs2221, %r6889; 2026-02-21T09:18:15.0507900Z prmt.b32 %r6890, %r6703, 0, 0xbbb3U; 2026-02-21T09:18:15.0507959Z cvt.u16.u32 %rs2222, %r6890; 2026-02-21T09:18:15.0508026Z prmt.b32 %r6891, %r6703, 0, 0x7773U; 2026-02-21T09:18:15.0508084Z cvt.u16.u32 %rs2223, %r6891; 2026-02-21T09:18:15.0508144Z prmt.b32 %r6892, %r6704, 0, 0x8880U; 2026-02-21T09:18:15.0508208Z cvt.u16.u32 %rs2224, %r6892; 2026-02-21T09:18:15.0508267Z prmt.b32 %r6893, %r6704, 0, 0x7770U; 2026-02-21T09:18:15.0508325Z cvt.u16.u32 %rs2225, %r6893; 2026-02-21T09:18:15.0508385Z prmt.b32 %r6894, %r6704, 0, 0x9991U; 2026-02-21T09:18:15.0508471Z cvt.u16.u32 %rs2226, %r6894; 2026-02-21T09:18:15.0508532Z prmt.b32 %r6895, %r6704, 0, 0x7771U; 2026-02-21T09:18:15.0508589Z cvt.u16.u32 %rs2227, %r6895; 2026-02-21T09:18:15.0508675Z prmt.b32 %r6896, %r6704, 0, 0xaaa2U; 2026-02-21T09:18:15.0508732Z cvt.u16.u32 %rs2228, %r6896; 2026-02-21T09:18:15.0508792Z prmt.b32 %r6897, %r6704, 0, 0x7772U; 2026-02-21T09:18:15.0508868Z cvt.u16.u32 %rs2229, %r6897; 2026-02-21T09:18:15.0508953Z prmt.b32 %r6898, %r6704, 0, 0xbbb3U; 2026-02-21T09:18:15.0509012Z cvt.u16.u32 %rs2230, %r6898; 2026-02-21T09:18:15.0509072Z prmt.b32 %r6899, %r6704, 0, 0x7773U; 2026-02-21T09:18:15.0509135Z cvt.u16.u32 %rs2231, %r6899; 2026-02-21T09:18:15.0509298Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0509358Z shl.b16 %rs2232, %rs2201, 12; 2026-02-21T09:18:15.0509422Z shr.s16 %rs2233, %rs2232, 12; 2026-02-21T09:18:15.0509480Z shl.b16 %rs2234, %rs2203, 12; 2026-02-21T09:18:15.0509538Z shr.s16 %rs2235, %rs2234, 12; 2026-02-21T09:18:15.0509593Z shl.b16 %rs2236, %rs2205, 12; 2026-02-21T09:18:15.0509656Z shr.s16 %rs2237, %rs2236, 12; 2026-02-21T09:18:15.0509712Z shl.b16 %rs2238, %rs2207, 12; 2026-02-21T09:18:15.0509769Z shr.s16 %rs2239, %rs2238, 12; 2026-02-21T09:18:15.0509831Z shl.b16 %rs2240, %rs2209, 12; 2026-02-21T09:18:15.0509889Z shr.s16 %rs2241, %rs2240, 12; 2026-02-21T09:18:15.0509945Z shl.b16 %rs2242, %rs2211, 12; 2026-02-21T09:18:15.0510001Z shr.s16 %rs2243, %rs2242, 12; 2026-02-21T09:18:15.0510065Z shl.b16 %rs2244, %rs2213, 12; 2026-02-21T09:18:15.0510120Z shr.s16 %rs2245, %rs2244, 12; 2026-02-21T09:18:15.0510175Z shl.b16 %rs2246, %rs2215, 12; 2026-02-21T09:18:15.0510239Z shr.s16 %rs2247, %rs2246, 12; 2026-02-21T09:18:15.0510295Z shl.b16 %rs2248, %rs2217, 12; 2026-02-21T09:18:15.0510350Z shr.s16 %rs2249, %rs2248, 12; 2026-02-21T09:18:15.0510414Z shl.b16 %rs2250, %rs2219, 12; 2026-02-21T09:18:15.0510470Z shr.s16 %rs2251, %rs2250, 12; 2026-02-21T09:18:15.0510527Z shl.b16 %rs2252, %rs2221, 12; 2026-02-21T09:18:15.0510583Z shr.s16 %rs2253, %rs2252, 12; 2026-02-21T09:18:15.0510646Z shl.b16 %rs2254, %rs2223, 12; 2026-02-21T09:18:15.0510702Z shr.s16 %rs2255, %rs2254, 12; 2026-02-21T09:18:15.0510758Z shl.b16 %rs2256, %rs2225, 12; 2026-02-21T09:18:15.0510821Z shr.s16 %rs2257, %rs2256, 12; 2026-02-21T09:18:15.0510878Z shl.b16 %rs2258, %rs2227, 12; 2026-02-21T09:18:15.0510935Z shr.s16 %rs2259, %rs2258, 12; 2026-02-21T09:18:15.0510991Z shl.b16 %rs2260, %rs2229, 12; 2026-02-21T09:18:15.0511055Z shr.s16 %rs2261, %rs2260, 12; 2026-02-21T09:18:15.0511112Z shl.b16 %rs2262, %rs2231, 12; 2026-02-21T09:18:15.0511168Z shr.s16 %rs2263, %rs2262, 12; 2026-02-21T09:18:15.0511334Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0511392Z shr.u16 %rs2264, %rs2200, 4; 2026-02-21T09:18:15.0511448Z shr.u16 %rs2265, %rs2202, 4; 2026-02-21T09:18:15.0511506Z shr.u16 %rs2266, %rs2204, 4; 2026-02-21T09:18:15.0511601Z shr.u16 %rs2267, %rs2206, 4; 2026-02-21T09:18:15.0511660Z shr.u16 %rs2268, %rs2208, 4; 2026-02-21T09:18:15.0511716Z shr.u16 %rs2269, %rs2210, 4; 2026-02-21T09:18:15.0511784Z shr.u16 %rs2270, %rs2212, 4; 2026-02-21T09:18:15.0511843Z shr.u16 %rs2271, %rs2214, 4; 2026-02-21T09:18:15.0511900Z shr.u16 %rs2272, %rs2216, 4; 2026-02-21T09:18:15.0511968Z shr.u16 %rs2273, %rs2218, 4; 2026-02-21T09:18:15.0512029Z shr.u16 %rs2274, %rs2220, 4; 2026-02-21T09:18:15.0512088Z shr.u16 %rs2275, %rs2222, 4; 2026-02-21T09:18:15.0512147Z shr.u16 %rs2276, %rs2224, 4; 2026-02-21T09:18:15.0512213Z shr.u16 %rs2277, %rs2226, 4; 2026-02-21T09:18:15.0512269Z shr.u16 %rs2278, %rs2228, 4; 2026-02-21T09:18:15.0512326Z shr.u16 %rs2279, %rs2230, 4; 2026-02-21T09:18:15.0512490Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0512543Z bar.sync 0; 2026-02-21T09:18:15.0512605Z st.shared.b8 [%r70], %rs2233; 2026-02-21T09:18:15.0512709Z st.shared.b8 [%r71], %rs2235; 2026-02-21T09:18:15.0512782Z st.shared.b8 [%r72], %rs2237; 2026-02-21T09:18:15.0512840Z st.shared.b8 [%r73], %rs2239; 2026-02-21T09:18:15.0512930Z st.shared.b8 [%r74+512], %rs2241; 2026-02-21T09:18:15.0512999Z st.shared.b8 [%r75+512], %rs2243; 2026-02-21T09:18:15.0513083Z st.shared.b8 [%r76+512], %rs2245; 2026-02-21T09:18:15.0513146Z st.shared.b8 [%r77+512], %rs2247; 2026-02-21T09:18:15.0513229Z st.shared.b8 [%r78+1024], %rs2249; 2026-02-21T09:18:15.0513301Z st.shared.b8 [%r79+1024], %rs2251; 2026-02-21T09:18:15.0513360Z st.shared.b8 [%r80+1024], %rs2253; 2026-02-21T09:18:15.0513420Z st.shared.b8 [%r81+1024], %rs2255; 2026-02-21T09:18:15.0513487Z st.shared.b8 [%r82+1536], %rs2257; 2026-02-21T09:18:15.0513546Z st.shared.b8 [%r83+1536], %rs2259; 2026-02-21T09:18:15.0513605Z st.shared.b8 [%r84+1536], %rs2261; 2026-02-21T09:18:15.0513672Z st.shared.b8 [%r85+1536], %rs2263; 2026-02-21T09:18:15.0513727Z bar.sync 0; 2026-02-21T09:18:15.0513790Z ld.shared.b32 %r6900, [%r86]; 2026-02-21T09:18:15.0513850Z ld.shared.b32 %r6901, [%r87]; 2026-02-21T09:18:15.0513915Z ld.shared.b32 %r6902, [%r88]; 2026-02-21T09:18:15.0513975Z ld.shared.b32 %r6903, [%r89]; 2026-02-21T09:18:15.0514133Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0514194Z bar.sync 0; 2026-02-21T09:18:15.0514253Z st.shared.b8 [%r70], %rs2264; 2026-02-21T09:18:15.0514313Z st.shared.b8 [%r71], %rs2265; 2026-02-21T09:18:15.0514371Z st.shared.b8 [%r72], %rs2266; 2026-02-21T09:18:15.0514436Z st.shared.b8 [%r73], %rs2267; 2026-02-21T09:18:15.0514495Z st.shared.b8 [%r74+512], %rs2268; 2026-02-21T09:18:15.0514555Z st.shared.b8 [%r75+512], %rs2269; 2026-02-21T09:18:15.0514622Z st.shared.b8 [%r76+512], %rs2270; 2026-02-21T09:18:15.0514680Z st.shared.b8 [%r77+512], %rs2271; 2026-02-21T09:18:15.0514740Z st.shared.b8 [%r78+1024], %rs2272; 2026-02-21T09:18:15.0514806Z st.shared.b8 [%r79+1024], %rs2273; 2026-02-21T09:18:15.0514866Z st.shared.b8 [%r80+1024], %rs2274; 2026-02-21T09:18:15.0514925Z st.shared.b8 [%r81+1024], %rs2275; 2026-02-21T09:18:15.0514984Z st.shared.b8 [%r82+1536], %rs2276; 2026-02-21T09:18:15.0515051Z st.shared.b8 [%r83+1536], %rs2277; 2026-02-21T09:18:15.0515110Z st.shared.b8 [%r84+1536], %rs2278; 2026-02-21T09:18:15.0515170Z st.shared.b8 [%r85+1536], %rs2279; 2026-02-21T09:18:15.0515229Z bar.sync 0; 2026-02-21T09:18:15.0515288Z ld.shared.b32 %r6904, [%r86]; 2026-02-21T09:18:15.0515346Z ld.shared.b32 %r6905, [%r87]; 2026-02-21T09:18:15.0515405Z ld.shared.b32 %r6906, [%r88]; 2026-02-21T09:18:15.0515469Z ld.shared.b32 %r6907, [%r89]; 2026-02-21T09:18:15.0515625Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0515687Z cvt.s8.s32 %rs2280, %r6901; 2026-02-21T09:18:15.0515756Z cvt.rn.f32.s16 %r6908, %rs2280; 2026-02-21T09:18:15.0515814Z cvt.s8.s32 %rs2281, %r6900; 2026-02-21T09:18:15.0515876Z cvt.rn.f32.s16 %r6909, %rs2281; 2026-02-21T09:18:15.0515941Z cvt.s8.s32 %rs2282, %r6905; 2026-02-21T09:18:15.0516001Z cvt.rn.f32.s16 %r6910, %rs2282; 2026-02-21T09:18:15.0516060Z cvt.s8.s32 %rs2283, %r6904; 2026-02-21T09:18:15.0516118Z cvt.rn.f32.s16 %r6911, %rs2283; 2026-02-21T09:18:15.0516184Z cvt.s8.s32 %rs2284, %r6903; 2026-02-21T09:18:15.0516244Z cvt.rn.f32.s16 %r6912, %rs2284; 2026-02-21T09:18:15.0516302Z cvt.s8.s32 %rs2285, %r6902; 2026-02-21T09:18:15.0516367Z cvt.rn.f32.s16 %r6913, %rs2285; 2026-02-21T09:18:15.0516424Z cvt.s8.s32 %rs2286, %r6907; 2026-02-21T09:18:15.0516482Z cvt.rn.f32.s16 %r6914, %rs2286; 2026-02-21T09:18:15.0516540Z cvt.s8.s32 %rs2287, %r6906; 2026-02-21T09:18:15.0516604Z cvt.rn.f32.s16 %r6915, %rs2287; 2026-02-21T09:18:15.0516666Z prmt.b32 %r6916, %r6901, 0, 0x9991U; 2026-02-21T09:18:15.0516725Z cvt.u16.u32 %rs2288, %r6916; 2026-02-21T09:18:15.0516789Z cvt.rn.f32.s16 %r6917, %rs2288; 2026-02-21T09:18:15.0516851Z prmt.b32 %r6918, %r6900, 0, 0x9991U; 2026-02-21T09:18:15.0516931Z cvt.u16.u32 %rs2289, %r6918; 2026-02-21T09:18:15.0516990Z cvt.rn.f32.s16 %r6919, %rs2289; 2026-02-21T09:18:15.0517059Z prmt.b32 %r6920, %r6905, 0, 0x9991U; 2026-02-21T09:18:15.0517139Z cvt.u16.u32 %rs2290, %r6920; 2026-02-21T09:18:15.0517198Z cvt.rn.f32.s16 %r6921, %rs2290; 2026-02-21T09:18:15.0517287Z prmt.b32 %r6922, %r6904, 0, 0x9991U; 2026-02-21T09:18:15.0517346Z cvt.u16.u32 %rs2291, %r6922; 2026-02-21T09:18:15.0517422Z cvt.rn.f32.s16 %r6923, %rs2291; 2026-02-21T09:18:15.0517491Z prmt.b32 %r6924, %r6903, 0, 0x9991U; 2026-02-21T09:18:15.0517550Z cvt.u16.u32 %rs2292, %r6924; 2026-02-21T09:18:15.0517609Z cvt.rn.f32.s16 %r6925, %rs2292; 2026-02-21T09:18:15.0517670Z prmt.b32 %r6926, %r6902, 0, 0x9991U; 2026-02-21T09:18:15.0517735Z cvt.u16.u32 %rs2293, %r6926; 2026-02-21T09:18:15.0517794Z cvt.rn.f32.s16 %r6927, %rs2293; 2026-02-21T09:18:15.0517854Z prmt.b32 %r6928, %r6907, 0, 0x9991U; 2026-02-21T09:18:15.0517921Z cvt.u16.u32 %rs2294, %r6928; 2026-02-21T09:18:15.0517981Z cvt.rn.f32.s16 %r6929, %rs2294; 2026-02-21T09:18:15.0518041Z prmt.b32 %r6930, %r6906, 0, 0x9991U; 2026-02-21T09:18:15.0518099Z cvt.u16.u32 %rs2295, %r6930; 2026-02-21T09:18:15.0518168Z cvt.rn.f32.s16 %r6931, %rs2295; 2026-02-21T09:18:15.0518229Z prmt.b32 %r6932, %r6901, 0, 0xaaa2U; 2026-02-21T09:18:15.0518288Z cvt.u16.u32 %rs2296, %r6932; 2026-02-21T09:18:15.0518356Z cvt.rn.f32.s16 %r6933, %rs2296; 2026-02-21T09:18:15.0518419Z prmt.b32 %r6934, %r6900, 0, 0xaaa2U; 2026-02-21T09:18:15.0518478Z cvt.u16.u32 %rs2297, %r6934; 2026-02-21T09:18:15.0518537Z cvt.rn.f32.s16 %r6935, %rs2297; 2026-02-21T09:18:15.0518607Z prmt.b32 %r6936, %r6905, 0, 0xaaa2U; 2026-02-21T09:18:15.0518666Z cvt.u16.u32 %rs2298, %r6936; 2026-02-21T09:18:15.0518726Z cvt.rn.f32.s16 %r6937, %rs2298; 2026-02-21T09:18:15.0518794Z prmt.b32 %r6938, %r6904, 0, 0xaaa2U; 2026-02-21T09:18:15.0518851Z cvt.u16.u32 %rs2299, %r6938; 2026-02-21T09:18:15.0518909Z cvt.rn.f32.s16 %r6939, %rs2299; 2026-02-21T09:18:15.0518980Z prmt.b32 %r6940, %r6903, 0, 0xaaa2U; 2026-02-21T09:18:15.0519037Z cvt.u16.u32 %rs2300, %r6940; 2026-02-21T09:18:15.0519096Z cvt.rn.f32.s16 %r6941, %rs2300; 2026-02-21T09:18:15.0519158Z prmt.b32 %r6942, %r6902, 0, 0xaaa2U; 2026-02-21T09:18:15.0519223Z cvt.u16.u32 %rs2301, %r6942; 2026-02-21T09:18:15.0519281Z cvt.rn.f32.s16 %r6943, %rs2301; 2026-02-21T09:18:15.0519341Z prmt.b32 %r6944, %r6907, 0, 0xaaa2U; 2026-02-21T09:18:15.0519404Z cvt.u16.u32 %rs2302, %r6944; 2026-02-21T09:18:15.0519463Z cvt.rn.f32.s16 %r6945, %rs2302; 2026-02-21T09:18:15.0519523Z prmt.b32 %r6946, %r6906, 0, 0xaaa2U; 2026-02-21T09:18:15.0519580Z cvt.u16.u32 %rs2303, %r6946; 2026-02-21T09:18:15.0519646Z cvt.rn.f32.s16 %r6947, %rs2303; 2026-02-21T09:18:15.0519707Z prmt.b32 %r6948, %r6901, 0, 0xbbb3U; 2026-02-21T09:18:15.0519763Z cvt.u16.u32 %rs2304, %r6948; 2026-02-21T09:18:15.0519830Z cvt.rn.f32.s16 %r6949, %rs2304; 2026-02-21T09:18:15.0519891Z prmt.b32 %r6950, %r6900, 0, 0xbbb3U; 2026-02-21T09:18:15.0519949Z cvt.u16.u32 %rs2305, %r6950; 2026-02-21T09:18:15.0520014Z cvt.rn.f32.s16 %r6951, %rs2305; 2026-02-21T09:18:15.0520075Z prmt.b32 %r6952, %r6905, 0, 0xbbb3U; 2026-02-21T09:18:15.0520135Z cvt.u16.u32 %rs2306, %r6952; 2026-02-21T09:18:15.0520192Z cvt.rn.f32.s16 %r6953, %rs2306; 2026-02-21T09:18:15.0520262Z prmt.b32 %r6954, %r6904, 0, 0xbbb3U; 2026-02-21T09:18:15.0520321Z cvt.u16.u32 %rs2307, %r6954; 2026-02-21T09:18:15.0520380Z cvt.rn.f32.s16 %r6955, %rs2307; 2026-02-21T09:18:15.0520449Z prmt.b32 %r6956, %r6903, 0, 0xbbb3U; 2026-02-21T09:18:15.0520506Z cvt.u16.u32 %rs2308, %r6956; 2026-02-21T09:18:15.0520564Z cvt.rn.f32.s16 %r6957, %rs2308; 2026-02-21T09:18:15.0520624Z prmt.b32 %r6958, %r6902, 0, 0xbbb3U; 2026-02-21T09:18:15.0520687Z cvt.u16.u32 %rs2309, %r6958; 2026-02-21T09:18:15.0520746Z cvt.rn.f32.s16 %r6959, %rs2309; 2026-02-21T09:18:15.0520807Z prmt.b32 %r6960, %r6907, 0, 0xbbb3U; 2026-02-21T09:18:15.0520874Z cvt.u16.u32 %rs2310, %r6960; 2026-02-21T09:18:15.0520955Z cvt.rn.f32.s16 %r6961, %rs2310; 2026-02-21T09:18:15.0521018Z prmt.b32 %r6962, %r6906, 0, 0xbbb3U; 2026-02-21T09:18:15.0521075Z cvt.u16.u32 %rs2311, %r6962; 2026-02-21T09:18:15.0521162Z cvt.rn.f32.s16 %r6963, %rs2311; 2026-02-21T09:18:15.0521214Z bar.sync 0; 2026-02-21T09:18:15.0521332Z st.shared.v4.b32 [%r90], {%r6909, %r6911, %r6908, %r6910}; 2026-02-21T09:18:15.0521479Z st.shared.v4.b32 [%r91], {%r6913, %r6915, %r6912, %r6914}; 2026-02-21T09:18:15.0521702Z st.shared.v4.b32 [%r92], {%r6919, %r6923, %r6917, %r6921}; 2026-02-21T09:18:15.0521794Z st.shared.v4.b32 [%r93], {%r6927, %r6931, %r6925, %r6929}; 2026-02-21T09:18:15.0521891Z st.shared.v4.b32 [%r94], {%r6935, %r6939, %r6933, %r6937}; 2026-02-21T09:18:15.0521981Z st.shared.v4.b32 [%r95], {%r6943, %r6947, %r6941, %r6945}; 2026-02-21T09:18:15.0522069Z st.shared.v4.b32 [%r96], {%r6951, %r6955, %r6949, %r6953}; 2026-02-21T09:18:15.0522156Z st.shared.v4.b32 [%r97], {%r6959, %r6963, %r6957, %r6961}; 2026-02-21T09:18:15.0522221Z $L__tmp411: 2026-02-21T09:18:15.0522430Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0522491Z // begin inline asm 2026-02-21T09:18:15.0522804Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6705, %r6706, %r6707, %r6708, %r6709, %r6710, %r6711, %r6712, %r6713, %r6714, %r6715, %r6716, %r6717, %r6718, %r6719, %r6720}, [%r6721 + 0], 64; 2026-02-21T09:18:15.0522862Z // end inline asm 2026-02-21T09:18:15.0522921Z // begin inline asm 2026-02-21T09:18:15.0523228Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6722, %r6723, %r6724, %r6725, %r6726, %r6727, %r6728, %r6729, %r6730, %r6731, %r6732, %r6733, %r6734, %r6735, %r6736, %r6737}, [%r6721 + 16], 64; 2026-02-21T09:18:15.0523284Z // end inline asm 2026-02-21T09:18:15.0523341Z // begin inline asm 2026-02-21T09:18:15.0523648Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6739, %r6740, %r6741, %r6742, %r6743, %r6744, %r6745, %r6746, %r6747, %r6748, %r6749, %r6750, %r6751, %r6752, %r6753, %r6754}, [%r6721 + 32], 64; 2026-02-21T09:18:15.0523705Z // end inline asm 2026-02-21T09:18:15.0523763Z // begin inline asm 2026-02-21T09:18:15.0524067Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6756, %r6757, %r6758, %r6759, %r6760, %r6761, %r6762, %r6763, %r6764, %r6765, %r6766, %r6767, %r6768, %r6769, %r6770, %r6771}, [%r6721 + 48], 64; 2026-02-21T09:18:15.0524123Z // end inline asm 2026-02-21T09:18:15.0524182Z // begin inline asm 2026-02-21T09:18:15.0524251Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0524314Z // end inline asm 2026-02-21T09:18:15.0524376Z mov.pred %p340, -1; 2026-02-21T09:18:15.0524433Z // begin inline asm 2026-02-21T09:18:15.0524747Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7010 + 0], 64, {%r6705, %r6706, %r6707, %r6708, %r6709, %r6710, %r6711, %r6712, %r6713, %r6714, %r6715, %r6716, %r6717, %r6718, %r6719, %r6720}; 2026-02-21T09:18:15.0524802Z // end inline asm 2026-02-21T09:18:15.0524859Z // begin inline asm 2026-02-21T09:18:15.0525172Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7010 + 16], 64, {%r6722, %r6723, %r6724, %r6725, %r6726, %r6727, %r6728, %r6729, %r6730, %r6731, %r6732, %r6733, %r6734, %r6735, %r6736, %r6737}; 2026-02-21T09:18:15.0525229Z // end inline asm 2026-02-21T09:18:15.0525284Z // begin inline asm 2026-02-21T09:18:15.0525602Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7010 + 32], 64, {%r6739, %r6740, %r6741, %r6742, %r6743, %r6744, %r6745, %r6746, %r6747, %r6748, %r6749, %r6750, %r6751, %r6752, %r6753, %r6754}; 2026-02-21T09:18:15.0525656Z // end inline asm 2026-02-21T09:18:15.0525712Z // begin inline asm 2026-02-21T09:18:15.0526013Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7010 + 48], 64, {%r6756, %r6757, %r6758, %r6759, %r6760, %r6761, %r6762, %r6763, %r6764, %r6765, %r6766, %r6767, %r6768, %r6769, %r6770, %r6771}; 2026-02-21T09:18:15.0526075Z // end inline asm 2026-02-21T09:18:15.0526130Z // begin inline asm 2026-02-21T09:18:15.0526198Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0526296Z // end inline asm 2026-02-21T09:18:15.0526351Z bar.sync 0; 2026-02-21T09:18:15.0526408Z // begin inline asm 2026-02-21T09:18:15.0526745Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7419 + 0], 16, {%r6842, %r6843, %r6844, %r6845, %r6846, %r6847, %r6848, %r6849, %r6850, %r6851, %r6852, %r6853, %r6854, %r6855, %r6856, %r6857}; 2026-02-21T09:18:15.0526830Z // end inline asm 2026-02-21T09:18:15.0526925Z // begin inline asm 2026-02-21T09:18:15.0526991Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0527055Z // end inline asm 2026-02-21T09:18:15.0527111Z bar.sync 0; 2026-02-21T09:18:15.0527168Z // begin inline asm 2026-02-21T09:18:15.0527248Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0527303Z // end inline asm 2026-02-21T09:18:15.0527362Z // begin inline asm 2026-02-21T09:18:15.0527464Z @%p366 mbarrier.init.shared::cta.b64 [%r7557], 1; 2026-02-21T09:18:15.0527522Z // end inline asm 2026-02-21T09:18:15.0527577Z bar.sync 0; 2026-02-21T09:18:15.0527640Z @%p322 bra $L__BB0_50; 2026-02-21T09:18:15.0527753Z // %bb.49: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0527817Z elect.sync %r6976|%p341, -1; 2026-02-21T09:18:15.0527879Z mov.b32 %r6966, 69208336; 2026-02-21T09:18:15.0527943Z // begin inline asm 2026-02-21T09:18:15.0528100Z @%p341 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 0 ], %rd6, %r6966, %p340; 2026-02-21T09:18:15.0528155Z // end inline asm 2026-02-21T09:18:15.0528212Z // begin inline asm 2026-02-21T09:18:15.0528371Z @%p341 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 8 ], %rd7, %r6966, %p340; 2026-02-21T09:18:15.0528426Z // end inline asm 2026-02-21T09:18:15.0528481Z // begin inline asm 2026-02-21T09:18:15.0528640Z @%p341 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 16 ], %rd8, %r6966, %p340; 2026-02-21T09:18:15.0528696Z // end inline asm 2026-02-21T09:18:15.0528753Z // begin inline asm 2026-02-21T09:18:15.0528909Z @%p341 tcgen05.mma.cta_group::1.kind::tf32 [ %r6964 + 0 ], [ %r7543 + 24 ], %rd9, %r6966, %p340; 2026-02-21T09:18:15.0528966Z // end inline asm 2026-02-21T09:18:15.0529027Z cvt.u64.u32 %rd883, %r7557; 2026-02-21T09:18:15.0529091Z // begin inline asm 2026-02-21T09:18:15.0529217Z @%p341 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd883]; 2026-02-21T09:18:15.0529274Z // end inline asm 2026-02-21T09:18:15.0529374Z $L__BB0_50: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0529437Z // begin inline asm 2026-02-21T09:18:15.0529487Z 2026-02-21T09:18:15.0529536Z { 2026-02-21T09:18:15.0529603Z .reg .pred complete; 2026-02-21T09:18:15.0529658Z waitLoop: 2026-02-21T09:18:15.0529778Z mbarrier.try_wait.parity.shared.b64 complete, [%r7557], %r6691; 2026-02-21T09:18:15.0529843Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0529901Z } 2026-02-21T09:18:15.0529905Z 2026-02-21T09:18:15.0529960Z // end inline asm 2026-02-21T09:18:15.0530015Z bar.sync 0; 2026-02-21T09:18:15.0530080Z // begin inline asm 2026-02-21T09:18:15.0530164Z @%p366 mbarrier.inval.shared::cta.b64 [%r7557]; 2026-02-21T09:18:15.0530220Z // end inline asm 2026-02-21T09:18:15.0530280Z $L__tmp412: 2026-02-21T09:18:15.0530448Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0530512Z add.s64 %rd885, %rd857, 128; 2026-02-21T09:18:15.0530677Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0530746Z add.s64 %rd888, %rd860, 128; 2026-02-21T09:18:15.0530802Z // begin inline asm 2026-02-21T09:18:15.0530859Z mov.u64 %rd884, 0x0; 2026-02-21T09:18:15.0530975Z createpolicy.fractional.L2::evict_first.b64 %rd884, 1.0; 2026-02-21T09:18:15.0531029Z // end inline asm 2026-02-21T09:18:15.0531084Z // begin inline asm 2026-02-21T09:18:15.0531140Z mov.u32 %r6982, 0x0; 2026-02-21T09:18:15.0531202Z mov.u32 %r6983, 0x0; 2026-02-21T09:18:15.0531256Z mov.u32 %r6984, 0x0; 2026-02-21T09:18:15.0531334Z mov.u32 %r6985, 0x0; 2026-02-21T09:18:15.0531518Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6982, %r6983, %r6984, %r6985 }, [ %rd885 + 0 ], %rd884; 2026-02-21T09:18:15.0531632Z // end inline asm 2026-02-21T09:18:15.0531688Z // begin inline asm 2026-02-21T09:18:15.0531750Z mov.u64 %rd887, 0x0; 2026-02-21T09:18:15.0531882Z createpolicy.fractional.L2::evict_first.b64 %rd887, 1.0; 2026-02-21T09:18:15.0531938Z // end inline asm 2026-02-21T09:18:15.0532016Z // begin inline asm 2026-02-21T09:18:15.0532080Z mov.u32 %r6986, 0x0; 2026-02-21T09:18:15.0532134Z mov.u32 %r6987, 0x0; 2026-02-21T09:18:15.0532190Z mov.u32 %r6988, 0x0; 2026-02-21T09:18:15.0532250Z mov.u32 %r6989, 0x0; 2026-02-21T09:18:15.0532421Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6986, %r6987, %r6988, %r6989 }, [ %rd888 + 0 ], %rd887; 2026-02-21T09:18:15.0532475Z // end inline asm 2026-02-21T09:18:15.0532533Z $L__tmp413: 2026-02-21T09:18:15.0532745Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0532841Z st.shared.v4.b32 [%r67], {%r6982, %r6983, %r6984, %r6985}; 2026-02-21T09:18:15.0532942Z st.shared.v4.b32 [%r67+512], {%r6986, %r6987, %r6988, %r6989}; 2026-02-21T09:18:15.0533004Z bar.sync 0; 2026-02-21T09:18:15.0533098Z ld.shared.v4.b32 {%r7149, %r7150, %r7151, %r7152}, [%r68]; 2026-02-21T09:18:15.0533164Z mov.b32 {%rs2312, %rs2313}, %r7152; 2026-02-21T09:18:15.0533237Z mov.b32 {%rs2314, %rs2315}, %r7151; 2026-02-21T09:18:15.0533300Z mov.b32 {%rs2316, %rs2317}, %r7150; 2026-02-21T09:18:15.0533361Z mov.b32 {%rs2318, %rs2319}, %r7149; 2026-02-21T09:18:15.0533462Z ld.shared.v4.b32 {%r7153, %r7154, %r7155, %r7156}, [%r69]; 2026-02-21T09:18:15.0533523Z mov.b32 {%rs2320, %rs2321}, %r7156; 2026-02-21T09:18:15.0533583Z mov.b32 {%rs2322, %rs2323}, %r7155; 2026-02-21T09:18:15.0533642Z mov.b32 {%rs2324, %rs2325}, %r7154; 2026-02-21T09:18:15.0533710Z mov.b32 {%rs2326, %rs2327}, %r7153; 2026-02-21T09:18:15.0533764Z $L__tmp414: 2026-02-21T09:18:15.0533927Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0533998Z cvt.f32.bf16 %r7131, %rs2318; 2026-02-21T09:18:15.0534061Z cvt.f32.bf16 %r7132, %rs2319; 2026-02-21T09:18:15.0534120Z cvt.f32.bf16 %r7133, %rs2316; 2026-02-21T09:18:15.0534178Z cvt.f32.bf16 %r7134, %rs2317; 2026-02-21T09:18:15.0534245Z cvt.f32.bf16 %r7135, %rs2314; 2026-02-21T09:18:15.0534304Z cvt.f32.bf16 %r7136, %rs2315; 2026-02-21T09:18:15.0534361Z cvt.f32.bf16 %r7137, %rs2312; 2026-02-21T09:18:15.0534426Z cvt.f32.bf16 %r7138, %rs2313; 2026-02-21T09:18:15.0534483Z cvt.f32.bf16 %r7139, %rs2326; 2026-02-21T09:18:15.0534540Z cvt.f32.bf16 %r7140, %rs2327; 2026-02-21T09:18:15.0534605Z cvt.f32.bf16 %r7141, %rs2324; 2026-02-21T09:18:15.0534662Z cvt.f32.bf16 %r7142, %rs2325; 2026-02-21T09:18:15.0534719Z cvt.f32.bf16 %r7143, %rs2322; 2026-02-21T09:18:15.0534776Z cvt.f32.bf16 %r7144, %rs2323; 2026-02-21T09:18:15.0534841Z cvt.f32.bf16 %r7145, %rs2320; 2026-02-21T09:18:15.0534898Z cvt.f32.bf16 %r7146, %rs2321; 2026-02-21T09:18:15.0535060Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0535137Z add.s64 %rd891, %rd1056, 262144; 2026-02-21T09:18:15.0535195Z // begin inline asm 2026-02-21T09:18:15.0535255Z mov.u64 %rd890, 0x0; 2026-02-21T09:18:15.0535366Z createpolicy.fractional.L2::evict_first.b64 %rd890, 1.0; 2026-02-21T09:18:15.0535432Z // end inline asm 2026-02-21T09:18:15.0535488Z // begin inline asm 2026-02-21T09:18:15.0535544Z mov.u32 %r6990, 0x0; 2026-02-21T09:18:15.0535606Z mov.u32 %r6991, 0x0; 2026-02-21T09:18:15.0535660Z mov.u32 %r6992, 0x0; 2026-02-21T09:18:15.0535715Z mov.u32 %r6993, 0x0; 2026-02-21T09:18:15.0535892Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r6990, %r6991, %r6992, %r6993 }, [ %rd891 + 0 ], %rd890; 2026-02-21T09:18:15.0535947Z // end inline asm 2026-02-21T09:18:15.0536038Z prmt.b32 %r7157, %r6990, 0, 0x8880U; 2026-02-21T09:18:15.0536099Z cvt.u16.u32 %rs2328, %r7157; 2026-02-21T09:18:15.0536170Z prmt.b32 %r7158, %r6990, 0, 0x7770U; 2026-02-21T09:18:15.0536229Z cvt.u16.u32 %rs2329, %r7158; 2026-02-21T09:18:15.0536316Z prmt.b32 %r7159, %r6990, 0, 0x9991U; 2026-02-21T09:18:15.0536382Z cvt.u16.u32 %rs2330, %r7159; 2026-02-21T09:18:15.0536462Z prmt.b32 %r7160, %r6990, 0, 0x7771U; 2026-02-21T09:18:15.0536539Z cvt.u16.u32 %rs2331, %r7160; 2026-02-21T09:18:15.0536600Z prmt.b32 %r7161, %r6990, 0, 0xaaa2U; 2026-02-21T09:18:15.0536665Z cvt.u16.u32 %rs2332, %r7161; 2026-02-21T09:18:15.0536725Z prmt.b32 %r7162, %r6990, 0, 0x7772U; 2026-02-21T09:18:15.0536807Z cvt.u16.u32 %rs2333, %r7162; 2026-02-21T09:18:15.0536876Z prmt.b32 %r7163, %r6990, 0, 0xbbb3U; 2026-02-21T09:18:15.0536937Z cvt.u16.u32 %rs2334, %r7163; 2026-02-21T09:18:15.0537001Z prmt.b32 %r7164, %r6990, 0, 0x7773U; 2026-02-21T09:18:15.0537068Z cvt.u16.u32 %rs2335, %r7164; 2026-02-21T09:18:15.0537132Z prmt.b32 %r7165, %r6991, 0, 0x8880U; 2026-02-21T09:18:15.0537192Z cvt.u16.u32 %rs2336, %r7165; 2026-02-21T09:18:15.0537255Z prmt.b32 %r7166, %r6991, 0, 0x7770U; 2026-02-21T09:18:15.0537324Z cvt.u16.u32 %rs2337, %r7166; 2026-02-21T09:18:15.0537386Z prmt.b32 %r7167, %r6991, 0, 0x9991U; 2026-02-21T09:18:15.0537448Z cvt.u16.u32 %rs2338, %r7167; 2026-02-21T09:18:15.0537519Z prmt.b32 %r7168, %r6991, 0, 0x7771U; 2026-02-21T09:18:15.0537580Z cvt.u16.u32 %rs2339, %r7168; 2026-02-21T09:18:15.0537642Z prmt.b32 %r7169, %r6991, 0, 0xaaa2U; 2026-02-21T09:18:15.0537703Z cvt.u16.u32 %rs2340, %r7169; 2026-02-21T09:18:15.0537772Z prmt.b32 %r7170, %r6991, 0, 0x7772U; 2026-02-21T09:18:15.0537832Z cvt.u16.u32 %rs2341, %r7170; 2026-02-21T09:18:15.0537895Z prmt.b32 %r7171, %r6991, 0, 0xbbb3U; 2026-02-21T09:18:15.0537963Z cvt.u16.u32 %rs2342, %r7171; 2026-02-21T09:18:15.0538024Z prmt.b32 %r7172, %r6991, 0, 0x7773U; 2026-02-21T09:18:15.0538085Z cvt.u16.u32 %rs2343, %r7172; 2026-02-21T09:18:15.0538155Z prmt.b32 %r7173, %r6992, 0, 0x8880U; 2026-02-21T09:18:15.0538215Z cvt.u16.u32 %rs2344, %r7173; 2026-02-21T09:18:15.0538278Z prmt.b32 %r7174, %r6992, 0, 0x7770U; 2026-02-21T09:18:15.0538338Z cvt.u16.u32 %rs2345, %r7174; 2026-02-21T09:18:15.0538409Z prmt.b32 %r7175, %r6992, 0, 0x9991U; 2026-02-21T09:18:15.0538467Z cvt.u16.u32 %rs2346, %r7175; 2026-02-21T09:18:15.0538532Z prmt.b32 %r7176, %r6992, 0, 0x7771U; 2026-02-21T09:18:15.0538600Z cvt.u16.u32 %rs2347, %r7176; 2026-02-21T09:18:15.0538663Z prmt.b32 %r7177, %r6992, 0, 0xaaa2U; 2026-02-21T09:18:15.0538723Z cvt.u16.u32 %rs2348, %r7177; 2026-02-21T09:18:15.0538785Z prmt.b32 %r7178, %r6992, 0, 0x7772U; 2026-02-21T09:18:15.0538851Z cvt.u16.u32 %rs2349, %r7178; 2026-02-21T09:18:15.0538912Z prmt.b32 %r7179, %r6992, 0, 0xbbb3U; 2026-02-21T09:18:15.0538972Z cvt.u16.u32 %rs2350, %r7179; 2026-02-21T09:18:15.0539040Z prmt.b32 %r7180, %r6992, 0, 0x7773U; 2026-02-21T09:18:15.0539100Z cvt.u16.u32 %rs2351, %r7180; 2026-02-21T09:18:15.0539163Z prmt.b32 %r7181, %r6993, 0, 0x8880U; 2026-02-21T09:18:15.0539223Z cvt.u16.u32 %rs2352, %r7181; 2026-02-21T09:18:15.0539291Z prmt.b32 %r7182, %r6993, 0, 0x7770U; 2026-02-21T09:18:15.0539351Z cvt.u16.u32 %rs2353, %r7182; 2026-02-21T09:18:15.0539415Z prmt.b32 %r7183, %r6993, 0, 0x9991U; 2026-02-21T09:18:15.0539482Z cvt.u16.u32 %rs2354, %r7183; 2026-02-21T09:18:15.0539546Z prmt.b32 %r7184, %r6993, 0, 0x7771U; 2026-02-21T09:18:15.0539608Z cvt.u16.u32 %rs2355, %r7184; 2026-02-21T09:18:15.0539678Z prmt.b32 %r7185, %r6993, 0, 0xaaa2U; 2026-02-21T09:18:15.0539737Z cvt.u16.u32 %rs2356, %r7185; 2026-02-21T09:18:15.0539799Z prmt.b32 %r7186, %r6993, 0, 0x7772U; 2026-02-21T09:18:15.0539858Z cvt.u16.u32 %rs2357, %r7186; 2026-02-21T09:18:15.0539927Z prmt.b32 %r7187, %r6993, 0, 0xbbb3U; 2026-02-21T09:18:15.0539987Z cvt.u16.u32 %rs2358, %r7187; 2026-02-21T09:18:15.0540051Z prmt.b32 %r7188, %r6993, 0, 0x7773U; 2026-02-21T09:18:15.0540117Z cvt.u16.u32 %rs2359, %r7188; 2026-02-21T09:18:15.0540310Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0540373Z shl.b16 %rs2360, %rs2329, 12; 2026-02-21T09:18:15.0540435Z shr.s16 %rs2361, %rs2360, 12; 2026-02-21T09:18:15.0540521Z shl.b16 %rs2362, %rs2331, 12; 2026-02-21T09:18:15.0540580Z shr.s16 %rs2363, %rs2362, 12; 2026-02-21T09:18:15.0540662Z shl.b16 %rs2364, %rs2333, 12; 2026-02-21T09:18:15.0540734Z shr.s16 %rs2365, %rs2364, 12; 2026-02-21T09:18:15.0540814Z shl.b16 %rs2366, %rs2335, 12; 2026-02-21T09:18:15.0540876Z shr.s16 %rs2367, %rs2366, 12; 2026-02-21T09:18:15.0540942Z shl.b16 %rs2368, %rs2337, 12; 2026-02-21T09:18:15.0541002Z shr.s16 %rs2369, %rs2368, 12; 2026-02-21T09:18:15.0541062Z shl.b16 %rs2370, %rs2339, 12; 2026-02-21T09:18:15.0541121Z shr.s16 %rs2371, %rs2370, 12; 2026-02-21T09:18:15.0541188Z shl.b16 %rs2372, %rs2341, 12; 2026-02-21T09:18:15.0541248Z shr.s16 %rs2373, %rs2372, 12; 2026-02-21T09:18:15.0541307Z shl.b16 %rs2374, %rs2343, 12; 2026-02-21T09:18:15.0541375Z shr.s16 %rs2375, %rs2374, 12; 2026-02-21T09:18:15.0541434Z shl.b16 %rs2376, %rs2345, 12; 2026-02-21T09:18:15.0541494Z shr.s16 %rs2377, %rs2376, 12; 2026-02-21T09:18:15.0541579Z shl.b16 %rs2378, %rs2347, 12; 2026-02-21T09:18:15.0541647Z shr.s16 %rs2379, %rs2378, 12; 2026-02-21T09:18:15.0541709Z shl.b16 %rs2380, %rs2349, 12; 2026-02-21T09:18:15.0541770Z shr.s16 %rs2381, %rs2380, 12; 2026-02-21T09:18:15.0541841Z shl.b16 %rs2382, %rs2351, 12; 2026-02-21T09:18:15.0541900Z shr.s16 %rs2383, %rs2382, 12; 2026-02-21T09:18:15.0541961Z shl.b16 %rs2384, %rs2353, 12; 2026-02-21T09:18:15.0542021Z shr.s16 %rs2385, %rs2384, 12; 2026-02-21T09:18:15.0542092Z shl.b16 %rs2386, %rs2355, 12; 2026-02-21T09:18:15.0542150Z shr.s16 %rs2387, %rs2386, 12; 2026-02-21T09:18:15.0542210Z shl.b16 %rs2388, %rs2357, 12; 2026-02-21T09:18:15.0542277Z shr.s16 %rs2389, %rs2388, 12; 2026-02-21T09:18:15.0542335Z shl.b16 %rs2390, %rs2359, 12; 2026-02-21T09:18:15.0542394Z shr.s16 %rs2391, %rs2390, 12; 2026-02-21T09:18:15.0542571Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0542633Z shr.u16 %rs2392, %rs2328, 4; 2026-02-21T09:18:15.0542694Z shr.u16 %rs2393, %rs2330, 4; 2026-02-21T09:18:15.0542755Z shr.u16 %rs2394, %rs2332, 4; 2026-02-21T09:18:15.0542830Z shr.u16 %rs2395, %rs2334, 4; 2026-02-21T09:18:15.0542892Z shr.u16 %rs2396, %rs2336, 4; 2026-02-21T09:18:15.0542957Z shr.u16 %rs2397, %rs2338, 4; 2026-02-21T09:18:15.0543026Z shr.u16 %rs2398, %rs2340, 4; 2026-02-21T09:18:15.0543086Z shr.u16 %rs2399, %rs2342, 4; 2026-02-21T09:18:15.0543145Z shr.u16 %rs2400, %rs2344, 4; 2026-02-21T09:18:15.0543205Z shr.u16 %rs2401, %rs2346, 4; 2026-02-21T09:18:15.0543273Z shr.u16 %rs2402, %rs2348, 4; 2026-02-21T09:18:15.0543333Z shr.u16 %rs2403, %rs2350, 4; 2026-02-21T09:18:15.0543393Z shr.u16 %rs2404, %rs2352, 4; 2026-02-21T09:18:15.0543461Z shr.u16 %rs2405, %rs2354, 4; 2026-02-21T09:18:15.0543520Z shr.u16 %rs2406, %rs2356, 4; 2026-02-21T09:18:15.0543581Z shr.u16 %rs2407, %rs2358, 4; 2026-02-21T09:18:15.0543755Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0543815Z bar.sync 0; 2026-02-21T09:18:15.0543879Z st.shared.b8 [%r70], %rs2361; 2026-02-21T09:18:15.0543940Z st.shared.b8 [%r71], %rs2363; 2026-02-21T09:18:15.0544011Z st.shared.b8 [%r72], %rs2365; 2026-02-21T09:18:15.0544073Z st.shared.b8 [%r73], %rs2367; 2026-02-21T09:18:15.0544137Z st.shared.b8 [%r74+512], %rs2369; 2026-02-21T09:18:15.0544210Z st.shared.b8 [%r75+512], %rs2371; 2026-02-21T09:18:15.0544273Z st.shared.b8 [%r76+512], %rs2373; 2026-02-21T09:18:15.0544334Z st.shared.b8 [%r77+512], %rs2375; 2026-02-21T09:18:15.0544399Z st.shared.b8 [%r78+1024], %rs2377; 2026-02-21T09:18:15.0544471Z st.shared.b8 [%r79+1024], %rs2379; 2026-02-21T09:18:15.0544535Z st.shared.b8 [%r80+1024], %rs2381; 2026-02-21T09:18:15.0544597Z st.shared.b8 [%r81+1024], %rs2383; 2026-02-21T09:18:15.0544697Z st.shared.b8 [%r82+1536], %rs2385; 2026-02-21T09:18:15.0544761Z st.shared.b8 [%r83+1536], %rs2387; 2026-02-21T09:18:15.0544823Z st.shared.b8 [%r84+1536], %rs2389; 2026-02-21T09:18:15.0544911Z st.shared.b8 [%r85+1536], %rs2391; 2026-02-21T09:18:15.0544987Z bar.sync 0; 2026-02-21T09:18:15.0545049Z ld.shared.b32 %r7189, [%r86]; 2026-02-21T09:18:15.0545149Z ld.shared.b32 %r7190, [%r87]; 2026-02-21T09:18:15.0545243Z ld.shared.b32 %r7191, [%r88]; 2026-02-21T09:18:15.0545304Z ld.shared.b32 %r7192, [%r89]; 2026-02-21T09:18:15.0545465Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0545524Z bar.sync 0; 2026-02-21T09:18:15.0545583Z st.shared.b8 [%r70], %rs2392; 2026-02-21T09:18:15.0545639Z st.shared.b8 [%r71], %rs2393; 2026-02-21T09:18:15.0545697Z st.shared.b8 [%r72], %rs2394; 2026-02-21T09:18:15.0545762Z st.shared.b8 [%r73], %rs2395; 2026-02-21T09:18:15.0545822Z st.shared.b8 [%r74+512], %rs2396; 2026-02-21T09:18:15.0545883Z st.shared.b8 [%r75+512], %rs2397; 2026-02-21T09:18:15.0545947Z st.shared.b8 [%r76+512], %rs2398; 2026-02-21T09:18:15.0546006Z st.shared.b8 [%r77+512], %rs2399; 2026-02-21T09:18:15.0546066Z st.shared.b8 [%r78+1024], %rs2400; 2026-02-21T09:18:15.0546125Z st.shared.b8 [%r79+1024], %rs2401; 2026-02-21T09:18:15.0546193Z st.shared.b8 [%r80+1024], %rs2402; 2026-02-21T09:18:15.0546252Z st.shared.b8 [%r81+1024], %rs2403; 2026-02-21T09:18:15.0546312Z st.shared.b8 [%r82+1536], %rs2404; 2026-02-21T09:18:15.0546377Z st.shared.b8 [%r83+1536], %rs2405; 2026-02-21T09:18:15.0546436Z st.shared.b8 [%r84+1536], %rs2406; 2026-02-21T09:18:15.0546495Z st.shared.b8 [%r85+1536], %rs2407; 2026-02-21T09:18:15.0546546Z bar.sync 0; 2026-02-21T09:18:15.0546612Z ld.shared.b32 %r7193, [%r86]; 2026-02-21T09:18:15.0546669Z ld.shared.b32 %r7194, [%r87]; 2026-02-21T09:18:15.0546727Z ld.shared.b32 %r7195, [%r88]; 2026-02-21T09:18:15.0546791Z ld.shared.b32 %r7196, [%r89]; 2026-02-21T09:18:15.0546951Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0547011Z cvt.s8.s32 %rs2408, %r7190; 2026-02-21T09:18:15.0547081Z cvt.rn.f32.s16 %r7197, %rs2408; 2026-02-21T09:18:15.0547140Z cvt.s8.s32 %rs2409, %r7189; 2026-02-21T09:18:15.0547200Z cvt.rn.f32.s16 %r7198, %rs2409; 2026-02-21T09:18:15.0547259Z cvt.s8.s32 %rs2410, %r7194; 2026-02-21T09:18:15.0547328Z cvt.rn.f32.s16 %r7199, %rs2410; 2026-02-21T09:18:15.0547386Z cvt.s8.s32 %rs2411, %r7193; 2026-02-21T09:18:15.0547445Z cvt.rn.f32.s16 %r7200, %rs2411; 2026-02-21T09:18:15.0547510Z cvt.s8.s32 %rs2412, %r7192; 2026-02-21T09:18:15.0547569Z cvt.rn.f32.s16 %r7201, %rs2412; 2026-02-21T09:18:15.0547626Z cvt.s8.s32 %rs2413, %r7191; 2026-02-21T09:18:15.0547683Z cvt.rn.f32.s16 %r7202, %rs2413; 2026-02-21T09:18:15.0547748Z cvt.s8.s32 %rs2414, %r7196; 2026-02-21T09:18:15.0547806Z cvt.rn.f32.s16 %r7203, %rs2414; 2026-02-21T09:18:15.0547863Z cvt.s8.s32 %rs2415, %r7195; 2026-02-21T09:18:15.0547930Z cvt.rn.f32.s16 %r7204, %rs2415; 2026-02-21T09:18:15.0547993Z prmt.b32 %r7205, %r7190, 0, 0x9991U; 2026-02-21T09:18:15.0548053Z cvt.u16.u32 %rs2416, %r7205; 2026-02-21T09:18:15.0548119Z cvt.rn.f32.s16 %r7206, %rs2416; 2026-02-21T09:18:15.0548182Z prmt.b32 %r7207, %r7189, 0, 0x9991U; 2026-02-21T09:18:15.0548242Z cvt.u16.u32 %rs2417, %r7207; 2026-02-21T09:18:15.0548300Z cvt.rn.f32.s16 %r7208, %rs2417; 2026-02-21T09:18:15.0548371Z prmt.b32 %r7209, %r7194, 0, 0x9991U; 2026-02-21T09:18:15.0548430Z cvt.u16.u32 %rs2418, %r7209; 2026-02-21T09:18:15.0548489Z cvt.rn.f32.s16 %r7210, %rs2418; 2026-02-21T09:18:15.0548561Z prmt.b32 %r7211, %r7193, 0, 0x9991U; 2026-02-21T09:18:15.0548620Z cvt.u16.u32 %rs2419, %r7211; 2026-02-21T09:18:15.0548680Z cvt.rn.f32.s16 %r7212, %rs2419; 2026-02-21T09:18:15.0548742Z prmt.b32 %r7213, %r7192, 0, 0x9991U; 2026-02-21T09:18:15.0548812Z cvt.u16.u32 %rs2420, %r7213; 2026-02-21T09:18:15.0548870Z cvt.rn.f32.s16 %r7214, %rs2420; 2026-02-21T09:18:15.0548956Z prmt.b32 %r7215, %r7191, 0, 0x9991U; 2026-02-21T09:18:15.0549021Z cvt.u16.u32 %rs2421, %r7215; 2026-02-21T09:18:15.0549080Z cvt.rn.f32.s16 %r7216, %rs2421; 2026-02-21T09:18:15.0549162Z prmt.b32 %r7217, %r7196, 0, 0x9991U; 2026-02-21T09:18:15.0549228Z cvt.u16.u32 %rs2422, %r7217; 2026-02-21T09:18:15.0549305Z cvt.rn.f32.s16 %r7218, %rs2422; 2026-02-21T09:18:15.0549367Z prmt.b32 %r7219, %r7195, 0, 0x9991U; 2026-02-21T09:18:15.0549444Z cvt.u16.u32 %rs2423, %r7219; 2026-02-21T09:18:15.0549514Z cvt.rn.f32.s16 %r7220, %rs2423; 2026-02-21T09:18:15.0549574Z prmt.b32 %r7221, %r7190, 0, 0xaaa2U; 2026-02-21T09:18:15.0549632Z cvt.u16.u32 %rs2424, %r7221; 2026-02-21T09:18:15.0549696Z cvt.rn.f32.s16 %r7222, %rs2424; 2026-02-21T09:18:15.0549758Z prmt.b32 %r7223, %r7189, 0, 0xaaa2U; 2026-02-21T09:18:15.0549815Z cvt.u16.u32 %rs2425, %r7223; 2026-02-21T09:18:15.0549873Z cvt.rn.f32.s16 %r7224, %rs2425; 2026-02-21T09:18:15.0549941Z prmt.b32 %r7225, %r7194, 0, 0xaaa2U; 2026-02-21T09:18:15.0550002Z cvt.u16.u32 %rs2426, %r7225; 2026-02-21T09:18:15.0550060Z cvt.rn.f32.s16 %r7226, %rs2426; 2026-02-21T09:18:15.0550127Z prmt.b32 %r7227, %r7193, 0, 0xaaa2U; 2026-02-21T09:18:15.0550187Z cvt.u16.u32 %rs2427, %r7227; 2026-02-21T09:18:15.0550246Z cvt.rn.f32.s16 %r7228, %rs2427; 2026-02-21T09:18:15.0550307Z prmt.b32 %r7229, %r7192, 0, 0xaaa2U; 2026-02-21T09:18:15.0550373Z cvt.u16.u32 %rs2428, %r7229; 2026-02-21T09:18:15.0550434Z cvt.rn.f32.s16 %r7230, %rs2428; 2026-02-21T09:18:15.0550496Z prmt.b32 %r7231, %r7191, 0, 0xaaa2U; 2026-02-21T09:18:15.0550561Z cvt.u16.u32 %rs2429, %r7231; 2026-02-21T09:18:15.0550619Z cvt.rn.f32.s16 %r7232, %rs2429; 2026-02-21T09:18:15.0550679Z prmt.b32 %r7233, %r7196, 0, 0xaaa2U; 2026-02-21T09:18:15.0550745Z cvt.u16.u32 %rs2430, %r7233; 2026-02-21T09:18:15.0550805Z cvt.rn.f32.s16 %r7234, %rs2430; 2026-02-21T09:18:15.0550866Z prmt.b32 %r7235, %r7195, 0, 0xaaa2U; 2026-02-21T09:18:15.0550924Z cvt.u16.u32 %rs2431, %r7235; 2026-02-21T09:18:15.0550991Z cvt.rn.f32.s16 %r7236, %rs2431; 2026-02-21T09:18:15.0551052Z prmt.b32 %r7237, %r7190, 0, 0xbbb3U; 2026-02-21T09:18:15.0551110Z cvt.u16.u32 %rs2432, %r7237; 2026-02-21T09:18:15.0551176Z cvt.rn.f32.s16 %r7238, %rs2432; 2026-02-21T09:18:15.0551235Z prmt.b32 %r7239, %r7189, 0, 0xbbb3U; 2026-02-21T09:18:15.0551295Z cvt.u16.u32 %rs2433, %r7239; 2026-02-21T09:18:15.0551353Z cvt.rn.f32.s16 %r7240, %rs2433; 2026-02-21T09:18:15.0551422Z prmt.b32 %r7241, %r7194, 0, 0xbbb3U; 2026-02-21T09:18:15.0551481Z cvt.u16.u32 %rs2434, %r7241; 2026-02-21T09:18:15.0551566Z cvt.rn.f32.s16 %r7242, %rs2434; 2026-02-21T09:18:15.0551633Z prmt.b32 %r7243, %r7193, 0, 0xbbb3U; 2026-02-21T09:18:15.0551691Z cvt.u16.u32 %rs2435, %r7243; 2026-02-21T09:18:15.0551750Z cvt.rn.f32.s16 %r7244, %rs2435; 2026-02-21T09:18:15.0551817Z prmt.b32 %r7245, %r7192, 0, 0xbbb3U; 2026-02-21T09:18:15.0551875Z cvt.u16.u32 %rs2436, %r7245; 2026-02-21T09:18:15.0551933Z cvt.rn.f32.s16 %r7246, %rs2436; 2026-02-21T09:18:15.0551994Z prmt.b32 %r7247, %r7191, 0, 0xbbb3U; 2026-02-21T09:18:15.0552059Z cvt.u16.u32 %rs2437, %r7247; 2026-02-21T09:18:15.0552118Z cvt.rn.f32.s16 %r7248, %rs2437; 2026-02-21T09:18:15.0552178Z prmt.b32 %r7249, %r7196, 0, 0xbbb3U; 2026-02-21T09:18:15.0552243Z cvt.u16.u32 %rs2438, %r7249; 2026-02-21T09:18:15.0552301Z cvt.rn.f32.s16 %r7250, %rs2438; 2026-02-21T09:18:15.0552362Z prmt.b32 %r7251, %r7195, 0, 0xbbb3U; 2026-02-21T09:18:15.0552421Z cvt.u16.u32 %rs2439, %r7251; 2026-02-21T09:18:15.0552486Z cvt.rn.f32.s16 %r7252, %rs2439; 2026-02-21T09:18:15.0552539Z bar.sync 0; 2026-02-21T09:18:15.0552636Z st.shared.v4.b32 [%r90], {%r7198, %r7200, %r7197, %r7199}; 2026-02-21T09:18:15.0552738Z st.shared.v4.b32 [%r91], {%r7202, %r7204, %r7201, %r7203}; 2026-02-21T09:18:15.0552830Z st.shared.v4.b32 [%r92], {%r7208, %r7212, %r7206, %r7210}; 2026-02-21T09:18:15.0552918Z st.shared.v4.b32 [%r93], {%r7216, %r7220, %r7214, %r7218}; 2026-02-21T09:18:15.0553013Z st.shared.v4.b32 [%r94], {%r7224, %r7228, %r7222, %r7226}; 2026-02-21T09:18:15.0553129Z st.shared.v4.b32 [%r95], {%r7232, %r7236, %r7230, %r7234}; 2026-02-21T09:18:15.0553217Z st.shared.v4.b32 [%r96], {%r7240, %r7244, %r7238, %r7242}; 2026-02-21T09:18:15.0553329Z st.shared.v4.b32 [%r97], {%r7248, %r7252, %r7246, %r7250}; 2026-02-21T09:18:15.0553389Z $L__tmp415: 2026-02-21T09:18:15.0553648Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0553708Z // begin inline asm 2026-02-21T09:18:15.0554025Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r6994, %r6995, %r6996, %r6997, %r6998, %r6999, %r7000, %r7001, %r7002, %r7003, %r7004, %r7005, %r7006, %r7007, %r7008, %r7009}, [%r7010 + 0], 64; 2026-02-21T09:18:15.0554080Z // end inline asm 2026-02-21T09:18:15.0554138Z // begin inline asm 2026-02-21T09:18:15.0554451Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7011, %r7012, %r7013, %r7014, %r7015, %r7016, %r7017, %r7018, %r7019, %r7020, %r7021, %r7022, %r7023, %r7024, %r7025, %r7026}, [%r7010 + 16], 64; 2026-02-21T09:18:15.0554508Z // end inline asm 2026-02-21T09:18:15.0554564Z // begin inline asm 2026-02-21T09:18:15.0554871Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7028, %r7029, %r7030, %r7031, %r7032, %r7033, %r7034, %r7035, %r7036, %r7037, %r7038, %r7039, %r7040, %r7041, %r7042, %r7043}, [%r7010 + 32], 64; 2026-02-21T09:18:15.0554928Z // end inline asm 2026-02-21T09:18:15.0554984Z // begin inline asm 2026-02-21T09:18:15.0555287Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7045, %r7046, %r7047, %r7048, %r7049, %r7050, %r7051, %r7052, %r7053, %r7054, %r7055, %r7056, %r7057, %r7058, %r7059, %r7060}, [%r7010 + 48], 64; 2026-02-21T09:18:15.0555349Z // end inline asm 2026-02-21T09:18:15.0555405Z // begin inline asm 2026-02-21T09:18:15.0555473Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0555536Z // end inline asm 2026-02-21T09:18:15.0555592Z // begin inline asm 2026-02-21T09:18:15.0555906Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7299 + 0], 64, {%r6994, %r6995, %r6996, %r6997, %r6998, %r6999, %r7000, %r7001, %r7002, %r7003, %r7004, %r7005, %r7006, %r7007, %r7008, %r7009}; 2026-02-21T09:18:15.0555970Z // end inline asm 2026-02-21T09:18:15.0556027Z // begin inline asm 2026-02-21T09:18:15.0556346Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7299 + 16], 64, {%r7011, %r7012, %r7013, %r7014, %r7015, %r7016, %r7017, %r7018, %r7019, %r7020, %r7021, %r7022, %r7023, %r7024, %r7025, %r7026}; 2026-02-21T09:18:15.0556407Z // end inline asm 2026-02-21T09:18:15.0556463Z // begin inline asm 2026-02-21T09:18:15.0556771Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7299 + 32], 64, {%r7028, %r7029, %r7030, %r7031, %r7032, %r7033, %r7034, %r7035, %r7036, %r7037, %r7038, %r7039, %r7040, %r7041, %r7042, %r7043}; 2026-02-21T09:18:15.0556834Z // end inline asm 2026-02-21T09:18:15.0556891Z // begin inline asm 2026-02-21T09:18:15.0557201Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7299 + 48], 64, {%r7045, %r7046, %r7047, %r7048, %r7049, %r7050, %r7051, %r7052, %r7053, %r7054, %r7055, %r7056, %r7057, %r7058, %r7059, %r7060}; 2026-02-21T09:18:15.0557258Z // end inline asm 2026-02-21T09:18:15.0557329Z // begin inline asm 2026-02-21T09:18:15.0557398Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0557452Z // end inline asm 2026-02-21T09:18:15.0557514Z bar.sync 0; 2026-02-21T09:18:15.0557570Z // begin inline asm 2026-02-21T09:18:15.0557876Z @%p340 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7130 + 0], 16, {%r7131, %r7132, %r7133, %r7134, %r7135, %r7136, %r7137, %r7138, %r7139, %r7140, %r7141, %r7142, %r7143, %r7144, %r7145, %r7146}; 2026-02-21T09:18:15.0557938Z // end inline asm 2026-02-21T09:18:15.0557993Z // begin inline asm 2026-02-21T09:18:15.0558057Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0558112Z // end inline asm 2026-02-21T09:18:15.0558172Z bar.sync 0; 2026-02-21T09:18:15.0558228Z // begin inline asm 2026-02-21T09:18:15.0558299Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0558383Z // end inline asm 2026-02-21T09:18:15.0558439Z // begin inline asm 2026-02-21T09:18:15.0558527Z @%p366 mbarrier.init.shared::cta.b64 [%r7557], 1; 2026-02-21T09:18:15.0558584Z // end inline asm 2026-02-21T09:18:15.0558667Z bar.sync 0; 2026-02-21T09:18:15.0558727Z @%p322 bra $L__BB0_52; 2026-02-21T09:18:15.0558844Z // %bb.51: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0558917Z elect.sync %r7265|%p358, -1; 2026-02-21T09:18:15.0558996Z mov.b32 %r7255, 69208336; 2026-02-21T09:18:15.0559059Z mov.pred %p357, -1; 2026-02-21T09:18:15.0559115Z // begin inline asm 2026-02-21T09:18:15.0559279Z @%p358 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 0 ], %rd6, %r7255, %p357; 2026-02-21T09:18:15.0559334Z // end inline asm 2026-02-21T09:18:15.0559392Z // begin inline asm 2026-02-21T09:18:15.0559551Z @%p358 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 8 ], %rd7, %r7255, %p357; 2026-02-21T09:18:15.0559606Z // end inline asm 2026-02-21T09:18:15.0559663Z // begin inline asm 2026-02-21T09:18:15.0559822Z @%p358 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 16 ], %rd8, %r7255, %p357; 2026-02-21T09:18:15.0559876Z // end inline asm 2026-02-21T09:18:15.0559934Z // begin inline asm 2026-02-21T09:18:15.0560089Z @%p358 tcgen05.mma.cta_group::1.kind::tf32 [ %r7253 + 0 ], [ %r7254 + 24 ], %rd9, %r7255, %p357; 2026-02-21T09:18:15.0560144Z // end inline asm 2026-02-21T09:18:15.0560206Z cvt.u64.u32 %rd897, %r7557; 2026-02-21T09:18:15.0560262Z // begin inline asm 2026-02-21T09:18:15.0560396Z @%p358 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd897]; 2026-02-21T09:18:15.0560450Z // end inline asm 2026-02-21T09:18:15.0560548Z $L__BB0_52: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0560645Z .loc 2 0 36 // standard.py:0:36 2026-02-21T09:18:15.0560700Z mov.b32 %r7269, 0; 2026-02-21T09:18:15.0560911Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0560973Z // begin inline asm 2026-02-21T09:18:15.0561024Z 2026-02-21T09:18:15.0561074Z { 2026-02-21T09:18:15.0561137Z .reg .pred complete; 2026-02-21T09:18:15.0561198Z waitLoop: 2026-02-21T09:18:15.0561318Z mbarrier.try_wait.parity.shared.b64 complete, [%r7557], %r7269; 2026-02-21T09:18:15.0561385Z @!complete bra.uni waitLoop; 2026-02-21T09:18:15.0561441Z } 2026-02-21T09:18:15.0561446Z 2026-02-21T09:18:15.0561500Z // end inline asm 2026-02-21T09:18:15.0561575Z bar.sync 0; 2026-02-21T09:18:15.0561631Z // begin inline asm 2026-02-21T09:18:15.0561723Z @%p366 mbarrier.inval.shared::cta.b64 [%r7557]; 2026-02-21T09:18:15.0561777Z // end inline asm 2026-02-21T09:18:15.0561829Z $L__tmp416: 2026-02-21T09:18:15.0562001Z .loc 1 48 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:32 2026-02-21T09:18:15.0562061Z add.s64 %rd899, %rd857, 192; 2026-02-21T09:18:15.0562225Z .loc 1 48 80 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:48:80 2026-02-21T09:18:15.0562293Z add.s64 %rd902, %rd860, 192; 2026-02-21T09:18:15.0562348Z // begin inline asm 2026-02-21T09:18:15.0562407Z mov.u64 %rd898, 0x0; 2026-02-21T09:18:15.0562514Z createpolicy.fractional.L2::evict_first.b64 %rd898, 1.0; 2026-02-21T09:18:15.0562576Z // end inline asm 2026-02-21T09:18:15.0562630Z // begin inline asm 2026-02-21T09:18:15.0562687Z mov.u32 %r7271, 0x0; 2026-02-21T09:18:15.0562750Z mov.u32 %r7272, 0x0; 2026-02-21T09:18:15.0562804Z mov.u32 %r7273, 0x0; 2026-02-21T09:18:15.0562858Z mov.u32 %r7274, 0x0; 2026-02-21T09:18:15.0563041Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r7271, %r7272, %r7273, %r7274 }, [ %rd899 + 0 ], %rd898; 2026-02-21T09:18:15.0563096Z // end inline asm 2026-02-21T09:18:15.0563151Z // begin inline asm 2026-02-21T09:18:15.0563208Z mov.u64 %rd901, 0x0; 2026-02-21T09:18:15.0563320Z createpolicy.fractional.L2::evict_first.b64 %rd901, 1.0; 2026-02-21T09:18:15.0563409Z // end inline asm 2026-02-21T09:18:15.0563466Z // begin inline asm 2026-02-21T09:18:15.0563527Z mov.u32 %r7275, 0x0; 2026-02-21T09:18:15.0563581Z mov.u32 %r7276, 0x0; 2026-02-21T09:18:15.0563681Z mov.u32 %r7277, 0x0; 2026-02-21T09:18:15.0563736Z mov.u32 %r7278, 0x0; 2026-02-21T09:18:15.0563945Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r7275, %r7276, %r7277, %r7278 }, [ %rd902 + 0 ], %rd901; 2026-02-21T09:18:15.0564027Z // end inline asm 2026-02-21T09:18:15.0564082Z $L__tmp417: 2026-02-21T09:18:15.0564296Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0564392Z st.shared.v4.b32 [%r67], {%r7271, %r7272, %r7273, %r7274}; 2026-02-21T09:18:15.0564493Z st.shared.v4.b32 [%r67+512], {%r7275, %r7276, %r7277, %r7278}; 2026-02-21T09:18:15.0564556Z bar.sync 0; 2026-02-21T09:18:15.0564650Z ld.shared.v4.b32 {%r7438, %r7439, %r7440, %r7441}, [%r68]; 2026-02-21T09:18:15.0564719Z mov.b32 {%rs2440, %rs2441}, %r7441; 2026-02-21T09:18:15.0564791Z mov.b32 {%rs2442, %rs2443}, %r7440; 2026-02-21T09:18:15.0564854Z mov.b32 {%rs2444, %rs2445}, %r7439; 2026-02-21T09:18:15.0564917Z mov.b32 {%rs2446, %rs2447}, %r7438; 2026-02-21T09:18:15.0565008Z ld.shared.v4.b32 {%r7442, %r7443, %r7444, %r7445}, [%r69]; 2026-02-21T09:18:15.0565082Z mov.b32 {%rs2448, %rs2449}, %r7445; 2026-02-21T09:18:15.0565143Z mov.b32 {%rs2450, %rs2451}, %r7444; 2026-02-21T09:18:15.0565206Z mov.b32 {%rs2452, %rs2453}, %r7443; 2026-02-21T09:18:15.0565275Z mov.b32 {%rs2454, %rs2455}, %r7442; 2026-02-21T09:18:15.0565332Z $L__tmp418: 2026-02-21T09:18:15.0565493Z .loc 1 52 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:52:32 2026-02-21T09:18:15.0565555Z cvt.f32.bf16 %r7420, %rs2446; 2026-02-21T09:18:15.0565624Z cvt.f32.bf16 %r7421, %rs2447; 2026-02-21T09:18:15.0565683Z cvt.f32.bf16 %r7422, %rs2444; 2026-02-21T09:18:15.0565741Z cvt.f32.bf16 %r7423, %rs2445; 2026-02-21T09:18:15.0565810Z cvt.f32.bf16 %r7424, %rs2442; 2026-02-21T09:18:15.0565867Z cvt.f32.bf16 %r7425, %rs2443; 2026-02-21T09:18:15.0565925Z cvt.f32.bf16 %r7426, %rs2440; 2026-02-21T09:18:15.0565991Z cvt.f32.bf16 %r7427, %rs2441; 2026-02-21T09:18:15.0566049Z cvt.f32.bf16 %r7428, %rs2454; 2026-02-21T09:18:15.0566108Z cvt.f32.bf16 %r7429, %rs2455; 2026-02-21T09:18:15.0566165Z cvt.f32.bf16 %r7430, %rs2452; 2026-02-21T09:18:15.0566231Z cvt.f32.bf16 %r7431, %rs2453; 2026-02-21T09:18:15.0566289Z cvt.f32.bf16 %r7432, %rs2450; 2026-02-21T09:18:15.0566346Z cvt.f32.bf16 %r7433, %rs2451; 2026-02-21T09:18:15.0566410Z cvt.f32.bf16 %r7434, %rs2448; 2026-02-21T09:18:15.0566466Z cvt.f32.bf16 %r7435, %rs2449; 2026-02-21T09:18:15.0566624Z .loc 1 54 87 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:54:87 2026-02-21T09:18:15.0566690Z add.s64 %rd905, %rd1056, 393216; 2026-02-21T09:18:15.0566756Z // begin inline asm 2026-02-21T09:18:15.0566813Z mov.u64 %rd904, 0x0; 2026-02-21T09:18:15.0566920Z createpolicy.fractional.L2::evict_first.b64 %rd904, 1.0; 2026-02-21T09:18:15.0566983Z // end inline asm 2026-02-21T09:18:15.0567042Z // begin inline asm 2026-02-21T09:18:15.0567099Z mov.u32 %r7279, 0x0; 2026-02-21T09:18:15.0567162Z mov.u32 %r7280, 0x0; 2026-02-21T09:18:15.0567219Z mov.u32 %r7281, 0x0; 2026-02-21T09:18:15.0567275Z mov.u32 %r7282, 0x0; 2026-02-21T09:18:15.0567449Z ld.global.L1::evict_first.L2::cache_hint.v4.b32 { %r7279, %r7280, %r7281, %r7282 }, [ %rd905 + 0 ], %rd904; 2026-02-21T09:18:15.0567513Z // end inline asm 2026-02-21T09:18:15.0567578Z prmt.b32 %r7446, %r7279, 0, 0x8880U; 2026-02-21T09:18:15.0567640Z cvt.u16.u32 %rs2456, %r7446; 2026-02-21T09:18:15.0567711Z prmt.b32 %r7447, %r7279, 0, 0x7770U; 2026-02-21T09:18:15.0567773Z cvt.u16.u32 %rs2457, %r7447; 2026-02-21T09:18:15.0567835Z prmt.b32 %r7448, %r7279, 0, 0x9991U; 2026-02-21T09:18:15.0567894Z cvt.u16.u32 %rs2458, %r7448; 2026-02-21T09:18:15.0567963Z prmt.b32 %r7449, %r7279, 0, 0x7771U; 2026-02-21T09:18:15.0568050Z cvt.u16.u32 %rs2459, %r7449; 2026-02-21T09:18:15.0568110Z prmt.b32 %r7450, %r7279, 0, 0xaaa2U; 2026-02-21T09:18:15.0568175Z cvt.u16.u32 %rs2460, %r7450; 2026-02-21T09:18:15.0568263Z prmt.b32 %r7451, %r7279, 0, 0x7772U; 2026-02-21T09:18:15.0568320Z cvt.u16.u32 %rs2461, %r7451; 2026-02-21T09:18:15.0568404Z prmt.b32 %r7452, %r7279, 0, 0xbbb3U; 2026-02-21T09:18:15.0568462Z cvt.u16.u32 %rs2462, %r7452; 2026-02-21T09:18:15.0568543Z prmt.b32 %r7453, %r7279, 0, 0x7773U; 2026-02-21T09:18:15.0568601Z cvt.u16.u32 %rs2463, %r7453; 2026-02-21T09:18:15.0568668Z prmt.b32 %r7454, %r7280, 0, 0x8880U; 2026-02-21T09:18:15.0568725Z cvt.u16.u32 %rs2464, %r7454; 2026-02-21T09:18:15.0568784Z prmt.b32 %r7455, %r7280, 0, 0x7770U; 2026-02-21T09:18:15.0568847Z cvt.u16.u32 %rs2465, %r7455; 2026-02-21T09:18:15.0568907Z prmt.b32 %r7456, %r7280, 0, 0x9991U; 2026-02-21T09:18:15.0568964Z cvt.u16.u32 %rs2466, %r7456; 2026-02-21T09:18:15.0569025Z prmt.b32 %r7457, %r7280, 0, 0x7771U; 2026-02-21T09:18:15.0569091Z cvt.u16.u32 %rs2467, %r7457; 2026-02-21T09:18:15.0569152Z prmt.b32 %r7458, %r7280, 0, 0xaaa2U; 2026-02-21T09:18:15.0569209Z cvt.u16.u32 %rs2468, %r7458; 2026-02-21T09:18:15.0569279Z prmt.b32 %r7459, %r7280, 0, 0x7772U; 2026-02-21T09:18:15.0569338Z cvt.u16.u32 %rs2469, %r7459; 2026-02-21T09:18:15.0569399Z prmt.b32 %r7460, %r7280, 0, 0xbbb3U; 2026-02-21T09:18:15.0569457Z cvt.u16.u32 %rs2470, %r7460; 2026-02-21T09:18:15.0569524Z prmt.b32 %r7461, %r7280, 0, 0x7773U; 2026-02-21T09:18:15.0569581Z cvt.u16.u32 %rs2471, %r7461; 2026-02-21T09:18:15.0569639Z prmt.b32 %r7462, %r7281, 0, 0x8880U; 2026-02-21T09:18:15.0569704Z cvt.u16.u32 %rs2472, %r7462; 2026-02-21T09:18:15.0569763Z prmt.b32 %r7463, %r7281, 0, 0x7770U; 2026-02-21T09:18:15.0569819Z cvt.u16.u32 %rs2473, %r7463; 2026-02-21T09:18:15.0569884Z prmt.b32 %r7464, %r7281, 0, 0x9991U; 2026-02-21T09:18:15.0569941Z cvt.u16.u32 %rs2474, %r7464; 2026-02-21T09:18:15.0570001Z prmt.b32 %r7465, %r7281, 0, 0x7771U; 2026-02-21T09:18:15.0570060Z cvt.u16.u32 %rs2475, %r7465; 2026-02-21T09:18:15.0570127Z prmt.b32 %r7466, %r7281, 0, 0xaaa2U; 2026-02-21T09:18:15.0570183Z cvt.u16.u32 %rs2476, %r7466; 2026-02-21T09:18:15.0570244Z prmt.b32 %r7467, %r7281, 0, 0x7772U; 2026-02-21T09:18:15.0570306Z cvt.u16.u32 %rs2477, %r7467; 2026-02-21T09:18:15.0570367Z prmt.b32 %r7468, %r7281, 0, 0xbbb3U; 2026-02-21T09:18:15.0570423Z cvt.u16.u32 %rs2478, %r7468; 2026-02-21T09:18:15.0570483Z prmt.b32 %r7469, %r7281, 0, 0x7773U; 2026-02-21T09:18:15.0570548Z cvt.u16.u32 %rs2479, %r7469; 2026-02-21T09:18:15.0570607Z prmt.b32 %r7470, %r7282, 0, 0x8880U; 2026-02-21T09:18:15.0570663Z cvt.u16.u32 %rs2480, %r7470; 2026-02-21T09:18:15.0570723Z prmt.b32 %r7471, %r7282, 0, 0x7770U; 2026-02-21T09:18:15.0570777Z cvt.u16.u32 %rs2481, %r7471; 2026-02-21T09:18:15.0570834Z prmt.b32 %r7472, %r7282, 0, 0x9991U; 2026-02-21T09:18:15.0570892Z cvt.u16.u32 %rs2482, %r7472; 2026-02-21T09:18:15.0570951Z prmt.b32 %r7473, %r7282, 0, 0x7771U; 2026-02-21T09:18:15.0571009Z cvt.u16.u32 %rs2483, %r7473; 2026-02-21T09:18:15.0571069Z prmt.b32 %r7474, %r7282, 0, 0xaaa2U; 2026-02-21T09:18:15.0571134Z cvt.u16.u32 %rs2484, %r7474; 2026-02-21T09:18:15.0571195Z prmt.b32 %r7475, %r7282, 0, 0x7772U; 2026-02-21T09:18:15.0571253Z cvt.u16.u32 %rs2485, %r7475; 2026-02-21T09:18:15.0571320Z prmt.b32 %r7476, %r7282, 0, 0xbbb3U; 2026-02-21T09:18:15.0571377Z cvt.u16.u32 %rs2486, %r7476; 2026-02-21T09:18:15.0571439Z prmt.b32 %r7477, %r7282, 0, 0x7773U; 2026-02-21T09:18:15.0571498Z cvt.u16.u32 %rs2487, %r7477; 2026-02-21T09:18:15.0571711Z .loc 1 59 25 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:59:25 2026-02-21T09:18:15.0571773Z shl.b16 %rs2488, %rs2457, 12; 2026-02-21T09:18:15.0571833Z shr.s16 %rs2489, %rs2488, 12; 2026-02-21T09:18:15.0571900Z shl.b16 %rs2490, %rs2459, 12; 2026-02-21T09:18:15.0571960Z shr.s16 %rs2491, %rs2490, 12; 2026-02-21T09:18:15.0572019Z shl.b16 %rs2492, %rs2461, 12; 2026-02-21T09:18:15.0572113Z shr.s16 %rs2493, %rs2492, 12; 2026-02-21T09:18:15.0572170Z shl.b16 %rs2494, %rs2463, 12; 2026-02-21T09:18:15.0572226Z shr.s16 %rs2495, %rs2494, 12; 2026-02-21T09:18:15.0572282Z shl.b16 %rs2496, %rs2465, 12; 2026-02-21T09:18:15.0572375Z shr.s16 %rs2497, %rs2496, 12; 2026-02-21T09:18:15.0572435Z shl.b16 %rs2498, %rs2467, 12; 2026-02-21T09:18:15.0572520Z shr.s16 %rs2499, %rs2498, 12; 2026-02-21T09:18:15.0572614Z shl.b16 %rs2500, %rs2469, 12; 2026-02-21T09:18:15.0572675Z shr.s16 %rs2501, %rs2500, 12; 2026-02-21T09:18:15.0572735Z shl.b16 %rs2502, %rs2471, 12; 2026-02-21T09:18:15.0572795Z shr.s16 %rs2503, %rs2502, 12; 2026-02-21T09:18:15.0572865Z shl.b16 %rs2504, %rs2473, 12; 2026-02-21T09:18:15.0572923Z shr.s16 %rs2505, %rs2504, 12; 2026-02-21T09:18:15.0572979Z shl.b16 %rs2506, %rs2475, 12; 2026-02-21T09:18:15.0573042Z shr.s16 %rs2507, %rs2506, 12; 2026-02-21T09:18:15.0573097Z shl.b16 %rs2508, %rs2477, 12; 2026-02-21T09:18:15.0573154Z shr.s16 %rs2509, %rs2508, 12; 2026-02-21T09:18:15.0573211Z shl.b16 %rs2510, %rs2479, 12; 2026-02-21T09:18:15.0573273Z shr.s16 %rs2511, %rs2510, 12; 2026-02-21T09:18:15.0573331Z shl.b16 %rs2512, %rs2481, 12; 2026-02-21T09:18:15.0573389Z shr.s16 %rs2513, %rs2512, 12; 2026-02-21T09:18:15.0573451Z shl.b16 %rs2514, %rs2483, 12; 2026-02-21T09:18:15.0573508Z shr.s16 %rs2515, %rs2514, 12; 2026-02-21T09:18:15.0573566Z shl.b16 %rs2516, %rs2485, 12; 2026-02-21T09:18:15.0573624Z shr.s16 %rs2517, %rs2516, 12; 2026-02-21T09:18:15.0573687Z shl.b16 %rs2518, %rs2487, 12; 2026-02-21T09:18:15.0573744Z shr.s16 %rs2519, %rs2518, 12; 2026-02-21T09:18:15.0573906Z .loc 1 62 28 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:62:28 2026-02-21T09:18:15.0573971Z shr.u16 %rs2520, %rs2456, 4; 2026-02-21T09:18:15.0574029Z shr.u16 %rs2521, %rs2458, 4; 2026-02-21T09:18:15.0574085Z shr.u16 %rs2522, %rs2460, 4; 2026-02-21T09:18:15.0574148Z shr.u16 %rs2523, %rs2462, 4; 2026-02-21T09:18:15.0574206Z shr.u16 %rs2524, %rs2464, 4; 2026-02-21T09:18:15.0574262Z shr.u16 %rs2525, %rs2466, 4; 2026-02-21T09:18:15.0574318Z shr.u16 %rs2526, %rs2468, 4; 2026-02-21T09:18:15.0574383Z shr.u16 %rs2527, %rs2470, 4; 2026-02-21T09:18:15.0574441Z shr.u16 %rs2528, %rs2472, 4; 2026-02-21T09:18:15.0574497Z shr.u16 %rs2529, %rs2474, 4; 2026-02-21T09:18:15.0574562Z shr.u16 %rs2530, %rs2476, 4; 2026-02-21T09:18:15.0574618Z shr.u16 %rs2531, %rs2478, 4; 2026-02-21T09:18:15.0574676Z shr.u16 %rs2532, %rs2480, 4; 2026-02-21T09:18:15.0574733Z shr.u16 %rs2533, %rs2482, 4; 2026-02-21T09:18:15.0574795Z shr.u16 %rs2534, %rs2484, 4; 2026-02-21T09:18:15.0574851Z shr.u16 %rs2535, %rs2486, 4; 2026-02-21T09:18:15.0575009Z .loc 1 66 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:66:45 2026-02-21T09:18:15.0575070Z bar.sync 0; 2026-02-21T09:18:15.0575129Z st.shared.b8 [%r70], %rs2489; 2026-02-21T09:18:15.0575189Z st.shared.b8 [%r71], %rs2491; 2026-02-21T09:18:15.0575254Z st.shared.b8 [%r72], %rs2493; 2026-02-21T09:18:15.0575313Z st.shared.b8 [%r73], %rs2495; 2026-02-21T09:18:15.0575375Z st.shared.b8 [%r74+512], %rs2497; 2026-02-21T09:18:15.0575437Z st.shared.b8 [%r75+512], %rs2499; 2026-02-21T09:18:15.0575505Z st.shared.b8 [%r76+512], %rs2501; 2026-02-21T09:18:15.0575564Z st.shared.b8 [%r77+512], %rs2503; 2026-02-21T09:18:15.0575628Z st.shared.b8 [%r78+1024], %rs2505; 2026-02-21T09:18:15.0575693Z st.shared.b8 [%r79+1024], %rs2507; 2026-02-21T09:18:15.0575753Z st.shared.b8 [%r80+1024], %rs2509; 2026-02-21T09:18:15.0575812Z st.shared.b8 [%r81+1024], %rs2511; 2026-02-21T09:18:15.0575871Z st.shared.b8 [%r82+1536], %rs2513; 2026-02-21T09:18:15.0575936Z st.shared.b8 [%r83+1536], %rs2515; 2026-02-21T09:18:15.0575993Z st.shared.b8 [%r84+1536], %rs2517; 2026-02-21T09:18:15.0576050Z st.shared.b8 [%r85+1536], %rs2519; 2026-02-21T09:18:15.0576108Z bar.sync 0; 2026-02-21T09:18:15.0576167Z ld.shared.b32 %r7478, [%r86]; 2026-02-21T09:18:15.0576226Z ld.shared.b32 %r7479, [%r87]; 2026-02-21T09:18:15.0576310Z ld.shared.b32 %r7480, [%r88]; 2026-02-21T09:18:15.0576376Z ld.shared.b32 %r7481, [%r89]; 2026-02-21T09:18:15.0576535Z .loc 1 67 45 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:67:45 2026-02-21T09:18:15.0576613Z bar.sync 0; 2026-02-21T09:18:15.0576681Z st.shared.b8 [%r70], %rs2520; 2026-02-21T09:18:15.0576758Z st.shared.b8 [%r71], %rs2521; 2026-02-21T09:18:15.0576833Z st.shared.b8 [%r72], %rs2522; 2026-02-21T09:18:15.0576899Z st.shared.b8 [%r73], %rs2523; 2026-02-21T09:18:15.0576960Z st.shared.b8 [%r74+512], %rs2524; 2026-02-21T09:18:15.0577019Z st.shared.b8 [%r75+512], %rs2525; 2026-02-21T09:18:15.0577079Z st.shared.b8 [%r76+512], %rs2526; 2026-02-21T09:18:15.0577146Z st.shared.b8 [%r77+512], %rs2527; 2026-02-21T09:18:15.0577206Z st.shared.b8 [%r78+1024], %rs2528; 2026-02-21T09:18:15.0577266Z st.shared.b8 [%r79+1024], %rs2529; 2026-02-21T09:18:15.0577333Z st.shared.b8 [%r80+1024], %rs2530; 2026-02-21T09:18:15.0577393Z st.shared.b8 [%r81+1024], %rs2531; 2026-02-21T09:18:15.0577453Z st.shared.b8 [%r82+1536], %rs2532; 2026-02-21T09:18:15.0577513Z st.shared.b8 [%r83+1536], %rs2533; 2026-02-21T09:18:15.0577582Z st.shared.b8 [%r84+1536], %rs2534; 2026-02-21T09:18:15.0577643Z st.shared.b8 [%r85+1536], %rs2535; 2026-02-21T09:18:15.0577695Z bar.sync 0; 2026-02-21T09:18:15.0577763Z ld.shared.b32 %r7482, [%r86]; 2026-02-21T09:18:15.0577824Z ld.shared.b32 %r7483, [%r87]; 2026-02-21T09:18:15.0577884Z ld.shared.b32 %r7484, [%r88]; 2026-02-21T09:18:15.0577951Z ld.shared.b32 %r7485, [%r89]; 2026-02-21T09:18:15.0578114Z .loc 1 77 32 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:77:32 2026-02-21T09:18:15.0578175Z cvt.s8.s32 %rs2536, %r7479; 2026-02-21T09:18:15.0578238Z cvt.rn.f32.s16 %r7486, %rs2536; 2026-02-21T09:18:15.0578308Z cvt.s8.s32 %rs2537, %r7478; 2026-02-21T09:18:15.0578372Z cvt.rn.f32.s16 %r7487, %rs2537; 2026-02-21T09:18:15.0578433Z cvt.s8.s32 %rs2538, %r7483; 2026-02-21T09:18:15.0578503Z cvt.rn.f32.s16 %r7488, %rs2538; 2026-02-21T09:18:15.0578566Z cvt.s8.s32 %rs2539, %r7482; 2026-02-21T09:18:15.0578626Z cvt.rn.f32.s16 %r7489, %rs2539; 2026-02-21T09:18:15.0578687Z cvt.s8.s32 %rs2540, %r7481; 2026-02-21T09:18:15.0578754Z cvt.rn.f32.s16 %r7490, %rs2540; 2026-02-21T09:18:15.0578812Z cvt.s8.s32 %rs2541, %r7480; 2026-02-21T09:18:15.0578871Z cvt.rn.f32.s16 %r7491, %rs2541; 2026-02-21T09:18:15.0578938Z cvt.s8.s32 %rs2542, %r7485; 2026-02-21T09:18:15.0578997Z cvt.rn.f32.s16 %r7492, %rs2542; 2026-02-21T09:18:15.0579055Z cvt.s8.s32 %rs2543, %r7484; 2026-02-21T09:18:15.0579121Z cvt.rn.f32.s16 %r7493, %rs2543; 2026-02-21T09:18:15.0579185Z prmt.b32 %r7494, %r7479, 0, 0x9991U; 2026-02-21T09:18:15.0579244Z cvt.u16.u32 %rs2544, %r7494; 2026-02-21T09:18:15.0579303Z cvt.rn.f32.s16 %r7495, %rs2544; 2026-02-21T09:18:15.0579374Z prmt.b32 %r7496, %r7478, 0, 0x9991U; 2026-02-21T09:18:15.0579433Z cvt.u16.u32 %rs2545, %r7496; 2026-02-21T09:18:15.0579494Z cvt.rn.f32.s16 %r7497, %rs2545; 2026-02-21T09:18:15.0579563Z prmt.b32 %r7498, %r7483, 0, 0x9991U; 2026-02-21T09:18:15.0579622Z cvt.u16.u32 %rs2546, %r7498; 2026-02-21T09:18:15.0579681Z cvt.rn.f32.s16 %r7499, %rs2546; 2026-02-21T09:18:15.0579743Z prmt.b32 %r7500, %r7482, 0, 0x9991U; 2026-02-21T09:18:15.0579808Z cvt.u16.u32 %rs2547, %r7500; 2026-02-21T09:18:15.0579870Z cvt.rn.f32.s16 %r7501, %rs2547; 2026-02-21T09:18:15.0579932Z prmt.b32 %r7502, %r7481, 0, 0x9991U; 2026-02-21T09:18:15.0579998Z cvt.u16.u32 %rs2548, %r7502; 2026-02-21T09:18:15.0580056Z cvt.rn.f32.s16 %r7503, %rs2548; 2026-02-21T09:18:15.0580117Z prmt.b32 %r7504, %r7480, 0, 0x9991U; 2026-02-21T09:18:15.0580175Z cvt.u16.u32 %rs2549, %r7504; 2026-02-21T09:18:15.0580242Z cvt.rn.f32.s16 %r7505, %rs2549; 2026-02-21T09:18:15.0580302Z prmt.b32 %r7506, %r7485, 0, 0x9991U; 2026-02-21T09:18:15.0580361Z cvt.u16.u32 %rs2550, %r7506; 2026-02-21T09:18:15.0580427Z cvt.rn.f32.s16 %r7507, %rs2550; 2026-02-21T09:18:15.0580520Z prmt.b32 %r7508, %r7484, 0, 0x9991U; 2026-02-21T09:18:15.0580578Z cvt.u16.u32 %rs2551, %r7508; 2026-02-21T09:18:15.0580644Z cvt.rn.f32.s16 %r7509, %rs2551; 2026-02-21T09:18:15.0580705Z prmt.b32 %r7510, %r7479, 0, 0xaaa2U; 2026-02-21T09:18:15.0580783Z cvt.u16.u32 %rs2552, %r7510; 2026-02-21T09:18:15.0580841Z cvt.rn.f32.s16 %r7511, %rs2552; 2026-02-21T09:18:15.0580930Z prmt.b32 %r7512, %r7478, 0, 0xaaa2U; 2026-02-21T09:18:15.0581010Z cvt.u16.u32 %rs2553, %r7512; 2026-02-21T09:18:15.0581070Z cvt.rn.f32.s16 %r7513, %rs2553; 2026-02-21T09:18:15.0581150Z prmt.b32 %r7514, %r7483, 0, 0xaaa2U; 2026-02-21T09:18:15.0581222Z cvt.u16.u32 %rs2554, %r7514; 2026-02-21T09:18:15.0581283Z cvt.rn.f32.s16 %r7515, %rs2554; 2026-02-21T09:18:15.0581345Z prmt.b32 %r7516, %r7482, 0, 0xaaa2U; 2026-02-21T09:18:15.0581412Z cvt.u16.u32 %rs2555, %r7516; 2026-02-21T09:18:15.0581472Z cvt.rn.f32.s16 %r7517, %rs2555; 2026-02-21T09:18:15.0581559Z prmt.b32 %r7518, %r7481, 0, 0xaaa2U; 2026-02-21T09:18:15.0581631Z cvt.u16.u32 %rs2556, %r7518; 2026-02-21T09:18:15.0581693Z cvt.rn.f32.s16 %r7519, %rs2556; 2026-02-21T09:18:15.0581755Z prmt.b32 %r7520, %r7480, 0, 0xaaa2U; 2026-02-21T09:18:15.0581821Z cvt.u16.u32 %rs2557, %r7520; 2026-02-21T09:18:15.0581884Z cvt.rn.f32.s16 %r7521, %rs2557; 2026-02-21T09:18:15.0581947Z prmt.b32 %r7522, %r7485, 0, 0xaaa2U; 2026-02-21T09:18:15.0582008Z cvt.u16.u32 %rs2558, %r7522; 2026-02-21T09:18:15.0582078Z cvt.rn.f32.s16 %r7523, %rs2558; 2026-02-21T09:18:15.0582141Z prmt.b32 %r7524, %r7484, 0, 0xaaa2U; 2026-02-21T09:18:15.0582203Z cvt.u16.u32 %rs2559, %r7524; 2026-02-21T09:18:15.0582269Z cvt.rn.f32.s16 %r7525, %rs2559; 2026-02-21T09:18:15.0582332Z prmt.b32 %r7526, %r7479, 0, 0xbbb3U; 2026-02-21T09:18:15.0582392Z cvt.u16.u32 %rs2560, %r7526; 2026-02-21T09:18:15.0582453Z cvt.rn.f32.s16 %r7527, %rs2560; 2026-02-21T09:18:15.0582524Z prmt.b32 %r7528, %r7478, 0, 0xbbb3U; 2026-02-21T09:18:15.0582584Z cvt.u16.u32 %rs2561, %r7528; 2026-02-21T09:18:15.0582644Z cvt.rn.f32.s16 %r7529, %rs2561; 2026-02-21T09:18:15.0582713Z prmt.b32 %r7530, %r7483, 0, 0xbbb3U; 2026-02-21T09:18:15.0582773Z cvt.u16.u32 %rs2562, %r7530; 2026-02-21T09:18:15.0582834Z cvt.rn.f32.s16 %r7531, %rs2562; 2026-02-21T09:18:15.0582899Z prmt.b32 %r7532, %r7482, 0, 0xbbb3U; 2026-02-21T09:18:15.0582966Z cvt.u16.u32 %rs2563, %r7532; 2026-02-21T09:18:15.0583029Z cvt.rn.f32.s16 %r7533, %rs2563; 2026-02-21T09:18:15.0583093Z prmt.b32 %r7534, %r7481, 0, 0xbbb3U; 2026-02-21T09:18:15.0583164Z cvt.u16.u32 %rs2564, %r7534; 2026-02-21T09:18:15.0583225Z cvt.rn.f32.s16 %r7535, %rs2564; 2026-02-21T09:18:15.0583289Z prmt.b32 %r7536, %r7480, 0, 0xbbb3U; 2026-02-21T09:18:15.0583356Z cvt.u16.u32 %rs2565, %r7536; 2026-02-21T09:18:15.0583416Z cvt.rn.f32.s16 %r7537, %rs2565; 2026-02-21T09:18:15.0583480Z prmt.b32 %r7538, %r7485, 0, 0xbbb3U; 2026-02-21T09:18:15.0583540Z cvt.u16.u32 %rs2566, %r7538; 2026-02-21T09:18:15.0583608Z cvt.rn.f32.s16 %r7539, %rs2566; 2026-02-21T09:18:15.0583672Z prmt.b32 %r7540, %r7484, 0, 0xbbb3U; 2026-02-21T09:18:15.0583735Z cvt.u16.u32 %rs2567, %r7540; 2026-02-21T09:18:15.0583804Z cvt.rn.f32.s16 %r7541, %rs2567; 2026-02-21T09:18:15.0583861Z bar.sync 0; 2026-02-21T09:18:15.0583966Z st.shared.v4.b32 [%r90], {%r7487, %r7489, %r7486, %r7488}; 2026-02-21T09:18:15.0584066Z st.shared.v4.b32 [%r91], {%r7491, %r7493, %r7490, %r7492}; 2026-02-21T09:18:15.0584174Z st.shared.v4.b32 [%r92], {%r7497, %r7501, %r7495, %r7499}; 2026-02-21T09:18:15.0584269Z st.shared.v4.b32 [%r93], {%r7505, %r7509, %r7503, %r7507}; 2026-02-21T09:18:15.0584363Z st.shared.v4.b32 [%r94], {%r7513, %r7517, %r7511, %r7515}; 2026-02-21T09:18:15.0584463Z st.shared.v4.b32 [%r95], {%r7521, %r7525, %r7519, %r7523}; 2026-02-21T09:18:15.0584556Z st.shared.v4.b32 [%r96], {%r7529, %r7533, %r7527, %r7531}; 2026-02-21T09:18:15.0584648Z st.shared.v4.b32 [%r97], {%r7537, %r7541, %r7535, %r7539}; 2026-02-21T09:18:15.0584711Z $L__tmp419: 2026-02-21T09:18:15.0584935Z .loc 2 291 36 // standard.py:291:36 @[ c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:84:40 ] 2026-02-21T09:18:15.0585028Z // begin inline asm 2026-02-21T09:18:15.0585359Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7283, %r7284, %r7285, %r7286, %r7287, %r7288, %r7289, %r7290, %r7291, %r7292, %r7293, %r7294, %r7295, %r7296, %r7297, %r7298}, [%r7299 + 0], 64; 2026-02-21T09:18:15.0585473Z // end inline asm 2026-02-21T09:18:15.0585537Z // begin inline asm 2026-02-21T09:18:15.0585881Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7300, %r7301, %r7302, %r7303, %r7304, %r7305, %r7306, %r7307, %r7308, %r7309, %r7310, %r7311, %r7312, %r7313, %r7314, %r7315}, [%r7299 + 16], 64; 2026-02-21T09:18:15.0585957Z // end inline asm 2026-02-21T09:18:15.0586019Z // begin inline asm 2026-02-21T09:18:15.0586328Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7317, %r7318, %r7319, %r7320, %r7321, %r7322, %r7323, %r7324, %r7325, %r7326, %r7327, %r7328, %r7329, %r7330, %r7331, %r7332}, [%r7299 + 32], 64; 2026-02-21T09:18:15.0586395Z // end inline asm 2026-02-21T09:18:15.0586455Z // begin inline asm 2026-02-21T09:18:15.0586766Z tcgen05.ld.sync.aligned.16x32bx2.x16.b32 {%r7334, %r7335, %r7336, %r7337, %r7338, %r7339, %r7340, %r7341, %r7342, %r7343, %r7344, %r7345, %r7346, %r7347, %r7348, %r7349}, [%r7299 + 48], 64; 2026-02-21T09:18:15.0586834Z // end inline asm 2026-02-21T09:18:15.0586895Z // begin inline asm 2026-02-21T09:18:15.0586968Z tcgen05.wait::ld.sync.aligned; 2026-02-21T09:18:15.0587033Z // end inline asm 2026-02-21T09:18:15.0587098Z mov.pred %p393, -1; 2026-02-21T09:18:15.0587157Z // begin inline asm 2026-02-21T09:18:15.0587477Z @%p393 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 0], 64, {%r7283, %r7284, %r7285, %r7286, %r7287, %r7288, %r7289, %r7290, %r7291, %r7292, %r7293, %r7294, %r7295, %r7296, %r7297, %r7298}; 2026-02-21T09:18:15.0587542Z // end inline asm 2026-02-21T09:18:15.0587599Z // begin inline asm 2026-02-21T09:18:15.0587922Z @%p393 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 16], 64, {%r7300, %r7301, %r7302, %r7303, %r7304, %r7305, %r7306, %r7307, %r7308, %r7309, %r7310, %r7311, %r7312, %r7313, %r7314, %r7315}; 2026-02-21T09:18:15.0587989Z // end inline asm 2026-02-21T09:18:15.0588048Z // begin inline asm 2026-02-21T09:18:15.0588366Z @%p393 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 32], 64, {%r7317, %r7318, %r7319, %r7320, %r7321, %r7322, %r7323, %r7324, %r7325, %r7326, %r7327, %r7328, %r7329, %r7330, %r7331, %r7332}; 2026-02-21T09:18:15.0588432Z // end inline asm 2026-02-21T09:18:15.0588490Z // begin inline asm 2026-02-21T09:18:15.0588814Z @%p393 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7628 + 48], 64, {%r7334, %r7335, %r7336, %r7337, %r7338, %r7339, %r7340, %r7341, %r7342, %r7343, %r7344, %r7345, %r7346, %r7347, %r7348, %r7349}; 2026-02-21T09:18:15.0588877Z // end inline asm 2026-02-21T09:18:15.0588944Z // begin inline asm 2026-02-21T09:18:15.0589012Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0589066Z // end inline asm 2026-02-21T09:18:15.0589128Z bar.sync 0; 2026-02-21T09:18:15.0589185Z // begin inline asm 2026-02-21T09:18:15.0589487Z @%p393 tcgen05.st.sync.aligned.16x32bx2.x16.b32 [%r7419 + 0], 16, {%r7420, %r7421, %r7422, %r7423, %r7424, %r7425, %r7426, %r7427, %r7428, %r7429, %r7430, %r7431, %r7432, %r7433, %r7434, %r7435}; 2026-02-21T09:18:15.0589552Z // end inline asm 2026-02-21T09:18:15.0589607Z // begin inline asm 2026-02-21T09:18:15.0589673Z tcgen05.wait::st.sync.aligned; 2026-02-21T09:18:15.0589737Z // end inline asm 2026-02-21T09:18:15.0589792Z bar.sync 0; 2026-02-21T09:18:15.0589848Z // begin inline asm 2026-02-21T09:18:15.0589919Z fence.proxy.async.shared::cta; 2026-02-21T09:18:15.0589980Z // end inline asm 2026-02-21T09:18:15.0590035Z // begin inline asm 2026-02-21T09:18:15.0590126Z @%p366 mbarrier.init.shared::cta.b64 [%r7557], 1; 2026-02-21T09:18:15.0590187Z // end inline asm 2026-02-21T09:18:15.0590240Z bar.sync 0; 2026-02-21T09:18:15.0590300Z @%p322 bra $L__BB0_54; 2026-02-21T09:18:15.0590399Z // %bb.53: // in Loop: Header=BB0_46 Depth=2 2026-02-21T09:18:15.0590492Z elect.sync %r7554|%p375, -1; 2026-02-21T09:18:15.0590552Z mov.b32 %r7544, 69208336; 2026-02-21T09:18:15.0590612Z mov.pred %p374, -1; 2026-02-21T09:18:15.0590694Z // begin inline asm 2026-02-21T09:18:15.0590873Z @%p375 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 0 ], %rd6, %r7544, %p374; 2026-02-21T09:18:15.0590930Z // end inline asm 2026-02-21T09:18:15.0591010Z // begin inline asm 2026-02-21T09:18:15.0591162Z @%p375 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 8 ], %rd7, %r7544, %p374; 2026-02-21T09:18:15.0591216Z // end inline asm 2026-02-21T09:18:15.0591271Z // begin inline asm 2026-02-21T09:18:15.0591426Z @%p375 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 16 ], %rd8, %r7544, %p374; 2026-02-21T09:18:15.0591481Z // end inline asm 2026-02-21T09:18:15.0591582Z // begin inline asm 2026-02-21T09:18:15.0591737Z @%p375 tcgen05.mma.cta_group::1.kind::tf32 [ %r7822 + 0 ], [ %r7543 + 24 ], %rd9, %r7544, %p374; 2026-02-21T09:18:15.0591794Z // end inline asm 2026-02-21T09:18:15.0591855Z cvt.u64.u32 %rd911, %r7557; 2026-02-21T09:18:15.0591918Z // begin inline asm 2026-02-21T09:18:15.0592046Z @%p375 tcgen05.commit.cta_group::1.mbarrier::arrive::one.b64 [%rd911]; 2026-02-21T09:18:15.0592099Z // end inline asm 2026-02-21T09:18:15.0592163Z bra.uni $L__BB0_54; 2026-02-21T09:18:15.0592216Z $L__tmp420: 2026-02-21T09:18:15.0592299Z $L__BB0_56: // %._crit_edge 2026-02-21T09:18:15.0592468Z .loc 1 19 4 // c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py:19:4 2026-02-21T09:18:15.0592530Z bar.sync 0; 2026-02-21T09:18:15.0592587Z // begin inline asm 2026-02-21T09:18:15.0592703Z @%p6 tcgen05.dealloc.cta_group::1.sync.aligned.b32 %r7822, 512; 2026-02-21T09:18:15.0592762Z // end inline asm 2026-02-21T09:18:15.0592815Z ret; 2026-02-21T09:18:15.0592868Z $L__tmp421: 2026-02-21T09:18:15.0592923Z $L__func_end0: 2026-02-21T09:18:15.0593017Z // -- End function 2026-02-21T09:18:15.0593069Z } 2026-02-21T09:18:15.0593270Z .file 1 "/tmp/torchinductor_root/35/c35ys7dbesayx7c3pxluriejp2u3fjg2k246rq4y2otu3qxkozdv.py" 2026-02-21T09:18:15.0593452Z .file 2 "/__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py" 2026-02-21T09:18:15.0593516Z .section .debug_abbrev 2026-02-21T09:18:15.0593567Z { 2026-02-21T09:18:15.0593660Z .b8 1 // Abbreviation Code 2026-02-21T09:18:15.0593749Z .b8 17 // DW_TAG_compile_unit 2026-02-21T09:18:15.0593829Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:18:15.0593906Z .b8 37 // DW_AT_producer 2026-02-21T09:18:15.0593991Z .b8 8 // DW_FORM_string 2026-02-21T09:18:15.0594063Z .b8 19 // DW_AT_language 2026-02-21T09:18:15.0594139Z .b8 5 // DW_FORM_data2 2026-02-21T09:18:15.0594224Z .b8 3 // DW_AT_name 2026-02-21T09:18:15.0594297Z .b8 8 // DW_FORM_string 2026-02-21T09:18:15.0594375Z .b8 16 // DW_AT_stmt_list 2026-02-21T09:18:15.0594458Z .b8 6 // DW_FORM_data4 2026-02-21T09:18:15.0594533Z .b8 27 // DW_AT_comp_dir 2026-02-21T09:18:15.0594607Z .b8 8 // DW_FORM_string 2026-02-21T09:18:15.0594678Z .b8 0 // EOM(1) 2026-02-21T09:18:15.0594757Z .b8 0 // EOM(2) 2026-02-21T09:18:15.0594838Z .b8 2 // Abbreviation Code 2026-02-21T09:18:15.0594920Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:18:15.0595005Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:18:15.0595078Z .b8 3 // DW_AT_name 2026-02-21T09:18:15.0595179Z .b8 8 // DW_FORM_string 2026-02-21T09:18:15.0595265Z .b8 32 // DW_AT_inline 2026-02-21T09:18:15.0595364Z .b8 11 // DW_FORM_data1 2026-02-21T09:18:15.0595432Z .b8 0 // EOM(1) 2026-02-21T09:18:15.0595522Z .b8 0 // EOM(2) 2026-02-21T09:18:15.0595646Z .b8 3 // Abbreviation Code 2026-02-21T09:18:15.0595726Z .b8 46 // DW_TAG_subprogram 2026-02-21T09:18:15.0595803Z .b8 1 // DW_CHILDREN_yes 2026-02-21T09:18:15.0595886Z .b8 17 // DW_AT_low_pc 2026-02-21T09:18:15.0595958Z .b8 1 // DW_FORM_addr 2026-02-21T09:18:15.0596035Z .b8 18 // DW_AT_high_pc 2026-02-21T09:18:15.0596114Z .b8 1 // DW_FORM_addr 2026-02-21T09:18:15.0596199Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:18:15.0596271Z .b8 19 // DW_FORM_ref4 2026-02-21T09:18:15.0596345Z .b8 0 // EOM(1) 2026-02-21T09:18:15.0596412Z .b8 0 // EOM(2) 2026-02-21T09:18:15.0596492Z .b8 4 // Abbreviation Code 2026-02-21T09:18:15.0596584Z .b8 29 // DW_TAG_inlined_subroutine 2026-02-21T09:18:15.0596667Z .b8 0 // DW_CHILDREN_no 2026-02-21T09:18:15.0596750Z .b8 49 // DW_AT_abstract_origin 2026-02-21T09:18:15.0596823Z .b8 19 // DW_FORM_ref4 2026-02-21T09:18:15.0596902Z .b8 17 // DW_AT_low_pc 2026-02-21T09:18:15.0596974Z .b8 1 // DW_FORM_addr 2026-02-21T09:18:15.0597051Z .b8 18 // DW_AT_high_pc 2026-02-21T09:18:15.0597131Z .b8 1 // DW_FORM_addr 2026-02-21T09:18:15.0597208Z .b8 88 // DW_AT_call_file 2026-02-21T09:18:15.0597283Z .b8 11 // DW_FORM_data1 2026-02-21T09:18:15.0597360Z .b8 89 // DW_AT_call_line 2026-02-21T09:18:15.0597440Z .b8 11 // DW_FORM_data1 2026-02-21T09:18:15.0597520Z .b8 87 // DW_AT_call_column 2026-02-21T09:18:15.0597593Z .b8 11 // DW_FORM_data1 2026-02-21T09:18:15.0597667Z .b8 0 // EOM(1) 2026-02-21T09:18:15.0597734Z .b8 0 // EOM(2) 2026-02-21T09:18:15.0597800Z .b8 0 // EOM(3) 2026-02-21T09:18:15.0597858Z } 2026-02-21T09:18:15.0597919Z .section .debug_info 2026-02-21T09:18:15.0597971Z { 2026-02-21T09:18:15.0598055Z .b32 178 // Length of Unit 2026-02-21T09:18:15.0598146Z .b8 2 // DWARF version number 2026-02-21T09:18:15.0598199Z .b8 0 2026-02-21T09:18:15.0598314Z .b32 .debug_abbrev // Offset Into Abbrev. Section 2026-02-21T09:18:15.0598408Z .b8 8 // Address Size (in bytes) 2026-02-21T09:18:15.0598510Z .b8 1 // Abbrev [1] 0xb:0xab DW_TAG_compile_unit 2026-02-21T09:18:15.0598592Z .b8 116 // DW_AT_producer 2026-02-21T09:18:15.0598653Z .b8 114 2026-02-21T09:18:15.0598705Z .b8 105 2026-02-21T09:18:15.0598756Z .b8 116 2026-02-21T09:18:15.0598807Z .b8 111 2026-02-21T09:18:15.0598865Z .b8 110 2026-02-21T09:18:15.0598915Z .b8 0 2026-02-21T09:18:15.0598988Z .b8 2 // DW_AT_language 2026-02-21T09:18:15.0599045Z .b8 0 2026-02-21T09:18:15.0599118Z .b8 99 // DW_AT_name 2026-02-21T09:18:15.0599192Z .b8 51 2026-02-21T09:18:15.0599244Z .b8 53 2026-02-21T09:18:15.0599301Z .b8 121 2026-02-21T09:18:15.0599352Z .b8 115 2026-02-21T09:18:15.0599402Z .b8 55 2026-02-21T09:18:15.0599458Z .b8 100 2026-02-21T09:18:15.0599536Z .b8 98 2026-02-21T09:18:15.0599586Z .b8 101 2026-02-21T09:18:15.0599635Z .b8 115 2026-02-21T09:18:15.0599691Z .b8 97 2026-02-21T09:18:15.0599767Z .b8 121 2026-02-21T09:18:15.0599819Z .b8 120 2026-02-21T09:18:15.0599869Z .b8 55 2026-02-21T09:18:15.0599946Z .b8 99 2026-02-21T09:18:15.0599998Z .b8 51 2026-02-21T09:18:15.0600049Z .b8 112 2026-02-21T09:18:15.0600104Z .b8 120 2026-02-21T09:18:15.0600154Z .b8 108 2026-02-21T09:18:15.0600205Z .b8 117 2026-02-21T09:18:15.0600253Z .b8 114 2026-02-21T09:18:15.0600310Z .b8 105 2026-02-21T09:18:15.0600361Z .b8 101 2026-02-21T09:18:15.0600410Z .b8 106 2026-02-21T09:18:15.0600467Z .b8 112 2026-02-21T09:18:15.0600516Z .b8 50 2026-02-21T09:18:15.0600567Z .b8 117 2026-02-21T09:18:15.0600616Z .b8 51 2026-02-21T09:18:15.0600673Z .b8 102 2026-02-21T09:18:15.0600724Z .b8 106 2026-02-21T09:18:15.0600774Z .b8 103 2026-02-21T09:18:15.0600823Z .b8 50 2026-02-21T09:18:15.0600880Z .b8 107 2026-02-21T09:18:15.0600930Z .b8 50 2026-02-21T09:18:15.0600980Z .b8 52 2026-02-21T09:18:15.0601038Z .b8 54 2026-02-21T09:18:15.0601087Z .b8 114 2026-02-21T09:18:15.0601137Z .b8 113 2026-02-21T09:18:15.0601186Z .b8 52 2026-02-21T09:18:15.0601246Z .b8 121 2026-02-21T09:18:15.0601296Z .b8 50 2026-02-21T09:18:15.0601346Z .b8 111 2026-02-21T09:18:15.0601405Z .b8 116 2026-02-21T09:18:15.0601455Z .b8 117 2026-02-21T09:18:15.0601507Z .b8 51 2026-02-21T09:18:15.0601586Z .b8 113 2026-02-21T09:18:15.0601647Z .b8 120 2026-02-21T09:18:15.0601701Z .b8 107 2026-02-21T09:18:15.0601753Z .b8 111 2026-02-21T09:18:15.0601803Z .b8 122 2026-02-21T09:18:15.0601860Z .b8 100 2026-02-21T09:18:15.0601910Z .b8 118 2026-02-21T09:18:15.0601961Z .b8 46 2026-02-21T09:18:15.0602018Z .b8 112 2026-02-21T09:18:15.0602069Z .b8 121 2026-02-21T09:18:15.0602119Z .b8 0 2026-02-21T09:18:15.0602209Z .b32 .debug_line // DW_AT_stmt_list 2026-02-21T09:18:15.0602290Z .b8 47 // DW_AT_comp_dir 2026-02-21T09:18:15.0602340Z .b8 116 2026-02-21T09:18:15.0602390Z .b8 109 2026-02-21T09:18:15.0602448Z .b8 112 2026-02-21T09:18:15.0602498Z .b8 47 2026-02-21T09:18:15.0602547Z .b8 116 2026-02-21T09:18:15.0602597Z .b8 111 2026-02-21T09:18:15.0602656Z .b8 114 2026-02-21T09:18:15.0602706Z .b8 99 2026-02-21T09:18:15.0602756Z .b8 104 2026-02-21T09:18:15.0602815Z .b8 105 2026-02-21T09:18:15.0602866Z .b8 110 2026-02-21T09:18:15.0602917Z .b8 100 2026-02-21T09:18:15.0602968Z .b8 117 2026-02-21T09:18:15.0603025Z .b8 99 2026-02-21T09:18:15.0603075Z .b8 116 2026-02-21T09:18:15.0603126Z .b8 111 2026-02-21T09:18:15.0603182Z .b8 114 2026-02-21T09:18:15.0603233Z .b8 95 2026-02-21T09:18:15.0603284Z .b8 114 2026-02-21T09:18:15.0603333Z .b8 111 2026-02-21T09:18:15.0603392Z .b8 111 2026-02-21T09:18:15.0603442Z .b8 116 2026-02-21T09:18:15.0603491Z .b8 47 2026-02-21T09:18:15.0603541Z .b8 51 2026-02-21T09:18:15.0603600Z .b8 53 2026-02-21T09:18:15.0603650Z .b8 0 2026-02-21T09:18:15.0603750Z .b8 2 // Abbrev [2] 0x6c:0x1b DW_TAG_subprogram 2026-02-21T09:18:15.0603829Z .b8 95 // DW_AT_name 2026-02-21T09:18:15.0603882Z .b8 104 2026-02-21T09:18:15.0603932Z .b8 101 2026-02-21T09:18:15.0603983Z .b8 108 2026-02-21T09:18:15.0604043Z .b8 105 2026-02-21T09:18:15.0604093Z .b8 111 2026-02-21T09:18:15.0604144Z .b8 110 2026-02-21T09:18:15.0604201Z .b8 95 2026-02-21T09:18:15.0604250Z .b8 109 2026-02-21T09:18:15.0604301Z .b8 97 2026-02-21T09:18:15.0604350Z .b8 116 2026-02-21T09:18:15.0604407Z .b8 109 2026-02-21T09:18:15.0604455Z .b8 117 2026-02-21T09:18:15.0604506Z .b8 108 2026-02-21T09:18:15.0604563Z .b8 95 2026-02-21T09:18:15.0604613Z .b8 98 2026-02-21T09:18:15.0604661Z .b8 102 2026-02-21T09:18:15.0604711Z .b8 49 2026-02-21T09:18:15.0604767Z .b8 54 2026-02-21T09:18:15.0604816Z .b8 95 2026-02-21T09:18:15.0604864Z .b8 105 2026-02-21T09:18:15.0604920Z .b8 110 2026-02-21T09:18:15.0604999Z .b8 116 2026-02-21T09:18:15.0605048Z .b8 52 2026-02-21T09:18:15.0605098Z .b8 0 2026-02-21T09:18:15.0605178Z .b8 1 // DW_AT_inline 2026-02-21T09:18:15.0605299Z .b8 3 // Abbrev [3] 0x87:0x2e DW_TAG_subprogram 2026-02-21T09:18:15.0605410Z .b64 $L__func_begin0 // DW_AT_low_pc 2026-02-21T09:18:15.0605502Z .b64 $L__func_end0 // DW_AT_high_pc 2026-02-21T09:18:15.0605619Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:18:15.0605733Z .b8 4 // Abbrev [4] 0x9c:0x18 DW_TAG_inlined_subroutine 2026-02-21T09:18:15.0605820Z .b32 108 // DW_AT_abstract_origin 2026-02-21T09:18:15.0605907Z .b64 $L__tmp1 // DW_AT_low_pc 2026-02-21T09:18:15.0605992Z .b64 $L__tmp420 // DW_AT_high_pc 2026-02-21T09:18:15.0606067Z .b8 1 // DW_AT_call_file 2026-02-21T09:18:15.0606151Z .b8 84 // DW_AT_call_line 2026-02-21T09:18:15.0606230Z .b8 40 // DW_AT_call_column 2026-02-21T09:18:15.0606314Z .b8 0 // End Of Children Mark 2026-02-21T09:18:15.0606403Z .b8 0 // End Of Children Mark 2026-02-21T09:18:15.0606455Z } 2026-02-21T09:18:15.0606522Z .section .debug_macinfo { } 2026-02-21T09:18:15.0606527Z 2026-02-21T09:18:15.0606609Z ================================================================ 2026-02-21T09:18:15.0606713Z please share the reproducer above with Triton project. 2026-02-21T09:18:15.2303131Z [682s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', 'first'], loop_orders=[[1, 0]], num_stages=5, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T09:18:15.2303604Z Tensor-likes are not close! 2026-02-21T09:18:15.2303619Z 2026-02-21T09:18:15.2303714Z Mismatched elements: 33436622 / 33554432 (99.6%) 2026-02-21T09:18:15.2303868Z Greatest absolute difference: 1328.0 at index (4076, 5908) (up to 0.01 allowed) 2026-02-21T09:18:15.2304008Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:18:15.2304129Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:18:15.2304134Z 2026-02-21T09:18:15.7552397Z [682s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T09:18:15.7553404Z Tensor-likes are not close! 2026-02-21T09:18:15.7553527Z 2026-02-21T09:18:15.7553617Z Mismatched elements: 33444466 / 33554432 (99.7%) 2026-02-21T09:18:15.7553905Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T09:18:15.7554239Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:18:15.7554546Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:18:15.7554710Z 2026-02-21T09:18:15.9554764Z 2026-02-21T09:18:15.9559385Z [683s] Generation 15 complete: 2026-02-21T09:18:15.9559899Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 5.1 configs/s 2026-02-21T09:18:15.9560240Z error=4 2026-02-21T09:18:15.9560385Z ok=14 2026-02-21T09:18:15.9560881Z min=0.2386 2026-02-21T09:18:15.9561080Z mid=1.8591 2026-02-21T09:18:15.9561209Z max=11.8313 2026-02-21T09:18:15.9561367Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:18:15.9562000Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:18:15.9562252Z 'l2_groupings': [2], 2026-02-21T09:18:15.9562436Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:18:15.9562674Z 'loop_orders': [[0, 1]], 2026-02-21T09:18:15.9562838Z 'maxnreg': 128, 2026-02-21T09:18:15.9562997Z 'num_sm_multiplier': 2, 2026-02-21T09:18:15.9563186Z 'num_stages': 3, 2026-02-21T09:18:15.9563334Z 'num_warps': 4, 2026-02-21T09:18:15.9563519Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:18:15.9563716Z 'range_flattens': [None, None], 2026-02-21T09:18:15.9563891Z 'range_multi_buffers': [True, False], 2026-02-21T09:18:15.9564079Z 'range_num_stages': [3, 4], 2026-02-21T09:18:15.9564245Z 'range_unroll_factors': [0, 0], 2026-02-21T09:18:15.9564433Z 'range_warp_specializes': [True, None]} 2026-02-21T09:18:15.9595349Z [683s] Fitting surrogate: 1142 points, 1142 targets 2026-02-21T09:18:16.4376978Z [683s] Generation 16 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:18:47.1549658Z [714s] Timeout after 30s compiling Config(block_sizes=[16, 256, 8], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=16, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[None, None]) 2026-02-21T09:18:47.1570526Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 0.3 configs/s 2026-02-21T09:18:47.2632896Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:18:47.2636699Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:18:47.2641277Z ^ 2026-02-21T09:18:47.2645419Z /tmp/torchinductor_root/fb/cfb3mob5t67awyqzcjgglbdsn4z6cgn7sksqp7cc4e7ievt73vig.py:86:36: note: called from 2026-02-21T09:18:47.2646566Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:18:47.2646794Z ^ 2026-02-21T09:18:47.2647219Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:18:47.2647724Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:18:47.2647980Z ^ 2026-02-21T09:18:47.2648194Z module { 2026-02-21T09:18:47.2650253Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:18:47.2650885Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:18:47.2651124Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:18:47.2651322Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:18:47.2651644Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:18:47.2651891Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:18:47.2652129Z %cst_2 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:18:47.2652375Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32> 2026-02-21T09:18:47.2652622Z %cst_4 = arith.constant dense<1024> : tensor<64x1xi32> 2026-02-21T09:18:47.2652858Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:18:47.2653086Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x32xf32> 2026-02-21T09:18:47.2653342Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:18:47.2653534Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:18:47.2653725Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:18:47.2653932Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:18:47.2654122Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:18:47.2654324Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:18:47.2654794Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:18:47.2655129Z %0 = tt.make_tensor_descriptor %arg2, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T09:18:47.2655514Z %1 = tt.get_program_id x : i32 2026-02-21T09:18:47.2655702Z %2 = arith.divsi %1, %c4096_i32 : i32 2026-02-21T09:18:47.2655954Z %3 = arith.muli %2, %c16_i32 : i32 2026-02-21T09:18:47.2656185Z %4 = arith.subi %c64_i32, %3 : i32 2026-02-21T09:18:47.2656388Z %5 = arith.minsi %4, %c16_i32 : i32 2026-02-21T09:18:47.2656566Z %6 = arith.remsi %1, %c4096_i32 : i32 2026-02-21T09:18:47.2656757Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:18:47.2656930Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:18:47.2657125Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:18:47.2657299Z %10 = arith.muli %8, %c64_i32 : i32 2026-02-21T09:18:47.2657541Z %11 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:18:47.2658122Z %12 = tt.splat %10 : i32 -> tensor<64xi32> 2026-02-21T09:18:47.2658328Z %13 = arith.addi %12, %11 : tensor<64xi32> 2026-02-21T09:18:47.2658546Z %14 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:18:47.2658782Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:18:47.2659042Z %16 = tt.splat %14 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.2659251Z %17 = arith.addi %16, %15 : tensor<32xi32> 2026-02-21T09:18:47.2659483Z %c64_i32_6 = arith.constant 64 : i32 2026-02-21T09:18:47.2659806Z %18 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c64_i32_6 iter_args(%arg4 = %cst_5) -> (tensor<64x32xf32>) : i32 { 2026-02-21T09:18:47.2660180Z %20 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.2660432Z %21 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.2660649Z %22 = arith.addi %21, %20 : tensor<16xi32> 2026-02-21T09:18:47.2660846Z %23 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:18:47.2661048Z %24 = tt.splat %23 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.2661249Z %25 = arith.addi %24, %15 : tensor<32xi32> 2026-02-21T09:18:47.2661504Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:18:47.2661821Z %27 = arith.muli %26, %cst_4 : tensor<64x1xi32> 2026-02-21T09:18:47.2662076Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2662371Z %29 = tt.broadcast %27 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2662629Z %30 = tt.broadcast %28 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2662868Z %31 = arith.addi %29, %30 : tensor<64x32xi32> 2026-02-21T09:18:47.2663118Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2663399Z %33 = tt.addptr %32, %31 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:18:47.2663718Z %34 = tt.load %33 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2664018Z %35 = arith.extf %34 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:18:47.2664309Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.2664579Z %37 = arith.muli %36, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.2664831Z %38 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2665116Z %39 = tt.broadcast %37 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2665365Z %40 = tt.broadcast %38 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2665600Z %41 = arith.addi %39, %40 : tensor<16x32xi32> 2026-02-21T09:18:47.2665832Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2666107Z %43 = tt.addptr %42, %41 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.2666358Z %44 = tt.load %43 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2666620Z %45 = arith.shli %44, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2666845Z %46 = arith.shrsi %45, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2667063Z %47 = arith.shrsi %44, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2667344Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.2667666Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.2668017Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.2668340Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2668651Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2668935Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2669178Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2669449Z %55 = tt.broadcast %51 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2669739Z %56 = arith.select %54, %55, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2670005Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2670267Z %58 = tt.broadcast %52 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2670535Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2670822Z %60 = arith.select %59, %58, %56 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2671091Z %61 = tt.reshape %60 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.2671350Z %62 = arith.sitofp %61 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.2671745Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:18:47.2672062Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:18:47.2672262Z %64 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:18:47.2672455Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T09:18:47.2672689Z %66 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.2672934Z %67 = tt.splat %65 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.2673141Z %68 = arith.addi %67, %66 : tensor<16xi32> 2026-02-21T09:18:47.2673338Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T09:18:47.2673525Z %70 = tt.splat %69 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.2673726Z %71 = arith.addi %70, %15 : tensor<32xi32> 2026-02-21T09:18:47.2673975Z %72 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:18:47.2674244Z %73 = arith.muli %72, %cst_4 : tensor<64x1xi32> 2026-02-21T09:18:47.2674496Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2674784Z %75 = tt.broadcast %73 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2675044Z %76 = tt.broadcast %74 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2675272Z %77 = arith.addi %75, %76 : tensor<64x32xi32> 2026-02-21T09:18:47.2675518Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2675803Z %79 = tt.addptr %78, %77 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:18:47.2676112Z %80 = tt.load %79 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2676400Z %81 = arith.extf %80 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:18:47.2676680Z %82 = tt.expand_dims %68 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.2676944Z %83 = arith.muli %82, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.2677193Z %84 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2677475Z %85 = tt.broadcast %83 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2677723Z %86 = tt.broadcast %84 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2677991Z %87 = arith.addi %85, %86 : tensor<16x32xi32> 2026-02-21T09:18:47.2678225Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2678529Z %89 = tt.addptr %88, %87 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.2678805Z %90 = tt.load %89 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2679035Z %91 = arith.shli %90, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2679253Z %92 = arith.shrsi %91, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2679466Z %93 = arith.shrsi %90, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2679717Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.2680004Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.2680316Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.2680642Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2680962Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2681247Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2681502Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2681813Z %101 = tt.broadcast %97 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2682108Z %102 = arith.select %100, %101, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2682383Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2682644Z %104 = tt.broadcast %98 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2682918Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2683213Z %106 = arith.select %105, %104, %102 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2683504Z %107 = tt.reshape %106 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.2683771Z %108 = arith.sitofp %107 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.2684126Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:18:47.2684437Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:18:47.2684638Z %110 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:18:47.2684831Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T09:18:47.2685069Z %112 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.2685315Z %113 = tt.splat %111 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.2685526Z %114 = arith.addi %113, %112 : tensor<16xi32> 2026-02-21T09:18:47.2685725Z %115 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:18:47.2685920Z %116 = tt.splat %115 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.2686141Z %117 = arith.addi %116, %15 : tensor<32xi32> 2026-02-21T09:18:47.2686403Z %118 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:18:47.2686697Z %119 = arith.muli %118, %cst_4 : tensor<64x1xi32> 2026-02-21T09:18:47.2686973Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2687289Z %121 = tt.broadcast %119 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2687576Z %122 = tt.broadcast %120 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2687826Z %123 = arith.addi %121, %122 : tensor<64x32xi32> 2026-02-21T09:18:47.2688088Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2688387Z %125 = tt.addptr %124, %123 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:18:47.2688722Z %126 = tt.load %125 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2689039Z %127 = arith.extf %126 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:18:47.2689370Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.2689657Z %129 = arith.muli %128, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.2689949Z %130 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2690302Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2690601Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2690856Z %133 = arith.addi %131, %132 : tensor<16x32xi32> 2026-02-21T09:18:47.2691131Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2691418Z %135 = tt.addptr %134, %133 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.2691716Z %136 = tt.load %135 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2691961Z %137 = arith.shli %136, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2692199Z %138 = arith.shrsi %137, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2692444Z %139 = arith.shrsi %136, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2692701Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.2693015Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.2693336Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.2693681Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2694025Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2694312Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2694568Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2694842Z %147 = tt.broadcast %143 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2695140Z %148 = arith.select %146, %147, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2695417Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2695674Z %150 = tt.broadcast %144 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2695955Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2696233Z %152 = arith.select %151, %150, %148 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2696517Z %153 = tt.reshape %152 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.2696780Z %154 = arith.sitofp %153 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.2697139Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:18:47.2697465Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:18:47.2697655Z %156 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:18:47.2697856Z %157 = arith.addi %arg3, %156 : i32 2026-02-21T09:18:47.2698079Z %158 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.2698334Z %159 = tt.splat %157 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.2698535Z %160 = arith.addi %159, %158 : tensor<16xi32> 2026-02-21T09:18:47.2698738Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:18:47.2698936Z %162 = tt.splat %161 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.2699135Z %163 = arith.addi %162, %15 : tensor<32xi32> 2026-02-21T09:18:47.2699390Z %164 = tt.expand_dims %13 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:18:47.2699655Z %165 = arith.muli %164, %cst_4 : tensor<64x1xi32> 2026-02-21T09:18:47.2699917Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2700203Z %167 = tt.broadcast %165 : tensor<64x1xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2700505Z %168 = tt.broadcast %166 : tensor<1x32xi32> -> tensor<64x32xi32> 2026-02-21T09:18:47.2700749Z %169 = arith.addi %167, %168 : tensor<64x32xi32> 2026-02-21T09:18:47.2700993Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2701339Z %171 = tt.addptr %170, %169 : tensor<64x32x!tt.ptr>, tensor<64x32xi32> 2026-02-21T09:18:47.2701699Z %172 = tt.load %171 evictionPolicy = evict_first : tensor<64x32x!tt.ptr> 2026-02-21T09:18:47.2702005Z %173 = arith.extf %172 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T09:18:47.2702300Z %174 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.2702569Z %175 = arith.muli %174, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.2702837Z %176 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.2703124Z %177 = tt.broadcast %175 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2703389Z %178 = tt.broadcast %176 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.2703626Z %179 = arith.addi %177, %178 : tensor<16x32xi32> 2026-02-21T09:18:47.2703872Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2704157Z %181 = tt.addptr %180, %179 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.2704415Z %182 = tt.load %181 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.2704647Z %183 = arith.shli %182, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2704878Z %184 = arith.shrsi %183, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2705109Z %185 = arith.shrsi %182, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.2705354Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.2705654Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.2705972Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.2706291Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2706619Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.2706902Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2707162Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2707438Z %193 = tt.broadcast %189 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2707726Z %194 = arith.select %192, %193, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2708006Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.2708258Z %196 = tt.broadcast %190 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.2708532Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.2708809Z %198 = arith.select %197, %196, %194 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.2709115Z %199 = tt.reshape %198 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.2709385Z %200 = arith.sitofp %199 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.2709738Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<64x32xf32> * tensor<32x32xf32> -> tensor<64x32xf32> 2026-02-21T09:18:47.2710063Z scf.yield %201 : tensor<64x32xf32> 2026-02-21T09:18:47.2710263Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T09:18:47.2710509Z %19 = arith.truncf %18 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T09:18:47.2710835Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T09:18:47.2711108Z tt.return 2026-02-21T09:18:47.2711246Z } 2026-02-21T09:18:47.2711368Z } 2026-02-21T09:18:47.2711445Z 2026-02-21T09:18:47.2711499Z {-# 2026-02-21T09:18:47.2711660Z external_resources: { 2026-02-21T09:18:47.2711855Z mlir_reproducer: { 2026-02-21T09:18:47.2716297Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:18:47.2720791Z disable_threading: false, 2026-02-21T09:18:47.2720971Z verify_each: true 2026-02-21T09:18:47.2721118Z } 2026-02-21T09:18:47.2721248Z } 2026-02-21T09:18:47.2721367Z #-} 2026-02-21T09:18:47.2721819Z /tmp/torchinductor_root/fb/cfb3mob5t67awyqzcjgglbdsn4z6cgn7sksqp7cc4e7ievt73vig.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:18:47.2722856Z /tmp/torchinductor_root/fb/cfb3mob5t67awyqzcjgglbdsn4z6cgn7sksqp7cc4e7ievt73vig.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:18:47.2723670Z [714s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:18:47.2724737Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:18:47.2725699Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:18:47.2725956Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:18:47.4476933Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:18:47.4481395Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:18:47.4485794Z ^ 2026-02-21T09:18:47.4487469Z /tmp/torchinductor_root/c3/cc3zj7zz3a5ushie43j4d3l7zlc6vaf5gm55otashy7ejcildg24.py:86:36: note: called from 2026-02-21T09:18:47.4487884Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:18:47.4488100Z ^ 2026-02-21T09:18:47.4488499Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:18:47.4489157Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:18:47.4489399Z ^ 2026-02-21T09:18:47.4489904Z module { 2026-02-21T09:18:47.4494944Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:18:47.4495748Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:18:47.4496000Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:18:47.4496237Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:18:47.4496500Z %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:18:47.4496754Z %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:18:47.4497033Z %cst_2 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:18:47.4497303Z %cst_3 = arith.constant dense<8192> : tensor<16x1xi32> 2026-02-21T09:18:47.4497592Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32> 2026-02-21T09:18:47.4497833Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:18:47.4498090Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<128x32xf32> 2026-02-21T09:18:47.4498387Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:18:47.4498602Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:18:47.4498814Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:18:47.4499008Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:18:47.4499215Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:18:47.4499395Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:18:47.4499611Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:18:47.4499937Z %0 = tt.make_tensor_descriptor %arg2, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T09:18:47.4500270Z %1 = tt.get_program_id x : i32 2026-02-21T09:18:47.4500449Z %2 = arith.divsi %1, %c4096_i32 : i32 2026-02-21T09:18:47.4500643Z %3 = arith.muli %2, %c16_i32 : i32 2026-02-21T09:18:47.4500815Z %4 = arith.subi %c32_i32, %3 : i32 2026-02-21T09:18:47.4500996Z %5 = arith.minsi %4, %c16_i32 : i32 2026-02-21T09:18:47.4501175Z %6 = arith.remsi %1, %c4096_i32 : i32 2026-02-21T09:18:47.4501359Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:18:47.4501530Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:18:47.4501781Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:18:47.4501962Z %10 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:18:47.4502200Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:18:47.4502466Z %12 = tt.splat %10 : i32 -> tensor<128xi32> 2026-02-21T09:18:47.4502666Z %13 = arith.addi %12, %11 : tensor<128xi32> 2026-02-21T09:18:47.4502860Z %14 = arith.muli %9, %c32_i32 : i32 2026-02-21T09:18:47.4503083Z %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:18:47.4503335Z %16 = tt.splat %14 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.4503538Z %17 = arith.addi %16, %15 : tensor<32xi32> 2026-02-21T09:18:47.4503725Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:18:47.4504046Z %18 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c64_i32 iter_args(%arg4 = %cst_5) -> (tensor<128x32xf32>) : i32 { 2026-02-21T09:18:47.4504410Z %20 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.4504669Z %21 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.4504873Z %22 = arith.addi %21, %20 : tensor<16xi32> 2026-02-21T09:18:47.4505073Z %23 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:18:47.4505270Z %24 = tt.splat %23 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.4505466Z %25 = arith.addi %24, %15 : tensor<32xi32> 2026-02-21T09:18:47.4505723Z %26 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:18:47.4505994Z %27 = arith.muli %26, %cst_4 : tensor<128x1xi32> 2026-02-21T09:18:47.4506315Z %28 = tt.expand_dims %25 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4506605Z %29 = tt.broadcast %27 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4506913Z %30 = tt.broadcast %28 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4507193Z %31 = arith.addi %29, %30 : tensor<128x32xi32> 2026-02-21T09:18:47.4507471Z %32 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4507771Z %33 = tt.addptr %32, %31 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:18:47.4508082Z %34 = tt.load %33 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4508388Z %35 = arith.extf %34 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:18:47.4508672Z %36 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.4508945Z %37 = arith.muli %36, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.4509201Z %38 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4509481Z %39 = tt.broadcast %37 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4509745Z %40 = tt.broadcast %38 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4509976Z %41 = arith.addi %39, %40 : tensor<16x32xi32> 2026-02-21T09:18:47.4510220Z %42 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4510490Z %43 = tt.addptr %42, %41 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.4510746Z %44 = tt.load %43 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4510960Z %45 = arith.shli %44, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4511172Z %46 = arith.shrsi %45, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4511395Z %47 = arith.shrsi %44, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4511678Z %48 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.4511975Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.4512276Z %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.4512598Z %51 = tt.expand_dims %46 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4512926Z %52 = tt.expand_dims %47 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4513201Z %53 = arith.cmpi eq, %50, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4513453Z %54 = tt.broadcast %53 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4513716Z %55 = tt.broadcast %51 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4514001Z %56 = arith.select %54, %55, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4514275Z %57 = arith.cmpi eq, %50, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4514522Z %58 = tt.broadcast %52 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4514794Z %59 = tt.broadcast %57 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4515061Z %60 = arith.select %59, %58, %56 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4515337Z %61 = tt.reshape %60 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.4515589Z %62 = arith.sitofp %61 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.4515948Z %63 = tt.dot %35, %62, %arg4, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x32xf32> -> tensor<128x32xf32> 2026-02-21T09:18:47.4516278Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:18:47.4516466Z %64 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:18:47.4516679Z %65 = arith.addi %arg3, %64 : i32 2026-02-21T09:18:47.4516906Z %66 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.4517158Z %67 = tt.splat %65 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.4517354Z %68 = arith.addi %67, %66 : tensor<16xi32> 2026-02-21T09:18:47.4517584Z %69 = arith.muli %65, %c2_i32 : i32 2026-02-21T09:18:47.4517769Z %70 = tt.splat %69 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.4517968Z %71 = arith.addi %70, %15 : tensor<32xi32> 2026-02-21T09:18:47.4518248Z %72 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:18:47.4518541Z %73 = arith.muli %72, %cst_4 : tensor<128x1xi32> 2026-02-21T09:18:47.4518825Z %74 = tt.expand_dims %71 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4519109Z %75 = tt.broadcast %73 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4519372Z %76 = tt.broadcast %74 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4519603Z %77 = arith.addi %75, %76 : tensor<128x32xi32> 2026-02-21T09:18:47.4519852Z %78 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4520141Z %79 = tt.addptr %78, %77 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:18:47.4520443Z %80 = tt.load %79 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4520744Z %81 = arith.extf %80 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:18:47.4521019Z %82 = tt.expand_dims %68 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.4521281Z %83 = arith.muli %82, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.4521531Z %84 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4521848Z %85 = tt.broadcast %83 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4522109Z %86 = tt.broadcast %84 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4522338Z %87 = arith.addi %85, %86 : tensor<16x32xi32> 2026-02-21T09:18:47.4522572Z %88 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4522839Z %89 = tt.addptr %88, %87 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.4523090Z %90 = tt.load %89 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4523303Z %91 = arith.shli %90, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4523512Z %92 = arith.shrsi %91, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4523726Z %93 = arith.shrsi %90, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4523966Z %94 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.4524259Z %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.4524559Z %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.4524875Z %97 = tt.expand_dims %92 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4525194Z %98 = tt.expand_dims %93 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4525465Z %99 = arith.cmpi eq, %96, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4525719Z %100 = tt.broadcast %99 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4525985Z %101 = tt.broadcast %97 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4526280Z %102 = arith.select %100, %101, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4526553Z %103 = arith.cmpi eq, %96, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4526818Z %104 = tt.broadcast %98 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4527092Z %105 = tt.broadcast %103 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4527368Z %106 = arith.select %105, %104, %102 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4527654Z %107 = tt.reshape %106 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.4527916Z %108 = arith.sitofp %107 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.4528280Z %109 = tt.dot %81, %108, %63, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x32xf32> -> tensor<128x32xf32> 2026-02-21T09:18:47.4528636Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:18:47.4528829Z %110 = arith.muli %c16_i32, %c2_i32_6 : i32 2026-02-21T09:18:47.4529062Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T09:18:47.4529297Z %112 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.4529595Z %113 = tt.splat %111 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.4529834Z %114 = arith.addi %113, %112 : tensor<16xi32> 2026-02-21T09:18:47.4530037Z %115 = arith.muli %111, %c2_i32 : i32 2026-02-21T09:18:47.4530230Z %116 = tt.splat %115 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.4530432Z %117 = arith.addi %116, %15 : tensor<32xi32> 2026-02-21T09:18:47.4530692Z %118 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:18:47.4530968Z %119 = arith.muli %118, %cst_4 : tensor<128x1xi32> 2026-02-21T09:18:47.4531238Z %120 = tt.expand_dims %117 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4531559Z %121 = tt.broadcast %119 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4531837Z %122 = tt.broadcast %120 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4532085Z %123 = arith.addi %121, %122 : tensor<128x32xi32> 2026-02-21T09:18:47.4532329Z %124 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4532634Z %125 = tt.addptr %124, %123 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:18:47.4532955Z %126 = tt.load %125 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4533277Z %127 = arith.extf %126 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:18:47.4533575Z %128 = tt.expand_dims %114 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.4533860Z %129 = arith.muli %128, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.4534133Z %130 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4534433Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4534708Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4534954Z %133 = arith.addi %131, %132 : tensor<16x32xi32> 2026-02-21T09:18:47.4535206Z %134 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4535493Z %135 = tt.addptr %134, %133 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.4535770Z %136 = tt.load %135 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4535997Z %137 = arith.shli %136, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4536222Z %138 = arith.shrsi %137, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4536455Z %139 = arith.shrsi %136, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4536706Z %140 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.4537014Z %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.4537339Z %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.4537675Z %143 = tt.expand_dims %138 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4538020Z %144 = tt.expand_dims %139 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4538321Z %145 = arith.cmpi eq, %142, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4538590Z %146 = tt.broadcast %145 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4538871Z %147 = tt.broadcast %143 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4539177Z %148 = arith.select %146, %147, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4539471Z %149 = arith.cmpi eq, %142, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4539731Z %150 = tt.broadcast %144 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4540072Z %151 = tt.broadcast %149 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4540360Z %152 = arith.select %151, %150, %148 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4540675Z %153 = tt.reshape %152 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.4540962Z %154 = arith.sitofp %153 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.4541357Z %155 = tt.dot %127, %154, %109, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x32xf32> -> tensor<128x32xf32> 2026-02-21T09:18:47.4541714Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:18:47.4541904Z %156 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:18:47.4542105Z %157 = arith.addi %arg3, %156 : i32 2026-02-21T09:18:47.4542334Z %158 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:18:47.4542592Z %159 = tt.splat %157 : i32 -> tensor<16xi32> 2026-02-21T09:18:47.4542802Z %160 = arith.addi %159, %158 : tensor<16xi32> 2026-02-21T09:18:47.4543009Z %161 = arith.muli %157, %c2_i32 : i32 2026-02-21T09:18:47.4543216Z %162 = tt.splat %161 : i32 -> tensor<32xi32> 2026-02-21T09:18:47.4543425Z %163 = arith.addi %162, %15 : tensor<32xi32> 2026-02-21T09:18:47.4543697Z %164 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:18:47.4543977Z %165 = arith.muli %164, %cst_4 : tensor<128x1xi32> 2026-02-21T09:18:47.4544248Z %166 = tt.expand_dims %163 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4544544Z %167 = tt.broadcast %165 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4544820Z %168 = tt.broadcast %166 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:18:47.4545066Z %169 = arith.addi %167, %168 : tensor<128x32xi32> 2026-02-21T09:18:47.4545311Z %170 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4545617Z %171 = tt.addptr %170, %169 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:18:47.4545933Z %172 = tt.load %171 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:18:47.4546246Z %173 = arith.extf %172 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:18:47.4546546Z %174 = tt.expand_dims %160 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:18:47.4546814Z %175 = arith.muli %174, %cst_3 : tensor<16x1xi32> 2026-02-21T09:18:47.4547078Z %176 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:18:47.4547366Z %177 = tt.broadcast %175 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4547637Z %178 = tt.broadcast %176 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:18:47.4547875Z %179 = arith.addi %177, %178 : tensor<16x32xi32> 2026-02-21T09:18:47.4548117Z %180 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4548402Z %181 = tt.addptr %180, %179 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:18:47.4548657Z %182 = tt.load %181 : tensor<16x32x!tt.ptr> 2026-02-21T09:18:47.4548877Z %183 = arith.shli %182, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4549093Z %184 = arith.shrsi %183, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4549319Z %185 = arith.shrsi %182, %cst_2 : tensor<16x32xi8> 2026-02-21T09:18:47.4549565Z %186 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:18:47.4549886Z %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:18:47.4550203Z %188 = tt.expand_dims %187 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:18:47.4550527Z %189 = tt.expand_dims %184 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4550858Z %190 = tt.expand_dims %185 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:18:47.4551188Z %191 = arith.cmpi eq, %188, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4551441Z %192 = tt.broadcast %191 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4551746Z %193 = tt.broadcast %189 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4552085Z %194 = arith.select %192, %193, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4552394Z %195 = arith.cmpi eq, %188, %cst_0 : tensor<1x2x1xi32> 2026-02-21T09:18:47.4552647Z %196 = tt.broadcast %190 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:18:47.4552925Z %197 = tt.broadcast %195 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:18:47.4553211Z %198 = arith.select %197, %196, %194 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:18:47.4553495Z %199 = tt.reshape %198 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:18:47.4553766Z %200 = arith.sitofp %199 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:18:47.4554124Z %201 = tt.dot %173, %200, %155, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x32xf32> -> tensor<128x32xf32> 2026-02-21T09:18:47.4554453Z scf.yield %201 : tensor<128x32xf32> 2026-02-21T09:18:47.4554661Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T09:18:47.4554904Z %19 = arith.truncf %18 : tensor<128x32xf32> to tensor<128x32xbf16> 2026-02-21T09:18:47.4555235Z tt.descriptor_store %0[%10, %14], %19 : !tt.tensordesc>, tensor<128x32xbf16> 2026-02-21T09:18:47.4555507Z tt.return 2026-02-21T09:18:47.4555645Z } 2026-02-21T09:18:47.4555765Z } 2026-02-21T09:18:47.4555843Z 2026-02-21T09:18:47.4555896Z {-# 2026-02-21T09:18:47.4556030Z external_resources: { 2026-02-21T09:18:47.4556200Z mlir_reproducer: { 2026-02-21T09:18:47.4560580Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:18:47.4564971Z disable_threading: false, 2026-02-21T09:18:47.4565148Z verify_each: true 2026-02-21T09:18:47.4565291Z } 2026-02-21T09:18:47.4565418Z } 2026-02-21T09:18:47.4565534Z #-} 2026-02-21T09:18:47.4565963Z /tmp/torchinductor_root/c3/cc3zj7zz3a5ushie43j4d3l7zlc6vaf5gm55otashy7ejcildg24.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:18:47.4566992Z /tmp/torchinductor_root/c3/cc3zj7zz3a5ushie43j4d3l7zlc6vaf5gm55otashy7ejcildg24.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:18:47.4567856Z [714s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:18:47.4568970Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 32], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:18:47.4569925Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:18:47.4570178Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:18:48.4204226Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 12.3 configs/s 2026-02-21T09:18:48.4208585Z [715s] Generation 16 complete: 2026-02-21T09:18:48.4213571Z error=8 2026-02-21T09:18:48.4214991Z timeout=1 2026-02-21T09:18:48.4215165Z ok=10 2026-02-21T09:18:48.4215307Z min=0.2386 2026-02-21T09:18:48.4215436Z mid=1.2109 2026-02-21T09:18:48.4215567Z max=21.7431 2026-02-21T09:18:48.4215719Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:18:48.4215991Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:18:48.4216234Z 'l2_groupings': [2], 2026-02-21T09:18:48.4216419Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:18:48.4216623Z 'loop_orders': [[0, 1]], 2026-02-21T09:18:48.4216778Z 'maxnreg': 128, 2026-02-21T09:18:48.4216930Z 'num_sm_multiplier': 2, 2026-02-21T09:18:48.4217085Z 'num_stages': 3, 2026-02-21T09:18:48.4217229Z 'num_warps': 4, 2026-02-21T09:18:48.4217387Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:18:48.4217589Z 'range_flattens': [None, None], 2026-02-21T09:18:48.4217772Z 'range_multi_buffers': [True, False], 2026-02-21T09:18:48.4217960Z 'range_num_stages': [3, 4], 2026-02-21T09:18:48.4218126Z 'range_unroll_factors': [0, 0], 2026-02-21T09:18:48.4218312Z 'range_warp_specializes': [True, None]} 2026-02-21T09:18:48.4243523Z [715s] Fitting surrogate: 1161 points, 1161 targets 2026-02-21T09:18:48.9026762Z [716s] Generation 17 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:18:58.5172774Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 1.1 configs/s 2026-02-21T09:18:59.6591958Z [726s] Generation 17 complete: 2026-02-21T09:18:59.6592484Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 15.4 configs/s 2026-02-21T09:18:59.6592885Z error=5 2026-02-21T09:18:59.6596565Z ok=14 2026-02-21T09:18:59.6596812Z min=0.2386 2026-02-21T09:18:59.6596979Z mid=1.3077 2026-02-21T09:18:59.6597150Z max=19.6818 2026-02-21T09:18:59.6597336Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:18:59.6597642Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:18:59.6597972Z 'l2_groupings': [2], 2026-02-21T09:18:59.6598183Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:18:59.6598419Z 'loop_orders': [[0, 1]], 2026-02-21T09:18:59.6598589Z 'maxnreg': 128, 2026-02-21T09:18:59.6598738Z 'num_sm_multiplier': 2, 2026-02-21T09:18:59.6598915Z 'num_stages': 3, 2026-02-21T09:18:59.6599059Z 'num_warps': 4, 2026-02-21T09:18:59.6599236Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:18:59.6599443Z 'range_flattens': [None, None], 2026-02-21T09:18:59.6599632Z 'range_multi_buffers': [True, False], 2026-02-21T09:18:59.6599818Z 'range_num_stages': [3, 4], 2026-02-21T09:18:59.6599997Z 'range_unroll_factors': [0, 0], 2026-02-21T09:18:59.6600181Z 'range_warp_specializes': [True, None]} 2026-02-21T09:18:59.6642016Z [726s] Fitting surrogate: 1180 points, 1180 targets 2026-02-21T09:19:00.1726317Z [727s] Generation 18 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:19:10.4818498Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 1.3 configs/s 2026-02-21T09:19:12.2576781Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 9.0 configs/s 2026-02-21T09:19:12.2582288Z [739s] Generation 18 complete: 2026-02-21T09:19:12.2587521Z ok=17 2026-02-21T09:19:12.2588968Z min=0.2386 2026-02-21T09:19:12.2589139Z mid=2.2017 2026-02-21T09:19:12.2589491Z max=59.5456 2026-02-21T09:19:12.2589724Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:19:12.2589996Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:19:12.2590242Z 'l2_groupings': [2], 2026-02-21T09:19:12.2590432Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:19:12.2590650Z 'loop_orders': [[0, 1]], 2026-02-21T09:19:12.2590809Z 'maxnreg': 128, 2026-02-21T09:19:12.2590968Z 'num_sm_multiplier': 2, 2026-02-21T09:19:12.2591129Z 'num_stages': 3, 2026-02-21T09:19:12.2591283Z 'num_warps': 4, 2026-02-21T09:19:12.2591441Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:19:12.2591885Z 'range_flattens': [None, None], 2026-02-21T09:19:12.2592075Z 'range_multi_buffers': [True, False], 2026-02-21T09:19:12.2592290Z 'range_num_stages': [3, 4], 2026-02-21T09:19:12.2592470Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:12.2592687Z 'range_warp_specializes': [True, None]} 2026-02-21T09:19:12.2616710Z [739s] Fitting surrogate: 1197 points, 1197 targets 2026-02-21T09:19:12.7417394Z [739s] Generation 19 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:19:19.8158594Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.7 configs/s 2026-02-21T09:19:21.4604554Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:19:21.4609029Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:19:21.4609329Z ^ 2026-02-21T09:19:21.4609692Z /tmp/torchinductor_root/sj/csj3wzsqt77qbpdzmruupjdridyeqfd4cjxyftuaqj4eu34ppasa.py:78:36: note: called from 2026-02-21T09:19:21.4610110Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:19:21.4610311Z ^ 2026-02-21T09:19:21.4610721Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:19:21.4611194Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:19:21.4611434Z ^ 2026-02-21T09:19:21.4611806Z module { 2026-02-21T09:19:21.4612323Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:19:21.4612912Z %cst = arith.constant dense<0> : tensor<16x2x32xi8> 2026-02-21T09:19:21.4613153Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:19:21.4613339Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:19:21.4613779Z %cst_0 = arith.constant dense<8192> : tensor<256x1xi32> 2026-02-21T09:19:21.4614033Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:19:21.4614339Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:19:21.4614586Z %cst_3 = arith.constant dense<4> : tensor<16x32xi8> 2026-02-21T09:19:21.4614831Z %cst_4 = arith.constant dense<8192> : tensor<16x1xi32> 2026-02-21T09:19:21.4615089Z %cst_5 = arith.constant dense<1024> : tensor<256x1xi32> 2026-02-21T09:19:21.4615312Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:19:21.4615546Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<256x32xf32> 2026-02-21T09:19:21.4615787Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:19:21.4615989Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:19:21.4616187Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:19:21.4616376Z %0 = tt.get_program_id x : i32 2026-02-21T09:19:21.4616571Z %1 = arith.divsi %0, %c256_i32 : i32 2026-02-21T09:19:21.4616759Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T09:19:21.4616949Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T09:19:21.4617132Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T09:19:21.4617323Z %5 = arith.remsi %0, %c256_i32 : i32 2026-02-21T09:19:21.4617509Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:19:21.4617750Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:19:21.4617970Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:19:21.4618155Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T09:19:21.4618401Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:21.4618664Z %11 = tt.splat %9 : i32 -> tensor<32xi32> 2026-02-21T09:19:21.4618880Z %12 = arith.addi %11, %10 : tensor<32xi32> 2026-02-21T09:19:21.4619076Z %13 = arith.muli %8, %c256_i32 : i32 2026-02-21T09:19:21.4619328Z %14 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T09:19:21.4619587Z %15 = tt.splat %13 : i32 -> tensor<256xi32> 2026-02-21T09:19:21.4619803Z %16 = arith.addi %15, %14 : tensor<256xi32> 2026-02-21T09:19:21.4620003Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:19:21.4620330Z %17 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c64_i32 iter_args(%arg4 = %cst_6) -> (tensor<256x32xf32>) : i32 { 2026-02-21T09:19:21.4620721Z %27 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:21.4620984Z %28 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:19:21.4621205Z %29 = arith.addi %28, %27 : tensor<16xi32> 2026-02-21T09:19:21.4621406Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:19:21.4621673Z %31 = tt.splat %30 : i32 -> tensor<32xi32> 2026-02-21T09:19:21.4621877Z %32 = arith.addi %31, %10 : tensor<32xi32> 2026-02-21T09:19:21.4622127Z %33 = tt.expand_dims %16 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:19:21.4622402Z %34 = arith.muli %33, %cst_5 : tensor<256x1xi32> 2026-02-21T09:19:21.4622659Z %35 = tt.expand_dims %32 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4622954Z %36 = tt.broadcast %34 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4623218Z %37 = tt.broadcast %35 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4623463Z %38 = arith.addi %36, %37 : tensor<256x32xi32> 2026-02-21T09:19:21.4623712Z %39 = tt.splat %arg0 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4623993Z %40 = tt.addptr %39, %38 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:19:21.4624304Z %41 = tt.load %40 evictionPolicy = evict_first : tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4624595Z %42 = arith.extf %41 : tensor<256x32xbf16> to tensor<256x32xf32> 2026-02-21T09:19:21.4624880Z %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:21.4625144Z %44 = arith.muli %43, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:21.4625434Z %45 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4625718Z %46 = tt.broadcast %44 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4626003Z %47 = tt.broadcast %45 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4626242Z %48 = arith.addi %46, %47 : tensor<16x32xi32> 2026-02-21T09:19:21.4626476Z %49 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4626750Z %50 = tt.addptr %49, %48 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:19:21.4627004Z %51 = tt.load %50 : tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4627213Z %52 = arith.shli %51, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4627435Z %53 = arith.shrsi %52, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4627644Z %54 = arith.shrsi %51, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4627887Z %55 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:21.4628171Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:21.4628478Z %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:21.4628800Z %58 = tt.expand_dims %53 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4629138Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4629453Z %60 = arith.cmpi eq, %57, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4629697Z %61 = tt.broadcast %60 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4629965Z %62 = tt.broadcast %58 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4630242Z %63 = arith.select %61, %62, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4630509Z %64 = arith.cmpi eq, %57, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4630758Z %65 = tt.broadcast %59 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4631017Z %66 = tt.broadcast %64 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4631290Z %67 = arith.select %66, %65, %63 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4631602Z %68 = tt.reshape %67 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:19:21.4631867Z %69 = arith.sitofp %68 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:19:21.4632229Z %70 = tt.dot %42, %69, %arg4, inputPrecision = tf32 : tensor<256x32xf32> * tensor<32x32xf32> -> tensor<256x32xf32> 2026-02-21T09:19:21.4632550Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:19:21.4632748Z %71 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:19:21.4632981Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:19:21.4633205Z %73 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:21.4633454Z %74 = tt.splat %72 : i32 -> tensor<16xi32> 2026-02-21T09:19:21.4633659Z %75 = arith.addi %74, %73 : tensor<16xi32> 2026-02-21T09:19:21.4633849Z %76 = arith.muli %72, %c2_i32 : i32 2026-02-21T09:19:21.4634038Z %77 = tt.splat %76 : i32 -> tensor<32xi32> 2026-02-21T09:19:21.4634228Z %78 = arith.addi %77, %10 : tensor<32xi32> 2026-02-21T09:19:21.4634478Z %79 = tt.expand_dims %16 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:19:21.4634740Z %80 = arith.muli %79, %cst_5 : tensor<256x1xi32> 2026-02-21T09:19:21.4635001Z %81 = tt.expand_dims %78 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4635283Z %82 = tt.broadcast %80 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4635548Z %83 = tt.broadcast %81 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4635787Z %84 = arith.addi %82, %83 : tensor<256x32xi32> 2026-02-21T09:19:21.4636023Z %85 = tt.splat %arg0 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4636310Z %86 = tt.addptr %85, %84 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:19:21.4636643Z %87 = tt.load %86 evictionPolicy = evict_first : tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4636942Z %88 = arith.extf %87 : tensor<256x32xbf16> to tensor<256x32xf32> 2026-02-21T09:19:21.4637283Z %89 = tt.expand_dims %75 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:21.4637545Z %90 = arith.muli %89, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:21.4637805Z %91 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4638086Z %92 = tt.broadcast %90 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4638347Z %93 = tt.broadcast %91 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4638578Z %94 = arith.addi %92, %93 : tensor<16x32xi32> 2026-02-21T09:19:21.4638817Z %95 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4639097Z %96 = tt.addptr %95, %94 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:19:21.4639350Z %97 = tt.load %96 : tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4639571Z %98 = arith.shli %97, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4639785Z %99 = arith.shrsi %98, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4640008Z %100 = arith.shrsi %97, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4640286Z %101 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:21.4640613Z %102 = tt.expand_dims %101 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:21.4640942Z %103 = tt.expand_dims %102 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:21.4641269Z %104 = tt.expand_dims %99 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4641643Z %105 = tt.expand_dims %100 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4641927Z %106 = arith.cmpi eq, %103, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4642187Z %107 = tt.broadcast %106 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4642470Z %108 = tt.broadcast %104 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4642755Z %109 = arith.select %107, %108, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4643039Z %110 = arith.cmpi eq, %103, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4643287Z %111 = tt.broadcast %105 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4643559Z %112 = tt.broadcast %110 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4643862Z %113 = arith.select %112, %111, %109 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4644155Z %114 = tt.reshape %113 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:19:21.4644421Z %115 = arith.sitofp %114 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:19:21.4644778Z %116 = tt.dot %88, %115, %70, inputPrecision = tf32 : tensor<256x32xf32> * tensor<32x32xf32> -> tensor<256x32xf32> 2026-02-21T09:19:21.4645109Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:19:21.4645302Z %117 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:19:21.4645507Z %118 = arith.addi %arg3, %117 : i32 2026-02-21T09:19:21.4645738Z %119 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:21.4645994Z %120 = tt.splat %118 : i32 -> tensor<16xi32> 2026-02-21T09:19:21.4646208Z %121 = arith.addi %120, %119 : tensor<16xi32> 2026-02-21T09:19:21.4646401Z %122 = arith.muli %118, %c2_i32 : i32 2026-02-21T09:19:21.4646599Z %123 = tt.splat %122 : i32 -> tensor<32xi32> 2026-02-21T09:19:21.4646795Z %124 = arith.addi %123, %10 : tensor<32xi32> 2026-02-21T09:19:21.4647052Z %125 = tt.expand_dims %16 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:19:21.4647326Z %126 = arith.muli %125, %cst_5 : tensor<256x1xi32> 2026-02-21T09:19:21.4647593Z %127 = tt.expand_dims %124 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4647922Z %128 = tt.broadcast %126 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4648190Z %129 = tt.broadcast %127 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4648463Z %130 = arith.addi %128, %129 : tensor<256x32xi32> 2026-02-21T09:19:21.4648706Z %131 = tt.splat %arg0 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4649005Z %132 = tt.addptr %131, %130 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:19:21.4649316Z %133 = tt.load %132 evictionPolicy = evict_first : tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4649622Z %134 = arith.extf %133 : tensor<256x32xbf16> to tensor<256x32xf32> 2026-02-21T09:19:21.4649917Z %135 = tt.expand_dims %121 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:21.4650182Z %136 = arith.muli %135, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:21.4650441Z %137 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4650723Z %138 = tt.broadcast %136 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4650992Z %139 = tt.broadcast %137 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4651235Z %140 = arith.addi %138, %139 : tensor<16x32xi32> 2026-02-21T09:19:21.4651516Z %141 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4651937Z %142 = tt.addptr %141, %140 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:19:21.4652206Z %143 = tt.load %142 : tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4652442Z %144 = arith.shli %143, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4652659Z %145 = arith.shrsi %144, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4652887Z %146 = arith.shrsi %143, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4653136Z %147 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:21.4653430Z %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:21.4653750Z %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:21.4654078Z %150 = tt.expand_dims %145 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4654414Z %151 = tt.expand_dims %146 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4654700Z %152 = arith.cmpi eq, %149, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4654959Z %153 = tt.broadcast %152 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4655236Z %154 = tt.broadcast %150 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4655521Z %155 = arith.select %153, %154, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4655797Z %156 = arith.cmpi eq, %149, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4656045Z %157 = tt.broadcast %151 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4656326Z %158 = tt.broadcast %156 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4656608Z %159 = arith.select %158, %157, %155 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4656886Z %160 = tt.reshape %159 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:19:21.4657156Z %161 = arith.sitofp %160 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:19:21.4657511Z %162 = tt.dot %134, %161, %116, inputPrecision = tf32 : tensor<256x32xf32> * tensor<32x32xf32> -> tensor<256x32xf32> 2026-02-21T09:19:21.4657836Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:19:21.4658023Z %163 = arith.muli %c16_i32, %c3_i32 : i32 2026-02-21T09:19:21.4658224Z %164 = arith.addi %arg3, %163 : i32 2026-02-21T09:19:21.4658479Z %165 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:21.4658740Z %166 = tt.splat %164 : i32 -> tensor<16xi32> 2026-02-21T09:19:21.4658993Z %167 = arith.addi %166, %165 : tensor<16xi32> 2026-02-21T09:19:21.4659191Z %168 = arith.muli %164, %c2_i32 : i32 2026-02-21T09:19:21.4659393Z %169 = tt.splat %168 : i32 -> tensor<32xi32> 2026-02-21T09:19:21.4659628Z %170 = arith.addi %169, %10 : tensor<32xi32> 2026-02-21T09:19:21.4659897Z %171 = tt.expand_dims %16 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:19:21.4660189Z %172 = arith.muli %171, %cst_5 : tensor<256x1xi32> 2026-02-21T09:19:21.4660461Z %173 = tt.expand_dims %170 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4660772Z %174 = tt.broadcast %172 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4661053Z %175 = tt.broadcast %173 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4661311Z %176 = arith.addi %174, %175 : tensor<256x32xi32> 2026-02-21T09:19:21.4661602Z %177 = tt.splat %arg0 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4661916Z %178 = tt.addptr %177, %176 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:19:21.4662249Z %179 = tt.load %178 evictionPolicy = evict_first : tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4662560Z %180 = arith.extf %179 : tensor<256x32xbf16> to tensor<256x32xf32> 2026-02-21T09:19:21.4662898Z %181 = tt.expand_dims %167 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:21.4663202Z %182 = arith.muli %181, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:21.4663477Z %183 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4663783Z %184 = tt.broadcast %182 : tensor<16x1xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4664059Z %185 = tt.broadcast %183 : tensor<1x32xi32> -> tensor<16x32xi32> 2026-02-21T09:19:21.4664318Z %186 = arith.addi %184, %185 : tensor<16x32xi32> 2026-02-21T09:19:21.4664561Z %187 = tt.splat %arg1 : !tt.ptr -> tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4664856Z %188 = tt.addptr %187, %186 : tensor<16x32x!tt.ptr>, tensor<16x32xi32> 2026-02-21T09:19:21.4665123Z %189 = tt.load %188 : tensor<16x32x!tt.ptr> 2026-02-21T09:19:21.4665350Z %190 = arith.shli %189, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4665581Z %191 = arith.shrsi %190, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4665808Z %192 = arith.shrsi %189, %cst_3 : tensor<16x32xi8> 2026-02-21T09:19:21.4666068Z %193 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:21.4666379Z %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:21.4666692Z %195 = tt.expand_dims %194 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:21.4667012Z %196 = tt.expand_dims %191 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4667342Z %197 = tt.expand_dims %192 {axis = 1 : i32} : tensor<16x32xi8> -> tensor<16x1x32xi8> 2026-02-21T09:19:21.4667634Z %198 = arith.cmpi eq, %195, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4667886Z %199 = tt.broadcast %198 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4668164Z %200 = tt.broadcast %196 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4668449Z %201 = arith.select %199, %200, %cst : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4668729Z %202 = arith.cmpi eq, %195, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:21.4668988Z %203 = tt.broadcast %197 : tensor<16x1x32xi8> -> tensor<16x2x32xi8> 2026-02-21T09:19:21.4669254Z %204 = tt.broadcast %202 : tensor<1x2x1xi1> -> tensor<16x2x32xi1> 2026-02-21T09:19:21.4669535Z %205 = arith.select %204, %203, %201 : tensor<16x2x32xi1>, tensor<16x2x32xi8> 2026-02-21T09:19:21.4669812Z %206 = tt.reshape %205 : tensor<16x2x32xi8> -> tensor<32x32xi8> 2026-02-21T09:19:21.4670076Z %207 = arith.sitofp %206 : tensor<32x32xi8> to tensor<32x32xf32> 2026-02-21T09:19:21.4670457Z %208 = tt.dot %180, %207, %162, inputPrecision = tf32 : tensor<256x32xf32> * tensor<32x32xf32> -> tensor<256x32xf32> 2026-02-21T09:19:21.4670783Z scf.yield %208 : tensor<256x32xf32> 2026-02-21T09:19:21.4670991Z } {tt.flatten} 2026-02-21T09:19:21.4671182Z %18 = arith.truncf %17 : tensor<256x32xf32> to tensor<256x32xbf16> 2026-02-21T09:19:21.4671471Z %19 = tt.expand_dims %16 {axis = 1 : i32} : tensor<256xi32> -> tensor<256x1xi32> 2026-02-21T09:19:21.4671776Z %20 = arith.muli %19, %cst_0 : tensor<256x1xi32> 2026-02-21T09:19:21.4672034Z %21 = tt.expand_dims %12 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:21.4672317Z %22 = tt.broadcast %20 : tensor<256x1xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4672580Z %23 = tt.broadcast %21 : tensor<1x32xi32> -> tensor<256x32xi32> 2026-02-21T09:19:21.4672814Z %24 = arith.addi %22, %23 : tensor<256x32xi32> 2026-02-21T09:19:21.4673055Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4673341Z %26 = tt.addptr %25, %24 : tensor<256x32x!tt.ptr>, tensor<256x32xi32> 2026-02-21T09:19:21.4673594Z tt.store %26, %18 : tensor<256x32x!tt.ptr> 2026-02-21T09:19:21.4673792Z tt.return 2026-02-21T09:19:21.4673920Z } 2026-02-21T09:19:21.4674049Z } 2026-02-21T09:19:21.4674119Z 2026-02-21T09:19:21.4674201Z {-# 2026-02-21T09:19:21.4674332Z external_resources: { 2026-02-21T09:19:21.4674520Z mlir_reproducer: { 2026-02-21T09:19:21.4678934Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:19:21.4683446Z disable_threading: false, 2026-02-21T09:19:21.4683624Z verify_each: true 2026-02-21T09:19:21.4683769Z } 2026-02-21T09:19:21.4683893Z } 2026-02-21T09:19:21.4684007Z #-} 2026-02-21T09:19:21.4684434Z /tmp/torchinductor_root/sj/csj3wzsqt77qbpdzmruupjdridyeqfd4cjxyftuaqj4eu34ppasa.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:19:21.4685466Z /tmp/torchinductor_root/sj/csj3wzsqt77qbpdzmruupjdridyeqfd4cjxyftuaqj4eu34ppasa.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:19:21.4686281Z [748s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:19:21.4687366Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 256, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:19:21.4688310Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:19:21.4688563Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:19:23.2997587Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:19:23.2999413Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:19:23.2999702Z ^ 2026-02-21T09:19:23.3000080Z /tmp/torchinductor_root/7k/c7kikpyiy5tvjknzgf3cge7bmffjxkjmrmvc7nvahiw5epg4cnhj.py:78:36: note: called from 2026-02-21T09:19:23.3000478Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:19:23.3000678Z ^ 2026-02-21T09:19:23.3001247Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:19:23.3001850Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:19:23.3002096Z ^ 2026-02-21T09:19:23.3002276Z module { 2026-02-21T09:19:23.3004328Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:19:23.3004919Z %cst = arith.constant dense<0> : tensor<16x2x128xi8> 2026-02-21T09:19:23.3005154Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:19:23.3005362Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:19:23.3005549Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:19:23.3005775Z %cst_0 = arith.constant dense<8192> : tensor<128x1xi32> 2026-02-21T09:19:23.3006017Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:19:23.3006252Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:19:23.3006473Z %cst_3 = arith.constant dense<4> : tensor<16x128xi8> 2026-02-21T09:19:23.3006709Z %cst_4 = arith.constant dense<8192> : tensor<16x1xi32> 2026-02-21T09:19:23.3006944Z %cst_5 = arith.constant dense<1024> : tensor<128x1xi32> 2026-02-21T09:19:23.3007161Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:19:23.3007389Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x128xf32> 2026-02-21T09:19:23.3007625Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:19:23.3007817Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:19:23.3007998Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:19:23.3008187Z %0 = tt.get_program_id x : i32 2026-02-21T09:19:23.3008375Z %1 = arith.divsi %0, %c1024_i32 : i32 2026-02-21T09:19:23.3008573Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T09:19:23.3008752Z %3 = arith.subi %c32_i32, %2 : i32 2026-02-21T09:19:23.3008927Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T09:19:23.3009115Z %5 = arith.remsi %0, %c1024_i32 : i32 2026-02-21T09:19:23.3009295Z %6 = arith.remsi %5, %4 : i32 2026-02-21T09:19:23.3009472Z %7 = arith.addi %2, %6 : i32 2026-02-21T09:19:23.3009639Z %8 = arith.divsi %5, %4 : i32 2026-02-21T09:19:23.3009814Z %9 = arith.muli %7, %c128_i32 : i32 2026-02-21T09:19:23.3010040Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:19:23.3010301Z %11 = tt.splat %9 : i32 -> tensor<128xi32> 2026-02-21T09:19:23.3010508Z %12 = arith.addi %11, %10 : tensor<128xi32> 2026-02-21T09:19:23.3010698Z %13 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:19:23.3011091Z %14 = tt.splat %13 : i32 -> tensor<128xi32> 2026-02-21T09:19:23.3011287Z %15 = arith.addi %14, %10 : tensor<128xi32> 2026-02-21T09:19:23.3011484Z %c480_i32 = arith.constant 480 : i32 2026-02-21T09:19:23.3011893Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:19:23.3012221Z %16 = scf.for %arg3 = %c0_i32 to %c480_i32 step %c48_i32 iter_args(%arg4 = %cst_6) -> (tensor<128x128xf32>) : i32 { 2026-02-21T09:19:23.3012593Z %27 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:23.3012848Z %28 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:19:23.3013098Z %29 = arith.addi %28, %27 : tensor<16xi32> 2026-02-21T09:19:23.3013299Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:19:23.3013524Z %31 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:23.3013771Z %32 = tt.splat %30 : i32 -> tensor<32xi32> 2026-02-21T09:19:23.3013962Z %33 = arith.addi %32, %31 : tensor<32xi32> 2026-02-21T09:19:23.3014220Z %34 = tt.expand_dims %12 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:23.3014496Z %35 = arith.muli %34, %cst_5 : tensor<128x1xi32> 2026-02-21T09:19:23.3014751Z %36 = tt.expand_dims %33 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:23.3015085Z %37 = tt.broadcast %35 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3015381Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3015628Z %39 = arith.addi %37, %38 : tensor<128x32xi32> 2026-02-21T09:19:23.3015898Z %40 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3016185Z %41 = tt.addptr %40, %39 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:23.3016502Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3016818Z %43 = arith.extf %42 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:23.3017110Z %44 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:23.3017386Z %45 = arith.muli %44, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:23.3017641Z %46 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:19:23.3017935Z %47 = tt.broadcast %45 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3018197Z %48 = tt.broadcast %46 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3018439Z %49 = arith.addi %47, %48 : tensor<16x128xi32> 2026-02-21T09:19:23.3018678Z %50 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3018951Z %51 = tt.addptr %50, %49 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:19:23.3019209Z %52 = tt.load %51 : tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3019422Z %53 = arith.shli %52, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3019646Z %54 = arith.shrsi %53, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3019863Z %55 = arith.shrsi %52, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3020113Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:23.3020409Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:23.3020708Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:23.3021032Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3021354Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3021677Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3021929Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3022196Z %63 = tt.broadcast %59 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3022526Z %64 = arith.select %62, %63, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3022789Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3023075Z %66 = tt.broadcast %60 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3023346Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3023625Z %68 = arith.select %67, %66, %64 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3023905Z %69 = tt.reshape %68 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:23.3024163Z %70 = arith.sitofp %69 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:23.3024526Z %71 = tt.dot %43, %70, %arg4, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:23.3024847Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:19:23.3025043Z %72 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:19:23.3025242Z %73 = arith.addi %arg3, %72 : i32 2026-02-21T09:19:23.3025467Z %74 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:23.3025715Z %75 = tt.splat %73 : i32 -> tensor<16xi32> 2026-02-21T09:19:23.3025913Z %76 = arith.addi %75, %74 : tensor<16xi32> 2026-02-21T09:19:23.3026133Z %77 = arith.muli %73, %c2_i32 : i32 2026-02-21T09:19:23.3026384Z %78 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:23.3026628Z %79 = tt.splat %77 : i32 -> tensor<32xi32> 2026-02-21T09:19:23.3026820Z %80 = arith.addi %79, %78 : tensor<32xi32> 2026-02-21T09:19:23.3027073Z %81 = tt.expand_dims %12 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:23.3027346Z %82 = arith.muli %81, %cst_5 : tensor<128x1xi32> 2026-02-21T09:19:23.3027597Z %83 = tt.expand_dims %80 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:23.3027885Z %84 = tt.broadcast %82 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3028145Z %85 = tt.broadcast %83 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3028388Z %86 = arith.addi %84, %85 : tensor<128x32xi32> 2026-02-21T09:19:23.3028637Z %87 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3028922Z %88 = tt.addptr %87, %86 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:23.3029253Z %89 = tt.load %88 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3029556Z %90 = arith.extf %89 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:23.3029854Z %91 = tt.expand_dims %76 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:23.3030124Z %92 = arith.muli %91, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:23.3030392Z %93 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:19:23.3030695Z %94 = tt.broadcast %92 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3030963Z %95 = tt.broadcast %93 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3031208Z %96 = arith.addi %94, %95 : tensor<16x128xi32> 2026-02-21T09:19:23.3031448Z %97 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3031768Z %98 = tt.addptr %97, %96 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:19:23.3032032Z %99 = tt.load %98 : tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3032267Z %100 = arith.shli %99, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3032509Z %101 = arith.shrsi %100, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3032749Z %102 = arith.shrsi %99, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3033019Z %103 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:23.3033328Z %104 = tt.expand_dims %103 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:23.3033670Z %105 = tt.expand_dims %104 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:23.3034049Z %106 = tt.expand_dims %101 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3034426Z %107 = tt.expand_dims %102 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3034734Z %108 = arith.cmpi eq, %105, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3035002Z %109 = tt.broadcast %108 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3035304Z %110 = tt.broadcast %106 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3035614Z %111 = arith.select %109, %110, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3035913Z %112 = arith.cmpi eq, %105, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3036181Z %113 = tt.broadcast %107 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3036467Z %114 = tt.broadcast %112 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3036772Z %115 = arith.select %114, %113, %111 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3037073Z %116 = tt.reshape %115 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:23.3037358Z %117 = arith.sitofp %116 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:23.3037758Z %118 = tt.dot %90, %117, %71, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:23.3038134Z %c2_i32_7 = arith.constant 2 : i32 2026-02-21T09:19:23.3038345Z %119 = arith.muli %c16_i32, %c2_i32_7 : i32 2026-02-21T09:19:23.3038547Z %120 = arith.addi %arg3, %119 : i32 2026-02-21T09:19:23.3038794Z %121 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:23.3039053Z %122 = tt.splat %120 : i32 -> tensor<16xi32> 2026-02-21T09:19:23.3039275Z %123 = arith.addi %122, %121 : tensor<16xi32> 2026-02-21T09:19:23.3039480Z %124 = arith.muli %120, %c2_i32 : i32 2026-02-21T09:19:23.3039727Z %125 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:23.3039990Z %126 = tt.splat %124 : i32 -> tensor<32xi32> 2026-02-21T09:19:23.3040209Z %127 = arith.addi %126, %125 : tensor<32xi32> 2026-02-21T09:19:23.3040473Z %128 = tt.expand_dims %12 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:23.3040746Z %129 = arith.muli %128, %cst_5 : tensor<128x1xi32> 2026-02-21T09:19:23.3041014Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:23.3041304Z %131 = tt.broadcast %129 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3041615Z %132 = tt.broadcast %130 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3041867Z %133 = arith.addi %131, %132 : tensor<128x32xi32> 2026-02-21T09:19:23.3042115Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3042421Z %135 = tt.addptr %134, %133 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:23.3042740Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3043051Z %137 = arith.extf %136 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:23.3043349Z %138 = tt.expand_dims %123 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:23.3043614Z %139 = arith.muli %138, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:23.3043875Z %140 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:19:23.3044166Z %141 = tt.broadcast %139 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3044448Z %142 = tt.broadcast %140 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3044693Z %143 = arith.addi %141, %142 : tensor<16x128xi32> 2026-02-21T09:19:23.3044939Z %144 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3045255Z %145 = tt.addptr %144, %143 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:19:23.3045513Z %146 = tt.load %145 : tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3045736Z %147 = arith.shli %146, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3045989Z %148 = arith.shrsi %147, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3046225Z %149 = arith.shrsi %146, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3046480Z %150 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:23.3046783Z %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:23.3047105Z %152 = tt.expand_dims %151 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:23.3047438Z %153 = tt.expand_dims %148 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3047780Z %154 = tt.expand_dims %149 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3048071Z %155 = arith.cmpi eq, %152, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3048336Z %156 = tt.broadcast %155 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3048624Z %157 = tt.broadcast %153 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3048971Z %158 = arith.select %156, %157, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3049281Z %159 = arith.cmpi eq, %152, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3049536Z %160 = tt.broadcast %154 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3049817Z %161 = tt.broadcast %159 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3050103Z %162 = arith.select %161, %160, %158 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3050392Z %163 = tt.reshape %162 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:23.3050661Z %164 = arith.sitofp %163 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:23.3051050Z %165 = tt.dot %137, %164, %118, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:23.3051377Z scf.yield %165 : tensor<128x128xf32> 2026-02-21T09:19:23.3051584Z } {tt.flatten} 2026-02-21T09:19:23.3051864Z %17 = scf.for %arg3 = %c480_i32 to %c512_i32 step %c16_i32 iter_args(%arg4 = %16) -> (tensor<128x128xf32>) : i32 { 2026-02-21T09:19:23.3052232Z %27 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:19:23.3052476Z %28 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T09:19:23.3052688Z %29 = arith.addi %28, %27 : tensor<16xi32> 2026-02-21T09:19:23.3052891Z %30 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:19:23.3053117Z %31 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:23.3053365Z %32 = tt.splat %30 : i32 -> tensor<32xi32> 2026-02-21T09:19:23.3053561Z %33 = arith.addi %32, %31 : tensor<32xi32> 2026-02-21T09:19:23.3053810Z %34 = tt.expand_dims %12 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:23.3054072Z %35 = arith.muli %34, %cst_5 : tensor<128x1xi32> 2026-02-21T09:19:23.3054333Z %36 = tt.expand_dims %33 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:23.3054631Z %37 = tt.broadcast %35 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3054897Z %38 = tt.broadcast %36 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:23.3055138Z %39 = arith.addi %37, %38 : tensor<128x32xi32> 2026-02-21T09:19:23.3055380Z %40 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3055668Z %41 = tt.addptr %40, %39 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:23.3055973Z %42 = tt.load %41 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:23.3056275Z %43 = arith.extf %42 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:23.3056596Z %44 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:19:23.3056857Z %45 = arith.muli %44, %cst_4 : tensor<16x1xi32> 2026-02-21T09:19:23.3057138Z %46 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:19:23.3057421Z %47 = tt.broadcast %45 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3057686Z %48 = tt.broadcast %46 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:19:23.3057912Z %49 = arith.addi %47, %48 : tensor<16x128xi32> 2026-02-21T09:19:23.3058150Z %50 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3058426Z %51 = tt.addptr %50, %49 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:19:23.3058673Z %52 = tt.load %51 : tensor<16x128x!tt.ptr> 2026-02-21T09:19:23.3058890Z %53 = arith.shli %52, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3059105Z %54 = arith.shrsi %53, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3059329Z %55 = arith.shrsi %52, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:23.3059568Z %56 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:23.3059859Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:23.3060192Z %58 = tt.expand_dims %57 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:23.3060531Z %59 = tt.expand_dims %54 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3060857Z %60 = tt.expand_dims %55 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:23.3061132Z %61 = arith.cmpi eq, %58, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3061381Z %62 = tt.broadcast %61 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3061678Z %63 = tt.broadcast %59 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3061960Z %64 = arith.select %62, %63, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3062234Z %65 = arith.cmpi eq, %58, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:23.3062479Z %66 = tt.broadcast %60 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:23.3062755Z %67 = tt.broadcast %65 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:23.3063027Z %68 = arith.select %67, %66, %64 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:23.3063309Z %69 = tt.reshape %68 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:23.3063569Z %70 = arith.sitofp %69 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:23.3063926Z %71 = tt.dot %43, %70, %arg4, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:23.3064248Z scf.yield %71 : tensor<128x128xf32> 2026-02-21T09:19:23.3064441Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:19:23.3064678Z %18 = arith.truncf %17 : tensor<128x128xf32> to tensor<128x128xbf16> 2026-02-21T09:19:23.3064973Z %19 = tt.expand_dims %12 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:23.3065236Z %20 = arith.muli %19, %cst_0 : tensor<128x1xi32> 2026-02-21T09:19:23.3065495Z %21 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:19:23.3065779Z %22 = tt.broadcast %20 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T09:19:23.3066048Z %23 = tt.broadcast %21 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T09:19:23.3066281Z %24 = arith.addi %22, %23 : tensor<128x128xi32> 2026-02-21T09:19:23.3066527Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T09:19:23.3066820Z %26 = tt.addptr %25, %24 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T09:19:23.3067083Z tt.store %26, %18 : tensor<128x128x!tt.ptr> 2026-02-21T09:19:23.3067287Z tt.return 2026-02-21T09:19:23.3067417Z } 2026-02-21T09:19:23.3067546Z } 2026-02-21T09:19:23.3067648Z 2026-02-21T09:19:23.3067700Z {-# 2026-02-21T09:19:23.3067837Z external_resources: { 2026-02-21T09:19:23.3067995Z mlir_reproducer: { 2026-02-21T09:19:23.3072511Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:19:23.3077177Z disable_threading: false, 2026-02-21T09:19:23.3077378Z verify_each: true 2026-02-21T09:19:23.3077529Z } 2026-02-21T09:19:23.3077663Z } 2026-02-21T09:19:23.3077781Z #-} 2026-02-21T09:19:23.3078238Z /tmp/torchinductor_root/7k/c7kikpyiy5tvjknzgf3cge7bmffjxkjmrmvc7nvahiw5epg4cnhj.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:19:23.3079330Z /tmp/torchinductor_root/7k/c7kikpyiy5tvjknzgf3cge7bmffjxkjmrmvc7nvahiw5epg4cnhj.py:13:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:19:23.3080200Z [750s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:19:23.3081307Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:19:23.3082303Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:19:23.3082558Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:19:23.7989241Z [750s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['last', ''], loop_orders=[[0, 1]], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T09:19:23.7990215Z Tensor-likes are not close! 2026-02-21T09:19:23.7990334Z 2026-02-21T09:19:23.7990420Z Mismatched elements: 33448764 / 33554432 (99.7%) 2026-02-21T09:19:23.7990706Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T09:19:23.7991233Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:19:23.7991614Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:19:23.7991867Z 2026-02-21T09:19:24.0486236Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 4.6 configs/s 2026-02-21T09:19:24.0494902Z [751s] Generation 19 complete: 2026-02-21T09:19:24.0496233Z error=4 2026-02-21T09:19:24.0496384Z ok=17 2026-02-21T09:19:24.0496519Z min=0.2386 2026-02-21T09:19:24.0496646Z mid=2.3910 2026-02-21T09:19:24.0496777Z max=168.3354 2026-02-21T09:19:24.0496930Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:19:24.0497188Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:19:24.0497435Z 'l2_groupings': [2], 2026-02-21T09:19:24.0497611Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:19:24.0497817Z 'loop_orders': [[0, 1]], 2026-02-21T09:19:24.0497979Z 'maxnreg': 128, 2026-02-21T09:19:24.0498133Z 'num_sm_multiplier': 2, 2026-02-21T09:19:24.0498285Z 'num_stages': 3, 2026-02-21T09:19:24.0498428Z 'num_warps': 4, 2026-02-21T09:19:24.0498585Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:19:24.0498793Z 'range_flattens': [None, None], 2026-02-21T09:19:24.0498974Z 'range_multi_buffers': [True, False], 2026-02-21T09:19:24.0499342Z 'range_num_stages': [3, 4], 2026-02-21T09:19:24.0499570Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:24.0499757Z 'range_warp_specializes': [True, None]} 2026-02-21T09:19:24.0536562Z [751s] Fitting surrogate: 1218 points, 1218 targets 2026-02-21T09:19:24.5103987Z [751s] Generation 20 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:19:27.3563821Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 9.8 configs/s 2026-02-21T09:19:27.4518138Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: error: operand #2 does not dominate this use 2026-02-21T09:19:27.4519875Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:19:27.4520154Z ^ 2026-02-21T09:19:27.4520529Z /tmp/torchinductor_root/6t/c6tjvl2a2g5szigx5uukpmkmn5inwxbel45jrjhnm777gotbwraf.py:85:36: note: called from 2026-02-21T09:19:27.4520939Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:19:27.4521150Z ^ 2026-02-21T09:19:27.4521756Z /__w/helion/helion/.venv/lib/python3.12/site-packages/triton/language/standard.py:291:36: note: operand defined here (op in the same block) 2026-02-21T09:19:27.4522229Z return core.reduce(input, axis, _sum_combine, keep_dims=keep_dims) 2026-02-21T09:19:27.4522474Z ^ 2026-02-21T09:19:27.4522637Z module { 2026-02-21T09:19:27.4523127Z tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:19:27.4523700Z %cst = arith.constant dense<0> : tensor<16x2x128xi8> 2026-02-21T09:19:27.4523924Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:19:27.4524124Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:19:27.4524307Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:19:27.4524556Z %cst_0 = arith.constant dense<8192> : tensor<128x1xi32> 2026-02-21T09:19:27.4524811Z %cst_1 = arith.constant dense<1> : tensor<1x2x1xi32> 2026-02-21T09:19:27.4525041Z %cst_2 = arith.constant dense<0> : tensor<1x2x1xi32> 2026-02-21T09:19:27.4525271Z %cst_3 = arith.constant dense<4> : tensor<16x128xi8> 2026-02-21T09:19:27.4525501Z %cst_4 = arith.constant dense<1024> : tensor<128x1xi32> 2026-02-21T09:19:27.4525715Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:19:27.4525945Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<128x128xf32> 2026-02-21T09:19:27.4526178Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:19:27.4526611Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:19:27.4526791Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:19:27.4526981Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:19:27.4527227Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T09:19:27.4527417Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:19:27.4527739Z %0 = tt.make_tensor_descriptor %arg1, [%c512_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T09:19:27.4528060Z %1 = tt.get_program_id x : i32 2026-02-21T09:19:27.4528250Z %2 = arith.divsi %1, %c1024_i32 : i32 2026-02-21T09:19:27.4528432Z %3 = arith.muli %2, %c16_i32 : i32 2026-02-21T09:19:27.4528614Z %4 = arith.subi %c32_i32, %3 : i32 2026-02-21T09:19:27.4528788Z %5 = arith.minsi %4, %c16_i32 : i32 2026-02-21T09:19:27.4528974Z %6 = arith.remsi %1, %c1024_i32 : i32 2026-02-21T09:19:27.4529155Z %7 = arith.remsi %6, %5 : i32 2026-02-21T09:19:27.4529337Z %8 = arith.addi %3, %7 : i32 2026-02-21T09:19:27.4529514Z %9 = arith.divsi %6, %5 : i32 2026-02-21T09:19:27.4529686Z %10 = arith.muli %8, %c128_i32 : i32 2026-02-21T09:19:27.4529931Z %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:19:27.4530190Z %12 = tt.splat %10 : i32 -> tensor<128xi32> 2026-02-21T09:19:27.4530404Z %13 = arith.addi %12, %11 : tensor<128xi32> 2026-02-21T09:19:27.4530646Z %14 = arith.muli %9, %c128_i32 : i32 2026-02-21T09:19:27.4530884Z %15 = tt.splat %14 : i32 -> tensor<128xi32> 2026-02-21T09:19:27.4531075Z %16 = arith.addi %15, %11 : tensor<128xi32> 2026-02-21T09:19:27.4531270Z %c480_i32 = arith.constant 480 : i32 2026-02-21T09:19:27.4531456Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:19:27.4531836Z %17 = scf.for %arg3 = %c0_i32 to %c480_i32 step %c48_i32 iter_args(%arg4 = %cst_5) -> (tensor<128x128xf32>) : i32 { 2026-02-21T09:19:27.4532166Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:19:27.4532399Z %29 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:27.4532664Z %30 = tt.splat %28 : i32 -> tensor<32xi32> 2026-02-21T09:19:27.4532862Z %31 = arith.addi %30, %29 : tensor<32xi32> 2026-02-21T09:19:27.4533128Z %32 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:27.4533410Z %33 = arith.muli %32, %cst_4 : tensor<128x1xi32> 2026-02-21T09:19:27.4533669Z %34 = tt.expand_dims %31 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:27.4533967Z %35 = tt.broadcast %33 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4534230Z %36 = tt.broadcast %34 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4534477Z %37 = arith.addi %35, %36 : tensor<128x32xi32> 2026-02-21T09:19:27.4534721Z %38 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4535015Z %39 = tt.addptr %38, %37 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:27.4535335Z %40 = tt.load %39 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4535625Z %41 = arith.extf %40 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:27.4535959Z %42 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<16x128xi8> 2026-02-21T09:19:27.4536268Z %43 = arith.shli %42, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4536496Z %44 = arith.shrsi %43, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4536722Z %45 = arith.shrsi %42, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4536965Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:27.4537264Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:27.4537565Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:27.4537888Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4538249Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4538533Z %51 = arith.cmpi eq, %48, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4538810Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4539083Z %53 = tt.broadcast %49 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4539381Z %54 = arith.select %52, %53, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4539663Z %55 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4539925Z %56 = tt.broadcast %50 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4540206Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4540483Z %58 = arith.select %57, %56, %54 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4540775Z %59 = tt.reshape %58 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:27.4541046Z %60 = arith.sitofp %59 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:27.4541423Z %61 = tt.dot %41, %60, %arg4, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:27.4541791Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:19:27.4542026Z %62 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:19:27.4542229Z %63 = arith.addi %arg3, %62 : i32 2026-02-21T09:19:27.4542441Z %64 = arith.muli %63, %c2_i32 : i32 2026-02-21T09:19:27.4542686Z %65 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:27.4542935Z %66 = tt.splat %64 : i32 -> tensor<32xi32> 2026-02-21T09:19:27.4543147Z %67 = arith.addi %66, %65 : tensor<32xi32> 2026-02-21T09:19:27.4543402Z %68 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:27.4543693Z %69 = arith.muli %68, %cst_4 : tensor<128x1xi32> 2026-02-21T09:19:27.4543972Z %70 = tt.expand_dims %67 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:27.4544283Z %71 = tt.broadcast %69 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4544552Z %72 = tt.broadcast %70 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4544788Z %73 = arith.addi %71, %72 : tensor<128x32xi32> 2026-02-21T09:19:27.4545039Z %74 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4545323Z %75 = tt.addptr %74, %73 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:27.4545634Z %76 = tt.load %75 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4545935Z %77 = arith.extf %76 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:27.4546250Z %78 = tt.descriptor_load %0[%63, %14] : !tt.tensordesc> -> tensor<16x128xi8> 2026-02-21T09:19:27.4546575Z %79 = arith.shli %78, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4546800Z %80 = arith.shrsi %79, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4547015Z %81 = arith.shrsi %78, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4547263Z %82 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:27.4547550Z %83 = tt.expand_dims %82 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:27.4547859Z %84 = tt.expand_dims %83 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:27.4548177Z %85 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4548490Z %86 = tt.expand_dims %81 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4548775Z %87 = arith.cmpi eq, %84, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4549019Z %88 = tt.broadcast %87 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4549294Z %89 = tt.broadcast %85 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4549605Z %90 = arith.select %88, %89, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4549878Z %91 = arith.cmpi eq, %84, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4550152Z %92 = tt.broadcast %86 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4550417Z %93 = tt.broadcast %91 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4550693Z %94 = arith.select %93, %92, %90 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4550961Z %95 = tt.reshape %94 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:27.4551221Z %96 = arith.sitofp %95 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:27.4551605Z %97 = tt.dot %77, %96, %61, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:27.4551968Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T09:19:27.4552167Z %98 = arith.muli %c16_i32, %c2_i32_6 : i32 2026-02-21T09:19:27.4552358Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T09:19:27.4552547Z %100 = arith.muli %99, %c2_i32 : i32 2026-02-21T09:19:27.4552779Z %101 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:27.4553040Z %102 = tt.splat %100 : i32 -> tensor<32xi32> 2026-02-21T09:19:27.4553284Z %103 = arith.addi %102, %101 : tensor<32xi32> 2026-02-21T09:19:27.4553572Z %104 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:27.4553853Z %105 = arith.muli %104, %cst_4 : tensor<128x1xi32> 2026-02-21T09:19:27.4554112Z %106 = tt.expand_dims %103 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:27.4554415Z %107 = tt.broadcast %105 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4554682Z %108 = tt.broadcast %106 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4554933Z %109 = arith.addi %107, %108 : tensor<128x32xi32> 2026-02-21T09:19:27.4555183Z %110 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4555469Z %111 = tt.addptr %110, %109 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:27.4555790Z %112 = tt.load %111 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4556090Z %113 = arith.extf %112 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:27.4556416Z %114 = tt.descriptor_load %0[%99, %14] : !tt.tensordesc> -> tensor<16x128xi8> 2026-02-21T09:19:27.4556712Z %115 = arith.shli %114, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4556939Z %116 = arith.shrsi %115, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4557165Z %117 = arith.shrsi %114, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4557408Z %118 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:27.4557701Z %119 = tt.expand_dims %118 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:27.4558036Z %120 = tt.expand_dims %119 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:27.4558355Z %121 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4558689Z %122 = tt.expand_dims %117 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4558977Z %123 = arith.cmpi eq, %120, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4559225Z %124 = tt.broadcast %123 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4559506Z %125 = tt.broadcast %121 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4559797Z %126 = arith.select %124, %125, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4560079Z %127 = arith.cmpi eq, %120, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4560334Z %128 = tt.broadcast %122 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4560603Z %129 = tt.broadcast %127 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4560922Z %130 = arith.select %129, %128, %126 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4561208Z %131 = tt.reshape %130 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:27.4561518Z %132 = arith.sitofp %131 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:27.4561947Z %133 = tt.dot %113, %132, %97, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:27.4562287Z scf.yield %133 : tensor<128x128xf32> 2026-02-21T09:19:27.4562471Z } 2026-02-21T09:19:27.4562742Z %18 = scf.for %arg3 = %c480_i32 to %c512_i32 step %c16_i32 iter_args(%arg4 = %17) -> (tensor<128x128xf32>) : i32 { 2026-02-21T09:19:27.4563079Z %28 = arith.muli %arg3, %c2_i32 : i32 2026-02-21T09:19:27.4563320Z %29 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T09:19:27.4563580Z %30 = tt.splat %28 : i32 -> tensor<32xi32> 2026-02-21T09:19:27.4563787Z %31 = arith.addi %30, %29 : tensor<32xi32> 2026-02-21T09:19:27.4564054Z %32 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:27.4564339Z %33 = arith.muli %32, %cst_4 : tensor<128x1xi32> 2026-02-21T09:19:27.4564643Z %34 = tt.expand_dims %31 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T09:19:27.4564978Z %35 = tt.broadcast %33 : tensor<128x1xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4565252Z %36 = tt.broadcast %34 : tensor<1x32xi32> -> tensor<128x32xi32> 2026-02-21T09:19:27.4565505Z %37 = arith.addi %35, %36 : tensor<128x32xi32> 2026-02-21T09:19:27.4565765Z %38 = tt.splat %arg0 : !tt.ptr -> tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4566062Z %39 = tt.addptr %38, %37 : tensor<128x32x!tt.ptr>, tensor<128x32xi32> 2026-02-21T09:19:27.4566391Z %40 = tt.load %39 evictionPolicy = evict_first : tensor<128x32x!tt.ptr> 2026-02-21T09:19:27.4566697Z %41 = arith.extf %40 : tensor<128x32xbf16> to tensor<128x32xf32> 2026-02-21T09:19:27.4567043Z %42 = tt.descriptor_load %0[%arg3, %14] : !tt.tensordesc> -> tensor<16x128xi8> 2026-02-21T09:19:27.4567357Z %43 = arith.shli %42, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4567591Z %44 = arith.shrsi %43, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4567824Z %45 = arith.shrsi %42, %cst_3 : tensor<16x128xi8> 2026-02-21T09:19:27.4568076Z %46 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> 2026-02-21T09:19:27.4568383Z %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> 2026-02-21T09:19:27.4568696Z %48 = tt.expand_dims %47 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> 2026-02-21T09:19:27.4569030Z %49 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4569359Z %50 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16x128xi8> -> tensor<16x1x128xi8> 2026-02-21T09:19:27.4569653Z %51 = arith.cmpi eq, %48, %cst_2 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4569910Z %52 = tt.broadcast %51 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4570188Z %53 = tt.broadcast %49 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4570488Z %54 = arith.select %52, %53, %cst : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4570769Z %55 = arith.cmpi eq, %48, %cst_1 : tensor<1x2x1xi32> 2026-02-21T09:19:27.4571020Z %56 = tt.broadcast %50 : tensor<16x1x128xi8> -> tensor<16x2x128xi8> 2026-02-21T09:19:27.4571291Z %57 = tt.broadcast %55 : tensor<1x2x1xi1> -> tensor<16x2x128xi1> 2026-02-21T09:19:27.4571601Z %58 = arith.select %57, %56, %54 : tensor<16x2x128xi1>, tensor<16x2x128xi8> 2026-02-21T09:19:27.4571882Z %59 = tt.reshape %58 : tensor<16x2x128xi8> -> tensor<32x128xi8> 2026-02-21T09:19:27.4572141Z %60 = arith.sitofp %59 : tensor<32x128xi8> to tensor<32x128xf32> 2026-02-21T09:19:27.4572538Z %61 = tt.dot %41, %60, %arg4, inputPrecision = tf32 : tensor<128x32xf32> * tensor<32x128xf32> -> tensor<128x128xf32> 2026-02-21T09:19:27.4572862Z scf.yield %61 : tensor<128x128xf32> 2026-02-21T09:19:27.4573103Z } {tt.num_stages = 1 : i32} 2026-02-21T09:19:27.4573330Z %19 = arith.truncf %18 : tensor<128x128xf32> to tensor<128x128xbf16> 2026-02-21T09:19:27.4573624Z %20 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T09:19:27.4573895Z %21 = arith.muli %20, %cst_0 : tensor<128x1xi32> 2026-02-21T09:19:27.4574145Z %22 = tt.expand_dims %16 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:19:27.4574435Z %23 = tt.broadcast %21 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T09:19:27.4574695Z %24 = tt.broadcast %22 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T09:19:27.4574937Z %25 = arith.addi %23, %24 : tensor<128x128xi32> 2026-02-21T09:19:27.4575185Z %26 = tt.splat %arg2 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T09:19:27.4575469Z %27 = tt.addptr %26, %25 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T09:19:27.4575738Z tt.store %27, %19 : tensor<128x128x!tt.ptr> 2026-02-21T09:19:27.4575930Z tt.return 2026-02-21T09:19:27.4576066Z } 2026-02-21T09:19:27.4576185Z } 2026-02-21T09:19:27.4576289Z 2026-02-21T09:19:27.4576342Z {-# 2026-02-21T09:19:27.4576502Z external_resources: { 2026-02-21T09:19:27.4576662Z mlir_reproducer: { 2026-02-21T09:19:27.4581078Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:19:27.4585606Z disable_threading: false, 2026-02-21T09:19:27.4585780Z verify_each: true 2026-02-21T09:19:27.4585919Z } 2026-02-21T09:19:27.4586043Z } 2026-02-21T09:19:27.4586155Z #-} 2026-02-21T09:19:27.4586578Z /tmp/torchinductor_root/6t/c6tjvl2a2g5szigx5uukpmkmn5inwxbel45jrjhnm777gotbwraf.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:19:27.4587597Z /tmp/torchinductor_root/6t/c6tjvl2a2g5szigx5uukpmkmn5inwxbel45jrjhnm777gotbwraf.py:19:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:19:27.4588436Z [754s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:19:27.4589488Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:19:27.4590473Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:19:27.4590725Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:19:27.8921417Z [755s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', ''], loop_orders=[[0, 1]], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]) 2026-02-21T09:19:27.8922627Z Tensor-likes are not close! 2026-02-21T09:19:27.8922758Z 2026-02-21T09:19:27.8922844Z Mismatched elements: 33448764 / 33554432 (99.7%) 2026-02-21T09:19:27.8923321Z Greatest absolute difference: 1408.0 at index (289, 7045) (up to 0.01 allowed) 2026-02-21T09:19:27.8923699Z Greatest relative difference: inf at index (3847, 4168) (up to 0.01 allowed) 2026-02-21T09:19:27.8924014Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:19:27.8924179Z 2026-02-21T09:19:29.2573198Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 9.5 configs/s 2026-02-21T09:19:29.2576258Z [756s] Generation 20 complete: 2026-02-21T09:19:29.2578367Z error=5 2026-02-21T09:19:29.2578561Z ok=14 2026-02-21T09:19:29.2578700Z min=0.2386 2026-02-21T09:19:29.2578851Z mid=1.1010 2026-02-21T09:19:29.2578993Z max=83.6117 2026-02-21T09:19:29.2579152Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:19:29.2579426Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:19:29.2579689Z 'l2_groupings': [2], 2026-02-21T09:19:29.2579873Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:19:29.2580092Z 'loop_orders': [[0, 1]], 2026-02-21T09:19:29.2580262Z 'maxnreg': 128, 2026-02-21T09:19:29.2580411Z 'num_sm_multiplier': 2, 2026-02-21T09:19:29.2580578Z 'num_stages': 3, 2026-02-21T09:19:29.2580730Z 'num_warps': 4, 2026-02-21T09:19:29.2580892Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:19:29.2581082Z 'range_flattens': [None, None], 2026-02-21T09:19:29.2581265Z 'range_multi_buffers': [True, False], 2026-02-21T09:19:29.2581444Z 'range_num_stages': [3, 4], 2026-02-21T09:19:29.2581850Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:29.2582042Z 'range_warp_specializes': [True, None]} 2026-02-21T09:19:29.2617137Z [756s] Fitting surrogate: 1237 points, 1237 targets 2026-02-21T09:19:29.5670033Z [756s] Autotuning complete in 756.7s after searching 1169 configs. 2026-02-21T09:19:29.5670454Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:19:29.5671717Z @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:19:29.5672702Z 2026-02-21T09:19:29.5672957Z [756s] Code of selected kernel: /tmp/torchinductor_root/hx/chxj4blxlrj4zun7tf7ezqbwlanarwmeiltelxnidd4awe5qkvn5.py 2026-02-21T09:19:29.5895577Z from __future__ import annotations 2026-02-21T09:19:29.5897417Z 2026-02-21T09:19:29.5897566Z import torch 2026-02-21T09:19:29.5897996Z import helion 2026-02-21T09:19:29.5898129Z import triton 2026-02-21T09:19:29.5898283Z import triton.language as tl 2026-02-21T09:19:29.5898518Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:19:29.5898747Z 2026-02-21T09:19:29.5898821Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T09:19:29.5899004Z _BLOCK_SIZE_2 = tl.constexpr(128) 2026-02-21T09:19:29.5899179Z _BLOCK_SIZE_0 = tl.constexpr(16) 2026-02-21T09:19:29.5899437Z # src[int4_gemm.py:34]: def matmul_bf16_int4(A: Tensor, B: Tensor) -> Tensor: 2026-02-21T09:19:29.5899690Z # src[int4_gemm.py:35]: """ 2026-02-21T09:19:29.5899942Z # src[int4_gemm.py:36]: BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:19:29.5900197Z # src[int4_gemm.py:34-93]: ... 2026-02-21T09:19:29.5900393Z helion.runtime.set_triton_allocator() 2026-02-21T09:19:29.5900520Z 2026-02-21T09:19:29.5900583Z @triton.jit 2026-02-21T09:19:29.5900854Z def _helion_matmul_bf16_int4(A, B, C, _NUM_SM: tl.constexpr, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:19:29.5901274Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:19:29.5901732Z B_desc = tl.make_tensor_descriptor(B, [512, 8192], [8192, 1], [_BLOCK_SIZE_0, _BLOCK_SIZE_2]) 2026-02-21T09:19:29.5902117Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:19:29.5902497Z C_desc = tl.make_tensor_descriptor(C, [4096, 8192], [8192, 1], [_BLOCK_SIZE_1, _BLOCK_SIZE_2]) 2026-02-21T09:19:29.5902824Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:19:29.5903130Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:19:29.5903391Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:19:29.5903670Z total_pids = tl.cdiv(4096, _BLOCK_SIZE_1) * tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T09:19:29.5904127Z for virtual_pid in tl.range(tl.program_id(0), total_pids, _NUM_SM * 2, warp_specialize=True, num_stages=3, disallow_acc_multi_buffer=False): 2026-02-21T09:19:29.5904554Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:19:29.5904803Z num_pid_m = tl.cdiv(4096, _BLOCK_SIZE_1) 2026-02-21T09:19:29.5905007Z num_pid_n = tl.cdiv(8192, _BLOCK_SIZE_2) 2026-02-21T09:19:29.5905202Z inner_2d_pid = virtual_pid 2026-02-21T09:19:29.5905389Z num_pid_in_group = 2 * num_pid_n 2026-02-21T09:19:29.5905591Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T09:19:29.5905800Z first_pid_m = group_id * 2 2026-02-21T09:19:29.5905999Z group_size_m = min(num_pid_m - first_pid_m, 2) 2026-02-21T09:19:29.5906266Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T09:19:29.5906547Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T09:19:29.5906778Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T09:19:29.5907016Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:19:29.5907252Z offset_2 = pid_1 * _BLOCK_SIZE_2 2026-02-21T09:19:29.5907509Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:19:29.5907805Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:19:29.5908130Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:19:29.5908532Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:19:29.5908928Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:19:29.5909208Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:19:29.5909485Z for offset_3 in tl.range(0, 512, _BLOCK_SIZE_0, num_stages=4, disallow_acc_multi_buffer=True): 2026-02-21T09:19:29.5909774Z acc_copy = acc 2026-02-21T09:19:29.5909933Z acc_copy_0 = acc_copy 2026-02-21T09:19:29.5910195Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:19:29.5910422Z mul = 2 * offset_3 2026-02-21T09:19:29.5910680Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:19:29.5910989Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:19:29.5911292Z load = tl.load(A + (indices_1[:, None] * 1024 + iota[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T09:19:29.5911727Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:19:29.5912017Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:19:29.5912263Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:19:29.5912494Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:19:29.5912774Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:19:29.5913080Z b_tile = B_desc.load([offset_3, offset_2]) 2026-02-21T09:19:29.5913369Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:19:29.5913654Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:19:29.5913832Z v_2 = b_tile << v_1 2026-02-21T09:19:29.5914033Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:19:29.5914243Z v_4 = v_2 >> v_3 2026-02-21T09:19:29.5914487Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:19:29.5914760Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:19:29.5914933Z v_6 = b_tile >> v_5 2026-02-21T09:19:29.5915162Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:19:29.5915404Z stack_idx = tl.arange(0, 2) 2026-02-21T09:19:29.5915613Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:19:29.5915830Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:19:29.5916042Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:19:29.5916277Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:19:29.5916479Z mask_0 = broadcast_idx == 0 2026-02-21T09:19:29.5916716Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:19:29.5916951Z mask_1 = broadcast_idx == 1 2026-02-21T09:19:29.5917181Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:19:29.5917452Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:19:29.5917741Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:19:29.5918021Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:19:29.5918270Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:19:29.5918523Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:19:29.5918788Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:19:29.5919075Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:19:29.5919296Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:19:29.5919536Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:19:29.5919830Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:19:29.5920120Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:19:29.5920326Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:19:29.5920524Z acc = acc_copy_0 + sum_1 2026-02-21T09:19:29.5920755Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:19:29.5920989Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:19:29.5921190Z C_desc.store([offset_1, offset_2], v_10) 2026-02-21T09:19:29.5921355Z 2026-02-21T09:19:29.5921490Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:19:29.5921780Z """ 2026-02-21T09:19:29.5921957Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:19:29.5922143Z 2026-02-21T09:19:29.5922238Z This kernel performs matrix multiplication where: 2026-02-21T09:19:29.5922466Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:19:29.5922714Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:19:29.5922978Z (two 4-bit values packed into each int8) 2026-02-21T09:19:29.5923107Z 2026-02-21T09:19:29.5923170Z Args: 2026-02-21T09:19:29.5923349Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:19:29.5923629Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:19:29.5923797Z 2026-02-21T09:19:29.5923854Z Returns: 2026-02-21T09:19:29.5924045Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:19:29.5924260Z """ 2026-02-21T09:19:29.5924406Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:19:29.5924595Z M, K = A.shape 2026-02-21T09:19:29.5924752Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:19:29.5924939Z _, N = B.shape 2026-02-21T09:19:29.5925199Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:19:29.5925559Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:19:29.5925835Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:19:29.5926097Z _NUM_SM = helion.runtime.get_num_sm(A.device) 2026-02-21T09:19:29.5926415Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:19:29.5926835Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:19:29.5927243Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:19:29.5927529Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:19:29.5927715Z _BLOCK_SIZE_0 = 16 2026-02-21T09:19:29.5927917Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:19:29.5928215Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:19:29.5928500Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:19:29.5928707Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:19:29.5928942Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:19:29.5929244Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:19:29.5929517Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:19:29.5929735Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:19:29.5930164Z _launcher(_helion_matmul_bf16_int4, (_NUM_SM * 2,), A, B, C, _NUM_SM, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=3, maxnreg=128) 2026-02-21T09:19:29.5930561Z # src[int4_gemm.py:93]: return C 2026-02-21T09:19:29.5930737Z return C 2026-02-21T09:19:30.6549263Z WARNING:tritonbench.utils.triton_op:Completed input ID 17: 2026-02-21T09:19:30.6550649Z x_val 2026-02-21T09:19:30.6550902Z --------------------- 2026-02-21T09:19:30.6555001Z (1, 4096, 8192, 1024) 2026-02-21T09:19:30.6559281Z 2026-02-21T09:19:30.6573362Z 60%|██████ | 6/10 [35:30<29:44, 446.21s/it]WARNING:tritonbench.utils.triton_op:Running input ID 21: 2026-02-21T09:19:30.6577016Z x_val 2026-02-21T09:19:30.6580898Z --------------------- 2026-02-21T09:19:30.6584613Z (4, 4096, 8192, 1024) 2026-02-21T09:19:30.6588846Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:19:32.0011205Z INFO:tritonbench.utils.triton_op:Took 2.63ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:19:33.6881135Z INFO:tritonbench.utils.triton_op:Took 0.11ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:19:34.8250838Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:19:34.8254646Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:19:34.8259104Z 'dtype': 'torch.bfloat16', 2026-02-21T09:19:34.8262957Z 'shape': (4, 4096, 1024), 2026-02-21T09:19:34.8264837Z 'stride': (4194304, 1024, 1)}, 2026-02-21T09:19:34.8265143Z { 'device': 'cuda:0', 2026-02-21T09:19:34.8269622Z 'dtype': 'torch.int32', 2026-02-21T09:19:34.8271077Z 'shape': (1024, 8192), 2026-02-21T09:19:34.8271379Z 'stride': (8192, 1)}), 2026-02-21T09:19:34.8271658Z 'kwargs': {}} 2026-02-21T09:19:34.8323835Z INFO:tritonbench.utils.triton_op:Took 7.65ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:19:35.0951943Z [0s] Autotune random seed: 2136913670 2026-02-21T09:19:35.1376275Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:19:48.4569401Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 11.9 configs/s 2026-02-21T09:19:49.1034226Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━━━ 100/100 - configs/s 2026-02-21T09:19:49.1040263Z [13s] Adaptive compile timeout: 30s (90% percentile=9.7s, bounds=[30.0s, 30s]) 2026-02-21T09:19:49.1044986Z [13s] Initial random population of 100, 1 starting points: 2026-02-21T09:19:49.1046159Z error=99 2026-02-21T09:19:49.1046577Z ok=1 2026-02-21T09:19:49.1046717Z min=15.7348 2026-02-21T09:19:49.1046858Z mid=15.7348 2026-02-21T09:19:49.1046987Z max=15.7348 2026-02-21T09:19:49.1047141Z best={'block_sizes': [16, 16, 16], 2026-02-21T09:19:49.1047371Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:19:49.1047576Z 'l2_groupings': [1], 2026-02-21T09:19:49.1047747Z 'load_eviction_policies': ['', ''], 2026-02-21T09:19:49.1047932Z 'loop_orders': [[0, 1]], 2026-02-21T09:19:49.1048097Z 'num_stages': 1, 2026-02-21T09:19:49.1048239Z 'num_warps': 4, 2026-02-21T09:19:49.1048410Z 'pid_type': 'flat', 2026-02-21T09:19:49.1048563Z 'range_flattens': [None, None], 2026-02-21T09:19:49.1048743Z 'range_multi_buffers': [None, None], 2026-02-21T09:19:49.1048926Z 'range_num_stages': [0, 0], 2026-02-21T09:19:49.1049096Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:49.1049284Z 'range_warp_specializes': [None, None]} 2026-02-21T09:19:49.1059008Z [13s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:19:49.4585551Z [14s] Generation 1 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:19:51.2088998Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 26.7 configs/s 2026-02-21T09:20:06.8967859Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 18/18 1.1 configs/s 2026-02-21T09:20:06.9372416Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 29/29 - configs/s 2026-02-21T09:20:07.8132688Z [32s] Generation 1 complete: 2026-02-21T09:20:07.8140011Z ok=19 2026-02-21T09:20:07.8145363Z min=6.8578 2026-02-21T09:20:07.8149340Z mid=38.7523 2026-02-21T09:20:07.8153356Z max=320.8038 2026-02-21T09:20:07.8158079Z best={'block_sizes': [16, 32, 32], 2026-02-21T09:20:07.8162082Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:20:07.8165957Z 'l2_groupings': [1], 2026-02-21T09:20:07.8170613Z 'load_eviction_policies': ['', ''], 2026-02-21T09:20:07.8170946Z 'loop_orders': [[0, 1]], 2026-02-21T09:20:07.8171142Z 'num_stages': 1, 2026-02-21T09:20:07.8177816Z 'num_warps': 4, 2026-02-21T09:20:07.8179320Z 'pid_type': 'flat', 2026-02-21T09:20:07.8179533Z 'range_flattens': [None, None], 2026-02-21T09:20:07.8179728Z 'range_multi_buffers': [None, None], 2026-02-21T09:20:07.8179926Z 'range_num_stages': [0, 0], 2026-02-21T09:20:07.8180097Z 'range_unroll_factors': [0, 0], 2026-02-21T09:20:07.8180289Z 'range_warp_specializes': [None, None]} 2026-02-21T09:20:07.8180585Z [32s] Fitting surrogate: 119 points, 119 targets 2026-02-21T09:20:08.2049952Z [33s] Generation 2 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:20:09.9814740Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 20.3 configs/s 2026-02-21T09:20:10.0817792Z [34s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:20:10.0824304Z Tensor-likes are not close! 2026-02-21T09:20:10.0827850Z 2026-02-21T09:20:10.0830065Z Mismatched elements: 133574863 / 134217728 (99.5%) 2026-02-21T09:20:10.0830383Z Greatest absolute difference: 1488.0 at index (13540, 3539) (up to 0.01 allowed) 2026-02-21T09:20:10.0830737Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T09:20:10.0831052Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:20:10.0831234Z 2026-02-21T09:20:11.5337351Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 11.1 configs/s 2026-02-21T09:20:12.5083201Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 82/82 50.5 configs/s 2026-02-21T09:20:13.1887414Z [38s] Generation 2 complete: 2026-02-21T09:20:13.1891954Z error=1 2026-02-21T09:20:13.1896365Z ok=18 2026-02-21T09:20:13.1898670Z min=2.4145 2026-02-21T09:20:13.1898871Z mid=4.3110 2026-02-21T09:20:13.1899068Z max=8.3845 2026-02-21T09:20:13.1899228Z best={'block_sizes': [16, 128, 32], 2026-02-21T09:20:13.1899457Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:20:13.1899678Z 'l2_groupings': [1], 2026-02-21T09:20:13.1899847Z 'load_eviction_policies': ['', ''], 2026-02-21T09:20:13.1900048Z 'loop_orders': [[0, 1]], 2026-02-21T09:20:13.1900212Z 'num_stages': 1, 2026-02-21T09:20:13.1900372Z 'num_warps': 4, 2026-02-21T09:20:13.1900524Z 'pid_type': 'flat', 2026-02-21T09:20:13.1900698Z 'range_flattens': [None, None], 2026-02-21T09:20:13.1900898Z 'range_multi_buffers': [None, None], 2026-02-21T09:20:13.1901085Z 'range_num_stages': [0, 0], 2026-02-21T09:20:13.1901262Z 'range_unroll_factors': [0, 0], 2026-02-21T09:20:13.1901447Z 'range_warp_specializes': [None, None]} 2026-02-21T09:20:13.1901750Z [38s] Fitting surrogate: 138 points, 138 targets 2026-02-21T09:20:13.5732307Z [38s] Generation 3 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:20:15.5561329Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 17.1 configs/s 2026-02-21T09:20:17.1494251Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 12.9 configs/s 2026-02-21T09:20:17.3682743Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━ 190/190 413.0 configs/s 2026-02-21T09:20:17.6187190Z [42s] Generation 3 complete: 2026-02-21T09:20:17.6189064Z ok=20 2026-02-21T09:20:17.6189274Z min=1.0476 2026-02-21T09:20:17.6189470Z mid=2.8538 2026-02-21T09:20:17.6189641Z max=15.5249 2026-02-21T09:20:17.6189824Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:20:17.6190114Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:20:17.6190337Z 'l2_groupings': [1], 2026-02-21T09:20:17.6190522Z 'load_eviction_policies': ['', ''], 2026-02-21T09:20:17.6190715Z 'loop_orders': [[0, 1]], 2026-02-21T09:20:17.6190882Z 'num_stages': 2, 2026-02-21T09:20:17.6191025Z 'num_warps': 4, 2026-02-21T09:20:17.6191197Z 'pid_type': 'flat', 2026-02-21T09:20:17.6191360Z 'range_flattens': [None, None], 2026-02-21T09:20:17.6191622Z 'range_multi_buffers': [None, None], 2026-02-21T09:20:17.6191813Z 'range_num_stages': [0, 0], 2026-02-21T09:20:17.6191991Z 'range_unroll_factors': [0, 0], 2026-02-21T09:20:17.6192182Z 'range_warp_specializes': [None, None]} 2026-02-21T09:20:17.6196895Z [42s] Fitting surrogate: 158 points, 158 targets 2026-02-21T09:20:17.9870624Z [42s] Generation 4 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:20:20.0430039Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 15.4 configs/s 2026-02-21T09:20:21.6386085Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 10.7 configs/s 2026-02-21T09:20:22.5165807Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━ 190/190 173.0 configs/s 2026-02-21T09:20:22.7580155Z [47s] Generation 4 complete: 2026-02-21T09:20:22.7584034Z error=3 2026-02-21T09:20:22.7588702Z ok=16 2026-02-21T09:20:22.7588961Z min=1.0465 2026-02-21T09:20:22.7589344Z mid=1.9323 2026-02-21T09:20:22.7589499Z max=52.5650 2026-02-21T09:20:22.7589704Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:20:22.7589988Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:20:22.7590243Z 'l2_groupings': [1], 2026-02-21T09:20:22.7590428Z 'load_eviction_policies': ['', ''], 2026-02-21T09:20:22.7590631Z 'loop_orders': [[0, 1]], 2026-02-21T09:20:22.7590787Z 'num_stages': 2, 2026-02-21T09:20:22.7590935Z 'num_warps': 4, 2026-02-21T09:20:22.7591076Z 'pid_type': 'flat', 2026-02-21T09:20:22.7591238Z 'range_flattens': [None, None], 2026-02-21T09:20:22.7591418Z 'range_multi_buffers': [None, None], 2026-02-21T09:20:22.7591878Z 'range_num_stages': [0, 0], 2026-02-21T09:20:22.7592047Z 'range_unroll_factors': [0, 0], 2026-02-21T09:20:22.7592239Z 'range_warp_specializes': [None, None]} 2026-02-21T09:20:22.7592461Z [47s] Fitting surrogate: 177 points, 177 targets 2026-02-21T09:20:23.0831753Z [47s] Generation 5 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:20:25.8086366Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 2.9 configs/s 2026-02-21T09:20:26.2363715Z [51s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 128, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:20:26.2367977Z Tensor-likes are not close! 2026-02-21T09:20:26.2372801Z 2026-02-21T09:20:26.2377311Z Mismatched elements: 133747204 / 134217728 (99.6%) 2026-02-21T09:20:26.2381721Z Greatest absolute difference: 1464.0 at index (14943, 1411) (up to 0.01 allowed) 2026-02-21T09:20:26.2386197Z Greatest relative difference: inf at index (6066, 7373) (up to 0.01 allowed) 2026-02-21T09:20:26.2390544Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:20:26.2394111Z 2026-02-21T09:20:26.2787784Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 27.5 configs/s 2026-02-21T09:20:26.7276133Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━ 191/191 277.0 configs/s 2026-02-21T09:20:26.9756941Z [51s] Generation 5 complete: 2026-02-21T09:20:26.9758361Z error=8 2026-02-21T09:20:26.9758515Z ok=5 2026-02-21T09:20:26.9758647Z min=1.0476 2026-02-21T09:20:26.9758776Z mid=1.3200 2026-02-21T09:20:26.9758909Z max=12.1436 2026-02-21T09:20:26.9759059Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:20:26.9759283Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:20:26.9759510Z 'l2_groupings': [1], 2026-02-21T09:20:26.9759669Z 'load_eviction_policies': ['', ''], 2026-02-21T09:20:26.9759856Z 'loop_orders': [[0, 1]], 2026-02-21T09:20:26.9760011Z 'num_stages': 2, 2026-02-21T09:20:26.9760161Z 'num_warps': 4, 2026-02-21T09:20:26.9760300Z 'pid_type': 'flat', 2026-02-21T09:20:26.9760462Z 'range_flattens': [None, None], 2026-02-21T09:20:26.9760641Z 'range_multi_buffers': [None, None], 2026-02-21T09:20:26.9760834Z 'range_num_stages': [0, 0], 2026-02-21T09:20:26.9761004Z 'range_unroll_factors': [0, 0], 2026-02-21T09:20:26.9761181Z 'range_warp_specializes': [None, None]} 2026-02-21T09:20:26.9770105Z [51s] Fitting surrogate: 190 points, 190 targets 2026-02-21T09:20:27.1568769Z [52s] Autotuning complete in 52.0s after searching 86 configs. 2026-02-21T09:20:27.1569168Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:20:27.1570180Z @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:20:27.1571416Z 2026-02-21T09:20:27.1572003Z [52s] Code of selected kernel: /tmp/torchinductor_root/yn/cyn5soig3id7gubbnjapnin4fcuyrt3inoax5bk2mjrqtoyw6ma6.py 2026-02-21T09:20:27.1768067Z from __future__ import annotations 2026-02-21T09:20:27.1768321Z 2026-02-21T09:20:27.1768450Z import torch 2026-02-21T09:20:27.1768654Z import triton 2026-02-21T09:20:27.1768824Z import triton.language as tl 2026-02-21T09:20:27.1769118Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:20:27.1769321Z 2026-02-21T09:20:27.1769397Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T09:20:27.1769582Z _BLOCK_SIZE_2 = tl.constexpr(128) 2026-02-21T09:20:27.1775670Z _BLOCK_SIZE_0 = tl.constexpr(16) 2026-02-21T09:20:27.1775905Z 2026-02-21T09:20:27.1775980Z @triton.jit 2026-02-21T09:20:27.1776233Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr): 2026-02-21T09:20:27.1776594Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:20:27.1776854Z num_blocks_0 = tl.cdiv(16384, _BLOCK_SIZE_1) 2026-02-21T09:20:27.1777082Z pid_0 = tl.program_id(0) % num_blocks_0 2026-02-21T09:20:27.1777296Z pid_1 = tl.program_id(0) // num_blocks_0 2026-02-21T09:20:27.1777497Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T09:20:27.1777737Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T09:20:27.1777971Z offset_2 = pid_1 * _BLOCK_SIZE_2 2026-02-21T09:20:27.1778199Z indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32) 2026-02-21T09:20:27.1778495Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:20:27.1778802Z acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32) 2026-02-21T09:20:27.1779130Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:20:27.1779558Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:20:27.1779958Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:20:27.1780233Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:20:27.1780442Z for offset_3 in tl.range(0, 512, _BLOCK_SIZE_0): 2026-02-21T09:20:27.1780696Z indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T09:20:27.1780935Z acc_copy = acc 2026-02-21T09:20:27.1781135Z acc_copy_0 = acc_copy 2026-02-21T09:20:27.1781353Z # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2 2026-02-21T09:20:27.1781683Z mul = 2 * offset_3 2026-02-21T09:20:27.1781936Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:20:27.1782227Z iota = mul + tl.arange(0, mul_1) 2026-02-21T09:20:27.1782477Z load = tl.load(A + (indices_1[:, None] * 1024 + iota[None, :] * 1), None) 2026-02-21T09:20:27.1782806Z # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to( 2026-02-21T09:20:27.1783109Z # src[int4_gemm.py:66]: torch.float32 2026-02-21T09:20:27.1783342Z # src[int4_gemm.py:67]: ) # [BLOCK_SIZE_M, BLOCK_SIZE_K] 2026-02-21T09:20:27.1783574Z v_0 = tl.cast(load, tl.float32) 2026-02-21T09:20:27.1783849Z # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n] # [BLOCK_SIZE_K//2, BLOCK_SIZE_N] 2026-02-21T09:20:27.1784212Z b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None) 2026-02-21T09:20:27.1784562Z # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8) # Sign-extend low 4 bits 2026-02-21T09:20:27.1784832Z v_1 = tl.full([], 4, tl.int8) 2026-02-21T09:20:27.1785203Z v_2 = b_tile << v_1 2026-02-21T09:20:27.1785368Z v_3 = tl.full([], 4, tl.int8) 2026-02-21T09:20:27.1785549Z v_4 = v_2 >> v_3 2026-02-21T09:20:27.1785787Z # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8) # Sign-extend high 4 bits 2026-02-21T09:20:27.1786110Z v_5 = tl.full([], 4, tl.int8) 2026-02-21T09:20:27.1786326Z v_6 = b_tile >> v_5 2026-02-21T09:20:27.1786587Z # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1) 2026-02-21T09:20:27.1786837Z stack_idx = tl.arange(0, 2) 2026-02-21T09:20:27.1787031Z broadcast_idx = stack_idx[None, :, None] 2026-02-21T09:20:27.1787240Z expanded_0 = tl.expand_dims(v_4, 1) 2026-02-21T09:20:27.1787430Z expanded_1 = tl.expand_dims(v_6, 1) 2026-02-21T09:20:27.1787642Z stacked_result = tl.zeros_like(expanded_0) 2026-02-21T09:20:27.1787841Z mask_0 = broadcast_idx == 0 2026-02-21T09:20:27.1788074Z stacked_result = tl.where(mask_0, expanded_0, stacked_result) 2026-02-21T09:20:27.1788315Z mask_1 = broadcast_idx == 1 2026-02-21T09:20:27.1788537Z stacked_result = tl.where(mask_1, expanded_1, stacked_result) 2026-02-21T09:20:27.1788819Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:20:27.1789104Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:20:27.1789382Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:20:27.1789635Z view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2]) 2026-02-21T09:20:27.1789890Z v_7 = tl.cast(view, tl.float32) 2026-02-21T09:20:27.1790167Z # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2) # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1] 2026-02-21T09:20:27.1790439Z a_tile_1 = v_0[:, :, None] 2026-02-21T09:20:27.1790662Z # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0) 2026-02-21T09:20:27.1790891Z b_unpacked_1 = v_7[None, :, :] 2026-02-21T09:20:27.1791187Z # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1) # [BLOCK_SIZE_M, BLOCK_SIZE_N] 2026-02-21T09:20:27.1791475Z v_8 = a_tile_1 * b_unpacked_1 2026-02-21T09:20:27.1791713Z sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32) 2026-02-21T09:20:27.1791918Z acc = acc_copy_0 + sum_1 2026-02-21T09:20:27.1792140Z # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16) 2026-02-21T09:20:27.1792384Z v_10 = tl.cast(acc, tl.bfloat16) 2026-02-21T09:20:27.1792624Z tl.store(C + (indices_1[:, None] * 8192 + indices_2[None, :] * 1), v_10, None) 2026-02-21T09:20:27.1792821Z 2026-02-21T09:20:27.1792954Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher): 2026-02-21T09:20:27.1793196Z """ 2026-02-21T09:20:27.1793376Z BFloat16 x INT4 General Matrix Multiplication (GEMM). 2026-02-21T09:20:27.1793532Z 2026-02-21T09:20:27.1793641Z This kernel performs matrix multiplication where: 2026-02-21T09:20:27.1793869Z - A is a bfloat16 matrix of shape [M, K] 2026-02-21T09:20:27.1794130Z - B is an int8 matrix of shape [K//2, N] containing packed int4 values 2026-02-21T09:20:27.1794387Z (two 4-bit values packed into each int8) 2026-02-21T09:20:27.1794530Z 2026-02-21T09:20:27.1794587Z Args: 2026-02-21T09:20:27.1794765Z A (Tensor): Input tensor of shape [M, K] in bfloat16 format. 2026-02-21T09:20:27.1795048Z B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format. 2026-02-21T09:20:27.1795216Z 2026-02-21T09:20:27.1795278Z Returns: 2026-02-21T09:20:27.1795459Z Tensor: Output tensor of shape [M, N] in bfloat16 format. 2026-02-21T09:20:27.1795675Z """ 2026-02-21T09:20:27.1795812Z # src[int4_gemm.py:50]: M, K = A.shape 2026-02-21T09:20:27.1795994Z M, K = A.shape 2026-02-21T09:20:27.1796144Z # src[int4_gemm.py:51]: _, N = B.shape 2026-02-21T09:20:27.1796320Z _, N = B.shape 2026-02-21T09:20:27.1796545Z # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:20:27.1796895Z C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device) 2026-02-21T09:20:27.1797166Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:20:27.1797414Z _BLOCK_SIZE_1 = 128 2026-02-21T09:20:27.1797573Z _BLOCK_SIZE_2 = 128 2026-02-21T09:20:27.1797855Z # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed): 2026-02-21T09:20:27.1798291Z # src[int4_gemm.py:61]: # Load corresponding tiles from A (need to load twice the packed tile size) 2026-02-21T09:20:27.1798683Z # src[int4_gemm.py:62]: # We need to map tile_k_packed to the corresponding range in A 2026-02-21T09:20:27.1798949Z # src[int4_gemm.py:60-89]: ... 2026-02-21T09:20:27.1799124Z _BLOCK_SIZE_0 = 16 2026-02-21T09:20:27.1799311Z # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape( 2026-02-21T09:20:27.1799592Z # src[int4_gemm.py:84]: tile_k_packed.block_size * 2, tile_n.block_size 2026-02-21T09:20:27.1799848Z # src[int4_gemm.py:85]: ).to(torch.float32) 2026-02-21T09:20:27.1800047Z _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0 2026-02-21T09:20:27.1800273Z # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]): 2026-02-21T09:20:27.1800559Z # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T09:20:27.1800820Z # src[int4_gemm.py:57-91]: ... 2026-02-21T09:20:27.1801032Z _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0) 2026-02-21T09:20:27.1801517Z _launcher(_helion_matmul_bf16_int4, (triton.cdiv(16384, _BLOCK_SIZE_1) * triton.cdiv(8192, _BLOCK_SIZE_2),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=2) 2026-02-21T09:20:27.1802011Z # src[int4_gemm.py:93]: return C 2026-02-21T09:20:27.1802193Z return C 2026-02-21T09:20:28.2269785Z WARNING:tritonbench.utils.triton_op:Completed input ID 21: 2026-02-21T09:20:28.2271226Z x_val 2026-02-21T09:20:28.2271449Z --------------------- 2026-02-21T09:20:28.2271823Z (4, 4096, 8192, 1024) 2026-02-21T09:20:28.2272034Z 2026-02-21T09:20:28.2286641Z 70%|███████ | 7/10 [36:28<15:57, 319.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 24: 2026-02-21T09:20:28.2290731Z x_val 2026-02-21T09:20:28.2294392Z ---------------------- 2026-02-21T09:20:28.2295292Z (16, 4096, 1280, 8192) 2026-02-21T09:20:28.2295793Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_eager_int4_gemm 2026-02-21T09:20:29.4615423Z INFO:tritonbench.utils.triton_op:Took 2.38ms to get benchmark function for preprocessed_torch_compile_int4_gemm 2026-02-21T09:20:31.1983161Z Autotune Choices Stats: 2026-02-21T09:20:31.1985278Z {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 1.0802559852600098, "best_triton_pos": 1, "best_triton_time": 1.4008640050888062, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4"} 2026-02-21T09:20:31.2000757Z AUTOTUNE mm(65536x8192, 8192x1280) 2026-02-21T09:20:31.2005744Z strides: [8192, 1], [1280, 1] 2026-02-21T09:20:31.2007515Z dtypes: torch.bfloat16, torch.bfloat16 2026-02-21T09:20:31.2007874Z mm 1.0803 ms 100.0% 2026-02-21T09:20:31.2008448Z triton_mm_17 1.4009 ms 77.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T09:20:31.2009200Z triton_mm_16 1.5625 ms 69.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T09:20:31.2009992Z triton_mm_18 1.5893 ms 68.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2026-02-21T09:20:31.2010757Z triton_mm_11 1.8238 ms 59.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T09:20:31.2011910Z triton_mm_10 1.9660 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T09:20:31.2012827Z triton_mm_13 2.0254 ms 53.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T09:20:31.2013601Z triton_mm_14 2.1033 ms 51.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2026-02-21T09:20:31.2014316Z triton_mm_9 2.1729 ms 49.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2026-02-21T09:20:31.2015052Z triton_mm_7 2.2373 ms 48.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2026-02-21T09:20:31.2015726Z SingleProcess AUTOTUNE benchmarking takes 1.1117 seconds and 0.2904 seconds precompiling for 20 choices 2026-02-21T09:20:33.8632817Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm 2026-02-21T09:20:35.1584534Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:20:35.1588576Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:20:35.1589020Z 'dtype': 'torch.bfloat16', 2026-02-21T09:20:35.1589406Z 'shape': (16, 4096, 8192), 2026-02-21T09:20:35.1589688Z 'stride': (33554432, 8192, 1)}, 2026-02-21T09:20:35.1590058Z { 'device': 'cuda:0', 2026-02-21T09:20:35.1590296Z 'dtype': 'torch.int32', 2026-02-21T09:20:35.1590554Z 'shape': (8192, 1280), 2026-02-21T09:20:35.1590844Z 'stride': (1280, 1)}), 2026-02-21T09:20:35.1591071Z 'kwargs': {}} 2026-02-21T09:20:35.1607747Z INFO:tritonbench.utils.triton_op:Took 2.59ms to get benchmark function for helion_int4_gemm_tritonbench 2026-02-21T09:20:35.4147454Z [0s] Autotune random seed: 2136913670 2026-02-21T09:20:35.6123840Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:21:10.7897239Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 1024, 2], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[64], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T09:21:10.7910999Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T09:25:47.3451090Z [311s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 16, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['last', 'last'], loop_orders=[[0, 1]], num_sm_multiplier=128, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[4, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]) 2026-02-21T09:25:47.3452767Z Tensor-likes are not close! 2026-02-21T09:25:47.3456161Z 2026-02-21T09:25:47.3460986Z Mismatched elements: 83487266 / 83886080 (99.5%) 2026-02-21T09:25:47.3462554Z Greatest absolute difference: 3936.0 at index (8609, 105) (up to 0.01 allowed) 2026-02-21T09:25:47.3462967Z Greatest relative difference: 46137344.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:25:47.3463397Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:25:47.3463606Z 2026-02-21T09:26:01.4698496Z [325s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 512, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['last', 'first'], loop_orders=[[1, 0]], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[4, 1], range_unroll_factors=[4, 1], range_warp_specializes=[False, None]) 2026-02-21T09:26:01.4699768Z Tensor-likes are not close! 2026-02-21T09:26:01.4699959Z 2026-02-21T09:26:01.4700163Z Mismatched elements: 83762618 / 83886080 (99.9%) 2026-02-21T09:26:01.4700558Z Greatest absolute difference: 8960.0 at index (62503, 918) (up to 0.01 allowed) 2026-02-21T09:26:01.4700988Z Greatest relative difference: 88080384.0 at index (16838, 491) (up to 0.01 allowed) 2026-02-21T09:26:01.4701369Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:26:01.4701766Z 2026-02-21T09:26:36.4354091Z [360s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 512, 16], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=64, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, False], range_num_stages=[2, 1], range_unroll_factors=[2, 0], range_warp_specializes=[False, None]) 2026-02-21T09:26:36.4355331Z Tensor-likes are not close! 2026-02-21T09:26:36.4359772Z 2026-02-21T09:26:36.4361864Z Mismatched elements: 83710904 / 83886080 (99.8%) 2026-02-21T09:26:36.4362256Z Greatest absolute difference: 6784.0 at index (54458, 1133) (up to 0.01 allowed) 2026-02-21T09:26:36.4362672Z Greatest relative difference: 88604672.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:26:36.4363046Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:26:36.4363283Z 2026-02-21T09:29:59.6189519Z [564s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[32], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=4, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]) 2026-02-21T09:29:59.6190798Z Tensor-likes are not close! 2026-02-21T09:29:59.6195099Z 2026-02-21T09:29:59.6199159Z Mismatched elements: 83821368 / 83886080 (99.9%) 2026-02-21T09:29:59.6203329Z Greatest absolute difference: 9408.0 at index (3424, 363) (up to 0.01 allowed) 2026-02-21T09:29:59.6204823Z Greatest relative difference: 110100480.0 at index (39175, 14) (up to 0.01 allowed) 2026-02-21T09:29:59.6205201Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:29:59.6205387Z 2026-02-21T09:32:39.1078216Z [723s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 128, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['first', 'first'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=128, num_stages=8, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[1, 1], range_warp_specializes=[None, True]) 2026-02-21T09:32:39.1085518Z Tensor-likes are not close! 2026-02-21T09:32:39.1089061Z 2026-02-21T09:32:39.1090791Z Mismatched elements: 83715461 / 83886080 (99.8%) 2026-02-21T09:32:39.1091115Z Greatest absolute difference: 7424.0 at index (50049, 142) (up to 0.01 allowed) 2026-02-21T09:32:39.1091517Z Greatest relative difference: 63700992.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:32:39.1092062Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:32:39.1092247Z 2026-02-21T09:38:21.5641118Z [1065s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'last'], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=1, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[False, False]) 2026-02-21T09:38:21.5642759Z Tensor-likes are not close! 2026-02-21T09:38:21.5642905Z 2026-02-21T09:38:21.5643081Z Mismatched elements: 83486643 / 83886080 (99.5%) 2026-02-21T09:38:21.5643435Z Greatest absolute difference: 3920.0 at index (8609, 105) (up to 0.01 allowed) 2026-02-21T09:38:21.5643854Z Greatest relative difference: 44302336.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:38:21.5644248Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:38:21.5644431Z 2026-02-21T09:41:07.5538894Z [1231s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 128, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', 'first'], loop_orders=[[0, 1]], maxnreg=128, num_sm_multiplier=2, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]) 2026-02-21T09:41:07.5540141Z Tensor-likes are not close! 2026-02-21T09:41:07.5544147Z 2026-02-21T09:41:07.5549434Z Mismatched elements: 83774238 / 83886080 (99.9%) 2026-02-21T09:41:07.5553978Z Greatest absolute difference: 9600.0 at index (24908, 1055) (up to 0.01 allowed) 2026-02-21T09:41:07.5554967Z Greatest relative difference: 168820736.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:41:07.5555439Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:41:07.5555629Z 2026-02-21T09:44:14.2875461Z [1418s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 512, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], l2_groupings=[2], load_eviction_policies=['first', 'last'], loop_orders=[[0, 1]], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]) 2026-02-21T09:44:14.2879645Z Tensor-likes are not close! 2026-02-21T09:44:14.2883372Z 2026-02-21T09:44:14.2885883Z Mismatched elements: 83716178 / 83886080 (99.8%) 2026-02-21T09:44:14.2886441Z Greatest absolute difference: 6816.0 at index (59561, 889) (up to 0.01 allowed) 2026-02-21T09:44:14.2886970Z Greatest relative difference: 84410368.0 at index (18996, 134) (up to 0.01 allowed) 2026-02-21T09:44:14.2887466Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:44:14.2887658Z 2026-02-21T09:46:11.9328752Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.2 configs/s 2026-02-21T09:46:11.9340559Z [1536s] Adaptive compile timeout: 30s (90% percentile=7.8s, bounds=[30.0s, 30s]) 2026-02-21T09:46:12.0722174Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 - configs/s 2026-02-21T09:46:12.6085567Z [1536s] Initial random population of 100, 5 starting points: 2026-02-21T09:46:12.6086927Z error=16 2026-02-21T09:46:12.6087381Z timeout=1 2026-02-21T09:46:12.6087579Z ok=83 2026-02-21T09:46:12.6087751Z min=40.3590 2026-02-21T09:46:12.6087956Z mid=1249.7255 2026-02-21T09:46:12.6088150Z max=17425.0020 2026-02-21T09:46:12.6088382Z best={'block_sizes': [16, 64, 64], 2026-02-21T09:46:12.6088656Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:46:12.6088949Z 'l2_groupings': [32], 2026-02-21T09:46:12.6089192Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:46:12.6089434Z 'loop_orders': [[0, 1]], 2026-02-21T09:46:12.6089664Z 'num_stages': 5, 2026-02-21T09:46:12.6089863Z 'num_warps': 16, 2026-02-21T09:46:12.6093685Z 'pid_type': 'flat', 2026-02-21T09:46:12.6098117Z 'range_flattens': [None, False], 2026-02-21T09:46:12.6101276Z 'range_multi_buffers': [None, False], 2026-02-21T09:46:12.6105124Z 'range_num_stages': [0, 1], 2026-02-21T09:46:12.6106458Z 'range_unroll_factors': [0, 0], 2026-02-21T09:46:12.6106771Z 'range_warp_specializes': [None, None]} 2026-02-21T09:46:12.6107129Z [1536s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:46:13.9262056Z [1538s] Generation 1 starting: 96 neighbors, 5 active search path(s) 2026-02-21T09:46:27.1718017Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 2.1 configs/s 2026-02-21T09:46:32.5785657Z [1556s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 16, 64], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['', 'last'], loop_orders=[[1, 0]], maxnreg=64, num_sm_multiplier=128, num_stages=8, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:46:32.5786962Z Tensor-likes are not close! 2026-02-21T09:46:32.5790384Z 2026-02-21T09:46:32.5794921Z Mismatched elements: 83692326 / 83886080 (99.8%) 2026-02-21T09:46:32.5798858Z Greatest absolute difference: 8096.0 at index (60657, 377) (up to 0.01 allowed) 2026-02-21T09:46:32.5800335Z Greatest relative difference: 32243712.0 at index (16838, 491) (up to 0.01 allowed) 2026-02-21T09:46:32.5800805Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:46:32.5802332Z 2026-02-21T09:47:11.1778702Z [1595s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]) 2026-02-21T09:47:11.1784545Z Tensor-likes are not close! 2026-02-21T09:47:11.1787970Z 2026-02-21T09:47:11.1791489Z Mismatched elements: 83348939 / 83886080 (99.4%) 2026-02-21T09:47:11.1795374Z Greatest absolute difference: 2416.0 at index (12986, 297) (up to 0.01 allowed) 2026-02-21T09:47:11.1799353Z Greatest relative difference: 10223616.0 at index (9502, 532) (up to 0.01 allowed) 2026-02-21T09:47:11.1803401Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:47:11.1804420Z 2026-02-21T09:48:16.5131799Z [1660s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]) 2026-02-21T09:48:16.5136553Z Tensor-likes are not close! 2026-02-21T09:48:16.5140606Z 2026-02-21T09:48:16.5144898Z Mismatched elements: 83671559 / 83886080 (99.7%) 2026-02-21T09:48:16.5149558Z Greatest absolute difference: 5088.0 at index (19505, 46) (up to 0.01 allowed) 2026-02-21T09:48:16.5151081Z Greatest relative difference: 20447232.0 at index (9502, 532) (up to 0.01 allowed) 2026-02-21T09:48:16.5151777Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:48:16.5152006Z 2026-02-21T09:48:31.4805689Z [1675s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:48:31.4806894Z Tensor-likes are not close! 2026-02-21T09:48:31.4807037Z 2026-02-21T09:48:31.4807173Z Mismatched elements: 83348939 / 83886080 (99.4%) 2026-02-21T09:48:31.4807501Z Greatest absolute difference: 2416.0 at index (12986, 297) (up to 0.01 allowed) 2026-02-21T09:48:31.4807914Z Greatest relative difference: 10223616.0 at index (9502, 532) (up to 0.01 allowed) 2026-02-21T09:48:31.4808273Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:48:31.4808478Z 2026-02-21T09:50:29.9663881Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 99/99 0.5 configs/s 2026-02-21T09:50:29.9680338Z [1794s] Generation 1 complete: 2026-02-21T09:50:29.9681354Z error=4 2026-02-21T09:50:29.9681604Z ok=98 2026-02-21T09:50:29.9681781Z min=7.3205 2026-02-21T09:50:29.9681993Z mid=47.5310 2026-02-21T09:50:29.9682171Z max=3204.0173 2026-02-21T09:50:29.9682396Z best={'block_sizes': [64, 64, 128], 2026-02-21T09:50:29.9682667Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:50:29.9682959Z 'l2_groupings': [32], 2026-02-21T09:50:29.9683177Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:50:29.9683458Z 'loop_orders': [[0, 1]], 2026-02-21T09:50:29.9683655Z 'num_stages': 5, 2026-02-21T09:50:29.9683864Z 'num_warps': 4, 2026-02-21T09:50:29.9684073Z 'pid_type': 'flat', 2026-02-21T09:50:29.9684304Z 'range_flattens': [None, False], 2026-02-21T09:50:29.9684556Z 'range_multi_buffers': [None, False], 2026-02-21T09:50:29.9684784Z 'range_num_stages': [0, 1], 2026-02-21T09:50:29.9685024Z 'range_unroll_factors': [0, 0], 2026-02-21T09:50:29.9685254Z 'range_warp_specializes': [None, None]} 2026-02-21T09:50:29.9702582Z [1794s] Fitting surrogate: 202 points, 202 targets 2026-02-21T09:50:31.8282062Z [1796s] Generation 2 starting: 96 neighbors, 5 active search path(s) 2026-02-21T09:50:47.8421016Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 1.8 configs/s 2026-02-21T09:50:53.9805616Z [1818s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 64], indexing=['pointer', 'tensor_descriptor', 'pointer'], l2_groupings=[2], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:50:53.9806786Z Tensor-likes are not close! 2026-02-21T09:50:53.9811992Z 2026-02-21T09:50:53.9813799Z Mismatched elements: 83621361 / 83886080 (99.7%) 2026-02-21T09:50:53.9814219Z Greatest absolute difference: 3888.0 at index (35844, 277) (up to 0.01 allowed) 2026-02-21T09:50:53.9814622Z Greatest relative difference: 164626432.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:50:53.9815026Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:50:53.9815220Z 2026-02-21T09:51:00.6557460Z [1825s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:51:00.6559116Z Tensor-likes are not close! 2026-02-21T09:51:00.6559261Z 2026-02-21T09:51:00.6559403Z Mismatched elements: 83704465 / 83886080 (99.8%) 2026-02-21T09:51:00.6559739Z Greatest absolute difference: 7232.0 at index (19358, 1191) (up to 0.01 allowed) 2026-02-21T09:51:00.6560191Z Greatest relative difference: 283115520.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:51:00.6560592Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:00.6560812Z 2026-02-21T09:51:01.1064874Z [1825s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:51:01.1066213Z Tensor-likes are not close! 2026-02-21T09:51:01.1068179Z 2026-02-21T09:51:01.1068473Z Mismatched elements: 83692326 / 83886080 (99.8%) 2026-02-21T09:51:01.1069157Z Greatest absolute difference: 8096.0 at index (60657, 377) (up to 0.01 allowed) 2026-02-21T09:51:01.1073128Z Greatest relative difference: 32243712.0 at index (16838, 491) (up to 0.01 allowed) 2026-02-21T09:51:01.1077396Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:01.1080938Z 2026-02-21T09:51:01.3314932Z [1825s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]) 2026-02-21T09:51:01.3316342Z Tensor-likes are not close! 2026-02-21T09:51:01.3320729Z 2026-02-21T09:51:01.3322512Z Mismatched elements: 83486939 / 83886080 (99.5%) 2026-02-21T09:51:01.3322932Z Greatest absolute difference: 4704.0 at index (60657, 377) (up to 0.01 allowed) 2026-02-21T09:51:01.3323376Z Greatest relative difference: 16121856.0 at index (16838, 491) (up to 0.01 allowed) 2026-02-21T09:51:01.3323755Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:01.3323955Z 2026-02-21T09:51:02.0570075Z [1826s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:51:02.0571310Z Tensor-likes are not close! 2026-02-21T09:51:02.0576213Z 2026-02-21T09:51:02.0578312Z Mismatched elements: 83692326 / 83886080 (99.8%) 2026-02-21T09:51:02.0578696Z Greatest absolute difference: 8096.0 at index (60657, 377) (up to 0.01 allowed) 2026-02-21T09:51:02.0579126Z Greatest relative difference: 32243712.0 at index (16838, 491) (up to 0.01 allowed) 2026-02-21T09:51:02.0579495Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:02.0579715Z 2026-02-21T09:51:02.0877573Z [1826s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], maxnreg=128, num_sm_multiplier=128, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:51:02.0879016Z Tensor-likes are not close! 2026-02-21T09:51:02.0882879Z 2026-02-21T09:51:02.0887786Z Mismatched elements: 83559273 / 83886080 (99.6%) 2026-02-21T09:51:02.0891873Z Greatest absolute difference: 4000.0 at index (19358, 1191) (up to 0.01 allowed) 2026-02-21T09:51:02.0893465Z Greatest relative difference: 141557760.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:51:02.0893909Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:02.0894097Z 2026-02-21T09:51:02.4950190Z [1826s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 32, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], l2_groupings=[16], load_eviction_policies=['first', 'last'], loop_orders=[[1, 0]], num_sm_multiplier=128, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:51:02.4951425Z Tensor-likes are not close! 2026-02-21T09:51:02.4951766Z 2026-02-21T09:51:02.4951947Z Mismatched elements: 83704465 / 83886080 (99.8%) 2026-02-21T09:51:02.4952542Z Greatest absolute difference: 7232.0 at index (19358, 1191) (up to 0.01 allowed) 2026-02-21T09:51:02.4953058Z Greatest relative difference: 283115520.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:51:02.4953435Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:02.4953645Z 2026-02-21T09:51:11.2814565Z [1835s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 64, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['first', ''], loop_orders=[[1, 0]], num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:51:11.2815610Z Tensor-likes are not close! 2026-02-21T09:51:11.2815754Z 2026-02-21T09:51:11.2815866Z Mismatched elements: 83595214 / 83886080 (99.7%) 2026-02-21T09:51:11.2816237Z Greatest absolute difference: 3744.0 at index (37631, 506) (up to 0.01 allowed) 2026-02-21T09:51:11.2816639Z Greatest relative difference: 164626432.0 at index (50268, 905) (up to 0.01 allowed) 2026-02-21T09:51:11.2817038Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:51:11.2817222Z 2026-02-21T09:51:26.9198793Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 100/100 2.2 configs/s 2026-02-21T09:51:27.3505977Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 27/27 12.2 configs/s 2026-02-21T09:51:29.2278016Z [1853s] Generation 2 complete: 2026-02-21T09:51:29.2281151Z error=11 2026-02-21T09:51:29.2282769Z ok=91 2026-02-21T09:51:29.2282974Z min=7.3145 2026-02-21T09:51:29.2283151Z mid=19.0842 2026-02-21T09:51:29.2283381Z max=574.0668 2026-02-21T09:51:29.2283577Z best={'block_sizes': [64, 64, 128], 2026-02-21T09:51:29.2283886Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:51:29.2284152Z 'l2_groupings': [32], 2026-02-21T09:51:29.2284415Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:51:29.2284723Z 'loop_orders': [[0, 1]], 2026-02-21T09:51:29.2284933Z 'num_stages': 5, 2026-02-21T09:51:29.2285148Z 'num_warps': 4, 2026-02-21T09:51:29.2285351Z 'pid_type': 'flat', 2026-02-21T09:51:29.2285575Z 'range_flattens': [None, False], 2026-02-21T09:51:29.2285798Z 'range_multi_buffers': [None, False], 2026-02-21T09:51:29.2286049Z 'range_num_stages': [0, 1], 2026-02-21T09:51:29.2286256Z 'range_unroll_factors': [0, 0], 2026-02-21T09:51:29.2286507Z 'range_warp_specializes': [None, None]} 2026-02-21T09:51:29.2315373Z [1853s] Fitting surrogate: 304 points, 304 targets 2026-02-21T09:51:30.4913111Z [1854s] Generation 3 starting: 92 neighbors, 5 active search path(s) 2026-02-21T09:51:39.2746906Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 12.3 configs/s 2026-02-21T09:51:48.3667546Z Generation 3: exploring neighbors 40% ━━━━━━━━ 39/97 4.2 configs/s 2026-02-21T09:51:48.3749760Z 2026-02-21T09:51:48.3755509Z 70%|███████ | 7/10 [1:07:48<29:03, 581.22s/it] 2026-02-21T09:51:48.3760117Z WARNING:tritonbench.utils.triton_op:Caught exception on backend helion_int4_gemm_tritonbench, terminating early with partial results 2026-02-21T09:51:48.3764394Z Traceback (most recent call last): 2026-02-21T09:51:48.3769216Z File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1199, in run 2026-02-21T09:51:48.3770050Z y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce( 2026-02-21T09:51:48.3770386Z ^^^^^^^^^^^^^^^^^ 2026-02-21T09:51:48.3775421Z File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1188, in _reduce_benchmarks 2026-02-21T09:51:48.3775908Z torch.accelerator.synchronize() 2026-02-21T09:51:48.3776374Z File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/accelerator/__init__.py", line 235, in synchronize 2026-02-21T09:51:48.3776848Z torch._C._accelerator_synchronizeDevice(device_index) 2026-02-21T09:51:48.3777150Z torch.AcceleratorError: CUDA error: misaligned address 2026-02-21T09:51:48.3777909Z Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. 2026-02-21T09:51:48.3778519Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2026-02-21T09:51:48.3778972Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1 2026-02-21T09:51:48.3779315Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2026-02-21T09:51:48.3779515Z 2026-02-21T09:51:48.3779756Z WARNING:tritonbench.utils.triton_op:Failing input: --input-id 24 --num-inputs 1 --input-sample-mode first-k 2026-02-21T09:51:48.3780270Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp3tl88rvh.csv 2026-02-21T09:51:49.3946554Z x_val preprocessed_torch_compile_int4_gemm-speedup preprocessed_torch_compile_int4_gemm-accuracy preprocessed_triton_int4_gemm-speedup preprocessed_triton_int4_gemm-accuracy helion_int4_gemm_tritonbench-speedup helion_int4_gemm_tritonbench-accuracy 2026-02-21T09:51:49.3947871Z --------------------- ---------------------------------------------- ----------------------------------------------- --------------------------------------- ---------------------------------------- -------------------------------------- --------------------------------------- 2026-02-21T09:51:49.3948677Z (1, 1, 1280, 8192) 8.43464 1 0.77063 1 8.441 1 2026-02-21T09:51:49.3949349Z (1, 1, 8192, 3584) 8.32296 1 4.01927 1 14.7849 1 2026-02-21T09:51:49.3949992Z (4, 1, 8192, 3584) 10.1842 1 4.25233 1 7.30309 1 2026-02-21T09:51:49.3950648Z (16, 1, 7168, 8192) 8.51634 1 3.72168 1 5.88809 1 2026-02-21T09:51:49.3951309Z (64, 1, 7168, 8192) 8.13663 1 2.61398 1 4.39176 1 2026-02-21T09:51:49.3952451Z (1, 4096, 8192, 1024) 2.16307 1 0.11966 1 0.505307 1 2026-02-21T09:51:49.3957335Z (4, 4096, 8192, 1024) 1.31568 1 0.0761045 1 0.279168 1 2026-02-21T09:51:49.3962424Z average 6.72479 1 2.22481 1 5.9419 1 2026-02-21T09:51:52.4605030Z ✅ Completed benchmark for kernel: int4_gemm 2026-02-21T09:51:52.4614436Z [ 2026-02-21T09:51:52.4616532Z { 2026-02-21T09:51:52.4616832Z "benchmark": { 2026-02-21T09:51:52.4619427Z "name": "Helion Benchmark", 2026-02-21T09:51:52.4619800Z "extra_info": { 2026-02-21T09:51:52.4625051Z "device": "NVIDIA B200" 2026-02-21T09:51:52.4629373Z } 2026-02-21T09:51:52.4631702Z }, 2026-02-21T09:51:52.4632004Z "model": { 2026-02-21T09:51:52.4637420Z "name": "int4_gemm" 2026-02-21T09:51:52.4639131Z }, 2026-02-21T09:51:52.4639376Z "metric": { 2026-02-21T09:51:52.4639636Z "name": "torch_compile_speedup", 2026-02-21T09:51:52.4644260Z "benchmark_values": [ 2026-02-21T09:51:52.4646536Z 8.434643079847929, 2026-02-21T09:51:52.4646810Z 8.322957306075319, 2026-02-21T09:51:52.4647031Z 10.184205867178628, 2026-02-21T09:51:52.4647266Z 8.516340227715336, 2026-02-21T09:51:52.4647514Z 8.136625139866736, 2026-02-21T09:51:52.4647733Z 2.163072258190891, 2026-02-21T09:51:52.4651143Z 1.315677929682589 2026-02-21T09:51:52.4651469Z ] 2026-02-21T09:51:52.4651723Z }, 2026-02-21T09:51:52.4651938Z "shape": [ 2026-02-21T09:51:52.4652127Z "(1, 1, 1280, 8192)", 2026-02-21T09:51:52.4652358Z "(1, 1, 8192, 3584)", 2026-02-21T09:51:52.4652550Z "(4, 1, 8192, 3584)", 2026-02-21T09:51:52.4652768Z "(16, 1, 7168, 8192)", 2026-02-21T09:51:52.4652966Z "(64, 1, 7168, 8192)", 2026-02-21T09:51:52.4653187Z "(1, 4096, 8192, 1024)", 2026-02-21T09:51:52.4653385Z "(4, 4096, 8192, 1024)" 2026-02-21T09:51:52.4653605Z ] 2026-02-21T09:51:52.4653764Z }, 2026-02-21T09:51:52.4653947Z { 2026-02-21T09:51:52.4654119Z "benchmark": { 2026-02-21T09:51:52.4654346Z "name": "Helion Benchmark", 2026-02-21T09:51:52.4654594Z "extra_info": { 2026-02-21T09:51:52.4654786Z "device": "NVIDIA B200" 2026-02-21T09:51:52.4655011Z } 2026-02-21T09:51:52.4655174Z }, 2026-02-21T09:51:52.4655372Z "model": { 2026-02-21T09:51:52.4655558Z "name": "int4_gemm" 2026-02-21T09:51:52.4655770Z }, 2026-02-21T09:51:52.4655931Z "metric": { 2026-02-21T09:51:52.4656160Z "name": "torch_compile_accuracy", 2026-02-21T09:51:52.4656387Z "benchmark_values": [ 2026-02-21T09:51:52.4656621Z 1.0, 2026-02-21T09:51:52.4656782Z 1.0, 2026-02-21T09:51:52.4656968Z 1.0, 2026-02-21T09:51:52.4657151Z 1.0, 2026-02-21T09:51:52.4657314Z 1.0, 2026-02-21T09:51:52.4657512Z 1.0, 2026-02-21T09:51:52.4657676Z 1.0 2026-02-21T09:51:52.4657867Z ] 2026-02-21T09:51:52.4658028Z }, 2026-02-21T09:51:52.4658213Z "shape": [ 2026-02-21T09:51:52.4658391Z "(1, 1, 1280, 8192)", 2026-02-21T09:51:52.4658663Z "(1, 1, 8192, 3584)", 2026-02-21T09:51:52.4659109Z "(4, 1, 8192, 3584)", 2026-02-21T09:51:52.4659301Z "(16, 1, 7168, 8192)", 2026-02-21T09:51:52.4659578Z "(64, 1, 7168, 8192)", 2026-02-21T09:51:52.4659780Z "(1, 4096, 8192, 1024)", 2026-02-21T09:51:52.4660061Z "(4, 4096, 8192, 1024)" 2026-02-21T09:51:52.4660255Z ] 2026-02-21T09:51:52.4660449Z }, 2026-02-21T09:51:52.4660609Z { 2026-02-21T09:51:52.4660814Z "benchmark": { 2026-02-21T09:51:52.4661036Z "name": "Helion Benchmark", 2026-02-21T09:51:52.4661252Z "extra_info": { 2026-02-21T09:51:52.4661477Z "device": "NVIDIA B200" 2026-02-21T09:51:52.4661731Z } 2026-02-21T09:51:52.4661922Z }, 2026-02-21T09:51:52.4662087Z "model": { 2026-02-21T09:51:52.4662296Z "name": "int4_gemm" 2026-02-21T09:51:52.4662502Z }, 2026-02-21T09:51:52.4662695Z "metric": { 2026-02-21T09:51:52.4662882Z "name": "triton_speedup", 2026-02-21T09:51:52.4663122Z "benchmark_values": [ 2026-02-21T09:51:52.4663322Z 0.7706297438415864, 2026-02-21T09:51:52.4663537Z 4.019273046971646, 2026-02-21T09:51:52.4663769Z 4.2523330839098765, 2026-02-21T09:51:52.4663964Z 3.7216777732154296, 2026-02-21T09:51:52.4664185Z 2.6139823225839285, 2026-02-21T09:51:52.4664381Z 0.11966007450752324, 2026-02-21T09:51:52.4664614Z 0.07610451036552543 2026-02-21T09:51:52.4664801Z ] 2026-02-21T09:51:52.4665035Z }, 2026-02-21T09:51:52.4665207Z "shape": [ 2026-02-21T09:51:52.4665450Z "(1, 1, 1280, 8192)", 2026-02-21T09:51:52.4665647Z "(1, 1, 8192, 3584)", 2026-02-21T09:51:52.4665871Z "(4, 1, 8192, 3584)", 2026-02-21T09:51:52.4666089Z "(16, 1, 7168, 8192)", 2026-02-21T09:51:52.4666284Z "(64, 1, 7168, 8192)", 2026-02-21T09:51:52.4666518Z "(1, 4096, 8192, 1024)", 2026-02-21T09:51:52.4666724Z "(4, 4096, 8192, 1024)" 2026-02-21T09:51:52.4666952Z ] 2026-02-21T09:51:52.4667117Z }, 2026-02-21T09:51:52.4667306Z { 2026-02-21T09:51:52.4667475Z "benchmark": { 2026-02-21T09:51:52.4667703Z "name": "Helion Benchmark", 2026-02-21T09:51:52.4667922Z "extra_info": { 2026-02-21T09:51:52.4668150Z "device": "NVIDIA B200" 2026-02-21T09:51:52.4668351Z } 2026-02-21T09:51:52.4668546Z }, 2026-02-21T09:51:52.4668740Z "model": { 2026-02-21T09:51:52.4668924Z "name": "int4_gemm" 2026-02-21T09:51:52.4669144Z }, 2026-02-21T09:51:52.4669312Z "metric": { 2026-02-21T09:51:52.4669530Z "name": "triton_accuracy", 2026-02-21T09:51:52.4669748Z "benchmark_values": [ 2026-02-21T09:51:52.4669976Z 1.0, 2026-02-21T09:51:52.4670148Z 1.0, 2026-02-21T09:51:52.4670351Z 1.0, 2026-02-21T09:51:52.4670520Z 1.0, 2026-02-21T09:51:52.4670715Z 1.0, 2026-02-21T09:51:52.4670885Z 1.0, 2026-02-21T09:51:52.4671082Z 1.0 2026-02-21T09:51:52.4671281Z ] 2026-02-21T09:51:52.4671446Z }, 2026-02-21T09:51:52.4671690Z "shape": [ 2026-02-21T09:51:52.4671863Z "(1, 1, 1280, 8192)", 2026-02-21T09:51:52.4672083Z "(1, 1, 8192, 3584)", 2026-02-21T09:51:52.4672271Z "(4, 1, 8192, 3584)", 2026-02-21T09:51:52.4672490Z "(16, 1, 7168, 8192)", 2026-02-21T09:51:52.4672681Z "(64, 1, 7168, 8192)", 2026-02-21T09:51:52.4672904Z "(1, 4096, 8192, 1024)", 2026-02-21T09:51:52.4673094Z "(4, 4096, 8192, 1024)" 2026-02-21T09:51:52.4673315Z ] 2026-02-21T09:51:52.4673473Z }, 2026-02-21T09:51:52.4673653Z { 2026-02-21T09:51:52.4673840Z "benchmark": { 2026-02-21T09:51:52.4674027Z "name": "Helion Benchmark", 2026-02-21T09:51:52.4674258Z "extra_info": { 2026-02-21T09:51:52.4674450Z "device": "NVIDIA B200" 2026-02-21T09:51:52.4674673Z } 2026-02-21T09:51:52.4674832Z }, 2026-02-21T09:51:52.4675019Z "model": { 2026-02-21T09:51:52.4675194Z "name": "int4_gemm" 2026-02-21T09:51:52.4675407Z }, 2026-02-21T09:51:52.4675570Z "metric": { 2026-02-21T09:51:52.4675784Z "name": "helion_speedup", 2026-02-21T09:51:52.4676012Z "benchmark_values": [ 2026-02-21T09:51:52.4676254Z 8.440995810200166, 2026-02-21T09:51:52.4676496Z 14.784918172550455, 2026-02-21T09:51:52.4676691Z 7.303089175162985, 2026-02-21T09:51:52.4676911Z 5.888088706919985, 2026-02-21T09:51:52.4677137Z 4.3917612299750175, 2026-02-21T09:51:52.4677351Z 0.5053071078736215, 2026-02-21T09:51:52.4677535Z 0.27916795971724073 2026-02-21T09:51:52.4677743Z ] 2026-02-21T09:51:52.4677901Z }, 2026-02-21T09:51:52.4678092Z "shape": [ 2026-02-21T09:51:52.4678263Z "(1, 1, 1280, 8192)", 2026-02-21T09:51:52.4678475Z "(1, 1, 8192, 3584)", 2026-02-21T09:51:52.4678689Z "(4, 1, 8192, 3584)", 2026-02-21T09:51:52.4678877Z "(16, 1, 7168, 8192)", 2026-02-21T09:51:52.4679096Z "(64, 1, 7168, 8192)", 2026-02-21T09:51:52.4679288Z "(1, 4096, 8192, 1024)", 2026-02-21T09:51:52.4679513Z "(4, 4096, 8192, 1024)" 2026-02-21T09:51:52.4679704Z ] 2026-02-21T09:51:52.4679887Z }, 2026-02-21T09:51:52.4680041Z { 2026-02-21T09:51:52.4680239Z "benchmark": { 2026-02-21T09:51:52.4680427Z "name": "Helion Benchmark", 2026-02-21T09:51:52.4680666Z "extra_info": { 2026-02-21T09:51:52.4680885Z "device": "NVIDIA B200" 2026-02-21T09:51:52.4681086Z } 2026-02-21T09:51:52.4681277Z }, 2026-02-21T09:51:52.4681440Z "model": { 2026-02-21T09:51:52.4681725Z "name": "int4_gemm" 2026-02-21T09:51:52.4681916Z }, 2026-02-21T09:51:52.4682106Z "metric": { 2026-02-21T09:51:52.4682321Z "name": "helion_accuracy", 2026-02-21T09:51:52.4682563Z "benchmark_values": [ 2026-02-21T09:51:52.4682755Z 1.0, 2026-02-21T09:51:52.4682951Z 1.0, 2026-02-21T09:51:52.4683115Z 1.0, 2026-02-21T09:51:52.4683309Z 1.0, 2026-02-21T09:51:52.4683498Z 1.0, 2026-02-21T09:51:52.4683657Z 1.0, 2026-02-21T09:51:52.4683844Z 1.0 2026-02-21T09:51:52.4684006Z ] 2026-02-21T09:51:52.4684192Z }, 2026-02-21T09:51:52.4684354Z "shape": [ 2026-02-21T09:51:52.4684555Z "(1, 1, 1280, 8192)", 2026-02-21T09:51:52.4684743Z "(1, 1, 8192, 3584)", 2026-02-21T09:51:52.4684963Z "(4, 1, 8192, 3584)", 2026-02-21T09:51:52.4685151Z "(16, 1, 7168, 8192)", 2026-02-21T09:51:52.4685374Z "(64, 1, 7168, 8192)", 2026-02-21T09:51:52.4685564Z "(1, 4096, 8192, 1024)", 2026-02-21T09:51:52.4685778Z "(4, 4096, 8192, 1024)" 2026-02-21T09:51:52.4685989Z ] 2026-02-21T09:51:52.4686142Z } 2026-02-21T09:51:52.4686343Z ] 2026-02-21T09:51:52.4753420Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main 2026-02-21T09:51:52.4753766Z with: 2026-02-21T09:51:52.4754231Z github-token: *** 2026-02-21T09:51:52.4754444Z venv: .venv/bin/activate 2026-02-21T09:51:52.4754694Z schema-version: v3 2026-02-21T09:51:52.4754898Z env: 2026-02-21T09:51:52.4755122Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:52.4755403Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4755712Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:52.4756051Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4756328Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4756625Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4757057Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:52.4757607Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:52.4757874Z ##[endgroup] 2026-02-21T09:51:52.4816131Z ##[group]Run set -eux 2026-02-21T09:51:52.4816367Z set -eux 2026-02-21T09:51:52.4816583Z  2026-02-21T09:51:52.4816816Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2026-02-21T09:51:52.4817090Z  echo "Missing github-token input" 2026-02-21T09:51:52.4817360Z  exit 1 2026-02-21T09:51:52.4817548Z fi 2026-02-21T09:51:52.4818690Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:52.4819035Z env: 2026-02-21T09:51:52.4819254Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:52.4819508Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4819831Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:52.4820192Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4820454Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4820741Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.4821149Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:52.4821654Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:52.4822078Z GITHUB_TOKEN: *** 2026-02-21T09:51:52.4822266Z ##[endgroup] 2026-02-21T09:51:52.5618119Z + [[ -z *** ]] 2026-02-21T09:51:52.5679944Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2026-02-21T09:51:52.5680297Z with: 2026-02-21T09:51:52.5680663Z github-token: *** 2026-02-21T09:51:52.5680916Z env: 2026-02-21T09:51:52.5681125Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:52.5681444Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5681830Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:52.5682197Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5682495Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5682852Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5683322Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:52.5683793Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:52.5684106Z ##[endgroup] 2026-02-21T09:51:52.5695633Z ##[group]Run set -eux 2026-02-21T09:51:52.5695874Z set -eux 2026-02-21T09:51:52.5696110Z  2026-02-21T09:51:52.5696526Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2026-02-21T09:51:52.5697057Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:52.5697366Z env: 2026-02-21T09:51:52.5697576Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:52.5697891Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5698220Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:52.5698725Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5699054Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5699351Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:52.5699838Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:52.5700318Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:52.5700764Z GITHUB_TOKEN: *** 2026-02-21T09:51:52.5701004Z ##[endgroup] 2026-02-21T09:51:52.6093783Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 dgxb200-03-1008 2026-02-21T09:51:54.0943276Z setting job-id=64380329747 2026-02-21T09:51:54.0945582Z setting job-name=run-b200 (int4_gemm) / benchmark-cu130-int4_gemm-py3.12-b200 2026-02-21T09:51:54.1090604Z ##[group]Run set -eux 2026-02-21T09:51:54.1090863Z set -eux 2026-02-21T09:51:54.1091052Z  2026-02-21T09:51:54.1091301Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T09:51:54.1091642Z  source ".venv/bin/activate" 2026-02-21T09:51:54.1091873Z fi 2026-02-21T09:51:54.1092079Z  2026-02-21T09:51:54.1092373Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2026-02-21T09:51:54.1092771Z  --schema-version "${SCHEMA_VERSION}" \ 2026-02-21T09:51:54.1093038Z  --repo "${REPO}" \ 2026-02-21T09:51:54.1093304Z  --head-branch "${HEAD_BRANCH}" \ 2026-02-21T09:51:54.1093654Z  --head-sha "${HEAD_SHA}" \ 2026-02-21T09:51:54.1093915Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2026-02-21T09:51:54.1094217Z  --run-attempt "${RUN_ATTEMPT}" \ 2026-02-21T09:51:54.1094506Z  --job-id "${JOB_ID}" \ 2026-02-21T09:51:54.1094763Z  --job-name "${JOB_NAME}" 2026-02-21T09:51:54.1095070Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:54.1095334Z env: 2026-02-21T09:51:54.1095549Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:54.1095792Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.1096113Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:54.1096396Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.1096679Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.1096928Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.1097350Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:54.1097798Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:54.1098059Z SCHEMA_VERSION: v3 2026-02-21T09:51:54.1098280Z REPO: pytorch/helion 2026-02-21T09:51:54.1098479Z HEAD_BRANCH: refs/heads/main 2026-02-21T09:51:54.1098745Z HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T09:51:54.1098990Z WORKFLOW_RUN_ID: 22253280836 2026-02-21T09:51:54.1099227Z RUN_ATTEMPT: 1 2026-02-21T09:51:54.1099413Z JOB_ID: 64380329747 2026-02-21T09:51:54.1099705Z JOB_NAME: run-b200 (int4_gemm) / benchmark-cu130-int4_gemm-py3.12-b200 2026-02-21T09:51:54.1100016Z ##[endgroup] 2026-02-21T09:51:54.1925050Z + [[ -n .venv/bin/activate ]] 2026-02-21T09:51:54.1925460Z + source .venv/bin/activate 2026-02-21T09:51:54.1925740Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1925948Z ++ '[' -n x ']' 2026-02-21T09:51:54.1926217Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T09:51:54.1926544Z ++ '[' .venv/bin/activate = /__w/_temp/1977a82a-f8fe-401e-a95f-293e53772ba5.sh ']' 2026-02-21T09:51:54.1926917Z ++ deactivate nondestructive 2026-02-21T09:51:54.1927126Z ++ unset -f pydoc 2026-02-21T09:51:54.1927341Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1927549Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1927727Z ++ hash -r 2026-02-21T09:51:54.1927921Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1928102Z ++ unset VIRTUAL_ENV 2026-02-21T09:51:54.1928333Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T09:51:54.1928833Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T09:51:54.1929123Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T09:51:54.1930043Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T09:51:54.1930289Z ++ '[' linux-gnu = msys ']' 2026-02-21T09:51:54.1930505Z ++ export VIRTUAL_ENV 2026-02-21T09:51:54.1930731Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1930946Z ++ unset SCRIPT_PATH 2026-02-21T09:51:54.1931702Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T09:51:54.1932987Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T09:51:54.1933736Z ++ export PATH 2026-02-21T09:51:54.1933934Z ++ '[' xhelion '!=' x ']' 2026-02-21T09:51:54.1934185Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T09:51:54.1934410Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T09:51:54.1934644Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1934825Z ++ '[' -z '' ']' 2026-02-21T09:51:54.1935035Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T09:51:54.1935229Z ++ PS1='(helion) ' 2026-02-21T09:51:54.1935446Z ++ export PS1 2026-02-21T09:51:54.1935628Z ++ alias pydoc 2026-02-21T09:51:54.1935838Z ++ true 2026-02-21T09:51:54.1936036Z ++ hash -r 2026-02-21T09:51:54.1937207Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329747 --job-name 'run-b200 (int4_gemm) / benchmark-cu130-int4_gemm-py3.12-b200' 2026-02-21T09:51:54.2288041Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main 2026-02-21T09:51:54.2288381Z with: 2026-02-21T09:51:54.2288571Z venv: .venv/bin/activate 2026-02-21T09:51:54.2288810Z env: 2026-02-21T09:51:54.2288994Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:54.2289283Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2289577Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:54.2289890Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2290176Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2290438Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2290881Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:54.2291325Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:54.2291648Z ##[endgroup] 2026-02-21T09:51:54.2302984Z ##[group]Run set -eux 2026-02-21T09:51:54.2303212Z set -eux 2026-02-21T09:51:54.2303422Z  2026-02-21T09:51:54.2303614Z if command -v nvidia-smi; then 2026-02-21T09:51:54.2303884Z  DEVICE_NAME=cuda 2026-02-21T09:51:54.2304128Z  nvidia-smi 2026-02-21T09:51:54.2304340Z elif command -v rocm-smi; then 2026-02-21T09:51:54.2304598Z  DEVICE_NAME=rocm 2026-02-21T09:51:54.2304798Z  rocm-smi 2026-02-21T09:51:54.2305028Z elif command -v hl-smi; then 2026-02-21T09:51:54.2305258Z  DEVICE_NAME=hpu 2026-02-21T09:51:54.2305491Z  hl-smi 2026-02-21T09:51:54.2305675Z else 2026-02-21T09:51:54.2305890Z  arch=$(uname -m) 2026-02-21T09:51:54.2306117Z  2026-02-21T09:51:54.2306293Z  case "$arch" in 2026-02-21T09:51:54.2306533Z  aarch64|arm64) 2026-02-21T09:51:54.2306750Z  DEVICE_NAME=arm64-cpu 2026-02-21T09:51:54.2306999Z  ;; 2026-02-21T09:51:54.2307177Z  *) 2026-02-21T09:51:54.2307395Z  DEVICE_NAME=cpu 2026-02-21T09:51:54.2307598Z  ;; 2026-02-21T09:51:54.2307803Z  esac 2026-02-21T09:51:54.2307981Z  lscpu 2026-02-21T09:51:54.2308182Z fi 2026-02-21T09:51:54.2308423Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2026-02-21T09:51:54.2308761Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:54.2309039Z env: 2026-02-21T09:51:54.2309221Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:54.2309494Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2309789Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:54.2310103Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2310385Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2310642Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.2311078Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:54.2311492Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:54.2311811Z ##[endgroup] 2026-02-21T09:51:54.3515273Z + command -v nvidia-smi 2026-02-21T09:51:54.3515527Z /usr/bin/nvidia-smi 2026-02-21T09:51:54.3515836Z + DEVICE_NAME=cuda 2026-02-21T09:51:54.3516058Z + nvidia-smi 2026-02-21T09:51:54.3671388Z Sat Feb 21 09:51:54 2026 2026-02-21T09:51:54.3671857Z +-----------------------------------------------------------------------------------------+ 2026-02-21T09:51:54.3672352Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T09:51:54.3672955Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T09:51:54.3673376Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T09:51:54.3673953Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T09:51:54.3674291Z | | | MIG M. | 2026-02-21T09:51:54.3674612Z |=========================================+========================+======================| 2026-02-21T09:51:54.3757676Z | 0 NVIDIA B200 Off | 00000000:C3:00.0 Off | 0 | 2026-02-21T09:51:54.3758093Z | N/A 32C P0 173W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T09:51:54.3758451Z | | | Disabled | 2026-02-21T09:51:54.3758825Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T09:51:54.3759045Z 2026-02-21T09:51:54.3759199Z +-----------------------------------------------------------------------------------------+ 2026-02-21T09:51:54.3759566Z | Processes: | 2026-02-21T09:51:54.3759903Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T09:51:54.3760251Z | ID ID Usage | 2026-02-21T09:51:54.3760567Z |=========================================================================================| 2026-02-21T09:51:54.3760887Z | No running processes found | 2026-02-21T09:51:54.3761272Z +-----------------------------------------------------------------------------------------+ 2026-02-21T09:51:54.4162541Z + echo DEVICE_NAME=cuda 2026-02-21T09:51:54.4198593Z ##[group]Run set -eux 2026-02-21T09:51:54.4198862Z set -eux 2026-02-21T09:51:54.4199038Z  2026-02-21T09:51:54.4199276Z if [[ "${DEVICE_NAME}" == "cuda" ]]; then 2026-02-21T09:51:54.4199573Z  # Return the same device name as PyTorch 2026-02-21T09:51:54.4199911Z  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader) 2026-02-21T09:51:54.4200273Z elif [[ "${DEVICE_NAME}" == "rocm" ]]; then 2026-02-21T09:51:54.4200629Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2026-02-21T09:51:54.4201016Z elif [[ "${DEVICE_NAME}" == "hpu" ]]; then 2026-02-21T09:51:54.4201443Z  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//') 2026-02-21T09:51:54.4201871Z elif [[ "${DEVICE_NAME}" == "cpu" ]]; then 2026-02-21T09:51:54.4202614Z  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))" 2026-02-21T09:51:54.4203355Z elif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then 2026-02-21T09:51:54.4203713Z  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ") 2026-02-21T09:51:54.4204069Z fi 2026-02-21T09:51:54.4204279Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2026-02-21T09:51:54.4204640Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:54.4204901Z env: 2026-02-21T09:51:54.4205085Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:54.4205359Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.4205646Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:54.4206010Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.4206268Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.4206586Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.4207079Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:54.4207507Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:54.4207794Z DEVICE_NAME: cuda 2026-02-21T09:51:54.4207981Z ##[endgroup] 2026-02-21T09:51:54.5809437Z + [[ cuda == \c\u\d\a ]] 2026-02-21T09:51:54.5809774Z ++ nvidia-smi -i 0 --query-gpu=name --format=csv,noheader 2026-02-21T09:51:54.5982038Z + DEVICE_TYPE='NVIDIA B200' 2026-02-21T09:51:54.5982382Z + echo 'DEVICE_TYPE=NVIDIA B200' 2026-02-21T09:51:54.6024019Z ##[group]Run set -eux 2026-02-21T09:51:54.6024262Z set -eux 2026-02-21T09:51:54.6024450Z  2026-02-21T09:51:54.6024680Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T09:51:54.6024927Z  source ".venv/bin/activate" 2026-02-21T09:51:54.6025173Z fi 2026-02-21T09:51:54.6025341Z  2026-02-21T09:51:54.6025605Z python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T09:51:54.6026012Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2026-02-21T09:51:54.6026446Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:54.6026720Z env: 2026-02-21T09:51:54.6026904Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:54.6027174Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.6027461Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:54.6027772Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.6028056Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.6028314Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:54.6028738Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:54.6029161Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:54.6029447Z DEVICE_NAME: cuda 2026-02-21T09:51:54.6029638Z DEVICE_TYPE: NVIDIA B200 2026-02-21T09:51:54.6029864Z ##[endgroup] 2026-02-21T09:51:54.6595333Z + [[ -n .venv/bin/activate ]] 2026-02-21T09:51:54.6595661Z + source .venv/bin/activate 2026-02-21T09:51:54.6595888Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6596110Z ++ '[' -n x ']' 2026-02-21T09:51:54.6596344Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T09:51:54.6596663Z ++ '[' .venv/bin/activate = /__w/_temp/10568874-09be-4781-a130-1a7986c670e3.sh ']' 2026-02-21T09:51:54.6597031Z ++ deactivate nondestructive 2026-02-21T09:51:54.6597247Z ++ unset -f pydoc 2026-02-21T09:51:54.6597461Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6597638Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6597841Z ++ hash -r 2026-02-21T09:51:54.6598007Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6598237Z ++ unset VIRTUAL_ENV 2026-02-21T09:51:54.6598435Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T09:51:54.6598690Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T09:51:54.6598972Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T09:51:54.6599203Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T09:51:54.6599441Z ++ '[' linux-gnu = msys ']' 2026-02-21T09:51:54.6599661Z ++ export VIRTUAL_ENV 2026-02-21T09:51:54.6599873Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6600053Z ++ unset SCRIPT_PATH 2026-02-21T09:51:54.6600743Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T09:51:54.6601958Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T09:51:54.6602848Z ++ export PATH 2026-02-21T09:51:54.6603066Z ++ '[' xhelion '!=' x ']' 2026-02-21T09:51:54.6603311Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T09:51:54.6603551Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T09:51:54.6603774Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6604023Z ++ '[' -z '' ']' 2026-02-21T09:51:54.6604225Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T09:51:54.6604415Z ++ PS1='(helion) ' 2026-02-21T09:51:54.6604633Z ++ export PS1 2026-02-21T09:51:54.6604814Z ++ alias pydoc 2026-02-21T09:51:54.6605025Z ++ true 2026-02-21T09:51:54.6605191Z ++ hash -r 2026-02-21T09:51:54.6605453Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T09:51:55.3203096Z Collecting psutil==7.0.0 2026-02-21T09:51:55.3835066Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB) 2026-02-21T09:51:55.4056582Z Collecting nvidia-ml-py==13.580.82 2026-02-21T09:51:55.4127051Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB) 2026-02-21T09:51:55.4235154Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2026-02-21T09:51:55.4490308Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB) 2026-02-21T09:51:55.5317586Z Installing collected packages: nvidia-ml-py, psutil 2026-02-21T09:51:55.5321516Z Attempting uninstall: nvidia-ml-py 2026-02-21T09:51:55.5343739Z Found existing installation: nvidia-ml-py 13.590.48 2026-02-21T09:51:55.5356142Z Uninstalling nvidia-ml-py-13.590.48: 2026-02-21T09:51:55.6010398Z Successfully uninstalled nvidia-ml-py-13.590.48 2026-02-21T09:51:55.6449718Z Attempting uninstall: psutil 2026-02-21T09:51:55.6479096Z Found existing installation: psutil 7.2.2 2026-02-21T09:51:55.6493915Z Uninstalling psutil-7.2.2: 2026-02-21T09:51:55.6497843Z Successfully uninstalled psutil-7.2.2 2026-02-21T09:51:55.7615918Z 2026-02-21T09:51:55.7651031Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0 2026-02-21T09:51:55.8918273Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py 2026-02-21T09:51:57.6981291Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main 2026-02-21T09:51:57.6981674Z with: 2026-02-21T09:51:57.6981888Z venv: .venv/bin/activate 2026-02-21T09:51:57.6982089Z env: 2026-02-21T09:51:57.6982303Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:57.6982554Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6982875Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:57.6983164Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6983457Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6983721Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6984168Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:57.6984625Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:57.6984894Z DEVICE_NAME: cuda 2026-02-21T09:51:57.6985182Z DEVICE_TYPE: NVIDIA B200 2026-02-21T09:51:57.6985380Z ##[endgroup] 2026-02-21T09:51:57.6995801Z ##[group]Run set -eux 2026-02-21T09:51:57.6996063Z set -eux 2026-02-21T09:51:57.6996249Z  2026-02-21T09:51:57.6996475Z # TODO (huydhn): Implement this part 2026-02-21T09:51:57.6996752Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2026-02-21T09:51:57.6997113Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T09:51:57.6997354Z env: 2026-02-21T09:51:57.6997564Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:57.6997812Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6998129Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:57.6998512Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6998774Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6999065Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.6999511Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:57.6999962Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:57.7000252Z DEVICE_NAME: cuda 2026-02-21T09:51:57.7000446Z DEVICE_TYPE: NVIDIA B200 2026-02-21T09:51:57.7000678Z ##[endgroup] 2026-02-21T09:51:57.7524052Z + echo 'dependencies={}' 2026-02-21T09:51:57.7575421Z ##[group]Run actions/upload-artifact@v6 2026-02-21T09:51:57.7575703Z with: 2026-02-21T09:51:57.7575934Z name: benchmark-results-b200-int4_gemm 2026-02-21T09:51:57.7576183Z path: test/test-reports 2026-02-21T09:51:57.7576425Z if-no-files-found: warn 2026-02-21T09:51:57.7576636Z compression-level: 6 2026-02-21T09:51:57.7576876Z overwrite: false 2026-02-21T09:51:57.7577076Z include-hidden-files: false 2026-02-21T09:51:57.7577313Z env: 2026-02-21T09:51:57.7577497Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T09:51:57.7577781Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.7578085Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T09:51:57.7578405Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.7578695Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.7578954Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T09:51:57.7579458Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T09:51:57.7579895Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T09:51:57.7580180Z DEVICE_NAME: cuda 2026-02-21T09:51:57.7580369Z DEVICE_TYPE: NVIDIA B200 2026-02-21T09:51:57.7580597Z ##[endgroup] 2026-02-21T09:51:57.7583591Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T09:51:57.9856504Z With the provided path, there will be 1 file uploaded 2026-02-21T09:51:57.9858733Z Artifact name is valid! 2026-02-21T09:51:57.9862973Z Root directory input is valid! 2026-02-21T09:51:58.2301255Z Beginning upload of artifact content to blob storage 2026-02-21T09:51:58.5742146Z Uploaded bytes 675 2026-02-21T09:51:58.6634678Z Finished uploading artifact content to blob storage! 2026-02-21T09:51:58.6638827Z SHA256 digest of uploaded artifact zip is 052d5c8037ca6981a10d6ebb86846467be0a5d9dfa0b1be6ba8bab0ef6fc5c79 2026-02-21T09:51:58.6640716Z Finalizing artifact upload 2026-02-21T09:51:58.9363720Z Artifact benchmark-results-b200-int4_gemm.zip successfully finalized. Artifact ID 5600733086 2026-02-21T09:51:58.9367726Z Artifact benchmark-results-b200-int4_gemm has been successfully uploaded! Final size is 675 bytes. Artifact ID is 5600733086 2026-02-21T09:51:58.9368726Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5600733086 2026-02-21T09:51:58.9498756Z Post job cleanup. 2026-02-21T09:51:58.9503642Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T09:51:59.1406675Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python 2026-02-21T09:51:59.1408336Z (node:161869) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T09:51:59.1408905Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T09:51:59.1515492Z Post job cleanup. 2026-02-21T09:51:59.1518737Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T09:51:59.3425966Z Post job cleanup. 2026-02-21T09:51:59.3430113Z ##[command]/usr/bin/docker exec 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c sh -c "cat /etc/*release | grep ^ID" 2026-02-21T09:51:59.4981393Z [command]/usr/bin/git version 2026-02-21T09:51:59.5009522Z git version 2.43.0 2026-02-21T09:51:59.5036695Z Temporarily overriding HOME='/__w/_temp/a06d6775-1041-4162-89cb-fdbb30c079f9' before making global git config changes 2026-02-21T09:51:59.5038667Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T09:51:59.5039164Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T09:51:59.5079090Z Removing SSH command configuration 2026-02-21T09:51:59.5083569Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T09:51:59.5103143Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T09:51:59.5337737Z Removing HTTP extra header 2026-02-21T09:51:59.5340288Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T09:51:59.5366694Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T09:51:59.5585663Z Removing includeIf entries pointing to credentials config files 2026-02-21T09:51:59.5590693Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T09:51:59.5607177Z includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T09:51:59.5607648Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T09:51:59.5608060Z includeif.gitdir:/github/workspace/.git.path 2026-02-21T09:51:59.5608386Z includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T09:51:59.5614542Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T09:51:59.5633126Z /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5637791Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5668017Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T09:51:59.5686984Z /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5691973Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5720635Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path 2026-02-21T09:51:59.5735870Z /github/runner_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5741401Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5770217Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T09:51:59.5786470Z /github/runner_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5790478Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config 2026-02-21T09:51:59.5824391Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T09:51:59.6053816Z Removing credentials config '/__w/_temp/git-credentials-509bd599-b18b-40b2-a4ad-c8996394d818.config' 2026-02-21T09:51:59.6167546Z Stop and remove container: 34d6440bddee4f2886b9a62f2ba8293b_nvidiacuda1301develubuntu2404_e0c1dd 2026-02-21T09:51:59.6171327Z ##[command]/usr/bin/docker rm --force 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c 2026-02-21T09:52:02.2650569Z 21dde89906eb49d3591f972ceb16e996f96f3f0eab153590d83e675e1629ab7c 2026-02-21T09:52:02.2683667Z Remove container network: github_network_4610037d786e4ad48d234196e1acb7ad 2026-02-21T09:52:02.2686468Z ##[command]/usr/bin/docker network rm github_network_4610037d786e4ad48d234196e1acb7ad 2026-02-21T09:52:02.3851022Z github_network_4610037d786e4ad48d234196e1acb7ad 2026-02-21T09:52:02.3907000Z Evaluate and set job outputs 2026-02-21T09:52:02.3912241Z Set output 'benchmark-metadata' 2026-02-21T09:52:02.3913750Z Set output 'runners-info' 2026-02-21T09:52:02.3914368Z Set output 'dependencies' 2026-02-21T09:52:02.3914849Z Cleaning up orphan processes